Df = (session.sql(sql_query).to_pandas()) returns all columns as string

I’m converting pystarburst dataframes into pandas so that my colleagues can interact using familiar techniques.

However, the conversion returns all columns as object (string/varchar) instead of the original data types from the SQL schema.

Here is the python workflow:


import trino
from pystarburst import Session

# Connection parameters
db_parameters = {
    "host": "",
    "port": "",
    "http_scheme": "",
    "catalog" : "",
    "schema": "",
    "verify": "",
    "auth": trino.auth.BasicAuthentication(
        "usr",
        "pwd"
    )
}

session = Session.builder.configs(db_parameters).create()

sql_query = """
SELECT
    column1, -- varchar
    column2, -- decimal(18,2)
    column3 -- date
FROM schema.table
WHERE column1 LIKE 'A%'
"""

df = (session.sql(sql_query).to_pandas()) # Convert to pandas as colleagues are familiar

df.info(verbose=True)  # All the columns return as "object" making it difficult for them to do ETL in the next step

df = normalize(df) # ETL based on pandas library so colleagues can adjust

df.to_parquet("file_path", index=None) # Results saved for analytical use
```

I’m running the following on Starburst Galaxy and don’t see the problem quite as bad as you do. Seems the varchars (and sadly the date field) below become objects, but the integer, bigint, and double columns all seem to be OK for me.

# skipping the creation of session for brevity 

ordersPySbDf = session.sql("select * from tpch.tiny.orders")

print('** cols and type from PyStarburst dataframe \n')
for field in ordersPySbDf.schema.fields:
    print(field.name +" , "+str(field.datatype))

pandasDf = ordersPySbDf.to_pandas()

print('\n\n** cols and type from pandas dataframe (one way) \n')
pandasDf.info(verbose=True)

print('\n\n** cols and type from pandas dataframe (nuther way) \n')
print(pandasDf.dtypes)

Output:

** cols and type from PyStarburst dataframe 

orderkey , type='LongType'
custkey , type='LongType'
orderstatus , size=None type='StringType'
totalprice , type='DoubleType'
orderdate , type='DateType'
orderpriority , size=None type='StringType'
clerk , size=None type='StringType'
shippriority , type='IntegerType'
comment , size=None type='StringType'


** cols and type from pandas dataframe (one way) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   orderkey       15000 non-null  int64  
 1   custkey        15000 non-null  int64  
 2   orderstatus    15000 non-null  object 
 3   totalprice     15000 non-null  float64
 4   orderdate      15000 non-null  object 
 5   orderpriority  15000 non-null  object 
 6   clerk          15000 non-null  object 
 7   shippriority   15000 non-null  int64  
 8   comment        15000 non-null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 1.0+ MB


** cols and type from pandas dataframe (nuther way) 

orderkey           int64
custkey            int64
orderstatus       object
totalprice       float64
orderdate         object
orderpriority     object
clerk             object
shippriority       int64
comment           object
dtype: object

Can you try to reproduce what I got? If you get something different, can you run it the same way I did? You can visit starburst-dataframes-exploration/PyStarburst/notebooks/PlayingWithPandas.ipynb at main · lestermartin/starburst-dataframes-exploration · GitHub and click on the ‘Open in Colab’ button at the top.

1 Like