I’m converting pystarburst dataframes into pandas so that my colleagues can interact using familiar techniques.
However, the conversion returns all columns as object (string/varchar) instead of the original data types from the SQL schema.
Here is the python workflow:
import trino
from pystarburst import Session
# Connection parameters
db_parameters = {
"host": "",
"port": "",
"http_scheme": "",
"catalog" : "",
"schema": "",
"verify": "",
"auth": trino.auth.BasicAuthentication(
"usr",
"pwd"
)
}
session = Session.builder.configs(db_parameters).create()
sql_query = """
SELECT
column1, -- varchar
column2, -- decimal(18,2)
column3 -- date
FROM schema.table
WHERE column1 LIKE 'A%'
"""
df = (session.sql(sql_query).to_pandas()) # Convert to pandas as colleagues are familiar
df.info(verbose=True) # All the columns return as "object" making it difficult for them to do ETL in the next step
df = normalize(df) # ETL based on pandas library so colleagues can adjust
df.to_parquet("file_path", index=None) # Results saved for analytical use
```
I’m running the following on Starburst Galaxy and don’t see the problem quite as bad as you do. Seems the varchars (and sadly the date field) below become objects, but the integer, bigint, and double columns all seem to be OK for me.
# skipping the creation of session for brevity
ordersPySbDf = session.sql("select * from tpch.tiny.orders")
print('** cols and type from PyStarburst dataframe \n')
for field in ordersPySbDf.schema.fields:
print(field.name +" , "+str(field.datatype))
pandasDf = ordersPySbDf.to_pandas()
print('\n\n** cols and type from pandas dataframe (one way) \n')
pandasDf.info(verbose=True)
print('\n\n** cols and type from pandas dataframe (nuther way) \n')
print(pandasDf.dtypes)