Inferring dtypes in get_as_dataframe
See original GitHub issueThis is an enhancement proposal.
For my use case, it could be nice if gspread-dataframe was able to try to infer column dtypes when fetching data from a sheet. While individual cells are converted through numericise
, their column dtype remains object
, and the returned dataframe fails equality checks with the original dataframe.
Motivating example
>>> df = pd.DataFrame({'a': [4,1,2,4],
... 'b': list('abba')},
... index=pd.Index(list('ABCD'), name='our index'))
>>> df
a b
our index
A 4 a
B 1 b
C 2 b
D 4 a
>>> df.dtypes
a int64
b object
dtype: object
>>> ws = # Get a test worksheet here
>>> set_with_dataframe(ws, df, include_index=True, resize=True)
>>> r = get_as_dataframe(ws, index_column_number=1)
>>> r # Looks as expected
a b
our index
A 4 a
B 1 b
C 2 b
D 4 a
>>> r.dtypes # All object dtype
a object
b object
dtype: object
>>> [type(v) for v in r['a']] # correctly converted to int
[int, int, int, int]
>>> df.equals(r) # The equality check fails
False
>>> df['a'].equals(r['a']) # because of the dtype of column 'a'.
False
>>> df['a'] == r['a'] # The values *are* the same, though.
our index
A True
B True
C True
D True
Name: a, dtype: bool
>>> df['b'].equals(r['b']) # str works as expected
True
Suggested solution
I am unsure what is the best way to deal with this, and whether it is a general enough use-case to warrant an addition to gspread-dataframe
. At any rate, the following code is my initial stab at how dtype inference could be implemented:
import pandas as pd
converters = (
pd.to_numeric,
pd.to_timedelta,
pd.to_datetime,
)
def _assign_column_dtypes(df):
for conv in converters:
for col in df:
if df[col].dtype != object:
continue
df[col] = conv(df[col], errors='ignore')
return df
It intentionally places timedelta before datetime, as ‘00:03:00’ can be interpreted as either one by pandas. In my use-case, datetimes always include a date, so ‘00:03:00’ would definitely be a timedelta.
Take it for a spin!
# Construct a dataframe where everything is either str or object
n = 10
df = pd.DataFrame({
'datetime str': pd.date_range('2017-03-15', freq='D', periods=n
).astype(str),
'timedelta str': pd.timedelta_range('00:03:00', periods=n, freq='10 s'
).to_native_types().astype(str),
'int obj': pd.Series(range(n), dtype=object),
'int str': [str(i) for i in range(n)],
'float obj': pd.Series(map(float, range(n)), dtype=object),
'float str': [str(float(i)) for i in range(n)],
})
print(df)
# datetime str float obj float str int obj int str timedelta str
# 0 2017-03-15 0 0.0 0 0 00:03:00
# 1 2017-03-16 1 1.0 1 1 00:03:10
# 2 2017-03-17 2 2.0 2 2 00:03:20
# 3 2017-03-18 3 3.0 3 3 00:03:30
# 4 2017-03-19 4 4.0 4 4 00:03:40
# 5 2017-03-20 5 5.0 5 5 00:03:50
# 6 2017-03-21 6 6.0 6 6 00:04:00
# 7 2017-03-22 7 7.0 7 7 00:04:10
# 8 2017-03-23 8 8.0 8 8 00:04:20
# 9 2017-03-24 9 9.0 9 9 00:04:30
print(df.dtypes)
# datetime str object
# float obj object
# float str object
# int obj object
# int str object
# timedelta str object
# dtype: object
df = _assign_column_dtypes(df)
print(df)
# datetime str float obj float str int obj int str timedelta str
# 0 2017-03-15 0.0 0.0 0 0 00:03:00
# 1 2017-03-16 1.0 1.0 1 1 00:03:10
# 2 2017-03-17 2.0 2.0 2 2 00:03:20
# 3 2017-03-18 3.0 3.0 3 3 00:03:30
# 4 2017-03-19 4.0 4.0 4 4 00:03:40
# 5 2017-03-20 5.0 5.0 5 5 00:03:50
# 6 2017-03-21 6.0 6.0 6 6 00:04:00
# 7 2017-03-22 7.0 7.0 7 7 00:04:10
# 8 2017-03-23 8.0 8.0 8 8 00:04:20
# 9 2017-03-24 9.0 9.0 9 9 00:04:30
print(df.dtypes)
# datetime str datetime64[ns]
# float obj float64
# float str float64
# int obj int64
# int str int64
# timedelta str timedelta64[ns]
# dtype: object
Issue Analytics
- State:
- Created 6 years ago
- Comments:9 (9 by maintainers)
Top GitHub Comments
@NTAWolf I’ve opened #2 to represent the switch to
TextParser
; a PR will happen in the next few days to implement. I’m going to close this issue; let me know if you think it should be re-opened.OK, I will be adding some tests to exercise the different keyword arguments for
TextParser
and ensure that the resulting DataFrames are as expected. Then I will plan a major version release.In the meantime, a quick recipe with the current release is below. (It will always
evaluate_formulas
; to effectevaluate_formulas=False
, you will need to build a list of values yourself usingcell.input_value
.)