Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inferring dtypes in get_as_dataframe

See original GitHub issue

This is an enhancement proposal.

For my use case, it could be nice if gspread-dataframe was able to try to infer column dtypes when fetching data from a sheet. While individual cells are converted through numericise, their column dtype remains object, and the returned dataframe fails equality checks with the original dataframe.

Motivating example

>>> df = pd.DataFrame({'a': [4,1,2,4],
...                    'b': list('abba')},
...                    index=pd.Index(list('ABCD'), name='our index'))
>>> df
           a  b
our index      
A          4  a
B          1  b
C          2  b
D          4  a
>>> df.dtypes
a     int64
b    object
dtype: object
>>> ws =  # Get a test worksheet here
>>> set_with_dataframe(ws, df, include_index=True, resize=True)
>>> r = get_as_dataframe(ws, index_column_number=1)
>>> r  # Looks as expected
           a  b
our index      
A          4  a
B          1  b
C          2  b
D          4  a
>>> r.dtypes  # All object dtype
a    object
b    object
dtype: object
>>> [type(v) for v in r['a']]  # correctly converted to int
[int, int, int, int]
>>> df.equals(r)  # The equality check fails
False
>>> df['a'].equals(r['a'])  # because of the dtype of column 'a'.
False
>>> df['a'] == r['a']  # The values *are* the same, though.
our index
A    True
B    True
C    True
D    True
Name: a, dtype: bool
>>> df['b'].equals(r['b'])  # str works as expected
True

Take it for a spin!

# Construct a dataframe where everything is either str or object
n = 10
df = pd.DataFrame({
    'datetime str': pd.date_range('2017-03-15', freq='D', periods=n
                                  ).astype(str),
    'timedelta str': pd.timedelta_range('00:03:00', periods=n, freq='10 s'
                                        ).to_native_types().astype(str),
    'int obj': pd.Series(range(n), dtype=object),
    'int str': [str(i) for i in range(n)],
    'float obj': pd.Series(map(float, range(n)), dtype=object),
    'float str': [str(float(i)) for i in range(n)],
})

print(df)
#   datetime str float obj float str int obj int str timedelta str
# 0   2017-03-15         0       0.0       0       0      00:03:00
# 1   2017-03-16         1       1.0       1       1      00:03:10
# 2   2017-03-17         2       2.0       2       2      00:03:20
# 3   2017-03-18         3       3.0       3       3      00:03:30
# 4   2017-03-19         4       4.0       4       4      00:03:40
# 5   2017-03-20         5       5.0       5       5      00:03:50
# 6   2017-03-21         6       6.0       6       6      00:04:00
# 7   2017-03-22         7       7.0       7       7      00:04:10
# 8   2017-03-23         8       8.0       8       8      00:04:20
# 9   2017-03-24         9       9.0       9       9      00:04:30

print(df.dtypes)
# datetime str     object
# float obj        object
# float str        object
# int obj          object
# int str          object
# timedelta str    object
# dtype: object


df = _assign_column_dtypes(df)

print(df)
#   datetime str  float obj  float str  int obj  int str  timedelta str
# 0   2017-03-15        0.0        0.0        0        0       00:03:00
# 1   2017-03-16        1.0        1.0        1        1       00:03:10
# 2   2017-03-17        2.0        2.0        2        2       00:03:20
# 3   2017-03-18        3.0        3.0        3        3       00:03:30
# 4   2017-03-19        4.0        4.0        4        4       00:03:40
# 5   2017-03-20        5.0        5.0        5        5       00:03:50
# 6   2017-03-21        6.0        6.0        6        6       00:04:00
# 7   2017-03-22        7.0        7.0        7        7       00:04:10
# 8   2017-03-23        8.0        8.0        8        8       00:04:20
# 9   2017-03-24        9.0        9.0        9        9       00:04:30

print(df.dtypes)
# datetime str      datetime64[ns]
# float obj                float64
# float str                float64
# int obj                    int64
# int str                    int64
# timedelta str    timedelta64[ns]
# dtype: object

Issue Analytics

State:
Created 6 years ago
Comments:9 (9 by maintainers)

Top GitHub Comments

1reaction

robin900commented, Mar 28, 2017

@NTAWolf I’ve opened #2 to represent the switch to TextParser; a PR will happen in the next few days to implement. I’m going to close this issue; let me know if you think it should be re-opened.

0reactions

robin900commented, Mar 28, 2017

OK, I will be adding some tests to exercise the different keyword arguments for TextParser and ensure that the resulting DataFrames are as expected. Then I will plan a major version release.

In the meantime, a quick recipe with the current release is below. (It will always evaluate_formulas; to effect evaluate_formulas=False, you will need to build a list of values yourself using cell.input_value.)

from pandas.io.parsers import TextParser

def get_as_dataframe(worksheet, **options):
    return TextParser(worksheet.get_all_values(), **options).read()

Top Results From Across the Web

Custom pandas dtype inferring - python - Stack Overflow

I am wondering if pandas.read_csv 's dtype parameter is not what you are looking for? You can specify the types of columns using...

pandas.DataFrame.infer_objects

Attempt to infer better dtypes for object columns. Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged.

How pandas infers data types when parsing CSV files

A detailed explanation of type inference in pandas.

Python | Pandas dataframe.infer_objects() - GeeksforGeeks

Pandas dataframe.infer_objects() function attempts to infer better data ... Let's see the dtype (data type) of each column in the dataframe.

How to Convert to Best Data Types Automatically in Pandas?

With the recent Pandas 1.0.0, we can make Pandas infer the best ... a Series (or each Series in a DataFrame) to dtypes...