question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inferring dtypes in get_as_dataframe

See original GitHub issue

This is an enhancement proposal.

For my use case, it could be nice if gspread-dataframe was able to try to infer column dtypes when fetching data from a sheet. While individual cells are converted through numericise, their column dtype remains object, and the returned dataframe fails equality checks with the original dataframe.

Motivating example

>>> df = pd.DataFrame({'a': [4,1,2,4],
...                    'b': list('abba')},
...                    index=pd.Index(list('ABCD'), name='our index'))
>>> df
           a  b
our index      
A          4  a
B          1  b
C          2  b
D          4  a
>>> df.dtypes
a     int64
b    object
dtype: object
>>> ws =  # Get a test worksheet here
>>> set_with_dataframe(ws, df, include_index=True, resize=True)
>>> r = get_as_dataframe(ws, index_column_number=1)
>>> r  # Looks as expected
           a  b
our index      
A          4  a
B          1  b
C          2  b
D          4  a
>>> r.dtypes  # All object dtype
a    object
b    object
dtype: object
>>> [type(v) for v in r['a']]  # correctly converted to int
[int, int, int, int]
>>> df.equals(r)  # The equality check fails
False
>>> df['a'].equals(r['a'])  # because of the dtype of column 'a'.
False
>>> df['a'] == r['a']  # The values *are* the same, though.
our index
A    True
B    True
C    True
D    True
Name: a, dtype: bool
>>> df['b'].equals(r['b'])  # str works as expected
True

Suggested solution

I am unsure what is the best way to deal with this, and whether it is a general enough use-case to warrant an addition to gspread-dataframe. At any rate, the following code is my initial stab at how dtype inference could be implemented:

import pandas as pd

converters = (
    pd.to_numeric,
    pd.to_timedelta,
    pd.to_datetime,
)


def _assign_column_dtypes(df):
    for conv in converters:
        for col in df:
            if df[col].dtype != object:
                continue
            df[col] = conv(df[col], errors='ignore')

    return df

It intentionally places timedelta before datetime, as ‘00:03:00’ can be interpreted as either one by pandas. In my use-case, datetimes always include a date, so ‘00:03:00’ would definitely be a timedelta.

Take it for a spin!

# Construct a dataframe where everything is either str or object
n = 10
df = pd.DataFrame({
    'datetime str': pd.date_range('2017-03-15', freq='D', periods=n
                                  ).astype(str),
    'timedelta str': pd.timedelta_range('00:03:00', periods=n, freq='10 s'
                                        ).to_native_types().astype(str),
    'int obj': pd.Series(range(n), dtype=object),
    'int str': [str(i) for i in range(n)],
    'float obj': pd.Series(map(float, range(n)), dtype=object),
    'float str': [str(float(i)) for i in range(n)],
})

print(df)
#   datetime str float obj float str int obj int str timedelta str
# 0   2017-03-15         0       0.0       0       0      00:03:00
# 1   2017-03-16         1       1.0       1       1      00:03:10
# 2   2017-03-17         2       2.0       2       2      00:03:20
# 3   2017-03-18         3       3.0       3       3      00:03:30
# 4   2017-03-19         4       4.0       4       4      00:03:40
# 5   2017-03-20         5       5.0       5       5      00:03:50
# 6   2017-03-21         6       6.0       6       6      00:04:00
# 7   2017-03-22         7       7.0       7       7      00:04:10
# 8   2017-03-23         8       8.0       8       8      00:04:20
# 9   2017-03-24         9       9.0       9       9      00:04:30

print(df.dtypes)
# datetime str     object
# float obj        object
# float str        object
# int obj          object
# int str          object
# timedelta str    object
# dtype: object


df = _assign_column_dtypes(df)

print(df)
#   datetime str  float obj  float str  int obj  int str  timedelta str
# 0   2017-03-15        0.0        0.0        0        0       00:03:00
# 1   2017-03-16        1.0        1.0        1        1       00:03:10
# 2   2017-03-17        2.0        2.0        2        2       00:03:20
# 3   2017-03-18        3.0        3.0        3        3       00:03:30
# 4   2017-03-19        4.0        4.0        4        4       00:03:40
# 5   2017-03-20        5.0        5.0        5        5       00:03:50
# 6   2017-03-21        6.0        6.0        6        6       00:04:00
# 7   2017-03-22        7.0        7.0        7        7       00:04:10
# 8   2017-03-23        8.0        8.0        8        8       00:04:20
# 9   2017-03-24        9.0        9.0        9        9       00:04:30

print(df.dtypes)
# datetime str      datetime64[ns]
# float obj                float64
# float str                float64
# int obj                    int64
# int str                    int64
# timedelta str    timedelta64[ns]
# dtype: object

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:9 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
robin900commented, Mar 28, 2017

@NTAWolf I’ve opened #2 to represent the switch to TextParser; a PR will happen in the next few days to implement. I’m going to close this issue; let me know if you think it should be re-opened.

0reactions
robin900commented, Mar 28, 2017

OK, I will be adding some tests to exercise the different keyword arguments for TextParser and ensure that the resulting DataFrames are as expected. Then I will plan a major version release.

In the meantime, a quick recipe with the current release is below. (It will always evaluate_formulas; to effect evaluate_formulas=False, you will need to build a list of values yourself using cell.input_value.)

from pandas.io.parsers import TextParser

def get_as_dataframe(worksheet, **options):
    return TextParser(worksheet.get_all_values(), **options).read()
Read more comments on GitHub >

github_iconTop Results From Across the Web

Custom pandas dtype inferring - python - Stack Overflow
I am wondering if pandas.read_csv 's dtype parameter is not what you are looking for? You can specify the types of columns using...
Read more >
pandas.DataFrame.infer_objects
Attempt to infer better dtypes for object columns. Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged.
Read more >
How pandas infers data types when parsing CSV files
A detailed explanation of type inference in pandas.
Read more >
Python | Pandas dataframe.infer_objects() - GeeksforGeeks
Pandas dataframe.infer_objects() function attempts to infer better data ... Let's see the dtype (data type) of each column in the dataframe.
Read more >
How to Convert to Best Data Types Automatically in Pandas?
With the recent Pandas 1.0.0, we can make Pandas infer the best ... a Series (or each Series in a DataFrame) to dtypes...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found