question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

strings not properly detected despite correct dtype in read_csv

See original GitHub issue

Hello there!

I am working with text data, and I read my data in using

full_list =[]

for myfile in all_files:
    print("processing " + myfile)
    news = pd.read_csv(myfile, usecols = ['FULL_TIMESTAMP', 'HEADLINE'], dtype = {'HEADLINE' : str})
    full_list.append(news)
   
data_full = pd.concat(full_list)

As you see, I make sure that my headline variable is a str. However, when I type

collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

I get :

  File "<ipython-input-1-8ce0197f52ac>", line 34, in <module>
    collapsed =data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

  File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2668, in aggregate
    result = self._aggregate_named(func_or_funcs, *args, **kwargs)

  File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2786, in _aggregate_named
    output = func(group, *args, **kwargs)

  File "<ipython-input-1-8ce0197f52ac>", line 34, in <lambda>
    collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

TypeError: sequence item 21: expected string, float found

To fix the problem, I need first to type

data_full['HEADLINE'] = data_full['HEADLINE'].astype(str)

Is that expected? I thought specifying the dtypes in read_csv was the most robust solution to have consistent types in the data? Still using Pandas 19.2.

Thanks!

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
TomAugspurgercommented, Jun 1, 2017

I don’t think so (at least not by default). There is the keep_default_na=False option to read_csv, if you’d like to disable that.

On Thu, Jun 1, 2017 at 11:41 AM, Olaf notifications@github.com wrote:

I think you found the correct solutionn!!

That raises the question of: is that the expected output? Shouldnt we have the missing values read in as “” instead of np.nan when the user specifies dtype = str ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16569#issuecomment-305550976, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIhzRbeVXyoWVfcEZFwTH64w7HilMks5r_um6gaJpZM4Ns4yP .

1reaction
jschendelcommented, Jun 1, 2017

Do you have any missing values in your ‘HEADLINE’ column? You can check with data_full['HEADLINE'].isnull().any().

Missing values would be read in as np.nan despite the dtype specification, and the type of np.nan is float. Using astype(str) would convert np.nan values to the string 'nan', which would explain why it works after doing so, though I don’t know that it’s working as you intended, as you’d get something like 'string1| nan| string3' from the join.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas read_csv dtype read all columns but few as string
It's a loop cycling through various CSVs with differing columns, so a direct column conversion after having read the whole csv as string...
Read more >
Parsing error due to misinterpreted dtypes · Issue #271
It seems that the parsing of the dtypes does not cover all cases, its a bit hard to debug as Pandas does not...
Read more >
Pandas read_csv to DataFrames: Python Pandas Tutorial
In this tutorial, we'll show how to use read_csv pandas to import data into Python, with practical examples.
Read more >
pandas.read_csv — pandas 1.5.2 documentation
To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read...
Read more >
5 Best Ways to Get the Most Out of Pandas read_csv
The reason is, there is no datetime dtype to be set for read_csv as csv ... Alternatively, pass the list of all the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found