Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

strings not properly detected despite correct dtype in read_csv

See original GitHub issue

Hello there!

I am working with text data, and I read my data in using

full_list =[]

for myfile in all_files:
    print("processing " + myfile)
    news = pd.read_csv(myfile, usecols = ['FULL_TIMESTAMP', 'HEADLINE'], dtype = {'HEADLINE' : str})
    full_list.append(news)
   
data_full = pd.concat(full_list)

As you see, I make sure that my headline variable is a str. However, when I type

collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

I get :

  File "<ipython-input-1-8ce0197f52ac>", line 34, in <module>
    collapsed =data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

  File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2668, in aggregate
    result = self._aggregate_named(func_or_funcs, *args, **kwargs)

  File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2786, in _aggregate_named
    output = func(group, *args, **kwargs)

  File "<ipython-input-1-8ce0197f52ac>", line 34, in <lambda>
    collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))

TypeError: sequence item 21: expected string, float found

To fix the problem, I need first to type

data_full['HEADLINE'] = data_full['HEADLINE'].astype(str)

Is that expected? I thought specifying the dtypes in read_csv was the most robust solution to have consistent types in the data? Still using Pandas 19.2.

Thanks!

Issue Analytics

State:
Created 6 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

TomAugspurgercommented, Jun 1, 2017

I don’t think so (at least not by default). There is the keep_default_na=False option to read_csv, if you’d like to disable that.

On Thu, Jun 1, 2017 at 11:41 AM, Olaf notifications@github.com wrote:

I think you found the correct solutionn!!

That raises the question of: is that the expected output? Shouldnt we have the missing values read in as “” instead of np.nan when the user specifies dtype = str ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/16569#issuecomment-305550976, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIhzRbeVXyoWVfcEZFwTH64w7HilMks5r_um6gaJpZM4Ns4yP .

1reaction

jschendelcommented, Jun 1, 2017

Do you have any missing values in your ‘HEADLINE’ column? You can check with data_full['HEADLINE'].isnull().any().

Missing values would be read in as np.nan despite the dtype specification, and the type of np.nan is float. Using astype(str) would convert np.nan values to the string 'nan', which would explain why it works after doing so, though I don’t know that it’s working as you intended, as you’d get something like 'string1| nan| string3' from the join.