strings not properly detected despite correct dtype in read_csv
See original GitHub issueHello there!
I am working with text data, and I read my data in using
full_list =[]
for myfile in all_files:
print("processing " + myfile)
news = pd.read_csv(myfile, usecols = ['FULL_TIMESTAMP', 'HEADLINE'], dtype = {'HEADLINE' : str})
full_list.append(news)
data_full = pd.concat(full_list)
As you see, I make sure that my headline variable is a str
. However, when I type
collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))
I get :
File "<ipython-input-1-8ce0197f52ac>", line 34, in <module>
collapsed =data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))
File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2668, in aggregate
result = self._aggregate_named(func_or_funcs, *args, **kwargs)
File "C:\Users\me\AppData\Local\Continuum\Anaconda2\lib\site-packages\pandas\core\groupby.py", line 2786, in _aggregate_named
output = func(group, *args, **kwargs)
File "<ipython-input-1-8ce0197f52ac>", line 34, in <lambda>
collapsed = data_full.groupby('day').HEADLINE.agg(lambda x: '| '.join(x))
TypeError: sequence item 21: expected string, float found
To fix the problem, I need first to type
data_full['HEADLINE'] = data_full['HEADLINE'].astype(str)
Is that expected? I thought specifying the dtypes
in read_csv
was the most robust solution to have consistent types in the data? Still using Pandas 19.2.
Thanks!
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (5 by maintainers)
Top Results From Across the Web
Pandas read_csv dtype read all columns but few as string
It's a loop cycling through various CSVs with differing columns, so a direct column conversion after having read the whole csv as string...
Read more >Parsing error due to misinterpreted dtypes · Issue #271
It seems that the parsing of the dtypes does not cover all cases, its a bit hard to debug as Pandas does not...
Read more >Pandas read_csv to DataFrames: Python Pandas Tutorial
In this tutorial, we'll show how to use read_csv pandas to import data into Python, with practical examples.
Read more >pandas.read_csv — pandas 1.5.2 documentation
To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read...
Read more >5 Best Ways to Get the Most Out of Pandas read_csv
The reason is, there is no datetime dtype to be set for read_csv as csv ... Alternatively, pass the list of all the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I don’t think so (at least not by default). There is the
keep_default_na=False
option toread_csv
, if you’d like to disable that.On Thu, Jun 1, 2017 at 11:41 AM, Olaf notifications@github.com wrote:
Do you have any missing values in your ‘HEADLINE’ column? You can check with
data_full['HEADLINE'].isnull().any()
.Missing values would be read in as
np.nan
despite thedtype
specification, and the type ofnp.nan
is float. Usingastype(str)
would convertnp.nan
values to the string'nan'
, which would explain why it works after doing so, though I don’t know that it’s working as you intended, as you’d get something like'string1| nan| string3'
from thejoin
.