question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

read_csv with wildcards fails on different column orders

See original GitHub issue

Hi I finally tracked down this issue after alot of work but if you have 2 (or more) csv files with headers, but the column order of the files is different, read_csv will read the second file as if the columns where in the same order.

I did not know that the order of these columns would be different but pd.Dataframe.to_csv() does not guarantee an ordering.

take for example the following 2 files.

==> mean-2016.12.11.csv <==
d,demo,s,market,qtr_hr,ue,000s,w,np,tarp
2016-12-11,TTLPPL,0,0201,0,4998799.939999982,6.332151999999998,6332.151999999998,5,0.12667344314643691
2016-12-11,TTLPPL,0,0201,1,4998799.939999982,3.9752799999999993,3975.2799999999993,2,0.07952468687914752
2016-12-11,TTLPPL,0,0201,2,4998799.939999982,3.2186719999999993,3218.6719999999996,2,0.0643888941072527

==> mean-2016.12.18.csv <==
d,demo,s,market,qtr_hr,w,ue,np,000s,tarp
2016-12-18,TTLPPL,0,0201,0,1743.1853333333338,4998800.000000004,2,1.7431853333333331,0.034872075964898205
2016-12-18,TTLPPL,0,0201,1,1952.8600000000004,4998800.000000004,3,1.9528599999999998,0.039066575978234735
2016-12-18,TTLPPL,0,0201,2,1654.542666666667,4998800.000000004,2,1.6545426666666665,0.033098797044624005

you get the following

>>> df = dd.read_csv('../data/mean-2016.12.18.csv')
>>> df.compute().ix[0]
d         2016-12-18
demo          TTLPPL
s                  0
market           201
qtr_hr             0
w            1743.19
ue        4.9988e+06
np                 2
000s         1.74319
tarp       0.0348721
Name: 0, dtype: object

in contrast

>>> df = dd.read_csv('../data/mean-2016.12.1*.csv')
>>> df.compute().ix[0]

d	demo	s	market	qtr_hr	ue	000s	w	np	tarp
0	2016-12-11	TTLPPL	0	201	0	4.998800e+06	6.332152e+00	6332.152	5.000000	0.126673
0	2016-12-18	TTLPPL	0	201	0	1.743185e+03	4.998800e+06	2.000	1.743185	0.034872

Issue Analytics

  • State:open
  • Created 7 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

2reactions
mrocklincommented, Mar 8, 2017

but If it can’t be fixed it should produce an error (or at least a warning).

I agree that it should do this. There are literally dozens of cases with malformed CSV where it would be useful to provide informative errors. Unfortunately there is no way to make the directory-of-csv format entirely reliable without looking through all of the data ahead of time. There is a necessary trade-off here between “get going quickly” and “thoroughly check through all of the data for inconsistencies”.

If you have thoughts on how to improve the current internal design of dd.read_csv and the time to implement those changes then contributions would be most welcome.

2reactions
stumitchellcommented, Mar 8, 2017

yeah I got it around it like this

df = dd.concat([dd.read_csv(f) for f in filelist])

but that means I can’t use wildcards anymore 😦

but If it can’t be fixed it should produce an error (or at least a warning).

Read more comments on GitHub >

github_iconTop Results From Across the Web

Pandas reading csv files with partial wildcard - Stack Overflow
I'm trying to write a script that imports a file, then does something with the file and outputs the result into another file....
Read more >
Pandas read csv using wild card pattern for selecting columns ...
Simplier is rename columns Order Summary first and then selecting only expected 2 columns: dfs = [pd.read_csv(f, sep=";").rename(columns={'Order ...
Read more >
Solved: Read from CSV recently downloaded
The * acts as a wildcard. If you try to read CSV to a datatable variable and then paste it into Excel, it...
Read more >
Solved: Wild Card - Alteryx Community
If you are looking to read multiple csv files which have different number of columns or the column names change across multiple files, ......
Read more >
Query multiple tables using a wildcard table | BigQuery
If a single scanned table has a schema mismatch (that is, a column with the same name is of a different type), the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found