read_csv with wildcards fails on different column orders
See original GitHub issueHi I finally tracked down this issue after alot of work but if you have 2 (or more) csv files with headers, but the column order of the files is different, read_csv will read the second file as if the columns where in the same order.
I did not know that the order of these columns would be different but pd.Dataframe.to_csv()
does not guarantee an ordering.
take for example the following 2 files.
==> mean-2016.12.11.csv <==
d,demo,s,market,qtr_hr,ue,000s,w,np,tarp
2016-12-11,TTLPPL,0,0201,0,4998799.939999982,6.332151999999998,6332.151999999998,5,0.12667344314643691
2016-12-11,TTLPPL,0,0201,1,4998799.939999982,3.9752799999999993,3975.2799999999993,2,0.07952468687914752
2016-12-11,TTLPPL,0,0201,2,4998799.939999982,3.2186719999999993,3218.6719999999996,2,0.0643888941072527
==> mean-2016.12.18.csv <==
d,demo,s,market,qtr_hr,w,ue,np,000s,tarp
2016-12-18,TTLPPL,0,0201,0,1743.1853333333338,4998800.000000004,2,1.7431853333333331,0.034872075964898205
2016-12-18,TTLPPL,0,0201,1,1952.8600000000004,4998800.000000004,3,1.9528599999999998,0.039066575978234735
2016-12-18,TTLPPL,0,0201,2,1654.542666666667,4998800.000000004,2,1.6545426666666665,0.033098797044624005
you get the following
>>> df = dd.read_csv('../data/mean-2016.12.18.csv')
>>> df.compute().ix[0]
d 2016-12-18
demo TTLPPL
s 0
market 201
qtr_hr 0
w 1743.19
ue 4.9988e+06
np 2
000s 1.74319
tarp 0.0348721
Name: 0, dtype: object
in contrast
>>> df = dd.read_csv('../data/mean-2016.12.1*.csv')
>>> df.compute().ix[0]
d demo s market qtr_hr ue 000s w np tarp
0 2016-12-11 TTLPPL 0 201 0 4.998800e+06 6.332152e+00 6332.152 5.000000 0.126673
0 2016-12-18 TTLPPL 0 201 0 1.743185e+03 4.998800e+06 2.000 1.743185 0.034872
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Pandas reading csv files with partial wildcard - Stack Overflow
I'm trying to write a script that imports a file, then does something with the file and outputs the result into another file....
Read more >Pandas read csv using wild card pattern for selecting columns ...
Simplier is rename columns Order Summary first and then selecting only expected 2 columns: dfs = [pd.read_csv(f, sep=";").rename(columns={'Order ...
Read more >Solved: Read from CSV recently downloaded
The * acts as a wildcard. If you try to read CSV to a datatable variable and then paste it into Excel, it...
Read more >Solved: Wild Card - Alteryx Community
If you are looking to read multiple csv files which have different number of columns or the column names change across multiple files, ......
Read more >Query multiple tables using a wildcard table | BigQuery
If a single scanned table has a schema mismatch (that is, a column with the same name is of a different type), the...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I agree that it should do this. There are literally dozens of cases with malformed CSV where it would be useful to provide informative errors. Unfortunately there is no way to make the directory-of-csv format entirely reliable without looking through all of the data ahead of time. There is a necessary trade-off here between “get going quickly” and “thoroughly check through all of the data for inconsistencies”.
If you have thoughts on how to improve the current internal design of
dd.read_csv
and the time to implement those changes then contributions would be most welcome.yeah I got it around it like this
df = dd.concat([dd.read_csv(f) for f in filelist])
but that means I can’t use wildcards anymore 😦
but If it can’t be fixed it should produce an error (or at least a warning).