ENH: read_html to handle rowspan, colspan
See original GitHub issueCode Sample, a copy-pastable example if possible
import pandas as pd
pd.read_html('https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html')[0]
This has complex table headings:
read_html
output begins with:
Year Retired-worker families Survivor families Disabled-worker families Unnamed: 4_level_0 Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 Unnamed: 10_level_0 Unnamed: 11_level_0 \
Worker only Worker and wife?a Non-disabled widow only Widowed mother or father and? Worker only Worker, wife,?b and? Worker and spouse Unnamed: 7_level_1 Unnamed: 8_level_1 Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1
All Men Women 1?child 2?children 3 or more children All Men Women 1?child 2 or more children Unnamed: 11_level_2
0 NaN Number?(thousands) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 1945 416 338 78 181 95 86.00 48.00 24.00 .?.?. .?.?. .?.?.
(row 0 of the output is probably something one would have to manually eliminate)
Problem description
For HTML headings with rowspan and colspan elements, read_html
has undesirable behavior. Basically read_html
packs all heading <th>
elements in any particular row to the left, so any particular column no longer has any association with the <th>
elements that are actually above it in the HTML table.
Ample discussion here about the analogous pandas+Excel test case: https://github.com/pandas-dev/pandas/issues/4679
Relevant web discussions:
- https://stackoverflow.com/questions/23703638/bug-in-pandas-read-html-method
- https://stackoverflow.com/questions/28763891/what-should-i-do-when-tr-has-rowspan
- https://johnricco.github.io/2017/04/04/python-html/ (using his example)
This may be an issue with the underlying parsers and cannot be solved well in pandas. This appears to be the behavior with both lxml and bs4/html5lib.
Expected Output
Each column should be associated with the <th>
elements above it in the table. This might be a multi-row column name (as it is now) (a MultiIndex
?) or a tuple (presumably if the argument tupleize_cols
is set to True
). Instead, currently, column n is associated with the n th <th>
entry in the table row regardless of the settings of rowspan/colspan.
It may be this is possible to do properly in current pandas in which case I apologize for filing the issue (but I’d be happy to know how to do it).
Output of pd.show_versions()
pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 36.2.0 Cython: 0.26 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 5.3.0 sphinx: 1.6.3 patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: 3.7.3 bs4: 4.5.3 html5lib: 1.0b10 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None
Issue Analytics
- State:
- Created 6 years ago
- Reactions:2
- Comments:13 (6 by maintainers)
Top GitHub Comments
@jowens - can you open a PR with your WIP code? Easier to answer these type of questions that way.
From #17074:
@chris-b1 or anyone else, help a brother out? Can you tell me what this test does? It’s just expecting the parser to throw an error? The output from the test code (where it’s failing) is at the bottom. It’s a pretty weird HTML file.
Now, if I call it with my current in-progress code as
dfs = pd.read_html('computer_sales_page.html', header=[0, 1])
, I see:and if I call it without a header argument (
dfs = pd.read_html('computer_sales_page.html')
), I see:These seem like OK outputs to me. I’m not sure what the original test is supposed to show. I think I’d like to just delete the test if it’s supposed to fail (and no longer fails).