Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ENH: read_html to handle rowspan, colspan

See original GitHub issue

Code Sample, a copy-pastable example if possible

import pandas as pd
pd.read_html('https://www.ssa.gov/policy/docs/statcomps/supplement/2015/5h.html')[0]

This has complex table headings:

annual_statistical_supplement__2015_-_beneficiary_families_with_oasdi_benefits_in_current-payment_status__5_h_

read_html output begins with:

                                                 Year                            Retired-worker families        Survivor families             Disabled-worker families                                Unnamed: 4_level_0                              Unnamed: 5_level_0 Unnamed: 6_level_0 Unnamed: 7_level_0 Unnamed: 8_level_0 Unnamed: 9_level_0 Unnamed: 10_level_0 Unnamed: 11_level_0  \
                                          Worker only                                  Worker and wife?a  Non-disabled widow only        Widowed mother or father and?                                       Worker only                            Worker, wife,?b and?  Worker and spouse Unnamed: 7_level_1 Unnamed: 8_level_1 Unnamed: 9_level_1 Unnamed: 10_level_1 Unnamed: 11_level_1
                                                  All                                                Men                    Women                              1?child                                        2?children                              3 or more children                All                Men              Women            1?child  2 or more children Unnamed: 11_level_2
0                                                 NaN                                 Number?(thousands)                      NaN                                  NaN                                               NaN                                             NaN                NaN                NaN                NaN                NaN                 NaN                 NaN
1                                                1945                                                416                      338                                   78                                               181                                              95              86.00              48.00              24.00              .?.?.               .?.?.               .?.?.

(row 0 of the output is probably something one would have to manually eliminate)

Problem description

For HTML headings with rowspan and colspan elements, read_html has undesirable behavior. Basically read_html packs all heading <th> elements in any particular row to the left, so any particular column no longer has any association with the <th> elements that are actually above it in the HTML table.

Ample discussion here about the analogous pandas+Excel test case: https://github.com/pandas-dev/pandas/issues/4679

Relevant web discussions:

This may be an issue with the underlying parsers and cannot be solved well in pandas. This appears to be the behavior with both lxml and bs4/html5lib.

Expected Output

Each column should be associated with the <th> elements above it in the table. This might be a multi-row column name (as it is now) (a MultiIndex?) or a tuple (presumably if the argument tupleize_cols is set to True). Instead, currently, column n is associated with the n th <th> entry in the table row regardless of the settings of rowspan/colspan.

It may be this is possible to do properly in current pandas in which case I apologize for filing the issue (but I’d be happy to know how to do it).

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit: None python: 2.7.13.final.0 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_US.US-ASCII LOCALE: None.None

pandas: 0.20.3 pytest: None pip: 9.0.1 setuptools: 36.2.0 Cython: 0.26 numpy: 1.13.1 scipy: 0.19.1 xarray: None IPython: 5.3.0 sphinx: 1.6.3 patsy: None dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: None xlsxwriter: None lxml: 3.7.3 bs4: 4.5.3 html5lib: 1.0b10 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: None

Issue Analytics

State:
Created 6 years ago
Reactions:2
Comments:13 (6 by maintainers)

Top GitHub Comments

1reaction

chris-b1commented, Jul 26, 2017

@jowens - can you open a PR with your WIP code? Easier to answer these type of questions that way.

0reactions

gfyoungcommented, Jul 26, 2017

From #17074:

@chris-b1 or anyone else, help a brother out? Can you tell me what this test does? It’s just expecting the parser to throw an error? The output from the test code (where it’s failing) is at the bottom. It’s a pretty weird HTML file.

Now, if I call it with my current in-progress code as dfs = pd.read_html('computer_sales_page.html', header=[0, 1]), I see:

Index([         (u'Unnamed: 0_level_0', u'Unnamed: 0_level_1'),
                (u'Unnamed: 1_level_0', u'Unnamed: 1_level_1'),
                     (u'Three months ended April?30', u'2013'),
              u'(u'Three months ended April\xa030', '2013').1',
       (u'Three months ended April?30', u'Unnamed: 4_level_1'),
                     (u'Three months ended April?30', u'2012'),
              u'(u'Three months ended April\xa030', '2012').1',
                (u'Unnamed: 7_level_0', u'Unnamed: 7_level_1'),
                       (u'Six months ended April?30', u'2013'),
                u'(u'Six months ended April\xa030', '2013').1',
        (u'Six months ended April?30', u'Unnamed: 10_level_1'),
                       (u'Six months ended April?30', u'2012'),
                u'(u'Six months ended April\xa030', '2012').1',
              (u'Unnamed: 13_level_0', u'Unnamed: 13_level_1')],
      dtype='object')

and if I call it without a header argument (dfs = pd.read_html('computer_sales_page.html')), I see:

Index([   (u'Unnamed: 0_level_0', u'Unnamed: 0_level_1', u'Unnamed: 0_level_2'),
          (u'Unnamed: 1_level_0', u'Unnamed: 1_level_1', u'Unnamed: 1_level_2'),
                      (u'Three months ended April?30', u'2013', u'In millions'),
                u'(u'Three months ended April\xa030', '2013', 'In millions').1',
        (u'Three months ended April?30', u'Unnamed: 4_level_1', u'In millions'),
                      (u'Three months ended April?30', u'2012', u'In millions'),
                u'(u'Three months ended April\xa030', '2012', 'In millions').1',
                 (u'Unnamed: 7_level_0', u'Unnamed: 7_level_1', u'In millions'),
                        (u'Six months ended April?30', u'2013', u'In millions'),
                  u'(u'Six months ended April\xa030', '2013', 'In millions').1',
         (u'Six months ended April?30', u'Unnamed: 10_level_1', u'In millions'),
                        (u'Six months ended April?30', u'2012', u'In millions'),
                  u'(u'Six months ended April\xa030', '2012', 'In millions').1',
       (u'Unnamed: 13_level_0', u'Unnamed: 13_level_1', u'Unnamed: 13_level_2')],
      dtype='object')

These seem like OK outputs to me. I’m not sure what the original test is supposed to show. I think I’d like to just delete the test if it’s supposed to fail (and no longer fails).

____________________ TestReadHtml.test_computer_sales_page _____________________

self = <pandas.tests.io.test_html.TestReadHtml object at 0x1120aa390>

    def test_computer_sales_page(self):
        data = os.path.join(DATA_PATH, 'computer_sales_page.html')
        with tm.assert_raises_regex(ParserError,
                                    r"Passed header=\[0,1\] are "
                                    r"too many rows for this "
                                    r"multi_index of columns"):
>           self.read_html(data, header=[0, 1])

pandas/tests/io/test_html.py:778:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pandas.util.testing._AssertRaisesContextmanager object at 0x1120aab50>
exc_type = None, exc_value = None, trace_back = None

    def __exit__(self, exc_type, exc_value, trace_back):
        expected = self.exception

        if not exc_type:
            exp_name = getattr(expected, "__name__", str(expected))
>           raise AssertionError("{0} not raised.".format(exp_name))
E           AssertionError: ParserError not raised.

pandas/util/testing.py:2491: AssertionError

Top Results From Across the Web

Table Rowspan And Colspan In HTML Explained (With ...

Both colspan= and rowspan= are attributes of the two table-cell elements, <th> and <td> . They provide the same functionality as “merge cell”...

HTML Table Colspan & Rowspan - W3Schools

To make a cell span over multiple columns, use the colspan attribute: ... Note: The value of the rowspan attribute represents the number...

HTML (DOM) sourced data - DataTables example

HTML (DOM) sourced data. The foundation for DataTables is progressive enhancement, so it is very adept at reading table information directly from the...

A Complete Guide to the Table Element | CSS-Tricks

The element in HTML is used for displaying tabular data. ... that can go on any table cell element ( <th> or <td>...

HTML Extraction Algorithm Based on Property and Data Cell

OPEN ACCESS. HTML Extraction Algorithm Based on Property and. Data Cell. To cite this article: Detty Purnamasari et al 2013 IOP Conf. Ser.:...