question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

.extract() is unable to get data properly from sparse tables

See original GitHub issue

I created a manual table to reproduce the bug which I am facing

<!DOCTYPE html>
<html lang="en">
<table class="manual_table">
   <thead>
      <tr>
        <th class="">Mar 2008</th>
        <th class="">Mar 2009</th>
        <th class="">Mar 2010</th>
      </tr>
   </thead>
   <tbody>
      <tr>
        <td class="">8,626</td>
        <td class="">8,427</td>
        <td class="">11,525</td>
      </tr>
      <tr>
        <td class="">16,408</td>
        <td class="">19,582</td>
        <td class=""></td>
      </tr>
      <tr>        
        <td class=""></td>
        <td class="">22,574</td>
        <td class="">21,755</td> 
      </tr>
   </tbody>
</table>

Now when I try to run the below code on the above html. This is the output I get

>>> rows = response.css(".manual_table tbody tr")
>>> rows[0].css("td::text").extract()
['8,626', '8,427', '11,525']
>>> rows[1].css("td::text").extract()
['16,408', '19,582']
>>> rows[2].css("td::text").extract()
['22,574', '21,755']

As you can notice, It is unable to give proper output for empty data cells. It is ignoring all empty values and that seems a bug.

Similarly if you run below code you will find some weird results. I am confused because it is not supposed to be like this.

>>> len(rows[2].css("td::text").extract())
2
>>> len(rows[2].css("td::text"))
2
>>> len(rows[2].css("td"))
3

Both .getall() and .extract() give the same issue.

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
ilyazubcommented, Feb 16, 2022

@shubham-MLwiz xpath("normalize-space()").getall() returns None from the empty data cells unlike text().

>>> s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

Full code

from parsel import Selector

html = """<!DOCTYPE html> 
<html lang="en"> 
<table class="manual_table"> 
  <thead> 
    <tr> 
      <th class="">Mar 2008</th> 
      <th class="">Mar 2009</th> 
      <th class="">Mar 2010</th> 
    </tr> 
  </thead> 
  <tbody> 
    <tr> 
      <td class="">8,626</td> 
      <td class="">8,427</td> 
      <td class="">11,525</td> 
    </tr> 
    <tr> 
      <td class="">16,408</td> 
      <td class="">19,582</td> 
      <td class=""></td> 
    </tr> 
    <tr>         
      <td class=""></td> 
      <td class="">22,574</td> 
      <td class="">21,755</td>  
    </tr> 
  </tbody> 
</table>
</html>"""

s = Selector(text=html)

rows = s.css(".manual_table tbody tr")

dt = []
for row in rows:
    for data in row.css("td"):
        dt.append(data.css("::text").get(default=''))

print("Loop:", dt)

dt2 = s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()

print("One-liner:", dt2)

Output

Loop: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']
One-liner: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

I’m commenting on this old issue because I’ve faced it today.

0reactions
Gallaeciocommented, Jun 1, 2020

Is there a better way to parse a sparse table other than the looping method?

I believe that is the right way to do it with Parsel.

Read more comments on GitHub >

github_iconTop Results From Across the Web

unable to extract table data using beautifulsoup - Stack Overflow
The table data is inside a script tag as it is dynamically generated so your code would find nothing parsing the source ·...
Read more >
Database Engine events and errors - SQL Server
Consult this MSSQL error code list to find explanations for error messages for SQL Server database engine events.
Read more >
Unable to extract structured table from PDF's | Decipher
Hi Decipher Team,We are trying to extract a few fields and Tables from a PDF document,Decipher is doing a great job in extracting...
Read more >
PartiQL select statements for DynamoDB - AWS Documentation
Use the SELECT statement to retrieve data from a table in Amazon DynamoDB. Using the SELECT statement can result in a full table...
Read more >
TABLESEER: AUTOMATIC TABLE EXTRACTION, SEARCH ...
extracting table data from digital libraries and enables users to ... (ASCII) text based, it cannot fully make use of document image information....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found