Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

.extract() is unable to get data properly from sparse tables

See original GitHub issue

I created a manual table to reproduce the bug which I am facing

<!DOCTYPE html>
<html lang="en">
<table class="manual_table">
   <thead>
      <tr>
        <th class="">Mar 2008</th>
        <th class="">Mar 2009</th>
        <th class="">Mar 2010</th>
      </tr>
   </thead>
   <tbody>
      <tr>
        <td class="">8,626</td>
        <td class="">8,427</td>
        <td class="">11,525</td>
      </tr>
      <tr>
        <td class="">16,408</td>
        <td class="">19,582</td>
        <td class=""></td>
      </tr>
      <tr>        
        <td class=""></td>
        <td class="">22,574</td>
        <td class="">21,755</td> 
      </tr>
   </tbody>
</table>

Now when I try to run the below code on the above html. This is the output I get

>>> rows = response.css(".manual_table tbody tr")
>>> rows[0].css("td::text").extract()
['8,626', '8,427', '11,525']
>>> rows[1].css("td::text").extract()
['16,408', '19,582']
>>> rows[2].css("td::text").extract()
['22,574', '21,755']

As you can notice, It is unable to give proper output for empty data cells. It is ignoring all empty values and that seems a bug.

Similarly if you run below code you will find some weird results. I am confused because it is not supposed to be like this.

>>> len(rows[2].css("td::text").extract())
2
>>> len(rows[2].css("td::text"))
2
>>> len(rows[2].css("td"))
3

Both .getall() and .extract() give the same issue.

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

ilyazubcommented, Feb 16, 2022

@shubham-MLwiz xpath("normalize-space()").getall() returns None from the empty data cells unlike text().

>>> s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()
['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

Full code

from parsel import Selector

html = """<!DOCTYPE html> 
<html lang="en"> 
<table class="manual_table"> 
  <thead> 
    <tr> 
      <th class="">Mar 2008</th> 
      <th class="">Mar 2009</th> 
      <th class="">Mar 2010</th> 
    </tr> 
  </thead> 
  <tbody> 
    <tr> 
      <td class="">8,626</td> 
      <td class="">8,427</td> 
      <td class="">11,525</td> 
    </tr> 
    <tr> 
      <td class="">16,408</td> 
      <td class="">19,582</td> 
      <td class=""></td> 
    </tr> 
    <tr>         
      <td class=""></td> 
      <td class="">22,574</td> 
      <td class="">21,755</td>  
    </tr> 
  </tbody> 
</table>
</html>"""

s = Selector(text=html)

rows = s.css(".manual_table tbody tr")

dt = []
for row in rows:
    for data in row.css("td"):
        dt.append(data.css("::text").get(default=''))

print("Loop:", dt)

dt2 = s.css(".manual_table tbody tr td").xpath("normalize-space()").getall()

print("One-liner:", dt2)

Output

Loop: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']
One-liner: ['8,626', '8,427', '11,525', '16,408', '19,582', '', '', '22,574', '21,755']

I’m commenting on this old issue because I’ve faced it today.

0reactions

Gallaeciocommented, Jun 1, 2020