question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scrapy not parsing content of tables correctly

See original GitHub issue

Hi everyone,

I’m trying to crawl the syllabus of my university and have encountered a problem: It seems that scrapy isn’t parsing the content of tables at all. I’ve tried it with XPath as well as CSS-Selectors and neither does work.

Have a look at the following minimal working example (robots have to be ignored for the spider to work):

import scrapy


class TableTestSpider(scrapy.Spider):
    name = "table-test"

    def start_requests(self):
        urls = [
            "https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan&raum.rgid=166",
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        content_div = response.xpath("//div[@class='content_max']/*")

        print("\ntable")
        table = response.xpath("//table")
        print(table)

        print("\n//tbody/tr/th")
        tbody_tr_th = response.xpath("//tbody/tr/th")
        print(tbody_tr_th)

        for div in content_div:
            if div.xpath("h3/a"):
                print("h3/a")

            if div.css("table"):
                print("table")

            if div.xpath("tbody/tr/th"):
                print("tbody tr th")

Output:

2018-09-16 20:52:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan&raum.rgid=166> (referer: None)

table
[<Selector xpath='//table' data='<table width="100%" cellpadding="0" cell'>, <Selector xpath='//table' data='<table width="100%" cellpadding="0" cell'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>]

//tbody/tr/th
[]
table
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
2018-09-16 20:52:00 [scrapy.core.engine] INFO: Closing spider (finished)

If you run the example you will notice that scrapy is able to find the tables via XPath (and CSS-Selectors) but not the content (tbody/tr/th) as well as any content (//tbody/tr/th). On the other hand, if you open the URL in your browser (https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&raum.rgid=166&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan) and run $x("//tbody/tr/th") in the browsers console it returns the heads of the tables (part of the content of a table) without any errors, indicating an error probably within the XML-parsing-algorithm. Also, querying for a h3/a-combination works flawlessly in scrapy, indicating that it isn’t an error regarding the depth of the XML-DOM-tree.

One thing I’ve noticed when looking at the source code of the website was that it regularly works with tabs and linebreaks and non-breaking spaces causing me some trouble when parsing other elements, probably causing some hick-ups. Another thing I’ve noticed is that the website uses the summary-attribute within the table-tag, which, according to the W3C (https://www.w3.org/TR/WCAG-TECHS/H73.html#H73-description) is deprecated in HTML 5, though ths shouldn’t matter since the doctype indicates a HTML 4.01-document (<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">).

I’m hoping to being able root out the cause with your help.

Best regards, zergar

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
wRARcommented, Sep 17, 2018
In [7]: response.xpath('//tbody')
Out[7]: []

There is no tbody tag in the HTML text so you shouldn’t use it in your queries.

0reactions
zergarcommented, Sep 17, 2018

@wRAR that’s good to know. Again, thanks for your help.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Scrapy not getting correct data with two tables on the page
I updated the path to be licenses = response.xpath("//caption[text()='Active Launch Licenses']/following-sibling::tbody[1]") and the selectors ...
Read more >
Web Scraping with Scrapy and MongoDB - Real Python
This tutorial covers how to write a Python web crawler using Scrapy to scrape and parse data and then store the data in...
Read more >
Using your browser's Developer Tools for scraping
Then, back to your web browser, right-click on the span tag, select Copy > XPath and paste it in the Scrapy shell like...
Read more >
Web scraping using Python and Scrapy - UCSB Carpentry
Extracting URLs using the spider. Armed with the correct query, we can now update our spider accordingly. The parse methods returns the contents...
Read more >
Easy web scraping with Scrapy | ScrapingBee
Scrapy is the most popular Python web scraping framework. ... and passes its content to parse for data extraction; CrawlSpider, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found