Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

scrapy not parsing content of tables correctly

See original GitHub issue

Hi everyone,

I’m trying to crawl the syllabus of my university and have encountered a problem: It seems that scrapy isn’t parsing the content of tables at all. I’ve tried it with XPath as well as CSS-Selectors and neither does work.

Have a look at the following minimal working example (robots have to be ignored for the spider to work):

import scrapy


class TableTestSpider(scrapy.Spider):
    name = "table-test"

    def start_requests(self):
        urls = [
            "https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan&raum.rgid=166",
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        content_div = response.xpath("//div[@class='content_max']/*")

        print("\ntable")
        table = response.xpath("//table")
        print(table)

        print("\n//tbody/tr/th")
        tbody_tr_th = response.xpath("//tbody/tr/th")
        print(tbody_tr_th)

        for div in content_div:
            if div.xpath("h3/a"):
                print("h3/a")

            if div.css("table"):
                print("table")

            if div.xpath("tbody/tr/th"):
                print("tbody tr th")

Output:

2018-09-16 20:52:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan&raum.rgid=166> (referer: None)

table
[<Selector xpath='//table' data='<table width="100%" cellpadding="0" cell'>, <Selector xpath='//table' data='<table width="100%" cellpadding="0" cell'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>]

//tbody/tr/th
[]
table
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
2018-09-16 20:52:00 [scrapy.core.engine] INFO: Closing spider (finished)

If you run the example you will notice that scrapy is able to find the tables via XPath (and CSS-Selectors) but not the content (tbody/tr/th) as well as any content (//tbody/tr/th). On the other hand, if you open the URL in your browser (https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&raum.rgid=166&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan) and run $x("//tbody/tr/th") in the browsers console it returns the heads of the tables (part of the content of a table) without any errors, indicating an error probably within the XML-parsing-algorithm. Also, querying for a h3/a-combination works flawlessly in scrapy, indicating that it isn’t an error regarding the depth of the XML-DOM-tree.

One thing I’ve noticed when looking at the source code of the website was that it regularly works with tabs and linebreaks and non-breaking spaces causing me some trouble when parsing other elements, probably causing some hick-ups. Another thing I’ve noticed is that the website uses the summary-attribute within the table-tag, which, according to the W3C (https://www.w3.org/TR/WCAG-TECHS/H73.html#H73-description) is deprecated in HTML 5, though ths shouldn’t matter since the doctype indicates a HTML 4.01-document (<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">).

I’m hoping to being able root out the cause with your help.

Best regards, zergar

Issue Analytics

State:
Created 5 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

1reaction

wRARcommented, Sep 17, 2018

In [7]: response.xpath('//tbody')
Out[7]: []

There is no tbody tag in the HTML text so you shouldn’t use it in your queries.

0reactions

zergarcommented, Sep 17, 2018

@wRAR that’s good to know. Again, thanks for your help.

Top Results From Across the Web

Scrapy not getting correct data with two tables on the page

I updated the path to be licenses = response.xpath("//caption[text()='Active Launch Licenses']/following-sibling::tbody[1]") and the selectors ...

Web Scraping with Scrapy and MongoDB - Real Python

This tutorial covers how to write a Python web crawler using Scrapy to scrape and parse data and then store the data in...

Using your browser's Developer Tools for scraping

Then, back to your web browser, right-click on the span tag, select Copy > XPath and paste it in the Scrapy shell like...

Web scraping using Python and Scrapy - UCSB Carpentry

Extracting URLs using the spider. Armed with the correct query, we can now update our spider accordingly. The parse methods returns the contents...

Easy web scraping with Scrapy | ScrapingBee

Scrapy is the most popular Python web scraping framework. ... and passes its content to parse for data extraction; CrawlSpider, ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

scrapy not parsing content of tables correctly

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Is it possible to close the spider at spider_opened signal?

scrapy different with requests? requests work, but scrapy not