scrapy not parsing content of tables correctly
See original GitHub issueHi everyone,
I’m trying to crawl the syllabus of my university and have encountered a problem: It seems that scrapy isn’t parsing the content of tables at all. I’ve tried it with XPath as well as CSS-Selectors and neither does work.
Have a look at the following minimal working example (robots have to be ignored for the spider to work):
import scrapy
class TableTestSpider(scrapy.Spider):
name = "table-test"
def start_requests(self):
urls = [
"https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan&raum.rgid=166",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
content_div = response.xpath("//div[@class='content_max']/*")
print("\ntable")
table = response.xpath("//table")
print(table)
print("\n//tbody/tr/th")
tbody_tr_th = response.xpath("//tbody/tr/th")
print(tbody_tr_th)
for div in content_div:
if div.xpath("h3/a"):
print("h3/a")
if div.css("table"):
print("table")
if div.xpath("tbody/tr/th"):
print("tbody tr th")
Output:
2018-09-16 20:52:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan&raum.rgid=166> (referer: None)
table
[<Selector xpath='//table' data='<table width="100%" cellpadding="0" cell'>, <Selector xpath='//table' data='<table width="100%" cellpadding="0" cell'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>, <Selector xpath='//table' data='<table summary="Übersicht über alle Vera'>]
//tbody/tr/th
[]
table
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
h3/a
table
2018-09-16 20:52:00 [scrapy.core.engine] INFO: Closing spider (finished)
If you run the example you will notice that scrapy is able to find the tables via XPath (and CSS-Selectors) but not the content (tbody/tr/th
) as well as any content (//tbody/tr/th
). On the other hand, if you open the URL in your browser (https://lsf.tubit.tu-berlin.de/qisserver/servlet/de.his.servlet.RequestDispatcherServlet?state=wplan&raum.rgid=166&week=40_2018&act=Raum&pool=Raum&show=liste&P.vx=lang&P.subc=plan) and run $x("//tbody/tr/th")
in the browsers console it returns the heads of the tables (part of the content of a table) without any errors, indicating an error probably within the XML-parsing-algorithm. Also, querying for a h3/a
-combination works flawlessly in scrapy, indicating that it isn’t an error regarding the depth of the XML-DOM-tree.
One thing I’ve noticed when looking at the source code of the website was that it regularly works with tabs and linebreaks and non-breaking spaces causing me some trouble when parsing other elements, probably causing some hick-ups. Another thing I’ve noticed is that the website uses the summary
-attribute within the table
-tag, which, according to the W3C (https://www.w3.org/TR/WCAG-TECHS/H73.html#H73-description) is deprecated in HTML 5, though ths shouldn’t matter since the doctype indicates a HTML 4.01-document (<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
).
I’m hoping to being able root out the cause with your help.
Best regards, zergar
Issue Analytics
- State:
- Created 5 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
There is no
tbody
tag in the HTML text so you shouldn’t use it in your queries.@wRAR that’s good to know. Again, thanks for your help.