XMLFeedSpider iternodes iterator does not work on XML document with namespace
See original GitHub issue(Opening the issue so that we track it, although it is already known.)
Sample input document:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.argos.ie/static/Product/partNumber/2353030.htm</loc></url>
<url><loc>http://www.argos.ie/static/Product/partNumber/2717339.htm</loc></url>
(...)
Symptom:
With Scrapy 1.4.0 (and earlier for sure)
Using XMLFeedSpider
and the default iternodes
iterator, nodes using itertag='loc'
cannot be found,
(...) site-packages/scrapy/utils/iterators.py", line 31, in xmliter
yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
exceptions.IndexError: list index out of range
and registering namespaces and using itertag='prefix:loc'
does not work either.
Recently seen (again) on StackOverflow. Was already discussed on scrapy-users.
There’s a WIP PR #861. Last comments were about moving to iterparse-based implementation
Issue Analytics
- State:
- Created 6 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
XMLFeedSpider parsing issue with xml file that 8859-1 encoded
Hello,. You are right, the "iternodes" iterator has an issue with namespaces. The problem is in scrapy.utils.iterators.xmliter() which uses regular expressions.
Read more >How to extract urls from an xml using scrapy - XMLFeedSpider?
And there's acutally a bug in XMLFeedSpider when using (the default) iterator iternodes when the XML document uses a namespace.
Read more >Source code for scrapy.spiders.feed
This module implements the XMLFeedSpider which is the recommended ... to parse the file using the 'iternodes' iterator, an 'xml' selector, ...
Read more >Parsing an XML document with a default namespace in Scrapy
While writing a new spider for Feeds I stumbled upon the following problem. I wanted to parse an XML feed with a default...
Read more >StAX API
XML documents are treated as a filtered series of events, and infoset states ... into routines that can work with the standard Java...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@redapple i will be having a go at his
Yeap, this should be fixed now,
itertag='prefix:loc'
should work.