question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XMLFeedSpider iternodes iterator does not work on XML document with namespace

See original GitHub issue

(Opening the issue so that we track it, although it is already known.)

Sample input document:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.argos.ie/static/Product/partNumber/2353030.htm</loc></url>
<url><loc>http://www.argos.ie/static/Product/partNumber/2717339.htm</loc></url>
(...)

Symptom: With Scrapy 1.4.0 (and earlier for sure) Using XMLFeedSpider and the default iternodes iterator, nodes using itertag='loc' cannot be found,

    (...) site-packages/scrapy/utils/iterators.py", line 31, in xmliter
        yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
    exceptions.IndexError: list index out of range

and registering namespaces and using itertag='prefix:loc' does not work either.

Recently seen (again) on StackOverflow. Was already discussed on scrapy-users.

There’s a WIP PR #861. Last comments were about moving to iterparse-based implementation

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
kigsmtuacommented, Sep 15, 2017

@redapple i will be having a go at his

0reactions
Gallaeciocommented, Oct 6, 2020

Yeap, this should be fixed now, itertag='prefix:loc' should work.

Read more comments on GitHub >

github_iconTop Results From Across the Web

XMLFeedSpider parsing issue with xml file that 8859-1 encoded
Hello,. You are right, the "iternodes" iterator has an issue with namespaces. The problem is in scrapy.utils.iterators.xmliter() which uses regular expressions.
Read more >
How to extract urls from an xml using scrapy - XMLFeedSpider?
And there's acutally a bug in XMLFeedSpider when using (the default) iterator iternodes when the XML document uses a namespace.
Read more >
Source code for scrapy.spiders.feed
This module implements the XMLFeedSpider which is the recommended ... to parse the file using the 'iternodes' iterator, an 'xml' selector, ...
Read more >
Parsing an XML document with a default namespace in Scrapy
While writing a new spider for Feeds I stumbled upon the following problem. I wanted to parse an XML feed with a default...
Read more >
StAX API
XML documents are treated as a filtered series of events, and infoset states ... into routines that can work with the standard Java...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found