Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XMLFeedSpider iternodes iterator does not work on XML document with namespace

See original GitHub issue

(Opening the issue so that we track it, although it is already known.)

Sample input document:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.argos.ie/static/Product/partNumber/2353030.htm</loc></url>
<url><loc>http://www.argos.ie/static/Product/partNumber/2717339.htm</loc></url>
(...)

Symptom: With Scrapy 1.4.0 (and earlier for sure) Using XMLFeedSpider and the default iternodes iterator, nodes using itertag='loc' cannot be found,

    (...) site-packages/scrapy/utils/iterators.py", line 31, in xmliter
        yield Selector(text=nodetext, type='xml').xpath('//' + nodename)[0]
    exceptions.IndexError: list index out of range

and registering namespaces and using itertag='prefix:loc' does not work either.

Recently seen (again) on StackOverflow. Was already discussed on scrapy-users.

There’s a WIP PR #861. Last comments were about moving to iterparse-based implementation

Issue Analytics

State:
Created 6 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

kigsmtuacommented, Sep 15, 2017

@redapple i will be having a go at his

0reactions

Gallaeciocommented, Oct 6, 2020

Yeap, this should be fixed now, itertag='prefix:loc' should work.

Top Results From Across the Web

XMLFeedSpider parsing issue with xml file that 8859-1 encoded

Hello,. You are right, the "iternodes" iterator has an issue with namespaces. The problem is in scrapy.utils.iterators.xmliter() which uses regular expressions.

How to extract urls from an xml using scrapy - XMLFeedSpider?

And there's acutally a bug in XMLFeedSpider when using (the default) iterator iternodes when the XML document uses a namespace.

Source code for scrapy.spiders.feed

This module implements the XMLFeedSpider which is the recommended ... to parse the file using the 'iternodes' iterator, an 'xml' selector, ...

Parsing an XML document with a default namespace in Scrapy

While writing a new spider for Feeds I stumbled upon the following problem. I wanted to parse an XML feed with a default...

StAX API

XML documents are treated as a filtered series of events, and infoset states ... into routines that can work with the standard Java...