question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XML default namespace leads to TypeError: __init__() keywords must be strings

See original GitHub issue

This is a bug with handling valid XML namespaces; soupsieve assumes all namespaces have a prefix:

<prefix:tag xmlns:prefix="...">

but the prefix can be omitted to define a default namespace:

<tag xmlns="...">

meaning that any element without a prefix: prepended to the tag name is in that namespace. See section 6.2 of the XML namespaces 1.1 spec.

During parsing, lxml passes in a default namespace under the None key, e.g. {None: "..."}, and unique keys are accumulated in the soup._namespaces dictionary. soupsieve assumes the dictionary only ever has string keys, so an XML document with a default namespace leads to an exception.

Test case (using BeautifulSoup 4.7 for convenience):

>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.7.0'
>>> sample = b'''\
... <?xml version="1.1"?>
... <!-- unprefixed element types are from "books" -->
... <book xmlns='urn:loc.gov:books'
...       xmlns:isbn='urn:ISBN:0-395-36341-6'>
...     <title>Cheaper by the Dozen</title>
...     <isbn:number>1568491379</isbn:number>
... </book>
... '''
>>> soup = BeautifulSoup(sample, 'xml')
>>> soup._namespaces
{'xml': 'http://www.w3.org/XML/1998/namespace', None: 'urn:loc.gov:books', 'isbn': 'urn:ISBN:0-395-36341-6'}
>>> soup.select_one('title')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1345, in select_one
    value = self.select(selector, namespaces, 1, **kwargs)
  File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1377, in select
    return soupsieve.select(selector, self, namespaces, limit, **kwargs)
  File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 108, in select
    return compile(select, namespaces, flags).select(tag, limit)
  File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 50, in compile
    namespaces = ct.Namespaces(**(namespaces))
TypeError: __init__() keywords must be strings

where <title>Cheaper by the Dozen</title> was expected.

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:13 (10 by maintainers)

github_iconTop GitHub Comments

1reaction
facelessusercommented, Jan 6, 2019

Here’s the thing, I would argue this is a bug with BeautifulSoup. The documentation explicitly states that here: https://facelessuser.github.io/soupsieve/api/#namespaces.

So, if we use a proper namespace key, we see that SoupSieve can handle no prefixes just fine. This usage was considered and planned for in the beginning:

>>> sample = b'''\
... <?xml version="1.1"?>
... <!-- unprefixed element types are from "books" -->
... <book xmlns='urn:loc.gov:books'
...       xmlns:isbn='urn:ISBN:0-395-36341-6'>
...     <title>Cheaper by the Dozen</title>
...     <isbn:number>1568491379</isbn:number>
... </book>
... '''
>>> namespaces = {'xml': 'http://www.w3.org/XML/1998/namespace', '': 'urn:loc.gov.books', 'isbn': 'urn:ISBN:0-395-36341-6'}
>>> soup = BeautifulSoup(sample, 'xml')
>>> soup.select_one('title', namespaces=namespaces)
<title>Cheaper by the Dozen</title>

When soupsieve was proposed, self._namespaces was not a tracked attribute in BeautfiulSoup. This was a feature added after my original proposal, and unfortunately was not included in test cases or this would have been caught and discussed before release. SoupSieve doesn’t check self._namespaces for the prefix key, it is the other way around. BeautifulSoup is sending self._namespaces into SoupSieve as a default, but it is not following SoupSieve’s requirements.

0reactions
facelessusercommented, Jan 7, 2019

Thanks for the info!

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to fix bs4 select error: 'TypeError: __init__() keywords ...
You must have upgraded to BeautifulSoup 4.7, which replaced the ... the default namespace marked with an empty string, not None , leading...
Read more >
[Example code]-How to fix bs4 select error: 'TypeError
Namespaces (**(namespaces)) TypeError: __init__() keywords must be strings ... default namespace marked with an empty string, not None , leading to this bug....
Read more >
W3C XML Schema Definition Language (XSD) 1.1 Part 1
The purpose of an XSD schema is to define and describe a class of XML documents by using schema components to constrain and...
Read more >
Usage — xmlschema 2.1.1 documentation
A schema instance has methods to validate an XML document against the schema. The first method is xmlschema.XMLSchemaBase.is_valid() , that returns True if...
Read more >
XmlNamespaceManager.AddNamespace(String, String) Method
Empty to add a default namespace. Note If the XmlNamespaceManager will be used for resolving namespaces in an XML Path Language (XPath) expression,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found