XML default namespace leads to TypeError: __init__() keywords must be strings
See original GitHub issueThis is a bug with handling valid XML namespaces; soupsieve assumes all namespaces have a prefix:
<prefix:tag xmlns:prefix="...">
but the prefix can be omitted to define a default namespace:
<tag xmlns="...">
meaning that any element without a prefix:
prepended to the tag name is in that namespace. See section 6.2 of the XML namespaces 1.1 spec.
During parsing, lxml
passes in a default namespace under the None
key, e.g. {None: "..."}
, and unique keys are accumulated in the soup._namespaces
dictionary. soupsieve assumes the dictionary only ever has string keys, so an XML document with a default namespace leads to an exception.
Test case (using BeautifulSoup 4.7 for convenience):
>>> from bs4 import BeautifulSoup, __version__
>>> __version__
'4.7.0'
>>> sample = b'''\
... <?xml version="1.1"?>
... <!-- unprefixed element types are from "books" -->
... <book xmlns='urn:loc.gov:books'
... xmlns:isbn='urn:ISBN:0-395-36341-6'>
... <title>Cheaper by the Dozen</title>
... <isbn:number>1568491379</isbn:number>
... </book>
... '''
>>> soup = BeautifulSoup(sample, 'xml')
>>> soup._namespaces
{'xml': 'http://www.w3.org/XML/1998/namespace', None: 'urn:loc.gov:books', 'isbn': 'urn:ISBN:0-395-36341-6'}
>>> soup.select_one('title')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1345, in select_one
value = self.select(selector, namespaces, 1, **kwargs)
File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1377, in select
return soupsieve.select(selector, self, namespaces, limit, **kwargs)
File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 108, in select
return compile(select, namespaces, flags).select(tag, limit)
File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 50, in compile
namespaces = ct.Namespaces(**(namespaces))
TypeError: __init__() keywords must be strings
where <title>Cheaper by the Dozen</title>
was expected.
Issue Analytics
- State:
- Created 5 years ago
- Comments:13 (10 by maintainers)
Top Results From Across the Web
How to fix bs4 select error: 'TypeError: __init__() keywords ...
You must have upgraded to BeautifulSoup 4.7, which replaced the ... the default namespace marked with an empty string, not None , leading...
Read more >[Example code]-How to fix bs4 select error: 'TypeError
Namespaces (**(namespaces)) TypeError: __init__() keywords must be strings ... default namespace marked with an empty string, not None , leading to this bug....
Read more >W3C XML Schema Definition Language (XSD) 1.1 Part 1
The purpose of an XSD schema is to define and describe a class of XML documents by using schema components to constrain and...
Read more >Usage — xmlschema 2.1.1 documentation
A schema instance has methods to validate an XML document against the schema. The first method is xmlschema.XMLSchemaBase.is_valid() , that returns True if...
Read more >XmlNamespaceManager.AddNamespace(String, String) Method
Empty to add a default namespace. Note If the XmlNamespaceManager will be used for resolving namespaces in an XML Path Language (XPath) expression,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Here’s the thing, I would argue this is a bug with BeautifulSoup. The documentation explicitly states that here: https://facelessuser.github.io/soupsieve/api/#namespaces.
So, if we use a proper namespace key, we see that SoupSieve can handle no prefixes just fine. This usage was considered and planned for in the beginning:
When soupsieve was proposed,
self._namespaces
was not a tracked attribute in BeautfiulSoup. This was a feature added after my original proposal, and unfortunately was not included in test cases or this would have been caught and discussed before release. SoupSieve doesn’t checkself._namespaces
for the prefix key, it is the other way around. BeautifulSoup is sendingself._namespaces
into SoupSieve as a default, but it is not following SoupSieve’s requirements.Thanks for the info!