question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

InvalidBSON error: 'utf8' codec can't decode bytes

See original GitHub issue

I’ve restored some data to Mongo using mongorestore with the ‘objcheck’ parameter as shown below, to ensure that all records contain valid BSON. I’m also running Miongo v3.2.10 which I believe has objcheck enabled by default to prevent invalid data entering Mongo?

mongorestore --objcheck --collection mydata --db mydb mongodump.bson

When I run mongo-connector against the data to load into Elastic, after a short while the process crashes with an exception suggesting it has encountered invalid BSON as shown below.

2016-10-21 07:47:54,252 [CRITICAL] mongo_connector.oplog_manager:782 - Exception during collection dump
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.5.0.dev0-py2.7.egg/mongo_connector/oplog_manager.py", line 735, in do_dump
    upsert_all(dm)
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.5.0.dev0-py2.7.egg/mongo_connector/oplog_manager.py", line 719, in upsert_all
    dm.bulk_upsert(docs_to_dump(namespace), mapped_ns, long_ts)
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.5.0.dev0-py2.7.egg/mongo_connector/util.py", line 32, in wrapped
    return f(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 229, in bulk_upsert
    for ok, resp in responses:
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 161, in streaming_bulk
    for bulk_actions in _chunk_actions(actions, chunk_size, max_chunk_bytes, client.transport.serializer):
  File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 55, in _chunk_actions
    for action, data in actions:
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 195, in docs_to_upsert
    for doc in docs:
  File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.5.0.dev0-py2.7.egg/mongo_connector/oplog_manager.py", line 680, in docs_to_dump
    for doc in cursor:
  File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 1090, in next
    if len(self.__data) or self._refresh():
  File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 1032, in _refresh
    self.__max_await_time_ms))
  File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 903, in __send_message
    codec_options=self.__codec_options)
  File "/usr/local/lib/python2.7/dist-packages/pymongo/helpers.py", line 142, in _unpack_response
    "data": bson.decode_all(response[20:], codec_options)}
InvalidBSON: 'utf8' codec can't decode bytes in position 74-76: invalid continuation byte
2016-10-21 07:47:54,253 [ERROR] mongo_connector.oplog_manager:790 - OplogThread: Failed during dump collection cannot recover! Collection(Database(MongoClient(host=[u'localhost:27017'], document_class=dict, tz_aware=False, connect=True, replicaset=u'rs0'), u'local'), u'oplog.rs')
2016-10-21 07:47:54,254 [DEBUG] mongo_connector.oplog_manager:278 - OplogThread: Last entry is the one we already processed.  Up to date.  Sleeping.
2016-10-21 07:47:55,071 [ERROR] mongo_connector.connector:310 - MongoConnector: OplogThread <OplogThread(Thread-2, started 140537034176256)> unexpectedly stopped! Shutting down
2016-10-21 07:47:55,071 [INFO] mongo_connector.connector:368 - MongoConnector: Stopping all OplogThreads
2016-10-21 07:47:55,071 [DEBUG] mongo_connector.oplog_manager:457 - OplogThread: exiting due to join call.

I’ve ran mongo-connector in verbose debug mode, but I can’t find the source of the problem.

Is it likely that the objcheck flag is not working and allowing invalid data into Mongo or is the problem elsewhere? Thanks

Maintainer edit: Added python syntax.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

2reactions
ShaneHarveycommented, Oct 25, 2016

It sounds like the --objcheck flag does not validate individual BSON types other than the objects themselves: https://docs.mongodb.com/manual/reference/program/mongorestore/#cmdoption--objcheck

You can try to find the invalid document(s) and remove them. Another option is to use the unicode_decode_error_handler=ignore codec option and set the option in the MongoDB connection string to work around this issue:

$ mongo-connector -m mongodb://localhost:27017/?unicode_decode_error_handler=ignore <your args...>
0reactions
johnjjungcommented, Oct 24, 2019

Confirmed that this fixes all the issues:

from pymongo MongoClient
MongoClient('mongodb://mongodb:27017', unicode_decode_error_handler='ignore')

Adding that unicode_decode_error_handler='ignore' worked on pymongo 3.9.0!

Read more comments on GitHub >

github_iconTop Results From Across the Web

pymongo error: bson.errors.InvalidBSON: 'utf8' codec can't ...
I was seeing 'utf8' codec can't decode byte 0xfd in position 0: invalid start byte and adding unicode_decode_error_handler='ignore' to the ...
Read more >
Python – pymongo error: bson.errors.InvalidBSON: 'utf8' codec ...
You need to read the Python Unicode HOWTO. This error is the very first example. Basically, stop using str to convert from unicode...
Read more >
bson.errors.InvalidBSON: 'utf8' codec can't decode byte 0xa1 ...
I'm using Python 3.6, pymongo 3.4.0. According to the documentation, you can clone a collection with the 'with_options' method, which does the trick...
Read more >
[PYTHON-995] Pymongo - Entry - codec can't decode byte
The problem is the node.js driver does not validate that strings are valid utf8. BSON requires utf8 strings and python is strict about...
Read more >
MongoDB decode bson codec pymongo - Redash Discourse
return decode_all(data, codec_options) bson.errors.InvalidBSON: 'utf-8' codec can't decode byte 0xd4 in position 1: invalid continuation ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found