InvalidBSON error: 'utf8' codec can't decode bytes
See original GitHub issueI’ve restored some data to Mongo using mongorestore with the ‘objcheck’ parameter as shown below, to ensure that all records contain valid BSON. I’m also running Miongo v3.2.10 which I believe has objcheck enabled by default to prevent invalid data entering Mongo?
mongorestore --objcheck --collection mydata --db mydb mongodump.bson
When I run mongo-connector against the data to load into Elastic, after a short while the process crashes with an exception suggesting it has encountered invalid BSON as shown below.
2016-10-21 07:47:54,252 [CRITICAL] mongo_connector.oplog_manager:782 - Exception during collection dump
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.5.0.dev0-py2.7.egg/mongo_connector/oplog_manager.py", line 735, in do_dump
upsert_all(dm)
File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.5.0.dev0-py2.7.egg/mongo_connector/oplog_manager.py", line 719, in upsert_all
dm.bulk_upsert(docs_to_dump(namespace), mapped_ns, long_ts)
File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.5.0.dev0-py2.7.egg/mongo_connector/util.py", line 32, in wrapped
return f(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 229, in bulk_upsert
for ok, resp in responses:
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 161, in streaming_bulk
for bulk_actions in _chunk_actions(actions, chunk_size, max_chunk_bytes, client.transport.serializer):
File "/usr/local/lib/python2.7/dist-packages/elasticsearch/helpers/__init__.py", line 55, in _chunk_actions
for action, data in actions:
File "/usr/local/lib/python2.7/dist-packages/mongo_connector/doc_managers/elastic2_doc_manager.py", line 195, in docs_to_upsert
for doc in docs:
File "/usr/local/lib/python2.7/dist-packages/mongo_connector-2.5.0.dev0-py2.7.egg/mongo_connector/oplog_manager.py", line 680, in docs_to_dump
for doc in cursor:
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 1090, in next
if len(self.__data) or self._refresh():
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 1032, in _refresh
self.__max_await_time_ms))
File "/usr/local/lib/python2.7/dist-packages/pymongo/cursor.py", line 903, in __send_message
codec_options=self.__codec_options)
File "/usr/local/lib/python2.7/dist-packages/pymongo/helpers.py", line 142, in _unpack_response
"data": bson.decode_all(response[20:], codec_options)}
InvalidBSON: 'utf8' codec can't decode bytes in position 74-76: invalid continuation byte
2016-10-21 07:47:54,253 [ERROR] mongo_connector.oplog_manager:790 - OplogThread: Failed during dump collection cannot recover! Collection(Database(MongoClient(host=[u'localhost:27017'], document_class=dict, tz_aware=False, connect=True, replicaset=u'rs0'), u'local'), u'oplog.rs')
2016-10-21 07:47:54,254 [DEBUG] mongo_connector.oplog_manager:278 - OplogThread: Last entry is the one we already processed. Up to date. Sleeping.
2016-10-21 07:47:55,071 [ERROR] mongo_connector.connector:310 - MongoConnector: OplogThread <OplogThread(Thread-2, started 140537034176256)> unexpectedly stopped! Shutting down
2016-10-21 07:47:55,071 [INFO] mongo_connector.connector:368 - MongoConnector: Stopping all OplogThreads
2016-10-21 07:47:55,071 [DEBUG] mongo_connector.oplog_manager:457 - OplogThread: exiting due to join call.
I’ve ran mongo-connector in verbose debug mode, but I can’t find the source of the problem.
Is it likely that the objcheck flag is not working and allowing invalid data into Mongo or is the problem elsewhere? Thanks
Maintainer edit: Added python syntax.
Issue Analytics
- State:
- Created 7 years ago
- Comments:6 (2 by maintainers)
It sounds like the
--objcheck
flag does not validate individual BSON types other than the objects themselves: https://docs.mongodb.com/manual/reference/program/mongorestore/#cmdoption--objcheckYou can try to find the invalid document(s) and remove them. Another option is to use the
unicode_decode_error_handler=ignore
codec option and set the option in the MongoDB connection string to work around this issue:Confirmed that this fixes all the issues:
Adding that
unicode_decode_error_handler='ignore'
worked on pymongo 3.9.0!