SQL based Datastores fail when document metadata has a list
See original GitHub issueThis issue is easily reproducible in FAISSDataStore and SQLDataStore. When the value of the key in doc.meta is set to a list, then document_store.write fails. This happens when I’m using TikaConverter to convert a directory of files, and some of them have lists in metadata.
Error message
File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/faiss.py", line 295, in write_documents
super(FAISSDocumentStore, self).write_documents(
File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/sql.py", line 401, in write_documents
self.session.query(MetaDocumentORM).filter_by(document_id=doc.id).delete()
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3209, in delete
result = self.session.execute(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1660, in execute
) = compile_state_cls.orm_pre_session_exec(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1829, in orm_pre_session_exec
session._autoflush()
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2257, in _autoflush
util.raise_(e, with_traceback=sys.exc_info()[2])
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
raise exception
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2246, in _autoflush
self.flush()
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3383, in flush
self._flush(objects)
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3523, in _flush
transaction.rollback(_capture_exception=True)
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
compat.raise_(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
raise exception
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3483, in _flush
flush_context.execute()
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
rec.execute(self)
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
util.preloaded.orm_persistence.save_obj(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
_emit_insert_statements(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1238, in _emit_insert_statements
result = connection._execute_20(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
return meth(self, args_10style, kwargs_10style, execution_options)
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 332, in _execute_on_connection
return connection._execute_clauseelement(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
ret = self._execute_context(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
self._handle_dbapi_exception(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
util.raise_(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
raise exception
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
self.dialect.do_execute(
File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
sqlalchemy.exc.InterfaceError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(sqlite3.InterfaceError) Error binding parameter 2 - probably unsupported type.
[SQL: INSERT INTO meta_document (id, name, value, document_id, document_index) VALUES (?, ?, ?, ?, ?)]
[parameters: ('60afe526-9500-420d-8170-0aeff009e205', 'xmpMM:History:InstanceID', ['xmp.iid:f12cf1f4-997f-474e-b7e1-4eb15ac63f4f', 'xmp.iid:d8516db1-2277-c840-b5d8-12ba4225175d'], 'feb937070fb26bb3f3cf9f397a3
531a5', 'document')]
(Background on this error at: https://sqlalche.me/e/14/rvf5)
Expected behavior The workaround is to simply set the particular meta variable to “ignore” in the TikaConverter.convert function. There should be a knob to ignore these metadatas automatically.
FAQ Check Yes
System:
- OS: Ubuntu 18.04
- GPU/CPU: Nvidia Titan XP/Intel® Xeon® CPU E5-2697
- Haystack version (commit or version number): ba08fc86f5871555537cc3766889e9cae8c5ad03
- DocumentStore: FAISS
- Reader: TransformersReader
- Retriever: EmbeddingRetriever
Issue Analytics
- State:
- Created a year ago
- Comments:11 (10 by maintainers)
Top Results From Across the Web
'Failed to get metadata from database' error generating ... - IBM
Steps:1. Open the Contributor Administration Console as an administrator. 2. Expand Datastores|Datastore Name|Applications|Application Name| ...
Read more >3 Creating and Using Data Models and Datastores
Go to the Attributes tab the file datastore that has a delimited format. Click the Reverse Engineer button. Oracle Data Integrator creates the...
Read more >SQL vs. NoSQL Database: When to Use, How to Choose
Databases have 5 components: interface, query processor, metadata, indexes, ... The file system is the simplest and oldest datastore.
Read more >Datastore Overview | Cloud Datastore Documentation
Datastore runs in Google data centers, which use redundancy to minimize impact from points of failure. Massive scalability with high performance. Datastore uses ......
Read more >Transaction locking and row versioning guide - SQL Server
After a transaction has started, it must be successfully completed (committed), or the SQL Server Database Engine undoes all of the data ...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@sridhar Indeed, my thoughts were that people who use SQL Lite commonly are doing small tests, and a more complex metadata processing wouldn’t be needed. Considering performance problems, SQL Lite is highly not recommended for a mainstream, not tests or small scale, NLP scenario where workload exceed even some low profiles.
Yes. This is one point that the main idea is already at code (filtering metadata). With JSON, it would still be easy to exclude fields inside the hierarchy that aren’t need and keep the ones of interest in NLP.
#2809 is somehow related to this subject. Implementing a JSON data type support into the data store would allow lots of other information which could be of interest. The user would decide when building his pipeline.
I think that for now, doing a type check as a safe measure would be just a starting point. Error is just bypassed, losing some data.
Maybe a progressive change so current users could still use this data store in such scenarios, but knowing they will lose the metadata. Then, the split of the document stores (SQL Lite capabilities are much limited compared to current mainstream ones), which would technically be a breaking change, as the document store name would change and users should update their codes. And after, the improvement of the class to handle any kind of objects in metadata, keeping the current features and improving to support complex objects into it.
@sjrl I think that this issue can be closed as the problem has actually been solved. In any case, the interesting reflections we have made will remain here. What @TuanaCelik reported is addressed in another issue.