question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

SQL based Datastores fail when document metadata has a list

See original GitHub issue

This issue is easily reproducible in FAISSDataStore and SQLDataStore. When the value of the key in doc.meta is set to a list, then document_store.write fails. This happens when I’m using TikaConverter to convert a directory of files, and some of them have lists in metadata.

Error message

File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/faiss.py", line 295, in write_documents
    super(FAISSDocumentStore, self).write_documents(
  File "/home/sridhar/devel/doc_intelligence/haystack/haystack/document_stores/sql.py", line 401, in write_documents
    self.session.query(MetaDocumentORM).filter_by(document_id=doc.id).delete()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/query.py", line 3209, in delete
    result = self.session.execute(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 1660, in execute
    ) = compile_state_cls.orm_pre_session_exec(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1829, in orm_pre_session_exec
    session._autoflush()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2257, in _autoflush
    util.raise_(e, with_traceback=sys.exc_info()[2])
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 2246, in _autoflush
    self.flush()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3383, in flush
    self._flush(objects)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3523, in _flush
    transaction.rollback(_capture_exception=True)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/session.py", line 3483, in _flush
    flush_context.execute()
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
    rec.execute(self)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
    util.preloaded.orm_persistence.save_obj(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    _emit_insert_statements(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/orm/persistence.py", line 1238, in _emit_insert_statements
    result = connection._execute_20(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1631, in _execute_20
    return meth(self, args_10style, kwargs_10style, execution_options)
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 332, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1498, in _execute_clauseelement
    ret = self._execute_context(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1862, in _execute_context
   self._handle_dbapi_exception(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2043, in _handle_dbapi_exception
    util.raise_(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1819, in _execute_context
    self.dialect.do_execute(
  File "/home/sridhar/devel/doc_intelligence/env_haystack/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.InterfaceError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(sqlite3.InterfaceError) Error binding parameter 2 - probably unsupported type.
[SQL: INSERT INTO meta_document (id, name, value, document_id, document_index) VALUES (?, ?, ?, ?, ?)]
[parameters: ('60afe526-9500-420d-8170-0aeff009e205', 'xmpMM:History:InstanceID', ['xmp.iid:f12cf1f4-997f-474e-b7e1-4eb15ac63f4f', 'xmp.iid:d8516db1-2277-c840-b5d8-12ba4225175d'], 'feb937070fb26bb3f3cf9f397a3
531a5', 'document')]

(Background on this error at: https://sqlalche.me/e/14/rvf5)

Expected behavior The workaround is to simply set the particular meta variable to “ignore” in the TikaConverter.convert function. There should be a knob to ignore these metadatas automatically.

FAQ Check Yes

System:

  • OS: Ubuntu 18.04
  • GPU/CPU: Nvidia Titan XP/Intel® Xeon® CPU E5-2697
  • Haystack version (commit or version number): ba08fc86f5871555537cc3766889e9cae8c5ad03
  • DocumentStore: FAISS
  • Reader: TransformersReader
  • Retriever: EmbeddingRetriever

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:11 (10 by maintainers)

github_iconTop GitHub Comments

2reactions
danielbichuetticommented, Jul 21, 2022

@sridhar Indeed, my thoughts were that people who use SQL Lite commonly are doing small tests, and a more complex metadata processing wouldn’t be needed. Considering performance problems, SQL Lite is highly not recommended for a mainstream, not tests or small scale, NLP scenario where workload exceed even some low profiles.

Some of the metadata generated by these parsers are quite expansive

Yes. This is one point that the main idea is already at code (filtering metadata). With JSON, it would still be easy to exclude fields inside the hierarchy that aren’t need and keep the ones of interest in NLP.

#2809 is somehow related to this subject. Implementing a JSON data type support into the data store would allow lots of other information which could be of interest. The user would decide when building his pipeline.

I think that for now, doing a type check as a safe measure would be just a starting point. Error is just bypassed, losing some data.

Maybe a progressive change so current users could still use this data store in such scenarios, but knowing they will lose the metadata. Then, the split of the document stores (SQL Lite capabilities are much limited compared to current mainstream ones), which would technically be a breaking change, as the document store name would change and users should update their codes. And after, the improvement of the class to handle any kind of objects in metadata, keeping the current features and improving to support complex objects into it.

1reaction
anakin87commented, Sep 7, 2022

@sjrl I think that this issue can be closed as the problem has actually been solved. In any case, the interesting reflections we have made will remain here. What @TuanaCelik reported is addressed in another issue.

Read more comments on GitHub >

github_iconTop Results From Across the Web

'Failed to get metadata from database' error generating ... - IBM
Steps:1. Open the Contributor Administration Console as an administrator. 2. Expand Datastores|Datastore Name|Applications|Application Name| ...
Read more >
3 Creating and Using Data Models and Datastores
Go to the Attributes tab the file datastore that has a delimited format. Click the Reverse Engineer button. Oracle Data Integrator creates the...
Read more >
SQL vs. NoSQL Database: When to Use, How to Choose
Databases have 5 components: interface, query processor, metadata, indexes, ... The file system is the simplest and oldest datastore.
Read more >
Datastore Overview | Cloud Datastore Documentation
Datastore runs in Google data centers, which use redundancy to minimize impact from points of failure. Massive scalability with high performance. Datastore uses ......
Read more >
Transaction locking and row versioning guide - SQL Server
After a transaction has started, it must be successfully completed (committed), or the SQL Server Database Engine undoes all of the data ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found