question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Separate workers for parsing and database insertions

See original GitHub issue

Is your feature request related to a problem? Please describe. Decouple UDF processes from the backend/database session. Right now, when we run UDFRunner.apply_mt(), we create a number of UDF worker processes. These processes all own an sqlalchemy Session object and add/commit to the database at the end of their respective parsing loop.

Describe the solution you’d like Make the UDF processes backend-agnostic, e.g. by having a set of separate BackendWorker processes handle the insertion of sentences. One possible way: Connect the output_queue of UDF to the input of BackendWorker, which receive Sentence lists and handle the sqlalchemy commits.

This will not fully decouple UDF from the backend, because the parser returns sqlalchemy-specific Sentence objects, but it could be one step towards that goal.

Additional context This feature request refers to decoupling of parsing and backend. There’s likely more coupling with the backend later in the processing pipeline.

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:2
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
HiromuHotacommented, Nov 9, 2019

We could use (Py)Spark, Dask, etc. for distributed computing but the bottleneck would be the data persistence layer, i.e., PostgreSQL. In other words, as long as we use PostgreSQL, it’ll be the bottleneck and we end up doing ad-hoc performance optimizations here and there.

One idea is to use different appliers for different storage backends: one for in-memory, another for PostgreSQL, one another for Hive, etc. The snorkel project (not snorkel-extraction) takes this approach for different computing frameworks (LFApplier, DaskLFApplier, SparkLFApplier), but Fonduer has more appliers to take care of, i.e., parser, mention_extractor, candidate_extractor, labeler, featurizer; and Fonduer has to worry about the data persistence layer too.

0reactions
senwucommented, Nov 9, 2019

That’s one idea! I think it would be better to modularize so we can 1) have better support for distributed computing from other parties (e.g., PySpark, Dask ); 2) easy to extend to other data layers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Is it possible to use Node worker threads to perform database ...
Moving your insert calls to a separate thread as you propose will not only get you no performance gains, it will actually decrease...
Read more >
Query Processing Architecture Guide - SQL Server
When a query or index operation starts executing on multiple worker threads for parallel execution, the same number of worker threads is used ......
Read more >
Queryparser, an Open Source Tool for Parsing and Analyzing ...
Written in Haskell, Queryparser is Uber Engineering's open source tool for parsing and analyzing SQL queries that makes it easy to identify ...
Read more >
Multithreaded reduce · Issue #562 · snorkel-team/snorkel · GitHub
As reported in SQLAlchemy FAQ, the solution seems to use bulk inserts and ... Separate workers for parsing and database insertions HazyResearch/fonduer#137.
Read more >
Extract, transform, load - Wikipedia
ETL systems commonly integrate data from multiple applications (systems), typically developed and supported by different vendors or hosted on separate ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found