question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Session, persistence operations and bulk insert

See original GitHub issue

Hi, thank you for sharing this ORM project, I have started playing with it by modifying and extending your introductory demo. Hope you like it. I have a few questions:

  1. Is there any persistence in memory of objects you create from infi.clickhouse_orm.models. I am referring to an ORM session kind of thing here ?
  2. If not, do you think this is a useful feature to add in the future in terms of atomicity or caching or … ?
  3. What’s the best method to use with infi.clickhouse_orm for massive insertion of data, let’s say a table with 1million rows and two columns integer and float ?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
emakarovcommented, Sep 29, 2018

about p.1 Commits, sessions and transactions as well as rollbacks in other ORMs are reflections of relational databases behaviour. Clickhouse doesn’t support transactions, so you can’t implement it. About existence in memory I still don’t understand what you mean.

p.3. Clickhouse has two interfaces - TCP and HTTP. infi.clickhouse_orm is using HTTP interface. Other python projects can use TCP, for example, https://github.com/mymarilyn/clickhouse-driver clickhouse-client is native client that can read CSV directly and insert it into db via most efficient way. I think transferring million of rows via HTTP is not the best idea.

0reactions
ishiravcommented, Oct 13, 2018

Thank you @emakarov for your inputs, I agree with everything you wrote. Some additional comments below.

Sessions I’m aware of the concept of sessions in SQLAlchemy, personally I don’t like it. I modeled clickhouse_orm after Django’s ORM, which does not use sessions. In any case, sessions are mostly relevant for databases that support transactions, unlike ClickHouse.

TCP vs. HTTP To minimize the “time to market”, I used the most convenient interface when developing this project - HTTP. Of course it comes with some overhead, but at least for my use cases this is not significant. Furthermore, if I understand correctly the TCP interface requires using the Native data format. From the ClickHouse docs:

The most efficient format. Data is written and read by blocks in binary format. For each block, the number of rows, number of columns, column names and types, and parts of columns in this block are recorded one after another. In other words, this format is “columnar” – it doesn’t convert columns to rows. This is the format used in the native interface for interaction between servers, for using the command-line client, and for C++ clients. You can use this format to quickly generate dumps that can only be read by the ClickHouse DBMS. It doesn’t make sense to work with this format yourself.

So it didn’t sound like a great idea to base an external library on a native, binary, internal format which might change without notice.

Overhead Loading large amounts of data is best done with the CLI, not via the ORM. Using the ORM adds a lot of overhead, since each line needs to be parsed, converted to Python objects, validated, and then re-encoded as text before sending it to the database. Using any ORM is a tradeoff - you sacrifice performance for convenience (because it’s always faster to talk to the database directly, but usually less convenient that using Python objects). This tradeoff might make sense when you’re generating new data or processing results of queries, but not when you just want to copy a million rows from a file into a database table.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Additional Persistence Techniques — SQLAlchemy 2.0 ...
Bulk Operations¶. Legacy Feature. SQLAlchemy 2.0 has integrated the Session “bulk insert” and “bulk update” capabilities into 2.0 style ...
Read more >
Additional Persistence Techniques
Bulk Operations mode is a new series of operations made available on the Session object for the purpose of invoking INSERT and UPDATE...
Read more >
Batch Insert/Update with Hibernate/JPA - Baeldung
Learn how to use batch inserts and updates using Hibernate/JPA. ... it'll send a separate SQL statement for each insert/update operation:
Read more >
Bulk saving complex objects SQLAlchemy - Stack Overflow
Session.bulk_save_objects() is too low level API for your use case, which is persisting multiple model objects and their relationships.
Read more >
Chapter 13. Batch processing
Batch inserts. When making new objects persistent, you must flush() and then clear() the session regularly, to control the size of the first-level...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found