Inserting billions of rows
See original GitHub issueHi, I wanted to try out peewee for the first time but I hit a snag at inserting ~20 billion rows. With the python sqlite3 library it works super fast (~1.5 minutes).
With peewee I don’t see any progress. Neither does the file grow nor does it finish within a reasonable time-frame and there is no indication of progress.
I don’t know if I’m doing something really bad. The only difference is that I could not create a double-column primary key in Data so I made a double-column index instead.
Any help would be appreciated 😃
import os
import numpy as np
import h5py
from tqdm import tqdm
import peewee as pw
db = pw.SqliteDatabase(dbpath)
class WorkUnit(pw.Model):
name = pw.CharField()
class Meta:
database = db
class Data(pw.Model):
x1 = pw.CharField()
x2 = pw.CharField()
x3 = pw.ForeignKeyField(WorkUnit, null=True)
x4 = pw.FloatField(null=True)
x5 = pw.FloatField(null=True)
x6 = pw.BooleanField(null=True)
x7 = pw.BooleanField(null=True)
class Meta:
indexes = (
(('x1', 'x2'), True), # True for unique
)
database = db
db.connect()
def createTables():
db.create_tables([WorkUnit, Data])
def initializeDB(dataset):
data = []
fields = [Data.x1, Data.x2, Data.x5, Data.x6, Data.x7]
with h5py.File(dataset, 'r') as infile:
groups = list(infile)
for group in tqdm(groups):
for i, myval in enumerate(np.array(infile[group]['myval'])):
data.append((group, 'c{0:06d}'.format(i), myval, False, False))
with db.atomic():
Data.insert_many(data, fields=fields).execute()
Issue Analytics
- State:
- Created 5 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
Inserting 1 billion rows. : r/PostgreSQL - Reddit
What will be the problem to insert a billion rows in a transaction and inserting billion rows in billion transaction.
Read more >If you have a table with a billion rows, how would you ... - Quora
It depends on the DB engine. ALTER TABLE <mytab> ADD COLUMN <x> DEFAULT <y> ;. will typically work, although it will frequently lock ......
Read more >Insert billions of rows in MySQL database for benchmarking
In my project, I required to have hundreds of millions of records in MySQL database. Here I wanted to compare and benchmark the...
Read more >Inserting a billion rows in SQLite under a minute | Hacker News
"Trying to insert 1 billion rows in SQL in under a minute". If anything it's more interesting because it implies the chance of...
Read more >Best approach to load data from table with billions rows
We have a sql server table with over billion rows in it with ID as ... FROM src; ID >= @minID; AND (otherconditions);...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
My personal answer will never be to use NoSQL! Peewee can handle this, but it does impose some overhead. Here are the docs on ideas for speeding-up bulk inserts:
http://docs.peewee-orm.com/en/latest/peewee/querying.html#bulk-inserts
Your comment, in which you use a transaction and insert_many() looks good but you may be hitting practical limits with SQLite’s ability to parameterize the data. Peewee also may be getting in the way by ensuring/validating data-types before building up the query. You can always profile the code to see what’s going on.
Using
executemany()
with a sqlite3 cursor is just fine, too. Peewee database objects have acursor()
method with exposes a DB-API cursor. So this should work:Lastly, you can try batching the inserts:
That’s very strange, since the first one is literally the same as the second. It may have something to do with the way Peewee sets up the connection autocommit… you can try wrapping it in a transaction and seeing if that has any effect: