question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inserting billions of rows

See original GitHub issue

Hi, I wanted to try out peewee for the first time but I hit a snag at inserting ~20 billion rows. With the python sqlite3 library it works super fast (~1.5 minutes).

With peewee I don’t see any progress. Neither does the file grow nor does it finish within a reasonable time-frame and there is no indication of progress.

I don’t know if I’m doing something really bad. The only difference is that I could not create a double-column primary key in Data so I made a double-column index instead.

Any help would be appreciated 😃

import os
import numpy as np
import h5py
from tqdm import tqdm
import peewee as pw

db = pw.SqliteDatabase(dbpath)

class WorkUnit(pw.Model):
    name = pw.CharField()

    class Meta:
        database = db

class Data(pw.Model):
    x1 = pw.CharField()
    x2 = pw.CharField()
    x3 = pw.ForeignKeyField(WorkUnit, null=True)
    x4 = pw.FloatField(null=True)
    x5 = pw.FloatField(null=True)
    x6 = pw.BooleanField(null=True)
    x7 = pw.BooleanField(null=True)

    class Meta:
        indexes = (
            (('x1', 'x2'), True),  # True for unique
        )
        database = db

db.connect()

def createTables():
    db.create_tables([WorkUnit, Data])

def initializeDB(dataset):
    data = []
    fields = [Data.x1, Data.x2, Data.x5, Data.x6, Data.x7]
    with h5py.File(dataset, 'r') as infile:
        groups = list(infile)
        for group in tqdm(groups):
            for i, myval in enumerate(np.array(infile[group]['myval'])):
                data.append((group, 'c{0:06d}'.format(i), myval, False, False))
    with db.atomic():
        Data.insert_many(data, fields=fields).execute()

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
coleifercommented, May 28, 2018

My personal answer will never be to use NoSQL! Peewee can handle this, but it does impose some overhead. Here are the docs on ideas for speeding-up bulk inserts:

http://docs.peewee-orm.com/en/latest/peewee/querying.html#bulk-inserts

Your comment, in which you use a transaction and insert_many() looks good but you may be hitting practical limits with SQLite’s ability to parameterize the data. Peewee also may be getting in the way by ensuring/validating data-types before building up the query. You can always profile the code to see what’s going on.

Using executemany() with a sqlite3 cursor is just fine, too. Peewee database objects have a cursor() method with exposes a DB-API cursor. So this should work:

db = pw.SqliteDatabase(...)
db.create_tables([WorkUnit, Data])
cursor = db.cursor()

with h5py.File(dataset, 'r') as infile:
    groups = list(infile)
    for group in tqdm(groups):
        rdata = [(group, 'c{0:06d}'.format(i), 0, val, 0, 0) for i, val in enumerate(np.array(infile[group]['val']))]
        cursor.executemany("INSERT INTO data VALUES (?, ?, NULL, ?, ?, ?, ?)", rdata)

Lastly, you can try batching the inserts:

    with h5py.File(dataset, 'r') as infile:
        groups = list(infile)
        for group in tqdm(groups):  # Around 50k iterations
            data = []
            for i, myval in enumerate(np.array(infile[group]['myval'])):
                data.append((group, 'c{0:06d}'.format(i), myval, False, False))

            with db.atomic():
                for i in range(0, len(data), 100):  # INSERT 100 rows at-a-time
                    Data.insert_many(data[i:i+100], fields=fields).execute()
0reactions
coleifercommented, May 29, 2018

That’s very strange, since the first one is literally the same as the second. It may have something to do with the way Peewee sets up the connection autocommit… you can try wrapping it in a transaction and seeing if that has any effect:

with db.atomic():
    cursor.executemany(...)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Inserting 1 billion rows. : r/PostgreSQL - Reddit
What will be the problem to insert a billion rows in a transaction and inserting billion rows in billion transaction.
Read more >
If you have a table with a billion rows, how would you ... - Quora
It depends on the DB engine. ALTER TABLE <mytab> ADD COLUMN <x> DEFAULT <y> ;. will typically work, although it will frequently lock ......
Read more >
Insert billions of rows in MySQL database for benchmarking
In my project, I required to have hundreds of millions of records in MySQL database. Here I wanted to compare and benchmark the...
Read more >
Inserting a billion rows in SQLite under a minute | Hacker News
"Trying to insert 1 billion rows in SQL in under a minute". If anything it's more interesting because it implies the chance of...
Read more >
Best approach to load data from table with billions rows
We have a sql server table with over billion rows in it with ID as ... FROM src; ID >= @minID; AND (otherconditions);...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found