Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Inserting billions of rows

See original GitHub issue

Hi, I wanted to try out peewee for the first time but I hit a snag at inserting ~20 billion rows. With the python sqlite3 library it works super fast (~1.5 minutes).

With peewee I don’t see any progress. Neither does the file grow nor does it finish within a reasonable time-frame and there is no indication of progress.

I don’t know if I’m doing something really bad. The only difference is that I could not create a double-column primary key in Data so I made a double-column index instead.

Any help would be appreciated 😃

import os
import numpy as np
import h5py
from tqdm import tqdm
import peewee as pw

db = pw.SqliteDatabase(dbpath)

class WorkUnit(pw.Model):
    name = pw.CharField()

    class Meta:
        database = db

class Data(pw.Model):
    x1 = pw.CharField()
    x2 = pw.CharField()
    x3 = pw.ForeignKeyField(WorkUnit, null=True)
    x4 = pw.FloatField(null=True)
    x5 = pw.FloatField(null=True)
    x6 = pw.BooleanField(null=True)
    x7 = pw.BooleanField(null=True)

    class Meta:
        indexes = (
            (('x1', 'x2'), True),  # True for unique
        )
        database = db

db.connect()

def createTables():
    db.create_tables([WorkUnit, Data])

def initializeDB(dataset):
    data = []
    fields = [Data.x1, Data.x2, Data.x5, Data.x6, Data.x7]
    with h5py.File(dataset, 'r') as infile:
        groups = list(infile)
        for group in tqdm(groups):
            for i, myval in enumerate(np.array(infile[group]['myval'])):
                data.append((group, 'c{0:06d}'.format(i), myval, False, False))
    with db.atomic():
        Data.insert_many(data, fields=fields).execute()

Issue Analytics

State:
Created 5 years ago
Comments:7 (2 by maintainers)

Top GitHub Comments

1reaction

coleifercommented, May 28, 2018

My personal answer will never be to use NoSQL! Peewee can handle this, but it does impose some overhead. Here are the docs on ideas for speeding-up bulk inserts:

http://docs.peewee-orm.com/en/latest/peewee/querying.html#bulk-inserts

Your comment, in which you use a transaction and insert_many() looks good but you may be hitting practical limits with SQLite’s ability to parameterize the data. Peewee also may be getting in the way by ensuring/validating data-types before building up the query. You can always profile the code to see what’s going on.

Using executemany() with a sqlite3 cursor is just fine, too. Peewee database objects have a cursor() method with exposes a DB-API cursor. So this should work:

db = pw.SqliteDatabase(...)
db.create_tables([WorkUnit, Data])
cursor = db.cursor()

with h5py.File(dataset, 'r') as infile:
    groups = list(infile)
    for group in tqdm(groups):
        rdata = [(group, 'c{0:06d}'.format(i), 0, val, 0, 0) for i, val in enumerate(np.array(infile[group]['val']))]
        cursor.executemany("INSERT INTO data VALUES (?, ?, NULL, ?, ?, ?, ?)", rdata)

Lastly, you can try batching the inserts:

    with h5py.File(dataset, 'r') as infile:
        groups = list(infile)
        for group in tqdm(groups):  # Around 50k iterations
            data = []
            for i, myval in enumerate(np.array(infile[group]['myval'])):
                data.append((group, 'c{0:06d}'.format(i), myval, False, False))

            with db.atomic():
                for i in range(0, len(data), 100):  # INSERT 100 rows at-a-time
                    Data.insert_many(data[i:i+100], fields=fields).execute()

0reactions

coleifercommented, May 29, 2018

That’s very strange, since the first one is literally the same as the second. It may have something to do with the way Peewee sets up the connection autocommit… you can try wrapping it in a transaction and seeing if that has any effect:

with db.atomic():
    cursor.executemany(...)

Top Results From Across the Web

Inserting 1 billion rows. : r/PostgreSQL - Reddit

What will be the problem to insert a billion rows in a transaction and inserting billion rows in billion transaction.

If you have a table with a billion rows, how would you ... - Quora

It depends on the DB engine. ALTER TABLE <mytab> ADD COLUMN <x> DEFAULT <y> ;. will typically work, although it will frequently lock ......

Insert billions of rows in MySQL database for benchmarking

In my project, I required to have hundreds of millions of records in MySQL database. Here I wanted to compare and benchmark the...

Inserting a billion rows in SQLite under a minute | Hacker News

"Trying to insert 1 billion rows in SQL in under a minute". If anything it's more interesting because it implies the chance of...

Best approach to load data from table with billions rows

We have a sql server table with over billion rows in it with ID as ... FROM src; ID >= @minID; AND (otherconditions);...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Inserting billions of rows

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

MySQLDatabase return "using password: NO" when password is provided

"KeyError: None" when using JOIN and aggregate_rows in 2.10.2