Cleaning large quantities of data takes far too long...
See original GitHub issueThis is more of a question than issue per se, but I didn’t know where else to ask since there isn’t any official forum and there’s no recent activity on the gitter.
Before I go any further, let me preface this by saying I’m still using older aldeed:simple-schema@1.5.3
that is built into the collection2 atmosphere package. Still after reading the release notes of the newer versions, I have no reason to believe the clean()
function will be faster in this regard.
I have to insert large quantities of data, tens of thousands of rows (documents) with about 50 keys, into my database and I’ve defined a SimpleSchema
for them where most of the fields are just strings, but some are dates and decimal numbers. I run each row of data through schema.clean(row)
and then I Match.test(row, schema)
to make sure we’ve got all required fields. I assign some other fields prior to insertion, and then perform a bulk insertion by directly accessing the mongo object.
// start recording time
_.each(rows, (row) => {
schema.clean(row); // removing this makes this whole process 100 times faster
if (!Match.test(row, schema)) return; // this may be removed or not, has no significant effect on runtime
//... add some fields, these also don't take a significant amount of time
row.date = today;
row._id = Random.id();
valids.push(row);
});
// timetaken on 13000 rows = ~100s with cleaning vs. ~1s without ¯\_(ツ)_/¯
MyCollection.rawCollection().insertMany(valids, { ordered: false }, (err, res) => { ... });
The reason I do it like this is because inserting large amounts of data using Meteor’s Mongo.Collection
functions can be slow and doesn’t scale very well, unfortunately. I’ve tried both ways, and can say for certain that independently cleaning/validating bulk data before performing a bulk insert directly into the database is always faster than looping through the data and relying on collection2
’s insert hook.
However, this is still too slow because schema.clean(row)
is just taking too long. When I don’t call clean()
the timing for validating and inserting 10000+ rows drops from 100+ seconds to 1 second, so something is obviously wrong here.
There just has to be a faster way of doing this.
Issue Analytics
- State:
- Created 7 years ago
- Reactions:1
- Comments:9
Top GitHub Comments
There is a chance this has improved in the rewritten SimpleSchema, but there is an even better chance that it’s much worse, at least by default. Because one change was that it now clones rather than mutates by default, though if you pass
mutate: true
, most likely it will be similar speed to what you have now.The slowness is most certainly due to the MongoObject handling, which is necessary mainly because of supporting update modifiers and running autoValues. The quick fix would be to create a much simpler clean function that can be used when we know that the full treatment isn’t necessary.
I’ve been thinking about changing how a lot of this works to make it focus on the 95% cases and ditch support for the 5% that is adding all of the extreme complexity. These changes would also make things much faster.
The Meteor SimpleSchema package is no longer maintained other than critical fixes to keep it running with each latest Meteor release. Closing non-critical issues in this repo. Anyone who cares about this may do one or more of the following: