Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Performance issue on import - chatty database communication

See original GitHub issue

Hi,

I am importing CSVs with 10k rows and I noticed that it is very very slow. Like 10minutes to import more or less. This is against a django app running in my local connected to a postgres running in my local as well.

Upon further inspection, looks like for every row in the imported CSV, a select * from table where id = ? is performed, thereby slowing down the import. It may be better to do a select * from table where id in (?, ?, ?, ...) instead to speed up the process.

Thanks

Issue Analytics

State:
Created 5 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

10reactions

franz-seecommented, Mar 10, 2019

@andrewgy8 This is my workaround

class BulkQueryMixin:
    _ids_to_objects = {}

    def before_import(self, dataset, using_transactions, dry_run, **kwargs):
        if len(dataset):
            header_indices = [idx for idx, header in enumerate(dataset.headers) if header in self._meta.import_id_fields]
            query_ids = [[(dataset.headers[header_idx], data[header_idx],) for header_idx in header_indices] for data in dataset]
            where_clause = ' or '.join(['(%s)' % ' and '.join([self._to_where_condition(value[0], value[1]) for value in query_id])for query_id in query_ids])
            select_query = 'select * from %s where %s' % (self._meta.model._meta.db_table, where_clause)
            pre_queried_objects = self._meta.model.objects.raw(select_query)
            self._ids_to_objects = self._map_out_objects(pre_queried_objects)

    def get_instance(self, instance_loader, row):
        row_id = self._get_id(row)
        match = self._ids_to_objects.get(row_id, None)
        return match

    def _map_out_objects(self, values):
        return {self._get_id(value): value for value in values}

    def _get_id(self, value):
        return '-'.join([str(self._get_header_value(header_id, value)) for header_id in self._meta.import_id_fields])

    def _get_header_value(self, header_id, value):
        header_value = value[header_id] if isinstance(value, dict) else getattr(value, header_id)
        return header_value if header_value else ''

    def _to_where_condition(self, column_name, column_value):
        if column_value:
            return '%s = \'%s\'' % (column_name, column_value)
        else:
            return '%s is null' % column_name

Then I use that BulkQueryMixin in my Resource to fetch all data related to the uploaded import once, and then on every get_instance(), i just retrieve from that cached dict. The code is not pretty, but seems to work and has greatly increased the performance on my end.

But aside from that, I had to do a few more things (ranked in terms of performance gain)

Add in my Resource skip_unchanged - removes any update statements which can be one for every row element
Modify my django Models to override __deepcopy__ (and __copy__) - this also slows down things substantially
Add in my Resource report_skipped - speeds up rendering of the confirmation page. Also, with 10k records uploaded, it makes spotting the difference easier

2reactions

franz-seecommented, Mar 13, 2019

@andrewgy8 Good point. I’ll try to see if I can change it to use querysets and filters instead.

Top Results From Across the Web

Why is My Database Application so Slow? - Simple Talk

A very common cause of performance problems, in our experience, is running “chatty” applications over high latency networks. A chatty ...

Chatty I/O antipattern - Performance antipatterns for cloud apps

Problem description. Network calls and other I/O operations are inherently slow compared to compute tasks. Each I/O request typically has significant overhead, ...

Urgent advise needed - Software AG Tech Community & Forums

It's been a while since I looked into it, but JDBC network traffic used to be “chatty” in small bursts. This caused performance...

SQL Server Advanced Troubleshooting and Performance ...

most problems present themselves as general performance issues: ... The Protocol Layer handles communication between SQL Server and client.

[ MATLAB + SQL Server ] Slow performance when fetching data

My problem is that the while-loop performs quite slowly. It takes up to 8-10 seconds to fetch the data. One might ask why...