Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Duplicate records when mirroring data to BigQuery

See original GitHub issue

Extension name: firestore-bigquery-export
Extension version: v0.1.12

It was running perfectly fine on v0.1.5 but after upgrading to v0.1.12 the extensions started to export duplicated records to BigQuery.

The only difference between two duplicate records is the document_id field: one of the records has a null document_id and the other has a non-null document_id.

It’s a very severe problem since it compromised many of my reports.

Weirdly, not all the records are getting duplicated. Sometimes they do and sometimes they do not - and whey they don’t, there’s only one record: the one with the non-null document_id. Still, this is a very frequent issue and since then I already have thousands of duplicates.

It’s also worth mentioning that all of my extensions have been upgraded almost at the same time and that some of them are working perfectly.

Issue Analytics

State:
Created 3 years ago
Comments:10 (4 by maintainers)

Top GitHub Comments

1reaction

raubrey2014commented, Mar 8, 2021

@dackers86 just as a follow up, yes I backfilled the entries to fix this discrepancy.

For anyone else running into this issue, I backfilled by the query below, but you should be very careful when doing this. This query assumes that the last 20 characters of the column document_name are the document id. There could be cases where that is not a safe assumption.

UPDATE `<your project id>.<your dataset id>.<exporter collection name>_raw_changelog` SET document_id=RIGHT(document_name, 20) WHERE document_id IS NULL;

So for me it looks something like:

UPDATE `ryans_project.firestore.comments_raw_changelog` SET document_id=RIGHT(document_name, 20) WHERE document_id IS NULL;

Hope this is helpful!

0reactions

dackers86commented, Jan 18, 2022

Closing as this appears to be resolved.

Top Results From Across the Web

Copy datasets | BigQuery - Google Cloud

Your project can copy 1,000 tables per run to a destination dataset that is in a different region. For example, if you configure...

Google BigQuery There are no primary key or unique ...

Google told us even we rerun the transfers, there will not be duplicated records. Is that bigquery transfer using the streaming? The duplicated...

Tips to Prevent Data Duplication in Google Big Query

Tips to prevent data duplicates in Google BigQuery. The article presents common remedies using Dataddo ETL for preventing data duplication.

Chapter 4. Loading Data into BigQuery - O'Reilly

Use JSON for small files where human readability is important. Impact of compression and staging via Google Cloud Storage. For formats such as...

How to Build a Unique MD5 Row Hash Using SQL in ...

Using native BigQuery functionality to generate a dynamic, unique row identifier ... This means that your data might contain duplicate rows, ...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

Duplicate records when mirroring data to BigQuery

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[firestore-bigquery-export] not creating Ingestion-time partitioned table

Error when mirroring data to BigQuery: Resource did not meet condition IF_MATCH