question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

EmrEtlRunner: race condition overwriting Clojure Collector files during staging step

See original GitHub issue

Some raw files are misnamed during the CloudFront-like conversion process, causing 2 root problems of missing files.

In the example below the time-stamp into the filename of both files is always renamed as 2017-01-13-03 (UTC). A 2017-01-13-04 file is missing.

EmrEtlRunner output:

[Fri Jan 13 06:15:05 UTC 2017] (t3)    MOVE snowplow-collectors-log/e-xrnips2p/i-050292a49/_var_log_tomcat8_rotated_localhost_access_log.txt1484280061.gz -> viadeo-snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-13-03.us-east-1.i-050292a49944536d7.txt.gz
[Fri Jan 13 06:15:05 UTC 2017] (t5)    MOVE snowplow-collectors-log/e-xrnips2p/i-050292a49/_var_log_tomcat8_rotated_localhost_access_log.txt1484276462.gz -> viadeo-snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-13-03.us-east-1.i-050292a49944536d7.txt.gz

Time-stamp conversion should be :

  • 1484280061 = 13/1/2017 at 5:01:01 CET
  • 1484276462 = 13/1/2017 at 4:01:02 CET

So, when the file is archived the “2017-01-13-03” could be either the one at 5:01 or the one at 4:01 without any file named as “2017-01-13-04


Additional info :

  • We have more than 30,000 files in our raw/in bucket ( remaining Elastic Beanstalk logs)
  • I wasn’t able to verify if the issue comes from ruby or jruby (suspect bug in jruby) Maybe the bump to 9.1.6.0 could fix the issue but I wasn’t able to make run the snowplow-emr-etl-runner & snowplow-storage-loader with this version.
  • We have this issue at least once every two days

Bonus : Maybe keeping the timestamp in the filename could be interesting to validate the converted CloudFront format.

Related:

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:13 (9 by maintainers)

github_iconTop GitHub Comments

1reaction
alexanderdeancommented, Aug 17, 2017

I disagree - a bug report is immutable - it intrinsically relates to data loss, and that doesn’t change with the fix being in another ticket. Removing the work assignment metadata by contrast is fine.

In x months from now, I want to be able to go back and review bugs which relate to data loss. The ticket that resolved the problem is uninteresting in comparison.

1reaction
BenFradetcommented, May 30, 2017

@vceron Have you had a chance to update to emr etl runner 0.23 (or 0.24) as we updated jruby to 9.1.6 in both those releases?

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found