EmrEtlRunner: race condition overwriting Clojure Collector files during staging step
See original GitHub issueSome raw files are misnamed during the CloudFront-like conversion process, causing 2 root problems of missing files.
In the example below the time-stamp into the filename of both files is always renamed as 2017-01-13-03 (UTC). A 2017-01-13-04 file is missing.
EmrEtlRunner output:
[Fri Jan 13 06:15:05 UTC 2017] (t3) MOVE snowplow-collectors-log/e-xrnips2p/i-050292a49/_var_log_tomcat8_rotated_localhost_access_log.txt1484280061.gz -> viadeo-snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-13-03.us-east-1.i-050292a49944536d7.txt.gz
[Fri Jan 13 06:15:05 UTC 2017] (t5) MOVE snowplow-collectors-log/e-xrnips2p/i-050292a49/_var_log_tomcat8_rotated_localhost_access_log.txt1484276462.gz -> viadeo-snowplow-processing/processing/var_log_tomcat8_rotated_localhost_access_log.2017-01-13-03.us-east-1.i-050292a49944536d7.txt.gz
Time-stamp conversion should be :
- 1484280061 = 13/1/2017 at 5:01:01 CET
- 1484276462 = 13/1/2017 at 4:01:02 CET
So, when the file is archived the “2017-01-13-03” could be either the one at 5:01 or the one at 4:01 without any file named as “2017-01-13-04”
Additional info :
- We have more than 30,000 files in our raw/in bucket ( remaining Elastic Beanstalk logs)
- I wasn’t able to verify if the issue comes from ruby or jruby (suspect bug in jruby) Maybe the bump to 9.1.6.0 could fix the issue but I wasn’t able to make run the snowplow-emr-etl-runner & snowplow-storage-loader with this version.
- We have this issue at least once every two days
Bonus : Maybe keeping the timestamp in the filename could be interesting to validate the converted CloudFront format.
Related:
- 61 Pygmy Parrot
- #1398 - EmrEtlRunner: change Clojure Collector log timestamp format to match CloudFront logs ( https://github.com/snowplow/snowplow/commit/4b36d14868340821b8665d326376383faf0f41ab)
- #1379 - EmrEtlRunner: append region name to Clojure Collector log files
- #1404 - EmrEtlRunner: append rather than prepend instance names to Clojure Collector log files
Issue Analytics
- State:
- Created 7 years ago
- Comments:13 (9 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I disagree - a bug report is immutable - it intrinsically relates to data loss, and that doesn’t change with the fix being in another ticket. Removing the work assignment metadata by contrast is fine.
In x months from now, I want to be able to go back and review bugs which relate to data loss. The ticket that resolved the problem is uninteresting in comparison.
@vceron Have you had a chance to update to emr etl runner 0.23 (or 0.24) as we updated jruby to 9.1.6 in both those releases?