question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues/Features for Build Shepherd `parse_metadata` Script

See original GitHub issue

These are issues that should be tackled in order to improve the use of the parse_metadata.py script used by the Nextstrain build shepherds. A mix of bugs and features.

I’ve added numbers so that it’s easier to discuss and reference!

High priority

  • 1 More specific variants.txt corrections.
    • Example 1 : ‘Gyeonggi’ is a location and a division. We want to correct division Gyeonggi to division Gyeonggi Province but only when it’s a division, not when it’s a location. (Should be fixed in #475)
    • Example 2: ‘Bahrein’ is both a country and a division - both need to be corrected to ‘Bahrain’. Currently only the division annotations are automatically generated. (Should be fixed in #468 )
    • Example 3: Annotations are auto-generated to correct ‘unknown’ to (blank). However if one has both unknown divisions and locations, only location corrections are generated. (Emma note: I believe these should also only go to blank when the country is ‘USA’ as we don’t do this for other countries - right? May need to be special case)
    • Example 4: Allow specifying that a variant should only be replaced if it is from a specified division or country (or other). Example - So that not all examples of ‘Brownsville’ are replaced with ‘Brownsville County’ - only if they are in Texas, USA.
  • 2 Auto-add missing places to the auto-generated color_ordering.tsv file Example: ‘Sangalkam’ is missing from the defaults/color_ordering.tsv and is displayed with a warning (Sangalkam (only missing in ordering)), but this is not added to the auto-generated developer_scripts/color_ordering.tsv). This needs to happen before we can switch to just using an auto-generated orderings file Note: Ensure this happens both just if the place is missing from color_ordering and if it’s a ‘false’ division that’s been corrected. See example here
    • Directly modify script so places that are just missing in color_orderings.tsv are added automatically (@MoiraZuber )
    • Un-comment read_exposure call so we auto-add places which are ‘exposure places’ to the color_orderings.tsv
    • Clean up the exposure categories through modifying gisaid_annotations & accepted_exposure_additions so we are working from a clean slate(@emmahodcroft )

Medium priority

  • 3 Automatically add new locations needing lat-longs into defaults/lat_longs.tsv and re-sort the file as necessary (in order by location, division, country, etc)
  • 4 Clarify the role of sequences_exclude.txt - what does it do, how can we use it better?
  • 5 Clarify role of accepted_exposure_additions.txt and figure out how to get working optimally/resolve inconsistencies
  • 9 Expand script to also generate gisaid_annotations.tsv lines for additional-info-changes.txt entries. After reading in this file, should generate new entries to add these. Two challenges here: 1) Often in ‘free text’ so may need to experiment with parsing these (but even getting a ‘start’ of the lines needed makes them easier to then ‘multi-column edit’). 2) If places are not in existing metadata or location-hierarchies, may have to have user interface to clarify the division/country/region that should be generated alongside.
  • 11 Fix bug - If there are two variants from the same original name (Ex: Mp -> Occitanie (Europe, France) and Mp -> Northern Mariana Islands (North America, USA) ), then only the second one is applied since the first is overwritten. A warning is shown, but we should handle this better so both are applied.
  • 12 Find more long-term way to sort countries by Lat-Long. See slack threads here and here
  • 13 Include & adapt @MoiraZuber’s script that auto-locates who should be attributed in tweets

Low priority

  • 6 Convey to people who might want to run this script that geopy is necessary dependency (Should be fixed in #470 )
  • 7 Clear remaining Belgium alerts at start of script & outline best way to handle similar alerts in future
  • 8 Better handle ignoring (but informing user about) cruise-related missing data (display more compactly/prettily - and just once) – Including get rid of Missing locations: # USA # California which still appears even if the only missing places were ‘cruise-related’ (see here)
  • 10 Better documentation of the auxiliary files & how the script works

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:15 (15 by maintainers)

github_iconTop GitHub Comments

1reaction
emmahodcroftcommented, Jul 15, 2020

Good catch - I’ll add another to the list. Unfortunately I guess the numbering system is going to get out of order now 🙃

1reaction
eharkinscommented, Jul 14, 2020

Thanks!

I can’t tell - do any of these (e.g. 5) have to do with automating creation of annotations for additional-info-changes.txt that comes into slack with travel history and other things to be annotated?

Edit: This seems like at least a medium priority since when there are many of these “additional info changes”, we could save a lot of time by running them through the script (or a separate script) to verify they are real locations, spelled correctly, and ones we don’t want to ignore, and then generate annotations for them.

Read more comments on GitHub >

github_iconTop Results From Across the Web

No results found

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found