Issues/Features for Build Shepherd `parse_metadata` Script
See original GitHub issueThese are issues that should be tackled in order to improve the use of the parse_metadata.py
script used by the Nextstrain build shepherds. A mix of bugs and features.
I’ve added numbers so that it’s easier to discuss and reference!
High priority
- 1 More specific
variants.txt
corrections.- Example 1 : ‘Gyeonggi’ is a location and a division. We want to correct
division Gyeonggi
todivision Gyeonggi Province
but only when it’s a division, not when it’s a location. (Should be fixed in #475) - Example 2: ‘Bahrein’ is both a country and a division - both need to be corrected to ‘Bahrain’. Currently only the
division
annotations are automatically generated. (Should be fixed in #468 ) - Example 3: Annotations are auto-generated to correct ‘unknown’ to
unknown
divisions and locations, only location corrections are generated. (Emma note: I believe these should also only go to blank when the country is ‘USA’ as we don’t do this for other countries - right? May need to be special case) - Example 4: Allow specifying that a variant should only be replaced if it is from a specified
division
orcountry
(or other). Example - So that not all examples of ‘Brownsville’ are replaced with ‘Brownsville County’ - only if they are in Texas, USA.
- Example 1 : ‘Gyeonggi’ is a location and a division. We want to correct
- 2 Auto-add missing places to the auto-generated
color_ordering.tsv
file Example: ‘Sangalkam’ is missing from thedefaults/color_ordering.tsv
and is displayed with a warning (Sangalkam (only missing in ordering)
), but this is not added to the auto-generateddeveloper_scripts/color_ordering.tsv
). This needs to happen before we can switch to just using an auto-generated orderings file Note: Ensure this happens both just if the place is missing fromcolor_ordering
and if it’s a ‘false’ division that’s been corrected. See example here- Directly modify script so places that are just missing in
color_orderings.tsv
are added automatically (@MoiraZuber ) - Un-comment
read_exposure
call so we auto-add places which are ‘exposure places’ to thecolor_orderings.tsv
- Clean up the exposure categories through modifying
gisaid_annotations
&accepted_exposure_additions
so we are working from a clean slate(@emmahodcroft )
- Directly modify script so places that are just missing in
Medium priority
- 3 Automatically add new locations needing lat-longs into
defaults/lat_longs.tsv
and re-sort the file as necessary (in order bylocation
,division
,country
, etc) - 4 Clarify the role of
sequences_exclude.txt
- what does it do, how can we use it better? - 5 Clarify role of
accepted_exposure_additions.txt
and figure out how to get working optimally/resolve inconsistencies - 9 Expand script to also generate
gisaid_annotations.tsv
lines foradditional-info-changes.txt
entries. After reading in this file, should generate new entries to add these. Two challenges here: 1) Often in ‘free text’ so may need to experiment with parsing these (but even getting a ‘start’ of the lines needed makes them easier to then ‘multi-column edit’). 2) If places are not in existingmetadata
orlocation-hierarchies
, may have to have user interface to clarify the division/country/region that should be generated alongside. - 11 Fix bug - If there are two variants from the same original name (Ex:
Mp -> Occitanie (Europe, France)
andMp -> Northern Mariana Islands (North America, USA)
), then only the second one is applied since the first is overwritten. A warning is shown, but we should handle this better so both are applied. - 12 Find more long-term way to sort countries by Lat-Long. See slack threads here and here
- 13 Include & adapt @MoiraZuber’s script that auto-locates who should be attributed in tweets
Low priority
- 6 Convey to people who might want to run this script that
geopy
is necessary dependency (Should be fixed in #470 ) - 7 Clear remaining
Belgium
alerts at start of script & outline best way to handle similar alerts in future - 8 Better handle ignoring (but informing user about) cruise-related missing data (display more compactly/prettily - and just once) – Including get rid of
Missing locations: # USA # California
which still appears even if the only missing places were ‘cruise-related’ (see here) - 10 Better documentation of the auxiliary files & how the script works
Issue Analytics
- State:
- Created 3 years ago
- Comments:15 (15 by maintainers)
Top Results From Across the Web
No results found
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Good catch - I’ll add another to the list. Unfortunately I guess the numbering system is going to get out of order now 🙃
Thanks!
I can’t tell - do any of these (e.g. 5) have to do with automating creation of annotations for
additional-info-changes.txt
that comes into slack with travel history and other things to be annotated?Edit: This seems like at least a medium priority since when there are many of these “additional info changes”, we could save a lot of time by running them through the script (or a separate script) to verify they are real locations, spelled correctly, and ones we don’t want to ignore, and then generate annotations for them.