Issues/Features for Build Shepherd `parse_metadata` Script

See original GitHub issue

These are issues that should be tackled in order to improve the use of the parse_metadata.py script used by the Nextstrain build shepherds. A mix of bugs and features.

I’ve added numbers so that it’s easier to discuss and reference!

High priority

1 More specific variants.txt corrections.
- Example 1 : ‘Gyeonggi’ is a location and a division. We want to correct division Gyeonggi to division Gyeonggi Province but only when it’s a division, not when it’s a location. (Should be fixed in #475)
- Example 2: ‘Bahrein’ is both a country and a division - both need to be corrected to ‘Bahrain’. Currently only the division annotations are automatically generated. (Should be fixed in #468 )
- Example 3: Annotations are auto-generated to correct ‘unknown’ to (blank). However if one has both unknown divisions and locations, only location corrections are generated. (Emma note: I believe these should also only go to blank when the country is ‘USA’ as we don’t do this for other countries - right? May need to be special case)
- Example 4: Allow specifying that a variant should only be replaced if it is from a specified division or country (or other). Example - So that not all examples of ‘Brownsville’ are replaced with ‘Brownsville County’ - only if they are in Texas, USA.
2 Auto-add missing places to the auto-generated color_ordering.tsv file Example: ‘Sangalkam’ is missing from the defaults/color_ordering.tsv and is displayed with a warning (Sangalkam (only missing in ordering)), but this is not added to the auto-generated developer_scripts/color_ordering.tsv). This needs to happen before we can switch to just using an auto-generated orderings file Note: Ensure this happens both just if the place is missing from color_ordering and if it’s a ‘false’ division that’s been corrected. See example here
- Directly modify script so places that are just missing in color_orderings.tsv are added automatically (@MoiraZuber )
- Un-comment read_exposure call so we auto-add places which are ‘exposure places’ to the color_orderings.tsv
- Clean up the exposure categories through modifying gisaid_annotations & accepted_exposure_additions so we are working from a clean slate(@emmahodcroft )

Medium priority

3 Automatically add new locations needing lat-longs into defaults/lat_longs.tsv and re-sort the file as necessary (in order by location, division, country, etc)
4 Clarify the role of sequences_exclude.txt - what does it do, how can we use it better?
5 Clarify role of accepted_exposure_additions.txt and figure out how to get working optimally/resolve inconsistencies
9 Expand script to also generate gisaid_annotations.tsv lines for additional-info-changes.txt entries. After reading in this file, should generate new entries to add these. Two challenges here: 1) Often in ‘free text’ so may need to experiment with parsing these (but even getting a ‘start’ of the lines needed makes them easier to then ‘multi-column edit’). 2) If places are not in existing metadata or location-hierarchies, may have to have user interface to clarify the division/country/region that should be generated alongside.
11 Fix bug - If there are two variants from the same original name (Ex: Mp -> Occitanie (Europe, France) and Mp -> Northern Mariana Islands (North America, USA) ), then only the second one is applied since the first is overwritten. A warning is shown, but we should handle this better so both are applied.
12 Find more long-term way to sort countries by Lat-Long. See slack threads here and here
13 Include & adapt @MoiraZuber’s script that auto-locates who should be attributed in tweets

Low priority

6 Convey to people who might want to run this script that geopy is necessary dependency (Should be fixed in #470 )
7 Clear remaining Belgium alerts at start of script & outline best way to handle similar alerts in future
8 Better handle ignoring (but informing user about) cruise-related missing data (display more compactly/prettily - and just once) – Including get rid of Missing locations: # USA # California which still appears even if the only missing places were ‘cruise-related’ (see here)
10 Better documentation of the auxiliary files & how the script works

Issue Analytics

State:
Created 3 years ago
Comments:15 (15 by maintainers)

Top GitHub Comments

1reaction

emmahodcroftcommented, Jul 15, 2020

Good catch - I’ll add another to the list. Unfortunately I guess the numbering system is going to get out of order now 🙃

1reaction

eharkinscommented, Jul 14, 2020

Thanks!

I can’t tell - do any of these (e.g. 5) have to do with automating creation of annotations for additional-info-changes.txt that comes into slack with travel history and other things to be annotated?

Edit: This seems like at least a medium priority since when there are many of these “additional info changes”, we could save a lot of time by running them through the script (or a separate script) to verify they are real locations, spelled correctly, and ones we don’t want to ignore, and then generate annotations for them.