Avoid special characters like '=' in produced filenames
See original GitHub issueIs your feature request related to a problem? Please describe.
Running pudl_etl
for EPA-CEMS produces parquet files named like parquet/epacems/year=2018/state=ID/c582a7e6811d41d79b94998916c7cec5.parquet
(when using the prefect
branch). These filenames contain '='
, which can be problematic when working with remote filesystems.
For example, AWS S3 does allow the '='
character (and a lot of other special characters) in filenames, but generally recommends against using them, since they require special handling the downstream applications may not have.
You can use any UTF-8 character in an object key name. However, using certain characters in key names can cause problems with some applications and protocols. The following guidelines help you maximize compliance with DNS, web-safe characters, XML parsers, and other APIs.
(from https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html)
Similarly, Google Cloud Storage has a list of special chars to avoid: https://cloud.google.com/storage/docs/naming-objects.
So while '='
is valid, avoiding it is recommended (at least by AWS). And in practice this did cause a problem with AWS.jl (https://github.com/JuliaCloud/AWS.jl/pull/293), and looking at issues on various AWS SDKs, handling special chars is a common source of issues, so best is if special chars can just be avoided when possible
Describe the solution you’d like
- Do not use special chars like
'='
in filenames, only alphanumeric charsa-z
,A-Z
,0-9
, and perhaps a limited number of other chars that are “generally safe” for use e.g. underscore'_'
.- i.e. no more than what the linked AWS docs lists as “Safe characters”
- For example, the dataset above could be
epacems/year_2018/state_ID/c582a7e6811d41d79b94998916c7cec5.parquet
, orepacems/year2018/stateID/c582a7e6811d41d79b94998916c7cec5.parquet
Describe alternatives you’ve considered
- Leave as is; downstream tools have handle whatever’s given / user have to work-around this (e.g. rename the files)
Additional context
So far i’ve been adding special handling to the Julia AWS SDK(https://github.com/JuliaCloud/AWS.jl/pull/285, https://github.com/JuliaCloud/AWS.jl/pull/293) to handle special characters in filenames, since various other energy-related datasets also use non-alphanumeric chars, e.g. some ISOs like to use '('
, ')'
, ':'
, ' '
, '-'
, but these are still easier to handle than '='
.
Where possible I think it is helpful to avoid special characters, so as to increases the number of downstream tools that can be used, and reduces the burden on those tools to handle special cases (which can be problematic https://github.com/JuliaCloud/AWS.jl/pull/293#issuecomment-783475801). We also considered renaming the files output by PUDL ETL, but again would be simplest if this wasn’t need at all 😃
I took a quick look at the code, but couldn’t see exactly where the /year=year/
type naming comes from, which made me wonder if it was tied into an existing tool (e.g. prefect/partitions) but this is just a guess. Hoping a change to naming scheme wouldn’t have any knock-on effects 🤞
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:8 (7 by maintainers)
Top GitHub Comments
I just used the hive partitioning because it was the default when I created the parquet outputs. If this is just as easy to set up and works in a wider variety of environments, then I agree we should use
DirectoryPartitioning
Especially since the fact that the partitions are years / states is pretty clear from the values that the partition directories take on.This comes from using pyarrow’s default HivePartitioning to create names like
year=2018/state=ID/x.parquet
. The other alternative,DirectoryPartitioning
, would create names like2018/ID/x.parquet
. (I think other options, likeyear_2018/state_ID/
, would work, but would require a customPartitioningFactory
.)Docs: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html
Current writer code: https://github.com/catalyst-cooperative/pudl/blob/a2c1b996ea81015e586e392bb95609da76161cec/src/pudl/convert/epacems_to_parquet.py#L219-L224