question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Avoid special characters like '=' in produced filenames

See original GitHub issue

Is your feature request related to a problem? Please describe. Running pudl_etl for EPA-CEMS produces parquet files named like parquet/epacems/year=2018/state=ID/c582a7e6811d41d79b94998916c7cec5.parquet (when using the prefect branch). These filenames contain '=', which can be problematic when working with remote filesystems.

For example, AWS S3 does allow the '=' character (and a lot of other special characters) in filenames, but generally recommends against using them, since they require special handling the downstream applications may not have.

You can use any UTF-8 character in an object key name. However, using certain characters in key names can cause problems with some applications and protocols. The following guidelines help you maximize compliance with DNS, web-safe characters, XML parsers, and other APIs.

(from https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html)

Similarly, Google Cloud Storage has a list of special chars to avoid: https://cloud.google.com/storage/docs/naming-objects.

So while '=' is valid, avoiding it is recommended (at least by AWS). And in practice this did cause a problem with AWS.jl (https://github.com/JuliaCloud/AWS.jl/pull/293), and looking at issues on various AWS SDKs, handling special chars is a common source of issues, so best is if special chars can just be avoided when possible

Describe the solution you’d like

  • Do not use special chars like '=' in filenames, only alphanumeric chars a-z, A-Z, 0-9, and perhaps a limited number of other chars that are “generally safe” for use e.g. underscore '_'.
    • i.e. no more than what the linked AWS docs lists as “Safe characters”
  • For example, the dataset above could be
    • epacems/year_2018/state_ID/c582a7e6811d41d79b94998916c7cec5.parquet, or
    • epacems/year2018/stateID/c582a7e6811d41d79b94998916c7cec5.parquet

Describe alternatives you’ve considered

  • Leave as is; downstream tools have handle whatever’s given / user have to work-around this (e.g. rename the files)

Additional context So far i’ve been adding special handling to the Julia AWS SDK(https://github.com/JuliaCloud/AWS.jl/pull/285, https://github.com/JuliaCloud/AWS.jl/pull/293) to handle special characters in filenames, since various other energy-related datasets also use non-alphanumeric chars, e.g. some ISOs like to use '(', ')', ':', ' ', '-', but these are still easier to handle than '='.

Where possible I think it is helpful to avoid special characters, so as to increases the number of downstream tools that can be used, and reduces the burden on those tools to handle special cases (which can be problematic https://github.com/JuliaCloud/AWS.jl/pull/293#issuecomment-783475801). We also considered renaming the files output by PUDL ETL, but again would be simplest if this wasn’t need at all 😃

I took a quick look at the code, but couldn’t see exactly where the /year=year/ type naming comes from, which made me wonder if it was tied into an existing tool (e.g. prefect/partitions) but this is just a guess. Hoping a change to naming scheme wouldn’t have any knock-on effects 🤞

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:1
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
zaneselvanscommented, Feb 23, 2021

I just used the hive partitioning because it was the default when I created the parquet outputs. If this is just as easy to set up and works in a wider variety of environments, then I agree we should use DirectoryPartitioning Especially since the fact that the partitions are years / states is pretty clear from the values that the partition directories take on.

1reaction
karldwcommented, Feb 22, 2021

This comes from using pyarrow’s default HivePartitioning to create names like year=2018/state=ID/x.parquet. The other alternative, DirectoryPartitioning, would create names like 2018/ID/x.parquet. (I think other options, like year_2018/state_ID/, would work, but would require a custom PartitioningFactory.)

Docs: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html

Current writer code: https://github.com/catalyst-cooperative/pudl/blob/a2c1b996ea81015e586e392bb95609da76161cec/src/pudl/convert/epacems_to_parquet.py#L219-L224

Read more comments on GitHub >

github_iconTop Results From Across the Web

Which special character should be avoided when naming a file?
You should not use any characters but letters, numbers and underscores in filenames, even though many modern OSes allow you much more freedom...
Read more >
How to handle special characters in filenames using Switch
If you want to get rid of this issue, the best is to avoid using diacritics, spaces, reserved characters, etc., but this takes...
Read more >
What Special Characters do I need to avoid to successfully ...
Avoiding common illegal filename characters is essential to ensure successful archive. Naming conventions for all files in an archive are important, not only ......
Read more >
Special Characters You Should Avoid Using and File Path ...
CLS - Special Characters You Should Avoid Using and File Path, File Name, and Data Character Limitations · Number of Characters · Invalid/Reserved ......
Read more >
What characters are forbidden in Windows and Linux directory ...
Yes, characters like * " ? and others are forbidden, but there are a infinite number of names composed only of valid characters...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found