question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

allow regexp matching in workspace operations

See original GitHub issue

IMO at least the following operations (both in the CLI and API) on METS deserve allowing regular expression matching:

  • remove_file: instead of a fixed ID to look for
  • remove_file_group: instead of a fixed ID to look for
  • find_files: both MIME type (e.g. just image/) and ID
  • add_file_group: for both file names and file IDs (possibly with back-references), cf. discussion here

To minimise possible ambiguity between a verbatim string and regex interpretation, while still keeping the existing argument/option names (and not introducing a new flavour each time), I recommend either using POSIX Basic Regular Expression syntax (which is perhaps hard to get by in Python) or allowing some kind of extra notation in the input, e.g. a re: prefix.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
bertskycommented, Jun 8, 2020

Sorry, just saw your response, already had #507 queued as fix.

feature would be self-documented via help strings in ocrd workspace.

Will do.

With the multi-valued semantics (and return-type) of ocrd_mets.remove_file, now we cannot simply delegate from workspace.remove_file

I’ll mirror the behavior in METS with remove_file/remove_one_file.

I did without mirroring…

1reaction
kbacommented, Apr 15, 2020

I agree these make sense and I can implement them. This is related to https://github.com/OCR-D/core/issues/446#issuecomment-590328336 (inefficient find_files) and #448, so I’m implementing it as part of #448. Have a look at https://github.com/kba/ocrd-core/commit/8b6d277640335bf8afa1d815f0cf26f5b9290060, this implements the regex search for find_files with a re: prefix, i.e. you can do mets.find_files(mimetype="re:image/jpe?g") or mets.find_files(ID="re:.*0001.*").

Since this is coupled to the “single-pass find_files changeset”, I still need to do performance testing but do let me know if this is going in the wrong direction.

Not sure about the re: prefix because of possible conflicts. How about ~ or @ which are presumably more rare in digitization data than re:?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Guidelines for using regular expressions - Google Support
A regular expression, also called a regex, is a method for matching text with patterns. For example, a regular expression can describe the...
Read more >
Regular Expressions | Adobe Analytics
Regular expressions are used across all data workbench search fields ... Iteration metacharacters let you match a pattern more than once.
Read more >
Regexp entities | Dialogflow ES - Google Cloud
With regexp entities, you can provide regular expressions for matching. ... Note: Enabling auto speech adaptation is recommended when using regexp entities.
Read more >
A Guide to R Regular Expressions With Examples - DataCamp
Explore regular expressions in R, why they're important, the tools and ... Below are the main functions that search for regex matches in...
Read more >
Excel Regex: match strings using regular expressions - Ablebits
To match a string in a single cell, refer to that cell in the first argument. The second argument is supposed to contain...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found