Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

allow regexp matching in workspace operations

See original GitHub issue

IMO at least the following operations (both in the CLI and API) on METS deserve allowing regular expression matching:

remove_file: instead of a fixed ID to look for
remove_file_group: instead of a fixed ID to look for
find_files: both MIME type (e.g. just image/) and ID
add_file_group: for both file names and file IDs (possibly with back-references), cf. discussion here

To minimise possible ambiguity between a verbatim string and regex interpretation, while still keeping the existing argument/option names (and not introducing a new flavour each time), I recommend either using POSIX Basic Regular Expression syntax (which is perhaps hard to get by in Python) or allowing some kind of extra notation in the input, e.g. a re: prefix.

Issue Analytics

State:
Created 4 years ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

bertskycommented, Jun 8, 2020

Sorry, just saw your response, already had #507 queued as fix.

feature would be self-documented via help strings in ocrd workspace.

Will do.

With the multi-valued semantics (and return-type) of ocrd_mets.remove_file, now we cannot simply delegate from workspace.remove_file

I’ll mirror the behavior in METS with remove_file/remove_one_file.

I did without mirroring…

1reaction

kbacommented, Apr 15, 2020

I agree these make sense and I can implement them. This is related to https://github.com/OCR-D/core/issues/446#issuecomment-590328336 (inefficient find_files) and #448, so I’m implementing it as part of #448. Have a look at https://github.com/kba/ocrd-core/commit/8b6d277640335bf8afa1d815f0cf26f5b9290060, this implements the regex search for find_files with a re: prefix, i.e. you can do mets.find_files(mimetype="re:image/jpe?g") or mets.find_files(ID="re:.*0001.*").

Since this is coupled to the “single-pass find_files changeset”, I still need to do performance testing but do let me know if this is going in the wrong direction.

Not sure about the re: prefix because of possible conflicts. How about ~ or @ which are presumably more rare in digitization data than re:?

Top Results From Across the Web

Guidelines for using regular expressions - Google Support

A regular expression, also called a regex, is a method for matching text with patterns. For example, a regular expression can describe the...

Regular Expressions | Adobe Analytics

Regular expressions are used across all data workbench search fields ... Iteration metacharacters let you match a pattern more than once.

Regexp entities | Dialogflow ES - Google Cloud

With regexp entities, you can provide regular expressions for matching. ... Note: Enabling auto speech adaptation is recommended when using regexp entities.

A Guide to R Regular Expressions With Examples - DataCamp

Explore regular expressions in R, why they're important, the tools and ... Below are the main functions that search for regex matches in...

Excel Regex: match strings using regular expressions - Ablebits

To match a string in a single cell, refer to that cell in the first argument. The second argument is supposed to contain...