question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Better structure dataset implementations

See original GitHub issue

The suggestion is to adjust all the datasets (e.g. speechcommands) to follow some changes done in tedlium, see example below.

  • Add a _parse_filesystem method to extract a list of “data point identifiers” in a pre-determined order to replace the generic walk_files, as here and #791.
  • Move _load_item as method, as here.
  • Replace class attributes by constructor arguments, e.g. here.
  • Remove non-standard attributes, e.g. here? or add attributes, e.g. here?

Relates to #852, GTZAN #791, tedlium #882. cc @mthrok @cpuhrsch

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:22 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
mthrokcommented, Dec 27, 2020

@krishnakalyan3

Thanks for the feedback. Do you have any other thoughts while working on #1127?

  1. removing _ext_audio -> I will replace it with a string.

Do you mean string “constant”? I think that makes sense, especially for “.wav” formats.

  1. deprecating and removing folder_in_archive -> I am also in favor of this, It can also simplify the dataset module in the future.

Yes. We had to do this with CommonVoice recently, and the resulting code became much simple.

  1. What about deprecating and removing url?

For url, even though it is rarely used, there are multiple potential scenarios where the source becomes unavailable and the archive is re-hosted somewhere else.

  1. firewalls
  2. the original source becomes unavailable.

Modifying _RELEASE_CONFIGS directly would be a better option

But there are cases that users do not have an admin privilege to modify the installed package, so it’s better if users can provide their configuration from their client code. Maybe not as a single variable but making a custom configuration type that is specific to the dataset might be a possible option.

1reaction
mthrokcommented, Dec 25, 2020

Looking at #1127 (thanks @krishnakalyan3 ), the folder_in_archive should be deprecated and removed, from all the Dataset implementation.

It does not provide consistent behavior with download + extract behavior and, in the first place, directory structure being changed is not something library should be expecting and conforming. If user has changed the directory structure, that’s on the user, and library should not be taking care of it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

The Beginner's Guide to Structured Data for Organizing ...
Learn how to use structured data to optimize and organize your website and make your customers' lives easier.
Read more >
8 Common Data Structures every Programmer must know
A quick introduction to 8 commonly used data structures. Data Structures are a specialized means of organizing and storing data in computers in ......
Read more >
Intro to How Structured Data Markup Works - Google Developers
Google uses structured data markup to understand content. Explore this guide to discover how structured data works, review formats, and learn where to...
Read more >
Python Implementations of Data Structures | by Jiahui Wang
Python Implementations of Data Structures. A summary of the Python implementation of stacks, queues, sets, dictionaries, linked lists, and trees.
Read more >
Data Structures Used in Git Implementation. - Medium
The directory structure is represented as a tree, but commits and tags form a more complicated structure because of branching and merging. The...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found