question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to convert a dict generator into a huggingface dataset.

See original GitHub issue

Link

No response

Description

Hey there, I have used seqio to get a well distributed mixture of samples from multiple dataset. However the resultant output from seqio is a python generator dict, which I cannot produce back into huggingface dataset.

The generator contains all the samples needed for training the model but I cannot convert it into a huggingface dataset.

The code looks like this:

for ex in seqio_data:
print(ex[“text”])

I need to convert the seqio_data (generator) into huggingface dataset.

the complete seqio code goes here:

import functools

import seqio
import tensorflow as tf
import t5.data
from datasets import load_dataset
from t5.data import postprocessors
from t5.data import preprocessors
from t5.evaluation import metrics
from seqio import FunctionDataSource, utils

TaskRegistry = seqio.TaskRegistry



def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
    dataset = load_dataset(**dataset_params)
    if shuffle:
        if seed:
            dataset = dataset.shuffle(seed=seed)
        else:
            dataset = dataset.shuffle()
    while True:
        for item in dataset[str(split)]:
            yield item[column]


def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
    return tf.data.Dataset.from_generator(
        functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
        output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
    )


@utils.map_over_dataset
def target_to_key(x, key_map, target_key):
    """Assign the value from the dataset to target_key in key_map"""
    return {**key_map, target_key: x}



dataset_name = 'oscar-corpus/OSCAR-2109'
subset= 'mr'
dataset_params = {"path": dataset_name, "language":subset, "use_auth_token":True}
dataset_shapes = None

TaskRegistry.add(
    "oscar_marathi_corpus",
    source=seqio.FunctionDataSource(
        dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
        splits=("train", "validation"),
        caching_permitted=False,
        num_input_examples=dataset_shapes,
    ),
preprocessors=[
functools.partial(
target_to_key, key_map={
"targets": None,
}, target_key="targets")],
    output_features={"targets": seqio.Feature(vocabulary=seqio.PassThroughVocabulary, add_eos=False, dtype=tf.string, rank=0)},
    metric_fns=[]
)

dataset = seqio.get_mixture_or_task("oscar_marathi_corpus").get_dataset(
    sequence_length=None,
    split="train",
    shuffle=True,
    num_epochs=1,
    shard_info=seqio.ShardInfo(index=0, num_shards=10),
    use_cached=False,
    seed=42
)
for _, ex in zip(range(5), dataset):
     print(ex['targets'].numpy().decode())

Owner

No response

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:18 (10 by maintainers)

github_iconTop GitHub Comments

4reactions
mariosaskocommented, Jun 6, 2022

@lhoestq +1 for the Dataset.from_generator idea.

Having thought about it, let’s avoid adding Dataset.from_iterable to the API since dictionaries are technically iteralbles (“iterable” is a broad term in Python), and we already provide Dataset.from_dict. And for lists maybe we can add Dataset.from_list similar to pa.Table.from_pylist. WDYT?

2reactions
mariosaskocommented, Jul 5, 2022

Hi @johann-petrak! You can pass the features directly to ArrowWriter’s initializer like so ArrowWriter(..., features=features).

And the reason why I prefer Dataset.from_generator over Dataset.from_iterable is mentioned in one of my previous comments.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Convert a list of dictionaries to hugging face dataset object
I think the easiest way would be datasets.Dataset.from_pandas(pd.DataFrame(data=data)) . 2 Likes.
Read more >
How to wrap a generator with HF dataset
Hi ! Right now to do this you have to define your dataset using a dataset script, in which you can define your...
Read more >
Load - Hugging Face
A dataset without a loading script by default loads all the data into the train split. ... Create a dataset from a Python...
Read more >
Source code for datasets.dataset_dict - Hugging Face
The transform is set for every dataset in the dataset dictionary As :func:`datasets.Dataset.set_format`, this can be reset using :func:`datasets.
Read more >
Main classes - Hugging Face
Create DatasetInfo from the JSON file in dataset_info_dir. ... Dataset. Convert dict to a pyarrow. ... Create a Dataset from a generator. Example:....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found