Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

how to convert a dict generator into a huggingface dataset.

See original GitHub issue

Link

No response

Description

Hey there, I have used seqio to get a well distributed mixture of samples from multiple dataset. However the resultant output from seqio is a python generator dict, which I cannot produce back into huggingface dataset.

The generator contains all the samples needed for training the model but I cannot convert it into a huggingface dataset.

The code looks like this:

for ex in seqio_data:
print(ex[“text”])

I need to convert the seqio_data (generator) into huggingface dataset.

the complete seqio code goes here:

import functools

import seqio
import tensorflow as tf
import t5.data
from datasets import load_dataset
from t5.data import postprocessors
from t5.data import preprocessors
from t5.evaluation import metrics
from seqio import FunctionDataSource, utils

TaskRegistry = seqio.TaskRegistry



def gen_dataset(split, shuffle=False, seed=None, column="text", dataset_params=None):
    dataset = load_dataset(**dataset_params)
    if shuffle:
        if seed:
            dataset = dataset.shuffle(seed=seed)
        else:
            dataset = dataset.shuffle()
    while True:
        for item in dataset[str(split)]:
            yield item[column]


def dataset_fn(split, shuffle_files, seed=None, dataset_params=None):
    return tf.data.Dataset.from_generator(
        functools.partial(gen_dataset, split, shuffle_files, seed, dataset_params=dataset_params),
        output_signature=tf.TensorSpec(shape=(), dtype=tf.string, name=dataset_name)
    )


@utils.map_over_dataset
def target_to_key(x, key_map, target_key):
    """Assign the value from the dataset to target_key in key_map"""
    return {**key_map, target_key: x}



dataset_name = 'oscar-corpus/OSCAR-2109'
subset= 'mr'
dataset_params = {"path": dataset_name, "language":subset, "use_auth_token":True}
dataset_shapes = None

TaskRegistry.add(
    "oscar_marathi_corpus",
    source=seqio.FunctionDataSource(
        dataset_fn=functools.partial(dataset_fn, dataset_params=dataset_params),
        splits=("train", "validation"),
        caching_permitted=False,
        num_input_examples=dataset_shapes,
    ),
preprocessors=[
functools.partial(
target_to_key, key_map={
"targets": None,
}, target_key="targets")],
    output_features={"targets": seqio.Feature(vocabulary=seqio.PassThroughVocabulary, add_eos=False, dtype=tf.string, rank=0)},
    metric_fns=[]
)

dataset = seqio.get_mixture_or_task("oscar_marathi_corpus").get_dataset(
    sequence_length=None,
    split="train",
    shuffle=True,
    num_epochs=1,
    shard_info=seqio.ShardInfo(index=0, num_shards=10),
    use_cached=False,
    seed=42
)
for _, ex in zip(range(5), dataset):
     print(ex['targets'].numpy().decode())

Owner

No response

Issue Analytics

State:
Created a year ago
Comments:18 (10 by maintainers)

Top GitHub Comments

4reactions

mariosaskocommented, Jun 6, 2022

@lhoestq +1 for the Dataset.from_generator idea.

Having thought about it, let’s avoid adding Dataset.from_iterable to the API since dictionaries are technically iteralbles (“iterable” is a broad term in Python), and we already provide Dataset.from_dict. And for lists maybe we can add Dataset.from_list similar to pa.Table.from_pylist. WDYT?

2reactions

mariosaskocommented, Jul 5, 2022

Hi @johann-petrak! You can pass the features directly to ArrowWriter’s initializer like so ArrowWriter(..., features=features).

And the reason why I prefer Dataset.from_generator over Dataset.from_iterable is mentioned in one of my previous comments.

Top Results From Across the Web

Convert a list of dictionaries to hugging face dataset object

I think the easiest way would be datasets.Dataset.from_pandas(pd.DataFrame(data=data)) . 2 Likes.

How to wrap a generator with HF dataset

Hi ! Right now to do this you have to define your dataset using a dataset script, in which you can define your...

Load - Hugging Face

A dataset without a loading script by default loads all the data into the train split. ... Create a dataset from a Python...

Source code for datasets.dataset_dict - Hugging Face

The transform is set for every dataset in the dataset dictionary As :func:`datasets.Dataset.set_format`, this can be reset using :func:`datasets.

Main classes - Hugging Face

Create DatasetInfo from the JSON file in dataset_info_dir. ... Dataset. Convert dict to a pyarrow. ... Create a Dataset from a generator. Example:....