question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Import labeled dataset

See original GitHub issue

Feature Request: import labeled data sets in BIO format. Like:

SOCCER	O
-	O
JAPAN	B-LOC
GET	O
LUCKY	O
WIN	O
,	O
CHINA	B-PER
IN	O
SURPRISE	O
DEFEAT	O
.	O

Nadim	B-PER
Ladki	I-PER

AL-AIN	B-LOC
,	O
United	B-LOC
Arab	I-LOC
Emirates	I-LOC
1996-12-06 O

Btw, I love your tool, thanks for doing it open source

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Reactions:4
  • Comments:30 (12 by maintainers)

github_iconTop GitHub Comments

6reactions
machakuxcommented, Feb 18, 2019

I would like to be able import labelled datasets to review, correct wrongly labeled data, continue labelling a partially labeled dataset or to add labelled data to an existing project (mostly use cases 1,2).

I think storing documents together with labels might simplify things.

I it will be decided to store annotations together with the document, the document model could be something like


class Document(models.Model):
    project = models.ForeignKey(Project, related_name='documents', on_delete=models.CASCADE)
    text = models.TextField()
    labels = models.TextField() #  or ManyToManyField() or ArrayField()
    annotations = models.TextField()   # or ManyToManyField() or JSONField() 
    seq2seq_annotations = models.TextField()  # or ManyToManyField()  or ArrayField()
    metadata = models.TextField(default='{}')  #  or JSONField()
    # ...

Django has several third packages like https://github.com/dmkoch/django-jsonfield which can be used to provide a bit more flexible data structures. And if you will be using Postgresql Django has native/in-built fields for JSON, Arrays and more, see https://docs.djangoproject.com/en/dev/ref/contrib/postgres/fields/ .

Assuming the basic functionality does not involve updating existing documents, the import will not need to account for an external_id although users might be allowed to upload them as metadata for their own future references.

Allowing users to update existing documents through bulk upload could be limited to admin interface or command line interface as an advanced functionality for users who are sure with what they are doing. Here users can be allowed to provide the real id field (the object primary key), which means if an object with the provided id already exists in a database it will be updated otherwise a new object will be created.

If documents and annotations will be stored together it may also make easier to utilize existing tools like django-import-export especially for imports via admin interface.

4reactions
Hironsancommented, Mar 12, 2019

I thoroughly redesigned APIs and models and supported labeled dataset import.

Task x format is as follows:

Plain CSV JSON CoNLL
Text Classification ○(single label) X
Sequence Labeling X
Seq2seq X

We can confirm the detailed format in an upload page:

image

This is not a perfect feature. This is the first step. There are some bugs and performance problems. So welcome your opinions and feedbacks.

Thank you for your feedback and contribution.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Creating datasets and importing images | AutoML Vision
Create a dataset and specify whether to allow multiple labels on each item. Import data items into the dataset. Label the items. When...
Read more >
Import pre-annotated data into Label Studio
Import predicted labels, predictions, pre-annotations, or pre-labels into Label Studio for your data labeling, machine learning, and data science projects.
Read more >
How can I import existing data labels to Azure Machine ...
I have a dataset on the Azure Machine Learning Studio, which is about 1200 images. I also have a tab-delimited text file that...
Read more >
Create a dataset - Labelbox Docs
A dataset is a collection of data rows imported to Labelbox at one time. ... Asset, A single cloud-hosted file to be labeled...
Read more >
Loading Datasets From Disk — FiftyOne 0.18.0 documentation
import fiftyone as fo # The directory containing the dataset to import ... Alternatively, when importing labeled datasets in formats such as COCO, ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found