question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Annotating an unlabeled set

See original GitHub issue

Hi. Thanks for the great repo. I have got a question regarding the PET training and annotating an unlabeled set (as mentioned in the paper examples from D). I assume that it would be done using the command in the PET Training and Evaluation section in the repo. However, I am not sure where to put the unlabeled set and where to get the predicted labels? Would you please let me know how we should get the predicted labels for the unlabeled set? Thank you.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
chris-aeviatorcommented, Oct 11, 2020

@timoschick

if I have labels 0 = 'bad' & 1 = 'good' I’ll get an unlabeled_logits.txt with the first row beeing -1 and then a row for each row in my unlabeled.csv file.

Is it correct that I then apply softmax to it to get a prediction of the first label “bad” (corresponds to first “column” in logits file) and “good” (second “column”)

example logits

-1
0.21161096000000001 0.3217776633333334
1.6751958333333334  -1.45424471

EDIT:

Ended up writing a conversion script (since I’m using an airflow pipeline anyways for the job) that writes me a prediction file with probabilities from the logits

import torch
import numpy as np        
import pandas as pd

logits_file = '/tmp/unlabeled_logits.txt'
results = []
with open(logits_file, 'r') as fh:
  for line in fh.read().splitlines():
    example_logits = [float(x) for x in line.split()]
    tensors = torch.tensor(example_logits)
    sm = torch.nn.Softmax()
    results.append(sm(tensors).numpy())
df = pd.DataFrame(results)
df.to_csv('/out/predictions.csv')

output is a propability for my label bad (first column) and good (2nd)

0.9937028288841248,0.006297166459262371
1reaction
timoschickcommented, Sep 14, 2020

If your verbalizer uses only the words terrible, bad, okay, good and great, then PET simply ignores the probabilities assigned to all other words. Let’s assume the model’s predictions are (in that order):

horrible # 0.30
awful    # 0.20
terrible # 0.20
bad      # 0.10
... 
okay     # 0.02
good     # 0.01
great    # 0.01

PET basically removes all words that are not used by the verbalizer, resulting in the following reduced list:

terrible # 0.20
bad      # 0.10
... 
okay     # 0.02
good     # 0.01
great    # 0.01

So PET would assign the label corresponding to terrible to this example, even if terrible is not the word that the language model would have predicted.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Use Unlabeled Data in Machine Learning
These models use unlabeled data with only certain data points annotated. This is very useful for self-training and co-training, which can be ...
Read more >
Data Annotation Tutorial: Definition, Tools, Datasets - V7 Labs
Image annotation is the task of annotating an image with labels. It ensures that a machine learning algorithm recognizes an annotated area as...
Read more >
How to Annotate and Improve Datasets with CVAT and FiftyOne
Unlabeled dataset annotation. For most machine learning projects, the first step is to collect a dataset needed for a specific task. For ...
Read more >
Annotating Datasets — FiftyOne 0.18.0 documentation - Voxel51
The basic workflow to use the annotation API to add or edit labels on your FiftyOne datasets is as follows: Load a labeled...
Read more >
Automatic Annotation of Unlabeled Data from Smartphone ...
In this paper, we adopted the k-means clustering algorithm for annotating unlabeled sensor data for the purpose of detecting sensitive ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found