question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Could anyone please show me the whole process to make your own dataset?

See original GitHub issue

I am trapped in the big confusion on how to make a dataset that can be used to train arcface model out of my own face data directory, in the structure as below:

.
├── PersonA
│   ├── 001.jpeg
│   └── 002.jpeg
├── PersonB
│   ├── 001.jpeg
│   └── 002.jpeg
└── PersonC
    ├── 001.jpeg
    ├── 002.jpeg
    └── 003.jpeg

3 directories, 7 files

It would be better to show me the steps in detail in the following form:

  1. … …

Much appreciation and thanks!

Issue Analytics

  • State:open
  • Created 4 years ago
  • Reactions:8
  • Comments:18

github_iconTop GitHub Comments

68reactions
Talgincommented, Jan 27, 2020

Hi @YokkaBear, I think you have to search other places or read papers of LFW, YTF dataset creators to get an understanding of how datasets are collected and refined, the issues here are mostly about insightface project and not about some dataset, because they’ve used merged or publicly available datasets (you can read their paper (https://arxiv.org/pdf/1801.07698.pdf)). But to clear some points you can follow the next steps (regarding this project):

  1. You have to collect images of people: one folder for images of one person (the images can be of different quality, having different lightning conditions, from different cameras, sources, etc.);
  2. You have to refine the folders you have: delete/merge duplicate folders (having the same person images), because it will affect the accuracy of your training;
  3. After first two steps you should have one folder with subfolders having the structure like you provided above;
  4. Align your dataset with the size of the images 112*112: for this you can use facenet’s alignment script (https://github.com/davidsandberg/facenet/blob/master/src/align/align_dataset_mtcnn.py) or if it doesn’t work for you, you can use my script (https://github.com/Talgin/preparing_data/blob/master/align_dataset_mtcnn_v1.py), which is the revised version of facenet script - some libraries were out of date, so I had to change them.
  5. Then… you have to divide your dataset into Train, Validation, Test sets (we used 80%,10%,10% ratio);
  6. Then you have to create .lst file of you dataset: you can use our script (https://github.com/Talgin/preparing_data/blob/master/insightface_pairs_gen_v1.py);
  7. Generate .rec and .idx files using: face2rec2.py (insightface/src/data/face2rec2.py);
  8. Generate pairs.txt;
  9. Using pairs.txt generate .bin file using lfw2pack.py (https://github.com/deepinsight/insightface/blob/master/src/data/lfw2pack.py) - bin file is needed for validation;
  10. Collect .rec, .idx, .bin, property files into one folder and start training.

You can also read some info on my page (I’ll rewrite/restructure it in 1-2 weeks): https://github.com/Talgin/preparing_data

P.S. Codes provided above are my publicly available codes and codes shared in github by other people.

Good luck! 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Preparing Your Dataset for Machine Learning: 10 Steps
Preparing data for machine learning projects is a crucial first step. Learn how to collect data, what is data cleaning, who is responsible ......
Read more >
Creating your own dataset - Hugging Face Course
Exploring how long it takes to close open issues or pull requests · Training a multilabel classifier that can tag issues with metadata...
Read more >
How to Build A Data Set For Your Machine Learning Project
I have a data set, what now? Not so fast! You should know that all data sets are innacurate. At this moment of...
Read more >
Making your own dataset - UK Data Service
How you approach making your own teaching dataset will depend upon your aims ... Follow this link to a video tutorial showing how...
Read more >
Making datasets | Data Science and Machine Learning - Kaggle
I've tried looking at datasets to get into but I can't find any that leaves me curious. My first thought is to create...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found