question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEATURE] Adding FFHQ dataset

See original GitHub issue

I have the 1024 and 128 scale pngs from the FFHQ dataset. I’d like to upload this as a hub:// dataset so that you can copy it to the activeloop namespace.

Currently I am considering how to structure the dataset, and what splits it should be uploaded as.

Below is the schema I have used so far. It includes all of the metadata from the original dataset including the URLs to the original files, and the pixel_md5 hashes match when looping back over the dataset and recomputing them.

ds = hub.empty("./ffhq-1024", overwrite=True)

with ds:
    ds.create_tensor("metadata/author", htype="text")
    ds.create_tensor("metadata/country", htype="text")
    ds.create_tensor("metadata/date_crawled", htype="text")
    ds.create_tensor("metadata/date_uploaded", htype="text")
    ds.create_tensor("metadata/license", htype="text")
    ds.create_tensor("metadata/license_url", htype="text")
    ds.create_tensor("metadata/photo_title", htype="text")
    ds.create_tensor("metadata/photo_url", htype="text")

    ds.create_tensor("images/image", htype="image", sample_compression="png")
    ds.create_tensor("images/face_landmarks", dtype=np.float32)
    ds.create_tensor("images/file_md5", htype="text")
    ds.create_tensor("images/file_path", htype="text")
    ds.create_tensor("images/file_url", htype="text")
    ds.create_tensor("images/file_size", dtype=np.int32)
    ds.create_tensor("images/pixel_md5", htype="text")

    ds.create_tensor("thumbs/image", htype="image", sample_compression="png")
    ds.create_tensor("thumbs/face_landmarks", dtype=np.float32)
    ds.create_tensor("thumbs/file_md5", htype="text")
    ds.create_tensor("thumbs/file_path", htype="text")
    ds.create_tensor("thumbs/file_url", htype="text")
    ds.create_tensor("thumbs/file_size", dtype=np.int32)
    ds.create_tensor("thumbs/pixel_md5", htype="text")

    ds.create_tensor("wilds/face_landmarks", dtype=np.float32)
    ds.create_tensor("wilds/face_rect", dtype=np.float32)
    ds.create_tensor("wilds/file_md5", htype="text")
    ds.create_tensor("wilds/file_path", htype="text")
    ds.create_tensor("wilds/file_url", htype="text")
    ds.create_tensor("wilds/file_size", dtype=np.int32)
    ds.create_tensor("wilds/pixel_md5", htype="text")
    ds.create_tensor("wilds/pixel_size", dtype=np.int32)

Does this structure abide by Hub best practices?

Would it be a good idea to also upload a “ffhq-128” without the 1024 images, and “ffhq-meta” without the 128 images also?

>>> next(ds.tensorflow().as_numpy_iterator())
{
  'metadata/author': array([b'Jeremy Frumkin'], dtype=object), 
  'metadata/country': array([b''], dtype=object), 
  'metadata/date_crawled': array([b'2018-10-10'], dtype=object), 
  'metadata/date_uploaded': array([b'2007-08-16'], dtype=object), 
  'metadata/license': array([b'Attribution-NonCommercial License'], dtype=object), 
  'metadata/license_url': array([b'https://creativecommons.org/licenses/by-nc/2.0/'], dtype=object), 
  'metadata/photo_title': array([b'DSCF0899.JPG'], dtype=object), 
  'metadata/photo_url': array([b'https://www.flickr.com/photos/frumkin/1133484654/'], dtype=object), 
  
  'images/image': array([[[  0, 133, 147], ..., [132, 157, 164]]], dtype=uint8), 
  'images/face_landmarks': array([[131.62, 453.8 ], ..., [521.04, 715.26]], dtype=float32), 
  'images/file_md5': array([b'ddeaeea6ce59569643715759d537fd1b'], dtype=object), 
  'images/file_path': array([b'images1024x1024/00000/00000.png'], dtype=object), 
  'images/file_size': array([1488194], dtype=int32), 
  'images/file_url': array([b'https://drive.google.com/uc?id=1xJYS4u3p0wMmDtvUE13fOkxFaUGBoH42'], dtype=object), 
  'images/pixel_md5': array([b'47238b44dfb87644460cbdcc4607e289'], dtype=object), 
  
  'thumbs/image': array([[[  0, 130, 146], ..., [134, 157, 163]]], dtype=uint8), 
  'thumbs/face_landmarks': array([[ 16.4525 ,  56.725  ], ..., [ 65.13   ,  89.4075 ]], dtype=float32), 
  'thumbs/file_md5': array([b'bd3e40b2ba20f76b55dc282907b89cd1'], dtype=object), 
  'thumbs/file_path': array([b'thumbnails128x128/00000/00000.png'], dtype=object), 
  'thumbs/file_size': array([29050], dtype=int32), 
  'thumbs/file_url': array([b'https://drive.google.com/uc?id=1fUMlLrNuh5NdcnMsOpSJpKcDfYLG6_7E'], dtype=object), 
  'thumbs/pixel_md5': array([b'38d7e93eb9a796d0e65f8c64de8ba161'], dtype=object), 
  
  'wilds/face_landmarks': array([[ 562.5,  697.5], ..., [1060.5,  996.5]], dtype=float32), 
  'wilds/face_rect': array([ 667.,  410., 1438., 1181.], dtype=float32), 
  'wilds/file_md5': array([b'1dc0287e73e485efb0516a80ce9d42b4'], dtype=object), 
  'wilds/file_path': array([b'in-the-wild-images/00000/00000.png'], dtype=object), 
  'wilds/file_size': array([3991569], dtype=int32), 
  'wilds/file_url': array([b'https://drive.google.com/uc?id=1yT9RlvypPefGnREEbuHLE6zDXEQofw-m'], dtype=object), 
  'wilds/pixel_md5': array([b'86b3470c42e33235d76b979161fb2327'], dtype=object), 
  'wilds/pixel_size': array([2016, 1512], dtype=int32)
}

Getting the 900GB Wilds images, along with the TFRecords that are pre-resized for each intermediate scale is proving to be harder to acquire. But just hosting the 1024 scale images would already be a huge improvement in making the dataset accessible.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:28 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
farizrahman4ucommented, Jun 23, 2022

Hi @JossWhittle. For htype = 'json', you should append data as a dict/list structure. @farizrahman4u Can you pls comment on exactly on how the serialization works?

@JossWhittle You can append python dict/list to json tensors. These are internally dumped to json string (json.dumps(...)) and then encoded to bytes with utf-8 encoding. We use custom json encoder/decoder to support numpy arrays nested in dicts/lists as well.

1reaction
farizrahman4ucommented, Jun 23, 2022

@99991 @JossWhittle We have a PR up for fixing the deepcopy issue. Here is notebook deepcopying ffhq: colab. These changes will be in the next release.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Gender, Age, and Emotion for Flickr-Faces-HQ Dataset (FFHQ)
This dataset provides various information for each face in the Flickr-Faces-HQ (FFHQ) image dataset of human faces. The dataset consists of ...
Read more >
FFHQ Dataset: Usage in GAN Research and Alternatives
The FFHQ dataset includes JSON metadata, a download script, and documentation. There are two ways to access the dataset: Download it directly from...
Read more >
bFFHQ Dataset - Papers With Code
Gender-biased FFHQ dataset (bFFHQ) has age as a target label and gender as a correlated bias, and the images are from the FFHQ...
Read more >
[R] Flickr-Faces-HQ Dataset (FFHQ) : r/MachineLearning
It's using this heavyweight cluster-based functionality which is seriously not fun and seems to be responsible for the copying-on-startup thing ...
Read more >
ffhq · akhaliq/BlendGAN at 01d08da - Hugging Face
ffhq_dataset/__init__.py ADDED ... ffhq_dataset/face_alignment.py ADDED ... + # Align function from FFHQ dataset pre-processing step.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found