Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[FEATURE] Adding FFHQ dataset

See original GitHub issue

I have the 1024 and 128 scale pngs from the FFHQ dataset. I’d like to upload this as a hub:// dataset so that you can copy it to the activeloop namespace.

Currently I am considering how to structure the dataset, and what splits it should be uploaded as.

Below is the schema I have used so far. It includes all of the metadata from the original dataset including the URLs to the original files, and the pixel_md5 hashes match when looping back over the dataset and recomputing them.

ds = hub.empty("./ffhq-1024", overwrite=True)

with ds:
    ds.create_tensor("metadata/author", htype="text")
    ds.create_tensor("metadata/country", htype="text")
    ds.create_tensor("metadata/date_crawled", htype="text")
    ds.create_tensor("metadata/date_uploaded", htype="text")
    ds.create_tensor("metadata/license", htype="text")
    ds.create_tensor("metadata/license_url", htype="text")
    ds.create_tensor("metadata/photo_title", htype="text")
    ds.create_tensor("metadata/photo_url", htype="text")

    ds.create_tensor("images/image", htype="image", sample_compression="png")
    ds.create_tensor("images/face_landmarks", dtype=np.float32)
    ds.create_tensor("images/file_md5", htype="text")
    ds.create_tensor("images/file_path", htype="text")
    ds.create_tensor("images/file_url", htype="text")
    ds.create_tensor("images/file_size", dtype=np.int32)
    ds.create_tensor("images/pixel_md5", htype="text")

    ds.create_tensor("thumbs/image", htype="image", sample_compression="png")
    ds.create_tensor("thumbs/face_landmarks", dtype=np.float32)
    ds.create_tensor("thumbs/file_md5", htype="text")
    ds.create_tensor("thumbs/file_path", htype="text")
    ds.create_tensor("thumbs/file_url", htype="text")
    ds.create_tensor("thumbs/file_size", dtype=np.int32)
    ds.create_tensor("thumbs/pixel_md5", htype="text")

    ds.create_tensor("wilds/face_landmarks", dtype=np.float32)
    ds.create_tensor("wilds/face_rect", dtype=np.float32)
    ds.create_tensor("wilds/file_md5", htype="text")
    ds.create_tensor("wilds/file_path", htype="text")
    ds.create_tensor("wilds/file_url", htype="text")
    ds.create_tensor("wilds/file_size", dtype=np.int32)
    ds.create_tensor("wilds/pixel_md5", htype="text")
    ds.create_tensor("wilds/pixel_size", dtype=np.int32)

Does this structure abide by Hub best practices?

Would it be a good idea to also upload a “ffhq-128” without the 1024 images, and “ffhq-meta” without the 128 images also?

>>> next(ds.tensorflow().as_numpy_iterator())
{
  'metadata/author': array([b'Jeremy Frumkin'], dtype=object), 
  'metadata/country': array([b''], dtype=object), 
  'metadata/date_crawled': array([b'2018-10-10'], dtype=object), 
  'metadata/date_uploaded': array([b'2007-08-16'], dtype=object), 
  'metadata/license': array([b'Attribution-NonCommercial License'], dtype=object), 
  'metadata/license_url': array([b'https://creativecommons.org/licenses/by-nc/2.0/'], dtype=object), 
  'metadata/photo_title': array([b'DSCF0899.JPG'], dtype=object), 
  'metadata/photo_url': array([b'https://www.flickr.com/photos/frumkin/1133484654/'], dtype=object), 
  
  'images/image': array([[[  0, 133, 147], ..., [132, 157, 164]]], dtype=uint8), 
  'images/face_landmarks': array([[131.62, 453.8 ], ..., [521.04, 715.26]], dtype=float32), 
  'images/file_md5': array([b'ddeaeea6ce59569643715759d537fd1b'], dtype=object), 
  'images/file_path': array([b'images1024x1024/00000/00000.png'], dtype=object), 
  'images/file_size': array([1488194], dtype=int32), 
  'images/file_url': array([b'https://drive.google.com/uc?id=1xJYS4u3p0wMmDtvUE13fOkxFaUGBoH42'], dtype=object), 
  'images/pixel_md5': array([b'47238b44dfb87644460cbdcc4607e289'], dtype=object), 
  
  'thumbs/image': array([[[  0, 130, 146], ..., [134, 157, 163]]], dtype=uint8), 
  'thumbs/face_landmarks': array([[ 16.4525 ,  56.725  ], ..., [ 65.13   ,  89.4075 ]], dtype=float32), 
  'thumbs/file_md5': array([b'bd3e40b2ba20f76b55dc282907b89cd1'], dtype=object), 
  'thumbs/file_path': array([b'thumbnails128x128/00000/00000.png'], dtype=object), 
  'thumbs/file_size': array([29050], dtype=int32), 
  'thumbs/file_url': array([b'https://drive.google.com/uc?id=1fUMlLrNuh5NdcnMsOpSJpKcDfYLG6_7E'], dtype=object), 
  'thumbs/pixel_md5': array([b'38d7e93eb9a796d0e65f8c64de8ba161'], dtype=object), 
  
  'wilds/face_landmarks': array([[ 562.5,  697.5], ..., [1060.5,  996.5]], dtype=float32), 
  'wilds/face_rect': array([ 667.,  410., 1438., 1181.], dtype=float32), 
  'wilds/file_md5': array([b'1dc0287e73e485efb0516a80ce9d42b4'], dtype=object), 
  'wilds/file_path': array([b'in-the-wild-images/00000/00000.png'], dtype=object), 
  'wilds/file_size': array([3991569], dtype=int32), 
  'wilds/file_url': array([b'https://drive.google.com/uc?id=1yT9RlvypPefGnREEbuHLE6zDXEQofw-m'], dtype=object), 
  'wilds/pixel_md5': array([b'86b3470c42e33235d76b979161fb2327'], dtype=object), 
  'wilds/pixel_size': array([2016, 1512], dtype=int32)
}

Getting the 900GB Wilds images, along with the TFRecords that are pre-resized for each intermediate scale is proving to be harder to acquire. But just hosting the 1024 scale images would already be a huge improvement in making the dataset accessible.

Issue Analytics

State:
Created a year ago
Comments:28 (14 by maintainers)

Top GitHub Comments

1reaction

farizrahman4ucommented, Jun 23, 2022

Hi @JossWhittle. For htype = 'json', you should append data as a dict/list structure. @farizrahman4u Can you pls comment on exactly on how the serialization works?

@JossWhittle You can append python dict/list to json tensors. These are internally dumped to json string (json.dumps(...)) and then encoded to bytes with utf-8 encoding. We use custom json encoder/decoder to support numpy arrays nested in dicts/lists as well.

1reaction

farizrahman4ucommented, Jun 23, 2022

@99991 @JossWhittle We have a PR up for fixing the deepcopy issue. Here is notebook deepcopying ffhq: colab. These changes will be in the next release.