[FEATURE] Adding FFHQ dataset
See original GitHub issueI have the 1024 and 128 scale pngs from the FFHQ dataset. I’d like to upload this as a hub:// dataset so that you can copy it to the activeloop namespace.
Currently I am considering how to structure the dataset, and what splits it should be uploaded as.
Below is the schema I have used so far. It includes all of the metadata from the original dataset including the URLs to the original files, and the pixel_md5 hashes match when looping back over the dataset and recomputing them.
ds = hub.empty("./ffhq-1024", overwrite=True)
with ds:
ds.create_tensor("metadata/author", htype="text")
ds.create_tensor("metadata/country", htype="text")
ds.create_tensor("metadata/date_crawled", htype="text")
ds.create_tensor("metadata/date_uploaded", htype="text")
ds.create_tensor("metadata/license", htype="text")
ds.create_tensor("metadata/license_url", htype="text")
ds.create_tensor("metadata/photo_title", htype="text")
ds.create_tensor("metadata/photo_url", htype="text")
ds.create_tensor("images/image", htype="image", sample_compression="png")
ds.create_tensor("images/face_landmarks", dtype=np.float32)
ds.create_tensor("images/file_md5", htype="text")
ds.create_tensor("images/file_path", htype="text")
ds.create_tensor("images/file_url", htype="text")
ds.create_tensor("images/file_size", dtype=np.int32)
ds.create_tensor("images/pixel_md5", htype="text")
ds.create_tensor("thumbs/image", htype="image", sample_compression="png")
ds.create_tensor("thumbs/face_landmarks", dtype=np.float32)
ds.create_tensor("thumbs/file_md5", htype="text")
ds.create_tensor("thumbs/file_path", htype="text")
ds.create_tensor("thumbs/file_url", htype="text")
ds.create_tensor("thumbs/file_size", dtype=np.int32)
ds.create_tensor("thumbs/pixel_md5", htype="text")
ds.create_tensor("wilds/face_landmarks", dtype=np.float32)
ds.create_tensor("wilds/face_rect", dtype=np.float32)
ds.create_tensor("wilds/file_md5", htype="text")
ds.create_tensor("wilds/file_path", htype="text")
ds.create_tensor("wilds/file_url", htype="text")
ds.create_tensor("wilds/file_size", dtype=np.int32)
ds.create_tensor("wilds/pixel_md5", htype="text")
ds.create_tensor("wilds/pixel_size", dtype=np.int32)
Does this structure abide by Hub best practices?
Would it be a good idea to also upload a “ffhq-128” without the 1024 images, and “ffhq-meta” without the 128 images also?
>>> next(ds.tensorflow().as_numpy_iterator())
{
'metadata/author': array([b'Jeremy Frumkin'], dtype=object),
'metadata/country': array([b''], dtype=object),
'metadata/date_crawled': array([b'2018-10-10'], dtype=object),
'metadata/date_uploaded': array([b'2007-08-16'], dtype=object),
'metadata/license': array([b'Attribution-NonCommercial License'], dtype=object),
'metadata/license_url': array([b'https://creativecommons.org/licenses/by-nc/2.0/'], dtype=object),
'metadata/photo_title': array([b'DSCF0899.JPG'], dtype=object),
'metadata/photo_url': array([b'https://www.flickr.com/photos/frumkin/1133484654/'], dtype=object),
'images/image': array([[[ 0, 133, 147], ..., [132, 157, 164]]], dtype=uint8),
'images/face_landmarks': array([[131.62, 453.8 ], ..., [521.04, 715.26]], dtype=float32),
'images/file_md5': array([b'ddeaeea6ce59569643715759d537fd1b'], dtype=object),
'images/file_path': array([b'images1024x1024/00000/00000.png'], dtype=object),
'images/file_size': array([1488194], dtype=int32),
'images/file_url': array([b'https://drive.google.com/uc?id=1xJYS4u3p0wMmDtvUE13fOkxFaUGBoH42'], dtype=object),
'images/pixel_md5': array([b'47238b44dfb87644460cbdcc4607e289'], dtype=object),
'thumbs/image': array([[[ 0, 130, 146], ..., [134, 157, 163]]], dtype=uint8),
'thumbs/face_landmarks': array([[ 16.4525 , 56.725 ], ..., [ 65.13 , 89.4075 ]], dtype=float32),
'thumbs/file_md5': array([b'bd3e40b2ba20f76b55dc282907b89cd1'], dtype=object),
'thumbs/file_path': array([b'thumbnails128x128/00000/00000.png'], dtype=object),
'thumbs/file_size': array([29050], dtype=int32),
'thumbs/file_url': array([b'https://drive.google.com/uc?id=1fUMlLrNuh5NdcnMsOpSJpKcDfYLG6_7E'], dtype=object),
'thumbs/pixel_md5': array([b'38d7e93eb9a796d0e65f8c64de8ba161'], dtype=object),
'wilds/face_landmarks': array([[ 562.5, 697.5], ..., [1060.5, 996.5]], dtype=float32),
'wilds/face_rect': array([ 667., 410., 1438., 1181.], dtype=float32),
'wilds/file_md5': array([b'1dc0287e73e485efb0516a80ce9d42b4'], dtype=object),
'wilds/file_path': array([b'in-the-wild-images/00000/00000.png'], dtype=object),
'wilds/file_size': array([3991569], dtype=int32),
'wilds/file_url': array([b'https://drive.google.com/uc?id=1yT9RlvypPefGnREEbuHLE6zDXEQofw-m'], dtype=object),
'wilds/pixel_md5': array([b'86b3470c42e33235d76b979161fb2327'], dtype=object),
'wilds/pixel_size': array([2016, 1512], dtype=int32)
}
Getting the 900GB Wilds images, along with the TFRecords that are pre-resized for each intermediate scale is proving to be harder to acquire. But just hosting the 1024 scale images would already be a huge improvement in making the dataset accessible.
Issue Analytics
- State:
- Created a year ago
- Comments:28 (14 by maintainers)
@JossWhittle You can append python dict/list to json tensors. These are internally dumped to json string (
json.dumps(...)
) and then encoded tobytes
withutf-8
encoding. We use custom json encoder/decoder to support numpy arrays nested in dicts/lists as well.@99991 @JossWhittle We have a PR up for fixing the deepcopy issue. Here is notebook deepcopying ffhq: colab. These changes will be in the next release.