Anyone able to load tfrecords into TFRS generated with the spark-to-tf-records connector?
See original GitHub issueAnyone see this kind of error when trying to load TF records generated from Spark by the spark to tf records connector or linkedin’s spark tf record library?
Error: Error when deserializing tfrecord’s in TF 2.x: Only integers, slices (:), ellipsis (...), tf.newaxis (None) and scalar tf.int32/tf.int64 tensors are valid indices
Filed tickets there with details
- https://github.com/linkedin/spark-tfrecord/issues/19
- https://github.com/tensorflow/ecosystem/issues/178
Really just doing a simple thing, using the small movielens dataset:
# Code for the connector
movies_df.write.format("tfrecords").mode("overwrite").save(tf_movies_dir)
ratings_df.write.format("tfrecords").mode("overwrite").save(tf_ratings_dir)
# Alternatively, code for the spark to tfrecord
movies_df.write.format("tfrecord").mode("overwrite").option("recordType", "Example").save(tf_movies_dir)
ratings_df.write.format("tfrecord").mode("overwrite").option("recordType", "Example").save(tf_ratings_dir)
s3 = boto3.resource("s3", verify=False)
bucket = s3.Bucket("mybucket")
filenames = []
for object_summary in bucket.objects.filter(
Prefix=f"emr/spark_apps/myapp/movielens-100k-conversion/movies-0001/part"
):
filenames.append(os.path.join("s3://audiomack-master-airflow/", object_summary.key))
movies_dataset = tf.data.TFRecordDataset(filenames)
filenames = []
for object_summary in bucket.objects.filter(
Prefix=f"emr/spark_apps/myapp/movielens-100k-conversion/ratings-0001/part"
):
filenames.append(os.path.join("s3://audiomack-master-airflow/", object_summary.key))
ratings_dataset = tf.data.TFRecordDataset(filenames)
Issue Analytics
- State:
- Created 3 years ago
- Comments:15
Top Results From Across the Web
Save Apache Spark DataFrames as TFRecord files
Learn how to use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files and load TFRecord with TensorFlow.
Read more >TFRecord and tf.train.Example | TensorFlow Core
Writing a TFRecord file. The easiest way to get the data into a dataset is to use the from_tensor_slices method. Applied to an...
Read more >Spark-TFRecord: Toward full support of TFRecord in Spark
How to use Spark-TFRecord. Spark-TFRecord is fully backward-compatible with Spark-Tensorflow-Connector. Migration is easy: just include the ...
Read more >A hands-on guide to TFRecords - Towards Data Science
And it took quite some time to get all these files loaded. This is where TFRecords (or large NumPy arrays, for that matter)...
Read more >Using TF-Records on Spark Cluster - nareshr8
Using just TF-Records, I was able to get a direct decrease in the training time 3x times. ... Thanks to the Spark Tensorflow...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Have you looked at the docs for reading TFRecord files containing
tf.train.Examples?It looks like you’re skipping the deserialization step (converting the serialized
tf.train.Exampleprotos to dictionaries of tensors).@dgoldenberg-audiomack Yeah, I will do. My first guess was it was how it was being written.