question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Anyone able to load tfrecords into TFRS generated with the spark-to-tf-records connector?

See original GitHub issue

Anyone see this kind of error when trying to load TF records generated from Spark by the spark to tf records connector or linkedin’s spark tf record library?

Error: Error when deserializing tfrecord’s in TF 2.x: Only integers, slices (:), ellipsis (...), tf.newaxis (None) and scalar tf.int32/tf.int64 tensors are valid indices

Filed tickets there with details

Really just doing a simple thing, using the small movielens dataset:

    # Code for the connector
    movies_df.write.format("tfrecords").mode("overwrite").save(tf_movies_dir)
    ratings_df.write.format("tfrecords").mode("overwrite").save(tf_ratings_dir)

    # Alternatively, code for the spark to tfrecord
    movies_df.write.format("tfrecord").mode("overwrite").option("recordType", "Example").save(tf_movies_dir)
    ratings_df.write.format("tfrecord").mode("overwrite").option("recordType", "Example").save(tf_ratings_dir)

    s3 = boto3.resource("s3", verify=False)
    bucket = s3.Bucket("mybucket")

    filenames = []
    for object_summary in bucket.objects.filter(
            Prefix=f"emr/spark_apps/myapp/movielens-100k-conversion/movies-0001/part"
    ):
        filenames.append(os.path.join("s3://audiomack-master-airflow/", object_summary.key))
    movies_dataset = tf.data.TFRecordDataset(filenames)

    filenames = []
    for object_summary in bucket.objects.filter(
            Prefix=f"emr/spark_apps/myapp/movielens-100k-conversion/ratings-0001/part"
    ):
        filenames.append(os.path.join("s3://audiomack-master-airflow/", object_summary.key))
    ratings_dataset = tf.data.TFRecordDataset(filenames)

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:15

github_iconTop GitHub Comments

1reaction
maciejkulacommented, Dec 30, 2020

Have you looked at the docs for reading TFRecord files containing tf.train.Examples?

It looks like you’re skipping the deserialization step (converting the serialized tf.train.Example protos to dictionaries of tensors).

0reactions
Data-Jackcommented, Apr 26, 2021

@dgoldenberg-audiomack Yeah, I will do. My first guess was it was how it was being written.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Save Apache Spark DataFrames as TFRecord files
Learn how to use spark-tensorflow-connector to save Apache Spark DataFrames to TFRecord files and load TFRecord with TensorFlow.
Read more >
TFRecord and tf.train.Example | TensorFlow Core
Writing a TFRecord file. The easiest way to get the data into a dataset is to use the from_tensor_slices method. Applied to an...
Read more >
Spark-TFRecord: Toward full support of TFRecord in Spark
How to use Spark-TFRecord. Spark-TFRecord is fully backward-compatible with Spark-Tensorflow-Connector. Migration is easy: just include the ...
Read more >
A hands-on guide to TFRecords - Towards Data Science
And it took quite some time to get all these files loaded. This is where TFRecords (or large NumPy arrays, for that matter)...
Read more >
Using TF-Records on Spark Cluster - nareshr8
Using just TF-Records, I was able to get a direct decrease in the training time 3x times. ... Thanks to the Spark Tensorflow...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found