question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[DataPipe] Ensure all DataPipes Meet Testing Requirements

See original GitHub issue

🚀 Feature

We have many tests for existing DataPipes (both in PyTorch Core and TorchData). However, over time, they have become less organized. Moreover, as the testing requirements expand, older DataPipes may not have tests to cover the newly added requirements.

This issue aims to track the status of tests for all DataPipes.

Motivation

We want to ensure test coverage for all DataPipe is complete to reduce bugs and unexpected behavior.

Alternative

We also should create some testing templates for IterDataPipe and MapDataPipe that can be widely applied.

IterDataPipe Tracker

X - Done NA - Not Applicable Blank - Not Done/Unclear

Test definitions: Functional - unit test to ensure that the DataPipe works properly with various input arguments Reset - DataPipe can be reset/restart after being read __len__ - the __len__ method is implemented whenever possible (or explicitly not implemented) Serializable - DataPipe is serializable Graph (future) - can be traversed as part of a DataPipe graph Snapshot (future) - can be saved/loaded as a checkpoint/snapshot

Name Module Functional Test Reset __len__ Serializable (Pickable) Graph Snapshot
Batcher Core X X X X
Collator Core X X X X
Concater Core X X X X
Demultiplexer Core X X X X
FileLister Core X X X X
FileOpener Core X X X X
Filter Core X X X X
Forker Core X X X X
Grouper Core X X X X
IterableWrapper Core X X X X
Mapper Core X X X X
Multiplexer Core X X X X
RoutedDecoder Core X X X X
Sampler Core X X X X
Shuffler Core X X X X
StreamReader Core X X X X
UnBatcher Core X X X X
Zipper Core X X X X
BucketBatcher Data X X X X
CSVDictParser Data X X X X
CSVParser Data X X X X
Cycler Data X X X X
DataFrameMaker Data X X X X
Decompressor Data X X X X
Enumerator Data X X X X
FlatMapper Data X X X X
FSSpecFileLister Data X X X X
FSSpecFileOpener Data X X X X
FSSpecSaver Data X X X X
GDriveReader Data X X X X
HashChecker Data X X X X
Header Data X X X X
HttpReader Data X X X X
InMemoryCacheHolder Data X X X X
IndexAdder Data X X X X
IoPathFileLister Data X X X X
IoPathFileOpener Data X X X X
IoPathSaver Data X X X X
IterKeyZipper Data X X X X
JsonParser Data X X X X
LineReader Data X X X X
MapKeyZipper Data X X X X
OnDiskCacheHolder Data X X X X
OnlineReader Data X X X X
ParagraphAggregator Data X X X X
ParquetDataFrameLoader Data X X X X
RarArchiveLoader Data X X X X
Rows2Columnar Data X X X X
SampleMultiplexer Data X X X X
Saver Data X X X X
TarArchiveLoader Data X X X X
UnZipper Data X X X X
XzFileLoader Data X X X X
ZipArchiveLoader Data X X X X

MapDataPipe Tracker

X - Done NA - Not Applicable Blank - Not Done/Unclear

Name Module Functional Test __len__ Serializable (Pickable) Graph Snapshot
Batcher Core X X
Concater Core X X
Mapper Core X X X
SequenceWrapper Core X X X
Shuffler Core X X
Zipper Core X X

cc: @ejguan @VitalyFedyunin @NivekT

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:2
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
NivekTcommented, Apr 28, 2022

When we have time, we might need to go over our DataPipes again to identify any missing test since there are a few DataPipe implemented recently.

Besides, for future reference, we might need to improve our testing framework to something similar to OpInfo in PyTorch Core to run the testing coverage automatically without we go over each test by ourselves.

Agreed that the OpInfo-like way is probably the best. I think our inputs and necessary setup for each test is a bit all over the place. Having tests split between two repos doesn’t help either.

1reaction
ejguancommented, Apr 28, 2022

When we have time, we might need to go over our DataPipes again to identify any missing test since there are a few DataPipe implemented recently.

Besides, for future reference, we might need to improve our testing framework to something similar to OpInfo in PyTorch Core to run the testing coverage automatically without we go over each test by ourselves.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Add comprehensive serialization tests #172 - pytorch/data
The feature We have serialization tests in PyTorch Core ... [DataPipe] Ensure all DataPipes Meet Testing Requirements #106.
Read more >
Resolving Common Problems - AWS Data Pipeline
Ensure that the pipeline definition is complete, check your closing braces, verify required commas, check for missing references, and other syntax errors. It...
Read more >
Colorado Assessment Resources ACCESS for ELLs 2022-2023
DACs must meet with School Assessment Coordinators (SACs) to ensure a training plan is in place for training Test Administrators, ...
Read more >
Salesforce IoT Scale Edition - Salesforce Help
Verify that your context data (batch data) can be successfully combined with event data using a partition key. Ensure that your data pipes...
Read more >
mysql - AWS datapipeline copy from RDS to S3 fails with ...
The most likely cause of this is that the credentials you provided (username & password) are incorrect or the security group associated with...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found