question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Create a guide on how to write tests for nodes and pipelines

See original GitHub issue

Description

Our users have expressed a need to learn how to write tests for their nodes and pipelines. We encourage learning about software-engineering best practice and want to include a guide in our documentation for this.

Possible Implementation

This guide must focus on explaining:

  • How kedro test works
  • How to write a test for a node, using an example
  • How to write a test for a pipeline, using an example

This guide can fall into our Development chapter. And remember to follow our guidelines for contributing to the documentation.

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:4
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

9reactions
daBlesrcommented, Jun 29, 2022

I’m not able to access those links, but in any case, in my experience writing tests for pipeline came to its full power when doing operations on Spark DataFrames. Usually, for pipelines, one test with data suffices. Here is a suggestion for what this page could look like using Spark DataFrames:

Writing Tests for Nodes and Pipelines

In this section we introduce the way you write unit tests and integration tests for Nodes and Pipelines respectively. Each Node in a pipeline should have its own (parametrised) unit tests. Tests for pipelines on the other hand should test a number of sequential nodes. For such a test, the input of the first node is tested against the expected output of the last node. Imagine the following nodes, were we compute the total equity of a store’s inventory:

# pipelines/inventory/nodes.py
def process_total_price_per_product(inventory: DataFrame):
    return (
        inventory
            .withColumn("total_price", col("in_store") * col("price"))
            .select("name", "total_price")
    )

def process_total_equity(total_price_per_product: DataFrame):
    return (
        total_price_per_product
            .groupBy()
            .sum('total_price')
            .first()
            [0]  # first col
    ) or 0.0

The pipeline:

# pipelines/inventory/pipeline.py
def create_pipeline() -> Pipeline:
    return pipeline([
        node(process_total_price_per_product, inputs="inventory", outputs="total_price_per_product"),
        node(process_total_equity, inputs="total_price_per_product", outputs="total_equity")
    ])

Unit Tests

We begin by writing the first unit test by spinning up two sample data sets from a CSV file: inventory

name,price,in_store
apple,1.0,2
pear,1.5,5
banana,2.0,3

and total_price_per_product

name,total_price
apple,2.0
pear,7.5
banana,6.0

Since, we need to be able to read these CSV datasets into Spark DataFrames, we construct a Spark Session in the root of our testing package:

# tests/conftest.py
@pytest.fixture
def spark() -> SparkSession:
    _spark = (
        SparkSession.builder.master("local[*]")
            .appName("local-tests")
            .enableHiveSupport()
            .getOrCreate()
    )
    return _spark

Now, we can create fixtures for loading these datasets:

# tests/pipelines/inventory/conftest.py
@pytest.fixture
def inventory(spark):
    return spark.read.csv(test_data_inventory_path, header=True, inferSchema=True)

@pytest.fixture
def total_price_per_product(spark):
    return spark.read.csv(test_data_tpp_path, header=True, inferSchema=True)

In test_nodes.py we use the above fixtures to test our Nodes:

# tests/pipelines/inventory/test_nodes.py
class TestInventoryNodes:

    def test_total_price_per_product(self, spark, inventory, total_price_per_product):
        processed_total_prices = process_total_price_per_product(inventory)

        assert 'total_price' in processed_total_prices.columns
        assert processed_total_prices.collect() == total_price_per_product.collect()

    def test_total_equity(self, spark, total_price_per_product):
        total_equity = process_total_equity(total_price_per_product)

        assert total_equity == 15.5

Integration Test

As we have unit tests for each individual node in our Inventory pipeline, we would like to create an integration test for the pipeline as a whole. Again, we start by creating a number of fixtures in the root of the testing package:

# tests/conftest.py
@pytest.fixture
def seq_runner():
    return SequentialRunner()

@pytest.fixture
def catalog():
    return DataCatalog()

The fixture guarantees we are working with a clean instance of the DataCatalog in our tests. In the end we can write an integration test of the Inventory pipeline:

# tests/pipelines/inventory/test_pipeline.py
class TestInventoryPipeline:

    def test_pipeline(
            self,
            spark,      # Spark Session
            seq_runner, # New SequentialRunner
            catalog,    # New Catalog
            inventory   # Inventory DataFrame
    ):

        # loading up the empty catalog for pipeline inputs.
        catalog.add("inventory", MemoryDataSet(inventory))

        # we want to do an integration test from start to this output
        pipeline = create_pipeline().from_inputs("inventory").to_outputs("total_equity")

        # run the pipeline
        output = seq_runner.run(pipeline, catalog)

        assert output['total_equity'] == 15.5

Make sure to add from_inputs(..) and to_outputs(..) to create_pipeline, if you add nodes later to the pipeline, you’d still like to keep this integration test working.


Let me know what you guys think of this setup 😄

1reaction
datajoelycommented, Jul 25, 2022

This is actively being worked on 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Node.js Unit Testing: Get Started Quickly With Examples
This post we'll show you how to get started with Node.js unit testing in practice, with examples. Think of this post as a...
Read more >
Tutorial: Set up a continuous testing pipeline with Node.js
The pipeline is initiated when code is pushed to a repository on GitHub. CircleCI will automatically pull the new changes, build the code,...
Read more >
Integration testing tutorial with Bitbucket Pipelines - Atlassian
Unit tests validating individual methods and classes are a great start to prevent issues, but you will also need to run integration tests...
Read more >
Node.js Unit Testing Automation With Drone CI Using Mocha ...
CI lets developers save time on manual testing and focus on building the rich features that their customers need. Today, we'll demonstrate how ......
Read more >
Unit test report examples - GitLab Docs
Unit test reports can be generated for many languages and packages. Use these examples as guidelines for configuring your pipeline to generate unit...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found