question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Setting data path using PipelineParameter within OutputFileDatasetConfig yields unusable path

See original GitHub issue
  • Package Name: azureml-core
  • Package Version: 1.39.0
  • Operating System: windows 11
  • Python Version: 3.7.12

Describe the bug When using a PiplelineParameter to handle the output data path within OutputFileDatasetConfig, the pipeline job will complete successfully but create an unusable path on the Datastore file system.

To Reproduce

### Pipeline Parameters ###

# Create pipeline parameter for input dataset name
input_dataset_name_pipeline_param = PipelineParameter(
    name="input_dataset_name",
    default_value=dataset_name
)

# Create dataset output path from pipeline param so it can be changed at runtime
data_output_path_pipeline_param = PipelineParameter(
    name="data_output_path",
    default_value=''
)

### Create Inputs and Outputs ###

# Get datastore name, data source path, and its expected schema from mapped values using the colloquial dataset name
datastore_name = DATASET_REFERENCE_BY_NAME[dataset_name].datastore   # returns a Datastore object for the input data
data_path = DATASET_REFERENCE_BY_NAME[dataset_name].data_source_path  # returns path to the input data

# Create dataset object using the datastore name and data source path
input_dataset = create_tabular_dataset_from_datastore(workspace, datastore_name, data_path)  # returns a Dataset object

# Create input tabular 
tabular_ds_consumption = DatasetConsumptionConfig(
    name="input_tabular_dataset", # name to use to access dataset within Run context
    dataset=input_dataset
)

output_datastore = Datastore(workspace, name='datastore_name')

# Create output dataset 
output_data = OutputFileDatasetConfig(
    name="dataset_output",
    destination=(output_datastore, data_output_path_pipeline_param)
).as_upload(overwrite=True)


### Create pipeline steps ###

# Pass input dataset into step1
step1 = PythonScriptStep(
    script_name="script.py", # doesn't matter what this does,
    source_directory="src/"
    name="Step 1", 
    arguments=["--dataset-name", input_dataset_name_pipeline_param
    ],
    inputs=[tabular_ds_consumption],
    outputs=[output_data]
)

pipeline_definition = Pipeline(workspace, steps=[step1])

# The actual submission to run the pipeline job (using PublishPipeline)
pipeline_definition.publish(name="my-pipeline")

experiment = Experiment(name='my-pipeline-experiment')

experiment.submit(
    published_pipeline,
    continue_on_step_failure=True,
    pipeline_parameters={"input_dataset_name": 'dataset', "data_output_path": f"base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet"}
)

Expected behavior Expected OutputFileDatasetConfig( name="dataset_output", destination=(output_datastore, data_output_path_pipeline_param) ).as_upload(overwrite=True) to upload data to the path input during pipeline run submission using the PipelineParameter input "data_output_path": f"base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet"

Instead, it created an unusable folder under the datastore, and the pipeline still completes successfully. image

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:2
  • Comments:16 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
Piranha688commented, Jul 17, 2022

@chritter, I had opened a separate ticket with Microsoft. Their reply was that this behavior is intentional as the OutputFileDatasetConfig and PipelineParameter are not intended to be used this way.

I was recommended to introduce a separate PythonScriptStep for the purpose of writing output to a dynamic file path.

Hopefully v2 of AML will smooth out some of these issues going forward.

1reaction
caitriggscommented, May 18, 2022

I ended up creating an upload step and using a pipeline parameter to pass in a FileDataset instead of relying on using OutputFileDatasetConfig. So it looks like:

build_pipelines.py

def raw_data_refresh_pipeline_template(workspace: Workspace, dataset_name: str, output_dir: str):
    global gdcmlopsabm_env, environment, source_dir_for_snapshot

    # Get datastore name, data source path using the colloquial dataset name defined in datarefs.py
    input_datastore_name = datarefs.RAW_DATASET_REFERENCE_BY_NAME[dataset_name].datastore_name
    input_data_path = datarefs.RAW_DATASET_REFERENCE_BY_NAME[dataset_name].data_source_path

    ### Pipeline Parameters ###

    # Create dataset name pipeline param
    dataset_name_pipeline_param = PipelineParameter(
        name="input_raw_dataset_name",
        default_value=dataset_name
    )

    # Create pipeline param for input tabular dataset
    input_tabular_dataset_pipeline_param = PipelineParameter(
        name='input_tabular_dataset', 
        default_value=create_tabular_dataset(workspace, data_path=input_data_path, datastore_name=input_datastore_name)
    )

    # Create pipeline param for input file dataset
    input_file_dataset_pipeline_param = PipelineParameter(
        name='input_file_dataset', 
        default_value=create_file_dataset(workspace, data_path=input_data_path, datastore_name=input_datastore_name)
    )    

    # Create pipeline param for output path
    output_path_pipeline_param = PipelineParameter(
        name="output_dir",
        default_value=output_dir
    )


    ### Create Inputs & Outputs ###

    # Create input tabular & file datasets 
    tabular_ds_consumption = DatasetConsumptionConfig(
        name="input_tabular_dataset", # name to use to access dataset within Run context
        dataset=input_tabular_dataset_pipeline_param
    )
    file_ds_consumption = DatasetConsumptionConfig(
        name="input_file_dataset", # name to use to access dataset within Run context
        dataset=input_file_dataset_pipeline_param
    ).as_mount()

    # Create link from data checker to data uploader
    data_checked = PipelineData(name='data_checked')


    ### Create Run Config ###

    # create pipeline run configuration
    aml_run_config = RunConfiguration()
    # set run configs for pipeline run environment
    aml_run_config.environment = Environment.get(workspace, name="abmv3-datarefresh")
    aml_run_config.target = ComputeTarget(workspace=workspace, name="abmv3-data")


    ### Create pipeline steps ###

    # Pass input dataset into data checker
    data_checker_step = PythonScriptStep(
        script_name="src/steps/data_checker.py",
        name="Data Checker", 
        runconfig=aml_run_config,
        arguments=["--dataset-name", dataset_name_pipeline_param],
        inputs=[tabular_ds_consumption],
        outputs=[data_checked],
        allow_reuse=False,
    )

    data_uploader_step = PythonScriptStep(
        script_name="src/steps/data_uploader.py",
        name="Data Uploader", 
        runconfig=aml_run_config,
        arguments=["--output-dir", output_path_pipeline_param],
        inputs=[file_ds_consumption, data_checked.as_input("data_checked")],
        allow_reuse=False,
    )

    logging.info(f"Snapshot will be created from source dir: {source_dir_for_snapshot}")

    return Pipeline(workspace, default_source_directory=source_dir_for_snapshot, steps=[data_checker_step, data_uploader_step])

data_uploader.py takes a FileDataset and an output path and uploads it to our storage to that output path, in our case an ADLSgen2, using azure.storage.filedatalake.DataLakeServiceClient and creating an azure.identity.ClientSecretCredential using our Datastore’s saved SPN creds

import argparse
import os
from pathlib import Path

# azureml
from azureml.core import Run

# custom modules
import configs.pathrefs as pathrefs
from src.utils.azure_env import AzureEnvironment

"""
Data Uploader takes a path to a directory where files are and uploads to the provided output directory.
Run.get_context().input_datasets["input_file_dataset"]: file path to a directory e.g. "dataset_name/date/"
args.output_dir: file path to the directory on ADLS to upload to e.g. "raw_data_validated/dataset_name/date/"
"""

parser = argparse.ArgumentParser()
parser.add_argument('--output-dir')

args = parser.parse_args()

output_dir = args.output_dir
input_dir = Run.get_context().input_datasets["input_file_dataset"]

# if input_file_dataset only has 1 file in its dir it returns the full path instead of to the parent dir
if os.path.isfile(input_dir):
    input_dir = Path(input_dir).parent.resolve()

print(f"input_file_dataset: {input_dir}")
print(f"output_dir: {output_dir}")

# create ABM-AI ADLS instance and initialize with ADLS config location
gdcmlopsabm_env = AzureEnvironment(adls_config_path=pathrefs.ABMAI_ADLS_CONFIG, amls_config_path=pathrefs.LOCAL_AMLS_CONFIG)

# upload contents of OutputFileDatasetConfig folder to gdcmlopsabm ADLS 
def upload_to_adls(self, input_dir: str, output_dir: str):
    '''
    input_dir (str): path to directory where file(s) live; the OutputFileDatasetConfig.as_upload() value if using in AMLS pipeline
    output_dir (str): the path to output directory on ADLS
    '''
    # get ADLS client using AMLS Datastore SPN credentials
    adls_service_client = self.get_adls_client()
    # create the ADLS file system client
    container_client = adls_service_client.get_file_system_client(self.container)

    print(f"Preparing location to upload: {self.storage_name}:{self.container}/{output_dir}")

    # create ADLS directory client
    directory_client = container_client.get_directory_client(output_dir)

    # iterate thru local dir for file(s) in dir and write to ADLS file(s)
    for child_path in Path(input_dir).iterdir(): 

        # get the filename of the input file to use for output_filename
        _, output_filename = os.path.split(child_path)

        # create ADLS file and file client so we can upload content to new empty file
        directory_client.create_file(output_filename)
        file_client = directory_client.get_file_client(output_filename)
        
        print(f"Uploading file content from {child_path}\n")

        # handle which file type to write out
        try:
            file_contents = pa.input_stream(child_path)
        except:
            print(f"{output_filename.split('.')[-1]} is not a supported file type")
        
        # upload file contents to output path
        file_client.upload_data(file_contents, overwrite=True)

# gdcmlopsabm_env has some references to datastore and container names to use for the environemnt (dev vs test vs prod)
gdcmlopsabm_env.upload_to_adls(input_dir, output_dir)

print(f"------Uploaded to {gdcmlopsabm_env.amls_datastore_name}/{output_dir}--------")

So that after we publish the pipeline using some defaults, we can submit a run against the default PublishedPipeline PipelineEndpoint. We use a dataset name tied to a dataset reference that contains the input Datastore and data path info, plus some more info like the pipeline name to use

def run_raw_data_refresh_pipeline(dataset_name):

    global gdcmlopsabm_env, workspace, today, batch_id

    # get base ADLS folder name
    base_data_dir = gdcmlopsabm_env.raw_data_dir

    # Create dataset output path from dataset name and date
    data_output_path = f"{base_data_dir}/{dataset_name}/{today}"

    # Get datastore name, data source path using the colloquial dataset name defined in datarefs.py
    input_datastore_name = RAW_DATASET_REFERENCE_BY_NAME[dataset_name].datastore_name
    input_data_path = RAW_DATASET_REFERENCE_BY_NAME[dataset_name].data_source_path
    # Create tabular dataset object for checking the schemas with pandas
    input_tabular_dataset = create_tabular_dataset(workspace, datastore_name=input_datastore_name, data_path=input_data_path)
    # Create file dataset object for uploading raw data directly to avoid converting files from dataframe after Data Checker
    ## include metadata files and not just *.parquet
    input_data_path_dir = PurePosixPath(input_data_path).parent
    input_file_dataset = create_file_dataset(workspace, datastore_name=input_datastore_name, data_path=str(input_data_path_dir))

    # get pipeline endpoint by name
    pipeline_endpoint_name = RAW_DATASET_REFERENCE_BY_NAME[dataset_name].pipeline_endpoint_name
    pipeline_endpoint = PipelineEndpoint.get(workspace=workspace, name=pipeline_endpoint_name)

    # get pipeline endpoint's default PublishedPipeline version
    default_version = pipeline_endpoint.get_default_version()
    default_published = pipeline_endpoint.get_pipeline(default_version).name
    
    logging.info(f"Submitting pipline run against PipelineEndpoint {pipeline_endpoint_name} default version: {default_published}\
        \nat data output path {gdcmlopsabm_env.amls_datastore_name}: {data_output_path}\n")

    # create experiment reference to submit a pipeline run to
    exp = Experiment(workspace, name=f"{pipeline_endpoint_name}-{dataset_name}")

    # Submit pipeline run against PipelineEndpoint's default PublishedPipeline using Experiment so we can add tags
    exp.submit(
        pipeline_endpoint, 
        pipeline_version=default_version,
        pipeline_parameters={"input_raw_dataset_name": dataset_name,
                             "input_tabular_dataset": input_tabular_dataset,
                             "input_file_dataset": input_file_dataset,
                             "output_dir": data_output_path
                            },                      
        tags={"dataset": f'{dataset_name}', "env": f'{gdcmlopsabm_env.environment}', 'pipeline': f'{default_published}', 'batch_id': f'{batch_id}'}
    )
Read more comments on GitHub >

github_iconTop Results From Across the Web

Schedule Class - azureml-pipeline-core - Microsoft Learn
The name of the data path pipeline parameter to set with the changed blob path. ... In the following example, when the Schedule...
Read more >
Posts | Vlad Iliescu
WAY. I'll show you a couple of improvements in a moment, ... Passing Data Between Pipeline Steps with OutputFileDatasetConfig; Conclusion ...
Read more >
How to pass a DataPath PipelineParameter from ...
This notebook demonstrates the use of DataPath and PipelineParameters in AML Pipeline. You will learn how strings and DataPath can be ...
Read more >
Data | Azure Machine Learning
Guide to working with data in Azure ML. ... files, # List[str] of absolute paths of files to upload ... from azureml.data import...
Read more >
The missing guide to AzureML, Part 3: Connecting to data and ...
Data. The simplest way to work with data in AzureML is to read and write blobs from the blob datastore attached to your...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found