Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Setting data path using PipelineParameter within OutputFileDatasetConfig yields unusable path

See original GitHub issue

Package Name: azureml-core
Package Version: 1.39.0
Operating System: windows 11
Python Version: 3.7.12

Describe the bug When using a PiplelineParameter to handle the output data path within OutputFileDatasetConfig, the pipeline job will complete successfully but create an unusable path on the Datastore file system.

To Reproduce

### Pipeline Parameters ###

# Create pipeline parameter for input dataset name
input_dataset_name_pipeline_param = PipelineParameter(
    name="input_dataset_name",
    default_value=dataset_name
)

# Create dataset output path from pipeline param so it can be changed at runtime
data_output_path_pipeline_param = PipelineParameter(
    name="data_output_path",
    default_value=''
)

### Create Inputs and Outputs ###

# Get datastore name, data source path, and its expected schema from mapped values using the colloquial dataset name
datastore_name = DATASET_REFERENCE_BY_NAME[dataset_name].datastore   # returns a Datastore object for the input data
data_path = DATASET_REFERENCE_BY_NAME[dataset_name].data_source_path  # returns path to the input data

# Create dataset object using the datastore name and data source path
input_dataset = create_tabular_dataset_from_datastore(workspace, datastore_name, data_path)  # returns a Dataset object

# Create input tabular 
tabular_ds_consumption = DatasetConsumptionConfig(
    name="input_tabular_dataset", # name to use to access dataset within Run context
    dataset=input_dataset
)

output_datastore = Datastore(workspace, name='datastore_name')

# Create output dataset 
output_data = OutputFileDatasetConfig(
    name="dataset_output",
    destination=(output_datastore, data_output_path_pipeline_param)
).as_upload(overwrite=True)


### Create pipeline steps ###

# Pass input dataset into step1
step1 = PythonScriptStep(
    script_name="script.py", # doesn't matter what this does,
    source_directory="src/"
    name="Step 1", 
    arguments=["--dataset-name", input_dataset_name_pipeline_param
    ],
    inputs=[tabular_ds_consumption],
    outputs=[output_data]
)

pipeline_definition = Pipeline(workspace, steps=[step1])

# The actual submission to run the pipeline job (using PublishPipeline)
pipeline_definition.publish(name="my-pipeline")

experiment = Experiment(name='my-pipeline-experiment')

experiment.submit(
    published_pipeline,
    continue_on_step_failure=True,
    pipeline_parameters={"input_dataset_name": 'dataset', "data_output_path": f"base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet"}
)

Expected behavior Expected OutputFileDatasetConfig( name="dataset_output", destination=(output_datastore, data_output_path_pipeline_param) ).as_upload(overwrite=True) to upload data to the path input during pipeline run submission using the PipelineParameter input "data_output_path": f"base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet"

Instead, it created an unusable folder under the datastore, and the pipeline still completes successfully.

Issue Analytics

State:
Created a year ago
Reactions:2
Comments:16 (5 by maintainers)

Top GitHub Comments

3reactions

Piranha688commented, Jul 17, 2022

@chritter, I had opened a separate ticket with Microsoft. Their reply was that this behavior is intentional as the OutputFileDatasetConfig and PipelineParameter are not intended to be used this way.

I was recommended to introduce a separate PythonScriptStep for the purpose of writing output to a dynamic file path.

Hopefully v2 of AML will smooth out some of these issues going forward.

1reaction

caitriggscommented, May 18, 2022

I ended up creating an upload step and using a pipeline parameter to pass in a FileDataset instead of relying on using OutputFileDatasetConfig. So it looks like:

build_pipelines.py

def raw_data_refresh_pipeline_template(workspace: Workspace, dataset_name: str, output_dir: str):
    global gdcmlopsabm_env, environment, source_dir_for_snapshot

    # Get datastore name, data source path using the colloquial dataset name defined in datarefs.py
    input_datastore_name = datarefs.RAW_DATASET_REFERENCE_BY_NAME[dataset_name].datastore_name
    input_data_path = datarefs.RAW_DATASET_REFERENCE_BY_NAME[dataset_name].data_source_path

    ### Pipeline Parameters ###

    # Create dataset name pipeline param
    dataset_name_pipeline_param = PipelineParameter(
        name="input_raw_dataset_name",
        default_value=dataset_name
    )

    # Create pipeline param for input tabular dataset
    input_tabular_dataset_pipeline_param = PipelineParameter(
        name='input_tabular_dataset', 
        default_value=create_tabular_dataset(workspace, data_path=input_data_path, datastore_name=input_datastore_name)
    )

    # Create pipeline param for input file dataset
    input_file_dataset_pipeline_param = PipelineParameter(
        name='input_file_dataset', 
        default_value=create_file_dataset(workspace, data_path=input_data_path, datastore_name=input_datastore_name)
    )    

    # Create pipeline param for output path
    output_path_pipeline_param = PipelineParameter(
        name="output_dir",
        default_value=output_dir
    )


    ### Create Inputs & Outputs ###

    # Create input tabular & file datasets 
    tabular_ds_consumption = DatasetConsumptionConfig(
        name="input_tabular_dataset", # name to use to access dataset within Run context
        dataset=input_tabular_dataset_pipeline_param
    )
    file_ds_consumption = DatasetConsumptionConfig(
        name="input_file_dataset", # name to use to access dataset within Run context
        dataset=input_file_dataset_pipeline_param
    ).as_mount()

    # Create link from data checker to data uploader
    data_checked = PipelineData(name='data_checked')


    ### Create Run Config ###

    # create pipeline run configuration
    aml_run_config = RunConfiguration()
    # set run configs for pipeline run environment
    aml_run_config.environment = Environment.get(workspace, name="abmv3-datarefresh")
    aml_run_config.target = ComputeTarget(workspace=workspace, name="abmv3-data")


    ### Create pipeline steps ###

    # Pass input dataset into data checker
    data_checker_step = PythonScriptStep(
        script_name="src/steps/data_checker.py",
        name="Data Checker", 
        runconfig=aml_run_config,
        arguments=["--dataset-name", dataset_name_pipeline_param],
        inputs=[tabular_ds_consumption],
        outputs=[data_checked],
        allow_reuse=False,
    )

    data_uploader_step = PythonScriptStep(
        script_name="src/steps/data_uploader.py",
        name="Data Uploader", 
        runconfig=aml_run_config,
        arguments=["--output-dir", output_path_pipeline_param],
        inputs=[file_ds_consumption, data_checked.as_input("data_checked")],
        allow_reuse=False,
    )

    logging.info(f"Snapshot will be created from source dir: {source_dir_for_snapshot}")

    return Pipeline(workspace, default_source_directory=source_dir_for_snapshot, steps=[data_checker_step, data_uploader_step])

data_uploader.py takes a FileDataset and an output path and uploads it to our storage to that output path, in our case an ADLSgen2, using azure.storage.filedatalake.DataLakeServiceClient and creating an azure.identity.ClientSecretCredential using our Datastore’s saved SPN creds

import argparse
import os
from pathlib import Path

# azureml
from azureml.core import Run

# custom modules
import configs.pathrefs as pathrefs
from src.utils.azure_env import AzureEnvironment

"""
Data Uploader takes a path to a directory where files are and uploads to the provided output directory.
Run.get_context().input_datasets["input_file_dataset"]: file path to a directory e.g. "dataset_name/date/"
args.output_dir: file path to the directory on ADLS to upload to e.g. "raw_data_validated/dataset_name/date/"
"""

parser = argparse.ArgumentParser()
parser.add_argument('--output-dir')

args = parser.parse_args()

output_dir = args.output_dir
input_dir = Run.get_context().input_datasets["input_file_dataset"]

# if input_file_dataset only has 1 file in its dir it returns the full path instead of to the parent dir
if os.path.isfile(input_dir):
    input_dir = Path(input_dir).parent.resolve()

print(f"input_file_dataset: {input_dir}")
print(f"output_dir: {output_dir}")

# create ABM-AI ADLS instance and initialize with ADLS config location
gdcmlopsabm_env = AzureEnvironment(adls_config_path=pathrefs.ABMAI_ADLS_CONFIG, amls_config_path=pathrefs.LOCAL_AMLS_CONFIG)

# upload contents of OutputFileDatasetConfig folder to gdcmlopsabm ADLS 
def upload_to_adls(self, input_dir: str, output_dir: str):
    '''
    input_dir (str): path to directory where file(s) live; the OutputFileDatasetConfig.as_upload() value if using in AMLS pipeline
    output_dir (str): the path to output directory on ADLS
    '''
    # get ADLS client using AMLS Datastore SPN credentials
    adls_service_client = self.get_adls_client()
    # create the ADLS file system client
    container_client = adls_service_client.get_file_system_client(self.container)

    print(f"Preparing location to upload: {self.storage_name}:{self.container}/{output_dir}")

    # create ADLS directory client
    directory_client = container_client.get_directory_client(output_dir)

    # iterate thru local dir for file(s) in dir and write to ADLS file(s)
    for child_path in Path(input_dir).iterdir(): 

        # get the filename of the input file to use for output_filename
        _, output_filename = os.path.split(child_path)

        # create ADLS file and file client so we can upload content to new empty file
        directory_client.create_file(output_filename)
        file_client = directory_client.get_file_client(output_filename)
        
        print(f"Uploading file content from {child_path}\n")

        # handle which file type to write out
        try:
            file_contents = pa.input_stream(child_path)
        except:
            print(f"{output_filename.split('.')[-1]} is not a supported file type")
        
        # upload file contents to output path
        file_client.upload_data(file_contents, overwrite=True)

# gdcmlopsabm_env has some references to datastore and container names to use for the environemnt (dev vs test vs prod)
gdcmlopsabm_env.upload_to_adls(input_dir, output_dir)

print(f"------Uploaded to {gdcmlopsabm_env.amls_datastore_name}/{output_dir}--------")

So that after we publish the pipeline using some defaults, we can submit a run against the default PublishedPipeline PipelineEndpoint. We use a dataset name tied to a dataset reference that contains the input Datastore and data path info, plus some more info like the pipeline name to use

def run_raw_data_refresh_pipeline(dataset_name):

    global gdcmlopsabm_env, workspace, today, batch_id

    # get base ADLS folder name
    base_data_dir = gdcmlopsabm_env.raw_data_dir

    # Create dataset output path from dataset name and date
    data_output_path = f"{base_data_dir}/{dataset_name}/{today}"

    # Get datastore name, data source path using the colloquial dataset name defined in datarefs.py
    input_datastore_name = RAW_DATASET_REFERENCE_BY_NAME[dataset_name].datastore_name
    input_data_path = RAW_DATASET_REFERENCE_BY_NAME[dataset_name].data_source_path
    # Create tabular dataset object for checking the schemas with pandas
    input_tabular_dataset = create_tabular_dataset(workspace, datastore_name=input_datastore_name, data_path=input_data_path)
    # Create file dataset object for uploading raw data directly to avoid converting files from dataframe after Data Checker
    ## include metadata files and not just *.parquet
    input_data_path_dir = PurePosixPath(input_data_path).parent
    input_file_dataset = create_file_dataset(workspace, datastore_name=input_datastore_name, data_path=str(input_data_path_dir))

    # get pipeline endpoint by name
    pipeline_endpoint_name = RAW_DATASET_REFERENCE_BY_NAME[dataset_name].pipeline_endpoint_name
    pipeline_endpoint = PipelineEndpoint.get(workspace=workspace, name=pipeline_endpoint_name)

    # get pipeline endpoint's default PublishedPipeline version
    default_version = pipeline_endpoint.get_default_version()
    default_published = pipeline_endpoint.get_pipeline(default_version).name
    
    logging.info(f"Submitting pipline run against PipelineEndpoint {pipeline_endpoint_name} default version: {default_published}\
        \nat data output path {gdcmlopsabm_env.amls_datastore_name}: {data_output_path}\n")

    # create experiment reference to submit a pipeline run to
    exp = Experiment(workspace, name=f"{pipeline_endpoint_name}-{dataset_name}")

    # Submit pipeline run against PipelineEndpoint's default PublishedPipeline using Experiment so we can add tags
    exp.submit(
        pipeline_endpoint, 
        pipeline_version=default_version,
        pipeline_parameters={"input_raw_dataset_name": dataset_name,
                             "input_tabular_dataset": input_tabular_dataset,
                             "input_file_dataset": input_file_dataset,
                             "output_dir": data_output_path
                            },                      
        tags={"dataset": f'{dataset_name}', "env": f'{gdcmlopsabm_env.environment}', 'pipeline': f'{default_published}', 'batch_id': f'{batch_id}'}
    )