[Azure ML SDK v2] File is not written to output azureml datastore
See original GitHub issue- Package Name: azure.ai.ml
- Package Version: latest in Azure ML Notebooks (Standard)
- Operating System: Azure ML Notebooks (Standard)
- Python Version: Azure ML Notebooks (Standard)
Describe the bug
The Azure ML datastore tfconfigs
has multiple files in the base path.
For a pipeline job the Azure ML Datastore tfconfig
is defined as an output to write data:
update_config_component = command(
name="tf_config_update",
display_name="Tensorflow configuration file update",
description="Reads the pipeline configuration file from a specific model (directory), updates it with the params, and saves the new pipleine config file to the output directory",
inputs=dict(
config_dir=Input(type="uri_folder"),
config_directory_name=Input(type="string"),
images_dir=Input(type="uri_folder"),
labelmap_path=Input(type="string"),
fine_tune_checkpoint_type=Input(type="string"),
fine_tune_checkpoint=Input(type="string"),
train_record_path=Input(type="string"),
test_record_path=Input(type="string"),
num_classes=Input(type="integer"),
batch_size=Input(type="integer"),
num_steps=Input(type="integer"),
),
outputs = {
"config_directory_output": Output(
type=AssetTypes.URI_FOLDER,
path=f"azureml://subscriptions/{ml_client.subscription_id}/resourcegroups/{ml_client.resource_group_name}/workspaces/{ml_client.resource_group_name}/datastores/tfconfigs/paths/",
)
},
# The source folder of the component
code=update_config_src_dir,
command="""pwd && ls -la ${{outputs.config_directory_output}} && python update.py \
--config_dir ${{inputs.config_dir}} \
--config_directory_name ${{inputs.config_directory_name}} \
--config_directory_output ${{outputs.config_directory_output}} \
--images_dir ${{inputs.images_dir}} \
--labelmap_path ${{inputs.labelmap_path}} \
--fine_tune_checkpoint_type ${{inputs.fine_tune_checkpoint_type}} \
--fine_tune_checkpoint ${{inputs.fine_tune_checkpoint}} \
--train_record_path ${{inputs.train_record_path}} \
--test_record_path ${{inputs.test_record_path}} \
--num_classes ${{inputs.num_classes}} \
--batch_size ${{inputs.batch_size}} \
--num_steps ${{inputs.num_steps}} \
""",
environment="azureml://registries/azureml/environments/AzureML-minimal-ubuntu18.04-py37-cpu-inference/versions/43",
)
The output config_directory_output
is mounted the computing engine execution as follows:
/mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output
At the beginning of the python-script the output
directory is listed as follows:
print("Listing path / dir: ", args.config_directory_output)
arr = os.listdir(args.config_directory_output)
print(arr)
The directory does not include any files:
Listing path / dir: /mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output
[]
BUG: The Azure ML Datastore tfconfig
mounted as an output
includes multiple files already uploaded manually.
At the end of the python script a config-file is written to the mounted output
and the directiry is listed again as follows:
with open(pipeline_config_path, "r") as f:
config = f.read()
with open(new_pipeline_config_path, 'w') as f:
# Set labelmap path
config = re.sub('label_map_path: ".*?"',
'label_map_path: "{}"'.format(images_dir_labelmap_path), config)
# Set fine_tune_checkpoint path
config = re.sub('fine_tune_checkpoint_type: ".*?"',
'fine_tune_checkpoint_type: "{}"'.format(args.fine_tune_checkpoint_type), config)
# Set fine_tune_checkpoint path
config = re.sub('fine_tune_checkpoint: ".*?"',
'fine_tune_checkpoint: "{}"'.format(args.fine_tune_checkpoint), config)
# Set train tf-record file path
config = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/train)(.*?")',
'input_path: "{}"'.format(images_dir_train_record_path), config)
# Set test tf-record file path
config = re.sub('(input_path: ".*?)(PATH_TO_BE_CONFIGURED/val)(.*?")',
'input_path: "{}"'.format(images_dir_test_record_path), config)
# Set number of classes.
config = re.sub('num_classes: [0-9]+',
'num_classes: {}'.format(args.num_classes), config)
# Set batch size
config = re.sub('batch_size: [0-9]+',
'batch_size: {}'.format(args.batch_size), config)
# Set training steps
config = re.sub('num_steps: [0-9]+',
'num_steps: {}'.format(int(args.num_steps)), config)
f.write(config)
# List directory
print("Listing path / dir: ", args.config_directory_output)
arr = os.listdir(args.config_directory_output)
print(arr)
The listing directory of the mounted output
is as follows:
Listing path / dir: /mnt/azureml/cr/j/6a153baacc664cada4060f0b95adbf0e/cap/data-capability/wd/config_directory_output
['ssd_mobilenet_v2_fpnlite_640x640_coco17_tpu-8_steps125000_batch16.config']
BUG: The mounted output
directory includes now a file. But the Azure ML Datastore does not include the new written file seen in the Azure Explorer / Azure Portal GUI.
To Reproduce Steps to reproduce the behavior:
- Create a new Azure ML Datastore with a new container the storage account
- Create pipeline with a job and the
output
is the new created Azure ML Datastore - Write a file to the
output
in a pipeline job - Run the pipeline
- Confirm that the file is not created in the Azure ML Datastore / Azure Storage Blob Container
Expected behavior
Any file written an output Azure ML Datastore
in a python-job should be written to the underlying Azure Storage Blob Container and can be used later.
Additional context Using the following tutorials as reference:
- https://github.com/Azure/azureml-examples/blob/main/sdk/python/assets/data/data.ipynb -> Reading and writing data in a job
Issue Analytics
- State:
- Created 10 months ago
- Comments:6 (4 by maintainers)
Top GitHub Comments
Specifing output path during defining component will not work and still use default path
azureml://datastores/${{default_datastore}}/paths/azureml/${{name}}/${{output_name}}/
However, specifing output path during component consumption in pipeline is supported with code like below:
please refer to our sample on this.
@wangchao1230 What do you think of adding a validation in Output class’s constructor?