question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Multiple prefixes in GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator

See original GitHub issue

Description

Support passing multiple prefixes to GoogleCloudStorageListOperator and GoogleCloudStorageDeleteOperator operators.

Use case / motivation

I have this folder structure in GCS bucket.

+-- year={year}
|   +-- month={month}
|       +--day={day}
|           +-- topic={topic1}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic2}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic3}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic4}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic5}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic6}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic7}
|       +--day={day}
|           +-- topic={topic1}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic2}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic3}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic4}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic5}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic6}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           +-- topic={topic7}
|                 +--file 1
|                 +--file 2
|                 +--file 3
|           ....

What I need to achieve is delete one day of objects. For example, I need to delete objects in year=2020/month=08/day=19. I can do that easily using gsutils. In gsutil you can delete them via wild card gsutil ear=2020/month=08/day=19/* but using the REST APIs you can’t even if you use a prefix. The reason is there is no one prefix to get all the objects inside a folder. I achieved that by using multiple prefixes and for each prefix, I will get the list of objects. Unfortunately, I can’t pass more than one prefix to the operators.

Prefixes used

  • year=2020/month=08/day=19/topic={topic1}
  • year=2020/month=08/day=19/topic={topic2}
  • year=2020/month=08/day=19/topic={topic3}
  • year=2020/month=08/day=19/topic={topic4}
  • year=2020/month=08/day=19/topic={topic5}
  • year=2020/month=08/day=19/topic={topic6}
  • year=2020/month=08/day=19/topic={topic7}

Issue Analytics

  • State:open
  • Created 3 years ago
  • Comments:10 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
eladkalcommented, May 2, 2021

What I’m planning to do is to modify the GCSHook.list() method to accept prefixes instead of prefix. I need to know how we can do that with backward compatibility? Some old code will assume this hook is accepting one prefix and we need to raise a deprecation warning. Or maybe it is only used internally and I need to refactor the operators who use it?

prefix is a parameter of list_blobs https://googleapis.dev/python/storage/latest/client.html even if you modify the parameter on the hook at the end you will still be able to utalize only single prefix each time. You can modify prefix to accept Optional[str,List[str]] that way the modification is also backward compatible. This has some similarities to approach suggested on https://github.com/apache/airflow/issues/15001

0reactions
EmadMokhtarcommented, Oct 27, 2021

@EmadMokhtar are you still working on this issue?

I want to but I’m facing issues with setup the dev environment for Airflow. I will give it another try an upcoming week.

Read more comments on GitHub >

github_iconTop Results From Across the Web

[GitHub] [airflow] EmadMokhtar opened a new issue #10426
**Description** Support passing multiple prefixes to `GoogleCloudStorageListOperator` and `GoogleCloudStorageDeleteOperator` operators.
Read more >
List the objects in a bucket using a prefix filter | Cloud Storage
Prefixes and delimiters can be used to emulate directory listings. /// Prefixes can be used to filter objects starting with prefix.
Read more >
airflow.contrib.operators.gcs_delete_operator
Module Contents¶. class airflow.contrib.operators.gcs_delete_operator. GoogleCloudStorageDeleteOperator (bucket_name, objects=None, prefix=None, ...
Read more >
Release Notes - Apache Airflow documentation - Amazon AWS
Fix RecursionError on graph view of a DAG with many tasks (#26175) ... Add group prefix to decorated mapped task (#26081). Fix UI...
Read more >
Delete all files in 'folder' or with prefix in Google Cloud Bucket ...
The API only supports deleting a single object at a time. You can only request many deletions using many HTTP requests or by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found