Json files from S3 downloading as text files
See original GitHub issueApache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
No response
Apache Airflow version
2.3.0 (latest released)
Operating System
Mac OS Mojave 10.14.6
Deployment
Official Apache Airflow Helm Chart
Deployment details
No response
What happened
When I download a json file from S3 using the S3Hook:
filename=s3_hook.download_file(bucket_name=self.source_s3_bucket, key=key, local_path="./data")
The file is being downloaded as a text file starting with airflow_temp_
.
What you think should happen instead
It would be nice to have them download as a json file or keep the same filename as in S3. Since it requires additional code to go back and read the file as a dictionary (ast.literal_eval) and there is no guarantee that the json structure is maintained.
How to reproduce
Where s3_conn_id is the Airflow connection and s3_bucket is a bucket on AWS S3. This is the custom operator class:
from airflow.models.baseoperator import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.hooks.S3_hook import S3Hook
import logging
class S3SearchFilingsOperator(BaseOperator):
"""
Queries the Datastore API and uploads the processed info as a csv to the S3 bucket.
:param source_s3_bucket: Choose source s3 bucket
:param source_s3_directory: Source s3 directory
:param s3_conn_id: S3 Connection ID
:param destination_s3_bucket: S3 Bucket Destination
"""
@apply_defaults
def __init__(
self,
source_s3_bucket=None,
source_s3_directory=True,
s3_conn_id=True,
destination_s3_bucket=None,
destination_s3_directory=None,
search_terms=[],
*args,
**kwargs) -> None:
super().__init__(*args, **kwargs)
self.source_s3_bucket = source_s3_bucket
self.source_s3_directory = source_s3_directory
self.s3_conn_id = s3_conn_id
self.destination_s3_bucket = destination_s3_bucket
self.destination_s3_directory = destination_s3_directory
def execute(self, context):
"""
Executes the operator.
"""
s3_hook = S3Hook(self.s3_conn_id)
keys = s3_hook.list_keys(bucket_name=self.source_s3_bucket)
for key in keys:
# download file
filename=s3_hook.download_file(bucket_name=self.source_s3_bucket, key=key, local_path="./data")
logging.info(filename)
with open(filename, 'rb') as handle:
filing = handle.read()
filing = pickle.loads(filing)
logging.info(filing.keys())
And this is the dag file:
from keywordSearch.operators.s3_search_filings_operator import S3SearchFilingsOperator
from airflow import DAG
from airflow.utils.dates import days_ago
from datetime import timedelta
# from aws_pull import aws_pull
default_args = {
"owner" : "airflow",
"depends_on_past" : False,
"start_date": days_ago(2),
"email" : ["airflow@example.com"],
"email_on_failure" : False,
"email_on_retry" : False,
"retries" : 1,
"retry_delay": timedelta(seconds=30)
}
with DAG("keyword-search-full-load",
default_args=default_args,
description="Syntax Keyword Search",
max_active_runs=1,
schedule_interval=None) as dag:
op3 = S3SearchFilingsOperator(
task_id="s3_search_filings",
source_s3_bucket="processed-filings",
source_s3_directory="citations",
s3_conn_id="Syntax_S3",
destination_s3_bucket="keywordsearch",
destination_s3_directory="results",
dag=dag
)
op3
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project’s Code of Conduct
Issue Analytics
- State:
- Created a year ago
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Downloading an object - Amazon Simple Storage Service
This section explains how to download objects from an S3 bucket.
Read more >Cannot read json file downloaded from s3 bucket
This the code I use to read a json file from s3 s3sr = boto3.resource('s3') bucket_obj=s3sr.
Read more >How to fetch contents of JSON files stored in Amazon S3 ...
Restart the server and hit the URL on port 3000. On success, you will see the JSON content of the file on the...
Read more >Spark Read Json From Amazon S3
In this tutorial, you will learn how to read a JSON (single or multiple) file from an Amazon AWS S3 bucket into DataFrame...
Read more >Downloading files — Boto3 Docs 1.26.32 documentation - AWS
The download_file method accepts the names of the bucket and object to download and the filename to save the file to. import boto3...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Hi @feluelle . Thanks, will look into creating a PR now.
Hi I just created a PR for this issue!