Data loss while writing avro file to s3 compatible storage
See original GitHub issueHi,
I am converting a csv file into avro and writing to s3 compliant storage.I see that schema file(.avsc) is written properly. However, there is data loss while writing to .avro file. Below is snippet of my code
## Code
import smart_open
from boto.compat import urlsplit, six
import boto
import boto.s3.connection
import avro.schema
from avro.datafile import DataFileWriter
from avro.io import DatumWriter
import pandas as pn
import os,sys
FilePath = 's3a://mybucket/vinuthnav/csv/file1.csv' #path on s3
splitInputDir = urlsplit(FilePath, allow_fragments=False)
inConn = boto.connect_s3(
aws_access_key_id = access_key_id,
aws_secret_access_key = secret_access_key,
port=int(port),
host = hostname,
is_secure=False,
calling_format = boto.s3.connection.OrdinaryCallingFormat(),
)
#get bucket
inbucket = inConn.get_bucket(splitInputDir.netloc)
#read in the csv file
kr = inbucket.get_key(splitInputDir.path)
with smart_open.smart_open(kr, 'r') as fin:
xa = pn.read_csv(fin, header=1, error_bad_lines = False).fillna('NA')
rowCount, columnCount = xa.shape #check if data frame is empty, if it is, don't write outp
if rowCount == 0:
##do nothing
print '>> [NOTE] empty file'
else:
#generate avro schema and data
dataFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avro")
schemaFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avsc")
kwd = inbucket.get_key(urlsplit(dataFile, allow_fragments=False).path, validate=False)
schema = gen_schema(xa.columns)
with smart_open.smart_open(kwd, 'wb') as foutd:
dictRes = xa.to_dict(orient='records')
writer = DataFileWriter(foutd, DatumWriter(), schema)
for ll, row in enumerate(dictRes):
writer.append(row)
Issue Analytics
- State:
- Created 5 years ago
- Comments:78
Top Results From Across the Web
Working with Apache Avro files in Amazon S3 - Gary A. Stafford
In this post, we will learn how to preview another popular file format often stored in Amazon S3— Apache Avro™.
Read more >Loading Avro data from Cloud Storage | BigQuery
When importing multiple Avro files with different Avro schemas, all schemas must be compatible with Avro's schema resolution. When BigQuery detects the schema, ......
Read more >Amazon Simple Storage Service (S3) - AWS
The data stored in S3 One Zone-IA is not resilient to the physical loss of an Availability Zone resulting from disasters, such as...
Read more >S3 to S3 Avro - Cloudera Documentation
It converts the files into Avro format and writes them to the destination S3 location. You define and store the data processing schema...
Read more >Query External Data with ORC, Parquet, or Avro Source Files
Autonomous Database makes it easy to access ORC, Parquet, or Avro data stored in object store using external tables. ORC, Parquet, and Avro...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
OK, it’s the weekend, and I’ve had a look at it. Time to clear things up.
The “data loss” referred to in this ticket does not come from smart_open. It comes from misusing avro. You need to either:
Both work identically. If you don’t call them, avro keeps some data in its internal buffers, and never writes it to the output file.
However, if you do the above while using smart_open, avro ends up calling BufferedOutputWriter.close twice. The first time succeeds, but the second time fails due to a bug in that method. I’ve added a test and fixed that bug.
Strangely, I haven’t been able to reproduce the errors related to the absence of a flush method. @vinuthna91, can you reproduce the error with the new code?
It’s still Thursday. Weekend starts in two days.