Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data loss while writing avro file to s3 compatible storage

See original GitHub issue

Hi,

I am converting a csv file into avro and writing to s3 compliant storage.I see that schema file(.avsc) is written properly. However, there is data loss while writing to .avro file. Below is snippet of my code

## Code
import smart_open
from boto.compat import urlsplit, six
import boto
import boto.s3.connection

import avro.schema
from avro.datafile import  DataFileWriter 
from avro.io import  DatumWriter

import pandas as pn
import os,sys

FilePath = 's3a://mybucket/vinuthnav/csv/file1.csv' #path on s3

splitInputDir = urlsplit(FilePath, allow_fragments=False)

inConn = boto.connect_s3(
	aws_access_key_id = access_key_id,
	aws_secret_access_key = secret_access_key,
	port=int(port),
	host = hostname,
	is_secure=False,
	calling_format = boto.s3.connection.OrdinaryCallingFormat(),
	)
#get bucket
inbucket = inConn.get_bucket(splitInputDir.netloc)
#read in the csv file
kr = inbucket.get_key(splitInputDir.path)
with smart_open.smart_open(kr, 'r') as fin:
	xa = pn.read_csv(fin, header=1, error_bad_lines = False).fillna('NA')
		
rowCount, columnCount = xa.shape #check if data frame is empty, if it is, don't write outp
if rowCount == 0:
	##do nothing
	print '>> [NOTE] empty file'
	

else:
	#generate avro schema and data
	
	dataFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avro")
	schemaFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avsc")
	
	kwd = inbucket.get_key(urlsplit(dataFile, allow_fragments=False).path, validate=False)
	schema = gen_schema(xa.columns)
	
	with smart_open.smart_open(kwd, 'wb') as foutd: 
		
		dictRes = xa.to_dict(orient='records')
		writer = DataFileWriter(foutd, DatumWriter(), schema)
		for ll, row in enumerate(dictRes):
			writer.append(row)

Issue Analytics

State:
Created 5 years ago
Comments:78

Top GitHub Comments

3reactions

mpenkovcommented, Aug 4, 2018

OK, it’s the weekend, and I’ve had a look at it. Time to clear things up.

The “data loss” referred to in this ticket does not come from smart_open. It comes from misusing avro. You need to either:

Close the writer using writer.close (as @vinuthna91 mentioned a few posts ago)
Use the writer as a context manager.

Both work identically. If you don’t call them, avro keeps some data in its internal buffers, and never writes it to the output file.

However, if you do the above while using smart_open, avro ends up calling BufferedOutputWriter.close twice. The first time succeeds, but the second time fails due to a bug in that method. I’ve added a test and fixed that bug.

Strangely, I haven’t been able to reproduce the errors related to the absence of a flush method. @vinuthna91, can you reproduce the error with the new code?

3reactions

mpenkovcommented, Aug 2, 2018

OK, I think see what’s going on here. I’ll look at this again on the weekend.

It’s still Thursday. Weekend starts in two days.

Top Results From Across the Web

Working with Apache Avro files in Amazon S3 - Gary A. Stafford

In this post, we will learn how to preview another popular file format often stored in Amazon S3— Apache Avro™.

Loading Avro data from Cloud Storage | BigQuery

When importing multiple Avro files with different Avro schemas, all schemas must be compatible with Avro's schema resolution. When BigQuery detects the schema, ......

Amazon Simple Storage Service (S3) - AWS

The data stored in S3 One Zone-IA is not resilient to the physical loss of an Availability Zone resulting from disasters, such as...

S3 to S3 Avro - Cloudera Documentation

It converts the files into Avro format and writes them to the destination S3 location. You define and store the data processing schema...

Query External Data with ORC, Parquet, or Avro Source Files

Autonomous Database makes it easy to access ORC, Parquet, or Avro data stored in object store using external tables. ORC, Parquet, and Avro...