question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Data loss while writing avro file to s3 compatible storage

See original GitHub issue

Hi,

I am converting a csv file into avro and writing to s3 compliant storage.I see that schema file(.avsc) is written properly. However, there is data loss while writing to .avro file. Below is snippet of my code

## Code
import smart_open
from boto.compat import urlsplit, six
import boto
import boto.s3.connection

import avro.schema
from avro.datafile import  DataFileWriter 
from avro.io import  DatumWriter

import pandas as pn
import os,sys

FilePath = 's3a://mybucket/vinuthnav/csv/file1.csv' #path on s3

splitInputDir = urlsplit(FilePath, allow_fragments=False)

inConn = boto.connect_s3(
	aws_access_key_id = access_key_id,
	aws_secret_access_key = secret_access_key,
	port=int(port),
	host = hostname,
	is_secure=False,
	calling_format = boto.s3.connection.OrdinaryCallingFormat(),
	)
#get bucket
inbucket = inConn.get_bucket(splitInputDir.netloc)
#read in the csv file
kr = inbucket.get_key(splitInputDir.path)
with smart_open.smart_open(kr, 'r') as fin:
	xa = pn.read_csv(fin, header=1, error_bad_lines = False).fillna('NA')
		
rowCount, columnCount = xa.shape #check if data frame is empty, if it is, don't write outp
if rowCount == 0:
	##do nothing
	print '>> [NOTE] empty file'
	

else:
	#generate avro schema and data
	
	dataFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avro")
	schemaFile = os.path.join(os.path.basename(FileName), os.path.splitext(FileName)[0]+".avsc")
	
	kwd = inbucket.get_key(urlsplit(dataFile, allow_fragments=False).path, validate=False)
	schema = gen_schema(xa.columns)
	
	with smart_open.smart_open(kwd, 'wb') as foutd: 
		
		dictRes = xa.to_dict(orient='records')
		writer = DataFileWriter(foutd, DatumWriter(), schema)
		for ll, row in enumerate(dictRes):
			writer.append(row)

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:78

github_iconTop GitHub Comments

3reactions
mpenkovcommented, Aug 4, 2018

OK, it’s the weekend, and I’ve had a look at it. Time to clear things up.

The “data loss” referred to in this ticket does not come from smart_open. It comes from misusing avro. You need to either:

  1. Close the writer using writer.close (as @vinuthna91 mentioned a few posts ago)
  2. Use the writer as a context manager.

Both work identically. If you don’t call them, avro keeps some data in its internal buffers, and never writes it to the output file.

However, if you do the above while using smart_open, avro ends up calling BufferedOutputWriter.close twice. The first time succeeds, but the second time fails due to a bug in that method. I’ve added a test and fixed that bug.

Strangely, I haven’t been able to reproduce the errors related to the absence of a flush method. @vinuthna91, can you reproduce the error with the new code?

3reactions
mpenkovcommented, Aug 2, 2018

OK, I think see what’s going on here. I’ll look at this again on the weekend.

It’s still Thursday. Weekend starts in two days.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Working with Apache Avro files in Amazon S3 - Gary A. Stafford
In this post, we will learn how to preview another popular file format often stored in Amazon S3— Apache Avro™.
Read more >
Loading Avro data from Cloud Storage | BigQuery
When importing multiple Avro files with different Avro schemas, all schemas must be compatible with Avro's schema resolution. When BigQuery detects the schema, ......
Read more >
Amazon Simple Storage Service (S3) - AWS
The data stored in S3 One Zone-IA is not resilient to the physical loss of an Availability Zone resulting from disasters, such as...
Read more >
S3 to S3 Avro - Cloudera Documentation
It converts the files into Avro format and writes them to the destination S3 location. You define and store the data processing schema...
Read more >
Query External Data with ORC, Parquet, or Avro Source Files
Autonomous Database makes it easy to access ORC, Parquet, or Avro data stored in object store using external tables. ORC, Parquet, and Avro...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found