question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Converting jsonLines file into parquet

See original GitHub issue

I am writing log files in jsonlines format into s3 bucket. I was able to load it into panda dataframe: df=wr.s3.read_json(path, lines=True) then if I save it back wr.s3.to_parquet( df=df, path=path2, dataset=True, mode="append" ) I am receiving error ArrowTypeError: Expected a bytes object, got a 'dict' object

If I modify my code and do for df in wr.s3.read_json(path, lines=True, chunksize=1): wr.s3.to_parquet( df=df, path=path2, dataset=True, mode="append" )

The code worked, but then I am getting one file for each log entry. This is very slow and not efficient. How can I read jsonlines file and write it to parquet?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
edvorkincommented, Jul 14, 2020

@igorborgest after further inspection I discover that you are right, some fields with the same name have different data types. That how I got the exception. Thanks for your help.

0reactions
igorborgestcommented, Jul 14, 2020

is there I way to enforce some schema from the catalog and reject non-compliant records?

Yes there is, using the dtype argument:

dtype (Dict[str, str], optional) – Dictionary of columns names and Athena/Glue types to be casted. Useful when you have columns with undetermined or mixed data types. (e.g. {‘col name’: ‘bigint’, ‘col2 name’: ‘int’})

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to transform Json lines to parquet with Python?
If you are using pandas 0.25.3 version you can install fastparquet or pyarrow library and execute the below code >>> df = pd....
Read more >
Converting jsonLines file into parquet · Issue #319 - GitHub
I am writing log files in jsonlines format into s3 bucket. I was able to load it into panda dataframe: df=wr.s3.read_json(path, lines=True)
Read more >
JSON to parquet : How to perform in Python with example ?
Firstly convert JSON to dataframe and then to parquet file. In this article, we will explore the complete same process with an easy...
Read more >
Convert CSV / JSON files to Apache Parquet using AWS Glue
We wanted to use a solution with Zero Administrative skills. And now we are using Glue for this. Yes, we can convert the...
Read more >
Conversion of JSON to parquet format using Apache ... - Medium
Before going to parquet conversion from json object, let us understand the parquet file format. Apache Parquet is a self-describing data ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found