Converting jsonLines file into parquet
See original GitHub issueI am writing log files in jsonlines format into s3 bucket. I was able to load it into panda dataframe:
df=wr.s3.read_json(path, lines=True)
then if I save it back
wr.s3.to_parquet( df=df, path=path2, dataset=True, mode="append" )
I am receiving error
ArrowTypeError: Expected a bytes object, got a 'dict' object
If I modify my code and do
for df in wr.s3.read_json(path, lines=True, chunksize=1): wr.s3.to_parquet( df=df, path=path2, dataset=True, mode="append" )
The code worked, but then I am getting one file for each log entry. This is very slow and not efficient. How can I read jsonlines file and write it to parquet?
Issue Analytics
- State:
- Created 3 years ago
- Comments:6 (3 by maintainers)
Top Results From Across the Web
How to transform Json lines to parquet with Python?
If you are using pandas 0.25.3 version you can install fastparquet or pyarrow library and execute the below code >>> df = pd....
Read more >Converting jsonLines file into parquet · Issue #319 - GitHub
I am writing log files in jsonlines format into s3 bucket. I was able to load it into panda dataframe: df=wr.s3.read_json(path, lines=True)
Read more >JSON to parquet : How to perform in Python with example ?
Firstly convert JSON to dataframe and then to parquet file. In this article, we will explore the complete same process with an easy...
Read more >Convert CSV / JSON files to Apache Parquet using AWS Glue
We wanted to use a solution with Zero Administrative skills. And now we are using Glue for this. Yes, we can convert the...
Read more >Conversion of JSON to parquet format using Apache ... - Medium
Before going to parquet conversion from json object, let us understand the parquet file format. Apache Parquet is a self-describing data ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@igorborgest after further inspection I discover that you are right, some fields with the same name have different data types. That how I got the exception. Thanks for your help.
Yes there is, using the
dtype
argument:dtype (Dict[str, str], optional) – Dictionary of columns names and Athena/Glue types to be casted. Useful when you have columns with undetermined or mixed data types. (e.g. {‘col name’: ‘bigint’, ‘col2 name’: ‘int’})