gzip-compressed item exports
See original GitHub issueI think compressing exported data can be useful in a lot of cases. It’d be good to have a built-in way for compressed exports in Scrapy. I have this implementation, but probably it makes more sense to handle .jl.gz / csv.gz / … extensions just like .jl / .csv / … instead of creating a storage, I’m not sure:
# -*- coding: utf-8 -*-
import os
import gzip
from zope.interface import Interface, implementer
from w3lib.url import file_uri_to_path
from scrapy.extensions.feedexport import IFeedStorage
@implementer(IFeedStorage)
class GzipFileFeedStorage(object):
"""
Storage which exports data to a gzipped file.
To use it, add
::
FEED_STORAGES = {
'gzip': 'myproject.exports.GzipFileFeedStorage',
}
to settings.py and then run scrapy crawl like this::
scrapy crawl foo -o gzip:/path/to/items.jl
The command above will create ``/path/to/items.jl.gz`` file
(.gz extension is added automatically).
Other export formats are also supported, but it is recommended to use .jl.
If a spider is killed then gz archive may be partially broken.
In this case it user should read the broken archive line-by-line and stop
on gzip decoding errors, discarding the tail. It works OK with .jl exports.
"""
COMPRESS_LEVEL = 4
def __init__(self, uri):
self.path = file_uri_to_path(uri) + ".gz"
def open(self, spider):
dirname = os.path.dirname(self.path)
if dirname and not os.path.exists(dirname):
os.makedirs(dirname, exist_ok=True)
return gzip.open(self.path, 'ab', compresslevel=self.COMPRESS_LEVEL)
def store(self, file):
file.close()
What do you think?
Issue Analytics
- State:
- Created 7 years ago
- Comments:5 (3 by maintainers)
Top Results From Across the Web
Using Compressed Data Transfers
The gzip-compressed and compressed methods are most beneficial to users who transfer large data files or those with slow data connections. However, they...
Read more >Transcoding of gzip-compressed files | Cloud Storage
This page discusses the conversion of files to and from a gzip-compressed state. The page includes an overview of transcoding, best practices for...
Read more >Compression on Bulk Export APIs? - Marketing Nation - Marketo
For Bulk Activity Export - exporting a file it is both stored as plain text and transmitted without gzip compression. This data is...
Read more >Quick Tip: EXPORT TO PARQUET Compression with GZIP, ...
EXPORT TO PARQUET exports a table, columns from a table, or query results to files in the Parquet format. These Parquet files use...
Read more >Oracle export to compressed gzip
All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Would love to see this merged. Would anyone object if we merged @redapple’s comment on Aug 31, 2016 into the codebase? I’ve just added it in https://github.com/scrapy/scrapy/pull/4131
@kmike, @redapple: could we merge?
+1 for supporting gzip. And +1 for making it part of the feed exporter (instead of feed storage). Below are the sample lines that I’m using for a project: