question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

gzip-compressed item exports

See original GitHub issue

I think compressing exported data can be useful in a lot of cases. It’d be good to have a built-in way for compressed exports in Scrapy. I have this implementation, but probably it makes more sense to handle .jl.gz / csv.gz / … extensions just like .jl / .csv / … instead of creating a storage, I’m not sure:

# -*- coding: utf-8 -*-
import os
import gzip

from zope.interface import Interface, implementer
from w3lib.url import file_uri_to_path
from scrapy.extensions.feedexport import IFeedStorage


@implementer(IFeedStorage)
class GzipFileFeedStorage(object):
    """
    Storage which exports data to a gzipped file.
    To use it, add

    ::

        FEED_STORAGES = {
            'gzip': 'myproject.exports.GzipFileFeedStorage',
        }

    to settings.py and then run scrapy crawl like this::

        scrapy crawl foo -o gzip:/path/to/items.jl

    The command above will create ``/path/to/items.jl.gz`` file
    (.gz extension is added automatically).

    Other export formats are also supported, but it is recommended to use .jl.
    If a spider is killed then gz archive may be partially broken.
    In this case it user should read the broken archive line-by-line and stop
    on gzip decoding errors, discarding the tail. It works OK with .jl exports.
    """
    COMPRESS_LEVEL = 4

    def __init__(self, uri):
        self.path = file_uri_to_path(uri) + ".gz"

    def open(self, spider):
        dirname = os.path.dirname(self.path)
        if dirname and not os.path.exists(dirname):
            os.makedirs(dirname, exist_ok=True)
        return gzip.open(self.path, 'ab', compresslevel=self.COMPRESS_LEVEL)

    def store(self, file):
        file.close()

What do you think?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

4reactions
tianhuilcommented, Nov 5, 2019

Would love to see this merged. Would anyone object if we merged @redapple’s comment on Aug 31, 2016 into the codebase? I’ve just added it in https://github.com/scrapy/scrapy/pull/4131

@kmike, @redapple: could we merge?

4reactions
starrifycommented, Mar 2, 2017

+1 for supporting gzip. And +1 for making it part of the feed exporter (instead of feed storage). Below are the sample lines that I’m using for a project:

# coding: utf8

import gzip

from scrapy.exporters import JsonLinesItemExporter


class JsonLinesGzipItemExporter(JsonLinesItemExporter):
    """
    Sample exporter for .jl + .gz format.
    To use it, add
    ::

        FEED_EXPORTERS = {
            'jl.gz': 'myproject.exporters.JsonLinesGzipItemExporter',
        }
        FEED_FORMAT = 'jl.gz'

    to settings.py and then run scrapy crawl like this::

        scrapy crawl foo -o s3://path/to/items.jl.gz

    (if `FEED_FORMAT` is not explicitly specified, you'll need to add
    `-t jl.gz` to the command above)
    """

    def __init__(self, file, **kwargs):
        gzfile = gzip.GzipFile(fileobj=file)
        super(JsonLinesGzipItemExporter, self).__init__(gzfile, **kwargs)

    def finish_exporting(self):
        self.file.close()
Read more comments on GitHub >

github_iconTop Results From Across the Web

Using Compressed Data Transfers
The gzip-compressed and compressed methods are most beneficial to users who transfer large data files or those with slow data connections. However, they...
Read more >
Transcoding of gzip-compressed files | Cloud Storage
This page discusses the conversion of files to and from a gzip-compressed state. The page includes an overview of transcoding, best practices for...
Read more >
Compression on Bulk Export APIs? - Marketing Nation - Marketo
For Bulk Activity Export - exporting a file it is both stored as plain text and transmitted without gzip compression. This data is...
Read more >
Quick Tip: EXPORT TO PARQUET Compression with GZIP, ...
EXPORT TO PARQUET exports a table, columns from a table, or query results to files in the Parquet format. These Parquet files use...
Read more >
Oracle export to compressed gzip
All rights reserved. Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.1.0 - Production With the Partitioning, OLAP and Data Mining options
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found