Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[CEP] Form XML Compression in BlobDB

See original GitHub issue

Abstract

Form XML makes up 70% of the BlobDB data on ICDS. In environments with fewer form attachments this number is likely much higher. This CEP proposes to compress the XML data to reduce the storage requirements of the BlobDB.

Motivation

The form XML is saved to the BlobDB uncompressed and as stated above it makes up at least 70% of the data in the BlobDB. Basic tests show that compression can give considerable reduction in storage.

On a sample of 10000 ICDS forms:

Average file size before compression	9579 bytes
Average file size after compression (gzip -6)	2031 bytes

Specification

Since this is an isolated use case the compression and decompression will not be handled by the BlobDB.

This specification allows for the compression of the XML to take place in three different places. The purpose of this is to allow experimentation to determine the best place for the compression to take place based on data from actual usage (see Metrics section below).

Compression

The compression should use gzip since it offers good compromise between compression ratio and speed. It is also well supported in Python.

Metrics The following metrics should be added to allow monitoring of this change both to ensure it is correctly functioning and to provide data to better understand the access patterns.

commcare.form_submission.read_compressed

This metric will give us insight into how many times a compressed XML is read and where (which process) is it read.

Tags:
- source
  - Django view name (get_current_request().resolver_match.view_name)
  - celery task name (get_current_task().name)
  - Django management command name (not sure how to get this)

Changes to CommCare

This should be implemented within the BlobDB in order to make it transparent to calling code.

Update the BlobDB Metadata to keep track of compressed data

    compressed = BooleanField(null=True)

Question: should this be indexed?

Update put API to compress data based on the type code

Since we only want to compress form XML at this stage the decision to compress can be based on the type code. Any calls to blobdb.put with type code of CODES.form_xml (2) should compress the content before saving it to the backend.

In addition to compressing the data the blob metadata must record that the data was compressed by setting the compressed flat to True:

def put(self, content, **blob_meta_args):
    # This logic could be moved to the `MetaDB`
    blob_meta_args['compressed'] = blob_meta_args.get('type_code') == CODES.form_xml
    meta = self.metadb.new(**blob_meta_args)
    ...
    if meta.compressed:
        # consider creating a streamed compression here: https://stackoverflow.com/a/31566082/632517
        content = compress(content)
    ...

Update get API to support decompressing

Currently the get API only takes the blob key which is used to fetch the blob directly from the backend. In order to support decompression the API will need to be modified to pass in the meta object as well:

def get(self, type_code, key=None, meta=None):
    if type_code == CODES.form_xml and not meta:
        raise ...
    blobstream = ....
    if meta and meta.compressed:
        return GzipFile(fileobj=blobstream, mode='r')
    else:
        return blobstream

Impact on users

No impact on end users is expected.

Impact on hosting

No impact is expected for hosting CommCare. This change should be completely transparent.

Backwards compatibility

Since we will likely not compress old data (at least not initially) it is a requirement that the code support compressed and uncompressed XML. This requirement will make the change backwards compatible by default.

Release Timeline

End of Q2 2020

Open questions and issues

Should the metadb.content_length be the compressed or uncompressed size
Should we update the size function

Issue Analytics

State:
Created 4 years ago
Comments:12 (12 by maintainers)

Top GitHub Comments

1reaction

millerdevcommented, Apr 14, 2020

This is shaping up well. Regarding the get API, it should either accept

key and type_code

meta alone.

def get(self, key=None, type_code=None, *, meta=None):
    key = validate_get_args(key, type_code, meta)
    blobstream = ....
    if meta and meta.compressed:
        return GzipFile(fileobj=blobstream, mode='r')
    return blobstream

def validate_get_args(key, type_code, meta):
    if key is not None or type_code is not None:
        if meta is not None:
            raise ValueError("'key' and 'meta' are mutually exclusive")
        if type_code == CODES.form_xml:
            raise ValueError("form XML must be loaded with 'meta' argument")
        if key is None or type_code is None:
            raise ValueError("'key' must be specified with 'type_code'")
        return key
    if meta is None:
        raise ValueError("'key' and 'type_code' or 'meta' is required")
    return meta.key

Edit: update order of checks in validate_get_args for better error output.

0reactions

snopokecommented, Mar 18, 2020

I’ve updated the CEP to reflect the comments and discussion.

Top Results From Across the Web

Issues · dimagi/commcare-hq - GitHub

[CEP] Cache formplayer's validate_form's result on HQ to improve App Preview ... [CEP] Treat 409 responses from forwarders as duplicates and don't resend....

Integrated BlobDB - RocksDB

blob_compression_type : the compression type to use for blob files. All blobs in the same file are compressed using the same algorithm.

Viewing online file analysis results for 'SentinelAgent.exe'

Spyware: Found a string that may be used as part of an injection method; Fingerprint: Queries process information. Reads the active computer name ......

XML Structure Compression

The ability to compress XML is useful because XML is a highly verbose language, es- pecially regarding the duplication of meta-data in the...

Recently Added Projects - Open Source Software in Java

BlobDB is very small and can be embed in non GAE java application as \"in memory ... JPEG Export * Compressed XML Files...