[CEP] Form XML Compression in BlobDB
See original GitHub issueAbstract
Form XML makes up 70% of the BlobDB data on ICDS. In environments with fewer form attachments this number is likely much higher. This CEP proposes to compress the XML data to reduce the storage requirements of the BlobDB.
Motivation
The form XML is saved to the BlobDB uncompressed and as stated above it makes up at least 70% of the data in the BlobDB. Basic tests show that compression can give considerable reduction in storage.
On a sample of 10000 ICDS forms:
Average file size before compression | 9579 bytes |
Average file size after compression (gzip -6) | 2031 bytes |
Specification
Since this is an isolated use case the compression and decompression will not be handled by the BlobDB.
This specification allows for the compression of the XML to take place in three different places. The purpose of this is to allow experimentation to determine the best place for the compression to take place based on data from actual usage (see Metrics section below).
Compression
The compression should use gzip
since it offers good compromise between compression ratio and speed. It is also well supported in Python.
Metrics The following metrics should be added to allow monitoring of this change both to ensure it is correctly functioning and to provide data to better understand the access patterns.
-
commcare.form_submission.read_compressed
This metric will give us insight into how many times a compressed XML is read and where (which process) is it read.
Tags:
- source
- Django view name (
get_current_request().resolver_match.view_name
) - celery task name (
get_current_task().name
) - Django management command name (not sure how to get this)
- Django view name (
- source
Changes to CommCare
This should be implemented within the BlobDB in order to make it transparent to calling code.
- Update the BlobDB Metadata to keep track of compressed data
compressed = BooleanField(null=True)
Question: should this be indexed?
- Update
put
API to compress data based on the type code
Since we only want to compress form XML at this stage the decision to compress can be based on the type code. Any calls to blobdb.put
with type code of CODES.form_xml
(2) should compress the content before saving it to the backend.
In addition to compressing the data the blob metadata must record that the data was compressed by setting the compressed
flat to True
:
def put(self, content, **blob_meta_args):
# This logic could be moved to the `MetaDB`
blob_meta_args['compressed'] = blob_meta_args.get('type_code') == CODES.form_xml
meta = self.metadb.new(**blob_meta_args)
...
if meta.compressed:
# consider creating a streamed compression here: https://stackoverflow.com/a/31566082/632517
content = compress(content)
...
- Update
get
API to support decompressing
Currently the get
API only takes the blob key
which is used to fetch the blob directly from the backend. In order to support decompression the API will need to be modified to pass in the meta
object as well:
def get(self, type_code, key=None, meta=None):
if type_code == CODES.form_xml and not meta:
raise ...
blobstream = ....
if meta and meta.compressed:
return GzipFile(fileobj=blobstream, mode='r')
else:
return blobstream
Impact on users
No impact on end users is expected.
Impact on hosting
No impact is expected for hosting CommCare. This change should be completely transparent.
Backwards compatibility
Since we will likely not compress old data (at least not initially) it is a requirement that the code support compressed and uncompressed XML. This requirement will make the change backwards compatible by default.
Release Timeline
End of Q2 2020
Open questions and issues
- Should the
metadb.content_length
be the compressed or uncompressed size - Should we update the size function
Issue Analytics
- State:
- Created 4 years ago
- Comments:12 (12 by maintainers)
Top GitHub Comments
This is shaping up well. Regarding the
get
API, it should either acceptkey
andtype_code
OR
meta
alone.Edit: update order of checks in
validate_get_args
for better error output.I’ve updated the CEP to reflect the comments and discussion.