Implementing a CompressedTextField for Django

Driving to my office today I had the idea to implement a field type in Django, which allows me to transparently save data in a compressed state to the database.

A standard Lorem Ipsum paragraph (as follows) takes 446 Bytes as ASCII and using my CompressedTextField only takes 282 Bytes in the Database this is roughly 60% of the space. I think for longer texts it might even get a better ratio.

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

The Code

The idea to implement this behaviour in Django is to subclass a built-in field and extend it a bit. The only change on your model should be to use the custom field instead of a built-in field. The way of getting and setting the data should be handled in the custom field's code.

The following code can be written in your models.py file, or (maybe better) in some external file, but for simplicity I have coded it into the models.py file where the model lives, that should use this field.

First let's get the functions, which we need to compress and uncompress the data. Compression is already done, because the gzip-middleware is using it. I decided to write a function to reverse the compression as uncompress_string. (Linebreaks marked with , code also available at djangosnippets.org.)

from django.db import models
from django.utils.text import compress_string
from django.db.models import signals
from django.dispatch import dispatcher

def uncompress_string(s):
    '''helper function to reverse django.utils.text.compress_string'''
    import cStringIO, gzip
    try:
        zbuf = cStringIO.StringIO(s)
        zfile = gzip.GzipFile(fileobj=zbuf)
        ret = zfile.read()
        zfile.close()
    except:
        ret = s
    return ret

Next we need to define an model field, which should extend an already provided field, in this case the TextField field.

class CompressedTextField(models.TextField):
    '''
    transparently compress data before hitting the
    db and uncompress after fetching
    '''

    def get_db_prep_save(self, value):
        if value is not None:
            value = compress_string(value)
        return models.TextField.get_db_prep_save(self, value)

    def _get_val_from_obj(self, obj):
        if obj:
            return uncompress_string(getattr(obj, self.attname))
        else:
            return self.get_default()

    def post_init(self, instance=None):
        value = self._get_val_from_obj(instance)
        if value:
            setattr(instance, self.attname, value)

    def contribute_to_class(self, cls, name):
        super(CompressedTextField, self).contribute_to_class(cls, name)
        dispatcher.connect(self.post_init,
                            signal=signals.post_init, sender=cls)

    def get_internal_type(self):
        return "TextField"

    def db_type(self):
        from django.conf import settings
        if settings.DATABASE_ENGINE == 'mysql':
            return 'longblob'
        else:
            raise Exception, '%s currently works only with MySQL'\
             % self.__class__.__name__

A few explanations: The method get_prep_save() ist the one, that handles data compression while saving to the database. It is pretty easy in django to modify data before writing to the db, so this is everything we need to code.

The other way, modifing data while fetching from the db is a bit more complicated. The way that works for me(TM) is to override the _get_val_from_obj() method to do the uncompression and then use contribute_to_class() to attach the post_init() method to the model class, to be executed after initializing the model. Essentially this means, when you create a model instance, which has a CompressedTextField field, Django will create the model instance, fetch the data from the database, emit the post_init signal and at this point my post_init() method kicks in and decompresses the data. While working with the instance we always deal with the decompressed data, and only right before saving to the database get_db_prep_save() will compress it again.

The db_type() method is used by the svn version of django to create fields in the database with a custom type. If you are using Django 0.96 you have to manually alter your database field to a blog or longblob version. It's only provided as an starting point and only implements mysql as a database engine. In fact it's even untested, because I use Django 0.96, so YMMV.

The last step is to include this field in your model, this is simple, here's an example:

class Item(models.Model):
    title = models.CharField(maxlength=200)
    summary = CompressedTextField(blank=True)

Drawbacks

Of course using the CompressedTextField has drawbacks. The first I can think of is, that using filter() on this field will not work, because the data in the database is not what Django expects. The same goes for order_by(), but who orders by a text field at SQL level might have other problems as well. I don't think this is a big problem, most people will not use filters on text fields.

The second drawback is a decreased performance due to the overhead added by compressing and uncompressing. This should be measured, but at this point it's not important for my application so I havn't done any tests.

Benefits

One benefit of using the CompressedTextField is the saved space at the database level. This might be a good point if your data gets really big, you have to decide for yourself if you need the compression because of this point.

The other and in my opinion more interessting point is the following: Using a CompressedTextField hides senstive data from the admin's eye. Imagine implementing a private message system, where searching for the message content is not an requirement. Saving the messages in a compressed state is not only good because you save space but because your db-admin will not be able to read the messages, while looking into the database for whatever reasons. It's not an improvement in security, because the data can easily be decompressed, but this way it's easier for your (not evil) db-admin to respect the privacy of the data, accidential disclosure is no longer an option.

Conclusion

This works for me, but is not thoroughly tested. Maybe it's total crap. But I would really like to get some feedback on this piece of code. The Code is also available at djangosnippets.org.


Kommentare