Speed up Django’s collectstatic command with Collectfasta

Django’s collectstatic command (added in Django 1.3 – March 23, 2011) was designed for storage backends where file retrieval was cheap because it was on your local disk.

In Django 1.4 (March 23, 2012) Django introduced CachedStaticFilesStorage which would append md5 hashes to the end of files so that you could have multiple versions of files which could stick around while you did a blue/green deployment. It also meant you could put your app in-front of a CDN and the filename hashes would ensure that when the file changed so did the cache key. This meant you didn’t need to worry about invalidating the CDN assets or users’ browser caches.

Later on (Django 1.7 – September 2, 2014) we got ManifestStaticFilesStorage which stores the filenames in a json file assisting with hosting on remote storage like S3.

The original django-storages is even older than collectstaticthe initial commit was back in  Jun 12, 2008. Its purpose was to provide a storage backend for AWS S3 which has since taken over the world. It also provides S3ManifestStaticStorage which is great for static file serving – you don’t even need to set up a static web server to serve them – they can come straight from the bucket or CDN.

The big problem with all of this is that running collectstatic on S3-based storage is painfully slow. Especially hashing storage which uses the post-process hook to modify and re-upload files to update file references (which then can trigger further updates). There used to be a solution to this – Collectfast (released May 2013) was an awesome drop-in replacement for the collectstatic management command which would auto-magically speed things up. Unfortunely, it has been archived and is no longer maintained – the last release being in 2020. Waiting for collectstatic to run has become tiring.

I’ve spent the past few weekends forking the original Collectfast trying to get the repo up-to-date and working again. It has been an interesting challenge and I’ve finally got it to a state where I am happy with the performance improvements it provides over the Django command and am confident it works. Introducing…. Collectfasta -an updated fork of Collectfast – even faster than before.

What’s new in Collectfasta?

You can now run all tests without connecting to cloud services

One of the reasons Collectfast was archived was because it was difficult to find a new maintainer, as most tests, specifically the ‘live tests’, required real Google Cloud Platform (GCP) and AWS credentials for execution.

I have now set up popular mocking tools LocalStack and fake-gcs-server to allow these tests to run without any AWS or GCP credentials. This has also opened up a new avenue of testing since you can run these mocks for free: testing for performance on many files rather than just a single file. I’m observing performance improvements of 5x-10x with local mocks, and these improvements are even more significant with remote APIs.

I’ve kept both the live tests and the docker tests running on master for better coverage.

AWS_PRELOAD_METADATA reimplemented

AWS_PRELOAD_METADATA has been removed in django-storages 1.10 (2020-08-30) and hard-coding preload_metadata = True has been a key performance optimisation that collectfast made in the boto3 strategy. The reason was straightforward: during collectstatic the exists method checks if a file already exists. This is fine when exists is cheap – but for the S3Storage exists will do a HeadObject request to the S3 API every time, for every file.

In contrast, when preload_metadata was working:

  1. it would initially call ListObjectsV2 to see what is already there
  2. stores the results in a dict,
  3. then exists checks the dict first, returning True if the key exists – otherwise deferring to the initial implementation.

This significantly speeds up subsequent collectstatic runs on the same files, since you’re replacing hundreds of API calls with one.

Removing this feature from django-storages made sense – it’s not the kind of thing you want people enabling on a web server – because it will cause memory leaks and is not concurrent-safe. However, for a management command like collectstatic – concurrency doesn’t matter.

Re-implementing the functionality was nasty – I wrapped the storage object with my own storage subclass of key methods that saved the preloaded data so that it could be kept up to date on save, delete etc. There’s surely a better pattern than what I ended up with – but I was optimising for replicating the removed logic rather than beautiful code – this is ripe for a refactor.

The two-pass strategy

After I got the preload_metadata working again, I found that my code was still pretty slow. The culprit was the multiple post-processing hashing passes that occur when the files reference each other. It confused me a lot because there are comments in ManifestFilesMixin that specifically mention consideration for S3:

            # use the original, local file, not the copied-but-unprocessed
            # file, which might be somewhere far away, like S3
            storage, path = paths[name]
django/contrib/staticfiles/storage.py#L341

Upon further investigation, I discovered the cause was worse than I thought. Staticfiles does an exists check here on L358 and then deletes the file that exists on L378 which means we need to re-upload it – this happens when there’s references between the static files. As a result, the system re-uploads these files every time, even with the preload_metadata optimisations. I wanted to find a better way.

I thought of a simple solution: a two-pass strategy. It works by running collectstatic using the InMemoryStorage or FileSystemStorage mixed in with ManifestFilesMixin. This means all the post-processing happens locally. Then for the second pass, we just iterate over the storage used in the first-pass and copy the files, as-is to S3. It means that it is still quite a bit slower than other strategies, because the first-pass has to run every time. But the first-pass is quite fast, and on subsequent runs the second-pass copies 0 files if they haven’t changed. It also only does a single ListObjectsV2 call at the start as we re-use the preload strategy for the second pass.

What needs work?

  1. The tests could be refactored to be a bit simpler – as raised in #217
  2. The two-pass strategy only works for AWS – the Google version doesn’t even have a manifest files version in django-storages
  3. I haven’t touched the filesystem strategies at all – but in my experience filesystem storages are usually fast anyway. Potentially they (and the threading vars) could be removed – the main bottleneck I think has always been network requests.
  4. I fought the current Strategy abstraction quite a bit – especially for two-pass – there’s an opportunity to refactor this to something simpler.

PRs are accepted / encouraged – github.com/jasongi/collectfasta

Leave a Reply

Your email address will not be published. Required fields are marked *