Speed up Django’s collectstatic command with Collectfasta

Django’s collectstatic command (added in Django 1.3 – March 23, 2011) was designed for storage backends where file retrieval was cheap because it was on your local disk.

In Django 1.4 (March 23, 2012) Django introduced CachedStaticFilesStorage which would append md5 hashes to the end of files so that you could have multiple versions of files which could stick around while you did a blue/green deployment. It also meant you could put your app in-front of a CDN and the filename hashes would ensure that when the file changed so did the cache key. This meant you didn’t need to worry about invalidating the CDN assets or users’ browser caches.

Later on (Django 1.7 – September 2, 2014) we got ManifestStaticFilesStorage which stores the filenames in a json file assisting with hosting on remote storage like S3.

The original django-storages is even older than collectstaticthe initial commit was back in  Jun 12, 2008. Its purpose was to provide a storage backend for AWS S3 which has since taken over the world. It also provides S3ManifestStaticStorage which is great for static file serving – you don’t even need to set up a static web server to serve them – they can come straight from the bucket or CDN.

The big problem with all of this is that running collectstatic on S3-based storage is painfully slow. Especially hashing storage which uses the post-process hook to modify and re-upload files to update file references (which then can trigger further updates). There used to be a solution to this – Collectfast (released May 2013) was an awesome drop-in replacement for the collectstatic management command which would auto-magically speed things up. Unfortunely, it has been archived and is no longer maintained – the last release being in 2020. Waiting for collectstatic to run has become tiring.

I’ve spent the past few weekends forking the original Collectfast trying to get the repo up-to-date and working again. It has been an interesting challenge and I’ve finally got it to a state where I am happy with the performance improvements it provides over the Django command and am confident it works. Introducing…. Collectfasta -an updated fork of Collectfast – even faster than before.

What’s new in Collectfasta?

You can now run all tests without connecting to cloud services

One of the reasons Collectfast was archived was because it was difficult to find a new maintainer, as most tests, specifically the ‘live tests’, required real Google Cloud Platform (GCP) and AWS credentials for execution.

I have now set up popular mocking tools LocalStack and fake-gcs-server to allow these tests to run without any AWS or GCP credentials. This has also opened up a new avenue of testing since you can run these mocks for free: testing for performance on many files rather than just a single file. I’m observing performance improvements of 5x-10x with local mocks, and these improvements are even more significant with remote APIs.

I’ve kept both the live tests and the docker tests running on master for better coverage.

AWS_PRELOAD_METADATA reimplemented

AWS_PRELOAD_METADATA has been removed in django-storages 1.10 (2020-08-30) and hard-coding preload_metadata = True has been a key performance optimisation that collectfast made in the boto3 strategy. The reason was straightforward: during collectstatic the exists method checks if a file already exists. This is fine when exists is cheap – but for the S3Storage exists will do a HeadObject request to the S3 API every time, for every file.

In contrast, when preload_metadata was working:

  1. it would initially call ListObjectsV2 to see what is already there
  2. stores the results in a dict,
  3. then exists checks the dict first, returning True if the key exists – otherwise deferring to the initial implementation.

This significantly speeds up subsequent collectstatic runs on the same files, since you’re replacing hundreds of API calls with one.

Removing this feature from django-storages made sense – it’s not the kind of thing you want people enabling on a web server – because it will cause memory leaks and is not concurrent-safe. However, for a management command like collectstatic – concurrency doesn’t matter.

Re-implementing the functionality was nasty – I wrapped the storage object with my own storage subclass of key methods that saved the preloaded data so that it could be kept up to date on save, delete etc. There’s surely a better pattern than what I ended up with – but I was optimising for replicating the removed logic rather than beautiful code – this is ripe for a refactor.

The two-pass strategy

After I got the preload_metadata working again, I found that my code was still pretty slow. The culprit was the multiple post-processing hashing passes that occur when the files reference each other. It confused me a lot because there are comments in ManifestFilesMixin that specifically mention consideration for S3:

            # use the original, local file, not the copied-but-unprocessed
            # file, which might be somewhere far away, like S3
            storage, path = paths[name]
django/contrib/staticfiles/storage.py#L341

Upon further investigation, I discovered the cause was worse than I thought. Staticfiles does an exists check here on L358 and then deletes the file that exists on L378 which means we need to re-upload it – this happens when there’s references between the static files. As a result, the system re-uploads these files every time, even with the preload_metadata optimisations. I wanted to find a better way.

I thought of a simple solution: a two-pass strategy. It works by running collectstatic using the InMemoryStorage or FileSystemStorage mixed in with ManifestFilesMixin. This means all the post-processing happens locally. Then for the second pass, we just iterate over the storage used in the first-pass and copy the files, as-is to S3. It means that it is still quite a bit slower than other strategies, because the first-pass has to run every time. But the first-pass is quite fast, and on subsequent runs the second-pass copies 0 files if they haven’t changed. It also only does a single ListObjectsV2 call at the start as we re-use the preload strategy for the second pass.

What needs work?

  1. The tests could be refactored to be a bit simpler – as raised in #217
  2. The two-pass strategy only works for AWS – the Google version doesn’t even have a manifest files version in django-storages
  3. I haven’t touched the filesystem strategies at all – but in my experience filesystem storages are usually fast anyway. Potentially they (and the threading vars) could be removed – the main bottleneck I think has always been network requests.
  4. I fought the current Strategy abstraction quite a bit – especially for two-pass – there’s an opportunity to refactor this to something simpler.

PRs are accepted / encouraged – github.com/jasongi/collectfasta

Superloop vs Aussie Broadband – a 2023 comparison

A few years back, during the 2020 lockdowns, I posted “Moving from Aussie Broadband to Superloop“, which turned out to be one of the most trafficked posts on this site.

A lot has changed since 2023, including me moving from Superloop back to Aussie Broadband for a few months, then back to Superloop. This is what has changed.

The Price
Back in 2020, 100/40 was $109 (Aussie) vs $98 (Superloop) with 6 months at $88. Recently, NBN has reduced the wholesale costs of faster plans to encourage upgrades.. Superloop prices have lowered due to this, now with a 6 month introductory peroid of $75 and ongoing $89 – this $20 – $26 dollar difference is why I switched back and will still reccomend them. Here’s a sign up link (referral link). Meanwhile Aussie has only decreased from $105 to $109.

CVC Graphs
Part of the draw of Superloop was that they published their CVC Graphs for all to see that they weren’t being congested. Since merging with Exetel, the CVC Graphs are disappointingly not available on their public website, however they are still accessible and I will continue to archive them while they are. Because there’s no information out there about this, I have no idea if they apply only to legacy superloop customers or also to Exetel/new Superloop signups. However the ACCC heavily monitor speeds now so it seems unlikely they will be able to get away with purchasing less CVC to cover the price drop/discounting.

The Transfer
It’s still just as fast. I was connected within an hour, and luckily Aussie still will pro-rata your month when cancelling (Superloop require 30 days notice).

No Port Blocking – but opt-out CGNAT
Superloop now by default opt you in to CGNAT – which is annoying if you’re a developer, have a home-lab or play video games. However, you can still opt-out by contacting support (they have chat!). There’s still no port blocking.

Fixing the Siemens SINUMERIK 828D when USB drives won’t mount

I came across an interesting problem the other day. All of a sudden, after saving a particular file to USB on the SINUMERIK 828D (a popular CNC machine controller) all USB drives stopped showing up on the HMI.

After looking through the file browser, you could see that /user/sinumerik/mnt/USB had the file in it – even when a USB was not plugged in. This is a common problem in the linux world. What probably happened is that for some reason, the HMI, saved the file to /user/sinumerik/mnt/USB, and then when plugging in the USB drive it would not mount as the directory already existed with files in it.

The next problem was – how do we fix it. We couldn’t delete the folder via the HMI, because there was inadequate permissions (even as the manufacturer user) so the end solution was to remove the system CF card from the unit, mount it on a linux computer (as it’s formatted ex4) with a USB CF card reader and then remove the file. Obviously this is risky, and you should always backup before attempting this. Another solution would be to SSH into the machine as root, but this requires it to be networked, and also the ability to get root access which I couldn’t figure out.

After mounting on a separate computer, we could see that the USB folder existed and had different permissions to the other drive mounts. After deleting the file and putting the CF card back in the 828D, USB drives were able to be mounted again.

Screenshot of 828D filesystem
You can see that the USB folder, unlike the other mounts, is not a link and does not allow non-root user’s write access

Adding a throttle to the Entity ECU-100 controller on the ALDI Cell Electric bicycle

Back in 2019 I bought an ALDI “special buy” e-bike for $999 – the Cell Ultimo Urban E-bike. It is an OK bike, but due to the way the pedal assist works it can be a bit annoying trying to take off on a high gear (as you need to begin to pedal before it works) or go up a hill on a high gear.

I wanted to fix this by adding a throttle. In some states this may be an illegal modification and will definitely void your warranty so please do your own research.

The Cell bikes use the Entity ECU-100 as it’s controller. There’s no datasheet or diagram of the available pins for this controller but researching generic e-bike controller boards (the kind you see on aliexpress/ebay) I found that you wire up a throttle to the “SP” pin.

Remove the two screws on the back of the battery housing where the controller is, then proceeded to label each of the wires while detaching them (this is important, there are many different connections and you don’t want to wire up the wrong thing when reconnecting them).

Remove the three screws on the side of the controller housing and then the four screws on front and back. You can now see the SP pin.

Three wires soldered to the SP, 5v and GND pins. You should use proper colours for the wire unlike me (Red for 5v, Black for GND, White/coloured for SP)

Solder three wires to the controller board. I drilled an extra hold in the plate the writes come out of, fed them through and crimped them to some bullet (socket) connectors.

I purchased a random thumb throttle from ebay for about $20, snipped the end off and crimped on some bullet plug connectors, ran the wires through the bike tube (you want to find something long and slightly bendy to feed it through).

Now everything is hooked up, you can put it all back together and enjoy your new speedy bike.

How to host Jackbox over Zoom on Mac OSX – with sound!

The Jackbox Party Packs by Jackbox Games have been a lifesaver in the COVID-19 pandemic given it has been very hard to see people in person. Most of the games work great over Zoom, but unfortunately getting the perfect Jackbox setup it isn’t the most straightforward task. I have experimented with various setups and I believe I have found the perfect method to a seamless Jackbox experience.

Set Jackbox to windowed mode

Before you begin your game, open up Jackbox and set it to windowed mode. This will allow you to see your friends, control zoom and play at the same time with a single screen. It’s slightly different for each pack, but most of them have a settings option in the main menu where you can choose the volume and full-screen/windowed. After you have done this, exit Jackbox.

Start your zoom call

Start your Zoom call, no need to invite anyone yet.

Before you open Jackbox, share your screen with computer sound

It’s very important to do this before you start the Jackbox pack you will be playing. When you share your screen with the Share Computer Sound, it creates the “ZoomAudioDevice” which aggregates your microphone and computer audio into a single device for Zoom to use. It is important to do this before Jackbox starts because Jackbox picks the audio output on startup and keeps it for the entire session, if the ZoomAudioDevice isn’t there then no matter how many times you try sharing your screen the sound won’t come through.

Start Jackbox

Start up your desired Jackbox Party Pack (make sure you’ve set it to windowed mode).

Stop sharing your screen then share the Jackbox window

You could just play Jackbox sharing your screen as above, but it is a sub-par experience. Your friends will be able to see all your saucy notifications, all your chrome tabs open in the background etc. What you really want is to just share the Jackbox window. If you stop sharing and then press share screen again, you will now have the option to choose the Jackbox window. Pick it then press share. You could also choose to share a screen portion instead of the window, but this could result in things drawing over the window, although it will make it easier to switch party packs.

NOTE: You’ll see on my screenshots I have unticked Optimise Screen Share for Video Clip. This is because I have found that although this improves the quality of the stream, the trade-off is some extra latency at times which is much more annoying when playing games with countdown timers. You milage may vary.

Whenever you switch a party pack, repeat all the steps

It is important that whenever you start a party pack you do it with your screen sharing on and share computer sound ticked, otherwise your sound will not work. Exiting a party pack will stop the window screen sharing, so you will need to start at the first stop.

Other Methods

I only discovered this method recently, prior to this I was using a more convoluted setup that involved two laptops and a program called Soundflower, which I will go into more detail into another post however this method is much easier to set up.

I hope this helps, happy Jackboxing!

Moving from Aussie Broadband to Superloop

Update: See the updated 2023 comparison here

After a few years as a happy Aussie Broadband customer, I have decided to move to Superloop. This is a quick summary of my experience.

The Price
Aussie Broadband recently announced they are increasing the price of their 100mbps plans by $10 a month. This means that an unlimited 100/40 plan is $98 for Superloop vs $109 for Aussie. Couple it with a referral code/link and you can get it for $88 for 6 months (here is my referral link). Aussie broadband have never been the cheapest provider, but this new hike is uncompetitive.

Superloop
Superloop has copied the Aussie Broadband playbook as a premium NBN provider. They publish daily CVC graphs, which aren’t as detailed as Aussie Broadband’s but are good enough for you to identify if they are not provisioning enough CVC to your POI. Like Aussie Broadband there’s no lock in contracts and extra connection fees, which is a great way of knowing that if they do something like increase your plan costs or have a decline in service quality, you can easily move without penalty.

The Transfer
Churning was ridiculously quick. There was no need to cancel my connection with Aussie Broadband, I simply signed up on the Superloop website and within 5 minutes the connection had been swapped over with no noticeable connection interruption. No need to talk to anyone. One annoying thing is that the only payment methods they offer are Credit Card and BPay, so Credit Card is the only automatic way of paying – luckily though there is no surcharge for this.

No More CGNAT or Port Blocking
Something annoying about Aussie Broadband is that when you sign up or move house you need to wait until the connection is active, then contact their support to remove CGNAT (which messes with things like online games) and unblock incoming ports (important if, like me, you do some web development type stuff on your network). I was pleasantly surprised that Superloop hand out Dynamic IPv4 addresses by default and doesn’t engage in port blocking.

Similarly to my Aussie Broadband CVC Archive, I have started archiving Superloop’s CVC graphs too.

Name list generator

Need a list of names generated? Just load up this page with names as a query param with your comma separated list of names and it shall generate a random list, e.g jasongi.com/2020/05/21/name-list-generator/?names=Jason,Fred,Jane! Perfect for Zoom stand-ups!

    100 Toasty Tofu(s) 2018 Edition

    It’s that time of the year again. Last year I made the foray into predicting Triple J’s Hottest 100 and it was fun so this year I’ve given it another go with some key differences. I completely rewrote the script that does the legwork, and decided to go one step further with doing some demographic weighting and analysis.

    The New Script

    Last year I was using Tesseract, one of the leading open source projects. This year I decided to test out some cloud based OCR to see if it was any better. I tried Amazon Recognition and Google Cloud Vision. After testing both it became clear that Google Cloud Vision is miles ahead of Recognition in both text detection and paragraph detection, so I went with that. I’ve also hooked up all the data to a metabase instance, which is great for easily displaying data.

    100 warm tunas is now scraping twitter and instagram. I considered whether my script should do the same but decided against it for a few reasons. Last year,  ZestfullyGreen did a twitter scraper but it failed to predict the #1 song. This lead me to believe that the sample of people on twitter are not representative of the Hottest 100 voting population and would not improve the prediction while instagram has a strong history is accurate predictions.

    The Results

    Without further ado, these are the raw counts.

    Interestingly it wasn’t always like this. If you look at the day by day counts, former bookie favourite This Is America won the first day and it took a few more days for Ocean Alley’s total to catch up. We could be in for a close Hottest 100.

    Looking a little deeper

    Every year, Triple J loves to wheel out the stats on the Hottest 100 while refusing to release counts. This makes seeing how far off previous predictions are difficult. I did some research and found several interesting articles.

    This year’s Hottest 100 has set a new voting record! Gave us a breakdown by state, gender and age bracket (kinda) of who voted.

    • More women than men voted this year, 51% female compared to 48% male (rounded out by 1% for ‘Other’ and ‘No answer’)
    • New South Wales took the lion’s share of votes (29%), followed by Victoria (23%), backed up by QLD (20%), and in order after that, WA (11%), SA (8%), ACT and TAS (3%), Overseas voters (2%), and NT (1%).
    • The most common age of voters was 21 years old. About half of voters were aged 18-24 and around 80% of voters were under 30.

    Did guys and gals vote differently in the Hottest 100? Let’s find out showed us the gender divide in music tastes. Hottest 100: What songs were most popular with each state and territory? did the same for states/territories.

    Instagram doesn’t list a location for people or their gender, but I figured gender could be approximated by running people’s names through gender_guesser, a library for python that uses a name dataset to guess gender. This decreases our sample size as not everyone has their name of instagram, but is an interesting experiment. Here you can see the differences in votes.

    The divide is clear. Everybody loves Ocean Alley and Gambino, but people with masculine names seem to have an aversion to Wafia and Amy Shark (Could this be why she has never gotten a #1?). Masculine names also enjoy Ruby Fields – Dinosaur more than people with feminine names.

    For location, I used a different approximation. Sometimes people tag their photos with a location, and it’s probable that that location is where they live. So the script tries to find the last tagged location and puts them into that state. It’s not perfect but provides some interesting results.

    When we put this altogether, we can produce weighted prediction of the Hottest 100 based on either gender, state or both.

    This doesn’t affect the top songs but you can see ones with a particular bias (e.g Mallrat, which is popular with feminine names) shoots up.

    This year’s Hottest 100 is set to be a close one. If you think you’re better at predicting these things, submit your prediction here and then watch it count down here.

    Aussie Broadband CVC Archive

    Aussie Broadband is a great NBN internet provider. They are the only one that posts daily CVC Utilization graphs which are the only real way you can see if you’re going to get peak time congestion. After suffering congestion under other ISPs I moved to Aussie and haven’t had any speed slowdown

    Unfortunately for whatever reason they don’t offer historic CVC data, they only display the previous day’s, I’ve kindly started backing it up so that Aussie users can see historic CVC data. All this data belongs to ABB and I take no responsibility for its accuracy. But seriously, they are great and you should switch to them if you can.

    See the archive here.