It’s that time of the year again. Last year I made the foray into predicting Triple J’s Hottest 100 and it was fun so this year I’ve given it another go with some key differences. I completely rewrote the script that does the legwork, and decided to go one step further with doing some demographic weighting and analysis.
The New Script
Last year I was using Tesseract, one of the leading open source projects. This year I decided to test out some cloud based OCR to see if it was any better. I tried Amazon Recognition and Google Cloud Vision. After testing both it became clear that Google Cloud Vision is miles ahead of Recognition in both text detection and paragraph detection, so I went with that. I’ve also hooked up all the data to a metabase instance, which is great for easily displaying data.
100 warm tunas is now scraping twitter and instagram. I considered whether my script should do the same but decided against it for a few reasons. Last year, ZestfullyGreen did a twitter scraper but it failed to predict the #1 song. This lead me to believe that the sample of people on twitter are not representative of the Hottest 100 voting population and would not improve the prediction while instagram has a strong history is accurate predictions.
Without further ado, these are the raw counts.
Interestingly it wasn’t always like this. If you look at the day by day counts, former bookie favourite This Is America won the first day and it took a few more days for Ocean Alley’s total to catch up. We could be in for a close Hottest 100.
Looking a little deeper
Every year, Triple J loves to wheel out the stats on the Hottest 100 while refusing to release counts. This makes seeing how far off previous predictions are difficult. I did some research and found several interesting articles.
This year’s Hottest 100 has set a new voting record! Gave us a breakdown by state, gender and age bracket (kinda) of who voted.
- More women than men voted this year, 51% female compared to 48% male (rounded out by 1% for ‘Other’ and ‘No answer’)
- New South Wales took the lion’s share of votes (29%), followed by Victoria (23%), backed up by QLD (20%), and in order after that, WA (11%), SA (8%), ACT and TAS (3%), Overseas voters (2%), and NT (1%).
- The most common age of voters was 21 years old. About half of voters were aged 18-24 and around 80% of voters were under 30.
Did guys and gals vote differently in the Hottest 100? Let’s find out showed us the gender divide in music tastes. Hottest 100: What songs were most popular with each state and territory? did the same for states/territories.
Instagram doesn’t list a location for people or their gender, but I figured gender could be approximated by running people’s names through gender_guesser, a library for python that uses a name dataset to guess gender. This decreases our sample size as not everyone has their name of instagram, but is an interesting experiment. Here you can see the differences in votes.
The divide is clear. Everybody loves Ocean Alley and Gambino, but people with masculine names seem to have an aversion to Wafia and Amy Shark (Could this be why she has never gotten a #1?). Masculine names also enjoy Ruby Fields – Dinosaur more than people with feminine names.
For location, I used a different approximation. Sometimes people tag their photos with a location, and it’s probable that that location is where they live. So the script tries to find the last tagged location and puts them into that state. It’s not perfect but provides some interesting results.
When we put this altogether, we can produce weighted prediction of the Hottest 100 based on either gender, state or both.
This doesn’t affect the top songs but you can see ones with a particular bias (e.g Mallrat, which is popular with feminine names) shoots up.
This year’s Hottest 100 is set to be a close one. If you think you’re better at predicting these things, submit your prediction here and then watch it count down here.
So this is just from instagram?
How do you prevent counting the fake votes?
You have a similar number to Warmtunas although that includes Twitter and Direct Messages. Do you know where your extra votes came from?
“hottest1002018”, “hottest100”, “triplejhottest100”, “triplejhottest1002018”, “jjjhottest100”, “jjjhottest1002018”, “hottestonehundred”, “triplejhottestonehundred”, “triplej”, ‘tripplej’, ‘tripplejhottest100’, ‘triplejshottest100’
and also anyone tagging @triplej
It’s a secret 🙂 (revealing this would probably enable people to get around it). The demographic analysis is a good way of verifying that the data hasn’t been skewed.
Probably the extra hash tags and / or differences in OCR/Matching algorithm.
Does your count update hourly?
When will you finish counting and put forward final results?
Will you update ‘top 5 by day’ again as the final day of voting will heavily influence votes?
Just a comment Im really liking this page and like the thought that has gone into it too, perhaps a bit more than warm tunas. good stuff.
It updates hourly. There’s not much point continuing scraping after voting finishes tomorrow. All the tables/graph is live updated.
Thanks for the feedback!
You’ve got the wrong Arctic Monkeys song in your prediction countdown
Oops, I’ve fixed it now