In the coming few days, we’ll be posting blog posts from several of our model submitters articulating how they’ve built their model and how it’s performed thus far.
The first model in this series will be from Pamplemousse. You can read more about this model from its creator below.
Each week has been a variation on a multinomial logistic regression, where the multiclass outcome consists of “winner”, “safe”, and “eliminated”. I predict probabilities for the available queens each week, and the queens with the highest probability of being eliminated and of winning are my picks. I started off very modestly in terms of data and features and have added a little bit each week (this is almost entirely a function of my time/schedule).
The first week I used only data from the first episodes of each season, and only considered standardized age, contestant entrance order, and state. I felt bad about throwing away so much data, but I ran out of time and figured that it would be hard to do well in the first week anyway. The algorithm performed well on the historical data (5/9 first episode winners correct, 6/9 first episode winners correct – hello, overfitting!). I actually threw out season 3 because the episode numbering was off and I didn’t realize it. My predictions for the first episode were for Ariel Versace to be eliminated and Silky to win. Too bad it wasn’t about predicting the most drama between two queens….
My big thing this week was using the US Census API to pull in city size information and state region. I don’t think it helps that much but it was interesting and I wanted to learn how to do it! I spent a bunch of my time on that and ended up only using data from episodes 1 and 2 (still didn’t catch the episode number variable issue). Features were standardized age, contestant entrance, city size (categorical variable based on quartiles of the empirical distribution), and indicators for state subregion (pacific is the reference, and I made one for Puerto Rico). Once again the historical performance was pretty decent (8/21 losers and 5/21 winners correctly identified), but the season 11 episode 2 predictions were very poor: I predicted Vanjie to go home (actively rooted against my algorithm on this one!) and Ra’jah to win.
I finally used all of the available episodes as inputs in Week 3. I also corrected the episode numbering thing to help build indicators to capture prior episode performance (winner or bottom), and survey score. The total set of features for week 3 was standardized age, entrance order, prior episode win, prior episode bottom, survey score, city size, and state region. Historical performance was OK (33 losers, 28 winners correct), and I predicted Vanjie to win and Mercedes Iman Diamond to go home.
Algorithm and data were pretty much the same as last week, except I added in standardized past wins, standardized past bottom placements, and I standardized the survey score. Historical performance was similar to the prior week (39 losers and 25 winners correct), and I predicted A’keria to win and Ra’jah to go home.
Algorithm and data were the same as the prior week, and historical performance was also similar to the prior week (41 losers and 28 winners correct). Miss Vanjie predicted to win and Ra’jah to go home.
So far I haven’t done any sort of cross validation, but I would love to try something like the leave-future-out CV that Shira’s blog post demonstrates. I’d also like to explore incorporating the social media data!