How I made a dumb F1 Vietnamese GP Prediction Algo:

4/5/2020

2 Comments

TL;DR:

Automate data scraping from race stats, official website using Selenium, BeautifulSoup/LXML(Both are similar) packages. To get data from all races in 2018, 2019
Clean data -> min-max normalization -> augment the 44 data points to ~4000 samples using simple mean, median shifts
Sliding window transformation to make it look like a supervised training problem
Train a shallow CNN+LSTM regressor with very small hidden layer sizes, because data already has strong correlation => high generalization
Save weights at train accuracy of 50%(very low learning rate=>barely any weight updates){Lap 10 prediction}, 60%{Lap 20..so on}, 70%, 80%, 90%, >90%.
Add randomness using weighted regularization of random variables such that prediction does not converge to final result in first pass
Add randomized weights (pos or neg) to predictions for Max Verstappen since he was chosen as the wild card. 60% chance of weight being positive (Cause the poor kid was starting from the back of the grid)
Et voila! C'est tout.

Algo, without the jargon (tried my best to avoid):

Friday 3:30 am
Location: Somewhere near Boston, USA:
The phone had enough charge so I could turn over to the other side, watching F1 highlights on Insta, frustrated about the effect of COVID-19 on the F1 2020 calendar. Nothing but this, tormented me more about the current situation of the world (Recession, Deaths, Rising cases, China re-opening their live markets, Toilet pap... oh wait I'm Indian). i couldn't care less about any of this, but a whole F1 season in jeopardy?! Not having it. Ergo, F1_Predictor.py was born.

Getting Data:

Couldn't find a readily available dataset online (Frankly didn't look much, too much effort)
Wrote an automated web scraper to automatically log into the official website, navigate into the results page for each race and extract data from the tables. This is super easy for the F1 official website because the site doesn't have active scraper protection methods and is all written in standard html code.

Forming the Dataset:

Cleaned extracted data (Replaced 'driver names' with integers from 0-19, replaced 'DNF' to -1) to make it easier to work with
Normalized data (Min-Max Normalization), for easier calculation of probabilities. In very loose terms, this is required to bring all the data into the same range (0-1) so its easier for the network to find correlations and learn patterns. These patterns are then used to make the predictions
The network wont learn shit with just 44 data points to train on, so augmented data to ~4000 samples using a naive mean/median-shift algo.
- The mean of each race is calculated for each feature column
- The mean is shifted randomly by a factor of 1E-6 and reconstruct each column to match the same distribution
- Do the same for median of each column in each race
- Frankly don't know how effective/correct this is, but it gave kinda realistic results (After implementing my cheats explained in the Cheats section :P)
Sliding window transformation to make the problem easier to deal with: To bring in factors like driver's morale from the previous race and modelling its effects on the current race is important, but difficult to add to the data explicitly. So we make a sliding window transformation:
- Say, available data points: [race 1, race 2, race 3, race 4, race 5....]
- The networks inputs (loose terms) would be:
  - Input 1: [race 1, race 2, race 3]; Output 1: [race 4]
  - Input 2: [race 2, race 3, race 4]; Output 2: [race 5]
  - ..... so on
- This helps the network learn if there were any effects of race 1, race 2, race 3 on race 4; Effects of race 2, race 3, race 4 on race 5.... and so on.

The Network:

Supervised training is generally(don't hold me to it) performed for two major tasks (Classification, Regression). Very basic idea:
- Classification: A network trained to classify dogs and cats, when given an unseen image, tries to check if its closer to a dog or a cat, and gives a class prediction accordingly
- Regression: Given a set of data points, the network tries to fit a curve that most closely represents the data distribution. This curve can then be extrapolated to make future predictions
We are not trying to classify anything in our case, but trying to predict future results, so our network could loosely be called a regressor (More accurately: a time-series predictor)
Explaining the network without jargon is too much work so go here to read about CNNs and here to read about LSTMs
The network itself: 3 Conv Layers, Feeding into a single LSTM layer, Feeding into Dense layers that spit out predictions

Cheats:

Given the strong correlation in the existing data (eg. Hamilton has just won way too many times for the network to not predict straight away that he'd be on P1 and remain that way for the entire race). That is sadly true, but it's no fun if the network also mirrors the bitter reality. Hence the following cheats:
Regularization: Adding a random weight to certain parameters in the loss function. So basically, the loss function is like a combined representation of what's going on in the network at each step. The aim of the network is to do whatever it takes to minimize the loss function, and whatever the network did to achieve minimum loss is what we want the network to learn.
- The problem with this: For the given data there is way too many patterns that the network could learn in a minute, and cause the network to converge to the final solution extremely fast. This is because it is a given that Hamilton, Max, Vettel would always be in the top order, and George Russel could never win! But that's no fun! There needs to be some random drama.
- Regularization generally adds a high weight to unnecessary parameters like noise to tell the network to stop focussing on the noise and trying to minimize the more important features. Otherwise the network only minimizes that which can be easily minimized.
- The way I used it though, essentially adds a weight to random terms in the loss function to make sure they aren't minimized too fast.
- This is why Sainz went all the way up to P4, else the network would've predicted a P6/P7 for him right at the start.
Adding a wild card agent: The most interesting races have that one unsuspecting driver who ends up achieving something miraculous. Races that finish exactly the way they started are called ummm... what's that phrase?! oh yeah! FUCKING BORING!
- So I took a poll and Max ended up getting the Max votes...cause Max -> Max.... NVM.
- What that basically meant is whatever predictions max got, it was multiplied by a random factor that could be positive of negative. This means he would be either the star of the show or the mega crash of the day, left to a coin toss
- In the first 10 laps he got a positive weight causing him to jump 7 positions in just 10 laps. (Which is not impossible for Max :P)
- Between 10-20 he jumped 4 positions which was what the network predicted cause the random weight was close to 1
- Between 20-30 the network predicted him to go up 5 places (which is not too big a stretch for Max, but c'mon), and the negative weight meant he only jumped 2 positions. The same happened till lap 50.
- Between 50-60 the network predicted P4 -> P2 but again, a negative weight nullified the prediction and he finished P4
- If not for the randomized weights Max would've gone P20 -> P2 in 20 laps. Could've been fun to watch if he actually pulled it off, but let there be some amount of reality in what we're doing :P

Et Voila! We've had our first Vietnamese GP in a while. COVID-19 can go fuck itself.

2 Comments

Anthony Jones link

11/5/2022 07:04:49 am

Win common more meeting though ready civil. Every relate local will senior outside.
Approach attention memory movie author standard work. Direction magazine national when drop.

Retractable Awning Minnesota link

4/20/2023 10:45:54 am

Niice post thanks for sharing

The wait is over!
Peek inside PRAkTIKal's HQ!