F1 AI benchmarks

Any researchers who want the prompts and/or links to the chats can contact me =)

I went through the Formula 1 EOY quiz on Mr V’s Garage’s YouTube channel and decided to put some AIs through the test as well. For the LLMs, I gave each of them just 1 shot at the questions. These all use the free versions of the LLMs. Here is how I fared, and how the AIs fared (x indicates incorrect, v indicates correct, p indicates partially correct):

Round 1

Me

x x x x x

Claude (Sonnet 4.5)

v p x v v

Gemini 3 (Fast)

v p x v v

ChatGPT (probably 5.2?)

v x x x x

Interestingly, despite stating non-Ferrari drivers, LLMs frequently included Hamilton, possibly indicating overfitting.
Claude & Gemini 3 made the same mistake on question 2!

One bonus to consider: though I got all questions wrong, I could tell you that I wasn’t confident about them. The LLMs sometimes confidently stated the wrong answer.

All the LLMs used search.

Round 2

Here, I pasted the question as an image. All LLMs near-flawlessly parsed the image (ChatGPT saw “Rasmussen” as “Rasmuss(en)” for some reason). I gave them the same scoring guidelines (+1 point for every correct name, -1 point for every incorrect name).

Me

I got 3 points with 4 right & 1 wrong:

Hans Joachim
Sam Tingle
Paddy Driver
Jyrki Jarvilehto
Bernie Ecclestone

Claude (Sonnet 4.5)

5 points (6 correct, 1 wrong)

Gemini 3 (Fast)

12 points (12 correct, 0 wrong)

ChatGPT (probably 5.2?)

2 points (10 correct, 8 wrong!!)

Round 3

Me

I got 0. My answers were 3, 493, 35 C, 10000$, 6400.

Claude (Sonnet 4.5)

v x x p x

Gemini 3 (Fast)

v x x v v

ChatGPT (probably 5.2?)

v x x x x

Somehow, they all got the same number for the heat hazard, but this differed from the official answer.
I’m not sure how to evaluate Claude’s fourth answer, but I’ve given it partially correct as it gave the right number and then added a correction. Claude often gives up on questions where it has to do some work (like searching the internet).

Round 3

Me

  1. I just gave myself a zero from this point on. I don’t expect to know any of these.

Claude (Sonnet 4.5)

x x x x x v x

Gemini 3 (Fast)

x p x v v x x

ChatGPT (probably 5.2?)

Paused due to hitting rate limits (if someone from Open AI wants to give me free credits I’ll update this)

Round 4, 5, 6, 7

Me

Did not attempt

Claude (Sonnet 4.5)

4 points: v v v v x
0 points: x x x x x
4 points: (+4 -0)
1 point: v x x x x

Gemini 3 (Fast)

3 points: v v x x v
4.21 points: x v p v(+7 -1) v
4 points: (+6 -2)
4.8 points: v v v v(+6) x

ChatGPT (probably 5.2?)

0 points: x x x x x
1.86 points: x x v(+5 -2) v x
7 points: (+7 -0)
1 point: v x x x(+2 -1) x

Bonus

Gemini 3 Thinking

2 points: x x x v v
13 points: (+13 -0)
1 point: v x x x x
3 points: v v x x v
5 points: v v v(+7 -0) v v
10 points: (+11 -1)
3.8 points: v x x v(+7 -1) v

Final scores

Gemini 3 Fast wins with 37.01 points! Claude Sonnet 4.5 is in second with 20 points and ChatGPT comes in last with 13.86 points (though it skipped one round due to rate limits).

If I get access to more models, I’ll update this. Update: Gemini 3 Thinking scored 37.8 given the scoring system chosen (it did really well on a couple of rounds) despite skipping an entire round!

Comments

Leave a comment