Battle Of The Bots | Andy Yu · 4th Grade Science Fair 2026

Abstract

Can AI chatbots help me with my math homework?

Question

Which chatbot gets the most right?

Computers are starting to act more like people, but just like people they don't always get things right. I wanted to find out which chatbots give the most correct answers and explain how they got there so I can actually learn from them.

Chatbots tested

ChatGPT, Claude, and Gemini

I tried the same math problems on ChatGPT, Claude, and Gemini, mixing text questions with image questions when the problem was easier to show as a picture.

How I scored

Accuracy, clarity, and consistency

I used a 0-2 scale to score each bot on whether the answer was right, whether a 10-year-old could understand the explanation, and whether it stayed consistent across three separate tries in incognito chats.

Question Bank

The kinds of math problems I asked.

Easy · M1

Arithmetic

What is the value of 235 + 148 − 76?
(A) 297 (B) 307 (C) 317 (D) 327 (E) 337

Answer: (B) 307

Easy · M2

Number sense

Which of the numbers below is closest to 400?
(A) 374 (B) 389 (C) 412 (D) 403 (E) 421

Answer: (D) 403

Medium · M4

Word problem

A baker bakes 6 trays of muffins with 8 muffins on each tray. He sells 19 in the morning and 14 in the afternoon. How many muffins are left?
(A) 13 (B) 15 (C) 17 (D) 19 (E) 21

Answer: (B) 15

Medium · M5

Time

A train departs at 10:20 AM and the journey takes 2 hours and 55 minutes. At what time does it arrive?
(A) 12:55 PM (B) 1:05 PM (C) 1:10 PM (D) 1:15 PM (E) 1:20 PM

Answer: (D) 1:15 PM

Medium · M6

Logic

Five friends each have a different number from 1 to 5. Ben's number is twice Ana's. Cara's number is 1. Dan's number is greater than Ben's. What is Eli's number?
(A) 1 (B) 2 (C) 3 (D) 4 (E) 5

Answer: (C) 3

Hard · M3, M7–M9 (Image)

Math Kangaroo image problems

The hard questions came from Math Kangaroo and were sent as screenshots: an LCM puzzle (M7, 2019 5-6 #21), a paper-folding question (M8, 1999 5-6 #27), and a tricky card question (M9, 2021 5-6 #30). M3 was an easy image question too.

Procedure

How I ran the experiment.

Built a master question list

I grouped my math problems into easy, medium, and hard so I could see how the bots did at each level.

Used text and image problems

Some problems are hard to explain in words. For those I sent a screenshot to the chatbot instead of typing it out.

Asked each question three times

AI bots can give different answers for the same question, so I ran every question three times to see if the answer stayed the same.

Used incognito / temporary chats

Chatbots can remember earlier questions, so I used a fresh chat each time to keep the experiment fair.

Scored accuracy and clarity

2 = correct or very clear, 1 = partly correct or somewhat clear, 0 = wrong or confusing. Getting the right answer was not enough — the explanation had to make sense too.

Scored consistency across 3 trials

2 = all 3 trials correct, 1 = mixed, 0 = mostly wrong. This matters because most people only ask a chatbot once.

Results

The scores across easy, medium, and hard questions.

Overall averages

Model	Accuracy	Clarity	Consistency
ChatGPT	1.61	1.70	1.89
Claude	1.56	1.50	1.67
Gemini	1.70	1.78	1.83
Winner	Gemini	Gemini	ChatGPT

Easy questions

Model	Accuracy	Clarity	Consistency
ChatGPT	2.00	2.00	2.00
Claude	1.78	1.17	1.67
Gemini	2.00	2.00	2.00
Winner	ChatGPT & Gemini tied

Medium questions

Model	Accuracy	Clarity	Consistency
ChatGPT	2.00	2.00	2.00
Claude	2.00	2.00	2.00
Gemini	2.00	2.00	2.00
Winner	All three tied — perfect scores

Hard questions

Model	Accuracy	Clarity	Consistency
ChatGPT	0.84	1.11	1.67
Claude	0.89	1.33	1.33
Gemini	1.11	1.33	1.50
Winner	Gemini	Gemini & Claude	ChatGPT

Hard Questions · Trial By Trial

Where the bots actually broke down.

M7 — LCM (image)

Model	T1	T2	T3
ChatGPT	numeric only	✓	numeric only
Claude	✓	✓	✓
Gemini	✓	✓	✓

M8 — paper folding (image)

Model	T1	T2	T3
ChatGPT	✗	✓	✗
Claude	✗	✗	✗
Gemini	✓	✓	✗

M9 — card puzzle (image)

Model	T1	T2	T3
ChatGPT	✗	✗	✗
Claude	✗	✗	✓ (guess)
Gemini	✗	✗	✗

M3 — easy image question

Model	T1	T2	T3
ChatGPT	✓	✓	✓
Claude	✓	✗ (letter)	✗ (letter)
Gemini	✓	✓	✓

Conclusion

What I learned from Battle Of The Bots.

Takeaway 1

All three bots ace easy and medium math

On easy and medium problems the chatbots almost always got the right answer and explained it clearly. The real differences showed up on the hard questions.

Takeaway 2

Gemini was strongest on hard and image problems

Gemini did especially well on geometry, spatial, and image questions. Claude struggled the most with spatial problems and missed an image question even at the easy level.

Takeaway 3

Right answer isn't enough — the explanation matters

Each bot has its own style: ChatGPT walks through multiple steps, Gemini gives the most detailed step-by-step breakdown, and Claude tends to do everything in one step which is harder to follow. On M9, Claude got the answer right once, but the explanation looked like a lucky guess.

Takeaway 4

Chatbots help, but people still matter

AI bots are great learning tools because they can explain problems in different ways. But for hard problems a real teacher or parent is still important, and I want to understand the steps instead of just copying the answer.

Future

Where I'd like to take this next.

Retest as the bots get smarter

AI bots are changing very fast. My results might already look different in a month, so I'd like to rerun these tests to see how each bot improves.

Add more chatbots

I only tested three. Next time I'd also like to try Grok, DeepSeek, Kimi, and Z.ai to see how they compare.

Learn how AI bots are trained

AI bots need to learn just like humans do. I'm curious about who teaches them, what they learn from, and how they get better over time.

Try other subjects

In the future I'd also like to test AI bots on reading and science, not just math.

← Back to my page