Built a master question list
I grouped my math problems into easy, medium, and hard so I
could see how the bots did at each level.
Used text and image problems
Some problems are hard to explain in words. For those I sent a
screenshot to the chatbot instead of typing it out.
Asked each question three times
AI bots can give different answers for the same question, so I
ran every question three times to see if the answer stayed the
same.
Used incognito / temporary chats
Chatbots can remember earlier questions, so I used a fresh
chat each time to keep the experiment fair.
Scored accuracy and clarity
2 = correct or very clear, 1 = partly correct or somewhat
clear, 0 = wrong or confusing. Getting the right answer was
not enough — the explanation had to make sense too.
Scored consistency across 3 trials
2 = all 3 trials correct, 1 = mixed, 0 = mostly wrong. This
matters because most people only ask a chatbot once.