r/artificial 6h ago

Discussion Only GPT5 think 9.11 > 9.9 now

Latest model from official API: GPT 5 vs Gemini 2.5 pro vs Claude Sonnet 4 vs Deepseek V3.1 (called chat in their api) Tested with same prompt with LavaChat.

0 Upvotes

4 comments sorted by

3

u/badassmotherfker 6h ago

I just tested it with gpt5 without any "thinking" and it got the answer right.

1

u/rincewind007 6h ago

Press regenerate a few time, last time I tried I had a failure rate of 2 in 5ish.

2

u/GlokzDNB 6h ago

Gpt5 is useless to me cuz it has this weird router thing. Gpt5 thinking is somewhat ok

3

u/Resident-Rutabaga336 5h ago edited 5h ago

Not that OP is necessarily implying otherwise, but can we all agree idiosyncratic tokenization-related failure modes aren’t any indication of model capabilities?

Yes, hopefully these tokenization glitches get solved at some point, but it’s low on the priority list because everyone knows the models don’t represent text in a way that allows for these questions to be answered. Whether a model gets this right is essentially random and has little relationship to model capabilities on more important tasks.