technology·Jun 8, 2026·1 min read

Anthropic's Claude Opus 4.8 Shows Mixed Results in Six-Test Evaluation

Anthropic's new flagship model excelled in math and code but consumed entire token quota in one prompt.

Anthropic's latest model, Claude Opus 4.8, underwent a six-test evaluation revealing both strengths and weaknesses. The AI performed well on a complex math problem and generated a flawless game, but also exhausted its entire token allowance in a single prompt.

Test Highlights

Aced a challenging math problem, showing strong reasoning capabilities.
Produced a clean, bug-free game from scratch.
Consumed the full token quota in one interaction, limiting further tests.

Observers noted that while Claude Opus 4.8 excels in tasks it is optimized for, it struggles with token management. The incident highlights the model's tendency to generate verbose outputs, potentially impacting usability in long sessions.

Anthropic has not commented on the token consumption issue. The evaluation was conducted by Decrypt using standard benchmark prompts.

#claude #anthropic #ai #testing #tokens #flagship

Test Highlights

Related stories