{"id":162129,"date":"2025-04-14T22:27:55","date_gmt":"2025-04-14T22:27:55","guid":{"rendered":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/04\/14\/debates-over-ai-benchmarking-have-reached-pokemon-techcrunch\/"},"modified":"2025-04-14T22:27:55","modified_gmt":"2025-04-14T22:27:55","slug":"debates-over-ai-benchmarking-have-reached-pokemon-techcrunch","status":"publish","type":"post","link":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/04\/14\/debates-over-ai-benchmarking-have-reached-pokemon-techcrunch\/","title":{"rendered":"Debates over AI benchmarking have reached Pok\u00e9mon | TechCrunch"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">Not even Pok\u00e9mon is safe from AI benchmarking controversy. <\/p>\n<p class=\"wp-block-paragraph\">Last week, a <a rel=\"nofollow\" href=\"https:\/\/x.com\/Jush21e8\/status\/1910293595422413051\" target=\"_blank\">post on X<\/a> went viral, claiming that Google\u2019s latest Gemini model surpassed Anthropic\u2019s flagship Claude model in the original Pok\u00e9mon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer\u2019s Twitch stream; Claude was <a href=\"https:\/\/techcrunch.com\/2025\/02\/24\/anthropic-used-pokemon-to-benchmark-its-newest-ai-model\/\" target=\"_blank\" rel=\"noopener\">stuck at Mount Moon<\/a> as of late February.<\/p>\n<blockquote class=\"wp-block-quote twitter-tweet is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town<\/p>\n<p class=\"wp-block-paragraph\">119 live views only btw, incredibly underrated stream <a rel=\"nofollow\" href=\"https:\/\/t.co\/8AvSovAI4x\" target=\"_blank\">pic.twitter.com\/8AvSovAI4x<\/a><\/p>\n<p class=\"wp-block-paragraph\">\u2014 Jush (@Jush21e8) <a rel=\"nofollow noopener\" href=\"https:\/\/twitter.com\/Jush21e8\/status\/1910293595422413051?ref_src=twsrc%5Etfw\" target=\"_blank\">April 10, 2025<\/a><\/p>\n<\/blockquote>\n<p class=\"wp-block-paragraph\">But what the post failed to mention is that Gemini had an advantage.<\/p>\n<p class=\"wp-block-paragraph\">As <a rel=\"nofollow noopener\" href=\"https:\/\/www.reddit.com\/r\/singularity\/comments\/1jvwqc9\/gemini_plays_pok%C3%A9mon_has_made_it_through_rock\/\" target=\"_blank\">users on Reddit<\/a> pointed out, the developer who maintains the Gemini stream built a custom minimap that helps the model identify \u201ctiles\u201d in the game like cuttable trees. This reduces the need for Gemini to analyze screenshots before it makes gameplay decisions.<\/p>\n<p class=\"wp-block-paragraph\">Now, Pok\u00e9mon is a semi-serious AI benchmark at best \u2014 few would argue it\u2019s a very informative test of a model\u2019s capabilities. But it <em>is<\/em> an instructive example of how different implementations of a benchmark can influence the results.<\/p>\n<p class=\"wp-block-paragraph\">For example, Anthropic <a rel=\"nofollow noopener\" href=\"https:\/\/www.anthropic.com\/news\/claude-3-7-sonnet\" target=\"_blank\">reported<\/a> two scores for its recent Anthropic 3.7 Sonnet model on the benchmark SWE-bench Verified, which is designed to evaluate a model\u2019s coding abilities. Claude 3.7 Sonnet achieved 62.3% accuracy on SWE-bench Verified, but 70.3% with a \u201ccustom scaffold\u201d that Anthropic developed.<\/p>\n<p class=\"wp-block-paragraph\">More recently, Meta <a href=\"https:\/\/techcrunch.com\/2025\/04\/06\/metas-benchmarks-for-its-new-ai-models-are-a-bit-misleading\/\" target=\"_blank\" rel=\"noopener\">fine-tuned<\/a> a version of one of its newer models, Llama 4 Maverick, to perform well on a particular benchmark, LM Arena. The <a href=\"https:\/\/techcrunch.com\/2025\/04\/11\/metas-vanilla-maverick-ai-model-ranks-below-rivals-on-a-popular-chat-benchmark\/\" target=\"_blank\" rel=\"noopener\">vanilla version<\/a> of the model scores significantly worse on the same evaluation.<\/p>\n<p class=\"wp-block-paragraph\">Given that AI benchmarks \u2014 Pok\u00e9mon included \u2014 are <a href=\"https:\/\/techcrunch.com\/2024\/03\/07\/heres-why-most-ai-benchmarks-tell-us-so-little\/\" target=\"_blank\" rel=\"noopener\">imperfect measures<\/a> to begin with, custom and non-standard implementations threaten to muddy the waters even further. That is to say, it doesn\u2019t seem likely that it\u2019ll get any easier to compare models as they\u2019re released.<\/p>\n<\/div>\n<p><script async src=\"\/\/platform.twitter.com\/widgets.js\" charset=\"utf-8\"><\/script><br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/04\/14\/debates-over-ai-benchmarking-have-reached-pokemon\/\" target=\"_blank\" rel=\"noopener\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Not even Pok\u00e9mon is safe from AI benchmarking controversy. Last week, a post on X went viral, claiming that Google\u2019s latest Gemini model surpassed Anthropic\u2019s flagship Claude model in the original Pok\u00e9mon video game trilogy. Reportedly, Gemini had reached Lavendar Town in a developer\u2019s Twitch stream; Claude was stuck at Mount Moon as of late [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":162130,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[],"class_list":{"0":"post-162129","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-tech"},"_links":{"self":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/162129","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/comments?post=162129"}],"version-history":[{"count":0,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/162129\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media\/162130"}],"wp:attachment":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media?parent=162129"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/categories?post=162129"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/tags?post=162129"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}