{"id":165514,"date":"2025-05-01T00:08:35","date_gmt":"2025-05-01T00:08:35","guid":{"rendered":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/05\/01\/study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark-techcrunch\/"},"modified":"2025-05-01T00:08:35","modified_gmt":"2025-05-01T00:08:35","slug":"study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark-techcrunch","status":"publish","type":"post","link":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/05\/01\/study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark-techcrunch\/","title":{"rendered":"Study accuses LM Arena of helping top AI labs game its benchmark | TechCrunch"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\"><a rel=\"nofollow noopener\" href=\"https:\/\/arxiv.org\/pdf\/2504.20879\" target=\"_blank\">A new paper<\/a> from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve better leaderboard scores at the expense of rivals.<\/p>\n<p class=\"wp-block-paragraph\">According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform\u2019s leaderboard, though the opportunity was not afforded to every firm, the authors say.<\/p>\n<p class=\"wp-block-paragraph\">\u201cOnly a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others,\u201d said Cohere\u2019s VP of AI research and co-author of the study, Sara Hooker, in an interview with TechCrunch. \u201cThis is gamification.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Created in 2023 as an academic research project out of UC Berkeley, Chatbot Arena has become a go-to benchmark for AI companies. It works by putting answers from two different AI models side-by-side in a \u201cbattle,\u201d and asking users to choose the best one. It\u2019s not uncommon to see unreleased models competing in the arena under a pseudonym. <\/p>\n<p class=\"wp-block-paragraph\">Votes over time contribute to a model\u2019s score \u2014 and, consequently, its placement on the Chatbot Arena leaderboard. While many commercial actors participate in Chatbot Arena, LM Arena has long maintained that its benchmark is an impartial and fair one.<\/p>\n<p class=\"wp-block-paragraph\">However, that\u2019s not what the paper\u2019s authors say they uncovered.<\/p>\n<p class=\"wp-block-paragraph\">One AI company, Meta, was able to privately test 27 model variants on Chatbot Arena between January and March leading up to the tech giant\u2019s Llama 4 release, the authors allege. At launch, Meta only publicly revealed the score of a single model \u2014 a model that happened to rank near the top of the Chatbot Arena leaderboard.<\/p>\n<div class=\"wp-block-techcrunch-inline-cta\">\n<div class=\"inline-cta__wrapper\">\n<p>Techcrunch event<\/p>\n<div class=\"inline-cta__content\">\n<p>\n\t\t\t\t\t\t\t\t\t<span class=\"inline-cta__location\">Berkeley, CA<\/span><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t<span class=\"inline-cta__separator\">|<\/span><br \/>\n\t\t\t\t\t\t\t\t\t\t\t\t\t<span class=\"inline-cta__date\">June 5<\/span>\n\t\t\t\t\t\t\t<\/p>\n<p>\t\t\t\t\t\t\t<a href=\"https:\/\/techcrunch.com\/events\/tc-sessions-ai\/exhibit\/?promo=tc_inline_exhibit&amp;utm_campaign=tcsessionsai2025&amp;utm_content=exhibit&amp;utm_medium=ad&amp;utm_source=tc\" class=\"inline-cta__register-button\" target=\"_blank\" rel=\"noopener\"><br \/>\n\t\t\t\t\t<span>BOOK NOW<\/span><br \/>\n\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n<\/p><\/div>\n<\/div>\n<figure class=\"wp-block-image aligncenter size-large\"><figcaption class=\"wp-element-caption\"><span class=\"wp-element-caption__text\">A chart pulled from the study. (Credit: Singh et al.)<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">In an email to TechCrunch, LM Arena Co-Founder and UC Berkeley Professor Ion Stoica said that the study was full of \u201cinaccuracies\u201d and \u201cquestionable analysis.\u201d<\/p>\n<p class=\"wp-block-paragraph\">\u201cWe are committed to fair, community-driven evaluations, and invite all model providers to submit more models for testing and to improve their performance on human preference,\u201d said LM Arena in a statement provided to TechCrunch.\u00a0\u201cIf a model provider chooses to submit more tests than another model provider, this does not mean the second model provider is treated unfairly.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-supposedly-favored-labs\">Supposedly favored labs<\/h2>\n<p class=\"wp-block-paragraph\">The paper\u2019s authors started conducting their research in November 2024 after learning that some AI companies were possibly being given preferential access to Chatbot Arena. In total, they measured more than 2.8 million Chatbot Arena battles over a five-month stretch.<\/p>\n<p class=\"wp-block-paragraph\">The authors say they found evidence that LM Arena allowed certain AI companies, including Meta, OpenAI, and Google, to collect more data from Chatbot Arena by having their models appear in a higher number of model \u201cbattles.\u201d This increased sampling rate gave these companies an unfair advantage, the authors allege.<\/p>\n<p class=\"wp-block-paragraph\">Using additional data from LM Arena could improve a model\u2019s performance on Arena Hard, another benchmark LM Arena maintains, by 112%. However, LM Arena said in a\u00a0<a href=\"https:\/\/x.com\/lmarena_ai\/status\/1917668731481907527\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">post on X<\/a>\u00a0that Arena Hard performance does not directly correlate to Chatbot Arena performance.<\/p>\n<p class=\"wp-block-paragraph\">Hooker said it\u2019s unclear how certain AI companies might\u2019ve received priority access, but that it\u2019s incumbent on LM Arena to increase its transparency regardless.<\/p>\n<p class=\"wp-block-paragraph\">In\u00a0a <a href=\"https:\/\/x.com\/lmarena_ai\/status\/1917668731481907527\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">post on X<\/a>, LM Arena said that several of the claims in the paper don\u2019t reflect reality. The organization pointed to a\u00a0<a href=\"https:\/\/blog.lmarena.ai\/blog\/2025\/two-year-celebration\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">blog post<\/a> it published earlier this week indicating that models from non-major labs appear in more Chatbot Arena battles than the study suggests. <\/p>\n<p class=\"wp-block-paragraph\">One important limitation of the study is that it relied on \u201cself-identification\u201d to determine which AI models were in private testing on Chatbot Arena. The authors prompted AI models several times about their company of origin, and relied on the models\u2019 answers to classify them \u2014 a method that isn\u2019t foolproof. <\/p>\n<p class=\"wp-block-paragraph\">However, Hooker said that when the authors reached out to LM Arena to share their preliminary findings, the organization didn\u2019t dispute them.<\/p>\n<p class=\"wp-block-paragraph\">TechCrunch reached out to Meta, Google, OpenAI, and Amazon \u2014 all of which were mentioned in the study \u2014 for comment. None immediately responded.<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-lm-arena-in-hot-water\">LM Arena in hot water<\/h2>\n<p class=\"wp-block-paragraph\">In the paper, the authors call on LM Arena to implement a number of changes aimed at making Chatbot Arena more \u201cfair.\u201d For example, the authors say, LM Arena could set a clear and transparent limit on the number of private tests AI labs can conduct, and publicly disclose scores from these tests.<\/p>\n<p class=\"wp-block-paragraph\">In a\u00a0<a href=\"https:\/\/x.com\/lmarena_ai\/status\/1917668731481907527\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">post on X,<\/a>\u00a0LM Arena rejected these suggestions, claiming it has published information on pre-release testing\u00a0<a href=\"https:\/\/blog.lmarena.ai\/blog\/2024\/policy\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">since March 2024<\/a>. The benchmarking organization also said it \u201cmakes no sense to show scores for pre-release models which are not publicly available,\u201d because the AI community cannot test the models for themselves.<\/p>\n<p class=\"wp-block-paragraph\">The researchers also say LM Arena could adjust Chatbot Arena\u2019s sampling rate to ensure that all models in the arena appear in the same number of battles. LM Arena has been receptive to this recommendation publicly, and indicated that it\u2019ll create a new sampling algorithm.<\/p>\n<p class=\"wp-block-paragraph\">The paper comes weeks after Meta was caught gaming benchmarks in Chatbot Arena around the launch of its above-mentioned Llama 4 models. Meta optimized one of the Llama 4 models for \u201cconversationality,\u201d which helped it achieve an impressive score on Chatbot Arena\u2019s leaderboard. But the company never released the optimized model \u2014 and the vanilla version\u00a0<a href=\"https:\/\/techcrunch.com\/2025\/04\/11\/metas-vanilla-maverick-ai-model-ranks-below-rivals-on-a-popular-chat-benchmark\/\" target=\"_blank\" rel=\"noopener\">ended up performing much worse<\/a> on Chatbot Arena.<\/p>\n<p class=\"wp-block-paragraph\">At the time, LM Arena said Meta should have been more transparent in its approach to benchmarking.<\/p>\n<p class=\"wp-block-paragraph\">Earlier this month, LM Arena announced it was <a rel=\"nofollow noopener\" href=\"https:\/\/www.bloomberg.com\/news\/articles\/2025-04-17\/popular-ai-ranking-website-chatbot-arena-is-becoming-a-real-company\" target=\"_blank\">launching a company<\/a>, with plans to raise capital from investors. The study increases scrutiny on private benchmark organization\u2019s \u2014 and whether they can be trusted to assess AI models without corporate influence clouding the process.<\/p>\n<p class=\"wp-block-paragraph\"><em>Update on 4\/30\/25 at 9:35pm PT: A previous version of this story included comment from a Google DeepMind engineer who said part of Cohere\u2019s study was inaccurate. The researcher did not dispute that Google sent 10 models to LM Arena for pre-release testing from January to March, as Cohere alleges, but simply noted the company\u2019s open source team, which works on Gemma, only sent one.<\/em><\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/04\/30\/study-accuses-lm-arena-of-helping-top-ai-labs-game-its-benchmark\/\" target=\"_blank\" rel=\"noopener\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve better leaderboard scores at the expense of rivals. According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":165515,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[],"class_list":{"0":"post-165514","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-tech"},"_links":{"self":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/165514","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/comments?post=165514"}],"version-history":[{"count":0,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/165514\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media\/165515"}],"wp:attachment":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media?parent=165514"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/categories?post=165514"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/tags?post=165514"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}