{"id":157088,"date":"2025-03-20T20:11:10","date_gmt":"2025-03-20T20:11:10","guid":{"rendered":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/03\/20\/a-high-schooler-built-a-website-that-lets-you-challenge-ai-models-to-a-minecraft-build-off-techcrunch\/"},"modified":"2025-03-20T20:11:10","modified_gmt":"2025-03-20T20:11:10","slug":"a-high-schooler-built-a-website-that-lets-you-challenge-ai-models-to-a-minecraft-build-off-techcrunch","status":"publish","type":"post","link":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/03\/20\/a-high-schooler-built-a-website-that-lets-you-challenge-ai-models-to-a-minecraft-build-off-techcrunch\/","title":{"rendered":"A high schooler built a website that lets you challenge AI models to a Minecraft build-off | TechCrunch"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">As conventional <a href=\"https:\/\/techcrunch.com\/2024\/03\/07\/heres-why-most-ai-benchmarks-tell-us-so-little\/\" target=\"_blank\" rel=\"noopener\">AI benchmarking<\/a> techniques prove inadequate, AI builders are turning to more creative ways to assess the capabilities of generative AI models. For one group of developers, that\u2019s Minecraft, the Microsoft-owned sandbox-building game.<\/p>\n<p class=\"wp-block-paragraph\">The website <a rel=\"nofollow noopener\" href=\"https:\/\/mcbench.ai\/\" target=\"_blank\">Minecraft Benchmark<\/a> (or MC-Bench) was developed collaboratively to pit AI models against each other in head-to-head challenges to respond to prompts with Minecraft creations. Users can vote on which model did a better job, and only after voting can they see which AI made each Minecraft build.<\/p>\n<figure class=\"wp-block-image aligncenter size-large\"><figcaption class=\"wp-element-caption\"><span class=\"wp-block-image__credits\"><strong>Image Credits:<\/strong><a rel=\"nofollow noopener\" href=\"https:\/\/mcbench.ai\/\" target=\"_blank\">Minecraft Benchmark <span class=\"screen-reader-text\">(opens in a new window)<\/span><\/a><\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">For Adi Singh, the 12th grader who started MC-Bench, the value of Minecraft isn\u2019t so much the game itself, but the familiarity that people have with it \u2014 after all, it is the <a rel=\"nofollow noopener\" href=\"https:\/\/www.theverge.com\/2023\/10\/15\/23916349\/minecraft-mojang-sold-300-million-copies-live-2023\" target=\"_blank\">best-selling<\/a> video game of all time. Even for people who haven\u2019t played the game, it\u2019s still possible to evaluate which blocky representation of a pineapple is better realized.<\/p>\n<p class=\"wp-block-paragraph\">\u201cMinecraft allows people to see the progress [of AI development] much more easily,\u201d Singh told TechCrunch. \u201cPeople are used to Minecraft, used to the look and the vibe.\u201d<\/p>\n<p class=\"wp-block-paragraph\">MC-Bench currently lists eight people as volunteer contributors. Anthropic, Google, OpenAI, and Alibaba have subsidized the project\u2019s use of their products to run benchmark prompts, per MC-Bench\u2019s website, but the companies are not otherwise affiliated. <\/p>\n<p class=\"wp-block-paragraph\">\u201cCurrently we are just doing simple builds to reflect on how far we\u2019ve come from the GPT-3 era, but [we] could see ourselves scaling to these longer-form plans and goal-oriented tasks,\u201d Singh said. \u201cGames might just be a medium to test agentic reasoning that is safer than in real life and more controllable for testing purposes, making it more ideal in my eyes.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Other games like <a href=\"https:\/\/techcrunch.com\/2025\/02\/25\/anthropics-claude-ai-is-playing-pokemon-on-twitch-slowly\/\" target=\"_blank\" rel=\"noopener\">Pok\u00e9mon Red<\/a>,\u00a0<a href=\"https:\/\/github.com\/OpenGenerativeAI\/llm-colosseum\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Street Fighter<\/a>, and <a href=\"https:\/\/techcrunch.com\/2024\/11\/05\/people-are-using-games-like-pictionary-to-benchmark-ai-now\/\" target=\"_blank\" rel=\"noopener\">Pictionary<\/a> have been used as experimental benchmarks for AI, in part because the art of benchmarking AI is <a href=\"https:\/\/techcrunch.com\/2025\/02\/19\/this-week-in-ai-maybe-we-should-ignore-ai-benchmarks-for-now\/\" target=\"_blank\" rel=\"noopener\">notoriously tricky<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">Researchers often test AI models on <a rel=\"nofollow noopener\" href=\"https:\/\/openai.com\/index\/gpt-4-research\/\" target=\"_blank\">standardized evaluations<\/a>, but many of these tests give AI a home-field advantage. Because of the way they\u2019re trained, models are naturally gifted at certain, narrow kinds of problem-solving, particularly problem-solving that requires rote memorization or basic extrapolation.<\/p>\n<p class=\"wp-block-paragraph\">Put simply, it\u2019s hard to glean what it means that OpenAI\u2019s GPT-4 can score in the 88th percentile on the LSAT, but cannot discern <a href=\"https:\/\/techcrunch.com\/2024\/08\/27\/why-ai-cant-spell-strawberry\/\" target=\"_blank\" rel=\"noopener\">how many Rs are in the word \u201cstrawberry.\u201d<\/a> Anthropic\u2019s <a rel=\"nofollow noopener\" href=\"https:\/\/www.anthropic.com\/news\/claude-3-7-sonnet\" target=\"_blank\">Claude 3.7 Sonnet<\/a> achieved 62.3% accuracy on a standardized software engineering benchmark, but it is worse at playing Pok\u00e9mon than most five-year-olds. <\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"2352\" height=\"1168\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?w=680\" alt=\"\" class=\"wp-image-2984367\" srcset=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png 2352w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=150,74 150w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=300,149 300w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=768,381 768w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=680,338 680w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=1200,596 1200w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=1280,636 1280w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=430,214 430w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=720,358 720w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=900,447 900w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=800,397 800w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=1536,763 1536w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=2048,1017 2048w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=668,332 668w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=1242,617 1242w, https:\/\/techcrunch.com\/wp-content\/uploads\/2025\/03\/Screenshot-2025-03-20-at-2.22.35PM.png?resize=708,352 708w\" sizes=\"auto, (max-width: 2352px) 100vw, 2352px\"\/><\/figure>\n<p class=\"wp-block-paragraph\">MC-Bench is technically a programming benchmark, since the models are asked to write code to create the prompted build, like \u201cFrosty the Snowman\u201d or \u201ca charming tropical beach hut on a pristine sandy shore.\u201d<\/p>\n<p class=\"wp-block-paragraph\">But it\u2019s easier for most MC-Bench users to evaluate whether a snowman looks better than to dig into code, which gives the project wider appeal \u2014 and thus the potential to collect more data about which models consistently score better.<\/p>\n<p class=\"wp-block-paragraph\">Whether those scores amount to much in the way of AI usefulness is up for debate, of course. Singh asserts that they\u2019re a strong signal, though.<\/p>\n<p class=\"wp-block-paragraph\">\u201cThe current leaderboard reflects quite closely to my own experience of using these models, which is unlike a lot of pure text benchmarks,\u201d Singh said. \u201cMaybe [MC-Bench] could be useful to companies to know if they\u2019re heading in the right direction.\u201d<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/03\/20\/a-high-schooler-built-a-website-that-lets-you-challenge-ai-models-to-a-minecraft-build-off\/\" target=\"_blank\" rel=\"noopener\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>As conventional AI benchmarking techniques prove inadequate, AI builders are turning to more creative ways to assess the capabilities of generative AI models. For one group of developers, that\u2019s Minecraft, the Microsoft-owned sandbox-building game. The website Minecraft Benchmark (or MC-Bench) was developed collaboratively to pit AI models against each other in head-to-head challenges to respond [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":157089,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[],"class_list":{"0":"post-157088","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-tech"},"_links":{"self":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/157088","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/comments?post=157088"}],"version-history":[{"count":0,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/157088\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media\/157089"}],"wp:attachment":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media?parent=157088"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/categories?post=157088"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/tags?post=157088"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}