{"id":108352,"date":"2024-06-29T22:30:00","date_gmt":"2024-06-29T22:30:00","guid":{"rendered":"https:\/\/entertainment.runfyers.com\/index.php\/2024\/06\/29\/exclusive-geminis-data-analyzing-abilities-arent-as-good-as-google-claims\/"},"modified":"2024-06-29T22:30:00","modified_gmt":"2024-06-29T22:30:00","slug":"exclusive-geminis-data-analyzing-abilities-arent-as-good-as-google-claims","status":"publish","type":"post","link":"https:\/\/entertainment.runfyers.com\/index.php\/2024\/06\/29\/exclusive-geminis-data-analyzing-abilities-arent-as-good-as-google-claims\/","title":{"rendered":"Exclusive: Gemini&#8217;s data-analyzing abilities aren&#8217;t as good as Google claims"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">One of the selling points of Google\u2019s flagship generative AI models, <a href=\"https:\/\/techcrunch.com\/2024\/04\/29\/what-is-google-gemini-ai\/\" target=\"_blank\" rel=\"noopener\">Gemini 1.5 Pro and 1.5 Flash<\/a>, is the amount of data they can supposedly process and analyze. In press briefings and demos, Google has repeatedly claimed that the models can accomplish previously impossible tasks thanks to their \u201clong context,\u201d like summarizing multiple hundred-page documents or searching across scenes in film footage.<\/p>\n<p class=\"wp-block-paragraph\">But new research suggests that the models aren\u2019t, in fact, very good at those things.<\/p>\n<p class=\"wp-block-paragraph\">Two <a href=\"https:\/\/x.com\/m2saxon\/status\/1805452166171443221?t=RgnCc6gVGz_mOh8Gvm0MgA&amp;s=19\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">separate<\/a> <a href=\"https:\/\/x.com\/mar_kar_\/status\/1805660949023793224?t=CdEkD5cCSbPEBKb16eKdaw&amp;s=19\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">studies<\/a> investigated how well Google\u2019s Gemini models and others make sense out of an enormous amount of data \u2014 think \u201cWar and Peace\u201d-length works. Both find that Gemini 1.5 Pro and 1.5 Flash struggle to answer questions about large datasets correctly; in one series of document-based tests, the models gave the right answer only 40% 50% of the time.<\/p>\n<p class=\"wp-block-paragraph\">\u201cWhile models like Gemini 1.5 Pro can technically process long\u00a0contexts, we have seen many cases indicating that the models don\u2019t actually \u2018understand\u2019 the content,\u201d Marzena Karpinska, a postdoc at UMass Amherst and a co-author on one of the studies, told TechCrunch. <\/p>\n<h2 class=\"wp-block-heading\" id=\"h-gemini-s-context-window-is-lacking\">Gemini\u2019s context window is lacking<\/h2>\n<p class=\"wp-block-paragraph\">A model\u2019s context, or context window, refers to input data (e.g., text) that the model considers before generating output (e.g., additional text). A simple question \u2014 \u201cWho won the 2020 U.S. presidential election?\u201d \u2014 can serve as context, as can a movie script, show or audio clip. And as context windows grow, so does the size of the documents being fit into them.<\/p>\n<p class=\"wp-block-paragraph\">The newest versions of Gemini can take in upward of 2 million tokens as context. (\u201cTokens\u201d are subdivided bits of raw data, like the syllables \u201cfan,\u201d \u201ctas\u201d and \u201ctic\u201d in the word \u201cfantastic.\u201d) That\u2019s equivalent to roughly 1.4 million words, two hours of video or 22 hours of audio \u2014 the largest context of any commercially available model.<\/p>\n<p class=\"wp-block-paragraph\">In a briefing earlier this year, Google showed several pre-recorded demos meant to illustrate the potential of Gemini\u2019s long-context capabilities. One had Gemini 1.5 Pro search the transcript of the Apollo 11 moon landing telecast \u2014 around 402 pages \u2014 for quotes containing jokes, and then find a scene in the telecast that looked similar to a pencil sketch.<\/p>\n<p class=\"wp-block-paragraph\">VP of research at Google DeepMind Oriol Vinyals, who led the briefing, described the model as \u201cmagical.\u201d <\/p>\n<p class=\"wp-block-paragraph\">\u201c[1.5 Pro] performs these sorts of reasoning tasks across every single page, every single word,\u201d he said.<\/p>\n<p class=\"wp-block-paragraph\">That might have been an exaggeration.<\/p>\n<p class=\"wp-block-paragraph\">In one of the aforementioned studies benchmarking these capabilities, Karpinska, along with researchers from the Allen Institute for AI and Princeton, asked the models to evaluate true\/false statements about fiction books written in English. The researchers chose recent works so that the models couldn\u2019t \u201ccheat\u201d by relying on foreknowledge, and they peppered the statements with references to specific details and plot points that\u2019d be impossible to comprehend without reading the books in their entirety.<\/p>\n<p class=\"wp-block-paragraph\">Given a statement like \u201cBy using her skills as an Apoth, Nusis is able to reverse engineer the type of portal opened by the reagents key found in Rona\u2019s wooden chest,\u201d Gemini 1.5 Pro and 1.5 Flash \u2014 having ingested the relevant book \u2014 had to say whether the statement was true or false and explain their reasoning.<\/p>\n<figure class=\"wp-block-image aligncenter size-large\"><figcaption class=\"wp-element-caption\"><strong>Image Credits:<\/strong> UMass Amherst<\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">Tested on one book around 260,000 words (~520 pages) in length, the researchers found that 1.5 Pro answered the true\/false statements correctly 46.7% of the time while Flash answered correctly only 20% of the time. That means a coin is significantly better at answering questions about the book than Google\u2019s latest machine learning model. Averaging all the benchmark results, neither model managed to achieve higher than random chance in terms of question-answering accuracy.<\/p>\n<p class=\"wp-block-paragraph\">\u201cWe\u2019ve noticed that the models have more difficulty verifying claims that require considering larger portions of the book, or even the entire book, compared to claims that can be solved by retrieving sentence-level evidence,\u201d Karpinska said. \u201cQualitatively, we also observed that the models struggle with verifying claims about implicit information that is clear to a human reader but not explicitly stated in the text.\u201d<\/p>\n<p class=\"wp-block-paragraph\">The second of the two studies, co-authored by researchers at UC Santa Barbara, tested the ability of Gemini 1.5 Flash (but not 1.5 Pro) to \u201creason over\u201d videos \u2014 that is, search through and answer questions about the content in them.<\/p>\n<p class=\"wp-block-paragraph\">The co-authors created a dataset of images (e.g., a photo of a birthday cake) paired with questions for the model to answer about the objects depicted in the images (e.g., \u201cWhat cartoon character is on this cake?\u201d). To evaluate the models, they picked one of the images at random and inserted \u201cdistractor\u201d images before and after it to create slideshow-like footage.<\/p>\n<p class=\"wp-block-paragraph\">Flash didn\u2019t perform all that well. In a test that had the model transcribe six handwritten digits from a \u201cslideshow\u201d of 25 images, Flash got around 50% of the transcriptions right. The accuracy dropped to around 30% with eight digits. <\/p>\n<p class=\"wp-block-paragraph\">\u201cOn real question-answering tasks over images, it appears to be particularly hard for all the models we tested,\u201d Michael Saxon, a PhD student at UC Santa Barbara and one of the study\u2019s co-authors, told TechCrunch. \u201cThat small amount of reasoning \u2014 recognizing that a number is in a frame and reading it \u2014 might be what is breaking the model.\u201d<\/p>\n<h2 class=\"wp-block-heading\" id=\"h-google-is-overpromising-with-gemini\">Google is overpromising with Gemini<\/h2>\n<p class=\"wp-block-paragraph\">Neither of the studies have been peer-reviewed, nor do they probe the releases of Gemini 1.5 Pro and 1.5 Flash with 2-million-token contexts. (Both tested the 1-million-token context releases.) And Flash isn\u2019t meant to be as capable as Pro in terms of performance; Google advertises it as a low-cost alternative.<\/p>\n<p class=\"wp-block-paragraph\">Nevertheless, both <a href=\"https:\/\/techcrunch.com\/2024\/02\/15\/we-tested-googles-gemini-chatbot-heres-how-it-performed\/\" target=\"_blank\" rel=\"noopener\">add fuel to the fire<\/a> that Google\u2019s been overpromising \u2014 and under-delivering \u2014 with Gemini <a href=\"https:\/\/techcrunch.com\/2023\/12\/07\/googles-best-gemini-demo-was-faked\/\" target=\"_blank\" rel=\"noopener\">from the beginning<\/a>. None of the models the researchers tested, including OpenAI\u2019s <a href=\"https:\/\/techcrunch.com\/2024\/05\/13\/openais-newest-model-is-gpt-4o\/\" target=\"_blank\" rel=\"noopener\">GPT-4o<\/a> and Anthropic\u2019s <a href=\"https:\/\/techcrunch.com\/2024\/06\/20\/anthropic-claims-its-latest-model-is-best-in-class\/\" target=\"_blank\" rel=\"noopener\">Claude 3.5 Sonnet<\/a>, performed well. But Google\u2019s the only model provider that\u2019s given context window top billing in its advertisements.<\/p>\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-techcrunch wp-block-embed-techcrunch\"\/>\n<p class=\"wp-block-paragraph\">\u201cThere\u2019s nothing wrong with the simple claim, \u2018Our model can take X number of tokens\u2019 based on the objective technical details,\u201d Saxon said. \u201cBut the question is, what useful thing can you do with it?\u201d<\/p>\n<p class=\"wp-block-paragraph\">Generative AI broadly speaking is coming under increased scrutiny as businesses (and investors) grow frustrated with the technology\u2019s limitations.<\/p>\n<p class=\"wp-block-paragraph\">In a\u00a0<a href=\"https:\/\/techcrunch.com\/2024\/01\/11\/generative-ai-enterprise-not-home-run\/\" target=\"_blank\" rel=\"noopener\">pair of recent surveys from<\/a>\u00a0Boston Consulting Group, about half of the respondents \u2014 all C-suite executives \u2014 said that they don\u2019t expect generative AI to bring about substantial productivity gains and that they\u2019re worried about the potential for mistakes and data compromises arising from generative AI-powered tools. PitchBook recently\u00a0<a href=\"https:\/\/pitchbook.com\/news\/articles\/generative-ai-seed-funding-drops\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">reported<\/a>\u00a0that, for two consecutive quarters, generative AI dealmaking at the earliest stages has declined, plummeting 76% from its Q3 2023 peak. <\/p>\n<p class=\"wp-block-paragraph\">Faced with meeting-summarizing chatbots that conjure up fictional details about people and AI search platforms that basically amount to plagiarism generators, customers are on the hunt for promising differentiators. Google \u2014 which has raced, <a href=\"https:\/\/techcrunch.com\/2024\/02\/23\/embarrassing-and-wrong-google-admits-it-lost-control-of-image-generating-ai\/\" target=\"_blank\" rel=\"noopener\">at times clumsily<\/a>, to catch up to its generative AI rivals \u2014 was desperate to make Gemini\u2019s context one of those differentiators.<\/p>\n<figure class=\"wp-block-embed is-type-wp-embed is-provider-techcrunch wp-block-embed-techcrunch\"\/>\n<p class=\"wp-block-paragraph\">But the bet was premature, it seems. <\/p>\n<p class=\"wp-block-paragraph\">\u201cWe haven\u2019t settled on a way to really show that \u2018reasoning\u2019 or \u2018understanding\u2019 over long documents is taking place, and basically every group releasing these models is cobbling together their own ad hoc evals to make these claims,\u201d Karpinska said. \u201cWithout the knowledge of how long\u00a0context\u00a0processing is implemented \u2014 and companies do not share these details \u2014 it is hard to say how realistic these claims are.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Google didn\u2019t respond to a request for comment. <\/p>\n<p class=\"wp-block-paragraph\">Both Saxon and Karpinska believe the antidotes to hyped-up claims around generative AI are better benchmarks and, along the same vein, greater emphasis on third-party critique. Saxon notes that one of the more common tests for long context (liberally cited by Google in its marketing materials), \u201cneedle in the haystack,\u201d only measures a model\u2019s ability to retrieve particular info, like names and numbers, from datasets \u2014 not answer complex questions about that info.<\/p>\n<p class=\"wp-block-paragraph\">\u201cAll scientists and most engineers using these models are essentially in agreement that our existing benchmark culture is broken,\u201d Saxon said, \u201cso it\u2019s important that the public understands to take these giant reports containing numbers like \u2018general intelligence across benchmarks\u2019 with a massive grain of salt.\u201d<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2024\/06\/29\/geminis-data-analyzing-abilities-arent-as-good-as-google-claims\/\" target=\"_blank\" rel=\"noopener\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the selling points of Google\u2019s flagship generative AI models, Gemini 1.5 Pro and 1.5 Flash, is the amount of data they can supposedly process and analyze. In press briefings and demos, Google has repeatedly claimed that the models can accomplish previously impossible tasks thanks to their \u201clong context,\u201d like summarizing multiple hundred-page documents [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":108353,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[],"class_list":{"0":"post-108352","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-tech"},"_links":{"self":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/108352","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/comments?post=108352"}],"version-history":[{"count":0,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/108352\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media\/108353"}],"wp:attachment":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media?parent=108352"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/categories?post=108352"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/tags?post=108352"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}