{"id":144015,"date":"2025-01-14T22:26:51","date_gmt":"2025-01-14T22:26:51","guid":{"rendered":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/01\/14\/inside-metas-race-to-beat-openai-we-need-to-learn-how-to-build-frontier-and-win-this-race\/"},"modified":"2025-01-14T22:26:51","modified_gmt":"2025-01-14T22:26:51","slug":"inside-metas-race-to-beat-openai-we-need-to-learn-how-to-build-frontier-and-win-this-race","status":"publish","type":"post","link":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/01\/14\/inside-metas-race-to-beat-openai-we-need-to-learn-how-to-build-frontier-and-win-this-race\/","title":{"rendered":"Inside Meta\u2019s race to beat OpenAI: \u201cWe need to learn how to build frontier and win this race\u201d"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">A major copyright lawsuit against Meta has revealed a trove of internal communications about the company\u2019s plans to develop its open-source AI models, Llama, which include discussions about avoiding \u201cmedia coverage suggesting we have used a dataset we know to be pirated.\u201d<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">The messages, which were part of a series of exhibits unsealed by a California court, suggest Meta used copyrighted data when training its AI systems and worked to conceal it \u2014 as it raced to beat rivals like OpenAI and Mistral. Portions of <a href=\"https:\/\/www.theguardian.com\/technology\/2025\/jan\/10\/mark-zuckerberg-meta-books-ai-models-sarah-silverman\" target=\"_blank\" rel=\"noopener\">the messages were first revealed<\/a> last week.<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">In an October 2023 email to Meta AI researcher Hugo Touvron, Ahmad Al-Dahle, Meta\u2019s vice president of generative AI, <a href=\"https:\/\/storage.courtlistener.com\/recap\/gov.uscourts.cand.415175\/gov.uscourts.cand.415175.391.10.pdf\" target=\"_blank\" rel=\"noopener\">wrote that the company\u2019s goal<\/a> \u201cneeds to be GPT4,\u201d referring to the large language model OpenAI <a href=\"https:\/\/www.theverge.com\/2023\/3\/14\/23638033\/openai-gpt-4-chatgpt-multimodal-deep-learning\" target=\"_blank\" rel=\"noopener\">announced in March of 2023<\/a>. Meta had \u201cto learn how to build frontier and win this race,\u201d Al-Dahle added. Those plans apparently involved the <a href=\"https:\/\/www.theguardian.com\/books\/2023\/sep\/15\/four-large-us-publishers-sue-shadow-library-for-alleged-copyright-infringement\" target=\"_blank\" rel=\"noopener\">book piracy site Library Genesis (LibGen)<\/a> to train its AI systems. <\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">An<a href=\"https:\/\/storage.courtlistener.com\/recap\/gov.uscourts.cand.415175\/gov.uscourts.cand.415175.391.24.pdf\" target=\"_blank\" rel=\"noopener\"> undated email from Meta director of product Sony Theakanath<\/a>, sent to VP of AI research Joelle Pineau, weighed whether to use LibGen internally only, for benchmarks included in a blog post, or to create a model trained on the site. In the email, Theakanath writes that \u201cGenAI has been approved to use LibGen for Llama3&#8230; with a number of agreed upon mitigations\u201d after escalating it to \u201cMZ\u201d \u2014 presumably Meta CEO Mark Zuckerberg. As noted in the email, Theakanath believed \u201cLibgen is essential to meet SOTA [state-of-the-art] numbers,\u201d adding \u201cit is known that OpenAI and Mistral are using the library for their models (through word of mouth).\u201d Mistral and OpenAI haven\u2019t stated whether or not they use LibGen. (<em>The Verge<\/em> reached out to both for more information).<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component clear-both block\">\n<div class=\"my-9\">\n<p><figcaption class=\"duet--article--dangerously-set-cms-markup inline text-gray-13 dark:text-gray-e9 [&amp;&gt;a:hover]:text-black [&amp;&gt;a:hover]:shadow-underline-black dark:[&amp;&gt;a:hover]:text-gray-e9 dark:[&amp;&gt;a:hover]:shadow-underline-gray-63 [&amp;&gt;a]:shadow-underline-gray-13 dark:[&amp;&gt;a]:shadow-underline-gray-63\"><em>Meta\u2019s Theakanath writes that LibGen is \u201cessential\u201d to reaching \u201cSOTA numbers across all categories.\u201d<\/em><\/figcaption><cite class=\"duet--article--dangerously-set-cms-markup inline not-italic text-gray-63 dark:text-gray-bd [&amp;&gt;a:hover]:text-gray-63 [&amp;&gt;a:hover]:shadow-underline-black dark:[&amp;&gt;a:hover]:text-gray-bd dark:[&amp;&gt;a:hover]:shadow-underline-gray [&amp;&gt;a]:shadow-underline-gray-63 dark:[&amp;&gt;a]:text-gray-bd dark:[&amp;&gt;a]:shadow-underline-gray\">Screenshot: The Verge<\/cite><\/p>\n<\/div>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">The <a href=\"https:\/\/www.theverge.com\/2023\/7\/9\/23788741\/sarah-silverman-openai-meta-chatgpt-llama-copyright-infringement-chatbots-artificial-intelligence-ai\" target=\"_blank\" rel=\"noopener\">court documents stem from a class action lawsuit<\/a> that author Richard Kadrey, comedian Sarah Silverman, and others filed against Meta, accusing it of using illegally obtained copyrighted content to train its AI models in violation of intellectual property laws. Meta, like other AI companies, has argued that using copyrighted material in training data should constitute legal fair use. <em>The Verge<\/em> reached out to Meta with a request for comment but didn\u2019t immediately hear back.<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">Some of the \u201cmitigations\u201d for using LibGen included stipulations that Meta must \u201cremove data clearly marked as pirated\/stolen,\u201d while avoiding externally citing \u201cthe use of any training data\u201d from the site. Theakanath\u2019s email also said the company would need to \u201cred team\u201d the company\u2019s models \u201cfor bioweapons and CBRNE [Chemical, Biological, Radiological, Nuclear, and Explosives]\u201d risks.<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">The email also went over some of the \u201cpolicy risks\u201d posed by the use of LibGen as well, including how regulators might respond to media coverage suggesting Meta\u2019s use of pirated content. \u201cThis may undermine our negotiating position with regulators on these issues,\u201d the email said. <a href=\"https:\/\/www.courtlistener.com\/docket\/67569326\/391\/26\/kadrey-v-meta-platforms-inc\/\" target=\"_blank\" rel=\"noopener\">An April 2023 conversation<\/a> between Meta researcher Nikolay Bashlykov and AI team member David Esiobu also showed Bashlykov admitting he\u2019s \u201cnot sure we can use meta\u2019s IPs to load through torrents [of] pirate content.\u201d<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\"><a href=\"https:\/\/www.courtlistener.com\/docket\/67569326\/391\/18\/kadrey-v-meta-platforms-inc\/\" target=\"_blank\" rel=\"noopener\">Other internal documents<\/a> show the measures Meta took to obscure the copyright information in LibGen\u2019s training data. A document titled \u201cobservations on LibGen-SciMag\u201d shows comments left by employees about how to improve the dataset. One suggestion is to \u201cremove more copyright headers and document identifiers,\u201d which includes any lines containing \u201cISBN,\u201d \u201cCopyright,\u201d \u201cAll rights reserved,\u201d or the copyright symbol. Other notes mention taking out more metadata \u201cto avoid potential legal complications,\u201d as well as considering whether to remove a paper\u2019s list of authors \u201cto reduce liability.\u201d<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component clear-both block\">\n<div class=\"my-9\">\n<p><figcaption class=\"duet--article--dangerously-set-cms-markup inline text-gray-13 dark:text-gray-e9 [&amp;&gt;a:hover]:text-black [&amp;&gt;a:hover]:shadow-underline-black dark:[&amp;&gt;a:hover]:text-gray-e9 dark:[&amp;&gt;a:hover]:shadow-underline-gray-63 [&amp;&gt;a]:shadow-underline-gray-13 dark:[&amp;&gt;a]:shadow-underline-gray-63\"><em>The document discusses removing \u201ccopyright headers and document identifiers.\u201d<\/em><\/figcaption><cite class=\"duet--article--dangerously-set-cms-markup inline not-italic text-gray-63 dark:text-gray-bd [&amp;&gt;a:hover]:text-gray-63 [&amp;&gt;a:hover]:shadow-underline-black dark:[&amp;&gt;a:hover]:text-gray-bd dark:[&amp;&gt;a:hover]:shadow-underline-gray [&amp;&gt;a]:shadow-underline-gray-63 dark:[&amp;&gt;a]:text-gray-bd dark:[&amp;&gt;a]:shadow-underline-gray\">Screenshot: The Verge<\/cite><\/p>\n<\/div>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">Last June, <em>The New York Times <\/em><a href=\"https:\/\/www.nytimes.com\/2024\/04\/06\/technology\/tech-giants-harvest-data-artificial-intelligence.html\" target=\"_blank\" rel=\"noopener\">reported<\/a> on the frantic race inside Meta after ChatGPT\u2019s debut, revealing the company had hit a wall: it had used up almost every available English book, article, and poem it could find online. Desperate for more data, executives reportedly discussed buying Simon &amp; Schuster outright and considered hiring contractors in Africa to summarize books without permission. <\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">In the report, some executives justified their approach by pointing to OpenAI\u2019s \u201cmarket precedent\u201d of using copyrighted works, while others argued <a href=\"https:\/\/www.reuters.com\/article\/us-google-books-idUSKCN0SA1S020151016\/\" target=\"_blank\" rel=\"noopener\">Google\u2019s 2015 court victory establishing its right to scan books<\/a> could provide legal cover. \u201cThe only thing holding us back from being as good as ChatGPT is literally just data volume,\u201d one executive said in a meeting, per <em>The New York Times<\/em>.<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">It\u2019s been reported that frontier labs like OpenAI and Anthropic have hit a data wall, which means they don\u2019t have sufficient new data to train their large language models. Many leaders have denied this, OpenAI CEO Sam Altman <a href=\"https:\/\/x.com\/sama\/status\/1856941766915641580?lang=en\" target=\"_blank\">said plainly<\/a>: \u201cThere is no wall.\u201d OpenAI cofounder Ilya Sutskever, who <a href=\"https:\/\/www.theverge.com\/2024\/5\/14\/24156920\/openai-chief-scientist-ilya-sutskever-leaves\" target=\"_blank\" rel=\"noopener\">left the company last May<\/a> to start a new frontier lab, has been more straightforward about the potential of a data wall. At <a href=\"https:\/\/www.theverge.com\/2024\/12\/13\/24320811\/what-ilya-sutskever-sees-openai-model-data-training\" target=\"_blank\" rel=\"noopener\">a premier AI conference last month<\/a>, Sutskever said: \u201cWe\u2019ve achieved peak data and there\u2019ll be no more. We have to deal with the data that we have. There\u2019s only one internet.\u201d<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">This data scarcity has led to a whole lot of weird, new ways to get unique data. <em>Bloomberg <\/em><a href=\"https:\/\/www.bloomberg.com\/news\/articles\/2025-01-10\/youtubers-are-selling-their-unused-video-footage-to-ai-companies\" target=\"_blank\" rel=\"noopener\">reported<\/a> that frontier labs like OpenAI and Google have been paying digital content creators between $1 and $4 per minute for their unused video footage through a third-party in order to train LLMs (both of those companies have competing AI video-generation products).<\/p>\n<\/div>\n<div class=\"duet--article--article-body-component\">\n<p class=\"duet--article--dangerously-set-cms-markup duet--article--standard-paragraph mb-20 font-fkroman text-18 leading-160 -tracking-1 selection:bg-franklin-20 dark:text-white dark:selection:bg-blurple [&amp;_a:hover]:shadow-highlight-franklin dark:[&amp;_a:hover]:shadow-highlight-blurple [&amp;_a]:shadow-underline-black dark:[&amp;_a]:shadow-underline-white\">With companies like Meta and OpenAI hoping to grow their AI systems as fast as possible, things are bound to get a bit messy. Though <a href=\"https:\/\/www.theverge.com\/2024\/2\/13\/24072131\/sarah-silverman-paul-tremblay-openai-chatgpt-copyright-lawsuit\" target=\"_blank\" rel=\"noopener\">a judge partially dismissed Kadrey and Silverman\u2019s class action<\/a> lawsuit last year, the evidence outlined here could strengthen parts of their case as it moves forward in court.<\/p>\n<\/div>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/www.theverge.com\/2025\/1\/14\/24343692\/meta-lawsuit-copyright-lawsuit-llama-libgen\" target=\"_blank\" rel=\"noopener\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>A major copyright lawsuit against Meta has revealed a trove of internal communications about the company\u2019s plans to develop its open-source AI models, Llama, which include discussions about avoiding \u201cmedia coverage suggesting we have used a dataset we know to be pirated.\u201d The messages, which were part of a series of exhibits unsealed by a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":144016,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[],"class_list":{"0":"post-144015","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-tech"},"_links":{"self":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/144015","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/comments?post=144015"}],"version-history":[{"count":0,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/144015\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media\/144016"}],"wp:attachment":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media?parent=144015"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/categories?post=144015"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/tags?post=144015"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}