{"id":75984,"date":"2024-02-14T23:19:04","date_gmt":"2024-02-14T23:19:04","guid":{"rendered":"https:\/\/entertainment.runfyers.com\/index.php\/2024\/02\/14\/largest-text-to-speech-ai-model-yet-shows-emergent-abilities-techcrunch\/"},"modified":"2024-02-14T23:19:04","modified_gmt":"2024-02-14T23:19:04","slug":"largest-text-to-speech-ai-model-yet-shows-emergent-abilities-techcrunch","status":"publish","type":"post","link":"https:\/\/entertainment.runfyers.com\/index.php\/2024\/02\/14\/largest-text-to-speech-ai-model-yet-shows-emergent-abilities-techcrunch\/","title":{"rendered":"Largest text-to-speech AI model yet shows &#8217;emergent abilities&#8217; | TechCrunch"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\">Researchers at Amazon have trained the largest ever text-to-speech model yet, which they claim exhibits \u201cemergent\u201d qualities improving its ability to speak even complex sentences naturally. The breakthrough could be what the technology needs to escape the uncanny valley.<\/p>\n<p>These models were always going to grow and improve, but the researchers specifically hoped to see the kind of leap in ability that we observed once language models got past a certain size. For reasons unknown to us, once LLMs grow past a certain point, they start being way more robust and versatile, able to perform tasks they weren\u2019t trained to.<\/p>\n<p>That is not to say they are gaining sentience or anything, just that past a certain point their performance on certain conversational AI tasks hockey sticks. The team at Amazon AGI \u2014 no secret what they\u2019re aiming at \u2014 thought the same might happen as text-to-speech models grew as well, and their research suggests this is in fact the case.<\/p>\n<p>The new model is called <a href=\"https:\/\/amazon-ltts-paper.com\/\" target=\"_blank\" rel=\"noopener\">Big Adaptive Streamable TTS with Emergent abilities<\/a>, which they have contorted into the abbreviation BASE TTS. The largest version of the model uses 100,000 hours of public domain speech, 90% of which is in English, the remainder in German, Dutch and Spanish.<\/p>\n<p>At 980 million parameters, BASE-large appears to be the biggest model in this category. They also trained 400M- and 150M-parameter models based on 10,000 and 1,000 hours of audio respectively, for comparison \u2014 the idea being, if one of these models shows emergent behaviors but another doesn\u2019t, you have a range for where those behaviors begin to emerge.<\/p>\n<p>As it turns out, the medium-sized model showed the jump in capability the team was looking for, not necessarily in ordinary speech quality (it is reviewed better but only by a couple points) but in the set of emergent abilities they observed and measured. Here are examples of tricky text <a href=\"https:\/\/arxiv.org\/abs\/2402.08093\" target=\"_blank\" rel=\"noopener\">mentioned in the paper<\/a>:<\/p>\n<ul>\n<li><strong>Compound nouns<\/strong>: The Beckhams decided to rent a charming stone-built quaint countryside holiday cottage.<\/li>\n<li><strong>Emotions<\/strong>: \u201cOh my gosh! Are we really going to the Maldives? That\u2019s unbelievable!\u201d Jennie squealed, bouncing on her toes with uncontained glee.<\/li>\n<li><strong>Foreign words<\/strong>: \u201cMr. Henry, renowned for his mise en place, orchestrated a seven-course meal, each dish a pi\u00e8ce de r\u00e9sistance.<\/li>\n<li><strong>Paralinguistics<\/strong> (i.e. readable non-words): \u201cShh, Lucy, shhh, we mustn\u2019t wake your baby brother,\u201d Tom whispered, as they tiptoed past the nursery.<\/li>\n<li><strong>Punctuations<\/strong>: She received an odd text from her brother: \u2019Emergency @ home; call ASAP! Mom &amp; Dad are worried\u2026#familymatters.\u2019<\/li>\n<li><strong>Questions<\/strong>: But the Brexit question remains: After all the trials and tribulations, will the ministers find the answers in time?<\/li>\n<li><strong>Syntactic complexities<\/strong>: The movie that De Moya who was recently awarded the lifetime achievement award starred in 2022 was a box-office hit, despite the mixed reviews.<\/li>\n<\/ul>\n<p>\u201cThese sentences are designed to contain challenging tasks \u2013 parsing garden-path sentences, placing phrasal stress on long-winded compound nouns, producing emotional or whispered speech, or producing the correct phonemes for foreign words like \u201cqi\u201d or punctuations like \u201c@\u201d \u2013 none of which BASE TTS is explicitly trained to perform,\u201d the authors write.<\/p>\n<p>Such features normally trip up text-to-speech engines, which will mispronounce, skip words, use odd intonation or make some other blunder. BASE TTS still had trouble, but it did far better than its contemporaries \u2014 models like Tortoise and VALL-E.<\/p>\n<p>There are a bunch of examples of these difficult texts being spoken quite naturally by the new model <a href=\"https:\/\/amazon-ltts-paper.com\/\" target=\"_blank\" rel=\"noopener\">at the site they made for it.<\/a> Of course these were chosen by the researchers, so they\u2019re necessarily cherry-picked, but it\u2019s impressive regardless. Here are a couple, if you don\u2019t feel like clicking through:<\/p>\n<p><!--[if lt IE 9]><![endif]--><br \/>\n<audio class=\"wp-audio-shortcode\" id=\"audio-2665528-1\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/wav\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/shh-its-starting.wav?_=1\"\/><a href=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/shh-its-starting.wav\" target=\"_blank\" rel=\"noopener\">https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/shh-its-starting.wav<\/a><\/audio><\/p>\n<p><audio class=\"wp-audio-shortcode\" id=\"audio-2665528-2\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/wav\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/how-french.wav?_=2\"\/><a href=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/how-french.wav\" target=\"_blank\" rel=\"noopener\">https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/how-french.wav<\/a><\/audio><\/p>\n<p><audio class=\"wp-audio-shortcode\" id=\"audio-2665528-3\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/wav\" src=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/guiding-moonlight.wav?_=3\"\/><a href=\"https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/guiding-moonlight.wav\" target=\"_blank\" rel=\"noopener\">https:\/\/techcrunch.com\/wp-content\/uploads\/2024\/02\/guiding-moonlight.wav<\/a><\/audio><\/p>\n<p>Because the three BASE TTS models share an architecture, it seems clear that the size of the model and the extent of its training data seem to be the cause of the model\u2019s ability to handle some of the above complexities. Bear in mind this is still an experimental model and process \u2014 not a commercial model or anything. Later research will have to identify the inflection point for emergent ability and how to train and deploy the resulting model efficiently.<\/p>\n<p>Notably, this model is \u201cstreamable,\u201d as the name says \u2014 meaning it doesn\u2019t need to generate whole sentences at once but goes moment by moment at a relatively low bitrate. The team has also attempted to package the speech metadata like emotionality, prosody and so on in a separate, low-bandwidth stream that could accompany vanilla audio.<\/p>\n<p>It seems that text-to-speech models may have a breakout moment in 2024 \u2014 just in time for the election! But there\u2019s no denying the usefulness of this technology, for accessibility in particular. The team does note that it declined to publish the model\u2019s source and other data due to the risk of bad actors taking advantage of it. The cat will get out of that bag eventually, though.<\/p>\n<\/p><\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2024\/02\/14\/largest-text-to-speech-ai-model-yet-shows-emergent-abilities\/\" target=\"_blank\" rel=\"noopener\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Researchers at Amazon have trained the largest ever text-to-speech model yet, which they claim exhibits \u201cemergent\u201d qualities improving its ability to speak even complex sentences naturally. The breakthrough could be what the technology needs to escape the uncanny valley. These models were always going to grow and improve, but the researchers specifically hoped to see [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":75985,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[],"class_list":{"0":"post-75984","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-tech"},"_links":{"self":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/75984","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/comments?post=75984"}],"version-history":[{"count":0,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/75984\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media\/75985"}],"wp:attachment":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media?parent=75984"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/categories?post=75984"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/tags?post=75984"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}