{"id":147342,"date":"2025-01-31T22:04:46","date_gmt":"2025-01-31T22:04:46","guid":{"rendered":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/01\/31\/mlcommons-and-hugging-face-team-up-to-release-massive-speech-data-set-for-ai-research-techcrunch\/"},"modified":"2025-01-31T22:04:46","modified_gmt":"2025-01-31T22:04:46","slug":"mlcommons-and-hugging-face-team-up-to-release-massive-speech-data-set-for-ai-research-techcrunch","status":"publish","type":"post","link":"https:\/\/entertainment.runfyers.com\/index.php\/2025\/01\/31\/mlcommons-and-hugging-face-team-up-to-release-massive-speech-data-set-for-ai-research-techcrunch\/","title":{"rendered":"MLCommons and Hugging Face team up to release massive speech data set for AI research | TechCrunch"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">MLCommons, a nonprofit AI safety working group, has teamed up with AI dev platform Hugging Face to release one of the world\u2019s largest collections of public domain voice recordings for AI research.<\/p>\n<p class=\"wp-block-paragraph\">The data set, called <a rel=\"nofollow noopener\" href=\"https:\/\/huggingface.co\/datasets\/MLCommons\/unsupervised_peoples_speech\" target=\"_blank\">Unsupervised People\u2019s Speech<\/a>, contains more than a million hours of audio spanning at least 89 different languages. MLCommons says it was motivated to create it by a desire to support R&amp;D in \u201cvarious areas of speech technology.\u201d<\/p>\n<p class=\"wp-block-paragraph\">\u201cSupporting broader natural language processing research for languages other than English helps bring communication technologies to more people globally,\u201d the organization wrote in a <a rel=\"nofollow noopener\" href=\"https:\/\/mlcommons.org\/2025\/01\/new-unsupervised-peoples-speech\/\" target=\"_blank\">blog post<\/a> Thursday. \u201cWe anticipate several avenues for the research community to continue to build and develop, especially in the areas of improving low-resource language speech models, enhanced speech recognition across different accents and dialects, and novel applications in speech synthesis.\u201d<\/p>\n<p class=\"wp-block-paragraph\">It\u2019s an admirable goal, to be sure. But AI data sets like Unsupervised People\u2019s Speech can carry risks for the researchers who choose to use them.<\/p>\n<p class=\"wp-block-paragraph\">Biased data is one of those risks. The recordings in Unsupervised People\u2019s Speech came from Archive.org, the nonprofit perhaps best known for the Wayback Machine web archival tool. Because many of Archive.org\u2019s contributors are English-speaking \u2014 and American \u2014 almost all of the recordings in Unsupervised People\u2019s Speech are in American-accented English, <a rel=\"nofollow noopener\" href=\"https:\/\/huggingface.co\/datasets\/MLCommons\/unsupervised_peoples_speech\" target=\"_blank\">per the readme on the official project page<\/a>.<\/p>\n<p class=\"wp-block-paragraph\">That means that, without careful filtering, AI systems like speech recognition and voice synthesizer models trained on Unsupervised People\u2019s Speech could exhibit some of the same prejudices. They might, for example, struggle to transcribe English spoken by a non-native speaker, or have trouble generating synthetic voices in languages other than English.<\/p>\n<p class=\"wp-block-paragraph\">Unsupervised People\u2019s Speech might also contain recordings from people unaware that their voices are being used for AI research purposes \u2014 including commercial applications. While MLCommons says that all recordings in the data set are public domain or available under Creative Commons licenses, there\u2019s the possibility mistakes were made. <\/p>\n<p class=\"wp-block-paragraph\"><a rel=\"nofollow noopener\" href=\"https:\/\/news.mit.edu\/2024\/study-large-language-models-datasets-lack-transparency-0830\" target=\"_blank\">According to an MIT analysis<\/a>, hundreds of publicly available AI training data sets lack licensing information and contain errors. Creator advocates including Ed Newton-Rex, the CEO of AI ethics-focused nonprofit Fairly Trained, have made the case that creators shouldn\u2019t be required to \u201copt out\u201d of AI data sets because of the onerous burden opting out imposes on these creators.<\/p>\n<p class=\"wp-block-paragraph\">\u201cMany creators (e.g. Squarespace users) have no meaningful way of opting out,\u201d <a rel=\"nofollow\" href=\"https:\/\/x.com\/ednewtonrex\/status\/1803698394143268899\" target=\"_blank\">Newton-Rex wrote<\/a> in a post on X last June. \u201cFor creators who <em>can<\/em> opt out, there are multiple overlapping opt-out methods, which are (1) incredibly confusing and (2) woefully incomplete in their coverage. Even if a perfect universal opt-out existed, it would be hugely unfair to put the opt-out burden on creators, given that generative AI uses their work to compete with them \u2014 many would simply not realize they could opt out.\u201d<\/p>\n<p class=\"wp-block-paragraph\">MLCommons says that it\u2019s committed to updating, maintaining, and improving the quality of Unsupervised People\u2019s Speech. But given the potential flaws, it\u2019d behoove developers to exercise serious caution.<\/p>\n<\/div>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2025\/01\/31\/mlcommons-and-hugging-face-team-up-to-release-massive-speech-data-set-for-ai-research\/\" target=\"_blank\" rel=\"noopener\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>MLCommons, a nonprofit AI safety working group, has teamed up with AI dev platform Hugging Face to release one of the world\u2019s largest collections of public domain voice recordings for AI research. The data set, called Unsupervised People\u2019s Speech, contains more than a million hours of audio spanning at least 89 different languages. MLCommons says [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":147343,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[],"class_list":{"0":"post-147342","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-tech"},"_links":{"self":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/147342","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/comments?post=147342"}],"version-history":[{"count":0,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/147342\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media\/147343"}],"wp:attachment":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media?parent=147342"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/categories?post=147342"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/tags?post=147342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}