{"id":243857,"date":"2026-06-02T19:02:21","date_gmt":"2026-06-02T19:02:21","guid":{"rendered":"https:\/\/entertainment.runfyers.com\/index.php\/2026\/06\/02\/new-microsoft-tool-lets-devs-spin-up-ai-behavior-tests-using-text-descriptions-techcrunch\/"},"modified":"2026-06-02T19:02:21","modified_gmt":"2026-06-02T19:02:21","slug":"new-microsoft-tool-lets-devs-spin-up-ai-behavior-tests-using-text-descriptions-techcrunch","status":"publish","type":"post","link":"https:\/\/entertainment.runfyers.com\/index.php\/2026\/06\/02\/new-microsoft-tool-lets-devs-spin-up-ai-behavior-tests-using-text-descriptions-techcrunch\/","title":{"rendered":"New Microsoft tool lets devs spin up AI behavior tests using text descriptions | TechCrunch"},"content":{"rendered":"<p> <br \/>\n<\/p>\n<div>\n<p id=\"speakable-summary\" class=\"wp-block-paragraph\">AI researchers and labs have advanced by leaps and bounds in evaluating AI models for everything from <a href=\"https:\/\/www.theregister.com\/software\/2024\/12\/05\/mlcommons-produces-benchmark-of-ai-model-safety\/621835\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">safety<\/a> and compliance to <a href=\"https:\/\/techcrunch.com\/2025\/08\/25\/ai-sycophancy-isnt-just-a-quirk-experts-consider-it-a-dark-pattern-to-turn-users-into-profit\/\" target=\"_blank\" rel=\"noreferrer noopener\">sycophancy<\/a> and <a href=\"https:\/\/www.anthropic.com\/research\/bloom\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">alignment<\/a>. But it appears companies and developers are faced with a new, specific need: making sure their AI system behaves as intended for their specific product or service.<\/p>\n<p class=\"wp-block-paragraph\">In a bid to make that testing process simpler, Microsoft on Tuesday took the wraps off <a href=\"https:\/\/github.com\/responsibleai\/ASSERT\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">ASSERT<\/a>, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing.<\/p>\n<p class=\"wp-block-paragraph\">The open source framework, Microsoft says, makes evaluating application-specific AI behavior easy by using AI to turn high-level, natural-language descriptions of goals, policies, or intended behaviors into thorough, scored tests that can be investigated.<\/p>\n<p class=\"wp-block-paragraph\">ASSERT takes plain-language descriptions of an AI model\u2019s expected behavior and policies, turns them into a structured set of acceptable and unacceptable behaviors, generates problem scenarios and test cases, runs them against the target system, and scores the results. It can also record the paths the AI system takes, including intermediate actions and tool calls, so developers can inspect where failures happen.<\/p>\n<p class=\"wp-block-paragraph\">Devs can provide system context, tools, and constraints, too, if they want to further customize what the evaluations cover.<\/p>\n<p class=\"wp-block-paragraph\">For example, a developer could specify that a document research AI agent shouldn\u2019t send emails to people outside the company, and it should limit confidential information to C-level executives and provide concise summaries with prior context in mind. ASSERT will use those rules to generate test cases that check whether the system follows those rules on an ongoing basis.<\/p>\n<figure class=\"wp-block-image size-large is-resized\"><figcaption class=\"wp-element-caption\"><span class=\"wp-block-image__credits\"><strong>Image Credits:<\/strong>Microsoft<\/span><\/figcaption><\/figure>\n<p class=\"wp-block-paragraph\">The framework, according to Microsoft, fills a gap that broader, more general evaluations cannot when AI models are intended to behave in a manner that is shaped by an application or product\u2019s context, policies, and tools.<\/p>\n<p class=\"wp-block-paragraph\">\u201cOne of the things we\u2019ve learned is that evaluations are absolutely critical to making good decisions,\u201d said <a href=\"https:\/\/www.linkedin.com\/in\/slbird\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Sarah Bird<\/a>, chief product officer of Responsible AI at Microsoft. \u201cBecause if you don\u2019t understand the behavior of the AI system, it\u2019s really hard to know if it\u2019s meeting your organization\u2019s bar\u00a0\u2026 What we found is that if you really want to have a trustworthy system, you should evaluate many more dimensions that are application-specific.\u201d<\/p>\n<p class=\"wp-block-paragraph\">Bird said ASSERT can be used to evaluate systems when they\u2019re being built, after deployment, and even for continuous monitoring. <\/p>\n<p class=\"wp-block-paragraph\">The release comes amidst a gradual but broader shift in the AI industry. As models grow more capable, researchers are focusing on repeatable testing and regression checks, with <a href=\"https:\/\/crfm.stanford.edu\/helm\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">Stanford\u2019s HELM<\/a>, <a href=\"https:\/\/mlcommons.org\/ailuminate\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">MLCommons\u2019 AILuminate<\/a>, and evaluation groups like <a href=\"https:\/\/metr.org\/\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">METR<\/a> rolling out benchmarks to measure how models behave under different conditions.<\/p>\n<\/div>\n<p><em>When you purchase through links in our articles, <a href=\"https:\/\/techcrunch.com\/techcrunch-affiliate-monetization-standards\/\" target=\"_blank\" rel=\"noopener\">we may earn a small commission<\/a>. This doesn\u2019t affect our editorial independence.<\/em><\/p>\n<p><br \/>\n<br \/><a href=\"https:\/\/techcrunch.com\/2026\/06\/02\/new-microsoft-tool-lets-devs-spin-up-ai-behavior-tests-using-text-descriptions\/\" target=\"_blank\" rel=\"noopener\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>AI researchers and labs have advanced by leaps and bounds in evaluating AI models for everything from safety and compliance to sycophancy and alignment. But it appears companies and developers are faced with a new, specific need: making sure their AI system behaves as intended for their specific product or service. In a bid to [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":243858,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[14],"tags":[],"class_list":{"0":"post-243857","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-tech"},"_links":{"self":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/243857","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/comments?post=243857"}],"version-history":[{"count":0,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/posts\/243857\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media\/243858"}],"wp:attachment":[{"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/media?parent=243857"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/categories?post=243857"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/entertainment.runfyers.com\/index.php\/wp-json\/wp\/v2\/tags?post=243857"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}