Normal view

Received — 1 February 2026 Microsoft Security Blog

Case study: Securing AI application supply chains

The rapid adoption of AI applications, including agents, orchestrators, and autonomous workflows, represents a significant shift in how software systems are built and operated. Unlike traditional applications, these systems are active participants in execution. They make decisions, invoke tools, and interact with other systems on behalf of users. While this evolution enables new capabilities, it also introduces an expanded and less familiar attack surface.

Security discussions often focus on prompt-level protections, and that focus is justified. However, prompt security addresses only one layer of risk. Equally important is securing the AI application supply chain, including the frameworks, SDKs, and orchestration layers used to build and operate these systems. Vulnerabilities in these components can allow attackers to influence AI behavior, access sensitive resources, or compromise the broader application environment.

The recent disclosure of CVE-2025-68664, known as LangGrinch, in LangChain Core highlights the importance of securing the AI supply chain. This blog uses that real-world vulnerability to illustrate how Microsoft Defender posture management capabilities can help organizations identify and mitigate AI supply chain risks.

Case example: Serialization injection in LangChain (CVE-2025-68664)

A recently disclosed vulnerability in LangChain Core highlights how AI frameworks can become conduits for exploitation when workloads are not properly secured. Tracked as CVE-2025-68664 and commonly referred to as LangGrinch, this flaw exposes risks associated with insecure deserialization in agentic ecosystems that rely heavily on structured metadata exchange.

Vulnerability summary

CVE-2025-68664 is a serialization injection vulnerability affecting the langchain-core Python package. The issue stems from improper handling of internal metadata fields during the serialization and deserialization process. If exploited, an attacker could:

  • Extract secrets such as environment variables without authorization
  • Instantiate unintended classes during object reconstruction
  • Trigger side effects through malicious object initialization

The vulnerability carries a CVSS score of 9.3, highlighting the risks that arise when AI orchestration systems do not adequately separate control signals from user-supplied data.

Understanding the root cause: The lc marker

LangChain utilizes a custom serialization format to maintain state across different components of an AI chain. To distinguish between standard data and serialized LangChain objects, the framework uses a reserved key called lc. During deserialization, when the framework encounters a dictionary containing this key, it interprets the content as a trusted object rather than plain user data.

The vulnerability originates in the dumps() and dumpd() functions in affected versions of the langchain-core package. These functions did not properly escape or neutralize the lc key when processing user-controlled dictionaries. As a result, if an attacker is able to inject a dictionary containing the lc key into a data stream that is later serialized and deserialized, the framework may reconstruct a malicious object.

This is a classic example of an injection flaw where data and control signals are not properly separated, allowing untrusted input to influence the execution flow.

Mitigation and protection guidance

Microsoft recommends that all organizations using LangChain review their deployments and apply the following mitigations immediately.

1. Update LangChain Core

The most effective defense is to upgrade to a patched version of the langchain-core package.

  • For 0.3.x users: Update to version 0.3.81 or later.
  • For 1.x users: Update to version 1.2.5 or later.

2. Query the security explorer to identify any instances of LangChain in your environment

To identify instances of LangChain package in the assets protected by Defender for Cloud, customers can use the Cloud Security Explorer:

*Identification in cloud compute resources requires Defender CSPM / Defender for Containers / Defender for Servers plan.

*Identification in code environment requires connecting your code environment to Defender for Cloud Learn how to set up connectors

3. Remediate based on Defender for Cloud recommendations across the software development cycle: Code, Ship, Runtime

*Identification in cloud compute resources requires Defender CSPM / Defender for Containers / Defender for Servers plan.

*Identification in code environment requires connecting your code environment to Defender for Cloud Learn how to set up connectors

4. Create GitHub issues with runtime context directly from Defender for Cloud, track progress, and use Copilot coding agent for AI-powered automated fix

Learn more about Defender for Cloud seamless workflows with GitHub to shorten remediation times for security issues.

Microsoft Defender XDR detections 

Microsoft security products provide several layers of defense to help organizations identify and block exploitation attempts related to AI vulnerable software.  

Microsoft Defender provides visibility into vulnerable AI workloads through its Cloud Security Posture Management (Defender CSPM).

Vulnerability Assessment: Defender for Cloud scanners have been updated to identify containers and virtual machines running vulnerable versions of langchain-core. Microsoft Defender is actively working to expand coverage to additional platforms and this blog will be updated when more information is available.

Hunting queries   

Microsoft Defender XDR

Security teams can use the advanced hunting capabilities in Microsoft Defender XDR to proactively look for indicators of exploitation. A common sign of exploitation is a Python process associated with LangChain attempting to access sensitive environment variables or making unexpected network connections immediately following an LLM interaction.

The following Kusto Query Language (KQL) query can be used to identify devices that are using the vulnerable software:

DeviceTvmSoftwareInventory
| where SoftwareName has "langchain" 
    and (
        // Lower version ranges
        SoftwareVersion startswith "0." 
        and toint(split(SoftwareVersion, ".")[1]) 

References

This research is provided by Microsoft Defender Security Research with contributions from Tamer Salman, Astar Lev, Yossi Weizman, Hagai Ran Kestenberg, and Shai Yannai.

Learn more  

Review our documentation to learn more about our real-time protection capabilities and see how to enable them within your organization.  

Learn more about securing Copilot Studio agents with Microsoft Defender 

Learn more about Protect your agents in real-time during runtime (Preview) – Microsoft Defender for Cloud Apps | Microsoft Learn  

Explore how to build and customize agents with Copilot Studio Agent Builder  

The post Case study: Securing AI application supply chains appeared first on Microsoft Security Blog.

Turning threat reports into detection insights with AI

Security teams routinely need to transform unstructured threat knowledge, such as incident narratives, red team breach-path writeups, threat actor profiles, and public reports into concrete defensive action. The early stages of that work are often the slowest. These include extracting tactics, techniques, and procedures (TTPs) from long documents, mapping them to a standard taxonomy, and determining which TTPs are already covered by existing detections versus which represent potential gaps.

Complex documents that mix prose, tables, screenshots, links, and code make it easy to miss key details. As a result, manual analysis can take days or even weeks, depending on the scope and telemetry involved.

This post outlines an AI-assisted workflow for detection analysis designed to accelerate detection engineering. The workflow generates a structured initial analysis from common security content, such as incident reports and threat writeups. It extracts candidate TTPs from the content, validates those TTPs, and normalizes them to a consistent format, including alignment with the MITRE ATT&CK framework.

The workflow then performs coverage and gap analysis by comparing the extracted TTPs against an existing detection catalog. It combines similarity search with LLM-based validation to improve accuracy. The goal is to give defenders a high-quality starting point by quickly surfacing likely coverage areas and potential detection gaps.

This approach saves time and allows analysts to focus where they add the most value: validating findings, confirming what telemetry actually captures, and implementing or tuning detections.

Technical details

Figure 1: Overall flow of the analysis.

Figure 1: Overall flow of the analysis

Figure 1 illustrates the overall architecture of the workflow for analyzing threat data. The system accepts multiple content types and processes them through three main stages: TTP extraction, MITRE ATT&CK mapping, and detection coverage analysis.

The workflow ingests artifacts that describe adversary behavior, including documents and web-based content. These artifacts include:

  • Red team reports
  • Threat intelligence (TI) reports
  • Threat actor (TA) profiles.

The system supports multiple content formats, allowing teams to process both internal and external reports without manual reformatting.

During ingestion, the system breaks each document into machine-readable segments, such as text blocks, headings, and lists. It retains the original document structure to preserve context. This is important because the location of information, such as whether it appears in an appendix or in key findings, can affect how the data is interpreted. This is especially relevant for long reports that combine narrative text with supporting evidence.

1) TTP and metadata extraction

The first major technical step extracts candidate TTPs from the ingested content. The workflow identifies technique-like behaviors described in free text and converts them into a structured format for review and downstream mapping.

The system uses specialized Large Language Model (LLM) prompts to extract this information from raw content. In addition to candidate TTPs, the system extracts supporting metadata, including:

  • Relevant cloud stack layers
  • Detection opportunities
  • Telemetry required for detection authoring

2) MITRE ATT&CK mapping

The system validates MITRE ATT&CK mappings by normalizing extracted behaviors to specific technique identifiers and names. This process highlights areas of uncertainty for review and correction, helping standardize visibility into attack observations and potential protection gaps.

The goal is to map all relevant layers, including tactics, techniques, and sub-techniques, by assigning each extracted TTP to the appropriate level of the MITRE ATT&CK hierarchy. Each TTP is mapped using a single LLM call with Retrieval Augmented Generation (RAG). To maintain accuracy, the system uses a focused, one-at-a-time approach to mapping.

3) Existing detections mapping and gap analysis

A key workflow step is mapping extracted TTPs against existing detections to determine which behaviors are already covered and where gaps may exist. This allows defenders to assess current coverage and prioritize detection development or tuning efforts.

Figure 2: Detection Mapping Process.

Figure 2 illustrates the end-to-end detection mapping process. This phase includes the following:

  • Vector similarity search: The system uses this to identify potential detection matches for each extracted TTP.
  • LLM-based validation: The system uses this to minimize false positives and provide determinations of “likely covered” versus “likely gap” outcomes.

The vector similarity search process begins by standardizing all detections, including their metadata and code, during an offline preprocessing step. This information is stored in a relational database and includes details such as titles, descriptions, and MITRE ATT&CK mappings. In federated environments, detections may come from multiple repositories, so this standardization streamlines access during detection mapping. Selected fields are then used to build a vector database, enabling semantic search across detections.

Vector search uses approximate nearest neighbor algorithms and produces a similarity-based confidence score. Because setting effective thresholds for these scores can be challenging, the workflow includes a second validation step using an LLM. This step evaluates whether candidate mappings are valid for a given TTP using a tailored prompt.

The final output highlights prioritized detection opportunities and identifies potential gaps. These results are intended as recommendations that defenders should confirm based on their environment and available telemetry. Because the analysis relies on extracted text and metadata, which may be ambiguous, these mappings do not guarantee detection coverage. Organizations should supplement this approach with real-world simulations to further validate the results.

Human-in-the-loop: why validation remains essential

Final confirmation requires human expertise and empirical validation. The workflow identifies promising detection opportunities and potential gaps, but confirmation depends on testing with real telemetry, simulation, and review of detection logic in context.

This boundary is important because coverage in this approach is primarily based on text similarity and metadata alignment. A detection may exist but operate at a different scope, depend on telemetry that is not universally available, or require correlation across multiple data sources. The purpose of the workflow is to reduce time to initial analysis so experts can focus on high-value validation and implementation work.

Practical advice for using AI

Large language models are powerful for accelerating security analysis, but they can be inconsistent across runs, especially when prompts, context, or inputs vary. Output quality depends heavily on the prompt. Long prompts might not transmit intent effectively to the model.

1) Plan for inconsistency and make critical steps deterministic

For high-impact steps, such as TTP extraction or mapping behaviors to a taxonomy, prioritize stability over creativity:

  • Use stronger models for the most critical steps and reserve smaller or cheaper models for tasks like summarization or formatting. Reasoning models are often more effective than non-reasoning models.
  • Use structured outputs, such as JSON schemas, and explicit formatting requirements to reduce variance. Most state-of-the-art models now support structured output.
  • Include a self-critique or answer review step in the model output. Use sequential LLM calls or a multi-turn agentic workflow to ensure a satisfactory result.

2) Insert reviewer checkpoints where mistakes are costly

Even high-performing models can miss details in long or heterogeneous documents. To reduce the risk of omissions or incorrect mappings, add human-in-the-loop reviewer gates:

  • Reviewer checkpoints are especially valuable for final TTP lists and any “coverage vs. gap” conclusions.
  • Treat automated outputs as a first-pass hypothesis. Require expert validation and, if possible, empirical checks before operational decisions.

3) Optimize prompt context for better accuracy

Avoid including too much information in prompts. While modern models have large token windows, excess content can dilute relevance, increase cost, and reduce accuracy.

Best Practices:

  • Provide only the minimum necessary context. Focus on the information needed for the current step. Use RAG or staged, multi-step prompts instead of one large prompt.
  • Be specific. Use clear, direct instructions. Vague or open-ended requests often produce unclear results.

4) Build an evaluation loop

Establish an evaluation process for production-quality results:

  • Develop gold datasets and ground-truth samples to track coverage and accuracy over time.
  • Use expert reviews to validate results instead of relying on offline metrics.
  • Use evaluations to identify regressions when prompts, models, or context packaging changes.

Where AI accelerates detection and experts validate

Detection engineering is most effective when treated as a continuous loop:

  1. Gather new intelligence
  2. Extract relevant behaviors
  3. Check current coverage
  4. Set validation priorities
  5. Implementing improvements

AI can accelerate the early stages of this loop by quickly structuring TTPs and enabling efficient matching against existing detections. This allows defenders to focus on higher-value work, such as validating coverage, investigating areas of uncertainty, and refining detection logic.

In evaluation, the AI-assisted approach to TTP extraction produced results comparable to those of security experts. By combining the speed of AI with expert review and validation, organizations can scale detection coverage analysis more effectively, even during periods of high reporting volume.

This research is provided by Microsoft Defender Security Research with contributions from  Fatih Bulut.

References

  1. MITRE ATT&CK Framework: https://attack.mitre.org
  2. Fatih Bulut, Anjali Mangal. “Towards Autonomous Detection Engineering”. Annual Computer Security Applications Conference (ACSAC) 2025. Link: https://www.acsac.org/2025/files/web/acsac25-casestudy-bulut.pdf

The post Turning threat reports into detection insights with AI appeared first on Microsoft Security Blog.

❌