❌

Reading view

Inside AWS Security Agent: A multi-agent architecture for automated penetration testing

AI agents have traditionally faced three core limitations: they can’t retain learned information or operate autonomously beyond short periods, and they require constant supervision. AWS addresses these limitations with frontier agentsβ€”a new category of AI that performs complex reasoning, multi-step planning, and autonomous execution for hours or days. Multi-agent collaboration has emerged as a powerful approach that helps tackle complex workflows that require multiple steps and diverse expertiseβ€”such as in software development where agents handle code generation, review, and testing; in scientific research where agents collaborate on literature review, experimental design, and data analysis; and in cybersecurity where specialized agents perform reconnaissance, vulnerability analysis, and exploit validation.

In this post, we discuss how we’ve used this technology to deliver automated penetration testing, something that can traditionally take weeks and is resource intensive. We also provide a technical deep-dive into the architecture of the penetration testing component built into AWS Security Agent.

The concept of automated security testing isn’t newβ€”penetration testing tools and vulnerability scanners have existed for decades. However, with recent advancements in large language models (LLMs), frontier agents are designed to reason about application behavior, adapt strategies based on feedback, and understand context in ways that traditional tools can’t. By creating a network of specialized agents, we can address increasingly complex security challenges: one agent maps the attack surface while others analyze business logic flaws, validate findings, and prioritize vulnerabilities based on actual exploitability. The exploitability context comes from the combination of actual exploit attempts by swarm agent workers, independent re-validation by specialized validators, and LLM-driven scoring according to the common vulnerability scoring system (CVSS).

We’ve developed automated penetration testing for the AWS Security Agent. This capability includes a multi-agent penetration testing system that orchestrates specialized security agents to work collaboratively on vulnerability detection. The system begins with multiple types of scanning to establish baseline coverage, then conducts broad reconnaissance using static, predefined tasks to map the application surface and identify initial attack vectors. Building on these findings, our agentic system dynamically generates focused test tasks tailored to the specific application contextβ€”reasoning about discovered endpoints, business logic patterns, and potential vulnerability chains to create targeted security tests that adapt based on application responses. By combining these specialized capabilities, the system can tackle complex security scenarios across major risk categories. Beyond single-vulnerability detection, the system performs complex chained attacksβ€”for instance, combining an information disclosure flaw with privilege escalation to access sensitive resources, or chaining insecure direct object references (IDOR) with authentication bypass.

Figure 1: Diagram of the AWS Security Agent penetration testing component.

Figure 1: Diagram of the AWS Security Agent penetration testing component.

System architecture

This section describes the major components of the system. The following subsections cover authentication and initial access, baseline scanning, multi-phased exploration with the specialized agent swarm, and validation with report generation.

Authentication and initial access

The system begins with an intelligent sign-in component that handles authentication across diverse application architectures. This component combines LLM-based reasoning with deterministic mechanisms to locate sign-in pages, attempt provided credentials, and maintain authenticated sessions for subsequent testing phases. The approach adapts to different application structures and target environments automatically and uses a browser tool. The developer can optionally provide a custom sign-in prompt tailored to the target application.

Baseline scanning phase

Following authentication, the system initiates comprehensive baseline scanning through parallel execution of specialized scanners. For black-box testing, the network scanner conducts automated web application security testing, generating raw traffic interactions and identifying candidate vulnerable endpoints. In white-box settings, the code scanner additionally performs deep source code analysis when repositories are available, producing descriptive documentation across multiple categories. Additional specialized scanners complement these capabilities to identify vulnerabilities across multiple dimensions and establish initial security coverage.

Multi-phased exploration

The system employs two distinct exploration approaches that work in concert. Managed execution operates with predefined static tasks across major risk categories like cross-site scripting, insecure direct object reference, privilege escalation, and so on. This component systematically helps ensure comprehensive coverage by executing curated tasks for each risk type. In the next phase, guided exploration takes a dynamic, intelligence-driven approach. This component ingests discovered endpoints, validated findings, and code analysis documentation to reason about application-specific attack opportunities. It operates in two stages: first generating a contextual penetration testing plan by identifying unexplored resources and potential vulnerability chains, then programmatically managing the execution of these dynamically generated tasks. The guided explorer runs with adaptive tasks that evolve based on application responses and discovered patterns.

Specialized agent swarm
Both exploration approaches dispatch work to specialized swarm worker agentsβ€”each configured for specific risk types and equipped with comprehensive penetration testing toolkits including code executors, web fuzzers, NVD vulnerability database search for Common Vulnerabilities and Exposures (CVE) intelligence, and vulnerability-specific tools. These workers execute assigned tasks with timeout management and structured reporting.

Validation and report generation

When specialized agents identify potential security risks, they generate structured reports containing the vulnerability type, affected endpoints, exploitation evidence, and technical context. However, automated penetration testing faces a critical challenge: LLM agents can produce plausible-sounding findings that require rigorous validation. Candidate findings undergo validation through both deterministic validators and specialized LLM-based agents that attempt active exploitation. We employ assertion-based validation techniques where natural language assertions written by security experts encode deep knowledge about real attack behaviors, requiring explicit, structured proof that’s significantly harder to circumvent than narrow deterministic checks. Validated findings undergo Common Vulnerability Scoring System (CVSS) analysis for severity assessment, then are synthesized into final reports with validation results, severity scores, and exploitation evidenceβ€”designed to deliver actionable, high-confidence vulnerabilities for effective remediation.

Benchmarking

To evaluate our system, we performed human evaluation in addition to automatic benchmarking. We conducted analysis on real-world trajectories and created a taxonomy of error patterns. By spotting frequent error patterns, we were able to iterate on our solution. We report results on the CVE Bench public benchmark, which is a collection of vulnerable web applications containing 40 critical-severity CVEs from the National Vulnerability Database used to evaluate AI agents on real-world exploits. Each application includes automatic exploit references, and LLM-based agents attempt to execute attacks that trigger the vulnerabilities.

We measure success through the attack success rate (ASR) metric, defined as the rate of successful exploitation of application vulnerabilities. CVE Bench uses a grader that the agent can query to verify exploit success and provides explicit capture-the-flag (CTF) instructions. We evaluate in three configurations:

  1. With CTF instructions and grader checks after each tool call, achieving 92.5% on CVE Bench v2.0 (we note that some challenges involve blind exploitation where the agent cannot verify success without this feedback).
  2. Without CTF instructions or grader feedback, achieving 80%β€”which better reflects real-world conditions where the agent must self-validate through observable outcomes. We also observed that the agent was able to identify some CVEs based on the LLM’s parametric knowledge, as shown in the following bash command where the model explicitly references a CVE by name.
  3. Therefore, we ran an additional experiment using an LLM whose knowledge cutoff date predates CVE Bench v1.0 release, achieving 65% ASR.

The following code example shows an LLM agent demonstrating parametric knowledge of CVE-2023-37999 from its training data, then issuing a bash command to check exploitation prerequisites.

# HT Mega 2.2.0 has a known vulnerability – CVE-2023-37999
# It has an unauthenticated privilege escalation via the REST API settings endpoint
# Let's check if registration is enabled
curl -s http://target:9090/wp-login.php?action=register -I | head -10

We’re committed to pushing the frontier of security vulnerability detection by continuously evaluating our agent and staying competitive with newer, more challenging benchmarks.

Optimizing testing and compute budget

One challenge for penetration testing is determining the balance between exploitation and exploration. Using a depth-first approach can waste too much compute on specific directions, leading to lower vulnerability coverage under a fixed compute budget. Compare that to breadth-first search, which is unlikely to discover deep vulnerabilities that require testing multiple approaches. Therefore, a balance between the two approaches is needed to maximize coverage for a given compute budget. Our proposed system design aims to include a hybrid approach. A more efficient dynamic solution that generalizes across various vulnerabilities and different web applications remains an open research question.

Another challenge with penetration testing is non-determinism. Because of the underlying LLMs, the output of penetration test runs can vary from one run to another. Having different findings across multiple runs can lead to confusion. One option to mitigate this is to perform multiple runs and consolidate the findings across them.

Conclusion

The multi-agent architecture presented in this post demonstrates how you can use specialized agents that can collaborate to tackle complex penetration testing workflowsβ€”from intelligent authentication and baseline scanning through managed and guided exploration phases, culminating in rigorous validation. By orchestrating these specialized components with adaptive task generation and assertion-based validation, the system delivers comprehensive security coverage that evolves based on application-specific context and discovered patterns.

AWS Security Agent is now in public preview, for more information, see Getting Started with AWS Security Agent.

If you have feedback about this post, submit comments in theΒ CommentsΒ section below.

Tamer Alkhouli

Tamer Alkhouli
Tamer is an Amazon Web Services Senior Applied Scientist with over 13 years in NLP across academia and industry. He earned a PhD in machine translation from RWTH Aachen University under Hermann Ney. Across his career, he has built systems in machine translation, conversational AI, and foundation models. At AWS, he has contributed to Amazon Lex, Titan foundation models, Amazon Bedrock Agents, and the AWS Security Agent.

Divya Bhargavi

Divya Bhargavi
Divya is a Senior Applied Scientist at AWS on the Security Agent team. Her work focuses on designing agentic architectures for vulnerability discovery and exploit validation, with emphasis on developing robust benchmarking frameworks and evaluation methodologies for security agents in adversarial contexts. Prior to this, she led scientific engagements at the AWS Generative AI Innovation Center.

Daniele Bonadiman

Daniele Bonadiman
Daniele is a Senior Applied Scientist at AWS, where he works on AWS Security Agent. Daniele holds a PhD in Applied Machine Learning and Natural Language Processing from the University of Trento. During his time at AWS, Daniele has contributed to several AI initiatives focusing on conversational AI, agent orchestration, and code interpretation for AI agents.

Yilun Cui

Yilun Cui
Yilun is a Principal Engineer at AWS working on Agentic AI. Yilun has had over a decade of experience building tools for developers and he is passionate about applying AI throughout the software development lifecycle to help software developers build faster and deliver better products.

Dr. Yi Zhang

Dr. Yi Zhang
Yi is a Principal Applied Scientist at AWS. With over 25 years of industrial and academic research experience, Yi’s research focuses on the development of conversational and interactive multi-agent systems and syntactic and semantic understanding of natural language. He has been leading the research effort behind the development of multiple AWS services such as AWS Security Agent and Amazon Bedrock Agent.

  •  

AIs are Getting Better at Finding and Exploiting Internet Vulnerabilities

Really interesting blog post from Anthropic:

In a recent evaluation of AI models’ cyber capabilities, current Claude models can now succeed at multistage attacks on networks with dozens of hosts using only standard, open-source tools, instead of the custom tools needed by previous generations. This illustrates how barriers to the use of AI in relatively autonomous cyber workflows are rapidly coming down, and highlights the importance of security fundamentals like promptly patching known vulnerabilities.

[…]

A notable development during the testing of Claude Sonnet 4.5 is that the model can now succeed on a minority of the networks without the custom cyber toolkit needed by previous generations. In particular, Sonnet 4.5 can now exfiltrate all of the (simulated) personal information in a high-fidelity simulation of the Equifax data breachβ€”Β­one of the costliest cyber attacks in historyβ€”Β­using only a Bash shell on a widely-available Kali Linux host (standard, open-source tools for penetration testing; not a custom toolkit). Sonnet 4.5 accomplishes this by instantly recognizing a publicized CVE and writing code to exploit it without needing to look it up or iterate on it. Recalling that the original Equifax breach happened by exploiting a publicized CVE that had not yet been patched, the prospect of highly competent and fast AI agents leveraging this approach underscores the pressing need for security best practices like prompt updates and patches.

Read the whole thing. Automatic exploitation will be a major change in cybersecurity. And things are happening fast. There have been significant developments since I wrote this in October.

  •  

Augmenting Penetration Testing Methodology with Artificial Intelligence – Part 3: Arcanum Cyber Security Bot

In my journey to explore how I can use artificial intelligence to assist in penetration testing, I experimented with a security-focused chat bot created by Jason Haddix called Arcanum Cyber Security Bot (available on https://chatgpt.com/gpts). Jason engineered this bot to leverage up-to-date technical information related to application security and penetration testing.

The post Augmenting Penetration Testing Methodology with Artificial Intelligence – Part 3: Arcanum Cyber Security Bot appeared first on Black Hills Information Security, Inc..

  •  
  •  

Augmenting Penetration Testing Methodology with Artificial Intelligence – Part 1: Burpference

Burpference is a Burp Suite plugin that takes requests and responses to and from in-scope web applications and sends them off to an LLM for inference. In the context of artificial intelligence, inference is taking a trained model, providing it with new information, and asking it to analyze this new information based on its training.

The post Augmenting Penetration Testing Methodology with Artificial Intelligence – Part 1: Burpference appeared first on Black Hills Information Security, Inc..

  •  

Why Your Org Needs a Penetration Test Program

This webcast originally aired on February 27, 2025. Join us for a very special free one-hour Black Hills Information Security webcast with Corey Ham & Kelli Tarala on why your […]

The post Why Your Org Needs a Penetration Test Program appeared first on Black Hills Information Security, Inc..

  •  

5 Things We Are Going to Continue to Ignore in 2025

In this video, John Strand discusses the complexities and challenges of penetration testing, emphasizing that it goes beyond just finding and exploiting vulnerabilities.

The post 5 Things We Are Going to Continue to Ignore in 2025 appeared first on Black Hills Information Security, Inc..

  •  

Attack Tactics 9: Shadow Creds for PrivEsc w/ Kent & Jordan

In this video, Kent Ickler and Jordan Drysdale discuss Attack Tactics 9: Shadow Credentials for Primaries, focusing on a specific technique used in penetration testing services at Black Hills Information Security

The post Attack Tactics 9: Shadow Creds for PrivEsc w/ Kent & Jordan appeared first on Black Hills Information Security, Inc..

  •  

What Is Penetration Testing?

In today’s world, security is more important than ever. As organizations increasingly rely on technology to drive business, digital threats are becoming more sophisticated, varied, and difficult to defend against. […]

The post What Is Penetration Testing? appeared first on Black Hills Information Security, Inc..

  •  

Pentesting, Threat Hunting, and SOC: An Overview

By Ray Van Hoose, Wade Wells, and Edna Jonsson || Guest Authors This post is comprised of 3 articles that were originally published in the second edition of the InfoSec […]

The post Pentesting, Threat Hunting, and SOC: An Overview appeared first on Black Hills Information Security, Inc..

  •  

Why Do Car Dealers Need Cybersecurity Services?Β 

Tom Smith // At Black Hills Information Security (BHIS), we deal with all manner of clients, public and private. Until a month or two ago, though, we’d never dealt with […]

The post Why Do Car Dealers Need Cybersecurity Services?Β  appeared first on Black Hills Information Security, Inc..

  •  

Start to Finish: Configuring an Android Phone for Pentesting

Jeff Barbi // *Guest Post Background Unless you’re pentesting mobile apps consistently, it’s easy for your methodologies to fall out of date. Each new version of Android brings with it […]

The post Start to Finish: Configuring an Android Phone for Pentesting appeared first on Black Hills Information Security, Inc..

  •  

Webcast: Sacred Cash Cow Tipping 2019

John Strand // Yet again it is time for another edition of Sacred Cash Cow Tipping! Or, β€œWhy do these endpoint security bypass techniques still work? Why?” The goal of […]

The post Webcast: Sacred Cash Cow Tipping 2019 appeared first on Black Hills Information Security, Inc..

  •  

Performing a Physical Pentest? Bring This!

Jordan Drysdale// Physical Pentest Upcoming? Bring a Badgy. While badge reproduction may not be the intended use of this product, if you are a physical tester and you don’t own […]

The post Performing a Physical Pentest? Bring This! appeared first on Black Hills Information Security, Inc..

  •  

Digging Deeper into Vulnerable Windows Services

Brian Fehrman // Privilege escalation is a common goal for threat actors after they have compromised a system. Having elevated permissions can allow for tasks such as: extracting local password-hashes, […]

The post Digging Deeper into Vulnerable Windows Services appeared first on Black Hills Information Security, Inc..

  •  

A Morning with Cobalt Strike & Symantec

Joff Thyer // If you have been penetration testing a while, you likely have ended up in a Red Team situation or will be engaged in it soon enough. From […]

The post A Morning with Cobalt Strike & Symantec appeared first on Black Hills Information Security, Inc..

  •  
❌