Normal view

Building an AI-powered defense-in-depth security architecture for serverless microservices

16 February 2026 at 21:10

Enterprise customers face an unprecedented security landscape where sophisticated cyber threats use artificial intelligence to identify vulnerabilities, automate attacks, and evade detection at machine speed. Traditional perimeter-based security models are insufficient when adversaries can analyze millions of attack vectors in seconds and exploit zero-day vulnerabilities before patches are available.

The distributed nature of serverless architectures compounds this challenge—while microservices offer agility and scalability, they significantly expand the attack surface where each API endpoint, function invocation, and data store becomes a potential entry point, and a single misconfigured component can provide attackers the foothold needed for lateral movement. Organizations must simultaneously navigate complex regulatory environments where compliance frameworks like GDPR, HIPAA, PCI-DSS, and SOC 2 demand robust security controls and comprehensive audit trails, while the velocity of software development creates tension between security and innovation, requiring architectures that are both comprehensive and automated to enable secure deployment without sacrificing speed.

The challenge is multifaceted:

  • Expanded attack surface: Multiple entry points across distributed services requiring protection against distributed denial of service (DDoS) attacks, injection vulnerabilities, and unauthorized access
  • Identity and access complexity: Managing authentication and authorization across numerous microservices and service-to-service communications
  • Data protection requirements: Encrypting sensitive data in transit and at rest while securely storing and rotating credentials without compromising performance
  • Compliance and data protection: Meeting regulatory requirements through comprehensive audit trails and continuous monitoring in distributed environments
  • Network isolation challenges: Implementing controlled communication paths without exposing resources to the public internet
  • AI-powered threats: Defending against attackers who use AI to automate reconnaissance, adapt attacks in real-time, and identify vulnerabilities at machine speed

The solution lies in defense-in-depth—a layered security approach where multiple independent controls work together to protect your application.

This article demonstrates how to implement a comprehensive AI-powered defense-in-depth security architecture for serverless microservices on Amazon Web Services (AWS). By layering security controls at each tier of your application, this architecture creates a resilient system where no single point of failure compromises your entire infrastructure, designed so that if one layer is compromised, additional controls help limit the impact and contain the incident while incorporating AI and machine learning services throughout to help organizations address and respond to AI-powered threats with AI-powered defenses.

Architecture overview: A journey through security layers

Let’s trace a user request from the public internet through our secured serverless architecture, examining each security layer and the AWS services that protect it. This implementation deploys security controls at seven distinct layers with continuous monitoring and AI-powered threat detection throughout, where each layer provides specific capabilities that work together to create a comprehensive defense-in-depth strategy:

  • Layer 1 blocks malicious traffic before it reaches your application
  • Layer 2 verifies user identity and enforces access policies
  • Layer 3 encrypts communications and manages API access
  • Layer 4 isolates resources in private networks
  • Layer 5 secures compute execution environments
  • Layer 6 protects credentials and sensitive configuration
  • Layer 7 encrypts data at rest and controls data access
  • Continuous monitoring detects threats across layers using AI-powered analysis


Figure 1: Architecture diagram

Figure 1: Architecture diagram

Layer 1: Edge protection

Before requests reach your application, they traverse the public internet where attackers launch volumetric DDoS attacks, SQL injection, cross-site scripting (XSS), and other web exploits. AWS observed and mitigated thousands of distributed denial of service (DDoS) attacks in 2024, with one exceeding 2.3 terabits per second.

  • DDos protection: AWS Shield provides managed DDoS protection for applications running on AWS and is enabled for customers at no cost. AWS Shield Advanced offers enhanced detection, continuous access to the AWS DDoS Response Team (DRT), cost protection during attacks, and advanced diagnostics for enterprise applications.
  • Layer 7 protection: AWS WAF protects against Layer 7 attacks through managed rule groups from AWS and AWS Marketplace sellers that cover OWASP Top 10 vulnerabilities including SQL injection, XSS, and remote file inclusion. Rate-based rules automatically block IPs that exceed request thresholds, protecting against application-layer DDoS and brute force attacks. Geo-blocking capabilities restrict access based on geographic location, while Bot Control uses machine learning to identify and block malicious bots while allowing legitimate traffic.
  • AI for security: Amazon GuardDuty uses generative AI to enhance native security services, implementing AI capabilities to improve threat detection, investigation, and response through automated analysis.
  • AI-powered enhancement: Organizations can build autonomous AI security agents using Amazon Bedrock to analyze AWS WAF logs, reason through attack data, and automate incident response. These agents detect novel attack patterns that signature-based systems miss, generate natural language summaries of security incidents, automatically recommend AWS WAF rule updates based on emerging threats, correlate attack indicators across distributed services to identify coordinated campaigns, and trigger appropriate remediation actions based on threat context. This helps enable more proactive threat detection and response capabilities, reducing mean time to detection and response.

Layer 2: Verifying identity

After requests pass edge protection, you must verify user identity and determine resource access. Traditional username/password authentication is vulnerable to credential stuffing, phishing, and brute force attacks, requiring robust identity management that supports multiple authentication methods and adaptive security responding to risk signals in real time.

Amazon Cognito provides comprehensive identity and access management for web and mobile applications through two components:

  • User pools offer a fully managed user directory handling registration, sign-in, multi-factor authentication (MFA), password policies, social identity provider integration, SAML and OpenID Connect federation for enterprise identity providers, and advanced security features including adaptive authentication and compromised credential detection.
  • Identity pools grant temporary, limited-privilege AWS credentials to users for secure direct access to AWS services without exposing long-term credentials.

Amazon Cognito adaptive authentication uses machine learning to detect suspicious sign-in attempts by analyzing device fingerprinting, IP address reputation, geographic location anomalies, and sign-in velocity patterns, then allows sign-in, requires additional MFA verification, or blocks attempts based on risk assessment. Compromised credential detection automatically checks credentials against databases of compromised passwords and blocks sign-ins using known compromised credentials. MFA supports both SMS-based and time-based one-time password (TOTP) methods, significantly reducing account takeover risk.

For advanced behavioral analysis, organizations can use Amazon Bedrock to analyze patterns across extended timeframes, detecting account takeover attempts through geographic anomalies, device fingerprint changes, access pattern deviations, and time-of-day anomalies.

Layer 3: The application front door

An API gateway serves as your application’s entry point. It must handle request routing, throttling, API key management, encryption and it needs to integrate seamlessly with your authentication layer and provide detailed logging for security auditing while maintaining high performance and low latency.

  • Amazon API Gateway is a fully managed service for creating, publishing, and securing APIs at scale, providing critical security capabilities including SSL/TLS encryption with AWS Certificate Manager (ACM) to automatically handle certificate provisioning, renewal, and deployment. Request throttling and quota management protects backend services through configurable burst and rate limits with usage quotas per API key or client to prevent abuse, while API key management controls access from partner systems and third-party integrations. Request/response validation uses JSON Schema to validate data before reaching AWS Lambda functions, preventing malformed requests from consuming compute resources while seamless integration with Amazon Cognito validates JSON Web Tokens (JWTs) and enforces authentication requirements before requests reach application logic.
  • GuardDuty provides AI-powered intelligent threat detection by analyzing API invocation patterns and identifying suspicious activity including credential exfiltration using machine learning. For advanced analysis, Amazon Bedrock analyzes API Gateway metrics and Amazon CloudWatch logs to identify unusual HTTP 4XX error spikes (for example, 403 Forbidden) that might indicate scanning or probing attempts, geographic distribution anomalies, endpoint access pattern deviations, time-series anomalies in request volume, or suspicious user agent patterns.

Layer 4: Network isolation

Application logic and data must be isolated from direct internet access. Network segmentation is designed to limit lateral movement if a security incident occurs, helping to prevent compromised components from easily accessing sensitive resources.

  • Amazon Virtual Private Cloud (Amazon VPC) provides isolated network environments implementing a multi-tier architecture with public subnets for NAT gateways and application load balancers with internet gateway routes, private subnets for Lambda functions and application components accessing the internet through NAT Gateways for outbound connections, and data subnets with the most restrictive access controls. Lambda functions run in private subnets to prevent direct internet access, VPC flow logs capture network traffic for security analysis, security groups provide stateful firewalls following least privilege principles, Network ACLs add stateless subnet-level firewalls with explicit deny rules, and VPC endpoints enable private connectivity to Amazon DynamoDB, AWS Secrets Manager, and Amazon S3 without traffic leaving the AWS network.
  • GuardDuty provides AI-powered network threat detection by continuously monitoring VPC Flow Logs, CloudTrail logs, and DNS logs using machine learning to identify unusual network patterns, unauthorized access attempts, compromised instances, and reconnaissance activity, now including generative AI capabilities for automated analysis and natural language security queries.

Layer 5: Compute security

Lambda functions executing your application code and often requiring access to sensitive resources and credentials must be protected against code injection, unauthorized invocations, and privilege escalation. Additionally, functions must be monitored for unusual behavior that might indicate compromise.

Lambda provides built-in security features including:

  • AWS Identity and Access Management (IAM) execution roles that define precise resource and action access following least privilege principles
  • Resource-based policies that control which services and accounts can invoke functions to prevent unauthorized invocations
  • Environment variable encryption using AWS Key Management Services (AWS KMS) for variables at rest while sensitive data should use Secrets Manager function isolation designed so that each execution runs in isolated environments preventing cross-invocation data access
  • VPC integration enabling functions to benefit from network isolation and security group controls
  • Runtime security with automatically patched and updated managed runtimes
  • Code signing with AWS Signer digitally signing deployment packages for code integrity and cryptographic verification against unauthorized modifications

AI-powered code security: Amazon CodeGuru Security combines machine learning and automated reasoning to identify vulnerabilities including OWASP Top 10 and CWE Top 25 issues, log injection, secrets, and insecure AWS API usage. Using deep semantic analysis trained on millions of lines of Amazon code, it employs rule mining and supervised ML models combining logistic regression and neural networks for high true-positive rates.

Vulnerability management: Amazon Inspector provides automated vulnerability management, continuously scanning Lambda functions for software vulnerabilities and network exposure, using machine learning to prioritize findings and provide detailed remediation guidance.

Layer 6: Protecting credentials

Applications require access to sensitive credentials including database passwords, API keys, and encryption keys. Hardcoding secrets in code or storing them in environment variables creates security vulnerabilities, requiring secure storage, regular rotation, authorized-only access, and comprehensive auditing for compliance.

  • Secrets Manager protects access to applications, services, and IT resources without managing hardware security modules (HSMs). It provides centralized secret storage for database credentials, API keys, and OAuth tokens in an encrypted repository using AWS KMS encryption at rest.
  • Automatic secret rotation configures rotation for database credentials, automatically updating both the secret store and target database without application downtime.
  • Fine-grained access control uses IAM policies to control which users and services access specific secrets, implementing least-privilege access.
  • Audit trails log secret access in AWS CloudTrail for compliance and security investigations. VPC endpoint support is designed so that secret retrieval traffic doesn’t leave the AWS network.
  • Lambda integration enables functions to retrieve secrets programmatically at runtime, designed so that secrets aren’t stored in code or configuration files and can be rotated without redeployment.
  • GuardDuty provides AI-powered monitoring, detecting anomalous behavior patterns that could indicate credential compromise or unauthorized access.

Layer 7: Data protection

The data layer stores sensitive business information and customer data requiring protection both at rest and in transit. Data must be encrypted, access tightly controlled, and operations audited, while maintaining resilience against availability attacks and high performance.

Amazon DynamoDB is a fully managed NoSQL database providing built-in security features including:

  • Encryption at rest (using AWS-owned, AWS managed, or customer managed KMS keys)
  • Encryption in transit (TLS 1.2 or higher)
  • Fine-grained access control through IAM policies with item-level and attribute-level permissions
  • VPC endpoints for private connectivity
  • Point-in-Time Recovery for continuous backups
  • Streams for audit trails
  • Backup and disaster recovery capabilities
  • Global Tables for multi-AWS Region, multi-active replication designed to provide high availability and low-latency global access

GuarDuty and Amazon Bedrock provide AI-powered data protection:

  • GuardDuty monitors DynamoDB API activity through CloudTrail logs using machine learning to detect anomalous data access patterns including unusual query volumes, access from unexpected geographic locations, and data exfiltration attempts.
  • Amazon Bedrock analyzes DynamoDB Streams and CloudTrail logs to identify suspicious access patterns, correlate anomalies across multiple tables and time periods, generate natural language summaries of data access incidents for security teams, and recommend access control policy adjustments based on actual usage patterns versus configured permissions. This helps transform data protection from reactive monitoring to proactive threat hunting that can detect compromised credentials and insider threats.

Continuous monitoring

Even with comprehensive security controls at every layer, continuous monitoring is essential to detect threats that bypass defenses. Security requires ongoing real-time visibility, intelligent threat detection, and rapid response capabilities rather than one-time implementation.

  • GuardDuty protects your AWS accounts, workloads, and data with intelligent threat detection.
  • CloudWatch provides comprehensive monitoring and observability, collecting metrics, monitoring log files, setting alarms, and automatically reacting to AWS resource changes.
  • CloudTrail provides governance, compliance, and operational auditing by logging all API calls in your AWS account, creating comprehensive audit trails for security analysis and compliance reporting.
  • AI-powered enhancement with Amazon Bedrock provides automated threat analysis; generating natural language summaries of GuardDuty findings and CloudWatch logs, pattern recognition identifying coordinated attacks across multiple security signals, incident response recommendations based on your architecture and compliance requirements, security posture assessment with improvement recommendations, and automated response through Lambda and Amazon EventBridge that isolates compromised resources, revokes suspicious credentials, or notifies security teams through Amazon SNS when threats are detected.

Conclusion

Securing serverless microservices presents significant challenges, but as demonstrated, using AWS services alongside AI-powered capabilities creates a resilient defense-in-depth architecture that protects against current and emerging threats while proving that security and agility are not mutually exclusive.

Security is an ongoing process—continuously monitor your environment, regularly review security controls, stay informed about emerging threats and best practices, and treat security as a fundamental architectural principle rather than an afterthought.

Further reading

If you have feedback about this blog post, submit them in the Comments section below. If you have questions about using this solution, start a thread in the EventBridge, GuardDuty, or Security Hub forums, or contact AWS Support.

Roger Nem Roger Nem
Roger is an Enterprise Technical Account Manager (TAM) supporting Healthcare & Life Science customers at Amazon Web Services (AWS). As a Security Technical Field community specialist, he helps enterprise customers design secure cloud architectures aligned with industry best practices. Beyond his professional pursuits, Roger finds joy in quality time with family and friends, nurturing his passion for music, and exploring new destinations through travel.

Explore scaling options for AWS Directory Service for Microsoft Active Directory

30 January 2026 at 20:51

You can use AWS Directory Service for Microsoft Active Directory as your primary Active Directory Forest for hosting your users’ identities. Your IT teams can continue using existing skills and applications while your organization benefits from the enhanced security, reliability, and scalability of AWS managed services. You can also run AWS Managed Microsoft AD as a resource forest. In this configuration, AWS Managed Microsoft AD serves supported AWS services while users’ identities remain under exclusive control of your organization on a self-managed Active Directory. As your organization grows and scales, so will your AWS Managed Microsoft AD deployments.

In this post, you’ll learn how to use Amazon CloudWatch dashboards to monitor key performance metrics of your AWS Managed Microsoft AD deployment to track and analyze a directory’s performance over time. You can then use that information to determine when and how best to scale directory services for optimal performance.

Scaling your Active Directory

When you deploy AWS Managed Microsoft AD, the service initially creates two domain controller instances in two separate subnets of the same virtual private cloud (VPC). This architecture economically provides resiliency and high availability with a minimal set of resources. This initial configuration enables every feature that AWS Managed Microsoft AD offers. As your organization grows, its workflows will become larger and more complex, requiring that you scale your directories accordingly. AWS Managed Microsoft AD simplifies and makes the scaling process secure with minimal administrative effort. When it’s time to scale a directory, AWS Managed Microsoft AD offers two options: scale-up or scale-out.

Understanding scale-up and scale-out

Scale-up—also called upgrading your AWS Managed Microsoft AD—means changing the edition of an AWS Managed Microsoft AD from Standard to Enterprise. Enterprise Edition delivers larger domain controller instances, with higher compute capacity and larger storage for Active Directory objects. When a directory scales up, it retains the same number of domain controller instances that it previously had with larger quotas. Instances are replaced one at a time to minimize disruptions to production workflows.

A few features offered by the service are a better fit for the size and compute power of Enterprise Edition AWS Managed Microsoft AD and so are only available in Enterprise Edition. Consider scaling-up your directory if you encounter any of the following scenarios:

  • You plan to replicate your directory across multiple AWS Regions. Multi-Region replication is only available in Enterprise Edition.
  • The number of Active Directory objects in the directory will exceed the recommended threshold of 30,000 objects for Standard Edition. Enterprise Edition can accommodate up to 500,000 directory objects.
  • You plan to share your directory with more than 25 other AWS accounts. The default directory sharing quota is 25 accounts for Standard Edition and 500 for Enterprise Edition.

Important: Scaling up a directory from Standard to Enterprise is a one-way operation that cannot be reverted and operates at a higher hourly price.

Scale-out means deploying additional domain controllers for your AWS Managed Microsoft AD. You can scale out both Standard or Enterprise directories and can scale out different Regions independently. You don’t need to scale every Region to the same number of domain controller instances. When scale-out takes place, additional domain controller instances with the same compute resources and storage capacity as existing ones are launched in the same subnets.

Because some operations cannot be reverted, it’s important to understand the impact of each scaling operation. It’s preferable to scale out the number of domain controllers first, because you can revert that change if necessary. Consider scaling up first only if you need a feature that’s only available in Enterprise Edition.

Making an informed decision using CloudWatch

Since December 2021, AWS Managed Microsoft AD helps optimize scaling decisions with directory metrics in Amazon CloudWatch. Amazon CloudWatch metrics are a time-ordered set of data-points about performance indicators of a system that you can use to monitor and analyze performance over time. Metrics are stored as a time-series set and each data point has an associated timestamp. By using CloudWatch, you can create alarms based on metrics and visualize and analyze metrics to derive new insights.

To understand the performance of a directory over time, define the key performance metrics based on your workload when you create the directory. Record the initial values of those metrics to create a performance baseline. Periodically revisit and compare data points for the same metrics to understand trends and use of resources over time. Based on the information provided by the performance baseline and periodic follow-ups, you can decide when to scale your directory and what scaling method to use. This process is depicted in Figure 1.

Figure 1: Decision-making process for scaling an Active Directory implementation

Figure 1: Decision-making process for scaling an Active Directory implementation

Depending on the characteristics of your workload, you might face different resource constraints in your directory system. From an infrastructure perspective, the more commonly demanded resources are:

  • Network Interface: Current Bandwidth
  • Processor: % Processor Time
  • LogicalDisk: % Free Space

From an Active Directory perspective, consider metrics such as:

  • NTDS: LDAP Searches/sec
  • NTDS: ATQ Estimated Queue Delay

The following table is an example decision matrix based on which resource is constrained.

Constrained resource Recommended action
% Processor Time Scale out
I/O Database Reads Average Latency Scale out
Committed Bytes in Use Scale out
% Free Space Scale up

For example, you can create a CloudWatch alarm that will trigger when Processor: % Processor Time is over 80% for more than 5 minutes. If this alarm triggers often, it could be a signal that domain controller instances are struggling to service the regular volume of user authentication requests. In such a scenario, you might consider scaling-out an additional domain controller to guarantee the service’s SLA. Conversely, if the LogicalDisk: % Free Space drops below 10% and trends downwards, you might consider scaling-up to Enterprise Edition, because it provides a larger capacity for directory objects.

To facilitate tracking and analyzing performance of AWS Managed Microsoft AD over time, you can use Amazon CloudWatch to create a custom dashboard including relevant metrics.

Prerequisites

Before you get started, make sure that you have the following prerequisites in place:

Create a CloudWatch dashboard

With the prerequisites in place, you’re ready to create a CloudWatch dashboard to track directory service metrics. For more information, see Getting started with CloudWatch automatic dashboards.

To create a dashboard:

  1. Open the AWS Management Console for CloudWatch.
  2. In the navigation pane, choose Dashboards, and then choose Create dashboard.
  3. In the Create new dashboard dialog box, enter a name for the dashboard and then choose Create dashboard.
  4. When the Add widget window appears:
    1. Under Data sources types, select CloudWatch.
    2. Under Data type, select Metrics.
    3. Under Widget type, select Line.
    4. Choose Next.
  5. In the Add metric graph window, choose DirectoryService and then select Processor as the Metric category and % Processor Time under Metric name. Select each instance of the metric, represented as the Domain Controller IP, for one Directory ID.
  6. Choose Create widget.

    Note: if there are multiple directories in the same Region, all instances (domain controllers IPs) will be available for selection. To help ensure effective monitoring and alarms, create a separate dashboard for each directory.

  7. Choose the plus sign (+) at the top of the window to add more widgets. Repeat steps 1–6 to add additional widgets for other relevant metrics. In this example the metric categories and names added are:
    • Processor: % Processor Time
    • LogicalDisk: % Free Space
    • Memory: Committed Bytes in Use
    • Database: I/O Database Reads Average Latency
    • Network Interface: Current Bandwidth
    • DNS: Recursive Queries/Sec
  8. After adding the desired metrics, choose Save.
Figure 2: CloudWatch dashboard showing directory services metrics

Figure 2: CloudWatch dashboard showing directory services metrics

(Optional) Create an alarm in CloudWatch

Now that you have a dashboard where you can view metrics, consider setting up CloudWatch alarms to alert you when a metric reaches or goes beyond a specified threshold. For more information, see Create a CloudWatch alarm based on a static threshold and Adding an alarm to a CloudWatch dashboard.

The following are recommended thresholds to monitor when determining the need to scale an AWS Managed Microsoft AD. These are general recommendations based on standard use cases. You might have to adjust these thresholds to make the best scaling decisions for your organization.

  • Processor: % Processor Time: Monitor CPU utilization to understand computational demands on your domain controllers. Set CloudWatch alarms at 80% for a period of 5 minutes. Sustained high values indicate potential sizing issues that might require scaling out your directory.
  • LogicalDisk: % Free Space: Maintain at least 25% free space on volumes containing Active Directory data for optimal performance. Set CloudWatch alarms to trigger when free space drops below 20%. Low disk space can severely impact directory operations and require implementing cleanup procedures or scaling up the directory.
  • Network Interface: Current Bandwidth: Average network utilization should be kept below 50% of available bandwidth during peak operations for optimal directory responsiveness. Set CloudWatch alarms at 70% utilization to allow room for spikes in activity. Consistently high values suggest network constraints that might require scaling out your directory.
  • Memory: Committed Bytes in Use: Monitor memory commitment levels to help ensure that your domain controllers have sufficient memory resources for Active Directory operations. This metric tracks the amount of virtual memory that has been committed, indicating the total memory load on your domain controllers. Set CloudWatch alarms at 80% of the commit limit. Sustained high values can lead to excessive paging, significantly degrading directory performance and potentially causing authentication delays.
  • Database: I/O Database Reads Average Latency: Maintain average read latencies below 25 milliseconds. Set CloudWatch alarms at a threshold of 50 milliseconds. If read latencies are consistently elevated, consider scaling-out your directory.
  • DNS: Recursive Queries/sec: Given the tight integration of Active Directory with DNS, monitor this metric for stability and predictable patterns. Use CloudWatch anomaly detection rather than fixed thresholds to identify unexpected behaviors that could indicate DNS configuration issues or potential security concerns.

Post-scaling considerations

Different resources across your architecture might contain references to the IP addresses of the AWS Managed Microsoft AD. After a scale-out operation that deploys additional domain controller instances on a directory, update existing references to maintain full functionality of workloads. References for the directory’s IP addresses can be found (but might not be limited to) the following services:

To maintain the full functionality of your workloads after a directory scaling operation, update the following:

  • Firewall rules that allow traffic to and from the IP addresses of domain controller instances
  • Route53 Resolver endpoint rules and DNS conditional forwarders that forward queries to the directory instances
  • CloudWatch dashboards that display metric data about the directory to include dimensions for the new IP addresses

Clean up resources

In this post, you created components that generate costs. Clean up these resources when no longer required to avoid additional charges.

  • Remove added domain controller’s IP addresses from firewall rules, resolver endpoint rules and DNS conditional forwarders.
  • Delete the custom CloudWatch dashboards you don’t plan to keep.
  • Scale back existing directories to the previous number of domain controller instances.

Conclusion

In this post, you learned how to monitor directory performance metrics using Amazon CloudWatch. By combining performance baselines, monitoring, and planning, you can make informed decisions about when and how to scale a directory safely and efficiently. By scaling directories in a timely manner, you can optimize efficiency and reduce the risk of outages by having a right-sized directory service to support your organization’s workloads.

Scale out your directory when your Active Directory-aware workflows have grown over time and the solution requires additional domain controller instances to maintain the service SLA. Scale up your directory when you require a feature that’s only available in Enterprise Edition AWS Managed Microsoft AD, such as multi-Region replication or additional storage to accommodate Active Directory objects. By using the flexible scaling capabilities and independent Regional expansion, you can optimize costs while maintaining appropriate service levels.

To learn more about AWS Managed Microsoft AD optimization and monitoring with Amazon CloudWatch, see:

Nahuel Benavidez Nahuel Benavidez
Nahuel is a Sr. CSE in AWS, specializing in AWS Directory Service, Microsoft Technologies, and SQL Server. He enjoys teaming with customers to discover exciting ways to explore AWS services. Nahuel loves to spoil his niece and goddaughters above all else. Also, Dungeons and Dragons (before it was popular), CrossFit, hiking, trekking and, sharing a pint with friends but “just one.”

How to get started with security response automation on AWS

29 January 2026 at 20:44

December 2, 2019: Original publication date of this post.


At AWS, we encourage you to use automation. Not just to deploy your workloads and configure services, but to also help you quickly detect and respond to security events within your AWS environments. In addition to increasing the speed of detection and response, automation also helps you scale your security operations as your workloads in AWS increase and scale as well. For these reasons, security automation is a key principle outlined in the Well-Architected Framework, the AWS Cloud Adoption Framework, and the AWS Security Incident Response Guide.

Security response automation is a broad topic that spans many areas. The goal of this blog post is to introduce you to core concepts and help you get started. You will learn how to implement automated security response mechanisms within your AWS environments. This post will include common patterns that customers often use, implementation considerations, and an example solution. Additionally, we will share resources AWS has produced in the form of the Automated Security Response GitHub repo. The GitHub repo includes scripts that are ready-to-deploy for common scenarios.

What is security response automation?

Security response automation is a planned and programmed action taken to achieve a desired state for an application or resource based on a condition or event. When you implement security response automation, you should adopt an approach that draws from existing security frameworks. Frameworks are published materials which consist of standards, guidelines, and best practices in order help organizations manage cybersecurity-related risk. Using frameworks helps you achieve consistency and scalability and enables you to focus more on the strategic aspects of your security program. You should work with compliance professionals within your organization to understand any specific compliance or security frameworks that are also relevant for your AWS environment.

Our example solution is based on the NIST Cybersecurity Framework (CSF), which is designed to help organizations assess and improve their ability to help prevent, detect, and respond to security events. According to the CSF, “cybersecurity incident response” supports your ability to contain the impact of potential cybersecurity events.

Although automation is not a CSF requirement, automating responses to events enables you to create repeatable, predictable approaches to monitoring and responding to threats. When we build automation around events that we know should not occur, it gives us an advantage over a malicious actor because the automation is able to respond within minutes or even seconds compared to an on-call support engineer.

The five main steps in the CSF are identify, protect, detect, respond and recover. We’ve expanded the detect and respond steps to include automation and investigation activities.

Figure 1: The five steps in the CSF

Figure 1: The five steps in the CSF

The following definitions for each step in the diagram above are based on the CSF but have been adapted for our example in this blog post. Although we will focus on the detect, automate and respond steps, it’s important to understand the entire process flow.

  • Identify: Identify and understand the resources, applications, and data within your AWS environment.
  • Protect: Develop and implement appropriate controls and safeguards to facilitate the delivery of services.
  • Detect: Develop and implement appropriate activities to identify the occurrence of a cybersecurity event. This step includes the implementation of monitoring capabilities which will be discussed further in the next section.
  • Automate: Develop and implement planned, programmed actions that will achieve a desired state for an application or resource based on a condition or event.
  • Investigate: Perform a systematic examination of the security event to establish the root cause.
  • Respond: Develop and implement appropriate activities to take automated or manual actions regarding a detected security event.
  • Recover: Develop and implement appropriate activities to maintain plans for resilience and to restore capabilities or services that were impaired due to a security event

Security response automation on AWS

AWS CloudTrail and AWS Config continuously log details regarding users and other identity principals, the resources they interacted with, and configuration changes they might have made in your AWS account. We are able to combine these logs with Amazon EventBridge, which gives us a single service to trigger automations based on events. You can use this information to automatically detect resource changes and to react to deviations from your desired state.

Figure 2: Automated remediation flow

Figure 2: Automated remediation flow

As shown in the diagram above, an automated remediation flow on AWS has three stages:

  1. Monitor: Your automated monitoring tools collect information about resources and applications running in your AWS environment. For example, they might collect AWS CloudTrail information about activities performed in your AWS account, usage metrics from your Amazon EC2 instances, or flow log information about the traffic going to and from network interfaces in your Amazon Virtual Private Cloud (VPC).
  2. Detect: When a monitoring tool detects a predefined condition—such as a breached threshold, anomalous activity, or configuration deviation—it raises a flag within the system. A triggering condition might be an anomalous activity detected by Amazon GuardDuty, a resource out of compliance with an AWS Config rule, or a high rate of blocked requests on an Amazon VPC security group or AWS Web Application Firewall (AWS WAF) web access control list (web-acl).
  3. Respond: When a condition is flagged, an automated response is triggered that performs an action you’ve predefined—something intended to remediate or mitigate the flagged condition.

Examples of automated response actions may include modifying a VPC security group, patching an Amazon EC2 instance, rotating various different types of credentials, or adding an additional entry into an IP set in AWS WAF that is part of a web-acl rule to block suspicious clients who triggered a threshold from a monitoring metric.

You can use the event-driven flow described above to achieve a variety of automated response patterns with varying degrees of complexity. Your response pattern could be as simple as invoking a single AWS Lambda function, or it could be a complex series of AWS Step Function tasks with advanced logic. In this blog post, we’ll use two simple Lambda functions in our example solution.

How to define your response automation

Now that we’ve introduced the concept of security response automation, start thinking about security requirements within your environment that you’d like to enforce through automation. These design requirements might come from general best practices you’d like to follow, or they might be specific controls from compliance frameworks relevant for your business.

Customers start with the run-books they already use as part of their Incident Response Lifecycle. Simple run-books, like responding to an exfiltrated credential, can be quickly mapped to automation especially if your run book calls for the disabling of the credential and the notification of on-call personnel. But it can be resource driven as well. Events such as a new AWS VPC being created might trigger your automation to immediately deploy your company’s standard configuration for VPC flowlog collection.

Your objectives should be quantitative, not qualitative. Here are some examples of quantitative objectives:

  • Remote administrative network access to servers should be limited.
  • Server storage volumes should be encrypted.
  • AWS console logins should be protected by multi-factor authentication.

As an optional step, you can expand these objectives into user stories that define the conditions and remediation actions when there is an event. User stories are informal descriptions that briefly document a feature within a software system. User stories may be global and span across multiple applications or they may be specific to a single application.

For example:

“Remote administrative network access to servers should have limited access from internal trusted networks only. Remote access ports include SSH TCP port 22 and RDP TCP port 3389. If remote access ports are detected within the environment and they are accessible to outside resources, they should be automatically closed and the owner will be notified.”

Once you’ve completed your user story, you can determine how to use automated remediation to help achieve these objectives in your AWS environment. User stories should be stored in a location that provides versioning support and can reference the associated automation code.

You should carefully consider the effect of your remediation mechanisms in order to help prevent unintended impact on your resources and applications. Remediation actions such as instance termination, credential revocation, and security group modification can adversely affect application availability. Depending on the level of risk that’s acceptable to your organization, your automated mechanism can only provide a notification which would then be manually investigated prior to remediation. Once you’ve identified an automated remediation mechanism, you can build out the required components and test them in a non-production environment.

Sample response automation walkthrough

In the following section, we’ll walk you through an automated remediation for a simulated event that indicates potential unauthorized activity—the unintended disabling of CloudTrail logging. Outside parties might want to disable logging to avoid detection and the recording of their unauthorized activity. Our response is to re-enable the CloudTrail logging and immediately notify the security contact. Here’s the user story for this scenario:

“CloudTrail logging should be enabled for all AWS accounts and regions. If CloudTrail logging is disabled, it will automatically be enabled and the security operations team will be notified.”

A note about the sample response automation below as it references Amazon EventBridge: EventBridge was formerly referred to as Amazon CloudWatch Events. If you see other documentation referring to Amazon CloudWatch, you can find that configuration now via the Amazon EventBridge console page.

Additionally, we will be looking at this scenario through the lens of an account that has a stand-alone CloudTrail configuration. While this is an acceptable configuration, AWS recommends using AWS Organizations, which allows you to configure an organizational CloudTrail. These organizational trails are immutable to the child accounts so that logging data cannot be removed or tampered with.

In order to use our sample remediation, you will need to enable Amazon GuardDuty and AWS Security Hub in the AWS Region you have selected. Both of these services include a 30-day trial at no additional cost. See the AWS Security Hub pricing page and the Amazon GuardDuty pricing page for additional details.

Important: You’ll use AWS CloudTrail to test the sample remediation. Running more than one CloudTrail trail in your AWS account will result in charges based on the number of events processed while the trail is running. Charges for additional copies of management events recorded in a Region are applied based on the published pricing plan. To minimize the charges, follow the clean-up steps that we provide later in this post to remove the sample automation and delete the trail.

Deploy the sample response automation

In this section, we’ll show you how to deploy and test the CloudTrail logging remediation sample. Amazon GuardDuty generates the finding

Stealth:IAMUser/CloudTrailLoggingDisabled when CloudTrail logging is disabled, and AWS Security Hub collects findings from GuardDuty using the standardized finding format mentioned earlier. We recommend that you deploy this sample into a non- production AWS account.

Select the Launch Stack button below to deploy a CloudFormation template with an automation sample in the us-east-1 Region. You can also download the template and implement it in another Region. The template consists of an Amazon EventBridge rule, an AWS Lambda function, and the IAM permissions necessary for both components to execute. It takes several minutes for the CloudFormation stack build to complete.

Select the Launch Stack button to launch the template

  1. In the CloudFormation console, choose the Select Template form, and then select Next.
  2. On the Specify Details page, provide the email address for a security contact. For the purpose of this walkthrough, it should be an email address that you have access to. Then select Next.
  3. On the Options page, accept the defaults, then select Next.
  4. On the Review page, confirm the details, then select Create.
  5. While the stack is being created, check the inbox of the email address that you provided in step 2. Look for an email message with the subject AWS Notification – Subscription Confirmation. Select the link in the body of the email to confirm your subscription to the Amazon Simple Notification Service (Amazon SNS) topic. You should see a success message like the one shown in Figure 3:

    Figure 3: SNS subscription confirmation

    Figure 3: SNS subscription confirmation

  6. Return to the CloudFormation console. After the Status field for the CloudFormation stack changes to CREATE COMPLETE (as shown in Figure 4), the solution is implemented and is ready for testing.

    Figure 4: CREATE_COMPLETE status

    Figure 4: CREATE_COMPLETE status

Test the sample automation

You’re now ready to test the automated response by creating a test trail in CloudTrail, then trying to stop it.

  1. From the AWS Management Console, choose Services > CloudTrail.
  2. Select Trails, then select Create Trail.
  3. On the Create Trail form:
    1. Enter a value for Trail name and for AWS KMS alias, as shown in Figure 5.
    2. For Storage location, create a new S3 bucket or choose an existing one. For our testing, we create a new S3 bucket.

      Figure 5: Create a CloudTrail trail

      Figure 5: Create a CloudTrail trail

    3. On the next page, under Management events, select Write-only (to minimize event volume).

      Figure 6: Create a CloudTrail trail

      Figure 6: Create a CloudTrail trail

  4. On the Trails page of the CloudTrail console, verify that the new trail has started. You should see the status as logging, as shown in Figure 7.

    Figure 7: Verify new trail has started

    Figure 7: Verify new trail has started

  5. You’re now ready to act like an unauthorized user trying to cover their tracks. Stop the logging for the trail that you just created:
    1. Select the new trail name to display its configuration page.
    2. In the top-right corner, choose the Stop logging button.
    3. When prompted with a warning dialog box, select Stop logging.
    4. Verify that the logging has stopped by confirming that the Start logging button now appears in the top right, as shown in Figure 8.

      Figure 8: Verify logging switch is off

      Figure 8: Verify logging switch is off

    You have now simulated a security event by disabling logging for one of the trails in the CloudTrail service. Within the next few seconds, the near real-time automated response will detect the stopped trail, restart it, and send an email notification. You can refresh the Trails page of the CloudTrail console to verify through the Stop logging button at the top right corner.

    Within the next several minutes, the investigatory automated response will also begin. GuardDuty will detect the action that stopped the trail and enrich the data about the source of unexpected behavior. Security Hub will then ingest that information and optionally correlate with other security events.

    Following the steps below, you can monitor findings within Security Hub for the finding type TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled to be generated:

  6. In the AWS Management Console, choose Services > Security Hub.
    1. In the left pane, select Findings.
    2. Select the Add filters field, then select Type.
    3. Select EQUALS, paste TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled into the field, then select Apply.
    4. Refresh your browser periodically until the finding is generated.

    Figure 9: Monitor Security Hub for your finding

    Figure 9: Monitor Security Hub for your finding

  7. Select the title of the finding to review details. When you’re ready, you can choose to archive the finding by selecting the Archive link. Alternately, you can select a custom action to continue with the response. Custom actions are one of the ways that you can integrate Security Hub with custom partner solutions.

Now that you’ve completed your review of the finding, let’s dig into the components of automation.

How the sample automation works

This example incorporates two automated responses: a near real-time workflow and an investigatory workflow. The near real-time workflow provides a rapid response to an individual event, in this case the stopping of a trail. The goal is to restore the trail to a functioning state and alert security responders as quickly as possible. The investigatory workflow still includes a response to provide defense in depth and uses services that support a more in-depth investigation of the incident.

Figure 10: Sample automation workflow

Figure 10: Sample automation workflow

In the near real-time workflow, Amazon EventBridge monitors for the undesired activity.

When a trail is stopped, AWS CloudTrail publishes an event on the EventBridge bus. An EventBridge rule detects the trail-stopping event and invokes a Lambda function to respond to the event by restarting the trail and notifying the security contact via an Amazon Simple Notification Service (SNS) topic.

In the investigative workflow, CloudTrail logs are monitored for undesired activities. For example, if a trail is stopped, there will be a corresponding log record. GuardDuty detects this activity and retrieves additional data points regarding the source IP that executed the API call. Two common examples of those additional data points in GuardDuty findings include whether the API call came from an IP address on a threat list, or whether it came from a network not commonly used in your AWS account. An AWS Lambda function responds by restarting the trail and notifying the security contact. The finding is imported into AWS Security Hub, where it’s aggregated with other findings for analyst viewing. Using EventBridge, you can configure Security Hub to export the finding to partner security orchestration tools, SIEM (security information and event management) systems, and ticketing systems for investigation.

AWS Security Hub imports findings from AWS security services such as GuardDuty, Amazon Macie and Amazon Inspector, plus from third-party product integrations you’ve enabled. Findings are provided to Security Hub in AWS Security Finding Format (ASFF), which minimizes the need for data conversion. Security Hub correlates these findings to help you identify related security events and determine a root cause. Security Hub also publishes its findings to Amazon EventBridge to enable further processing by other AWS services such as AWS Lambda. You can also create custom actions using Security Hub. Custom actions are useful for security analysts working with the Security Hub console who want to send a specific finding, or a small set of findings, to a response or a remediation workflow.

Deeper look into how the “Respond” phase works

Amazon EventBridge and AWS Lambda work together to respond to a security finding.

Amazon EventBridge is a service that provides real-time access to changes in data in AWS services, your own applications, and Software-as-a-Service (SaaS) applications without writing code. In this example, EventBridge identifies a Security Hub finding that requires action and invokes a Lambda function that performs remediation. As shown in Figure 11, the Lambda function both notifies the security operator via SNS and restarts the stopped CloudTrail.

Figure 11: Sample “respond” workflow

Figure 11: Sample “respond” workflow

To set this response up, we looked for an event to indicate that a trail had stopped or was disabled. We knew that the GuardDuty finding Stealth:IAMUser/CloudTrailLoggingDisabled is raised when CloudTrail logging is disabled. Therefore, we configured the default event bus to look for this event.

You can learn more regarding the available GuardDuty findings in the user guide.

How the code works

When Security Hub publishes a finding to EventBridge, it includes full details of the finding as discovered by GuardDuty. The finding is published in JSON format. If you review the details of the sample finding, note that it has several fields helping you identify the specific events that you’re looking for. Here are some of the relevant details:

{
   …
   "source":"aws.securityhub",
   …
   "detail":{
      "findings": [{
		…
    	“Types”: [
			"TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled"
			],
		…
      }]
}

You can build an event pattern using these fields, which an EventBridge filtering rule can then use to identify events and to invoke the remediation Lambda function. Below is a snippet from the CloudFormation template we provided earlier that defines that event pattern for the EventBridge filtering rule:

# pattern matches the nested JSON format of a specific Security Hub finding
      EventPattern:
        source:
        - aws.securityhub
        detail-type:
          - "Security Hub Findings - Imported"
        detail:
          findings:
            Types:
              - "TTPs/Defense Evasion/Stealth:IAMUser-CloudTrailLoggingDisabled"

Once the rule is in place, EventBridge continuously monitors the event bus for events with this pattern.

When EventBridge finds a match, it invokes the remediating Lambda function and passes the full details of the event to the function. The Lambda function then parses the JSON fields in the event so that it can act as shown in this Python code snippet:

# extract trail ARN by parsing the incoming Security Hub finding (in JSON format)
trailARN = event['detail']['findings'][0]['ProductFields']['action/awsApiCallAction/affectedResources/AWS::CloudTrail::Trail']   

# description contains useful details to be sent to security operations
description = event['detail']['findings'][0]['Description']

The code also issues a notification to security operators so they can review the findings and insights in Security Hub and other services to better understand the incident and to decide whether further manual actions are warranted. Here’s the code snippet that uses SNS to send out a note to security operators:

#Sending the notification that the AWS CloudTrail has been disabled.
snspublish = snsclient.publish(
	TargetArn = snsARN,
	Message="Automatically restarting CloudTrail logging.  Event description: \"%s\" " %description
	)

While notifications to human operators are important, the Lambda function will not wait to take action. It immediately remediates the condition by restarting the stopped trail in CloudTrail. Here’s a code snippet that restarts the trail to reenable logging:

try:
	client = boto3.client('cloudtrail')
	enablelogging = client.start_logging(Name=trailARN)
	logger.debug("Response on enable CloudTrail logging- %s" %enablelogging)
except ClientError as e:
	logger.error("An error occured: %s" %e)

After the trail has been restarted, API activity is once again logged and can be audited.

This can help provide relevant data for the remaining steps in the incident response process. The data is especially important for the post-incident phase, when your team analyzes lessons learned to help prevent future incidents. You can also use this phase to identify additional steps to automate in your incident response.

How to Enable Custom Action and build your own Automated Response

Unlike how you set up the notification earlier, you may not want fully automate responses to findings. To set up automation that you can manually trigger it for specific findings, you can use custom actions. A custom action is a Security Hub mechanism for sending selected findings to EventBridge that can be matched by an EventBridge rule. The rule defines a specific action to take when a finding is received that is associated with the custom action ID. Custom actions can be used, for example, to send a specific finding, or a small set of findings, to a response or remediation workflow. You can create up to 50 custom actions.

In this section, we will walk you through how to create a custom action in Security Hub which will trigger an EventBridge rule to execute a Lambda function for the same security finding related to CloudTrail Disabled.

Create a Custom Action in Security Hub

  1. Open Security Hub. In the left navigation pane, under Management, open the Custom actions page.
  2. Choose Create custom action.
  3. Enter an Action Name, Action Description, and Action ID that are representative of an action that you are implementing—for example Enable CloudTrail Logging.
  4. Choose Create custom action.
  5. Copy the custom action ARN that was generated. You will need it in the next steps.

Create Amazon EventBridge Rule to capture the Custom Action

In this section, you will define an EventBridge rule that will match events (findings) coming from Security Hub which were forwarded by the custom action you defined above.

  1. Navigate to the Amazon EventBridge console.
  2. On the right side, choose Create rule.
  3. On the Define rule detail page, give your rule a name and description that represents the rule’s purpose (for example, the same name and description that you used for the custom action). Then choose Next.
  4. Security Hub findings are sent as events to the AWS default event bus. In the Define pattern section, you can identify filters to take a specific action when matched events appear. For the Build event pattern step, leave the Event source set to AWS events or EventBridge partner events.
  5. Scroll down to Event pattern. Under Event source, leave it set to AWS Services, and under AWS Service, select Security Hub.
  6. For the Event Type, choose Security Hub Findings – Custom Action.
  7. Then select Specific custom action ARN(s) and enter the ARN for the custom action that you created earlier.
  8. Notice that as you selected these options, the event pattern on the right was updating. Choose Next.
  9. On the Select target(s) step, from the Select a target dropdown, select Lambda function. Then, from the Function dropdown, select SecurityAutoremediation-CloudTrailStartLoggingLamb-xxxx. This lambda function was created as part of the Cloudformation template.
  10. Choose Next.
  11. For the Configure tags step, choose Next.
  12. For the Review and create step, choose Create rule.

Trigger the automation

As GuardDuty and Security Hub have been enabled, after AWS Cloudtrail logging is enabled, you should see a security finding generated by Amazon GuardDuty and collected in AWS Security Hub.

  1. Navigate to the Security Hub Findings page.
  2. In the top corner, from the Actions dropdown menu, select the Enable CloudTrail Logging custom action.
  3. Verify the CloudTrail configuration by accessing the AWS CloudTrail dashboard.
  4. Confirm that the trail status displays as Logging, which indicates the successful execution of the remediation Lambda function triggered by the EventBridge rule through the custom action.

How AWS helps customers get started

Many customers look at the task of building automation remediation as daunting. Many operations teams might not have the skills or human scale to take on developing automation scripts. Because many Incident Response scenarios can be mapped to findings in AWS security services, we can begin building tools that respond and are quickly adaptable to your environment.

Automated Security Response (ASR) on AWS is a solution that enables AWS Security Hub customers to remediate findings with a single click using sets of predefined response and remediation actions called Playbooks. The remediations are implemented as AWS Systems Manager automation documents. The solution includes remediations for issues such as unused access keys, open security groups, weak account password policies, VPC flow logging configurations, and public S3 buckets. Remediations can also be configured to trigger automatically when findings appear in AWS Security Hub.

The solution includes the playbook remediations for some of the security controls defined as part of the following standards:

  • AWS Foundational Security Best Practices (FSBP) v1.0.0
  • Center for Internet Security (CIS) AWS Foundations Benchmark v1.2.0
  • Center for Internet Security (CIS) AWS Foundations Benchmark v1.4.0
  • Center for Internet Security (CIS) AWS Foundations Benchmark v3.0.0
  • Payment Card Industry (PCI) Data Security Standard (DSS) v3.2.1
  • National Institute of Standards and Technology (NIST) Special Publication 800-53 Revision 5

A Playbook called Security Control is included that allows operation with AWS Security Hub’s Consolidated Control Findings feature.

Figure 12: Architecture of the Automated Security Solution

Figure 12: Architecture of the Automated Security Solution

Additionally, the library includes instructions in the Implementation Guide on how to create new automations in an existing Playbook.

You can use and deploy this library into your accounts at no additional cost, however there are costs associated with the services that it consumes.

Clean up

After you’ve completed the sample security response automation, we recommend that you remove the resources created in this walkthrough example from your account in order to minimize the charges associated with the trail in CloudTrail and data stored in S3.

Important: Deleting resources in your account can negatively impact the applications running in your AWS account. Verify that applications and AWS account security do not depend on the resources you’re about to delete.

Here are the clean-up steps:

Summary

You’ve learned the basic concepts and considerations behind security response automation on AWS and how to use Amazon EventBridge, Amazon GuardDuty and AWS Security Hub to automatically re-enable AWS CloudTrail when it becomes disabled unexpectedly. Additionally you got a chance to learn about the AWS Automated Security Response library and how it can help you rapidly get started with automations through Security Hub. As a next step, you may want to start building your own custom response automations and dive deeper into the AWS Security Incident Response Guide, NIST Cybersecurity Framework (CSF) or the AWS Cloud Adoption Framework (CAF) Security Perspective. You can explore additional automatic remediation solutions on the AWS Solution Library. You can find the code used in this example on GitHub.

If you have feedback about this blog post, submit them in the Comments section below. If you have questions about using this solution, start a thread in the
EventBridge, GuardDuty or Security Hub forums, or contact AWS Support.

Exploring common centralized and decentralized approaches to secrets management

23 January 2026 at 20:15

One of the most common questions about secrets management strategies on Amazon Web Services (AWS) is whether an organization should centralize its secrets. Though this question is often focused on whether secrets should be centrally stored, there are four aspects of centralizing the secrets management process that need to be considered: creation, storage, rotation, and monitoring. In this post, we discuss the advantages and tradeoffs of centralizing or decentralizing each of these aspects of secrets management.

Centralized creation of secrets

When deciding whether to centralize secrets creation, you should consider how you already deploy infrastructure in the cloud. Modern DevOps practices have driven some organizations toward developer portals and internal developer platforms that use golden paths for infrastructure deployment. By using tools that use golden paths, developers can deploy infrastructure in a self-service model through infrastructure as code (IaC) while adhering to organizational standards.

A central function maintains these golden paths, such as a platform engineering team. Examples of services that can be used to maintain and define golden paths might include AWS services such as AWS Service Catalog or popular open source projects such as Backstage.io. Using this approach, developers can focus on application code while platform engineers focus on infrastructure deployment, security controls, and developer tooling. An example of a golden path might be a templatized implementation for a microservice that writes to a database.

For example, a golden path could define that a service or application must be built using the AWS Cloud Development Kit (AWS CDK), running on Amazon Elastic Container Service (Amazon ECS), and use AWS Secrets Manager to retrieve database credentials. The platform team could also build checks to help ensure that the secret’s resource policy only allows access to the role being used by the microservice and is encrypted with a customer managed key. This pattern abstracts deployments away from developers and facilitates resource deployment across accounts. This is one example of a centralized creation pattern, shown in Figure 1.

Figure 1: Architecture diagram highlighting the developer portal deployment pattern for centralized creation

Figure 1: Architecture diagram highlighting the developer portal deployment pattern for centralized creation

The advantages of this approach are:

  • Consistent naming tagging, and access control: When secrets are created centrally, you can enforce a standard naming convention based on the account, workload, service, or data classification. This simplifies implementing scalable patterns like attribute-based access control (ABAC).
  • Least privilege checks in CI/CD pipelines: When you create secrets within the confines of IaC pipelines, you can use APIs such as the AWS IAM Access Analyzer check-no-new-access API. Deployment pipelines can be templatized, so individual teams can take advantage of organizational standards while still owning deployment pipelines.
  • Create mechanisms for collaboration between platform engineering and security teams: Often, the shift towards golden paths and service catalogs is driven by a desire for a better developer experience and reduced operational overhead. A byproduct of this move is that security teams can partner with platform engineering teams to build security by default into these paths.

The tradeoffs of this approach are:

  • It takes time and effort to make this shift. You might not have the resources to invest in full-time platform engineering or DevOps teams. To centrally provision software and infrastructure like this, you must maintain libraries of golden paths that are appropriate for the use cases of your organization. Depending on the size of your organization, this might not be feasible.
  • Golden paths must keep up with the features of the services they support: If you’re using this pattern, and the service you’re relying on releases a new feature, your developers must wait for the features to be added to the affected golden paths.

If you want to learn more about the internal developer platform pattern, check out the re:Invent 2024 talk Elevating the developer experience with Backstage on AWS.

Decentralized creation of secrets

In a decentralized model, application teams own the IaC templates and deployment mechanisms in their own accounts. Here, each team is operating independently, which can make it more difficult to enforce standards as code. We’ll refer to this pattern, shown in Figure 2, as a decentralized creation pattern.

Figure 2: Decentralized creation of secrets

Figure 2: Decentralized creation of secrets

The advantages of this approach are:

  • Speed: Developers can move quickly and have more autonomy because they own the creation process. Individual teams don’t have a dependency on a central function.
  • Flexibility: You can still use features such as the IAM Access Analyzer check-no-new-access API, but it’s up to each team to implement this in their pipeline.

The tradeoffs of this approach are:

  • Lack of standardization: It can become more difficult to enforce naming and tagging conventions, because it’s not templatized and applied through central creation mechanisms. Access controls and resource policies might not be consistent across teams.
  • Developer attention: Developers must manage more of the underlying infrastructure and deployment pipelines.

Centralized storage of secrets

Some customers choose to store their secrets in a central account, and others choose to store secrets in the accounts in which their workloads live. Figure 3 shows the architecture for centralized storage of secrets.

Figure 3: Centralized storage of secrets

Figure 3: Centralized storage of secrets

The advantages of centralizing the storage of secrets are:

  • Simplified monitoring and observability: Monitoring secrets can be simplified by keeping them in a single account and with a centralized team controlling them.

Some tradeoffs of centralizing the storage of secrets are:

  • Additional operational overhead: When sharing secrets across accounts, you must configure resource policies on each secret that is shared.
  • Additional cost of AWS KMS Customer Managed Keys: You must use AWS Key Management Service (AWS KMS) customer managed keys when sharing secrets across accounts. While this gives you an additional layer of access control over secret access, it will increase cost under the AWS KMS pricing. It will also add another policy that needs to be created and maintained.
  • High concentration of sensitive data: Having secrets in a central account can increase the number of resources affected in the event of inadvertent access or misconfiguration.
  • Account quotas: Before deciding on a centralized secret account, review the AWS service quotas to ensure you won’t hit quotas in your production environment.
  • Service managed secrets: When services such as Amazon Relational Database Service (Amazon RDS) or Amazon Redshift manage secrets on your behalf, these secrets are placed in the same account as the resource with which the secret is associated. To maintain a centralized storage of secrets while using service managed secrets, the resources would also have to be centralized.

Though there are advantages to centralizing secrets for monitoring and observability, many customers already rely on services such as AWS Security Hub, IAM Access Analyzer, AWS Config, and Amazon CloudWatch for cross-account observability. These services make it easier to create centralized views of secrets in a multi-account environment.

Decentralized storage of secrets

In a decentralized approach to storage, shown in in Figure 4, secrets live in the same accounts as the workload that needs access to them.

Figure 4: Decentralized storage of secrets

Figure 4: Decentralized storage of secrets

The advantages of decentralizing the storage of secrets are:

  • Account boundaries and logical segmentation: Account boundaries provide a natural segmentation between workloads in AWS. When operating in a distributed multi-account environment, you cannot access secrets from another account by default, and all cross-account access must be allowed by both a resource policy in the source account and an IAM policy in the destination account. You can use resource control polices to prevent the sharing of secrets across accounts.
  • AWS KMS key choice: If your secrets aren’t shared across accounts, then you have the choice to use AWS KMS customer managed keys or AWS managed keys to encrypt your secrets.
  • Delegate permissions management to application owners: When secrets are stored in accounts with the applications that need to consume them, application owners define fine-grained permissions in secrets resource policies.

There are a few tradeoffs to consider for this architecture:

  • Auditing and monitoring require cross-account deployments: Tools that are used to monitor the compliance and status of secrets need to operate across multiple accounts and present information in a single place. This is simplified by AWS native tools, which are described later in this post.
  • Automated remediation workflows: You can have detective controls in place to alert on any misconfiguration or security risks related to your secrets. For example, you can surface an alert when a secret is shared outside of your organizational boundary through a resource policy. These workflows can be more complex in a multi-account environment. However, we have samples that can help, such as the Automated Security Response on AWS solution.

Centralized rotation

Like the creation and storage of secrets, organizations take different approaches to centralizing the lifecycle management and rotation of secrets.

When you centralize lifecycle management, as shown in Figure 5, a central team manages and owns AWS Lambda functions for rotation. The advantages of centralizing the lifecycle management of secrets are:

  • Developers can reuse rotation functions: In this pattern, a centralized team maintains a common library of rotation functions for different use cases. An example of this can be seen in this AWS re:Inforce session. Using this method, application teams don’t have to build their own custom rotation functions and can benefit from common architectural decisions regarding databases and third-party software as a service (SaaS) applications.
  • Logging: When storing and accessing rotation function logs, the centralized pattern can simplify managing logs from a single place.
Figure 5: Centralized rotation of secrets

Figure 5: Centralized rotation of secrets

There are some tradeoffs in centralizing the lifecycle management and rotation of secrets:

  • Additional cross-account access scenarios: When centralizing lifecycle management, the Lambda functions in central accounts require permissions to create, update, delete and read secrets in the application accounts. This increases the operational overhead required to enable secret rotation.
  • Service quotas: When you centralize a function at scale, service quotas can come into play. Check the Lambda service quotas to verify that you won’t hit quotas in your production environments.

Decentralized rotation

Decentralizing the lifecycle management of secrets is a more common choice, where the rotation functions live in the same account as the associated secret, as shown in Figure 6.

Figure 6: Decentralized rotation of secrets

Figure 6: Decentralized rotation of secrets

The advantages of decentralizing the lifecycle management of secrets are:

  • Templatization and customization: Developers can reuse rotation templates, but tweak the functions as needed to meet their needs and use cases
  • No cross-account access: Decentralized rotation of secrets happens all in one account and doesn’t require cross-account access.

The primary tradeoff of decentralizing rotation is that you will need to provide either centralized or federated access to logs for rotation functions in different accounts. By default, Lambda automatically captures logs for all function invocations and sends them to CloudWatch Logs. CloudWatch Logs offers a few different ways that you can centralize your logs, with the tradeoffs of each described in the documentation.

Centralized auditing and monitoring of secrets

Regardless of the model chosen for creation, storage, and rotation of secrets, centralize the compliance and auditing aspect when operating in a multi-account environment. You can use AWS Security Hub CSPM through its integration with AWS Organizations to centralize:

In this scenario, shown in Figure 7, centralized functions get visibility across the organization and individual teams can view their posture at an account level with no need to look at the state of the entire organization.

Use AWS CloudTrail organizational trails to send all API calls for Secrets Manager to a centralized delegated admin account.

Figure 7: Centralized monitoring and auditing

Figure 7: Centralized monitoring and auditing

Decentralized auditing and monitoring of secrets

For organizations that don’t require centralized auditing and monitoring of secrets, you can configure access so that individual teams can determine which logs are collected, alerts are enabled, and checks are in place in relation to your secrets. The advantages of this approach are:

  • Flexibility: Development teams have the freedom to choose what monitoring, auditing, and logging tools work best for them.
  • Reduced dependencies: Development teams don’t have to rely on centralized functions for alerting and monitoring capabilities.

The tradeoffs of this approach are:

  • Operational overhead: This can create redundant work for teams looking to accomplish similar goals.
  • Difficulty aggregating logs in cross-account investigations: If logs, alerts, and monitoring capabilities are decentralized, it can increase the difficulty of investigating events that affect multiple accounts.

Putting it all together

Most organizations choose a combination of these approaches to meet their needs. An example is a financial services company that has a central security team, operates across hundreds of AWS accounts, and has hundreds of applications that are isolated at the account level. This customer could:

  • Centralize the creation process, enforcing organizational standards for naming, tagging, and access control
  • Decentralize storage of secrets, using the AWS account as a natural boundary for access and storing the secret in the account where the workload is operating, delegating control to application owners
  • Decentralize lifecycle management so that application owners can manage their own rotation functions
  • Centralize auditing, using tools like AWS Config, Security Hub, and IAM Access Analyzer to give the central security team insight into the posture of their secrets while letting application owners retain control

Conclusion

In this post, we’ve examined the architectural decisions organizations face when implementing secrets management on AWS: creation, storage, rotation, and monitoring. Each approach—whether centralized or decentralized—offers distinct advantages and tradeoffs that should align with your organization’s security requirements, operational model, and scale. The important points include:

  • Choose your secrets management architecture based on your organization’s specific requirements and capabilities. There’s no one solution that will fit every situation.
  • Use automation and IaC to enforce consistent security controls, regardless of your approach.
  • Implement comprehensive monitoring and auditing capabilities through AWS services to maintain visibility across your environment.

Resources

To learn more about AWS Secrets Manager, check out some of these resources:

Brendan Paul Brendan Paul
Brendan is a Senior Security Solutions Architect at AWS and has been at AWS for more than 6 years. He spends most of his time at work helping customers solve problems in the data protection and workload identity domains. Outside of work, he’s pursuing his master’s degree in data science from UC Berkeley.
Eduardo Patroncinio Eduardo Patroncinio
Eduardo is a distinguished Principal Solutions Architect on the AWS Strategic Accounts team, bringing unparalleled expertise to the forefront of cloud technology. With an impressive career spanning more than 25 years, Eduardo has been a driving force in designing and delivering innovative customer solutions within the dynamic realms of cloud and service management.

How to customize your response to layer 7 DDoS attacks using AWS WAF Anti-DDoS AMR

10 December 2025 at 05:41

Over the first half of this year, AWS WAF introduced new application-layer protections to address the growing trend of short-lived, high-throughput Layer 7 (L7) distributed denial of service (DDoS) attacks. These protections are provided through the AWS WAF Anti-DDoS AWS Managed Rules (Anti-DDoS AMR) rule group. While the default configuration is effective for most workloads, you might want to tailor the response to match your application’s risk tolerance.

In this post, you’ll learn how the Anti-DDoS AMR works, and how you can customize its behavior using labels and additional AWS WAF rules. You’ll walk through three practical scenarios, each demonstrating a different customization technique.

How the Anti-DDoS AMR works at a high level

The Anti-DDoS AMR establishes a baseline of your traffic and uses it to detect anomalies within seconds. As shown in Figure 1, when the Anti-DDoS AMR detects a DDoS attack, it adds the event-detected label to all incoming requests, and the ddos-request label to incoming requests that are suspected of contributing to the attack. It also adds an additional confidence-based label, such as high-suspicion-ddos-request, when the request is suspected of contributing to the attack. In AWS WAF, a label is metadata added to a request by a rule when the rule matches the request. After being added, a label is available for subsequent rules, which can use it to enrich their evaluation logic. The Anti-DDoS AMR uses the added labels to mitigate the DDoS attack.

Figure 1 – Anti-DDOS AMR process flow

Figure 1 – Anti-DDOS AMR process flow

Default mitigations are based on a combination of Block and JavaScript Challenge actions. The Challenge action can only be handled properly by a client that’s expecting HTML content. For this reason, you need to exclude the paths of non-challengeable requests (such as API fetches) in the Anti-DDoS AMR configuration. The Anti-DDoS AMR applies the challengeable-request label to requests that don’t match the configured challenge exclusions. By default, the following mitigation rules are evaluated in order:

  • ChallengeAllDuringEvent, which is equivalent of the following logic: IF event-detected AND challengeable-request THEN challenge.
  • ChallengeDDoSRequests, which is equivalent to the following logic: IF (high-suspicion-ddos-request OR medium-suspicion-ddos-request OR low-suspicion-ddos-request) AND challengeable-request THEN challenge. Its sensitivity can be changed to match your needs, such as only challenge medium and high suspicious DDoS requests.
  • DDoSRequests, which is equivalent to the following logic: IF high-suspicion-ddos-request THEN block. Its sensitivity can be changed to match your needs, such as block medium in addition to high suspicious DDoS requests.

Customizing your response to layer 7 DDoS attacks

This customization can be done using two different approaches. In the first approach, you configure the Anti-DDoS AMR to take the action you want, then you add subsequent rules to further harden your response under certain conditions. In the second approach, you change some or all the rules of the Anti-DDoS AMR to count mode, then create additional rules that define your response to DDoS attacks.

In both approaches, the subsequent rules are configured using conditions you define, combined with conditions based on labels applied to requests by the Anti-DDoS AMR. The following section includes three examples of customizing your response to DDoS attacks. The first two examples are based on the first approach, while the last one is based on the second approach.

Example 1: More sensitive mitigation outside of core countries

Let’s suppose that your main business is conducted in two main countries, the UAE and KSA. You are happy with the default behavior of the Anti-DDoS AMR in these countries, but you want to block more aggressively outside of these countries. You can implement this using the following rules:

  • Anti-DDoS AMR with default configurations
  • A custom rule that blocks if the following conditions are met: Request is initiated from outside of UAE or KSA AND request has high-suspicion-ddos-request or medium-suspicion-ddos-request labels

Configuration

After adding your Anti-DDoS AMR with default configuration, create a subsequent custom rule with the following JSON definition.

Note: You need to use the AWS WAF JSON rule editor or infrastructure-as-code (IaC) tools (such as AWS CloudFormation or Terraform) to define this rule. The current AWS WAF console doesn’t allow creating rules with multiple AND/OR logic nesting.

{
    "Action": {
        "Block": {}
    },
    "Name": "more-sensitive-ddos-mitigation-outside-of-core-countries",
    "Priority": 1,
    "Statement": {
        "AndStatement": {
            "Statements": [
                {
                    "NotStatement": {
                        "Statement": {
                            "GeoMatchStatement": {
                                "CountryCodes": [
                                    "AE",
                                    "SA"
                                ]
                            }
                        }
                    }
                },
                {
                    "OrStatement": {
                        "Statements": [
                            {
                                "LabelMatchStatement": {
                                    "Key": "awswaf:managed:aws:anti-ddos:medium-suspicion-ddos-request",
                                    "Scope": "LABEL"
                                }
                            },
                            {
                                "LabelMatchStatement": {
                                    "Key": "awswaf:managed:aws:anti-ddos:high-suspicion-ddos-request",
                                    "Scope": "LABEL"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    },
    "VisibilityConfig": {
        "CloudWatchMetricsEnabled": true,
        "MetricName": "more-sensitive-ddos-mitigation-outside-of-core-countries",
        "SampledRequestsEnabled": true
    }
}

Similarly, during an attack, you can more aggressively mitigate requests from unusual sources, such as requests labeled by the Anonymous IP managed rule group as coming from web hosting and cloud providers.

Example 2: Lower rate-limiting thresholds during DDoS attacks

Suppose that your application has sensitive URLs that are compute heavy. To protect the availability of your application, you have applied a rate limiting rule to these URLs configured with a 100 requests threshold over 2 mins window. You can harden this response during a DDoS attack by applying a more aggressive threshold. You can implement this using the following rules:

  1. An Anti-DDoS AMR with default configurations
  2. A rate-limiting rule, scoped to sensitive URLs, configured with a 100 requests threshold over a 2-minute window
  3. A rate-limiting rule, scoped to sensitive URLs and to the event-detected label, configured with a 10 requests threshold over a 10-minute window

Configuration

After adding your Anti-DDoS AMR with default configuration, and your rate-limit rule for sensitive URLs, create a subsequent new rate limiting rule with the following JSON definition.

{
    "Action": {
        "Block": {}
    },
    "Name": "ip-rate-limit-10-10mins-under-ddos",
    "Priority": 2,
    "Statement": {
        "RateBasedStatement": {
            "AggregateKeyType": "IP",
            "EvaluationWindowSec": 600,
            "Limit": 10,
            "ScopeDownStatement": {
                "AndStatement": {
                    "Statements": [
                        {
                            "ByteMatchStatement": {
                                "FieldToMatch": {
                                    "UriPath": {}
                                },
                                "PositionalConstraint": "EXACTLY",
                                "SearchString": "/sensitive-url",
                                "TextTransformations": [
                                    {
                                        "Priority": 0,
                                        "Type": "LOWERCASE"
                                    }
                                ]
                            }
                        },
                        {
                            "LabelMatchStatement": {
                                "Key": "awswaf:managed:aws:anti-ddos:event-detected",
                                "Scope": "LABEL"
                            }
                        }
                    ]
                }
            }
        }
    },
    "VisibilityConfig": {
        "CloudWatchMetricsEnabled": true,
        "MetricName": "ip-rate-limit-10-10mins-under-ddos",
        "SampledRequestsEnabled": true
    }
}

Example 3: Adaptive response according to your application scalability

Suppose that you are operating a legacy application that can safely scale to a certain threshold of traffic volume, after which it degrades. If the total traffic volume, including the DDoS traffic, is below this threshold, you decide not to challenge all requests during a DDoS attack to avoid impacting user experience. In this scenario, you’d only rely on the default block action of high suspicion DDoS requests. If the total traffic volume is above the safe threshold of your legacy application to process traffic, then you decide to use the equivalent of Anti-DDoS AMR’s default ChallengeDDoSRequests mitigation. You can implement this using the following rules:

  1. An Anti-DDoS AMR with ChallengeAllDuringEvent and ChallengeDDoSRequests rules configured in count mode.
  2. A rate limiting rule that counts your traffic and is configured with a threshold corresponding to your application capacity to normally process traffic. As action, it only counts requests and applies a custom label—for example, CapacityExceeded—when its thresholds are met.
  3. A rule that mimics ChallengeDDoSRequests but only when the CapacityExceeded label is present: Challenge if ddos-request, CapacityExceeded, and challengeable-request labels are present

Configuration

First, update your Anti-DDoS AMR by changing Challenge actions to Count actions.

Figure 2 – Updated Anti-DDoS AMR rules in example 3

Figure 2 – Updated Anti-DDoS AMR rules in example 3

Then create the rate limit capacity-exceeded-detection rule in count mode, using the following JSON definition:

{
    "Action": {
        "Count": {}
    },
    "Name": "capacity-exceeded-detection",
    "Priority": 2,
    "RuleLabels": [
        {
            "Name": "mycompany:capacityexceeded"
        }
    ],
    "Statement": {
        "RateBasedStatement": {
            "Limit": 10000
            "AggregateKeyType": "CONSTANT",
            "EvaluationWindowSec": 120,
            "ScopeDownStatement": {
                "NotStatement": {
                    "Statement": {
                        "LabelMatchStatement": {
                            "Scope": "LABEL",
                            "Key": "non-exsiting-label-to-count-all-requests"
                        }
                    }
                }
            }
        }
    },
    "VisibilityConfig": {
        "CloudWatchMetricsEnabled": true,
        "MetricName": "capacity-exceeded-detection",
        "SampledRequestsEnabled": true
    }
}

Finally, create the challenge-if-ddos-and-capacity-exceeded challenge rule using the following JSON definition:

{
    "Action": {
        "Challenge": {}
    },
    "Name": "challenge-if-ddos-and-capacity-exceeded",
    "Priority": 3,
    "Statement": {
        "AndStatement": {
            "Statements": [
                {
                    "LabelMatchStatement": {
                        "Key": "mycompany:capacityexceeded",
                        "Scope": "LABEL"
                    }
                },
                {
                    "LabelMatchStatement": {
                        "Key": "awswaf:managed:aws:anti-ddos:ddos-request",
                        "Scope": "LABEL"
                    }
                },
                {
                    "LabelMatchStatement": {
                        "Key": "awswaf:managed:aws:anti-ddos:challengeable-request",
                        "Scope": "LABEL"
                    }
                }
            ]
        }
    },
    "VisibilityConfig": {
        "CloudWatchMetricsEnabled": true,
        "MetricName": "challenge-if-ddos-and-capacity-exceeded",
        "SampledRequestsEnabled": true
    }
}

Conclusion

By combining the built-in protections of the Anti-DDoS AMR with custom logic, you can adapt your defenses to match your unique risk profile, traffic patterns, and application scalability. The examples in this post illustrate how you can fine-tune sensitivity, enforce stronger mitigations under specific conditions, and even build adaptive defenses that respond dynamically to your system’s capacity.

You can use the dynamic labeling system in AWS WAF to implement customization granularly. You can also use AWS WAF labels to exclude costly logging of DDoS attack traffic.

If you have feedback about this post, submit comments in the Comments section below.

Achraf Souk

Achraf is a Principal Solutions Architect at AWS with more than 15 years of experience in cloud, security, and networking. He works closely with customers across industries to design resilient, fast, and secure web applications. A frequent writer and speaker, he enjoys simplifying deeply technical topics for a wider audience. Achraf has a track record in building and scaling technical organizations.

How to use the Secrets Store CSI Driver provider Amazon EKS add-on with Secrets Manager

26 November 2025 at 19:54

In this post, we introduce the AWS provider for the Secrets Store CSI Driver, a new AWS Secrets Manager add-on for Amazon Elastic Kubernetes Service (Amazon EKS) that you can use to fetch secrets from Secrets Manager and parameters from AWS Systems Manager Parameter Store and mount them as files in Kubernetes pods. The add-on is straightforward to install and configure, works on Amazon Elastic Compute Cloud (Amazon EC2) instances and hybrid nodes, and includes the latest security updates and bugfixes. It provides a secure and reliable way to retrieve your secrets in Kubernetes workloads.

The AWS provider for the Secrets Store CSI Driver is an open source Kubernetes DaemonSet.

Amazon EKS add-ons provide installation and management of a curated set of add-ons for EKS clusters. You can use these add-ons to help ensure that your EKS clusters are secure and stable and reduce the number of steps required to install, configure, and update add-ons.

Secrets Manager helps you manage, retrieve, and rotate database credentials, application credentials, OAuth tokens, API keys, and other secrets throughout their lifecycles. By using Secrets Manager to store credentials, you can avoid using hard-coded credentials in application source code, helping to avoid unintended or inadvertent access.

New EKS add-on: AWS provider for the Secrets Store CSI Driver

We recommend installing the provider as an Amazon EKS add-on instead of the legacy installation methods (Helm, kubectl) to reduce the amount of time it takes to install and configure the provider. The add-on can be installed in several ways: using eksctl—which you will use in this post—the AWS Management Console, the Amazon EKS API, AWS CloudFormation, or the AWS Command Line Interface (AWS CLI).

Security considerations

The open-source Secrets Store CSI Driver maintained by the Kubernetes community enables mounting secrets as files in Kubernetes clusters. The AWS provider relies on the CSI driver and mounts secrets as file in your EKS clusters. Security best practice recommends caching secrets in memory where possible. If you prefer to adopt the native Kubernetes experience, please follow the steps in this blog post. If you prefer to cache secrets in memory, we recommend using the AWS Secrets Manager Agent.

IAM principals require Secrets Manager permissions to get and describe secrets. If using Systems Manager Parameter Store, principals also require Parameter Store permissions to get parameters. Resource policies on secrets serve as another access control mechanism, and AWS principals must be explicitly granted permissions to access individual secrets if they’re accessing secrets from a different AWS account (see Access AWS Secrets Manager secrets from a different account). The Amazon EKS add-on provides security features including support for using FIPS endpoints. AWS provides a managed IAM policy, AWSSecretsManagerClientReadOnlyAccess, which we recommend using with the EKS add-on.

Solution walkthrough

In the following sections, you’ll create an EKS cluster, create a test secret in Secrets Manager, install the Amazon EKS add-on, and use it to retrieve the test secret and mount it as a file in your cluster.

Prerequisites

  1. AWS credentials, which must be configured in your environment to allow AWS API calls and are required to allow access to Secrets Manager
  2. AWS CLI v2 or higher
  3. Your preferred AWS Region must be available in your environment. Use the following command to set your preferred region:
    aws configure set default.region <preferred_region>
    
  4. The kubectl and eksctl command-line tools
  5. A Kubernetes deployment file hosted in the GitHub repo for the provider

With the prerequisites in place, you’re ready to run the commands in the following steps in your terminal:

Create an EKS cluster

  1. Create a shell variable in your terminal with the name of your cluster:
    CLUSTER_NAME="my-test-cluster”
    
  2. Create an EKS cluster:
    eksctl create cluster $CLUSTER_NAME 
    

eksctl will automatically use a recent version of Kubernetes and create the resources needed for the cluster to function. This command typically takes about 15 minutes to finish setting up the cluster.

Create a test secret

Create a secret named addon_secret in Secrets Manager:

aws secretsmanager create-secret \
  --name addon_secret \
  --secret-string "super secret!"

Set up the Secrets Store CSI Driver provider EKS add-on

Install the Amazon EKS add-on:

eksctl create addon \
  --cluster $CLUSTER_NAME \
  --name aws-secrets-store-csi-driver-provider

Create an IAM role

Create an AWS Identity and Access Management (IAM) role that the EKS Pod Identity service principal can assume and save it in a shell variable (replace <region> with the AWS Region configured in your environment):

ROLE_ARN=$(aws --region <region> --query Role.Arn --output text iam create-role --role-name nginx-deployment-role --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "pods.eks.amazonaws.com"
            },
            "Action": [
                "sts:AssumeRole",
                "sts:TagSession"
            ]
        }
    ]
}')

Attach a managed policy to the IAM role

Note: AWS provides a managed policy for client-side consumption of secrets through Secrets Manager: AWSSecretsManagerClientReadOnlyAccess. This policy grants access to get and describe secrets for the secrets in your account. If you want to further follow the principle of least privilege, create a custom policy scoped down to only the secrets you want to retrieve.

Attach the managed policy to the IAM role that you just created:

aws iam attach-role-policy \
  --role-name nginx-deployment-role \
  --policy-arn arn:aws:iam::aws:policy/AWSSecretsManagerClientReadOnlyAccess

Set up the EKS Pod Identity Agent

Note: The add-on provides two methods of authentication: IAM roles for service accounts (IRSA) and EKS Pod Identity. In this solution, you’ll use EKS Pod Identity.

  1. After you’ve installed the add-on in your cluster, install the EKS Pod Identity Agent add-on for authentication:
    eksctl create addon \
      --cluster $CLUSTER_NAME \
      --name eks-pod-identity-agent
    
  2. Create an EKS Pod Identity association for the cluster:
    eksctl create podidentityassociation \
        --cluster $CLUSTER_NAME \
        --namespace default \
        --region <region> \
        --service-account-name nginx-pod-identity-deployment-sa \
        --role-arn $ROLE_ARN \
        --create-service-account true
    

Set up your SecretProviderClass

The SecretProviderClass is a YAML file that defines which secrets and parameters to mount as files in your cluster.

  1. Create a minimal SecretProviderClass called spc.yaml for the test secret with the following content:
    apiVersion: secrets-store.csi.x-k8s.io/v1
    kind: SecretProviderClass
    metadata:
      name: nginx-pod-identity-deployment-aws-secrets
    spec:
      provider: aws
      parameters:
        objects: |
          - objectName: "addon_secret"
            objectType: "secretsmanager"
        usePodIdentity: "true"
    
  2. Deploy your SecretProviderClass (make sure you’re in the same directory as the spc.yaml you just created):
    kubectl apply -f spc.yaml
    

To learn more about the SecretProviderClass, see the GitHub readme for the provider.

Deploy your pod to your EKS cluster

For brevity, we’ve omitted the content of the Kubernetes deployment file. The following is an example deployment file for Pod Identity in the GitHub repository for the provider—use this file to deploy your pod:

kubectl apply -f https://raw.githubusercontent.com/aws/secrets-store-csi-driver-provider-aws/main/examples/ExampleDeployment-PodIdentity.yaml

This will mount addon_secret at /mnt/secrets-store in your cluster.

Retrieve your secret

  1. Print the value of addon_secret to confirm that the secret was mounted successfully:
    kubectl exec -it $(kubectl get pods | awk '/nginx-pod-identity-deployment/{print $1}' | head -1) -- cat /mnt/secrets-store/addon_secret
    
  2. You should see the following output:
    super secret!
    

You’ve successfully fetched your test secret from Secrets Manager using the new Amazon EKS add-on and mounted it as a file in your Kubernetes cluster.

Clean up

Run the following commands to clean up the resources that you created in this tutorial:

aws secretsmanager delete-secret \
  --secret-id addon_secret \
  --force-delete-without-recovery

aws iam delete-role --role-name nginx-deployment-role

eksctl delete cluster $CLUSTER_NAME

Conclusion

In this post, you learned how to use the new Amazon EKS add-on for the AWS Secrets Store CSI Driver provider to securely retrieve your secrets and parameters and mount them as files in your Kubernetes clusters. The new EKS add-on provides benefits such as the latest security patches and bug fixes, tighter integration with Amazon EKS, and reduces the time it takes to install and configure the AWS Secrets Store CSI Driver provider. The add-on is validated by EKS to work with EC2 instances and hybrid nodes.

Further reading

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Angad Misra

Angad Misra

Angad is a Software Engineer on the AWS Secrets Manager team. When he isn’t building secure, reliable, and scalable software from first principles, he enjoys a good latte, live music, playing guitar, exploring the great outdoors, cooking, and lazing around with his cat, Freyja.

Introducing guidelines for network scanning

25 November 2025 at 19:11

Amazon Web Services (AWS) is introducing guidelines for network scanning of customer workloads. By following these guidelines, conforming scanners will collect more accurate data, minimize abuse reports, and help improve the security of the internet for everyone.

Network scanning is a practice in modern IT environments that can be used for either legitimate security needs or abused for malicious activity. On the legitimate side, organizations conduct network scans to maintain accurate inventories of their assets, verify security configurations, and identify potential vulnerabilities or outdated software versions that require attention. Security teams, system administrators, and authorized third-party security researchers use scanning in their standard toolkit for collecting security posture data. However, scanning is also performed by threat actors attempting to enumerate systems, discover weaknesses, or gather intelligence for attacks. Distinguishing between legitimate scanning activity and potentially harmful reconnaissance is a constant challenge for security operations.

When software vulnerabilities are found through scanning a given system, it’s particularly important that the scanner is well-intentioned. If a software vulnerability is discovered and attacked by a threat actor, it could allow unauthorized access to an organization’s IT systems. Organizations must effectively manage their software vulnerabilities to protect themselves from ransomware, data theft, operational issues, and regulatory penalties. At the same time, the scale of known vulnerabilities is growing rapidly, at a rate of 21% per year for the past 10 years as reported in the NIST National Vulnerability Database.

With these factors at play, network scanners need to scan and manage the collected security data with care. There are a variety of parties interested in security data, and each group uses the data differently. If security data is discovered and abused by threat actors, then system compromises, ransomware, and denial of service can create disruption and costs for system owners. With the exponential growth of data centers and connected software workloads providing critical services across energy, manufacturing, healthcare, government, education, finance, and transportation sectors, the impact of security data in the wrong hands can have significant real-world consequences.

Multiple parties

Multiple parties have vested interests in security data, including at least the following groups:

  • Organizations want to understand their asset inventories and patch vulnerabilities quickly to protect their assets.
  • Program auditors want evidence that organizations have robust controls in place to manage their infrastructure.
  • Cyber insurance providers want risk evaluations of organizational security posture.
  • Investors performing due diligence want to understand the cyber risk profile of an organization.
  • Security researchers want to identify risks and notify organizations to take action.
  • Threat actors want to exploit unpatched vulnerabilities and weaknesses for unauthorized access.

The sensitive nature of security data creates a complex ecosystem of competing interests, where an organization must maintain different levels of data access for different parties.

Motivation for the guidelines

We’ve described both the legitimate and malicious uses of network scanning, and the different parties that have an interest in the resulting data. We’re introducing these guidelines because we need to protect our networks and our customers; and telling the difference between these parties is challenging. There’s no single standard for the identification of network scanners on the internet. As such, system owners and defenders often don’t know who is scanning their systems. Each system owner is independently responsible for managing identification of these different parties. Network scanners might use unique methods to identify themselves, such as reverse DNS, custom user agents, or dedicated network ranges. In the case of malicious actors, they might attempt to evade identification altogether. This degree of identity variance makes it difficult for system owners to know the motivation of parties performing network scanning.

To address this challenge, we’re introducing behavioral guidelines for network scanning. AWS seeks to provide network security for every customer; our goal is to screen out abusive scanning that doesn’t meet these guidelines. Parties that broadly network scan can follow these guidelines to receive more reliable data from AWS IP space. Organizations running on AWS receive a higher degree of assurance in their risk management.

When network scanning is managed according to these guidelines, it helps system owners strengthen their defenses and improve visibility across their digital ecosystem. For example, Amazon Inspector can detect software vulnerabilities and prioritize remediation efforts while conforming to these guidelines. Similarly, partners in AWS Marketplace use these guidelines to collect internet-wide signals and help organizations understand and manage cyber risk.

“When organizations have clear, data-driven visibility into their own security posture and that of their third parties, they can make faster, smarter decisions to reduce cyber risk across the ecosystem.” – Dave Casion, CTO, Bitsight

Of course, security works better together, so AWS customers can report abusive scanning to our Trust & Safety Center as type Network Activity > Port Scanning and Intrusion Attempts. Each report helps improve the collective protection against malicious use of security data.

The guidelines

To help ensure that legitimate network scanners can clearly differentiate themselves from threat actors, AWS offers the following guidance for scanning customer workloads. This guidance on network scanning complements the policies on penetration testing and vulnerability reporting. AWS reserves the right to limit or block traffic that appears non-compliant with these guidelines. A conforming scanner adheres to the following practices:

Observational

  • Perform no actions that attempt to create, modify, or delete resources or data on discovered endpoints.
  • Respect the integrity of targeted systems. Scans cause no degradation to system function and cause no change in system configuration.
  • Examples of non-mutating scanning include:
    • Initiating and completing a TCP handshake
    • Retrieving the banner from an SSH service

Identifiable

  • Provide transparency by publishing sources of scanning activity.
  • Implement a verifiable process for confirming the authenticity of scanning activities.
  • Examples of identifiable scanning include:
    • Supporting reverse DNS lookups to one of your organization’s public DNS zones for scanning Ips.
    • Publishing scanning IP ranges, organized by types of requests (such as service existence, vulnerability checks).
    • If HTTP scanning, have meaningful content in user agent strings (such as names from your public DNS zones, URL for opt-out)

Cooperative

  • Limit scan rates to minimize impact on target systems.
  • Provide an opt-out mechanism for verified resource owners to request cessation of scanning activity.
  • Honor opt-out requests within a reasonable response period.
  • Examples of cooperative scanning include:
    • Limit scanning to one service transaction per second per destination service.
    • Respect site settings as expressed in robots.txt and security.txt and other such industry standards for expressing site owner intent.

Confidential

  • Maintain secure infrastructure and data handling practices as reflected by industry-standard certifications such as SOC2.
  • Ensure no unauthenticated or unauthorized access to collected scan data.
  • Implement user identification and verification processes.

See the full guidance on AWS.

What’s next?

As more network scanners follow this guidance, system owners will benefit from reduced risk to their confidentiality, integrity, and availability. Legitimate network scanners will send a clear signal of their intention and improve their visibility quality. With the constantly changing state of networking, we expect that this guidance will evolve along with technical controls over time. We look forward to input from customers, system owners, network scanners and others to continue improving security posture across AWS and the internet.

If you have feedback about this post, submit comments in the Comments section below or contact AWS Support.

Stephen Goodman

Stephen Goodman

As a senior manager for Amazon active defense, Stephen leads data-driven programs to protect AWS customers and the internet from threat actors.

❌