AgentOps: Operationalize agentic AI at scale with Amazon Bedrock AgentCore

When you build agentic AI solutions, you face unique operational challenges. Agents make unpredictable decisions, costs spiral unexpectedly, and debugging non-deterministic failures seems impossible. Agentic AI applications don’t just execute predetermined workflows. They reason, adapt, and make autonomous decisions, and DevOps practices need to be adapted. That’s where AgentOps comes in, the operational discipline for deploying, managing, and continuously improving AI agents in production.

The first part of our blog series introduced how to operationalize generative AI workloads. In this post, we show how to accelerate the path to production for agentic AI workloads, check the quality of your agents and tools, and drive agentic AI adoption in your organization by implementing AgentOps with Amazon Bedrock AgentCore. We discuss best practices from real world implementations across four pillars: governance and security, build and operations, evaluation, and observability. We also show how AWS services, people, and processes come together into a reference architecture that you can adapt for your organization.

Note that this post focuses on operations and not agent design. The implementation examples use Amazon Bedrock AgentCore and supporting AWS services, but the principles discussed apply broadly. The reference architecture is a starting point: your organization’s requirements will determine how you adapt it.

This post covers best practices and real-world learnings for each of the AgentOps pillars:

Amazon Bedrock AgentCore offers components that you can use independently or together to implement these pillars. It is AWS’s Agentic AI platform for building, deploying, and operating effective agents securely at scale. AgentCore works with any open source framework and any large language model (LLM) and you can transition from local development to production without managing infrastructure.

Like other software solutions, agents follow a development lifecycle from idea to production, and that progression never truly ends. Agents require continuous operational attention and improvements across every stage. Below, we’ve mapped out how agentic AI impacts each stage of your DevOps pipeline: Plan, Develop, Build, Test, Deploy & Release, Maintain and Monitor.

The pillars apply irrespective of where you are in the lifecycle. From a responsible AI perspective, you need systematic risk management throughout. “The Agentic AI Security Scoping Matrix: A framework for securing autonomous AI systems” can help identify and manage risks.

The following reference architecture shows how the pillars, lifecycle, people, processes, and AWS services connect. Let’s go through it step-by-step.

Production deployment and operations

Now let’s go through each pillar in more detail.

In agentic systems, a single user request can spread across hierarchical chains or trigger collaborative swarms where multiple agents act on the user’s behalf. Each interaction between user and agent needs to be tightly controlled. When Agent A calls Agent B, there can be ambiguity of what agent is authorized to perform which actions. If a user with limited permissions triggers an agent, the agent must inherit those restrictions. This ambiguity only compounds in deeper chain of calls. You need strict governance around who can access the agents, what data and tools and APIs the agents can access, who can authorize these permissions, and what occurs when issues arise.

The following diagram shows the security decisions to be made at each step when an agent handles a request. A user’s input flows through an environment, into the agent, which uses tools and memory to generate outputs. The application verifies the user’s identity, whether they are allowed to invoke the agent, and whether the agent can access the requested context, memory, and tools with the specific parameters. It also validates that inputs are safe and that the agent is authorised to return the specific outputs.

To achieve a layered security approach that helps agents operate within well-defined boundaries while maintaining auditability you should consider the following dimensions.

AgentOps is an extension of GenAIOps, the same way MLOps is an extension of DevOps. If you followed Part 1: GenAIOps, the same design principles apply to AgentOps. You should follow a multi-account strategy for organizational isolation and Service Control Policies (SCPs) to set security guardrails across accounts.

The following reference diagram shows the multi-account AWS architecture:

Accounts and resources are deployed and managed using Infrastructure as Code (IaC).

When using Amazon Bedrock, you control which models the applications have access to using SCPs and IAM identity-based policies. Your agents can use these models directly or via a generative AI gateway such as LiteLLM. With a gateway, you centralize access control and simplify governance implementation across multiple model providers while providing a unified API interface for rate limiting per user or agent, token budgeting, cost tracking and budget enforcement, model routing based on security policies, and centralized audit trails for compliance. AWS has published guidance on how to deploy a generative AI gateway. We initially placed the gateway in shared services for simplicity, but found it harder to attribute costs to individual agents and moved it to application accounts.

You can use AWS Identity and Access Management (IAM) for fine-grained access control. Additionally, with AgentCore Identity you manage authentication and authorization across your agents, with fine-grained access controls and cross-agent authentication protocols that maintain security boundaries as requests propagate through your system. For more information refer to Amazon Bedrock AgentCore Identity: Securing agentic AI at scale. AWS CloudTrail can be used for comprehensive audit logging and forensic analysis.

Data flows through multiple touchpoints: user inputs (text, attachments), agent instructions, outputs, accessed data sources, and memory operations, each presenting potential security risks. Configure Amazon Bedrock Guardrails to evaluate user prompts and model responses against your safety policies and to protect against threats like inadvertent PII disclosure. For detailed set-up instructions to implement guardrails and integrate them with a generative AI gateway refer to Safeguard generative AI applications with Amazon Bedrock Guardrails.

In addition to the above, use version control of evaluation datasets (with a few hundred examples) and systematically track changes to documents and generated embeddings within RAG knowledge bases to support evaluation and auditing requirements.

In agentic applications, data represents underlying facts, documents, and structured information agents query (knowledge bases, databases, APIs) accessed through retrieval mechanisms like RAG, governed through traditional access controls. On the other hand, memory is the agent’s working context (what it retains about conversations, user preferences, and interaction patterns). It is dynamic and conversational, evolving with each interaction.

With AgentCore Memory you get short-term memory and long-term memory with built-in and custom strategies for memory extraction. You can also override extraction logic or implement self-managed strategies for specialised requirements. Namespaces, which are defined at creation time as part of the strategy configuration in long-term memory, organise memory by actor, session, or strategy. They provide the structure that helps personalisation and shared learning across users. AgentCore Memory scopes data to individual aggregates at actor level. When agents need to learn cross-user patterns, memory can aggregate at higher application-wide levels. In a multi-account deployment pattern, each account (dev, pre-prod, prod) has its own AgentCore Memory resources that teams deploy and manage alongside their applications. This deployment pattern helps with security isolation, independent scaling, alignment with data residency requirements, and cost allocation per application.

Applications can access multiple memory resources. The following diagram illustrates this approach, showing how two applications, a fraud and a claims application, access risk signals and policy details from their dedicated resources, and user details from a shared memory resource. You can control which memory resources and information they have access to with IAM policies.

Agents call tools on behalf of users but not every user should trigger every tool with every parameter. You can use AgentCore Gateway to govern tools and transform APIs, Lambda functions, and services into MCP-compatible tools accessible through a single, secure endpoint. It works with AgentCore Identity to manage both inbound authentication (verifying agent identity) and outbound authentication (connecting to tools via OAuth, token refresh, credential storage) so that agents do not handle credentials directly. You define which runtimes, tools, and backends the agent can reach, enforced by IAM/resource policies and workload identities. Identity establishes the perimeter and answers “who you are” and “what infrastructure you can access”.

Policy in AgentCore intercepts tool requests routed via Gateway, evaluating requests against deterministic policies expressed in Cedar, AWS’s open-sourced policy language, before allowing tool access. It answers, “are you allowed to do this right now” (e.g. can you open a claim exceeding 1 million?). You get auditable enforcement across agent interactions, reducing the risk of policy bypass through agent manipulation.

The following diagram shows how the above services work together. The same pattern applies to each one of the dev, pre-prod and prod accounts.

To see how these governance and identity patterns work at enterprise scale, see how Swisscom builds agentic AI for customer support and sales with Amazon Bedrock AgentCore.

With governance boundaries in place, you need consistent mechanisms to build, version, and deploy agents, tools, and memory configurations across environments.

Agents depend on infrastructure, resources, tools, and models that can change independently. You need operational discipline to avoid an update of a tool owned by another team, impacting your agent, or a memory misconfiguration inadvertently disclosing user context. Treat every component as a versioned, deployable artifact with its own repository:

With this separation, you have independent versioning, testing, and deployment while maintaining clear ownership and change tracking. The following diagram illustrates this approach:

Environments and CI/CD Pipelines for agentic applications

Developers working in the dev account clone the repositories with the seed code. As they develop their application and agents, they modify: 1. the agent code, including the shared modules, and the IaC templates to provision AgentCore resources for agents, 2. vector stores using services such as Amazon OpenSearch Serverless, Amazon S3 Vectors for embeddings or Amazon Aurora with pgvector, 3. data ingestion pipelines 4. application code that connects to the agent 5. changes to automated evaluation pipelines. Merging their changes triggers a CI/CD pipeline. This deploys IaC templates to the pre-prod account and packages application and agent code as container images pushed to Amazon Elastic Container Registry (ECR) in the shared services or pipeline account.

In pre-prod, automated tests run across seven dimensions: integration, performance, UAT, regression, security, agentic AI, and responsible AI. Agentic AI evaluation includes authentication flows, user context propagation, authorization validation for tool access, and agent-specific quality metrics. Agentic AI evaluation is the most complex, spanning multiple dimensions. For example, validating that a user’s identity and permissions propagate correctly across a multi-agent chain may require building custom test setups that simulate requests flowing across multiple agents to verify identity and permissions propagate correctly at each step.

You create agents in your agent repository. You containerize your agent implementation, store the container image in ECR, and deploy it to AgentCore Runtime connecting to AgentCore Identity, AgentCore Memory resources, and account-level and shared AgentCore Gateway. When you are ready to merge the changes, the CI/CD pipeline packages the agent as a container image, pushes it to ECR, and deploys to AgentCore Runtime in pre-prod.

The pipeline also registers or updates the agent’s metadata as a structured record in AWS Agent Registry in AgentCore. With AWS Agent Registry, you have a centralized place to discover, share, and reuse agents, MCP servers, tools, and agent skills across your organization. It supports automatic metadata ingestion from MCP and A2A endpoints and tracks records through an approval workflow (draft → pending → approved) before they become discoverable organization-wide. You invoke agents directly or via A2A or as targets behind an MCP server. In pre-prod, it runs automated tests before promotion to production. The IaC templates in the infrastructure repository define the Runtime, Memory resources, and IAM roles for consistent infrastructure across environments.

Each AgentCore Runtime maintains immutable versions automatically. You can create endpoint aliases (like DEV, PREPROD, and PROD) that point to specific versions, to implement independent promotion, instant rollback, and version management within your deployment workflow.

Tag every agent with owner, cost center, and use-case ID, and use AWS CloudTrail for audit trails.

Agents invoke tools directly (for built-in capabilities), via an AgentCore Gateway endpoints in the same account, or via shared AgentCore Gateway endpoint(s) in the shared services account which points to approved org-wide tools. Each tool exposed via the AgentCore Gateway has its own lifecycle and CI/CD pipeline to get deployed to pre-prod.

To register your tool to the AgentCore Gateway, define a manifest in your tool repository specifying the Gateway the tool belongs to (shared or application specific), auth method, requested prefix and compliance metadata. On merge, the CI/CD pipeline injects the endpoint from environment-specific config, validates the manifest, the tool’s prefix, and registers the tool as target. For that, it calls CreateGatewayTarget and SynchronizeGatewayTargets, using templates from the infrastructure repository. This way, you can implement consistent tool names and use IAM policies to restrict direct Gateway access to the Gateway only. Application teams control what gets registered and the platform team where and how.

Treat memory configuration like other deployable artifacts that are versioned, tested, and promoted through your CI/CD pipeline. Version control memory resources, TTL configurations, extraction strategies, and namespace structures and deploy them through your CI/CD pipeline for identical behaviour across environments with no manual configuration or drift. Apply automated testing to validate memory persistence, LTM extraction quality, namespace isolation, and cross-session retrieval before promotion to production.

To see how these build and operational patterns work at enterprise scale, see how Allianz designed AIOps at enterprise scale with Amazon Bedrock AgentCore.

Reliable pipelines get your agents to production. Structured, multi-level evaluation catches any issues.

Agents can fail in ways that are not immediately obvious. A wrong tool selection, a missed context, or a hallucinated response can be hard to detect. Structured evaluation across multiple levels (tool, conversation turn, session outcome, and system) helps prevent these failures from reaching production. The evaluation lifecycle steps are:

If you followed Part 1: GenAIOps, your evaluation foundation remains relevant, but agentic applications introduce additional requirements: you still need to evaluate LLMs but now also a chain of decisions, tool invocations, and memory retrievals that compound across a conversation.

In Agentic workflows, evaluation occurs at four distinct levels per agent:

First, evaluate the tool itself. For deterministic tools like APIs, this can include unit tests to verify expected behavior, and performance metrics such as latency and timeouts. For LLM-backed tools like RAG, evaluate model performance metrics using human-in-the-loop or LLM-as-a-Judge. Example metrics include correctness, helpfulness, relevance, harmfulness, and style/tone. For data retrieved from knowledge bases, evaluate retrieval quality, chunk relevance, and freshness. Check how to build strong data foundations to be successful. For multimodal tools (audio-to-audio, image generation, video creation), evaluate modality-specific quality metrics (image fidelity, audio clarity, video coherence), cross-modal consistency (does generated content align with text instructions?), safety and content policy compliance, and generation latency.

Second, evaluate the agent’s use of the tool. Verify that the agent reasons and plans correctly, selects the appropriate tool for a task and extracts the relevant parameters accurately from user queries. Key metrics include tool selection accuracy, parameter extraction accuracy, and tool response latency and error rates.

At this level, you evaluate a single turn of conversation (one input-output pair) to identify specific problematic responses and quality issues. Some example metrics are: Correctness (is the information factually accurate), Helpfulness (How useful is this specific response?) Faithfulness (Is the response grounded in provided context?), Response Relevance (Does it address the user’s query?), Conciseness, Coherence, Instruction Following, Refusal, Harmfulness, Stereotyping. There are additional metrics to evaluate in multi-agent systems, some examples are: agent orchestration accuracy (can the orchestrator correctly route requests to the appropriate agents and coordinate handoffs between them?), quality of information exchange between agents, agent collaboration on shared tasks.

This level examines whether the agent achieved the user’s goal across the full conversation. A correct individual response does not guarantee a successful outcome. Key metrics include task completion rate, goal accuracy, conversation efficiency, and memory consistency.

At this level, you evaluate production-readiness and operational performance factors. Some example metrics are end-to-end latency, time-to-first-token, throughput, tool call error rates, loop detection, and cost per completed task. You may also have custom success metrics that reflect your use-case or business requirements. For example, domain specific requirements, compliance with regulatory rules or with branding guidelines.

In addition to the 4 levels, agentic evaluation runs in two modes that serve different needs:

On-demand evaluation runs against specific spans, traces, or sessions during development and as a quality gate before every release. You provide reference inputs alongside session spans as the gold standard to compare results against. Targeted testing includes turn-by-turn debugging, component validation, and CI/CD integration. Pre-deployment testing includes stability validation, turn-level metrics, and component monitoring. This immediate feedback drives an iterative loop to refine models, prompts, tools, and logic.

Online evaluation continuously monitors live production traffic with configurable sampling rates, from low-volume sampling to full traffic coverage. It samples conversation quality, turn-level metrics, and component monitoring in production sessions, during A/B testing, and during full rollout. Continuous outputs feed into Amazon CloudWatch dashboards for ongoing monitoring.

The following image shows this workflow:

In local development and the development account you run on-demand evaluation for rapid iteration. In pre-production, on-demand evaluation becomes a pipeline gate. The build does not promote to production until evaluation passes. In production, online evaluation takes over, continuously sampling live traffic and alerting you when quality drops. You should detect quality issues before your users do. When evaluation detects a quality drop, results feed directly into your human review queue or trigger an automated rollback through your CI/CD pipeline.

In AWS, with Amazon Bedrock Evaluations you get LLM-as-a-judge capabilities and access to a team of human workers for evaluating model performance and effectiveness of Amazon Bedrock models and knowledge bases. With AgentCore Evaluations you have online and on-demand evaluation for your agents, while Strands Evaluation gives you a framework for evaluating tools and Amazon Augmented AI (A2I) brings human review into the loop. Check Generative AI Atlas: Evaluating Agentic Framework Use Cases for additional information on how to evaluate agents.

For performance metrics, you can use Amazon CloudWatch to extract logs and metrics and developer tools such pytest and JUnit to run unit tests on APIs.

Evaluation tells you whether your agent works at release time; observability tells you whether it keeps working, and why it stops.

Observability is where the AgentOps cycle completes. The telemetry it produces feeds back into governance decisions, informs the next evaluation cycle, and shapes how you build and deploy in the next iteration. You need visibility across four distinct layers for your production agents:

These are the types of data you should be capturing:

Execution tracing: every step, tool call, and LLM interaction

Check that your support teams, stakeholders, and domain experts have access to the relevant dashboards to be able to act on these metrics, for instance using IAM identity-based policies.

There are three layers for observability and monitoring: instrumentation (OpenTelemetry SDK, either embedded directly or via framework-native support), a collection and processing layer (AWS Distro for OpenTelemetry Collector or ADOT ), and an analysis backend (for example Amazon CloudWatch) .Agent frameworks like the Strands SDK include built-in OpenTelemetry (OTEL) instrumentation. For Python-based agents, you can bootstrap auto-instrumentation using the opentelemetry-instrument command. Telemetry is exported via OpenTelemetry Protocol (OTLP), to an ADOT collector, which handles sampling, filtering, batching, and routing. In development you can export directly to a backend, but in production we recommend using the Collector as an intermediate layer.

When your architecture spans multiple agents on different frameworks, OpenTelemetry’s W3C Trace Context propagation passes a shared trace ID across every agent and service, giving you the complete execution path in one view. For requests that share a logical session but span separate traces, you can use OpenTelemetry Baggage to propagate session IDs across service boundaries. For the backend, we’ve seen two approaches, using AgentCore Observability and its dashboards powered by Amazon CloudWatch or third-party tools via OpenTelemetry.

Approach 1: Using AgentCore Observability in Amazon CloudWatch

With Amazon CloudWatch, you get two dashboards for agentic workloads.

The CloudWatch model invocation dashboard covers Bedrock model metrics including latency, token counts, throttles, and error counts, with additional filters for timing patterns, tool usage, and knowledge lookups.

The Bedrock AgentCore Observability dashboard gives you a comprehensive view of agent workflows (traces, cost, latency, tokens, and custom metadata) with IAM access controls, PII redaction, and trace summaries for troubleshooting. It is powered by CloudWatch Transaction Search which converts spans to semantic convention format and stores them as structured logs in the aws/spans log group, making every span searchable and analysable. CloudWatch Application Signals correlates generative AI application telemetry with underlying infrastructure metrics for unified end-to-end troubleshooting.

AgentCore Runtime automatically configures required log groups, IAM permissions, and OTEL environment variables and applications only need to add the OpenTelemetry SDK as a dependency. AgentCore also emits service metrics to CloudWatch for its managed resources, including Memory, Gateway, built-in tools, Identity, and Policy. For example, you get real-time visibility into memory operations, including invocations, latency, system errors, user errors, throttles, and record numbers for events and memory. In multi-account deployments you build and manage centralized dashboards in the monitoring account, reconstructing the views that exist natively in individual accounts.

Because AgentCore Runtime exports telemetry via standard OpenTelemetry protocols, it integrates with third-party observability solutions, such as LangFuse, that specialize in agent-centric telemetry.You can use such tools in two ways:

Self-managed third-party deployment: Deploy the observability tool in a shared AWS account or VPC, exposed via a secure TLS endpoint. Agents on AgentCore Runtime in other accounts export OTEL traces and metrics directly using OTLP over HTTPS. Connect accounts via Transit Gateway and secure traffic with credential rotation, API keys, and network access policies. Data governance, retention, and privacy controls remain under your direct management.

Third-party SaaS: Agents send OTEL data to a managed cloud endpoint (e.g., LangFuse Cloud, Arize Cloud). Authorisation uses vendor API keys, and traffic flows over the public internet or via VPC endpoints depending on the tool. This enables fast onboarding and operational scaling, but telemetry data leaves your AWS environment.

The telemetry from observability feeds back into agent design, operational improvement decisions, and evaluation refinements, closing the AgentOps loop.

Building production-grade agentic AI is hard. Agents make autonomous decisions, call external tools, and collaborate in ways that are difficult to anticipate and harder to debug. In this post, we have shared the practices we have seen work in production across the four pillars: governance and security, build and operations, evaluation, and observability.

We encourage you to start applying these practices in your projects and share your experiences. Start by implementing Pillar 1 (Governance & Security), multi-account isolation, then progress to CI/CD for agents, add evaluation gates, and observability. Check out the AgentCore documentation to get started.

AgentOps: Operationalize agentic AI at scale with Amazon Bedrock AgentCore

Related Stories

World Cup 2026: White House 'in discussions' over Iran travel restrictions

World Cup: Netherlands bounces back with 5

Ivory Coast fans celebrate FIFA World Cup match in Toronto

Iran says Strait of Hormuz is closed over ceasefire violations after continued Israeli strikes in Lebanon

Italy's PM Meloni fires back, tells Trump to worry about his own popularity

Parts of Calgary hit with heavy rain, hail

Baltimore celebrates Mary Alice Jones' 100th birthday

AP Top Stories June 20