From PDFs to insights: Architecting an intelligent document processing pipeline with AWS generative AI services

Organizations process millions of documents daily, from insurance claims and invoices to legal contracts and medical records. While traditional optical character recognition (OCR) solutions extract text, they can’t understand context, relationships, or meaning embedded within complex documents. This limitation creates bottlenecks that require manual intervention, increasing processing time and costs while introducing potential errors.

Amazon Bedrock Data Automation (BDA), provides a unified API experience for extracting meaningful insights from multimodal content, including documents, images, videos, and audio files. Unlike traditional solutions that focus on text extraction, BDA understands document context, validates extracted data, and provides confidence scores for accuracy. BDA processes documents through a pipeline that automates complex tasks including document classification, extraction, normalization, and validation. When a document is submitted, BDA automatically splits it along logical boundaries, classifies each section into appropriate document types, and matches them to the correct processing blueprints. This intelligent routing removes the need for manual document sorting and orchestration of multiple AI models. The service supports a wide range of file formats, with support for up to 3,000 pages and 500 MB per API request, making it suitable for processing diverse document types at scale.

This post outlines the development of a cost-effective and scalable intelligent document processing pipeline on AWS, powered by Amazon Bedrock and its features. BDA is a managed service within Amazon Bedrock that automates the extraction of insights from documents. We demonstrate how BDA extracts and analyzes document content, while Strands Agent hosted on Amazon Bedrock AgentCore Runtime coordinate specialized processing tasks, and Amazon Bedrock Knowledge Base enable contextual understanding across multiple documents. By combining these capabilities within a unified architecture, organizations can transform their document processing workflows with minimal development effort.

Our intelligent document processing pipeline combines generative AI with orchestrated workflows to automatically extract, analyze visual plots, graphs, and charts, and derive insights from documents while maintaining context and relationships across multiple data sources.The solution processes documents through four integrated layers:

The input processing layer forms the foundation of this solution. This layer manages the initial reception and routing of incoming documents. A Document Upload Triggers processing workflows when documents arrive in designated Amazon Simple Storage Service (Amzon S3) buckets, supporting various formats including PDFs, and scanned documents (in PDF).

BDA serves as the core extraction engine in the input processing layer, handling document splitting, classification, and content extraction through a unified API. AWS Step Functions orchestrates the workflow to maximize the capabilities of BDA in the Extraction and Storage Layer, providing operational visibility and control throughout the process. Here’s the detailed orchestration flow:

This orchestration approach provides a highly scalable serverless pipeline for automated document analysis with appropriate branching logic and exception management throughout each processing stage.

This layer is central to this solution, where BDA serves as the core engine for transforming raw content into structured, actionable data. We provide more details in the following section.

Amazon Bedrock Data Automation serves as the primary processing engine, offering two flexible output options:

Visual analysis processing uses the capabilities of BDA to extract insights from plots, diagrams, charts, and visual elements that traditional optical character recognition (OCR) solutions can’t interpret. BDA provides image crops as part of the output when doing figure captioning, and it also generates detailed textual descriptions and structured data from these visual elements that are included in the downstream workflow. For example, when BDA processes a chart, it produces:

All document formats in downstream processing: Every supported document format (PDF, PNG, JPG, TIFF, DOC, DOCX) is processed through the unified API. The extracted content from BDA, including visual element descriptions, can then be manually configured for indexing and vectorization in Amazon Bedrock Knowledge Bases to enable semantic search across diverse document types. Note that BDA also has a built-in integration with Knowledge Bases where it can serve as a parser during document ingestion into a knowledge base, using BDA standard output (no blueprints required). This downstream workflow receives structured JSON outputs from BDA containing all extracted information, enabling consistent processing regardless of the original file format.

Data extraction from documents includes:

This layer is the cognitive engine of this solution. Amazon Bedrock Knowledge Bases must be configured to work with Amazon OpenSearch Serverless to transform raw content into actionable insights through semantic search and Retrieval Augmented Generation (RAG) capabilities. The following section provides more details.

Amazon Bedrock Knowledge Bases with Amazon OpenSearch Serverless enables semantic search and RAG workflows by:

Amazon Bedrock FMs analyze visual content including chart and graph interpretation, document layout understanding, and cross-modal relationship detection between text and visual components.

This layer organizes the intelligence of this solution. Strands Agents on Amazon Bedrock AgentCore Runtime manage the overall processing workflow by routing requests to the appropriate specialized agents based on request type and coordinating cross-agent communication for complex document analysis.

Specialized task agents handle specific document processing functions:

The processing pipeline employs an event-driven approach to document processing, integrating multiple specialized layers into a cohesive workflow. It follows a logical progression where each step builds upon the previous one. This begins with document upload, triggering Amazon S3 events that initiate state machines, and proceeding through multi-modal processing that extracts meaning from diverse content types. The pipeline continues with agent coordination that directs processing based on document characteristics, followed by knowledge base indexing for intelligent retrieval. This methodical flow culminates in the generation and integration of insights with business systems, creating a comprehensive processing journey from raw documents to actionable intelligence.

AWS Step Functions orchestrates the document processing pipeline, handling document classification, multi-modal extraction, data validation, and knowledge base integration.

The user-facing layer provides intelligent query processing through natural language interaction with the processed document corpus, coordination agent supervision of specialized agents, and the smart distribution of queries to the right processing agents.

A commercial real estate investment firm receives over 200 property evaluation reports monthly. These reports contain:

The analyst accesses this solution, uploads the documents to it

This implementation shows how our generative AI services can transform real estate investment analysis through document processing capabilities by doing the following:

Document classification: The system automatically identifies document types, extracts property metadata (including address and square footage), and routes different document sections to the appropriate processing agents.

Natural language queries: Investment professionals process information using natural language queries, such as “Show me properties with projected IRR above 12% and debt coverage ratios over 1.25″ or “Compare NOI growth projections with actual market performance for similar assets.”

Processing time per property reduced from 3–4 hours to 15-20 minutes for initial screening. Automated extraction removes manual transcription errors while cross-document validation identifies inconsistencies. The firm can process significantly more opportunities and identify attractive investments that might otherwise be overlooked.

Scalability validation: This solution has been tested at scale, successfully processing over 50,000 PDF documents concurrently through the BDA pipeline. The solution maintained high accuracy across diverse document types including contracts, financial reports, and technical specifications while processing at scale. The serverless architecture with AWS Step Functions and asynchronous BDA processing enabled this massive parallel processing capability without performance degradation, demonstrating the solution’s readiness for enterprise-scale document processing workloads.

The complete AWS Cloud Development Kit (AWS CDK) implementation provisions the entire architecture with infrastructure as code (IaC) principles. The deployment creates four main stack components aligned with our architecture layers and includes environment-specific configurations for development, staging, and production environments.

Before implementing this solution, ensure that you have:

The complete CDK implementation is available in our public GitHub repository: Intelligent Document Processing with Bedrock Agents.

To deploy this solution, run the following command:

The following are thoughtful approaches to managing operational expenses while maintaining the effectiveness of this solution’s processing.

Route documents to appropriate processing levels based on complexity. Simple text documents use basic extraction, while complex documents with tables and images employ more advanced processing techniques.

Combine multiple documents into a single Amazon Bedrock Data Automation request where appropriate to improve costs while respecting service limits.

Implement Amazon S3 lifecyle policies to automatically transition processed documents to lower-cost storage tiers based on access patterns.

The architecture incorporates enterprise-grade security through AWS KMS keys for encryption of documents and processing results, AWS PrivateLink connectivity for secure API access within VPC boundaries, and IAM roles with least-privilege access principles across all components.

To avoid ongoing charges, delete the resources created by this solution:

To delete all the resources created, run this command:# Cleanup deployment./cleanup.sh –profile default –environment UAT

Organizations can use Amazon Bedrock Data Automation, combined with an agent-based coordination architecture to automate document processing from a traditional cost center into a strategic business asset. By automatically extracting and analyzing visual plots, graphs, and charts, and deriving insights from documents while maintaining context and relationships across data sources, organizations can unlock value previously trapped in unstructured content.The multilayered architecture provides a foundation for scalable, cost-effective document processing that adapts to varying workloads while maintaining high accuracy. The visual analysis capabilities provide valuable insights embedded in charts, graphs, and images and are captured and made available for business intelligence and decision-making.Start with a focused proof of concept that targets your most common document types and visual analysis requirements. Then, expand the solution as you gain experience with the services and understand your specific accuracy and performance requirements.

To learn more about Amazon Bedrock Data Automation, visit the Amazon Bedrock Data Automation documentation. For hands-on experience with intelligent document processing, explore the IDP workshop on GitHub. The complete CDK implementation code for this architecture is available in the AWS Samples repository with deployment instructions and configuration examples.

From PDFs to insights: Architecting an intelligent document processing pipeline with AWS generative AI services

Related Stories

4 dead amid flooding caused by heavy rains, Kentucky governor says

Lionel Messi becomes first man to score in 7 straight World Cup games with free kick goal in win over Jordan

Global News Podcast | Venezuela races to find earthquake survivors

Is Iran out of the World Cup? Third

Manitoba wildfire forces mandatory evacuation of Lynn Lake

Why do mosquitoes seem to love some people more? An expert explains

Stampeders ruin Lions’ party with 41

Edmonton police shoot man dead after alleged assaults during hit-and