Architecting Agentic Workflows for Automated Issue Resolution: A Cross-Cloud Comparison

The Problem Worth Solving

A Jira ticket arrives. Someone triages it, assigns it, a developer reads the context, finds the relevant code, writes a fix, opens a PR, waits for CI, responds to review comments. Days pass.

The bottleneck isn't intelligence — it's coordination overhead and context-switching. An agent that can ingest the ticket, retrieve the relevant code, generate a fix, open a PR, and surface it for human review can compress that cycle dramatically for a well-scoped class of issues.

The engineering question isn't whether this is possible. It's which infrastructure makes it reliable, observable, and cost-efficient at scale. The "Big Three" cloud providers each offer a distinct approach, and the tradeoffs are meaningful.

Infrastructure Breakdown

AWS: Bedrock + Step Functions

AWS's answer to agentic orchestration is Bedrock Agents paired with Step Functions for stateful workflow management.

Bedrock Agents provides a managed runtime for tool-calling agents. You define an action group (a set of tools backed by Lambda functions), attach a knowledge base (S3 + OpenSearch for RAG), and Bedrock handles the ReAct loop, retry logic, and session state. The model roster includes Claude, Titan, Llama, and Mistral variants — giving you flexibility without managing model infrastructure.

Step Functions handles the outer orchestration: the webhook receipt, pre-processing steps, agent invocation, post-processing, and notification routing. It provides a visual state machine, built-in retry/catch semantics, and native integration with the AWS service graph (SQS, SNS, EventBridge, CodeCommit).

The combination works well for teams already deep in the AWS ecosystem. The IAM permission model, CloudWatch observability, and Lambda-based tool execution all integrate cleanly. The weak point is cold-start latency on Lambda-backed tools and the verbosity of Step Functions state machine definitions for complex conditional flows.

Best fit for: Teams with existing AWS infrastructure, compliance requirements favoring managed services, and workloads that benefit from Lambda's event-driven execution model.

GCP: Vertex AI + Firebase/Cloud Run

Google's approach centers on Vertex AI Agent Builder with Gemini 1.5 Pro as the reasoning backbone, hosted on Cloud Run for low-latency, containerized execution.

Vertex AI provides a grounding/RAG layer with native integration to Google Cloud Storage, BigQuery, and the broader data platform. The Grounding with Google Search feature is a meaningful differentiator for agents that need live external context. Gemini 1.5 Pro's 1M token context window reduces the need for aggressive chunking in RAG pipelines — large codebases can be fed more directly.

Cloud Run handles agent hosting with fast cold starts relative to Lambda, strong container-native deployment patterns, and straightforward auto-scaling. Firebase adds real-time state sync useful for HITL workflows where a human reviewer needs to see agent progress and approve/reject mid-task.

The GCP stack shines for teams with Python-heavy ML workflows or existing BigQuery data infrastructure. The Vertex AI SDK is well-designed for programmatic agent construction. The weaker point is the narrower ecosystem for non-GCP tooling and some immaturity in the agent-specific managed services compared to AWS.

Best fit for: Data-heavy organizations, teams using Google Workspace as their Jira/issue-tracking integration point, and workloads with very large context requirements.

Azure: Semantic Kernel / Autogen + Azure OpenAI + Logic Apps

Microsoft's agentic stack offers the most flexibility in agent framework choice, with Azure OpenAI providing model access and Logic Apps handling workflow orchestration.

Semantic Kernel is Microsoft's open-source SDK for building AI orchestration pipelines. It has strong plugin/tool abstractions and deep .NET and Python support. Autogen (also from Microsoft Research) goes further — it enables multi-agent conversations where agents can negotiate, critique each other's outputs, and collaboratively solve problems. For complex issue resolution tasks that benefit from a "write-then-review" agent pair, Autogen's conversation model is architecturally natural.

Logic Apps handles workflow coordination in a similar role to Step Functions — webhook receipt, conditional branching, Azure DevOps/GitHub integration, Teams notifications for HITL approvals. The no-code/low-code designer makes it accessible to non-engineering stakeholders defining approval workflows.

Azure OpenAI provides GPT-4o and o1 model access with data residency guarantees and private endpoint support — often a requirement in enterprise environments with data governance constraints.

The Azure stack's advantage is enterprise integration: Active Directory for identity, Teams for HITL communication, DevOps for code repository and CI/CD. The gap is that Semantic Kernel and Autogen are still maturing relative to LangChain's ecosystem depth.

Best fit for: Enterprise organizations with Microsoft-centric toolchains, compliance-heavy industries, and teams that want multi-agent conversation patterns.

Alternative: LangGraph and CrewAI

For teams willing to own more infrastructure, LangGraph and CrewAI offer capabilities that managed services don't yet match.

LangGraph models agent workflows as directed graphs with explicit state. This makes complex conditional flows — retry on failure, branch on confidence score, loop until verification passes — first-class constructs rather than workarounds. It's particularly strong for agents that need deterministic state recovery after partial failures, a real gap in managed runtimes that treat agents as black boxes.

CrewAI specializes in multi-agent role assignment. You define agents by role (Researcher, Coder, Reviewer, Tester), give them toolsets appropriate to their function, and let them collaborate on tasks. For Jira-to-PR workflows this maps cleanly: an Analysis agent reads the ticket and codebase, a Coding agent generates the fix, a Review agent critiques it, and a Testing agent validates it before the PR is opened.

Both frameworks deploy on any cloud or on-premises — useful for organizations with data residency requirements or existing Kubernetes clusters. The cost is operational ownership of the runtime, observability stack, and model access layer.

Best fit for: Teams with platform engineering capacity, complex multi-agent coordination requirements, or constraints that prevent use of managed cloud AI services.

Cloud-Native Agent Services Comparison

| Capability | AWS Bedrock Agents | GCP Vertex AI Agents | Azure (SK/Autogen + AOAI) | LangGraph / CrewAI | |---|---|---|---|---| | Managed runtime | Yes (Bedrock Agents) | Yes (Agent Builder) | Partial (Logic Apps + SDK) | No (self-hosted) | | Model choice | Claude, Llama, Mistral, Titan | Gemini 1.5 Pro/Flash, Llama | GPT-4o, o1, Phi | Any (model-agnostic) | | RAG / knowledge base | Native (OpenSearch + S3) | Native (GCS + BigQuery) | Azure AI Search | DIY (Chroma, Weaviate, etc.) | | Context window | Up to 200K (Claude 3) | Up to 1M (Gemini 1.5 Pro) | Up to 128K (GPT-4o) | Model-dependent | | Workflow orchestration | Step Functions | Cloud Workflows / Eventarc | Logic Apps | LangGraph state machine | | Multi-agent support | Basic (nested agents) | Basic (agent chaining) | Strong (Autogen) | Strong (CrewAI roles) | | HITL approval flows | Manual via SNS/SQS | Firebase Realtime / Pub/Sub | Logic Apps + Teams | Custom webhook | | Observability | CloudWatch + Bedrock traces | Cloud Logging + Vertex traces | Azure Monitor + App Insights | LangSmith / custom | | Enterprise auth | IAM + Cognito | IAM + Workload Identity | Azure AD / Entra ID | Custom | | Cold start latency | Medium (Lambda-backed tools) | Low (Cloud Run) | Low (Azure Functions Flex) | Depends on deployment | | Data residency | Region-scoped | Region-scoped | Region-scoped + VNet | Full control |

The Workflow Pattern: Jira to PR

The agentic flow for automated issue resolution follows a consistent pattern across all platforms, with infrastructure-specific implementations at each stage.

Sequence Diagram: Jira-to-PR Agentic Flow

Jira                Webhook/Event        Orchestrator         Agent Runtime        Code Repo
 |                       |                    |                    |                   |
 |-- Ticket Created ----->|                    |                    |                   |
 |                       |-- Trigger event --->|                    |                   |
 |                       |                    |-- Start session --->|                   |
 |                       |                    |                    |                   |
 |                       |          [Step 1: Ingest]               |                   |
 |                       |                    |-- Fetch ticket ---->|                   |
 |                       |                    |<- Ticket details ---|                   |
 |                       |                    |                    |                   |
 |                       |          [Step 2: Contextualize]        |                   |
 |                       |                    |-- RAG query ------->|                   |
 |                       |                    |<- Relevant code ----|                   |
 |                       |                    |-- Fetch docs ------>|                   |
 |                       |                    |<- Documentation ----|                   |
 |                       |                    |                    |                   |
 |                       |          [Step 3: Analyze]              |                   |
 |                       |                    |-- Analyze + plan -->|                   |
 |                       |                    |<- Fix plan ---------|                   |
 |                       |                    |                    |                   |
 |                       |          [Step 4: Generate]             |                   |
 |                       |                    |-- Generate code --->|                   |
 |                       |                    |<- Code patch -------|                   |
 |                       |                    |                    |                   |
 |                       |          [Step 5: Validate]             |                   |
 |                       |                    |-- Run tests ------->|                   |
 |                       |                    |<- Test results -----|                   |
 |                       |                    |                    |                   |
 |                       |   [HITL Checkpoint: human approval]     |                   |
 |                       |<-- Approval request -|                   |                   |
 |                       |-- Approved ---------->|                  |                   |
 |                       |                    |                    |                   |
 |                       |          [Step 6: Submit]               |                   |
 |                       |                    |-- Create branch ----|------------------>|
 |                       |                    |-- Commit patch -----|------------------>|
 |                       |                    |-- Open PR ----------|------------------>|
 |                       |                    |                    |                   |
 |<-- PR link comment ---|                    |                    |                   |

Stage 1: Ingestion

A webhook from Jira fires when a ticket is created or transitions to a state that triggers automated handling (e.g., "Ready for Agent"). The payload contains ticket ID, description, acceptance criteria, priority, and labels.

The orchestration layer (Step Functions, Logic Apps, or a LangGraph entry node) validates the payload, enriches it with historical ticket context from a metadata store, and initializes an agent session.

A filtering step here is worth the overhead: not all tickets are suited for autonomous resolution. Routing logic based on ticket type, component labels, or a classification agent prevents the workflow from wasting cycles on tickets that genuinely need human judgment from the start.

Stage 2: Contextualization

This is where RAG earns its keep. The agent queries the knowledge base with the ticket description as the semantic search input, retrieving:

Relevant source files and functions from the codebase vector store
Related past tickets and their resolutions
Architecture documentation for the affected component
Test patterns for the relevant code area

Chunking strategy matters here. File-level chunks preserve context but are large and expensive. Function-level chunks are more precise but miss cross-function dependencies. A hybrid approach — function-level with surrounding context and file-level metadata — performs best in practice.

The agent builds a working context that fits the model's effective reasoning window, prioritizing the most semantically similar retrievals and trimming lower-ranked results.

Stage 3: Analysis and Planning

The agent uses the enriched context to produce a structured fix plan: which files to modify, what logic to change, and why. This planning step, surfaced as structured output (JSON), is checkpointable — you can log it, review it, and use it to reconstruct reasoning if the later steps fail.

For multi-agent setups (CrewAI, Autogen), this is where an Analysis agent hands off to a Coding agent. The plan becomes the inter-agent communication artifact.

Stage 4: Code Generation

The coding step takes the plan and generates actual diffs against the current codebase. This is the most model-capability-dependent step. Stronger models (Claude Sonnet, GPT-4o, Gemini 1.5 Pro) produce more idiomatic, context-aware code. Weaker models may generate syntactically valid but architecturally inconsistent changes.

Key tool design choices at this step:

File read tools should return targeted context, not full files, to avoid bloating the prompt.
Diff generation is preferable to full file rewrite — smaller, reviewable, less likely to inadvertently change unrelated logic.
The agent should justify each change in the commit message, grounding it in the ticket requirements.

Stage 5: Validation

Before any PR is opened, the agent runs validation: static analysis, unit tests for the modified files, and if available, integration tests for the affected component. Test failures return to the code generation step with the error output as additional context — a contained retry loop.

This step requires careful bounding. Set a maximum retry count. If the agent can't produce passing tests within the bound, it should surface the partial work with a clear failure summary rather than loop indefinitely.

Stage 6: PR Submission

A passing validation triggers branch creation, commit, and PR open via the repository API (GitHub, GitLab, Azure DevOps). The PR description is generated from the fix plan and validation results — structured, traceable, and linked back to the originating Jira ticket.

The agent adds a comment to the Jira ticket with the PR link and a summary of what was changed and why. The ticket transitions to a "Review" state, returning the work to the human queue — now with a proposed solution attached rather than a blank slate.

The Human-in-the-Loop (HITL) Evolution

The temptation when deploying agentic workflows is to maximize automation and minimize human touchpoints. This is the wrong optimization target, at least initially.

The right question isn't "how do we remove humans from the loop?" It's "where does human judgment create the most value, and how do we route work there efficiently?"

The Shift-Left Philosophy

"Shift-left" in traditional software quality means catching defects earlier — before they reach production, before they're expensive to fix. Applied to agentic oversight, it means moving human judgment upstream: into the definition of what the agent should do, rather than into the correction of what the agent did wrong.

This is a meaningful reorientation of the human role:

Before agentic workflows: The developer reads the ticket, understands the system, writes the code, and reviews their own work before submitting.

After agentic workflows: The developer defines what "correct" looks like (acceptance criteria, test coverage requirements, architectural constraints), reviews the agent's proposed solution against that definition, and approves or redirects.

The cognitive work shifts from execution to specification and judgment. That's a higher-leverage activity, not a lesser one.

Defining the New Human Role

The Architect: Defining system boundaries, data flow constraints, and the "Definition of Done" that the agent optimizes toward. The quality of agent output is directly proportional to the quality of the specification it receives. An architect who writes precise acceptance criteria and maintains a well-structured knowledge base is the highest-leverage contributor in an agentic engineering team.

The Reviewer: The human review checkpoint isn't a rubber stamp — it's the quality gate. Reviewers evaluate agent-generated PRs for correctness, security implications, architectural consistency, and edge cases the agent may have missed. Over time, patterns in reviewer feedback become training signal for improving the agent's prompt templates and RAG retrieval.

The Security Auditor: Agentic code generation can introduce vulnerabilities the agent doesn't recognize. Security review of agent-generated code should be a defined step, not an afterthought. SQL injection, insecure deserialization, overly permissive IAM changes — these are real risks when agents operate across codebases and infrastructure at scale.

The Agent Mentor: This is the role with the longest leverage. Reviewing which tickets the agent handles well versus poorly, identifying systematic gaps in the knowledge base, refining prompt templates based on failure patterns, and adjusting routing logic to better match ticket types to agent capabilities. The agent improves as its human mentors invest in improving it.

Where HITL Checkpoints Belong

Not every step needs a human gate. Over-gating defeats the purpose. Under-gating creates risk. A practical starting framework:

| Checkpoint | Human involvement | Trigger | |---|---|---| | Ticket routing | Auto (with monitoring) | Ticket type / component label | | Fix plan review | Optional (async) | Confidence score below threshold | | PR approval | Required | Always, for first 90 days | | Security-sensitive changes | Required | IAM, auth, crypto, PII-adjacent code | | Production deployment | Required | Always | | Agent failure escalation | Required | Max retries exceeded |

The "Required — always, for first 90 days" for PR approval is deliberate. Trust in a new agentic system is earned through a track record, not assumed. Once the false positive and false negative rates are well-characterized, the approval gate can be relaxed for low-risk ticket categories.

Mentoring the Agent

The RAG knowledge base is not a static artifact. It degrades as codebases evolve and improves as it's actively maintained. Human investment in the vector store — ensuring documentation is current, removing outdated snippets, adding architecture decision records — pays direct dividends in agent output quality.

Prompt template refinement is similarly iterative. The prompts that define how the agent analyzes tickets, generates plans, and writes code are engineering artifacts that deserve version control, testing, and systematic improvement. Teams that treat prompt templates as throwaway configuration will struggle with inconsistent agent behavior. Teams that manage them like production code will see reliable, improvable performance.

Choosing Your Stack

The right infrastructure choice depends less on technical capability (all four approaches can implement the workflow above) and more on organizational fit:

AWS Bedrock + Step Functions if you're already AWS-native and want minimal operational overhead.
GCP Vertex AI + Cloud Run if you have large-context requirements or existing Google data infrastructure.
Azure OpenAI + Logic Apps + Semantic Kernel/Autogen if you're enterprise Microsoft and need multi-agent conversation patterns or data residency guarantees.
LangGraph or CrewAI if you need maximum control, have platform engineering capacity, and want framework flexibility across clouds.

Start with a narrow ticket category — a specific bug type or component — rather than broad deployment. Measure time-to-PR, PR acceptance rate, and human review burden per ticket. Let the data drive where to expand and where to tighten the human oversight gates.

The teams that get the most value from agentic issue resolution are the ones that invest as heavily in the human workflow changes as in the infrastructure. The agent is half the system. The human practices around it are the other half.

The shift from coder to architect-and-reviewer isn't a demotion. It's the work that scales — the part where human judgment is genuinely irreplaceable.