Guide - GenAI Integration in System Workflows: Key Lessons from a Real-World Use Case

Type
Guide
Year
Category
Anthropic, Amazon Bedrock, Claude 3 Haiku, AWS Serverless Services

Generative AI models have undergone rapid advancements in the past year, and many of us have already experienced its benefits—often as end users consuming AI-generated content or by building apps that display model outputs directly to users.

In these cases where humans see the results, it's easy to filter out irrelevant or unhelpful responses. We can correct them or generate new ones. Many applications also track feedback (like "Good Response" or "Bad Response") or let users regenerate the output with tweaks. And because we pay per request or token, we naturally control our usage based on our budget and needs.

However, as more companies and developers connect these models into automated workflows—where outputs flow directly into other systems rather than being shown to humans—the requirements become more challenging.

This blog post focuses on one such scenario where model outputs are used as an intermediate step in a multi-service process. We’ll look at important factors such as rate limits, output consistency, reliability, how to handle retries and failures, and keeping costs under control.

Use case: Processing audio files for contextual feedback analysis

Tech Stack: Amazon Transcribe, AWS Lambda, Step functions, Aurora Serverless, Amazon Bedrock, API Gateway, Cognito, and more Serverless services.

GenAI Model: Anthropic Claude Haiku 3 v1

We selected the model for this use case by comparing factors like output quality, consistency, latency, and cost. After evaluating different options, we found Anthropic’s models to be among the best overall, we pay by consumption, and Claude 3 Haiku offered the strongest balance between price and performance. Over time, we refined prompts, ran multiple evaluations, and made incremental adjustments to optimize the setup.

cw-metricsProcessing during rate limit and post rate limit

Understanding the Shift: Human vs System Consumption

When Humans Consume AI Output

  • Automatic Quality Control: We (humans) quickly understand the usefulness and correctness of the AI-generated content. If the output is subpar, we can discard it or request an immediate re-run. Usually our interaction with such systems (ChatGPT, Gemini, Perplexity, or Claude) are iterative.
  • Cost Awareness: When we pay by consumption (e.g., per token or per API call), we naturally limit usage to stay within budget.
  • Flexible Tolerance: We can adapt to slight deviations in formatting or clarity, as we can easily parse meaning or manually fix minor errors.

When Systems Consume AI Output

  • Strict Quality Requirements: Downstream services expect data in a specific format or structure. Any unexpected format or missing fields can break the pipeline.
  • Automated Scalability: The system can fire multiple requests in quick succession without human interaction, potentially leading to very high usage volumes and costs.
  • No On-the-Fly Corrections: When failures occur, we are not always in the loop to intervene and correct the output in real time.

Rate Limits: Designing for External Constraints

When making calls to an AI model—whether hosted in-house or by a third-party provider—there are always limits or quotas. Our particular use case integrated Anthropic’s Claude 3 Haiku model via Amazon Bedrock, where we realized the limit was reduced to just 20 calls per minute. This limit was far below our operational requirements.

  • Account for Rate Limit Variability: We discovered that even if a service initially supports a higher rate (e.g., 1,000 calls per minute), it can be throttled down unexpectedly. This could be due to subscription tiers, internal policy changes, or temporary adjustments by the service provider.

  • Output Token Limits: One important detail often overlooked is the maximum output token limit for certain GenAI models. In our case, Anthropic’s Claude 3 Haiku model enforces a fixed output token limit of 4,096 tokens. By default, our SDK calls were set to an output max of 1,000 tokens, which was sufficient for smaller or medium outputs but occasionally caused issues when the expected output approached closer to 1000 tokens or more. Haiku 3.5 supports 8192 output tokens.

    • Unexpected Truncation: If the model attempts to generate more tokens than the specified maximum, the response is cut off. This can lead to malformed or incomplete JSON and, in turn, break the flow.
    • Difficult Debugging: Initially, it wasn’t obvious that we were hitting a token limit. We saw inconsistent responses and incomplete data, which triggered retry mechanisms but didn’t clearly indicate why the output was failing validation.
  • Design with Flexibility: Relying on a fixed assumption about your rate limit can result in unexpected failures. In our case, adjusting our data processing pipeline to match the dynamically available rate saved us from immediate breakdown. We designed our workflow to:

    • Queue Requests: Buffer requests when nearing the rate limit.
    • Scale Down: Reduce concurrency (processing batch size) if throttling errors become evident.
  • Engage Support Early: When limits are unexpectedly lowered, it can significantly hamper processing. Reaching out to the service provider’s support team early helps expedite resolution, but also plan for the possibility that support may take days.

In our case, it took AWS support team days to finally restore the limit. Many customers are experiencing similar issues.

Consistency and Reliability of AI-Generated Output

Our downstream services required complex JSON outputs from the model. The pipeline would fail whenever the structure or content of the returned JSON deviated from the expected schema.

  • Schema Validation Implement strict validation checks immediately after receiving AI output. If the JSON format or specific fields are missing, handle it gracefully before proceeding. This helps localize errors and prevent malformed data from propagating.

  • Controlled Prompts AI output consistency is heavily influenced by the prompt. We refined our prompts by:

    • Specifying Output Format: For example, explicitly stating the JSON structure required (keys, data types, etc.).
    • Including Examples: Providing a valid example of the JSON output to guide the model. Adding Constraints: Using system or developer messages (if the model supports them) to enforce that the output must be in valid JSON form.
  • Monitoring & Logging Detailed logs of both the input prompts and the AI-generated outputs are vital. By analyzing these logs, you can spot patterns in the model’s inconsistent behavior and fine-tune prompts or retry logic.

Even if your selected model can handle complex tasks, treat its output with caution. Validate its structure before letting it flow further into your pipeline.

We are still seeing around 1-3% failure rate when it comes to inconsistent output. The model output returns unexpected text instructions, quotes, newline characters in the output JSON data.


We use text cleaning technique to remove the unwanted data from the output, this has reduced the failure/retry rate as many times the actual JSON data is in the expected format.

Retries and Failures: Building Resilience

In a human-centric use case, errors can be manually retried or corrected by the end user. But in an automated system, you must design a retry mechanism that accounts for various failure modes, such as:

  • Throttling: Rate-limit errors that require a delay before retrying.
  • Malformed Outputs: JSON structures that do not pass validation.
  • Transient Errors: Network timeouts or temporary service unavailability.

Adaptive Retry Strategy: Use an exponential backoff for throttling errors to prevent flooding the API. For malformed responses, however, immediate retry with a slightly modified prompt can sometimes resolve the issue.

Cost vs. Reliability Trade-Off: Each retry incurs additional costs. Configure upper limits on retries to avoid ballooning expenses, especially if you’re dealing with large payloads. After the maximum retries are exhausted, the pipeline should:

  • Mark the Request as Failed: Log detailed error information for diagnostics.
  • Offload to a DLQ (Dead Letter Queue): Reprocess later or route to a manual review if necessary.

Grading Failures: Not all failures are identical. Some errors may be fixable with a simple retry, while others require deeper understanding (e.g., a re-prompt or additional data input). Prepare your retry strategies accordingly.

Managing Cost

When we use GenAI models interactively, usage typically remains moderate. In an automated workflow, volume can skyrocket—especially if you’re processing large data sets or have many parallel requests. Moreover, retries multiply your total cost.

  • Token Accounting: If your billing is token-based, pay close attention to the size of both the input prompt and the expected output. Even moderate changes in prompt structure can greatly affect token usage.

  • Budget Thresholds and Alerts: Implement real-time monitoring of GenAI service costs. Set alerts that notify you if you’re approaching budget thresholds. This allows you to scale back or pause non-critical workflows to avoid unexpected bills.

  • Failure Loop: Repeated retries for a problematic request can waste tokens and money. In some scenarios, it may be more cost-effective to escalate the failure to human review. A “partial automation” approach can strike a balance between cost control and throughput.

Provisioned throughput is an alternative option to get consistent inference that need guaranteed throughput. It comes at a very steep price and commitment terms.


Anthropic Claude 3 Haiku with 200k context length costs around $58,000 for 1 month commitment.

For Future: More AI, More Integration

As new generation GenAI models evolve, more application use cases will emerge where the AI’s role is deeply embedded in larger automated systems—rather than simply producing content for humans.

By proactively designing around constraints like rate limits, enforcing output validation, implementing robust retries, and carefully tracking costs, you can mitigate common pitfalls and achieve reliable, scalable AI-driven workflows.

View All

Anthropic Sonnet v2 First Impressions with Computer Use

How we used the latest Claude Sonnet 3.5 model for detailed image interpretation for a specific use case.

Read more

Visual interpretation using Claude 3.5 Sonnet and Amazon Bedrock

How we used the latest Claude Sonnet 3.5 model for detailed image interpretation for a specific use case.

Read more

Tell us about your project

Our office

  • 408-409, SNS Platina
    Opp Shrenik Residecy
    Vesu, Surat, India
    Google Map