Where does AI fail most in software development?

Architecture (local optima), edge cases (rare inputs, boundaries), security (wrong or outdated advice), consistency (style drift), and domain (business rules).

Is AI-generated code secure?

Not by default. Use security review and Securing APIs, OWASP. Never rely on AI alone for auth, secrets, or injection.

How do I catch AI mistakes in edge cases?

Unit and integration tests, review with edge cases in mind, never assume AI output is correct. See testing strategies and How AI Is Changing Code Review and Testing.

Why does AI produce inconsistent code across files?

AI has no global view of your style; use linters, formatters, style guides, and human review.

What are the trade-offs of relying on AI for code?

See The Trade-Offs of Relying on AI for Code Generation: speed vs understanding, debt risk, learning impact.

Can AI help with legacy code?

Yes for explanation and scaffolding. Use small steps and review; risks are higher for wrong assumptions and breaking callers.

Why does AI suggest insecure code sometimes?

Models are trained on public code with insecure patterns; they do not know your threat model. Always review security-sensitive paths.

How do I know when to trust AI output?

Trust for repetitive, bounded tasks; verify for architecture, security, edge cases, and business rules.

Where AI Still Fails in Real-World Software

Q: Can AI do architecture and design?

Not reliably. Humans should own architecture; use AI for implementation within agreed boundaries. See Clean Architecture and the article.

Introduction

AI coding tools still fail in predictable ways: architecture decisions, rare edge cases, security-sensitive code, consistency across a codebase, and domain nuance. This article maps where AI still fails in real-world software development—architecture and design, edge cases and correctness, security, consistency and style, domain and business rules—and what to do (review, tests, ownership) so you can use AI where it helps and verify where it does not. For architects and tech leads, targeting AI at safe tasks (boilerplate, scaffolding) and applying review and tests where it fails (architecture, security, edge cases, business rules) keeps risk bounded.

If you are new, start with Topics covered and Where AI fails at a glance. See also Current State of AI Coding Tools 2026, Trade-Offs of AI for Code, and Impact on Code Quality.

Topics covered

Decision Context
Why “where AI fails” matters
Where AI fails at a glance
Architecture and design
Edge cases and correctness
Security
Consistency and style
Domain and business rules
What to do: review, tests, ownership
Real-world failure examples
Code-level examples: what failure looks like in real code
Architecture failures in depth
Edge-case and correctness failures in depth
Consistency and style failures in depth
Failure mode summary table
By language and stack
When to involve security or domain experts
Key terms
Mitigation checklist: before, during, and after
Related reading
Common issues and challenges
Best practices and pitfalls
Quick reference: when to verify
Summary
Position & Rationale
Trade-Offs & Failure Modes
What Most Guides Miss
Decision Framework
Key Takeaways
When I Would Use This Again — and When I Wouldn’t
Frequently Asked Questions

Decision Context

When this applies: Teams or developers using AI coding tools who want a clear picture of where current tools tend to fail so they can set boundaries and review accordingly.
When it doesn’t: Teams that don’t use AI or that only want tool recommendations. This article is about failure modes, not tools.
Scale: Any team size; the failure modes (architecture, edge cases, security, consistency, domain) are structural.
Constraints: Mitigation requires review, tests, and explicit boundaries; without them, reliance on AI in weak areas is riskier.
Non-goals: This article doesn’t argue that AI is useless; it states where it tends to fail and how to mitigate.

Why “where AI fails” matters

Knowing where AI fails lets you target it at tasks it handles well (boilerplate, patterns, explanations) and avoid trusting it for high-stakes areas (architecture, security, rare bugs) without verification. That reduces debt and risk—see Trade-Offs of Relying on AI for Code Generation and What AI IDEs Get Right — and What They Get Wrong. Without this map, teams often over-trust AI in architecture or security and under-review in edge cases—leading to rework and incidents that could have been caught by review and tests.

Where AI fails at a glance

Area	How AI often fails	What to do
Architecture	Suggests local optima, ignores constraints	Humans decide; use Clean Architecture, SOLID
Edge cases	Misses rare inputs, off-by-one, nulls	Tests, review; testing strategies
Security	Insecure defaults, wrong threat model	Security review; never trust AI alone
Consistency	Style and patterns drift across files	Linters, style guide, human review
Domain	Misunderstands business rules	Domain experts and tests

Loading diagram…

How to use the table: For each area (architecture, edge cases, security, consistency, domain), the table summarises how AI often fails and what to do. Review, tests, and ownership are the common mitigations; linters and style guides help consistency; domain experts and tests help business rules. Use this as a checklist when reviewing AI-generated code.

Architecture and design

AI tends to optimise locally: it suggests code that works in the narrow context you gave, not necessarily fit for your Clean Architecture, microservices, or scale. It can break dependency rules (e.g. use case importing infrastructure), ignore existing patterns (e.g. a new service that does not follow your repository or dependency injection style), and over-engineer (e.g. unnecessary abstraction) or under-engineer (e.g. logic in the wrong layer). Ownership: Humans own architecture; use AI for implementation within agreed boundaries. See What AI IDEs Get Right — and What They Get Wrong. Concrete failure: “Add a new API for orders” may produce a controller that calls the database directly instead of going through a service and repository—review must enforce layering.

Edge cases and correctness

AI is trained on common patterns; rare inputs, boundary conditions, nulls, and concurrency are where it often fails. Generated code can look right and fail in production. What to do: Unit and integration tests, review with edge cases in mind, and never assume “AI wrote it” means “it’s correct.” See How AI Is Changing Code Review and Testing.

Security

AI can suggest insecure defaults (e.g. weak crypto, SQL concatenation, hardcoded secrets), wrong threat models, or outdated advice. Never rely on AI alone for auth, secrets, injection, or compliance. What to do: Security review, OWASP and Securing APIs, and treat AI output as untrusted in security-sensitive paths.

Consistency and style

Across many files, AI can drift in naming, structure, and patterns. One file may look like your style; the next may not. What to do: Linters, formatters, style guides, and human review to keep consistency. Technical leadership can set norms and templates.

Domain and business rules

AI does not know your business rules, regulations, or domain nuance. It can guess and get wrong (e.g. rounding, dates, eligibility). What to do: Domain experts and tests that encode rules; use AI for scaffolding, not authoritative business logic. See Domain-Driven Design for clarity on boundaries.

What to do: review, tests, ownership

Review all AI-generated code—especially for architecture, security, and edge cases. Tests (unit, integration) catch correctness and regressions; testing strategies still apply. Ownership: Assign owners for design and critical paths; use AI as assistant, not decision-maker. See Impact of AI Tools on Code Quality and Maintainability and What Developers Actually Want From AI Assistants. Norms (technical leadership): e.g. “AI output is draft; human review is required for merge.”

Real-world failure examples

Architecture: A team asked AI to “add a new feature” and received a monolithic handler that mixed HTTP, business logic, and DB access. Fix: Define layers and prompt within them (e.g. “add a use case for X in the application layer”); review for dependency direction. Edge cases: Generated code passed unit tests for happy path but failed when input was null or empty. Fix: Add edge-case tests; review with null and boundaries in mind. Security: AI suggested string concatenation for SQL and hardcoded config. Fix: Never trust AI for auth, secrets, or injection; use OWASP and security review. Consistency: Multi-file generation produced different naming and error-handling style. Fix: Linters, formatters, human review. Domain: AI guessed a business rule (e.g. rounding) and got it wrong. Fix: Domain experts and tests that encode rules; use AI for scaffolding only.

Code-level examples: what failure looks like in real code

Below are exact prompts, full bad AI output (what the model returns in theory), what goes wrong at build or runtime, and full correct code so you see concrete issues at code level—not just description. Use them when reviewing or training the team.

Example 1: Architecture — prompt vs full bad vs full good

Exact prompt: “Add an endpoint to fetch a user by ID and return their profile.”

What you get in theory (bad AI output): Controller injects DbContext, does mapping and logic in the controller, and violates Clean Architecture.

// BAD: Controller with infra dependency and logic in wrong place
using Microsoft.AspNetCore.Mvc;
using Microsoft.EntityFrameworkCore;

namespace MyApp.Api.Controllers
{
    public class UserController : ControllerBase
    {
        private readonly AppDbContext _db;

        public UserController(AppDbContext db) => _db = db;

        [HttpGet("{id}")]
        public async Task<IActionResult> Get(int id)
        {
            var user = await _db.Users.FindAsync(id);
            if (user == null) return NotFound();
            var profile = new UserProfileDto
            {
                FullName = user.FirstName + " " + user.LastName,
                Email = user.Email,
                CreatedAt = user.CreatedAt
            };
            return Ok(profile);
        }
    }
}

What goes wrong at code level: Controller depends on infrastructure (DbContext); mapping and concatenation logic live in the API layer; hard to unit-test and to change persistence or profile shape. Result in theory: Rework when you introduce a use case layer; tests require a real or mocked DbContext at the controller.

Correct approach (full good code): Controller only delegates; use case and repository live in their layers.

// GOOD: Controller delegates; no DbContext; use case uses IUserRepository
namespace MyApp.Api.Controllers
{
    public class UserController : ControllerBase
    {
        private readonly IGetUserProfileUseCase _getUserProfileUseCase;

        public UserController(IGetUserProfileUseCase getUserProfileUseCase) =>
            _getUserProfileUseCase = getUserProfileUseCase;

        [HttpGet("{id}")]
        public async Task<IActionResult> Get(int id)
        {
            var result = await _getUserProfileUseCase.Execute(id);
            return result.Match<IActionResult>(Ok, _ => NotFound());
        }
    }
}

// Application layer: IGetUserProfileUseCase.Execute(id) uses IUserRepository, maps to UserProfileDto
// Infrastructure: IUserRepository implemented with DbContext in a separate project

Example 2: Security — SQL injection and hardcoded secret

Exact prompt: “Write code to search users by email and log to the database.”

What you get in theory (bad AI output): User input concatenated into SQL; hardcoded connection or secret.

// BAD: Injection risk + hardcoded secret
public class UserSearchService
{
    private const string ConnectionString = "Server=prod;User=sa;Password=secret123";

    public async Task<User?> FindByEmail(string email)
    {
        var sql = "SELECT * FROM Users WHERE Email = '" + email + "'";
        await using var conn = new SqlConnection(ConnectionString);
        await conn.OpenAsync();
        await using var cmd = new SqlCommand(sql, conn);
        await using var reader = await cmd.ExecuteReaderAsync();
        return reader.Read() ? MapUser(reader) : null;
    }
}

What goes wrong at code level: Input email = "'; DROP TABLE Users; --" executes arbitrary SQL; secret in source control and compliance breach. Result in theory: Security incident and data loss.

Correct approach (full good code): Parameterised query; configuration for connection.

// GOOD: Parameterised; no secrets in code
public class UserSearchService
{
    private readonly IConfiguration _config;

    public UserSearchService(IConfiguration config) => _config = config;

    public async Task<User?> FindByEmail(string email)
    {
        var connStr = _config.GetConnectionString("Default");
        await using var conn = new SqlConnection(connStr);
        await conn.OpenAsync();
        await using var cmd = new SqlCommand("SELECT * FROM Users WHERE Email = @Email", conn);
        cmd.Parameters.AddWithValue("@Email", email);
        await using var reader = await cmd.ExecuteReaderAsync();
        return reader.Read() ? MapUser(reader) : null;
    }
}

Example 3: Edge cases — null and boundaries

Exact prompt: “Implement a method that returns the first line item amount for an order.”

What you get in theory (bad AI output): No null or empty check; IndexOutOfRangeException or NullReferenceException in production.

// BAD: Crashes when order or LineItems null/empty
public decimal GetFirstAmount(Order order)
{
    return order.LineItems[0].Amount;
}

What goes wrong at code level: order null → NullReferenceException; LineItems null or empty → NullReferenceException or IndexOutOfRangeException. Result in theory: Runtime crash when orders are empty or partially loaded.

Correct approach (full good code): Explicit null and empty handling; documented behaviour.

// GOOD: Edge cases handled; behaviour clear
public decimal? GetFirstAmount(Order? order)
{
    if (order?.LineItems == null || order.LineItems.Count == 0)
        return null;
    return order.LineItems[0].Amount;
}

Example 4: Consistency — naming and style drift

Exact prompt (file A): “Add a method to get order by ID.” Prompt (file B): “Add method to fetch order.”

What you get in theory (bad AI output): Different names and signatures across files; inconsistent with existing repo.

// File A (OrderService.cs): GetOrderByIdAsync
public async Task<Order> GetOrderByIdAsync(int id)
{
    return await _repo.FindById(id);
}

// File B (PaymentService.cs): FetchOrder, sync, different return
public Order FetchOrder(int id)
{
    return _orderRepo.GetById(id);  // blocking; different naming
}

What goes wrong at code level: Two conventions in one codebase; callers mix GetOrderByIdAsync and FetchOrder; async vs sync deadlocks or confusion. Result in theory: Review churn and maintainability debt.

Correct approach (full good code): Single convention; align with style guide and existing repos.

// GOOD: Single convention (async suffix, Get* naming, Task<T> return)
// OrderService.cs
public async Task<Order?> GetOrderByIdAsync(int id, CancellationToken ct = default) =>
    await _repo.GetByIdAsync(id, ct);

// PaymentService.cs — same pattern
public async Task<Order?> GetOrderByIdAsync(int id, CancellationToken ct = default) =>
    await _orderRepo.GetByIdAsync(id, ct);

Example 5: Domain — wrong business rule and rounding

Exact prompt: “Calculate discount for an order: 10% if total > 100.”

What you get in theory (bad AI output): Boundary unclear (is 100 included?); rounding not specified; no tests encoding the rule.

// BAD: Boundary unclear; rounding not specified; no tests
public decimal GetDiscount(Order order)
{
    return order.Total >= 100 ? order.Total * 0.10m : 0;
}

What goes wrong at code level: Product says “discount above 100” but code uses >= (100 gets discount); currency rounding can differ (banker’s vs half-up); no tests so regressions go unnoticed. Result in theory: Wrong discount in production and disputes.

Correct approach (full good code): Explicit rule; rounding specified; tests encode behaviour.

// GOOD: Rule explicit; rounding specified; tests encode rule
public decimal GetDiscount(Order order)
{
    if (order.Total <= 100m) return 0m;
    var discount = order.Total * 0.10m;
    return Math.Round(discount, 2, MidpointRounding.AwayFromZero);
}

// Unit tests: GetDiscount_WhenTotal100_Returns0; GetDiscount_WhenTotal100_01_Returns10_00; etc.

Takeaway: Use these exact prompts and full bad/good pairs as review checklists. Architecture = layers and dependencies; security = parameterised queries, no secrets in code; edge cases = null, empty, boundaries; consistency = naming and async convention; domain = rules and rounding encoded in tests. See What to do: review, tests, ownership and Mitigation checklist.

Architecture failures in depth

Local optima. AI suggests code that works in the narrow context (e.g. “add an endpoint”) but violates Clean Architecture (e.g. controller calling repository directly, or use case importing infrastructure). Why: Models are trained on snippets and common patterns, not your architecture doc or dependency rules. Mitigation: Humans own architecture; document layers and boundaries; prompt within bounds (“add a use case for X”); review for dependency direction and layer placement. See What AI IDEs Get Right and Wrong.

Over- and under-engineering. AI may add unnecessary abstraction (e.g. factory for one implementation) or put logic in the wrong place (e.g. SQL in a controller). Mitigation: Review for fit with actual scale and team conventions; use SOLID and design patterns as guidance, not blind acceptance of AI suggestions.

Edge-case and correctness failures in depth

Rare inputs and boundaries. Null, empty list, zero, negative values, max length—AI often generates happy-path code and misses these. Off-by-one and boundary bugs are common in generated loops and conditionals. Mitigation: Unit and integration tests that cover edge cases; review with null and boundaries in mind; never assume “AI wrote it” means “it’s correct.” See How AI Is Changing Code Review and Testing.

Concurrency. Races, deadlocks, thread safety—AI rarely models these correctly. Generated code may lack synchronisation or use wrong primitives. Mitigation: Human design and review for concurrency-sensitive paths; tests (e.g. stress, concurrency tests) where appropriate; avoid relying on AI for locking or async coordination without verification.

Consistency and style failures in depth

Cross-file drift. Naming (e.g. async suffix, DTO vs Model), error handling (exceptions vs Result), structure (where logic lives) can differ when many files are generated or edited by AI. Why: No global view of repo; each suggestion is local. Mitigation: Linters, formatters, documented style guides; human review for architectural and cross-file consistency; codebase-aware tools where possible. See What Developers Want From AI (Consistency) and Impact on Code Quality.

Failure mode summary table

Failure mode	Typical symptom	Mitigation
Architecture	Wrong layer, broken dependencies, monolith suggested	Humans own architecture; prompt within bounds; review for layering
Edge cases	Null/empty/boundary bugs in production	Tests for edge cases; review with null/boundaries in mind
Security	Injection, hardcoded secrets, weak crypto	Security review; never trust AI for auth/secrets/injection
Consistency	Style and naming drift across files	Linters, formatters, style guide, human review
Domain	Wrong business rule, rounding, dates	Domain experts; tests that encode rules; AI for scaffolding only

Use this table as a quick checklist when reviewing or scoping AI use—see What to do: review, tests, ownership.

By language and stack

Strong training data (e.g. JavaScript/TypeScript, C#/.NET, Python, React). AI often produces plausible and consistent code; failure modes (architecture, security, edge cases) still apply—review and tests remain essential. Niche or legacy (e.g. COBOL, internal DSLs, old frameworks). Less training data; AI may suggest generic or wrong patterns. Use AI for explanation and scaffolding; human ownership for logic and integration. Polyglot repos. Context can mix languages and confuse the model; limit AI to single-language or bounded areas where possible. See What AI IDEs Get Right and Wrong (By stack and language).

When to involve security or domain experts

Security. Involve security or app-sec review for auth, secrets, injection-prone code, PII handling, and compliance-sensitive paths. Do not rely on AI suggestions or AI review tools as sufficient for sign-off. Domain. Involve domain experts when business rules, eligibility, rounding, dates, or regulations are encoded in code; tests that encode rules catch misunderstandings; AI for scaffolding only. See Securing APIs, OWASP, and Domain-Driven Design.

Key terms

Local optima: AI optimises for the narrow context (e.g. one file, one prompt) and may violate system-wide constraints (architecture, consistency).
Edge case: Rare inputs, boundaries (null, empty, zero, max), concurrency (races, deadlocks); AI often misses these.
Consistency drift: Style or patterns differ across files when AI generates or edits without global view of the repo.

Mitigation checklist: before, during, and after

Before using AI. Document layers and patterns (e.g. Clean Architecture, SOLID) so review has a baseline. Define scope (e.g. no AI for auth, payment, PII). Enable linters and formatters so mechanical drift is caught.

During use. Review all AI-generated code for architecture, security, edge cases, and consistency. Expand AI-scaffolded tests for edge cases and business rules. Reject or refactor output that violates standards. Require explanation for opaque or critical code.

After merge. Measure defect rate and time to change; refactor hotspots when debt appears. Revisit norms when signals worsen. See Impact on Code Quality (Checklist, When to tighten standards).

Trade-Offs of AI Code Generation — Speed vs understanding, debt, learning; when to use AI vs when not to.
Impact of AI Tools on Code Quality and Maintainability — How to keep quality high with review, standards, and metrics.
What AI IDEs Get Right — and What They Get Wrong — Strengths and weaknesses of AI IDEs; overlaps with this article.
How AI Is Changing Code Review and Testing — Using AI in review and test gen while keeping humans in the loop.
Current State of AI Coding Tools in 2026 — Landscape and adoption; how tools fit into the pipeline.
What Developers Actually Want From AI Assistants — Context, control, consistency, clarity; aligning tools with needs.

Common issues and challenges

Trusting AI for “one big change”: Letting AI design or refactor across many files often produces inconsistent or broken results. Break work into bounded tasks; review each step—see How Developers Are Integrating AI.
Missing edge cases in generated tests: AI tends to generate happy path tests; nulls, boundaries, and concurrency are often missed. Add and maintain edge-case tests yourself—see Testing strategies.
Security “suggestions” that are wrong: AI can suggest insecure defaults (e.g. weak crypto, string concatenation for SQL). Never rely on AI alone for auth, secrets, or injection—see OWASP and Securing APIs.

Best practices and pitfalls

Do:

Use AI for repetition and scaffolding; review and test everything; own architecture and security.
Break work into bounded tasks (one layer, one responsibility); use Clean Architecture and SOLID so generated code stays in the right place.
Encode business rules and edge cases in tests; use domain experts for nuance—see Domain-Driven Design.

Do not:

Trust AI for architecture, security, or business rules without verification.
Assume more AI means better code; see Current State of AI Coding Tools and Trade-Offs.

Quick reference: when to verify

If AI generated…	Verify for
Architecture / new feature	Layering, dependencies, patterns
Any security path	Auth, secrets, injection, OWASP
Logic / algorithms	Edge cases, null, boundaries, concurrency
Many files	Consistency, style, call sites
Business rules	Domain correctness, tests

Summary

AI still fails at architecture, edge cases, security depth, consistency, and domain nuance—mitigate with review, tests, ownership, and clear boundaries so you use AI where it helps and verify where it does not. Assuming AI output is correct for security or business rules leads to wrong or inconsistent outcomes; defining where AI is allowed and where human-only applies reduces risk. Next, list the failure modes that apply to your stack (architecture, security, edge cases, consistency), then set explicit boundaries (e.g. AI only within this layer) and make review mandatory for generated code in those areas.

AI still fails at architecture, edge cases, security, consistency, and domain nuance.
Mitigate with review, tests, ownership, and clear boundaries; use AI where it helps, verify where it does not.
For more see Trade-Offs, Code Quality, and What AI IDEs Get Right and Wrong.

Position & Rationale

The article states where current AI coding tools tend to fail (architecture, edge cases, security depth, consistency, domain nuance) based on how they work: they optimise locally, they don’t have full system or threat-model context, and they can suggest outdated or wrong patterns. That’s a constraint of the technology, not a verdict on whether to use it. The stance is: use AI where it helps; verify and own where it doesn’t.

Trade-Offs & Failure Modes

Relying on AI for areas where it typically fails (e.g. architecture, security) without human verification increases the chance of wrong or inconsistent outcomes. Adding more review and tests mitigates but doesn’t remove the risk. Failure modes: assuming AI output is correct for security or business rules; using AI for cross-file or system-wide consistency without human oversight.

What Most Guides Miss

Many guides list “what AI can do” and underplay “where it fails and why.” The failure modes are predictable from how the tools work (local optimisation, no full context). Another gap: mitigation is review, tests, and clear boundaries—not “better prompts” alone. Boundaries (e.g. “AI only within this layer”) need to be explicit.

Decision Framework

If the task is local, repetitive, or well-scoped → AI often helps; still review the result.
If the task is architecture, security, or cross-cutting consistency → Don’t rely on AI alone; humans must own and verify.
For edge cases and business rules → Assume AI can miss them; tests and review are the safety net.
For any use → Define where AI is allowed and where human-only applies.

Key Takeaways

AI tends to fail at architecture, edge cases, security depth, consistency, and domain nuance; the causes are structural (local optimisation, lack of full context).
Mitigate with review, tests, ownership, and clear boundaries; use AI where it helps, verify where it doesn’t.
Don’t assume AI output is correct for security or business-critical code without human verification.

When I Would Use This Again — and When I Wouldn’t

Use this framing when a team wants a clear picture of where AI coding tools fall short so they can set boundaries and review accordingly. Don’t use it to argue that AI is useless; the point is to match use to what the tools can and can’t reliably do.

Waqas Ahmad — Software Architect & Technical Consultant

Distributed Systems

Article

Where AI Still Fails in Real-World Software Development

Read the article