👋Hi, I'm Waqas — a Software Architect and Technical Consultant specializing in .NET, Azure, microservices, and API-first system design..
I help companies build reliable, maintainable, and high-performance backend platforms that scale.
Where AI Still Fails in Real-World Software Development
Where AI coding tools still fail: architecture, edge cases, and domain nuance.
January 21, 2026 · Waqas Ahmad
Read the article
Introduction
AI coding tools still fail in predictable ways: architecture decisions, rare edge cases, security-sensitive code, consistency across a codebase, and domain nuance. This article maps where AI still fails in real-world software development—architecture and design, edge cases and correctness, security, consistency and style, domain and business rules—and what to do (review, tests, ownership) so you can use AI where it helps and verify where it does not. For architects and tech leads, targeting AI at safe tasks (boilerplate, scaffolding) and applying review and tests where it fails (architecture, security, edge cases, business rules) keeps risk bounded.
When this applies: Teams or developers using AI coding tools who want a clear picture of where current tools tend to fail so they can set boundaries and review accordingly.
When it doesn’t: Teams that don’t use AI or that only want tool recommendations. This article is about failure modes, not tools.
Scale: Any team size; the failure modes (architecture, edge cases, security, consistency, domain) are structural.
Constraints: Mitigation requires review, tests, and explicit boundaries; without them, reliance on AI in weak areas is riskier.
Non-goals: This article doesn’t argue that AI is useless; it states where it tends to fail and how to mitigate.
Why “where AI fails” matters
Knowing where AI fails lets you target it at tasks it handles well (boilerplate, patterns, explanations) and avoid trusting it for high-stakes areas (architecture, security, rare bugs) without verification. That reduces debt and risk—see Trade-Offs of Relying on AI for Code Generation and What AI IDEs Get Right — and What They Get Wrong. Without this map, teams often over-trust AI in architecture or security and under-review in edge cases—leading to rework and incidents that could have been caught by review and tests.
How to use the table: For each area (architecture, edge cases, security, consistency, domain), the table summarises how AI often fails and what to do. Review, tests, and ownership are the common mitigations; linters and style guides help consistency; domain experts and tests help business rules. Use this as a checklist when reviewing AI-generated code.
Architecture and design
AI tends to optimise locally: it suggests code that works in the narrow context you gave, not necessarily fit for your Clean Architecture, microservices, or scale. It can break dependency rules (e.g. use case importing infrastructure), ignore existing patterns (e.g. a new service that does not follow your repository or dependency injection style), and over-engineer (e.g. unnecessary abstraction) or under-engineer (e.g. logic in the wrong layer). Ownership: Humans own architecture; use AI for implementation within agreed boundaries. See What AI IDEs Get Right — and What They Get Wrong. Concrete failure: “Add a new API for orders” may produce a controller that calls the database directly instead of going through a service and repository—review must enforce layering.
Edge cases and correctness
AI is trained on common patterns; rare inputs, boundary conditions, nulls, and concurrency are where it often fails. Generated code can look right and fail in production. What to do:Unit and integration tests, review with edge cases in mind, and never assume “AI wrote it” means “it’s correct.” See How AI Is Changing Code Review and Testing.
Security
AI can suggest insecure defaults (e.g. weak crypto, SQL concatenation, hardcoded secrets), wrong threat models, or outdated advice. Never rely on AI alone for auth, secrets, injection, or compliance. What to do: Security review, OWASP and Securing APIs, and treat AI output as untrusted in security-sensitive paths.
Consistency and style
Across many files, AI can drift in naming, structure, and patterns. One file may look like your style; the next may not. What to do:Linters, formatters, style guides, and human review to keep consistency. Technical leadership can set norms and templates.
Domain and business rules
AI does not know your business rules, regulations, or domain nuance. It can guess and get wrong (e.g. rounding, dates, eligibility). What to do:Domain experts and tests that encode rules; use AI for scaffolding, not authoritative business logic. See Domain-Driven Design for clarity on boundaries.
Architecture: A team asked AI to “add a new feature” and received a monolithic handler that mixed HTTP, business logic, and DB access. Fix: Define layers and prompt within them (e.g. “add a use case for X in the application layer”); review for dependency direction. Edge cases: Generated code passed unit tests for happy path but failed when input was null or empty. Fix:Addedge-case tests; review with null and boundaries in mind. Security: AI suggested string concatenation for SQL and hardcoded config. Fix:Never trust AI for auth, secrets, or injection; use OWASP and security review. Consistency: Multi-file generation produced differentnaming and error-handling style. Fix:Linters, formatters, human review. Domain: AI guessed a business rule (e.g. rounding) and got it wrong. Fix:Domain experts and tests that encode rules; use AI for scaffolding only.
Code-level examples: what failure looks like in real code
Below are exact prompts, fullbad AI output (what the model returns in theory), what goes wrong at build or runtime, and fullcorrect code so you see concrete issues at code level—not just description. Use them when reviewing or training the team.
Example 1: Architecture — prompt vs full bad vs full good
Exact prompt: “Add an endpoint to fetch a user by ID and return their profile.”
What you get in theory (bad AI output): Controller injects DbContext, does mapping and logic in the controller, and violatesClean Architecture.
// BAD: Controller with infra dependency and logic in wrong placeusing Microsoft.AspNetCore.Mvc;
using Microsoft.EntityFrameworkCore;
namespaceMyApp.Api.Controllers
{
publicclassUserController : ControllerBase
{
privatereadonly AppDbContext _db;
publicUserController(AppDbContext db) => _db = db;
[HttpGet("{id}")]
publicasync Task<IActionResult> Get(int id)
{
var user = await _db.Users.FindAsync(id);
if (user == null) return NotFound();
var profile = new UserProfileDto
{
FullName = user.FirstName + " " + user.LastName,
Email = user.Email,
CreatedAt = user.CreatedAt
};
return Ok(profile);
}
}
}
What goes wrong at code level: Controller depends on infrastructure (DbContext); mapping and concatenation logic live in the API layer; hard to unit-test and to change persistence or profile shape. Result in theory:Rework when you introduce a use case layer; tests require a real or mocked DbContext at the controller.
Correct approach (full good code): Controller only delegates; use case and repository live in their layers.
// GOOD: Controller delegates; no DbContext; use case uses IUserRepositorynamespaceMyApp.Api.Controllers
{
publicclassUserController : ControllerBase
{
privatereadonly IGetUserProfileUseCase _getUserProfileUseCase;
publicUserController(IGetUserProfileUseCase getUserProfileUseCase) =>
_getUserProfileUseCase = getUserProfileUseCase;
[HttpGet("{id}")]
publicasync Task<IActionResult> Get(int id)
{
var result = await _getUserProfileUseCase.Execute(id);
return result.Match<IActionResult>(Ok, _ => NotFound());
}
}
}
// Application layer: IGetUserProfileUseCase.Execute(id) uses IUserRepository, maps to UserProfileDto// Infrastructure: IUserRepository implemented with DbContext in a separate project
Example 2: Security — SQL injection and hardcoded secret
Exact prompt: “Write code to search users by email and log to the database.”
What you get in theory (bad AI output): User input concatenated into SQL; hardcoded connection or secret.
What goes wrong at code level: Input email = "'; DROP TABLE Users; --" executes arbitrary SQL; secret in source control and compliance breach. Result in theory:Security incident and data loss.
Correct approach (full good code): Parameterised query; configuration for connection.
// GOOD: Parameterised; no secrets in codepublicclassUserSearchService
{
privatereadonly IConfiguration _config;
publicUserSearchService(IConfiguration config) => _config = config;
publicasync Task<User?> FindByEmail(string email)
{
var connStr = _config.GetConnectionString("Default");
awaitusingvar conn = new SqlConnection(connStr);
await conn.OpenAsync();
awaitusingvar cmd = new SqlCommand("SELECT * FROM Users WHERE Email = @Email", conn);
cmd.Parameters.AddWithValue("@Email", email);
awaitusingvar reader = await cmd.ExecuteReaderAsync();
return reader.Read() ? MapUser(reader) : null;
}
}
Example 3: Edge cases — null and boundaries
Exact prompt: “Implement a method that returns the first line item amount for an order.”
What you get in theory (bad AI output): No null or empty check; IndexOutOfRangeException or NullReferenceException in production.
// BAD: Crashes when order or LineItems null/emptypublicdecimalGetFirstAmount(Order order)
{
return order.LineItems[0].Amount;
}
What goes wrong at code level:order null → NullReferenceException; LineItems null or empty → NullReferenceException or IndexOutOfRangeException. Result in theory:Runtime crash when orders are empty or partially loaded.
Correct approach (full good code): Explicit null and empty handling; documented behaviour.
Exact prompt (file A): “Add a method to get order by ID.” Prompt (file B): “Add method to fetch order.”
What you get in theory (bad AI output): Different names and signatures across files; inconsistent with existing repo.
// File A (OrderService.cs): GetOrderByIdAsyncpublicasync Task<Order> GetOrderByIdAsync(int id)
{
returnawait _repo.FindById(id);
}
// File B (PaymentService.cs): FetchOrder, sync, different returnpublic Order FetchOrder(int id)
{
return _orderRepo.GetById(id); // blocking; different naming
}
What goes wrong at code level: Two conventions in one codebase; callers mix GetOrderByIdAsync and FetchOrder; async vs sync deadlocks or confusion. Result in theory:Review churn and maintainability debt.
Correct approach (full good code): Single convention; align with style guide and existing repos.
What goes wrong at code level: Product says “discount above 100” but code uses >= (100 gets discount); currency rounding can differ (banker’s vs half-up); no tests so regressions go unnoticed. Result in theory:Wrong discount in production and disputes.
// GOOD: Rule explicit; rounding specified; tests encode rulepublicdecimalGetDiscount(Order order)
{
if (order.Total <= 100m) return0m;
var discount = order.Total * 0.10m;
return Math.Round(discount, 2, MidpointRounding.AwayFromZero);
}
// Unit tests: GetDiscount_WhenTotal100_Returns0; GetDiscount_WhenTotal100_01_Returns10_00; etc.
Takeaway: Use these exact prompts and full bad/good pairs as review checklists. Architecture = layers and dependencies; security = parameterised queries, no secrets in code; edge cases = null, empty, boundaries; consistency = naming and async convention; domain = rules and rounding encoded in tests. See What to do: review, tests, ownership and Mitigation checklist.
Architecture failures in depth
Local optima. AI suggests code that works in the narrow context (e.g. “add an endpoint”) but violatesClean Architecture (e.g. controller calling repository directly, or use case importing infrastructure). Why: Models are trained on snippets and common patterns, not yourarchitecture doc or dependency rules. Mitigation:Humans own architecture; documentlayers and boundaries; prompt within bounds (“add a use case for X”); review for dependency direction and layer placement. See What AI IDEs Get Right and Wrong.
Over- and under-engineering. AI may addunnecessary abstraction (e.g. factory for one implementation) or putlogic in the wrong place (e.g. SQL in a controller). Mitigation:Review for fit with actualscale and teamconventions; use SOLID and design patterns as guidance, not blindacceptance of AI suggestions.
Edge-case and correctness failures in depth
Rare inputs and boundaries.Null, empty list, zero, negative values, max length—AI oftengenerateshappy-path code and misses these. Off-by-one and boundarybugs are common in generatedloops and conditionals. Mitigation:Unit and integration tests that coveredge cases; review with null and boundaries in mind; never assume “AI wrote it” means “it’s correct.” See How AI Is Changing Code Review and Testing.
Concurrency.Races, deadlocks, thread safety—AI rarelymodels these correctly. Generated code may lacksynchronisation or usewrongprimitives. Mitigation:Human design and review for concurrency-sensitive paths; tests (e.g. stress, concurrency tests) where appropriate; avoid relying on AI for locking or asynccoordination without verification.
Consistency and style failures in depth
Cross-file drift.Naming (e.g. async suffix, DTO vs Model), error handling (exceptions vs Result), structure (where logic lives) can differ when many files are generated or edited by AI. Why: No global view of repo; each suggestion is local. Mitigation:Linters, formatters, documentedstyle guides; human review for architectural and cross-fileconsistency; codebase-aware tools where possible. See What Developers Want From AI (Consistency) and Impact on Code Quality.
Strong training data (e.g. JavaScript/TypeScript, C#/.NET, Python, React). AI often produces plausible and consistent code; failure modes (architecture, security, edge cases) still apply—review and tests remain essential. Niche or legacy (e.g. COBOL, internal DSLs, old frameworks).Lesstraining data; AI may suggestgeneric or wrong patterns. Use AI for explanation and scaffolding; humanownership for logic and integration. Polyglot repos.Context can mixlanguages and confuse the model; limit AI to single-language or boundedareas where possible. See What AI IDEs Get Right and Wrong (By stack and language).
When to involve security or domain experts
Security.Involvesecurity or app-secreview for auth, secrets, injection-prone code, PII handling, and compliance-sensitive paths. Do not rely on AI suggestions or AIreview tools as sufficient for sign-off. Domain.Involvedomain experts when business rules, eligibility, rounding, dates, or regulations are encoded in code; tests that encode rules catchmisunderstandings; AI for scaffolding only. See Securing APIs, OWASP, and Domain-Driven Design.
Key terms
Local optima: AI optimises for the narrow context (e.g. one file, one prompt) and may violatesystem-wideconstraints (architecture, consistency).
Consistency drift:Style or patternsdiffer across files when AI generates or edits without globalview of the repo.
Mitigation checklist: before, during, and after
Before using AI.Documentlayers and patterns (e.g. Clean Architecture, SOLID) so review has a baseline. Definescope (e.g. no AI for auth, payment, PII). Enablelinters and formatters so mechanicaldrift is caught.
During use.Review all AI-generated code for architecture, security, edge cases, and consistency. Expand AI-scaffolded tests for edge cases and business rules. Reject or refactor output that violatesstandards. Requireexplanation for opaque or critical code.
After merge.Measuredefect rate and time to change; refactorhotspots when debt appears. Revisitnorms when signalsworsen. See Impact on Code Quality (Checklist, When to tighten standards).
Trusting AI for “one big change”: Letting AI design or refactor across many files often produces inconsistent or broken results. Break work into bounded tasks; review each step—see How Developers Are Integrating AI.
Missing edge cases in generated tests: AI tends to generate happy path tests; nulls, boundaries, and concurrency are often missed. Add and maintain edge-case tests yourself—see Testing strategies.
Security “suggestions” that are wrong: AI can suggest insecure defaults (e.g. weak crypto, string concatenation for SQL). Never rely on AI alone for auth, secrets, or injection—see OWASP and Securing APIs.
Best practices and pitfalls
Do:
Use AI for repetition and scaffolding; review and test everything; own architecture and security.
Break work into bounded tasks (one layer, one responsibility); use Clean Architecture and SOLID so generated code stays in the right place.
Encode business rules and edge cases in tests; use domain experts for nuance—see Domain-Driven Design.
Do not:
Trust AI for architecture, security, or business rules without verification.
AI still fails at architecture, edge cases, security depth, consistency, and domain nuance—mitigate with review, tests, ownership, and clear boundaries so you use AI where it helps and verify where it does not. Assuming AI output is correct for security or business rules leads to wrong or inconsistent outcomes; defining where AI is allowed and where human-only applies reduces risk. Next, list the failure modes that apply to your stack (architecture, security, edge cases, consistency), then set explicit boundaries (e.g. AI only within this layer) and make review mandatory for generated code in those areas.
AI still fails at architecture, edge cases, security, consistency, and domain nuance.
Mitigate with review, tests, ownership, and clear boundaries; use AI where it helps, verify where it does not.
The article states where current AI coding tools tend to fail (architecture, edge cases, security depth, consistency, domain nuance) based on how they work: they optimise locally, they don’t have full system or threat-model context, and they can suggest outdated or wrong patterns. That’s a constraint of the technology, not a verdict on whether to use it. The stance is: use AI where it helps; verify and own where it doesn’t.
Trade-Offs & Failure Modes
Relying on AI for areas where it typically fails (e.g. architecture, security) without human verification increases the chance of wrong or inconsistent outcomes. Adding more review and tests mitigates but doesn’t remove the risk. Failure modes: assuming AI output is correct for security or business rules; using AI for cross-file or system-wide consistency without human oversight.
What Most Guides Miss
Many guides list “what AI can do” and underplay “where it fails and why.” The failure modes are predictable from how the tools work (local optimisation, no full context). Another gap: mitigation is review, tests, and clear boundaries—not “better prompts” alone. Boundaries (e.g. “AI only within this layer”) need to be explicit.
Decision Framework
If the task is local, repetitive, or well-scoped → AI often helps; still review the result.
If the task is architecture, security, or cross-cutting consistency → Don’t rely on AI alone; humans must own and verify.
For edge cases and business rules → Assume AI can miss them; tests and review are the safety net.
For any use → Define where AI is allowed and where human-only applies.
Key Takeaways
AI tends to fail at architecture, edge cases, security depth, consistency, and domain nuance; the causes are structural (local optimisation, lack of full context).
Mitigate with review, tests, ownership, and clear boundaries; use AI where it helps, verify where it doesn’t.
Don’t assume AI output is correct for security or business-critical code without human verification.
When I Would Use This Again — and When I Wouldn’t
Use this framing when a team wants a clear picture of where AI coding tools fall short so they can set boundaries and review accordingly. Don’t use it to argue that AI is useless; the point is to match use to what the tools can and can’t reliably do.
Frequently Asked Questions
Frequently Asked Questions
Where does AI fail most in software development?
Most often: architecture (local optima), edge cases (rare inputs, boundaries), security (wrong or outdated advice), consistency (style drift), and domain (business rules). See the article sections above.
Can AI do architecture and design?
Not reliably. AI tends to optimise locally and can break Clean Architecture or team patterns. Humans should own architecture; use AI for implementation within agreed boundaries.
Is AI-generated code secure?
Not by default. AI can suggest insecure patterns. Never rely on AI alone for auth, secrets, or injection. Use security review and Securing APIs, OWASP.
Yes for explanation and scaffolding (e.g. tests, wrappers). Risks are higher: wrong assumptions, breaking callers. Use small steps and review—see Developers integrating AI.
Why does AI suggest insecure code sometimes?
Models are trained on public code, which often contains insecure patterns. They do not “know” your threat model. Always review security-sensitive paths; use Securing APIs and OWASP.
How do I know when to trust AI output?
Trust for repetitive, pattern-based tasks in bounded scope (one file, one layer). Verify for architecture, security, edge cases, and business rules. See What to do: review, tests, ownership above.
Does AI fail less in some languages or stacks?
Strong training data (e.g. JavaScript/TypeScript, C#/.NET, Python) often yields moreplausible code; failuremodes (architecture, security, edge cases, consistency) still apply. Niche or legacy stacks failmore (wrong or outdated patterns)—use AI for explanation and scaffolding; humanownership for logic. See By language and stack.
When should we involve security or domain experts?
Security: For auth, secrets, injection, PII, compliance—alwayshumansecurity review; never AI-only. Domain: For business rules, eligibility, rounding, dates—domain experts and tests that encode rules; AI for scaffolding only. See When to involve security or domain experts.
Related Guides & Resources
Explore the matching guide, related services, and more articles.