Quality Gates and QA for AI Generated Content

Mark Waldron
Mar 19
5 min read

~6 min read

The Problem

So I had hundreds of content files across multiple languages for whentotravel.com, generated by the process I described in the previous post. The obvious question hit me fairly quickly: how do I know if any of this is actually good?

I could read every file. All of them. In multiple languages, most of which I don't speak. That was clearly not happening. But publishing without review felt like a bad idea. I needed something in the middle — a way to flag problems at scale so I could focus my attention where it mattered rather than reading every word of every file.

What Was Going Wrong

The content was structurally correct. The Zod schema validation in the generation script took care of that. But structural correctness and quality are different things. After manually reading through a sample I started spotting patterns. They fell into three categories.

Register violations. German content occasionally slipped from the formal Sie to the informal du. Japanese content mixed plain form (da/dearu) with polite form (desu/masu) within the same file. This might sound minor but in some cultures inconsistent formality reads as either rude or confused. You wouldn't switch between "you" and "thou" halfway through an English article. Same principle.

Tone mismatches. I had defined tone guidance for each locale — French should be "elegant and cultivated but not stuffy", German should be "factual and precise". Some files nailed it. Others didn't. The French content sometimes read like a Wikipedia article. The German content was occasionally too flowery. When you can't read the language fluently yourself, these things are hard to catch manually.

AI writing patterns. Despite the banned words list in the generation prompts, some content still had that unmistakable AI flavour. Formulaic transitions ("Whether you're... or..."), hedging filler ("it's worth noting that"), repetitive sentence structures, emotional inflation ("an unforgettable experience you won't want to miss"). Readers can smell this stuff. I can smell it.

Three Checks

I built three separate content quality checks, each targeting one of these categories. I implemented them as custom Claude Code skills. Reusable workflows defined in Markdown that you can invoke with a slash command. Think of them as purpose-built tools for the AI assistant.

Register check loads the formality rules for the target locale from a config file and scans the content for violations. This one is fairly binary — either you used Sie consistently or you didn't. The output is a pass/fail with a table showing offending lines and the expected form.

Tone check is more subjective and was harder to get right. It evaluates the prose against the tone guidance and rates it pass, warn, or fail with quoted passages and suggestions. The boundary between "factual and precise" and "dry and boring" isn't always clear. I had to iterate on this one.

AI pattern detection scans for the structural fingerprints of AI writing. Formulaic transitions, generic padding, unnatural parallelism, emotional inflation, Wikipedia voice. This catches the most issues and is also the most meta thing I've built — using AI to detect AI writing. It works better than you'd expect. The model is quite good at recognising its own bad habits.

I wrapped all three into a single /check-content skill. Run it against a file and you get a unified report with a severity: PASS, WARN-MINOR (probably fine to publish), WARN-MAJOR (needs attention), or FAIL (regenerate it).

Claude Code running the /check-content quality gate skill

The Feedback Loop

The real value isn't in any single check. It's in the loop.

Generate content. Run the checks. Fix the issues. Regenerate if needed. The --pass2-only flag from the generation script means I can regenerate just the prose while keeping the structured data (ratings, festivals) intact. So when the tone check flags a file, I don't have to start from scratch. Just rerun the writing pass.

This cycle — generate, check, fix, regenerate — is essentially what a human editor does, but faster and more consistent across hundreds of files. It doesn't replace human judgement. I still review the English files myself. And iterating on those, seeing the checks catch real problems and the rewrites actually fix them, built my confidence that the same process was working for the languages I can't read. It means I'm reviewing a handful of flagged files instead of the full set. That's a much better use of my time.

Lessons

A few things I took away from this:

If you generate content with AI, you need AI-assisted QA too. The generator and the reviewer are doing fundamentally different jobs. The generator is creative and expansive. The reviewer is critical and precise. Splitting these roles — rather than hoping the generator gets everything right first time — produces better outcomes with less manual effort.

Define your quality criteria before you build the checks. I was lucky that the locale guidance (register, tone, avoid words) already existed in my config files because I needed it for the generation prompts. Having clear, machine-readable rules made the quality checks much easier to build. If I'd tried to build the checks first without defining what "good" looks like, I'd still be going in circles.

This applies beyond content. Any time you're using AI to produce output at scale — code, data, reports, whatever — think about what your quality gate looks like. How will you know if the output is wrong? If the answer is "someone will notice eventually", you probably need a check.

I was also a little surprised at how much this changed my confidence in the generated content. Before the checks I was nervous about publishing. After, I felt much more comfortable. Not because the checks are perfect — they're not — but because I know the obvious problems have been caught. That psychological shift matters when you're deciding whether to hit publish.

More to do

I have a nice automated process for creating content now. I am happy with the result and I have good feedback (human and AI) concerning the tone of voice and overall low level of AI sloppiness. The quality is high, but noticeably less so for the Korean translations. Probably because AI models have much more reference data in English and related languages. The prompts are working well but because of the order I approached this, there are some assumptions baked in around the kind of rules that exist in certain languages. I went English to Germanic and Romance languages to Thai, roughly in that order, and the assumptions I started with are starting to show as I move into less familiar language structures. I will need to step back and fold some of my learnings in before I attempt the next set of languages on my target list: Simplified Chinese, Traditional Chinese, and Cantonese.

What I am starting to see, though, is the outline of something bigger. Right now this is a manually triggered pipeline: I decide which locations to write about, kick off the generation, run the quality checks, review the flagged ones, and publish. But most of those steps don't actually need me. The live site already has search analytics. It would not be a huge leap to have a pipeline that spots locations people are searching for that don't have content yet, generates it, runs the quality gates, and publishes anything that passes cleanly. The ones that don't pass get queued for a human to look at. A content pipeline that grows the site based on actual demand, with me only stepping in when something needs fixing. I am not there yet, but the pieces are falling into place.