← Back to Engineering Blog

Automated QA for Non-Deterministic AI: Testing What You Cannot Predict

Large Language Models will, given enough runs, produce every possible failure: missing sections, leaked tokens, broken formatting, invented terminology, and hallucinated data. SAGE doesn't hope for the best—it runs every AI-generated reading through an 11-gate automated validation pipeline before it ever reaches a user. This is how we test what we cannot predict.

1. The Core Problem: AI Output Is Structurally Unreliable

Every SAGE reading—whether a 7-section Daily snapshot or an 8-section Premium deep-dive—follows a rigid structural contract. Each section has a defined purpose: the mystical opening, the court analysis, the verdict banner, the elemental field, the closing guidance table. This structure is not decorative; it is the interface between the AI engine and the user experience.

The problem is fundamental: Large Language Models are probabilistic systems being asked to produce deterministic output. No amount of prompt engineering can guarantee structural compliance over thousands of readings across 10 languages. The LLM will, eventually:

These are not theoretical risks. Every one of these failures has occurred in production and was caught by the pipeline described in this article.

⚠️ The Fundamental Tension

Traditional software testing assumes deterministic output: given input X, expect output Y. With generative AI, the same input produces different output every time. You cannot write a simple assertion. You must instead define structural invariants—properties that must hold true regardless of the specific prose the LLM generates.

2. The 11-Gate Evaluation Pipeline

SAGE's QA pipeline is a fully automated evaluation suite that runs after every AI-generated reading. It is not a spot-check or a sampling strategy—it is a mandatory gate that every reading must pass before delivery. The pipeline validates five dimensions of reading integrity: structural completeness, semantic depth, data fidelity, encoding purity, and formatting hygiene.

11
Validation Gates
10
Languages Covered
2
Traditions (Western & Ramal)
40
Evaluation Paths
GateCheck NameWhat It Validates
1Section PresenceAll 7 (Daily) or 8 (Premium) structural sections exist in the output
2Narrative DepthEach section exceeds minimum character thresholds — rejects empty or platitude-only sections
3Mandatory LabelsMode-aware subheader verification: Premium Section 8 closing table, Daily Section 3 guidance bullets
4Token Leakage GuardDetects leaked internal ZRX_*_XZR placeholder tokens that were never meant for user eyes
5Duplicate Label GuardEnsures no structural label (e.g., "Strategic Alignment", "Vibrational Key") appears more than once
6Verdict Banner IntegrityValidates the verdict/timing/confidence banner contains all three mandatory metrics and isn't contaminated with prose
7Data FidelityVerifies that the court figures (Houses 13–15) calculated by the chart engine are actually referenced in the AI-generated text
8Protected Term IntegrityConfirms that sacred Roman-script terms (figure names, Ramal vocabulary) survive translation into non-Latin scripts
9Markdown HygieneCatches unbalanced bold markers, empty bold tags, and orphaned formatting artifacts that break rendering
10Purushartha Table IntegrityEnsures the spiritual pillars table has populated Graha and Gati columns, not blank dashes
11Section 3 DirectnessDetects and rejects technical summary bloat in sections that should contain spiritual narrative prose

Each gate produces a binary PASS or FAIL result. A single failure in any gate rejects the entire reading. There is no "partial pass." This is the engineering equivalent of a cryptographic checksum: the output is either structurally valid or it is not delivered.

3. Deep Dive: The ZRX Canary Token System

One of SAGE's most distinctive QA techniques is borrowed from cybersecurity: canary tokens.

When SAGE constructs the data payload that is sent to the LLM, it does not pass raw geomantic figure names directly. Instead, it wraps every figure reference in an obfuscated placeholder pattern:

// Internal payload (never seen by users)
"house_1": "ZRX_Tariq_XZR"
"house_13": "ZRX_Humrah_XZR"
"house_15": "ZRX_Qabz-el-Kharij_XZR"

These ZRX_*_XZR tokens serve two purposes:

# Gate 4: Token Leakage Guard
zrx_leaks = regex.findall('ZRX_[A-Za-z0-9_]+_XZR', reading_text)
if zrx_leaks:
  reject("Token leakage detected", leaked=zrx_leaks)
💡 The Cybersecurity Parallel

This is the oracle equivalent of a canary token in penetration testing. Security engineers plant invisible marker files in sensitive directories; if those markers appear in unauthorized locations, it signals a breach. SAGE plants obfuscated markers in AI payloads; if those markers appear in user-facing text, it signals a pipeline failure. The principle is identical: inject a signal that should never be visible, then monitor for its escape.

4. Deep Dive: Multilingual Verdict Banner Validation

Every SAGE reading contains a verdict banner—a structured block that communicates three mandatory metrics to the user: the verdict itself (favorable/unfavorable/mixed), the timing window (e.g., 2–4 weeks), and a confidence percentage. This banner must be present and correctly formatted in all 10 supported languages.

The challenge: these keywords have no shared lexical root across Arabic, Hindi, Chinese, Japanese, Russian, and the Latin-script European languages. The pipeline cannot use a single regex. Instead, it maintains localized keyword arrays for each metric dimension:

# Verdict detection across 10 languages
verdict_keywords = ["verdict", "veredito", "urteil", "निर्णय", "حكم", "裁定", "判定", "вердикт"]

# Timing detection
timing_keywords  = ["timing", "tiempo", "temps", "zeit", "समय", "التوقيت", "时间", "время", "⏳"]

# Confidence detection
confidence_keywords = ["confidence", "confianza", "confiance", "विश्वास", "الثقة", "置信度", "信頼度", "🎯"]

The pipeline requires all three keyword dimensions to be present. But presence alone is not enough. Gate 6 also enforces a banner length ceiling of 600 characters. Why? Because when the LLM hallucinates, it often does so by injecting narrative prose into what should be a tightly structured data block. A 1,200-character "verdict banner" is almost certainly contaminated with essay-style content that belongs in a narrative section.

The pipeline also counts bullet points within the banner. In Daily mode, the banner should contain exactly 3 structured bullets. In Premium mode, 3–4. If the bullet count exceeds the limit, it triggers a hallucinated label detection failure—the LLM has invented structural elements that don't exist in the reading schema.

🔬 Why Emoji Keywords?

Notice that the keyword arrays include emoji like and 🎯. This is intentional. When the LLM generates non-Latin-script readings (Arabic, Chinese, Japanese), it sometimes uses emoji as universal anchors for structured data. The pipeline treats these as valid signals for banner detection, ensuring no reading is falsely rejected because the LLM chose iconographic shorthand over a translated keyword.

5. Deep Dive: The Mizan — Mathematical Chart Validation

Before the AI generates a single word, SAGE validates the mathematical integrity of the geomantic chart itself. This is the Mizan (Arabic: "balance")—a two-layer validation system that ensures the 16-house chart is internally consistent.

Layer 1: Structural Mizan (Binary Proof)

A geomantic chart is not random. The 16 houses are derived from 4 mother figures through a deterministic process:

The Structural Mizan checks every single derivation. If House 9 does not equal the binary sum of Houses 1 and 2, the chart is mathematically broken—and no amount of AI prose can salvage it. The reading is rejected at the chart level, before the LLM is ever invoked.

# XOR-Parity Rule (universal across all traditions)
def add_figures(figure_a, figure_b):
  for row_a, row_b in zip(figure_a, figure_b):
    yield 2 if row_a == row_b else 1

# Niece Check: H9 must equal H1 ⊕ H2
expected_h9 = add_figures(pattern[1], pattern[2])
assert expected_h9 == pattern[9], "Chart integrity violation"

Layer 2: Interpretive Mizan (Quality Scoring)

Beyond mathematical correctness, the Interpretive Mizan scores the interpretive reliability of the chart on a 0–100 scale. Three factors contribute:

This score is surfaced to the user as part of the reading metadata. A "Strong" Mizan (80+) tells the user the chart itself is a reliable foundation for interpretation. A "Weak" Mizan (30–49) signals that the question may need to be re-approached.

💡 Why This Matters

The Mizan is SAGE's quality signal before the AI is invoked. Most AI products can only validate output. SAGE validates the input as well—ensuring the AI is working from a mathematically sound foundation. This is analogous to validating a database schema before running queries against it.

6. The Zero-Trust AI Philosophy

The 11-gate pipeline, the ZRX canary tokens, the Mizan, and the multilingual keyword arrays all share a single engineering philosophy: Zero-Trust AI.

In cybersecurity, "zero trust" means you never assume a network actor is legitimate—you verify every request, every time. SAGE applies this same principle to generative AI:

This is not pessimism—it is engineering realism. Probabilistic systems will, given sufficient volume, produce every possible failure mode. The only reliable strategy is to assume failure and engineer deterministic verification around it.

📐 The Scale

SAGE supports 10 languages (English, Hindi, Spanish, French, German, Russian, Arabic, Chinese, Japanese, Portuguese), 2 traditions (Western Geomancy and Indian Ramal Shastra), and 2 reading modes (Daily and Premium). That's 40 distinct evaluation paths—each with its own localized headers, subheaders, keyword arrays, depth thresholds, and structural expectations. Every single path runs through the full 11-gate pipeline.

The result: when a SAGE reading reaches your screen, it has been mathematically validated at the chart level, structurally validated at the content level, and encoding-validated at the script level. It is not a best-effort output from a language model. It is a verified artifact of a deterministic validation process applied to a non-deterministic generation engine.

That is what it means to test what you cannot predict.

Experience the Oracle

The engineering described here protects the integrity of every SAGE reading. Try it for free and see what a verified AI oracle delivers.

✧ Get 11 Free Energy Tokens ✧