AI Evaluation Lab

Your AI Agents,
Evaluated Before
You Deploy Them.

Inspector is the evaluation infrastructure that proves your AI pipeline is faithful to source documents, resilient against adversarial inputs, and compliant with regulatory axioms before a single production decision is made.

The Gap

Deploying AI Without Evaluation Is Guessing

Every AI deployment carries invisible failure modes. Hallucinations look like correct answers. Faithfulness failures are silent. Adversarial inputs expose systems that passed every unit test. In regulated industries, these gaps become liability.

RAG systems hallucinate. Without faithfulness testing, you cannot distinguish claims grounded in your documents from claims invented by the model.

Adversarial inputs (contradictory context, poisoned instructions, injected data) break AI systems in ways that standard testing frameworks never catch.

Compliance gaps in AI extraction are invisible until a regulator finds them. By then the decision has already been made and recorded.

Evaluation done manually in spreadsheets is not audit-ready evidence. You need structured, repeatable, timestamped results you can hand to an examiner.

Inspector runs three automated evaluation suites against your AI pipeline on every build. Every claim is tested. Every axiom is validated. Every session produces an audit-ready report.

This is the difference between assuming your AI works and proving it.

The Lab

Three Suites. Every Failure Mode Covered.

Inspector targets the three categories of AI failure that matter most in regulated deployments: extraction accuracy, retrieval faithfulness, and adversarial resilience.

01
test_data_extraction.py

Extraction Suite

Validates structured data extraction against ground-truth expectations. Tests that the AI extracts the correct values, in the correct format, without hallucinating fields that aren't in the source document. Covers complex nested structures from raw regulatory filings.

No hallucination Field accuracy Format validation
02
test_rag_performance.py

RAG Performance Suite

Audits the end-to-end retrieval-augmented generation pipeline using Ragas and DeepEval metrics: Faithfulness, Answer Relevancy, Contextual Precision, and Contextual Recall. Every claim in every response is traced back to a retrieved context chunk.

Faithfulness Answer Relevancy Contextual Recall
03
test_poison_detection.py

Adversarial Suite

Confirms that the AI flags and refuses contradictory or poisoned contexts. Tests injection resistance, contradictory source detection, context confusion patterns, and prompt override attempts. A system that passes standard tests may still fail this one.

Poison detection Contradiction flagging Injection resistance

DeepEval / Ragas Metrics Used

Metric Suite What It Checks
FaithfulnessMetric RAG Performance All claims in the output are grounded in retrieved context. No hallucinated facts.
AnswerRelevancyMetric RAG Performance The response actually answers the question asked, not a related one.
ContextualRecallMetric RAG Performance Retrieved context contains the information needed to produce the expected answer.
ContextualPrecisionMetric RAG Performance Retrieved nodes are ranked by relevance. The most relevant chunks appear first.
HallucinationMetric Extraction Output does not contain information absent from the source document.
Poison Detection (custom) Adversarial Contradictory context is identified and flagged before it influences a decision.
Regulatory Compliance

Axiom Validation Built Into the Evaluation Pipeline

Inspector ships with structured compliance axioms for regulated domains. The California auto insurance module validates three mandatory regulatory requirements: rating factor rank order, factor weight variance, and rate impact reconciliation.

Each axiom is a Pydantic model with hard enforcement. Violations raise immediately with the exact regulatory citation, not a log warning that gets lost in test output.

The same pattern extends to any jurisdiction or regulatory framework. New domains plug in as strategies with their own anchor constants and validation models.

California CCR 2632.8 + Prop 103

Rating factors must follow a mandatory weight hierarchy. The model either satisfies the axiom or it raises a Rank Order Violation. There is no partial pass.

strategies/serff_ca/models.py
# Three enforced regulatory axioms

class RankOrderAxiom:
  # CCR 2632.8: Safety Record must outweigh
  # Annual Mileage, which must outweigh
  # Driving Experience
  # Violation -> raises ValueError("Rank Order Violation")

class VarianceAxiom:
  # CCR 2632.8: No adjacent factor may vary
  # by more than 25% in weight
  # Violation -> raises ValueError("Variance Violation")

class RateReconciliationAxiom:
  # Prop 103: Stated rate impact must match
  # calculated impact within tolerance
  # Violation -> raises ValueError("Reconciliation Fault")

# Prohibited factors (hard-stops):
# gender, credit_score, education,
# occupation, income, price_optimization

Prior Approval Threshold

7%

Rate impact threshold

triggers

PARK

Human review required

Privacy Architecture

Files Loaded Into RAM. Never Written to Disk.

Inspector's MemoryHandler ingests every submitted document into a BytesIO buffer. Raw file content never touches the filesystem. When the context manager exits, the buffer is explicitly truncated and closed before the reference is released.

stream_to_memory() Loads binary content into BytesIO. Buffer.name set for PDF parser compatibility.
truncate(0) + seek(0) Explicit shred on context exit. Content overwritten before GC can collect it.
capture_safe_metadata() Only filename, size_bytes, and status persist. No document content in the audit record.
Audit Packet The only artifact that persists. Contains evaluation results and non-sensitive metadata.
memory_handler.py
@contextmanager
def stream_to_memory(
  file_content: bytes,
  filename: str
):
  buffer = io.BytesIO(file_content)
  buffer.name = filename  # PDF parser compat

  try:
    logger.info(
      f"Ingested '{filename}' into RAM."
    )
    yield buffer
  finally:
    # Explicit shred before close
    buffer.truncate(0)
    buffer.seek(0)
    buffer.close()
    logger.info(
      f"Buffer for '{filename}' shredded."
    )

What persists after evaluation

filename PERSISTS
size_bytes PERSISTS
evaluation results (pass/fail, scores) PERSISTS
raw document content SHREDDED
extracted text SHREDDED
PII from source documents SHREDDED
Evidence-Grade Output

Audit-Ready Reports. Every Session.

Inspector generates structured reports after every evaluation run. PDF for examiners. HTML for dashboards. JSON for pipelines. All three, automatically.

PDF

.pdf

Regulatory Evidence

Premium formatted report generated by Jinja2 + xhtml2pdf. Contains the trust score, per-axiom status table, violation details, and evaluation metadata. Designed to be handed directly to a compliance examiner.

Trust Score (0.0 - 1.0)
Per-axiom PASS / FAIL table
Violation descriptions
Evaluation timestamp + config
HTML

.html

Dashboard Integration

The same Jinja2 template rendered for browser display. Viewable directly in the Streamlit audit dashboard or embedded in an internal compliance portal. Styled to match Novus Forge brand standards.

Interactive score visualization
Expandable axiom detail
Linked to source session
Streamlit dashboard ready
JSON

.json

Pipeline Integration

Machine-readable evaluation results. Structured for consumption by Axis, CI/CD pipelines, or any downstream system that needs to gate on evaluation results before deployment.

Structured result objects
company_name, trust_score
results[] with axiom_id + status
CI/CD gate compatible

Run a full evaluation session

# Run all three suites, generate all three report formats
python evals/run_lab.py

# Reports auto-saved to evals/results/:
#   2026-04-20_session.pdf
#   2026-04-20_session.html
#   2026-04-20_session.json

# Or run a single suite
python -m pytest evals/test_rag_performance.py -v

# CLI audit endpoint
python workspace/cli.py --file filing.pdf \
  --endpoint http://localhost:8005/api/v1/audit/serff-ca
Deployment

Standalone Evaluation or Integrated Quality Gate

Inspector runs independently against any AI pipeline, or connects to Axis as a pre-deployment validation layer before manifests go live.

Standalone Evaluation Lab

Run Inspector as a self-contained evaluation environment. The Streamlit dashboard provides a visual interface for session management and report review. The FastAPI server exposes evaluation endpoints for programmatic access. No Novus Forge dependency required.

Streamlit audit dashboard on port 8501
FastAPI evaluation server on port 8005
CLI client for batch filing submissions
pytest-based suites integrate with any CI/CD pipeline
# Start the evaluation dashboard
streamlit run workspace/runner_app.py

# Start the API server
python src/inspector/server.py

Integrated with Axis

Use Inspector as a pre-deployment validation gate for Axis manifests. Before a new agent manifest goes live, Inspector evaluates it against all three suites and the compliance axioms for the target domain. Only manifests that pass are promoted.

Validate manifest behavior before hot-reload deploys it
Compliance axiom checks gated per domain and jurisdiction
Evaluation results linked to manifest version for audit trail
AxisNotificationStub ready for full Axis integration
Technical Specs

Built on Industry-Standard Evaluation Frameworks

Inspector uses the same evaluation libraries used by the research community, packaged into a production-ready audit workflow.

DeepEval

Evaluation Framework

AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric, BiasMetric, and 10+ others. LLM-as-a-judge architecture.

Ragas

RAG Metrics

Faithfulness, Answer Relevancy, Contextual Precision, Contextual Recall. Purpose-built for retrieval-augmented generation evaluation.

FastAPI

Evaluation API

REST endpoints for filing submission and evaluation retrieval. SERFF California compliance endpoint included.

Streamlit

Audit Dashboard

Visual session management, report browser, per-axiom result cards, and direct PDF download from the evaluation history.

Jinja2 + xhtml2pdf

Report Generation

Premium PDF and HTML report rendering. Templates aligned with Novus Forge brand standards. Auto-persisted after every session.

Pydantic

Axiom Enforcement

Regulatory axioms modeled as Pydantic validators. Violations raise immediately with the exact regulatory citation and constraint description.

pytest

Test Runner

All three evaluation suites run via pytest. CI/CD integration via exit codes. 8 tests, all passing. run_lab.py automates the full session.

Python

Runtime

Virtual environment managed per project. Memory-safe BytesIO handling enforced via context managers.

Prove Your AI Pipeline Works Before It Goes Live

Inspector runs the evaluations your team doesn't have time to write manually and produces the audit evidence your compliance team actually needs. Works standalone or as part of the Novus Forge platform.