Your AI Agents,
Evaluated Before
You Deploy Them.
Inspector is the evaluation infrastructure that proves your AI pipeline is faithful to source documents, resilient against adversarial inputs, and compliant with regulatory axioms before a single production decision is made.
Deploying AI Without Evaluation Is Guessing
Every AI deployment carries invisible failure modes. Hallucinations look like correct answers. Faithfulness failures are silent. Adversarial inputs expose systems that passed every unit test. In regulated industries, these gaps become liability.
RAG systems hallucinate. Without faithfulness testing, you cannot distinguish claims grounded in your documents from claims invented by the model.
Adversarial inputs (contradictory context, poisoned instructions, injected data) break AI systems in ways that standard testing frameworks never catch.
Compliance gaps in AI extraction are invisible until a regulator finds them. By then the decision has already been made and recorded.
Evaluation done manually in spreadsheets is not audit-ready evidence. You need structured, repeatable, timestamped results you can hand to an examiner.
Inspector runs three automated evaluation suites against your AI pipeline on every build. Every claim is tested. Every axiom is validated. Every session produces an audit-ready report.
This is the difference between assuming your AI works and proving it.
Three Suites. Every Failure Mode Covered.
Inspector targets the three categories of AI failure that matter most in regulated deployments: extraction accuracy, retrieval faithfulness, and adversarial resilience.
Extraction Suite
Validates structured data extraction against ground-truth expectations. Tests that the AI extracts the correct values, in the correct format, without hallucinating fields that aren't in the source document. Covers complex nested structures from raw regulatory filings.
RAG Performance Suite
Audits the end-to-end retrieval-augmented generation pipeline using Ragas and DeepEval metrics: Faithfulness, Answer Relevancy, Contextual Precision, and Contextual Recall. Every claim in every response is traced back to a retrieved context chunk.
Adversarial Suite
Confirms that the AI flags and refuses contradictory or poisoned contexts. Tests injection resistance, contradictory source detection, context confusion patterns, and prompt override attempts. A system that passes standard tests may still fail this one.
DeepEval / Ragas Metrics Used
| Metric | Suite | What It Checks |
|---|---|---|
| FaithfulnessMetric | RAG Performance | All claims in the output are grounded in retrieved context. No hallucinated facts. |
| AnswerRelevancyMetric | RAG Performance | The response actually answers the question asked, not a related one. |
| ContextualRecallMetric | RAG Performance | Retrieved context contains the information needed to produce the expected answer. |
| ContextualPrecisionMetric | RAG Performance | Retrieved nodes are ranked by relevance. The most relevant chunks appear first. |
| HallucinationMetric | Extraction | Output does not contain information absent from the source document. |
| Poison Detection (custom) | Adversarial | Contradictory context is identified and flagged before it influences a decision. |
Axiom Validation Built Into the Evaluation Pipeline
Inspector ships with structured compliance axioms for regulated domains. The California auto insurance module validates three mandatory regulatory requirements: rating factor rank order, factor weight variance, and rate impact reconciliation.
Each axiom is a Pydantic model with hard enforcement. Violations raise immediately with the exact regulatory citation, not a log warning that gets lost in test output.
The same pattern extends to any jurisdiction or regulatory framework. New domains plug in as strategies with their own anchor constants and validation models.
California CCR 2632.8 + Prop 103
Rating factors must follow a mandatory weight hierarchy. The model either satisfies the axiom or it raises a Rank Order Violation. There is no partial pass.
# Three enforced regulatory axioms
class RankOrderAxiom:
# CCR 2632.8: Safety Record must outweigh
# Annual Mileage, which must outweigh
# Driving Experience
# Violation -> raises ValueError("Rank Order Violation")
class VarianceAxiom:
# CCR 2632.8: No adjacent factor may vary
# by more than 25% in weight
# Violation -> raises ValueError("Variance Violation")
class RateReconciliationAxiom:
# Prop 103: Stated rate impact must match
# calculated impact within tolerance
# Violation -> raises ValueError("Reconciliation Fault")
# Prohibited factors (hard-stops):
# gender, credit_score, education,
# occupation, income, price_optimization Prior Approval Threshold
7%
Rate impact threshold
PARK
Human review required
Files Loaded Into RAM. Never Written to Disk.
Inspector's MemoryHandler ingests every submitted document into a BytesIO buffer. Raw file content never touches the filesystem. When the context manager exits, the buffer is explicitly truncated and closed before the reference is released.
@contextmanager
def stream_to_memory(
file_content: bytes,
filename: str
):
buffer = io.BytesIO(file_content)
buffer.name = filename # PDF parser compat
try:
logger.info(
f"Ingested '{filename}' into RAM."
)
yield buffer
finally:
# Explicit shred before close
buffer.truncate(0)
buffer.seek(0)
buffer.close()
logger.info(
f"Buffer for '{filename}' shredded."
) What persists after evaluation
Audit-Ready Reports. Every Session.
Inspector generates structured reports after every evaluation run. PDF for examiners. HTML for dashboards. JSON for pipelines. All three, automatically.
Regulatory Evidence
Premium formatted report generated by Jinja2 + xhtml2pdf. Contains the trust score, per-axiom status table, violation details, and evaluation metadata. Designed to be handed directly to a compliance examiner.
.html
Dashboard Integration
The same Jinja2 template rendered for browser display. Viewable directly in the Streamlit audit dashboard or embedded in an internal compliance portal. Styled to match Novus Forge brand standards.
.json
Pipeline Integration
Machine-readable evaluation results. Structured for consumption by Axis, CI/CD pipelines, or any downstream system that needs to gate on evaluation results before deployment.
Run a full evaluation session
# Run all three suites, generate all three report formats python evals/run_lab.py # Reports auto-saved to evals/results/: # 2026-04-20_session.pdf # 2026-04-20_session.html # 2026-04-20_session.json # Or run a single suite python -m pytest evals/test_rag_performance.py -v # CLI audit endpoint python workspace/cli.py --file filing.pdf \ --endpoint http://localhost:8005/api/v1/audit/serff-ca
Standalone Evaluation or Integrated Quality Gate
Inspector runs independently against any AI pipeline, or connects to Axis as a pre-deployment validation layer before manifests go live.
Standalone Evaluation Lab
Run Inspector as a self-contained evaluation environment. The Streamlit dashboard provides a visual interface for session management and report review. The FastAPI server exposes evaluation endpoints for programmatic access. No Novus Forge dependency required.
# Start the evaluation dashboard streamlit run workspace/runner_app.py # Start the API server python src/inspector/server.py
Integrated with Axis
Use Inspector as a pre-deployment validation gate for Axis manifests. Before a new agent manifest goes live, Inspector evaluates it against all three suites and the compliance axioms for the target domain. Only manifests that pass are promoted.
Built on Industry-Standard Evaluation Frameworks
Inspector uses the same evaluation libraries used by the research community, packaged into a production-ready audit workflow.
DeepEval
Evaluation Framework
AnswerRelevancyMetric, FaithfulnessMetric, HallucinationMetric, BiasMetric, and 10+ others. LLM-as-a-judge architecture.
Ragas
RAG Metrics
Faithfulness, Answer Relevancy, Contextual Precision, Contextual Recall. Purpose-built for retrieval-augmented generation evaluation.
FastAPI
Evaluation API
REST endpoints for filing submission and evaluation retrieval. SERFF California compliance endpoint included.
Streamlit
Audit Dashboard
Visual session management, report browser, per-axiom result cards, and direct PDF download from the evaluation history.
Jinja2 + xhtml2pdf
Report Generation
Premium PDF and HTML report rendering. Templates aligned with Novus Forge brand standards. Auto-persisted after every session.
Pydantic
Axiom Enforcement
Regulatory axioms modeled as Pydantic validators. Violations raise immediately with the exact regulatory citation and constraint description.
pytest
Test Runner
All three evaluation suites run via pytest. CI/CD integration via exit codes. 8 tests, all passing. run_lab.py automates the full session.
Python
Runtime
Virtual environment managed per project. Memory-safe BytesIO handling enforced via context managers.
Prove Your AI Pipeline Works Before It Goes Live
Inspector runs the evaluations your team doesn't have time to write manually and produces the audit evidence your compliance team actually needs. Works standalone or as part of the Novus Forge platform.