What we can demonstrate today

Every result below includes what it demonstrates, the benchmark used, and how the experiment was set up. For proprietary methods, we include a methodology note explaining the approach. All results are from benchmarks, not from production customer traffic.

Structural redundancy is real and measurable

Redundancy across architectures

MCG's core claim is that neural networks contain structurally redundant layers that can be identified and removed. This demonstrates that the pattern holds across fundamentally different architectures, from small CNNs to large language models, and that three stable layer archetypes emerge consistently.

Quality: architecture-specific benchmarks10+ architectures verified
ArchitectureParametersFLOP reductionQuality vs. baseline
CNN (ResNet-18)11M55%Preserved
CNN (WideResNet)36M47%Preserved
Vision Transformer86M78%Preserved
LLM (1.1B)1.1B48%Improved
LLM (3B)3B51%Preserved
LLM (7B)7B48%Improved
LLM (8B)8B40%Preserved
LLM (14B)14B35%Improved
Methodology note: FLOP reduction is measured during MCG's proprietary gate-training analysis phase. FLOP reduction and inference wall-clock speedup are different metrics. Quality is measured using each architecture's standard benchmark. The gate-training method is proprietary.

72B layer removal

This demonstrates that MCG scales to production-sized models and that some layers are actively harmful: removing them improves both quality and speed simultaneously.

Benchmark: MMLU (public)80-layer model, two independent seeds

Eight layers removed from a 72-billion parameter model. Quality improved by 4%. Wall-clock inference speedup: 13-20%. A second independent run confirmed the result with different layers identified and the same outcome. Output model is in standard open-source format.

7B harmful layer discovery

This demonstrates that MCG finds not only redundant layers but layers that actively harm performance. Removing a single harmful layer improved model quality, and removing a second layer alongside it provided additional speedup with further quality improvement.

Quality: perplexity (standard metric)7B-parameter model

Layer 12 identified as harmful: removal improved quality by 2.35%. Removing layers 12 and 17 together yielded 1.058x speedup with 3.13% quality improvement. This was a model with no near-zero layers, where conventional analysis would find nothing to remove.

Downstream task retention

Layer removal must preserve performance across diverse tasks, not just one metric. This shows quality retention across three independent public benchmarks covering knowledge, common sense, and science reasoning.

Benchmarks: MMLU, HellaSwag, ARC (all public)8B model, 16 layers removed

After removing 16 layers: MMLU 94.9%, HellaSwag 98.1%, ARC 95.3% of the dense model's scores. Scores are percentages relative to the unmodified model.

Response-quality measurement enables intelligent routing

Confidence-based routing

TTU's core claim is that measuring quality on the response (not the question) enables significant cost reduction without meaningful quality loss. This validates that approach on a standardized benchmark.

Benchmark: MMLU (public)N = 1,000 queries, leading models

At the optimal threshold: up to 99.8% quality retained, up to 51% cost reduction. Routing overhead: negligible (six orders of magnitude below typical API latency). Six verified scenarios across different query types.

Methodology note: The quality estimation method is proprietary. MMLU is multiple-choice, which is well-suited for measuring routing accuracy but differs from open-ended queries. Cost reduction on actual workloads depends on the proportion of easy vs. complex queries in the specific use case. All results are from benchmarks, not from production customer traffic.

Consistency routing

This demonstrates that intelligent routing can exceed single-model quality, not just match it at lower cost. Multiple independent generations with majority vote catch errors that a single generation makes.

Benchmark: public math reasoning benchmarkN = 500

Consistency routing (K=3, majority vote): 105% quality compared to always using the most expensive model, with approximately 40% cost reduction.

Cascade routing

Progressive escalation across three model tiers achieves a better quality-cost tradeoff than binary small/large routing.

Benchmark-verified, 5 configuration tests

Three-tier cascade routing verified with progressive escalation. Achieves higher quality than binary routing at comparable cost reduction.

Deterministic verification with full reproducibility

Deterministic rule engine

This demonstrates that deterministic, byte-identical AI verification is achievable in practice, not just in theory. Every verification produces an identical result every time, with a cryptographic audit trail.

248 of 248 tests passingHealthcare example domain

9 structured safety rules for clinical scenarios covering cross-reactivity errors, dosage verification, drug interaction risks, clinical currency, and more. Zero false positives on critical test cases. Byte-identical reproducibility verified across independent runs. Structured data integration for 5 clinical resource types. HTTP API with three production endpoints.

Methodology note: The safety rules are hand-authored for healthcare scenarios to demonstrate the system's capabilities. The test suite is proprietary. Healthcare was chosen as example domain because it most clearly demonstrates the value of deterministic verification. The architecture is vertical-agnostic. The adaptive system (automatic rule discovery, graduated activation) is under development.

Verification speed

For production deployment, verification must add negligible latency. This confirms the deterministic approach is fast enough for every response in real time.

Timing measurement on deterministic engine

Average verification time: approximately 7ms per AI response, including rule evaluation, result logging, and certificate generation. Negligible compared to typical AI response times. Deterministic execution means no variance in timing.

Underlying mathematics validated externally

The mathematical techniques underlying CoF Audit's verification approach have been validated against published state-of-the-art methods on public benchmarks, demonstrating significant advantages in fabrication detection.

Public benchmarks, multiple model familiesCompared against peer-reviewed published methods

On fabrication detection (models producing ungrounded content): in-model performance of 0.87-0.99 AUROC across multiple model families. On identical test prompts, outperformed the leading peer-reviewed method (published in Nature 2024) by a significant margin with non-overlapping confidence intervals at 5x faster inference.

Methodology note: Fabrication detection measures when models generate content not grounded in input or reality. This is distinct from misconception detection (where models genuinely believe incorrect information), on which all current methods, including ours, perform near chance level. The strongest results require access to open-source model internals. Output-level signals that work on any model via standard API are validated but show more modest performance.

Questions about our results?

We're happy to walk through any result in detail.