What we can demonstrate today
Every result below includes what it demonstrates, the benchmark used, and how the experiment was set up. For proprietary methods, we include a methodology note explaining the approach. All results are from benchmarks, not from production customer traffic.
Structural redundancy is real and measurable
Redundancy across architectures
MCG's core claim is that neural networks contain structurally redundant layers that can be identified and removed. This demonstrates that the pattern holds across fundamentally different architectures, from small CNNs to large language models, and that three stable layer archetypes emerge consistently.
| Architecture | Parameters | FLOP reduction | Quality vs. baseline |
|---|---|---|---|
| CNN (ResNet-18) | 11M | 55% | Preserved |
| CNN (WideResNet) | 36M | 47% | Preserved |
| Vision Transformer | 86M | 78% | Preserved |
| LLM (1.1B) | 1.1B | 48% | Improved |
| LLM (3B) | 3B | 51% | Preserved |
| LLM (7B) | 7B | 48% | Improved |
| LLM (8B) | 8B | 40% | Preserved |
| LLM (14B) | 14B | 35% | Improved |
72B layer removal
This demonstrates that MCG scales to production-sized models and that some layers are actively harmful: removing them improves both quality and speed simultaneously.
Eight layers removed from a 72-billion parameter model. Quality improved by 4%. Wall-clock inference speedup: 13-20%. A second independent run confirmed the result with different layers identified and the same outcome. Output model is in standard open-source format.
7B harmful layer discovery
This demonstrates that MCG finds not only redundant layers but layers that actively harm performance. Removing a single harmful layer improved model quality, and removing a second layer alongside it provided additional speedup with further quality improvement.
Layer 12 identified as harmful: removal improved quality by 2.35%. Removing layers 12 and 17 together yielded 1.058x speedup with 3.13% quality improvement. This was a model with no near-zero layers, where conventional analysis would find nothing to remove.
Downstream task retention
Layer removal must preserve performance across diverse tasks, not just one metric. This shows quality retention across three independent public benchmarks covering knowledge, common sense, and science reasoning.
After removing 16 layers: MMLU 94.9%, HellaSwag 98.1%, ARC 95.3% of the dense model's scores. Scores are percentages relative to the unmodified model.
Response-quality measurement enables intelligent routing
Confidence-based routing
TTU's core claim is that measuring quality on the response (not the question) enables significant cost reduction without meaningful quality loss. This validates that approach on a standardized benchmark.
At the optimal threshold: up to 99.8% quality retained, up to 51% cost reduction. Routing overhead: negligible (six orders of magnitude below typical API latency). Six verified scenarios across different query types.
Consistency routing
This demonstrates that intelligent routing can exceed single-model quality, not just match it at lower cost. Multiple independent generations with majority vote catch errors that a single generation makes.
Consistency routing (K=3, majority vote): 105% quality compared to always using the most expensive model, with approximately 40% cost reduction.
Cascade routing
Progressive escalation across three model tiers achieves a better quality-cost tradeoff than binary small/large routing.
Three-tier cascade routing verified with progressive escalation. Achieves higher quality than binary routing at comparable cost reduction.
Deterministic verification with full reproducibility
Deterministic rule engine
This demonstrates that deterministic, byte-identical AI verification is achievable in practice, not just in theory. Every verification produces an identical result every time, with a cryptographic audit trail.
9 structured safety rules for clinical scenarios covering cross-reactivity errors, dosage verification, drug interaction risks, clinical currency, and more. Zero false positives on critical test cases. Byte-identical reproducibility verified across independent runs. Structured data integration for 5 clinical resource types. HTTP API with three production endpoints.
Verification speed
For production deployment, verification must add negligible latency. This confirms the deterministic approach is fast enough for every response in real time.
Average verification time: approximately 7ms per AI response, including rule evaluation, result logging, and certificate generation. Negligible compared to typical AI response times. Deterministic execution means no variance in timing.
Underlying mathematics validated externally
The mathematical techniques underlying CoF Audit's verification approach have been validated against published state-of-the-art methods on public benchmarks, demonstrating significant advantages in fabrication detection.
On fabrication detection (models producing ungrounded content): in-model performance of 0.87-0.99 AUROC across multiple model families. On identical test prompts, outperformed the leading peer-reviewed method (published in Nature 2024) by a significant margin with non-overlapping confidence intervals at 5x faster inference.
Questions about our results?
We're happy to walk through any result in detail.