ZETAPHI benchmark statements are scoped to custom models, specific test regimes,
matched comparisons, and explicit claim boundaries. Public benchmark material expands
only when the underlying receipts are ready to stand on their own.
Argoverse 2: Spatial Translation Robustness
Global Origin Offset: [+0, +0]
0 m
Live Validation MSE
12.55
Argoverse 2 Motion Forecasting: Equal-Parameter Parity Match
*Evaluated on raw continuous trajectory coordinates. No agent-centric geometric normalization or handcrafted spatial embeddings were applied to either model to isolate raw architectural induction capability.*
Architecture
Parameters
Validation MSE
Epoch 20 Stability
Peak VRAM
Batch-1 Latency
Dense Transformer (Baseline)
497,820
4,168,184.45
Failed (Diverged)
~12.2 GiB
~4.5000 ms
ZetaPhi 4W (Ours)
496,396
12.55
Stable
2.2 GiB
0.0019 ms
ZetaPhi 8W (Ours)
496,396
13.86
Stable
2.2 GiB
0.0019 ms
Why the Transformer Explodes on Raw Spatial Data
Standard sequence-mixing architectures calculate token similarity via the dot product of Queries and Keys ($Q K^T$). When fed raw, continuous map coordinates (e.g., $X=1500, Y=-800$), this mechanism suffers from a fatal flaw: the dot products scale quadratically with the absolute magnitude of the global coordinates.
As vehicles traverse further from the origin, the resulting attention matrices explode in magnitude. This instantly saturates the Softmax activation, destroys gradient flow, and causes catastrophic divergence—resulting in scattered trajectory predictions.
The ZetaPhi Advantage: O(N) Linear Scaling
ZetaPhi completely discards quadratic dot-product cross-attention in favor of a linearly scaling continuous-time architecture. It utilizes an O(1) stateful temporal integration process that scales linearly without multiplying absolute magnitudes against one another.
This architectural shift grants ZetaPhi native translation invariance out of the box. It maintains perfectly stable gradient flow over unbounded continuous features reliably, all while executing in a fraction of the VRAM footprint due to its hardware-optimized O(1) memory complexity.
NUPLAN CLOSED-LOOP SIMULATION
O(1) Inference Latency at 1000-Agent Scale
In continuous closed-loop robotics environments, sequence-mixing architectures face a strict hardware barrier: concatenating historical state for every micro-adjustment causes an $O(N^2)$ compute and memory explosion. We isolated this trap by benchmarking ZetaPhi's O(1) stateful temporal integration against a Dense Transformer across 1,000 simultaneous agents.
1000-Agent Closed-Loop Trajectory Parity Match
*Evaluated on 1,000 simultaneous agents over 1,000 physics ticks. 500k parameter budget. Both models normalized for relative kinematics.*
Architecture
Validation MSE
Peak VRAM
Tick 1000 Latency
Safety Standard
Dense Transformer
0.0143
4,430 MB
226.93 ms
Violates 100ms Deadline
ZetaPhi Spectrum 64W (Ours)
0.0137
44.6 MB
0.0095 ms
Continuous Real-Time
The Conclusion: The Dense Transformer natively hits a latency wall as its Key-Value history grows, causing simulated collisions as it violates the 100ms control deadline. ZetaPhi achieves identical trajectory accuracy while executing natively in 9.5 microseconds via its fused C++ kernel, proving the fundamental requirement of linearly scaling continuous-time architectures for edge robotics.
EUROC MAV — KINEMATICS & IMU SENSOR FUSION
ZetaPhi W=8 Spectrum resolves the Position vs. Rotation Pareto limit.
The Bottom Line: On continuous multi-axis drone telemetry, standard attention splits accuracy between position and rotation. ZetaPhi's W=8 Spectrum model established a new Pareto optimal frontier, achieving strict parameter-matched dominance over the Dense Transformer.
Tracking both high-frequency vibration and long-term trajectory integration simultaneously starves standard architectures. Providing 8 distinct temporal witnesses spanning a logarithmic frequency range optimally captures the physical realities of multi-rotor kinematics.
CLAIM BOUNDARY
Optimal width is bounded.
The "Best vs Best" topology sweep revealed that W=16 began to fragment channel capacity, degrading performance back to Transformer parity. W=8 is the exact optimal structural width for 6-DoF inertial prediction.
ROBOMIMIC V1 — CONTINUOUS ROBOTIC CONTROL
Wide geometric topologies natively map to complex multi-actuator telemetry.
The Bottom Line: Learning human teleoperation requires modeling complex dependencies across gripper actuations and joint velocities. A wide ZetaPhi topology (W=32) monotonically scaled down error, achieving State-of-the-Art performance against the Dense Transformer under exact parameter parity.
Spectrum mechanism dominates clean RF, but reveals a rotational boundary.
The Bottom Line: We relaxed the W=4 constraint to W=8, allowing ZetaPhi to perfectly match the high-frequency transitions of complex modulations (QAM/PSK), beating the Transformer on Clean accuracy (60.68% vs 57.98%). We then subjected both models to a rigorous RF Hardware Fault Crucible.
54.34%
Impulse Hampel Filter
vs 48.02% Transformer
39.61%
IQ Imbalance Phase Shift
vs 54.72% Transformer
WHAT IT MEANS
Robustness to real-world deployment faults.
ZetaPhi natively outperforms the Transformer across missing packets, noisy burst channels, and CFO phase drift. It acts as an inherently stable, low-latency edge classifier under most standard transmission failures.
CLAIM BOUNDARY
Uncalibrated Quadrature vulnerability.
Without cross-witness dense layer-norms, ZetaPhi's orthogonal state tracking is highly vulnerable to extreme rotational phase shifts (IQ Imbalance). This provides a firm, honest physical boundary for Edge deployments: raw IQ streams must be pre-calibrated for phase imbalance before ZetaPhi ingestion.
COMPUTATIONAL GENOMICS — 131,072 SEQUENCE LENGTH
ZetaPhi natively survives context scaling where dense attention critically fails.
The Bottom Line: We pushed sequence modeling to the limits of a single 24GB RTX 4090, targeting a 131,072 base-pair context length (approaching Enformer/Basenji scale) for epigenetic track prediction. The Dense Transformer natively hit a fatal Out-of-Memory (OOM) wall. ZetaPhi successfully completed the 131k training loop.
SUCCESS
ZetaPhi W=8 Training
Batch size 2, 24GB VRAM
FATAL OOM
Dense Transformer Training
Quadratic Graph Materialization
WHAT IT MEANS
True O(1) stateful backpropagation.
Dense attention requires an O(N²) memory footprint to materialize the attention map for backpropagation. By utilizing PyTorch gradient checkpointing over ZetaPhi's custom C++ O(N) temporal loop, we bypassed naive graph caching. ZetaPhi's memory footprint is bounded strictly by its hidden state dimension, unlocking massive enterprise-scale sequence modeling on consumer-grade hardware.
FI-2010 LIMIT ORDER BOOK — HIGH-FREQUENCY TRADING
ZetaPhi hits 0.12ms tick-to-trade latency via fully fused O(1) state generation.
The Bottom Line: High-Frequency Trading demands strictly reactive, single-tick inference (Batch 1, Seq 1) over dense 144-feature market depth arrays. Under PyTorch compilation (reduce-overhead), ZetaPhi's recurrent state successfully fused into a single kernel, achieving a flat 0.12ms inference latency compared to the Transformer's ~69ms KV-cache sync overhead.
0.1239 ms
ZetaPhi W=8 Tick Latency
O(1) Stateful Fusion
69.70 ms
Transformer Tick Latency
Attention Graph Breaks
WHAT IT MEANS
Microsecond-scale exchange boundaries.
Because ZetaPhi requires no dynamic sequence reallocation or KV-cache updates, its entire predictive loop reduces to pure vector math. This unlocks deep sequence models for microsecond-scale trading algorithms previously restricted to linear regressions or shallow decision trees.
Hybrid ZetaPhi matches Dense Transformer semantics with zero parameter starvation.
The Bottom Line: We executed a strict Best-vs-Best semantic evaluation on the TinyStories dataset. The control was a 4-Layer Dense Transformer (29.46M params). The experimental lane was a Hybrid ZetaPhi architecture consisting of 1 Layer of Local Exact Attention + 3 Layers of ZetaPhi Spectrum (28.69M params). ZetaPhi operated under strict starvation rules, using ~770k fewer parameters than the baseline.
2.5991
Hybrid ZetaPhi (W=8)
Loss at Step 1500 (28.69M Params)
2.7521
Dense Transformer (H=8)
Loss at Step 1500 (29.46M Params)
WHAT IT MEANS
Local associative lookup + Infinite macro context.
ZetaPhi is exceptionally strong at long-range structural modeling but can struggle with exact token-level associative lookups (e.g., retrieving specific names or exact short-range grammatical rules). By pairing a single layer of local sliding-window attention with an infinite-context O(N) ZetaPhi stack, we successfully matched and exceeded the Dense Transformer's semantic loss curve without requiring an O(N²) global footprint.
2026 PHYSICAL-SIGNAL BENCHMARK SERIES
Parameter-matched, multi-seed comparisons across four sensor domains.
The current benchmark series evaluates the ZetaPhi architecture against parameter-matched GRU, temporal-CNN,
and Transformer baselines on continuous physical signal streams: human-activity recognition (inertial sensors),
radio-frequency modulation classification, turbofan remaining-useful-life prognostics, and RNA structure
prediction. Every comparison holds parameter budget, optimizer, schedule, and data splits constant; model
selection uses validation only, and test sets are read once per final model. Results below report mean ± std
across seeds. Architecture variants (A/B/C) differ only by internal non-trainable settings — zero parameter
delta and zero measured latency delta between variants.
RADIOML 2016.10a — RF MODULATION CLASSIFICATION
Parameter-matched comparison on 220,000 radio signals, 11 modulation classes.
The Bottom Line: ZetaPhi variant C outperforms the parameter-matched Transformer by +1.75
points and leads every architecture in the high-SNR band (90.4% at +16 dB). The temporal CNN holds the
overall clean lead at this short 128-sample window — reported here because honest baselines matter.
Model
Params
Test Acc (3 seeds)
Batch-1 Latency (p50)
Corruption Retention
Temporal CNN
522,587
61.38 ± 0.17
0.426 ms
0.875
ZetaPhi variant C
541,995
60.63 ± 0.14
0.452 ms
0.795
Transformer
547,275
58.88 ± 0.33
0.388 ms
0.816
GRU
524,587
58.15 ± 0.17
1.281 ms
0.853
ZetaPhi variant A
541,995
56.65 ± 0.11
0.472 ms
0.655
WHAT IT MEANS
Internal configuration alone moves accuracy and robustness
Variant C versus variant A is +3.98 points of clean accuracy and +0.14 of corruption retention from
zero-parameter internal settings — the dominant axis of the architecture, confirmed in a third domain.
Variant C also beats the Transformer on 7 of 10 corruption cells and wins the sample-clock-error cell
outright over every baseline.
CLAIM BOUNDARY
Honest scope, including where we lose
The temporal CNN leads overall at this 128-sample window length, and slowly varying multiplicative
distortions (carrier-frequency drift, IQ imbalance) remain the architecture's weakest corruption family.
A 1024-sample long-context study on RadioML 2018.01A is in progress, where sequence-length scaling
becomes the dominant cost factor.
The Bottom Line: ZetaPhi variant C posts the best clean accuracy on the board
(88.76 vs the Transformer's 87.62, 5 seeds) and, behind a standard embedded driver filter, holds its
full clean accuracy under sensor spike bursts — a regime where the Transformer loses 30+ points.
Condition
Transformer (611k params)
ZetaPhi variant C (542k params)
Clean (test, 5 seeds)
87.62 ± 0.37
88.76 ± 1.17
Spike bursts (raw)
17.20
69.14
Spike bursts + standard Hampel filter
55.93
88.77 (= own clean)
20% packet loss + forward-fill
87.40
88.55
Calibration drift (honest negative)
77.10
72.21
WHAT IT MEANS
Graceful degradation behind real driver stacks
Behind the same standard embedded filter, variant C under spike bursts matches its own clean accuracy and
exceeds the Transformer's clean accuracy. For deployed sensor systems, behavior under faults is the
operative metric, and that is where this architecture differentiates.
CLAIM BOUNDARY
A lead, with negatives stated
The clean lead over the Transformer (+1.14) is within statistical-confirmation distance, not a closed
case. Raw zero-injection and sustained calibration drift favor the Transformer; both results are reported
in the underlying study rather than omitted.
NASA C-MAPSS FD001 — TURBOFAN PROGNOSTICS
Remaining-useful-life regression on dynamic flight trajectories (FD004)
The Bottom Line: At a strict 500k-parameter parity, the new ZetaPhi Gated Spectrum architecture solves the non-stationary calibration drift problem. By dynamically shutting the mean-field gate during dual-fault modes, ZetaPhi mathematically outperforms the O(N²) Transformer at both standard (seq 50) and extreme (seq 150) histories on NASA's most brutal telemetry dataset.
Sequence Length
Attention (500k)
ZetaPhi Gated Spectrum (500k)
50
24.87 RMSE
21.61 RMSE
150
44.37 RMSE
38.37 RMSE
WHAT IT MEANS
Dynamically severing poisoned anchors
Under massive non-stationary operating conditions (altitude/Mach shifts) combined with dual fault modes, naive return-to-mean equations drift wildly. The Gated Spectrum topology learns to instantly sever its anchor rope when it detects complex failure, allowing it to accurately trace the end-of-life dive independently while standard O(N²) attention breaks down.
CLAIM BOUNDARY
Extrapolation stability and scaling
At extreme extrapolation horizons, ZetaPhi's continuous stream architecture provides unmatched stability compared to standard attention models. Furthermore, its batch-1 execution latency remains perfectly flat at 0.26ms out to sequences of 4096 tokens—where Attention costs 10x the time and 8x the memory.
behavior observed in the sequence-scaling work elsewhere on this page.
CLAIM BOUNDARY
Short-history regimes favor the baselines
At 30–50-cycle histories — the common deployment regime for this dataset — ZetaPhi loses cleanly to all
three baselines, and one long-history seed showed instability (reflected in the ±5.18). Both facts are
stated in the underlying card.
KAGGLE RIBONANZA — RNA STRUCTURE PREDICTION
Hidden-test evaluation against a dense-attention control, scored by Kaggle.
The Bottom Line: A ZetaPhi sequence layer, swapped in as a drop-in replacement for the
self-attention stage of an otherwise identical pipeline, outperformed the dense-Transformer control on
Kaggle's hidden test data on both the public and private leaderboards (error metric, lower is better).
Model
Public Leaderboard
Private Leaderboard
ZetaPhi (attention stage replaced)
0.18567
0.18299
Dense Transformer control
0.20657
0.20686
WHAT IT MEANS
Hidden-test evidence on structured biological sequences
Hidden-test leaderboard scoring removes test-set tuning as an explanation: neither model ever saw the
evaluation data. The architecture's strongest results continue to come from structured, long-range-dependency
domains such as molecular sequence data.
CLAIM BOUNDARY
One disclosed confound
The ZetaPhi entry carried roughly 37% more parameters than the control in this pairing. A parameter-matched
rematch is on the roadmap; until then this result is reported as strong but not parameter-controlled.
PG-19 LONG-CONTEXT SEMANTICS
Breaking the Context Barrier: 1 Million Tokens with ZetaPhi.
Traditional transformer architectures face an unavoidable mathematical wall: memory usage scales exponentially with the amount of context they process. In our benchmark, a standard Dense Transformer completely exhausted 16GB of VRAM and crashed (CUDA Out of Memory) at just 64,000 tokens.
The ZetaPhi O(1) Architecture: Using native O(1) stateful temporal integration and linear scaling constraints, ZetaPhi processed an unbroken stream of 1,032,192 real semantic tokens from PG-19 with a perfectly flat memory footprint of just 83.3 MB, completely bypassing the memory bottlenecks of dense attention.
DEEP CONTEXT ABSORPTION
Quality Increases with Scale
A common issue with extending sequence length in linear models is the loss of narrative tension—the model "survives" the context but forgets the plot, causing perplexity to degrade. ZetaPhi demonstrated the opposite. As context scaled toward a million tokens, the model's perplexity actively decreased, dropping from ~150 to a massive low of 67.81 at the 950,000-token mark. This proves it actively utilizes deep context to better understand narrative structure.
CLAIM BOUNDARY
Task-bounded mechanism validation
This is a strictly bounded architectural comparison on identical parameters. It demonstrates that the O(N) scaling mechanism generalizes to deep semantic text without capacity starvation, but it does not represent a claim of universal language-quality parity with massive scale commercial LLMs.
STATEFUL EDGE INFERENCE
O(1) Generation Latency and Flat VRAM Footprint.
Autoregressive generation was tested up to 1,032,192 tokens on a single 24GB consumer GPU. Using a stateful CUDA kernel, ZetaPhi maintains a flat generation step-time of ~13.4 milliseconds natively in registers regardless of sequence depth.
The Bottom Line: ZetaPhi's recurrent state successfully processed over 1,000,000 tokens while maintaining constant memory bounds and sub-millisecond per-token step latency (batched).
ARTIFACT BASIS
Strict parameter parity and compiled edge receipts
Lanes held at exact parameter parity: Dense Transformer (501,914 mixer params) vs ZetaPhi Spectrum (505,648 mixer params).
Dataset: PG-19 tokenized via GPT-2. Evaluated on test sequences from 128 to 4,096 tokens.
Generation latency measured via torch.utils.cpp_extension.load_inline using single-step stateful C++ kernels to bypass PyTorch graph overhead.
WAYMO AUTONOMOUS TRAJECTORY TRACKING
Massive Multi-Agent Tracking at Edge Speeds.
To evaluate spatial reasoning and temporal tracking capabilities, ZetaPhi was tested against the real-world Waymo Open Motion Dataset. The task required structurally predicting the dynamic physical trajectories of 1,000 simultaneous agents (vehicles, pedestrians, cyclists).
The Bottom Line: Traditional dense attention struggles with the massive sequence lengths required for 1,000 concurrent agents, resulting in 226,000 µs latency per step. By utilizing native translation invariance and O(N) linear scaling, ZetaPhi accurately tracked the agents with a robust Mean Squared Error (MSE) of 0.477, while completing stateful inference in just 530.2 µs natively in C++. That is a 426x speedup over the dense baseline, operating comfortably within real-time edge computing constraints.
Accuracy: ZetaPhi achieved a stable 0.477 MSE across 3,840 validated scenarios.
Latency Measured: Dense Transformer baseline (226.0 ms) vs ZetaPhi compiled CUDA extension (0.53 ms).
TINYSTORIES FULL-DATA SEMANTIC RUN
Matched 1-epoch causal-LM comparison under shared controls.
The matched causal language modeling runs show that the discrete relational architecture can learn
meaningful TinyStories language structure under the same full-corpus 1-epoch training budget used for
the dense control and the lower-witness comparison lane.
In this updated semantic lane, the 16-Witness TCR run completed the full corpus and achieved the
strongest validation result in the matched setup, outperforming both the dense Transformer control and
the 2-Witness TCR baseline. This is bounded semantic-learning evidence under shared controls, not a
general pretrained-LLM replacement claim.
Lane
Lineage / Notes
Final Val Loss
Final Val PPL
Train Steps
Elapsed
16-Witness TCR
Best validated result in this exact 1-epoch full-data setup
1.5555
4.7373
264,965 / 264,965
2h 52m
Dense Transformer
Strong dense attention control under the same full-data budget
1.7656
5.8453
264,965 / 264,965
48m
2-Witness TCR
Minimal witness circular-reader baseline under the same matched setup
1.8128
6.1274
264,965 / 264,965
38m
WHAT IT MEANS
Best semantic result in the matched TinyStories lane
On this bounded full-data TinyStories pass, 16-Witness TCR led decisively, beating both the dense
Transformer control and the smaller 2-Witness TCR baseline.
CLAIM BOUNDARY
Still task-bounded and evidence-scoped
This section should be read as task-specific, receipt-backed semantic evidence only.
It does not imply universal model superiority, pretrained parity, or broad language-quality claims.
Controls were shared across lanes, but parameter count was not equalized across witness configurations
in this early run; a strictly parameter-matched semantic comparison is on the public roadmap below.
ARTIFACT BASIS
Three matched full-data runs with explicit receipt anchors
All three lanes completed 264,965 / 264,965 steps.
Controls held constant: TinyStories full train split, GPT-2 tokenizer, context length 128, batch size 8, d_model 128, 2 layers, lr 3e-4, 1 epoch.
Context-survival and throughput boundary evidence.
The Bottom Line: In this forward-only ultralong scaling artifact,
Dense failed first, 16-Witness TCR completed through 524,288 tokens before OOM at 1,048,576,
the earlier TCR adapter lane completed through 1,048,576, and Toroidal extended one
full boundary higher to 2,097,152 tokens.
This section is compute/efficiency evidence only. It should not be read as semantic-quality evidence.
Once dense fails, later rows establish survival boundaries rather than full-range speed parity.
Lane
Largest Completed Context
Next Failure Boundary
Throughput at Largest Completed
Claim Boundary
Dense Transformer
No completed ultralong row
OOM at 32,768
N/A
Failure boundary only, not a quality claim
16-Witness TCR
524,288 tokens
OOM at 1,048,576
104,046 tokens/s
Efficiency / compute / context-survival evidence only
TCR Adapter
1,048,576 tokens
OOM at 2,097,152
1,573,723 tokens/s
Efficiency / compute / context-survival evidence only
Toroidal Adapter
2,097,152 tokens
OOM at 4,194,304
1,677,532 tokens/s
Efficiency / compute / context-survival evidence only
WHAT IT MEANS
Long-context reach is materially extended
In this harness, the toroidal-family lanes extend feasible context far beyond dense attention.
The new 16-Witness TCR row adds a heavier witness-family point on that curve: better semantic quality in the
matched TinyStories lane came with a lower ultralong survival boundary than the lighter TCR adapter lane.
That matters for understanding the quality-vs-endurance tradeoff, even though it does not by itself establish semantic quality.
CLAIM BOUNDARY
Systems evidence, not language-quality evidence
This artifact is explicitly forward-only and compute-oriented. It should be interpreted as
survival/throughput evidence, not as perplexity, benchmark-score, or universal capability proof.
ARTIFACT BASIS
Ultralong survival boundary snapshot
Dense OOM at 32,768.
16-Witness TCR completed through 524,288 and OOM’d at 1,048,576.
TCR completed through 1,048,576 and OOM’d at 2,097,152.
Toroidal completed through 2,097,152 and OOM’d at 4,194,304.
The Bottom Line: Our toroidal CIFAR references outperform the matched dense Transformer baseline,
but strong CNN baselines still lead this benchmark in absolute accuracy.
Model
Notes
Epochs
Eval Acc
Params
Peak VRAM (MB)
Acc / GB VRAM
Acc / M Params
WRN-28-10
Strong CNN baseline
100
0.8138
36,536,884
2630.0
0.3169
0.0223
ResNet-18
Standard CNN baseline
100
0.7896
11,220,132
711.3
1.1367
0.0704
Two-Witness Exp18
Heavier toroidal experimental branch
100
0.6933
1,264,885
891.7
0.7961
0.5481
Exp5 Single-Lattice
Main toroidal reference branch
100
0.6920
749,869
556.9
1.2724
0.9228
Dense Transformer
Standard attention baseline
100
0.6337
700,773
385.8
1.6820
0.9043
WHAT IT MEANS
Better than dense attention, not better than top CNNs
On this benchmark, the toroidal references clear the matched dense Transformer baseline,
but they do not beat the strongest CNN baselines in raw accuracy.
CLAIM BOUNDARY
Calibration evidence, not a universal image-model claim
These rows are benchmark-specific reference points only. They are included as honest calibration,
not as a broad model-family victory claim.
PUBLIC BENCHMARK ROADMAP
Next artifact-backed releases
RadioML 2018.01A long-context study: 1024-sample windows, parameter-matched baselines, accuracy and compute-cost curves versus sequence length (in progress).
Parameter-matched semantic lane: TinyStories and PG-19 perplexity comparisons under strict parameter parity with training-cost receipts.
Needle-in-a-Haystack / Passkey Retrieval: exact key-retrieval accuracy across long contexts with matched baselines.
Long-context robotics sensor streams: visual-inertial and multi-rate sensor fusion with matched baselines.
VALIDATION MEAN SQUARED ERROR (MSE) ON 65,536 ZETA-ZERO GAPS
(Lower MSE = Higher Precision and Stronger Geometric Resonance)
DENSE TRANSFORMER
0.287
2-WITNESS
0.229
2-WITNESS (ASYM)
0.194
8-WITNESS
0.167
* Note: A standard Dense Transformer matrix blurs the sequence, while the 8-Witness Toroidal architecture reduces error by ~42%.
Why this benchmark
The spacings between consecutive Riemann zeta zeros form one of the most structured numerical sequences
available: rigid, aperiodic, and governed by deep long-range correlations. That makes them a demanding
stress test for sequence architectures — there is no local shortcut, and a model only improves by
capturing genuine long-range structure. On this task, dense attention hits a clear performance floor.
The ZetaPhi architecture distributes relational processing across multiple structurally distinct
internal pathways and reconciles their outputs hierarchically, rather than resolving all pairwise
interactions in a single dense matrix. On this dataset, that approach reduced validation error
monotonically as internal configuration strength increased — with the 8-witness configuration cutting
the dense Transformer's error by roughly 42%.
Scope of the claim
These results come from a frozen, multi-seed validation protocol on 65,536 zeta-zero gaps. They are
evidence that the architecture captures long-range numerical structure more effectively than a matched
dense-attention baseline on this task — consistent with the pattern across the benchmark series, where
the architecture's advantages concentrate in structured, long-range-dependency domains. They are not a
claim of universal superiority, and the sequence-mixing layer's linear scaling in sequence length is
reported separately in the scaling section above.