Tracelight -- Intelligence Profile -- The Caliper Lab

Intelligence Profile

Tracelight

AI-native Excel add-in for financial modelling. Built for investment banking, management consulting, and asset management teams who need formula generation, error detection, and workflow automation without leaving their spreadsheet environment.

Financial Modelling AI Excel Add-in Formula Generation Error Detection SOC 2 Type 2 Model-Agnostic

Rich coverage

Q1 2026 -- Run #3
847 tasks -- CaliperFin-v2

Frontier update: GPT-5.5 (April 2026) released with material improvements to structured data reasoning. Baseline recalculation in progress. Updated gap scores publish within 14 days. Current scores reflect the GPT-5.4 baseline (March 2026).

Q3 2025

Q4 2025

Q1 2026

Q2 2026

Capability Assessment Independent -- Q1 2026

Tracelight sits at the top of its category on the tasks that matter most for its target buyers. The more important question for investors and enterprise teams is what the frontier means for that position over the next 12 to 18 months.

Where the product leads

On formula generation and error detection -- the two tasks Tracelight is built around -- the product performs within 3 points of the GPT-5.4 frontier baseline. That is a materially smaller gap than the category median of 9.4 points. The product's structural edge is its spreadsheet encoding layer, which allows frontier LLMs to interpret cell relationships, formula logic, and formatting conventions that general-purpose interfaces cannot process natively.

Error detection accuracy of 94.2%, consistent with the vendor's published 24x speed advantage claim -- the strongest independently verifiable claim in their marketing.
74% practitioner acceptance rate, above the 58% category average, with a declining verification rate signalling growing user confidence.
Category rank: 2nd of 8 products on the Lab's weighted benchmark.

The frontier question

The frontier is improving at 4.2 points per quarter on structured data reasoning tasks. At that rate, GPT-5.5 will approach parity with Tracelight's L1--L2 performance within two to three quarters. The product's model-agnostic architecture -- it switches underlying LLMs as the market moves -- partially offsets this risk. The durable question is whether the spreadsheet encoding layer continues to provide a meaningful edge as frontier models improve at native structured data interpretation.

L1--L2 gap (formula and extraction tasks): 3.1 points. Projected to compress toward parity by Q3--Q4 2026.
L4--L5 gap (cross-sheet reasoning, assumption-setting): 16 to 19 points. Not closing at current trajectory.

Decision implication

For enterprise buyers in banking and consulting, the relevant question is not whether AI can generate formulas but whether it can do so with the precision, transparency, and auditability that professional financial models require. On that more specific question, Tracelight's performance and its practitioner signal are both above category average. Buyers considering deployment in the next 12 months are buying into a position that is currently strong but will require active monitoring as the frontier evolves.

What the data does not yet cover

Multi-workbook operations with external data references have not been benchmarked -- relevant for consolidation models.
The 60% time-saving claim is sourced from vendor testing and has not been verified against a controlled user study. Panel data shows a 44% median reduction.
Panel signal on the consulting segment is based on 14 practitioners. One additional cycle required for statistical stability.

Benchmark Scorecard vs. GPT-5.4 baseline -- 847 tasks

Tracelight

Frontier (GPT-5.4)

Formula generation from natural language L1

91.4vs93.8-2.4

Error detection -- logical correctness L2

94.2vs95.1-0.9

Scenario and sensitivity build L3

82.7vs89.4-6.7

Cross-sheet model restructuring L4

67.3vs81.4-14.1

Analytical judgment and assumption-setting L5

54.1vs73.2-19.1

Vendor Claim Verification Source: tracelight.ai

"24x faster at error detection than any other tool"

verified Average error detection latency of 0.8 seconds against 19.2 seconds for the next fastest tool in the benchmark set. Accuracy also leads at 94.2%. The strongest independently verifiable claim in the vendor's published materials.

"3x faster and more accurate than alternatives in testing"

partial Speed advantage holds on L1--L2 tasks. Accuracy advantage narrows on L3 and above -- the gap closes as task complexity increases. Consistent with a product optimised for structured extractive work rather than complex reasoning.

"Saving teams more than 60% of their time in Excel"

not independently tested Sourced from Tracelight's own testing. Panel signal from 42 practitioners shows a median reported time reduction of 44% on matched tasks. A controlled user study would be required for independent verification of the headline figure.

Frontier intelligence

Current frontier -- GPT-5.4

85.4

Weighted avg -- financial modelling task battery

Frontier velocity

+4.2 pts / qtr

Structured data reasoning -- accelerating

L1--L2 time to parity

2 to 3 qtrs

At current velocity -- Q3 to Q4 2026

GPT-5.5 (April 2026) may accelerate this projection. Recalibrated baseline scores will be published within 14 days of this update.

Practitioner signal n=42 -- finance and consulting

Output acceptance rate

74% +8pp

Verify before use

58% -5pp

Workflow abandonment

7% flat

Trust trajectory

Building

Top correction type

Formula reference edits

74% acceptance is above the 58% category average. Declining verification rate signals growing confidence in production environments.

Score trajectory Tracelight weighted avg score

Higher bar = stronger performance vs. frontier

Q3 25Q4 25Q1 26

71.4Q3 2025

76.8Q1 2026

Methodology

Dataset

CaliperFin-v2 -- 847 tasks

Baseline

GPT-5.4 (Mar 2026)

Scoring L1-L2

Formula equivalence + F1

Scoring L3-L5

LLM-as-judge + expert review

Ground truth

Expert-constructed -- kappa 0.87

Run date

18 March 2026

Representative profile for discussion -- all scores and findings are illustrative, based on the Lab's published methodology applied to Tracelight's publicly stated capabilities. Full benchmark data will be published upon completion of the formal evaluation programme. thecaliperlab.com