Reasoning in language models, and stochastic Fermi estimation

Current langauge models fail around common sense reasoning, factuality, and composing information from multiple sources.

Giving these models a "scratchpad" to generate and combine intermediate inferences could help mitigate these issues.

Designing Fermi reasoning

Questions are built using structured templates combined with publicly available datasets, such as data from the Bureau of Labor Statistics (BLS).

To investigate how models may use different Fermi reasoning patterns — we can vary several parameters:

include_conversions: false — each contextual step is either a "knowledge" step or a "conversion" step; this setting drops the conversion steps.
partial_knowledge: true — include only one of the "knowledge" steps.
round_magnitude: true — numbers in the reasoning steps are rounded to their largest order of magnitude.
add_noise: true — numbers in the reasoning steps are made incorrect.
placeholder_vals: true — same context, but numbers are replaced by strings like "var1", "var2", etc.

Two examples are show below:

Question

How many plumbers are there in Oregon?

Reasoning context

There are 1,806,950 workers in Oregon.

0.30% of American workers are plumbers.

An example question grounded in BLS data.

Question

How many polar bears can you line up between Nashville and New York?

Reasoningi context

There are 759.50 miles between Nashville and New York.

A polar bear is 2.70 meters long.

There are 0.03 meters in an inch.

There are 63,360 inches in a mile.

Some questions need multi-hop reasoning across several conversions.

We can also assess how two different reasoning patterns affect accuracy.

Question

How many hours would it take to bike from New York to Philadelphia?

Reasoning context V1

There are 80.53 miles between New York and Philadelphia.

A person can bike at 10 miles per hour.

Reasoning context V2

It takes 19.05 hours to bike from New York to Boston.

Starting from New York, it is 2.4 times as far to Boston as to Philadelphia.

V1 gives the distance and speed directly; V2 forces indirect reasoning (via the New York–Boston route).

Early results with T5 baselines

3-answer task; dashed line is the 0.33 chance level.

The larger t5-base can benefit from additional reasoning context, whereas the t5-small struggles.

Interestingly, settings where reasoning context deliberately flawed - such as "Add noise" (multiplying numbers by random values) or using placeholder variables - sometimes perform even better than providing the correct context.

The "no conversions" setting also performed the best, suggesting too many reasoning steps might incorrectly anchor the model in the wrong context.