Reasoning in language models, and stochastic Fermi estimation
Current langauge models fail around common sense reasoning, factuality, and composing information from multiple sources.
Giving these models a "scratchpad" to generate and combine intermediate inferences could help mitigate these issues.
Designing Fermi reasoning
Questions are built using structured templates combined with publicly available datasets, such as data from the Bureau of Labor Statistics (BLS).
To investigate how models may use different Fermi reasoning patterns — we can vary several parameters:
include_conversions: false— each contextual step is either a "knowledge" step or a "conversion" step; this setting drops the conversion steps.partial_knowledge: true— include only one of the "knowledge" steps.round_magnitude: true— numbers in the reasoning steps are rounded to their largest order of magnitude.add_noise: true— numbers in the reasoning steps are made incorrect.placeholder_vals: true— same context, but numbers are replaced by strings like"var1","var2", etc.
Two examples are show below:
How many plumbers are there in Oregon?
Reasoning contextThere are 1,806,950 workers in Oregon.
0.30% of American workers are plumbers.
How many polar bears can you line up between Nashville and New York?
Reasoningi contextThere are 759.50 miles between Nashville and New York.
A polar bear is 2.70 meters long.
There are 0.03 meters in an inch.
There are 63,360 inches in a mile.
We can also assess how two different reasoning patterns affect accuracy.
How many hours would it take to bike from New York to Philadelphia?
Reasoning context V1There are 80.53 miles between New York and Philadelphia.
A person can bike at 10 miles per hour.
Reasoning context V2It takes 19.05 hours to bike from New York to Boston.
Starting from New York, it is 2.4 times as far to Boston as to Philadelphia.
Early results with T5 baselines
The larger t5-base can benefit from additional reasoning context, whereas the t5-small struggles.
Interestingly, settings where reasoning context deliberately flawed - such as "Add noise" (multiplying numbers by random values) or using placeholder variables - sometimes perform even better than providing the correct context.
The "no conversions" setting also performed the best, suggesting too many reasoning steps might incorrectly anchor the model in the wrong context.