Revisiting bottlenecks in automated AI research

Simon Williamson has his pelican test for LLMs. I have one for reproducing my old NeurIPS creativity workshop paper on evolving 3D shapes, which I also mused about here.

A genetic algorithm evolves shapes, each scored against CLIP.

I like this test personally, for a few reasons:

The original code is short (~500 lines).
It requires multiple capabilities: coding, agentic tool use, multimodal.
The most "interesting" part of the paper imo is the overall concept itself.

Two 3D shapes evolved against 'elephant' or 'record player', at different viewing angles.

I first started testing this while developing coding capabilties for PaLM 2. That was a bit ambitious then, and not very successful.

Models today are a lot better. In particular, there have been significant improvements in:

instruction following: I can now provide a minimal prompt asking to "re-implement this paper", rather than providing detailed, paragraph-level instructions.
tool use abilities: I can just point to the Arxiv link.
long-horizon coding abilities: I can let the program run and try to debug itself for sometime in the background, while I do my real work.

However, they still fail for the following reasons:

Downloading and using CLIP: surprisingly hard (or not, depending on your mental model).
General coding capabilities, and iterative self-debugging: e.g. rabbit-holing down incorrect theads, or hallucinating/making incorrect assumptions to circumvent a failure.
Assessing the outputs visually.
Ideation and taste: this is harder to test, but I haven't yet seen models come up with simple, neat (imo) ideas like this.

Excluding the last point, which may not matter significantly in the grand scheme of auto-research, all these should be solved within the year.

Of course, real auto-research tests a bevy of other capabilities, e.g. performance engineering, truly long-horizon tasks, and continually learning over multiple experiments.