Revisiting bottlenecks in automated AI research

Simon Williamson has his pelican test for LLMs. I have one for reproducing my old NeurIPS creativity workshop paper on evolving 3D shapes, which I also mused about here.

A genetic algorithm evolves shapes, each scored against CLIP.

I like this test personally, for a few reasons:

  1. The original code is short (~500 lines).
  2. It requires multiple capabilities: coding, agentic tool use, multimodal.
  3. The most "interesting" part of the paper imo is the overall concept itself.
Two 3D shapes evolved against 'elephant' or 'record player', at different viewing angles.

I first started testing this while developing coding capabilties for PaLM 2. That was a bit ambitious then, and not very successful.

Models today are a lot better. In particular, there have been significant improvements in:

  • instruction following: I can now provide a minimal prompt asking to "re-implement this paper", rather than providing detailed, paragraph-level instructions.
  • tool use abilities: I can just point to the Arxiv link.
  • long-horizon coding abilities: I can let the program run and try to debug itself for sometime in the background, while I do my real work.

However, they still fail for the following reasons:

  1. Downloading and using CLIP: surprisingly hard (or not, depending on your mental model).
  2. General coding capabilities, and iterative self-debugging: e.g. rabbit-holing down incorrect theads, or hallucinating/making incorrect assumptions to circumvent a failure.
  3. Assessing the outputs visually.
  4. Ideation and taste: this is harder to test, but I haven't yet seen models come up with simple, neat (imo) ideas like this.

Excluding the last point, which may not matter significantly in the grand scheme of auto-research, all these should be solved within the year.

Of course, real auto-research tests a bevy of other capabilities, e.g. performance engineering, truly long-horizon tasks, and continually learning over multiple experiments.