Preparing for superintelligent AI tutors

I recently completed a project around if modern interpretability methods can improve human-AI collaboration. TLDR: they aren't very useful.

At the start of the project, we came up with a simple ontology of what model explanations might be useful for:

  1. Human-AI collaboration: improving performance on given a task
  2. Model debugging: providing insight into training runs or individual predictions
  3. Model certification: approving a model's overall behavior according to criteria X and threshold y
  4. AI pedagogy: using the model as a tutor on tasks where it outperforms humans

I'm particularly interested in the fourth use case right now. There are some interesting questions around translating neuralese, producing human understandable concepts, what kinds of explanations are most effective, etc.

The challenge currently is finding suitable tasks where the condition of AI outperforming humans holds. There a few obvious candidates, including chess engines or chess-playing models. In a different direction, we're also testing models trained on MaRVL, which is a vision and language dataset for multicultural reasoning.

This will obviously become more relevant as AI continues to improve, and becomes superintelligent in various domains.