Joint attention plays an important role in many interactive and collaborative domains. In learning and teaching, joint attention ensures shared context between two people. I was particularly interested in the phenomena in the context of word learning but found that we’re still incredibly far away from successfully sharing context between humans and robots. My dissertation work focused on two phenomena: 1) visual context sharing through referencing and 2) joint attention in dynamic environments. The dissertation can be found here if interested in the direct findings.

Every joint attention system should probably have these two fundamental qualities:

  1. The robot should be able to direct the attention of the human
  2. The human should be able to direct the attention of the robot

However, the devil is in the details. As it turns out, human attention isn’t easily directed. There are a few known ways a robot can direct attention: 1) gaze following and 2) deixis. In gaze following, the human will look to the robots eye gaze direction and then direct their gaze to the presumed target of that gaze. In deixis, humans tend to follow a robotic gesture, namely one that is meant to be deictic. I found these gestures to have the most impact and effect when directing human attention. Finally, to make matters worse, how human attention is allocated is incredibly complex and is still one of the greatest mysteries in cognitive science. Whoops. But also, hmm. It’s not deterministic that you direct the human that they will do what the robot asks. There are other common challenges like understanding the human’s covert and overt attention.

And if you can believe it, we also need the robot to respond to the human gestures and gaze. So far, we’ve only kept the human on the context that the robot desires them to be on, can you imagine having to recognize and track gaze and gesture from a human? What a complex computer vision problem! I dug into a lot of work by John Tsotsos where he pretty deeply modeled synthetic visual attention. I found his research to be really formative and inspired in its exploration of the visual system. I began modeling a similar system as his and in parallel looking into human-to-human referencing behavior. Around this time, I found that the “saliency mask” of the robot’s attention should probably be most appropriate metric[1,2] for measuring shared context.

In my dissertation, I noted that I found that referencing gestures tend to be both distal (far away from the target) and proximal (very close to the target). When I gave the participants a target that was very precise (like an edge of shape or a part like the tail of a dog), I noted seeing gesture behavior that would “trace” the edge or to gesture distally, and then subsequently gesture proximally as if they were moving across a hierarchy of parts adeptly.

Frustratingly, I had no unifying approach to the robot’s visual attention system. After about two years, I had gotten more realistic and begun to work on very proximal referencing of whole objects within a dynamic scene. I found that I could track gaze and gesture in much more intimate interactions this way. Furthermore, sharing a target in “world-space” was easier to do when the pupil was close to the camera and a gesture tracker like the LeapMotion could be co-located together. This would allow the robot to respond to the human. However, with a large robot like Maddox, I also needed to move the arm of the robot in a tight space to reference very precisely on the shared table. I developed a custom algorithm for this gesturing behavior. This led me to future work in gesture synthesis using neural networks.

Having beaten a dead horse for about 2-3 years, I have since moved on from the problem. But my honest opinion is that we’re still incredibly far away from even this basic social mechanism being implemented well in everyday robots.

Embodied learning, ones in which the robot is denied a keyboard/mouse and must learn from the social environment is, as many in the machine learning community would argue, a great mystery. Hopefully long term investment could provide a substrate in which we can research and develop this off-the-shelf technology for applications in social learning, word learning, and exophoric reference resolution in human-robot dialog.


  1. DePalma, N. (2017, July). Bidirectional gaze guiding and indexing in human-robot interaction through a situated robotic architecture (PhD thesis). The Massachusetts Institute of Technology, 77 Massachusetts Ave. Cambridge, MA.
  2. DePalma, N., & Breazeal, C. (2016). Towards learning through robotic interaction alone: the joint guided search task. In the proceedings of the conference on Artificial Intelligence and the Simulation of Behavior (AISB).