A crucial step in language evolution was likely joint attention with alternating gaze between vocalizing individuals and an object. This triadic interaction likely formed a foundation for labeling of objects. We have argued that vocalizations used for “social glue” – flexible low intensity and low arousal vocalizations given during e.g. grooming, keeping in contact with the group, etc. – are a probable source of raw material for first labels. It is critical that these vocalizations be socially directed, by gaze contact. We longitudinally investigated directed gaze during vocalizations in low arousal interactions during the first year in three bonobo mother-infant pairs and compared them with 9 human mother-infant pairs. We found that bonobo infants directed their gaze to a conspecific during vocalizations only 8% of the time while human infants directed it 44% of the time.