This study introduces a neural network that models the social interactions from a video corpus. The corpus consists of recordings of naturalistic observations of social interactions among children and their environment. The videos are annotated multimodally including features like gestures. We explore how this video corpus can be utilized for modelling by training our model on a portion of the annotated data extracted from the corpus, and then by using the model to predict novel interaction sequences. We evaluate our model by comparing its automatically generated sequences to an unseen portion of the corpus data. The initial results show strong similarities between the generated interactions and those observed in the corpus.