The ability to understand systems is important for individuals engaged in various domains in science. Prior research has demonstrated direct-manipulation animation (DMA) effectively supports learners to understand dynamic systems. However, the role of learners' characteristics and scaffolding upon their performance remains unclear. The current study examines how instruction and spatial ability modulates learners executive attention when they use DMAs to learn physical systems. To demonstrate comprehension and reasoning ability, participants were asked to explain and predict outcomes of what-if scenarios. Their eye-movements during their interaction with DMAs were recorded. Participants eye-gaze patterns and learning performance revealed interactivity itself did not lead to understanding of these systems. Instead, the conjunction of attending to relevant information while actively manipulating the animation facilitated retention and reasoning. Spatial ability, however, was not significant in predicting performance. Overall, findings suggest learning with interactive animation involves cognizant interchange between bottom-up visuo-motor support and top-down cognitive control.