Humans demonstrate remarkable abilities to predict physical events in complex scenes. Two classes of models for physical scene understanding have recently been proposed: ``Intuitive Physics Engines'', or IPEs, which posit that people make predictions by running approximate probabilistic simulations in causal mental models similar to physics engines, and memory-based models like convolutional networks, which make judgments based on analogies to stored experiences of previously encountered scenes and outcomes. Here we report four experiments that rigorously compare simulation-based and CNN-based models, where both approaches are concretely instantiated in algorithms that can run on raw image inputs and produce as outputs physical judgments. Both approaches can achieve super-human accuracy levels and can quantitatively predict human judgments to a similar degree, but only the simulation-based models generalize to novel situations in ways that people do, and are qualitatively consistent with systematic perceptual illusions and judgment asymmetries that people show.