It has been claimed that a genuinely abstract number representation exists and is capable of representing the numerosity of any set of discrete elements irrespective of whether they are presented in visual or auditory modality. To test whether adults can compare large numerosities cross-modally as accurately as intra-modally, we measured Weber fractions and a point of subjective equality of numerical discrimination in the visual, auditory, and cross-modal conditions with use of a carefully controlled experimental procedure. Results showed distinct differences between the performances of the visual and the auditory condition in such way that numerical discrimination of the auditory sequence is more precise than that of visual sequence. Moreover, the performance of cross-modal trials differed among participants, with the exception that they were all worse than the auditory condition and that the number of visual stimuli was overestimated. Taken together, our findings implied that numerical discrimination of the auditory and visual stimuli mediates the modality-specific processes, suggesting that the numerical representation process can be complex of multiple stages.