When parents assess a kids' app, they tend to notice the visuals first and the background music second. The small, responsive sounds tend to go unnoticed: the plop when something lands, the splash when water runs, the soft chop of a knife on a board. They feel like decoration. They're doing more than that.
The sounds a child triggers through their own actions are doing something different from the sounds that play in the background. They're teaching cause and effect. And the research on how this works is surprisingly specific.
The feedback loop you don't notice
When a toddler taps a screen and hears a sound, their brain registers something important: I did that. This is called contingent feedback, and it's one of the strongest mechanisms for early learning.
Kirkorian, Choi, and Pempek (2016) tested this directly. Toddlers aged 24 to 36 months learned new words significantly better from touchscreen content that responded to their actions compared to identical content that played passively (Kirkorian, 2016). The contingent group's scores were comparable to live, in-person teaching. The timing matters too. Goldstein, King, and West (2003) found that responses within roughly one to two seconds of the child's action are processed as "I caused that" (Goldstein, 2003). Responses that come later aren't registered the same way.
Begus, Gliga, and Southgate (2014) showed a related effect: information delivered after an infant's own initiated action was retained better than identical information delivered unprompted (Begus, 2014). The child's agency is part of the encoding. They remember what they made happen.
The sound needs to match the action
Not all audio feedback is equal. Russo-Johnson and colleagues (2017) studied 2- to 5-year-olds using a word-learning app and found that relevant interactive feedback, where tapping produced audio semantically connected to the content, improved learning (Russo-Johnson, 2017). Irrelevant feedback, where the sound had no connection to what the child was doing, actually distracted from the task.
This aligns with Hirsh-Pasek and colleagues' (2015) influential framework for educational app design, which identifies four pillars: active, engaged, meaningful, and socially interactive (Hirsh-Pasek, 2015). Audio that connects to the child's action supports the "active" pillar. Generic reward sounds that fire regardless of context undermine the "engaged" pillar by pulling attention away from what the child is actually doing.
The practical implication: a splashing sound when a child washes fruit under a tap is more useful than a generic chime. It connects the action to a real-world concept. A chop sound when they cut something on a board reinforces what a knife does. These aren't just pleasant sounds. They're tiny bridges to the real world.
Three senses are better than one
There's a separate line of evidence about why sound specifically matters alongside touch and vision. Shams and Seitz (2008) reviewed the research on multisensory learning and found that combining auditory, visual, and tactile information produces more robust encoding than any single modality, even when you only test one sense afterwards (Shams, 2008). The brain stores multisensory experiences differently.
Jordan and Baker (2011) tested this with 3- and 4-year-olds on a number matching task. Kids who received audio-visual information together performed significantly better than those who got visual information alone (Jordan, 2011). The effect is consistent with what Bahrick and Lickliter (2000) call "intersensory redundancy": young children preferentially attend to and learn from information that arrives through multiple senses at the same time.
A touchscreen game already provides two senses: vision and touch. Adding a well-matched sound effect on each interaction adds the third. That's the difference between one type of memory trace and a richer one.
It doesn't need to sound real
One question we get is whether synthesised sounds are as effective as recorded ones. The short answer: it doesn't seem to matter. Gaver (1993) distinguished between two modes of listening: "musical listening" (attending to the sound's acoustic properties) and "everyday listening" (attending to what caused the sound) (Gaver, 1993). Young children primarily engage in everyday listening. They associate a sound with the action that produced it, not with how faithfully it reproduces a real-world recording.
Plass and Kaplan (2016) found that what matters is the emotional quality of the sound: warm, pleasant tones create a favourable state that supports cognitive processing (Plass, 2016). A synthesised plop that feels soft and satisfying works as well as a recorded one. Possibly better, because synthesised sounds can be tuned precisely for warmth and volume without the background noise and compression artefacts that come with field recordings.
This connects to what Trainor and Heinmiller (1998) found about young children's preferences: they gravitate towards consonant, harmonically simple sounds and away from dissonant or spectrally complex ones (Trainor, 1998). Clear, warm tones over harsh or noisy ones. Simple over complicated. The same principle that applies to visual pacing applies to audio.
What to listen for
Next time your toddler is playing a game, watch their face when they tap something and hear a response. That moment of recognition is visible. It builds agency. It teaches them the screen is something they control.
The sounds don't need to be loud or complex. They need to be immediate, warm, and connected to what the child just did. A gentle splash for water. A soft thud for placing something down. A quiet sizzle for cooking. Each one says: you did this, and it mattered.
Our companion post on background music covers the ambient side of audio design. But if you had to choose between a beautiful soundtrack and responsive sound effects, the research would nudge you towards the effects. The sounds your kid triggers are the sounds that teach.
Sources
- Kirkorian, H.L., Choi, K., & Pempek, T.A. (2016). Toddlers' word learning from contingent and non-contingent video on touch screens. Child Development, 87(2), 405-413. https://doi.org/10.1111/cdev.12508
- Goldstein, M.H., King, A.P., & West, M.J. (2003). Social interaction shapes babbling: testing parallels between birdsong and speech. Proceedings of the National Academy of Sciences, 100(13), 8030-8035. https://doi.org/10.1073/pnas.1332441100
- Begus, K., Gliga, T., & Southgate, V. (2014). Infants learn what they want to learn: responding to infant pointing leads to superior learning. PLoS ONE, 9(10), e108817. https://doi.org/10.1371/journal.pone.0108817
- Russo-Johnson, C., Troseth, G., Duncan, C., & Mesghina, A. (2017). All tapped out: touchscreen interactivity and young children's word learning. Frontiers in Psychology, 8, 578. https://doi.org/10.3389/fpsyg.2017.00578
- Hirsh-Pasek, K., Zosh, J.M., Golinkoff, R.M., Gray, J.H., Robb, M.B., & Kaufman, J. (2015). Putting education in 'educational' apps: lessons from the science of learning. Psychological Science in the Public Interest, 16(1), 3-34. https://doi.org/10.1177/1529100615569721
- Shams, L., & Seitz, A.R. (2008). Benefits of multisensory learning. Trends in Cognitive Sciences, 12(11), 411-417. https://doi.org/10.1016/j.tics.2008.07.006
- Jordan, K.E., & Baker, J. (2011). Multisensory information boosts numerical matching abilities in young children. Developmental Science, 14(2), 205-213. https://doi.org/10.1111/j.1467-7687.2010.00966.x
- Gaver, W.W. (1993). What in the world do we hear? An ecological approach to auditory event perception. Ecological Psychology, 5(1), 1-29. https://doi.org/10.1207/s15326969eco0501_1
- Plass, J.L., & Kaplan, U. (2016). Emotional design in digital media for learning. In S.Y. Tettegah & M. Gartmeier (Eds.), Emotions, Technology, Design, and Learning, 131-161. https://doi.org/10.1016/B978-0-12-801856-9.00007-4
- Trainor, L.J., & Heinmiller, B.M. (1998). The development of evaluative responses to music: infants prefer to listen to consonance over dissonance. Cognition, 66(2), B33-B36. https://doi.org/10.1016/S0010-0277(98)00010-0