Human language, as well as birdsong, relies on the ability to arrange vocal elements in new sequences. However, little is known about the ontogenetic origin of this capacity. Here we track the development of vocal combinatorial capacity in three species of vocal learners, combining an experimental approach in zebra finches (Taeniopygia guttata) with an analysis of natural development of vocal transitions in Bengalese finches (Lonchura striata domestica) and pre-lingual human infants. We find a common, stepwise pattern of acquiring vocal transitions across species. In our first study, juvenile zebra finches were trained to perform one song and then the training target was altered, prompting the birds to swap syllable order, or insert a new syllable into a string. All birds solved these permutation tasks in a series of steps, gradually approximating the target sequence by acquiring new pairwise syllable transitions, sometimes too slowly to accomplish the task fully. Similarly, in the more complex songs of Bengalese finches, branching points and bidirectional transitions in song syntax were acquired in a stepwise fashion, starting from a more restrictive set of vocal transitions. The babbling of pre-lingual human infants showed a similar pattern: instead of a single developmental shift from reduplicated to variegated babbling (that is, from repetitive to diverse sequences), we observed multiple shifts, where each new syllable type slowly acquired a diversity of pairwise transitions, asynchronously over development. Collectively, these results point to a common generative process that is conserved across species, suggesting that the long-noted gap between perceptual versus motor combinatorial capabilities in human infants may arise partly from the challenges in constructing new pairwise vocal transitions.