The most critical attribute of human language is its unbounded combinatorial nature: smaller elements can be combined into larger structures on the basis of a grammatical system, resulting in a hierarchy of linguistic units, such as words, phrases and sentences. Mentally parsing and representing such structures, however, poses challenges for speech comprehension. In speech, hierarchical linguistic structures do not have boundaries that are clearly defined by acoustic cues and must therefore be internally and incrementally constructed during comprehension. We found that, during listening to connected speech, cortical activity of different timescales concurrently tracked the time course of abstract linguistic structures at different hierarchical levels, such as words, phrases and sentences. Notably, the neural tracking of hierarchical linguistic structures was dissociated from the encoding of acoustic cues and from the predictability of incoming words. Our results indicate that a hierarchy of neural processing timescales underlies grammar-based internal construction of hierarchical linguistic structure.