Although it doesn't buffer complete frames, it still needs a buffer and introduces some latency, doesn't it?
I doubt it.
The TMDS data is essentially packet-switched, with packets being defined for video, audio, I2C-traffic (called DDC), etc. etc. As a packet comes in, its data will be de-serialised, sent over the decoder's output parallel bus as video R,G,B + extra-control-lines, and immediately re-serialised into a valid packet. The only latency introduced is the SERDES implementation, and we're talking nanoseconds at most here, more likely picoseconds.
I mean, if you're being literal, then yes there's *some* latency, in as much as there's some latency as it gets transmitted over a wire as well, but for all practical purposes there ought to be zero latency. We normally define latency as relative *to* something, and since the entire signal is being transformed (without buffering), there's nothing for any delay to be relative to. If you prefer, you could imagine it as introducing a delay akin to extending the wire by another 1m of cable - sure that adds some more picoseconds to the delay, but since the entire signal has that delay, it's not "latent". Also, a delay of tens of picoseconds generally isn't important
There's a lot more bandwidth on the signalling bus than is used by any of the video resolutions that we're transmitting here, so we don't delay the signal by adding audio, the audio packets will just be intermixed into spare slots as an audio packet. The HDMI spec actually goes into this, and defines 'data island' areas where audio (for example) can be interspersed within the signal - typically within the horizontal front-porch (the part of the signal at the start of each horizontal scan-line that is not displayed on the screen). There are also other data-islands defined at the top of the screen, as well as control-periods wherein data packets are used to identify which data-islands contain which data (amongst other things). You'll find the HSYNC and VSYNC info in the data island packets as well.
FWIW, video is the acknowledged "master stream" on an HDMI link, with audio only being generally transmitted with an accuracy to its associated video frame, so the audio will trickle in as the video frame progresses down the screen. There is sufficient slack ahead of the video frame that the first scanline's audio can be transmitted then, and enough space per scanline for another dollop of audio.
Overall, HDMI can carry a massive amount of audio (IIRC it's 8 x 192kHz 48-bit channels, way more than we would ever need), and it can carry video (even at the lower LBR rate) to show video at 1920x1280x24-bit @ 60Hz. Then you've got HBR (high bit-rate) which can go much further. The carrying-capacity of a single HDMI link is at least 165 million pixels/second, with every version of the spec increasing that...