Modern diarization pipelines: (1) voice activity detection (find segments with speech vs. silence), (2) speaker embedding extraction (convert each speech segment into a vector that represents the speaker's voice characteristics using models like ECAPA-TDNN), (3) clustering (group segments with similar embeddings — same speaker), (4) optionally, resegmentation (refine boundaries using the clustered speaker models). The pipeline produces timestamps labeled with speaker IDs.
Newer systems like Pyannote, NVIDIA NeMo, and WhisperX perform diarization end-to-end or integrate tightly with speech recognition. WhisperX combines Whisper transcription with word-level timestamps and speaker diarization, producing speaker-attributed transcripts in one pipeline. This integration handles overlapping speech better than separate pipeline stages.
Hard cases: overlapping speech (two people talking simultaneously), short speaker turns (brief interjections), similar-sounding speakers (family members), varying recording conditions (one speaker on phone, another in room), and determining the number of speakers (you often don't know in advance). State-of-the-art systems achieve ~5–10% Diarization Error Rate on benchmark datasets but can be worse in challenging real-world conditions.