As mentioned in the previous blog post, ATC professionals responsible for manual checks and transcriptions were not happy with the quality of automatic segmentation of the recordings in the first run of the manual checks and transcriptions. The problems were particularly burdensome in the case of data from NATS that is much denser than in the case of Isavia and some controllers sometimes do not have time to draw breath between two aircraft calls.
Whereas the first version of the automatic splitting (that are manually checked in BUT’s web-based tool SpokenData) relied on voice activity detection to produce elementary segments, the new system employs the speaker diarization approach to create them. Indeed, the initial version sometimes generated long segments containing several dialogue exchanges between pilots and air traffic controllers as there was not sufficient pause between the utterances. The speaker diarization technique detects the change of speakers and produces correct segments in these situations. Furthermore, the speaker labelling (deciding whether the speaker is a pilot or an air traffic controller) is newly based on a stochastic approach that takes advantage of clustering of the diarized segments.
The above-mentioned improvements should make the tasks of manual corrections of the automatic speaker segmentation and labelling much easier and should significantly reduce the time controllers have to spend before they can move to the speech transcription. The comparison with the previous method on the data from the first run of the manual checking shows that the need for manual splitting could be reduced by 58 % for NATS recordings and 23% for Isavia recordings (even the initial version was relatively correct in the denser speech data). The speaker labelling corrections could be reduced by 38% for NATS and 22% for Isavia. Let us see how this reflects in the total time spent on listening to the recordings and improving their segmentation in the second run of the manual corrections.