Interspeech is one of the biggest conferences on speech research. In September 2019, industry leaders attended this four-day event to present their work. Here are my main takeaways from the conference.
The first day was very interesting. Graz was picked as this year’s location because Austria is a meeting place for three different styles of languages (Roman, Slavic and German). The keynote presenter, ISCA medalist Keiichi Tokuda, shared great insights on the field of statistical speech synthesis.
One of my main takeaways was around the advancements in terms of current capabilities for emotion production in speech synthesis. This field is the one in charge of making a computer voice sound more natural. A so-called Hidden Markov model can be used for the production of these voices. We were shown an example of a computer-generated singing voice, which sounded incredibly natural.
Zero resource ASR
The session about zero resource ASR concentrated on dialects and languages that are dying. In the case of the latter, it can be combined with a speech synthesiser to teach these languages to those interested in keeping them alive. This can be done by using parts of another language to train for a new (zero resource) language.
During the session, we were shown a caption dictionary for images created with the use of two languages. For this, there has to be some similarity between them. This could be very interesting for robotics, as a robot would be able to describe what it sees using its own language.
This is also possible by combining acoustic models (how words are pronounced) from a well-resourced language and a language model (how words are written) of a zero resource language. For this, a high similarity between the two languages is required. For example, Afrikaans versus Dutch (using the acoustic model from Dutch and the language model from Afrikaans). This principle could help design an ASR for new languages but also makes it adaptable for dialects.
The far-field ASR presentation concentrated on voice assistants. During the session, I saw examples of how speech is often affected by noise. This, combined with reverberation and echo, affects the audio signal, which is more pronounced in far-field than in near field. A mixture of techniques, including deep learning, beamformer and Machine Learning models, were used. This demonstrated that beamformers can be installed on-device and don’t need training, presenting a tremendous advantage.
Source separation is quite similar to far-field ASR. However, it deals with different issues. In many cases, there’s no reverberation or echo, but there are other speakers (often described as the cocktail party effect). During the session, most of the techniques used to overcome these challenges involved the use of Machine Learning. At CTS, we could use this for applications where multiple speakers speak simultaneously. Combined with a Dialogflow agent, this would enable us to automatically transcribe a conversation in a meeting room.
There was also a session around mental health and how it relates to speech. Speech contains all kinds of features for showing different emotions and it can be used to describe the mental state of a person.
Predicting depression was a hot topic during the presentation. It was interesting to learn how technology can help recognise someone’s mental state based on their talking speed and posture. According to psychology: the posture of a person can be used to determine the affective-dimension of the person. This means that posture recognition improves the affect recognition in speech.
When using rigid models, the recognition rate can be influenced by accents and issues with the vocal tract. On day three of the conference, I attended a session concentrating on the vocal tract and how this can be modelled using fluid dynamics in 3D. The model showed the effect of hoarseness and lesions on vocal cords. To check this, high-speed endoscopy was used to track the movement of the vocal folds which vibrates surprisingly quick. This information can be used to build a model that can deal with small variations in the voice due to a cold or vocal tract issues, for example.
Speech and audio classification
Speech and audio classification concentrates on the event happening in the background of a recording. These can be cars driving past, clocks ticking or rain falling down on a roof. Knowing what these things are helps noise reduction (reducing specific frequencies or events) as well as locating where a recording took place (cafe, street or greenhouse).
This track was very interesting. Surveillance of critical systems would be an interesting application for this tech, where the classifier could detect a fire taking place or breaking of glass and send people to the relevant locations. The gammatonegram was especially inspiring to see. This transformation has more spectral information in the lower frequency regions which could be used for the removal of reverberation.
The focus of this session was to solve the cocktail party effect, where multiple people are talking (often at the same time). Determining the location of the speaker simplifies this challenge, since only a small area has to be taken into account. Moreover, it’s very unlikely that two speakers are in exactly the same location. Beamforming can be used for target speaker extraction as well as speaker tracking when they move across the room. The latter is great for surveillance purposes, where you can place cameras with a microphone and combine the idea with gait analysis.
The step from speaker separation to speaker recognition is a small one. Speaker recognition can verify that someone is who they claim to be. It was great to learn how this information can be kept secure with two-step verification encryption. This means no one can decrypt the information unless they both have keys and the owner of the dataset (or voice) has to accept this.
Another interesting technique that was shown is enhancing speech in noisy conditions. Very useful for voice assistants to filter out background noise and, with this, improve speech recognition.
Natural language interfacing
The final day of the conference opened with a keynote on sound and voice applications. Natural language interfacing helps you use an interface with your voice as the main source of control. During this session, I learned about the sequence-2-sequence approach where a sentence is taken as a whole rather than being chopped into bits.
Another solution to this problem is to use so-called sketches that describe similar questions and drill down to the right information. This is then used as input for a sequence-to-sequence model. Additionally, the sentence can be translated into a different language and then translated back to the original one to ensure they share a similar structure, solving the phrasing issue.
Audio signal characterisation
This session mainly gave information about scene classification for those interested in learning where audio has been recorded. Audio event classification can be a good technique to identify what’s happening in an audio recording apart from speech. This can potentially be applied to surveillance situations where strange audio events might trigger an alarm.
Representation learning and emotion recognition
The last session of the conference dealt with two topics that were of interest to me: representation learning and emotion recognition. I'd like to learn more about the latter. Combining it with representation learning makes it interesting to look at algorithms and see how these can be applied to other challenges.
For me, it was interesting to learn a great deal during the conference and see how I can put these insights into good use for CTS projects as well as for my final thesis.
Interspeech 2020 will be held in Shanghai, China. Watch this space for my thoughts on next year’s conference.
05 Aug 2020
How to seamlessly transition to Google Chat
30 Jul 2020
Joining CTS as a graduate
21 Jul 2020