Scope of the Special Session

During the last several years, there has been significant interest in the articulatory-to-acoustic conversion research field, which is often referred to as “Silent Speech Interfaces” (SSI). This has the main idea of recording the soundless articulatory movement, and automatically generating speech from the movement information, while the subject is not producing any sound. For this automatic conversion task, typically electromagnetic articulography (EMA), ultrasound tongue imaging (UTI), permanent magnetic articulography (PMA), surface electromyography (sEMG) or multimodal approaches are used, and the above methods may also be combined with a simple video recording of the lip movements. Current SSI systems are either using the 1) “direct synthesis” principle: speech is generated without an intermediate step, directly from the articulatory data, or the 2) “recognition-followed-by-synthesis” principle, where the content is recognized first from articulatory data, and speech is synthesized from the text. Compared to Silent Speech Recognition (SSR), direct synthesis has the advantage that there is a much smaller delay between articulation and speech generation, which enables conversational use and potential research on human-in-the-loop scenarios, and there are fewer potential sources of error than in the case of the SSR + synthesis approach.

Such an SSI system can be highly useful for the speaking impaired (e.g. after laryngectomy), and for scenarios where regular speech is not feasible, but information should be transmitted from the speaker (e.g. extremely noisy environments). Although there have been numerous research studies in this field in the last decade, the potential applications seem to be still far away in a practically working scenario. The main challenges are the following:

  1. Session dependency: A source of variance comes from the possible misalignment of the recording equipment. For example, for UTI recordings, the probe fixing headset has to be mounted to the speaker before use, and in practice it is impossible to mount it to exactly the same spot as before. This inherently causes the recorded ultrasound video to become misaligned compared to a video recorded in a previous session. Therefore, such recordings are not directly comparable. This occurs similarly with other articulatory equipment.
  2. Speaker dependency: Although there are a lot of research results for generating intelligible speech or recognizing content using EMA, UTI, PMA, sEMG, lip video and multimodal data, all the studies were conducted on relatively small databases and typically with one or just a small number of speakers. However, all of the articulatory tracking devices are obviously highly sensitive to the speaker; and the development of novel methods for normalization, alignment, model adaptation, speaker adaptation would be highly important
  3. Mapping from articulatory movement to prosody: Most approaches in SSI concentrate on predicting just the spectral features. The reason for this is that while there is a direct relation between articulatory movement and the spectral content of speech, the F0 value depends on the vocal fold vibration, which is only indirectly related to the movement of the tongue and face or the opening of the lips. There have been only a few studies that attempted to predict the voicing feature and the F0 curve using articulatory data as input; but articulatory-to-F0 prediction would be a key for a practical application.

Within this special session, we call for recent results in multi-speaker / multi-session / prosody generation related articulatory data processing for silent speech synthesis and recognition. For processing the articulatory signals, knowledge is required from fields which are further away from speech processing, e.g. biosignal related 2D/3D image processing, multi-dimensional audio signal processing, audio-visual synchronization. . Therefore, we invite cross-fertilization from other fields. We encourage you to bring demonstrations of working systems to present along with your paper.

Papers and Presentation Form

Topics of interest for this special session include, but are not limited to:

Paper submissions must conform to the format defined in the Interspeech paper preparation guidelines and detailed in the Author’s Kit, which can be found on the Interspeech web site. When submitting the paper in the Interspeech electronic paper submission system, please indicate that the paper should be included in the Special Session on Speaker Adaptation and Prosody Modeling in Silent Speech Interfaces. All submissions will take part in the normal paper review process.

We encourage you to bring demonstrations of working systems to present along with your paper. If you would like to do so, please inform the special session chairs via e-mail by July 1.

Important Dates


Conference website

Special sessions

Paper submission