During the last several years, there has been significant interest in the articulatory-to-acoustic conversion research field, which is often referred to as “Silent Speech Interfaces” (SSI). This has the main idea of recording the soundless articulatory movement, and automatically generating speech from the movement information, while the subject is not producing any sound. For this automatic conversion task, typically electromagnetic articulography (EMA), ultrasound tongue imaging (UTI), permanent magnetic articulography (PMA), surface electromyography (sEMG) or multimodal approaches are used, and the above methods may also be combined with a simple video recording of the lip movements. Current SSI systems are either using the 1) “direct synthesis” principle: speech is generated without an intermediate step, directly from the articulatory data, or the 2) “recognition-followed-by-synthesis” principle, where the content is recognized first from articulatory data, and speech is synthesized from the text. Compared to Silent Speech Recognition (SSR), direct synthesis has the advantage that there is a much smaller delay between articulation and speech generation, which enables conversational use and potential research on human-in-the-loop scenarios, and there are fewer potential sources of error than in the case of the SSR + synthesis approach.
Such an SSI system can be highly useful for the speaking impaired (e.g. after laryngectomy), and for scenarios where regular speech is not feasible, but information should be transmitted from the speaker (e.g. extremely noisy environments). Although there have been numerous research studies in this field in the last decade, the potential applications seem to be still far away in a practically working scenario. The main challenges are the following:
Within this special session, we call for recent results in multi-speaker / multi-session / prosody generation related articulatory data processing for silent speech synthesis and recognition. For processing the articulatory signals, knowledge is required from fields which are further away from speech processing, e.g. biosignal related 2D/3D image processing, multi-dimensional audio signal processing, audio-visual synchronization. . Therefore, we invite cross-fertilization from other fields. We encourage you to bring demonstrations of working systems to present along with your paper.
Topics of interest for this special session include, but are not limited to:
Paper submissions must conform to the format defined in the Interspeech paper preparation guidelines and detailed in the Author’s Kit, which can be found on the Interspeech web site. When submitting the paper in the Interspeech electronic paper submission system, please indicate that the paper should be included in the Special Session on Speaker Adaptation and Prosody Modeling in Silent Speech Interfaces. All submissions will take part in the normal paper review process.
We encourage you to bring demonstrations of working systems to present along with your paper. If you would like to do so, please inform the special session chairs via e-mail by July 1.