Name : Joseph Razik
Joseph Razik graduated in both mathematics and computer science from the Nancy University in France. He then started the research in the speech recognition area. He obtained a M.S. degree in computer science from the same university. In 2007, he received a Ph.D. in computer science, under the direction of Pr. Jean-Paul Haton. His topic concerned the definition of confidence measures for automatic speech recognition. He is currently a postdoctoral fellow with Gérard Chollet as supervisor in the Signal and Images Processing department of TELECOM ParisTech.
His reasearch interests include:
· Speech/music segmentation,
· Automatic dialogue,
· Speech and speaker recognition,
· Voice transformation and conversion.
P. Perrot, M. Morel, J. Razik, G. Chollet. Vocal Forgery in Forensic Sciences. e-Forensics 2009.
J. Razik, O. Mella, D. Fohr, JP. Haton. Frame-Synchronous and Local Confidence Measures for on-the-fly Automatic Speech Recognition. Interspeech 2008.
Title of Project : Voice conversion: a toy, a threat or a forensic tool ?
Voice conversion is a topic that has more and more development in many applications such as entertainment, speech synthesis, and so on. But this kind of development also appears as a real threatening tool for criminals and perhaps as an efficient tool for investigators. This aim of voice conversion is to transform automatically a source (impostor) speaker’s voice to the sound like a target (client) speaker’s voice. In a criminal case (from miscellaneous call to terrorism) it is very uncomfortable to use a professional impersonator to imitate the voice of a target. It is really more interesting to use a system able to do it automatically. Different methods exist in the literature. The aim of this presentation is to make a review of the possibilities, to propose a comparative evaluation of three specific methods based on a client voice extracted from Internet and to open a perspective of the reversibility of voice disguise for investigators.
Nowadays, it is easy to collect several speech or video materials of someone from internet, especially for well-known persons as politicians. For our study, we collected an allocution of the French president from the internet and trained a conversion function to imitate his voice. Fortunately, the converted signal is not high quality: low intelligibility, unnatural, lots of artifacts and noises. But, according to three different measures (spectral distortion, likelihood ratio, and perceptual test) the converted source voice is closer to the target than the source voice and can deceive automatic speaker verification systems.
Although this automatic voice conversion technique is a potential threat, it is also a potential forensic tool to invert voice disguise. At this level of knowledge, the goal will not be to make speaker recognition on the “cleaned” voice, but to provide better intelligibility or clues to know if this is a real voice.