Thursday July 26, 2007
Dave's Bit BucketDave Walker's jottings - mostly pertaining to security I watched a fascinating documentary on television the other night, which not only revealed some interesting information on an infamous historical figure, but also raised the curtain on a technology I'd expect to see much greater deployment of in the near future. Please bear with me, and I'll describe the whole thing in context. Between 1941 and 1944, Eva Braun - who had received some training in cine camera work and cinematography - shot a whole bunch of silent colour footage of the life that she and Adolf Hitler shared at the Berghof, the near-equally infamous guests they entertained, etc. This footage was discovered in 1945 by the OSS, who were looking for any evidence that would be useful to the prosecution in the Nuremburg trials. As the films were silent, they were considered to be of no evidential value and have remained, for decades, a simple historical curiosity. A profoundly deaf computer scientist - who is, coincidentally, German - determined some years ago that computers could potentially use image analysis to lip-read, using the techniques that he employs day to day. He developed image analysis software which can, with a high degree of accuracy, map mouth, jaw and throat movements from captured video onto a computer-modelled head, and thus to phonemes. The software - which is currently optimised for German - is able to not only make such a mapping when the captured footage presents a subject face-on to the camera, but also when the subject has their face at anything up to 120 degrees from face-on. Thus, with a little training (the details of which were unfortunately glossed-over, but which appeared to mostly involve identifying which area of the captured video represented a speaker's mouth), the software was able to translate the lip movements of the filmed subjects into speech components. Using the vocal talent of a number of German actors - one of whom was judged to impersonate Hitler particularly closely, based on the one covert audio recording of Hitler ever made, and the only recording of him in conversation, rather than giving a public speech - a vocal track was made to accompany the Berghof footage, and the two were dubbed together. This was shown in the documentary, with English subtitles. Now, consider the ramifications of automated lip reading (ALR) as applied in other contexts. If facial recognition software is able to not only identify a face but its components, such that it could pass details of mouth location to ALR software, if could become possible to reconstruct speech (and potentially and eventually, text) from high-resolution CCTV footage. Given enough computing power, this could potentially be done for every face in a crowd. Of course, caveats apply. German is a particularly clearly-enunciated language where every syllable is sounded, has a relatively small number of consistent pronounciation rules, and does not have variations in meaning associated with tonality. Tonal languages such as Thai, where a given small phonemal sequence enunciated in a tenor voice can mean something entirely different to the same phonemal sequence enunciated baritone, would most likely still require a skilled human interpreter to give meaning to the sound that ALR output would have, unless ALR is expected to eventually be able to interpret such things based on analysis of apparent constrictions in the footage of the speaker's throat. Nonetheless, it's interesting... Update: The technology on show was developed by Frank Hubner; I've also found a paper on automated head-recognition algorithms, designed specifically to facilitate mouth area identification for automated lip reading, here. Seems there's more folk working on this than initially meets the eye - it's a space to watch. (2007-07-26 07:53:19.0) Permalink Comments [1]
Trackback URL: http://blogs.sun.com/davew/entry/automated_lip_reading
Post a Comment: |
Calendar
RSS Feeds
All /Cooking /General /Java /Networking /Security Search | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Posted by Gilles Gravier on July 27, 2007 at 05:18 AM GMT #