Wetenschap

Aggression detector for CCTV

A smart add-on for CCTV cameras recognizes stress and aggression. Its developer Dr. Iulia Lefter says it may prioritize certain cameras to surveillance operators.


It all starts when someone reports at the desk, who is already late for an important meeting. But instead of being helped, the visitor gets ignored or misunderstood. As tension rises, so do the voices. Gestures get sharper and faster. Next, someone may throw something innocent as water, but provoke the other party to escalate. For humans, it’s pretty clear that things are getting out of hand. But how do you teach a computer to read the subtleties of human, or primate, conduct.




That is the subject of Lefter’s PhD research at EEMCS faculty that was supported by the Netherlands Defence Academy, TNO and the TU. Lefter, who did her BSc. at the Transilvania University of Brasov In Rumania, graduated (with honours) at the TU in 2009 on the automatic recognition of emotions based on speech.




Dr. Gertjan Burghouts from TNO, who is on her promotion committee, is specialized in automatic video detection of human behaviour and realtime classification of activities as innocent or threatening. You can get a flavor of his work on the TNO’s IntelligentImaging Youtube page.





 

Mind the gap
Mind the gap

Mind the gap


“What is unique about our work”, says Lefter, “is that we created an intermediate level of understanding.” She refers to the ‘semantic gap’ that exists between say the pitch and the loudness of a voice (low level information) and the message it conveys (high level). The same applies for tracing someone in a video image (low) and understanding his behavior (high).




Interaction between humans is a complex dual-channel affair: there is exchange of speech and body language, both of which have content (called ‘semantics’) and a tone (called ‘prosody’). “There’s a big difference between pointing like this”, Lefter explains, gradually moving her finger towards me. “Or like this”, stabbing her finger now. Metaphorically speaking, what researchers call ‘prosody’ is the tone that makes the music.




In the ‘intermediate level’ the system evaluates meaning and tone of both speech and gestures. From these, stress and aggression levels are derived. Mind you, the system cannot judge by itself – it was trained to agree as closely as possible with people who rated the training videos.




Lefter stresses that people need both image and sound to correctly evaluate stress situations. Confronted with only one channel of the training videos, people tagged more than half of the situations as ‘low aggression’. When both sight and sound were presented, only a third of the situations was still classified as low aggression. In other words: with incomplete information, people tend to underestimate aggression levels or misinterpret behaviour.


The current smart stress sensor agrees in about 70 percent of the cases with humans. Incidents that it will typically miss out on are those that start within a social context: someone putting his feet up in the train, or a mother with a baby on her arm to whom no one offers a seat. “Most people would understand why stress could evolve from that, because they know the context.” Computers obviously don’t. Yet.




The multimodal surveillance that Lefter developed could be used to prioritize screens in a CCTV surveillance room where something seems to be happening. Operators should nonetheless stay vigilant, says Lefter, because smart as it may be, the automatic surveillance is not waterproof.




→ Iulia Lefter, Multimodal Surveillance, Behavior Analysis for Recognizing Stress and Aggression, 12 March 2014, PhD supervisors Prof. Catholijn Jonker and Prof. Leon Rothkrantz (EEMCS faculty at TU and the Netherlands Defence Academy).

Redacteur Redactie

Heb je een vraag of opmerking over dit artikel?

delta@tudelft.nl

Comments are closed.