The availability of low-cost, portable imagers and new embedded computing platforms makes video surveillance possible in new environments. However, situations in which a portable, embedded video surveillance system is most useful (e.g., monitoring outdoor and/or busy scenes) also pose the greatest challenges. Real-world scenes are characterized by changing illumination and shadows, multimodal features (such as rippling waves and rustling leaves), and frequent, multilevel occlusions. To extract foreground in these dynamic visual environments, adaptive multimodal background models are frequently used that maintain historical scene information to improve accuracy. These methods are problematic in real-time embedded environments where limited computation and storage restrict the amount of historical data that can be processed and stored.

We have developed a new adaptive technique, multimodal mean (MM), which balances accuracy, performance, and efficiency to meet embedded system requirements. This algorithm delivers comparable accuracy of the best alternative (Mixture of Gaussians) with a 6X improvement in execution time and an 18% reduction in required storage on an eBox-2300 Thin Client VESA PC running Windows Embedded CE 6.0.
