Hiu Chung (Martin) Law:
Feature Saliency in Unsupervised Learning
Clustering is a common unsupervised learning technique to discover the structure of a set of multi-dimensional data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and no obvious criteria to guide the search. Another important yet not satisfactory solved problem for partitional clustering is to determine the correct number of clusters. In this talk I shall talk about an algorithm to solve these two problems simultaneously. Instead of making a hard decision to select different features, feature saliencies are estimated. Instead of requiring the user to specify the number of clusters, this number is estimated directly from the data. An Expectation-Maximization (EM) algorithm using Minimum Message Length (MML) criteria is derived to this end. Both synthetic and real data are used to demonstrate the potential of the proposed algorithm.
Write-up and presentation files are available.
Energy Minimization for Fingerprints and Dental Images
We will introduce the basic theoretical ideas from calculus regarding energy minimization, and discuss some practical aspects pertaining to computer vision. In particular, a few concrete case studies will be discussed: fitting of a curve to the flow field of a fingerprint, fitting of a curve to an image using the intensity, and template matching in dental images.
Online Development of Cognitive Behaviors by A Robot - A Case Study using Auditory and Visual Sensing
Audition and vision are two major sensory modalities for humans to sense and understand the world. Although a significant progress has been made in automatic speech recognition and visual object recognition, the field still faces a tremendous amount of difficulties. Motivated by the autonomous development process of humans, we are interested in building a robot that automatically develops its auditory and visual cognitive and behavioral skills through real-time interactions with the environment, which typically involves humans.
In this talk, I will discuss three basic techniques have been developed and implemented to resolve the technical challenges for a developmental robot: (1) A fast incremental principal component analysis algorithm, the complementary candid incremental principal component analysis (CCIPCA) algorithm; (2) A customized version of hierarchical discriminant analysis (HDR) for long temporal contexts; (3) A developmental architecture and algorithm that integrate multimodal sensing, action-imposed learning, reinforcement learning, and communicative learning. Based upon the above three basic techniques, we have designed and implemented a prototype robot that learns cognitive behaviors from simple to complex. The architecture and experimental results will be discussed to show how the robot does (1) grounded speech learning, (2) speech-directed procedure learning, and (3) simple semantics learning.
Dr. Suthep Madarasmi:
Image Search using Deformable Contours
We present the use of deformable templates for image retrieval where a template line drawing sketch can be detected in the target image, irrespective of its position, size, rotation, and smooth deformation transformations. First, potential template positions are found in the target image using a novel, modified version of the Generalized Hough Transform and Watershed Segmentation, irrespective of position, scale, and orientation. Each candidate position is then used to find a match by allowing the template to undergo a grid deformation transformation. The deformed template contour is matched with the target by measuring 1) the smoothness of the matching vector between edge elements of target and transformed template and 2) the similarity in contour tangent direction at that edge element. The deformation parameters are updated via a relaxation technique using the Gibbs Sampler on the energy cost function to find the best match between the deformed template and the target image. To avoid getting stuck in a local minimum solution, we include a novel coarse-and-fine model for contour matching into the energy functional. The matches that meet the preset cost criteria are reported as the result of image search as shown in the experiments. Other applications of this contour matching approach will also be presented.
A Face to Face Communication System using Augmented Reality
The overall aim of the system is to allow single or multiple users to enter a room sized-display and use a broadband telecommunication link to engage in face-to-face interaction with other remote users in a 3D augmented reality environment. The challenging task here is to capture a face without occluding the user's field of view. We propose a system that captures the two side views of the face simultaneously and generate the frontal view. Once the frontal view is generated, it is texture mapped onto a 3D head model that will be displayed at the remote site and the data is sent through a high bandwidth channels like Internet 2. The main advantages of the system will be the ability to produce stable 3D, stereoscopic, video-based images of all the remote participants whose faces are captured without obstructing the field of view. This system can be used in the fields of augmented reality and other mobile applications.
Deriving Behavior Repertories from Human Motion Capture
The process of providing control for rigid-body articulations, such as a character in a video game or a physically driven humanoid robot, can be a difficult and tedious task. The goal of such an effort is to produce motion for the character that is expressive of some desired behavior. In order to produce such motion, however, the rotational degrees-of-freedom (DOFs) defining the posture of the character must be set to acceptable angular values for each moment during the motion. A skilled programmer or animator can manually craft controllers that produce quality motion, but at the expense of vast amounts of time, effort, and training. Motion capture techniques have been used in animation and robotics to quickly produce character control on par with manually crafted control. Unfortunately, control produced by motion capture is still only useful for a single motion. Such motion is difficult to modify for user constraints such as "reach higher".
An alternative is to have a repertoire of basis behaviors (or skills) can be used as a foundation for representing motion at a more intuitive level. Each behavior is parameterized to produce a wide variety of motions expressing the same underlying "meaning". Behaviors can be sequenced and/or superimposed to produce new and more complex motion control. Furthermore, basis behaviors provide a motion vocabulary that allow for classification and synthesis of relatively arbitrary motion and, consequently, provide a substrate for imitation. Typically, sets of basis behaviors are developed manually. However, human decision making errors can lead to the design of basis behavior sets having undesirable properties or scalability problems.
In the effort to avoid such problems, I present a methodology for automatically deriving sets of basis behaviors. In my approach, motion capture data are collected of a human performing a large variety of "representative/typical" motions. The underlying structure of this motion data are estimated using nonlinear spatio-temporal dimension reduction. Parameterized primitive and meta-level behaviors are then derived from this underlying structure.
Indexing and Retrieval of On-line Documents
The existing literature and the ongoing work in the field on indexing and retrieval of on-line documents will be discussed in the talk.
3D Model-Based Self-Evolving Affine Subspace for Face Recognition
A robust and useful face recognition system should be able to recognize a face in the presence of facial variations due to different illumination conditions, head poses and facial expressions. However, these facial variations for a subject are not sufficiently captured in the small number of face images usually acquired for training an appearance-based face recognition system. We present a face recognition system that aligns a 3D generic face model onto a frontal face image in order to synthesize these facial variations for augmenting the training set for face recognition. A number of synthetic face images of a subject are then generated by imposing changes in head pose, illumination, and facial expression on the aligned face model. These synthesized images, evolved from the given image, are used to construct an affine subspace called a self-evolving affine subspace (SEAS) to represent the subject. Training and test subjects, given their frontal views, are all represented in the same way by using the SEAS. Face recognition is achieved by minimizing the distance between the SEAS of a test subject and that of each subject in the training database. Since the SEAS is generated by each image independently with the help of a 3D generic face model, the training cost for a newly added image is O(1). In our experiments, only a single sample image is available for each subject for training. Preliminary experimental results show that the proposed system is promising for improving the performance of appearance-based face recognition.
Learning User-specific Parameters in a Multibiometric System
Biometric authentication systems that use a single biometric trait have to contend with noisy data, restricted degrees of freedom, failure-to-enroll problems, spoof attacks, and unacceptable error rates. Multibiometric authentication systems that use multiple traits of an individual for authentication, alleviate some of these problems while improving verification performance. We demonstrate that the performance of multibiometric systems can be further improved by learning user-specific parameters. Two types of parameters are considered here. (i) Thresholds that are used to decide if a matching score indicates a genuine user or an impostor, and (ii) weights that are used to indicate the importance of matching scores output by each biometric trait. User-specific thresholds are computed using the cumulative histogram of impostor matching scores corresponding to each user. The user-specific weights associated with each biometric are estimated by searching for that set of weights which minimizes the total verification error. The tests were conducted on a database of 50 users who provided fingerprint, face and hand geometry data, with 10 of these users providing data over a period of two months.
Dr. J.K. Aggarwal:
Using Structure, Color and Texture for Content-based Image Retrieval
Images are an essential component of modern data systems. In critical applications ranging from surveillance to medicine, efficient query systems are needed to quickly locate images with particular properties within large collections. Content-based image retrieval systems analyze image features to identify image content. Color and texture are two of the features traditionally used to approach this challenging problem.
At The University of Texas at Austin, we have found structure, derived by perceptual grouping, to be a valuable tool in our quest for better content-based image retrieval. Structure adds significantly to the efficiency of the image retrieval. This presentation focuses on deriving structure via perceptual grouping, and its use in image classification and retrieval. This use of structure does not entail the segmentation of the image. Our analysis shows that structure, color and texture form an excellent feature set for image retrieval. A hands-on comparison of results using color, texture and structure to retrieve images containing both natural and manmade objects will be demonstrated. Our system, which is available on the web, incorporates relevance feedback from the user to further refine the search. Future uses of our system in surveillance and video summarization will also be discussed.
Dr. Aude Oliva: Recognition of Semantic Properties of Real World Scenes
Interpreting scenes from the real world involves a step of recognition of the semantic category of the scene (e.g. view on an object, or a forest, a field, a street, etc.). Based on this assumption, we propose a computational procedure that determines the probable semantic category to which a scene image belongs (environmental scenes and objects in context), bypassing processing dedicated to object segmentation and intermediary visual processes (edges, surfaces). We first estimate the spatial layout and spatial envelope of a scene image, using a set of perceptual dimensions (mean depth, degree of openness, expansion, roughness, etc), previously determined by experimental studies. We show that these elementary scene properties may be estimated using spectral and coarsely localized second-order statistical information. We then demonstrate that computing these elementary properties allows the computational model to automatically build a linguistic representation of a scene image (e.g. perspective view at 100 meters of a large busy urban space) that is meaningful enough to determine the probable semantic category of the image (e.g. a street with buildings). The scene representation can be used to prime the presence/absence of objects in an image and to predict their location before exploring the image. In this scheme, visual scene information can be available early in the visual processing chain, and provide an efficient short cut for object detection and recognition.
Work done in collaboration with Antonio Torralba, MIT.
The Control Architecture for a Developmental Robot
In classical control system, a controller wants system's output to adhere, as close as possible to a given set-point, or track a given trajectory, in the face of uncertainty of the process and in the presence of uncontrollable external disturbances. Most of control theories and practices are focuses on these problems, where it is assumed that the set-point or reference trajectory is generated by some global environment models. This assumption is one of the reasons that this type of control is least related to development, since children just do not have those models and do not understand math either. In this talk, we will present a control architecture that avoid using trajectories. In addition, the architecture is different from the classic controller in following aspects: (1) No action trajectories known at the time of programming. (2) No minimization of tracking error is adopted. (3) The context is used to model the automatic sequential motion. (4) Multi-modality and high dimensional sensor input (e.g., vision, audition, and touch) is integrated into controller. Also, based on this architecture, a low level open-loop force controller for the Dav's mobile base is designed and will be presented.