Multiple Instance Learning for Behavioral Coding (IEEE 2017).


We propose a computational methodology for automatically estimating human behavioral patterns using the multiple instance learning (MIL) paradigm. We describe the incremental diverse density algorithm, a particular formulation of multiple instance learning, and discuss its suitability for behavioral coding. We use a rich multi-modal corpus comprised of chronically distressed married couples having problem-solving discussions as a case study to experimentally evaluate our approach. In the multiple instance learning framework, we treat each discussion as a collection of short-term behavioral expressions which are manifested in the acoustic, lexical, and visual channels. We experimentally demonstrate that this approach successfully learns representations that carry relevant information about the behavioral coding task. Furthermore, we employ this methodology to gain novel insights into human behavioral data, such as the local versus global nature of behavioral constructs as well as the level of ambiguity in the expression of behaviors through each respective modality. Finally, we assess the success of each modality for behavioral classification and compare schemes for multimodal fusion within the proposed framework.


H UMAN behavior is inherently multimodal and complex, characterized by heterogeneity and variability in its patterning. This presents unique challenges and opportunities for signal processing and machine learning researchers to contribute to the behavioral sciences. These contributions can most readily be evaluated by relating them to measures that are already established and relevant to a given application domain, e.g., study of distressed relationships. A common method for evaluating human behavior is manual behavioral coding, which seeks to create standardized measures for characterizing observed behaviors along dimensions of interest, e.g., affect, engagement, withdrawal [1]. These measures are often applied in a holistic, summative fashion. That is, expert annotators will observe subjects in situations that elicit expressions of particular behaviors of interest and then provide their judgements on the degree that these behavioral constructs are exhibited in the overall session of observation. While this method can provide valuable insights into how these behaviors relate to outcomes, e.g., relationship success, it does not provide insight into which particular expressions contribute the most to the assigned behavioral codes.

Behavioral Signal Processing:

This work is part of the emerging field of behavioral signal processing (BSP) [9]. BSP is the development and application of signal processing tools for aiding behavioral sciences research and translation including notably in mental and behavioral health domains. Engineering approaches have the potential to offer fine data-centric insights which are otherwise inaccessible to clinicians and researchers working in the behavioral sciences. A common approach in BSP is to develop signal-derived representations and use these representations with appropriate pattern recognition methods to correlate with or predict desired behavioral codes [10]. Affective computing is an exemplary domain in behavior analysis facilitated by the use of signal processing and machine learning [11]. There have been numerous studies focusing on automatically deriving multimodal representations and performing classification experiments with emotional data [12], [13]. More recently, it has been of interest to identify prototypical expressions of emotions so as to deal with inherent ambiguity that arises from the large variability of expressions between (and within) subjects [14], [15]. With the advances in computational research on human emotions we are able to draw much inspiration toward computational approaches to understand higher level human behaviors. In this work, we focus on evaluating the proposed behavioral signal processing approaches in the couples therapy domain. Couples marital therapy interactions represent a rich domain in which many high level human behaviors are elicited from the subjects and are used to help guide the course and evaluate the effectiveness of therapy. This domain has been the subject of several recent BSP studies, including studies seeking to develop representations of the acoustic, lexical, and visual signals as they relate to target behavioral codes [10], [16]–[18], as well as studies that focus on the interaction between subjects within the sessions [19], [20].


Multiple instance learning is a machine learning framework in which labeled bags contain many instances. Each instance is represented by a feature vector and thus bags are collections of feature vectors that share a single label. The task then is to determine the label to assign to the bag without having specific information as to how the instances of that bag correspond to its assigned label. MIL was introduced by Dietterich et al. for drug activity prediction [21]. The paradigm has since been applied to several machine learning tasks including: natural scene classification [22]; image categorization [23], [24]; and text classifi- cation [25]. While the MIL framework has been more often applied to object recognition tasks for images, more recently it has been applied to human generated signals, such as speech, gestural, and linguistic data. In these data the objects being recognized are prototypical displays of a labeled action or behavior of interest. Ali and Shah applied the MIL framework to human action recognition [26]. Their target labels were clearly defined physical actions such as bending or hand waving. The MIL framework has also been applied to more abstract human expressions such as affect and behavior. Schuller and Rigoll proposed using a bag of speech frames framework for recognizing speakers’ level of interest [27]. We proposed the application of MIL to couples problem-solving discussions data using acoustic features [28], lexical features [29], and audio-visual fusion [30]. This paper integrates components of these works as well as extending the framework through recent developments and analysis.


Researchers have considered many sources of information for modeling human behavior. For this work we focus on three major modalities: lexical, audio, and visual. These features will help capture what is said, how it is said, and the subjects’ associated movements. Subsequently, we will discuss methods of fusing information from these sources in order to benefit from the complementary insights they provide. Each modality also presents unique modeling challenges, which must be considered for fusion.


In order to evaluate the efficacy of our proposed methodology, we conduct three experiments. In the first, we compare the predictive accuracy of our proposed methodology with selecting instances for prediction at random. This is what essentially occurs in applications of the thin slices in the literature of behavioral sciences. In the second experiment, we use the incremental diverse density algorithm to estimate multiple concepts for each of the behavioral codes. In the last experiment, we perform multimodal fusion to determine if combining information channels leads to higher predictive accuracy.


This framework allows for experimentation which reveals the local/global nature of behaviors as expressed through human communicative channels (e.g., speech). With respect to speech, we found that negative affect behaviors (blame and negative affect) are more locally displayed than behaviors conveying positive affect (acceptance of other and positive affect). Furthermore, we used the Incremental Diverse Density algorithm to learn multiple behavioral concepts. This methodology allows for estimating the level of ambiguity presented via a particular channel. For example, we found that with respect to speech, learning additional concepts for the negative behaviors did not provide additional information about the behavior, meaning there is little ambiguity in the expression of these behaviors. However, the opposite is true for expression of positive behaviors in vocal expressions, meaning there is much more ambiguity of expression in this case. Additionally, we find that this relation is reversed in the lexical channel: i.e., there is more ambiguity in expression of negative behaviors and less in positive behaviors, which is likely do to the close relation of these two modes of expression. We also used this methodology with multimodal fusion and found that in certain cases combining information from multiple channels provided increased classification accuracy over using that of only a single modality.