Social Attractiveness in Dialogs Antje Schweitzer, Natalie Lewandowski, Daniel Duran Institute for Natural Language Processing, University of Stuttgart, Germany firstname.lastname@ims.uni-stuttgart.de Abstract This study investigates how acoustic and lexical properties of spontaneous speech in dialogs affect perceived social attractive- ness in terms of speaker likeability, friendliness, competence, and self-confidence. We analyze a database of longer sponta- neous dialogs between German female speakers and the mu- tual ratings that dialog partners assigned to one another after every conversation. Thus the ratings reflect long-term impres- sions based on dialog behavior. Using linear mixed models, we investigate both classical acoustic-prosodic and lexical param- eters as well as parameters that capture the degree of speak- ers’ adaptation, or “convergence”, of these parameters to each other. Specifically we find that likeability is correlated with the speaker’s lexical convergence as well as with her convergence in f0 peak height. Friendliness is significantly related to vari- ation in intensity. For competence, the proportion of positive words in the dialog, variation in shimmer, and overall phonetic convergence are significant correlates. Self-confidence finally is related to several prosodic, phonetic, and lexical adaptation parameters. In some cases, the effect depends on whether inter- locutors also had eye contact during their conversation. Taken together, these findings provide evidence that in addition to clas- sical parameters, convergence parameters play an important role in the mutual perception of social attractiveness. Index Terms: social attractiveness, convergence, spontaneous speech 1. Introduction A considerable body of literature has investigated acoustic cor- relates of extralinguistic factors such as emotion and personality (see e.g. [1, 2] for overviews). In recent years, voice attractive- ness and pleasantness have also gained interest, and they have been hypothesized to depend on similar parameters [3]. In this field the focus is often on cross-gender perception of voice at- tractiveness, and the notion of attractiveness in these cases is then to some extent biased to sexual attractiveness (see for in- stance [4, 5, 6, 7, 8]). Other research on voice attractiveness, or pleasantness, is in the context of dialog systems aiming to provide pleasant synthetic voices [9, 10]. In any case, research in the domain of vocal attractiveness or pleasantness typically makes use of short, often read, sometimes synthetic, stimuli, which are rated by independent listeners outside communica- tion situations (e.g. [5, 11, 12, 13, 10, 6, 14, 8, 15]); [3] refer to this as a “passive rating scenario” . One of the very few excep- tions is [7] who investigate mutual ratings of participants of a speed dating game. Among the parameters that have been suggested to be re- lated to voice attractiveness and pleasantness, many are related to f0; for instance [4] finds closely spaced low-frequency har- monics correlate with women’s perception of male attractive- ness, [5] show that feminine women prefer lower-pitched male voices; in a speed-dating study [7] women perceived as friendly exhibit higher maximum pitch and greater pitch variance, [8] find that men prefer higher pitch in women, while women pre- fer lower pitch in men; [11] looking at male voices also find that low pitch is preferred, and additionally report an effect of f0 variance in interaction with f0 mean, where low variance is preferred for voices with medium mean f0, while mid and high variance is preferred for either low or high mean f0 voices. The finding that f0 or pitch play an important role is chal- lenged by [15] who find that low F1 is rated as more attractive for male voices, while there is no significant effect of f0. They claim that voices that are thought to reflect greater body height in men are preferred, and since low F1 as an indicator of vocal tract size is a more reliable estimator of body height than f0, this explains the null result regarding f0. In addition [15] find breathier female voices, and male voices with shorter durations to be more attractive. They also report a possibly sociopho- netic effect, namely the preference of a low F2 in /u/ for female voices, which they interpret as an indication that the /u/ fronting that is typical for female Californian speakers is evaluated pos- itively by raters. Other studies corroborate the relevance of vocal tract length as derived from formants, for instance [5] find a preference for more dispersed formants in female voices, and [13] a prefer- ence for less dispersed formants in male voices when evaluated by females; however the latter also find that dispersion is not predictive of dominance ratings of male voices by men. The preference for breathier female voices found by [15] had pre- viously been found by [8], interestingly along with a prefer- ence for wide formant dispersion in female voices, contradict- ing [13] and in contrast to [15] who obtained this finding for men but not for women. A few studies report results for in- tensity, e.g. more varied intensity in men who are perceived as friendly [7]. That particular study also finds effects of lexical parameters and laughter, for instance friendly men and women are reported to use more turn-medial or turn-final laughter. An- other study [16] reports an effect of speech rate (higher like- ability scores for decreased speech rate), but only for women. That study also finds that skewness correlates with likeability for women, while the “speaker’s formant” (increased intensity in long-term average spectrum between 3 and 4 kHz) is found in the voices of likeable men. Another line of research is interested in the question of whether it is possible, and with which accuracy, to tell if voices will be perceived as attractive or pleasant, given a very large set of (usually acoustic-phonetic) parameters [10, 9, 3]. These stud- ies typically investigate many more parameters; for instance, [10] use more than 4000 features derived from 60 low-level cepstral, auditory spectral, energy-related, voice-related, or f0- related parameters extracted with openSMILE [17]; [9] use 310 features comprising energy-related features as well as f0 for- mants, cepstral features, voice quality, and articulation speed. However while these studies to some extent also investigate which features are central for the prediction, they do usually not Copyright 2017 ISCA INTERSPEECH 2017 August 20–24, 2017, Stockholm, Sweden http://dx.doi.org/10.21437/Interspeech.2017-833 2243