INFLUENCE OF ACOUSTIC LOW-LEVEL DESCRIPTORS IN THE DETECTION OF
CLINICAL DEPRESSION IN ADOLESCENTS
Lu-Shih Alex Low, Namunu C. Maddage, Margaret Lech, Lisa Sheeber
†
, Nicholas Allen
††
School of Electrical and Computer Engineering, RMIT University, Melbourne 3001, Australia
†
Oregon Research Institute, 1715 Franklin Boulevard, Eugene, Oregon 97403
††
ORYGEN Research Centre and Department of Psychology, University of Melbourne, Melbourne 3010, Australia
lushih.low@student.rmit.edu.au, {namunu.maddage, margaret.lech}@rmit.edu.au, lsheeber@ori.org, nba@unimelb.edu.au
ABSTRACT
In this paper, we report the influence that classification accuracies
have in speech analysis from a clinical dataset by adding acoustic
low-level descriptors (LLD) belonging to prosodic (i.e. pitch,
formants, energy, jitter, shimmer) and spectral features (i.e.
spectral flux, centroid, entropy and roll-off) along with their delta
(ィ) and delta-delta (ィ-ィ) coefficients to two baseline features of
Mel frequency cepstral coefficients and Teager energy critical-
band based autocorrelation envelope. Extracted acoustic low-level
descriptors (LLD) that display an increase in accuracy after being
added to these baseline features were finally modeled together
using Gaussian mixture models and tested. A clinical data set of
speech from 139 adolescents, including 68 (49 girls and 19 boys)
diagnosed as clinically depressed, was used in the classification
experiments. For male subjects, the combination of (TEO-CB-
Auto-Env + ィ + ィ-ィ) + F0 + (LogE + ィ + ィ-ィ) + (Shimmer + ィ) +
Spectral Flux + Spectral Roll-off gave the highest classification
rate of 77.82% while for the female subjects, using TEO-CB-Auto-
Env gave an accuracy of 74.74%.
Index Terms— Clinical depression, prosodic feature, spectral
feature, acoustic features, Gaussian Mixture Model
1. INTRODUCTION
Understanding the causes of clinical depression has long been a
complex and challenging task particularly in the field of
psychology due to the many potential psychological variables.
With the advancements in technology of high speed computers,
psychologists in recent years have been trying to collaborate with
different disciplinary fields to better understand the psychological
factors relating to the development of clinical depression. All of
which has the main outcome of trying to share each others
expertise to further assist psychologists in making additional
contributions towards the prevention and treatment of clinical
depression. The different research fields include: 1) cognitive
neuroscience which introduces neuro-imaging techniques such as
functional magnetic resonance imaging (fMRI) to monitor the
cognitive patterns relating to the activity of the brain, 2) facial
recognition which analyses the facial expressions displayed from
various emotions and 3) speech and language processing which
objectively analyses the vocal patterns of human speech. This
paper focuses solely on the latter, whereby it has been well
documented that the voice of depressed individuals are slow,
uniform, monotonous and expressionless with the person having
the fear of expressing himself or herself [11] [12]. As a
consequence, our study takes a look at the objective assessment of
a subject’s speaking behavior and vocal characteristics to identify
any differences in acoustic speech measures between depressed
and control subjects. According to [6] there is strong evidence that
demonstrates most suicides are linked to depressive disorders and
symptomatology. Depressive disorders are associated with a range
of psychosocial impairments and comorbid symptomatology which
includes varying degrees of psychomotor retardation (slowness) or
agitation. Although statistics have shown that the numbers of
suicides in Australia have decreased in recent years following the
peaks in 1997-1998, suicide still remains a leading cause of death
(in 2006: ranked 15
th
), and is greater than the number of deaths
from transport accidents making it a prominent public health
concern [1]. Studies show that almost half of all suicides are in the
age group of 25-49 years. Therefore, by focusing our attention in
the intervention of depression at a young age, it could be easier to
treat and eradicate the problem before it is too late. The resulting
conclusion is that late-life depression is a chronic or recurring
disorder, and when goes unrecognized, may have devastating
effects. There have been several studies published on various
methods in objectively analyzing vocal parameters as possible cues
to clinical depression. The most commonly used speech processing
techniques in the recognition of emotions and clinical depression
in the literature are related to prosody (i.e. pitch, jitter, energy,
pause time and speaking rate), as well as spectral feature (i.e.
formants) and cepstral features (i.e. Mel-frequency cepstral
coefficients). Prosodic information which has the closest relation
to the expressiveness of speech has been widely studied in this
field. There have been a number of studies that have consistently
shown an increase in speech rate and loudness, as well as a
decrease in pause time duration in clinical interviews that are key
discriminators of mood improvement over the course of therapy
[3], [7]. Fundamental frequency (F
0
), the most widely studied
parameter, has also shown a strong correlation to depression.
Unfortunately, the generalizability of some of these findings still
remains unclear as reported results vary from one investigator to
another.
In our particular study, the purpose of this paper was to
examine the influence that classification accuracy (increase or
decrease) will have in inserting acoustic low-level descriptors
representing prosodic and spectral features to our set of baseline
features: 1) Mel-frequency cepstral coefficients (MFCC) and 2)
Teager energy operator critical-band based autocorrelation
envelope (TEO-CB-Auto-Env). The reason behind using these two
methods as the baseline is that MFCCs have been widely used in
speech content analysis and are known to be robust acoustic
feature. TEO-CB-Auto-Env method on the other hand, have
performed reliably well in emotional stress classification [10].
Finally, we use a different combination of baseline features and
5154 978-1-4244-4296-6/10/$25.00 ©2010 IEEE ICASSP 2010