SPEECH SUMMARIZATION OF LONG SPOKEN DOCUMENT:
IMPROVING MEMORY EFFICIENCY OF SPEECH/TEXT ENCODERS
Takatomo Kano
1
, Atsunori Ogawa
1
, Marc Delcroix
1
, Roshan Sharma
2
,
Kohei Matsuura
1
, and Shinji Watanabe
2
1
NTT Corporation, Japan,
2
Carnegie Mellon University, Pittsburgh, USA,
ABSTRACT
Speech summarization requires processing several minute-long
speech sequences to allow exploiting the whole context of a spo-
ken document. A conventional approach is a cascade of automatic
speech recognition (ASR) and text summarization (TS). However,
the cascade systems are sensitive to ASR errors. Moreover, the
cascade system cannot be optimized for input speech and utilize
para-linguistic information. Recently, there has been an increased
interest in end-to-end (E2E) approaches optimized to output sum-
maries directly from speech. Such systems can thus mitigate the
ASR errors of cascade approaches. However, E2E speech summa-
rization requires massive computational resources because it needs
to encode long speech sequences. We propose a speech summa-
rization system that enables E2E summarization from 100 seconds,
which is the limit of the conventional method, to up to 10 minutes
(i.e., the duration of typical instructional videos on YouTube). How-
ever, the modeling capability of this model for minute-long speech
sequences is weaker than the conventional approach. We thus ex-
ploit auxiliary text information from ASR transcriptions to improve
the modeling capabilities. The resultant system consists of a dual
speech/text encoder decoder-based summarization system. We per-
form experiments on the How2 dataset showing the proposed system
improved METEOR scores by up to 2.7 points by fully exploiting
the long spoken documents.
Index Terms— end-to-end modeling, long spoken document,
memory efficient encoders, dual speech/text encoder
1. INTRODUCTION
Speech summarization is a technology that generates an abstractive
summary from a lengthy spoken document, such as an introduc-
tion video. Unlike other speech processing tasks, such as automatic
speech recognition (ASR) or speech translation, generating an ac-
curate summary requires access to the entire content. Since speech
signals are very long sequences, the speech summarization task is
particularly challenging as it requires handling hundred to tens of
thousands of input frames to process several minute-long spoken
documents.
A conventional approach to tackle this problem consists of first
performing ASR and then text summarization (TS) [1–3]. ASR can
transcribe an entire spoken document by performing utterance-wise
recognition. TS can operate on the entire transcribed spoken doc-
ument because the text sequences of an utterance are much shorter
than speech. Such a cascade approach provides a modular system
where we can optimize the components for each sub-task on ded-
icated datasets. It can exploit the entire context of the document
to generate summaries [4, 5]. However, the cascade system cannot
optimize the summary for the input speech and utilize prosodic in-
formation. Moreover, ASR systems inevitably introduce errors that
Table 1. Summarization performance (ROUGE-L score [13]) on the
How2 dataset [14] of a Transformer TS with truncated input text.
10% 30% 60% 90% 100%
ROUGE-L 10.0 18.9 45.3 50.8 51.2
can affect the quality of the summaries [1, 2, 6].
Recently, an end-to-end (E2E) speech summarization [7] that
directly generates a summary from speech has been proposed as
an alternative to cascade systems. An E2E system consists of a
speech encoder module that extracts embedding from the speech sig-
nal and a decoder module that generates a summary from the speech
embedding in an autoregressive manner. The system is based on
the Transformer architecture [8], which has become the standard
for many speech and language processing tasks. E2E systems of-
fer the possibility of optimizing the whole system for the summa-
rization task, eliminating the intermediate ASR, therefore, avoid-
ing the impact of ASR errors. Moreover, the E2E system has the
possibility to use speech features such as pitch and power to deter-
mine the important point of the spoken document. However, E2E
speech summarization needs to encode very lengthy speech signals
because the system must process all utterances simultaneously, as in
the TS model. This long speech encoding requires massive computa-
tional resources and memory, especially during training. The Trans-
former performs better than alternatives that use, e.g., long short-
term memory networks [9] and can process the entire sequence at
once using self-attention; however, the self-attention memory usage
and computation increase quadratically with the length of the se-
quences [8, 10–12].
In a previous work [7], E2E speech summarization was achieved
using Longformer [12] architecture for the speech encoder, which
limits the range of the self-attention to reduce the amount of com-
putation. However, it also impedes considering a long context in
the self-attention module, which may be an issue for summariza-
tion. Consequently, the approach has only been applied to data trun-
cated up to 100 seconds to avoid out-of-memory (OOM) issues dur-
ing training [7]. Restricting the input of the speech summarization
can have a significant impact on performance. For example, Ta-
ble 1 shows the Recall-Oriented Understudy for Gisting Evaluation-
Longest (ROUGE-L) score of TS when truncating the length of the
text. We observe a clear performance drop when the TS system sees
only part of the input document (e.g., 60 % or less). This preliminary
experiment reveals the importance of developing speech summariza-
tion systems exploiting long spoken documents.
In this paper, we propose extending the E2E speech summariza-
tion system proposed in [7] to allow for handling longer spoken doc-
uments. First, we investigate memory- and computationally-efficient
ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10095019