SPEECH SUMMARIZATION OF LONG SPOKEN DOCUMENT: IMPROVING MEMORY EFFICIENCY OF SPEECH/TEXT ENCODERS Takatomo Kano 1 , Atsunori Ogawa 1 , Marc Delcroix 1 , Roshan Sharma 2 , Kohei Matsuura 1 , and Shinji Watanabe 2 1 NTT Corporation, Japan, 2 Carnegie Mellon University, Pittsburgh, USA, ABSTRACT Speech summarization requires processing several minute-long speech sequences to allow exploiting the whole context of a spo- ken document. A conventional approach is a cascade of automatic speech recognition (ASR) and text summarization (TS). However, the cascade systems are sensitive to ASR errors. Moreover, the cascade system cannot be optimized for input speech and utilize para-linguistic information. Recently, there has been an increased interest in end-to-end (E2E) approaches optimized to output sum- maries directly from speech. Such systems can thus mitigate the ASR errors of cascade approaches. However, E2E speech summa- rization requires massive computational resources because it needs to encode long speech sequences. We propose a speech summa- rization system that enables E2E summarization from 100 seconds, which is the limit of the conventional method, to up to 10 minutes (i.e., the duration of typical instructional videos on YouTube). How- ever, the modeling capability of this model for minute-long speech sequences is weaker than the conventional approach. We thus ex- ploit auxiliary text information from ASR transcriptions to improve the modeling capabilities. The resultant system consists of a dual speech/text encoder decoder-based summarization system. We per- form experiments on the How2 dataset showing the proposed system improved METEOR scores by up to 2.7 points by fully exploiting the long spoken documents. Index Terms— end-to-end modeling, long spoken document, memory efﬁcient encoders, dual speech/text encoder 1. INTRODUCTION Speech summarization is a technology that generates an abstractive summary from a lengthy spoken document, such as an introduc- tion video. Unlike other speech processing tasks, such as automatic speech recognition (ASR) or speech translation, generating an ac- curate summary requires access to the entire content. Since speech signals are very long sequences, the speech summarization task is particularly challenging as it requires handling hundred to tens of thousands of input frames to process several minute-long spoken documents. A conventional approach to tackle this problem consists of ﬁrst performing ASR and then text summarization (TS) [1–3]. ASR can transcribe an entire spoken document by performing utterance-wise recognition. TS can operate on the entire transcribed spoken doc- ument because the text sequences of an utterance are much shorter than speech. Such a cascade approach provides a modular system where we can optimize the components for each sub-task on ded- icated datasets. It can exploit the entire context of the document to generate summaries [4, 5]. However, the cascade system cannot optimize the summary for the input speech and utilize prosodic in- formation. Moreover, ASR systems inevitably introduce errors that Table 1. Summarization performance (ROUGE-L score [13]) on the How2 dataset [14] of a Transformer TS with truncated input text. 10% 30% 60% 90% 100% ROUGE-L 10.0 18.9 45.3 50.8 51.2 can affect the quality of the summaries [1, 2, 6]. Recently, an end-to-end (E2E) speech summarization [7] that directly generates a summary from speech has been proposed as an alternative to cascade systems. An E2E system consists of a speech encoder module that extracts embedding from the speech sig- nal and a decoder module that generates a summary from the speech embedding in an autoregressive manner. The system is based on the Transformer architecture [8], which has become the standard for many speech and language processing tasks. E2E systems of- fer the possibility of optimizing the whole system for the summa- rization task, eliminating the intermediate ASR, therefore, avoid- ing the impact of ASR errors. Moreover, the E2E system has the possibility to use speech features such as pitch and power to deter- mine the important point of the spoken document. However, E2E speech summarization needs to encode very lengthy speech signals because the system must process all utterances simultaneously, as in the TS model. This long speech encoding requires massive computa- tional resources and memory, especially during training. The Trans- former performs better than alternatives that use, e.g., long short- term memory networks [9] and can process the entire sequence at once using self-attention; however, the self-attention memory usage and computation increase quadratically with the length of the se- quences [8, 10–12]. In a previous work [7], E2E speech summarization was achieved using Longformer [12] architecture for the speech encoder, which limits the range of the self-attention to reduce the amount of com- putation. However, it also impedes considering a long context in the self-attention module, which may be an issue for summariza- tion. Consequently, the approach has only been applied to data trun- cated up to 100 seconds to avoid out-of-memory (OOM) issues dur- ing training [7]. Restricting the input of the speech summarization can have a signiﬁcant impact on performance. For example, Ta- ble 1 shows the Recall-Oriented Understudy for Gisting Evaluation- Longest (ROUGE-L) score of TS when truncating the length of the text. We observe a clear performance drop when the TS system sees only part of the input document (e.g., 60 % or less). This preliminary experiment reveals the importance of developing speech summariza- tion systems exploiting long spoken documents. In this paper, we propose extending the E2E speech summariza- tion system proposed in [7] to allow for handling longer spoken doc- uments. First, we investigate memory- and computationally-efﬁcient ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-7281-6327-7/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICASSP49357.2023.10095019