A Cross-modal Approach for Karaoke Artifacts Correction Wei-Qi Yan 1 and Mohan S Kankanhalli 2 1 University of California, Irvine, CA 92697-3425 2 National University of Singapore, Singapore 117543 Abstract. Karaoke singing is a popular form of entertainment in several parts of the world. Since this genre of performance attracts amateurs, the singing often has artifacts related to scale, tempo, and synchrony. We have developed an approach to correct these artifacts using cross-modal multimedia streams information. We ﬁrst perform adaptive sampling on the user’s rendition and then use the original singer’s rendition as well as the video caption highlighting information in order to correct the pitch, tempo and the loudness. A method of analogies has been employed to perform this correction. The basic idea is to manipulate the user’s rendi- tion in a manner to make it as similar as possible to the original singing. A pre-processing step of noise removal due to feedback and huﬃng also helps improve the quality of the user’s audio. The results are described in the paper which shows the eﬀectiveness of this multimedia approach. Key Words: Adaptive Sampling, Artifacts Handling, Karaoke. 1 Introduction A multimedia environment Π(t) usually consists of a multiplicity of correlated data streams Π i (t) with (n ≥ 2): Π(t)= {Π i (t),i =0, 1, 2, ··· ,n; t ∈ (0, +∞)} (1) The correlations R among them can be expressed as: R = {∼: Π p (t) ∼ Π q (t), 0 ≤ p, q ≤ n} (2) Karaoke (“missing orchestra” in Japanese) is an example of such a multime- dia environment.This Japanese entertainment form features a live singer with pre-recorded accompaniment. The user sings into a microphone while the stored music is played simultaneously. Karaoke is tremendously popular in eastern Asia as an avenue for recreation and entertainment. Some of the distinctive charac- teristics of karaoke are: – It encourages artistry in which users try to emulate the original singer in terms of timbre and expression;