Int J Comput Vis DOI 10.1007/s11263-010-0389-8 Identifying Join Candidates in the Cairo Genizah Lior Wolf · Rotem Littman · Naama Mayer · Tanya German · Nachum Dershowitz · Roni Shweka · Yaacov Choueka Received: 25 January 2010 / Accepted: 17 September 2010 © Springer Science+Business Media, LLC 2010 Abstract A join is a set of manuscript-fragments that are known to originate from the same original work. The Cairo Genizah is a collection containing approximately 350,000 fragments of mainly Jewish texts discovered in the late 19th century. The fragments are today spread out in libraries and private collections worldwide, and there is an ongoing effort to document and catalogue all extant fragments. The task of ﬁnding joins is currently conducted manually by experts, and presumably only a small fraction of the existing joins have been discovered. In this work, we study the problem of automatically ﬁnding candidate joins, so as to streamline the task. The proposed method is based on a combination of local descriptors and learning techniques. To evaluate the performance of various join-ﬁnding methods, without re- lying on the availability of human experts, we construct a benchmark dataset that is modeled on the Labeled Faces in the Wild benchmark for face recognition. Using this bench- mark, we evaluate several alternative image representations and learning techniques. Finally, a set of newly-discovered join-candidates have been identiﬁed using our method and validated by a human expert. Keywords Cairo Genizah · Document analysis · Similarity learning L. Wolf ( ) · R. Littman · N. Mayer · T. German · N. Dershowitz The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel e-mail: wolf@cs.tau.ac.il R. Shweka · Y. Choueka The Friedberg Genizah Project, Jerusalem, Israel 1 Introduction Written text is one of the best sources for understanding historical life. The Cairo Genizah is a unique source of preserved middle-eastern texts, collected between the 11th and the 19th centuries. These texts are a mix of religious Jewish manuscripts with a smaller proportion of secular texts. To make the study of the Genizah more efﬁcient, there is an acute demand to group the fragments and recon- struct the original manuscripts. Throughout the years, schol- ars have devoted a great deal of time to manually identify such groups, referred to as joins, often visiting numerous li- braries. Manual classiﬁcation is currently the gold-standard for ﬁnding joins. However, it is not scalable and cannot be ap- plied to the entire corpus. We suggest automatically iden- tifying candidate joins to be veriﬁed by human experts. To this end, we employ modern image-recognition tools such as local descriptors, bag-of-features representations and dis- criminative metric learning techniques. These techniques are modiﬁed for the problem at hand by employing suitable pre- processing and by employing task-speciﬁc key-point selec- tion techniques. Where appropriate, we use suitable generic methods. We validate our methods in two ways. The ﬁrst is to con- struct a benchmark for the evaluation of algorithms that are able to compare the images of two leaves. Algorithms are evaluated based on their ability to determine whether two leaves are a join or not. In addition, we create a short list of most likely newly discovered join candidates, according to our algorithm’s metric, and send it to a human expert for validation. The main contributions of this work are as follows: 1. The design of an algorithmic framework for ﬁnding join-