Int J Comput Vis
DOI 10.1007/s11263-010-0389-8
Identifying Join Candidates in the Cairo Genizah
Lior Wolf · Rotem Littman · Naama Mayer ·
Tanya German · Nachum Dershowitz · Roni Shweka ·
Yaacov Choueka
Received: 25 January 2010 / Accepted: 17 September 2010
© Springer Science+Business Media, LLC 2010
Abstract A join is a set of manuscript-fragments that are
known to originate from the same original work. The Cairo
Genizah is a collection containing approximately 350,000
fragments of mainly Jewish texts discovered in the late 19th
century. The fragments are today spread out in libraries and
private collections worldwide, and there is an ongoing effort
to document and catalogue all extant fragments. The task
of finding joins is currently conducted manually by experts,
and presumably only a small fraction of the existing joins
have been discovered. In this work, we study the problem
of automatically finding candidate joins, so as to streamline
the task. The proposed method is based on a combination of
local descriptors and learning techniques. To evaluate the
performance of various join-finding methods, without re-
lying on the availability of human experts, we construct a
benchmark dataset that is modeled on the Labeled Faces in
the Wild benchmark for face recognition. Using this bench-
mark, we evaluate several alternative image representations
and learning techniques. Finally, a set of newly-discovered
join-candidates have been identified using our method and
validated by a human expert.
Keywords Cairo Genizah · Document analysis · Similarity
learning
L. Wolf ( ) · R. Littman · N. Mayer · T. German · N. Dershowitz
The Blavatnik School of Computer Science, Tel Aviv University,
Tel Aviv, Israel
e-mail: wolf@cs.tau.ac.il
R. Shweka · Y. Choueka
The Friedberg Genizah Project, Jerusalem, Israel
1 Introduction
Written text is one of the best sources for understanding
historical life. The Cairo Genizah is a unique source of
preserved middle-eastern texts, collected between the 11th
and the 19th centuries. These texts are a mix of religious
Jewish manuscripts with a smaller proportion of secular
texts. To make the study of the Genizah more efficient,
there is an acute demand to group the fragments and recon-
struct the original manuscripts. Throughout the years, schol-
ars have devoted a great deal of time to manually identify
such groups, referred to as joins, often visiting numerous li-
braries.
Manual classification is currently the gold-standard for
finding joins. However, it is not scalable and cannot be ap-
plied to the entire corpus. We suggest automatically iden-
tifying candidate joins to be verified by human experts. To
this end, we employ modern image-recognition tools such
as local descriptors, bag-of-features representations and dis-
criminative metric learning techniques. These techniques are
modified for the problem at hand by employing suitable pre-
processing and by employing task-specific key-point selec-
tion techniques. Where appropriate, we use suitable generic
methods.
We validate our methods in two ways. The first is to con-
struct a benchmark for the evaluation of algorithms that are
able to compare the images of two leaves. Algorithms are
evaluated based on their ability to determine whether two
leaves are a join or not. In addition, we create a short list
of most likely newly discovered join candidates, according
to our algorithm’s metric, and send it to a human expert for
validation.
The main contributions of this work are as follows:
1. The design of an algorithmic framework for finding join-