ARTICLES
DOI: 10.1038/s41562-017-0186-2
© 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.
1
Department of General Psychology and Padova Neuroscience Center, University of Padova, via Venezia 8, Padova 35131, Italy.
2
Laboratoire de Psychologie
Cognitive - UMR7290, Centre National de la Recherche Scientifique, Aix-Marseille Université, 3, place Victor Hugo, Marseille 13331 CEDEX 3, France.
3
Institute of Cognitive Sciences and Technologies (ISTC), National Research Council (CNR), Via Martiri della Libertà 2, Padova 35137, Italy.
4
IRCCS San
Camillo Hospital Foundation, via Alberoni 70, Venice-Lido 30126, Italy. *e-mail: marco.zorzi@unipd.it
V
isual perception of symbols like letters and digits constitutes
the front end of much more complex cognitive functions, such
as reading and mathematics. Written symbols are culture spe-
cific, which implies that the mapping between visual form and sym-
bol identity is often arbitrary: even within the same script, our visual
system must tune to fine-grained visual details (for example, to dis-
criminate between I and J) but also neglect significant variability in
the visual appearance of the same symbol (for example, versus Φ).
This ability appears even more remarkable considering that reading is
a recent cultural invention, with a history of fewer than 6,000 years
9
.
This implies that evolutionary mechanisms could not have shaped the
human visual system specifically to support reading, which must be
acquired through education. Nevertheless, despite the large variabil-
ity in writing systems, cross-cultural studies have shown that writ-
ten symbols are always processed by the same cortical circuits
10
. One
explanation for the universal neurocognitive bases of a cultural inven-
tion like reading is that it partially ‘invades’ evolutionarily older brain
circuits, which are recycled during development to support a novel
function that is in some way related to their original one
7
. Indeed,
although learning to read requires extensive training and interaction
with many other sources of information (for example, phonological
and semantic), orthographic processing can be performed to some
extent even by non-human primates
11
, which must necessarily rely on
purely visual information
12
. This suggests that cortical visual circuits
that evolved for generic object and scene recognition might serve as
a starting point for learning to recognize written symbols, and might
be partially reorganized as a result of reading acquisition
13
. In turn,
visual symbols are likely to have been culturally selected to match the
type of geometric structures found in natural scenes
8
.
From a computational perspective, the processing of complex
visual information requires hierarchical organization
14,15
, where
neurons in the early levels extract simple features over local regions
of the visual field that are successively combined into more complex
features covering larger portions of the visual scene. Accordingly,
visual processing can be conceived as a series of non-linear trans-
formations over the sensory input to build more abstract, internal
representations that are invariant to irrelevant changes in visual
appearance
16
. This hierarchical, multilayer architecture seems well
suited to also supporting orthographic processing
17
. At the let-
ter level, basic visual features such as edges and curvatures might
be combined into simple geometrical shapes and letter fragments,
thereby allowing recognition through component features
1,17,18
.
Explicit teaching and contextual information might then lead to
even more abstract letter identities
19
, whose positional information
can be used to encode graphemes and bigrams, up to high-level rep-
resentations of entire words
12,20
.
It should be noted that the acquisition of literacy seems to
profoundly reshape early levels of visual processing
13
, and a full
account of orthographic development should also consider the
important role of top-down processing in visual word recogni-
tion
21
. Nevertheless, the encoding of individual letters seems to be a
prerequisite to create word-level representations
22
, as also assumed
in computational models of reading development
23,24
. Moreover,
recent neuroimaging evidence suggests that a ‘letter form area’
25
can
be distinguished, both at the spatial and temporal dynamic levels,
from the classic ‘visual word form area’
10
in the ventral occipitotem-
poral cortex. However, despite enormous progress in dissecting the
functional organization of orthographic processing using neuroim-
aging techniques, the leading computational model of letter percep-
tion is based on hand-coded features
26,27
and does not explain how
high-level representations can be acquired through learning. Other
models either represent letters in a localistic fashion (that is, there
Letter perception emerges from unsupervised
deep learning and recycling of natural image
features
Alberto Testolin
1
, Ivilin Stoianov
2,3
and Marco Zorzi
1,4
*
The use of written symbols is a major achievement of human cultural evolution. However, how abstract letter representations
might be learned from vision is still an unsolved problem
1,2
. Here, we present a large-scale computational model of letter rec-
ognition based on deep neural networks
3,4
, which develops a hierarchy of increasingly more complex internal representations
in a completely unsupervised way by fitting a probabilistic, generative model to the visual input
5,6
. In line with the hypothesis
that learning written symbols partially recycles pre-existing neuronal circuits for object recognition
7
, earlier processing levels
in the model exploit domain-general visual features learned from natural images, while domain-specific features emerge in
upstream neurons following exposure to printed letters. We show that these high-level representations can be easily mapped
to letter identities even for noise-degraded images, producing accurate simulations of a broad range of empirical findings on
letter perception in human observers. Our model shows that by reusing natural visual primitives, learning written symbols only
requires limited, domain-specific tuning, supporting the hypothesis that their shape has been culturally selected to match the
statistical structure of natural environments
8
.
NATURE HUMAN BEHAVIOUR | www.nature.com/nathumbehav