Social area analysis, data mining, and GIS Seth E. Spielman a , Jean-Claude Thill b, * a Department of Geography, SUNY-Buffalo, Buffalo, NY, USA b Department of Geography and Earth Sciences, University of North Carolina – Charlotte, 9201 University City Blvd, Charlotte, NC 28223, USA Received 17 July 2007; received in revised form 16 November 2007; accepted 19 November 2007 Abstract There is a long cartographic tradition of describing cities through a focus on the characteristics of their residents. A review of the history of this type of urban social analysis highlights some persistent challenges. In this paper existing geodemographic approaches are extended through coupling the Kohonen Self-Organizing Map algorithm (SOM), a data-mining technique, with geographic informa- tion systems (GIS). This approach allows the construction of linked maps of social (attribute) and geographic space. This novel type of geodemographic classification allows ad hoc hierarchical groupings and exploration of the relationship between social similarity and geo- graphic proximity. It allows one to filter complex demographic datasets and is capable of highlighting general social patterns while retaining the fundamental social fingerprints of a city. A dataset describing 79 attributes of the 2217 census tracts in New York City is analyzed to illustrate the technique. Pairs of social and geographic maps are formally compared using simple pattern metrics. Our analysis of New York City calls into question some assumptions about the functional form of spatial relationships that underlie many modeling and statistical techniques. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Self-Organizing Maps; Geodemographics; New York City; Data mining; GIS 1. Introduction In gearing up for the first United States decennial census in 1790, James Madison argued that the census should be ‘‘extended so as to embrace some other objects besides the bare enumeration of the inhabitants; it would enable them to adapt the public measures to the particular circum- stances of the community(Kurland & Lerner, 1987, p. 139). Madison’s idea, that knowing something about the characteristics of local populations improves local gover- nance is accepted as a basic premise in planning, politics, and policy analysis. However how one understands the particular circumstances of a community is a methodolog- ical question that has been evolving for over a century. Madison’s proposal to extend the census to include the occupations of inhabitants was rejected by the United States Senate in 1790. In a letter to Jefferson, Madison reflected that his plan was ‘‘thrown out by the Senate as a waste of trouble and supplying materials for idle people to make a book(Cohen, 1981, p. 47). Unlike in Madison’s day, data about cities and the people who live in them is now abundant; in fact data are so abundant and complex that integrating available information into the public plan- ning processes is often difficult. The first census asked five questions; the long form of the questionnaire for the 2000 decennial census of population was 10 pages long and included over 50 questions. Many municipalities now main- tain detailed datasets describing crime, traffic, school per- formance, the built environment, and many other facets of urban life. The volume of data currently available to planners is excellent fodder for urban scholars. Yet, it remains a challenge to communicate the complexity of the urban social landscape in an engaging and efficient manner. In addition to a dramatic increase in the volume of information, new forms of analysis that emphasize an exploratory approach and are based on computational 0198-9715/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.compenvurbsys.2007.11.004 * Corresponding author. Tel.: +1 704 687 5909; fax: +1 704 687 5966. E-mail addresses: ses27@buffalo.edu (S.E. Spielman), jfthill@uncc.edu (J.-C. Thill). www.elsevier.com/locate/compenvurbsys Available online at www.sciencedirect.com Computers, Environment and Urban Systems 32 (2008) 110–122