R EAL ESTATE APPRAISAL IN B RAZIL APREPRINT Thiago Marzagão Observatory of Public Spending Government of Brazil Brasília-DF, Brazil thiago.marzagao@cgu.gov.br Rodrigo Ferreira Observatory of Public Spending Government of Brazil Brasília-DF, Brazil rodrigo.p.ferreira@cgu.gov.br Leonardo Sales Observatory of Public Spending Government of Brazil Brasília-DF, Brazil leonardo.sales@cgu.gov.br July 15, 2019 ABSTRACT Brazilian banks commonly use linear regression to appraise real estate: they regress price on features like area, location, etc, and use the resulting model to estimate the market value of the target property. But Brazilian banks do not test the predictive performance of those models, which for all we know are no better than random guesses. That introduces huge inefficiencies in the real estate market. Here we propose a machine learning approach to the problem. We use real estate data scraped from 15 thousand online listings and use it to fit a boosted trees model. The resulting model has a median absolute error of 8,16%. We provide all data and source code. Keywords: real estate. hedonic pricing. market behavior. JEL codes: R30. L110. D40. Problem How do we know the market value of real estate? The Brazilian Association of Technical Standards (ABNT) advises the use of econometric models for the valuation of urban property (NBR 14653-2 - "Appraisal of urban real estate"). Many appraisers follow that recommendation. They find real estate similar to the target property - say, other residential apartments in the same city -, collect data on those properties, and regress price on features like area, location, number of bedrooms, and the like. The appraiser then uses the estimated model to find the market value of the target property. The ABNT guidelines tell the appraiser to check the estimated model for linearity, heteroskedasticity, autocorrelation, multicollinearity, normality of residuals, presence of outliers, and for the statistical significance of each coefficient and of the model as a whole. The guidelines also say to check the model fit, by observing the R 2 . If no serious problems are found, and if the R 2 is not considered too low (the guidelines do not specify a threshold), the work is done. That approach is flawed. All of the samples are used to fit the regression line. No samples are left out to test the performance of the model. Hence we cannot know how good or bad the model is. The model may have an unacceptably high mean or median error. For all we know, the models created today by Brazilian appraisers are no better than random guesses. In other words, the current approach is an econometric solution to a machine learning problem. In real estate appraisals we are not interested in the effect of swimming pools on house prices. We are interested in finding the market value of an individual house. We do not care what the coefficient of "has swimming pool" is or whether it is statistically significant.