1 Color and Size Image Dataset Normalization Protocol for Natural Image Classification: a Case Study in Tomato Crop Pathologies Juan F. Molina ∗ , Rodrigo Gil † , Carlos Bojacá † , Gloria Díaz ‡ and Hugo Franco § Abstract—In computer vision research, the construction of image datasets is a critical process, given the need for robust experimentation frameworks that ensure the quality and validity of the resulting conclusions and performance measurements in each particular study. Therefore, experimental datasets must optimize their statistical, visual and computational properties through an adequate selection of representative and useful visual data, according to the specific research question being addressed. This paper proposes a dataset construction protocol for ad hoc acquired images in a particular Machine Learning application: tomato crop health assessment. Keywords—Computer vision, image retrieval, natural images, image datasets. I. I NTRODUCTION Dataset construction has proved to be a non trivial task [1], due to potential bias, incompleteness and even isolation of the resulting collection. In this context, it is necessary to take into account both high level (database–wide) and low level (image–wide) properties that are considered as desirable from the experimental point of view. Recent works have approached this kind of standardization by establishing requirements for set size (scale), semantic hierarchy, labeling accuracy and diversity (variability) as criteria for choosing (and, sometimes, conditioning) groups of items to be included into the dataset [2]. Detectability of objects and features is also an important criterion within the selection process, since it is directly related to the actual presence of visual and geometrical features in each individual item (image). However, imposing certain properties to the images considered to be included in the dataset could “close” and, then, constrain it for alternative usages and expansions. Indeed, several authors (e.g. Torralba et al. [1]) criticize the usage of most known image datasets (i.e. Corel, Sun9, Pascal, Caltech, ImageNet, among others) because of the implicit bias of pursuing better classification algorithm performance within a “closed world”. Reducing this kind of bias is a complex problem, and the strategy to address it should be tightly adjusted to the properties of regions and objects of interests in the scope of each particular Computer Vision application. This involves ∗ Student of Computer Engineering at the Universidad Central (Colombia) † Centro de Biosistemas, Universidad Jorge Tadeo Lozano (Colombia) ‡ Comp. Eng., Universidad Antonio Nariño (Colombia) § Comp. Eng., Universidad Central (Colombia) - hfrancot@ucentral.edu.co an implicit specialization of each dataset, depending on the Machine Learning tasks it will support. In particular, even most state–of–the–art representation models, such as bag of features, have color invariant properties, certain visual features, relevant in the problem context, could be enhanced to ease visual infor- mation characterization, i.e. feature descriptor design. In cases where information variations related to relevant features are too slight, invariant model representations could be insufficient –in terms of classification, clustering and retrieving performance– since key details could get hidden or lost to certain descriptors. This detectability issues are also related to correct image acqui- sition and resolution properties, then involving dataset disk size trade–offs, according to the minimum image resolution needed to an appropriate acquisition of objects and regions of interest. Based on these considerations, this exploratory work presents a first approach to a consistent protocol for constructing a proper dataset to support Computer Vision applications in specialized fields by standardizing a) acquisition controllable settings and conditions, b) global color space and c) individual image size. The protocol and its performance is evaluated in the context of a particular case study: detection of anomalies and diseases in agricultural crops (tomato plantations). II. DATASET CONSTRUCTION CRITERIA A. Feature–based Acquisition While several protocols for building image datasets can be found in the literature of computer vision and machine learning [2], [3], they are focused, however, in collecting specific feature classes and object types, so image acquisition does not constitute the dataset construction process or it is neglected. Given the particular goals of the proposed dataset –and its potential usage for supporting image classification and retrieval tasks–, it is necessary to take into account acquisition quality criteria to the dataset items, such as illumination, tonality, resolution, focus, object proportions and regularity of the object, acquisition noise, information loss due to image representation and feature distribution along the image dataset. This also implies a consistent preprocessing of every image, so that the best version possible of each item is obtained before it is included in the final collection. However, most of these specifications are difficult to control in natural, outdoor environments. Since the global image qual- ity must be enough to validate assertions and inferences ob- tained by dataset usage, a set of acquisition guidelines has to be 978-1-4799-1121-9/13/$31.00 c 2013 IEEE