Classifying food items by image using Convolutional Neural Networks Derek Farren Stanford University dfarren@stanford.edu Abstract Grocery items image classiﬁcation is a well researched problem. However, until the recent announcements from Amazon regarding their ”Just walk out” technology used in Amazon Go, most Computer Vision techniques used in the state of the art research did not involve neural networks. However, recently a research group from the University of Freiburg released the most complete grocery items image dataset openly available. They also developed the most ac- curate classiﬁcation model for that dataset using convolu- tional neural networks. I this research I propose a model to classify the The Freiburg Groceries Dataset that is more accurate than the state of the art. 1. Introduction Amazon Go’s recent announcement has brought atten- tion to grocery image detection in Computer Vision. The shopping experience, according to Amazon, is made possi- ble by the same types of technologies used in self-driving cars. That is, computer vision, sensor fusion, and deep learning technologies. With ”Just Walk Out” technology, users can enter the store with the Amazon Go app, shop for products, and walk out of the store without lines or check- out. The technology automatically detects when products are taken or returned to shelves and keeps track of them in a virtual cart. When the shopping is ﬁnished, users leave the store and their Amazon account is charged shortly there- after. At the heart of this technology, there is a Computer Vi- sion model classifying grocery items by their image caught in a video camera. This work proposes a model to accom- plish such a task.The model proposed is more accurate than the state of the art [1]. This work also proposes a greedy algorithm that im- proves the network performance by changing its archi- tecture. This algorithm is called Guided Prunning and it proved to be very helpful in situations where there is some large areas of convexity in the network architecture vs. net- work accuracy function. 2. Related Work A fair amount of work has been done using Computer Vision on groceries datasets. A real-time product detection system from video is pre- sented in [2]. Some effort for matching database images on an input image is shown in [3] by using scale-invariant feature transform (SIFT)[4] vectors in an efﬁcient manner. Another study focuses on logo detection in natural scenes by spatial pyramid mining [5].In [6], the authors apply planogram extraction based on image processing by using a combination of several detectors. SIFT matching and op- tical character recognition are some of them. However, because most of the grocery image datsets are privately owned, not much improvement has been done in this area until last year soon after Amazon announced their Amazon Go stores. A new dataset was released. The Freiburg Groceries Dataset [1] is a dataset consisting of 5,000 256x256 RGB images covering 25 different classes of groceries, with at least 97 images per class. The authors collected all images from real-world settings at different stores and apartments. In contrast to existing groceries datasets, this dataset in- cludes a large variety of perspectives, lighting conditions, and degrees of clutter. Overall, the images contain thou- sands of different object instances. Examples for each class can be seen in Figure 2. This dataset is currently the state of the art used in grocery Computer Vision testing. The authors also proposed a classiﬁer on this dataset, where they re-trained the CaffeNet architecture and achieved a mean accuracy of 78.9%. Also, this work proposed Guided Prunning, a greedy al- gorithm that improves the network performance by manip- ulating some Hyperparameters. Hyperparameter optimiza- tion is an important research topic in machine learning, and is widely used in practice [9][10][11][12] . Despite their success, these methods are still limited in that they only search models from a ﬁxed-length space. In other words, it is difﬁcult to ask them to generate a variable- length conﬁguration that speciﬁes the structure and connec- tivity of a network. In practice, these methods often work 1