A Combination of Simple Models by Forward Predictor Selection for Job Recommendation Dávid Zibriczky Department of Telecommunications and Media Informatics Budapest University of Technology and Economics Budapest, Hungary david.zibriczky@gmail.com ABSTRACT The present paper introduces a solution for the RecSys Chal- lenge 2016. The principle of the proposed technique is to de- fine various models capturing the specificity of the dataset and then to subsequently find the optimal combinations of these by considering different user categories. The ap- proach follows a practical way for the fine-tuning of recom- mender algorithms, highlighting their components, training- and prediction time. Based on forward predictor selection, it can be shown that item-neighbor methods and the recom- mendation of already shown or interacted items have great potential in improving the offline accuracy. The best com- position consists of 11 predictor instances that achieved the third place with 665,592 leaderboard score and 2,005,263 final score. CCS Concepts •Information systems → Recommender systems; 1. INTRODUCTION In 2016, the RecSys Challenge aimed to compete solu- tions for a job recommendation in the social network of XING. For the competition, XING provided an anonymized dataset about their users, items, recommendations and cor- responding user interactions in their system. The goal of the challenge was to estimate on which jobs the users ac- tually clicked. The accuracy of a solution is measured by a combination of precision, recall, and hit ratio. 2. THE DATASET The training data consists of four sets: (1) impressions that are a list of items that were shown for the users by XING, (2) interactions that are positive or negative feed- backs on job posting items, and (3-4) user /item catalog that contains the metadata about the users/job postings. The values of the metadata are described on the website of the challenge. The test set contains a list of 150K target user ids. The goal of the challenge was to give top 30 lists of job postings that would likely be clicked by the target users in the following seven days, independently from time and context. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. RecSys Challenge ’16, September 15 2016, Boston, MA, USA c 2016 ACM. ISBN 978-1-4503-4801-0/16/09. . . $15.00 DOI: http://dx.doi.org/10.1145/2987538.2987548 2.1 Preprocessing The original data is transformed into two types of format: 1. Events: (time, userid, itemid, type, value) As the exact time of impressions is unknown, the time of the event is approximated with the Unix time of 12 p.m. Thursday of the year-week. Each(time, user_id, item_id) triple is transformed into a new record, counting its occurrence in value, and setting its event type to 5. The interactions set fits into this format. 2. User/item metadata: (id, key1, key2, ..., keyN). The user- and item metadata set is provided according to the desired format, therefore no transformation is required for these sets. However, a significant number of duplications are removed from the user catalog and all unknown values are set to empty value. Based of the analysis of item metadata, the country and region of the items are not consistent with their geo-location. Using longitude and latitude metadata, all of the job postings are assigned to the closest city with at least 1,000 population 1 . The metadata of items is enriched via cities using continent, country, region, city name, and population. 2.2 Statistics Table 1 summarizes statistics about the dataset consider- ing the number of events, unique users, and unique items. The results show that 95% of the events are impressions that covers approx. 4 times more users than interaction events. This concludes that three quarters of the users are inactive. The statistics highlight that half of the users who have im- pressions are missing from the catalog. Another important statistic is that the average number of interactions on an item is around 8, which is considered low. Data source #events #users #items E(1): Click 7,183,038 769,396 998,424 E(2): Bookmark 206,191 59,063 142,908 E(3): Reply 422,026 107,463 190,099 E(4): Delete 1,015,423 44,595 215,844 E(5): Impression 201,872,093 2,755,167 846,814 E(1-5): All events 210,698,771 2,792,405 1,257,422 Catalog - 1,367,057 1,358,098 Catalog and E(5) - 1,292,662 843,589 Catalog and E(1-4) - 784,687 1,025,582 Catalog or E(1-5) - 2,829,563 1,362,890 Table 1: Statistics about data source. 1 https://www.maxmind.com/en/free-world-cities-database