Why approximate when you can get the exact? Optimal Targeted Viral Marketing at Scale Xiang Li, J. David Smith CISE Department University of Florida Gainesville, FL, 32611 Email: {xixiang, jdsmith}@cise.ufl.edu Thang N. Dinh CS Department Virginia Commonwealth University Richmond, VA 23284 Email: tndinh@vcu.edu My T. Thai CISE Department University of Florida Gainesville, FL, 32611 Email: mythai@ufl.edu Abstract—One of the most central problems in viral marketing is Influence Maximization (IM), which finds a set of k seed users who can influence the maximum number of users in online social networks. Unfortunately, all existing algorithms to IM, including the state of the art SSA and IMM, have an approximation ratio of (1 1/e ǫ). Recently, a generalization of IM, Cost-aware Target Viral Marketing (CTVM), asks for the most cost-effective users to influence the most relevant users, has been introduced. The current best algorithm for CTVM has an approximation ratio of (1 1/ e ǫ). In this paper, we study the CTVM problem, aiming to optimally solve the problem. We first highlight that using a traditional two stage stochastic programming to exactly solve CTVM is not possible because of scalability. We then propose an almost exact algorithm TIPTOP, which has an approximation ratio of (1 ǫ). This result significantly improves the current best solutions to both IM and CTVM. At the heart of TIPTOP lies an innovative technique that reduces the number of samples as much as possible. This allows us to exactly solve CTVM on a much smaller space of generated samples using Integer Programming. While obtaining an almost exact solution, TIPTOP is very scalable, running on billion-scale networks such as Twitter under three hours. Furthermore, TIPTOP lends a tool for researchers to benchmark their solutions against the optimal one in large-scale networks, which is currently not available. KeywordsViral Marketing; Influence Maximization; Algo- rithms; Online Social Networks; Optimization I. I NTRODUCTION With billions of active users, Online Social Networks (OSNs) have been one of the most effective platforms for mar- keting and advertising. Through “word-of-mouth” exchanges, so-called viral marketing, the influence and product adoption can spread from few key users to billions of users in the network. To identify those key users, a great amount of work has been devoted to the Influence Maximization (IM) problem which asks for a set of k seed users that maximize the expected number of influenced nodes. Taking into account both arbitrary cost for selecting a node and arbitrary benefit for influencing a node, a generalized problem, Cost-aware Targeted Viral Marketing (CTVM), has been introduced recently [1]. Given a budget B, the goal is to find a seed set S with the total cost at most B that maximizes the expected total benefit over the influenced nodes. Solutions to both IM and CTVM either suffer from scal- ability issues or limited performance guarantees. Since IM is NP-hard, Kempe et al. [2] introduced the first (1 1 e ǫ)approximation algorithm for any ǫ> 0. However, it is not scalable. This work has motivated a vast amount of work on IM in the past decade [3]–[10] (and references therein), focusing on solving the scalability issue. The most recent noticeable results are IMM [5] and SSA [6] algorithms, both of which can run on large-scale networks with the same performance guarantee of (1 1 e ǫ). Meanwhile, BCT was introduced to solve CTVM with a ratio of (1 1/ e ǫ) [1]. Despite a great amount of aforementioned works, none of these attempts to solve IM or CTVM exactly. The lack of such a solution makes it impossible to evaluate the practical performance of existing algorithms in real-world datasets. That is: how far from optimal are the solutions given by existing algorithms on billion-scale OSNs, in addition to their theo- retical performance guarantees? This question has remained unanswered til now. Unfortunately, obtaining exact solutions to CTVM (and thus to IM) is very challenging as the running time is no longer in polynomial time. Due to the nature of the problem, stochas- tic programming (SP) is a viable approach for optimization under uncertainty when the probability distribution governs the data is given [11]. However traditional SP-based solutions to various problems in NP-hard class are only for small net- works with a few hundreds nodes [11]. Thus directly applying existing techniques to IM and CTVM is not suitable as OSNs consist of millions of users and billions of edges. Furthermore, the theory developed to assess the solution quality such as those in [11], [12], and the references therein, only provide approximate confidence interval. Therefore, sufficiently large samples are needed to justify the quality assessment. This re- quires us to develop novel stochastic programming techniques to exactly solve CTVM on large-scale networks. In this paper, we provide the very first (almost) exact so- lutions to CTVM. To tackle the above challenges, we develop two innovative techniques: 1) Reduce the number of samples as much as possible so that the SP can be solved in a very short time. This requires us to tightly bound the number of samples needed to generate a candidate solution. 2) Develop novel computational method to assess the solution quality with just enough samples, where the quality requirement is given a priori. These results break the tradition and cross the barriers in stochastic programming theory where solving SP on large- scale networks had been thought impractical. Our contributions are summarized as follows: Design an (almost) exact solution to CTVM using two-stage stochastic programming (T-EXACT) which utilizes the sample average approximation method to reduce the number of realizations. Using T-EXACT, we illustrate that traditional stochastic programming techniques badly suffer the scalability issue. arXiv:1701.08462v1 [cs.SI] 30 Jan 2017