arXiv:1904.11316v2 [cs.LG] 26 Apr 2019 Stability and Optimization Error of Stochastic Gradient Descent for Pairwise Learning Wei Shen 16482530@life.hkbu.edu.hk Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong Zhenhuan Yang zyang6@albany.edu Department of Mathematics and Statistics State University of New York at Albany Albany, USA Yiming Ying yying@albany.edu Department of Mathematics and Statistics State University of New York at Albany Albany, USA Xiaoming Yuan xmyuan@hku.hk Department of Mathematics, The University of Hong Kong, Hong Kong Abstract In this paper we study the stability and its trade-off with optimization error for stochastic gradient descent (SGD) algorithms in the pairwise learning setting. Pairwise learning refers to a learning task which involves a loss function depending on pairs of instances among which notable examples are bipartite ranking, metric learning, area under ROC (AUC) maximization and minimum error entropy (MEE) principle. Our contribution is twofold. Firstly, we establish the stability results of SGD for pairwise learning in the convex, strongly convex and non-convex settings, from which generalization bounds can be naturally derived. Secondly, we establish the trade-off between stability and optimization error of SGD algorithms for pairwise learning. This is achieved by lower-bounding the sum of stability and optimization error by the minimax statistical error over a prescribed class of pairwise loss functions. From this fundamental trade-off, we obtain lower bounds for the optimization error of SGD algorithms and the excess expected risk over a class of pairwise losses. In addition, we illustrate our stability results by giving some specific examples of AUC maximization, metric learning and MEE. Keywords: Stability; Generalization; Optimization Error; Stochastic Gradient Descent; Pairwise Learning; Minimax Statistical Error 1. Introduction This paper concerns with pairwise learning which usually involves a pairwise loss func- tion, i.e. the loss function depends on a pair of examples which can be expressed by (h, (x, y), (x ,y )) for a hypothesis function h : X→ R. This is in contrast to the prob- lem of pointwise learning in standard classification and regression which typically involves a univariate loss function (h, x, y). Several important learning tasks can be viewed as 1