Neural Architecture Search Using Stable Rank of Convolutional Layers Kengo Machida 1 , Kuniaki Uto 1 , Koichi Shinoda 1 , Taiji Suzuki 2,3 1 Tokyo Institute of Technology, Japan 2 Graduate School of Information Science and Technology, The University of Tokyo, Japan 3 Center for Advanced Intelligence Project, RIKEN, Japan machida@ks.c.titech.ac.jp, uto@ks.c.titech.ac.jp, shinoda@c.titech.ac.jp, taiji@mist.i.u-tokyo.ac.jp Abstract In Neural Architecture Search (NAS), Differentiable AR- chiTecture Search (DARTS) has recently attracted much attention due to its high efficiency. It defines an over- parameterized network with mixed edges each of which rep- resents all operator candidates, and jointly optimizes the weights of the network and its architecture in an alternating way. However, this process prefers a model whose weights converge faster than the others, and such a model with fastest convergence often leads to overfitting. Accordingly the result- ing model cannot always be well-generalized. To overcome this problem, we propose Minimum Stable Rank DARTS (MSR-DARTS), which aims to find a model with the best generalization error by replacing the architecture optimiza- tion with the selection process using the minimum stable rank criterion. Specifically, a convolution operator is represented by a matrix and our method chooses the one whose stable rank is the smallest. We evaluate MSR-DARTS on CIFAR- 10 and ImageNet dataset. It achieves an error rate of 2.92% with only 1.7M parameters within 0.5 GPU-days on CIFAR- 10, and a top-1 error rate of 24.0% on ImageNet. Our MSR- DARTS directly optimizes an ImageNet model with only 2.6 GPU days while it is often impractical for existing NAS meth- ods to directly optimize a large model such as ImageNet mod- els and hence a proxy dataset such as CIFAR-10 is often uti- lized. 1 Introduction Neural Architecture Search (NAS) seeks to design neural network structure automatically, and has already been suc- cessful on many tasks (Ahn, Kang, and Sohn 2018; Liu et al. 2019; Pham et al. 2018). In NAS, all possible architectures are defined by a search space, which consists of network topologies and operator sets, and a search strategy is used to obtain a better architecture efficiently on the defined search space. As a recent trend in the search space, a small compo- nent in a network called cell are defined as an optimization target to reduce search cost. For search strategy, Reinforce- ment Learning (RL) (Zoph and Le 2017; Zoph et al. 2018; Pham et al. 2018) and Evolutionary Algorithms (EA) (Liu et al. 2018b; Tang, Golbabaee, and Davies 2017; Real et al. 2019) are widely used. Copyright c 2021, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Recently, DARTS (Liu, Simonyan, and Yang 2019) and its derivations (Xie et al. 2019; Chen et al. 2019; Xu et al. 2019; Liang et al. 2019) proposed differentiable approaches which relax the search spaces to be continuous and thus en- able direct application of gradient based optimization. These methods are effective in the search cost since they skip the evaluation of each sampled architecture, which is required in RL and EA. The cell defined in these works is Direct Acyclic Graph (DAG) with multiple nodes, each of which is a la- tent representation (e.g., a feature map in convolutional net- works) and each directed edge is associated with an operator. While these works explicitly introduce architecture param- eters as learnable parameters in addition to the weight pa- rameters of over-parameterized networks in the architecture search, each edge in DAG is a mixed edge which includes all candidate operators in the operator set and each operator is weighted by an architecture parameter. An architecture pa- rameter indicates how suitable its operator in a mixed edge is. The architecture parameters are jointly trained with the weight parameters in an alternating way. However, it is re- ported that this optimization process tends to produce a fast converge architecture, which is not always the optimal solu- tion in terms of accuracy (Shu, Wang, and Cai 2020). We propose a new pipeline named Minimum Stable Rank Differentiable ARchiTecture Search (MSR-DARTS) to solve this problem. In this method, the optimization of the learnable architecture parameters is replaced with the selec- tion process using stable rank criterion and thus only weight parameters of neural networks are trained during the archi- tecture search. The discrete architecture is derived by assum- ing that only limited convolutional operators (e.g., separa- ble convolution and dilated convolution with different kernel size) are included in our operator set, in which each convo- lutional operator is regarded as a matrix. Then we utilize the stable rank (numerical rank) of each convolution to derive a discrete architecture. Specifically, in each mixed edge, the operator which has the lowest stable rank is selected. The ar- chitecture search based on the stable rank is appropriate con- sidering that the low rankness of matrix is related to the gen- eralization ability of neural networks. Several studies (Arora et al. 2018; Suzuki, Abe, and Nishimura 2020) reported that a neural network with lower stable rank operators gets a higher generalization ability, where a stable rank is often used instead of a rank because the former properly captures arXiv:2009.09209v1 [cs.CV] 19 Sep 2020