Statistical Inference for Data Integration Xi Yang, Katherine A. Hoadley, Jan Hannig, J. S. Marron The University of North Carolina at Chapel Hill Abstract In the age of big data, data integration is a critical step especially in the understanding of how diverse data types work together and work separately. Among the data integration methods, the Angle-Based Joint and Individual Variation Explained (AJIVE) is particularly attractive because it not only studies joint behavior but also individual behavior. Typically scores indicate relationships between data objects. The drivers of those relationships are determined by the loadings. A funda- mental question is which loadings are statistically significant. A useful approach for assessing this is the jackstraw method. In this paper, we develop jackstraw for the loadings of the AJIVE data analysis. This provides statistical inference about the drivers in both joint and individual feature spaces. Keywords: Data integration; AJIVE; jackstraw; statistically significant. 1. Introduction Many modern data sets such as genomic data involve multiple data types, which are measured on a common set of experimental units. For example, this is an important issue in modern cancer research studied in Section 3, where two important and complementary data types are gene expres- sion and copy number measurements. A useful approach is to organize such different types of data into separate blocks. The Angle-Based Joint and Individual Variation Explained (AJIVE) method [1] assumes a common type of connection among these data blocks and studies the ways that they vary together as well as separately. Thus, AJIVE untangles joint and individual variation and gives unique insights into the structure of these data sets. An important open problem is statistical inference on the AJIVE loadings to determine which are significant drivers of the analysis. Figure 1 uses a simple Gaussian simulation to illustrate how AJIVE gives modes of variation that are the basis of the inference developed in this paper. This toy example is constructed to clearly Preprint submitted to Journal of Multivariate Analysis September 28, 2021 arXiv:2109.12272v1 [stat.AP] 25 Sep 2021