1 Data Leakage Prevention for Secure Cross-Domain Information Exchange Kyrre Wahl Kongsgård, Nils Agne Nordbotten, Federico Mancini, Raymond Haakseth and Paal E. Engelstad Abstract—Cross-domain information exchange is an increas- ingly important capability for conducting efﬁcient and secure operations, both within coalitions and within single nations. A data guard is a common cross-domain sharing solution that inspects and validates that the security labels of exported data objects are such that they can be released according to policy. While we see that guard solutions can be implemented with high assurance, we ﬁnd that obtaining an equivalent level of assurance in the correctness of the security labels easily becomes a hard problem in practical scenarios. Thus, a weakness of the guard- based solution is that there is often limited assurance in the correctness of the security labels. To mitigate this, guards make use of content checkers such as dirty word lists as a means for detecting mislabeled data. To improve the overall security of such cross-domain solu- tions we investigate more advanced content checkers based on the use of machine learning. Instead of relying on manually speciﬁed dirty word lists, we can build data-driven methods that automatically infer the words associated with classiﬁed content. However, care must be taken when constructing and deploying these methods as naive implementations are vulnerable to manipulation attacks. In order to provide a better context for performing classiﬁcation, we monitor the incoming information ﬂow and use the audit trail to construct controlled environments. The usefulness of said deployment scheme is demonstrated using a real collection of classiﬁed and unclassiﬁed documents. I. I NTRODUCTION T HE need for efﬁcient information exchange within na- tional armed forces, coalitions, and between military and civilian entities has received signiﬁcant attention in recent years. This need is in strong contrast with the traditional ap- proach to securing classiﬁed military systems, where isolation of security domains and information systems has been the default approach. Thus, concepts such as NATO’s Information Exchange Gateways (IEGs), and similar initiatives within the nations, have been established to enable cross-domain information exchange in a secure manner. These cross-domain solutions are required to perform vari- ous security controls, (e.g., information ﬂow control, antivirus, and access control) to ensure that the interconnection does not compromise conﬁdentiality, integrity, or availability. In addition, non-security speciﬁc requirements such as what type of information needs to be exchanged (e.g., friendly force iden- tiﬁcation, chat, or documents), and protocol speciﬁc details, may also impact security and what type of security controls are required. A key challenge, particularly when interconnecting domains at different classiﬁcation levels, is to ensure sufﬁcient assurance in the information ﬂow control so that classiﬁed data is not leaked. Solutions for collaboration and information sharing across security domains may generally be categorized as transfer solutions or access solutions. A transfer solution enables the transfer of information from one domain to another, while an access solution provides a user access to services and/or in- formation within another domain without logically transferring the information from that domain. In the latter case the access solution itself may be viewed as an extension of the domain to be accessed, imposing the domain separation requirements on the access solution (e.g., a thin client connected by a secure tunnel). Transfer solutions may be further categorized based on their ability to provide one-way or two-way transfer. E.g., one-way data diodes are frequently used when information needs to be moved from a lower classiﬁed domain to a higher classiﬁed domain, while two-way information exchange may be enabled using a security ﬁlter or guard. We here use the term guard to refer to solutions basing their release decisions (at least partly) on security labels, while it may otherwise perform similar checks as a security ﬁlter (e.g., ensuring that data objects are according to some predeﬁned format). Assuming that security labels are correct, a guard may provide stronger security than a security ﬁlter alone, as a security ﬁlter typically may be bypassed by anyone knowing the allowed message format. This may to some extent be mitigated by having the security ﬁlter authenticate senders, but the use of security labels nevertheless provide an additional layer of security and also better applies to content whose sensitivity typically can not be determined by its format or type, such as documents, emails, or chat messages. Before a user or service can initiate a request to export a data object, it must ﬁrst be assigned a security label. This label is cryptographically bound to the data object. While the integrity of the data object and security label as such is cryptographically protected during transfer and storage, it is much more difﬁcult to ensure that the correct security label is attached in the ﬁrst place. For instance, if a RESTRICTED document is labelled as UNCLASSIFIED, it may result in it being released to an unclassiﬁed environment (i.e., leaked). Such mislabelling may be due to human or technical errors, or be due to users or malware trying to bypass security controls. While the use of high assurance operating systems and applications may signiﬁcantly reduce the risk of technical errors and malware, the use of commodity general purpose operating systems and applications are often mandated due to practical and economical reasons. This lack of assurance in end-user systems may in some cases be mitigated by labelling data objects based on origin, where a potentially high assurance intermediary mechanism (e.g., gateway) labels all data from a given origin (e.g., computer or network) with a given classiﬁcation (e.g., RESTRICTED). However, this