1 Data Leakage Prevention for Secure Cross-Domain Information Exchange Kyrre Wahl Kongsgård, Nils Agne Nordbotten, Federico Mancini, Raymond Haakseth and Paal E. Engelstad Abstract—Cross-domain information exchange is an increas- ingly important capability for conducting efficient and secure operations, both within coalitions and within single nations. A data guard is a common cross-domain sharing solution that inspects and validates that the security labels of exported data objects are such that they can be released according to policy. While we see that guard solutions can be implemented with high assurance, we find that obtaining an equivalent level of assurance in the correctness of the security labels easily becomes a hard problem in practical scenarios. Thus, a weakness of the guard- based solution is that there is often limited assurance in the correctness of the security labels. To mitigate this, guards make use of content checkers such as dirty word lists as a means for detecting mislabeled data. To improve the overall security of such cross-domain solu- tions we investigate more advanced content checkers based on the use of machine learning. Instead of relying on manually specified dirty word lists, we can build data-driven methods that automatically infer the words associated with classified content. However, care must be taken when constructing and deploying these methods as naive implementations are vulnerable to manipulation attacks. In order to provide a better context for performing classification, we monitor the incoming information flow and use the audit trail to construct controlled environments. The usefulness of said deployment scheme is demonstrated using a real collection of classified and unclassified documents. I. I NTRODUCTION T HE need for efficient information exchange within na- tional armed forces, coalitions, and between military and civilian entities has received significant attention in recent years. This need is in strong contrast with the traditional ap- proach to securing classified military systems, where isolation of security domains and information systems has been the default approach. Thus, concepts such as NATO’s Information Exchange Gateways (IEGs), and similar initiatives within the nations, have been established to enable cross-domain information exchange in a secure manner. These cross-domain solutions are required to perform vari- ous security controls, (e.g., information flow control, antivirus, and access control) to ensure that the interconnection does not compromise confidentiality, integrity, or availability. In addition, non-security specific requirements such as what type of information needs to be exchanged (e.g., friendly force iden- tification, chat, or documents), and protocol specific details, may also impact security and what type of security controls are required. A key challenge, particularly when interconnecting domains at different classification levels, is to ensure sufficient assurance in the information flow control so that classified data is not leaked. Solutions for collaboration and information sharing across security domains may generally be categorized as transfer solutions or access solutions. A transfer solution enables the transfer of information from one domain to another, while an access solution provides a user access to services and/or in- formation within another domain without logically transferring the information from that domain. In the latter case the access solution itself may be viewed as an extension of the domain to be accessed, imposing the domain separation requirements on the access solution (e.g., a thin client connected by a secure tunnel). Transfer solutions may be further categorized based on their ability to provide one-way or two-way transfer. E.g., one-way data diodes are frequently used when information needs to be moved from a lower classified domain to a higher classified domain, while two-way information exchange may be enabled using a security filter or guard. We here use the term guard to refer to solutions basing their release decisions (at least partly) on security labels, while it may otherwise perform similar checks as a security filter (e.g., ensuring that data objects are according to some predefined format). Assuming that security labels are correct, a guard may provide stronger security than a security filter alone, as a security filter typically may be bypassed by anyone knowing the allowed message format. This may to some extent be mitigated by having the security filter authenticate senders, but the use of security labels nevertheless provide an additional layer of security and also better applies to content whose sensitivity typically can not be determined by its format or type, such as documents, emails, or chat messages. Before a user or service can initiate a request to export a data object, it must first be assigned a security label. This label is cryptographically bound to the data object. While the integrity of the data object and security label as such is cryptographically protected during transfer and storage, it is much more difficult to ensure that the correct security label is attached in the first place. For instance, if a RESTRICTED document is labelled as UNCLASSIFIED, it may result in it being released to an unclassified environment (i.e., leaked). Such mislabelling may be due to human or technical errors, or be due to users or malware trying to bypass security controls. While the use of high assurance operating systems and applications may significantly reduce the risk of technical errors and malware, the use of commodity general purpose operating systems and applications are often mandated due to practical and economical reasons. This lack of assurance in end-user systems may in some cases be mitigated by labelling data objects based on origin, where a potentially high assurance intermediary mechanism (e.g., gateway) labels all data from a given origin (e.g., computer or network) with a given classification (e.g., RESTRICTED). However, this