978-1-4244-7445-5/10/$26.00 ©2010 IEEE Text Mining of Personal Communication Understanding the Technical and Privacy Related Challenges Håkan Jonsson Corporate Technology Office Sony Ericsson Lund, Sweden hakan1.jonsson@sonyericsson.com Pierre Nugues, Christofer Bach, Johan Gunnarsson Computer Science Lund University Lund, Sweden pierre.nugues@cs.lth.se, buffyin@gmail.com, johan.gunnarsson@gmail.com Abstract— This paper reports on the work on a new service using text mining on SMS data: SMSTrends. The service extracts trends in the form of keywords from SMS messages sent and received by ad hoc location-based communities of users. Trends are then presented to the user using a phone widget, which is regularly updated to show the latest trends. This allows the user to see what the user community is texting about, and makes her aware of what is going on in this community. Privacy considerations of the service are governed by user expectations and regulations. Brenner and Wang [1] discussed mining of personal communication in operator bit pipes. We expand on this by looking deeper into privacy and regulatory aspects through the specific example of SMSTrends. Especially, the use of adaptive location granularity selection is introduced. Keywords-text mining; messaging; location; context awareness; collective awareness; privacy I. INTRODUCTION Personal communication such as SMS is considered highly private. This combined with privacy and data protection regulations makes it very hard to develop services and applications or do research which require a priori access to large amounts of SMS messages. Examples of such services are text prediction engines and marketing analytics on SMS. A. Background The work on the SMSTrends service was started as a research project to extract named entities from SMS messages (SMS). When we discovered the problems of finding or collecting a relevant corpus of SMs to carry out the project, the corpus collection became a topic in itself: Under what conditions are users ready to give others access to their SMS? As SMS messages are private data exchanged between two parties, a classical approach to corpus collection – automatic gathering from machine-readable documents or transcriptions from printed sources – is not applicable. A first naïve request to our colleagues to hand us their SMs for the sake of science miserably failed. We started the SMSTrends application in an attempt to offer them a benefit to sharing their SMS data. After a small group of users had tried it (about a third of the people asked), few wanted to continue using it unless it was made possible to mark messages as secret, to make sure they were not used by the service. After this feature was introduced, a small group continued to use the service. However, the user group is yet too small to make any conclusions regarding the end user value of the service compared to the cost of the user information, and further studies with larger groups are needed. B. The Service The service extracts trends in the form of keywords from SMS messages sent and received by the users of the application. Trends are then presented to a user using a phone widget, which is regularly updated to show the latest trends. This allows her to see what a user community is texting about, and makes her aware of what is going on in this community. Figure 1. SMSTrends widget screenshot