Archiving and Categorization of E-Mail Systems

Nicholas A. Wagner and Ian H. MacGregor


With most adults in the western world now transitioned into e-mail use, an untold amount of textual information is available on personal interests and preferences. The vast majority of such e-mail, whether private or job-related, belongs to corporations. Whether the e-mail account in question is set up through a person’s employer or through commercial agents such as Yahoo or Hotmail, usage contracts generally identify the e-mail as belonging to the corporation. Even deleting e-mail from an account provides no guarantee that the e-mail is actually removed from the storage source.

While it has always been clear that e-mail information represents a potential privacy concern, analysis of the information has not been commercially viable, thereby providing a sense of privacy to the end-user. While e-mail systems have provided reasonable (if mostly inadequate) security systems to protect the privacy of individuals, the corporate owners of the e-mails have access to all messages sent and received. Given the availability of a commercially viable method for extract consumer marketing data, it is likely just a matter of time before corporations will attempt to capitalize on the available information.

The main focus of this paper was to determine whether automatic analysis of e-mails is becoming a viable option for corporations. Using a new text analysis tool named Latent Categorization Method (LCM) developed by Larsen and Monarchi (2004), the study aimed to collect the ‘sent e-mail’ of volunteers (sent e-mail was used to get a true measure of interest rather than any e-mail received, which could be spam) and attempt to create personas for marketing purposes. The personas were based on characteristics that were determined by the most frequently occurring latent semantic topics in the volunteers’ outgoing electronic mail.

The analysis tool, LCM, had previously been used to categorize the occurring topics in academic journal abstracts. The process begins by creating an ASCII text file of all of the data, which in this case is a collection of e-mail message bodies provided by volunteers. Each message is identified with bits of code that identify its author, so that the interpreted data is matched to them. Next, a parser removed all of the words that have little or no relevance to building a persona, such as ‘has,’ ‘go,’ ‘you,’ ‘as,’ and many more. Further, proper names were removed by creating a database of the items to be removed by the parser.

Another database is created to properly group words together with similar meanings. For example, if an e-mail spoke of the sport of running, all variations of the word (running, ran, etc) would have to be pointed to its root stem, run. The words that made it through the parser and the stemming functions were then analyzed and the visual representations were created. The next steps include weighing of terms, decomposition of matrices, as well as clustering of texts (e-mails). It should be noted that LCM does not simply examine simple counts of words, but rather examines latent semantic relationships between sets of co-occurring words.

The volunteers that provided e-mails for the analysis were mostly teaching assistants, and therefore their e-mail tended to focus on academically related topics. Frequently occurring stems were ‘recit’ (for recitation), ‘paper,’ and ‘career.’ The process of associating the misspelled words and ‘spoken’ words could have been a bit more precise so that the meaning would have been stemmed rather than discarded from the data set. Also, a larger volunteer base should have been used to see what kinds of different topics would come up. A more diverse volunteer base would have helped in the variety aspect as well. However, aside from what could have been improved, the results were still very interesting and informative.

The results yielded predictable profile conclusions given the knowledge of the participating volunteers, which allowed the results to be verified as to the accuracy. Upon analysis, remarkable similarities between each individual and profile were seen, especially when the limited scope of e-mails and small volunteer base were taken into account. Considering all volunteers were students, 87.5 percent of which teaching assistants, and all e-mails collected from university e-mail accounts, the results were very much in line with typical correspondence between teachers and students or on subjects relating to class work. An example is volunteer 1 (labeled P1). The results indicated the most frequently occurring semantic topics were paper, assign, recit (recitation), class, and busi (business), among others. The topic ‘paper’ alone accounts for a little over 10% of the most important topics. Thus it is safe to deduce that this volunteer discusses class papers relating to business and teaches a recitation. For a consumer e-mail situation, this topic might be of more importance; if a topic relating to complaining about vacuuming and cleaning appears, the individual would likely be a perfect recipient of the Roomba Robotic vacuum product. More details are discussed in the paper and includes the possibility of matching like personalities.

With technological advances in data processing, storing, and interpretation such as the LCM approach coupled to an e-mail system, there are numerous ethical implications. The authors have related the Individual versus Community paradigm, as illustrated in Rushworth M. Kidder’s book, How Good People Make Tough Choices: Resolving the Dillemas of Ethical Living (1995). However, the authors in this case prefer the term Company versus Community, since the issues at hand deal with an organization that provides e-mail services. The possible monetary gains for the employer in using LCM with their e-mail systems can only be realized after an extensive evaluation of their commitment to service and the through the commitments to privacy (if any) in their terms of service agreement.

What if there are ambiguous statements, a privacy policy has been omitted, or e-mail categorization practices do not correspond to the company’s commitment to service? The company would be in a situation that would require the analysis of whether categorization would intrude on their users’ privacy, and whether they are willing to set the example for other e-mail vendors. Would they be willing to accept the responsibility of contributing to the reinforcing loop of keeping up with competitors who also analyze their e-mail systems?

In conclusion, the study shows that mass analysis of e-mail for the purpose of creating personas is no longer a future scenario, but currently plausible, probable, and commercially viable. Not only does such analysis promise to be available to corporations, but the profiles created using such text analysis tools are likely to be of a quality surpassing all other consumer profile techniques currently available. This conclusion takes on added validity when network analysis of e-mail addresses is added to the mix of available information, further developing maps of profile interactions. To head off the potential meltdown of privacy protection represented by the approaches described in this paper, privacy advocates must act in an expeditious manner.


Kidder, Rushworth M. “How Good People Make Tough Choices: Resolving the Dilemma of Wthical Thinking.” New York: William Morrow and Company.

Larsen, Kai R. and Monarchi, David E. “A Mathematical Approach to Categorizations and Labeling of Qualitative Data: The Latent Categorization Method.” Sociological Methodology, 20(1), pp. 349-400.