Suppose we wished to create an intelligent machine, and the web was the choice of information. More specifically, suppose we relied on Wikipedia as a resource from which this intelligent machine would derive its knowledge base. Any acts of Wikipedia vandalism would then impact upon the knowledge base of this intelligent system, and the system might develop confidence in entirely incorrect information. Ethically, should we develop a machine which can craft its own knowledge base without reference to the veracity of the material it considers? If we did, what kinds of “beliefs” might such a machine start to encompass? How do we address the veracity of such materials so that such a learning machine might distinguish between truth and lie, and what kinds of conclusions might be derived about our world as a consequence? If trying to construct an ethical machine, how appropriate can ethical outcomes be considered in the presence of deceptive data? And, finally, how much of the Web is deceptive?
In this paper, we will investigate the nature and, importantly the detectability, of deception at large, and in relation to the web. Deception appears to be increasingly prevalent in society, whether deliberate, accidental, or simply ill-informed. Examples of deception are readily available, from individuals deceiving potential partners on dating websites, to surveys which make headlines about “Coffee causing Hallucinations” with no medical evidence and very little scientific rigour , . to companies which collapsed due to deceptive financial practices (e.g. Enron, WorldCom), and segments of the financial industry allegedly misrepresenting risk in order to derive substantial profits . We envisage a Web Filter which could be used equally well as an assistive service for human readers, and as a mechanism within a system that learns from the web.
So-called Deception Theory, and the possibility to model human deception processes, is interesting to experts in different subject fields for differing reasons and with different foci. Most research has been directed towards human physical reactions in relation to co-located (face-to-face, synchronous) deception, largely considering non-verbal cues involving body language, eye movements, vocal pitch and so on, and how to detect deceptions on the basis of such cues. Such research is interesting for sociologists in terms of how deception is created, criminologists in trying to differentiate the deceptive from the truthful, and computer vision researchers in relating identifying such cues automatically across participants within captured video. Alternative communication mediums, in which participants are distributed, communications asynchronous, and cues can only be captured from the artefact of the communication, the verbal, requires entirely different lines of expertise and investigation.
To try to recognize deception in verbal communication, lexical and grammatical analysis is typical. Such approaches may be suitable for identifying deception on the web. It is assumed that deception leads to identifiable, yet unconscious, lexical selection and the forming of certain grammatical structures (Toma and Hancock, 2010), and these may act as generally useful cues for deceptive writing. From numerous researchers (Burgoon et al 2003, Pennebaker et al 2003, Newman et al 2003, Zhou et al 2003), we find that the presence of such cues can be divided into four principal groups:
1. Use of more negative emotion words
2. Use distancing strategies – fewer self references
3. Use of larger proportions of rare and/or long words
4. Use of more emotion words
To demonstrate that deception detection might be possible, Pennebaker and colleagues developed a text analysis program called Linguistic Inquiry and Word Count (LIWC) which analyzes texts against an internal dictionary (Pennebaker, Francis, & Booth, 2001, Pennebaker, Booth, & Francis, 2007). Each word in the text can belong to one or more of LIWC’s 70 dimensions, which include general text measures (e.g. word count); psychological indicators (e.g. emotions), and semantically-related words (e.g. temporally and spatially related words). We submitted the BBC’s “’Visions link’ to coffee intake” article, alluded to earlier, to the free online version of LIWC. Results, included below, show an absence of self-references, few positive emotions, and a large proportion of “big words”. Of course, such an analysis is inconclusive as such features may also be true of scientific articles, or textbooks, and LIWC leaves the interpretation up to us.
Our aim is to create a system that can identify deceptive texts, but also explains which are the most deceptive sentences. Such a system could act as an effective Web Filter for both human and machine use. We assume that such a system should be geared towards the peaks of deception which may occur in texts which are otherwise not deceptive, so such deceptions may be lost in the aggregate. We must also account for systematic variations as exist in different text genres in order to control for them, and to ascertain the threshold values for the various factors which give us appropriate confidence in our identification.
The full paper will initially review the literature relating to deception in general, and distinguish between deceptions and lies. In the process, we will offer up some interesting – and occasionally amusing – examples of deception. We will then focus towards text-based deception, and we will include discuss initial experiments geared towards the development of the system mentioned above. One of these experiments may even demonstrate how a paper supposedly geared towards deception detection, whose conclusions fail to fit the aim, was never likely to achieve its aim, and make mention of one or two other interesting examples of academic deception and/or lies.
Toma, C.L. and Hancock, J.T. (2010). Reading between the Lines: Linguistic Cues to Deception in Online Dating Profiles. Proceedings of the ACM conference on Computer-Supported Cooperative Work (CSCW 2009)
Burgoon, J.K., Blair, J.P., Qin, T., and Nunamaker, J.F. (2003). Detecting deception through linguistic analysis. Intelligence and Security Informatics, 2665.
Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). “Linguistic Inquiry and Word Count: LIWC 2007”. Austin, TX: LIWC (www.liwc.net).
Pennebaker, J.W., Francis, M.E., and Booth, R.J. (2001). “Linguistic Inquiry and Word Count: LIWC 2001”. Mahwah, NJ: Erlbaum Publishers.