Kimppa, K. K.
Donor = the person to whom the voice belongs
Producer = the producer of the speech synthesis software
User = the party using the software for synthesizing speech (of the donor).
A research group1 at the University of Turku is developing a speech synthesis software which can synthesize the voice of anyone based on a set of rules used. The project has raised certain ethically relevant questions, such as who owns a voice of a person, if anyone? What kinds of rights, if any to a voice can anyone have? The development of the software and other similar softwares is active and these questions are bound to rise in the near future; if not now.
In this paper we aim to look at the problem from an ethical perspective. The aim is to show that the rights of a person to their voice, rights of someone using such software and rights (typically copyrights and patents) of the producer of the software seem to create tensions for the use of such softwares – tensions, which need to be resolved. The supposed level of the software is that of synthesizing voice which is indistinguishable from the actual voice of a person, either by hearing the synthesizer producing sentences or, even by technical means.
Speech synthesis systems have been available for decades, and several ways to produce synthesized speech have emerged. The study at hand concerns an older method of creating artificial speech, a rule-based speech synthesis. A rule-based speech synthesis may be considered a truly synthetic way to produce speech since it does not make use of any samples of natural, human speech as most of the other synthesis methods do. On the other hand, the rule-based speech synthesis is commonly considered to be the most challenging way to produce high-quality synthetic speech. However, the latest studies of our own have shown promising results to overcome the unnatural quality of rule-based artificial speech. In fact, it may be problematic even to tell a natural speaker apart from a synthetic speech sample. There are also other synthesis methods, which have a high level of naturalness, e.g. HMM-based synthesizers.
Following diagram demonstrates the steps included in the speech synthesis. Transcription translates an orthographic text into phonetic alphabets. After the transcription the written text is in the form of pronunciation. The transcription with setting of durations and fundamental frequency is often referred as the syntagmatic part of creating speech synthesis. The intonation is generated in the syntagmatic steps. The following phoneme rules are acoustic models for each speech sound. The synthetic speech is produced and played back after the acoustic model of each sound is generated. The audio signal is generated using a Klatt-type signal generator (Klatt 1980). The system design is described in greater detail in Saarni et al (2006).
Diagram 1. Steps in generating a rule-based synthetic speech
Ethical questions raised
The main question seems to be whether a person has a natural right to their voice. Does a person have an innate, unquestionable right to any uses of their voice or not. Some exceptions to this have apparently been made, e.g. imitator’s use the voice of political figures often enough in their shows, often in a sarcastic context. However, the context in which it is used is also identifiable as a show. Applications such as answering machine voices recorded by an imitator to imitate celebrity voices are closer to the problem presented. Do the persons whose voices are being used have a right to control the use of their voice in this kind of applications?
The second apparent question is whether the speech of a person can be misused. Should it be, that for example Osama Bin Laden actually is dead (as no video independently confirmed to be current of him has been seen for quite some time now) a synthesis software could easily enough be used to propagate the same ideas he has been known to support. The authenticity of some recordings which are claimed to be made by him have actually been questioned2.
Having these questions in mind, does the producer have some ethical responsibilities, and if, what are these responsibilities? Should the product of the synthesizer be water marked in some way to distinguish it from the original speech of the person? Is the responsibility producer’s or user’s? Should the donor be automatically compensated for the use of their voice? Should they be able to control how and for what purposes their voice is used? Should the system resemble the copyright system, in which the main argument is the right to exclude others from using the ‘work’, here the voice of the person in question? If so, how to control this?
In this paper we will consider the questions from the perspective of the rights of the person whose voice is being used, from the perspective of the rights (if any) of the producer and the user of the software. We will also look into the duties such software undoubtedly puts to the users of the software and the possible duties which fall to the producer. Finally, we will look into the consequences, both in a utilitarian and economic sense such software would and will introduce.
Although Howley et al (2002) consider specifically privacy enhancing technologies, in a similar manner system design personnel need to be aware of the possible uses and consequences of these uses of speech simulation software (for Stakeholder theory, see e.g. Bowern et al (2004)). The software cannot be released before a study of the effects is done. In a competitive environment, the pressures to release early, without considering all the implications of the release are considerable (Powers, 2002). However, the consequences of a release of an unfinished product that does not support abuse of the software can also be considerable.
If ownership of even such minor immaterial objects as items in a computer game (see e.g. Reynolds 2002, or Kimppa and Bissett 2005) can be considered to be a major issue, surely ownership of ones voice and its (perfect) imitation is a central question for ones self.
Many similar issues arise with the use of someone’s voice as do with the sue of a picture of someone, although some ethical issues are also specific and novel to voice synthesis. Questions such as are similar to those raising with pictures: Who owns the product of the voice synthetisator? What role does consent play in the use of someone’s perfectly imitated voice? What moral rights must be considered? In this paper we will look at these issues amongst others and compare them to the ones presented for image usage (see e.g. Weckert and Adeney, 1994 or Evans and Mahoney 2004).
Preliminarily, it would seem evident that misuses of such software are possible, if not even probable. Thus, the design of the software should already minimize the possibility of such misuse, be it natural rights of the donor, duties toward the donor by the users and producers or just plain consequences of the use of such software.
Bowern, M., McDonald, C. and Weckert J. (2004) Stakeholder theory in practice: Building better software systems. Ethicomp 2004, University of the Aegean, Syros, Greece, 14 to16 April 2004, pp. 157—169.
Evans, Jill, and Mahoney, John (2004), Ethical and Legal Aspects of Using Digital Images of People: Impact on Learning and Teaching. Ethicomp 2004, University of the Aegean, Syros, Greece, 14 to16 April 2004, pp. 289—297.
Howley, Richard, Rogerson, Simon, Fairweather N. B. and Pratchett, Lawrence (2002), The Role of Information Systems Personnel in the Provision for Privacy and Data Protection in Organisations and Within Information Systems. Ethicomp 2002, Universidade Lusíada, Lisbon, Portugal, 13-15 November 2002, pp. 169—180.
Kimppa, K. K. and Bissett, A. (2005), Is cheating in network computer games a question worth raising? CEPE 2005, July 17-19, Enschede, The Netherlands, pp. 259—267.
Klatt, D. H. (1980): Software for a Cascade/Parallel Formant Synthesizer. Journal of the Acoustical Society of America 67, 971–995.
Powers, Thomas M. (2002), Responsibility in Software Engineering: Uncovering an Ethical Model. Ethicomp 2002, Universidade Lusíada, Lisbon, Portugal, 13-15 November 2002, pp. 247—257.
Reynolds, Ren (2002), Intellectual Property Rights in Community Based Video Games. Ethicomp 2002, Universidade Lusíada, Lisbon, Portugal, 13-15 November 2002, pp. 455—470.
Saarni, T., Paakkulainen, J., Mäkilä, T., Hakokari, J., Aaltonen, O., Isoaho, J. & Salakoski, T. (2006): Implementing a Rule-based Speech Synthesizer on a Mobile Platform. T. Salakoski et al. (Eds.): FinTAL 2006, LNAI 4139, pp. 349 – 355.
Weckert, John and Adeney, Douglas (1994), Ethics in Electronic Image Manipulation. Ethics in the computer age, Galtinburg, Tennessee, United States, pp. 113—114.