Search engines and the problem of transparency

Dag Elgesem


Search engines are becoming increasingly important as mechanisms for getting access to information on the web. They have thus acquired considerable power and some authors have even suggested that Google and the other big players in the search engine market function as the new gate-keepers to the web. Several authors, including Introna and Nissenbaum in an important, recent paper, have argued that for this reason the details of the algorithms of the search engines should be made public knowledge. The problem, crudely put, is that we do not know whether the search engines are biased or not. Because we all use and trust these programs the details of the algorithm have to be revealed, it is argued. In the paper, using ideas from Kant about freedom of information as a precondition for the use of all other rights and freedoms, arguments for a strong policy of openness on this issue will be construed.

There are serious problems with such a position, however. One problem is that if all the details of the algorithms of the search engines were revealed this would open up for massive attempts to manipulate the results by webmasters all over the world. The consequence would be that the search engines would be even more biased. The dilemma, then, is that a right to information could actually make people worse off in terms of information.

In order to find a way out of this ethical dilemma, a model for the estimation of the credibility of information is invoked. It is suggested that the sophisticated Bayesian analysis of the credibility of testimonies, developed by the legal theoretician Richard D. Friedman, can be used also in the analysis of the credibility of search results. In the paper I will argue that the model developed by Friedman can be used as the basis for the ethical evaluation of search engines.

It is possible here only to briefly indicate the approach. The question asked with respect to a testimony in court is: How likely is it that the event X in fact happened, given the testimony by witness w that X happened? Friedman develops a way to practically apply a Bayesian calculation of the probability of the truth of the testimony, given knowledge of the probability of various sources of error in the testimony. The user’s question has a similar structure to the one concerning credibility of testimonies: How likely is it that the a set X of web pages is the most relevant answer to my search term, given that the search engine comes up with this set X? In the paper it is argued that the same kind of analysis as the one suggested by Friedman, in terms of conditional probabilities, can be applied to the search engine case.

To illustrate, consider the model below (adapted from Friedman, p. 713):

Most relevant set of hits.

Indexed as most relevant.

Ranked as most relevant.

Set presented as most relevant.

Search space.

Not the most relevant set

Not the best ranking.

Not the most relevant hits.

Error in indexing

Bias in ranking

Paid hits.

Start by considering the box to the upper right in the figure above, which will be presented in detail in the paper. This represents the result of the search as it is presented to the user. The user is interested in knowing along which path this result was produced. The claim made by the search engine is that this is the most relevant set of pages given the search term entered by the user. If this claim is true the result was produced along the uppermost path. Whether this claim is true or not is important for the user’s evaluation of the credibility of the information included in the set (the first 10 hits, say). But the claim is not necessarily true. It is possible for example that the indexing of the pages gives raise to a biased classification. Another possibility is that there are problems with the ranking algorithm and that less relevant pages are ranked as more relevant. Another source of error would be that the ranking is manipulated in the presentation of the results, for example that it is possible to pay for a high ranking among the hits. In these cases the result is produced along one of the other paths indicated in the diagram. The higher the probability that the result is produced along one of the lower paths, the less credible is the information, given the user’s informational needs.

In the paper it will be argued that this model gives a better basis for the ethical evaluation of search engine policies from the standpoint of the user. The user is interested in a search engine where the there is as little bias as possible. It is of course impossible to eradicate the bias completely. For example, Google’s page rank algorithm is to some extent affected by the power law of the structure of the linking on the web (even though this is contested). But we can all agree that we want to reduce bias and strive to increase the probability that claim that the result set is the most relevant, is correct. The question, then, is whether a strong policy of openness concerning the details of the algorithm contributes to this. It will be argued that it probably does not. And, on the other hand, a publication of the details of the algorithm will not better enable the average user to determine the probability that the result set is optimally relevant.