David Lillis: Estimating Probabilities for Effective Data Fusion

Estimating Probabilities for Effective Data Fusion

David Lillis, Lusheng Zhang, Fergus Toolan, Rem W. Collier, David Leonard and John Dunnion

In Proceedings of the 33rd Annual ACM SIGIR Conference on Research and Development in Information Retrieval, pages 347--354, Geneva, Switzerland, 2010. ACM.


Data Fusion is the combination of a number of independent search results, relating to the same document collection, into a single result to be presented to the user. A number of probabilistic data fusion models have been shown to be effective in empirical studies. These typically attempt to estimate the probability that particular documents will be relevant, based on training data. However, little attempt has been made to gauge how the accuracy of these estimations affect fusion performance. The focus of this paper is twofold: firstly, that accurate estimation of the probability of relevance results in effective data fusion; and secondly, that an effective approximation of this probability can be made based on less training data that has previously been employed. This is based on the observation that the distribution of relevant documents follows a similar pattern in most high-quality result sets. Curve fitting suggests that this can be modelled by a simple function that is less complex than other models that have been proposed. The use of existing IR evaluation metrics is proposed as a substitution for probability calculations. Mean Average Precision is used to demonstrate the effectiveness of this approach, with evaluation results demonstrating competitive performance when compared with related algorithms with more onerous requirements