A Comparison of Offline Evaluations Online Evaluations and User Studies


  • Evaluate recommender system
    • offline evaluation
    • online evaluation
    • user studies
  • Result from offline sometimes contradict from online and user studies
    • Human factors
    • imperfection of offline datasets
  • Offline are suitable to evaluate recommender systems
    • Offline
      • measure the ::accuracy:: based on ground-truth
      • Precision and n
      • MRR
      • nDCG
      • F-Measure
      • Novelty or serendipity - Beyond accuracy - evaluating recommender systems by coverage and serendipityBeyond accuracy - evaluating recommender systems by coverage and serendipity
        1. Coverage

        The coverage of a recommender is a measure of the domain of items over which the system can make recommendations
        2 concepts

        The percentage of the items for which the sys...
    • 3 types of offline datasets
      • Explicit ground-truth
      • Inferred ground-truth
      • Expert ground-truth
  • Online evaluations
    • click-through rate (most favorable metric)
    • link-through rate did not correlate less well with user satisfaction
    • cite-through rate did not correlate less well with user satisfaction


J. Beel and S. Langer, “A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems,” in Research and Advanced Technology for Digital Libraries, vol. 9316, S. Kapidakis, C. Mazurek, and M. Werla, Eds. Cham: Springer International Publishing, 2015, pp. 153–168. doi: 10.1007/978-3-319-24592-8_12.
