A Comparison of Offline Evaluations Online Evaluations and User Studies

2022/01/18


  • Evaluate recommender system
    • offline evaluation
    • online evaluation
    • user studies
  • Result from offline sometimes contradict from online and user studies
    • Human factors
    • imperfection of offline datasets
  • Offline are suitable to evaluate recommender systems
    • Offline
      • measure the ::accuracy:: based on ground-truth
      • Precision and n
      • MRR
      • nDCG
      • F-Measure
      • Novelty or serendipity - Beyond accuracy - evaluating recommender systems by coverage and serendipityBeyond accuracy - evaluating recommender systems by coverage and serendipity
        1. Coverage

        The coverage of a recommender is a measure of the domain of items over which the system can make recommendations
        2 concepts

        The percentage of the items for which the sys...
    • 3 types of offline datasets
      • Explicit ground-truth
      • Inferred ground-truth
      • Expert ground-truth
  • Online evaluations
    • click-through rate (most favorable metric)
    • link-through rate did not correlate less well with user satisfaction
    • cite-through rate did not correlate less well with user satisfaction

Reference

J. Beel and S. Langer, “A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems,” in Research and Advanced Technology for Digital Libraries, vol. 9316, S. Kapidakis, C. Mazurek, and M. Werla, Eds. Cham: Springer International Publishing, 2015, pp. 153–168. doi: 10.1007/978-3-319-24592-8_12.

#recommender-system