A Comparative analysis of offline and online evaluations and discussions of research paper recommender system evaluation


  • Offline evaluation contradict with online evaluation
  • Both CTR and MAP never contradicted each other
    • It could still be possible that MAP over users will differ

Research Questions

  • Why do offline evaluations only sometimes accurately predict performance in real-world systems?
    1. Human factors
      1. Wait too long to receive recommendations
      2. Presentation is unappealing
      3. Label of recommendations is suboptimal, or for commercial
      4. Older users tend to be more satisfied with recommendations than younger user
      5. Unregistered users are more concerned about privacy
    2. Imperfection of offline datasets
      1. Containing only a fraction of all relevant documents
  • Is it possible to identify the situation where offline evaluations have predictive power?
  • Is it problematic that offline evaluations do not (always) have predictive power?


J. Beel, M. Genzmehr, S. Langer, A. Nürnberger, and B. Gipp, “A comparative analysis of offline and online evaluations and discussion of research paper recommender system evaluation,” in Proceedings of the International Workshop on Reproducibility and Replication in Recommender Systems Evaluation - RepSys ’13, Hong Kong, China, 2013, pp. 7–14. doi: 10.1145/2532508.2532511.