A Comparison of Offline Evaluations Online Evaluations and User Studies

2022/01/18

Evaluate recommender system
- offline evaluation
- online evaluation
- user studies
Result from offline sometimes contradict from online and user studies
- Human factors
- imperfection of offline datasets
Offline are suitable to evaluate recommender systems
- Offline
  - measure the ::accuracy:: based on ground-truth
  - Precision and n
  - MRR
  - nDCG
  - F-Measure
  - Novelty or serendipity - Beyond accuracy - evaluating recommender systems by coverage and serendipityBeyond accuracy - evaluating recommender systems by coverage and serendipity
    1. Coverage
    
    The coverage of a recommender is a measure of the domain of items over which the system can make recommendations
    2 concepts
    
    The percentage of the items for which the sys...
- 3 types of offline datasets
  - Explicit ground-truth
  - Inferred ground-truth
  - Expert ground-truth
Online evaluations
- click-through rate (most favorable metric)
- link-through rate did not correlate less well with user satisfaction
- cite-through rate did not correlate less well with user satisfaction

Reference

J. Beel and S. Langer, “A Comparison of Offline Evaluations, Online Evaluations, and User Studies in the Context of Research-Paper Recommender Systems,” in Research and Advanced Technology for Digital Libraries, vol. 9316, S. Kapidakis, C. Mazurek, and M. Werla, Eds. Cham: Springer International Publishing, 2015, pp. 153–168. doi: 10.1007/978-3-319-24592-8_12.

#recommender-system

lukkiddd. 2022, powered by Jekyll Garden

Linkedin | Github | Twitter

A Comparison of Offline Evaluations Online Evaluations and User Studies

Reference

Links to this note