Non-intrusive load monitoring (NILM) is a popular approach to estimate appliance-level electricity consumption from aggregate consumption data of households. Assessing the suitability of NILM algorithms to be used in real scenarios is however still cumbersome, mainly because there exists no standardized evaluation procedure for NILM algorithms and the availability of comprehensive electricity consumption data sets on which to run such a procedure is still limited. This paper contributes to the solution of this problem by: (1) outlining the key dimensions of the design space of NILM algorithms; (2) presenting a novel, comprehensive data set to evaluate the performance of NILM algorithms; (3) describing the design and implementation of a framework that significantly eases the evaluation of NILM algorithms using different data sets and parameter configurations; (4) demonstrating the use of the presented framework and data set through an extensive performance evaluation of four selected NILM algorithms. Both the presented data set and the evaluation framework are made publicly available.