Evaluation 1: Semantic Web community is still a long way from defining standard evaluation benchmarks. The uam group in collaboration with KMi, have created an "startup" novel benchmark to evaluate Semantic Search Systems. PowerAqua is evaluated as part of an IR system. This new evaluation benchmark is based on the TREC 9 and TREC 2001 (http://trec.nist.gov/) standards as a starting point, because this provide us with an independently produced set of queries and document judgments. The IR collection we took as basics comprises 10 GB of Web documents known as the TREC WT10G collection, 100 queries, corresponding to real user logs requests, and the list of document judgments related to each query. These judgments allow the quality of the information retrieval techniques to be calculated using standard precision and recall metrics. The aims behind selecting an IR collection are twofold. On the one hand, to be able to evaluate the query results retrieved by querying the Semantic Web. On the other hand, to be able to evaluate the advantage of using semantic information for document retrieval in terms of precision and recall.

Evaluation 2: The evaluation of PowerAqua as a standalone system focuses on its capability to answer queries by relying on information provided by multiple ontologies. As such, the evaluation will primarily assess the mapping capabilities of the system (i.e., its ability to map a user query into ontological triples on real time) rather than its linguistic coverage or merging and ranking component, which for this version, are still quite limited.

Evaluation 3: The evaluation of PowerAqua Merging and Ranking capabilities for queries that require to be answered by combining multiple facts from the same or different ontologies.

Evaluation 4: We describe informal experiments we have carried out, to investigate whether it is feasible or not to use the PowerAqua answers to user queries obtained from DBpedia and various ontologies, to successfully perform query expansion and improve the precision of Yahoo web searches on the first 10 results. Here, we also investigate which ranking mechanism on the semantic results returned by PowerAqua is most efficient to automatically elicit the most accurate results on web

Evaluation 5: Informal experiments with ad-hoc queries to measure the performance of PowerAqua as interface to Watson only