How does the World Health Organization rate compared to Medical Students for Choice (a reproductive health rights group)? How do the Centers for Disease Control and Prevention score across a range of review services? As commentators begin to question the value of much of the information currently available on the Internet, how helpful are stars, badges or seals of approval (SOAPs) in identifying quality resources? Is a "Cool site of the moment", an accolade that literally changes every moment, of any real use? It is arguable that the more "awards" become available, the more debased the concept may start to appear.
The OMNI Advisory Group for Evaluation Criteria  has been examining a range of services that seek to provide access to selected health and biomedical networked information resources. In comparing those services, and especially in comparing the criteria they use for evaluating materials, our overall aim has been to establish how effective such services may be in facilitating access to quality materials available across the Internet. We have identified a number of differences between the review services as they have emerged over the last two years, as well as differences in their approaches to resource evaluation.
For example, services such as Excite Web Reviews , Lycos Top 5% Sites  and the Magellan Internet Guide  offer brief and informal reviews of sites, and cover a wide range of subject disciplines. Alternatively, the eLib  ANR  projects (e.g. ADAM , EEVL , OMNI  and SOSIG ) offer subject-specific gateways to selected and evaluated resources. Descriptions of sites are accessible through a searchable database, and inclusion in the database is indicative in its own right of the quality of a site. A more detailed comparison of a wider range of services can be found in an earlier article .
Informal services such as Excite Web Reviews and Lycos Top 5% Sites often cover large numbers of sites, but offer minimal evaluative detail. It is such services, in particular, that have tended to adopt numerical or star rating systems. Some services award a badge that can be displayed to indicate a site's quality, and most offer a searchable database of reviewed resources. Using reviews of the Centers for Disease Control and Prevention (CDC ) site  as an example, it is possible to examine the usefulness and value of such rating schemes, and especially their effectiveness in offering the user a guide to the quality of a site 'at a glance'.
Magellan Internet Guide
Popular among such services is the use of a star rating system. Indicatively, the Magellan Internet Guide allocates 1 to 10 points in three areas:
This results in an overall rating of one to four stars:
28 to 30 points = ****
22 to 27 points = ***
13 to 21 points = **
1 to 12 points = *
The CDC site has been awarded a Magellan 3 star rating only, even though the site is reviewed as 'arguably the pre-eminent authority on the control and prevention of disease, injury, and disability in the United States'. (The user has the option of linking to a more detailed and informative review.) Inevitably, if the reviewers rely on criteria such as these, evaluation will be biased towards the 'hotness', 'hipness' or 'coolness' of sites, rather than quality of content. The scores are indiscriminately cumulated to generate the star rating, and the user is not in fact informed of the different scores for each of the three criteria. The more detailed review does not differentiate between the three possible scores, or explain why the site did not receive four stars. As Rettig suggests, 'given the fuzziness of the criteria used to generate those stars, they shed little light on the usefulness and value of the 45,000 reviewed and rated sites among the four million in the Magellan directory' .
Excite Web Reviews
Excite editors rate reviewed sites with between 1 and 4 LEPs. LEPs are illustrations of Excite's mascot: the little Excite person. According to Excite: "a 4-LEP rating is the highest, denoting a fantastic site. A 3-LEP rating indicates a very good site. A 2-LEP rating points you to an average or somewhat dull site, and a 1-LEP rating indicates that 'the site leaves a lot to be desired". In this scheme, the Centers for Disease Control and Prevention site attracts 4 LEPs.
Excite Web Reviews presents the rated sites in order 'from best to worst' within broad subject topics or subtopics, in effect ranking resources relative to each other. Interesting juxtapositions result so that, in the Health & Medicine > Substance Abuse > Drugs category for instance, the US National Institute on Drug Abuse (of the National Institutes of Health), scoring only 2 LEPs, ranks below The Truth About Drugs, produced by Narconon(r), according to Excite 'an organization based on the teachings of Scientology founder L. Ron Hubbard', which is awarded 3 LEPs.
Lycos Top 5% Sites
Lycos Top 5% Sites reviewers award a 0 to 50 rating for resources in three categories (content, presentation and experience). For the CDC site, a brief display is provided: 'Centers for Disease Control - Government health research site - Content: 41 Presentation: 26 Experience: 37'. The user can follow up the numerical scores by linking to a more detailed review, but no information is available about the criteria used to determine the scores, and no details are given in the review itself regarding how the ratings are assigned.
Like Excite Web Reviews, Lycos Top 5% Sites allows search results to be displayed according to the scores awarded to sites in any of the three categories (as well as alphabetically): broad subject categories or search results can be browsed in order of the scores sites have received for their content alone. The discrepancies of the rating system are once again clearly revealed as, for instance, the World Health Organization (at 36/50) is awarded the same score as MedWorld (a 'Forum / help for Stanford med students'), and features below Medical Students for Choice ('Reproductive health rights group') (at 39/50) in the same broad category of Professional Medicine.
The Six Senses site  also uses a scoring system, and reviews healthcare and medical resources using six criteria: content, aesthetics, interactivity, innovation, freshness, and character. A score from 1 to 6 is awarded in each category, and the total of these scores represents a site's overall rating. The Six Senses Seal Of Approval is awarded to any site that scores 24 or above, which can then display the Seal. The user can follow up this score by examining the scores allocated in each area by three different reviewers. Brief annotations are offered, though sometimes it would appear that little care is taken in their construction, and they offer little insight into the quality or coverage of the CDC site itself.
Despite its subject specialisation, the focus of the Six Senses evaluation is very much on the superficial aspects of sites, as indicated by the criteria used. Incredibly, and despite a consistently high score for content awarded to the CDC site by all three reviewers, the CDC site is awarded 21 out of 36, and does not receive the Six Senses Seal of Approval; possibly because 'it could better seize the opportunity to engage visitors with innovative features' and, although 'unlimited in potential, the CDC has an opportunity to aesthetically make this site a more pleasing stop as well'.
Medical Matrix  focuses on clinical medical resources of interest primarily to US physicians and health workers. Star ratings of 1 to 5 are occasionally awarded, and the CDC site attracts 3 stars. However, simple inclusion in Medical Matrix is meant to be indicative of the quality of a resource, as even the absence of a star rating indicates 'specialized knowledge with suitable clinical content'. The basis of the distinction between none, one or many stars remains unclear (for instance, how can one usefully distinguish between a site awarded four stars as an 'outstanding site across all categories and a premier web page for the discipline' and one awarded five stars as 'an award winning site for Medical Internet'?), and there is no indication whether all sites included in Medical Matrix have in fact been rated.
Interestingly, the Medical Matrix submission form encourages resource evaluation beyond superficial aspects of a site, and incorporates differential weighting across the evaluation criteria: content related criteria (peer review and application) attract a greater number of points than media, feel and ease of access, so that sites that score highly on content alone are not penalised for lacking in appearance. It is unclear how this point rating system influences the star rating that accompanies the annotations, if at all, since selection is reviewed by an editorial board, drawn from the American Medical Informatics Association's Internet Working Group.
In assessing the use of numerical and star rating schemes by sites that review health and biomedical resources, the Advisory Group have yet to identify examples of any that:
In addition, numerical and star rating schemes appear to be used as a measure of a site's 'coolness', rather than as an indicator of the quality of a site's content, despite research suggesting the importance of quality of content to users themselves .
If numerical and star rating systems can vary so widely in their implementation and interpretation, as we have seen here, it remains questionable whether they can succeed in usefully guiding the user to the selection of high quality resources 'at a glance'. It is perhaps not coincidental that eLib ANR projects have so far been cautious in this regard. Work from the European DESIRE project has produced a working model for subject based information gateways (SBIGs) . While the project has surveyed the criteria used by a range of review sites, no recommendations are made with regard to the use of rating systems as compared to other evaluative systems.
As a result of ongoing work in this area, the Advisory Group has identified examples of 'good practice' and recommended these for consideration and adoption by the OMNI Project. These include features such as the inclusion of the date on which a resource was described or reviewed .
The US Communications Decency Act sought to restrict publishing of content on the Internet. One filtering mechanism which was developed in response to this is PICS (Platform for Internet Content Selection) . While PICS will control neither the publication nor the distribution of information, it may instead offer individuals or organisations the option of filtering out or filtering in selected views of networked information. One vision of how this may be achieved has already been articulated by Chris Armstrong of the Centre for Information Quality Management . Through the likely adoption of PICS compliance by the regular Internet browsers (PICS is already supported by Microsoft's Internet Explorer), it may in future become possible to configure browsers to access resources that satisfy trusted quality ratings.
In the meantime, however tempting it may be to line up the badges awarded to your website, it may be worth asking whether they are worth the pixels they occupy on your screen or whether more formal criteria for evaluating information resources, as adopted by the emerging eLib SBIGs remain a more helpful way to judge the quality of WWW resources.
Sam Saunders, J.P.Saunders@leeds.ac.uk , replies on 21-May-1997:
While this article is sensibly sceptical of the populist award systems, and cautious about PIC, it leaves the very difficult question towards the end, and then avoids taking a position. What is quality? It does implicitly suggest, through it's championship of the CDC site, that there are quality criteria with which all its readers would concur. But it only hints in the last section what they might be, or how they might be derived.
I would suggest that even within a fairly well specified community, criteria would be hard to agree, especially if they were to be applied "objectively". Relatively trivial criteria, such as numbers of words, level of reading difficulty, date of last revision, and errors in spelling and HTML could be applied with some consistency. These are descriptive, however, not evaluative. Beyond that, quality becomes more contentious and its arbiters start to take on heavier responsibilities. As a mere guardian of a web site, I would be very wary of awarding stars or writing general evaluations of material created in a field in which I was no longer active.
The simplest reason why this is so is that quality cannot reside in the item itself. Quality is tied to purpose, and purpose is negotiable as between the creator and the user. The suggestion that it can be derived from association with a third party makes the problem more, not less intractable.
If I were to agree that "inclusion in the (SBIG) database is indicative in its own right of the quality of a site" it would have to be because my own purposes were closely matched by those of the site's reviewers. As things are at present I will develop a relationship with a subject-based gateway, and gradually learn about its idiosyncracies. It will become very useful to me. But I cannot rely on it - simply because the quality (or qualities) I am looking for are as much a consequence of my own purpose as of the item's inherent proprerties or the creator's intentions.
Dr.Gary Malet, Medical Informatics Fellow, Oregon Health Sciences University and Co Chair AMIA's Internet Working Group and involved with "MEDICAL MATRIX", firstname.lastname@example.org, replied on 13 July 1997:
The author of the critique of Internet resource rating systems provides the following summary: "however tempting it may be to line up the badges awarded to your website, it may be worth asking whether they are worth the pixels they occupy". I would like to take this opportunity to defend and provide some of the backround for the rating system that has been attached to the Medical Matrix Internet Clinical Medicine Resources Guide.
I would assert, contrary to what is reported in the article, that the point rating system is indeed, "an objective and reproducible approach to rating sites". The review criteria for Medical Matrix are explicitly stated at http://www.medmatrix.org/info/sitesurvey.html. Any medical Internet resource can be evaluated using this template with an eye to the resource's utility for point of care clinical application. We have found that the approach can be extended to any number of contributors. A clinician editorial board has provided focus for this effort.
"The basis of the distinction between none, one or many stars remains unclear (for instance, how can one usefully distinguish between a site awarded four stars as an 'outstanding site across all categories and a premier web page for the discipline' and one awarded five stars as 'an award winning site for Medical Internet'?)"
An example of a discipline is a medical specialty. It is expected that medical specialists will appreciate a pointer to the premier resources in their field. An example of a five star site is the Merck Manual. (A search based on star rank will be instituted in the future.)
"and there is no indication whether all sites included in Medical Matrix have in fact been rated."
They have all been rated.
"Interestingly, the Medical Matrix .... incorporates differential weighting across the evaluation criteria: content related criteria (peer review and application) attract a greater number of points than media, feel and ease of access, so that sites that score highly on content alone are not penalised for lacking in appearance. It is unclear how this point rating system influences the star rating that accompanies the annotations,...."
The point rating system is an objective and reproducible approach to rating sites. The point rating system is the foundation for the stars that are assigned.
This may not have been made clear to the authors of the critique. We will make this clearer in our presentation of the resource.
In assessing the use of numerical and star rating schemes by sites that review health and biomedical resources, the Advisory Group have yet to identify examples of any that:
- demonstrate internal consistency"
Our rating sytem is explicit. It is applied to each resource.
- "have yet to identify examples of any that are defined robustly enough to ensure reproducibility (avoiding subjectivity)"
Our criteria are explicit. We have used the same template for our editorial board and contributors. Enlisting a panel of esteemed analysts eliminates subjectivity.
- "have yet to identify examples of any that relate closely to a service's stated review or evaluation criteria "
Our rankings are based on our stated criteria.
- "have yet to identify examples of any that are supported in accompanying reviews or annotations "
It is not clear why the annotation should repeat the ranking.
- "have yet to identify examples of any that scale effortlessly with increasingly large numbers of resources."
Our methodology of having our contributors complete an evaluation template and having periodic peer review seems efficient to us.
- "have yet to identify examples of any that scale convincingly across potentially variable audience interests and contexts of information seeking"
Our resource is targeted to clinical practioners. We have announced this. The markers that would seem to overlap for health professionals and consumers might be quality of science, extent of peer review, date, etc. However, it seems more effective to distinguish a professional from consumer oriented database.
- "have yet to identify examples of any that apply satisfactorily across a full range of resource types or collections of resources "
It has been our approach to rank entries in comparison to other entries of the same resource type.
"If numerical and star rating systems can vary so widely in their implementation and interpretation, as we have seen here, it remains questionable whether they can succeed in usefully guiding the user to the selection of high quality resources 'at a glance'."
Rating systems are fairly ubiquitous. We all select restaurants, movies, etc. using them. The systems are inherently imperfect because the criteria and values of the cataloguer and user differ. However, the precise definition of what is clinically relevant information has allowed the site selection and rankings within Medical Matrix to be fairly obvious. Hopefully they are intuitively understood by clinicians. We have surveyed our users and have found this to be true.
While, it is clear that additional metadata descriptions such as date authored, impact factor, or even statistical significance could be applied, it is not clear that confining the indexer to a standard template for resource descriptions would prove economic. In my opinion the greater need is to subcategorize resources along MeSH trees.
In conclusion, I fully appreciate the value of the academic exchange in this journal. However, I would appeal to readers of this forum to embrace the Internet medium and take advantage of the capabilities that it offers. An email to contributors to Medical Matrix is an easy way to learn about its methodology. There is a great potential to improve patient care by cataloguing Internet medical resources. It would appear to me that supportive, collaborative, and cross discipline efforts would help to accomplish this goal.
Tel: 01509 222356
Fax: 01509 223993
Address: Loughborough, LE11 3TU
Royal Free Hospital School of Medicine
Tel: 0171 830 2585
Fax: 0171 794 3534
Address: Rowland Hill Street, London, NW3 2PF
Department of Information and Library Studies,
University of Wales, Aberystwyth
Tel: 01970 622146
Fax: 01970 622190
Address: Llanbadarn Campus, Aberystwyth, SY23 3AS