Planet SOSIG: A Spring-clean for SOSIG: A Systematic Approach to Collection Management
The SOSIG collection
The core of the SOSIG service, the Internet Catalogue, now holds over 21,000 structured metadata records describing Internet resources relevant to social science teaching, learning and research. Established in 1994, SOSIG is one of the longest-running subject gateways in Europe. Our section editors have been seeking out, evaluating and describing social science Internet resources, developing the collection so that it now covers 17 top-level subject headings with over 1000 sub-sections. Given the dynamic nature of the Internet, and the Web in particular, collection development is a major task. Collection management (i.e. weeding out broken links, checking and updating records) at this scale can also be something of a challenge.
The SOSIG core team, based at ILRT in Bristol, devotes considerable resource to removing or revising records with broken links (human checks based on reports from an automated weekly link-checking programme). Section editors, based in universities and research organisations around the UK, also consider durability and reliability of resources as part of the extensive quality criteria for inclusion in the Catalogue. They regularly check records and update them: however, the human input required to do this on a systematic and comprehensive scale would be beyond current resources. SOSIG has therefore recently embarked on a major 'spring cleaning' exercise that it is hoped will address this issue and keep the records current. We describe below the method, and outcomes to date.
There are several reasons why such collection management activity is important. User feedback indicates that currency of the resource descriptions is one of the most appreciated features of the SOSIG service. SOSIG and other RDN hubs are promoted on the basis of the quality of their records: offering out of date descriptions and other details is likely to frustrate users and, in the long term, be detrimental to their perceptions and therefore use of the service. Recent changes in data protection legislation also emphasise the obligation to check that authors/owners are aware of and happy with the inclusion of their resources in SOSIG. Checking with resource owners also appears to have incidental public relations benefits and is helping to develop the collection by identifying new resources from information publishers and providers.
How did we go about our spring-clean? Each of the metadata records for the 21,000 resources catalogued in SOSIG contains a field for 'administrative email' - the contact email address of the person or organisation responsible for the site. We adapted an existing perl script (developed in ILRT for another project), which allowed a tailored email to be sent to each of these addresses. The message includes the URL of the SOSIG record(s) associated with the admin email. Recipients are informed that their resources are included in SOSIG and are asked to check the SOSIG record for their resource (via an embedded link in the message) and supply corrections if necessary. They are also invited to propose new resources for addition to the Catalogue. The script adds details of records processed to a copy of the database: the next time the script runs, it checks all unmarked records and sends the message to the next 2000 of these. Once all 21,000 records have been processed, the script will run on a weekly basis to ensure that newly added records are also notified to their owners for checking.
Phasing the process
We first considered a mass, simultaneous mailout covering all 21,000 records. The script sends one message per minute to avoid swamping the servers. However, we had no idea of the level of response likely to be generated and wanted to avoid swamping ourselves! We therefore decided to phase the process, running the script against batches of 2000 records on a roughly monthly basis, in numerical order of unique record identifiers. The process was run for the first time at the end of July 2002 and, on the basis of low-numbered identifiers, included records of resources first catalogued in SOSIG's early days. A second batch of 2000 records was processed in the last month. Whilst Phil Cross oversaw the technical monitoring of the process, Emma Place and Dave Boyd have handled the personal responses, either dealing with change requests or passing on suggestions for additional resources to Section Editors responsible for specific subject areas on SOSIG.
Some interim results
A range of responses
To date we have received 239 personal responses (approximately 5%) from email recipients. A further 1023 automated 'bounced' responses were received. Those of us who are regular and long-term users of the Web are well aware of the fairly constant evolution of Web resource content and features. The SOSIG spring clean exercise also highlights the extent of change in personnel associated with Web resources. As mentioned above, of the emails sent relating to the first 4000 records, over a quarter 'bounced' back. Although a very small proportion of these were automated 'out of office' replies, most were returned because the address was no longer in use.
The majority of the personal responses requested only one change: to the administrative email address recorded for their resource. Many had stopped using personal email addresses and had turned to generic site or service addresses. Others reported that they were no longer responsible for the resource. As the first batches included older records, it will be interesting to see whether the proportion of bounced and changed emails reduces over time, or whether people are really more volatile than the resources.
We have to assume that the remaining 69% of email recipients have no cause for complaint or change requests. In fact, we were very pleased at the overwhelmingly positive response the exercise has generated so far. Many simply confirmed that their records were correct and they were pleased to be included. Others noted minor corrections to descriptions, URLs and, as mentioned, admin email addresses. Many also took the time to recommend new resources for addition to the Catalogue. Only one or two concerns were raised about the inclusion of certain data in the recorded, although there were several queries which highlighted changes needed to the email message for the second and subsequent batches.
One of these arose as a result of the de-duplication process, which only operates within each batch of 2000 records. Where the same admin email address is included in records excluded from that batch, the de-duplication process ignores it. Some recipients therefore asked why we had apparently included only some of their resources, when they are actually on SOSIG, just not in that particular set of records. The text of the message will therefore change for the third batch to make this clear.
Only one major issue was raised, that of deep-linking. It seems that this is a problem for one organisation, and raises questions about the changing nature of the Web - or perhaps some companies' difficulty in engaging with its original principles. Time will tell whether this is an issue for other organisations: to date it has been raised only once.
Handling the responses
Spring-cleaning in domestic settings always involves considerable effort, and the SOSIG spring clean is no exception. Emma and then Dave have spent about a week, full-time, dealing with the personal responses received after each batch of 2000 records were processed. The first batch of messages all had the same subject line, so it was impossible to distinguish between responses appearing in the shared mailbox used for replies. In the second 2000, the subject line includes the domain of the admin email address, which makes handling the responses much easier.
Bounced messages create the most work, because detective skills are then necessary to check resources 'by hand' and search for a replacement admin email address to which the message can then be forwarded. Minor corrections take little time, but the recommendation of new resources leads to initiation of our usual evaluation and cataloguing processes which can be lengthy, depending on the nature and scale of the resource.
We realised that timing of the process could have been better: initiating it in the middle of Summer holiday season is likely to have resulted in more out-of-office replies than might be expected at other times. It will be interesting to monitor this as the processing progresses over the next few months, to see whether this is actually the case.
Although time-consuming, the spring clean is still a more efficient way of cleaning the data than each Section Editor having to trawl through every single record and its associated resource. Here we are relying on resource owners to notify us of incorrect data as well as new resources: they are the ones who know their resources best, and are best-placed to identify problems and changes.
The spring-clean appears to have sent out - and generated - some very positive messages for SOSIG and resource providers. The range of responses has also been very interesting, with very large organisations - publishers, government departments in the UK and abroad, for example - have responded just as positively as small units and individuals. Messages have come from all over the world and most seem genuinely pleased to be included, indicating that SOSIG is held in some esteem. The House of Commons, Channel4 and the Office of the President of Burkina Faso all took the time to respond. Just a small selection of positive comments received are reproduced below:
"Many thanks for your email and the courtesy of letting us know about your listing of our Web site" (Women in London)
"I appreciate the listing and hope that your viewers will find the site to be both educational and enjoyable" (Ralph Frerichs, UCLA)
"Thank you very much, that's most encouraging. We will try to maintain our contributions in accordance with the confidence you have shown us" (Michael Pye, Internet Journal of Religion)
"Thank you for linking to our site. You have a very interesting Web page and it's great that you've compartmentalized all the links so well … keep up the good work" (Oula Ingero, Virtual Finland)
"Thank you for the inclusion of "EuroPortal" in your SOSIG database. I have also placed the SOSIG logo on the main menu page" (Gerhard Kenk, CrossWater Systems)
There has clearly been a benefit so far in public relations, raising awareness of SOSIG and also establishing more direct contact with resource providers, who may remember to advise us of new resources or changes in the future. Several confirmed that they had added a reciprocal link to SOSIG to their sites, which not only disseminates SOSIG's name more widely, but also offers a potential new route to the service to an audience that might not otherwise have been aware of it.
As the processing continues, we shall monitor and analyse responses and expect to publish fuller details in the future. Meanwhile, we hope SOSIG users will benefit from the much "cleaner" and more current data in our Internet resource catalogue.
- Lesly Huxley, Emma Place, David Boyd, Phil Cross
- 8-10 Berkeley Square
- Bristol BS8 1HH