Web Magazine for Information Professionals

Moving Ariadne: Migrating and Enriching Content with Drupal

Thom Bunting explains some of the technology behind the migration of Ariadne (including more than 1600 articles from its back issues archive) onto a Drupal content management platform.

Tools and strategies for content management are a perennial topic in Ariadne. With more than one hundred articles touching on content management system (CMS) technologies or techniques since this online magazine commenced publication in 1996, Ariadne attests to continuing interest in this topic. Authors have discussed this topic within various contexts, from intranets to repositories and Web 2.0,  with some notable surges in references to 'content management' between 2000 and 2005 (see Figure 1 below).  Although levels of discussion are by no means trending, over recent years it is clear that Ariadne authors have taken note of and written about content management tools and techniques on a regular basis. 

In the light of this long-established interest, it is noteworthy that Ariadne itself migrated into a content management system only recently. Although the formatting of its articles did change a few times since 1996, Ariadne remained 'hand-coded' for more than fifteen years.  None of its articles had been migrated into a database-driven content management system until March 2012, when issue 68 was published.  

As mentioned in the editorial introduction to that first issue, launching the new content management arrangements, and as discussed in some more detail below (see 'Technical challenges in content migration'), the considerable size of Ariadne's archive of back issues was daunting.  With more than 1600 articles in hand-coded 'flat'-html formats, the process of migration itself required careful planning to result in a seamless, graceful transition into an entirely new content management arrangement.  Over time, the sheer size of the Ariadne corpus had made it both increasingly rich in content and increasingly more challenging to convert retrospectively into a database-driven CMS as the total number of articles published within this online magazine steadily expanded. 

In looking back over the recent process of migrating Ariadne onto a CMS platform, this article discusses some tools and techniques used to prepare content for transfer, testing, and then re-launch.  After explaining some of the background to and objectives of this work, this article focuses on key features of content management supported by Drupal. 

Figure 1: Timeline of references in Ariadne to content management

Figure 1: Ariadne timeline of references to content management

Requirements Analysis: Planning the Way Forward

Based on surveys of readers and authors conducted in late 2010, the Ariadne management team analysed the range of feedback, drew up sets of re-development requirements, and then considered the options available.

The following table provides an overview of key findings regarding the range of enhanced functionality and features considered:

Overview of findings derived from survey responses
enhanced functionality or featureinterest recorded in surveys
browsing by keywords73.4% of respondents
updated look and feel62.3% of respondents
browsing by title50.0% of respondents
enhanced use of search engine48.0% of respondents
improved display for portable devices34.0% of respondents
more summative information on articles32.1% of respondents
improved navigability from article level32.1% of respondents
improved social media options29.5% of respondents
browsing by author28.0% of respondents
improved RSS feeds27.0% of respondents

In addition to these findings derived from surveys, the management team also recognised the need for some other functionalities to support monitoring of Ariadne's on-going engagement with various domains and institutions across the UK and beyond.

Additional features to support monitoring of engagement
identification of author domains (higher education, further education, research, commercial, etc)to support analysis of Ariadne connections and reach across various sectors
identification of authors by organisationto support analysis of Ariadne connections and reach in UK and worldwide

Taking into account the key findings derived from survey questions as well as the additional functionality identified as useful in monitoring UK and worldwide engagement, the Ariadne management team drew up sets of re-development requirements and considered how to proceed. Migration into a content management system represented the obvious way forward, as it became clear that Ariadne's previous tradition of 'hand-coded' production (dating from the early days of the Web) had little chance of coping gracefully with the new sets of requirements.

In a review of CMS options available, it also became clear that  Drupal [1] was well positioned as a content management system (or, emphasising its highly modular and extensible design, content management framework  [2] ) to supply required functionality and features.

Rationale for Drupal's Content Management Framework

Since its beginnings as open source CMS project in 2001 (with a strong boost in momentum around 2005), Drupal has been well known as for its adaptable, modular architecture for content management.  

Drupal has drawn attention from many quarters including global corporate, government, and academic institutions attracted by the benefits of using open source code base across a wide range of use cases.  In an extensive series of 15 in-depth articles published on the IBM developerWorks Web site from 2006 to 2007 [3], for example, senior engineers at IBM favourably reviewed the strengths of its programming model (including clean 'separation of content from presentation' [4] and robust 'extensibility' [5]). Though much has changed since the IBM developerWorks articles explained the benefits of Drupal, the fundamental conclusions remain as true now as then:  

Drupal has held up well. When we needed new functions, we could usually find an existing module within the contributions. If not, we were able to quickly build our own custom module to extend the functions of our system. This extensibility, found in many open source CMSs, is critical for addressing new problems as they arise.

Perhaps the most important factor in deciding to use Drupal is the very active and large community of developers engaged with its technical development, on-going testing, and collaborative support. As the overall architecture of Drupal open source is highly extensible,  developers produce a wide range of modules that provide new functions as required. Over time, large development houses and major projects [6] as well as many dedicated community programmers [7] contribute to the Drupal so many useful (and sometimes increasingly small or abstracted) building blocks of functionality that it is very rarely necessary for anyone to build 'custom' modules.  Drupal's very active community of developers provides and manages 'core' as well as 'contrib' modules in a way that minimises the need for extra programming effort.

Although it is true that Drupal offers developers (as well as technically capable users) many options for customising features and functionality through code, most use cases are easily accommodated without need for touching code. As explained on Wikipedia [8], Drupal is amenable both to developers and to ordinary users: "Although Drupal offers a sophisticated programming interface for developers, no programming skills are required for basic website installation and administration". Particularly important for everyone using Drupal to manage content, the collaborative, methodically managed code base provided by the active development community can be managed administratively.  For developers and ordinary users alike, code management arrangements coordinated via the official Drupal site can be managed easily and reliably.  Whereas technical users have many powerful options to manage module installations, configurations, and updates very efficiently via the command-line (using 'drush'), non-technical users can use Web-based interfaces to manage receive module configurations and update notifications.

Over the years, the scope of engagement with Drupal across many sectors has increased.  As the official drupal.org site explains, a broad range of groups 'from local businesses to global organizations' use Drupal to manage content.  In addition to the examples of major publishers, universities, and corporates using Drupal, it is noteworthy that a range of government-funded initiatives for public engagement (such as www.whitehouse.gov) and data-sharing (data.gov and data.gov.uk ) as well as very large publishing sites (such as economist.com ) are using Drupal to run remarkably high-performance, data-rich Web sites. [9]

As reliable access to Ariadne content remains the highest priority, the fact that Drupal has been used for years by major publishers contributed to the decision to consider it as a suitable platform.  Although it is unlikely that Ariadne will become as popular as some of the Internet's busiest publishing sites, it is reassuring to know that 'a single mid range server' running Drupal can with reasonable tuning scale up to very high loads of 3.4 million page views per day [10]. 

Technical Challenges in Content Migration

For consistency and coherence, the Ariadne management team decided that all articles originally published in the previous issues were to be migrated onto the new Drupal-based platform. In planning the migration of content into a more modern 'look and feel', with enhanced functionality, the decision was taken to keep intact the substance of articles 'as is'.[11] The focus in this process has been on maintaining the historical integrity of more than 1600 articles in the Ariadne back issues archive, as these previous articles represent a substantial body of work reflecting developments in thinking and working since the commencement of publication in the relatively early days of the Web.

As the previous articles had been all 'hand-coded', with some variations of HTML related to both evolutions in standards and (occasionally) human inconsistencies, one of the first challenges was to retrieve from the 'flat-HTML' corpus the core content of articles that were to be migrated into the Drupal database structure.  To accomplish this, a Perl script was created to scan all previous articles and extract the key elements required: article titles, authors, article summaries (used as preview / lead-in text), and the main article content.  In this extraction process, it was necessary to transform programatically some aspects of the HTML into required formats (for example, by changing relative paths to absolute paths for site-wide consistency) and downloading images into consistently structured directories so that all Ariadne images could be managed in a consistent way.

Once this key content from the previous articles had been extracted from the 'flat-HTML' corpus, the next challenge was to import this content into the MySQL database to be used by Drupal. [12] To achieve this we used the Feeds module, which proved very useful in mapping specific elements of content to the relevant database tables and fields.  After setting up and testing the required mappings via the Feeds Importers interface), it was possible to import all the previous articles from the Ariadne archive into the Drupal database within a few minutes. [13]

To check that all content had imported correctly and was displaying as expected in the new Drupal format, the Ariadne production team scanned through migrated content.  To make this scanning process efficient, administrative views were set up to produce all-in-one overviews of all articles published in each issue so that it was easier to spot and correct any inconsistencies remaining from automated parsing of hand-coded articles.  As part of this post-migration review process, it was relatively easy to make a range of improvements in areas such as tweaking placement of images and strengthening editorial consistency in titles across more than 1600 articles. Even though scanning by human beings was, of course, much slower than the programmatic execution of content migration that had preceded it, achievement of more thorough consistency across the entire Ariadne back issue archive has prepared the way for opening up this Web magazine's full range of content. [14]

Extensible Toolkit: Using Drupal 'contrib' Options

Once the previous articles from the Ariadne back issue archive had been imported, some key Drupal 'contrib' modules were essential to meeting the set of requirements gathered in the early planning stage of re-development work.  Those familiar with Drupal will not be surprised that Views, Taxonomy, and Content Construction Kit (CCK) [15] modules proved useful in providing essential functionality, as these modules are very widely used in many Drupal sites.  Below is an overview of how Drupal module and theme options usefully supported required functionalities and features.

 

Usage of Drupal 'core' and 'contrib'
functionality or featurecore modulecontrib modules
browsing by keywordsTaxonomyViews
updated look and feel 0 Point Theme, CCK
browsing by titleTaxonomyViews
enhanced use of search engineSearchSearch 404
improved display for portable devices 0 Point Theme
more summative information on articles CCK
improved navigability from article levelTaxonomyViews
improved social media options ShareThis
browsing by authorTaxonomyViews
improved RSS feeds Views, Feed Path Publisher
identification of author domainsTaxonomyViews
identification of authors by organisationTaxonomyViews

As content migrated into the new platform, several other Drupal 'contrib' modules also proved immediately useful in providing features and functionality related to points raised by surveys during the preliminary requirements-gathering process.  Below is an overview of some of these further modules.

Further usage of Drupal 'contrib'
functionality or featurecontrib modules
auto-generated article table of contents Table of Contents
bibliographic citation examples for articlesViews, CCK
automated management of imagesViews, ImageField, ImageCache
printer-friendly versions of articlesPrint
auto-generated markers for external linksExternal Link
managed feedback from readersWebform
calls for authors, forthcoming updatesViews, CCK

By using the 'contrib' modules listed above (in combination with some 'core' as well as other utility modules), the new Ariadne easily supported a range of enhancements such as the following:

Some further enhancements of the new Ariadne platform were supported by another key 'contrib' module: Data.  In conjunction with the Views module, this 'contrib' Data module supported the requirement for monitoring data about author backgrounds and it provided a basis for extending considerably the range of metadata that can be managed and presented regarding author activities and other factors related to publications.  It is worth noting how  this relatively new 'contrib' Data module significantly extends Drupal's traditionally strong support for running data-rich Web sites. [16]  Whilst it is by design geared for technically adept users (as it assumes understanding of database table structures, database queries, and data modelling generally), the Data module provides many options for managing directly within Drupal key sets of data that are relevant to a Web site.  In view of the growing trends in using Drupal as front-end to data-rich Web sites, it is likely that the Data module is well positioned to help extend this further. 

Complementary Tools: Other Technologies and Drupal

The process of migrating Ariadne confirmed that Drupal also integrates flexibly with a range of other technologies.   Given that the Drupal project has its origins in mainstream LAMP developments [17], it quite naturally lends itself to working compatibly with many other technologies that share its open source development philosophy.  In addition to open source, however, Drupal can also integrate with other technologies. [18

The process of working on the migration of Ariadne content demonstrated that 'contrib' modules can seamlessly facilitate integration with a range of data interchange standards (JSON, XML, RDF) and programming libraries (JQuery, GD graphics, etc).  Although this integration with other technologies is by no means unique to it, Drupal does have a fundamentally adaptable and extensible approach to integrating other technologies into its content management framework. [19

In addition, Ariadne re-development also demonstrated how Drupal lends itself to visualisation of data sets.  As the Drupal community is actively interested in visualisation of data using recently developed libraries (such as TileMill mapping and Highcharts graphing libraries), quite a bit of work in this area is in progress. [20]  Given that Drupal could manage rich data sets related to usage of keywords in articles published over the more than 15 years of publication, after the Ariadne content migration it proved relatively easy to produce via the recently developed Highcharts libraries relevant sets of visualisations showing timelines, bar charts, scatter plots, etc.

Of course Drupal as a web-based application is also amenable to testing and tuning procedures using a range of commonly available tools.  In preparing for re-launch, the new Drupal platform for Ariadne was tested repeatedly with a range of both open source (shell scripts using wget) and proprietary (ProxySniffer) [21] load-testing tools after carefully tuning the Drupal application and its platform using best-practice Drupal and LAMP-standard procedures. [22]  Methodical tuning, testing, and final optimisation was facilitated by many freely available technologies relevant to checking the performance of a Drupal content management platform. 

Conclusion

Based on the experience of transferring, testing, and re-launching on a new Drupal content management platform more than 1600 articles from the Ariadne back issue archive, it is fair to say that Drupal technology has proven more than equal to the challenging tasks involved.

In discussing some of the planning and technical work related to this migration process, this article has attempted to provide information that may be of some usefulness and relevance to two types of readers: 1) those interested in Drupal as a content management system; 2) those interested in Ariadne as compendium of content about the Web technologies.  Now that you as a reader have made it this far along in this article, it's likely that you are in either one of these two groups.  If by this point you find yourself a bit more firmly situated in both these camps, then this article may well have served its purpose.

References

  1. Wikipedia provides a useful overview of Drupal as 'free and open-source content management system (CMS) and content management framework (CMF) written in PHP and distributed under the GNU General PUblic License. ' See: http://en.wikipedia.org/wiki/Drupal
  2. The notion of 'content management framework' puts emphasis on the reusability of components.  As indicated by the Wikipedia article on Drupal, the focus of most Drupal 'core' modules (and, increasingly, many 'contrib' modules) is on providing reusable bits of functionality that can be extended or altered in a highly adaptable way suiting the requirements of particular use cases.  Discussions around work in progress on Drupal 8 (currently scheduled for release summer 2013) make it clear that Drupal core is tending to become content management framework (rather than attempting to produce a finished product); for more information, see http://drupal.org/node/908022
  3. IBM DeveloperWorks in 2006 and 2007 published a series of articles focusing on Drupal (and its relationships with other technologies).  For an overview of these articles, see:  http://www.ibm.com/developerworks/ibm/osource/implement.html
  4. 'Using open source software to design, develop, and deploy a collaborative Web site, Part 5: Getting started with Drupal' http://www.ibm.com/developerworks/ibm/library/i-osource5/
  5. 'Using open source software to design, develop, and deploy a collaborative Web site, Part 15: Lessons learned' http://www.ibm.com/developerworks/ibm/library/i-osource15/
  6. Some notable examples of commercial development houses and major projects contributing module code freely to the Drupal community include: Development Seed release of 'Feeds' moduleCyrve release of powerful enhancements to 'Migrate' module; and www.whitehouse.gov on-going releases of modular functionality such as 'NodeEmbed' module etc.  
  7. It is also important to appreciate that much Drupal code has been contributed by a broad community of dedicated developers who produce key modules without the resources provided by major development house resources or substantial funding.  Typical of a globally distributed open source programming model, Drupal has been developed by a combination of both highly experienced and highly creative contributers working collaboratively. For example, a very young programmer created the 'Drush make' module, which automates many tedious tasks of site construction, and this after immediately being well received as 'absolute game changer' (see http://boldium.com/blog/drush-make-comes-to-installation-profiles ) has been further developed by some of Drupal's most senior programmers and integrated directly into the 'drush' (as the fundamental Drupal site management system).
  8. See the Wikipedia article on Drupal (http://en.wikipedia.org/wiki/Drupal ) for an overview of both basic features and functionalities, some critiques, and noteworthy examples of Drupal implementations on websites across publishing, education, and many other domains.
  9. Further noteworthy examples of sites using Drupal can be found in the Wikipedia article (http://en.wikipedia.org/wiki/Drupal ) and on the official Drupal site (http://drupal.org/about ) as well as on many 'showcases' such as this site belonging to Drupal's original creator Dries Buytaert: http://buytaert.net/tag/drupal-sites/ .
  10. See Khalid Baheyeldin, "How we did it: 3.4 million page views per day, 92 M per month, one server!", DrupalCon Chicago 2011,  http://chicago2011.drupal.org/sessions/how-we-did-it-3-4-million-page-views-day-92-m-month-one-server
  11. As noted in the editorial introduction to Issue 68 (see http://www.ariadne.ac.uk/issue68/editorial1 ), this decision to keep the main content of each article 'as is' following migration to a new platform helps to protect historically interesting online resources.  Even though each article is presented with a more user-friendly and consistent look and feel, and each article is connected far more deeply and richly with other articles in Ariadne related by keyword themes, the preservation of each article's main content remained a key objective of the migration work.  Consequently, word-for-word each article in the new format is comparable with its corresponding version in the previous format; only the surrounding navigational links are different.
  12. Although most Drupal sites typically use MySQL databases, Drupal does provide a database abstraction layer and it (depending upon version and configuration) can support a broad range of databases including: MariaDB, PostgreSQL, SQLite, Microsoft SQL Server, MondoDB. 
  13. In addition to migrations of content managed via the Feeds module (see http://drupal.org/project/feeds/ ),  even more powerful options are available for migrating content via the Migrate module ( http://drupal.org/project/migrate/ ).  Whereas the Feeds module proved sufficient and highly effective in moving all Ariadne content into the new Drupal platform, in cases where content to be migrated is highly dynamic or huge, then the Migrate module could be a more suitable solution as this module includes sophisticated features such as database-to-database mappings, incremental migrations, and roll-backs (with operations executable via 'drush' command-line). 
  14. As cited in the editorial introduction to Issue 68, former Ariadne editor John Kirriemir particularly welcomes the fact that after migration to the new platform "the magazine has one universal style throughout".  Whereas the previous 'hand-coded' production style of Ariadne meant that it would have been very challenging to effect a universal change in presentation style, adjustments in 'look and feel' were trivial once the content had migrated into Drupal: simply by changing theme templates (easily done via the standard Drupal configuration options) or by tweaking CSS files, the display of all content can be transformed.  
  15. It's noteworthy that parts of 'contrib' module Content Construction Kit (CCK) have been moved into 'core' as of Drupal version 7.  In practice, this addition of some CCK functionality to 'core' Drupal code is a sign that CCK functionality has been accepted as essential to every site: by enabling site administrators to extend the range of Drupal database fields,  CCK does further extend the flexibility of Drupal in accommodating a vast range of content management requirements. It is also worth noting, however, that only some parts of CCK have moved into 'core' as of Drupal 7: there is still a strong 'small core' movement within the Drupal community, which in principle is trying to reduce the size of 'core' modules as much as possible and (conversely) to increase the range of 'contrib' modules that can build on a very basic, abstracted functionality provided by a 'small core' set of code. 
  16. As noted above, Drupal has often been used as front-end to data-rich sites such as data.gov and data.gov.uk yet recently there seems to be an emerging trend in combining Drupal with large, sometimes highly dynamic data sets and code that produces interactive visualisations.  For many examples of interesting work in this area, see http://developmentseed.org/projects.
  17. For an overview of LAMP stack (Linux Apache, MySQL, Perl / PHP / Python) development stack see: http://en.wikipedia.org/wiki/LAMP_(software_bundle) .  Whereas the P in the LAMP acronym originally stood for Perl, it is now often interpreted as standing for PHP or (less frequently) Python as all of these scripting languages can be applied in similar ways depending upon a developer's personal coding preference.
  18. From the relatively early case studies of Drupal published by IBM developerWorks, to the most recent work undertaken to make Drupal 7 compatible with Microsoft SQL Server, it is clear that Drupal's architecture makes it amenable to a broad range of applications buth open source and proprietary.
  19. The relatively new Libraries API module (http://drupal.org/project/libraries/ ), for instance, positions itself as "common denominator for all Drupal modules/profiles/themes that integrate with external libraries".  As such, this module provides a mechanism for bringing into Drupal various libraries (such as JQuery) in a uniform way, so that  libraries can be easily upgraded without complications and all modules can reliably use a specified library without creating unexpected conflicts (due to multiple, incompatible versions being installed, etc).
  20. For overviews of development work on 'Creating Beautiful Maps for Drupal with TileMill' see: http://developmentseed.org/blog/2011/mar/08/creating-beautiful-maps-drupal-tilemill/.  For more information on the Drupal 'contrib' module in alpha stages of development for direct integration of Highcharts, see:  http://drupal.org/project/highcharts .  
  21. For an overview of ProxySniffer 'Professional Web Load and Stress testing Tool' see: http://www.proxy-sniffer.com/
  22. For an in-depth explanation of best-practice procedures for Drupal performance optimisation and server tuning, see the 1.5 hour video recording of conference workshop presentation by Nate Haug: http://drupalize.me/videos/overview-performance-scalability

Author Details

Dr. Thom Bunting
Observatory and Innovation Zone Project Manager / Web Manager
Innovation Support Centre
UKOLN
University of Bath
Bath BA2 7AY

Email: t.bunting@ukoln.ac.uk
Web site: http://www.ukoln.ac.uk/ukoln/staff/t.bunting/

Return to top