A Possible Approach to Importing Static Content Into Drupal.

November 18, 2007

The Motivation

The bread and butter of my freelance development work often involves converting old, static Web sites of small organizations–small businesses and non-profits–to a more sustainable content management system.

These sites were usually started by amateur, would-be graphic designers who work for free, or very little compensation. They tend to be students or more traditional, print-oriented graphic designers looking for their first real world “HTML coding” experience. They were retained by companies that couldn’t make a significant financial commitment to their Web site, either because they didn’t believe the site would be important to their organization, or they didn’t believe they had the money to hire a professional to build and maintain it.

In these situations, my work will involve not only migrating the content and hosting, but also updating the site design, reorganizing and cleaning up the content, and empowering staff members by educating them not only in the use of the content management system but good Web design and information architecture practices. I usually also try to educate the organization’s management on ways to use their new Web site as an effective marketing tool, a strategic part of their business.

You can imagine the state these sites are in when I get them. I’m talking circa 1998 Microsoft Frontpage (or present day Dreamweaver) garbage, or worse. The graphic design hurts your eyes, the HTML code is dirty, the organization of the content is dizzying. I don’t think this is shocking news to most readers.

Some clients are smart enough to recognize that they have a problem with their site, they trust me to do the job they hired me to do, and they are willing to make a commitment of time and energy to learn to manage their site effectively. In these cases, they will see a CMS as an opportunity to start fresh, re-envision their Web site and rebuild it from the ground up. This frees me to do what I do best and allows the organization to make a much needed investment in their Web site content.

On the other end of the spectrum is the customer that refuses to be educated in best practices, either because they ultimately don’t trust my judgment, or they think they know what’s best for their site. They want everything preserved–their ugly layout, confusing structure and horrible content–they just want to use a CMS to feel a little hipper. In my first year, when I was just trying to establish myself, I swallowed my pride and worked for these clients. Now, I have a better qualification process, and I don’t waste my time with clients like this. They want something for nothing, they don’t trust me, and ultimately, they don’t deserve me. They are simply much more trouble than they are worth.

Of course, most clients fall in the middle. They are willing to listen to most of my advice, but might insist on certain things like an in illogical structure to the navigation, or a hideous header graphic. I’ve never believed that my role is to “save the client from themselves”–its their site, and after stating my case, I need to implement what they want whether I agree with it or not. (I’ve heard of companies that will reverse changes clients make to sites if they don’t agree with what the client has done, without even consulting the client first!) Or, the client may be willing to remake their site quite dramatically, but they need to make changes more slowly and incrementally, and want to wait until the site is driven from a CMS to being transforming it. That approach can make sense too.

In either of these cases, the first problem I often need to address is just migrating existing site content. This may be because the clients don’t feel confident in their ability to be trained to use the CMS right away, or it may be that they just assume that I will initially do this work. I guess its like when I go to a shoe store expecting to buy shoe laces; after condescending laughter, I’m told that shoe stores don’t do that sort of thing: “you don’t expect to buy gas from a car dealer, do you?”

More and more, my “go to” CMS is Drupal. I appreciate its conceptual design, and it is easy to deploy. Its a mature product with equally mature modules that allow me to address a wide range of customer needs. Obviously, like any good CMS, Drupal is great at building new sites and maintaining existing ones, but when it comes to the largely mind numbing task of shoving in lots of existing static content, Drupal can seem agonizingly slow, as does any other mature CMS I have used in that situation.

So, what are my options for trying to automate this task?

The Approach

My first thought was that what I’ve described above must sound very familiar to others working in this space, so there must be some great ideas out there for making this kind of content migration easier. And Drupal has many great modules, perhaps there is one designed for just this occasion.

And of course, there is: Import_HTML. My first reaction was that this appeared to be a very well thought out, clever approach to the problem, and that opinion hasn’t changed, despite the fact that I recently tried it on a particular site I mirrored offline using wget, and I didn’t get the results I was hoping for.

I suspect that no matter how well Import_HTML is implemented, its effectiveness is likely to be limited because the problem its trying to solve is just too big and arbitrarily complex. Because the sites I want to import are not managed by a CMS, or even constructed by someone skilled, each page is a unique mess all its own. I’m skeptical that any automated tool can deal with these situations effectively enough to make its use worthwhile.

(That said, I would encourage anyone using Drupal and facing this situation to give Import_HTML a chance, as it is a very nice module that you may find more helpful in your situation than I did in mine. If your experiences and your assessment of Import_HTML’s broader usefulness differ from mine, I would love to hear more about it. I didn’t invest a lot of time in getting Import_HTML to work, so it would be extremely unfair to conclude that what I am saying about it is anything like an informed “review” of the module.)

Even if Import_HTML worked well in some cases, I’m not sure it would be helpful to have to fall back on a manual process in the cases in which it didn’t work. I started thinking that I would prefer to have a single, consistent approach that worked in all cases, even if it only automated a part of the process. After some thinking, I concluded that a great deal of typing and clicking involved in importing static content manually into Drupal centered around creating a new page, typing its title, its url path and configuring it in the menu structure, including its parent and weighting. It also happens that this is a much smaller set of specifically defined tasks to attempt to automate, and I believe I have a promising start on doing just that.

Imagine a typical tree style navigation on a Web site. The information it provides is all the information that I described in the previous paragraph. So what if we envision this structure beforehand, quickly type it into a plain text file, and use that to then automate those tasks described? Lets take a small example of what that might look like:

home
products
- content management
- customer relationship management
services
- web hosting
- custom programming
about us
contact us

It looks like this file could be parsed and if we know how to programmatically work with Drupal, we could easily get a leg up on migrating the static content. After some research in the Drupal forums, I created my first pass on making this happen:

#!/usr/bin/php
<?php
error_reporting(E_ERROR);
// check the command line args, provide help
if ($argc != 1 || in_array($argv[1], array('--help', '-help', '-h', '-?'))) {
?>

This is a command line PHP script.

Use it as indicated to create the structure of a site, complete with placeholder
pages, custom paths and proper placement in the menu structure without the
tedium of the Drupal GUI. Then the content of each page can be customized.

Usage: <?php echo $argv[0]; ?>

When run from the root of the Drupal install, it looks for a file in that
directory called structure.import with the following format:

  <url path> | <menu label> | <page title>
  - <url path> | <menu label> | <page title>
  -- <url path> | <menu label> | <page title>

The lines in this file are in order, optionally with an '-' character at the
beginning to indicate placement in the hierarchy.

Only one or two elements may be specified on each line, the remaining fields
will be deduced. Capitalization is normalized and whitespace trimmed.

With --help, -help, -h, or -? options, you can get this help.

<?php
}

// load in necessary Drupal classes, database connection information
require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

// file format configuration
$levelDelim = '-';
$elementDelim = '|';

// track levels
// for each level, map parent ids to current weight for an item on that parent
// 1=navigation, 2=primary links
$levels[] = array(1,0);

$import = "structure.import";
if (file_exists($import)) {
  $lines = file($import);

  foreach ($lines as $line_num => $line) {
    if (trim($line) != '') {
      $elements = explode($elementDelim, $line);
      $level = substr_count($elements[0], $levelDelim) + 1;

      $path = $elements[0];
      $path = str_replace($levelDelim, '', $path);
      $path = trim($path);
      $path = strtolower($path);

      $label=ucwords($path);
      $title=$label;
      if (isset($elements[1])) {
        $label = trim($elements[1]);
      }
      if (substr_count($path, ' ')) {
        $path = str_replace(' ', '_', $path);
      }
      if (isset($elements[2])) {
        $title = trim($elements[2]);
      }

      // create the page
      $node = new StdClass();
      $node->uid = 1;
      $node->type = 'page';
      $node->status = 1; // published
      $node->promote = 0; // don't promote to front page
      $node->path = $path; // ?q=path
      $node->format=3; // full HTML
      $node->title = $title;
      $node->body = ''; // add later
      node_save($node);

      $parentLevel = $level-1;
      $parentLevelInfo =& $levels[$parentLevel];

      // create the menu item
      $menuItem = array();
      $menuItem['pid'] = $parentLevelInfo[0];
      $parentLevelInfo[1]++;
      $menuItem['weight']=$parentLevelInfo[1];
      $menuItem['path']='node/' . $node->nid;
      $menuItem['title']=$label;
      $menuItem['type']=118; // see includes/menu.inc
      menu_save_item($menuItem);

      $levels[$level] = array($menuItem['mid'],0);

    }
  }

} else {
  echo "\n\nNo import file: $import found.\n";
}
?>

This is obviously pretty rough around the edges, but I think its a promising start, and it definitely automates a lot of tedious clicking in the Drupal content management interface. It will parse the structured text file I presented above and create a basic site structure with placeholder page nodes.

Its designed to run on the command line in the root of Drupal site, and looks for a structured text file following the above conventions in the same directory, called “structure.import”. It fires up the Drupal machinery, just as it would if Drupal were receiving a request through the Web, and programmatically creates page nodes and configures the menuing system. Clearly, the Drupal folks expected that users would want to interact with the system programmatically.

In most cases, the text file I presented is all you would need to create. But as I said, some customers want their site content to be migrated faithfully, at least at first, and one thing I see time and again is that links don’t match the page titles they link to–which for me is a cardinal usability sin. So, I allowed the script to account for situations like this by allowing you to specify different menu labels and page node titles. And, if you only have a few pages that do this, you only have to specify the ones that are different. Also, the program attempts to be very tolerant of things like spacing issues and capitalization. So, you can have a pretty sloppy file that should still work as expected, even with a mess like this:

home
Products
-content management
-customer relationship management
services
- web hosting
- custom programming
about us | About Us Label
  contact us | Contact Us Label | Contact Us Title

The only oddity I have seen so far is that when I initially run it, it appears not to create the menu entries until I actually go to the menu administration area, then they suddenly appear. Obviously, this is likely some caching issue that I could also probably control programmatically, if I knew better what I was doing.

Other enhancements are probably screaming out at you. Obviously we can also programmatically specify the body of the page node, so perhaps we could come up with a semi-automated way of doing that too, perhaps even programmatically running Tidy on the page body before assigning it to the node. Other suggestions are welcome.

Conclusion

As I said, the situation I have described must sound familiar to many, and there are probably many developers with a lot more experience with Drupal than I have, who have given this a lot of thought and perhaps reached very different conclusions. I’d be interested to hear people’s opinions on the potential of this approach and any descriptions of alternate solutions to the challenge of importing static Web pages into Drupal.

Even if this approach turns out to be a bad idea, if nothing else, I think its a solid example of how to manipulate Drupal content programmatically. Maybe that will be the greatest value of presenting this code.

7 Comments »

2008-01-10 18:15:39

[...] a previous post, “A Possible Approach to Importing Static Content Into Drupal”, I talked about trying to find a quick and dirty way to populate Drupal, in that case migrating [...]

 
Comment by diamon
2008-02-05 19:18:32

Umm… reasonable article nice design.. kill the auto search term highlighting or at least add some rules like don’t highlight the letter a or other obvious words (the, and, I) makes the site highly unusable….

 
2008-06-23 17:01:30

[...] continue to find useful the script I posted some time ago that creates a basic site structure in Drupal. Its not uncommon that a customer will [...]

 
Comment by Dale Reagan
2008-11-08 15:49:42

Greetings!
I’ve been looking for this type of discussion and am surprised that there is not too much of it. Your tool does appear to be quite useful (I need to ‘import’ ~300 pages with a very simple structure.)

I saved your code to: import.drupal.php and get an error.

php -l import.drupal.php
PHP Parse error: syntax error, unexpected ‘,’ in import.drupal.php on line 57
Errors parsing import.drupal.php

Is line 57 correct?

$path = str_replace($levelDelim, “, $path);

:)
Dale

Comment by admin
2008-11-09 07:48:18

Dale,

That should be right, you may also want to be sure that in cutting and pasting from the blog post, “smart quotes” didn’t get pasted, or that the two single quotes don’t get translated into one double quote.

Also, if you are using Drupal 6, check out this update: http://www.stonemind.net/blog/2008/06/23/drupal-6-site-structure-script/ .

 
 
Comment by sandrar
2009-09-10 10:40:39

Hi! I was surfing and found your blog post… nice! I love your blog. :) Cheers! Sandra. R.

 
2011-03-04 00:39:19

Got a few tips that are really applicable to our project. Thanks.

 
Name (required)
E-mail (required - never shown publicly)
URI
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> in your comment.