This is a step-by-step guide to using Drupal to totally take over an existing site with as little fuss as possible. We will

Prepare the site

For this tute, it doesn't matter if you are importing an old site into a working Drupal host, or installing Drupal over an existing filesystem. As long as there are not direct file conflicts, either will work the same. If there are conflicts, they can probably be sorted easily. As always, it's best to run this on a demo on very well backed up site. Some of the steps may cause lossage, though I haven't had any. I won't cover the actual Drupal setup, and will start with a working, clean site and fetch static stuff into it, but in a way that emulates an in-place replace.

For this illustration, I start with a working Drupal install in /var/www/drupal . I'm operating as a user in the same permissions group as the webserver, and have a umask of 0002, so my files are shared with it.

Enable importHTML module

Move into the site. $ cd /var/www/drupal/sites/default/ $ mkdir modules about putting modules under 'sites' not webroot

Get a checkout of import_html or download and unzip the tarball as usual. $ cvs -d login Logging in to CVS password: anonymous $ cvs -d co -d import_html contributions/modules/import_html cvs checkout: Updating import_html U import_html/README.txt U import_html/import_html.module U import_html/import_html_help.htm ... U import_html/coders_php_library/xhtml_tidy.conf U import_html/coders_php_library/ cvs checkout: Updating import_html/coders_php_library/bin Visit your website (Mine's called monster) log in to the modules page, enable it and save. I disable comments, enable path, and turn on clean URLs, these are pretty basic setups.

I'm assuming you are logged in as site admin, this is a major thing to be doing to a site and needs pretty much every permission available.

Check the preferences

There's a whole heap of info in the help at /admin/help/import_html. Reading is always good, but hopefully this tutorial can skip bits.

Visit /admin/settings/import_html. If you don't see any warnings message at all, life is good for you.

Due to my configuration, the HTMLTidy extension is not available. But we can try to get it. Press the big button (and then press back) or follow the instructions.

Scan through the rest of the available options to see what's available. All the defaults are a safe start, but we are doing a total conversion, so we will make two changes. It's normally set up to import files from elsewhere, but we will import in place so the file storage paths will not change, and the URLs won't either. So clear the Extra File Storage Path and Import Site Prefix.

Check your default document (.htm vs .html) is correct for your source structure.

And save.

Almost ready to try importing, but what?

Prepare the content

If you are starting from existing content, skip this step.

I'm, going to rip a small (tiny really) site from a few years back. It has only one tricky bit (the front page), but we won't worry about it. As you can see, Is pretty static, pretty old-school, but a simple structure to illustrate a site. You may already have a local copy of your site, but I'll use a cool (again, old-school) tool to start us off. $ cd /var/www $ wget -k -m -X/photos -r ... and we get a nice local copy, just like that! I excluded the /photos gallery 'coz it's big and not needed here.

You can probably browse those files, and see what you got, but right now we are going to merge.

Normally I'd import files from this location, but to demonstrate an in-place replacement, I'll merge the files first.

Cross fingers, and $ mv* /var/www/drupal/ Yowza. Now we have two sites in one. If you were starting with an existing site already, thanks for joining us.

A bit more groundwork

To show a couple more features, create a taxonomy term for these pages to get tagged with. Naturally I'll call it 'rockbar' inside a vocabulary called 'Subjects'. I'll also make a placeholder menu item called 'rockbar' for the heirachy to be built under.

Finally, hit the import page

Back in Drupal Admin, visit /admin/import_html. I've cheated a little because I know these pages are structured well enough to work - the actual text is surrounded by a semantic <div id="content"> tag. This helps, but is not required. Anyway, I'll skip the demo test.

Our source files are this site itself. Yes this will seem to create some overlap, but we can handle it. This exact method is the most unstable way of updating a site imaginable, but can work! So set the "Site Root" as required (/var/www/drupal/), tick publish the results automatically, and proceed.

Sorry, we get to see all of the Drupal guts here too, but you could have filtered that out in the settings exclusions. Anyway. We can selectively choose the content to import. Do that with the sections you want. Select all pages is all we need to do just now.

You can import any bits-by-bit and it'll probably keep up with you. In this situation, importing an htm file will DELETE IT once it's done a successful parse. This is intentional and tidy, but you did back up, right?

On 'go' you'll probably get a lot of messages. Some may be garbled. You'll see the images are acknowleged, the html is parsed.


Unset some redirects

A menu structure should be built, and some pages should be available. But not all. What's up? Well the Drupal .htaccess mod_rewrite chooses to first look at the filesystem before sending any redirects to Drupal. That;s fine for resources and files, but it does so for directories too. Which makes little sense to me. Anyway. Open up your editor on the Drupal .htaccess and comment out the -d line.

  # Rewrite current-style URLs of the form 'index.php?q=x'.
  RewriteCond %{REQUEST_FILENAME} !-f
  ### RewriteCond %{REQUEST_FILENAME} !-d
  RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]

Hey, where's my formatting?

Did you check the "Input Format" in the settings page? If it's "Filtered HTML" then chances are a bunch of legacy tags aren't making it to the screen. They are still there though! Recommended setting for imports is 'Unfiltered HTML'.

No really, where's my formatting?

We've left behind a lot of the old formatting info. HTMLTidy has been run agressively on the sources, to remove inline styles, and any CSS includes that used to be in the headers were not brought forward. That's sorta the point of this process. We've now got a squeaky-clean set of content that you can apply new Drupal themes to.