The Scary World of Data Migration… From HTML Pages to Drupal 7

I recently was faced with a project where the client has a "database" of items that will be brought into the brand new Drupal 7 site that we are preparing for him. Come to find out, this "database" was actually about 2,500 html documents. In order to extract the data from these html docs, I needed 3 things:

  • Tidy to clean up the HTML
  • QueryPath to extract the text from the HTML
  • Some custom PHP to bring the records into Drupal

Tidy up

I needed to use Tidy because the HTML was a bit inconsistent, and some tags were not closed properly. This made things a little problematic for QueryPath. Tidy works great, and it was really easy to use. There are many options in the manual but I chose use the -clean, -indent and -modify options. Let's take a look at what happens with these options: