The Scary World of Data Migration… From HTML Pages to Drupal 7
I recently was faced with a project where the client has a "database" of items that will be brought into the brand new Drupal 7 site that we are preparing for him. Come to find out, this "database" was actually about 2,500 html documents. In order to extract the data from these html docs, I needed 3 things:
- Tidy to clean up the HTML
- QueryPath to extract the text from the HTML
- Some custom PHP to bring the records into Drupal
Tidy up
I needed to use Tidy because the HTML was a bit inconsistent, and some tags were not closed properly. This made things a little problematic for QueryPath. Tidy works great, and it was really easy to use. There are many options in the manual but I chose use the -clean, -indent and -modify options. Let's take a look at what happens with these options: