Node Import Part 0: the prequel

Ethan's picture

I thought I'd share some lessons learned in the node creation wild as a follow-up to Tom's recent post surveying the options available to prospective content importers who need to get a slew of nodes into Drupal from some far off place such as the land of flat HTML, custom CMSes or even the fabled Drupal 4.6.

For a very exciting upcoming site we had to import a few hundred old blog posts from a Drupal 4.6 install, along with 30 authors and around 2000 taxonomy associations. While these numbers are nowhere near as big as content migration figures can crawl, the complexity of the data structure was multiplied by the specific requirements of the various node importing schemes available in Drupal. This multiplicative factor yielded a somewhat dizzying array of alternatives that had to be evaluated, with the vast majority of them leading down paths that involved exporting nodes, importing with old legacy data attached then iteratively importing the author and taxonomy data one piece at a time using custom scripts and translation tables.

Scary.

While the Macro module definitely made the job a great deal easier -- mainly because it was able to handle authorship, taxonomy and creation timestamps all as part of the node creation process -- one of the real gems I uncovered was a MySQL function I wish I'd found years ago: GROUP_CONCAT.

GROUP_CONCAT concatenates all records returned by a single field query, breaking up the results according to the GROUP BY clause, if one is present. As an example, this means that this statement:

SELECT GROUP_CONCAT(t_d.name) AS tags, t_n.nid FROM term_data as t_d JOIN term_node AS t_n ON t_n.tid = t_d.tid WHERE t_d.vid =2 GROUP BY t_n.nid

will return a two column result set, in the first column will be a comma-delimited listing of all taxonomy terms from the vocabulary with vid=2 and in the second column we'll have the nid of the post which has all these tags. In the past, this step would require my writing some intermediary script to reduce the fully-normalized result in which each term was on a separate row.

The real advantage of doing this de-normalization in pure SQL comes when we use this expression as a sub-select to construct a single table with all the information needed to construct full drupal_execute form value arrays for each node. This can be done with SQL that looks something like this:

SELECT n.nid, a.dst, n.title, n.created, n.body, tags.tags, u.name, u.uid, u.mail FROM node as n JOIN
(SELECT GROUP_CONCAT(t_d.name) AS tags, t_n.nid FROM term_data as t_d JOIN term_node AS t_n ON t_n.tid = t_d.tid WHERE t_d.vid =2 GROUP BY t_n.nid) as tags ON tags.nid = n.nid
JOIN users as u on u.uid = n.uid
JOIN url_alias as a ON a.src = CONCAT('node/',n.nid);

This statement makes use of the second column in our GROUP_CONCAT query to join those results to the node data via a subselect. What comes out of this query can then be processed by a PHP script and used to generate the drupal_execute calls or Macro PHP code needed to migrate the nodes into the new database, taxonomy, users and all.

Reply

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockcode>
  • Lines and paragraphs break automatically.
  • You may post block code using <blockcode [type="language"]>...</blockcode> tags. You may also post inline code using <code [type="language"]>...</code> tags.

More information about formatting options

Captcha
Are you a robot? We usually like robots, but not in our comments.
Copy the characters (respecting upper/lower case) from the image.