Large custom WordPress content migrations

WordPress natively handles a number of different content migrations from various platforms. However, these will not work for all organizations either due to custom requirements for data transformation or because the data is being migrated from a non-supported or proprietary system. In these instances, writing a custom script is unavoidable. Furthermore, as WordPress becomes widely accepted as an enterprise content management system, large organizations will continue to move to the platform. This means larger data sets and there are key performance considerations that must be made when developing a migration script.

wp_update_term_count

It is likely that any content migration involves attaching taxonomy terms to posts. When adding terms to a post, WordPress recalculates and caches the number of terms associated to a post using the function wp_update_term_count. For content migrations less than 50,000 posts, this may not post a significant performance issue. However, for larger migrations this is a major consideration. This function will run every time when calling wp_set_object_terms or wp_set_post_terms. As the size of the WordPress database increases, the execution time will increase from fractions of a second to a few seconds. Although this poses a barely noticeable inconvenience when an editor is saving a post, when spread out over the course of a large migration, the amount of time to process each post will continue to increase and could make your migration script run for days instead of hours. Furthermore, during the actual migration process, the post counts for each term are likely of no importance and updating them is nothing more than a roadblock to performance.

The best strategy to deal with this is simply not to execute the portions of this function that recalculate the post counts. There is no WordPress hook to do this programmatically, so this can be done by copying/pasting the existing code from the core into your own modified function or by commenting out the code in the core. For the purposes of this post, we’ll use the latter strategy. The offending code is actually in a helper function called wp_update_term_count_now in wp-includes.taxonomy.php. You should comment out lines 2515 to 2536 (as of WordPress 3.5) as follows:

function wp_update_term_count_now( $terms, $taxonomy ) {
/*global $wpdb;
$terms = array_map('intval', $terms);
$taxonomy = get_taxonomy($taxonomy);
if ( !empty($taxonomy->update_count_callback) ) {
call_user_func($taxonomy->update_count_callback, $terms, $taxonomy);
} else {
$object_types = (array) $taxonomy->object_type;
foreach ( $object_types as &$object_type ) {
if ( 0 === strpos( $object_type, 'attachment:' ) )
list( $object_type ) = explode( ':', $object_type );
}
if ( $object_types == array_filter( $object_types, 'post_type_exists' ) ) {
// Only post types are attached to this taxonomy
_update_post_term_count( $terms, $taxonomy );
} else {
// Default count updater
_update_generic_term_count( $terms, $taxonomy );
}
}*/
clean_term_cache($terms, '', false);
return true;
}

This will prevent the counts from being updated every time terms are attached to a post and make your migration script run with a constant speed for each record inserted.

After your migration script is complete, simply uncomment the code above and create a secondary script that iterates through all terms in all necessary taxonomies and calls wp_update_term_count_now for each. This function takes the taxonomy name and an array of term IDs as parameters. This way you only have to pay the price once for calculating the term counts. Here is a simple code fragment I wrote to accomplish this:

$taxonomies = get_taxonomies();
foreach( $taxonomies as $taxonomy ) {
$args = array(
'hide_empty' => 0,
'fields' => 'ids'
);
$terms = get_terms( $taxonomy, $args );
if( is_array( $terms ) && !empty( $terms ) ) wp_update_term_count_now( $terms, $taxonomy );
}

Caching

Inevitably, there will be a necessity to perform lookups in the WordPress database during a content migration. Whether it’s taxonomy term IDs, to check if a post has already been inserted, or some other piece of metadata, you will very likely need to query the WordPress database. WordPress functions are generally quite optimized for this purpose but there is generally still a small price to pay for hitting the database. However, if you are using WP_Query or wpdb to accomplish this task, the price could be quite high. There are a few different caching methods you can use based on what is required for your script.

Transients

This can be alleviated through use of the the Transients API and ensuring transients are set to store in memcache. (If you haven’t worked with WordPress and memcache before, the WordPress Memcached Object Cache plugin is a good place to start.) Any time your migration script queries the database for data that is required for migrating a post, if this data is also going to be likely useful to future posts, store it in a transient with a reasonable expiration time. I like to use a day (60*60*24) because if your migration script is running longer than that, you probably need further optimizations. The code should be set up such that if the transient lookup fails, it then calls the original function or query and then stores that result in the transient.

Also keep in mind that WordPress transient keys cannot be longer than 45 characters. Sometimes, the key may involve something like a URL that looks up a WordPress post ID and cannot be reasonably expected to always be less than 45 characters, especially when combined with your namespace for the key. In this instance, I like to use the php md5 function on the value to create a unique value that is always exactly 32 characters in length and then combine that with my namespace for the variable. Here is a complete example using all of the above:

$transient_key = "namespace_" . md5( $variable );
if( ( $some_value = get_transient( $transient_key ) ) === false ) {
$some_value = some_wordpress_function( $variable );
set_transient( $transient_key, $some_value, 60*60*24 );
}

 

wp_options

Transients and memcache are useful for values that can be easily repopulated from the database if the transient has expired or if you need to re-run your script. However, you may run into scenarios where you absolutely need the data to persist and cannot rely on transients. Examples of this would be queries where the speed is simply too slow to even risk using the non-cached method of retrieving the data or if you are running a script that needs to make multiple passes over the data, such as for a content import followed by linking those imported posts with the new IDs. In those instances, you absolutely need to have the cached values available at all times and cannot risk it being cleared from transients.

In these instances, I prefer to use wp_options to store the data. Wp_options provides a nice, flat table indexed by option_name that allows you to quickly store and retrieve data even on a huge database. There are two key points to using this method:

  1. Always ensure that you use add_option and not update_options when setting the value. This is because the autoload parameter is not available for update_option. You should always set this to false because you absolutely do not need or want to load all of these values automatically when bootstrapping WordPress.
  2. When your migration is complete, ensure that you have a separate cleanup script to remove all of this unnecessary data from wp_options. (If you’ve disabled wp_update_term_count as described above, you could consider combining that into one script.)

Themes

It is always important to consider whether or not you need to load the theme when running the migration. If you have custom post types, custom taxonomies, and so forth it will likely be inevitable. However, if it is not for any reason, then disable loading themes for further performance enhancement by including this line in your script before including wp-load.php to bootstrap the WordPress environment.

define( 'WP_USE_THEMES', FALSE ); 

Hooks

It is also very important to consider what functions are hooked to what actions either in the plugins or the active theme in your environment. This is especially true for hooks like ‘save_post’ or ‘set_object_terms’ that are likely going to be run with every post you are migrating. Carefully consider what code is being executed and comment out or use remove_action in your migration script to disable those that are unnecessary during the migration.

Large File Handling

If you are migrating a large data set, then it is likely you will be provided large export files. In the case of XML format, this means it is impossible to parse the entire file on load. However, this does not mean you need to sacrifice the convenience of something like SimpleXMLElement. What I’ve done with large files is use the very basic fgets to stream in the file line by line and a simple string comparison function to find the start and end elements for a single record and then feed that into SimpleXMLElement for further processing. This allows for the best of both worlds with maximum performance. Here is an example:

$import_xml = fopen( $import_file, "r" );
// Iterate over the lines of the file until we have constructed a complete "asset" element
// Then feed that element into SimpleXMLElement for ease of processing
// This allows us to use the convenience of SimpleXMLElement on a huge document
$current_asset = "";
$processing_asset = false;
while ( ($import_line = fgets( $import_xml ) ) !== false ) {
// Check if we've reached the end of the import file
// If so, exit
if( strpos( $import_line, "" ) !== false && strpos( $import_line, "" ) >= 0 ) break;
if( !$processing_asset && strpos( $import_line, "<asset" ) !== false && strpos( $import_line, "<asset" ) >= 0 ) {
// We've found the start of an asset element. 
// Clear the previous asset and begin capturing the entire asset for processing.
$current_asset = $import_line;
$processing_asset = true;
}
if ( $processing_asset && strpos( $import_line, "<asset" ) === false ) {
// We're assembling the text for a single asset element
$current_asset .= $import_line;
if( strpos( $import_line, "" ) !== false && strpos( $import_line, "" ) >= 0 ) {
// We've completed the assembly of a single asset element
// Create a SimpleXMLElement for ease of parsing each individual node
$asset = new SimpleXMLElement( $current_asset );
// Pass the SimpleXMLElement object to the function for processing and insertion of related links
asset_processing_function( $asset );
// Terminate processing of this asset
$processing_asset = false;
}
}
}
// Close the stream and finish up with the import
fclose( $import_xml );

Script settings

Any long running script is likely to run up against either memory usage or execution time limitations. You should carefully consider the settings for both in WordPress and PHP respectively.

If your PHP installation allows for it, change the execution time in your script to something you feel is sufficient for its completion. For example:

if( !ini_get('safe_mode') ) ini_set( 'max_execution_time', 100000 );

For WordPress, make sure the memory limit is sufficient for your script. If you don’t want to update wp_config.php, then just reset it for the duration of your script like:

define( 'WP_MAX_MEMORY_LIMIT', '4096M' );

If you find that neither are sufficient, then it likely means your script probably needs to run in batches. Consider splitting your large import file into smaller files or enabling your script to run in batches of records and then using a shell script to wrap the call to each batch.

Plugin functions

If you are using third party or even custom developed plugins, you may find that the functions that are hooked to core WordPress filters and actions (see wp_update_term_count above) do not necessarily perform ideally under the scenario of a large content migration. If this is the case, consider disabling these functions and writing (or copying/editing) stripped down/customized versions specifically for your script.

Run from the command line

Finally, it goes without saying that any large content migration should never be run from the browser. Migration scripts should always be run from the command line. I like to create a “cmd” directory in the root of my WordPress install for this purpose.

Make sure you are outputting log entries at key points in the migration to track your progress. Some key things to track:

  1. Start/end times: store the start time of your script so you can output the script duration at the end. Otherwise, there is no way to reliably determine if you are making any performance improvements as you code.
  2. Memory usage: If you’re concerned about memory usage, call the PHP function memory_get_usage periodically to verify, such as after every post is inserted. This will help inform decisions such as if you need to batch the data.

If you’ve never written a WordPress command line script before, here is a simple shell you can use to get going:

// Load the WordPress environment
require_once( '/path/to/wp-load.php' );
exit( main() );
function main() {
// Do some migration stuff here
}

Happy migrating!