Building a Contentful Custom Data Migration to WordPress

As part of our Special Projects team here at Automattic, we help interesting people and organizations have a great experience with WordPress. Sometimes that involves collaborating to build out new solutions to meet our partners needs, other times we help them migrate existing custom systems over to WordPress.

I’ve recently had the opportunity to work on a custom migration from Contentful, which I learned will often need to be handled custom, due to the individualized nature of how customers design their structures in Contentful’s spaces and the format of their export.

To aid in future efforts of this nature, I’ve tried to sum up in this guide what advice may be most useful.

Before you begin: ALWAYS make sure you have WP_IMPORTING defined to true!

This is something we’ve seen trip up imports in the past — especially as it can be a bit of uncommon knowledge that folks never realized existed in the first place. Forgetting to manually define this constant can result in publish notifications, pingbacks, subscription emails, and dozens of other incidents. Keep in mind that this isn’t a panacea, some third-party plugins may not check for the constant being defined, but most do.

First, get the export file

For the example I’m working with here, we’re looking at migrating from Contentful to WordPress. Contentful has an established export tool provided in their command-line interface and a library for making custom exports. As it’s fairly well documented, I’m not going to delve terribly far into extracting it, merely acknowledging that existing tooling is already there.

Next, understand your data

As noted previously, Contentful’s system leans heavily into configuration over convention, so similar data structures created in two different spaces could have fields named or structured entirely differently. So, let’s understand what there is in the export, and determine what it should be mapped to.

For different items going into WordPress, should they go in as posts and pages, or as a custom post type? If the latter, an established one with a format described in a commonly used plugin, or a unique one we’re declaring just for this site? If each item has properties, should they be stored in post meta, or should it be a taxonomy — either as categories or tags, or something distinct with its own unique properties? What should go into the post content? How are these properties likely to be used in WordPress? Are they strictly descriptive to utilize on the single post page, or are they something that would be useful to have archive pages for that value generated by WordPress (and therefore be a taxonomy)?

Needing to answer these questions, the first thing to do is start building a script that will iterate over each item in the export, and generate some context on what the data itself looks like. It could look something like this, which was the (slightly simplified) start of parsing through a Contentful export:

<?php
$blob = file_get_contents( __DIR__ . '/contentful-export.json' );
$json = json_decode( $blob );

// Distilling down to the bits we actually care about...
$entries = $json->entries;
$content_types = array_map(
	function( $entry ) {
		return $entry->sys->contentType->sys->id;
	},
    $json->entries
);

// Now we've got $content_types which is an array of all the entries content types.
// To get a more intelligible aggregate, let's generate a summary:
$frequency = array_count_values( $content_types );

// Sort them in decrementing order, then print out.
arsort( $frequency );

print_r( $frequency );

Due to how Contentful’s data model structures their data and attributes, you need to step down multiple levels to get to the actual values that we’re looking for. Hence a lot of additional ->sys steps. The output of this for a site that had an active blog sharing recipes, and selling some cookbooks and tools could look something like this:

~/contentful-export $ php ./summarize.php
Array
(
    [meal] => 478
    [ingredient] => 382
    [blog] => 523
    [productPage] => 28
    [author] => 12
)

So, if we wanted to pull all the data across, we’d need to possibly store each recipe as a new Meals post type, each ingredient as a taxonomy that is attached to the new Meals post type — to enable users to search for all recipes that include butternut squash (for example), and then each blog post mapping to WordPress native blog posts. Product pages could perhaps map to WooCommerce products, and Authors could vary depending on how they’re used. If each post only ever had a single Author, they could be set as users in WordPress, with post attributions set to them, or just as postmeta if guest authors only have a single post. If there are two or more authors assigned to some items, however, we’d want to evaluate a plugin to manage the user interface for managing that model — there are a few well-established options.

To understand what bits of data there are on blog posts, for example, we could begin pulling aggregations as we did before, something like so:

<?php

$blob = file_get_contents( __DIR__ . '/contentful-export.json' );
$json = json_decode( $blob );

// Filter all our entries down to only the `blog` contentType.
$blogs = array_filter(
	$json->entries,
	function( $entry ) {
		return 'blog' === $entry->sys->contentType->sys->id;
	}
);

// Of the blog entries, get only the custom fields declared for each.
$allmeta = array_column( $blogs, 'fields' );

// The fields are stored as an object, not an array, so we need to convert to an array first.
$allmeta = array_map( 'get_object_vars', $allmeta );

// Throw away the values -- for the moment, we're only trying to see what fields there are.
$keysonly = array_map( 'array_keys', $allmeta );

// Lump the keys of all entries into one giant array.
$unifiedmeta = call_user_func_array( 'array_merge', $keysonly );

// As we'd done before, let's get a look at the frequency of each custom field's usage.
$frequency = array_count_values( $unifiedmeta );
arsort( $frequency );
print_r( $frequency );

And the output this time should give us an idea of the assorted fields each blog post contains —

~/contentful-export $ php list-blog-fields.php
Array
(
    [blogTitle] => 523
    [slug] => 523
    [categories] => 523
    [blogHeroImage] => 523
    [author] => 522
    [publishDate] => 489
    [blogBody] => 476
    [similarArticles] => 208
    [furtherReading] => 172
)

Progress! So, many of these fields will be able to easily map across to blog posts — but some may need a bit of formatting help. Each field will also have a language key or keys underneath it, for the storage of translations. Here’s a truncated example of a few schemas you might see:

{ 
  "blogBody": {
    "en-US": "## This is a blog post!nnIt's stored in __markdown__, which is pretty neat."
  },
  "publishDate": {
    "en-US": "2021-07-28T00:00-08:00"
  },
  "categories": {
    "en-US": [
      "Opinions",
      "Example Category"
    ]
  },
  "blogHeroImage": {
    "en-US": [
      {
        "url": "http://res.cloudinary.com/path/image/upload/f_auto/q_auto/v1641266700/s3/contentful/image-filename.jpg",
        "tags": [
        ],
        "type": "upload",
        "bytes": 1161876,
        "width": 1900,
        "format": "jpg",
        "height": 1267,
        "version": 1641266700,
        "duration": null,
        "metadata": {
        },
        "public_id": "s3/contentful/image-filename",
        "created_at": "2022-01-04T03:25:00Z",
        "secure_url": "https://res.cloudinary.com/path/image/upload/f_auto/q_auto/v1641266700/s3/contentful/image-filename.jpg",
        "original_url": "http://res.cloudinary.com/path/image/upload/v1641266700/s3/contentful/image-filename.jpg",
        "resource_type": "image",
        "raw_transformation": "f_auto/q_auto",
        "original_secure_url": "https://res.cloudinary.com/path/image/upload/v1641266700/s3/contentful/image-filename.jpg"
      }
    ]
  },
  "author": {
    "en-US": {
      "sys": {
        "type": "Link",
        "linkType": "Entry",
        "id": "ZWFzaWx5LiBCYXNlNjQgZW"
      }
    }
  },
}

Taking a look at the blogBody field, for example, we might notice that the content is stored internally as Markdown, so we’ll need to load a processor to convert it to html, to store in WordPress.

Other fields, like author may contain references — instead of data — or arrays of references, if it is storing multiple. The key listed here under $entry->fields->author->{en-US}->sys->id would map to an item with the $entry->sys->contentType->sys->id of author and an $entry->sys->id of ZWFzaWx5LiBCYXNlNjQgZW. Some may contain complex structures describing items such as media files — you can generally get a good feel for what’s contained therein by just previewing the export, but a full write-up is available in Contentful’s developer docs.

Now that we have some idea of the data we’re working with, let’s register the post meta that we’re going to be importing the individual fields to, if they aren’t going into a pre-existing slot. It’ll look something like:

add_action( 'init', function() {
	register_post_meta(
		'post',
		'furtherReading',
		array(
			'type' => 'string',
			'description' => __( 'A list of books and articles for further edification.' ),
			'single' => true,
			'default' => null,
			'show_in_rest' => true,
		)
	);

	// More custom post types, custom taxonomies, custom meta declarations ...
}

Then, write the actual import code

We’ve written some code that will aggregate data on the import data, and prepared a place for it to get stored, but now it’s time to actually iterate through. There’s a couple important things that are worth reviewing here as we go:

Some essentials and safety rails for the process

Again, remember to define WP_IMPORTING!

When importing entries, consider if it’s beneficial to leave some artifacts behind? This could be as simple as json_encode()-ing the raw import data and storing it in a post meta entry for the imported post, so that if an error in the import is discovered later it’s easy to access the raw data again to understand where it originated and what went wrong.

Understanding that things can go wrong when we’re eventually running the import, it’s important to handle failures and timeouts gracefully. If an import stops partway through, ensure that re-running the import will not create the same bit of content again is a relatively simple check, but one worth making to avoid needing to correct for in the moment later.

We would strongly suggest including the ability to run what we’ve internally referred to the rule of 1-2-10. When you believe it’s time to run the import, first try importing a single entry. If that succeeds, then do two entries. Finally, if those both have been checked and verified to have imported well without causing any ill-effects, run it again with ten entries. Only after verifying it still operates well with no surprises, should you move on to attempting the full data set. When you run the 10-item data set, check how long the import takes to run via time to get a rough feel for how long the full run might take. Including the ability to specify the number of items to import — either via a command-line argument or some other means would be tremendously useful here.

On the topic of functionality to include, when possible include a dry-run option that will output to a log file what would otherwise be inserted to your website — to be able to safely test the import out, and ensure that the data looks good before modifying production databases.

It may be worth considering importing the posts with a draft status instead of publish, so that they can all go live at once — especially if it’s a slow import or a significant number of them. If entries are imported via this method, ensure that the WP_IMPORTING constant is defined when they’re published as well, to avoid any subscription or notification emails keying off the post status change. This can also give you an opportunity to spot-check them before they’re visible to site users.

Consider if it is worth installing a plugin to log or block all outgoing emails as a safety measure — or would this be overly detrimental to site functionality? It’s possible that logging would result in significantly more database writes if any were to go out — causing a performance hit, or stopping emails might interfere with the typical operation of the site for visitors — for example, if transactional emails were suddenly silenced.

Performance Optimization of Running the Import:

If running in a WordPress.com VIP Go related ecosystem, and you’re calling get_post()hundreds of times, consider running vip_inmemory_cleanup() periodically to clear out some data you may not need any more from local object caches.

If you’re making a significant number of taxonomy modifications, on WordPress.com VIP consider also looking into using start_bulk_operation() and end_bulk_operation() — or for self-hosted sites, consider whether it would be worth utilizing wp_defer_term_counting() directly to let you either put off some updates until the end of your import, or batch them every few hundred operations during.

How long are you expecting the import to run for? Are you running the import in an environment where you may want to consider running a set_time_limit() call on each iteration to ensure it doesn’t time out?

If importing to an active production site, consider tossing in a periodic sleep() or usleep() call every so often to avoid overloading your production database. Maybe have a $lastsleep = time() that will trigger a 3 second sleep whenever it’s been 5 seconds since the last sleep — and then reset the $lastsleep?

After the import runs, consider if there are any caches — object caches, permalink caches, page caches, or the like — that you may need to flush. Try to be considerate about what caches you’re invalidating, as rebuilding them can impact site performance.

User Experience of Running the Import:

Always be clear to the user about what there is to do, and how the task is progressing. This could be as simple as echoing Preparing to import 1,234 entries, and on each iteration printing Item 1 of 1,234 (0.1% complete) or Processing Item {$import_post_id}. There are few things as frustrating as staring at a long running process, and wondering if it’s getting stuck.

Have your output messages be easy to skim. This could entail including emoji in the messages to grab attention, so perhaps a ⁉️ or ‼️ or 🟢 🟡 ⏰ 📧. Also, consider indenting lines after the first for a given entry, and using color codes for terminal output.

Consider how you’re handling media. Do you want to sideload them via the import process, or handle them afterwards with a tool like our wordpress-importer-fixes? If the media items are pulled down during the import, make sure to flag them in a way that they won’t get downloaded multiple times on subsequent runs!

Last, running the import

After running the import successfully on a local or development environment, it’s time to plan how you’re intending to run it in production.

When possible, for an active site with frequent users, it’s generally best to pick a time of day to run the script where user traffic will be at a minimum. If something goes wrong, it’ll be easier to fix when things aren’t getting slammed.

Next, take a database backup before running any imports. Make sure you know how to quickly rollback to the backup you just took, as it’s not fun to figure out in the heat of the moment when tensions are running high. Also, consider implementing any additional logging so that you don’t lose any user data if something goes wrong and you need to roll back to a prior export! An e-commerce site losing some customer orders or comments that came in after the database export was taken because you had to run a restore is not a good experience!

Depending on how you’ve written the import script, go ahead and run it with whatever way you’ve included to ensure it’s running as a dry run! Ideally, you can run it by including the code via WP-CLI like wp eval-file /path/to/import.php or the like if you have full ssh access to the server. If you’re operating on a shared host without ssh access, you may need to consider another method such as triggering it over http, and manually including wp-load.php. Assuming you’re running it via ssh, consider whether you’d like to pipe the output to tee as well to save the output as a file, or running it via screen so that if your SSH session is dropped it can continue running in the background.

After the dry-run runs and looks good, as noted above run the import with a single post from the import file, and then check your work. Did it import as expected? Were any emails or notifications accidentally fired off to users? Far better to catch this when a single post is imported, rather than scaling up to all of them. Then give it a shot with two entries, and then ten. If everything looks right, run the full import, and keep an eye on it as it’s going.

As an aside, consider that if the data to be input requires human intervention or review, or is of a particularly small set, sometimes it’s more efficient to present the content in wp-admin or another format where you can insert or review it manually via the admin interface. Some plugins and post types may store data in difficult to parse and access ways, so what would ordinarily take four hours to code can sometimes be done manually in one. As appealing as it is to engineer automated solutions, for one-off tasks sometimes it’s more efficient to put in the work manually.

The import ran, now what?

Congratulations! Time to check your work.

Look over the data you imported, and spot-check it. Do the post counts match what you were expecting to find post-import? Find a good place to store any logs that you generated of the import for review later if needed.

If the site you’re migrating from is still online, consider whether the WordPress Importer Fixers WP-CLI command may help with pulling down and handling media imports from the prior site. It can often handle the urls embedded in your content for you, rather than trying to leverage media_sideload_image() in your custom import script — but it may not find the media if the images or other files you’d like to load into WordPress’s Media Library aren’t in the new post content.

If you’re importing the content from a prior version of the site to a new version that’s going to live at the same domain, will the permalinks load right the content? If not, you may need to set up some redirects to handle forwarding inbound links from the old address to the new.

Any additional small-scale cleanup you may need to do can likely now be done without the original data file, by iterating through posts already in your database, keying off the post meta value where you stored the original post’s raw import data. If you’re certain that data will never be needed again, it should be trivial to truncate or delete the import meta with a single query — but perhaps export them first in case they’re later needed for reference.