Creating a sanitized Drupal database backup

People often want to create a backup copy of their site database and give it to someone else to create an environment similar to the live environment for testing or development. However, doing so exposes all of your site data being leaked if that backup copy is ever placed on a CD that gets lost or a harddrive which is not destroyed at the end of its life or a laptop which is stolen.

One solution is for developers to encrypt their work environments (like using Virtualbox and an encrypted disk image on Mac). However, another simple solution exists that will work regardless of the strength of security on the development environment: munge the data as the backup is created.

Sanitizing a Drupal database export

The basic idea of this strategy is five easy steps:

  1. Export your database
  2. Import it into a new temporary place
  3. Overwrite sensitive information with dummy data or delete it
  4. Export from the temporary place
  5. Share this new export
  6. Delete the export created in step 1
-- CAUTION: DO NOT RUN THIS ON DATABASE WHERE YOU NEED TO KEEP THE DATA!!!

-- Munge emails.
UPDATE users SET mail = CONCAT(name, '@localhost'), init = CONCAT(name, '@localhost'), pass = MD5(CONCAT('MILDSECRET', name));
UPDATE comments SET mail = CONCAT(name, '@localhost');
UPDATE directory SET mail = CONCAT(name, '@localhost');
UPDATE authmap SET authname = CONCAT(aid, '@localhost');
UPDATE client SET mail = CONCAT(cid, '@localhost');
-- Only important if you use project and project_issue module
UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';
UPDATE projects SET mail = CONCAT("empty", '@localhost');
-- Only important if you use simplenews module
UPDATE simplenews_subscriptions SET mail = CONCAT(snid, '@localhost');

-- Clear out old webform entries which likely include e-mails
UPDATE webform_submitted_data set data='scrubbed';

-- Get rid of irrelevant data it contains IP addresses and bulks up the database
TRUNCATE accesslog;
TRUNCATE access;
TRUNCATE cache;
TRUNCATE cache_filter;
TRUNCATE cache_menu;
TRUNCATE cache_page;
TRUNCATE devel_queries;
TRUNCATE devel_times;
TRUNCATE flood;
TRUNCATE history;
TRUNCATE search_dataset;
TRUNCATE search_index;
TRUNCATE search_total;
TRUNCATE sessions;
TRUNCATE watchdog;

-- Alter sensitive entries in the Variable table
update variable set value = 's:4:"fake";' where name = 'smtp_password';

The above script is a slightly modified version of a fairly specific script that works for one specific database: drupal.org. However, you can see the techniques used here and imagine how to adapt them for other sites.

Modify the script to protect data for your specific site

One important step that I take when adapting this to a new site is to make my export from step 4 above and then do a search through the text of the export for common e-mail providers like yahoo or gmail to see if any more addresses exist. If you find one then you figure out which column they are in and use a similar "munge" process on the column updating it to null.

Take special care with some modules like twitter that may store account credentials (twitter usernames and passwords) in a database table.

And take extra special care with modules like smtp or mailhandler which store credentials for important email accounts in the variable table or other locations in the database.

Once you've got this script tuned for your site you may need to update it as you add new modules or upgrade existing modules.

And that's it - now you can share copies of your database with greater confidence that your site won't be the source of a data leak.

Edit: Updated to include webform example based on jbrauer's similar blog post.

Use Drush sql-sanitize

The Drush command line tool offered a tool called sql-sanitize that performs sanitization of a database and can be extended with hooks.

Consider this example code snippet:

<?php
/**
* Implements of hook_drush_sql_sync_sanitize().
*/
function modulename_drush_sql_sync_sanitize($source) {
 
$query = "UPDATE content_type_profile SET field_primary_email_email = CONCAT(SUBSTRING_INDEX(field_primary_email_email,'@', 1), '@example.com');";
 
$query .= "UPDATE content_type_profile SET field_work_phone_value = SUBSTRING(FLOOR(RAND() * 1000000000000) FROM 1 FOR 12) WHERE field_work_phone_value IS NOT NULL;";
 
drush_sql_register_post_sync_op('my-sanitize-id', dt('Reset email addresses in profiles/user table and phone numbers'), $query);
}
?>

That code implements the drush hook to sanitize data. Essentially the hook returns a long string of complete queries that can be executed including the traling semicolon.

Hopefully this feature can be added back to drush in the future.