How To Merge Two Sulu Instances Together

Adding a new Sulu instance to your infrastructure is rather straightforward, but how about removing one? Merging the structure and data of a CMS instance into another can lead to unforeseen obstacles. This article is an experience report of a Sulu customer with valuable insights about making a merge as smooth as possible.

Background

Brille24 is an online shop selling glasses in multiple European countries. In 2020, we modernized our tech stack and started using Sulu for content management. Our biggest shop is the German one. It has much more traffic than the other shops, which prompted an entirely different architecture. To tackle this, we decided to split the infrastructure into two parts: one Sulu instance for the German shop and another one for all the other international shops. We used the same code base for both instances, with the main difference being a more complicated and powerful Varnish caching for the German shop.

In a recent development, however, the opportunity arose to merge the Varnish configurations and infrastructure, and so the reasons for keeping the instances separate became less and less convincing. Furthermore, we began observing some issues with the two-systems approach, including:

No copying of content between international and German shops with Sulu's copy blocks feature
Content migrations have to run on 4 systems (twice per environment — staging and prod)
Higher server costs at AWS
No shared media or caches

Info — Just to toss around some numbers to give a sense of the size of migration we are talking about: In total, we had 7,000 pages across both systems; each database dump was around 1GB.

Getting started

The first step was spinning up a new instance which would be the new “merged” instance, for testing out the merging scripts. It was straightforward to get the merged system's code to work, as the two originating systems were already using the same code base. A few environment variables differed, but we merged them quickly, and some weren't even used anymore. (Best time to remove them.) The only things that really needed our attention were the media storage in S3 and the database. We wanted to have subfolders for the media storage to avoid naming conflicts. So an image from the German shop would be moved from image.png to brille24/image.png, and the same image from the international shops would reside under int/image.png. However, this meant that we also needed to change the paths in the database in the me_file_versions table. We'll come back to that later.

The database

The database was the biggest headache. Sadly, there aren't many tools that could help with merging databases with the same structure (that is, the same tables and columns) but different data. Most of them didn't fit the use case, and some straight-up crashed under the load. In the end, writing a script that made the migration repeatable sounded like the best bet. As part of the analysis process, we have identified three kinds of tables in Sulu:

The static tables: These tables don't need to be merged, as they contain the same contents and structure in both systems. (For example, co_address_types.)
Tables with UUIDs as their primary key: These tables are also easy to work with. You can concatenate both tables, and they just work. (For example, ta_tasks from Sulu's automation bundle.)
Tables using auto-incrementing IDs: Those are the worst and most hassle. Thankfully, there is also a “nice” solution for this. Take one database as is. Before importing the second database, add an offset to it. For us, just to be sure, we took 1,000,000. So the IDs of the international were 1,000,001 and 1,000,002 etc.

That sounds easy. Just export both databases, increment the IDs of one of them, and be done with it, right?

Yes, but how do you increment IDs in a dumped database? There are two options that come to mind: dumping the SQL file and then doing some regex magic or importing the exported tables into a temporary database, modifying it there, and exporting the relevant tables again. The second one was easier. However, remember the file path updates for the S3 in the me_file_versions table? It could have been easy as well, but since Sulu doesn't store this in a dedicated column but inside a JSON object, we went for modifying the export with regex this time.

The last issue we faced, at least in terms of the database, was: “What if we have a unique index on a column but the merged table isn't unique anymore?” An example of this would be the phpcr_nodes table, as we have homepages for all webspaces in both systems. So the path /cmf/brille24/contents existed in both databases, but just concatenating both databases would end up with the same page twice, which is not good. In this case, INSERT INTO ... ON DUPLICATE KEY UPDATE is your friend. With this, we were able to update the duplicate contents from the base system (the German shop) with the content from the other system (the international shops).

Complications with PHPCR

Talking about the phpcr_nodes table (we are using the PHPCR doctrine integration). To grossly simplify, it contains all the content, from webspaces to snippets. For minor changes like setting the same default snippet for every webspace, you can use the Doctrine PHPCR shell. But there was still the Sulu image ID problem (the “ID+1,000,000” database migration). We had to go through all the pages and update the image IDs in the content. Neither regex nor SQL statements could easily do this with XML in the database, so we wrote a console command to fix that. The crux here was that using the Sulu structure converter to read the structure from the database would have taken a lot of time and would slow down iteration. So a custom implementation was built that directly interacts with PHPCR.

How does this work? To get a bit technical: You load all the nodes that you want to migrate. In our case, we would just load all nodes which were not under the /cmf/brille24/ prefix. Then get all the properties for each node. This will then return a list of all properties that the page has configured. However, working with the list is a little difficult, so we first parse the list into a tree shape. This allows us to look only for blocks that have images in them. Then we extract the image data (which is a JSON string containing the ID and the image position), update the ID, and save the updated property of the PHPCR node. This is basically what Sulu recommends with their migrations.

Conclusion

Sometimes, you make decisions that you regret later on and that become expensive to fix. Some learnings that we made:

Don't use auto-increment IDs: UUIDs are easier to handle when merging content
PHPCR & Doctrine are difficult to deal with when doing mass operations on huge documents. (We should have used Jackalope.)
Automation is key to iterating quickly.

But in hindsight, we are happy that we did it. So the sooner you simplify processes and systems, the easier it gets in the long run.