Data Synchronization

I haven’t been blogging much.  I suffer from what we’ve termed the “Great American Post Syndrome” (GAPS).  That is, every blog post needs to be the best piece of technical literature ever written, so you actually never post anything.

So, the result is few posts, but a large drafts folder.  Looking through it, judging by volume, I must think HTML5 and offline data synchronization are REALLY important topics.

Lately we’ve been working with a client on HTML5 (yes, its a scandal).  A lot of that work has to do with applying mobile lessons to HTML5 development.  Just because you’re building a “web app” doesn’t mean you suddenly don’t have to worry about being offline.  Some day we might chat about how to do this in HTML5, but since syncing data needs to happen in both contexts, and the concepts are pretty similar, we’ll start there.

This will be a multi-part series.  One way to avoid GAPS is to write little bits and try not to take any particular one too seriously.

BTW, I swear GAPS is not a Backronym. I didn’t even notice it till I wrote out the letters.

Definition

Synchronizing data is a big topic.  It means different things to different people, and is obviously framed by projects they’ve worked on.  In our cases, it generally means one or more data entities in database tables.  Data may be entered locally, pushed remotely to a server, and mixed and pulled back with data from other users that could conflict.  There are probably relationships between the tables, and specific business logic around entities.  Summary, “synchronizing” is not as simple as copying data up and back to the server. It may also include data outside of the database, like image file uploads.

On top of that, you often don’t control the server API. Basically, it ranges from “difficult” to “captain, we’ve lost the hydraulics”.

Changes to data should be possible offline.  If you’re not supporting that, stop reading. You aren’t synchronizing data. You’re writing a web app. If your users are on mobile devices, they will curse you often, but supporting offline data is difficult and time consuming. You decide.

The basic workflow is:

  1. Update data from the server (periodically)
  2. Add/Edit/Delete data locally
  3. Push these changes to a central server

It seems like something that should be pretty simple, and sometimes it is. If you only have one or two tables, and if your user’s data can only be edited by you, or if data is only rarely changed, this can usually be handled with some manual code.  When your app gets significantly more complex, you’ll find syncing becomes difficult, but not only that.  You’ll have weird issues that are very difficult to debug, because they don’t come up in testing and are pretty much impossible to reproduce. You should implement robust error and crash report, but that’s the topic of another post. I am STUNNED by how many app projects we jump on that don’t have crash reporting. If your method of error reporting is one-star reviews, somebody needs to throw you an intervention (hmm. Reality show? App Rescue? I think I could yell at people).

Methods of Synchronization

There are probably several more ways that you could manage your syncing, but I think there are basically two methods: state-based (SB) and command-based (CB).

SB is the most obvious method.  When its time to sync with the server, you crack open the db and look at the tables to figure out what needs syncing.  This requires, at a minimum, a “dirty” flag column. That might be simply that, but often its an “updated” column with a timestamp.

This “works”, and is the most obvious solution.  Query the tables, push data. Done. However, it can get tricky when you have multiple tables, and when those tables relate to each other.

Imagine an app with a Company table and a Product table.  You can’t create a product without a company, so when you run your sync, you need to make sure you do things in order. That’s easy enough. Just query by the “updated” column.

Now, imagine you create a company, then add a product, but then update the company.  If you simply use “updated”, your sync will fail.  Now you need to keep an “added” and “updated” field, or you need to send the company info on each product add, or something more painful.

OK, that sucks.  Now the “business” wants to have a “main product” field added to the company table.  Whoops.  Circular relations.  Then you need to throw in the other 10-20 tables that a “mature” app inexplicably seems to grow, and deal with their data and temporal relations.

Also, maybe you just want to edit the name of the company, and somebody else edits the address.  Since nobody is keeping tabs on what data was edited, the sync will fail, but you can’t shake the idea that if you’d add a little more info to your table you could do this properly. So now there’s a “nameUpdatedAt” field and an “addressUpdatedAt” field.  Seems reasonable.

Then a new developer starts and over the next few weeks you swear you hear her saying “what is this garbage?” under her breath, over and over.

If you’re remotely awake and reading this, you’ll probably guess that I think CB is usually the way to go, but before we get into the details, I’ll say that simpler data is probably better handled by SB.  With one or two simple tables, and a fairly infrequent update schedule, or a small number of users, state-based is obvious to code, and has a “safer” feeling. The dirty flag doesn’t get cleared until you get an affirmative from the server. If you’re debugging your app and you want to see a snapshot of current status, simply query the db. Simplicity is valuable, so long as it actually is as simple as you want it to be (that’s another syndrome that needs a name. Coding things to be as simple as you want them to be, rather than accepting their actual complexity and dealing with it. Todo: make a backronym from CRAPS or something similar).

Command based is what I’d call a “pulling the sock inside out” approach. If we had some magic computer science complexity evaluation device, properly syncing data with either method would probably look exactly the same, stripped down to its core operations. Only the way its done is different (and much easier to comprehend).

CB works as follows.  Rather than look at the state to figure out what you need to do, you record “what you need to do”, as commands. When you add a company, you push the “AddCompany” command into a queue. When you add a product, the “AddProduct” command.  Update the company? Shocker. Push the “UpdateCompany” command.

In its simplest form, these commands get played back in the order they were added.  The part about this being an inside-out sock is that with SB, you’re generating a list of commands from the data state.  With CB, you’re simply recording those commands when the actions happen.  At the end of the day, what gets pushed over the wire is roughly equivalent.

The reason why this is better is because you don’t really need to deal with the complexities of your data relationships, or with temporal issues (by “temporal” I mean “this needs to happen before that”. Its pretty rare that you can say “temporal” with a straight face in a non sci-fi context, so enjoy it. You’re welcome).  Also, if you wanted fine grained operations, you can split out updates into different commands.  You could have an “UpdateCompanyName” command, for example.  As with most things in life, the 80/20 rule applies here.  Maybe you put the extra effort in for one or two main tables, and do course grained updates on the rest.

You can also write commands for stuff that isn’t kept in the DB. Image uploads, again, are the common example.

The benefit isn’t free. You need to implement a rock-solid queueing system, that includes storing commands, dealing with errors, etc. If you have a bug in your queue code, you could lose updates.  SB has the advantage of safety, or at least the impression of safety.  You don’t clear the dirty flag until you get an affirmative response from the server.

Fortunetely, the parts of a CB queue that need to be rock-solid are pretty common across implementations, and we’ve built one.

https://github.com/touchlab/Superbus

Its overly complex right now. We’ve had a few iterations, which has resulted in several options for storing command data. I may trim all but the safest, which is SQLite. File based is OK, but with SQLite, you can include command data updates inside of your data transactions, so command and entity state changes are atomic.  Properly implemented, this should ensure the state of the command queue and your own data remain in lockstep.  Of course, I pretty much guarantee you’ll have logic issues when implementing your sync queue. Especially if its your first time.

But if it was easy, everybody would do it.

Later in the series, I’ll cover a basic implementation and talk about common pitfalls.