Making updates safe for parallel processing

I’ve seen a couple cases recently where we were making modifications to a document in MarkLogic and the process was restricted to a single thread due to concerns about overwriting data. Let’s look at an example.

function makeUpdates(someArgs) {
  let uri = '/some/content.json';
  let docObj = cts.doc(uri).toObject();
  docObj.meta.foo = updateFoo(docObj.meta.foo, someArgs);

  xdmp.documentInsert(
    uri,
    docObj,
    {
      "collections": xdmp.documentGetCollections(uri),
      "permissions": xdmp.documentGetPermissions(uri)
    }
}

Pretty straightforward. But what if there are multiple processes that end up calling this function in parallel? It’s not hard to imagine a case where process 1 and process 2 both read the document around the same time and get to work on the processing. Suppose process 1 gets to the xdmp.documentInsert first. It asks MarkLogic for a write lock on the URI, and gets it. Meanwhile, process 2 reaches the xdmp.documentInsert and requests a write lock for that same URI. MarkLogic sees that process 1 already has a lock, so process 2 blocks while waiting for the lock.

Process 1 does its document insert. When this request finishes, the change is committed, the transaction is complete, and the lock is released. Process 2 now gets the lock and proceeds to make its own update to the document.

The problem is that process 2 doesn’t know about process 1’s change. Process 2 read the document early, before that change was committed, and prepared its own update based on that data. When process 2 commits its changes, it will overwrite process 1’s changes. (This is known as the Lost Update problem.)

The solution is to ensure that the document won’t change between the time a process reads it and the time the process updates it. Let’s revise our function:

function makeUpdates(someArgs) {
  let uri = '/some/content.json';
  xdmp.lockForUpdate(uri);
  let docObj = cts.doc(uri).toObject();
  docObj.meta.foo = updateFoo(docObj.meta.foo, someArgs);

  xdmp.documentInsert(
    uri,
    docObj,
    {
      "collections": xdmp.documentGetCollections(uri),
      "permissions": xdmp.documentGetPermissions(uri)
    }
}

The xdmp.lockForUpdate function tells MarkLogic to get a write lock right away, blocking if another process already has the lock. The lock will automatically be released at the end of the transaction, just as if the lock were obtained implicitly by calling xdmp.documentInsert.

The placement of xdmp.lockForUpdate is important. Note that it comes before the call to cts.doc, where the process reads the document.

Let’s revisit process 1 and process 2. Each comes into the function and sets the uri variable. They get to the lock request. Suppose process 1 gets the lock first. Process 2 will then block and wait for the lock.

Process 1 goes on to calculate its update and insert the revised document into the database. It finishes, commits the transaction, and releases the lock (all automatically).

Process 2 is now able to get the lock on uri. Only now does it read the document. Because this is an update request, it will see the latest version of the document — including process 1’s changes. Process 2 calculates its update based on the latest version of the document and inserts the new version.

This change ensures that the updates happen safely. We can also look at performance. The xdmp.lockForUpdate must happen before the document read, but any processing that is not dependent on that data should happen first. For example, for one client we need to read some data from SharePoint and update the document with it. The interaction with SharePoint includes a network call and other processing — it takes a little time, but does not depend on the data in the target document. We do that first, then request the lock.

function makeUpdates(someArgs) {
  let uri = '/some/content.json';
  let someData = getUpdates(someArgs);
  xdmp.lockForUpdate(uri);
  let docObj = cts.doc(uri).toObject();
  docObj.content.data = docObj.content.data.push(someData);

  xdmp.documentInsert(
    uri,
    docObj,
    {
      "collections": xdmp.documentGetCollections(uri),
      "permissions": xdmp.documentGetPermissions(uri)
    }
}

This allows process 1 and process 2 to perform the SharePoint work in parallel before getting to the lock request.

A simple change can prevent lost updates, ensuring data safety and (in some cases) allowing greater performance by making parallel processing safe.

Leave a Reply

Your email address will not be published. Required fields are marked *