String interpolation in Apache NiFi

In one of my recent posts, I talked about ExecuteScriptMarkLogic, a handy processor for getting Apache NiFi to talk to Progress MarkLogic. I’d like to share a couple little “gotchas” that we’ve run into before. I’m using ExecuteScriptMarkLogic to illustrate the point, but it applies to any processor where we use JavaScript code in the processor.

In NiFi’s expression language, we use the following syntax to refer to a flowfile attribute value: "${myAttribute}". This lets us use attribute values in the code we run in MarkLogic. Here’s an example:

'use strict';
const myLib = require('/lib/myLib.sjs');

myLib.myFunction('${myAttribute}');

There are a couple things to note here.

Direct Replacement

With the code above, the ${myAttribute} will be replaced by whatever value is found in the corresponding attribute of the flowfile — if any. Note that I surrounded the reference with quotation marks. It’s easy to think of attribute values as strings, but the value will simply replace the reference without adding anything. If the attribute value is “foo” and I left the quotes out of my code, I’d end up with myLib.myFunction(foo). That’s invalid. The quotes ensure that the value will be presented as a string. If there is no value, we’ll get an empty string.

Of course, it might be that I want a number, not a string. Does that mean I can just leave off the quotes and be done? You could, but you’re taking a chance. If there’s no value, you aren’t getting a zero, you’re just getting … nothing. Consider myFunction(${myAttribute). If myAttribute has no value, this code will be evaluated as myFunction(). That may not be what you want. Worst case, your code might return a wrong answer instead of failing. If a missing or non-numeric value is something you want to handle, you can do this:

'use strict';
const myLib = require('/lib/myLib.sjs');

const myValue = parseInt('${myAttribute}', 10);
if (!isNaN(myValue)) {
  // do the processing
} else {
  // do alternative processing or error logging, perhaps including throw
}

String Interpolation

Another fun little gotcha — NiFi attributes aren’t the only place this syntax gets used. JavaScript has a string interpolation feature:

const myValue = 5;
const count = `1, 2, ... ${myValue}! (Three, sir!)`;

At the end of this code, count will have a string value of “1, 2, … 5! (Three, sir!)”. However, if you run this code in ExecuteScriptMarkLogic, NiFi gets to it first. When it sees the ${myValue}, it will do a direct replacement with the value of the myValue attribute. Since this happens before JavaScript evaluates the code, JavaScript doesn’t have the chance to do the string interpolation. The resulting string will depend on the value of that flowfile attribute, which may not even exist.

Happily there’s a simple work around — don’t use string interpolation in this context:

const myValue = 5; 
const count = '1, 2, ... ' + myValue + '! (Three, sir!)';

This way NiFi never sees the ${} notation and you get the expected result.

Conclusion

Using NiFi flowfile attribute values to drive processing in MarkLogic is a great example of NiFi’s data orchestration. As developers, we need to be conscious of exactly what we’re asking the software to do. Hopefully these tips help you avoid some subtle errors and get you one step closer to production!

Apache NiFi and Progress MarkLogic

For years, I’ve used Apache NiFi as a data orchestration tool. Based on NiFi’s built-in scheduler, we pull data from upstream sources, send it to Progress MarkLogic, and trigger MarkLogic to take certain actions on that data. We also use NiFi to ask MarkLogic for information that is ready to process and take action based on that. This post looks at best practices for that communication.

The 4V Nexus search application has a few examples of this. 4V Nexus allows content administrators to add content sources (SharePoint, Confluence, Google Drive, Windows File Shares, etc), specify how often they should be ingested, and what credentials to use when connecting. Sometimes a content admin may want to remove a content source. Removing the source from the list tracked in MarkLogic is simple, but if we have brought in a large number of documents from that source, we may not be able to remove them all in a single transaction. Apache NiFi helps with this — ask MarkLogic for a list of affected documents, process them, then report back to MarkLogic when all content has been cleaned up. At that last stage, we tell MarkLogic to remove the source from the list.

That process has a few points where NiFi is talking to MarkLogic. There are multiple ways to do this. Let’s take a look at some of the tradeoffs.

ExecuteScriptMarkLogic

The simplest way for NiFi to talk to MarkLogic is to use the ExecuteScriptMarkLogic processor. With this approach, we put code into the processor itself, which is then run in MarkLogic as an eval. This makes for pretty easy testing — if you don’t quite get the result you were hoping for, you change the code and send another flow file through it. This allows for rapid interaction, similar to working with Query Console.

This works well for exploratory coding, but has some downsides when it comes to lasting code in your application.

The code itself lives in NiFi. You’ll likely commit this code to the NiFi Registry, but it is disconnected from the rest of your application’s code (likely in git). More significantly, it’s hard to test this code. The only way you can test it is by running flow files through the processor with the various inputs you’d like to test.

My preferred approach when using ExecuteScriptMarkLogic is to push the code into a library. The code that remains in NiFi then looks something like this:

const myLib = require('/lib/myLib.sjs');
myLib.callMyFunction('#{someParam}', '${flowFileAttribute}');

The processor’s job is simply to convert what NiFi knows into parameters to the function. (This approach is good for APIs, too.)

Now that the code is in a library, I can make these same types of calls in Query Console. Even better — I can write unit tests, setting up a variety of scenarios and prevent regressions when making changes.

There’s also the option to use ExecuteScriptMarkLogic‘s module path option. This also relies on code that has been deployed to MarkLogic. I like it better, but still relies on the security point below.

CallRestExtensionMarkLogic

While really convenient, ExecuteScriptMarkLogic isn’t the only way to call MarkLogic. The CallRestExtensionMarkLogic processor lets you call a MarkLogic REST API extension. At some level, this is pretty similar to the code above that imports a library and calls a function. There is one big difference to be conscious of. With this approach, the user that NiFi uses to talk to MarkLogic will need REST-related privileges, such as http://marklogic.com/xdmp/privileges/rest-reader and http://marklogic.com/xdmp/privileges/rest-writer. On the other hand, ExecuteScriptMarkLogic uses the /v1/eval endpoint, which requires privileges like http://marklogic.com/xdmp/privileges/xdmp-eval, privileges that have pretty broad scope.

It’s worth noting that you should use CallRestExtensionMarkLogic, not ExtensionCallMarkLogic. Several improvements led to deprecating the latter, the most significant of which is much better error handling.

Conclusion

Both of these processors have their place. The privileges needed for ExecuteScriptMarkLogic are typically locked down to a role intended for NiFi’s use, but this should be considered for your application. Make sure the bulk of your code is set up to be properly version controlled and — most importantly — well tested!

Updating document quality

A little-used Progress MarkLogic feature (from what I’ve seen) is the ability to change a document’s quality. Lowering this value will make it drop in search results, while increasing it will make it more prominent.

For one client, we’re allowing subject matter experts to lower the quality of documents that aren’t very helpful. They get a button that is hidden from regular users. Clicking the button sends a request to the middle tier, which in turn uses Progress MarkLogic’s REST API to change the quality of the target document. This can be accomplished with a PUT request to /v1/documents.

There was only one thing that threw us off for a moment: we thought we could handle this through the URL parameters, but with no message body, we got a 400 error. Here’s what to do instead, presented as a curl command:

curl --location 
  --request PUT \
  'http://localhost:8010/v1/documents?uri=/content/1234.json&category=quality' \
  --header 'Content-Type: application/json' \
  --data-raw '{ "quality": 10 }'

I’m leaving authentication as an exercise for the reader, but this shows specifying the needed information in the body. Hope that helps someone!

Testing Custom Progress MarkLogic APIs

We can use the marklogic-unit-test framework to test custom APIs hosted in Progress MarkLogic. Doing so is more of an integration test than a unit test, allowing us to ensure that HTTP inputs are correctly mapped to function parameters and that the API call works all the way through.

So why don’t we only do this? A couple reasons.

First, HTTP calls are slower than library calls. Second, when an API-level test fails, there may be a lot of code that was run. Why did it fail can be a pretty involved question.

The nice thing about unit tests is that you can be very focused on a specific piece of functionality. The nice thing about integration tests is making sure those pieces work together. There’s room for both.

Authentication

Suppose we do want to build some API-level tests. We can do that using Progress MarkLogic’s xdmp.httpGetxdmp.httpPost, and other functions. That brings us to an important question — how do we authenticate those calls?

Progress MarkLogic has a great feature for this: Secure Credentials. Secure Credentials are a way to let application code make use of a set of credentials without recording a password in the modules database. For testing, we want to create a user that has the required roles for the test, but we want to do so in a secure way.

For a test suite, we can decide on the name of a Secure Credential that we’ll use for that suite. In suiteSetup.sjs, we can create a new user with a UUID in the name and use a UUID for the password. This user is granted the roles needed to call the API.

We then create a secure credential with the previously determined name for that username, targeting %%mlHost%% and the test port that we’re using. Now we can run our tests. Our test code not only doesn’t have the password embedded, it doesn’t even know what it is — it was randomly selected while creating the user and credential and then forgotten.

After our tests, the suiteTeardown can remove the credential. We can get the name of the user from the credential, so we can remove the user at the same time.

This approach allows us to build tests using HTTP calls in a secure way. As always, you should only grant this user the roles needed to call the API. This helps ensure that your security setup is valid as well as your code.

Nulls and the Empty Sequence

We recently came across a neat little gotcha that I thought was worth sharing. I’ve written before about how JSON Nodes and JS objects look the same but act differently in Progress MarkLogic; this is similar but I’m looking at null and the empty Sequence.

Let’s create a really simple document to play with:

'use strict';
declareUpdate();
xdmp.documentInsert(
  "/content/book1.json",
  {
    "book": {
      "title": "A book title",
      "subtitle": "A subtitle"
    }
  }
)

And a template to create a very simple view:

'use strict';

const tde = require("/MarkLogic/tde.xqy");

let template = xdmp.toJSON({
  "template": {
    "context": "/book",
    "rows": [
      {
        "schemaName": "fourV",
        "viewName": "book",
        "columns": [
          {
            "name": "title",
            "scalarType": "string",
            "val": "title"
          },
          {
            "name": "stuff",
            "scalarType": "string",
            "val": "foo",
            "nullable": true
          }
        ]
      }
    ]
  }
});

tde.templateInsert("/tde/book.json", template)

You’ll see that the stuff column looks for the "foo" property, which doesn’t exist. That’s okay, the column is nullable. Now we can do a simple Optic query:

'use strict';

const op = require("/MarkLogic/optic");

op.fromView("fourV", "book")
  .result()

That gives us a Sequence of 1 item that looks like this:

{
  "fourV.book.title": "A book title", 
  "fourV.book.stuff": null
}

As expected we get the title and a null value. Simple as that, right? Well, there’s a wrinkle. That null turns out not to be a null:

'use strict';

const op = require("/MarkLogic/optic");

op.fromView("fourV", "book")
  .result()
  .toArray()[0]["fourV.book.stuff"] === null

That query returns false. Although it gets serialized as null, the actual data structure is an empty Sequence (this is how it gets represented in the index). Instead of comparing to null, we use fn.empty or fn.exists to determine whether there is an interesting value there.

'use strict';

const op = require("/MarkLogic/optic");
const EMPTY_SEQUENCE = Sequence.from([]);
fn.empty(op.fromView("fourV", "book")
  .result()
  .toArray()[0]["fourV.book.stuff"])

Just for fun, what if you really wanted to use the === operator for your comparison instead of fn.empty/exists? You can create an empty Sequence object using Sequence.from.

'use strict';

const op = require("/MarkLogic/optic");
const EMPTY_SEQUENCE = Sequence.from([]);
op.fromView("fourV", "book")
  .result()
  .toArray()[0]["fourV.book.stuff"] === EMPTY_SEQUENCE

The key takeaway is to remain aware of the data type you’re working with. Every now and then its serialization may fool you, but now you’ve got another trick to figure it out.

Apply temporal to an existing document

MarkLogic’s temporal feature allows an out-of-the-box way to preserve copies of a document when it gets updated. You can read much more in the Temporal Developer’s Guide, but I had a need to look at a particular question recently — how do I make a non-temporal document temporal?

First, let’s think about what makes a document temporal. A temporal document 1) is in a temporal collection, 2) has timestamps that indicate its lifetime, and 3) has a collection named for its URI. The latest collection also indicates the current version of a temporal document. We can create a temporal document using the xdmp.temporalInsert function, which includes a parameter to specify the temporal collection. The timestamps are managed automatically. MarkLogic’s Data Hub Framework can also write temporal documents (using xdmp.temporalInsert under the hood), which I explored in the dhf-temporal project on Github.

So what happens if you try to do an xdmp.temporalInsert with a URI that already points to a non-temporal document? MarkLogic will throw a TEMPORAL-NOTINCOLLECTION error. We could brute force it by deleting the document and then doing a temporal insert, but that has to be done in two separate transactions to avoid XDMP-CONFLICTINGUPDATE. That works, but there is a non-zero risk that the delete will succeed, but the insert won’t.

The alternative is to update in place, applying temporal aspects to the non-temporal document. Here’s an example:

'use strict';
declareUpdate();
let uri = "/claims/claim3.json";
xdmp.documentAddCollections(
  uri, 
  [uri, "claim/temporal", "latest"]
);
xdmp.documentPutMetadata(
  uri, 
  {
    "claim-system-start": fn.currentDateTime(),
    "claim-system-end": "9999-12-31T11:59:59Z",
    "temporalDocURI": uri
  })

Our target URI is a non-temporal document that we want to add to the claim/temporal temporal collection. As such, we add three collections. The collection named of the URI lets MarkLogic associate copies of this document together. The latest collection tells MarkLogic which is the most recent. Since there’s only one copy of this document, we want it to be the latest.

We also set the metadata. Note that the names used need to match up with the temporal axis values used in conjunction with the temporal collection.

We set the end time to “9999-12-31T11:59:59Z”, which is “the end of time”. We can set the start time to whatever value we choose. Note that you might want to set this to some time in the past to allow for more interesting temporal queries.

After running this, the document is now a temporal document. Running xdmp.temporalInsert on it will behave as expected, archiving this version and creating a new one.

To apply this change to an existing database, you can deploy the temporal collection & axis, then run a CORB job to add the non-temporal data to the temporal collection. If you have an on-going ingest process that may overwrite some of the relevant documents, consider pausing that process between the deployment and completion of the CORB job to avoid transition errors.

What does it mean to be a MarkLogic DBA?

The responsibilities of a DBA are different for MarkLogic than for a traditional relational database. While the line of responsibility among a DBA, the development team, and system administrators will be drawn differently at every organization, here are some guidelines you can use.

Included

The activities in this section will generally be handled by a MarkLogic DBA.

Monitor Logs. Review logs for error messages. If found in application app servers, report them to the appropriate development team. If found in cluster-wide ErrorLog.txt, address them. Logs are often pulled into some other system for easier monitoring.

Manage User Accounts. Create new user accounts with appropriate roles. Periodically review them to remove those no longer needed.

Manage Backups and Restores. The development team can and should include scheduled backups in their database configuration. The DBA can advise them on the configuration (based on Recovery Point Objective) and determine the directories for those backups. The DBA can perform a manual backup or restore when called for.

Upgrade MarkLogic. The DBA will perform upgrades of the MarkLogic cluster. Part of this task is communicating with the development teams of any applications deployed to the cluster; those teams must do their own testing to ensure their applications will work with the new version.

Manage SSL Certs. Secure access to MarkLogic’s application servers is a key requirement for many applications. SSL certificates need to be renewed periodically. Keep track of when they expire and renew them before the deadline.

Assign Prefixes and Port Numbers. If multiple applications are deployed to a MarkLogic cluster, each will a distinct name (used in app servers, databases, forests, etc.) as well as a set of ports. The DBA can work with the development team(s) to ensure uniqueness.

Manage failovers. The DBA can work with network administrators to manage failing over to a DR site during an outage. The DBA can also monitor failover forests to ensure one server in a cluster doesn’t get overloaded.

Monitor forest capacity. Observe forest capacity and make recommendations based on MarkLogic’s guidelines.

Not Included

Some activities handled by DBAs in relational systems are better managed by the development team when using MarkLogic.

Schema changes. Unlike a relational database, there is no fixed schema to modify. If an application’s data changes form, the development team will need to address this.

Performance tuning. A knowledgeable MarkLogic DBA can observe and advise the development team, but generally the developers will modify indexes and queries. The DBA may monitor resource usage and alert the development team about resource limitations or work with system administrators to increase hardware.

Might Be Included

  • Manage the Security Database. MarkLogic states that the Security database is intended to be shared among applications deployed to a MarkLogic cluster. Due to this sharing, there is a possibility for collisions to happen if good naming conventions are not determined and enforced. A DBA can play a role in this, or it might be handled by a cross-application architect.

YMMV

As noted, the assigned duties will vary to some extent in each organization. What responsibilities do your MarkLogic DBAs take on?

MarkLogic index data types

MarkLogic offers several types of indexes: Universal, range, triples. These indexes provide fast access to your content and can be configured to work with specific data types. MarkLogic will even do some type conversions for you.

Universal Index

Let’s insert a couple documents. Note the difference between the updated properties (“T” versus no “T”) and the types of the someNumber property.

'use strict';
declareUpdate();
xdmp.documentInsert(
  "/content/doc1.json",
  {
    "updated": "2022-07-13T00:00:00",
    "someNumber": 1
  }
)
xdmp.documentInsert(
  "/content/doc2.json",
  {
    "updated": "2022-07-12 00:00:00",
    "someNumber": "2"
  }
)

The Universal Index will store each of these values, along with the structure, as they are provided to MarkLogic. We can query those as soon as the transaction completes. To do so, we need to query for the specific value of the right type: cts.jsonPropertyValueQuery("someNumber", 1) will find doc1.json, but cts.jsonPropertyValueQuery("someNumber", "1") will not.

Range Indexes

Let’s set up 2 range indexes:

  • On the “updated” property with type “dateTime”
  • On the “someNumber” property with type “int”

I remember that at some point in the past, doc2.json would have been rejected, because a valid dateTime has to have a “T” between the date and the time. (In other words, xs.dateTime("2022-07-12 00:00:00") would fail.) MarkLogic changed that at some point; our sample data values, both with and without the “T”, can be passed to the xs.dateTime constructor successfully. If we ask MarkLogic for the values in the range index, we’ll see both dateTimes (with the “T”):

cts.values(cts.jsonPropertyReference("updated"))
=>
2022-07-12T00:00:00 2022-07-13T00:00:00

Likewise, we can do an inequality query whether our input has the “T” or not:

cts.search(
  cts.jsonPropertyRangeQuery(
    "updated", 
    ">=", 
    xs.dateTime("2022-07-12 00:00:00")
  )
)

Triples Index

The triples index, which powers both triples and views, also does this conversion. Let’s add a template:

'use strict';
 const tde = require("/MarkLogic/tde.xqy");
 const typeTemplate = xdmp.toJSON(
   {
     "template": {
       "context": "/",
       "directories": ["/content/"],
       "rows": [
         {
           "schemaName": "test",
           "viewName": "types",
           "columns": [
             {
               "name": "updated",
               "scalarType": "dateTime",
               "val": "updated",
               "invalidValues":"reject"
             },
             {
               "name": "someNumber",
               "scalarType": "int",
               "val": "someNumber",
               "invalidValues":"reject"
             }
           ]
         }
       ]
     }
   }
 );
 tde.templateInsert(
   "/test/typeTemplate.json" ,
   typeTemplate,
   xdmp.defaultPermissions(),
   ["TDE"]
 )

Now we can do a simple query and see that the values have been converted to their target types:

select * from test.types
test.types.updated test.types.someNumber
2022-07-13T00:00:00 1
2022-07-12T00:00:00 2

Note that our template doesn’t have any code to explicitly convert the values; MarkLogic just does it for us.

Impact

I find this implicit conversion especially helpful for xs.dateTime. Relational databases often use the format without the “T” in the middle. When ingesting data from such sources (or accepting queries from consumers that expect that format), the ingest process would need to add the “T” in order to match the expected format if the implicit conversion didn’t happen.

The key thing is to remember that the value in the document (and in the Universal Index) hasn’t changed — MarkLogic stores whatever is provided. If you have a property where the source doesn’t reliably provide the same type, remember that your value queries will need to match both type and value (as in the case for the someNumber property above).

Scoping queries in the Optic API

Every now and then I write an Optic query that has parts that look redundant. In the example below, assume the 4V.Sample view has a “keyCode” field and that the JSON docs the view is built from have a “keyProp” property. My goal is to gather some information from a view for a specific set of values. The set of values is itself the result of a calculation.

// Get an array of objects that include a "code" property. 
// These are the interesting ones to use in our query.
const myKeys = buildKeys();
// Make an array of just the code values
let myKeyCodes = myKeys.map(item => item.code);
let result = op
  .fromLiterals(myKeys)
  .joinLeftOuter(
    op
      .fromView("4V", "Sample")
      .where(cts.jsonPropertyValueQuery("keyProp", myKeys)),
    op.on(op.col("code"), op.col("keyCode"))
  )
  .limit(100)
  .result()
  .toArray();

let response = {
  time: xdmp.elapsedTime(),
  result
};

response

Notice the .where clause. This is a scoping query that narrows down the possibilities to consider in the rest of the query. Of course, with the joinLeftOuter, it’s redundant — we’ll get the same results with or without that .where, because we’re only going to produce one row for each row on the left (the literals).

So why bother? Performance.

I ran this query multiple times with and without the .where clause on a data set with about 750,000 rows in the 4V.Sample view. Without the clause, it took just short of 3 seconds to run. With the clause, it was about 0.1 seconds. Big difference!

Binding multiple values

With MarkLogic’s SPARQL queries, we can bind a value to constrain the query. Using this capability, we can gather information about something of interest. But what if I want to query against multiple values?

Let’s start with some sample data.

'use strict';
 declareUpdate();
 const sem = require("/MarkLogic/semantics")
 const myPred = sem.iri("myPredicate");
 sem.rdfInsert([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
   .map(num => sem.triple(sem.iri(`target${num}`), myPred, num))
 )

After running this in Query Console, I have 10 triples. Now suppose I want to find the objects where the subject is one of target1target2, or target3? I have a couple choices. With MarkLogic, I can go after one of these values with a simple bind (the second parameter in sem.sparql):

sem.sparql(
  `
    select ?obj
    where {
      ?target ?pred ?obj
    }
  `,
  { target: sem.iri("target1") }
)

To be complete in thinking about my options, I could brute-force it and just run multiple SPARQL queries, one for each of my targets. That’s pretty inefficient.

I could also use a FILTER.

sem.sparql(
  `
    select ?ojbj
    where {
      ?target ?pred ?obj
      FILTER (?target IN (<target1>, <target2>, <target3>))
    }
  `
)

This is effective, but I’ve learned it’s not very efficient (but better than multiple queries). Another option is to bind the values we’re looking for using sem.sparql‘s second parameter:

'use strict';
sem.sparql(
  `
    select ?obj
    where {
      ?target ?pred ?obj
    }`,
  {
    target: [sem.iri("target1"), sem.iri("target2"), sem.iri("target3")]
  }
)

I loaded up my database with 100,000 triples for a quick test (no other data, no other load). Both ran in a matter of milliseconds, but using the bind approach ran in about a quarter of the time that the FILTER approach took.

If you’re looking to run a SPARQL query and you want to use multiple values, binding an array is the preferred way. Interesting to note, however — you can’t do that in a SPARQL update! Stay tuned for the next post where I’ll cover that.

SPARQL update with multiple targets

In my last post, I talked about using the bindings parameter of MarkLogic’s sem.sparql function to look for multiple values in a SPARQL query. It turns out that approach doesn’t work for SPARQL Update.

I’ll use the same sample data as my previous post:

'use strict';
declareUpdate();
const sem = require("/MarkLogic/semantics")
const myPred = sem.iri("myPredicate");
sem.rdfInsert([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
  .map(num => sem.triple(sem.iri(`target${num}`), myPred, num))
)

Now suppose that I want to delete some of those values. I’ll go after the triples with subjects target1target2, and target3. Let’s try the same approach that works for queries and bind these values as a parameter to the query:

'use strict';
const myPred = sem.iri("myPredicate");
const targets = [sem.iri("target1"), sem.iri("target2"), sem.iri("target3") ];
sem.sparqlUpdate(
  `
    delete {
      ?target <myPredicate> ?obj
    }
    where {
      ?target <myPredicate> ?obj
    }
  `,
  {
    target: targets
  }
)

When I run this, I get an error: XDMP-BADRDFVAL. Uh oh, a bad RDF value. I expect the where clause here works the same as a SPARQL query, but ?target and ?obj get bound and used in the delete clause. That clause doesn’t like the array. Alright, how can we do this instead?

I’ve found that using SPARQL’s VALUES clause works.

'use strict';
const myPred = sem.iri("myPredicate");
const targets = [sem.iri("target1"), sem.iri("target2"), sem.iri("target3") ];
sem.sparqlUpdate(
  `
    delete {
      ?target <myPredicate> ?obj
    }
    where {
      ?target <myPredicate> ?obj
      VALUES ?target { ${targets.map(iri => '<' + iri + '>').join(' ')} }
    }
  `
 )

Let’s take a closer look at that VALUES line. The SPARQL is wrapped in ` (backtick), so we can do string interpolation. I’m wrapping the individual values in ‘<‘ and ‘>’ so that they’ll be seen as IRIs. What the interpreter will see is this:

delete {
  ?target <myPredicate> ?obj
}
where {
  ?target <myPredicate> ?obj
  VALUES ?target { <target1> <target2> <target3> } }
}

This allows me to hit multiple targets with my delete in the same request.

There is one thing I don’t like about this approach: when we’re using the binding parameter, the actual query string that we’re passing in remains the same for different calls. MarkLogic can interpret the query, build a plan, and cache it in anticipation of the query being called again (like any parameterized query). With this approach, the query itself changes every time it gets called, so caching doesn’t help us. All the same, it works.

Found a better way? Let me know in the comments!

TDE Template – Unknown Table

MarkLogic’s Data Hub Central offers an easy way to create entities. As a bonus, it automatically creates a TDE Template for you, making your entity accessible through SQL or Optic API queries. However, we ran into an error with this process recently.

After creating an entity and processing some sample data, a member of the team tried to run a simple “select *” on the view associated with the entity. That failed with a “SQL-TABLENOTFOUND -- Unknown table” error. A little investigation revealed the problem. Let’s look at an example.

Like many developers, I have a rubber duck on my desk. It seems appropriate to use that duck to illustrate the problem.

I fired up a new Data Hub example and used Hub Central to create a RubberDuck entity. It has two properties, size and description, both strings. Looking in my data-hub-file-SCHEMAS database, I see my RubberDuck entity (/entities/RubberDuck.entity.json) and a TDE template (/tde/RubberDuck-1.0.0.tdex). I threw in some sample data:

DuckSize,DuckDescription
small,pirate
medium,dinosaur

With just a little more work, I had a couple RubberDuck entities in my final database. Great! Now I can run a nice simple query (in Query Console as the admin user):

select * from RubberDuck.RubberDuck

Hopes for a quick exploration are dashed:

SQL-TABLENOTFOUND: amped-qconsole:qconsole-sql($query, ()) -- Unknown table: Table 'RubberDuck.RubberDuck' not found

Explanation

This turns out to be a security configuration problem. But wait — I ran that query as my admin user, shouldn’t that work? As it turns out, no. There are some limits on admin users and this is one.

Digging deeper, here are the permissions for the template document (as shown by Query Console):

Role R U I E N
data-hub-entity-model-writer x
data-hub-common x
tde-admin x x
tde-view x

If a TDE template document has permissions, then those permissions are extended to the triples and rows built by the template.

Resolution

To solve this, you need to query with a user that has a role with read permissions. For applications, this error is a sign that the user being configured is missing an important role. Odds are, you’re seeing this in development while working with your local admin user. How do I know that?

Looking at the users & roles that make up the Data Hub security model, we can see that data-hub-developer inherits the tde-admin role, while data-hub-operator inherits the tde-view role. The intent is that users who interact with Data Hub data will have at least one of these roles, which will allow it to read data generated by the TDE templates. Make sure users and roles for your application have these roles assigned appropriately.

If you’re working locally and connecting to Query Console using your admin user, you can give that user the data-hub-developer role to avoid this error.

Populating an array from numbered fields

As we move data from other sources (often relational databases) into MarkLogic JSON, we have the opportunity to change the way data are represented. I’ve seen a pattern a few times where the incoming source repeats a field, appending a number. I’ll illustrate with a simplified example:

name,pet_name_1,pet_name2
David,Merry,Pippin
Alice,Spot,
Bob,Fluffy,George

When I see something like this, I represent the numbered fields as an Array. My entity definition would have two properties in this case: “Name” (a string) and “PetNames” (an array of strings).

MarkLogic’s Hub Central makes it easy to define an entity and set up a mapping to populate it. When defining the mapping, we can provide a simple XPath expression for each property. For the “Name” property, we can just give “name” as the expression. To populate the array, we have multiple properties we want to pull from. We can do this very simply:

We list the property names that we want to put in the array, separating them with commas. After running the mapping step, we get this (showing just the instance portion of the entity):

{
  "PetOwner": {
    "name": "David",
    "petNames": [
      "Merry",
      "Pippin"
    ]
  }
}

Looks good! Instead of a sparse table with a fixed number of pet names, we have an array that can hold an arbitrary number of them. There is something that’s not quite right though. In our sample data, Alice has just one pet. After running the load step, which converts from CSV to JSON, here’s the instance part of the document in staging:

{ 
  "instance": {
    "name": "Alice", 
    "pet_name_1": "Spot", 
    "pet_name_2": ""
  }
}

Notice that although we don’t have a value for “pet_name_2”, the JSON property still exists with an empty string. This is a reasonable interpretation of CSV to JSON, but let’s see what we end up with for our final entity:

{
  "PetOwner": {
    "name": "Alice",
    "petNames": [
      "Spot",
      ""
    ]
  }
}

We have an empty string in the array, which isn’t ideal. We have a couple ways to address this. We could modify the load step to eliminate the empty property or we modify the mapping step to keep the empty string out of the final entity. I like the second idea — we’ll keep the raw data with minimal changes and focus on creating the final entity the way we want.

A simple fix for this problem can be added right in the mapping XPath expression:

The [. != ""] in the XPath selects for values that are not empty strings, leading to a cleaner final entity:

{
  "PetOwner": {
    "name": "Alice",
    "petNames": [
      "Spot"
    ]
  }
}

SPARQL Update and locks

I just learned something the hard way, so I thought I’d share.

The tl;dr is that sem.sparqlUpdate runs in a separate transaction by default, which means you need to be careful about document locks. (If your response is “well, duh”, then you may not need the rest of this post. If you’ve ever had a sem.sparqlUpdate request time out when it should return quickly, read on.)

Quick refresher: all requests in MarkLogic run as either a query or an update. When a request runs as a query, it runs at a particular timestamp. Thanks to the magic of MVCC, this means that the request does not need to acquire read locks on the documents it gets data from.

An update, on the other hand, will grab read and write locks. I’ll borrow from MarkLogic’s documentation here:

Read-locks block for write-locks and write-locks block for both read- and write-locks. An update has to obtain a read-lock before reading a document and a write-lock before changing (adding, deleting, modifying) a document. Lock acquisition is ordered, first-come first-served, and locks are released automatically at the end of the update request.

MARKLOGIC CONCEPTS GUIDE; DATA MANAGEMENT CHAPTER

So when an update reads a document, it gets a write lock, which prevents any other update from changing the document during the first request. A read lock may be promoted to a write lock if a request first reads and then updates a document:

declareUpdate();
let doc = cts.doc('/test/dave.json'); // read lock on uri
let docObj = doc.toObject();
docObj.updatedBy = "me";
xdmp.nodeReplace(doc, docObj); // write lock on uri

So far, so good. Here’s what I ran into: I have a set of triples where the subject is a URL and the object is a timestamp. I want to delete all of triples having this predicate that have anything other than the latest timestamp. I broke that into two pieces: 1) find that latest timestamp; 2) delete any triples that have a different timestamp. I’m doing this in a single JavaScript request.

declareUpdate();
let maxDTS = fn.head(sem.sparql(
  `
    select (MAX(?dts) as ?maxdts) 
    where { 
      GRAPH <mygraph> { 
        ?url <http://4VServices.com/seenOn> ?dts 
      }
    }
  ` )).maxdts; sem.sparqlUpdate( ` WITH <http://4VServices.com/blog> DELETE { ?url <http://4VServices.com/blog/seenOn> ?dts . } WHERE { ?url <http://4VServices.com/blog/seenOn> ?dts . FILTER (?dts != ?recentDTS) } `, { "recentDTS": maxDTS } )

First, if someone sees a way to write better SPARQL here and do it all in one request, let me know.

When I ran this code, I was rewarded with a spinner that sat until the request timed out. No good. Why would that be? The initial query will get read locks on the documents that hold the triples, but the update will surely promote those to write locks and do the deletes. Right?

Nope.

I found the key piece of information in the sem.sparqlUpdate documentation.

“isolation=ISOLATION_LEVEL”

ISOLATION_LEVEL can be different-transaction or same-statement. Default is different-transaction….

The sem.sparqlUpdate call runs (by default) in a different transaction. Since I was running my parent request as an update, it got read locks for the documents that hold the triples. The update attempt then tried to get write locks for the same documents, but couldn’t because the parent request hadn’t finished yet, so it hadn’t released those read locks.

Solution

The solution in my case was to remove the declareUpdate() from the parent request. The parent then ran as a query, which doesn’t take any locks, so the sem.sparqlUpdate was able to get the write locks it needed. I could also have had the update run with the isolation=same-statement option, which allows the read locks to be promoted to write locks.

Making updates safe for parallel processing

I’ve seen a couple cases recently where we were making modifications to a document in MarkLogic and the process was restricted to a single thread due to concerns about overwriting data. Let’s look at an example.

function makeUpdates(someArgs) {
  let uri = '/some/content.json';
  let docObj = cts.doc(uri).toObject();
  docObj.meta.foo = updateFoo(docObj.meta.foo, someArgs);

  xdmp.documentInsert(
    uri,
    docObj,
    {
      "collections": xdmp.documentGetCollections(uri),
      "permissions": xdmp.documentGetPermissions(uri)
    }
}

Pretty straightforward. But what if there are multiple processes that end up calling this function in parallel? It’s not hard to imagine a case where process 1 and process 2 both read the document around the same time and get to work on the processing. Suppose process 1 gets to the xdmp.documentInsert first. It asks MarkLogic for a write lock on the URI, and gets it. Meanwhile, process 2 reaches the xdmp.documentInsert and requests a write lock for that same URI. MarkLogic sees that process 1 already has a lock, so process 2 blocks while waiting for the lock.

Process 1 does its document insert. When this request finishes, the change is committed, the transaction is complete, and the lock is released. Process 2 now gets the lock and proceeds to make its own update to the document.

The problem is that process 2 doesn’t know about process 1’s change. Process 2 read the document early, before that change was committed, and prepared its own update based on that data. When process 2 commits its changes, it will overwrite process 1’s changes. (This is known as the Lost Update problem.)

The solution is to ensure that the document won’t change between the time a process reads it and the time the process updates it. Let’s revise our function:

function makeUpdates(someArgs) {
  let uri = '/some/content.json';
  xdmp.lockForUpdate(uri);
  let docObj = cts.doc(uri).toObject();
  docObj.meta.foo = updateFoo(docObj.meta.foo, someArgs);

  xdmp.documentInsert(
    uri,
    docObj,
    {
      "collections": xdmp.documentGetCollections(uri),
      "permissions": xdmp.documentGetPermissions(uri)
    }
}

The xdmp.lockForUpdate function tells MarkLogic to get a write lock right away, blocking if another process already has the lock. The lock will automatically be released at the end of the transaction, just as if the lock were obtained implicitly by calling xdmp.documentInsert.

The placement of xdmp.lockForUpdate is important. Note that it comes before the call to cts.doc, where the process reads the document.

Let’s revisit process 1 and process 2. Each comes into the function and sets the uri variable. They get to the lock request. Suppose process 1 gets the lock first. Process 2 will then block and wait for the lock.

Process 1 goes on to calculate its update and insert the revised document into the database. It finishes, commits the transaction, and releases the lock (all automatically).

Process 2 is now able to get the lock on uri. Only now does it read the document. Because this is an update request, it will see the latest version of the document — including process 1’s changes. Process 2 calculates its update based on the latest version of the document and inserts the new version.

This change ensures that the updates happen safely. We can also look at performance. The xdmp.lockForUpdate must happen before the document read, but any processing that is not dependent on that data should happen first. For example, for one client we need to read some data from SharePoint and update the document with it. The interaction with SharePoint includes a network call and other processing — it takes a little time, but does not depend on the data in the target document. We do that first, then request the lock.

function makeUpdates(someArgs) {
  let uri = '/some/content.json';
  let someData = getUpdates(someArgs);
  xdmp.lockForUpdate(uri);
  let docObj = cts.doc(uri).toObject();
  docObj.content.data = docObj.content.data.push(someData);

  xdmp.documentInsert(
    uri,
    docObj,
    {
      "collections": xdmp.documentGetCollections(uri),
      "permissions": xdmp.documentGetPermissions(uri)
    }
}

This allows process 1 and process 2 to perform the SharePoint work in parallel before getting to the lock request.

A simple change can prevent lost updates, ensuring data safety and (in some cases) allowing greater performance by making parallel processing safe.

Jonhy Rick

www.finances.com

Life is good. Here we are in the last couple months of summer with its warm weather, family activities, getting ready for the children.

cta-bg

Looking Forward to Building a Partnership!

Let's discuss how we can help your organization