• Working with JSON in XQuery

    MarkLogic supports XQuery and JavaScript as native languages. XQuery is a very natural way to work with XML, just as JavaScript is a very natural way to work with JSON. However, sometimes it’s useful to cross over and work with JSON using XQuery or vice versa. This post has some tips on using XQuery to work with JSON in memory.

    JSON Nodes

    The first thing to know is that JSON that we get from the database comes in an immutable form. Specifically, that’s a JSON Node. We can directly construct JSON Nodes using constructors, like this:

    object-node {
      "a": 1,
      "b": 2
    }

    If all you want to do is retrieve a node from the database, or a part of one, working with nodes is great. One of the cool things is that we can apply XPath to nodes:

    let $obj := object-node {
      "a": 1,
      "b": 2
    }
    return $obj/a

    This makes it easy to select a portion of a JSON document; we might do that when we pull original JSON content out of an envelope document, for instance. We can also use this to pull data that is deep in a node structure, using a path like “/envelope/instance/TopProperty/lowerProperty”.

    Changing

    If we want to edit a JSON structure, we need a different approach. There are two ways we can go about it: recursive descent over JSON nodes, or by converting to objects. I’ll give recursive descent its own post later; for now, I’ll talk about objects.

    A json:object is a mutable key-value structure (this is the same as a map:map, except that the ordering of keys in a map:map is undefined and its default serialization is JSON rather than XML). It can be hierarchical, because the values can themselves be objects. This should sound familiar: I could give that same description for a JavaScript object. If fact, when we want to represent JSON nodes in a mutable way, the way to do it is to convert them to the json:object structure. Then we can use map:put() or map:with to change an existing value or add a new one, and map:delete() to delete an existing value. When we’re done, we can convert back to JSON to update the database or send on to a client.

    let $obj := object-node {
      "a": 1,
      "b": 2
    }
    let $map := xdmp:from-json($obj)
    let $_ := map:put($map, "c", 3)
    return xdmp:to-json($map)

    Note that while objects are great for manipulating data, we can’t apply an XPath like we can with a JSON node.

    What Am I Looking At?

    Query Console cheerfully presents JSON data such that it looks like JSON, regardless of whether you’re looking at JSON nodes or json:objects. This can make it hard to know what exactly you’re looking at. Knowing the format of your JSON data tells you how to interact with it. You can identify an item of each representation using instance of tests.

    declare function local:report($item)
    {
      "object node: " || $item instance of object-node() ||
      "; json:object: " || $item instance of element(json:object) ||
      "; map:entry: " || $item instance of map:map
    };
    
    let $node := object-node { "foo": "bar" }
    let $obj := json:object() => map:with("foo", "bar") 
    let $json-obj := {$obj}/node()
    let $map := map:new(map:entry("foo", "bar"))
    let $to-json := xdmp:to-json($obj)/node() (: xdmp:to-json returns a document node :)
    let $to-json-map := xdmp:to-json($map)/node()
    let $from-json := xdmp:from-json($node)
    return (
      "node:        " || local:report($node),
      "json-obj:    " || local:report($json-obj),
      "map:         " || local:report($map),
      "obj:         " || local:report($obj),
      "to-json:     " || local:report($to-json),
      "to-json-map: " || local:report($to-json-map),
      "from-json:   " || local:report($from-json),
      "fn:data:     " || local:report(fn:data($node))
    )

    Results:

    node:        object node: true; json:object: false; map:entry: false
    json-obj:    object node: false; json:object: true; map:entry: false
    map:         object node: false; json:object: false; map:entry: true
    obj:         object node: false; json:object: false; map:entry: true
    to-json:     object node: true; json:object: false; map:entry: false
    to-json-map: object node: true; json:object: false; map:entry: false
    from-json:   object node: false; json:object: false; map:entry: true
    fn:data:     object node: false; json:object: false; map:entry: true

    Constructing

    There are two ways of building JSON Nodes. We can use the JSON node constructors directly, or we can build up maps and pass the result to xdmp:to-json(). Here’s the constructor version:

    object-node {
      "a": 1,
      "b": 2,
      "c": array-node {
        object-node { "fname": "Harrison", "lname": "Ford" },
        object-node { "fname": "Mark", "lname": "Hamill" },
        object-node { "fname": "Carrie", "lname": "Fisher" },
        object-node { "fname": "Natalie", "lname": "Portman" }
      }
    }

    And here’s the method using maps:

    xdmp:to-json(
      json:object() 
      => map:with("a", 1)
      => map:with("b", 2)
      => map:with("c", 
          json:array() =>
            json:array-with((
              json:object() => map:with("fname", "Harrison") => map:with("lname", "Ford"),
              json:object() => map:with("fname", "Mark") => map:with("lname", "Hamill"),
              json:object() => map:with("fname", "Carrie") => map:with("lname", "Fisher"),
              json:object() => map:with("fname", "Natalie") => map:with("lname", "Portman")
            ))
          )
    )

    Personally, I find the direct constructor approach more natural. An important consideration is what you’re going to do with the structures once you have them. If you plan to modify them, go with the objects, do whatever modification you need, and then pass the finished product to xdmp:to-json — don’t convert back and forth. Likewise, if you’re constructing JSON to return to a client and won’t be persisting it in the database, stick with objects; you’ll find it runs faster.

    Wrap Up

    While XQuery/XML and JavaScript/JSON are good pairings, you can easily work with either structure from either language. Your can figure out from your data which structure is a better fit for your data. Your language choice can be driven by the needs and knowledge of the development team. 

  • Searching for one result
    MarkLogic is both a database and a search engine. Sometimes, you know you only want one result from your search. What’s a good way to do that? Using MarkLogic’s Server-side JavaScript, here’s one way to do it:
    const query = ...;
    // foreshadowing -- there's a better way!
    cts.search(query).toArray().slice(0, 1)
    We get the results of our search, change the results from a Sequence to an Array, and take the first one. Simple — but not the way we should do this. How do we know? We ask MarkLogic, of course. Instead of the above, let’s evaluate the following in Query Console (note: if your query will hit a large number of documents, maybe don’t run this in Query Console).
    const query = ...;
    cts.search(query).toArray().slice(0, 1);
    xdmp.queryMeters()
    Looking at the queryMeters output, we see some interesting things, including cache hits and misses. Toward the end of the response, there’s a JSON property called “documents”, which is an array of objects identifying the URIs that the query loaded. As written, that array will include all results from the search, not just the one that we want to get. We want to tell MarkLogic to just return the first one. For some MarkLogic functions, you can specify an option “limit=1”. The cts.search() function doesn’t have that option. cts.search returns a Sequence, a data structure that is similar to an array in some ways, but not quite the same. In XQuery, there are a few ways we could wrap a call to cts:search() to tell MarkLogic that we only wanted a certain number of results:
    cts:search(fn:doc(), $query)[1 to 10]
    fn:subsequence(cts:search(fn:doc(), $query), 1, 10)
    MarkLogic sees what parts of the Sequence we want and optimizes to only return those. Calling xdmp:query-meters() confirms that. Okay, that’s cool and all, but we’re working with JavaScript. Can we do the same thing? Yes, yes we can.
    fn.subsequence(cts.search(query), 1, 10)
    xdmp.queryMeters() tells us that we’re only bringing back the ten documents (remember that a Sequence starts at index 1, while an Array starts at index 0). Knowing that Sequence optimizations influence our searches, we’ve got a real simple way to get just that first result:
    const query = ...;
    fn.head(cts.search(query))
    Again, we turn to xdmp.queryMeters() and find just one item in the documents array. Just what we need!
  • MarkLogic Universal and Range Indexes

    I recently saw a dramatic improvement in a MarkLogic query with a one-line code change. That’s always a good thing to find, so I thought I’d share. First, an illustration of the impact:

    In NiFi, FlowFiles are pieces of data that a processor needs to work on. In this case, NiFi is calling a MarkLogic query for each FlowFile. The metric shown in the graph is FlowFiles processed by NiFi over 5 minute intervals. There are peaks around 16,000, but mostly lower, which wasn’t the throughput I needed. A one-line change and throughput went to about 80,000 over 5 minutes. The pace was still accelerating when it ran out of data to process, so I’m not sure how how it would have gone. So what did I change?

    MarkLogic provides a number of indexes to improve query performance. Among them are the Universal Index (with many options) and Range Indexes.
    Two types of queries look like they do very similar things, but they rely on these different indexes: cts:element-value-query() and cts:element-range-query() (with the “=” operator).

    Here’s the original version of the function:

    (: Given a URI, check whether there is a corresponding Item. 
     : If there is, delete it. 
     :)
    declare function lib:delete-replaced-item($uri as xs:string)
    {
      let $item-uri :=
        cts:uris(
          (),
          ("limit=1", "score-zero"),
          cts:and-query((
              cts:collection-query("item"),
              cts:element-value-query(
                xs:QName("es:id"), 
                lib:id-from-uri($uri)
              )
          ))
        )
      return
        if (fn:exists($item-uri)) then
          xdmp:spawn-function(
            function() { xdmp:document-delete($item-uri) }
          )
        else ()
    };

    Note the cts:element-value-query. This query uses the Universal Index, which captures every term, along with the XML or JSON structure. This makes for rapid lookups of which documents have a particular term.

    The cts:element-value-query looks for documents that have the provided input as the entire contents of the target element. In certain cases, this works great. However, in the example above, the return value from lib:id-from-uri($uri) looks something like “a~b~c”. The problem is the “~” characters. As MarkLogic tokenizes the content, it sees “a~b~c” as the sequence of tokens (“a”, “b”, “c”). We can see this using the xdmp:plan function on cts:element-value-query(xs:QName("id"), "a~b~c"). The results include a final-plan element:

    
      
        
          12776805441528511383
          element(id,value("a","b","c"))
        
      
    

    This query needs to look for not just one value, but three, in the correct order. Let’s compare that with the plan using cts:element-range-query(xs:QName("id"), "=", "a~b~c"):

    
      xmlns:xs="http://www.w3.org/2001/XMLSchema"
      
        
          208346549518586783
          element(id)
          a~b~c
          a~b~c
        
      
    

    Here, MarkLogic is working with an upper and lower bound, which are the same value. To find results, MarkLogic will use the “id” range index, do a seek on the list of values to find the appropriate entry, and any matching URIs are found.

    MarkLogic provides many different types of indexes. Knowing the right one to use for your query can make a huge difference in the performance.