Processing upstream deletes

We can load data into a data hub from a variety of upstream data sources. What if we want to delete documents that have been deleted upstream? In the ideal case we’re using a Change Data Capture system and we get notifications about those deletions, but sometimes we don’t have that. How can we detect what’s been deleted if we don’t get notified?

MarkLogic’s ability to store RDF triples comes in handy here. When we load data, we can record a triple for each record we see, noting the timestamp of that load. (We’ll record this triple for every document, although we can skip processing for documents that haven’t been modified since the previous load.) The next time we load, we’ll record the triples again. This is a lot lower impact than updating each document’s metadata when we do the load. That approach would cause an update to every document, whereas many timestamps for many documents can be recorded in a single document using managed triples.

We now have two sets of triples that reflect the upstream content at two timestamps. At this point, we can run a SPARQL query to find out what was present at the earlier timestamp, but isn’t there for the new timestamp. Any records that match are no longer around and can be deleted from the data hub.

We’ll want to clean up these triples over time. When we’re ready to start a new data load, we can keep the latest triples and delete any before them. We run the data load, leaving us with triples from the current load and from the immediate prior batch, simplifying our SPARQL query and avoiding excess data accumulation.

The best option is for our upstream sources to inform us when data gets deleted, but when that option’s not available, MarkLogic’s ability to use triples for metadata tracking comes in very handy.

Leave a Reply

Your email address will not be published. Required fields are marked *