We’re used to doing reviews for source code. NiFi flows look different, but when they are part of your application’s code, it’s useful to have a review prior to committing. Here’s what I look for when I’m reviewing a flow:
Named processors. Wherever it would make things more clear, change the name of the processor to reflect what it’s doing. Instead of “RouteOnAttribute”, I’d rather see a name that shows me the basis for routing. I often phrase this as a question, like “Needs conversion?” or “File too big?”. If I’m using an ExecuteScript processor, name it for what it’s doing (“Create Preview”).
Add comments when helpful. NiFi processors have a tab where you can add a comment. Once added, you get a little black triangle to hover over, making it easy to see the comment. Use this to explain things that aren’t clear — think about why is it this way, not just what is this doing.
Use labels for general documentation. Labels are background boxes with text. One use for them is to provide documentation that applies to multiple processors. This could be explaining how to use a flow or describing how a section of the flow works.
Spacing. Processors should be far apart enough that you can see the arrows. Without being able to see the arrow heads, it takes more time to understand the direction of the flow.
Ordering. I like to arrange processors so that the flow starts in the upper left, proceeds to the right, then down, snaking back and forth across the canvas. For flows with few branches this works well, showing a lot on the screen while being zoomed in close enough to see clearly. For a flow that does have more branches, I’ll go with a straight vertical happy-path flow, with branches to go out to the side and then proceed downward. When I’m reviewing a flow it doesn’t necessarily have to do one of those, but it should be clear where it starts and should generally flow in a consistent direction.
Parameters or variables. If something will be different across environments, it should be specified in a parameter context or variable. Parameter contexts are generally preferred, particularly for sensitive information like passwords.
Color-coding external dependencies. By using the “change color” feature, I can see at a glance where the flow depends on external systems. Whether it’s a database or remote API, things of the same type get the same color.
Processor groups. Processor groups are the subroutines of NiFi. During code review, if you saw a 500-line function, you’d probably suggest breaking it up into a couple functions, each of which has a cohesive purpose. Likewise, use processor groups for cohesive sections of a flow. Giving the group a good name and description makes the parent flow easier to understand.
Starting/stopping. When you’re ready to start a flow, the easiest way is to right-click on the canvas and click start, instead of having to start processors individually. If you find yourself thinking “I want to start the processors on the left side of this group, but not the ones on the right”, that’s a code smell (flow smell?). Consider setting up processor groups so that you can start and stop them as a unit.
Error handling. For processors with a non-trivial chance of failures, the flow should handle those cases. The trickiest judgement call is what relationships you can auto-terminate from a processor. The other big question is what to do with the failure cases when they happen. Sometimes a retry makes sense (a network outage may lead to an HTTP request failure); sometimes you need to send an email notification.
Concurrent tasks. NiFi’s default number of concurrent tasks is 1. I usually make changes to this only after I’ve deployed a flow to an environment that has a good amount of data — I’d rather respond to a real case than guess about where the bottlenecks are. In a review, if I see an unusually high number of concurrent tasks, I’ll ask some questions about that. (In particular, if the concurrent task count is close to or higher than the maximum timer driven thread count, that’s a flag for a closer look.)
Timing problems. Will something go wrong if flow files are being processed at the end of a flow while others are still be processed at the beginning? If the flow has timing dependencies, it probably needs the Wait/Notify pattern.
Queue sizes. Every active flow file consumes some resources. Depending on the resources available, I sometimes limit the size of the queues to prevent too many flow files from being in a flow at a time. (This depends on how the flow files come into the flow, of course.) The default queue size is 10,000; I often lower it to 1,000 – 2,500, especially after a SplitJson or similar processor. The balance is to ensure enough flow files are available that processors can do their work without consuming too much memory.
Attributes & content. Attributes are intended for small amounts of data, compared with content. During a review, if I see a large amount of data being put into attributes, I’ll start asking questions. Attributes are held in memory, while content lives on the disk.
Use out-of-the-box processors. The ExecuteScript processor is a great way to handle custom functionality, but wherever possible, I prefer to see out-of-the-box processors. Using ExecuteScript to do what we can do with a standard processor means longer development times and more maintenance.
Since most of my company’s work with NiFi involves sending data to MarkLogic, I have some MarkLogic-specific items I check for as well.
Duplicate strategy. A PutMarkLogic process gives you some options for handling duplicate URIs within a batch. You can ignore them if you know it will never happen (like if the URI is based on a UUID). If you don’t know that for sure, you should select one of the other options.
Minimize scripts. The ExecuteScriptMarkLogic processor is really helpful, but can also become a crutch. If you find yourself writing substantial amounts of code in one of these, consider moving the code to a library. This moves version control from NiFi Registry to git (much more mature) and is easier to test as well.