Alteryx: how to control the order of execution
To find out more about Concentra's bespoke Analytics Services and Solutions see Services.
For the past 6 months I’ve been using Alteryx at a global investment bank, working alongside the bank’s analytics team. This is their first implementation of Alteryx. We have been assisting them in setting up the framework for their analytics, and then working on the initial projects. I’ve found the speed of development incredibly fast, particularly with respect to the tools automatically assuming the most useful configuration.
That said, there is one piece of functionality, that I often have use for, that is not immediately available: controlling the order of execution. Alteryx modules are data-driven rather than code-driven, so tools generally execute in the order that they receive rows of data to their input, and this usually produces the correct result (1). However, this is not always the case, and a specific problem I regularly come across is ensuring that the process only inputs data after it has completed outputting relevant data. Based on my experiences, I will outline 4 methods to control the order of execution along with an assessment of the effectiveness of each one. These are:
Using various standard tools to purposefully control the flow of data. (2)
Making use of the “Block Until Done” tool. (1)
Adding an "After Run, Run Command..." Event. (3)
Using the “ModuleRunner” macro, developed by Alteryx engineer Adam Riley. (4)
Each one of these methods has their own advantages and disadvantages, but the ModuleRunner is certainly worth highlighting since it is not part of the main Alteryx release and does provide a significant gain in functionality.
1) Controlling the flow of data
Creative use of the standard tools can solve several problems in this area, and I use this approach for archiving files. A solution of this sort typically involves a join or append fields tool, and a summarise or count tool. In the example below I am reading from two separate sheets in an excel workbook, and I only want to move the files into the archive folder once I have finished reading from both sheets.
Example use case:
Several input tools must complete before archiving a file.
This method relies on the tool you wish to execute first having an output connection, and there are several situations where this is not the case. For example, if you need to write data out from Alteryx, then read it in again, this method gives you no way of ensuring the output tool has finished executing, before you try to input that data. That can be achieved by...
2) Using the “Block Until Done” tool
This tool can be very powerful, but the most useful method for controlling execution may look confusing at first sight: it ensures that records are passed down one output stream at a time, only after all records have been passed down the preceding input stream. Therefore to output data and then read it in, the output tool must be attached first, and then a dynamic input tool attached second.
Whilst it may appear to be impossible to ensure that a tool with no input executes second, by creating a batch macro, almost any tool can be converted to have an input. Batch macros are incredibly useful and I will write a post about their use very soon, so keep checking in.
Example use cases:
- Output to a table with an automatically generated key, then read in the data, including the key
- Output a configuration file, to be used by a subsequent run command tool
- Output to a "management" table, to show that a process has started, then continue the process
- When other data streams are joined to the stream executing first, they can prevent that part of the workflow completing first. I suggest that after a Block Until Done tool, the workflow you wish to execute first should only contain one datastream
- There is no ability to change the behaviour of the module on success or failure. The only control is a checkbox “Cancel Running Workflow on Error” in the properties window’s “Runtime” tab, which does not help when a different workflow should be run on error. These problems can be solved by...
3) Adding a “Run Command…” event
In the properties window, on the tab “Events” it is possible to configure the module to run another module after the first module completes (or before, or after success or failure). This does provide a guaranteed order of execution, and can handle success and failure of the first module.
Example use cases:
Run different modules on success or failure of the first module
- A major strength of Alteryx is the clarity of the visual workflow (when used with good practice in mind!). This method takes away from that, and leaves no oversight of the process. Many of our processes involve at least five modules and I would not enjoy checking these settings for each module before I could understand the overall purpose of the process
- These commands are hardcoded, so deploying modules on different environments could become a headache
- This functionality is only available with the server license for Alteryx; without that AlteryxEngineCmd.exe cannot be run on the command line (5)
All of these problems are solved by...
4) Using ModuleRunner by Adam Riley
Of these macros, the ModuleRunner and ConditionalModuleRunner provide the functionality I was looking for. They are simple to use: provide the path of the module to run and ModuleRunner will run it, then send a row to the S output if it succeeds, and to the F output if it fails. ConditionalModuleRunner has an input as well, and will only execute if it receives a row of data. Therefore you can chain many modules to execute in sequence, dependent on success and failure, and the resulting "control" module should provide an excellent overview of your process.
I mentioned that using the ModuleRunner means it is not necessary to hardcode the location of the modules; that requires the creation of a batch macro (they are very useful!). As I mentioned before, please do check back for a future blog on batch macros.
Another benefit of the ModuleRunner is that the output datastreams contain the execution log of the underlying module. Adam also includes a “Log Parser” macro to make this output more straightforward to read.
It's not possible to pass data between the modules within Alteryx. Any data that is required by a module must be read from an external source.
This is not an official Alteryx tool, and it does rely on an executable written by Adam.
I would recommend using 1) and 2) where possible and 4) in other cases. I personally have found the ModuleRunner very useful: my colleague Sandeep and I have built further macros around it that interact with our database to record the progress of all our processes, read the location of modules from a table and record the logs of each process.