Skip to main content

Multi-Server Mode (Horizontal Scalability)

Multi-server mode description

If you are going to run several Workflow Engine instances connected to the same database, you should use the multi-server mode. This mode is only available with the Ultimate license. The Workflow Engine instances do not know anything about each other; they only share a common database (or database scheme). All instances are equal, none of them is special, therefore, at least one working instance is sufficient for the health of your cluster. The Workflow Engine does not provide load balancing tools; the balancer should be external, choose the one that you are able to configure.

Cluster

Configuring Workflow Engine

Initialization of WorkflowRuntime in the multi-server mode slightly differs from the single-server mode. The code below shows the differences.

var workflowRuntime = new WorkflowRuntime("Unique Runtime Identifer")
...
.AsMultiServer();

First of all, in the multi-server mode, the runtime identifier must be specified; it can be any string as shown in the code above, or you can pass the object implementing interface IRuntimeIdProvider to the constructor.

public interface IRuntimeIdProvider
{
string GetId();
}

The GetId() method should return the runtime identifier. It is highly recommended that the runtime identifier remains constant. However, if it changes, nothing bad occurs, but the system works less efficiently. The Workflow Engine package includes DefaultRuntimeIdProvider; in the constructor, define the path to the file with the runtime identifier, then this provider will generate a unique identifier and save it in the specified file to use hereinafter.

The multi-server mode is configured by calling the .AsMultiServer() method; in this case, the default settings are used, you can see their values below. However, you can also specify the multi-server settings provider as an object implementing IMultiServerSettingsSource.

public interface IServerSettingsSource<T> where T : new()
{
T GetSettings();
void SaveSettings(T settings);
}

public interface IMultiServerSettingsSource : IServerSettingsSource<MultiServerSettings> { }

GetSettings is the method to return settings, while SaveSettings is the method to save them. It is critical for each of the Workflow Runtime instances working with the same database to have the same settings. The Workflow Engine package includes the DefaultMultiServerSettingsSource to store the settings in the global parameters table. Please, note that you should call the SaveSettings method by yourself, when you want to change the settings to be applied to each server in the cluster.

Thus, you can proceed to configure Workflow Runtime as follows:

var workflowRuntime = new WorkflowRuntime(new DefaultRuntimeIdProvider("path to file with RuntimeId"));
...
workflowRuntime.AsMultiServer(new DefaultMultiServerSettingsSource(workflowRuntime));
caution

The workflowRuntime.Start() method is required to use the Workflow Runtime API in the multi-server mode; on calling, this method correctly starts timers and enables the runtime API.

When shutting down the server, it is highly recommended calling workflowRuntime.Shutdown() or workflowRuntime.ShutdownAsync(), to correctly disable the runtime API and stop its timers. Then, to start the server anew, you should call workflowRuntime.Start() again. You are free not to call for shutdown, but this will start an excess recovery procedure, as shown below.

Multi-server Settings

The settings of the multi-server mode can be changed using DefaultMultiServerSettingsSource in Workflow Engine. The following settings are available:

  • int TimerInterval = 1000 - the system timer interval to handle process timers, in milliseconds.
  • int TimerMaxSequentialFailCount = 5 - if an unhandled exception occurs during the system timer execution, and repeats consecutively the number of times specified by this setting, the timer gets disabled.
  • int ExecuteTimersBatchSize = 100 - the system timer handles process timers in batches; it is the size of such a batch.
  • int ServiceTimerInterval = 60000 - the system service timer interval. This timer starts recovery processes after a failure. See in detail below.
  • int ServiceTimerMaxSequentialFailCount = 5- if an unhandled exception occurs during the system service timer execution, and repeats consecutively the number of times specified by this setting, the service timer gets disabled.
  • ProtectionIntervalInPercents { get; set; } = 0.2 - the value of the protection interval, necessary so that the system timers of different servers do not interfere with each other. See the timers diagram below.
  • AliveSignalInterval = 1000 - the interval to report that a server is still running, in milliseconds.
  • int NumberOfSkippedIntervalsToSupposeDeath { get; set; } = 60 - the number of missed report signals that a server is still running, after which the server is declared Terminated, and the data become subject for recovery procedures.
  • int RestorePauseInterval { get; set; } = 1000 - if a server starts and finds out that another server has begun to restore its data, then it will wait for its recovery, standing by at the indicated interval in milliseconds.
  • int RestoreWaitAttemptCount { get; set; } = 60 - if a server starts and finds out that another server has begun to recover its data, it will wait for its recovery; this is the maximum number of attempts to wait for the recovery. If the actual number exceeds the maximum, the server will shut down.
  • int StartStatusCheckPauseInterval { get; set; } = 1000 - if a server starts and finds out that another server has begun to change its state, then it will wait for the new state, standing by at the indicated interval in milliseconds.
  • int StartStatusCheckAttemptCount { get; set; } = 10 - if a server starts and finds out that another server has begun to change its state, then it will wait for the new state; this is the maximum number of attempts to wait for the new state.
  • int ObtainTimerValueInterval { get; set; } = 1000 - servers are integrated into the timers diagram to process the timers in turn. If a server cannot get the time value for its system timer to start by the schedule, then it will wait during the indicated interval in milliseconds.
  • int ObtainTimerValueAttemptCount { get; set; } = 60 - the number of attempts to get the next time value for the system timer to start. After exceeding the maximum number of attempts, the server will shut down.
  • int MaxDegreeOfParallelismMultiplier { get; set; } = 1 - each server tries to process timers and recovery processes in parallel; the degree of parallelism is defined as the number of available processor cores multiplied by this value.

States of Runtimes

The states of Workflow Runtimes, currently running or ever run before, are stored in the WorkflowRuntime table. You should use the procedure below to obtain the runtime information in Workflow Engine:

var runtimeInfo = workflowRuntime.PersistenceProvider.GetWorkflowRuntimeModel("Runtime Id");

The Workflow Runtime can have one of the following states:

  • Dead - the runtime was correctly turned off and not working at the moment. No recovery required.
  • Terminated - the runtime was turned off, but the recovery procedure is required.
  • SelfRestore - the runtime has restarted and undergoing the self-restore procedure.
  • Restore - the runtime data is being restored by another server, the runtime itself is not working.
  • Alive - the runtime is alive if the current time does not exceed the value LastAliveSignal + AliveSignalInterval * NumberOfSkippedIntervalsToSupposeDeath, that is, the runtime has lately shown signs of life. If this condition is not met, then the runtime is declared Terminated.
  • Single - a special state for the single-server runtime, similar to Alive.

Features of Timers

The timers system for the multi-server mode is designed with the following features:

  • it guarantees that no timer will start before the appointed time, provided that at least one server is running.
  • the timer response time may be longer than the assigned value, but the system tries to minimize the delay.
  • if all servers have shut down, but then at least one turns on, all the timers will be processed.
  • the servers process the timers in turn, in order to minimize the performance drop due to the competition between the servers, and to keep the system alive, even if several servers have failed.

Below is the diagram of timers in the multi-server mode, if an additional server turns on, or if it suddenly turns off.

Diagram

Inside each server, the following occurs: when it is time to start processing the timers, it selects a batch of timers, with the batch size less or equal to ExecuteTimersBatchSize. The timers are processed in parallel. After processing the chosen timers, the server finds out whether it can process more of them; that is, if there is any time left before another server starts processing the timers. At the same time, the server tries not to use the time interval TimerInterval * ProtectionIntervalInPercents, considering it an untouchable reserve. Thus, the server tries to process the maximum possible number of timers during the interval equal to TimerInterval * (1 - ProtectionIntervalInPercents). The batch size of the timers being processed does not exceed ExecuteTimersBatchSize. After the server stops processing the timers, it determines the next time the system timer to be activated, and sets the system timer for this time.

TimerInterval

Process Association with Workflow Runtime

You should realize how the process is associated with the runtime to understand the recovery procedure. Since, if the runtime (or the server instance) turns off, you should restore the processes that belong to this particular instance that has turned off. Here, a very simple rule.

  • If a process is inactive - Idled or Finalized - it is not associated to anything.
  • If a process is in the Running status, then it belongs to the runtime (or the server instance) that has set this status.

Service Timer and Recovery Procedure

The system service timer works according to the same diagram as the system timer, but taking into account the ServiceTimerInterval setting. If the server stops incorrectly, might appear processes stuck in the Running status (the statuses of processes are described here). The Running status will prevent further manipulation of the process, therefore, it must be reset. This procedure is started by the service timer. This timer serves for the runtimes that require tidying up.

The following priorities of the Workflow Runtimes selection for recovery are set:

  • a runtime that was being restored by this server before, but the recovery was interrupted for some reason.
  • a runtime with the Single state.
  • a runtime with the Alive state but without any signs of life during the interval equal to AliveSignalInterval * NumberOfSkippedIntervalsToSupposeDeath, or in the Terminated state. The runtime that has remained silent the longest of all is the first runtime to be processed.

The recovery procedure is also started by the server itself at startup if it detects that it was incorrectly stopped.

The recovery procedure and its customization are described here.