Participant Node Health

The participant exposes health status information in several ways, which may be inspected manually when troubleshooting or integrated into larger monitoring and orchestration systems.

Using gRPC Health Service for Load Balancing and Orchestration

The Participant Node provides a grpc.health.v1.Health service, implementing the gRPC Health Checking Protocol protocol.

Kubernetes containers can be configured to use this for readiness or liveness probes, e.g.

readinessProbe:
  grpc:
    port: <port>

By default the port is the one used for the Ledger API.

Likewise, gRPC clients and NGinx can be configured to watch the health service for traffic management and load balancing.

You can manually check the health of a Participant with a command line tool such as grpcurl e.g. (using the Participant’s actual address):

$ grpcurl -plaintext <host>:<port> grpc.health.v1.Health/Check
{
  "status": "SERVING"
}

Calling Check will respond with SERVING if it is currently ready and available to serve requests.

Calling Watch will perform a streaming health check. The server will immediately send the current health of the Participant, and then send a new message whenever the health changes.

When multiple Participant replicas are configured, passive nodes return NOT_SERVING.

In practice, the health of the Participant is composed of the health of the components it depends on. You can query these individually by name, by making a request with the service field set to the name of the component. An empty or unset service field returns the aggregate health of all components. An unknown name will result in a gRPC NOT_FOUND error.

Checking health via HTTP

Health checking can also be done via HTTP, which is useful for frameworks that don’t support gRPC Health Checking Protocol. Setting monitoring.http-health-server.port=<port> in the configuration for your node will expose health information at the URL http://<host>:<port>/health.

Here the important information is reported via the HTTP Reponse status code.

  • A status of 200 is equivalent to SERVING from the gRPC Health Service.

  • A status of 503 is equivalent to NOT_SERVING.

  • A status of 500 means the check failed for any other reason.

Kubernetes can use also use these for readiness probes:

readinessProbe:
  httpGet:
    port: <port>
    path: /health

Inspection of General Health Status

General information about the Participant Node, including about unhealthy synchronizers and dependencies, and whether the node is currently Active, can be displayed in the canton console by invoking the health.status command on the node.

@ participant1.health.status
res1: NodeStatus[ParticipantStatus] = Participant id: PAR::participant1::12201ff69b1d24edbf0ee2028a304ea702ee8536790dab1a31e7136e6d90ff6d473c
Uptime: 2.352928s
Ports:
    ledger: 30104
    admin: 30105
Connected synchronizers: None
Unhealthy synchronizers: None
Active: true
Components:
    memory_storage : Ok()
    connected-synchronizer : Not Initialized
    sync-ephemeral-state : Not Initialized
    sequencer-client : Not Initialized
    acs-commitment-processor : Not Initialized
Version: 3.3.0-SNAPSHOT
Supported protocol version(s): 33

The Admin API of the Participant Node provides programmatic access to this data in a structured form, via ParticipantStatusService’s ParticipantStatus call.

The canton console can also provide information about all connected nodes, including those remotely connected, by invoking the command at the top level.

@ health.status
res2: CantonStatus = Status for Sequencer 'sequencer1':
Sequencer id: da::1220a82692abc55c0367abefc4bdbc23df25688230430ddfeef5759845f26d5cc29c
Synchronizer id: da::1220a82692abc55c0367abefc4bdbc23df25688230430ddfeef5759845f26d5cc29c
Uptime: 4.697653s
Ports:
    public: 30109
    admin: 30110
Connected participants:
    PAR::participant2::1220a4d7463b...
    PAR::participant1::12201ff69b1d...
Connected mediators:
    MED::mediator1::122009299340...
Sequencer: SequencerHealthStatus(active = true)
details-extra: None
Components:
    memory_storage : Ok()
    sequencer : Ok()
Accepts admin changes: true
Version: 3.3.0-SNAPSHOT
Protocol version: 33

Status for Mediator 'mediator1':
Node uid: mediator1::12200929934059da3e012af672ee8a5d26a7e4b3e5084920be298f791f7619843c78
Synchronizer id: da::1220a82692abc55c0367abefc4bdbc23df25688230430ddfeef5759845f26d5cc29c
Uptime: 4.619599s
Ports:
    admin: 30108
Active: true
Components:
    memory_storage : Ok()
    sequencer-client : Ok()
Version: 3.3.0-SNAPSHOT
Protocol version: 33

Status for Participant 'participant1':
Participant id: PAR::participant1::12201ff69b1d24edbf0ee2028a304ea702ee8536790dab1a31e7136e6d90ff6d473c
Uptime: 6.910377s
Ports:
    ledger: 30104
    admin: 30105
Connected synchronizers:
    da::1220a82692ab...
Unhealthy synchronizers: None
Active: true
Components:
    memory_storage : Ok()
    connected-synchronizer : Ok()
    sync-ephemeral-state : Ok()
    sequencer-client : Ok()
    acs-commitment-processor : Ok()
Version: 3.3.0-SNAPSHOT
Supported protocol version(s): 33

Status for Participant 'participant2':
Participant id: PAR::participant2::1220a4d7463bd34b2ba3704401b48ab41d8f88cdcbe512fc1ef071aad97fef106161
Uptime: 6.422954s
Ports:
    ledger: 30106
    admin: 30107
Connected synchronizers:
    da::1220a82692ab...
Unhealthy synchronizers: None
Active: true
Components:
    memory_storage : Ok()
    connected-synchronizer : Ok()
    sync-ephemeral-state : Ok()
    sequencer-client : Ok()
    acs-commitment-processor : Ok()
Version: 3.3.0-SNAPSHOT
Supported protocol version(s): 33

Generating a Node Health Dump for Troubleshooting

When interacting with support or attempting to troubleshoot an issue, it is often necessary to capture a snapshot of relevant execution state. Canton implements a facility that gathers key system information and bundles it into a ZIP file.

This will contain:

  • The configuration you are using, with all sensitive data stripped from it (no passwords).

  • An extract of the log file. Sensitive data is not logged into log files.

  • A current snapshot on Canton metrics.

  • A stacktrace for each running thread.

These health dumps can be triggered from the canton console with health.dump(), which returns the path to the resulting ZIP file.

@ health.dump()
..

If the console is configured to access remote nodes, their state will be included too. You can obtain the data of just a specific node by targeting it when running the command, e.g. remoteParticipant1.health.dump()

When packaging large amounts of data, increase the default timeout of the dump command:

@ health.dump(timeout = 2.minutes)
..

Health dumps can also be gathered via gRPC on the Admin API of the Participant Node via the StatusService’s HealthDump. This call streams back the bytes of the produced ZIP file.

Monitoring for Slow or Stuck Tasks

Some operations can report when they are slow, if you enable

canton.monitoring.logging.log-slow-futures = yes

If a task is taking longer than expected, a log line will be emitted periodically until it completes, such as <task name> has not completed after <duration>. This feature is disabled by default to reduce the overhead.

Canton also provides a facility to periodically test whether we are able to schedule new tasks in a timely manner, enabled via the configuration

canton.monitoring.deadlock-detection.enabled = yes

If a problem is detected, a log line containing Task runner <name> is stuck or overloaded for <duration> will be emitted. This may indicate that resources such as CPU are overloaded, that the Execution Context is too small, or that too many tasks are otherwise stuck. If the issue resolves itself, a subsequent log message: Task runner <name> is just overloaded, but operating correctly. Task got executed in the meantime will be emitted.

Disabling Restart on Fatal Failures

Processes should be run under a process supervisor, such as systemd or Kubernetes, which can monitor them and restart them as needed. By default, the Participant Node process will exit in the event of a fatal failure.

If you wish to disable this behaviour

canton.parameters.exit-on-fatal-failures = no

which will cause the Node to stay alive and report unhealthy in such cases.