Recover

PQS is designed to operate as a long-running process which uses these principles to enhance availability:

  • Redundancy involves running multiple instances of PQS in parallel to ensure that the system remains available even if one instance fails.

  • Retry involves healing from transient and recoverable failures without shutting down the process or requiring operator intervention.

  • Recovery entails reconciling the current state of the ledger with already exported data in the datastore after a cold start, and continuing from the latest checkpoint.

High availability

Multiple isolated instances of PQS can be instantiated without any cross-dependency. This allows for an active-active high availability clustering model. Please note that different instances might not be at the same offset due to different processing rates and general network non-determinism. PQS’ SQL API provides capabilities to deal with this ‘eventual consistency’ model, to ensure that readers have at least ‘repeatable read’ consistency. See validate_offset_exists() in Offset management for more details.

        ---
title: High Availability Deployment
---
flowchart LR
    Participant --Ledger API--> PQS1[PQS<br>Process]
    Participant --Ledger API--> PQS2[PQS<br>Process]
    PQS1 --JDBC--> Database1[PQS<br>Database]
    PQS2 --JDBC--> Database2[PQS<br>Database]
    Database1 <--JDBC--> LoadBalancer[Load<br>Balancer]
    Database2 <--JDBC--> LoadBalancer[Load<br>Balancer]
    LoadBalancer <--JDBC--> App((App<br>Cluster))
style PQS1 stroke-width:4px
style PQS2 stroke-width:4px
style Database1 stroke-width:4px
style Database2 stroke-width:4px
    

Retries

PQS’ pipeline command is a unidirectional streaming process that heavily relies on the availability of its source and target dependencies. When PQS encounters an error, it attempts to recover by restarting its internal engine, if the error is designated as recoverable:

  • gRPC [1] (white-listed; retries if):

    • CANCELLED

    • DEADLINE_EXCEEDED

    • NOT_FOUND

    • PERMISSION_DENIED

    • RESOURCE_EXHAUSTED

    • FAILED_PRECONDITION

    • ABORTED

    • INTERNAL

    • UNAVAILABLE

    • DATA_LOSS

    • UNAUTHENTICATED

  • JDBC [2] (black-listed; retries unless):

    • INVALID_PARAMETER_TYPE

    • PROTOCOL_VIOLATION

    • NOT_IMPLEMENTED

    • INVALID_PARAMETER_VALUE

    • SYNTAX_ERROR

    • UNDEFINED_COLUMN

    • UNDEFINED_OBJECT

    • UNDEFINED_TABLE

    • UNDEFINED_FUNCTION

    • NUMERIC_CONSTANT_OUT_OF_RANGE

    • NUMERIC_VALUE_OUT_OF_RANGE

    • DATA_TYPE_MISMATCH

    • INVALID_NAME

    • CANNOT_COERCE

    • UNEXPECTED_ERROR

  • Daml packages (retries if):

Configuration

The following Configuration options are available to control the retry behavior of PQS:

--retry-backoff-base string      Base time (ISO 8601) for backoff retry strategy (default: PT1S)
--retry-backoff-cap string       Max duration (ISO 8601) between attempts (default: PT1M)
--retry-backoff-factor double    Factor for backoff retry strategy (default: 2.0)
--retry-counter-attempts int     Max attempts before giving up (optional)
--retry-counter-reset string     Reset retry counters after period (ISO 8601) of stability (default: PT10M)
--retry-counter-duration string  Time limit (ISO 8601) before giving up (optional)

Configuring --retry-backoff-* settings control periodicity of retries and the maximum duration between attempts.

Configuring --retry-counter-attempts and --retry-counter-duration controls the maximum instability tolerance before shutting down.

Configuring --retry-counter-reset controls the period of stability after which the retry counters are reset across the board.

Logging

While PQS recovers, the following log messages are emitted to indicate the progress of the recovery:

12:52:26.753 I [zio-fiber-257] com.digitalasset.scribe.appversion.package:14 scribe, version: UNSPECIFIED  application=scribe
12:52:16.725 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable GRPC exception. Attempt 1, unstable for 0 seconds. Remaining attempts: 42. Remaining time: 10 minutes. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable GRPC exception.
    Suppressed: io.grpc.StatusException: UNAVAILABLE: io exception
        Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/[0:0:0:0:0:0:0:1]:6865
            Suppressed: java.net.ConnectException: Connection refused application=scribe
12:52:29.007 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable GRPC exception. Attempt 2, unstable for 12 seconds. Remaining attempts: 41. Remaining time: 9 minutes 47 seconds. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable GRPC exception.
    Suppressed: io.grpc.StatusException: UNAVAILABLE: io exception
        Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/[0:0:0:0:0:0:0:1]:6865
            Suppressed: java.net.ConnectException: Connection refused application=scribe
12:52:51.237 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable GRPC exception. Attempt 3, unstable for 34 seconds. Remaining attempts: 40. Remaining time: 9 minutes 25 seconds. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable GRPC exception.
    Suppressed: io.grpc.StatusException: UNAVAILABLE: io exception
        Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/[0:0:0:0:0:0:0:1]:6865
            Suppressed: java.net.ConnectException: Connection refused application=scribe
12:53:33.473 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable GRPC exception. Attempt 4, unstable for 1 minute 16 seconds. Remaining attempts: 39. Remaining time: 8 minutes 43 seconds. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable GRPC exception.
    Suppressed: io.grpc.StatusException: UNAVAILABLE: io exception
        Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/[0:0:0:0:0:0:0:1]:6865
            Suppressed: java.net.ConnectException: Connection refused application=scribe
12:54:36.328 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable JDBC exception. Attempt 5, unstable for 2 minutes 19 seconds. Remaining attempts: 38. Remaining time: 7 minutes 40 seconds. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable JDBC exception.
    Suppressed: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
        Suppressed: java.net.ConnectException: Connection refused application=scribe

Metrics

The following metrics are available to monitor stability of PQS’ dependencies. See Application metrics for more details on general observability:

## TYPE app_restarts_total counter
## HELP app_restarts_total Number of total app restarts due to recoverable errors
app_restarts_total{,exception="Recoverable GRPC exception."} 5.0
app_restarts_total{,exception="Recoverable JDBC exception."} 1.0
app_restarts_total{,exception="Recoverable: unknown Daml package detected."} 1.0

## TYPE grpc_up gauge
## HELP grpc_up Grpc channel is up
grpc_up{} 1.0

## TYPE jdbc_conn_pool_up gauge
## HELP jdbc_conn_pool_up JDBC connection pool is up
jdbc_conn_pool_up{} 1.0

Retry counters reset

If PQS encounters network unavailability it starts incrementing retry counters with each attempt. These counters are reset only after a period of stability, as defined by --retry-counter-reset. As such, during the prolonged periods of intermittent failures that alternate with brief periods of operating normally, PQS keeps maintaining a cautious stance on assumptions regarding the stability of the overall system. This can be illustrated with an example below:

As an example, for the setting --retry-counter-reset PT5M the following timeline illustrates how the retry works:

time -->       1:00            5:00               10:00
                v               v                   v
operation:  ====xx=x====x=======x========================
                ^               ^                   ^
                A               B                   C

x - a failure causing retry happens
= - operating normally

In the timeline above, intermittent failures start at point A, and each retry attempt contributes to the increase of the overall backoff schedule. Consequently, each subsequent retry allows more time for the system to recover. This schedule does not reset to its initial values until after the configured period of stability is reached following the last failure (point B), such as after operating without any failures for 5 minutes (point C).

Dynamic Daml package reload

Deploying new Daml packages (DARs) to the Participant Node does not require restarting PQS. When PQS encounters an unknown package while processing a received event, it fetches the missing packages from the Participant Node.

This applies when PQS receives a transaction that references a Daml package that was not present when the current pipeline session started. PQS temporarily pauses ingestion while it reloads its internal schema:

  1. Fetches the updated set of packages from the Participant Node

  2. Parses the new package and rebuilds internal type mappings

  3. Registers new templates and choices in the datastore

  4. Resumes ingestion from the last committed checkpoint

No data is lost during this process. The transaction that triggered the reload is reprocessed after the schema update completes.

The following log message indicates that a reload was triggered:

12:34:56.789 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable: unknown Daml package detected. Attempt 1, unstable for 0 seconds. Remaining attempts: 42. Remaining time: 10 minutes. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable: unknown Daml package detected.
    Suppressed: com.digitalasset.zio.daml.ledgerapi.UnknownDamlPackageException: No package for abc123:My.Module:MyTemplate was seen on initialization. Retrying to discover new packages.

The key fragments to search for in logs:

  • Recoverable: unknown Daml package detected—indicates PQS identified a new package and is reloading.

  • No package for <packageId>:<moduleName>:<entityName> was seen on initialization. Retrying to discover new packages. —identifies the specific template from the unrecognized package that triggered the reload.

Reload duration depends on the total number of packages on the ledger and their complexity. For ledgers with many packages (100+), the reload may take tens of seconds while PQS re-parses all packages to rebuild its type information. During this time, ingestion is paused and downstream queries continue to serve previously committed data.

The app_restarts_total counter (see Application metrics) is incremented with label exception="Recoverable: unknown Daml package detected." each time a reload occurs. You can use this metric to track the frequency of package reloads and set up alerts if reloads happen more often than expected.

Note

Dynamic package reload applies to Daml package deployments only. Other configuration changes such as adding parties, templates, or interfaces to filters still require a restart. See Configure.

Exit codes

PQS terminates with the following exit codes:

  • 0: Normal termination

  • 1: Termination due to unrecoverable error or all retry attempts for recoverable errors have been exhausted

Ledger streaming & recovery

On (re-)start, PQS determines last saved checkpoint and continues incremental processing from that point onward. PQS is able to start and finish at prescribed ledger offsets, specified via args.

In many scenarios --pipeline-ledger-start Oldest --pipeline-ledger-stop Never is the most appropriate configuration, for both initial population of all available history, and also catering for resumption/recovery processing.

Start offset meanings:

Value

Meaning

Genesis

Commence from the first offset of the ledger, failing if not available.

Oldest

Resume processing, or start from the oldest available offset of the ledger (if the datastore is empty).

Latest

Resume processing, or start from the latest available offset of the ledger (if the datastore is empty).

<offset>

Offset from which to start processing, terminating if it does not match the state of the datastore.

Stop offset meanings:

Value

Meaning

Latest

Process until reaching the latest available offset of the ledger, then terminate.

Never

Keep processing and never terminate.

<offset>

Process until reaching this offset, then terminate.

Caution

If the ledger has been pruned beyond the offset specified in --pipeline-ledger-start, PQS fails to start. For more details see History slicing.