- Overview
- Tutorials
- Getting started
- Get started with Canton and the JSON Ledger API
- Get Started with Canton, the JSON Ledger API, and TypeScript
- Get started with Canton Network App Dev Quickstart
- Get started with smart contract development
- Basic contracts
- Test templates using Daml scripts
- Build the Daml Archive (.dar) file
- Data types
- Transform contracts using choices
- Add constraints to a contract
- Parties and authority
- Compose choices
- Handle exceptions
- Work with dependencies
- Functional programming 101
- The Daml standard library
- Test Daml contracts
- Next steps
- Application development
- Getting started
- Development how-tos
- Component how-tos
- Explanations
- References
- Application development
- Smart contract development
- Daml language cheat sheet
- Daml language reference
- Daml standard library
- DA.Action.State.Class
- DA.Action.State
- DA.Action
- DA.Assert
- DA.Bifunctor
- DA.Crypto.Text
- DA.Date
- DA.Either
- DA.Exception
- DA.Fail
- DA.Foldable
- DA.Functor
- DA.Internal.Interface.AnyView.Types
- DA.Internal.Interface.AnyView
- DA.List.BuiltinOrder
- DA.List.Total
- DA.List
- DA.Logic
- DA.Map
- DA.Math
- DA.Monoid
- DA.NonEmpty.Types
- DA.NonEmpty
- DA.Numeric
- DA.Optional
- DA.Record
- DA.Semigroup
- DA.Set
- DA.Stack
- DA.Text
- DA.TextMap
- DA.Time
- DA.Traversable
- DA.Tuple
- DA.Validation
- GHC.Show.Text
- GHC.Tuple.Check
- Prelude
- Smart contract upgrading reference
- Glossary of concepts
Recover¶
PQS is designed to operate as a long-running process which uses these principles to enhance availability:
Redundancy involves running multiple instances of PQS in parallel to ensure that the system remains available even if one instance fails.
Retry involves healing from transient and recoverable failures without shutting down the process or requiring operator intervention.
Recovery entails reconciling the current state of the ledger with already exported data in the datastore after a cold start, and continuing from the latest checkpoint.
High availability¶
Multiple isolated instances of PQS can be instantiated without any cross-dependency. This allows for an active-active
high availability clustering model. Please note that different instances might not be at the same offset due to
different processing rates and general network non-determinism. PQS’ SQL API provides capabilities to deal with
this ‘eventual consistency’ model, to ensure that readers have at least ‘repeatable read’ consistency. See
validate_offset_exists()
in Offset management for more details.
--- title: High Availability Deployment --- flowchart LR Participant --Ledger API--> PQS1[PQS<br>Process] Participant --Ledger API--> PQS2[PQS<br>Process] PQS1 --JDBC--> Database1[PQS<br>Database] PQS2 --JDBC--> Database2[PQS<br>Database] Database1 <--JDBC--> LoadBalancer[Load<br>Balancer] Database2 <--JDBC--> LoadBalancer[Load<br>Balancer] LoadBalancer <--JDBC--> App((App<br>Cluster)) style PQS1 stroke-width:4px style PQS2 stroke-width:4px style Database1 stroke-width:4px style Database2 stroke-width:4px
Retries¶
PQS’ pipeline
command is a unidirectional streaming process that heavily relies on the availability of its
source
and target
dependencies. When PQS encounters an error, it attempts to recover by restarting its
internal engine, if the error is designated as recoverable:
gRPC [1] (white-listed; retries if):
CANCELLED
DEADLINE_EXCEEDED
NOT_FOUND
PERMISSION_DENIED
RESOURCE_EXHAUSTED
FAILED_PRECONDITION
ABORTED
INTERNAL
UNAVAILABLE
DATA_LOSS
UNAUTHENTICATED
JDBC [2] (black-listed; retries unless):
INVALID_PARAMETER_TYPE
PROTOCOL_VIOLATION
NOT_IMPLEMENTED
INVALID_PARAMETER_VALUE
SYNTAX_ERROR
UNDEFINED_COLUMN
UNDEFINED_OBJECT
UNDEFINED_TABLE
UNDEFINED_FUNCTION
NUMERIC_CONSTANT_OUT_OF_RANGE
NUMERIC_VALUE_OUT_OF_RANGE
DATA_TYPE_MISMATCH
INVALID_NAME
CANNOT_COERCE
UNEXPECTED_ERROR
Configuration¶
The following Configuration options are available to control the retry behavior of PQS:
--retry-backoff-base string Base time (ISO 8601) for backoff retry strategy (default: PT1S)
--retry-backoff-cap string Max duration (ISO 8601) between attempts (default: PT1M)
--retry-backoff-factor double Factor for backoff retry strategy (default: 2.0)
--retry-counter-attempts int Max attempts before giving up (optional)
--retry-counter-reset string Reset retry counters after period (ISO 8601) of stability (default: PT10M)
--retry-counter-duration string Time limit (ISO 8601) before giving up (optional)
Configuring --retry-backoff-*
settings control periodicity of retries and the maximum duration between attempts.
Configuring --retry-counter-attempts
and --retry-counter-duration
controls the maximum instability
tolerance before shutting down.
Configuring --retry-counter-reset
controls the period of stability after which the retry counters are reset
across the board.
Logging¶
While PQS recovers, the following log messages are emitted to indicate the progress of the recovery:
12:52:26.753 I [zio-fiber-257] com.digitalasset.scribe.appversion.package:14 scribe, version: UNSPECIFIED application=scribe
12:52:16.725 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable GRPC exception. Attempt 1, unstable for 0 seconds. Remaining attempts: 42. Remaining time: 10 minutes. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable GRPC exception.
Suppressed: io.grpc.StatusException: UNAVAILABLE: io exception
Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/[0:0:0:0:0:0:0:1]:6865
Suppressed: java.net.ConnectException: Connection refused application=scribe
12:52:29.007 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable GRPC exception. Attempt 2, unstable for 12 seconds. Remaining attempts: 41. Remaining time: 9 minutes 47 seconds. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable GRPC exception.
Suppressed: io.grpc.StatusException: UNAVAILABLE: io exception
Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/[0:0:0:0:0:0:0:1]:6865
Suppressed: java.net.ConnectException: Connection refused application=scribe
12:52:51.237 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable GRPC exception. Attempt 3, unstable for 34 seconds. Remaining attempts: 40. Remaining time: 9 minutes 25 seconds. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable GRPC exception.
Suppressed: io.grpc.StatusException: UNAVAILABLE: io exception
Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/[0:0:0:0:0:0:0:1]:6865
Suppressed: java.net.ConnectException: Connection refused application=scribe
12:53:33.473 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable GRPC exception. Attempt 4, unstable for 1 minute 16 seconds. Remaining attempts: 39. Remaining time: 8 minutes 43 seconds. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable GRPC exception.
Suppressed: io.grpc.StatusException: UNAVAILABLE: io exception
Suppressed: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/[0:0:0:0:0:0:0:1]:6865
Suppressed: java.net.ConnectException: Connection refused application=scribe
12:54:36.328 I [zio-fiber-0] com.digitalasset.scribe.pipeline.Retry.retryRecoverable:48 Recoverable JDBC exception. Attempt 5, unstable for 2 minutes 19 seconds. Remaining attempts: 38. Remaining time: 7 minutes 40 seconds. Exception in thread "zio-fiber-" java.lang.Throwable: Recoverable JDBC exception.
Suppressed: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.
Suppressed: java.net.ConnectException: Connection refused application=scribe
Metrics¶
The following metrics are available to monitor stability of PQS’ dependencies. See Application metrics for more details on general observability:
## TYPE app_restarts_total counter
## HELP app_restarts_total Number of total app restarts due to recoverable errors
app_restarts_total{,exception="Recoverable GRPC exception."} 5.0
## TYPE grpc_up gauge
## HELP grpc_up Grpc channel is up
grpc_up{} 1.0
## TYPE jdbc_conn_pool_up gauge
## HELP jdbc_conn_pool_up JDBC connection pool is up
jdbc_conn_pool_up{} 1.0
Retry counters reset¶
If PQS encounters network unavailability it starts incrementing retry counters with each attempt. These counters are
reset only after a period of stability, as defined by --retry-counter-reset
. As such, during the prolonged
periods of intermittent failures that alternate with brief periods of operating normally, PQS keeps maintaining a
cautious stance on assumptions regarding the stability of the overall system. This can be illustrated with an example
below:
As an example, for the setting --retry-counter-reset PT5M
the following timeline illustrates how the retry works:
time --> 1:00 5:00 10:00
v v v
operation: ====xx=x====x=======x========================
^ ^ ^
A B C
x - a failure causing retry happens
= - operating normally
In the timeline above, intermittent failures start at point A, and each retry attempt contributes to the increase of the overall backoff schedule. Consequently, each subsequent retry allows more time for the system to recover. This schedule does not reset to its initial values until after the configured period of stability is reached following the last failure (point B), such as after operating without any failures for 5 minutes (point C).
Exit codes¶
PQS terminates with the following exit codes:
0
: Normal termination1
: Termination due to unrecoverable error or all retry attempts for recoverable errors have been exhausted
Ledger streaming & recovery¶
On (re-)start, PQS determines last saved checkpoint and continues incremental processing from that point onward. PQS is able to start and finish at prescribed ledger offsets, specified via args.
In many scenarios --pipeline-ledger-start Oldest --pipeline-ledger-stop Never
is the most appropriate
configuration, for both initial population of all available history, and also catering for resumption/recovery
processing.
Start offset meanings:
Value |
Meaning |
---|---|
|
Commence from the first offset of the ledger, failing if not available. |
|
Resume processing, or start from the oldest available offset of the ledger (if the datastore is empty). |
|
Resume processing, or start from the latest available offset of the ledger (if the datastore is empty). |
|
Offset from which to start processing, terminating if it does not match the state of the datastore. |
Stop offset meanings:
Value |
Meaning |
---|---|
|
Process until reaching the latest available offset of the ledger, then terminate. |
|
Keep processing and never terminate. |
|
Process until reaching this offset, then terminate. |
Caution
If the ledger has been pruned beyond the offset specified in --pipeline-ledger-start
, PQS fails to start.
For more details see History slicing.