Troubleshoot¶

When things go wrong, it is important to be able to diagnose the problem quickly and effectively. This section aims to get an operator up to speed with the most common troubleshooting techniques and tools available in PQS. You might want to refer to it for ideas when devising your own troubleshooting procedures.

PQS application (pipeline process) quick facts:

exports ledger events into queryable data store
does not send ledger commands
is stateless
is restart friendly (fast restarts in absence of migration, Daml model changes, etc)
is tolerant to unavailable dependencies (through retry loop)
uses only 1 Ledger API stream connection (flat transaction or transaction tree) after initialisation
uses a pool of connections to Postgres (16 by default)
can be secured with TLS on both connections
uses OpenTelemetry Agent for its observability signals exports
can export diagnostics archive (with metrics and thread dumps over time)

Look into exit codes of PQS process or Docker/Kubernetes container orchestrator:

137 indicates the process was killed by external forces with SIGKILL (-9), see also
non-zero exit might indicate invalid starting conditions which are treated as non-recoverable errors. Causes might include:
- misspelled startup parameter names or values

Look into logs for activity indicators:

ledger keep-alives are present
watermark advances in the presence of expected ledger traffic, see also here and here
retry loop indicates recoverable errors (both upstream and downstream), examine message for indication of underlying cause
in case of non-recoverable errors, keep in mind that the last visible stacktrace does not necessarily represent the true root cause - explore events that preceded it by requesting a bigger slice of logs before the termination

Look into metrics for detailed breakdown of PQS internals:

correlate transactions (pipeline_events_total{type="transaction"}) and watermark (watermark_ix) throughput metrics to identify if any slowdowns are present in the PQS pipeline
get an idea of PQS pipeline introduced latency - see here
get an idea of contract churn (which correlates with write activity of PQS) by template - see here

Look into database to get familiar with Daml model footprint:

get an idea of data volumes in terms of Daml structure - see here and here

Look into database statistics for resource utilisation

get an idea of I/O split - disk vs index, cache sizing (tables and indexes)
probe for heavy queries (current and over time)
inspect if bgwriter flush triggers too frequently

Try correlating representative metrics between PQS & Canton (if available).

To escalate issues to Digital Asset’s support team, please provide forensics by collecting diagnostics dump in proximity of the incident time and attach the resulting archive to the support ticket.

Runbooks¶

Runbooks

Troubleshoot¶

Runbooks¶

On this page