Troubleshoot

When things go wrong, it is important to be able to diagnose the problem quickly and effectively. This section aims to get an operator up to speed with the most common troubleshooting techniques and tools available in PQS. You might want to refer to it for ideas when devising your own troubleshooting procedures.

PQS application (pipeline process) quick facts:

  • exports ledger events into queryable data store

  • does not send ledger commands

  • is stateless

  • is restart friendly (fast restarts in absence of migration, Daml model changes, etc)

  • is tolerant to unavailable dependencies (through retry loop)

  • uses only 1 Ledger API stream connection (flat transaction or transaction tree) after initialisation

  • uses a pool of connections to Postgres (16 by default)

  • can be secured with TLS on both connections

  • uses OpenTelemetry Agent for its observability signals exports

  • can export diagnostics archive (with metrics and thread dumps over time)

Look into exit codes of PQS process or Docker/Kubernetes container orchestrator:

  • 137 indicates the process was killed by external forces with SIGKILL (-9), see also

  • non-zero exit might indicate invalid starting conditions which are treated as non-recoverable errors. Causes might include:

    • misspelled startup parameter names or values

Look into logs for activity indicators:

  • ledger keep-alives are present

  • watermark advances in the presence of expected ledger traffic, see also here and here

  • retry loop indicates recoverable errors (both upstream and downstream), examine message for indication of underlying cause

  • in case of non-recoverable errors, keep in mind that the last visible stacktrace does not necessarily represent the true root cause - explore events that preceded it by requesting a bigger slice of logs before the termination

Look into metrics for detailed breakdown of PQS internals:

  • correlate transactions (pipeline_events_total{type="transaction"}) and watermark (watermark_ix) throughput metrics to identify if any slowdowns are present in the PQS pipeline

  • get an idea of PQS pipeline introduced latency - see here

  • get an idea of contract churn (which correlates with write activity of PQS) by template - see here

Look into database to get familiar with Daml model footprint:

  • get an idea of data volumes in terms of Daml structure - see here and here

Look into database statistics for resource utilisation

Try correlating representative metrics between PQS & Canton (if available).

To escalate issues to Digital Asset’s support team, please provide forensics by collecting diagnostics dump in proximity of the incident time and attach the resulting archive to the support ticket.

Runbooks