Logstash High Availability

This document describes how to operate the IAP RASP reporting Logstash pipeline in a high-availability configuration and how to recover from outages.

Two concerns are addressed separately:

  • Data resilience — ensuring no events are lost during Elasticsearch outages or Logstash restarts.
  • Solution resilience — eliminating Logstash as a single point of failure.

Operational runbooks for common scenarios are included at the end of this document.


Data Resilience

Persistent Queue

Logstash is configured with queue.type: persisted and queue.max_bytes: 1gb (see logstash/logstash.yml). The persistent queue sits between the JDBC input and the Elasticsearch output and provides durability across restarts and short outages.

When Elasticsearch becomes unreachable:

  1. The Elasticsearch output plugin retries failed deliveries with exponential back-off. Events accumulate in the persistent queue on disk.
  2. The JDBC input continues polling PostgreSQL and enqueueing new events until the queue reaches queue.max_bytes.
  3. Once the queue is full, the JDBC input blocks. No new events are read from PostgreSQL. Logstash does not drop events and does not crash.
  4. The tracking metadata file (last_run_metadata_path) is only updated after the Elasticsearch output confirms delivery. A Logstash restart during a full-queue scenario will replay from the last confirmed position — no data is lost.

Once Elasticsearch becomes reachable again, the output drains the persistent queue, the JDBC input unblocks, and the pipeline catches up automatically without manual intervention.

PostgreSQL Impact During Outages

The JDBC input maintains a persistent connection to PostgreSQL. During an Elasticsearch outage:

  • Each poll executes a read-only SELECT … WHERE id > :last_value ORDER BY id ASC. No locks are held between polls.
  • When the queue is full and the input blocks, no new queries are issued. The connection remains open but idle.

There are no lock contention or connection exhaustion concerns at typical event rates.

Queue Sizing

The default queue.max_bytes: 1gb is appropriate for most deployments. To size for a specific outage window:

queue_bytes = avg_event_rate_per_sec × avg_event_size_bytes × outage_window_seconds × safety_factor

safety_factor accounts for event rate spikes, per-event queue metadata overhead, and estimation uncertainty. A value of is recommended.

Scenario Rate Avg size Window Factor Recommended size
Very low traffic 1 /m 512 B 24 h ~1.5 MB
Low traffic 10 /s 512 B 24 h ~900 MB
Medium traffic 100 /s 512 B 24 h ~9 GB
High traffic 500 /s 512 B 48 h ~86 GB

For medium or high traffic, increase queue.max_bytes in logstash.yml and ensure the Docker volume has sufficient capacity.

Recovery After Long Outages

Estimating catch-up duration

When Elasticsearch recovers after a multi-day outage, Logstash drains the persistent queue and then processes events that accumulated in PostgreSQL while the input was blocked.

Approximate catch-up time:

catch_up_seconds ≈ backlog_events / (pipeline.batch.size × pipeline.workers × es_write_throughput_per_worker)

Typical values for a single-node Elasticsearch:

Parameter Default Notes
pipeline.batch.size 125 Events per batch
pipeline.workers 2 Parallel output workers
ES write throughput ~500–2 000/s Depends on hardware and index complexity

At 1 000 docs/s sustained throughput, a backlog of 10 million events clears in approximately 2.8 hours.

Tuning for faster recovery

If the backlog is large and time-to-recovery is critical, temporarily increase batch size and workers:

# logstash.yml — increase for catch-up, then revert to defaults
pipeline.workers: 4
pipeline.batch.size: 500
pipeline.batch.delay: 5

Restart Logstash after editing. Monitor Elasticsearch CPU and indexing queue depth — reduce the values if Elasticsearch shows pressure (GET /_cat/thread_pool/write?v).

Monitoring pipeline lag

# Events currently queued (trends toward 0 as the pipeline catches up)
curl http://localhost:9600/_node/stats/pipelines | \
  jq '.pipelines | to_entries[] | {pipeline: .key, queue_events: .value.queue.events_count}'

# Latest PostgreSQL row ID — difference from metadata value is remaining backlog
docker exec iap-postgres psql -U inappprotection -c \
  "SELECT MAX(id) FROM iap_app_event;"

Dead-Letter Queue

The dead-letter queue (dead_letter_queue.enable: true) is not used in this pipeline.

All writes use a stable document_id derived from the PostgreSQL row ID, eliminating permanent indexing failures. The source data in PostgreSQL is never deleted — any event can be replayed by resetting the metadata file. The dead-letter queue would add operational overhead without providing meaningful benefit.


High Availability Deployment

Logstash has no built-in clustering. HA is achieved at the infrastructure level. The primary challenge is tracking state: each pipeline stores last_run_value in a local metadata file. The recommended approach uses a shared volume so that a standby instance can resume exactly where the active instance left off.

Active/Passive with Shared Volume

Mount the logstash-data volume from shared network storage (NFS, Azure Files, AWS EFS). Run two Logstash containers; only one is active at a time. The standby container is kept stopped and started — manually or by an orchestrator — when the active instance fails.

┌─────────────────────┐     shared NFS volume      ┌─────────────────────┐
│  Logstash (active)  │ ──────────────────────────▶ │  Logstash (standby) │
│  reads/writes .meta │◀──────────────────────────  │  stopped            │
└─────────────────────┘                             └─────────────────────┘

Characteristics:

  • No application changes required.
  • The standby resumes from exactly where the active left off — no gaps or duplicates.
  • Simple to operate.
  • Requires shared network storage. Manual (or orchestrator-driven) failover; recovery time depends on detection speed.

Setup:

  1. Provision a shared volume (Azure Files share, NFS mount, or AWS EFS).
  2. Mount the volume at /usr/share/logstash/data on both containers.
  3. Start only the active container. Configure your orchestrator (Docker Swarm, Kubernetes) to restart it on failure and to start the standby only if the active container remains unhealthy.

Operational Runbooks

The following runbooks are written for operations teams unfamiliar with Logstash internals. Each runbook is self-contained.


Runbook 1 — Planned Elasticsearch Maintenance

Scope: Scheduled ES maintenance (upgrades, index management, node replacement).

Before maintenance

  1. Check that the queue is empty before stopping Logstash:
    curl http://localhost:9600/_node/stats/pipelines | \
      jq '.pipelines | to_entries[] | {pipeline: .key, queue_events: .value.queue.events_count}'
    

    All values should be 0. If they are not, wait for the pipeline to drain before proceeding.

  2. Stop Logstash gracefully:
    docker stop iap-logstash
    

    Stopping Logstash flushes in-memory state to the persistent queue on disk. No data is lost.

During maintenance

  1. Perform the Elasticsearch maintenance as planned.

  2. Verify Elasticsearch is healthy before restarting Logstash:

    curl -u <user>:<password> "https://<es-host>/_cluster/health?pretty"
    # status should be "green" or "yellow" — not "red"
    

After maintenance

  1. Start Logstash:
    docker start iap-logstash
    
  2. Monitor catch-up until the queue reaches 0:
    watch -n 5 'curl -s http://localhost:9600/_node/stats/pipelines | \
      jq ".pipelines | to_entries[] | {pipeline: .key, queue: .value.queue.events_count}"'
    
  3. Verify that indices are receiving new documents:
    curl -u <user>:<password> \
      "https://<es-host>/_cat/indices/iap-rasp-*?v&h=index,docs.count,store.size"
    

Runbook 2 — Unplanned Elasticsearch Outage

Scope: Elasticsearch becomes unreachable unexpectedly (network failure, node crash, certificate expiry).

Detection

Logstash logs repeated errors such as:

[ERROR] elasticsearch output: connection refused / SSL handshake failed / ...

Check container logs:

docker logs iap-logstash --tail 50 | grep -i error

Kibana dashboards that stop updating are another indicator that the pipeline has stalled.

During the outage

No action is required. Logstash retries delivery automatically with exponential back-off. The persistent queue absorbs incoming events up to queue.max_bytes (default: 1 GB). Once the queue is full, the JDBC input pauses. Events are not lost — the source data remains intact in PostgreSQL.

After Elasticsearch recovers

  1. Verify Elasticsearch is reachable:
    curl -u <user>:<password> "https://<es-host>/_cluster/health?pretty"
    
  2. Logstash resumes delivery automatically. Monitor catch-up:
    watch -n 10 'curl -s http://localhost:9600/_node/stats/pipelines | \
      jq ".pipelines | to_entries[] | {pipeline: .key, queue: .value.queue.events_count}"'
    
  3. If Logstash appears stuck (queue not draining after 5+ minutes with Elasticsearch healthy), restart it:
    docker restart iap-logstash
    
  4. Once the queue reaches 0, verify document counts are increasing in Elasticsearch (see Runbook 1, step 7).

Runbook 3 — Pipeline Stuck / Not Advancing

Scope: Logstash is running but the pipeline is not processing new events — Elasticsearch indices show no new documents despite data being written to PostgreSQL.

Step 1 — Verify Logstash is running

docker ps | grep iap-logstash
docker logs iap-logstash --tail 100 | grep -E "(ERROR|WARN|Pipeline started)"

Step 2 — Check queue depth

curl -s http://localhost:9600/_node/stats/pipelines | \
  jq '.pipelines | to_entries[] | {pipeline: .key, queue_events: .value.queue.events_count}'
Result Action
Queue > 0 and growing Elasticsearch output is failing — see Runbook 2
Queue = 0, no new docs in ES JDBC input is not reading new rows — proceed to Step 3
Queue = 0, docs are increasing in ES Pipeline is healthy; Kibana may need a refresh

Step 3 — Inspect metadata files

docker exec iap-logstash cat /usr/share/logstash/data/rasp-events.metadata
docker exec iap-logstash cat /usr/share/logstash/data/rasp-flags-set.metadata
docker exec iap-logstash cat /usr/share/logstash/data/rasp-flags-cleared.metadata

Compare with the latest row IDs in PostgreSQL:

docker exec iap-postgres psql -U inappprotection -c \
  "SELECT
     (SELECT MAX(id) FROM iap_app_event)                                 AS max_event_id,
     (SELECT MAX(id) FROM iap_app_flag_history WHERE flag_status = 'SET')     AS max_flag_set_id,
     (SELECT MAX(id) FROM iap_app_flag_history WHERE flag_status = 'CLEARED') AS max_flag_cleared_id;"
Metadata vs. DB max Action
Metadata = DB max Pipeline is up to date — no new data to ship
Metadata far ahead of DB max Metadata is corrupted — proceed to Step 4
Metadata far behind DB max, ES is healthy Pipeline should be processing — proceed to Step 5

Step 4 — Reset metadata (if corrupted)

Warning: Resetting to 0 causes Logstash to reprocess all historical data. This is safe — idempotent writes mean no Elasticsearch duplicates — but may take significant time for large datasets. To replay from a specific point, replace 0 with the desired PostgreSQL row ID.

docker exec iap-logstash sh -c 'echo "--- 0" > /usr/share/logstash/data/rasp-events.metadata'
docker exec iap-logstash sh -c 'echo "--- 0" > /usr/share/logstash/data/rasp-flags-set.metadata'
docker exec iap-logstash sh -c 'echo "--- 0" > /usr/share/logstash/data/rasp-flags-cleared.metadata'

Step 5 — Restart Logstash

docker restart iap-logstash
docker logs iap-logstash --tail 50 | grep -E "(ERROR|Pipeline started)"

All three pipelines should log Pipeline started within 30 seconds. If errors persist, escalate to the development team with the full log output.


Alternative HA Approaches

The following approaches are documented for completeness. They are not the recommended deployment model for this pipeline.

External Tracking in PostgreSQL

Replace the local metadata file with a PostgreSQL-backed tracking table. A custom JDBC input reads last_value from the database instead of from a local file, making state portable across any number of instances without requiring shared storage.

-- Tracking table (create once)
CREATE TABLE logstash_tracking (
    pipeline_id   TEXT PRIMARY KEY,
    last_value    BIGINT NOT NULL DEFAULT 0,
    updated_at    TIMESTAMPTZ NOT NULL DEFAULT now()
);
INSERT INTO logstash_tracking (pipeline_id) VALUES
    ('rasp-events'), ('rasp-flags-set'), ('rasp-flags-cleared');

When to consider: Cloud-native deployments (Kubernetes, Azure Container Apps) where ephemeral containers cannot rely on persistent shared volumes.

Trade-offs: Requires custom pipeline configuration not supported out of the box by the standard JDBC input use_column_value mechanism. Adds a PostgreSQL table and write permissions for the Logstash user.

Centralised Queue (Kafka / Redis)

Introduce a message broker between the JDBC input and the Elasticsearch output:

PostgreSQL → Logstash (producer) → Kafka/Redis → Logstash (consumer, N instances) → Elasticsearch

Not suitable for this pipeline. The IAP reporting event rate is low (tens to hundreds of events per second) and idempotent Elasticsearch writes already provide data safety. The added complexity of a broker is not justified.

Last updated on May 25, 2026 (07:48) Edit on Github Send Feedback

develop

In-App Protection