Logstash High Availability
This document describes how to operate the IAP RASP reporting Logstash pipeline in a high-availability configuration and how to recover from outages.
Two concerns are addressed separately:
- Data resilience — ensuring no events are lost during Elasticsearch outages or Logstash restarts.
- Solution resilience — eliminating Logstash as a single point of failure.
Operational runbooks for common scenarios are included at the end of this document.
Data Resilience
Persistent Queue
Logstash is configured with queue.type: persisted and queue.max_bytes: 1gb
(see logstash/logstash.yml). The persistent queue sits between the JDBC input and the
Elasticsearch output and provides durability across restarts and short outages.
When Elasticsearch becomes unreachable:
- The Elasticsearch output plugin retries failed deliveries with exponential back-off. Events accumulate in the persistent queue on disk.
- The JDBC input continues polling PostgreSQL and enqueueing new events until the queue reaches
queue.max_bytes. - Once the queue is full, the JDBC input blocks. No new events are read from PostgreSQL. Logstash does not drop events and does not crash.
- The tracking metadata file (
last_run_metadata_path) is only updated after the Elasticsearch output confirms delivery. A Logstash restart during a full-queue scenario will replay from the last confirmed position — no data is lost.
Once Elasticsearch becomes reachable again, the output drains the persistent queue, the JDBC input unblocks, and the pipeline catches up automatically without manual intervention.
PostgreSQL Impact During Outages
The JDBC input maintains a persistent connection to PostgreSQL. During an Elasticsearch outage:
- Each poll executes a read-only
SELECT … WHERE id > :last_value ORDER BY id ASC. No locks are held between polls. - When the queue is full and the input blocks, no new queries are issued. The connection remains open but idle.
There are no lock contention or connection exhaustion concerns at typical event rates.
Queue Sizing
The default queue.max_bytes: 1gb is appropriate for most deployments. To size for a specific
outage window:
queue_bytes = avg_event_rate_per_sec × avg_event_size_bytes × outage_window_seconds × safety_factor
safety_factor accounts for event rate spikes, per-event queue metadata overhead, and estimation
uncertainty. A value of 2× is recommended.
| Scenario | Rate | Avg size | Window | Factor | Recommended size |
|---|---|---|---|---|---|
| Very low traffic | 1 /m | 512 B | 24 h | 2× | ~1.5 MB |
| Low traffic | 10 /s | 512 B | 24 h | 2× | ~900 MB |
| Medium traffic | 100 /s | 512 B | 24 h | 2× | ~9 GB |
| High traffic | 500 /s | 512 B | 48 h | 2× | ~86 GB |
For medium or high traffic, increase queue.max_bytes in logstash.yml and ensure the Docker
volume has sufficient capacity.
Recovery After Long Outages
Estimating catch-up duration
When Elasticsearch recovers after a multi-day outage, Logstash drains the persistent queue and then processes events that accumulated in PostgreSQL while the input was blocked.
Approximate catch-up time:
catch_up_seconds ≈ backlog_events / (pipeline.batch.size × pipeline.workers × es_write_throughput_per_worker)
Typical values for a single-node Elasticsearch:
| Parameter | Default | Notes |
|---|---|---|
pipeline.batch.size |
125 | Events per batch |
pipeline.workers |
2 | Parallel output workers |
| ES write throughput | ~500–2 000/s | Depends on hardware and index complexity |
At 1 000 docs/s sustained throughput, a backlog of 10 million events clears in approximately 2.8 hours.
Tuning for faster recovery
If the backlog is large and time-to-recovery is critical, temporarily increase batch size and workers:
# logstash.yml — increase for catch-up, then revert to defaults
pipeline.workers: 4
pipeline.batch.size: 500
pipeline.batch.delay: 5
Restart Logstash after editing. Monitor Elasticsearch CPU and indexing queue depth — reduce the
values if Elasticsearch shows pressure (GET /_cat/thread_pool/write?v).
Monitoring pipeline lag
# Events currently queued (trends toward 0 as the pipeline catches up)
curl http://localhost:9600/_node/stats/pipelines | \
jq '.pipelines | to_entries[] | {pipeline: .key, queue_events: .value.queue.events_count}'
# Latest PostgreSQL row ID — difference from metadata value is remaining backlog
docker exec iap-postgres psql -U inappprotection -c \
"SELECT MAX(id) FROM iap_app_event;"
Dead-Letter Queue
The dead-letter queue (dead_letter_queue.enable: true) is not used in this pipeline.
All writes use a stable document_id derived from the PostgreSQL row ID, eliminating
permanent indexing failures. The source data in PostgreSQL is never deleted — any event can
be replayed by resetting the metadata file. The dead-letter queue would add operational overhead
without providing meaningful benefit.
High Availability Deployment
Logstash has no built-in clustering. HA is achieved at the infrastructure level. The primary
challenge is tracking state: each pipeline stores last_run_value in a local metadata file.
The recommended approach uses a shared volume so that a standby instance can resume exactly where
the active instance left off.
Active/Passive with Shared Volume
Mount the logstash-data volume from shared network storage (NFS, Azure Files, AWS EFS). Run
two Logstash containers; only one is active at a time. The standby container is kept stopped and
started — manually or by an orchestrator — when the active instance fails.
┌─────────────────────┐ shared NFS volume ┌─────────────────────┐
│ Logstash (active) │ ──────────────────────────▶ │ Logstash (standby) │
│ reads/writes .meta │◀────────────────────────── │ stopped │
└─────────────────────┘ └─────────────────────┘
Characteristics:
- No application changes required.
- The standby resumes from exactly where the active left off — no gaps or duplicates.
- Simple to operate.
- Requires shared network storage. Manual (or orchestrator-driven) failover; recovery time depends on detection speed.
Setup:
- Provision a shared volume (Azure Files share, NFS mount, or AWS EFS).
- Mount the volume at
/usr/share/logstash/dataon both containers. - Start only the active container. Configure your orchestrator (Docker Swarm, Kubernetes) to restart it on failure and to start the standby only if the active container remains unhealthy.
Operational Runbooks
The following runbooks are written for operations teams unfamiliar with Logstash internals. Each runbook is self-contained.
Runbook 1 — Planned Elasticsearch Maintenance
Scope: Scheduled ES maintenance (upgrades, index management, node replacement).
Before maintenance
- Check that the queue is empty before stopping Logstash:
curl http://localhost:9600/_node/stats/pipelines | \ jq '.pipelines | to_entries[] | {pipeline: .key, queue_events: .value.queue.events_count}'All values should be 0. If they are not, wait for the pipeline to drain before proceeding.
- Stop Logstash gracefully:
docker stop iap-logstashStopping Logstash flushes in-memory state to the persistent queue on disk. No data is lost.
During maintenance
-
Perform the Elasticsearch maintenance as planned.
-
Verify Elasticsearch is healthy before restarting Logstash:
curl -u <user>:<password> "https://<es-host>/_cluster/health?pretty" # status should be "green" or "yellow" — not "red"
After maintenance
- Start Logstash:
docker start iap-logstash - Monitor catch-up until the queue reaches 0:
watch -n 5 'curl -s http://localhost:9600/_node/stats/pipelines | \ jq ".pipelines | to_entries[] | {pipeline: .key, queue: .value.queue.events_count}"' - Verify that indices are receiving new documents:
curl -u <user>:<password> \ "https://<es-host>/_cat/indices/iap-rasp-*?v&h=index,docs.count,store.size"
Runbook 2 — Unplanned Elasticsearch Outage
Scope: Elasticsearch becomes unreachable unexpectedly (network failure, node crash, certificate expiry).
Detection
Logstash logs repeated errors such as:
[ERROR] elasticsearch output: connection refused / SSL handshake failed / ...
Check container logs:
docker logs iap-logstash --tail 50 | grep -i error
Kibana dashboards that stop updating are another indicator that the pipeline has stalled.
During the outage
No action is required. Logstash retries delivery automatically with exponential back-off. The
persistent queue absorbs incoming events up to queue.max_bytes (default: 1 GB). Once the queue
is full, the JDBC input pauses. Events are not lost — the source data remains intact in PostgreSQL.
After Elasticsearch recovers
- Verify Elasticsearch is reachable:
curl -u <user>:<password> "https://<es-host>/_cluster/health?pretty" - Logstash resumes delivery automatically. Monitor catch-up:
watch -n 10 'curl -s http://localhost:9600/_node/stats/pipelines | \ jq ".pipelines | to_entries[] | {pipeline: .key, queue: .value.queue.events_count}"' - If Logstash appears stuck (queue not draining after 5+ minutes with Elasticsearch healthy),
restart it:
docker restart iap-logstash - Once the queue reaches 0, verify document counts are increasing in Elasticsearch (see Runbook 1, step 7).
Runbook 3 — Pipeline Stuck / Not Advancing
Scope: Logstash is running but the pipeline is not processing new events — Elasticsearch indices show no new documents despite data being written to PostgreSQL.
Step 1 — Verify Logstash is running
docker ps | grep iap-logstash
docker logs iap-logstash --tail 100 | grep -E "(ERROR|WARN|Pipeline started)"
Step 2 — Check queue depth
curl -s http://localhost:9600/_node/stats/pipelines | \
jq '.pipelines | to_entries[] | {pipeline: .key, queue_events: .value.queue.events_count}'
| Result | Action |
|---|---|
| Queue > 0 and growing | Elasticsearch output is failing — see Runbook 2 |
| Queue = 0, no new docs in ES | JDBC input is not reading new rows — proceed to Step 3 |
| Queue = 0, docs are increasing in ES | Pipeline is healthy; Kibana may need a refresh |
Step 3 — Inspect metadata files
docker exec iap-logstash cat /usr/share/logstash/data/rasp-events.metadata
docker exec iap-logstash cat /usr/share/logstash/data/rasp-flags-set.metadata
docker exec iap-logstash cat /usr/share/logstash/data/rasp-flags-cleared.metadata
Compare with the latest row IDs in PostgreSQL:
docker exec iap-postgres psql -U inappprotection -c \
"SELECT
(SELECT MAX(id) FROM iap_app_event) AS max_event_id,
(SELECT MAX(id) FROM iap_app_flag_history WHERE flag_status = 'SET') AS max_flag_set_id,
(SELECT MAX(id) FROM iap_app_flag_history WHERE flag_status = 'CLEARED') AS max_flag_cleared_id;"
| Metadata vs. DB max | Action |
|---|---|
| Metadata = DB max | Pipeline is up to date — no new data to ship |
| Metadata far ahead of DB max | Metadata is corrupted — proceed to Step 4 |
| Metadata far behind DB max, ES is healthy | Pipeline should be processing — proceed to Step 5 |
Step 4 — Reset metadata (if corrupted)
Warning: Resetting to
0causes Logstash to reprocess all historical data. This is safe — idempotent writes mean no Elasticsearch duplicates — but may take significant time for large datasets. To replay from a specific point, replace0with the desired PostgreSQL row ID.
docker exec iap-logstash sh -c 'echo "--- 0" > /usr/share/logstash/data/rasp-events.metadata'
docker exec iap-logstash sh -c 'echo "--- 0" > /usr/share/logstash/data/rasp-flags-set.metadata'
docker exec iap-logstash sh -c 'echo "--- 0" > /usr/share/logstash/data/rasp-flags-cleared.metadata'
Step 5 — Restart Logstash
docker restart iap-logstash
docker logs iap-logstash --tail 50 | grep -E "(ERROR|Pipeline started)"
All three pipelines should log Pipeline started within 30 seconds. If errors persist, escalate
to the development team with the full log output.
Alternative HA Approaches
The following approaches are documented for completeness. They are not the recommended deployment model for this pipeline.
External Tracking in PostgreSQL
Replace the local metadata file with a PostgreSQL-backed tracking table. A custom JDBC input
reads last_value from the database instead of from a local file, making state portable across
any number of instances without requiring shared storage.
-- Tracking table (create once)
CREATE TABLE logstash_tracking (
pipeline_id TEXT PRIMARY KEY,
last_value BIGINT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
INSERT INTO logstash_tracking (pipeline_id) VALUES
('rasp-events'), ('rasp-flags-set'), ('rasp-flags-cleared');
When to consider: Cloud-native deployments (Kubernetes, Azure Container Apps) where ephemeral containers cannot rely on persistent shared volumes.
Trade-offs: Requires custom pipeline configuration not supported out of the box by the standard
JDBC input use_column_value mechanism. Adds a PostgreSQL table and write permissions for the
Logstash user.
Centralised Queue (Kafka / Redis)
Introduce a message broker between the JDBC input and the Elasticsearch output:
PostgreSQL → Logstash (producer) → Kafka/Redis → Logstash (consumer, N instances) → Elasticsearch
Not suitable for this pipeline. The IAP reporting event rate is low (tens to hundreds of events per second) and idempotent Elasticsearch writes already provide data safety. The added complexity of a broker is not justified.