After years of building and operating Elasticsearch deployments at Wells Fargo (45 clusters, ~2,000 hosts) and now APS (enterprise SIEM architecture), I’ve accumulated a set of hard-won lessons about what actually works at scale. This post covers the key architectural decisions and operational patterns that separate a resilient enterprise SIEM from one that breaks under production load.
Start With the Data Problem
The first mistake most teams make is thinking about Elasticsearch before thinking about their data. Before you spin up a single node, answer these questions:
- How many events per second at peak?
- What is the retention requirement per data tier?
- What are the query patterns? (Point lookups vs. aggregations vs. full-text search)
- What compliance obligations govern data residency and access?
Your answers determine your cluster topology, index lifecycle management (ILM) policy, and hardware sizing. At Wells Fargo, we ingested hundreds of thousands of events per second across multiple data centers. That required dedicated ingest, hot, warm, cold, and frozen tiers—not the single-tier setup that gets you started in a proof of concept.
Index Design Matters More Than You Think
Hot-warm-cold tiering only works if your index design supports it. The patterns I rely on:
Time-based indices with ILM rollover. A single index per data source per day (or per rollover threshold) gives you predictable shard sizes and clean ILM transitions. Avoid the temptation to put multiple data sources in one index—it creates operational headaches later.
Shard sizing. Aim for 20–50 GB per shard in the hot tier. Under-sharded indices create hot spots; over-sharded indices create coordination overhead. For a 100 GB/day ingest stream with 30-day hot retention, that’s roughly 2–5 shards per index.
Mapping discipline. Lock down your dynamic mapping. Unbounded dynamic mapping in a SIEM context will eventually explode your mapping count and cause node instability. Use dynamic: false on the top-level mapping and explicitly map every field you actually query.
Kafka as a Buffer
One of the most important architectural decisions I’ve made is putting Kafka between log sources and Elasticsearch. This gives you:
- Backpressure handling. When Elasticsearch is under pressure (re-indexing, shard recovery, cluster rebalancing), Kafka holds the data instead of dropping it.
- Replay capability. If you need to re-ingest data with a corrected mapping or enrichment pipeline, you replay from Kafka rather than trying to recover from the source.
- Multi-consumer patterns. Security analytics, compliance archiving, and real-time alerting can all consume from the same Kafka topic with independent consumer groups.
The ILM Configuration I Wish I’d Used From Day One
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": { "max_primary_shard_size": "40gb", "max_age": "1d" },
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "3d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 },
"set_priority": { "priority": 50 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"set_priority": { "priority": 0 },
"freeze": {}
}
},
"delete": {
"min_age": "365d",
"actions": { "delete": {} }
}
}
}
}
The shrink + forcemerge in warm is critical for read performance on older data. Once an index stops receiving writes, collapse it to a single shard and force-merge to a single segment. Query performance on frozen/cold data improves dramatically.
Operational Lessons
Automate snapshot before every change. Index recoveries after a botched mapping change are painful. Before any schema migration or cluster upgrade, verify your snapshot repository is working and take a snapshot.
Monitor your JVM heap. Keep heap usage below 75% sustained. Above 80% and you’ll start seeing GC pauses that cascade into shard timeout errors. For SIEM workloads, 30–64 GB heap nodes work well, but watch your field data cache—heavy aggregations on keyword fields will blow your heap.
Don’t use cross-cluster search as a crutch. CCS is useful but it adds latency and operational complexity. Where possible, replicate critical indices to a local cluster rather than querying across clusters for time-sensitive SIEM queries.
Test your disaster recovery. At Wells Fargo, we ran quarterly DR drills where we actually failed over to the warm standby cluster and ran SIEM operations from it. Running the drill once taught us more than six months of architectural review.
The Platform vs. Product Debate
One last thing: decide early whether you’re building a SIEM platform (infrastructure that your security team builds product on top of) or a SIEM product (a turnkey solution). In my experience, enterprise security teams need a platform—they have proprietary detection logic, custom data sources, and regulatory requirements that off-the-shelf SIEM products can’t accommodate without extensive customization anyway. Own the platform and give the security team the keys.
If you’re planning an Elasticsearch SIEM deployment and want to talk through architecture, data modeling, or operational runbooks, get in touch. I’ve seen most of the failure modes—and I can help you avoid them.