Hi, I'm Ivan Saakov, Engineering Manager at the inDrive Security Operations Center.
In this article, I share my experience and the architecture behind migrating Splunk Enterprise from a traditional on-premises bare-metal cluster with local disks to AWS using SmartStore technology. SmartStore enables the use of S3-compatible storage for warm Splunk buckets while keeping them fully searchable.
To keep this article focused, I intentionally omit the basic Splunk Enterprise installation and configuration steps and instead concentrate on the migration-specific settings that matter most.
Key achievement of the approach I used: Zero Downtime:
- Ingestion: The data stream never stops.
- Search: Search remains continuously available.
- SOC: Alerting continues to operate without degradation.
- User Experience: Users do not notice the switchover.
Splunk Support generally recommends stopping data ingestion for this type of migration. I deliberately deviated from that guidance. In a mission-critical Security Operations Center (SOC) environment, stopping ingestion or alerting is simply unacceptable.
1. Terminology and Architecture
Definitions and abbreviations:
- CM (Cluster Manager): The manager node for the IDX cluster.
- Compute: Amazon EC2 instances.
- DS (Deployment Server): A server for managing endpoint configurations
- HF (Heavy Forwarder): A dedicated node for ingesting, parsing, and routing data from specific sources such as APIs, databases, and syslog.
- IDX (Indexer): A node responsible for indexing and storing data.
- Local Storage (NVMe): Used strictly as cache, including hot buckets and cached warm data.
- Non-SmartStore (Source): The original bare-metal Splunk cluster. Local storage for hot, warm, and cold data with classic replication.
- Remote Storage (S3): The source of truth. Stores all warm buckets. Provides hardware independence and data durability through native AWS replication across Availability Zones (Multi-AZ).
- SH (Search Head): A node responsible for search and aggregation.
- SHC (Search Head Cluster): A highly available cluster of search nodes.
- SmartStore (Target): The target architecture in AWS.
2. SmartStore Logic
SmartStore changes the storage paradigm:
- Remote Storage (S3): A single storage layer. Regardless of the cluster Replication Factor, only one unique copy of each bucket is stored in S3. This delivers massive savings on storage costs.
- Indexer: Functions as a compute node, while its local disks are used exclusively as a search cache.
The Cache Manager, which runs on the indexer nodes, is responsible for intelligently managing the data lifecycle on fast local NVMe disks. Its behavior is based on two mechanisms:
- Eviction: Selectively clears the cache when space runs low. The algorithm understands file types: heavy data such TSIDX files and raw journal archives are evicted first, while lightweight service files such as bloom filters stay on disk longer to keep searches fast.
- Fetch (Rehydration): During a search, the Cache Manager does not download an entire bucket from S3. It transparently fetches only the bucket components needed for the specific query, whether it is metadata, TSIDX, or journal data. This allows commands such as tstats to complete without downloading heavy raw logs at all. In addition, the Lookahead mechanism heuristically prefetches data to offset network latency.
3. Hardware Sizing (AWS)
For SmartStore, the balance between CPU and local cache performance is critical. Using EBS volumes in AWS is possible, but in practice it is usually more expensive when the IOPS and throughput requirements are comparable.
Instance choice: i3en family
- Target: i3en.6xlarge (Storage Optimized).
- Throughput: 150 GB/day/peer, according to Splunk recommendations for this number of CPU cores and RAM.
- Storage: 15 TB NVMe Instance Store (RAID 0). The very large cache allows hot data to remain local, minimizing latency and reducing S3 request costs. Because S3 already holds a copy of the data, there is no need to mirror local disks with RAID 1 or 10.
- Network: 25 Gbps. This is critical for fast rehydration from S3 during heavy searches.
Important nuance: Ephemeral storage
- i3en disks are Instance Store volumes, physically attached to the host.
- Reboot: Data is preserved.
- Stop / Terminate: 100 percent of the data on the local NVMe disks is destroyed completely and irreversibly. This is a hardware characteristic of AWS: when an EC2 instance is stopped, the disks are physically detached and cryptographically wiped.
Mitigation:
- Enable Termination Protection.
- Enable Stop Protection (EC2 Console -> Actions -> Instance Settings).
4. Amazon S3 Configuration (Production Hardening)
5. Security and IAM: Hybrid Access
We required access to a single S3 bucket from both the Source and Target environments. The following set of S3 permissions proved sufficient:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "ListAndLocation",
"Effect": "Allow",
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation"
],
"Resource": "arn:aws:s3:::splunk-bucket"
},
{
"Sid": "ObjectRW",
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload"
],
"Resource": "arn:aws:s3:::splunk-bucket/*"
}
]
}
Access implementation
-
Target: Use an IAM Role (Instance Profile). Never store access keys on EC2.
-
Validation:
sudo -u splunk aws s3 ls s3://splunk-bucket -
Source: Use an IAM User.
-
Create a service user.
-
Generate an
access_keyandsecret_key. -
Specify them in
indexes.conf. -
After the migration is complete, revoke the keys.
6. Traffic Balancing (AWS ALB)
Proper ALB configuration is critical for stable ingestion through HTTP Event Collector (HEC) and for the Web UI.
Global Listener Settings
- Connection Idle Timeout: 290 sec.
- For HEC, the important setting is busyKeepAliveIdleTimeout=300 in the [http] stanza of inputs.conf. The ALB timeout must be strictly lower so that the load balancer closes the connection first. Otherwise, the client will receive a 502 Bad Gateway.
- HTTP/2: Enabled. However, HTTP/2 must also be enabled on the client side for senders that push traffic to the ALB. In practice, most HEC senders still use HTTP/1.1.
Target Groups
A. HEC (IDX)
- Health Check: /services/collector/health verifies that the HEC endpoint is alive and serving requests at the application level, which is better than a simple TCP check.
- Stickiness: ON (cookie-based) if useACK=true. OFF if useACK=false.
B. Web GUI (SHC)
-
Stickiness: ON (duration-based). The Web UI is stateful: both the user session and search job artifacts are bound to a specific SH node. Therefore, stickiness must be enabled in ALB, otherwise users will bounce between SH nodes and encounter errors or re-logins.
7. Infrastructure Setup: Multisite Cluster
To achieve a zero-downtime migration, I used a temporary architecture in which two Splunk clusters run in parallel, each managed by its own Cluster Manager:
-
Source CM: currently manages the existing bare-metal cluster.
-
Target CM: manages the new multisite IDX cluster in AWS.
8. Site Architecture (Multisite)
In a SmartStore + Multisite configuration, it is critical to assign roles to sites correctly.
Architecture:
- site1: AWS Availability Zone A (IDX)
- site2: AWS Availability Zone B (IDX)
- site0: SHC
Why Site 0?
If you place SH in site1, Splunk automatically enables Search Affinity, attempting to read data from local peers. In SmartStore, that is counterproductive: the local cache may be empty while the required bucket resides in S3. Forcing site affinity interferes with Cache Manager logic and increases latency. Placing the SHC in site0 disables that behavior and allows the SHC to request data from any available peer. At the same time, each SHC node can still be placed in its own AWS Availability Zone without any issue.
Stage 1. Configure the Target CM (AWS)
At this stage, we establish connectivity for the multisite IDX cluster. Cluster Manager initialization command:
/opt/splunk/bin/splunk edit cluster-config \-mode manager \-multisite true \-site site1 \-available_sites site1,site2 \-site_replication_factor origin:1,total:2 \-site_search_factor origin:1,total:2 \-replication_factor 2 \-search_factor 2 \-cluster_label idx-aws-smartstore \-secret 'ClusterSecretKey'
server.conf:
[clustering]
mode = manager
multisite = true
available_sites = site1,site2
cluster_label = idx-aws-smartstore
site_replication_factor = origin:1, total:2
site_search_factor = origin:1, total:2
constrain_singlesite_buckets = false
Notes:
- available_sites = site1,site2: only data-bearing sites are listed; site0 is intentionally excluded.
- origin:1,total:2: guarantees Availability Zone fault tolerance - one local copy in the AZ where the bucket was created and a second copy in another zone.
- constrain_singlesite_buckets = false: critical for SmartStore and for historical data migration - it allows old buckets to replicate without strict site affinity.
Target License Manager (LM)
A single source of truth, which can be combined with the CM. I recommend moving both the Target and Source clusters to the new LM. The goal is to avoid License Violations while the two environments are running in parallel.
Note that all cluster components except the indexers - CM, SH, HF, and DS - should send _internal and _audit logs to the target AWS cluster via outputs.conf.
Stage 2. Configure the Target IDX
Initialize peers in different AZs, that is, different sites.
For the node in AZ-A (site1):
/opt/splunk/bin/splunk edit cluster-config -mode peer -manager_uri https://aws-splunk-cm:8089 -multisite true -site site1 -secret 'ClusterSecretKey'
For the node in AZ-B (site2):
/opt/splunk/bin/splunk edit cluster-config -mode peer -manager_uri https://aws-splunk-cm:8089 -multisite true -site site2 -secret 'ClusterSecretKey'
server.conf:
[imds]
imds_version = v2
[cachemanager]
eviction_policy = lruk
eviction_padding = 102400
It is important to configure incoming data streams. New indexers do not listen on any ports by default. Configure:
- 9997 for Splunk-to-Splunk (S2S).
- 8088 for HEC. Deployment of this configuration depends on how you manage IDX configuration.
Integration Layer setup
Target HF: the following items must be migrated:
- Apps ($SPLUNK_HOME/etc/apps/).
- JDBC drivers.
- Checkpoints ($SPLUNK_HOME/var/lib/splunk/modinputs).
- Verify network connectivity to data sources.
- Keep all inputs on the Target HF disabled.
Target DS: the following items must be migrated:
- Deployment Apps ($SPLUNK_HOME/etc/deployment-apps/).
- Server Classes ($SPLUNK_HOME/etc/system/local/serverclass.conf).
- Verify permissions and consistency.
- Switching to the Target DS will be done later via DNS.
Stage 3. Build a Hybrid SHC
At this stage, we temporarily expand the Source SHC so that it can work with both the Source and Target IDX clusters at the same time.
This approach allows us to:
- automatically synchronize Knowledge Objects, such as manually created saved searches,
- preserve the KV Store state,
- transition users and alerting with zero downtime.
Hybrid SHC strategy:
- Configure the Source SHC to work with two CMs by using Multi-Cluster Search.
- Prepare the new SH nodes in AWS by retrieving the configuration bundle from the old Deployer.
- Add the new SH nodes to the existing SHC and shared quorum.
- After stabilization, move configuration management to the Target Deployer in AWS.
- After the migration is complete, remove the Source SH nodes from the SHC one by one.
Configure Multi-Cluster Search on the Source SHC
On every Source SH node, edit server.conf and replace the old [clustering] section with the following:
[clustering]
mode = searchhead
manager_uri = clustermanager:multi, clustermanager:single
[clustermanager:multi]
multisite = true
site = site0
manager_uri = https://aws-splunk-cm:8089
[clustermanager:single]
manager_uri = https://old-splunk-cm:8089
site = site0 is critical for SmartStore + Multisite. SH nodes must not participate in site affinity. Each SH node will query both IDX clusters in parallel.
Verify that the SH can see the new IDX nodes in AWS:
index=_internal | dedup splunk_server | table splunk_server
Initialize the Target SH
Important: During initialization, point to the old Deployer URL so the new SH nodes immediately retrieve the current application bundle.
/opt/splunk/bin/splunk init shcluster-config -mgmt_uri https://aws-splunk-cm:8089 -replication_port 9200 -conf_deploy_fetch_url https://old-splunk-cm:8089 -secret 'OLDClusterSecretKey'
In server.conf, before restart, specify the same multi-cluster search configuration used on the Source SH nodes:
[clustering]
mode = searchhead
manager_uri = clustermanager:multi, clustermanager:single
[clustermanager:multi]
multisite = true
site = site0
manager_uri =https://aws-splunk-cm:8089
[clustermanager:single]
manager_uri = https://old-splunk-cm:8089
Join into a Single SHC
Join the Target SH to the existing Source SHC:
/opt/splunk/bin/splunk add shcluster-member -current_member_uri
https://old-splunk-search:8089
Zero-downtime mechanics: what happens under the hood
During SHC consolidation, several independent mechanisms operate in parallel:
- Baseline Configuration Sync: The new SH node registers with the old Deployer and receives the full bundle before joining quorum. The Target SH nodes are fully compatible with the Source SH nodes.
- Dynamic Knowledge Object Replication: The Source SHC Captain initiates online replication of Knowledge Objects. Raft provides strict consistency.
- State Synchronization (KV Store): The SHC automatically extends the MongoDB replica set. This is asynchronous, but consistent.
- Multi-Cluster Search Connections: When a search starts, the SH sends the query to both IDX clusters in parallel. The actual data source remains transparent to end users and alerting.
Configuration Management Switchover (Deployer Switchover)
After SHC synchronization succeeds, configuration management can be fully moved to the Target Deployer across all SHC nodes. Edit server.conf on every SH:
[shclustering]
conf_deploy_fetch_url = https://aws-splunk-cm:8089
After that, execute a rolling restart of the SHC to apply the settings.
Stage 4. Migrate Data to S3 (Push to Cloud)
At this stage, we begin migrating existing warm and cold index buckets from the old indexer cluster to AWS S3, which will then serve as the remote store for the Target cluster.
Migration strategy
I recommend migrating in stages. Start with one non-critical index such as test_index. Verify that bucket upload to S3 succeeds and that no errors are present. Then gradually add the remaining indexes, in batches or all at once, depending on channel throughput and system load.
indexes.conf on the Source IDX cluster:
[volume:remote_store]
storageType = remote
path = s3://splunk_s3
remote.s3.region = eu-central-1
remote.s3.endpoint = https://s3.eu-central-1.amazonaws.com
remote.s3.encryption = sse-s3
remote.s3.supports_versioning = false
remote.s3.access_key = XXX
remote.s3.secret_key = XXX
[test_index]
remotePath = volume:remote_store/test_index
The selected indexes begin uploading existing warm and cold buckets to S3 in the background. The process is asynchronous and does not interrupt ingestion. New hot buckets continue to be written locally.
Migration validation
- Open the AWS Console in S3 and verify that prefixes named after the indexes appear. Alternatively, check the built-in dashboards in Monitoring Console -> Indexing -> SmartStore -> Instance.
- Migration Progress should move toward 100 percent.
- Upload Queue should decrease.
- Upload/Download Failures must remain at 0. Any errors here usually indicate network or IAM permission issues.
Force hot bucket rollover
To close active write files and turn all hot buckets into warm buckets so they can be uploaded to S3, use:
/opt/splunk/bin/splunk _internal call /data/indexes/*/roll-hot-buckets -auth
admin:password
Do not move on to the next stage until Upload Queue reaches 0.
Stage 5. Attach the Target IDX Cluster
Apply the following configuration to the Target cluster in indexes.conf:
[default]
repFactor = auto
bucketMerging = true
homePath = /splunk_cache/$_index_name/db
coldPath = /splunk_cache/$_index_name/colddb
thawedPath = /splunk_cache/$_index_name/thaweddb
remotePath = volume:remote_store/$_index_name
[volume:remote_store]
storageType = remote
path = s3://splunk_s3
remote.s3.region = eu-central-1
remote.s3.endpoint = https://s3.eu-central-1.amazonaws.com
remote.s3.encryption = sse-s3
remote.s3.supports_versioning = false
[test_index]
The Target SmartStore indexers Ingest the configuration, connect to S3, discover the uploaded bucket metadata, begin serving search over the data in S3 through Cache Manager, and, after ingestion is switched over, start writing to their local hot buckets.
Cutover to the Target SmartStore Cluster
- DNS and ALB switch: Redirect all incoming traffic to the new infrastructure. Verify that the HEC layer is healthy: curl -k https://aws-splunk-hec.sec/services/collector/health.
- Activate Modular Inputs on the Target HF: enable the inputs and disable them on the Source HF. The checkpoints will resume correctly, and Splunk will continue collecting data from the point where it stopped.
Stage 6. Finalize SHC Configuration (Post-Migration Cleanup)
After a successful cutover, the Target IDX cluster handles ingestion and search, and the Source IDX cluster no longer participates in search.
Post-Migration Final State: remove the [clustermanager:multi] and [clustermanager:single] sections. Leave only a direct reference to the Target CM.
server.conf:
[general]
site = site0
serverName = aws-splunk-search
[license]
manager_uri = https://aws-splunk-cm:8089
[replication_port://9200]
[shclustering]
conf_deploy_fetch_url = https://aws-splunk-cm:8089
mgmt_uri = https://aws-splunk-search:8089
replication_factor = 3
shcluster_label = shc_aws_prod
[clustering]
mode = searchhead
multisite = true
manager_uri = https://aws-splunk-cm:8089
Remove Source SH nodes from the SHC
Run the removal command from any active SHC member, preferably the Captain, specifying the URI of the old server being removed, then stop that server:
/opt/splunk/bin/splunk remove shcluster-member -mgmt_uri https://old-splunk-search:8089
After this stage:
- The SHC is fully hosted in AWS.
- Only the Target Cluster Manager is in use.
- Multi-cluster search is disabled.
- The Source SH nodes are fully decommissioned.
Conclusion
The migration is complete. The Source IDX cluster is effectively no longer used. The Target IDX cluster operates in production mode, the data has been migrated and now resides in AWS S3, and the SHC has been fully moved to AWS. The old servers can be decommissioned permanently.
The key zero-downtime condition was achieved:
- Ingestion never stopped.
- Search and scheduled searches remained available.
- Alerting and SOC correlation were never interrupted.
- KV Store and Knowledge Objects were migrated without manual copying.
- The detection pipeline never stopped.
Happy Splunking!
