Zero-Downtime Splunk Migration at inDrive: From Bare Metal to AWS SmartStore

Hi, I'm Ivan Saakov, Engineering Manager at the inDrive Security Operations Center.

In this article, I share my experience and the architecture behind migrating Splunk Enterprise from a traditional on-premises bare-metal cluster with local disks to AWS using SmartStore technology. SmartStore enables the use of S3-compatible storage for warm Splunk buckets while keeping them fully searchable.

To keep this article focused, I intentionally omit the basic Splunk Enterprise installation and configuration steps and instead concentrate on the migration-specific settings that matter most.

Key achievement of the approach I used: Zero Downtime:

Ingestion: The data stream never stops.
Search: Search remains continuously available.
SOC: Alerting continues to operate without degradation.
User Experience: Users do not notice the switchover.

Splunk Support generally recommends stopping data ingestion for this type of migration. I deliberately deviated from that guidance. In a mission-critical Security Operations Center (SOC) environment, stopping ingestion or alerting is simply unacceptable.

1. Terminology and Architecture

Definitions and abbreviations:

CM (Cluster Manager): The manager node for the IDX cluster.
Compute: Amazon EC2 instances.
DS (Deployment Server): A server for managing endpoint configurations
HF (Heavy Forwarder): A dedicated node for ingesting, parsing, and routing data from specific sources such as APIs, databases, and syslog.
IDX (Indexer): A node responsible for indexing and storing data.
Local Storage (NVMe): Used strictly as cache, including hot buckets and cached warm data.
Non-SmartStore (Source): The original bare-metal Splunk cluster. Local storage for hot, warm, and cold data with classic replication.
Remote Storage (S3): The source of truth. Stores all warm buckets. Provides hardware independence and data durability through native AWS replication across Availability Zones (Multi-AZ).
SH (Search Head): A node responsible for search and aggregation.
SHC (Search Head Cluster): A highly available cluster of search nodes.
SmartStore (Target): The target architecture in AWS.

2. SmartStore Logic

SmartStore changes the storage paradigm:

Remote Storage (S3): A single storage layer. Regardless of the cluster Replication Factor, only one unique copy of each bucket is stored in S3. This delivers massive savings on storage costs.
Indexer: Functions as a compute node, while its local disks are used exclusively as a search cache.

The Cache Manager, which runs on the indexer nodes, is responsible for intelligently managing the data lifecycle on fast local NVMe disks. Its behavior is based on two mechanisms:

Eviction: Selectively clears the cache when space runs low. The algorithm understands file types: heavy data such TSIDX files and raw journal archives are evicted first, while lightweight service files such as bloom filters stay on disk longer to keep searches fast.
Fetch (Rehydration): During a search, the Cache Manager does not download an entire bucket from S3. It transparently fetches only the bucket components needed for the specific query, whether it is metadata, TSIDX, or journal data. This allows commands such as tstats to complete without downloading heavy raw logs at all. In addition, the Lookahead mechanism heuristically prefetches data to offset network latency.

3. Hardware Sizing (AWS)

For SmartStore, the balance between CPU and local cache performance is critical. Using EBS volumes in AWS is possible, but in practice it is usually more expensive when the IOPS and throughput requirements are comparable.

Instance choice: i3en family

Target: i3en.6xlarge (Storage Optimized).
Throughput: 150 GB/day/peer, according to Splunk recommendations for this number of CPU cores and RAM.
Storage: 15 TB NVMe Instance Store (RAID 0). The very large cache allows hot data to remain local, minimizing latency and reducing S3 request costs. Because S3 already holds a copy of the data, there is no need to mirror local disks with RAID 1 or 10.
Network: 25 Gbps. This is critical for fast rehydration from S3 during heavy searches.

Important nuance: Ephemeral storage

i3en disks are Instance Store volumes, physically attached to the host.
Reboot: Data is preserved.
Stop / Terminate: 100 percent of the data on the local NVMe disks is destroyed completely and irreversibly. This is a hardware characteristic of AWS: when an EC2 instance is stopped, the disks are physically detached and cryptographically wiped.

Mitigation:

Enable Termination Protection.
Enable Stop Protection (EC2 Console -> Actions -> Instance Settings).

4. Amazon S3 Configuration (Production Hardening)

5. Security and IAM: Hybrid Access

We required access to a single S3 bucket from both the Source and Target environments. The following set of S3 permissions proved sufficient:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "ListAndLocation",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": "arn:aws:s3:::splunk-bucket"
    },
    {
      "Sid": "ObjectRW",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:AbortMultipartUpload"
      ],
      "Resource": "arn:aws:s3:::splunk-bucket/*"
    }
  ]
}

Access implementation

Target: Use an IAM Role (Instance Profile). Never store access keys on EC2.
Validation: sudo -u splunk aws s3 ls s3://splunk-bucket
Source: Use an IAM User.
Create a service user.
Generate an access_key and secret_key.
Specify them in indexes.conf.
After the migration is complete, revoke the keys.

6. Traffic Balancing (AWS ALB)

Proper ALB configuration is critical for stable ingestion through HTTP Event Collector (HEC) and for the Web UI.

Global Listener Settings

Connection Idle Timeout: 290 sec.
For HEC, the important setting is busyKeepAliveIdleTimeout=300 in the [http] stanza of inputs.conf. The ALB timeout must be strictly lower so that the load balancer closes the connection first. Otherwise, the client will receive a 502 Bad Gateway.
HTTP/2: Enabled. However, HTTP/2 must also be enabled on the client side for senders that push traffic to the ALB. In practice, most HEC senders still use HTTP/1.1.

Target Groups

A. HEC (IDX)

Health Check: /services/collector/health verifies that the HEC endpoint is alive and serving requests at the application level, which is better than a simple TCP check.
Stickiness: ON (cookie-based) if useACK=true. OFF if useACK=false.

B. Web GUI (SHC)

Stickiness: ON (duration-based). The Web UI is stateful: both the user session and search job artifacts are bound to a specific SH node. Therefore, stickiness must be enabled in ALB, otherwise users will bounce between SH nodes and encounter errors or re-logins.

7. Infrastructure Setup: Multisite Cluster

To achieve a zero-downtime migration, I used a temporary architecture in which two Splunk clusters run in parallel, each managed by its own Cluster Manager:

Source CM: currently manages the existing bare-metal cluster.
Target CM: manages the new multisite IDX cluster in AWS.

8. Site Architecture (Multisite)

In a SmartStore + Multisite configuration, it is critical to assign roles to sites correctly.

Architecture:

site1: AWS Availability Zone A (IDX)
site2: AWS Availability Zone B (IDX)
site0: SHC

Why Site 0?

If you place SH in site1, Splunk automatically enables Search Affinity, attempting to read data from local peers. In SmartStore, that is counterproductive: the local cache may be empty while the required bucket resides in S3. Forcing site affinity interferes with Cache Manager logic and increases latency. Placing the SHC in site0 disables that behavior and allows the SHC to request data from any available peer. At the same time, each SHC node can still be placed in its own AWS Availability Zone without any issue.

Stage 1. Configure the Target CM (AWS)

At this stage, we establish connectivity for the multisite IDX cluster. Cluster Manager initialization command:

/opt/splunk/bin/splunk edit cluster-config \-mode manager \-multisite true \-site site1 \-available_sites site1,site2 \-site_replication_factor origin:1,total:2 \-site_search_factor origin:1,total:2 \-replication_factor 2 \-search_factor 2 \-cluster_label idx-aws-smartstore \-secret 'ClusterSecretKey'

server.conf:

[clustering]
mode = manager
multisite = true
available_sites = site1,site2
cluster_label = idx-aws-smartstore
site_replication_factor = origin:1, total:2
site_search_factor = origin:1, total:2
constrain_singlesite_buckets = false

Notes:

available_sites = site1,site2: only data-bearing sites are listed; site0 is intentionally excluded.
origin:1,total:2: guarantees Availability Zone fault tolerance - one local copy in the AZ where the bucket was created and a second copy in another zone.
constrain_singlesite_buckets = false: critical for SmartStore and for historical data migration - it allows old buckets to replicate without strict site affinity.

Target License Manager (LM)

A single source of truth, which can be combined with the CM. I recommend moving both the Target and Source clusters to the new LM. The goal is to avoid License Violations while the two environments are running in parallel.

Note that all cluster components except the indexers - CM, SH, HF, and DS - should send _internal and _audit logs to the target AWS cluster via outputs.conf.

Stage 2. Configure the Target IDX

Initialize peers in different AZs, that is, different sites.

For the node in AZ-A (site1):

/opt/splunk/bin/splunk edit cluster-config -mode peer -manager_uri https://aws-splunk-cm:8089 -multisite true -site site1 -secret 'ClusterSecretKey'

For the node in AZ-B (site2):

/opt/splunk/bin/splunk edit cluster-config -mode peer -manager_uri https://aws-splunk-cm:8089 -multisite true -site site2 -secret 'ClusterSecretKey'

server.conf:

[imds]
imds_version = v2

[cachemanager]
eviction_policy = lruk
eviction_padding = 102400

It is important to configure incoming data streams. New indexers do not listen on any ports by default. Configure:

9997 for Splunk-to-Splunk (S2S).
8088 for HEC. Deployment of this configuration depends on how you manage IDX configuration.

Integration Layer setup

Target HF: the following items must be migrated:

Apps ($SPLUNK_HOME/etc/apps/).
JDBC drivers.
Checkpoints ($SPLUNK_HOME/var/lib/splunk/modinputs).
Verify network connectivity to data sources.
Keep all inputs on the Target HF disabled.

Target DS: the following items must be migrated:

Deployment Apps ($SPLUNK_HOME/etc/deployment-apps/).
Server Classes ($SPLUNK_HOME/etc/system/local/serverclass.conf).
Verify permissions and consistency.
Switching to the Target DS will be done later via DNS.

Stage 3. Build a Hybrid SHC

At this stage, we temporarily expand the Source SHC so that it can work with both the Source and Target IDX clusters at the same time.

This approach allows us to:

automatically synchronize Knowledge Objects, such as manually created saved searches,
preserve the KV Store state,
transition users and alerting with zero downtime.

Hybrid SHC strategy:

Configure the Source SHC to work with two CMs by using Multi-Cluster Search.
Prepare the new SH nodes in AWS by retrieving the configuration bundle from the old Deployer.
Add the new SH nodes to the existing SHC and shared quorum.
After stabilization, move configuration management to the Target Deployer in AWS.
After the migration is complete, remove the Source SH nodes from the SHC one by one.

Configure Multi-Cluster Search on the Source SHC

On every Source SH node, edit server.conf and replace the old [clustering] section with the following:

[clustering]
mode = searchhead
manager_uri = clustermanager:multi, clustermanager:single

[clustermanager:multi]
multisite = true
site = site0
manager_uri = https://aws-splunk-cm:8089

[clustermanager:single]
manager_uri = https://old-splunk-cm:8089

site = site0 is critical for SmartStore + Multisite. SH nodes must not participate in site affinity. Each SH node will query both IDX clusters in parallel.

Verify that the SH can see the new IDX nodes in AWS:

index=_internal | dedup splunk_server | table splunk_server

Initialize the Target SH

Important: During initialization, point to the old Deployer URL so the new SH nodes immediately retrieve the current application bundle.

/opt/splunk/bin/splunk init shcluster-config -mgmt_uri https://aws-splunk-cm:8089 -replication_port 9200 -conf_deploy_fetch_url https://old-splunk-cm:8089 -secret 'OLDClusterSecretKey'

In server.conf, before restart, specify the same multi-cluster search configuration used on the Source SH nodes:

[clustering]
mode = searchhead
manager_uri = clustermanager:multi, clustermanager:single

[clustermanager:multi]
multisite = true
site = site0
manager_uri =https://aws-splunk-cm:8089

[clustermanager:single]
manager_uri = https://old-splunk-cm:8089

Join into a Single SHC

Join the Target SH to the existing Source SHC:

/opt/splunk/bin/splunk add shcluster-member -current_member_uri 
https://old-splunk-search:8089

Zero-downtime mechanics: what happens under the hood

During SHC consolidation, several independent mechanisms operate in parallel:

Baseline Configuration Sync: The new SH node registers with the old Deployer and receives the full bundle before joining quorum. The Target SH nodes are fully compatible with the Source SH nodes.
Dynamic Knowledge Object Replication: The Source SHC Captain initiates online replication of Knowledge Objects. Raft provides strict consistency.
State Synchronization (KV Store): The SHC automatically extends the MongoDB replica set. This is asynchronous, but consistent.
Multi-Cluster Search Connections: When a search starts, the SH sends the query to both IDX clusters in parallel. The actual data source remains transparent to end users and alerting.

Configuration Management Switchover (Deployer Switchover)

After SHC synchronization succeeds, configuration management can be fully moved to the Target Deployer across all SHC nodes. Edit server.conf on every SH:

[shclustering]
conf_deploy_fetch_url = https://aws-splunk-cm:8089

After that, execute a rolling restart of the SHC to apply the settings.

Stage 4. Migrate Data to S3 (Push to Cloud)

At this stage, we begin migrating existing warm and cold index buckets from the old indexer cluster to AWS S3, which will then serve as the remote store for the Target cluster.

Migration strategy

I recommend migrating in stages. Start with one non-critical index such as test_index. Verify that bucket upload to S3 succeeds and that no errors are present. Then gradually add the remaining indexes, in batches or all at once, depending on channel throughput and system load.

indexes.conf on the Source IDX cluster:

[volume:remote_store]
storageType = remote
path = s3://splunk_s3

remote.s3.region = eu-central-1
remote.s3.endpoint = https://s3.eu-central-1.amazonaws.com
remote.s3.encryption = sse-s3
remote.s3.supports_versioning = false

remote.s3.access_key = XXX
remote.s3.secret_key = XXX

[test_index]
remotePath = volume:remote_store/test_index

The selected indexes begin uploading existing warm and cold buckets to S3 in the background. The process is asynchronous and does not interrupt ingestion. New hot buckets continue to be written locally.

Migration validation

Open the AWS Console in S3 and verify that prefixes named after the indexes appear. Alternatively, check the built-in dashboards in Monitoring Console -> Indexing -> SmartStore -> Instance.
Migration Progress should move toward 100 percent.
Upload Queue should decrease.
Upload/Download Failures must remain at 0. Any errors here usually indicate network or IAM permission issues.

Force hot bucket rollover

To close active write files and turn all hot buckets into warm buckets so they can be uploaded to S3, use:

/opt/splunk/bin/splunk _internal call /data/indexes/*/roll-hot-buckets -auth 
admin:password

Do not move on to the next stage until Upload Queue reaches 0.

Stage 5. Attach the Target IDX Cluster

Apply the following configuration to the Target cluster in indexes.conf:

[default]
repFactor = auto
bucketMerging = true

homePath   = /splunk_cache/$_index_name/db
coldPath   = /splunk_cache/$_index_name/colddb
thawedPath = /splunk_cache/$_index_name/thaweddb

remotePath = volume:remote_store/$_index_name

[volume:remote_store]
storageType = remote
path = s3://splunk_s3

remote.s3.region = eu-central-1
remote.s3.endpoint = https://s3.eu-central-1.amazonaws.com
remote.s3.encryption = sse-s3
remote.s3.supports_versioning = false

[test_index]

The Target SmartStore indexers Ingest the configuration, connect to S3, discover the uploaded bucket metadata, begin serving search over the data in S3 through Cache Manager, and, after ingestion is switched over, start writing to their local hot buckets.

Cutover to the Target SmartStore Cluster

DNS and ALB switch: Redirect all incoming traffic to the new infrastructure. Verify that the HEC layer is healthy: curl -k https://aws-splunk-hec.sec/services/collector/health.
Activate Modular Inputs on the Target HF: enable the inputs and disable them on the Source HF. The checkpoints will resume correctly, and Splunk will continue collecting data from the point where it stopped.

Stage 6. Finalize SHC Configuration (Post-Migration Cleanup)

After a successful cutover, the Target IDX cluster handles ingestion and search, and the Source IDX cluster no longer participates in search.

Post-Migration Final State: remove the [clustermanager:multi] and [clustermanager:single] sections. Leave only a direct reference to the Target CM.

server.conf:

[general]
site = site0
serverName = aws-splunk-search

[license]
manager_uri = https://aws-splunk-cm:8089

[replication_port://9200]

[shclustering]
conf_deploy_fetch_url = https://aws-splunk-cm:8089
mgmt_uri = https://aws-splunk-search:8089
replication_factor = 3
shcluster_label = shc_aws_prod

[clustering]
mode = searchhead
multisite = true
manager_uri = https://aws-splunk-cm:8089

Remove Source SH nodes from the SHC

Run the removal command from any active SHC member, preferably the Captain, specifying the URI of the old server being removed, then stop that server:

/opt/splunk/bin/splunk remove shcluster-member -mgmt_uri https://old-splunk-search:8089

After this stage:

The SHC is fully hosted in AWS.
Only the Target Cluster Manager is in use.
Multi-cluster search is disabled.
The Source SH nodes are fully decommissioned.

Conclusion

The migration is complete. The Source IDX cluster is effectively no longer used. The Target IDX cluster operates in production mode, the data has been migrated and now resides in AWS S3, and the SHC has been fully moved to AWS. The old servers can be decommissioned permanently.

The key zero-downtime condition was achieved:

Ingestion never stopped.
Search and scheduled searches remained available.
Alerting and SOC correlation were never interrupted.
KV Store and Knowledge Objects were migrated without manual copying.
The detection pipeline never stopped.

Happy Splunking!