There is a requirement to use Spark Operator in a K8s cluster to run a spark job. The official image contains many vulnerabilities, including those due to Hadoop libraries. Let's build our own Spark Operator image. To build our image, we'll need a spark image as a base image and a Golang image to build Spark Operator itself. Spark image Building a Spark image without Hadoop using a specific version of Spark RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \ && tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \ && mv spark-3.5.1-bin-without-hadoop /opt/spark \ && rm spark-3.5.1-bin-without-hadoop.tgz Spark-operator image We build the Spark Operator image, we will need several Hadoop libraries to run submit commands. For example, the FIPS version build is given, the differences in the build and run commands. For building on Go, the parameter GOEXPERIMENT=boringcrypto is used For running spark-submit, the java parameter for Bouncy Castle is used Djavax.net.ssl.trustStorePassword=password You can build an image without FIPS changes. To run spark-submit, we will add Hadoop libraries during the build process: hadoop-client-runtime hadoop-client-api slf4j-api entrypoint.sh is used from the official Kubeflow repository https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh Example Dockerfile for building Spark Operator ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop ARG GOLANG_IMAGE=golang-1.21 ARG SPARK_OPERATOR_VERSION=1.3.1 ARG HADOOP_VERSION_DEFAULT=3.4.0 ARG HADOOP_TMP_HOME="/opt/hadoop" ARG TARGETARCH=amd64 # Prepare spark-operator build FROM ${GOLANG_IMAGE} as builder WORKDIR /app/spark-operator ARG SPARK_OPERATOR_VERSION RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator RUN GOTOOLCHAIN=go1.22.3 go mod download # Build ARG TARGETARCH RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go #Install Hadoop jars ARG HADOOP_VERSION_DEFAULT ARG HADOOP_TMP_HOME RUN mkdir -p ${HADOOP_TMP_HOME} RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME} # Prepare spark-operator image FROM ${ECR_URL}:${SPARK_IMAGE} WORKDIR /opt/spark-operator USER root ENV PATH $JAVA_HOME/bin:$PATH ENV SPARK_HOME="/opt/spark" ENV JAVA_HOME="/opt/jdk-11.0.21" ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password" ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin: RUN yum update -y && \ yum install --setopt=tsflags=nodocs -y openssl && \ yum clean all ARG HADOOP_TMP_HOME COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/ COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/ COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/ COPY entrypoint.sh /opt/spark-operator/ RUN chmod a+x /opt/spark-operator/entrypoint.sh ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"] Conclusion After the build, we still have several vulnerabilities in the Hadoop library hadoop-client-runtime: org.apache.avro:avro (hadoop-client-runtime-3.4.0.jar) – CVE-2023-39410 org.apache.commons:commons-compress – CVE-2024-25710, CVE-2024-26308 Since without this library we'll not be able to run spark-submit, but the rest of the huge part of the vulnerabilities is removed along with the main Hadoop libraries. There is a requirement to use Spark Operator in a K8s cluster to run a spark job. The official image contains many vulnerabilities, including those due to Hadoop libraries. Let's build our own Spark Operator image. To build our image, we'll need a spark image as a base image and a Golang image to build Spark Operator itself. Spark image Building a Spark image without Hadoop using a specific version of Spark RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \ && tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \ && mv spark-3.5.1-bin-without-hadoop /opt/spark \ && rm spark-3.5.1-bin-without-hadoop.tgz RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \ && tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \ && mv spark-3.5.1-bin-without-hadoop /opt/spark \ && rm spark-3.5.1-bin-without-hadoop.tgz Spark-operator image We build the Spark Operator image, we will need several Hadoop libraries to run submit commands. For example, the FIPS version build is given, the differences in the build and run commands. For building on Go, the parameter GOEXPERIMENT=boringcrypto is used GOEXPERIMENT=boringcrypto For running spark-submit, the java parameter for Bouncy Castle is used Djavax.net.ssl.trustStorePassword=password Djavax.net.ssl.trustStorePassword=password You can build an image without FIPS changes. To run spark-submit, we will add Hadoop libraries during the build process: hadoop-client-runtime hadoop-client-api slf4j-api hadoop-client-runtime hadoop-client-runtime hadoop-client-api hadoop-client-api slf4j-api slf4j-api entrypoint.sh is used from the official Kubeflow repository https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh entrypoint.sh https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh Example Dockerfile for building Spark Operator ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop ARG GOLANG_IMAGE=golang-1.21 ARG SPARK_OPERATOR_VERSION=1.3.1 ARG HADOOP_VERSION_DEFAULT=3.4.0 ARG HADOOP_TMP_HOME="/opt/hadoop" ARG TARGETARCH=amd64 # Prepare spark-operator build FROM ${GOLANG_IMAGE} as builder WORKDIR /app/spark-operator ARG SPARK_OPERATOR_VERSION RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator RUN GOTOOLCHAIN=go1.22.3 go mod download # Build ARG TARGETARCH RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go #Install Hadoop jars ARG HADOOP_VERSION_DEFAULT ARG HADOOP_TMP_HOME RUN mkdir -p ${HADOOP_TMP_HOME} RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME} # Prepare spark-operator image FROM ${ECR_URL}:${SPARK_IMAGE} WORKDIR /opt/spark-operator USER root ENV PATH $JAVA_HOME/bin:$PATH ENV SPARK_HOME="/opt/spark" ENV JAVA_HOME="/opt/jdk-11.0.21" ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password" ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin: RUN yum update -y && \ yum install --setopt=tsflags=nodocs -y openssl && \ yum clean all ARG HADOOP_TMP_HOME COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/ COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/ COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/ COPY entrypoint.sh /opt/spark-operator/ RUN chmod a+x /opt/spark-operator/entrypoint.sh ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"] ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop ARG GOLANG_IMAGE=golang-1.21 ARG SPARK_OPERATOR_VERSION=1.3.1 ARG HADOOP_VERSION_DEFAULT=3.4.0 ARG HADOOP_TMP_HOME="/opt/hadoop" ARG TARGETARCH=amd64 # Prepare spark-operator build FROM ${GOLANG_IMAGE} as builder WORKDIR /app/spark-operator ARG SPARK_OPERATOR_VERSION RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator RUN GOTOOLCHAIN=go1.22.3 go mod download # Build ARG TARGETARCH RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go #Install Hadoop jars ARG HADOOP_VERSION_DEFAULT ARG HADOOP_TMP_HOME RUN mkdir -p ${HADOOP_TMP_HOME} RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME} # Prepare spark-operator image FROM ${ECR_URL}:${SPARK_IMAGE} WORKDIR /opt/spark-operator USER root ENV PATH $JAVA_HOME/bin:$PATH ENV SPARK_HOME="/opt/spark" ENV JAVA_HOME="/opt/jdk-11.0.21" ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password" ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin: RUN yum update -y && \ yum install --setopt=tsflags=nodocs -y openssl && \ yum clean all ARG HADOOP_TMP_HOME COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/ COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/ COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/ COPY entrypoint.sh /opt/spark-operator/ RUN chmod a+x /opt/spark-operator/entrypoint.sh ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"] Conclusion After the build, we still have several vulnerabilities in the Hadoop library hadoop-client-runtime: org.apache.avro:avro (hadoop-client-runtime-3.4.0.jar) – CVE-2023-39410 org.apache.commons:commons-compress – CVE-2024-25710, CVE-2024-26308 org.apache.avro:avro (hadoop-client-runtime-3.4.0.jar) – CVE-2023-39410 org.apache.commons:commons-compress – CVE-2024-25710, CVE-2024-26308 Since without this library we'll not be able to run spark-submit, but the rest of the huge part of the vulnerabilities is removed along with the main Hadoop libraries.