There is a requirement to use Spark Operator in a K8s cluster to run a spark job. The official image contains many vulnerabilities, including those due to Hadoop libraries. Let's build our own Spark Operator image. To build our image, we'll need a spark image as a base image and a Golang image to build Spark Operator itself. Spark image Building a Spark image without Hadoop using a specific version of Spark RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \
    && tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \
    && mv spark-3.5.1-bin-without-hadoop /opt/spark \
    && rm spark-3.5.1-bin-without-hadoop.tgz Spark-operator image We build the Spark Operator image, we will need several Hadoop libraries to run submit commands. For example, the FIPS version build is given, the differences in the build and run commands. For building on Go, the parameter GOEXPERIMENT=boringcrypto is used For running spark-submit, the java parameter for Bouncy Castle is used Djavax.net.ssl.trustStorePassword=password You can build an image without FIPS changes. To run spark-submit, we will add Hadoop libraries during the build process: hadoop-client-runtime
hadoop-client-api
slf4j-api entrypoint.sh is used from the official Kubeflow repository https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh Example Dockerfile for building Spark Operator ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop
ARG GOLANG_IMAGE=golang-1.21
ARG SPARK_OPERATOR_VERSION=1.3.1
ARG HADOOP_VERSION_DEFAULT=3.4.0
ARG HADOOP_TMP_HOME="/opt/hadoop"
ARG TARGETARCH=amd64

# Prepare spark-operator build
FROM ${GOLANG_IMAGE} as builder
WORKDIR /app/spark-operator

ARG SPARK_OPERATOR_VERSION
RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator

RUN GOTOOLCHAIN=go1.22.3 go mod download

# Build
ARG TARGETARCH
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go

#Install Hadoop jars
ARG HADOOP_VERSION_DEFAULT
ARG HADOOP_TMP_HOME
RUN mkdir -p ${HADOOP_TMP_HOME}
RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME}

# Prepare spark-operator image
FROM ${ECR_URL}:${SPARK_IMAGE}
WORKDIR /opt/spark-operator
USER root

ENV PATH $JAVA_HOME/bin:$PATH
ENV SPARK_HOME="/opt/spark"
ENV JAVA_HOME="/opt/jdk-11.0.21"
ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password"
ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin:

RUN yum update -y && \
    yum install --setopt=tsflags=nodocs -y openssl && \
    yum clean all

ARG HADOOP_TMP_HOME
COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/

COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/
COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/

COPY entrypoint.sh /opt/spark-operator/
RUN chmod a+x /opt/spark-operator/entrypoint.sh
ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"] Conclusion After the build, we still have several vulnerabilities in the Hadoop library hadoop-client-runtime: org.apache.avro:avro (hadoop-client-runtime-3.4.0.jar) – CVE-2023-39410
org.apache.commons:commons-compress – CVE-2024-25710, CVE-2024-26308 Since without this library we'll not be able to run spark-submit, but the rest of the huge part of the vulnerabilities is removed along with the main Hadoop libraries. There is a requirement to use Spark Operator in a K8s cluster to run a spark job. The official image contains many vulnerabilities, including those due to Hadoop libraries. Let's build our own Spark Operator image. To build our image, we'll need a spark image as a base image and a Golang image to build Spark Operator itself. Spark image Building a Spark image without Hadoop using a specific version of Spark RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \
    && tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \
    && mv spark-3.5.1-bin-without-hadoop /opt/spark \
    && rm spark-3.5.1-bin-without-hadoop.tgz RUN curl -L https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-without-hadoop.tgz -o spark-3.5.1-bin-without-hadoop.tgz \
    && tar -xvzf spark-3.5.1-bin-without-hadoop.tgz \
    && mv spark-3.5.1-bin-without-hadoop /opt/spark \
    && rm spark-3.5.1-bin-without-hadoop.tgz Spark-operator image We build the Spark Operator image, we will need several Hadoop libraries to run submit commands. For example, the FIPS version build is given, the differences in the build and run commands. For building on Go, the parameter GOEXPERIMENT=boringcrypto is used GOEXPERIMENT=boringcrypto For running spark-submit, the java parameter for Bouncy Castle is used Djavax.net.ssl.trustStorePassword=password Djavax.net.ssl.trustStorePassword=password You can build an image without FIPS changes. To run spark-submit, we will add Hadoop libraries during the build process: hadoop-client-runtime hadoop-client-api slf4j-api hadoop-client-runtime hadoop-client-runtime hadoop-client-api hadoop-client-api slf4j-api slf4j-api entrypoint.sh is used from the official Kubeflow repository https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh entrypoint.sh https://github.com/kubeflow/spark-operator/blob/master/entrypoint.sh Example Dockerfile for building Spark Operator ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop
ARG GOLANG_IMAGE=golang-1.21
ARG SPARK_OPERATOR_VERSION=1.3.1
ARG HADOOP_VERSION_DEFAULT=3.4.0
ARG HADOOP_TMP_HOME="/opt/hadoop"
ARG TARGETARCH=amd64

# Prepare spark-operator build
FROM ${GOLANG_IMAGE} as builder
WORKDIR /app/spark-operator

ARG SPARK_OPERATOR_VERSION
RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator

RUN GOTOOLCHAIN=go1.22.3 go mod download

# Build
ARG TARGETARCH
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go

#Install Hadoop jars
ARG HADOOP_VERSION_DEFAULT
ARG HADOOP_TMP_HOME
RUN mkdir -p ${HADOOP_TMP_HOME}
RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME}

# Prepare spark-operator image
FROM ${ECR_URL}:${SPARK_IMAGE}
WORKDIR /opt/spark-operator
USER root

ENV PATH $JAVA_HOME/bin:$PATH
ENV SPARK_HOME="/opt/spark"
ENV JAVA_HOME="/opt/jdk-11.0.21"
ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password"
ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin:

RUN yum update -y && \
    yum install --setopt=tsflags=nodocs -y openssl && \
    yum clean all

ARG HADOOP_TMP_HOME
COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/

COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/
COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/

COPY entrypoint.sh /opt/spark-operator/
RUN chmod a+x /opt/spark-operator/entrypoint.sh
ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"] ARG SPARK_IMAGE=spark-3.5.1-bin-without-hadoop
ARG GOLANG_IMAGE=golang-1.21
ARG SPARK_OPERATOR_VERSION=1.3.1
ARG HADOOP_VERSION_DEFAULT=3.4.0
ARG HADOOP_TMP_HOME="/opt/hadoop"
ARG TARGETARCH=amd64

# Prepare spark-operator build
FROM ${GOLANG_IMAGE} as builder
WORKDIR /app/spark-operator

ARG SPARK_OPERATOR_VERSION
RUN curl -Ls https://github.com/kubeflow/spark-operator/archive/refs/tags/spark-operator-chart-${SPARK_OPERATOR_VERSION}.tar.gz | tar -xz --strip-components 1 -C /app/spark-operator

RUN GOTOOLCHAIN=go1.22.3 go mod download

# Build
ARG TARGETARCH
RUN CGO_ENABLED=0 GOOS=linux GOARCH=${TARGETARCH} GO111MODULE=on GOTOOLCHAIN=go1.22.3 GOEXPERIMENT=boringcrypto go build -a -o /app/spark-operator/spark-operator main.go

#Install Hadoop jars
ARG HADOOP_VERSION_DEFAULT
ARG HADOOP_TMP_HOME
RUN mkdir -p ${HADOOP_TMP_HOME}
RUN curl -Ls https://archive.apache.org/dist/hadoop/common/hadoop-${HADOOP_VERSION_DEFAULT}/hadoop-${HADOOP_VERSION_DEFAULT}.tar.gz | tar -xz --strip-components 1 -C ${HADOOP_TMP_HOME}

# Prepare spark-operator image
FROM ${ECR_URL}:${SPARK_IMAGE}
WORKDIR /opt/spark-operator
USER root

ENV PATH $JAVA_HOME/bin:$PATH
ENV SPARK_HOME="/opt/spark"
ENV JAVA_HOME="/opt/jdk-11.0.21"
ENV SPARK_SUBMIT_OPTS="${SPARK_SUBMIT_OPTS} -Djavax.net.ssl.trustStorePassword=password"
ENV PATH=${PATH}:${SPARK_HOME}/bin:${JAVA_HOME}/bin:

RUN yum update -y && \
    yum install --setopt=tsflags=nodocs -y openssl && \
    yum clean all

ARG HADOOP_TMP_HOME
COPY --from=builder ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-runtime-*.jar ${HADOOP_TMP_HOME}/share/hadoop/client/hadoop-client-api-*.jar ${HADOOP_TMP_HOME}/share/hadoop/common/lib/slf4j-api-*.jar /opt/spark/jars/

COPY --from=builder /app/spark-operator/spark-operator /opt/spark-operator/
COPY --from=builder /app/spark-operator/hack/gencerts.sh /usr/bin/

COPY entrypoint.sh /opt/spark-operator/
RUN chmod a+x /opt/spark-operator/entrypoint.sh
ENTRYPOINT ["/opt/spark-operator/entrypoint.sh"] Conclusion After the build, we still have several vulnerabilities in the Hadoop library hadoop-client-runtime: org.apache.avro:avro (hadoop-client-runtime-3.4.0.jar) – CVE-2023-39410 org.apache.commons:commons-compress – CVE-2024-25710, CVE-2024-26308 org.apache.avro:avro (hadoop-client-runtime-3.4.0.jar) – CVE-2023-39410 org.apache.commons:commons-compress – CVE-2024-25710, CVE-2024-26308 Since without this library we'll not be able to run spark-submit, but the rest of the huge part of the vulnerabilities is removed along with the main Hadoop libraries.

The code in this story is for educational purposes. The readers are solely responsible for whatever they build with it.

Building a Custom Docker Image for K8s Spark Operator to Fix Vulnerabilities

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: Delving Into OpenTelemetry Collector (11/18/2023)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

105 Stories To Learn About K8s

104 Stories To Learn About Continuous Integration

10 Upcoming DevOps Conferences for 2018

The Noonification: How to Deal With Flapping or Broken Tests (11/29/2023)

The Noonification: Delving Into OpenTelemetry Collector (11/18/2023)

The Noonification: How to Implement a Merkle Tree in Solidity (11/12/2023)

105 Stories To Learn About K8s

104 Stories To Learn About Continuous Integration

10 Upcoming DevOps Conferences for 2018

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps