This tutorial shows how Alibaba Cloud Container team runs PyTorch on HDFS using Alluxio under Kubernetes environment. The original Chinese article was published on , then translated and published on Alibaba Cloud's engineering blog Alluxio's Engineering Blog Google’s TensorFlow and Facebook’s PyTorch are two Deep Learning frameworks that have been popular with the open source community. Although PyTorch is still a relatively new framework, many developers have successfully adopted it due to its ease of use. By default, PyTorch does not support Deep Learning model training directly in HDFS, which brings challenges to users who store data sets in HDFS. These users need to either export HDFS data at the start of each training job or modify the source code of PyTorch to support reading from HDFS. Both approaches are not ideal because they require additional manual work that may introduce additional uncertainties to the training job. To avoid this problem, we choose to use Alluxio as an interface to access HDFS via a POSIX FileSystem interface. This approach greatly improved development efficiency at Alibaba Cloud. This article demonstrates how this work was achieved within a Kubernetes environment. Prepare HDFS 2.7.2 environment For this tutorial, we used a Helm Chart to install HDFS to mock an existing HDFS cluster. 1. Install the helm chart of Hadoop 2.7.2 git https://github.com/cheyang/kubernetes-HDFS.git kubectl label nodes cn-huhehaote.192.168.0.117 hdfs-namenode-selector=hdfs-namenode-0 helm dependency build charts/hdfs-k8s helm install hdfs charts/hdfs-k8s \ -- tags.ha= \ -- tags.simple= \ -- global.namenodeHAEnabled= \ -- hdfs-simple-namenode-k8s.nodeSelector.hdfs-namenode-selector=hdfs-namenode-0 clone #helm install -f values.yaml hdfs charts/hdfs-k8s set false set true set false set 2. Check the status of the helm chart kubectl get all -l release=hdfs 3. Client access hdfs kubectl -it hdfs-client-f5bc448dd-rc28d bash root@hdfs-client-f5bc448dd-rc28d:/ Configured Capacity: 422481862656 (393.47 GB) Present Capacity: 355748564992 (331.32 GB) DFS Remaining: 355748515840 (331.32 GB) DFS Used: 49152 (48 KB) DFS Used%: 0.00% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 Missing blocks (with replication factor 1): 0 ------------------------------------------------- Live datanodes (2): Name: 172.31.136.180:50010 (172-31-136-180.node-exporter.arms-prom.svc.cluster.local) Hostname: iZj6c7rzs9xaeczn47omzcZ Decommission Status : Normal Configured Capacity: 211240931328 (196.73 GB) DFS Used: 24576 (24 KB) Non DFS Used: 32051716096 (29.85 GB) DFS Remaining: 179189190656 (166.88 GB) DFS Used%: 0.00% DFS Remaining%: 84.83% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Xceivers: 1 Last contact: Tue Mar 31 16:48:52 UTC 2020 exec # hdfs dfsadmin -report 4. HDFS client configuration [root@iZj6c61fdnjcrcrc2sevsfZ kubernetes-HDFS] root@hdfs-client-f5bc448dd-rc28d:/ cat: /etc/hadoop-custom-conf: Is a directory root@hdfs-client-f5bc448dd-rc28d:/ root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf core-site.xml hdfs-site.xml root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf <?xml version= ?> <?xml-stylesheet = href= ?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020</value> </property> </configuration> root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf <?xml version= ?> <?xml-stylesheet = href= ?> <configuration> <property> <name>dfs.namenode.name.dir</name> <value>file:///hadoop/dfs/name</value> </property> <property> <name>dfs.namenode.datanode.registration.ip-hostname-check</name> <value> </value> </property> <property> <name>dfs.datanode.data.dir</name> <value>/hadoop/dfs/data/0</value> </property> </configuration> root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf Error: No named `--version root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf Error: No named `-version root@hdfs-client-f5bc448dd-rc28d:/etc/hadoop-custom-conf Hadoop 2.7.2 Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r b165c4fe8a74265c792ce23f546c64604acf0e41 Compiled by jenkins on 2016-01-26T00:08Z Compiled with protoc 2.5.0 From with checksum d0fda26633fa762bff87ec759ebe689c This was run using /opt/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar # kubectl exec -it hdfs-client-f5bc448dd-rc28d bash # cat /etc/hadoop-custom-conf # cd /etc/hadoop-custom-conf # ls # cat core-site.xml "1.0" type "text/xsl" "configuration.xsl" # cat hdfs-site.xml "1.0" type "text/xsl" "configuration.xsl" false # hadoop --version command ' was found. Perhaps you meant `hadoop -version' # hadoop -version command ' was found. Perhaps you meant `hadoop version' # hadoop version source command 5. Experimental HDFS basic file operations Found 1 items drwxr-xr-x - root supergroup 0 2020-03-31 16:51 / Found 2 items -rw-r--r-- 3 root supergroup 3670 2020-04-20 08:51 / /hadoop-env.cmd # hdfs dfs -ls / test # hdfs dfs -mkdir /mytest # hdfs dfs -copyFromLocal /etc/hadoop/hadoop-env.cmd /test/ # hdfs dfs -ls /test test 6. Download data mkdir -p /data/MNIST/raw/ /data/MNIST/raw/ wget http://kubeflow.oss-cn-beijing.aliyuncs.com/mnist/train-images-idx3-ubyte.gz wget http://kubeflow.oss-cn-beijing.aliyuncs.com/mnist/train-labels-idx1-ubyte.gz wget http://kubeflow.oss-cn-beijing.aliyuncs.com/mnist/t10k-images-idx3-ubyte.gz wget http://kubeflow.oss-cn-beijing.aliyuncs.com/mnist/t10k-labels-idx1-ubyte.gz hdfs dfs -mkdir -p /data/MNIST/raw hdfs dfs -copyFromLocal *.gz /data/MNIST/raw cd Deploy Alluxio 1. First select the designated node, which can be one or more kubectl label nodes cn-huhehaote.192.168.0.117 dataset=mnist 2. Create config.yaml, in which you need to configure the node selector to specify the node cat << EOF > config.yaml image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio imageTag: nodeSelector: dataset: mnist properties: alluxio.fuse.debug.enabled: alluxio.user.file.writetype.default: MUST_CACHE alluxio.master.journal.folder: /journal alluxio.master.journal.type: UFS alluxio.master.mount.table.root.ufs: worker: jvmOptions: master: jvmOptions: tieredstore: levels: - : MEM level: 0 quota: 20GB : hostPath path: /dev/shm high: 0.99 low: 0.8 fuse: image: registry.cn-huhehaote.aliyuncs.com/alluxio/alluxio-fuse imageTag: jvmOptions: args: - fuse - --fuse-opts=direct_io EOF "2.2.0-SNAPSHOT-b2c7e50" "false" "hdfs://hdfs-namenode-0.hdfs-namenode.default.svc.cluster.local:8020" " -Xmx4G " " -Xmx4G " alias type "2.2.0-SNAPSHOT-b2c7e50" " -Xmx4G -Xms4G " It should be noted that theHDFS version needs to be specified at compilation time. This example provides a container image that supports HDFS 2.7.2 is used to specify the HDFS address alluxio.master.mount.table.root.ufs quota is used to specify the upper limit of the cache. For specific configuration, please refer to https://docs.alluxio.io/os/user/stable/en/deploy/Running-Alluxio-On-Kubernetes.html 3. Deploy Alluxio wget http://kubeflow.oss-cn-beijing.aliyuncs.com/alluxio-0.12.0.tgz tar -xvf alluxio-0.12.0.tgz helm install alluxio -f config.yaml alluxio 4. Check the status of Alluxio, wait until all components are ready helm get manifest alluxio | kubectl get -f - NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/alluxio-master ClusterIP None <none> 19998/TCP,19999/TCP,20001/TCP,20002/TCP 14h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/alluxio-fuse 4 4 4 4 4 <none> 14h NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/alluxio-worker 4 4 4 4 4 <none> 14h NAME READY AGE statefulset.apps/alluxio-master 1/1 14h Prepare PyTorch container image 1. Create a Dockerfile mkdir pytorch-mnist pytorch-mnist vim Dockerfile cd Populate the Dockerfile with the following content: FROM pytorch/pytorch:1.4-cuda10.1-cudnn7-devel ADD mnist.py / CMD [ , ] # pytorch/pytorch:1.4-cuda10.1-cudnn7-devel "python" "/mnist.py" 2. Create a PyTorch python file called mnist.py pytorch-mnist vim mnist.py cd Populate the python file with the following content: __future__ print_function argparse torch torch.nn nn torch.nn.functional F torch.optim optim torchvision datasets, transforms torch.optim.lr_scheduler StepLR super(Net, self).__init__() self.conv1 = nn.Conv2d( , , , ) self.conv2 = nn.Conv2d( , , , ) self.dropout1 = nn.Dropout2d( ) self.dropout2 = nn.Dropout2d( ) self.fc1 = nn.Linear( , ) self.fc2 = nn.Linear( , ) x = self.conv1(x) x = F.relu(x) x = self.conv2(x) x = F.relu(x) x = F.max_pool2d(x, ) x = self.dropout1(x) x = torch.flatten(x, ) x = self.fc1(x) x = F.relu(x) x = self.dropout2(x) x = self.fc2(x) output = F.log_softmax(x, dim= ) output model.train() batch_idx, (data, target) enumerate(train_loader): data, target = data.to(device), target.to(device) optimizer.zero_grad() output = model(data) loss = F.nll_loss(output, target) loss.backward() optimizer.step() batch_idx % args.log_interval == : print( .format( epoch, batch_idx * len(data), len(train_loader.dataset), * batch_idx / len(train_loader), loss.item())) model.eval() test_loss = correct = torch.no_grad(): data, target test_loader: data, target = data.to(device), target.to(device) output = model(data) test_loss += F.nll_loss(output, target, reduction= ).item() pred = output.argmax(dim= , keepdim= ) correct += pred.eq(target.view_as(pred)).sum().item() test_loss /= len(test_loader.dataset) print( .format( test_loss, correct, len(test_loader.dataset), * correct / len(test_loader.dataset))) parser = argparse.ArgumentParser(description= ) parser.add_argument( , type=int, default= , metavar= , help= ) parser.add_argument( , type=int, default= , metavar= , help= ) parser.add_argument( , type=int, default= , metavar= , help= ) parser.add_argument( , type=float, default= , metavar= , help= ) parser.add_argument( , type=float, default= , metavar= , help= ) parser.add_argument( , action= , default= , help= ) parser.add_argument( , type=int, default= , metavar= , help= ) parser.add_argument( , type=int, default= , metavar= , help= ) parser.add_argument( , action= , default= , help= ) args = parser.parse_args() use_cuda = args.no_cuda torch.cuda.is_available() torch.manual_seed(args.seed) device = torch.device( use_cuda ) kwargs = { : , : } use_cuda {} train_loader = torch.utils.data.DataLoader( datasets.MNIST( , train= , download= , transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize(( ,), ( ,)) ])), batch_size=args.batch_size, shuffle= , **kwargs) test_loader = torch.utils.data.DataLoader( datasets.MNIST( , train= , transform=transforms.Compose([ transforms.ToTensor(), transforms.Normalize(( ,), ( ,)) ])), batch_size=args.test_batch_size, shuffle= , **kwargs) model = Net().to(device) optimizer = optim.Adadelta(model.parameters(), lr=args.lr) scheduler = StepLR(optimizer, step_size= , gamma=args.gamma) epoch range( , args.epochs + ): train(args, model, device, train_loader, optimizer, epoch) test(model, device, test_loader) scheduler.step() args.save_model: torch.save(model.state_dict(), ) __name__ == : main() # -*- coding: utf-8 -*- # @Author: cheyang # @Date: 2020-04-18 22:41:12 # @Last Modified by: cheyang # @Last Modified time: 2020-04-18 22:44:06 from import import import import as import as import as from import from import : class Net (nn.Module) : def __init__ (self) 1 32 3 1 32 64 3 1 0.25 0.5 9216 128 128 10 : def forward (self, x) 2 1 1 return : def train (args, model, device, train_loader, optimizer, epoch) for in if 0 'Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}' 100. : def test (model, device, test_loader) 0 0 with for in 'sum' 1 True '\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n' 100. : def main () # Training settings 'PyTorch MNIST Example' '--batch-size' 64 'N' 'input batch size for training (default: 64)' '--test-batch-size' 1000 'N' 'input batch size for testing (default: 1000)' '--epochs' 14 'N' 'number of epochs to train (default: 14)' '--lr' 1.0 'LR' 'learning rate (default: 1.0)' '--gamma' 0.7 'M' 'Learning rate step gamma (default: 0.7)' '--no-cuda' 'store_true' False 'disables CUDA training' '--seed' 1 'S' 'random seed (default: 1)' '--log-interval' 10 'N' 'how many batches to wait before logging training status' '--save-model' 'store_true' False 'For Saving the current Model' not and "cuda" if else "cpu" 'num_workers' 1 'pin_memory' True if else '../data' True True 0.1307 0.3081 True '../data' False 0.1307 0.3081 True 1 for in 1 1 if "mnist_cnn.pt" if '__main__' 3. Build the image Build a custom image under the same level of the directory, the target container image in this example is registry.cn-shanghai.aliyuncs.com/tensorflow-samples/mnist:pytorch-1.4-cuda10.1-cudnn7-devel docker build -t \ registry.cn-shanghai.aliyuncs.com/tensorflow-samples/mnist:pytorch-1.4-cuda10.1-cudnn7-devel . 4. Push the built mirror to the mirror warehouse created in the East China 1 area for users who are in the Greater China area (Alibaba Cloud). You can refer to the registry.cn-shanghai.aliyuncs.com/tensorflow-samples/mnist:pytorch-1.4-cuda10.1-cudnn7-devel basic operation of mirroring. Submit PyTorch training tasks 1. Install arena $ wget http://kubeflow.oss-cn-beijing.aliyuncs.com/arena-installer-0.3.3-332fcde-linux-amd64.tar.gz $ tar -xvf arena-installer-0.3.3-332fcde-linux-amd64.tar.gz $ arena-installer/ $ ./install. $ yum install bash-completion -y $ >> ~/.bashrc $ chmod u+x ~/.bashrc cd echo "source <(arena completion bash)" 2. Use arena to submit training tasks, remember to choose selector as dataset=mnist arena submit tf \ --name=alluxio-pytorch \ --selector=dataset=mnist \ --data-dir=/alluxio-fuse/data:/data \ --gpus=1 \ --image=registry.cn-shanghai.aliyuncs.com/tensorflow-samples/mnist:pytorch-1.4-cuda10.1-cudnn7-devel \ "python /mnist.py" 3. And view the training log through arena Train Epoch: 12 [49280/60000 (82%)] Loss: 0.021669 Train Epoch: 12 [49920/60000 (83%)] Loss: 0.008180 Train Epoch: 12 [50560/60000 (84%)] Loss: 0.009288 Train Epoch: 12 [51200/60000 (85%)] Loss: 0.035657 Train Epoch: 12 [51840/60000 (86%)] Loss: 0.006190 Train Epoch: 12 [52480/60000 (87%)] Loss: 0.007776 Train Epoch: 12 [53120/60000 (88%)] Loss: 0.001990 Train Epoch: 12 [53760/60000 (90%)] Loss: 0.003609 Train Epoch: 12 [54400/60000 (91%)] Loss: 0.001943 Train Epoch: 12 [55040/60000 (92%)] Loss: 0.078825 Train Epoch: 12 [55680/60000 (93%)] Loss: 0.000925 Train Epoch: 12 [56320/60000 (94%)] Loss: 0.018071 Train Epoch: 12 [56960/60000 (95%)] Loss: 0.031451 Train Epoch: 12 [57600/60000 (96%)] Loss: 0.031353 Train Epoch: 12 [58240/60000 (97%)] Loss: 0.075761 Train Epoch: 12 [58880/60000 (98%)] Loss: 0.003975 Train Epoch: 12 [59520/60000 (99%)] Loss: 0.085389 Test : Average loss: 0.0256, Accuracy: 9921/10000 (99%) # arena logs --tail=20 alluxio-pytorch set Summary Previously, running the PyTorch program required users to modify the PyTorch adapter code to be able to access data in HDFS. Using Alluxio, we were able to quickly develop and train models without any additional work to modify PyTorch code or manually copy HDFS data. This approach is further simplified by setting up the entire environment within Kubernetes.