Efficient workflow and reproducibility are crucially important components in every machine learning project, which enables to: Rapidly iterate over new models and compare different approaches faster. Promote confidence in the results and transparency. Save time and resources. and serve as the foundation of . Such reasonable technology stack for deep learning prototyping provides a comprehensive and seamless solution, allowing you to effortlessly explore different tasks across a variety of hardware accelerators such as CPUs, multi-GPUs, and TPUs. Furthermore, it includes a curated collection of best practices and extensive documentation for greater clarity and comprehension. PyTorch Lightning Hydra this template can be used as is for some basic tasks like Classification, Segmentation, or Metric Learning, or be easily extended for any other tasks due to high-level modularity and scalable structure. This template As a baseline, I have used the gorgeous , reshaped and polished it, and implemented more features that can improve the overall efficiency of workflow and reproducibility. Lightning Hydra Template Table of content Main technologies Project structure Workflow - how it works Basic workflow LightningDataModule LightningModule Training loop Evaluation and prediction loops Callbacks Logs Data Hyperparameters search Docker Tests Continuous integration Main technologies - a lightweight deep learning framework / PyTorch wrapper for professional AI researchers and machine learning engineers who need maximal flexibility without sacrificing performance at scale. PyTorch Lightning - a framework that simplifies configuring complex applications. The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line. Hydra Project structure The machine learning project structure may differ depending on the specific requirements and goals of the project, as well as the tools and frameworks being used. However, this is a typical directory structure of machine learning project: src/ data/ logs/ tests/ some additional directories, like , , etc. notebooks/ docs/ In this particular case, the directory structure looks like this: ├── configs <- Hydra configuration files │ ├── callbacks <- Callbacks configs │ ├── datamodule <- Datamodule configs │ ├── debug <- Debugging configs │ ├── experiment <- Experiment configs │ ├── extras <- Extra utilities configs │ ├── hparams_search <- Hyperparameter search configs │ ├── hydra <- Hydra settings configs │ ├── local <- Local configs │ ├── logger <- Logger configs │ ├── module <- Module configs │ ├── paths <- Project paths configs │ ├── trainer <- Trainer configs │ │ │ ├── eval.yaml <- Main config for evaluation │ └── train.yaml <- Main config for training │ ├── data <- Project data ├── logs <- Generated logs ├── notebooks <- Jupyter notebooks ├── scripts <- Shell scripts │ ├── src <- Source code │ ├── callbacks <- Additional callbacks │ ├── datamodules <- Lightning datamodules │ ├── modules <- Lightning modules │ ├── utils <- Utility scripts │ │ │ ├── eval.py <- Run evaluation │ └── train.py <- Run training │ ├── tests <- Tests of any kind │ ├── .dockerignore <- List of files ignored by docker ├── .gitattributes <- List of attributes to pathnames ├── .gitignore <- List of files ignored by git ├── .pre-commit-config.yaml <- Configuration of pre-commit hooks ├── Dockerfile <- Dockerfile ├── Makefile <- Makefile ├── pyproject.toml <- Config for testing and linting ├── requirements.txt <- Python dependencies ├── setup.py <- Setup file └── README.md Workflow - how it works Before starting a project, you should consider the following aspects to ensure the reproducibility of results: Docker image Freezing python package versions Git Data version control. Many of these currently provide not just Data Version Control, but a lot of side highly useful features like Model Registry or Experiments Tracking: DVC Neptune Your own solution or others… Experiments Tracking tools: Weights & Biases Neptune DVC Comet MLFlow TensorBoard Or just CSV files… Basic workflow could be used as is for some basic tasks like Classification, Segmentation, or Metric Learning approach, but if you need to do something more complex, here is a general workflow: This template Write your PyTorch Lightning Module (see examples in ) src/modules/single_module.py Write your PyTorch Lightning DataModule (see examples in ) src/datamodules/datamodules.py Fill up your configs, particularly create experiment configs Run experiments: Run training with chosen experiment config: python src/train.py experiment=experiment_name.yaml Use hyperparameter search, for example by Optuna Sweeper via Hydra: # using Hydra multirun mode python src/train.py -m hparams_search=mnist_optuna Execute the runs with some config parameter manually: python src/train.py -m logger=csv module.optimizer.weight_decay=0.0,0.00001,0.0001 Run evaluation with different checkpoints or run prediction on a custom dataset for additional analysis The template contains an example with classification, which uses for tests by the way. If you run , you will get something like this: . MNIST python src/train.py Show terminal screen when running pipeline in the template documentation LightningDataModule At the start, you need to create PyTorch Dataset for your task. It has to include and methods. Maybe you can use as is or easily modify already in the template. See more details in . __getitem__ __len__ implemented datasets PyTorch documentation Also, it could be useful to see a data about how it is possible to save data for training and evaluation. section Then, you need to create DataModule using . By default, API has the following methods: PyTorch Lightning DataModule API (optional): perform data operations on CPU via a single process, like load and preprocess data, etc. prepare_data (optional): perform data operations on every GPU, like train/val/test splits, create datasets, etc. setup : used to generate the training dataloader(s) train_dataloader : used to generate the validation dataloader(s) val_dataloader : used to generate the test dataloader(s) test_dataloader (optional): used to generate the prediction dataloader(s) predict_dataloader See examples of configs in folder. datamodule configs/datamodule Show LightningDataModule API in the template documentation. By default, the template contains the following DataModules: in which , and return single DataLoader, returns list of DataLoaders SingleDataModule train_dataloader val_dataloader test_dataloader predict_dataloader in which return dict of DataLoaders, , and return list of DataLoaders MultipleDataModule train_dataloader val_dataloader test_dataloader predict_dataloader In the template, DataModules has method to simplify datasets instantiation. _get_dataset_ LightningModule LightningModule API Next, your need to create LightningModule using . Minimum API has the following methods: PyTorch Lightning LightningModule API : use for inference only (separate from training_step) forward : the complete training loop training_step : the complete validation loop validation_step : the complete test loop test_step : the complete prediction loop predict_step : define optimizers and LR schedulers configure_optimizers Also, you can override optional methods for each step to perform additional logic: : training step end operations training_step_end : training epoch end operations training_epoch_end : validation step end operations validation_step_end : validation epoch end operations validation_epoch_end : test step end operations test_step_end : test epoch end operations test_epoch_end Show LightningModule API methods and appropriate order in the template documentation. In the template, LightningModule has method to adjust repeated operations, like or calculation, which are required in , and . model_step forward loss training_step validation_step test_step Metrics The template offers the following : Metrics API metric: main metric, which also uses for all callbacks or trackers like , or . main model_checkpoint early_stopping scheduler.monitor metric: used for tracking the best validation metric. Usually, it can be or . valid_best MaxMetric MinMetric metrics: some additional metrics. additional Each metric config should contain key with the metric class name and other parameters, which are required by the metric. The template allows to use any metrics, for example from or implemented by yourself. See more details about , implemented and config as a part of configs in folder. _target_ torchmetrics torchmetrics API Metrics API metrics network configs/module/network Metric config example: metrics: main: _target_: "torchmetrics.Accuracy" task: "binary" valid_best: _target_: "torchmetrics.MaxMetric" additional: AUROC: _target_: "torchmetrics.AUROC" task: "binary" Loss The template suggests the following : Losses API Loss config should contain key with the loss class name and other parameters required _target_ Parameter containing string in name will be wrapped by and cast to type before passing to loss due to requirements from most of the losses. weight torch.tensor torch.float The template allows you to use any losses, for example from or implemented by yourself. See more details about implemented and config as a part of configs in folder. PyTorch Losses API loss network configs/module/network Loss config examples: loss: _target_: "torch.nn.CrossEntropyLoss" loss: _target_: "torch.nn.BCEWithLogitsLoss" pos_weight: [0.25] loss: _target_: "src.modules.losses.VicRegLoss" sim_loss_weight: 25.0 var_loss_weight: 25.0 cov_loss_weight: 1.0 Also, the template includes few manually implemented losses: as example for self-supervised learning VicRegLoss : use for extremely imbalanced tasks FocalLoss : use for Metric Learning approach AngularPenaltySMLoss Model The template offers the following , model config should contain: Model API : key with the model class name _target_ : model name model_name (optional): model repository model_repo Other parameters required by a model By default, a model can be loaded from: with setting up as , for example torchvision.models model_name torchvision.models/<model-name> torchvision.models/mobilenet_v3_large with setting up as , for example segmentation_models_pytorch model_name segmentation_models_pytorch/<model-name> segmentation_models_pytorch/Unet with setting up as , for example timm model_name timm/<model-name> timm/mobilenetv3_100 with setting up as and , for example and torch.hub model_name torch.hub/<model-name> model_repo model_name="torch.hub/resnet18" model_repo="pytorch/vision" See more details about implemented and config as a part of configs in folder. Model API model network configs/module/network Model config example: model: _target_: "src.modules.models.classification.Classifier" model_name: "torchvision.models/mobilenet_v3_large" model_repo: null weights: "IMAGENET1K_V2" num_classes: 1 Implemented LightningModules By default, the template comes with the following LightningModules: contains LightningModules for a few tasks, like common, self-supervised learning and metric learning approach, which require a single DataLoader on each step SingleLitModule contains LightningModules, which require multiple DataLoaders on each step MultipleLitModule See examples of configs in folder. Some LightningModule config example: module configs/module _target_: src.modules.single_module.MNISTLitModule defaults: - _self_ - network: mnist.yaml optimizer: _target_: torch.optim.Adam lr: 0.001 weight_decay: 0.0 scheduler: scheduler: _target_: torch.optim.lr_scheduler.ReduceLROnPlateau mode: "max" factor: 0.1 min_lr: 1.0e-9 patience: 10 verbose: True extras: monitor: ${replace:"__metric__/valid"} interval: "epoch" frequency: 1 logging: on_step: False on_epoch: True sync_dist: False prog_bar: True Training loop in the template consists of the following stages: Training loop LightningDataModule instantiating LightningModule instantiating Callbacks instantiating Loggers instantiating Plugins instantiating Trainer instantiating Hyperparameters and metadata logging Training the model Testing the best model See more details in and . training loop configs/train.yaml Evaluation and prediction loops in the template consists of the following stages: Evaluation loop LightningDataModule instantiating LightningModule instantiating Loggers instantiating Trainer instantiating Hyperparameteres and metadata logging Evaluating model or predicting See more details in and . evaluation loop configs/eval.yaml The template contains the following : Prediction API Set in to turn on prediction mode. predict: True configs/eval.yaml DataModule could contain multiple predict datasets: datasets: predict: dataset1: _target_: src.datamodules.datasets.ClassificationDataset json_path: ${paths.data_dir}/predict/data1.json dataset2: _target_: src.datamodules.datasets.ClassificationDataset json_path: ${paths.data_dir}/predict/data2.json PyTorch Lightning returns a list of batch predictions, when returns a single dataloader, and a list of lists of batch predictions, when returns multiple dataloaders. LightningDataModule.predict_dataloader() LightningDataModule.predict_dataloader() Predictions log to folder. {cfg.paths.output_dir}/predictions/ If there are multiple predict dataloaders, predictions will be saved with postfix. It isn’t possible to use dataset names due to PyTorch Lightning doesn’t allow to return a dict of dataloaders from method. _<dataloader_idx> LightningDataModule.predict_dataloader() There are two possible built-in output formats: and . format is used by default, but it might be more effective to use format for a large number of predictions, it may help to avoid RAM memory overflow, because allows writing row by row and doesn’t require keeping in RAM the whole dict like in case of . To change the output format, set variable in config file. csv json json csv csv json predictions_saving_params.output_format configs/extra/default.yaml If you need some custom output format, for instance, , you can easily modify method. parquet src.utils.saving_utils.save_predictions() See more details about and . Prediction API predict_step in LightningModule Callbacks PyTorch Lightning has a lot of , which can be used just by adding them to the callbacks config, thanks to Hydra. See examples in folder. built-in callbacks callbacks config By default, the template contains a few of them: Model Checkpoint Early Stopping Model Summary Rich Progress Bar However, there is an additional callback, which might be more elegant and useful, instead of using : LightProgressBar RichProgressbar Logs Hydra creates new output directory in for every executed run. logs/ Furthermore, template offers to save additional metadata for better reproducibility and debugging, including: logs pip logs git logs: CPU, GPU (nvidia-smi) environment full copy of and directories src/ configs/ Default logging structure: ├── logs │ ├── task_name │ │ ├── runs <- Logs generated by runs │ │ │ ├── YYYY-MM-DD_HH-MM-SS <- Datetime of the run │ │ │ │ ├── .hydra <- Hydra logs │ │ │ │ ├── csv <- Csv logs │ │ │ │ ├── wandb <- Weights & Biases logs │ │ │ │ ├── checkpoints <- Training checkpoints │ │ │ │ ├── metadata <- Metadata │ │ │ │ │ ├── pip.log <- Pip logs │ │ │ │ │ ├── git.log <- Git logs │ │ │ │ │ ├── env.log <- Environment logs │ │ │ │ │ ├── src <- Full copy of `src/` │ │ │ │ │ └── configs <- Full copy of `configs/` │ │ │ │ └── ... <- Any other saved files │ │ │ └── ... │ │ │ │ │ └── multiruns <- Logs generated by multiruns │ │ ├── YYYY-MM-DD_HH-MM-SS <- Datetime of the multirun │ │ │ ├──1 <- Multirun job number │ │ │ ├──2 │ │ │ └── ... │ │ └── ... │ │ │ └── debugs <- Logs generated during debug │ └── ... Data Usually, images or any other data files just stored on disk in folders. It is a simple and convenient way. However, there are other methods and one of them calls as or h5py, which has a few reasons why it might be more beneficial to store images in HDF5 files instead of just folders: Hierarchical Data Format HDF5 Efficient storage: the data format is designed specifically for storing large amounts of data. It is particularly well-suited for storing arrays of data, like images, and can compress the data to reduce the overall size of the file. The important thing about compressing in HDF5 files is that objects are compressed independently and only the objects that you need get decompressed on output. This is clearly more efficient than compressing the entire file and having to decompress the entire file to read it. Fast access: HDF5 allows you to access the data stored in the file using indexing, just like you would with a NumPy array. This makes it easy and fast to retrieve the data you need, which can be especially important when you are working with large datasets. Easy to use: HDF5 is easy to use and integrates well with other tools commonly used in machine learning, such as NumPy and PyTorch. This means you can use HDF5 to store your data and then load it into your training code without any additional preprocessing. Self-describing: it is possible to add information that helps users and tools know what is in the file. What are the variables, what are their types, what tools collected and wrote them, etc. The tool you are working on can read metadata for files. Attributes in an HDF5 file can be attached to any object in the file – they are not just file level information. This template contains a tool which might be used to easily create and read HDF5 files. To create HDF5 file: from src.datamodules.components.h5_file import H5PyFile H5PyFile().create( filename="/path/to/dataset_train_set_v1.h5", content=["/path/to/image_0.png", "/path/to/image_1.png", ...], # each content item loads as np.fromfile(filepath, dtype=np.uint8) ) To read HDF5 file in the wild: import matplotlib.pyplot as plt from src.datamodules.components.h5_file import H5PyFile h5py_file = H5PyFile(filename="/path/to/dataset_train_set_v1.h5") image = h5py_file[0] plt.imshow(image) To read HDF5 file in : Dataset.__getitem__ def __getitem__(self, index: int) -> Any: key = self.keys[index] # get the image key, e.g. path data_file = self.data_file source = data_file[key] # get the image image = io.BytesIO(source) # read the image ... Hyperparameters search Hydra provides out-of-the-box hyperparameters sweepers: . Optuna, Nevergrad or Ax You may define hyperparameters search by adding new config file to . configs/hparams_search See example of . With this method, there is no need to add extra code, everything is specified in a single configuration file. The only requirement is to return the optimized metric value from the launch file. hyperparameters search config Execute it with: python src/train.py -m hparams_search=mnist_optuna The will be available under folder. optimization_results.yaml logs/task_name/multirun Docker Docker is an essential part of environment reproducibility that makes it possible to easily package a machine learning pipeline and its dependencies into a single container that can be easily deployed and run on any environment. This is particularly useful due to it helps to ensure that the code will run consistently, regardless of the environment in which it is deployed. Docker image could require some additional packages depends on which device is used for running. For example, for running on cluster with NVIDIA GPUs it requires the CUDA Toolkit from NVIDIA. The CUDA Toolkit provides everything you need to develop GPU-accelerated applications, including GPU-accelerated libraries, a compiler, development tools and the CUDA runtime. In general, there are many way how to set up it, but to simplify this process you can use: , where it is easy to find images with any combinations of OS, CUDA, etc. See possible structure of . Official Nvidia Docker Images Hub Dockerfile here Miniconda for GPU environments. Moreover, it can be advantageous to use: Additional docker container runtime options for , like , , etc. managing resources constraints -cpuset-cpus -gpus - a (h)top like task monitor for AMD, Intel and NVIDIA GPUs. NVTOP Here it is some example of container running based on proposed and : Dockerfile .dockerignore set -o errexit export DOCKER_BUILDKIT=1 export PROGRESS_NO_TRUNC=1 docker build --tag <project-name> \ --build-arg OS_VERSION="22.04" \ --build-arg CUDA_VERSION="11.7.0" \ --build-arg PYTHON_VERSION="3.10" \ --build-arg USER_ID=$(id -u) \ --build-arg GROUP_ID=$(id -g) \ --build-arg NAME="<your-name>" \ --build-arg WORKDIR_PATH=$(pwd) . docker run \ --name <task-name> \ --rm \ -u $(id -u):$(id -g) \ -v $(pwd):$(pwd):rw \ --gpus '"device=0,1,3,4"' \ --cpuset-cpus "0-47" \ -it \ --entrypoint /bin/bash \ <project-name>:latest Tests Tests are an important aspect of software development in general, and especially in Machine Learning, because here it can be much more difficult to understand if code are working correctly without testing. Consequently, template contains some generic tests implemented with . pytest For this purpose MNIST is used. It is a small dataset, so it is possible to run all tests on CPU. However, it is easy to implement tests for your own dataset if it requires. As a baseline the tests cover: Main module configs instantiation by Hydra DataModule Losses loading Metrics loading Models loading and utils Training on 1% of MNIST dataset, for example: running 1 train, val and test steps running 1 epoch, saving checkpoint and resuming for the second epoch running 2 epochs with DDP simulated on CPU Evaluating and predicting Hyperparameters optimization Custom progress bar functionality Utils All this implemented tests created for verifying that the main pipeline modules and utils are executable and working as expected However, sometimes it couldn’t be enough to ensure that the code is working correctly, especially in case of more complex pipelines and models. For running: # run all tests pytest # run tests from specific file pytest tests/test_train.py # run tests from specific test pytest tests/test_train.py::test_train_ddp_sim # run all tests except the ones marked as slow pytest -k "not slow" Continuous integration The template contains a few initial CI workflows via the GitHub Actions platform. It makes it easy to automate and streamline development workflows, which can help to save time and effort, increase efficiency, and improve overall quality of the code. In particularly, it includes: : running all tests from with on , and platforms .github/workflows/test.yaml tests/ pytest Linux Mac Windows : running on main branch for all files .github/workflows/code-quality-main.yaml pre-commits : running on pull requests for modified files only .github/workflows/code-quality-pr.yaml pre-commits Note: You need to enable the GitHub Actions from the settings in your repository. See more about . GitHub Actions for CI In the case of using GitLab, it is easy to set up based on GitHub Actions workflows. Here it manages by file. See more . GitLab CI .gitlab-ci.yml here Also published . here