I started using Pytorch to train my models back in early 2018 with 0.3.1 release. I got hooked by the Pythonic feel, ease of use and flexibility. It was just so much easier to do things in Pytorch than in Tensorflow or Theano. But and there was not much out there back then. something I missed was the Keras-like high-level interface to PyTorch Fast-forward to 2020, and we have 6 high-level training APIs in the PyTorch Ecosystem . Skorch Catalyst Fastai PyTorch Ignite PyTorch Lightning TorchBearer But which one should you choose? What are the pros and cons of using each one? I thought: who can better than the authors themselves? explain the differences between those libraries I picked up my proverbial phone and asked them to write an article with me. They all agreed and this is how this post was created! So, I’ve asked authors to talk about the following aspects of their libraries: Philosophy of the project API structure The learning curve for new users Build-in features (what you get out-of-the-box) Extension capabilities (simplicity of integration in research) Reproducibility Distributed training Productionalization Popularity … and they really did answer thoroughly 🙂 Skorch The behind development can be summarized as follows: philosophy skorch follow the sklearn APIdon’t hide PyTorchdon’t reinvent the wheelbe hackable These principles laid out the design space within which we operate. Regarding the , it presents itself, most obviously, in how you : scikit-learn API train and predict skorch NeuralNetClassifier net = NeuralNetClassifier(...) net.fit(X_train, y_train) net.predict(X_test) from import Because skorch is using this everyone should be able to start using it very quickly. simple and well-established API But the than calling “fit” and “predict”. You can seamlessly integrate your skorch model within sklearn `Pipeline`s, use sklearn’s numerous metrics (no need to re-implement F1, R², etc.), and use it with GridSearchCV. sklearn integration goes deeper When it comes to : you can use any other hyperparameter search strategy as long as there is a sklearn-compatible implementation. parameter sweeps We are especially proud that . For example, if your module has an initialization parameter called num_units, you can grid search that parameter right away. you can search on almost any hyper-parameter without additional work Here is a list of things you can grid search out-of-the-box: any parameter on your Module (number of units and layers, nonlinearity, dropout rate, …) optimizer (learning rate, momentum…) criterion DataLoader (batch size, shuffling, …) callbacks (any parameter, even on your custom callbacks) This is how it looks like in code: sklearn.model_selection GridSearchCV params = { : [ , ], : [ , ], : [ , ], : [ , , ], : [ , ], : [ , , ], } net = NeuralNetClassifier(...) gs = GridSearchCV(net, params, cv= , scoring= ) gs.fit(X, y) print(gs.best_score_, gs.best_params_) from import 'lr' 0.01 0.02 'max_epochs' 10 20 'module__num_units' 10 20 'optimizer__momentum' 0.6 0.9 0.95 'iterator_train__shuffle' True False 'callbacks__mycallback__someparam' 1 2 3 3 'accuracy' As far as I’m aware, no other framework provides this flexibility. On top of that, by using the dask parallel backend, you can across your cluster without too much hassle. distribute the hyper-parameter search Using the mature sklearn API, skorch users can that is typically seen when writing train loops, validation loops, and hyper-parameter search in pure PyTorch. avoid the boilerplate code From the PyTorch side, we decided not to hide the backend behind an abstraction layer, as is the case in keras, for example. Instead, . As a user, you can use PyTorch’s Dataset (think torchvision, including TTA), DataLoader, and learning rate schedulers. Most importantly, you can use PyTorch Modules with almost no restrictions. we expose numerous components known from PyTorch We thus made a conscious effort to instead of re-inventing the wheel. This makes skorch or to remove it after your initial experimentation phase without any lock-in effect. re-use as many existing features from sklearn and PyTorch as possible easy to use on top of your existing codebase For instance, you can replace the neural net with any sklearn model or you can extract the PyTorch module and use it without skorch. On top of re-using existing features, we added some of our own. Most notably, skorch out-of-the-box. On top of Datasets, you can use: works with many common data types numpy arrays, torch tensors, pandas DataFrames, Python dictionaries holding heterogeneous data, external/custom datasets like ImageFolder from torchvision. We’ve put extra effort to make these work well with sklearn. If this is not enough to satisfy your customization needs, . Our documentation contains examples of how to implement and , modifying every possible behavior right down to the training step. we took pains to facilitate implementing your own callbacks or your own model trainers custom callbacks custom trainers The philosophy of not re-inventing the wheel should make skorch easy to learn for anyone who is familiar with sklearn and PyTorch. And since we designed skorch around customization and flexibility, it shouldn’t be too hard to master. To learn more about skorch check out these and . examples notebooks Skorch is . We addressed some common issues regarding productionalization, specifically: geared towards, and used in, production we make sure to and to give a sufficiently long deprecation period where necessary. be backward compatible you can train on GPU and serve on CPU, you can Pipeline containing the skorch model for later re-use. pickle a whole sklearn we provide a helper function to that exposes all your model parameters, including their documentation, as command line arguments, with just three lines of extra code turn your training code into a command line script That being said, I have implemented, or know people who have implemented, more -y stuff, like and numerous types of techniques. This does require more profound knowledge of skorch, though, so you might have to dig deeper in the docs or ask us for pointers on github. research GANs semi-supervised learning I personally haven’t come across anyone using skorch with reinforcement learning, but I would like to hear what experience people had with that. Since our initial release of skorch in the summer of 2017, the project has matured a lot and an around it. In a typical week, a handful of issues are opened on github or a question is asked on stackoverflow. We answer most questions within a day, and if there is a good feature request or bug report, we try to guide the reporter towards implementing it themselves. active community has grown This way, , which means the project’s health is not dependent on a single person. we have had more than 20 contributors over the project’s lifetime, with 3 of them being regulars , say fastai, is that skorch doesn’t come “batteries-included”. That means, it’s up to the user to implement their own modules or to use the modules of one of the many existing collections (say, torchvision). Skorch provides the skeleton, but you have to bring the meat. The big difference between skorch and some other higher-level frameworks When not to use Skorch super custom PyTorch code, possibly reinforcement learningbackend agnostic code (switch between PyTorch, tensorflow, …) there is no need at all for the sklearn API avoid a very slight performance overhead When to use skorch gain sklearn API and all associated benefits like hyper-parameter search most PyTorch workflows just work avoid boilerplate, standardize code use some of the many utilities discussed above Catalyst Philosophy The idea behind the is quite simple: Catalyst collect all the technical, dev-heavy, Deep Learning stuff in a framework, make it easy to re-use boring day-to-day components, focus on research and hypothesis testing in our projects. To make that happen we looked at a typical Deep Learning project, which usually has the following structure: stage stages: epoch epochs: dataloader dataloaders: batch dataloader: handle(batch) for in for in for in for in If you think about it, most of the time, all you need to do is specify the handle method for the new model and how batches of data should be fed to that model. Why then, so much of our time is spent implementing pipelines and debugging training loops rather than developing something new or testing a hypothesis? We realized that it is possible to so that we can in the high-quality, reusable backbone and . separate the engineering from the research invest our time once engineering use it across all the projects That is how Catalyst was born: an Open Source PyTorch framework, that allows you to write compact but full-features pipelines, and lets you focus on the main part of your project. abstracts engineering boilerplate away Our mission at Catalyst.Team is to use our software engineering and deep learning expertise to standardize workflows and enable cross-domain communication between deep learning and reinforcement learning researchers. We believe that reduced development friction and free flow of ideas will lead to future breakthroughs in DL and such an R&D Ecosystem will help make that happen. The learning curve Catalyst can be easily adopted by both DL newcomers and seasoned experts thanks to two APIs: , which was developed with a focus on — to start your path into reproducible DL research. , which mostly focuses on — to bring the power of DL/RL even on large clusters. Notebook API easy experimentation and Jupyter Notebooks usage Config API scalability and CLI interface When it comes to PyTorch user experience we really want to keep it as simple as possible: You define your loaders, model, criterion, optimizer, and scheduler as you usually would: torch loaders = { : ..., : ...} model = Net() criterion = torch.nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters()) scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer) import # data "train" "valid" # model, criterion, optimizer and you pass those PyTorch objects to Catalyst Runner catalyst.dl SupervisedRunner logdir = num_epochs = runner = SupervisedRunner() runner.train( model=model, criterion=criterion, optimizer=optimizer, scheduler=scheduler, loaders=loaders, logdir=logdir, num_epochs=num_epochs, verbose= ,) from import # experiment setup "./logdir" 42 # model runner # model training True . This is how we feel deep learning code should look like. Clearly decoupled engineering from deep learning with almost no boilerplate To get started with both APIs you can follow our or if you don’t want to choose, just check out the most common ones: and tutorials and pipelines classification segmentation. Design and Architecture The most interesting part about — Experiment, Runner, State and Callback abstractions, which are the core features of Catalyst. Notebook and Config API is that they use the same “backend” logic an abstraction that contains information about the experiment — a model, a criterion, an optimizer, a scheduler, and their hyperparameters. It also contains information about the data and transformations used. In general, the Experiment knows you would like to run. : Experiment what a class that knows how to run an experiment. It contains all the logic of to run the experiment, stages (another distinctive feature of Catalyst), epoch and batches. : Runner how some intermediate storage between Experiment and Runner that saves the current of the Experiments — model, criterion, optimizer, schedulers, metrics, loggers, loaders, etc : State state a powerful abstraction that lets you your experiment run logic. To give users maximum flexibility and extensibility we allow callback execution anywhere in the training loop: : Callback customize on_stage_start on_epoch_start on_loader_start on_batch_start on_batch_end on_epoch_end on_stage_end on_exception # ... By implementing these methods you can make any additional logic possible. As a result, you can (and after Catalyst.RL 2.0 release — Reinforcement Learning pipeline), combining it from available primitives (thanks to the community, their number is growing every day). implement any Deep Learning pipeline in a few lines of code Everything else (Models, Criterions, Optimizers, Schedulers) are pure PyTorch primitives. on top but rather makes it easy to reuse those building blocks between different frameworks and domains. Catalyst does not create any wrappers or abstractions Extension capabilities / Simplicity of integration in research Thanks to flexible framework design and Callbacks-mechanism, Catalyst is easily extendable for a large number of DL-based projects. You can check out our Catalyst-powered repositories on . awesome-catalyst-list If you are interested in — there are a large number of RL-based repos and competition solutions also. To compare Catalyst.RL with other RL frameworks you could check out . Reinforcement Learning Open Source RL list Other built-in features (what you get out of the box) Knowing that you can extend it easily gives comfort but there are . Some of them include: a ton of features that you get out-of-the-box Based on a flexible callback system, Catalyst has such , such as gradient accumulation, gradient clipping, weight decay correction, top-K best checkpoints saving, tensorboard integration, and many other useful day-to-day deep learning utils. easily integrated common Deep Learning best practices Thanks to our contributors and contrib modules, , like AdamW, OneCycle, SWA, Ranger, LookAhead, and many other research developments. Catalyst has access to all recent SOTA features Moreover, such like Nvidia apex, , , , wandb, and neptune.ai just out of the box to make your research more user-friendly. Thanks to such integrations, Catalyst has full support for test-time augmentations, mixed precision, and distributed training. we integrate with popular libraries Albumentations SMP transformers For the industry needs, we also have framework-wise support for which makes putting models in production easier. PyTorch tracing Furthermore, we deploy predefined Catalyst-based docker images with each release for easier integration.Finally, we support additional solutions for both model serving — (industry-oriented) and experiments monitoring — (research-oriented). ReAction Alchemy Everything is integrated into the library and covered by CI tests (we have a dedicated gpu-server for that). And thanks to Catalyst scripts, you can from the command line (check catalyst-parallel-run for more info). schedule a large number of experiments and run them in parallel over all available GPUs Reproducibility We’ve put a lot of work to make experiments that you run with Catalyst reproducible. Thanks to library-wise determinism not only and different hardware parts (with docker encapsulation, of course). See experiments if interested. Catalyst-based experiments are reproducible between server runs on one server but also between several runs over different servers here Moreover, (as RL far as RL can be reproducible). For example, with synchronous experiment runs, you can achieve very close performance, thanks to determinism in sampled trajectories. This is notoriously hard and as far as I am aware Reinforcement Learning experiments are also reproducibility-oriented Catalyst has the most reproducible RL pipelines out there. To achieve this new level of reproducibility in DL and RL we had to create several additional features: thanks to Experiments, Runner and Callbacks abstractions, it’s quite easy to save these primitive for further usage. Full source code dumping: with such feature even working with the dev version of Catalyst, you can always reproduce experiment results. Catalyst source code dumpling: Catalyst dumps pip and conda packages versions (it can be later used to define your docker images) Environment versioning: Finally, Catalyst supports several , like Alchemy, Neptune.ai, Wandb to store all your experiment metrics and additional info for better research progress tracking and reproducibility. monitoring tools Thanks to those library-wise solutions, you can be sure that the pipelines you implement in Catalyst are reproducible with all the experiment logs and checkpoints saved for future reference. Productionalization Now that we know how Catalyst helps with deep learning research we can talk about . deploying trained models to production As was already mentioned, Catalyst It lets you convert PyTorch models (that use Python code) to TorchScript model (that has everything integrated). TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. supports model tracing out-of-the-box. Additionally, to help Catalyst users deploy their pipelines into production systems, Catalyst.Team has a ( including fp16 support). Docker Hub with pre-build Catalyst-based images Moreover, to help researchers bring their ideas into production and real-world applications, we’ve created Catalyst.Ecosystem: our own with sync/async API, batch mode support, quest, and all other typical backends that you would expect from a well-designed production system. : Reaction PyTorch Serving solution our for experiment tracking, model comparison and research results sharing. : Alchemy monitoring tools Popularity Since the first pypi release 12 months ago Catalyst has gained ~ on Github and over . We are proud to be part of such an Open Source Ecosystem and extremely grateful to all our users and contributors for constant support and feedback. 1.5k stars 100k downloads One of the online communities that was especially helpful was one of the largest slack channels for Data Scientists and Machine learning practitioners in the world (40k+ users). Without their ideas and feedback, Catalyst wouldn’t get where it is today. ods.ai: Special thanks to our early-adopters, Bac Nguyen Xuan Eugene Khvedchenya Alex Gaziev and contributors that make it all worth it. : Acknowledgments Since the beginning of the development of the Сatalyst, a lot of people have influenced it in a lot of different ways. As a token of my appreciation a HUGE THANK YOU to: I want to express personal thanks to: for great Catalyst tutorials Roman Tezikov for many Config API improvements and pipelines Eugene Kachan for ReAction design David Kuryakin and for many RL algorithms implemented together Aleksey Grinchuk Valentin Khrulkov for a bunch of Config API improvements Alex Gaziev and for Catalyst.GAN initiative Andrey Zharkov Artem Zolkin for Catalyst.NLP movement Yury Kashnitsky for creation Evgeny Semyonov MLComp for library and for many Kaggle Catalyst-based solutions Eugene Khvedchenya Pytorch-toolbelt Nguyen Xuan Bac Andrey Lukyanenko for Experiment idea and PoC Vsevolod Poletaev for Callbacks-based system inspiration Aleksandr Belskikh for multi-stage pipelines support requirement Artur Kuzin for countless pieces of useful adviceand Vladimir Iglovikov for awesome Catalyst.Ecosystem design Ivan Stepanenko Thanks to all that support, Catalyst has become a part of Kaggle docker image, was and now we are developing to accelerate your research and production needs. added to the PyTorch Ecosystem our own DL R&D Ecosystem To read more about , please check and Catalyst.Ecosystem our vision project manifesto. Finally, we are always happy to help our companies/startups/research labs, who are already using Catalyst or are considering using it for their next project. Catalyst.Friends: Thanks for reading, and… Break the cycle — use Catalyst! When to use Catalyst To have flexible and reusable codebase without boilerplate. You want to share your expertise with other researchers from different Deep Learning areas. Boost your research speed with Catalyst.Ecosystem. When not to use Catalyst You have only started your deep learning path — in this way low-level PyTorch is a great introduction. You want to create very specific, custom, pipelines with a bunch of irreproducible tricks 🙂 Fastai Note: What follows is about the . You can preview it and it is documented . If you read this post after it has been released, it will be in the and will be documented . version 2 of fastai that will be released in July 2020 here here main repository there Fastai is a deep learning library which provides: : with high-level components that can quickly and easily provide state of the art results in standard deep learning domains, practitioners : with low-level components that can be mixed and matched to build new things. researchers It aims to do both things without substantial compromises in ease of use, flexibility, or performance. This is . It expresses common underlying patterns of many deep learning and data processing techniques in terms of . What is important is that these abstractions can be which makes fastai approachable and . possible thanks to a carefully layered architecture decoupled abstractions expressed clearly and concisely rapidly productive, but also deeply hackable and configurable A high-level API offers , which is built on top of a customizable models with sensible defaults hierarchy of lower-level building blocks. This article covers a representative subset of the features of the library. For details, see our the , and the documentation. fastai paper API When talking about fastai API one needs to distinguish We will talk about both in the following sections. High and Middle/Low-level API. High-level API The high-level API is very useful to beginners and practitioners who are mainly interested in applying pre-existing deep learning methods. It offers concise APIs for main application areas: vision, text, tabular, time-series analysis, recommendation (collaborative filtering) These and behaviors based on all available information. APIs choose intelligent default values For instance, fastai provides a Learner which brings together architecture, optimizer, and data, and class automatically chooses an appropriate loss function where possible. To give another example, generally, a training set should be shuffled, and a validation set should not be shuffled. fastai provides a single Dataloaders which automatically class constructs validation and training data loaders with these details already handled. To see how those “clear and concise code” principles in action let’s fine-tune an model on the and achieve close to state-of-the-art accuracy within a couple of minutes of training on a single GPU: imagenet Oxford IIT Pets dataset fastai.vision.all * path = untar_data(URLs.PETS) dls = ImageDataloaders.from_name_re(path=path, bs= , fnames = get_image_files(path/ ), path = , item_tfms=RandomResizedCrop( , min_scale= ), batch_tfms=[*aug_transforms(size= , max_warp= ), Normalize.from_stats(*imagenet_stats)]) learn = cnn_learner(dls, resnet34, metrics=error_rate) learn.fine_tune( ) from import 64 "images" r'/([^/]+)_\d+.jpg$' 450 0.75 224 0. 4 These are all of the lines of code necessary for this task. Each line of code does one important task, allowing the user to focus on what they need to do, rather than minor details: This is not an excerpt. fastai.vision.all * from import from the library. It’s important to note that the library has been designed carefully to avoid these styles of imports cluttering the namespace. imports all the necessary pieces path = untar_data(URLs.PETS) from the fast.ai datasets collection (if not previously downloaded) to a configurable location, extracts it (if not previously extracted), and returns a pathlib.Path object with the extracted location. downloads a standard dataset dls = ImageDataloaders.from_name_re(path=path, bs= , fnames = get_image_files(path/ ), pat = , item_tfms=RandomResizedCrop( , min_scale= ), batch_tfms=[*aug_transforms(size= , max_warp= ), Normalize.from_stats(*imagenet_stats)]) 64 "images" r'/([^/]+)_\d+.jpg$' 450 0.75 224 0. sets up the Dataloaders. Note the : separation of item level and batch level transforms transforms are applied transforms are applied (if available). item to individual images on the CPU batch to a mini batch on the GPU aug_transforms() selects a set of data augmentations. As always in fastai, a default that works well across a variety of vision datasets is chosen but can be fully customized if needed. learn = cnn_learner(dls, resnet34, metrics=error_rate) Creates a Learner, which a to train on. Learner, which automatically handles whatever details it can for the user. For instance, in this image classification problem, it will : combines an optimizer, a model, and the dat Each application (vision, text, tabular) has a customized function that creates a download an ImageNet-pretrained model, if not already available, remove the classification head of the model, replace it with a head appropriate for this particular dataset, set appropriate optimizer, weight decay, learning rate, and so forth learn.fine_tune( ) 4 fine-tunes the model. In this case, it is using the 1-cycle policy, which is a recent best practice for training deep learning models but is not widely available in other libraries. A lot of things happen under the hood in .fine_tune(): annealing both the learning rates and the momentums, printing metrics on the validation set, displaying results in an HTML or console tablerecording losses and metrics after every batch and so forth. A GPU will be used if one is available. It will first train the head for one epoch while the body of the model is frozen, then fine-tunes for as many epochs given (here 4) using discriminative learning rates. One of the strengths of the fastai library is how consistent the API is across applications. For example, fine-tuning a pretrained model on the IMDB dataset (a text classification task) using ULMFiT can be done in 6 lines of code: fastai2.text.all * path = untar_data(URLs.IMDB) dls = TextDataloaders.from_folder(path, valid= ) learn = text_classifier_learner(dls, AWD_LSTM, drop_mult= , metrics=accuracy) learn.fine_tune( , ) from import 'test' 0.5 4 1e-2 Users get a very like tabular, time series or recommendation systems. Once a Learner has been trained, you can explore the results with the command learn.show_results(). How those results are presented depends on the application, in vision you get labeled pictures, in text you get a dataframe summarizing samples, targets and predictions. In our pets classification example you would get something like this: similar experience in other domains Another important high-level API component is the which is an expressive API for data loading. It is the first attempt we are aware of, to systematically define all of the steps necessary to prepare data for a deep learning model, and give users a mix and match recipe book for combining these pieces (which we refer to as data blocks). data block API, Here is an example of how to use the data block API to get the dataset ready for modeling: MNIST mnist = DataBlock( blocks=(ImageBlock(cls=PILImageBW), CategoryBlock), get_items=get_image_files, splitter=GrandparentSplitter(), get_y=parent_label) dls = mnist.databunch(untar_data(URLs.MNIST_TINY), batch_tfms=Normalize) Mid and low-level API In the previous section, you saw how you can get a lot done quickly with the high-level api which has a ton of out-of-the-box functionalities. However, there are situations, when you need to tweak things or extend what is already there. This is where middle and low-level APIs come into the picture: provides the core deep learning and data-processing methods for each of these applications, mid-level API provide a library of optimized primitives and functional and object-oriented foundations, which allows the mid-level to be developed and customized. low-level API The training loop can be customized using theLearner It allows gradients, data, losses, control flow, and else novel two-way callback system. anything to be read and changed at any point during training. There is a rich history of using callbacks to allow for customization of numeric software, and today nearly all modern deep learning libraries provide this functionality. However, fastai’s callback system is the first that we are aware of that supports the design principles necessary for : complete two-way callbacks A callback which gives users full flexibility. Every callback should be at that stage in the training loop, including hyper-parameters, losses, gradients, input and target data, and so forth; should be available at every single point during training able to access every piece of information available Every callback should be , at any time before they are used, able to modify all these pieces of information All the tweaks of the training loop (different schedulers, mixed-precision training, reporting on , , , or equivalent, , oversampling strategies, distributed training, GAN training…) are implemented in callbacks that the and do ablation studies. Convenience methods are there to add those callbacks for the user, making training in mixed precision as easy as saying TensorBoard wandb neptune MixUp end-user can mix and match with their own, making it easier to experiment with things learn = learn.to_fp16() or as easy as training in a distributed environment = learn.to_distributed() learn fastai also provides a that allows recent optimization techniques, like LAMB, RAdam or AdamW, to be implemented in a few lines of code. new, generic optimizer abstraction It is possible thanks to into two basic pieces: refactoring optimizer abstractions , which track and aggregate statistics such as gradient moving averages stats , which combine stats and hyper-parameters to “step” the weights using some function. steppers This foundation has allowed us to write most of fastai’s optimizers in 2–3 lines of code, while in other popular libraries that would take you 50+. There are many other mid-tier and low-level APIs that make it easy for researchers and developers to build new methods on top of a fast and flexible foundation. The library is already in We have used it to create a complete, and very popular deep learning course: (the first video of the last iteration has 256k views). wide use in research, industry, and teaching. Practical deep learning for coders The has at the time of writing. The community is very active on the , be it to clarify points of the course that are unclear, help with debugging or team up to tackle a new deep learning project. repository 16.9k stars and is used in more than 2,000 projects fast.ai forum When to use fastai The goal is to have something easy enough for beginners but flexible enough for researchers/practitioners. When not to use fastai The only thing I can think of is that you wouldn’t use fastai to serve in production a model you trained in a different framework, since we don’t deal with that aspect. PyTorch Ignite is a high-level library that helps with training neural networks in PyTorch. Since its beginning in 2018, our goal has been to: Pytorch Ignite “make the common things easy and the hard things possible”. Why use Ignite? Ignite’s high level of abstraction that user is training. We only require the user to . It gives users a lot of flexibility and allows them to use Ignite in tasks such as co-training multiple models (i.e. GANs) or tracking multiple losses and metrics in your training loop assumes little about the type of model or multiple models define the closure to be run in the training and optional validation loop Ignite concepts and API There are a few core objects in the Ignite’s API that you need to learn: : the essence of the library Engine : interaction with the Engine (e.g. early stopping, checkpoints, logging) Events & Handlers : out-of-the-box metrics for various tasks Metrics We will present some basics to understand the main ideas but feel free to dig deeper into in the repository. examples Engine It simply loops over provided data, executes a processing function and returns a result. A Trainer Engine as processing function. is an with model’s weights update ignite.engine Engine model.train() optimizer.zero_grad() x, y = prepare_batch(batch) y_pred = model(x) loss = criterion(y_pred, y) loss.backward() optimizer.step() loss.item() trainer = Engine(update_model) trainer.run(data, max_epochs= ) from import : def update_model (trainer, batch) return 100 An Evaluator Engine as processing function. (object to validate model) is an with on-line metric computation logic ignite.engine Engine total_loss = [] x, y = batch model.eval() torch.no_grad(): y_pred = model(x) loss = criterion(y_pred, y) total_loss.append(loss.item()) loss.item() evaluator = Engine(compute_metrics) evaluator.run(data, max_epochs= ) print( ) from import : def compute_metrics (_, batch) with return 1 f"Loss: " {torch.tensor(total_loss).mean()} This code can silently train a model and compute total loss. In the next section we will see how to make the training and validation more user-friendly. Events & Handlers In order of Engine and allow users to interact at each step of the run, . The idea is that users could execute a custom code inside of the training loop as an event handler, similar to callbacks in other libraries. to improve the flexibility we introduced events and handlers fire_event(Events.STARTED) epoch < max_epochs: fire_event(Events.EPOCH_STARTED) batch data: fire_event(Events.ITERATION_STARTED) output = process_function(batch) fire_event(Events.ITERATION_COMPLETED) fire_event(Events.EPOCH_COMPLETED) fire_event(Events.COMPLETED) while # run once on data for in At each call, all its event handlers are executed. For example, users may want to set up some run-dependent variables at the beginning of training (Events.STARTED) and update the learning rate on each iteration (Events.ITERATION_COMPLETED). With Ignite the code will look like this: fire_event train_loader = … model = … optimizer = … criterion = ... lr_scheduler = … trainer = Engine(process_function) lr_scheduler.step() trainer.run(train_loader, max_epochs= ) : def process_function (engine, batch) # … user function to update model weights @trainer.on(Events.STARTED) : def setup_logging_folder (_) # create a folder for the run # set up some run dependent variables @trainer.on(Events.ITERATION_COMPLETED) : def update_lr (engine) 50 (we only require the first argument to be engine), e.g. lambda, simple function, class method etc. We do not require to inherit from an interface and override possibly its abstract methods. The cool thing with handlers (vs “callback” interfaces) is that it can be any function with the correct signature trainer.add_event_handler( Events.STARTED, engine: print( )) mydata = [ , , , ] print( .format(data)) trainer.add_event_handler( Events.COMPLETED, on_training_ended, mydata) lambda "Start training" # attach handler with args, kwargs 1 2 3 4 : def on_training_ended (engine, data) "Training is ended. mydata={}" Built-in events filtering There are cases when users would like to execute the code periodically/once or with a custom rule like: run the validation every 5 epochs, store a checkpoint every 1000 iterations, change a variable on 20th epoch, log gradients on the first 10 iterations. etc. Ignite provides such flexibility to separate “the code to execute” from the logic “when to execute the code”. For example, to it is simply coded: run the validation every 5 epochs @trainer.on(Events.EPOCH_COMPLETED(every=5)) : def run_validation (_) # run validation Similarly, to : change some training variable once on 20th epoch @trainer.on(Events.EPOCH_STARTED(once=20)) : def change_training_variable (_) # ... More generally, user can provide its own events filtering function: event < : : def first_x_iters (_, event) if 10 return True return False @trainer.on(Events.ITERATION_COMPLETED(event_filter=first_x_iters)) : def log_gradients (_) # … Out-of-the-box handlers Ignite provides a list of handlers and metrics to simplify user’s code: : to save training checkpoints (composed of trainer, model(s), optimizer(s), lr scheduler(s), etc) to save best models (by validation score) stops the training if no progress is done (by validation score) stops the training if NaN is encountered concatenate, add a warm-up, setup linear or cosine annealing, linear piecewise scheduling of any optimizer parameter (lr, momentum, betas, …) Checkpoint EarlyStopping : TerminateOnNan: Optimizer Parameters Scheduling: Logging to common platforms: TensorBoard, Visdom, MLflow, Polyaxon or Neptune (batch losses, metrics GPU mem/utilization, optimizer parameters and more). Metrics Ignite also provides a : Precision, Recall, Accuracy, Confusion Matrix, IoU etc, ~20 regression metrics list of out-of-the-box metrics for various tasks For example, below we compute validation accuracy on the validation dataset: ignite.metrics Accuracy y_pred, y_true evaluator = Engine(compute_predictions) metric = Accuracy() metric.attach(evaluator, ) evaluator.run(val_loader) > evaluator.state.metrics[ ] = from import : def compute_predictions (_, batch) # … return "val_accuracy" "val_accuracy" 0.98765 Go and to see the full list of available metrics. here here Ignite metrics have this cool property that or torch methods: users can compose its own metric by using basic arithmetical operations precision = Precision(average= ) recall = Recall(average= ) F1_per_class = (precision * recall * / (precision + recall)) F1_mean = F1_per_class.mean() F1_mean.attach(engine, ) False False 2 # torch mean method "F1" Library structure The library is composed of two main modules: module contains bases like Engine, metrics, some essential handlers. It has Core PyTorch as the only dependency. module may depend on other libraries (e.g. scikit-learn, tensorboardX, visdom, tqdm, etc) and can potentially have backward compatibility breaking changes between versions. Both modules are largely covered by unit tests. Contrib Extension capabilities / Simplicity of integration in research We believe that our event/handler system is rather flexible and gives people the ability to interact with every part of the training process. Because of that, (we provide two basic examples to train and ) or we’ve seen Ignite being used to train GANs DCGAN CycleGAN Reinforcement Learning models. According to Github’s “Used by”, Ignite was for their papers: used by researchers BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning, github A Model to Search for Synthesizable Molecules, github Localised Generative Flows, github Unsupervised Spatiotemporal Data Inpainting, github Extracting T Cell Function and Differentiation Characteristics from the Biomedical Literature, github Because of those (and other research projects) we strongly believe that Ignite gives you enough flexibility to do deep learning research. Integrations with other libraries/frameworks Ignite . Some cool integrations that we have include: plays nicely with other libraries or frameworks if their features do not overlap hyperparameter tuning with Ax ( ). Ignite example hyperparameter tuning with Optuna ( ).logging to TensorBoard, Visdom, MLflow, Polyaxon, Neptune (Ignite’s code), Chainer UI (Chainer’s code).Training with mixed precision using Nvidia Apex ( ). Optuna example Ignite’s examples Reproducibility We’ve put a lot of effort into making Ignite training reproducible: Ignite’s and when it is possible forces the data loaders to provide same data samples on different runs; Engine automatically handles the random states Ignite like MLflow, Polyaxon, Neptune. This helps to keep track of software, parameter, and data dependencies of ML experiments; integrates with experiment tracking systems We provide several examples and (inspired from torchvision) of (e.g. classification on CIFAR10, ImageNet, and segmentation on Pascal VOC12). “references” reproducible training on vision tasks Distributed training Distributed training is also : model or data. supported by Ignite but we leave up to the user to set up its type of parallelism For example, in data distributed configuration users are required to correctly set up the distributed process group, wrap the model, use distributed sampler etc. Ignite handles metrics computation: reduction of the value across all processes. We (e.g. ) to display how to use Ignite in a distributed configuration. provide several examples distributed CIFAR10 Popularity At the moment of writing, Ignite had about and according to Github’s “Used by” feature is Some honorable mentions are: 2.5k stars used by 205 repositories. by HuggingFace State-of-the-Art Conversational AI with Transfer Learning by HuggingFace Tutorial on Transfer Learning in NLP held at NAACL 2019 Thomas Wolf from HuggingFace also left some awesome feedback for the library in (Thanks, Thomas!): one of his blog articles “Using the awesome PyTorch ignite framework and the new API for Automatic Mixed Precision (FP16/32) provided by NVIDIA’s apex, we were able to distill our +3k lines of competition code in less than 250 lines of training code with distributed and FP16 options!” by Max Lapan This is a book on Deep Reinforcement Learning by Max Lapan wherein the second edition examples are made with Ignite. Deep-Reinforcement-Learning-Hands-On-Second-Edition AI Toolkit for Healthcare Imaging. This project primarily focused on the healthcare research to develop DL models for medical imaging uses Ignite for end-to-end training. Project MONAI: For other use-cases, please take a look at and its “Used by”. Ignite’s github page When to use Ignite Remove boilerplate and standardize your code using highly customizable modules of Ignite’s API. When you require factorized code but don’t want to sacrifice on flexibility to support your complicated training strategies Use the rich array of utilities like metrics, handlers, and loggers available to evaluate/debug your model with ease When not to use Ignite When there is a super custom PyTorch code where Ignite’s API is overhead. When completely satisfied by pure PyTorch API or another high-level library Thank you for reading! presented to you with love by the ! Pytorch-Ignite PyTorch community PyTorch Lightning Philosophy is a very lightweight wrapper on PyTorch which is more like a . The format allows you to get rid of a ton of boilerplate code while . PyTorch Lightning coding standard than a framework keeping it easy to follow The use of hooks, standard across every part of the training, means you can override any part of the internal functionality down to how the backward pass is done — it is extremely flexible. The result is a framework that to try crazy ideas without having to learn yet another framework while automating away all the engineering details. gives researchers, students, and production teams the ultimate flexibility Lightning has two additional, more ambitious motivations: in the deep learning community. reproducibility of research and democratization of best practices Notable features Train on CPU, GPU or TPUs without changing your code! Only library to support TPU training (Trainer(num_tpu_cores=8)) Trivial multi-node trainingTrivial multi-GPU training Trivial 16 bit precision support Built-in performance profiler (Trainer(profile=True)) Tons of integrations with libraries like tensorboard, comet.ml, neptune.ai, etc… (Trainer(logger=NeptuneLogger(...))) Team Lightning has 90+ contributors and a who make sure the project moves forward lightning fast. core team of 8 contributors Documentation is extremely thorough yet simple and easy to use. Lightning documentation API At the core, Lightning has an API that Trainer LightningModule. centers around two objects, the and the The Trainer abstracts away all the engineering details and the LightningModule captures all the science/research code. This decoupling makes the research code more readable and allows it to run on arbitrary hardware. LightningModule All the goes into LightningModule. research logic For example, in a cancer detection system, this part would handle the main things like the object detection model, data loaders for medical images etc. It groups the core : ingredients you need to build a deep learning system The computations (init, forward). What happens in the training loop (training_step). What happens in the validation loop (validation_step). What happens in the testing loop (test_step). The optimizer(s) to use (configure_optimizers). The data to use (train, test, val dataloaders). Let’s take a look at the example from the docs and unpack what is happening there. pytorch_lightning pl super(CoolSystem, self).__init__() self.l1 = torch.nn.Linear( * , ) torch.relu(self.l1(x.view(x.size( ), ))) x, y = batch y_hat = self.forward(x) loss = F.cross_entropy(y_hat, y) tensorboard_logs = { : loss} { : loss, : tensorboard_logs} x, y = batch y_hat = self.forward(x) { : F.cross_entropy(y_hat, y)} avg_loss = torch.stack([x[ ] x outputs]).mean() tensorboard_logs = { : avg_loss} { : avg_loss, : tensorboard_logs} x, y = batch y_hat = self.forward(x) { : F.cross_entropy(y_hat, y)} avg_loss = torch.stack([x[ ] x outputs]).mean() tensorboard_logs = { : avg_loss} { : avg_loss, : tensorboard_logs} torch.optim.Adam(self.parameters(), lr= ) DataLoader( MNIST(os.getcwd(), train= , download= , transform=transforms.ToTensor()), batch_size= ) DataLoader( MNIST(os.getcwd(), train= , download= , transform=transforms.ToTensor()), batch_size= ) DataLoader( MNIST(os.getcwd(), train= , download= , transform=transforms.ToTensor()), batch_size= ) import as : class MNISTExample (pl.LightningModule) : def __init__ (self) # not the best model... 28 28 10 : def forward (self, x) return 0 -1 : def training_step (self, batch, batch_idx) # REQUIRED 'train_loss' return 'loss' 'log' : def validation_step (self, batch, batch_idx) # OPTIONAL return 'val_loss' : def validation_end (self, outputs) # OPTIONAL 'val_loss' for in 'val_loss' return 'avg_val_loss' 'log' : def test_step (self, batch, batch_idx) # OPTIONAL return 'test_loss' : def test_end (self, outputs) # OPTIONAL 'test_loss' for in 'test_loss' return 'avg_test_loss' 'log' : def configure_optimizers (self) # REQUIRED # can return multiple optimizers # and learning_rate schedulers # (LBFGS it is automatically supported, # no need for closure function) return 0.02 @pl.data_loader : def train_dataloader (self) # REQUIRED return True True 32 @pl.data_loader : def val_dataloader (self) # OPTIONAL return True True 32 @pl.data_loader : def test_dataloader (self) # OPTIONAL return False True 32 As you can see, the LightningModule and simply organizes them in : builds on top of pure PyTorch code nine methods Defines our model or multiple models, and initializes the weights __init__(): You can think of it as your standard PyTorch forward method but with additional flexibility to define what you want to happen at the prediction/inference level. forward(): Defines what happens in the training loop. It combines a forward pass, loss calculation, and any other logic you want to execute during training. training_step(): Defines what happens in the validation loop. For example, you can go calculate loss or accuracy for each batch and store them in the logs. validation_step() : Everything that you want to happen after the validation loop ends. For example, you may want to calculate the average loss or accuracy over validation batches validation_end() : What you want to happen to each batch at inference time. You can put your Test Time Augmentation logic or other things here. test_step() : Similarly to validation_end, you can use it to aggregate the batch results calculated during test_step test_end() : initialize an optimizer or multiple optimizers configure_optimizers() : returns your PyTorch DataLoaders for train, validation, and test sets. train/val/test_dataloader() : Since every PytorchLightning system needs to implement those methods it is . really easy to see exactly what is happening in the research For example, to understand what a paper is doing, all you have to do is look at the training_step of the LightningModule! This readability and a close mapping between the core research concepts and implementation lies at the core of Lightning. Trainer This is . where the engineering part of deep learning happens In the cancer detection system, this might mean how many GPUs you use, when you save checkpoints when you stop training, etc… These are details that make up a lot of the “secret sauce” of research which are standard best practices across deep learning projects (ie: not hugely relevant to cancer detection). Notice that the LightningModule has nothing about GPUs or 16-bit precision or early stopping or logging or anything like that. All of that is automatically handled by the trainer. pytorch_lightning Trainer model = MNISTExample() trainer = Trainer() trainer.fit(model) from import # most basic trainer, uses good defaults That’s all it takes to train this model! The trainer handles everything for you including: Early stopping Automatic logging to Tensorboard (or comet, mlflow, neptune, etc…) Auto checkpointing And more (we’ll talk about that in the next sections) All of this is free out of the box! The learning curve Since LightningModule is simply reorganizing pure Pytorch objects and everything is “out in the open” it is trivial to refactor your PyTorch code to the Lightning format. For more information about making the switch from pure PyTorch to Lightning read . this article Build-in features (what you get out of the box) Lightning gives For instance, it takes a one-liner to use things like: a ton of advanced features out-of-the-box. Multi-gpu training Trainer(gpus= ) 8 TPU training Trainer(num_tpu_cores= ) 8 Multi-node training Trainer(gpus= , num_nodes= , distributed_backend= ) 8 8 'ddp' Gradient Clipping Trainer(gradient_clip_val= ) 2.0 Accumulated Gradients Trainer(accumulate_grad_batches= ) 12 16-bit precision Trainer(use_amp= ) True Truncated back-propagation through time Trainer(truncated_bptt_steps= ) 3 and a lot more. If you would like to see the full list of free-magic features go here. Extension capabilities / Simplicity of integration in research Having a bunch of in-built functionalities is great but such as data-processing without having other abstractions operate on those. for researchers, it’s crucial to not have to learn yet another library, and directly control key parts of research This interface should be thought of as a system, not as a model. The system might have multiple models (GANs, seq-2-seq, etc…) or just one model, such as this simple MNIST example. This flexible format allows for the most freedom in training and validating. Thus researchers are and ONLY have to worry about the LightningModule. free to try as many crazy things as they want, But maybe you need even MORE flexibility. In this case, you can do things like: Change how the backward step is done. Change how 16-bit is initialized. Add your own way of doing distributed training. Add Learning rate schedulers. Use multiple optimizers. Change the frequency of optimizer updates. And many many more things. Under the hood, This makes EVERY single aspect of training highly configurable — which is exactly the flexibility a research or production team needs. everything in Lightning is implemented as hooks that can be overridden by the user. But wait you say… this is too simple for your use case? No worries, Lightning was designed while doing research at NYU and Facebook AI Research for my PhD to be as flexible as possible for researchers. Here are some examples: Need ? Override this hook: your own backward pass use_amp: amp.scale_loss(loss, optimizer) scaled_loss: scaled_loss.backward() : loss.backward() : def backward (self, use_amp, loss, optimizer) if with as else Need ? Override this hook: your own amp init model, optimizers = amp.initialize( model, optimizers, opt_level=amp_level, ) model, optimizers : def configure_apex (self, amp, model, optimizers, amp_level) return Want to go as deep as adding ? Override these two hooks: your own DDP implementation model = LightningDistributedDataParallel( model, device_ids=device_ids, find_unused_parameters= ) model : default_port = os.environ[ ] default_port = default_port[ :] default_port = int(default_port) + Exception e: default_port = : default_port = os.environ[ ] Exception: os.environ[ ] = str(default_port) : root_node = os.environ[ ].split( )[ ] Exception: root_node = root_node = self.trainer.resolve_root_node_address(root_node) os.environ[ ] = root_node dist.init_process_group( , rank=self.proc_rank, world_size=self.world_size ) : def configure_ddp (self, model, device_ids) # Lightning DDP simply routes to test_step, val_step, etc... True return : def init_ddp_connection (self) # use slurm job id for the port number # guarantees unique ports across jobs from same grid search try # use the last 4 numbers in the job id as the id 'SLURM_JOB_ID' -4 # all ports should be in the 10k+ range 15000 except as 12910 # if user gave a port number, use that one instead try 'MASTER_PORT' except 'MASTER_PORT' # figure out the root node addr try 'SLURM_NODELIST' ' ' 0 except '127.0.0.2' 'MASTER_ADDR' 'nccl' There are 10s of hooks like these and we add more as researchers request them. The bottom line is that working with the bleeding-edge AI research. Lightning is trivial to use for a new user and infinitely extensible if you’re a researcher or production team Readability and moving towards Reproducibility As I mentioned, Lightning was created with a second more ambitious broad motivation: Reproducibility. While true reproducibility requires standard code, standard seeds, standard hardware, etc… Lightning contributes to reproducible research in two ways: to , standardize the format of the ML code so that the approach can be tested in different systems. decouple the engineering from the science The result is an expressive, powerful API for doing research. If every research project and paper was implemented using the LightningModule template, it would be very easy to find out what’s going on (but perhaps not easy to understand haha) Distributed training Lightning makes multi-GPU or even multi-GPU multi-node training trivial. For instance, if you want to train the above example on multiple GPUs just add the following flags to the trainer: trainer = Trainer(gpus= , distributed_backend= ) trainer.fit(model) 4 'dp' Using the above flags will run this model on 4 GPUs. If you want to run on say 16 GPUs, where you have 4 machines each with 4 GPUs, change the trainer flags to this: trainer = Trainer(gpus= , nb_gpu_nodes= , distributed_backend= ) trainer.fit(model) 4 4 'ddp' And submit the following SLURM job: source activate $ export NCCL_DEBUG=INFO export PYTHONFAULTHANDLER= srun python3 mnist_example.py #!/bin/bash -l # SLURM SUBMIT SCRIPT #SBATCH --nodes=4 #SBATCH --gres=gpu:4 #SBATCH --ntasks-per-node=4 #SBATCH --mem=0 #SBATCH --time=0-02:00:00 # activate conda env 1 # ------------------------- # debugging flags (optional) 1 # on your cluster you might need these: # set the network interface # export NCCL_SOCKET_IFNAME=^docker0,lo # might need the latest cuda # module load NCCL/2.4.7-1-cuda.10.0 # ------------------------- # run script from above This is crazy simple considering how much happens under the hood. For more information about distributed training with Pytorch lightning read this article about “How To Train A GAN On 128 GPUs Using PyTorch”. Productionalization Lightning models can be easily deployed because they’re still simple PyTorch models under the hood. This means we can leverage all the engineering advancements from the PyTorch community on supporting deployment. Popularity Pytorch Lightning has over and has recently hit More importantly, the community is growing rapidly with adding new features daily. You can talk to us on or . 3800 stars on Github 110k downloads. over 90 contributors, many from the top AI labs in the world Github Slack When to use PyTorch Lightning Lightning is It’s great when you know what you need to do. This focus means it adds advanced features for people looking to test/build things very quickly without getting bogged down in the details. made for professional researchers and production teams working on cutting edge research. When not to use PyTorch Lightning For new-comers, we recommend they build a simple MNIST system from scratch using pure PyTorch. This will show them how to set up a training loop, etc. Once they understand how that works and how the forward/backward pass work, they can move into lightning. Although lightning is made for professional researchers and data scientists, new-comers can still benefit. Torchbearer Our part of the blog will be a little different from the others because (sort of). In particular, team. The move came about from a meeting with William Falcon at NeurIPS 2019, and was recently torchbearer is coming to an end we are joining the PyTorch-Lightning announced on the PyTorch blog. So, instead of trying to sell you torchbearer, we thought we should write about what we did well, what we did wrong, and why we are moving to Lightning. What we did well The lib got pretty popular and got to which was far more than we had ever imagined. 500+ stars on GitHub We became a . It was an important experience for us that allowed us to feel like a valued part of a wider community. part of the PyTorch ecosystem We’ve built a comprehensive set of built-in callbacks and metrics. This was one of our key successes; a lot of powerful outcomes can be achieved in a single line of code with torchbearer. An important feature of torchbearer that is the object. This is a mutable dictionary that houses all of the variables that are in use by the core training loop. By editing these variables in callbacks at different points in the loop, most highly complex outcomes can be achieved. enables extreme flexibility state It was always important to us that torchbearer had . We focused on example-led docs that can be executed in your browser with Google Colab. The example library has been a success, giving quick information on the more powerful use cases of torchbearer. good documentation A final thing to note is that torchbearer has been We count this as a success because we have almost in order to prototype our ideas, even the ridiculous ones! used by both of us over the past two years for our PhD research. never had to change the torchbearer API What we did wrong The object, which makes this library so flexible, is also problematic. The ability to access any part of the library from any other leads itself towards abuse in the same way that global variables do. In particular, once more than one object is acting on it. Additionally, for state to be effective you need to know what each variable is and in which callbacks you can access it, so the state determining how and when a particular variable in the state object was changed is challenging learning curve is steep. By its nature, Since every part of state is available at all times, how do you chunk this and distribute it across devices? PyTorch can deal with this in some way, in that torchbearer can be used when distributed, but it is unclear exactly what is happening to state at these times. torchbearer does not lend itself to distributed training, or even to some extent low precision training. . Torchbearer offers a way to completely write your own core loop, but you then have to manually write in callback points to ensure all the built-in Torchbearer functionality. Coupling this with a lower standard of documentation compared to other aspects of the library, custom loops were overly complicated and likely completely unknown to most users. Changing the core training loop was non-trivial . As a result, some parts of the library were thoroughly tested and stable (since they were important for our PhD work), while others were under-developed and buggy. Managing an open-source project while working on our PhDs ended up being more difficult than expected . This significantly improved Torchbearer, but also meant a lot of effort moving from one version to the next. It felt justified as we were still pre 1.0.0 stable release but it certainly contributed to some users choosing other libraries. During our initial growth, we decided to dramatically change the core API Why we are joining Pytorch Lightning? The first key reason for our willingness to move to Lightning is its popularity. With Lightning we , that has already eclipsed many of its competitors. become part of the fastest-growing PyTorch training library The second key reason for our move, and a key part of the success of Lightning, is that , both challenging to implement in torchbearer. These practical considerations made in the early stages of Lightning’s development are invaluable to the modern deep learning practitioner and it was built from the ground up to support distributed training and low precision would be challenging to retro-fit in torchbearer. In addition, at Lightning This will enable us to ensure greater stability and to support a broader range of use cases than is possible with just two developers as we have now. we will be part of a larger team of core developers. Ultimately, we have always believed that the best way to move things forward would be to join efforts with another library. This is our chance to do that and help Lightning become the best training library for PyTorch. (Subjective) Comparison and Final Thoughts At this point, I want to give a… huge THANK YOU to all the authors! Wow, this is a lot of first-hand info and I hope it will make it easier to choose the library that works for you. As I was working on this article with them and looking closer at what their libraries have to offer (and creating some Pull Requests), I gained that I want to share with you here. my own personal perspective Skorch If you want the sklearn-like API then is your lib. It is well tested and documented. It actually before working on this article which was a nice surprise. That said the I feel that it really delivers on their promise and does exactly what it was built to do. I really respect tools/libs like that. Skorch gives more flexibility then what I had anticipated focus of this lib is not cutting edge research but rather production applications. Fastai for a long time It can get you state-of-the-art results in 10 lines of almost magical code. But there is , perhaps lesser-known, that lets you access and create custom building blocks that Maybe it was the uber-popular fastai deep learning course that created a false image of this library in my mind but I will definitely take it for a spin in the future, especially with the recent v2 pre-release. Fastai has been a great choice for people getting into deep learning. another side to the library lower-level APIs give researchers and practitioners flexibility to implement very complex systems. Pytorch Ignite is an interesting animal. With its, (for my personal taste), you can It has a ton of features out-of-the-box and I definitely understand why many researchers use it in their daily work. It but you just need to stop thinking in “callback terms” and you’ll be fine. That said, the API doesn’t speak to me as clearly as some other libs. You should check it out though, as it may be a great choice for you. Ignite a bit exotic engine, event and handler API do pretty much whatever you want. took me a moment to get familiar with the framework Catalyst Before looking into I thought it was a heavy(ish) framework for creating deep learning pipelines. Now my view is completely different. . Pure PyTorch objects go into a trainer that deals with the training. It is very flexible and has a separate module that deals with Reinforcement Learning. It also And those multistage pipelines I told you about? You can easily create them with minimal overhead. Overall I think it is a Catalyst It decouples engineering stuff from research in a beautiful way gives you a lot of features out-of-the-box when it comes to reproducibility, and serving models in production. great project and a lot of people out there could benefit from using it. Pytorch Lightning also wants to separate science from engineering and I think it does a great job at that. There are just a ton of in-built features that make it even more appealing. But something that makes this library a bit different is that it enables It is really easy to follow the logic inside of the LightningModule where the training step (among other things) is not abstracted away. I think communicating research projects in this way can be extremely effective. and with I think that in front of it, Lightning bright even 🙂 Lightning reproducibility by making deep learning research implementations readable. It is getting very popular very quickly authors of Torchbearer joining the core developer team this project has a bright future So which one should you choose? As always it depends but I think you now have enough information to make a good decision! This article was originally posted on the Neptune blog . If you liked it, you may like it there :) You can also find me tweeting @Neptune_a i or posting on LinkedIn about ML and Data Science stuff.