Global warming is on the rise due to the presence of the highest levels of carbon dioxide, methane, and nitrous oxide levels compared to the past. Data Scientists, data engineers, and cloud experts all have come forward to create a more sustainable environment by following the best practices in Machine Learning.
Machine Learning models create a detrimental effect on the environment when using substantial computational resources and energy while getting trained for thousands of hours on specialized hardware accelerators in data centers.
The average temperature rise has been increasing steadily over the last 3 decades (from 1980), as illustrated in the figure below. All popular Meteorological agencies/bodies show similar trends, which have made environmentalists, geologists, and technology experts in different domains come forward and set certain standards for controlling the temperature rise.
Global average temperature anomaly from 1880 to 2012, compared to the 1951-1980 long-term average. Source: NASA Earth Observatory.
Research in curbing down the energy expenditures from ML models has led to using “state-of-the-art” models that differ from conventional Machine Learning approaches by following a decentralized training. Instead of a Centralized ML with the server responsible for handling all ML training tasks, in Federated Learning, individual devices train their own local data and send the updated model to the cloud/server, which aggregates the model from different devices and pushes the updated model, back to the devices.
With gradual advancements of Federated Learning (FL), the importance of FL in Sustainability has been realized, particularly when rechargeable devices can collect energy from the ambient environment, saving energy cost in networked environments in both wireless and edge networks.
Federated Learning (FL) settings can be applied to as either cross-silo or cross-device. In a cross-silo scenario, clients are generally few, with high availability during all rounds, and are likely to have similar data distribution for training, e.g., hospitals. This scenario serves more as a use case to consider Independent and Identically Distributed (IID) distributions.
For the 2nd use-case, we can consider a cross-device system that will likely encompass thousands of clients having very different data distributions (non-IID) participating in just a few rounds, e.g. training of next-word prediction models on mobile devices.
Thus FL can be known to serve two different partition schemes: a uniform partition (IID) where each client has approximately the same proportion of each class from the original dataset, and a heterogeneous partition (non-IID) for which each client has an unbalanced and different proportion of each class.
In addition to handling different data distributions, it has been possible to lay the analytical Carbon Footprint Model for FL, which can provide a first-of-its-kind quantitative CO2e emissions estimation method. This can give a detailed study on emissions resulting from both hardware training and communication between servers and clients. This gives a solid foundation base to show the roadmap for future environmentally-friendly federated learning.
Moreover, the FL setup enables researchers to conduct carbon sensitivity analysis on real FL hardware under different settings, strategies, and tasks. The studies and experiments have proposed that CO2e emissions depend on a wide range of hyper-parameters, and emissions from communication between clients and servers can range from 0.4% of total emissions to more than 95%, and efficient strategies can reduce CO2e emissions up to 60%.
FL will continue to cast its long-lasting impact on the total CO2e emission. This might be further facilitated by including sustainable physical location, relevant deep learning tasks, model architecture, FL aggregation strategy, and hardware efficiency.
One of the most important factors for consideration in FL is quantifying carbon emissions. As research has already demonstrated
proper design of the FL setup leads to a decrease of these emissions, the integration of the released CO2e serves as a crucial metric to the FL deployment.
FL is known to converge quicker with fewer FL rounds on increasing the number of local epochs. However, this does not guarantee
a smaller overall energy consumption.
The below figure illustrates how Federated Learning can have a long-lasting impact on the environment by having efficient algorithms that reduce device to server communications on the one hand and the use of advanced hardware with better processing capabilities and greater transparency on energy consumption.
In comparison to centralized systems, where we see cooling in datacenters accounts for up to 40% of the total energy consumed, FL does not need or use this parameter. On the other hand, FL can use the Power Usage Effectiveness (PUE) ratio.
There are different initiations to compensate Carbon emissions by carbon offsetting or with the purchases of Renewable Energy Credits (RECs, in the USA) or Tradable Green Certificates (TGCs, in the EU). Carbon offsetting is an action initiated to compensate polluting actions via different investments in environment-friendly projects, such as renewable energies or massive tree planting Anderson, (2012).
Devices can also depend on renewable energy resources for their own energy generation, which can be accomplished primarily in 2 ways and that strategizes how devices can send updates to the central server in FL setup.
In the first use-case, as illustrated from the below figure, we see that clients are opportunist about using their energy during the training process, which causes a degradation in performance. Some of the main characteristics of this process are:
In the second use case, as illustrated from the below figure, we see that clients are pessimists about using their energy and wait for the slowest client to have enough energy before starting the training process. As a result, the process may be slow, but it gives better performance. Some of the main characteristics of this process are:
Instead of strictly adhering to either of these principles, there can be an optimal client scheduling process for training. Client selection in conventional federated learning algorithms is primarily based on the assumption that all clients are inherently available to participate in training if chosen. Clients have the flexibility for dropouts, which may occur uniformly at random (which does not bias the training).
The most important areas of focus in FL have been on selecting the clients to maximize the convergence rate or to reduce the communication overhead of training. In contrast, the optimal scheduling process can help to strategize energy consumption by allowing selected clients to participate in the training process through a stochastic process based on their energy profile instead of allowing their participation at all rounds. This scheduling process ensures convergence by keeping constant the number of clients participating in one global round.
A unique characteristic of this scheduling is to allow clients to perform local training at each global round, but the global model is updated by using only the local updates from the clients that were originally scheduled at that global round.
If we try to understand from the perspective of accuracy and number of rounds (or in other words, the time needed for model convergence), we see the optimal scheduling (Algorithm1) along with FedAvg performs better in terms of accuracy, while Benchmark1 has a fairly steady accuracy for the different number of global rounds. In contrast, Benchmark2 demonstrates an increase in accuracy by increasing the global rounds.
On the other side, we should also be aware that CO2e emissions (expressed in grams, i.e., lower is better) for both centralized learning and FL when they reach the target accuracies, with different setups.
Benchmark -1 Each client participates in training as soon as they have enough energy and then waits until the next energy arrival.
Benchmark – 2: Global model is updated only when all clients have received energy, i.e., the server waits until all clients have energy available before initiating a global update.
We need to quantify cloud sustainability in Federated Learning (FL) environment. In addition to Sustainable Supply Chain promotions, we need to put a strong emphasis on Smarter efficient enterprise data centers where we can measure Carbon-free energy scores along with Power Usage Effectiveness. We provide some of the definitions of the key metrics in terms of the Google Cloud Platform in the reference figure below.
Google CFE%: This is the average percentage of carbon-free energy consumed by a user application in a particular location on an hourly basis while taking into account the investments we have made in renewable energy in that location. This means that in addition to the carbon-free energy that's already supplied by the grid, we have added renewable energy generation in that location to reach our 24/7 carbon-free energy objective.
Grid carbon intensity (gCO2eq/kWh): This metric indicates the average lifecycle gross emissions per unit of energy from the grid. This metric should be used to compare the regions in terms of the carbon intensity of their electricity from the local grid. For regions that are similar in CFE%, this will indicate the relative emissions for when your workload is not running on carbon-free energy.
Google Cloud net carbon emissions (Scope 2 market-based): Google invests in enough renewable energy and carbon offsets to neutralize the global operational carbon emissions footprint of Google Cloud per the GHG protocol under the Scope 2 market-based methodology.
Unique Cloud Strategies As Google Cloud Platform is actively concentrating on increasing the CFE% for each of the Google Cloud regions, deploying solutions having a higher percentage of carbon-free energy accounts for increasing the sustainability of the solution. Some of the unique propositions to cloud AI specialists and architects are to :
Pick a lower-carbon region for your new applications: Building and running new applications in the region with the highest CFE% available at the application.
Run batch jobs in a lower carbon region: Planning batch workloads by picking the region with the highest CFE% in order to maximize the carbon-free energy supplying the job.
Set an organizational policy for lower carbon regions: Allowing usage of resources and services in certain regions while restricting its access and usage in other regions.
Efficient use of services: Increasing the efficiency of cloud apps by using less energy (consequently fewer carbon emissions) by increasing the dependence on serverless products like Cloud Run, Cloud Functions. These services automatically scale up and down based on the workload and conserve energy as much as possible. In addition to this, right VM sizing also plays an important role in conserving energy.
The below figure illustrates how different variants of Federated Learning - FedAVG, FedAdam, FedAdaGrad with different adaptive ML optimizers can work together with the right cloud optimizations to reduce carbon emissions.
Cloud & ML Optimizations in Federated Learning
With the current research results and potential opportunities of Federated Learning (FL), there are focussed directions in building sustainably federated and distributed learning schemes for large-scale networks. In large-scale deployments, millions of devices jointly train machine learning models over large volumes of data. Some of these research directions include formalizing the fundamental performance limits of distributed training under stochastic and unknown energy arrival processes. Exploration and research on model quantization and compression techniques will continue that can help adapt to the resource and energy arrival patterns and characterize the relationship between the energy renewal processes and training performance.
The end goal of deploying a simple and scalable federated learning strategy with provable convergence guarantees can be satisfied with Sustainable Federated Leading. Here in Sustainable FL, devices can rely on intermittent energy availability. This kind of proposed framework can significantly improve the training performance compared to the energy-agnostic benchmarks.