AI and there’s lots of money to be made from it. But organisations keep making the news over AI governance failings, such as Microsoft’s and google images . We’re seeing a growth of ethics and governance councils but with mixed success - Google . is the future chatbot that turned racist labelling African-Americans as gorillas shut theirs down Why is good governance proving so difficult? Does there have to be a trade-off between good governance and innovation? The quick answer on this is that For the detailed answer, read on. running AI with governance poses tough technical challenges. The solutions fall under the emerging field of MLOps... but it is an emerging field. The AI Rush AI is taking off. The market for AI is from $9.5bn in 2018 to $118.6bn by 2025. Naturally there is a race to get to the opportunities first. has been forecast to grow However, making money from machine learning right now is not easy. Up to of machine learning projects never go live. Some projects fail because the idea doesn’t work or the data isn’t available. Even after those hurdles, the journey to running the model live isn’t easy. 87% There can also be demanding hardware needs. The data flowing through the model and the quality of predictions can require monitoring. All this makes for a . Getting data, preparing it, keeping it fresh and managing changes to it are all challenging. challenging AI operations landscape One approach is the way - just deal with the issues as they come. This is especially risky with machine learning as things can break in a whole range of ways. Let’s look at some of them. “move fast and break things” Challenges for Running AI in Prod Machine learning models are only as good as the data that they are trained on. Machine learning takes patterns from known data and reapplies the extracted patterns to new data. If the known data is not representative of the new data then the model will give bad predictions. So a major challenge is getting data that is representative. The level of quality required of predictions varies with use-case. If the predictions are going to be followed up by a human (e.g. flagging transactions as potential fraud risks) then a poor prediction might not be too serious. If a self-driving car then that could be very serious. In some cases even one bad prediction could be intolerable. mistakes a stop sign for a speed limit sign Outliers Sometimes a model can perform well but occasional predictions might go astray. This is likely to happen if there are particular data points which stand out from others - : outlier data points Outliers don’t fit the pattern of the rest of the data and the model is built to be general so it will predict on them as though they were within the main pattern. If you’ve got outliers that you can detect then you may be able to plan for that and send them down a different process. Naturally it takes time to get a process in place so it’s a risk you want to mitigate in advance rather than try to react to. Concept Drift You may get good representative data that you train your model on and it might perform well in live. Then it might mysteriously start to perform worse and worse. This can happen if the relationships for the data change in the real world. For example, if you’re predicting what fashion items are likely to sell and the season changes. Then you’ll find your model suggesting swimwear in winter. To avoid losing money you’d need to scramble to get fresh data to train your model on. Bias There are data points that might correlate highly with certain outcomes but which we shouldn’t use for ethical reasons. For example, we . wouldn’t find it acceptable to use race in automated recommendations for parole An apparent model bias situation hit the news at the end of 2019 with , despite his wife having a higher credit score. Apple co-founder Steve Wozniak was among those to report the same: AppleCard offering David Heinemeier Hanson a much higher credit limit than his wife The operators of AppleCard have stated that they didn't use gender as a data-point. However, gender could enter indirectly. For example, if occupation is used this could . indirectly bias for gender in some cases as certain occupations are dominated by one gender (e.g. primary school teachers tend to be female) The AppleCard case has attracted regulatory attention and highlights that . New York's Department of Financial Services an “algorithm that intentionally or not results in discriminatory treatment of women or any other protected class violates New York law.” bias represents legal risk as well as a reputation risk stated that Privacy The has brought a lot of attention on privacy issues. A key part of that was the sharing of data without adequate consent from the person that the data concerned. There can be risks in this space for machine learning. Facebook and Cambridge Analytica scandal A machine learning model is likely to predict similarly to the nearest data points in the training data. Say you’re predicting voting and somebody asks for predictions for retired female voters in a given district. One might not expect that to reveal much about who was surveyed for the training data - but . It might then be possible for somebody so-inclined to work backwards and figure out things about your data that you didn’t know you were revealing. it might if there’s only a handful of retired female voters in that district Range of Risks So we’re seeing a range of risks. Poor data might lead to poor predictions that cost money. Or poor predictions might be outright dangerous. Both happened with the which had $62M of funding and was shut down after making too many unsafe predictions. failed Watson for Oncology project Even if the training data is representative for a wide range of cases, the real world might lead to unanticipated kinds of data. This happened with and the . apple’s face recognition software that was tricked by a mask self-driving car that wasn’t able to recognise jaywalkers Machine learning models can be briefly successful and then get tripped up when the real-world data changes (concept drift). But models can also be too responsive to new data. was learning continuously from its interactions online and the result was that it became racist. Microsoft’s famous chatbot It’s also possible to get into hot water by overlooking the ethical dimensions of using particular data points such as gender or race. An example of this was Amazon Rekognition far more likely to mistake a member of US Congress for a criminal (compared with criminal mugshots) if they were a person of colour: being shown by ACLU to be With increasingly high-profile incidents hitting the news, organisations are concerned to ensure that AI adoption doesn’t come at an unacceptable cost. Striving for Good Governance Regulation Government and regulators are also concerned. Facebook has even gone so far as to . We are in the early days of as regulators have concerns about what , especially on innovation. But there are some notable attempts emerging. call on government to provide more regulation forming regulation effects greater regulation might have The European Union’s “the data subject shall have the right not to be subject to a decision based solely on automated processing, including profiling, which produces legal effects concerning him or her or similarly significantly affects him or her.” It adds that data subjects have a right to “meaningful information about the logic involved” and to “the significance and the envisaged consequences” of automated decision-making. GDPR legislation states that The general thrust is that if AI is used to make a decision that impacts on me then I should be able to challenge that decision and get a meaningful explanation. If concern is raised about the process then the organisation might be challenged on whether the process introduced unnecessary risk. Guidelines have been issued by the US Federal Reserve for the financial industry on the use of quantitative models ( - ). These also expect organisations not to take unnecessary risk and to be able to justify their design decisions. Furthermore, they call for active monitoring and evaluation of models. The management of risk involved in the design and use of a model should be able to stand up to challenge by an informed party and the institution is expected to have processes in place to ensure that the risk trade-offs of any design are justifiable. SR11-7 pdf Internal AI Ethics Bodies Part of the AI governance picture now is organisations setting up councils to oversee guidelines on appropriate use of AI in their organisation. A key problem these councils face is how to between high-level aspirations like transparency and fairness and actionable advice. close the gap Closing the Gap So how to close the gap between governance aspirations and data science practice? And can it be done in a way that doesn’t compromise on innovation? Lessons of DevOps It is possible to go faster and with better governance. This is one of the lessons of the . Automating processes allows them to be executed faster and more predictably and for more reliable tracking. There may be an upfront cost to automating deployment and monitoring processes but the cost is paid back by the benefits of being able to make iterative improvements more often, . rise of DevOps gathering more feedback and responding to feedback more quickly DevOps for Machine Learning is Special Unfortunately are needed in order to close the gap between AI governance aspirations and data science practice. more than just existing DevOps practices Reproducibility If something goes wrong with a model then you might naturally want to be able to go back to how it was trained (that particular version of it) and make a tweak and train it again. If your process is to stand up to challenge by a potential regulator/auditor then you’ll need to think about whether they’ll ask for for this. The practice of being able to go back to a particular version and rebuild it from source is pretty typical for mainstream DevOps but it’s not an easy thing to achieve for machine learning. There can be a lot of data involved, it can change often and it can be transformed as it goes through the process. An : example machine learning pipeline might have several stages before generating a model For reproducibility you would want to be able to go back to a particular training run that resulted in a particular model (the .pkl here) and find all the versions of the source files (the .py files) and the versions of the data, as well as who/what initiated the run and when and any metrics (e.g. accuracy). Monitoring For full reproducibility you would also need to know exactly what the request was that was made and what the prediction was. So this data needs to be stored and made retrievable. Full logging of predictions is useful because it gives you: An audit trail of requests that can be individually investigated. Fresh data that can be used for training the next iteration of the model. A stream of data that can be monitored against benchmarks. Monitoring of the data stream could be used to mitigate the risk of concept drift. This would involve watching whether data is close enough to the training data for predictions to be trustworthy. Alerts could be set up for if the data strays too far. Deployments Deploying a new iteration of a model to take over all of live traffic in one go might be risky unless you’re entirely sure that live traffic matches the training data. Given the risks, you might want to tentatively deploy new versions of models to just run against a portion of live data before scaling up to take over from the old version. Explainability We’ve seen that GDPR asks that data subjects have a right to “meaningful information about the logic involved.” Providing explanations for machine learning decisions can be challenging. Machine learning takes patterns from data and reapplies them but the patterns are not always simple lines that can be visualized. There are a range of types of technique and the . kind of explanations that are achievable can vary There’s an example of the use of one such technique . In that example income is predicted from US census data which includes features such as age, marital status, gender and occupation. A technique called ‘anchors’ is used to reveal which feature values were most relevant to a particular prediction. The technique scans the dataset and finds patterns in the predictions. A particular low-income prediction is chosen and the technique is used to reveal that a separated female would be classified as low-income in 95% of cases, meaning that other features (such as occupation or age) then have very little relevance for separated females. in Seldon’s Alibi explainer library The income classifier example bears a striking resemblance to the incident with . Building explainability into the design would’ve allowed it to be determined quickly whether gender was the reason for the bias. AppleCard offering David Heinemeier Hanson a much higher credit limit than his wife There are also explanation techniques applicable to text classification or image predictions. Another shows how particular sections from an image can be determined to be especially relevant to how it was classified: Alibi example Here the nose-region of the picture was especially important for determining that the ImageNet picture was of a Persian Cat. Rise of MLOps We can now see that there’s a range of challenges to tackle for good AI governance. Many organisations are choosing whether to put together an MLOps infrastructure in-house to tackle these challenges or to choose a platform. ML Platforms ML Platforms come in a range of flavours. Some are part of a cloud provider offering, such as AWS SageMaker or AzureML. Others are an offering in themselves, such as Databricks Mlflow. The level of automation involved in platforms can vary. For example Google’s AutoML and DataRobot aim to enable models to be produced with minimal machine learning expertise. Some platforms are closed source, some are open (such as the kubernetes-oriented kubeflow). Some are more oriented towards particular stages of machine learning (e.g. training vs deployment) and the extent of support can vary when it comes to aiding reproducibility, monitoring or explainability. Organisations have to consider their use-cases and priorities when evaluating platforms. Tool Landscape Rather than choosing an existing platform, organisations or teams can choose to assemble their own. There is not yet a ‘ ’ of obvious choices. Instead there’s a range of choices for each part of the pipeline and a particular assemblage of choices might not have been designed to work together. Teams have to evaluate tools for how well they fit their needs and also how well they fit their other tool choices. canonical stack The represents an interesting position here as it aims to bring together best-of-breed open source tools and ensure they work well together. , where I am an Engineer, specialises in open source tools for and . We partner with kubeflow on tools such as . We also offer , an enterprise product that brings open source tools together to get the best out of them and speed up project delivery. Seldon Deploy can be used either standalone or together with other platforms to add an extra layer of visibility and governance. kubeflow platform Seldon deployment governance KFServing Seldon Deploy Industry Initiatives The aims to foster open source collaboration on machine learning tools. It has a great . The full landscape is large but here’s a snippet: Linux Foundation’s LF AI project visualization of the tool landscape The is a volunteer-led research centre with and has put a particular focus on providing practical tools and showing how to follow its . The Institute has created an , a framework to and listings of . Institute for Ethical AI over 150 expert members principles AI explainability library evaluate machine learning suppliers MLOps tools The was recently formed by Dan Jeffries of (a company specialising in data pipelines, versioning and lineage). The foundation has put an emphasis on closing the gap between AI governance aspirations and realisation. They’ve emphasised auditing and explainability as well as the formation of AI QA/response teams. Practical AI Ethics Foundation Pachyderm suggesting This is of what’s going on in this fast-moving space. just some Move Fast with Stable Infrastructure Part of the future for machine learning is to follow . MLOps infrastructure will enable organisations to go faster and with greater governance. Right now MLOps is a space of innovation. Projects looking to leverage MLOps have to first learn and take advantage of it in a way that best suits their situation and aims. Facebook’s “move fast with stable infrastructure” motto how to navigate the space (Title image from , attributed to Facebook.) https://medium.com/swlh/move-fast-and-break-things-is-not-dead-8260b0718d90