What do I think data ethics is?

https://techcrunch.com/2016/11/12/data-ethics-the-new-competitive-advantage/

Background

I have spent quite a bit of time reading all different ideas about ethics, data, artificial intelligence, algorithms, automation … etc, etc … but I am yet to explore exactly what I think data ethics is all about. And I don’t yet have a definitive idea.

I came to this area gradually while I was doing a Udacity Deep Learning Nanodegree. ‘Deep learning’ is a form of machine learning that has multiple layers of computational ‘neurons’ connected by weights. Through a process of learning, the model adjusts its weights across these hundreds and thousands of weights. Over time the model ‘learns’ to map inputs to outputs more accurately. It’s pretty amazing! And if you look at ArXiv you can watch the constant — and overwhelming — progress in this area each day!

Neural networks are not simple things to understand — nor are they simple to program. And it was during this process of figuring out how to create these models that I realised I had no idea why the model was able to turn an image into a correct label. I understood the process it used to get a better representation of the input — back propagation — and it made sense. However, if you asked me how or why a certain aspect of an image was the most relevant to come up with a particular label, then I would have to shrug my shoulders.

That said, there are some techniques to better visualise what is going on within these deep learning models that help to get a sense for what is going on; but I still don’t think it is enough — particularly when one of these models is used to influence an outcome on someone’s life.

I asked endless questions on the course forum but got answers that weren’t very satisfying. And as the course moved on to more esoteric models like recurrent neural networks and adversarial networks, I had to take time out and consider what I was actually looking at. I wanted a narrative to explain to me what these models were all about — why are we doing this? what problems do they solve? what is the problem?

My next logical jump was to consider the implications of not just deep neural networks, but the whole field of artificial intelligence — in all its forms. Now, I am not buying in to the idea of conscious machines or anything … I am only talking about ‘intelligence’ that allows a computer model to identify patterns in data on a scale no human can match. That’s all! The discussions around autonomous robots — while interesting — are not really my concern.

So, I stopped trying to understand the technicalities of these neural networks — i.e. the Python code, or the Tensorflow API — and started looking at the social implications.

The Data Pipeline

I think of data as a living thing. It is just out there collected in massive databases and stored or exposed through APIs. And in many cases it is just collected as a by-product of people using various online platforms. It is metadata, it is transactional data, and it lacks context. Then someone — or an organisation — collates the data and transforms it in a way that makes it suitable to use in a deep learning model. Essentially, the problem needs to be ‘numberfied’ in a way that can be operated on through weights and activation functions. The model then spits out yet more data — which now has some context. This data informs decisions, leads to some other automated process, or could even be fed back through the model to improve its learning.

However, there are many human interventions required in this journey from input through to output and then to business decision. A person (and in the following I also mean ‘people’) needs to decide how a business problem can be converted into a form suitable for the model; a person needs to run code to transform the data in a systematic way to feed in to the model; a person needs to ‘train’ the model; then a person needs to fit the decision making model within a broader business context; and a person needs to take action on the output of the model. In any of these steps an unintended error or bias can creep in.

Essentially, there are a lot of places to stand back and be concerned about what is actually going on. And we certainly can’t take for granted that the model is producing reliable or accurate outputs. We can hope the model and all the processes above were enacted with care … but how can we know for sure?

Checks and Balances

With all this in mind I think I could say my understanding of an ethics of data is that we need a structured approach to collecting, using, and making decisions with data. I think we need ways to check and double-check our assumptions along the way and have a process to keep our values engaged — not to simply delegate our responsibility to an algorithm. We need to be accountable.

As the field is developing so rapidly, the range of implications of these models can only grow. When we consider all the places where large amounts of data is collected and ‘serious’ decisions need to be made — good examples are immigration, healthcare, and education — we definitely want the advantages of computers, but not at the expense of not understanding what is going on. It isn’t so simple to assume a model’s output is correct … and it is certainly optimistic to believe we can construct an all-embracing model that can deal with the messiness of reality.

I want to be involved with the development of standards and best practice around this area. I don’t want things to take on a life of their own in this area — an inertia borne out of misinformation or lack of understanding. But these are all ideas that are still taking shape for me.