By Javier Valez, Ph.D. Overview We’re excited to announce that will be starting a new series of posts on here at the blog. The series will explore quite a few aspects, including: Veracity Forge.AI Machine learning models for veracity Moral and ethical aspects of modeling veracity Experimental design and processes for creating machine learning models System design Validation This series is being written in tandem with the development of our system and models at Forge.AI and, as such, it is more of a journey than a destination. I will explore the rationale behind our choices and structure in order to illuminate hidden assumptions that may be baked into our models. We will also include my thoughts on being a (as opposed to other viewpoints and stances one can take when designing, building, or using machine learning). Modeler Why Veracity? At Forge.AI we consume many types of textual data as input including, but not limited to: news articles, research and analytical reports, tweets, marketing materials, shareholder presentations, SEC filings, transcripts, etc. At Forge.AI we are working to turn this “soup” of data into coherent, succinct, and salient nuggets we call Events. These Events are consumed by our customers to make decisions, take actions, and reason about concepts relevant to their business. Before we can transform our input data into such Events we need to consider one of the most fundamental principles of computational system, including machine learning systems: the concept of . This principle is most succinctly represented as: . any input quality Garbage In, Garbage Out Well, easy enough. We just won’t give our systems bad inputs! Done. End-of-Post. Yet this post is still going… So, what did we miss? To continue our metaphor, what people sometimes leave out is that . Whether so-called bad inputs should be ignored or treated as informative is a matter of perspective and use-case. This is especially true for us at Forge.AI since we do not, and can not, know every single use-case for every single possible consumer of our Events. Knowing the quality of the input is key to being able to decide what to do with the it — whether to not even use the input or to treat it as highly informative and useful in a particular system. one man’s trash is another man’s treasure One such quality of textual data that is often of interest is whether the meaning of the text is actually true or not. For numerical data we might measure quality with a but textual data quality is not as easily quantified. So, we turn to in hopes of being able to understand and reason about the quality of textual input. signal-to-noise ratio, veracity In addition to our own machine learning algorithms here at Forge.AI, we must also be cognizant of our customer’s machine learning systems. Do our customer’s systems handle “bad” quality data well? Is it useful for consumers of our Events data to know about things that are perhaps low-veracity? Some algorithms can leverage low-veracity Events such a tabloid hearsay or campaigns, for others ignorance is bliss. Our Events are designed to be inputs to machine learning systems far and wide; systems which themselves have to reason about their input data quality. Veracity is not just of interest to our internal machine learning algorithms, it is paramount for all consumers of our Events. Our models of data quality not only allow us to reason about *our* machine learning systems, they allow our consumers to reason about *their* machine learning systems. FUD A Definition Our initial definition for veracity comes from the the Merriam-Webster dictionary: . Even though this definition uses only five words we can already see the complexity of trying to model veracity. To start, we have a choice in how to interpret : is this a boolean property or are there gradations of conformity? We are a client-facing company therefore the results of our model of veracity — and our chosen definition of veracity — will be used by our clients to reason about their very own machine learning systems. At Forge.AI we pursue simplicity when building machine learning models in order to improve our own reasoning and because the users of our model results can reason about whether and how to use the results more efficiently. For these reasons we will treat conformity as being all-or-nothing. conformity with truth or fact conformity Rewriting the definition, we now have: Veracity = “Does [a blob of text] fully conform to truth or facts? [ Yes | No ].” Notice that I have now made veracity a boolean property and I have further focused our models to work on blobs of text. It is important to narrow the scope of the model early on so that we can: Construct a valid and useful problem statement Create hypotheses and design experiments Gather relevant data Let’s tackle the last two salient words: and . Wow. These words are loaded with potential meanings, nuances, politics, and philosophical pitfalls — entire fields are devoted to these two words. We need a base definition that is widely adopted from a trusted source, so we turn to one of the most reputable dictionaries of the english language: The Oxford English Dictionary. From the Oxford English Dictionary: truth fact Truth: That which is true or in accordance with fact or reality Fact: The truth about events as opposed to interpretation Seems clear enough — even with the circular definition; is a boolean property that represents whether a blob of text contains things, or Events, that are true according to reality. This definition is scoped to what the Event(s) represents. For example, one can have a “true” Event that asserts “Javier *said* I cannot write” and a second, false, Event that incorrectly asserted “Javier cannot write”. veracity Knowledge, Infinities, and Truth We now have a definition of what we want to model, but no model yet. How do we begin to model veracity, as defined in the previous section?, It seems like any model will have to ascertain whether something, call it X, is true. How many things can be true? How do we represent facts? How do we query something, call it Y, for whether X is true or not? There will never be a Y that can store “all true things”, or, at least, practically never. We can have local regions of knowledge stored in , and such objects would allow us to query for whether X was true or not according to the Knowledge Base, but having a global, sufficient Knowledge Base seems like an impossible task. It is not even clear whether the number of truths or facts is finite or infinite. Additionally, what is known changes over time: non-stationarity will get you every time! Knowledge Bases Of course, we can always certain things about truth. We could assume that truth is finite, or we could go down the non-parametric Bayesian route and assume truth is infinite but only a finite representation is needed at any point in time. We can also assume that truth and facts change slowly over time, perhaps not at all, in order to ignore the non-stationarity or make the growth rate manageable. How far do these assumption take us? We can reduce the problem of veracity down to the following: given a blob of text representing X, is X contained in our Knowledge Base Y? Unreservedly glossing over we get X from the blob of text (where X is something we can then query our Knowledge Base Y about), we have now turned our model of veracity into a query of a Knowledge Base. The caveats: assume how Our Knowledge Base must contain all (finite, current, pertinent) truths Our Knowledge Base contains no falsehoods So, even with two very strong assumptions regarding finiteness of truth and stationarity, we have to somehow construct a Knowledge Base that contains truths and nothing but the truth. A back of the envelope calculation is enough to see the sheer magnitude, and impracticality, of storing all possible facts: There are 7.6 billion people on earth (source: ) and each makes at least one decision daily which gives rise to at least one fact … and this does not include facts about things that are not people! still all worldometers This line of thought has given me pause. Let us say, for now, that we will have a Knowledge Base big enough to contain all truth. Can we, perhaps, only apply veracity towards subjects that we care about? How then, do we define what we care about? Is it based on some form of utility function, perhaps applied to a future state? However we define it, we try to build a knowledge base whose facts and truth cover a certain region or domain of knowledge well. We can quantify what means, as well as . In fact, this ideas of how one builds a knowledge base is . not can well cover already being discussed here at Forge.AI There will invariably come a time, though, when we must apply a model of veracity to something which is not in our Knowledge Base; either because it is a brand new thing or because our Knowledge Base is just incomplete. What recourse, then? Do we throw up our hands and say that anything not in our Knowledge Base is, by construction and definition, unreasonable (unable-to-be-reasoned-about)? What then of consumers who are on the bleeding edge, whose needs require reasoning over new and unexplored regimes? The real question, then, is can we make a model of veracity that does not depend on our, or any, Knowledge Base? *1 Veracity Devoid of Fact Eschewing a Knowledge Base, what is left? Let us consider we want a model for veracity in the first place. I can come up with the following reasons: why We want to know when someone or something is lying in order to “count lies”. It does not matter what the lie is about, or why, just that we see a lie We want to know if we can something trust We want to know if someone or something is lying (or not) because we want to know they are lying (or telling the truth): was it simply a mistake, or is there some other game afoot? why Option #1 seems like we really do need a Knowledge Base. This option is all about whether the text is truthful or not and is a direct application of our definition of veracity. I can see, for example, academics wanting to study the statistics of lies, or the dynamics of lies, being interested in this option. I do not see a way to create a model of veracity for option #1; it is a valid problem but not one we will consider from now on. Option #2 is all about . Now, trust certainly has a similar feel to facts and truth, but it is not exactly the same thing. You may be able to trust the content of a blob of text by trusting the of that text. In fact, it is very reasonable to know the truth behind a piece of text and to trust the source and therefore learn what the text says as a fact. Here I see several ways that we can try to model a quantity much like veracity but whose goal is actually trust: trust source not Reputation Expertise (global) Expertise (specific) Community Membership/”Points” Multiple Sources and Corroboration Past Experience Intention of Source Option #3 is all about the of the source for the target audience. Notice that the last bullet point for the previous option also has the intention of the source. Coincidence? I think not! intention Intention: An Aim or Plan The section heading comes straight out of the definition of from the Oxford English Dictionary. Intention is all about the aim or the goal of a particular piece of text. Is the textual piece trying to persuade? Is the piece trying to inform as objectively as possible? Is the piece trying to get itself re-syndicated somehow (sensational writing and/or headlines)? Are there, possibly hidden, agendas for the textual information? If so, what are the agendas? Are there utility feedback loops in the environment which can inform the intention of a piece of writing, for example: intention clickbait Web ads + click revenue = clickbait Academic publication + grant likelihood = doomed to succeed One striking property of the examples on intention above is that none of them revolve around . In fact, they are all relevant and interesting questions irregardless of the factual content (or lack thereof) of a piece of text. This looks like a promising direction since intention, it seems, does not require a Knowledge Base. truth Stepping Back and Stepping Forward: The Modeler’s Dance Okay, let’s step back and see where we have arrived. We started with the idea of and a solid and clear definition: conformity to truth or facts. However, when we began digging for a model of veracity we stumbled upon Knowledge Bases and the seemingly impossible task of ever having “the right Knowledge Base containing all truth and nothing but the truth”. So, what does a modeler do? We started dissecting the reasons why we would want a model of veracity. It turns out that for two of the three reasons we quickly came up with, the idea of was paramount. And intention has nothing to do with truth, or Knowledge Bases, so we can sidestep that whole mess altogether. veracity intention Such back-and-forth reasoning I call the because it is an unavoidable process when creating models. Modelers are not, strictly, scientists in the sense of a search for truth; a model may be an approximation of reality, or it may be a useful tool to compute something. As such, modelers are not explicitly tied to the truth of the universe but sometimes are like engineers and create tools that are useful for specific functions. modeler’s dance Now, you may think that modeling anything but truth will always be worse than if you modeled truth itself. If that thought crossed your mind, I ask you this: have you ever found a self-deception to be useful, perhaps even necessary, during some part of your life (e.g. when dealing with fears or anxieties self-deception is a useful )? We do not have the freedom to wait until we have fully understood reality to live in it (we are alive right now!). Similarly, a modeler cannot fail to consider that the model needs to be used, the questions need to be asked and answered, even if the truth of the matter is as yet unknown to all of humanity (or even just to the modeler). There is always a part of a modeling process where we need to determine both what we believe to be truth, and what we believe we need to model; the two are not always the same thing: ego defense Dance of the Modeler 1: Intention and Syntax Our journey so far has taken us to a new goal: creating a model of . We want this model to explicitly distance itself from Knowledge Bases, facts, and truth. Further, we would like this model to apply as generally as possible to pieces of text. How do we do this? We turn to the somewhat contentious ideas, authored by James E. Pennebaker in , that the use and structure of language carries information about the emotions and self-concepts of the author of said language. As an example, people who obsess about the past use high rates of past-tense verbs. Pennebaker’s ideas are alluring because they tie together language and creator of that language so that analyzing the language reveals something about the creator — such as their intentions. In order to be general to types of text, is there something in the syntax that can help infer the intention of the piece of text? According to Pennebaker: Yes. “Functional word” (pronouns, articles, auxiliary verbs) usage may give insight into the speaker, the audience, or both. We will go a step further and try to use the syntax structures themselves (not the actual word choice) as features with which to build a model of intention. An example and teaser of how syntax differs by intention is the following: the average number of adverbs used in survey questions (designed by statisticians to be unbiased) differs from that in reading comprehension questions (where presumably statisticians were not employed to create unbiased questions). intention “The Secret Life Of Pronouns” *2 While syntax seems to align with intention, we must be extremely cautious not to end up creating a model for . Voice, sometimes erroneously referred to as the writing style of a piece, is not fully understood. Different communities and cultures may have norms determining the voice that is “proper” or “good” for a particular piece of text. Human readers are often swayed by the voice of a piece of text, especially if the voice goes against the norms of the reader’s community. While voice may be an interesting part of a piece of text, it is not the intention of the text. Our wish is for a model of intention, irregardless of voice. Why? Because as modelers, we are responsible for the decisions, actions, and possible misuses of our models. We want to be very careful to make a model that unfairly biases towards/against: voice not Race Gender Age Culture For us, we will use the following definition: voice is all stylistics elements that are and are to the speaker/writer; voice encompasses intrinsic properties of the speaker/writer devoid of intention. Style, on the other hand, is a made by a writer and has an inherent intentionality behind it; I may choose to write in a persuasive style, or an analytical style. For those further interested in voice, Dell Hymes’ is an interesting read. based on the individual speaker/writer inherent choice “Ethnography, Linguistics, Narrative Inequality: Toward an Understanding of Voice” Now that we have a traded one seemingly impossible modeling task (veracity) for another just as seemingly impossible modeling task (intention), look forward to the next step towards a model of intention using syntax on a future post. Cooldown I started this post talking about the importance of knowing the “quality” (by which I meant veracity) of data being given to machine learning algorithms — not just here at Forge.AI but also any consumers of our Events data stream. We danced, we’ve thought, and we have even shifted our original goal from a purely truth-based veracity model to one whose use-case of trust is explicit form the start: a model of intention. We begin by using , the underlying structure of language itself, as a way to generalize our model into one useful over many types of texts. But there are dangers that we must explicitly guard against; dangers of bias against culture or race, voice or gender. In this way, a Modeler is always willing to create the future, reality be damned! As always, my hope is that our models perform better than a human audience. Is this an attainable goal? Perhaps; but it is our goal nonetheless. syntax Footnotes: Going into this exercise, I had a strong leaning towards one particular outcome. However, it’s important to understand when your feelings are tied to your intuition. It is even more important to understand that intuition is sometimes wrong. The process of creating a machine learning model is sometimes touted as an or a . That may be true for some but, in my view, models are rational and the process for creating a machine learning model should also be rational. The best place to live is right at the edge of intuition, and a little beyond! art creative process In case you were wondering, I just throw in the concept of intention without any real definition. As it turns out, , and one which deserves it’s own post (or two or three as we progress down our journey). For now, let us say that intention boils down to one of a small set of future changes to the reader a writer may have hoped for, including staples such as: deception, persuasion, information transfer, objective query-and-response, and emotional modification. did intention is a confusing concept Note: This post was originally published on our blog: https://www.forge.ai/blog/veracity-models-methods-and-morals