š Iāve launched a new Twitter bot that tells the stories of all of the very good dogs who live in New York City.
The data the bot uses originates from a ~90k row CSV of dog licensing data obtained from the city of New York via a FOIA request. I was alerted to the existence of this data by Parker Higgins, who was messing around with it and had converted it to a more useful format.
The moment I got my hands on the data, I knew what I had to do:
Now that the bot is live, I wanted to share some info about the process of making it.
Ok, this is pretty banal, but there are a thousand ways to build and deploy a Twitter bot, and sometimes itās interesting to see how others do it. So here goes, very quickly:
Good Dogs of NYC is a Ruby application that uses Active Record to talk to a PostgreSQL database and the Twitter gem to tweet. Itās deployed on Heroku as an app with no dynos. I use the free Heroku Scheduler add-on to run an executable script that does the tweeting. It runs every 10 minutes and tweets 1 out of every 8 times it runs.
Ok that was pretty boring. Hereās another good dog to cleanse your palate:
As always with this kind of project, a bunch of labor went into cleaning and massaging the data into a usable form. The most time-consuming part of this was around locations. The data provided each dogās ZIP code, but I wanted to refer to NYC neighborhoods. So I set out to find a mapping of NYC ZIP codes to neighborhoods. Easy, right?
Nope. The only ZIP-neighborhood mapping available online is provided by the NYC Department of Health, but it isnāt specific enough with its neighborhoods (āSouthern Brooklynā doesnāt cut it), and it was missing a bunch of codes that appeared in my data set.
So I took a closer look at the ZIP codes in the data, and discovered why no one had created this mapping before:
So, I improvised. I settled on 182 codes that covered more than 99% of my doggie friends. For each of these, I searched for the ZIP on Google Maps. Then āand this is the real 1337 haxor partāI squinted at the screen and chose the neighborhood that, according to Google, looked like it mapped best to the ZIP codesās region.
Hereās the final result of that tedious work. (If you want to use this data, Please keep in mind that some arbitrary choices went into the generation of that data because of (4) above.)
Iāve long been interested in programmatically generating prose descriptions of data. This is something I worked on in a pretty naive way in my first tech job, and itās still a big business (for at least one firm).
This interest is one reason I like Twitter bots. Each bot Iāve createdāThe Meaning of Life, Important Animal (now defunct), and Father John Botsyāis in the text generation genre. Each time I built a bot, I did just enough work to generate the text for that bot. This time, I wanted to take a more abstract, reusable approach.
The result is the Wordz gem. Wordz is a lightweight generative text library. You declare the structure of the text you want to generateāa āgrammarāāand optionally specify some objects that hold the text bits you want to compose. Hereās a small fragment of Good Dogsā grammar as json:
{"<root>": [["<body>", "<tag>|0.1", "<sign_off>"]],"<body>": [["<salutation_with_name>", "Hi.|0.1", "<short_sentences_body>", "Hi.|0.1"],["<long_sentence_body>", "<salutation_with_name>", "Hi.|0.1"]],"<salutation_with_name>": [["Hi, my name is", "#dog#name#", "."],["<bark>|0.2", "I am", "#dog#name#", "."],["Oh hi. It is", "#dog#name#", "here|0.3", "."]],
.... etc etc etc
}
The angle-bracketed elements here, like <body>,
refer to other keys in the grammar, and enable you to build up complex structures from simple pieces. Method call elements such as #dog#name#
will pass the message (in this case name
)to the specified object (in this case the one youāve called dog
). For Good Dogs, this is a record of a good dog pulled form the database. Other elements are string literals. For every type of element, you can optionally append a probability, e.g. Hi.|0.1
. This means: add āHi.ā to the text 10% of the time, and do nothig the rest of the time. Wordz recursively evaluates your grammar using the objects you input, and assembles it into a list of phrases.
Once Wordz has a list of phrases, a post-processing step that joins the text together into sentences with proper spacing. Most of the time we want to join everything together with spaces in between, but there are exceptions, such as ["I am", "#dog#name#", "!"]
, which we want to generate "I am dog name!"
not "I am dog nameĀ !"
.
Thereās also facility for string literals which have special effects in post-processing. Right now the only example of this is the string $INDEF_ARTICLE
, which gets replaced in post-processing by either āaā or āanā depending on the phrase that follows it.
Planned future work on Wordz:
Ok, thanks for reading. Hereās a list of links to the work described above!