by Liza Daly
In July 2016 I was brought on to the Creative Commons team to research and build a proof-of-concept for CC Search: a front door to the universe of openly licensed content. As part of this engagement, I would need to recommend a technical approach, ingest millions of works, and build a robust working prototype that could serve to generate public discussion and point the direction for future enhancements and, ultimately, a fully realized vision.
CC estimates that there are 1.1 billion works in the Commons, inclusive of both CC-licensed material and the public domain. Of that, half is thought to be images, supplied by artists who have chosen CC licenses for their work and by cultural institutions that have chosen to release public domain materials without restrictions. The remaining 500 million works comprise other creative media such as video and music, as well as, increasingly, CC-licensed research, educational material, and wholly new kinds of works such as 3D modeling data.
The term of my engagement was seven months, including scoping, research, product build, and public launch. CC is a small organization with limited resources. I received generous assistance from Rob Myers, CC’s developer and technical lead, and strategic input from many in the organization, but otherwise this was to be a development team of one.
Despite that constraint, the project would not be a realistic proof of concept if it did not adequately reflect the scale of the overall ambition. That meant we would need to operate on enough content to mirror the breadth and diversity of the Commons, while at the same time avoiding unnecessary overhead that would impair our ability to iterate quickly.
We settled on a goal to represent 1% of the known Commons, but had a choice about which 1% to select:
Images comprise half of the total Commons yet represent a diverse set of material — we would still need to answer what it meant to search (for example) both contemporary photographs and images of museum holdings. The uncertain future of Flickr, which alone holds more than 350 million openly licensed images, adds urgency. Finally, images lend themselves naturally to exploration via a web interface; this would be less true for some other media types. For these reasons, we decided that the first public iteration of CC Search would be made up of approximately 10 million images, and only images.
Having chosen images, I looked at potential content partners with large open collections and APIs for discovery and acquisition available. Our short list included:
Provider name and Est. number of open works
New York Public Library 180,000
Metropolitan Museum of Art 200,000
Rijksmuseum 500,000
500px 800,000
Internet Archive 1,000,000
DPLA 3,000,000
Europeana 5,000,000
Wikimedia Commons 35,000,000
Flickr 380,000,000
(These numbers are only first-order approximations, and there’s plenty of overlap: Europeana includes Rijksmuseum and many of the public domain works that are part of Flickr Commons; many photographers post to both 500px and Flickr.)
There were several criteria for consideration:
The last is critically important to this project. While we cannot make legal assertions about material not under our control, as an official Creative Commons project, we should endeavor to ensure that what we present for re-use is, indeed, openly licensed. (This guideline led us to reject other commercial photography providers which seem to repost or recycle copyrighted material without much oversight.)
We also needed to consider the relative representation of each provider:
Clearly, a “representative sample” would have to be disproportionately sourced from Flickr, but to support our goal of presenting a diversity of material, we’d also need to ensure that we designed the interface to surface non-photographs or allow users to drill down into specific providers.
Unfortunately, we had to defer a few providers we might’ve liked to include:
In the end, we selected these providers as representing the best “bang for the buck” as well as hewing closely to the overall representation of the Commons:
500px 50,000
Flickr 9,000,000
New York Public Library 168,000
Rijksmuseum 59,000
Metropolitan Museum of Art 200,000
(exact figures may vary from time-to-time)
To surface the non-Flickr works, we developed filtering tools that let you drill down either by general work type — photograph or cultural works — or by specific provider(s). This is an imperfect metric: there are cultural works on Flickr and photographs held at museums, but this was the simplest way to slice across the problem.
CC Search is meant to make material more discoverable regardless of where it is hosted. For this reason (and for obvious cost-saving objectives), we decided to host only image metadata — title, creator name, any known tags or descriptions — and link directly to the provider for image display and download. A consequence of this is that CC Search only includes images which are currently available on the web; CC is not collecting or archive any images itself.
As the overall architecture is expected to evolve (and grow considerably!), it was natural to choose a cloud hosting infrastructure. I did a cost analysis on both Google Cloud and AWS and they netted to more or less the same. I went with AWS primarily due to its larger suite of offerings and the likelihood that future developers would be more familiar with it than Google’s cloud. At the start of the project I estimated a total hosting cost of $1,000/mo, well within the existing project budget. (After the prototype was built and we had a better understanding of the scale and components needed, I commissioned another round of estimates prepared by an experienced IT administrator. Those estimates were closer to $1,400/mo, which I consider to be a relatively benign growth in scope and are still within budget.)
My role in CC Search is temporary: I am writing code to either be thrown away or evolved by other engineers. While this project is meant to be a prototype, we all know that the MVP tends to become production code. I would be delighted if the entire codebase were eventually replaced, having served its purpose in promoting the initiative and surfacing use cases, but I needed to operate under the assumption that this code would live on, potentially for years. For these reasons, I operated under these self-imposed guidelines:
A long-term goal of this project is to facilitate not only search and discovery, but also reuse and “gratitude.” A frequent complaint about open licenses in general — both for creative works and software code — is that contributing to the commons can be a thankless task. There are always more consumers than contributors, and there’s no open web equivalent to a Facebook “like.”
The nature of the reuse cycle can suggest a blockchain architecture: if I create photo A, and you modify it as photo B, there is an ordered relationship between the two that it would be nice to record and surface. While most acts of reuse are non-commercial, that doesn’t have to be the case, and a blockchain-based system could also serve to record a licensing transaction in a distributed way.
Having said that, we opted not to pursue a blockchain-based for the first iteration for a few reasons:
CC’s existing web services are in a diversity of languages — Ruby, Python, and PHP — and any integrations were going to be on the web service layer, so I chose to work in Python solely due to my own familiarity with it. As the prototype evolved, we decided the opportunity for an engaging front door to the Commons lay in curation and personalization. Because of its dedicated maintenance team and frequent patch management, I chose Django as the web framework.
CC Search is, of course, largely about search. I chose Elasticsearch over Solr or other search engine options primarily due to the availability of AWS’s Elasticsearch-as-a-service. CC Search is not, at this time, a particularly sophisticated search application; image metadata is relatively simple and when dealing with a heterogeneous content set from a diversity of providers, one tends towards a lowest-common-denominator approach — our search can only be as rich as our weakest data source. There is much to be improved here.
Lastly, I chose Postgres (via AWS’s RDS database service) over MySQL due to its support for JSON and Array datatypes. MongoDB and other NoSQL solutions are attractive due to their fluidity and scalability, but were never a strong contender — I felt we already had many of the benefits of a document database via Elasticsearch, and regardless, Django loses much of its value without a relational database backend. With the ability to store and query arbitrary JSON, I felt Postgres struck a nice balance between structure and flexibility.
CC Search does have front-end components, but they are relatively limited. Because CC Search is targeting recent browsers, I decided to forgo any particular framework — even jQuery — though I did use ES6 syntax for its conciseness and clarity. This requires some tooling and build steps that may violate my “obvious over obscure” mandate, but I would argue that ES6 is a better foundation for future JS development than the alternatives. Though less comprehensive than the Python suite, I set up JS-level unit tests as well.
All product developers need to be asking themselves what they are doing to minimize privacy issues, as well as looking hard at how their applications could be misused. I had excellent support and feedback from the CC team on the subject of collecting the minimum amount of user data: enough to actually operate a site that includes personalization, but no more. It’s a joy to work with an organization that cares deeply about user safety and privacy. We may not have gotten it perfectly right, but reviewing the final product to identify potential privacy and abuse vectors was a no-brainer.
There was much we wanted to do that we’ve deferred for now, or will revisit based on user feedback:
We want to hear your feedback and suggestions! Please send us ideas and comments to our Feedback form, and report bugs and issues to our public issue tracker.