Authors:
(1) Arcangelo Massari, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {[email protected]};
(2) Fabio Mariani, Institute of Philosophy and Sciences of Art, Leuphana University, Lüneburg, Germany {[email protected]};
(3) Ivan Heibi, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy and Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {[email protected]};
(4) Silvio Peroni, Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy and Digital Humanities Advanced Research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna, Bologna, Italy {[email protected]};
(5) David Shotton, Oxford e-Research Centre, University of Oxford, Oxford, United Kingdom {[email protected]}.
This article detailed the methodology used to develop OpenCitations Meta, a database that stores and delivers bibliographic metadata for all publications involved in the OpenCitations Indexes. This process involves two main phases: (1) an automatic curation analysis aimed at deduplicating entities, correcting errors and enriching information, and (2) a data conversion to RDF, while keeping track of changes and provenance in RDF.
Information about new publications is continuously being added to Crossref, DataCite, and PubMed, and we will develop procedures to ingest these new metadata into OpenCitations Meta in a regular and timely manner. Furthermore, work is already underway to ingest bibliographic metadata from the Japan Link Center and the OpenAIRE Research Graph, and other sources will be included as our human and computational resources permit. OpenCitations Meta will thus continue to grow.
OpenCitations Meta has three major benefits. First, the use of OMIDs (OpenCitation Meta Identifiers) for all stored entities enables OpenCitations Meta to act as a mapping hub for publications that may have more than one external PID (for example a journal article described in Crossref with a DOI (Digital Object Identifier), and the same publication described in PubMed with a PMID (PubMed Identifier), while also making it possible to characterise citations involving resources lacking any external PIDs. Consequently, the second benefit is that OpenCitations Meta allows citations in OpenCitations Indexes to be described as OMID-to-OMID, disambiguating citations between documents with different identifier schemes, e.g. represented as DOI-to-DOI on Crossref and PMID-to-PMID on PubMed. Third, OpenCitations Meta speeds search operations to retrieve metadata on publications involved in the citations stored in the OpenCitations Citation Indexes, since these metadata are now kept in-house, rather than being retrieved by on-the-fly API calls to external resources.
Future challenges will be to elaborate a disambiguation system for people lacking an ORCID identifier, to improve the quality of the existing metadata, to enhance the search operations and the storage efficiency, to add additional metadata fields for Abstracts, Funder IDs, Funding information, and Institutional identifiers, and to populate these where these metadata are available from our sources.
Finally, an interface will be implemented and made available to trusted domain experts to permit direct real-time manual curation of metadata held by OpenCitations Meta. Such a system will track changes and provenance, will preserve the delta between different versions of each entity, and will retain information such as the agent responsible for the change, the primary source, and the date. In this way, we will strive to make OpenCitations Meta not only comprehensive but also an accurate and fully open and reusable source of bibliographic metadata to which members of the scholarly community can directly contribute.
This work has been partially funded by the European Union’s Horizon 2020 Research and Innovation Program under grant agreement No 101017452 (OpenAIRE-Nexus Project).
Abramatic, J.-F., Di Cosmo, R., & Zacchiroli, S. (2018). Building the universal archive of source code. Communications of the ACM, 61 (10), 29–31. https://doi.org/10.1145/3183558
Atzori, C., Bardi, A., Manghi, P., & Mannocci, A. (2017). The OpenAIRE Workflows for Data Management [Series Title: Communications in Computer and Information Science]. In C. Grana & L. Baraldi (Eds.), Digital Libraries and Archives (pp. 95–107). Springer International Publishing. https://doi.org/10.1007/978-3-319-68130-6_8
Auer, S., Oelen, A., Haris, M., Stocker, M., D’Souza, J., Farfar, K. E., Vogt, L., Prinz, M., Wiens, V., & Jaradeh, M. Y. (2020). Improving Access to Scientific Literature with Knowledge Graphs. Bibliothek Forschung und Praxis, 44 (3), 516–529. https://doi.org/10.1515/bfp-2020-2042
Board, D. U. (2020). DCMI Metadata Terms. Retrieved July 16, 2021, from http://dublincore.org/specifications/dublin-core/dcmi-terms/2020-01- 20/
Brase, J. (2009). DataCite - A Global Registration Agency for Research Data. 2009 Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology, 257–261. https: //doi.org/10.1109/COINFO.2009.66
Brase, J. (2010). Datacite - A Global Registration Agency for Research Data. SSRN Electronic Journal. https://doi.org/10.2139/ssrn.1639998
Carroll, J. J., Bizer, C., Hayes, P., & Stickler, P. (2005). Named graphs, provenance and trust. Proceedings of the 14th international conference on World Wide Web - WWW ’05, 613. https://doi.org/10.1145/1060745. 1060835
Daquino, M., & Peroni, S. (2019). OCO, the OpenCitations Ontology. Retrieved September 4, 2021, from https://w3id.org/oc/ontology/2019-09-19
Daquino, M., Peroni, S., & Shotton, D. (2020). The OpenCitations Data Model [Artwork Size: 836876 Bytes Publisher: figshare], 836876 Bytes. https: //doi.org/10.6084/M9.FIGSHARE.3443876.V7
Dhakal, K. (2019). Unpaywall. Journal of the Medical Library Association, 107 (2). https://doi.org/10.5195/jmla.2019.650
European Commission. Directorate General for Research and Innovation. (2016). Realising the European open science cloud: First report and recommendations of the Commission high level expert group on the European open science cloud. Publications Office. Retrieved October 17, 2022, from https://data.europa.eu/doi/10.2777/940154
Falco, R., Gangemi, A., Peroni, S., Shotton, D., & Vitali, F. (2014). Modelling OWL Ontologies with Graffoo [Series Title: Lecture Notes in Computer Science]. In V. Presutti, E. Blomqvist, R. Troncy, H. Sack, I. Papadakis, & A. Tordai (Eds.), The Semantic Web: ESWC 2014 Satellite Events (pp. 320–325). Springer International Publishing. https://doi.org/10. 1007/978-3-319-11955-7_42
Fricke, S. (2018). Semantic Scholar. Journal of the Medical Library Association, 106 (1). https://doi.org/10.5195/jmla.2018.280
Garcia, A., Lopez, F., Garcia, L., Giraldo, O., Bucheli, V., & Dumontier, M. (2018). Biotea: Semantics for Pubmed Central. PeerJ, 6, e4201. https: //doi.org/10.7717/peerj.4201
Gentile, A. L., & Nuzzolese, A. G. (2015). cLODg-Conference Linked Open Data Generator. ISWC (Posters & Demos).
Gil, Y., Cheney, J., Groth, P., Hartig, O., Miles, S., Moreau, L., & Silva, P. (2010). Provenance XG Final Report [Type: W3C.]. http://www.w3. org/2005/Incubator/prov/XGR-prov-20101214/
Gorraiz, J., Melero-Fuentes, D., Gumpenberger, C., & Valderrama-Zurián, J.-C. (2016). Availability of digital object identifiers (DOIs) in Web of Science and Scopus. Journal of Informetrics, 10 (1), 98–109. https://doi.org/ 10.1016/j.joi.2015.11.008
Haak, L. L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012). ORCID: A system to uniquely identify researchers. Learned Publishing, 25 (4), 259–264. https://doi.org/10.1087/20120404
Hammond, T., Pasin, M., & Theodoridis, E. (2017). Data integration and disintegration: Managing Springer Nature SciGraph with SHACL and OWL. ISWC (Posters, Demos & Industry Tracks). http://ceur-ws.org/Vol1963/paper493.pdf
Hara, M. (2020). Introduction of Japan Link Center (JaLC) [Artwork Size: 2213661 Bytes Publisher: ORCID], 2213661 Bytes. https://doi.org/10. 23640/07243.12469094.V1
Heibi, I., Peroni, S., & Shotton, D. (2019a). Crowdsourcing open citations with CROCI – An analysis of the current status of open citations, and a proposal [arXiv: 1902.02534]. arXiv:1902.02534 [cs]. Retrieved September 15, 2021, from http://arxiv.org/abs/1902.02534
Heibi, I., Peroni, S., & Shotton, D. (2019b). Software review: COCI, the OpenCitations Index of Crossref open DOI-to-DOI citations. Scientometrics, 121 (2), 1213–1228. https://doi.org/10.1007/s11192-019-03217-6
Hendricks, G., Tkaczyk, D., Lin, J., & Feeney, P. (2020). Crossref: The sustainable source of community-owned scholarly metadata. Quantitative Science Studies, 1 (1), 414–427. https://doi.org/10.1162/qss_a_00022
ICite, Hutchins, B. I., & Santangelo, G. (2022). iCite Database Snapshots (NIH Open Citation Collection) [Publisher: The NIH Figshare Archive]. https: //doi.org/10.35092/YHJC.C.4586573
Koivunen, M.-R., & Miller, E. (2001). Semantic Web Activity [Edition: W3C Volume: 11 02]. https://www.w3.org/2001/12/semweb-fin/w3csw
Lammey, R. (2020). Solutions for identification problems: A look at the Research Organization Registry. Science Editing, 7 (1), 65–69. https://doi.org/ 10.6087/kcse.192
Lebo, T., Sahoo, S., & McGuinness, D. (2013). PROV-O: The PROV Ontology [Place: PROV-O Volume: 04 30]. Retrieved July 16, 2021, from http: //www.w3.org/TR/2013/REC-prov-o-20130430/
Maloney, C., Sequeira, E., Kelly, C., Orris, R., & Beck, J. (2013). PubMed Central. In The NCBI Handbook.
Manghi, P., Manola, N., Horstmann, W., & Peters, D. (2010). An Infrastructure for Managing EC Funded Research Output: The OpenAIRE Project. Grey Journal (TGJ), 6 (1).
Massari, A., & Heibi, I. (2022). How to structure citations data and bibliographic metadata in the OpenCitations accepted format. Proceedings of the Workshop on Understanding LIterature references in academic full TExt, 3220. http://ceur-ws.org/Vol-3220/invited-talk2.pdf
Massari, A., & Peroni, S. (2022). Performing live time-traversal queries via SPARQL on RDF datasets [Publisher: arXiv Version Number: 2]. https: //doi.org/10.48550/ARXIV.2210.02534
Mora-Cantallops, M., Sánchez-Alonso, S., & García-Barriocanal, E. (2019). A systematic literature review on Wikidata. Data Technologies and Applications, 53 (3), 250–268. https://doi.org/10.1108/DTA-12-2018-0110
Morrison, H. (2017). Directory of Open Access Journals (DOAJ). The Charleston Advisor, 18 (3), 25–28. https://doi.org/10.5260/chara.18.3.25
Nielsen, F. Å., Mietchen, D., & Willighagen, E. L. (2017). Scholia, Scientometrics and Wikidata. In E. Blomqvist, K. Hose, H. Paulheim, A. Lawrynowicz, F. Ciravegna, & O. Hartig (Eds.), The Semantic Web: ESWC 2017 Satellite Events - ESWC 2017 Satellite Events, Portorož, Slovenia, May 28 - June 1, 2017, Revised Selected Papers (pp. 237– 259). Springer. https://doi.org/10.1007/978-3-319-70407-4_36
Nuzzolese, A. G., Gentile, A. L., Presutti, V., & Gangemi, A. (2016). Semantic web conference ontology-a refactoring solution. European semantic web conference, 84–87.
OpenCitations. (2022). COCI CSV dataset of all the citation data. https://doi. org/10.6084/M9.FIGSHARE.6741422.V18
OpenCitations. (2023a). OpenCitations Meta CSV dataset of all bibliographic metadata. https://doi.org/10.6084/M9.FIGSHARE.21747461.V3
OpenCitations. (2023b). OpenCitations Meta RDF dataset of all bibliographic metadata and its provenance information. https://doi.org/10.6084/M9. FIGSHARE.21747536.V3
Pelgrin, O., Galárraga, L., & Hose, K. (2021). Towards fully-fledged archiving for RDF datasets (A.-C. Ngonga Ngomo, M. Saleem, R. Verborgh, M. Saleem, R. Verborgh, M. I. Ali, & O. Hartig, Eds.). Semantic Web Journal, 12 (6), 903–925. https://doi.org/10.3233/SW-210434
Peroni, S., & Shotton, D. (2018). Open Citation: Definition [Artwork Size: 95436 Bytes Publisher: figshare], 95436 Bytes. https://doi.org/10.6084/M9. FIGSHARE.6683855.V1
Peroni, S., & Shotton, D. (2020). OpenCitations, an infrastructure organization for open scholarship [_eprint: https://direct.mit.edu/qss/articlepdf/1/1/428/1760920/qss_a_00023.pdf]. Quantitative Science Studies, 1 (1), 428–444. https://doi.org/10.1162/qss_a_00023
Peroni, S., Shotton, D., & Vitali, F. (2012). Scholarly publishing and linked data: Describing roles, statuses, temporal and contextual extents. Proceedings of the 8th International Conference on Semantic Systems - I-SEMANTICS ’12, 9. https://doi.org/10.1145/2362499.2362502
Persiani, S., Daquino, M., & Peroni, S. (2022). A Programming Interface for Creating Data According to the SPAR Ontologies and the OpenCitations Data Model [Series Title: Lecture Notes in Computer Science]. In P. Groth, M.-E. Vidal, F. Suchanek, P. Szekley, P. Kapanipathi, C. Pesquita, H. Skaf-Molli, & M. Tamper (Eds.), The Semantic Web (pp. 305–322). Springer International Publishing. https://doi.org/10. 1007/978-3-031-06981-9_18
Pranckut˙e, R. (2021). Web of Science (WoS) and Scopus: The Titans of Bibliographic Information in Today’s Academic World. Publications, 9 (1), 12. https://doi.org/10.3390/publications9010012
Priem, J., Piwowar, H. A., & Orr, R. (2022). OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts [arXiv: 2205.01833]. CoRR, abs/2205.01833. https://doi.org/10.48550/arXiv. 2205.01833
Research, E. O. F. N., & OpenAIRE. (2013). Zenodo: Research. Shared. [Publisher: CERN]. https://doi.org/10.25495/7GXK-RD71
Sigurdsson, S. (2020). The future of arXiv and knowledge discovery in open science. Proceedings of the First Workshop on Scholarly Document Processing, 7–9. https://doi.org/10.18653/v1/2020.sdp-1.2
Sikos, L. F., & Philp, D. (2020). Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs. Data Science and Engineering, 5 (3), 293–316. https: / / doi. org / 10. 1007 / s41019-020-00118-0
Subramanian, S., King, D., Downey, D., & Feldman, S. (2021). S2AND: A Benchmark and Evaluation System for Author Name Disambiguation. 2021 ACM/IEEE Joint Conference on Digital Libraries (JCDL), 170– 179. https://doi.org/10.1109/JCDL52503.2021.00029
Tanon, T. P., Vrandecic, D., Schaffert, S., Steiner, T., & Pintscher, L. (2016). From Freebase to Wikidata: The Great Migration. In J. Bourdeau, J. Hendler, R. Nkambou, I. Horrocks, & B. Y. Zhao (Eds.), Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016 (pp. 1419–1428). ACM. https: //doi.org/10.1145/2872427.2874809
The Europe PMC Consortium. (2015). Europe PMC: A full-text literature database for the life sciences and platform for innovation. Nucleic Acids Research, 43 (D1), D1042–D1048. https://doi.org/10.1093/nar/gku1061
Tillett, B. (2005). What is FRBR? A conceptual model for the bibliographic universe. The Australian Library Journal, 54 (1), 24–30. https://doi. org/10.1080/00049670.2005.10721710
Vision, T. (2010). The Dryad Digital Repository: Published evolutionary data as part of the greater data ecosystem. Nature Precedings. https://doi. org/10.1038/npre.2010.4595.1
Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., . . . Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3 (1), 160018. https://doi.org/10. 1038/sdata.2016.18
Wolf, M., & Wicksteed, C. (1997). Date and Time Formats. Retrieved May 9, 2022, from https://www.w3.org/TR/NOTE-datetime
Zhang, Z., Nuzzolese, A. G., & Gentile, A. L. (2017). Entity Deduplication on ScholarlyData [Series Title: Lecture Notes in Computer Science]. In E. Blomqvist, D. Maynard, A. Gangemi, R. Hoekstra, P. Hitzler, & O. Hartig (Eds.), The Semantic Web (pp. 85–100). Springer International Publishing. https://doi.org/10.1007/978-3-319-58068-5_6
This paper is available on arxiv under CC 4.0 DEED license.