paint-brush
Lost Your Source Code? Unlock the Power of Software Heritage Archiveby@maxkalik
729 reads
729 reads

Lost Your Source Code? Unlock the Power of Software Heritage Archive

by Max KalikJuly 7th, 2023
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Software Heritage (SWH) - a universal software archive. Any public repositories and source code from GitHub, GitLab, Bitbucket, and so on are already there. Save your source code often and automatically using SWH tools. By leveraging SWH, individuals in the software development industry can promote software persistence and enhance the integrity of source code references in their articles, research papers, and other documentation.
featured image - Lost Your Source Code? Unlock the Power of Software Heritage Archive
Max Kalik HackerNoon profile picture


Recently I took the role of an ambassador for Software Heritage - a remarkable universal source code archive. I was surprised to discover that many of my colleagues and fellow software developers were unaware of SWH’s existence. After visiting SWH website they weren’t quite sure what is this about and why it is important. I decided to write this article for them and for those who still don’t know about source code archiving and the importance of software persistence. Let’s get started.


Software Heritage provides a service for archiving and referencing historical and contemporary software — with a focus on human-readable source code.


This is written in a Wikipedia article about SWH. It is pretty concisely written, but it is still not entirely clear what problem SWH is solving. Let me show you an example of the problem in purpose to understand deeply SWH initiative.

Problem

If you are a researcher or tech writer (like me) this example can be familiar to you. Imagine, some time ago you wrote an article and there were references to other articles and also to the source code. The reference could be just a web link to GitHub / GitLab or another place. The problem is that you don’t guarantee that the link you provided in the references would exist always or the source code snippet itself wouldn’t be changed.


This means it would be great to have a place where you as a researcher in your articles can reference source code and definitely know that the link won’t be broken or the source code will be the same as you used in the article.


A similar problem was encountered by Roberto Di Cosmo — the president of Software Heritage. He discovered that in one of his scientific articles, the reference to the source code was broken. This experience led Roberto to the idea of creating specific persistent storage for all public software that could be gathered from all open-source operators and other publically accessible source code places. This kind of archive can provide robust references to the source code which won’t be broken or the historical data won’t be changed or deleted.


In this article, I’m going to show you how to use the Software Heritage service. I will walk you through the process of archiving and reference code. I broke this small tutorial into two categories — potential roles who are going to use archiving services.

What the Archive Looks Like

I’m a typical Software Engineer and of course, I have an account in GitHub where I have a bunch of public repositories which could be potentially archived. Looking ahead when I checked SWH archive and surprisingly discovered that all my public repositories are already there. You can check yours and most likely you will see the same.


Archived projects

So, this means the SWH team implemented some kind of web crawler that reads all public repositories and harvest them into the global archive automatically without your participation.


However your public software source code might not be archived, but you always can do this manually. So, let’s do it.

How to archive code

Let’s consider an example: RollingNumbers — an open-source library that I created some time ago. Let’s archive it with me. Based on SWH tutorial I need to prepare this public repository first. For that I need to have (ideally) 3 files there: README.md AUTHORS and LICENSE (or LICENSES).


Optionally we can add another file called codemeta.json. You can generate it using CodeMeta generator. README file should be familiar to you. LICENSE file you can generate using GitHub. AUTHOR example you can find in Google Open Source Documentation. Keep in mind that AUTHOR and LICENSE should be just plain text files.


RollingNumbers GitHub repository


Now it’s ready to be archived and we have 3 options:


We are going to use manual saving via URL form. But here is an important thing to know: you shouldn’t use a URL of the web page of your repository. It should be the link to the repository when you clone the project using git clone. In my case, my URL or RollingNumbers repository for submitting will look like this:

As you can see I use git as the type and URL to RollingNumbers git repository. Clicking Submit this request will be scheduled with plenty of other requests as mine.


After saving, we can check is how the archive of your repository itself looks like. Visually it looks familiar as a typical repository in GitHub with README markdown below.


SWH archive


This is how we can archive our repositories in SWH archive. I believe all your public source code is already there, but if not, I recommend saving missing repositories using the preferred archiving method described above. This means we are all contributors to the software history.


You might say that your source code is not important for archiving or it’s still WIP and not ready to be saved. But SWH is not a place for only ready-to-use software or “dead” code. SWH archives everything, regardless your code is ready or not.

How to reference code

Now let’s play the role of the researchers who are going to use source code references in their articles. For instance, I’m going to use references right in this article and my target will be just archived RollingNumbers source code.


First, we need to find this repository in the archive using Search.


As you can see the text field accepts SWHID (don’t worry about it for now) or just a string that can be the name of the repository. When you found a repository you will see the Permalinks button on the right-hand side of the page. Clinking this you will be offered to copy either the identifier or permalink.


The term permalink is already self-explained. You can click on RollingNumbers permalink and you will be brought to the archive of the project directory.


SWH permalink builder


If you want to get a reference to the particular file of the project you just need to open this file and get a permalink in the same way as described above using the permalink button.


You can also get the reference of a code fragment. For this just click on the first line number and then with Shift click on the last line number of the code snippet.


SWH permalink builder

SWHID

What makes references to a source code archive permanent? How researchers can guarantee that references in our articles are stable? To understand this we need to look at a term SWHID — Software Heritage Identifiers.


Here is an example of the permalink of the code snippet from DigitLayer.swift the file of my RollingNumbers archive:


https://archive.softwareheritage.org/
swh:1:cnt:ba62f3e9e8ad0a1026fd8a39a4654a14cb385b4e;
origin=https://github.com/maxkalik/RollingNumbers;
visit=swh:1:snp:fd32172c9a4434090c7fba7edb24f88b9b9f4fed;
anchor=swh:1:rev:2e2f802a88ac1ee5d5e39f88e392a6dd4f2a6fd0;
path=/Sources/RollingNumbers/DigitLayer.swift;
lines=11-29


SWHID represents a reference request to the source code of an archive, files, commits, etc. Let’s get rid of all additional parameters from this request and take a look at the core identifier:

swh:1:cnt:ba62f3e9e8ad0a1026fd8a39a4654a14cb385b4e


There are 4 base components:

<prefix>:<schema_version>:<object_type>:<object_id>


Skip the first two components and take a look at the third part: object_type — this is a type of category (or type of archive) that is being captured and you as a . There are several object types:

  • cnt — contents: Select this when you want to archive only a particular file or a code snippet;
  • snp — snapshots: Use it if you want to archive full history including branches, commits, tags, etc;
  • rel — releases: It will archive just a release version;
  • rev — revisions: Use it if you want to archive just a commit from your repository;
  • dir — directories: It will archive just a current version of a repository;

SWH permalink builder

You can select a type right from the permalink builder in the SWH archive.

Another important part of SWHID is the context parameters (or in documentation — qualifiers):

origin=https://github.com/maxkalik/RollingNumbers;
visit=swh:1:snp:fd32172c9a4434090c7fba7edb24f88b9b9f4fed;
anchor=swh:1:rev:2e2f802a88ac1ee5d5e39f88e392a6dd4f2a6fd0;
path=/Sources/RollingNumbers/DigitLayer.swift;
lines=11-29

As you can guess these qualifiers describe in a query way what exactly it will be fetched from the requested archive, the path to the file (DigitLayer), and the range of lines of code snippet.


In simple words, SWHID makes source code archive references permanent and you can be sure the code won’t be changed.


The full explanation you can get from the SWHID documentation.

Conclusion

In conclusion, Software Heritage (SWH) serves as a universal source code archive and provides a crucial service for archiving and referencing historical and contemporary software. By offering a persistent storage solution, SWH addresses the problem of broken or changed references to source code, ensuring that researchers, tech writers, and other users can confidently reference and access source code without concerns about link stability or code alterations.


Through manual or automated archiving methods, software engineers can contribute to the software history by preserving their public repositories in the SWH archive. Researchers can utilize SWH’s search functionality and SWHIDs to obtain permanent and reliable references to specific files, code snippets, commits, snapshots, releases, and more. By leveraging SWH, individuals in the software development industry can promote software persistence and enhance the integrity of source code references in their articles, research papers, and other documentation.

Links