Sometimes, it is interesting to ponder the relationship between seemingly independent technologies, as it can help us better understand each technology. This article will be this kind of article. A mere reflection on the similarities between distributed file systems, blockchain, and artificial intelligence.
Distributed file systems like IPFS or Ethereum Swarm are content-addressable storage systems. Both systems are based on a distributed database, where content can be accessed based on a given hash. When we want to access a file in such a system, we can do so with a 32-byte hash.
We submit this identifier to the system, which returns the content to us. This 32-byte identifier is generated based on the content and is unique for each piece of content. This means that every piece of content in the world can be assigned a unique 32-byte number.
Think about how fascinating this is. For every video, song, text, book, or program that has ever been created or will be created, there is a unique 32-byte identifier. If this were not the case, IPFS or Swarm would not be able to function, as an identifier would be associated with two different pieces of content in the system, causing collisions.
However, no one has ever found such a collision.
Another example of the effectiveness of hashing is the Ethereum blockchain. Ethereum is a giant virtual computer. Users can send instructions (transactions) to the computer, which will then be executed by miners (who serve as the computer's processors), and they will modify the state of the computer.
The state of the computer is stored in a structure called the Patricia-Merkle tree, which is associative storage similar to IPFS, except that unlike with IPFS, the database is not distributed and every miner has a personal copy of the whole database.
When a block is created in the blockchain, only the hash of the virtual computer's state is stored, also in 32 bytes, and this is called the state root.
This huge computer runs thousands of programs (smart contracts) and stores a massive amount of data, yet every possible state can be associated with a 32-byte number. Without this, the Ethereum network would not function.
Although 32 bytes might not seem like much, it seems that in the world, we are able to compress almost everything into this size. And this is where artificial intelligence comes into play. As the hash is to the crypto world, embedding is to the world of AI.
Embedding is when we map some information onto a vector. For easier understanding, we can imagine these vectors as points in a multi-dimensional space, but in reality, this is also a list of numbers just like in the case of hashing.
Even a 32-byte hash can be imagined as a point in a 32-dimensional space. Therefore, embedding is a type of compression, similar to hashing. However, there is a major difference between hashing and embedding.
Hash algorithms map the contents to completely random points in the multi-dimensional space, while in the case of embedding, it is important for similar contents to be close to each other. For example, in a facial recognition system, it is important for images of the same person to have vectors that are close to each other, as the system uses this information to determine if it is the same person or not.
This is why hash algorithms use a relatively simple, fixed calculation, while embedding is a complex algorithm that is typically learned through training.
Now that we understand the role of hash algorithms in the crypto world and have seen that 32 bytes can contain almost everything, we can better understand the functioning of systems such as LLMs (which are the base of systems like ChatGPT).
When we ask a question to ChatGPT, a vector is created from our question, similar to when we create a hash for content. However, while a hash points to content in a distributed file system, in the case of ChatGPT, it points to a word that best fits the content.
ChatGPT does nothing but predict the next word based on the given input, and then predicts the next word for the resulting content, and so on. (In reality, the process is a bit more complex, as it predicts tokens instead of words, and not just one token, but a list of the most likely tokens from which it randomly chooses.)
ChatGPT is able to work efficiently because it has been trained on a massive text database and has some pre-made responses for almost every question. In this sense, we could compare it to a traditional large lexicon. What makes it unique is the continuity of its vectors.
Even if we ask it a question that has never been asked before, the vector created from the question will be very close to the vectors of known questions, so the response given will likely be accurate.
In the case of a distributed file system, we can access specific content with concrete hashes, while in the case of a large language model, we search for tokens that best match the embedding vector (which is some kind of a hash).
In both cases, we are talking about a vast database, but in the first case, the database contains specific values associated with concrete keys, while in the second case, both the keys and values are located in a continuous space.
It could be interesting to consider what kind of systems could emerge from the fusion of these two approaches. I can imagine, for example, a distributed vector database similar to IPFS, where everyone publishes their knowledge with embedding vectors.
If I am looking for an answer to a question, the search is done with embedding vectors instead of content hashes in the DHT, and I receive the data that is relevant to my question.
With the help of a language model, I can then construct a coherent response on my end. Instead of distributed file storage, the goal of the system would be to store distributed knowledge.
The system could be the base of a distributed social network where individuals do not form connections. There are no friendships or followings. Instead, everyone posts important information on their wall, and if someone needs specific information, they specify their desired content, and the system gathers it from the network based on embedded vectors.
In this way, members of the network can create a large collective mind, a collective superintelligence. Perhaps this is what the future of social networks could look like.