Unbundling Vectors of Centralization in Web3
A year ago I illustrated the Web3 stack as I understood it at the time. More recently, I published Multicoin’s crypto mega theses, detailing the investment thesis for Web3. As I highlighted, one of the key implications of Web3 is that data ownership and application logic will be unbundled.
In this essay, I’ll explain the specific problems inherent in this unbundling, and how we’ve been thinking about making investments in the Web3 stack.
Where Are The Databases and Data?
For practical purposes, the vast majority of modern applications can be thought of as a UX on top of a database. While there are some exceptions (e.g., video streaming and video games), this is generally true. Virtually every major consumer service—Facebook, Reddit, Evernote, Twitter, Google Docs, Gmail, iMessage, etc—can be simplified to a UX on top of a database.
In the Web2 model, application providers store and manage user data so that users don’t have to. Moreover, Web2 application providers are always online because they’re running servers 24/7, whereas users frequently go offline (subways, poor connectivity, battery challenges, etc). In the Web3 model, there is no centralized application provider, and so the entire data ownership paradigm needs to change.
This begs a couple of questions:
- Where does a user - let’s call her Alice - store her data (assuming she isn’t maintaining her own server)?
- And how does the sender of content send content to Alice if Alice is offline?
Naturally, the answer must be: store the content somewhere that’s always online and accessible, and make sure that when Alice comes back online Alice knows where to find the content addressed to her. This paradigm encapsulates both P2P applications like messaging and traditional database applications like newspapers, social media, and note-taking apps.
To execute on this, there are a few mechanical challenges:
- Someone who is not Alice needs to know to store the content and an index of that content so that Alice can find and download it later.
- Alice needs to know where to find that index.
- With that index, Alice needs to actually find and download the underlying data itself.
- Whoever is storing the data should not be able to read the content (if private), or censor it.
By solving these problems, data ownership and application logic can be unbundled, enabling Web3 to prosper.
Before exploring how modern Web3 entrepreneurs are solving these problems, it’s worth considering how others have tried to solve these problems in the past.
Prior Attempts At Decentralizing The Internet
There have been a handful of teams—including but not limited to Zeronet, Freenet, and Scuttlebutt—that have tried to, as they would describe it, “decentralize the Internet.” They tried to do this before the modern crypto era as we know it today. Most of these efforts were focused on supporting narrow use cases; for example, censorship-resistant messaging and message boards targeted at users in countries with authoritarian regimes.
If you’re curious, I recommend trying each of these systems. You’ll find that the UXs leave much to be desired. Although there are many UX problems with these systems, the biggest problem by far is speed. Everything is just painfully slow.
What gives? Why are they so slow?
Because they’re all logically decentralized.
These systems adopted some variation of the following architecture. I’ll describe their architectures in the context of an encrypted, P2P messaging app:
These systems are based on the idea that if someone sends a message to Alice while she’s offline, they’ll send it to Bob instead, and that Bob will store the message on Alice’s behalf.
When Alice comes back online, she will ask Bob if she missed anything while she was offline (index of messages).
Unfortunately, Alice has no guarantee that 1) Bob is online now, 2) that he was online for the duration she was offline, and 3) that Bob will actually have the full index of messages she missed while she was offline. To remedy this, Bob can ask his peers if they are aware of any messages that were addressed to Alice. However, those peers may be or have been offline too, and they may also have incomplete information.
In this paradigm, it’s simply not possible to guarantee message delivery as it’s not clear where messages should be delivered and who should store the index of messages. This creates compounding problems when the recipient comes back online as the recipient doesn’t know where to find the list of messages addressed addressed to her, or the data messages reference.
Scuttlebutt, which is focused on building a P2P social network, attempts to solve this by adopting a Facebook-like double-opt in friend system. That is, once Alice and Bob become friends, they share their friend lists with one another so that Bob can index and store content published by Alice’s friends on Alice’s behalf. This requires that Alice notify all of her friends that Bob is her proxy and vice versa. Then, when Alice’s friends publish an update while Alice is offline, Alice’s friends can send that update to Bob, who can host it for Alice.
Zeronet and Freenet, which are more generalized than just P2P social networking, use a similar model, except there is not a double-opt in friend model. This adds quite a few complexities to the system, and makes things even slower. Unlike the Scuttlebutt model, in which friends agree to help each other for defined information pathways, users of Freenet and Zeronet have to randomly ping other users and ask them about what information they know about. This is the crux of why these systems are so slow.
Let’s say that somehow, Alice finally pieces together the index of everything she missed while she was offline. That is, she knows that Carol sent her a photo, and that Dave is storing the photo at “dave.com/alicepic1.png”. If Dave is offline, how can Alice access the photo?
These problems are non-trivial. Decentralizing the Internet is hard.
Logical And Architectural (De)Centralization
The root cause of all the problems described above is the lack of logically centralized storage and indexing. What is logically centralized storage? To answer that, it helps to understand the three vectors of decentralization in distributed systems:
- the number of computers in the system
- the number of people who can exert influence on the system
- the number of interfaces by which external agents interact with the system
For a more robust explanation of these concepts, I recommend this essay by Vitalik Buterin.
Web2 monopolies solved all of the problems outlined in the previous section because they rely on logically centralized storage. That is, when Alice comes back online, she just asks the centralized web service, which maintains centralized storage, what messages she missed since she was last online. The web service queries a database that it controls containing all user messages, and returns the correct messages.
The problem with this model is that these Web2 systems bundle all forms of centralization: They’re logically centralized, politically centralized, and (other than for scaling purposes) architecturally centralized.
So are there storage systems that are logically centralized, but architecturally and politically decentralized?
Fortunately, the answer is a resounding yes: interplanetary filesystem (IPFS) for contract-based storage, and Arweave for permanent storage (contract-based storage: store X bytes of data for Y period of time with Z retrievability guarantees. AWS, GCP, Azure, Filecoin, and Sia are all contract-based storage systems).
What exactly does it mean for the system to be logically centralized but architecturally and politically decentralized? The best way to understand this is to consider how a computer retrieves basic files from another server on the web today (location-based addressing), and then compare that to the IPFS/Arweave approach (content-based-addressing).
In the Web2 architecture, if Alice wants to download a picture from a server, Alice will go to a URL that looks something like: website.com/image.png. What exactly happens when Alice tries to go to that URL?
Using DNS, Alice knows where to find the server at website.com, and she will ask the server for the image it’s hosting located on its local filesystem at “/image.png.” Assuming the server wants to cooperate, it will check its directory for /image.png and return that file if it exists.
Note how fragile this system is: If the file moves, is changed, the server is busy, or the server is uncooperative for any reason, the request will fail.
This is the foundation on which the web is built today.
In a content-based addressing system like the one used in IPFS and Arweave, the URL that Alice visits is something like this: QmTkzDwWqPbnAh5YiV5VwcTLnGdwSNsNTn2aDxdXBFca7D.
Although it’s not human-readable, it’s deterministically generated from the content that it’s derived from. That is, there is only a single piece of content in the universe which, when hashed, will produce that exact string. The magic of IPFS and Arweave is that they handle all of the complexity that allows a computer to resolve QmTkzDwW... into this webpage.
(If you’d like to learn more about IPFS works, this 6-part series is an excellent starting point).
The content on the IPFS and Arweave networks is stored across many machines. Regardless of how many machines the content is stored on or where those machines are located in the world, these protocols resolve addresses like QmTkzDwW... regardless of where the actual content is stored.
That’s the magic of content-based addressing. It exposes a single logical interface - the content-based address - that will always resolve correctly regardless of where the underlying data is stored across a vast network of computers that are architecturally and politically decentralized.
Of the four major technical challenges outlined at the beginning of this essay, content-based addressing solves #1, #3, and #4 (storing the content, making the content available for download, and ensuring that the host cannot read private information). But what about #2: knowing where to look for the data?
While IPFS and Arweave act as a logically-centralized but architecturally and politically-decentralized filesystems, these systems are not databases. That is, there is no way to query them and ask “please show me all of the messages sent from Bob to Alice between dates X and Y.”
Fortunately, there are a few ways to solve this problem.
One approach is to store the index of messages of on blockchains directly. Blockchains themselves are logically centralized but architecturally and politically decentralized databases. Using a decentralized service like The Graph or centralized service like dFuse, Alice can query indices stored on blockchains. The blockchain doesn’t store the underlying data, but rather just a hash of the data. That hash is just a pointer to the content stored in IPFS or Arweave. Both the Graph and dFuse are live today, and many applications have adopted this model of storing hashes on chain that point to data stored in content-addressed systems.
A second approach is to leverage Textile. Textile has built a unique technology that they call Threads, which act as a private, encrypted, personal database on top of IPFS. Because this database is built on IPFS, it’s logically centralized but architecturally and politically decentralized. As a logically centralized database, senders and receivers know where to send and read information from. Moreover, Textile recently launched Cafes, enabling users to establish a server to host their Threads (rather than hosting Threads locally). The next step for Textile is to build an economic layer to incentivize validators to host cafes for other users, which is analogous to how Filecoin is the economic layer for IPFS.
A third approach is to leverage OrbitDB. OrbitDB is similar to Textile’s Threads, except that OrbitDB is designed primarily for public data (e.g. for building decentralized Twitter), whereas Textile’s Threads natively integrate encryption and key management for private information (e.g. P2P messaging). Like Textile, OrbitDB is available today, and the OrbitDB team is working on an economic layer on top of the underlying technology.
Lastly, there are a number of teams who are building what are effectively traditional databases with different vectors of decentralization: Fluence's constructions will operate in permissionless settings with BFT guarantees, while Bluzelle is building a two-tiered system with a politically centralized set of master nodes, and an architecturally decentralized set of replica nodes.
We are skeptical of the idea of adding a BFT layer on top of traditional databases given the work that smart contract teams are doing to solve the data availability problem at scale, such as Solana’s Replicators. Instead, we have opted to make investments in “crypto-native” databases like Textile, and at the developer API layer in the form of The Graph and dFuse.
With the protocols and services described above - IPFS, Filecoin, Arweave, The Graph, dFuse, Textile, and OrbitDB - there is a clear path by which Web3 can come to fruition. All of these services exist today, albeit not quite in production-ready and web-scale form with battle-tested crypto-economics. Nevertheless, there are known solutions to the most important problem - exposing a single logically centralized interface for politically and architecturally decentralized systems.
Is there anything left?
Now that we have solutions for logically centralized but architecturally decentralized storage, indexing, and retrieval, we can afford to think about higher-level logic. Some examples:
- How does Alice manage multiple identities? For example, Alice may not want to use the same public key across Facebook/Google/Snapchat/Reddit. And what if she wants to manage those identities in a single interface without linking them publicly?
- Given that Alice wants to send Bob private information but store it on IPFS or Arweave - which are by definition public systems - they need to leverage perfect forward secrecy (PFS) handshakes. How do they setup PFS in an asynchronous way and manage all of the associated keys?
- Given that traditional encryption schemes are only intended for two parties to communicate, how can the system support private communications for large groups of people such as message boards or large chat groups?
- How does the system enable common UX patterns such as group discovery, user data recovery, content removal, etc?
While these are distinct technical challenges, I broadly bucket all of these as “higher-level logic” problems.
Textile’s Threads address precisely these kinds of problems. In many ways, one can think of Textile as iCloud for IPFS. While this analogy isn’t perfect, it generally works: In the same way that iCloud abstracts cross-device syncing and data back up for applications (providing both a better user and developer experience), Textile provides all of the higher order logical tools on top of IPFS to make application development seamless for developers while ensuring seamless cross-device syncing and backup for users on IPFS.
The Web3 ecosystem is incredibly diverse on many dimensions, on the types of problems being solved, locations of the teams, economic models they’re employing, and more. That the Web3 stack is coming together despite the fact that there isn’t a single logically centralized entity coordinating the whole thing is remarkable. However, this also means there is a lot of entropy in the system, and as such it’s difficult to understand the higher-level themes. In this essay, I distilled that as follows:
The greatest challenge in the transition from Web2 to Web3 is the transition from systems in which all three vectors of centralization - logical, architectural, and political - have been bundled to systems that are logically centralized but politically and architecturally decentralized.
If you’re building core infrastructure or apps on the Web3 stack, please reach out or DM me on Twitter. We are looking to make more Web3 investments. We believe Web3 is going to be a paradigm shift that unlocks trillions of dollars of value over the next decade, and we’re looking to back the best entrepreneurs building foundational Web3 infrastructure.
Disclosures: Multicoin Capital has invested in Solana, The Graph, Textile, Arweave, and Dfuse. Multicoin Capital abides by a “No Trade Policy” for the assets listed in this report for 72 hours (“No Trade Period”) following its public release. No officer, director or employee shall purchase or sell any of the aforementioned assets during the No Trade Period.