Editor's note: This post was co-authored by Tushar Jain and Kyle Samani.
A blockchain is a database with unique trust-minimization properties. Like all databases, there are two kinds of operations: reads and writes.
Most of the discourse to date around scaling blockchains is about writes. This is often measured as transactions per second (TPS). For example, Ethereum supports 15-30 TPS, Binance Smart Chain supports up to 160 TPS, and Solana supports up to 50,000 TPS. Investors have invested billions of dollars to scale blockchain writes.
Demand for block space has grown exponentially and we expect that demand for reading data on blockchains will scale even faster. Basically every major application on the internet is a database application of some form. And in most database applications, the ratio of reads to writes is between 100:1 and 10,000:1. Why is the ratio so skewed? If you have 10,000 followers on Instagram, and you post a single photo, and 10% of your followers open Instagram and see the photo, then that single write (uploading the photo) results in 1,000 reads. If you have 10,000 people trading an asset and you make a single trade on a decentralized exchange, then those 10,000 people must read that trade and update the price of that asset. Writes are constrained by the scalability of Layer 1 blockchains. As Layer 2 solutions like optimistic- and zk-rollups come online, and as high-throughput networks like Solana take off, the amount of writes will explode leading to an exponential increase in read demand for the reasons stated above.
Scaling reads will be the next big scaling challenge for the blockchain industry.
Most database applications have a specific data structure, where the structure of the database is tailored to how the system is expected to query data out of the system. For example, let’s consider a chat app like Telegram. You could imagine that the entire Telegram system is one giant table, with four column headers: user ID, thread ID, timestamp, and message content. While this could theoretically work, given how many users Telegram has (500M+) and how many messages are sent per day, you can see how this would degrade performance. Every single time a user clicks on a thread, all queries are querying from the same table. This is a really big problem. Clearly it would be better if you could localize queries so that not all of the queries are hitting the same table.
You can imagine an alternate structure. Let’s suppose each thread is stored in a separate table with three column headers: user ID, timestamp, and message content. When a user clicks on a thread in the Telegram UI, the system knows which table is storing all of the messages in that thread, and queries that table asking for new messages. This structure localizes the queries, resulting in dramatically increased performance.
The problem with scaling reads in a blockchain is that a blockchain—by definition—does not prescribe the transaction format. A blockchain is just a series of transactions. Anyone can submit any transaction at any time, and any transactions can have any format and be arbitrarily complex.
All of the transactions on a blockchain are going into a single table. In light of the Telegram example above, you can see how this creates a serious performance problem. In the case of a blockchain, this problem is even more pronounced because there isn’t a single type of application with a specific data structure. There are thousands of applications, each of which have unique data structures and different use cases.
The standard Ethereum client—Go Ethereum (GETH)—has some basic querying capabilities. For example, you can ask GETH, “How much ETH does this address have?” And the data structure of the Ethereum merkle trie makes it easy for GETH to answer that question.
But now let’s consider a more complex question like, “How much TVL is in these 50 Uniswap pools, and what is the total TVL across all of them?” Answering this question requires an understanding of what a Uniswap pool is and the price of the assets in those pools.
By default, GETH cannot answer that question, because GETH itself doesn’t even know what a Uniswap pool is.
So in summary, there are three distinct but related problems here:
- Scaling reads
- Understanding data structures to know what needs to be queried
- Offering censorship-resistant query results
Querying blockchain data is challenging for developers, and to solve these problems there needs to be a robust and scalable solution. The Graph, a decentralized indexing protocol, is the leading project working on these problems, and, if successful, it is poised to become the “Google of Web3.”
Multicoin Capital invested in The Graph back in 2018 with the thesis that the query layer would grow to become one of the most important layers of the Web3 stack. This essay revisits The Graph because the project is at a critical turning point, both in its history and the industry at large. The Graph’s hosted service, which has grown to become one of the most popular products in crypto, has served its purpose. Now, The Graph is migrating to the decentralized network, and, as it advances this transition, each of the three problems above are beginning to be solved, creating the opportunity for fully decentralized applications.
Enabling decentralization is one thing, but scaling it is another. The Graph has seen over 20x growth in the last year and processed over 25 billion queries in May 2021 alone. The Graph Foundation recently made two major grants to help The Graph scale as it tackles the read problems. The first grant was to StreamingFast, a team of engineers who will bring their expertise to help radically improve indexing performance. The second grant was to Figment, a team that will help make it easier for anyone to spin up a Graph node and increase the supply on the network.
As the first generation of sub-graphs migrate from the hosted service to the decentralized protocol, the complete vision of The Graph is starting to come into view and network effects are driving rapid growth.
Infinitely Scaling Reads
The foundation of all crypto networks is the encoding of incentives (and disincentives) into software in order to facilitate large-scale collaboration among distrusting parties without a centralized, coordinating entity. This is what makes these systems trust-minimized.
In abstract, this framing presents a novel way to scale reads infinitely: design a crypto-economic game to incentivize rational, economical-driven actors to perform reads for people who are querying that data, thereby allowing the supply-side of the network to self-organize to fulfill the demand.
Traditionally, companies built centralized services and scaled them out the old fashioned way: by hiring lots of programmers, dev-ops teams, and managing a bunch of servers. They’d spend millions of man-hours architecting and re-architecting the system to optimize performance and cost.
The Graph is a protocol that: 1. incentivizes independent, rational actors to store and index subsets of a massive dataset (all of the data on the supported blockchains) 2. helps users of this service figure out which actors are storing each subsets 3. ensures that these query providers are returning valid responses (not returning false results) 4. facilitates payment
What happens if indexers (people who run the queries) are unable to service all the demand?
In the case where demand outstrips supply, market participants—both existing indexers and outsiders—will observe this in realtime by monitoring payment flows on the blockchain. Those who have excess resources (or who can easily acquire resources) will download and run The Graph software, register themselves on The Graph’s smart contract to become discoverable, index in-demand datasets, and begin processing queries for users. This entire cycle can play out over the course of a few minutes or hours, and can be 100% automated.
Stated more simply: as demand for the query services grows, rational economically motivated supply will self-organize to fulfill that demand. Kyle articulated this thesis in general terms a couple of years ago.
Scaling Data Structure
There are two technical ways to solve the data structure problem:
1) ask GETH for a list of all Uniswap pools, and all of the relevant transactions (e.g. deposits, withdrawals, and trades), and then calculate the TVL as a program outside of GETH. Anytime someone asks the question, repeat the calculations to get an up to date number.
2) You define a separate data structure, such that every time a new transaction is added to the end of the giant table that is the list of Ethereum transactions, the system detects if this transaction related to increasing or decreasing the size of a Uniswap pool, and if so, updates the appropriate additional data fields in a database that lives outside of GETH.
The Graph provides a framework for specifying the data structures to enable solution #2 (these are called Subgraph manifests), a database indexing service based on these manifest files, and a real time query system that operates across a decentralized network of nodes.
Moreover, the Graph exposes queries to developers using GraphQL, a query language invented and open sourced by Facebook in 2015. GraphQL is now widely considered the gold standard for query interfaces for developers because it is so easy to use. Here is a bit more technical info on why GraphQL is awesome.
There are other solutions that attempt to solve the same problem. We segment the market as follows:
Pocket network is intellectually interesting because it’s the only one in the lower right hand corner. However, The Graph’s success has demonstrated that developers prefer to build using GraphQL and subgraphs, rather than RPC calls to GETH nodes because it is much faster and easier.
Scaling Censorship Resistance & Security
Graph’s largest competitor is Infura. But Infura is a simplistic service. It is literally a load balancer in front of thousands of GETH nodes. It doesn’t even attempt to provide higher level abstractions (“query-optimized” per the parlance in the image above). All of the other competitors are a small fraction of Infura’s size.
Infura serves as a centralized chokepoint for the dApps that use it. In order to offer censorship resistant services a dApp must eliminate any centralized control over it. If an attacker can use Infura to shut down users’ ability to use a dApp, that is a major threat vector. The Graph is a global network of independent people running infrastructure to offer query results and therefore can offer censorship resistance.
In the case that a Graph indexer returns a false response, they can be heavily penalized. So long as anyone —either the person who requested the query or a 3rd party fisherman—detects that an indexer produced an invalid result, she can report the invalid response to the blockchain, and the blockchain itself will be the final arbiter of truth. Assuming the indexer lied, the blockchain will slash the indexer’s bond (posted in GRT), and reward the person who reported the malicious behavior.
This means that the largest query ecosystem with the largest value staked by indexers will be the most secure query system. This return to scale produces a feedback loop as follows:
Why decentralize? That enables a few things:
- Truly serverless applications that are censorship resistant and that cannot be shut down
- Moving data closer to the “edge” to reduce latency and increase performance
- A far more efficient way to scale from millions of queries per day to trillions (yes that is a ‘T’) queries per day.
When people talk about decentralization, they usually talk about it just in the context of censorship resistance (e.g. Bitcoin). The primary mechanism to achieve this is just sheer redundancy.
In the case of The Graph, decentralization leads to both censorship resistance and scale. Anyone anywhere in the world can run a Graph server, and they can see in real-time what query demand is on a per subgraph basis. So enterprising developers/computer nerds all over the world can see if a particular subgraph is underserved, and can quickly spin up a Graph node and begin serving queries and generating income. This is the key insight that makes Graph infinitely scalable, and also leads to censorship resistance.
That is the power of decentralization.
Disclosure: Multicoin has established, maintains and enforces written policies and procedures reasonably designed to identify and effectively manage conflicts of interest related to its investment activities. Multicoin Capital abides by a “No Trade Policy” for the assets listed in this report for 3 days (“No Trade Period”) following its public release. Multicoin Capital owns GRT.