# 6.824 Lecture 6 - Fault Tolerance with Raft - Part 1

• Systems we’ve seen so far have been single-master, and so have a SPOF
• People wanted to build replicated systems before the first consensus papers came out (~1990); they attempted to do it by:
• Building networks that would not fail (expensive)
• Waiting for human/operator intervention when something went wrong
• $2f + 1$ servers can withstand $f$ failures
• Any two majorities overlap in at least one server
• 1990: Paxos & Viewstamped Replication (VSR) papers were published; took ~15 years for these ideas to be used in production
• Raft is closer to VSR than Paxos
• Raft integration with (colocated) application code:
• Application receives a request, and forwards it to the Raft leader
• Raft replicas commit this request to their log
• Each replica notifies its own application instance
• Each instance of the application then applies the request
• The application instance that received the request responds to the client
• Followers acknowledge before the write log entries to disk.
• If disk writes are slow enough that they start lagging, memory usage grows
• If this happens for long enough, the server will run out of memory and crash
• Raft doesn’t include a safeguard against this
• “Committed” state is not persisted
• Say all nodes crash and they come back up, a leader is first elected

• That leader then sends out AppendEntries heartbeats

• And then picks a commitIndex following this rule:

If there exists an $N$ such that $N > \text{commitIndex}$, a majority of $\text{matchIndex[i]} ≥ N$, and $\text{log[N].term} == \text{currentTerm}$: set $\text{commitIndex} = N$ (§5.3, §5.4).

• And then propagates it using subsequent AppendEntries calls

• It’s possible to build a consensus system without the notion of a leader