http://www.jianshu.com/p/d10310683bbb
状态机
一个节点处于下面的3种状态之一:
Leader:负责接收客户端的请求,将日志复制到其他节点并告知其他节点何时应用这些日志是安全的。
Candidate:Leader选举过程中的临时角色。
Follower:负责响应来自Leader或者Candidate的请求。
raft状态机
Term
Raft 算法将时间划分成为任意不同长度的任期(term)。任期用连续的数字进行表示。每一个任期的开始都是一次选举(election),一个或多个候选人会试图成为领导人。如果一个候选人赢得了选举,它就会在该任期的剩余时间担任领导人。在某些情况下,选票会被瓜分,有可能没有选出领导人,那么,将会开始另一个任期,并且立刻开始下一次选举。Raft 算法保证在给定的一个任期最多只有一个领导人。
Leader选举
谈到共识算法,绕不过去的问题就是leader选举。Leader选举就是投票,然后得到大多数支持的就当选。核心的思想见下图:
Leader选举所发送的请求及处理方式
候选者投出来的选票,包括以下内容:
- 我是谁(candidateId),我想竞选选哪一任期的leader(term)
- 我的最新日志是什么(lastLogIndex、lastLogTerm)
接收者如果发现term < currentTerm,当然是果断拒绝。那么如果候选者的日志至少跟接收者一样新,并且接收者还没有投过任何一个term这个任期的选票,那么接收者就回复同意。
Safety
共识协议的一个重要的问题是如何来证明协议的正确性。
Raft保证上述特性总是成立,从而保证正确性
Election Safety
对于一个特定的term,最多只有一个leader被选中。这个通过以下的条件来保证:
- 对于特定的term,只能投一张选票。currentTerm和votedFor需要持久化记录的。
- leader必须得到大部分节点的确认。
因为leader必须得到大多数节点的确认,我们可以用反证法证明Election safety。假设某个term里选出了两个不同的leader,那么这两个leader一定都得到了大多数节点的确认,那么一定有一个节点,它对于这个term,投出了两张赞成票。这跟上面提到的第1点是矛盾的。
Log Matching
这条特性,说的是,如果两个节点上的某个日志项(log entry),拥有同一个ID和term,那么这两个节点的这个日志项之前(包括这个日志项)的所有日志项全是一样的。Raft通过保证以下两点来保证log matching成立:
- If two entries in different logs have the same index and term, then they store the same command.
- If two entries in different logs have the same index and term, then the logs are identical in all preceding
entries.
第1点容易保证,因为有Election safety,所以term一样意味着leader是同一个。ID也一样意味同一个leader发出的同一条日志项,所以内容一定是一样的。
第2点怎么保证呢?当leader向follower发送AppendEntries RPC的时候,leader除了带上当前日志项的term、ID,还会把当前日志项的前一项的term、ID也一起带上。follower只有检查到leader发过来的前一项的term、ID和follower当前的term、ID能对应上的时候,才接收请求,否则就拒绝。
然后我们就可以用数学归纳法证明Log matching:
- 假设leader和follower的某一项Log entry,满足log matching,我们很容易证明如果AppendEntries RPC成功返回后,leader和follower的这个log entry的下一项log entry,也满足log matching。
- 由于leader和follower的起始状态都是一样的,并且满足log matching。
综上,log matching得证。
Leader Completeness
Leader completeness说的是,如果一个log entry被某个Leader committed了,那么这个log entry在所有之后的leader的日志里都一定存在。论文里用反证法证明,我直接摘抄下来了:
We assume that the Leader Completeness Property does not hold, then we prove a contradiction. Suppose the leader for term T (leaderT) commits a log entry from its term, but that log entry is not stored by the leader of some future term. Consider the smallest term U > T whose leader (leaderU) does not store the entry.
The committed entry must have been absent from leaderU’s log at the time of its election (leaders never
delete or overwrite entries).leaderT replicated the entry on a majority of the clus�ter, and leaderU received votes from a majority of the cluster. Thus, at least one server (“the voter”) both accepted the entry from leaderT and voted forleaderU, as shown in Figure 9. The voter is key to reaching a contradiction.
The voter must have accepted the committed entry from leaderT before voting for leaderU; otherwise it
would have rejected the AppendEntries request from leaderT (its current term would have been higher than
T).The voter still stored the entry when it voted for leaderU, since every intervening leader contained the
entry (by assumption), leaders never remove entries, and followers only remove entries if they conflict with the leader.The voter granted its vote to leaderU, so leaderU’s log must have been as up-to-date as the voter’s. This
leads to one of two contradictions.First, if the voter and leaderU shared the same last log term, then leaderU’s log must have been at least
as long as the voter’s, so its log contained every entry in the voter’s log. This is a contradiction, since the
voter contained the committed entry and leaderU was assumed not to.Otherwise, leaderU’s last log term must have been larger than the voter’s. Moreover, it was larger than
T, since the voter’s last log term was at least T (it contains the committed entry from term T). The earlier
leader that created leaderU’s last log entry must have contained the committed entry in its log (by assumption). Then, by the Log Matching Property, leaderU’s log must also contain the committed entry, which is a contradiction.This completes the contradiction. Thus, the leaders of all terms greater than T must contain all entries
from term T that are committed in term T.The Log Matching Property guarantees that future leaders will also contain entries that are committed
indirectly, such as index 2 in Figure 8(d)
其实我觉得第6点是有点问题的。第6点说的是,如果voter和leaderU的最后一条日志有同样的term,那么leaderU的日志就一定至少和voter一样长,这个理解起来怪怪的,至少我暂时是没有理解。而实际上我们把第6点忽略,把第7点扩展一下,就可以用第7点把第6点给涵盖了。把第7点改成以下就可以:
- Otherwise, leaderU’s last log term must have been larger than or equal to the voter’s. Moreover, it was larger than or equal to T, since the voter’s last log term was at least T (it contains the committed entry from term T). The earlier
leader that created leaderU’s last log entry must have contained the committed entry in its log (by assumption). Then, by the Log Matching Property, leaderU’s log must also contain the committed entry, which is a contradiction.
State Machine Safety Property
这个特性说的是,一旦一个log entry被apply,那么这个位置就不可能有别的log entry被apply。这样才能保证状态机是绝对一致的。
证明:
Given the Leader Completeness Property, we can prove the State Machine Safety Property from Figure 3, which states that if a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index. At the time a server applies a log entry to its state machine, its log must be identical to the leader’s log up through that entry and the entry must be committed. Now consider the lowest term in which any server applies a given log index; the Log Completeness Property guarantees that the leaders for all higher terms will store that same log entry, so servers that apply the index in later terms will apply the same value. Thus, the State Machine Safety Property holds.
成员变更
成员变更为什么一定要两阶段
如果成员变更是直接从旧配置更新到新配置,由于配置的更新不可能在所有节点上同时发生,那么可能存在某一时刻,同时存在两个没有交集的多数集,造成同一个term有两个不同的leader,如下图所示:
形成两个没有交集的多数集
新配置什么时候开始起作用
一旦leader把Cold,new写入Log开始,leader就要用Cold,new指定的集群开始决定多数节点,而不是等到commit才开始。
等到Cold,new被commit之后,就可以开始提交最终的Cnew。一旦Cnew被commit之后,配置更新就算完成了。如果节点不在Cnew指定的集群里,那么节点就应该退出了。
etcd raft的成员变更
如果我们可以保证Cold和Cnew的多数集一定有交集,那么就可以直接写入新配置,而不用两阶段(不需要两次commit,commit一次就OK)。通过限制成员变更每次只能增删一个成员,我们可以保证这一点。这样可以一定程度简化设计。
线性一致性的保证
3种方法:
- 所有请求(包括读请求)都走正常的流程,这样显然可以保证线性一致性。但是这样的问题是代价太大。
- leader接收到请求后,先记下当前的commit ID,然后广播消息以确认当前自己的leader地位没有变化,然后等待apply ID大于等于刚才记下的commit ID,这样就可以保证线性一致性。
- 在方案2的基础上,不发送消息去确认自己的leader地位,只要自己的leader还没有过期,就认为自己还是leader。这个方案的风险是,如果机器时间不是很准确,依赖时间的控制可能不是特别靠谱。