All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about ceph paxos implementation
@ 2017-11-27 13:14 Kang Wang
  2017-11-27 14:27 ` Sage Weil
  0 siblings, 1 reply; 5+ messages in thread
From: Kang Wang @ 2017-11-27 13:14 UTC (permalink / raw)
  To: ceph-devel

hi

I read the code of ceph paxos recently, and have a question about it, which, in my opinion, may violate the consistency.

Assume we have five monitor node m1, m2, m3, m4, m5, the prior one has larger rank than the back one. 

Consider the situation as below:

1, m1 as the leader, and all node have the same last_commited at begin, then m1 propose a new value ‘2', which then be accept by m1 and m3:
m1:   1 2
m2:   1
m3:   1 2
m4:   1
m5:   1	

2, Unfortunatly, both m1 and m3 go down, and m2 become leader without knowledge about the propse, and it propose a new value ‘3' 
m1:   1 2  down 
m2:   1 3
m3:   1 2  down
m4:   1
m5:   1	

3, Then m2 goes down before send anything to others, then m1, m3 recovered and commit value ‘2’ with the quorum m1, m3, m4
m1:   1 2
m2:   1 3  down
m3:   1 2
m4:   1 2
m5:   1	

4, Before the commit message sent to others, m1 and m3 go down again. So value ‘3’ only commit on m1. Then m2 become leader once more.
m1:   1 2  down
m2:   1 3
m3:   1 2  down
m4:   1 2
m5:   1

5, Leader m2 see the uncommited value ‘2’, but discard it by compare uncommitted_pn in function handle_last, so it commit value ‘3’ with the quorum m2, m4, m5
m1:   1 2  down
m2:   1 3
m3:   1 2  down
m4:   1 3
m5:   1 3

Now we see the value ‘2’ has been commited, but lost soon. Am I right on it?


Thanks
WANG KANG

 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ceph paxos implementation
  2017-11-27 13:14 Question about ceph paxos implementation Kang Wang
@ 2017-11-27 14:27 ` Sage Weil
  2017-11-28  3:20   ` Kang Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2017-11-27 14:27 UTC (permalink / raw)
  To: Kang Wang; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1887 bytes --]

On Mon, 27 Nov 2017, Kang Wang wrote:
> hi
> 
> I read the code of ceph paxos recently, and have a question about it, which, in my opinion, may violate the consistency.
> 
> Assume we have five monitor node m1, m2, m3, m4, m5, the prior one has larger rank than the back one. 
> 
> Consider the situation as below:
> 
> 1, m1 as the leader, and all node have the same last_commited at begin, then m1 propose a new value ‘2', which then be accept by m1 and m3:
> m1:   1 2
> m2:   1
> m3:   1 2
> m4:   1
> m5:   1	
> 
> 2, Unfortunatly, both m1 and m3 go down, and m2 become leader without knowledge about the propse, and it propose a new value ‘3' 
> m1:   1 2  down 
> m2:   1 3
> m3:   1 2  down
> m4:   1
> m5:   1	
> 
> 3, Then m2 goes down before send anything to others, then m1, m3 recovered and commit value ‘2’ with the quorum m1, m3, m4
> m1:   1 2
> m2:   1 3  down
> m3:   1 2
> m4:   1 2
> m5:   1	
> 
> 4, Before the commit message sent to others, m1 and m3 go down again. So value ‘3’ only commit on m1. Then m2 become leader once more.
> m1:   1 2  down
> m2:   1 3
> m3:   1 2  down
> m4:   1 2
> m5:   1
> 
> 5, Leader m2 see the uncommited value ‘2’, but discard it by compare uncommitted_pn in function handle_last, so it commit value ‘3’ with the quorum m2, m4, m5
> m1:   1 2  down
> m2:   1 3
> m3:   1 2  down
> m4:   1 3
> m5:   1 3

This is what the last->uncommitted_pn value is for.  I believe this 
prevents us from using 2's pn (and uncommitted value) because 3's pn is 
larger.  Can you verify?

Thanks!
sage


> 
> Now we see the value ‘2’ has been commited, but lost soon. Am I right on it?
> 
> 
> Thanks
> WANG KANG
> 
>  --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ceph paxos implementation
  2017-11-27 14:27 ` Sage Weil
@ 2017-11-28  3:20   ` Kang Wang
  2017-11-28 13:06     ` Sage Weil
  0 siblings, 1 reply; 5+ messages in thread
From: Kang Wang @ 2017-11-28  3:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

You mean value ‘2’ wouldn’t be used at the 3rd step?

3, Then m2 goes down before send anything to others, then m1, m3 recovered and commit value ‘2’ with the quorum m1, m3, m4
m1:   1 2
m2:   1 3  down
m3:   1 2
m4:   1 2
m5:   1	


but as I assume that m2 goes down before it could send  MMonPaxos::OP_BEGIN message to others,
so the new leader m1 has no chance to know there exists a newer uncommited value ‘3'



Thanks
WANG KANG












> On 27 Nov 2017, at 10:27 PM, Sage Weil <sage@newdream.net> wrote:
> 
> On Mon, 27 Nov 2017, Kang Wang wrote:
>> hi
>> 
>> I read the code of ceph paxos recently, and have a question about it, which, in my opinion, may violate the consistency.
>> 
>> Assume we have five monitor node m1, m2, m3, m4, m5, the prior one has larger rank than the back one. 
>> 
>> Consider the situation as below:
>> 
>> 1, m1 as the leader, and all node have the same last_commited at begin, then m1 propose a new value ‘2', which then be accept by m1 and m3:
>> m1:   1 2
>> m2:   1
>> m3:   1 2
>> m4:   1
>> m5:   1	
>> 
>> 2, Unfortunatly, both m1 and m3 go down, and m2 become leader without knowledge about the propse, and it propose a new value ‘3' 
>> m1:   1 2  down 
>> m2:   1 3
>> m3:   1 2  down
>> m4:   1
>> m5:   1	
>> 
>> 3, Then m2 goes down before send anything to others, then m1, m3 recovered and commit value ‘2’ with the quorum m1, m3, m4
>> m1:   1 2
>> m2:   1 3  down
>> m3:   1 2
>> m4:   1 2
>> m5:   1	
>> 
>> 4, Before the commit message sent to others, m1 and m3 go down again. So value ‘3’ only commit on m1. Then m2 become leader once more.
>> m1:   1 2  down
>> m2:   1 3
>> m3:   1 2  down
>> m4:   1 2
>> m5:   1
>> 
>> 5, Leader m2 see the uncommited value ‘2’, but discard it by compare uncommitted_pn in function handle_last, so it commit value ‘3’ with the quorum m2, m4, m5
>> m1:   1 2  down
>> m2:   1 3
>> m3:   1 2  down
>> m4:   1 3
>> m5:   1 3
> 
> This is what the last->uncommitted_pn value is for.  I believe this 
> prevents us from using 2's pn (and uncommitted value) because 3's pn is 
> larger.  Can you verify?
> 
> Thanks!
> sage
> 
> 
>> 
>> Now we see the value ‘2’ has been commited, but lost soon. Am I right on it?
>> 
>> 
>> Thanks
>> WANG KANG
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ceph paxos implementation
  2017-11-28  3:20   ` Kang Wang
@ 2017-11-28 13:06     ` Sage Weil
  2017-11-29  6:54       ` Kang Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2017-11-28 13:06 UTC (permalink / raw)
  To: Kang Wang; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3513 bytes --]

Sorry, I misread the scenario before.  I think what will actually happen 
is that in step 3, when "2" is recovered, it will be re-proposed, so 
let's actually call it "2'" (it will have the round 3 pn associated with 
it).  That means that in step 4, "3" wouldn't be reproposed or 
committed, because m4 has "2'" with a higher pn... "2'" would be 
re-proposed.

You might rewrite out the scenario with parens for uncommited, and 
something like (n pn=123) so that the proposal number is indicated.  
Seeing uncomitted vs committed and the pn will make the sequence more 
clear!

sage



On Tue, 28 Nov 2017, Kang Wang wrote:

> You mean value ‘2’ wouldn’t be used at the 3rd step?
> 
> 3, Then m2 goes down before send anything to others, then m1, m3 recovered and commit value ‘2’ with the quorum m1, m3, m4
> m1:   1 2
> m2:   1 3  down
> m3:   1 2
> m4:   1 2
> m5:   1	
> 
> 
> but as I assume that m2 goes down before it could send  MMonPaxos::OP_BEGIN message to others,
> so the new leader m1 has no chance to know there exists a newer uncommited value ‘3'
> 
> 
> 
> Thanks
> WANG KANG
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> > On 27 Nov 2017, at 10:27 PM, Sage Weil <sage@newdream.net> wrote:
> > 
> > On Mon, 27 Nov 2017, Kang Wang wrote:
> >> hi
> >> 
> >> I read the code of ceph paxos recently, and have a question about it, which, in my opinion, may violate the consistency.
> >> 
> >> Assume we have five monitor node m1, m2, m3, m4, m5, the prior one has larger rank than the back one. 
> >> 
> >> Consider the situation as below:
> >> 
> >> 1, m1 as the leader, and all node have the same last_commited at begin, then m1 propose a new value ‘2', which then be accept by m1 and m3:
> >> m1:   1 2
> >> m2:   1
> >> m3:   1 2
> >> m4:   1
> >> m5:   1	
> >> 
> >> 2, Unfortunatly, both m1 and m3 go down, and m2 become leader without knowledge about the propse, and it propose a new value ‘3' 
> >> m1:   1 2  down 
> >> m2:   1 3
> >> m3:   1 2  down
> >> m4:   1
> >> m5:   1	
> >> 
> >> 3, Then m2 goes down before send anything to others, then m1, m3 recovered and commit value ‘2’ with the quorum m1, m3, m4
> >> m1:   1 2
> >> m2:   1 3  down
> >> m3:   1 2
> >> m4:   1 2
> >> m5:   1	
> >> 
> >> 4, Before the commit message sent to others, m1 and m3 go down again. So value ‘3’ only commit on m1. Then m2 become leader once more.
> >> m1:   1 2  down
> >> m2:   1 3
> >> m3:   1 2  down
> >> m4:   1 2
> >> m5:   1
> >> 
> >> 5, Leader m2 see the uncommited value ‘2’, but discard it by compare uncommitted_pn in function handle_last, so it commit value ‘3’ with the quorum m2, m4, m5
> >> m1:   1 2  down
> >> m2:   1 3
> >> m3:   1 2  down
> >> m4:   1 3
> >> m5:   1 3
> > 
> > This is what the last->uncommitted_pn value is for.  I believe this 
> > prevents us from using 2's pn (and uncommitted value) because 3's pn is 
> > larger.  Can you verify?
> > 
> > Thanks!
> > sage
> > 
> > 
> >> 
> >> Now we see the value ‘2’ has been commited, but lost soon. Am I right on it?
> >> 
> >> 
> >> Thanks
> >> WANG KANG
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Question about ceph paxos implementation
  2017-11-28 13:06     ` Sage Weil
@ 2017-11-29  6:54       ` Kang Wang
  0 siblings, 0 replies; 5+ messages in thread
From: Kang Wang @ 2017-11-29  6:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

You are right, Sage 

I have lost sight of that the uncommited_pn would be replaced by the new accept_pn
when uncommitted_value being re-proposed.

Thanks again and I rewrote the scenario for anyone who may have a same question below.
For simplicity, I used a increasing integer indicated pn,
branket for commited value and parens for uncommited.


1, m1 as the leader, and all node have the same last_commited at begin, then m1 propose a new value ‘2', which then be accept by m1 and m3:
m1:   [1, pn=123]   (2, pn=123)
m2:   [1, pn=123] 
m3:   [1, pn=123]   (2, pn=123)
m4:   [1, pn=123] 
m5:   [1, pn=123]  	

2, Unfortunatly, both m1 and m3 go down, and m2 become leader without knowledge about the propse, and it propose a new value ‘3' 
m1:   [1, pn=123]   (2, pn=123)  down 
m2:   [1, pn=123]   (3, pn=124)
m3:   [1, pn=123]   (2, pn=123)  down
m4:   [1, pn=123] 
m5:   [1, pn=123]	

3, Then m2 goes down before send anything to others, then m1, m3 recovered and commit value ‘2’ with the quorum m1, m3, m4
m1:   [1, pn=123]   [2, pn=125] 
m2:   [1, pn=123]   (3, pn=124)  down
m3:   [1, pn=123]   (2, pn=125) 
m4:   [1, pn=123]   (2, pn=125) 
m5:   [1, pn=123]

4, Before the commit message sent to others, m1 and m3 go down again. So value ‘3’ only commit on m1. Then m2 become leader once more.
m1:   [1, pn=123]   [2, pn=125]   down
m2:   [1, pn=123]   (3, pn=124)
m3:   [1, pn=123]   (2, pn=125)  down
m4:   [1, pn=123]   (2, pn=125) 
m5:   [1, pn=123]

5, Leader m2 see the uncommited value ‘2’, with larger pn than ‘3', so it re-propse value ‘2’ with a newer pn
m1:   [1, pn=123]   [2, pn=125]   down
m2:   [1, pn=123]   [2, pn=126]
m3:   [1, pn=123]   (2, pn=125)  down
m4:   [1, pn=123]   [2, pn=126] 
m5:   [1, pn=123]   [2, pn=126]


> On 28 Nov 2017, at 9:06 PM, Sage Weil <sage@newdream.net> wrote:
> 
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-11-29  6:54 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-27 13:14 Question about ceph paxos implementation Kang Wang
2017-11-27 14:27 ` Sage Weil
2017-11-28  3:20   ` Kang Wang
2017-11-28 13:06     ` Sage Weil
2017-11-29  6:54       ` Kang Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.