All of lore.kernel.org
 help / color / mirror / Atom feed
* About the blueprint OSD: Transactions
@ 2015-03-03  9:32 Li Wang
  2015-03-03 22:52 ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Li Wang @ 2015-03-03  9:32 UTC (permalink / raw)
  To: Sage Weil, Josh Durgin, pmcgarry; +Cc: ceph-devel

Hi Sage,
   We are pretty interested in the multi-object transaction support,
we think it is potencially very useful. we have read your implementation
description, and summarize it as below, please check if our
understanding is correct,

1 client select a master, and sends full txn to master
2 master holds txn in memory, sends PREPAREs to slaves
3 slaves persist PREPARE on the side, send PREPARE_ACK,
   in the case there is a compare-then-write operation,
   and compartion fail, slave will send PREPARE_FAIL instead
4 master collects all PREPARE_ACKs and applies the txn
   and marks txn COMMITTING, in the case a PREPARE_FAIL received,
   master send slaves ROLL_BACK, and the slaves will discard
   the prepared txn
5 once persisted, master send COMMITs to slaves
6 master replies to client COMMITED, to enable client to proceed
   to do other operations except reading the commited data
7 slaves get COMMIT and apply, reply with COMMIT_ACK
8 master collect COMMIT_ACK and reply to client FINISHED, to enable
   client read the data
9 master closes out txn record

We think it manifiests to implement a transaction itself, however,
it did not take into account the cases that concurrent multiple 
transactions,
how to enforce the order and atomicity among the distributed transactions,
how to do locking and dead locking avoidance, it seems there are
some further desgining jobs to do.

We are wondering if you can move this blueprint discussion into a
UTC+8 friendly time, so that we can involve in

Cheers,
Li Wang


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: About the blueprint OSD: Transactions
  2015-03-03  9:32 About the blueprint OSD: Transactions Li Wang
@ 2015-03-03 22:52 ` Sage Weil
  2015-03-03 23:03   ` Patrick McGarry
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2015-03-03 22:52 UTC (permalink / raw)
  To: Li Wang; +Cc: Josh Durgin, pmcgarry, ceph-devel

On Tue, 3 Mar 2015, Li Wang wrote:
> Hi Sage,
>   We are pretty interested in the multi-object transaction support,
> we think it is potencially very useful. we have read your implementation
> description, and summarize it as below, please check if our
> understanding is correct,
> 
> 1 client select a master, and sends full txn to master
> 2 master holds txn in memory, sends PREPAREs to slaves
> 3 slaves persist PREPARE on the side, send PREPARE_ACK,
>   in the case there is a compare-then-write operation,
>   and compartion fail, slave will send PREPARE_FAIL instead
> 4 master collects all PREPARE_ACKs and applies the txn
>   and marks txn COMMITTING, in the case a PREPARE_FAIL received,
>   master send slaves ROLL_BACK, and the slaves will discard
>   the prepared txn
> 5 once persisted, master send COMMITs to slaves
> 6 master replies to client COMMITED, to enable client to proceed
>   to do other operations except reading the commited data
> 7 slaves get COMMIT and apply, reply with COMMIT_ACK
> 8 master collect COMMIT_ACK and reply to client FINISHED, to enable
>   client read the data
> 9 master closes out txn record

Yep!   Plus the failure path handling...

> We think it manifiests to implement a transaction itself, however,
> it did not take into account the cases that concurrent multiple transactions,
> how to enforce the order and atomicity among the distributed transactions,
> how to do locking and dead locking avoidance, it seems there are
> some further desgining jobs to do.

Yeah.  I think it would be nice if we can define a few simple flags 
indicating whether the masters and/or slaves are readable during the 
prepared-but-uncommitted phase, as there are different requirements for 
different users.

And we need to pick a (simple!) deadlock avoidance approach.  Maybe a 
simple EAGAIN is enough and leave it to the clients to be consistent about 
which object to choose as the master.

> We are wondering if you can move this blueprint discussion into a
> UTC+8 friendly time, so that we can involve in

I think Patrick is moving it!

Thanks-
sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: About the blueprint OSD: Transactions
  2015-03-03 22:52 ` Sage Weil
@ 2015-03-03 23:03   ` Patrick McGarry
  0 siblings, 0 replies; 6+ messages in thread
From: Patrick McGarry @ 2015-03-03 23:03 UTC (permalink / raw)
  To: Sage Weil; +Cc: Li Wang, Josh Durgin, ceph-devel

Yep, I bumped the OSD: Transactions discussion to the end of the day.
Let me know if you see anything else that looks amiss (including my
timezone math!). Thanks.


On Tue, Mar 3, 2015 at 5:52 PM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 3 Mar 2015, Li Wang wrote:
>> Hi Sage,
>>   We are pretty interested in the multi-object transaction support,
>> we think it is potencially very useful. we have read your implementation
>> description, and summarize it as below, please check if our
>> understanding is correct,
>>
>> 1 client select a master, and sends full txn to master
>> 2 master holds txn in memory, sends PREPAREs to slaves
>> 3 slaves persist PREPARE on the side, send PREPARE_ACK,
>>   in the case there is a compare-then-write operation,
>>   and compartion fail, slave will send PREPARE_FAIL instead
>> 4 master collects all PREPARE_ACKs and applies the txn
>>   and marks txn COMMITTING, in the case a PREPARE_FAIL received,
>>   master send slaves ROLL_BACK, and the slaves will discard
>>   the prepared txn
>> 5 once persisted, master send COMMITs to slaves
>> 6 master replies to client COMMITED, to enable client to proceed
>>   to do other operations except reading the commited data
>> 7 slaves get COMMIT and apply, reply with COMMIT_ACK
>> 8 master collect COMMIT_ACK and reply to client FINISHED, to enable
>>   client read the data
>> 9 master closes out txn record
>
> Yep!   Plus the failure path handling...
>
>> We think it manifiests to implement a transaction itself, however,
>> it did not take into account the cases that concurrent multiple transactions,
>> how to enforce the order and atomicity among the distributed transactions,
>> how to do locking and dead locking avoidance, it seems there are
>> some further desgining jobs to do.
>
> Yeah.  I think it would be nice if we can define a few simple flags
> indicating whether the masters and/or slaves are readable during the
> prepared-but-uncommitted phase, as there are different requirements for
> different users.
>
> And we need to pick a (simple!) deadlock avoidance approach.  Maybe a
> simple EAGAIN is enough and leave it to the clients to be consistent about
> which object to choose as the master.
>
>> We are wondering if you can move this blueprint discussion into a
>> UTC+8 friendly time, so that we can involve in
>
> I think Patrick is moving it!
>
> Thanks-
> sage



-- 

Best Regards,

Patrick McGarry
Director Ceph Community || Red Hat
http://ceph.com  ||  http://community.redhat.com
@scuttlemonkey || @ceph

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: About the blueprint OSD: Transactions
  2015-03-05  7:38     ` Sage Weil
@ 2015-03-10  2:09       ` Li Wang
  0 siblings, 0 replies; 6+ messages in thread
From: Li Wang @ 2015-03-10  2:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josh Durgin, pmcgarry, ceph-devel, Samuel Just

The atomicity semantics of transaction must not be violated. Suppose
there are two concurrent transactions, T1 (Transaction 1) writes a set
of objects {A, B, C}, and T2 touches {B, C, D}, where each object is in
a different OSD. And A and D are selected as the master, respectively.
For simplicity, suppose T1 do a write to make the value of each object
be 1, while T2 make them 2. Then only two results are legal, either
A=B=C=1, or B=C=D=2, it forbids to happen that B=1, C=2 or vice versa.
Suppose OSD_B receives PREPARE in a sequence of (T1, T2), while OSD_C
receives PREPARE in a sequence of (T2, T1). This could happen since T1
and T2 are managed by different masters. The operation sequence is as
follows,

1. OSD_B receives PREPARE from T1 and do the preparation
2. OSD_C receives PREPARE from T2 and do the preparation
3. OSD_B receives PREPARE from T2, finds a in-flight transaction on B,
wait for T1 to finish
4. OSD_C receives PREPARE from T1, finds a in-flight transaction on C,
wait for T2 to finish

Obviously, it results in a deadlock. So if there is in-flight
transaction to share the write on the same object,
it should not wait. Also it could not accept, otherwise, the atomicity
may be violated. For the above example, in Steps 3 and 4, if the two
OSDs accept the PREPARE, then the final results after the two
transactions finished will be B=2, C=1. Note forcing the master to be
the lowest-sorting object name seems not fix this problem either,
  if the sorting of A and D are slower than B and C. So it seems the
only option is to give up and retry in such case.

  Please check how is the following,
  (1) Client calculate the PG that the master object suggested by
  programmers belonging to, and retrieve the primary
  OSD of that PG, called master, and send the full transaction to it
(2) Master hold the transaction in memory, and send PREPARE to slaves
(3) Slave check if there exists at least one in-flight transaction on
the same object, if so, reply master EAGAIN, otherwise reply PREPARE_ACK
(4) Master collect PREPARE_ACK, and send COMMIT to slaves. In the case
EAGAIN received, master reply client EAGAIN, and send ROLL_BACK to any
prepared slaves, discard the transaction, and expect client resend the
transaction with a newer id.
(5) Slave perform all the read-and-comparison operations, reply EFAIL
if any operation fail. If all succeed, slave commit the transaction
into journal of PG metadata, and reply master COMMIT_ACK
(6) Master collect COMMIT_ACK, reply client COMMITED, and send APPLY to
slaves. In the case EFAIL received from slave, master reply client
EFAIL, send ROLL_BACK to slaves, and discard the transaction
(7) Slave apply the transaction from journal or PG metadata to the
actual objects, and reply master APPLY_ACK
(8) Master collect APPLY_ACK, reply client APPLIED, and close out the
transaction Note it does not describe the persist operation on master
side, because in terms of the process PREPARE, COMMIT and APPLY, master
acts as exactly a slave. For example, in Step 3, the master also will
check if there is a conflict in-flight transaction.

Cheers,
Li Wang


On 2015/3/5 15:49, Sage Weil wrote:
> On Thu, 5 Mar 2015, Li Wang wrote:
>> On 2015/3/5 8:56, Sage Weil wrote:
>>> On Wed, 4 Mar 2015, Li Wang wrote:
>>>> Hi Sage, Please take a look if the below works,
>>>> [...]
>>>
>>> I think this works.  A few notes:
>>>
>>> 1- I don't think there's a need to persist the txn on the master until the
>>> slaves reply with PREPARE_ACK.
>>
>> I think the txn must be persisted at the very first at master side,
>> since once it send the message to slaves, there must be a mechanism
>> that the ROLL_BACK message could be resent to slaves if master down,
>> just there may only few, rather than whole information of the
>> transaction need be persisted
>
> I think we can still skip it because it's not about durabiliy (master and
> slave are both PGs that are replicated), just about coordination.  if
> master repeers the slaves will ask whether to roll forward or back and the
> (new) master will respond with ROLLBACK or COMMIT.
>
> If you missed the CDS session it should be posted on youtube shortly... we
> discussed both possibilities.  We think the main difference is that in
> your case you have to do a double write (prepare + commit on master) but
> that hides the commit latency sinc eyou can reply when you get the
> PREPARE_ACKs.  In my proposal, you only write once on the master, but you
> have to wait for the PREPAREs, and then write the COMMIT, and then reply
> to the clients.. which will have a higher total latency.
>
>>> 2- This is basically optimistic concurrency with backoff if
>>> possible deadlock is detected.  I think we can do the same thing in the
>>> proposal in the blueprint if a PREPARE sees that a txn (in-memory) is
>>> pending or if a client txn is recieved and there is a pending PREPARE.  In
>>> the latter case, it seems like we should block and wait...
>>>
>>
>> Yes. We can divide the process into two steps, the first step is
>> PREPARE, only for deadlock avoidance, this only refers to memory
>> operation in all slaves' sides. First, master send PREPARE to slaves,
>> the slaves check if there is pending transaction in memory, if so,
>> reply master EAGAIN, otherwise reply PREPARE_ACK, which lead to
>> an extremely fast deadlock avoidance. master collect all PREPARE_ACK,
>> and send COMMIT to slaves, then slaves commit their transaction part
>> to PG metadata, reply master COMMIT_ACK
>
> Backing off if any affected object has another in-flight transaction is
> sufficient but also conservative since we'll fail/retry transactions that
> actually could have completed w/o deadlocking.  The altnerative is to
> leave it to the client to only propose transactions that won't conflict.
> The latter is certainly an easier first version to implement :) but it may
> also be that it's all that we want.  Solving the deadlock avoidance in the
> general case sucks.  :(
>
> Maybe a simple backoff like you propose is a decent middle ground... I
> susepect, though, that a large portion of transactions in the real world
> will be A+B, A+C, A+D, etc where they are non-deadlocking but do overlap
> (e.g. on an index or metadata object).
>
> sage
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: About the blueprint OSD: Transactions
  2015-03-05  3:54   ` Li Wang
@ 2015-03-05  7:38     ` Sage Weil
  2015-03-10  2:09       ` Li Wang
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2015-03-05  7:38 UTC (permalink / raw)
  To: Li Wang; +Cc: Josh Durgin, pmcgarry, ceph-devel

On Thu, 5 Mar 2015, Li Wang wrote:
> On 2015/3/5 8:56, Sage Weil wrote:
> > On Wed, 4 Mar 2015, Li Wang wrote:
> > > Hi Sage, Please take a look if the below works,
> > > [...]
> > 
> > I think this works.  A few notes:
> > 
> > 1- I don't think there's a need to persist the txn on the master until the
> > slaves reply with PREPARE_ACK.
> 
> I think the txn must be persisted at the very first at master side,
> since once it send the message to slaves, there must be a mechanism
> that the ROLL_BACK message could be resent to slaves if master down,
> just there may only few, rather than whole information of the
> transaction need be persisted

I think we can still skip it because it's not about durabiliy (master and 
slave are both PGs that are replicated), just about coordination.  if 
master repeers the slaves will ask whether to roll forward or back and the 
(new) master will respond with ROLLBACK or COMMIT.

If you missed the CDS session it should be posted on youtube shortly... we 
discussed both possibilities.  We think the main difference is that in 
your case you have to do a double write (prepare + commit on master) but 
that hides the commit latency sinc eyou can reply when you get the 
PREPARE_ACKs.  In my proposal, you only write once on the master, but you 
have to wait for the PREPAREs, and then write the COMMIT, and then reply 
to the clients.. which will have a higher total latency.

> > 2- This is basically optimistic concurrency with backoff if
> > possible deadlock is detected.  I think we can do the same thing in the
> > proposal in the blueprint if a PREPARE sees that a txn (in-memory) is
> > pending or if a client txn is recieved and there is a pending PREPARE.  In
> > the latter case, it seems like we should block and wait...
> > 
> 
> Yes. We can divide the process into two steps, the first step is
> PREPARE, only for deadlock avoidance, this only refers to memory
> operation in all slaves' sides. First, master send PREPARE to slaves,
> the slaves check if there is pending transaction in memory, if so,
> reply master EAGAIN, otherwise reply PREPARE_ACK, which lead to
> an extremely fast deadlock avoidance. master collect all PREPARE_ACK,
> and send COMMIT to slaves, then slaves commit their transaction part
> to PG metadata, reply master COMMIT_ACK

Backing off if any affected object has another in-flight transaction is 
sufficient but also conservative since we'll fail/retry transactions that 
actually could have completed w/o deadlocking.  The altnerative is to 
leave it to the client to only propose transactions that won't conflict.  
The latter is certainly an easier first version to implement :) but it may 
also be that it's all that we want.  Solving the deadlock avoidance in the 
general case sucks.  :(

Maybe a simple backoff like you propose is a decent middle ground... I 
susepect, though, that a large portion of transactions in the real world 
will be A+B, A+C, A+D, etc where they are non-deadlocking but do overlap 
(e.g. on an index or metadata object).

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: About the blueprint OSD: Transactions
  2015-03-05  0:55 ` Sage Weil
@ 2015-03-05  3:54   ` Li Wang
  2015-03-05  7:38     ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Li Wang @ 2015-03-05  3:54 UTC (permalink / raw)
  To: Sage Weil; +Cc: Josh Durgin, pmcgarry, ceph-devel



On 2015/3/5 8:56, Sage Weil wrote:
> On Wed, 4 Mar 2015, Li Wang wrote:
>> Hi Sage, Please take a look if the below works,
>> [...]
>
> I think this works.  A few notes:
>
> 1- I don't think there's a need to persist the txn on the master until the
> slaves reply with PREPARE_ACK.

I think the txn must be persisted at the very first at master side,
since once it send the message to slaves, there must be a mechanism
that the ROLL_BACK message could be resent to slaves if master down,
just there may only few, rather than whole information of the
transaction need be persisted

>
> 2- This is basically optimistic concurrency with backoff if
> possible deadlock is detected.  I think we can do the same thing in the
> proposal in the blueprint if a PREPARE sees that a txn (in-memory) is
> pending or if a client txn is recieved and there is a pending PREPARE.  In
> the latter case, it seems like we should block and wait...
>

Yes. We can divide the process into two steps, the first step is
PREPARE, only for deadlock avoidance, this only refers to memory
operation in all slaves' sides. First, master send PREPARE to slaves,
the slaves check if there is pending transaction in memory, if so,
reply master EAGAIN, otherwise reply PREPARE_ACK, which lead to
an extremely fast deadlock avoidance. master collect all PREPARE_ACK,
and send COMMIT to slaves, then slaves commit their transaction part
to PG metadata, reply master COMMIT_ACK

> 3- In either scheme, we can do full deadlock avoidance if we force
> the master to be the lowest-sorting object name, or something like that.
> But I think that will have a performance impact since there is likely a
> best choice for master depending on the transaction itself... like a txn
> that writes 4MB to an object and inserts a pointer in another object;
> clearly the 4MB piece should be the master so that it is only written once
> and doesn't cross the network.
>
> sage
>
>
>
>> 1 Client calculate the
>> PG that the master object suggested by programmers belonging to, and
>> retrieve the primary OSD of that PG, called master, and send the full
>> transaction to it 2 master persist the whole transaction in the
>> corresponding PG metadata 3 master parse transaction, to obtain the set
>> of slave OSDs which are the primary OSDs of other PGs the transaction
>> referred to, and send PREPARE as well as the part of transaction needed
>> be done on each individual PG to its corresponding slave OSD 4 For each
>> slave OSD, it check if there exist a PREPARED-BUT-UNCOMMITTED
>> transaction in its PG metadata such that the two transactions share at
>> least one write operation on the same object, if so, the slave OSD give
>> up preparing, and reply PREPARE-AGAIN. Otherwise, it perform all the
>> read-and-comparison operations in its received transaction part, reply
>> PREPARE-FAIL if any of the operation fail. If all succeed, it persist
>> its transaction part in its PG metadata, and reply PREPARE-ACK 5 master
>> collect all PREPARE_ACKs, and reply client PREPARED, in the case a
>> PREPARE_FAIL received, master reply client ERROR, and send slaves
>> ROLL_BACK, and the slaves will discard its prepared transaction part, if
>> any, and reply master ROLL_BACK_ACK. master collect all ROLL_BACK_ACKs,
>> and discard the transaction. In the case of a PREPARE_AGAIN received,
>> the process is similar to PREPARE_FAIL except that master reply client
>> EAGAIN 6 master send slaves COMMIT 7 slaves get COMMIT and commit their
>> individual transaction part, and reply COMMIT_ACK 8 master collect all
>> COMMIT_ACKs and reply client COMMITTED 9 master close out the
>> transaction record It seems to work without dead locking in the normal
>> condition, however, there are still many kinds of errors it needs to
>> take into account, such as PG changing, OSD down etc, does it? Cheers,
>> Li Wang > -----????----- > ???: Sage Weil
>
>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-03-10  2:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-03  9:32 About the blueprint OSD: Transactions Li Wang
2015-03-03 22:52 ` Sage Weil
2015-03-03 23:03   ` Patrick McGarry
     [not found] <ACIAwADIAN8kzvlukrK0Farh.1.1425481937528.Hmail.liwang@ubuntukylin.com>
2015-03-05  0:55 ` Sage Weil
2015-03-05  3:54   ` Li Wang
2015-03-05  7:38     ` Sage Weil
2015-03-10  2:09       ` Li Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.