How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster

All of lore.kernel.org
 help / color / mirror / Atom feed

* How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
@ 2017-06-05  4:05 xxhdx1985126
  2017-06-05 12:00 ` Jason Dillaman
  0 siblings, 1 reply; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-05  4:05 UTC (permalink / raw)
  To: ceph-devel

Hi, everyone.

Recently, I've been reading the source code of rbd-mirror. I wonder how rbd-mirror preserves the order of WRITE operations that finished on the primary cluster. As far as I can understand the code, rbd-mirror fetches I/O operations from the journal on the primary cluster, and replay them on the slave cluster without checking whether there's already any I/O operations targeting the same object that has been issued to the slave cluster and not yet finished. Since concurrent operations may finish in a different order than that in which they arrived at the OSD, the order that the WRITE operations finish on the slave cluster may be different than that on the primay cluster. For example: on the primary cluster, there are two WRITE operation targeting the same object A which are, in the order they finish on the primary cluster, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off data1", which means that the result of the two operations on the primary cluster is A.off=data2 while, on the slave cluster, the result is A.off=data1.

Is this possible?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-05  4:05 How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster xxhdx1985126
@ 2017-06-05 12:00 ` Jason Dillaman
  2017-06-05 15:49   ` xxhdx1985126
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Dillaman @ 2017-06-05 12:00 UTC (permalink / raw)
  To: xxhdx1985126; +Cc: ceph-devel

The order of writes from a single client connection to a single OSD is
guaranteed. The rbd-mirror journal replay process handles one event at
a time and does not start processing the next event until the IO has
been started in-flight with librados. Therefore, even though the
replay process allows 50 - 100 IO requests to be in-flight, those IOs
are actually well-ordered in terms of updates to a single object.

Of course, such an IO request from the client application / VM would
be incorrect behavior if they didn't wait for the completion callback
before issuing the second update.

On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
> Hi, everyone.
>
>
> Recently, I've been reading the source code of rbd-mirror. I wonder how rbd-mirror preserves the order of WRITE operations that finished on the primary cluster. As far as I can understand the code, rbd-mirror fetches I/O operations from the journal on the primary cluster, and replay them on the slave cluster without checking whether there's already any I/O operations targeting the same object that has been issued to the slave cluster and not yet finished. Since concurrent operations may finish in a different order than that in which they arrived at the OSD, the order that the WRITE operations finish on the slave cluster may be different than that on the primay cluster. For example: on the primary cluster, there are two WRITE operation targeting the same object A which are, in the order t
 hey finish on the primary cluster, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off data1", which means that the result of the two operations on the primary cluster is A.off=data2 while, on the slave cluster, the result is A.off=data1.
>
>
> Is this possible?
>
>
>



-- 
Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-05 12:00 ` Jason Dillaman
@ 2017-06-05 15:49   ` xxhdx1985126
  2017-06-05 16:21     ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-05 15:49 UTC (permalink / raw)
  To: jdillama; +Cc: ceph-devel


Uh, sorry, I don't quite follow you. According to my understanding of the source code of OSD and our experiment previously mentioned in "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html", there exists the following scenario where the actual finishing order of the WRITEs that targets the same object is not the same as the order they arrived at OSD, which, I think, could be a hint that the order of writes from a single client connection to a single OSD is not guranteed:

      Say three writes that targeting the same object A arrived at an OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The first write, "WRITE A.off 1", acquires the objectcontext lock of object A, and is put into a transaction queue to go through the "journaling + file system write" procedure. Before it's finished, a thread of OSD::osd_op_tp retrieved the second write and attempt to process it during which it finds that the objectcontext lock of A is held by a previous WRITE and put the second write into A's rwstate::waiters queue. It's only when the first write is finished on all replica OSDs that the second write is put back into OSD::shardedop_wq to be processed again in the future. If, after the second write is put into rwstate::waiters queue and the first write is finished on all replica OSDs, in which case the first write release the A's objectcontext lock, but before the second write is put back into OSD::shardedop_wq, the third write is retrieved by OSD's worker thread, it would get processed as no previous operation is holding A's objectcontext lock, in which case, the actual finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3", "WRITE A.off 2", which is different from the order they arrived.

In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html, we showed our experiment result which is exactly as the above scenario shows.

However, the ceph's version on which we did the experiment and the souce code of which we read was Hammer, 0.94.5. I don't know whether the scenario above may still exists in later versions.

Am I right about this? Or am I missing anything? Please help me, I'm really confused right now. Thank you.

At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>The order of writes from a single client connection to a single OSD is
>guaranteed. The rbd-mirror journal replay process handles one event at
>a time and does not start processing the next event until the IO has
>been started in-flight with librados. Therefore, even though the
>replay process allows 50 - 100 IO requests to be in-flight, those IOs
>are actually well-ordered in terms of updates to a single object.
>
>Of course, such an IO request from the client application / VM would
>be incorrect behavior if they didn't wait for the completion callback
>before issuing the second update.
>
>On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>> Hi, everyone.
>>
>>
>> Recently, I've been reading the source code of rbd-mirror. I wonder how rbd-mirror preserves the order of WRITE operations that finished on the primary cluster. As far as I can understand the code, rbd-mirror fetches I/O operations from the journal on the primary cluster, and replay them on the slave cluster without checking whether there's already any I/O operations targeting the same object that has been issued to the slave cluster and not yet finished. Since concurrent operations may finish in a different order than that in which they arrived at the OSD, the order that the WRITE operations finish on the slave cluster may be different than that on the primay cluster. For example: on the primary cluster, there are two WRITE operation targeting the same object A which are, in the order they finish on the primary cluster, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off data1", which means that the result of the two operations on the primary cluster is A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>
>>
>> Is this possible?
>>
>>
>>
>
>
>
>-- 
>Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-05 15:49   ` xxhdx1985126
@ 2017-06-05 16:21     ` Sage Weil
       [not found]       ` <428d3e79.3d2.15c7a56a64f.Coremail.xxhdx1985126@163.com>
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2017-06-05 16:21 UTC (permalink / raw)
  To: xxhdx1985126; +Cc: jdillama, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5444 bytes --]

On Mon, 5 Jun 2017, xxhdx1985126 wrote:
> 
> Uh, sorry, I don't quite follow you. According to my understanding of 
> the source code of OSD and our experiment previously mentioned in 
> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html", 
> there exists the following scenario where the actual finishing order of 
> the WRITEs that targets the same object is not the same as the order 
> they arrived at OSD, which, I think, could be a hint that the order of 
> writes from a single client connection to a single OSD is not guranteed:

If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on 
OSD ordering to be correct is totally fine--lots of other stuff does too.

>       Say three writes that targeting the same object A arrived at an 
> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The 
> first write, "WRITE A.off 1", acquires the objectcontext lock of object 
> A, and is put into a transaction queue to go through the "journaling + 
> file system write" procedure. Before it's finished, a thread of 
> OSD::osd_op_tp retrieved the second write and attempt to process it 
> during which it finds that the objectcontext lock of A is held by a 
> previous WRITE and put the second write into A's rwstate::waiters queue. 
> It's only when the first write is finished on all replica OSDs that the 
> second write is put back into OSD::shardedop_wq to be processed again in 
> the future. If, after the second write is put into rwstate::waiters 
> queue and the first write is finished on all replica OSDs, in which case 
> the first write release the A's objectcontext lock, but before the 
> second write is put back into OSD::shardedop_wq, the third write is 
> retrieved by OSD's worker thread, it would get processed as no previous 
> operation is holding A's objectcontext lock, in which case, the actual 
> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3", 
> "WRITE A.off 2", which is different from the order they arrived.

This should not happen.  (If it happened in the past, it was a bug, but I 
would expect it is fixed in the latest hammer point release, and in jewel 
and master.)  The requeing is done under the PG::lock so that requeueing 
preserves ordering.  A fair bit of code and a *lot* of testing goes into 
ensuring that this is true.  If you've seen this recently, then a 
reproducer or log (and tracker ticket) would be welcome!  When we see any 
ordering errors in QA we take them very seriously and fix them quickly.

You might be interested in the osd_debug_op_order config option, which we 
enable in qa, which asserts if it sees ops from a client arrive out of 
order.  The ceph_test_rados workload generate that we use for much of the 
rados qa suite also fails if it sees out of order operations.

sage

> 
> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html, we showed our experiment result which is exactly as the above scenario shows.
> 
> However, the ceph's version on which we did the experiment and the souce code of which we read was Hammer, 0.94.5. I don't know whether the scenario above may still exists in later versions.
> 
> Am I right about this? Or am I missing anything? Please help me, I'm really confused right now. Thank you.
> 
> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
> >The order of writes from a single client connection to a single OSD is
> >guaranteed. The rbd-mirror journal replay process handles one event at
> >a time and does not start processing the next event until the IO has
> >been started in-flight with librados. Therefore, even though the
> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
> >are actually well-ordered in terms of updates to a single object.
> >
> >Of course, such an IO request from the client application / VM would
> >be incorrect behavior if they didn't wait for the completion callback
> >before issuing the second update.
> >
> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
> >> Hi, everyone.
> >>
> >>
> >> Recently, I've been reading the source code of rbd-mirror. I wonder how rbd-mirror preserves the order of WRITE operations that finished on the primary cluster. As far as I can understand the code, rbd-mirror fetches I/O operations from the journal on the primary cluster, and replay them on the slave cluster without checking whether there's already any I/O operations targeting the same object that has been issued to the slave cluster and not yet finished. Since concurrent operations may finish in a different order than that in which they arrived at the OSD, the order that the WRITE operations finish on the slave cluster may be different than that on the primay cluster. For example: on the primary cluster, there are two WRITE operation targeting the same object A which are, in the orde
 r they finish on the primary cluster, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off data1", wh
 ich means that the result of the two operations on the primary cluster is A.off=data2 while, on the slave cluster, the result is A.off=data1.
> >>
> >>
> >> Is this possible?
> >>
> >>
> >>
> >
> >
> >
> >-- 
> >Jason
> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j\a???f＂穐殝鄗???畐ア???⒎???:+v墾妛鑚豰稛???\a珣赙zZ+凒殠娸???"濟!秈

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
       [not found]       ` <428d3e79.3d2.15c7a56a64f.Coremail.xxhdx1985126@163.com>
@ 2017-06-05 22:22         ` Sage Weil
  2017-06-05 22:50           ` xxhdx1985126
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2017-06-05 22:22 UTC (permalink / raw)
  To: xxhdx1985126; +Cc: jdillama, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 6586 bytes --]

On Tue, 6 Jun 2017, xxhdx1985126 wrote:
> Thanks for your reply:-)
> 
> The requeueing is protected by PG::lock, however, when the write request is
> add to the transaction queue, it's left for the journaling thread and
> filestore thread, to do the actual write, the OSD's worker thread just
> release the PG::lock and try to retrieve the next req in OSD's work queue,
> which gives the opportunity for later reqs to go before previous reqs. This
> did happened in our experiment.

FileStore should also strictly order the requests via the OpSequencer.

> However, since this experiment was done serveral month ago, I'll upload the
> log if I can find it, or I'll try to reproduce it.

Okay, thanks!

sage


> 
> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
> >> 
> >> Uh, sorry, I don't quite follow you. According to my understanding of 
> >> the source code of OSD and our experiment previously mentioned in 
> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html", 
> >> there exists the following scenario where the actual finishing order of 
> >> the WRITEs that targets the same object is not the same as the order 
> >> they arrived at OSD, which, I think, could be a hint that the order of 
> >> writes from a single client connection to a single OSD is not guranteed:
> >
> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on 
> >OSD ordering to be correct is totally fine--lots of other stuff does too.
> >
> >>       Say three writes that targeting the same object A arrived at an 
> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The 
> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object 
> >> A, and is put into a transaction queue to go through the "journaling + 
> >> file system write" procedure. Before it's finished, a thread of 
> >> OSD::osd_op_tp retrieved the second write and attempt to process it 
> >> during which it finds that the objectcontext lock of A is held by a 
> >> previous WRITE and put the second write into A's rwstate::waiters queue. 
> >> It's only when the first write is finished on all replica OSDs that the 
> >> second write is put back into OSD::shardedop_wq to be processed again in 
> >> the future. If, after the second write is put into rwstate::waiters 
> >> queue and the first write is finished on all replica OSDs, in which case 
> >> the first write release the A's objectcontext lock, but before the 
> >> second write is put back into OSD::shardedop_wq, the third write is 
> >> retrieved by OSD's worker thread, it would get processed as no previous 
> >> operation is holding A's objectcontext lock, in which case, the actual 
> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3", 
> >> "WRITE A.off 2", which is different from the order they arrived.
> >
> >This should not happen.  (If it happened in the past, it was a bug, but I 
> >would expect it is fixed in the latest hammer point release, and in jewel 
> >and master.)  The requeing is done under the PG::lock so that requeueing 
> >preserves ordering.  A fair bit of code and a *lot* of testing goes into 
> >ensuring that this is true.  If you've seen this recently, then a 
> >reproducer or log (and tracker ticket) would be welcome!  When we see any 
> >ordering errors in QA we take them very seriously and fix them quickly.
> >
> >You might be interested in the osd_debug_op_order config option, which we 
> >enable in qa, which asserts if it sees ops from a client arrive out of 
> >order.  The ceph_test_rados workload generate that we use for much of the 
> >rados qa suite also fails if it sees out of order operations.
> >
> >sage
> >
> >> 
> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html, 
> we showed our experiment result which is exactly as the above scenario shows
> .
> >> 
> >> However, the ceph's version on which we did the experiment and the souce 
> code of which we read was Hammer, 0.94.5. I don't know whether the scenario 
> above may still exists in later versions.
> >> 
> >> Am I right about this? Or am I missing anything? Please help me, I'm real
> ly confused right now. Thank you.
> >> 
> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
> >> >The order of writes from a single client connection to a single OSD is
> >> >guaranteed. The rbd-mirror journal replay process handles one event at
> >> >a time and does not start processing the next event until the IO has
> >> >been started in-flight with librados. Therefore, even though the
> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
> >> >are actually well-ordered in terms of updates to a single object.
> >> >
> >> >Of course, such an IO request from the client application / VM would
> >> >be incorrect behavior if they didn't wait for the completion callback
> >> >before issuing the second update.
> >> >
> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
> te:
> >> >> Hi, everyone.
> >> >>
> >> >>
> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
> w rbd-mirror preserves the order of WRITE operations that finished on the pr
> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
> perations from the journal on the primary cluster, and replay them on the sl
> ave cluster without checking whether there's already any I/O operations targ
> eting the same object that has been issued to the slave cluster and not yet 
> finished. Since concurrent operations may finish in a different order than t
> hat in which they arrived at the OSD, the order that the WRITE operations fi
> nish on the slave cluster may be different than that on the primay cluster. 
> For example: on the primary cluster, there are two WRITE operation targeting
>  the same object A which are, in the order they finish on the primary cluste
> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>  data1", wh
> > ich means that the result of the two operations on the primary cluster is 
> A.off=data2 while, on the slave cluster, the result is A.off=data1.
> >> >>
> >> >>
> >> >> Is this possible?
> >> >>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> >-- 
> >> >Jason
> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
> 
> 
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-05 22:22         ` Sage Weil
@ 2017-06-05 22:50           ` xxhdx1985126
  2017-06-06  3:04             ` xxhdx1985126
       [not found]             ` <47d3f4bc.3a4b.15c7b3fceb0.Coremail.xxhdx1985126@163.com>
  0 siblings, 2 replies; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-05 22:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: jdillama, ceph-devel


Thanks for your reply:-)

The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.

However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.

At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>> Thanks for your reply:-)
>> 
>> The requeueing is protected by PG::lock, however, when the write request is
>> add to the transaction queue, it's left for the journaling thread and
>> filestore thread, to do the actual write, the OSD's worker thread just
>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>> which gives the opportunity for later reqs to go before previous reqs. This
>> did happened in our experiment.
>
>FileStore should also strictly order the requests via the OpSequencer.
>
>> However, since this experiment was done serveral month ago, I'll upload the
>> log if I can find it, or I'll try to reproduce it.
>
>Okay, thanks!
>
>sage
>
>
>> 
>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>> >> 
>> >> Uh, sorry, I don't quite follow you. According to my understanding of 
>> >> the source code of OSD and our experiment previously mentioned in 
>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html", 
>> >> there exists the following scenario where the actual finishing order of 
>> >> the WRITEs that targets the same object is not the same as the order 
>> >> they arrived at OSD, which, I think, could be a hint that the order of 
>> >> writes from a single client connection to a single OSD is not guranteed:
>> >
>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on 
>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>> >
>> >>       Say three writes that targeting the same object A arrived at an 
>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The 
>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object 
>> >> A, and is put into a transaction queue to go through the "journaling + 
>> >> file system write" procedure. Before it's finished, a thread of 
>> >> OSD::osd_op_tp retrieved the second write and attempt to process it 
>> >> during which it finds that the objectcontext lock of A is held by a 
>> >> previous WRITE and put the second write into A's rwstate::waiters queue. 
>> >> It's only when the first write is finished on all replica OSDs that the 
>> >> second write is put back into OSD::shardedop_wq to be processed again in 
>> >> the future. If, after the second write is put into rwstate::waiters 
>> >> queue and the first write is finished on all replica OSDs, in which case 
>> >> the first write release the A's objectcontext lock, but before the 
>> >> second write is put back into OSD::shardedop_wq, the third write is 
>> >> retrieved by OSD's worker thread, it would get processed as no previous 
>> >> operation is holding A's objectcontext lock, in which case, the actual 
>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3", 
>> >> "WRITE A.off 2", which is different from the order they arrived.
>> >
>> >This should not happen.  (If it happened in the past, it was a bug, but I 
>> >would expect it is fixed in the latest hammer point release, and in jewel 
>> >and master.)  The requeing is done under the PG::lock so that requeueing 
>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into 
>> >ensuring that this is true.  If you've seen this recently, then a 
>> >reproducer or log (and tracker ticket) would be welcome!  When we see any 
>> >ordering errors in QA we take them very seriously and fix them quickly.
>> >
>> >You might be interested in the osd_debug_op_order config option, which we 
>> >enable in qa, which asserts if it sees ops from a client arrive out of 
>> >order.  The ceph_test_rados workload generate that we use for much of the 
>> >rados qa suite also fails if it sees out of order operations.
>> >
>> >sage
>> >
>> >> 
>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html, 
>> we showed our experiment result which is exactly as the above scenario shows
>> .
>> >> 
>> >> However, the ceph's version on which we did the experiment and the souce 
>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario 
>> above may still exists in later versions.
>> >> 
>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>> ly confused right now. Thank you.
>> >> 
>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>> >> >The order of writes from a single client connection to a single OSD is
>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>> >> >a time and does not start processing the next event until the IO has
>> >> >been started in-flight with librados. Therefore, even though the
>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>> >> >are actually well-ordered in terms of updates to a single object.
>> >> >
>> >> >Of course, such an IO request from the client application / VM would
>> >> >be incorrect behavior if they didn't wait for the completion callback
>> >> >before issuing the second update.
>> >> >
>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>> te:
>> >> >> Hi, everyone.
>> >> >>
>> >> >>
>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>> perations from the journal on the primary cluster, and replay them on the sl
>> ave cluster without checking whether there's already any I/O operations targ
>> eting the same object that has been issued to the slave cluster and not yet 
>> finished. Since concurrent operations may finish in a different order than t
>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>> nish on the slave cluster may be different than that on the primay cluster. 
>> For example: on the primary cluster, there are two WRITE operation targeting
>>  the same object A which are, in the order they finish on the primary cluste
>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>  data1", wh
>> > ich means that the result of the two operations on the primary cluster is 
>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>> >> >>
>> >> >>
>> >> >> Is this possible?
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> >-- 
>> >> >Jason
>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>> 
>> 
>>  
>> 
>> 
>> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-05 22:50           ` xxhdx1985126
@ 2017-06-06  3:04             ` xxhdx1985126
  2017-06-06  3:13               ` Haomai Wang
       [not found]             ` <47d3f4bc.3a4b.15c7b3fceb0.Coremail.xxhdx1985126@163.com>
  1 sibling, 1 reply; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-06  3:04 UTC (permalink / raw)
  To: Sage Weil; +Cc: jdillama, ceph-devel


I submitted an issue about three months ago: http://tracker.ceph.com/issues/19252, in which I submitted part of my log.


At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>
>Thanks for your reply:-)
>
>The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.
>
>However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.
>
>At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>>> Thanks for your reply:-)
>>> 
>>> The requeueing is protected by PG::lock, however, when the write request is
>>> add to the transaction queue, it's left for the journaling thread and
>>> filestore thread, to do the actual write, the OSD's worker thread just
>>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>>> which gives the opportunity for later reqs to go before previous reqs. This
>>> did happened in our experiment.
>>
>>FileStore should also strictly order the requests via the OpSequencer.
>>
>>> However, since this experiment was done serveral month ago, I'll upload the
>>> log if I can find it, or I'll try to reproduce it.
>>
>>Okay, thanks!
>>
>>sage
>>
>>
>>> 
>>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>>> >> 
>>> >> Uh, sorry, I don't quite follow you. According to my understanding of 
>>> >> the source code of OSD and our experiment previously mentioned in 
>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html", 
>>> >> there exists the following scenario where the actual finishing order of 
>>> >> the WRITEs that targets the same object is not the same as the order 
>>> >> they arrived at OSD, which, I think, could be a hint that the order of 
>>> >> writes from a single client connection to a single OSD is not guranteed:
>>> >
>>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on 
>>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>>> >
>>> >>       Say three writes that targeting the same object A arrived at an 
>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The 
>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object 
>>> >> A, and is put into a transaction queue to go through the "journaling + 
>>> >> file system write" procedure. Before it's finished, a thread of 
>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it 
>>> >> during which it finds that the objectcontext lock of A is held by a 
>>> >> previous WRITE and put the second write into A's rwstate::waiters queue. 
>>> >> It's only when the first write is finished on all replica OSDs that the 
>>> >> second write is put back into OSD::shardedop_wq to be processed again in 
>>> >> the future. If, after the second write is put into rwstate::waiters 
>>> >> queue and the first write is finished on all replica OSDs, in which case 
>>> >> the first write release the A's objectcontext lock, but before the 
>>> >> second write is put back into OSD::shardedop_wq, the third write is 
>>> >> retrieved by OSD's worker thread, it would get processed as no previous 
>>> >> operation is holding A's objectcontext lock, in which case, the actual 
>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3", 
>>> >> "WRITE A.off 2", which is different from the order they arrived.
>>> >
>>> >This should not happen.  (If it happened in the past, it was a bug, but I 
>>> >would expect it is fixed in the latest hammer point release, and in jewel 
>>> >and master.)  The requeing is done under the PG::lock so that requeueing 
>>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into 
>>> >ensuring that this is true.  If you've seen this recently, then a 
>>> >reproducer or log (and tracker ticket) would be welcome!  When we see any 
>>> >ordering errors in QA we take them very seriously and fix them quickly.
>>> >
>>> >You might be interested in the osd_debug_op_order config option, which we 
>>> >enable in qa, which asserts if it sees ops from a client arrive out of 
>>> >order.  The ceph_test_rados workload generate that we use for much of the 
>>> >rados qa suite also fails if it sees out of order operations.
>>> >
>>> >sage
>>> >
>>> >> 
>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html, 
>>> we showed our experiment result which is exactly as the above scenario shows
>>> .
>>> >> 
>>> >> However, the ceph's version on which we did the experiment and the souce 
>>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario 
>>> above may still exists in later versions.
>>> >> 
>>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>>> ly confused right now. Thank you.
>>> >> 
>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>>> >> >The order of writes from a single client connection to a single OSD is
>>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>>> >> >a time and does not start processing the next event until the IO has
>>> >> >been started in-flight with librados. Therefore, even though the
>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>>> >> >are actually well-ordered in terms of updates to a single object.
>>> >> >
>>> >> >Of course, such an IO request from the client application / VM would
>>> >> >be incorrect behavior if they didn't wait for the completion callback
>>> >> >before issuing the second update.
>>> >> >
>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>>> te:
>>> >> >> Hi, everyone.
>>> >> >>
>>> >> >>
>>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>>> perations from the journal on the primary cluster, and replay them on the sl
>>> ave cluster without checking whether there's already any I/O operations targ
>>> eting the same object that has been issued to the slave cluster and not yet 
>>> finished. Since concurrent operations may finish in a different order than t
>>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>>> nish on the slave cluster may be different than that on the primay cluster. 
>>> For example: on the primary cluster, there are two WRITE operation targeting
>>>  the same object A which are, in the order they finish on the primary cluste
>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>>  data1", wh
>>> > ich means that the result of the two operations on the primary cluster is 
>>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>> >> >>
>>> >> >>
>>> >> >> Is this possible?
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >
>>> >> >
>>> >> >
>>> >> >-- 
>>> >> >Jason
>>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>>> 
>>> 
>>>  
>>> 
>>> 
>>> 



 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06  3:04             ` xxhdx1985126
@ 2017-06-06  3:13               ` Haomai Wang
  2017-06-06  4:05                 ` xxhdx1985126
  0 siblings, 1 reply; 18+ messages in thread
From: Haomai Wang @ 2017-06-06  3:13 UTC (permalink / raw)
  To: xxhdx1985126; +Cc: Sage Weil, jdillama, ceph-devel

On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>
> I submitted an issue about three months ago: http://tracker.ceph.com/issues/19252, in which I submitted part of my log.
>

what's your use case?
I noticed your first sentence "Recently, in our test, we found a
strange phenomenon: a READ req from client A that arrived later than a
WRITE req from client B is finished ealier than that WRITE req.".

Is it means the two requests from two client? if so, it's expected.

>
> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>
>>Thanks for your reply:-)
>>
>>The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.
>>
>>However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.
>>
>>At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>>>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>>>> Thanks for your reply:-)
>>>>
>>>> The requeueing is protected by PG::lock, however, when the write request is
>>>> add to the transaction queue, it's left for the journaling thread and
>>>> filestore thread, to do the actual write, the OSD's worker thread just
>>>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>>>> which gives the opportunity for later reqs to go before previous reqs. This
>>>> did happened in our experiment.
>>>
>>>FileStore should also strictly order the requests via the OpSequencer.
>>>
>>>> However, since this experiment was done serveral month ago, I'll upload the
>>>> log if I can find it, or I'll try to reproduce it.
>>>
>>>Okay, thanks!
>>>
>>>sage
>>>
>>>
>>>>
>>>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>>>> >>
>>>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>>> >> the source code of OSD and our experiment previously mentioned in
>>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html",
>>>> >> there exists the following scenario where the actual finishing order of
>>>> >> the WRITEs that targets the same object is not the same as the order
>>>> >> they arrived at OSD, which, I think, could be a hint that the order of
>>>> >> writes from a single client connection to a single OSD is not guranteed:
>>>> >
>>>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on
>>>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>>>> >
>>>> >>       Say three writes that targeting the same object A arrived at an
>>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The
>>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object
>>>> >> A, and is put into a transaction queue to go through the "journaling +
>>>> >> file system write" procedure. Before it's finished, a thread of
>>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it
>>>> >> during which it finds that the objectcontext lock of A is held by a
>>>> >> previous WRITE and put the second write into A's rwstate::waiters queue.
>>>> >> It's only when the first write is finished on all replica OSDs that the
>>>> >> second write is put back into OSD::shardedop_wq to be processed again in
>>>> >> the future. If, after the second write is put into rwstate::waiters
>>>> >> queue and the first write is finished on all replica OSDs, in which case
>>>> >> the first write release the A's objectcontext lock, but before the
>>>> >> second write is put back into OSD::shardedop_wq, the third write is
>>>> >> retrieved by OSD's worker thread, it would get processed as no previous
>>>> >> operation is holding A's objectcontext lock, in which case, the actual
>>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3",
>>>> >> "WRITE A.off 2", which is different from the order they arrived.
>>>> >
>>>> >This should not happen.  (If it happened in the past, it was a bug, but I
>>>> >would expect it is fixed in the latest hammer point release, and in jewel
>>>> >and master.)  The requeing is done under the PG::lock so that requeueing
>>>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into
>>>> >ensuring that this is true.  If you've seen this recently, then a
>>>> >reproducer or log (and tracker ticket) would be welcome!  When we see any
>>>> >ordering errors in QA we take them very seriously and fix them quickly.
>>>> >
>>>> >You might be interested in the osd_debug_op_order config option, which we
>>>> >enable in qa, which asserts if it sees ops from a client arrive out of
>>>> >order.  The ceph_test_rados workload generate that we use for much of the
>>>> >rados qa suite also fails if it sees out of order operations.
>>>> >
>>>> >sage
>>>> >
>>>> >>
>>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html,
>>>> we showed our experiment result which is exactly as the above scenario shows
>>>> .
>>>> >>
>>>> >> However, the ceph's version on which we did the experiment and the souce
>>>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario
>>>> above may still exists in later versions.
>>>> >>
>>>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>>>> ly confused right now. Thank you.
>>>> >>
>>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>>>> >> >The order of writes from a single client connection to a single OSD is
>>>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>>>> >> >a time and does not start processing the next event until the IO has
>>>> >> >been started in-flight with librados. Therefore, even though the
>>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>>>> >> >are actually well-ordered in terms of updates to a single object.
>>>> >> >
>>>> >> >Of course, such an IO request from the client application / VM would
>>>> >> >be incorrect behavior if they didn't wait for the completion callback
>>>> >> >before issuing the second update.
>>>> >> >
>>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>>>> te:
>>>> >> >> Hi, everyone.
>>>> >> >>
>>>> >> >>
>>>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>>>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>>>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>>>> perations from the journal on the primary cluster, and replay them on the sl
>>>> ave cluster without checking whether there's already any I/O operations targ
>>>> eting the same object that has been issued to the slave cluster and not yet
>>>> finished. Since concurrent operations may finish in a different order than t
>>>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>>>> nish on the slave cluster may be different than that on the primay cluster.
>>>> For example: on the primary cluster, there are two WRITE operation targeting
>>>>  the same object A which are, in the order they finish on the primary cluste
>>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>>>  data1", wh
>>>> > ich means that the result of the two operations on the primary cluster is
>>>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>>> >> >>
>>>> >> >>
>>>> >> >> Is this possible?
>>>> >> >>
>>>> >> >>
>>>> >> >>
>>>> >> >
>>>> >> >
>>>> >> >
>>>> >> >--
>>>> >> >Jason
>>>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>>>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>
>
>
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06  3:13               ` Haomai Wang
@ 2017-06-06  4:05                 ` xxhdx1985126
  2017-06-06  7:34                   ` xxhdx1985126
  2017-06-06  7:40                   ` xxhdx1985126
  0 siblings, 2 replies; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-06  4:05 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, jdillama, ceph-devel


Yes, it is from two clients, but it looks to me that OSD treats requests from different clients uniformly. So I think, in the scenario of rbd-mirror, requests from the same rbd-mirror instance can also get disordered.


At 2017-06-06 11:13:48, "Haomai Wang" <haomai@xsky.com> wrote:
>On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>>
>> I submitted an issue about three months ago: http://tracker.ceph.com/issues/19252, in which I submitted part of my log.
>>
>
>what's your use case?
>I noticed your first sentence "Recently, in our test, we found a
>strange phenomenon: a READ req from client A that arrived later than a
>WRITE req from client B is finished ealier than that WRITE req.".
>
>Is it means the two requests from two client? if so, it's expected.
>
>>
>> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>>
>>>Thanks for your reply:-)
>>>
>>>The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.
>>>
>>>However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.
>>>
>>>At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>>>>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>>>>> Thanks for your reply:-)
>>>>>
>>>>> The requeueing is protected by PG::lock, however, when the write request is
>>>>> add to the transaction queue, it's left for the journaling thread and
>>>>> filestore thread, to do the actual write, the OSD's worker thread just
>>>>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>>>>> which gives the opportunity for later reqs to go before previous reqs. This
>>>>> did happened in our experiment.
>>>>
>>>>FileStore should also strictly order the requests via the OpSequencer.
>>>>
>>>>> However, since this experiment was done serveral month ago, I'll upload the
>>>>> log if I can find it, or I'll try to reproduce it.
>>>>
>>>>Okay, thanks!
>>>>
>>>>sage
>>>>
>>>>
>>>>>
>>>>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>>>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>>>>> >>
>>>>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>>>> >> the source code of OSD and our experiment previously mentioned in
>>>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html",
>>>>> >> there exists the following scenario where the actual finishing order of
>>>>> >> the WRITEs that targets the same object is not the same as the order
>>>>> >> they arrived at OSD, which, I think, could be a hint that the order of
>>>>> >> writes from a single client connection to a single OSD is not guranteed:
>>>>> >
>>>>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on
>>>>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>>>>> >
>>>>> >>       Say three writes that targeting the same object A arrived at an
>>>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The
>>>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object
>>>>> >> A, and is put into a transaction queue to go through the "journaling +
>>>>> >> file system write" procedure. Before it's finished, a thread of
>>>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it
>>>>> >> during which it finds that the objectcontext lock of A is held by a
>>>>> >> previous WRITE and put the second write into A's rwstate::waiters queue.
>>>>> >> It's only when the first write is finished on all replica OSDs that the
>>>>> >> second write is put back into OSD::shardedop_wq to be processed again in
>>>>> >> the future. If, after the second write is put into rwstate::waiters
>>>>> >> queue and the first write is finished on all replica OSDs, in which case
>>>>> >> the first write release the A's objectcontext lock, but before the
>>>>> >> second write is put back into OSD::shardedop_wq, the third write is
>>>>> >> retrieved by OSD's worker thread, it would get processed as no previous
>>>>> >> operation is holding A's objectcontext lock, in which case, the actual
>>>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3",
>>>>> >> "WRITE A.off 2", which is different from the order they arrived.
>>>>> >
>>>>> >This should not happen.  (If it happened in the past, it was a bug, but I
>>>>> >would expect it is fixed in the latest hammer point release, and in jewel
>>>>> >and master.)  The requeing is done under the PG::lock so that requeueing
>>>>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into
>>>>> >ensuring that this is true.  If you've seen this recently, then a
>>>>> >reproducer or log (and tracker ticket) would be welcome!  When we see any
>>>>> >ordering errors in QA we take them very seriously and fix them quickly.
>>>>> >
>>>>> >You might be interested in the osd_debug_op_order config option, which we
>>>>> >enable in qa, which asserts if it sees ops from a client arrive out of
>>>>> >order.  The ceph_test_rados workload generate that we use for much of the
>>>>> >rados qa suite also fails if it sees out of order operations.
>>>>> >
>>>>> >sage
>>>>> >
>>>>> >>
>>>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html,
>>>>> we showed our experiment result which is exactly as the above scenario shows
>>>>> .
>>>>> >>
>>>>> >> However, the ceph's version on which we did the experiment and the souce
>>>>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario
>>>>> above may still exists in later versions.
>>>>> >>
>>>>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>>>>> ly confused right now. Thank you.
>>>>> >>
>>>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>>>>> >> >The order of writes from a single client connection to a single OSD is
>>>>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>>>>> >> >a time and does not start processing the next event until the IO has
>>>>> >> >been started in-flight with librados. Therefore, even though the
>>>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>>>>> >> >are actually well-ordered in terms of updates to a single object.
>>>>> >> >
>>>>> >> >Of course, such an IO request from the client application / VM would
>>>>> >> >be incorrect behavior if they didn't wait for the completion callback
>>>>> >> >before issuing the second update.
>>>>> >> >
>>>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>>>>> te:
>>>>> >> >> Hi, everyone.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>>>>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>>>>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>>>>> perations from the journal on the primary cluster, and replay them on the sl
>>>>> ave cluster without checking whether there's already any I/O operations targ
>>>>> eting the same object that has been issued to the slave cluster and not yet
>>>>> finished. Since concurrent operations may finish in a different order than t
>>>>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>>>>> nish on the slave cluster may be different than that on the primay cluster.
>>>>> For example: on the primary cluster, there are two WRITE operation targeting
>>>>>  the same object A which are, in the order they finish on the primary cluste
>>>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>>>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>>>>  data1", wh
>>>>> > ich means that the result of the two operations on the primary cluster is
>>>>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> Is this possible?
>>>>> >> >>
>>>>> >> >>
>>>>> >> >>
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> >--
>>>>> >> >Jason
>>>>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>>>>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>
>>
>>
>>



 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06  4:05                 ` xxhdx1985126
@ 2017-06-06  7:34                   ` xxhdx1985126
  2017-06-06  7:40                   ` xxhdx1985126
  1 sibling, 0 replies; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-06  7:34 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, jdillama, ceph-devel


Uh, sorry, my fault. OSD does treat requests from different clients seperately.

At 2017-06-06 12:05:28, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>
>Yes, it is from two clients, but it looks to me that OSD treats requests from different clients uniformly. So I think, in the scenario of rbd-mirror, requests from the same rbd-mirror instance can also get disordered.
>
>
>At 2017-06-06 11:13:48, "Haomai Wang" <haomai@xsky.com> wrote:
>>On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>>>
>>> I submitted an issue about three months ago: http://tracker.ceph.com/issues/19252, in which I submitted part of my log.
>>>
>>
>>what's your use case?
>>I noticed your first sentence "Recently, in our test, we found a
>>strange phenomenon: a READ req from client A that arrived later than a
>>WRITE req from client B is finished ealier than that WRITE req.".
>>
>>Is it means the two requests from two client? if so, it's expected.
>>
>>>
>>> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>>>
>>>>Thanks for your reply:-)
>>>>
>>>>The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.
>>>>
>>>>However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.
>>>>
>>>>At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>>>>>> Thanks for your reply:-)
>>>>>>
>>>>>> The requeueing is protected by PG::lock, however, when the write request is
>>>>>> add to the transaction queue, it's left for the journaling thread and
>>>>>> filestore thread, to do the actual write, the OSD's worker thread just
>>>>>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>>>>>> which gives the opportunity for later reqs to go before previous reqs. This
>>>>>> did happened in our experiment.
>>>>>
>>>>>FileStore should also strictly order the requests via the OpSequencer.
>>>>>
>>>>>> However, since this experiment was done serveral month ago, I'll upload the
>>>>>> log if I can find it, or I'll try to reproduce it.
>>>>>
>>>>>Okay, thanks!
>>>>>
>>>>>sage
>>>>>
>>>>>
>>>>>>
>>>>>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>>>>>> >>
>>>>>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>>>>> >> the source code of OSD and our experiment previously mentioned in
>>>>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html",
>>>>>> >> there exists the following scenario where the actual finishing order of
>>>>>> >> the WRITEs that targets the same object is not the same as the order
>>>>>> >> they arrived at OSD, which, I think, could be a hint that the order of
>>>>>> >> writes from a single client connection to a single OSD is not guranteed:
>>>>>> >
>>>>>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on
>>>>>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>>>>>> >
>>>>>> >>       Say three writes that targeting the same object A arrived at an
>>>>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The
>>>>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object
>>>>>> >> A, and is put into a transaction queue to go through the "journaling +
>>>>>> >> file system write" procedure. Before it's finished, a thread of
>>>>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it
>>>>>> >> during which it finds that the objectcontext lock of A is held by a
>>>>>> >> previous WRITE and put the second write into A's rwstate::waiters queue.
>>>>>> >> It's only when the first write is finished on all replica OSDs that the
>>>>>> >> second write is put back into OSD::shardedop_wq to be processed again in
>>>>>> >> the future. If, after the second write is put into rwstate::waiters
>>>>>> >> queue and the first write is finished on all replica OSDs, in which case
>>>>>> >> the first write release the A's objectcontext lock, but before the
>>>>>> >> second write is put back into OSD::shardedop_wq, the third write is
>>>>>> >> retrieved by OSD's worker thread, it would get processed as no previous
>>>>>> >> operation is holding A's objectcontext lock, in which case, the actual
>>>>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3",
>>>>>> >> "WRITE A.off 2", which is different from the order they arrived.
>>>>>> >
>>>>>> >This should not happen.  (If it happened in the past, it was a bug, but I
>>>>>> >would expect it is fixed in the latest hammer point release, and in jewel
>>>>>> >and master.)  The requeing is done under the PG::lock so that requeueing
>>>>>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into
>>>>>> >ensuring that this is true.  If you've seen this recently, then a
>>>>>> >reproducer or log (and tracker ticket) would be welcome!  When we see any
>>>>>> >ordering errors in QA we take them very seriously and fix them quickly.
>>>>>> >
>>>>>> >You might be interested in the osd_debug_op_order config option, which we
>>>>>> >enable in qa, which asserts if it sees ops from a client arrive out of
>>>>>> >order.  The ceph_test_rados workload generate that we use for much of the
>>>>>> >rados qa suite also fails if it sees out of order operations.
>>>>>> >
>>>>>> >sage
>>>>>> >
>>>>>> >>
>>>>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html,
>>>>>> we showed our experiment result which is exactly as the above scenario shows
>>>>>> .
>>>>>> >>
>>>>>> >> However, the ceph's version on which we did the experiment and the souce
>>>>>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario
>>>>>> above may still exists in later versions.
>>>>>> >>
>>>>>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>>>>>> ly confused right now. Thank you.
>>>>>> >>
>>>>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>>>>>> >> >The order of writes from a single client connection to a single OSD is
>>>>>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>>>>>> >> >a time and does not start processing the next event until the IO has
>>>>>> >> >been started in-flight with librados. Therefore, even though the
>>>>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>>>>>> >> >are actually well-ordered in terms of updates to a single object.
>>>>>> >> >
>>>>>> >> >Of course, such an IO request from the client application / VM would
>>>>>> >> >be incorrect behavior if they didn't wait for the completion callback
>>>>>> >> >before issuing the second update.
>>>>>> >> >
>>>>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>>>>>> te:
>>>>>> >> >> Hi, everyone.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>>>>>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>>>>>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>>>>>> perations from the journal on the primary cluster, and replay them on the sl
>>>>>> ave cluster without checking whether there's already any I/O operations targ
>>>>>> eting the same object that has been issued to the slave cluster and not yet
>>>>>> finished. Since concurrent operations may finish in a different order than t
>>>>>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>>>>>> nish on the slave cluster may be different than that on the primay cluster.
>>>>>> For example: on the primary cluster, there are two WRITE operation targeting
>>>>>>  the same object A which are, in the order they finish on the primary cluste
>>>>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>>>>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>>>>>  data1", wh
>>>>>> > ich means that the result of the two operations on the primary cluster is
>>>>>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Is this possible?
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >--
>>>>>> >> >Jason
>>>>>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>>>>>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>>
>>>
>
>
>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06  4:05                 ` xxhdx1985126
  2017-06-06  7:34                   ` xxhdx1985126
@ 2017-06-06  7:40                   ` xxhdx1985126
  2017-06-06  7:49                     ` Haomai Wang
  1 sibling, 1 reply; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-06  7:40 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, jdillama, ceph-devel


Uh, sorry, my fault. OSD does treat requests from different clients seperately.

By the way, I've a further question. Say, there is a write operation: WRITE(objA.off=3, objB.off=4), if, due to some reason(primary cluster's OSD failure, for example), the "objB" part of the operation can not be replicated to slave cluster, can the "objA" part and non-objB part of subsequent writes that involve objB still be replicated?

At 2017-06-06 12:05:28, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>
>Yes, it is from two clients, but it looks to me that OSD treats requests from different clients uniformly. So I think, in the scenario of rbd-mirror, requests from the same rbd-mirror instance can also get disordered.
>
>
>At 2017-06-06 11:13:48, "Haomai Wang" <haomai@xsky.com> wrote:
>>On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>>>
>>> I submitted an issue about three months ago: http://tracker.ceph.com/issues/19252, in which I submitted part of my log.
>>>
>>
>>what's your use case?
>>I noticed your first sentence "Recently, in our test, we found a
>>strange phenomenon: a READ req from client A that arrived later than a
>>WRITE req from client B is finished ealier than that WRITE req.".
>>
>>Is it means the two requests from two client? if so, it's expected.
>>
>>>
>>> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>>>
>>>>Thanks for your reply:-)
>>>>
>>>>The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.
>>>>
>>>>However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.
>>>>
>>>>At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>>>>>> Thanks for your reply:-)
>>>>>>
>>>>>> The requeueing is protected by PG::lock, however, when the write request is
>>>>>> add to the transaction queue, it's left for the journaling thread and
>>>>>> filestore thread, to do the actual write, the OSD's worker thread just
>>>>>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>>>>>> which gives the opportunity for later reqs to go before previous reqs. This
>>>>>> did happened in our experiment.
>>>>>
>>>>>FileStore should also strictly order the requests via the OpSequencer.
>>>>>
>>>>>> However, since this experiment was done serveral month ago, I'll upload the
>>>>>> log if I can find it, or I'll try to reproduce it.
>>>>>
>>>>>Okay, thanks!
>>>>>
>>>>>sage
>>>>>
>>>>>
>>>>>>
>>>>>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>>>>>> >>
>>>>>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>>>>> >> the source code of OSD and our experiment previously mentioned in
>>>>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html",
>>>>>> >> there exists the following scenario where the actual finishing order of
>>>>>> >> the WRITEs that targets the same object is not the same as the order
>>>>>> >> they arrived at OSD, which, I think, could be a hint that the order of
>>>>>> >> writes from a single client connection to a single OSD is not guranteed:
>>>>>> >
>>>>>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on
>>>>>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>>>>>> >
>>>>>> >>       Say three writes that targeting the same object A arrived at an
>>>>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The
>>>>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object
>>>>>> >> A, and is put into a transaction queue to go through the "journaling +
>>>>>> >> file system write" procedure. Before it's finished, a thread of
>>>>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it
>>>>>> >> during which it finds that the objectcontext lock of A is held by a
>>>>>> >> previous WRITE and put the second write into A's rwstate::waiters queue.
>>>>>> >> It's only when the first write is finished on all replica OSDs that the
>>>>>> >> second write is put back into OSD::shardedop_wq to be processed again in
>>>>>> >> the future. If, after the second write is put into rwstate::waiters
>>>>>> >> queue and the first write is finished on all replica OSDs, in which case
>>>>>> >> the first write release the A's objectcontext lock, but before the
>>>>>> >> second write is put back into OSD::shardedop_wq, the third write is
>>>>>> >> retrieved by OSD's worker thread, it would get processed as no previous
>>>>>> >> operation is holding A's objectcontext lock, in which case, the actual
>>>>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3",
>>>>>> >> "WRITE A.off 2", which is different from the order they arrived.
>>>>>> >
>>>>>> >This should not happen.  (If it happened in the past, it was a bug, but I
>>>>>> >would expect it is fixed in the latest hammer point release, and in jewel
>>>>>> >and master.)  The requeing is done under the PG::lock so that requeueing
>>>>>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into
>>>>>> >ensuring that this is true.  If you've seen this recently, then a
>>>>>> >reproducer or log (and tracker ticket) would be welcome!  When we see any
>>>>>> >ordering errors in QA we take them very seriously and fix them quickly.
>>>>>> >
>>>>>> >You might be interested in the osd_debug_op_order config option, which we
>>>>>> >enable in qa, which asserts if it sees ops from a client arrive out of
>>>>>> >order.  The ceph_test_rados workload generate that we use for much of the
>>>>>> >rados qa suite also fails if it sees out of order operations.
>>>>>> >
>>>>>> >sage
>>>>>> >
>>>>>> >>
>>>>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html,
>>>>>> we showed our experiment result which is exactly as the above scenario shows
>>>>>> .
>>>>>> >>
>>>>>> >> However, the ceph's version on which we did the experiment and the souce
>>>>>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario
>>>>>> above may still exists in later versions.
>>>>>> >>
>>>>>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>>>>>> ly confused right now. Thank you.
>>>>>> >>
>>>>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>>>>>> >> >The order of writes from a single client connection to a single OSD is
>>>>>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>>>>>> >> >a time and does not start processing the next event until the IO has
>>>>>> >> >been started in-flight with librados. Therefore, even though the
>>>>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>>>>>> >> >are actually well-ordered in terms of updates to a single object.
>>>>>> >> >
>>>>>> >> >Of course, such an IO request from the client application / VM would
>>>>>> >> >be incorrect behavior if they didn't wait for the completion callback
>>>>>> >> >before issuing the second update.
>>>>>> >> >
>>>>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>>>>>> te:
>>>>>> >> >> Hi, everyone.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>>>>>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>>>>>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>>>>>> perations from the journal on the primary cluster, and replay them on the sl
>>>>>> ave cluster without checking whether there's already any I/O operations targ
>>>>>> eting the same object that has been issued to the slave cluster and not yet
>>>>>> finished. Since concurrent operations may finish in a different order than t
>>>>>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>>>>>> nish on the slave cluster may be different than that on the primay cluster.
>>>>>> For example: on the primary cluster, there are two WRITE operation targeting
>>>>>>  the same object A which are, in the order they finish on the primary cluste
>>>>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>>>>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>>>>>  data1", wh
>>>>>> > ich means that the result of the two operations on the primary cluster is
>>>>>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> Is this possible?
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >--
>>>>>> >> >Jason
>>>>>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>>>>>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>>
>>>
>
>
>
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06  7:40                   ` xxhdx1985126
@ 2017-06-06  7:49                     ` Haomai Wang
  2017-06-06  8:07                       ` xxhdx1985126
  0 siblings, 1 reply; 18+ messages in thread
From: Haomai Wang @ 2017-06-06  7:49 UTC (permalink / raw)
  To: xxhdx1985126; +Cc: Sage Weil, jdillama, ceph-devel

On Tue, Jun 6, 2017 at 3:40 PM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>
> Uh, sorry, my fault. OSD does treat requests from different clients seperately.
>
> By the way, I've a further question. Say, there is a write operation: WRITE(objA.off=3, objB.off=4), if, due to some reason(primary cluster's OSD failure, for example), the "objB" part of the operation can not be replicated to slave cluster, can the "objA" part and non-objB part of subsequent writes that involve objB still be replicated?

librados only support object-level transaction

>
> At 2017-06-06 12:05:28, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>
>>Yes, it is from two clients, but it looks to me that OSD treats requests from different clients uniformly. So I think, in the scenario of rbd-mirror, requests from the same rbd-mirror instance can also get disordered.
>>
>>
>>At 2017-06-06 11:13:48, "Haomai Wang" <haomai@xsky.com> wrote:
>>>On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>>>>
>>>> I submitted an issue about three months ago: http://tracker.ceph.com/issues/19252, in which I submitted part of my log.
>>>>
>>>
>>>what's your use case?
>>>I noticed your first sentence "Recently, in our test, we found a
>>>strange phenomenon: a READ req from client A that arrived later than a
>>>WRITE req from client B is finished ealier than that WRITE req.".
>>>
>>>Is it means the two requests from two client? if so, it's expected.
>>>
>>>>
>>>> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>>>>
>>>>>Thanks for your reply:-)
>>>>>
>>>>>The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.
>>>>>
>>>>>However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.
>>>>>
>>>>>At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>>>>>>> Thanks for your reply:-)
>>>>>>>
>>>>>>> The requeueing is protected by PG::lock, however, when the write request is
>>>>>>> add to the transaction queue, it's left for the journaling thread and
>>>>>>> filestore thread, to do the actual write, the OSD's worker thread just
>>>>>>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>>>>>>> which gives the opportunity for later reqs to go before previous reqs. This
>>>>>>> did happened in our experiment.
>>>>>>
>>>>>>FileStore should also strictly order the requests via the OpSequencer.
>>>>>>
>>>>>>> However, since this experiment was done serveral month ago, I'll upload the
>>>>>>> log if I can find it, or I'll try to reproduce it.
>>>>>>
>>>>>>Okay, thanks!
>>>>>>
>>>>>>sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>>>>>>> >>
>>>>>>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>>>>>> >> the source code of OSD and our experiment previously mentioned in
>>>>>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html",
>>>>>>> >> there exists the following scenario where the actual finishing order of
>>>>>>> >> the WRITEs that targets the same object is not the same as the order
>>>>>>> >> they arrived at OSD, which, I think, could be a hint that the order of
>>>>>>> >> writes from a single client connection to a single OSD is not guranteed:
>>>>>>> >
>>>>>>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on
>>>>>>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>>>>>>> >
>>>>>>> >>       Say three writes that targeting the same object A arrived at an
>>>>>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The
>>>>>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object
>>>>>>> >> A, and is put into a transaction queue to go through the "journaling +
>>>>>>> >> file system write" procedure. Before it's finished, a thread of
>>>>>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it
>>>>>>> >> during which it finds that the objectcontext lock of A is held by a
>>>>>>> >> previous WRITE and put the second write into A's rwstate::waiters queue.
>>>>>>> >> It's only when the first write is finished on all replica OSDs that the
>>>>>>> >> second write is put back into OSD::shardedop_wq to be processed again in
>>>>>>> >> the future. If, after the second write is put into rwstate::waiters
>>>>>>> >> queue and the first write is finished on all replica OSDs, in which case
>>>>>>> >> the first write release the A's objectcontext lock, but before the
>>>>>>> >> second write is put back into OSD::shardedop_wq, the third write is
>>>>>>> >> retrieved by OSD's worker thread, it would get processed as no previous
>>>>>>> >> operation is holding A's objectcontext lock, in which case, the actual
>>>>>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3",
>>>>>>> >> "WRITE A.off 2", which is different from the order they arrived.
>>>>>>> >
>>>>>>> >This should not happen.  (If it happened in the past, it was a bug, but I
>>>>>>> >would expect it is fixed in the latest hammer point release, and in jewel
>>>>>>> >and master.)  The requeing is done under the PG::lock so that requeueing
>>>>>>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into
>>>>>>> >ensuring that this is true.  If you've seen this recently, then a
>>>>>>> >reproducer or log (and tracker ticket) would be welcome!  When we see any
>>>>>>> >ordering errors in QA we take them very seriously and fix them quickly.
>>>>>>> >
>>>>>>> >You might be interested in the osd_debug_op_order config option, which we
>>>>>>> >enable in qa, which asserts if it sees ops from a client arrive out of
>>>>>>> >order.  The ceph_test_rados workload generate that we use for much of the
>>>>>>> >rados qa suite also fails if it sees out of order operations.
>>>>>>> >
>>>>>>> >sage
>>>>>>> >
>>>>>>> >>
>>>>>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html,
>>>>>>> we showed our experiment result which is exactly as the above scenario shows
>>>>>>> .
>>>>>>> >>
>>>>>>> >> However, the ceph's version on which we did the experiment and the souce
>>>>>>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario
>>>>>>> above may still exists in later versions.
>>>>>>> >>
>>>>>>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>>>>>>> ly confused right now. Thank you.
>>>>>>> >>
>>>>>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>>>>>>> >> >The order of writes from a single client connection to a single OSD is
>>>>>>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>>>>>>> >> >a time and does not start processing the next event until the IO has
>>>>>>> >> >been started in-flight with librados. Therefore, even though the
>>>>>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>>>>>>> >> >are actually well-ordered in terms of updates to a single object.
>>>>>>> >> >
>>>>>>> >> >Of course, such an IO request from the client application / VM would
>>>>>>> >> >be incorrect behavior if they didn't wait for the completion callback
>>>>>>> >> >before issuing the second update.
>>>>>>> >> >
>>>>>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>>>>>>> te:
>>>>>>> >> >> Hi, everyone.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>>>>>>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>>>>>>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>>>>>>> perations from the journal on the primary cluster, and replay them on the sl
>>>>>>> ave cluster without checking whether there's already any I/O operations targ
>>>>>>> eting the same object that has been issued to the slave cluster and not yet
>>>>>>> finished. Since concurrent operations may finish in a different order than t
>>>>>>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>>>>>>> nish on the slave cluster may be different than that on the primay cluster.
>>>>>>> For example: on the primary cluster, there are two WRITE operation targeting
>>>>>>>  the same object A which are, in the order they finish on the primary cluste
>>>>>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>>>>>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>>>>>>  data1", wh
>>>>>>> > ich means that the result of the two operations on the primary cluster is
>>>>>>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Is this possible?
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >--
>>>>>>> >> >Jason
>>>>>>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>>>>>>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re: Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06  7:49                     ` Haomai Wang
@ 2017-06-06  8:07                       ` xxhdx1985126
  2017-06-06 11:20                         ` Jason Dillaman
  0 siblings, 1 reply; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-06  8:07 UTC (permalink / raw)
  To: Haomai Wang; +Cc: Sage Weil, jdillama, ceph-devel


So, if that case happens, rbd-mirror will go on replicate subsequent writes, is this right?

At 2017-06-06 15:49:07, "Haomai Wang" <haomai@xsky.com> wrote:
>On Tue, Jun 6, 2017 at 3:40 PM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>>
>> Uh, sorry, my fault. OSD does treat requests from different clients seperately.
>>
>> By the way, I've a further question. Say, there is a write operation: WRITE(objA.off=3, objB.off=4), if, due to some reason(primary cluster's OSD failure, for example), the "objB" part of the operation can not be replicated to slave cluster, can the "objA" part and non-objB part of subsequent writes that involve objB still be replicated?
>
>librados only support object-level transaction
>
>>
>> At 2017-06-06 12:05:28, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>>
>>>Yes, it is from two clients, but it looks to me that OSD treats requests from different clients uniformly. So I think, in the scenario of rbd-mirror, requests from the same rbd-mirror instance can also get disordered.
>>>
>>>
>>>At 2017-06-06 11:13:48, "Haomai Wang" <haomai@xsky.com> wrote:
>>>>On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>>>>>
>>>>> I submitted an issue about three months ago: http://tracker.ceph.com/issues/19252, in which I submitted part of my log.
>>>>>
>>>>
>>>>what's your use case?
>>>>I noticed your first sentence "Recently, in our test, we found a
>>>>strange phenomenon: a READ req from client A that arrived later than a
>>>>WRITE req from client B is finished ealier than that WRITE req.".
>>>>
>>>>Is it means the two requests from two client? if so, it's expected.
>>>>
>>>>>
>>>>> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>>>>>
>>>>>>Thanks for your reply:-)
>>>>>>
>>>>>>The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.
>>>>>>
>>>>>>However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.
>>>>>>
>>>>>>At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>>>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>>>>>>>> Thanks for your reply:-)
>>>>>>>>
>>>>>>>> The requeueing is protected by PG::lock, however, when the write request is
>>>>>>>> add to the transaction queue, it's left for the journaling thread and
>>>>>>>> filestore thread, to do the actual write, the OSD's worker thread just
>>>>>>>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>>>>>>>> which gives the opportunity for later reqs to go before previous reqs. This
>>>>>>>> did happened in our experiment.
>>>>>>>
>>>>>>>FileStore should also strictly order the requests via the OpSequencer.
>>>>>>>
>>>>>>>> However, since this experiment was done serveral month ago, I'll upload the
>>>>>>>> log if I can find it, or I'll try to reproduce it.
>>>>>>>
>>>>>>>Okay, thanks!
>>>>>>>
>>>>>>>sage
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>>>>>>>> >>
>>>>>>>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>>>>>>> >> the source code of OSD and our experiment previously mentioned in
>>>>>>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html",
>>>>>>>> >> there exists the following scenario where the actual finishing order of
>>>>>>>> >> the WRITEs that targets the same object is not the same as the order
>>>>>>>> >> they arrived at OSD, which, I think, could be a hint that the order of
>>>>>>>> >> writes from a single client connection to a single OSD is not guranteed:
>>>>>>>> >
>>>>>>>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on
>>>>>>>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>>>>>>>> >
>>>>>>>> >>       Say three writes that targeting the same object A arrived at an
>>>>>>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The
>>>>>>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object
>>>>>>>> >> A, and is put into a transaction queue to go through the "journaling +
>>>>>>>> >> file system write" procedure. Before it's finished, a thread of
>>>>>>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it
>>>>>>>> >> during which it finds that the objectcontext lock of A is held by a
>>>>>>>> >> previous WRITE and put the second write into A's rwstate::waiters queue.
>>>>>>>> >> It's only when the first write is finished on all replica OSDs that the
>>>>>>>> >> second write is put back into OSD::shardedop_wq to be processed again in
>>>>>>>> >> the future. If, after the second write is put into rwstate::waiters
>>>>>>>> >> queue and the first write is finished on all replica OSDs, in which case
>>>>>>>> >> the first write release the A's objectcontext lock, but before the
>>>>>>>> >> second write is put back into OSD::shardedop_wq, the third write is
>>>>>>>> >> retrieved by OSD's worker thread, it would get processed as no previous
>>>>>>>> >> operation is holding A's objectcontext lock, in which case, the actual
>>>>>>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3",
>>>>>>>> >> "WRITE A.off 2", which is different from the order they arrived.
>>>>>>>> >
>>>>>>>> >This should not happen.  (If it happened in the past, it was a bug, but I
>>>>>>>> >would expect it is fixed in the latest hammer point release, and in jewel
>>>>>>>> >and master.)  The requeing is done under the PG::lock so that requeueing
>>>>>>>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into
>>>>>>>> >ensuring that this is true.  If you've seen this recently, then a
>>>>>>>> >reproducer or log (and tracker ticket) would be welcome!  When we see any
>>>>>>>> >ordering errors in QA we take them very seriously and fix them quickly.
>>>>>>>> >
>>>>>>>> >You might be interested in the osd_debug_op_order config option, which we
>>>>>>>> >enable in qa, which asserts if it sees ops from a client arrive out of
>>>>>>>> >order.  The ceph_test_rados workload generate that we use for much of the
>>>>>>>> >rados qa suite also fails if it sees out of order operations.
>>>>>>>> >
>>>>>>>> >sage
>>>>>>>> >
>>>>>>>> >>
>>>>>>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html,
>>>>>>>> we showed our experiment result which is exactly as the above scenario shows
>>>>>>>> .
>>>>>>>> >>
>>>>>>>> >> However, the ceph's version on which we did the experiment and the souce
>>>>>>>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario
>>>>>>>> above may still exists in later versions.
>>>>>>>> >>
>>>>>>>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>>>>>>>> ly confused right now. Thank you.
>>>>>>>> >>
>>>>>>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>>>>>>>> >> >The order of writes from a single client connection to a single OSD is
>>>>>>>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>>>>>>>> >> >a time and does not start processing the next event until the IO has
>>>>>>>> >> >been started in-flight with librados. Therefore, even though the
>>>>>>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>>>>>>>> >> >are actually well-ordered in terms of updates to a single object.
>>>>>>>> >> >
>>>>>>>> >> >Of course, such an IO request from the client application / VM would
>>>>>>>> >> >be incorrect behavior if they didn't wait for the completion callback
>>>>>>>> >> >before issuing the second update.
>>>>>>>> >> >
>>>>>>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>>>>>>>> te:
>>>>>>>> >> >> Hi, everyone.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>>>>>>>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>>>>>>>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>>>>>>>> perations from the journal on the primary cluster, and replay them on the sl
>>>>>>>> ave cluster without checking whether there's already any I/O operations targ
>>>>>>>> eting the same object that has been issued to the slave cluster and not yet
>>>>>>>> finished. Since concurrent operations may finish in a different order than t
>>>>>>>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>>>>>>>> nish on the slave cluster may be different than that on the primay cluster.
>>>>>>>> For example: on the primary cluster, there are two WRITE operation targeting
>>>>>>>>  the same object A which are, in the order they finish on the primary cluste
>>>>>>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>>>>>>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>>>>>>>  data1", wh
>>>>>>>> > ich means that the result of the two operations on the primary cluster is
>>>>>>>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >> Is this possible?
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >>
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> >
>>>>>>>> >> >--
>>>>>>>> >> >Jason
>>>>>>>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>>>>>>>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Re: Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06  8:07                       ` xxhdx1985126
@ 2017-06-06 11:20                         ` Jason Dillaman
  2017-06-14  2:03                           ` xxhdx1985126
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Dillaman @ 2017-06-06 11:20 UTC (permalink / raw)
  To: xxhdx1985126; +Cc: Haomai Wang, Sage Weil, ceph-devel

On Tue, Jun 6, 2017 at 4:07 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
> So, if that case happens, rbd-mirror will go on replicate subsequent writes, is this right?

Yes -- until it reaches a flush event and is stalled on the
uncommitted IO from the down OSD (or until it reaches the hard-stop of
100 in-flight, uncommitted IOs).


-- 
Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
       [not found]             ` <47d3f4bc.3a4b.15c7b3fceb0.Coremail.xxhdx1985126@163.com>
@ 2017-06-06 14:01               ` Sage Weil
  2017-06-07 15:36                 ` xxhdx1985126
  2017-06-07 15:42                 ` xxhdx1985126
  0 siblings, 2 replies; 18+ messages in thread
From: Sage Weil @ 2017-06-06 14:01 UTC (permalink / raw)
  To: xxhdx1985126; +Cc: jdillama, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 8616 bytes --]

On Tue, 6 Jun 2017, xxhdx1985126 wrote:
> I submitted an issue about three months
> ago: http://tracker.ceph.com/issues/19252

Ah, right.  Reads and writes may reorder by default.  You can ensure a 
read is ordered as a write by adding the RWORDERED flag to the op.  The 
OSD will then order it as a write and you'll get the behavior it sounds 
like you're after.

I don't think this has any implications for rbd-mirror because writes are 
still strictly ordered, and that is what is mirrored.  I haven't thought 
about it too deeply though so maybe I'm missing something?

sage


> 
> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
> >
> >Thanks for your reply:-)
> >
> >The requeueing is protected by PG::lock, however, when the write request is
>  add to the transaction queue, it's left for the journaling thread and files
> tore thread, to do the actual write, the OSD's worker thread just release th
> e PG::lock and try to retrieve the next req in OSD's work queue, which gives
>  the opportunity for later reqs to go before previous reqs. This did happene
> d in our experiment.
> >
> >However, since this experiment was done serveral month ago, I'll upload the
>  log if I can find it, or I'll try to reproduce it.
> >
> >At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
> >>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
> >>> Thanks for your reply:-)
> >>> 
> >>> The requeueing is protected by PG::lock, however, when the write request
>  is
> >>> add to the transaction queue, it's left for the journaling thread and
> >>> filestore thread, to do the actual write, the OSD's worker thread just
> >>> release the PG::lock and try to retrieve the next req in OSD's work queu
> e,
> >>> which gives the opportunity for later reqs to go before previous reqs. T
> his
> >>> did happened in our experiment.
> >>
> >>FileStore should also strictly order the requests via the OpSequencer.
> >>
> >>> However, since this experiment was done serveral month ago, I'll upload 
> the
> >>> log if I can find it, or I'll try to reproduce it.
> >>
> >>Okay, thanks!
> >>
> >>sage
> >>
> >>
> >>> 
> >>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
> >>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
> >>> >> 
> >>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>  
> >>> >> the source code of OSD and our experiment previously mentioned in 
> >>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html
> ", 
> >>> >> there exists the following scenario where the actual finishing order 
> of 
> >>> >> the WRITEs that targets the same object is not the same as the order 
> >>> >> they arrived at OSD, which, I think, could be a hint that the order o
> f 
> >>> >> writes from a single client connection to a single OSD is not gurante
> ed:
> >>> >
> >>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying
>  on 
> >>> >OSD ordering to be correct is totally fine--lots of other stuff does to
> o.
> >>> >
> >>> >>       Say three writes that targeting the same object A arrived at an
>  
> >>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". 
> The 
> >>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of obje
> ct 
> >>> >> A, and is put into a transaction queue to go through the "journaling 
> + 
> >>> >> file system write" procedure. Before it's finished, a thread of 
> >>> >> OSD::osd_op_tp retrieved the second write and attempt to process it 
> >>> >> during which it finds that the objectcontext lock of A is held by a 
> >>> >> previous WRITE and put the second write into A's rwstate::waiters que
> ue. 
> >>> >> It's only when the first write is finished on all replica OSDs that t
> he 
> >>> >> second write is put back into OSD::shardedop_wq to be processed again
>  in 
> >>> >> the future. If, after the second write is put into rwstate::waiters 
> >>> >> queue and the first write is finished on all replica OSDs, in which c
> ase 
> >>> >> the first write release the A's objectcontext lock, but before the 
> >>> >> second write is put back into OSD::shardedop_wq, the third write is 
> >>> >> retrieved by OSD's worker thread, it would get processed as no previo
> us 
> >>> >> operation is holding A's objectcontext lock, in which case, the actua
> l 
> >>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 
> 3", 
> >>> >> "WRITE A.off 2", which is different from the order they arrived.
> >>> >
> >>> >This should not happen.  (If it happened in the past, it was a bug, but
>  I 
> >>> >would expect it is fixed in the latest hammer point release, and in jew
> el 
> >>> >and master.)  The requeing is done under the PG::lock so that requeuein
> g 
> >>> >preserves ordering.  A fair bit of code and a *lot* of testing goes int
> o 
> >>> >ensuring that this is true.  If you've seen this recently, then a 
> >>> >reproducer or log (and tracker ticket) would be welcome!  When we see a
> ny 
> >>> >ordering errors in QA we take them very seriously and fix them quickly.
> >>> >
> >>> >You might be interested in the osd_debug_op_order config option, which 
> we 
> >>> >enable in qa, which asserts if it sees ops from a client arrive out of 
> >>> >order.  The ceph_test_rados workload generate that we use for much of t
> he 
> >>> >rados qa suite also fails if it sees out of order operations.
> >>> >
> >>> >sage
> >>> >
> >>> >> 
> >>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.ht
> ml, 
> >>> we showed our experiment result which is exactly as the above scenario s
> hows
> >>> .
> >>> >> 
> >>> >> However, the ceph's version on which we did the experiment and the so
> uce 
> >>> code of which we read was Hammer, 0.94.5. I don't know whether the scena
> rio 
> >>> above may still exists in later versions.
> >>> >> 
> >>> >> Am I right about this? Or am I missing anything? Please help me, I'm 
> real
> >>> ly confused right now. Thank you.
> >>> >> 
> >>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
> >>> >> >The order of writes from a single client connection to a single OSD 
> is
> >>> >> >guaranteed. The rbd-mirror journal replay process handles one event 
> at
> >>> >> >a time and does not start processing the next event until the IO has
> >>> >> >been started in-flight with librados. Therefore, even though the
> >>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IO
> s
> >>> >> >are actually well-ordered in terms of updates to a single object.
> >>> >> >
> >>> >> >Of course, such an IO request from the client application / VM would
> >>> >> >be incorrect behavior if they didn't wait for the completion callbac
> k
> >>> >> >before issuing the second update.
> >>> >> >
> >>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com>
>  wro
> >>> te:
> >>> >> >> Hi, everyone.
> >>> >> >>
> >>> >> >>
> >>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonde
> r ho
> >>> w rbd-mirror preserves the order of WRITE operations that finished on th
> e pr
> >>> imary cluster. As far as I can understand the code, rbd-mirror fetches I
> /O o
> >>> perations from the journal on the primary cluster, and replay them on th
> e sl
> >>> ave cluster without checking whether there's already any I/O operations 
> targ
> >>> eting the same object that has been issued to the slave cluster and not 
> yet 
> >>> finished. Since concurrent operations may finish in a different order th
> an t
> >>> hat in which they arrived at the OSD, the order that the WRITE operation
> s fi
> >>> nish on the slave cluster may be different than that on the primay clust
> er. 
> >>> For example: on the primary cluster, there are two WRITE operation targe
> ting
> >>>  the same object A which are, in the order they finish on the primary cl
> uste
> >>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are repl
> ayed
> >>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A
> .off
> >>>  data1", wh
> >>> > ich means that the result of the two operations on the primary cluster
>  is 
> >>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
> >>> >> >>
> >>> >> >>
> >>> >> >> Is this possible?
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> >-- 
> >>> >> >Jason
> >>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐
> 殝鄗
> >>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
> >>> 
> >>> 
> >>>  
> >>> 
> >>> 
> >>> 
> 
> 
>  
> 
> 
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re:Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06 14:01               ` Sage Weil
@ 2017-06-07 15:36                 ` xxhdx1985126
  2017-06-07 15:42                 ` xxhdx1985126
  1 sibling, 0 replies; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-07 15:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: jdillama, ceph-devel

I think it's just like Haomai said, writes from the same clients are ordered strictly, because in OSD::ShardedOp_Wq, requests are stored associated with their sources. While writes from different sources may be out of order, as they are not put into the same queue when they are "requeued".


At 2017-06-06 22:01:56, "Sage Weil" <sweil@redhat.com> wrote:
>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>> I submitted an issue about three months
>> ago: http://tracker.ceph.com/issues/19252
>
>Ah, right.  Reads and writes may reorder by default.  You can ensure a 
>read is ordered as a write by adding the RWORDERED flag to the op.  The 
>OSD will then order it as a write and you'll get the behavior it sounds 
>like you're after.
>
>I don't think this has any implications for rbd-mirror because writes are 
>still strictly ordered, and that is what is mirrored.  I haven't thought 
>about it too deeply though so maybe I'm missing something?
>
>sage
>
>
>> 
>> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>> >
>> >Thanks for your reply:-)
>> >
>> >The requeueing is protected by PG::lock, however, when the write request is
>>  add to the transaction queue, it's left for the journaling thread and files
>> tore thread, to do the actual write, the OSD's worker thread just release th
>> e PG::lock and try to retrieve the next req in OSD's work queue, which gives
>>  the opportunity for later reqs to go before previous reqs. This did happene
>> d in our experiment.
>> >
>> >However, since this experiment was done serveral month ago, I'll upload the
>>  log if I can find it, or I'll try to reproduce it.
>> >
>> >At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>> >>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>> >>> Thanks for your reply:-)
>> >>> 
>> >>> The requeueing is protected by PG::lock, however, when the write request
>>  is
>> >>> add to the transaction queue, it's left for the journaling thread and
>> >>> filestore thread, to do the actual write, the OSD's worker thread just
>> >>> release the PG::lock and try to retrieve the next req in OSD's work queu
>> e,
>> >>> which gives the opportunity for later reqs to go before previous reqs. T
>> his
>> >>> did happened in our experiment.
>> >>
>> >>FileStore should also strictly order the requests via the OpSequencer.
>> >>
>> >>> However, since this experiment was done serveral month ago, I'll upload 
>> the
>> >>> log if I can find it, or I'll try to reproduce it.
>> >>
>> >>Okay, thanks!
>> >>
>> >>sage
>> >>
>> >>
>> >>> 
>> >>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>> >>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>> >>> >> 
>> >>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>  
>> >>> >> the source code of OSD and our experiment previously mentioned in 
>> >>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html
>> ", 
>> >>> >> there exists the following scenario where the actual finishing order 
>> of 
>> >>> >> the WRITEs that targets the same object is not the same as the order 
>> >>> >> they arrived at OSD, which, I think, could be a hint that the order o
>> f 
>> >>> >> writes from a single client connection to a single OSD is not gurante
>> ed:
>> >>> >
>> >>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying
>>  on 
>> >>> >OSD ordering to be correct is totally fine--lots of other stuff does to
>> o.
>> >>> >
>> >>> >>       Say three writes that targeting the same object A arrived at an
>>  
>> >>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". 
>> The 
>> >>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of obje
>> ct 
>> >>> >> A, and is put into a transaction queue to go through the "journaling 
>> + 
>> >>> >> file system write" procedure. Before it's finished, a thread of 
>> >>> >> OSD::osd_op_tp retrieved the second write and attempt to process it 
>> >>> >> during which it finds that the objectcontext lock of A is held by a 
>> >>> >> previous WRITE and put the second write into A's rwstate::waiters que
>> ue. 
>> >>> >> It's only when the first write is finished on all replica OSDs that t
>> he 
>> >>> >> second write is put back into OSD::shardedop_wq to be processed again
>>  in 
>> >>> >> the future. If, after the second write is put into rwstate::waiters 
>> >>> >> queue and the first write is finished on all replica OSDs, in which c
>> ase 
>> >>> >> the first write release the A's objectcontext lock, but before the 
>> >>> >> second write is put back into OSD::shardedop_wq, the third write is 
>> >>> >> retrieved by OSD's worker thread, it would get processed as no previo
>> us 
>> >>> >> operation is holding A's objectcontext lock, in which case, the actua
>> l 
>> >>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 
>> 3", 
>> >>> >> "WRITE A.off 2", which is different from the order they arrived.
>> >>> >
>> >>> >This should not happen.  (If it happened in the past, it was a bug, but
>>  I 
>> >>> >would expect it is fixed in the latest hammer point release, and in jew
>> el 
>> >>> >and master.)  The requeing is done under the PG::lock so that requeuein
>> g 
>> >>> >preserves ordering.  A fair bit of code and a *lot* of testing goes int
>> o 
>> >>> >ensuring that this is true.  If you've seen this recently, then a 
>> >>> >reproducer or log (and tracker ticket) would be welcome!  When we see a
>> ny 
>> >>> >ordering errors in QA we take them very seriously and fix them quickly.
>> >>> >
>> >>> >You might be interested in the osd_debug_op_order config option, which 
>> we 
>> >>> >enable in qa, which asserts if it sees ops from a client arrive out of 
>> >>> >order.  The ceph_test_rados workload generate that we use for much of t
>> he 
>> >>> >rados qa suite also fails if it sees out of order operations.
>> >>> >
>> >>> >sage
>> >>> >
>> >>> >> 
>> >>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.ht
>> ml, 
>> >>> we showed our experiment result which is exactly as the above scenario s
>> hows
>> >>> .
>> >>> >> 
>> >>> >> However, the ceph's version on which we did the experiment and the so
>> uce 
>> >>> code of which we read was Hammer, 0.94.5. I don't know whether the scena
>> rio 
>> >>> above may still exists in later versions.
>> >>> >> 
>> >>> >> Am I right about this? Or am I missing anything? Please help me, I'm 
>> real
>> >>> ly confused right now. Thank you.
>> >>> >> 
>> >>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>> >>> >> >The order of writes from a single client connection to a single OSD 
>> is
>> >>> >> >guaranteed. The rbd-mirror journal replay process handles one event 
>> at
>> >>> >> >a time and does not start processing the next event until the IO has
>> >>> >> >been started in-flight with librados. Therefore, even though the
>> >>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IO
>> s
>> >>> >> >are actually well-ordered in terms of updates to a single object.
>> >>> >> >
>> >>> >> >Of course, such an IO request from the client application / VM would
>> >>> >> >be incorrect behavior if they didn't wait for the completion callbac
>> k
>> >>> >> >before issuing the second update.
>> >>> >> >
>> >>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com>
>>  wro
>> >>> te:
>> >>> >> >> Hi, everyone.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonde
>> r ho
>> >>> w rbd-mirror preserves the order of WRITE operations that finished on th
>> e pr
>> >>> imary cluster. As far as I can understand the code, rbd-mirror fetches I
>> /O o
>> >>> perations from the journal on the primary cluster, and replay them on th
>> e sl
>> >>> ave cluster without checking whether there's already any I/O operations 
>> targ
>> >>> eting the same object that has been issued to the slave cluster and not 
>> yet 
>> >>> finished. Since concurrent operations may finish in a different order th
>> an t
>> >>> hat in which they arrived at the OSD, the order that the WRITE operation
>> s fi
>> >>> nish on the slave cluster may be different than that on the primay clust
>> er. 
>> >>> For example: on the primary cluster, there are two WRITE operation targe
>> ting
>> >>>  the same object A which are, in the order they finish on the primary cl
>> uste
>> >>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are repl
>> ayed
>> >>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A
>> .off
>> >>>  data1", wh
>> >>> > ich means that the result of the two operations on the primary cluster
>>  is 
>> >>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> Is this possible?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >-- 
>> >>> >> >Jason
>> >>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐
>> 殝鄗
>> >>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>> >>> 
>> >>> 
>> >>>  
>> >>> 
>> >>> 
>> >>> 
>> 
>> 
>>  
>> 
>> 
>> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re:Re:Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06 14:01               ` Sage Weil
  2017-06-07 15:36                 ` xxhdx1985126
@ 2017-06-07 15:42                 ` xxhdx1985126
  1 sibling, 0 replies; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-07 15:42 UTC (permalink / raw)
  To: Sage Weil; +Cc: jdillama, ceph-devel

I think it's just like Haomai said, writes from the same clients are ordered strictly, because in OSD::ShardedOp_Wq, requests are stored associated with their sources. While writes from different sources may be out of order, as they are not put into the same queue when they are "requeued".

Thank you all:-)

At 2017-06-06 22:01:56, "Sage Weil" <sweil@redhat.com> wrote:
>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>> I submitted an issue about three months
>> ago: http://tracker.ceph.com/issues/19252
>
>Ah, right.  Reads and writes may reorder by default.  You can ensure a 
>read is ordered as a write by adding the RWORDERED flag to the op.  The 
>OSD will then order it as a write and you'll get the behavior it sounds 
>like you're after.
>
>I don't think this has any implications for rbd-mirror because writes are 
>still strictly ordered, and that is what is mirrored.  I haven't thought 
>about it too deeply though so maybe I'm missing something?
>
>sage
>
>
>> 
>> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>> >
>> >Thanks for your reply:-)
>> >
>> >The requeueing is protected by PG::lock, however, when the write request is
>>  add to the transaction queue, it's left for the journaling thread and files
>> tore thread, to do the actual write, the OSD's worker thread just release th
>> e PG::lock and try to retrieve the next req in OSD's work queue, which gives
>>  the opportunity for later reqs to go before previous reqs. This did happene
>> d in our experiment.
>> >
>> >However, since this experiment was done serveral month ago, I'll upload the
>>  log if I can find it, or I'll try to reproduce it.
>> >
>> >At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>> >>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>> >>> Thanks for your reply:-)
>> >>> 
>> >>> The requeueing is protected by PG::lock, however, when the write request
>>  is
>> >>> add to the transaction queue, it's left for the journaling thread and
>> >>> filestore thread, to do the actual write, the OSD's worker thread just
>> >>> release the PG::lock and try to retrieve the next req in OSD's work queu
>> e,
>> >>> which gives the opportunity for later reqs to go before previous reqs. T
>> his
>> >>> did happened in our experiment.
>> >>
>> >>FileStore should also strictly order the requests via the OpSequencer.
>> >>
>> >>> However, since this experiment was done serveral month ago, I'll upload 
>> the
>> >>> log if I can find it, or I'll try to reproduce it.
>> >>
>> >>Okay, thanks!
>> >>
>> >>sage
>> >>
>> >>
>> >>> 
>> >>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>> >>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>> >>> >> 
>> >>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>  
>> >>> >> the source code of OSD and our experiment previously mentioned in 
>> >>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html
>> ", 
>> >>> >> there exists the following scenario where the actual finishing order 
>> of 
>> >>> >> the WRITEs that targets the same object is not the same as the order 
>> >>> >> they arrived at OSD, which, I think, could be a hint that the order o
>> f 
>> >>> >> writes from a single client connection to a single OSD is not gurante
>> ed:
>> >>> >
>> >>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying
>>  on 
>> >>> >OSD ordering to be correct is totally fine--lots of other stuff does to
>> o.
>> >>> >
>> >>> >>       Say three writes that targeting the same object A arrived at an
>>  
>> >>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". 
>> The 
>> >>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of obje
>> ct 
>> >>> >> A, and is put into a transaction queue to go through the "journaling 
>> + 
>> >>> >> file system write" procedure. Before it's finished, a thread of 
>> >>> >> OSD::osd_op_tp retrieved the second write and attempt to process it 
>> >>> >> during which it finds that the objectcontext lock of A is held by a 
>> >>> >> previous WRITE and put the second write into A's rwstate::waiters que
>> ue. 
>> >>> >> It's only when the first write is finished on all replica OSDs that t
>> he 
>> >>> >> second write is put back into OSD::shardedop_wq to be processed again
>>  in 
>> >>> >> the future. If, after the second write is put into rwstate::waiters 
>> >>> >> queue and the first write is finished on all replica OSDs, in which c
>> ase 
>> >>> >> the first write release the A's objectcontext lock, but before the 
>> >>> >> second write is put back into OSD::shardedop_wq, the third write is 
>> >>> >> retrieved by OSD's worker thread, it would get processed as no previo
>> us 
>> >>> >> operation is holding A's objectcontext lock, in which case, the actua
>> l 
>> >>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 
>> 3", 
>> >>> >> "WRITE A.off 2", which is different from the order they arrived.
>> >>> >
>> >>> >This should not happen.  (If it happened in the past, it was a bug, but
>>  I 
>> >>> >would expect it is fixed in the latest hammer point release, and in jew
>> el 
>> >>> >and master.)  The requeing is done under the PG::lock so that requeuein
>> g 
>> >>> >preserves ordering.  A fair bit of code and a *lot* of testing goes int
>> o 
>> >>> >ensuring that this is true.  If you've seen this recently, then a 
>> >>> >reproducer or log (and tracker ticket) would be welcome!  When we see a
>> ny 
>> >>> >ordering errors in QA we take them very seriously and fix them quickly.
>> >>> >
>> >>> >You might be interested in the osd_debug_op_order config option, which 
>> we 
>> >>> >enable in qa, which asserts if it sees ops from a client arrive out of 
>> >>> >order.  The ceph_test_rados workload generate that we use for much of t
>> he 
>> >>> >rados qa suite also fails if it sees out of order operations.
>> >>> >
>> >>> >sage
>> >>> >
>> >>> >> 
>> >>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.ht
>> ml, 
>> >>> we showed our experiment result which is exactly as the above scenario s
>> hows
>> >>> .
>> >>> >> 
>> >>> >> However, the ceph's version on which we did the experiment and the so
>> uce 
>> >>> code of which we read was Hammer, 0.94.5. I don't know whether the scena
>> rio 
>> >>> above may still exists in later versions.
>> >>> >> 
>> >>> >> Am I right about this? Or am I missing anything? Please help me, I'm 
>> real
>> >>> ly confused right now. Thank you.
>> >>> >> 
>> >>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>> >>> >> >The order of writes from a single client connection to a single OSD 
>> is
>> >>> >> >guaranteed. The rbd-mirror journal replay process handles one event 
>> at
>> >>> >> >a time and does not start processing the next event until the IO has
>> >>> >> >been started in-flight with librados. Therefore, even though the
>> >>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IO
>> s
>> >>> >> >are actually well-ordered in terms of updates to a single object.
>> >>> >> >
>> >>> >> >Of course, such an IO request from the client application / VM would
>> >>> >> >be incorrect behavior if they didn't wait for the completion callbac
>> k
>> >>> >> >before issuing the second update.
>> >>> >> >
>> >>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com>
>>  wro
>> >>> te:
>> >>> >> >> Hi, everyone.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonde
>> r ho
>> >>> w rbd-mirror preserves the order of WRITE operations that finished on th
>> e pr
>> >>> imary cluster. As far as I can understand the code, rbd-mirror fetches I
>> /O o
>> >>> perations from the journal on the primary cluster, and replay them on th
>> e sl
>> >>> ave cluster without checking whether there's already any I/O operations 
>> targ
>> >>> eting the same object that has been issued to the slave cluster and not 
>> yet 
>> >>> finished. Since concurrent operations may finish in a different order th
>> an t
>> >>> hat in which they arrived at the OSD, the order that the WRITE operation
>> s fi
>> >>> nish on the slave cluster may be different than that on the primay clust
>> er. 
>> >>> For example: on the primary cluster, there are two WRITE operation targe
>> ting
>> >>>  the same object A which are, in the order they finish on the primary cl
>> uste
>> >>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are repl
>> ayed
>> >>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A
>> .off
>> >>>  data1", wh
>> >>> > ich means that the result of the two operations on the primary cluster
>>  is 
>> >>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> Is this possible?
>> >>> >> >>
>> >>> >> >>
>> >>> >> >>
>> >>> >> >
>> >>> >> >
>> >>> >> >
>> >>> >> >-- 
>> >>> >> >Jason
>> >>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐
>> 殝鄗
>> >>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>> >>> 
>> >>> 
>> >>>  
>> >>> 
>> >>> 
>> >>> 
>> 
>> 
>>  
>> 
>> 
>> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Re: Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
  2017-06-06 11:20                         ` Jason Dillaman
@ 2017-06-14  2:03                           ` xxhdx1985126
  0 siblings, 0 replies; 18+ messages in thread
From: xxhdx1985126 @ 2017-06-14  2:03 UTC (permalink / raw)
  To: jdillama; +Cc: Haomai Wang, Sage Weil, ceph-devel

Hi,  thanks for your reply.  Now, I have another question. Given all the writes are replicated in order, why is flush event replicated? As far as I know, what a flush does is just making previous writes committed and doesn't do any modification to the data, which makes it seem to me that replicating it or not doesn't make any difference.

Sent from my Mi phone
On Jason Dillaman <jdillama@redhat.com>, Jun 6, 2017 19:20 wrote:

On Tue, Jun 6, 2017 at 4:07 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:

> So, if that case happens, rbd-mirror will go on replicate subsequent writes, is this right?

Yes -- until it reaches a flush event and is stalled on the

uncommitted IO from the down OSD (or until it reaches the hard-stop of

100 in-flight, uncommitted IOs).

-- 

Jason

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2017-06-14  2:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-05  4:05 How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster xxhdx1985126
2017-06-05 12:00 ` Jason Dillaman
2017-06-05 15:49   ` xxhdx1985126
2017-06-05 16:21     ` Sage Weil
     [not found]       ` <428d3e79.3d2.15c7a56a64f.Coremail.xxhdx1985126@163.com>
2017-06-05 22:22         ` Sage Weil
2017-06-05 22:50           ` xxhdx1985126
2017-06-06  3:04             ` xxhdx1985126
2017-06-06  3:13               ` Haomai Wang
2017-06-06  4:05                 ` xxhdx1985126
2017-06-06  7:34                   ` xxhdx1985126
2017-06-06  7:40                   ` xxhdx1985126
2017-06-06  7:49                     ` Haomai Wang
2017-06-06  8:07                       ` xxhdx1985126
2017-06-06 11:20                         ` Jason Dillaman
2017-06-14  2:03                           ` xxhdx1985126
     [not found]             ` <47d3f4bc.3a4b.15c7b3fceb0.Coremail.xxhdx1985126@163.com>
2017-06-06 14:01               ` Sage Weil
2017-06-07 15:36                 ` xxhdx1985126
2017-06-07 15:42                 ` xxhdx1985126

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.