Re: Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster

From: Haomai Wang <haomai@xsky.com>
To: xxhdx1985126 <xxhdx1985126@163.com>
Cc: Sage Weil <sweil@redhat.com>,
	jdillama@redhat.com,
	"ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: Re: Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster
Date: Tue, 6 Jun 2017 15:49:07 +0800	[thread overview]
Message-ID: <CACJqLyaKv3TvkFopog0mkGCyDqOaC-9ZcgPnPgYMvNRiW3DWCg@mail.gmail.com> (raw)
In-Reply-To: <762fe21a.9068.15c7c58f41b.Coremail.xxhdx1985126@163.com>

On Tue, Jun 6, 2017 at 3:40 PM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>
> Uh, sorry, my fault. OSD does treat requests from different clients seperately.
>
> By the way, I've a further question. Say, there is a write operation: WRITE(objA.off=3, objB.off=4), if, due to some reason(primary cluster's OSD failure, for example), the "objB" part of the operation can not be replicated to slave cluster, can the "objA" part and non-objB part of subsequent writes that involve objB still be replicated?

librados only support object-level transaction

>
> At 2017-06-06 12:05:28, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>
>>Yes, it is from two clients, but it looks to me that OSD treats requests from different clients uniformly. So I think, in the scenario of rbd-mirror, requests from the same rbd-mirror instance can also get disordered.
>>
>>
>>At 2017-06-06 11:13:48, "Haomai Wang" <haomai@xsky.com> wrote:
>>>On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 <xxhdx1985126@163.com> wrote:
>>>>
>>>> I submitted an issue about three months ago: http://tracker.ceph.com/issues/19252, in which I submitted part of my log.
>>>>
>>>
>>>what's your use case?
>>>I noticed your first sentence "Recently, in our test, we found a
>>>strange phenomenon: a READ req from client A that arrived later than a
>>>WRITE req from client B is finished ealier than that WRITE req.".
>>>
>>>Is it means the two requests from two client? if so, it's expected.
>>>
>>>>
>>>> At 2017-06-06 06:50:49, "xxhdx1985126" <xxhdx1985126@163.com> wrote:
>>>>>
>>>>>Thanks for your reply:-)
>>>>>
>>>>>The requeueing is protected by PG::lock, however, when the write request is add to the transaction queue, it's left for the journaling thread and filestore thread, to do the actual write, the OSD's worker thread just release the PG::lock and try to retrieve the next req in OSD's work queue, which gives the opportunity for later reqs to go before previous reqs. This did happened in our experiment.
>>>>>
>>>>>However, since this experiment was done serveral month ago, I'll upload the log if I can find it, or I'll try to reproduce it.
>>>>>
>>>>>At 2017-06-06 06:22:36, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>>On Tue, 6 Jun 2017, xxhdx1985126 wrote:
>>>>>>> Thanks for your reply:-)
>>>>>>>
>>>>>>> The requeueing is protected by PG::lock, however, when the write request is
>>>>>>> add to the transaction queue, it's left for the journaling thread and
>>>>>>> filestore thread, to do the actual write, the OSD's worker thread just
>>>>>>> release the PG::lock and try to retrieve the next req in OSD's work queue,
>>>>>>> which gives the opportunity for later reqs to go before previous reqs. This
>>>>>>> did happened in our experiment.
>>>>>>
>>>>>>FileStore should also strictly order the requests via the OpSequencer.
>>>>>>
>>>>>>> However, since this experiment was done serveral month ago, I'll upload the
>>>>>>> log if I can find it, or I'll try to reproduce it.
>>>>>>
>>>>>>Okay, thanks!
>>>>>>
>>>>>>sage
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> At 2017-06-06 00:21:34, "Sage Weil" <sweil@redhat.com> wrote:
>>>>>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote:
>>>>>>> >>
>>>>>>> >> Uh, sorry, I don't quite follow you. According to my understanding of
>>>>>>> >> the source code of OSD and our experiment previously mentioned in
>>>>>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html",
>>>>>>> >> there exists the following scenario where the actual finishing order of
>>>>>>> >> the WRITEs that targets the same object is not the same as the order
>>>>>>> >> they arrived at OSD, which, I think, could be a hint that the order of
>>>>>>> >> writes from a single client connection to a single OSD is not guranteed:
>>>>>>> >
>>>>>>> >If so, it is a bug that should be fixed in the OSD.  rbd-mirror relying on
>>>>>>> >OSD ordering to be correct is totally fine--lots of other stuff does too.
>>>>>>> >
>>>>>>> >>       Say three writes that targeting the same object A arrived at an
>>>>>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3". The
>>>>>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of object
>>>>>>> >> A, and is put into a transaction queue to go through the "journaling +
>>>>>>> >> file system write" procedure. Before it's finished, a thread of
>>>>>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it
>>>>>>> >> during which it finds that the objectcontext lock of A is held by a
>>>>>>> >> previous WRITE and put the second write into A's rwstate::waiters queue.
>>>>>>> >> It's only when the first write is finished on all replica OSDs that the
>>>>>>> >> second write is put back into OSD::shardedop_wq to be processed again in
>>>>>>> >> the future. If, after the second write is put into rwstate::waiters
>>>>>>> >> queue and the first write is finished on all replica OSDs, in which case
>>>>>>> >> the first write release the A's objectcontext lock, but before the
>>>>>>> >> second write is put back into OSD::shardedop_wq, the third write is
>>>>>>> >> retrieved by OSD's worker thread, it would get processed as no previous
>>>>>>> >> operation is holding A's objectcontext lock, in which case, the actual
>>>>>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.off 3",
>>>>>>> >> "WRITE A.off 2", which is different from the order they arrived.
>>>>>>> >
>>>>>>> >This should not happen.  (If it happened in the past, it was a bug, but I
>>>>>>> >would expect it is fixed in the latest hammer point release, and in jewel
>>>>>>> >and master.)  The requeing is done under the PG::lock so that requeueing
>>>>>>> >preserves ordering.  A fair bit of code and a *lot* of testing goes into
>>>>>>> >ensuring that this is true.  If you've seen this recently, then a
>>>>>>> >reproducer or log (and tracker ticket) would be welcome!  When we see any
>>>>>>> >ordering errors in QA we take them very seriously and fix them quickly.
>>>>>>> >
>>>>>>> >You might be interested in the osd_debug_op_order config option, which we
>>>>>>> >enable in qa, which asserts if it sees ops from a client arrive out of
>>>>>>> >order.  The ceph_test_rados workload generate that we use for much of the
>>>>>>> >rados qa suite also fails if it sees out of order operations.
>>>>>>> >
>>>>>>> >sage
>>>>>>> >
>>>>>>> >>
>>>>>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.html,
>>>>>>> we showed our experiment result which is exactly as the above scenario shows
>>>>>>> .
>>>>>>> >>
>>>>>>> >> However, the ceph's version on which we did the experiment and the souce
>>>>>>> code of which we read was Hammer, 0.94.5. I don't know whether the scenario
>>>>>>> above may still exists in later versions.
>>>>>>> >>
>>>>>>> >> Am I right about this? Or am I missing anything? Please help me, I'm real
>>>>>>> ly confused right now. Thank you.
>>>>>>> >>
>>>>>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" <jdillama@redhat.com> wrote:
>>>>>>> >> >The order of writes from a single client connection to a single OSD is
>>>>>>> >> >guaranteed. The rbd-mirror journal replay process handles one event at
>>>>>>> >> >a time and does not start processing the next event until the IO has
>>>>>>> >> >been started in-flight with librados. Therefore, even though the
>>>>>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those IOs
>>>>>>> >> >are actually well-ordered in terms of updates to a single object.
>>>>>>> >> >
>>>>>>> >> >Of course, such an IO request from the client application / VM would
>>>>>>> >> >be incorrect behavior if they didn't wait for the completion callback
>>>>>>> >> >before issuing the second update.
>>>>>>> >> >
>>>>>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 <xxhdx1985126@163.com> wro
>>>>>>> te:
>>>>>>> >> >> Hi, everyone.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Recently, I've been reading the source code of rbd-mirror. I wonder ho
>>>>>>> w rbd-mirror preserves the order of WRITE operations that finished on the pr
>>>>>>> imary cluster. As far as I can understand the code, rbd-mirror fetches I/O o
>>>>>>> perations from the journal on the primary cluster, and replay them on the sl
>>>>>>> ave cluster without checking whether there's already any I/O operations targ
>>>>>>> eting the same object that has been issued to the slave cluster and not yet
>>>>>>> finished. Since concurrent operations may finish in a different order than t
>>>>>>> hat in which they arrived at the OSD, the order that the WRITE operations fi
>>>>>>> nish on the slave cluster may be different than that on the primay cluster.
>>>>>>> For example: on the primary cluster, there are two WRITE operation targeting
>>>>>>>  the same object A which are, in the order they finish on the primary cluste
>>>>>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are replayed
>>>>>>>  on the slave cluster, the order may be "WRITE A.off data2" and "WRITE A.off
>>>>>>>  data1", wh
>>>>>>> > ich means that the result of the two operations on the primary cluster is
>>>>>>> A.off=data2 while, on the slave cluster, the result is A.off=data1.
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >> Is this possible?
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >>
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >
>>>>>>> >> >--
>>>>>>> >> >Jason
>>>>>>> >> N嫥叉靣笡y???氊b瞂???千v豝???藓{.n???壏渮榏z鳐妠ay???蕠跈???j???f＂穐殝鄗
>>>>>>> ???畐ア???⒎???:+v墾妛鑚豰稛???珣赙zZ+凒殠娸???"濟!秈
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>
>>