From mboxrd@z Thu Jan 1 00:00:00 1970 From: Haomai Wang Subject: Re: Re:Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster Date: Tue, 6 Jun 2017 15:49:07 +0800 Message-ID: References: <70c5c552.66e6.15c766dce68.Coremail.xxhdx1985126@163.com> <75e21ee8.f334.15c78f23929.Coremail.xxhdx1985126@163.com> <428d3e79.3d2.15c7a56a64f.Coremail.xxhdx1985126@163.com> <448ceff0.46ce.15c7b5c0b16.Coremail.xxhdx1985126@163.com> <344a1254.5a75.15c7b942fa4.Coremail.xxhdx1985126@163.com> <762fe21a.9068.15c7c58f41b.Coremail.xxhdx1985126@163.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail-sg2apc01on0100.outbound.protection.outlook.com ([104.47.125.100]:43264 "EHLO APC01-SG2-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751285AbdFFHtV (ORCPT ); Tue, 6 Jun 2017 03:49:21 -0400 Received: by mail-it0-f51.google.com with SMTP id m47so100991814iti.0 for ; Tue, 06 Jun 2017 00:49:16 -0700 (PDT) In-Reply-To: <762fe21a.9068.15c7c58f41b.Coremail.xxhdx1985126@163.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: xxhdx1985126 Cc: Sage Weil , jdillama@redhat.com, "ceph-devel@vger.kernel.org" On Tue, Jun 6, 2017 at 3:40 PM, xxhdx1985126 wrote: > > Uh, sorry, my fault. OSD does treat requests from different clients seper= ately. > > By the way, I've a further question. Say, there is a write operation: WRI= TE(objA.off=3D3, objB.off=3D4), if, due to some reason(primary cluster's OS= D failure, for example), the "objB" part of the operation can not be replic= ated to slave cluster, can the "objA" part and non-objB part of subsequent = writes that involve objB still be replicated? librados only support object-level transaction > > At 2017-06-06 12:05:28, "xxhdx1985126" wrote: >> >>Yes, it is from two clients, but it looks to me that OSD treats requests = from different clients uniformly. So I think, in the scenario of rbd-mirror= , requests from the same rbd-mirror instance can also get disordered. >> >> >>At 2017-06-06 11:13:48, "Haomai Wang" wrote: >>>On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 wro= te: >>>> >>>> I submitted an issue about three months ago: http://tracker.ceph.com/i= ssues/19252, in which I submitted part of my log. >>>> >>> >>>what's your use case? >>>I noticed your first sentence "Recently, in our test, we found a >>>strange phenomenon: a READ req from client A that arrived later than a >>>WRITE req from client B is finished ealier than that WRITE req.". >>> >>>Is it means the two requests from two client? if so, it's expected. >>> >>>> >>>> At 2017-06-06 06:50:49, "xxhdx1985126" wrote: >>>>> >>>>>Thanks for your reply:-) >>>>> >>>>>The requeueing is protected by PG::lock, however, when the write reque= st is add to the transaction queue, it's left for the journaling thread and= filestore thread, to do the actual write, the OSD's worker thread just rel= ease the PG::lock and try to retrieve the next req in OSD's work queue, whi= ch gives the opportunity for later reqs to go before previous reqs. This di= d happened in our experiment. >>>>> >>>>>However, since this experiment was done serveral month ago, I'll uploa= d the log if I can find it, or I'll try to reproduce it. >>>>> >>>>>At 2017-06-06 06:22:36, "Sage Weil" wrote: >>>>>>On Tue, 6 Jun 2017, xxhdx1985126 wrote: >>>>>>> Thanks for your reply:-) >>>>>>> >>>>>>> The requeueing is protected by PG::lock, however, when the write re= quest is >>>>>>> add to the transaction queue, it's left for the journaling thread a= nd >>>>>>> filestore thread, to do the actual write, the OSD's worker thread j= ust >>>>>>> release the PG::lock and try to retrieve the next req in OSD's work= queue, >>>>>>> which gives the opportunity for later reqs to go before previous re= qs. This >>>>>>> did happened in our experiment. >>>>>> >>>>>>FileStore should also strictly order the requests via the OpSequencer= . >>>>>> >>>>>>> However, since this experiment was done serveral month ago, I'll up= load the >>>>>>> log if I can find it, or I'll try to reproduce it. >>>>>> >>>>>>Okay, thanks! >>>>>> >>>>>>sage >>>>>> >>>>>> >>>>>>> >>>>>>> At 2017-06-06 00:21:34, "Sage Weil" wrote: >>>>>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote: >>>>>>> >> >>>>>>> >> Uh, sorry, I don't quite follow you. According to my understandi= ng of >>>>>>> >> the source code of OSD and our experiment previously mentioned i= n >>>>>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178= .html", >>>>>>> >> there exists the following scenario where the actual finishing o= rder of >>>>>>> >> the WRITEs that targets the same object is not the same as the o= rder >>>>>>> >> they arrived at OSD, which, I think, could be a hint that the or= der of >>>>>>> >> writes from a single client connection to a single OSD is not gu= ranteed: >>>>>>> > >>>>>>> >If so, it is a bug that should be fixed in the OSD. rbd-mirror re= lying on >>>>>>> >OSD ordering to be correct is totally fine--lots of other stuff do= es too. >>>>>>> > >>>>>>> >> Say three writes that targeting the same object A arrived = at an >>>>>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off= 3". The >>>>>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of= object >>>>>>> >> A, and is put into a transaction queue to go through the "journa= ling + >>>>>>> >> file system write" procedure. Before it's finished, a thread of >>>>>>> >> OSD::osd_op_tp retrieved the second write and attempt to process= it >>>>>>> >> during which it finds that the objectcontext lock of A is held b= y a >>>>>>> >> previous WRITE and put the second write into A's rwstate::waiter= s queue. >>>>>>> >> It's only when the first write is finished on all replica OSDs t= hat the >>>>>>> >> second write is put back into OSD::shardedop_wq to be processed = again in >>>>>>> >> the future. If, after the second write is put into rwstate::wait= ers >>>>>>> >> queue and the first write is finished on all replica OSDs, in wh= ich case >>>>>>> >> the first write release the A's objectcontext lock, but before t= he >>>>>>> >> second write is put back into OSD::shardedop_wq, the third write= is >>>>>>> >> retrieved by OSD's worker thread, it would get processed as no p= revious >>>>>>> >> operation is holding A's objectcontext lock, in which case, the = actual >>>>>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A= .off 3", >>>>>>> >> "WRITE A.off 2", which is different from the order they arrived. >>>>>>> > >>>>>>> >This should not happen. (If it happened in the past, it was a bug= , but I >>>>>>> >would expect it is fixed in the latest hammer point release, and i= n jewel >>>>>>> >and master.) The requeing is done under the PG::lock so that requ= eueing >>>>>>> >preserves ordering. A fair bit of code and a *lot* of testing goe= s into >>>>>>> >ensuring that this is true. If you've seen this recently, then a >>>>>>> >reproducer or log (and tracker ticket) would be welcome! When we = see any >>>>>>> >ordering errors in QA we take them very seriously and fix them qui= ckly. >>>>>>> > >>>>>>> >You might be interested in the osd_debug_op_order config option, w= hich we >>>>>>> >enable in qa, which asserts if it sees ops from a client arrive ou= t of >>>>>>> >order. The ceph_test_rados workload generate that we use for much= of the >>>>>>> >rados qa suite also fails if it sees out of order operations. >>>>>>> > >>>>>>> >sage >>>>>>> > >>>>>>> >> >>>>>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg361= 78.html, >>>>>>> we showed our experiment result which is exactly as the above scena= rio shows >>>>>>> . >>>>>>> >> >>>>>>> >> However, the ceph's version on which we did the experiment and t= he souce >>>>>>> code of which we read was Hammer, 0.94.5. I don't know whether the = scenario >>>>>>> above may still exists in later versions. >>>>>>> >> >>>>>>> >> Am I right about this? Or am I missing anything? Please help me,= I'm real >>>>>>> ly confused right now. Thank you. >>>>>>> >> >>>>>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" w= rote: >>>>>>> >> >The order of writes from a single client connection to a single= OSD is >>>>>>> >> >guaranteed. The rbd-mirror journal replay process handles one e= vent at >>>>>>> >> >a time and does not start processing the next event until the I= O has >>>>>>> >> >been started in-flight with librados. Therefore, even though th= e >>>>>>> >> >replay process allows 50 - 100 IO requests to be in-flight, tho= se IOs >>>>>>> >> >are actually well-ordered in terms of updates to a single objec= t. >>>>>>> >> > >>>>>>> >> >Of course, such an IO request from the client application / VM = would >>>>>>> >> >be incorrect behavior if they didn't wait for the completion ca= llback >>>>>>> >> >before issuing the second update. >>>>>>> >> > >>>>>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 wro >>>>>>> te: >>>>>>> >> >> Hi, everyone. >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> Recently, I've been reading the source code of rbd-mirror. I = wonder ho >>>>>>> w rbd-mirror preserves the order of WRITE operations that finished = on the pr >>>>>>> imary cluster. As far as I can understand the code, rbd-mirror fetc= hes I/O o >>>>>>> perations from the journal on the primary cluster, and replay them = on the sl >>>>>>> ave cluster without checking whether there's already any I/O operat= ions targ >>>>>>> eting the same object that has been issued to the slave cluster and= not yet >>>>>>> finished. Since concurrent operations may finish in a different ord= er than t >>>>>>> hat in which they arrived at the OSD, the order that the WRITE oper= ations fi >>>>>>> nish on the slave cluster may be different than that on the primay = cluster. >>>>>>> For example: on the primary cluster, there are two WRITE operation = targeting >>>>>>> the same object A which are, in the order they finish on the prima= ry cluste >>>>>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are= replayed >>>>>>> on the slave cluster, the order may be "WRITE A.off data2" and "WR= ITE A.off >>>>>>> data1", wh >>>>>>> > ich means that the result of the two operations on the primary cl= uster is >>>>>>> A.off=3Ddata2 while, on the slave cluster, the result is A.off=3Dda= ta1. >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> Is this possible? >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>>> >> > >>>>>>> >> > >>>>>>> >> > >>>>>>> >> >-- >>>>>>> >> >Jason >>>>>>> >> N=E5=AB=A5=E5=8F=89=E9=9D=A3=E7=AC=A1y???=E6=B0=8Ab=E7=9E=82???= =E5=8D=83v=E8=B1=9D???=E8=97=93{.n???=E5=A3=8F=E6=B8=AE=E6=A6=8Fz=E9=B3=90= =E5=A6=A0ay???=E8=95=A0=E8=B7=88???j???f=EF=BC=82=E7=A9=90=E6=AE=9D=E9=84= =97 >>>>>>> ???=E7=95=90=E3=82=A2???=E2=92=8E???:+v=E5=A2=BE=E5=A6=9B=E9=91=9A= =E8=B1=B0=E7=A8=9B???=E7=8F=A3=E8=B5=99zZ+=E5=87=92=E6=AE=A0=E5=A8=B8???"= =E6=BF=9F!=E7=A7=88 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>> >>>> >>>> >>>> >> >> >> >>