From mboxrd@z Thu Jan 1 00:00:00 1970 From: Haomai Wang Subject: Re: Re:Re:Re:Re: How does rbd-mirror preserve the order of WRITE operations that finished on the primary cluster Date: Tue, 6 Jun 2017 11:13:48 +0800 Message-ID: References: <70c5c552.66e6.15c766dce68.Coremail.xxhdx1985126@163.com> <75e21ee8.f334.15c78f23929.Coremail.xxhdx1985126@163.com> <428d3e79.3d2.15c7a56a64f.Coremail.xxhdx1985126@163.com> <448ceff0.46ce.15c7b5c0b16.Coremail.xxhdx1985126@163.com> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Return-path: Received: from mail-sg2apc01on0099.outbound.protection.outlook.com ([104.47.125.99]:26144 "EHLO APC01-SG2-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751194AbdFFDN5 (ORCPT ); Mon, 5 Jun 2017 23:13:57 -0400 Received: by mail-it0-f50.google.com with SMTP id m47so96266411iti.0 for ; Mon, 05 Jun 2017 20:13:54 -0700 (PDT) In-Reply-To: <448ceff0.46ce.15c7b5c0b16.Coremail.xxhdx1985126@163.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: xxhdx1985126 Cc: Sage Weil , jdillama@redhat.com, "ceph-devel@vger.kernel.org" On Tue, Jun 6, 2017 at 11:04 AM, xxhdx1985126 wrote: > > I submitted an issue about three months ago: http://tracker.ceph.com/issu= es/19252, in which I submitted part of my log. > what's your use case? I noticed your first sentence "Recently, in our test, we found a strange phenomenon: a READ req from client A that arrived later than a WRITE req from client B is finished ealier than that WRITE req.". Is it means the two requests from two client? if so, it's expected. > > At 2017-06-06 06:50:49, "xxhdx1985126" wrote: >> >>Thanks for your reply:-) >> >>The requeueing is protected by PG::lock, however, when the write request = is add to the transaction queue, it's left for the journaling thread and fi= lestore thread, to do the actual write, the OSD's worker thread just releas= e the PG::lock and try to retrieve the next req in OSD's work queue, which = gives the opportunity for later reqs to go before previous reqs. This did h= appened in our experiment. >> >>However, since this experiment was done serveral month ago, I'll upload t= he log if I can find it, or I'll try to reproduce it. >> >>At 2017-06-06 06:22:36, "Sage Weil" wrote: >>>On Tue, 6 Jun 2017, xxhdx1985126 wrote: >>>> Thanks for your reply:-) >>>> >>>> The requeueing is protected by PG::lock, however, when the write reque= st is >>>> add to the transaction queue, it's left for the journaling thread and >>>> filestore thread, to do the actual write, the OSD's worker thread just >>>> release the PG::lock and try to retrieve the next req in OSD's work qu= eue, >>>> which gives the opportunity for later reqs to go before previous reqs.= This >>>> did happened in our experiment. >>> >>>FileStore should also strictly order the requests via the OpSequencer. >>> >>>> However, since this experiment was done serveral month ago, I'll uploa= d the >>>> log if I can find it, or I'll try to reproduce it. >>> >>>Okay, thanks! >>> >>>sage >>> >>> >>>> >>>> At 2017-06-06 00:21:34, "Sage Weil" wrote: >>>> >On Mon, 5 Jun 2017, xxhdx1985126 wrote: >>>> >> >>>> >> Uh, sorry, I don't quite follow you. According to my understanding = of >>>> >> the source code of OSD and our experiment previously mentioned in >>>> >> "https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.ht= ml", >>>> >> there exists the following scenario where the actual finishing orde= r of >>>> >> the WRITEs that targets the same object is not the same as the orde= r >>>> >> they arrived at OSD, which, I think, could be a hint that the order= of >>>> >> writes from a single client connection to a single OSD is not guran= teed: >>>> > >>>> >If so, it is a bug that should be fixed in the OSD. rbd-mirror relyi= ng on >>>> >OSD ordering to be correct is totally fine--lots of other stuff does = too. >>>> > >>>> >> Say three writes that targeting the same object A arrived at = an >>>> >> OSD in the order: "WRITE A.off 1", "WRITE A.off 2", "WRITE A.off 3"= . The >>>> >> first write, "WRITE A.off 1", acquires the objectcontext lock of ob= ject >>>> >> A, and is put into a transaction queue to go through the "journalin= g + >>>> >> file system write" procedure. Before it's finished, a thread of >>>> >> OSD::osd_op_tp retrieved the second write and attempt to process it >>>> >> during which it finds that the objectcontext lock of A is held by a >>>> >> previous WRITE and put the second write into A's rwstate::waiters q= ueue. >>>> >> It's only when the first write is finished on all replica OSDs that= the >>>> >> second write is put back into OSD::shardedop_wq to be processed aga= in in >>>> >> the future. If, after the second write is put into rwstate::waiters >>>> >> queue and the first write is finished on all replica OSDs, in which= case >>>> >> the first write release the A's objectcontext lock, but before the >>>> >> second write is put back into OSD::shardedop_wq, the third write is >>>> >> retrieved by OSD's worker thread, it would get processed as no prev= ious >>>> >> operation is holding A's objectcontext lock, in which case, the act= ual >>>> >> finishing order of the three writes is "WRITE A.off 1", "WRITE A.of= f 3", >>>> >> "WRITE A.off 2", which is different from the order they arrived. >>>> > >>>> >This should not happen. (If it happened in the past, it was a bug, b= ut I >>>> >would expect it is fixed in the latest hammer point release, and in j= ewel >>>> >and master.) The requeing is done under the PG::lock so that requeue= ing >>>> >preserves ordering. A fair bit of code and a *lot* of testing goes i= nto >>>> >ensuring that this is true. If you've seen this recently, then a >>>> >reproducer or log (and tracker ticket) would be welcome! When we see= any >>>> >ordering errors in QA we take them very seriously and fix them quickl= y. >>>> > >>>> >You might be interested in the osd_debug_op_order config option, whic= h we >>>> >enable in qa, which asserts if it sees ops from a client arrive out o= f >>>> >order. The ceph_test_rados workload generate that we use for much of= the >>>> >rados qa suite also fails if it sees out of order operations. >>>> > >>>> >sage >>>> > >>>> >> >>>> >> In https://www.mail-archive.com/ceph-users@lists.ceph.com/msg36178.= html, >>>> we showed our experiment result which is exactly as the above scenario= shows >>>> . >>>> >> >>>> >> However, the ceph's version on which we did the experiment and the = souce >>>> code of which we read was Hammer, 0.94.5. I don't know whether the sce= nario >>>> above may still exists in later versions. >>>> >> >>>> >> Am I right about this? Or am I missing anything? Please help me, I'= m real >>>> ly confused right now. Thank you. >>>> >> >>>> >> At 2017-06-05 20:00:06, "Jason Dillaman" wrot= e: >>>> >> >The order of writes from a single client connection to a single OS= D is >>>> >> >guaranteed. The rbd-mirror journal replay process handles one even= t at >>>> >> >a time and does not start processing the next event until the IO h= as >>>> >> >been started in-flight with librados. Therefore, even though the >>>> >> >replay process allows 50 - 100 IO requests to be in-flight, those = IOs >>>> >> >are actually well-ordered in terms of updates to a single object. >>>> >> > >>>> >> >Of course, such an IO request from the client application / VM wou= ld >>>> >> >be incorrect behavior if they didn't wait for the completion callb= ack >>>> >> >before issuing the second update. >>>> >> > >>>> >> >On Mon, Jun 5, 2017 at 12:05 AM, xxhdx1985126 wro >>>> te: >>>> >> >> Hi, everyone. >>>> >> >> >>>> >> >> >>>> >> >> Recently, I've been reading the source code of rbd-mirror. I won= der ho >>>> w rbd-mirror preserves the order of WRITE operations that finished on = the pr >>>> imary cluster. As far as I can understand the code, rbd-mirror fetches= I/O o >>>> perations from the journal on the primary cluster, and replay them on = the sl >>>> ave cluster without checking whether there's already any I/O operation= s targ >>>> eting the same object that has been issued to the slave cluster and no= t yet >>>> finished. Since concurrent operations may finish in a different order = than t >>>> hat in which they arrived at the OSD, the order that the WRITE operati= ons fi >>>> nish on the slave cluster may be different than that on the primay clu= ster. >>>> For example: on the primary cluster, there are two WRITE operation tar= geting >>>> the same object A which are, in the order they finish on the primary = cluste >>>> r, "WRITE A.off data1" and "WRITE A.off data2"; while when they are re= played >>>> on the slave cluster, the order may be "WRITE A.off data2" and "WRITE= A.off >>>> data1", wh >>>> > ich means that the result of the two operations on the primary clust= er is >>>> A.off=3Ddata2 while, on the slave cluster, the result is A.off=3Ddata1= . >>>> >> >> >>>> >> >> >>>> >> >> Is this possible? >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> > >>>> >> > >>>> >> > >>>> >> >-- >>>> >> >Jason >>>> >> N=E5=AB=A5=E5=8F=89=E9=9D=A3=E7=AC=A1y???=E6=B0=8Ab=E7=9E=82???=E5= =8D=83v=E8=B1=9D???=E8=97=93{.n???=E5=A3=8F=E6=B8=AE=E6=A6=8Fz=E9=B3=90=E5= =A6=A0ay???=E8=95=A0=E8=B7=88???j???f=EF=BC=82=E7=A9=90=E6=AE=9D=E9=84=97 >>>> ???=E7=95=90=E3=82=A2???=E2=92=8E???:+v=E5=A2=BE=E5=A6=9B=E9=91=9A=E8= =B1=B0=E7=A8=9B???=E7=8F=A3=E8=B5=99zZ+=E5=87=92=E6=AE=A0=E5=A8=B8???"=E6= =BF=9F!=E7=A7=88 >>>> >>>> >>>> >>>> >>>> >>>> > > > >