From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4AC03C6FD1F for ; Sun, 19 Mar 2023 14:34:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230120AbjCSOen (ORCPT ); Sun, 19 Mar 2023 10:34:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39308 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229472AbjCSOel (ORCPT ); Sun, 19 Mar 2023 10:34:41 -0400 Received: from mail-oi1-x22f.google.com (mail-oi1-x22f.google.com [IPv6:2607:f8b0:4864:20::22f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E03097DB6 for ; Sun, 19 Mar 2023 07:34:39 -0700 (PDT) Received: by mail-oi1-x22f.google.com with SMTP id bp22so310479oib.6 for ; Sun, 19 Mar 2023 07:34:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vastdata.com; s=google; t=1679236479; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=GirUUxQOPPxnU7Cs9/BvDg4VtDV++YTTuxckXbwaeAw=; b=NG87OEHgj6TftrlSU6kU1qKPhfVAHpwbKwSIFyTZgU3M9x64kV/VFeqRGa6e6Iq50v jKjz71jwTjIuQA7s7/RbzAOpzyTTXkPui+teyyMMLr7MlSnpOKA0MP8kDvmW0uL7ylnV 5JRhY69sXOKrsvfNtxWRX7IKINiGaMHMFAjZbAg6mjFrP4dVx+Rbyv81VyP+IDS+dqo7 5i8PWG11wY6oq3/qr/PaMxfonIIGZj1erohRWC+ZLJb6vGXXTxx/dtHDGZPyKgcCJHeD OeL9YBvcCC0FMZi2uRz9RRLSvi6NCt0N/fZrwaMt1o7tkK47tPJGDSj1bFKDpe4Tn6Oc WAwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679236479; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=GirUUxQOPPxnU7Cs9/BvDg4VtDV++YTTuxckXbwaeAw=; b=Xmy/ge4H4JpR0hBFsItHSPWtqz169/LKRnDD+x08e4GcLoy3GbdK/kLokSZLfrRnXz hjUbbCkfIREtgrfNk305MczLOedZCk7xdTto94JhJ4M4Qf7hkM4tvzcs7r+PNeKZUk2/ JD5J3YCoFDHdwVe5NZElchzrVGKte5HngAAKQZ5uxrKoHymysAwPr25S/bB08ZImYLPj T3erBTqSH9Yo/vkWWas1ncYagPeOslnPuBzylpaGz09Cq4gV3IsO/6foXkNux3RTPGCD wIhMyU2Vt9IO27bzyEIRXeKWBON7MVYf4bKFvFvqmrEg59QKbTJ1IWckqCp9KYp6GAYo 3xpw== X-Gm-Message-State: AO0yUKXmandC1Xc0Whihfjmcr2LAA407JSLRSLVqUh6BTesVqL+Fl6Jk ntmjo32G+G92WuMmXZxUB3Fk3IR5RXTaqyyUBa06dZ3EVBNp6PMa X-Google-Smtp-Source: AK7set+/zzcC5+SNDETAao0Hg0nZMXHw5EagI4k5utH1yS7aSmG2jIlKwTvBlchDk5KLTv56QZ5LD8rINBWCZA7/g0Q= X-Received: by 2002:aca:f09:0:b0:384:1832:233b with SMTP id 9-20020aca0f09000000b003841832233bmr4952810oip.2.1679236478535; Sun, 19 Mar 2023 07:34:38 -0700 (PDT) MIME-Version: 1.0 References: <25620.54403.944889.209021@quad.stoffel.home> In-Reply-To: From: Asaf Levy Date: Sun, 19 Mar 2023 16:34:27 +0200 Message-ID: Subject: Re: Question about potential data consistency issues when writes failed in mdadm raid1 To: Geoff Back Cc: John Stoffel , Ronnie Lazar , linux-raid@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: linux-raid@vger.kernel.org I understand. As you said I meant consistent under the definition that data must never go back in time regardless of the failure type. Thanks, Asaf On Sun, Mar 19, 2023 at 2:45=E2=80=AFPM Geoff Back = wrote: > > Hi Asaf, > > All disk subsystems are inherently consistent in the normal meaning of > the term under normal circumstances. > > An application that requires your specific definition of consistency > across catastrophic failure cases in the disk subsystem needs to use an > application-appropriate method of ensuring writes are flushed before > reading. Writing with O_DIRECT is one method and depending on the > application's exact requirements may be sufficient on its own. In other > application domains, flushing and some form of out-of-band signalling or > locking is better. It really depends on the application. > > Regards, > > Geoff. > > Geoff Back > What if we're all just characters in someone's nightmares? > > On 19/03/2023 11:31, Asaf Levy wrote: > > Thank you for the clarification. > > > > To make sure I fully understand. > > An application that requires consistency should use O_DIRECT and > > enforce an R/W lock on top of the mirrored device? > > > > Asaf > > > > On Sun, Mar 19, 2023 at 11:55=E2=80=AFAM Geoff Back wrote: > >> Hi Asaf, > >> > >> Yes, in principle there are all sorts of cases where you can perform a > >> read of newly written data that is not yet on the underlying disk and > >> hence the possibility of reading the old data following recovery from = an > >> intervening catastrophic event (such as a crash). This is a fundament= al > >> characteristic of write caching and applies with any storage device an= d > >> any write operation where something crashes before the write is comple= te > >> - you can get this with a single disk or SSD without having RAID in th= e > >> mix at all. > >> > >> The correct and only way to guarantee that you can never have your > >> "consistency issue" is to flush the write through to the underlying > >> devices before reading. If you explicitly flush the write operation > >> (which will block until all writes are complete on A, B, M) and the > >> flush completes successfully, then all reads will be of the new data a= nd > >> there is no consistency issue. > >> > >> Your scenario describes a concern for the higher level code, not in th= e > >> storage system. If your application needs to be absolutely certain th= at > >> even after a crash you cannot end up reading old data having previousl= y > >> read new data then it is the responsibility of the application to flus= h > >> the writes to the storage before executing the read. You would also > >> need to ensure that the application cannot read from the data between > >> write and flush; there's several different ways to achieve that > >> (O_DIRECT may be helpful). Alternatively, you might want to look at > >> using something other than the disk for your data interchange between > >> processes. > >> > >> Regards, > >> > >> Geoff. > >> > >> Geoff Back > >> What if we're all just characters in someone's nightmares? > >> > >> On 19/03/2023 09:13, Asaf Levy wrote: > >>> Hi John, > >>> > >>> Thank you for your quick response, I'll try to elaborate further. > >>> What we are trying to understand is if there is a potential race > >>> between reads and writes when mirroring 2 devices. > >>> This is unrelated to the fact that the write was not acked. > >>> > >>> The scenario is: let's assume we have a reader R and a writer W and 2 > >>> MD devices A and B. A and B are managed under a device M which is > >>> configured to use A and B as mirrors (RAID 1). Currently, we have som= e > >>> data on A and B, let's call it V1. > >>> > >>> W issues a write (V2) to the managed device M > >>> The driver sends the write both to A and B at the same time. > >>> The write to device A (V2) completes > >>> R issues a read to M which directs it to A and returns the result (V2= ). > >>> Now the driver and device A fail at the same time before the write > >>> ever gets to device B. > >>> > >>> When the driver recovers all it is left with is device B so future > >>> reads will return older data (V1) than the data that was returned to > >>> R. > >>> > >>> Thanks, > >>> Asaf > >>> > >>> On Fri, Mar 17, 2023 at 10:58=E2=80=AFPM John Stoffel wrote: > >>>>>>>>> "Ronnie" =3D=3D Ronnie Lazar writes= : > >>>>> I'm trying to understand how mdadm protects against inconsistent da= ta > >>>>> read in the face of failures that occur while writing to a device t= hat > >>>>> has raid1. > >>>> You need to give a better test case, with examples. > >>>> > >>>>> Here is the scenario: I have set up raid1 that has 2 mirrors. First > >>>>> one is on local storage and the second is on remote storage. The > >>>>> remote storage mirror is configured with write-mostly. > >>>> Configuration details? And what is the remote device? > >>>> > >>>>> We have parallel jobs: 1 writing to an area on the device and the > >>>>> other reading from that area. > >>>> So you create /dev/md9 and are writing/reading from it, then the > >>>> system crashes and you lose the local half of the mirror, right? > >>>> > >>>>> The write operation writes the data to the first mirror, and at tha= t > >>>>> point the read operation reads the new data from the first mirror. > >>>> So how is your write succeeding if it's not written to both halves o= f > >>>> the MD device? You need to give more details and maybe even some > >>>> example code showing what you're doing here. > >>>> > >>>>> Now, before data has been written to the second (remote) mirror a > >>>>> failure has occurred which caused the first machine to fail, When > >>>>> the machine comes up, the data is recovered from the second, remote= , > >>>>> mirror. > >>>> Ah... some more details. It sounds like you have a system A which i= s > >>>> writing to a SITE local remote device as well as a REMOTE site devic= e > >>>> in the MD mirror, is this correct? > >>>> > >>>> Are these iSCSI devices? FibreChannel? NBD devices? More details > >>>> please. > >>>> > >>>>> Now when reading from this area, the users will receive the older > >>>>> value, even though, in the first read they got the newer value that > >>>>> was written. > >>>>> Does mdadm protect against this inconsistency? > >>>> It shouldn't be returning success on the write until both sides of t= he > >>>> mirror are updated. But we can't really tell until you give more > >>>> details and an example. > >>>> > >>>> I assume you're not building a RAID1 device and then writing to the > >>>> individual devices behind it's back or something silly like that, > >>>> right? > >>>> > >>>> John > >>>> >