Question about potential data consistency issues when writes failed in mdadm raid1

All of lore.kernel.org
 help / color / mirror / Atom feed

* Question about potential data consistency issues when writes failed in mdadm raid1
@ 2023-03-15 18:06 Ronnie Lazar
  2023-03-17 20:58 ` John Stoffel
  0 siblings, 1 reply; 8+ messages in thread
From: Ronnie Lazar @ 2023-03-15 18:06 UTC (permalink / raw)
  To: linux-raid; +Cc: Asaf Levy

Hi,

I'm trying to understand how mdadm protects against inconsistent data
read in the face of failures that occur while writing to a device that
has raid1.
Here is the scenario:
I have set up raid1 that has 2 mirrors. First one is on local storage
and the second is on remote storage.  The remote storage mirror is
configured with write-mostly.

We have parallel jobs: 1 writing to an area on the device and the
other reading from that area.
The write operation writes the data to the first mirror, and at that
point the read operation reads the new data from the first mirror.
Now, before data has been written to the second (remote) mirror a
failure has occurred which caused the first machine to fail, When the
machine comes up, the data is recovered from the second, remote,
mirror.

Now when reading from this area, the users will receive the older
value, even though, in the first read they got the newer value that
was written.

Does mdadm protect against this inconsistency?

Regards,
Ronnie Lazar

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Question about potential data consistency issues when writes failed in mdadm raid1
  2023-03-15 18:06 Question about potential data consistency issues when writes failed in mdadm raid1 Ronnie Lazar
@ 2023-03-17 20:58 ` John Stoffel
  2023-03-19  9:13   ` Asaf Levy
  0 siblings, 1 reply; 8+ messages in thread
From: John Stoffel @ 2023-03-17 20:58 UTC (permalink / raw)
  To: Ronnie Lazar; +Cc: linux-raid, Asaf Levy

>>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@vastdata.com> writes:

> I'm trying to understand how mdadm protects against inconsistent data
> read in the face of failures that occur while writing to a device that
> has raid1.

You need to give a better test case, with examples. 

> Here is the scenario: I have set up raid1 that has 2 mirrors. First
> one is on local storage and the second is on remote storage.  The
> remote storage mirror is configured with write-mostly.

Configuration details?  And what is the remote device?  

> We have parallel jobs: 1 writing to an area on the device and the
> other reading from that area.

So you create /dev/md9 and are writing/reading from it, then the
system crashes and you lose the local half of the mirror, right?

> The write operation writes the data to the first mirror, and at that
> point the read operation reads the new data from the first mirror.

So how is your write succeeding if it's not written to both halves of
the MD device?  You need to give more details and maybe even some
example code showing what you're doing here. 

> Now, before data has been written to the second (remote) mirror a
> failure has occurred which caused the first machine to fail, When
> the machine comes up, the data is recovered from the second, remote,
> mirror.

Ah... some more details.  It sounds like you have a system A which is
writing to a SITE local remote device as well as a REMOTE site device
in the MD mirror, is this correct?  

Are these iSCSI devices?  FibreChannel?  NBD devices?  More details
please.

> Now when reading from this area, the users will receive the older
> value, even though, in the first read they got the newer value that
> was written.

> Does mdadm protect against this inconsistency?

It shouldn't be returning success on the write until both sides of the
mirror are updated.  But we can't really tell until you give more
details and an example.

I assume you're not building a RAID1 device and then writing to the
individual devices behind it's back or something silly like that,
right? 

John

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Question about potential data consistency issues when writes failed in mdadm raid1
  2023-03-17 20:58 ` John Stoffel
@ 2023-03-19  9:13   ` Asaf Levy
  2023-03-19  9:55     ` Geoff Back
  0 siblings, 1 reply; 8+ messages in thread
From: Asaf Levy @ 2023-03-19  9:13 UTC (permalink / raw)
  To: John Stoffel; +Cc: Ronnie Lazar, linux-raid

Hi John,

Thank you for your quick response, I'll try to elaborate further.
What we are trying to understand is if there is a potential race
between reads and writes when mirroring 2 devices.
This is unrelated to the fact that the write was not acked.

The scenario is: let's assume we have a reader R and a writer W and 2
MD devices A and B. A and B are managed under a device M which is
configured to use A and B as mirrors (RAID 1). Currently, we have some
data on A and B, let's call it V1.

W issues a write (V2) to the managed device M
The driver sends the write both to A and B at the same time.
The write to device A (V2) completes
R issues a read to M which directs it to A and returns the result (V2).
Now the driver and device A fail at the same time before the write
ever gets to device B.

When the driver recovers all it is left with is device B so future
reads will return older data (V1) than the data that was returned to
R.

Thanks,
Asaf

On Fri, Mar 17, 2023 at 10:58 PM John Stoffel <john@stoffel.org> wrote:
>
> >>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@vastdata.com> writes:
>
> > I'm trying to understand how mdadm protects against inconsistent data
> > read in the face of failures that occur while writing to a device that
> > has raid1.
>
> You need to give a better test case, with examples.
>
> > Here is the scenario: I have set up raid1 that has 2 mirrors. First
> > one is on local storage and the second is on remote storage.  The
> > remote storage mirror is configured with write-mostly.
>
> Configuration details?  And what is the remote device?
>
> > We have parallel jobs: 1 writing to an area on the device and the
> > other reading from that area.
>
> So you create /dev/md9 and are writing/reading from it, then the
> system crashes and you lose the local half of the mirror, right?
>
> > The write operation writes the data to the first mirror, and at that
> > point the read operation reads the new data from the first mirror.
>
> So how is your write succeeding if it's not written to both halves of
> the MD device?  You need to give more details and maybe even some
> example code showing what you're doing here.
>
> > Now, before data has been written to the second (remote) mirror a
> > failure has occurred which caused the first machine to fail, When
> > the machine comes up, the data is recovered from the second, remote,
> > mirror.
>
> Ah... some more details.  It sounds like you have a system A which is
> writing to a SITE local remote device as well as a REMOTE site device
> in the MD mirror, is this correct?
>
> Are these iSCSI devices?  FibreChannel?  NBD devices?  More details
> please.
>
> > Now when reading from this area, the users will receive the older
> > value, even though, in the first read they got the newer value that
> > was written.
>
> > Does mdadm protect against this inconsistency?
>
> It shouldn't be returning success on the write until both sides of the
> mirror are updated.  But we can't really tell until you give more
> details and an example.
>
> I assume you're not building a RAID1 device and then writing to the
> individual devices behind it's back or something silly like that,
> right?
>
> John
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Question about potential data consistency issues when writes failed in mdadm raid1
  2023-03-19  9:13   ` Asaf Levy
@ 2023-03-19  9:55     ` Geoff Back
  2023-03-19 11:31       ` Asaf Levy
  0 siblings, 1 reply; 8+ messages in thread
From: Geoff Back @ 2023-03-19  9:55 UTC (permalink / raw)
  To: Asaf Levy, John Stoffel; +Cc: Ronnie Lazar, linux-raid

Hi Asaf,

Yes, in principle there are all sorts of cases where you can perform a
read of newly written data that is not yet on the underlying disk and
hence the possibility of reading the old data following recovery from an
intervening catastrophic event (such as a crash).  This is a fundamental
characteristic of write caching and applies with any storage device and
any write operation where something crashes before the write is complete
- you can get this with a single disk or SSD without having RAID in the
mix at all.

The correct and only way to guarantee that you can never have your
"consistency issue" is to flush the write through to the underlying
devices before reading.  If you explicitly flush the write operation
(which will block until all writes are complete on A, B, M) and the
flush completes successfully, then all reads will be of the new data and
there is no consistency issue.

Your scenario describes a concern for the higher level code, not in the
storage system.  If your application needs to be absolutely certain that
even after a crash you cannot end up reading old data having previously
read new data then it is the responsibility of the application to flush
the writes to the storage before executing the read.  You would also
need to ensure that the application cannot read from the data between
write and flush; there's several different ways to achieve that
(O_DIRECT may be helpful).  Alternatively, you might want to look at
using something other than the disk for your data interchange between
processes.

Regards,

Geoff.

Geoff Back
What if we're all just characters in someone's nightmares?

On 19/03/2023 09:13, Asaf Levy wrote:
> Hi John,
>
> Thank you for your quick response, I'll try to elaborate further.
> What we are trying to understand is if there is a potential race
> between reads and writes when mirroring 2 devices.
> This is unrelated to the fact that the write was not acked.
>
> The scenario is: let's assume we have a reader R and a writer W and 2
> MD devices A and B. A and B are managed under a device M which is
> configured to use A and B as mirrors (RAID 1). Currently, we have some
> data on A and B, let's call it V1.
>
> W issues a write (V2) to the managed device M
> The driver sends the write both to A and B at the same time.
> The write to device A (V2) completes
> R issues a read to M which directs it to A and returns the result (V2).
> Now the driver and device A fail at the same time before the write
> ever gets to device B.
>
> When the driver recovers all it is left with is device B so future
> reads will return older data (V1) than the data that was returned to
> R.
>
> Thanks,
> Asaf
>
> On Fri, Mar 17, 2023 at 10:58 PM John Stoffel <john@stoffel.org> wrote:
>>>>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@vastdata.com> writes:
>>> I'm trying to understand how mdadm protects against inconsistent data
>>> read in the face of failures that occur while writing to a device that
>>> has raid1.
>> You need to give a better test case, with examples.
>>
>>> Here is the scenario: I have set up raid1 that has 2 mirrors. First
>>> one is on local storage and the second is on remote storage.  The
>>> remote storage mirror is configured with write-mostly.
>> Configuration details?  And what is the remote device?
>>
>>> We have parallel jobs: 1 writing to an area on the device and the
>>> other reading from that area.
>> So you create /dev/md9 and are writing/reading from it, then the
>> system crashes and you lose the local half of the mirror, right?
>>
>>> The write operation writes the data to the first mirror, and at that
>>> point the read operation reads the new data from the first mirror.
>> So how is your write succeeding if it's not written to both halves of
>> the MD device?  You need to give more details and maybe even some
>> example code showing what you're doing here.
>>
>>> Now, before data has been written to the second (remote) mirror a
>>> failure has occurred which caused the first machine to fail, When
>>> the machine comes up, the data is recovered from the second, remote,
>>> mirror.
>> Ah... some more details.  It sounds like you have a system A which is
>> writing to a SITE local remote device as well as a REMOTE site device
>> in the MD mirror, is this correct?
>>
>> Are these iSCSI devices?  FibreChannel?  NBD devices?  More details
>> please.
>>
>>> Now when reading from this area, the users will receive the older
>>> value, even though, in the first read they got the newer value that
>>> was written.
>>> Does mdadm protect against this inconsistency?
>> It shouldn't be returning success on the write until both sides of the
>> mirror are updated.  But we can't really tell until you give more
>> details and an example.
>>
>> I assume you're not building a RAID1 device and then writing to the
>> individual devices behind it's back or something silly like that,
>> right?
>>
>> John
>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Question about potential data consistency issues when writes failed in mdadm raid1
  2023-03-19  9:55     ` Geoff Back
@ 2023-03-19 11:31       ` Asaf Levy
  2023-03-19 12:45         ` Geoff Back
  2023-03-20 13:52         ` John Stoffel
  0 siblings, 2 replies; 8+ messages in thread
From: Asaf Levy @ 2023-03-19 11:31 UTC (permalink / raw)
  To: Geoff Back; +Cc: John Stoffel, Ronnie Lazar, linux-raid

Thank you for the clarification.

To make sure I fully understand.
An application that requires consistency should use O_DIRECT and
enforce an R/W lock on top of the mirrored device?

Asaf

On Sun, Mar 19, 2023 at 11:55 AM Geoff Back <geoff@demonlair.co.uk> wrote:
>
> Hi Asaf,
>
> Yes, in principle there are all sorts of cases where you can perform a
> read of newly written data that is not yet on the underlying disk and
> hence the possibility of reading the old data following recovery from an
> intervening catastrophic event (such as a crash).  This is a fundamental
> characteristic of write caching and applies with any storage device and
> any write operation where something crashes before the write is complete
> - you can get this with a single disk or SSD without having RAID in the
> mix at all.
>
> The correct and only way to guarantee that you can never have your
> "consistency issue" is to flush the write through to the underlying
> devices before reading.  If you explicitly flush the write operation
> (which will block until all writes are complete on A, B, M) and the
> flush completes successfully, then all reads will be of the new data and
> there is no consistency issue.
>
> Your scenario describes a concern for the higher level code, not in the
> storage system.  If your application needs to be absolutely certain that
> even after a crash you cannot end up reading old data having previously
> read new data then it is the responsibility of the application to flush
> the writes to the storage before executing the read.  You would also
> need to ensure that the application cannot read from the data between
> write and flush; there's several different ways to achieve that
> (O_DIRECT may be helpful).  Alternatively, you might want to look at
> using something other than the disk for your data interchange between
> processes.
>
> Regards,
>
> Geoff.
>
> Geoff Back
> What if we're all just characters in someone's nightmares?
>
> On 19/03/2023 09:13, Asaf Levy wrote:
> > Hi John,
> >
> > Thank you for your quick response, I'll try to elaborate further.
> > What we are trying to understand is if there is a potential race
> > between reads and writes when mirroring 2 devices.
> > This is unrelated to the fact that the write was not acked.
> >
> > The scenario is: let's assume we have a reader R and a writer W and 2
> > MD devices A and B. A and B are managed under a device M which is
> > configured to use A and B as mirrors (RAID 1). Currently, we have some
> > data on A and B, let's call it V1.
> >
> > W issues a write (V2) to the managed device M
> > The driver sends the write both to A and B at the same time.
> > The write to device A (V2) completes
> > R issues a read to M which directs it to A and returns the result (V2).
> > Now the driver and device A fail at the same time before the write
> > ever gets to device B.
> >
> > When the driver recovers all it is left with is device B so future
> > reads will return older data (V1) than the data that was returned to
> > R.
> >
> > Thanks,
> > Asaf
> >
> > On Fri, Mar 17, 2023 at 10:58 PM John Stoffel <john@stoffel.org> wrote:
> >>>>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@vastdata.com> writes:
> >>> I'm trying to understand how mdadm protects against inconsistent data
> >>> read in the face of failures that occur while writing to a device that
> >>> has raid1.
> >> You need to give a better test case, with examples.
> >>
> >>> Here is the scenario: I have set up raid1 that has 2 mirrors. First
> >>> one is on local storage and the second is on remote storage.  The
> >>> remote storage mirror is configured with write-mostly.
> >> Configuration details?  And what is the remote device?
> >>
> >>> We have parallel jobs: 1 writing to an area on the device and the
> >>> other reading from that area.
> >> So you create /dev/md9 and are writing/reading from it, then the
> >> system crashes and you lose the local half of the mirror, right?
> >>
> >>> The write operation writes the data to the first mirror, and at that
> >>> point the read operation reads the new data from the first mirror.
> >> So how is your write succeeding if it's not written to both halves of
> >> the MD device?  You need to give more details and maybe even some
> >> example code showing what you're doing here.
> >>
> >>> Now, before data has been written to the second (remote) mirror a
> >>> failure has occurred which caused the first machine to fail, When
> >>> the machine comes up, the data is recovered from the second, remote,
> >>> mirror.
> >> Ah... some more details.  It sounds like you have a system A which is
> >> writing to a SITE local remote device as well as a REMOTE site device
> >> in the MD mirror, is this correct?
> >>
> >> Are these iSCSI devices?  FibreChannel?  NBD devices?  More details
> >> please.
> >>
> >>> Now when reading from this area, the users will receive the older
> >>> value, even though, in the first read they got the newer value that
> >>> was written.
> >>> Does mdadm protect against this inconsistency?
> >> It shouldn't be returning success on the write until both sides of the
> >> mirror are updated.  But we can't really tell until you give more
> >> details and an example.
> >>
> >> I assume you're not building a RAID1 device and then writing to the
> >> individual devices behind it's back or something silly like that,
> >> right?
> >>
> >> John
> >>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Question about potential data consistency issues when writes failed in mdadm raid1
  2023-03-19 11:31       ` Asaf Levy
@ 2023-03-19 12:45         ` Geoff Back
  2023-03-19 14:34           ` Asaf Levy
  2023-03-20 13:52         ` John Stoffel
  1 sibling, 1 reply; 8+ messages in thread
From: Geoff Back @ 2023-03-19 12:45 UTC (permalink / raw)
  To: Asaf Levy; +Cc: John Stoffel, Ronnie Lazar, linux-raid

Hi Asaf,

All disk subsystems are inherently consistent in the normal meaning of
the term under normal circumstances.

An application that requires your specific definition of consistency
across catastrophic failure cases in the disk subsystem needs to use an
application-appropriate method of ensuring writes are flushed before
reading.  Writing with O_DIRECT is one method and depending on the
application's exact requirements may be sufficient on its own.  In other
application domains, flushing and some form of out-of-band signalling or
locking is better.  It really depends on the application.

Regards,

Geoff.

Geoff Back
What if we're all just characters in someone's nightmares?

On 19/03/2023 11:31, Asaf Levy wrote:
> Thank you for the clarification.
>
> To make sure I fully understand.
> An application that requires consistency should use O_DIRECT and
> enforce an R/W lock on top of the mirrored device?
>
> Asaf
>
> On Sun, Mar 19, 2023 at 11:55 AM Geoff Back <geoff@demonlair.co.uk> wrote:
>> Hi Asaf,
>>
>> Yes, in principle there are all sorts of cases where you can perform a
>> read of newly written data that is not yet on the underlying disk and
>> hence the possibility of reading the old data following recovery from an
>> intervening catastrophic event (such as a crash).  This is a fundamental
>> characteristic of write caching and applies with any storage device and
>> any write operation where something crashes before the write is complete
>> - you can get this with a single disk or SSD without having RAID in the
>> mix at all.
>>
>> The correct and only way to guarantee that you can never have your
>> "consistency issue" is to flush the write through to the underlying
>> devices before reading.  If you explicitly flush the write operation
>> (which will block until all writes are complete on A, B, M) and the
>> flush completes successfully, then all reads will be of the new data and
>> there is no consistency issue.
>>
>> Your scenario describes a concern for the higher level code, not in the
>> storage system.  If your application needs to be absolutely certain that
>> even after a crash you cannot end up reading old data having previously
>> read new data then it is the responsibility of the application to flush
>> the writes to the storage before executing the read.  You would also
>> need to ensure that the application cannot read from the data between
>> write and flush; there's several different ways to achieve that
>> (O_DIRECT may be helpful).  Alternatively, you might want to look at
>> using something other than the disk for your data interchange between
>> processes.
>>
>> Regards,
>>
>> Geoff.
>>
>> Geoff Back
>> What if we're all just characters in someone's nightmares?
>>
>> On 19/03/2023 09:13, Asaf Levy wrote:
>>> Hi John,
>>>
>>> Thank you for your quick response, I'll try to elaborate further.
>>> What we are trying to understand is if there is a potential race
>>> between reads and writes when mirroring 2 devices.
>>> This is unrelated to the fact that the write was not acked.
>>>
>>> The scenario is: let's assume we have a reader R and a writer W and 2
>>> MD devices A and B. A and B are managed under a device M which is
>>> configured to use A and B as mirrors (RAID 1). Currently, we have some
>>> data on A and B, let's call it V1.
>>>
>>> W issues a write (V2) to the managed device M
>>> The driver sends the write both to A and B at the same time.
>>> The write to device A (V2) completes
>>> R issues a read to M which directs it to A and returns the result (V2).
>>> Now the driver and device A fail at the same time before the write
>>> ever gets to device B.
>>>
>>> When the driver recovers all it is left with is device B so future
>>> reads will return older data (V1) than the data that was returned to
>>> R.
>>>
>>> Thanks,
>>> Asaf
>>>
>>> On Fri, Mar 17, 2023 at 10:58 PM John Stoffel <john@stoffel.org> wrote:
>>>>>>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@vastdata.com> writes:
>>>>> I'm trying to understand how mdadm protects against inconsistent data
>>>>> read in the face of failures that occur while writing to a device that
>>>>> has raid1.
>>>> You need to give a better test case, with examples.
>>>>
>>>>> Here is the scenario: I have set up raid1 that has 2 mirrors. First
>>>>> one is on local storage and the second is on remote storage.  The
>>>>> remote storage mirror is configured with write-mostly.
>>>> Configuration details?  And what is the remote device?
>>>>
>>>>> We have parallel jobs: 1 writing to an area on the device and the
>>>>> other reading from that area.
>>>> So you create /dev/md9 and are writing/reading from it, then the
>>>> system crashes and you lose the local half of the mirror, right?
>>>>
>>>>> The write operation writes the data to the first mirror, and at that
>>>>> point the read operation reads the new data from the first mirror.
>>>> So how is your write succeeding if it's not written to both halves of
>>>> the MD device?  You need to give more details and maybe even some
>>>> example code showing what you're doing here.
>>>>
>>>>> Now, before data has been written to the second (remote) mirror a
>>>>> failure has occurred which caused the first machine to fail, When
>>>>> the machine comes up, the data is recovered from the second, remote,
>>>>> mirror.
>>>> Ah... some more details.  It sounds like you have a system A which is
>>>> writing to a SITE local remote device as well as a REMOTE site device
>>>> in the MD mirror, is this correct?
>>>>
>>>> Are these iSCSI devices?  FibreChannel?  NBD devices?  More details
>>>> please.
>>>>
>>>>> Now when reading from this area, the users will receive the older
>>>>> value, even though, in the first read they got the newer value that
>>>>> was written.
>>>>> Does mdadm protect against this inconsistency?
>>>> It shouldn't be returning success on the write until both sides of the
>>>> mirror are updated.  But we can't really tell until you give more
>>>> details and an example.
>>>>
>>>> I assume you're not building a RAID1 device and then writing to the
>>>> individual devices behind it's back or something silly like that,
>>>> right?
>>>>
>>>> John
>>>>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Question about potential data consistency issues when writes failed in mdadm raid1
  2023-03-19 12:45         ` Geoff Back
@ 2023-03-19 14:34           ` Asaf Levy
  0 siblings, 0 replies; 8+ messages in thread
From: Asaf Levy @ 2023-03-19 14:34 UTC (permalink / raw)
  To: Geoff Back; +Cc: John Stoffel, Ronnie Lazar, linux-raid

I understand.
As you said I meant consistent under the definition that data must
never go back in time regardless of the failure type.

Thanks,
Asaf

On Sun, Mar 19, 2023 at 2:45 PM Geoff Back <geoff@demonlair.co.uk> wrote:
>
> Hi Asaf,
>
> All disk subsystems are inherently consistent in the normal meaning of
> the term under normal circumstances.
>
> An application that requires your specific definition of consistency
> across catastrophic failure cases in the disk subsystem needs to use an
> application-appropriate method of ensuring writes are flushed before
> reading.  Writing with O_DIRECT is one method and depending on the
> application's exact requirements may be sufficient on its own.  In other
> application domains, flushing and some form of out-of-band signalling or
> locking is better.  It really depends on the application.
>
> Regards,
>
> Geoff.
>
> Geoff Back
> What if we're all just characters in someone's nightmares?
>
> On 19/03/2023 11:31, Asaf Levy wrote:
> > Thank you for the clarification.
> >
> > To make sure I fully understand.
> > An application that requires consistency should use O_DIRECT and
> > enforce an R/W lock on top of the mirrored device?
> >
> > Asaf
> >
> > On Sun, Mar 19, 2023 at 11:55 AM Geoff Back <geoff@demonlair.co.uk> wrote:
> >> Hi Asaf,
> >>
> >> Yes, in principle there are all sorts of cases where you can perform a
> >> read of newly written data that is not yet on the underlying disk and
> >> hence the possibility of reading the old data following recovery from an
> >> intervening catastrophic event (such as a crash).  This is a fundamental
> >> characteristic of write caching and applies with any storage device and
> >> any write operation where something crashes before the write is complete
> >> - you can get this with a single disk or SSD without having RAID in the
> >> mix at all.
> >>
> >> The correct and only way to guarantee that you can never have your
> >> "consistency issue" is to flush the write through to the underlying
> >> devices before reading.  If you explicitly flush the write operation
> >> (which will block until all writes are complete on A, B, M) and the
> >> flush completes successfully, then all reads will be of the new data and
> >> there is no consistency issue.
> >>
> >> Your scenario describes a concern for the higher level code, not in the
> >> storage system.  If your application needs to be absolutely certain that
> >> even after a crash you cannot end up reading old data having previously
> >> read new data then it is the responsibility of the application to flush
> >> the writes to the storage before executing the read.  You would also
> >> need to ensure that the application cannot read from the data between
> >> write and flush; there's several different ways to achieve that
> >> (O_DIRECT may be helpful).  Alternatively, you might want to look at
> >> using something other than the disk for your data interchange between
> >> processes.
> >>
> >> Regards,
> >>
> >> Geoff.
> >>
> >> Geoff Back
> >> What if we're all just characters in someone's nightmares?
> >>
> >> On 19/03/2023 09:13, Asaf Levy wrote:
> >>> Hi John,
> >>>
> >>> Thank you for your quick response, I'll try to elaborate further.
> >>> What we are trying to understand is if there is a potential race
> >>> between reads and writes when mirroring 2 devices.
> >>> This is unrelated to the fact that the write was not acked.
> >>>
> >>> The scenario is: let's assume we have a reader R and a writer W and 2
> >>> MD devices A and B. A and B are managed under a device M which is
> >>> configured to use A and B as mirrors (RAID 1). Currently, we have some
> >>> data on A and B, let's call it V1.
> >>>
> >>> W issues a write (V2) to the managed device M
> >>> The driver sends the write both to A and B at the same time.
> >>> The write to device A (V2) completes
> >>> R issues a read to M which directs it to A and returns the result (V2).
> >>> Now the driver and device A fail at the same time before the write
> >>> ever gets to device B.
> >>>
> >>> When the driver recovers all it is left with is device B so future
> >>> reads will return older data (V1) than the data that was returned to
> >>> R.
> >>>
> >>> Thanks,
> >>> Asaf
> >>>
> >>> On Fri, Mar 17, 2023 at 10:58 PM John Stoffel <john@stoffel.org> wrote:
> >>>>>>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@vastdata.com> writes:
> >>>>> I'm trying to understand how mdadm protects against inconsistent data
> >>>>> read in the face of failures that occur while writing to a device that
> >>>>> has raid1.
> >>>> You need to give a better test case, with examples.
> >>>>
> >>>>> Here is the scenario: I have set up raid1 that has 2 mirrors. First
> >>>>> one is on local storage and the second is on remote storage.  The
> >>>>> remote storage mirror is configured with write-mostly.
> >>>> Configuration details?  And what is the remote device?
> >>>>
> >>>>> We have parallel jobs: 1 writing to an area on the device and the
> >>>>> other reading from that area.
> >>>> So you create /dev/md9 and are writing/reading from it, then the
> >>>> system crashes and you lose the local half of the mirror, right?
> >>>>
> >>>>> The write operation writes the data to the first mirror, and at that
> >>>>> point the read operation reads the new data from the first mirror.
> >>>> So how is your write succeeding if it's not written to both halves of
> >>>> the MD device?  You need to give more details and maybe even some
> >>>> example code showing what you're doing here.
> >>>>
> >>>>> Now, before data has been written to the second (remote) mirror a
> >>>>> failure has occurred which caused the first machine to fail, When
> >>>>> the machine comes up, the data is recovered from the second, remote,
> >>>>> mirror.
> >>>> Ah... some more details.  It sounds like you have a system A which is
> >>>> writing to a SITE local remote device as well as a REMOTE site device
> >>>> in the MD mirror, is this correct?
> >>>>
> >>>> Are these iSCSI devices?  FibreChannel?  NBD devices?  More details
> >>>> please.
> >>>>
> >>>>> Now when reading from this area, the users will receive the older
> >>>>> value, even though, in the first read they got the newer value that
> >>>>> was written.
> >>>>> Does mdadm protect against this inconsistency?
> >>>> It shouldn't be returning success on the write until both sides of the
> >>>> mirror are updated.  But we can't really tell until you give more
> >>>> details and an example.
> >>>>
> >>>> I assume you're not building a RAID1 device and then writing to the
> >>>> individual devices behind it's back or something silly like that,
> >>>> right?
> >>>>
> >>>> John
> >>>>
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Question about potential data consistency issues when writes failed in mdadm raid1
  2023-03-19 11:31       ` Asaf Levy
  2023-03-19 12:45         ` Geoff Back
@ 2023-03-20 13:52         ` John Stoffel
  1 sibling, 0 replies; 8+ messages in thread
From: John Stoffel @ 2023-03-20 13:52 UTC (permalink / raw)
  To: Asaf Levy; +Cc: Geoff Back, John Stoffel, Ronnie Lazar, linux-raid

>>>>> "Asaf" == Asaf Levy <asaf@vastdata.com> writes:

> Thank you for the clarification.
> To make sure I fully understand.

> An application that requires consistency should use O_DIRECT and
> enforce an R/W lock on top of the mirrored device?

No, it's more complicated than that.  Remember, you also have a
filesystem on there, and different filesystems have different
semantics for write(2) system calls.  But in POSIX, writes should only
return when the data is on the disk, or return an error which you need
to handle properly. 

So a write to a RAID device (doesn't matter which type) should only
return when the data is written to all members of the RAID group.  If
it doesn't, it's really quite broken.  

The other way to really make sure you're data is written properly is
to call the 'syncfs()' call after you do a write which you want to
confirm is on the disks properly.  

But again, you really haven't given enough details on what you are
trying to do here, and what problem you are trying to solve.  

If you have a local SSD, and then you have a remote NBD (Network Block
Device) running across the country at the end of a 80 milisecond link,
and you're using RAID1 on those two devices, then your write
performance will be limited by the remote device.  Writing to the MD
device holding those two devices will have to wait until the remote
device acknowledges the write back, which will take time.  

John



> Asaf

> On Sun, Mar 19, 2023 at 11:55 AM Geoff Back <geoff@demonlair.co.uk> wrote:
>> 
>> Hi Asaf,
>> 
>> Yes, in principle there are all sorts of cases where you can perform a
>> read of newly written data that is not yet on the underlying disk and
>> hence the possibility of reading the old data following recovery from an
>> intervening catastrophic event (such as a crash).  This is a fundamental
>> characteristic of write caching and applies with any storage device and
>> any write operation where something crashes before the write is complete
>> - you can get this with a single disk or SSD without having RAID in the
>> mix at all.
>> 
>> The correct and only way to guarantee that you can never have your
>> "consistency issue" is to flush the write through to the underlying
>> devices before reading.  If you explicitly flush the write operation
>> (which will block until all writes are complete on A, B, M) and the
>> flush completes successfully, then all reads will be of the new data and
>> there is no consistency issue.
>> 
>> Your scenario describes a concern for the higher level code, not in the
>> storage system.  If your application needs to be absolutely certain that
>> even after a crash you cannot end up reading old data having previously
>> read new data then it is the responsibility of the application to flush
>> the writes to the storage before executing the read.  You would also
>> need to ensure that the application cannot read from the data between
>> write and flush; there's several different ways to achieve that
>> (O_DIRECT may be helpful).  Alternatively, you might want to look at
>> using something other than the disk for your data interchange between
>> processes.
>> 
>> Regards,
>> 
>> Geoff.
>> 
>> Geoff Back
>> What if we're all just characters in someone's nightmares?
>> 
>> On 19/03/2023 09:13, Asaf Levy wrote:
>> > Hi John,
>> >
>> > Thank you for your quick response, I'll try to elaborate further.
>> > What we are trying to understand is if there is a potential race
>> > between reads and writes when mirroring 2 devices.
>> > This is unrelated to the fact that the write was not acked.
>> >
>> > The scenario is: let's assume we have a reader R and a writer W and 2
>> > MD devices A and B. A and B are managed under a device M which is
>> > configured to use A and B as mirrors (RAID 1). Currently, we have some
>> > data on A and B, let's call it V1.
>> >
>> > W issues a write (V2) to the managed device M
>> > The driver sends the write both to A and B at the same time.
>> > The write to device A (V2) completes
>> > R issues a read to M which directs it to A and returns the result (V2).
>> > Now the driver and device A fail at the same time before the write
>> > ever gets to device B.
>> >
>> > When the driver recovers all it is left with is device B so future
>> > reads will return older data (V1) than the data that was returned to
>> > R.
>> >
>> > Thanks,
>> > Asaf
>> >
>> > On Fri, Mar 17, 2023 at 10:58 PM John Stoffel <john@stoffel.org> wrote:
>> >>>>>>> "Ronnie" == Ronnie Lazar <ronnie.lazar@vastdata.com> writes:
>> >>> I'm trying to understand how mdadm protects against inconsistent data
>> >>> read in the face of failures that occur while writing to a device that
>> >>> has raid1.
>> >> You need to give a better test case, with examples.
>> >>
>> >>> Here is the scenario: I have set up raid1 that has 2 mirrors. First
>> >>> one is on local storage and the second is on remote storage.  The
>> >>> remote storage mirror is configured with write-mostly.
>> >> Configuration details?  And what is the remote device?
>> >>
>> >>> We have parallel jobs: 1 writing to an area on the device and the
>> >>> other reading from that area.
>> >> So you create /dev/md9 and are writing/reading from it, then the
>> >> system crashes and you lose the local half of the mirror, right?
>> >>
>> >>> The write operation writes the data to the first mirror, and at that
>> >>> point the read operation reads the new data from the first mirror.
>> >> So how is your write succeeding if it's not written to both halves of
>> >> the MD device?  You need to give more details and maybe even some
>> >> example code showing what you're doing here.
>> >>
>> >>> Now, before data has been written to the second (remote) mirror a
>> >>> failure has occurred which caused the first machine to fail, When
>> >>> the machine comes up, the data is recovered from the second, remote,
>> >>> mirror.
>> >> Ah... some more details.  It sounds like you have a system A which is
>> >> writing to a SITE local remote device as well as a REMOTE site device
>> >> in the MD mirror, is this correct?
>> >>
>> >> Are these iSCSI devices?  FibreChannel?  NBD devices?  More details
>> >> please.
>> >>
>> >>> Now when reading from this area, the users will receive the older
>> >>> value, even though, in the first read they got the newer value that
>> >>> was written.
>> >>> Does mdadm protect against this inconsistency?
>> >> It shouldn't be returning success on the write until both sides of the
>> >> mirror are updated.  But we can't really tell until you give more
>> >> details and an example.
>> >>
>> >> I assume you're not building a RAID1 device and then writing to the
>> >> individual devices behind it's back or something silly like that,
>> >> right?
>> >>
>> >> John
>> >>
>> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-03-20 13:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-15 18:06 Question about potential data consistency issues when writes failed in mdadm raid1 Ronnie Lazar
2023-03-17 20:58 ` John Stoffel
2023-03-19  9:13   ` Asaf Levy
2023-03-19  9:55     ` Geoff Back
2023-03-19 11:31       ` Asaf Levy
2023-03-19 12:45         ` Geoff Back
2023-03-19 14:34           ` Asaf Levy
2023-03-20 13:52         ` John Stoffel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.