All of lore.kernel.org
 help / color / mirror / Atom feed
* Is this error fixable or do I need to rebuild the drive?
@ 2022-03-04 23:33 Jan Kanis
  2022-03-04 23:39 ` Qu Wenruo
  2022-03-07  0:42 ` Damien Le Moal
  0 siblings, 2 replies; 13+ messages in thread
From: Jan Kanis @ 2022-03-04 23:33 UTC (permalink / raw)
  To: linux-btrfs

Hi,

I have a btrfs filesystem with two disks in raid 1. Each btrfs device
sits on top of a LUKS encrypted volume, which consists of a raw drive
partition on a SMR hard disk, though I don't think that's relevant.

One of the drives failed, the sata link appears to have died, if I'm
interpreting the system logs right. As it's a raid 1 the system kept
running and I didn't notice the dead drive until some time later,
during which I kept using the filesystem.
Something wasn't behaving right, so I decided to reboot. After the
reboot the btrfs filesystem didn't come up and one of the drives was
dead. I was able to mount from the remaining device with
degraded/read-only, all data seemed to be there.
I took out the dead drive and put it into another system for
examination. After some fiddling the drive came up again, so it wasn't
permanently dead after all. I was able to mount it degraded/read-only.
It looked good except for missing the latest changes I made to some
files I was working with, so it was a bit out of date. A btrfs scrub
showed no corruptions.
I put the drive back in the original system, thinking that btrfs would
either refuse to mount it or fix it from the other copy. The
filesystem automatically mounted rw without a 'degraded' option, and
the filesystem could be used again. The logs showed some "parent
transid verify failed" errors, which I assumed would be corrected from
the other copy. Attempting to mount only the drive that had failed
with degraded/read-only now no longer worked.

It's now some days later, the filesystem is still working, but I'm
also still getting "parent transid verify failed" errors in the logs,
and "read error corrected". So by now I'm thinking that btrfs
apparently does not fix this error by itself. What's happening here,
and why isn't btrfs fixing it, it has two copies of everything?
What's the best way to fix it manually? Rebalance the data? scrub it?
delete, wipe and re-add the device that failed so the mirror can be
rebuilt?

best,
Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-04 23:33 Is this error fixable or do I need to rebuild the drive? Jan Kanis
@ 2022-03-04 23:39 ` Qu Wenruo
  2022-03-05  3:11   ` Remi Gauvin
  2022-03-07  0:42 ` Damien Le Moal
  1 sibling, 1 reply; 13+ messages in thread
From: Qu Wenruo @ 2022-03-04 23:39 UTC (permalink / raw)
  To: Jan Kanis, linux-btrfs



On 2022/3/5 07:33, Jan Kanis wrote:
> Hi,
>
> I have a btrfs filesystem with two disks in raid 1. Each btrfs device
> sits on top of a LUKS encrypted volume, which consists of a raw drive
> partition on a SMR hard disk, though I don't think that's relevant.
>
> One of the drives failed, the sata link appears to have died, if I'm
> interpreting the system logs right. As it's a raid 1 the system kept
> running and I didn't notice the dead drive until some time later,
> during which I kept using the filesystem.
> Something wasn't behaving right, so I decided to reboot. After the
> reboot the btrfs filesystem didn't come up and one of the drives was
> dead. I was able to mount from the remaining device with
> degraded/read-only, all data seemed to be there.

One thing to keep in mind is, degraded/read-only is always good.

Btrfs is not good at handling split-brain cases yet (meanings two device
get degraded mounted each, and new data writes happen for both devices).

So great you didn't do write to the devices.

> I took out the dead drive and put it into another system for
> examination. After some fiddling the drive came up again, so it wasn't
> permanently dead after all. I was able to mount it degraded/read-only.
> It looked good except for missing the latest changes I made to some
> files I was working with, so it was a bit out of date. A btrfs scrub
> showed no corruptions.
> I put the drive back in the original system, thinking that btrfs would
> either refuse to mount it or fix it from the other copy. The
> filesystem automatically mounted rw without a 'degraded' option, and
> the filesystem could be used again. The logs showed some "parent
> transid verify failed" errors, which I assumed would be corrected from
> the other copy. Attempting to mount only the drive that had failed
> with degraded/read-only now no longer worked.
>
> It's now some days later, the filesystem is still working, but I'm
> also still getting "parent transid verify failed" errors in the logs,
> and "read error corrected".

'Cause your workload may not yet fully CoWed the whole metadata.

Thus sometimes you will still hit out-of-sync data from the old device.

  So by now I'm thinking that btrfs
> apparently does not fix this error by itself. What's happening here,
> and why isn't btrfs fixing it, it has two copies of everything?
> What's the best way to fix it manually? Rebalance the data? scrub it?

Scrub it would be the correct thing to do.

For read-write mount, scrub will try to read all data/metadata, and
check against their checksum.
Any mismatch will result btrfs to write back the data using a good copy.

Thus it should solve the problem.

We may want to do an automatic scrub for out-of-sync devices in the near
future.

Thanks,
Qu

> delete, wipe and re-add the device that failed so the mirror can be
> rebuilt?
>
> best,
> Jan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-04 23:39 ` Qu Wenruo
@ 2022-03-05  3:11   ` Remi Gauvin
  2022-03-05  6:47     ` Qu Wenruo
  0 siblings, 1 reply; 13+ messages in thread
From: Remi Gauvin @ 2022-03-05  3:11 UTC (permalink / raw)
  To: Qu Wenruo, Jan Kanis, linux-btrfs

On 2022-03-04 6:39 p.m., Qu Wenruo wrote:

>  So by now I'm thinking that btrfs
>> apparently does not fix this error by itself. What's happening here,
>> and why isn't btrfs fixing it, it has two copies of everything?
>> What's the best way to fix it manually? Rebalance the data? scrub it?
> 
> Scrub it would be the correct thing to do.
> 

Correct me if I'm wrong, the statistical math is a little above my head.

Since the failed drive was disconnected for some time while the
filesystem was read write, there is potentially hundreds of thousands of
sectors with incorrect data.  That will not only make scrub slow, but
due to CRC collision, has a 'significant' chance of leaving some data on
the failed drive corrupt.

If I understand this correctly, the safest way to fix this filesystem
without unnecessary chance of corrupt data is to do a dev replace of the
failed drive to a hot spare with the -r switch, so it is only reading
from the drive with the most consistent data.  This strategy requires a
3rd drive, at least temporarily.

So, if /dev/sda1 is the drive that was always good, and /dev/sdb1 is the
drive that had taken a vacation....

And /dev/sdc1 is a new hot spare

btrfs replace start -r /dev/sdb1 /dev/sdc1

(On some kernel versions you might have to reboot for the replace
operation to finish.  But once /dev/sdb1 is completely removed, if you
wanted to use it again, you could

btrfs replace start /dev/sdc1 /dev/sdb1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-05  3:11   ` Remi Gauvin
@ 2022-03-05  6:47     ` Qu Wenruo
  2022-03-05 10:33       ` Jan Kanis
  2022-03-06  4:06       ` Zygo Blaxell
  0 siblings, 2 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-03-05  6:47 UTC (permalink / raw)
  To: Remi Gauvin, Jan Kanis, linux-btrfs



On 2022/3/5 11:11, Remi Gauvin wrote:
> On 2022-03-04 6:39 p.m., Qu Wenruo wrote:
>
>>   So by now I'm thinking that btrfs
>>> apparently does not fix this error by itself. What's happening here,
>>> and why isn't btrfs fixing it, it has two copies of everything?
>>> What's the best way to fix it manually? Rebalance the data? scrub it?
>>
>> Scrub it would be the correct thing to do.
>>
>
> Correct me if I'm wrong, the statistical math is a little above my head.
>
> Since the failed drive was disconnected for some time while the
> filesystem was read write, there is potentially hundreds of thousands of
> sectors with incorrect data.

Mostly correct.

>  That will not only make scrub slow, but
> due to CRC collision, has a 'significant' chance of leaving some data on
> the failed drive corrupt.

I doubt, 2^32 is not a small number, not to mention your data may not be
  that random.

Thus I'm not that concerned about hash conflicts.

>
> If I understand this correctly, the safest way to fix this filesystem
> without unnecessary chance of corrupt data is to do a dev replace of the
> failed drive to a hot spare with the -r switch, so it is only reading
> from the drive with the most consistent data.  This strategy requires a
> 3rd drive, at least temporarily.

That also would be a solution.

And better, you don't need to bother a third device, just wipe the
out-of-data device, and replace missing device with that new one.

But please note that, if your good device has any data corruption, there
is no chance to recover that data using the out-of-date device.

Thus I prefer a scrub, as it still has a chance (maybe low) to recover.

But if you have already scrub the good device, mounted degradely without
the bad one, and no corruption reported, then you are fine to go ahead
with replace.

Thanks,
Qu

>
> So, if /dev/sda1 is the drive that was always good, and /dev/sdb1 is the
> drive that had taken a vacation....
>
> And /dev/sdc1 is a new hot spare
>
> btrfs replace start -r /dev/sdb1 /dev/sdc1
>
> (On some kernel versions you might have to reboot for the replace
> operation to finish.  But once /dev/sdb1 is completely removed, if you
> wanted to use it again, you could
>
> btrfs replace start /dev/sdc1 /dev/sdb1
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-05  6:47     ` Qu Wenruo
@ 2022-03-05 10:33       ` Jan Kanis
  2022-03-05 10:42         ` Jan Kanis
  2022-03-05 11:06         ` Qu Wenruo
  2022-03-06  4:06       ` Zygo Blaxell
  1 sibling, 2 replies; 13+ messages in thread
From: Jan Kanis @ 2022-03-05 10:33 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Remi Gauvin, linux-btrfs

I wrote about 40 GB of data to the filesystem before noticing that one
device had failed, but that data is now no longer needed so I don't
care a lot if I might have a corrupted block in there. The filesystem
is used mainly for backups and storage of large files, there's no
operating system automatically updating things on that drive, so I'm
quite sure of what changes are on there.

What surprises me is that I'm getting checksum failures at all now. I
scrubbed both devices independently when I had taken one of them out
of the system, and both passed without errors. The checksum failures
only started happening when I added the out of date device back into
the array. Does btrfs assume that both devices are in sync in such a
case, and thus that a checksum from device 1 is also valid for the
equivalent block on device 2?

The statistics:
The chance of one block matching its checksum by chance is 2**-32. 40
GB is 1 million blocks. The chance of not having any spurious checksum
matches is then (1-2**-32)**1e6, which is 0.999767. That's not as high
as I was expecting but still a very good chance.

> not to mention your data may not be that random.
I think there are many cases where the data is pretty random, which is
when it is compressed. The data on this drive is largely media files
or other compressed files, which are pretty random. The only case I
can think of where you would have large amounts of uncompressed data
on your disk is if you're running a database on it.

I'll see what happens with a scrub.

Thanks for the help, Jan


On Sat, 5 Mar 2022 at 07:47, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2022/3/5 11:11, Remi Gauvin wrote:
> > On 2022-03-04 6:39 p.m., Qu Wenruo wrote:
> >
> >>   So by now I'm thinking that btrfs
> >>> apparently does not fix this error by itself. What's happening here,
> >>> and why isn't btrfs fixing it, it has two copies of everything?
> >>> What's the best way to fix it manually? Rebalance the data? scrub it?
> >>
> >> Scrub it would be the correct thing to do.
> >>
> >
> > Correct me if I'm wrong, the statistical math is a little above my head.
> >
> > Since the failed drive was disconnected for some time while the
> > filesystem was read write, there is potentially hundreds of thousands of
> > sectors with incorrect data.
>
> Mostly correct.
>
> >  That will not only make scrub slow, but
> > due to CRC collision, has a 'significant' chance of leaving some data on
> > the failed drive corrupt.
>
> I doubt, 2^32 is not a small number, not to mention your data may not be
>   that random.
>
> Thus I'm not that concerned about hash conflicts.
>
> >
> > If I understand this correctly, the safest way to fix this filesystem
> > without unnecessary chance of corrupt data is to do a dev replace of the
> > failed drive to a hot spare with the -r switch, so it is only reading
> > from the drive with the most consistent data.  This strategy requires a
> > 3rd drive, at least temporarily.
>
> That also would be a solution.
>
> And better, you don't need to bother a third device, just wipe the
> out-of-data device, and replace missing device with that new one.
>
> But please note that, if your good device has any data corruption, there
> is no chance to recover that data using the out-of-date device.
>
> Thus I prefer a scrub, as it still has a chance (maybe low) to recover.
>
> But if you have already scrub the good device, mounted degradely without
> the bad one, and no corruption reported, then you are fine to go ahead
> with replace.
>
> Thanks,
> Qu
>
> >
> > So, if /dev/sda1 is the drive that was always good, and /dev/sdb1 is the
> > drive that had taken a vacation....
> >
> > And /dev/sdc1 is a new hot spare
> >
> > btrfs replace start -r /dev/sdb1 /dev/sdc1
> >
> > (On some kernel versions you might have to reboot for the replace
> > operation to finish.  But once /dev/sdb1 is completely removed, if you
> > wanted to use it again, you could
> >
> > btrfs replace start /dev/sdc1 /dev/sdb1
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-05 10:33       ` Jan Kanis
@ 2022-03-05 10:42         ` Jan Kanis
  2022-03-05 11:21           ` Qu Wenruo
  2022-03-05 11:06         ` Qu Wenruo
  1 sibling, 1 reply; 13+ messages in thread
From: Jan Kanis @ 2022-03-05 10:42 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Remi Gauvin, linux-btrfs

Correction on the statistics: 40 GB is 10 million blocks. Chance of no
spurious checksum matches is then (1-2**-32)**1e67 = 0.99767. The risk
starts to become significant when writing a few terabyte.

On Sat, 5 Mar 2022 at 11:33, Jan Kanis <jan.code@jankanis.nl> wrote:
>
> I wrote about 40 GB of data to the filesystem before noticing that one
> device had failed, but that data is now no longer needed so I don't
> care a lot if I might have a corrupted block in there. The filesystem
> is used mainly for backups and storage of large files, there's no
> operating system automatically updating things on that drive, so I'm
> quite sure of what changes are on there.
>
> What surprises me is that I'm getting checksum failures at all now. I
> scrubbed both devices independently when I had taken one of them out
> of the system, and both passed without errors. The checksum failures
> only started happening when I added the out of date device back into
> the array. Does btrfs assume that both devices are in sync in such a
> case, and thus that a checksum from device 1 is also valid for the
> equivalent block on device 2?
>
> The statistics:
> The chance of one block matching its checksum by chance is 2**-32. 40
> GB is 1 million blocks. The chance of not having any spurious checksum
> matches is then (1-2**-32)**1e6, which is 0.999767. That's not as high
> as I was expecting but still a very good chance.
>
> > not to mention your data may not be that random.
> I think there are many cases where the data is pretty random, which is
> when it is compressed. The data on this drive is largely media files
> or other compressed files, which are pretty random. The only case I
> can think of where you would have large amounts of uncompressed data
> on your disk is if you're running a database on it.
>
> I'll see what happens with a scrub.
>
> Thanks for the help, Jan
>
>
> On Sat, 5 Mar 2022 at 07:47, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
> >
> >
> >
> > On 2022/3/5 11:11, Remi Gauvin wrote:
> > > On 2022-03-04 6:39 p.m., Qu Wenruo wrote:
> > >
> > >>   So by now I'm thinking that btrfs
> > >>> apparently does not fix this error by itself. What's happening here,
> > >>> and why isn't btrfs fixing it, it has two copies of everything?
> > >>> What's the best way to fix it manually? Rebalance the data? scrub it?
> > >>
> > >> Scrub it would be the correct thing to do.
> > >>
> > >
> > > Correct me if I'm wrong, the statistical math is a little above my head.
> > >
> > > Since the failed drive was disconnected for some time while the
> > > filesystem was read write, there is potentially hundreds of thousands of
> > > sectors with incorrect data.
> >
> > Mostly correct.
> >
> > >  That will not only make scrub slow, but
> > > due to CRC collision, has a 'significant' chance of leaving some data on
> > > the failed drive corrupt.
> >
> > I doubt, 2^32 is not a small number, not to mention your data may not be
> >   that random.
> >
> > Thus I'm not that concerned about hash conflicts.
> >
> > >
> > > If I understand this correctly, the safest way to fix this filesystem
> > > without unnecessary chance of corrupt data is to do a dev replace of the
> > > failed drive to a hot spare with the -r switch, so it is only reading
> > > from the drive with the most consistent data.  This strategy requires a
> > > 3rd drive, at least temporarily.
> >
> > That also would be a solution.
> >
> > And better, you don't need to bother a third device, just wipe the
> > out-of-data device, and replace missing device with that new one.
> >
> > But please note that, if your good device has any data corruption, there
> > is no chance to recover that data using the out-of-date device.
> >
> > Thus I prefer a scrub, as it still has a chance (maybe low) to recover.
> >
> > But if you have already scrub the good device, mounted degradely without
> > the bad one, and no corruption reported, then you are fine to go ahead
> > with replace.
> >
> > Thanks,
> > Qu
> >
> > >
> > > So, if /dev/sda1 is the drive that was always good, and /dev/sdb1 is the
> > > drive that had taken a vacation....
> > >
> > > And /dev/sdc1 is a new hot spare
> > >
> > > btrfs replace start -r /dev/sdb1 /dev/sdc1
> > >
> > > (On some kernel versions you might have to reboot for the replace
> > > operation to finish.  But once /dev/sdb1 is completely removed, if you
> > > wanted to use it again, you could
> > >
> > > btrfs replace start /dev/sdc1 /dev/sdb1
> > >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-05 10:33       ` Jan Kanis
  2022-03-05 10:42         ` Jan Kanis
@ 2022-03-05 11:06         ` Qu Wenruo
  1 sibling, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-03-05 11:06 UTC (permalink / raw)
  To: Jan Kanis; +Cc: Remi Gauvin, linux-btrfs



On 2022/3/5 18:33, Jan Kanis wrote:
> I wrote about 40 GB of data to the filesystem before noticing that one
> device had failed, but that data is now no longer needed so I don't
> care a lot if I might have a corrupted block in there. The filesystem
> is used mainly for backups and storage of large files, there's no
> operating system automatically updating things on that drive, so I'm
> quite sure of what changes are on there.
>
> What surprises me is that I'm getting checksum failures at all now. I
> scrubbed both devices independently when I had taken one of them out
> of the system, and both passed without errors. The checksum failures
> only started happening when I added the out of date device back into
> the array. Does btrfs assume that both devices are in sync in such a
> case, and thus that a checksum from device 1 is also valid for the
> equivalent block on device 2?

This is the split-brain case.

Each device has their own versions of metadata.
They all pass checksum check, at their own generation.

But when mixed, btrfs will use the latest one as root, thus older
metadata will be considered as corrupted.
So is the older data.

That's why independent scrub makes no sense for such split brain case.


Thanks,
Qu

>
> The statistics:
> The chance of one block matching its checksum by chance is 2**-32. 40
> GB is 1 million blocks. The chance of not having any spurious checksum
> matches is then (1-2**-32)**1e6, which is 0.999767. That's not as high
> as I was expecting but still a very good chance.
>
>> not to mention your data may not be that random.
> I think there are many cases where the data is pretty random, which is
> when it is compressed. The data on this drive is largely media files
> or other compressed files, which are pretty random. The only case I
> can think of where you would have large amounts of uncompressed data
> on your disk is if you're running a database on it.
>
> I'll see what happens with a scrub.
>
> Thanks for the help, Jan
>
>
> On Sat, 5 Mar 2022 at 07:47, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>
>>
>>
>> On 2022/3/5 11:11, Remi Gauvin wrote:
>>> On 2022-03-04 6:39 p.m., Qu Wenruo wrote:
>>>
>>>>    So by now I'm thinking that btrfs
>>>>> apparently does not fix this error by itself. What's happening here,
>>>>> and why isn't btrfs fixing it, it has two copies of everything?
>>>>> What's the best way to fix it manually? Rebalance the data? scrub it?
>>>>
>>>> Scrub it would be the correct thing to do.
>>>>
>>>
>>> Correct me if I'm wrong, the statistical math is a little above my head.
>>>
>>> Since the failed drive was disconnected for some time while the
>>> filesystem was read write, there is potentially hundreds of thousands of
>>> sectors with incorrect data.
>>
>> Mostly correct.
>>
>>>   That will not only make scrub slow, but
>>> due to CRC collision, has a 'significant' chance of leaving some data on
>>> the failed drive corrupt.
>>
>> I doubt, 2^32 is not a small number, not to mention your data may not be
>>    that random.
>>
>> Thus I'm not that concerned about hash conflicts.
>>
>>>
>>> If I understand this correctly, the safest way to fix this filesystem
>>> without unnecessary chance of corrupt data is to do a dev replace of the
>>> failed drive to a hot spare with the -r switch, so it is only reading
>>> from the drive with the most consistent data.  This strategy requires a
>>> 3rd drive, at least temporarily.
>>
>> That also would be a solution.
>>
>> And better, you don't need to bother a third device, just wipe the
>> out-of-data device, and replace missing device with that new one.
>>
>> But please note that, if your good device has any data corruption, there
>> is no chance to recover that data using the out-of-date device.
>>
>> Thus I prefer a scrub, as it still has a chance (maybe low) to recover.
>>
>> But if you have already scrub the good device, mounted degradely without
>> the bad one, and no corruption reported, then you are fine to go ahead
>> with replace.
>>
>> Thanks,
>> Qu
>>
>>>
>>> So, if /dev/sda1 is the drive that was always good, and /dev/sdb1 is the
>>> drive that had taken a vacation....
>>>
>>> And /dev/sdc1 is a new hot spare
>>>
>>> btrfs replace start -r /dev/sdb1 /dev/sdc1
>>>
>>> (On some kernel versions you might have to reboot for the replace
>>> operation to finish.  But once /dev/sdb1 is completely removed, if you
>>> wanted to use it again, you could
>>>
>>> btrfs replace start /dev/sdc1 /dev/sdb1
>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-05 10:42         ` Jan Kanis
@ 2022-03-05 11:21           ` Qu Wenruo
  0 siblings, 0 replies; 13+ messages in thread
From: Qu Wenruo @ 2022-03-05 11:21 UTC (permalink / raw)
  To: Jan Kanis; +Cc: Remi Gauvin, linux-btrfs



On 2022/3/5 18:42, Jan Kanis wrote:
> Correction on the statistics: 40 GB is 10 million blocks. Chance of no
> spurious checksum matches is then (1-2**-32)**1e67 = 0.99767. The risk
> starts to become significant when writing a few terabyte.

That's why we have support for SHA256.

Furthermore, we're going to support off-line (aka, unmounted) conversion
to other checksums using btrfs-progs in the near future.

So, please looking forward that feature.

Thanks,
Qu
>
> On Sat, 5 Mar 2022 at 11:33, Jan Kanis <jan.code@jankanis.nl> wrote:
>>
>> I wrote about 40 GB of data to the filesystem before noticing that one
>> device had failed, but that data is now no longer needed so I don't
>> care a lot if I might have a corrupted block in there. The filesystem
>> is used mainly for backups and storage of large files, there's no
>> operating system automatically updating things on that drive, so I'm
>> quite sure of what changes are on there.
>>
>> What surprises me is that I'm getting checksum failures at all now. I
>> scrubbed both devices independently when I had taken one of them out
>> of the system, and both passed without errors. The checksum failures
>> only started happening when I added the out of date device back into
>> the array. Does btrfs assume that both devices are in sync in such a
>> case, and thus that a checksum from device 1 is also valid for the
>> equivalent block on device 2?
>>
>> The statistics:
>> The chance of one block matching its checksum by chance is 2**-32. 40
>> GB is 1 million blocks. The chance of not having any spurious checksum
>> matches is then (1-2**-32)**1e6, which is 0.999767. That's not as high
>> as I was expecting but still a very good chance.
>>
>>> not to mention your data may not be that random.
>> I think there are many cases where the data is pretty random, which is
>> when it is compressed. The data on this drive is largely media files
>> or other compressed files, which are pretty random. The only case I
>> can think of where you would have large amounts of uncompressed data
>> on your disk is if you're running a database on it.
>>
>> I'll see what happens with a scrub.
>>
>> Thanks for the help, Jan
>>
>>
>> On Sat, 5 Mar 2022 at 07:47, Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>>>
>>>
>>>
>>> On 2022/3/5 11:11, Remi Gauvin wrote:
>>>> On 2022-03-04 6:39 p.m., Qu Wenruo wrote:
>>>>
>>>>>    So by now I'm thinking that btrfs
>>>>>> apparently does not fix this error by itself. What's happening here,
>>>>>> and why isn't btrfs fixing it, it has two copies of everything?
>>>>>> What's the best way to fix it manually? Rebalance the data? scrub it?
>>>>>
>>>>> Scrub it would be the correct thing to do.
>>>>>
>>>>
>>>> Correct me if I'm wrong, the statistical math is a little above my head.
>>>>
>>>> Since the failed drive was disconnected for some time while the
>>>> filesystem was read write, there is potentially hundreds of thousands of
>>>> sectors with incorrect data.
>>>
>>> Mostly correct.
>>>
>>>>   That will not only make scrub slow, but
>>>> due to CRC collision, has a 'significant' chance of leaving some data on
>>>> the failed drive corrupt.
>>>
>>> I doubt, 2^32 is not a small number, not to mention your data may not be
>>>    that random.
>>>
>>> Thus I'm not that concerned about hash conflicts.
>>>
>>>>
>>>> If I understand this correctly, the safest way to fix this filesystem
>>>> without unnecessary chance of corrupt data is to do a dev replace of the
>>>> failed drive to a hot spare with the -r switch, so it is only reading
>>>> from the drive with the most consistent data.  This strategy requires a
>>>> 3rd drive, at least temporarily.
>>>
>>> That also would be a solution.
>>>
>>> And better, you don't need to bother a third device, just wipe the
>>> out-of-data device, and replace missing device with that new one.
>>>
>>> But please note that, if your good device has any data corruption, there
>>> is no chance to recover that data using the out-of-date device.
>>>
>>> Thus I prefer a scrub, as it still has a chance (maybe low) to recover.
>>>
>>> But if you have already scrub the good device, mounted degradely without
>>> the bad one, and no corruption reported, then you are fine to go ahead
>>> with replace.
>>>
>>> Thanks,
>>> Qu
>>>
>>>>
>>>> So, if /dev/sda1 is the drive that was always good, and /dev/sdb1 is the
>>>> drive that had taken a vacation....
>>>>
>>>> And /dev/sdc1 is a new hot spare
>>>>
>>>> btrfs replace start -r /dev/sdb1 /dev/sdc1
>>>>
>>>> (On some kernel versions you might have to reboot for the replace
>>>> operation to finish.  But once /dev/sdb1 is completely removed, if you
>>>> wanted to use it again, you could
>>>>
>>>> btrfs replace start /dev/sdc1 /dev/sdb1
>>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-05  6:47     ` Qu Wenruo
  2022-03-05 10:33       ` Jan Kanis
@ 2022-03-06  4:06       ` Zygo Blaxell
  2022-03-06 11:45         ` Remi Gauvin
  1 sibling, 1 reply; 13+ messages in thread
From: Zygo Blaxell @ 2022-03-06  4:06 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Remi Gauvin, Jan Kanis, linux-btrfs

On Sat, Mar 05, 2022 at 02:47:31PM +0800, Qu Wenruo wrote:
> 
> 
> On 2022/3/5 11:11, Remi Gauvin wrote:
> > On 2022-03-04 6:39 p.m., Qu Wenruo wrote:
> > 
> > >   So by now I'm thinking that btrfs
> > > > apparently does not fix this error by itself. What's happening here,
> > > > and why isn't btrfs fixing it, it has two copies of everything?
> > > > What's the best way to fix it manually? Rebalance the data? scrub it?
> > > 
> > > Scrub it would be the correct thing to do.
> > > 
> > 
> > Correct me if I'm wrong, the statistical math is a little above my head.
> > 
> > Since the failed drive was disconnected for some time while the
> > filesystem was read write, there is potentially hundreds of thousands of
> > sectors with incorrect data.
> 
> Mostly correct.
> 
> >  That will not only make scrub slow, but
> > due to CRC collision, has a 'significant' chance of leaving some data on
> > the failed drive corrupt.
> 
> I doubt, 2^32 is not a small number, not to mention your data may not be
>  that random.
> 
> Thus I'm not that concerned about hash conflicts.

It becomes an issue after about 16 TB of unique data has been
written--then the collision probability approaches 100%.  Not much of
a problem yet, but individual disks already passed 16 TB a while ago.

> > If I understand this correctly, the safest way to fix this filesystem
> > without unnecessary chance of corrupt data is to do a dev replace of the
> > failed drive to a hot spare with the -r switch, so it is only reading
> > from the drive with the most consistent data.  This strategy requires a
> > 3rd drive, at least temporarily.
> 
> That also would be a solution.
> 
> And better, you don't need to bother a third device, just wipe the
> out-of-data device, and replace missing device with that new one.
> 
> But please note that, if your good device has any data corruption, there
> is no chance to recover that data using the out-of-date device.
> 
> Thus I prefer a scrub, as it still has a chance (maybe low) to recover.

Ideally, 'btrfs replace' would be able to replace a device with itself,
i.e. remove the restriction that the replacing device and the replaced
device can't be the same device.  This is one of the more important
management features that mdadm has which btrfs is still missing.

This form of replace should read from other disks if possible (like -r),
otherwise don't write anything on the replaced disk since we'd just be
reading the data that is already there and rewriting it in the same block.

The "run a scrub" approach doesn't work with nodatacow files, so they
end up corrupted because there's no way to tell btrfs that one drive is
definitely missing some writes and its content should not be trusted.
"Replace with same device" enables btrfs to know that nodatacow blocks on
the device are not trustworthy and should be reconstructed from mirrors,
without losing the ability to recover any other data should another device
in the filesystem fail during the replacement operation.

If it's easier to implement, some way to force an online btrfs device
offline with a mounted btrfs would be sufficient in the short term
(i.e. transition from all drives online to degraded mode with a
device removed from the filesystem).  The wipefs approach requires a
device-specific sysfs delete operation which makes it unusable with
block devices that don't provide one.

> But if you have already scrub the good device, mounted degradely without
> the bad one, and no corruption reported, then you are fine to go ahead
> with replace.
> 
> Thanks,
> Qu
> 
> > 
> > So, if /dev/sda1 is the drive that was always good, and /dev/sdb1 is the
> > drive that had taken a vacation....
> > 
> > And /dev/sdc1 is a new hot spare
> > 
> > btrfs replace start -r /dev/sdb1 /dev/sdc1
> > 
> > (On some kernel versions you might have to reboot for the replace
> > operation to finish.  But once /dev/sdb1 is completely removed, if you
> > wanted to use it again, you could
> > 
> > btrfs replace start /dev/sdc1 /dev/sdb1
> > 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-06  4:06       ` Zygo Blaxell
@ 2022-03-06 11:45         ` Remi Gauvin
  2022-03-06 23:41           ` Zygo Blaxell
  0 siblings, 1 reply; 13+ messages in thread
From: Remi Gauvin @ 2022-03-06 11:45 UTC (permalink / raw)
  To: linux-btrfs

On 2022-03-05 11:06 p.m., Zygo Blaxell wrote:

> Ideally, 'btrfs replace' would be able to replace a device with itself,
> i.e. remove the restriction that the replacing device and the replaced
> device can't be the same device.  This is one of the more important
> management features that mdadm has which btrfs is still missing.
> 
> This form of replace should read from other disks if possible (like -r),
> otherwise don't write anything on the replaced disk since we'd just be
> reading the data that is already there and rewriting it in the same block.
> 

Ideally, this command would first read and compare the mirrored data.
That way, only stale data would be replaced.  That way, it can be used
with solid-state media without consuming a write cycle unnecessarily.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-06 11:45         ` Remi Gauvin
@ 2022-03-06 23:41           ` Zygo Blaxell
  0 siblings, 0 replies; 13+ messages in thread
From: Zygo Blaxell @ 2022-03-06 23:41 UTC (permalink / raw)
  To: Remi Gauvin; +Cc: linux-btrfs

On Sun, Mar 06, 2022 at 06:45:16AM -0500, Remi Gauvin wrote:
> On 2022-03-05 11:06 p.m., Zygo Blaxell wrote:
> 
> > Ideally, 'btrfs replace' would be able to replace a device with itself,
> > i.e. remove the restriction that the replacing device and the replaced
> > device can't be the same device.  This is one of the more important
> > management features that mdadm has which btrfs is still missing.
> > 
> > This form of replace should read from other disks if possible (like -r),
> > otherwise don't write anything on the replaced disk since we'd just be
> > reading the data that is already there and rewriting it in the same block.
> > 
> 
> Ideally, this command would first read and compare the mirrored data.
> That way, only stale data would be replaced.  That way, it can be used
> with solid-state media without consuming a write cycle unnecessarily.

Good point, but it would require interleaved read and write cycles which
would be bad for spinning disks if there large areas with differences.

Maybe this can be solved by having a "SSD mode" and "HDD mode" which
would minimize writes or write everything blindly, respectively.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-04 23:33 Is this error fixable or do I need to rebuild the drive? Jan Kanis
  2022-03-04 23:39 ` Qu Wenruo
@ 2022-03-07  0:42 ` Damien Le Moal
  2022-03-09 13:46   ` Jan Kanis
  1 sibling, 1 reply; 13+ messages in thread
From: Damien Le Moal @ 2022-03-07  0:42 UTC (permalink / raw)
  To: Jan Kanis, linux-btrfs

On 3/5/22 08:33, Jan Kanis wrote:
> Hi,
> 
> I have a btrfs filesystem with two disks in raid 1. Each btrfs device
> sits on top of a LUKS encrypted volume, which consists of a raw drive
> partition on a SMR hard disk, though I don't think that's relevant.

Hu... SMR disks do not support partitions... And last time I checked,
cryptsetup did not support LUKS formating of SMR drives (dm-crypt does
support SMR, that is not the issue). Care to better explain your setup ?

> 
> One of the drives failed, the sata link appears to have died, if I'm
> interpreting the system logs right. As it's a raid 1 the system kept
> running and I didn't notice the dead drive until some time later,
> during which I kept using the filesystem.
> Something wasn't behaving right, so I decided to reboot. After the
> reboot the btrfs filesystem didn't come up and one of the drives was
> dead. I was able to mount from the remaining device with
> degraded/read-only, all data seemed to be there.
> I took out the dead drive and put it into another system for
> examination. After some fiddling the drive came up again, so it wasn't
> permanently dead after all. I was able to mount it degraded/read-only.
> It looked good except for missing the latest changes I made to some
> files I was working with, so it was a bit out of date. A btrfs scrub
> showed no corruptions.
> I put the drive back in the original system, thinking that btrfs would
> either refuse to mount it or fix it from the other copy. The
> filesystem automatically mounted rw without a 'degraded' option, and
> the filesystem could be used again. The logs showed some "parent
> transid verify failed" errors, which I assumed would be corrected from
> the other copy. Attempting to mount only the drive that had failed
> with degraded/read-only now no longer worked.
> 
> It's now some days later, the filesystem is still working, but I'm
> also still getting "parent transid verify failed" errors in the logs,
> and "read error corrected". So by now I'm thinking that btrfs
> apparently does not fix this error by itself. What's happening here,
> and why isn't btrfs fixing it, it has two copies of everything?
> What's the best way to fix it manually? Rebalance the data? scrub it?
> delete, wipe and re-add the device that failed so the mirror can be
> rebuilt?
> 
> best,
> Jan


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Is this error fixable or do I need to rebuild the drive?
  2022-03-07  0:42 ` Damien Le Moal
@ 2022-03-09 13:46   ` Jan Kanis
  0 siblings, 0 replies; 13+ messages in thread
From: Jan Kanis @ 2022-03-09 13:46 UTC (permalink / raw)
  To: Damien Le Moal, Qu Wenruo, Remi Gauvin; +Cc: linux-btrfs

On Sat, 5 Mar 2022 at 11:42, Jan Kanis <jan.code@jankanis.nl> wrote:
>
> Correction on the statistics: 40 GB is 10 million blocks. Chance of no
> spurious checksum matches is then (1-2**-32)**1e67 = 0.99767. The risk
> starts to become significant when writing a few terabyte.

The scrub looks like it worked. I had some 10 million errors, all
correctable, so it looks like my assumptions for the calculation were
correct. Of course I don't know if there were any spurious checksum
matches.


On Mon, 7 Mar 2022 at 01:42, Damien Le Moal
<damien.lemoal@opensource.wdc.com> wrote:
> Hu... SMR disks do not support partitions... And last time I checked,
> cryptsetup did not support LUKS formating of SMR drives (dm-crypt does
> support SMR, that is not the issue). Care to better explain your setup ?

Sure. By SMR drive I meant a regular hard disk (in fact a pair of
Seagate Barracudas) that uses SMR internally but presents a standard
SATA interface.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-03-09 13:46 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-04 23:33 Is this error fixable or do I need to rebuild the drive? Jan Kanis
2022-03-04 23:39 ` Qu Wenruo
2022-03-05  3:11   ` Remi Gauvin
2022-03-05  6:47     ` Qu Wenruo
2022-03-05 10:33       ` Jan Kanis
2022-03-05 10:42         ` Jan Kanis
2022-03-05 11:21           ` Qu Wenruo
2022-03-05 11:06         ` Qu Wenruo
2022-03-06  4:06       ` Zygo Blaxell
2022-03-06 11:45         ` Remi Gauvin
2022-03-06 23:41           ` Zygo Blaxell
2022-03-07  0:42 ` Damien Le Moal
2022-03-09 13:46   ` Jan Kanis

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.