All of lore.kernel.org
 help / color / mirror / Atom feed
* Reading files with bad data checksum
@ 2021-01-10 11:52 David Woodhouse
  2021-01-10 12:08 ` Forza
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: David Woodhouse @ 2021-01-10 11:52 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1954 bytes --]

I migrated a system to btrfs which was hosting virtual machins with
qemu.

Using it without disabling copy-on-write was a mistake, of course, and
it became horribly fragmented and slow.

So I tried copying it to a new file... but it has actual *errors* too,
which I think are because it was using the 'directsync' caching mode
for block I/O in qemu.

https://bugzilla.redhat.com/show_bug.cgi?id=1204569#c12

I filed https://bugzilla.redhat.com/show_bug.cgi?id=1914433

What I see is that *both* disks of the RAID-1 have data which is
consistent, and does not match the checksum that btrfs expects:

[ 6827.513630] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 1
[ 6827.517448] BTRFS error (device sda3): bdev /dev/sdb3 errs: wr 0, rd 0, flush 0, corrupt 8286, gen 0
[ 6827.527281] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 2
[ 6827.530817] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9115, gen 0

It looks like an O_DIRECT bug where the data *do* get updated without
updating the checksum. Which is kind of the worst of both worlds, I
suppose, since I also did get the appalling performance of COW and
fragmentation.

In the short term, all I want to do is make a copy of the file, using
the data which are in the disk regardless of the fact that btrfs thinks
the checksum doesn't match. Is there a way I can turn off *checking* of
the checksum for that specific file (or file descriptor?).

Or is the only way to do it with something like FIBMAP, reading the
offending blocks directly from the underlying disk and then writing
them into the appropriate offset in (a copy of) the file? A plan which
is slightly complicated by the fact that of course btrfs doesn't
support FIBMAP.

What's the best way to recover the data?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reading files with bad data checksum
  2021-01-10 11:52 Reading files with bad data checksum David Woodhouse
@ 2021-01-10 12:08 ` Forza
  2021-01-10 12:36   ` David Woodhouse
  2021-01-10 22:45 ` Chris Murphy
  2021-01-11  8:23 ` Nikolay Borisov
  2 siblings, 1 reply; 6+ messages in thread
From: Forza @ 2021-01-10 12:08 UTC (permalink / raw)
  To: David Woodhouse, linux-btrfs



On 2021-01-10 12:52, David Woodhouse wrote:
> I migrated a system to btrfs which was hosting virtual machins with
> qemu.
> 
> Using it without disabling copy-on-write was a mistake, of course, and
> it became horribly fragmented and slow.
> 
> So I tried copying it to a new file... but it has actual *errors* too,
> which I think are because it was using the 'directsync' caching mode
> for block I/O in qemu.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1204569#c12
> 
> I filed https://bugzilla.redhat.com/show_bug.cgi?id=1914433
> 
> What I see is that *both* disks of the RAID-1 have data which is
> consistent, and does not match the checksum that btrfs expects:
> 
> [ 6827.513630] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 1
> [ 6827.517448] BTRFS error (device sda3): bdev /dev/sdb3 errs: wr 0, rd 0, flush 0, corrupt 8286, gen 0
> [ 6827.527281] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 2
> [ 6827.530817] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9115, gen 0
> 
> It looks like an O_DIRECT bug where the data *do* get updated without
> updating the checksum. Which is kind of the worst of both worlds, I
> suppose, since I also did get the appalling performance of COW and
> fragmentation.

With O_DIRECT Btrfs shouldn't do checksum or compression. This is one of 
the issues with Direct IO for the moment. But it should not cause those 
dmesg errors. I believe it should show in scrub as no_csum:

# btrfs scrub status -R /mnt/6TB/
UUID:             fe0a1142-51ab-4181-b635-adbf9f4ea6e6
Scrub started:    Sun Nov 22 13:11:20 2020
Status:           finished
Duration:         9:37:39
         data_extents_scrubbed: 164773032
         tree_extents_scrubbed: 1113696
         data_bytes_scrubbed: 10570715316224
         tree_bytes_scrubbed: 18246795264
         read_errors: 0
         csum_errors: 0
         verify_errors: 0
         no_csum: 3120
         csum_discards: 0
         super_errors: 0
         malloc_errors: 0
         uncorrectable_errors: 0
         unverified_errors: 0
         corrected_errors: 0
         last_physical: 5823976701952


> 
> In the short term, all I want to do is make a copy of the file, using
> the data which are in the disk regardless of the fact that btrfs thinks
> the checksum doesn't match. Is there a way I can turn off *checking* of
> the checksum for that specific file (or file descriptor?).
> 
> Or is the only way to do it with something like FIBMAP, reading the
> offending blocks directly from the underlying disk and then writing
> them into the appropriate offset in (a copy of) the file? A plan which
> is slightly complicated by the fact that of course btrfs doesn't
> support FIBMAP.
> 
> What's the best way to recover the data?
> 

You can use GNU ddrescue to copy files. It can skip the offending blocks 
and replace the bad data with zeroes. Not sure how well qemu will handle 
that though.


I did some tests with qemu to try to avoid O_DIRECT. This worked, and 
also enabled compression and csums. It worked by emulating nvme with 
larger than 4KiB block size. I've tried with 8192 and 16384 sizes. 
Although it works, it may also be really slow. I have not done any 
benchmarks yet.


# qemu-system-x86_64 -D qemu.log \
-name fedora \
-enable-kvm -machine q35 -device intel-iommu \
-smp cores=4 -m 3072 \
-drive 
format=raw,file=disk2.img,cache=writeback,aio=io_uring,if=none,id=drv0 \
-device 
nvme,drive=drv0,serial=1234,physical_block_size=8192,logical_block_size=8192,write-cache=on 
\
-display vnc=192.168.0.1:0,to=100 \
-net nic,model=virtio,macaddr=00:00:00:00:00:01 -net tap,ifname=qemu0 \

Just a warning about cache=writeback. I have not checked how safe this 
is with regards to crashes and powerloss.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reading files with bad data checksum
  2021-01-10 12:08 ` Forza
@ 2021-01-10 12:36   ` David Woodhouse
  2021-01-11 18:56     ` Goffredo Baroncelli
  0 siblings, 1 reply; 6+ messages in thread
From: David Woodhouse @ 2021-01-10 12:36 UTC (permalink / raw)
  To: Forza, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 3517 bytes --]

On Sun, 2021-01-10 at 13:08 +0100, Forza wrote:
> 
> On 2021-01-10 12:52, David Woodhouse wrote:
> > I migrated a system to btrfs which was hosting virtual machins with
> > qemu.
> > 
> > Using it without disabling copy-on-write was a mistake, of course, and
> > it became horribly fragmented and slow.
> > 
> > So I tried copying it to a new file... but it has actual *errors* too,
> > which I think are because it was using the 'directsync' caching mode
> > for block I/O in qemu.
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=1204569#c12
> > 
> > I filed https://bugzilla.redhat.com/show_bug.cgi?id=1914433
> > 
> > What I see is that *both* disks of the RAID-1 have data which is
> > consistent, and does not match the checksum that btrfs expects:
> > 
> > [ 6827.513630] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 1
> > [ 6827.517448] BTRFS error (device sda3): bdev /dev/sdb3 errs: wr 0, rd 0, flush 0, corrupt 8286, gen 0
> > [ 6827.527281] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 2
> > [ 6827.530817] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9115, gen 0
> > 
> > It looks like an O_DIRECT bug where the data *do* get updated without
> > updating the checksum. Which is kind of the worst of both worlds, I
> > suppose, since I also did get the appalling performance of COW and
> > fragmentation.
> 
> With O_DIRECT Btrfs shouldn't do checksum or compression. This is one of 
> the issues with Direct IO for the moment. But it should not cause those 
> dmesg errors. I believe it should show in scrub as no_csum:

It showed up as errors. There appears to be a btrfs bug there but since
I suspect it'll be easy to reproduce I'm more focused on recovery right
now.

> > In the short term, all I want to do is make a copy of the file, using
> > the data which are in the disk regardless of the fact that btrfs thinks
> > the checksum doesn't match. Is there a way I can turn off *checking* of
> > the checksum for that specific file (or file descriptor?).
> > 
> > Or is the only way to do it with something like FIBMAP, reading the
> > offending blocks directly from the underlying disk and then writing
> > them into the appropriate offset in (a copy of) the file? A plan which
> > is slightly complicated by the fact that of course btrfs doesn't
> > support FIBMAP.
> > 
> > What's the best way to recover the data?
> > 
> 
> You can use GNU ddrescue to copy files. It can skip the offending blocks 
> and replace the bad data with zeroes. Not sure how well qemu will handle 
> that though.

Right. I've already copied the image with dd conv=sync,noerror to a new
one with the +C flag. It passes 'qemu-img check', and in fact the guest
is running just fine with it. I was expecting it to stop with
catastrophic file system errors but I can't see anything wrong at all.
I'm just paranoid that eventually I'll find out that the blocks belong
to some file(s) I actually want, and I'd like to recover them.

Right now I have a horribly fragmented image file with these 'errors'
cluttering up my file system and making backups of the host go
extremely slow. I'd like to get those blocks back so I can make a clean
copy of the image, and keep it around for reference in case I later
*do* discover that I need the contents of those blocks.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reading files with bad data checksum
  2021-01-10 11:52 Reading files with bad data checksum David Woodhouse
  2021-01-10 12:08 ` Forza
@ 2021-01-10 22:45 ` Chris Murphy
  2021-01-11  8:23 ` Nikolay Borisov
  2 siblings, 0 replies; 6+ messages in thread
From: Chris Murphy @ 2021-01-10 22:45 UTC (permalink / raw)
  To: David Woodhouse; +Cc: Btrfs BTRFS

On Sun, Jan 10, 2021 at 4:54 AM David Woodhouse <dwmw2@infradead.org> wrote:
>
> I filed https://bugzilla.redhat.com/show_bug.cgi?id=1914433
>
> What I see is that *both* disks of the RAID-1 have data which is
> consistent, and does not match the checksum that btrfs expects:

Yeah either use nodatacow (chattr +C) or don't use O_DIRECT until
there's a proper fix.

> What's the best way to recover the data?

I'd say, kernel 5.11's new "mount -o ro,rescue=ignoredatacsums"
feature. You can copy it out normally, no special tools.

The alternative is 'btrfs restore'.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reading files with bad data checksum
  2021-01-10 11:52 Reading files with bad data checksum David Woodhouse
  2021-01-10 12:08 ` Forza
  2021-01-10 22:45 ` Chris Murphy
@ 2021-01-11  8:23 ` Nikolay Borisov
  2 siblings, 0 replies; 6+ messages in thread
From: Nikolay Borisov @ 2021-01-11  8:23 UTC (permalink / raw)
  To: David Woodhouse, linux-btrfs



On 10.01.21 г. 13:52 ч., David Woodhouse wrote:
> I migrated a system to btrfs which was hosting virtual machins with
> qemu.
> 
> Using it without disabling copy-on-write was a mistake, of course, and
> it became horribly fragmented and slow.
> 
> So I tried copying it to a new file... but it has actual *errors* too,
> which I think are because it was using the 'directsync' caching mode
> for block I/O in qemu.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1204569#c12
> 
> I filed https://bugzilla.redhat.com/show_bug.cgi?id=1914433
> 
> What I see is that *both* disks of the RAID-1 have data which is
> consistent, and does not match the checksum that btrfs expects:
> 
> [ 6827.513630] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 1
> [ 6827.517448] BTRFS error (device sda3): bdev /dev/sdb3 errs: wr 0, rd 0, flush 0, corrupt 8286, gen 0
> [ 6827.527281] BTRFS warning (device sda3): csum failed root 5 ino 24387997 off 2935152640 csum 0x81529887 expected csum 0xb0093af0 mirror 2
> [ 6827.530817] BTRFS error (device sda3): bdev /dev/sda3 errs: wr 0, rd 0, flush 0, corrupt 9115, gen 0
> 
> It looks like an O_DIRECT bug where the data *do* get updated without
> updating the checksum. Which is kind of the worst of both worlds, I
> suppose, since I also did get the appalling performance of COW and
> fragmentation.
> 
> In the short term, all I want to do is make a copy of the file, using
> the data which are in the disk regardless of the fact that btrfs thinks
> the checksum doesn't match. Is there a way I can turn off *checking* of
> the checksum for that specific file (or file descriptor?).
> 
> Or is the only way to do it with something like FIBMAP, reading the
> offending blocks directly from the underlying disk and then writing
> them into the appropriate offset in (a copy of) the file? A plan which
> is slightly complicated by the fact that of course btrfs doesn't
> support FIBMAP.
> 
> What's the best way to recover the data?

I think you've hit this peculiarity of btrfs:

https://linux-btrfs.vger.kernel.narkive.com/mR7V3G37/qemu-disk-images-on-btrfs-suffer-checksum-errors

> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Reading files with bad data checksum
  2021-01-10 12:36   ` David Woodhouse
@ 2021-01-11 18:56     ` Goffredo Baroncelli
  0 siblings, 0 replies; 6+ messages in thread
From: Goffredo Baroncelli @ 2021-01-11 18:56 UTC (permalink / raw)
  To: David Woodhouse, Forza, linux-btrfs

On 1/10/21 1:36 PM, David Woodhouse wrote:
> On Sun, 2021-01-10 at 13:08 +0100, Forza wrote:
[...]
> 
> It showed up as errors. There appears to be a btrfs bug there but since

Yes, it is an old btrfs bug. And qemu is not the guilty.

https://lore.kernel.org/linux-btrfs/cf8a733f-2c9d-7ffe-e865-4c13d99dfb60@libero.it/

In my email there is a code to reproduce it.

Basically it is difficult to have the checksum sync with the data when O_DIRECT
is used. Even OpenZFS has problem with it. The OpenZFS solution is to lie about O_DIRECT:
it allows the flag however it doesn't honor.

I think that BTRFS should behave like ZFS: when csum are enable, O_DIRECT shouldn't be
honored (or returning an error or behaving like ZFS).


> I suspect it'll be easy to reproduce I'm more focused on recovery right
> now.
> 
>>> In the short term, all I want to do is make a copy of the file, using
>>> the data which are in the disk regardless of the fact that btrfs thinks
>>> the checksum doesn't match. Is there a way I can turn off *checking* of
>>> the checksum for that specific file (or file descriptor?).
>>>
>>> Or is the only way to do it with something like FIBMAP, reading the
>>> offending blocks directly from the underlying disk and then writing
>>> them into the appropriate offset in (a copy of) the file? A plan which
>>> is slightly complicated by the fact that of course btrfs doesn't
>>> support FIBMAP.
>>>
>>> What's the best way to recover the data?
>>>
>>
>> You can use GNU ddrescue to copy files. It can skip the offending blocks
>> and replace the bad data with zeroes. Not sure how well qemu will handle
>> that though.
> 
> Right. I've already copied the image with dd conv=sync,noerror to a new
> one with the +C flag. It passes 'qemu-img check', and in fact the guest
> is running just fine with it. I was expecting it to stop with
> catastrophic file system errors but I can't see anything wrong at all.
> I'm just paranoid that eventually I'll find out that the blocks belong
> to some file(s) I actually want, and I'd like to recover them.
> 
> Right now I have a horribly fragmented image file with these 'errors'
> cluttering up my file system and making backups of the host go
> extremely slow. I'd like to get those blocks back so I can make a clean
> copy of the image, and keep it around for reference in case I later
> *do* discover that I need the contents of those blocks.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-01-11 18:57 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-10 11:52 Reading files with bad data checksum David Woodhouse
2021-01-10 12:08 ` Forza
2021-01-10 12:36   ` David Woodhouse
2021-01-11 18:56     ` Goffredo Baroncelli
2021-01-10 22:45 ` Chris Murphy
2021-01-11  8:23 ` Nikolay Borisov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.