All of lore.kernel.org
 help / color / mirror / Atom feed
* On the issue of direct I/O and csum warnings
@ 2021-07-23 14:55 Jonas Aaberg
  2021-07-23 18:45 ` Martin Raiber
  0 siblings, 1 reply; 4+ messages in thread
From: Jonas Aaberg @ 2021-07-23 14:55 UTC (permalink / raw)
  To: linux-btrfs


Hi,

I use btrfs on dm-crypt. About two months ago, I started to get:

--
BTRFS warning (device dm-0): csum failed root 257 ino 1068852 off
25690112 csum 0xa27faf9a expected csum 0x4c266278 mirror 1 BTRFS error
(device dm-0): bdev /dev/mapper/disk0 errs: wr 0, rd 0, flush 0,
corrupt 349, gen 0
--

kind of warning/errors on my laptop. I went a bought a new NVME disk
because I'm rather found of my data, eventhough most is backup-ed up.

A week later, I started to get the same kind of warning/error message
on my new NVME. After half a day of memtest86, resulted in no memory
errors found, I gave up on my otherwise stable laptop and started to
use an old laptop that I've been to lazy to sell instead while looking
out for a decent pre-owned newer laptop.

Now I'm just about to install and move over to a newly bought laptop,
when today my old laptop started to show the same warning/errors.
My old laptop does not share a single part with the laptop which I
previous got the "checksum failure" warnings on. Therefore I have a hard
time to believe that I've gotten the same hardware failure twice.

Then I found:
<https://btrfs.wiki.kernel.org/index.php/Gotchas> and "Direct I/O and
CRCs".

Which I believe is what I've ran into. One of the affect files is
a log file from syncthing on both computers.

Some people might have been quite pissed off having bought a new
NVME disk and another laptop in vain, but I'm a relieved that I
think I've found the root cause of.
I've used btrfs for about ten years and together with the "btrfs"
tool I find btrfs a very pleasant experience.

I have just one humble request, please do something about this
checksum error message. Just add printk with a link to:
<https://btrfs.wiki.kernel.org/index.php/Gotchas> and the issue of
"Direct I/O and CRCs".

Maybe update the wiki with:
`find <mountpoint> -inum <ino-number-from-warning-message>`
would be a helpful as well.

Thanks.

Best regards,
 Jonas Aaberg

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On the issue of direct I/O and csum warnings
  2021-07-23 14:55 On the issue of direct I/O and csum warnings Jonas Aaberg
@ 2021-07-23 18:45 ` Martin Raiber
  2021-07-24  6:30   ` Jonas Aaberg
  0 siblings, 1 reply; 4+ messages in thread
From: Martin Raiber @ 2021-07-23 18:45 UTC (permalink / raw)
  To: Jonas Aaberg, linux-btrfs

On 23.07.2021 16:55 Jonas Aaberg wrote:
> Hi,
>
> I use btrfs on dm-crypt. About two months ago, I started to get:
>
> --
> BTRFS warning (device dm-0): csum failed root 257 ino 1068852 off
> 25690112 csum 0xa27faf9a expected csum 0x4c266278 mirror 1 BTRFS error
> (device dm-0): bdev /dev/mapper/disk0 errs: wr 0, rd 0, flush 0,
> corrupt 349, gen 0
> --
>
> kind of warning/errors on my laptop. I went a bought a new NVME disk
> because I'm rather found of my data, eventhough most is backup-ed up.
>
> A week later, I started to get the same kind of warning/error message
> on my new NVME. After half a day of memtest86, resulted in no memory
> errors found, I gave up on my otherwise stable laptop and started to
> use an old laptop that I've been to lazy to sell instead while looking
> out for a decent pre-owned newer laptop.
>
> Now I'm just about to install and move over to a newly bought laptop,
> when today my old laptop started to show the same warning/errors.
> My old laptop does not share a single part with the laptop which I
> previous got the "checksum failure" warnings on. Therefore I have a hard
> time to believe that I've gotten the same hardware failure twice.
>
> Then I found:
> <https://btrfs.wiki.kernel.org/index.php/Gotchas> and "Direct I/O and
> CRCs".
>
> Which I believe is what I've ran into. One of the affect files is
> a log file from syncthing on both computers.

I wouldn't be certain about the conclusion that it is the direct I/O csum issue. Are you sure syncthing is writing to logs via direct I/O? That would be bad e.g. because it disables btrfs compression and log files compress really well. So I'd say report additional information like kernel version (and if it is a vanilla kernel), how your btrfs is setup (metadata RAID1), etc.

> I have just one humble request, please do something about this
> checksum error message. Just add printk with a link to:
> <https://btrfs.wiki.kernel.org/index.php/Gotchas> and the issue of
> "Direct I/O and CRCs".
The problem is nothing can be done without impacting performance and direct I/O is used for performance. IMO it should be disabled by default (i.e. it just pretends to do direct I/O like ZFSOnLinux) and be able to be enabled via mount option.
>
> Maybe update the wiki with:
> `find <mountpoint> -inum <ino-number-from-warning-message>`
> would be a helpful as well.

btrfs inspect-internal inode-resolve <ino-number-from-warning-message> <fs>

is faster.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On the issue of direct I/O and csum warnings
  2021-07-23 18:45 ` Martin Raiber
@ 2021-07-24  6:30   ` Jonas Aaberg
  2021-07-24  9:44     ` Martin Raiber
  0 siblings, 1 reply; 4+ messages in thread
From: Jonas Aaberg @ 2021-07-24  6:30 UTC (permalink / raw)
  To: Martin Raiber; +Cc: linux-btrfs

On Fri, 23 Jul 2021 18:45:40 +0000
Martin Raiber <martin@urbackup.org> wrote:

> On 23.07.2021 16:55 Jonas Aaberg wrote:
> > Hi,
> >
> > I use btrfs on dm-crypt. About two months ago, I started to get:
> >
> > --
> > BTRFS warning (device dm-0): csum failed root 257 ino 1068852 off
> > 25690112 csum 0xa27faf9a expected csum 0x4c266278 mirror 1 BTRFS
> > error (device dm-0): bdev /dev/mapper/disk0 errs: wr 0, rd 0, flush
> > 0, corrupt 349, gen 0
> > --
> >
> > kind of warning/errors on my laptop. I went a bought a new NVME disk
> > because I'm rather found of my data, eventhough most is backup-ed
> > up.
> >
> > A week later, I started to get the same kind of warning/error
> > message on my new NVME. After half a day of memtest86, resulted in
> > no memory errors found, I gave up on my otherwise stable laptop and
> > started to use an old laptop that I've been to lazy to sell instead
> > while looking out for a decent pre-owned newer laptop.
> >
> > Now I'm just about to install and move over to a newly bought
> > laptop, when today my old laptop started to show the same
> > warning/errors. My old laptop does not share a single part with the
> > laptop which I previous got the "checksum failure" warnings on.
> > Therefore I have a hard time to believe that I've gotten the same
> > hardware failure twice.
> >
> > Then I found:
> > <https://btrfs.wiki.kernel.org/index.php/Gotchas> and "Direct I/O
> > and CRCs".
> >
> > Which I believe is what I've ran into. One of the affect files is
> > a log file from syncthing on both computers.  
> 
> I wouldn't be certain about the conclusion that it is the direct I/O
> csum issue. Are you sure syncthing is writing to logs via direct I/O?
> That would be bad e.g. because it disables btrfs compression and log
> files compress really well. So I'd say report additional information
> like kernel version (and if it is a vanilla kernel), how your btrfs
> is setup (metadata RAID1), etc.

No, I've not checked syncthing and its dependencies. But I'll do that.
Just to be sure we're talking about the same thing, "direct" means
O_DIRECT on syscall open()?

I use archlinux, with their stock "linux-lts" kernel which has been
on 5.10 since winter/spring. I'm sure that the two last checksum errors
have occurred on 5.10.x - unsure about exactly which version. Currently
the computer runs 5.10.52, but it was after a system update and a
restart that I noticed the checksum error. So the checksum error
probably occurred on a previous kernel version in the 5.10 range.

regarding mount options:

/dev/mapper/disk0 on / type btrfs
(rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=256,subvol=/__current/ROOT)
/dev/mapper/disk0 on /home type btrfs
(rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=257,subvol=/__current/home)
/dev/mapper/disk0 on /var/log type btrfs
(rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=258,subvol=/__current/log)

No raid. Just btrfs upon dmcrypt.

The file with faulty checksum is in the subvolume=/__current/home.
(/home//jonas/.config/syncthing/index-v0.14.0.db/007197.log)

If I recall right, I did correct the checksum errors on the first nvme
disk where it occurred. The second NVME is left as it is when it
occurred, and the error is still present on my SSD. So I can maybe get
some history if needed.

Any more information that you would like to have?

> 
> > I have just one humble request, please do something about this
> > checksum error message. Just add printk with a link to:
> > <https://btrfs.wiki.kernel.org/index.php/Gotchas> and the issue of
> > "Direct I/O and CRCs".  
> The problem is nothing can be done without impacting performance and
> direct I/O is used for performance.
Understood. I was talking about making the print less alarming.

> IMO it should be disabled by
> default (i.e. it just pretends to do direct I/O like ZFSOnLinux) and
> be able to be enabled via mount option.
Sounds like a good idea.

> >
> > Maybe update the wiki with:
> > `find <mountpoint> -inum <ino-number-from-warning-message>`
> > would be a helpful as well.  
> 
> btrfs inspect-internal inode-resolve
> <ino-number-from-warning-message> <fs>
> 
> is faster.
Thanks!

BR,
 Jonas Aaberg



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On the issue of direct I/O and csum warnings
  2021-07-24  6:30   ` Jonas Aaberg
@ 2021-07-24  9:44     ` Martin Raiber
  0 siblings, 0 replies; 4+ messages in thread
From: Martin Raiber @ 2021-07-24  9:44 UTC (permalink / raw)
  To: Jonas Aaberg; +Cc: linux-btrfs

On 24.07.2021 08:30 Jonas Aaberg wrote:
> On Fri, 23 Jul 2021 18:45:40 +0000
> Martin Raiber <martin@urbackup.org> wrote:
>
>> On 23.07.2021 16:55 Jonas Aaberg wrote:
>>> Hi,
>>>
>>> I use btrfs on dm-crypt. About two months ago, I started to get:
>>>
>>> --
>>> BTRFS warning (device dm-0): csum failed root 257 ino 1068852 off
>>> 25690112 csum 0xa27faf9a expected csum 0x4c266278 mirror 1 BTRFS
>>> error (device dm-0): bdev /dev/mapper/disk0 errs: wr 0, rd 0, flush
>>> 0, corrupt 349, gen 0
>>> --
>>>
>>> kind of warning/errors on my laptop. I went a bought a new NVME disk
>>> because I'm rather found of my data, eventhough most is backup-ed
>>> up.
>>>
>>> A week later, I started to get the same kind of warning/error
>>> message on my new NVME. After half a day of memtest86, resulted in
>>> no memory errors found, I gave up on my otherwise stable laptop and
>>> started to use an old laptop that I've been to lazy to sell instead
>>> while looking out for a decent pre-owned newer laptop.
>>>
>>> Now I'm just about to install and move over to a newly bought
>>> laptop, when today my old laptop started to show the same
>>> warning/errors. My old laptop does not share a single part with the
>>> laptop which I previous got the "checksum failure" warnings on.
>>> Therefore I have a hard time to believe that I've gotten the same
>>> hardware failure twice.
>>>
>>> Then I found:
>>> <https://btrfs.wiki.kernel.org/index.php/Gotchas> and "Direct I/O
>>> and CRCs".
>>>
>>> Which I believe is what I've ran into. One of the affect files is
>>> a log file from syncthing on both computers.  
>> I wouldn't be certain about the conclusion that it is the direct I/O
>> csum issue. Are you sure syncthing is writing to logs via direct I/O?
>> That would be bad e.g. because it disables btrfs compression and log
>> files compress really well. So I'd say report additional information
>> like kernel version (and if it is a vanilla kernel), how your btrfs
>> is setup (metadata RAID1), etc.
> No, I've not checked syncthing and its dependencies. But I'll do that.
> Just to be sure we're talking about the same thing, "direct" means
> O_DIRECT on syscall open()?
Yes.
>
> I use archlinux, with their stock "linux-lts" kernel which has been
> on 5.10 since winter/spring. I'm sure that the two last checksum errors
> have occurred on 5.10.x - unsure about exactly which version. Currently
> the computer runs 5.10.52, but it was after a system update and a
> restart that I noticed the checksum error. So the checksum error
> probably occurred on a previous kernel version in the 5.10 range.
>
> regarding mount options:
>
> /dev/mapper/disk0 on / type btrfs
> (rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=256,subvol=/__current/ROOT)
> /dev/mapper/disk0 on /home type btrfs
> (rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=257,subvol=/__current/home)
> /dev/mapper/disk0 on /var/log type btrfs
> (rw,relatime,compress=zstd:3,ssd,space_cache,autodefrag,subvolid=258,subvol=/__current/log)
>
> No raid. Just btrfs upon dmcrypt.
>
> The file with faulty checksum is in the subvolume=/__current/home.
> (/home//jonas/.config/syncthing/index-v0.14.0.db/007197.log)

That looks like a leveldb log file. I looked at rocksdb and that has options to use O_DIRECT, but it uses https://github.com/syndtr/goleveldb and I can see no hint of it using O_DIRECT there...

>
> If I recall right, I did correct the checksum errors on the first nvme
> disk where it occurred. The second NVME is left as it is when it
> occurred, and the error is still present on my SSD. So I can maybe get
> some history if needed.
>
> Any more information that you would like to have?
>
>>> I have just one humble request, please do something about this
>>> checksum error message. Just add printk with a link to:
>>> <https://btrfs.wiki.kernel.org/index.php/Gotchas> and the issue of
>>> "Direct I/O and CRCs".  
>> The problem is nothing can be done without impacting performance and
>> direct I/O is used for performance.
> Understood. I was talking about making the print less alarming.
It can't really distinguish the case where the buffer changed between write-out and checksumming and the case where data changed on disk either (without impacting performance).
>
>> IMO it should be disabled by
>> default (i.e. it just pretends to do direct I/O like ZFSOnLinux) and
>> be able to be enabled via mount option.
> Sounds like a good idea.
>
>>> Maybe update the wiki with:
>>> `find <mountpoint> -inum <ino-number-from-warning-message>`
>>> would be a helpful as well.  
>> btrfs inspect-internal inode-resolve
>> <ino-number-from-warning-message> <fs>
>>
>> is faster.
> Thanks!
>
> BR,
>  Jonas Aaberg
>
>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-07-24  9:44 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-23 14:55 On the issue of direct I/O and csum warnings Jonas Aaberg
2021-07-23 18:45 ` Martin Raiber
2021-07-24  6:30   ` Jonas Aaberg
2021-07-24  9:44     ` Martin Raiber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.