* Reporting and monitoring storage events (blog)
@ 2017-04-19 17:39 Chris Murphy
2017-04-20 12:27 ` Austin S. Hemmelgarn
0 siblings, 1 reply; 2+ messages in thread
From: Chris Murphy @ 2017-04-19 17:39 UTC (permalink / raw)
To: Btrfs BTRFS
http://www-rhstorage.rhcloud.com/blog/vpodzime/reporting-and-monitoring-storage-events
I think the most useful part of this would be standardized messaging.
For the exact same defect state on disk (data corruption), I get two
different formatted messages depending on whether it's found passively
by reading the file, or with a scrub.
(this is 2x disk raid 1)
read file:
[256914.773712] BTRFS warning (device dm-6): csum failed ino 257 off 0
csum 3734069121 expected csum 1334657141
[256914.774594] BTRFS warning (device dm-6): csum failed ino 257 off 0
csum 3734069121 expected csum 1334657141
[256914.775892] BTRFS info (device dm-6): read error corrected: ino
257 off 0 (dev /dev/mapper/VG-b1 sector 2155520)
scrub volume:
[257313.636610] BTRFS warning (device dm-6): checksum error at logical
1103626240 on dev /dev/mapper/VG-b1, sector 2155520, root 5, inode
257, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
[257313.636865] BTRFS error (device dm-6): bdev /dev/mapper/VG-b1
errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[257313.637737] BTRFS error (device dm-6): fixed up error at logical
1103626240 on dev /dev/mapper/VG-b1
Reading means there's a warning, scrubbing means there's an error? So
even the log level is different for the same problem?
And then the ambiguous "read error corrected" vs "fixed up error" -
the second one is more clear that the fix is pushed to a device "fixed
error on device" rather than just an in memory correction. But still,
they're different messages for the same problem and the auto healing.
--
Chris Murphy
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Reporting and monitoring storage events (blog)
2017-04-19 17:39 Reporting and monitoring storage events (blog) Chris Murphy
@ 2017-04-20 12:27 ` Austin S. Hemmelgarn
0 siblings, 0 replies; 2+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-20 12:27 UTC (permalink / raw)
To: Chris Murphy, Btrfs BTRFS
On 2017-04-19 13:39, Chris Murphy wrote:
> http://www-rhstorage.rhcloud.com/blog/vpodzime/reporting-and-monitoring-storage-events
>
> I think the most useful part of this would be standardized messaging.
> For the exact same defect state on disk (data corruption), I get two
> different formatted messages depending on whether it's found passively
> by reading the file, or with a scrub.
In addition to that, adding an event channel back to userspace like
dmeventd and mdadm use for their monitoring would be extremely useful.
Logging is useful for postmortem analysis, but monitoring logs to get
event notifications is error-prone, potentially racy, and introduces
unnecessary delays in handling.
>
> (this is 2x disk raid 1)
>
> read file:
> [256914.773712] BTRFS warning (device dm-6): csum failed ino 257 off 0
> csum 3734069121 expected csum 1334657141
> [256914.774594] BTRFS warning (device dm-6): csum failed ino 257 off 0
> csum 3734069121 expected csum 1334657141
> [256914.775892] BTRFS info (device dm-6): read error corrected: ino
> 257 off 0 (dev /dev/mapper/VG-b1 sector 2155520)
>
> scrub volume:
>
>
> [257313.636610] BTRFS warning (device dm-6): checksum error at logical
> 1103626240 on dev /dev/mapper/VG-b1, sector 2155520, root 5, inode
> 257, offset 0, length 4096, links 1 (path:
> openSUSE-Tumbleweed-NET-x86_64-Current.iso)
> [257313.636865] BTRFS error (device dm-6): bdev /dev/mapper/VG-b1
> errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> [257313.637737] BTRFS error (device dm-6): fixed up error at logical
> 1103626240 on dev /dev/mapper/VG-b1
>
>
> Reading means there's a warning, scrubbing means there's an error? So
> even the log level is different for the same problem?
What's more confusing is that:
* Checksum failure on read is a warning, but correction of that error is
an info message (These should be the same log level so that they either
both show up, or neither shows up. Having just the checksum failure or
the error correction display is potentially confusing).
* The message from a scrub that provides most of the useful info is a
warning (and it's a checksum error), but the info about correcting it
and incrementing the error counters are errors.
So, not only are things inconsistent across the type of correction, but
they're internally inconsistent.
>
> And then the ambiguous "read error corrected" vs "fixed up error" -
> the second one is more clear that the fix is pushed to a device "fixed
> error on device" rather than just an in memory correction. But still,
> they're different messages for the same problem and the auto healing.
Of the two, I personally prefer the scrub messages by a pretty
significant margin. They give you info about the inode, the location of
the error, the path in the FS, and even the location on-disk itself
while additionally logging the values of the cumulative error counters
and telling you that the error was corrected. If we were to update that
to include what triggered detecting the error, that would cover pretty
much everything needed for a reasonable e-mail notification.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2017-04-20 12:27 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-19 17:39 Reporting and monitoring storage events (blog) Chris Murphy
2017-04-20 12:27 ` Austin S. Hemmelgarn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.