All of lore.kernel.org
 help / color / mirror / Atom feed
* Reporting and monitoring storage events (blog)
@ 2017-04-19 17:39 Chris Murphy
  2017-04-20 12:27 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 2+ messages in thread
From: Chris Murphy @ 2017-04-19 17:39 UTC (permalink / raw)
  To: Btrfs BTRFS

http://www-rhstorage.rhcloud.com/blog/vpodzime/reporting-and-monitoring-storage-events

I think the most useful part of this would be standardized messaging.
For the exact same defect state on disk (data corruption), I get two
different formatted messages depending on whether it's found passively
by reading the file, or with a scrub.

(this is 2x disk raid 1)

read file:
[256914.773712] BTRFS warning (device dm-6): csum failed ino 257 off 0
csum 3734069121 expected csum 1334657141
[256914.774594] BTRFS warning (device dm-6): csum failed ino 257 off 0
csum 3734069121 expected csum 1334657141
[256914.775892] BTRFS info (device dm-6): read error corrected: ino
257 off 0 (dev /dev/mapper/VG-b1 sector 2155520)

scrub volume:


[257313.636610] BTRFS warning (device dm-6): checksum error at logical
1103626240 on dev /dev/mapper/VG-b1, sector 2155520, root 5, inode
257, offset 0, length 4096, links 1 (path:
openSUSE-Tumbleweed-NET-x86_64-Current.iso)
[257313.636865] BTRFS error (device dm-6): bdev /dev/mapper/VG-b1
errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[257313.637737] BTRFS error (device dm-6): fixed up error at logical
1103626240 on dev /dev/mapper/VG-b1


Reading means there's a warning, scrubbing means there's an error? So
even the log level is different for the same problem?

And then the ambiguous "read error corrected" vs "fixed up error" -
the second one is more clear that the fix is pushed to a device "fixed
error on device" rather than just an in memory correction. But still,
they're different messages for the same problem and the auto healing.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Reporting and monitoring storage events (blog)
  2017-04-19 17:39 Reporting and monitoring storage events (blog) Chris Murphy
@ 2017-04-20 12:27 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 2+ messages in thread
From: Austin S. Hemmelgarn @ 2017-04-20 12:27 UTC (permalink / raw)
  To: Chris Murphy, Btrfs BTRFS

On 2017-04-19 13:39, Chris Murphy wrote:
> http://www-rhstorage.rhcloud.com/blog/vpodzime/reporting-and-monitoring-storage-events
>
> I think the most useful part of this would be standardized messaging.
> For the exact same defect state on disk (data corruption), I get two
> different formatted messages depending on whether it's found passively
> by reading the file, or with a scrub.
In addition to that, adding an event channel back to userspace like 
dmeventd and mdadm use for their monitoring would be extremely useful. 
Logging is useful for postmortem analysis, but monitoring logs to get 
event notifications is error-prone, potentially racy, and introduces 
unnecessary delays in handling.

>
> (this is 2x disk raid 1)
>
> read file:
> [256914.773712] BTRFS warning (device dm-6): csum failed ino 257 off 0
> csum 3734069121 expected csum 1334657141
> [256914.774594] BTRFS warning (device dm-6): csum failed ino 257 off 0
> csum 3734069121 expected csum 1334657141
> [256914.775892] BTRFS info (device dm-6): read error corrected: ino
> 257 off 0 (dev /dev/mapper/VG-b1 sector 2155520)
>
> scrub volume:
>
>
> [257313.636610] BTRFS warning (device dm-6): checksum error at logical
> 1103626240 on dev /dev/mapper/VG-b1, sector 2155520, root 5, inode
> 257, offset 0, length 4096, links 1 (path:
> openSUSE-Tumbleweed-NET-x86_64-Current.iso)
> [257313.636865] BTRFS error (device dm-6): bdev /dev/mapper/VG-b1
> errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
> [257313.637737] BTRFS error (device dm-6): fixed up error at logical
> 1103626240 on dev /dev/mapper/VG-b1
>
>
> Reading means there's a warning, scrubbing means there's an error? So
> even the log level is different for the same problem?
What's more confusing is that:
* Checksum failure on read is a warning, but correction of that error is 
an info message (These should be the same log level so that they either 
both show up, or neither shows up.  Having just the checksum failure or 
the error correction display is potentially confusing).
* The message from a scrub that provides most of the useful info is a 
warning (and it's a checksum error), but the info about correcting it 
and incrementing the error counters are errors.

So, not only are things inconsistent across the type of correction, but 
they're internally inconsistent.
>
> And then the ambiguous "read error corrected" vs "fixed up error" -
> the second one is more clear that the fix is pushed to a device "fixed
> error on device" rather than just an in memory correction. But still,
> they're different messages for the same problem and the auto healing.
Of the two, I personally prefer the scrub messages by a pretty 
significant margin.  They give you info about the inode, the location of 
the error, the path in the FS, and even the location on-disk itself 
while additionally logging the values of the cumulative error counters 
and telling you that the error was corrected.  If we were to update that 
to include what triggered detecting the error, that would cover pretty 
much everything needed for a reasonable e-mail notification.


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2017-04-20 12:27 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-19 17:39 Reporting and monitoring storage events (blog) Chris Murphy
2017-04-20 12:27 ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.