All of lore.kernel.org
 help / color / mirror / Atom feed
* Is this normal? Should I use scrub?
@ 2015-04-01 15:11 Andy Smith
  2015-04-01 15:42 ` Hugo Mills
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Smith @ 2015-04-01 15:11 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I have a 6 device RAID-1 filesystem:

$ sudo btrfs fi df /srv/tank
Data, RAID1: total=1.24TiB, used=1.24TiB
System, RAID1: total=32.00MiB, used=184.00KiB
Metadata, RAID1: total=3.00GiB, used=1.65GiB
unknown, single: total=512.00MiB, used=0.00
$ sudo btrfs fi sh /srv/tank
Label: 'tank'  uuid: 472ee2b3-4dc3-4fc1-80bc-5ba967069ceb
        Total devices 6 FS bytes used 1.24TiB
        devid    2 size 1.82TiB used 384.03GiB path /dev/sdh
        devid    3 size 1.82TiB used 383.00GiB path /dev/sdg
        devid    4 size 1.82TiB used 384.00GiB path /dev/sdf
        devid    5 size 2.73TiB used 1.13TiB path /dev/sdk
        devid    6 size 1.82TiB used 121.00GiB path /dev/sdj
        devid    7 size 2.73TiB used 116.00GiB path /dev/sde

Btrfs v3.14.2

All of these devices are in an external eSATA enclosure.

A few days ago (I believe) something went wrong with the enclosure
hardware and the SCSI bus kept getting reset over and over. At one
point three of the six devices were kicked out and the filesystem
was left running (read-only) on three devices.

Through some trial and error I determined that the enclosure was
taking exception to one of the devices, and by removing it I was
able to get things up and running with five devices, writeable,
mounted in degraded mode. /dev/sdk is the device that was kept out
of the filesystem.

I do not believe that there is anything wrong with /dev/sdk as I put
it in another system and was able to read it entirely, do SMART long
tests on it, etc.

I wasn't able to prove it is a hardware problem until I took the
enclosure out of service as it's the only enclosure I had. So that's
a task for later.

I have now got a new enclosure and put this system back together
with all six devices. I was not expecting this filesystem to mount
without assistance on boot because of /dev/sdk being "stale"
compared to the other devices. I suppose this incorrect view is a
holdover from my experience with mdadm.

Anyway, I booted it and /srv/tank was mounted automatically with all
six devices.  I got a bunch of these messages as soon as it was
mounted:

    http://pastie.org/private/2ghahjwtzlcm6hwp66hkg

There's lots more of it but it's all like that. That paste is from
the end of the log and there haven't been any more such message
since, so that's about 20 minutes (the times are in GMT).

Is that normal output indicating that btrfs is repairing the
"staleness" of sdk from the other copy?

I seem to be able to use the filesystem and a cursory inspection
isn't turning up anything that I can't read or that seems
corrupted. I will now run checksums against my last good backup.

Should I run a scrub as well?

Cheers,
Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Is this normal? Should I use scrub?
  2015-04-01 15:11 Is this normal? Should I use scrub? Andy Smith
@ 2015-04-01 15:42 ` Hugo Mills
  2015-04-02  9:58   ` Andy Smith
  0 siblings, 1 reply; 4+ messages in thread
From: Hugo Mills @ 2015-04-01 15:42 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1843 bytes --]

   Hi, Andy,

On Wed, Apr 01, 2015 at 03:11:14PM +0000, Andy Smith wrote:
> I have a 6 device RAID-1 filesystem:

[snip tale of a filesystem with out of data data on one copy of the RAID]

> I have now got a new enclosure and put this system back together
> with all six devices. I was not expecting this filesystem to mount
> without assistance on boot because of /dev/sdk being "stale"
> compared to the other devices. I suppose this incorrect view is a
> holdover from my experience with mdadm.
> 
> Anyway, I booted it and /srv/tank was mounted automatically with all
> six devices.  I got a bunch of these messages as soon as it was
> mounted:
> 
>     http://pastie.org/private/2ghahjwtzlcm6hwp66hkg
> 
> There's lots more of it but it's all like that. That paste is from
> the end of the log and there haven't been any more such message
> since, so that's about 20 minutes (the times are in GMT).
> 
> Is that normal output indicating that btrfs is repairing the
> "staleness" of sdk from the other copy?

   Yes, exactly. That output you pasted looks pretty much exactly like
what I'd expect to see in the situation described above. You might
also expect to see some checksum errors corrected in the data, as well
as the metadata messages you're getting.

> I seem to be able to use the filesystem and a cursory inspection
> isn't turning up anything that I can't read or that seems
> corrupted. I will now run checksums against my last good backup.
> 
> Should I run a scrub as well?

   Yes. The output you've had so far will be just the pieces that the
FS has tried to read, and where, as a result, it's been able to detect
the out-of-date data. A scrub will check and fix everything.

   Hugo.

-- 
Hugo Mills             | My karma has run over my dogma.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Is this normal? Should I use scrub?
  2015-04-01 15:42 ` Hugo Mills
@ 2015-04-02  9:58   ` Andy Smith
  2015-04-02 10:29     ` Hugo Mills
  0 siblings, 1 reply; 4+ messages in thread
From: Andy Smith @ 2015-04-02  9:58 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs

Hi Hugo,

Thanks for your help.

On Wed, Apr 01, 2015 at 03:42:02PM +0000, Hugo Mills wrote:
> On Wed, Apr 01, 2015 at 03:11:14PM +0000, Andy Smith wrote:
> > Should I run a scrub as well?
> 
>    Yes. The output you've had so far will be just the pieces that the
> FS has tried to read, and where, as a result, it's been able to detect
> the out-of-date data. A scrub will check and fix everything.

Thanks, things seem to be fine now. :)

What's the difference between "verufy" and "csum" here?

scrub status for 472ee2b3-4dc3-4fc1-80bc-5ba967069ceb
scrub device /dev/sdh (id 2) history
        scrub started at Wed Apr  1 20:05:58 2015 and finished after 14642 seconds
        total bytes scrubbed: 383.42GiB with 0 errors
scrub device /dev/sdg (id 3) history
        scrub started at Wed Apr  1 20:05:58 2015 and finished after 14504 seconds
        total bytes scrubbed: 382.62GiB with 0 errors
scrub device /dev/sdf (id 4) history
        scrub started at Wed Apr  1 20:05:58 2015 and finished after 14436 seconds
        total bytes scrubbed: 383.00GiB with 0 errors
scrub device /dev/sdk (id 5) history
        scrub started at Wed Apr  1 20:05:58 2015 and finished after 21156 seconds
        total bytes scrubbed: 1.13TiB with 14530 errors
        error details: verify=10909 csum=3621
        corrected errors: 14530, uncorrectable errors: 0, unverified errors: 0
scrub device /dev/sdj (id 6) history
        scrub started at Wed Apr  1 20:05:58 2015 and finished after 5693 seconds
        total bytes scrubbed: 119.42GiB with 0 errors
scrub device /dev/sde (id 7) history
        scrub started at Wed Apr  1 20:05:58 2015 and finished after 5282 seconds
        total bytes scrubbed: 114.45GiB with 0 errors

Cheers,
Andy

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Is this normal? Should I use scrub?
  2015-04-02  9:58   ` Andy Smith
@ 2015-04-02 10:29     ` Hugo Mills
  0 siblings, 0 replies; 4+ messages in thread
From: Hugo Mills @ 2015-04-02 10:29 UTC (permalink / raw)
  To: Andy Smith; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2975 bytes --]

On Thu, Apr 02, 2015 at 09:58:39AM +0000, Andy Smith wrote:
> Hi Hugo,
> 
> Thanks for your help.

   Makes a change from you answering my questions. :)

> On Wed, Apr 01, 2015 at 03:42:02PM +0000, Hugo Mills wrote:
> > On Wed, Apr 01, 2015 at 03:11:14PM +0000, Andy Smith wrote:
> > > Should I run a scrub as well?
> > 
> >    Yes. The output you've had so far will be just the pieces that the
> > FS has tried to read, and where, as a result, it's been able to detect
> > the out-of-date data. A scrub will check and fix everything.
> 
> Thanks, things seem to be fine now. :)
> 
> What's the difference between "verufy" and "csum" here?

   verify would be where the internal consistency checks for metadata
failed. That might be, for example, where it's detected that a tree
node has a newer transaction ID (effectively a monotonic timestamp)
than its parent. This should never happen, so the parent is probably
out of date. If there's another copy of the metadata that doesn't have
the same problem, it can be used to repair the obviously-wrong copy.

   csum is where the checksum validation failed -- this would be, for
example, where some data was modified on one copy and left unchanged
on the older copy, but the metadata for both copies was updated. In
that case, the data on the out-of-date drive wouldn't match the
checksum, and needs to be updated from the good copy.

   Hugo.

> scrub status for 472ee2b3-4dc3-4fc1-80bc-5ba967069ceb
> scrub device /dev/sdh (id 2) history
>         scrub started at Wed Apr  1 20:05:58 2015 and finished after 14642 seconds
>         total bytes scrubbed: 383.42GiB with 0 errors
> scrub device /dev/sdg (id 3) history
>         scrub started at Wed Apr  1 20:05:58 2015 and finished after 14504 seconds
>         total bytes scrubbed: 382.62GiB with 0 errors
> scrub device /dev/sdf (id 4) history
>         scrub started at Wed Apr  1 20:05:58 2015 and finished after 14436 seconds
>         total bytes scrubbed: 383.00GiB with 0 errors
> scrub device /dev/sdk (id 5) history
>         scrub started at Wed Apr  1 20:05:58 2015 and finished after 21156 seconds
>         total bytes scrubbed: 1.13TiB with 14530 errors
>         error details: verify=10909 csum=3621
>         corrected errors: 14530, uncorrectable errors: 0, unverified errors: 0
> scrub device /dev/sdj (id 6) history
>         scrub started at Wed Apr  1 20:05:58 2015 and finished after 5693 seconds
>         total bytes scrubbed: 119.42GiB with 0 errors
> scrub device /dev/sde (id 7) history
>         scrub started at Wed Apr  1 20:05:58 2015 and finished after 5282 seconds
>         total bytes scrubbed: 114.45GiB with 0 errors
> 
> Cheers,
> Andy

-- 
Hugo Mills             | Debugging is like hitting yourself in the head with
hugo@... carfax.org.uk | hammer: it feels so good when you find the bug, and
http://carfax.org.uk/  | you're allowed to stop debugging.
PGP: 65E74AC0          |                                        PotatoEngineer

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-04-02 10:29 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-01 15:11 Is this normal? Should I use scrub? Andy Smith
2015-04-01 15:42 ` Hugo Mills
2015-04-02  9:58   ` Andy Smith
2015-04-02 10:29     ` Hugo Mills

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.