All of lore.kernel.org
 help / color / mirror / Atom feed
* Scrub aborts on newer kernels
@ 2016-05-26 17:55 Tyson Whitehead
  2016-05-27 18:12 ` Chris Murphy
  2016-06-17 22:00 ` Chris Murphy
  0 siblings, 2 replies; 9+ messages in thread
From: Tyson Whitehead @ 2016-05-26 17:55 UTC (permalink / raw)
  To: linux-btrfs

Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs scrub aborts before completing.

If I boot back into an older kernel (4.1 or 4.3, not sure about 4.2) then it runs to completion without any issues.

Steps to reproduce:

1 - make a raid1 system
2 - run with only one disk for awhile to introduce inconsistency
3 - add the other disk back and run btrfs scrub

The newer kernels will get part way through the scrub and then die.  For example, with 4.6

# btrfs scrub status -dR /
scrub status for 61267e7b-e8e3-43e1-99f3-40cb2b004a6a
scrub device /dev/sda3 (id 1) history
        scrub started at Thu May 26 10:59:31 2016 and was aborted after 00:02:23
        data_extents_scrubbed: 256140
        tree_extents_scrubbed: 35016
        data_bytes_scrubbed: 14865694720
        tree_bytes_scrubbed: 573702144
        read_errors: 0
        csum_errors: 0
        verify_errors: 0
        no_csum: 2032
        csum_discards: 0
        super_errors: 0
        malloc_errors: 0
        uncorrectable_errors: 0
        unverified_errors: 0
        corrected_errors: 0
        last_physical: 16004874240
scrub device /dev/sdb3 (id 2) history
        scrub started at Thu May 26 10:59:31 2016 and was aborted after 00:02:35
        data_extents_scrubbed: 256139
        tree_extents_scrubbed: 35016
        data_bytes_scrubbed: 14865690624
        tree_bytes_scrubbed: 573702144
        read_errors: 0
        csum_errors: 205
        verify_errors: 24
        no_csum: 2032
        csum_discards: 0
        super_errors: 0
        malloc_errors: 0
        uncorrectable_errors: 0
        unverified_errors: 0
        corrected_errors: 229
        last_physical: 15984951296

The kernel logs show nothing other than the standard "no csum found for inode ..." and "parent transid verify failed ..." messages

Then booting back into 4.3 and rerunning the scrub.

# btrfs scrub start -BdR /
scrub device /dev/sda3 (id 1) done
        scrub started at Thu May 26 11:43:00 2016 and finished after 00:56:25
        data_extents_scrubbed: 6939254
        tree_extents_scrubbed: 68269
        data_bytes_scrubbed: 426809974784
        tree_bytes_scrubbed: 1118519296
        read_errors: 0
        csum_errors: 0
        verify_errors: 0
        no_csum: 62895
        csum_discards: 0
        super_errors: 0
        malloc_errors: 0
        uncorrectable_errors: 0
        unverified_errors: 0
        corrected_errors: 0
        last_physical: 482390048768
scrub device /dev/sdb3 (id 2) done
        scrub started at Thu May 26 11:43:00 2016 and finished after 00:58:41
        data_extents_scrubbed: 6939240
        tree_extents_scrubbed: 68118                                                                                             
        data_bytes_scrubbed: 426809335808                                                                                        
        tree_bytes_scrubbed: 1116045312                                                                                          
        read_errors: 0                                                                                                           
        csum_errors: 1051510                                                                                                     
        verify_errors: 0                                                                                                         
        no_csum: 62767                                                                                                           
        csum_discards: 0                                                                                                         
        super_errors: 0                                                                                                          
        malloc_errors: 0                                                                                                         
        uncorrectable_errors: 0                                                                                                  
        unverified_errors: 0                                                                                                     
        corrected_errors: 1051510                                                                                                
        last_physical: 482390048768                                                                                              
WARNING: errors detected during scrubbing, corrected                                                                             

Cheers!  -Tyson

PS:  This is with version 4.4 of the btrfs progs and Debian kernel release 4.1, 4.3, 4.4, 4.5, and 4.6.

-- 
 Tyson Whitehead
 HPC Programming Specialist
 Compute Canada (SHARCNET)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scrub aborts on newer kernels
  2016-05-26 17:55 Scrub aborts on newer kernels Tyson Whitehead
@ 2016-05-27 18:12 ` Chris Murphy
  2016-06-17 14:45   ` Tyson Whitehead
  2016-06-17 22:00 ` Chris Murphy
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2016-05-27 18:12 UTC (permalink / raw)
  To: Tyson Whitehead; +Cc: Btrfs BTRFS

On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
> Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs scrub aborts before completing.

I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.

I think the bigger issue is the lack of information why a scrub is aborted.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scrub aborts on newer kernels
  2016-05-27 18:12 ` Chris Murphy
@ 2016-06-17 14:45   ` Tyson Whitehead
  2016-06-17 22:18     ` Chris Murphy
  0 siblings, 1 reply; 9+ messages in thread
From: Tyson Whitehead @ 2016-06-17 14:45 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On May 27, 2016 12:12:54 PM Chris Murphy wrote:
> On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
> > Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs scrub aborts before completing.
> 
> I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.
> 
> I think the bigger issue is the lack of information why a scrub is aborted.

Thanks for checking into this Chris.

Any advice on how to get some more information out of the scrub process?

Cheers!  -Tyson

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scrub aborts on newer kernels
  2016-05-26 17:55 Scrub aborts on newer kernels Tyson Whitehead
  2016-05-27 18:12 ` Chris Murphy
@ 2016-06-17 22:00 ` Chris Murphy
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2016-06-17 22:00 UTC (permalink / raw)
  To: Tyson Whitehead; +Cc: Btrfs BTRFS

Back to the original email...



On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
> Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs scrub aborts before completing.
>
> If I boot back into an older kernel (4.1 or 4.3, not sure about 4.2) then it runs to completion without any issues.
>
> Steps to reproduce:
>
> 1 - make a raid1 system
> 2 - run with only one disk for awhile to introduce inconsistency
> 3 - add the other disk back and run btrfs scrub
>
> The newer kernels will get part way through the scrub and then die.  For example, with 4.6
>
> # btrfs scrub status -dR /
> scrub status for 61267e7b-e8e3-43e1-99f3-40cb2b004a6a
> scrub device /dev/sda3 (id 1) history
>         scrub started at Thu May 26 10:59:31 2016 and was aborted after 00:02:23
>         data_extents_scrubbed: 256140
>         tree_extents_scrubbed: 35016
>         data_bytes_scrubbed: 14865694720
>         tree_bytes_scrubbed: 573702144
>         read_errors: 0
>         csum_errors: 0
>         verify_errors: 0
>         no_csum: 2032
>         csum_discards: 0
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 0
>         last_physical: 16004874240
> scrub device /dev/sdb3 (id 2) history
>         scrub started at Thu May 26 10:59:31 2016 and was aborted after 00:02:35
>         data_extents_scrubbed: 256139
>         tree_extents_scrubbed: 35016
>         data_bytes_scrubbed: 14865690624
>         tree_bytes_scrubbed: 573702144
>         read_errors: 0
>         csum_errors: 205
>         verify_errors: 24
>         no_csum: 2032
>         csum_discards: 0
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 229
>         last_physical: 15984951296

no_csum is not unusual as there are often things set with xattr +C
(nodatacow) for example this is now the default with newer versions of
systemd for systemd-journald logs.

But this 2nd device has verify_errors and csum_errors both of which
add up to the same value as corrected_errors, before the abort. I
think that's odd. It's a lot of errors.

Also odd is the abort doesn't happen at exactly the same time for both
devices; maybe explained by it taking 12 seconds for the corrections
to happen on the 2nd device? But 229 4KiB blocks being corrected
wouldn't take 12 seconds... for any reason that I can't think of.



> The kernel logs show nothing other than the standard "no csum found for inode ..." and "parent transid verify failed ..." messages

Maybe include a btrfs check for the volume, using btrfs progs 4.4.1 or 4.5.3.

>
> Then booting back into 4.3 and rerunning the scrub.
>
> # btrfs scrub start -BdR /
> scrub device /dev/sda3 (id 1) done
>         scrub started at Thu May 26 11:43:00 2016 and finished after 00:56:25
>         data_extents_scrubbed: 6939254
>         tree_extents_scrubbed: 68269
>         data_bytes_scrubbed: 426809974784
>         tree_bytes_scrubbed: 1118519296
>         read_errors: 0
>         csum_errors: 0
>         verify_errors: 0
>         no_csum: 62895
>         csum_discards: 0
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 0
>         last_physical: 482390048768
> scrub device /dev/sdb3 (id 2) done
>         scrub started at Thu May 26 11:43:00 2016 and finished after 00:58:41
>         data_extents_scrubbed: 6939240
>         tree_extents_scrubbed: 68118
>         data_bytes_scrubbed: 426809335808
>         tree_bytes_scrubbed: 1116045312
>         read_errors: 0
>         csum_errors: 1051510
>         verify_errors: 0
>         no_csum: 62767
>         csum_discards: 0
>         super_errors: 0
>         malloc_errors: 0
>         uncorrectable_errors: 0
>         unverified_errors: 0
>         corrected_errors: 1051510
>         last_physical: 482390048768
> WARNING: errors detected during scrubbing, corrected


OK and now it's over on million corrections for a single device, the
other one isn't affected.

I know btrfs dev stats are cumulative, I forget if scrubs stats are.
If they are, that's a bit confusing. But in any case, lifetime or one
time, a million corrections is crazy unless this is intentional,
trying to hammer on Btrfs's self-healing abilities. Good test. Not a
good in-production behavior though.

So I think there are two problems. The first is why are there so many
problems in the first place? And why is fixing them causing an abort
with new kernels? You might have found a bug/regression that isn't
being caught with testing if the test volumes don't have some unknown
minimum number of csum errors. See what I'm getting at?



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scrub aborts on newer kernels
  2016-06-17 14:45   ` Tyson Whitehead
@ 2016-06-17 22:18     ` Chris Murphy
  2016-06-17 22:20       ` Chris Murphy
  2016-06-20  9:22       ` Tyson Whitehead
  0 siblings, 2 replies; 9+ messages in thread
From: Chris Murphy @ 2016-06-17 22:18 UTC (permalink / raw)
  To: Tyson Whitehead; +Cc: Chris Murphy, Btrfs BTRFS

On Fri, Jun 17, 2016 at 8:45 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
> On May 27, 2016 12:12:54 PM Chris Murphy wrote:
>> On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
>> > Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs scrub aborts before completing.
>>
>> I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.
>>
>> I think the bigger issue is the lack of information why a scrub is aborted.
>
> Thanks for checking into this Chris.
>
> Any advice on how to get some more information out of the scrub process?

So I can't help you out with why scrub fails. That sounds like a bug,
not least of which is there's no information why it's failing. So I
suggest filing a bug with what you know to date and include in this
thread for reference. The main point is that there *are* errors (a lot
of them apparently) needing fixing up, but somehow certain kernel
versions are aborting and there's no information why.

Next you need to find out why this one device has so many errors. Put
the output from smartctl -x /dev/sdb somewhere, maybe even attach it
to the bug report since it's somewhat related. Either that device is
simply unreliable, or you've got a bad cable connection.

You need to include an entire dmesg, or even better would be dmesg and
also 'journalctl -b -o short-monotonic' attached to the bug report
also. While there may be no btrfs messages indicating what's going on,
there might be usb or libata message indicating hardware problems that
are instigating.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scrub aborts on newer kernels
  2016-06-17 22:18     ` Chris Murphy
@ 2016-06-17 22:20       ` Chris Murphy
  2016-06-20  9:22       ` Tyson Whitehead
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2016-06-17 22:20 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Tyson Whitehead, Btrfs BTRFS

 'journalctl -b -o short-monotonic'

-b is for the current boot. So either reproduce the problem and use
-b, or you can actually track down the boot when the scrubbing failed
with journalctl --list-boots and use -b -1, -b -2, and so on, to get
the journal for prior boots.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scrub aborts on newer kernels
  2016-06-17 22:18     ` Chris Murphy
  2016-06-17 22:20       ` Chris Murphy
@ 2016-06-20  9:22       ` Tyson Whitehead
  2016-06-20 18:05         ` Chris Murphy
  2016-06-20 18:06         ` Chris Murphy
  1 sibling, 2 replies; 9+ messages in thread
From: Tyson Whitehead @ 2016-06-20  9:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On 17/06/16 06:18 PM, Chris Murphy wrote:
> On Fri, Jun 17, 2016 at 8:45 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
>> On May 27, 2016 12:12:54 PM Chris Murphy wrote:
>>> On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
>>>> Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs scrub aborts before completing.
>>>
>>> I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.
>>>
>>> I think the bigger issue is the lack of information why a scrub is aborted.
>>
>> Thanks for checking into this Chris.
>>
>> Any advice on how to get some more information out of the scrub process?
>
> Next you need to find out why this one device has so many errors. Put the output from smartctl -x /dev/sdb somewhere, maybe even attach it to the bug report since it's somewhat related. Either that device is simply unreliable, or you've got a bad cable connection.

The device is okay.  The errors were caused by me running it for a period with only one of the devices present.

In more detail.  My desktop has a BTRFS RAID 1 setup.  I needed to access to it on the road, so I just shut the desktop down, grabbed one of the drives, and used it in my laptop in degraded mode.

When I got back I recombined them (the one I left in my office was never booted by itself) and started a scrub.  That is when I discovered scrub aborted on newer kernels but completed okay on older ones.

Cheers!  -Tyson

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scrub aborts on newer kernels
  2016-06-20  9:22       ` Tyson Whitehead
@ 2016-06-20 18:05         ` Chris Murphy
  2016-06-20 18:06         ` Chris Murphy
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2016-06-20 18:05 UTC (permalink / raw)
  To: Tyson Whitehead; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Jun 20, 2016 at 3:22 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
> On 17/06/16 06:18 PM, Chris Murphy wrote:
>>
>> On Fri, Jun 17, 2016 at 8:45 AM, Tyson Whitehead <twhitehead@gmail.com>
>> wrote:
>>>
>>> On May 27, 2016 12:12:54 PM Chris Murphy wrote:
>>>>
>>>> On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@gmail.com>
>>>> wrote:
>>>>>
>>>>> Under the last several kernels versions (4.6 and I believe 4.4 and,
>>>>> 4.5) btrfs scrub aborts before completing.
>>>>
>>>>
>>>> I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.
>>>>
>>>> I think the bigger issue is the lack of information why a scrub is
>>>> aborted.
>>>
>>>
>>> Thanks for checking into this Chris.
>>>
>>> Any advice on how to get some more information out of the scrub process?
>>
>>
>> Next you need to find out why this one device has so many errors. Put the
>> output from smartctl -x /dev/sdb somewhere, maybe even attach it to the bug
>> report since it's somewhat related. Either that device is simply unreliable,
>> or you've got a bad cable connection.
>
>
> The device is okay.  The errors were caused by me running it for a period
> with only one of the devices present.
>
> In more detail.  My desktop has a BTRFS RAID 1 setup.  I needed to access to
> it on the road, so I just shut the desktop down, grabbed one of the drives,
> and used it in my laptop in degraded mode.
>
> When I got back I recombined them (the one I left in my office was never
> booted by itself) and started a scrub.  That is when I discovered scrub
> aborted on newer kernels but completed okay on older ones.

That's troubling indeed.

OK so what about a balance with one of the kernels that aborts scrub?
Does the balance abort or does it work and then does a subsequent
scrub work?

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Scrub aborts on newer kernels
  2016-06-20  9:22       ` Tyson Whitehead
  2016-06-20 18:05         ` Chris Murphy
@ 2016-06-20 18:06         ` Chris Murphy
  1 sibling, 0 replies; 9+ messages in thread
From: Chris Murphy @ 2016-06-20 18:06 UTC (permalink / raw)
  To: Tyson Whitehead; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, Jun 20, 2016 at 3:22 AM, Tyson Whitehead <twhitehead@gmail.com> wrote:
> On 17/06/16 06:18 PM, Chris Murphy wrote:
>>
>> On Fri, Jun 17, 2016 at 8:45 AM, Tyson Whitehead <twhitehead@gmail.com>
>> wrote:
>>>
>>> On May 27, 2016 12:12:54 PM Chris Murphy wrote:
>>>>
>>>> On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead <twhitehead@gmail.com>
>>>> wrote:
>>>>>
>>>>> Under the last several kernels versions (4.6 and I believe 4.4 and,
>>>>> 4.5) btrfs scrub aborts before completing.
>>>>
>>>>
>>>> I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.
>>>>
>>>> I think the bigger issue is the lack of information why a scrub is
>>>> aborted.
>>>
>>>
>>> Thanks for checking into this Chris.
>>>
>>> Any advice on how to get some more information out of the scrub process?
>>
>>
>> Next you need to find out why this one device has so many errors. Put the
>> output from smartctl -x /dev/sdb somewhere, maybe even attach it to the bug
>> report since it's somewhat related. Either that device is simply unreliable,
>> or you've got a bad cable connection.
>
>
> The device is okay.  The errors were caused by me running it for a period
> with only one of the devices present.
>
> In more detail.  My desktop has a BTRFS RAID 1 setup.  I needed to access to
> it on the road, so I just shut the desktop down, grabbed one of the drives,
> and used it in my laptop in degraded mode.
>
> When I got back I recombined them (the one I left in my office was never
> booted by itself) and started a scrub.  That is when I discovered scrub
> aborted on newer kernels but completed okay on older ones.
>
> Cheers!  -Tyson

Are you absolutely positively certain that only one of the two devices
was ever written to while mounted degraded? As in you're 100% certain,
not 99% or less certain?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-06-20 18:36 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 17:55 Scrub aborts on newer kernels Tyson Whitehead
2016-05-27 18:12 ` Chris Murphy
2016-06-17 14:45   ` Tyson Whitehead
2016-06-17 22:18     ` Chris Murphy
2016-06-17 22:20       ` Chris Murphy
2016-06-20  9:22       ` Tyson Whitehead
2016-06-20 18:05         ` Chris Murphy
2016-06-20 18:06         ` Chris Murphy
2016-06-17 22:00 ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.