All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs check --check-data-csum malfunctioning?
@ 2017-04-18 12:41 Werner Braun
  2017-04-18 13:15 ` Hugo Mills
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Werner Braun @ 2017-04-18 12:41 UTC (permalink / raw)
  To: linux-btrfs

Hi,

i have a WD WD40EZRX with strange beaviour off btrfs check vs. btrfs scrub

running btrfs check --check-data-csum returns no errors on the disk

running btrfs scrub on the disk finds tons of errors

i could clear the disk and send it to anyone intrested in ;-)


-- 
Werner Braun
+49 178 145 8768

O³ Software GmbH & Co. KG * Franziusallee 73 * 24148 Kiel * Germany
Sitz der Gesellschaft Kiel * HR Amtsgericht Kiel HRA 6418 KI
Persönlich haftende Gesellschafterin: O³ Software GmbH
HR Amtsgericht Kiel HRB 10335 KI * Geschäftsführer Werner Braun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs check --check-data-csum malfunctioning?
  2017-04-18 12:41 btrfs check --check-data-csum malfunctioning? Werner Braun
@ 2017-04-18 13:15 ` Hugo Mills
  2017-04-18 13:35   ` Werner Braun
  2017-04-18 20:21 ` Duncan
  2017-04-19  0:40 ` Qu Wenruo
  2 siblings, 1 reply; 7+ messages in thread
From: Hugo Mills @ 2017-04-18 13:15 UTC (permalink / raw)
  To: Werner Braun; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 754 bytes --]

On Tue, Apr 18, 2017 at 02:41:49PM +0200, Werner Braun wrote:
> Hi,
> 
> i have a WD WD40EZRX with strange beaviour off btrfs check vs. btrfs scrub
> 
> running btrfs check --check-data-csum returns no errors on the disk
> 
> running btrfs scrub on the disk finds tons of errors
> 
> i could clear the disk and send it to anyone intrested in ;-)

   Do you have anything likely to be writing with O_DIRECT during the
scrub? Specifically, databases and VMs. Possibly some kinds of
torrent/distributed downloads.

   Hugo.

-- 
Hugo Mills             | In event of Last Trump, please form an orderly queue
hugo@... carfax.org.uk | and await judgement.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |     Unofficial notice in Cambridge University Library

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs check --check-data-csum malfunctioning?
  2017-04-18 13:15 ` Hugo Mills
@ 2017-04-18 13:35   ` Werner Braun
  0 siblings, 0 replies; 7+ messages in thread
From: Werner Braun @ 2017-04-18 13:35 UTC (permalink / raw)
  To: Hugo Mills, linux-btrfs



On 18.04.2017 15:15, Hugo Mills wrote:
> On Tue, Apr 18, 2017 at 02:41:49PM +0200, Werner Braun wrote:
>> Hi,
>>
>> i have a WD WD40EZRX with strange beaviour off btrfs check vs. btrfs scrub
>>
>> running btrfs check --check-data-csum returns no errors on the disk
>>
>> running btrfs scrub on the disk finds tons of errors
>>
>> i could clear the disk and send it to anyone intrested in ;-)
>
>    Do you have anything likely to be writing with O_DIRECT during the
> scrub? Specifically, databases and VMs. Possibly some kinds of
> torrent/distributed downloads.

No i used as a temporary disk for btrfs send receive,
btrfs send -v -f /mnt/tmp/out <subvol>
wrote 1.5 TB of Data.

while running
btrfs receive -v -f /mnt/tmp/out /path/to/dest

read errors occured

I called

btrfs check --check-data-csum

which returned no errors

then i mounted the Disk and called

btrfs scrub start /path/to/dest

No other read or write operations where running on the disk at that time.



-- 
Werner Braun
+49 178 145 8768

O³ Software GmbH & Co. KG * Franziusallee 73 * 24148 Kiel * Germany
Sitz der Gesellschaft Kiel * HR Amtsgericht Kiel HRA 6418 KI
Persönlich haftende Gesellschafterin: O³ Software GmbH
HR Amtsgericht Kiel HRB 10335 KI * Geschäftsführer Werner Braun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs check --check-data-csum malfunctioning?
  2017-04-18 12:41 btrfs check --check-data-csum malfunctioning? Werner Braun
  2017-04-18 13:15 ` Hugo Mills
@ 2017-04-18 20:21 ` Duncan
  2017-04-19  0:40 ` Qu Wenruo
  2 siblings, 0 replies; 7+ messages in thread
From: Duncan @ 2017-04-18 20:21 UTC (permalink / raw)
  To: linux-btrfs

Werner Braun posted on Tue, 18 Apr 2017 14:41:49 +0200 as excerpted:

> Hi,
> 
> i have a WD WD40EZRX with strange beaviour off btrfs check vs. btrfs
> scrub
> 
> running btrfs check --check-data-csum returns no errors on the disk
> 
> running btrfs scrub on the disk finds tons of errors

A dev could confirm this, but AFAIK from the comments on the patches I've 
seen going by, btrfs check --check-data-csum only checks that there's a 
valid copy available; that is, it stops as soon as it finds a valid copy, 
and if that's the first one, it won't check the second that's available 
in dup or raid1/10 mode.

A scrub of the full filesystem (as opposed to a single device, for those 
with more than one) will however check both copies and fix the second 
copy from the first, if necessary.

Of course this only applies to blocks that /have/ a second copy, that is, 
those in chunks that are raid1, raid10, or dup.   On a default single 
device btrfs, this will be metadata chunks in dup mode, not data chunks 
in single mode.

I think the low-memory-mode check may work differently than normal mode 
check in this regard, and as I said, this is based on the comments on 
patches going by, so it's possible the newest versions have changed this, 
but I'm not sure.

Meanwhile, it's worth noting that btrfs scrub calls in-kernel code to do 
the scrub, while check does everything in userspace, and the comments in 
the patches suggest the code has diverged, so it's not entirely 
surprising that the results differ.

Of course as Hugo mentions, scrub is done with the filesystem mounted, as 
well, and it's possible dio bypasses the normal buffered-write locking 
that prevents blocks from changing out from under scrub as it's doing its 
thing, allowing current direct-write access to screw things up in-
flight.  Check's userspace access is done with the filesystem unmounted, 
so that shouldn't be possible there, unless something's writing directly 
to the device itself, not thru the filesystem.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs check --check-data-csum malfunctioning?
  2017-04-18 12:41 btrfs check --check-data-csum malfunctioning? Werner Braun
  2017-04-18 13:15 ` Hugo Mills
  2017-04-18 20:21 ` Duncan
@ 2017-04-19  0:40 ` Qu Wenruo
  2017-04-19  9:44   ` Henk Slager
  2 siblings, 1 reply; 7+ messages in thread
From: Qu Wenruo @ 2017-04-19  0:40 UTC (permalink / raw)
  To: Werner Braun, linux-btrfs



At 04/18/2017 08:41 PM, Werner Braun wrote:
> Hi,
> 
> i have a WD WD40EZRX with strange beaviour off btrfs check vs. btrfs scrub
> 
> running btrfs check --check-data-csum returns no errors on the disk
> 
> running btrfs scrub on the disk finds tons of errors
> 
> i could clear the disk and send it to anyone intrested in ;-)
> 
> 
That's because --check-data-csum will only check the first copy of data 
if the first copy is good.

I've submitted offline scrub patchset to address this, which is a 
btrfs-progs equivalent of kernel scrub.

https://github.com/adam900710/btrfs-progs/tree/offline_scrub

Thanks,
Qu



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs check --check-data-csum malfunctioning?
  2017-04-19  0:40 ` Qu Wenruo
@ 2017-04-19  9:44   ` Henk Slager
  2017-05-24 18:56     ` Henk Slager
  0 siblings, 1 reply; 7+ messages in thread
From: Henk Slager @ 2017-04-19  9:44 UTC (permalink / raw)
  To: Werner Braun; +Cc: linux-btrfs

> At 04/18/2017 08:41 PM, Werner Braun wrote:
>>
>> Hi,
>>
>> i have a WD WD40EZRX with strange beaviour off btrfs check vs. btrfs scrub
>>
>> running btrfs check --check-data-csum returns no errors on the disk
>>
>> running btrfs scrub on the disk finds tons of errors
>>
>> i could clear the disk and send it to anyone intrested in ;-)
>>
>>
> That's because --check-data-csum will only check the first copy of data if
> the first copy is good.

So is the conclusion that all the csum errors are in the metadata?
What is the profile of the fs? ( not dup for metadata I assume?)

I also have a WD40EZRX and the fs on it is also almost exclusively a
btrfs receive target and it has now for the second time csum (just 5 )
errors. Extended selftest at 16K hours shows no problem and I am not
fully sure if this is a magnetic media error case or something else.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: btrfs check --check-data-csum malfunctioning?
  2017-04-19  9:44   ` Henk Slager
@ 2017-05-24 18:56     ` Henk Slager
  0 siblings, 0 replies; 7+ messages in thread
From: Henk Slager @ 2017-05-24 18:56 UTC (permalink / raw)
  To: linux-btrfs; +Cc: Werner Braun

On Wed, Apr 19, 2017 at 11:44 AM, Henk Slager <eye1tm@gmail.com> wrote:

> I also have a WD40EZRX and the fs on it is also almost exclusively a
> btrfs receive target and it has now for the second time csum (just 5 )
> errors. Extended selftest at 16K hours shows no problem and I am not
> fully sure if this is a magnetic media error case or something else.

I have now located the 20K (sequential) of bad csums in a 4G file and
physical chunk address. Then read that 1G chunk to a file and wrote it
back to the same disk location. No I/O errors in dmesg, so my
assumption is that the 20K bad spot is replaced by good spares. Or it
was a btrfs or luks fault or just a spurious random write somehow due
to SW/HW glitch.

As a sort of locking the bad area, I did cp --reflink the 4G file to
the root of the fs and read-writeback the 20K spot in the 4G file in
the send-source fs. So now after another differential receive, I
remove all but the latest snapshot. The 5 csum errors will then sit
there fixed if I don't balance. Then just before I do a btrfs-repflace
(if I decide to ), I delete the 4G file en make sure the cleaner has
finished so that replace will not fail on bad the 5 bad csums.

The fs on the WD40EZRX is just another clone/backup but with quite
some complex subvolume tree. The above actions + replace are more fun
and faster cloning again than recreating the tree with rsync etc. I
have done similar things in the past, when csum errors were clearly
due to btrfs bugs but with good HDDs.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-05-24 18:56 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-18 12:41 btrfs check --check-data-csum malfunctioning? Werner Braun
2017-04-18 13:15 ` Hugo Mills
2017-04-18 13:35   ` Werner Braun
2017-04-18 20:21 ` Duncan
2017-04-19  0:40 ` Qu Wenruo
2017-04-19  9:44   ` Henk Slager
2017-05-24 18:56     ` Henk Slager

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.