All of lore.kernel.org
 help / color / mirror / Atom feed
* file corruptions, 2nd half of 512b block
@ 2018-03-22 15:02 Chris Dunlop
  2018-03-22 18:03 ` Brian Foster
  0 siblings, 1 reply; 11+ messages in thread
From: Chris Dunlop @ 2018-03-22 15:02 UTC (permalink / raw)
  To: linux-xfs

Hi,

I'm experiencing 256-byte corruptions in files on XFS on 4.9.76.

System configuration details below.

For those cases where the corrupt file can be regenerated from other
data and the new file compared to the corrupt file (15 files in all),
the corruptions are invariably in the 2nd 256b half of a 512b sector,
part way through the file. That's pretty odd! Perhaps some kind of
buffer tail problem?

Are there any known issues that might cause this?

------
Late addendum... this is the same system, but different FS, where I
experienced this:

https://www.spinics.net/lists/linux-xfs/msg14876.html

To my vast surprise I see the box is still on the same kernel without
the patch per that message. (I must have been sleep deprived, I would
have sworn that it was upgraded.) Is this possibly the same underlying 
problem?
------

Further details...

The corruptions are being flagged by a mismatched md5. The file
generator calculates the md5 of the data as it's being generated (i.e.
before it hits storage), and saves the md5 in a separate file alongside
the data file. The corruptions are being found by comparing the
previously calculated md5 with the current file contents. The xfs sits
on raid6 which checks clean, so it seems like it's not a hdd problem.

The box was upgraded from 3.18.25 to 4.9.76 on 2018-01-15. There's a
good chance this was when the corruptions started as the earliest
confirmed corruption is in a file generated 2018-02-04, and there may
have been another on 2018-01-23. However it's also possible there were
earlier (and maybe even much earlier) corruptions which weren't being
picked up.

A scan through the commit log between 4.9.76 and current stable (.88)
for xfs bits doesn't show anything that stands out as relevant, at least
to my eyes. I've also looked between 4.9 and current HEAD, but there
are of course a /lot/ of xfs updates there and I'm afraid it's way too
easy for me to miss any relevant changes.

The file generator either runs remotely, and the data (and md5) arrives
via FTP, or runs locally, and the data (and md5) is written via NFS. The
corruptions have occurred in both cases.

These files are generally in the 10s of GB range, with a few at 1-3GB,
and a few in the low 100s of GB. All but one of the corrupt files have a
single 256b corruption, with the other having two separate corruptions
(each in the 2nd half of a 512b sector).

Overall we've received ~33k files since the o/s change, and have
identified about 34 corrupt files amongst those. Unfortunately some
parts of the generator aren't deterministic so we can't compare corrupt
files with regenerated files in all cases - per above, we've been able
to compare 15 of these files, with no discernible pattern other than the
corruption always occurring in the 2nd 256b of a 512b block.

Using xfs_bmap to investigate the corrupt files and where the corrupt
data sits, and digging down further into the LV, PV, md and hdd levels,
there's no consistency or discernible pattern of placement of the
corruptions at any level: ag, md, hdd.

Eyeballing the corrupted blocks and matching good blocks doesn't show
any obvious pattern. The files themselves contain compressed data so
it's all highly random at the block level, and the corruptions
themselves similarly look like random bytes.

The corrupt blocks are not a copy of other data in the file within the
surrounding 256k of the corrupt block.

----------------------------------------------------------------------
System configuration
----------------------------------------------------------------------

linux-4.9.76
xfsprogs 4.10
CPU: 2 x E5620 (16 cores total)
192G RAM

# grep bigfs /etc/mtab
/dev/mapper/vg00-bigfs /bigfs xfs rw,noatime,attr2,inode64,logbsize=256k,sunit=1024,swidth=9216,noquota 0 0
# xfs_info /bigfs
meta-data=/dev/mapper/vg00-bigfs isize=512    agcount=246, agsize=268435328 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=65929101312, imaxpct=5
         =                       sunit=128    swidth=1152 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.

The raids all check clean.

The XFS has been expanded a number of times.

----------------------------------------------------------------------
Explicit example...
----------------------------------------------------------------------

2018-03-04 21:40:44 data + md5 files written
2018-03-04 22:43:33 checksum mismatch detected

file size: 31232491008 bytes

The file is moved to "badfile", and the file regenerated from source
data as "goodfile".

"cmp -l badfile goodfile" shows there are 256 bytes differing, in the
2nd half of (512b) block 53906431.

$ dd if=badfile bs=512 skip=53906431 count=1 | od -t x2
0000000 4579 df86 376e a4dd 22d6 0a6a 845d c6c3
0000020 78c2 56b1 6344 e371 8ed3 f16e 691b b329
0000040 cee2 ab84 bfb5 f9f3 1a3c 23b8 33d1 e70c
0000060 8135 9dbb aaf8 be26 fea7 8446 bd39 6b28
0000100 7895 3f84 c07d 95a3 c79b 11e3 28cb dcdd
0000120 5e75 b945 cd8e 46c6 53b8 a0f2 dad3 a68b
0000140 5361 b5b4 09c9 8264 bf18 ede5 4177 0a5c
0000160 ddc7 4927 6b24 80c9 8f4c 76ac 1ae3 1df9
0000200 b477 3be0 c60a 9355 53e0 925f 4b8d 162c
0000220 2431 788f 4024 16ae 226e 51c4 6b85 392d
0000240 5283 a918 b97a c85c 7b34 e341 7689 0468
0000260 a4f1 a94a 0798 e5e3 435a 5ee4 3ab4 af1c
0000300 426a e484 7d2e 4e37 f2ef 95b3 fcf5 8fc8
0000320 a9d2 e50d 61ae 76bd 5ad9 6d00 67c3 3fcc
0000340 a610 7edd fe05 46bf 78c1 c70b 1829 11b7
0000360 9a34 c496 5161 c546 43cd 7eb8 ff70 473a
x
0000400 11d2 8c94 00ed c9cc d299 5fcf 38ee 5358
0000420 6da3 f8fd 8495 906e cf6f 3c12 94d7 a236
0000440 4150 98ce 22a6 68b0 f6f3 b2e0 f857 0719
0000460 58c9 abbf 059f 1092 c122 7592 a95e c736
0000500 aca4 4bd6 2ce0 1d4e 6097 9054 6f25 519c
0000520 187b 2598 8c1d 33ba 49fa 9cb6 e55c 779d
0000540 347f e1f2 8c6d fc06 5398 d675 ae49 4206
0000560 e343 7e08 b24a ed18 504b 4f28 5479 d492
0000600 1a88 fe80 6d19 0982 629a e06b e24a c78e
0000620 c2a9 370d f249 41ab 103b 0256 d0b2 b545
0000640 736d 430f c8a4 cf19 e5fb 5378 5889 7f3a
0000660 0dee e401 abcf 1d0d 5af2 5abe 0cbb 07a5
0000700 79ee 75d0 1bb7 68ee 5566 c057 45f9 a8ca
0000720 ee5d 3d86 b557 8d11 92cc 9b21 d421 fe81
0000740 8657 ffd6 e20d 01be 4e02 6049 540e b7f7
0000760 dfd4 4a0b 2a60 978c a6b1 2a8a 3e98 bcc5
0001000

$ dd if=goodfile bs=512 skip=53906431 count=1 | od -t x2
0000000 4579 df86 376e a4dd 22d6 0a6a 845d c6c3
0000020 78c2 56b1 6344 e371 8ed3 f16e 691b b329
0000040 cee2 ab84 bfb5 f9f3 1a3c 23b8 33d1 e70c
0000060 8135 9dbb aaf8 be26 fea7 8446 bd39 6b28
0000100 7895 3f84 c07d 95a3 c79b 11e3 28cb dcdd
0000120 5e75 b945 cd8e 46c6 53b8 a0f2 dad3 a68b
0000140 5361 b5b4 09c9 8264 bf18 ede5 4177 0a5c
0000160 ddc7 4927 6b24 80c9 8f4c 76ac 1ae3 1df9
0000200 b477 3be0 c60a 9355 53e0 925f 4b8d 162c
0000220 2431 788f 4024 16ae 226e 51c4 6b85 392d
0000240 5283 a918 b97a c85c 7b34 e341 7689 0468
0000260 a4f1 a94a 0798 e5e3 435a 5ee4 3ab4 af1c
0000300 426a e484 7d2e 4e37 f2ef 95b3 fcf5 8fc8
0000320 a9d2 e50d 61ae 76bd 5ad9 6d00 67c3 3fcc
0000340 a610 7edd fe05 46bf 78c1 c70b 1829 11b7
0000360 9a34 c496 5161 c546 43cd 7eb8 ff70 473a
x
0000400 3bf1 6176 7e4b f1ce 1e3c b747 4b16 8406
0000420 1e48 d38f ad9d edf0 11c6 fa63 6a7f b973
0000440 c90b 6745 be94 8090 d547 3c78 a8c9 ea94
0000460 498d 3115 cc88 8fb7 4f1d 8c1e f947 64d2
0000500 278f 2899 d2f1 d22f fcf0 7523 e3c7 a66e
0000520 a269 cac4 ae3d e551 1339 4d14 c0aa 52bc
0000540 b320 e0ed 46a7 bb93 1397 574c 1ed5 278f
0000560 8487 48d8 e24b 8882 9eef f64c 4c9a d916
0000600 d391 ddf8 4e13 4572 58e4 abcc 6f48 9c7e
0000620 4dda 2aa6 c8f2 4ac8 7002 a33b db8d fd00
0000640 3f4c 1cd1 89cf fa98 5692 b426 5b53 5e7e
0000660 7129 cf5f e3c8 fcf1 b378 1e31 de4f a0d7
0000700 9276 532d 3885 3bb1 93ca 87b8 2804 7d0b
0000720 68ec bc9b 624a 7249 3788 4d20 d5ac ecf6
0000740 2122 bbb8 dc49 2759 27b9 03a8 7ffa 5b6a
0000760 7ad1 a846 d795 6cfe bc1e c014 442a a93d
0001000

$ xfs_bmap -v badfile
badfile:
 EXT: FILE-OFFSET           BLOCK-RANGE                 AG AG-OFFSET                   TOTAL FLAGS
   0: [0..31743]:           281349379072..281349410815 131 (29155328..29187071)        31744 000011
   1: [31744..64511]:       281351100416..281351133183 131 (30876672..30909439)        32768 000011
   2: [64512..130047]:      281383613440..281383678975 131 (63389696..63455231)        65536 000011
   3: [130048..523263]:     281479251968..281479645183 131 (159028224..159421439)     393216 000011
   4: [523264..1047551]:    281513342976..281513867263 131 (193119232..193643519)     524288 000011
   5: [1047552..2096127]:   281627355136..281628403711 131 (307131392..308179967)    1048576 000011
   6: [2096128..5421055]:   281882829824..281886154751 131 (562606080..565931007)    3324928 000011
   7: [5421056..8386943]:   281904449536..281907415423 131 (584225792..587191679)    2965888 000111
   8: [8386944..8388543]:   281970693120..281970694719 131 (650469376..650470975)       1600 000111
   9: [8388544..8585215]:   281974888448..281975085119 131 (654664704..654861375)     196672 000111
  10: [8585216..9371647]:   281977619456..281978405887 131 (657395712..658182143)     786432 000011
  11: [9371648..12517375]:  281970695168..281973840895 131 (650471424..653617151)    3145728 000011
  12: [12517376..16465919]: 282179899392..282183847935 131 (859675648..863624191)    3948544 000011
  13: [16465920..20660223]: 282295112704..282299307007 131 (974888960..979083263)    4194304 000011
  14: [20660224..29048831]: 282533269504..282541658111 131 (1213045760..1221434367)  8388608 000010
  15: [29048832..45826039]: 286146131968..286162909175 133 (530942976..547720183)   16777208 000111
  16: [45826040..58243047]: 289315926016..289328343023 134 (1553254400..1565671407) 12417008 000111
  17: [58243048..61000959]: 294169719808..294172477719 136 (2112082944..2114840855)  2757912 000111

I.e. the corruption (in 512b sector 53906431) occurs part way through
extent 16, and not on an ag boundary.

Just to make sure we're not hitting some other boundary on the
underlying infrastructure, which might hint the problem could be there,
let's see where the file sector lies...

>From extent 16, the actual corrupt sector offset within the lv device
underneath xfs is:

289315926016 + (53906431 - 45826040) == 289324006407

Then we can look at the devices underneath the lv:

# lvs --units s -o lv_name,seg_start,seg_size,devices
  LV    Start         SSize         Devices
  bigfs            0S 105486999552S /dev/md0(0)
  bigfs 105486999552S 105487007744S /dev/md4(0)
  bigfs 210974007296S 105487007744S /dev/md9(0)
  bigfs 316461015040S  35160866816S /dev/md1(0)
  bigfs 351621881856S 105487007744S /dev/md5(0)
  bigfs 457108889600S  70323920896S /dev/md3(0)

Comparing our corrupt sector lv offset with the start sector of each md
device, we can see the corrupt sector is within /dev/md9 and not at a
boundary. The corrupt sector offset within the lv data on md9 is given
by:

289324006407 - 210974007296 == 78349999111

The lv data itself is offset within /dev/md9 and the offset can be seen
by:

# pvs --unit s -o pv_name,pe_start
  PV         1st PE
  /dev/md0     9216S
  /dev/md1     9216S
  /dev/md3     9216S
  /dev/md4     9216S
  /dev/md5     9216S
  /dev/md9     9216S

...so the lv data starts at sector 9216 of the md, which means the
corrupt sector is at this offset within /dev/md9:

9216 + 78349999111 == 78350008327

Confirm the calculations are correct by comparing the corrupt sector
from the file with the calculated sector on the md device:

# {
  dd if=badfile of=/tmp/foo.1 bs=512 skip=53906431 count=1
  dd if=/dev/md9 of=/tmp/foo.2 bs=512 skip=78350008327 count=1
  cmp /tmp/foo.{1,2} && echo "got it" || echo "try again"
}
got it

----------------------------------------------------------------------


I'd appreciate some pointers towards tracking down what's going on - or
even better, which version of linux I should upgrade to to make the
problem disappear!

Cheers,

Chris.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-22 15:02 file corruptions, 2nd half of 512b block Chris Dunlop
@ 2018-03-22 18:03 ` Brian Foster
  2018-03-22 23:04   ` Dave Chinner
  2018-03-27 22:33   ` Chris Dunlop
  0 siblings, 2 replies; 11+ messages in thread
From: Brian Foster @ 2018-03-22 18:03 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
> Hi,
> 
> I'm experiencing 256-byte corruptions in files on XFS on 4.9.76.
> 
> System configuration details below.
> 
> For those cases where the corrupt file can be regenerated from other
> data and the new file compared to the corrupt file (15 files in all),
> the corruptions are invariably in the 2nd 256b half of a 512b sector,
> part way through the file. That's pretty odd! Perhaps some kind of
> buffer tail problem?
> 
> Are there any known issues that might cause this?
> 

Nothing that I can think of. A quick look through the writeback changes
shows this[1] commit, but I'd expect any corruption in that case to
manifest as page size (4k) rather than at 256b granularity.

[1] 40214d128e ("xfs: trim writepage mapping to within eof")

> ------
> Late addendum... this is the same system, but different FS, where I
> experienced this:
> 
> https://www.spinics.net/lists/linux-xfs/msg14876.html
> 
> To my vast surprise I see the box is still on the same kernel without
> the patch per that message. (I must have been sleep deprived, I would
> have sworn that it was upgraded.) Is this possibly the same underlying
> problem?
> ------
> 
> Further details...
> 
> The corruptions are being flagged by a mismatched md5. The file
> generator calculates the md5 of the data as it's being generated (i.e.
> before it hits storage), and saves the md5 in a separate file alongside
> the data file. The corruptions are being found by comparing the
> previously calculated md5 with the current file contents. The xfs sits
> on raid6 which checks clean, so it seems like it's not a hdd problem.
> 
> The box was upgraded from 3.18.25 to 4.9.76 on 2018-01-15. There's a
> good chance this was when the corruptions started as the earliest
> confirmed corruption is in a file generated 2018-02-04, and there may
> have been another on 2018-01-23. However it's also possible there were
> earlier (and maybe even much earlier) corruptions which weren't being
> picked up.
> 
> A scan through the commit log between 4.9.76 and current stable (.88)
> for xfs bits doesn't show anything that stands out as relevant, at least
> to my eyes. I've also looked between 4.9 and current HEAD, but there
> are of course a /lot/ of xfs updates there and I'm afraid it's way too
> easy for me to miss any relevant changes.
> 
> The file generator either runs remotely, and the data (and md5) arrives
> via FTP, or runs locally, and the data (and md5) is written via NFS. The
> corruptions have occurred in both cases.
> 
> These files are generally in the 10s of GB range, with a few at 1-3GB,
> and a few in the low 100s of GB. All but one of the corrupt files have a
> single 256b corruption, with the other having two separate corruptions
> (each in the 2nd half of a 512b sector).
> 
> Overall we've received ~33k files since the o/s change, and have
> identified about 34 corrupt files amongst those. Unfortunately some
> parts of the generator aren't deterministic so we can't compare corrupt
> files with regenerated files in all cases - per above, we've been able
> to compare 15 of these files, with no discernible pattern other than the
> corruption always occurring in the 2nd 256b of a 512b block.
> 
> Using xfs_bmap to investigate the corrupt files and where the corrupt
> data sits, and digging down further into the LV, PV, md and hdd levels,
> there's no consistency or discernible pattern of placement of the
> corruptions at any level: ag, md, hdd.
> 
> Eyeballing the corrupted blocks and matching good blocks doesn't show
> any obvious pattern. The files themselves contain compressed data so
> it's all highly random at the block level, and the corruptions
> themselves similarly look like random bytes.
> 
> The corrupt blocks are not a copy of other data in the file within the
> surrounding 256k of the corrupt block.
> 

So you obviously have a fairly large/complex storage configuration. I
think you have to assume that this corruption could be introduced pretty
much anywhere in the stack (network, mm, fs, block layer, md) until it
can be narrowed down.

> ----------------------------------------------------------------------
> System configuration
> ----------------------------------------------------------------------
> 
> linux-4.9.76
> xfsprogs 4.10
> CPU: 2 x E5620 (16 cores total)
> 192G RAM
> 
> # grep bigfs /etc/mtab
> /dev/mapper/vg00-bigfs /bigfs xfs rw,noatime,attr2,inode64,logbsize=256k,sunit=1024,swidth=9216,noquota 0 0
> # xfs_info /bigfs
> meta-data=/dev/mapper/vg00-bigfs isize=512    agcount=246, agsize=268435328 blks
>         =                       sectsz=4096  attr=2, projid32bit=1
>         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
>         =                       reflink=0
> data     =                       bsize=4096   blocks=65929101312, imaxpct=5
>         =                       sunit=128    swidth=1152 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal               bsize=4096   blocks=521728, version=2
>         =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.
> 
> The raids all check clean.
> 
> The XFS has been expanded a number of times.
> 
> ----------------------------------------------------------------------
> Explicit example...
> ----------------------------------------------------------------------
> 
> 2018-03-04 21:40:44 data + md5 files written
> 2018-03-04 22:43:33 checksum mismatch detected
> 

Seems like the corruption is detected fairly soon after creation. How
often are these files explicitly checked/read? I also assume the files
aren't ever modified..?

FWIW, the patterns that you have shown so far do seem to suggest
something higher level than a physical storage problem. Otherwise, I'd
expect these instances wouldn't always necessarily land in file data.
Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
problems?

OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
blocks. I suppose that could suggest some kind of memory/cache
corruption as opposed to a bad page/extent state or something of that
nature.

Hmm, I guess the only productive thing I can think of right now is to
see if you can try and detect the problem as soon as possible. For e.g.,
it sounds like this is a closed system. If so, could you follow up every
file creation with an immediate md5 verification (perhaps followed by an
fadvise(DONTNEED) and another md5 check to try and catch an inconsistent
pagecache)? Perhaps others might have further ideas..

> file size: 31232491008 bytes
> 
> The file is moved to "badfile", and the file regenerated from source
> data as "goodfile".
> 
> "cmp -l badfile goodfile" shows there are 256 bytes differing, in the
> 2nd half of (512b) block 53906431.
> 

FWIW, that's the last (512b) sector of the associated (4k) page. Does
that happen to be consistent across whatever other instances you have a
record of?

Brian

> $ dd if=badfile bs=512 skip=53906431 count=1 | od -t x2
> 0000000 4579 df86 376e a4dd 22d6 0a6a 845d c6c3
> 0000020 78c2 56b1 6344 e371 8ed3 f16e 691b b329
> 0000040 cee2 ab84 bfb5 f9f3 1a3c 23b8 33d1 e70c
> 0000060 8135 9dbb aaf8 be26 fea7 8446 bd39 6b28
> 0000100 7895 3f84 c07d 95a3 c79b 11e3 28cb dcdd
> 0000120 5e75 b945 cd8e 46c6 53b8 a0f2 dad3 a68b
> 0000140 5361 b5b4 09c9 8264 bf18 ede5 4177 0a5c
> 0000160 ddc7 4927 6b24 80c9 8f4c 76ac 1ae3 1df9
> 0000200 b477 3be0 c60a 9355 53e0 925f 4b8d 162c
> 0000220 2431 788f 4024 16ae 226e 51c4 6b85 392d
> 0000240 5283 a918 b97a c85c 7b34 e341 7689 0468
> 0000260 a4f1 a94a 0798 e5e3 435a 5ee4 3ab4 af1c
> 0000300 426a e484 7d2e 4e37 f2ef 95b3 fcf5 8fc8
> 0000320 a9d2 e50d 61ae 76bd 5ad9 6d00 67c3 3fcc
> 0000340 a610 7edd fe05 46bf 78c1 c70b 1829 11b7
> 0000360 9a34 c496 5161 c546 43cd 7eb8 ff70 473a
> x
> 0000400 11d2 8c94 00ed c9cc d299 5fcf 38ee 5358
> 0000420 6da3 f8fd 8495 906e cf6f 3c12 94d7 a236
> 0000440 4150 98ce 22a6 68b0 f6f3 b2e0 f857 0719
> 0000460 58c9 abbf 059f 1092 c122 7592 a95e c736
> 0000500 aca4 4bd6 2ce0 1d4e 6097 9054 6f25 519c
> 0000520 187b 2598 8c1d 33ba 49fa 9cb6 e55c 779d
> 0000540 347f e1f2 8c6d fc06 5398 d675 ae49 4206
> 0000560 e343 7e08 b24a ed18 504b 4f28 5479 d492
> 0000600 1a88 fe80 6d19 0982 629a e06b e24a c78e
> 0000620 c2a9 370d f249 41ab 103b 0256 d0b2 b545
> 0000640 736d 430f c8a4 cf19 e5fb 5378 5889 7f3a
> 0000660 0dee e401 abcf 1d0d 5af2 5abe 0cbb 07a5
> 0000700 79ee 75d0 1bb7 68ee 5566 c057 45f9 a8ca
> 0000720 ee5d 3d86 b557 8d11 92cc 9b21 d421 fe81
> 0000740 8657 ffd6 e20d 01be 4e02 6049 540e b7f7
> 0000760 dfd4 4a0b 2a60 978c a6b1 2a8a 3e98 bcc5
> 0001000
> 
> $ dd if=goodfile bs=512 skip=53906431 count=1 | od -t x2
> 0000000 4579 df86 376e a4dd 22d6 0a6a 845d c6c3
> 0000020 78c2 56b1 6344 e371 8ed3 f16e 691b b329
> 0000040 cee2 ab84 bfb5 f9f3 1a3c 23b8 33d1 e70c
> 0000060 8135 9dbb aaf8 be26 fea7 8446 bd39 6b28
> 0000100 7895 3f84 c07d 95a3 c79b 11e3 28cb dcdd
> 0000120 5e75 b945 cd8e 46c6 53b8 a0f2 dad3 a68b
> 0000140 5361 b5b4 09c9 8264 bf18 ede5 4177 0a5c
> 0000160 ddc7 4927 6b24 80c9 8f4c 76ac 1ae3 1df9
> 0000200 b477 3be0 c60a 9355 53e0 925f 4b8d 162c
> 0000220 2431 788f 4024 16ae 226e 51c4 6b85 392d
> 0000240 5283 a918 b97a c85c 7b34 e341 7689 0468
> 0000260 a4f1 a94a 0798 e5e3 435a 5ee4 3ab4 af1c
> 0000300 426a e484 7d2e 4e37 f2ef 95b3 fcf5 8fc8
> 0000320 a9d2 e50d 61ae 76bd 5ad9 6d00 67c3 3fcc
> 0000340 a610 7edd fe05 46bf 78c1 c70b 1829 11b7
> 0000360 9a34 c496 5161 c546 43cd 7eb8 ff70 473a
> x
> 0000400 3bf1 6176 7e4b f1ce 1e3c b747 4b16 8406
> 0000420 1e48 d38f ad9d edf0 11c6 fa63 6a7f b973
> 0000440 c90b 6745 be94 8090 d547 3c78 a8c9 ea94
> 0000460 498d 3115 cc88 8fb7 4f1d 8c1e f947 64d2
> 0000500 278f 2899 d2f1 d22f fcf0 7523 e3c7 a66e
> 0000520 a269 cac4 ae3d e551 1339 4d14 c0aa 52bc
> 0000540 b320 e0ed 46a7 bb93 1397 574c 1ed5 278f
> 0000560 8487 48d8 e24b 8882 9eef f64c 4c9a d916
> 0000600 d391 ddf8 4e13 4572 58e4 abcc 6f48 9c7e
> 0000620 4dda 2aa6 c8f2 4ac8 7002 a33b db8d fd00
> 0000640 3f4c 1cd1 89cf fa98 5692 b426 5b53 5e7e
> 0000660 7129 cf5f e3c8 fcf1 b378 1e31 de4f a0d7
> 0000700 9276 532d 3885 3bb1 93ca 87b8 2804 7d0b
> 0000720 68ec bc9b 624a 7249 3788 4d20 d5ac ecf6
> 0000740 2122 bbb8 dc49 2759 27b9 03a8 7ffa 5b6a
> 0000760 7ad1 a846 d795 6cfe bc1e c014 442a a93d
> 0001000
> 
> $ xfs_bmap -v badfile
> badfile:
> EXT: FILE-OFFSET           BLOCK-RANGE                 AG AG-OFFSET                   TOTAL FLAGS
>   0: [0..31743]:           281349379072..281349410815 131 (29155328..29187071)        31744 000011
>   1: [31744..64511]:       281351100416..281351133183 131 (30876672..30909439)        32768 000011
>   2: [64512..130047]:      281383613440..281383678975 131 (63389696..63455231)        65536 000011
>   3: [130048..523263]:     281479251968..281479645183 131 (159028224..159421439)     393216 000011
>   4: [523264..1047551]:    281513342976..281513867263 131 (193119232..193643519)     524288 000011
>   5: [1047552..2096127]:   281627355136..281628403711 131 (307131392..308179967)    1048576 000011
>   6: [2096128..5421055]:   281882829824..281886154751 131 (562606080..565931007)    3324928 000011
>   7: [5421056..8386943]:   281904449536..281907415423 131 (584225792..587191679)    2965888 000111
>   8: [8386944..8388543]:   281970693120..281970694719 131 (650469376..650470975)       1600 000111
>   9: [8388544..8585215]:   281974888448..281975085119 131 (654664704..654861375)     196672 000111
>  10: [8585216..9371647]:   281977619456..281978405887 131 (657395712..658182143)     786432 000011
>  11: [9371648..12517375]:  281970695168..281973840895 131 (650471424..653617151)    3145728 000011
>  12: [12517376..16465919]: 282179899392..282183847935 131 (859675648..863624191)    3948544 000011
>  13: [16465920..20660223]: 282295112704..282299307007 131 (974888960..979083263)    4194304 000011
>  14: [20660224..29048831]: 282533269504..282541658111 131 (1213045760..1221434367)  8388608 000010
>  15: [29048832..45826039]: 286146131968..286162909175 133 (530942976..547720183)   16777208 000111
>  16: [45826040..58243047]: 289315926016..289328343023 134 (1553254400..1565671407) 12417008 000111
>  17: [58243048..61000959]: 294169719808..294172477719 136 (2112082944..2114840855)  2757912 000111
> 
> I.e. the corruption (in 512b sector 53906431) occurs part way through
> extent 16, and not on an ag boundary.
> 
> Just to make sure we're not hitting some other boundary on the
> underlying infrastructure, which might hint the problem could be there,
> let's see where the file sector lies...
> 
> From extent 16, the actual corrupt sector offset within the lv device
> underneath xfs is:
> 
> 289315926016 + (53906431 - 45826040) == 289324006407
> 
> Then we can look at the devices underneath the lv:
> 
> # lvs --units s -o lv_name,seg_start,seg_size,devices
>  LV    Start         SSize         Devices
>  bigfs            0S 105486999552S /dev/md0(0)
>  bigfs 105486999552S 105487007744S /dev/md4(0)
>  bigfs 210974007296S 105487007744S /dev/md9(0)
>  bigfs 316461015040S  35160866816S /dev/md1(0)
>  bigfs 351621881856S 105487007744S /dev/md5(0)
>  bigfs 457108889600S  70323920896S /dev/md3(0)
> 
> Comparing our corrupt sector lv offset with the start sector of each md
> device, we can see the corrupt sector is within /dev/md9 and not at a
> boundary. The corrupt sector offset within the lv data on md9 is given
> by:
> 
> 289324006407 - 210974007296 == 78349999111
> 
> The lv data itself is offset within /dev/md9 and the offset can be seen
> by:
> 
> # pvs --unit s -o pv_name,pe_start
>  PV         1st PE
>  /dev/md0     9216S
>  /dev/md1     9216S
>  /dev/md3     9216S
>  /dev/md4     9216S
>  /dev/md5     9216S
>  /dev/md9     9216S
> 
> ...so the lv data starts at sector 9216 of the md, which means the
> corrupt sector is at this offset within /dev/md9:
> 
> 9216 + 78349999111 == 78350008327
> 
> Confirm the calculations are correct by comparing the corrupt sector
> from the file with the calculated sector on the md device:
> 
> # {
>  dd if=badfile of=/tmp/foo.1 bs=512 skip=53906431 count=1
>  dd if=/dev/md9 of=/tmp/foo.2 bs=512 skip=78350008327 count=1
>  cmp /tmp/foo.{1,2} && echo "got it" || echo "try again"
> }
> got it
> 
> ----------------------------------------------------------------------
> 
> 
> I'd appreciate some pointers towards tracking down what's going on - or
> even better, which version of linux I should upgrade to to make the
> problem disappear!
> 
> Cheers,
> 
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-22 18:03 ` Brian Foster
@ 2018-03-22 23:04   ` Dave Chinner
  2018-03-22 23:26     ` Darrick J. Wong
  2018-03-28 15:20     ` Chris Dunlop
  2018-03-27 22:33   ` Chris Dunlop
  1 sibling, 2 replies; 11+ messages in thread
From: Dave Chinner @ 2018-03-22 23:04 UTC (permalink / raw)
  To: Brian Foster; +Cc: Chris Dunlop, linux-xfs

On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
> On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
> > Eyeballing the corrupted blocks and matching good blocks doesn't show
> > any obvious pattern. The files themselves contain compressed data so
> > it's all highly random at the block level, and the corruptions
> > themselves similarly look like random bytes.
> > 
> > The corrupt blocks are not a copy of other data in the file within the
> > surrounding 256k of the corrupt block.
> > 
> 
> So you obviously have a fairly large/complex storage configuration. I
> think you have to assume that this corruption could be introduced pretty
> much anywhere in the stack (network, mm, fs, block layer, md) until it
> can be narrowed down.
> 
> > ----------------------------------------------------------------------
> > System configuration
> > ----------------------------------------------------------------------
> > 
> > linux-4.9.76
> > xfsprogs 4.10
> > CPU: 2 x E5620 (16 cores total)
> > 192G RAM
> > 
> > # grep bigfs /etc/mtab
> > /dev/mapper/vg00-bigfs /bigfs xfs rw,noatime,attr2,inode64,logbsize=256k,sunit=1024,swidth=9216,noquota 0 0
> > # xfs_info /bigfs
> > meta-data=/dev/mapper/vg00-bigfs isize=512    agcount=246, agsize=268435328 blks
> >         =                       sectsz=4096  attr=2, projid32bit=1
> >         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
> >         =                       reflink=0
> > data     =                       bsize=4096   blocks=65929101312, imaxpct=5
> >         =                       sunit=128    swidth=1152 blks
> > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > log      =internal               bsize=4096   blocks=521728, version=2
> >         =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.

Are these all on the one raid controller? i.e. what's the physical
layout of all these disks?

> > The raids all check clean.
> > 
> > The XFS has been expanded a number of times.
> > 
> > ----------------------------------------------------------------------
> > Explicit example...
> > ----------------------------------------------------------------------
> > 
> > 2018-03-04 21:40:44 data + md5 files written
> > 2018-03-04 22:43:33 checksum mismatch detected
> 
> Seems like the corruption is detected fairly soon after creation. How
> often are these files explicitly checked/read? I also assume the files
> aren't ever modified..?
> 
> FWIW, the patterns that you have shown so far do seem to suggest
> something higher level than a physical storage problem. Otherwise, I'd
> expect these instances wouldn't always necessarily land in file data.
> Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
> problems?
> 
> OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
> blocks. I suppose that could suggest some kind of memory/cache
> corruption as opposed to a bad page/extent state or something of that
> nature.

Especially with the data write mechanisms being used - e.g. NFS
won't be doing partial sector reads and writes for data transfer -
it'll all be done in blocks much larger that the filesystem block
size (e.g. 1MB IOs).

> Hmm, I guess the only productive thing I can think of right now is to
> see if you can try and detect the problem as soon as possible. For e.g.,
> it sounds like this is a closed system. If so, could you follow up every
> file creation with an immediate md5 verification (perhaps followed by an
> fadvise(DONTNEED) and another md5 check to try and catch an inconsistent
> pagecache)? Perhaps others might have further ideas..

Basically, the only steps now are a methodical, layer by layer
checking of the IO path to isolate where the corruption is being
introduced. First you need a somewhat reliable reproducer that can
be used for debugging.

Write patterned files (e.g. encode a file id, file offset and 16 bit
cksum in every 8 byte chunk) and then verify them. When you get a
corruption, the corrupted data will tell you where the corruption
came from. It'll either be silent bit flips, some other files' data,
or it will be stale data.i See if the corruption pattern is
consistent. See if the locations correlate to a single disk, a
single raid controller, a single backplane, etc. i.e. try to find
some pattern to the corruption.

Unfortunately, I can't find the repository for the data checking
tools that were developed years ago for doing exactly this sort of
testing (genstream+checkstream) online anymore - they seem to
have disappeared from the internet. (*) Shouldn't be too hard to
write a quick tool to do this, though.

Also worth testing is whether the same corruption occurs when you
use direct IO to write and read the files. That would rule out a
large chunk of the filesystem and OS code as the cause of the
corruption.

(*) Google is completely useless for searching for historic things,
mailing lists and/or code these days. Searching google now reminds
of the bad old days of AltaVista - "never finds what I'm looking
for"....

> > file size: 31232491008 bytes
> > 
> > The file is moved to "badfile", and the file regenerated from source
> > data as "goodfile".

What does "regenerated from source" mean?

DOes that mean a new file is created, compressed and then copied
across? Or is it just the original file being copied again?

> > From extent 16, the actual corrupt sector offset within the lv device
> > underneath xfs is:
> > 
> > 289315926016 + (53906431 - 45826040) == 289324006407
> > 
> > Then we can look at the devices underneath the lv:
> > 
> > # lvs --units s -o lv_name,seg_start,seg_size,devices
> >  LV    Start         SSize         Devices
> >  bigfs            0S 105486999552S /dev/md0(0)
> >  bigfs 105486999552S 105487007744S /dev/md4(0)
> >  bigfs 210974007296S 105487007744S /dev/md9(0)
> >  bigfs 316461015040S  35160866816S /dev/md1(0)
> >  bigfs 351621881856S 105487007744S /dev/md5(0)
> >  bigfs 457108889600S  70323920896S /dev/md3(0)
> > 
> > Comparing our corrupt sector lv offset with the start sector of each md
> > device, we can see the corrupt sector is within /dev/md9 and not at a
> > boundary. The corrupt sector offset within the lv data on md9 is given
> > by:

Does, the problem always occur on /dev/md9?

If so, does the location correlate to a single disk in /dev/md9?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-22 23:04   ` Dave Chinner
@ 2018-03-22 23:26     ` Darrick J. Wong
  2018-03-22 23:49       ` Dave Chinner
  2018-03-28 15:20     ` Chris Dunlop
  1 sibling, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2018-03-22 23:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Chris Dunlop, linux-xfs

On Fri, Mar 23, 2018 at 10:04:50AM +1100, Dave Chinner wrote:
> On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
> > On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
> > > Eyeballing the corrupted blocks and matching good blocks doesn't show
> > > any obvious pattern. The files themselves contain compressed data so
> > > it's all highly random at the block level, and the corruptions
> > > themselves similarly look like random bytes.
> > > 
> > > The corrupt blocks are not a copy of other data in the file within the
> > > surrounding 256k of the corrupt block.
> > > 
> > 
> > So you obviously have a fairly large/complex storage configuration. I
> > think you have to assume that this corruption could be introduced pretty
> > much anywhere in the stack (network, mm, fs, block layer, md) until it
> > can be narrowed down.
> > 
> > > ----------------------------------------------------------------------
> > > System configuration
> > > ----------------------------------------------------------------------
> > > 
> > > linux-4.9.76
> > > xfsprogs 4.10
> > > CPU: 2 x E5620 (16 cores total)
> > > 192G RAM
> > > 
> > > # grep bigfs /etc/mtab
> > > /dev/mapper/vg00-bigfs /bigfs xfs rw,noatime,attr2,inode64,logbsize=256k,sunit=1024,swidth=9216,noquota 0 0
> > > # xfs_info /bigfs
> > > meta-data=/dev/mapper/vg00-bigfs isize=512    agcount=246, agsize=268435328 blks
> > >         =                       sectsz=4096  attr=2, projid32bit=1
> > >         =                       crc=1        finobt=1 spinodes=0 rmapbt=0
> > >         =                       reflink=0
> > > data     =                       bsize=4096   blocks=65929101312, imaxpct=5
> > >         =                       sunit=128    swidth=1152 blks
> > > naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> > > log      =internal               bsize=4096   blocks=521728, version=2
> > >         =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > > 
> > > XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.
> 
> Are these all on the one raid controller? i.e. what's the physical
> layout of all these disks?
> 
> > > The raids all check clean.
> > > 
> > > The XFS has been expanded a number of times.
> > > 
> > > ----------------------------------------------------------------------
> > > Explicit example...
> > > ----------------------------------------------------------------------
> > > 
> > > 2018-03-04 21:40:44 data + md5 files written
> > > 2018-03-04 22:43:33 checksum mismatch detected
> > 
> > Seems like the corruption is detected fairly soon after creation. How
> > often are these files explicitly checked/read? I also assume the files
> > aren't ever modified..?
> > 
> > FWIW, the patterns that you have shown so far do seem to suggest
> > something higher level than a physical storage problem. Otherwise, I'd
> > expect these instances wouldn't always necessarily land in file data.
> > Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
> > problems?
> > 
> > OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
> > blocks. I suppose that could suggest some kind of memory/cache
> > corruption as opposed to a bad page/extent state or something of that
> > nature.
> 
> Especially with the data write mechanisms being used - e.g. NFS
> won't be doing partial sector reads and writes for data transfer -
> it'll all be done in blocks much larger that the filesystem block
> size (e.g. 1MB IOs).
> 
> > Hmm, I guess the only productive thing I can think of right now is to
> > see if you can try and detect the problem as soon as possible. For e.g.,
> > it sounds like this is a closed system. If so, could you follow up every
> > file creation with an immediate md5 verification (perhaps followed by an
> > fadvise(DONTNEED) and another md5 check to try and catch an inconsistent
> > pagecache)? Perhaps others might have further ideas..
> 
> Basically, the only steps now are a methodical, layer by layer
> checking of the IO path to isolate where the corruption is being
> introduced. First you need a somewhat reliable reproducer that can
> be used for debugging.
> 
> Write patterned files (e.g. encode a file id, file offset and 16 bit
> cksum in every 8 byte chunk) and then verify them. When you get a
> corruption, the corrupted data will tell you where the corruption
> came from. It'll either be silent bit flips, some other files' data,
> or it will be stale data.i See if the corruption pattern is
> consistent. See if the locations correlate to a single disk, a
> single raid controller, a single backplane, etc. i.e. try to find
> some pattern to the corruption.
> 
> Unfortunately, I can't find the repository for the data checking
> tools that were developed years ago for doing exactly this sort of
> testing (genstream+checkstream) online anymore - they seem to
> have disappeared from the internet. (*) Shouldn't be too hard to
> write a quick tool to do this, though.

https://sourceforge.net/projects/checkstream/ ?

--D

> Also worth testing is whether the same corruption occurs when you
> use direct IO to write and read the files. That would rule out a
> large chunk of the filesystem and OS code as the cause of the
> corruption.
> 
> (*) Google is completely useless for searching for historic things,
> mailing lists and/or code these days. Searching google now reminds
> of the bad old days of AltaVista - "never finds what I'm looking
> for"....
> 
> > > file size: 31232491008 bytes
> > > 
> > > The file is moved to "badfile", and the file regenerated from source
> > > data as "goodfile".
> 
> What does "regenerated from source" mean?
> 
> DOes that mean a new file is created, compressed and then copied
> across? Or is it just the original file being copied again?
> 
> > > From extent 16, the actual corrupt sector offset within the lv device
> > > underneath xfs is:
> > > 
> > > 289315926016 + (53906431 - 45826040) == 289324006407
> > > 
> > > Then we can look at the devices underneath the lv:
> > > 
> > > # lvs --units s -o lv_name,seg_start,seg_size,devices
> > >  LV    Start         SSize         Devices
> > >  bigfs            0S 105486999552S /dev/md0(0)
> > >  bigfs 105486999552S 105487007744S /dev/md4(0)
> > >  bigfs 210974007296S 105487007744S /dev/md9(0)
> > >  bigfs 316461015040S  35160866816S /dev/md1(0)
> > >  bigfs 351621881856S 105487007744S /dev/md5(0)
> > >  bigfs 457108889600S  70323920896S /dev/md3(0)
> > > 
> > > Comparing our corrupt sector lv offset with the start sector of each md
> > > device, we can see the corrupt sector is within /dev/md9 and not at a
> > > boundary. The corrupt sector offset within the lv data on md9 is given
> > > by:
> 
> Does, the problem always occur on /dev/md9?
> 
> If so, does the location correlate to a single disk in /dev/md9?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-22 23:26     ` Darrick J. Wong
@ 2018-03-22 23:49       ` Dave Chinner
  0 siblings, 0 replies; 11+ messages in thread
From: Dave Chinner @ 2018-03-22 23:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, Chris Dunlop, linux-xfs

On Thu, Mar 22, 2018 at 04:26:29PM -0700, Darrick J. Wong wrote:
> On Fri, Mar 23, 2018 at 10:04:50AM +1100, Dave Chinner wrote:
> > Unfortunately, I can't find the repository for the data checking
> > tools that were developed years ago for doing exactly this sort of
> > testing (genstream+checkstream) online anymore - they seem to
> > have disappeared from the internet. (*) Shouldn't be too hard to
> > write a quick tool to do this, though.
> 
> https://sourceforge.net/projects/checkstream/ ?

Thanks, Darrick!

FWIW, did I mention that google is totally useless as a search
engine these days?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-22 18:03 ` Brian Foster
  2018-03-22 23:04   ` Dave Chinner
@ 2018-03-27 22:33   ` Chris Dunlop
  2018-03-28 18:09     ` Brian Foster
  1 sibling, 1 reply; 11+ messages in thread
From: Chris Dunlop @ 2018-03-27 22:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
> On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
>> Hi,
>>
>> I'm experiencing 256-byte corruptions in files on XFS on 4.9.76.
>>
>> System configuration details below.
>>
>> For those cases where the corrupt file can be regenerated from other
>> data and the new file compared to the corrupt file (15 files in all),
>> the corruptions are invariably in the 2nd 256b half of a 512b sector,
>> part way through the file. That's pretty odd! Perhaps some kind of
>> buffer tail problem?
>>
>> Are there any known issues that might cause this?
>
> Nothing that I can think of. A quick look through the writeback changes
> shows this[1] commit, but I'd expect any corruption in that case to
> manifest as page size (4k) rather than at 256b granularity.
>
> [1] 40214d128e ("xfs: trim writepage mapping to within eof")

Looks like that issue can occur if the file is closed, then reopened and 
appended to. That's possible with the files written via ftp (the ftp 
upload allows for continuation of partial files), but not the files 
written via NFS - if they're incomplete they're removed and started from 
scratch.

> So you obviously have a fairly large/complex storage configuration. I
> think you have to assume that this corruption could be introduced pretty
> much anywhere in the stack (network, mm, fs, block layer, md) until it
> can be narrowed down.

Yup.

Per below I'm seeing a good checksum a bit after arrival and bad 
checksum some time later, so looks like it's not network.

>> 2018-03-04 21:40:44 data + md5 files written
>> 2018-03-04 22:43:33 checksum mismatch detected
>
> Seems like the corruption is detected fairly soon after creation. How
> often are these files explicitly checked/read? I also assume the files
> aren't ever modified..?

Correct, the files aren't ever (deliberately) modified.

The files are generally checked once, some time (minutes to hours) after 
landing. After the first check I've been (perhaps foolishly) relying on 
raid6 scrubs to keep the data intact.

The files may be read a few times more over the course of a month, then 
they're either removed or just sit there quietly for months to years. 

> FWIW, the patterns that you have shown so far do seem to suggest
> something higher level than a physical storage problem. Otherwise, I'd
> expect these instances wouldn't always necessarily land in file data.
> Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
> problems?

I haven't tried xfs_repair yet. At 181T used and high but unknown at 
this point number of dirs and files, I imagine it will take quite a 
while and the filesystem shouldn't really be unavailable for more than 
low numbers of hours. I can use an LVM snapshot to do the 'xfs_repair 
-n', but I need to add enough spare capacity to hold the amount of data 
that arrives (at 0.5-1TB/day) during life of the check / snapshot. That 
might take a bit of fiddling because the system is getting short on 
drive bays.

Is it possible to work out approximately how long the check might take?

> OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
> blocks. I suppose that could suggest some kind of memory/cache
> corruption as opposed to a bad page/extent state or something of that
> nature.

I should have mentioned in the system summary: it's ECC RAM, with no 
EDAC errors coming up. So it shouldn't be memory corruption due to a bad 
stick or whatever. But, yes, there can be other causes.

> Hmm, I guess the only productive thing I can think of right now is to
> see if you can try and detect the problem as soon as possible. For e.g.,
> it sounds like this is a closed system. If so, could you follow up every
> file creation with an immediate md5 verification (perhaps followed by an
> fadvise(DONTNEED) and another md5 check to try and catch an inconsistent
> pagecache)? Perhaps others might have further ideas..

The check runs "soon" after file arrival (usually minutes), but not 
immediately. I could potentially alter the ftp receiver to calculate the 
md5 as the file data is received and cross check with the md5 file at 
the end, but doing same with the files that arrive via NFS would be 
difficult.

The great majority of the corruptions have been in the files arriving 
via NFS - possibly because those files tend to be much larger so random 
corruptions simply hit them more, but also I guess possibly because NFS 
is more susceptible to whatever is causing the problem.

I have a number of instances where it definitely looks like the file has 
made it to the filesystem (but not necessarily disk) and checked ok, 
only to later fail the md5 check, e.g.:

2018-03-12 07:36:56 created
2018-03-12 07:50:05 check ok
2018-03-26 19:02:14 check bad

2018-03-13 08:13:10 created
2018-03-13 08:36:56 check ok
2018-03-26 14:58:39 check bad

2018-03-13 21:06:34 created
2018-03-13 21:11:18 check ok
2018-03-26 19:24:24 check bad

I've now (subsequent to those instances above) updated to your 
suggestion: do the check first (without DONTNEED), then if the file had 
pages in the vm before the first check (seen using 'vmtouch' Resident 
Pages), use DONTNEED (via 'vmtouch -e') and do the check again.

I haven't yet seen any corrupt files with this new scheme (it's now been 
in place for only 24 hours).

I've not played with vmtouch before so I'm not sure what's normal, but 
there seems to be some odd behaviour. Most of the time, 'vmtouch -e' 
clears the file from buffers immediately, but sometimes it leaves a 
single page resident, even in the face of repeated calls. I understand 
that fadvise(DONTNEED) is advisory (and of course there's always a 
chance something else can bring file pages back into vm), so I had it in 
a loop:

check_pages_buffered
checksum
if pages_were_buffered
  fadvise(DONTNEED)
  whilst pages_buffered
    fadvise(DONTNEED)
    sleep 2
  done
  checksum
fi

I had a case where that loop was running for 2.5 hours before self 
terminating, in the absence of anything else touching the file (that I 
could find), and another case where it continued for 1.5 hours before I 
killed it. It seems a single page can persist in memory (I don't know if 
it's the same page) for *hours* even with many, many fadvise(DONTNEED) 
calls. In testing, I was finally able to clear that file from vm using:

  echo 3 > /proc/sys/vm/drop_caches

...but that's a wee bit heavy to use to clear single pages so I'm now 
breaking the loop if pages_buffered <= 1.

Any idea what that impressively persistent page is about?

>> "cmp -l badfile goodfile" shows there are 256 bytes differing, in the
>> 2nd half of (512b) block 53906431.
>
> FWIW, that's the last (512b) sector of the associated (4k) page. Does
> that happen to be consistent across whatever other instances you have a
> record of?

Huh, I should have noticed that! Yes, all corruptions are the last 256b 
of a 4k page. And in fact all are the last 256b in the first 4k page of 
an 8k block. That's odd as well!

FYI, these are the 256b offsets now I'm now working with (there have 
been a few more since I started):

310799
876559
1400335
1676815
3516271
4243471
4919311
6267919
10212879
11520527
11842175
16179215
18018367
22609935
45314111
51365903
60588047
69212175
82352143
107812863
165136351
227067839
527947775

Thanks for your time!

Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-22 23:04   ` Dave Chinner
  2018-03-22 23:26     ` Darrick J. Wong
@ 2018-03-28 15:20     ` Chris Dunlop
  2018-03-28 22:27       ` Dave Chinner
  1 sibling, 1 reply; 11+ messages in thread
From: Chris Dunlop @ 2018-03-28 15:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Fri, Mar 23, 2018 at 10:04:50AM +1100, Dave Chinner wrote:
> On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
>> On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
>>> Eyeballing the corrupted blocks and matching good blocks doesn't show
>>> any obvious pattern. The files themselves contain compressed data so
>>> it's all highly random at the block level, and the corruptions
>>> themselves similarly look like random bytes.
>>>
>>> The corrupt blocks are not a copy of other data in the file within the
>>> surrounding 256k of the corrupt block.

>>> XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.
>
> Are these all on the one raid controller? i.e. what's the physical
> layout of all these disks?

Yep, one controller. Physical layout:

c0 LSI 9211-8i (SAS2008)
 |
 + SAS expander w/ SATA HDD x 12
 |   + SAS expander w/ SATA HDD x 24
 |       + SAS expander w/ SATA HDD x 24
 |
 + SAS expander w/ SATA HDD x 24
     + SAS expander w/ SATA HDD x 24

>> OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
>> blocks. I suppose that could suggest some kind of memory/cache
>> corruption as opposed to a bad page/extent state or something of that
>> nature.
>
> Especially with the data write mechanisms being used - e.g. NFS
> won't be doing partial sector reads and writes for data transfer -
> it'll all be done in blocks much larger that the filesystem block
> size (e.g. 1MB IOs).

Yep, that's one of the reasons 256b corruptions is so odd.

> Basically, the only steps now are a methodical, layer by layer
> checking of the IO path to isolate where the corruption is being
> introduced. First you need a somewhat reliable reproducer that can
> be used for debugging.

The "reliable reproducer" at this point seems to be to simply let the 
box keep doing it's thing - at least being able to detect the problem 
and having the luxury of being able to repair the damage or re-get files 
from remote means we're not looking at irretrievable data loss.

But I'll take a look at that checkstream stuff you've mentioned to see 
if I can get a more methodical reproducer.

> Write patterned files (e.g. encode a file id, file offset and 16 bit
> cksum in every 8 byte chunk) and then verify them. When you get a
> corruption, the corrupted data will tell you where the corruption
> came from. It'll either be silent bit flips, some other files' data,
> or it will be stale data.i See if the corruption pattern is
> consistent. See if the locations correlate to a single disk, a
> single raid controller, a single backplane, etc. i.e. try to find
> some pattern to the corruption.

Generating an md5 checksum for every 256b block in each of the corrupted 
files reveals that, a significant proportion of the time (16 out of 23 
corruptions, in 20 files), the corrupt 256b block is apparently a copy 
of a block from a whole number of 4KB blocks prior in the file (the 
"source" block). In fact, 14 of those source blocks are a whole number 
of 128KB blocks prior to corrupt block, and 11 of the 16 source blocks 
are a whole number of 1MB blocks prior to the corrupt blocks.

As previously noted by Brian, all the corruptions are the last 256b in a 
4KB block (but not the last 256b in the first 4KB block of an 8KB block 
as I later erronously claimed). That also means that all the "source" 
blocks are also the last 256b in a 4KB block.

Those nice round numbers are seem highly suspicious, but I don't know 
what they might be telling me.

In table form, with 'source' being the 256b offset to the apparent 
source block, i.e. the block with the same contents in the same file as 
the corrupt block (or '-' where the source block wasn't found), 
'corrupt' being the 256b offset to the corrupt block, and the remaining 
columns showing the whole number of 4KB, 128KB or 1MB blocks between the 
'source' and 'corrupt' blocks (or n/w where it's not a whole number):

file       source    corrupt    4KB  128KB   1MB
------   --------   -------- ------  -----  ----
file01    4222991    4243471   1280     40     5
file01   57753615   57794575   2560     80    10
file02          -   18018367      -      -     -
file03     249359     310799   3840    120    15
file04    6208015    6267919   3744    117   n/w
file05  226989503  227067839   4896    153   n/w
file06          -   22609935      -      -     -
file07   10151439   10212879   3840    120    15
file08   16097295   16179215   5120    160    20
file08   20273167   20355087   5120    160    20
file09          -    1676815      -      -     -
file10          -   82352143      -      -     -
file11   69171215   69212175   2560     80    10
file12    4716671    4919311  12665    n/w   n/w
file13  165115871  165136351   1280     40     5
file14    1338895    1400335   3840    120    15
file15          -  107812863      -      -     -
file16          -    3516271      -      -     -
file17   11499535   11520527   1312     41   n/w
file17          -   11842175      -      -     -
file18     815119     876559   3840    120    15
file19   45234191   45314111   4995    n/w   n/w
file20   51324943   51365903   2560     80    10

Corruption counts per AG:

      1 10
      1 15
      1 31
      1 52
      1 54
      1 74
      1 82
      1 83
      1 115
      1 116
      1 134
      1 168
      1 174
      1 187
      1 188
      1 190
      2 37
      2 93
      3 80

Corruption counts per md:

      0 /dev/md1
      0 /dev/md3
      3 /dev/md9
      5 /dev/md0
      5 /dev/md5
     10 /dev/md4

I don't know what's going on with md4 - maybe it's just that it has more 
free space so that's where files tend to get written so that's where the 
corruptions tend to show up? And similarly md1 and md3 may have almost 
no free space so they're not receiving files so not showing corruptions.

But I'm guessing about that free space, I don't know how to work out how 
much free space there is on a particular md (as part of an LV). Any 
hints, e.g. look at something in the AGs then somehow work out which AGs 
are landing on which mds?

Corruption counts per disk:

      1 md5:sdav
      1 md4:sdbc
      1 md0:sdbg
      1 md4:sdbm
      1 md9:sdcp
      1 md0:sdd 
      1 md0:sds 
      2 md5:sdaw
      2 md4:sdbf
      2 md0:sdcc
      2 md5:sdce
      2 md9:sdcj
      2 md4:sde 
      4 md4:sdbd

At first glance that looks like a random distribution. Although, with 66 
disks in total under the fs, that sdbd is a *bit* suspicious.

md to disks, with corruption counts after the colons:

md0:5 : sdbg:1 sdbh   sdcd sdcc:2  sds:1  sdh   sdbj    sdy    sdt  sdr  sdd:1
md1   :  sdc    sdj    sdg  sdi    sdo    sdx   sdax   sdaz   sdba sdbb  sdn
md3   : sdby   sdbl   sdbo sdbz   sdbp   sdbq   sdbs   sdbt   sdbr sdbi sdbx
md4:10: sdbn   sdbm:1 sdbv sdbc:1 sdbu   sdbf:2 sdbd:4  sde:2  sdk  sdw  sdf
md5:5 : sdce:2 sdaq   sdar sdas   sdat   sdau   sdav:1 sdao   sdcx sdcn sdaw:2
md9:3 : sdcg   sdcj:2 sdck sdcl   sdcm   sdco   sdcp:1 sdcq   sdcr sdcs sdcv

Physical layout of disks with corruptions:

/sys/devices/pci0000:00/0000:00:05.0/0000:03:00.0/host0/...
  port-0:0/exp-0:0/port-0:0:0/exp-0:1/port-0:1:23/sdav
  port-0:0/exp-0:0/port-0:0:4/sde
  port-0:0/exp-0:0/port-0:0:5/sdd
  port-0:0/exp-0:0/port-0:0:19/sds
  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:0/exp-0:4/port-0:4:3/sdcj
  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:0/exp-0:4/port-0:4:9/sdcp
  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:5/sdbm
  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:21/sdcc
  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:23/sdce
  port-0:1/exp-0:2/port-0:2:1/sdaw
  port-0:1/exp-0:2/port-0:2:7/sdbc
  port-0:1/exp-0:2/port-0:2:8/sdbd
  port-0:1/exp-0:2/port-0:2:10/sdbf
  port-0:1/exp-0:2/port-0:2:11/sdbg

I.e. corrupt blocks appear on disks attached to every expander in the 
system.

Whilst that hardware side of things is interesting, and that md4 could 
bear some more investigation, as previously suggested, and now with more 
evidence (older files checked clean), it's looking like this issue 
really started with the upgrade from v3.18.25 to v4.9.76 on 2018-01-15. 
I.e. less likely to be hardware related - unless the new kernel is 
stressing the hardware in new exciting ways.

I'm also wondering whether I should just try v4.14.latest, and see if 
the problem goes away (there's always hope!). But that would leave a 
lingering bad taste that maybe there's something not quite right in 
v4.9.whatever land. Not everyone has checksums that can tell them their 
data is going just /slightly/ out of whack...

> Unfortunately, I can't find the repository for the data checking
> tools that were developed years ago for doing exactly this sort of
> testing (genstream+checkstream) online anymore - they seem to
> have disappeared from the internet. (*) Shouldn't be too hard to
> write a quick tool to do this, though.
>
> Also worth testing is whether the same corruption occurs when you
> use direct IO to write and read the files. That would rule out a
> large chunk of the filesystem and OS code as the cause of the
> corruption.

Looks like the checkstream stuff can do O_DIRECT.

>>> The file is moved to "badfile", and the file regenerated from source
>>> data as "goodfile".
>
> What does "regenerated from source" mean?
>
> DOes that mean a new file is created, compressed and then copied
> across? Or is it just the original file being copied again?

New file recreated from source data using the same method used to create 
the original (now corrupt) file.

>>> Comparing our corrupt sector lv offset with the start sector of each md
>>> device, we can see the corrupt sector is within /dev/md9 and not at a
>>> boundary. The corrupt sector offset within the lv data on md9 is given
>>> by:
>
> Does, the problem always occur on /dev/md9?
>
> If so, does the location correlate to a single disk in /dev/md9?

No, per above, corruptions occur in various mds (and various disks 
within mds), and the disks are attached to differing points in the 
physical hierarchy.

> Cheers,
> 
> Dave.


Amazing stuff on that COW work for XFS by the way - new tricks for old 
dogs indeed!


Cheers,

Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-27 22:33   ` Chris Dunlop
@ 2018-03-28 18:09     ` Brian Foster
  2018-03-29  0:15       ` Chris Dunlop
  0 siblings, 1 reply; 11+ messages in thread
From: Brian Foster @ 2018-03-28 18:09 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Wed, Mar 28, 2018 at 09:33:10AM +1100, Chris Dunlop wrote:
> On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
> > On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
> > > Hi,
> > > 
> > > I'm experiencing 256-byte corruptions in files on XFS on 4.9.76.
> > > 
> > > System configuration details below.
> > > 
> > > For those cases where the corrupt file can be regenerated from other
> > > data and the new file compared to the corrupt file (15 files in all),
> > > the corruptions are invariably in the 2nd 256b half of a 512b sector,
> > > part way through the file. That's pretty odd! Perhaps some kind of
> > > buffer tail problem?
> > > 
> > > Are there any known issues that might cause this?
> > 
> > Nothing that I can think of. A quick look through the writeback changes
> > shows this[1] commit, but I'd expect any corruption in that case to
> > manifest as page size (4k) rather than at 256b granularity.
> > 
> > [1] 40214d128e ("xfs: trim writepage mapping to within eof")
> 
> Looks like that issue can occur if the file is closed, then reopened and
> appended to. That's possible with the files written via ftp (the ftp upload
> allows for continuation of partial files), but not the files written via NFS
> - if they're incomplete they're removed and started from scratch.
> 
> > So you obviously have a fairly large/complex storage configuration. I
> > think you have to assume that this corruption could be introduced pretty
> > much anywhere in the stack (network, mm, fs, block layer, md) until it
> > can be narrowed down.
> 
> Yup.
> 
> Per below I'm seeing a good checksum a bit after arrival and bad checksum
> some time later, so looks like it's not network.
> 
> > > 2018-03-04 21:40:44 data + md5 files written
> > > 2018-03-04 22:43:33 checksum mismatch detected
> > 
> > Seems like the corruption is detected fairly soon after creation. How
> > often are these files explicitly checked/read? I also assume the files
> > aren't ever modified..?
> 
> Correct, the files aren't ever (deliberately) modified.
> 
> The files are generally checked once, some time (minutes to hours) after
> landing. After the first check I've been (perhaps foolishly) relying on
> raid6 scrubs to keep the data intact.
> 
> The files may be read a few times more over the course of a month, then
> they're either removed or just sit there quietly for months to years.
> 
> > FWIW, the patterns that you have shown so far do seem to suggest
> > something higher level than a physical storage problem. Otherwise, I'd
> > expect these instances wouldn't always necessarily land in file data.
> > Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
> > problems?
> 
> I haven't tried xfs_repair yet. At 181T used and high but unknown at this
> point number of dirs and files, I imagine it will take quite a while and the
> filesystem shouldn't really be unavailable for more than low numbers of
> hours. I can use an LVM snapshot to do the 'xfs_repair -n', but I need to
> add enough spare capacity to hold the amount of data that arrives (at
> 0.5-1TB/day) during life of the check / snapshot. That might take a bit of
> fiddling because the system is getting short on drive bays.
> 
> Is it possible to work out approximately how long the check might take?
> 

It will probably depend more on the amount of metadata than the size of
the fs. That said, it's not critical if downtime is an issue. It's more
something to check when convenient just to be sure there aren't other
issues in play.

> > OTOH, a 256b corruption seems quite unusual for a filesystem with 4k
> > blocks. I suppose that could suggest some kind of memory/cache
> > corruption as opposed to a bad page/extent state or something of that
> > nature.
> 
> I should have mentioned in the system summary: it's ECC RAM, with no EDAC
> errors coming up. So it shouldn't be memory corruption due to a bad stick or
> whatever. But, yes, there can be other causes.
> 
> > Hmm, I guess the only productive thing I can think of right now is to
> > see if you can try and detect the problem as soon as possible. For e.g.,
> > it sounds like this is a closed system. If so, could you follow up every
> > file creation with an immediate md5 verification (perhaps followed by an
> > fadvise(DONTNEED) and another md5 check to try and catch an inconsistent
> > pagecache)? Perhaps others might have further ideas..
> 
> The check runs "soon" after file arrival (usually minutes), but not
> immediately. I could potentially alter the ftp receiver to calculate the md5
> as the file data is received and cross check with the md5 file at the end,
> but doing same with the files that arrive via NFS would be difficult.
> 
> The great majority of the corruptions have been in the files arriving via
> NFS - possibly because those files tend to be much larger so random
> corruptions simply hit them more, but also I guess possibly because NFS is
> more susceptible to whatever is causing the problem.
> 
> I have a number of instances where it definitely looks like the file has
> made it to the filesystem (but not necessarily disk) and checked ok, only to
> later fail the md5 check, e.g.:
> 
> 2018-03-12 07:36:56 created
> 2018-03-12 07:50:05 check ok
> 2018-03-26 19:02:14 check bad
> 
> 2018-03-13 08:13:10 created
> 2018-03-13 08:36:56 check ok
> 2018-03-26 14:58:39 check bad
> 
> 2018-03-13 21:06:34 created
> 2018-03-13 21:11:18 check ok
> 2018-03-26 19:24:24 check bad
> 

How much is known about possible events related to the file between the
time the check passes and when the md5 goes bad? For example, do we know
for certain nothing read or otherwise acted on the file in that time?

If so, it certainly seems like the difference between check ok and check
bad could be due to cache effects.

> I've now (subsequent to those instances above) updated to your suggestion:
> do the check first (without DONTNEED), then if the file had pages in the vm
> before the first check (seen using 'vmtouch' Resident Pages), use DONTNEED
> (via 'vmtouch -e') and do the check again.
> 
> I haven't yet seen any corrupt files with this new scheme (it's now been in
> place for only 24 hours).
> 
> I've not played with vmtouch before so I'm not sure what's normal, but there
> seems to be some odd behaviour. Most of the time, 'vmtouch -e' clears the
> file from buffers immediately, but sometimes it leaves a single page
> resident, even in the face of repeated calls. I understand that
> fadvise(DONTNEED) is advisory (and of course there's always a chance
> something else can bring file pages back into vm), so I had it in a loop:
> 
> check_pages_buffered
> checksum
> if pages_were_buffered
>  fadvise(DONTNEED)
>  whilst pages_buffered
>    fadvise(DONTNEED)
>    sleep 2
>  done
>  checksum
> fi
> 
> I had a case where that loop was running for 2.5 hours before self
> terminating, in the absence of anything else touching the file (that I could
> find), and another case where it continued for 1.5 hours before I killed it.
> It seems a single page can persist in memory (I don't know if it's the same
> page) for *hours* even with many, many fadvise(DONTNEED) calls. In testing,
> I was finally able to clear that file from vm using:
> 
>  echo 3 > /proc/sys/vm/drop_caches
> 
> ...but that's a wee bit heavy to use to clear single pages so I'm now
> breaking the loop if pages_buffered <= 1.
> 
> Any idea what that impressively persistent page is about?
> 

Hm, not sure. I see that behavior on one file that was recently cached
in my dev tree. A local copy of the same file shows the same thing. If I
copy to a separate fs on another vm (with a newer kernel), I don't see
that behavior. I'm not sure off hand what the difference is, perhaps it
has something to do with the kernel. But this is all debug logic so I
wouldn't worry too much about doing excessive numbers of loops and
whatnot unless this behavior proves to be somehow relevant to the
problem.

FWIW, 'vmtouch -v' shows a little table of which pages are actually
present in the file. In my test, the tail page is the one that persists.
More importantly, it might be useful to use 'vmtouch -v' in your checks
above. That way we actually have a record of whether the particular
corrupted page was cached between a 'check ok' -> 'check bad'
transition.

> > > "cmp -l badfile goodfile" shows there are 256 bytes differing, in the
> > > 2nd half of (512b) block 53906431.
> > 
> > FWIW, that's the last (512b) sector of the associated (4k) page. Does
> > that happen to be consistent across whatever other instances you have a
> > record of?
> 
> Huh, I should have noticed that! Yes, all corruptions are the last 256b of a
> 4k page. And in fact all are the last 256b in the first 4k page of an 8k
> block. That's odd as well!
> 

Ok, that's potentially interesting. But what exactly do you mean by an
8k block? This is a 4k block filesystem, correct? Are you just saying
that the pages that contain the corruption all happen to be at 8k
aligned offsets?

Brian

> FYI, these are the 256b offsets now I'm now working with (there have been a
> few more since I started):
> 
> 310799
> 876559
> 1400335
> 1676815
> 3516271
> 4243471
> 4919311
> 6267919
> 10212879
> 11520527
> 11842175
> 16179215
> 18018367
> 22609935
> 45314111
> 51365903
> 60588047
> 69212175
> 82352143
> 107812863
> 165136351
> 227067839
> 527947775
> 
> Thanks for your time!
> 
> Chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-28 15:20     ` Chris Dunlop
@ 2018-03-28 22:27       ` Dave Chinner
  2018-03-29  1:09         ` Chris Dunlop
  0 siblings, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-03-28 22:27 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Brian Foster, linux-xfs

On Thu, Mar 29, 2018 at 02:20:00AM +1100, Chris Dunlop wrote:
> On Fri, Mar 23, 2018 at 10:04:50AM +1100, Dave Chinner wrote:
> >On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
> >>On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
> >>>Eyeballing the corrupted blocks and matching good blocks doesn't show
> >>>any obvious pattern. The files themselves contain compressed data so
> >>>it's all highly random at the block level, and the corruptions
> >>>themselves similarly look like random bytes.
> >>>
> >>>The corrupt blocks are not a copy of other data in the file within the
> >>>surrounding 256k of the corrupt block.
> 
> >>>XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.
> >
> >Are these all on the one raid controller? i.e. what's the physical
> >layout of all these disks?
> 
> Yep, one controller. Physical layout:
> 
> c0 LSI 9211-8i (SAS2008)
> |
> + SAS expander w/ SATA HDD x 12
> |   + SAS expander w/ SATA HDD x 24
> |       + SAS expander w/ SATA HDD x 24
> |
> + SAS expander w/ SATA HDD x 24
>     + SAS expander w/ SATA HDD x 24

Ok, that's good to know. I've seen misdirected writes in a past life
because a controller had a firmware bug when it hit it's maximum CTQ
depth of 2048 (controller max, not per-lun max) and the 2049th
queued write got written to a random lun on the controller. That
causes random, unpredicatble data corruptions in a similar manner to
what you are seeing.

So don't rule out a hardware problem yet.

> >Basically, the only steps now are a methodical, layer by layer
> >checking of the IO path to isolate where the corruption is being
> >introduced. First you need a somewhat reliable reproducer that can
> >be used for debugging.
> 
> The "reliable reproducer" at this point seems to be to simply let
> the box keep doing it's thing - at least being able to detect the
> problem and having the luxury of being able to repair the damage or
> re-get files from remote means we're not looking at irretrievable
> data loss.

Yup, same as what I saw in the past - check if it's controller load
related.

> But I'll take a look at that checkstream stuff you've mentioned to
> see if I can get a more methodical reproducer.

genstream/checkstream was what we used back then :P

> >Write patterned files (e.g. encode a file id, file offset and 16 bit
> >cksum in every 8 byte chunk) and then verify them. When you get a
> >corruption, the corrupted data will tell you where the corruption
> >came from. It'll either be silent bit flips, some other files' data,
> >or it will be stale data.i See if the corruption pattern is
> >consistent. See if the locations correlate to a single disk, a
> >single raid controller, a single backplane, etc. i.e. try to find
> >some pattern to the corruption.
> 
> Generating an md5 checksum for every 256b block in each of the
> corrupted files reveals that, a significant proportion of the time
> (16 out of 23 corruptions, in 20 files), the corrupt 256b block is
> apparently a copy of a block from a whole number of 4KB blocks prior
> in the file (the "source" block). In fact, 14 of those source blocks
> are a whole number of 128KB blocks prior to corrupt block, and 11 of
> the 16 source blocks are a whole number of 1MB blocks prior to the
> corrupt blocks.
> 
> As previously noted by Brian, all the corruptions are the last 256b
> in a 4KB block (but not the last 256b in the first 4KB block of an
> 8KB block as I later erronously claimed). That also means that all
> the "source" blocks are also the last 256b in a 4KB block.
> 
> Those nice round numbers are seem highly suspicious, but I don't
> know what they might be telling me.
> In table form, with 'source' being the 256b offset to the apparent
> source block, i.e. the block with the same contents in the same file
> as the corrupt block (or '-' where the source block wasn't found),
> 'corrupt' being the 256b offset to the corrupt block, and the
> remaining columns showing the whole number of 4KB, 128KB or 1MB
> blocks between the 'source' and 'corrupt' blocks (or n/w where it's
> not a whole number):
> 
> file       source    corrupt    4KB  128KB   1MB
> ------   --------   -------- ------  -----  ----
> file01    4222991    4243471   1280     40     5
> file01   57753615   57794575   2560     80    10
> file02          -   18018367      -      -     -
> file03     249359     310799   3840    120    15
> file04    6208015    6267919   3744    117   n/w
> file05  226989503  227067839   4896    153   n/w
> file06          -   22609935      -      -     -
> file07   10151439   10212879   3840    120    15
> file08   16097295   16179215   5120    160    20
> file08   20273167   20355087   5120    160    20
> file09          -    1676815      -      -     -
> file10          -   82352143      -      -     -
> file11   69171215   69212175   2560     80    10
> file12    4716671    4919311  12665    n/w   n/w
> file13  165115871  165136351   1280     40     5
> file14    1338895    1400335   3840    120    15
> file15          -  107812863      -      -     -
> file16          -    3516271      -      -     -
> file17   11499535   11520527   1312     41   n/w
> file17          -   11842175      -      -     -
> file18     815119     876559   3840    120    15
> file19   45234191   45314111   4995    n/w   n/w
> file20   51324943   51365903   2560     80    10

that recurrent 1280/40/5 factoring is highly suspicious. Also, the
distance between the source and the corruption location:

file01 20480		= 2^14 + 2^12
file01 40960		= 2^15 + 2^13
file03 61440		= 2^16 + 2^14 + 2^13 + 2^12
file04 59904		= 2^16 + 2^14 + 2^13 + 2^11 + 2^9
file05 78336
file07 61440
file08 81920
file08 81920
file11 40960
file12 202640
file13 20480
file14 61440
file17 20992
file18 61440
file19 79920
file20 40960


Such a round number for the offset to be wrong by makes me wonder.
These all have the smell of LBA bit errors (i.e. misdirected writes)
but the partial sector overwrite has me a bit baffled. The bit
pattern points at hardware, but the partial sector overwrite
shouldn't ever occur at the hardware level.

This bit offset pattern smells of a hardware problem at the
controller level, but I'm not yet convinced that it is. We still
need to rule out filesystem level issues.

[...]

> md to disks, with corruption counts after the colons:
> 
> md0:5 : sdbg:1 sdbh   sdcd sdcc:2  sds:1  sdh   sdbj    sdy    sdt  sdr  sdd:1
> md1   :  sdc    sdj    sdg  sdi    sdo    sdx   sdax   sdaz   sdba sdbb  sdn
> md3   : sdby   sdbl   sdbo sdbz   sdbp   sdbq   sdbs   sdbt   sdbr sdbi sdbx
> md4:10: sdbn   sdbm:1 sdbv sdbc:1 sdbu   sdbf:2 sdbd:4  sde:2  sdk  sdw  sdf
> md5:5 : sdce:2 sdaq   sdar sdas   sdat   sdau   sdav:1 sdao   sdcx sdcn sdaw:2
> md9:3 : sdcg   sdcj:2 sdck sdcl   sdcm   sdco   sdcp:1 sdcq   sdcr sdcs sdcv
> 
> Physical layout of disks with corruptions:
> 
> /sys/devices/pci0000:00/0000:00:05.0/0000:03:00.0/host0/...
>  port-0:0/exp-0:0/port-0:0:0/exp-0:1/port-0:1:23/sdav
>  port-0:0/exp-0:0/port-0:0:4/sde
>  port-0:0/exp-0:0/port-0:0:5/sdd
>  port-0:0/exp-0:0/port-0:0:19/sds
>  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:0/exp-0:4/port-0:4:3/sdcj
>  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:0/exp-0:4/port-0:4:9/sdcp
>  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:5/sdbm
>  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:21/sdcc
>  port-0:1/exp-0:2/port-0:2:0/exp-0:3/port-0:3:23/sdce
>  port-0:1/exp-0:2/port-0:2:1/sdaw
>  port-0:1/exp-0:2/port-0:2:7/sdbc
>  port-0:1/exp-0:2/port-0:2:8/sdbd
>  port-0:1/exp-0:2/port-0:2:10/sdbf
>  port-0:1/exp-0:2/port-0:2:11/sdbg
> 
> I.e. corrupt blocks appear on disks attached to every expander in
> the system.

Ok, thank you for digging deep enough to demonstrate the corruption
spread. It certainly helps on my end to know it's spread across
several disks and expanders and isn't obviously a single bad
disk, cable or expander. It doesn't rule out the controller as the
problem, though.

> Whilst that hardware side of things is interesting, and that md4
> could bear some more investigation, as previously suggested, and now
> with more evidence (older files checked clean), it's looking like
> this issue really started with the upgrade from v3.18.25 to v4.9.76
> on 2018-01-15. I.e. less likely to be hardware related - unless the
> new kernel is stressing the hardware in new exciting ways.

Right, that's entirely possible the new kernel is doing something
the old kernel didn't, like loading it up with more concurrent IO
across more disks. Do you have the latest firmware on the
controller?

The next steps are to validate the data is getting through each
layer of the OS intact. This really needs a more predictable test
case - can you reproduce and detect this corruption using
genstream/checkstream?

If so, the first step is to move to direct IO to rule out a page
cache related data corruption. If direct IO still shows the
corruption, we need to rule out things like file extension and
zeroing causing issues. e.g. preallocate the entire files, then
write via direct IO. If that still generates corruption then we need
to add code into the bottom of the filesystem IO path to validate
the data being sent by the filesystem is not corrupt.

If we get that far with correct write data, but still get
corruptions on read, it's not a filesystem created data corruption.
Let's see if we can get to that point first...

> I'm also wondering whether I should just try v4.14.latest, and see
> if the problem goes away (there's always hope!). But that would
> leave a lingering bad taste that maybe there's something not quite
> right in v4.9.whatever land. Not everyone has checksums that can
> tell them their data is going just /slightly/ out of whack...

Yeah, though if there was a general problem I'd have expected to
hear about it from several sources by now. What you are doing is not
a one-off sort of workload....

> Amazing stuff on that COW work for XFS by the way - new tricks for
> old dogs indeed!

Thank Darrick for all the COW work, not me :P

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-28 18:09     ` Brian Foster
@ 2018-03-29  0:15       ` Chris Dunlop
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Dunlop @ 2018-03-29  0:15 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Mar 28, 2018 at 02:09:16PM -0400, Brian Foster wrote:
> On Wed, Mar 28, 2018 at 09:33:10AM +1100, Chris Dunlop wrote:
>> On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
>>> On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
>>>> Hi,
>>>>
>>>> I'm experiencing 256-byte corruptions in files on XFS on 4.9.76.

>>> FWIW, the patterns that you have shown so far do seem to suggest
>>> something higher level than a physical storage problem. Otherwise, I'd
>>> expect these instances wouldn't always necessarily land in file data.
>>> Have you run 'xfs_repair -n' on the fs to confirm there aren't any other
>>> problems?
>>
>> I haven't tried xfs_repair yet. At 181T used and high but unknown at this
>> point number of dirs and files, I imagine it will take quite a while and the
>> filesystem shouldn't really be unavailable for more than low numbers of
>> hours. I can use an LVM snapshot to do the 'xfs_repair -n', but I need to
>> add enough spare capacity to hold the amount of data that arrives (at
>> 0.5-1TB/day) during life of the check / snapshot. That might take a bit of
>> fiddling because the system is getting short on drive bays.
>>
>> Is it possible to work out approximately how long the check might take?
>
> It will probably depend more on the amount of metadata than the size of
> the fs. That said, it's not critical if downtime is an issue. It's more
> something to check when convenient just to be sure there aren't other
> issues in play.

It's not looking too good in terms of how much metadata: I've had 
"dircnt" (https://github.com/ChristopherSchultz/fast-file-count) running 
for over 24 hours now and it's still going... (unfortunately it doesn't 
allow for SIGUSR1 to report current stats a la dd). I guess a simple 
directory scan like that is going to be significantly quicker than the 
'xfs_repair -n' - unless 'xfs_repair' uses optimisations not available 
to a simple directory scan?

>> I have a number of instances where it definitely looks like the file has
>> made it to the filesystem (but not necessarily disk) and checked ok, only to
>> later fail the md5 check, e.g.:
>>
>> 2018-03-12 07:36:56 created
>> 2018-03-12 07:50:05 check ok
>> 2018-03-26 19:02:14 check bad
>>
>> 2018-03-13 08:13:10 created
>> 2018-03-13 08:36:56 check ok
>> 2018-03-26 14:58:39 check bad
>>
>> 2018-03-13 21:06:34 created
>> 2018-03-13 21:11:18 check ok
>> 2018-03-26 19:24:24 check bad
>
> How much is known about possible events related to the file between the
> time the check passes and when the md5 goes bad? For example, do we know
> for certain nothing read or otherwise acted on the file in that time?
>
> If so, it certainly seems like the difference between check ok and check
> bad could be due to cache effects.

At least some of the files were read between the ok and bad checks. In 
at least one case the reader complained about a decompression error - in 
fact that that was what started me looking into this in detail.

>>                             ... Most of the time, 'vmtouch -e' clears the
>> file from buffers immediately, but sometimes it leaves a single page
>> resident, even in the face of repeated calls. ...
>>
>> Any idea what that impressively persistent page is about?
>
> Hm, not sure. I see that behavior on one file that was recently cached
> in my dev tree. A local copy of the same file shows the same thing. If I
> copy to a separate fs on another vm (with a newer kernel), I don't see
> that behavior. I'm not sure off hand what the difference is, perhaps it
> has something to do with the kernel. But this is all debug logic so I
> wouldn't worry too much about doing excessive numbers of loops and
> whatnot unless this behavior proves to be somehow relevant to the
> problem.
>
> FWIW, 'vmtouch -v' shows a little table of which pages are actually
> present in the file. In my test, the tail page is the one that persists.
> More importantly, it might be useful to use 'vmtouch -v' in your checks
> above. That way we actually have a record of whether the particular
> corrupted page was cached between a 'check ok' -> 'check bad'
> transition.

Tks, I'll add that to the check script.

>>>> "cmp -l badfile goodfile" shows there are 256 bytes differing, in the
>>>> 2nd half of (512b) block 53906431.
>>>
>>> FWIW, that's the last (512b) sector of the associated (4k) page. Does
>>> that happen to be consistent across whatever other instances you have a
>>> record of?
>>
>> Huh, I should have noticed that! Yes, all corruptions are the last 256b of a
>> 4k page. And in fact all are the last 256b in the first 4k page of an 8k
>> block. That's odd as well!
>
> Ok, that's potentially interesting. But what exactly do you mean by an
> 8k block? This is a 4k block filesystem, correct? Are you just saying
> that the pages that contain the corruption all happen to be at 8k
> aligned offsets?

Yes, I meant 8k aligned offsets. But it turns out I was wrong, they're 
not consistently placed within 8k aligned offsets - sorry for the false 
alarm. See also the file/source/corrupt table in email to Dave.

> Brian

Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: file corruptions, 2nd half of 512b block
  2018-03-28 22:27       ` Dave Chinner
@ 2018-03-29  1:09         ` Chris Dunlop
  0 siblings, 0 replies; 11+ messages in thread
From: Chris Dunlop @ 2018-03-29  1:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Thu, Mar 29, 2018 at 09:27:54AM +1100, Dave Chinner wrote:
> On Thu, Mar 29, 2018 at 02:20:00AM +1100, Chris Dunlop wrote:
>> On Fri, Mar 23, 2018 at 10:04:50AM +1100, Dave Chinner wrote:
>>> On Thu, Mar 22, 2018 at 02:03:28PM -0400, Brian Foster wrote:
>>>> On Fri, Mar 23, 2018 at 02:02:26AM +1100, Chris Dunlop wrote:
>>>>> XFS on LVM on 6 x PVs, each PV is md raid-6, each with 11 x hdd.
>>>
>>> Are these all on the one raid controller? i.e. what's the physical
>>> layout of all these disks?
>>
>> Yep, one controller. Physical layout:
>>
>> c0 LSI 9211-8i (SAS2008)
>> |
>> + SAS expander w/ SATA HDD x 12
>> |   + SAS expander w/ SATA HDD x 24
>> |       + SAS expander w/ SATA HDD x 24
>> |
>> + SAS expander w/ SATA HDD x 24
>>     + SAS expander w/ SATA HDD x 24
>
> Ok, that's good to know. I've seen misdirected writes in a past life
> because a controller had a firmware bug when it hit it's maximum CTQ
> depth of 2048 (controller max, not per-lun max) and the 2049th
> queued write got written to a random lun on the controller. That
> causes random, unpredicatble data corruptions in a similar manner to
> what you are seeing.

Ouch!

> So don't rule out a hardware problem yet.

OK. I'm not sure which of hardware or kernel I'd prefer it to be 
at this point!

>> Whilst that hardware side of things is interesting, and that md4
>> could bear some more investigation, as previously suggested, and now
>> with more evidence (older files checked clean), it's looking like
>> this issue really started with the upgrade from v3.18.25 to v4.9.76
>> on 2018-01-15. I.e. less likely to be hardware related - unless the
>> new kernel is stressing the hardware in new exciting ways.
>
> Right, that's entirely possible the new kernel is doing something
> the old kernel didn't, like loading it up with more concurrent IO
> across more disks. Do you have the latest firmware on the
> controller?

Not quite: it's on 19.00.00.00, looks like latest is 20.00.06.00 or 
20.00.07.00, depending on where you look.

I can't find a comprehensive set of release notes. Sigh.

We originally held off going to 20 because there were reports of 
problems, but it looks like they've since been resolved in the minor 
updates. Unfortunately we won't be able to update the BIOS in the next 
week or so.

> The next steps are to validate the data is getting through each
> layer of the OS intact. This really needs a more predictable test
> case - can you reproduce and detect this corruption using
> genstream/checkstream?
>
> If so, the first step is to move to direct IO to rule out a page
> cache related data corruption. If direct IO still shows the
> corruption, we need to rule out things like file extension and
> zeroing causing issues. e.g. preallocate the entire files, then
> write via direct IO. If that still generates corruption then we need
> to add code into the bottom of the filesystem IO path to validate
> the data being sent by the filesystem is not corrupt.
>
> If we get that far with correct write data, but still get
> corruptions on read, it's not a filesystem created data corruption.
> Let's see if we can get to that point first...

I'll see what I can do - and/or I'll try v4.14.latest: even if that
makes the problem goes away, that will tell us ...something, right?!

Cheers,

Chris

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-03-29  1:09 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-22 15:02 file corruptions, 2nd half of 512b block Chris Dunlop
2018-03-22 18:03 ` Brian Foster
2018-03-22 23:04   ` Dave Chinner
2018-03-22 23:26     ` Darrick J. Wong
2018-03-22 23:49       ` Dave Chinner
2018-03-28 15:20     ` Chris Dunlop
2018-03-28 22:27       ` Dave Chinner
2018-03-29  1:09         ` Chris Dunlop
2018-03-27 22:33   ` Chris Dunlop
2018-03-28 18:09     ` Brian Foster
2018-03-29  0:15       ` Chris Dunlop

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.