All of lore.kernel.org
 help / color / mirror / Atom feed
* Recently-formatted XFS filesystems reporting negative used space
@ 2018-07-10 13:43 Filippo Giunchedi
  2018-07-10 21:39 ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Filippo Giunchedi @ 2018-07-10 13:43 UTC (permalink / raw)
  To: linux-xfs

Hello,
a little background: at Wikimedia Foundation we are running a 30-hosts
Openstack Swift cluster to host user media uploads, each host has 12
spinning disks formatted individually with xfs.

Some of the recently-formatted filesystems have started reporting
negative usage upon hitting around 70% usage, though some filesystems
on the same host kept reporting as expected:

/dev/sdn1       3.7T  -14T   17T    - /srv/swift-storage/sdn1
/dev/sdh1       3.7T  -13T   17T    - /srv/swift-storage/sdh1
/dev/sdc1       3.7T  3.0T  670G  83% /srv/swift-storage/sdc1
/dev/sdk1       3.7T  3.1T  643G  83% /srv/swift-storage/sdk1

We have experienced this bug only on the last four machines to be put
in service and formatted with xfsprogs 4.9.0+nmu1 from Debian Stretch.
The remaining hosts were formatted in the past with xfsprogs 3.2.1 or
older.
We have also a standby cluster in another datacenter with similar
configuration and hosts that received write traffic only but not read
traffic; the standby cluster hasn't experienced the bug and all
filesystems report the correct usage.
As far as I can tell the difference in xfsprogs version used for
formatting means defaults have changed, (e.g. crc is enabled on the
affected filesystems). Have you seen this issue before and do you know
how to fix it?

I would love to help debugging this issue, we've been detailing the
work done so far at https://phabricator.wikimedia.org/T199198

thanks in advance!
Filippo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-10 13:43 Recently-formatted XFS filesystems reporting negative used space Filippo Giunchedi
@ 2018-07-10 21:39 ` Eric Sandeen
  2018-07-10 22:40   ` Dave Chinner
  2018-07-11  8:31   ` Filippo Giunchedi
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Sandeen @ 2018-07-10 21:39 UTC (permalink / raw)
  To: Filippo Giunchedi, linux-xfs

On 7/10/18 8:43 AM, Filippo Giunchedi wrote:
> Hello,
> a little background: at Wikimedia Foundation we are running a 30-hosts
> Openstack Swift cluster to host user media uploads, each host has 12
> spinning disks formatted individually with xfs.
> 
> Some of the recently-formatted filesystems have started reporting
> negative usage upon hitting around 70% usage, though some filesystems
> on the same host kept reporting as expected:
> 
> /dev/sdn1       3.7T  -14T   17T    - /srv/swift-storage/sdn1
> /dev/sdh1       3.7T  -13T   17T    - /srv/swift-storage/sdh1
> /dev/sdc1       3.7T  3.0T  670G  83% /srv/swift-storage/sdc1
> /dev/sdk1       3.7T  3.1T  643G  83% /srv/swift-storage/sdk1
> 
> We have experienced this bug only on the last four machines to be put
> in service and formatted with xfsprogs 4.9.0+nmu1 from Debian Stretch.
> The remaining hosts were formatted in the past with xfsprogs 3.2.1 or
> older.
> We have also a standby cluster in another datacenter with similar
> configuration and hosts that received write traffic only but not read
> traffic; the standby cluster hasn't experienced the bug and all
> filesystems report the correct usage.
> As far as I can tell the difference in xfsprogs version used for
> formatting means defaults have changed, (e.g. crc is enabled on the
> affected filesystems). Have you seen this issue before and do you know
> how to fix it?
> 
> I would love to help debugging this issue, we've been detailing the
> work done so far at https://phabricator.wikimedia.org/T199198

What kernel are the problematic nodes running?

>From your repair output:

root@ms-be1040:~# xfs_repair -n /dev/sde1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
sb_fdblocks 4461713825, counted 166746529
        - found root inode chunk

that sb_fdblocks really is ~17T which indicates the problem
really is on disk.

4461713825
100001001111100000101100110100001
166746529
     1001111100000101100110100001

you have a bit flipped in the problematic value... but you're running
with CRCs so it seems unlikely to have been some sort of bit-rot (that,
and the fact that you're hitting the same problem on multiple nodes).

Soooo not sure what to say right now other than "your bad value has an
extra bit set for some reason."

-Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-10 21:39 ` Eric Sandeen
@ 2018-07-10 22:40   ` Dave Chinner
  2018-07-13 17:44     ` Bill O'Donnell
  2018-07-11  8:31   ` Filippo Giunchedi
  1 sibling, 1 reply; 11+ messages in thread
From: Dave Chinner @ 2018-07-10 22:40 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Filippo Giunchedi, linux-xfs

On Tue, Jul 10, 2018 at 04:39:26PM -0500, Eric Sandeen wrote:
> On 7/10/18 8:43 AM, Filippo Giunchedi wrote:
> > Hello,
> > a little background: at Wikimedia Foundation we are running a 30-hosts
> > Openstack Swift cluster to host user media uploads, each host has 12
> > spinning disks formatted individually with xfs.
> > 
> > Some of the recently-formatted filesystems have started reporting
> > negative usage upon hitting around 70% usage, though some filesystems
> > on the same host kept reporting as expected:
> > 
> > /dev/sdn1       3.7T  -14T   17T    - /srv/swift-storage/sdn1
> > /dev/sdh1       3.7T  -13T   17T    - /srv/swift-storage/sdh1
> > /dev/sdc1       3.7T  3.0T  670G  83% /srv/swift-storage/sdc1
> > /dev/sdk1       3.7T  3.1T  643G  83% /srv/swift-storage/sdk1
> > 
> > We have experienced this bug only on the last four machines to be put
> > in service and formatted with xfsprogs 4.9.0+nmu1 from Debian Stretch.
> > The remaining hosts were formatted in the past with xfsprogs 3.2.1 or
> > older.
> > We have also a standby cluster in another datacenter with similar
> > configuration and hosts that received write traffic only but not read
> > traffic; the standby cluster hasn't experienced the bug and all
> > filesystems report the correct usage.
> > As far as I can tell the difference in xfsprogs version used for
> > formatting means defaults have changed, (e.g. crc is enabled on the
> > affected filesystems). Have you seen this issue before and do you know
> > how to fix it?
> > 
> > I would love to help debugging this issue, we've been detailing the
> > work done so far at https://phabricator.wikimedia.org/T199198
> 
> What kernel are the problematic nodes running?
> 
> From your repair output:
> 
> root@ms-be1040:~# xfs_repair -n /dev/sde1
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - zero log...
>         - scan filesystem freespace and inode maps...
> sb_fdblocks 4461713825, counted 166746529
>         - found root inode chunk
> 
> that sb_fdblocks really is ~17T which indicates the problem
> really is on disk.
> 
> 4461713825
> 100001001111100000101100110100001
> 166746529
>      1001111100000101100110100001
> 
> you have a bit flipped in the problematic value... but you're running
> with CRCs so it seems unlikely to have been some sort of bit-rot (that,
> and the fact that you're hitting the same problem on multiple nodes).
> 
> Soooo not sure what to say right now other than "your bad value has an
> extra bit set for some reason."

Looks like the superblock verifier doesn't bounds check free block
or free/used inode counts.  Perhaps we should be checking this in
the verifier so in-memory corruption like this never makes it to
disk?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-10 21:39 ` Eric Sandeen
  2018-07-10 22:40   ` Dave Chinner
@ 2018-07-11  8:31   ` Filippo Giunchedi
  2018-07-16  9:29     ` Filippo Giunchedi
  1 sibling, 1 reply; 11+ messages in thread
From: Filippo Giunchedi @ 2018-07-11  8:31 UTC (permalink / raw)
  To: sandeen; +Cc: linux-xfs

Thanks Eric and Dave!

On Tue, Jul 10, 2018 at 11:39 PM Eric Sandeen <sandeen@sandeen.net> wrote:
>
> On 7/10/18 8:43 AM, Filippo Giunchedi wrote:
> > Hello,
> > a little background: at Wikimedia Foundation we are running a 30-hosts
> > Openstack Swift cluster to host user media uploads, each host has 12
> > spinning disks formatted individually with xfs.
> >
> > Some of the recently-formatted filesystems have started reporting
> > negative usage upon hitting around 70% usage, though some filesystems
> > on the same host kept reporting as expected:
> >
> > /dev/sdn1       3.7T  -14T   17T    - /srv/swift-storage/sdn1
> > /dev/sdh1       3.7T  -13T   17T    - /srv/swift-storage/sdh1
> > /dev/sdc1       3.7T  3.0T  670G  83% /srv/swift-storage/sdc1
> > /dev/sdk1       3.7T  3.1T  643G  83% /srv/swift-storage/sdk1
> >
> > We have experienced this bug only on the last four machines to be put
> > in service and formatted with xfsprogs 4.9.0+nmu1 from Debian Stretch.
> > The remaining hosts were formatted in the past with xfsprogs 3.2.1 or
> > older.
> > We have also a standby cluster in another datacenter with similar
> > configuration and hosts that received write traffic only but not read
> > traffic; the standby cluster hasn't experienced the bug and all
> > filesystems report the correct usage.
> > As far as I can tell the difference in xfsprogs version used for
> > formatting means defaults have changed, (e.g. crc is enabled on the
> > affected filesystems). Have you seen this issue before and do you know
> > how to fix it?
> >
> > I would love to help debugging this issue, we've been detailing the
> > work done so far at https://phabricator.wikimedia.org/T199198
>
> What kernel are the problematic nodes running?

All nodes are running 4.9.82-1+deb9u3

> From your repair output:
>
> root@ms-be1040:~# xfs_repair -n /dev/sde1
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - zero log...
>         - scan filesystem freespace and inode maps...
> sb_fdblocks 4461713825, counted 166746529
>         - found root inode chunk
>
> that sb_fdblocks really is ~17T which indicates the problem
> really is on disk.
>
> 4461713825
> 100001001111100000101100110100001
> 166746529
>      1001111100000101100110100001
>
> you have a bit flipped in the problematic value... but you're running
> with CRCs so it seems unlikely to have been some sort of bit-rot (that,
> and the fact that you're hitting the same problem on multiple nodes).

Ouch, indeed we've seen this problem on multiple nodes, said hosts
belong to the same and latest shipment from the OEM. We'll run
hardware diagnostics on these hosts and others we've received at
another datacenter (which haven't shown issues so far but don't serve
reads either).

thanks for your help!

Filippo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-10 22:40   ` Dave Chinner
@ 2018-07-13 17:44     ` Bill O'Donnell
  0 siblings, 0 replies; 11+ messages in thread
From: Bill O'Donnell @ 2018-07-13 17:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eric Sandeen, Filippo Giunchedi, linux-xfs

On Wed, Jul 11, 2018 at 08:40:26AM +1000, Dave Chinner wrote:
> On Tue, Jul 10, 2018 at 04:39:26PM -0500, Eric Sandeen wrote:
> > On 7/10/18 8:43 AM, Filippo Giunchedi wrote:
> > > Hello,
> > > a little background: at Wikimedia Foundation we are running a 30-hosts
> > > Openstack Swift cluster to host user media uploads, each host has 12
> > > spinning disks formatted individually with xfs.
> > > 
> > > Some of the recently-formatted filesystems have started reporting
> > > negative usage upon hitting around 70% usage, though some filesystems
> > > on the same host kept reporting as expected:
> > > 
> > > /dev/sdn1       3.7T  -14T   17T    - /srv/swift-storage/sdn1
> > > /dev/sdh1       3.7T  -13T   17T    - /srv/swift-storage/sdh1
> > > /dev/sdc1       3.7T  3.0T  670G  83% /srv/swift-storage/sdc1
> > > /dev/sdk1       3.7T  3.1T  643G  83% /srv/swift-storage/sdk1
> > > 
> > > We have experienced this bug only on the last four machines to be put
> > > in service and formatted with xfsprogs 4.9.0+nmu1 from Debian Stretch.
> > > The remaining hosts were formatted in the past with xfsprogs 3.2.1 or
> > > older.
> > > We have also a standby cluster in another datacenter with similar
> > > configuration and hosts that received write traffic only but not read
> > > traffic; the standby cluster hasn't experienced the bug and all
> > > filesystems report the correct usage.
> > > As far as I can tell the difference in xfsprogs version used for
> > > formatting means defaults have changed, (e.g. crc is enabled on the
> > > affected filesystems). Have you seen this issue before and do you know
> > > how to fix it?
> > > 
> > > I would love to help debugging this issue, we've been detailing the
> > > work done so far at https://phabricator.wikimedia.org/T199198
> > 
> > What kernel are the problematic nodes running?
> > 
> > From your repair output:
> > 
> > root@ms-be1040:~# xfs_repair -n /dev/sde1
> > Phase 1 - find and verify superblock...
> > Phase 2 - using internal log
> >         - zero log...
> >         - scan filesystem freespace and inode maps...
> > sb_fdblocks 4461713825, counted 166746529
> >         - found root inode chunk
> > 
> > that sb_fdblocks really is ~17T which indicates the problem
> > really is on disk.
> > 
> > 4461713825
> > 100001001111100000101100110100001
> > 166746529
> >      1001111100000101100110100001
> > 
> > you have a bit flipped in the problematic value... but you're running
> > with CRCs so it seems unlikely to have been some sort of bit-rot (that,
> > and the fact that you're hitting the same problem on multiple nodes).
> > 
> > Soooo not sure what to say right now other than "your bad value has an
> > extra bit set for some reason."
> 
> Looks like the superblock verifier doesn't bounds check free block
> or free/used inode counts.  Perhaps we should be checking this in
> the verifier so in-memory corruption like this never makes it to
> disk?

A proposed patch and discussion thread is on the list:
https://www.spinics.net/lists/linux-xfs/msg20645.html

Thanks-
Bill


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-11  8:31   ` Filippo Giunchedi
@ 2018-07-16  9:29     ` Filippo Giunchedi
  2018-07-17  9:26       ` Carlos Maiolino
  0 siblings, 1 reply; 11+ messages in thread
From: Filippo Giunchedi @ 2018-07-16  9:29 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

On Wed, Jul 11, 2018 at 10:31 AM Filippo Giunchedi
<fgiunchedi@wikimedia.org> wrote:
> > that sb_fdblocks really is ~17T which indicates the problem
> > really is on disk.
> >
> > 4461713825
> > 100001001111100000101100110100001
> > 166746529
> >      1001111100000101100110100001
> >
> > you have a bit flipped in the problematic value... but you're running
> > with CRCs so it seems unlikely to have been some sort of bit-rot (that,
> > and the fact that you're hitting the same problem on multiple nodes).
>
> Ouch, indeed we've seen this problem on multiple nodes, said hosts
> belong to the same and latest shipment from the OEM. We'll run
> hardware diagnostics on these hosts and others we've received at
> another datacenter (which haven't shown issues so far but don't serve
> reads either).

Update on this: we've ran hw diagnostics and couldn't find anything
wrong, xfs_repair does fix the issue so we'll be going ahead with
that. Is there anything we can do to help debugging in case this
happens again?

thanks a lot!
Filippo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-16  9:29     ` Filippo Giunchedi
@ 2018-07-17  9:26       ` Carlos Maiolino
  2018-07-20 10:20         ` Filippo Giunchedi
  0 siblings, 1 reply; 11+ messages in thread
From: Carlos Maiolino @ 2018-07-17  9:26 UTC (permalink / raw)
  To: Filippo Giunchedi; +Cc: linux-xfs

On Mon, Jul 16, 2018 at 11:29:51AM +0200, Filippo Giunchedi wrote:
> On Wed, Jul 11, 2018 at 10:31 AM Filippo Giunchedi
> <fgiunchedi@wikimedia.org> wrote:
> > > that sb_fdblocks really is ~17T which indicates the problem
> > > really is on disk.
> > >
> > > 4461713825
> > > 100001001111100000101100110100001
> > > 166746529
> > >      1001111100000101100110100001
> > >
> > > you have a bit flipped in the problematic value... but you're running
> > > with CRCs so it seems unlikely to have been some sort of bit-rot (that,
> > > and the fact that you're hitting the same problem on multiple nodes).
> >
> > Ouch, indeed we've seen this problem on multiple nodes, said hosts
> > belong to the same and latest shipment from the OEM. We'll run
> > hardware diagnostics on these hosts and others we've received at
> > another datacenter (which haven't shown issues so far but don't serve
> > reads either).
> 
> Update on this: we've ran hw diagnostics and couldn't find anything
> wrong, xfs_repair does fix the issue so we'll be going ahead with
> that. Is there anything we can do to help debugging in case this
> happens again?
> 

There is a patch being discussed on list to help catch these bit corruptions
before they reach the disk, but, bear in mind we can only improve the validation
of our metadata. Nothing actually forbids these bit flips are occurring on your
data, and you are actually writing corrupted data into your files.

Cheers

> thanks a lot!
> Filippo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Carlos

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-17  9:26       ` Carlos Maiolino
@ 2018-07-20 10:20         ` Filippo Giunchedi
  2018-07-22  0:03           ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Filippo Giunchedi @ 2018-07-20 10:20 UTC (permalink / raw)
  To: linux-xfs

On Tue, Jul 17, 2018 at 11:26 AM Carlos Maiolino <cmaiolino@redhat.com> wrote:
> > > Ouch, indeed we've seen this problem on multiple nodes, said hosts
> > > belong to the same and latest shipment from the OEM. We'll run
> > > hardware diagnostics on these hosts and others we've received at
> > > another datacenter (which haven't shown issues so far but don't serve
> > > reads either).
> >
> > Update on this: we've ran hw diagnostics and couldn't find anything
> > wrong, xfs_repair does fix the issue so we'll be going ahead with
> > that. Is there anything we can do to help debugging in case this
> > happens again?
> >
>
> There is a patch being discussed on list to help catch these bit corruptions
> before they reach the disk, but, bear in mind we can only improve the validation
> of our metadata. Nothing actually forbids these bit flips are occurring on your
> data, and you are actually writing corrupted data into your files.

We've found no other cases of bit flips or corruption in metadata or
the data itself though.
To recap what we've seen, hardware bit flipping is extremely unlikely:
the same type of sb_fdblocks corruption has appeared on four different
hosts affecting at most one third of xfs filesystems per host. Also
the corruption looks always the same, namely the 33rd bit flipped
which also seems suspicious.

HTH,
Filippo

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-20 10:20         ` Filippo Giunchedi
@ 2018-07-22  0:03           ` Eric Sandeen
  2018-07-30 10:02             ` Filippo Giunchedi
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2018-07-22  0:03 UTC (permalink / raw)
  To: Filippo Giunchedi, linux-xfs

On 7/20/18 3:20 AM, Filippo Giunchedi wrote:
> On Tue, Jul 17, 2018 at 11:26 AM Carlos Maiolino <cmaiolino@redhat.com> wrote:
>>>> Ouch, indeed we've seen this problem on multiple nodes, said hosts
>>>> belong to the same and latest shipment from the OEM. We'll run
>>>> hardware diagnostics on these hosts and others we've received at
>>>> another datacenter (which haven't shown issues so far but don't serve
>>>> reads either).
>>>
>>> Update on this: we've ran hw diagnostics and couldn't find anything
>>> wrong, xfs_repair does fix the issue so we'll be going ahead with
>>> that. Is there anything we can do to help debugging in case this
>>> happens again?
>>>
>>
>> There is a patch being discussed on list to help catch these bit corruptions
>> before they reach the disk, but, bear in mind we can only improve the validation
>> of our metadata. Nothing actually forbids these bit flips are occurring on your
>> data, and you are actually writing corrupted data into your files.
> 
> We've found no other cases of bit flips or corruption in metadata or
> the data itself though.
> To recap what we've seen, hardware bit flipping is extremely unlikely:
> the same type of sb_fdblocks corruption has appeared on four different
> hosts affecting at most one third of xfs filesystems per host. Also
> the corruption looks always the same, namely the 33rd bit flipped
> which also seems suspicious.

Running a debug kernel with memory poisoning, KASAN, or something similar might
help catch it if it's a stray memory write of some sort...

-Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-22  0:03           ` Eric Sandeen
@ 2018-07-30 10:02             ` Filippo Giunchedi
  2018-07-30 23:42               ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Filippo Giunchedi @ 2018-07-30 10:02 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-xfs

On Sun, Jul 22, 2018 at 2:03 AM Eric Sandeen <sandeen@sandeen.net> wrote:
> On 7/20/18 3:20 AM, Filippo Giunchedi wrote:
> > To recap what we've seen, hardware bit flipping is extremely unlikely:
> > the same type of sb_fdblocks corruption has appeared on four different
> > hosts affecting at most one third of xfs filesystems per host. Also
> > the corruption looks always the same, namely the 33rd bit flipped
> > which also seems suspicious.
>
> Running a debug kernel with memory poisoning, KASAN, or something similar might
> help catch it if it's a stray memory write of some sort...

Thanks! BTW we've experienced this again on a FS at around 77% usage
and xfs_repair reports a flip in the 32nd bit (output below). We'll
enable memory poisoning on said host and see if other filesystems on
that host experience the same.
I see in the patch thread it has been mentioned this particular
condition will be checked and fixed/prevented in 4.19 though the root
cause isn't know (?)

Thanks again!
Filippo

# xfs_repair -n /dev/sdc1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
sb_fdblocks 4515987426, counted 221020130
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Recently-formatted XFS filesystems reporting negative used space
  2018-07-30 10:02             ` Filippo Giunchedi
@ 2018-07-30 23:42               ` Eric Sandeen
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2018-07-30 23:42 UTC (permalink / raw)
  To: Filippo Giunchedi; +Cc: linux-xfs

On 7/30/18 5:02 AM, Filippo Giunchedi wrote:
> On Sun, Jul 22, 2018 at 2:03 AM Eric Sandeen <sandeen@sandeen.net> wrote:
>> On 7/20/18 3:20 AM, Filippo Giunchedi wrote:
>>> To recap what we've seen, hardware bit flipping is extremely unlikely:
>>> the same type of sb_fdblocks corruption has appeared on four different
>>> hosts affecting at most one third of xfs filesystems per host. Also
>>> the corruption looks always the same, namely the 33rd bit flipped
>>> which also seems suspicious.
>>
>> Running a debug kernel with memory poisoning, KASAN, or something similar might
>> help catch it if it's a stray memory write of some sort...
> 
> Thanks! BTW we've experienced this again on a FS at around 77% usage
> and xfs_repair reports a flip in the 32nd bit (output below). We'll
> enable memory poisoning on said host and see if other filesystems on
> that host experience the same.
> I see in the patch thread it has been mentioned this particular
> condition will be checked and fixed/prevented in 4.19 though the root
> cause isn't know (?)

Yeah, the validation should happen in 4.19, but no idea what the root
cause is.  Catching it at write time may offer some clues, I hope,
if the debug kernel doesn't help.

-Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-07-31  1:19 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-10 13:43 Recently-formatted XFS filesystems reporting negative used space Filippo Giunchedi
2018-07-10 21:39 ` Eric Sandeen
2018-07-10 22:40   ` Dave Chinner
2018-07-13 17:44     ` Bill O'Donnell
2018-07-11  8:31   ` Filippo Giunchedi
2018-07-16  9:29     ` Filippo Giunchedi
2018-07-17  9:26       ` Carlos Maiolino
2018-07-20 10:20         ` Filippo Giunchedi
2018-07-22  0:03           ` Eric Sandeen
2018-07-30 10:02             ` Filippo Giunchedi
2018-07-30 23:42               ` Eric Sandeen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.