All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: XFS bug???
       [not found] <87y7lrmnra.wl%peterc@chubb.wattle.id.au>
@ 2007-03-21  1:17 ` Nathan Scott
  2007-03-21  2:24   ` David Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Nathan Scott @ 2007-03-21  1:17 UTC (permalink / raw)
  To: Peter Chubb; +Cc: xfs

On Wed, 2007-03-21 at 11:41 +1100, Peter Chubb wrote:
> Hi Nathan,
> 	Our main backup machine is showing XFS errors.  Any ideas how
> to fix?  

Hi Peter,

I don't have as much time to spend on XFS as I used to, so better to
contact the list at SGI (CC'd).  What kernel version are you running
there?  Looks like a corrupt directory - was this machine exposed to
the 2.6.17 corruption issue perhaps?

cheers.

> Mar 20 08:12:29 bitburger kernel: Filesystem "dm-0": XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c.  Caller 0xf91d5d60
> Mar 20 08:12:29 bitburger kernel:  [pg0+953910291/1069454336] xfs_trans_cancel+0x103/0x140 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [pg0+953945440/1069454336] xfs_create+0x3d0/0x780 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [pg0+953945440/1069454336] xfs_create+0x3d0/0x780 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [pg0+954000638/1069454336] xfs_vn_mknod+0x3ae/0x4b0 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [pg0+953977031/1069454336] xfs_buf_free+0x47/0xc0 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [pg0+953722066/1069454336] xfs_da_state_free+0x52/0x70 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [pg0+953759517/1069454336] xfs_dir2_node_lookup+0x9d/0xd0 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [pg0+953724888/1069454336] xfs_dir_lookup+0x138/0x160 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [__link_path_walk+3778/3808] __link_path_walk+0xec2/0xee0
> Mar 20 08:12:29 bitburger kernel:  [_atomic_dec_and_lock+43/80] _atomic_dec_and_lock+0x2b/0x50
> Mar 20 08:12:29 bitburger kernel:  [mntput_no_expire+35/208] mntput_no_expire+0x23/0xd0
> Mar 20 08:12:29 bitburger kernel:  [pg0+953918840/1069454336] xfs_dir_lookup_int+0x48/0x130 [xfs]
> Mar 20 08:12:29 bitburger kernel:  [permission+211/272] permission+0xd3/0x110
> Mar 20 08:12:29 bitburger kernel:  [vfs_create+210/384] vfs_create+0xd2/0x180
> Mar 20 08:12:29 bitburger kernel:  [open_namei+1806/1904] open_namei+0x70e/0x770
> Mar 20 08:12:29 bitburger kernel:  [netif_receive_skb+651/880] netif_receive_skb+0x28b/0x370
> Mar 20 08:12:29 bitburger kernel:  [do_filp_open+64/96] do_filp_open+0x40/0x60
> Mar 20 08:12:29 bitburger kernel:  [get_unused_fd+168/240] get_unused_fd+0xa8/0xf0
> Mar 20 08:12:29 bitburger kernel:  [do_sys_open+87/240] do_sys_open+0x57/0xf0
> Mar 20 08:12:29 bitburger kernel:  [sys_open+39/48] sys_open+0x27/0x30
> Mar 20 08:12:29 bitburger kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
> Mar 20 08:12:29 bitburger kernel: Filesystem "dm-0": Corruption of in-memory data detected.  Shutting down filesystem: dm-0
> Mar 20 08:12:29 bitburger kernel: Please umount the filesystem, and rectify the 
> problem(s)
> 
> 
> --
> Dr Peter Chubb  http://www.gelato.unsw.edu.au  peterc AT gelato.unsw.edu.au
> http://www.ertos.nicta.com.au           ERTOS within National ICT Australia
-- 
Nathan

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XFS bug???
  2007-03-21  1:17 ` XFS bug??? Nathan Scott
@ 2007-03-21  2:24   ` David Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: David Chinner @ 2007-03-21  2:24 UTC (permalink / raw)
  To: Nathan Scott; +Cc: Peter Chubb, xfs

On Wed, Mar 21, 2007 at 12:17:01PM +1100, Nathan Scott wrote:
> On Wed, 2007-03-21 at 11:41 +1100, Peter Chubb wrote:
> > Hi Nathan,
> > 	Our main backup machine is showing XFS errors.  Any ideas how
> > to fix?  
> 
> Hi Peter,
> 
> I don't have as much time to spend on XFS as I used to, so better to
> contact the list at SGI (CC'd).  What kernel version are you running
> there?  Looks like a corrupt directory - was this machine exposed to
> the 2.6.17 corruption issue perhaps?
> >
> > Mar 20 08:12:29 bitburger kernel: Filesystem "dm-0": XFS internal
> > error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c.  Caller 0xf91d5d60

Oh, yet another report of this. Peter, can you run with the patch
posted here:

http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg105975.html

And if you trip over the problem again it will tell us a bit
more about what triggered the error that caused the shutdown.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XFS bug?
       [not found]       ` <C28A1C2E-423B-48BC-8953-735B85CDFE08@flyingcircus.io>
@ 2016-12-07  6:14         ` Dave Chinner
  0 siblings, 0 replies; 7+ messages in thread
From: Dave Chinner @ 2016-12-07  6:14 UTC (permalink / raw)
  To: Christian Theune; +Cc: linux-xfs

On Tue, Dec 06, 2016 at 08:25:52PM +0100, Christian Theune wrote:
> Hi,
> 
> here’s a follow-up. We found we have one host still running 4.1 that is awaiting its reboot to activate 4.4.
> 
> Is there a safe way forward that you would recommend? I would guess something like:
> 
> - shut down clients
> - unmount filesystems

- run xfs_repair from xfsprogs >4.5.0
- boot into 4.4 kernel.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XFS bug?
  2016-12-01 11:56   ` Christian Theune
@ 2016-12-01 20:15     ` Dave Chinner
       [not found]       ` <C28A1C2E-423B-48BC-8953-735B85CDFE08@flyingcircus.io>
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2016-12-01 20:15 UTC (permalink / raw)
  To: Christian Theune; +Cc: linux-xfs

On Thu, Dec 01, 2016 at 12:56:08PM +0100, Christian Theune wrote:
> Hi Dave,
> 
> (I hope I pick up the right quoting style.)
> 
> > On 1 Dec 2016, at 12:03, Dave Chinner <david@fromorbit.com> wrote:
> > 
> > Hi Christian - thanks for perservering and getting this report to
> > the list. :P
> 
> Thanks for nudging me. :)
> 
> Not sure whether its possible or makes sense to answer this questions at the moment: should we feel safe running 4.4 right now?

Yes, I think this is a result of the kernel upgrade, not a problem
with the individual kernels. I strongly suggest that you run
xfsprogs 4.5.0 or more recent with the 4.4 kernel because it has
the matching AGFL packing fix in it.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XFS bug?
  2016-12-01 11:03 ` Dave Chinner
@ 2016-12-01 11:56   ` Christian Theune
  2016-12-01 20:15     ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Theune @ 2016-12-01 11:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Hi Dave,

(I hope I pick up the right quoting style.) (I also hope I don’t double post, I had to trim my mailer settings to not accidentally send multipart/html to the list.)

> On 1 Dec 2016, at 12:03, Dave Chinner <david@fromorbit.com> wrote:
> 
> Hi Christian - thanks for perservering and getting this report to
> the list. :P

Thanks for nudging me. :)

Not sure whether its possible or makes sense to answer this questions at the moment: should we feel safe running 4.4 right now?

Christian

-- 
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: XFS bug?
  2016-11-30 13:07 XFS bug? Christian Theune
@ 2016-12-01 11:03 ` Dave Chinner
  2016-12-01 11:56   ` Christian Theune
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Chinner @ 2016-12-01 11:03 UTC (permalink / raw)
  To: Christian Theune; +Cc: linux-xfs

On Wed, Nov 30, 2016 at 02:07:39PM +0100, Christian Theune wrote:
> Hi there,
> 
> we’re running a Ceph cluster which had a very rough outage not
> long ago[1].
> 
> When updating our previous kernels from 4.1.16 (Gentoo) to 4.4.27
> (Gentoo) we encountered the following problem in our production
> environment (but not in staging or development):

Hi Christian - thanks for perservering and getting this report to
the list. :P

> 
> - Properly shut down and reboot the machine running Ceph OSDs on XFS w/ kernel 4.1.16.
> - Boot with 4.4.27, let the machine mount the FS’ and start OSDs
> - Have everything run 20-30 minutes
> - Ceph OSDs start crashing. Kernel shows messages attached in kern.log

Which shouldn't happen. I'm pretty sure it's the AGFL packing change
that has caused the problem here, but I'm still paging all that
back into memory and clearing out all the other little things I need
to before digging back into this. I have a couple of ideas about how
this could occur:

> An interesting error we saw during repair was this (I can’t remember or reconstruct whether this was on the 4.1 or 4.4 kernel):
> 
> bad agbno 4294967295 in agfl, agno 12
> freeblk count 7 != flcount 6 in ag 12
> sb_fdblocks 82969993, counted 82969994

Because this:

> Note, that the agbno is 2**32-1 repeatedly

is NULLAGBNO, which is what the AGFL is initialised to by mkfs, and
indicates we're accessing a slot that hasn't been filled correctly.

> Also interesting: the broken filesystems and xfs_repair behaved
> completely differently whether talked to from a 4.1 or 4.4 kernel,
> thus the pattern of first running xfs_repair on 4.1 and then again
> on 4.4.

Yup, I'd expect that given that xfs_repair has the same AGFL packing
issue and what it ends up with is dependent on whether the packing
matches the kernel being run or not...

> This looks similar to [2] and may be related to the already fixed
> bug referenced by Dave in [3], but in our case there was no 32/64
> bit migration involved.

That was the initial discovery vector, but looking into this again I
suspect the issue is packing changes the slot indexing. I do have a
patchset where I started trying to fix all this up automatically,
and so I need to go back to that and sort out where I was up to and
see if I was addressing this index offset problem at all. This is
where I previously got up to:

https://www.spinics.net/lists/linux-xfs/msg00445.html

More tomorrow once I've dug in further...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* XFS bug?
@ 2016-11-30 13:07 Christian Theune
  2016-12-01 11:03 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Theune @ 2016-11-30 13:07 UTC (permalink / raw)
  To: linux-xfs

[-- Attachment #1: Type: text/plain, Size: 1991 bytes --]

Hi there,

we’re running a Ceph cluster which had a very rough outage not long ago[1].

When updating our previous kernels from 4.1.16 (Gentoo) to 4.4.27 (Gentoo) we encountered the following problem in our production environment (but not in staging or development):

- Properly shut down and reboot the machine running Ceph OSDs on XFS w/ kernel 4.1.16.
- Boot with 4.4.27, let the machine mount the FS’ and start OSDs
- Have everything run 20-30 minutes
- Ceph OSDs start crashing. Kernel shows messages attached in kern.log
- Panic. Breath. 
- The RAID controllers (LSI) did not exhibit any sign of disk problems at all.
- Trying to interact with the crashed FS’, i.e.through xfs_repair, caused infinitely hanging syscalls. Clean reboot was no longer possible at that point.

After some experimentation the way to clean things up with negligible residual harm was:

- reboot into 4.1 kernel
- run xfs_repair, force the journal to be cleaned with -L (in some instances)
- ensure a second xfs_repair ends up clean, as well after a mount/umount cycle
- reboot into 4.4 kernel
- run xfs_repair again, ensure it eventually becomes clean, and stays that way after mount/unmount as well as a reboot cycle

An interesting error we saw during repair was this (I can’t remember or reconstruct whether this was on the 4.1 or 4.4 kernel):

bad agbno 4294967295 in agfl, agno 12
freeblk count 7 != flcount 6 in ag 12
sb_fdblocks 82969993, counted 82969994

bad agbno 4294967295 in agfl, agno 13
freeblk count 7 != flcount 6 in ag 13
sb_fdblocks 98156324, counted 98156325

Note, that the agbno is 2**32-1 repeatedly and the sb_fdblocks is off-by-one. I personally don’t have enough internal XFS knowledge, but to me this smells “interesting”.

Also interesting: the broken filesystems and xfs_repair behaved completely differently whether talked to from a 4.1 or 4.4 kernel, thus the pattern of first running xfs_repair on 4.1 and then again on 4.4.


[-- Attachment #2: kern.log.gz --]
[-- Type: application/x-gzip, Size: 49378 bytes --]

[-- Attachment #3: Type: text/plain, Size: 1015 bytes --]



This looks similar to [2] and may be related to the already fixed bug referenced by Dave in [3], but in our case there was no 32/64 bit migration involved.

I’d love if someone could check whether this is a new bug - I reviewed all kernel logs since the old kernel we had but could not find anything that I can pinpoint to our situation.

Unfortunately, my notes aren’t as complete as I would have liked them to be, let me know if you need anything specific, I’ll do my best to dig it up.

Cheers and thanks in advance,
Christian

[1] http://status.flyingcircus.io/incidents/h37gk5v81nz5
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1576599
[3] https://plus.google.com/u/0/+FlorianHaas/posts/LNYMKQF7rgU

-- 
Christian Theune · ct@flyingcircus.io · +49 345 219401 0
Flying Circus Internet Operations GmbH · http://flyingcircus.io
Forsterstraße 29 · 06112 Halle (Saale) · Deutschland
HR Stendal HRB 21169 · Geschäftsführer: Christian. Theune, Christian. Zagrodnick


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-12-07  6:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <87y7lrmnra.wl%peterc@chubb.wattle.id.au>
2007-03-21  1:17 ` XFS bug??? Nathan Scott
2007-03-21  2:24   ` David Chinner
2016-11-30 13:07 XFS bug? Christian Theune
2016-12-01 11:03 ` Dave Chinner
2016-12-01 11:56   ` Christian Theune
2016-12-01 20:15     ` Dave Chinner
     [not found]       ` <C28A1C2E-423B-48BC-8953-735B85CDFE08@flyingcircus.io>
2016-12-07  6:14         ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.