* panic on 4.20 server exporting xfs filesystem @ 2015-03-03 22:10 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-03 22:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-nfs, xfs I'm getting mysterious crashes on a server exporting an xfs filesystem. Strangely, I've reproduced this on 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs but haven't yet managed to reproduce on either of its parents (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try again. I can also try a serial console or something, but for now I'm not getting a lot of information about the crashes. The filesystem in question isn't on a block device available to the client, but I'm still seeing occasional GETLAYOUT and GETDEVICEINFO calls; I suppose the client's getting that far, finding no devices it can use, and giving up? Sorry for the incomplete report, I'll pass along more when I have it. --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* panic on 4.20 server exporting xfs filesystem @ 2015-03-03 22:10 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-03 22:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-nfs, xfs I'm getting mysterious crashes on a server exporting an xfs filesystem. Strangely, I've reproduced this on 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs but haven't yet managed to reproduce on either of its parents (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try again. I can also try a serial console or something, but for now I'm not getting a lot of information about the crashes. The filesystem in question isn't on a block device available to the client, but I'm still seeing occasional GETLAYOUT and GETDEVICEINFO calls; I suppose the client's getting that far, finding no devices it can use, and giving up? Sorry for the incomplete report, I'll pass along more when I have it. --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-03 22:10 ` J. Bruce Fields @ 2015-03-03 22:44 ` Dave Chinner -1 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-03 22:44 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > Strangely, I've reproduced this on > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > but haven't yet managed to reproduce on either of its parents > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > again. I think you'll find that the bug is only triggered after that XFS merge because it's what enabled block layout support in the server, i.e. nfsd4_setup_layout_type() is now setting the export type to LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to it's export ops. > I can also try a serial console or something, but for now I'm not > getting a lot of information about the crashes. Really need a stack trace - even a photo of the screen with the stack trace on it will do for starters. ;) > The filesystem in question isn't on a block device available to the > client, but I'm still seeing occasional GETLAYOUT and GETDEVICEINFO > calls; I suppose the client's getting that far, finding no devices it > can use, and giving up? I can't see anything in the XFS code that would obviously cause a problem - its completely unaware to the state of the visibility of the underlying block device to the pnfs clients or the error handling paths that PNFS server/clients might take when the block device is not visible at the client side.... > Sorry for the incomplete report, I'll pass along more when I have it. No worries, good to have an early heads up. :) Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-03 22:44 ` Dave Chinner 0 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-03 22:44 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, linux-nfs, xfs On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > Strangely, I've reproduced this on > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > but haven't yet managed to reproduce on either of its parents > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > again. I think you'll find that the bug is only triggered after that XFS merge because it's what enabled block layout support in the server, i.e. nfsd4_setup_layout_type() is now setting the export type to LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to it's export ops. > I can also try a serial console or something, but for now I'm not > getting a lot of information about the crashes. Really need a stack trace - even a photo of the screen with the stack trace on it will do for starters. ;) > The filesystem in question isn't on a block device available to the > client, but I'm still seeing occasional GETLAYOUT and GETDEVICEINFO > calls; I suppose the client's getting that far, finding no devices it > can use, and giving up? I can't see anything in the XFS code that would obviously cause a problem - its completely unaware to the state of the visibility of the underlying block device to the pnfs clients or the error handling paths that PNFS server/clients might take when the block device is not visible at the client side.... > Sorry for the incomplete report, I'll pass along more when I have it. No worries, good to have an early heads up. :) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-03 22:44 ` Dave Chinner @ 2015-03-04 2:08 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-04 2:08 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-nfs, Christoph Hellwig, xfs On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > Strangely, I've reproduced this on > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > but haven't yet managed to reproduce on either of its parents > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > again. > > I think you'll find that the bug is only triggered after that XFS > merge because it's what enabled block layout support in the server, > i.e. nfsd4_setup_layout_type() is now setting the export type to > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > it's export ops. Doh--after all the discussion I didn't actually pay attention to what happened in the end. OK, I see, you're right, it's all more-or-less dead code till that merge. Christoph's code was passing all my tests before that, so maybe we broke something in the merge process. Alternatively, it could be because I've added more tests--I'll rerun my current tests on his original branch.... > > I can also try a serial console or something, but for now I'm not > > getting a lot of information about the crashes. > > Really need a stack trace - even a photo of the screen with the > stack trace on it will do for starters. ;) Yeah, I'll let you know when I have something. By the way, Christoph, an unrelated question: how are the devices found, and what are the chances of a client writing to the wrong block device? (E.g., if they're addressed based on a uuid that doesn't change on cloning the block device, and if the client had access to another device with an identical copy of the filesystem, could it end up writing to that instead?) --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-04 2:08 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-04 2:08 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, linux-nfs, xfs On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > Strangely, I've reproduced this on > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > but haven't yet managed to reproduce on either of its parents > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > again. > > I think you'll find that the bug is only triggered after that XFS > merge because it's what enabled block layout support in the server, > i.e. nfsd4_setup_layout_type() is now setting the export type to > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > it's export ops. Doh--after all the discussion I didn't actually pay attention to what happened in the end. OK, I see, you're right, it's all more-or-less dead code till that merge. Christoph's code was passing all my tests before that, so maybe we broke something in the merge process. Alternatively, it could be because I've added more tests--I'll rerun my current tests on his original branch.... > > I can also try a serial console or something, but for now I'm not > > getting a lot of information about the crashes. > > Really need a stack trace - even a photo of the screen with the > stack trace on it will do for starters. ;) Yeah, I'll let you know when I have something. By the way, Christoph, an unrelated question: how are the devices found, and what are the chances of a client writing to the wrong block device? (E.g., if they're addressed based on a uuid that doesn't change on cloning the block device, and if the client had access to another device with an identical copy of the filesystem, could it end up writing to that instead?) --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 2:08 ` J. Bruce Fields @ 2015-03-04 4:41 ` Dave Chinner -1 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-04 4:41 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > By the way, Christoph, an unrelated question: how are the devices found, > and what are the chances of a client writing to the wrong block device? > > (E.g., if they're addressed based on a uuid that doesn't change on > cloning the block device, and if the client had access to another device > with an identical copy of the filesystem, could it end up writing to > that instead?) As I understand it, nothing will prevent this - if you don't change the UUID on the filesystem when you clone it, then the UUID will still match and writes can be directed to any block deice with a matching offset/UUID pair. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-04 4:41 ` Dave Chinner 0 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-04 4:41 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, linux-nfs, xfs On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > By the way, Christoph, an unrelated question: how are the devices found, > and what are the chances of a client writing to the wrong block device? > > (E.g., if they're addressed based on a uuid that doesn't change on > cloning the block device, and if the client had access to another device > with an identical copy of the filesystem, could it end up writing to > that instead?) As I understand it, nothing will prevent this - if you don't change the UUID on the filesystem when you clone it, then the UUID will still match and writes can be directed to any block deice with a matching offset/UUID pair. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 4:41 ` Dave Chinner @ 2015-03-05 13:19 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-05 13:19 UTC (permalink / raw) To: Dave Chinner; +Cc: J. Bruce Fields, linux-nfs, Christoph Hellwig, xfs On Wed, Mar 04, 2015 at 03:41:41PM +1100, Dave Chinner wrote: > As I understand it, nothing will prevent this - if you don't change > the UUID on the filesystem when you clone it, then the UUID will > still match and writes can be directed to any block deice with a > matching offset/UUID pair. Unfortunately that's the case indeed. The whole discovery part of the block layout protocol is fundamentally flawed, as is the recall part. This is my attempt to fix it, but I have no idea how to proceed from posting my draft to the IETF to actually making it a standard: http://tools.ietf.org/html/draft-hellwig-nfsv4-scsi-layout-00 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-05 13:19 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-05 13:19 UTC (permalink / raw) To: Dave Chinner; +Cc: J. Bruce Fields, Christoph Hellwig, linux-nfs, xfs On Wed, Mar 04, 2015 at 03:41:41PM +1100, Dave Chinner wrote: > As I understand it, nothing will prevent this - if you don't change > the UUID on the filesystem when you clone it, then the UUID will > still match and writes can be directed to any block deice with a > matching offset/UUID pair. Unfortunately that's the case indeed. The whole discovery part of the block layout protocol is fundamentally flawed, as is the recall part. This is my attempt to fix it, but I have no idea how to proceed from posting my draft to the IETF to actually making it a standard: http://tools.ietf.org/html/draft-hellwig-nfsv4-scsi-layout-00 ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 13:19 ` Christoph Hellwig @ 2015-03-05 15:21 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 15:21 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-nfs, xfs On Thu, Mar 05, 2015 at 02:19:01PM +0100, Christoph Hellwig wrote: > On Wed, Mar 04, 2015 at 03:41:41PM +1100, Dave Chinner wrote: > > As I understand it, nothing will prevent this - if you don't change > > the UUID on the filesystem when you clone it, then the UUID will > > still match and writes can be directed to any block deice with a > > matching offset/UUID pair. > > Unfortunately that's the case indeed. The whole discovery part of > the block layout protocol is fundamentally flawed, as is the recall > part. This is my attempt to fix it, but I have no idea how to proceed > from posting my draft to the IETF to actually making it a standard: > > http://tools.ietf.org/html/draft-hellwig-nfsv4-scsi-layout-00 Keep asking, I guess. I'll try to give it a review too. May be another reason we can't keep it on by default, if it all it takes is some confusion with snapshots and something disastrous happens on upgrade to a block pnfs-supporting kernel. Though maybe that's really a client bug, as it should probably be getting an OK from someone before using a device. Arguably the "systemctl start nfs-blkmap" or equivalent is that, but something more explicit might be better. --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-05 15:21 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 15:21 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-nfs, xfs On Thu, Mar 05, 2015 at 02:19:01PM +0100, Christoph Hellwig wrote: > On Wed, Mar 04, 2015 at 03:41:41PM +1100, Dave Chinner wrote: > > As I understand it, nothing will prevent this - if you don't change > > the UUID on the filesystem when you clone it, then the UUID will > > still match and writes can be directed to any block deice with a > > matching offset/UUID pair. > > Unfortunately that's the case indeed. The whole discovery part of > the block layout protocol is fundamentally flawed, as is the recall > part. This is my attempt to fix it, but I have no idea how to proceed > from posting my draft to the IETF to actually making it a standard: > > http://tools.ietf.org/html/draft-hellwig-nfsv4-scsi-layout-00 Keep asking, I guess. I'll try to give it a review too. May be another reason we can't keep it on by default, if it all it takes is some confusion with snapshots and something disastrous happens on upgrade to a block pnfs-supporting kernel. Though maybe that's really a client bug, as it should probably be getting an OK from someone before using a device. Arguably the "systemctl start nfs-blkmap" or equivalent is that, but something more explicit might be better. --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 13:19 ` Christoph Hellwig @ 2015-03-08 13:08 ` Tom Haynes -1 siblings, 0 replies; 69+ messages in thread From: Tom Haynes @ 2015-03-08 13:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: J. Bruce Fields, linux-nfs, xfs > On Mar 5, 2015, at 5:19 AM, Christoph Hellwig <hch@lst.de> wrote: > >> On Wed, Mar 04, 2015 at 03:41:41PM +1100, Dave Chinner wrote: >> As I understand it, nothing will prevent this - if you don't change >> the UUID on the filesystem when you clone it, then the UUID will >> still match and writes can be directed to any block deice with a >> matching offset/UUID pair. > > Unfortunately that's the case indeed. The whole discovery part of > the block layout protocol is fundamentally flawed, as is the recall > part. This is my attempt to fix it, but I have no idea how to proceed > from posting my draft to the IETF to actually making it a standard: > > http://tools.ietf.org/html/draft-hellwig-nfsv4-scsi-layout-00 Let's be sure to talk about this at LSF... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-08 13:08 ` Tom Haynes 0 siblings, 0 replies; 69+ messages in thread From: Tom Haynes @ 2015-03-08 13:08 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, J. Bruce Fields, linux-nfs, xfs > On Mar 5, 2015, at 5:19 AM, Christoph Hellwig <hch@lst.de> wrote: > >> On Wed, Mar 04, 2015 at 03:41:41PM +1100, Dave Chinner wrote: >> As I understand it, nothing will prevent this - if you don't change >> the UUID on the filesystem when you clone it, then the UUID will >> still match and writes can be directed to any block deice with a >> matching offset/UUID pair. > > Unfortunately that's the case indeed. The whole discovery part of > the block layout protocol is fundamentally flawed, as is the recall > part. This is my attempt to fix it, but I have no idea how to proceed > from posting my draft to the IETF to actually making it a standard: > > http://tools.ietf.org/html/draft-hellwig-nfsv4-scsi-layout-00 Let's be sure to talk about this at LSF... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 2:08 ` J. Bruce Fields @ 2015-03-04 15:54 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-04 15:54 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-nfs, Christoph Hellwig, xfs On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > > > Strangely, I've reproduced this on > > > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > > > but haven't yet managed to reproduce on either of its parents > > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > > again. > > > > I think you'll find that the bug is only triggered after that XFS > > merge because it's what enabled block layout support in the server, > > i.e. nfsd4_setup_layout_type() is now setting the export type to > > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > it's export ops. > > Doh--after all the discussion I didn't actually pay attention to what > happened in the end. OK, I see, you're right, it's all more-or-less > dead code till that merge. > > Christoph's code was passing all my tests before that, so maybe we > broke something in the merge process. > > Alternatively, it could be because I've added more tests--I'll rerun my > current tests on his original branch.... The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look very informative. I'm running xfstests over NFSv4.1 with client and server running the same kernel, the filesystem in question is xfs, but isn't otherwise available to the client (so the client shouldn't be doing pnfs). --b. BUG: unable to handle kernel paging request at 00000000757d4900 IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 PGD 0 Thread overran stack, or stack corrupted Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc CPU: 1 PID: 18130 Comm: kworker/1:0 Not tainted 3.19.0-rc4-00205-gcd4b02e #79 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 Workqueue: rpciod rpc_async_schedule [sunrpc] task: ffff880030639710 ti: ffff88001e698000 task.ti: ffff88001e698000 RIP: 0010:[<ffffffff810b59af>] [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 RSP: 0018:ffff88007f903e08 EFLAGS: 00010092 RAX: 000000000000d4e8 RBX: 000000001e698038 RCX: 000000001e698038 RDX: ffffffff822377c0 RSI: 0000000000000003 RDI: ffff880030639f78 RBP: ffff88007f903e38 R08: 0000000000000000 R09: 0000000000000001 R10: 000000000000001b R11: ffffffff82238fc0 R12: 00000000003b4c1b R13: ffff880030639710 R14: ffff880030639710 R15: 0000001536dbb554 FS: 0000000000000000(0000) GS:ffff88007f900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000757d4900 CR3: 000000006d8e2000 CR4: 00000000000406e0 Stack: ffffffff810b5955 0000000000000000 ffff88007f903e98 ffff880030639778 ffff88007f913698 00000000003b4c1b ffff88007f903e78 ffffffff810a47d0 ffff88007f903e78 ffff880030639778 ffff88007f913698 ffff88007f913600 Call Trace: <IRQ> [<ffffffff810b5955>] ? cpuacct_charge+0x5/0xa0 [<ffffffff810a47d0>] update_curr+0xd0/0x190 [<ffffffff810a767f>] task_tick_fair+0x1df/0x4f0 [<ffffffff8109e147>] scheduler_tick+0x57/0xd0 [<ffffffff810d7e11>] update_process_times+0x51/0x60 [<ffffffff810e43df>] tick_periodic+0x2f/0xc0 [<ffffffff8165b517>] ? debug_smp_processor_id+0x17/0x20 [<ffffffff810e4599>] tick_handle_periodic+0x29/0x70 [<ffffffff81033e6a>] local_apic_timer_interrupt+0x3a/0x70 [<ffffffff81a8fb81>] smp_apic_timer_interrupt+0x41/0x60 [<ffffffff81a8df1f>] apic_timer_interrupt+0x6f/0x80 <EOI> Code: 31 c9 45 31 c0 31 f6 48 c7 c7 c0 8f 23 82 e8 a9 71 00 00 49 8b 85 c0 0f 00 00 48 63 cb 48 8b 50 58 0f 1f 00 48 8b 82 d0 00 00 00 <48> 03 04 cd 40 47 31 82 4c 01 20 48 8b 52 48 48 85 d2 75 e5 48 RIP [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 RSP <ffff88007f903e08> CR2: 00000000757d4900 ---[ end trace fa7901843d14b3ab ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) ---[ end Kernel panic - not syncing: Fatal exception in interrupt _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-04 15:54 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-04 15:54 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, linux-nfs, xfs On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > > > Strangely, I've reproduced this on > > > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > > > but haven't yet managed to reproduce on either of its parents > > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > > again. > > > > I think you'll find that the bug is only triggered after that XFS > > merge because it's what enabled block layout support in the server, > > i.e. nfsd4_setup_layout_type() is now setting the export type to > > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > it's export ops. > > Doh--after all the discussion I didn't actually pay attention to what > happened in the end. OK, I see, you're right, it's all more-or-less > dead code till that merge. > > Christoph's code was passing all my tests before that, so maybe we > broke something in the merge process. > > Alternatively, it could be because I've added more tests--I'll rerun my > current tests on his original branch.... The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look very informative. I'm running xfstests over NFSv4.1 with client and server running the same kernel, the filesystem in question is xfs, but isn't otherwise available to the client (so the client shouldn't be doing pnfs). --b. BUG: unable to handle kernel paging request at 00000000757d4900 IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 PGD 0 Thread overran stack, or stack corrupted Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc CPU: 1 PID: 18130 Comm: kworker/1:0 Not tainted 3.19.0-rc4-00205-gcd4b02e #79 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 Workqueue: rpciod rpc_async_schedule [sunrpc] task: ffff880030639710 ti: ffff88001e698000 task.ti: ffff88001e698000 RIP: 0010:[<ffffffff810b59af>] [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 RSP: 0018:ffff88007f903e08 EFLAGS: 00010092 RAX: 000000000000d4e8 RBX: 000000001e698038 RCX: 000000001e698038 RDX: ffffffff822377c0 RSI: 0000000000000003 RDI: ffff880030639f78 RBP: ffff88007f903e38 R08: 0000000000000000 R09: 0000000000000001 R10: 000000000000001b R11: ffffffff82238fc0 R12: 00000000003b4c1b R13: ffff880030639710 R14: ffff880030639710 R15: 0000001536dbb554 FS: 0000000000000000(0000) GS:ffff88007f900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000757d4900 CR3: 000000006d8e2000 CR4: 00000000000406e0 Stack: ffffffff810b5955 0000000000000000 ffff88007f903e98 ffff880030639778 ffff88007f913698 00000000003b4c1b ffff88007f903e78 ffffffff810a47d0 ffff88007f903e78 ffff880030639778 ffff88007f913698 ffff88007f913600 Call Trace: <IRQ> [<ffffffff810b5955>] ? cpuacct_charge+0x5/0xa0 [<ffffffff810a47d0>] update_curr+0xd0/0x190 [<ffffffff810a767f>] task_tick_fair+0x1df/0x4f0 [<ffffffff8109e147>] scheduler_tick+0x57/0xd0 [<ffffffff810d7e11>] update_process_times+0x51/0x60 [<ffffffff810e43df>] tick_periodic+0x2f/0xc0 [<ffffffff8165b517>] ? debug_smp_processor_id+0x17/0x20 [<ffffffff810e4599>] tick_handle_periodic+0x29/0x70 [<ffffffff81033e6a>] local_apic_timer_interrupt+0x3a/0x70 [<ffffffff81a8fb81>] smp_apic_timer_interrupt+0x41/0x60 [<ffffffff81a8df1f>] apic_timer_interrupt+0x6f/0x80 <EOI> Code: 31 c9 45 31 c0 31 f6 48 c7 c7 c0 8f 23 82 e8 a9 71 00 00 49 8b 85 c0 0f 00 00 48 63 cb 48 8b 50 58 0f 1f 00 48 8b 82 d0 00 00 00 <48> 03 04 cd 40 47 31 82 4c 01 20 48 8b 52 48 48 85 d2 75 e5 48 RIP [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 RSP <ffff88007f903e08> CR2: 00000000757d4900 ---[ end trace fa7901843d14b3ab ]--- Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) ---[ end Kernel panic - not syncing: Fatal exception in interrupt ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 15:54 ` J. Bruce Fields @ 2015-03-04 22:09 ` Dave Chinner -1 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-04 22:09 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > > > > > Strangely, I've reproduced this on > > > > > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > > > > > but haven't yet managed to reproduce on either of its parents > > > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > > > again. > > > > > > I think you'll find that the bug is only triggered after that XFS > > > merge because it's what enabled block layout support in the server, > > > i.e. nfsd4_setup_layout_type() is now setting the export type to > > > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > > it's export ops. > > > > Doh--after all the discussion I didn't actually pay attention to what > > happened in the end. OK, I see, you're right, it's all more-or-less > > dead code till that merge. > > > > Christoph's code was passing all my tests before that, so maybe we > > broke something in the merge process. > > > > Alternatively, it could be because I've added more tests--I'll rerun my > > current tests on his original branch.... > > The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > very informative. I'm running xfstests over NFSv4.1 with client and > server running the same kernel, the filesystem in question is xfs, but > isn't otherwise available to the client (so the client shouldn't be > doing pnfs). > > --b. > > BUG: unable to handle kernel paging request at 00000000757d4900 > IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > PGD 0 > Thread overran stack, or stack corrupted Hmmmm. That is not at all informative, especially as it's only dumped the interrupt stack and not the stack or the task that it has detected as overrun or corrupted. Can you turn on all the stack overrun debug options? Maybe even turn on the stack tracer to get an idea of whether we are recursing deeply somewhere we shouldn't be? Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-04 22:09 ` Dave Chinner 0 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-04 22:09 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, linux-nfs, xfs On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > > > > > Strangely, I've reproduced this on > > > > > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > > > > > but haven't yet managed to reproduce on either of its parents > > > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > > > again. > > > > > > I think you'll find that the bug is only triggered after that XFS > > > merge because it's what enabled block layout support in the server, > > > i.e. nfsd4_setup_layout_type() is now setting the export type to > > > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > > it's export ops. > > > > Doh--after all the discussion I didn't actually pay attention to what > > happened in the end. OK, I see, you're right, it's all more-or-less > > dead code till that merge. > > > > Christoph's code was passing all my tests before that, so maybe we > > broke something in the merge process. > > > > Alternatively, it could be because I've added more tests--I'll rerun my > > current tests on his original branch.... > > The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > very informative. I'm running xfstests over NFSv4.1 with client and > server running the same kernel, the filesystem in question is xfs, but > isn't otherwise available to the client (so the client shouldn't be > doing pnfs). > > --b. > > BUG: unable to handle kernel paging request at 00000000757d4900 > IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > PGD 0 > Thread overran stack, or stack corrupted Hmmmm. That is not at all informative, especially as it's only dumped the interrupt stack and not the stack or the task that it has detected as overrun or corrupted. Can you turn on all the stack overrun debug options? Maybe even turn on the stack tracer to get an idea of whether we are recursing deeply somewhere we shouldn't be? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 22:09 ` Dave Chinner @ 2015-03-04 22:27 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-04 22:27 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-nfs, Christoph Hellwig, xfs On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > > On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > > On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > > > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > > > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > > > > > > > Strangely, I've reproduced this on > > > > > > > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > > > > > > > but haven't yet managed to reproduce on either of its parents > > > > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > > > > again. > > > > > > > > I think you'll find that the bug is only triggered after that XFS > > > > merge because it's what enabled block layout support in the server, > > > > i.e. nfsd4_setup_layout_type() is now setting the export type to > > > > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > > > it's export ops. > > > > > > Doh--after all the discussion I didn't actually pay attention to what > > > happened in the end. OK, I see, you're right, it's all more-or-less > > > dead code till that merge. > > > > > > Christoph's code was passing all my tests before that, so maybe we > > > broke something in the merge process. > > > > > > Alternatively, it could be because I've added more tests--I'll rerun my > > > current tests on his original branch.... > > > > The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > > very informative. I'm running xfstests over NFSv4.1 with client and > > server running the same kernel, the filesystem in question is xfs, but > > isn't otherwise available to the client (so the client shouldn't be > > doing pnfs). > > > > --b. > > > > BUG: unable to handle kernel paging request at 00000000757d4900 > > IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > > PGD 0 > > Thread overran stack, or stack corrupted > > Hmmmm. That is not at all informative, especially as it's only > dumped the interrupt stack and not the stack or the task that it > has detected as overrun or corrupted. > > Can you turn on all the stack overrun debug options? Maybe even > turn on the stack tracer to get an idea of whether we are recursing > deeply somewhere we shouldn't be? Digging around under "Kernel hacking".... I already have DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try turning on the latter. (Will I be able to get information out of it before the panic?) I guess I'll also try SCHED_STACK_END_CHECK. Anything else I'm missing? --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-04 22:27 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-04 22:27 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, linux-nfs, xfs On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > > On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > > On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > > > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > > > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > > > > > > > Strangely, I've reproduced this on > > > > > > > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > > > > > > > but haven't yet managed to reproduce on either of its parents > > > > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > > > > again. > > > > > > > > I think you'll find that the bug is only triggered after that XFS > > > > merge because it's what enabled block layout support in the server, > > > > i.e. nfsd4_setup_layout_type() is now setting the export type to > > > > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > > > it's export ops. > > > > > > Doh--after all the discussion I didn't actually pay attention to what > > > happened in the end. OK, I see, you're right, it's all more-or-less > > > dead code till that merge. > > > > > > Christoph's code was passing all my tests before that, so maybe we > > > broke something in the merge process. > > > > > > Alternatively, it could be because I've added more tests--I'll rerun my > > > current tests on his original branch.... > > > > The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > > very informative. I'm running xfstests over NFSv4.1 with client and > > server running the same kernel, the filesystem in question is xfs, but > > isn't otherwise available to the client (so the client shouldn't be > > doing pnfs). > > > > --b. > > > > BUG: unable to handle kernel paging request at 00000000757d4900 > > IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > > PGD 0 > > Thread overran stack, or stack corrupted > > Hmmmm. That is not at all informative, especially as it's only > dumped the interrupt stack and not the stack or the task that it > has detected as overrun or corrupted. > > Can you turn on all the stack overrun debug options? Maybe even > turn on the stack tracer to get an idea of whether we are recursing > deeply somewhere we shouldn't be? Digging around under "Kernel hacking".... I already have DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try turning on the latter. (Will I be able to get information out of it before the panic?) I guess I'll also try SCHED_STACK_END_CHECK. Anything else I'm missing? --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 22:27 ` J. Bruce Fields @ 2015-03-04 22:45 ` Dave Chinner -1 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-04 22:45 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: > On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > > On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > > > On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > > > On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > > > > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > > > > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > > > > > > > > > Strangely, I've reproduced this on > > > > > > > > > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > > > > > > > > > but haven't yet managed to reproduce on either of its parents > > > > > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > > > > > again. > > > > > > > > > > I think you'll find that the bug is only triggered after that XFS > > > > > merge because it's what enabled block layout support in the server, > > > > > i.e. nfsd4_setup_layout_type() is now setting the export type to > > > > > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > > > > it's export ops. > > > > > > > > Doh--after all the discussion I didn't actually pay attention to what > > > > happened in the end. OK, I see, you're right, it's all more-or-less > > > > dead code till that merge. > > > > > > > > Christoph's code was passing all my tests before that, so maybe we > > > > broke something in the merge process. > > > > > > > > Alternatively, it could be because I've added more tests--I'll rerun my > > > > current tests on his original branch.... > > > > > > The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > > > very informative. I'm running xfstests over NFSv4.1 with client and > > > server running the same kernel, the filesystem in question is xfs, but > > > isn't otherwise available to the client (so the client shouldn't be > > > doing pnfs). > > > > > > --b. > > > > > > BUG: unable to handle kernel paging request at 00000000757d4900 > > > IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > > > PGD 0 > > > Thread overran stack, or stack corrupted > > > > Hmmmm. That is not at all informative, especially as it's only > > dumped the interrupt stack and not the stack or the task that it > > has detected as overrun or corrupted. > > > > Can you turn on all the stack overrun debug options? Maybe even > > turn on the stack tracer to get an idea of whether we are recursing > > deeply somewhere we shouldn't be? > > Digging around under "Kernel hacking".... I already have > DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try > turning on the latter. (Will I be able to get information out of it > before the panic?) just keep taking samples of the worst case stack usage as the test runs. If there's anything unusual before the failure then it will show up, otherwise I'm not sure how to track this down... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-04 22:45 ` Dave Chinner 0 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-04 22:45 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, linux-nfs, xfs On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: > On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > > On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > > > On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > > > On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > > > > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > > > > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > > > > > > > > > > > Strangely, I've reproduced this on > > > > > > > > > > > > 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > > > > > > > > > > > but haven't yet managed to reproduce on either of its parents > > > > > > (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > > > > > again. > > > > > > > > > > I think you'll find that the bug is only triggered after that XFS > > > > > merge because it's what enabled block layout support in the server, > > > > > i.e. nfsd4_setup_layout_type() is now setting the export type to > > > > > LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > > > > it's export ops. > > > > > > > > Doh--after all the discussion I didn't actually pay attention to what > > > > happened in the end. OK, I see, you're right, it's all more-or-less > > > > dead code till that merge. > > > > > > > > Christoph's code was passing all my tests before that, so maybe we > > > > broke something in the merge process. > > > > > > > > Alternatively, it could be because I've added more tests--I'll rerun my > > > > current tests on his original branch.... > > > > > > The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > > > very informative. I'm running xfstests over NFSv4.1 with client and > > > server running the same kernel, the filesystem in question is xfs, but > > > isn't otherwise available to the client (so the client shouldn't be > > > doing pnfs). > > > > > > --b. > > > > > > BUG: unable to handle kernel paging request at 00000000757d4900 > > > IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > > > PGD 0 > > > Thread overran stack, or stack corrupted > > > > Hmmmm. That is not at all informative, especially as it's only > > dumped the interrupt stack and not the stack or the task that it > > has detected as overrun or corrupted. > > > > Can you turn on all the stack overrun debug options? Maybe even > > turn on the stack tracer to get an idea of whether we are recursing > > deeply somewhere we shouldn't be? > > Digging around under "Kernel hacking".... I already have > DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try > turning on the latter. (Will I be able to get information out of it > before the panic?) just keep taking samples of the worst case stack usage as the test runs. If there's anything unusual before the failure then it will show up, otherwise I'm not sure how to track this down... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 22:45 ` Dave Chinner @ 2015-03-04 22:49 ` Eric Sandeen -1 siblings, 0 replies; 69+ messages in thread From: Eric Sandeen @ 2015-03-04 22:49 UTC (permalink / raw) To: Dave Chinner, J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On 3/4/15 4:45 PM, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. >>>>>>> >>>>>>> Strangely, I've reproduced this on >>>>>>> >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs >>>>>>> >>>>>>> but haven't yet managed to reproduce on either of its parents >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try >>>>>>> again. >>>>>> >>>>>> I think you'll find that the bug is only triggered after that XFS >>>>>> merge because it's what enabled block layout support in the server, >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to >>>>>> it's export ops. >>>>> >>>>> Doh--after all the discussion I didn't actually pay attention to what >>>>> happened in the end. OK, I see, you're right, it's all more-or-less >>>>> dead code till that merge. >>>>> >>>>> Christoph's code was passing all my tests before that, so maybe we >>>>> broke something in the merge process. >>>>> >>>>> Alternatively, it could be because I've added more tests--I'll rerun my >>>>> current tests on his original branch.... >>>> >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look >>>> very informative. I'm running xfstests over NFSv4.1 with client and >>>> server running the same kernel, the filesystem in question is xfs, but >>>> isn't otherwise available to the client (so the client shouldn't be >>>> doing pnfs). >>>> >>>> --b. >>>> >>>> BUG: unable to handle kernel paging request at 00000000757d4900 >>>> IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 >>>> PGD 0 >>>> Thread overran stack, or stack corrupted >>> >>> Hmmmm. That is not at all informative, especially as it's only >>> dumped the interrupt stack and not the stack or the task that it >>> has detected as overrun or corrupted. >>> >>> Can you turn on all the stack overrun debug options? Maybe even >>> turn on the stack tracer to get an idea of whether we are recursing >>> deeply somewhere we shouldn't be? >> >> Digging around under "Kernel hacking".... I already have >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try >> turning on the latter. (Will I be able to get information out of it >> before the panic?) > > just keep taking samples of the worst case stack usage as the test > runs. If there's anything unusual before the failure then it will > show up, otherwise I'm not sure how to track this down... I think it should print "maximum stack depth" messages whenever a stack reaches a new max excursion... > Cheers, > > Dave. > _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-04 22:49 ` Eric Sandeen 0 siblings, 0 replies; 69+ messages in thread From: Eric Sandeen @ 2015-03-04 22:49 UTC (permalink / raw) To: Dave Chinner, J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On 3/4/15 4:45 PM, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. >>>>>>> >>>>>>> Strangely, I've reproduced this on >>>>>>> >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs >>>>>>> >>>>>>> but haven't yet managed to reproduce on either of its parents >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try >>>>>>> again. >>>>>> >>>>>> I think you'll find that the bug is only triggered after that XFS >>>>>> merge because it's what enabled block layout support in the server, >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to >>>>>> it's export ops. >>>>> >>>>> Doh--after all the discussion I didn't actually pay attention to what >>>>> happened in the end. OK, I see, you're right, it's all more-or-less >>>>> dead code till that merge. >>>>> >>>>> Christoph's code was passing all my tests before that, so maybe we >>>>> broke something in the merge process. >>>>> >>>>> Alternatively, it could be because I've added more tests--I'll rerun my >>>>> current tests on his original branch.... >>>> >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look >>>> very informative. I'm running xfstests over NFSv4.1 with client and >>>> server running the same kernel, the filesystem in question is xfs, but >>>> isn't otherwise available to the client (so the client shouldn't be >>>> doing pnfs). >>>> >>>> --b. >>>> >>>> BUG: unable to handle kernel paging request at 00000000757d4900 >>>> IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 >>>> PGD 0 >>>> Thread overran stack, or stack corrupted >>> >>> Hmmmm. That is not at all informative, especially as it's only >>> dumped the interrupt stack and not the stack or the task that it >>> has detected as overrun or corrupted. >>> >>> Can you turn on all the stack overrun debug options? Maybe even >>> turn on the stack tracer to get an idea of whether we are recursing >>> deeply somewhere we shouldn't be? >> >> Digging around under "Kernel hacking".... I already have >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try >> turning on the latter. (Will I be able to get information out of it >> before the panic?) > > just keep taking samples of the worst case stack usage as the test > runs. If there's anything unusual before the failure then it will > show up, otherwise I'm not sure how to track this down... I think it should print "maximum stack depth" messages whenever a stack reaches a new max excursion... > Cheers, > > Dave. > ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 22:49 ` Eric Sandeen @ 2015-03-04 22:56 ` Dave Chinner -1 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-04 22:56 UTC (permalink / raw) To: Eric Sandeen; +Cc: J. Bruce Fields, linux-nfs, Christoph Hellwig, xfs On Wed, Mar 04, 2015 at 04:49:09PM -0600, Eric Sandeen wrote: > On 3/4/15 4:45 PM, Dave Chinner wrote: > > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: > >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. > >>>>>>> > >>>>>>> Strangely, I've reproduced this on > >>>>>>> > >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > >>>>>>> > >>>>>>> but haven't yet managed to reproduce on either of its parents > >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > >>>>>>> again. > >>>>>> > >>>>>> I think you'll find that the bug is only triggered after that XFS > >>>>>> merge because it's what enabled block layout support in the server, > >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to > >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > >>>>>> it's export ops. > >>>>> > >>>>> Doh--after all the discussion I didn't actually pay attention to what > >>>>> happened in the end. OK, I see, you're right, it's all more-or-less > >>>>> dead code till that merge. > >>>>> > >>>>> Christoph's code was passing all my tests before that, so maybe we > >>>>> broke something in the merge process. > >>>>> > >>>>> Alternatively, it could be because I've added more tests--I'll rerun my > >>>>> current tests on his original branch.... > >>>> > >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > >>>> very informative. I'm running xfstests over NFSv4.1 with client and > >>>> server running the same kernel, the filesystem in question is xfs, but > >>>> isn't otherwise available to the client (so the client shouldn't be > >>>> doing pnfs). > >>>> > >>>> --b. > >>>> > >>>> BUG: unable to handle kernel paging request at 00000000757d4900 > >>>> IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > >>>> PGD 0 > >>>> Thread overran stack, or stack corrupted > >>> > >>> Hmmmm. That is not at all informative, especially as it's only > >>> dumped the interrupt stack and not the stack or the task that it > >>> has detected as overrun or corrupted. > >>> > >>> Can you turn on all the stack overrun debug options? Maybe even > >>> turn on the stack tracer to get an idea of whether we are recursing > >>> deeply somewhere we shouldn't be? > >> > >> Digging around under "Kernel hacking".... I already have > >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try > >> turning on the latter. (Will I be able to get information out of it > >> before the panic?) > > > > just keep taking samples of the worst case stack usage as the test > > runs. If there's anything unusual before the failure then it will > > show up, otherwise I'm not sure how to track this down... > > I think it should print "maximum stack depth" messages whenever a stack > reaches a new max excursion... That gets printed only when the process exits, IIRC. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-04 22:56 ` Dave Chinner 0 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-04 22:56 UTC (permalink / raw) To: Eric Sandeen; +Cc: J. Bruce Fields, linux-nfs, Christoph Hellwig, xfs On Wed, Mar 04, 2015 at 04:49:09PM -0600, Eric Sandeen wrote: > On 3/4/15 4:45 PM, Dave Chinner wrote: > > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: > >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. > >>>>>>> > >>>>>>> Strangely, I've reproduced this on > >>>>>>> > >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > >>>>>>> > >>>>>>> but haven't yet managed to reproduce on either of its parents > >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > >>>>>>> again. > >>>>>> > >>>>>> I think you'll find that the bug is only triggered after that XFS > >>>>>> merge because it's what enabled block layout support in the server, > >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to > >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > >>>>>> it's export ops. > >>>>> > >>>>> Doh--after all the discussion I didn't actually pay attention to what > >>>>> happened in the end. OK, I see, you're right, it's all more-or-less > >>>>> dead code till that merge. > >>>>> > >>>>> Christoph's code was passing all my tests before that, so maybe we > >>>>> broke something in the merge process. > >>>>> > >>>>> Alternatively, it could be because I've added more tests--I'll rerun my > >>>>> current tests on his original branch.... > >>>> > >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > >>>> very informative. I'm running xfstests over NFSv4.1 with client and > >>>> server running the same kernel, the filesystem in question is xfs, but > >>>> isn't otherwise available to the client (so the client shouldn't be > >>>> doing pnfs). > >>>> > >>>> --b. > >>>> > >>>> BUG: unable to handle kernel paging request at 00000000757d4900 > >>>> IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > >>>> PGD 0 > >>>> Thread overran stack, or stack corrupted > >>> > >>> Hmmmm. That is not at all informative, especially as it's only > >>> dumped the interrupt stack and not the stack or the task that it > >>> has detected as overrun or corrupted. > >>> > >>> Can you turn on all the stack overrun debug options? Maybe even > >>> turn on the stack tracer to get an idea of whether we are recursing > >>> deeply somewhere we shouldn't be? > >> > >> Digging around under "Kernel hacking".... I already have > >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try > >> turning on the latter. (Will I be able to get information out of it > >> before the panic?) > > > > just keep taking samples of the worst case stack usage as the test > > runs. If there's anything unusual before the failure then it will > > show up, otherwise I'm not sure how to track this down... > > I think it should print "maximum stack depth" messages whenever a stack > reaches a new max excursion... That gets printed only when the process exits, IIRC. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-04 22:56 ` Dave Chinner @ 2015-03-05 4:08 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 4:08 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-nfs, Eric Sandeen, Christoph Hellwig, xfs On Thu, Mar 05, 2015 at 09:56:23AM +1100, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 04:49:09PM -0600, Eric Sandeen wrote: > > On 3/4/15 4:45 PM, Dave Chinner wrote: > > > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: > > >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > > >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > > >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. > > >>>>>>> > > >>>>>>> Strangely, I've reproduced this on > > >>>>>>> > > >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > >>>>>>> > > >>>>>>> but haven't yet managed to reproduce on either of its parents > > >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > >>>>>>> again. > > >>>>>> > > >>>>>> I think you'll find that the bug is only triggered after that XFS > > >>>>>> merge because it's what enabled block layout support in the server, > > >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to > > >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > >>>>>> it's export ops. > > >>>>> > > >>>>> Doh--after all the discussion I didn't actually pay attention to what > > >>>>> happened in the end. OK, I see, you're right, it's all more-or-less > > >>>>> dead code till that merge. > > >>>>> > > >>>>> Christoph's code was passing all my tests before that, so maybe we > > >>>>> broke something in the merge process. > > >>>>> > > >>>>> Alternatively, it could be because I've added more tests--I'll rerun my > > >>>>> current tests on his original branch.... > > >>>> > > >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > > >>>> very informative. I'm running xfstests over NFSv4.1 with client and > > >>>> server running the same kernel, the filesystem in question is xfs, but > > >>>> isn't otherwise available to the client (so the client shouldn't be > > >>>> doing pnfs). > > >>>> > > >>>> --b. > > >>>> > > >>>> BUG: unable to handle kernel paging request at 00000000757d4900 > > >>>> IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > > >>>> PGD 0 > > >>>> Thread overran stack, or stack corrupted > > >>> > > >>> Hmmmm. That is not at all informative, especially as it's only > > >>> dumped the interrupt stack and not the stack or the task that it > > >>> has detected as overrun or corrupted. > > >>> > > >>> Can you turn on all the stack overrun debug options? Maybe even > > >>> turn on the stack tracer to get an idea of whether we are recursing > > >>> deeply somewhere we shouldn't be? > > >> > > >> Digging around under "Kernel hacking".... I already have > > >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try > > >> turning on the latter. (Will I be able to get information out of it > > >> before the panic?) > > > > > > just keep taking samples of the worst case stack usage as the test > > > runs. If there's anything unusual before the failure then it will > > > show up, otherwise I'm not sure how to track this down... > > > > I think it should print "maximum stack depth" messages whenever a stack > > reaches a new max excursion... > > That gets printed only when the process exits, IIRC. Ah-hah: static void nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) { ... nfsd4_cb_layout_fail(ls); That'd do it! Haven't tried to figure out why exactly that's getting called, and why only rarely. Some intermittent problem with the callback path, I guess. Anyway, I think that solves most of the mystery.... --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-05 4:08 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 4:08 UTC (permalink / raw) To: Dave Chinner; +Cc: Eric Sandeen, linux-nfs, Christoph Hellwig, xfs On Thu, Mar 05, 2015 at 09:56:23AM +1100, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 04:49:09PM -0600, Eric Sandeen wrote: > > On 3/4/15 4:45 PM, Dave Chinner wrote: > > > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: > > >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > > >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > > >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. > > >>>>>>> > > >>>>>>> Strangely, I've reproduced this on > > >>>>>>> > > >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > >>>>>>> > > >>>>>>> but haven't yet managed to reproduce on either of its parents > > >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > >>>>>>> again. > > >>>>>> > > >>>>>> I think you'll find that the bug is only triggered after that XFS > > >>>>>> merge because it's what enabled block layout support in the server, > > >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to > > >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > >>>>>> it's export ops. > > >>>>> > > >>>>> Doh--after all the discussion I didn't actually pay attention to what > > >>>>> happened in the end. OK, I see, you're right, it's all more-or-less > > >>>>> dead code till that merge. > > >>>>> > > >>>>> Christoph's code was passing all my tests before that, so maybe we > > >>>>> broke something in the merge process. > > >>>>> > > >>>>> Alternatively, it could be because I've added more tests--I'll rerun my > > >>>>> current tests on his original branch.... > > >>>> > > >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > > >>>> very informative. I'm running xfstests over NFSv4.1 with client and > > >>>> server running the same kernel, the filesystem in question is xfs, but > > >>>> isn't otherwise available to the client (so the client shouldn't be > > >>>> doing pnfs). > > >>>> > > >>>> --b. > > >>>> > > >>>> BUG: unable to handle kernel paging request at 00000000757d4900 > > >>>> IP: [<ffffffff810b59af>] cpuacct_charge+0x5f/0xa0 > > >>>> PGD 0 > > >>>> Thread overran stack, or stack corrupted > > >>> > > >>> Hmmmm. That is not at all informative, especially as it's only > > >>> dumped the interrupt stack and not the stack or the task that it > > >>> has detected as overrun or corrupted. > > >>> > > >>> Can you turn on all the stack overrun debug options? Maybe even > > >>> turn on the stack tracer to get an idea of whether we are recursing > > >>> deeply somewhere we shouldn't be? > > >> > > >> Digging around under "Kernel hacking".... I already have > > >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try > > >> turning on the latter. (Will I be able to get information out of it > > >> before the panic?) > > > > > > just keep taking samples of the worst case stack usage as the test > > > runs. If there's anything unusual before the failure then it will > > > show up, otherwise I'm not sure how to track this down... > > > > I think it should print "maximum stack depth" messages whenever a stack > > reaches a new max excursion... > > That gets printed only when the process exits, IIRC. Ah-hah: static void nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) { ... nfsd4_cb_layout_fail(ls); That'd do it! Haven't tried to figure out why exactly that's getting called, and why only rarely. Some intermittent problem with the callback path, I guess. Anyway, I think that solves most of the mystery.... --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 4:08 ` J. Bruce Fields @ 2015-03-05 13:17 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-05 13:17 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Eric Sandeen, linux-nfs, Christoph Hellwig, xfs On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > Ah-hah: > > static void > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > { > ... > nfsd4_cb_layout_fail(ls); > > That'd do it! > > Haven't tried to figure out why exactly that's getting called, and why > only rarely. Some intermittent problem with the callback path, I guess. > > Anyway, I think that solves most of the mystery.... Ooops, that was a nasty git merge error in the last rebase, see the fix below. But I really wonder if we need to make the usage of pnfs explicit after all, othterwise we'll always hand out layouts on any XFS-exported filesystems, which can't be used and will eventually need to be recalled. --- >From ad592590cce9f7441c3cd21d030f3a986d8759d7 Mon Sep 17 00:00:00 2001 From: Christoph Hellwig <hch@lst.de> Date: Thu, 5 Mar 2015 06:12:29 -0700 Subject: nfsd: don't recursively call nfsd4_cb_layout_fail Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS layout recalls"), we recursivelt call nfsd4_cb_layout_fail from itself, leading to stack overflows. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/nfsd/nfs4layouts.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c index 3c1bfa1..1028a06 100644 --- a/fs/nfsd/nfs4layouts.c +++ b/fs/nfsd/nfs4layouts.c @@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); - nfsd4_cb_layout_fail(ls); - printk(KERN_WARNING "nfsd: client %s failed to respond to layout recall. " " Fencing..\n", addr_str); -- 1.9.1 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-05 13:17 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-05 13:17 UTC (permalink / raw) To: J. Bruce Fields Cc: Dave Chinner, Eric Sandeen, linux-nfs, Christoph Hellwig, xfs On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > Ah-hah: > > static void > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > { > ... > nfsd4_cb_layout_fail(ls); > > That'd do it! > > Haven't tried to figure out why exactly that's getting called, and why > only rarely. Some intermittent problem with the callback path, I guess. > > Anyway, I think that solves most of the mystery.... Ooops, that was a nasty git merge error in the last rebase, see the fix below. But I really wonder if we need to make the usage of pnfs explicit after all, othterwise we'll always hand out layouts on any XFS-exported filesystems, which can't be used and will eventually need to be recalled. --- ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 13:17 ` Christoph Hellwig @ 2015-03-05 15:01 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 15:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Eric Sandeen, linux-nfs, xfs On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > Ah-hah: > > > > static void > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > { > > ... > > nfsd4_cb_layout_fail(ls); > > > > That'd do it! > > > > Haven't tried to figure out why exactly that's getting called, and why > > only rarely. Some intermittent problem with the callback path, I guess. > > > > Anyway, I think that solves most of the mystery.... > > Ooops, that was a nasty git merge error in the last rebase, see the fix > below. Thanks! > But I really wonder if we need to make the usage of pnfs explicit > after all, othterwise we'll always hand out layouts on any XFS-exported > filesystems, which can't be used and will eventually need to be recalled. Yeah, maybe. We could check how many GETLAYOUTs we're actually seeing on these tests. In theory the client could quit asking, or only ask every n seconds, if the layouts it gets are all turning out to be useless. --b. > > --- > >From ad592590cce9f7441c3cd21d030f3a986d8759d7 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <hch@lst.de> > Date: Thu, 5 Mar 2015 06:12:29 -0700 > Subject: nfsd: don't recursively call nfsd4_cb_layout_fail > > Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS > layout recalls"), we recursivelt call nfsd4_cb_layout_fail from itself, > leading to stack overflows. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/nfsd/nfs4layouts.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c > index 3c1bfa1..1028a06 100644 > --- a/fs/nfsd/nfs4layouts.c > +++ b/fs/nfsd/nfs4layouts.c > @@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); > > - nfsd4_cb_layout_fail(ls); > - > printk(KERN_WARNING > "nfsd: client %s failed to respond to layout recall. " > " Fencing..\n", addr_str); > -- > 1.9.1 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-05 15:01 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 15:01 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, Eric Sandeen, linux-nfs, xfs On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > Ah-hah: > > > > static void > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > { > > ... > > nfsd4_cb_layout_fail(ls); > > > > That'd do it! > > > > Haven't tried to figure out why exactly that's getting called, and why > > only rarely. Some intermittent problem with the callback path, I guess. > > > > Anyway, I think that solves most of the mystery.... > > Ooops, that was a nasty git merge error in the last rebase, see the fix > below. Thanks! > But I really wonder if we need to make the usage of pnfs explicit > after all, othterwise we'll always hand out layouts on any XFS-exported > filesystems, which can't be used and will eventually need to be recalled. Yeah, maybe. We could check how many GETLAYOUTs we're actually seeing on these tests. In theory the client could quit asking, or only ask every n seconds, if the layouts it gets are all turning out to be useless. --b. > > --- > >From ad592590cce9f7441c3cd21d030f3a986d8759d7 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <hch@lst.de> > Date: Thu, 5 Mar 2015 06:12:29 -0700 > Subject: nfsd: don't recursively call nfsd4_cb_layout_fail > > Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS > layout recalls"), we recursivelt call nfsd4_cb_layout_fail from itself, > leading to stack overflows. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/nfsd/nfs4layouts.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c > index 3c1bfa1..1028a06 100644 > --- a/fs/nfsd/nfs4layouts.c > +++ b/fs/nfsd/nfs4layouts.c > @@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); > > - nfsd4_cb_layout_fail(ls); > - > printk(KERN_WARNING > "nfsd: client %s failed to respond to layout recall. " > " Fencing..\n", addr_str); > -- > 1.9.1 ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 15:01 ` J. Bruce Fields @ 2015-03-05 17:02 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 17:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Eric Sandeen, linux-nfs, xfs On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote: > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > > Ah-hah: > > > > > > static void > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > > { > > > ... > > > nfsd4_cb_layout_fail(ls); > > > > > > That'd do it! > > > > > > Haven't tried to figure out why exactly that's getting called, and why > > > only rarely. Some intermittent problem with the callback path, I guess. > > > > > > Anyway, I think that solves most of the mystery.... > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix > > below. > > Thanks! And with that fix things look good. I'm still curious why the callbacks are failling. It's also logging "nfsd: client 192.168.122.32 failed to respond to layout recall". I may not get a chance to debug for another week or two. --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-05 17:02 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 17:02 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, Eric Sandeen, linux-nfs, xfs On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote: > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > > Ah-hah: > > > > > > static void > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > > { > > > ... > > > nfsd4_cb_layout_fail(ls); > > > > > > That'd do it! > > > > > > Haven't tried to figure out why exactly that's getting called, and why > > > only rarely. Some intermittent problem with the callback path, I guess. > > > > > > Anyway, I think that solves most of the mystery.... > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix > > below. > > Thanks! And with that fix things look good. I'm still curious why the callbacks are failling. It's also logging "nfsd: client 192.168.122.32 failed to respond to layout recall". I may not get a chance to debug for another week or two. --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 17:02 ` J. Bruce Fields @ 2015-03-05 20:47 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 20:47 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Eric Sandeen, linux-nfs, xfs On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote: > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote: > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > > > Ah-hah: > > > > > > > > static void > > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > > > { > > > > ... > > > > nfsd4_cb_layout_fail(ls); > > > > > > > > That'd do it! > > > > > > > > Haven't tried to figure out why exactly that's getting called, and why > > > > only rarely. Some intermittent problem with the callback path, I guess. > > > > > > > > Anyway, I think that solves most of the mystery.... > > > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix > > > below. > > > > Thanks! > > And with that fix things look good. > > I'm still curious why the callbacks are failling. It's also logging > "nfsd: client 192.168.122.32 failed to respond to layout recall". I spoke too soon, I'm still not getting through my usual test run--the most recent run is hanging in generic/247 with the following in the server logs. But I probably still won't get a chance to look at this any closer till after vault. --b. nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. nfsd: fence failed for client 192.168.122.32: -2! nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. nfsd: fence failed for client 192.168.122.32: -2! receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 kswapd0: page allocation failure: order:0, mode:0x120 CPU: 0 PID: 580 Comm: kswapd0 Not tainted 4.0.0-rc2-09922-g26cbcc7 #89 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 0000000000000000 ffff88007f803bc8 ffffffff81a92597 0000000080000101 0000000000000120 ffff88007f803c58 ffffffff81155322 ffffffff822e9448 0000000000000000 0000000000000000 0000000000000001 00000000ffffffff Call Trace: <IRQ> [<ffffffff81a92597>] dump_stack+0x4f/0x7b [<ffffffff81155322>] warn_alloc_failed+0xe2/0x130 [<ffffffff81158503>] __alloc_pages_nodemask+0x5f3/0x860 [<ffffffff8195b2ab>] skb_page_frag_refill+0xab/0xd0 [<ffffffff817b5fba>] try_fill_recv+0x38a/0x7d0 [<ffffffff817b6af2>] virtnet_receive+0x6f2/0x860 [<ffffffff817b6c81>] virtnet_poll+0x21/0x90 [<ffffffff8197344a>] net_rx_action+0x1aa/0x320 [<ffffffff810796f2>] __do_softirq+0xf2/0x370 [<ffffffff81079ad5>] irq_exit+0x65/0x70 [<ffffffff81a9e124>] do_IRQ+0x64/0x100 [<ffffffff811d56f2>] ? free_buffer_head+0x22/0x80 [<ffffffff81a9c16f>] common_interrupt+0x6f/0x6f <EOI> [<ffffffff81092224>] ? kernel_text_address+0x64/0x70 [<ffffffff81197ff7>] ? kmem_cache_free+0xb7/0x230 [<ffffffff811d56f2>] free_buffer_head+0x22/0x80 [<ffffffff811d5b19>] try_to_free_buffers+0x79/0xc0 [<ffffffff81394b1b>] xfs_vm_releasepage+0x6b/0x160 [<ffffffff81184dc0>] ? __page_check_address+0x1f0/0x1f0 [<ffffffff8114fcc5>] try_to_release_page+0x35/0x60 [<ffffffff81162e02>] shrink_page_list+0x9c2/0xba0 [<ffffffff810bba1d>] ? trace_hardirqs_on_caller+0x12d/0x1d0 [<ffffffff811635df>] shrink_inactive_list+0x23f/0x520 [<ffffffff81164155>] shrink_lruvec+0x595/0x6d0 [<ffffffff81161a39>] ? shrink_slab.part.58.constprop.67+0x299/0x410 [<ffffffff811642dd>] shrink_zone+0x4d/0xa0 [<ffffffff81164c01>] kswapd+0x471/0xa30 [<ffffffff81164790>] ? try_to_free_pages+0x460/0x460 [<ffffffff81093a2f>] kthread+0xef/0x110 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220 [<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220 Mem-Info: DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 CPU 1: hi: 186, btch: 31 usd: 64 active_anon:7053 inactive_anon:2435 isolated_anon:0 active_file:88743 inactive_file:89505 isolated_file:32 unevictable:0 dirty:9786 writeback:0 unstable:0 free:3571 slab_reclaimable:227807 slab_unreclaimable:75772 mapped:21010 shmem:380 pagetables:1567 bounce:0 free_cma:0 DMA free:7788kB min:44kB low:52kB high:64kB active_anon:204kB inactive_anon:260kB active_file:696kB inactive_file:1016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:32kB writeback:0kB mapped:924kB shmem:36kB slab_reclaimable:1832kB slab_unreclaimable:1632kB kernel_stack:48kB pagetables:80kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 1943 1943 1943 DMA32 free:6188kB min:5616kB low:7020kB high:8424kB active_anon:28008kB inactive_anon:9480kB active_file:354360kB inactive_file:357004kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1990060kB mlocked:0kB dirty:39112kB writeback:0kB mapped:83116kB shmem:1484kB slab_reclaimable:909396kB slab_unreclaimable:301708kB kernel_stack:2608kB pagetables:6188kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 21*4kB (UEM) 11*8kB (UM) 10*16kB (UEM) 3*32kB (UM) 1*64kB (E) 1*128kB (E) 2*256kB (UM) 1*512kB (M) 2*1024kB (UM) 2*2048kB (ER) 0*4096kB = 7788kB DMA32: 276*4kB (UM) 30*8kB (UM) 83*16kB (UM) 2*32kB (MR) 2*64kB (M) 2*128kB (R) 2*256kB (R) 1*512kB (R) 0*1024kB 1*2048kB (R) 0*4096kB = 6192kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 178684 total pagecache pages 1 pages in swap cache Swap cache stats: add 14, delete 13, find 0/0 Free swap = 839620kB Total swap = 839676kB 524158 pages RAM 0 pages HighMem/MovableOnly 22666 pages reserved nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. nfsd: fence failed for client 192.168.122.32: -2! receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf INFO: task nfsd:17653 blocked for more than 120 seconds. Not tainted 4.0.0-rc2-09922-g26cbcc7 #89 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. nfsd D ffff8800753a7848 11720 17653 2 0x00000000 ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580 ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848 ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50 Call Trace: [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 [<ffffffff81a95737>] schedule+0x37/0x90 [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30 [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400 [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130 [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0 [<ffffffff8119f0b2>] do_readv_writev+0xe2/0x280 [<ffffffff813a0f90>] ? xfs_file_buffered_aio_write.isra.9+0x2a0/0x2a0 [<ffffffff813a0f90>] ? xfs_file_buffered_aio_write.isra.9+0x2a0/0x2a0 [<ffffffff810bc15d>] ? __lock_acquire+0x37d/0x2040 [<ffffffff810be759>] ? lock_release_non_nested+0xa9/0x340 [<ffffffff81a9aaad>] ? _raw_spin_unlock_irqrestore+0x5d/0x80 [<ffffffff810bba1d>] ? trace_hardirqs_on_caller+0x12d/0x1d0 [<ffffffff810bbacd>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8119f2d9>] vfs_writev+0x39/0x50 [<ffffffffa00ab861>] nfsd_vfs_write.isra.11+0xa1/0x350 [nfsd] [<ffffffffa00aecee>] nfsd_write+0x8e/0x100 [nfsd] [<ffffffffa00b95b5>] ? gen_boot_verifier+0x5/0xc0 [nfsd] [<ffffffffa00b97f5>] nfsd4_write+0x185/0x1e0 [nfsd] [<ffffffffa00bbe37>] nfsd4_proc_compound+0x3c7/0x6f0 [nfsd] [<ffffffffa00a7463>] nfsd_dispatch+0xc3/0x220 [nfsd] [<ffffffffa001314f>] svc_process_common+0x43f/0x650 [sunrpc] [<ffffffffa00134a3>] svc_process+0x143/0x260 [sunrpc] [<ffffffffa00a6cc7>] nfsd+0x167/0x1e0 [nfsd] [<ffffffffa00a6b65>] ? nfsd+0x5/0x1e0 [nfsd] [<ffffffffa00a6b60>] ? nfsd_destroy+0xe0/0xe0 [nfsd] [<ffffffff81093a2f>] kthread+0xef/0x110 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220 [<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220 2 locks held by nfsd/17653: #0: (sb_writers#11){.+.+.+}, at: [<ffffffff8119f208>] do_readv_writev+0x238/0x280 #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 INFO: task nfsd:17655 blocked for more than 120 seconds. ... _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-05 20:47 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-05 20:47 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, Eric Sandeen, linux-nfs, xfs On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote: > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote: > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > > > Ah-hah: > > > > > > > > static void > > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > > > { > > > > ... > > > > nfsd4_cb_layout_fail(ls); > > > > > > > > That'd do it! > > > > > > > > Haven't tried to figure out why exactly that's getting called, and why > > > > only rarely. Some intermittent problem with the callback path, I guess. > > > > > > > > Anyway, I think that solves most of the mystery.... > > > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix > > > below. > > > > Thanks! > > And with that fix things look good. > > I'm still curious why the callbacks are failling. It's also logging > "nfsd: client 192.168.122.32 failed to respond to layout recall". I spoke too soon, I'm still not getting through my usual test run--the most recent run is hanging in generic/247 with the following in the server logs. But I probably still won't get a chance to look at this any closer till after vault. --b. nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. nfsd: fence failed for client 192.168.122.32: -2! nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. nfsd: fence failed for client 192.168.122.32: -2! receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 kswapd0: page allocation failure: order:0, mode:0x120 CPU: 0 PID: 580 Comm: kswapd0 Not tainted 4.0.0-rc2-09922-g26cbcc7 #89 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014 0000000000000000 ffff88007f803bc8 ffffffff81a92597 0000000080000101 0000000000000120 ffff88007f803c58 ffffffff81155322 ffffffff822e9448 0000000000000000 0000000000000000 0000000000000001 00000000ffffffff Call Trace: <IRQ> [<ffffffff81a92597>] dump_stack+0x4f/0x7b [<ffffffff81155322>] warn_alloc_failed+0xe2/0x130 [<ffffffff81158503>] __alloc_pages_nodemask+0x5f3/0x860 [<ffffffff8195b2ab>] skb_page_frag_refill+0xab/0xd0 [<ffffffff817b5fba>] try_fill_recv+0x38a/0x7d0 [<ffffffff817b6af2>] virtnet_receive+0x6f2/0x860 [<ffffffff817b6c81>] virtnet_poll+0x21/0x90 [<ffffffff8197344a>] net_rx_action+0x1aa/0x320 [<ffffffff810796f2>] __do_softirq+0xf2/0x370 [<ffffffff81079ad5>] irq_exit+0x65/0x70 [<ffffffff81a9e124>] do_IRQ+0x64/0x100 [<ffffffff811d56f2>] ? free_buffer_head+0x22/0x80 [<ffffffff81a9c16f>] common_interrupt+0x6f/0x6f <EOI> [<ffffffff81092224>] ? kernel_text_address+0x64/0x70 [<ffffffff81197ff7>] ? kmem_cache_free+0xb7/0x230 [<ffffffff811d56f2>] free_buffer_head+0x22/0x80 [<ffffffff811d5b19>] try_to_free_buffers+0x79/0xc0 [<ffffffff81394b1b>] xfs_vm_releasepage+0x6b/0x160 [<ffffffff81184dc0>] ? __page_check_address+0x1f0/0x1f0 [<ffffffff8114fcc5>] try_to_release_page+0x35/0x60 [<ffffffff81162e02>] shrink_page_list+0x9c2/0xba0 [<ffffffff810bba1d>] ? trace_hardirqs_on_caller+0x12d/0x1d0 [<ffffffff811635df>] shrink_inactive_list+0x23f/0x520 [<ffffffff81164155>] shrink_lruvec+0x595/0x6d0 [<ffffffff81161a39>] ? shrink_slab.part.58.constprop.67+0x299/0x410 [<ffffffff811642dd>] shrink_zone+0x4d/0xa0 [<ffffffff81164c01>] kswapd+0x471/0xa30 [<ffffffff81164790>] ? try_to_free_pages+0x460/0x460 [<ffffffff81093a2f>] kthread+0xef/0x110 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220 [<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220 Mem-Info: DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 CPU 1: hi: 0, btch: 1 usd: 0 DMA32 per-cpu: CPU 0: hi: 186, btch: 31 usd: 0 CPU 1: hi: 186, btch: 31 usd: 64 active_anon:7053 inactive_anon:2435 isolated_anon:0 active_file:88743 inactive_file:89505 isolated_file:32 unevictable:0 dirty:9786 writeback:0 unstable:0 free:3571 slab_reclaimable:227807 slab_unreclaimable:75772 mapped:21010 shmem:380 pagetables:1567 bounce:0 free_cma:0 DMA free:7788kB min:44kB low:52kB high:64kB active_anon:204kB inactive_anon:260kB active_file:696kB inactive_file:1016kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:0kB dirty:32kB writeback:0kB mapped:924kB shmem:36kB slab_reclaimable:1832kB slab_unreclaimable:1632kB kernel_stack:48kB pagetables:80kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 1943 1943 1943 DMA32 free:6188kB min:5616kB low:7020kB high:8424kB active_anon:28008kB inactive_anon:9480kB active_file:354360kB inactive_file:357004kB unevictable:0kB isolated(anon):0kB isolated(file):128kB present:2080640kB managed:1990060kB mlocked:0kB dirty:39112kB writeback:0kB mapped:83116kB shmem:1484kB slab_reclaimable:909396kB slab_unreclaimable:301708kB kernel_stack:2608kB pagetables:6188kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no lowmem_reserve[]: 0 0 0 0 DMA: 21*4kB (UEM) 11*8kB (UM) 10*16kB (UEM) 3*32kB (UM) 1*64kB (E) 1*128kB (E) 2*256kB (UM) 1*512kB (M) 2*1024kB (UM) 2*2048kB (ER) 0*4096kB = 7788kB DMA32: 276*4kB (UM) 30*8kB (UM) 83*16kB (UM) 2*32kB (MR) 2*64kB (M) 2*128kB (R) 2*256kB (R) 1*512kB (R) 0*1024kB 1*2048kB (R) 0*4096kB = 6192kB Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB 178684 total pagecache pages 1 pages in swap cache Swap cache stats: add 14, delete 13, find 0/0 Free swap = 839620kB Total swap = 839676kB 524158 pages RAM 0 pages HighMem/MovableOnly 22666 pages reserved nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. nfsd: fence failed for client 192.168.122.32: -2! receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf INFO: task nfsd:17653 blocked for more than 120 seconds. Not tainted 4.0.0-rc2-09922-g26cbcc7 #89 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. nfsd D ffff8800753a7848 11720 17653 2 0x00000000 ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580 ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848 ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50 Call Trace: [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 [<ffffffff81a95737>] schedule+0x37/0x90 [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30 [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400 [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130 [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0 [<ffffffff8119f0b2>] do_readv_writev+0xe2/0x280 [<ffffffff813a0f90>] ? xfs_file_buffered_aio_write.isra.9+0x2a0/0x2a0 [<ffffffff813a0f90>] ? xfs_file_buffered_aio_write.isra.9+0x2a0/0x2a0 [<ffffffff810bc15d>] ? __lock_acquire+0x37d/0x2040 [<ffffffff810be759>] ? lock_release_non_nested+0xa9/0x340 [<ffffffff81a9aaad>] ? _raw_spin_unlock_irqrestore+0x5d/0x80 [<ffffffff810bba1d>] ? trace_hardirqs_on_caller+0x12d/0x1d0 [<ffffffff810bbacd>] ? trace_hardirqs_on+0xd/0x10 [<ffffffff8119f2d9>] vfs_writev+0x39/0x50 [<ffffffffa00ab861>] nfsd_vfs_write.isra.11+0xa1/0x350 [nfsd] [<ffffffffa00aecee>] nfsd_write+0x8e/0x100 [nfsd] [<ffffffffa00b95b5>] ? gen_boot_verifier+0x5/0xc0 [nfsd] [<ffffffffa00b97f5>] nfsd4_write+0x185/0x1e0 [nfsd] [<ffffffffa00bbe37>] nfsd4_proc_compound+0x3c7/0x6f0 [nfsd] [<ffffffffa00a7463>] nfsd_dispatch+0xc3/0x220 [nfsd] [<ffffffffa001314f>] svc_process_common+0x43f/0x650 [sunrpc] [<ffffffffa00134a3>] svc_process+0x143/0x260 [sunrpc] [<ffffffffa00a6cc7>] nfsd+0x167/0x1e0 [nfsd] [<ffffffffa00a6b65>] ? nfsd+0x5/0x1e0 [nfsd] [<ffffffffa00a6b60>] ? nfsd_destroy+0xe0/0xe0 [nfsd] [<ffffffff81093a2f>] kthread+0xef/0x110 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220 [<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0 [<ffffffff81093940>] ? kthread_create_on_node+0x220/0x220 2 locks held by nfsd/17653: #0: (sb_writers#11){.+.+.+}, at: [<ffffffff8119f208>] do_readv_writev+0x238/0x280 #1: (&sb->s_type->i_mutex_key#15){+.+.+.}, at: [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 INFO: task nfsd:17655 blocked for more than 120 seconds. ... ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 20:47 ` J. Bruce Fields @ 2015-03-05 20:59 ` Dave Chinner -1 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-05 20:59 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Eric Sandeen, Christoph Hellwig, xfs On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote: > On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote: > > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote: > > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > > > > Ah-hah: > > > > > > > > > > static void > > > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > > > > { > > > > > ... > > > > > nfsd4_cb_layout_fail(ls); > > > > > > > > > > That'd do it! > > > > > > > > > > Haven't tried to figure out why exactly that's getting called, and why > > > > > only rarely. Some intermittent problem with the callback path, I guess. > > > > > > > > > > Anyway, I think that solves most of the mystery.... > > > > > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix > > > > below. > > > > > > Thanks! > > > > And with that fix things look good. > > > > I'm still curious why the callbacks are failling. It's also logging > > "nfsd: client 192.168.122.32 failed to respond to layout recall". > > I spoke too soon, I'm still not getting through my usual test run--the most > recent run is hanging in generic/247 with the following in the server logs. > > But I probably still won't get a chance to look at this any closer till after > vault. > > --b. > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > nfsd: fence failed for client 192.168.122.32: -2! > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > nfsd: fence failed for client 192.168.122.32: -2! > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 > kswapd0: page allocation failure: order:0, mode:0x120 [snip network driver memory allocation failure] > active_anon:7053 inactive_anon:2435 isolated_anon:0 > active_file:88743 inactive_file:89505 isolated_file:32 > unevictable:0 dirty:9786 writeback:0 unstable:0 > free:3571 slab_reclaimable:227807 slab_unreclaimable:75772 > mapped:21010 shmem:380 pagetables:1567 bounce:0 > free_cma:0 Looks like there should be heaps of reclaimable memory... > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. So there's a layout recall pending... > nfsd: fence failed for client 192.168.122.32: -2! > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf > INFO: task nfsd:17653 blocked for more than 120 seconds. > Not tainted 4.0.0-rc2-09922-g26cbcc7 #89 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > nfsd D ffff8800753a7848 11720 17653 2 0x00000000 > ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580 > ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848 > ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50 > Call Trace: > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > [<ffffffff81a95737>] schedule+0x37/0x90 > [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30 > [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400 > [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130 > [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0 and the nfsd got hung up on the inode mutex during a write. Which means there's some other process blocked holding the i_mutex. sysrq-w and sysrq-t is probably going to tell us more here. I suspect we'll have another write stuck in break_layout()..... Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-05 20:59 ` Dave Chinner 0 siblings, 0 replies; 69+ messages in thread From: Dave Chinner @ 2015-03-05 20:59 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, Eric Sandeen, linux-nfs, xfs On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote: > On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote: > > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote: > > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > > > > Ah-hah: > > > > > > > > > > static void > > > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > > > > { > > > > > ... > > > > > nfsd4_cb_layout_fail(ls); > > > > > > > > > > That'd do it! > > > > > > > > > > Haven't tried to figure out why exactly that's getting called, and why > > > > > only rarely. Some intermittent problem with the callback path, I guess. > > > > > > > > > > Anyway, I think that solves most of the mystery.... > > > > > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix > > > > below. > > > > > > Thanks! > > > > And with that fix things look good. > > > > I'm still curious why the callbacks are failling. It's also logging > > "nfsd: client 192.168.122.32 failed to respond to layout recall". > > I spoke too soon, I'm still not getting through my usual test run--the most > recent run is hanging in generic/247 with the following in the server logs. > > But I probably still won't get a chance to look at this any closer till after > vault. > > --b. > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > nfsd: fence failed for client 192.168.122.32: -2! > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > nfsd: fence failed for client 192.168.122.32: -2! > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 > kswapd0: page allocation failure: order:0, mode:0x120 [snip network driver memory allocation failure] > active_anon:7053 inactive_anon:2435 isolated_anon:0 > active_file:88743 inactive_file:89505 isolated_file:32 > unevictable:0 dirty:9786 writeback:0 unstable:0 > free:3571 slab_reclaimable:227807 slab_unreclaimable:75772 > mapped:21010 shmem:380 pagetables:1567 bounce:0 > free_cma:0 Looks like there should be heaps of reclaimable memory... > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. So there's a layout recall pending... > nfsd: fence failed for client 192.168.122.32: -2! > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf > INFO: task nfsd:17653 blocked for more than 120 seconds. > Not tainted 4.0.0-rc2-09922-g26cbcc7 #89 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > nfsd D ffff8800753a7848 11720 17653 2 0x00000000 > ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580 > ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848 > ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50 > Call Trace: > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > [<ffffffff81a95737>] schedule+0x37/0x90 > [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30 > [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400 > [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130 > [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0 and the nfsd got hung up on the inode mutex during a write. Which means there's some other process blocked holding the i_mutex. sysrq-w and sysrq-t is probably going to tell us more here. I suspect we'll have another write stuck in break_layout()..... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 20:59 ` Dave Chinner @ 2015-03-06 20:47 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-06 20:47 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-nfs, Eric Sandeen, Christoph Hellwig, xfs On Fri, Mar 06, 2015 at 07:59:22AM +1100, Dave Chinner wrote: > On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote: > > On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote: > > > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote: > > > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > > > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > > > > > Ah-hah: > > > > > > > > > > > > static void > > > > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > > > > > { > > > > > > ... > > > > > > nfsd4_cb_layout_fail(ls); > > > > > > > > > > > > That'd do it! > > > > > > > > > > > > Haven't tried to figure out why exactly that's getting called, and why > > > > > > only rarely. Some intermittent problem with the callback path, I guess. > > > > > > > > > > > > Anyway, I think that solves most of the mystery.... > > > > > > > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix > > > > > below. > > > > > > > > Thanks! > > > > > > And with that fix things look good. > > > > > > I'm still curious why the callbacks are failling. It's also logging > > > "nfsd: client 192.168.122.32 failed to respond to layout recall". > > > > I spoke too soon, I'm still not getting through my usual test run--the most > > recent run is hanging in generic/247 with the following in the server logs. > > > > But I probably still won't get a chance to look at this any closer till after > > vault. > > > > --b. > > > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > nfsd: fence failed for client 192.168.122.32: -2! > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > nfsd: fence failed for client 192.168.122.32: -2! > > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 > > kswapd0: page allocation failure: order:0, mode:0x120 > > [snip network driver memory allocation failure] > > > active_anon:7053 inactive_anon:2435 isolated_anon:0 > > active_file:88743 inactive_file:89505 isolated_file:32 > > unevictable:0 dirty:9786 writeback:0 unstable:0 > > free:3571 slab_reclaimable:227807 slab_unreclaimable:75772 > > mapped:21010 shmem:380 pagetables:1567 bounce:0 > > free_cma:0 > > Looks like there should be heaps of reclaimable memory... > > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > So there's a layout recall pending... > > > nfsd: fence failed for client 192.168.122.32: -2! > > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf > > INFO: task nfsd:17653 blocked for more than 120 seconds. > > Not tainted 4.0.0-rc2-09922-g26cbcc7 #89 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > nfsd D ffff8800753a7848 11720 17653 2 0x00000000 > > ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580 > > ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848 > > ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50 > > Call Trace: > > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > > [<ffffffff81a95737>] schedule+0x37/0x90 > > [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30 > > [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400 > > [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 > > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > > [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 > > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > > [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130 > > [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0 > > and the nfsd got hung up on the inode mutex during a write. > > Which means there's some other process blocked holding the i_mutex. > sysrq-w and sysrq-t is probably going to tell us more here. > > I suspect we'll have another write stuck in break_layout()..... Yup! There's a bunch of threads similarly stuck in write, and then: # cat /proc/17765/stack [<ffffffff811edb18>] __break_lease+0x278/0x510 [<ffffffff813cd4d4>] xfs_break_layouts+0x94/0xf0 [<ffffffff813a0903>] xfs_file_aio_write_checks+0x53/0x100 [<ffffffff813a0d7b>] xfs_file_buffered_aio_write.isra.9+0x8b/0x2a0 [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130 [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0 [<ffffffff8119f0b2>] do_readv_writev+0xe2/0x280 [<ffffffff8119f2d9>] vfs_writev+0x39/0x50 [<ffffffffa00ab861>] nfsd_vfs_write.isra.11+0xa1/0x350 [nfsd] [<ffffffffa00aecee>] nfsd_write+0x8e/0x100 [nfsd] [<ffffffffa00b97f5>] nfsd4_write+0x185/0x1e0 [nfsd] [<ffffffffa00bbe37>] nfsd4_proc_compound+0x3c7/0x6f0 [nfsd] [<ffffffffa00a7463>] nfsd_dispatch+0xc3/0x220 [nfsd] [<ffffffffa001314f>] svc_process_common+0x43f/0x650 [sunrpc] [<ffffffffa00134a3>] svc_process+0x143/0x260 [sunrpc] [<ffffffffa00a6cc7>] nfsd+0x167/0x1e0 [nfsd] [<ffffffff81093a2f>] kthread+0xef/0x110 [<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff I'm worried by the blocking break_lease in xfs_break_layouts. There's a circular dependency: blocking in break_lease ties up an nfsd thread, but we'll need an nfsd thread to process the client's layout return. But in the worst case I'd still expect that to be cleaned up if the client doesn't return the layout within a lease period (20 seconds on my server). In addition to fencing the client, surely we should be forcibly revoking the layout at that point? -b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-06 20:47 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-06 20:47 UTC (permalink / raw) To: Dave Chinner; +Cc: Christoph Hellwig, Eric Sandeen, linux-nfs, xfs On Fri, Mar 06, 2015 at 07:59:22AM +1100, Dave Chinner wrote: > On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote: > > On Thu, Mar 05, 2015 at 12:02:17PM -0500, J. Bruce Fields wrote: > > > On Thu, Mar 05, 2015 at 10:01:38AM -0500, J. Bruce Fields wrote: > > > > On Thu, Mar 05, 2015 at 02:17:31PM +0100, Christoph Hellwig wrote: > > > > > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: > > > > > > Ah-hah: > > > > > > > > > > > > static void > > > > > > nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > > > > > { > > > > > > ... > > > > > > nfsd4_cb_layout_fail(ls); > > > > > > > > > > > > That'd do it! > > > > > > > > > > > > Haven't tried to figure out why exactly that's getting called, and why > > > > > > only rarely. Some intermittent problem with the callback path, I guess. > > > > > > > > > > > > Anyway, I think that solves most of the mystery.... > > > > > > > > > > Ooops, that was a nasty git merge error in the last rebase, see the fix > > > > > below. > > > > > > > > Thanks! > > > > > > And with that fix things look good. > > > > > > I'm still curious why the callbacks are failling. It's also logging > > > "nfsd: client 192.168.122.32 failed to respond to layout recall". > > > > I spoke too soon, I'm still not getting through my usual test run--the most > > recent run is hanging in generic/247 with the following in the server logs. > > > > But I probably still won't get a chance to look at this any closer till after > > vault. > > > > --b. > > > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > nfsd: fence failed for client 192.168.122.32: -2! > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > nfsd: fence failed for client 192.168.122.32: -2! > > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 > > kswapd0: page allocation failure: order:0, mode:0x120 > > [snip network driver memory allocation failure] > > > active_anon:7053 inactive_anon:2435 isolated_anon:0 > > active_file:88743 inactive_file:89505 isolated_file:32 > > unevictable:0 dirty:9786 writeback:0 unstable:0 > > free:3571 slab_reclaimable:227807 slab_unreclaimable:75772 > > mapped:21010 shmem:380 pagetables:1567 bounce:0 > > free_cma:0 > > Looks like there should be heaps of reclaimable memory... > > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > So there's a layout recall pending... > > > nfsd: fence failed for client 192.168.122.32: -2! > > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff880051dfc000 xid 8ff02aaf > > INFO: task nfsd:17653 blocked for more than 120 seconds. > > Not tainted 4.0.0-rc2-09922-g26cbcc7 #89 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > nfsd D ffff8800753a7848 11720 17653 2 0x00000000 > > ffff8800753a7848 0000000000000001 0000000000000001 ffffffff82210580 > > ffff88004e9bcb50 0000000000000006 ffffffff8119d9f0 ffff8800753a7848 > > ffff8800753a7fd8 ffff88002e5e3d70 0000000000000246 ffff88004e9bcb50 > > Call Trace: > > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > > [<ffffffff81a95737>] schedule+0x37/0x90 > > [<ffffffff81a95ac8>] schedule_preempt_disabled+0x18/0x30 > > [<ffffffff81a97756>] mutex_lock_nested+0x156/0x400 > > [<ffffffff813a0d5a>] ? xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 > > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > > [<ffffffff813a0d5a>] xfs_file_buffered_aio_write.isra.9+0x6a/0x2a0 > > [<ffffffff8119d9f0>] ? new_sync_read+0xb0/0xb0 > > [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130 > > [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0 > > and the nfsd got hung up on the inode mutex during a write. > > Which means there's some other process blocked holding the i_mutex. > sysrq-w and sysrq-t is probably going to tell us more here. > > I suspect we'll have another write stuck in break_layout()..... Yup! There's a bunch of threads similarly stuck in write, and then: # cat /proc/17765/stack [<ffffffff811edb18>] __break_lease+0x278/0x510 [<ffffffff813cd4d4>] xfs_break_layouts+0x94/0xf0 [<ffffffff813a0903>] xfs_file_aio_write_checks+0x53/0x100 [<ffffffff813a0d7b>] xfs_file_buffered_aio_write.isra.9+0x8b/0x2a0 [<ffffffff813a1016>] xfs_file_write_iter+0x86/0x130 [<ffffffff8119db05>] do_iter_readv_writev+0x65/0xa0 [<ffffffff8119f0b2>] do_readv_writev+0xe2/0x280 [<ffffffff8119f2d9>] vfs_writev+0x39/0x50 [<ffffffffa00ab861>] nfsd_vfs_write.isra.11+0xa1/0x350 [nfsd] [<ffffffffa00aecee>] nfsd_write+0x8e/0x100 [nfsd] [<ffffffffa00b97f5>] nfsd4_write+0x185/0x1e0 [nfsd] [<ffffffffa00bbe37>] nfsd4_proc_compound+0x3c7/0x6f0 [nfsd] [<ffffffffa00a7463>] nfsd_dispatch+0xc3/0x220 [nfsd] [<ffffffffa001314f>] svc_process_common+0x43f/0x650 [sunrpc] [<ffffffffa00134a3>] svc_process+0x143/0x260 [sunrpc] [<ffffffffa00a6cc7>] nfsd+0x167/0x1e0 [nfsd] [<ffffffff81093a2f>] kthread+0xef/0x110 [<ffffffff81a9b5ac>] ret_from_fork+0x7c/0xb0 [<ffffffffffffffff>] 0xffffffffffffffff I'm worried by the blocking break_lease in xfs_break_layouts. There's a circular dependency: blocking in break_lease ties up an nfsd thread, but we'll need an nfsd thread to process the client's layout return. But in the worst case I'd still expect that to be cleaned up if the client doesn't return the layout within a lease period (20 seconds on my server). In addition to fencing the client, surely we should be forcibly revoking the layout at that point? -b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-06 20:47 ` J. Bruce Fields @ 2015-03-19 17:27 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-19 17:27 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Eric Sandeen, linux-nfs, xfs FYI, I've now managed to reproduce the issue. I haven't had time to dig deeper, but it smells a lot like a callback path issue. Can you send the recursion fix to Linus ASAP and I'll send you a patch to turn the nopnfs option into a pnfs one, so we're at least doing fine for 4.0 in any case. I hope to debug this and may even have a real fix soon, but so far I don't know how long it will take. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-19 17:27 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-19 17:27 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Dave Chinner, Eric Sandeen, linux-nfs, xfs FYI, I've now managed to reproduce the issue. I haven't had time to dig deeper, but it smells a lot like a callback path issue. Can you send the recursion fix to Linus ASAP and I'll send you a patch to turn the nopnfs option into a pnfs one, so we're at least doing fine for 4.0 in any case. I hope to debug this and may even have a real fix soon, but so far I don't know how long it will take. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-19 17:27 ` Christoph Hellwig @ 2015-03-19 18:47 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-19 18:47 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Eric Sandeen, linux-nfs, xfs On Thu, Mar 19, 2015 at 06:27:31PM +0100, Christoph Hellwig wrote: > FYI, I've now managed to reproduce the issue. I haven't had time > to dig deeper, but it smells a lot like a callback path issue. > > Can you send the recursion fix to Linus ASAP and I'll send you a patch > to turn the nopnfs option into a pnfs one, so we're at least doing > fine for 4.0 in any case. Sure, sounds good. Also, there's the problem that when this is turned on a client can end up doing unnecessary LAYOUTGET. Do we have a plan for that? Possibilities: - Just depend on export flags: but some clients may have direct access and some not. If the clients with direct access or all easily identifiable by IP subnet, maybe it's not a big deal. Still, seems like an administrative hassle. - Do nothing, assume the client can deal with this with some kind of heuristics, and/or that the GETLAYOUT calls can be made very cheap. Not sure if that's true. - Use something like GETDEVLICELIST so the client can figure out in one go whether any layouts on a given filesystem will work. I forget what the problems with GETDEVICELIST were. > I hope to debug this and may even have > a real fix soon, but so far I don't know how long it will take. OK, thanks. --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-19 18:47 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-19 18:47 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, Eric Sandeen, linux-nfs, xfs On Thu, Mar 19, 2015 at 06:27:31PM +0100, Christoph Hellwig wrote: > FYI, I've now managed to reproduce the issue. I haven't had time > to dig deeper, but it smells a lot like a callback path issue. > > Can you send the recursion fix to Linus ASAP and I'll send you a patch > to turn the nopnfs option into a pnfs one, so we're at least doing > fine for 4.0 in any case. Sure, sounds good. Also, there's the problem that when this is turned on a client can end up doing unnecessary LAYOUTGET. Do we have a plan for that? Possibilities: - Just depend on export flags: but some clients may have direct access and some not. If the clients with direct access or all easily identifiable by IP subnet, maybe it's not a big deal. Still, seems like an administrative hassle. - Do nothing, assume the client can deal with this with some kind of heuristics, and/or that the GETLAYOUT calls can be made very cheap. Not sure if that's true. - Use something like GETDEVLICELIST so the client can figure out in one go whether any layouts on a given filesystem will work. I forget what the problems with GETDEVICELIST were. > I hope to debug this and may even have > a real fix soon, but so far I don't know how long it will take. OK, thanks. --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-19 18:47 ` J. Bruce Fields @ 2015-03-20 6:49 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-20 6:49 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Eric Sandeen, linux-nfs, Christoph Hellwig, xfs On Thu, Mar 19, 2015 at 02:47:14PM -0400, J. Bruce Fields wrote: > Also, there's the problem that when this is turned on a client can end > up doing unnecessary LAYOUTGET. Do we have a plan for that? > > Possibilities: > > - Just depend on export flags: but some clients may have direct > access and some not. If the clients with direct access or all > easily identifiable by IP subnet, maybe it's not a big deal. > Still, seems like an administrative hassle. We defintively want this to avoid getting into problems. > > - Do nothing, assume the client can deal with this with some > kind of heuristics, and/or that the GETLAYOUT calls can be > made very cheap. Not sure if that's true. The calls itself are cheap, the cliet processing of them isn't. I think we should just stop issueing GETLAYOUT calls on the client side if we keep errors again and again. One option might be to add negative device id cache entries, similar to how negative dentries work in the dcache. > - Use something like GETDEVLICELIST so the client can figure out > in one go whether any layouts on a given filesystem will work. > I forget what the problems with GETDEVICELIST were. The way the device IDs rules are written in NFS it is inherently racy. If I could go back 10 years in time I'd rewrite device ids to be stateids bound to a fsid, and a lot of things could be fixed up neatly that way.. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-20 6:49 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-20 6:49 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Dave Chinner, Eric Sandeen, linux-nfs, xfs On Thu, Mar 19, 2015 at 02:47:14PM -0400, J. Bruce Fields wrote: > Also, there's the problem that when this is turned on a client can end > up doing unnecessary LAYOUTGET. Do we have a plan for that? > > Possibilities: > > - Just depend on export flags: but some clients may have direct > access and some not. If the clients with direct access or all > easily identifiable by IP subnet, maybe it's not a big deal. > Still, seems like an administrative hassle. We defintively want this to avoid getting into problems. > > - Do nothing, assume the client can deal with this with some > kind of heuristics, and/or that the GETLAYOUT calls can be > made very cheap. Not sure if that's true. The calls itself are cheap, the cliet processing of them isn't. I think we should just stop issueing GETLAYOUT calls on the client side if we keep errors again and again. One option might be to add negative device id cache entries, similar to how negative dentries work in the dcache. > - Use something like GETDEVLICELIST so the client can figure out > in one go whether any layouts on a given filesystem will work. > I forget what the problems with GETDEVICELIST were. The way the device IDs rules are written in NFS it is inherently racy. If I could go back 10 years in time I'd rewrite device ids to be stateids bound to a fsid, and a lot of things could be fixed up neatly that way.. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 20:47 ` J. Bruce Fields @ 2015-03-08 15:30 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-08 15:30 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Eric Sandeen, linux-nfs, Christoph Hellwig, xfs On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote: > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > nfsd: fence failed for client 192.168.122.32: -2! > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > nfsd: fence failed for client 192.168.122.32: -2! There is no userspace elper to do the fencing, so unfortunately this is expecvted. > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 Now this looks like some issue with the low-level callback path. I've never seen tis before, but from looking at receive_cb_reply this happens if xprt_lookup_rqst can't find a rpc_rqst structured for the xid. Looks like we might be corrupting the request list / xid allocation somewhere? I can prepare a patch for you to aid with xid tracing if you want. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-08 15:30 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-08 15:30 UTC (permalink / raw) To: J. Bruce Fields Cc: Christoph Hellwig, Dave Chinner, Eric Sandeen, linux-nfs, xfs On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote: > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > nfsd: fence failed for client 192.168.122.32: -2! > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > nfsd: fence failed for client 192.168.122.32: -2! There is no userspace elper to do the fencing, so unfortunately this is expecvted. > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 Now this looks like some issue with the low-level callback path. I've never seen tis before, but from looking at receive_cb_reply this happens if xprt_lookup_rqst can't find a rpc_rqst structured for the xid. Looks like we might be corrupting the request list / xid allocation somewhere? I can prepare a patch for you to aid with xid tracing if you want. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-08 15:30 ` Christoph Hellwig @ 2015-03-09 19:45 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-09 19:45 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Eric Sandeen, linux-nfs, xfs On Sun, Mar 08, 2015 at 04:30:56PM +0100, Christoph Hellwig wrote: > On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote: > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > nfsd: fence failed for client 192.168.122.32: -2! > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > nfsd: fence failed for client 192.168.122.32: -2! > > There is no userspace elper to do the fencing, so unfortunately this > is expecvted. > > > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 > > Now this looks like some issue with the low-level callback path. I've never > seen tis before, but from looking at receive_cb_reply this happens if > xprt_lookup_rqst can't find a rpc_rqst structured for the xid. Looks like > we might be corrupting the request list / xid allocation somewhere? > > I can prepare a patch for you to aid with xid tracing if you want. I'll take a look when I get back. But before that I'd like to understand why the layout seems to be left here blocking writes forever, instead of getting cleaned up after a lease period with no layout return. --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-09 19:45 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-09 19:45 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, Eric Sandeen, linux-nfs, xfs On Sun, Mar 08, 2015 at 04:30:56PM +0100, Christoph Hellwig wrote: > On Thu, Mar 05, 2015 at 03:47:49PM -0500, J. Bruce Fields wrote: > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > nfsd: fence failed for client 192.168.122.32: -2! > > nfsd: client 192.168.122.32 failed to respond to layout recall. Fencing.. > > nfsd: fence failed for client 192.168.122.32: -2! > > There is no userspace elper to do the fencing, so unfortunately this > is expecvted. > > > receive_cb_reply: Got unrecognized reply: calldir 0x1 xpt_bc_xprt ffff88005639a000 xid c21abd62 > > Now this looks like some issue with the low-level callback path. I've never > seen tis before, but from looking at receive_cb_reply this happens if > xprt_lookup_rqst can't find a rpc_rqst structured for the xid. Looks like > we might be corrupting the request list / xid allocation somewhere? > > I can prepare a patch for you to aid with xid tracing if you want. I'll take a look when I get back. But before that I'd like to understand why the layout seems to be left here blocking writes forever, instead of getting cleaned up after a lease period with no layout return. --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-05 13:17 ` Christoph Hellwig @ 2015-03-20 4:06 ` Kinglong Mee -1 siblings, 0 replies; 69+ messages in thread From: Kinglong Mee @ 2015-03-20 4:06 UTC (permalink / raw) To: Christoph Hellwig Cc: J. Bruce Fields, Eric Sandeen, Linux NFS Mailing List, xfs On Thu, Mar 5, 2015 at 9:17 PM, Christoph Hellwig <hch@lst.de> wrote: > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: >> Ah-hah: >> >> static void >> nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) >> { >> ... >> nfsd4_cb_layout_fail(ls); >> >> That'd do it! >> >> Haven't tried to figure out why exactly that's getting called, and why >> only rarely. Some intermittent problem with the callback path, I guess. >> >> Anyway, I think that solves most of the mystery.... > > Ooops, that was a nasty git merge error in the last rebase, see the fix > below. But I really wonder if we need to make the usage of pnfs explicit > after all, othterwise we'll always hand out layouts on any XFS-exported > filesystems, which can't be used and will eventually need to be recalled. > > --- > From ad592590cce9f7441c3cd21d030f3a986d8759d7 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <hch@lst.de> > Date: Thu, 5 Mar 2015 06:12:29 -0700 > Subject: nfsd: don't recursively call nfsd4_cb_layout_fail > > Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS > layout recalls"), we recursivelt call nfsd4_cb_layout_fail from itself, > leading to stack overflows. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/nfsd/nfs4layouts.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c > index 3c1bfa1..1028a06 100644 > --- a/fs/nfsd/nfs4layouts.c > +++ b/fs/nfsd/nfs4layouts.c > @@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); > > - nfsd4_cb_layout_fail(ls); > - Maybe you want adding "trace_layout_recall_fail(&ls->ls_stid.sc_stateid);" here? I think the following is better, @@ -587,7 +587,7 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); - nfsd4_cb_layout_fail(ls); + trace_layout_recall_fail(&ls->ls_stid.sc_stateid); printk(KERN_WARNING "nfsd: client %s failed to respond to layout recall. " thanks, Kinglong Mee _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-20 4:06 ` Kinglong Mee 0 siblings, 0 replies; 69+ messages in thread From: Kinglong Mee @ 2015-03-20 4:06 UTC (permalink / raw) To: Christoph Hellwig Cc: J. Bruce Fields, Dave Chinner, Eric Sandeen, Linux NFS Mailing List, xfs On Thu, Mar 5, 2015 at 9:17 PM, Christoph Hellwig <hch@lst.de> wrote: > On Wed, Mar 04, 2015 at 11:08:49PM -0500, J. Bruce Fields wrote: >> Ah-hah: >> >> static void >> nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) >> { >> ... >> nfsd4_cb_layout_fail(ls); >> >> That'd do it! >> >> Haven't tried to figure out why exactly that's getting called, and why >> only rarely. Some intermittent problem with the callback path, I guess. >> >> Anyway, I think that solves most of the mystery.... > > Ooops, that was a nasty git merge error in the last rebase, see the fix > below. But I really wonder if we need to make the usage of pnfs explicit > after all, othterwise we'll always hand out layouts on any XFS-exported > filesystems, which can't be used and will eventually need to be recalled. > > --- > From ad592590cce9f7441c3cd21d030f3a986d8759d7 Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig <hch@lst.de> > Date: Thu, 5 Mar 2015 06:12:29 -0700 > Subject: nfsd: don't recursively call nfsd4_cb_layout_fail > > Due to a merge error when creating c5c707f9 ("nfsd: implement pNFS > layout recalls"), we recursivelt call nfsd4_cb_layout_fail from itself, > leading to stack overflows. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/nfsd/nfs4layouts.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c > index 3c1bfa1..1028a06 100644 > --- a/fs/nfsd/nfs4layouts.c > +++ b/fs/nfsd/nfs4layouts.c > @@ -587,8 +587,6 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) > > rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); > > - nfsd4_cb_layout_fail(ls); > - Maybe you want adding "trace_layout_recall_fail(&ls->ls_stid.sc_stateid);" here? I think the following is better, @@ -587,7 +587,7 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); - nfsd4_cb_layout_fail(ls); + trace_layout_recall_fail(&ls->ls_stid.sc_stateid); printk(KERN_WARNING "nfsd: client %s failed to respond to layout recall. " thanks, Kinglong Mee ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-20 4:06 ` Kinglong Mee @ 2015-03-20 6:50 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-20 6:50 UTC (permalink / raw) To: Kinglong Mee Cc: Linux NFS Mailing List, Eric Sandeen, xfs, J. Bruce Fields, Christoph Hellwig On Fri, Mar 20, 2015 at 12:06:18PM +0800, Kinglong Mee wrote: > Maybe you want adding "trace_layout_recall_fail(&ls->ls_stid.sc_stateid);" here? > I think the following is better, It is! That's indeed what I had in my old trees before the big rebase. Can you write up a small changelog and add a signoff? Thanks a lot! _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-20 6:50 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-20 6:50 UTC (permalink / raw) To: Kinglong Mee Cc: Christoph Hellwig, J. Bruce Fields, Dave Chinner, Eric Sandeen, Linux NFS Mailing List, xfs On Fri, Mar 20, 2015 at 12:06:18PM +0800, Kinglong Mee wrote: > Maybe you want adding "trace_layout_recall_fail(&ls->ls_stid.sc_stateid);" here? > I think the following is better, It is! That's indeed what I had in my old trees before the big rebase. Can you write up a small changelog and add a signoff? Thanks a lot! ^ permalink raw reply [flat|nested] 69+ messages in thread
* [PATCH] NFSD: Fix infinite loop in nfsd4_cb_layout_fail() 2015-03-20 6:50 ` Christoph Hellwig @ 2015-03-20 7:56 ` Kinglong Mee -1 siblings, 0 replies; 69+ messages in thread From: Kinglong Mee @ 2015-03-20 7:56 UTC (permalink / raw) To: Christoph Hellwig, J. Bruce Fields Cc: Eric Sandeen, Linux NFS Mailing List, xfs Fix commit 31ef83dc05 (nfsd: add trace events)'s typo causing a infinite loop when callback layout fail. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> --- fs/nfsd/nfs4layouts.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c index 3c1bfa1..0a616b5 100644 --- a/fs/nfsd/nfs4layouts.c +++ b/fs/nfsd/nfs4layouts.c @@ -587,7 +587,7 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); - nfsd4_cb_layout_fail(ls); + trace_layout_recall_fail(&ls->ls_stid.sc_stateid); printk(KERN_WARNING "nfsd: client %s failed to respond to layout recall. " -- 2.3.3 _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply related [flat|nested] 69+ messages in thread
* [PATCH] NFSD: Fix infinite loop in nfsd4_cb_layout_fail() @ 2015-03-20 7:56 ` Kinglong Mee 0 siblings, 0 replies; 69+ messages in thread From: Kinglong Mee @ 2015-03-20 7:56 UTC (permalink / raw) To: Christoph Hellwig, J. Bruce Fields Cc: Dave Chinner, Eric Sandeen, Linux NFS Mailing List, xfs Fix commit 31ef83dc05 (nfsd: add trace events)'s typo causing a infinite loop when callback layout fail. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Kinglong Mee <kinglongmee@gmail.com> --- fs/nfsd/nfs4layouts.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfsd/nfs4layouts.c b/fs/nfsd/nfs4layouts.c index 3c1bfa1..0a616b5 100644 --- a/fs/nfsd/nfs4layouts.c +++ b/fs/nfsd/nfs4layouts.c @@ -587,7 +587,7 @@ nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) rpc_ntop((struct sockaddr *)&clp->cl_addr, addr_str, sizeof(addr_str)); - nfsd4_cb_layout_fail(ls); + trace_layout_recall_fail(&ls->ls_stid.sc_stateid); printk(KERN_WARNING "nfsd: client %s failed to respond to layout recall. " -- 2.3.3 ^ permalink raw reply related [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-03 22:10 ` J. Bruce Fields @ 2015-03-15 12:58 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-15 12:58 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > I'm getting mysterious crashes on a server exporting an xfs filesystem. Can you shared the setup used to reproduce the various issues in this thread? I've been trying ot get a baseline, but even withotu CONFIG_NFSD_PNFS set I start to run into other prolems early on, so I suspect the reproducer might be something else? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-15 12:58 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-15 12:58 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, Dave Chinner, linux-nfs, xfs On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > I'm getting mysterious crashes on a server exporting an xfs filesystem. Can you shared the setup used to reproduce the various issues in this thread? I've been trying ot get a baseline, but even withotu CONFIG_NFSD_PNFS set I start to run into other prolems early on, so I suspect the reproducer might be something else? ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-15 12:58 ` Christoph Hellwig @ 2015-03-16 14:27 ` J. Bruce Fields -1 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-16 14:27 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-nfs, xfs On Sun, Mar 15, 2015 at 01:58:21PM +0100, Christoph Hellwig wrote: > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > Can you shared the setup used to reproduce the various issues in this thread? Server and client are both kvm guests. I'm running xfstests with a bunch of known-to-fail tests excluded: cat >/etc/xfsqa.config <<EOF TEST_DIR=/mnt TEST_DEV=$server:/exports/xfs2 NFS_MOUNT_OPTIONS="-overs=4.1" EOF ./check -nfs -g auto -E ~/xfstests-skip where xfs-tests-skip is: generic/003 generic/004 generic/008 generic/009 generic/010 generic/012 generic/015 generic/016 generic/018 generic/020 generic/021 generic/022 generic/024 generic/025 generic/026 generic/027 generic/030 generic/034 generic/036 generic/038 generic/039 generic/040 generic/041 generic/070 generic/076 generic/077 generic/079 generic/083 generic/093 generic/097 generic/099 generic/112 generic/113 generic/123 generic/125 generic/128 generic/192 generic/193 generic/198 generic/204 generic/207 generic/208 generic/209 generic/210 generic/211 generic/212 generic/213 generic/214 generic/219 generic/223 generic/224 generic/226 generic/228 generic/230 generic/231 generic/232 generic/233 generic/234 generic/235 generic/237 generic/239 generic/240 generic/241 generic/255 generic/256 generic/260 generic/269 generic/270 generic/273 generic/274 generic/275 generic/280 generic/288 generic/299 generic/300 generic/311 generic/312 generic/314 generic/315 generic/316 generic/317 generic/318 generic/319 generic/320 generic/321 generic/322 generic/323 generic/324 generic/325 shared/006 shared/032 shared/051 shared/272 shared/289 shared/298 Kernel is 26cbcc77df, from git://linux-nfs.org/~bfields/linux-topics.git for-4.0-incoming (just upstream 6587457b4b3 plus your fix for layout_fail recursion). I have another exported xfs filesystem on this server that's on a device shared with the client, but this filesystem isn't. Config has basically all the NFS stuff turned on (including NFSD_PNFS). Gory details in an ugly pile of shell scripts at git://linux-nfs.org/~bfields/testd.git. > I've been trying ot get a baseline, but even withotu CONFIG_NFSD_PNFS set > I start to run into other prolems early on, so I suspect the reproducer might > be something else? What other problems? --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-16 14:27 ` J. Bruce Fields 0 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-16 14:27 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Chinner, linux-nfs, xfs On Sun, Mar 15, 2015 at 01:58:21PM +0100, Christoph Hellwig wrote: > On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > I'm getting mysterious crashes on a server exporting an xfs filesystem. > > Can you shared the setup used to reproduce the various issues in this thread? Server and client are both kvm guests. I'm running xfstests with a bunch of known-to-fail tests excluded: cat >/etc/xfsqa.config <<EOF TEST_DIR=/mnt TEST_DEV=$server:/exports/xfs2 NFS_MOUNT_OPTIONS="-overs=4.1" EOF ./check -nfs -g auto -E ~/xfstests-skip where xfs-tests-skip is: generic/003 generic/004 generic/008 generic/009 generic/010 generic/012 generic/015 generic/016 generic/018 generic/020 generic/021 generic/022 generic/024 generic/025 generic/026 generic/027 generic/030 generic/034 generic/036 generic/038 generic/039 generic/040 generic/041 generic/070 generic/076 generic/077 generic/079 generic/083 generic/093 generic/097 generic/099 generic/112 generic/113 generic/123 generic/125 generic/128 generic/192 generic/193 generic/198 generic/204 generic/207 generic/208 generic/209 generic/210 generic/211 generic/212 generic/213 generic/214 generic/219 generic/223 generic/224 generic/226 generic/228 generic/230 generic/231 generic/232 generic/233 generic/234 generic/235 generic/237 generic/239 generic/240 generic/241 generic/255 generic/256 generic/260 generic/269 generic/270 generic/273 generic/274 generic/275 generic/280 generic/288 generic/299 generic/300 generic/311 generic/312 generic/314 generic/315 generic/316 generic/317 generic/318 generic/319 generic/320 generic/321 generic/322 generic/323 generic/324 generic/325 shared/006 shared/032 shared/051 shared/272 shared/289 shared/298 Kernel is 26cbcc77df, from git://linux-nfs.org/~bfields/linux-topics.git for-4.0-incoming (just upstream 6587457b4b3 plus your fix for layout_fail recursion). I have another exported xfs filesystem on this server that's on a device shared with the client, but this filesystem isn't. Config has basically all the NFS stuff turned on (including NFSD_PNFS). Gory details in an ugly pile of shell scripts at git://linux-nfs.org/~bfields/testd.git. > I've been trying ot get a baseline, but even withotu CONFIG_NFSD_PNFS set > I start to run into other prolems early on, so I suspect the reproducer might > be something else? What other problems? --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-16 14:27 ` J. Bruce Fields @ 2015-03-17 10:30 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-17 10:30 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On Mon, Mar 16, 2015 at 10:27:21AM -0400, J. Bruce Fields wrote: > > I've been trying ot get a baseline, but even withotu CONFIG_NFSD_PNFS set > > I start to run into other prolems early on, so I suspect the reproducer might > > be something else? > > What other problems? See the "nfsd use after free in 4.0-rc" thread on linux-nfs. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-17 10:30 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-17 10:30 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, Dave Chinner, linux-nfs, xfs On Mon, Mar 16, 2015 at 10:27:21AM -0400, J. Bruce Fields wrote: > > I've been trying ot get a baseline, but even withotu CONFIG_NFSD_PNFS set > > I start to run into other prolems early on, so I suspect the reproducer might > > be something else? > > What other problems? See the "nfsd use after free in 4.0-rc" thread on linux-nfs. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-16 14:27 ` J. Bruce Fields @ 2015-03-18 10:50 ` Christoph Hellwig -1 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-18 10:50 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs, Christoph Hellwig, xfs On Mon, Mar 16, 2015 at 10:27:21AM -0400, J. Bruce Fields wrote: > NFS_MOUNT_OPTIONS="-overs=4.1" > EOF > ./check -nfs -g auto -E ~/xfstests-skip > > where xfs-tests-skip is: Seems like you're excluding a lot of tests. Either way, both generic/011 and generic/013 cause the bug I reported earlier, so at least on 4.0-rc I don't really get close to running into these pnfs issues. Do you have a commit you're runing on that's before the nfs client merge and somewhat stable? _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem @ 2015-03-18 10:50 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-18 10:50 UTC (permalink / raw) To: J. Bruce Fields; +Cc: Christoph Hellwig, Dave Chinner, linux-nfs, xfs On Mon, Mar 16, 2015 at 10:27:21AM -0400, J. Bruce Fields wrote: > NFS_MOUNT_OPTIONS="-overs=4.1" > EOF > ./check -nfs -g auto -E ~/xfstests-skip > > where xfs-tests-skip is: Seems like you're excluding a lot of tests. Either way, both generic/011 and generic/013 cause the bug I reported earlier, so at least on 4.0-rc I don't really get close to running into these pnfs issues. Do you have a commit you're runing on that's before the nfs client merge and somewhat stable? ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-03 22:10 ` J. Bruce Fields ` (2 preceding siblings ...) (?) @ 2015-03-27 10:41 ` Christoph Hellwig 2015-03-27 14:50 ` Jeff Layton ` (2 more replies) -1 siblings, 3 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-27 10:41 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs FYI, I small update on tracking down the recall issue: this seems to be very much something in the callback channel on the server. When tracing the client all the recalls it gets they are handled fine, but we do get error back in the layout recall ->done handler, which most of the time but not always are local Linux errnos and not nfs error numbers, indicating something went wrong, probably in the RPC code. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-27 10:41 ` Christoph Hellwig @ 2015-03-27 14:50 ` Jeff Layton 2015-03-30 16:44 ` Christoph Hellwig 2015-03-27 15:13 ` J. Bruce Fields 2015-04-26 16:19 ` Christoph Hellwig 2 siblings, 1 reply; 69+ messages in thread From: Jeff Layton @ 2015-03-27 14:50 UTC (permalink / raw) To: Christoph Hellwig; +Cc: J. Bruce Fields, linux-nfs On Fri, 27 Mar 2015 11:41:35 +0100 Christoph Hellwig <hch@lst.de> wrote: > FYI, I small update on tracking down the recall issue: this seems to > be very much something in the callback channel on the server. When tracing > the client all the recalls it gets they are handled fine, but we do get > error back in the layout recall ->done handler, which most of the time > but not always are local Linux errnos and not nfs error numbers, indicating > something went wrong, probably in the RPC code. Taking a quick look, the ->done routines look a little suspicious to me anyway. AFAICT, the pc_decode routines for the callbacks always return a Linux errno, not a nfsstat4, and that's what should end up in tk_status. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-27 14:50 ` Jeff Layton @ 2015-03-30 16:44 ` Christoph Hellwig 0 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-03-30 16:44 UTC (permalink / raw) To: Jeff Layton; +Cc: Christoph Hellwig, J. Bruce Fields, linux-nfs On Fri, Mar 27, 2015 at 10:50:29AM -0400, Jeff Layton wrote: > > Taking a quick look, the ->done routines look a little suspicious to me > anyway. AFAICT, the pc_decode routines for the callbacks always return > a Linux errno, not a nfsstat4, and that's what should end up in > tk_status. Only for errors where nfs_cb_stat_to_errno knowns how to translate them, which aren't those we're worried about here. However this partial translation looks really suspiios to me, I'd really wish we could have a linux tk_status, and then an NFS-specific way to transfer the errno. I'd also love to have a new __bitwise type so that we can ensure NFS errnos are used properly. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-27 10:41 ` Christoph Hellwig 2015-03-27 14:50 ` Jeff Layton @ 2015-03-27 15:13 ` J. Bruce Fields 2015-04-26 16:19 ` Christoph Hellwig 2 siblings, 0 replies; 69+ messages in thread From: J. Bruce Fields @ 2015-03-27 15:13 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-nfs On Fri, Mar 27, 2015 at 11:41:35AM +0100, Christoph Hellwig wrote: > FYI, I small update on tracking down the recall issue: this seems to > be very much something in the callback channel on the server. When tracing > the client all the recalls it gets they are handled fine, but we do get > error back in the layout recall ->done handler, which most of the time > but not always are local Linux errnos and not nfs error numbers, indicating > something went wrong, probably in the RPC code. Do you have a patch to switch the default, as well? Were you just planning to replace NFSEXP_NOPNFS by a NFSEXP_PNFS? As long as we're still in -rc and nfs-utils doesn't know about the new flag, I can't see why that would be a problem. --b. ^ permalink raw reply [flat|nested] 69+ messages in thread
* Re: panic on 4.20 server exporting xfs filesystem 2015-03-27 10:41 ` Christoph Hellwig 2015-03-27 14:50 ` Jeff Layton 2015-03-27 15:13 ` J. Bruce Fields @ 2015-04-26 16:19 ` Christoph Hellwig 2 siblings, 0 replies; 69+ messages in thread From: Christoph Hellwig @ 2015-04-26 16:19 UTC (permalink / raw) To: J. Bruce Fields; +Cc: linux-nfs On Fri, Mar 27, 2015 at 11:41:35AM +0100, Christoph Hellwig wrote: > FYI, I small update on tracking down the recall issue: this seems to > be very much something in the callback channel on the server. When tracing > the client all the recalls it gets they are handled fine, but we do get > error back in the layout recall ->done handler, which most of the time > but not always are local Linux errnos and not nfs error numbers, indicating > something went wrong, probably in the RPC code. I think I've tracked down the major issue here (I think there are some more hidding in the backchannel error handling as well): - the Linux NFS server completely ignores the limits the client specifies for the backchannel in CREATE_SESSION, most importantly the ca_maxrequests value. Thus it will happily send lots of callback requests that can overflow the clients callback slot table. - even worse the Linux client has a callback slot table with just a single entry, so this is pretty easy to trigger. I can try to dive into this, but it might make sense if someone more familar with the sessions implementation could look into this issue. ^ permalink raw reply [flat|nested] 69+ messages in thread
end of thread, other threads:[~2015-04-26 16:19 UTC | newest] Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-03-03 22:10 panic on 4.20 server exporting xfs filesystem J. Bruce Fields 2015-03-03 22:10 ` J. Bruce Fields 2015-03-03 22:44 ` Dave Chinner 2015-03-03 22:44 ` Dave Chinner 2015-03-04 2:08 ` J. Bruce Fields 2015-03-04 2:08 ` J. Bruce Fields 2015-03-04 4:41 ` Dave Chinner 2015-03-04 4:41 ` Dave Chinner 2015-03-05 13:19 ` Christoph Hellwig 2015-03-05 13:19 ` Christoph Hellwig 2015-03-05 15:21 ` J. Bruce Fields 2015-03-05 15:21 ` J. Bruce Fields 2015-03-08 13:08 ` Tom Haynes 2015-03-08 13:08 ` Tom Haynes 2015-03-04 15:54 ` J. Bruce Fields 2015-03-04 15:54 ` J. Bruce Fields 2015-03-04 22:09 ` Dave Chinner 2015-03-04 22:09 ` Dave Chinner 2015-03-04 22:27 ` J. Bruce Fields 2015-03-04 22:27 ` J. Bruce Fields 2015-03-04 22:45 ` Dave Chinner 2015-03-04 22:45 ` Dave Chinner 2015-03-04 22:49 ` Eric Sandeen 2015-03-04 22:49 ` Eric Sandeen 2015-03-04 22:56 ` Dave Chinner 2015-03-04 22:56 ` Dave Chinner 2015-03-05 4:08 ` J. Bruce Fields 2015-03-05 4:08 ` J. Bruce Fields 2015-03-05 13:17 ` Christoph Hellwig 2015-03-05 13:17 ` Christoph Hellwig 2015-03-05 15:01 ` J. Bruce Fields 2015-03-05 15:01 ` J. Bruce Fields 2015-03-05 17:02 ` J. Bruce Fields 2015-03-05 17:02 ` J. Bruce Fields 2015-03-05 20:47 ` J. Bruce Fields 2015-03-05 20:47 ` J. Bruce Fields 2015-03-05 20:59 ` Dave Chinner 2015-03-05 20:59 ` Dave Chinner 2015-03-06 20:47 ` J. Bruce Fields 2015-03-06 20:47 ` J. Bruce Fields 2015-03-19 17:27 ` Christoph Hellwig 2015-03-19 17:27 ` Christoph Hellwig 2015-03-19 18:47 ` J. Bruce Fields 2015-03-19 18:47 ` J. Bruce Fields 2015-03-20 6:49 ` Christoph Hellwig 2015-03-20 6:49 ` Christoph Hellwig 2015-03-08 15:30 ` Christoph Hellwig 2015-03-08 15:30 ` Christoph Hellwig 2015-03-09 19:45 ` J. Bruce Fields 2015-03-09 19:45 ` J. Bruce Fields 2015-03-20 4:06 ` Kinglong Mee 2015-03-20 4:06 ` Kinglong Mee 2015-03-20 6:50 ` Christoph Hellwig 2015-03-20 6:50 ` Christoph Hellwig 2015-03-20 7:56 ` [PATCH] NFSD: Fix infinite loop in nfsd4_cb_layout_fail() Kinglong Mee 2015-03-20 7:56 ` Kinglong Mee 2015-03-15 12:58 ` panic on 4.20 server exporting xfs filesystem Christoph Hellwig 2015-03-15 12:58 ` Christoph Hellwig 2015-03-16 14:27 ` J. Bruce Fields 2015-03-16 14:27 ` J. Bruce Fields 2015-03-17 10:30 ` Christoph Hellwig 2015-03-17 10:30 ` Christoph Hellwig 2015-03-18 10:50 ` Christoph Hellwig 2015-03-18 10:50 ` Christoph Hellwig 2015-03-27 10:41 ` Christoph Hellwig 2015-03-27 14:50 ` Jeff Layton 2015-03-30 16:44 ` Christoph Hellwig 2015-03-27 15:13 ` J. Bruce Fields 2015-04-26 16:19 ` Christoph Hellwig
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.