From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay3.corp.sgi.com [198.149.34.15]) by oss.sgi.com (Postfix) with ESMTP id EF7A87F37 for ; Wed, 4 Mar 2015 22:08:55 -0600 (CST) Received: from cuda.sgi.com (cuda2.sgi.com [192.48.176.25]) by relay3.corp.sgi.com (Postfix) with ESMTP id 6B934AC001 for ; Wed, 4 Mar 2015 20:08:52 -0800 (PST) Received: from fieldses.org (fieldses.org [173.255.197.46]) by cuda.sgi.com with ESMTP id M95DjAZUjW0YHQ4s for ; Wed, 04 Mar 2015 20:08:50 -0800 (PST) Date: Wed, 4 Mar 2015 23:08:49 -0500 From: "J. Bruce Fields" Subject: Re: panic on 4.20 server exporting xfs filesystem Message-ID: <20150305040849.GJ1627@fieldses.org> References: <20150303221033.GB19439@fieldses.org> <20150303224456.GV4251@dastard> <20150304020826.GD19439@fieldses.org> <20150304155421.GE1627@fieldses.org> <20150304220900.GX18360@dastard> <20150304222709.GI1627@fieldses.org> <20150304224557.GY4251@dastard> <54F78BE5.1020608@sandeen.net> <20150304225623.GZ4251@dastard> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <20150304225623.GZ4251@dastard> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: linux-nfs@vger.kernel.org, Eric Sandeen , Christoph Hellwig , xfs@oss.sgi.com On Thu, Mar 05, 2015 at 09:56:23AM +1100, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 04:49:09PM -0600, Eric Sandeen wrote: > > On 3/4/15 4:45 PM, Dave Chinner wrote: > > > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: > > >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > > >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > > >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. > > >>>>>>> > > >>>>>>> Strangely, I've reproduced this on > > >>>>>>> > > >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > >>>>>>> > > >>>>>>> but haven't yet managed to reproduce on either of its parents > > >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > >>>>>>> again. > > >>>>>> > > >>>>>> I think you'll find that the bug is only triggered after that XFS > > >>>>>> merge because it's what enabled block layout support in the server, > > >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to > > >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > >>>>>> it's export ops. > > >>>>> > > >>>>> Doh--after all the discussion I didn't actually pay attention to what > > >>>>> happened in the end. OK, I see, you're right, it's all more-or-less > > >>>>> dead code till that merge. > > >>>>> > > >>>>> Christoph's code was passing all my tests before that, so maybe we > > >>>>> broke something in the merge process. > > >>>>> > > >>>>> Alternatively, it could be because I've added more tests--I'll rerun my > > >>>>> current tests on his original branch.... > > >>>> > > >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > > >>>> very informative. I'm running xfstests over NFSv4.1 with client and > > >>>> server running the same kernel, the filesystem in question is xfs, but > > >>>> isn't otherwise available to the client (so the client shouldn't be > > >>>> doing pnfs). > > >>>> > > >>>> --b. > > >>>> > > >>>> BUG: unable to handle kernel paging request at 00000000757d4900 > > >>>> IP: [] cpuacct_charge+0x5f/0xa0 > > >>>> PGD 0 > > >>>> Thread overran stack, or stack corrupted > > >>> > > >>> Hmmmm. That is not at all informative, especially as it's only > > >>> dumped the interrupt stack and not the stack or the task that it > > >>> has detected as overrun or corrupted. > > >>> > > >>> Can you turn on all the stack overrun debug options? Maybe even > > >>> turn on the stack tracer to get an idea of whether we are recursing > > >>> deeply somewhere we shouldn't be? > > >> > > >> Digging around under "Kernel hacking".... I already have > > >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try > > >> turning on the latter. (Will I be able to get information out of it > > >> before the panic?) > > > > > > just keep taking samples of the worst case stack usage as the test > > > runs. If there's anything unusual before the failure then it will > > > show up, otherwise I'm not sure how to track this down... > > > > I think it should print "maximum stack depth" messages whenever a stack > > reaches a new max excursion... > > That gets printed only when the process exits, IIRC. Ah-hah: static void nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) { ... nfsd4_cb_layout_fail(ls); That'd do it! Haven't tried to figure out why exactly that's getting called, and why only rarely. Some intermittent problem with the callback path, I guess. Anyway, I think that solves most of the mystery.... --b. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from fieldses.org ([173.255.197.46]:59154 "EHLO fieldses.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750738AbbCEEIu (ORCPT ); Wed, 4 Mar 2015 23:08:50 -0500 Date: Wed, 4 Mar 2015 23:08:49 -0500 From: "J. Bruce Fields" To: Dave Chinner Cc: Eric Sandeen , linux-nfs@vger.kernel.org, Christoph Hellwig , xfs@oss.sgi.com Subject: Re: panic on 4.20 server exporting xfs filesystem Message-ID: <20150305040849.GJ1627@fieldses.org> References: <20150303221033.GB19439@fieldses.org> <20150303224456.GV4251@dastard> <20150304020826.GD19439@fieldses.org> <20150304155421.GE1627@fieldses.org> <20150304220900.GX18360@dastard> <20150304222709.GI1627@fieldses.org> <20150304224557.GY4251@dastard> <54F78BE5.1020608@sandeen.net> <20150304225623.GZ4251@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20150304225623.GZ4251@dastard> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Thu, Mar 05, 2015 at 09:56:23AM +1100, Dave Chinner wrote: > On Wed, Mar 04, 2015 at 04:49:09PM -0600, Eric Sandeen wrote: > > On 3/4/15 4:45 PM, Dave Chinner wrote: > > > On Wed, Mar 04, 2015 at 05:27:09PM -0500, J. Bruce Fields wrote: > > >> On Thu, Mar 05, 2015 at 09:09:00AM +1100, Dave Chinner wrote: > > >>> On Wed, Mar 04, 2015 at 10:54:21AM -0500, J. Bruce Fields wrote: > > >>>> On Tue, Mar 03, 2015 at 09:08:26PM -0500, J. Bruce Fields wrote: > > >>>>> On Wed, Mar 04, 2015 at 09:44:56AM +1100, Dave Chinner wrote: > > >>>>>> On Tue, Mar 03, 2015 at 05:10:33PM -0500, J. Bruce Fields wrote: > > >>>>>>> I'm getting mysterious crashes on a server exporting an xfs filesystem. > > >>>>>>> > > >>>>>>> Strangely, I've reproduced this on > > >>>>>>> > > >>>>>>> 93aaa830fc17 "Merge tag 'xfs-pnfs-for-linus-3.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs > > >>>>>>> > > >>>>>>> but haven't yet managed to reproduce on either of its parents > > >>>>>>> (24a52e412ef2 or 781355c6e5ae). That might just be chance, I'll try > > >>>>>>> again. > > >>>>>> > > >>>>>> I think you'll find that the bug is only triggered after that XFS > > >>>>>> merge because it's what enabled block layout support in the server, > > >>>>>> i.e. nfsd4_setup_layout_type() is now setting the export type to > > >>>>>> LAYOUT_BLOCK_VOLUME because XFS has added the necessary functions to > > >>>>>> it's export ops. > > >>>>> > > >>>>> Doh--after all the discussion I didn't actually pay attention to what > > >>>>> happened in the end. OK, I see, you're right, it's all more-or-less > > >>>>> dead code till that merge. > > >>>>> > > >>>>> Christoph's code was passing all my tests before that, so maybe we > > >>>>> broke something in the merge process. > > >>>>> > > >>>>> Alternatively, it could be because I've added more tests--I'll rerun my > > >>>>> current tests on his original branch.... > > >>>> > > >>>> The below is on Christoph's pnfsd-for-3.20-4 (at cd4b02e). Doesn't look > > >>>> very informative. I'm running xfstests over NFSv4.1 with client and > > >>>> server running the same kernel, the filesystem in question is xfs, but > > >>>> isn't otherwise available to the client (so the client shouldn't be > > >>>> doing pnfs). > > >>>> > > >>>> --b. > > >>>> > > >>>> BUG: unable to handle kernel paging request at 00000000757d4900 > > >>>> IP: [] cpuacct_charge+0x5f/0xa0 > > >>>> PGD 0 > > >>>> Thread overran stack, or stack corrupted > > >>> > > >>> Hmmmm. That is not at all informative, especially as it's only > > >>> dumped the interrupt stack and not the stack or the task that it > > >>> has detected as overrun or corrupted. > > >>> > > >>> Can you turn on all the stack overrun debug options? Maybe even > > >>> turn on the stack tracer to get an idea of whether we are recursing > > >>> deeply somewhere we shouldn't be? > > >> > > >> Digging around under "Kernel hacking".... I already have > > >> DEBUG_STACK_USAGE, DEBUG_STACKOVERFLOW, and STACK_TRACER, and I can try > > >> turning on the latter. (Will I be able to get information out of it > > >> before the panic?) > > > > > > just keep taking samples of the worst case stack usage as the test > > > runs. If there's anything unusual before the failure then it will > > > show up, otherwise I'm not sure how to track this down... > > > > I think it should print "maximum stack depth" messages whenever a stack > > reaches a new max excursion... > > That gets printed only when the process exits, IIRC. Ah-hah: static void nfsd4_cb_layout_fail(struct nfs4_layout_stateid *ls) { ... nfsd4_cb_layout_fail(ls); That'd do it! Haven't tried to figure out why exactly that's getting called, and why only rarely. Some intermittent problem with the callback path, I guess. Anyway, I think that solves most of the mystery.... --b.