From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751185AbdKAFHG (ORCPT ); Wed, 1 Nov 2017 01:07:06 -0400 Received: from ipmail06.adl2.internode.on.net ([150.101.137.129]:6835 "EHLO ipmail06.adl2.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750716AbdKAFHE (ORCPT ); Wed, 1 Nov 2017 01:07:04 -0400 Date: Wed, 1 Nov 2017 16:07:01 +1100 From: Dave Chinner To: Cong Wang Cc: Dave Chinner , darrick.wong@oracle.com, linux-xfs@vger.kernel.org, LKML , Christoph Hellwig , Al Viro Subject: Re: xfs: list corruption in xfs_setup_inode() Message-ID: <20171101050701.GP5858@dastard> References: <20171031003358.GD5858@dastard> <20171101030536.GN5858@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 31, 2017 at 09:43:03PM -0700, Cong Wang wrote: > On Tue, Oct 31, 2017 at 8:05 PM, Dave Chinner wrote: > > On Tue, Oct 31, 2017 at 06:51:08PM -0700, Cong Wang wrote: > >> >> Please let me know if I can provide any other information. > >> > > >> > How do you reproduce the problem? > >> > >> The warning is reported via ABRT email, we don't know what was > >> happening at the time of crash. > > > > Which makes it even harder to track down. Perhaps you should > > configure the box to crashdump on such a failure and then we > > can do some post-failure forensic analysis... > > Yeah. > > We are trying to make kdump working, but even if kdump works > we still can't turn on panic_on_warn since this is production > machine. Hmmm. Ok, maybe you could leave a trace of the xfs_iget* trace points running and check the log tail for unusual events around the time of the next crash. e.g. xfs_iget_reclaim_fail events. That might point us to a potential interaction we can look at more closely. I'd also suggest slab poisoning as well, as that will catch other lifecycle problems that could be causing list corruptions such as use-after-free. Cheers, Dave. -- Dave Chinner david@fromorbit.com