From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934610AbdKAVfJ (ORCPT ); Wed, 1 Nov 2017 17:35:09 -0400 Received: from ipmail06.adl6.internode.on.net ([150.101.137.145]:42573 "EHLO ipmail06.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934328AbdKAVfG (ORCPT ); Wed, 1 Nov 2017 17:35:06 -0400 Date: Thu, 2 Nov 2017 08:32:30 +1100 From: Dave Chinner To: Cong Wang Cc: Dave Chinner , darrick.wong@oracle.com, linux-xfs@vger.kernel.org, LKML , Christoph Hellwig , Al Viro Subject: Re: xfs: list corruption in xfs_setup_inode() Message-ID: <20171101213230.GR5858@dastard> References: <20171031003358.GD5858@dastard> <20171101030536.GN5858@dastard> <20171101050701.GP5858@dastard> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171101050701.GP5858@dastard> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 01, 2017 at 04:07:01PM +1100, Dave Chinner wrote: > On Tue, Oct 31, 2017 at 09:43:03PM -0700, Cong Wang wrote: > > On Tue, Oct 31, 2017 at 8:05 PM, Dave Chinner wrote: > > > On Tue, Oct 31, 2017 at 06:51:08PM -0700, Cong Wang wrote: > > >> >> Please let me know if I can provide any other information. > > >> > > > >> > How do you reproduce the problem? > > >> > > >> The warning is reported via ABRT email, we don't know what was > > >> happening at the time of crash. > > > > > > Which makes it even harder to track down. Perhaps you should > > > configure the box to crashdump on such a failure and then we > > > can do some post-failure forensic analysis... > > > > Yeah. > > > > We are trying to make kdump working, but even if kdump works > > we still can't turn on panic_on_warn since this is production > > machine. > > Hmmm. Ok, maybe you could leave a trace of the xfs_iget* trace > points running and check the log tail for unusual events around the > time of the next crash. e.g. xfs_iget_reclaim_fail events. That > might point us to a potential interaction we can look at more > closely. I'd also suggest slab poisoning as well, as that will > catch other lifecycle problems that could be causing list > corruptions such as use-after-free. FWIW, I note that you are reporting another memory corruption/use-after-free related crash in the pipe_inode_info structure on these same machines. I'd suggest that you start with the premise that this list corruption has the same root cause... Cheers, Dave. -- Dave Chinner david@fromorbit.com