From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 7033C7F3F for ; Tue, 24 Sep 2013 12:14:49 -0500 (CDT) Date: Tue, 24 Sep 2013 12:14:46 -0500 From: Ben Myers Subject: Re: [PATCH 5/5] xfs: log recovery lsn ordering needs uuid check Message-ID: <20130924171446.GG1935@sgi.com> References: <1380002476-18839-1-git-send-email-david@fromorbit.com> <1380002476-18839-6-git-send-email-david@fromorbit.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <1380002476-18839-6-git-send-email-david@fromorbit.com> List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dave Chinner Cc: xfs@oss.sgi.com On Tue, Sep 24, 2013 at 04:01:16PM +1000, Dave Chinner wrote: > From: Dave Chinner > > After a fair number of xfstests runs, xfs/182 started to fail > regularly with a corrupted directory - a directory read verifier was > failing after recovery because it found a block with a XARM magic > number (remote attribute block) rather than a directory data block. > > The first time I saw this repeated failure I did /something/ and the > problem went away, so I was never able to find the underlying > problem. Test xfs/182 failed again today, and I found the root > cause before I did /something else/ that made it go away. > > Tracing indicated that the block in question was being correctly > logged, the log was being flushed by sync, but the buffer was not > being written back before the shutdown occurred. Tracing also > indicated that log recovery was also reading the block, but then > never writing it before log recovery invalidated the cache, > indicating that it was not modified by log recovery. > > More detailed analysis of the corpse indicated that the filesystem > had a uuid of "a4131074-1872-4cac-9323-2229adbcb886" but the XARM > block had a uuid of "8f32f043-c3c9-e7f8-f947-4e7f989c05d3", which > indicated it was a block from an older filesystem. The reason that > log recovery didn't replay it was that the LSN in the XARM block was > larger than the LSN of the transaction being replayed, and so the > block was not overwritten by log recovery. > > Hence, log recovery cant blindly trust the magic number and LSN in > the block - it must verify that it belongs to the filesystem being > recovered before using the LSN. i.e. if the UUIDs don't match, we > need to unconditionally recovery the change held in the log. recover > This patch was first tested on a block device that was repeatedly > causing xfs/182 to fail with the same failure on the same block with > the same directory read corruption signature (i.e. XARM block). It > did not fail, and hasn't failed since. > > Signed-off-by: Dave Chinner Looks good to me. Reviewed-by: Ben Myers _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs