Re: 5.3-rc1 regression with XFS log recovery

From: Dave Chinner <david@fromorbit.com>
To: "hch@lst.de" <hch@lst.de>
Cc: "Verma, Vishal L" <vishal.l.verma@intel.com>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"darrick.wong@oracle.com" <darrick.wong@oracle.com>,
	Ming Lei <ming.lei@redhat.com>,
	linux-block@vger.kernel.org
Subject: Re: 5.3-rc1 regression with XFS log recovery
Date: Tue, 20 Aug 2019 17:44:36 +1000	[thread overview]
Message-ID: <20190820074436.GD1119@dread.disaster.area> (raw)
In-Reply-To: <20190820055320.GB27501@lst.de>

On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > With the following debug patch.  Based on that I think I'll just
> > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > can restart the unaligned slub allocation drama..
> > 
> > This still doesn't make sense to me, because the pmem and brd code
> > have no aligment limitations in their make_request code - they can
> > handle byte adressing and should not have any problem at all with
> > 8 byte aligned memory in bios.
> > 
> > Digging a little furhter, I note that both brd and pmem use
> > identical mechanisms to marshall data in and out of bios, so they
> > are likely to have the same issue.
> > 
> > So, brd_make_request() does:
> > 
> >         bio_for_each_segment(bvec, bio, iter) {
> >                 unsigned int len = bvec.bv_len;
> >                 int err;
> > 
> >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> >                                   bio_op(bio), sector);
> >                 if (err)
> >                         goto io_error;
> >                 sector += len >> SECTOR_SHIFT;
> >         }
> > 
> > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > into individual pages, which are passed to brd_do_bvec(). An
> > unaligned 4kB io traces out as:
> > 
> >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > 
> > i.e. page		offset	len	sector
> > 00000000a77f0146	768	3328	0x7d0048
> > 000000006ceca91e	0	768	0x7d004e
> > 
> > You should be able to guess what the problems are from this.
> > 
> > Both pmem and brd are _sector_ based. We've done a partial sector
> > copy on the first bvec, then the second bvec has started the copy
> > from the wrong offset into the sector we've done a partial copy
> > from.
> > 
> > IOWs, no error is reported when the bvec buffer isn't sector
> > aligned, no error is reported when the length of data to copy was
> > not a multiple of sector size, and no error was reported when we
> > copied the same partial sector twice.
> 
> Yes.  I think bio_for_each_segment is buggy here, as it should not
> blindly split by pages.  CcingMing as he wrote much of this code.  I'll
> also dig into fixing it, but I just arrived in Japan and might be a
> little jetlagged.
> 
> > There's nothing quite like being repeatedly bitten by the same
> > misalignment bug because there's no validation in the infrastructure
> > that could catch it immediately and throw a useful warning/error
> > message.
> 
> The xen block driver doesn't use bio_for_each_segment, so it isn't
> exactly the same but a very related issue. 

Both stem from the fact that nothing in the block layer validates
memory alignment constraints. Xenblk, pmem and brd all return 511 to
queue_dma_alignment(), and all break when passed memory that isn't
aligned to 512 bytes.  There aren't even debug options we can turn
on that would tell use this is happening. Instead, we start with
data corruption and have to work backwards to find the needle in the
haystack from there. EIO and a WARN_ONCE would be a massive
improvement here....

> Maybe until we sort
> all this mess out we just need to depend on !SLUB_DEBUG for XFS?

SLUB_DEBUG=y by itself doesn't cause problems - I run that
all the time because otherwise there's no /proc/slabinfo with slub.

I used KASAN to get the above alignment behaviour - it's
SLUB_DEBUG_ON=y that perturbs the alignment, and I think the same
thing can happen with SLAB_DEBUG=y, so there's several dependencies
we'd have to add here.

Is there any way we can pass kmalloc a "use aligned heap" GFP flag
so that it allocates from the -rcl slabs to guarantee alignment
rather than the standard slabs that change alignment with config
options?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com