Re: 5.3-rc1 regression with XFS log recovery

From: Dave Chinner <david@fromorbit.com>
To: "hch@lst.de" <hch@lst.de>
Cc: "Verma, Vishal L" <vishal.l.verma@intel.com>,
	"linux-xfs@vger.kernel.org" <linux-xfs@vger.kernel.org>,
	"Williams, Dan J" <dan.j.williams@intel.com>,
	"darrick.wong@oracle.com" <darrick.wong@oracle.com>
Subject: Re: 5.3-rc1 regression with XFS log recovery
Date: Mon, 19 Aug 2019 10:08:31 +1000	[thread overview]
Message-ID: <20190819000831.GX6129@dread.disaster.area> (raw)
In-Reply-To: <20190818173426.GA32311@lst.de>

On Sun, Aug 18, 2019 at 07:34:26PM +0200, hch@lst.de wrote:
> On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote:
> > On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote:
> > > > The kernel log shows the following when the mount fails:
> > > 
> > > Is it always that same message?  I'll see if I can reproduce it,
> > > but I won't have that much memory to spare to create fake pmem,
> > > hope this also works with a single device and/or less memory..
> > 
> > I've reproduced a similar ASSERT with a small pmem device, so I hope
> > I can debug the issue locally now.
> 
> So I can also reproduce the same issue with the ramdisk driver, but not
> with any other 4k sector size device (nvmet, scsi target, scsi_debug,
> loop).

How did you reproduce it? I tried with indentical pmem config to
what Vishal was using and couldn't hit it...

> Which made me wonder if there is some issue about the memory
> passed in, and indeed just switching to plain vmalloc vs the XFS
> kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes
> the issue for me.  I don't really understand why yet, maybe I need to
> dig out alignment testing patches.
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 13d1d3e95b88..918ad3b884a7 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -125,7 +125,7 @@ xlog_alloc_buffer(
>  	if (nbblks > 1 && log->l_sectBBsize > 1)
>  		nbblks += log->l_sectBBsize;
>  	nbblks = round_up(nbblks, log->l_sectBBsize);
> -	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> +	return vmalloc(BBTOB(nbblks));
>  }

So it fails with contiguous memory being added to the bio (kmalloc
returns contiguous memory), but not with indiivdual, non-contiugous
pages (what vmalloc returns)?

Ok, so here's the trace of all the kmem_alloc_large() calls when
mounting the filesystem directly after mkfs:

https://lore.kernel.org/linux-xfs/20190722233452.31183-1-david@fromorbit.com/

 mount-1818  [001]   691.648570: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648570: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648575: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648575: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648578: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648578: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648581: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648581: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648583: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648583: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648586: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648586: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648588: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648589: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648591: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648592: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649456: kmem_alloc_large:     size 4096 flags 0x8 caller xlog_find_zeroed
 mount-1818  [001]   691.649456: kmem_alloc:           size 4096 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649496: kmem_alloc_large:     size 12288 flags 0x8 caller xlog_find_verify_cycle
 mount-1818  [001]   691.649496: kmem_alloc:           size 12288 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649502: kmem_alloc_large:     size 8192 flags 0x8 caller xlog_find_verify_log_record
 mount-1818  [001]   691.649502: kmem_alloc:           size 8192 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649506: kmem_alloc_large:     size 4096 flags 0x8 caller xlog_find_tail
 mount-1818  [001]   691.649506: kmem_alloc:           size 4096 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649516: kmem_alloc_large:     size 4198400 flags 0x8 caller xlog_write_log_records
 mount-1818  [001]   691.649517: kmem_alloc:           size 4198400 flags 0x8 caller kmem_alloc_large
                                                       ^^^^^^^^^^^^
 mount-1818  [001]   691.652152: kmem_alloc_large:     size 0 flags 0x11 caller xfs_alloc_rsum_cache
 mount-1818  [001]   691.652152: kmem_alloc:           size 0 flags 0x11 caller kmem_alloc_large
umount-1826  [000]   691.670308: kmem_alloc_large:     size 512 flags 0x5 caller xfs_log_commit_cil
umount-1826  [000]   691.670309: kmem_alloc:           size 512 flags 0x5 caller kmem_alloc_large

Ok, there's a >4MB allocation from:

xlog_find_tail
  xlog_clear_stale_blocks
    xlog_write_log_records
      xlog_alloc_buffer
	kmem_alloc_large()

If this memory allocation fails, we make the allocation smaller
in xlog_write_log_records() and try again. over and over again until
it succeeds of the size isn't large enough to hold a single buffer.
The above trace shows the allocation succeeding as a single
contiguous buffer (no retries), and the IOs succeed and all is good.

But Vishal wasn't even getting that far - the "totally zeroed_log"
comes from xlog_find_head after xlog_find_zeroed() fails.

xlog_find_zeroed() does a single sector allocation (4kB) and uses
that over and over again for the binary search. Then it calls
xlog_find_verify_cycle() which is passed a search range for
based on the last block it found that was zero and the size of
the log write window (2MB). If the last block found is less than
the log write window, it trims it back to the last block. We see
this in the trace above - a mkfs'd filesytem should have an unmount
record in it an nothing else, so at most 2 sectors should be
written. We see a 12kB - 3 sector - allocation in the trace above,
but it could be up to 2MB in size.

Finally, there's a 2 sector allocation by
xlog_find_verify_log_record(), which it does to check whether it
needs to back up the head to an earlier block in the log write
window because the log writes hit the disk out of order. That's
another "up to 2MB" allocation, but in the trace above it's only
8kB.

The issue, I think, is that only the big allocation in
xlog_clear_stale_blocks() has somewhat robust ENOMEM error handling.
All of these other allocations will return failure is the buffer
allocation fails, and KM_MAYFAIL doesn't try very hard to ensure the
allocation will succeed.

This change:

> -     return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> +     return vmalloc(BBTOB(nbblks));

demonstrates the difference - it changes the code from an allocation
that is allowed to fail to an allocation that will never fail. i.e
KM_MAYFAIL context is also passed to the __vmalloc() call inside
kmem_alloc_large() hence it can fail easily, too, where as vmalloc()
tries much, much harder and so is, in comparison, a NOFAIL context
(GFP_KERNEL).

I'd suggest that the fix is to pass the allocation context into
xlog_alloc_buffer(), using KM_MAYFAIL for only the allocations that
have ENOMEM fallback handling...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com