All of lore.kernel.org
 help / color / mirror / Atom feed
* 5.3-rc1 regression with XFS log recovery
@ 2019-08-16 20:59 Verma, Vishal L
  2019-08-18  7:11 ` hch
  0 siblings, 1 reply; 31+ messages in thread
From: Verma, Vishal L @ 2019-08-16 20:59 UTC (permalink / raw)
  To: linux-xfs; +Cc: Williams, Dan J, hch, david, darrick.wong

Hi all,

When running the 'ndctl' unit tests against 5.3-rc kernels, I noticed a
frequent failure of the 'mmap.sh' test [1][2].

[1]: https://github.com/pmem/ndctl/blob/master/test/mmap.sh
[2]: https://github.com/pmem/ndctl/blob/master/test/mmap.c

But in trying to pare down the test further, I found that I can simply
reproduce the problem by:

  mkfs.xfs -f /dev/pmem0
  mount /dev/pmem0 /mnt/mem

Where 'pmem0' is a legacy pmem namespace from reserved memory using the
memmap= command line option. (Specifically, I have this:
memmap=3G!6G,3G!9G )

The above mkfs/mount steps don't reproduce the problem a 100% of the
time, but it does happen on my qemu based setup over 75% of the times.

The kernel log shows the following when the mount fails:

   [Aug16 14:41] XFS (pmem0): Mounting V5 Filesystem
   [  +0.001856] XFS (pmem0): totally zeroed log
   [  +0.402616] XFS (pmem0): Internal error xlog_clear_stale_blocks(2) at line 1715 of file fs/xfs/xfs_log_recover.c.  Caller xlog_find_tail+0x230/0x340 [xfs]
   [  +0.001741] CPU: 7 PID: 1771 Comm: mount Tainted: G           O      5.2.0-rc4+ #112
   [  +0.000976] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
   [  +0.001516] Call Trace:
   [  +0.000351]  dump_stack+0x85/0xc0
   [  +0.000452]  xlog_clear_stale_blocks+0x16d/0x180 [xfs]
   [  +0.000665]  xlog_find_tail+0x230/0x340 [xfs]
   [  +0.000581]  xlog_recover+0x2b/0x160 [xfs]
   [  +0.000554]  xfs_log_mount+0x280/0x2a0 [xfs]
   [  +0.000561]  xfs_mountfs+0x415/0x860 [xfs]
   [  +0.000533]  ? xfs_mru_cache_create+0x18b/0x1f0 [xfs]
   [  +0.000665]  xfs_fs_fill_super+0x4b0/0x700 [xfs]
   [  +0.000638]  ? xfs_test_remount_options+0x60/0x60 [xfs]
   [  +0.000710]  mount_bdev+0x17f/0x1b0
   [  +0.000442]  legacy_get_tree+0x30/0x50
   [  +0.000467]  vfs_get_tree+0x28/0xf0
   [  +0.000436]  do_mount+0x2d4/0xa00
   [  +0.000411]  ? memdup_user+0x3e/0x70
   [  +0.000455]  ksys_mount+0xba/0xd0
   [  +0.000420]  __x64_sys_mount+0x21/0x30
   [  +0.000473]  do_syscall_64+0x60/0x240
   [  +0.000460]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
   [  +0.000655] RIP: 0033:0x7f730fec91be
   [  +0.000506] Code: 48 8b 0d cd 1c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9a 1c 0c 00 f7 d8 64 89 01 48
   [  +0.002305] RSP: 002b:00007ffdadbdb178 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
   [  +0.000922] RAX: ffffffffffffffda RBX: 000055b8f9db8a40 RCX: 00007f730fec91be
   [  +0.000875] RDX: 000055b8f9dbfdb0 RSI: 000055b8f9dbb930 RDI: 000055b8f9db8c20
   [  +0.000917] RBP: 00007f731007f1a4 R08: 0000000000000000 R09: 000055b8f9dc01f0
   [  +0.000942] R10: 00000000c0ed0000 R11: 0000000000000246 R12: 0000000000000000
   [  +0.000878] R13: 00000000c0ed0000 R14: 000055b8f9db8c20 R15: 000055b8f9dbfdb0
   [  +0.000915] XFS (pmem0): failed to locate log tail
   [  +0.000622] XFS (pmem0): log mount/recovery failed: error -117
   [  +0.012560] XFS (pmem0): log mount failed


A bisect pointed to this commit:

commit 6ad5b3255b9e3d6d94154738aacd5119bf9c8f6e (HEAD -> bisect-bad, refs/bisect/bad)
Author: Christoph Hellwig <hch@lst.de>
Date:   Fri Jun 28 19:27:26 2019 -0700

    xfs: use bios directly to read and write the log recovery buffers
    
    The xfs_buf structure is basically used as a glorified container for
    a memory allocation in the log recovery code.  Replace it with a
    call to kmem_alloc_large and a simple abstraction to read into or
    write from it synchronously using chained bios.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
    Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>

Full bisect log follows at the end.

I saw [3], but I can still easily hit the failure after manually
applying that patch on the above commit.

[3]: https://lore.kernel.org/linux-xfs/20190709152352.27465-1-hch@lst.de/

Any thoughts on what might be happening? I'd be happy to test out
theories/patches.

Thanks,
	-Vishal


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-16 20:59 5.3-rc1 regression with XFS log recovery Verma, Vishal L
@ 2019-08-18  7:11 ` hch
  2019-08-18  7:41   ` hch
  0 siblings, 1 reply; 31+ messages in thread
From: hch @ 2019-08-18  7:11 UTC (permalink / raw)
  To: Verma, Vishal L; +Cc: linux-xfs, Williams, Dan J, hch, david, darrick.wong

On Fri, Aug 16, 2019 at 08:59:44PM +0000, Verma, Vishal L wrote:
> Hi all,
> 
> When running the 'ndctl' unit tests against 5.3-rc kernels, I noticed a
> frequent failure of the 'mmap.sh' test [1][2].
> 
> [1]: https://github.com/pmem/ndctl/blob/master/test/mmap.sh
> [2]: https://github.com/pmem/ndctl/blob/master/test/mmap.c
> 
> But in trying to pare down the test further, I found that I can simply
> reproduce the problem by:
> 
>   mkfs.xfs -f /dev/pmem0
>   mount /dev/pmem0 /mnt/mem
> 
> Where 'pmem0' is a legacy pmem namespace from reserved memory using the
> memmap= command line option. (Specifically, I have this:
> memmap=3G!6G,3G!9G )
> 
> The above mkfs/mount steps don't reproduce the problem a 100% of the
> time, but it does happen on my qemu based setup over 75% of the times.
> 
> The kernel log shows the following when the mount fails:

Is it always that same message?  I'll see if I can reproduce it,
but I won't have that much memory to spare to create fake pmem,
hope this also works with a single device and/or less memory..

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-18  7:11 ` hch
@ 2019-08-18  7:41   ` hch
  2019-08-18 17:34     ` hch
  0 siblings, 1 reply; 31+ messages in thread
From: hch @ 2019-08-18  7:41 UTC (permalink / raw)
  To: Verma, Vishal L; +Cc: linux-xfs, Williams, Dan J, hch, david, darrick.wong

On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote:
> > The kernel log shows the following when the mount fails:
> 
> Is it always that same message?  I'll see if I can reproduce it,
> but I won't have that much memory to spare to create fake pmem,
> hope this also works with a single device and/or less memory..

I've reproduced a similar ASSERT with a small pmem device, so I hope
I can debug the issue locally now.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-18  7:41   ` hch
@ 2019-08-18 17:34     ` hch
  2019-08-19  0:08       ` Dave Chinner
                         ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: hch @ 2019-08-18 17:34 UTC (permalink / raw)
  To: Verma, Vishal L; +Cc: linux-xfs, Williams, Dan J, hch, david, darrick.wong

On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote:
> On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote:
> > > The kernel log shows the following when the mount fails:
> > 
> > Is it always that same message?  I'll see if I can reproduce it,
> > but I won't have that much memory to spare to create fake pmem,
> > hope this also works with a single device and/or less memory..
> 
> I've reproduced a similar ASSERT with a small pmem device, so I hope
> I can debug the issue locally now.

So I can also reproduce the same issue with the ramdisk driver, but not
with any other 4k sector size device (nvmet, scsi target, scsi_debug,
loop).  Which made me wonder if there is some issue about the memory
passed in, and indeed just switching to plain vmalloc vs the XFS
kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes
the issue for me.  I don't really understand why yet, maybe I need to
dig out alignment testing patches.

diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 13d1d3e95b88..918ad3b884a7 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -125,7 +125,7 @@ xlog_alloc_buffer(
 	if (nbblks > 1 && log->l_sectBBsize > 1)
 		nbblks += log->l_sectBBsize;
 	nbblks = round_up(nbblks, log->l_sectBBsize);
-	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
+	return vmalloc(BBTOB(nbblks));
 }
 
 /*
@@ -416,7 +416,7 @@ xlog_find_verify_cycle(
 	*new_blk = -1;
 
 out:
-	kmem_free(buffer);
+	vfree(buffer);
 	return error;
 }
 
@@ -527,7 +527,7 @@ xlog_find_verify_log_record(
 		*last_blk = i;
 
 out:
-	kmem_free(buffer);
+	vfree(buffer);
 	return error;
 }
 
@@ -781,7 +781,7 @@ xlog_find_head(
 			goto out_free_buffer;
 	}
 
-	kmem_free(buffer);
+	vfree(buffer);
 	if (head_blk == log_bbnum)
 		*return_head_blk = 0;
 	else
@@ -795,7 +795,7 @@ xlog_find_head(
 	return 0;
 
 out_free_buffer:
-	kmem_free(buffer);
+	vfree(buffer);
 	if (error)
 		xfs_warn(log->l_mp, "failed to find log head");
 	return error;
@@ -1049,7 +1049,7 @@ xlog_verify_tail(
 		"Tail block (0x%llx) overwrite detected. Updated to 0x%llx",
 			 orig_tail, *tail_blk);
 out:
-	kmem_free(buffer);
+	vfree(buffer);
 	return error;
 }
 
@@ -1096,7 +1096,7 @@ xlog_verify_head(
 	error = xlog_rseek_logrec_hdr(log, *head_blk, *tail_blk,
 				      XLOG_MAX_ICLOGS, tmp_buffer,
 				      &tmp_rhead_blk, &tmp_rhead, &tmp_wrapped);
-	kmem_free(tmp_buffer);
+	vfree(tmp_buffer);
 	if (error < 0)
 		return error;
 
@@ -1429,7 +1429,7 @@ xlog_find_tail(
 		error = xlog_clear_stale_blocks(log, tail_lsn);
 
 done:
-	kmem_free(buffer);
+	vfree(buffer);
 
 	if (error)
 		xfs_warn(log->l_mp, "failed to locate log tail");
@@ -1477,7 +1477,7 @@ xlog_find_zeroed(
 	first_cycle = xlog_get_cycle(offset);
 	if (first_cycle == 0) {		/* completely zeroed log */
 		*blk_no = 0;
-		kmem_free(buffer);
+		vfree(buffer);
 		return 1;
 	}
 
@@ -1488,7 +1488,7 @@ xlog_find_zeroed(
 
 	last_cycle = xlog_get_cycle(offset);
 	if (last_cycle != 0) {		/* log completely written to */
-		kmem_free(buffer);
+		vfree(buffer);
 		return 0;
 	}
 
@@ -1535,7 +1535,7 @@ xlog_find_zeroed(
 
 	*blk_no = last_blk;
 out_free_buffer:
-	kmem_free(buffer);
+	vfree(buffer);
 	if (error)
 		return error;
 	return 1;
@@ -1647,7 +1647,7 @@ xlog_write_log_records(
 	}
 
 out_free_buffer:
-	kmem_free(buffer);
+	vfree(buffer);
 	return error;
 }
 
@@ -5291,7 +5291,7 @@ xlog_do_recovery_pass(
 			hblks = h_size / XLOG_HEADER_CYCLE_SIZE;
 			if (h_size % XLOG_HEADER_CYCLE_SIZE)
 				hblks++;
-			kmem_free(hbp);
+			vfree(hbp);
 			hbp = xlog_alloc_buffer(log, hblks);
 		} else {
 			hblks = 1;
@@ -5307,7 +5307,7 @@ xlog_do_recovery_pass(
 		return -ENOMEM;
 	dbp = xlog_alloc_buffer(log, BTOBB(h_size));
 	if (!dbp) {
-		kmem_free(hbp);
+		vfree(hbp);
 		return -ENOMEM;
 	}
 
@@ -5468,9 +5468,9 @@ xlog_do_recovery_pass(
 	}
 
  bread_err2:
-	kmem_free(dbp);
+	vfree(dbp);
  bread_err1:
-	kmem_free(hbp);
+	vfree(hbp);
 
 	/*
 	 * Submit buffers that have been added from the last record processed,

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-18 17:34     ` hch
@ 2019-08-19  0:08       ` Dave Chinner
  2019-08-19  3:49         ` hch
  2019-08-19 17:19       ` Verma, Vishal L
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2019-08-19  0:08 UTC (permalink / raw)
  To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Sun, Aug 18, 2019 at 07:34:26PM +0200, hch@lst.de wrote:
> On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote:
> > On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote:
> > > > The kernel log shows the following when the mount fails:
> > > 
> > > Is it always that same message?  I'll see if I can reproduce it,
> > > but I won't have that much memory to spare to create fake pmem,
> > > hope this also works with a single device and/or less memory..
> > 
> > I've reproduced a similar ASSERT with a small pmem device, so I hope
> > I can debug the issue locally now.
> 
> So I can also reproduce the same issue with the ramdisk driver, but not
> with any other 4k sector size device (nvmet, scsi target, scsi_debug,
> loop).

How did you reproduce it? I tried with indentical pmem config to
what Vishal was using and couldn't hit it...

> Which made me wonder if there is some issue about the memory
> passed in, and indeed just switching to plain vmalloc vs the XFS
> kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes
> the issue for me.  I don't really understand why yet, maybe I need to
> dig out alignment testing patches.
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 13d1d3e95b88..918ad3b884a7 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -125,7 +125,7 @@ xlog_alloc_buffer(
>  	if (nbblks > 1 && log->l_sectBBsize > 1)
>  		nbblks += log->l_sectBBsize;
>  	nbblks = round_up(nbblks, log->l_sectBBsize);
> -	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> +	return vmalloc(BBTOB(nbblks));
>  }

So it fails with contiguous memory being added to the bio (kmalloc
returns contiguous memory), but not with indiivdual, non-contiugous
pages (what vmalloc returns)?

Ok, so here's the trace of all the kmem_alloc_large() calls when
mounting the filesystem directly after mkfs:

https://lore.kernel.org/linux-xfs/20190722233452.31183-1-david@fromorbit.com/

 mount-1818  [001]   691.648570: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648570: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648575: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648575: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648578: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648578: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648581: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648581: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648583: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648583: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648586: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648586: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648588: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648589: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.648591: kmem_alloc_large:     size 262144 flags 0x8 caller xlog_alloc_log
 mount-1818  [001]   691.648592: kmem_alloc:           size 262144 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649456: kmem_alloc_large:     size 4096 flags 0x8 caller xlog_find_zeroed
 mount-1818  [001]   691.649456: kmem_alloc:           size 4096 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649496: kmem_alloc_large:     size 12288 flags 0x8 caller xlog_find_verify_cycle
 mount-1818  [001]   691.649496: kmem_alloc:           size 12288 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649502: kmem_alloc_large:     size 8192 flags 0x8 caller xlog_find_verify_log_record
 mount-1818  [001]   691.649502: kmem_alloc:           size 8192 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649506: kmem_alloc_large:     size 4096 flags 0x8 caller xlog_find_tail
 mount-1818  [001]   691.649506: kmem_alloc:           size 4096 flags 0x8 caller kmem_alloc_large
 mount-1818  [001]   691.649516: kmem_alloc_large:     size 4198400 flags 0x8 caller xlog_write_log_records
 mount-1818  [001]   691.649517: kmem_alloc:           size 4198400 flags 0x8 caller kmem_alloc_large
                                                       ^^^^^^^^^^^^
 mount-1818  [001]   691.652152: kmem_alloc_large:     size 0 flags 0x11 caller xfs_alloc_rsum_cache
 mount-1818  [001]   691.652152: kmem_alloc:           size 0 flags 0x11 caller kmem_alloc_large
umount-1826  [000]   691.670308: kmem_alloc_large:     size 512 flags 0x5 caller xfs_log_commit_cil
umount-1826  [000]   691.670309: kmem_alloc:           size 512 flags 0x5 caller kmem_alloc_large

Ok, there's a >4MB allocation from:

xlog_find_tail
  xlog_clear_stale_blocks
    xlog_write_log_records
      xlog_alloc_buffer
	kmem_alloc_large()

If this memory allocation fails, we make the allocation smaller
in xlog_write_log_records() and try again. over and over again until
it succeeds of the size isn't large enough to hold a single buffer.
The above trace shows the allocation succeeding as a single
contiguous buffer (no retries), and the IOs succeed and all is good.

But Vishal wasn't even getting that far - the "totally zeroed_log"
comes from xlog_find_head after xlog_find_zeroed() fails.

xlog_find_zeroed() does a single sector allocation (4kB) and uses
that over and over again for the binary search. Then it calls
xlog_find_verify_cycle() which is passed a search range for
based on the last block it found that was zero and the size of
the log write window (2MB). If the last block found is less than
the log write window, it trims it back to the last block. We see
this in the trace above - a mkfs'd filesytem should have an unmount
record in it an nothing else, so at most 2 sectors should be
written. We see a 12kB - 3 sector - allocation in the trace above,
but it could be up to 2MB in size.

Finally, there's a 2 sector allocation by
xlog_find_verify_log_record(), which it does to check whether it
needs to back up the head to an earlier block in the log write
window because the log writes hit the disk out of order. That's
another "up to 2MB" allocation, but in the trace above it's only
8kB.

The issue, I think, is that only the big allocation in
xlog_clear_stale_blocks() has somewhat robust ENOMEM error handling.
All of these other allocations will return failure is the buffer
allocation fails, and KM_MAYFAIL doesn't try very hard to ensure the
allocation will succeed.

This change:

> -     return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> +     return vmalloc(BBTOB(nbblks));

demonstrates the difference - it changes the code from an allocation
that is allowed to fail to an allocation that will never fail. i.e
KM_MAYFAIL context is also passed to the __vmalloc() call inside
kmem_alloc_large() hence it can fail easily, too, where as vmalloc()
tries much, much harder and so is, in comparison, a NOFAIL context
(GFP_KERNEL).

I'd suggest that the fix is to pass the allocation context into
xlog_alloc_buffer(), using KM_MAYFAIL for only the allocations that
have ENOMEM fallback handling...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  0:08       ` Dave Chinner
@ 2019-08-19  3:49         ` hch
  2019-08-19  4:11           ` hch
  2019-08-19  4:15           ` Dave Chinner
  0 siblings, 2 replies; 31+ messages in thread
From: hch @ 2019-08-19  3:49 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

[-- Attachment #1: Type: text/plain, Size: 598 bytes --]

On Mon, Aug 19, 2019 at 10:08:31AM +1000, Dave Chinner wrote:
> > So I can also reproduce the same issue with the ramdisk driver, but not
> > with any other 4k sector size device (nvmet, scsi target, scsi_debug,
> > loop).
> 
> How did you reproduce it? I tried with indentical pmem config to
> what Vishal was using and couldn't hit it...

Just mkfs and mount the pmem device.

The fake pmem was create by the following kernel command line sniplet:
"memmap=2000M!1024M,2000M!4096M" and my kernel config is attached.  I
suspect you'll need CONFIG_SLUB to create the woes, but let me confirm
that..

[-- Attachment #2: .config --]
[-- Type: application/x-config, Size: 140796 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  3:49         ` hch
@ 2019-08-19  4:11           ` hch
  2019-08-19  4:22             ` Dave Chinner
  2019-08-19  4:15           ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: hch @ 2019-08-19  4:11 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Mon, Aug 19, 2019 at 05:49:48AM +0200, hch@lst.de wrote:
> The fake pmem was create by the following kernel command line sniplet:
> "memmap=2000M!1024M,2000M!4096M" and my kernel config is attached.  I
> suspect you'll need CONFIG_SLUB to create the woes, but let me confirm
> that..

Yep, doesn't reproduce with CONFIG_SLAB.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  3:49         ` hch
  2019-08-19  4:11           ` hch
@ 2019-08-19  4:15           ` Dave Chinner
  1 sibling, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2019-08-19  4:15 UTC (permalink / raw)
  To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Mon, Aug 19, 2019 at 05:49:48AM +0200, hch@lst.de wrote:
> On Mon, Aug 19, 2019 at 10:08:31AM +1000, Dave Chinner wrote:
> > > So I can also reproduce the same issue with the ramdisk driver, but not
> > > with any other 4k sector size device (nvmet, scsi target, scsi_debug,
> > > loop).
> > 
> > How did you reproduce it? I tried with indentical pmem config to
> > what Vishal was using and couldn't hit it...
> 
> Just mkfs and mount the pmem device.

Ran that in a loop overnight, no failures.

> The fake pmem was create by the following kernel command line sniplet:
> "memmap=2000M!1024M,2000M!4096M" and my kernel config is attached.  I

Very similar to what I'm using, just 2x8GB pmem instead of the
smaller 1/4GB pair you are using. I even tried manually making it
smaller w/ -d size= and changing log sizes, etc. I tried squeezing
down memory to only 256MB unused (via mlock) but that didn't seem to
change anything, either...

> suspect you'll need CONFIG_SLUB to create the woes, but let me confirm
> that..

Yup, that's what I have configured in my kernel, too.

It's still a mystery to me why I'm not able to reproduce it
easily. If it was just pmem, then I should have found it back when I
tripped over the chaining bug, because the two VMs I were running
testing on at the time were an iscsi test rig and an pmem test rig.
The iscsi one tripped over the chaining bug, the pmem one has had no
problems....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  4:11           ` hch
@ 2019-08-19  4:22             ` Dave Chinner
  2019-08-19  4:29               ` hch
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2019-08-19  4:22 UTC (permalink / raw)
  To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Mon, Aug 19, 2019 at 06:11:32AM +0200, hch@lst.de wrote:
> On Mon, Aug 19, 2019 at 05:49:48AM +0200, hch@lst.de wrote:
> > The fake pmem was create by the following kernel command line sniplet:
> > "memmap=2000M!1024M,2000M!4096M" and my kernel config is attached.  I
> > suspect you'll need CONFIG_SLUB to create the woes, but let me confirm
> > that..
> 
> Yep, doesn't reproduce with CONFIG_SLAB.

That implies a kmalloc heap issue.

Oh, is memory poisoning or something that modifies the alignment of
slabs turned on?

i.e. 4k/8k allocations from the kmalloc heap slabs might not be
appropriately aligned for IO, similar to the problems we have with
the xen blk driver?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  4:22             ` Dave Chinner
@ 2019-08-19  4:29               ` hch
  2019-08-19  4:40                 ` hch
  0 siblings, 1 reply; 31+ messages in thread
From: hch @ 2019-08-19  4:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Mon, Aug 19, 2019 at 02:22:59PM +1000, Dave Chinner wrote:
> That implies a kmalloc heap issue.
> 
> Oh, is memory poisoning or something that modifies the alignment of
> slabs turned on?
> 
> i.e. 4k/8k allocations from the kmalloc heap slabs might not be
> appropriately aligned for IO, similar to the problems we have with
> the xen blk driver?

That is what I suspect, and as you can see in the attached config I
usually run with slab debuggig on.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  4:29               ` hch
@ 2019-08-19  4:40                 ` hch
  2019-08-19  5:31                   ` Dave Chinner
  2019-08-20  4:41                   ` Dave Chinner
  0 siblings, 2 replies; 31+ messages in thread
From: hch @ 2019-08-19  4:40 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Mon, Aug 19, 2019 at 06:29:05AM +0200, hch@lst.de wrote:
> On Mon, Aug 19, 2019 at 02:22:59PM +1000, Dave Chinner wrote:
> > That implies a kmalloc heap issue.
> > 
> > Oh, is memory poisoning or something that modifies the alignment of
> > slabs turned on?
> > 
> > i.e. 4k/8k allocations from the kmalloc heap slabs might not be
> > appropriately aligned for IO, similar to the problems we have with
> > the xen blk driver?
> 
> That is what I suspect, and as you can see in the attached config I
> usually run with slab debuggig on.

Yep, looks like an unaligned allocation:

root@testvm:~# mount /dev/pmem1 /mnt/
[   62.346660] XFS (pmem1): Mounting V5 Filesystem
[   62.347960] unaligned allocation, offset = 680
[   62.349019] unaligned allocation, offset = 680
[   62.349872] unaligned allocation, offset = 680
[   62.350703] XFS (pmem1): totally zeroed log
[   62.351443] unaligned allocation, offset = 680
[   62.452203] unaligned allocation, offset = 344
[   62.528964] XFS: Assertion failed: head_blk != tail_blk, file:
fs/xfs/xfs_lo6
[   62.529879] ------------[ cut here ]------------
[   62.530334] kernel BUG at fs/xfs/xfs_message.c:102!
[   62.530824] invalid opcode: 0000 [#1] SMP PTI

With the following debug patch.  Based on that I think I'll just
formally submit the vmalloc switch as we're at -rc5, and then we
can restart the unaligned slub allocation drama..


diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 13d1d3e95b88..6a098d35931a 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -97,6 +97,9 @@ xlog_alloc_buffer(
 	struct xlog	*log,
 	int		nbblks)
 {
+	void		*ret;
+	unsigned long	offset;
+
 	/*
 	 * Pass log block 0 since we don't have an addr yet, buffer will be
 	 * verified on read.
@@ -125,7 +128,14 @@ xlog_alloc_buffer(
 	if (nbblks > 1 && log->l_sectBBsize > 1)
 		nbblks += log->l_sectBBsize;
 	nbblks = round_up(nbblks, log->l_sectBBsize);
-	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
+	ret = kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
+	if (!ret)
+		return NULL;
+	offset = offset_in_page(ret);
+	if (offset % 512)
+		printk("unaligned allocation, offset = %lu\n",
+			offset);
+	return ret;
 }
 
 /*



^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  4:40                 ` hch
@ 2019-08-19  5:31                   ` Dave Chinner
  2019-08-20  6:14                     ` hch
  2019-08-20  4:41                   ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2019-08-19  5:31 UTC (permalink / raw)
  To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Mon, Aug 19, 2019 at 06:40:12AM +0200, hch@lst.de wrote:
> On Mon, Aug 19, 2019 at 06:29:05AM +0200, hch@lst.de wrote:
> > On Mon, Aug 19, 2019 at 02:22:59PM +1000, Dave Chinner wrote:
> > > That implies a kmalloc heap issue.
> > > 
> > > Oh, is memory poisoning or something that modifies the alignment of
> > > slabs turned on?
> > > 
> > > i.e. 4k/8k allocations from the kmalloc heap slabs might not be
> > > appropriately aligned for IO, similar to the problems we have with
> > > the xen blk driver?
> > 
> > That is what I suspect, and as you can see in the attached config I
> > usually run with slab debuggig on.
> 
> Yep, looks like an unaligned allocation:

*nod*

> With the following debug patch.  Based on that I think I'll just
> formally submit the vmalloc switch as we're at -rc5, and then we
> can restart the unaligned slub allocation drama..

But I'm not so sure that's all there is to it. I turned KASAN on and
it works for the first few mkfs/mount cycles, then a mount basically
pegs a CPU and it's spending most of it's time inside KASAN
accouting code like this:

.....
  save_stack+0x19/0x80
  __kasan_kmalloc.constprop.10+0xc1/0xd0
  kmem_cache_alloc+0xd2/0x220
  mempool_alloc+0xda/0x230
  bio_alloc_bioset+0x12d/0x2d0
  xfs_rw_bdev+0x53/0x290
  xlog_do_io+0xd1/0x1c0
  xlog_bread+0x23/0x70
  xlog_rseek_logrec_hdr+0x207/0x2a0

After a couple of minutes, the mount fails an ASSERT:

XFS: Assertion failed: head_blk != tail_blk, file: fs/xfs/xfs_log_recover.c, line: 5236

And moments after KASAN spews this:

 BUG: KASAN: use-after-free in rwsem_down_read_slowpath+0x685/0x8f0
 Read of size 4 at addr ffff88806f152778 by task systemd-udevd/2231
 
 CPU: 4 PID: 2231 Comm: systemd-udevd Tainted: G      D           5.3.0-rc5-dgc+ #1506
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
 Call Trace:
  dump_stack+0x7c/0xc0
  print_address_description+0x6c/0x322
  ? rwsem_down_read_slowpath+0x685/0x8f0
  __kasan_report.cold.6+0x1c/0x3e
  ? rwsem_down_read_slowpath+0x685/0x8f0
  ? rwsem_down_read_slowpath+0x685/0x8f0
  kasan_report+0xe/0x12
  rwsem_down_read_slowpath+0x685/0x8f0
  ? unwind_get_return_address_ptr+0x50/0x50
  ? unwind_next_frame+0x6d6/0x8a0
  ? __down_timeout+0x1c0/0x1c0
  ? unwind_next_frame+0x6d6/0x8a0
  ? _raw_spin_lock+0x87/0xe0
  ? _raw_spin_lock+0x87/0xe0
  ? __cpuidle_text_end+0x5/0x5
  ? set_init_blocksize+0xe0/0xe0
  ? preempt_count_sub+0x43/0x50
  ? __might_sleep+0x31/0xd0
  ? set_init_blocksize+0xe0/0xe0
  ? ___might_sleep+0xc8/0xe0
  down_read+0x18d/0x1a0
  ? refcount_sub_and_test_checked+0xaf/0x150
  ? rwsem_down_read_slowpath+0x8f0/0x8f0
  ? _raw_spin_lock+0x87/0xe0
  __get_super.part.12+0xf8/0x130
  fsync_bdev+0xf/0x60
  invalidate_partition+0x1e/0x40
  rescan_partitions+0x8a/0x420
  blkdev_reread_part+0x1e/0x30
  blkdev_ioctl+0xb0b/0xe60
  ? __blkdev_driver_ioctl+0x80/0x80
  ? __bpf_prog_run64+0xc0/0xc0
  ? stack_trace_save+0x8a/0xb0
  ? save_stack+0x4d/0x80
  ? __seccomp_filter+0x133/0xa10
  ? preempt_count_sub+0x43/0x50
  block_ioctl+0x6d/0x80
  do_vfs_ioctl+0x134/0x9c0
  ? ioctl_preallocate+0x140/0x140
  ? selinux_file_ioctl+0x2b7/0x360
  ? selinux_capable+0x20/0x20
  ? syscall_trace_enter+0x233/0x540
  ksys_ioctl+0x60/0x90
  __x64_sys_ioctl+0x3d/0x50
  do_syscall_64+0x70/0x230
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
 RIP: 0033:0x7fade328a427
 Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 8
 RSP: 002b:00007ffdc4755928 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007fade328a427
 RDX: 0000000000000000 RSI: 000000000000125f RDI: 000000000000000e
 RBP: 0000000000000000 R08: 0000559597306140 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 000055959736dbc0
 R13: 0000000000000000 R14: 00007ffdc47569c8 R15: 000055959730dac0

 Allocated by task 4739:
  save_stack+0x19/0x80
  __kasan_kmalloc.constprop.10+0xc1/0xd0
  kmem_cache_alloc_node+0xf3/0x240
  copy_process+0x1f91/0x2f20
  _do_fork+0xe0/0x530
  __x64_sys_clone+0x10e/0x160
  do_syscall_64+0x70/0x230
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 Freed by task 0:
  save_stack+0x19/0x80
  __kasan_slab_free+0x12e/0x180
  kmem_cache_free+0x84/0x2c0
  rcu_core+0x35f/0x810
  __do_softirq+0x15f/0x476

 The buggy address belongs to the object at ffff888237048000
  which belongs to the cache task_struct of size 9792
 The buggy address is located 56 bytes inside of
  9792-byte region [ffff888237048000, ffff88823704a640)
 The buggy address belongs to the page:
 page:ffffea0008dc1200 refcount:1 mapcount:0 mapping:ffff888078a91800 index:0x0 compound_mapcount: 0
 flags: 0x57ffffc0010200(slab|head)
 raw: 0057ffffc0010200 dead000000000100 dead000000000122 ffff888078a91800
 raw: 0000000000000000 0000000000030003 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff888237047f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff888237047f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff888237048000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                         ^
  ffff888237048080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff888237048100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ==================================================================

Oh, hell, it's an rwsem that is referencing a free task.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-18 17:34     ` hch
  2019-08-19  0:08       ` Dave Chinner
@ 2019-08-19 17:19       ` Verma, Vishal L
  2019-08-21  0:26       ` Dave Chinner
  2019-08-26 17:32       ` Verma, Vishal L
  3 siblings, 0 replies; 31+ messages in thread
From: Verma, Vishal L @ 2019-08-19 17:19 UTC (permalink / raw)
  To: hch; +Cc: Williams, Dan J, linux-xfs, david, darrick.wong

On Sun, 2019-08-18 at 19:34 +0200, hch@lst.de wrote:
> 
> So I can also reproduce the same issue with the ramdisk driver, but not
> with any other 4k sector size device (nvmet, scsi target, scsi_debug,
> loop).  Which made me wonder if there is some issue about the memory
> passed in, and indeed just switching to plain vmalloc vs the XFS
> kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes
> the issue for me.  I don't really understand why yet, maybe I need to
> dig out alignment testing patches.

With the patch below, I can't reproduce the failure - if this is going
into 5.3-rc, feel free to add:

Tested-by: Vishal Verma <vishal.l.verma@intel.com>

> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 13d1d3e95b88..918ad3b884a7 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -125,7 +125,7 @@ xlog_alloc_buffer(
>  	if (nbblks > 1 && log->l_sectBBsize > 1)
>  		nbblks += log->l_sectBBsize;
>  	nbblks = round_up(nbblks, log->l_sectBBsize);
> -	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> +	return vmalloc(BBTOB(nbblks));
>  }
>  
>  /*
> @@ -416,7 +416,7 @@ xlog_find_verify_cycle(
>  	*new_blk = -1;
>  
>  out:
> -	kmem_free(buffer);
> +	vfree(buffer);
>  	return error;
>  }
>  
> @@ -527,7 +527,7 @@ xlog_find_verify_log_record(
>  		*last_blk = i;
>  
>  out:
> -	kmem_free(buffer);
> +	vfree(buffer);
>  	return error;
>  }
>  
> @@ -781,7 +781,7 @@ xlog_find_head(
>  			goto out_free_buffer;
>  	}
>  
> -	kmem_free(buffer);
> +	vfree(buffer);
>  	if (head_blk == log_bbnum)
>  		*return_head_blk = 0;
>  	else
> @@ -795,7 +795,7 @@ xlog_find_head(
>  	return 0;
>  
>  out_free_buffer:
> -	kmem_free(buffer);
> +	vfree(buffer);
>  	if (error)
>  		xfs_warn(log->l_mp, "failed to find log head");
>  	return error;
> @@ -1049,7 +1049,7 @@ xlog_verify_tail(
>  		"Tail block (0x%llx) overwrite detected. Updated to 0x%llx",
>  			 orig_tail, *tail_blk);
>  out:
> -	kmem_free(buffer);
> +	vfree(buffer);
>  	return error;
>  }
>  
> @@ -1096,7 +1096,7 @@ xlog_verify_head(
>  	error = xlog_rseek_logrec_hdr(log, *head_blk, *tail_blk,
>  				      XLOG_MAX_ICLOGS, tmp_buffer,
>  				      &tmp_rhead_blk, &tmp_rhead, &tmp_wrapped);
> -	kmem_free(tmp_buffer);
> +	vfree(tmp_buffer);
>  	if (error < 0)
>  		return error;
>  
> @@ -1429,7 +1429,7 @@ xlog_find_tail(
>  		error = xlog_clear_stale_blocks(log, tail_lsn);
>  
>  done:
> -	kmem_free(buffer);
> +	vfree(buffer);
>  
>  	if (error)
>  		xfs_warn(log->l_mp, "failed to locate log tail");
> @@ -1477,7 +1477,7 @@ xlog_find_zeroed(
>  	first_cycle = xlog_get_cycle(offset);
>  	if (first_cycle == 0) {		/* completely zeroed log */
>  		*blk_no = 0;
> -		kmem_free(buffer);
> +		vfree(buffer);
>  		return 1;
>  	}
>  
> @@ -1488,7 +1488,7 @@ xlog_find_zeroed(
>  
>  	last_cycle = xlog_get_cycle(offset);
>  	if (last_cycle != 0) {		/* log completely written to */
> -		kmem_free(buffer);
> +		vfree(buffer);
>  		return 0;
>  	}
>  
> @@ -1535,7 +1535,7 @@ xlog_find_zeroed(
>  
>  	*blk_no = last_blk;
>  out_free_buffer:
> -	kmem_free(buffer);
> +	vfree(buffer);
>  	if (error)
>  		return error;
>  	return 1;
> @@ -1647,7 +1647,7 @@ xlog_write_log_records(
>  	}
>  
>  out_free_buffer:
> -	kmem_free(buffer);
> +	vfree(buffer);
>  	return error;
>  }
>  
> @@ -5291,7 +5291,7 @@ xlog_do_recovery_pass(
>  			hblks = h_size / XLOG_HEADER_CYCLE_SIZE;
>  			if (h_size % XLOG_HEADER_CYCLE_SIZE)
>  				hblks++;
> -			kmem_free(hbp);
> +			vfree(hbp);
>  			hbp = xlog_alloc_buffer(log, hblks);
>  		} else {
>  			hblks = 1;
> @@ -5307,7 +5307,7 @@ xlog_do_recovery_pass(
>  		return -ENOMEM;
>  	dbp = xlog_alloc_buffer(log, BTOBB(h_size));
>  	if (!dbp) {
> -		kmem_free(hbp);
> +		vfree(hbp);
>  		return -ENOMEM;
>  	}
>  
> @@ -5468,9 +5468,9 @@ xlog_do_recovery_pass(
>  	}
>  
>   bread_err2:
> -	kmem_free(dbp);
> +	vfree(dbp);
>   bread_err1:
> -	kmem_free(hbp);
> +	vfree(hbp);
>  
>  	/*
>  	 * Submit buffers that have been added from the last record processed,


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  4:40                 ` hch
  2019-08-19  5:31                   ` Dave Chinner
@ 2019-08-20  4:41                   ` Dave Chinner
  2019-08-20  5:53                     ` hch
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2019-08-20  4:41 UTC (permalink / raw)
  To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Mon, Aug 19, 2019 at 06:40:12AM +0200, hch@lst.de wrote:
> On Mon, Aug 19, 2019 at 06:29:05AM +0200, hch@lst.de wrote:
> > On Mon, Aug 19, 2019 at 02:22:59PM +1000, Dave Chinner wrote:
> > > That implies a kmalloc heap issue.
> > > 
> > > Oh, is memory poisoning or something that modifies the alignment of
> > > slabs turned on?
> > > 
> > > i.e. 4k/8k allocations from the kmalloc heap slabs might not be
> > > appropriately aligned for IO, similar to the problems we have with
> > > the xen blk driver?
> > 
> > That is what I suspect, and as you can see in the attached config I
> > usually run with slab debuggig on.
> 
> Yep, looks like an unaligned allocation:
> 
> root@testvm:~# mount /dev/pmem1 /mnt/
> [   62.346660] XFS (pmem1): Mounting V5 Filesystem
> [   62.347960] unaligned allocation, offset = 680
> [   62.349019] unaligned allocation, offset = 680
> [   62.349872] unaligned allocation, offset = 680
> [   62.350703] XFS (pmem1): totally zeroed log
> [   62.351443] unaligned allocation, offset = 680
> [   62.452203] unaligned allocation, offset = 344
> [   62.528964] XFS: Assertion failed: head_blk != tail_blk, file:
> fs/xfs/xfs_lo6
> [   62.529879] ------------[ cut here ]------------
> [   62.530334] kernel BUG at fs/xfs/xfs_message.c:102!
> [   62.530824] invalid opcode: 0000 [#1] SMP PTI
> 
> With the following debug patch.  Based on that I think I'll just
> formally submit the vmalloc switch as we're at -rc5, and then we
> can restart the unaligned slub allocation drama..

This still doesn't make sense to me, because the pmem and brd code
have no aligment limitations in their make_request code - they can
handle byte adressing and should not have any problem at all with
8 byte aligned memory in bios.

Digging a little furhter, I note that both brd and pmem use
identical mechanisms to marshall data in and out of bios, so they
are likely to have the same issue.

So, brd_make_request() does:

        bio_for_each_segment(bvec, bio, iter) {
                unsigned int len = bvec.bv_len;
                int err;

                err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
                                  bio_op(bio), sector);
                if (err)
                        goto io_error;
                sector += len >> SECTOR_SHIFT;
        }

So, the code behind bio_for_each_segment() splits multi-page bvecs
into individual pages, which are passed to brd_do_bvec(). An
unaligned 4kB io traces out as:

 [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
 [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e

i.e. page		offset	len	sector
00000000a77f0146	768	3328	0x7d0048
000000006ceca91e	0	768	0x7d004e

You should be able to guess what the problems are from this.

Both pmem and brd are _sector_ based. We've done a partial sector
copy on the first bvec, then the second bvec has started the copy
from the wrong offset into the sector we've done a partial copy
from.

IOWs, no error is reported when the bvec buffer isn't sector
aligned, no error is reported when the length of data to copy was
not a multiple of sector size, and no error was reported when we
copied the same partial sector twice.

There's nothing quite like being repeatedly bitten by the same
misalignment bug because there's no validation in the infrastructure
that could catch it immediately and throw a useful warning/error
message.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  4:41                   ` Dave Chinner
@ 2019-08-20  5:53                     ` hch
  2019-08-20  7:44                       ` Dave Chinner
  2019-08-20  8:13                       ` Ming Lei
  0 siblings, 2 replies; 31+ messages in thread
From: hch @ 2019-08-20  5:53 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong,
	Ming Lei, linux-block

On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > With the following debug patch.  Based on that I think I'll just
> > formally submit the vmalloc switch as we're at -rc5, and then we
> > can restart the unaligned slub allocation drama..
> 
> This still doesn't make sense to me, because the pmem and brd code
> have no aligment limitations in their make_request code - they can
> handle byte adressing and should not have any problem at all with
> 8 byte aligned memory in bios.
> 
> Digging a little furhter, I note that both brd and pmem use
> identical mechanisms to marshall data in and out of bios, so they
> are likely to have the same issue.
> 
> So, brd_make_request() does:
> 
>         bio_for_each_segment(bvec, bio, iter) {
>                 unsigned int len = bvec.bv_len;
>                 int err;
> 
>                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
>                                   bio_op(bio), sector);
>                 if (err)
>                         goto io_error;
>                 sector += len >> SECTOR_SHIFT;
>         }
> 
> So, the code behind bio_for_each_segment() splits multi-page bvecs
> into individual pages, which are passed to brd_do_bvec(). An
> unaligned 4kB io traces out as:
> 
>  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
>  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> 
> i.e. page		offset	len	sector
> 00000000a77f0146	768	3328	0x7d0048
> 000000006ceca91e	0	768	0x7d004e
> 
> You should be able to guess what the problems are from this.
> 
> Both pmem and brd are _sector_ based. We've done a partial sector
> copy on the first bvec, then the second bvec has started the copy
> from the wrong offset into the sector we've done a partial copy
> from.
> 
> IOWs, no error is reported when the bvec buffer isn't sector
> aligned, no error is reported when the length of data to copy was
> not a multiple of sector size, and no error was reported when we
> copied the same partial sector twice.

Yes.  I think bio_for_each_segment is buggy here, as it should not
blindly split by pages.  CcingMing as he wrote much of this code.  I'll
also dig into fixing it, but I just arrived in Japan and might be a
little jetlagged.

> There's nothing quite like being repeatedly bitten by the same
> misalignment bug because there's no validation in the infrastructure
> that could catch it immediately and throw a useful warning/error
> message.

The xen block driver doesn't use bio_for_each_segment, so it isn't
exactly the same but a very related issue.  Maybe until we sort
all this mess out we just need to depend on !SLUB_DEBUG for XFS?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-19  5:31                   ` Dave Chinner
@ 2019-08-20  6:14                     ` hch
  0 siblings, 0 replies; 31+ messages in thread
From: hch @ 2019-08-20  6:14 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Mon, Aug 19, 2019 at 03:31:22PM +1000, Dave Chinner wrote:
> But I'm not so sure that's all there is to it. I turned KASAN on and
> it works for the first few mkfs/mount cycles, then a mount basically
> pegs a CPU and it's spending most of it's time inside KASAN
> accouting code like this:

Did that KASAN spat lead anywhere?  I've done some KASAN runs now and
I've not seen anything yet after running mkfs/mount/umount in a loop.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  5:53                     ` hch
@ 2019-08-20  7:44                       ` Dave Chinner
  2019-08-20  8:13                       ` Ming Lei
  1 sibling, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2019-08-20  7:44 UTC (permalink / raw)
  To: hch
  Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong,
	Ming Lei, linux-block

On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > With the following debug patch.  Based on that I think I'll just
> > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > can restart the unaligned slub allocation drama..
> > 
> > This still doesn't make sense to me, because the pmem and brd code
> > have no aligment limitations in their make_request code - they can
> > handle byte adressing and should not have any problem at all with
> > 8 byte aligned memory in bios.
> > 
> > Digging a little furhter, I note that both brd and pmem use
> > identical mechanisms to marshall data in and out of bios, so they
> > are likely to have the same issue.
> > 
> > So, brd_make_request() does:
> > 
> >         bio_for_each_segment(bvec, bio, iter) {
> >                 unsigned int len = bvec.bv_len;
> >                 int err;
> > 
> >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> >                                   bio_op(bio), sector);
> >                 if (err)
> >                         goto io_error;
> >                 sector += len >> SECTOR_SHIFT;
> >         }
> > 
> > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > into individual pages, which are passed to brd_do_bvec(). An
> > unaligned 4kB io traces out as:
> > 
> >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > 
> > i.e. page		offset	len	sector
> > 00000000a77f0146	768	3328	0x7d0048
> > 000000006ceca91e	0	768	0x7d004e
> > 
> > You should be able to guess what the problems are from this.
> > 
> > Both pmem and brd are _sector_ based. We've done a partial sector
> > copy on the first bvec, then the second bvec has started the copy
> > from the wrong offset into the sector we've done a partial copy
> > from.
> > 
> > IOWs, no error is reported when the bvec buffer isn't sector
> > aligned, no error is reported when the length of data to copy was
> > not a multiple of sector size, and no error was reported when we
> > copied the same partial sector twice.
> 
> Yes.  I think bio_for_each_segment is buggy here, as it should not
> blindly split by pages.  CcingMing as he wrote much of this code.  I'll
> also dig into fixing it, but I just arrived in Japan and might be a
> little jetlagged.
> 
> > There's nothing quite like being repeatedly bitten by the same
> > misalignment bug because there's no validation in the infrastructure
> > that could catch it immediately and throw a useful warning/error
> > message.
> 
> The xen block driver doesn't use bio_for_each_segment, so it isn't
> exactly the same but a very related issue. 

Both stem from the fact that nothing in the block layer validates
memory alignment constraints. Xenblk, pmem and brd all return 511 to
queue_dma_alignment(), and all break when passed memory that isn't
aligned to 512 bytes.  There aren't even debug options we can turn
on that would tell use this is happening. Instead, we start with
data corruption and have to work backwards to find the needle in the
haystack from there. EIO and a WARN_ONCE would be a massive
improvement here....

> Maybe until we sort
> all this mess out we just need to depend on !SLUB_DEBUG for XFS?

SLUB_DEBUG=y by itself doesn't cause problems - I run that
all the time because otherwise there's no /proc/slabinfo with slub.

I used KASAN to get the above alignment behaviour - it's
SLUB_DEBUG_ON=y that perturbs the alignment, and I think the same
thing can happen with SLAB_DEBUG=y, so there's several dependencies
we'd have to add here.

Is there any way we can pass kmalloc a "use aligned heap" GFP flag
so that it allocates from the -rcl slabs to guarantee alignment
rather than the standard slabs that change alignment with config
options?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  5:53                     ` hch
  2019-08-20  7:44                       ` Dave Chinner
@ 2019-08-20  8:13                       ` Ming Lei
  2019-08-20  9:24                         ` Ming Lei
  1 sibling, 1 reply; 31+ messages in thread
From: Ming Lei @ 2019-08-20  8:13 UTC (permalink / raw)
  To: hch
  Cc: Dave Chinner, Verma, Vishal L, linux-xfs, Williams, Dan J,
	darrick.wong, linux-block

On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > With the following debug patch.  Based on that I think I'll just
> > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > can restart the unaligned slub allocation drama..
> > 
> > This still doesn't make sense to me, because the pmem and brd code
> > have no aligment limitations in their make_request code - they can
> > handle byte adressing and should not have any problem at all with
> > 8 byte aligned memory in bios.
> > 
> > Digging a little furhter, I note that both brd and pmem use
> > identical mechanisms to marshall data in and out of bios, so they
> > are likely to have the same issue.
> > 
> > So, brd_make_request() does:
> > 
> >         bio_for_each_segment(bvec, bio, iter) {
> >                 unsigned int len = bvec.bv_len;
> >                 int err;
> > 
> >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> >                                   bio_op(bio), sector);
> >                 if (err)
> >                         goto io_error;
> >                 sector += len >> SECTOR_SHIFT;
> >         }
> > 
> > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > into individual pages, which are passed to brd_do_bvec(). An
> > unaligned 4kB io traces out as:
> > 
> >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > 
> > i.e. page		offset	len	sector
> > 00000000a77f0146	768	3328	0x7d0048
> > 000000006ceca91e	0	768	0x7d004e
> > 
> > You should be able to guess what the problems are from this.

The problem should be that offset of '768' is passed to bio_add_page().

It should be one slub buffer used for block IO, looks an old unsolved
problem.

> > 
> > Both pmem and brd are _sector_ based. We've done a partial sector
> > copy on the first bvec, then the second bvec has started the copy
> > from the wrong offset into the sector we've done a partial copy
> > from.
> > 
> > IOWs, no error is reported when the bvec buffer isn't sector
> > aligned, no error is reported when the length of data to copy was
> > not a multiple of sector size, and no error was reported when we
> > copied the same partial sector twice.
> 
> Yes.  I think bio_for_each_segment is buggy here, as it should not
> blindly split by pages.

bio_for_each_segment() just keeps the original interface as before
introducing multi-page bvec.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  8:13                       ` Ming Lei
@ 2019-08-20  9:24                         ` Ming Lei
  2019-08-20 16:30                           ` Verma, Vishal L
  2019-08-20 21:44                           ` Dave Chinner
  0 siblings, 2 replies; 31+ messages in thread
From: Ming Lei @ 2019-08-20  9:24 UTC (permalink / raw)
  To: hch, Verma, Vishal L
  Cc: Dave Chinner, linux-xfs, Williams, Dan J, darrick.wong, linux-block

On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote:
> On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > > With the following debug patch.  Based on that I think I'll just
> > > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > > can restart the unaligned slub allocation drama..
> > > 
> > > This still doesn't make sense to me, because the pmem and brd code
> > > have no aligment limitations in their make_request code - they can
> > > handle byte adressing and should not have any problem at all with
> > > 8 byte aligned memory in bios.
> > > 
> > > Digging a little furhter, I note that both brd and pmem use
> > > identical mechanisms to marshall data in and out of bios, so they
> > > are likely to have the same issue.
> > > 
> > > So, brd_make_request() does:
> > > 
> > >         bio_for_each_segment(bvec, bio, iter) {
> > >                 unsigned int len = bvec.bv_len;
> > >                 int err;
> > > 
> > >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> > >                                   bio_op(bio), sector);
> > >                 if (err)
> > >                         goto io_error;
> > >                 sector += len >> SECTOR_SHIFT;
> > >         }
> > > 
> > > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > > into individual pages, which are passed to brd_do_bvec(). An
> > > unaligned 4kB io traces out as:
> > > 
> > >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> > >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > > 
> > > i.e. page		offset	len	sector
> > > 00000000a77f0146	768	3328	0x7d0048
> > > 000000006ceca91e	0	768	0x7d004e
> > > 
> > > You should be able to guess what the problems are from this.
> 
> The problem should be that offset of '768' is passed to bio_add_page().

It can be quite hard to deal with non-512 aligned sector buffer, since
one sector buffer may cross two pages, so far one workaround I thought
of is to not merge such IO buffer into one bvec.

Verma, could you try the following patch?

diff --git a/block/bio.c b/block/bio.c
index 24a496f5d2e2..49deab2ac8c4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
 	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
 		return false;
 
+	if (off & 511)
+		return false;
+
 	if (bio->bi_vcnt > 0) {
 		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
 

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  9:24                         ` Ming Lei
@ 2019-08-20 16:30                           ` Verma, Vishal L
  2019-08-20 21:44                           ` Dave Chinner
  1 sibling, 0 replies; 31+ messages in thread
From: Verma, Vishal L @ 2019-08-20 16:30 UTC (permalink / raw)
  To: hch, ming.lei
  Cc: Williams, Dan J, linux-xfs, david, darrick.wong, linux-block

On Tue, 2019-08-20 at 17:24 +0800, Ming Lei wrote:
> 
> It can be quite hard to deal with non-512 aligned sector buffer, since
> one sector buffer may cross two pages, so far one workaround I thought
> of is to not merge such IO buffer into one bvec.
> 
> Verma, could you try the following patch?

Hi Ming,

I can hit the same failure with this patch.
Full thread, in case you haven't already seen it:
https://lore.kernel.org/linux-xfs/e49a6a3a244db055995769eb844c281f93e50ab9.camel@intel.com/

> 
> diff --git a/block/bio.c b/block/bio.c
> index 24a496f5d2e2..49deab2ac8c4 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct
> page *page,
>  	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
>  		return false;
>  
> +	if (off & 511)
> +		return false;
> +
>  	if (bio->bi_vcnt > 0) {
>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>  
> 
> Thanks,
> Ming


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20  9:24                         ` Ming Lei
  2019-08-20 16:30                           ` Verma, Vishal L
@ 2019-08-20 21:44                           ` Dave Chinner
  2019-08-20 22:08                             ` Verma, Vishal L
  2019-08-21  1:56                             ` Ming Lei
  1 sibling, 2 replies; 31+ messages in thread
From: Dave Chinner @ 2019-08-20 21:44 UTC (permalink / raw)
  To: Ming Lei
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong,
	linux-block

On Tue, Aug 20, 2019 at 05:24:25PM +0800, Ming Lei wrote:
> On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote:
> > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > > > With the following debug patch.  Based on that I think I'll just
> > > > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > > > can restart the unaligned slub allocation drama..
> > > > 
> > > > This still doesn't make sense to me, because the pmem and brd code
> > > > have no aligment limitations in their make_request code - they can
> > > > handle byte adressing and should not have any problem at all with
> > > > 8 byte aligned memory in bios.
> > > > 
> > > > Digging a little furhter, I note that both brd and pmem use
> > > > identical mechanisms to marshall data in and out of bios, so they
> > > > are likely to have the same issue.
> > > > 
> > > > So, brd_make_request() does:
> > > > 
> > > >         bio_for_each_segment(bvec, bio, iter) {
> > > >                 unsigned int len = bvec.bv_len;
> > > >                 int err;
> > > > 
> > > >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> > > >                                   bio_op(bio), sector);
> > > >                 if (err)
> > > >                         goto io_error;
> > > >                 sector += len >> SECTOR_SHIFT;
> > > >         }
> > > > 
> > > > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > > > into individual pages, which are passed to brd_do_bvec(). An
> > > > unaligned 4kB io traces out as:
> > > > 
> > > >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> > > >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > > > 
> > > > i.e. page		offset	len	sector
> > > > 00000000a77f0146	768	3328	0x7d0048
> > > > 000000006ceca91e	0	768	0x7d004e
> > > > 
> > > > You should be able to guess what the problems are from this.
> > 
> > The problem should be that offset of '768' is passed to bio_add_page().
> 
> It can be quite hard to deal with non-512 aligned sector buffer, since
> one sector buffer may cross two pages, so far one workaround I thought
> of is to not merge such IO buffer into one bvec.
> 
> Verma, could you try the following patch?
> 
> diff --git a/block/bio.c b/block/bio.c
> index 24a496f5d2e2..49deab2ac8c4 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>  	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
>  		return false;
>  
> +	if (off & 511)
> +		return false;

What does this acheive? It only prevents the unaligned segment from
being merged, it doesn't prevent it from being added to a new bvec.

However, the case here is that:

> > > > i.e. page		offset	len	sector
> > > > 00000000a77f0146	768	3328	0x7d0048
> > > > 000000006ceca91e	0	768	0x7d004e

The second page added to the bvec is actually offset alignedr. Hence
the check would do nothing on the first page because the bvec array
is empty (so goes into a new bvec anyway), and the check on the
second page would do nothing an it would merge with first because
the offset is aligned correctly. In both cases, the length of the
segment is not aligned, so that needs to be checked, too.

IOWs, I think the check needs to be in bio_add_page, it needs to
check both the offset and length for alignment, and it needs to grab
the alignment from queue_dma_alignment(), not use a hard coded value
of 511.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20 21:44                           ` Dave Chinner
@ 2019-08-20 22:08                             ` Verma, Vishal L
  2019-08-20 23:53                               ` Dave Chinner
  2019-08-21  2:19                               ` Ming Lei
  2019-08-21  1:56                             ` Ming Lei
  1 sibling, 2 replies; 31+ messages in thread
From: Verma, Vishal L @ 2019-08-20 22:08 UTC (permalink / raw)
  To: david, ming.lei
  Cc: hch, linux-xfs, Williams, Dan J, linux-block, darrick.wong

On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote:
> 
> However, the case here is that:
> 
> > > > > i.e. page		offset	len	sector
> > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > 000000006ceca91e	0	768	0x7d004e
> 
> The second page added to the bvec is actually offset alignedr. Hence
> the check would do nothing on the first page because the bvec array
> is empty (so goes into a new bvec anyway), and the check on the
> second page would do nothing an it would merge with first because
> the offset is aligned correctly. In both cases, the length of the
> segment is not aligned, so that needs to be checked, too.
> 
> IOWs, I think the check needs to be in bio_add_page, it needs to
> check both the offset and length for alignment, and it needs to grab
> the alignment from queue_dma_alignment(), not use a hard coded value
> of 511.
> 
So something like this?

diff --git a/block/bio.c b/block/bio.c
index 299a0e7651ec..80f449d23e5a 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
 int bio_add_page(struct bio *bio, struct page *page,
                 unsigned int len, unsigned int offset)
 {
+       struct request_queue *q = bio->bi_disk->queue;
        bool same_page = false;
 
+       if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q))
+               return 0;
+
        if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
                if (bio_full(bio, len))
                        return 0;

I tried this, but the 'mount' just hangs - which looks like it might be
due to xfs_rw_bdev() doing:

  while (bio_add_page(bio, page, len, off) != len) {
  	...


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20 22:08                             ` Verma, Vishal L
@ 2019-08-20 23:53                               ` Dave Chinner
  2019-08-21  2:19                               ` Ming Lei
  1 sibling, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2019-08-20 23:53 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: ming.lei, hch, linux-xfs, Williams, Dan J, linux-block, darrick.wong

On Tue, Aug 20, 2019 at 10:08:38PM +0000, Verma, Vishal L wrote:
> On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote:
> > 
> > However, the case here is that:
> > 
> > > > > > i.e. page		offset	len	sector
> > > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > > 000000006ceca91e	0	768	0x7d004e
> > 
> > The second page added to the bvec is actually offset alignedr. Hence
> > the check would do nothing on the first page because the bvec array
> > is empty (so goes into a new bvec anyway), and the check on the
> > second page would do nothing an it would merge with first because
> > the offset is aligned correctly. In both cases, the length of the
> > segment is not aligned, so that needs to be checked, too.
> > 
> > IOWs, I think the check needs to be in bio_add_page, it needs to
> > check both the offset and length for alignment, and it needs to grab
> > the alignment from queue_dma_alignment(), not use a hard coded value
> > of 511.
> > 
> So something like this?
> 
> diff --git a/block/bio.c b/block/bio.c
> index 299a0e7651ec..80f449d23e5a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
>  int bio_add_page(struct bio *bio, struct page *page,
>                  unsigned int len, unsigned int offset)
>  {
> +       struct request_queue *q = bio->bi_disk->queue;
>         bool same_page = false;
>  
> +       if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q))
> +               return 0;
> +
>         if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
>                 if (bio_full(bio, len))
>                         return 0;
> 
> I tried this, but the 'mount' just hangs - which looks like it might be
> due to xfs_rw_bdev() doing:
> 
>   while (bio_add_page(bio, page, len, off) != len) {

That's the return of zero that causes the loop to make no progress.
i.e. a return of 0 means "won't fit in bio, allocate a new bio
and try again". It's not an error return, so always returning zero
will eventually chew up all your memory allocating bios it
doesn't use, because submit_bio() doesn't return errors on chained
bios until the final bio in the chain is completed.

Add a bio_add_page_checked() function that does exactly the same
this as bio_add_page(), but add the

	if (WARN_ON_ONCE((offset | len) & queue_dma_alignment(q)))
		return -EIO;

to it and change the xfs code to:

	while ((len = bio_add_page_checked(bio, page, len, off)) != len) {
		if (len < 0) {
			/*
			 * submit the bio to wait on the rest of the
			 * chain to complete, then return an error.
			 * This is a really shitty failure on write, as we
			 * will have just done a partial write and
			 * effectively corrupted something on disk.
			 */
			submit_bio_wait(bio);
			return len;
		}
		....
	}

We probably should change all the XFS calls to bio_add_page to
bio_add_page_checked() while we are at it, because we have the
same alignment problem through xfs_buf.c and, potentially, on iclogs
via xfs_log.c as iclogs are allocated with kmem_alloc_large(), too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-18 17:34     ` hch
  2019-08-19  0:08       ` Dave Chinner
  2019-08-19 17:19       ` Verma, Vishal L
@ 2019-08-21  0:26       ` Dave Chinner
  2019-08-21  0:44         ` hch
  2019-08-26 17:32       ` Verma, Vishal L
  3 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2019-08-21  0:26 UTC (permalink / raw)
  To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Sun, Aug 18, 2019 at 07:34:26PM +0200, hch@lst.de wrote:
> On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote:
> > On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote:
> > > > The kernel log shows the following when the mount fails:
> > > 
> > > Is it always that same message?  I'll see if I can reproduce it,
> > > but I won't have that much memory to spare to create fake pmem,
> > > hope this also works with a single device and/or less memory..
> > 
> > I've reproduced a similar ASSERT with a small pmem device, so I hope
> > I can debug the issue locally now.
> 
> So I can also reproduce the same issue with the ramdisk driver, but not
> with any other 4k sector size device (nvmet, scsi target, scsi_debug,
> loop).  Which made me wonder if there is some issue about the memory
> passed in, and indeed just switching to plain vmalloc vs the XFS
> kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes
> the issue for me.  I don't really understand why yet, maybe I need to
> dig out alignment testing patches.
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 13d1d3e95b88..918ad3b884a7 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -125,7 +125,7 @@ xlog_alloc_buffer(
>  	if (nbblks > 1 && log->l_sectBBsize > 1)
>  		nbblks += log->l_sectBBsize;
>  	nbblks = round_up(nbblks, log->l_sectBBsize);
> -	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> +	return vmalloc(BBTOB(nbblks));
>  }

After thinking on this for a bit, I suspect the better thing to do
here is add a KM_ALIGNED flag to the allocation, so if the internal
kmem_alloc() returns an unaligned pointer we free it and fall
through to vmalloc() to get a properly aligned pointer....

That way none of the other interfaces have to change, and we can
then use kmem_alloc_large() everywhere we allocate buffers for IO.
And we don't need new infrastructure just to support these debug
configurations, either.

Actually, kmem_alloc_io() might be a better idea - keep the aligned
flag internal to the kmem code. Seems like a pretty simple solution
to the entire problem we have here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-21  0:26       ` Dave Chinner
@ 2019-08-21  0:44         ` hch
  2019-08-21  1:08           ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: hch @ 2019-08-21  0:44 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Wed, Aug 21, 2019 at 10:26:43AM +1000, Dave Chinner wrote:
> After thinking on this for a bit, I suspect the better thing to do
> here is add a KM_ALIGNED flag to the allocation, so if the internal
> kmem_alloc() returns an unaligned pointer we free it and fall
> through to vmalloc() to get a properly aligned pointer....
> 
> That way none of the other interfaces have to change, and we can
> then use kmem_alloc_large() everywhere we allocate buffers for IO.
> And we don't need new infrastructure just to support these debug
> configurations, either.
> 
> Actually, kmem_alloc_io() might be a better idea - keep the aligned
> flag internal to the kmem code. Seems like a pretty simple solution
> to the entire problem we have here...

The interface sounds ok.  The simple try and fallback implementation
OTOH means we always do two allocations іf slub debugging is enabled,
which is a little lasty.  I guess the best we can do for 5.3 and
then figure out a way to avoid for later.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-21  0:44         ` hch
@ 2019-08-21  1:08           ` Dave Chinner
  2019-08-21  1:56             ` Verma, Vishal L
  0 siblings, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2019-08-21  1:08 UTC (permalink / raw)
  To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong

On Wed, Aug 21, 2019 at 02:44:13AM +0200, hch@lst.de wrote:
> On Wed, Aug 21, 2019 at 10:26:43AM +1000, Dave Chinner wrote:
> > After thinking on this for a bit, I suspect the better thing to do
> > here is add a KM_ALIGNED flag to the allocation, so if the internal
> > kmem_alloc() returns an unaligned pointer we free it and fall
> > through to vmalloc() to get a properly aligned pointer....
> > 
> > That way none of the other interfaces have to change, and we can
> > then use kmem_alloc_large() everywhere we allocate buffers for IO.
> > And we don't need new infrastructure just to support these debug
> > configurations, either.
> > 
> > Actually, kmem_alloc_io() might be a better idea - keep the aligned
> > flag internal to the kmem code. Seems like a pretty simple solution
> > to the entire problem we have here...
> 
> The interface sounds ok.  The simple try and fallback implementation
> OTOH means we always do two allocations іf slub debugging is enabled,
> which is a little lasty.

Some creative ifdefery could avoid that, but quite frankly it's only
necessary for limited allocation contexts and so the
overhead/interactions would largely be unnoticable compared to the
wider impact of memory debugging...

> I guess the best we can do for 5.3 and
> then figure out a way to avoid for later.

Yeah, it also has the advantage of documenting all the places we
need to be careful of allocation alignment, and it would allow us to
simply plug in whatever future infrastructure comes along that
guarantees allocation alignment without changing any other XFS
code...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20 21:44                           ` Dave Chinner
  2019-08-20 22:08                             ` Verma, Vishal L
@ 2019-08-21  1:56                             ` Ming Lei
  1 sibling, 0 replies; 31+ messages in thread
From: Ming Lei @ 2019-08-21  1:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong,
	linux-block

On Wed, Aug 21, 2019 at 07:44:09AM +1000, Dave Chinner wrote:
> On Tue, Aug 20, 2019 at 05:24:25PM +0800, Ming Lei wrote:
> > On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote:
> > > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote:
> > > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote:
> > > > > > With the following debug patch.  Based on that I think I'll just
> > > > > > formally submit the vmalloc switch as we're at -rc5, and then we
> > > > > > can restart the unaligned slub allocation drama..
> > > > > 
> > > > > This still doesn't make sense to me, because the pmem and brd code
> > > > > have no aligment limitations in their make_request code - they can
> > > > > handle byte adressing and should not have any problem at all with
> > > > > 8 byte aligned memory in bios.
> > > > > 
> > > > > Digging a little furhter, I note that both brd and pmem use
> > > > > identical mechanisms to marshall data in and out of bios, so they
> > > > > are likely to have the same issue.
> > > > > 
> > > > > So, brd_make_request() does:
> > > > > 
> > > > >         bio_for_each_segment(bvec, bio, iter) {
> > > > >                 unsigned int len = bvec.bv_len;
> > > > >                 int err;
> > > > > 
> > > > >                 err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset,
> > > > >                                   bio_op(bio), sector);
> > > > >                 if (err)
> > > > >                         goto io_error;
> > > > >                 sector += len >> SECTOR_SHIFT;
> > > > >         }
> > > > > 
> > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs
> > > > > into individual pages, which are passed to brd_do_bvec(). An
> > > > > unaligned 4kB io traces out as:
> > > > > 
> > > > >  [  121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048
> > > > >  [  121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e
> > > > > 
> > > > > i.e. page		offset	len	sector
> > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > 000000006ceca91e	0	768	0x7d004e
> > > > > 
> > > > > You should be able to guess what the problems are from this.
> > > 
> > > The problem should be that offset of '768' is passed to bio_add_page().
> > 
> > It can be quite hard to deal with non-512 aligned sector buffer, since
> > one sector buffer may cross two pages, so far one workaround I thought
> > of is to not merge such IO buffer into one bvec.
> > 
> > Verma, could you try the following patch?
> > 
> > diff --git a/block/bio.c b/block/bio.c
> > index 24a496f5d2e2..49deab2ac8c4 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
> >  	if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED)))
> >  		return false;
> >  
> > +	if (off & 511)
> > +		return false;
> 
> What does this acheive? It only prevents the unaligned segment from
> being merged, it doesn't prevent it from being added to a new bvec.

The current issue is that block layer won't handle such case, as I
mentioned, one sector buffer may cross two pages, then it can't be
splitted successfully, so simply put into one new bvec just as
before enabling multi-page bvec.

> 
> However, the case here is that:
> 
> > > > > i.e. page		offset	len	sector
> > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > 000000006ceca91e	0	768	0x7d004e
> 
> The second page added to the bvec is actually offset alignedr. Hence
> the check would do nothing on the first page because the bvec array
> is empty (so goes into a new bvec anyway), and the check on the
> second page would do nothing an it would merge with first because
> the offset is aligned correctly. In both cases, the length of the
> segment is not aligned, so that needs to be checked, too.

What the patch changes is the bvec stored in bio, the above bvec is
actually built in-flight.

So if the 1st page added to bio is (768, 512), then finally
bio_for_each_segment() will generate the 1st page as (768, 512), then
everything will be fine.

> 
> IOWs, I think the check needs to be in bio_add_page, it needs to
> check both the offset and length for alignment, and it needs to grab

The length has to be 512 aligned, otherwise it is simply bug in fs.

> the alignment from queue_dma_alignment(), not use a hard coded value
> of 511.

Now the policy for bio_add_page() is to not checking any queue limit
given we don't know what is the actual limit for the finally IO, even the
queue isn't set when bio_add_page() is called.

If the upper layer won't pass slub buffer which is > PAGE_SIZE, block
layer may handle it well without the 512 alignment check.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-21  1:08           ` Dave Chinner
@ 2019-08-21  1:56             ` Verma, Vishal L
  2019-08-21  6:15               ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Verma, Vishal L @ 2019-08-21  1:56 UTC (permalink / raw)
  To: hch, david; +Cc: Williams, Dan J, linux-xfs, darrick.wong

On Wed, 2019-08-21 at 11:08 +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 02:44:13AM +0200, hch@lst.de wrote:
> > On Wed, Aug 21, 2019 at 10:26:43AM +1000, Dave Chinner wrote:
> > > After thinking on this for a bit, I suspect the better thing to do
> > > here is add a KM_ALIGNED flag to the allocation, so if the
> > > internal
> > > kmem_alloc() returns an unaligned pointer we free it and fall
> > > through to vmalloc() to get a properly aligned pointer....
> > > 
> > > That way none of the other interfaces have to change, and we can
> > > then use kmem_alloc_large() everywhere we allocate buffers for IO.
> > > And we don't need new infrastructure just to support these debug
> > > configurations, either.
> > > 
> > > Actually, kmem_alloc_io() might be a better idea - keep the
> > > aligned
> > > flag internal to the kmem code. Seems like a pretty simple
> > > solution
> > > to the entire problem we have here...
> > 
> > The interface sounds ok.  The simple try and fallback implementation
> > OTOH means we always do two allocations іf slub debugging is
> > enabled,
> > which is a little lasty.
> 
> Some creative ifdefery could avoid that, but quite frankly it's only
> necessary for limited allocation contexts and so the
> overhead/interactions would largely be unnoticable compared to the
> wider impact of memory debugging...
> 
> > I guess the best we can do for 5.3 and
> > then figure out a way to avoid for later.
> 
> Yeah, it also has the advantage of documenting all the places we
> need to be careful of allocation alignment, and it would allow us to
> simply plug in whatever future infrastructure comes along that
> guarantees allocation alignment without changing any other XFS
> code...

Just to clarify, this precludes the changes to bio_add_page() you
suggested earlier, right?



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-20 22:08                             ` Verma, Vishal L
  2019-08-20 23:53                               ` Dave Chinner
@ 2019-08-21  2:19                               ` Ming Lei
  1 sibling, 0 replies; 31+ messages in thread
From: Ming Lei @ 2019-08-21  2:19 UTC (permalink / raw)
  To: Verma, Vishal L
  Cc: david, hch, linux-xfs, Williams, Dan J, linux-block, darrick.wong

On Tue, Aug 20, 2019 at 10:08:38PM +0000, Verma, Vishal L wrote:
> On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote:
> > 
> > However, the case here is that:
> > 
> > > > > > i.e. page		offset	len	sector
> > > > > > 00000000a77f0146	768	3328	0x7d0048
> > > > > > 000000006ceca91e	0	768	0x7d004e
> > 
> > The second page added to the bvec is actually offset alignedr. Hence
> > the check would do nothing on the first page because the bvec array
> > is empty (so goes into a new bvec anyway), and the check on the
> > second page would do nothing an it would merge with first because
> > the offset is aligned correctly. In both cases, the length of the
> > segment is not aligned, so that needs to be checked, too.
> > 
> > IOWs, I think the check needs to be in bio_add_page, it needs to
> > check both the offset and length for alignment, and it needs to grab
> > the alignment from queue_dma_alignment(), not use a hard coded value
> > of 511.
> > 
> So something like this?
> 
> diff --git a/block/bio.c b/block/bio.c
> index 299a0e7651ec..80f449d23e5a 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page);
>  int bio_add_page(struct bio *bio, struct page *page,
>                  unsigned int len, unsigned int offset)
>  {
> +       struct request_queue *q = bio->bi_disk->queue;
>         bool same_page = false;
>  
> +       if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q))
> +               return 0;

bio->bi_disk or bio->bi_disk->queue may not be available in bio_add_page().

And 'len' has to be aligned with logical block size of the disk on which fs is
mounted, which info is obviously available for fs.

So the following code is really buggy:

xfs_rw_bdev():

+               struct page     *page = kmem_to_page(data);
+               unsigned int    off = offset_in_page(data);
+               unsigned int    len = min_t(unsigned, left, PAGE_SIZE - off);
+
+               while (bio_add_page(bio, page, len, off) != len) {


Thanks,
Ming

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-21  1:56             ` Verma, Vishal L
@ 2019-08-21  6:15               ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2019-08-21  6:15 UTC (permalink / raw)
  To: Verma, Vishal L; +Cc: hch, Williams, Dan J, linux-xfs, darrick.wong

On Wed, Aug 21, 2019 at 01:56:33AM +0000, Verma, Vishal L wrote:
> On Wed, 2019-08-21 at 11:08 +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 02:44:13AM +0200, hch@lst.de wrote:
> > > On Wed, Aug 21, 2019 at 10:26:43AM +1000, Dave Chinner wrote:
> > > > After thinking on this for a bit, I suspect the better thing to do
> > > > here is add a KM_ALIGNED flag to the allocation, so if the
> > > > internal
> > > > kmem_alloc() returns an unaligned pointer we free it and fall
> > > > through to vmalloc() to get a properly aligned pointer....
> > > > 
> > > > That way none of the other interfaces have to change, and we can
> > > > then use kmem_alloc_large() everywhere we allocate buffers for IO.
> > > > And we don't need new infrastructure just to support these debug
> > > > configurations, either.
> > > > 
> > > > Actually, kmem_alloc_io() might be a better idea - keep the
> > > > aligned
> > > > flag internal to the kmem code. Seems like a pretty simple
> > > > solution
> > > > to the entire problem we have here...
> > > 
> > > The interface sounds ok.  The simple try and fallback implementation
> > > OTOH means we always do two allocations іf slub debugging is
> > > enabled,
> > > which is a little lasty.
> > 
> > Some creative ifdefery could avoid that, but quite frankly it's only
> > necessary for limited allocation contexts and so the
> > overhead/interactions would largely be unnoticable compared to the
> > wider impact of memory debugging...
> > 
> > > I guess the best we can do for 5.3 and
> > > then figure out a way to avoid for later.
> > 
> > Yeah, it also has the advantage of documenting all the places we
> > need to be careful of allocation alignment, and it would allow us to
> > simply plug in whatever future infrastructure comes along that
> > guarantees allocation alignment without changing any other XFS
> > code...
> 
> Just to clarify, this precludes the changes to bio_add_page() you
> suggested earlier, right?

Nope, I most certainly want the block layer to catch this, but
given the total lack of understanding from Ming, I've given up
on trying to convince them that validation is both possible and
necessarily.

The patch below basically copies bio_add_page() into xfs as
xfs_bio_add_page() and uses block layer helpers to check alignment.
It's the same helper blk_rq_map_kern() uses to detect and bounce
unaligned user buffers...

WIth that, the simple KASAN enabled reproducer:

# mkfs.xfs /dev/ram0; mount /dev/ram0 /mnt/test
[   88.260091] XFS (ram0): Mounting V5 Filesystem
[   88.314453] WARNING: CPU: 0 PID: 4681 at fs/xfs/xfs_bio_io.c:24 xfs_bio_add_page+0x134/0x1d3
[   88.316484] CPU: 0 PID: 4681 Comm: mount Not tainted 5.3.0-rc5-dgc+ #1516
[   88.318115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[   88.320110] RIP: 0010:xfs_bio_add_page+0x134/0x1d3
....
[   88.340471] Call Trace:
[   88.343011]  xfs_rw_bdev.cold.5+0xf4/0x193
[   88.345224]  xlog_do_io+0xd1/0x1c0
[   88.346050]  xlog_bread+0x23/0x70
[   88.346857]  xlog_find_verify_log_record+0xc8/0x330
[   88.349171]  xlog_find_zeroed+0x224/0x330
[   88.356712]  xlog_find_head+0xbe/0x640
[   88.375967]  xlog_find_tail+0xc6/0x500
[   88.388098]  xlog_recover+0x80/0x2a0
[   88.392020]  xfs_log_mount+0x3a5/0x3d0
[   88.392931]  xfs_mountfs+0x753/0xe40
[   88.403201]  xfs_fs_fill_super+0x5c1/0x9d0
[   88.405320]  mount_bdev+0x1be/0x200
[   88.407178]  legacy_get_tree+0x6e/0xb0
[   88.408089]  vfs_get_tree+0x41/0x160
[   88.408955]  do_mount+0xa48/0xcf0
[   88.416384]  ksys_mount+0xb6/0xd0
[   88.417191]  __x64_sys_mount+0x62/0x70
[   88.418101]  do_syscall_64+0x70/0x230
[   88.418991]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   88.420208] RIP: 0033:0x7f1817194fea
......
[   88.436923] XFS (ram0): log recovery read I/O error at daddr 0x0 len 8 error -5
[   88.438707] XFS (ram0): empty log check failed
[   88.439794] XFS (ram0): log mount/recovery failed: error -5
[   88.441435] XFS (ram0): log mount failed
mount: /mnt/test: can't read superblock on /dev/ram0.
umount: /mnt/test: not mounted.
#

throws a warning and an IO error. No data corruption, no weird
log recovery failures, etc. This is easy to detect, and so easy to
debug now. Shame we can't do this in the block layer code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: alignment check bio buffers

From: Dave Chinner <dchinner@redhat.com>

Because apparently this is impossible to do in the block layer.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++---
 fs/xfs/xfs_buf.c    |  7 ++++---
 fs/xfs/xfs_linux.h  |  3 +++
 fs/xfs/xfs_log.c    |  6 +++++-
 4 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c
index e2148f2d5d6b..fbaea643c000 100644
--- a/fs/xfs/xfs_bio_io.c
+++ b/fs/xfs/xfs_bio_io.c
@@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count)
 	return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES);
 }
 
+int
+xfs_bio_add_page(
+	struct bio	*bio,
+	struct page	*page,
+	unsigned int	len,
+	unsigned int	offset)
+{
+	struct request_queue	*q = bio->bi_disk->queue;
+	bool		same_page = false;
+
+	if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset)))
+		return -EIO;
+
+	if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) {
+		if (bio_full(bio, len))
+			return 0;
+		__bio_add_page(bio, page, len, offset);
+	}
+	return len;
+}
+
 int
 xfs_rw_bdev(
 	struct block_device	*bdev,
@@ -20,7 +41,7 @@ xfs_rw_bdev(
 {
 	unsigned int		is_vmalloc = is_vmalloc_addr(data);
 	unsigned int		left = count;
-	int			error;
+	int			error, ret = 0;
 	struct bio		*bio;
 
 	if (is_vmalloc && op == REQ_OP_WRITE)
@@ -36,9 +57,12 @@ xfs_rw_bdev(
 		unsigned int	off = offset_in_page(data);
 		unsigned int	len = min_t(unsigned, left, PAGE_SIZE - off);
 
-		while (bio_add_page(bio, page, len, off) != len) {
+		while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) {
 			struct bio	*prev = bio;
 
+			if (ret < 0)
+				goto submit;
+
 			bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left));
 			bio_copy_dev(bio, prev);
 			bio->bi_iter.bi_sector = bio_end_sector(prev);
@@ -48,14 +72,16 @@ xfs_rw_bdev(
 			submit_bio(prev);
 		}
 
+		ret = 0;
 		data += len;
 		left -= len;
 	} while (left > 0);
 
+submit:
 	error = submit_bio_wait(bio);
 	bio_put(bio);
 
 	if (is_vmalloc && op == REQ_OP_READ)
 		invalidate_kernel_vmap_range(data, count);
-	return error;
+	return ret ? ret : error;
 }
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index dd3aa64098a8..f2e6669f60b0 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1261,6 +1261,7 @@ xfs_buf_ioapply_map(
 	sector_t	sector =  bp->b_maps[map].bm_bn;
 	int		size;
 	int		offset;
+	int		rbytes = 0;
 
 	/* skip the pages in the buffer before the start offset */
 	page_index = 0;
@@ -1290,12 +1291,12 @@ xfs_buf_ioapply_map(
 	bio_set_op_attrs(bio, op, op_flags);
 
 	for (; size && nr_pages; nr_pages--, page_index++) {
-		int	rbytes, nbytes = PAGE_SIZE - offset;
+		int	nbytes = PAGE_SIZE - offset;
 
 		if (nbytes > size)
 			nbytes = size;
 
-		rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes,
+		rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes,
 				      offset);
 		if (rbytes < nbytes)
 			break;
@@ -1306,7 +1307,7 @@ xfs_buf_ioapply_map(
 		total_nr_pages--;
 	}
 
-	if (likely(bio->bi_iter.bi_size)) {
+	if (likely(bio->bi_iter.bi_size && rbytes >= 0)) {
 		if (xfs_buf_is_vmapped(bp)) {
 			flush_kernel_vmap_range(bp->b_addr,
 						xfs_buf_vmap_len(bp));
diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h
index ca15105681ca..e71c7bf3e714 100644
--- a/fs/xfs/xfs_linux.h
+++ b/fs/xfs/xfs_linux.h
@@ -145,6 +145,9 @@ static inline void delay(long ticks)
 	schedule_timeout_uninterruptible(ticks);
 }
 
+int xfs_bio_add_page(struct bio *bio, struct page *page, unsigned int len,
+			unsigned int offset);
+
 /*
  * XFS wrapper structure for sysfs support. It depends on external data
  * structures and is embedded in various internal data structures to implement
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index acc5a824bb9e..c2f1ff57fb6d 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1673,8 +1673,12 @@ xlog_map_iclog_data(
 		struct page	*page = kmem_to_page(data);
 		unsigned int	off = offset_in_page(data);
 		size_t		len = min_t(size_t, count, PAGE_SIZE - off);
+		int		ret;
 
-		WARN_ON_ONCE(bio_add_page(bio, page, len, off) != len);
+		ret = xfs_bio_add_page(bio, page, len, off);
+		WARN_ON_ONCE(ret != len);
+		if (ret < 0)
+			break;
 
 		data += len;
 		count -= len;

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: 5.3-rc1 regression with XFS log recovery
  2019-08-18 17:34     ` hch
                         ` (2 preceding siblings ...)
  2019-08-21  0:26       ` Dave Chinner
@ 2019-08-26 17:32       ` Verma, Vishal L
  3 siblings, 0 replies; 31+ messages in thread
From: Verma, Vishal L @ 2019-08-26 17:32 UTC (permalink / raw)
  To: hch; +Cc: Williams, Dan J, linux-xfs, david, darrick.wong

On Sun, 2019-08-18 at 19:34 +0200, hch@lst.de wrote:
> On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote:
> > On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote:
> > > > The kernel log shows the following when the mount fails:
> > > 
> > > Is it always that same message?  I'll see if I can reproduce it,
> > > but I won't have that much memory to spare to create fake pmem,
> > > hope this also works with a single device and/or less memory..
> > 
> > I've reproduced a similar ASSERT with a small pmem device, so I hope
> > I can debug the issue locally now.
> 
> So I can also reproduce the same issue with the ramdisk driver, but not
> with any other 4k sector size device (nvmet, scsi target, scsi_debug,
> loop).  Which made me wonder if there is some issue about the memory
> passed in, and indeed just switching to plain vmalloc vs the XFS
> kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes
> the issue for me.  I don't really understand why yet, maybe I need to
> dig out alignment testing patches.
> 
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 13d1d3e95b88..918ad3b884a7 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -125,7 +125,7 @@ xlog_alloc_buffer(
>  	if (nbblks > 1 && log->l_sectBBsize > 1)
>  		nbblks += log->l_sectBBsize;
>  	nbblks = round_up(nbblks, log->l_sectBBsize);
> -	return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL);
> +	return vmalloc(BBTOB(nbblks));

Is the plan to use this vmalloc fix for 5.3? Is there anything I can
help with to make progress on this?

	-Vishal

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2019-08-26 17:32 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-16 20:59 5.3-rc1 regression with XFS log recovery Verma, Vishal L
2019-08-18  7:11 ` hch
2019-08-18  7:41   ` hch
2019-08-18 17:34     ` hch
2019-08-19  0:08       ` Dave Chinner
2019-08-19  3:49         ` hch
2019-08-19  4:11           ` hch
2019-08-19  4:22             ` Dave Chinner
2019-08-19  4:29               ` hch
2019-08-19  4:40                 ` hch
2019-08-19  5:31                   ` Dave Chinner
2019-08-20  6:14                     ` hch
2019-08-20  4:41                   ` Dave Chinner
2019-08-20  5:53                     ` hch
2019-08-20  7:44                       ` Dave Chinner
2019-08-20  8:13                       ` Ming Lei
2019-08-20  9:24                         ` Ming Lei
2019-08-20 16:30                           ` Verma, Vishal L
2019-08-20 21:44                           ` Dave Chinner
2019-08-20 22:08                             ` Verma, Vishal L
2019-08-20 23:53                               ` Dave Chinner
2019-08-21  2:19                               ` Ming Lei
2019-08-21  1:56                             ` Ming Lei
2019-08-19  4:15           ` Dave Chinner
2019-08-19 17:19       ` Verma, Vishal L
2019-08-21  0:26       ` Dave Chinner
2019-08-21  0:44         ` hch
2019-08-21  1:08           ` Dave Chinner
2019-08-21  1:56             ` Verma, Vishal L
2019-08-21  6:15               ` Dave Chinner
2019-08-26 17:32       ` Verma, Vishal L

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.