* 5.3-rc1 regression with XFS log recovery @ 2019-08-16 20:59 Verma, Vishal L 2019-08-18 7:11 ` hch 0 siblings, 1 reply; 31+ messages in thread From: Verma, Vishal L @ 2019-08-16 20:59 UTC (permalink / raw) To: linux-xfs; +Cc: Williams, Dan J, hch, david, darrick.wong Hi all, When running the 'ndctl' unit tests against 5.3-rc kernels, I noticed a frequent failure of the 'mmap.sh' test [1][2]. [1]: https://github.com/pmem/ndctl/blob/master/test/mmap.sh [2]: https://github.com/pmem/ndctl/blob/master/test/mmap.c But in trying to pare down the test further, I found that I can simply reproduce the problem by: mkfs.xfs -f /dev/pmem0 mount /dev/pmem0 /mnt/mem Where 'pmem0' is a legacy pmem namespace from reserved memory using the memmap= command line option. (Specifically, I have this: memmap=3G!6G,3G!9G ) The above mkfs/mount steps don't reproduce the problem a 100% of the time, but it does happen on my qemu based setup over 75% of the times. The kernel log shows the following when the mount fails: [Aug16 14:41] XFS (pmem0): Mounting V5 Filesystem [ +0.001856] XFS (pmem0): totally zeroed log [ +0.402616] XFS (pmem0): Internal error xlog_clear_stale_blocks(2) at line 1715 of file fs/xfs/xfs_log_recover.c. Caller xlog_find_tail+0x230/0x340 [xfs] [ +0.001741] CPU: 7 PID: 1771 Comm: mount Tainted: G O 5.2.0-rc4+ #112 [ +0.000976] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014 [ +0.001516] Call Trace: [ +0.000351] dump_stack+0x85/0xc0 [ +0.000452] xlog_clear_stale_blocks+0x16d/0x180 [xfs] [ +0.000665] xlog_find_tail+0x230/0x340 [xfs] [ +0.000581] xlog_recover+0x2b/0x160 [xfs] [ +0.000554] xfs_log_mount+0x280/0x2a0 [xfs] [ +0.000561] xfs_mountfs+0x415/0x860 [xfs] [ +0.000533] ? xfs_mru_cache_create+0x18b/0x1f0 [xfs] [ +0.000665] xfs_fs_fill_super+0x4b0/0x700 [xfs] [ +0.000638] ? xfs_test_remount_options+0x60/0x60 [xfs] [ +0.000710] mount_bdev+0x17f/0x1b0 [ +0.000442] legacy_get_tree+0x30/0x50 [ +0.000467] vfs_get_tree+0x28/0xf0 [ +0.000436] do_mount+0x2d4/0xa00 [ +0.000411] ? memdup_user+0x3e/0x70 [ +0.000455] ksys_mount+0xba/0xd0 [ +0.000420] __x64_sys_mount+0x21/0x30 [ +0.000473] do_syscall_64+0x60/0x240 [ +0.000460] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ +0.000655] RIP: 0033:0x7f730fec91be [ +0.000506] Code: 48 8b 0d cd 1c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 9a 1c 0c 00 f7 d8 64 89 01 48 [ +0.002305] RSP: 002b:00007ffdadbdb178 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [ +0.000922] RAX: ffffffffffffffda RBX: 000055b8f9db8a40 RCX: 00007f730fec91be [ +0.000875] RDX: 000055b8f9dbfdb0 RSI: 000055b8f9dbb930 RDI: 000055b8f9db8c20 [ +0.000917] RBP: 00007f731007f1a4 R08: 0000000000000000 R09: 000055b8f9dc01f0 [ +0.000942] R10: 00000000c0ed0000 R11: 0000000000000246 R12: 0000000000000000 [ +0.000878] R13: 00000000c0ed0000 R14: 000055b8f9db8c20 R15: 000055b8f9dbfdb0 [ +0.000915] XFS (pmem0): failed to locate log tail [ +0.000622] XFS (pmem0): log mount/recovery failed: error -117 [ +0.012560] XFS (pmem0): log mount failed A bisect pointed to this commit: commit 6ad5b3255b9e3d6d94154738aacd5119bf9c8f6e (HEAD -> bisect-bad, refs/bisect/bad) Author: Christoph Hellwig <hch@lst.de> Date: Fri Jun 28 19:27:26 2019 -0700 xfs: use bios directly to read and write the log recovery buffers The xfs_buf structure is basically used as a glorified container for a memory allocation in the log recovery code. Replace it with a call to kmem_alloc_large and a simple abstraction to read into or write from it synchronously using chained bios. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Full bisect log follows at the end. I saw [3], but I can still easily hit the failure after manually applying that patch on the above commit. [3]: https://lore.kernel.org/linux-xfs/20190709152352.27465-1-hch@lst.de/ Any thoughts on what might be happening? I'd be happy to test out theories/patches. Thanks, -Vishal ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-16 20:59 5.3-rc1 regression with XFS log recovery Verma, Vishal L @ 2019-08-18 7:11 ` hch 2019-08-18 7:41 ` hch 0 siblings, 1 reply; 31+ messages in thread From: hch @ 2019-08-18 7:11 UTC (permalink / raw) To: Verma, Vishal L; +Cc: linux-xfs, Williams, Dan J, hch, david, darrick.wong On Fri, Aug 16, 2019 at 08:59:44PM +0000, Verma, Vishal L wrote: > Hi all, > > When running the 'ndctl' unit tests against 5.3-rc kernels, I noticed a > frequent failure of the 'mmap.sh' test [1][2]. > > [1]: https://github.com/pmem/ndctl/blob/master/test/mmap.sh > [2]: https://github.com/pmem/ndctl/blob/master/test/mmap.c > > But in trying to pare down the test further, I found that I can simply > reproduce the problem by: > > mkfs.xfs -f /dev/pmem0 > mount /dev/pmem0 /mnt/mem > > Where 'pmem0' is a legacy pmem namespace from reserved memory using the > memmap= command line option. (Specifically, I have this: > memmap=3G!6G,3G!9G ) > > The above mkfs/mount steps don't reproduce the problem a 100% of the > time, but it does happen on my qemu based setup over 75% of the times. > > The kernel log shows the following when the mount fails: Is it always that same message? I'll see if I can reproduce it, but I won't have that much memory to spare to create fake pmem, hope this also works with a single device and/or less memory.. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-18 7:11 ` hch @ 2019-08-18 7:41 ` hch 2019-08-18 17:34 ` hch 0 siblings, 1 reply; 31+ messages in thread From: hch @ 2019-08-18 7:41 UTC (permalink / raw) To: Verma, Vishal L; +Cc: linux-xfs, Williams, Dan J, hch, david, darrick.wong On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote: > > The kernel log shows the following when the mount fails: > > Is it always that same message? I'll see if I can reproduce it, > but I won't have that much memory to spare to create fake pmem, > hope this also works with a single device and/or less memory.. I've reproduced a similar ASSERT with a small pmem device, so I hope I can debug the issue locally now. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-18 7:41 ` hch @ 2019-08-18 17:34 ` hch 2019-08-19 0:08 ` Dave Chinner ` (3 more replies) 0 siblings, 4 replies; 31+ messages in thread From: hch @ 2019-08-18 17:34 UTC (permalink / raw) To: Verma, Vishal L; +Cc: linux-xfs, Williams, Dan J, hch, david, darrick.wong On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote: > On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote: > > > The kernel log shows the following when the mount fails: > > > > Is it always that same message? I'll see if I can reproduce it, > > but I won't have that much memory to spare to create fake pmem, > > hope this also works with a single device and/or less memory.. > > I've reproduced a similar ASSERT with a small pmem device, so I hope > I can debug the issue locally now. So I can also reproduce the same issue with the ramdisk driver, but not with any other 4k sector size device (nvmet, scsi target, scsi_debug, loop). Which made me wonder if there is some issue about the memory passed in, and indeed just switching to plain vmalloc vs the XFS kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes the issue for me. I don't really understand why yet, maybe I need to dig out alignment testing patches. diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 13d1d3e95b88..918ad3b884a7 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -125,7 +125,7 @@ xlog_alloc_buffer( if (nbblks > 1 && log->l_sectBBsize > 1) nbblks += log->l_sectBBsize; nbblks = round_up(nbblks, log->l_sectBBsize); - return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL); + return vmalloc(BBTOB(nbblks)); } /* @@ -416,7 +416,7 @@ xlog_find_verify_cycle( *new_blk = -1; out: - kmem_free(buffer); + vfree(buffer); return error; } @@ -527,7 +527,7 @@ xlog_find_verify_log_record( *last_blk = i; out: - kmem_free(buffer); + vfree(buffer); return error; } @@ -781,7 +781,7 @@ xlog_find_head( goto out_free_buffer; } - kmem_free(buffer); + vfree(buffer); if (head_blk == log_bbnum) *return_head_blk = 0; else @@ -795,7 +795,7 @@ xlog_find_head( return 0; out_free_buffer: - kmem_free(buffer); + vfree(buffer); if (error) xfs_warn(log->l_mp, "failed to find log head"); return error; @@ -1049,7 +1049,7 @@ xlog_verify_tail( "Tail block (0x%llx) overwrite detected. Updated to 0x%llx", orig_tail, *tail_blk); out: - kmem_free(buffer); + vfree(buffer); return error; } @@ -1096,7 +1096,7 @@ xlog_verify_head( error = xlog_rseek_logrec_hdr(log, *head_blk, *tail_blk, XLOG_MAX_ICLOGS, tmp_buffer, &tmp_rhead_blk, &tmp_rhead, &tmp_wrapped); - kmem_free(tmp_buffer); + vfree(tmp_buffer); if (error < 0) return error; @@ -1429,7 +1429,7 @@ xlog_find_tail( error = xlog_clear_stale_blocks(log, tail_lsn); done: - kmem_free(buffer); + vfree(buffer); if (error) xfs_warn(log->l_mp, "failed to locate log tail"); @@ -1477,7 +1477,7 @@ xlog_find_zeroed( first_cycle = xlog_get_cycle(offset); if (first_cycle == 0) { /* completely zeroed log */ *blk_no = 0; - kmem_free(buffer); + vfree(buffer); return 1; } @@ -1488,7 +1488,7 @@ xlog_find_zeroed( last_cycle = xlog_get_cycle(offset); if (last_cycle != 0) { /* log completely written to */ - kmem_free(buffer); + vfree(buffer); return 0; } @@ -1535,7 +1535,7 @@ xlog_find_zeroed( *blk_no = last_blk; out_free_buffer: - kmem_free(buffer); + vfree(buffer); if (error) return error; return 1; @@ -1647,7 +1647,7 @@ xlog_write_log_records( } out_free_buffer: - kmem_free(buffer); + vfree(buffer); return error; } @@ -5291,7 +5291,7 @@ xlog_do_recovery_pass( hblks = h_size / XLOG_HEADER_CYCLE_SIZE; if (h_size % XLOG_HEADER_CYCLE_SIZE) hblks++; - kmem_free(hbp); + vfree(hbp); hbp = xlog_alloc_buffer(log, hblks); } else { hblks = 1; @@ -5307,7 +5307,7 @@ xlog_do_recovery_pass( return -ENOMEM; dbp = xlog_alloc_buffer(log, BTOBB(h_size)); if (!dbp) { - kmem_free(hbp); + vfree(hbp); return -ENOMEM; } @@ -5468,9 +5468,9 @@ xlog_do_recovery_pass( } bread_err2: - kmem_free(dbp); + vfree(dbp); bread_err1: - kmem_free(hbp); + vfree(hbp); /* * Submit buffers that have been added from the last record processed, ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-18 17:34 ` hch @ 2019-08-19 0:08 ` Dave Chinner 2019-08-19 3:49 ` hch 2019-08-19 17:19 ` Verma, Vishal L ` (2 subsequent siblings) 3 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2019-08-19 0:08 UTC (permalink / raw) To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Sun, Aug 18, 2019 at 07:34:26PM +0200, hch@lst.de wrote: > On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote: > > On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote: > > > > The kernel log shows the following when the mount fails: > > > > > > Is it always that same message? I'll see if I can reproduce it, > > > but I won't have that much memory to spare to create fake pmem, > > > hope this also works with a single device and/or less memory.. > > > > I've reproduced a similar ASSERT with a small pmem device, so I hope > > I can debug the issue locally now. > > So I can also reproduce the same issue with the ramdisk driver, but not > with any other 4k sector size device (nvmet, scsi target, scsi_debug, > loop). How did you reproduce it? I tried with indentical pmem config to what Vishal was using and couldn't hit it... > Which made me wonder if there is some issue about the memory > passed in, and indeed just switching to plain vmalloc vs the XFS > kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes > the issue for me. I don't really understand why yet, maybe I need to > dig out alignment testing patches. > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 13d1d3e95b88..918ad3b884a7 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -125,7 +125,7 @@ xlog_alloc_buffer( > if (nbblks > 1 && log->l_sectBBsize > 1) > nbblks += log->l_sectBBsize; > nbblks = round_up(nbblks, log->l_sectBBsize); > - return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL); > + return vmalloc(BBTOB(nbblks)); > } So it fails with contiguous memory being added to the bio (kmalloc returns contiguous memory), but not with indiivdual, non-contiugous pages (what vmalloc returns)? Ok, so here's the trace of all the kmem_alloc_large() calls when mounting the filesystem directly after mkfs: https://lore.kernel.org/linux-xfs/20190722233452.31183-1-david@fromorbit.com/ mount-1818 [001] 691.648570: kmem_alloc_large: size 262144 flags 0x8 caller xlog_alloc_log mount-1818 [001] 691.648570: kmem_alloc: size 262144 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.648575: kmem_alloc_large: size 262144 flags 0x8 caller xlog_alloc_log mount-1818 [001] 691.648575: kmem_alloc: size 262144 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.648578: kmem_alloc_large: size 262144 flags 0x8 caller xlog_alloc_log mount-1818 [001] 691.648578: kmem_alloc: size 262144 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.648581: kmem_alloc_large: size 262144 flags 0x8 caller xlog_alloc_log mount-1818 [001] 691.648581: kmem_alloc: size 262144 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.648583: kmem_alloc_large: size 262144 flags 0x8 caller xlog_alloc_log mount-1818 [001] 691.648583: kmem_alloc: size 262144 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.648586: kmem_alloc_large: size 262144 flags 0x8 caller xlog_alloc_log mount-1818 [001] 691.648586: kmem_alloc: size 262144 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.648588: kmem_alloc_large: size 262144 flags 0x8 caller xlog_alloc_log mount-1818 [001] 691.648589: kmem_alloc: size 262144 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.648591: kmem_alloc_large: size 262144 flags 0x8 caller xlog_alloc_log mount-1818 [001] 691.648592: kmem_alloc: size 262144 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.649456: kmem_alloc_large: size 4096 flags 0x8 caller xlog_find_zeroed mount-1818 [001] 691.649456: kmem_alloc: size 4096 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.649496: kmem_alloc_large: size 12288 flags 0x8 caller xlog_find_verify_cycle mount-1818 [001] 691.649496: kmem_alloc: size 12288 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.649502: kmem_alloc_large: size 8192 flags 0x8 caller xlog_find_verify_log_record mount-1818 [001] 691.649502: kmem_alloc: size 8192 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.649506: kmem_alloc_large: size 4096 flags 0x8 caller xlog_find_tail mount-1818 [001] 691.649506: kmem_alloc: size 4096 flags 0x8 caller kmem_alloc_large mount-1818 [001] 691.649516: kmem_alloc_large: size 4198400 flags 0x8 caller xlog_write_log_records mount-1818 [001] 691.649517: kmem_alloc: size 4198400 flags 0x8 caller kmem_alloc_large ^^^^^^^^^^^^ mount-1818 [001] 691.652152: kmem_alloc_large: size 0 flags 0x11 caller xfs_alloc_rsum_cache mount-1818 [001] 691.652152: kmem_alloc: size 0 flags 0x11 caller kmem_alloc_large umount-1826 [000] 691.670308: kmem_alloc_large: size 512 flags 0x5 caller xfs_log_commit_cil umount-1826 [000] 691.670309: kmem_alloc: size 512 flags 0x5 caller kmem_alloc_large Ok, there's a >4MB allocation from: xlog_find_tail xlog_clear_stale_blocks xlog_write_log_records xlog_alloc_buffer kmem_alloc_large() If this memory allocation fails, we make the allocation smaller in xlog_write_log_records() and try again. over and over again until it succeeds of the size isn't large enough to hold a single buffer. The above trace shows the allocation succeeding as a single contiguous buffer (no retries), and the IOs succeed and all is good. But Vishal wasn't even getting that far - the "totally zeroed_log" comes from xlog_find_head after xlog_find_zeroed() fails. xlog_find_zeroed() does a single sector allocation (4kB) and uses that over and over again for the binary search. Then it calls xlog_find_verify_cycle() which is passed a search range for based on the last block it found that was zero and the size of the log write window (2MB). If the last block found is less than the log write window, it trims it back to the last block. We see this in the trace above - a mkfs'd filesytem should have an unmount record in it an nothing else, so at most 2 sectors should be written. We see a 12kB - 3 sector - allocation in the trace above, but it could be up to 2MB in size. Finally, there's a 2 sector allocation by xlog_find_verify_log_record(), which it does to check whether it needs to back up the head to an earlier block in the log write window because the log writes hit the disk out of order. That's another "up to 2MB" allocation, but in the trace above it's only 8kB. The issue, I think, is that only the big allocation in xlog_clear_stale_blocks() has somewhat robust ENOMEM error handling. All of these other allocations will return failure is the buffer allocation fails, and KM_MAYFAIL doesn't try very hard to ensure the allocation will succeed. This change: > - return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL); > + return vmalloc(BBTOB(nbblks)); demonstrates the difference - it changes the code from an allocation that is allowed to fail to an allocation that will never fail. i.e KM_MAYFAIL context is also passed to the __vmalloc() call inside kmem_alloc_large() hence it can fail easily, too, where as vmalloc() tries much, much harder and so is, in comparison, a NOFAIL context (GFP_KERNEL). I'd suggest that the fix is to pass the allocation context into xlog_alloc_buffer(), using KM_MAYFAIL for only the allocations that have ENOMEM fallback handling... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 0:08 ` Dave Chinner @ 2019-08-19 3:49 ` hch 2019-08-19 4:11 ` hch 2019-08-19 4:15 ` Dave Chinner 0 siblings, 2 replies; 31+ messages in thread From: hch @ 2019-08-19 3:49 UTC (permalink / raw) To: Dave Chinner Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong [-- Attachment #1: Type: text/plain, Size: 598 bytes --] On Mon, Aug 19, 2019 at 10:08:31AM +1000, Dave Chinner wrote: > > So I can also reproduce the same issue with the ramdisk driver, but not > > with any other 4k sector size device (nvmet, scsi target, scsi_debug, > > loop). > > How did you reproduce it? I tried with indentical pmem config to > what Vishal was using and couldn't hit it... Just mkfs and mount the pmem device. The fake pmem was create by the following kernel command line sniplet: "memmap=2000M!1024M,2000M!4096M" and my kernel config is attached. I suspect you'll need CONFIG_SLUB to create the woes, but let me confirm that.. [-- Attachment #2: .config --] [-- Type: application/x-config, Size: 140796 bytes --] ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 3:49 ` hch @ 2019-08-19 4:11 ` hch 2019-08-19 4:22 ` Dave Chinner 2019-08-19 4:15 ` Dave Chinner 1 sibling, 1 reply; 31+ messages in thread From: hch @ 2019-08-19 4:11 UTC (permalink / raw) To: Dave Chinner Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Mon, Aug 19, 2019 at 05:49:48AM +0200, hch@lst.de wrote: > The fake pmem was create by the following kernel command line sniplet: > "memmap=2000M!1024M,2000M!4096M" and my kernel config is attached. I > suspect you'll need CONFIG_SLUB to create the woes, but let me confirm > that.. Yep, doesn't reproduce with CONFIG_SLAB. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 4:11 ` hch @ 2019-08-19 4:22 ` Dave Chinner 2019-08-19 4:29 ` hch 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2019-08-19 4:22 UTC (permalink / raw) To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Mon, Aug 19, 2019 at 06:11:32AM +0200, hch@lst.de wrote: > On Mon, Aug 19, 2019 at 05:49:48AM +0200, hch@lst.de wrote: > > The fake pmem was create by the following kernel command line sniplet: > > "memmap=2000M!1024M,2000M!4096M" and my kernel config is attached. I > > suspect you'll need CONFIG_SLUB to create the woes, but let me confirm > > that.. > > Yep, doesn't reproduce with CONFIG_SLAB. That implies a kmalloc heap issue. Oh, is memory poisoning or something that modifies the alignment of slabs turned on? i.e. 4k/8k allocations from the kmalloc heap slabs might not be appropriately aligned for IO, similar to the problems we have with the xen blk driver? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 4:22 ` Dave Chinner @ 2019-08-19 4:29 ` hch 2019-08-19 4:40 ` hch 0 siblings, 1 reply; 31+ messages in thread From: hch @ 2019-08-19 4:29 UTC (permalink / raw) To: Dave Chinner Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Mon, Aug 19, 2019 at 02:22:59PM +1000, Dave Chinner wrote: > That implies a kmalloc heap issue. > > Oh, is memory poisoning or something that modifies the alignment of > slabs turned on? > > i.e. 4k/8k allocations from the kmalloc heap slabs might not be > appropriately aligned for IO, similar to the problems we have with > the xen blk driver? That is what I suspect, and as you can see in the attached config I usually run with slab debuggig on. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 4:29 ` hch @ 2019-08-19 4:40 ` hch 2019-08-19 5:31 ` Dave Chinner 2019-08-20 4:41 ` Dave Chinner 0 siblings, 2 replies; 31+ messages in thread From: hch @ 2019-08-19 4:40 UTC (permalink / raw) To: Dave Chinner Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Mon, Aug 19, 2019 at 06:29:05AM +0200, hch@lst.de wrote: > On Mon, Aug 19, 2019 at 02:22:59PM +1000, Dave Chinner wrote: > > That implies a kmalloc heap issue. > > > > Oh, is memory poisoning or something that modifies the alignment of > > slabs turned on? > > > > i.e. 4k/8k allocations from the kmalloc heap slabs might not be > > appropriately aligned for IO, similar to the problems we have with > > the xen blk driver? > > That is what I suspect, and as you can see in the attached config I > usually run with slab debuggig on. Yep, looks like an unaligned allocation: root@testvm:~# mount /dev/pmem1 /mnt/ [ 62.346660] XFS (pmem1): Mounting V5 Filesystem [ 62.347960] unaligned allocation, offset = 680 [ 62.349019] unaligned allocation, offset = 680 [ 62.349872] unaligned allocation, offset = 680 [ 62.350703] XFS (pmem1): totally zeroed log [ 62.351443] unaligned allocation, offset = 680 [ 62.452203] unaligned allocation, offset = 344 [ 62.528964] XFS: Assertion failed: head_blk != tail_blk, file: fs/xfs/xfs_lo6 [ 62.529879] ------------[ cut here ]------------ [ 62.530334] kernel BUG at fs/xfs/xfs_message.c:102! [ 62.530824] invalid opcode: 0000 [#1] SMP PTI With the following debug patch. Based on that I think I'll just formally submit the vmalloc switch as we're at -rc5, and then we can restart the unaligned slub allocation drama.. diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 13d1d3e95b88..6a098d35931a 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -97,6 +97,9 @@ xlog_alloc_buffer( struct xlog *log, int nbblks) { + void *ret; + unsigned long offset; + /* * Pass log block 0 since we don't have an addr yet, buffer will be * verified on read. @@ -125,7 +128,14 @@ xlog_alloc_buffer( if (nbblks > 1 && log->l_sectBBsize > 1) nbblks += log->l_sectBBsize; nbblks = round_up(nbblks, log->l_sectBBsize); - return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL); + ret = kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL); + if (!ret) + return NULL; + offset = offset_in_page(ret); + if (offset % 512) + printk("unaligned allocation, offset = %lu\n", + offset); + return ret; } /* ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 4:40 ` hch @ 2019-08-19 5:31 ` Dave Chinner 2019-08-20 6:14 ` hch 2019-08-20 4:41 ` Dave Chinner 1 sibling, 1 reply; 31+ messages in thread From: Dave Chinner @ 2019-08-19 5:31 UTC (permalink / raw) To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Mon, Aug 19, 2019 at 06:40:12AM +0200, hch@lst.de wrote: > On Mon, Aug 19, 2019 at 06:29:05AM +0200, hch@lst.de wrote: > > On Mon, Aug 19, 2019 at 02:22:59PM +1000, Dave Chinner wrote: > > > That implies a kmalloc heap issue. > > > > > > Oh, is memory poisoning or something that modifies the alignment of > > > slabs turned on? > > > > > > i.e. 4k/8k allocations from the kmalloc heap slabs might not be > > > appropriately aligned for IO, similar to the problems we have with > > > the xen blk driver? > > > > That is what I suspect, and as you can see in the attached config I > > usually run with slab debuggig on. > > Yep, looks like an unaligned allocation: *nod* > With the following debug patch. Based on that I think I'll just > formally submit the vmalloc switch as we're at -rc5, and then we > can restart the unaligned slub allocation drama.. But I'm not so sure that's all there is to it. I turned KASAN on and it works for the first few mkfs/mount cycles, then a mount basically pegs a CPU and it's spending most of it's time inside KASAN accouting code like this: ..... save_stack+0x19/0x80 __kasan_kmalloc.constprop.10+0xc1/0xd0 kmem_cache_alloc+0xd2/0x220 mempool_alloc+0xda/0x230 bio_alloc_bioset+0x12d/0x2d0 xfs_rw_bdev+0x53/0x290 xlog_do_io+0xd1/0x1c0 xlog_bread+0x23/0x70 xlog_rseek_logrec_hdr+0x207/0x2a0 After a couple of minutes, the mount fails an ASSERT: XFS: Assertion failed: head_blk != tail_blk, file: fs/xfs/xfs_log_recover.c, line: 5236 And moments after KASAN spews this: BUG: KASAN: use-after-free in rwsem_down_read_slowpath+0x685/0x8f0 Read of size 4 at addr ffff88806f152778 by task systemd-udevd/2231 CPU: 4 PID: 2231 Comm: systemd-udevd Tainted: G D 5.3.0-rc5-dgc+ #1506 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0x7c/0xc0 print_address_description+0x6c/0x322 ? rwsem_down_read_slowpath+0x685/0x8f0 __kasan_report.cold.6+0x1c/0x3e ? rwsem_down_read_slowpath+0x685/0x8f0 ? rwsem_down_read_slowpath+0x685/0x8f0 kasan_report+0xe/0x12 rwsem_down_read_slowpath+0x685/0x8f0 ? unwind_get_return_address_ptr+0x50/0x50 ? unwind_next_frame+0x6d6/0x8a0 ? __down_timeout+0x1c0/0x1c0 ? unwind_next_frame+0x6d6/0x8a0 ? _raw_spin_lock+0x87/0xe0 ? _raw_spin_lock+0x87/0xe0 ? __cpuidle_text_end+0x5/0x5 ? set_init_blocksize+0xe0/0xe0 ? preempt_count_sub+0x43/0x50 ? __might_sleep+0x31/0xd0 ? set_init_blocksize+0xe0/0xe0 ? ___might_sleep+0xc8/0xe0 down_read+0x18d/0x1a0 ? refcount_sub_and_test_checked+0xaf/0x150 ? rwsem_down_read_slowpath+0x8f0/0x8f0 ? _raw_spin_lock+0x87/0xe0 __get_super.part.12+0xf8/0x130 fsync_bdev+0xf/0x60 invalidate_partition+0x1e/0x40 rescan_partitions+0x8a/0x420 blkdev_reread_part+0x1e/0x30 blkdev_ioctl+0xb0b/0xe60 ? __blkdev_driver_ioctl+0x80/0x80 ? __bpf_prog_run64+0xc0/0xc0 ? stack_trace_save+0x8a/0xb0 ? save_stack+0x4d/0x80 ? __seccomp_filter+0x133/0xa10 ? preempt_count_sub+0x43/0x50 block_ioctl+0x6d/0x80 do_vfs_ioctl+0x134/0x9c0 ? ioctl_preallocate+0x140/0x140 ? selinux_file_ioctl+0x2b7/0x360 ? selinux_capable+0x20/0x20 ? syscall_trace_enter+0x233/0x540 ksys_ioctl+0x60/0x90 __x64_sys_ioctl+0x3d/0x50 do_syscall_64+0x70/0x230 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fade328a427 Code: 00 00 90 48 8b 05 69 aa 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 39 aa 0c 00 8 RSP: 002b:00007ffdc4755928 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 000000000000000e RCX: 00007fade328a427 RDX: 0000000000000000 RSI: 000000000000125f RDI: 000000000000000e RBP: 0000000000000000 R08: 0000559597306140 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 000055959736dbc0 R13: 0000000000000000 R14: 00007ffdc47569c8 R15: 000055959730dac0 Allocated by task 4739: save_stack+0x19/0x80 __kasan_kmalloc.constprop.10+0xc1/0xd0 kmem_cache_alloc_node+0xf3/0x240 copy_process+0x1f91/0x2f20 _do_fork+0xe0/0x530 __x64_sys_clone+0x10e/0x160 do_syscall_64+0x70/0x230 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 0: save_stack+0x19/0x80 __kasan_slab_free+0x12e/0x180 kmem_cache_free+0x84/0x2c0 rcu_core+0x35f/0x810 __do_softirq+0x15f/0x476 The buggy address belongs to the object at ffff888237048000 which belongs to the cache task_struct of size 9792 The buggy address is located 56 bytes inside of 9792-byte region [ffff888237048000, ffff88823704a640) The buggy address belongs to the page: page:ffffea0008dc1200 refcount:1 mapcount:0 mapping:ffff888078a91800 index:0x0 compound_mapcount: 0 flags: 0x57ffffc0010200(slab|head) raw: 0057ffffc0010200 dead000000000100 dead000000000122 ffff888078a91800 raw: 0000000000000000 0000000000030003 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888237047f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888237047f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888237048000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888237048080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888237048100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Oh, hell, it's an rwsem that is referencing a free task. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 5:31 ` Dave Chinner @ 2019-08-20 6:14 ` hch 0 siblings, 0 replies; 31+ messages in thread From: hch @ 2019-08-20 6:14 UTC (permalink / raw) To: Dave Chinner Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Mon, Aug 19, 2019 at 03:31:22PM +1000, Dave Chinner wrote: > But I'm not so sure that's all there is to it. I turned KASAN on and > it works for the first few mkfs/mount cycles, then a mount basically > pegs a CPU and it's spending most of it's time inside KASAN > accouting code like this: Did that KASAN spat lead anywhere? I've done some KASAN runs now and I've not seen anything yet after running mkfs/mount/umount in a loop. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 4:40 ` hch 2019-08-19 5:31 ` Dave Chinner @ 2019-08-20 4:41 ` Dave Chinner 2019-08-20 5:53 ` hch 1 sibling, 1 reply; 31+ messages in thread From: Dave Chinner @ 2019-08-20 4:41 UTC (permalink / raw) To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Mon, Aug 19, 2019 at 06:40:12AM +0200, hch@lst.de wrote: > On Mon, Aug 19, 2019 at 06:29:05AM +0200, hch@lst.de wrote: > > On Mon, Aug 19, 2019 at 02:22:59PM +1000, Dave Chinner wrote: > > > That implies a kmalloc heap issue. > > > > > > Oh, is memory poisoning or something that modifies the alignment of > > > slabs turned on? > > > > > > i.e. 4k/8k allocations from the kmalloc heap slabs might not be > > > appropriately aligned for IO, similar to the problems we have with > > > the xen blk driver? > > > > That is what I suspect, and as you can see in the attached config I > > usually run with slab debuggig on. > > Yep, looks like an unaligned allocation: > > root@testvm:~# mount /dev/pmem1 /mnt/ > [ 62.346660] XFS (pmem1): Mounting V5 Filesystem > [ 62.347960] unaligned allocation, offset = 680 > [ 62.349019] unaligned allocation, offset = 680 > [ 62.349872] unaligned allocation, offset = 680 > [ 62.350703] XFS (pmem1): totally zeroed log > [ 62.351443] unaligned allocation, offset = 680 > [ 62.452203] unaligned allocation, offset = 344 > [ 62.528964] XFS: Assertion failed: head_blk != tail_blk, file: > fs/xfs/xfs_lo6 > [ 62.529879] ------------[ cut here ]------------ > [ 62.530334] kernel BUG at fs/xfs/xfs_message.c:102! > [ 62.530824] invalid opcode: 0000 [#1] SMP PTI > > With the following debug patch. Based on that I think I'll just > formally submit the vmalloc switch as we're at -rc5, and then we > can restart the unaligned slub allocation drama.. This still doesn't make sense to me, because the pmem and brd code have no aligment limitations in their make_request code - they can handle byte adressing and should not have any problem at all with 8 byte aligned memory in bios. Digging a little furhter, I note that both brd and pmem use identical mechanisms to marshall data in and out of bios, so they are likely to have the same issue. So, brd_make_request() does: bio_for_each_segment(bvec, bio, iter) { unsigned int len = bvec.bv_len; int err; err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, bio_op(bio), sector); if (err) goto io_error; sector += len >> SECTOR_SHIFT; } So, the code behind bio_for_each_segment() splits multi-page bvecs into individual pages, which are passed to brd_do_bvec(). An unaligned 4kB io traces out as: [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e i.e. page offset len sector 00000000a77f0146 768 3328 0x7d0048 000000006ceca91e 0 768 0x7d004e You should be able to guess what the problems are from this. Both pmem and brd are _sector_ based. We've done a partial sector copy on the first bvec, then the second bvec has started the copy from the wrong offset into the sector we've done a partial copy from. IOWs, no error is reported when the bvec buffer isn't sector aligned, no error is reported when the length of data to copy was not a multiple of sector size, and no error was reported when we copied the same partial sector twice. There's nothing quite like being repeatedly bitten by the same misalignment bug because there's no validation in the infrastructure that could catch it immediately and throw a useful warning/error message. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 4:41 ` Dave Chinner @ 2019-08-20 5:53 ` hch 2019-08-20 7:44 ` Dave Chinner 2019-08-20 8:13 ` Ming Lei 0 siblings, 2 replies; 31+ messages in thread From: hch @ 2019-08-20 5:53 UTC (permalink / raw) To: Dave Chinner Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong, Ming Lei, linux-block On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > With the following debug patch. Based on that I think I'll just > > formally submit the vmalloc switch as we're at -rc5, and then we > > can restart the unaligned slub allocation drama.. > > This still doesn't make sense to me, because the pmem and brd code > have no aligment limitations in their make_request code - they can > handle byte adressing and should not have any problem at all with > 8 byte aligned memory in bios. > > Digging a little furhter, I note that both brd and pmem use > identical mechanisms to marshall data in and out of bios, so they > are likely to have the same issue. > > So, brd_make_request() does: > > bio_for_each_segment(bvec, bio, iter) { > unsigned int len = bvec.bv_len; > int err; > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > bio_op(bio), sector); > if (err) > goto io_error; > sector += len >> SECTOR_SHIFT; > } > > So, the code behind bio_for_each_segment() splits multi-page bvecs > into individual pages, which are passed to brd_do_bvec(). An > unaligned 4kB io traces out as: > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > i.e. page offset len sector > 00000000a77f0146 768 3328 0x7d0048 > 000000006ceca91e 0 768 0x7d004e > > You should be able to guess what the problems are from this. > > Both pmem and brd are _sector_ based. We've done a partial sector > copy on the first bvec, then the second bvec has started the copy > from the wrong offset into the sector we've done a partial copy > from. > > IOWs, no error is reported when the bvec buffer isn't sector > aligned, no error is reported when the length of data to copy was > not a multiple of sector size, and no error was reported when we > copied the same partial sector twice. Yes. I think bio_for_each_segment is buggy here, as it should not blindly split by pages. CcingMing as he wrote much of this code. I'll also dig into fixing it, but I just arrived in Japan and might be a little jetlagged. > There's nothing quite like being repeatedly bitten by the same > misalignment bug because there's no validation in the infrastructure > that could catch it immediately and throw a useful warning/error > message. The xen block driver doesn't use bio_for_each_segment, so it isn't exactly the same but a very related issue. Maybe until we sort all this mess out we just need to depend on !SLUB_DEBUG for XFS? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 5:53 ` hch @ 2019-08-20 7:44 ` Dave Chinner 2019-08-20 8:13 ` Ming Lei 1 sibling, 0 replies; 31+ messages in thread From: Dave Chinner @ 2019-08-20 7:44 UTC (permalink / raw) To: hch Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong, Ming Lei, linux-block On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > With the following debug patch. Based on that I think I'll just > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > can restart the unaligned slub allocation drama.. > > > > This still doesn't make sense to me, because the pmem and brd code > > have no aligment limitations in their make_request code - they can > > handle byte adressing and should not have any problem at all with > > 8 byte aligned memory in bios. > > > > Digging a little furhter, I note that both brd and pmem use > > identical mechanisms to marshall data in and out of bios, so they > > are likely to have the same issue. > > > > So, brd_make_request() does: > > > > bio_for_each_segment(bvec, bio, iter) { > > unsigned int len = bvec.bv_len; > > int err; > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > bio_op(bio), sector); > > if (err) > > goto io_error; > > sector += len >> SECTOR_SHIFT; > > } > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > into individual pages, which are passed to brd_do_bvec(). An > > unaligned 4kB io traces out as: > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > i.e. page offset len sector > > 00000000a77f0146 768 3328 0x7d0048 > > 000000006ceca91e 0 768 0x7d004e > > > > You should be able to guess what the problems are from this. > > > > Both pmem and brd are _sector_ based. We've done a partial sector > > copy on the first bvec, then the second bvec has started the copy > > from the wrong offset into the sector we've done a partial copy > > from. > > > > IOWs, no error is reported when the bvec buffer isn't sector > > aligned, no error is reported when the length of data to copy was > > not a multiple of sector size, and no error was reported when we > > copied the same partial sector twice. > > Yes. I think bio_for_each_segment is buggy here, as it should not > blindly split by pages. CcingMing as he wrote much of this code. I'll > also dig into fixing it, but I just arrived in Japan and might be a > little jetlagged. > > > There's nothing quite like being repeatedly bitten by the same > > misalignment bug because there's no validation in the infrastructure > > that could catch it immediately and throw a useful warning/error > > message. > > The xen block driver doesn't use bio_for_each_segment, so it isn't > exactly the same but a very related issue. Both stem from the fact that nothing in the block layer validates memory alignment constraints. Xenblk, pmem and brd all return 511 to queue_dma_alignment(), and all break when passed memory that isn't aligned to 512 bytes. There aren't even debug options we can turn on that would tell use this is happening. Instead, we start with data corruption and have to work backwards to find the needle in the haystack from there. EIO and a WARN_ONCE would be a massive improvement here.... > Maybe until we sort > all this mess out we just need to depend on !SLUB_DEBUG for XFS? SLUB_DEBUG=y by itself doesn't cause problems - I run that all the time because otherwise there's no /proc/slabinfo with slub. I used KASAN to get the above alignment behaviour - it's SLUB_DEBUG_ON=y that perturbs the alignment, and I think the same thing can happen with SLAB_DEBUG=y, so there's several dependencies we'd have to add here. Is there any way we can pass kmalloc a "use aligned heap" GFP flag so that it allocates from the -rcl slabs to guarantee alignment rather than the standard slabs that change alignment with config options? Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 5:53 ` hch 2019-08-20 7:44 ` Dave Chinner @ 2019-08-20 8:13 ` Ming Lei 2019-08-20 9:24 ` Ming Lei 1 sibling, 1 reply; 31+ messages in thread From: Ming Lei @ 2019-08-20 8:13 UTC (permalink / raw) To: hch Cc: Dave Chinner, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong, linux-block On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > With the following debug patch. Based on that I think I'll just > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > can restart the unaligned slub allocation drama.. > > > > This still doesn't make sense to me, because the pmem and brd code > > have no aligment limitations in their make_request code - they can > > handle byte adressing and should not have any problem at all with > > 8 byte aligned memory in bios. > > > > Digging a little furhter, I note that both brd and pmem use > > identical mechanisms to marshall data in and out of bios, so they > > are likely to have the same issue. > > > > So, brd_make_request() does: > > > > bio_for_each_segment(bvec, bio, iter) { > > unsigned int len = bvec.bv_len; > > int err; > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > bio_op(bio), sector); > > if (err) > > goto io_error; > > sector += len >> SECTOR_SHIFT; > > } > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > into individual pages, which are passed to brd_do_bvec(). An > > unaligned 4kB io traces out as: > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > i.e. page offset len sector > > 00000000a77f0146 768 3328 0x7d0048 > > 000000006ceca91e 0 768 0x7d004e > > > > You should be able to guess what the problems are from this. The problem should be that offset of '768' is passed to bio_add_page(). It should be one slub buffer used for block IO, looks an old unsolved problem. > > > > Both pmem and brd are _sector_ based. We've done a partial sector > > copy on the first bvec, then the second bvec has started the copy > > from the wrong offset into the sector we've done a partial copy > > from. > > > > IOWs, no error is reported when the bvec buffer isn't sector > > aligned, no error is reported when the length of data to copy was > > not a multiple of sector size, and no error was reported when we > > copied the same partial sector twice. > > Yes. I think bio_for_each_segment is buggy here, as it should not > blindly split by pages. bio_for_each_segment() just keeps the original interface as before introducing multi-page bvec. Thanks, Ming ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 8:13 ` Ming Lei @ 2019-08-20 9:24 ` Ming Lei 2019-08-20 16:30 ` Verma, Vishal L 2019-08-20 21:44 ` Dave Chinner 0 siblings, 2 replies; 31+ messages in thread From: Ming Lei @ 2019-08-20 9:24 UTC (permalink / raw) To: hch, Verma, Vishal L Cc: Dave Chinner, linux-xfs, Williams, Dan J, darrick.wong, linux-block On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote: > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > > With the following debug patch. Based on that I think I'll just > > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > > can restart the unaligned slub allocation drama.. > > > > > > This still doesn't make sense to me, because the pmem and brd code > > > have no aligment limitations in their make_request code - they can > > > handle byte adressing and should not have any problem at all with > > > 8 byte aligned memory in bios. > > > > > > Digging a little furhter, I note that both brd and pmem use > > > identical mechanisms to marshall data in and out of bios, so they > > > are likely to have the same issue. > > > > > > So, brd_make_request() does: > > > > > > bio_for_each_segment(bvec, bio, iter) { > > > unsigned int len = bvec.bv_len; > > > int err; > > > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > > bio_op(bio), sector); > > > if (err) > > > goto io_error; > > > sector += len >> SECTOR_SHIFT; > > > } > > > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > > into individual pages, which are passed to brd_do_bvec(). An > > > unaligned 4kB io traces out as: > > > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > > > i.e. page offset len sector > > > 00000000a77f0146 768 3328 0x7d0048 > > > 000000006ceca91e 0 768 0x7d004e > > > > > > You should be able to guess what the problems are from this. > > The problem should be that offset of '768' is passed to bio_add_page(). It can be quite hard to deal with non-512 aligned sector buffer, since one sector buffer may cross two pages, so far one workaround I thought of is to not merge such IO buffer into one bvec. Verma, could you try the following patch? diff --git a/block/bio.c b/block/bio.c index 24a496f5d2e2..49deab2ac8c4 100644 --- a/block/bio.c +++ b/block/bio.c @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) return false; + if (off & 511) + return false; + if (bio->bi_vcnt > 0) { struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; Thanks, Ming ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 9:24 ` Ming Lei @ 2019-08-20 16:30 ` Verma, Vishal L 2019-08-20 21:44 ` Dave Chinner 1 sibling, 0 replies; 31+ messages in thread From: Verma, Vishal L @ 2019-08-20 16:30 UTC (permalink / raw) To: hch, ming.lei Cc: Williams, Dan J, linux-xfs, david, darrick.wong, linux-block On Tue, 2019-08-20 at 17:24 +0800, Ming Lei wrote: > > It can be quite hard to deal with non-512 aligned sector buffer, since > one sector buffer may cross two pages, so far one workaround I thought > of is to not merge such IO buffer into one bvec. > > Verma, could you try the following patch? Hi Ming, I can hit the same failure with this patch. Full thread, in case you haven't already seen it: https://lore.kernel.org/linux-xfs/e49a6a3a244db055995769eb844c281f93e50ab9.camel@intel.com/ > > diff --git a/block/bio.c b/block/bio.c > index 24a496f5d2e2..49deab2ac8c4 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct > page *page, > if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) > return false; > > + if (off & 511) > + return false; > + > if (bio->bi_vcnt > 0) { > struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1]; > > > Thanks, > Ming ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 9:24 ` Ming Lei 2019-08-20 16:30 ` Verma, Vishal L @ 2019-08-20 21:44 ` Dave Chinner 2019-08-20 22:08 ` Verma, Vishal L 2019-08-21 1:56 ` Ming Lei 1 sibling, 2 replies; 31+ messages in thread From: Dave Chinner @ 2019-08-20 21:44 UTC (permalink / raw) To: Ming Lei Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong, linux-block On Tue, Aug 20, 2019 at 05:24:25PM +0800, Ming Lei wrote: > On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote: > > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > > > With the following debug patch. Based on that I think I'll just > > > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > > > can restart the unaligned slub allocation drama.. > > > > > > > > This still doesn't make sense to me, because the pmem and brd code > > > > have no aligment limitations in their make_request code - they can > > > > handle byte adressing and should not have any problem at all with > > > > 8 byte aligned memory in bios. > > > > > > > > Digging a little furhter, I note that both brd and pmem use > > > > identical mechanisms to marshall data in and out of bios, so they > > > > are likely to have the same issue. > > > > > > > > So, brd_make_request() does: > > > > > > > > bio_for_each_segment(bvec, bio, iter) { > > > > unsigned int len = bvec.bv_len; > > > > int err; > > > > > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > > > bio_op(bio), sector); > > > > if (err) > > > > goto io_error; > > > > sector += len >> SECTOR_SHIFT; > > > > } > > > > > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > > > into individual pages, which are passed to brd_do_bvec(). An > > > > unaligned 4kB io traces out as: > > > > > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > > > > > i.e. page offset len sector > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > 000000006ceca91e 0 768 0x7d004e > > > > > > > > You should be able to guess what the problems are from this. > > > > The problem should be that offset of '768' is passed to bio_add_page(). > > It can be quite hard to deal with non-512 aligned sector buffer, since > one sector buffer may cross two pages, so far one workaround I thought > of is to not merge such IO buffer into one bvec. > > Verma, could you try the following patch? > > diff --git a/block/bio.c b/block/bio.c > index 24a496f5d2e2..49deab2ac8c4 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, > if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) > return false; > > + if (off & 511) > + return false; What does this acheive? It only prevents the unaligned segment from being merged, it doesn't prevent it from being added to a new bvec. However, the case here is that: > > > > i.e. page offset len sector > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > 000000006ceca91e 0 768 0x7d004e The second page added to the bvec is actually offset alignedr. Hence the check would do nothing on the first page because the bvec array is empty (so goes into a new bvec anyway), and the check on the second page would do nothing an it would merge with first because the offset is aligned correctly. In both cases, the length of the segment is not aligned, so that needs to be checked, too. IOWs, I think the check needs to be in bio_add_page, it needs to check both the offset and length for alignment, and it needs to grab the alignment from queue_dma_alignment(), not use a hard coded value of 511. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 21:44 ` Dave Chinner @ 2019-08-20 22:08 ` Verma, Vishal L 2019-08-20 23:53 ` Dave Chinner 2019-08-21 2:19 ` Ming Lei 2019-08-21 1:56 ` Ming Lei 1 sibling, 2 replies; 31+ messages in thread From: Verma, Vishal L @ 2019-08-20 22:08 UTC (permalink / raw) To: david, ming.lei Cc: hch, linux-xfs, Williams, Dan J, linux-block, darrick.wong On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote: > > However, the case here is that: > > > > > > i.e. page offset len sector > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > 000000006ceca91e 0 768 0x7d004e > > The second page added to the bvec is actually offset alignedr. Hence > the check would do nothing on the first page because the bvec array > is empty (so goes into a new bvec anyway), and the check on the > second page would do nothing an it would merge with first because > the offset is aligned correctly. In both cases, the length of the > segment is not aligned, so that needs to be checked, too. > > IOWs, I think the check needs to be in bio_add_page, it needs to > check both the offset and length for alignment, and it needs to grab > the alignment from queue_dma_alignment(), not use a hard coded value > of 511. > So something like this? diff --git a/block/bio.c b/block/bio.c index 299a0e7651ec..80f449d23e5a 100644 --- a/block/bio.c +++ b/block/bio.c @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page); int bio_add_page(struct bio *bio, struct page *page, unsigned int len, unsigned int offset) { + struct request_queue *q = bio->bi_disk->queue; bool same_page = false; + if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q)) + return 0; + if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) { if (bio_full(bio, len)) return 0; I tried this, but the 'mount' just hangs - which looks like it might be due to xfs_rw_bdev() doing: while (bio_add_page(bio, page, len, off) != len) { ... ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 22:08 ` Verma, Vishal L @ 2019-08-20 23:53 ` Dave Chinner 2019-08-21 2:19 ` Ming Lei 1 sibling, 0 replies; 31+ messages in thread From: Dave Chinner @ 2019-08-20 23:53 UTC (permalink / raw) To: Verma, Vishal L Cc: ming.lei, hch, linux-xfs, Williams, Dan J, linux-block, darrick.wong On Tue, Aug 20, 2019 at 10:08:38PM +0000, Verma, Vishal L wrote: > On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote: > > > > However, the case here is that: > > > > > > > > i.e. page offset len sector > > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > > 000000006ceca91e 0 768 0x7d004e > > > > The second page added to the bvec is actually offset alignedr. Hence > > the check would do nothing on the first page because the bvec array > > is empty (so goes into a new bvec anyway), and the check on the > > second page would do nothing an it would merge with first because > > the offset is aligned correctly. In both cases, the length of the > > segment is not aligned, so that needs to be checked, too. > > > > IOWs, I think the check needs to be in bio_add_page, it needs to > > check both the offset and length for alignment, and it needs to grab > > the alignment from queue_dma_alignment(), not use a hard coded value > > of 511. > > > So something like this? > > diff --git a/block/bio.c b/block/bio.c > index 299a0e7651ec..80f449d23e5a 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page); > int bio_add_page(struct bio *bio, struct page *page, > unsigned int len, unsigned int offset) > { > + struct request_queue *q = bio->bi_disk->queue; > bool same_page = false; > > + if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q)) > + return 0; > + > if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) { > if (bio_full(bio, len)) > return 0; > > I tried this, but the 'mount' just hangs - which looks like it might be > due to xfs_rw_bdev() doing: > > while (bio_add_page(bio, page, len, off) != len) { That's the return of zero that causes the loop to make no progress. i.e. a return of 0 means "won't fit in bio, allocate a new bio and try again". It's not an error return, so always returning zero will eventually chew up all your memory allocating bios it doesn't use, because submit_bio() doesn't return errors on chained bios until the final bio in the chain is completed. Add a bio_add_page_checked() function that does exactly the same this as bio_add_page(), but add the if (WARN_ON_ONCE((offset | len) & queue_dma_alignment(q))) return -EIO; to it and change the xfs code to: while ((len = bio_add_page_checked(bio, page, len, off)) != len) { if (len < 0) { /* * submit the bio to wait on the rest of the * chain to complete, then return an error. * This is a really shitty failure on write, as we * will have just done a partial write and * effectively corrupted something on disk. */ submit_bio_wait(bio); return len; } .... } We probably should change all the XFS calls to bio_add_page to bio_add_page_checked() while we are at it, because we have the same alignment problem through xfs_buf.c and, potentially, on iclogs via xfs_log.c as iclogs are allocated with kmem_alloc_large(), too. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 22:08 ` Verma, Vishal L 2019-08-20 23:53 ` Dave Chinner @ 2019-08-21 2:19 ` Ming Lei 1 sibling, 0 replies; 31+ messages in thread From: Ming Lei @ 2019-08-21 2:19 UTC (permalink / raw) To: Verma, Vishal L Cc: david, hch, linux-xfs, Williams, Dan J, linux-block, darrick.wong On Tue, Aug 20, 2019 at 10:08:38PM +0000, Verma, Vishal L wrote: > On Wed, 2019-08-21 at 07:44 +1000, Dave Chinner wrote: > > > > However, the case here is that: > > > > > > > > i.e. page offset len sector > > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > > 000000006ceca91e 0 768 0x7d004e > > > > The second page added to the bvec is actually offset alignedr. Hence > > the check would do nothing on the first page because the bvec array > > is empty (so goes into a new bvec anyway), and the check on the > > second page would do nothing an it would merge with first because > > the offset is aligned correctly. In both cases, the length of the > > segment is not aligned, so that needs to be checked, too. > > > > IOWs, I think the check needs to be in bio_add_page, it needs to > > check both the offset and length for alignment, and it needs to grab > > the alignment from queue_dma_alignment(), not use a hard coded value > > of 511. > > > So something like this? > > diff --git a/block/bio.c b/block/bio.c > index 299a0e7651ec..80f449d23e5a 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -822,8 +822,12 @@ EXPORT_SYMBOL_GPL(__bio_add_page); > int bio_add_page(struct bio *bio, struct page *page, > unsigned int len, unsigned int offset) > { > + struct request_queue *q = bio->bi_disk->queue; > bool same_page = false; > > + if (offset & queue_dma_alignment(q) || len & queue_dma_alignment(q)) > + return 0; bio->bi_disk or bio->bi_disk->queue may not be available in bio_add_page(). And 'len' has to be aligned with logical block size of the disk on which fs is mounted, which info is obviously available for fs. So the following code is really buggy: xfs_rw_bdev(): + struct page *page = kmem_to_page(data); + unsigned int off = offset_in_page(data); + unsigned int len = min_t(unsigned, left, PAGE_SIZE - off); + + while (bio_add_page(bio, page, len, off) != len) { Thanks, Ming ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-20 21:44 ` Dave Chinner 2019-08-20 22:08 ` Verma, Vishal L @ 2019-08-21 1:56 ` Ming Lei 1 sibling, 0 replies; 31+ messages in thread From: Ming Lei @ 2019-08-21 1:56 UTC (permalink / raw) To: Dave Chinner Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong, linux-block On Wed, Aug 21, 2019 at 07:44:09AM +1000, Dave Chinner wrote: > On Tue, Aug 20, 2019 at 05:24:25PM +0800, Ming Lei wrote: > > On Tue, Aug 20, 2019 at 04:13:26PM +0800, Ming Lei wrote: > > > On Tue, Aug 20, 2019 at 07:53:20AM +0200, hch@lst.de wrote: > > > > On Tue, Aug 20, 2019 at 02:41:35PM +1000, Dave Chinner wrote: > > > > > > With the following debug patch. Based on that I think I'll just > > > > > > formally submit the vmalloc switch as we're at -rc5, and then we > > > > > > can restart the unaligned slub allocation drama.. > > > > > > > > > > This still doesn't make sense to me, because the pmem and brd code > > > > > have no aligment limitations in their make_request code - they can > > > > > handle byte adressing and should not have any problem at all with > > > > > 8 byte aligned memory in bios. > > > > > > > > > > Digging a little furhter, I note that both brd and pmem use > > > > > identical mechanisms to marshall data in and out of bios, so they > > > > > are likely to have the same issue. > > > > > > > > > > So, brd_make_request() does: > > > > > > > > > > bio_for_each_segment(bvec, bio, iter) { > > > > > unsigned int len = bvec.bv_len; > > > > > int err; > > > > > > > > > > err = brd_do_bvec(brd, bvec.bv_page, len, bvec.bv_offset, > > > > > bio_op(bio), sector); > > > > > if (err) > > > > > goto io_error; > > > > > sector += len >> SECTOR_SHIFT; > > > > > } > > > > > > > > > > So, the code behind bio_for_each_segment() splits multi-page bvecs > > > > > into individual pages, which are passed to brd_do_bvec(). An > > > > > unaligned 4kB io traces out as: > > > > > > > > > > [ 121.295550] p,o,l,s 00000000a77f0146,768,3328,0x7d0048 > > > > > [ 121.297635] p,o,l,s 000000006ceca91e,0,768,0x7d004e > > > > > > > > > > i.e. page offset len sector > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > 000000006ceca91e 0 768 0x7d004e > > > > > > > > > > You should be able to guess what the problems are from this. > > > > > > The problem should be that offset of '768' is passed to bio_add_page(). > > > > It can be quite hard to deal with non-512 aligned sector buffer, since > > one sector buffer may cross two pages, so far one workaround I thought > > of is to not merge such IO buffer into one bvec. > > > > Verma, could you try the following patch? > > > > diff --git a/block/bio.c b/block/bio.c > > index 24a496f5d2e2..49deab2ac8c4 100644 > > --- a/block/bio.c > > +++ b/block/bio.c > > @@ -769,6 +769,9 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page, > > if (WARN_ON_ONCE(bio_flagged(bio, BIO_CLONED))) > > return false; > > > > + if (off & 511) > > + return false; > > What does this acheive? It only prevents the unaligned segment from > being merged, it doesn't prevent it from being added to a new bvec. The current issue is that block layer won't handle such case, as I mentioned, one sector buffer may cross two pages, then it can't be splitted successfully, so simply put into one new bvec just as before enabling multi-page bvec. > > However, the case here is that: > > > > > > i.e. page offset len sector > > > > > 00000000a77f0146 768 3328 0x7d0048 > > > > > 000000006ceca91e 0 768 0x7d004e > > The second page added to the bvec is actually offset alignedr. Hence > the check would do nothing on the first page because the bvec array > is empty (so goes into a new bvec anyway), and the check on the > second page would do nothing an it would merge with first because > the offset is aligned correctly. In both cases, the length of the > segment is not aligned, so that needs to be checked, too. What the patch changes is the bvec stored in bio, the above bvec is actually built in-flight. So if the 1st page added to bio is (768, 512), then finally bio_for_each_segment() will generate the 1st page as (768, 512), then everything will be fine. > > IOWs, I think the check needs to be in bio_add_page, it needs to > check both the offset and length for alignment, and it needs to grab The length has to be 512 aligned, otherwise it is simply bug in fs. > the alignment from queue_dma_alignment(), not use a hard coded value > of 511. Now the policy for bio_add_page() is to not checking any queue limit given we don't know what is the actual limit for the finally IO, even the queue isn't set when bio_add_page() is called. If the upper layer won't pass slub buffer which is > PAGE_SIZE, block layer may handle it well without the 512 alignment check. Thanks, Ming ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-19 3:49 ` hch 2019-08-19 4:11 ` hch @ 2019-08-19 4:15 ` Dave Chinner 1 sibling, 0 replies; 31+ messages in thread From: Dave Chinner @ 2019-08-19 4:15 UTC (permalink / raw) To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Mon, Aug 19, 2019 at 05:49:48AM +0200, hch@lst.de wrote: > On Mon, Aug 19, 2019 at 10:08:31AM +1000, Dave Chinner wrote: > > > So I can also reproduce the same issue with the ramdisk driver, but not > > > with any other 4k sector size device (nvmet, scsi target, scsi_debug, > > > loop). > > > > How did you reproduce it? I tried with indentical pmem config to > > what Vishal was using and couldn't hit it... > > Just mkfs and mount the pmem device. Ran that in a loop overnight, no failures. > The fake pmem was create by the following kernel command line sniplet: > "memmap=2000M!1024M,2000M!4096M" and my kernel config is attached. I Very similar to what I'm using, just 2x8GB pmem instead of the smaller 1/4GB pair you are using. I even tried manually making it smaller w/ -d size= and changing log sizes, etc. I tried squeezing down memory to only 256MB unused (via mlock) but that didn't seem to change anything, either... > suspect you'll need CONFIG_SLUB to create the woes, but let me confirm > that.. Yup, that's what I have configured in my kernel, too. It's still a mystery to me why I'm not able to reproduce it easily. If it was just pmem, then I should have found it back when I tripped over the chaining bug, because the two VMs I were running testing on at the time were an iscsi test rig and an pmem test rig. The iscsi one tripped over the chaining bug, the pmem one has had no problems.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-18 17:34 ` hch 2019-08-19 0:08 ` Dave Chinner @ 2019-08-19 17:19 ` Verma, Vishal L 2019-08-21 0:26 ` Dave Chinner 2019-08-26 17:32 ` Verma, Vishal L 3 siblings, 0 replies; 31+ messages in thread From: Verma, Vishal L @ 2019-08-19 17:19 UTC (permalink / raw) To: hch; +Cc: Williams, Dan J, linux-xfs, david, darrick.wong On Sun, 2019-08-18 at 19:34 +0200, hch@lst.de wrote: > > So I can also reproduce the same issue with the ramdisk driver, but not > with any other 4k sector size device (nvmet, scsi target, scsi_debug, > loop). Which made me wonder if there is some issue about the memory > passed in, and indeed just switching to plain vmalloc vs the XFS > kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes > the issue for me. I don't really understand why yet, maybe I need to > dig out alignment testing patches. With the patch below, I can't reproduce the failure - if this is going into 5.3-rc, feel free to add: Tested-by: Vishal Verma <vishal.l.verma@intel.com> > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 13d1d3e95b88..918ad3b884a7 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -125,7 +125,7 @@ xlog_alloc_buffer( > if (nbblks > 1 && log->l_sectBBsize > 1) > nbblks += log->l_sectBBsize; > nbblks = round_up(nbblks, log->l_sectBBsize); > - return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL); > + return vmalloc(BBTOB(nbblks)); > } > > /* > @@ -416,7 +416,7 @@ xlog_find_verify_cycle( > *new_blk = -1; > > out: > - kmem_free(buffer); > + vfree(buffer); > return error; > } > > @@ -527,7 +527,7 @@ xlog_find_verify_log_record( > *last_blk = i; > > out: > - kmem_free(buffer); > + vfree(buffer); > return error; > } > > @@ -781,7 +781,7 @@ xlog_find_head( > goto out_free_buffer; > } > > - kmem_free(buffer); > + vfree(buffer); > if (head_blk == log_bbnum) > *return_head_blk = 0; > else > @@ -795,7 +795,7 @@ xlog_find_head( > return 0; > > out_free_buffer: > - kmem_free(buffer); > + vfree(buffer); > if (error) > xfs_warn(log->l_mp, "failed to find log head"); > return error; > @@ -1049,7 +1049,7 @@ xlog_verify_tail( > "Tail block (0x%llx) overwrite detected. Updated to 0x%llx", > orig_tail, *tail_blk); > out: > - kmem_free(buffer); > + vfree(buffer); > return error; > } > > @@ -1096,7 +1096,7 @@ xlog_verify_head( > error = xlog_rseek_logrec_hdr(log, *head_blk, *tail_blk, > XLOG_MAX_ICLOGS, tmp_buffer, > &tmp_rhead_blk, &tmp_rhead, &tmp_wrapped); > - kmem_free(tmp_buffer); > + vfree(tmp_buffer); > if (error < 0) > return error; > > @@ -1429,7 +1429,7 @@ xlog_find_tail( > error = xlog_clear_stale_blocks(log, tail_lsn); > > done: > - kmem_free(buffer); > + vfree(buffer); > > if (error) > xfs_warn(log->l_mp, "failed to locate log tail"); > @@ -1477,7 +1477,7 @@ xlog_find_zeroed( > first_cycle = xlog_get_cycle(offset); > if (first_cycle == 0) { /* completely zeroed log */ > *blk_no = 0; > - kmem_free(buffer); > + vfree(buffer); > return 1; > } > > @@ -1488,7 +1488,7 @@ xlog_find_zeroed( > > last_cycle = xlog_get_cycle(offset); > if (last_cycle != 0) { /* log completely written to */ > - kmem_free(buffer); > + vfree(buffer); > return 0; > } > > @@ -1535,7 +1535,7 @@ xlog_find_zeroed( > > *blk_no = last_blk; > out_free_buffer: > - kmem_free(buffer); > + vfree(buffer); > if (error) > return error; > return 1; > @@ -1647,7 +1647,7 @@ xlog_write_log_records( > } > > out_free_buffer: > - kmem_free(buffer); > + vfree(buffer); > return error; > } > > @@ -5291,7 +5291,7 @@ xlog_do_recovery_pass( > hblks = h_size / XLOG_HEADER_CYCLE_SIZE; > if (h_size % XLOG_HEADER_CYCLE_SIZE) > hblks++; > - kmem_free(hbp); > + vfree(hbp); > hbp = xlog_alloc_buffer(log, hblks); > } else { > hblks = 1; > @@ -5307,7 +5307,7 @@ xlog_do_recovery_pass( > return -ENOMEM; > dbp = xlog_alloc_buffer(log, BTOBB(h_size)); > if (!dbp) { > - kmem_free(hbp); > + vfree(hbp); > return -ENOMEM; > } > > @@ -5468,9 +5468,9 @@ xlog_do_recovery_pass( > } > > bread_err2: > - kmem_free(dbp); > + vfree(dbp); > bread_err1: > - kmem_free(hbp); > + vfree(hbp); > > /* > * Submit buffers that have been added from the last record processed, ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-18 17:34 ` hch 2019-08-19 0:08 ` Dave Chinner 2019-08-19 17:19 ` Verma, Vishal L @ 2019-08-21 0:26 ` Dave Chinner 2019-08-21 0:44 ` hch 2019-08-26 17:32 ` Verma, Vishal L 3 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2019-08-21 0:26 UTC (permalink / raw) To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Sun, Aug 18, 2019 at 07:34:26PM +0200, hch@lst.de wrote: > On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote: > > On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote: > > > > The kernel log shows the following when the mount fails: > > > > > > Is it always that same message? I'll see if I can reproduce it, > > > but I won't have that much memory to spare to create fake pmem, > > > hope this also works with a single device and/or less memory.. > > > > I've reproduced a similar ASSERT with a small pmem device, so I hope > > I can debug the issue locally now. > > So I can also reproduce the same issue with the ramdisk driver, but not > with any other 4k sector size device (nvmet, scsi target, scsi_debug, > loop). Which made me wonder if there is some issue about the memory > passed in, and indeed just switching to plain vmalloc vs the XFS > kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes > the issue for me. I don't really understand why yet, maybe I need to > dig out alignment testing patches. > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 13d1d3e95b88..918ad3b884a7 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -125,7 +125,7 @@ xlog_alloc_buffer( > if (nbblks > 1 && log->l_sectBBsize > 1) > nbblks += log->l_sectBBsize; > nbblks = round_up(nbblks, log->l_sectBBsize); > - return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL); > + return vmalloc(BBTOB(nbblks)); > } After thinking on this for a bit, I suspect the better thing to do here is add a KM_ALIGNED flag to the allocation, so if the internal kmem_alloc() returns an unaligned pointer we free it and fall through to vmalloc() to get a properly aligned pointer.... That way none of the other interfaces have to change, and we can then use kmem_alloc_large() everywhere we allocate buffers for IO. And we don't need new infrastructure just to support these debug configurations, either. Actually, kmem_alloc_io() might be a better idea - keep the aligned flag internal to the kmem code. Seems like a pretty simple solution to the entire problem we have here... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-21 0:26 ` Dave Chinner @ 2019-08-21 0:44 ` hch 2019-08-21 1:08 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: hch @ 2019-08-21 0:44 UTC (permalink / raw) To: Dave Chinner Cc: hch, Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Wed, Aug 21, 2019 at 10:26:43AM +1000, Dave Chinner wrote: > After thinking on this for a bit, I suspect the better thing to do > here is add a KM_ALIGNED flag to the allocation, so if the internal > kmem_alloc() returns an unaligned pointer we free it and fall > through to vmalloc() to get a properly aligned pointer.... > > That way none of the other interfaces have to change, and we can > then use kmem_alloc_large() everywhere we allocate buffers for IO. > And we don't need new infrastructure just to support these debug > configurations, either. > > Actually, kmem_alloc_io() might be a better idea - keep the aligned > flag internal to the kmem code. Seems like a pretty simple solution > to the entire problem we have here... The interface sounds ok. The simple try and fallback implementation OTOH means we always do two allocations іf slub debugging is enabled, which is a little lasty. I guess the best we can do for 5.3 and then figure out a way to avoid for later. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-21 0:44 ` hch @ 2019-08-21 1:08 ` Dave Chinner 2019-08-21 1:56 ` Verma, Vishal L 0 siblings, 1 reply; 31+ messages in thread From: Dave Chinner @ 2019-08-21 1:08 UTC (permalink / raw) To: hch; +Cc: Verma, Vishal L, linux-xfs, Williams, Dan J, darrick.wong On Wed, Aug 21, 2019 at 02:44:13AM +0200, hch@lst.de wrote: > On Wed, Aug 21, 2019 at 10:26:43AM +1000, Dave Chinner wrote: > > After thinking on this for a bit, I suspect the better thing to do > > here is add a KM_ALIGNED flag to the allocation, so if the internal > > kmem_alloc() returns an unaligned pointer we free it and fall > > through to vmalloc() to get a properly aligned pointer.... > > > > That way none of the other interfaces have to change, and we can > > then use kmem_alloc_large() everywhere we allocate buffers for IO. > > And we don't need new infrastructure just to support these debug > > configurations, either. > > > > Actually, kmem_alloc_io() might be a better idea - keep the aligned > > flag internal to the kmem code. Seems like a pretty simple solution > > to the entire problem we have here... > > The interface sounds ok. The simple try and fallback implementation > OTOH means we always do two allocations іf slub debugging is enabled, > which is a little lasty. Some creative ifdefery could avoid that, but quite frankly it's only necessary for limited allocation contexts and so the overhead/interactions would largely be unnoticable compared to the wider impact of memory debugging... > I guess the best we can do for 5.3 and > then figure out a way to avoid for later. Yeah, it also has the advantage of documenting all the places we need to be careful of allocation alignment, and it would allow us to simply plug in whatever future infrastructure comes along that guarantees allocation alignment without changing any other XFS code... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-21 1:08 ` Dave Chinner @ 2019-08-21 1:56 ` Verma, Vishal L 2019-08-21 6:15 ` Dave Chinner 0 siblings, 1 reply; 31+ messages in thread From: Verma, Vishal L @ 2019-08-21 1:56 UTC (permalink / raw) To: hch, david; +Cc: Williams, Dan J, linux-xfs, darrick.wong On Wed, 2019-08-21 at 11:08 +1000, Dave Chinner wrote: > On Wed, Aug 21, 2019 at 02:44:13AM +0200, hch@lst.de wrote: > > On Wed, Aug 21, 2019 at 10:26:43AM +1000, Dave Chinner wrote: > > > After thinking on this for a bit, I suspect the better thing to do > > > here is add a KM_ALIGNED flag to the allocation, so if the > > > internal > > > kmem_alloc() returns an unaligned pointer we free it and fall > > > through to vmalloc() to get a properly aligned pointer.... > > > > > > That way none of the other interfaces have to change, and we can > > > then use kmem_alloc_large() everywhere we allocate buffers for IO. > > > And we don't need new infrastructure just to support these debug > > > configurations, either. > > > > > > Actually, kmem_alloc_io() might be a better idea - keep the > > > aligned > > > flag internal to the kmem code. Seems like a pretty simple > > > solution > > > to the entire problem we have here... > > > > The interface sounds ok. The simple try and fallback implementation > > OTOH means we always do two allocations іf slub debugging is > > enabled, > > which is a little lasty. > > Some creative ifdefery could avoid that, but quite frankly it's only > necessary for limited allocation contexts and so the > overhead/interactions would largely be unnoticable compared to the > wider impact of memory debugging... > > > I guess the best we can do for 5.3 and > > then figure out a way to avoid for later. > > Yeah, it also has the advantage of documenting all the places we > need to be careful of allocation alignment, and it would allow us to > simply plug in whatever future infrastructure comes along that > guarantees allocation alignment without changing any other XFS > code... Just to clarify, this precludes the changes to bio_add_page() you suggested earlier, right? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-21 1:56 ` Verma, Vishal L @ 2019-08-21 6:15 ` Dave Chinner 0 siblings, 0 replies; 31+ messages in thread From: Dave Chinner @ 2019-08-21 6:15 UTC (permalink / raw) To: Verma, Vishal L; +Cc: hch, Williams, Dan J, linux-xfs, darrick.wong On Wed, Aug 21, 2019 at 01:56:33AM +0000, Verma, Vishal L wrote: > On Wed, 2019-08-21 at 11:08 +1000, Dave Chinner wrote: > > On Wed, Aug 21, 2019 at 02:44:13AM +0200, hch@lst.de wrote: > > > On Wed, Aug 21, 2019 at 10:26:43AM +1000, Dave Chinner wrote: > > > > After thinking on this for a bit, I suspect the better thing to do > > > > here is add a KM_ALIGNED flag to the allocation, so if the > > > > internal > > > > kmem_alloc() returns an unaligned pointer we free it and fall > > > > through to vmalloc() to get a properly aligned pointer.... > > > > > > > > That way none of the other interfaces have to change, and we can > > > > then use kmem_alloc_large() everywhere we allocate buffers for IO. > > > > And we don't need new infrastructure just to support these debug > > > > configurations, either. > > > > > > > > Actually, kmem_alloc_io() might be a better idea - keep the > > > > aligned > > > > flag internal to the kmem code. Seems like a pretty simple > > > > solution > > > > to the entire problem we have here... > > > > > > The interface sounds ok. The simple try and fallback implementation > > > OTOH means we always do two allocations іf slub debugging is > > > enabled, > > > which is a little lasty. > > > > Some creative ifdefery could avoid that, but quite frankly it's only > > necessary for limited allocation contexts and so the > > overhead/interactions would largely be unnoticable compared to the > > wider impact of memory debugging... > > > > > I guess the best we can do for 5.3 and > > > then figure out a way to avoid for later. > > > > Yeah, it also has the advantage of documenting all the places we > > need to be careful of allocation alignment, and it would allow us to > > simply plug in whatever future infrastructure comes along that > > guarantees allocation alignment without changing any other XFS > > code... > > Just to clarify, this precludes the changes to bio_add_page() you > suggested earlier, right? Nope, I most certainly want the block layer to catch this, but given the total lack of understanding from Ming, I've given up on trying to convince them that validation is both possible and necessarily. The patch below basically copies bio_add_page() into xfs as xfs_bio_add_page() and uses block layer helpers to check alignment. It's the same helper blk_rq_map_kern() uses to detect and bounce unaligned user buffers... WIth that, the simple KASAN enabled reproducer: # mkfs.xfs /dev/ram0; mount /dev/ram0 /mnt/test [ 88.260091] XFS (ram0): Mounting V5 Filesystem [ 88.314453] WARNING: CPU: 0 PID: 4681 at fs/xfs/xfs_bio_io.c:24 xfs_bio_add_page+0x134/0x1d3 [ 88.316484] CPU: 0 PID: 4681 Comm: mount Not tainted 5.3.0-rc5-dgc+ #1516 [ 88.318115] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 [ 88.320110] RIP: 0010:xfs_bio_add_page+0x134/0x1d3 .... [ 88.340471] Call Trace: [ 88.343011] xfs_rw_bdev.cold.5+0xf4/0x193 [ 88.345224] xlog_do_io+0xd1/0x1c0 [ 88.346050] xlog_bread+0x23/0x70 [ 88.346857] xlog_find_verify_log_record+0xc8/0x330 [ 88.349171] xlog_find_zeroed+0x224/0x330 [ 88.356712] xlog_find_head+0xbe/0x640 [ 88.375967] xlog_find_tail+0xc6/0x500 [ 88.388098] xlog_recover+0x80/0x2a0 [ 88.392020] xfs_log_mount+0x3a5/0x3d0 [ 88.392931] xfs_mountfs+0x753/0xe40 [ 88.403201] xfs_fs_fill_super+0x5c1/0x9d0 [ 88.405320] mount_bdev+0x1be/0x200 [ 88.407178] legacy_get_tree+0x6e/0xb0 [ 88.408089] vfs_get_tree+0x41/0x160 [ 88.408955] do_mount+0xa48/0xcf0 [ 88.416384] ksys_mount+0xb6/0xd0 [ 88.417191] __x64_sys_mount+0x62/0x70 [ 88.418101] do_syscall_64+0x70/0x230 [ 88.418991] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 88.420208] RIP: 0033:0x7f1817194fea ...... [ 88.436923] XFS (ram0): log recovery read I/O error at daddr 0x0 len 8 error -5 [ 88.438707] XFS (ram0): empty log check failed [ 88.439794] XFS (ram0): log mount/recovery failed: error -5 [ 88.441435] XFS (ram0): log mount failed mount: /mnt/test: can't read superblock on /dev/ram0. umount: /mnt/test: not mounted. # throws a warning and an IO error. No data corruption, no weird log recovery failures, etc. This is easy to detect, and so easy to debug now. Shame we can't do this in the block layer code. Cheers, Dave. -- Dave Chinner david@fromorbit.com xfs: alignment check bio buffers From: Dave Chinner <dchinner@redhat.com> Because apparently this is impossible to do in the block layer. Signed-off-by: Dave Chinner <dchinner@redhat.com> --- fs/xfs/xfs_bio_io.c | 32 +++++++++++++++++++++++++++++--- fs/xfs/xfs_buf.c | 7 ++++--- fs/xfs/xfs_linux.h | 3 +++ fs/xfs/xfs_log.c | 6 +++++- 4 files changed, 41 insertions(+), 7 deletions(-) diff --git a/fs/xfs/xfs_bio_io.c b/fs/xfs/xfs_bio_io.c index e2148f2d5d6b..fbaea643c000 100644 --- a/fs/xfs/xfs_bio_io.c +++ b/fs/xfs/xfs_bio_io.c @@ -9,6 +9,27 @@ static inline unsigned int bio_max_vecs(unsigned int count) return min_t(unsigned, howmany(count, PAGE_SIZE), BIO_MAX_PAGES); } +int +xfs_bio_add_page( + struct bio *bio, + struct page *page, + unsigned int len, + unsigned int offset) +{ + struct request_queue *q = bio->bi_disk->queue; + bool same_page = false; + + if (WARN_ON_ONCE(!blk_rq_aligned(q, len, offset))) + return -EIO; + + if (!__bio_try_merge_page(bio, page, len, offset, &same_page)) { + if (bio_full(bio, len)) + return 0; + __bio_add_page(bio, page, len, offset); + } + return len; +} + int xfs_rw_bdev( struct block_device *bdev, @@ -20,7 +41,7 @@ xfs_rw_bdev( { unsigned int is_vmalloc = is_vmalloc_addr(data); unsigned int left = count; - int error; + int error, ret = 0; struct bio *bio; if (is_vmalloc && op == REQ_OP_WRITE) @@ -36,9 +57,12 @@ xfs_rw_bdev( unsigned int off = offset_in_page(data); unsigned int len = min_t(unsigned, left, PAGE_SIZE - off); - while (bio_add_page(bio, page, len, off) != len) { + while ((ret = xfs_bio_add_page(bio, page, len, off)) != len) { struct bio *prev = bio; + if (ret < 0) + goto submit; + bio = bio_alloc(GFP_KERNEL, bio_max_vecs(left)); bio_copy_dev(bio, prev); bio->bi_iter.bi_sector = bio_end_sector(prev); @@ -48,14 +72,16 @@ xfs_rw_bdev( submit_bio(prev); } + ret = 0; data += len; left -= len; } while (left > 0); +submit: error = submit_bio_wait(bio); bio_put(bio); if (is_vmalloc && op == REQ_OP_READ) invalidate_kernel_vmap_range(data, count); - return error; + return ret ? ret : error; } diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index dd3aa64098a8..f2e6669f60b0 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -1261,6 +1261,7 @@ xfs_buf_ioapply_map( sector_t sector = bp->b_maps[map].bm_bn; int size; int offset; + int rbytes = 0; /* skip the pages in the buffer before the start offset */ page_index = 0; @@ -1290,12 +1291,12 @@ xfs_buf_ioapply_map( bio_set_op_attrs(bio, op, op_flags); for (; size && nr_pages; nr_pages--, page_index++) { - int rbytes, nbytes = PAGE_SIZE - offset; + int nbytes = PAGE_SIZE - offset; if (nbytes > size) nbytes = size; - rbytes = bio_add_page(bio, bp->b_pages[page_index], nbytes, + rbytes = xfs_bio_add_page(bio, bp->b_pages[page_index], nbytes, offset); if (rbytes < nbytes) break; @@ -1306,7 +1307,7 @@ xfs_buf_ioapply_map( total_nr_pages--; } - if (likely(bio->bi_iter.bi_size)) { + if (likely(bio->bi_iter.bi_size && rbytes >= 0)) { if (xfs_buf_is_vmapped(bp)) { flush_kernel_vmap_range(bp->b_addr, xfs_buf_vmap_len(bp)); diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index ca15105681ca..e71c7bf3e714 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -145,6 +145,9 @@ static inline void delay(long ticks) schedule_timeout_uninterruptible(ticks); } +int xfs_bio_add_page(struct bio *bio, struct page *page, unsigned int len, + unsigned int offset); + /* * XFS wrapper structure for sysfs support. It depends on external data * structures and is embedded in various internal data structures to implement diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index acc5a824bb9e..c2f1ff57fb6d 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1673,8 +1673,12 @@ xlog_map_iclog_data( struct page *page = kmem_to_page(data); unsigned int off = offset_in_page(data); size_t len = min_t(size_t, count, PAGE_SIZE - off); + int ret; - WARN_ON_ONCE(bio_add_page(bio, page, len, off) != len); + ret = xfs_bio_add_page(bio, page, len, off); + WARN_ON_ONCE(ret != len); + if (ret < 0) + break; data += len; count -= len; ^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: 5.3-rc1 regression with XFS log recovery 2019-08-18 17:34 ` hch ` (2 preceding siblings ...) 2019-08-21 0:26 ` Dave Chinner @ 2019-08-26 17:32 ` Verma, Vishal L 3 siblings, 0 replies; 31+ messages in thread From: Verma, Vishal L @ 2019-08-26 17:32 UTC (permalink / raw) To: hch; +Cc: Williams, Dan J, linux-xfs, david, darrick.wong On Sun, 2019-08-18 at 19:34 +0200, hch@lst.de wrote: > On Sun, Aug 18, 2019 at 09:41:40AM +0200, hch@lst.de wrote: > > On Sun, Aug 18, 2019 at 09:11:28AM +0200, hch@lst.de wrote: > > > > The kernel log shows the following when the mount fails: > > > > > > Is it always that same message? I'll see if I can reproduce it, > > > but I won't have that much memory to spare to create fake pmem, > > > hope this also works with a single device and/or less memory.. > > > > I've reproduced a similar ASSERT with a small pmem device, so I hope > > I can debug the issue locally now. > > So I can also reproduce the same issue with the ramdisk driver, but not > with any other 4k sector size device (nvmet, scsi target, scsi_debug, > loop). Which made me wonder if there is some issue about the memory > passed in, and indeed just switching to plain vmalloc vs the XFS > kmem_alloc_large wrapper that either uses kmalloc or vmalloc fixes > the issue for me. I don't really understand why yet, maybe I need to > dig out alignment testing patches. > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > index 13d1d3e95b88..918ad3b884a7 100644 > --- a/fs/xfs/xfs_log_recover.c > +++ b/fs/xfs/xfs_log_recover.c > @@ -125,7 +125,7 @@ xlog_alloc_buffer( > if (nbblks > 1 && log->l_sectBBsize > 1) > nbblks += log->l_sectBBsize; > nbblks = round_up(nbblks, log->l_sectBBsize); > - return kmem_alloc_large(BBTOB(nbblks), KM_MAYFAIL); > + return vmalloc(BBTOB(nbblks)); Is the plan to use this vmalloc fix for 5.3? Is there anything I can help with to make progress on this? -Vishal ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2019-08-26 17:32 UTC | newest] Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-08-16 20:59 5.3-rc1 regression with XFS log recovery Verma, Vishal L 2019-08-18 7:11 ` hch 2019-08-18 7:41 ` hch 2019-08-18 17:34 ` hch 2019-08-19 0:08 ` Dave Chinner 2019-08-19 3:49 ` hch 2019-08-19 4:11 ` hch 2019-08-19 4:22 ` Dave Chinner 2019-08-19 4:29 ` hch 2019-08-19 4:40 ` hch 2019-08-19 5:31 ` Dave Chinner 2019-08-20 6:14 ` hch 2019-08-20 4:41 ` Dave Chinner 2019-08-20 5:53 ` hch 2019-08-20 7:44 ` Dave Chinner 2019-08-20 8:13 ` Ming Lei 2019-08-20 9:24 ` Ming Lei 2019-08-20 16:30 ` Verma, Vishal L 2019-08-20 21:44 ` Dave Chinner 2019-08-20 22:08 ` Verma, Vishal L 2019-08-20 23:53 ` Dave Chinner 2019-08-21 2:19 ` Ming Lei 2019-08-21 1:56 ` Ming Lei 2019-08-19 4:15 ` Dave Chinner 2019-08-19 17:19 ` Verma, Vishal L 2019-08-21 0:26 ` Dave Chinner 2019-08-21 0:44 ` hch 2019-08-21 1:08 ` Dave Chinner 2019-08-21 1:56 ` Verma, Vishal L 2019-08-21 6:15 ` Dave Chinner 2019-08-26 17:32 ` Verma, Vishal L
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.