* [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch @ 2018-07-10 19:10 Ross Zwisler 2018-07-10 19:10 ` [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional Ross Zwisler ` (3 more replies) 0 siblings, 4 replies; 24+ messages in thread From: Ross Zwisler @ 2018-07-10 19:10 UTC (permalink / raw) To: Jan Kara, Dan Williams, Dave Chinner, Darrick J. Wong, Christoph Hellwig, linux-nvdimm, Jeff Moyer, linux-ext4, Lukas Czerner Changes since v3: * Added an ext4_break_layouts() call to ext4_insert_range() to ensure that the {ext4,xfs}_break_layouts() calls have the same meaning. (Dave, Darrick and Jan) --- This series from Dan: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html added synchronization between DAX dma and truncate/hole-punch in XFS. This short series adds analogous support to ext4. I've added calls to ext4_break_layouts() everywhere that ext4 removes blocks from an inode's map. The timings in XFS are such that it's difficult to hit this race. Dan was able to show the race by manually introducing delays in the direct I/O path. For ext4, though, its trivial to hit this race, and a hit will result in a trigger of this WARN_ON_ONCE() in dax_disassociate_entry(): WARN_ON_ONCE(trunc && page_ref_count(page) > 1); I've made an xfstest which tests all the paths where we now call ext4_break_layouts(). Each of the four paths easily hits this race many times in my test setup with the xfstest. You can find that test here: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html With these patches applied, I've still seen occasional hits of the above WARN_ON_ONCE(), which tells me that we still have some work to do. I'll continue looking at these more rare hits. Ross Zwisler (2): dax: dax_layout_busy_page() warn on !exceptional ext4: handle layout changes to pinned DAX mappings fs/dax.c | 10 +++++++++- fs/ext4/ext4.h | 1 + fs/ext4/extents.c | 17 +++++++++++++++++ fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/truncate.h | 4 ++++ 5 files changed, 77 insertions(+), 1 deletion(-) -- 2.14.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional 2018-07-10 19:10 [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Ross Zwisler @ 2018-07-10 19:10 ` Ross Zwisler 2018-08-10 19:52 ` Eric Sandeen 2018-07-10 19:10 ` [PATCH v4 2/2] ext4: handle layout changes to pinned DAX mappings Ross Zwisler ` (2 subsequent siblings) 3 siblings, 1 reply; 24+ messages in thread From: Ross Zwisler @ 2018-07-10 19:10 UTC (permalink / raw) To: Jan Kara, Dan Williams, Dave Chinner, Darrick J. Wong, Christoph Hellwig, linux-nvdimm, Jeff Moyer, linux-ext4, Lukas Czerner Inodes using DAX should only ever have exceptional entries in their page caches. Make this clear by warning if the iteration in dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a comment for the pagevec_release() call which only deals with struct page pointers. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> --- fs/dax.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index 641192808bb6..897b51e41d8f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -566,7 +566,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping) if (index >= end) break; - if (!radix_tree_exceptional_entry(pvec_ent)) + if (WARN_ON_ONCE( + !radix_tree_exceptional_entry(pvec_ent))) continue; xa_lock_irq(&mapping->i_pages); @@ -578,6 +579,13 @@ struct page *dax_layout_busy_page(struct address_space *mapping) if (page) break; } + + /* + * We don't expect normal struct page entries to exist in our + * tree, but we keep these pagevec calls so that this code is + * consistent with the common pattern for handling pagevecs + * throughout the kernel. + */ pagevec_remove_exceptionals(&pvec); pagevec_release(&pvec); index++; -- 2.14.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional 2018-07-10 19:10 ` [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional Ross Zwisler @ 2018-08-10 19:52 ` Eric Sandeen 2018-08-10 20:33 ` Theodore Y. Ts'o 0 siblings, 1 reply; 24+ messages in thread From: Eric Sandeen @ 2018-08-10 19:52 UTC (permalink / raw) To: Ross Zwisler, Jan Kara, Dan Williams, Dave Chinner, Darrick J. Wong, Christoph Hellwig, linux-nvdimm, Jeff Moyer, linux-ext4, Lukas Czerner On 7/10/18 2:10 PM, Ross Zwisler wrote: > Inodes using DAX should only ever have exceptional entries in their page > caches. Make this clear by warning if the iteration in > dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a > comment for the pagevec_release() call which only deals with struct page > pointers. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > Reviewed-by: Jan Kara <jack@suse.cz> Hi Ted, hadn't seem feedback from you on this by the time it gathered reviews, is this something you plan to merge for realz? (I see it's on your dev branch now, just not sure of its permanence at this point.) Thanks, -Eric > --- > fs/dax.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/fs/dax.c b/fs/dax.c > index 641192808bb6..897b51e41d8f 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -566,7 +566,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > if (index >= end) > break; > > - if (!radix_tree_exceptional_entry(pvec_ent)) > + if (WARN_ON_ONCE( > + !radix_tree_exceptional_entry(pvec_ent))) > continue; > > xa_lock_irq(&mapping->i_pages); > @@ -578,6 +579,13 @@ struct page *dax_layout_busy_page(struct address_space *mapping) > if (page) > break; > } > + > + /* > + * We don't expect normal struct page entries to exist in our > + * tree, but we keep these pagevec calls so that this code is > + * consistent with the common pattern for handling pagevecs > + * throughout the kernel. > + */ > pagevec_remove_exceptionals(&pvec); > pagevec_release(&pvec); > index++; > _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional 2018-08-10 19:52 ` Eric Sandeen @ 2018-08-10 20:33 ` Theodore Y. Ts'o 2018-08-11 2:10 ` Theodore Y. Ts'o 0 siblings, 1 reply; 24+ messages in thread From: Theodore Y. Ts'o @ 2018-08-10 20:33 UTC (permalink / raw) To: sandeen Cc: Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner, Lukas Czerner, linux-ext4, Christoph Hellwig On Fri, Aug 10, 2018 at 02:52:53PM -0500, Eric Sandeen wrote: > > Hi Ted, hadn't seem feedback from you on this by the time it gathered reviews, > is this something you plan to merge for realz? (I see it's on your dev > branch now, just not sure of its permanence at this point.) Yes, the dev branch is pretty much locked down since I assume the merge window is openning over the weekend. The reason why I've been silent is because I haven't had a chance to do was a test run with DAX, because up until recently I wasn't able to run a DAX regression test run. (That's because of the CONFIG_KASAN incompatibility with CONFIG_ZONE_DEVICE that caused the kernel to instantly blow up on boot if I tried to enable emulated /dev/pmem devices.) I just kicked off a DAX test ("gce-xfstests -c dax -g auto") with CONFIG_KASAN disabled, and I expect it shouldn't show up anything concerning. So assuming nothing surprising pops up, yes it should be merged at the next merge window. - Ted _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional 2018-08-10 20:33 ` Theodore Y. Ts'o @ 2018-08-11 2:10 ` Theodore Y. Ts'o 2018-08-13 10:12 ` Jan Kara 0 siblings, 1 reply; 24+ messages in thread From: Theodore Y. Ts'o @ 2018-08-11 2:10 UTC (permalink / raw) To: sandeen Cc: Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner, Lukas Czerner, linux-ext4, Christoph Hellwig On Fri, Aug 10, 2018 at 04:33:49PM -0400, Theodore Y. Ts'o wrote: > I just kicked off a DAX test ("gce-xfstests -c dax -g auto") with > CONFIG_KASAN disabled, and I expect it shouldn't show up anything > concerning. So assuming nothing surprising pops up, yes it should be > merged at the next merge window. ... and here are the results. The first is 4.17, and the second is the ext4 git tree: ext4/dax: 488 tests, 4 failures, 97 skipped, 2647 seconds Failures: ext4/033 generic/344 generic/491 generic/503 ext4/dax: 488 tests, 3 failures, 97 skipped, 2637 seconds Failures: generic/081 generic/344 generic/388 The generic/388 failure is a known flake (shutdown stress test). The generic/081 regression appears to be a device-mapper issue: generic/081 [22:06:33][ 15.079661] run fstests generic/081 at 2018-08-10 22:06:33 [ 15.795745] device-mapper: ioctl: can't change device type (old=4 vs new=1) after initial table load. [failed, exit status 1] [22:06:36]- output mismatch (see /results/ext4/results-dax/generic/081.out.bad) --- tests/generic/081.out 2018-08-09 18:00:42.000000000 -0400 +++ /results/ext4/results-dax/generic/081.out.bad 2018-08-10 22:06:36.440005460 -0400 @@ -1,2 +1,4 @@ QA output created by 081 Silence is golden +Failed to create snapshot +(see /results/ext4/results-dax/generic/081.full for details) ... (Run 'diff -u tests/generic/081.out /results/ext4/results-dax/generic/081.out.bad' to see the entire diff) The generic/344 failure seems to be caused by a WARNING triggered in the nvdimm code: generic/344 [22:06:36][ 18.126280] run fstests generic/344 at 2018-08-10 22:06:36 [ 18.303113] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk [ 18.456988] EXT4-fs (pmem1): DAX enabled. Warning: EXPERIMENTAL, use at your own risk [ 97.375912] WARNING: CPU: 2 PID: 1712 at /usr/projects/linux/ext4/mm/memory.c:1801 insert_pfn+0x15a/0x170 [ 97.377261] CPU: 2 PID: 1712 Comm: holetest Not tainted 4.18.0-rc4-xfstests-00039-g863c37fcb14f #497 [ 97.378486] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 [ 97.379516] RIP: 0010:insert_pfn+0x15a/0x170 [ 97.380064] Code: 19 1b 01 eb dd 48 85 d2 74 07 48 23 1d 3f 19 1b 01 48 09 df 48 89 f8 0f 1f 40 00 48 b9 00 02 00 00 00 00 00 04 48 09 c1 eb c8 <0f> 0b e9 5e ff ff ff bb f4 ff ff ff e9 5e ff ff ff e8 80 7b ec ff [ 97.382469] RSP: 0000:ffffacfd0457fc60 EFLAGS: 00010206 [ 97.383123] RAX: 0000000000179e3b RBX: 00000000fffffff0 RCX: 0000000000000002 [ 97.384062] RDX: 000fffffffffffff RSI: 8f5c28f5c28f5c29 RDI: 8000000179e3b225 [ 97.385134] RBP: ffff923761654558 R08: 0000000000000001 R09: 0000000000000001 [ 97.386213] R10: ffff92376f415cc0 R11: 0000000000000001 R12: ffff92377e880cc0 [ 97.387264] R13: 00007fab98aab000 R14: 00000000003e860d R15: ffff923761560000 [ 97.388327] FS: 00007fab9049c700(0000) GS:ffff92377f200000(0000) knlGS:0000000000000000 [ 97.389514] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 97.390367] CR2: 00007fab98aabc00 CR3: 00000006a165a004 CR4: 0000000000360ee0 [ 97.391432] Call Trace: [ 97.391798] __vm_insert_mixed+0x7e/0xc0 [ 97.392376] vmf_insert_mixed_mkwrite+0xf/0x30 [ 97.393048] dax_iomap_pte_fault+0xb8b/0xe40 [ 97.393691] ext4_dax_huge_fault+0x145/0x200 [ 97.394268] do_wp_page+0x175/0x5b0 [ 97.394710] __handle_mm_fault+0x587/0xbb0 [ 97.395228] __do_page_fault+0x20c/0x490 [ 97.395729] ? async_page_fault+0x8/0x30 [ 97.396251] async_page_fault+0x1e/0x30 [ 97.396719] RIP: 0033:0x5598144275ea [ 97.397187] Code: 0f 85 8a 00 00 00 31 d2 48 85 db 4b 8d 04 34 7e 1f 0f 1f 80 00 00 00 00 48 89 d1 48 83 c2 01 48 0f af 0d 71 1b 20 00 48 39 d3 <48> 89 2c 08 75 e8 8b 0d 36 1b 20 00 31 c0 85 c9 74 0a 8b 15 2e 1b [ 97.399752] RSP: 002b:00007fab9049bf20 EFLAGS: 00010212 [ 97.400541] RAX: 00007fab90c9ec00 RBX: 0000000000010000 RCX: 0000000007e0d000 [ 97.401603] RDX: 0000000000007e0e RSI: 0000000000000000 RDI: 0000000000000001 [ 97.402673] RBP: 00007fab9049c700 R08: 00007fab9049c700 R09: 00007fab9049c700 [ 97.403755] R10: 00007fab9049c9d0 R11: 0000000000000202 R12: 00007fab90c9e000 [ 97.404851] R13: 00007ffc4608c9e0 R14: 0000000000000c00 R15: 000055981608e250 [ 97.405892] irq event stamp: 1111968 [ 97.406460] hardirqs last enabled at (1111967): [<ffffffffa36c4409>] _raw_spin_unlock_irq+0x29/0x40 [ 97.407826] hardirqs last disabled at (1111968): [<ffffffffa380118f>] error_entry+0x7f/0x100 [ 97.409080] softirqs last enabled at (1111390): [<ffffffffa3a00319>] __do_softirq+0x319/0x4d9 [ 97.410363] softirqs last disabled at (1111383): [<ffffffffa2e8c9e1>] irq_exit+0xc1/0xd0 [ 97.411400] ---[ end trace 69669a34a73c1a49 ]--- [ 117.726077] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk [ 117.727671] EXT4-fs (pmem0): mounted filesystem with ordered data mode. Opts: acl,user_xattr,block_validity,dax [ 117.796623] EXT4-fs (pmem1): DAX enabled. Warning: EXPERIMENTAL, use at your own risk [ 117.798546] EXT4-fs (pmem1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,block_validity,dax _check_dmesg: something found in dmesg (see /results/ext4/results-dax/generic/344.dmesg) - Ted _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional 2018-08-11 2:10 ` Theodore Y. Ts'o @ 2018-08-13 10:12 ` Jan Kara 2018-08-13 12:46 ` Theodore Y. Ts'o ` (2 more replies) 0 siblings, 3 replies; 24+ messages in thread From: Jan Kara @ 2018-08-13 10:12 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: sandeen, Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner, Christoph Hellwig, Lukas Czerner, linux-ext4 On Fri 10-08-18 22:10:53, Theodore Y. Ts'o wrote: > On Fri, Aug 10, 2018 at 04:33:49PM -0400, Theodore Y. Ts'o wrote: > > I just kicked off a DAX test ("gce-xfstests -c dax -g auto") with > > CONFIG_KASAN disabled, and I expect it shouldn't show up anything > > concerning. So assuming nothing surprising pops up, yes it should be > > merged at the next merge window. > > ... and here are the results. The first is 4.17, and the second is > the ext4 git tree: > > ext4/dax: 488 tests, 4 failures, 97 skipped, 2647 seconds > Failures: ext4/033 generic/344 generic/491 generic/503 > > ext4/dax: 488 tests, 3 failures, 97 skipped, 2637 seconds > Failures: generic/081 generic/344 generic/388 > > The generic/388 failure is a known flake (shutdown stress test). > > The generic/081 regression appears to be a device-mapper issue: > > generic/081 [22:06:33][ 15.079661] run fstests generic/081 at 2018-08-10 22:06:33 > [ 15.795745] device-mapper: ioctl: can't change device type (old=4 vs new=1) after initial table load. > [failed, exit status 1] [22:06:36]- output mismatch (see /results/ext4/results-dax/generic/081.out.bad) > --- tests/generic/081.out 2018-08-09 18:00:42.000000000 -0400 > +++ /results/ext4/results-dax/generic/081.out.bad 2018-08-10 22:06:36.440005460 -0400 > @@ -1,2 +1,4 @@ > QA output created by 081 > Silence is golden > +Failed to create snapshot > +(see /results/ext4/results-dax/generic/081.full for details) > ... > (Run 'diff -u tests/generic/081.out /results/ext4/results-dax/generic/081.out.bad' to see the entire diff) I'll see if this reproduces for me. Doesn't seem to be related to the DAX patches you caary though. > The generic/344 failure seems to be caused by a WARNING triggered in > the nvdimm code: OK, apparently this is nothing new for you as generic/344 fails for you even with 3.17. But it should not :). I'll try to see if I can reproduce this in my test setup during more test runs (I don't remember seeing it during occasional runs I do) and debug it further. Honza > generic/344 [22:06:36][ 18.126280] run fstests generic/344 at 2018-08-10 22:06:36 > [ 18.303113] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk > [ 18.456988] EXT4-fs (pmem1): DAX enabled. Warning: EXPERIMENTAL, use at your own risk > [ 97.375912] WARNING: CPU: 2 PID: 1712 at /usr/projects/linux/ext4/mm/memory.c:1801 insert_pfn+0x15a/0x170 > [ 97.377261] CPU: 2 PID: 1712 Comm: holetest Not tainted 4.18.0-rc4-xfstests-00039-g863c37fcb14f #497 > [ 97.378486] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014 > [ 97.379516] RIP: 0010:insert_pfn+0x15a/0x170 > [ 97.380064] Code: 19 1b 01 eb dd 48 85 d2 74 07 48 23 1d 3f 19 1b 01 48 09 df 48 89 f8 0f 1f 40 00 48 b9 00 02 00 00 00 00 00 04 48 09 c1 eb c8 <0f> 0b e9 5e ff ff ff bb f4 ff ff ff e9 5e ff ff ff e8 80 7b ec ff > [ 97.382469] RSP: 0000:ffffacfd0457fc60 EFLAGS: 00010206 > [ 97.383123] RAX: 0000000000179e3b RBX: 00000000fffffff0 RCX: 0000000000000002 > [ 97.384062] RDX: 000fffffffffffff RSI: 8f5c28f5c28f5c29 RDI: 8000000179e3b225 > [ 97.385134] RBP: ffff923761654558 R08: 0000000000000001 R09: 0000000000000001 > [ 97.386213] R10: ffff92376f415cc0 R11: 0000000000000001 R12: ffff92377e880cc0 > [ 97.387264] R13: 00007fab98aab000 R14: 00000000003e860d R15: ffff923761560000 > [ 97.388327] FS: 00007fab9049c700(0000) GS:ffff92377f200000(0000) knlGS:0000000000000000 > [ 97.389514] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 97.390367] CR2: 00007fab98aabc00 CR3: 00000006a165a004 CR4: 0000000000360ee0 > [ 97.391432] Call Trace: > [ 97.391798] __vm_insert_mixed+0x7e/0xc0 > [ 97.392376] vmf_insert_mixed_mkwrite+0xf/0x30 > [ 97.393048] dax_iomap_pte_fault+0xb8b/0xe40 > [ 97.393691] ext4_dax_huge_fault+0x145/0x200 > [ 97.394268] do_wp_page+0x175/0x5b0 > [ 97.394710] __handle_mm_fault+0x587/0xbb0 > [ 97.395228] __do_page_fault+0x20c/0x490 > [ 97.395729] ? async_page_fault+0x8/0x30 > [ 97.396251] async_page_fault+0x1e/0x30 > [ 97.396719] RIP: 0033:0x5598144275ea > [ 97.397187] Code: 0f 85 8a 00 00 00 31 d2 48 85 db 4b 8d 04 34 7e 1f 0f 1f 80 00 00 00 00 48 89 d1 48 83 c2 01 48 0f af 0d 71 1b 20 00 48 39 d3 <48> 89 2c 08 75 e8 8b 0d 36 1b 20 00 31 c0 85 c9 74 0a 8b 15 2e 1b > [ 97.399752] RSP: 002b:00007fab9049bf20 EFLAGS: 00010212 > [ 97.400541] RAX: 00007fab90c9ec00 RBX: 0000000000010000 RCX: 0000000007e0d000 > [ 97.401603] RDX: 0000000000007e0e RSI: 0000000000000000 RDI: 0000000000000001 > [ 97.402673] RBP: 00007fab9049c700 R08: 00007fab9049c700 R09: 00007fab9049c700 > [ 97.403755] R10: 00007fab9049c9d0 R11: 0000000000000202 R12: 00007fab90c9e000 > [ 97.404851] R13: 00007ffc4608c9e0 R14: 0000000000000c00 R15: 000055981608e250 > [ 97.405892] irq event stamp: 1111968 > [ 97.406460] hardirqs last enabled at (1111967): [<ffffffffa36c4409>] _raw_spin_unlock_irq+0x29/0x40 > [ 97.407826] hardirqs last disabled at (1111968): [<ffffffffa380118f>] error_entry+0x7f/0x100 > [ 97.409080] softirqs last enabled at (1111390): [<ffffffffa3a00319>] __do_softirq+0x319/0x4d9 > [ 97.410363] softirqs last disabled at (1111383): [<ffffffffa2e8c9e1>] irq_exit+0xc1/0xd0 > [ 97.411400] ---[ end trace 69669a34a73c1a49 ]--- > [ 117.726077] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk > [ 117.727671] EXT4-fs (pmem0): mounted filesystem with ordered data mode. Opts: acl,user_xattr,block_validity,dax > [ 117.796623] EXT4-fs (pmem1): DAX enabled. Warning: EXPERIMENTAL, use at your own risk > [ 117.798546] EXT4-fs (pmem1): mounted filesystem with ordered data mode. Opts: acl,user_xattr,block_validity,dax > _check_dmesg: something found in dmesg (see /results/ext4/results-dax/generic/344.dmesg) > > - Ted -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional 2018-08-13 10:12 ` Jan Kara @ 2018-08-13 12:46 ` Theodore Y. Ts'o 2018-08-24 15:44 ` Jan Kara 2018-08-27 16:09 ` Jan Kara 2 siblings, 0 replies; 24+ messages in thread From: Theodore Y. Ts'o @ 2018-08-13 12:46 UTC (permalink / raw) To: Jan Kara Cc: sandeen, Darrick J. Wong, linux-nvdimm, Dave Chinner, Lukas Czerner, linux-ext4, Christoph Hellwig On Mon, Aug 13, 2018 at 12:12:52PM +0200, Jan Kara wrote: > > The generic/081 regression appears to be a device-mapper issue... > > I'll see if this reproduces for me. Doesn't seem to be related to the DAX > patches you caary though. It does seem to be a DAX-specific failure though. > > The generic/344 failure seems to be caused by a WARNING triggered in > > the nvdimm code: > > OK, apparently this is nothing new for you as generic/344 fails for you > even with 3.17. But it should not :). I'll try to see if I can reproduce > this in my test setup during more test runs (I don't remember seeing it > during occasional runs I do) and debug it further. Thanks! In case it wasn't clear, I wasn't planning on letting these failures prevent the patches from going upstream. As you say, the generic/081 failure looks unrelated to ext4, and the generic/344 isn't a regression. - Ted _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional 2018-08-13 10:12 ` Jan Kara 2018-08-13 12:46 ` Theodore Y. Ts'o @ 2018-08-24 15:44 ` Jan Kara 2018-08-27 16:09 ` Jan Kara 2 siblings, 0 replies; 24+ messages in thread From: Jan Kara @ 2018-08-24 15:44 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: sandeen, Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner, Christoph Hellwig, Lukas Czerner, linux-ext4 On Mon 13-08-18 12:12:52, Jan Kara wrote: > On Fri 10-08-18 22:10:53, Theodore Y. Ts'o wrote: > > The generic/344 failure seems to be caused by a WARNING triggered in > > the nvdimm code: > > OK, apparently this is nothing new for you as generic/344 fails for you > even with 3.17. But it should not :). I'll try to see if I can reproduce > this in my test setup during more test runs (I don't remember seeing it > during occasional runs I do) and debug it further. I could reproduce this relatively easily but it took me quite a while to figure out how to best fix this. I'll send the fix shortly (mm: Fix warning in insert_pfn()). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional 2018-08-13 10:12 ` Jan Kara 2018-08-13 12:46 ` Theodore Y. Ts'o 2018-08-24 15:44 ` Jan Kara @ 2018-08-27 16:09 ` Jan Kara 2 siblings, 0 replies; 24+ messages in thread From: Jan Kara @ 2018-08-27 16:09 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: sandeen, Jan Kara, Darrick J. Wong, linux-nvdimm, Dave Chinner, Christoph Hellwig, Lukas Czerner, linux-ext4 On Mon 13-08-18 12:12:52, Jan Kara wrote: > On Fri 10-08-18 22:10:53, Theodore Y. Ts'o wrote: > > On Fri, Aug 10, 2018 at 04:33:49PM -0400, Theodore Y. Ts'o wrote: > > > I just kicked off a DAX test ("gce-xfstests -c dax -g auto") with > > > CONFIG_KASAN disabled, and I expect it shouldn't show up anything > > > concerning. So assuming nothing surprising pops up, yes it should be > > > merged at the next merge window. > > > > ... and here are the results. The first is 4.17, and the second is > > the ext4 git tree: > > > > ext4/dax: 488 tests, 4 failures, 97 skipped, 2647 seconds > > Failures: ext4/033 generic/344 generic/491 generic/503 > > > > ext4/dax: 488 tests, 3 failures, 97 skipped, 2637 seconds > > Failures: generic/081 generic/344 generic/388 > > > > The generic/388 failure is a known flake (shutdown stress test). > > > > The generic/081 regression appears to be a device-mapper issue: > > > > generic/081 [22:06:33][ 15.079661] run fstests generic/081 at 2018-08-10 22:06:33 > > [ 15.795745] device-mapper: ioctl: can't change device type (old=4 vs new=1) after initial table load. > > [failed, exit status 1] [22:06:36]- output mismatch (see /results/ext4/results-dax/generic/081.out.bad) > > --- tests/generic/081.out 2018-08-09 18:00:42.000000000 -0400 > > +++ /results/ext4/results-dax/generic/081.out.bad 2018-08-10 22:06:36.440005460 -0400 > > @@ -1,2 +1,4 @@ > > QA output created by 081 > > Silence is golden > > +Failed to create snapshot > > +(see /results/ext4/results-dax/generic/081.full for details) > > ... > > (Run 'diff -u tests/generic/081.out /results/ext4/results-dax/generic/081.out.bad' to see the entire diff) > > I'll see if this reproduces for me. Doesn't seem to be related to the DAX > patches you caary though. This is caused by commit dbc626597c39b "dm: prevent DAX mounts if not supported". I've started discussion with relevant developers how to fix this. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* [PATCH v4 2/2] ext4: handle layout changes to pinned DAX mappings 2018-07-10 19:10 [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Ross Zwisler 2018-07-10 19:10 ` [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional Ross Zwisler @ 2018-07-10 19:10 ` Ross Zwisler 2018-07-11 8:17 ` [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Jan Kara 2018-07-31 19:44 ` Ross Zwisler 3 siblings, 0 replies; 24+ messages in thread From: Ross Zwisler @ 2018-07-10 19:10 UTC (permalink / raw) To: Jan Kara, Dan Williams, Dave Chinner, Darrick J. Wong, Christoph Hellwig, linux-nvdimm, Jeff Moyer, linux-ext4, Lukas Czerner Follow the lead of xfs_break_dax_layouts() and add synchronization between operations in ext4 which remove blocks from an inode (hole punch, truncate down, etc.) and pages which are pinned due to DAX DMA operations. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Lukas Czerner <lczerner@redhat.com> --- fs/ext4/ext4.h | 1 + fs/ext4/extents.c | 17 +++++++++++++++++ fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/truncate.h | 4 ++++ 4 files changed, 68 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 0b127853c584..34bccd64d83d 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); extern int ext4_inode_attach_jinode(struct inode *inode); extern int ext4_can_truncate(struct inode *inode); extern int ext4_truncate(struct inode *); +extern int ext4_break_layouts(struct inode *); extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); extern void ext4_set_inode_flags(struct inode *); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 0057fe3f248d..b8161e6b88d1 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, * released from page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) { + up_write(&EXT4_I(inode)->i_mmap_sem); + goto out_mutex; + } + ret = ext4_update_disksize_before_punch(inode, offset, len); if (ret) { up_write(&EXT4_I(inode)->i_mmap_sem); @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) * page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) + goto out_mmap; + /* * Need to round down offset to be aligned with page size boundary * for page size > block size. @@ -5641,6 +5653,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) * page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) + goto out_mmap; + /* * Need to round down to align start offset to page size boundary * for page size > block size. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 2ea07efbe016..fadb8ecacb1e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4193,6 +4193,39 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, return 0; } +static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock) +{ + *did_unlock = true; + up_write(&ei->i_mmap_sem); + schedule(); + down_write(&ei->i_mmap_sem); +} + +int ext4_break_layouts(struct inode *inode) +{ + struct ext4_inode_info *ei = EXT4_I(inode); + struct page *page; + bool retry; + int error; + + if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem))) + return -EINVAL; + + do { + retry = false; + page = dax_layout_busy_page(inode->i_mapping); + if (!page) + return 0; + + error = ___wait_var_event(&page->_refcount, + atomic_read(&page->_refcount) == 1, + TASK_INTERRUPTIBLE, 0, 0, + ext4_wait_dax_page(ei, &retry)); + } while (error == 0 && retry); + + return error; +} + /* * ext4_punch_hole: punches a hole in a file by releasing the blocks * associated with the given offset and length @@ -4266,6 +4299,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) * page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) + goto out_dio; + first_block_offset = round_up(offset, sb->s_blocksize); last_block_offset = round_down((offset + length), sb->s_blocksize) - 1; @@ -5554,6 +5592,14 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) ext4_wait_for_tail_page_commit(inode); } down_write(&EXT4_I(inode)->i_mmap_sem); + + rc = ext4_break_layouts(inode); + if (rc) { + up_write(&EXT4_I(inode)->i_mmap_sem); + error = rc; + goto err_out; + } + /* * Truncate pagecache after we've waited for commit * in data=journal mode to make pages freeable. diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h index 0cb13badf473..bcbe3668c1d4 100644 --- a/fs/ext4/truncate.h +++ b/fs/ext4/truncate.h @@ -11,6 +11,10 @@ */ static inline void ext4_truncate_failed_write(struct inode *inode) { + /* + * We don't need to call ext4_break_layouts() because the blocks we + * are truncating were never visible to userspace. + */ down_write(&EXT4_I(inode)->i_mmap_sem); truncate_inode_pages(inode->i_mapping, inode->i_size); ext4_truncate(inode); -- 2.14.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-07-10 19:10 [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Ross Zwisler 2018-07-10 19:10 ` [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional Ross Zwisler 2018-07-10 19:10 ` [PATCH v4 2/2] ext4: handle layout changes to pinned DAX mappings Ross Zwisler @ 2018-07-11 8:17 ` Jan Kara 2018-07-11 15:41 ` Ross Zwisler 2018-07-25 22:28 ` Ross Zwisler 2018-07-31 19:44 ` Ross Zwisler 3 siblings, 2 replies; 24+ messages in thread From: Jan Kara @ 2018-07-11 8:17 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Dave Chinner, Lukas Czerner, linux-ext4, Christoph Hellwig On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > Changes since v3: > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > that the {ext4,xfs}_break_layouts() calls have the same meaning. > (Dave, Darrick and Jan) How about the occasional WARN_ON_ONCE you mention below. Were you able to hunt them down? Honza > --- > > This series from Dan: > > https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html > > added synchronization between DAX dma and truncate/hole-punch in XFS. > This short series adds analogous support to ext4. > > I've added calls to ext4_break_layouts() everywhere that ext4 removes > blocks from an inode's map. > > The timings in XFS are such that it's difficult to hit this race. Dan > was able to show the race by manually introducing delays in the direct > I/O path. > > For ext4, though, its trivial to hit this race, and a hit will result in > a trigger of this WARN_ON_ONCE() in dax_disassociate_entry(): > > WARN_ON_ONCE(trunc && page_ref_count(page) > 1); > > I've made an xfstest which tests all the paths where we now call > ext4_break_layouts(). Each of the four paths easily hits this race many > times in my test setup with the xfstest. You can find that test here: > > https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html > > With these patches applied, I've still seen occasional hits of the above > WARN_ON_ONCE(), which tells me that we still have some work to do. I'll > continue looking at these more rare hits. > > Ross Zwisler (2): > dax: dax_layout_busy_page() warn on !exceptional > ext4: handle layout changes to pinned DAX mappings > > fs/dax.c | 10 +++++++++- > fs/ext4/ext4.h | 1 + > fs/ext4/extents.c | 17 +++++++++++++++++ > fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > fs/ext4/truncate.h | 4 ++++ > 5 files changed, 77 insertions(+), 1 deletion(-) > > -- > 2.14.4 > -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-07-11 8:17 ` [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Jan Kara @ 2018-07-11 15:41 ` Ross Zwisler 2018-07-25 22:28 ` Ross Zwisler 1 sibling, 0 replies; 24+ messages in thread From: Ross Zwisler @ 2018-07-11 15:41 UTC (permalink / raw) To: Jan Kara Cc: linux-nvdimm, Darrick J. Wong, Dave Chinner, Lukas Czerner, linux-ext4, Christoph Hellwig On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: > On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > > Changes since v3: > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > > that the {ext4,xfs}_break_layouts() calls have the same meaning. > > (Dave, Darrick and Jan) > > How about the occasional WARN_ON_ONCE you mention below. Were you able to > hunt them down? That's next on my list of things to look at. :) _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-07-11 8:17 ` [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Jan Kara 2018-07-11 15:41 ` Ross Zwisler @ 2018-07-25 22:28 ` Ross Zwisler 2018-07-27 16:28 ` Ross Zwisler 1 sibling, 1 reply; 24+ messages in thread From: Ross Zwisler @ 2018-07-25 22:28 UTC (permalink / raw) To: Jan Kara Cc: linux-nvdimm, Darrick J. Wong, Dave Chinner, Lukas Czerner, linux-ext4, Christoph Hellwig On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: > On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > > Changes since v3: > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > > that the {ext4,xfs}_break_layouts() calls have the same meaning. > > (Dave, Darrick and Jan) > > How about the occasional WARN_ON_ONCE you mention below. Were you able to > hunt them down? The root cause of this issue is that while the ei->i_mmap_sem provides synchronization between ext4_break_layouts() and page faults, it doesn't provide synchronize us with the direct I/O path. This exact same issue exists in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK. This allows the direct I/O path to do I/O and raise & lower page->_refcount while we're executing a truncate/hole punch. This leads to us trying to free a page with an elevated refcount. Here's one instance of the race: CPU 0 CPU 1 ----- ----- ext4_punch_hole() ext4_break_layouts() # all pages have refcount=1 ext4_direct_IO() ... lots of layers ... follow_page_pte() get_page() # elevates refcount truncate_pagecache_range() ... a few layers ... dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE() A similar race occurs when the refcount is being dropped while we're running ext4_break_layouts(), and this is the one that my test was actually hitting: CPU 0 CPU 1 ----- ----- ext4_direct_IO() ... lots of layers ... follow_page_pte() get_page() # elevates refcount of page X ext4_punch_hole() ext4_break_layouts() # two pages, X & Y, have refcount == 2 __wait_var_event() # called for page X __put_devmap_managed_page() # drops refcount of X to 1 # __wait_var_events() checks X's refcount in "if (condition)", and breaks. # We never actually called ext4_wait_dax_page(), so 'retry' in # ext4_break_layouts() is still false. Exit do/while loop in # ext4_break_layouts, never attempting to wait on page Y which still has an # elevated refcount of 2. truncate_pagecache_range() ... a few layers ... dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE() This second race can be fixed with the patch at the end of this function, which I think should go in, unless there is a benfit to the current retry scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()? With this patch applied I've been able to run my unit test through thousands of iterations, where it used to failed consistently within 10 or so. Even so, I wonder if the real solution is to add synchronization between the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how this should be handled? --- >8 --- >From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001 From: Ross Zwisler <zwisler@kernel.org> Date: Wed, 25 Jul 2018 16:16:05 -0600 Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts() If the refcount of a page is lowered between the time that it is returned by dax_busy_page() and when the refcount is again checked in ext4_break_layouts() => ___wait_var_event(), the waiting function ext4_wait_dax_page() will never be called. This means that ext4_break_layouts() will still have 'retry' set to false, so we'll stop looping and never check the refcount of other pages in this inode. Instead, always continue looping as long as dax_layout_busy_page() gives us a page which it found with an elevated refcount. Note that this works around the race exposed by my unit test, but I think that there is another race that needs to be addressed, probably with additional synchronization added between direct I/O and {ext4,xfs}_break_layouts(). Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- fs/ext4/inode.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 27e9643bc13b..fedb88104bbf 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, return 0; } -static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock) +static void ext4_wait_dax_page(struct ext4_inode_info *ei) { - *did_unlock = true; up_write(&ei->i_mmap_sem); schedule(); down_write(&ei->i_mmap_sem); @@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode) { struct ext4_inode_info *ei = EXT4_I(inode); struct page *page; - bool retry; int error; if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem))) return -EINVAL; do { - retry = false; page = dax_layout_busy_page(inode->i_mapping); if (!page) return 0; @@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode) error = ___wait_var_event(&page->_refcount, atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, 0, 0, - ext4_wait_dax_page(ei, &retry)); - } while (error == 0 && retry); + ext4_wait_dax_page(ei)); + } while (error == 0); return error; } -- 2.14.4 _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-07-25 22:28 ` Ross Zwisler @ 2018-07-27 16:28 ` Ross Zwisler 2018-08-06 3:55 ` Dave Chinner 2018-08-07 8:45 ` Jan Kara 0 siblings, 2 replies; 24+ messages in thread From: Ross Zwisler @ 2018-07-27 16:28 UTC (permalink / raw) To: Jan Kara, linux-nvdimm, darrick.wong, Dave Chinner, lczerner, linux-ext4, Christoph Hellwig, linux-fsdevel, linux-xfs + fsdevel and the xfs list. On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: > > On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > > > Changes since v3: > > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > > > that the {ext4,xfs}_break_layouts() calls have the same meaning. > > > (Dave, Darrick and Jan) > > > > How about the occasional WARN_ON_ONCE you mention below. Were you able to > > hunt them down? > > The root cause of this issue is that while the ei->i_mmap_sem provides > synchronization between ext4_break_layouts() and page faults, it doesn't > provide synchronize us with the direct I/O path. This exact same issue exists > in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK. > > This allows the direct I/O path to do I/O and raise & lower page->_refcount > while we're executing a truncate/hole punch. This leads to us trying to free > a page with an elevated refcount. > > Here's one instance of the race: > > CPU 0 CPU 1 > ----- ----- > ext4_punch_hole() > ext4_break_layouts() # all pages have refcount=1 > > ext4_direct_IO() > ... lots of layers ... > follow_page_pte() > get_page() # elevates refcount > > truncate_pagecache_range() > ... a few layers ... > dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE() > > A similar race occurs when the refcount is being dropped while we're running > ext4_break_layouts(), and this is the one that my test was actually hitting: > > CPU 0 CPU 1 > ----- ----- > ext4_direct_IO() > ... lots of layers ... > follow_page_pte() > get_page() > # elevates refcount of page X > ext4_punch_hole() > ext4_break_layouts() # two pages, X & Y, have refcount == 2 > __wait_var_event() # called for page X > > __put_devmap_managed_page() > # drops refcount of X to 1 > > # __wait_var_events() checks X's refcount in "if (condition)", and breaks. > # We never actually called ext4_wait_dax_page(), so 'retry' in > # ext4_break_layouts() is still false. Exit do/while loop in > # ext4_break_layouts, never attempting to wait on page Y which still has an > # elevated refcount of 2. > > truncate_pagecache_range() > ... a few layers ... > dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE() > > This second race can be fixed with the patch at the end of this function, > which I think should go in, unless there is a benfit to the current retry > scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()? > With this patch applied I've been able to run my unit test through > thousands of iterations, where it used to failed consistently within 10 or > so. > > Even so, I wonder if the real solution is to add synchronization between > the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how > this should be handled? > > --- >8 --- > > From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001 > From: Ross Zwisler <zwisler@kernel.org> > Date: Wed, 25 Jul 2018 16:16:05 -0600 > Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts() > > If the refcount of a page is lowered between the time that it is returned > by dax_busy_page() and when the refcount is again checked in > ext4_break_layouts() => ___wait_var_event(), the waiting function > ext4_wait_dax_page() will never be called. This means that > ext4_break_layouts() will still have 'retry' set to false, so we'll stop > looping and never check the refcount of other pages in this inode. > > Instead, always continue looping as long as dax_layout_busy_page() gives us > a page which it found with an elevated refcount. > > Note that this works around the race exposed by my unit test, but I think > that there is another race that needs to be addressed, probably with > additional synchronization added between direct I/O and > {ext4,xfs}_break_layouts(). > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > fs/ext4/inode.c | 9 +++------ > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 27e9643bc13b..fedb88104bbf 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, > return 0; > } > > -static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock) > +static void ext4_wait_dax_page(struct ext4_inode_info *ei) > { > - *did_unlock = true; > up_write(&ei->i_mmap_sem); > schedule(); > down_write(&ei->i_mmap_sem); > @@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode) > { > struct ext4_inode_info *ei = EXT4_I(inode); > struct page *page; > - bool retry; > int error; > > if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem))) > return -EINVAL; > > do { > - retry = false; > page = dax_layout_busy_page(inode->i_mapping); > if (!page) > return 0; > @@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode) > error = ___wait_var_event(&page->_refcount, > atomic_read(&page->_refcount) == 1, > TASK_INTERRUPTIBLE, 0, 0, > - ext4_wait_dax_page(ei, &retry)); > - } while (error == 0 && retry); > + ext4_wait_dax_page(ei)); > + } while (error == 0); > > return error; > } _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-07-27 16:28 ` Ross Zwisler @ 2018-08-06 3:55 ` Dave Chinner 2018-08-06 15:49 ` Christoph Hellwig 2018-08-07 8:45 ` Jan Kara 1 sibling, 1 reply; 24+ messages in thread From: Dave Chinner @ 2018-08-06 3:55 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, darrick.wong, linux-nvdimm, linux-xfs, linux-fsdevel, lczerner, linux-ext4, Christoph Hellwig On Fri, Jul 27, 2018 at 10:28:51AM -0600, Ross Zwisler wrote: > + fsdevel and the xfs list. > > On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: > > > On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > > > > Changes since v3: > > > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > > > > that the {ext4,xfs}_break_layouts() calls have the same meaning. > > > > (Dave, Darrick and Jan) > > > > > > How about the occasional WARN_ON_ONCE you mention below. Were you able to > > > hunt them down? > > > > The root cause of this issue is that while the ei->i_mmap_sem provides > > synchronization between ext4_break_layouts() and page faults, it doesn't > > provide synchronize us with the direct I/O path. This exact same issue exists > > in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK. That's not correct for XFS. Truncate/holepunch in XFS hold *both* the MMAPLOCK and the IOLOCK in exclusive mode. Holding the MMAPLOCK serialises against page faults, holding the IOLOCK serialises against buffered IO and Direct IO submission. The inode_dio_wait() call that truncate/holepunch does after gaining the IOLOCK ensures that we wait for all direct IO in progress to complete (e.g. AIO) before the truncate/holepunch goes ahead. e.g: STATIC long xfs_file_fallocate( struct file *file, int mode, loff_t offset, loff_t len) { ..... uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL; ..... xfs_ilock(ip, iolock); error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP); ..... if (mode & FALLOC_FL_PUNCH_HOLE) { error = xfs_free_file_space(ip, offset, len); and then xfs_free_file_space calls xfs_flush_unmap_range() which waits for direct IO to complete, flushes all the dirty cached pages and then invalidates the page cache before doing any extent manipulation. The cannot be any IO or page faults in progress after this point..... truncate (through xfs_vn_setattr() essentially does the same thing, except that it's called with the IOLOCK already held exclusively (remember the IOLOCK is the vfs inode->i_rwsem). > > This allows the direct I/O path to do I/O and raise & lower page->_refcount > > while we're executing a truncate/hole punch. This leads to us trying to free > > a page with an elevated refcount. I don't see how this is possible in XFS - maybe I'm missing something, but "direct IO submission during truncate" is not something that should ever be happening in XFS, DAX or not. Cheers, Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-08-06 3:55 ` Dave Chinner @ 2018-08-06 15:49 ` Christoph Hellwig 2018-08-06 22:25 ` Dave Chinner 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2018-08-06 15:49 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-nvdimm, darrick.wong, linux-xfs, Ross Zwisler, linux-fsdevel, lczerner, linux-ext4, Christoph Hellwig > > > This allows the direct I/O path to do I/O and raise & lower page->_refcount > > > while we're executing a truncate/hole punch. This leads to us trying to free > > > a page with an elevated refcount. > > I don't see how this is possible in XFS - maybe I'm missing > something, but "direct IO submission during truncate" is not > something that should ever be happening in XFS, DAX or not. The pages involved in a direct I/O are not that of the file that the direct I/O read/write syscalls are called on, but those of the memory regions the direct I/O read/write syscalls operate on. Those pages could be file backed and undergo a truncate at the same time. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-08-06 15:49 ` Christoph Hellwig @ 2018-08-06 22:25 ` Dave Chinner 0 siblings, 0 replies; 24+ messages in thread From: Dave Chinner @ 2018-08-06 22:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, linux-nvdimm, darrick.wong, linux-xfs, Ross Zwisler, linux-fsdevel, lczerner, linux-ext4 On Mon, Aug 06, 2018 at 05:49:43PM +0200, Christoph Hellwig wrote: > > > > This allows the direct I/O path to do I/O and raise & lower page->_refcount > > > > while we're executing a truncate/hole punch. This leads to us trying to free > > > > a page with an elevated refcount. > > > > I don't see how this is possible in XFS - maybe I'm missing > > something, but "direct IO submission during truncate" is not > > something that should ever be happening in XFS, DAX or not. > > The pages involved in a direct I/O are not that of the file that > the direct I/O read/write syscalls are called on, but those of the > memory regions the direct I/O read/write syscalls operate on. > Those pages could be file backed and undergo a truncate at the > same time. So let me get this straight. First, mmap() file A, then fault it all in, then use the mmapped range of file A as the user buffer for direct IO to file B, then concurrently truncate file A down so the destination buffer for the file B dio will be beyond EOF and so we need to invalidate it. But waiting for gup references in truncate can race with other new page references via gup because gup does not serialise access to the file backed pages in any way? i.e. we hold no fs locks at all on file A when gup takes page references during direct IO to file B unless we have to fault in the page. this doesn't seem like a problem that the filesystem can solve, but it does indicate to me a potential solution. i.e. we take the MMAPLOCK during page faults, and so we can use that to serialise gup against the invalidation in progress on file A. i.e. it would seem to me that gup needs to refault file-backed pages rather than just blindly take a reference to them so that it triggers serialisation of the page references against in-progress invalidation operations. Thoughts? -Dave. -- Dave Chinner david@fromorbit.com _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-07-27 16:28 ` Ross Zwisler 2018-08-06 3:55 ` Dave Chinner @ 2018-08-07 8:45 ` Jan Kara 2018-09-10 22:18 ` Eric Sandeen 1 sibling, 1 reply; 24+ messages in thread From: Jan Kara @ 2018-08-07 8:45 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, darrick.wong, linux-nvdimm, Dave Chinner, linux-xfs, linux-fsdevel, lczerner, linux-ext4, Christoph Hellwig On Fri 27-07-18 10:28:51, Ross Zwisler wrote: > + fsdevel and the xfs list. > > On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: > > > On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > > > > Changes since v3: > > > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > > > > that the {ext4,xfs}_break_layouts() calls have the same meaning. > > > > (Dave, Darrick and Jan) > > > > > > How about the occasional WARN_ON_ONCE you mention below. Were you able to > > > hunt them down? > > > > The root cause of this issue is that while the ei->i_mmap_sem provides > > synchronization between ext4_break_layouts() and page faults, it doesn't > > provide synchronize us with the direct I/O path. This exact same issue exists > > in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK. > > > > This allows the direct I/O path to do I/O and raise & lower page->_refcount > > while we're executing a truncate/hole punch. This leads to us trying to free > > a page with an elevated refcount. > > > > Here's one instance of the race: > > > > CPU 0 CPU 1 > > ----- ----- > > ext4_punch_hole() > > ext4_break_layouts() # all pages have refcount=1 > > > > ext4_direct_IO() > > ... lots of layers ... > > follow_page_pte() > > get_page() # elevates refcount > > > > truncate_pagecache_range() > > ... a few layers ... > > dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE() > > So this is a very different race from the one below. And it should be impossible to happen. This race is exactly the reason why dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race. > > A similar race occurs when the refcount is being dropped while we're running > > ext4_break_layouts(), and this is the one that my test was actually hitting: > > > > CPU 0 CPU 1 > > ----- ----- > > ext4_direct_IO() > > ... lots of layers ... > > follow_page_pte() > > get_page() > > # elevates refcount of page X > > ext4_punch_hole() > > ext4_break_layouts() # two pages, X & Y, have refcount == 2 > > __wait_var_event() # called for page X > > > > __put_devmap_managed_page() > > # drops refcount of X to 1 > > > > # __wait_var_events() checks X's refcount in "if (condition)", and breaks. > > # We never actually called ext4_wait_dax_page(), so 'retry' in > > # ext4_break_layouts() is still false. Exit do/while loop in > > # ext4_break_layouts, never attempting to wait on page Y which still has an > > # elevated refcount of 2. > > > > truncate_pagecache_range() > > ... a few layers ... > > dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE() > > > > This second race can be fixed with the patch at the end of this function, > > which I think should go in, unless there is a benfit to the current retry > > scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()? > > With this patch applied I've been able to run my unit test through > > thousands of iterations, where it used to failed consistently within 10 or > > so. > > > > Even so, I wonder if the real solution is to add synchronization between > > the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how > > this should be handled? > > > > --- >8 --- > > > > From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001 > > From: Ross Zwisler <zwisler@kernel.org> > > Date: Wed, 25 Jul 2018 16:16:05 -0600 > > Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts() > > > > If the refcount of a page is lowered between the time that it is returned > > by dax_busy_page() and when the refcount is again checked in > > ext4_break_layouts() => ___wait_var_event(), the waiting function > > ext4_wait_dax_page() will never be called. This means that > > ext4_break_layouts() will still have 'retry' set to false, so we'll stop > > looping and never check the refcount of other pages in this inode. > > > > Instead, always continue looping as long as dax_layout_busy_page() gives us > > a page which it found with an elevated refcount. > > > > Note that this works around the race exposed by my unit test, but I think > > that there is another race that needs to be addressed, probably with > > additional synchronization added between direct I/O and > > {ext4,xfs}_break_layouts(). > > > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> OK, this is a good catch and the patch looks good. You can add: Reviewed-by: Jan Kara <jack@suse.cz> Also please post this fix officially to Ted to include it in his tree (I can see that he has all your other patches queued for the merge window). AFAICS XFS implementation of xfs_break_dax_layouts() has the same problem. Can you please fix it? Thanks for working on this! Honza > > --- > > fs/ext4/inode.c | 9 +++------ > > 1 file changed, 3 insertions(+), 6 deletions(-) > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 27e9643bc13b..fedb88104bbf 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, > > return 0; > > } > > > > -static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock) > > +static void ext4_wait_dax_page(struct ext4_inode_info *ei) > > { > > - *did_unlock = true; > > up_write(&ei->i_mmap_sem); > > schedule(); > > down_write(&ei->i_mmap_sem); > > @@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode) > > { > > struct ext4_inode_info *ei = EXT4_I(inode); > > struct page *page; > > - bool retry; > > int error; > > > > if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem))) > > return -EINVAL; > > > > do { > > - retry = false; > > page = dax_layout_busy_page(inode->i_mapping); > > if (!page) > > return 0; > > @@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode) > > error = ___wait_var_event(&page->_refcount, > > atomic_read(&page->_refcount) == 1, > > TASK_INTERRUPTIBLE, 0, 0, > > - ext4_wait_dax_page(ei, &retry)); > > - } while (error == 0 && retry); > > + ext4_wait_dax_page(ei)); > > + } while (error == 0); > > > > return error; > > } -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-08-07 8:45 ` Jan Kara @ 2018-09-10 22:18 ` Eric Sandeen 2018-09-11 15:14 ` Jan Kara 0 siblings, 1 reply; 24+ messages in thread From: Eric Sandeen @ 2018-09-10 22:18 UTC (permalink / raw) To: Jan Kara, Ross Zwisler Cc: Theodore Ts'o, darrick.wong, linux-nvdimm, linux-fsdevel, lczerner, linux-ext4 On 8/7/18 3:45 AM, Jan Kara wrote: > On Fri 27-07-18 10:28:51, Ross Zwisler wrote: >> + fsdevel and the xfs list. >> >> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler >> <ross.zwisler@linux.intel.com> wrote: >>> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: >>>> On Tue 10-07-18 13:10:29, Ross Zwisler wrote: >>>>> Changes since v3: >>>>> * Added an ext4_break_layouts() call to ext4_insert_range() to ensure >>>>> that the {ext4,xfs}_break_layouts() calls have the same meaning. >>>>> (Dave, Darrick and Jan) >>>> >>>> How about the occasional WARN_ON_ONCE you mention below. Were you able to >>>> hunt them down? >>> >>> The root cause of this issue is that while the ei->i_mmap_sem provides >>> synchronization between ext4_break_layouts() and page faults, it doesn't >>> provide synchronize us with the direct I/O path. This exact same issue exists >>> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK. >>> >>> This allows the direct I/O path to do I/O and raise & lower page->_refcount >>> while we're executing a truncate/hole punch. This leads to us trying to free >>> a page with an elevated refcount. >>> >>> Here's one instance of the race: >>> >>> CPU 0 CPU 1 >>> ----- ----- >>> ext4_punch_hole() >>> ext4_break_layouts() # all pages have refcount=1 >>> >>> ext4_direct_IO() >>> ... lots of layers ... >>> follow_page_pte() >>> get_page() # elevates refcount >>> >>> truncate_pagecache_range() >>> ... a few layers ... >>> dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE() >>> > > So this is a very different race from the one below. And it should be > impossible to happen. This race is exactly the reason why > dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault > which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race. > >>> A similar race occurs when the refcount is being dropped while we're running >>> ext4_break_layouts(), and this is the one that my test was actually hitting: >>> >>> CPU 0 CPU 1 >>> ----- ----- >>> ext4_direct_IO() >>> ... lots of layers ... >>> follow_page_pte() >>> get_page() >>> # elevates refcount of page X >>> ext4_punch_hole() >>> ext4_break_layouts() # two pages, X & Y, have refcount == 2 >>> __wait_var_event() # called for page X >>> >>> __put_devmap_managed_page() >>> # drops refcount of X to 1 >>> >>> # __wait_var_events() checks X's refcount in "if (condition)", and breaks. >>> # We never actually called ext4_wait_dax_page(), so 'retry' in >>> # ext4_break_layouts() is still false. Exit do/while loop in >>> # ext4_break_layouts, never attempting to wait on page Y which still has an >>> # elevated refcount of 2. >>> >>> truncate_pagecache_range() >>> ... a few layers ... >>> dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE() >>> >>> This second race can be fixed with the patch at the end of this function, >>> which I think should go in, unless there is a benfit to the current retry >>> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()? >>> With this patch applied I've been able to run my unit test through >>> thousands of iterations, where it used to failed consistently within 10 or >>> so. >>> >>> Even so, I wonder if the real solution is to add synchronization between >>> the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how >>> this should be handled? >>> >>> --- >8 --- >>> >>> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001 >>> From: Ross Zwisler <zwisler@kernel.org> >>> Date: Wed, 25 Jul 2018 16:16:05 -0600 >>> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts() >>> >>> If the refcount of a page is lowered between the time that it is returned >>> by dax_busy_page() and when the refcount is again checked in >>> ext4_break_layouts() => ___wait_var_event(), the waiting function >>> ext4_wait_dax_page() will never be called. This means that >>> ext4_break_layouts() will still have 'retry' set to false, so we'll stop >>> looping and never check the refcount of other pages in this inode. >>> >>> Instead, always continue looping as long as dax_layout_busy_page() gives us >>> a page which it found with an elevated refcount. >>> >>> Note that this works around the race exposed by my unit test, but I think >>> that there is another race that needs to be addressed, probably with >>> additional synchronization added between direct I/O and >>> {ext4,xfs}_break_layouts(). >>> >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > OK, this is a good catch and the patch looks good. You can add: > > Reviewed-by: Jan Kara <jack@suse.cz> > > Also please post this fix officially to Ted to include it in his tree (I > can see that he has all your other patches queued for the merge window). Did these ever get on Ted's radar? I don't see it upstream yet. Thanks, -Eric _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-09-10 22:18 ` Eric Sandeen @ 2018-09-11 15:14 ` Jan Kara 2018-09-11 15:20 ` Jan Kara 0 siblings, 1 reply; 24+ messages in thread From: Jan Kara @ 2018-09-11 15:14 UTC (permalink / raw) To: Eric Sandeen Cc: Jan Kara, linux-nvdimm, darrick.wong, Theodore Ts'o, Ross Zwisler, linux-fsdevel, lczerner, linux-ext4 On Mon 10-09-18 17:18:49, Eric Sandeen wrote: > On 8/7/18 3:45 AM, Jan Kara wrote: > > On Fri 27-07-18 10:28:51, Ross Zwisler wrote: > >> + fsdevel and the xfs list. > >> > >> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler > >> <ross.zwisler@linux.intel.com> wrote: > >>> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote: > >>>> On Tue 10-07-18 13:10:29, Ross Zwisler wrote: > >>>>> Changes since v3: > >>>>> * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > >>>>> that the {ext4,xfs}_break_layouts() calls have the same meaning. > >>>>> (Dave, Darrick and Jan) > >>>> > >>>> How about the occasional WARN_ON_ONCE you mention below. Were you able to > >>>> hunt them down? > >>> > >>> The root cause of this issue is that while the ei->i_mmap_sem provides > >>> synchronization between ext4_break_layouts() and page faults, it doesn't > >>> provide synchronize us with the direct I/O path. This exact same issue exists > >>> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK. > >>> > >>> This allows the direct I/O path to do I/O and raise & lower page->_refcount > >>> while we're executing a truncate/hole punch. This leads to us trying to free > >>> a page with an elevated refcount. > >>> > >>> Here's one instance of the race: > >>> > >>> CPU 0 CPU 1 > >>> ----- ----- > >>> ext4_punch_hole() > >>> ext4_break_layouts() # all pages have refcount=1 > >>> > >>> ext4_direct_IO() > >>> ... lots of layers ... > >>> follow_page_pte() > >>> get_page() # elevates refcount > >>> > >>> truncate_pagecache_range() > >>> ... a few layers ... > >>> dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE() > >>> > > > > So this is a very different race from the one below. And it should be > > impossible to happen. This race is exactly the reason why > > dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault > > which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race. > > > >>> A similar race occurs when the refcount is being dropped while we're running > >>> ext4_break_layouts(), and this is the one that my test was actually hitting: > >>> > >>> CPU 0 CPU 1 > >>> ----- ----- > >>> ext4_direct_IO() > >>> ... lots of layers ... > >>> follow_page_pte() > >>> get_page() > >>> # elevates refcount of page X > >>> ext4_punch_hole() > >>> ext4_break_layouts() # two pages, X & Y, have refcount == 2 > >>> __wait_var_event() # called for page X > >>> > >>> __put_devmap_managed_page() > >>> # drops refcount of X to 1 > >>> > >>> # __wait_var_events() checks X's refcount in "if (condition)", and breaks. > >>> # We never actually called ext4_wait_dax_page(), so 'retry' in > >>> # ext4_break_layouts() is still false. Exit do/while loop in > >>> # ext4_break_layouts, never attempting to wait on page Y which still has an > >>> # elevated refcount of 2. > >>> > >>> truncate_pagecache_range() > >>> ... a few layers ... > >>> dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE() > >>> > >>> This second race can be fixed with the patch at the end of this function, > >>> which I think should go in, unless there is a benfit to the current retry > >>> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()? > >>> With this patch applied I've been able to run my unit test through > >>> thousands of iterations, where it used to failed consistently within 10 or > >>> so. > >>> > >>> Even so, I wonder if the real solution is to add synchronization between > >>> the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how > >>> this should be handled? > >>> > >>> --- >8 --- > >>> > >>> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001 > >>> From: Ross Zwisler <zwisler@kernel.org> > >>> Date: Wed, 25 Jul 2018 16:16:05 -0600 > >>> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts() > >>> > >>> If the refcount of a page is lowered between the time that it is returned > >>> by dax_busy_page() and when the refcount is again checked in > >>> ext4_break_layouts() => ___wait_var_event(), the waiting function > >>> ext4_wait_dax_page() will never be called. This means that > >>> ext4_break_layouts() will still have 'retry' set to false, so we'll stop > >>> looping and never check the refcount of other pages in this inode. > >>> > >>> Instead, always continue looping as long as dax_layout_busy_page() gives us > >>> a page which it found with an elevated refcount. > >>> > >>> Note that this works around the race exposed by my unit test, but I think > >>> that there is another race that needs to be addressed, probably with > >>> additional synchronization added between direct I/O and > >>> {ext4,xfs}_break_layouts(). > >>> > >>> Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > > > > OK, this is a good catch and the patch looks good. You can add: > > > > Reviewed-by: Jan Kara <jack@suse.cz> > > > > Also please post this fix officially to Ted to include it in his tree (I > > can see that he has all your other patches queued for the merge window). > > Did these ever get on Ted's radar? I don't see it upstream yet. Hum, it seems Ted never picked this patch up. I guess I'll gather the two fixes you pointed out and resend them to Ted. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-09-11 15:14 ` Jan Kara @ 2018-09-11 15:20 ` Jan Kara 2018-09-11 17:28 ` Theodore Y. Ts'o 0 siblings, 1 reply; 24+ messages in thread From: Jan Kara @ 2018-09-11 15:20 UTC (permalink / raw) To: Eric Sandeen Cc: Jan Kara, linux-nvdimm, darrick.wong, Theodore Ts'o, Ross Zwisler, linux-fsdevel, lczerner, linux-ext4 On Tue 11-09-18 17:14:21, Jan Kara wrote: > On Mon 10-09-18 17:18:49, Eric Sandeen wrote: > > On 8/7/18 3:45 AM, Jan Kara wrote: > > > > > > OK, this is a good catch and the patch looks good. You can add: > > > > > > Reviewed-by: Jan Kara <jack@suse.cz> > > > > > > Also please post this fix officially to Ted to include it in his tree (I > > > can see that he has all your other patches queued for the merge window). > > > > Did these ever get on Ted's radar? I don't see it upstream yet. > > Hum, it seems Ted never picked this patch up. I guess I'll gather the two > fixes you pointed out and resend them to Ted. Actually only one patch when looking into it now. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-09-11 15:20 ` Jan Kara @ 2018-09-11 17:28 ` Theodore Y. Ts'o 2018-09-11 18:21 ` Eric Sandeen 0 siblings, 1 reply; 24+ messages in thread From: Theodore Y. Ts'o @ 2018-09-11 17:28 UTC (permalink / raw) To: Jan Kara Cc: linux-nvdimm, darrick.wong, Eric Sandeen, Ross Zwisler, linux-fsdevel, lczerner, linux-ext4 On Tue, Sep 11, 2018 at 05:20:24PM +0200, Jan Kara wrote: > > Hum, it seems Ted never picked this patch up. I guess I'll gather the two > > fixes you pointed out and resend them to Ted. > > Actually only one patch when looking into it now. I believe both are in the ext4 git tree. - Ted _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-09-11 17:28 ` Theodore Y. Ts'o @ 2018-09-11 18:21 ` Eric Sandeen 0 siblings, 0 replies; 24+ messages in thread From: Eric Sandeen @ 2018-09-11 18:21 UTC (permalink / raw) To: Theodore Y. Ts'o, Jan Kara Cc: linux-nvdimm, darrick.wong, Eric Sandeen, Ross Zwisler, linux-fsdevel, lczerner, linux-ext4 On 9/11/18 12:28 PM, Theodore Y. Ts'o wrote: > On Tue, Sep 11, 2018 at 05:20:24PM +0200, Jan Kara wrote: >>> Hum, it seems Ted never picked this patch up. I guess I'll gather the two >>> fixes you pointed out and resend them to Ted. >> >> Actually only one patch when looking into it now. > > I believe both are in the ext4 git tree. > > - Ted Ok, right: dax: dax_layout_busy_page() warn on !exceptional https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=cdbf8897cb09b7baf2b8a7e78051a35a872b01d5 ext4: handle layout changes to pinned DAX mappings https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=430657b6be896db57d974375cc499ca212c7f01d I think the one I meant to ask about was: [PATCH v2 1/2] ext4: Close race between direct IO and ext4_break_layouts() https://patchwork.kernel.org/patch/10560393/ Sorry for the confusion. -Eric _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch 2018-07-10 19:10 [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Ross Zwisler ` (2 preceding siblings ...) 2018-07-11 8:17 ` [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Jan Kara @ 2018-07-31 19:44 ` Ross Zwisler 3 siblings, 0 replies; 24+ messages in thread From: Ross Zwisler @ 2018-07-31 19:44 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, Dave Chinner, Lukas Czerner, linux-ext4, Christoph Hellwig On Tue, Jul 10, 2018 at 01:10:29PM -0600, Ross Zwisler wrote: > Changes since v3: > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure > that the {ext4,xfs}_break_layouts() calls have the same meaning. > (Dave, Darrick and Jan) > > --- > > This series from Dan: > > https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html > > added synchronization between DAX dma and truncate/hole-punch in XFS. > This short series adds analogous support to ext4. > > I've added calls to ext4_break_layouts() everywhere that ext4 removes > blocks from an inode's map. > > The timings in XFS are such that it's difficult to hit this race. Dan > was able to show the race by manually introducing delays in the direct > I/O path. > > For ext4, though, its trivial to hit this race, and a hit will result in > a trigger of this WARN_ON_ONCE() in dax_disassociate_entry(): > > WARN_ON_ONCE(trunc && page_ref_count(page) > 1); > > I've made an xfstest which tests all the paths where we now call > ext4_break_layouts(). Each of the four paths easily hits this race many > times in my test setup with the xfstest. You can find that test here: > > https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html > > With these patches applied, I've still seen occasional hits of the above > WARN_ON_ONCE(), which tells me that we still have some work to do. I'll > continue looking at these more rare hits. One last ping on this - do we want to merge this for v4.19? I've tracked down the more rare warnings, and have reported the race I'm seeing here: https://lists.01.org/pipermail/linux-nvdimm/2018-July/017205.html AFAICT the race exists equally for XFS and ext4, and I'm not sure how to solve it easily. Essentially it seems like we need to synchronize not just page faults but calls to get_page() with truncate/hole punch operations, else we can have the reference count for a given DAX page going up and down while we are in the middle of a truncate. I'm not sure if this is even feasible. The equivalent code for this series already exists in XFS, so taking the series now gets ext4 and XFS on the same footing moving forward, so I guess I'm in favor of merging it now, though I can see the argument that it's not a complete solution. Thoughts? _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2018-09-11 18:21 UTC | newest] Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-07-10 19:10 [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Ross Zwisler 2018-07-10 19:10 ` [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional Ross Zwisler 2018-08-10 19:52 ` Eric Sandeen 2018-08-10 20:33 ` Theodore Y. Ts'o 2018-08-11 2:10 ` Theodore Y. Ts'o 2018-08-13 10:12 ` Jan Kara 2018-08-13 12:46 ` Theodore Y. Ts'o 2018-08-24 15:44 ` Jan Kara 2018-08-27 16:09 ` Jan Kara 2018-07-10 19:10 ` [PATCH v4 2/2] ext4: handle layout changes to pinned DAX mappings Ross Zwisler 2018-07-11 8:17 ` [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch Jan Kara 2018-07-11 15:41 ` Ross Zwisler 2018-07-25 22:28 ` Ross Zwisler 2018-07-27 16:28 ` Ross Zwisler 2018-08-06 3:55 ` Dave Chinner 2018-08-06 15:49 ` Christoph Hellwig 2018-08-06 22:25 ` Dave Chinner 2018-08-07 8:45 ` Jan Kara 2018-09-10 22:18 ` Eric Sandeen 2018-09-11 15:14 ` Jan Kara 2018-09-11 15:20 ` Jan Kara 2018-09-11 17:28 ` Theodore Y. Ts'o 2018-09-11 18:21 ` Eric Sandeen 2018-07-31 19:44 ` Ross Zwisler
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).