* find_get_entries_tag regression bisected @ 2019-02-16 2:08 Dan Williams 2019-02-16 15:35 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Dan Williams @ 2019-02-16 2:08 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-nvdimm, linux-fsdevel, Linux Kernel Mailing List, Balcer, Piotr Hi Willy, Piotr reports the following crash can be triggered on latest mainline: EXT4-fs (pmem5): recovery complete EXT4-fs (pmem5): mounted filesystem with ordered data mode. Opts: dax ------------[ cut here ]------------ kernel BUG at mm/pgtable-generic.c:127! invalid opcode: 0000 [#1] SMP PTI CPU: 28 PID: 1193 Comm: a.out Tainted: G W OE 4.19.0-rc5+ #2907 [..] RIP: 0010:pmdp_huge_clear_flush+0x1e/0x80 [..] Call Trace: dax_writeback_mapping_range+0x473/0x8a0 ? print_shortest_lock_dependencies+0x40/0x40 ? jbd2_journal_stop+0xef/0x470 ? ext4_fill_super+0x3071/0x3110 ? __lock_is_held+0x4f/0x90 ? __lock_is_held+0x4f/0x90 ext4_dax_writepages+0xed/0x2f0 do_writepages+0x41/0xe0 __filemap_fdatawrite_range+0xbe/0xf0 file_write_and_wait_range+0x4c/0xa0 ext4_sync_file+0xa6/0x4f0 I bisected this regression to commit c1901cd33cf4 "page cache: Convert find_get_entries_tag to XArray". I suspect another case of pte vs pmd confusion. Below is the small reproducer from Piotr, it triggers in a qemu environment with emulated pmem, but only with ext4 that I can see, but I did not dig too deep. Let me know if anything jumps out to you. I'll otherwise take a deeper look in the coming days. #include <sys/mman.h> #include <unistd.h> #include <stdio.h> #include <fcntl.h> #include <string.h> #include <assert.h> #define MB (1ULL << 20) int main(int argc, char *argv[]) { int ret; int fd; off_t size = 2 * MB; char *path = argv[1]; fd = open(path, O_RDWR | O_CREAT | O_EXCL, 0666); assert(fd > 0); ret = ftruncate(fd, size); assert(ret == 0); char *addr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); assert(addr != MAP_FAILED); memset((char*)addr, '0', 1); ret = msync(addr + 4096, 1, MS_SYNC); assert(ret == 0); close(fd); return 0; } ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find_get_entries_tag regression bisected 2019-02-16 2:08 find_get_entries_tag regression bisected Dan Williams @ 2019-02-16 15:35 ` Matthew Wilcox 2019-02-16 17:29 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2019-02-16 15:35 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-fsdevel, Linux Kernel Mailing List, Balcer, Piotr On Fri, Feb 15, 2019 at 06:08:15PM -0800, Dan Williams wrote: > Hi Willy, > > Piotr reports the following crash can be triggered on latest mainline: > > EXT4-fs (pmem5): recovery complete > EXT4-fs (pmem5): mounted filesystem with ordered data mode. Opts: dax > ------------[ cut here ]------------ > kernel BUG at mm/pgtable-generic.c:127! > invalid opcode: 0000 [#1] SMP PTI > CPU: 28 PID: 1193 Comm: a.out Tainted: G W OE 4.19.0-rc5+ #2907 > [..] > RIP: 0010:pmdp_huge_clear_flush+0x1e/0x80 > [..] > Call Trace: > dax_writeback_mapping_range+0x473/0x8a0 > ? print_shortest_lock_dependencies+0x40/0x40 > ? jbd2_journal_stop+0xef/0x470 > ? ext4_fill_super+0x3071/0x3110 > ? __lock_is_held+0x4f/0x90 > ? __lock_is_held+0x4f/0x90 > ext4_dax_writepages+0xed/0x2f0 > do_writepages+0x41/0xe0 > __filemap_fdatawrite_range+0xbe/0xf0 > file_write_and_wait_range+0x4c/0xa0 > ext4_sync_file+0xa6/0x4f0 > > I bisected this regression to commit c1901cd33cf4 "page cache: Convert > find_get_entries_tag to XArray". I suspect another case of pte vs pmd > confusion. > > Below is the small reproducer from Piotr, it triggers in a qemu > environment with emulated pmem, but only with ext4 that I can see, but > I did not dig too deep. Let me know if anything jumps out to you. I'll > otherwise take a deeper look in the coming days. I think this is another case of the XArray and radix tree iterators having different behaviour with multi-slot entries. The radix tree iterator would rewind the index to the head index while the XArray iterator leaves it static. While the regression was introduced with the change to find_get_entries_tag(), DAX now uses the xas_for_each_marked() iterator directly, so changing the behaviour of find_get_entries_tag() would be pointless. I think the simplest way to fix this is for DAX to set the index appropriately. Maybe something like this? diff --git a/fs/dax.c b/fs/dax.c index 6959837cc465..43e3aad31885 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -843,7 +843,7 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, struct address_space *mapping, void *entry) { - unsigned long pfn; + unsigned long pfn, index; long ret = 0; size_t size; @@ -894,16 +894,17 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, xas_unlock_irq(xas); /* - * Even if dax_writeback_mapping_range() was given a wbc->range_start - * in the middle of a PMD, the 'index' we are given will be aligned to - * the start index of the PMD, as will the pfn we pull from 'entry'. + * If dax_writeback_mapping_range() was given a wbc->range_start + * in the middle of a PMD, the 'index' we are given needs to be + * aligned to the start index of the PMD. * This allows us to flush for PMD_SIZE and not have to worry about * partial PMD writebacks. */ pfn = dax_to_pfn(entry); size = PAGE_SIZE << dax_entry_order(entry); + index = xas->xa_index &~ ((1UL << dax_entry_order(entry)) - 1); - dax_entry_mkclean(mapping, xas->xa_index, pfn); + dax_entry_mkclean(mapping, index, pfn); dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size); /* * After we have flushed the cache, we can clear the dirty tag. There Another way to fix this would be to mask the address in dax_entry_mkclean(), but I think this is cleaner. ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: find_get_entries_tag regression bisected 2019-02-16 15:35 ` Matthew Wilcox @ 2019-02-16 17:29 ` Matthew Wilcox 2019-02-16 21:11 ` Matthew Wilcox 2019-02-27 18:16 ` Dan Williams 0 siblings, 2 replies; 8+ messages in thread From: Matthew Wilcox @ 2019-02-16 17:29 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-fsdevel, Linux Kernel Mailing List, Balcer, Piotr On Sat, Feb 16, 2019 at 07:35:11AM -0800, Matthew Wilcox wrote: > Another way to fix this would be to mask the address in dax_entry_mkclean(), > but I think this is cleaner. That's clearly rubbish, dax_entry_mkclean() can't possibly mask the address. It might be mis-aligned in another process. But ... if it's misaligned in another process, dax_entry_mkclean() will only clean the first PTE associated with the PMD; it won't clean the whole thing. I think we need something like this: (I'll have to split it apart to give us something to backport) diff --git a/fs/dax.c b/fs/dax.c index 6959837cc465..09680aa0481f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -768,7 +768,7 @@ unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma) /* Walk all mappings of a given index of a file and writeprotect them */ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, - unsigned long pfn) + pgoff_t end, unsigned long pfn) { struct vm_area_struct *vma; pte_t pte, *ptep = NULL; @@ -776,7 +776,7 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, spinlock_t *ptl; i_mmap_lock_read(mapping); - vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) { + vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) { struct mmu_notifier_range range; unsigned long address; @@ -843,9 +843,9 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, struct address_space *mapping, void *entry) { - unsigned long pfn; + unsigned long pfn, index; long ret = 0; - size_t size; + unsigned long count; /* * A page got tagged dirty in DAX mapping? Something is seriously @@ -894,17 +894,18 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, xas_unlock_irq(xas); /* - * Even if dax_writeback_mapping_range() was given a wbc->range_start - * in the middle of a PMD, the 'index' we are given will be aligned to - * the start index of the PMD, as will the pfn we pull from 'entry'. + * If dax_writeback_mapping_range() was given a wbc->range_start + * in the middle of a PMD, the 'index' we are given needs to be + * aligned to the start index of the PMD. * This allows us to flush for PMD_SIZE and not have to worry about * partial PMD writebacks. */ pfn = dax_to_pfn(entry); - size = PAGE_SIZE << dax_entry_order(entry); + count = 1UL << dax_entry_order(entry); + index = xas->xa_index &~ (count - 1); - dax_entry_mkclean(mapping, xas->xa_index, pfn); - dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size); + dax_entry_mkclean(mapping, index, index + count - 1, pfn); + dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); /* * After we have flushed the cache, we can clear the dirty tag. There * cannot be new dirty data in the pfn after the flush has completed as @@ -917,8 +918,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, xas_clear_mark(xas, PAGECACHE_TAG_DIRTY); dax_wake_entry(xas, entry, false); - trace_dax_writeback_one(mapping->host, xas->xa_index, - size >> PAGE_SHIFT); + trace_dax_writeback_one(mapping->host, xas->xa_index, count); return ret; put_unlocked: ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: find_get_entries_tag regression bisected 2019-02-16 17:29 ` Matthew Wilcox @ 2019-02-16 21:11 ` Matthew Wilcox 2019-02-26 5:03 ` Dan Williams 2019-02-27 18:16 ` Dan Williams 1 sibling, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2019-02-16 21:11 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-fsdevel, Linux Kernel Mailing List, Balcer, Piotr On Sat, Feb 16, 2019 at 09:29:48AM -0800, Matthew Wilcox wrote: > On Sat, Feb 16, 2019 at 07:35:11AM -0800, Matthew Wilcox wrote: > > Another way to fix this would be to mask the address in dax_entry_mkclean(), > > but I think this is cleaner. > > That's clearly rubbish, dax_entry_mkclean() can't possibly mask the > address. It might be mis-aligned in another process. But ... if it's > misaligned in another process, dax_entry_mkclean() will only clean the first > PTE associated with the PMD; it won't clean the whole thing. I think we need > something like this: Nope, this isn't enough. It's _necessary_ to find the processes that have part of this PMD page mapped, but not the start of it. But it's not _sufficient_ because it'll still only mkclean the first PTE. So we need a loop. I'm feeling a bit over my head here. I may have a go at a fuller fix, but if someone else wants to have a go at it, be my guest! (feeling massively unqualified for the task at hand ;-) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find_get_entries_tag regression bisected 2019-02-16 21:11 ` Matthew Wilcox @ 2019-02-26 5:03 ` Dan Williams 2019-02-26 12:08 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Dan Williams @ 2019-02-26 5:03 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-nvdimm, linux-fsdevel, Linux Kernel Mailing List, Balcer, Piotr On Sat, Feb 16, 2019 at 1:11 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Feb 16, 2019 at 09:29:48AM -0800, Matthew Wilcox wrote: > > On Sat, Feb 16, 2019 at 07:35:11AM -0800, Matthew Wilcox wrote: > > > Another way to fix this would be to mask the address in dax_entry_mkclean(), > > > but I think this is cleaner. > > > > That's clearly rubbish, dax_entry_mkclean() can't possibly mask the > > address. It might be mis-aligned in another process. But ... if it's > > misaligned in another process, dax_entry_mkclean() will only clean the first > > PTE associated with the PMD; it won't clean the whole thing. I think we need > > something like this: > > Nope, this isn't enough. It's _necessary_ to find the processes that > have part of this PMD page mapped, but not the start of it. But it's > not _sufficient_ because it'll still only mkclean the first PTE. So we > need a loop. I'm feeling a bit over my head here. I may have a go at > a fuller fix, but if someone else wants to have a go at it, be my guest! Nothing comes to mind outside of pseudo-reverting this conversion by introducing a way to get back to the old semantics. If you don't see a path forward, us mere Xarray-mortals stand no chance. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find_get_entries_tag regression bisected 2019-02-26 5:03 ` Dan Williams @ 2019-02-26 12:08 ` Matthew Wilcox 2019-02-26 13:16 ` Matthew Wilcox 0 siblings, 1 reply; 8+ messages in thread From: Matthew Wilcox @ 2019-02-26 12:08 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-fsdevel, Linux Kernel Mailing List, Balcer, Piotr On Mon, Feb 25, 2019 at 09:03:00PM -0800, Dan Williams wrote: > On Sat, Feb 16, 2019 at 1:11 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Sat, Feb 16, 2019 at 09:29:48AM -0800, Matthew Wilcox wrote: > > > On Sat, Feb 16, 2019 at 07:35:11AM -0800, Matthew Wilcox wrote: > > > > Another way to fix this would be to mask the address in dax_entry_mkclean(), > > > > but I think this is cleaner. > > > > > > That's clearly rubbish, dax_entry_mkclean() can't possibly mask the > > > address. It might be mis-aligned in another process. But ... if it's > > > misaligned in another process, dax_entry_mkclean() will only clean the first > > > PTE associated with the PMD; it won't clean the whole thing. I think we need > > > something like this: > > > > Nope, this isn't enough. It's _necessary_ to find the processes that > > have part of this PMD page mapped, but not the start of it. But it's > > not _sufficient_ because it'll still only mkclean the first PTE. So we > > need a loop. I'm feeling a bit over my head here. I may have a go at > > a fuller fix, but if someone else wants to have a go at it, be my guest! > > Nothing comes to mind outside of pseudo-reverting this conversion by > introducing a way to get back to the old semantics. If you don't see a > path forward, us mere Xarray-mortals stand no chance. This is a pre-existing bug. Fixing the regression is easy. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find_get_entries_tag regression bisected 2019-02-26 12:08 ` Matthew Wilcox @ 2019-02-26 13:16 ` Matthew Wilcox 0 siblings, 0 replies; 8+ messages in thread From: Matthew Wilcox @ 2019-02-26 13:16 UTC (permalink / raw) To: Dan Williams Cc: linux-nvdimm, linux-fsdevel, Linux Kernel Mailing List, Balcer, Piotr On Tue, Feb 26, 2019 at 04:08:21AM -0800, Matthew Wilcox wrote: > On Mon, Feb 25, 2019 at 09:03:00PM -0800, Dan Williams wrote: > > On Sat, Feb 16, 2019 at 1:11 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Sat, Feb 16, 2019 at 09:29:48AM -0800, Matthew Wilcox wrote: > > > > On Sat, Feb 16, 2019 at 07:35:11AM -0800, Matthew Wilcox wrote: > > > > > Another way to fix this would be to mask the address in dax_entry_mkclean(), > > > > > but I think this is cleaner. > > > > > > > > That's clearly rubbish, dax_entry_mkclean() can't possibly mask the > > > > address. It might be mis-aligned in another process. But ... if it's > > > > misaligned in another process, dax_entry_mkclean() will only clean the first > > > > PTE associated with the PMD; it won't clean the whole thing. I think we need > > > > something like this: > > > > > > Nope, this isn't enough. It's _necessary_ to find the processes that > > > have part of this PMD page mapped, but not the start of it. But it's > > > not _sufficient_ because it'll still only mkclean the first PTE. So we > > > need a loop. I'm feeling a bit over my head here. I may have a go at > > > a fuller fix, but if someone else wants to have a go at it, be my guest! > > > > Nothing comes to mind outside of pseudo-reverting this conversion by > > introducing a way to get back to the old semantics. If you don't see a > > path forward, us mere Xarray-mortals stand no chance. > > This is a pre-existing bug. Fixing the regression is easy. To be clear; I sent the regression fix a week ago in 20190216153511.GM12668@bombadil.infradead.org I haven't tested it; I don't have a suitable setup right now. The pre-existing bug is that if you have two tasks with the same address range mapped, but one has it mapped with a PMD and the other has it mapped with PTEs (maybe due to mapping only a subset of the addresses), dax_entry_mkclean() won't clean the PTEs because vma_interval_tree_foreach() won't find the VMA. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: find_get_entries_tag regression bisected 2019-02-16 17:29 ` Matthew Wilcox 2019-02-16 21:11 ` Matthew Wilcox @ 2019-02-27 18:16 ` Dan Williams 1 sibling, 0 replies; 8+ messages in thread From: Dan Williams @ 2019-02-27 18:16 UTC (permalink / raw) To: Matthew Wilcox Cc: linux-nvdimm, linux-fsdevel, Linux Kernel Mailing List, Balcer, Piotr On Sat, Feb 16, 2019 at 9:29 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Feb 16, 2019 at 07:35:11AM -0800, Matthew Wilcox wrote: > > Another way to fix this would be to mask the address in dax_entry_mkclean(), > > but I think this is cleaner. > > That's clearly rubbish, dax_entry_mkclean() can't possibly mask the > address. It might be mis-aligned in another process. But ... if it's > misaligned in another process, dax_entry_mkclean() will only clean the first > PTE associated with the PMD; it won't clean the whole thing. I think we need > something like this: > > (I'll have to split it apart to give us something to backport) Looks good to me, care to send a formal patch? Tested-by: Dan Williams <dan.j.williams@intel.com> > > diff --git a/fs/dax.c b/fs/dax.c > index 6959837cc465..09680aa0481f 100644 > --- a/fs/dax.c > +++ b/fs/dax.c > @@ -768,7 +768,7 @@ unsigned long pgoff_address(pgoff_t pgoff, struct vm_area_struct *vma) > > /* Walk all mappings of a given index of a file and writeprotect them */ > static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, > - unsigned long pfn) > + pgoff_t end, unsigned long pfn) > { > struct vm_area_struct *vma; > pte_t pte, *ptep = NULL; > @@ -776,7 +776,7 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, > spinlock_t *ptl; > > i_mmap_lock_read(mapping); > - vma_interval_tree_foreach(vma, &mapping->i_mmap, index, index) { > + vma_interval_tree_foreach(vma, &mapping->i_mmap, index, end) { > struct mmu_notifier_range range; > unsigned long address; > > @@ -843,9 +843,9 @@ static void dax_entry_mkclean(struct address_space *mapping, pgoff_t index, > static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, > struct address_space *mapping, void *entry) > { > - unsigned long pfn; > + unsigned long pfn, index; > long ret = 0; > - size_t size; > + unsigned long count; > > /* > * A page got tagged dirty in DAX mapping? Something is seriously > @@ -894,17 +894,18 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, > xas_unlock_irq(xas); > > /* > - * Even if dax_writeback_mapping_range() was given a wbc->range_start > - * in the middle of a PMD, the 'index' we are given will be aligned to > - * the start index of the PMD, as will the pfn we pull from 'entry'. > + * If dax_writeback_mapping_range() was given a wbc->range_start > + * in the middle of a PMD, the 'index' we are given needs to be > + * aligned to the start index of the PMD. > * This allows us to flush for PMD_SIZE and not have to worry about > * partial PMD writebacks. > */ > pfn = dax_to_pfn(entry); > - size = PAGE_SIZE << dax_entry_order(entry); > + count = 1UL << dax_entry_order(entry); > + index = xas->xa_index &~ (count - 1); > > - dax_entry_mkclean(mapping, xas->xa_index, pfn); > - dax_flush(dax_dev, page_address(pfn_to_page(pfn)), size); > + dax_entry_mkclean(mapping, index, index + count - 1, pfn); > + dax_flush(dax_dev, page_address(pfn_to_page(pfn)), count * PAGE_SIZE); > /* > * After we have flushed the cache, we can clear the dirty tag. There > * cannot be new dirty data in the pfn after the flush has completed as > @@ -917,8 +918,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev, > xas_clear_mark(xas, PAGECACHE_TAG_DIRTY); > dax_wake_entry(xas, entry, false); > > - trace_dax_writeback_one(mapping->host, xas->xa_index, > - size >> PAGE_SHIFT); > + trace_dax_writeback_one(mapping->host, xas->xa_index, count); > return ret; > > put_unlocked: > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2019-02-27 18:16 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-16 2:08 find_get_entries_tag regression bisected Dan Williams 2019-02-16 15:35 ` Matthew Wilcox 2019-02-16 17:29 ` Matthew Wilcox 2019-02-16 21:11 ` Matthew Wilcox 2019-02-26 5:03 ` Dan Williams 2019-02-26 12:08 ` Matthew Wilcox 2019-02-26 13:16 ` Matthew Wilcox 2019-02-27 18:16 ` Dan Williams
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).