nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* fsdax memory error handling regression
@ 2018-11-06  3:44 Williams, Dan J
  2018-11-06 14:48 ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Williams, Dan J @ 2018-11-06  3:44 UTC (permalink / raw)
  To: willy; +Cc: linux-fsdevel, linux-nvdimm

Hi Willy,

I'm seeing the following warning with v4.20-rc1 and the "dax.sh" test
from the ndctl repository:

[   69.962873] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[   69.969522] EXT4-fs (pmem0): mounted filesystem with ordered data mode. Opts: dax
[   70.028571] Injecting memory failure for pfn 0x208900 at process virtual address 0x7efe87b00000
[   70.032384] Memory failure: 0x208900: Killing dax-pmd:7066 due to hardware memory corruption
[   70.034420] Memory failure: 0x208900: recovery action for dax page: Recovered
[   70.038878] WARNING: CPU: 37 PID: 7066 at fs/dax.c:464 dax_insert_entry+0x30b/0x330
[   70.040675] Modules linked in: ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) ip6table_mangle(E) ip6table_raw(E) ip6table_security(E) iptable_mangle(E) iptable_raw(E) iptable_security(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ebtable_filter(E) ebtables(E) ip6table_filter(E) ip6_tables(E) crct10dif_pclmul(E) crc32_pclmul(E) dax_pmem(OE) crc32c_intel(E) device_dax(OE) ghash_clmulni_intel(E) nd_pmem(OE) nd_btt(OE) serio_raw(E) nd_e820(OE) nfit(OE) libnvdimm(OE) nfit_test_iomap(OE)
[   70.049936] CPU: 37 PID: 7066 Comm: dax-pmd Tainted: G           OE     4.19.0-rc5+ #2589
[   70.051726] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
[   70.055215] RIP: 0010:dax_insert_entry+0x30b/0x330
[   70.056769] Code: 84 b7 fe ff ff 48 81 e6 00 00 e0 ff e9 b2 fe ff ff 48 8b 3c 24 48 89 ee 31 d2 e8 10 eb ff ff 49 8b 7d 00 31 f6 e9 99 fe ff ff <0f> 0b e9 f8 fe ff ff 0f 0b e9 e2 fd ff ff e8 82 f1 f4 ff e9 9c fe
[   70.062086] RSP: 0000:ffffc900086bfb20 EFLAGS: 00010082
[   70.063726] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea0008220000
[   70.065755] RDX: 0000000000000000 RSI: 0000000000208800 RDI: 0000000000208800
[   70.067784] RBP: ffff880327870bb0 R08: 0000000000208801 R09: 0000000000208a00
[   70.069813] R10: 0000000000208801 R11: 0000000000000001 R12: ffff880327870bb8
[   70.071837] R13: 0000000000000000 R14: 0000000004110003 R15: 0000000000000009
[   70.073867] FS:  00007efe8859d540(0000) GS:ffff88033ea80000(0000) knlGS:0000000000000000
[   70.076547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.078294] CR2: 00007efe87a00000 CR3: 0000000334564003 CR4: 0000000000160ee0
[   70.080326] Call Trace:
[   70.081404]  ? dax_iomap_pfn+0xb4/0x100
[   70.082770]  dax_iomap_pte_fault+0x648/0xd60
[   70.084222]  dax_iomap_fault+0x230/0xba0
[   70.085596]  ? lock_acquire+0x9e/0x1a0
[   70.086940]  ? ext4_dax_huge_fault+0x5e/0x200
[   70.088406]  ext4_dax_huge_fault+0x78/0x200
[   70.089840]  ? up_read+0x1c/0x70
[   70.091071]  __do_fault+0x1f/0x136
[   70.092344]  __handle_mm_fault+0xd2b/0x11c0
[   70.093790]  handle_mm_fault+0x198/0x3a0
[   70.095166]  __do_page_fault+0x279/0x510
[   70.096546]  do_page_fault+0x32/0x200
[   70.097884]  ? async_page_fault+0x8/0x30
[   70.099256]  async_page_fault+0x1e/0x30

I tried to get this test going on -next before the merge window, but
-next was not bootable for me. Bisection points to:

    9f32d221301c dax: Convert dax_lock_mapping_entry to XArray

At first glance I think we need the old "always retry if we slept"
behavior. Otherwise this failure seems similar to the issue fixed by
Ross' change to always retry on any potential collision:

    b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()

I'll take a closer look tomorrow to see if that guess is plausible.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fsdax memory error handling regression
  2018-11-06  3:44 fsdax memory error handling regression Williams, Dan J
@ 2018-11-06 14:48 ` Matthew Wilcox
  2018-11-07  6:01   ` Williams, Dan J
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2018-11-06 14:48 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: linux-fsdevel, linux-nvdimm

On Tue, Nov 06, 2018 at 03:44:47AM +0000, Williams, Dan J wrote:
> Hi Willy,
> 
> I'm seeing the following warning with v4.20-rc1 and the "dax.sh" test
> from the ndctl repository:

I'll try to run this myself later today.

> I tried to get this test going on -next before the merge window, but
> -next was not bootable for me. Bisection points to:
> 
>     9f32d221301c dax: Convert dax_lock_mapping_entry to XArray
> 
> At first glance I think we need the old "always retry if we slept"
> behavior. Otherwise this failure seems similar to the issue fixed by
> Ross' change to always retry on any potential collision:
> 
>     b1f382178d15 ext4: close race between direct IO and ext4_break_layouts()
> 
> I'll take a closer look tomorrow to see if that guess is plausible.

If your thought is correct, then this should be all that's needed:

diff --git a/fs/dax.c b/fs/dax.c
index 616e36ea6aaa..529ac9d7c10a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -383,11 +383,8 @@ bool dax_lock_mapping_entry(struct page *page)
 		entry = xas_load(&xas);
 		if (dax_is_locked(entry)) {
 			entry = get_unlocked_entry(&xas);
-			/* Did the page move while we slept? */
-			if (dax_to_pfn(entry) != page_to_pfn(page)) {
-				xas_unlock_irq(&xas);
-				continue;
-			}
+			xas_unlock_irq(&xas);
+			continue;
 		}
 		dax_lock_entry(&xas, entry);
 		xas_unlock_irq(&xas);

I don't quite understand how we'd find a PFN for this page in the tree
after the page has had page->mapping removed.  However, the more I look
at this path, the more I don't like it -- it doesn't handle returning
NULL explicitly, nor does it handle the situation where a PMD is split
to form multiple PTEs explicitly, it just kind of relies on those bit
patterns not matching.

So I kind of like the "just retry without doing anything clever" situation
that the above patch takes us to.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: fsdax memory error handling regression
  2018-11-06 14:48 ` Matthew Wilcox
@ 2018-11-07  6:01   ` Williams, Dan J
  2018-11-09 19:54     ` Dan Williams
  2018-11-10  8:29     ` Matthew Wilcox
  0 siblings, 2 replies; 8+ messages in thread
From: Williams, Dan J @ 2018-11-07  6:01 UTC (permalink / raw)
  To: willy; +Cc: linux-fsdevel, jack, linux-nvdimm

On Tue, 2018-11-06 at 06:48 -0800, Matthew Wilcox wrote:
> On Tue, Nov 06, 2018 at 03:44:47AM +0000, Williams, Dan J wrote:
> > Hi Willy,
> > 
> > I'm seeing the following warning with v4.20-rc1 and the "dax.sh"
> > test
> > from the ndctl repository:
> 
> I'll try to run this myself later today.
> 
> > I tried to get this test going on -next before the merge window,
> > but
> > -next was not bootable for me. Bisection points to:
> > 
> >     9f32d221301c dax: Convert dax_lock_mapping_entry to XArray
> > 
> > At first glance I think we need the old "always retry if we slept"
> > behavior. Otherwise this failure seems similar to the issue fixed
> > by
> > Ross' change to always retry on any potential collision:
> > 
> >     b1f382178d15 ext4: close race between direct IO and
> > ext4_break_layouts()
> > 
> > I'll take a closer look tomorrow to see if that guess is plausible.
> 
> If your thought is correct, then this should be all that's needed:
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 616e36ea6aaa..529ac9d7c10a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -383,11 +383,8 @@ bool dax_lock_mapping_entry(struct page *page)
>  		entry = xas_load(&xas);
>  		if (dax_is_locked(entry)) {
>  			entry = get_unlocked_entry(&xas);
> -			/* Did the page move while we slept? */
> -			if (dax_to_pfn(entry) != page_to_pfn(page)) {
> -				xas_unlock_irq(&xas);
> -				continue;
> -			}
> +			xas_unlock_irq(&xas);
> +			continue;
>  		}
>  		dax_lock_entry(&xas, entry);
>  		xas_unlock_irq(&xas);

No, that doesn't work.

> 
> I don't quite understand how we'd find a PFN for this page in the
> tree
> after the page has had page->mapping removed.  However, the more I
> look
> at this path, the more I don't like it -- it doesn't handle returning
> NULL explicitly, nor does it handle the situation where a PMD is
> split
> to form multiple PTEs explicitly, it just kind of relies on those bit
> patterns not matching.
> 
> So I kind of like the "just retry without doing anything clever"
> situation
> that the above patch takes us to.

I've been hacking at this today and am starting to lean towards
"revert" over "fix" for the amount of changes needed to get this back
on its feet. I've been able to get the test passing again with the
below changes directly on top of commit 9f32d221301c "dax: Convert
dax_lock_mapping_entry to XArray". That said, I have thus far been
unable to rebase this patch on top of v4.20-rc1 and yield a functional
result.

My concerns are:
- I can't determine if dax_unlock_entry() wants an unlocked entry
parameter, or locked. The dax_insert_pfn_mkwrite() and
dax_unlock_mapping_entry() usages seem to disagree.

- The multi-order use case of Xarray is a mystery to me. It seems to
want to know the order of entries a-priori with a choice to use
XA_STATE_ORDER() vs XA_STATE(). This falls over in
dax_unlock_mapping_entry() and other places where the only source of
the order of the entry is determined from dax_is_pmd_entry() i.e. the
Xarray itself. PageHead() does not work for DAX pages because
PageHead() is only established by the page allocator and DAX pages
never participate in the page allocator.

- The usage of rcu_read_lock() in dax_lock_mapping_entry() is needed
for inode lifetime synchronization, not just for walking the radix.
That lock needs to be dropped before sleeping, and if we slept the
inode may no longer exist.

- I could not see how the pattern:
	entry = xas_load(&xas);
	if (dax_is_locked(entry)) {
		entry = get_unlocked_entry(&xas);
...was safe given that get_unlock_entry() turns around and does
validation that the entry is !xa_internal_entry() and !NULL.

- The usage of internal entries in grab_mapping_entry() seems to need
auditing. Previously we would compare the entry size against
@size_flag, but it now if index hits a multi-order entry in
get_unlocked_entry() afaics it could be internal and we need to convert
it to the actual entry before aborting... at least to match the v4.19
behavior.

This all seems to merit a rethink of the dax integration of Xarray.

diff --git a/fs/dax.c b/fs/dax.c
index fc2745ca3308..2b3bd4a4cc48 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -99,17 +99,6 @@ static void *dax_make_locked(unsigned long pfn, unsigned long flags)
 			DAX_LOCKED);
 }
 
-static void *dax_make_entry(pfn_t pfn, unsigned long flags)
-{
-	return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
-}
-
-static void *dax_make_page_entry(struct page *page)
-{
-	pfn_t pfn = page_to_pfn_t(page);
-	return dax_make_entry(pfn, PageHead(page) ? DAX_PMD : 0);
-}
-
 static bool dax_is_locked(void *entry)
 {
 	return xa_to_value(entry) & DAX_LOCKED;
@@ -225,7 +214,7 @@ static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
  *
  * Must be called with the i_pages lock held.
  */
-static void *get_unlocked_entry(struct xa_state *xas)
+static void *__get_unlocked_entry(struct xa_state *xas, bool (*wait_fn)(void))
 {
 	void *entry;
 	struct wait_exceptional_entry_queue ewait;
@@ -235,6 +224,8 @@ static void *get_unlocked_entry(struct xa_state *xas)
 	ewait.wait.func = wake_exceptional_entry_func;
 
 	for (;;) {
+		bool revalidate;
+
 		entry = xas_load(xas);
 		if (!entry || xa_is_internal(entry) ||
 				WARN_ON_ONCE(!xa_is_value(entry)) ||
@@ -247,12 +238,29 @@ static void *get_unlocked_entry(struct xa_state *xas)
 					  TASK_UNINTERRUPTIBLE);
 		xas_unlock_irq(xas);
 		xas_reset(xas);
-		schedule();
+		revalidate = wait_fn();
 		finish_wait(wq, &ewait.wait);
 		xas_lock_irq(xas);
+		if (revalidate)
+			return ERR_PTR(-EAGAIN);
 	}
 }
 
+static bool entry_wait(void)
+{
+	schedule();
+	/*
+	 * Never return an ERR_PTR() from __get_unlocked_entry(), just
+	 * keep looping.
+	 */
+	return false;
+}
+
+static void *get_unlocked_entry(struct xa_state *xas)
+{
+	return __get_unlocked_entry(xas, entry_wait);
+}
+
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
@@ -366,16 +374,6 @@ static void *__get_unlocked_mapping_entry(struct address_space *mapping,
 	}
 }
 
-static bool entry_wait(void)
-{
-	schedule();
-	/*
-	 * Never return an ERR_PTR() from
-	 * __get_unlocked_mapping_entry(), just keep looping.
-	 */
-	return false;
-}
-
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 		pgoff_t index, void ***slotp)
 {
@@ -498,16 +496,33 @@ static struct page *dax_busy_page(void *entry)
 	return NULL;
 }
 
+
+static bool entry_wait_revalidate(void)
+{
+	rcu_read_unlock();
+	schedule();
+	rcu_read_lock();
+
+	/*
+	 * Tell __get_unlocked_entry() to take a break, we need to
+	 * revalidate page->mapping after dropping locks
+	 */
+	return true;
+}
+
 bool dax_lock_mapping_entry(struct page *page)
 {
 	XA_STATE(xas, NULL, 0);
+	bool did_lock = false;
 	void *entry;
 
+	/* hold rcu lock to coordinate with inode end-of-life */
+	rcu_read_lock();
 	for (;;) {
 		struct address_space *mapping = READ_ONCE(page->mapping);
 
 		if (!dax_mapping(mapping))
-			return false;
+			break;
 
 		/*
 		 * In the device-dax case there's no need to lock, a
@@ -516,9 +531,12 @@ bool dax_lock_mapping_entry(struct page *page)
 		 * otherwise we would not have a valid pfn_to_page()
 		 * translation.
 		 */
-		if (S_ISCHR(mapping->host->i_mode))
-			return true;
+		if (S_ISCHR(mapping->host->i_mode)) {
+			did_lock = true;
+			break;
+		}
 
+		xas_reset(&xas);
 		xas.xa = &mapping->i_pages;
 		xas_lock_irq(&xas);
 		if (mapping != page->mapping) {
@@ -526,30 +544,33 @@ bool dax_lock_mapping_entry(struct page *page)
 			continue;
 		}
 		xas_set(&xas, page->index);
-		entry = xas_load(&xas);
-		if (dax_is_locked(entry)) {
-			entry = get_unlocked_entry(&xas);
-			/* Did the page move while we slept? */
-			if (dax_to_pfn(entry) != page_to_pfn(page)) {
-				xas_unlock_irq(&xas);
-				continue;
-			}
+		entry = __get_unlocked_entry(&xas, entry_wait_revalidate);
+		if (!entry) {
+			xas_unlock_irq(&xas);
+			break;
+		} else if (IS_ERR(entry)) {
+			xas_unlock_irq(&xas);
+			WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN);
+			continue;
 		}
 		dax_lock_entry(&xas, entry);
+		did_lock = true;
 		xas_unlock_irq(&xas);
-		return true;
+		break;
 	}
+	rcu_read_unlock();
+
+	return did_lock;
 }
 
 void dax_unlock_mapping_entry(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
-	XA_STATE(xas, &mapping->i_pages, page->index);
 
 	if (S_ISCHR(mapping->host->i_mode))
 		return;
 
-	dax_unlock_entry(&xas, dax_make_page_entry(page));
+	unlock_mapping_entry(mapping, page->index);
 }
 
 /*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: fsdax memory error handling regression
  2018-11-07  6:01   ` Williams, Dan J
@ 2018-11-09 19:54     ` Dan Williams
  2018-11-10  8:29     ` Matthew Wilcox
  1 sibling, 0 replies; 8+ messages in thread
From: Dan Williams @ 2018-11-09 19:54 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Jan Kara, linux-nvdimm

On Tue, Nov 6, 2018 at 10:01 PM Williams, Dan J
<dan.j.williams@intel.com> wrote:
>
> On Tue, 2018-11-06 at 06:48 -0800, Matthew Wilcox wrote:
> > On Tue, Nov 06, 2018 at 03:44:47AM +0000, Williams, Dan J wrote:
> > > Hi Willy,
> > >
> > > I'm seeing the following warning with v4.20-rc1 and the "dax.sh"
> > > test
> > > from the ndctl repository:
> >
> > I'll try to run this myself later today.
> >
> > > I tried to get this test going on -next before the merge window,
> > > but
> > > -next was not bootable for me. Bisection points to:
> > >
> > >     9f32d221301c dax: Convert dax_lock_mapping_entry to XArray
> > >
> > > At first glance I think we need the old "always retry if we slept"
> > > behavior. Otherwise this failure seems similar to the issue fixed
> > > by
> > > Ross' change to always retry on any potential collision:
> > >
> > >     b1f382178d15 ext4: close race between direct IO and
> > > ext4_break_layouts()
> > >
> > > I'll take a closer look tomorrow to see if that guess is plausible.
> >
> > If your thought is correct, then this should be all that's needed:
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 616e36ea6aaa..529ac9d7c10a 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -383,11 +383,8 @@ bool dax_lock_mapping_entry(struct page *page)
> >               entry = xas_load(&xas);
> >               if (dax_is_locked(entry)) {
> >                       entry = get_unlocked_entry(&xas);
> > -                     /* Did the page move while we slept? */
> > -                     if (dax_to_pfn(entry) != page_to_pfn(page)) {
> > -                             xas_unlock_irq(&xas);
> > -                             continue;
> > -                     }
> > +                     xas_unlock_irq(&xas);
> > +                     continue;
> >               }
> >               dax_lock_entry(&xas, entry);
> >               xas_unlock_irq(&xas);
>
> No, that doesn't work.
>
> >
> > I don't quite understand how we'd find a PFN for this page in the
> > tree
> > after the page has had page->mapping removed.  However, the more I
> > look
> > at this path, the more I don't like it -- it doesn't handle returning
> > NULL explicitly, nor does it handle the situation where a PMD is
> > split
> > to form multiple PTEs explicitly, it just kind of relies on those bit
> > patterns not matching.
> >
> > So I kind of like the "just retry without doing anything clever"
> > situation
> > that the above patch takes us to.
>
> I've been hacking at this today and am starting to lean towards
> "revert" over "fix" for the amount of changes needed to get this back
> on its feet. I've been able to get the test passing again with the
> below changes directly on top of commit 9f32d221301c "dax: Convert
> dax_lock_mapping_entry to XArray". That said, I have thus far been
> unable to rebase this patch on top of v4.20-rc1 and yield a functional
> result.


Willy? Thoughts? I'm holding off a revert of the fsdax conversion
awaiting whether you see a way to address the concerns incrementally.


> My concerns are:
> - I can't determine if dax_unlock_entry() wants an unlocked entry
> parameter, or locked. The dax_insert_pfn_mkwrite() and
> dax_unlock_mapping_entry() usages seem to disagree.
>
> - The multi-order use case of Xarray is a mystery to me. It seems to
> want to know the order of entries a-priori with a choice to use
> XA_STATE_ORDER() vs XA_STATE(). This falls over in
> dax_unlock_mapping_entry() and other places where the only source of
> the order of the entry is determined from dax_is_pmd_entry() i.e. the
> Xarray itself. PageHead() does not work for DAX pages because
> PageHead() is only established by the page allocator and DAX pages
> never participate in the page allocator.
>
> - The usage of rcu_read_lock() in dax_lock_mapping_entry() is needed
> for inode lifetime synchronization, not just for walking the radix.
> That lock needs to be dropped before sleeping, and if we slept the
> inode may no longer exist.
>
> - I could not see how the pattern:
>         entry = xas_load(&xas);
>         if (dax_is_locked(entry)) {
>                 entry = get_unlocked_entry(&xas);
> ...was safe given that get_unlock_entry() turns around and does
> validation that the entry is !xa_internal_entry() and !NULL.
>
> - The usage of internal entries in grab_mapping_entry() seems to need
> auditing. Previously we would compare the entry size against
> @size_flag, but it now if index hits a multi-order entry in
> get_unlocked_entry() afaics it could be internal and we need to convert
> it to the actual entry before aborting... at least to match the v4.19
> behavior.
>
> This all seems to merit a rethink of the dax integration of Xarray.
>
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fsdax memory error handling regression
  2018-11-07  6:01   ` Williams, Dan J
  2018-11-09 19:54     ` Dan Williams
@ 2018-11-10  8:29     ` Matthew Wilcox
  2018-11-10 17:08       ` Dan Williams
  1 sibling, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2018-11-10  8:29 UTC (permalink / raw)
  To: Williams, Dan J; +Cc: linux-fsdevel, jack, linux-nvdimm

On Wed, Nov 07, 2018 at 06:01:19AM +0000, Williams, Dan J wrote:
> On Tue, 2018-11-06 at 06:48 -0800, Matthew Wilcox wrote:
> > On Tue, Nov 06, 2018 at 03:44:47AM +0000, Williams, Dan J wrote:
> > > Hi Willy,
> > > 
> > > I'm seeing the following warning with v4.20-rc1 and the "dax.sh"
> > > test
> > > from the ndctl repository:
> > 
> > I'll try to run this myself later today.
> > 
> > > I tried to get this test going on -next before the merge window,
> > > but
> > > -next was not bootable for me. Bisection points to:
> > > 
> > >     9f32d221301c dax: Convert dax_lock_mapping_entry to XArray
> > > 
> > > At first glance I think we need the old "always retry if we slept"
> > > behavior. Otherwise this failure seems similar to the issue fixed
> > > by
> > > Ross' change to always retry on any potential collision:
> > > 
> > >     b1f382178d15 ext4: close race between direct IO and
> > > ext4_break_layouts()
> > > 
> > > I'll take a closer look tomorrow to see if that guess is plausible.
> > 
> > I don't quite understand how we'd find a PFN for this page in the
> > tree
> > after the page has had page->mapping removed.  However, the more I
> > look
> > at this path, the more I don't like it -- it doesn't handle returning
> > NULL explicitly, nor does it handle the situation where a PMD is
> > split
> > to form multiple PTEs explicitly, it just kind of relies on those bit
> > patterns not matching.
> > 
> > So I kind of like the "just retry without doing anything clever"
> > situation
> > that the above patch takes us to.
> 
> I've been hacking at this today and am starting to lean towards
> "revert" over "fix" for the amount of changes needed to get this back
> on its feet. I've been able to get the test passing again with the
> below changes directly on top of commit 9f32d221301c "dax: Convert
> dax_lock_mapping_entry to XArray". That said, I have thus far been
> unable to rebase this patch on top of v4.20-rc1 and yield a functional
> result.

I think it's a little premature to go for "revert".  Sure, if it's
not fixed in three-four weeks, but we don't normally jump straight to
"revert" at -rc1.

> My concerns are:
> - I can't determine if dax_unlock_entry() wants an unlocked entry
> parameter, or locked. The dax_insert_pfn_mkwrite() and
> dax_unlock_mapping_entry() usages seem to disagree.

That is fair.  I did document it in the changelog:

    dax: Convert dax_insert_pfn_mkwrite to XArray
    
    Add some XArray-based helper functions to replace the radix tree based
    metaphors currently in use.  The biggest change is that converted code
    doesn't see its own lock bit; get_unlocked_entry() always returns an
    entry with the lock bit clear.  So we don't have to mess around loading
    the current entry and clearing the lock bit; we can just store the
    unlocked entry that we already have.

but I should have written that in code too:

@@ -255,6 +255,7 @@ static void dax_unlock_entry(struct xa_state *xas, void *entry)
 {
        void *old;
 
+       BUG_ON(dax_is_locked(entry));
        xas_reset(xas);
        xas_lock_irq(xas);
        old = xas_store(xas, entry);


I've added a commit to my tree with that.

> - The multi-order use case of Xarray is a mystery to me. It seems to
> want to know the order of entries a-priori with a choice to use
> XA_STATE_ORDER() vs XA_STATE(). This falls over in
> dax_unlock_mapping_entry() and other places where the only source of
> the order of the entry is determined from dax_is_pmd_entry() i.e. the
> Xarray itself. PageHead() does not work for DAX pages because
> PageHead() is only established by the page allocator and DAX pages
> never participate in the page allocator.

I didn't know that you weren't using PageHead.  That wasn't well-documented.

There's xas_set_order() for dynamically setting the order of an entry.
However, for this specific instance, we already have an entry in the tree
which is of the correct order, so just using XA_STATE is sufficient, as
xas_store() does not punch a small entry into a large entry but rather
overwrites the canonical entry with the new entry's value, leaving it
the same size, unless the new entry is specified to be larger in size.

The problem, then, is that the PMD bit isn't being set in the entry.
We could simply do a xas_load() and copy the PMD bit over.  Is there
really no way to tell from the struct page whether it's in use as a
huge page?  That seems like a mistake.

> - The usage of rcu_read_lock() in dax_lock_mapping_entry() is needed
> for inode lifetime synchronization, not just for walking the radix.
> That lock needs to be dropped before sleeping, and if we slept the
> inode may no longer exist.

That _really_ wasn't documented but should be easy to fix.

> - I could not see how the pattern:
> 	entry = xas_load(&xas);
> 	if (dax_is_locked(entry)) {
> 		entry = get_unlocked_entry(&xas);
> ...was safe given that get_unlock_entry() turns around and does
> validation that the entry is !xa_internal_entry() and !NULL.

Oh you're saying that entry might be NULL in dax_lock_mapping_entry()?
It can't be an internal entry there because that won't happen while
holding the xa_lock and looking for an order-0 entry.  dax_is_locked()
will return false for a NULL entry, so I don't see a problem here.

> - The usage of internal entries in grab_mapping_entry() seems to need
> auditing. Previously we would compare the entry size against
> @size_flag, but it now if index hits a multi-order entry in
> get_unlocked_entry() afaics it could be internal and we need to convert
> it to the actual entry before aborting... at least to match the v4.19
> behavior.

If we get an internal entry in this case, we know we were looking up
a PMD entry and found a PTE entry.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fsdax memory error handling regression
  2018-11-10  8:29     ` Matthew Wilcox
@ 2018-11-10 17:08       ` Dan Williams
  2018-11-13 14:25         ` Matthew Wilcox
  0 siblings, 1 reply; 8+ messages in thread
From: Dan Williams @ 2018-11-10 17:08 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Jan Kara, linux-nvdimm

On Sat, Nov 10, 2018 at 12:29 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, Nov 07, 2018 at 06:01:19AM +0000, Williams, Dan J wrote:
> > On Tue, 2018-11-06 at 06:48 -0800, Matthew Wilcox wrote:
> > > On Tue, Nov 06, 2018 at 03:44:47AM +0000, Williams, Dan J wrote:
> > > > Hi Willy,
> > > >
> > > > I'm seeing the following warning with v4.20-rc1 and the "dax.sh"
> > > > test
> > > > from the ndctl repository:
> > >
> > > I'll try to run this myself later today.
> > >
> > > > I tried to get this test going on -next before the merge window,
> > > > but
> > > > -next was not bootable for me. Bisection points to:
> > > >
> > > >     9f32d221301c dax: Convert dax_lock_mapping_entry to XArray
> > > >
> > > > At first glance I think we need the old "always retry if we slept"
> > > > behavior. Otherwise this failure seems similar to the issue fixed
> > > > by
> > > > Ross' change to always retry on any potential collision:
> > > >
> > > >     b1f382178d15 ext4: close race between direct IO and
> > > > ext4_break_layouts()
> > > >
> > > > I'll take a closer look tomorrow to see if that guess is plausible.
> > >
> > > I don't quite understand how we'd find a PFN for this page in the
> > > tree
> > > after the page has had page->mapping removed.  However, the more I
> > > look
> > > at this path, the more I don't like it -- it doesn't handle returning
> > > NULL explicitly, nor does it handle the situation where a PMD is
> > > split
> > > to form multiple PTEs explicitly, it just kind of relies on those bit
> > > patterns not matching.
> > >
> > > So I kind of like the "just retry without doing anything clever"
> > > situation
> > > that the above patch takes us to.
> >
> > I've been hacking at this today and am starting to lean towards
> > "revert" over "fix" for the amount of changes needed to get this back
> > on its feet. I've been able to get the test passing again with the
> > below changes directly on top of commit 9f32d221301c "dax: Convert
> > dax_lock_mapping_entry to XArray". That said, I have thus far been
> > unable to rebase this patch on top of v4.20-rc1 and yield a functional
> > result.
>
> I think it's a little premature to go for "revert".  Sure, if it's
> not fixed in three-four weeks, but we don't normally jump straight to
> "revert" at -rc1.

Thanks for circling back to take a look at this.

>
> > My concerns are:
> > - I can't determine if dax_unlock_entry() wants an unlocked entry
> > parameter, or locked. The dax_insert_pfn_mkwrite() and
> > dax_unlock_mapping_entry() usages seem to disagree.
>
> That is fair.  I did document it in the changelog:
>
>     dax: Convert dax_insert_pfn_mkwrite to XArray
>
>     Add some XArray-based helper functions to replace the radix tree based
>     metaphors currently in use.  The biggest change is that converted code
>     doesn't see its own lock bit; get_unlocked_entry() always returns an
>     entry with the lock bit clear.  So we don't have to mess around loading
>     the current entry and clearing the lock bit; we can just store the
>     unlocked entry that we already have.
>
> but I should have written that in code too:

Ok.

>
> @@ -255,6 +255,7 @@ static void dax_unlock_entry(struct xa_state *xas, void *entry)
>  {
>         void *old;
>
> +       BUG_ON(dax_is_locked(entry));
>         xas_reset(xas);
>         xas_lock_irq(xas);
>         old = xas_store(xas, entry);
>
>
> I've added a commit to my tree with that.

WARN_ON_ONCE()?

> > - The multi-order use case of Xarray is a mystery to me. It seems to
> > want to know the order of entries a-priori with a choice to use
> > XA_STATE_ORDER() vs XA_STATE(). This falls over in
> > dax_unlock_mapping_entry() and other places where the only source of
> > the order of the entry is determined from dax_is_pmd_entry() i.e. the
> > Xarray itself. PageHead() does not work for DAX pages because
> > PageHead() is only established by the page allocator and DAX pages
> > never participate in the page allocator.
>
> I didn't know that you weren't using PageHead.  That wasn't well-documented.

Where would you have looked for that comment?

> There's xas_set_order() for dynamically setting the order of an entry.
> However, for this specific instance, we already have an entry in the tree
> which is of the correct order, so just using XA_STATE is sufficient, as
> xas_store() does not punch a small entry into a large entry but rather
> overwrites the canonical entry with the new entry's value, leaving it
> the same size, unless the new entry is specified to be larger in size.
>
> The problem, then, is that the PMD bit isn't being set in the entry.
> We could simply do a xas_load() and copy the PMD bit over.  Is there
> really no way to tell from the struct page whether it's in use as a
> huge page?  That seems like a mistake.

DAX pages have always been just enough struct page to make the DAX use
case stop crashing on fork, dma, etc. I think as DAX developers we've
had more than a few discussions about where i_pages data is in use vs
struct page. The current breakdown of surprises that I know of are:

page->lru: unavailable

compound_page / PageHead: not set, only pte entries can reliably
identify the mapping size across both filesystem-dax and device-dax

page dirty tracking: i_pages for filesystem-dax, no such thing for device_dax

page->index: not set until 4.19

page->mapping: not set until 4.19, needed custom aops

...it's fair to say we need a document. We've always needed one. This
shifting state of DAX with respect to i_pages tracking has been a saga
for a few years now.

> > - The usage of rcu_read_lock() in dax_lock_mapping_entry() is needed
> > for inode lifetime synchronization, not just for walking the radix.
> > That lock needs to be dropped before sleeping, and if we slept the
> > inode may no longer exist.
>
> That _really_ wasn't documented but should be easy to fix.

Fair, I added a comment in my proposed fix patch for this. It came up
in review with Jan, but yes it never made it to a code comment. That
said the conversion patch commit message is silent on why it thinks
it's safe to delete the lock. I can't seem to find any record of "dax:
Convert dax_lock_mapping_entry to XArray" ever being sent to a mailing
list, or cc'd to the usual DAX suspects. Certainly there's no
non-author sign-offs on the commit. I only saw it coming from the
collisions it caused in -next as I tried to get the 4.19 state of the
code stabilized, but obviously never had a chance to review it as we
were bug hunting 4.19 late into the -rcs.

> > - I could not see how the pattern:
> >       entry = xas_load(&xas);
> >       if (dax_is_locked(entry)) {
> >               entry = get_unlocked_entry(&xas);
> > ...was safe given that get_unlock_entry() turns around and does
> > validation that the entry is !xa_internal_entry() and !NULL.
>
> Oh you're saying that entry might be NULL in dax_lock_mapping_entry()?
> It can't be an internal entry there because that won't happen while
> holding the xa_lock and looking for an order-0 entry.  dax_is_locked()
> will return false for a NULL entry, so I don't see a problem here.

This is the problem, we don't know ahead of time that we're looking
for an order-0 entry. For the specific case of a memory failure in the
middle of a huge page the implementation takes
dax_lock_mapping_entry() with the expectation that any lock on a
sub-page locks the entire range in i_pages and *then* walks the ptes
to see the effective mapping size. If Xarray needs to know ahead of
time that the user wants the multi-order entry then we need to defer
this Xarray conversion until we figure out PageHead / compound_pages()
for DAX-pages.

> > - The usage of internal entries in grab_mapping_entry() seems to need
> > auditing. Previously we would compare the entry size against
> > @size_flag, but it now if index hits a multi-order entry in
> > get_unlocked_entry() afaics it could be internal and we need to convert
> > it to the actual entry before aborting... at least to match the v4.19
> > behavior.
>
> If we get an internal entry in this case, we know we were looking up
> a PMD entry and found a PTE entry.

Oh, so I may have my understanding of internal entries backwards? I.e.
I thought they were returned if you have an order-0 xas and passed
xas_load() an unaligned index, but the entry is multi-order. You're
saying they are only returned when we have a multi-order xas and
xas_load() finds an order-0 entry at the unaligned index. So
"internal" isn't Xarray private state it's an order-0 entry when the
user wanted multi-order?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fsdax memory error handling regression
  2018-11-10 17:08       ` Dan Williams
@ 2018-11-13 14:25         ` Matthew Wilcox
  2018-11-29  6:09           ` Dan Williams
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2018-11-13 14:25 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-fsdevel, Jan Kara, linux-nvdimm

On Sat, Nov 10, 2018 at 09:08:10AM -0800, Dan Williams wrote:
> On Sat, Nov 10, 2018 at 12:29 AM Matthew Wilcox <willy@infradead.org> wrote:
> > On Wed, Nov 07, 2018 at 06:01:19AM +0000, Williams, Dan J wrote:
> > > On Tue, 2018-11-06 at 06:48 -0800, Matthew Wilcox wrote:
> > > > On Tue, Nov 06, 2018 at 03:44:47AM +0000, Williams, Dan J wrote:
> > > > > Hi Willy,
> > > > >
> > > > > I'm seeing the following warning with v4.20-rc1 and the "dax.sh"
> > > > > test
> > > > > from the ndctl repository:
> > > >
> > > > I'll try to run this myself later today.
> > > >
> > > > > I tried to get this test going on -next before the merge window,
> > > > > but
> > > > > -next was not bootable for me. Bisection points to:
> > > > >
> > > > >     9f32d221301c dax: Convert dax_lock_mapping_entry to XArray
> > > > >
> > > > > At first glance I think we need the old "always retry if we slept"
> > > > > behavior. Otherwise this failure seems similar to the issue fixed
> > > > > by
> > > > > Ross' change to always retry on any potential collision:
> > > > >
> > > > >     b1f382178d15 ext4: close race between direct IO and
> > > > > ext4_break_layouts()
> > > > >
> > > > > I'll take a closer look tomorrow to see if that guess is plausible.
> > > >
> > > > I don't quite understand how we'd find a PFN for this page in the
> > > > tree
> > > > after the page has had page->mapping removed.  However, the more I
> > > > look
> > > > at this path, the more I don't like it -- it doesn't handle returning
> > > > NULL explicitly, nor does it handle the situation where a PMD is
> > > > split
> > > > to form multiple PTEs explicitly, it just kind of relies on those bit
> > > > patterns not matching.
> > > >
> > > > So I kind of like the "just retry without doing anything clever"
> > > > situation
> > > > that the above patch takes us to.
> > >
> > > I've been hacking at this today and am starting to lean towards
> > > "revert" over "fix" for the amount of changes needed to get this back
> > > on its feet. I've been able to get the test passing again with the
> > > below changes directly on top of commit 9f32d221301c "dax: Convert
> > > dax_lock_mapping_entry to XArray". That said, I have thus far been
> > > unable to rebase this patch on top of v4.20-rc1 and yield a functional
> > > result.
> >
> > I think it's a little premature to go for "revert".  Sure, if it's
> > not fixed in three-four weeks, but we don't normally jump straight to
> > "revert" at -rc1.
> 
> Thanks for circling back to take a look at this.

Thanks for the reminder email -- I somehow didn't see the email that
you sent on Wednesday.

> > +       BUG_ON(dax_is_locked(entry));
> 
> WARN_ON_ONCE()?

I don't think that's a good idea.  If you try to 'unlock' by storing a
locked entry, it's quite simply a bug.  Everything which tries to access
the entry from here on will sleep.  I think it's actually better to force
a reboot at this point than try to continue.

> > > - The multi-order use case of Xarray is a mystery to me. It seems to
> > > want to know the order of entries a-priori with a choice to use
> > > XA_STATE_ORDER() vs XA_STATE(). This falls over in
> > > dax_unlock_mapping_entry() and other places where the only source of
> > > the order of the entry is determined from dax_is_pmd_entry() i.e. the
> > > Xarray itself. PageHead() does not work for DAX pages because
> > > PageHead() is only established by the page allocator and DAX pages
> > > never participate in the page allocator.
> >
> > I didn't know that you weren't using PageHead.  That wasn't well-documented.
> 
> Where would you have looked for that comment?

Good point.  I don't know.

> > There's xas_set_order() for dynamically setting the order of an entry.
> > However, for this specific instance, we already have an entry in the tree
> > which is of the correct order, so just using XA_STATE is sufficient, as
> > xas_store() does not punch a small entry into a large entry but rather
> > overwrites the canonical entry with the new entry's value, leaving it
> > the same size, unless the new entry is specified to be larger in size.
> >
> > The problem, then, is that the PMD bit isn't being set in the entry.
> > We could simply do a xas_load() and copy the PMD bit over.  Is there
> > really no way to tell from the struct page whether it's in use as a
> > huge page?  That seems like a mistake.
> 
> DAX pages have always been just enough struct page to make the DAX use
> case stop crashing on fork, dma, etc. I think as DAX developers we've
> had more than a few discussions about where i_pages data is in use vs
> struct page. The current breakdown of surprises that I know of are:
> 
> page->lru: unavailable
> 
> compound_page / PageHead: not set, only pte entries can reliably
> identify the mapping size across both filesystem-dax and device-dax
> 
> page dirty tracking: i_pages for filesystem-dax, no such thing for device_dax
> 
> page->index: not set until 4.19
> 
> page->mapping: not set until 4.19, needed custom aops
> 
> ...it's fair to say we need a document. We've always needed one. This
> shifting state of DAX with respect to i_pages tracking has been a saga
> for a few years now.

I can't allow you to take too much blame here; struct page itself has been
woefully undocumented for too long.  I hope I improved the situation with
97b4a67198 and the other patches in that series.

> > > - The usage of rcu_read_lock() in dax_lock_mapping_entry() is needed
> > > for inode lifetime synchronization, not just for walking the radix.
> > > That lock needs to be dropped before sleeping, and if we slept the
> > > inode may no longer exist.
> >
> > That _really_ wasn't documented but should be easy to fix.
> 
> Fair, I added a comment in my proposed fix patch for this. It came up
> in review with Jan, but yes it never made it to a code comment. That
> said the conversion patch commit message is silent on why it thinks
> it's safe to delete the lock.

I thought it was safe to delete the lock because the rcu_read_lock()
was protecting the radix tree.  It's a pretty unusual locking pattern
to have inodes going away while there are still pages in the page cache.
I probably need to dig out the conversation between you & Jan on this
topic.

> I can't seem to find any record of "dax:
> Convert dax_lock_mapping_entry to XArray" ever being sent to a mailing
> list, or cc'd to the usual DAX suspects. Certainly there's no
> non-author sign-offs on the commit. I only saw it coming from the
> collisions it caused in -next as I tried to get the 4.19 state of the
> code stabilized, but obviously never had a chance to review it as we
> were bug hunting 4.19 late into the -rcs.

I thought I sent it out; possibly I messed that up.  I found it very
hard to get any Reviewed-by/Acked-by lines on any of the XArray work.
I sent out 14 revisions and only got nine review/ack tags on the
seventy-odd patches.

It's rather unfortunate; I know Ross spent a lot of time and effort
testing the DAX conversion, but he never sent me a Tested-by or
Reviewed-by for it.

> > > - I could not see how the pattern:
> > >       entry = xas_load(&xas);
> > >       if (dax_is_locked(entry)) {
> > >               entry = get_unlocked_entry(&xas);
> > > ...was safe given that get_unlock_entry() turns around and does
> > > validation that the entry is !xa_internal_entry() and !NULL.
> >
> > Oh you're saying that entry might be NULL in dax_lock_mapping_entry()?
> > It can't be an internal entry there because that won't happen while
> > holding the xa_lock and looking for an order-0 entry.  dax_is_locked()
> > will return false for a NULL entry, so I don't see a problem here.
> 
> This is the problem, we don't know ahead of time that we're looking
> for an order-0 entry. For the specific case of a memory failure in the
> middle of a huge page the implementation takes
> dax_lock_mapping_entry() with the expectation that any lock on a
> sub-page locks the entire range in i_pages and *then* walks the ptes
> to see the effective mapping size. If Xarray needs to know ahead of
> time that the user wants the multi-order entry then we need to defer
> this Xarray conversion until we figure out PageHead / compound_pages()
> for DAX-pages.

I haven't done a good job of explaining; let me try again.

When we call xas_load() with an order-0 xa_state, we always get an entry
that's actually in the array.  It might be a PMD entry or a PTE entry,
but it's always something in the array.  When we use a PMD-order xa_state
and there's a PTE entry, we don't bother walking down to the PTE level
of the tree, we just return a node pointer to indicate there's something
here, and it's not what you're looking for.

These semantics are what I thought DAX wanted, since DAX is basically
the only user of multiorder entries today.

> > > - The usage of internal entries in grab_mapping_entry() seems to need
> > > auditing. Previously we would compare the entry size against
> > > @size_flag, but it now if index hits a multi-order entry in
> > > get_unlocked_entry() afaics it could be internal and we need to convert
> > > it to the actual entry before aborting... at least to match the v4.19
> > > behavior.
> >
> > If we get an internal entry in this case, we know we were looking up
> > a PMD entry and found a PTE entry.
> 
> Oh, so I may have my understanding of internal entries backwards? I.e.
> I thought they were returned if you have an order-0 xas and passed
> xas_load() an unaligned index, but the entry is multi-order. You're
> saying they are only returned when we have a multi-order xas and
> xas_load() finds an order-0 entry at the unaligned index. So
> "internal" isn't Xarray private state it's an order-0 entry when the
> user wanted multi-order?

This sounds much more like what I just re-described above.  When you say
an unaligned index, I suspect you mean something like having a PMD entry
and specifying an index which is not PMD-aligned?  That always returns
the PMD entry, just like the radix tree used to.

The internal entry _is_ XArray private state, it's just being returned
as an indicator that "the entry you asked for isn't here".

But now that I read the code over, I realise that using xas_load() in
get_unlocked_entry() is wrong.  Consider an XArray with a PTE entry at
index 1023 and a huge page fault attempts to load a PMD entry at index
512.  That's going to return NULL, which will cause grab_mapping_entry()
to put a locked empty entry into the tree, erasing the PTE entry from
the tree.  Even if it's locked.

get_unlocked_entry() should be using xas_find_conflict() instead of
xas_load().  That will never return an internal entry, and will just be
generally easier to deal with.

I'm going to suggest at the unconference kickoff this morning that we do
a session on the XArray.  You & I certainly need to talk in person about
what I've done, and I think it could be useful for others to be present.

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fsdax memory error handling regression
  2018-11-13 14:25         ` Matthew Wilcox
@ 2018-11-29  6:09           ` Dan Williams
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Williams @ 2018-11-29  6:09 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: linux-fsdevel, Jan Kara, linux-nvdimm

On Tue, Nov 13, 2018 at 6:25 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Nov 10, 2018 at 09:08:10AM -0800, Dan Williams wrote:
> > On Sat, Nov 10, 2018 at 12:29 AM Matthew Wilcox <willy@infradead.org> wrote:
[..]
> > > If we get an internal entry in this case, we know we were looking up
> > > a PMD entry and found a PTE entry.
> >
> > Oh, so I may have my understanding of internal entries backwards? I.e.
> > I thought they were returned if you have an order-0 xas and passed
> > xas_load() an unaligned index, but the entry is multi-order. You're
> > saying they are only returned when we have a multi-order xas and
> > xas_load() finds an order-0 entry at the unaligned index. So
> > "internal" isn't Xarray private state it's an order-0 entry when the
> > user wanted multi-order?
>
> This sounds much more like what I just re-described above.  When you say
> an unaligned index, I suspect you mean something like having a PMD entry
> and specifying an index which is not PMD-aligned?  That always returns
> the PMD entry, just like the radix tree used to.

...ok, so I think I may have evidence to the contrary or something
else is going wrong in the api. At the very least we're inching
towards root cause. If I modify dax_to_pfn() like so:

diff --git a/fs/dax.c b/fs/dax.c
index 3f592dc18d67..7fd3529fe859 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -88,9 +88,17 @@ fs_initcall(init_dax_wait_table);
 #define DAX_ZERO_PAGE  (1UL << 2)
 #define DAX_EMPTY      (1UL << 3)

-static unsigned long dax_to_pfn(void *entry)
+static unsigned long dax_is_pmd_entry(void *entry)
 {
-       return xa_to_value(entry) >> DAX_SHIFT;
+       return xa_to_value(entry) & DAX_PMD;
+}
+
+static noinline unsigned long dax_to_pfn(void *entry)
+{
+       unsigned long val = xa_to_value(entry) >> DAX_SHIFT;
+
+       WARN_ON_ONCE(dax_is_pmd_entry(entry) && val & ((1UL << PMD_ORDER) - 1));
+       return val;
 }

...it triggers. The same change on top 4.19 does not. So somehow we
are able to lookup a pmd entry, but the value of the entry is pte
aligned. This is the precursor to the original failure because ext4
tries to invalidate the page with the memory failure, but the
resulting dax_disassociate_entry() starts its for_each_mapped_pfn() at
an unaligned pfn, and likely spills over into unrelated pages. Later
dax_insert_entry() sees ->mapping still set for the first few pfns
relative to the original pmd entry when it should have been set to
NULL.

Do you see anything in __dax_invalidate_entry() path that would lead
to this or do I need to peel the onion a bit more?
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-11-29  6:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-06  3:44 fsdax memory error handling regression Williams, Dan J
2018-11-06 14:48 ` Matthew Wilcox
2018-11-07  6:01   ` Williams, Dan J
2018-11-09 19:54     ` Dan Williams
2018-11-10  8:29     ` Matthew Wilcox
2018-11-10 17:08       ` Dan Williams
2018-11-13 14:25         ` Matthew Wilcox
2018-11-29  6:09           ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).