linux-nvdimm.lists.01.org archive mirror
 help / color / mirror / Atom feed
From: "Williams, Dan J" <dan.j.williams@intel.com>
To: "willy@infradead.org" <willy@infradead.org>
Cc: "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"jack@suse.cz" <jack@suse.cz>,
	"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>
Subject: Re: fsdax memory error handling regression
Date: Wed, 7 Nov 2018 06:01:19 +0000	[thread overview]
Message-ID: <35429d15f7dfd2c551b9d61512abe2e04376d2a0.camel@intel.com> (raw)
In-Reply-To: <20181106144848.GE3074@bombadil.infradead.org>

On Tue, 2018-11-06 at 06:48 -0800, Matthew Wilcox wrote:
> On Tue, Nov 06, 2018 at 03:44:47AM +0000, Williams, Dan J wrote:
> > Hi Willy,
> > 
> > I'm seeing the following warning with v4.20-rc1 and the "dax.sh"
> > test
> > from the ndctl repository:
> 
> I'll try to run this myself later today.
> 
> > I tried to get this test going on -next before the merge window,
> > but
> > -next was not bootable for me. Bisection points to:
> > 
> >     9f32d221301c dax: Convert dax_lock_mapping_entry to XArray
> > 
> > At first glance I think we need the old "always retry if we slept"
> > behavior. Otherwise this failure seems similar to the issue fixed
> > by
> > Ross' change to always retry on any potential collision:
> > 
> >     b1f382178d15 ext4: close race between direct IO and
> > ext4_break_layouts()
> > 
> > I'll take a closer look tomorrow to see if that guess is plausible.
> 
> If your thought is correct, then this should be all that's needed:
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 616e36ea6aaa..529ac9d7c10a 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -383,11 +383,8 @@ bool dax_lock_mapping_entry(struct page *page)
>  		entry = xas_load(&xas);
>  		if (dax_is_locked(entry)) {
>  			entry = get_unlocked_entry(&xas);
> -			/* Did the page move while we slept? */
> -			if (dax_to_pfn(entry) != page_to_pfn(page)) {
> -				xas_unlock_irq(&xas);
> -				continue;
> -			}
> +			xas_unlock_irq(&xas);
> +			continue;
>  		}
>  		dax_lock_entry(&xas, entry);
>  		xas_unlock_irq(&xas);

No, that doesn't work.

> 
> I don't quite understand how we'd find a PFN for this page in the
> tree
> after the page has had page->mapping removed.  However, the more I
> look
> at this path, the more I don't like it -- it doesn't handle returning
> NULL explicitly, nor does it handle the situation where a PMD is
> split
> to form multiple PTEs explicitly, it just kind of relies on those bit
> patterns not matching.
> 
> So I kind of like the "just retry without doing anything clever"
> situation
> that the above patch takes us to.

I've been hacking at this today and am starting to lean towards
"revert" over "fix" for the amount of changes needed to get this back
on its feet. I've been able to get the test passing again with the
below changes directly on top of commit 9f32d221301c "dax: Convert
dax_lock_mapping_entry to XArray". That said, I have thus far been
unable to rebase this patch on top of v4.20-rc1 and yield a functional
result.

My concerns are:
- I can't determine if dax_unlock_entry() wants an unlocked entry
parameter, or locked. The dax_insert_pfn_mkwrite() and
dax_unlock_mapping_entry() usages seem to disagree.

- The multi-order use case of Xarray is a mystery to me. It seems to
want to know the order of entries a-priori with a choice to use
XA_STATE_ORDER() vs XA_STATE(). This falls over in
dax_unlock_mapping_entry() and other places where the only source of
the order of the entry is determined from dax_is_pmd_entry() i.e. the
Xarray itself. PageHead() does not work for DAX pages because
PageHead() is only established by the page allocator and DAX pages
never participate in the page allocator.

- The usage of rcu_read_lock() in dax_lock_mapping_entry() is needed
for inode lifetime synchronization, not just for walking the radix.
That lock needs to be dropped before sleeping, and if we slept the
inode may no longer exist.

- I could not see how the pattern:
	entry = xas_load(&xas);
	if (dax_is_locked(entry)) {
		entry = get_unlocked_entry(&xas);
...was safe given that get_unlock_entry() turns around and does
validation that the entry is !xa_internal_entry() and !NULL.

- The usage of internal entries in grab_mapping_entry() seems to need
auditing. Previously we would compare the entry size against
@size_flag, but it now if index hits a multi-order entry in
get_unlocked_entry() afaics it could be internal and we need to convert
it to the actual entry before aborting... at least to match the v4.19
behavior.

This all seems to merit a rethink of the dax integration of Xarray.

diff --git a/fs/dax.c b/fs/dax.c
index fc2745ca3308..2b3bd4a4cc48 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -99,17 +99,6 @@ static void *dax_make_locked(unsigned long pfn, unsigned long flags)
 			DAX_LOCKED);
 }
 
-static void *dax_make_entry(pfn_t pfn, unsigned long flags)
-{
-	return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
-}
-
-static void *dax_make_page_entry(struct page *page)
-{
-	pfn_t pfn = page_to_pfn_t(page);
-	return dax_make_entry(pfn, PageHead(page) ? DAX_PMD : 0);
-}
-
 static bool dax_is_locked(void *entry)
 {
 	return xa_to_value(entry) & DAX_LOCKED;
@@ -225,7 +214,7 @@ static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
  *
  * Must be called with the i_pages lock held.
  */
-static void *get_unlocked_entry(struct xa_state *xas)
+static void *__get_unlocked_entry(struct xa_state *xas, bool (*wait_fn)(void))
 {
 	void *entry;
 	struct wait_exceptional_entry_queue ewait;
@@ -235,6 +224,8 @@ static void *get_unlocked_entry(struct xa_state *xas)
 	ewait.wait.func = wake_exceptional_entry_func;
 
 	for (;;) {
+		bool revalidate;
+
 		entry = xas_load(xas);
 		if (!entry || xa_is_internal(entry) ||
 				WARN_ON_ONCE(!xa_is_value(entry)) ||
@@ -247,12 +238,29 @@ static void *get_unlocked_entry(struct xa_state *xas)
 					  TASK_UNINTERRUPTIBLE);
 		xas_unlock_irq(xas);
 		xas_reset(xas);
-		schedule();
+		revalidate = wait_fn();
 		finish_wait(wq, &ewait.wait);
 		xas_lock_irq(xas);
+		if (revalidate)
+			return ERR_PTR(-EAGAIN);
 	}
 }
 
+static bool entry_wait(void)
+{
+	schedule();
+	/*
+	 * Never return an ERR_PTR() from __get_unlocked_entry(), just
+	 * keep looping.
+	 */
+	return false;
+}
+
+static void *get_unlocked_entry(struct xa_state *xas)
+{
+	return __get_unlocked_entry(xas, entry_wait);
+}
+
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
@@ -366,16 +374,6 @@ static void *__get_unlocked_mapping_entry(struct address_space *mapping,
 	}
 }
 
-static bool entry_wait(void)
-{
-	schedule();
-	/*
-	 * Never return an ERR_PTR() from
-	 * __get_unlocked_mapping_entry(), just keep looping.
-	 */
-	return false;
-}
-
 static void *get_unlocked_mapping_entry(struct address_space *mapping,
 		pgoff_t index, void ***slotp)
 {
@@ -498,16 +496,33 @@ static struct page *dax_busy_page(void *entry)
 	return NULL;
 }
 
+
+static bool entry_wait_revalidate(void)
+{
+	rcu_read_unlock();
+	schedule();
+	rcu_read_lock();
+
+	/*
+	 * Tell __get_unlocked_entry() to take a break, we need to
+	 * revalidate page->mapping after dropping locks
+	 */
+	return true;
+}
+
 bool dax_lock_mapping_entry(struct page *page)
 {
 	XA_STATE(xas, NULL, 0);
+	bool did_lock = false;
 	void *entry;
 
+	/* hold rcu lock to coordinate with inode end-of-life */
+	rcu_read_lock();
 	for (;;) {
 		struct address_space *mapping = READ_ONCE(page->mapping);
 
 		if (!dax_mapping(mapping))
-			return false;
+			break;
 
 		/*
 		 * In the device-dax case there's no need to lock, a
@@ -516,9 +531,12 @@ bool dax_lock_mapping_entry(struct page *page)
 		 * otherwise we would not have a valid pfn_to_page()
 		 * translation.
 		 */
-		if (S_ISCHR(mapping->host->i_mode))
-			return true;
+		if (S_ISCHR(mapping->host->i_mode)) {
+			did_lock = true;
+			break;
+		}
 
+		xas_reset(&xas);
 		xas.xa = &mapping->i_pages;
 		xas_lock_irq(&xas);
 		if (mapping != page->mapping) {
@@ -526,30 +544,33 @@ bool dax_lock_mapping_entry(struct page *page)
 			continue;
 		}
 		xas_set(&xas, page->index);
-		entry = xas_load(&xas);
-		if (dax_is_locked(entry)) {
-			entry = get_unlocked_entry(&xas);
-			/* Did the page move while we slept? */
-			if (dax_to_pfn(entry) != page_to_pfn(page)) {
-				xas_unlock_irq(&xas);
-				continue;
-			}
+		entry = __get_unlocked_entry(&xas, entry_wait_revalidate);
+		if (!entry) {
+			xas_unlock_irq(&xas);
+			break;
+		} else if (IS_ERR(entry)) {
+			xas_unlock_irq(&xas);
+			WARN_ON_ONCE(PTR_ERR(entry) != -EAGAIN);
+			continue;
 		}
 		dax_lock_entry(&xas, entry);
+		did_lock = true;
 		xas_unlock_irq(&xas);
-		return true;
+		break;
 	}
+	rcu_read_unlock();
+
+	return did_lock;
 }
 
 void dax_unlock_mapping_entry(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
-	XA_STATE(xas, &mapping->i_pages, page->index);
 
 	if (S_ISCHR(mapping->host->i_mode))
 		return;
 
-	dax_unlock_entry(&xas, dax_make_page_entry(page));
+	unlock_mapping_entry(mapping, page->index);
 }
 
 /*

_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

  reply	other threads:[~2018-11-07  6:01 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-06  3:44 fsdax memory error handling regression Williams, Dan J
2018-11-06 14:48 ` Matthew Wilcox
2018-11-07  6:01   ` Williams, Dan J [this message]
2018-11-09 19:54     ` Dan Williams
2018-11-10  8:29     ` Matthew Wilcox
2018-11-10 17:08       ` Dan Williams
2018-11-13 14:25         ` Matthew Wilcox
2018-11-29  6:09           ` Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=35429d15f7dfd2c551b9d61512abe2e04376d2a0.camel@intel.com \
    --to=dan.j.williams@intel.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).