From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Andrew Morton <akpm@linux-foundation.org>, linux-kernel@vger.kernel.org
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>,
"Darrick J. Wong" <darrick.wong@oracle.com>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christoph Hellwig <hch@lst.de>,
Dan Williams <dan.j.williams@intel.com>, Jan Kara <jack@suse.cz>,
Matthew Wilcox <mawilcox@microsoft.com>,
linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org
Subject: [PATCH] dax: fix radix tree insertion race
Date: Thu, 6 Apr 2017 15:29:44 -0600
Message-ID: <20170406212944.2866-1-ross.zwisler@linux.intel.com> (raw)
While running generic/340 in my test setup I hit the following race. It can
happen with kernels that support FS DAX PMDs, so v4.10 thru v4.11-rc5.
Thread 1 Thread 2
-------- --------
dax_iomap_pmd_fault()
grab_mapping_entry()
spin_lock_irq()
get_unlocked_mapping_entry()
'entry' is NULL, can't call lock_slot()
spin_unlock_irq()
radix_tree_preload()
dax_iomap_pmd_fault()
grab_mapping_entry()
spin_lock_irq()
get_unlocked_mapping_entry()
...
lock_slot()
spin_unlock_irq()
dax_pmd_insert_mapping()
<inserts a PMD mapping>
spin_lock_irq()
__radix_tree_insert() fails with -EEXIST
<fall back to 4k fault, and die horribly
when inserting a 4k entry where a PMD exists>
The issue is that we have to drop mapping->tree_lock while calling
radix_tree_preload(), but since we didn't have a radix tree entry to lock
(unlike in the pmd_downgrade case) we have no protection against Thread 2
coming along and inserting a PMD at the same index. For 4k entries we
handled this with a special-case response to -EEXIST coming from the
__radix_tree_insert(), but this doesn't save us for PMDs because the
-EEXIST case can also mean that we collided with a 4k entry in the radix
tree at a different index, but one that is covered by our PMD range.
So, correctly handle both the 4k and 2M collision cases by explicitly
re-checking the radix tree for an entry at our index once we reacquire
mapping->tree_lock.
This patch has made it through a clean xfstests run with the current
v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
loop. It used to fail within the first 10 iterations.
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: <stable@vger.kernel.org> [4.10+]
---
fs/dax.c | 35 ++++++++++++++++++++++-------------
1 file changed, 22 insertions(+), 13 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index de622d4..85abd74 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -373,6 +373,22 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
}
spin_lock_irq(&mapping->tree_lock);
+ if (!entry) {
+ /*
+ * We needed to drop the page_tree lock while calling
+ * radix_tree_preload() and we didn't have an entry to
+ * lock. See if another thread inserted an entry at
+ * our index during this time.
+ */
+ entry = __radix_tree_lookup(&mapping->page_tree, index,
+ NULL, &slot);
+ if (entry) {
+ radix_tree_preload_end();
+ spin_unlock_irq(&mapping->tree_lock);
+ goto restart;
+ }
+ }
+
if (pmd_downgrade) {
radix_tree_delete(&mapping->page_tree, index);
mapping->nrexceptional--;
@@ -388,19 +404,12 @@ static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index,
if (err) {
spin_unlock_irq(&mapping->tree_lock);
/*
- * Someone already created the entry? This is a
- * normal failure when inserting PMDs in a range
- * that already contains PTEs. In that case we want
- * to return -EEXIST immediately.
- */
- if (err == -EEXIST && !(size_flag & RADIX_DAX_PMD))
- goto restart;
- /*
- * Our insertion of a DAX PMD entry failed, most
- * likely because it collided with a PTE sized entry
- * at a different index in the PMD range. We haven't
- * inserted anything into the radix tree and have no
- * waiters to wake.
+ * Our insertion of a DAX entry failed, most likely
+ * because we were inserting a PMD entry and it
+ * collided with a PTE sized entry at a different
+ * index in the PMD range. We haven't inserted
+ * anything into the radix tree and have no waiters to
+ * wake.
*/
return ERR_PTR(err);
}
--
2.9.3
next reply index
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-04-06 21:29 Ross Zwisler [this message]
2017-04-10 13:41 ` Jan Kara
2017-04-10 20:34 ` Ross Zwisler
2017-04-11 8:29 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170406212944.2866-1-ross.zwisler@linux.intel.com \
--to=ross.zwisler@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=dan.j.williams@intel.com \
--cc=darrick.wong@oracle.com \
--cc=hch@lst.de \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvdimm@lists.01.org \
--cc=mawilcox@microsoft.com \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Linux-Fsdevel Archive on lore.kernel.org
Archives are clonable:
git clone --mirror https://lore.kernel.org/linux-fsdevel/0 linux-fsdevel/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 linux-fsdevel linux-fsdevel/ https://lore.kernel.org/linux-fsdevel \
linux-fsdevel@vger.kernel.org
public-inbox-index linux-fsdevel
Example config snippet for mirrors
Newsgroup available over NNTP:
nntp://nntp.lore.kernel.org/org.kernel.vger.linux-fsdevel
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git