linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Ross Zwisler <ross.zwisler@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	linux-nvdimm <linux-nvdimm@lists.01.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Linux MM <linux-mm@kvack.org>, Jan Kara <jack@suse.cz>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page()
Date: Wed, 4 Jul 2018 08:11:14 -0700	[thread overview]
Message-ID: <CAPcyv4i9kK6e6aWVJWyhdW+WKGVuujmBZXyCQvQ7s-JpAdY_9Q@mail.gmail.com> (raw)
In-Reply-To: <20180612181542.GB28436@linux.intel.com>

On Tue, Jun 12, 2018 at 11:15 AM, Ross Zwisler
<ross.zwisler@linux.intel.com> wrote:
> On Fri, Jun 08, 2018 at 04:51:14PM -0700, Dan Williams wrote:
>> In preparation for implementing support for memory poison (media error)
>> handling via dax mappings, implement a lock_page() equivalent. Poison
>> error handling requires rmap and needs guarantees that the page->mapping
>> association is maintained / valid (inode not freed) for the duration of
>> the lookup.
>>
>> In the device-dax case it is sufficient to simply hold a dev_pagemap
>> reference. In the filesystem-dax case we need to use the entry lock.
>>
>> Export the entry lock via dax_lock_page() that uses rcu_read_lock() to
>> protect against the inode being freed, and revalidates the page->mapping
>> association under xa_lock().
>>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/dax.c            |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++
>>  include/linux/dax.h |   15 ++++++++++
>>  2 files changed, 91 insertions(+)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index cccf6cad1a7a..b7e71b108fcf 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -361,6 +361,82 @@ static void dax_disassociate_entry(void *entry, struct address_space *mapping,
>>       }
>>  }
>>
>> +struct page *dax_lock_page(unsigned long pfn)
>> +{
>> +     pgoff_t index;
>> +     struct inode *inode;
>> +     wait_queue_head_t *wq;
>> +     void *entry = NULL, **slot;
>> +     struct address_space *mapping;
>> +     struct wait_exceptional_entry_queue ewait;
>> +     struct page *ret = NULL, *page = pfn_to_page(pfn);
>> +
>> +     rcu_read_lock();
>> +     for (;;) {
>> +             mapping = READ_ONCE(page->mapping);
>
> Why the READ_ONCE()?

We're potentially racing inode teardown, so the READ_ONCE() prevents
the compiler from trying to de-reference page->mapping twice and
getting inconsistent answers.

>
>> +
>> +             if (!mapping || !IS_DAX(mapping->host))
>
> Might read better using the dax_mapping() helper.

Sure.

>
> Also, forgive my ignorance, but this implies that dev dax has page->mapping
> set up and that that inode will have IS_DAX set, right?  This will let us get
> past this point for device DAX, and we'll bail out at the S_ISCHR() check?

Yes.

>
>> +                     break;
>> +
>> +             /*
>> +              * In the device-dax case there's no need to lock, a
>> +              * struct dev_pagemap pin is sufficient to keep the
>> +              * inode alive.
>> +              */
>> +             inode = mapping->host;
>> +             if (S_ISCHR(inode->i_mode)) {
>> +                     ret = page;
>
> 'ret' isn't actually used for anything in this function, we just
> unconditionally return 'page'.
>

Yes, bug.

>> +                     break;
>> +             }
>> +
>> +             xa_lock_irq(&mapping->i_pages);
>> +             if (mapping != page->mapping) {
>> +                     xa_unlock_irq(&mapping->i_pages);
>> +                     continue;
>> +             }
>> +             index = page->index;
>> +
>> +             init_wait(&ewait.wait);
>> +             ewait.wait.func = wake_exceptional_entry_func;
>> +
>> +             entry = __radix_tree_lookup(&mapping->i_pages, index, NULL,
>> +                             &slot);
>> +             if (!entry ||
>
> So if we do a lookup and there is no entry in the tree, we won't add an empty
> entry and lock it, we'll just return with no entry in the tree and nothing
> locked.
>
> Then, when we call dax_unlock_page(), we'll eventually hit a WARN_ON_ONCE() in
> dax_unlock_mapping_entry() when we see entry is 0.  And, in that gap we've got
> nothing locked so page faults could have happened, etc... (which would mean
> that instead of WARN_ON_ONCE() for an empty entry, we'd hit it instead for an
> unlocked entry).
>
> Is that okay?  Or do we need to insert a locked empty entry here?

No, the intent was to return NULL and fail the lock, but I messed up
and unconditionally returned the page.

  reply	other threads:[~2018-07-04 15:11 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-08 23:50 [PATCH v4 00/12] mm: Teach memory_failure() about ZONE_DEVICE pages Dan Williams
2018-06-08 23:50 ` [PATCH v4 01/12] device-dax: Convert to vmf_insert_mixed and vm_fault_t Dan Williams
2018-06-08 23:50 ` [PATCH v4 02/12] device-dax: Cleanup vm_fault de-reference chains Dan Williams
2018-06-11 17:12   ` Laurent Dufour
2018-06-11 17:14     ` Dan Williams
2018-06-08 23:50 ` [PATCH v4 03/12] device-dax: Enable page_mapping() Dan Williams
2018-06-08 23:50 ` [PATCH v4 04/12] device-dax: Set page->index Dan Williams
2018-06-08 23:50 ` [PATCH v4 05/12] filesystem-dax: " Dan Williams
2018-06-08 23:50 ` [PATCH v4 06/12] mm, madvise_inject_error: Let memory_failure() optionally take a page reference Dan Williams
2018-06-08 23:50 ` [PATCH v4 07/12] x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses Dan Williams
2018-06-08 23:51 ` [PATCH v4 08/12] x86/memory_failure: Introduce {set, clear}_mce_nospec() Dan Williams
2018-06-08 23:51 ` [PATCH v4 09/12] mm, memory_failure: Pass page size to kill_proc() Dan Williams
2018-06-08 23:51 ` [PATCH v4 10/12] filesystem-dax: Introduce dax_lock_page() Dan Williams
2018-06-11 15:41   ` Jan Kara
2018-06-11 16:48     ` Dan Williams
2018-06-12 18:07     ` Ross Zwisler
2018-07-04 15:20       ` Dan Williams
2018-07-04 15:17     ` Dan Williams
2018-06-12 18:15   ` Ross Zwisler
2018-07-04 15:11     ` Dan Williams [this message]
2018-06-08 23:51 ` [PATCH v4 11/12] mm, memory_failure: Teach memory_failure() about dev_pagemap pages Dan Williams
2018-06-11 15:50   ` Jan Kara
2018-06-11 16:45     ` Dan Williams
2018-06-12 20:14   ` Ross Zwisler
2018-06-12 23:38     ` Dan Williams
2018-06-08 23:51 ` [PATCH v4 12/12] libnvdimm, pmem: Restore page attributes when clearing errors Dan Williams

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAPcyv4i9kK6e6aWVJWyhdW+WKGVuujmBZXyCQvQ7s-JpAdY_9Q@mail.gmail.com \
    --to=dan.j.williams@intel.com \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=ross.zwisler@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).