All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-nvdimm@lists.01.org, NeilBrown <neilb@suse.com>,
	"Wilcox, Matthew R" <matthew.r.wilcox@intel.com>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 08/10] dax: New fault locking
Date: Thu, 31 Mar 2016 18:27:30 +0200	[thread overview]
Message-ID: <20160331162730.GB10434@quack.suse.cz> (raw)
In-Reply-To: <20160329215732.GA21264@linux.intel.com>

Thanks for review Ross! I have implemented your comments unless I state
here otherwise.

On Tue 29-03-16 15:57:32, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> > +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > +
> > +	init_wait(&wait.wait);
> > +	wait.wait.func = wake_exceptional_entry_func;
> > +	wait.key.root = &mapping->page_tree;
> > +	wait.key.index = index;
> > +
> > +	for (;;) {
> > +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> > +					  &slot);
> > +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> > +		    !slot_locked(slot)) {
> > +			if (slotp)
> > +				*slotp = slot;
> > +			return ret;
> > +		}
> > +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
> 
> Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
> zombie?

Well, and do you want to deal with signal handling all the way up? The wait
should be pretty short given the nature of pmem so I didn't see a big point
in bothering with signal handling...

> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		schedule();
> > +		finish_wait(wq, &wait.wait);
> > +		spin_lock_irq(&mapping->tree_lock);
> > +	}
> > +}
> > +
> > +/*
> > + * Find radix tree entry at given index. If it points to a page, return with
> > + * the page locked. If it points to the exceptional entry, return with the
> > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > + * create empty exceptional entry for the index and return with it locked.
> > + *
> > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > + * persistent memory the benefit is doubtful. We can add that later if we can
> > + * show it helps.
> > + */
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > +	void *ret, **slot;
> > +
> > +restart:
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> > +	/* No entry for given index? Make sure radix tree is big enough. */
> > +	if (!ret) {
> > +		int err;
> > +
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		err = radix_tree_preload(
> > +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> 
> What is the benefit to preloading the radix tree?  It looks like we have
> to drop the mapping->tree_lock, deal with an error, regrab the lock and
> then deal with a possible collision with an entry that was inserted while
> we didn't hold the lock.
> 
> Can we just try and insert it, then if it fails with -ENOMEM we just do
> our normal error path, dropping the tree_lock and returning the error?

If we don't preload, the allocations will happen with GFP_ATOMIC. That
should be avoided if possible since atomic allocations are pretty
restricted. So basically all the pagecache first allocates nodes we may
need before acquiring locks and then uses these nodes later and I have
mirrored that behavior. Note that we take the hit for dropping the lock
only if we really need to allocate new radix tree node so about once per 64
new entries. So it is not too bad.

> > -	WARN_ON_ONCE(pmd_entry && !dirty);
> >  	if (dirty)
> >  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >  
> > -	spin_lock_irq(&mapping->tree_lock);
> > -
> > -	entry = radix_tree_lookup(page_tree, pmd_index);
> > -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> > -		index = pmd_index;
> > -		goto dirty;
> > +	/* Replacing hole page with block mapping? */
> > +	if (!radix_tree_exceptional_entry(entry)) {
> > +		hole_fill = true;
> > +		error = radix_tree_preload(gfp_mask);
> > +		if (error)
> > +			return ERR_PTR(error);
> >  	}
> >  
> > -	entry = radix_tree_lookup(page_tree, index);
> > -	if (entry) {
> > -		type = RADIX_DAX_TYPE(entry);
> > -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > -					type != RADIX_DAX_PMD)) {
> > -			error = -EIO;
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > +		       DAX_ENTRY_LOCK);
> > +	if (hole_fill) {
> > +		__delete_from_page_cache(entry, NULL);
> > +		error = radix_tree_insert(page_tree, index, ret);
> > +		if (error) {
> > +			ret = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> > +		mapping->nrexceptional++;
> > +	} else {
> > +		void **slot;
> > +		void *ret2;
> >  
> > -		if (!pmd_entry || type == RADIX_DAX_PMD)
> > -			goto dirty;
> > -
> > -		/*
> > -		 * We only insert dirty PMD entries into the radix tree.  This
> > -		 * means we don't need to worry about removing a dirty PTE
> > -		 * entry and inserting a clean PMD entry, thus reducing the
> > -		 * range we would flush with a follow-up fsync/msync call.
> > -		 */
> > -		radix_tree_delete(&mapping->page_tree, index);
> > -		mapping->nrexceptional--;
> > -	}
> > -
> > -	if (sector == NO_SECTOR) {
> > -		/*
> > -		 * This can happen during correct operation if our pfn_mkwrite
> > -		 * fault raced against a hole punch operation.  If this
> > -		 * happens the pte that was hole punched will have been
> > -		 * unmapped and the radix tree entry will have been removed by
> > -		 * the time we are called, but the call will still happen.  We
> > -		 * will return all the way up to wp_pfn_shared(), where the
> > -		 * pte_same() check will fail, eventually causing page fault
> > -		 * to be retried by the CPU.
> > -		 */
> > -		goto unlock;
> > +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
> 
> You don't need ret2.  You can just compare 'entry' with '*slot' - see
> dax_writeback_one() for an example.

Hum, but if we want to do this cleanly (and get all the lockdep
verification), we should use

radix_tree_deref_slot_protected(slot, &mapping->tree_lock)

instead of *slot. And at that point my fingers hurt so much that I just
create a new variable for caching the result ;). BTW, this has prompted me
to also fix lock_slot, unlock_slot, slot_locked to use proper RCU
primitives for modifying slot contents.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, "Wilcox,
	Matthew R" <matthew.r.wilcox@intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	linux-nvdimm@lists.01.org, NeilBrown <neilb@suse.com>
Subject: Re: [PATCH 08/10] dax: New fault locking
Date: Thu, 31 Mar 2016 18:27:30 +0200	[thread overview]
Message-ID: <20160331162730.GB10434@quack.suse.cz> (raw)
In-Reply-To: <20160329215732.GA21264@linux.intel.com>

Thanks for review Ross! I have implemented your comments unless I state
here otherwise.

On Tue 29-03-16 15:57:32, Ross Zwisler wrote:
> On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> > +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > +
> > +	init_wait(&wait.wait);
> > +	wait.wait.func = wake_exceptional_entry_func;
> > +	wait.key.root = &mapping->page_tree;
> > +	wait.key.index = index;
> > +
> > +	for (;;) {
> > +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> > +					  &slot);
> > +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> > +		    !slot_locked(slot)) {
> > +			if (slotp)
> > +				*slotp = slot;
> > +			return ret;
> > +		}
> > +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);
> 
> Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
> zombie?

Well, and do you want to deal with signal handling all the way up? The wait
should be pretty short given the nature of pmem so I didn't see a big point
in bothering with signal handling...

> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		schedule();
> > +		finish_wait(wq, &wait.wait);
> > +		spin_lock_irq(&mapping->tree_lock);
> > +	}
> > +}
> > +
> > +/*
> > + * Find radix tree entry at given index. If it points to a page, return with
> > + * the page locked. If it points to the exceptional entry, return with the
> > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > + * create empty exceptional entry for the index and return with it locked.
> > + *
> > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > + * persistent memory the benefit is doubtful. We can add that later if we can
> > + * show it helps.
> > + */
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > +	void *ret, **slot;
> > +
> > +restart:
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> > +	/* No entry for given index? Make sure radix tree is big enough. */
> > +	if (!ret) {
> > +		int err;
> > +
> > +		spin_unlock_irq(&mapping->tree_lock);
> > +		err = radix_tree_preload(
> > +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> 
> What is the benefit to preloading the radix tree?  It looks like we have
> to drop the mapping->tree_lock, deal with an error, regrab the lock and
> then deal with a possible collision with an entry that was inserted while
> we didn't hold the lock.
> 
> Can we just try and insert it, then if it fails with -ENOMEM we just do
> our normal error path, dropping the tree_lock and returning the error?

If we don't preload, the allocations will happen with GFP_ATOMIC. That
should be avoided if possible since atomic allocations are pretty
restricted. So basically all the pagecache first allocates nodes we may
need before acquiring locks and then uses these nodes later and I have
mirrored that behavior. Note that we take the hit for dropping the lock
only if we really need to allocate new radix tree node so about once per 64
new entries. So it is not too bad.

> > -	WARN_ON_ONCE(pmd_entry && !dirty);
> >  	if (dirty)
> >  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >  
> > -	spin_lock_irq(&mapping->tree_lock);
> > -
> > -	entry = radix_tree_lookup(page_tree, pmd_index);
> > -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> > -		index = pmd_index;
> > -		goto dirty;
> > +	/* Replacing hole page with block mapping? */
> > +	if (!radix_tree_exceptional_entry(entry)) {
> > +		hole_fill = true;
> > +		error = radix_tree_preload(gfp_mask);
> > +		if (error)
> > +			return ERR_PTR(error);
> >  	}
> >  
> > -	entry = radix_tree_lookup(page_tree, index);
> > -	if (entry) {
> > -		type = RADIX_DAX_TYPE(entry);
> > -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> > -					type != RADIX_DAX_PMD)) {
> > -			error = -EIO;
> > +	spin_lock_irq(&mapping->tree_lock);
> > +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> > +		       DAX_ENTRY_LOCK);
> > +	if (hole_fill) {
> > +		__delete_from_page_cache(entry, NULL);
> > +		error = radix_tree_insert(page_tree, index, ret);
> > +		if (error) {
> > +			ret = ERR_PTR(error);
> >  			goto unlock;
> >  		}
> > +		mapping->nrexceptional++;
> > +	} else {
> > +		void **slot;
> > +		void *ret2;
> >  
> > -		if (!pmd_entry || type == RADIX_DAX_PMD)
> > -			goto dirty;
> > -
> > -		/*
> > -		 * We only insert dirty PMD entries into the radix tree.  This
> > -		 * means we don't need to worry about removing a dirty PTE
> > -		 * entry and inserting a clean PMD entry, thus reducing the
> > -		 * range we would flush with a follow-up fsync/msync call.
> > -		 */
> > -		radix_tree_delete(&mapping->page_tree, index);
> > -		mapping->nrexceptional--;
> > -	}
> > -
> > -	if (sector == NO_SECTOR) {
> > -		/*
> > -		 * This can happen during correct operation if our pfn_mkwrite
> > -		 * fault raced against a hole punch operation.  If this
> > -		 * happens the pte that was hole punched will have been
> > -		 * unmapped and the radix tree entry will have been removed by
> > -		 * the time we are called, but the call will still happen.  We
> > -		 * will return all the way up to wp_pfn_shared(), where the
> > -		 * pte_same() check will fail, eventually causing page fault
> > -		 * to be retried by the CPU.
> > -		 */
> > -		goto unlock;
> > +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);
> 
> You don't need ret2.  You can just compare 'entry' with '*slot' - see
> dax_writeback_one() for an example.

Hum, but if we want to do this cleanly (and get all the lockdep
verification), we should use

radix_tree_deref_slot_protected(slot, &mapping->tree_lock)

instead of *slot. And at that point my fingers hurt so much that I just
create a new variable for caching the result ;). BTW, this has prompted me
to also fix lock_slot, unlock_slot, slot_locked to use proper RCU
primitives for modifying slot contents.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

  reply	other threads:[~2016-04-04  8:34 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-21 13:22 [RFC v2] [PATCH 0/10] DAX page fault locking Jan Kara
2016-03-21 13:22 ` Jan Kara
2016-03-21 13:22 ` [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 17:25   ` Matthew Wilcox
2016-03-21 17:25     ` Matthew Wilcox
2016-03-21 13:22 ` [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 17:34   ` Matthew Wilcox
2016-03-21 17:34     ` Matthew Wilcox
2016-03-22  9:12     ` Jan Kara
2016-03-22  9:12       ` Jan Kara
2016-03-22  9:27       ` Matthew Wilcox
2016-03-22  9:27         ` Matthew Wilcox
2016-03-22 10:37         ` Jan Kara
2016-03-22 10:37           ` Jan Kara
2016-03-23 16:41           ` Ross Zwisler
2016-03-23 16:41             ` Ross Zwisler
2016-03-24 12:31             ` Jan Kara
2016-03-24 12:31               ` Jan Kara
2016-03-21 13:22 ` [PATCH 03/10] dax: Remove complete_unwritten argument Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:12   ` Ross Zwisler
2016-03-23 17:12     ` Ross Zwisler
2016-03-24 12:32     ` Jan Kara
2016-03-24 12:32       ` Jan Kara
2016-03-21 13:22 ` [PATCH 04/10] dax: Fix data corruption for written and mmapped files Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:39   ` Ross Zwisler
2016-03-23 17:39     ` Ross Zwisler
2016-03-24 12:51     ` Jan Kara
2016-03-24 12:51       ` Jan Kara
2016-03-29 15:17       ` Ross Zwisler
2016-03-29 15:17         ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 05/10] dax: Allow DAX code to replace exceptional entries Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:52   ` Ross Zwisler
2016-03-23 17:52     ` Ross Zwisler
2016-03-24 10:42     ` Jan Kara
2016-03-24 10:42       ` Jan Kara
2016-03-21 13:22 ` [PATCH 06/10] dax: Remove redundant inode size checks Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 21:08   ` Ross Zwisler
2016-03-23 21:08     ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 07/10] dax: Disable huge page handling Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 20:50   ` Ross Zwisler
2016-03-23 20:50     ` Ross Zwisler
2016-03-24 12:56     ` Jan Kara
2016-03-24 12:56       ` Jan Kara
2016-03-21 13:22 ` [PATCH 08/10] dax: New fault locking Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-29 21:57   ` Ross Zwisler
2016-03-29 21:57     ` Ross Zwisler
2016-03-31 16:27     ` Jan Kara [this message]
2016-03-31 16:27       ` Jan Kara
2016-03-21 13:22 ` [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 19:11   ` Matthew Wilcox
2016-03-21 19:11     ` Matthew Wilcox
2016-03-22  7:03     ` Jan Kara
2016-03-22  7:03       ` Jan Kara
2016-03-29 22:18   ` Ross Zwisler
2016-03-29 22:18     ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 10/10] dax: Remove i_mmap_lock protection Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-29 22:17   ` Ross Zwisler
2016-03-29 22:17     ` Ross Zwisler
2016-03-21 17:41 ` [RFC v2] [PATCH 0/10] DAX page fault locking Matthew Wilcox
2016-03-21 17:41   ` Matthew Wilcox
2016-03-23 15:09   ` Jan Kara
2016-03-23 15:09     ` Jan Kara
2016-03-23 20:50     ` Matthew Wilcox
2016-03-23 20:50       ` Matthew Wilcox
2016-03-24 10:00     ` Matthew Wilcox
2016-03-24 10:00       ` Matthew Wilcox
2016-03-22 19:32 ` Ross Zwisler
2016-03-22 19:32   ` Ross Zwisler
2016-03-22 21:07   ` Toshi Kani
2016-03-22 21:07     ` Toshi Kani
2016-03-22 21:15     ` Dave Chinner
2016-03-22 21:15       ` Dave Chinner
2016-03-23  9:45     ` Jan Kara
2016-03-23  9:45       ` Jan Kara
2016-03-23 15:11       ` Toshi Kani
2016-03-23 15:11         ` Toshi Kani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160331162730.GB10434@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-nvdimm@lists.01.org \
    --cc=matthew.r.wilcox@intel.com \
    --cc=neilb@suse.com \
    --cc=ross.zwisler@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.