All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-nvdimm@lists.01.org, NeilBrown <neilb@suse.com>, Wilcox,
Subject: Re: [PATCH 08/10] dax: New fault locking
Date: Tue, 29 Mar 2016 15:57:32 -0600	[thread overview]
Message-ID: <20160329215732.GA21264@linux.intel.com> (raw)
In-Reply-To: <1458566575-28063-9-git-send-email-jack@suse.cz>

On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> Currently DAX page fault locking is racy.
> 
> CPU0 (write fault)		CPU1 (read fault)
> 
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
> 
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> 
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
> 
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I've got lots of little comments, but from my point of view this seems like
it is looking pretty good.  I agree with the choice to put this in dax.c as
opposed to radix-tree.c or something - this seems very DAX specific for now.

> ---
>  fs/dax.c            | 500 ++++++++++++++++++++++++++++++++++++++--------------
>  include/linux/dax.h |   1 +
>  mm/truncate.c       |  62 ++++---
>  3 files changed, 396 insertions(+), 167 deletions(-)
<>
> +static inline int slot_locked(void **v)
> +{
> +	unsigned long l = *(unsigned long *)v;
> +	return l & DAX_ENTRY_LOCK;
> +}
> +
> +static inline void *lock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l |= DAX_ENTRY_LOCK);
> +}
> +
> +static inline void *unlock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
> +}

For the above three helpers I think we could do with better parameter and
variable naming so it's clearer what's going on.  s/v/slot/ and s/l/entry/ ?

Also, for many of these new functions we need to be holding
mapping->tree_lock - can we quickly document that with comments?

> +/*
> + * Lookup entry in radix tree, wait for it to become unlocked if it is
> + * exceptional entry and return.
> + *
> + * The function must be called with mapping->tree_lock held.
> + */
> +static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
> +					   pgoff_t index, void ***slotp)
> +{
> +	void *ret, **slot;
> +	struct wait_exceptional_entry_queue wait;

This should probably be named 'ewait' to be consistent with
wake_exceptional_entry_func(), and so we have a different and consistent
naming between our struct wait_exceptional_entry_queue and wait_queue_t
variables.

> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	init_wait(&wait.wait);
> +	wait.wait.func = wake_exceptional_entry_func;
> +	wait.key.root = &mapping->page_tree;
> +	wait.key.index = index;
> +
> +	for (;;) {
> +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> +					  &slot);
> +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> +		    !slot_locked(slot)) {
> +			if (slotp)
> +				*slotp = slot;
> +			return ret;
> +		}
> +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);

Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
zombie?

> +		spin_unlock_irq(&mapping->tree_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wait);
> +		spin_lock_irq(&mapping->tree_lock);
> +	}
> +}
> +
> +/*
> + * Find radix tree entry at given index. If it points to a page, return with
> + * the page locked. If it points to the exceptional entry, return with the
> + * radix tree entry locked. If the radix tree doesn't contain given index,
> + * create empty exceptional entry for the index and return with it locked.
> + *
> + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> + * persistent memory the benefit is doubtful. We can add that later if we can
> + * show it helps.
> + */
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +
> +restart:
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> +	/* No entry for given index? Make sure radix tree is big enough. */
> +	if (!ret) {
> +		int err;
> +
> +		spin_unlock_irq(&mapping->tree_lock);
> +		err = radix_tree_preload(
> +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);

What is the benefit to preloading the radix tree?  It looks like we have to
drop the mapping->tree_lock, deal with an error, regrab the lock and then deal
with a possible collision with an entry that was inserted while we didn't hold
the lock.

Can we just try and insert it, then if it fails with -ENOMEM we just do our
normal error path, dropping the tree_lock and returning the error?

> +		if (err)
> +			return ERR_PTR(err);
> +		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
> +		spin_lock_irq(&mapping->tree_lock);
> +		err = radix_tree_insert(&mapping->page_tree, index, ret);
> +		radix_tree_preload_end();
> +		if (err) {
> +			spin_unlock_irq(&mapping->tree_lock);
> +			/* Someone already created the entry? */
> +			if (err == -EEXIST)
> +				goto restart;
> +			return ERR_PTR(err);
> +		}
> +		/* Good, we have inserted empty locked entry into the tree. */
> +		mapping->nrexceptional++;
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return ret;
> +	}
> +	/* Normal page in radix tree? */
> +	if (!radix_tree_exceptional_entry(ret)) {
> +		struct page *page = ret;
> +
> +		page_cache_get(page);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		lock_page(page);
> +		/* Page got truncated? Retry... */
> +		if (unlikely(page->mapping != mapping)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto restart;
> +		}
> +		return page;
> +	}
> +	ret = lock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}
> +
> +static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
> +	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}
> +	if (WARN_ON_ONCE(!slot_locked(slot))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}

It may be worth combining these two WARN_ON_ONCE() error cases for brevity,
since they are both insanity conditions.

> +	unlock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	if (waitqueue_active(wq)) {
> +		struct exceptional_entry_key key;
> +
> +		key.root = &mapping->page_tree;
> +		key.index = index;
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
> +	}

The above if() block is repeated 3 times in the next few functions with small
variations (the third argument to __wake_up()).  Perhaps it should be pulled
out into a helper?

> +static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
> +			       void *entry, sector_t sector, bool dirty,
> +			       gfp_t gfp_mask)

This argument list is getting pretty long, and our one caller gets lots of
these guys out of the VMF.  Perhaps we could just pass in the VMF and extract
the bits ourselves?

>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	pgoff_t pmd_index = DAX_PMD_INDEX(index);
> -	int type, error = 0;
> -	void *entry;
> +	int error = 0;
> +	bool hole_fill = false;
> +	void *ret;

Just a nit, but I find the use of 'ret' a bit confusing, since it's not a
return value that we got from anywhere, it's an entry that we set up, insert
and then return to our caller.  We use 'error' to capture return values from
calls this function makes.  Maybe this would be clearer as "new_entry" or
something?

> -	WARN_ON_ONCE(pmd_entry && !dirty);
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	spin_lock_irq(&mapping->tree_lock);
> -
> -	entry = radix_tree_lookup(page_tree, pmd_index);
> -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> -		index = pmd_index;
> -		goto dirty;
> +	/* Replacing hole page with block mapping? */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		hole_fill = true;
> +		error = radix_tree_preload(gfp_mask);
> +		if (error)
> +			return ERR_PTR(error);
>  	}
>  
> -	entry = radix_tree_lookup(page_tree, index);
> -	if (entry) {
> -		type = RADIX_DAX_TYPE(entry);
> -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> -					type != RADIX_DAX_PMD)) {
> -			error = -EIO;
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> +		       DAX_ENTRY_LOCK);
> +	if (hole_fill) {
> +		__delete_from_page_cache(entry, NULL);
> +		error = radix_tree_insert(page_tree, index, ret);
> +		if (error) {
> +			ret = ERR_PTR(error);
>  			goto unlock;
>  		}
> +		mapping->nrexceptional++;
> +	} else {
> +		void **slot;
> +		void *ret2;
>  
> -		if (!pmd_entry || type == RADIX_DAX_PMD)
> -			goto dirty;
> -
> -		/*
> -		 * We only insert dirty PMD entries into the radix tree.  This
> -		 * means we don't need to worry about removing a dirty PTE
> -		 * entry and inserting a clean PMD entry, thus reducing the
> -		 * range we would flush with a follow-up fsync/msync call.
> -		 */
> -		radix_tree_delete(&mapping->page_tree, index);
> -		mapping->nrexceptional--;
> -	}
> -
> -	if (sector == NO_SECTOR) {
> -		/*
> -		 * This can happen during correct operation if our pfn_mkwrite
> -		 * fault raced against a hole punch operation.  If this
> -		 * happens the pte that was hole punched will have been
> -		 * unmapped and the radix tree entry will have been removed by
> -		 * the time we are called, but the call will still happen.  We
> -		 * will return all the way up to wp_pfn_shared(), where the
> -		 * pte_same() check will fail, eventually causing page fault
> -		 * to be retried by the CPU.
> -		 */
> -		goto unlock;
> +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);

You don't need ret2.  You can just compare 'entry' with '*slot' - see
dax_writeback_one() for an example.

> +		WARN_ON_ONCE(ret2 != entry);
> +		radix_tree_replace_slot(slot, ret);
>  	}
> -
> -	error = radix_tree_insert(page_tree, index,
> -			RADIX_DAX_ENTRY(sector, pmd_entry));
> -	if (error)
> -		goto unlock;
> -
> -	mapping->nrexceptional++;
> - dirty:
>  	if (dirty)
>  		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
>   unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
> -	return error;
> +	if (hole_fill)
> +		radix_tree_preload_end();
> +	return ret;
>  }
>  
>  static int dax_writeback_one(struct block_device *bdev,
> @@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  
> -static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> +static int dax_insert_mapping(struct address_space *mapping,
> +			struct buffer_head *bh, void *entry,
>  			struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> -	struct address_space *mapping = inode->i_mapping;
>  	struct block_device *bdev = bh->b_bdev;
>  	struct blk_dax_ctl dax = {
> -		.sector = to_sector(bh, inode),
> +		.sector = to_sector(bh, mapping->host),
>  		.size = bh->b_size,
>  	};
>  	int error;
> +	void *ret;
>  
>  	i_mmap_lock_read(mapping);
>  
> @@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	}
>  	dax_unmap_atomic(bdev, &dax);
>  
> -	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> -			vmf->flags & FAULT_FLAG_WRITE);
> -	if (error)
> +	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
> +			        vmf->flags & FAULT_FLAG_WRITE,
> +			        vmf->gfp_mask & ~__GFP_HIGHMEM);

The spacing before the parameters to dax_mapping_entry() is messed up & makes
checkpatch grumpy:

ERROR: code indent should use tabs where possible
#488: FILE: fs/dax.c:812:
+^I^I^I        vmf->flags & FAULT_FLAG_WRITE,$

ERROR: code indent should use tabs where possible
#489: FILE: fs/dax.c:813:
+^I^I^I        vmf->gfp_mask & ~__GFP_HIGHMEM);$

There are a few other checkpatch warnings as well that should probably be
addressed.

> +	if (IS_ERR(ret)) {
> +		error = PTR_ERR(ret);
>  		goto out;
> +	}
> +	/* Have we replaced hole page? Unmap and free it. */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> +				    PAGE_CACHE_SIZE, 0);
> +		unlock_page(entry);
> +		page_cache_release(entry);
> +	}
> +	entry = ret;
>  
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> -
>   out:
>  	i_mmap_unlock_read(mapping);
> -
> +	put_locked_mapping_entry(mapping, vmf->pgoff, entry);

Hmm....this entry was locked by our parent (__dax_fault()), and is released by
our parent in error cases that go through 'unlock_entry:'.  For symmetry it's
probably better to move this call up to our parent as well.
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm

WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-fsdevel@vger.kernel.org, "Wilcox,
	Matthew R" <matthew.r.wilcox@intel.com>,
	Ross Zwisler <ross.zwisler@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	linux-nvdimm@lists.01.org, NeilBrown <neilb@suse.com>
Subject: Re: [PATCH 08/10] dax: New fault locking
Date: Tue, 29 Mar 2016 15:57:32 -0600	[thread overview]
Message-ID: <20160329215732.GA21264@linux.intel.com> (raw)
In-Reply-To: <1458566575-28063-9-git-send-email-jack@suse.cz>

On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote:
> Currently DAX page fault locking is racy.
> 
> CPU0 (write fault)		CPU1 (read fault)
> 
> __dax_fault()			__dax_fault()
>   get_block(inode, block, &bh, 0) -> not mapped
> 				  get_block(inode, block, &bh, 0)
> 				    -> not mapped
>   if (!buffer_mapped(&bh))
>     if (vmf->flags & FAULT_FLAG_WRITE)
>       get_block(inode, block, &bh, 1) -> allocates blocks
>   if (page) -> no
> 				  if (!buffer_mapped(&bh))
> 				    if (vmf->flags & FAULT_FLAG_WRITE) {
> 				    } else {
> 				      dax_load_hole();
> 				    }
>   dax_insert_mapping()
> 
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
> 
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
> 
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

I've got lots of little comments, but from my point of view this seems like
it is looking pretty good.  I agree with the choice to put this in dax.c as
opposed to radix-tree.c or something - this seems very DAX specific for now.

> ---
>  fs/dax.c            | 500 ++++++++++++++++++++++++++++++++++++++--------------
>  include/linux/dax.h |   1 +
>  mm/truncate.c       |  62 ++++---
>  3 files changed, 396 insertions(+), 167 deletions(-)
<>
> +static inline int slot_locked(void **v)
> +{
> +	unsigned long l = *(unsigned long *)v;
> +	return l & DAX_ENTRY_LOCK;
> +}
> +
> +static inline void *lock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l |= DAX_ENTRY_LOCK);
> +}
> +
> +static inline void *unlock_slot(void **v)
> +{
> +	unsigned long *l = (unsigned long *)v;
> +	return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK);
> +}

For the above three helpers I think we could do with better parameter and
variable naming so it's clearer what's going on.  s/v/slot/ and s/l/entry/ ?

Also, for many of these new functions we need to be holding
mapping->tree_lock - can we quickly document that with comments?

> +/*
> + * Lookup entry in radix tree, wait for it to become unlocked if it is
> + * exceptional entry and return.
> + *
> + * The function must be called with mapping->tree_lock held.
> + */
> +static void *lookup_unlocked_mapping_entry(struct address_space *mapping,
> +					   pgoff_t index, void ***slotp)
> +{
> +	void *ret, **slot;
> +	struct wait_exceptional_entry_queue wait;

This should probably be named 'ewait' to be consistent with
wake_exceptional_entry_func(), and so we have a different and consistent
naming between our struct wait_exceptional_entry_queue and wait_queue_t
variables.

> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	init_wait(&wait.wait);
> +	wait.wait.func = wake_exceptional_entry_func;
> +	wait.key.root = &mapping->page_tree;
> +	wait.key.index = index;
> +
> +	for (;;) {
> +		ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
> +					  &slot);
> +		if (!ret || !radix_tree_exceptional_entry(ret) ||
> +		    !slot_locked(slot)) {
> +			if (slotp)
> +				*slotp = slot;
> +			return ret;
> +		}
> +		prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE);

Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable
zombie?

> +		spin_unlock_irq(&mapping->tree_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wait);
> +		spin_lock_irq(&mapping->tree_lock);
> +	}
> +}
> +
> +/*
> + * Find radix tree entry at given index. If it points to a page, return with
> + * the page locked. If it points to the exceptional entry, return with the
> + * radix tree entry locked. If the radix tree doesn't contain given index,
> + * create empty exceptional entry for the index and return with it locked.
> + *
> + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> + * persistent memory the benefit is doubtful. We can add that later if we can
> + * show it helps.
> + */
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +
> +restart:
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = lookup_unlocked_mapping_entry(mapping, index, &slot);
> +	/* No entry for given index? Make sure radix tree is big enough. */
> +	if (!ret) {
> +		int err;
> +
> +		spin_unlock_irq(&mapping->tree_lock);
> +		err = radix_tree_preload(
> +				mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);

What is the benefit to preloading the radix tree?  It looks like we have to
drop the mapping->tree_lock, deal with an error, regrab the lock and then deal
with a possible collision with an entry that was inserted while we didn't hold
the lock.

Can we just try and insert it, then if it fails with -ENOMEM we just do our
normal error path, dropping the tree_lock and returning the error?

> +		if (err)
> +			return ERR_PTR(err);
> +		ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK);
> +		spin_lock_irq(&mapping->tree_lock);
> +		err = radix_tree_insert(&mapping->page_tree, index, ret);
> +		radix_tree_preload_end();
> +		if (err) {
> +			spin_unlock_irq(&mapping->tree_lock);
> +			/* Someone already created the entry? */
> +			if (err == -EEXIST)
> +				goto restart;
> +			return ERR_PTR(err);
> +		}
> +		/* Good, we have inserted empty locked entry into the tree. */
> +		mapping->nrexceptional++;
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return ret;
> +	}
> +	/* Normal page in radix tree? */
> +	if (!radix_tree_exceptional_entry(ret)) {
> +		struct page *page = ret;
> +
> +		page_cache_get(page);
> +		spin_unlock_irq(&mapping->tree_lock);
> +		lock_page(page);
> +		/* Page got truncated? Retry... */
> +		if (unlikely(page->mapping != mapping)) {
> +			unlock_page(page);
> +			page_cache_release(page);
> +			goto restart;
> +		}
> +		return page;
> +	}
> +	ret = lock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	return ret;
> +}
> +
> +static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> +	void *ret, **slot;
> +	wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> +
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
> +	if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}
> +	if (WARN_ON_ONCE(!slot_locked(slot))) {
> +		spin_unlock_irq(&mapping->tree_lock);
> +		return;
> +	}

It may be worth combining these two WARN_ON_ONCE() error cases for brevity,
since they are both insanity conditions.

> +	unlock_slot(slot);
> +	spin_unlock_irq(&mapping->tree_lock);
> +	if (waitqueue_active(wq)) {
> +		struct exceptional_entry_key key;
> +
> +		key.root = &mapping->page_tree;
> +		key.index = index;
> +		__wake_up(wq, TASK_NORMAL, 1, &key);
> +	}

The above if() block is repeated 3 times in the next few functions with small
variations (the third argument to __wake_up()).  Perhaps it should be pulled
out into a helper?

> +static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index,
> +			       void *entry, sector_t sector, bool dirty,
> +			       gfp_t gfp_mask)

This argument list is getting pretty long, and our one caller gets lots of
these guys out of the VMF.  Perhaps we could just pass in the VMF and extract
the bits ourselves?

>  {
>  	struct radix_tree_root *page_tree = &mapping->page_tree;
> -	pgoff_t pmd_index = DAX_PMD_INDEX(index);
> -	int type, error = 0;
> -	void *entry;
> +	int error = 0;
> +	bool hole_fill = false;
> +	void *ret;

Just a nit, but I find the use of 'ret' a bit confusing, since it's not a
return value that we got from anywhere, it's an entry that we set up, insert
and then return to our caller.  We use 'error' to capture return values from
calls this function makes.  Maybe this would be clearer as "new_entry" or
something?

> -	WARN_ON_ONCE(pmd_entry && !dirty);
>  	if (dirty)
>  		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	spin_lock_irq(&mapping->tree_lock);
> -
> -	entry = radix_tree_lookup(page_tree, pmd_index);
> -	if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
> -		index = pmd_index;
> -		goto dirty;
> +	/* Replacing hole page with block mapping? */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		hole_fill = true;
> +		error = radix_tree_preload(gfp_mask);
> +		if (error)
> +			return ERR_PTR(error);
>  	}
>  
> -	entry = radix_tree_lookup(page_tree, index);
> -	if (entry) {
> -		type = RADIX_DAX_TYPE(entry);
> -		if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
> -					type != RADIX_DAX_PMD)) {
> -			error = -EIO;
> +	spin_lock_irq(&mapping->tree_lock);
> +	ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
> +		       DAX_ENTRY_LOCK);
> +	if (hole_fill) {
> +		__delete_from_page_cache(entry, NULL);
> +		error = radix_tree_insert(page_tree, index, ret);
> +		if (error) {
> +			ret = ERR_PTR(error);
>  			goto unlock;
>  		}
> +		mapping->nrexceptional++;
> +	} else {
> +		void **slot;
> +		void *ret2;
>  
> -		if (!pmd_entry || type == RADIX_DAX_PMD)
> -			goto dirty;
> -
> -		/*
> -		 * We only insert dirty PMD entries into the radix tree.  This
> -		 * means we don't need to worry about removing a dirty PTE
> -		 * entry and inserting a clean PMD entry, thus reducing the
> -		 * range we would flush with a follow-up fsync/msync call.
> -		 */
> -		radix_tree_delete(&mapping->page_tree, index);
> -		mapping->nrexceptional--;
> -	}
> -
> -	if (sector == NO_SECTOR) {
> -		/*
> -		 * This can happen during correct operation if our pfn_mkwrite
> -		 * fault raced against a hole punch operation.  If this
> -		 * happens the pte that was hole punched will have been
> -		 * unmapped and the radix tree entry will have been removed by
> -		 * the time we are called, but the call will still happen.  We
> -		 * will return all the way up to wp_pfn_shared(), where the
> -		 * pte_same() check will fail, eventually causing page fault
> -		 * to be retried by the CPU.
> -		 */
> -		goto unlock;
> +		ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot);

You don't need ret2.  You can just compare 'entry' with '*slot' - see
dax_writeback_one() for an example.

> +		WARN_ON_ONCE(ret2 != entry);
> +		radix_tree_replace_slot(slot, ret);
>  	}
> -
> -	error = radix_tree_insert(page_tree, index,
> -			RADIX_DAX_ENTRY(sector, pmd_entry));
> -	if (error)
> -		goto unlock;
> -
> -	mapping->nrexceptional++;
> - dirty:
>  	if (dirty)
>  		radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
>   unlock:
>  	spin_unlock_irq(&mapping->tree_lock);
> -	return error;
> +	if (hole_fill)
> +		radix_tree_preload_end();
> +	return ret;
>  }
>  
>  static int dax_writeback_one(struct block_device *bdev,
> @@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping,
>  }
>  EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
>  
> -static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
> +static int dax_insert_mapping(struct address_space *mapping,
> +			struct buffer_head *bh, void *entry,
>  			struct vm_area_struct *vma, struct vm_fault *vmf)
>  {
>  	unsigned long vaddr = (unsigned long)vmf->virtual_address;
> -	struct address_space *mapping = inode->i_mapping;
>  	struct block_device *bdev = bh->b_bdev;
>  	struct blk_dax_ctl dax = {
> -		.sector = to_sector(bh, inode),
> +		.sector = to_sector(bh, mapping->host),
>  		.size = bh->b_size,
>  	};
>  	int error;
> +	void *ret;
>  
>  	i_mmap_lock_read(mapping);
>  
> @@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
>  	}
>  	dax_unmap_atomic(bdev, &dax);
>  
> -	error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
> -			vmf->flags & FAULT_FLAG_WRITE);
> -	if (error)
> +	ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector,
> +			        vmf->flags & FAULT_FLAG_WRITE,
> +			        vmf->gfp_mask & ~__GFP_HIGHMEM);

The spacing before the parameters to dax_mapping_entry() is messed up & makes
checkpatch grumpy:

ERROR: code indent should use tabs where possible
#488: FILE: fs/dax.c:812:
+^I^I^I        vmf->flags & FAULT_FLAG_WRITE,$

ERROR: code indent should use tabs where possible
#489: FILE: fs/dax.c:813:
+^I^I^I        vmf->gfp_mask & ~__GFP_HIGHMEM);$

There are a few other checkpatch warnings as well that should probably be
addressed.

> +	if (IS_ERR(ret)) {
> +		error = PTR_ERR(ret);
>  		goto out;
> +	}
> +	/* Have we replaced hole page? Unmap and free it. */
> +	if (!radix_tree_exceptional_entry(entry)) {
> +		unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
> +				    PAGE_CACHE_SIZE, 0);
> +		unlock_page(entry);
> +		page_cache_release(entry);
> +	}
> +	entry = ret;
>  
>  	error = vm_insert_mixed(vma, vaddr, dax.pfn);
> -
>   out:
>  	i_mmap_unlock_read(mapping);
> -
> +	put_locked_mapping_entry(mapping, vmf->pgoff, entry);

Hmm....this entry was locked by our parent (__dax_fault()), and is released by
our parent in error cases that go through 'unlock_entry:'.  For symmetry it's
probably better to move this call up to our parent as well.

  reply	other threads:[~2016-03-29 21:58 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-21 13:22 [RFC v2] [PATCH 0/10] DAX page fault locking Jan Kara
2016-03-21 13:22 ` Jan Kara
2016-03-21 13:22 ` [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 17:25   ` Matthew Wilcox
2016-03-21 17:25     ` Matthew Wilcox
2016-03-21 13:22 ` [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 17:34   ` Matthew Wilcox
2016-03-21 17:34     ` Matthew Wilcox
2016-03-22  9:12     ` Jan Kara
2016-03-22  9:12       ` Jan Kara
2016-03-22  9:27       ` Matthew Wilcox
2016-03-22  9:27         ` Matthew Wilcox
2016-03-22 10:37         ` Jan Kara
2016-03-22 10:37           ` Jan Kara
2016-03-23 16:41           ` Ross Zwisler
2016-03-23 16:41             ` Ross Zwisler
2016-03-24 12:31             ` Jan Kara
2016-03-24 12:31               ` Jan Kara
2016-03-21 13:22 ` [PATCH 03/10] dax: Remove complete_unwritten argument Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:12   ` Ross Zwisler
2016-03-23 17:12     ` Ross Zwisler
2016-03-24 12:32     ` Jan Kara
2016-03-24 12:32       ` Jan Kara
2016-03-21 13:22 ` [PATCH 04/10] dax: Fix data corruption for written and mmapped files Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:39   ` Ross Zwisler
2016-03-23 17:39     ` Ross Zwisler
2016-03-24 12:51     ` Jan Kara
2016-03-24 12:51       ` Jan Kara
2016-03-29 15:17       ` Ross Zwisler
2016-03-29 15:17         ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 05/10] dax: Allow DAX code to replace exceptional entries Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 17:52   ` Ross Zwisler
2016-03-23 17:52     ` Ross Zwisler
2016-03-24 10:42     ` Jan Kara
2016-03-24 10:42       ` Jan Kara
2016-03-21 13:22 ` [PATCH 06/10] dax: Remove redundant inode size checks Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 21:08   ` Ross Zwisler
2016-03-23 21:08     ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 07/10] dax: Disable huge page handling Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-23 20:50   ` Ross Zwisler
2016-03-23 20:50     ` Ross Zwisler
2016-03-24 12:56     ` Jan Kara
2016-03-24 12:56       ` Jan Kara
2016-03-21 13:22 ` [PATCH 08/10] dax: New fault locking Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-29 21:57   ` Ross Zwisler [this message]
2016-03-29 21:57     ` Ross Zwisler
2016-03-31 16:27     ` Jan Kara
2016-03-31 16:27       ` Jan Kara
2016-03-21 13:22 ` [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-21 19:11   ` Matthew Wilcox
2016-03-21 19:11     ` Matthew Wilcox
2016-03-22  7:03     ` Jan Kara
2016-03-22  7:03       ` Jan Kara
2016-03-29 22:18   ` Ross Zwisler
2016-03-29 22:18     ` Ross Zwisler
2016-03-21 13:22 ` [PATCH 10/10] dax: Remove i_mmap_lock protection Jan Kara
2016-03-21 13:22   ` Jan Kara
2016-03-29 22:17   ` Ross Zwisler
2016-03-29 22:17     ` Ross Zwisler
2016-03-21 17:41 ` [RFC v2] [PATCH 0/10] DAX page fault locking Matthew Wilcox
2016-03-21 17:41   ` Matthew Wilcox
2016-03-23 15:09   ` Jan Kara
2016-03-23 15:09     ` Jan Kara
2016-03-23 20:50     ` Matthew Wilcox
2016-03-23 20:50       ` Matthew Wilcox
2016-03-24 10:00     ` Matthew Wilcox
2016-03-24 10:00       ` Matthew Wilcox
2016-03-22 19:32 ` Ross Zwisler
2016-03-22 19:32   ` Ross Zwisler
2016-03-22 21:07   ` Toshi Kani
2016-03-22 21:07     ` Toshi Kani
2016-03-22 21:15     ` Dave Chinner
2016-03-22 21:15       ` Dave Chinner
2016-03-23  9:45     ` Jan Kara
2016-03-23  9:45       ` Jan Kara
2016-03-23 15:11       ` Toshi Kani
2016-03-23 15:11         ` Toshi Kani

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160329215732.GA21264@linux.intel.com \
    --to=ross.zwisler@linux.intel.com \
    --cc=jack@suse.cz \
    --cc=linux-nvdimm@lists.01.org \
    --cc=neilb@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.