From: Ross Zwisler <ross.zwisler@linux.intel.com> To: Jan Kara <jack@suse.cz> Cc: linux-nvdimm@lists.01.org, NeilBrown <neilb@suse.com>, Wilcox, Subject: Re: [PATCH 08/10] dax: New fault locking Date: Tue, 29 Mar 2016 15:57:32 -0600 [thread overview] Message-ID: <20160329215732.GA21264@linux.intel.com> (raw) In-Reply-To: <1458566575-28063-9-git-send-email-jack@suse.cz> On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote: > Currently DAX page fault locking is racy. > > CPU0 (write fault) CPU1 (read fault) > > __dax_fault() __dax_fault() > get_block(inode, block, &bh, 0) -> not mapped > get_block(inode, block, &bh, 0) > -> not mapped > if (!buffer_mapped(&bh)) > if (vmf->flags & FAULT_FLAG_WRITE) > get_block(inode, block, &bh, 1) -> allocates blocks > if (page) -> no > if (!buffer_mapped(&bh)) > if (vmf->flags & FAULT_FLAG_WRITE) { > } else { > dax_load_hole(); > } > dax_insert_mapping() > > And we are in a situation where we fail in dax_radix_entry() with -EIO. > > Another problem with the current DAX page fault locking is that there is > no race-free way to clear dirty tag in the radix tree. We can always > end up with clean radix tree and dirty data in CPU cache. > > We fix the first problem by introducing locking of exceptional radix > tree entries in DAX mappings acting very similarly to page lock and thus > synchronizing properly faults against the same mapping index. The same > lock can later be used to avoid races when clearing radix tree dirty > tag. > > Signed-off-by: Jan Kara <jack@suse.cz> I've got lots of little comments, but from my point of view this seems like it is looking pretty good. I agree with the choice to put this in dax.c as opposed to radix-tree.c or something - this seems very DAX specific for now. > --- > fs/dax.c | 500 ++++++++++++++++++++++++++++++++++++++-------------- > include/linux/dax.h | 1 + > mm/truncate.c | 62 ++++--- > 3 files changed, 396 insertions(+), 167 deletions(-) <> > +static inline int slot_locked(void **v) > +{ > + unsigned long l = *(unsigned long *)v; > + return l & DAX_ENTRY_LOCK; > +} > + > +static inline void *lock_slot(void **v) > +{ > + unsigned long *l = (unsigned long *)v; > + return (void*)(*l |= DAX_ENTRY_LOCK); > +} > + > +static inline void *unlock_slot(void **v) > +{ > + unsigned long *l = (unsigned long *)v; > + return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK); > +} For the above three helpers I think we could do with better parameter and variable naming so it's clearer what's going on. s/v/slot/ and s/l/entry/ ? Also, for many of these new functions we need to be holding mapping->tree_lock - can we quickly document that with comments? > +/* > + * Lookup entry in radix tree, wait for it to become unlocked if it is > + * exceptional entry and return. > + * > + * The function must be called with mapping->tree_lock held. > + */ > +static void *lookup_unlocked_mapping_entry(struct address_space *mapping, > + pgoff_t index, void ***slotp) > +{ > + void *ret, **slot; > + struct wait_exceptional_entry_queue wait; This should probably be named 'ewait' to be consistent with wake_exceptional_entry_func(), and so we have a different and consistent naming between our struct wait_exceptional_entry_queue and wait_queue_t variables. > + wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index); > + > + init_wait(&wait.wait); > + wait.wait.func = wake_exceptional_entry_func; > + wait.key.root = &mapping->page_tree; > + wait.key.index = index; > + > + for (;;) { > + ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, > + &slot); > + if (!ret || !radix_tree_exceptional_entry(ret) || > + !slot_locked(slot)) { > + if (slotp) > + *slotp = slot; > + return ret; > + } > + prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE); Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable zombie? > + spin_unlock_irq(&mapping->tree_lock); > + schedule(); > + finish_wait(wq, &wait.wait); > + spin_lock_irq(&mapping->tree_lock); > + } > +} > + > +/* > + * Find radix tree entry at given index. If it points to a page, return with > + * the page locked. If it points to the exceptional entry, return with the > + * radix tree entry locked. If the radix tree doesn't contain given index, > + * create empty exceptional entry for the index and return with it locked. > + * > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For > + * persistent memory the benefit is doubtful. We can add that later if we can > + * show it helps. > + */ > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index) > +{ > + void *ret, **slot; > + > +restart: > + spin_lock_irq(&mapping->tree_lock); > + ret = lookup_unlocked_mapping_entry(mapping, index, &slot); > + /* No entry for given index? Make sure radix tree is big enough. */ > + if (!ret) { > + int err; > + > + spin_unlock_irq(&mapping->tree_lock); > + err = radix_tree_preload( > + mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM); What is the benefit to preloading the radix tree? It looks like we have to drop the mapping->tree_lock, deal with an error, regrab the lock and then deal with a possible collision with an entry that was inserted while we didn't hold the lock. Can we just try and insert it, then if it fails with -ENOMEM we just do our normal error path, dropping the tree_lock and returning the error? > + if (err) > + return ERR_PTR(err); > + ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK); > + spin_lock_irq(&mapping->tree_lock); > + err = radix_tree_insert(&mapping->page_tree, index, ret); > + radix_tree_preload_end(); > + if (err) { > + spin_unlock_irq(&mapping->tree_lock); > + /* Someone already created the entry? */ > + if (err == -EEXIST) > + goto restart; > + return ERR_PTR(err); > + } > + /* Good, we have inserted empty locked entry into the tree. */ > + mapping->nrexceptional++; > + spin_unlock_irq(&mapping->tree_lock); > + return ret; > + } > + /* Normal page in radix tree? */ > + if (!radix_tree_exceptional_entry(ret)) { > + struct page *page = ret; > + > + page_cache_get(page); > + spin_unlock_irq(&mapping->tree_lock); > + lock_page(page); > + /* Page got truncated? Retry... */ > + if (unlikely(page->mapping != mapping)) { > + unlock_page(page); > + page_cache_release(page); > + goto restart; > + } > + return page; > + } > + ret = lock_slot(slot); > + spin_unlock_irq(&mapping->tree_lock); > + return ret; > +} > + > +static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index) > +{ > + void *ret, **slot; > + wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index); > + > + spin_lock_irq(&mapping->tree_lock); > + ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot); > + if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) { > + spin_unlock_irq(&mapping->tree_lock); > + return; > + } > + if (WARN_ON_ONCE(!slot_locked(slot))) { > + spin_unlock_irq(&mapping->tree_lock); > + return; > + } It may be worth combining these two WARN_ON_ONCE() error cases for brevity, since they are both insanity conditions. > + unlock_slot(slot); > + spin_unlock_irq(&mapping->tree_lock); > + if (waitqueue_active(wq)) { > + struct exceptional_entry_key key; > + > + key.root = &mapping->page_tree; > + key.index = index; > + __wake_up(wq, TASK_NORMAL, 1, &key); > + } The above if() block is repeated 3 times in the next few functions with small variations (the third argument to __wake_up()). Perhaps it should be pulled out into a helper? > +static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index, > + void *entry, sector_t sector, bool dirty, > + gfp_t gfp_mask) This argument list is getting pretty long, and our one caller gets lots of these guys out of the VMF. Perhaps we could just pass in the VMF and extract the bits ourselves? > { > struct radix_tree_root *page_tree = &mapping->page_tree; > - pgoff_t pmd_index = DAX_PMD_INDEX(index); > - int type, error = 0; > - void *entry; > + int error = 0; > + bool hole_fill = false; > + void *ret; Just a nit, but I find the use of 'ret' a bit confusing, since it's not a return value that we got from anywhere, it's an entry that we set up, insert and then return to our caller. We use 'error' to capture return values from calls this function makes. Maybe this would be clearer as "new_entry" or something? > - WARN_ON_ONCE(pmd_entry && !dirty); > if (dirty) > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > - spin_lock_irq(&mapping->tree_lock); > - > - entry = radix_tree_lookup(page_tree, pmd_index); > - if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) { > - index = pmd_index; > - goto dirty; > + /* Replacing hole page with block mapping? */ > + if (!radix_tree_exceptional_entry(entry)) { > + hole_fill = true; > + error = radix_tree_preload(gfp_mask); > + if (error) > + return ERR_PTR(error); > } > > - entry = radix_tree_lookup(page_tree, index); > - if (entry) { > - type = RADIX_DAX_TYPE(entry); > - if (WARN_ON_ONCE(type != RADIX_DAX_PTE && > - type != RADIX_DAX_PMD)) { > - error = -EIO; > + spin_lock_irq(&mapping->tree_lock); > + ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) | > + DAX_ENTRY_LOCK); > + if (hole_fill) { > + __delete_from_page_cache(entry, NULL); > + error = radix_tree_insert(page_tree, index, ret); > + if (error) { > + ret = ERR_PTR(error); > goto unlock; > } > + mapping->nrexceptional++; > + } else { > + void **slot; > + void *ret2; > > - if (!pmd_entry || type == RADIX_DAX_PMD) > - goto dirty; > - > - /* > - * We only insert dirty PMD entries into the radix tree. This > - * means we don't need to worry about removing a dirty PTE > - * entry and inserting a clean PMD entry, thus reducing the > - * range we would flush with a follow-up fsync/msync call. > - */ > - radix_tree_delete(&mapping->page_tree, index); > - mapping->nrexceptional--; > - } > - > - if (sector == NO_SECTOR) { > - /* > - * This can happen during correct operation if our pfn_mkwrite > - * fault raced against a hole punch operation. If this > - * happens the pte that was hole punched will have been > - * unmapped and the radix tree entry will have been removed by > - * the time we are called, but the call will still happen. We > - * will return all the way up to wp_pfn_shared(), where the > - * pte_same() check will fail, eventually causing page fault > - * to be retried by the CPU. > - */ > - goto unlock; > + ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot); You don't need ret2. You can just compare 'entry' with '*slot' - see dax_writeback_one() for an example. > + WARN_ON_ONCE(ret2 != entry); > + radix_tree_replace_slot(slot, ret); > } > - > - error = radix_tree_insert(page_tree, index, > - RADIX_DAX_ENTRY(sector, pmd_entry)); > - if (error) > - goto unlock; > - > - mapping->nrexceptional++; > - dirty: > if (dirty) > radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY); > unlock: > spin_unlock_irq(&mapping->tree_lock); > - return error; > + if (hole_fill) > + radix_tree_preload_end(); > + return ret; > } > > static int dax_writeback_one(struct block_device *bdev, > @@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping, > } > EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); > > -static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > +static int dax_insert_mapping(struct address_space *mapping, > + struct buffer_head *bh, void *entry, > struct vm_area_struct *vma, struct vm_fault *vmf) > { > unsigned long vaddr = (unsigned long)vmf->virtual_address; > - struct address_space *mapping = inode->i_mapping; > struct block_device *bdev = bh->b_bdev; > struct blk_dax_ctl dax = { > - .sector = to_sector(bh, inode), > + .sector = to_sector(bh, mapping->host), > .size = bh->b_size, > }; > int error; > + void *ret; > > i_mmap_lock_read(mapping); > > @@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > } > dax_unmap_atomic(bdev, &dax); > > - error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false, > - vmf->flags & FAULT_FLAG_WRITE); > - if (error) > + ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector, > + vmf->flags & FAULT_FLAG_WRITE, > + vmf->gfp_mask & ~__GFP_HIGHMEM); The spacing before the parameters to dax_mapping_entry() is messed up & makes checkpatch grumpy: ERROR: code indent should use tabs where possible #488: FILE: fs/dax.c:812: +^I^I^I vmf->flags & FAULT_FLAG_WRITE,$ ERROR: code indent should use tabs where possible #489: FILE: fs/dax.c:813: +^I^I^I vmf->gfp_mask & ~__GFP_HIGHMEM);$ There are a few other checkpatch warnings as well that should probably be addressed. > + if (IS_ERR(ret)) { > + error = PTR_ERR(ret); > goto out; > + } > + /* Have we replaced hole page? Unmap and free it. */ > + if (!radix_tree_exceptional_entry(entry)) { > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + unlock_page(entry); > + page_cache_release(entry); > + } > + entry = ret; > > error = vm_insert_mixed(vma, vaddr, dax.pfn); > - > out: > i_mmap_unlock_read(mapping); > - > + put_locked_mapping_entry(mapping, vmf->pgoff, entry); Hmm....this entry was locked by our parent (__dax_fault()), and is released by our parent in error cases that go through 'unlock_entry:'. For symmetry it's probably better to move this call up to our parent as well. _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
WARNING: multiple messages have this Message-ID (diff)
From: Ross Zwisler <ross.zwisler@linux.intel.com> To: Jan Kara <jack@suse.cz> Cc: linux-fsdevel@vger.kernel.org, "Wilcox, Matthew R" <matthew.r.wilcox@intel.com>, Ross Zwisler <ross.zwisler@linux.intel.com>, Dan Williams <dan.j.williams@intel.com>, linux-nvdimm@lists.01.org, NeilBrown <neilb@suse.com> Subject: Re: [PATCH 08/10] dax: New fault locking Date: Tue, 29 Mar 2016 15:57:32 -0600 [thread overview] Message-ID: <20160329215732.GA21264@linux.intel.com> (raw) In-Reply-To: <1458566575-28063-9-git-send-email-jack@suse.cz> On Mon, Mar 21, 2016 at 02:22:53PM +0100, Jan Kara wrote: > Currently DAX page fault locking is racy. > > CPU0 (write fault) CPU1 (read fault) > > __dax_fault() __dax_fault() > get_block(inode, block, &bh, 0) -> not mapped > get_block(inode, block, &bh, 0) > -> not mapped > if (!buffer_mapped(&bh)) > if (vmf->flags & FAULT_FLAG_WRITE) > get_block(inode, block, &bh, 1) -> allocates blocks > if (page) -> no > if (!buffer_mapped(&bh)) > if (vmf->flags & FAULT_FLAG_WRITE) { > } else { > dax_load_hole(); > } > dax_insert_mapping() > > And we are in a situation where we fail in dax_radix_entry() with -EIO. > > Another problem with the current DAX page fault locking is that there is > no race-free way to clear dirty tag in the radix tree. We can always > end up with clean radix tree and dirty data in CPU cache. > > We fix the first problem by introducing locking of exceptional radix > tree entries in DAX mappings acting very similarly to page lock and thus > synchronizing properly faults against the same mapping index. The same > lock can later be used to avoid races when clearing radix tree dirty > tag. > > Signed-off-by: Jan Kara <jack@suse.cz> I've got lots of little comments, but from my point of view this seems like it is looking pretty good. I agree with the choice to put this in dax.c as opposed to radix-tree.c or something - this seems very DAX specific for now. > --- > fs/dax.c | 500 ++++++++++++++++++++++++++++++++++++++-------------- > include/linux/dax.h | 1 + > mm/truncate.c | 62 ++++--- > 3 files changed, 396 insertions(+), 167 deletions(-) <> > +static inline int slot_locked(void **v) > +{ > + unsigned long l = *(unsigned long *)v; > + return l & DAX_ENTRY_LOCK; > +} > + > +static inline void *lock_slot(void **v) > +{ > + unsigned long *l = (unsigned long *)v; > + return (void*)(*l |= DAX_ENTRY_LOCK); > +} > + > +static inline void *unlock_slot(void **v) > +{ > + unsigned long *l = (unsigned long *)v; > + return (void*)(*l &= ~(unsigned long)DAX_ENTRY_LOCK); > +} For the above three helpers I think we could do with better parameter and variable naming so it's clearer what's going on. s/v/slot/ and s/l/entry/ ? Also, for many of these new functions we need to be holding mapping->tree_lock - can we quickly document that with comments? > +/* > + * Lookup entry in radix tree, wait for it to become unlocked if it is > + * exceptional entry and return. > + * > + * The function must be called with mapping->tree_lock held. > + */ > +static void *lookup_unlocked_mapping_entry(struct address_space *mapping, > + pgoff_t index, void ***slotp) > +{ > + void *ret, **slot; > + struct wait_exceptional_entry_queue wait; This should probably be named 'ewait' to be consistent with wake_exceptional_entry_func(), and so we have a different and consistent naming between our struct wait_exceptional_entry_queue and wait_queue_t variables. > + wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index); > + > + init_wait(&wait.wait); > + wait.wait.func = wake_exceptional_entry_func; > + wait.key.root = &mapping->page_tree; > + wait.key.index = index; > + > + for (;;) { > + ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, > + &slot); > + if (!ret || !radix_tree_exceptional_entry(ret) || > + !slot_locked(slot)) { > + if (slotp) > + *slotp = slot; > + return ret; > + } > + prepare_to_wait_exclusive(wq, &wait.wait, TASK_UNINTERRUPTIBLE); Should we make this TASK_INTERRUPTIBLE so we don't end up with an unkillable zombie? > + spin_unlock_irq(&mapping->tree_lock); > + schedule(); > + finish_wait(wq, &wait.wait); > + spin_lock_irq(&mapping->tree_lock); > + } > +} > + > +/* > + * Find radix tree entry at given index. If it points to a page, return with > + * the page locked. If it points to the exceptional entry, return with the > + * radix tree entry locked. If the radix tree doesn't contain given index, > + * create empty exceptional entry for the index and return with it locked. > + * > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For > + * persistent memory the benefit is doubtful. We can add that later if we can > + * show it helps. > + */ > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index) > +{ > + void *ret, **slot; > + > +restart: > + spin_lock_irq(&mapping->tree_lock); > + ret = lookup_unlocked_mapping_entry(mapping, index, &slot); > + /* No entry for given index? Make sure radix tree is big enough. */ > + if (!ret) { > + int err; > + > + spin_unlock_irq(&mapping->tree_lock); > + err = radix_tree_preload( > + mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM); What is the benefit to preloading the radix tree? It looks like we have to drop the mapping->tree_lock, deal with an error, regrab the lock and then deal with a possible collision with an entry that was inserted while we didn't hold the lock. Can we just try and insert it, then if it fails with -ENOMEM we just do our normal error path, dropping the tree_lock and returning the error? > + if (err) > + return ERR_PTR(err); > + ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | DAX_ENTRY_LOCK); > + spin_lock_irq(&mapping->tree_lock); > + err = radix_tree_insert(&mapping->page_tree, index, ret); > + radix_tree_preload_end(); > + if (err) { > + spin_unlock_irq(&mapping->tree_lock); > + /* Someone already created the entry? */ > + if (err == -EEXIST) > + goto restart; > + return ERR_PTR(err); > + } > + /* Good, we have inserted empty locked entry into the tree. */ > + mapping->nrexceptional++; > + spin_unlock_irq(&mapping->tree_lock); > + return ret; > + } > + /* Normal page in radix tree? */ > + if (!radix_tree_exceptional_entry(ret)) { > + struct page *page = ret; > + > + page_cache_get(page); > + spin_unlock_irq(&mapping->tree_lock); > + lock_page(page); > + /* Page got truncated? Retry... */ > + if (unlikely(page->mapping != mapping)) { > + unlock_page(page); > + page_cache_release(page); > + goto restart; > + } > + return page; > + } > + ret = lock_slot(slot); > + spin_unlock_irq(&mapping->tree_lock); > + return ret; > +} > + > +static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index) > +{ > + void *ret, **slot; > + wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index); > + > + spin_lock_irq(&mapping->tree_lock); > + ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot); > + if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret))) { > + spin_unlock_irq(&mapping->tree_lock); > + return; > + } > + if (WARN_ON_ONCE(!slot_locked(slot))) { > + spin_unlock_irq(&mapping->tree_lock); > + return; > + } It may be worth combining these two WARN_ON_ONCE() error cases for brevity, since they are both insanity conditions. > + unlock_slot(slot); > + spin_unlock_irq(&mapping->tree_lock); > + if (waitqueue_active(wq)) { > + struct exceptional_entry_key key; > + > + key.root = &mapping->page_tree; > + key.index = index; > + __wake_up(wq, TASK_NORMAL, 1, &key); > + } The above if() block is repeated 3 times in the next few functions with small variations (the third argument to __wake_up()). Perhaps it should be pulled out into a helper? > +static void *dax_mapping_entry(struct address_space *mapping, pgoff_t index, > + void *entry, sector_t sector, bool dirty, > + gfp_t gfp_mask) This argument list is getting pretty long, and our one caller gets lots of these guys out of the VMF. Perhaps we could just pass in the VMF and extract the bits ourselves? > { > struct radix_tree_root *page_tree = &mapping->page_tree; > - pgoff_t pmd_index = DAX_PMD_INDEX(index); > - int type, error = 0; > - void *entry; > + int error = 0; > + bool hole_fill = false; > + void *ret; Just a nit, but I find the use of 'ret' a bit confusing, since it's not a return value that we got from anywhere, it's an entry that we set up, insert and then return to our caller. We use 'error' to capture return values from calls this function makes. Maybe this would be clearer as "new_entry" or something? > - WARN_ON_ONCE(pmd_entry && !dirty); > if (dirty) > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > - spin_lock_irq(&mapping->tree_lock); > - > - entry = radix_tree_lookup(page_tree, pmd_index); > - if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) { > - index = pmd_index; > - goto dirty; > + /* Replacing hole page with block mapping? */ > + if (!radix_tree_exceptional_entry(entry)) { > + hole_fill = true; > + error = radix_tree_preload(gfp_mask); > + if (error) > + return ERR_PTR(error); > } > > - entry = radix_tree_lookup(page_tree, index); > - if (entry) { > - type = RADIX_DAX_TYPE(entry); > - if (WARN_ON_ONCE(type != RADIX_DAX_PTE && > - type != RADIX_DAX_PMD)) { > - error = -EIO; > + spin_lock_irq(&mapping->tree_lock); > + ret = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) | > + DAX_ENTRY_LOCK); > + if (hole_fill) { > + __delete_from_page_cache(entry, NULL); > + error = radix_tree_insert(page_tree, index, ret); > + if (error) { > + ret = ERR_PTR(error); > goto unlock; > } > + mapping->nrexceptional++; > + } else { > + void **slot; > + void *ret2; > > - if (!pmd_entry || type == RADIX_DAX_PMD) > - goto dirty; > - > - /* > - * We only insert dirty PMD entries into the radix tree. This > - * means we don't need to worry about removing a dirty PTE > - * entry and inserting a clean PMD entry, thus reducing the > - * range we would flush with a follow-up fsync/msync call. > - */ > - radix_tree_delete(&mapping->page_tree, index); > - mapping->nrexceptional--; > - } > - > - if (sector == NO_SECTOR) { > - /* > - * This can happen during correct operation if our pfn_mkwrite > - * fault raced against a hole punch operation. If this > - * happens the pte that was hole punched will have been > - * unmapped and the radix tree entry will have been removed by > - * the time we are called, but the call will still happen. We > - * will return all the way up to wp_pfn_shared(), where the > - * pte_same() check will fail, eventually causing page fault > - * to be retried by the CPU. > - */ > - goto unlock; > + ret2 = __radix_tree_lookup(page_tree, index, NULL, &slot); You don't need ret2. You can just compare 'entry' with '*slot' - see dax_writeback_one() for an example. > + WARN_ON_ONCE(ret2 != entry); > + radix_tree_replace_slot(slot, ret); > } > - > - error = radix_tree_insert(page_tree, index, > - RADIX_DAX_ENTRY(sector, pmd_entry)); > - if (error) > - goto unlock; > - > - mapping->nrexceptional++; > - dirty: > if (dirty) > radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY); > unlock: > spin_unlock_irq(&mapping->tree_lock); > - return error; > + if (hole_fill) > + radix_tree_preload_end(); > + return ret; > } > > static int dax_writeback_one(struct block_device *bdev, > @@ -542,17 +782,18 @@ int dax_writeback_mapping_range(struct address_space *mapping, > } > EXPORT_SYMBOL_GPL(dax_writeback_mapping_range); > > -static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > +static int dax_insert_mapping(struct address_space *mapping, > + struct buffer_head *bh, void *entry, > struct vm_area_struct *vma, struct vm_fault *vmf) > { > unsigned long vaddr = (unsigned long)vmf->virtual_address; > - struct address_space *mapping = inode->i_mapping; > struct block_device *bdev = bh->b_bdev; > struct blk_dax_ctl dax = { > - .sector = to_sector(bh, inode), > + .sector = to_sector(bh, mapping->host), > .size = bh->b_size, > }; > int error; > + void *ret; > > i_mmap_lock_read(mapping); > > @@ -562,16 +803,26 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh, > } > dax_unmap_atomic(bdev, &dax); > > - error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false, > - vmf->flags & FAULT_FLAG_WRITE); > - if (error) > + ret = dax_mapping_entry(mapping, vmf->pgoff, entry, dax.sector, > + vmf->flags & FAULT_FLAG_WRITE, > + vmf->gfp_mask & ~__GFP_HIGHMEM); The spacing before the parameters to dax_mapping_entry() is messed up & makes checkpatch grumpy: ERROR: code indent should use tabs where possible #488: FILE: fs/dax.c:812: +^I^I^I vmf->flags & FAULT_FLAG_WRITE,$ ERROR: code indent should use tabs where possible #489: FILE: fs/dax.c:813: +^I^I^I vmf->gfp_mask & ~__GFP_HIGHMEM);$ There are a few other checkpatch warnings as well that should probably be addressed. > + if (IS_ERR(ret)) { > + error = PTR_ERR(ret); > goto out; > + } > + /* Have we replaced hole page? Unmap and free it. */ > + if (!radix_tree_exceptional_entry(entry)) { > + unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT, > + PAGE_CACHE_SIZE, 0); > + unlock_page(entry); > + page_cache_release(entry); > + } > + entry = ret; > > error = vm_insert_mixed(vma, vaddr, dax.pfn); > - > out: > i_mmap_unlock_read(mapping); > - > + put_locked_mapping_entry(mapping, vmf->pgoff, entry); Hmm....this entry was locked by our parent (__dax_fault()), and is released by our parent in error cases that go through 'unlock_entry:'. For symmetry it's probably better to move this call up to our parent as well.
next prev parent reply other threads:[~2016-03-29 21:58 UTC|newest] Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top 2016-03-21 13:22 [RFC v2] [PATCH 0/10] DAX page fault locking Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-21 13:22 ` [PATCH 01/10] DAX: move RADIX_DAX_ definitions to dax.c Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-21 17:25 ` Matthew Wilcox 2016-03-21 17:25 ` Matthew Wilcox 2016-03-21 13:22 ` [PATCH 02/10] radix-tree: make 'indirect' bit available to exception entries Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-21 17:34 ` Matthew Wilcox 2016-03-21 17:34 ` Matthew Wilcox 2016-03-22 9:12 ` Jan Kara 2016-03-22 9:12 ` Jan Kara 2016-03-22 9:27 ` Matthew Wilcox 2016-03-22 9:27 ` Matthew Wilcox 2016-03-22 10:37 ` Jan Kara 2016-03-22 10:37 ` Jan Kara 2016-03-23 16:41 ` Ross Zwisler 2016-03-23 16:41 ` Ross Zwisler 2016-03-24 12:31 ` Jan Kara 2016-03-24 12:31 ` Jan Kara 2016-03-21 13:22 ` [PATCH 03/10] dax: Remove complete_unwritten argument Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-23 17:12 ` Ross Zwisler 2016-03-23 17:12 ` Ross Zwisler 2016-03-24 12:32 ` Jan Kara 2016-03-24 12:32 ` Jan Kara 2016-03-21 13:22 ` [PATCH 04/10] dax: Fix data corruption for written and mmapped files Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-23 17:39 ` Ross Zwisler 2016-03-23 17:39 ` Ross Zwisler 2016-03-24 12:51 ` Jan Kara 2016-03-24 12:51 ` Jan Kara 2016-03-29 15:17 ` Ross Zwisler 2016-03-29 15:17 ` Ross Zwisler 2016-03-21 13:22 ` [PATCH 05/10] dax: Allow DAX code to replace exceptional entries Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-23 17:52 ` Ross Zwisler 2016-03-23 17:52 ` Ross Zwisler 2016-03-24 10:42 ` Jan Kara 2016-03-24 10:42 ` Jan Kara 2016-03-21 13:22 ` [PATCH 06/10] dax: Remove redundant inode size checks Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-23 21:08 ` Ross Zwisler 2016-03-23 21:08 ` Ross Zwisler 2016-03-21 13:22 ` [PATCH 07/10] dax: Disable huge page handling Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-23 20:50 ` Ross Zwisler 2016-03-23 20:50 ` Ross Zwisler 2016-03-24 12:56 ` Jan Kara 2016-03-24 12:56 ` Jan Kara 2016-03-21 13:22 ` [PATCH 08/10] dax: New fault locking Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-29 21:57 ` Ross Zwisler [this message] 2016-03-29 21:57 ` Ross Zwisler 2016-03-31 16:27 ` Jan Kara 2016-03-31 16:27 ` Jan Kara 2016-03-21 13:22 ` [PATCH 09/10] dax: Use radix tree entry lock to protect cow faults Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-21 19:11 ` Matthew Wilcox 2016-03-21 19:11 ` Matthew Wilcox 2016-03-22 7:03 ` Jan Kara 2016-03-22 7:03 ` Jan Kara 2016-03-29 22:18 ` Ross Zwisler 2016-03-29 22:18 ` Ross Zwisler 2016-03-21 13:22 ` [PATCH 10/10] dax: Remove i_mmap_lock protection Jan Kara 2016-03-21 13:22 ` Jan Kara 2016-03-29 22:17 ` Ross Zwisler 2016-03-29 22:17 ` Ross Zwisler 2016-03-21 17:41 ` [RFC v2] [PATCH 0/10] DAX page fault locking Matthew Wilcox 2016-03-21 17:41 ` Matthew Wilcox 2016-03-23 15:09 ` Jan Kara 2016-03-23 15:09 ` Jan Kara 2016-03-23 20:50 ` Matthew Wilcox 2016-03-23 20:50 ` Matthew Wilcox 2016-03-24 10:00 ` Matthew Wilcox 2016-03-24 10:00 ` Matthew Wilcox 2016-03-22 19:32 ` Ross Zwisler 2016-03-22 19:32 ` Ross Zwisler 2016-03-22 21:07 ` Toshi Kani 2016-03-22 21:07 ` Toshi Kani 2016-03-22 21:15 ` Dave Chinner 2016-03-22 21:15 ` Dave Chinner 2016-03-23 9:45 ` Jan Kara 2016-03-23 9:45 ` Jan Kara 2016-03-23 15:11 ` Toshi Kani 2016-03-23 15:11 ` Toshi Kani
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=20160329215732.GA21264@linux.intel.com \ --to=ross.zwisler@linux.intel.com \ --cc=jack@suse.cz \ --cc=linux-nvdimm@lists.01.org \ --cc=neilb@suse.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.