From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755502AbcBPRRp (ORCPT ); Tue, 16 Feb 2016 12:17:45 -0500 Received: from mga02.intel.com ([134.134.136.20]:22433 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755370AbcBPRRn (ORCPT ); Tue, 16 Feb 2016 12:17:43 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.22,456,1449561600"; d="scan'208";a="913516197" Subject: Re: [PATCHv2 08/28] mm: postpone page table allocation until do_set_pte() To: "Kirill A. Shutemov" References: <1455200516-132137-1-git-send-email-kirill.shutemov@linux.intel.com> <1455200516-132137-9-git-send-email-kirill.shutemov@linux.intel.com> <56BE1A09.6000007@intel.com> <20160216142657.GA16364@node.shutemov.name> Cc: "Kirill A. Shutemov" , Hugh Dickins , Andrea Arcangeli , Andrew Morton , Vlastimil Babka , Christoph Lameter , Naoya Horiguchi , Jerome Marchand , Yang Shi , Sasha Levin , linux-kernel@vger.kernel.org, linux-mm@kvack.org From: Dave Hansen Message-ID: <56C3599D.3060106@intel.com> Date: Tue, 16 Feb 2016 09:17:17 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.5.1 MIME-Version: 1.0 In-Reply-To: <20160216142657.GA16364@node.shutemov.name> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 02/16/2016 06:26 AM, Kirill A. Shutemov wrote: > On Fri, Feb 12, 2016 at 09:44:41AM -0800, Dave Hansen wrote: >> On 02/11/2016 06:21 AM, Kirill A. Shutemov wrote: >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index ca99c0ecf52e..172f4d8e798d 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -265,6 +265,7 @@ struct fault_env { >>> pmd_t *pmd; >>> pte_t *pte; >>> spinlock_t *ptl; >>> + pgtable_t prealloc_pte; >>> }; >> >> If we're going to do this fault_env thing, we need some heavy-duty >> comments on what the different fields do and what they mean. We don't >> want to get in to a situation where we're doing >> >> void fault_foo(struct fault_env *fe);.. >> >> and then we change the internals of fault_foo() to manipulate a >> different set of fe->* variables, or change assumptions, then have >> callers randomly break. >> >> One _nice_ part of passing all the arguments explicitly is that it makes >> you go visit all the call sites and think about how the conventions change. >> >> It just makes me nervous. >> >> The semantics of having both a ->pte and ->pmd need to be very clearly >> spelled out too, please. > > I've updated this to: > > /* > * Page fault context: passes though page fault handler instead of endless list > * of function arguments. > */ > struct fault_env { > struct vm_area_struct *vma; /* Target VMA */ > unsigned long address; /* Faulting virtual address */ > unsigned int flags; /* FAULT_FLAG_xxx flags */ > pmd_t *pmd; /* Pointer to pmd entry matching > * the 'address' > */ Is this just for huge PMDs, or does it also cover normal PMDs pointing to PTE pages? Is it populated every time we're at or below the PMD during a fault? Is it always valid? > pte_t *pte; /* Pointer to pte entry matching > * the 'address'. NULL if the page > * table hasn't been allocated. > */ What's the relationship between pmd and pte? Can both be set at the same time, etc...? > spinlock_t *ptl; /* Page table lock. > * Protects pte page table if 'pte' > * is not NULL, otherwise pmd. > */ Are there any rules for callers when a callee puts a value in here? > pgtable_t prealloc_pte; /* Pre-allocated pte page table. > * vm_ops->map_pages() calls > * do_set_pte() from atomic context. > * do_fault_around() pre-allocates > * page table to avoid allocation from > * atomic context. > */ > }; Who's responsible for freeing this and when? >>> /* >>> @@ -559,7 +560,8 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) >>> return pte; >>> } >>> >>> -void do_set_pte(struct fault_env *fe, struct page *page); >>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg, >>> + struct page *page); >>> #endif >> >> I think do_set_pte() might be due for a new name if it's going to be >> doing allocations internally. > > Any suggestions? alloc_set_pte() is probably fine. Just make it clear early in some comments that the allocation is conditional. >>> diff --git a/mm/filemap.c b/mm/filemap.c >>> index 28b3875969a8..ba8150d6dc33 100644 >>> --- a/mm/filemap.c >>> +++ b/mm/filemap.c >>> @@ -2146,11 +2146,6 @@ void filemap_map_pages(struct fault_env *fe, >>> start_pgoff) { >>> if (iter.index > end_pgoff) >>> break; >>> - fe->pte += iter.index - last_pgoff; >>> - fe->address += (iter.index - last_pgoff) << PAGE_SHIFT; >>> - last_pgoff = iter.index; >>> - if (!pte_none(*fe->pte)) >>> - goto next; >>> repeat: >>> page = radix_tree_deref_slot(slot); >>> if (unlikely(!page)) >>> @@ -2187,7 +2182,17 @@ repeat: >>> >>> if (file->f_ra.mmap_miss > 0) >>> file->f_ra.mmap_miss--; >>> - do_set_pte(fe, page); >>> + >>> + fe->address += (iter.index - last_pgoff) << PAGE_SHIFT; >>> + if (fe->pte) >>> + fe->pte += iter.index - last_pgoff; >>> + last_pgoff = iter.index; >>> + if (do_set_pte(fe, NULL, page)) { >>> + /* failed to setup page table: giving up */ >>> + if (!fe->pte) >>> + break; >>> + goto unlock; >>> + } >> >> What's the failure here, though? > > At this point in patchset it never fails: allocation failure is not > possible as we pre-allocate page table for faularound. > > Later after do_set_pmd() is introduced, huge page can be mapped here. By > us or under us. > > I'll update comment. So why check the return value of do_set_pte()? Why can it return nonzero? >> This also throws away the spiffy new error code that comes baqck from >> do_set_pte(). Is that OK? > > Yes. We will try harder in do_read_fault() once faultaround code failed to > solve the page fault with all proper locks and error handling. OK, I hope the new comment addresses this. >>> + /* >>> + * Use __pte_alloc instead of pte_alloc_map, because we can't >>> + * run pte_offset_map on the pmd, if an huge pmd could >>> + * materialize from under us from a different thread. >>> + */ >> >> This comment is a little bit funky. Maybe: >> >> "Use __pte_alloc() instead of pte_alloc_map(). We can't run >> pte_offset_map() on pmds where a huge pmd might be created (from a >> different thread)." >> >> Could you also talk a bit about where it _is_ safe to call pte_alloc_map()? > > That comment was just moved from __handle_mm_fault(). > > Would this be okay: > > /* > * Use __pte_alloc() instead of pte_alloc_map(). We can't run > * pte_offset_map() on pmds where a huge pmd might be created (from > * a different thread). > * > * pte_alloc_map() is safe to use under down_write(mmap_sem) or when > * parallel threads are excluded by other means. > */ Yep, that looks good. Just make sure to make it clear that mmap_sem isn't held in *this* context. >>> + if (unlikely(pmd_none(*fe->pmd) && >>> + __pte_alloc(vma->vm_mm, vma, fe->pmd, fe->address))) >>> + return VM_FAULT_OOM; >> >> Should we just move this pmd_none() check in to __pte_alloc()? You do >> this same-style check at least twice. > > We have it there. The check here is speculative to avoid taking ptl. > >>> + /* If an huge pmd materialized from under us just retry later */ >>> + if (unlikely(pmd_trans_huge(*fe->pmd))) >>> + return 0; >> >> Nit: please stop sprinkling unlikely() everywhere. Is there some >> concrete benefit to doing it here? I really doubt the compiler needs >> help putting the code for "return 0" out-of-line. >> >> Why is it important to abort here? Is this a small-page-only path? > > This unlikely() was moved from __handle_mm_fault(). I didn't put much > consideration in it. > >>> +static int pte_alloc_one_map(struct fault_env *fe) >>> +{ >>> + struct vm_area_struct *vma = fe->vma; >>> + >>> + if (!pmd_none(*fe->pmd)) >>> + goto map_pte; >> >> So the calling convention here is...? It looks like this has to be >> called with fe->pmd == pmd_none(). If not, we assume it's pointing to a >> pte page. This can never be called on a huge pmd. Right? > > It's not under ptl, so pmd can be filled under us. There's huge pmd check in > 'map_pte' goto path. > >>> + if (fe->prealloc_pte) { >>> + smp_wmb(); /* See comment in __pte_alloc() */ >> >> Are we trying to make *this* cpu's write visible, or to see the write >> from __pte_alloc()? It seems like we're trying to see the write. Isn't >> smp_rmb() what we want for that? > > See 362a61ad6119. > > I think more logical way would be to put it into do_fault_around(), just after > pte_alloc_one(). > >>> + fe->ptl = pmd_lock(vma->vm_mm, fe->pmd); >>> + if (unlikely(!pmd_none(*fe->pmd))) { >>> + spin_unlock(fe->ptl); >>> + goto map_pte; >>> + } >> >> Should we just make pmd_none() likely()? That seems like it would save >> about 20MB of unlikely()'s in the source. > > Heh. > >>> + atomic_long_inc(&vma->vm_mm->nr_ptes); >>> + pmd_populate(vma->vm_mm, fe->pmd, fe->prealloc_pte); >>> + spin_unlock(fe->ptl); >>> + fe->prealloc_pte = 0; >>> + } else if (unlikely(__pte_alloc(vma->vm_mm, vma, fe->pmd, >>> + fe->address))) { >>> + return VM_FAULT_OOM; >>> + } >>> +map_pte: >>> + if (unlikely(pmd_trans_huge(*fe->pmd))) >>> + return VM_FAULT_NOPAGE; >> >> I think I need a refresher on the locking rules. pte_offset_map*() is >> unsafe to call on a huge pmd. What in this context makes it impossible >> for the pmd to get promoted after the check? > > Do you mean what stops pte page table to collapsed into huge pmd? > The answer is mmap_sem. Collapse code takes the lock on write to be able to > retract page table. > >>> + fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, >>> + &fe->ptl); >>> + return 0; >>> +} >>> + >>> /** >>> * do_set_pte - setup new PTE entry for given page and add reverse page mapping. >>> * >>> * @fe: fault environment >>> + * @memcg: memcg to charge page (only for private mappings) >>> * @page: page to map >>> * >>> - * Caller must hold page table lock relevant for @fe->pte. >> >> That's a bit screwy now because fe->pte might not exist. Right? I > > [ you're commenting deleted line ] > > Right. > >> thought the ptl was derived from the physical address of the pte page. >> How can we have a lock for a physical address that doesn't exist yet? > > If fe->pte is NULL, pte_alloc_one_map() would take care about allocation, map > and lock the page table. > >>> + * Caller must take care of unlocking fe->ptl, if fe->pte is non-NULL on return. >>> * >>> * Target users are page handler itself and implementations of >>> * vm_ops->map_pages. >>> */ >>> -void do_set_pte(struct fault_env *fe, struct page *page) >>> +int do_set_pte(struct fault_env *fe, struct mem_cgroup *memcg, >>> + struct page *page) >>> { >>> struct vm_area_struct *vma = fe->vma; >>> bool write = fe->flags & FAULT_FLAG_WRITE; >>> pte_t entry; >>> >>> + if (!fe->pte) { >>> + int ret = pte_alloc_one_map(fe); >>> + if (ret) >>> + return ret; >>> + } >>> + >>> + if (!pte_none(*fe->pte)) >>> + return VM_FAULT_NOPAGE; >> >> Oh, you've got to add another pte_none() check because you're deferring >> the acquisition of the ptl lock? > > Yes, we need to re-check once ptl is taken. > >>> flush_icache_page(vma, page); >>> entry = mk_pte(page, vma->vm_page_prot); >>> if (write) >>> @@ -2811,6 +2864,8 @@ void do_set_pte(struct fault_env *fe, struct page *page) >>> if (write && !(vma->vm_flags & VM_SHARED)) { >>> inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); >>> page_add_new_anon_rmap(page, vma, fe->address, false); >>> + mem_cgroup_commit_charge(page, memcg, false, false); >>> + lru_cache_add_active_or_unevictable(page, vma); >>> } else { >>> inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); >>> page_add_file_rmap(page); >>> @@ -2819,6 +2874,8 @@ void do_set_pte(struct fault_env *fe, struct page *page) >>> >>> /* no need to invalidate: a not-present page won't be cached */ >>> update_mmu_cache(vma, fe->address, fe->pte); >>> + >>> + return 0; >>> } >>> >>> static unsigned long fault_around_bytes __read_mostly = >>> @@ -2885,19 +2942,17 @@ late_initcall(fault_around_debugfs); >>> * fault_around_pages() value (and therefore to page order). This way it's >>> * easier to guarantee that we don't cross page table boundaries. >>> */ >>> -static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff) >>> +static int do_fault_around(struct fault_env *fe, pgoff_t start_pgoff) >>> { >>> - unsigned long address = fe->address, start_addr, nr_pages, mask; >>> - pte_t *pte = fe->pte; >>> + unsigned long address = fe->address, nr_pages, mask; >>> pgoff_t end_pgoff; >>> - int off; >>> + int off, ret = 0; >>> >>> nr_pages = READ_ONCE(fault_around_bytes) >> PAGE_SHIFT; >>> mask = ~(nr_pages * PAGE_SIZE - 1) & PAGE_MASK; >>> >>> - start_addr = max(fe->address & mask, fe->vma->vm_start); >>> - off = ((fe->address - start_addr) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); >>> - fe->pte -= off; >>> + fe->address = max(address & mask, fe->vma->vm_start); >>> + off = ((address - fe->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); >>> start_pgoff -= off; >> >> Considering what's in this patch already, I think I'd leave the trivial >> local variable replacement for another patch. > > fe->address is not a local variable: it get passed into map_pages. > >>> /* >>> @@ -2905,30 +2960,33 @@ static void do_fault_around(struct fault_env *fe, pgoff_t start_pgoff) >>> * or fault_around_pages() from start_pgoff, depending what is nearest. >>> */ >>> end_pgoff = start_pgoff - >>> - ((start_addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) + >>> + ((fe->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) + >>> PTRS_PER_PTE - 1; >>> end_pgoff = min3(end_pgoff, vma_pages(fe->vma) + fe->vma->vm_pgoff - 1, >>> start_pgoff + nr_pages - 1); >>> >>> - /* Check if it makes any sense to call ->map_pages */ >>> - fe->address = start_addr; >>> - while (!pte_none(*fe->pte)) { >>> - if (++start_pgoff > end_pgoff) >>> - goto out; >>> - fe->address += PAGE_SIZE; >>> - if (fe->address >= fe->vma->vm_end) >>> - goto out; >>> - fe->pte++; >>> + if (pmd_none(*fe->pmd)) >>> + fe->prealloc_pte = pte_alloc_one(fe->vma->vm_mm, fe->address); >>> + fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff); >>> + if (fe->prealloc_pte) { >>> + pte_free(fe->vma->vm_mm, fe->prealloc_pte); >>> + fe->prealloc_pte = 0; >>> } >>> + if (!fe->pte) >>> + goto out; >> >> What does !fe->pte *mean* here? No pte page? No pte present? Huge pte >> present? > > Huge pmd is mapped. > > Comment added. > >>> - fe->vma->vm_ops->map_pages(fe, start_pgoff, end_pgoff); >>> + /* check if the page fault is solved */ >>> + fe->pte -= (fe->address >> PAGE_SHIFT) - (address >> PAGE_SHIFT); >>> + if (!pte_none(*fe->pte)) >>> + ret = VM_FAULT_NOPAGE; >>> + pte_unmap_unlock(fe->pte, fe->ptl); >>> out: >>> - /* restore fault_env */ >>> - fe->pte = pte; >>> fe->address = address; >>> + fe->pte = NULL; >>> + return ret; >>> } >>> >>> -static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte) >>> +static int do_read_fault(struct fault_env *fe, pgoff_t pgoff) >>> { >>> struct vm_area_struct *vma = fe->vma; >>> struct page *fault_page; >>> @@ -2940,33 +2998,25 @@ static int do_read_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte) >>> * something). >>> */ >>> if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) { >>> - fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, >>> - &fe->ptl); >>> - do_fault_around(fe, pgoff); >>> - if (!pte_same(*fe->pte, orig_pte)) >>> - goto unlock_out; >>> - pte_unmap_unlock(fe->pte, fe->ptl); >>> + ret = do_fault_around(fe, pgoff); >>> + if (ret) >>> + return ret; >>> } >>> >>> ret = __do_fault(fe, pgoff, NULL, &fault_page); >>> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) >>> return ret; >>> >>> - fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, &fe->ptl); >>> - if (unlikely(!pte_same(*fe->pte, orig_pte))) { >>> + ret |= do_set_pte(fe, NULL, fault_page); >>> + if (fe->pte) >>> pte_unmap_unlock(fe->pte, fe->ptl); >>> - unlock_page(fault_page); >>> - page_cache_release(fault_page); >>> - return ret; >>> - } >>> - do_set_pte(fe, fault_page); >>> unlock_page(fault_page); >>> -unlock_out: >>> - pte_unmap_unlock(fe->pte, fe->ptl); >>> + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) >>> + page_cache_release(fault_page); >>> return ret; >>> } >>> >>> -static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte) >>> +static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff) >>> { >>> struct vm_area_struct *vma = fe->vma; >>> struct page *fault_page, *new_page; >>> @@ -2994,26 +3044,9 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte) >>> copy_user_highpage(new_page, fault_page, fe->address, vma); >>> __SetPageUptodate(new_page); >>> >>> - fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, >>> - &fe->ptl); >>> - if (unlikely(!pte_same(*fe->pte, orig_pte))) { >>> + ret |= do_set_pte(fe, memcg, new_page); >>> + if (fe->pte) >>> pte_unmap_unlock(fe->pte, fe->ptl); >>> - if (fault_page) { >>> - unlock_page(fault_page); >>> - page_cache_release(fault_page); >>> - } else { >>> - /* >>> - * The fault handler has no page to lock, so it holds >>> - * i_mmap_lock for read to protect against truncate. >>> - */ >>> - i_mmap_unlock_read(vma->vm_file->f_mapping); >>> - } >>> - goto uncharge_out; >>> - } >>> - do_set_pte(fe, new_page); >>> - mem_cgroup_commit_charge(new_page, memcg, false, false); >>> - lru_cache_add_active_or_unevictable(new_page, vma); >>> - pte_unmap_unlock(fe->pte, fe->ptl); >>> if (fault_page) { >>> unlock_page(fault_page); >>> page_cache_release(fault_page); >>> @@ -3024,6 +3057,8 @@ static int do_cow_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte) >>> */ >>> i_mmap_unlock_read(vma->vm_file->f_mapping); >>> } >>> + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) >>> + goto uncharge_out; >>> return ret; >>> uncharge_out: >>> mem_cgroup_cancel_charge(new_page, memcg, false); >>> @@ -3031,7 +3066,7 @@ uncharge_out: >>> return ret; >>> } >>> >>> -static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte) >>> +static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff) >>> { >>> struct vm_area_struct *vma = fe->vma; >>> struct page *fault_page; >>> @@ -3057,16 +3092,15 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte) >>> } >>> } >>> >>> - fe->pte = pte_offset_map_lock(vma->vm_mm, fe->pmd, fe->address, >>> - &fe->ptl); >>> - if (unlikely(!pte_same(*fe->pte, orig_pte))) { >>> + ret |= do_set_pte(fe, NULL, fault_page); >>> + if (fe->pte) >>> pte_unmap_unlock(fe->pte, fe->ptl); >>> + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | >>> + VM_FAULT_RETRY))) { >>> unlock_page(fault_page); >>> page_cache_release(fault_page); >>> return ret; >>> } >>> - do_set_pte(fe, fault_page); >>> - pte_unmap_unlock(fe->pte, fe->ptl); >>> >>> if (set_page_dirty(fault_page)) >>> dirtied = 1; >>> @@ -3098,21 +3132,19 @@ static int do_shared_fault(struct fault_env *fe, pgoff_t pgoff, pte_t orig_pte) >>> * The mmap_sem may have been released depending on flags and our >>> * return value. See filemap_fault() and __lock_page_or_retry(). >>> */ >>> -static int do_fault(struct fault_env *fe, pte_t orig_pte) >>> +static int do_fault(struct fault_env *fe) >>> { >>> struct vm_area_struct *vma = fe->vma; >>> - pgoff_t pgoff = (((fe->address & PAGE_MASK) >>> - - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; >>> + pgoff_t pgoff = linear_page_index(vma, fe->address); >> >> Looks like another trivial cleanup. > > Okay, I'll move it into separate patch. > >>> - pte_unmap(fe->pte); >>> /* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */ >>> if (!vma->vm_ops->fault) >>> return VM_FAULT_SIGBUS; >>> if (!(fe->flags & FAULT_FLAG_WRITE)) >>> - return do_read_fault(fe, pgoff, orig_pte); >>> + return do_read_fault(fe, pgoff); >>> if (!(vma->vm_flags & VM_SHARED)) >>> - return do_cow_fault(fe, pgoff, orig_pte); >>> - return do_shared_fault(fe, pgoff, orig_pte); >>> + return do_cow_fault(fe, pgoff); >>> + return do_shared_fault(fe, pgoff); >>> } >>> >>> static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma, >>> @@ -3252,37 +3284,62 @@ static int wp_huge_pmd(struct fault_env *fe, pmd_t orig_pmd) >>> * with external mmu caches can use to update those (ie the Sparc or >>> * PowerPC hashed page tables that act as extended TLBs). >>> * >>> - * We enter with non-exclusive mmap_sem (to exclude vma changes, >>> - * but allow concurrent faults), and pte mapped but not yet locked. >>> - * We return with pte unmapped and unlocked. >>> + * We enter with non-exclusive mmap_sem (to exclude vma changes, but allow >>> + * concurrent faults). >>> * >>> - * The mmap_sem may have been released depending on flags and our >>> - * return value. See filemap_fault() and __lock_page_or_retry(). >>> + * The mmap_sem may have been released depending on flags and our return value. >>> + * See filemap_fault() and __lock_page_or_retry(). >>> */ >>> static int handle_pte_fault(struct fault_env *fe) >>> { >>> pte_t entry; >>> >>> + /* If an huge pmd materialized from under us just retry later */ >>> + if (unlikely(pmd_trans_huge(*fe->pmd))) >>> + return 0; >>> + >>> + if (unlikely(pmd_none(*fe->pmd))) { >>> + /* >>> + * Leave __pte_alloc() until later: because vm_ops->fault may >>> + * want to allocate huge page, and if we expose page table >>> + * for an instant, it will be difficult to retract from >>> + * concurrent faults and from rmap lookups. >>> + */ >>> + } else { >>> + /* >>> + * A regular pmd is established and it can't morph into a huge >>> + * pmd from under us anymore at this point because we hold the >>> + * mmap_sem read mode and khugepaged takes it in write mode. >>> + * So now it's safe to run pte_offset_map(). >>> + */ >>> + fe->pte = pte_offset_map(fe->pmd, fe->address); >>> + >>> + entry = *fe->pte; >>> + barrier(); >> >> Barrier because....? >> >>> + if (pte_none(entry)) { >>> + pte_unmap(fe->pte); >>> + fe->pte = NULL; >>> + } >>> + } >>> + >>> /* >>> * some architectures can have larger ptes than wordsize, >>> * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y, >>> * so READ_ONCE or ACCESS_ONCE cannot guarantee atomic accesses. >>> - * The code below just needs a consistent view for the ifs and >>> + * The code above just needs a consistent view for the ifs and >>> * we later double check anyway with the ptl lock held. So here >>> * a barrier will do. >>> */ >> >> Looks like the barrier got moved, but not the comment. > > Moved. > >> Man, that's a lot of code. > > Yeah. I don't see a sensible way to split it. :-/ >