linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Ning Qu <quning@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH 16/24] huge tmpfs: fix problems from premature exposure of pagetable
Date: Wed, 1 Jul 2015 13:53:34 +0300	[thread overview]
Message-ID: <20150701105334.GA18721@node.dhcp.inet.fi> (raw)
In-Reply-To: <alpine.LSU.2.11.1502202015090.14414@eggly.anvils>

On Fri, Feb 20, 2015 at 08:16:32PM -0800, Hugh Dickins wrote:
> Andrea wrote a very interesting comment on THP in mm/memory.c,
> just before the end of __handle_mm_fault():
> 
>  * A regular pmd is established and it can't morph into a huge pmd
>  * from under us anymore at this point because we hold the mmap_sem
>  * read mode and khugepaged takes it in write mode. So now it's
>  * safe to run pte_offset_map().
> 
> This comment hints at several difficulties, which anon THP solved
> for itself with mmap_sem and anon_vma lock, but which huge tmpfs
> may need to solve differently.
> 
> The reference to pte_offset_map() above: I believe that's a hint
> that on a 32-bit machine, the pagetables might need to come from
> kernel-mapped memory, but a huge pmd pointing to user memory beyond
> that limit could be racily substituted, causing undefined behavior
> in the architecture-dependent pte_offset_map().
> 
> That itself is not a problem on x86_64, but there's plenty more:
> how about those places which use pte_offset_map_lock() - if that
> spinlock is in the struct page of a pagetable, which has been
> deposited and might be withdrawn and freed at any moment (being
> on a list unattached to the allocating pmd in the case of x86),
> taking the spinlock might corrupt someone else's struct page.
> 
> Because THP has departed from the earlier rules (when pagetable
> was only freed under exclusive mmap_sem, or at exit_mmap, after
> removing all affected vmas from the rmap list): zap_huge_pmd()
> does pte_free() even when serving MADV_DONTNEED under down_read
> of mmap_sem.
> 
> And what of the "entry = *pte" at the start of handle_pte_fault(),
> getting the entry used in pte_same(,orig_pte) tests to validate all
> fault handling?  If that entry can itself be junk picked out of some
> freed and reused pagetable, it's hard to estimate the consequences.
> 
> We need to consider the safety of concurrent faults, and the
> safety of rmap lookups, and the safety of miscellaneous operations
> such as smaps_pte_range() for reading /proc/<pid>/smaps.
> 
> I set out to make safe the places which descend pgd,pud,pmd,pte,
> using more careful access techniques like mm_find_pmd(); but with
> pte_offset_map() being architecture-defined, it's too big a job to
> tighten it up all over.
> 
> Instead, approach from the opposite direction: just do not expose
> a pagetable in an empty *pmd, until vm_ops->fault has had a chance
> to ask for a huge pmd there.  This is a much easier change to make,
> and we are lucky that all the driver faults appear to be using
> interfaces (like vm_insert_page() and remap_pfn_range()) which
> automatically do the pte_alloc() if it was not already done.
> 
> But we must not get stuck refaulting: need FAULT_FLAG_MAY_HUGE for
> __do_fault() to tell shmem_fault() to try for huge only when *pmd is
> empty (could instead add pmd to vmf and let shmem work that out for
> itself, but probably better to hide pmd from vm_ops->faults).
> 
> Without a pagetable to hold the pte_none() entry found in a newly
> allocated pagetable, handle_pte_fault() would like to provide a static
> none entry for later orig_pte checks.  But architectures have never had
> to provide that definition before; and although almost all use zeroes
> for an empty pagetable, a few do not - nios2, s390, um, xtensa.
> 
> Never mind, forget about pte_same(,orig_pte), the three __do_fault()
> callers can follow do_anonymous_page()'s example, and just use a
> pte_none() check instead - supplemented by a pte_file pte_to_pgoff
> check until the day VM_NONLINEAR is removed.
> 
> do_fault_around() presents one last problem: it wants pagetable to
> have been allocated, but was being called by do_read_fault() before
> __do_fault().  But I see no disadvantage to moving it after,
> allowing huge pmd to be chosent first.

One disadvantage is addtional radix-tree lookup for page cache hot case.
IIRC, the difference was small, but measurable back when I implemented
faultaround.

Have you considered pushing page table allocation even futher -- into
do_set_pte()?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2015-07-01 10:53 UTC|newest]

Thread overview: 38+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-21  3:49 [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache Hugh Dickins
2015-02-21  3:51 ` [PATCH 01/24] mm: update_lru_size warn and reset bad lru_size Hugh Dickins
2015-02-23  9:30   ` Kirill A. Shutemov
2015-03-23  2:44     ` Hugh Dickins
2015-02-21  3:54 ` [PATCH 02/24] mm: update_lru_size do the __mod_zone_page_state Hugh Dickins
2015-02-21  3:56 ` [PATCH 03/24] mm: use __SetPageSwapBacked and don't ClearPageSwapBacked Hugh Dickins
2015-02-25 10:53   ` Mel Gorman
2015-03-23  3:01     ` Hugh Dickins
2015-02-21  3:58 ` [PATCH 04/24] mm: make page migration's newpage handling more robust Hugh Dickins
2015-02-21  4:00 ` [PATCH 05/24] tmpfs: preliminary minor tidyups Hugh Dickins
2015-02-21  4:01 ` [PATCH 06/24] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m Hugh Dickins
2015-02-21  4:03 ` [PATCH 07/24] huge tmpfs: include shmem freeholes in available memory counts Hugh Dickins
2015-02-21  4:05 ` [PATCH 08/24] huge tmpfs: prepare huge=N mount option and /proc/sys/vm/shmem_huge Hugh Dickins
2015-02-21  4:06 ` [PATCH 09/24] huge tmpfs: try to allocate huge pages, split into a team Hugh Dickins
2015-02-21  4:07 ` [PATCH 10/24] huge tmpfs: avoid team pages in a few places Hugh Dickins
2015-02-21  4:09 ` [PATCH 11/24] huge tmpfs: shrinker to migrate and free underused holes Hugh Dickins
2015-03-19 16:56   ` Konstantin Khlebnikov
2015-03-23  4:40     ` Hugh Dickins
2015-03-23 12:50       ` Kirill A. Shutemov
2015-03-23 13:50         ` Kirill A. Shutemov
2015-03-24 12:57       ` Kirill A. Shutemov
2015-03-25  0:41         ` Hugh Dickins
2015-02-21  4:11 ` [PATCH 12/24] huge tmpfs: get_unmapped_area align and fault supply huge page Hugh Dickins
2015-02-21  4:12 ` [PATCH 13/24] huge tmpfs: extend get_user_pages_fast to shmem pmd Hugh Dickins
2015-02-21  4:13 ` [PATCH 14/24] huge tmpfs: extend vma_adjust_trans_huge " Hugh Dickins
2015-02-21  4:15 ` [PATCH 15/24] huge tmpfs: rework page_referenced_one and try_to_unmap_one Hugh Dickins
2015-02-21  4:16 ` [PATCH 16/24] huge tmpfs: fix problems from premature exposure of pagetable Hugh Dickins
2015-07-01 10:53   ` Kirill A. Shutemov [this message]
2015-02-21  4:18 ` [PATCH 17/24] huge tmpfs: map shmem by huge page pmd or by page team ptes Hugh Dickins
2015-02-21  4:20 ` [PATCH 18/24] huge tmpfs: mmap_sem is unlocked when truncation splits huge pmd Hugh Dickins
2015-02-21  4:22 ` [PATCH 19/24] huge tmpfs: disband split huge pmds on race or memory failure Hugh Dickins
2015-02-21  4:23 ` [PATCH 20/24] huge tmpfs: use Unevictable lru with variable hpage_nr_pages() Hugh Dickins
2015-02-21  4:25 ` [PATCH 21/24] huge tmpfs: fix Mlocked meminfo, tracking huge and unhuge mlocks Hugh Dickins
2015-02-21  4:27 ` [PATCH 22/24] huge tmpfs: fix Mapped meminfo, tracking huge and unhuge mappings Hugh Dickins
2015-02-21  4:29 ` [PATCH 23/24] kvm: plumb return of hva when resolving page fault Hugh Dickins
2015-02-21  4:31 ` [PATCH 24/24] kvm: teach kvm to map page teams as huge pages Hugh Dickins
2015-02-23 13:48 ` [PATCH 00/24] huge tmpfs: an alternative approach to THPageCache Kirill A. Shutemov
2015-03-23  2:25   ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150701105334.GA18721@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=hughd@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=quning@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).