linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: John Hubbard <jhubbard@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Linux-MM <linux-mm@kvack.org>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Yu Zhao <yuzhao@google.com>,  Andy Lutomirski <luto@kernel.org>,
	Peter Xu <peterx@redhat.com>, Pavel Emelyanov <xemul@openvz.org>,
	 Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	 Minchan Kim <minchan@kernel.org>, Will Deacon <will@kernel.org>,
	 Peter Zijlstra <peterz@infradead.org>,
	Hugh Dickins <hughd@google.com>,
	 "Kirill A. Shutemov" <kirill@shutemov.name>,
	Matthew Wilcox <willy@infradead.org>,
	 Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	Kees Cook <keescook@chromium.org>,
	 Leon Romanovsky <leonro@nvidia.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, Jan Kara <jack@suse.cz>,
	 Kirill Tkhai <ktkhai@virtuozzo.com>,
	Nadav Amit <nadav.amit@gmail.com>,  Jens Axboe <axboe@kernel.dk>
Subject: Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse
Date: Mon, 11 Jan 2021 14:18:13 -0800	[thread overview]
Message-ID: <CAHk-=wje9r3fREBdZcOu=NihGczBtkqkhXRPDhY-ZkNVv=thiQ@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wipa-9wEuWHBjourmXAVHdeqDa59UxW6ZJ_Oqg6-Dwvdw@mail.gmail.com>

On Mon, Jan 11, 2021 at 11:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Sun, Jan 10, 2021 at 11:27 PM John Hubbard <jhubbard@nvidia.com> wrote:
> > IMHO, a lot of the bits in page _refcount are still being wasted (even
> > after GUP_PIN_COUNTING_BIAS overloading), because it's unlikely that
> > there are many callers of gup/pup per page.
>
> It may be unlikely under real loads.
>
> But we've actually had overflow issues on this because rather than
> real loads you can do attack loads (ie "lots of processes, lots of
> pipe file descriptors, lots of vmsplice() operations on the same
> page".
>
> We had to literally add that conditional "try_get_page()" that
> protects against overflow..

Actually, what I think might be a better model is to actually
strengthen the rules even more, and get rid of GUP_PIN_COUNTING_BIAS
entirely.

What we could do is just make a few clear rules explicit (most of
which we already basically hold to). Starting from that basic

 (a) Anonymous pages are made writable (ie COW) only when they have a
page_count() of 1

That very simple rule then automatically results in the corollary

 (b) a writable page in a COW mapping always starts out reachable
_only_ from the page tables

and now we could have a couple of really simple new rules:

 (c) we never ever make a writable page in a COW mapping read-only
_unless_ it has a page_count() of 1

 (d) we never create a swap cache page out of a writable COW mapping page

Now, if you combine these rules, the whole need for the
GUP_PIN_COUNTING_BIAS basically goes away.

Why? Because we know that the _only_ thing that can elevate the
refcount of a writable COW page is GUP - we'll just make sure nothing
else touches it.

The whole "optimistic page references throigh page cache" etc are
complete non-issues, because the whole point is that we already know
it's not a page cache page. There is simply no other way to reach that
page than through GUP.

Ergo: any writable pte in a COW mapping that has a page with a
page_count() > 1 is pinned by definition, and thus our

   page_maybe_dma_pinned(page)

could remove that "maybe" part, and simply check for

    page_count(page) > 1

(although the rule would be that this is only valid for a cow_mapping
pte, and only while holding the page table lock! So maybe it would be
good to pass in the vma and have an assert for that lock too).

And the thing is, none of the above rules are complicated.

The only new one would be the requirement that you cannot add a page
to the swap cache unless it is read-only in the page tables. That may
be the biggest hurdle here.

The way we handle swap cache is that we *first* add it to the swap
cache, and then we do a "try_to_unmap()" on it. So we currently don't
actually try to walk the page tables until we have already done that
swap cache thing.

But I do think that the only major problem spot is that
shrink_page_list() -> add_to_swap() case, and I think we could make
add_to_swap() just do the rmap walk and turn it read-only first.

(And it's worth pointing out that I'm only talking about regular
non-huge pages above, the rules for splitting hugepages may impact
those cases differently, I didn't really even try to think about those
cases).

But thatadd_to_swap() case might make it too painful. It _would_
simplify our rules wrt anonymous mappings enormously, though.

               Linus


  reply	other threads:[~2021-01-11 22:25 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-10  0:44 [PATCH 0/1] mm: restore full accuracy in COW page reuse Andrea Arcangeli
2021-01-10  0:44 ` [PATCH 1/1] " Andrea Arcangeli
2021-01-10  2:54   ` Andrea Arcangeli
2021-01-11 14:11     ` Kirill A. Shutemov
2021-01-10  0:55 ` [PATCH 0/1] " Linus Torvalds
2021-01-10  1:19   ` Linus Torvalds
2021-01-10  1:37     ` Linus Torvalds
2021-01-10  3:24       ` Andrea Arcangeli
2021-01-10  2:51     ` Andrea Arcangeli
2021-01-10  3:51       ` Linus Torvalds
2021-01-10 19:30         ` Linus Torvalds
2021-01-11  1:18           ` Jason Gunthorpe
2021-01-11  7:26           ` John Hubbard
2021-01-11 12:42             ` Matthew Wilcox
2021-01-11 16:05             ` Jason Gunthorpe
2021-01-11 16:15               ` Michal Hocko
2021-01-11 19:19             ` Linus Torvalds
2021-01-11 22:18               ` Linus Torvalds [this message]
2021-01-12 17:07                 ` Andy Lutomirski
2021-01-12 23:51                 ` Jerome Glisse
2021-01-13  2:16                 ` Matthew Wilcox
2021-01-13  2:43                   ` Linus Torvalds
2021-01-13  3:31                   ` Linus Torvalds
2021-01-13  8:52                     ` David Hildenbrand
2021-01-13  8:57                       ` David Hildenbrand
2021-01-13 12:32                     ` Kirill A. Shutemov
2021-01-13 12:55                       ` Matthew Wilcox
2021-01-13 19:54                         ` Linus Torvalds
2021-01-13 23:54           ` Peter Xu
2021-01-11 15:52       ` Jason Gunthorpe
2021-01-15  8:59 ` David Hildenbrand
2021-01-15 18:37   ` Jason Gunthorpe
2021-01-15 19:46     ` David Hildenbrand
2021-01-15 19:53       ` Jason Gunthorpe
2021-01-16  3:40       ` John Hubbard
2021-01-16 11:42         ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHk-=wje9r3fREBdZcOu=NihGczBtkqkhXRPDhY-ZkNVv=thiQ@mail.gmail.com' \
    --to=torvalds@linux-foundation.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keescook@chromium.org \
    --cc=kirill@shutemov.name \
    --cc=ktkhai@virtuozzo.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=minchan@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xemul@openvz.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).