All of lore.kernel.org
 help / color / mirror / Atom feed
From: John Hubbard <jhubbard@nvidia.com>
To: Nadav Amit <namit@vmware.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	David Rientjes <rientjes@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Mike Rapoport <rppt@linux.ibm.com>,
	Yang Shi <shy828301@gmail.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Matthew Wilcox <willy@infradead.org>,
	Vlastimil Babka <vbabka@suse.cz>, Jann Horn <jannh@google.com>,
	Michal Hocko <mhocko@kernel.org>, Rik van Riel <riel@surriel.com>,
	Roman Gushchin <guro@fb.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Peter Xu <peterx@redhat.com>, Donald Dutile <ddutile@redhat.com>,
	Christoph Hellwig <hch@lst.de>, Oleg Nesterov <oleg@redhat.com>,
	Jan Kara <jack@suse.cz>, Linux-MM <linux-mm@kvack.org>,
	"open list:KERNEL SELFTEST FRAMEWORK" 
	<linux-kselftest@vger.kernel.org>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)
Date: Sun, 19 Dec 2021 00:01:59 -0800	[thread overview]
Message-ID: <341cf567-468e-b5cf-6813-b26c49e851b5@nvidia.com> (raw)
In-Reply-To: <5A7D771C-FF95-465E-95F6-CD249FE28381@vmware.com>

On 12/18/21 22:02, Nadav Amit wrote:
> 
>> On Dec 18, 2021, at 4:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> (I have only ever seen the kernel side of uffd, not the actual user
>> side, so I'm not sure about the use patterns).
> 
> I use it in a very fine granularity, and I suspect QEMU and CRIU do so
> too.
> 
>>
>> That said, your suggestion of a shadow sw page table bit thing would
>> also work. And it would solve some problems we have in core areas
>> (notably "page_special()" which right now has that
>> ARCH_HAS_PTE_SPECIAL thing).
>>
>> It would make it really easy to have that "this page table entry is
>> pinned" flag too.
> 
> I found my old messy code for the software-PTE thing.
> 
> I see that eventually I decided to hold a pointer to the “extra PTEs”
> of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
> pages approach but this code is long gone. ]
> 
> My rationale was that:
> 
> 1. It does not bound you to have the same size for PTE and “extra-PTE”
> 2. The PMD-page struct is anyhow hot (since you acquired the PTL)
> 3. Allocating “extra-PTE” dynamically does not require to rewire the
>     page-tables, which requires a TLB flush.
> 
> I think there is a place to hold a pointer in the PMD-page-struct
> (_pt_pad_1, we just need to keep the lowest bit clear so the kernel
> won’t mistaken it to be a compound page).
> 
> I still don’t know what exactly you have in mind for making use
> out of it for the COW issue. Keeping a pin-count (which requires
> internal API changes for unpin_user_page() and friends?) or having
> “was ever pinned” sticky bit? And then changing
> page_needs_cow_for_dma() to look at the PTE so copy_present_pte()
> would break the COW eagerly?
> 
> Anyhow, I can clean it up and send (although it is rather simple
> and I ignored many thing, such as THP, remap, etc), but I am not
> sure I have the time now to fully address the COW problem. I will
> wait for Monday for David’s response.
> 

Hi Nadav,

A couple of thoughts about this part of the design:

a) The PMD-page-struct approach won't help as much, because (assuming
that we're using it in an attempt to get a true, perfect pin count), you
are combining the pin counts of a PMD's worth of pages. OTOH...maybe
that actually *is* OK, assuming you don't overflow--except that you can
only answer the "is it dma-pinned?" question at a PMD level. That's a
contradiction of your stated desire above to have very granular control.

Also, because of not having bit 0 available in page._pt_pad_1, I think
the count would have to be implemented as adding and subtracting 2,
instead of 1 (in order to keep the value even), further reducing the
counter range.

b) If, instead, you used your older 2-adjacent pages approach, then
Linus' comment makes more sense here: we could use the additional struct
page to hold an exact pin count, per page. That way, you can actually
build a wrapper function such as:

     page_really_is_dma_pinned()

...and/or simply get a stronger "maybe" for page_maybe_dma_pinned().

Furthermore, this direction is extensible and supports solving other "I
am out of space in struct page" problems, at the cost of more pages, of
course.

As an aside, I find it instructive that we're talking about this
approach, instead of extending struct page. The lesson I'm taking away
is: allocating more space for some cases (2x pages) is better than
having *all* struct pages be larger than they are now.

Anyway, the pin count implementation would look somewhat like the
existing hpage_pincount, which similarly has ample space for a separate,
exact pin count. In other words, this sort of thing (mostly-pseudo
code):


diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..646761388025 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -938,6 +938,16 @@ static inline bool hpage_pincount_available(struct page *page)
  	return PageCompound(page) && compound_order(page) > 1;
  }

+static inline bool shadow_page_pincount_available(struct page *page)
+{
+	/*
+	 * TODO: Nadav: connect this up with the shadow page table
+	 * implementation, and return an appropriate answer.
+	 */
+
+	return false; // hardcoded for now, for compile testing
+}
+
  static inline int head_compound_pincount(struct page *head)
  {
  	return atomic_read(compound_pincount_ptr(head));
@@ -950,6 +960,13 @@ static inline int compound_pincount(struct page *page)
  	return head_compound_pincount(page);
  }

+static inline int shadow_page_pincount(struct page *page)
+{
+	VM_BUG_ON_PAGE(!shadow_page_pincount_available(page), page);
+
+	return atomic_read(shadow_page_pincount_ptr(page));
+}
+
  static inline void set_compound_order(struct page *page, unsigned int order)
  {
  	page[1].compound_order = order;
@@ -1326,6 +1343,9 @@ static inline bool page_maybe_dma_pinned(struct page *page)
  	if (hpage_pincount_available(page))
  		return compound_pincount(page) > 0;

+	if (shadow_page_pincount_available(page))
+		return shadow_page_pincount(page) > 0;
+
  	/*
  	 * page_ref_count() is signed. If that refcount overflows, then
  	 * page_ref_count() returns a negative value, and callers will avoid

c) The "was it ever pinned" sticky bit is not a workable concept, at the
struct page level. A counter is required, in order to allow pages to
live out their normal lives to their fullest potential. The only time we
even temporarily got away with this kind of stickiness was at a higher
level, and only per-process, not per-physical-page. Processes come and
go, but the struct pages are more or less forever, so once you mark one
sticky like this, it's out of play.

thanks,
-- 
John Hubbard
NVIDIA

  reply	other threads:[~2021-12-19  8:02 UTC|newest]

Thread overview: 137+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-17 11:30 [PATCH v1 00/11] mm: COW fixes part 1: fix the COW security issue for THP and hugetlb David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 01/11] seqlock: provide lockdep-free raw_seqcount_t variant David Hildenbrand
2021-12-17 17:02   ` Nadav Amit
2021-12-17 17:29     ` David Hildenbrand
2021-12-17 17:49       ` David Hildenbrand
2021-12-17 18:01         ` Nadav Amit
2021-12-17 21:28   ` Thomas Gleixner
2021-12-17 22:02     ` David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 02/11] mm: thp: consolidate mapcount logic on THP split David Hildenbrand
2021-12-17 19:06   ` Yang Shi
2021-12-18 14:24   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 03/11] mm: simplify hugetlb and file-THP handling in __page_mapcount() David Hildenbrand
2021-12-17 17:16   ` Nadav Amit
2021-12-17 17:30     ` David Hildenbrand
2021-12-17 18:06   ` Mike Kravetz
2021-12-17 18:11     ` David Hildenbrand
2021-12-17 19:07   ` Yang Shi
2021-12-18 14:31   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 04/11] mm: thp: simlify total_mapcount() David Hildenbrand
2021-12-17 19:12   ` Yang Shi
2021-12-18 14:35   ` Kirill A. Shutemov
2021-12-17 11:30 ` [PATCH v1 05/11] mm: thp: allow for reading the THP mapcount atomically via a raw_seqlock_t David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) David Hildenbrand
2021-12-17 19:04   ` Linus Torvalds
2021-12-17 19:22     ` Linus Torvalds
2021-12-17 20:17       ` David Hildenbrand
2021-12-17 20:36         ` Linus Torvalds
2021-12-17 20:39           ` Linus Torvalds
2021-12-17 20:43             ` Linus Torvalds
2021-12-17 20:42           ` David Hildenbrand
2021-12-17 20:45             ` Linus Torvalds
2021-12-18 22:52               ` Kirill A. Shutemov
2021-12-18 23:05                 ` Linus Torvalds
2021-12-17 20:47           ` Jason Gunthorpe
2021-12-17 20:56             ` Linus Torvalds
2021-12-17 21:17               ` David Hildenbrand
2021-12-17 21:04             ` David Hildenbrand
2021-12-18  0:50               ` Jason Gunthorpe
2021-12-17 21:15             ` Nadav Amit
2021-12-17 21:20               ` David Hildenbrand
2021-12-18  0:50               ` Jason Gunthorpe
2021-12-18  1:53               ` Linus Torvalds
2021-12-18  2:17                 ` Linus Torvalds
2021-12-18  2:42                   ` Linus Torvalds
2021-12-18  3:36                     ` Linus Torvalds
2021-12-18 10:06                     ` David Hildenbrand
2021-12-18  3:05                 ` Jason Gunthorpe
2021-12-18  3:30                   ` Nadav Amit
2021-12-18  3:38                     ` Linus Torvalds
2021-12-18 18:42                       ` Jason Gunthorpe
2021-12-18 18:49                         ` David Hildenbrand
2021-12-18 21:48                         ` Nadav Amit
2021-12-18 22:53                           ` Linus Torvalds
2021-12-19  0:19                             ` Nadav Amit
2021-12-19  0:35                               ` Linus Torvalds
2021-12-19  6:02                                 ` Nadav Amit
2021-12-19  8:01                                   ` John Hubbard [this message]
2021-12-19 11:30                                     ` Matthew Wilcox
2021-12-19 17:27                                   ` Linus Torvalds
2021-12-19 17:44                                     ` David Hildenbrand
2021-12-19 17:44                                     ` Linus Torvalds
2021-12-19 17:59                                       ` David Hildenbrand
2021-12-19 21:12                                         ` Matthew Wilcox
2021-12-19 21:27                                           ` Linus Torvalds
2021-12-19 21:47                                             ` Matthew Wilcox
2021-12-19 21:53                                               ` Linus Torvalds
2021-12-19 22:02                                                 ` Matthew Wilcox
2021-12-19 22:12                                                   ` Linus Torvalds
2021-12-19 22:26                                                     ` Matthew Wilcox
2021-12-20 18:37                                           ` Matthew Wilcox
2021-12-20 18:52                                             ` Matthew Wilcox
2021-12-20 19:38                                               ` Linus Torvalds
2021-12-20 19:15                                             ` Linus Torvalds
2021-12-20 21:02                                               ` Matthew Wilcox
2021-12-20 21:27                                                 ` Linus Torvalds
2021-12-21  1:03                                         ` Jason Gunthorpe
2021-12-21  3:29                                           ` Matthew Wilcox
2021-12-21  8:58                                           ` David Hildenbrand
2021-12-21 14:28                                             ` Jason Gunthorpe
2021-12-21 15:19                                               ` David Hildenbrand
2021-12-21 23:54                                                 ` Jason Gunthorpe
2021-12-21 17:05                                             ` Linus Torvalds
2021-12-21 17:40                                               ` David Hildenbrand
2021-12-21 18:00                                                 ` Linus Torvalds
2021-12-21 18:28                                                   ` David Hildenbrand
2021-12-21 21:11                                                     ` John Hubbard
2021-12-21 18:07                                                 ` Jan Kara
2021-12-21 18:30                                                   ` Linus Torvalds
2021-12-21 18:51                                                     ` David Hildenbrand
2021-12-21 18:58                                                       ` Linus Torvalds
2021-12-21 21:16                                                     ` John Hubbard
2021-12-21 19:07                                                 ` Jason Gunthorpe
2021-12-22  8:51                                                   ` David Hildenbrand
2021-12-22  9:58                                                     ` David Hildenbrand
2021-12-22 12:41                                                       ` Jan Kara
2021-12-22 13:09                                                         ` David Hildenbrand
2021-12-22 14:42                                                           ` Jan Kara
2021-12-22 14:48                                                             ` David Hildenbrand
2021-12-22 16:08                                                               ` Jan Kara
2021-12-22 16:44                                                                 ` Matthew Wilcox
2021-12-22 18:40                                                                 ` Linus Torvalds
2021-12-23 12:54                                                                   ` Jan Kara
2021-12-23 17:18                                                                     ` Linus Torvalds
2021-12-23  0:21                                                           ` Matthew Wilcox
2021-12-24  2:53                                                             ` Jason Gunthorpe
2021-12-24  4:53                                                               ` Matthew Wilcox
2022-01-04  0:33                                                                 ` Jason Gunthorpe
2021-12-21 23:59                                                 ` Jason Gunthorpe
2021-12-22  8:30                                                   ` David Hildenbrand
2021-12-22 12:44                                                   ` Jan Kara
2021-12-17 20:45     ` David Hildenbrand
2021-12-17 20:51       ` Linus Torvalds
2021-12-17 20:55         ` David Hildenbrand
2021-12-17 21:36           ` Linus Torvalds
2021-12-17 21:47             ` David Hildenbrand
2021-12-17 21:50               ` Linus Torvalds
2021-12-17 22:29                 ` David Hildenbrand
2021-12-17 22:58                   ` Linus Torvalds
2021-12-17 23:29                     ` David Hildenbrand
2021-12-17 23:53                       ` Nadav Amit
2021-12-18  4:02                         ` Linus Torvalds
2021-12-18  4:52                           ` Nadav Amit
2021-12-18  5:03                             ` Matthew Wilcox
2021-12-18  5:23                               ` Nadav Amit
2021-12-18 18:37                               ` Linus Torvalds
2021-12-17 22:18               ` Linus Torvalds
2021-12-17 22:43                 ` David Hildenbrand
2021-12-17 23:20                   ` Linus Torvalds
2021-12-18  9:57                     ` David Hildenbrand
2021-12-18 19:21                       ` Linus Torvalds
2021-12-18 19:52                         ` Linus Torvalds
2021-12-19  8:43                           ` David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 07/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (!hugetlb) David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 08/11] mm: hugetlb: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 09/11] mm: gup: trigger unsharing via FAULT_FLAG_UNSHARE when required (hugetlb) David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 10/11] mm: thp: introduce and use page_trans_huge_anon_shared() David Hildenbrand
2021-12-17 11:30 ` [PATCH v1 11/11] selftests/vm: add tests for the known COW security issues David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=341cf567-468e-b5cf-6813-b26c49e851b5@nvidia.com \
    --to=jhubbard@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=ddutile@redhat.com \
    --cc=guro@fb.com \
    --cc=hch@lst.de \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgg@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mike.kravetz@oracle.com \
    --cc=namit@vmware.com \
    --cc=oleg@redhat.com \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rppt@linux.ibm.com \
    --cc=shakeelb@google.com \
    --cc=shy828301@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.