All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Jerome Glisse <jglisse@redhat.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Hugh Dickins <hughd@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Matthew Wilcox <willy@infradead.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Nadav Amit <nadav.amit@gmail.com>
Subject: Re: [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs
Date: Tue, 9 Feb 2021 17:00:56 -0500	[thread overview]
Message-ID: <20210209220056.GD103365@xz-x1> (raw)
In-Reply-To: <201f2636-1193-2cc1-ccee-a91243f14666@oracle.com>

On Tue, Feb 09, 2021 at 11:29:56AM -0800, Mike Kravetz wrote:
> On 2/5/21 6:36 PM, Peter Xu wrote:
> > On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
> >> On 1/29/21 2:49 PM, Peter Xu wrote:
> >>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> >>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
> >> ...
> >>> Huge & Mike,
> >>>
> >>> Would any of you have comment/concerns on the high-level design of this series?
> >>>
> >>> It would be great to know it, especially major objection, before move on to an
> >>> non-rfc version.
> >>
> >> My apologies for not looking at this sooner.  Even now, I have only taken
> >> a very brief look at the hugetlbfs patches.
> >>
> >> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
> >> hugetlbfs.  As you can imagine, there is some overlap in handling of wp ptes
> >> set for soft dirty.  In addition, pmd sharing must be disabled for soft dirty
> >> as here and in Axel's uffd minor fault code.
> > 
> > Interesting to know that we'll reach and need something common from different
> > directions, especially when they all mostly happen at the same time. :)
> > 
> > Is there a real "BUG" that you mentioned?  I'd be glad to read about it if
> > there is a link or something.
> > 
> 
> Sorry, I was referring to a bugzilla bug not a BUG().  Bottom line is that
> hugetlb was mostly overlooked when soft dirty support was added.  A thread
> mostly from me is at:
> lore.kernel.org/r/999775bf-4204-2bec-7c3d-72d81b4fce30@oracle.com
> I am close to sending out a RFC, but keep getting distracted.

Thanks.  Indeed I see no reason to not have hugetlb supported for soft dirty.
Tracking 1G huge pages could be too coarse and heavy, but 2M at least still
seems reasonable.

> 
> >> No objections to the overall approach based on my quick look.
> > 
> > Thanks for having a look.
> > 
> > So for hugetlb one major thing is indeed about the pmd sharing part, which
> > seems that we've got very good consensus on.
> 
> Yes
> 
> > The other thing that I'd love to get some comment would be a shared topic with
> > shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
> > to record wr-protect information even if the pgtable entries were flushed.
> > That comes from a fundamental difference between anonymous and file-backed
> > memory in that anonymous pages keep all info in the pgtable entry, but
> > file-backed memory is not, e.g., pgtable entries can be dropped at any time as
> > long as page cache is there.
> 
> Sorry, but I can not recall this difference for hugetlb pages.  What operations
> lead to flushing of pagetable entries?  It would need to be something other
> than unmap as it seems we want to lose the information in unmap IIUC.

For hugetlbfs I know two cases.

One is exactly huge pmd sharing as mentioned above, where we'll drop the
pgtable entries for a specific process but the page cache will still exist.

The other one is hugetlbfs_punch_hole(), where hugetlb_vmdelete_list() called
before remove_inode_hugepages().  For uffd-wp, there will be a very small
window that a wr-protected huge page can be written before the page is finally
dropped in remove_inode_hugepages() but after pgtable entry flushed.  In some
apps that could cause data loss.

> 
> > I goes to look at soft-dirty then regarding this issue, and there's actually a
> > paragraph about it:
> > 
> >         While in most cases tracking memory changes by #PF-s is more than enough
> >         there is still a scenario when we can lose soft dirty bits -- a task
> >         unmaps a previously mapped memory region and then maps a new one at
> >         exactly the same place. When unmap is called, the kernel internally
> >         clears PTE values including soft dirty bits. To notify user space
> >         application about such memory region renewal the kernel always marks
> >         new memory regions (and expanded regions) as soft dirty.
> > 
> > I feel like it just means soft-dirty currently allows false positives: we could
> > have set the soft dirty bit even if the page is clean.  And that's what this
> > series wanted to avoid: it used the new concept called "swap special pte" to
> > persistent that information even for file-backed memory.  That all goes for
> > avoiding those false positives.
> 
> Yes, I have seen this with soft dirty.  It really does not seem right.  When
> you first create a mapping, even before faulting in anything the vma is marked
> VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty.

Right that seems not optimal.  It is understandable since dirty info is indeed
tolerant to false positives, so soft-dirty avoided this issue as uffd-wp wanted
to solve in this series.  It would be great to know if current approach in this
series would work for us to remove those false positives.

> 
> To be honest, I am not sure you want to try and carry per-process/per-mapping
> wp information in the file.

What this series does is trying to persist that information in pgtable entries,
rather than in the file (or page cache).  Frankly I can't say whether that's
optimal either, so I'm always open to any comment.  So far I think it's a valid
solution, but it could always be possible that I missed something important.

> In the comment about soft dirty above, it seems
> reasonable that unmapping would clear all soft dirty information.  Also,
> unmapping would clear any uffd state/information.

Right, unmap should always means "dropping all information in the ptes".  It's
in below patch that we tried to treat it differently:

https://github.com/xzpeter/linux/commit/e958e9ee8d33e9a6602f93cdbe24a0c3614ab5e2

A quick summary of above patch: only if we unmap or truncate the hugetlbfs
file, would we call hugetlb_vmdelete_list() with ZAP_FLAG_DROP_FILE_UFFD_WP
(which means we'll drop all the information, including uffd-wp bit).

Thanks,

-- 
Peter Xu


  reply	other threads:[~2021-02-10  0:50 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-15 17:08 [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
2021-01-15 17:08 ` [PATCH RFC 01/30] mm/thp: Simplify copying of huge zero page pmd when fork Peter Xu
2021-01-15 17:08 ` [PATCH RFC 02/30] mm/userfaultfd: Fix uffd-wp special cases for fork() Peter Xu
2021-01-15 17:08 ` [PATCH RFC 03/30] mm/userfaultfd: Fix a few thp pmd missing uffd-wp bit Peter Xu
2021-01-15 17:08 ` [PATCH RFC 04/30] shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2021-01-15 17:08 ` [PATCH RFC 05/30] mm: Clear vmf->pte after pte_unmap_same() returns Peter Xu
2021-01-15 17:08 ` [PATCH RFC 06/30] mm/userfaultfd: Introduce special pte for unmapped file-backed mem Peter Xu
2021-01-15 17:08 ` [PATCH RFC 07/30] mm/swap: Introduce the idea of special swap ptes Peter Xu
2021-01-18 19:40   ` Jason Gunthorpe
2021-01-19 14:24     ` Peter Xu
2021-01-15 17:08 ` [PATCH RFC 08/30] shmem/userfaultfd: Handle uffd-wp special pte in page fault handler Peter Xu
2021-01-15 19:51   ` kernel test robot
2021-01-15 21:01   ` kernel test robot
2021-01-15 17:08 ` [PATCH RFC 09/30] mm: Drop first_index/last_index in zap_details Peter Xu
2021-01-15 17:08 ` [PATCH RFC 10/30] mm: Introduce zap_details.zap_flags Peter Xu
2021-01-15 17:08 ` [PATCH RFC 11/30] mm: Introduce ZAP_FLAG_SKIP_SWAP Peter Xu
2021-01-15 17:08 ` [PATCH RFC 12/30] mm: Pass zap_flags into unmap_mapping_pages() Peter Xu
2021-01-15 17:08 ` [PATCH RFC 13/30] shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed Peter Xu
2021-01-15 17:08 ` [PATCH RFC 14/30] shmem/userfaultfd: Allow wr-protect none pte for file-backed mem Peter Xu
2021-01-15 17:08 ` [PATCH RFC 15/30] shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps Peter Xu
2021-01-15 17:08 ` [PATCH RFC 16/30] shmem/userfaultfd: Handle the left-overed special swap ptes Peter Xu
2021-01-15 17:08 ` [PATCH RFC 17/30] shmem/userfaultfd: Pass over uffd-wp special swap pte when fork() Peter Xu
2021-01-15 17:08 ` [PATCH RFC 18/30] hugetlb/userfaultfd: Hook page faults for uffd write protection Peter Xu
2021-01-15 17:08 ` [PATCH RFC 19/30] hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP Peter Xu
2021-01-15 17:08 ` [PATCH RFC 20/30] hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT Peter Xu
2021-01-15 17:08 ` [PATCH RFC 21/30] hugetlb: Pass vma into huge_pte_alloc() Peter Xu
2021-01-28 22:59   ` Axel Rasmussen
2021-01-28 22:59     ` Axel Rasmussen
2021-01-29 22:31     ` Peter Xu
2021-01-30  8:08       ` Axel Rasmussen
2021-01-30  8:08         ` Axel Rasmussen
2021-01-15 17:08 ` [PATCH RFC 22/30] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled Peter Xu
2021-01-16 10:02   ` kernel test robot
2021-01-15 17:09 ` [PATCH RFC 23/30] mm/hugetlb: Introduce huge version of special swap pte helpers Peter Xu
2021-01-15 17:09 ` [PATCH RFC 24/30] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h Peter Xu
2021-01-15 17:09 ` [PATCH RFC 25/30] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Peter Xu
2021-01-15 23:05   ` kernel test robot
2021-01-15 17:09 ` [PATCH RFC 26/30] hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler Peter Xu
2021-01-15 17:09 ` [PATCH RFC 27/30] hugetlb/userfaultfd: Allow wr-protect none ptes Peter Xu
2021-01-15 17:09 ` [PATCH RFC 28/30] hugetlb/userfaultfd: Only drop uffd-wp special pte if required Peter Xu
2021-01-15 17:09 ` [PATCH RFC 29/30] userfaultfd: Enable write protection for shmem & hugetlbfs Peter Xu
2021-01-15 17:12 ` [PATCH RFC 30/30] userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs Peter Xu
2021-01-29 22:49 ` [PATCH RFC 00/30] userfaultfd-wp: Support shmem and hugetlbfs Peter Xu
2021-02-05 21:53   ` Mike Kravetz
2021-02-06  2:36     ` Peter Xu
2021-02-09 19:29       ` Mike Kravetz
2021-02-09 22:00         ` Peter Xu [this message]
2021-02-05 22:21   ` Hugh Dickins
2021-02-05 22:21     ` Hugh Dickins
2021-02-06  2:47     ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210209220056.GD103365@xz-x1 \
    --to=peterx@redhat.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=hughd@google.com \
    --cc=jglisse@redhat.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=nadav.amit@gmail.com \
    --cc=rppt@linux.vnet.ibm.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.