[RFC,00/30] userfaultfd-wp: Support shmem and hugetlbfs
mbox series

Message ID 20210115170907.24498-1-peterx@redhat.com
Headers show
Series
  • userfaultfd-wp: Support shmem and hugetlbfs
Related show

Message

Peter Xu Jan. 15, 2021, 5:08 p.m. UTC
This is a RFC series to support userfaultfd upon shmem and hugetlbfs.

PS. Note that there's a known issue [0] with tlb against uffd-wp/soft-dirty in
general and Nadav is working on it.  It may or may not directly affect
shmem/hugetlbfs since there're no COW on shared mappings normally.  Private
shmem could hit, but still that's another problem to solve in general, and this
RFC is majorly to see whether there's any objection on the concept of the idea
specific to uffd-wp on shmem/hugetlbfs.

The whole series can also be found online [1].

The major comment I'd like to get is on the new idea of swap special pte.  That
comes from suggestions from both Hugh and Andrea and I appreciated a lot for
those discussions.

In short, it's a new type of pte that doesn't exist in the past, while used in
file-backed memories to persist information across ptes being erased (but the
page cache could still exist, for example, so in the next page fault we can
reload the page cache with that specific information when necessary).

I'm copy-pasting some commit message from the patch "mm/swap: Introduce the
idea of special swap ptes", where uffd-wp becomes the first user of it:

    We used to have special swap entries, like migration entries, hw-poison
    entries, device private entries, etc.

    Those "special swap entries" reside in the range that they need to be at least
    swap entries first, and their types are decided by swp_type(entry).

    This patch introduces another idea called "special swap ptes".

    It's very easy to get confused against "special swap entries", but a speical
    swap pte should never contain a swap entry at all.  It means, it's illegal to
    call pte_to_swp_entry() upon a special swap pte.

    Make the uffd-wp special pte to be the first special swap pte.

    Before this patch, is_swap_pte()==true means one of the below:

       (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
             example, when an anonymous page got swapped out.

       (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
             example, a migration entry, a hw-poison entry, etc.

    After this patch, is_swap_pte()==true means one of the below, where case (b) is
    added:

     (a) The pte contains a swap entry.

       (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
             example, when an anonymous page got swapped out.

       (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
             example, a migration entry, a hw-poison entry, etc.

     (b) The pte does not contain a swap entry at all (so it cannot be passed
         into pte_to_swp_entry()).  For example, uffd-wp special swap pte.

Hugetlbfs needs similar thing because it's also file-backed.  I directly reused
the same special pte there, though the shmem/hugetlb change on supporting this
new pte is different since they don't share code path a lot.

Patch layout
============

Part (1): some fixes that I observed when working on this; feel free to skip
them for now becuase I think they're corner cases and irrelevant of the major
change:

  mm/thp: Simplify copying of huge zero page pmd when fork
  mm/userfaultfd: Fix uffd-wp special cases for fork()
  mm/userfaultfd: Fix a few thp pmd missing uffd-wp bit

Part (2): Shmem support, this is where the special swap pte is introduced.
Some zap rework is needed within the process:

  shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
  mm: Clear vmf->pte after pte_unmap_same() returns
  mm/userfaultfd: Introduce special pte for unmapped file-backed mem
  mm/swap: Introduce the idea of special swap ptes
  shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
  mm: Drop first_index/last_index in zap_details
  mm: Introduce zap_details.zap_flags
  mm: Introduce ZAP_FLAG_SKIP_SWAP
  mm: Pass zap_flags into unmap_mapping_pages()
  shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
  shmem/userfaultfd: Allow wr-protect none pte for file-backed mem
  shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on thps
  shmem/userfaultfd: Handle the left-overed special swap ptes
  shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()

Part (3): Hugetlb support, we need to disable huge pmd sharing for uffd-wp
because not compatible just like uffd minor mode.  The rest is the changes
required to teach hugetlbfs understand the special swap pte too that introduced
with the uffd-wp change:

  hugetlb/userfaultfd: Hook page faults for uffd write protection
  hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
  hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT
  hugetlb: Pass vma into huge_pte_alloc()
  hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  mm/hugetlb: Introduce huge version of special swap pte helpers
  mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
  hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
  hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler
  hugetlb/userfaultfd: Allow wr-protect none ptes
  hugetlb/userfaultfd: Only drop uffd-wp special pte if required

Part (4): Enable both features in code and test

  userfaultfd: Enable write protection for shmem & hugetlbfs
  userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs

Tests
=========

I've tested it using either userfaultfd kselftest program, but also with
umapsort [2] which should be even stricter.  No complicated mm setup is tested
yet besides page swapping in/out, but in all cases we need to have more tests
when it becomes non-RFC.

If anyone would like to try umapsort, need to use an extremely hacked version
of umap library [3], because by default umap only supports anonymous.  So to
test it we need to build [3] then [2].

Any comment would be greatly welcomed.  Thanks,

[0] https://lore.kernel.org/lkml/20201225092529.3228466-1-namit@vmware.com/
[1] https://github.com/xzpeter/linux/tree/uffd-wp-shmem-hugetlbfs
[2] https://github.com/LLNL/umap-apps
[3] https://github.com/xzpeter/umap/tree/peter-shmem-hugetlbfs

Peter Xu (30):
  mm/thp: Simplify copying of huge zero page pmd when fork
  mm/userfaultfd: Fix uffd-wp special cases for fork()
  mm/userfaultfd: Fix a few thp pmd missing uffd-wp bit
  shmem/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
  mm: Clear vmf->pte after pte_unmap_same() returns
  mm/userfaultfd: Introduce special pte for unmapped file-backed mem
  mm/swap: Introduce the idea of special swap ptes
  shmem/userfaultfd: Handle uffd-wp special pte in page fault handler
  mm: Drop first_index/last_index in zap_details
  mm: Introduce zap_details.zap_flags
  mm: Introduce ZAP_FLAG_SKIP_SWAP
  mm: Pass zap_flags into unmap_mapping_pages()
  shmem/userfaultfd: Persist uffd-wp bit across zapping for file-backed
  shmem/userfaultfd: Allow wr-protect none pte for file-backed mem
  shmem/userfaultfd: Allows file-back mem to be uffd wr-protected on
    thps
  shmem/userfaultfd: Handle the left-overed special swap ptes
  shmem/userfaultfd: Pass over uffd-wp special swap pte when fork()
  hugetlb/userfaultfd: Hook page faults for uffd write protection
  hugetlb/userfaultfd: Take care of UFFDIO_COPY_MODE_WP
  hugetlb/userfaultfd: Handle UFFDIO_WRITEPROTECT
  hugetlb: Pass vma into huge_pte_alloc()
  hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled
  mm/hugetlb: Introduce huge version of special swap pte helpers
  mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h
  hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp
  hugetlb/userfaultfd: Handle uffd-wp special pte in hugetlb pf handler
  hugetlb/userfaultfd: Allow wr-protect none ptes
  hugetlb/userfaultfd: Only drop uffd-wp special pte if required
  userfaultfd: Enable write protection for shmem & hugetlbfs
  userfaultfd/selftests: Enable uffd-wp for shmem/hugetlbfs

 arch/arm64/mm/hugetlbpage.c              |   5 +-
 arch/ia64/mm/hugetlbpage.c               |   3 +-
 arch/mips/mm/hugetlbpage.c               |   4 +-
 arch/parisc/mm/hugetlbpage.c             |   2 +-
 arch/powerpc/mm/hugetlbpage.c            |   3 +-
 arch/s390/mm/hugetlbpage.c               |   2 +-
 arch/sh/mm/hugetlbpage.c                 |   2 +-
 arch/sparc/mm/hugetlbpage.c              |   2 +-
 arch/x86/include/asm/pgtable.h           |  28 +++
 fs/dax.c                                 |  10 +-
 fs/hugetlbfs/inode.c                     |  15 +-
 fs/proc/task_mmu.c                       |  14 +-
 fs/userfaultfd.c                         |  80 +++++--
 include/asm-generic/hugetlb.h            |  10 +
 include/asm-generic/pgtable_uffd.h       |   3 +
 include/linux/huge_mm.h                  |   3 +-
 include/linux/hugetlb.h                  |  47 +++-
 include/linux/mm.h                       |  50 +++-
 include/linux/mm_inline.h                |  43 ++++
 include/linux/mmu_notifier.h             |   1 +
 include/linux/shmem_fs.h                 |   5 +-
 include/linux/swapops.h                  |  41 +++-
 include/linux/userfaultfd_k.h            |  37 +++
 include/uapi/linux/userfaultfd.h         |   3 +-
 mm/huge_memory.c                         |  36 ++-
 mm/hugetlb.c                             | 174 +++++++++++---
 mm/khugepaged.c                          |  14 +-
 mm/memcontrol.c                          |   2 +-
 mm/memory.c                              | 277 ++++++++++++++++++-----
 mm/migrate.c                             |   2 +-
 mm/mprotect.c                            |  63 +++++-
 mm/mremap.c                              |   2 +-
 mm/page_vma_mapped.c                     |   6 +-
 mm/rmap.c                                |   8 +
 mm/shmem.c                               |  39 +++-
 mm/truncate.c                            |  17 +-
 mm/userfaultfd.c                         |  37 +--
 tools/testing/selftests/vm/userfaultfd.c |  14 +-
 38 files changed, 881 insertions(+), 223 deletions(-)

Comments

Peter Xu Jan. 29, 2021, 10:49 p.m. UTC | #1
On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
> 
> PS. Note that there's a known issue [0] with tlb against uffd-wp/soft-dirty in
> general and Nadav is working on it.  It may or may not directly affect
> shmem/hugetlbfs since there're no COW on shared mappings normally.  Private
> shmem could hit, but still that's another problem to solve in general, and this
> RFC is majorly to see whether there's any objection on the concept of the idea
> specific to uffd-wp on shmem/hugetlbfs.
> 
> The whole series can also be found online [1].
> 
> The major comment I'd like to get is on the new idea of swap special pte.  That
> comes from suggestions from both Hugh and Andrea and I appreciated a lot for
> those discussions.
> 
> In short, it's a new type of pte that doesn't exist in the past, while used in
> file-backed memories to persist information across ptes being erased (but the
> page cache could still exist, for example, so in the next page fault we can
> reload the page cache with that specific information when necessary).
> 
> I'm copy-pasting some commit message from the patch "mm/swap: Introduce the
> idea of special swap ptes", where uffd-wp becomes the first user of it:
> 
>     We used to have special swap entries, like migration entries, hw-poison
>     entries, device private entries, etc.
> 
>     Those "special swap entries" reside in the range that they need to be at least
>     swap entries first, and their types are decided by swp_type(entry).
> 
>     This patch introduces another idea called "special swap ptes".
> 
>     It's very easy to get confused against "special swap entries", but a speical
>     swap pte should never contain a swap entry at all.  It means, it's illegal to
>     call pte_to_swp_entry() upon a special swap pte.
> 
>     Make the uffd-wp special pte to be the first special swap pte.
> 
>     Before this patch, is_swap_pte()==true means one of the below:
> 
>        (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
>              example, when an anonymous page got swapped out.
> 
>        (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
>              example, a migration entry, a hw-poison entry, etc.
> 
>     After this patch, is_swap_pte()==true means one of the below, where case (b) is
>     added:
> 
>      (a) The pte contains a swap entry.
> 
>        (a.1) The pte has a normal swap entry (non_swap_entry()==false).  For
>              example, when an anonymous page got swapped out.
> 
>        (a.2) The pte has a special swap entry (non_swap_entry()==true).  For
>              example, a migration entry, a hw-poison entry, etc.
> 
>      (b) The pte does not contain a swap entry at all (so it cannot be passed
>          into pte_to_swp_entry()).  For example, uffd-wp special swap pte.
> 
> Hugetlbfs needs similar thing because it's also file-backed.  I directly reused
> the same special pte there, though the shmem/hugetlb change on supporting this
> new pte is different since they don't share code path a lot.

Huge & Mike,

Would any of you have comment/concerns on the high-level design of this series?

It would be great to know it, especially major objection, before move on to an
non-rfc version.

Thanks,
Mike Kravetz Feb. 5, 2021, 9:53 p.m. UTC | #2
On 1/29/21 2:49 PM, Peter Xu wrote:
> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
...
> Huge & Mike,
> 
> Would any of you have comment/concerns on the high-level design of this series?
> 
> It would be great to know it, especially major objection, before move on to an
> non-rfc version.

My apologies for not looking at this sooner.  Even now, I have only taken
a very brief look at the hugetlbfs patches.

Coincidentally, I am working on the 'BUG' that soft dirty does not work for
hugetlbfs.  As you can imagine, there is some overlap in handling of wp ptes
set for soft dirty.  In addition, pmd sharing must be disabled for soft dirty
as here and in Axel's uffd minor fault code.

No objections to the overall approach based on my quick look.

I'll try to take a closer look at the areas where efforts overlap.
Hugh Dickins Feb. 5, 2021, 10:21 p.m. UTC | #3
On Fri, 29 Jan 2021, Peter Xu wrote:
> 
> Huge & Mike,
> 
> Would any of you have comment/concerns on the high-level design of this series?
> 
> It would be great to know it, especially major objection, before move on to an
> non-rfc version.

Seeing Mike's update prompts me to speak up: I have been looking, and
will continue to look through it - will report when done; but find I've
been making very little forward progress from one day to the next.

It is very confusing, inevitably; but you have done an *outstanding*
job on acknowledging the confusion, and commenting it in great detail.

Hugh
Peter Xu Feb. 6, 2021, 2:36 a.m. UTC | #4
On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
> On 1/29/21 2:49 PM, Peter Xu wrote:
> > On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> >> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
> ...
> > Huge & Mike,
> > 
> > Would any of you have comment/concerns on the high-level design of this series?
> > 
> > It would be great to know it, especially major objection, before move on to an
> > non-rfc version.
> 
> My apologies for not looking at this sooner.  Even now, I have only taken
> a very brief look at the hugetlbfs patches.
> 
> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
> hugetlbfs.  As you can imagine, there is some overlap in handling of wp ptes
> set for soft dirty.  In addition, pmd sharing must be disabled for soft dirty
> as here and in Axel's uffd minor fault code.

Interesting to know that we'll reach and need something common from different
directions, especially when they all mostly happen at the same time. :)

Is there a real "BUG" that you mentioned?  I'd be glad to read about it if
there is a link or something.

> 
> No objections to the overall approach based on my quick look.

Thanks for having a look.

So for hugetlb one major thing is indeed about the pmd sharing part, which
seems that we've got very good consensus on.

The other thing that I'd love to get some comment would be a shared topic with
shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
to record wr-protect information even if the pgtable entries were flushed.
That comes from a fundamental difference between anonymous and file-backed
memory in that anonymous pages keep all info in the pgtable entry, but
file-backed memory is not, e.g., pgtable entries can be dropped at any time as
long as page cache is there.

I goes to look at soft-dirty then regarding this issue, and there's actually a
paragraph about it:

        While in most cases tracking memory changes by #PF-s is more than enough
        there is still a scenario when we can lose soft dirty bits -- a task
        unmaps a previously mapped memory region and then maps a new one at
        exactly the same place. When unmap is called, the kernel internally
        clears PTE values including soft dirty bits. To notify user space
        application about such memory region renewal the kernel always marks
        new memory regions (and expanded regions) as soft dirty.

I feel like it just means soft-dirty currently allows false positives: we could
have set the soft dirty bit even if the page is clean.  And that's what this
series wanted to avoid: it used the new concept called "swap special pte" to
persistent that information even for file-backed memory.  That all goes for
avoiding those false positives.

> 
> I'll try to take a closer look at the areas where efforts overlap.

I dumped above just to hope maybe it could help a little bit more for the
reviews, but if it's not, I totally agree we can focus on the overlapped part.
And, I'd be more than glad to read your work too if I can understand more on
what you're working on with the hugetlb soft dirty bug, since I do feel uffd-wp
is servicing similar goals just like what soft-dirty does, so we could share a
lot of common knowledge there. :)

Thanks again!
Peter Xu Feb. 6, 2021, 2:47 a.m. UTC | #5
On Fri, Feb 05, 2021 at 02:21:47PM -0800, Hugh Dickins wrote:
> On Fri, 29 Jan 2021, Peter Xu wrote:
> > 
> > Huge & Mike,
> > 
> > Would any of you have comment/concerns on the high-level design of this series?
> > 
> > It would be great to know it, especially major objection, before move on to an
> > non-rfc version.
> 
> Seeing Mike's update prompts me to speak up: I have been looking, and
> will continue to look through it - will report when done; but find I've
> been making very little forward progress from one day to the next.
> 
> It is very confusing, inevitably; but you have done an *outstanding*
> job on acknowledging the confusion, and commenting it in great detail.

I'm honored to receive such an evaluation, thanks Hugh!

As a quick summary - what I did in this series is mostly what you've suggested
on using swp_type==1 && swp_offset=0 as a special pte, so the swap code can
trap it.  The only difference is that "swp_type==1 && swp_offset=0" still uses
valid swp_entry address space, so I introduced the "swap special pte" idea
hoping to make it clearer, which is also based on Andrea's suggestion.  I hope
I didn't make it even worse. :)

It's just that I don't want to make this idea that "only works for uffd-wp".
What I'm thinking is whether we can provide such a common way to keep some
records in pgtable entries that point to file-backed memory.  Say, currently
for a file-backed memory we can only have either a valid pte (either RO or RW)
or a none pte.  So maybe we could provide a way to start using the rest pte
address space that we haven't yet used.

Please take your time on reviewing the series.  Any of your future comment
would be greatly welcomed.

Thanks,
Mike Kravetz Feb. 9, 2021, 7:29 p.m. UTC | #6
On 2/5/21 6:36 PM, Peter Xu wrote:
> On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
>> On 1/29/21 2:49 PM, Peter Xu wrote:
>>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
>>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
>> ...
>>> Huge & Mike,
>>>
>>> Would any of you have comment/concerns on the high-level design of this series?
>>>
>>> It would be great to know it, especially major objection, before move on to an
>>> non-rfc version.
>>
>> My apologies for not looking at this sooner.  Even now, I have only taken
>> a very brief look at the hugetlbfs patches.
>>
>> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
>> hugetlbfs.  As you can imagine, there is some overlap in handling of wp ptes
>> set for soft dirty.  In addition, pmd sharing must be disabled for soft dirty
>> as here and in Axel's uffd minor fault code.
> 
> Interesting to know that we'll reach and need something common from different
> directions, especially when they all mostly happen at the same time. :)
> 
> Is there a real "BUG" that you mentioned?  I'd be glad to read about it if
> there is a link or something.
> 

Sorry, I was referring to a bugzilla bug not a BUG().  Bottom line is that
hugetlb was mostly overlooked when soft dirty support was added.  A thread
mostly from me is at:
lore.kernel.org/r/999775bf-4204-2bec-7c3d-72d81b4fce30@oracle.com
I am close to sending out a RFC, but keep getting distracted.

>> No objections to the overall approach based on my quick look.
> 
> Thanks for having a look.
> 
> So for hugetlb one major thing is indeed about the pmd sharing part, which
> seems that we've got very good consensus on.

Yes

> The other thing that I'd love to get some comment would be a shared topic with
> shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
> to record wr-protect information even if the pgtable entries were flushed.
> That comes from a fundamental difference between anonymous and file-backed
> memory in that anonymous pages keep all info in the pgtable entry, but
> file-backed memory is not, e.g., pgtable entries can be dropped at any time as
> long as page cache is there.

Sorry, but I can not recall this difference for hugetlb pages.  What operations
lead to flushing of pagetable entries?  It would need to be something other
than unmap as it seems we want to lose the information in unmap IIUC.

> I goes to look at soft-dirty then regarding this issue, and there's actually a
> paragraph about it:
> 
>         While in most cases tracking memory changes by #PF-s is more than enough
>         there is still a scenario when we can lose soft dirty bits -- a task
>         unmaps a previously mapped memory region and then maps a new one at
>         exactly the same place. When unmap is called, the kernel internally
>         clears PTE values including soft dirty bits. To notify user space
>         application about such memory region renewal the kernel always marks
>         new memory regions (and expanded regions) as soft dirty.
> 
> I feel like it just means soft-dirty currently allows false positives: we could
> have set the soft dirty bit even if the page is clean.  And that's what this
> series wanted to avoid: it used the new concept called "swap special pte" to
> persistent that information even for file-backed memory.  That all goes for
> avoiding those false positives.

Yes, I have seen this with soft dirty.  It really does not seem right.  When
you first create a mapping, even before faulting in anything the vma is marked
VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty.

To be honest, I am not sure you want to try and carry per-process/per-mapping
wp information in the file.  In the comment about soft dirty above, it seems
reasonable that unmapping would clear all soft dirty information.  Also,
unmapping would clear any uffd state/information.
Peter Xu Feb. 9, 2021, 10 p.m. UTC | #7
On Tue, Feb 09, 2021 at 11:29:56AM -0800, Mike Kravetz wrote:
> On 2/5/21 6:36 PM, Peter Xu wrote:
> > On Fri, Feb 05, 2021 at 01:53:34PM -0800, Mike Kravetz wrote:
> >> On 1/29/21 2:49 PM, Peter Xu wrote:
> >>> On Fri, Jan 15, 2021 at 12:08:37PM -0500, Peter Xu wrote:
> >>>> This is a RFC series to support userfaultfd upon shmem and hugetlbfs.
> >> ...
> >>> Huge & Mike,
> >>>
> >>> Would any of you have comment/concerns on the high-level design of this series?
> >>>
> >>> It would be great to know it, especially major objection, before move on to an
> >>> non-rfc version.
> >>
> >> My apologies for not looking at this sooner.  Even now, I have only taken
> >> a very brief look at the hugetlbfs patches.
> >>
> >> Coincidentally, I am working on the 'BUG' that soft dirty does not work for
> >> hugetlbfs.  As you can imagine, there is some overlap in handling of wp ptes
> >> set for soft dirty.  In addition, pmd sharing must be disabled for soft dirty
> >> as here and in Axel's uffd minor fault code.
> > 
> > Interesting to know that we'll reach and need something common from different
> > directions, especially when they all mostly happen at the same time. :)
> > 
> > Is there a real "BUG" that you mentioned?  I'd be glad to read about it if
> > there is a link or something.
> > 
> 
> Sorry, I was referring to a bugzilla bug not a BUG().  Bottom line is that
> hugetlb was mostly overlooked when soft dirty support was added.  A thread
> mostly from me is at:
> lore.kernel.org/r/999775bf-4204-2bec-7c3d-72d81b4fce30@oracle.com
> I am close to sending out a RFC, but keep getting distracted.

Thanks.  Indeed I see no reason to not have hugetlb supported for soft dirty.
Tracking 1G huge pages could be too coarse and heavy, but 2M at least still
seems reasonable.

> 
> >> No objections to the overall approach based on my quick look.
> > 
> > Thanks for having a look.
> > 
> > So for hugetlb one major thing is indeed about the pmd sharing part, which
> > seems that we've got very good consensus on.
> 
> Yes
> 
> > The other thing that I'd love to get some comment would be a shared topic with
> > shmem in that: for a file-backed memory type, uffd-wp needs a consolidated way
> > to record wr-protect information even if the pgtable entries were flushed.
> > That comes from a fundamental difference between anonymous and file-backed
> > memory in that anonymous pages keep all info in the pgtable entry, but
> > file-backed memory is not, e.g., pgtable entries can be dropped at any time as
> > long as page cache is there.
> 
> Sorry, but I can not recall this difference for hugetlb pages.  What operations
> lead to flushing of pagetable entries?  It would need to be something other
> than unmap as it seems we want to lose the information in unmap IIUC.

For hugetlbfs I know two cases.

One is exactly huge pmd sharing as mentioned above, where we'll drop the
pgtable entries for a specific process but the page cache will still exist.

The other one is hugetlbfs_punch_hole(), where hugetlb_vmdelete_list() called
before remove_inode_hugepages().  For uffd-wp, there will be a very small
window that a wr-protected huge page can be written before the page is finally
dropped in remove_inode_hugepages() but after pgtable entry flushed.  In some
apps that could cause data loss.

> 
> > I goes to look at soft-dirty then regarding this issue, and there's actually a
> > paragraph about it:
> > 
> >         While in most cases tracking memory changes by #PF-s is more than enough
> >         there is still a scenario when we can lose soft dirty bits -- a task
> >         unmaps a previously mapped memory region and then maps a new one at
> >         exactly the same place. When unmap is called, the kernel internally
> >         clears PTE values including soft dirty bits. To notify user space
> >         application about such memory region renewal the kernel always marks
> >         new memory regions (and expanded regions) as soft dirty.
> > 
> > I feel like it just means soft-dirty currently allows false positives: we could
> > have set the soft dirty bit even if the page is clean.  And that's what this
> > series wanted to avoid: it used the new concept called "swap special pte" to
> > persistent that information even for file-backed memory.  That all goes for
> > avoiding those false positives.
> 
> Yes, I have seen this with soft dirty.  It really does not seem right.  When
> you first create a mapping, even before faulting in anything the vma is marked
> VM_SOFTDIRTY and from the user's perspective all addresses/pages appear dirty.

Right that seems not optimal.  It is understandable since dirty info is indeed
tolerant to false positives, so soft-dirty avoided this issue as uffd-wp wanted
to solve in this series.  It would be great to know if current approach in this
series would work for us to remove those false positives.

> 
> To be honest, I am not sure you want to try and carry per-process/per-mapping
> wp information in the file.

What this series does is trying to persist that information in pgtable entries,
rather than in the file (or page cache).  Frankly I can't say whether that's
optimal either, so I'm always open to any comment.  So far I think it's a valid
solution, but it could always be possible that I missed something important.

> In the comment about soft dirty above, it seems
> reasonable that unmapping would clear all soft dirty information.  Also,
> unmapping would clear any uffd state/information.

Right, unmap should always means "dropping all information in the ptes".  It's
in below patch that we tried to treat it differently:

https://github.com/xzpeter/linux/commit/e958e9ee8d33e9a6602f93cdbe24a0c3614ab5e2

A quick summary of above patch: only if we unmap or truncate the hugetlbfs
file, would we call hugetlb_vmdelete_list() with ZAP_FLAG_DROP_FILE_UFFD_WP
(which means we'll drop all the information, including uffd-wp bit).

Thanks,