linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jann Horn <jannh@google.com>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	"Zach O'Keefe" <zokeefe@google.com>,
	 linux-kernel@vger.kernel.org,
	David Hildenbrand <david@redhat.com>,
	 Yang Shi <shy828301@gmail.com>
Subject: Re: [PATCH] mm/khugepaged: Fix ->anon_vma race
Date: Fri, 13 Jan 2023 20:28:59 +0100	[thread overview]
Message-ID: <CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com> (raw)
In-Reply-To: <20230112085649.gvriasb2t5xwmxkm@box.shutemov.name>

[-- Attachment #1: Type: text/plain, Size: 21973 bytes --]

On Thu, Jan 12, 2023 at 9:56 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Wed, Jan 11, 2023 at 02:33:51PM +0100, Jann Horn wrote:
> > If an ->anon_vma is attached to the VMA, collapse_and_free_pmd() requires
> > it to be locked. retract_page_tables() bails out if an ->anon_vma is
> > attached, but does this check before holding the mmap lock (as the comment
> > above the check explains).
> >
> > If we racily merge an existing ->anon_vma (shared with a child process)
> > from a neighboring VMA, subsequent rmap traversals on pages belonging to
> > the child will be able to see the page tables that we are concurrently
> > removing while assuming that nothing else can access them.
> >
> > Repeat the ->anon_vma check once we hold the mmap lock to ensure that there
> > really is no concurrent page table access.
> >
> > Reported-by: Zach O'Keefe <zokeefe@google.com>
> > Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Jann Horn <jannh@google.com>
> > ---
> > zokeefe@ pointed out to me that the current code (after my last round of patches)
> > can hit a lockdep assert by racing, and after staring at it a bit I've
> > convinced myself that this is a real, preexisting bug.
> > (I haven't written a reproducer for it though. One way to hit it might be
> > something along the lines of:
> >
> >  - set up a process A with a private-file-mapping VMA V1
> >  - let A fork() to create process B, thereby copying V1 in A to V1' in B
> >  - let B extend the end of V1'
> >  - let B put some anon pages into the extended part of V1'
>
> At this point V1' gets it's own ->anon_vma, not connected to V1, right?

Yes, true, it doesn't work in a child process because
reusable_anon_vma() refuses to reuse the anon_vma of an adjacent VMA
if that VMA is attached to more than one VMA. So to actually make it
work, you have to rearrange things a bit such that this happens in the
parent process, so that prepare_anon_vma() will reuse a neighboring
anon_vma via reusable_anon_vma().

Sorry, my description of the race was slightly off. See the attached
code for how to actually do it. Basically the trick is to use mremap()
to take one private file VMA with an anon_vma and turn it into two
private file VMAs ("mapping1" and "mapping2" in my code) with the same
anon_vma that map the same file range. Put an anonymous page into
mapping1. Then you can replace the first 2 MiB of "mapping2" with
another VMA that is not mergeable into mapping2, is
anon_vma-compatible with mapping2, has no anon_vma, and looks
THP-collapsible. And then you can, while khugepaged is in the race
window, attach the new VMA to the existing anon_vma and trigger an
rmap walk on the anonymous page.

> >  - let A map a new private-file-mapping VMA V2 directly behind V1, without
> >    an anon_vma
> > [race begins here]
> >   - in A's thread 1: begin retract_page_tables() on V2, run through first
> >     ->anon_vma check
> >   - in A's thread 2: run __anon_vma_prepare() on V2 and ensure that it
> >     merges the anon_vma of V1 (which implies V1 and V2 must be mapping the
> >     same file at compatible offsets)
> >   - in B: trigger rmap traversal on anon page in V1'
>
> I don't follow the race. rmap on V1' will not reach V1.
>
> >  mm/khugepaged.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 5cb401aa2b9d..0bfed37f3a3b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1644,7 +1644,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >                * has higher cost too. It would also probably require locking
> >                * the anon_vma.
> >                */
> > -             if (vma->anon_vma) {
> > +             if (READ_ONCE(vma->anon_vma)) {
> >                       result = SCAN_PAGE_ANON;
> >                       goto next;
> >               }
>
> This makes perfect sense. At least for readability. But I think
> false-negative should not lead to bad results.
>
> > @@ -1672,6 +1672,18 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
> >               result = SCAN_PTE_MAPPED_HUGEPAGE;
> >               if ((cc->is_khugepaged || is_target) &&
> >                   mmap_write_trylock(mm)) {
> > +                     /*
> > +                      * Re-check whether we have an ->anon_vma, because
> > +                      * collapse_and_free_pmd() requires that either no
> > +                      * ->anon_vma exists or the anon_vma is locked.
> > +                      * We already checked ->anon_vma above, but that check
> > +                      * is racy because ->anon_vma can be populated under the
> > +                      * mmap lock in read mode.
> > +                      */
> > +                     if (vma->anon_vma) {
> > +                             result = SCAN_PAGE_ANON;
> > +                             goto unlock_next;
> > +                     }
>
> This is totally wrong direction. Or I don't understand the race.
>
> At this point we already paid nearly all price of of pagetable retraction.
> I don't see any correctness reason to stop here, except for the assert.

This condition can only be hit with a narrow race, since vma->anon_vma
was NULL a moment ago. So this branch will only run very rarely, which
means it is unlikely to have an effect on performance.

> I think lockdep assert in collapse_and_free_pmd() is wrong and has to be
> dropped.

No, that lockdep assert has to be there. Page table traversal is
allowed under any one of the mmap lock, the anon_vma lock (if the VMA
is associated with an anon_vma), and the mapping lock (if the VMA is
associated with a mapping); and so to be able to remove page tables,
we must hold all three of them. Even if we couldn't figure out a way
to actually exploit a lack of locking here, we still shouldn't be
skipping locking because of that. Unless we have clearly defined API
contracts such as "you are allowed to traverse page tables if you hold
one of these three locks", it is impossible to maintain code or to
make it robust and secure. Even if you could show that the lack of
locking can't lead to a crash in the current kernel, that still
wouldn't be a good reason to skip the locking. Without something at
least resembling an API contract, an innocent-looking change in
another part of the kernel can turn things in a completely different
part of the kernel into dangerous bugs, and it is impossible to change
anything while feeling confident that you haven't broken something
halfway across the kernel. (Or at least that's what it feels like to
me when I try to edit kernel code.)

Anyway, here are proper instructions for reproducing the bug.
First build the kernel with the attached patch
HACK-thp-anonvma-logging-and-delay.diff, which adds a bunch of debug
logging and injects extra timing delays to make the race much easier
to hit.
Then build with CONFIG_KASAN=y and
CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y and lockdep, and run that
kernel.
Run "echo force > /sys/kernel/mm/transparent_hugepage/shmem_enabled"
to enable shmem THP.
Compile the attached reproducer thp_newbug.c ("gcc -o thp_newbug thp_newbug.c").
Let the reproducer run (it'll take a few seconds to run because it has
to wait for khugepaged and stuff). Then you should see a lockdep
assert failure followed by an ASAN splat due to page table UAF in
dmesg:


[ 1076.064348][  T116] khugepaged: retract_page_tables:
retract_page_tables(), cc->is_khugepaged=1, target_mm=ffff88817b4acc80
[ 1076.066331][  T116] khugepaged: retract_page_tables: running on vma
at 0x1fe000 in mm=ffff88817b4acc80
[ 1076.068464][  T116] khugepaged: retract_page_tables: about to call
collapse_and_free_pmd()
[ 1076.071078][  T116] khugepaged: retract_page_tables: done with
collapse_and_free_pmd()
[ 1076.072360][  T116] khugepaged: retract_page_tables: sleeping
before mmap lock...
[ 1076.135883][  T797] map_pte: executing under TEST1
pvmw->vma->vm_mm==ffff88817b4acc80 pvmw->vma->vm_start=0x23ff000
[ 1076.147212][  T797] map_pte:   ... begin delay...
[ 1079.074924][  T116] khugepaged: retract_page_tables: done sleeping
[ 1079.075874][  T116] khugepaged: retract_page_tables: about to call
collapse_and_free_pmd()
[ 1079.076964][  T116] ------------[ cut here ]------------
[ 1079.077663][  T116] WARNING: CPU: 14 PID: 116 at
mm/khugepaged.c:1406 collapse_and_free_pmd+0x364/0x420
[ 1079.078940][  T116] CPU: 14 PID: 116 Comm: khugepaged Not tainted
6.2.0-rc3-00060-gc757fc92a3f7-dirty #257
[ 1079.080191][  T116] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1079.081462][  T116] RIP: 0010:collapse_and_free_pmd+0x364/0x420
[ 1079.082290][  T116] Code: ae 7e e6 03 85 d2 0f 84 b3 fd ff ff 4c 89
ff e8 52 86 fe ff 49 8b 3f 31 f6 48 83 c7 78 e8 34 56 e3 01 85 c0 0f
85 95 fd ff ff <0f> 0b e9 8e fd ff ff e8 d0 13 e8 ff ba ff ff ff ff be
26 00 00 00
[ 1079.084735][  T116] RSP: 0018:ffffc900009b79b0 EFLAGS: 00010246
[ 1079.085527][  T116] RAX: 0000000000000000 RBX: ffff88817b4acc80
RCX: 0000000000000001
[ 1079.089956][  T116] RDX: 0000000000000000 RSI: ffffffffa5cab020
RDI: ffffffffa5e382e0
[ 1079.090967][  T116] RBP: 1ffff92000136f36 R08: 0000000000000000
R09: ffff88817b4acd6f
[ 1079.091942][  T116] R10: ffffed102f6959ad R11: 0000000000000001
R12: ffff888117b0a3e8
[ 1079.092953][  T116] R13: 0000000002400000 R14: ffff888171ad1090
R15: ffff8881149b1100
[ 1079.094066][  T116] FS:  0000000000000000(0000)
GS:ffff8881f7100000(0000) knlGS:0000000000000000
[ 1079.095387][  T116] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1079.096379][  T116] CR2: 000055c6faaf9578 CR3: 0000000111738005
CR4: 0000000000370ee0
[ 1079.097695][  T116] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[ 1079.098909][  T116] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[ 1079.100117][  T116] Call Trace:
[ 1079.100668][  T116]  <TASK>
[...]
[ 1079.106003][  T116]  collapse_file.cold+0x3a4/0x400
[...]
[ 1079.110117][  T116]  hpage_collapse_scan_file+0x72a/0xad0
[ 1079.110856][  T116]  ? lock_release+0x22e/0x460
[ 1079.111483][  T116]  ? __pfx_hpage_collapse_scan_file+0x10/0x10
[ 1079.112289][  T116]  khugepaged+0x66b/0xbf0
[...]
[ 1079.118911][  T116]  kthread+0x17e/0x1b0
[ 1079.119440][  T116]  ? __pfx_kthread+0x10/0x10
[ 1079.120012][  T116]  ret_from_fork+0x2c/0x50
[ 1079.120593][  T116]  </TASK>
[ 1079.120971][  T116] irq event stamp: 12259
[ 1079.121520][  T116] hardirqs last  enabled at (12271):
[<ffffffffa388204e>] __up_console_sem+0x5e/0x70
[ 1079.122735][  T116] hardirqs last disabled at (12282):
[<ffffffffa3882033>] __up_console_sem+0x43/0x70
[ 1079.123950][  T116] softirqs last  enabled at (12018):
[<ffffffffa37cd108>] __irq_exit_rcu+0xd8/0x140
[ 1079.125151][  T116] softirqs last disabled at (12013):
[<ffffffffa37cd108>] __irq_exit_rcu+0xd8/0x140
[ 1079.126342][  T116] ---[ end trace 0000000000000000 ]---
[ 1079.140447][  T116] khugepaged: retract_page_tables: done with
collapse_and_free_pmd()
[ 1079.143002][  T116] khugepaged: retract_page_tables: skipping vma
at 0x1400000 in mm=ffff88817b4acc80 because anon_vma
[ 1079.146490][  T116] khugepaged: retract_page_tables: skipping vma
at 0x1400000 in mm=ffff88817b4a9980 because anon_vma
[ 1079.159601][  T116] khugepaged: retract_page_tables:
retract_page_tables(), cc->is_khugepaged=1, target_mm=ffff88817b4acc80
[ 1079.161186][  T116] khugepaged: retract_page_tables: skipping vma
at 0x1400000 in mm=ffff88817b4acc80 because anon_vma
[ 1079.162648][  T116] khugepaged: retract_page_tables: skipping vma
at 0x1400000 in mm=ffff88817b4a9980 because anon_vma
[ 1079.164000][  T116] khugepaged: retract_page_tables: skipping vma
at 0x2600000 in mm=ffff88817b4acc80 because anon_vma
[ 1079.649601][  T797] map_pte:   ... end delay
[ 1079.650358][  T797]
==================================================================
[ 1079.651257][  T797] BUG: KASAN: use-after-free in
page_vma_mapped_walk+0x5f3/0xdf0
[ 1079.652120][  T797] Read of size 8 at addr ffff888171e54000 by task TEST1/797
[ 1079.652915][  T797]
[ 1079.653179][  T797] CPU: 13 PID: 797 Comm: TEST1 Tainted: G
W          6.2.0-rc3-00060-gc757fc92a3f7-dirty #257
[ 1079.654410][  T797] Hardware name: QEMU Standard PC (i440FX + PIIX,
1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 1079.655534][  T797] Call Trace:
[ 1079.655907][  T797]  <TASK>
[ 1079.656268][  T797]  dump_stack_lvl+0x5b/0x77
[ 1079.656812][  T797]  print_report+0x168/0x458
[...]
[ 1079.659003][  T797]  kasan_report+0xbb/0xf0
[...]
[ 1079.662829][  T797]  page_vma_mapped_walk+0x5f3/0xdf0
[ 1079.663474][  T797]  folio_referenced_one+0x1b3/0x3e0
[...]
[ 1079.667670][  T797]  rmap_walk_anon+0x137/0x2b0
[ 1079.668252][  T797]  folio_referenced+0x298/0x2c0
[...]
[ 1079.673519][  T797]  shrink_folio_list+0xd30/0x1660
[...]
[ 1079.679030][  T797]  reclaim_folio_list.constprop.0+0xcf/0x200
[...]
[ 1079.681772][  T797]  reclaim_pages+0x154/0x1c0
[...]
[ 1079.683727][  T797]  madvise_cold_or_pageout_pte_range+0xf64/0x1900
[...]
[ 1079.686125][  T797]  walk_pgd_range+0x68a/0xcb0
[...]
[ 1079.687393][  T797]  __walk_page_range+0x25d/0x260
[...]
[ 1079.689327][  T797]  walk_page_range+0x253/0x2b0
[...]
[ 1079.691922][  T797]  madvise_pageout+0x1ab/0x2a0
[...]
[ 1079.694355][  T797]  do_madvise.part.0+0x3a3/0x1370
[...]
[ 1079.701143][  T797]  __x64_sys_madvise+0x7b/0xa0
[ 1079.701719][  T797]  do_syscall_64+0x3a/0x90
[ 1079.702207][  T797]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 1079.702863][  T797] RIP: 0033:0x7f18ae2dfd07
[ 1079.703352][  T797] Code: ff ff ff ff c3 48 8b 15 87 61 0c 00 f7 d8
64 89 02 b8 ff ff ff ff eb bc 66 2e 0f 1f 84 00 00 00 00 00 90 b8 1c
00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 59 61 0c 00 f7 d8
64 89 01 48
[ 1079.705469][  T797] RSP: 002b:00007fffaaf31198 EFLAGS: 00000206
ORIG_RAX: 000000000000001c
[ 1079.706382][  T797] RAX: ffffffffffffffda RBX: 0000000000000000
RCX: 00007f18ae2dfd07
[ 1079.707244][  T797] RDX: 0000000000000015 RSI: 0000000000001000
RDI: 0000000001400000
[ 1079.710207][  T797] RBP: 00007fffaaf322c0 R08: 00007f18ae3a8800
R09: 00007f18ae3ad540
[ 1079.711092][  T797] R10: 00007f18ae2d6274 R11: 0000000000000206
R12: 00005577fac45220
[ 1079.712021][  T797] R13: 00007fffaaf323a0 R14: 0000000000000000
R15: 0000000000000000
[ 1079.712883][  T797]  </TASK>
[ 1079.713219][  T797]
[ 1079.713495][  T797] The buggy address belongs to the physical page:
[ 1079.714273][  T797] page:000000009614ae4b refcount:0 mapcount:0
mapping:0000000000000000 index:0x0 pfn:0x171e54
[ 1079.715595][  T797] flags: 0x8000000000000000(zone=2)
[ 1079.716275][  T797] raw: 8000000000000000 ffffea0004562e48
ffff8881f713d810 0000000000000000
[ 1079.717389][  T797] raw: 0000000000000000 ffff8881183f50c0
00000000ffffffff 0000000000000000
[ 1079.718508][  T797] page dumped because: kasan: bad access detected
[ 1079.719358][  T797] page_owner tracks the page as freed
[ 1079.720053][  T797] page last allocated via order 0, migratetype
Unmovable, gfp_mask 0x400dc0(GFP_KERNEL_ACCOUNT|__GFP_ZERO), pid 796,
tgid 796 (thp_newbug), ts 1065542415232, free_ts 1079140429026
[ 1079.722276][  T797]  get_page_from_freelist+0xb0a/0x1300
[ 1079.722872][  T797]  __alloc_pages+0x17e/0x3c0
[ 1079.723376][  T797]  pte_alloc_one+0x18/0xe0
[ 1079.723885][  T797]  __handle_mm_fault+0x15fa/0x1660
[ 1079.724474][  T797]  handle_mm_fault+0x18c/0x4e0
[ 1079.724995][  T797]  do_user_addr_fault+0x232/0x7f0
[ 1079.725551][  T797]  exc_page_fault+0x5d/0xe0
[ 1079.726045][  T797]  asm_exc_page_fault+0x22/0x30
[ 1079.726577][  T797] page last free stack trace:
[ 1079.727089][  T797]  free_pcp_prepare+0x276/0x530
[ 1079.727680][  T797]  free_unref_page+0x37/0x250
[ 1079.728193][  T797]  collapse_and_free_pmd+0x2a3/0x420
[ 1079.728773][  T797]  collapse_file.cold+0x3a4/0x400
[ 1079.729321][  T797]  hpage_collapse_scan_file+0x72a/0xad0
[ 1079.729939][  T797]  khugepaged+0x66b/0xbf0
[ 1079.730426][  T797]  kthread+0x17e/0x1b0
[ 1079.730874][  T797]  ret_from_fork+0x2c/0x50
[ 1079.731393][  T797]
[ 1079.731674][  T797] Memory state around the buggy address:
[ 1079.732284][  T797]  ffff888171e53f00: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[ 1079.733154][  T797]  ffff888171e53f80: 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00
[ 1079.734032][  T797] >ffff888171e54000: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[ 1079.734932][  T797]                    ^
[ 1079.735380][  T797]  ffff888171e54080: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[ 1079.736249][  T797]  ffff888171e54100: ff ff ff ff ff ff ff ff ff
ff ff ff ff ff ff ff
[ 1079.737119][  T797]
==================================================================
[ 1079.738129][  T797] Disabling lock debugging due to kernel taint
[ 1079.738897][  T797] map_pte: executing under TEST1
pvmw->vma->vm_mm==ffff88817b4acc80 pvmw->vma->vm_start=0x1400000
[ 1079.744091][  T797] map_pte:   ... continuing
[ 1079.744621][  T797] map_pte: executing under TEST1
pvmw->vma->vm_mm==ffff88817b4a9980 pvmw->vma->vm_start=0x1400000
[ 1079.745802][  T797] map_pte:   ... continuing


Here's the corresponding reproducer output (including dumps of the
anon_vma hierarchies attached to the VMAs, thanks to the extra debug
info I added to procfs in the attached diff):


user@vm:~/thp/thp_newbug$ ./thp_newbug
Rss = 2048 kB
[...]
Rss = 2048 kB
Rss = 2048 kB
Rss = 2048 kB
Rss = 2048 kB
Rss = 0 kB
khugepaged is about to write_trylock! associating anon_vma...
anon_vma associated, telling child to hit rmap now...
child hitting rmap...
child done with rmap

=========== PARENT ===========
001fe000-00400000 r--p 003fe000 00:01 13313
  /memfd:memfd (deleted)
Size:               2056 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   0 kB
Pss:                   0 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd mr mw me dc hg
01400000-01800000 rw-p 00400000 00:01 13313
  /memfd:memfd (deleted)
Size:               4096 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   0 kB
Pss:                   0 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd wr mr mw me ac
AnonVMAs:
 # [  0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4
num_children=2 num_active_vmas=3 parent=ffff8881149b1100
023ff000-02600000 rwxp 003ff000 00:01 13313
  /memfd:memfd (deleted)
Size:               2052 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   4 kB
Pss:                   4 kB
Pss_Dirty:             4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd wr ex mr mw me dc ac
AnonVMAs:
 # [  0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4
num_children=2 num_active_vmas=3 parent=ffff8881149b1100
02600000-02800000 rw-p 00600000 00:01 13313
  /memfd:memfd (deleted)
Size:               2048 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   0 kB
Pss:                   0 kB
Pss_Dirty:             0 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         0 kB
Referenced:            0 kB
Anonymous:             0 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd wr mr mw me dc ac
AnonVMAs:
 # [  0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4
num_children=2 num_active_vmas=3 parent=ffff8881149b1100
[...]
=========== CHILD ===========
01400000-01800000 rw-p 00400000 00:01 13313
  /memfd:memfd (deleted)
Size:               4096 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Rss:                   4 kB
Pss:                   4 kB
Pss_Dirty:             4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            0 kB
Anonymous:             4 kB
LazyFree:              0 kB
AnonHugePages:         0 kB
ShmemPmdMapped:        0 kB
FilePmdMapped:         0 kB
Shared_Hugetlb:        0 kB
Private_Hugetlb:       0 kB
Swap:                  0 kB
SwapPss:               0 kB
Locked:                0 kB
THPeligible:    1
VmFlags: rd wr mr mw me ac
AnonVMAs:
   [  0] ffff8881149b1100 root=ffff8881149b1100 refcount=2 anon_vmas=4
num_children=2 num_active_vmas=3 parent=ffff8881149b1100
 #   [  1] ffff8881149b0cc0 root=ffff8881149b1100 refcount=1
anon_vmas=1 num_children=0 num_active_vmas=1 parent=ffff8881149b1100
[...]
user@vm:~/thp/thp_newbug$

[-- Attachment #2: HACK-thp-anonvma-logging-and-delay.diff --]
[-- Type: text/x-patch, Size: 5384 bytes --]

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..865ad026cab3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -856,6 +856,39 @@ static void __show_smap(struct seq_file *m, const struct mem_size_stats *mss,
 	seq_puts(m, " kB\n");
 }
 
+static void anon_vma_dump_recursive(struct seq_file *m,
+				    struct vm_area_struct *vma,
+				    struct anon_vma *parent,
+				    int level)
+{
+	struct anon_vma_chain *avc;
+	int idx = -1;
+	int i;
+
+	list_for_each_entry_reverse(avc, &vma->anon_vma_chain, same_vma) {
+		struct anon_vma *av = avc->anon_vma;
+		struct anon_vma_chain *avc2;
+		unsigned long anon_vma_count = 0;
+
+		idx++;
+		if ((parent == NULL) ? (av != vma->anon_vma->root) :
+		    (av->parent != parent || av == parent))
+			continue;
+
+		anon_vma_interval_tree_foreach(avc2, &av->rb_root, 0, -(pgoff_t)1) {
+			anon_vma_count++;
+		}
+
+		seq_puts(m, (av == vma->anon_vma) ? " # " : "   ");
+		for (i=0; i<level; i++)
+			seq_puts(m, "  ");
+		seq_printf(m, "[%3d] %px root=%px refcount=%d anon_vmas=%lu num_children=%lu num_active_vmas=%lu parent=%px\n",
+			   idx, av, av->root, atomic_read(&av->refcount),
+			   anon_vma_count, av->num_children, av->num_active_vmas, av->parent);
+		anon_vma_dump_recursive(m, vma, av, level+1);
+	}
+}
+
 static int show_smap(struct seq_file *m, void *v)
 {
 	struct vm_area_struct *vma = v;
@@ -881,6 +914,14 @@ static int show_smap(struct seq_file *m, void *v)
 		seq_printf(m, "ProtectionKey:  %8u\n", vma_pkey(vma));
 	show_smap_vma_flags(m, vma);
 
+	/* dump anon_vma information */
+	if (READ_ONCE(vma->anon_vma)) {
+		seq_printf(m, "AnonVMAs:\n");
+		anon_vma_lock_write(vma->anon_vma);
+		anon_vma_dump_recursive(m, vma, NULL, 0);
+		anon_vma_unlock_write(vma->anon_vma);
+	}
+
 	return 0;
 }
 
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5cb401aa2b9d..f25e409eb1f1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -19,6 +19,7 @@
 #include <linux/page_table_check.h>
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
+#include <linux/delay.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1619,6 +1620,9 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
 	struct vm_area_struct *vma;
 	int target_result = SCAN_FAIL;
 
+	pr_warn("%s: retract_page_tables(), cc->is_khugepaged=%d, target_mm=%px\n",
+		__func__, cc->is_khugepaged, target_mm);
+
 	i_mmap_lock_write(mapping);
 	vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 		int result = SCAN_FAIL;
@@ -1645,20 +1649,37 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
 		 * the anon_vma.
 		 */
 		if (vma->anon_vma) {
+			pr_warn("%s: skipping vma at 0x%lx in mm=%px because anon_vma\n",
+				__func__, vma->vm_start, vma->vm_mm);
 			result = SCAN_PAGE_ANON;
 			goto next;
 		}
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		if (addr & ~HPAGE_PMD_MASK ||
 		    vma->vm_end < addr + HPAGE_PMD_SIZE) {
+			pr_warn("%s: skipping vma at 0x%lx in mm=%px because align\n",
+				__func__, vma->vm_start, vma->vm_mm);
 			result = SCAN_VMA_CHECK;
 			goto next;
 		}
 		mm = vma->vm_mm;
 		is_target = mm == target_mm && addr == target_addr;
 		result = find_pmd_or_thp_or_none(mm, addr, &pmd);
-		if (result != SCAN_SUCCEED)
+		if (result != SCAN_SUCCEED) {
+			pr_warn("%s: skipping vma at 0x%lx in mm=%px because no PMD\n",
+				__func__, vma->vm_start, vma->vm_mm);
 			goto next;
+		}
+
+		if (cc->is_khugepaged && mm == target_mm && vma->vm_start == 0x2400000-0x1000) {
+			pr_warn("%s: sleeping before mmap lock...\n", __func__);
+			mdelay(3000);
+			pr_warn("%s: done sleeping\n", __func__);
+		} else {
+			pr_warn("%s: running on vma at 0x%lx in mm=%px\n",
+				__func__, vma->vm_start, vma->vm_mm);
+		}
+
 		/*
 		 * We need exclusive mmap_lock to retract page table.
 		 *
@@ -1688,7 +1709,9 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff,
 				result = SCAN_PTE_UFFD_WP;
 				goto unlock_next;
 			}
+			pr_warn("%s: about to call collapse_and_free_pmd()\n", __func__);
 			collapse_and_free_pmd(mm, vma, addr, pmd);
+			pr_warn("%s: done with collapse_and_free_pmd()\n", __func__);
 			if (!cc->is_khugepaged && is_target)
 				result = set_huge_pmd(vma, addr, pmd, hpage);
 			else
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 93e13fc17d3c..771a0124329b 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -4,6 +4,7 @@
 #include <linux/hugetlb.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/delay.h>
 
 #include "internal.h"
 
@@ -16,6 +17,19 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
 static bool map_pte(struct page_vma_mapped_walk *pvmw)
 {
 	pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+
+	if (strcmp(current->comm, "TEST1") == 0) {
+		pr_warn("%s: executing under TEST1 pvmw->vma->vm_mm==%px pvmw->vma->vm_start=0x%lx\n",
+			__func__, pvmw->vma->vm_mm, pvmw->vma->vm_start);
+		if (current->mm == pvmw->vma->vm_mm || pvmw->vma->vm_start != 0x2400000-0x1000) {
+			pr_warn("%s:   ... continuing\n", __func__);
+		} else {
+			pr_warn("%s:   ... begin delay...\n", __func__);
+			mdelay(3500);
+			pr_warn("%s:   ... end delay\n", __func__);
+		}
+	}
+
 	if (!(pvmw->flags & PVMW_SYNC)) {
 		if (pvmw->flags & PVMW_MIGRATION) {
 			if (!is_swap_pte(*pvmw->pte))

[-- Attachment #3: thp_newbug.c --]
[-- Type: text/x-csrc, Size: 5750 bytes --]

#define _GNU_SOURCE
#include <err.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <fcntl.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/mman.h>
#include <sys/uio.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <sys/eventfd.h>
#include <sys/syscall.h>

#ifndef MADV_PAGEOUT
#define MADV_PAGEOUT 21
#endif

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

// NOTE: assumes hugepage support is set to madvise-only and shmem thp is set
// to "force".
int main(void) {
#if 0
  // for better correlation of events in dmesg
  FILE *kmsg_file = fopen("/dev/kmsg", "w");
  if (!kmsg_file)
    errx(1, "open kmsg");
  setlinebuf(kmsg_file);
#else
  // for running unprivileged
  #define kmsg_file stderr
#endif

  // create a memfd containing 8 MiB of data, with some extra back-and-forth
  // to avoid having transhuge shmem pages at this point even in "force" mode
  int fd = SYSCHK(syscall(__NR_memfd_create, "memfd", 0));
  for (int i=0; i<4*512; i++) {
    char dummy_page[0x1000];
    memset(dummy_page, 'P', 0x1000);
    assert(SYSCHK(pwrite(fd, dummy_page, 0x1000, i*0x1000)) == 0x1000);
  }
  SYSCHK(fallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, 0x4000, 0x1000));
  SYSCHK(fallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, 0x204000, 0x1000));
  SYSCHK(fallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, 0x404000, 0x1000));
  SYSCHK(fallocate(fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE, 0x604000, 0x1000));
  for (int i=0; i<4*512; i++) {
    char dummy_page[0x1000];
    memset(dummy_page, 'P', 0x1000);
    assert(SYSCHK(pwrite(fd, dummy_page, 0x1000, i*0x1000)) == 0x1000);
  }

  // create trigger_vma (2 MiB, offset 4 MiB)
  char *trigger_vma = SYSCHK(mmap((void*)0x200000 - 0x2000, 0x200000 + 0x2000, PROT_READ,
                                  MAP_PRIVATE|MAP_FIXED_NOREPLACE, fd, 0x400000 - 0x2000)) + 0x2000;
  SYSCHK(madvise(trigger_vma-0x2000, 0x200000+0x2000, MADV_DONTFORK));

  // create mapping1 (8 MiB, full file)
  char *mapping1 = SYSCHK(mmap((void*)0x1000000, 0x800000, PROT_READ|PROT_WRITE,
                               MAP_PRIVATE|MAP_FIXED_NOREPLACE, fd, 0));

  // create anon page and anon_vma in mapping1 (at offset 4 MiB)
  *(volatile char *)(mapping1 + 0x400000) = 1;

  // make it so that we end up with two VMAs that share an anon_vma and have the
  // same offset range.
  // start with a placeholder mapping, then mremap over it.
  char *mapping2 = SYSCHK(mmap((void*)0x2000000, 0x800000, PROT_NONE,
                               MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED_NOREPLACE,
                               -1, 0));
  // move
  SYSCHK(mremap(mapping1, 0x400000, 0x400000, MREMAP_MAYMOVE|MREMAP_FIXED,
                mapping2));
  // expand
  munmap(mapping2+0x400000, 0x400000);
  SYSCHK(mremap(mapping2, 0x400000, 0x800000, 0));
  // trim
  SYSCHK(munmap(mapping2, 0x400000));
  // replace first half of mapping2 with a new vma, not mergeable but
  // anonvma-mergeable.
  // with an extra page in front to make it go first in the anon_vma interval
  // tree.
  SYSCHK(mmap(mapping2+0x400000-0x1000, 0x1000+0x200000, PROT_READ|PROT_WRITE|PROT_EXEC,
              MAP_PRIVATE|MAP_FIXED, fd, 0x400000-0x1000));
  // don't inherit either half of mapping2
  SYSCHK(madvise(mapping2+0x400000-0x1000, 0x400000+0x1000, MADV_DONTFORK));

  // inherit mapping1 in child
  int child_cont_fd = SYSCHK(eventfd(0, EFD_SEMAPHORE));
  pid_t child = SYSCHK(fork());
  if (child == 0) {
    SYSCHK(prctl(PR_SET_PDEATHSIG, SIGKILL));
    if (getppid() == 1)
      exit(1);

    eventfd_t dummy_val;
    SYSCHK(eventfd_read(child_cont_fd, &dummy_val));

    fprintf(kmsg_file, "child hitting rmap...\n");
    prctl(PR_SET_NAME, "TEST1");
    SYSCHK(madvise(mapping1+0x400000, 0x1000, MADV_PAGEOUT));
    prctl(PR_SET_NAME, "thp_newbug_child");
    fprintf(kmsg_file, "child done with rmap\n");

    sleep(10);
    char cmd[1000];
    sprintf(cmd, "echo; echo =========== PARENT ===========; head -n120 /proc/%d/smaps; echo; echo =========== CHILD ===========; head -n60 /proc/%d/smaps", getppid(), getpid());
    system(cmd);
    exit(0);
  }

  // populate PTEs in trigger_vma and mapping2 with shmem pages
  for (int i=0; i<512; i++)
    ((volatile char *)trigger_vma)[i*0x1000];
  for (int i=0; i<512; i++)
    ((volatile char *)mapping2)[0x400000 + i*0x1000];

  // get rid of parent's mapping of anon page so that MADV_PAGEOUT works
  SYSCHK(madvise(mapping1+0x400000, 0x1000, MADV_DONTNEED));
  
  sleep(1);

  // enable hugepage support for trigger_vma to make khugepaged look at us
  SYSCHK(madvise(trigger_vma-0x2000, 0x200000+0x2000, MADV_HUGEPAGE));

  // wait for khugepaged (relies on first vma being trigger VMA)
  int parent_smaps_fd = SYSCHK(open("/proc/self/smaps", O_RDONLY));
  while (1) {
    char parent_smaps[0x1001];
    int parent_smaps_len = SYSCHK(pread(parent_smaps_fd, parent_smaps, 0x1000, 0));
    parent_smaps[parent_smaps_len] = '\0';
    char *line = strstr(parent_smaps, "\nRss:");
    if (!line)
      errx(1, "no Rss in smaps_rollup?");
    char *numptr = line + strlen("\nRss:");
    while (*numptr == ' ' || *numptr == '\t') numptr++;
    int rss = atoi(numptr);
    printf("Rss = %d kB\n", rss);
    if (rss == 0)
      break;
    usleep(1000*100);
  }

  fprintf(kmsg_file, "khugepaged is about to write_trylock! associating anon_vma...\n");
  *(volatile char *)(mapping2 + 0x400000 - 0x1000) = 1;
  fprintf(kmsg_file, "anon_vma associated, telling child to hit rmap now...\n");

  SYSCHK(eventfd_write(child_cont_fd, 1));


  // wait for child
  int wstatus;
  SYSCHK(waitpid(child, &wstatus, 0));

  exit(0);
}

  parent reply	other threads:[~2023-01-13 19:29 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-11 13:33 [PATCH] mm/khugepaged: Fix ->anon_vma race Jann Horn
2023-01-12  1:06 ` Yang Shi
2023-01-13 19:36   ` Jann Horn
2023-01-12  8:56 ` Kirill A. Shutemov
2023-01-12 18:12   ` Yang Shi
2023-01-13  0:10     ` Kirill A. Shutemov
2023-01-13  3:22       ` Yang Shi
2023-01-13 19:28   ` Jann Horn [this message]
2023-01-15 19:06     ` Kirill A. Shutemov
2023-01-16 12:06       ` Jann Horn
2023-01-16 12:34         ` Kirill A. Shutemov
2023-01-16 12:54           ` Jann Horn
2023-01-16 13:07           ` David Hildenbrand
2023-01-16 13:47             ` Kirill A. Shutemov
2023-01-23 11:07               ` David Hildenbrand
2023-01-24  0:51                 ` Kirill A. Shutemov
2023-01-24 10:19                   ` David Hildenbrand
2023-01-17 18:57       ` Yang Shi
2023-01-17 19:12 ` Jann Horn
2023-01-17 22:55   ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAG48ez3434wZBKFFbdx4M9j6eUwSUVPd4dxhzW_k_POneSDF+A@mail.gmail.com \
    --to=jannh@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=shy828301@gmail.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).