llvm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
From: Mike Kravetz <mike.kravetz@oracle.com>
To: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yang Shi <shy828301@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	syzbot <syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, llvm@lists.linux.dev, nathan@kernel.org,
	ndesaulniers@google.com, songmuchun@bytedance.com,
	syzkaller-bugs@googlegroups.com, trix@redhat.com
Subject: Re: [syzbot] general protection fault in PageHeadHuge
Date: Thu, 29 Sep 2022 16:33:53 -0700	[thread overview]
Message-ID: <YzYrYVeA0b9d5dos@monkey> (raw)
In-Reply-To: <YzMomT+OusJnLOPC@x1n>

On 09/27/22 12:45, Peter Xu wrote:
> On Tue, Sep 27, 2022 at 09:24:37AM -0700, Mike Kravetz wrote:
> > This should guarantee a read fault independent of what pthread_mutex_lock
> > does.  However, it still results in the occasional "ERROR: unexpected write
> > fault".  So, something else if happening.  I will continue to experiment
> > and think about this.
> 
> Thanks for verifying this, Mike.  I didn't yet reply but I did have some
> update on my side too.  I plan to look deeper and wanted to reply until
> that, because I do think there's something special on hugetlb and I still
> don't know. I just keep getting distracted by other things.. but maybe I
> should still share out what I have already.
> 
> I think I already know what had guaranteed the read faults - the NPTL
> pthread lib implemented mutex in different types, and the 1st instruction
> of lock() is to fetch the mutex type (at offset 0x10) then we know how we
> should move on:
> 
> (gdb) disas pthread_mutex_lock
> Dump of assembler code for function ___pthread_mutex_lock:
>    0x00007ffff7e3b0d0 <+0>:     endbr64 
>    0x00007ffff7e3b0d4 <+4>:     mov    0x10(%rdi),%eax       <---- read 0x10 of &mutex
>    0x00007ffff7e3b0d7 <+7>:     mov    %eax,%edx
>    0x00007ffff7e3b0d9 <+9>:     and    $0x17f,%edx
> (gdb) ptype pthread_mutex_t
> type = union {
>     struct __pthread_mutex_s __data;
>     char __size[40];
>     long __align;
> }
> (gdb) ptype struct __pthread_mutex_s
> type = struct __pthread_mutex_s {
>     int __lock;
>     unsigned int __count;
>     int __owner;
>     unsigned int __nusers;
>     int __kind;                                              <--- 0x10 offset here
>     short __spins;
>     short __elision;
>     __pthread_list_t __list;
> }
> 
> I looked directly at asm dumped from the binary I tested to make sure it's
> accurate.  So it means with current uffd selftest logically there should
> never be a write missing fault (I still don't think it's good to have the
> write check though.. but that's separate question to ask).
> 
> It also means hugetlb does something special here.  It smells really like
> for some reason the hugetlb pgtable got evicted after UFFDIO_COPY during
> locking_thread running, then any further lock() (e.g. cmpxchg) or modifying
> the counter could trigger a write fault.
> 
> OTOH this also explained why futex code was never tested on userfaultfd
> selftest, simply because futex() will always to be after that "read the
> type of mutex" thing.. which is something I want to rework a bit, so as to
> have uffd selftest to cover gup as you used to rightfully suggest.  But
> that'll be nothing urgent, and be something after we know what's special
> with hugetlb new code.
> 

I was able to do a little more debugging:

As you know the hugetlb calling path to handle_userfault is something
like this,

hugetlb_fault
	mutex_lock(&hugetlb_fault_mutex_table[hash]);
	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
	if (huge_pte_none_mostly())
		hugetlb_no_page()
			page = find_lock_page(mapping, idx);
			if (!page) {
				if (userfaultfd_missing(vma))
					mutex_unlock(&hugetlb_fault_mutex_table[hash]);
					return handle_userfault()
			}

For anon mappings, find_lock_page() will never find a page, so as long
as huge_pte_none_mostly() is true we will call into handle_userfault().

Since your analysis shows the testcase should never call handle_userfault() for
a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
the call to handle_userfault().  Sure enough, I saw plenty of printk messages.

Then, before calling handle_userfault() I added code to take the page table
lock and test huge_pte_none_mostly() again.  In every FAULT_FLAG_WRITE case,
this second test of huge_pte_none_mostly() was false.  So, the condition
changed from the check in hugetlb_fault until the check (with page table
lock) in hugetlb_no_page.

IIUC, the only code that should be modifying the pte in this test is
hugetlb_mcopy_atomic_pte().  It also holds the hugetlb_fault_mutex while
updating the pte.

It 'appears' that hugetlb_fault is not seeing the updated pte and I can
only guess that it is due to some caching issues.

After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.

	/* No need to invalidate - it was non-present before */
	update_mmu_cache(dst_vma, dst_addr, dst_pte);

I suspect that is true.  However, it seems like this test depends on all
CPUs seeing the updated pte immediately?

I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
issues take me a while to figure out.
-- 
Mike Kravetz

  reply	other threads:[~2022-09-29 23:34 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-23 16:05 [syzbot] general protection fault in PageHeadHuge syzbot
2022-09-23 21:11 ` Mike Kravetz
2022-09-23 21:37   ` Yang Shi
2022-09-24  0:14   ` Hugh Dickins
2022-09-24  0:18     ` Nathan Chancellor
2022-09-24  7:28       ` Hugh Dickins
2022-09-24  0:58     ` Peter Xu
2022-09-24 15:06       ` Peter Xu
2022-09-24 19:01         ` Mike Kravetz
2022-09-26  0:11           ` Peter Xu
2022-09-27 16:24             ` Mike Kravetz
2022-09-27 16:45               ` Peter Xu
2022-09-29 23:33                 ` Mike Kravetz [this message]
2022-09-30  1:29                   ` Peter Xu
2022-09-30  1:35                     ` Peter Xu
2022-09-30 16:05                   ` Peter Xu
2022-09-30 21:38                     ` Mike Kravetz
2022-10-01  2:47                       ` Peter Xu
2022-10-01  3:01                         ` Peter Xu
2022-10-03  1:16                           ` Mike Kravetz
2022-10-03 15:24                             ` Peter Xu
2022-09-25 18:55     ` Andrew Morton
2022-09-24 21:56   ` Matthew Wilcox
2022-09-27  4:32     ` Hugh Dickins

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YzYrYVeA0b9d5dos@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=llvm@lists.linux.dev \
    --cc=nathan@kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=peterx@redhat.com \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    --cc=syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com \
    --cc=syzkaller-bugs@googlegroups.com \
    --cc=trix@redhat.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).