Re: [syzbot] general protection fault in PageHeadHuge

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yang Shi <shy828301@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	syzbot <syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com>,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, llvm@lists.linux.dev, nathan@kernel.org,
	ndesaulniers@google.com, songmuchun@bytedance.com,
	syzkaller-bugs@googlegroups.com, trix@redhat.com
Subject: Re: [syzbot] general protection fault in PageHeadHuge
Date: Tue, 27 Sep 2022 09:24:37 -0700	[thread overview]
Message-ID: <YzMjxY5O6Hf/IPTx@monkey> (raw)
In-Reply-To: <YzDuHbuo2x/b2Mbr@x1n>

On 09/25/22 20:11, Peter Xu wrote:
> On Sat, Sep 24, 2022 at 12:01:16PM -0700, Mike Kravetz wrote:
> > On 09/24/22 11:06, Peter Xu wrote:
> > > 
> > > Sorry I forgot to reply on this one.
> > > 
> > > I didn't try linux-next, but I can easily reproduce this with mm-unstable
> > > already, and I verified that Hugh's patch fixes the problem for shmem.
> > > 
> > > When I was testing I found hugetlb selftest is broken too but with some
> > > other errors:
> > > 
> > > $ sudo ./userfaultfd hugetlb 100 10  
> > > ...
> > > bounces: 6, mode: racing ver read, ERROR: unexpected write fault (errno=0, line=779)
> > > 
> > > The failing check was making sure all MISSING events are not triggered by
> > > writes, but frankly I don't really know why it's required, and that check
> > > existed since the 1st commit when test was introduced.
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c47174fc362a089b1125174258e53ef4a69ce6b8
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/vm/userfaultfd.c?id=c47174fc362a089b1125174258e53ef4a69ce6b8#n291
> > > 
> > > And obviously some recent hugetlb-related change caused that to happen.
> > > 
> > > Dropping that check can definitely work, but I'll have a closer look soon
> > > too to make sure I didn't miss something.  Mike, please also let me know if
> > > you are aware of this problem.
> > > 
> > 
> > Peter, I am not aware of this problem.  I really should make running ALL
> > hugetlb tests part of my regular routine.
> > 
> > If you do not beat me to it, I will take a look in the next few days.
> 
> Just to update - my bisection points to 00cdec99f3eb ("hugetlbfs: revert
> use i_mmap_rwsem to address page fault/truncate race", 2022-09-21).
> 
> I don't understand how they are related so far, though.  It should be a
> timing thing because the failure cannot be reproduced on a VM but only on
> the host, and it can also pass sometimes even on the host but rarely.

Thanks Peter!

After your analysis, I also started looking at this.
- I did reproduce a few times in a VM
- On BM (a laptop) I could reproduce but it would take several (10's of) runs

> Logically all the uffd messages in the stress test should be generated by
> the locking thread, upon:
> 
> 		pthread_mutex_lock(area_mutex(area_dst, page_nr));

I personally find that test program hard to understand/follow and it takes me 
a day or so to understand what it is doing, then I immediately loose context
when I stop looking at it. :(

So, as you mention below the program is depending on pthread_mutex_lock()
doing a read fault before a write.

> I thought a common scheme for lock() fast path should already be an
> userspace cmpxchg() and that should be a write fault already.
> 
> For example, I did some stupid hack on the test and I can trigger the write
> check fault with anonymous easily with an explicit cmpxchg on byte offset 128:
> 
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index 74babdbc02e5..a7d6938d4553 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -637,6 +637,10 @@ static void *locking_thread(void *arg)
>                 } else
>                         page_nr += 1;
>                 page_nr %= nr_pages;
> +               char *ptr = area_dst + (page_nr * page_size) + 128;
> +               char _old = 0, new = 1;
> +               (void)__atomic_compare_exchange_n(ptr, &_old, new, false,
> +                               __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
>                 pthread_mutex_lock(area_mutex(area_dst, page_nr));
>                 count = *area_count(area_dst, page_nr);
>                 if (count != count_verify[page_nr])
> 
> I'll need some more time thinking about it before I send a patch to drop
> the write check..

I did another stupid hack, and duplicated the statement:
	count = *area_count(area_dst, page_nr);
before the,
	pthread_mutex_lock(area_mutex(area_dst, page_nr));

This should guarantee a read fault independent of what pthread_mutex_lock
does.  However, it still results in the occasional "ERROR: unexpected write
fault".  So, something else if happening.  I will continue to experiment
and think about this.
-- 
Mike Kravetz