linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Rafael Aquini <aquini@redhat.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	akpm@linux-foundation.org
Subject: Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference
Date: Thu, 24 Sep 2020 11:08:33 -0400	[thread overview]
Message-ID: <20200924150833.GE1023012@optiplex-lnx> (raw)
In-Reply-To: <87tuvnd3db.fsf@yhuang-dev.intel.com>

On Thu, Sep 24, 2020 at 03:45:52PM +0800, Huang, Ying wrote:
> Rafael Aquini <aquini@redhat.com> writes:
> 
> > On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
> >> Rafael Aquini <aquini@redhat.com> writes:
> >> > The bug here is quite simple: split_swap_cluster() misses checking for
> >> > lock_cluster() returning NULL before committing to change cluster_info->flags.
> >> 
> >> I don't think so.  We shouldn't run into this situation firstly.  So the
> >> "fix" hides the real bug instead of fixing it.  Just like we call
> >> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
> >> instead of returning if !PageLocked(head) silently.
> >>
> >
> > Not the same thing, obviously, as you are going for an apples-to-carrots
> > comparison, but since you mentioned:
> >
> > split_huge_page_to_list() asserts (in debug builds) *page is locked,
> 
> 	VM_BUG_ON_PAGE(!PageLocked(head), head);
> 
> It asserts *head instead of *page.
>
> > and later checks if *head bears the SwapCache flag. 
> > deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
> > but the 2nd page in the group instead.
> 
> No.  deferred_split_scan() will can trylock_page() on the 2nd page in
> the group, but
> 
> static inline int trylock_page(struct page *page)
> {
> 	page = compound_head(page);
> 	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
> }
> 
> So the head page will be locked instead.
> 

Yep, missed that. Thanks for straighten me out on this one.


> > This doesn't necessarely means it's a problem, though, but might help
> > on hitting the issue. 
> >  
> >> > The fundamental problem has nothing to do with allocating, or not allocating
> >> > a swap cluster, but it has to do with the fact that the THP deferred split scan
> >> > can transiently race with swapcache insertion, and the fact that when you run
> >> > your swap area on rotational storage cluster_info is _always_ NULL.
> >> > split_swap_cluster() needs to check for lock_cluster() returning NULL because
> >> > that's one possible case, and it clearly fails to do so.
> >> 
> >> If there's a race, we should fix the race.  But the code path for
> >> swapcache insertion is,
> >> 
> >> add_to_swap()
> >>   get_swap_page() /* Return if fails to allocate */
> >>   add_to_swap_cache()
> >>     SetPageSwapCache()
> >> 
> >> While the code path to split THP is,
> >> 
> >> split_huge_page_to_list()
> >>   if PageSwapCache()
> >>     split_swap_cluster()
> >> 
> >> Both code paths are protected by the page lock.  So there should be some
> >> other reasons to trigger the bug.
> >
> > As mentioned above, no they seem to not be protected (at least, not the
> > same page, depending on the case). While add_to_swap() will assure a 
> > page_lock on the compound head, split_huge_page_to_list() does not.
> >
> >
> >> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
> >> first place.  If so, the bug is that the flag is set and we should fix
> >> the setting.
> >> 
> >
> > I fail to follow your claim here. Where is the guarantee, in the code, that 
> > you'll never have a compound head in the swapcache? 
> 
> We may have a THP in the swap cache, only if non-rotational disk is used
> as swap device.  This is the design assumption of the THP swap support.
> And this is guaranteed via swap space allocation for THP will fail for
> HDD.  If the implementation doesn't guarantee this, we will fix the
> implementation to guarantee this.
> 
> >> > Run a workload that cause multiple THP COW, and add a memory hogger to create
> >> > memory pressure so you'll force the reclaimers to kick the registered
> >> > shrinkers. The trigger is not heavy swapping, and that's probably why
> >> > most swap test cases don't hit it. The window is tight, but you will get the
> >> > NULL pointer dereference.
> >> 
> >> Do you have a script to reproduce the bug?
> >> 
> >
> > Nope, a convoluted set of internal regression tests we have usually
> > triggers it. In the wild, customers running HANNA are seeing it,
> > occasionally.
> 
> So you haven't reproduce the bug on upstream kernel?
> 

Have you seen the stack dump in the patch? It still reproduces with v5.9,
even though the rate is a lot lower than with earlier kernels.


> Or, can you help to run the test with a debug kernel based on upstream
> kernel.  I can provide some debug patch.
> 

Sure, I can set your patches to run with the test cases we have that tend to 
reproduce the issue with some degree of success.


> >> > Regardless you find furhter bugs, or not, this patch is needed to correct a
> >> > blunt coding mistake.
> >> 
> >> As above.  I don't agree with that.
> >> 
> >
> > It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check,
> > though.
> 
> In contrast, if the checking is necessary, we shouldn't ignore it, but
> use something like
> 
>         ci = lock_cluster(si, offset);
> +       VM_BUG_ON(!ci);

Wrong. This will still allow for NULL ptr dereference on non-debug builds.
If ci can be NULL -- and it clearly can, we need to protect 
cluster_clear_huge(ci) against that.





  reply	other threads:[~2020-09-24 15:08 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-22 18:48 [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference Rafael Aquini
2020-09-22 19:47 ` Andrew Morton
2020-09-23 13:42   ` Rafael Aquini
2020-09-25  2:59     ` Andrew Morton
2020-09-25  3:06       ` Huang, Ying
2020-09-25  3:10         ` Andrew Morton
2020-09-23  2:21 ` Huang, Ying
2020-09-23  4:34   ` Rafael Aquini
2020-09-23  5:13     ` Huang, Ying
2020-09-23 13:01       ` Rafael Aquini
2020-09-24  0:59         ` Huang, Ying
2020-09-24  2:09           ` Rafael Aquini
2020-09-24  3:51             ` Huang, Ying
2020-09-24  6:30               ` Rafael Aquini
2020-09-24  6:57                 ` Huang, Ying
2020-09-24  7:45                 ` Huang, Ying
2020-09-24 15:08                   ` Rafael Aquini [this message]
2020-09-25  3:21                     ` Huang, Ying
2020-09-26 15:16                       ` Rafael Aquini
2020-09-27  5:33                         ` Huang, Ying
2020-10-01 14:31                       ` Rafael Aquini
2020-10-05 13:39                         ` Rafael Aquini
2020-10-09  0:18                           ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200924150833.GE1023012@optiplex-lnx \
    --to=aquini@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).