linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Huang\, Ying" <ying.huang@intel.com>
To: Rafael Aquini <aquini@redhat.com>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	 akpm@linux-foundation.org
Subject: Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference
Date: Thu, 24 Sep 2020 15:45:52 +0800	[thread overview]
Message-ID: <87tuvnd3db.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20200924063038.GD1023012@optiplex-lnx> (Rafael Aquini's message of "Thu, 24 Sep 2020 02:30:38 -0400")

Rafael Aquini <aquini@redhat.com> writes:

> On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
>> Rafael Aquini <aquini@redhat.com> writes:
>> > The bug here is quite simple: split_swap_cluster() misses checking for
>> > lock_cluster() returning NULL before committing to change cluster_info->flags.
>> 
>> I don't think so.  We shouldn't run into this situation firstly.  So the
>> "fix" hides the real bug instead of fixing it.  Just like we call
>> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
>> instead of returning if !PageLocked(head) silently.
>>
>
> Not the same thing, obviously, as you are going for an apples-to-carrots
> comparison, but since you mentioned:
>
> split_huge_page_to_list() asserts (in debug builds) *page is locked,

	VM_BUG_ON_PAGE(!PageLocked(head), head);

It asserts *head instead of *page.

> and later checks if *head bears the SwapCache flag. 
> deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
> but the 2nd page in the group instead.

No.  deferred_split_scan() will can trylock_page() on the 2nd page in
the group, but

static inline int trylock_page(struct page *page)
{
	page = compound_head(page);
	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
}

So the head page will be locked instead.

> This doesn't necessarely means it's a problem, though, but might help
> on hitting the issue. 
>  
>> > The fundamental problem has nothing to do with allocating, or not allocating
>> > a swap cluster, but it has to do with the fact that the THP deferred split scan
>> > can transiently race with swapcache insertion, and the fact that when you run
>> > your swap area on rotational storage cluster_info is _always_ NULL.
>> > split_swap_cluster() needs to check for lock_cluster() returning NULL because
>> > that's one possible case, and it clearly fails to do so.
>> 
>> If there's a race, we should fix the race.  But the code path for
>> swapcache insertion is,
>> 
>> add_to_swap()
>>   get_swap_page() /* Return if fails to allocate */
>>   add_to_swap_cache()
>>     SetPageSwapCache()
>> 
>> While the code path to split THP is,
>> 
>> split_huge_page_to_list()
>>   if PageSwapCache()
>>     split_swap_cluster()
>> 
>> Both code paths are protected by the page lock.  So there should be some
>> other reasons to trigger the bug.
>
> As mentioned above, no they seem to not be protected (at least, not the
> same page, depending on the case). While add_to_swap() will assure a 
> page_lock on the compound head, split_huge_page_to_list() does not.
>
>
>> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
>> first place.  If so, the bug is that the flag is set and we should fix
>> the setting.
>> 
>
> I fail to follow your claim here. Where is the guarantee, in the code, that 
> you'll never have a compound head in the swapcache? 

We may have a THP in the swap cache, only if non-rotational disk is used
as swap device.  This is the design assumption of the THP swap support.
And this is guaranteed via swap space allocation for THP will fail for
HDD.  If the implementation doesn't guarantee this, we will fix the
implementation to guarantee this.

>> > Run a workload that cause multiple THP COW, and add a memory hogger to create
>> > memory pressure so you'll force the reclaimers to kick the registered
>> > shrinkers. The trigger is not heavy swapping, and that's probably why
>> > most swap test cases don't hit it. The window is tight, but you will get the
>> > NULL pointer dereference.
>> 
>> Do you have a script to reproduce the bug?
>> 
>
> Nope, a convoluted set of internal regression tests we have usually
> triggers it. In the wild, customers running HANNA are seeing it,
> occasionally.

So you haven't reproduce the bug on upstream kernel?

Or, can you help to run the test with a debug kernel based on upstream
kernel.  I can provide some debug patch.

>> > Regardless you find furhter bugs, or not, this patch is needed to correct a
>> > blunt coding mistake.
>> 
>> As above.  I don't agree with that.
>> 
>
> It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check,
> though.

In contrast, if the checking is necessary, we shouldn't ignore it, but
use something like

        ci = lock_cluster(si, offset);
+       VM_BUG_ON(!ci);
	cluster_clear_huge(ci);

in split_swap_cluster() to enforce the checking to report bug as early
as possible.  But this appears unnecessary now because NULL accessing in
cluster_clear_huge().

Best Regards,
Huang, Ying


  parent reply	other threads:[~2020-09-24  7:46 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-22 18:48 [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference Rafael Aquini
2020-09-22 19:47 ` Andrew Morton
2020-09-23 13:42   ` Rafael Aquini
2020-09-25  2:59     ` Andrew Morton
2020-09-25  3:06       ` Huang, Ying
2020-09-25  3:10         ` Andrew Morton
2020-09-23  2:21 ` Huang, Ying
2020-09-23  4:34   ` Rafael Aquini
2020-09-23  5:13     ` Huang, Ying
2020-09-23 13:01       ` Rafael Aquini
2020-09-24  0:59         ` Huang, Ying
2020-09-24  2:09           ` Rafael Aquini
2020-09-24  3:51             ` Huang, Ying
2020-09-24  6:30               ` Rafael Aquini
2020-09-24  6:57                 ` Huang, Ying
2020-09-24  7:45                 ` Huang, Ying [this message]
2020-09-24 15:08                   ` Rafael Aquini
2020-09-25  3:21                     ` Huang, Ying
2020-09-26 15:16                       ` Rafael Aquini
2020-09-27  5:33                         ` Huang, Ying
2020-10-01 14:31                       ` Rafael Aquini
2020-10-05 13:39                         ` Rafael Aquini
2020-10-09  0:18                           ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tuvnd3db.fsf@yhuang-dev.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=aquini@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).