From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0FEAC4363D for ; Thu, 24 Sep 2020 15:08:43 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4C1322344C for ; Thu, 24 Sep 2020 15:08:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="SFVhu5oC" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 4C1322344C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id E3801900002; Thu, 24 Sep 2020 11:08:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DC0EC8E0001; Thu, 24 Sep 2020 11:08:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C8A8A900002; Thu, 24 Sep 2020 11:08:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0139.hostedemail.com [216.40.44.139]) by kanga.kvack.org (Postfix) with ESMTP id B03F48E0001 for ; Thu, 24 Sep 2020 11:08:42 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 5C55F180AD807 for ; Thu, 24 Sep 2020 15:08:42 +0000 (UTC) X-FDA: 77298286884.14.fuel76_2f0bd2627160 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id 33EB718229818 for ; Thu, 24 Sep 2020 15:08:42 +0000 (UTC) X-HE-Tag: fuel76_2f0bd2627160 X-Filterd-Recvd-Size: 7668 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf44.hostedemail.com (Postfix) with ESMTP for ; Thu, 24 Sep 2020 15:08:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1600960121; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=sM32U47HMbXDQBXjEiz101uEL2026nmry9OxMJaqapc=; b=SFVhu5oC5UHnIcxxNHrC6HQFSAQ4s+zWu6ZJznMEIZhHp1AAAtGxCAeA1nXflk9+52nCIo m9DUbM8iZVg4OGuQijCOJYWZuHuI4oKtpAXIbt5wWk4vhqgRjmrWp9JI45ZzSQITUDVXgK Qi5r3KGM7mUQWVwyFqmpFn9WpxWQfAE= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-393-5goT9VO7NkuV4M76j_ZSJg-1; Thu, 24 Sep 2020 11:08:39 -0400 X-MC-Unique: 5goT9VO7NkuV4M76j_ZSJg-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 00A5C64179; Thu, 24 Sep 2020 15:08:38 +0000 (UTC) Received: from optiplex-lnx (unknown [10.3.128.5]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 970DB702E7; Thu, 24 Sep 2020 15:08:36 +0000 (UTC) Date: Thu, 24 Sep 2020 11:08:33 -0400 From: Rafael Aquini To: "Huang, Ying" Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org Subject: Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference Message-ID: <20200924150833.GE1023012@optiplex-lnx> References: <20200922184838.978540-1-aquini@redhat.com> <878sd1qllb.fsf@yhuang-dev.intel.com> <20200923043459.GL795820@optiplex-lnx> <87sgb9oz1u.fsf@yhuang-dev.intel.com> <20200923130138.GM795820@optiplex-lnx> <87blhwng5f.fsf@yhuang-dev.intel.com> <20200924020928.GC1023012@optiplex-lnx> <877dsjessq.fsf@yhuang-dev.intel.com> <20200924063038.GD1023012@optiplex-lnx> <87tuvnd3db.fsf@yhuang-dev.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87tuvnd3db.fsf@yhuang-dev.intel.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Sep 24, 2020 at 03:45:52PM +0800, Huang, Ying wrote: > Rafael Aquini writes: > > > On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote: > >> Rafael Aquini writes: > >> > The bug here is quite simple: split_swap_cluster() misses checking for > >> > lock_cluster() returning NULL before committing to change cluster_info->flags. > >> > >> I don't think so. We shouldn't run into this situation firstly. So the > >> "fix" hides the real bug instead of fixing it. Just like we call > >> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list() > >> instead of returning if !PageLocked(head) silently. > >> > > > > Not the same thing, obviously, as you are going for an apples-to-carrots > > comparison, but since you mentioned: > > > > split_huge_page_to_list() asserts (in debug builds) *page is locked, > > VM_BUG_ON_PAGE(!PageLocked(head), head); > > It asserts *head instead of *page. > > > and later checks if *head bears the SwapCache flag. > > deferred_split_scan(), OTOH, doesn't hand down the compound head locked, > > but the 2nd page in the group instead. > > No. deferred_split_scan() will can trylock_page() on the 2nd page in > the group, but > > static inline int trylock_page(struct page *page) > { > page = compound_head(page); > return (likely(!test_and_set_bit_lock(PG_locked, &page->flags))); > } > > So the head page will be locked instead. > Yep, missed that. Thanks for straighten me out on this one. > > This doesn't necessarely means it's a problem, though, but might help > > on hitting the issue. > > > >> > The fundamental problem has nothing to do with allocating, or not allocating > >> > a swap cluster, but it has to do with the fact that the THP deferred split scan > >> > can transiently race with swapcache insertion, and the fact that when you run > >> > your swap area on rotational storage cluster_info is _always_ NULL. > >> > split_swap_cluster() needs to check for lock_cluster() returning NULL because > >> > that's one possible case, and it clearly fails to do so. > >> > >> If there's a race, we should fix the race. But the code path for > >> swapcache insertion is, > >> > >> add_to_swap() > >> get_swap_page() /* Return if fails to allocate */ > >> add_to_swap_cache() > >> SetPageSwapCache() > >> > >> While the code path to split THP is, > >> > >> split_huge_page_to_list() > >> if PageSwapCache() > >> split_swap_cluster() > >> > >> Both code paths are protected by the page lock. So there should be some > >> other reasons to trigger the bug. > > > > As mentioned above, no they seem to not be protected (at least, not the > > same page, depending on the case). While add_to_swap() will assure a > > page_lock on the compound head, split_huge_page_to_list() does not. > > > > > >> And again, for HDD, a THP shouldn't have PageSwapCache() set at the > >> first place. If so, the bug is that the flag is set and we should fix > >> the setting. > >> > > > > I fail to follow your claim here. Where is the guarantee, in the code, that > > you'll never have a compound head in the swapcache? > > We may have a THP in the swap cache, only if non-rotational disk is used > as swap device. This is the design assumption of the THP swap support. > And this is guaranteed via swap space allocation for THP will fail for > HDD. If the implementation doesn't guarantee this, we will fix the > implementation to guarantee this. > > >> > Run a workload that cause multiple THP COW, and add a memory hogger to create > >> > memory pressure so you'll force the reclaimers to kick the registered > >> > shrinkers. The trigger is not heavy swapping, and that's probably why > >> > most swap test cases don't hit it. The window is tight, but you will get the > >> > NULL pointer dereference. > >> > >> Do you have a script to reproduce the bug? > >> > > > > Nope, a convoluted set of internal regression tests we have usually > > triggers it. In the wild, customers running HANNA are seeing it, > > occasionally. > > So you haven't reproduce the bug on upstream kernel? > Have you seen the stack dump in the patch? It still reproduces with v5.9, even though the rate is a lot lower than with earlier kernels. > Or, can you help to run the test with a debug kernel based on upstream > kernel. I can provide some debug patch. > Sure, I can set your patches to run with the test cases we have that tend to reproduce the issue with some degree of success. > >> > Regardless you find furhter bugs, or not, this patch is needed to correct a > >> > blunt coding mistake. > >> > >> As above. I don't agree with that. > >> > > > > It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check, > > though. > > In contrast, if the checking is necessary, we shouldn't ignore it, but > use something like > > ci = lock_cluster(si, offset); > + VM_BUG_ON(!ci); Wrong. This will still allow for NULL ptr dereference on non-debug builds. If ci can be NULL -- and it clearly can, we need to protect cluster_clear_huge(ci) against that.