From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B63C7C4727D for ; Wed, 23 Sep 2020 20:01:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E1CDB206DB for ; Wed, 23 Sep 2020 20:01:08 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="c4ZiuKe3" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E1CDB206DB Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2CD7A6B0003; Wed, 23 Sep 2020 16:01:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 27C5B6B0037; Wed, 23 Sep 2020 16:01:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 16C566B005A; Wed, 23 Sep 2020 16:01:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0246.hostedemail.com [216.40.44.246]) by kanga.kvack.org (Postfix) with ESMTP id 016186B0003 for ; Wed, 23 Sep 2020 16:01:07 -0400 (EDT) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 978F13633 for ; Wed, 23 Sep 2020 20:01:07 +0000 (UTC) X-FDA: 77295394974.30.sand17_4b0fe7827159 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id A6DC8180B3C8E for ; Wed, 23 Sep 2020 20:01:06 +0000 (UTC) X-HE-Tag: sand17_4b0fe7827159 X-Filterd-Recvd-Size: 8958 Received: from mail-lj1-f193.google.com (mail-lj1-f193.google.com [209.85.208.193]) by imf43.hostedemail.com (Postfix) with ESMTP for ; Wed, 23 Sep 2020 20:01:06 +0000 (UTC) Received: by mail-lj1-f193.google.com with SMTP id v23so724499ljd.1 for ; Wed, 23 Sep 2020 13:01:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ueHl5EUr1DPpO7kKbVPwMp2GtIVax/Mas5iPmL1lfU8=; b=c4ZiuKe34ss/QiAVgmQ1Aai8rAOVtl0QO76xwig2BMkUVs0WSgzgExXE1yZhjhAM7o 5CUGpvdppt/XpJBf0LfO15bQ3bz+x3EzPAco2lYO4HrODi/as+IB9U4M4vkvYasnP4dz wTs2q443orkmk5uwCDzqcHwYJh+BwvMtGtqlI= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ueHl5EUr1DPpO7kKbVPwMp2GtIVax/Mas5iPmL1lfU8=; b=LrcX1Cm3MzsgQiZVnpUJBWzV+2ZMlUXS2TtHTVaSAFAKFnsnsEDpDKbm47ySStVeV/ P8/sH9MojYGXVya4EcNxRfgGijApYemdvFRAWyrFBaDcnD4R/Y7omCr0+RauJiVKG53X puayep1Q29OWTgsPG9sV8Q8sIXmDzUEikU9ueddmGzzRy/9XMcN4Q7g+pWxiqfhHFmpB wld63xwGk0CqBMwiQGi8lXVXIi9tpuh2kqoTXTqWv1F55BBwIaWHPcJpFI/vybLzi11o CpyPxIIofCXjdzJv8MKqd65begZMGEeCytlhG3cidGAISVLvXn8bsimBhx05gHajOswA zdJw== X-Gm-Message-State: AOAM531SgWw06yauCKLjHhYsJ5LgaivdQWDldcRMZSghI9lG5yuJwSL2 tprE5ZLsNns65Tgi8DTKIn3xkD5qArF2Zg== X-Google-Smtp-Source: ABdhPJx2Nah9P8dFxerQTWUdikHOeV6b84HhPgF4j+ia3glGOA9NISltq9j63XUFD66QJsjChjnXuw== X-Received: by 2002:a2e:b0f2:: with SMTP id h18mr478044ljl.231.1600891263449; Wed, 23 Sep 2020 13:01:03 -0700 (PDT) Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com. [209.85.208.181]) by smtp.gmail.com with ESMTPSA id p7sm384310lfo.256.2020.09.23.13.01.01 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 23 Sep 2020 13:01:01 -0700 (PDT) Received: by mail-lj1-f181.google.com with SMTP id n25so702782ljj.4 for ; Wed, 23 Sep 2020 13:01:01 -0700 (PDT) X-Received: by 2002:a2e:91cd:: with SMTP id u13mr405088ljg.421.1600891261097; Wed, 23 Sep 2020 13:01:01 -0700 (PDT) MIME-Version: 1.0 References: <20200916142806.GD7076@osiris> <20200922190350.7a0e0ca5@thinkpad> <20200923153938.5be5dd2c@thinkpad> In-Reply-To: <20200923153938.5be5dd2c@thinkpad> From: Linus Torvalds Date: Wed, 23 Sep 2020 13:00:45 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: BUG: Bad page state in process dirtyc0w_child To: Gerald Schaefer Cc: Peter Xu , Heiko Carstens , Qian Cai , Alexander Gordeev , Vasily Gorbik , Christian Borntraeger , linux-s390 , Linux-MM , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 23, 2020 at 6:39 AM Gerald Schaefer wrote: > > OK, I can now reproduce this, and unfortunately also with the gup_fast > fix, so it is something different. Bisecting is a bit hard, as it will > not always show immediately, sometimes takes up to an hour. > > Still, I think I found the culprit, merge commit b25d1dc9474e "Merge > branch 'simplify-do_wp_page'". Without those 4 patches, it works fine, > running over night. Odd, but I have a strong suspicion that the "do_wp_page() simplification" only ends up removing serialization that then hides some existing thing. > Not sure why this only shows on s390, should not be architecture-specific, > but we do often see subtle races earlier than others due to hypervisor > impact. Yeah, so if it needs very particular timing, maybe the s390 page table handling together with the hypervisor interfaces ends up being more likely to trigger this, and thus the different timings at do_wp_page() then ends up showing it. > One thing that seems strange to me is that the page flags from the > bad page state output are (uptodate|swapbacked), see below, or > (referenced|uptodate|dirty|swapbacked) in the original report. But IIUC, > that should not qualify for the "PAGE_FLAGS_CHECK_AT_FREE flag(s) set" > reason. So it seems that the flags have changed between check_free_page() > and __dump_page(), which would be very odd. Or maybe some issue with > compound pages, because __dump_page() looks at head->flags. The odd thing is that all of this _should_ be serialized by the page table lock, as far as I can tell. >From your trace, it looks very much like it's do_madvise() -> zap_pte_range() (your stack trace only has zap_p4d_range mentioned, but all the lower levels are inlined) that races with presumably fast-gup. But zap_pte_range() has the pte lock, and fast-gup is - by design - not allowed to change the page state other than taking a reference to it, and should do that with a "try_get" operation, so even taking the reference should never ever race with somebody doing the final free. IOW, the fast-GUP code does that page = pte_page(pte); head = try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; if (unlikely(pte_val(pte) != pte_val(*ptep))) { put_compound_head(head, 1, flags); goto pte_unmap; } where the important part is that "try_grab_compound_head()" which does the whole careful atomic "increase page count only if it wasn't zero". See page_cache_add_speculative(). So the rule is - if you hold the page table lock, you can just do "get_page(pte_page())" directly, because you know the pte cannot go away from under you - if you are fast-gup, the pte *can* go away from under you, so you need to do that very careful "get page unless it's gone" dance but I don't see us violating that. There's maybe some interesting memory ordering in the above case, but it does atomic_add_unless() which is ordered, and s390 is strongly ordered anyway, isn't it? (Yes, and it doesn't do the atomic stuff at all if TINY_RCU is set, but that's only set for non-preemptible UP kernels, so that doesn't matter). So if zap_page_range() races with fast-gup, then either zap_page_range() wins the race and puts the page - but then fast-gup won't touch it, or fast-gup wins and gets a reference to the page, and then zap_page_range() will clear it and drop the ref to it, but it won't be the final ref. Your dump seems to show that zap_page_range() *did* drop the final ref, but something is racing with it to the point of actually modifying the page flags. Odd. And the do_wp_page() change itself shouldn't be directly involved, because that's all done under the page table lock. But it obviously does change the page locking a lot, and changes timing a lot. And in fact, the zap_pte_range() code itself doesn't take the page lock (and cannot, because it's all under the page table spinlock). So it does smell like timing to me. But possibly with some s390-specific twist to it. Ooh. One thing that is *very* different about s390 is that it frees the page directly, and doesn't batch things up to happen after the TLB flush. Maybe THAT is the difference? Not that I can tell why it should matter, for all the reasons outlines above. But on x86-64, the __tlb_remove_page() function just adds the page to the "free this later" TLB flush structure, and if it fills up it does the TLB flush and then does the actual batched page freeing outside the page table lock. And that *has* been one of the things that the fast-gup code depended on. We even have a big comment about it: /* * Disable interrupts. The nested form is used, in order to allow * full, general purpose use of this routine. * * With interrupts disabled, we block page table pages from being * freed from under us. See struct mmu_table_batch comments in * include/asm-generic/tlb.h for more details. * * We do not adopt an rcu_read_lock(.) here as we also want to * block IPIs that come from THPs splitting. */ and maybe that whole thing doesn't hold true for s390 at all. Linus