From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=Fm2Q=M3=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED,
	DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,
	MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2FB5DC4338F
	for <linux-mm@archiver.kernel.org>; Wed,  4 Aug 2021 06:42:48 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id C797360F35
	for <linux-mm@archiver.kernel.org>; Wed,  4 Aug 2021 06:42:47 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C797360F35
Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 4D6068D0035; Wed,  4 Aug 2021 02:42:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 485218D002D; Wed,  4 Aug 2021 02:42:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 373A98D0035; Wed,  4 Aug 2021 02:42:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0200.hostedemail.com [216.40.44.200])
	by kanga.kvack.org (Postfix) with ESMTP id 1A0FB8D002D
	for <linux-mm@kvack.org>; Wed,  4 Aug 2021 02:42:47 -0400 (EDT)
Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id B92E223E74
	for <linux-mm@kvack.org>; Wed,  4 Aug 2021 06:42:46 +0000 (UTC)
X-FDA: 78436455132.27.0DCD1BE
Received: from mail-ot1-f52.google.com (mail-ot1-f52.google.com [209.85.210.52])
	by imf30.hostedemail.com (Postfix) with ESMTP id 7B23CE00C449
	for <linux-mm@kvack.org>; Wed,  4 Aug 2021 06:42:46 +0000 (UTC)
Received: by mail-ot1-f52.google.com with SMTP id v9-20020a9d60490000b02904f06fc590dbso796085otj.4
        for <linux-mm@kvack.org>; Tue, 03 Aug 2021 23:42:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20161025;
        h=date:from:to:cc:subject:in-reply-to:message-id:references
         :mime-version;
        bh=1PrwKlsbFvx8i8dTdlftDjVHQLdl0WQZNk2V6d/nwMk=;
        b=P/oi2weF1yW60zxeYdRS5MV/4kJpp6adELKDGXG7jRzfsX6XfPy+/Mkpu3QvAkSkRv
         2B65A4wgnxqb6ndMQQ9m2qFXL/MbEMX55RUstGFexUUNI62mmIUQkI+0+evx3Sv+JJBD
         dH21GFUd0Q6zkXxch3IWJqIQto6y6b997AKxELENBkoIJ2U7RQ34EMzc7/Tk7QTfeIXr
         q22aLQ23I0lWqs0jsRRt9xgbIp64ggE7/WYeXU06TzZgb2Z2Hl3BJTq6WRBIdMobmteF
         DKc2Y+OzVEatoFqyErtcp1tGjY+kByOOyrNxvYrBRry8DXVV/yvmkIB3q3VEa3zgoNQr
         HzDA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id
         :references:mime-version;
        bh=1PrwKlsbFvx8i8dTdlftDjVHQLdl0WQZNk2V6d/nwMk=;
        b=btuZ+Z58/vgfew/mSHBxEKytyda/LeyUkEH3RiuaxSWiYMBNgox3SpaY0DwPoSC1jx
         0UDebwjpr8iUan2ToQ/7o8kr6Dec27jjDh2F0hFu0DMjHL84T/CmK5djm7iaAg9caBFs
         /31/ED5umE5ZVn8aZdMxD9ki7W3tMPMwnWsGyFJBhMY5+cRf5gX8XDEgV+uFSRXMp8Kf
         UTSKnEPYqICVQuEGOwPixcrcdPl9Bt8u/0t3lKL2P0fAnase9R0TPnAL9XwKRzUx/B4k
         mVUhgofdmcV5CvHqlcImU7ykTk2JSX00hJV8XrcQAeZMdIp1s1EtjgDu1vef4/AmXufl
         573A==
X-Gm-Message-State: AOAM532cl7C1oKq6KNO6n+SffECvdoqbYR6U5u36giVwiUMBHEP39Tu9
	grzLTm6p5RSgfPYcRoO0kyOmLw==
X-Google-Smtp-Source: ABdhPJwY/nX1QsFfz6ifdTXN1zn/dgIxCxsDr4qc71TGmJexyImbnk0Nyu02XGND1eaOObehfqcDgg==
X-Received: by 2002:a9d:12e1:: with SMTP id g88mr7547600otg.36.1628059365589;
        Tue, 03 Aug 2021 23:42:45 -0700 (PDT)
Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id q21sm39277otn.61.2021.08.03.23.42.43
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 03 Aug 2021 23:42:44 -0700 (PDT)
Date: Tue, 3 Aug 2021 23:42:41 -0700 (PDT)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.anvils
To: huang ying <huang.ying.caritas@gmail.com>
cc: Hugh Dickins <hughd@google.com>, Huang Ying <ying.huang@intel.com>, 
    Andrew Morton <akpm@linux-foundation.org>, 
    David Hildenbrand <david@redhat.com>, Yang Shi <shy828301@gmail.com>, 
    Linux-MM <linux-mm@kvack.org>, LKML <linux-kernel@vger.kernel.org>, 
    Miaohe Lin <linmiaohe@huawei.com>, Johannes Weiner <hannes@cmpxchg.org>, 
    Michal Hocko <mhocko@suse.com>, Joonsoo Kim <iamjoonsoo.kim@lge.com>, 
    Matthew Wilcox <willy@infradead.org>, Minchan Kim <minchan@kernel.org>
Subject: Re: [PATCH] mm,shmem: Fix a typo in shmem_swapin_page()
In-Reply-To: <CAC=cRTNby4GkSJ-pjs6utgHtrQYEdy3XZQ06Qsxgyf1MJSBjrw@mail.gmail.com>
Message-ID: <3f91e0a9-7ba-c165-6231-b2bb38f7f63b@google.com>
References: <20210723080000.93953-1-ying.huang@intel.com> <24187e5e-069-9f3f-cefe-39ac70783753@google.com> <CAC=cRTNby4GkSJ-pjs6utgHtrQYEdy3XZQ06Qsxgyf1MJSBjrw@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 7B23CE00C449
Authentication-Results: imf30.hostedemail.com;
	dkim=pass header.d=google.com header.s=20161025 header.b="P/oi2weF";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf30.hostedemail.com: domain of hughd@google.com designates 209.85.210.52 as permitted sender) smtp.mailfrom=hughd@google.com
X-Stat-Signature: yhp3adugwn1ttyps1ua8m5n49ghk5ckc
X-HE-Tag: 1628059366-814164
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, 28 Jul 2021, huang ying wrote:
> On Sat, Jul 24, 2021 at 4:23 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > I was wary because, if the (never observed) race to be fixed is in
> > swap_cluster_readahead(), why was shmem_swapin_page() being patched?
> 
> When we get a swap entry from the page table or shmem xarray, and no
> necessary lock is held to prevent the swap device to be swapoff (e.g.
> page table lock, page lock, etc.), it's possible that the swap device
> has been swapoff when we operate on the swap entry (e.g. swapin).

Yes.  And even without any swapoff, the swap entry may no longer be
right by the time we go to swap it in, or after it has been swapped in.

Both mm/memory.c and mm/shmem.c have done an unlocked lookup to get
the swap entry, and have to do a pte_same() or shmem_confirm_swap()
later, to ensure that what they've got is still right.  That lockless
lookup and raciness is intentional, and has been working for years.

> So we need to find a way to prevent the swap device to be swapoff,
> get_swap_device() based on percpu_ref is used for that.

To prevent swapoff, no, it would be bad if swapoff could be prevented
indefinitely.  But I think you're saying to prevent swapoff from
completing - reaching the point of freeing structures which might
still be being examined.

Yes, though I thought that was already prevented.  But I may well have
missed the inode_read_congested() case which came in two years ago
(affecting shmem swapin only).  And I'll admit now to never(?) having
studied or tested the SWP_SYNCHRONOUS_IO case in do_swap_page() (with
suspiciously no equivalent in shmem swapin): I'd better start.

I do dislike the way SWP_SYNCHRONOUS_IO imported low-level swap business
into do_swap_page(): I'd love to try to move that into swap_state.c, and
in doing so hopefully get to appreciate it better (and the suggested
swapoff race: I presume it comes from skipping swapcache_prepare()).
But I have no time for that at present.

> To avoid to
> call get_swap_device() here and there (e.g. now it is called in many
> different places), I think it's better to call get_swap_device() when
> we just get a swap entry without holding the necessary lock, that is,
> in do_swap_page() and shmem_swapin_page(), etc.  So that we can delete
> the get_swap_device() call in lookup_swap_cache(),
> __read_swap_cache_async(), etc.  This will make it easier to
> understand when to use get_swap_device() and clean up the code.  Do
> you agree?

Offhand I'd say that I do not agree: but I can easily imagine coming to
agree with you, once I have tried to do better and discovered I cannot.

I dislike the way swap internals are being pushed out to the higher
levels.  Particularly since those higher levels already have to deal
with similar races which are not protected by get_swap_device().

I'd forgotten how earlier you found you had to add get_swap_device()
into lookup_swap_cache(), and friends: and can see the attraction of
doing it once instead of here and there.  But it is still there in
lookup_swap_cache() etc, so that's a poor argument for these commits.

> 
> > Not explained in its commit message, probably a misunderstanding of
> > how mm/shmem.c already manages races (and prefers not to be involved
> > in swap_info_struct stuff).
> 
> Yes.  The commit message isn't clean enough about why we do that.

Thanks for clearing that up.  In intervening days I did read about 50
of the ~100 mails on these commits, April and later (yes, I was Cc'ed
throughout, thanks, but that didn't give me the time).  I've not yet
reached any that explain the get_swap_device() placement, perhaps
I'll change my mind when eventually I read those.

> 
> > But why do I now say it's bad?  Because even if you correct the EINVAL
> > to -EINVAL, that's an unexpected error: -EEXIST is common, -ENOMEM is
> > not surprising, -ENOSPC can need consideration, but -EIO and anything
> > else just end up as SIGBUS when faulting (or as error from syscall).
> 
> Yes.  -EINVAL isn't a good choice.  If it's the swapoff race, then
> retrying can fix the race, so -EAGAIN may be a choice.  But if the
> swap entry is really invalid (almost impossible in theory), we may
> need something else, for example, WARN_ON_ONCE() and SIGBUS?  This
> reminds me that we may need to distinguish the two possibilities in
> get_swap_device()?

Ah, I guess in that last sentence you're thinking of what I realized
in writing previous mail, behaviour when given a corrupted swap entry.

Hugh