From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 358B2C54E67 for ; Sat, 23 Mar 2024 08:38:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 23B126B007B; Sat, 23 Mar 2024 04:38:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1C4756B0082; Sat, 23 Mar 2024 04:38:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 03D196B0083; Sat, 23 Mar 2024 04:38:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id E435F6B007B for ; Sat, 23 Mar 2024 04:38:28 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6EE8D40301 for ; Sat, 23 Mar 2024 08:38:28 +0000 (UTC) X-FDA: 81927652296.07.0684ADF Received: from mail-lj1-f175.google.com (mail-lj1-f175.google.com [209.85.208.175]) by imf10.hostedemail.com (Postfix) with ESMTP id 53011C001C for ; Sat, 23 Mar 2024 08:38:25 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=GkQbKY2m; spf=pass (imf10.hostedemail.com: domain of hezhongkun.hzk@bytedance.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=hezhongkun.hzk@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1711183106; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=k3QPhhaDkr5PVLV70HuhVyiAoj596WZlUI+Mtwdvmno=; b=c6INMG4qbhI1+/XrLkpTCRs3rgANTxy1V2wWNAf1rtbj3tBJf6kMdsj7ua8LeGedI9XJRD nuJPuOSGlyV8+9tf0UJzQC8gOBdXyMnHIuPNjczAkXWFZCtD0A+979IJwhC1qCnnc17VFf 6v+UtjDG/Ihu8igonzot5Ua3+NJ7fPM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1711183106; a=rsa-sha256; cv=none; b=slldOMiQofDn6w5u1HUsCTtATjJVdRwV2X4lmZ3UKaCDgSzsh1peVHpzOLEb0VxWtBR/qJ eo5ELh98L2HoMdMjyfgR50GT/7NNY4v2ZIdCNcpn0bBVX5V/IZSkc+pnFRy+ksvZmSG7fv 8ia7t7KxWVeX+eBaKLs/94vaRrrgLXw= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=GkQbKY2m; spf=pass (imf10.hostedemail.com: domain of hezhongkun.hzk@bytedance.com designates 209.85.208.175 as permitted sender) smtp.mailfrom=hezhongkun.hzk@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com Received: by mail-lj1-f175.google.com with SMTP id 38308e7fff4ca-2d46dd8b0b8so36821251fa.2 for ; Sat, 23 Mar 2024 01:38:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1711183103; x=1711787903; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=k3QPhhaDkr5PVLV70HuhVyiAoj596WZlUI+Mtwdvmno=; b=GkQbKY2mA94JUG36VsRxbXJegVQrf5jQ3+zI4HCEShtrMtL09L68XB7SaNCd3CFI4b g1TRR6iRwVpqecDdsstO+8DnWmlPC6qhVZdcqHO7vgW3kAP10OwaO9jPx/zgvuHx1g10 eDwwjZsUf9Xgo5ovZxDmVp/fcbwnUKsAIRUVM4ziV9sbEK0wc/VYOWYk9ZrX1B0zgRfG kmcG6Nz+UA+IiIRfYDLCUqTmWcCXynIU6owYyJedshMB1JNDfiBfKa1gso5SI7OkOPZS 2nadxO9tAfR918tfh/Ud7fZn0jyFSVGy/qvDFJ5/xWMXL3IpUc1Ox19tM58HZzmgeURZ 5clw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711183103; x=1711787903; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=k3QPhhaDkr5PVLV70HuhVyiAoj596WZlUI+Mtwdvmno=; b=TMHMA3lKg2XrEt2/OmH97fP8tYpzGJuiWWn6JRmLOWbAM8YrmHhGRVCdA3Nf5WDRQw iU2CZ3RSnEM6wIn3nqPoQ58jxkKEtm7DoZmObV2cZhvfd4+kkDglMzOsbrq70sVeZKU8 rRvRsWZe4RbEtTyxXPjJ65HmPL6ItrZFvBqAyWiE0pZ6s0v6BA6UweJsszd3eoex0kdR soGrWnzGPE5LcMJhNZKadeQNvXBQonK2p9hwqCfjiRkuaJZWan00tCycpIHU62L09MVG 4Nd/CHag1EOJH68Gl7b0UfMHPbPW/E5pDp9cILCIj840jVhBTJJyRaSAB3SJKrSkKalD FYBA== X-Forwarded-Encrypted: i=1; AJvYcCXQszxus34AEqvlOKgtlSZt6ZvR0siyEIuske/T1BTHEZV2RzHch6nf8z06ndmFY2h3HPtWgI2WSVh/8tOwAAUGe/g= X-Gm-Message-State: AOJu0YycfgyNQgMnc2im0ECt1MyUoCs/qk+fptUjM9RzGKNQmPGOmkxR yDqvo/gPjBdrxm1rb7R4khsg+69ApzkvFCgqAQOoL+kZq7wHhs+MaxL6NWo4UVKBSuYHntzBugp rTSXFtNwiTKp0ZnSUmnT0FLcBQ2G6Gn96QwpAKw== X-Google-Smtp-Source: AGHT+IGf1ePLVQ3GtNmT+nk/C1U229ozPox6gLzk45w+ERLHgE8VyHto1lwFRrftaW6XEPZnRQmt2uTtTVMVaD9WgFw= X-Received: by 2002:a2e:9613:0:b0:2d4:72be:e2c6 with SMTP id v19-20020a2e9613000000b002d472bee2c6mr1043129ljh.52.1711183103334; Sat, 23 Mar 2024 01:38:23 -0700 (PDT) MIME-Version: 1.0 References: <20240322163939.17846-1-chengming.zhou@linux.dev> <20240322234826.GA448621@cmpxchg.org> In-Reply-To: <20240322234826.GA448621@cmpxchg.org> From: Zhongkun He Date: Sat, 23 Mar 2024 16:38:10 +0800 Message-ID: Subject: Re: [External] Re: [RFC PATCH] mm: add folio in swapcache if swapin from zswap To: Johannes Weiner Cc: Yosry Ahmed , Barry Song <21cnbao@gmail.com>, chengming.zhou@linux.dev, nphamcs@gmail.com, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 53011C001C X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: ib8gs3ri9y9ywxmy35mag41xppceajej X-HE-Tag: 1711183105-218250 X-HE-Meta: U2FsdGVkX18tgK0erAWs89QXXZRAN7SrVkZbxa1IQAzyXrjBZOhXoI9zIzmvXZ9gchELdqU0mL7zexuILqko7SE47JAnXckBF0T6ORno+wTtHLAh2AxsFyY4qAYdawDI2h7tpiTXnAfq77sXQEBfGcC0wUN0FvzCHCU5/S2vzVgqT4HtNkSWna1lu5j7ne2Xbjw3dNqcyMW0pLdIUnaGs/GMmWbkTVu69poFQovTnWIQxUlK3rAO/8E/jR+5FrWGJxcpkw+tFtG+tens2dELjio73yZe0ChGAVkazdEQuNIqyT1nFnOrR1EdOZLz3EIy/vpdVbyr1tjdbI92xqAwHpf7gMD175slWcjDQytgAQguSWd0EFAIC58hBbPdHrL0N/jz/kx4RhVWHPeB5ZiWsNI5UMcdLNSkC/l+5qac18eLxIGYRobyN6Tz1MC3p514I0ex3Uiy5cx+7eW6VP3lCTNnVjk7i17lBEJA1uGlykt+mTk97+PMX3N9rYTBlN1Q4xz3u4lhb4sub1cD+I/b+F0SpMy7mr4vFoz0GHXn78ACdzGtAHs9I4uXCJZIvLQpLeJ+2U3zXxAZEtoSjha2i6WRKi+KevrzB/Wmq7hZ6Plt1JkBynEOfc4j+aHHdbPfoaCy2quH8TYBxLRe4jSfQDaVW0WHEdlS1ZiayWecZTEW0Cv2FwUXKKU+t+PM2oHxEmqga7XYo28xpxcV684Uxuel7Is44ErqO0eSIYKqEZjproeAFyGfyBOfICFjMTdg44cBQ7hahcSIFRIJPbLmlqWbtS1SO4YSDLfUj9AAixeOC4MQve6va25hNWnJeZ1hP/zjt9aMF4hmrSIJJFC38a3eggBXqsqX4NDX/65CG7aQrXrg3ogOBPeSZf34J8xNJXf69tJEFiAdPHjugKB04JdrYZC3e4BiCvH1wE6jbCiHLEq64jJ/bulm9utl4AfmFUk44+z5HKppVcSghJd fgm0tSKz 54I9buKCQ6Y5V1zJNLKcUvpdvi5PGTQCdai7r2wXk2dgy5ZEbsDNJkiz1VSvAiYq8lvj4eAIBtHn8j1N2bIF5fPnUBty4WkLZrxYurKnOPimFCnu8sFnR1110XT+SaDBSWvHZH42B0WTVOJ5TojcCNfIpcedoOIw+8FMuqiL7ae3PVYeYUTIfgOrdk9/hRAcaUFyN3t2acSBtlfOwBYitzw3TImsrJ+Hycwo5UNqc9fI+9/oOKfcIW6YxiBOQjzB3GebWukNlkvdMQvruaNn6v/AmY5Cz/xzoFQT84vSlG9lhRWgB8avJTvF02vKsNLVYb7FoCky0ZUdPn/C+NDIed5FgxAekONKBIDT4CXLahBJKzIGT2ucacANEVY1ZxWHzjQdP X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Mar 23, 2024 at 7:48=E2=80=AFAM Johannes Weiner wrote: > > On Fri, Mar 22, 2024 at 10:33:13PM +0000, Yosry Ahmed wrote: > > On Sat, Mar 23, 2024 at 10:41:32AM +1300, Barry Song wrote: > > > On Sat, Mar 23, 2024 at 8:38=E2=80=AFAM Yosry Ahmed wrote: > > > > > > > > On Fri, Mar 22, 2024 at 9:40=E2=80=AFAM = wrote: > > > > > > > > > > From: Chengming Zhou > > > > > > > > > > There is a report of data corruption caused by double swapin, whi= ch is > > > > > only possible in the skip swapcache path on SWP_SYNCHRONOUS_IO ba= ckends. > > > > > > > > > > The root cause is that zswap is not like other "normal" swap back= ends, > > > > > it won't keep the copy of data after the first time of swapin. So= if > > > > > > I don't quite understand this, so once we load a page from zswap, zsw= ap > > > will free it even though do_swap_page might not set it to PTE? > > > > > > shouldn't zswap free the memory after notify_free just like zram? > > > > It's an optimization that zswap has, exclusive loads. After a page is > > swapped in it can stick around in the swapcache for a while. In this > > case, there would be two copies in memory with zram (compressed and > > uncompressed). Zswap implements exclusive loads to drop the compressed > > copy. The folio is marked as dirty so that any attempts to reclaim it > > cause a new write (compression) to zswap. It is also for a lot of > > cleanups and straightforward entry lifetime tracking in zswap. > > > > It is mostly fine, the problem here happens because we skip the > > swapcache during swapin, so there is a possibility that we load the > > folio from zswap then just drop it without stashing it anywhere. > > > > > > > > > > the folio in the first time of swapin can't be installed in the p= agetable > > > > > successfully and we just free it directly. Then in the second tim= e of > > > > > swapin, we can't find anything in zswap and read wrong data from = swapfile, > > > > > so this data corruption problem happened. > > > > > > > > > > We can fix it by always adding the folio into swapcache if we kno= w the > > > > > pinned swap entry can be found in zswap, so it won't get freed ev= en though > > > > > it can't be installed successfully in the first time of swapin. > > > > > > > > A concurrent faulting thread could have already checked the swapcac= he > > > > before we add the folio to it, right? In this case, that thread wil= l > > > > go ahead and call swap_read_folio() anyway. > > > > > > > > Also, I suspect the zswap lookup might hurt performance. Would it b= e > > > > better to add the folio back to zswap upon failure? This should be > > > > detectable by checking if the folio is dirty as I mentioned in the = bug > > > > report thread. > > > > > > I don't like the idea either as sync-io is the fast path for zram etc= . > > > or, can we use > > > the way of zram to free compressed data? > > > > I don't think we want to stop doing exclusive loads in zswap due to thi= s > > interaction with zram, which shouldn't be common. > > > > I think we can solve this by just writing the folio back to zswap upon > > failure as I mentioned. > > Instead of storing again, can we avoid invalidating the entry in the > first place if the load is not "exclusive"? > > The reason for exclusive loads is that the ownership is transferred to > the swapcache, so there is no point in keeping our copy. With an > optimistic read that doesn't transfer ownership, this doesn't > apply. And we can easily tell inside zswap_load() if we're dealing > with a swapcache read or not by testing the folio. > > The synchronous read already has to pin the swp_entry_t to be safe, > using swapcache_prepare(). That blocks __read_swap_cache_async() which > means no other (exclusive) loads and no invalidates can occur. > > The zswap entry is freed during the regular swap_free() path, which > the sync fault calls on success. Otherwise we keep it. > > diff --git a/mm/zswap.c b/mm/zswap.c > index 535c907345e0..686364a6dd86 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1622,6 +1622,7 @@ bool zswap_load(struct folio *folio) > swp_entry_t swp =3D folio->swap; > pgoff_t offset =3D swp_offset(swp); > struct page *page =3D &folio->page; > + bool swapcache =3D folio_test_swapcache(folio); > struct zswap_tree *tree =3D swap_zswap_tree(swp); > struct zswap_entry *entry; > u8 *dst; > @@ -1634,7 +1635,8 @@ bool zswap_load(struct folio *folio) > spin_unlock(&tree->lock); > return false; > } > - zswap_rb_erase(&tree->rbroot, entry); > + if (swapcache) > + zswap_rb_erase(&tree->rbroot, entry); > spin_unlock(&tree->lock); > > if (entry->length) > @@ -1649,9 +1651,10 @@ bool zswap_load(struct folio *folio) > if (entry->objcg) > count_objcg_event(entry->objcg, ZSWPIN); > > - zswap_entry_free(entry); > - > - folio_mark_dirty(folio); > + if (swapcache) { > + zswap_entry_free(entry); > + folio_mark_dirty(folio); > + } > > return true; > } Hi Johannes, After a few hours I haven't seen any problems. If you don't mind=EF=BC=8Cplease add the following tag: Tested-by:Zhongkun He Reported-by: Zhongkun He Thanks.