From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 737F7C6FD1C for ; Thu, 23 Mar 2023 03:28:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 990526B0075; Wed, 22 Mar 2023 23:28:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9190A6B0078; Wed, 22 Mar 2023 23:28:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 744686B007B; Wed, 22 Mar 2023 23:28:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 5BE6B6B0075 for ; Wed, 22 Mar 2023 23:28:06 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 09203A0153 for ; Thu, 23 Mar 2023 03:28:06 +0000 (UTC) X-FDA: 80598729372.03.88E8DDC Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54]) by imf23.hostedemail.com (Postfix) with ESMTP id 03C01140007 for ; Thu, 23 Mar 2023 03:28:02 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Wmu2J2YC; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1679542083; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=pOcERnPGJ9g5YQqGCV8iH7CMxNUsN1xeMHGnW/QdyxY=; b=sbH+EYKABRCDvAQ1iqRfmWIu1gAuNuXSZIami4km4Sazyyp3A1ifQbwaYt1AAlJUWRZHvA M37nJYEQzqvADFEHh7rhNziivLuaXmcn/AvFbt4uwtlosqboA5wq/+SUP5gj2ScpKE0F5L 7fY5QUJdaRAtAcvj3JVJdV7e24CPNYA= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Wmu2J2YC; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1679542083; a=rsa-sha256; cv=none; b=ZgK93ZftrGOZHwPx8EyV5cOiNGX1KJdmYoSvUJG7usXOopD61GuT1tQJIVFmke82zQwk5T +nS1IhE/nYajkJ2TPQnyfpUPik+Ez+k1lAnV9ZfrmVZmw3ekZILtS88nU0WR9QtIHneHKj jygg9M+WLxzgBRksb8783G9UW3xCkio= Received: by mail-ed1-f54.google.com with SMTP id ew6so18139021edb.7 for ; Wed, 22 Mar 2023 20:28:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1679542081; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=pOcERnPGJ9g5YQqGCV8iH7CMxNUsN1xeMHGnW/QdyxY=; b=Wmu2J2YCTZyyEVGbK2YDEavUsk8kelgX70f9rpX5jHGJQpF57YgDK+A9sJ6Fovn8xe LSs7NIhAUmvWmGMdtnVWbgNmSse/RjmIPIOw49zOzIWRU5fZTNOO6MHPSwxoX+ZPoxJC Y5SLy6xKQ2Q9VVgmAlIyEnk9hjGYlwPgjgh6Eup8PBaLl9TeXyWYyBHxlYJS0uPXEUAL tXraZSEZ7xxzXKar4CQYPWVh9hDrBDz1jRvlegG3lHi/ZPHhoYA1Pu3wy7gOTJ9dapuT FW8wvUaLQyyg8I0zPf3+AS5kptsbQC0uHKaAbkWn+p27/j4HB7vqlt07iTWWg9KNL0Cm yTyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1679542081; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=pOcERnPGJ9g5YQqGCV8iH7CMxNUsN1xeMHGnW/QdyxY=; b=6QROUvkUwhHBfIFFB+FqbhVWr2V6TXUhEEakeHKAP3y/qk9ud27ijaLxGCFmJb6gCn sLqVi4HvrnVvKpFsM9yJ/nR1aGFsjoVNmuzOgFgl5jt2RAsbFI7pHmUQPMwsQNjprJGj 4qJcg/Du4EV3VQ8I+IcTK5bghyaFJUVwYsyvgL06pEmPevPXAddx68xFZdZxkk6SUsEx /CnsiDBJ57RCsSKzincRH75YqAN9dil3ngZo4Aax65/UKtr3b/1xy1MkQhXhJth75rM/ +V/GOIbIgf1oJxodhr9W00779+9CtuBOrXkO1GfUgf9JLxhhiWI+rtXmJ7jAh0vbmp0B Cy9w== X-Gm-Message-State: AO0yUKUbEcAxhWoCCYu89Ki7PwsJcNDtQshFVQpF2BSYWTapNLMX3IOl a5Y5xaGHfoeTiUhDlOje3ZRPkwSuuTsD5wtT7IlB7Q== X-Google-Smtp-Source: AK7set+0rtDXA9aJgF5KJsPMKN3G0aigQO7TFhWXMsyzq/Z+RGcwfhFQf1iPMuZBK5YsLFKPfEU82GMmbTGOloiB2H8= X-Received: by 2002:a17:907:375:b0:933:1967:a984 with SMTP id rs21-20020a170907037500b009331967a984mr4108517ejb.15.1679542081139; Wed, 22 Mar 2023 20:28:01 -0700 (PDT) MIME-Version: 1.0 References: <87356e850j.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1o571aa.fsf@yhuang6-desk2.ccr.corp.intel.com> <87o7ox762m.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bkkt5e4o.fsf@yhuang6-desk2.ccr.corp.intel.com> <87y1ns3zeg.fsf@yhuang6-desk2.ccr.corp.intel.com> <874jqcteyq.fsf@yhuang6-desk2.ccr.corp.intel.com> <87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com> In-Reply-To: <87v8isrwck.fsf@yhuang6-desk2.ccr.corp.intel.com> From: Yosry Ahmed Date: Wed, 22 Mar 2023 20:27:24 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: "Huang, Ying" Cc: Chris Li , lsf-pc@lists.linux-foundation.org, Johannes Weiner , Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Yang Shi , Peter Xu , Minchan Kim , Andrew Morton , Aneesh Kumar K V , Michal Hocko , Wei Xu Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 03C01140007 X-Stat-Signature: qawfys54qs7yqi8dqg6cbof6srjhun8r X-HE-Tag: 1679542082-533514 X-HE-Meta: U2FsdGVkX195E23WU3P9qPKzD98PnF+LGa1B5+kwHytwClaBjrbQNKMPfSejVWM8dL4pqgKXvtcmmEPtOxiK6jHtfUOeaYRfU9IQU3+nuFbxUdZA+FnwYHt2MpiQiRWjLbeThdvk04dyxssmaFawqICoBkKFnnxL1ZusMxfNdN+ki+D1wKBXfkNky6WhkFDromA2Or6lXB+wwgRGd7oJtkpmfTRBlLV225sWltJoXEbXFNp7xeEKqpnJPKDxbzZ2A8b1Csi2Yb9Aw7j+jlmAzui10jQHN4ozJWN+E68WkpBSQbHPu8Kfck/dVXj1HoeJajL9+v/NDY4V0kEAEDCeS176ePrp9bxLxwwJIjnjGE2osJQWWd7d7BocsghZMeWUVaHOW77PS78CU49vK5ahj9SXU42uW3QcZIn/ooI1RKmfdpapHpQ9q9EtkRcPcNxIeNEpKfV4j4KtYxaYsC6PCVJeEB0YcTQrqOy8V4RN4naCOkOlZ2eXoAT4KmlLj1pD00kvE/9NEo77giMQvzyH9XoUjfWeTyAcLgTphBJO2ENgk8DvA+l4msLu4B4RGejLF+3eKfsV+aS46uSPjpKRvESr0RjpH7KZGZrKl0aUZoZAu/NaigyRDHtUvdkygUCdaWXbHFh/e5lEf2XK4JS5AYohG5TAfQ5irzZLI8H52o7+iP02HVIeYQeT1APwE/ygmPeF58OdiEdgWpLWd3VlqlHoHOPwD90srLq42znEqdEv7qwIOMesQhkDEtYvQ7VKx5uwHonTreSNFzOpeMyR9OazsR9Anss+3mYsbr5KNlGq4Raf7aIQPRRmMt/VVbgoAKbg3IxcC2ktgrV6qipjuRf7RJdGaRIDytA8TOikKCFfJqIDMv0CmdoA86ujfdCN1vUQdKOe98ycuNtUfpnpn/ynz+kQukZG5PfDIHkUApk3ArVMp1+ol2OygxI+2FmSlif9yZl+vmHVACGmR+z 5jAraNiB mb+D6tw85r7ZAcX6Kb6/LTTtku4rr+RBkIWfcHhWrZCAZEEysFCdV72txT8R+WOc8EM5lIbBijbYWpUo4GLdWXP5RDhsNz//IQPERd/qDxJatm85htqaqEOFRh7/cKkEZoQfVuLPNEk2DY+pHHMHKc7sEETwwsUowcBKqmfJ5DXp+jP4y3kBZ3MMDuSfRaLzCP/HDlNqj5z9Qc4XnMx9LXcXZ6GKyajmO/qS0+pTMRu1z9hS+CFpDtyZl4bf/sm1Jcw3UAOPGY1AEmlfeOJ6llhD8U7c21G+LOj69JrTet4wicKY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 22, 2023 at 8:17=E2=80=AFPM Huang, Ying = wrote: > > Yosry Ahmed writes: > > > On Wed, Mar 22, 2023 at 6:50=E2=80=AFPM Huang, Ying wrote: > >> > >> Yosry Ahmed writes: > >> > >> > On Sun, Mar 19, 2023 at 7:56=E2=80=AFPM Huang, Ying wrote: > >> >> > >> >> Yosry Ahmed writes: > >> >> > >> >> > On Thu, Mar 16, 2023 at 12:51=E2=80=AFAM Huang, Ying wrote: > >> >> >> > >> >> >> Yosry Ahmed writes: > >> >> >> > >> >> >> > On Sun, Mar 12, 2023 at 7:13=E2=80=AFPM Huang, Ying wrote: > >> >> >> >> > >> >> >> >> Yosry Ahmed writes: > >> >> >> >> > >> >> >> >> > >> >> >> >> > > >> >> >> >> > My current idea is to have one xarray that stores the swap_= descs > >> >> >> >> > (which include swap_entry, swapcache, swap_count, etc), and= only for > >> >> >> >> > rotating disks have an additional xarray that maps swap_ent= ry -> > >> >> >> >> > swap_desc for cluster readahead, assuming we can eliminate = all other > >> >> >> >> > situations requiring a reverse mapping. > >> >> >> >> > > >> >> >> >> > I am not sure how having separate xarrays help? If we have = one xarray, > >> >> >> >> > might as well save the other lookups on put everything in s= wap_desc. > >> >> >> >> > In fact, this should improve the locking today as swapcache= / > >> >> >> >> > swap_count operations can be lockless or very lightly conte= nded. > >> >> >> >> > >> >> >> >> The condition of the proposal is "reverse mapping cannot be a= voided for > >> >> >> >> enough situation". So, if reverse mapping (or cluster readah= ead) can be > >> >> >> >> avoided for enough situations, I think your proposal is good.= Otherwise, > >> >> >> >> I propose to use 2 xarrays. You don't need another reverse m= apping > >> >> >> >> xarray, because you just need to read the next several swap_e= ntry into > >> >> >> >> the swap cache for cluster readahead. swap_desc isn't needed= for > >> >> >> >> cluster readahead. > >> >> >> > > >> >> >> > swap_desc would be needed for cluster readahead in my original > >> >> >> > proposal as the swap cache lives in swap_descs. Based on the c= urrent > >> >> >> > implementation, we would need a reverse mapping (swap entry -> > >> >> >> > swap_desc) in 3 situations: > >> >> >> > > >> >> >> > 1) __try_to_reclaim_swap(): when trying to find an empty swap = slot and > >> >> >> > failing, we fallback to trying to find swap entries that only = have a > >> >> >> > page in the swap cache (no references in page tables or page c= ache) > >> >> >> > and free them. This would require a reverse mapping. > >> >> >> > > >> >> >> > 2) swapoff: we need to swap in all entries in a swapfile, so w= e need > >> >> >> > to get all swap_descs associated with that swapfile. > >> >> >> > > >> >> >> > 3) swap cluster readahead. > >> >> >> > > >> >> >> > For (1), I think we can drop the dependency of a reverse mappi= ng if we > >> >> >> > free swap entries once we swap a page in and add it to the swa= p cache, > >> >> >> > even if the swap count does not drop to 0. > >> >> >> > >> >> >> Now, we will not drop the swap cache even if the swap count beco= mes 0 if > >> >> >> swap space utility < 50%. Per my understanding, this avoid swap= page > >> >> >> writing for read accesses. So I don't think we can change this = directly > >> >> >> without necessary discussion firstly. > >> >> > > >> >> > > >> >> > Right. I am not sure I understand why we do this today, is it to = save > >> >> > the overhead of allocating a new swap entry if the page is swappe= d out > >> >> > again soon? I am not sure I understand this statement "this avoid= swap > >> >> > page > >> >> > writing for read accesses". > >> >> > > >> >> >> > >> >> >> > >> >> >> > For (2), instead of scanning page tables and shmem page cache = to find > >> >> >> > swapped out pages for the swapfile, we can scan all swap_descs > >> >> >> > instead, we should be more efficient. This is one of the propo= sal's > >> >> >> > potential advantages. > >> >> >> > >> >> >> Good. > >> >> >> > >> >> >> > (3) is the one that would still need a reverse mapping with th= e > >> >> >> > current proposal. Today we use swap cluster readahead for anon= pages > >> >> >> > if we have a spinning disk or vma readahead is disabled. For s= hmem, we > >> >> >> > always use cluster readahead. If we can limit cluster readahea= d to > >> >> >> > only rotating disks, then the reverse mapping can only be main= tained > >> >> >> > for swapfiles on rotating disks. Otherwise, we will need to ma= intain a > >> >> >> > reverse mapping for all swapfiles. > >> >> >> > >> >> >> For shmem, I think that it should be good to readahead based on = shmem > >> >> >> file offset instead of swap device offset. > >> >> >> > >> >> >> It's possible that some pages in the readahead window are from H= DD while > >> >> >> some other pages aren't. So it's a little hard to enable cluste= r read > >> >> >> for HDD only. Anyway, it's not common to use HDD for swap now. > >> >> >> > >> >> >> >> > >> >> >> >> > If the point is to store the swap_desc directly inside the = xarray to > >> >> >> >> > save 8 bytes, I am concerned that having multiple xarrays f= or > >> >> >> >> > swapcache, swap_count, etc will use more than that. > >> >> >> >> > >> >> >> >> The idea is to save the memory used by reverse mapping xarray= . > >> >> >> > > >> >> >> > I see. > >> >> >> > > >> >> >> >> > >> >> >> >> >> >> > >> >> >> >> >> >> > Keep in mind that the current overhead is 1 byte O(ma= x swap pages) not > >> >> >> >> >> >> > O(swapped). Also, 1 byte is assuming we do not use th= e swap > >> >> >> >> >> >> > continuation pages. If we do, it may end up being mor= e. We also > >> >> >> >> >> >> > allocate continuation in full 4k pages, so even if on= e swap_map > >> >> >> >> >> >> > element in a page requires continuation, we will allo= cate an entire > >> >> >> >> >> >> > page. What I am trying to say is that to get an actua= l comparison you > >> >> >> >> >> >> > need to also factor in the swap utilization and the r= ate of usage of > >> >> >> >> >> >> > swap continuation. I don't know how to come up with a= formula for this > >> >> >> >> >> >> > tbh. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Also, like Johannes said, the worst case overhead (32= bytes if you > >> >> >> >> >> >> > count the reverse mapping) is 0.8% of swapped memory,= aka 8M for every > >> >> >> >> >> >> > 1G swapped. It doesn't sound *very* bad. I understand= that it is pure > >> >> >> >> >> >> > overhead for people not using zswap, but it is not ve= ry awful. > >> >> >> >> >> >> > > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> It seems what you really need is one bit of informat= ion to indicate > >> >> >> >> >> >> >> this page is backed by zswap. Then you can have a se= perate pointer > >> >> >> >> >> >> >> for the zswap entry. > >> >> >> >> >> >> > > >> >> >> >> >> >> > If you use one bit in swp_entry_t (or one of the avai= lable swap types) > >> >> >> >> >> >> > to indicate whether the page is backed with a swapfil= e or zswap it > >> >> >> >> >> >> > doesn't really work. We lose the indirection layer. H= ow do we move the > >> >> >> >> >> >> > page from zswap to swapfile? We need to go update the= page tables and > >> >> >> >> >> >> > the shmem page cache, similar to swapoff. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Instead, if we store a key else in swp_entry_t and us= e this to lookup > >> >> >> >> >> >> > the swp_entry_t or zswap_entry pointer then that's es= sentially what > >> >> >> >> >> >> > the swap_desc does. It just goes the extra mile of un= ifying the > >> >> >> >> >> >> > swapcache as well and storing it directly in the swap= _desc instead of > >> >> >> >> >> >> > storing it in another lookup structure. > >> >> >> >> >> >> > >> >> >> >> >> >> If we choose to make sizeof(struct swap_desc) =3D=3D 8,= that is, store only > >> >> >> >> >> >> swap_entry in swap_desc. The added indirection appears= to be another > >> >> >> >> >> >> level of page table with 1 entry. Then, we may use the= similar method > >> >> >> >> >> >> as supporting system with 2 level and 3 level page tabl= es, like the code > >> >> >> >> >> >> in include/asm-generic/pgtable-nopmd.h. But I haven't = thought about > >> >> >> >> >> >> this deeply. > >> >> >> >> >> > > >> >> >> >> >> > Can you expand further on this idea? I am not sure I ful= ly understand. > >> >> >> >> >> > >> >> >> >> >> OK. The goal is to avoid the overhead if indirection isn'= t enabled via > >> >> >> >> >> kconfig. > >> >> >> >> >> > >> >> >> >> >> If indirection isn't enabled, store swap_entry in PTE dire= ctly. > >> >> >> >> >> Otherwise, store index of swap_desc in PTE. Different fun= ctions (e.g., > >> >> >> >> >> to get/set swap_entry in PTE) are implemented based on kco= nfig. > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > I thought about this, the problem is that we will have mult= iple > >> >> >> >> > implementations of multiple things. For example, swap_count= without > >> >> >> >> > the indirection layer lives in the swap_map (with continuat= ion logic). > >> >> >> >> > With the indirection layer, it lives in the swap_desc (or s= omewhere > >> >> >> >> > else). Same for the swapcache. Even if we keep the swapcach= e in an > >> >> >> >> > xarray and not inside swap_desc, it would be indexed by swa= p_entry if > >> >> >> >> > the indirection is disabled, and by swap_desc (or similar) = if the > >> >> >> >> > indirection is enabled. I think maintaining separate implem= entations > >> >> >> >> > for when the indirection is enabled/disabled would be addin= g too much > >> >> >> >> > complexity. > >> >> >> >> > > >> >> >> >> > WDYT? > >> >> >> >> > >> >> >> >> If we go this way, swap cache and swap_count will always be i= ndexed by > >> >> >> >> swap_entry. swap_desc just provides a indirection to make it= possible > >> >> >> >> to move between swap devices. > >> >> >> >> > >> >> >> >> Why must we index swap cache and swap_count by swap_desc if i= ndirection > >> >> >> >> is enabled? Yes, we can save one xarray indexing if we do so= , but I > >> >> >> >> don't think the overhead of one xarray indexing is a showstop= per. > >> >> >> >> > >> >> >> >> I think this can be one intermediate step towards your final = target. > >> >> >> >> The changes to current implementation can be smaller. > >> >> >> > > >> >> >> > IIUC, the idea is to have two xarrays: > >> >> >> > (a) xarray that stores a pointer to a struct containing swap_c= ount and > >> >> >> > swap cache. > >> >> >> > (b) xarray that stores the underlying swap entry or zswap entr= y. > >> >> >> > > >> >> >> > When indirection is disabled: > >> >> >> > page tables & page cache have swap entry directly like today, = xarray > >> >> >> > (a) is indexed by swap entry, xarray (b) does not exist. No re= verse > >> >> >> > mapping needed. > >> >> >> > > >> >> >> > In this case we have an extra overhead of 12-16 bytes (the str= uct > >> >> >> > containing swap_count and swap cache) vs. 24 bytes of the swap= _desc. > >> >> >> > > >> >> >> > When indirection is enabled: > >> >> >> > page tables & page cache have a swap id (or swap_desc index), = xarray > >> >> >> > (a) is indexed by swap id, > >> >> >> > >> >> >> xarray (a) is indexed by swap entry. > >> >> > > >> >> > > >> >> > How so? With the indirection enabled, the page tables & page cach= e > >> >> > have the swap id (or swap_desc index), which can point to a swap = entry > >> >> > or a zswap entry -- which can change when the page is moved betwe= en > >> >> > zswap & swapfiles. How is xarray (a) indexed by the swap entry in= this > >> >> > case? Shouldn't be indexed by the abstract swap id so that the > >> >> > writeback from zswap is transparent? > >> >> > >> >> In my mind, > >> >> > >> >> - swap core will define a abstract interface to swap implementation= s > >> >> (zswap, swap device/file, maybe more in the future), like VFS. > >> >> > >> >> - zswap will be a special swap implementation (compressing instead = of > >> >> writing to disk). > >> >> > >> >> - swap core will manage the indirection layer and swap cache. > >> >> > >> >> - swap core can move swap pages between swap implementations (e.g.,= from > >> >> zswap to a swap device, or from one swap device to another swap > >> >> device) with the help of the indirection layer. > >> >> > >> >> In this design, the writeback from zswap becomes moving swapped pag= es > >> >> from zswap to a swap device. > >> > > >> > > >> > All the above matches my understanding of this proposal. swap_desc i= s > >> > the proposed indirection layer, and the swap implementations are zsw= ap > >> > & swap devices. For now, we only have 2 static swap implementations > >> > (zswap->swapfile). In the future, we can make this more dynamic as t= he > >> > need arises. > >> > >> Great to align with you on this. > > > > Great! > > > >> > >> >> > >> >> If my understanding were correct, your suggestion is kind of moving > >> >> zswap logic to the swap core? And zswap will be always at a higher > >> >> layer on top of swap device/file? > >> > > >> > > >> > We do not want to move the zswap logic into the swap core, we want t= o > >> > make the swap core independent of the swap implementation, and zswap > >> > is just one possible implementation. > >> > >> Good! > >> > >> I found that you put zswap related data structure inside struct > >> swap_desc directly. I think that we should avoid that as much as > >> possible. > > > > That was just an initial draft to show what it would be like, I do not > > intend to hardcode zswap-specific items in swap_desc. > > > >> > >> >> > >> >> >> > >> >> >> > >> >> >> > xarray (b) is indexed by swap id as well > >> >> >> > and contain swap entry or zswap entry. Reverse mapping might b= e > >> >> >> > needed. > >> >> >> > >> >> >> Reverse mapping isn't needed. > >> >> > > >> >> > > >> >> > It would be needed if xarray (a) is indexed by the swap id. I am = not > >> >> > sure I understand how it can be indexed by the swap entry if the > >> >> > indirection is enabled. > >> >> > > >> >> >> > >> >> >> > >> >> >> > In this case we have an extra overhead of 12-16 bytes + 8 byte= s for > >> >> >> > xarray (b) entry + memory overhead from 2nd xarray + reverse m= apping > >> >> >> > where needed. > >> >> >> > > >> >> >> > There is also the extra cpu overhead for an extra lookup in ce= rtain paths. > >> >> >> > > >> >> >> > Is my analysis correct? If yes, I agree that the original prop= osal is > >> >> >> > good if the reverse mapping can be avoided in enough situation= s, and > >> >> >> > that we should consider such alternatives otherwise. As I ment= ioned > >> >> >> > above, I think it comes down to whether we can completely rest= rict > >> >> >> > cluster readahead to rotating disks or not -- in which case we= need to > >> >> >> > decide what to do for shmem and for anon when vma readahead is > >> >> >> > disabled. > >> >> >> > >> >> >> We can even have a minimal indirection implementation. Where, s= wap > >> >> >> cache and swap_map[] are kept as they ware before, just one xarr= ay is > >> >> >> added. The xarray is indexed by swap id (or swap_desc index) to= store > >> >> >> the corresponding swap entry. > >> >> >> > >> >> >> When indirection is disabled, no extra overhead. > >> >> >> > >> >> >> When indirection is enabled, the extra overhead is just 8 bytes = per > >> >> >> swapped page. > >> >> >> > >> >> >> The basic migration support can be build on top of this. > >> >> >> > >> >> >> I think that this could be a baseline for indirection support. = Then > >> >> >> further optimization can be built on top of it step by step with > >> >> >> supporting data. > >> >> > > >> >> > > >> >> > I am not sure how this works with zswap. Currently swap_map[] > >> >> > implementation is specific for swapfiles, it does not work for zs= wap > >> >> > unless we implement separate swap counting logic for zswap & > >> >> > swapfiles. Same for the swapcache, it currently supports being in= dexed > >> >> > by a swap entry, it would need to support being indexed by a swap= id, > >> >> > or have a separate swap cache for zswap. Having separate > >> >> > implementation would add complexity, and we would need to perform > >> >> > handoffs of the swap count/cache when a page is moved from zswap = to a > >> >> > swapfile. > >> >> > >> >> We can allocate a swap entry for each swapped page in zswap. > >> > > >> > > >> > This is exactly what the current implementation does and what we wan= t > >> > to move away from. The current implementation uses zswap as an > >> > in-memory compressed cache on top of an actual swap device, and each > >> > swapped page in zswap has a swap entry allocated. With this > >> > implementation, zswap cannot be used without a swap device. > >> > >> I totally agree that we should avoid to use an actual swap device unde= r > >> zswap. And, as an swap implementation, zswap can manage the swap entr= y > >> inside zswap without an underlying actual swap device. For example, > >> when we swap a page to zswap (actually compress), we can allocate a > >> (virtual) swap entry in the zswap. I understand that there's overhead > >> to manage the swap entry in zswap. We can consider how to reduce the > >> overhead. > > > > I see. So we can (for example) use one of the swap types for zswap, > > and then have zswap code handle this entry according to its > > implementation. We can then have an xarray that maps swap ID -> swap > > entry, and this swap entry is used to index the swap cache and such. > > When a swapped page is moved between backends we update the swap ID -> > > swap entry xarray. > > > > This is one possible implementation that I thought of (very briefly > > tbh), but it does have its problems: > > For zswap: > > - Managing swap entries inside zswap unnecessarily. > > - We need to maintain a swap entry -> zswap entry mapping in zswap -- > > similar to the current rbtree, which is something that we can get rid > > of with the initial proposal if we embed the zswap_entry pointer > > directly in the swap_desc (it can be encoded to avoid breaking the > > abstraction). > > > > For mm/swap in general: > > - When we allocate a swap entry today, we store it in folio->private > > (or page->private), which is used by the unmapping code to be placed > > in the page tables or shmem page cache. With this implementation, we > > need to store the swap ID in page->private instead, which means that > > every time we need to access the swap cache during reclaim/swapout we > > need to lookup the swap entry first. > > - On the fault path, we need two lookups instead of one (swap ID -> > > swap entry, swap entry -> swap cache), not sure how this affects fault > > latency. > > - Each swap backend will have its own separate implementation of swap > > counting, which is hard to maintain and very error-prone since the > > logic is backend-agnostic. > > - Handing over a page from one swap backend to another includes > > handing over swap cache entries and swap counts, which I imagine will > > involve considerable synchronization. > > > > Do you have any thoughts on this? > > Yes. I understand there's additional overhead. I have no clear idea > about how to reduce this now. We need to think about that in depth. > > The bottom line is whether this is worse than the current zswap > implementation? It's not just zswap, as I note above, this design would introduce some overheads to the core swapping code as well as long as the indirection layer is active. I am particularly worried about the extra lookups on the fault path. For zswap, we already have a lookup today, so maintaining swap entry -> zswap entry mapping would not be a regression, but I am not sure about the extra overhead to manage swap entries within zswap. Keep in mind that using swap entries for zswap probably implies having a fixed/max size for zswap (to be able to manage the swap entries efficiently similar to swap devices), which is a limitation that the initial proposal was hoping to overcome. > > Best Regards, > Huang, Ying > > >> > >> >> > >> >> > >> >> >> > >> >> >> > >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Depending on how much you are going to reuse the swa= p cache, you might > >> >> >> >> >> >> >> need to have something like a swap_info_struct to ke= ep the locks happy. > >> >> >> >> >> >> > > >> >> >> >> >> >> > My current intention is to reimplement the swapcache = completely as a > >> >> >> >> >> >> > pointer in struct swap_desc. This would eliminate thi= s need and a lot > >> >> >> >> >> >> > of the locking we do today if I get things right. > >> >> >> >> >> >> > > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> > Another potential concern is readahead. With this = design, we have no > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Readahead is for spinning disk :-) Even a normal swa= p file with an SSD can > >> >> >> >> >> >> >> use some modernization. > >> >> >> >> >> >> > > >> >> >> >> >> >> > Yeah, I initially thought we would only need the swp_= entry_t -> > >> >> >> >> >> >> > swap_desc reverse mapping for readahead, and that we = can only store > >> >> >> >> >> >> > that for spinning disks, but I was wrong. We need for= other things as > >> >> >> >> >> >> > well today: swapoff, when trying to find an empty swa= p slot and we > >> >> >> >> >> >> > start trying to free swap slots used only by the swap= cache. However, I > >> >> >> >> >> >> > think both of these cases can be fixed (I can share m= ore details if > >> >> >> >> >> >> > you want). If everything goes well we should only nee= d to maintain the > >> >> >> >> >> >> > reverse mapping (extra overhead above 24 bytes) for s= wap files on > >> >> >> >> >> >> > spinning disks for readahead. > >> >> >> >> >> >> > > >> >> >> >> >> >> >> > >> >> >> >> >> >> >> Looking forward to your discussion. > >> >> >> >> >> > >> >> >> >> >> Per my understanding, the indirection is to make it easy t= o move > >> >> >> >> >> (swapped) pages among swap devices based on hot/cold. Thi= s is similar > >> >> >> >> >> as the target of memory tiering. It appears that we can e= xtend the > >> >> >> >> >> memory tiering (mm/memory-tiers.c) framework to cover swap= devices too? > >> >> >> >> >> Is it possible for zswap to be faster than some slow memor= y media? > >> >> >> >> > > >> >> >> >> > > >> >> >> >> > Agree with Chris that this may require a much larger overha= ul. A slow > >> >> >> >> > memory tier is still addressable memory, swap/zswap require= s a page > >> >> >> >> > fault to read the pages. I think (at least for now) there i= s a > >> >> >> >> > fundamental difference. We want reclaim to eventually treat= slow > >> >> >> >> > memory & swap as just different tiers to place cold memory = in with > >> >> >> >> > different characteristics, but otherwise I think the swappi= ng > >> >> >> >> > implementation itself is very different. Am I missing some= thing? > >> >> >> >> > >> >> >> >> Is it possible that zswap is faster than a really slow memory > >> >> >> >> addressable device backed by NAND? TBH, I don't have the ans= wer. > >> >> >> > > >> >> >> > I am not sure either. > >> >> >> > > >> >> >> >> > >> >> >> >> Anyway, do you need a way to describe the tiers of the swap d= evices? > >> >> >> >> So, you can move the cold pages among the swap devices based = on that? > >> >> >> > > >> >> >> > For now I think the "tiers" in this proposal are just zswap an= d normal > >> >> >> > swapfiles. We can later extend it to support more explicit tie= ring. > >> >> >> > >> >> >> IIUC, in original zswap implementation, there's 1:1 relationship= between > >> >> >> zswap and normal swapfile. But now, you make demoting among swa= p > >> >> >> devices more general. Then we need some general way to specify = which > >> >> >> swap devices are fast and which are slow, and the demoting relat= ionship > >> >> >> among them. It can be memory tiers or something else, but we ne= ed one. > >> >> > > >> >> > > >> >> > I think for this proposal, there are only 2 hardcoded tiers. Zswa= p is > >> >> > fast, swapfile is slow. In the future, we can support more dynami= c > >> >> > tiering if the need arises. > >> >> > >> >> We can start from a simple implementation. And I think that it's b= etter > >> >> to consider the general design too. Try not to make it impossible = now. > >> > > >> > > >> > Right. I am proposing we come up with an abstract generic interface > >> > for swap implementations, and have 2 implementations statically > >> > defined (swapfiles and zswap). If the need arises, we can make swap > >> > implementations more dynamic in the future. > >> > > >> >> > >> >> > >> >> Best Regards, > >> >> Huang, Ying > >> >