From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A71BDC64EC7 for ; Tue, 28 Feb 2023 04:29:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A58B16B0071; Mon, 27 Feb 2023 23:29:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A091A6B0072; Mon, 27 Feb 2023 23:29:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A9576B0073; Mon, 27 Feb 2023 23:29:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 7937E6B0071 for ; Mon, 27 Feb 2023 23:29:15 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 3BB5840FCF for ; Tue, 28 Feb 2023 04:29:15 +0000 (UTC) X-FDA: 80515421070.18.3FB1F71 Received: from mail-ed1-f52.google.com (mail-ed1-f52.google.com [209.85.208.52]) by imf27.hostedemail.com (Postfix) with ESMTP id 4460340004 for ; Tue, 28 Feb 2023 04:29:13 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Av1q89iu; spf=pass (imf27.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=kaleshsingh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1677558553; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MJZxzQ5tCEoWPieCBc7EeSQTW20RHLDycZrIaeQua7E=; b=JjvnMBNfzSvKWnukFe/S8VKHt6zAsPz6/MxB3nBgbSJ8bz/n8c5U/ZUoduGwj11rh2LVYB rkpUV8WWJYjpRJbZz7iuW5yFa5ntgCgkIOhQX43aC1abzcRvAX9jp8dP92GntO1LRCW6yF oFZJ7fV+Hf61TTqpOG5bBJZa7ECsaVE= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=Av1q89iu; spf=pass (imf27.hostedemail.com: domain of kaleshsingh@google.com designates 209.85.208.52 as permitted sender) smtp.mailfrom=kaleshsingh@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1677558553; a=rsa-sha256; cv=none; b=QJpOx79cvKmeLqb0zARIpB6uUnkCe0+1SVikpYYxF7vZT6UnTg3dELBD0u6XS1au/bO6sx teMfxI4eQxLwi4ic7K5m5/G5WiJ2FVvpscoTs1Y61TPAW1tjVImvXSxSVA431zGjou8TMv yCeG2Uf8WdzcOAGCooWFmPn9Bd4c6P8= Received: by mail-ed1-f52.google.com with SMTP id i34so34756715eda.7 for ; Mon, 27 Feb 2023 20:29:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; t=1677558552; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MJZxzQ5tCEoWPieCBc7EeSQTW20RHLDycZrIaeQua7E=; b=Av1q89iuhP8aIvSI4qR+mq/kjjpxCSzwnQQANsQx6CVTgFDvleABMSd0K4boA3gIAT gR5jRVans4Cl2NUWaLpQKaRc19SQD//aWltcn3X2aa/BQAA85ECpXl0RcU97YW1q8a53 e+S4UKpM8qFxD7yDzKRbiouEuK6LpbxUARD+V+dV8PwSR7j3X04Hkf1auKIk7NmQeYSq X0UNRyEPtqcR3dND7Af/k13uJmknTv+Imw8KVRV6Wx6HWe4JzT4QKjPNC8DR+sdqsPRf pJ56ATM0lyJT4jip94J5GWxwqm8Dx1s+a0RMY7nj6M/Dy0T6nqPYvJ4kCHrTpjYxuchx aYSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1677558552; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MJZxzQ5tCEoWPieCBc7EeSQTW20RHLDycZrIaeQua7E=; b=PEOYZ3vga5NHoNEZhSCnLFUYtNyplHPbQMSlca+D1b6TzFiwyi/B584mP+usEk1C2n pyA1GY79hN0Tq8N0qnrhQ8y/nh/MNAKiO/Tt02vOEZPkY58LcCdsaNpfjNDWwLfgxt3/ wnZCjpbmK1uL9gbb8eRoXVQ3I0E4+RN87J8Cv0prm5Rpco/AO/jxe38HWTB5SZoWk8Sd qtdNrxxzL53ab8Zo1GUr/8Tqmqm+lqlHNi0tGNYrujfOHwONxK9P/Xz67rz4A9jvmioG mrAl74iKV8dY2NwqlPJBRbGCFoOqBVjLEaPKqgq8CeewuJWC6bmKYbLlGfweMFM/mdME KAhQ== X-Gm-Message-State: AO0yUKVrfDQJ5fqDqZ8A4XoxgjTNLVhpGROIu/rHsJ+5RRaQR5bnUQJl EeUA+kWUfi1uqcp0w5nyRno1t+OXp3vVF/UhjCBsGg== X-Google-Smtp-Source: AK7set+Zc6ka1EAs831G73G/ccv1FGNxoN+1NFu6k9c66jcobZj6o+jXqum7gKRRTkUqaIcFpq0BXqjyg3wfFbAQRzY= X-Received: by 2002:a17:906:f192:b0:8b1:78b7:6803 with SMTP id gs18-20020a170906f19200b008b178b76803mr560590ejb.4.1677558551484; Mon, 27 Feb 2023 20:29:11 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Kalesh Singh Date: Mon, 27 Feb 2023 20:29:00 -0800 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Swap Abstraction / Native Zswap To: Yosry Ahmed Cc: Johannes Weiner , Yang Shi , lsf-pc@lists.linux-foundation.org, Linux-MM , Michal Hocko , Shakeel Butt , David Rientjes , Hugh Dickins , Seth Jennings , Dan Streetman , Vitaly Wool , Peter Xu , Minchan Kim , Andrew Morton , Nhat Pham , Akilesh Kailash Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 4460340004 X-Stat-Signature: afby8qaxnfuyp1cquup1tcksxyrc7gr7 X-HE-Tag: 1677558553-263651 X-HE-Meta: U2FsdGVkX19bIMQqit8P7WPCBGahDAmoQ7p8jh3KO/CawNI7ABiCBGL6fONGhmTXovMai+f2m/P/UFlJaxGuDVKcMc/4ltR6dlnFT/TaHzOXrKgp9Z/x3SBME+xoyaeiMnLSugbnwBF2F2ztfB3AB3NbD0prGmU/TYWr0hNYzwxXOKqjr3cdksT5SsKXQ77Tj7ukjqTGW13GRjwa/w3d2JzK7TkvRYvXUbYmtriLzjP2HrQy+FRN3kHyRTxesgxtwnSW6jj670KZTFvlRL5lwNUQLa11au0udBkN3NJxqZMAXaWgvpNQgoAr/FPbmcga7e+9/pV8Lit6eBMQg+Td7P9zc2uYcBcPCZY4KlWwHcheZQjOlk9E81yyU5jMpS0tAY4IYAmortyMKggQxf1m/2pe+7Rl+Q4CiuXiLyAAT278/yvX0lMg0ra0w6B4zsSFslmGfSv4YVDTdaakViHm/iRqLFZc/yOmrIUt011JLdr+Hxt7B3UuI4VMaIMmg6SBNjgklmEFHHmOxbAe7PoAKZBSl4iIfW83H+pRG/g/xdjYRpxTgw9nPkDrFyfxmVE6bfC5i9zhCJVJ12NmlyXcGsr+dYGQRrLQj4Y194Bqu8vNYY8jYwPJ1y22k6kqFloAb0wzsQ1+uUeRhC7CoKJKZB168e6HF5cDb8xw/69xvWWhgbTmCOapglg7k4JAaK9duBo6xqKSUs6FNQFXj/ArrmXT+UMtejxr330Jk/MidkBuQT/6pdfG5c/LrcBOU2zpNOoyxK+TwQK+KQzPbCNfKWO/0JI3jFu1JNFBmLfprDqj5zI/paVuQ58Yf1AwzntNCxjLm2T8vAy5HeRuAhCt87B+BjEhz73QXpLiK966SQsKA7YeRboFFWYGFYHCsMtryGRKZtj+LORyPsNNX8UJrg9KNSR/Zj9E/BJigcHrbY/bPG2T2gGuI4qqs49yV0R7UIAd17tom5YEJjvNSiY pX+AWAHa lzOAKG8UBuhZjIRfZPKIJ7SonTbJANozdDg0o1UL4nuJeRYq1Uvh4EnI87d6K/+8lcJnapGcVKsuLRUmmHvaNOHMotJhgtIN66ePJFeBGLnmLAH1WubuGxhkuEhyugxvxTxMJe8hHvslu3QAyB7YVi1LireUgGrYrRrF6rrpxRKNxMP+cEFe109Y6g88IN52hSy9+F/D+0juwhTebYwfAZyLePQJGYXKudYLMjnp1gNItSdPs7OKKu46/z/1stpYvcsxvu2GBre/SNiirzlt9CBqX+ZIkSHotJPJK8JI0dSU/8Y7hwqV3gGOuaFf5KK20zpkpNmoyiknKxCjEwiVUSEZliB7rSumpE5Fa X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Feb 22, 2023 at 2:47=E2=80=AFPM Yosry Ahmed = wrote: > > On Wed, Feb 22, 2023 at 8:57 AM Johannes Weiner wrot= e: > > > > Hello, > > > > thanks for proposing this, Yosry. I'm very interested in this > > work. Unfortunately, I won't be able to attend LSFMMBPF myself this > > time around due to a scheduling conflict :( > > Ugh, would have been great to have you, I guess there might be a > remote option, or we will end up discussing on the mailing list > eventually anyway. > > > > > On Tue, Feb 21, 2023 at 03:38:57PM -0800, Yosry Ahmed wrote: > > > On Tue, Feb 21, 2023 at 3:34 PM Yang Shi wrote: > > > > > > > > On Tue, Feb 21, 2023 at 11:46 AM Yosry Ahmed wrote: > > > > > > > > > > On Tue, Feb 21, 2023 at 11:26 AM Yang Shi w= rote: > > > > > > > > > > > > On Tue, Feb 21, 2023 at 10:56 AM Yosry Ahmed wrote: > > > > > > > > > > > > > > On Tue, Feb 21, 2023 at 10:40 AM Yang Shi wrote: > > > > > > > > > > > > > > > > Hi Yosry, > > > > > > > > > > > > > > > > Thanks for proposing this topic. I was thinking about this = before but > > > > > > > > I didn't make too much progress due to some other distracti= ons, and I > > > > > > > > got a couple of follow up questions about your design. Plea= se see the > > > > > > > > inline comments below. > > > > > > > > > > > > > > Great to see interested folks, thanks! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Feb 18, 2023 at 2:39 PM Yosry Ahmed wrote: > > > > > > > > > > > > > > > > > > Hello everyone, > > > > > > > > > > > > > > > > > > I would like to propose a topic for the upcoming LSF/MM/B= PF in May > > > > > > > > > 2023 about swap & zswap (hope I am not too late). > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D Intro =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > Currently, using zswap is dependent on swapfiles in an un= necessary > > > > > > > > > way. To use zswap, you need a swapfile configured (even i= f the space > > > > > > > > > will not be used) and zswap is restricted by its size. Wh= en pages > > > > > > > > > reside in zswap, the corresponding swap entry in the swap= file cannot > > > > > > > > > be used, and is essentially wasted. We also go through un= necessary > > > > > > > > > code paths when using zswap, such as finding and allocati= ng a swap > > > > > > > > > entry on the swapout path, or readahead in the swapin pat= h. I am > > > > > > > > > proposing a swapping abstraction layer that would allow u= s to remove > > > > > > > > > zswap's dependency on swapfiles. This can be done by intr= oducing a > > > > > > > > > data structure between the actual swapping implementation= (swapfiles, > > > > > > > > > zswap) and the rest of the MM code. > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D Objective =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > Enabling the use of zswap without a backing swapfile, whi= ch makes > > > > > > > > > zswap useful for a wider variety of use cases. Also, when= zswap is > > > > > > > > > used with a swapfile, the pages in zswap do not use up sp= ace in the > > > > > > > > > swapfile, so the overall swapping capacity increases. > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D Idea =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > Introduce a data structure, which I currently call a swap= _desc, as an > > > > > > > > > abstraction layer between swapping implementation and the= rest of MM > > > > > > > > > code. Page tables & page caches would store a swap id (en= coded as a > > > > > > > > > swp_entry_t) instead of directly storing the swap entry a= ssociated > > > > > > > > > with the swapfile. This swap id maps to a struct swap_des= c, which acts > > > > > > > > > as our abstraction layer. All MM code not concerned with = swapping > > > > > > > > > details would operate in terms of swap descs. The swap_de= sc can point > > > > > > > > > to either a normal swap entry (associated with a swapfile= ) or a zswap > > > > > > > > > entry. It can also include all non-backend specific opera= tions, such > > > > > > > > > as the swapcache (which would be a simple pointer in swap= _desc), swap > > > > > > > > > counting, etc. It creates a clear, nice abstraction layer= between MM > > > > > > > > > code and the actual swapping implementation. > > > > > > > > > > > > > > > > How will the swap_desc be allocated? Dynamically or preallo= cated? Is > > > > > > > > it 1:1 mapped to the swap slots on swap devices (whatever i= t is > > > > > > > > backed, for example, zswap, swap partition, swapfile, etc)? > > > > > > > > > > > > > > I imagine swap_desc's would be dynamically allocated when we = need to > > > > > > > swap something out. When allocated, a swap_desc would either = point to > > > > > > > a zswap_entry (if available), or a swap slot otherwise. In th= is case, > > > > > > > it would be 1:1 mapped to swapped out pages, not the swap slo= ts on > > > > > > > devices. > > > > > > > > > > > > It makes sense to be 1:1 mapped to swapped out pages if the swa= pfile > > > > > > is used as the back of zswap. > > > > > > > > > > > > > > > > > > > > I know that it might not be ideal to make allocations on the = reclaim > > > > > > > path (although it would be a small-ish slab allocation so we = might be > > > > > > > able to get away with it), but otherwise we would have static= ally > > > > > > > allocated swap_desc's for all swap slots on a swap device, ev= en unused > > > > > > > ones, which I imagine is too expensive. Also for things like = zswap, it > > > > > > > doesn't really make sense to preallocate at all. > > > > > > > > > > > > Yeah, it is not perfect to allocate memory in the reclamation p= ath. We > > > > > > do have such cases, but the fewer the better IMHO. > > > > > > > > > > Yeah. Perhaps we can preallocate a pool of swap_desc's on top of = the > > > > > slab cache, idk if that makes sense, or if there is a way to tell= slab > > > > > to proactively refill a cache. > > > > > > > > > > I am open to suggestions here. I don't think we should/can preall= ocate > > > > > the swap_desc's, and we cannot completely eliminate the allocatio= ns in > > > > > the reclaim path. We can only try to minimize them through cachin= g, > > > > > etc. Right? > > > > > > > > Yeah, reallocation should not work. But I'm not sure whether cachin= g > > > > works well for this case or not either. I'm supposed that you were > > > > thinking about something similar with pcp. When the available numbe= r > > > > of elements is lower than a threshold, refill the cache. It should > > > > work well with moderate memory pressure. But I'm not sure how it wo= uld > > > > behave with severe memory pressure, particularly when anonymous > > > > memory dominated the memory usage. Or maybe dynamic allocation work= s > > > > well, we are just over-engineered. > > > > > > Yeah it would be interesting to look into whether the swap_desc > > > allocation will be a bottleneck. Definitely something to look out for= . > > > I share your thoughts about wanting to do something about it but also > > > not wanting to over-engineer it. > > > > I'm not too concerned by this. It's a PF_MEMALLOC allocation, meaning > > it's not subject to watermarks. And the swapped page is freed right > > afterwards. As long as the compression delta exceeds the size of > > swap_desc, the process is a net reduction in allocated memory. For > > regular swap, the only requirement is that swap_desc < page_size() :-) > > > > To put this into perspective, the zswap backends allocate backing > > pages on-demand during reclaim. zsmalloc also kmallocs metadata in > > that path. We haven't had any issues with this in production, even > > under fairly severe memory pressure scenarios. > > Right. The only problem would be for pages that do not compress well > in zswap, in which case we might not end up freeing memory. As you > said, this is already happening today with zswap tho. > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D Benefits =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > This work enables using zswap without a backing swapfile = and increases > > > > > > > > > the swap capacity when zswap is used with a swapfile. It = also creates > > > > > > > > > a separation that allows us to skip code paths that don't= make sense > > > > > > > > > in the zswap path (e.g. readahead). We get to drop zswap'= s rbtree > > > > > > > > > which might result in better performance (less lookups, l= ess lock > > > > > > > > > contention). > > > > > > > > > > > > > > > > > > The abstraction layer also opens the door for multiple cl= eanups (e.g. > > > > > > > > > removing swapper address spaces, removing swap count cont= inuation > > > > > > > > > code, etc). Another nice cleanup that this work enables w= ould be > > > > > > > > > separating the overloaded swp_entry_t into two distinct t= ypes: one for > > > > > > > > > things that are stored in page tables / caches, and for a= ctual swap > > > > > > > > > entries. In the future, we can potentially further optimi= ze how we use > > > > > > > > > the bits in the page tables instead of sticking everythin= g into the > > > > > > > > > current type/offset format. > > > > > > > > > > > > > > > > > > Another potential win here can be swapoff, which can be m= ore practical > > > > > > > > > by directly scanning all swap_desc's instead of going thr= ough page > > > > > > > > > tables and shmem page caches. > > > > > > > > > > > > > > > > > > Overall zswap becomes more accessible and available to a = wider range > > > > > > > > > of use cases. > > > > > > > > > > > > > > > > How will you handle zswap writeback? Zswap may writeback to= the backed > > > > > > > > swap device IIUC. Assuming you have both zswap and swapfile= , they are > > > > > > > > separate devices with this design, right? If so, is the swa= pfile still > > > > > > > > the writeback target of zswap? And if it is the writeback t= arget, what > > > > > > > > if swapfile is full? > > > > > > > > > > > > > > When we try to writeback from zswap, we try to allocate a swa= p slot in > > > > > > > the swapfile, and switch the swap_desc to point to that inste= ad. The > > > > > > > process would be transparent to the rest of MM (page tables, = page > > > > > > > cache, etc). If the swapfile is full, then there's really not= hing we > > > > > > > can do, reclaim fails and we start OOMing. I imagine this is = the same > > > > > > > behavior as today when swap is full, the difference would be = that we > > > > > > > have to fill both zswap AND the swapfile to get to the OOMing= point, > > > > > > > so an overall increased swapping capacity. > > > > > > > > > > > > When zswap is full, but swapfile is not yet, will the swap try = to > > > > > > writeback zswap to swapfile to make more room for zswap or just= swap > > > > > > out to swapfile directly? > > > > > > > > > > > > > > > > The current behavior is that we swap to swapfile directly in this > > > > > case, which is far from ideal as we break LRU ordering by skippin= g > > > > > zswap. I believe this should be addressed, but not as part of thi= s > > > > > effort. The work to make zswap respect the LRU ordering by writin= g > > > > > back from zswap to make room can be done orthogonal to this effor= t. I > > > > > believe Johannes was looking into this at some point. > > > > Actually, zswap already does LRU writeback when the pool is full. Nhat > > Pham (CCd) recently upstreamed the LRU implementation for zsmalloc, so > > as of today all backends support this. > > > > There are still a few quirks in zswap that can cause rejections which > > bypass the LRU that need fixing. But for the most part LRU writeback > > to the backing file is the default behavior. > > Right, I was specifically talking about this case. When zswap is full > it rejects incoming pages and they go directly to the swapfile, but we > also kickoff writeback, so this only happens until we do some LRU > writeback. I guess I should have been more clear here. Thanks for > clarifying and correcting. > > > > > > > Other than breaking LRU ordering, I'm also concerned about the > > > > potential deteriorating performance when writing/reading from swapf= ile > > > > when zswap is full. The zswap->swapfile order should be able to > > > > maintain a consistent performance for userspace. > > > > > > Right. This happens today anyway AFAICT, when zswap is full we just > > > fallback to writing to swapfile, so this would not be a behavior > > > change. I agree it should be addressed anyway. > > > > > > > > > > > But anyway I don't have the data from real life workload to back th= e > > > > above points. If you or Johannes could share some real data, that > > > > would be very helpful to make the decisions. > > > > > > I actually don't, since we mostly run zswap without a backing > > > swapfile. Perhaps Johannes might be able to have some data on this (o= r > > > anyone using zswap with a backing swapfile). > > > > Due to LRU writeback, the latency increase when zswap spills its > > coldest entries into backing swap is fairly linear, as you may > > expect. We have some limited production data on this from the > > webservers. > > > > The biggest challenge in this space is properly sizing the zswap pool, > > such that it's big enough to hold the warm set that the workload is > > most latency-sensitive too, yet small enough such that the cold pages > > get spilled to backing swap. Nhat is working on improving this. > > > > That said, I think this discussion is orthogonal to the proposed > > topic. zswap spills to backing swap in LRU order as of today. The > > LRU/pool size tweaking is an optimization to get smarter zswap/swap > > placement according to access frequency. The proposed swap descriptor > > is an optimization to get better disk utilization, the ability to run > > zswap without backing swap, and a dramatic speedup in swapoff time. > > Fully agree. > > > > > > > > > > > Anyway I'm interested in attending the discussion for this = topic. > > > > > > > > > > > > > > Great! Looking forward to discuss this more! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D Cost =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > > > > > The obvious downside of this is added memory overhead, sp= ecifically > > > > > > > > > for users that use swapfiles without zswap. Instead of pa= ying one byte > > > > > > > > > (swap_map) for every potential page in the swapfile (+ sw= ap count > > > > > > > > > continuation), we pay the size of the swap_desc for every= page that is > > > > > > > > > actually in the swapfile, which I am estimating can be ro= ughly around > > > > > > > > > 24 bytes or so, so maybe 0.6% of swapped out memory. The = overhead only > > > > > > > > > scales with pages actually swapped out. For zswap users, = it should be > > > > > > > > > a win (or at least even) because we get to drop a lot of = fields from > > > > > > > > > struct zswap_entry (e.g. rbtree, index, etc). > > > > Shifting the cost from O(swapspace) to O(swapped) could be a win for > > many regular swap users too. > > > > There are the legacy setups that provision 2*RAM worth of swap as an > > emergency overflow that is then rarely used. > > > > We have a setups that swap to disk more proactively, but we also > > overprovision those in terms of swap space due to the cliff behavior > > when swap fills up and the VM runs out of options. > > > > To make a fair comparison, you really have to take average swap > > utilization into account. And I doubt that's very high. > > Yeah I was looking for some data here, but it varies heavily based on > the use case, so I opted to only state the overhead of the swap > descriptor without directly comparing it to the current overhead. > > > > > In terms of worst-case behavior, +0.8% per swapped page doesn't sound > > like a show-stopper to me. Especially when compared to zswap's current > > O(swapped) waste of disk space. > > Yeah for zswap users this should be a win on most/all fronts, even > memory overhead, as we will end up trimming struct zswap_entry which > is also O(swapped) memory overhead. It should also make zswap > available for more use cases. You don't need to provision and > configure swap space, you just need to turn zswap on. > > > > > > > > > > > > Another potential concern is readahead. With this design,= we have no > > > > > > > > > way to get a swap_desc given a swap entry (type & offset)= . We would > > > > > > > > > need to maintain a reverse mapping, adding a little bit m= ore overhead, > > > > > > > > > or search all swapped out pages instead :). A reverse map= ping might > > > > > > > > > pump the per-swapped page overhead to ~32 bytes (~0.8% of= swapped out > > > > > > > > > memory). > > > > > > > > > > > > > > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D Bottom Line =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D > > > > > > > > > It would be nice to discuss the potential here and the tr= adeoffs. I > > > > > > > > > know that other folks using zswap (or interested in using= it) may find > > > > > > > > > this very useful. I am sure I am missing some context on = why things > > > > > > > > > are the way they are, and perhaps some obvious holes in m= y story. > > > > > > > > > Looking forward to discussing this with anyone interested= :) > > > > > > > > > > > > > > > > > > I think Johannes may be interested in attending this disc= ussion, since > > > > > > > > > a lot of ideas here are inspired by discussions I had wit= h him :) Hi everyone, I came across this interesting proposal and I would like to participate in the discussion. I think it will be useful/overlap with some projects we are currently planning in Android. Thanks, Kalesh > > > > Thanks! >