linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/zswap: Improve with alloc_workqueue() call
@ 2023-12-11  5:28 Ronald Monthero
  2023-12-11 14:15 ` Nhat Pham
  0 siblings, 1 reply; 20+ messages in thread
From: Ronald Monthero @ 2023-12-11  5:28 UTC (permalink / raw)
  Cc: sjenning, akpm, Ronald Monthero, Dan Streetman, Vitaly Wool,
	linux-mm, linux-kernel

Use alloc_workqueue() to create and set finer
work item attributes instead of create_workqueue()
which is to be deprecated.

Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
---
 mm/zswap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 74411dfdad92..64dbe3e944a2 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1620,7 +1620,8 @@ static int zswap_setup(void)
 		zswap_enabled = false;
 	}
 
-	shrink_wq = create_workqueue("zswap-shrink");
+	shrink_wq = alloc_workqueue("zswap-shrink",
+			WQ_UNBOUND|WQ_MEM_RECLAIM, 0);
 	if (!shrink_wq)
 		goto fallback_fail;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2023-12-11  5:28 [PATCH] mm/zswap: Improve with alloc_workqueue() call Ronald Monthero
@ 2023-12-11 14:15 ` Nhat Pham
  2023-12-13 13:20   ` Ronald Monthero
  0 siblings, 1 reply; 20+ messages in thread
From: Nhat Pham @ 2023-12-11 14:15 UTC (permalink / raw)
  To: Ronald Monthero
  Cc: sjenning, akpm, Dan Streetman, Vitaly Wool, linux-mm, linux-kernel

On Sun, Dec 10, 2023 at 9:31 PM Ronald Monthero
<debug.penguin32@gmail.com> wrote:
>
> Use alloc_workqueue() to create and set finer
> work item attributes instead of create_workqueue()
> which is to be deprecated.
>
> Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
> ---
>  mm/zswap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 74411dfdad92..64dbe3e944a2 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
>                 zswap_enabled = false;
>         }
>
> -       shrink_wq = create_workqueue("zswap-shrink");
> +       shrink_wq = alloc_workqueue("zswap-shrink",
> +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 0);

Hmmm this changes the current behavior a bit right? create_workqueue()
is currently defined as:

alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))

I think this should be noted in the changelog, at the very least, even
if it is fine. We should be as explicit as possible about behavior
changes.



>         if (!shrink_wq)
>                 goto fallback_fail;
>
> --
> 2.34.1
>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2023-12-11 14:15 ` Nhat Pham
@ 2023-12-13 13:20   ` Ronald Monthero
  2023-12-14  0:28     ` Nhat Pham
  0 siblings, 1 reply; 20+ messages in thread
From: Ronald Monthero @ 2023-12-13 13:20 UTC (permalink / raw)
  To: Nhat Pham
  Cc: sjenning, akpm, Dan Streetman, Vitaly Wool, linux-mm, linux-kernel

Hi Nhat,
Thanks for checking.
On Tue, Dec 12, 2023 at 12:16 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sun, Dec 10, 2023 at 9:31 PM Ronald Monthero
> <debug.penguin32@gmail.com> wrote:
> >
> > Use alloc_workqueue() to create and set finer
> > work item attributes instead of create_workqueue()
> > which is to be deprecated.
> >
> > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
> > ---
> >  mm/zswap.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 74411dfdad92..64dbe3e944a2 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> >                 zswap_enabled = false;
> >         }
> >
> > -       shrink_wq = create_workqueue("zswap-shrink");
> > +       shrink_wq = alloc_workqueue("zswap-shrink",
> > +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 0);
>
> Hmmm this changes the current behavior a bit right? create_workqueue()
> is currently defined as:
>
> alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))

create_workqueue is deprecated and it's observed that most of the
subsystems have changed to using alloc_workqueue. So it's a small
minority of few remnant instances in the kernel and some drivers still
using create_workqueue. With the create_workqueue defined as is , it
hardcodes the workqueue  flags to be per cpu and unbound in nature and
not giving the flexibility of using any other wait queue
flags/attributes. ( WQ_CPU_INTENSIVE, WQ_HIGHPRI, WQ_RESCUER,
WQ_FREEZEABLE, WQ_UNBOUND) . Hence most of the subsystems and drivers
use the alloc_workqueue( ) api.
#define create_workqueue(name) \
alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))

> I think this should be noted in the changelog, at the very least, even
> if it is fine. We should be as explicit as possible about behavior
> changes.
>
imo, it's sort of known and consistently changed for quite some time already.
https://lists.openwall.net/linux-kernel/2016/06/07/1086
https://lists.openwall.net/linux-kernel/2011/01/03/124
https://lwn.net/Articles/403891/   => quoted: "The long-term plan, it
seems, is to convert all create_workqueue() users over to an
appropriate alloc_workqueue() call; eventually create_workqueue() will
be removed"

glad to take some suggestions , thoughts ?

BR,
ronald

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2023-12-13 13:20   ` Ronald Monthero
@ 2023-12-14  0:28     ` Nhat Pham
  2023-12-14  1:02       ` Nhat Pham
  2023-12-15  9:03       ` Ronald Monthero
  0 siblings, 2 replies; 20+ messages in thread
From: Nhat Pham @ 2023-12-14  0:28 UTC (permalink / raw)
  To: Ronald Monthero
  Cc: sjenning, akpm, Dan Streetman, Vitaly Wool, linux-mm, linux-kernel

On Wed, Dec 13, 2023 at 5:20 AM Ronald Monthero
<debug.penguin32@gmail.com> wrote:
>
> Hi Nhat,
> Thanks for checking.
> On Tue, Dec 12, 2023 at 12:16 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Sun, Dec 10, 2023 at 9:31 PM Ronald Monthero
> > <debug.penguin32@gmail.com> wrote:
> > >
> > > Use alloc_workqueue() to create and set finer
> > > work item attributes instead of create_workqueue()
> > > which is to be deprecated.
> > >
> > > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
> > > ---
> > >  mm/zswap.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 74411dfdad92..64dbe3e944a2 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> > >                 zswap_enabled = false;
> > >         }
> > >
> > > -       shrink_wq = create_workqueue("zswap-shrink");
> > > +       shrink_wq = alloc_workqueue("zswap-shrink",
> > > +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 0);
> >
> > Hmmm this changes the current behavior a bit right? create_workqueue()
> > is currently defined as:
> >
> > alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))
>
> create_workqueue is deprecated and it's observed that most of the
> subsystems have changed to using alloc_workqueue. So it's a small
> minority of few remnant instances in the kernel and some drivers still
> using create_workqueue. With the create_workqueue defined as is , it
> hardcodes the workqueue  flags to be per cpu and unbound in nature and
> not giving the flexibility of using any other wait queue
> flags/attributes. ( WQ_CPU_INTENSIVE, WQ_HIGHPRI, WQ_RESCUER,
> WQ_FREEZEABLE, WQ_UNBOUND) . Hence most of the subsystems and drivers
> use the alloc_workqueue( ) api.
> #define create_workqueue(name) \
> alloc_workqueue("%s", __WQ_LEGACY | WQ_MEM_RECLAIM, 1, (name))
>
> > I think this should be noted in the changelog, at the very least, even
> > if it is fine. We should be as explicit as possible about behavior
> > changes.
> >
> imo, it's sort of known and consistently changed for quite some time already.
> https://lists.openwall.net/linux-kernel/2016/06/07/1086
> https://lists.openwall.net/linux-kernel/2011/01/03/124
> https://lwn.net/Articles/403891/   => quoted: "The long-term plan, it
> seems, is to convert all create_workqueue() users over to an
> appropriate alloc_workqueue() call; eventually create_workqueue() will
> be removed"
>
> glad to take some suggestions , thoughts ?
>
> BR,
> ronald

I should have been clearer. I'm not against the change per-se - I
agree that we should replace create_workqueue() with
alloc_workqueue(). What I meant was, IIUC, there are two behavioral
changes with this new workqueue creation:

a) We're replacing a bounded workqueue (which as you noted, is fixed
by create_workqueue()) with an unbounded one (WQ_UNBOUND). This seems
fine to me - I doubt locality buys us much here.

b) create_workqueue() limits the number of concurrent per-cpu
execution contexts at 1 (i.e only one single global reclaimer),
whereas after this patch this is set to the default value. This seems
fine to me too - I don't remember us taking advantage of the previous
concurrency limitation. Also, in practice, the task_struct is
one-to-one with the zswap_pool's anyway, and most of the time, there
is just a single pool being used. (But it begs the question - what's
the point of using 0 instead of 1 here?)

Both seem fine (to me anyway - other reviewers feel free to take a
look and fact-check everything). I just feel like this should be
explicitly noted in the changelog, IMHO, in case we are mistaken and
need to revisit this :) Either way, not a NACK from me.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2023-12-14  0:28     ` Nhat Pham
@ 2023-12-14  1:02       ` Nhat Pham
  2023-12-15  9:03       ` Ronald Monthero
  1 sibling, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2023-12-14  1:02 UTC (permalink / raw)
  To: Ronald Monthero
  Cc: sjenning, akpm, Dan Streetman, Vitaly Wool, linux-mm, linux-kernel

On Wed, Dec 13, 2023 at 4:28 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> concurrency limitation. Also, in practice, the task_struct is
/s/task/work.
I was looking at some articles about tasks recently, so my brain just
auto-completed. I was referring to pool->shrink_work here.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2023-12-14  0:28     ` Nhat Pham
  2023-12-14  1:02       ` Nhat Pham
@ 2023-12-15  9:03       ` Ronald Monthero
  2023-12-20  0:21         ` Nhat Pham
  1 sibling, 1 reply; 20+ messages in thread
From: Ronald Monthero @ 2023-12-15  9:03 UTC (permalink / raw)
  To: Nhat Pham
  Cc: sjenning, akpm, Dan Streetman, Vitaly Wool, linux-mm, linux-kernel

On Thu, Dec 14, 2023 at 10:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
..
< snipped >

> I should have been clearer. I'm not against the change per-se - I
> agree that we should replace create_workqueue() with
> alloc_workqueue(). What I meant was, IIUC, there are two behavioral
> changes with this new workqueue creation:
>
> a) We're replacing a bounded workqueue (which as you noted, is fixed
> by create_workqueue()) with an unbounded one (WQ_UNBOUND). This seems
> fine to me - I doubt locality buys us much here.

Yes the workqueue attribute change per se but the existing
functionality remains seamless and the change is more a forward
looking change. imo under a memory pressure scenario an unbound
workqueue might workaround the scenario better as the number of
backing pools is dynamic. And with the WQ_UNBOUND attribute the
scheduler is more open to exercise some improvisations in any
demanding scenarios for offloading cpu time slicing for workers, ie if
any other worker of the same primary cpu had to be served due to
workers with WQ_HIGHPRI and WQ_CPU_INTENSIVE.Although normal and
highpri worker-pools don't interact with each other, the target cpu
atleast need not be the same if our worker for zswap is WQ_UNBOUND.

Also noting that the existing wq of zwap worker has the WQ_MEM_RECLAIM
attribute, so is there a rescue worker for zswap during a memory
pressure scenario ?
Quoting: "All work items which might be used on code paths that handle
memory reclaim are required to be queued on wq's that have a
rescue-worker reserved for execution under memory pressure. Else it is
possible that the worker-pool deadlocks waiting for execution contexts
to free up"

Also additional thought if adding WQ_FREEZABLE attribute while
creating the zswap worker make sense in scenarios to handle freeze and
unfreeze of specific cgroups or file system wide freeze and unfreeze
scenarios ? Does zswap worker participate in freeze/unfreeze code path
scenarios ?

> b) create_workqueue() limits the number of concurrent per-cpu
> execution contexts at 1 (i.e only one single global reclaimer),
> whereas after this patch this is set to the default value. This seems
> fine to me too - I don't remember us taking advantage of the previous
> concurrency limitation. Also, in practice, the task_struct is
> one-to-one with the zswap_pool's anyway, and most of the time, there
> is just a single pool being used. (But it begs the question - what's
> the point of using 0 instead of 1 here?)

Nothing in particular but I left it at default 0, which can go upto
256 ( @maxactive per cpu).
But if zswap worker is always intended to only have 1 active worker per cpu,
then that's fine with 1, otherwise a default setting might be flexible
for scaling.
just a thought, does having a higher value help for larger memory systems  ?

> Both seem fine (to me anyway - other reviewers feel free to take a
> look and fact-check everything). I just feel like this should be
> explicitly noted in the changelog, IMHO, in case we are mistaken and
> need to revisit this :) Either way, not a NACK from me.

Thanks Nhat, for checking. Above are my thoughts, I could be missing
some info or incorrect
on certain fronts so I would seek clarifications.
Also thanks in advance to other experts/maintainers, please share
feedback and suggestions.

BR,
ronald

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2023-12-15  9:03       ` Ronald Monthero
@ 2023-12-20  0:21         ` Nhat Pham
  2024-01-16 13:31           ` Ronald Monthero
  0 siblings, 1 reply; 20+ messages in thread
From: Nhat Pham @ 2023-12-20  0:21 UTC (permalink / raw)
  To: Ronald Monthero
  Cc: sjenning, akpm, Dan Streetman, Vitaly Wool, linux-mm,
	linux-kernel, Chris Li

On Fri, Dec 15, 2023 at 1:04 AM Ronald Monthero
<debug.penguin32@gmail.com> wrote:
>
> On Thu, Dec 14, 2023 at 10:28 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> ..
> < snipped >
>
> > I should have been clearer. I'm not against the change per-se - I
> > agree that we should replace create_workqueue() with
> > alloc_workqueue(). What I meant was, IIUC, there are two behavioral
> > changes with this new workqueue creation:
> >
> > a) We're replacing a bounded workqueue (which as you noted, is fixed
> > by create_workqueue()) with an unbounded one (WQ_UNBOUND). This seems
> > fine to me - I doubt locality buys us much here.
>
> Yes the workqueue attribute change per se but the existing
> functionality remains seamless and the change is more a forward
> looking change. imo under a memory pressure scenario an unbound
> workqueue might workaround the scenario better as the number of
> backing pools is dynamic. And with the WQ_UNBOUND attribute the
> scheduler is more open to exercise some improvisations in any
> demanding scenarios for offloading cpu time slicing for workers, ie if
> any other worker of the same primary cpu had to be served due to
> workers with WQ_HIGHPRI and WQ_CPU_INTENSIVE.Although normal and
> highpri worker-pools don't interact with each other, the target cpu
> atleast need not be the same if our worker for zswap is WQ_UNBOUND.

I don't object to the change per-se. I just meant that these
changes/discussion should be spelled out in the patch's changelog :)
IMHO, we should document behavior changes if there are any. For
instance, when we switched to kmap_local_page() for zswap, there is a
discussion in the changelog regarding how it differs from the old (and
deprecated) kmap_atomic():

https://lore.kernel.org/linux-mm/20231127160058.586446-1-fabio.maria.de.francesco@linux.intel.com/

and how that difference doesn't matter in the case of zswap.

>
> Also noting that the existing wq of zwap worker has the WQ_MEM_RECLAIM
> attribute, so is there a rescue worker for zswap during a memory
> pressure scenario ?
> Quoting: "All work items which might be used on code paths that handle
> memory reclaim are required to be queued on wq's that have a
> rescue-worker reserved for execution under memory pressure. Else it is
> possible that the worker-pool deadlocks waiting for execution contexts
> to free up"

Seems like it, but this behavior is not changed by your patch :) So
i'm not too concerned by it one way or another.

>
> Also additional thought if adding WQ_FREEZABLE attribute while
> creating the zswap worker make sense in scenarios to handle freeze and
> unfreeze of specific cgroups or file system wide freeze and unfreeze
> scenarios ? Does zswap worker participate in freeze/unfreeze code path
> scenarios ?

I don't think so, no? This zswap worker is scheduled upon the zswap
pool limit hit (which happens on the zswap store/swapping/memory
reclaim) path.

>
> > b) create_workqueue() limits the number of concurrent per-cpu
> > execution contexts at 1 (i.e only one single global reclaimer),
> > whereas after this patch this is set to the default value. This seems
> > fine to me too - I don't remember us taking advantage of the previous
> > concurrency limitation. Also, in practice, the task_struct is
> > one-to-one with the zswap_pool's anyway, and most of the time, there
> > is just a single pool being used. (But it begs the question - what's
> > the point of using 0 instead of 1 here?)
>
> Nothing in particular but I left it at default 0, which can go upto
> 256 ( @maxactive per cpu).
> But if zswap worker is always intended to only have 1 active worker per cpu,
> then that's fine with 1, otherwise a default setting might be flexible
> for scaling.
> just a thought, does having a higher value help for larger memory systems  ?

I don't think having higher value helps here tbh. We only have one
work_struct per pool, so it shouldn't make a difference either way :)

>
> > Both seem fine (to me anyway - other reviewers feel free to take a
> > look and fact-check everything). I just feel like this should be
> > explicitly noted in the changelog, IMHO, in case we are mistaken and
> > need to revisit this :) Either way, not a NACK from me.
>
> Thanks Nhat, for checking. Above are my thoughts, I could be missing
> some info or incorrect
> on certain fronts so I would seek clarifications.
> Also thanks in advance to other experts/maintainers, please share
> feedback and suggestions.
>
> BR,
> ronald

Also +Chris Li, who is also working on improving zswap :)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2023-12-20  0:21         ` Nhat Pham
@ 2024-01-16 13:31           ` Ronald Monthero
  2024-01-17 19:13             ` Nhat Pham
  0 siblings, 1 reply; 20+ messages in thread
From: Ronald Monthero @ 2024-01-16 13:31 UTC (permalink / raw)
  To: nphamcs
  Cc: sjenning, ddstreet, vitaly.wool, akpm, chrisl, linux-mm,
	linux-kernel, Ronald Monthero

The core-api create_workqueue is deprecated, this patch replaces
the create_workqueue with alloc_workqueue. The previous
implementation workqueue of zswap was a bounded workqueue, this
patch uses alloc_workqueue() to create an unbounded workqueue.
The WQ_UNBOUND attribute is desirable making the workqueue
not localized to a specific cpu so that the scheduler is free
to exercise improvisations in any demanding scenarios for
offloading cpu time slices for workqueues.
For example if any other workqueues of the same primary cpu
had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.
Also Unbound workqueue happens to be more efficient
in a system during memory pressure scenarios in comparison
 to a bounded workqueue.

shrink_wq = alloc_workqueue("zswap-shrink",
                     WQ_UNBOUND|WQ_MEM_RECLAIM, 1);

Overall the change suggested in this patch should be
seamless and does not alter the existing behavior,
other than the improvisation to be an unbounded workqueue.

Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
---
 mm/zswap.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 74411dfdad92..64dbe3e944a2 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1620,7 +1620,8 @@ static int zswap_setup(void)
 		zswap_enabled = false;
 	}
 
-	shrink_wq = create_workqueue("zswap-shrink");
+	shrink_wq = alloc_workqueue("zswap-shrink",
+			WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
 	if (!shrink_wq)
 		goto fallback_fail;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-16 13:31           ` Ronald Monthero
@ 2024-01-17 19:13             ` Nhat Pham
  2024-01-17 19:30               ` Yosry Ahmed
  2024-01-18 18:03               ` Nhat Pham
  0 siblings, 2 replies; 20+ messages in thread
From: Nhat Pham @ 2024-01-17 19:13 UTC (permalink / raw)
  To: Ronald Monthero
  Cc: sjenning, ddstreet, vitaly.wool, akpm, chrisl, linux-mm,
	linux-kernel, Johannes Weiner, Yosry Ahmed

On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero
<debug.penguin32@gmail.com> wrote:

+ Johannes and Yosry

>
> The core-api create_workqueue is deprecated, this patch replaces
> the create_workqueue with alloc_workqueue. The previous
> implementation workqueue of zswap was a bounded workqueue, this
> patch uses alloc_workqueue() to create an unbounded workqueue.
> The WQ_UNBOUND attribute is desirable making the workqueue
> not localized to a specific cpu so that the scheduler is free
> to exercise improvisations in any demanding scenarios for
> offloading cpu time slices for workqueues.

nit: extra space between paragraph would be nice.

> For example if any other workqueues of the same primary cpu
> had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.
> Also Unbound workqueue happens to be more efficient
> in a system during memory pressure scenarios in comparison
>  to a bounded workqueue.
>
> shrink_wq = alloc_workqueue("zswap-shrink",
>                      WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
>
> Overall the change suggested in this patch should be
> seamless and does not alter the existing behavior,
> other than the improvisation to be an unbounded workqueue.
>
> Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
> ---
>  mm/zswap.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 74411dfdad92..64dbe3e944a2 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
>                 zswap_enabled = false;
>         }
>
> -       shrink_wq = create_workqueue("zswap-shrink");
> +       shrink_wq = alloc_workqueue("zswap-shrink",
> +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 1);

Have you benchmarked this to check if there is any regression, just to
be safe? With an unbounded workqueue, you're gaining scheduling
flexibility at the cost of cache locality. My intuition is that it
doesn't matter too much here, but you should probably double check by
stress testing - run some workload with a relatively small zswap pool
limit (i.e heavy global writeback), and see if there is any difference
in performance.

>         if (!shrink_wq)
>                 goto fallback_fail;
>
> --
> 2.34.1
>

On a different note, I wonder if it would help to perform synchronous
reclaim here instead. With our current design, the zswap store failure
(due to global limit hit) would leave the incoming page going to swap
instead, creating an LRU inversion. Not sure if that's ideal.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-17 19:13             ` Nhat Pham
@ 2024-01-17 19:30               ` Yosry Ahmed
  2024-01-18 16:16                 ` Johannes Weiner
  2024-01-18 18:03               ` Nhat Pham
  1 sibling, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2024-01-17 19:30 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Ronald Monthero, sjenning, ddstreet, vitaly.wool, akpm, chrisl,
	linux-mm, linux-kernel, Johannes Weiner

On Wed, Jan 17, 2024 at 11:14 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero
> <debug.penguin32@gmail.com> wrote:
>
> + Johannes and Yosry
>
> >
> > The core-api create_workqueue is deprecated, this patch replaces
> > the create_workqueue with alloc_workqueue. The previous
> > implementation workqueue of zswap was a bounded workqueue, this
> > patch uses alloc_workqueue() to create an unbounded workqueue.
> > The WQ_UNBOUND attribute is desirable making the workqueue
> > not localized to a specific cpu so that the scheduler is free
> > to exercise improvisations in any demanding scenarios for
> > offloading cpu time slices for workqueues.
>
> nit: extra space between paragraph would be nice.
>
> > For example if any other workqueues of the same primary cpu
> > had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.
> > Also Unbound workqueue happens to be more efficient
> > in a system during memory pressure scenarios in comparison
> >  to a bounded workqueue.
> >
> > shrink_wq = alloc_workqueue("zswap-shrink",
> >                      WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
> >
> > Overall the change suggested in this patch should be
> > seamless and does not alter the existing behavior,
> > other than the improvisation to be an unbounded workqueue.
> >
> > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
> > ---
> >  mm/zswap.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 74411dfdad92..64dbe3e944a2 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> >                 zswap_enabled = false;
> >         }
> >
> > -       shrink_wq = create_workqueue("zswap-shrink");
> > +       shrink_wq = alloc_workqueue("zswap-shrink",
> > +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
>
> Have you benchmarked this to check if there is any regression, just to
> be safe? With an unbounded workqueue, you're gaining scheduling
> flexibility at the cost of cache locality. My intuition is that it
> doesn't matter too much here, but you should probably double check by
> stress testing - run some workload with a relatively small zswap pool
> limit (i.e heavy global writeback), and see if there is any difference
> in performance.

I also think this shouldn't make a large difference. The global
shrinking work is already expensive, and I imagine that it exhausts
the caches anyway by iterating memcgs. A performance smoketest would
be reassuring for sure, but I believe it won't make a difference.

Keep in mind that even with WQ_UNBOUND, we prefer the local CPU (see
wq_select_unbound_cpu()), so it will take more than global writeback
to observe a difference. The local CPU must not be in
wq_unbound_cpumask, or CONFIG_DEBUG_WQ_FORCE_RR_CPU should be on.

>
> >         if (!shrink_wq)
> >                 goto fallback_fail;
> >
> > --
> > 2.34.1
> >
>
> On a different note, I wonder if it would help to perform synchronous
> reclaim here instead. With our current design, the zswap store failure
> (due to global limit hit) would leave the incoming page going to swap
> instead, creating an LRU inversion. Not sure if that's ideal.

The global shrink path keeps reclaiming until zswap can accept again
(by default, that means reclaiming 10% of the total limit). I think
this is too expensive to be done synchronously.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-17 19:30               ` Yosry Ahmed
@ 2024-01-18 16:16                 ` Johannes Weiner
  2024-01-18 16:48                   ` Johannes Weiner
  2024-01-18 17:06                   ` Yosry Ahmed
  0 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2024-01-18 16:16 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, Ronald Monthero, sjenning, ddstreet, vitaly.wool,
	akpm, chrisl, linux-mm, linux-kernel

On Wed, Jan 17, 2024 at 11:30:50AM -0800, Yosry Ahmed wrote:
> On Wed, Jan 17, 2024 at 11:14 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero
> > <debug.penguin32@gmail.com> wrote:
> >
> > + Johannes and Yosry
> >
> > >
> > > The core-api create_workqueue is deprecated, this patch replaces
> > > the create_workqueue with alloc_workqueue. The previous
> > > implementation workqueue of zswap was a bounded workqueue, this
> > > patch uses alloc_workqueue() to create an unbounded workqueue.
> > > The WQ_UNBOUND attribute is desirable making the workqueue
> > > not localized to a specific cpu so that the scheduler is free
> > > to exercise improvisations in any demanding scenarios for
> > > offloading cpu time slices for workqueues.
> >
> > nit: extra space between paragraph would be nice.
> >
> > > For example if any other workqueues of the same primary cpu
> > > had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.
> > > Also Unbound workqueue happens to be more efficient
> > > in a system during memory pressure scenarios in comparison
> > >  to a bounded workqueue.
> > >
> > > shrink_wq = alloc_workqueue("zswap-shrink",
> > >                      WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
> > >
> > > Overall the change suggested in this patch should be
> > > seamless and does not alter the existing behavior,
> > > other than the improvisation to be an unbounded workqueue.
> > >
> > > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
> > > ---
> > >  mm/zswap.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index 74411dfdad92..64dbe3e944a2 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> > >                 zswap_enabled = false;
> > >         }
> > >
> > > -       shrink_wq = create_workqueue("zswap-shrink");
> > > +       shrink_wq = alloc_workqueue("zswap-shrink",
> > > +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
> >
> > Have you benchmarked this to check if there is any regression, just to
> > be safe? With an unbounded workqueue, you're gaining scheduling
> > flexibility at the cost of cache locality. My intuition is that it
> > doesn't matter too much here, but you should probably double check by
> > stress testing - run some workload with a relatively small zswap pool
> > limit (i.e heavy global writeback), and see if there is any difference
> > in performance.
> 
> I also think this shouldn't make a large difference. The global
> shrinking work is already expensive, and I imagine that it exhausts
> the caches anyway by iterating memcgs. A performance smoketest would
> be reassuring for sure, but I believe it won't make a difference.

The LRU inherently makes the shrinker work on the oldest and coldest
entries, so I doubt we benefit a lot from cache locality there.

What could make a difference though is the increased concurrency by
switching max_active from 1 to 0. This could cause a higher rate of
shrinker runs, which might increase lock contention and reclaim
volume. That part would be good to double check with the shrinker
benchmarks.

> > On a different note, I wonder if it would help to perform synchronous
> > reclaim here instead. With our current design, the zswap store failure
> > (due to global limit hit) would leave the incoming page going to swap
> > instead, creating an LRU inversion. Not sure if that's ideal.
> 
> The global shrink path keeps reclaiming until zswap can accept again
> (by default, that means reclaiming 10% of the total limit). I think
> this is too expensive to be done synchronously.

That thresholding code is a bit weird right now.

It wakes the shrinker and rejects at the same time. We're guaranteed
to see rejections, even if the shrinker has no trouble flushing some
entries a split second later.

It would make more sense to wake the shrinker at e.g. 95% full and
have it run until 90%.

But with that in place we also *should* do synchronous reclaim once we
hit 100%. Just enough to make room for the store. This is important to
catch the case where reclaim rate exceeds swapout rate. Rejecting and
going to swap means the reclaimer will be throttled down to IO rate
anyway, and the app latency isn't any worse. But this way we keep the
pipeline alive, and keep swapping out the oldest zswap entries,
instead of rejecting and swapping what would be the hottest ones.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-18 16:16                 ` Johannes Weiner
@ 2024-01-18 16:48                   ` Johannes Weiner
  2024-01-18 17:03                     ` Yosry Ahmed
  2024-01-18 17:06                   ` Yosry Ahmed
  1 sibling, 1 reply; 20+ messages in thread
From: Johannes Weiner @ 2024-01-18 16:48 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, Ronald Monthero, sjenning, ddstreet, vitaly.wool,
	akpm, chrisl, linux-mm, linux-kernel

On Thu, Jan 18, 2024 at 11:16:08AM -0500, Johannes Weiner wrote:
> > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero
> > > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> > > >                 zswap_enabled = false;
> > > >         }
> > > >
> > > > -       shrink_wq = create_workqueue("zswap-shrink");
> > > > +       shrink_wq = alloc_workqueue("zswap-shrink",
> > > > +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 1);

> What could make a difference though is the increased concurrency by
> switching max_active from 1 to 0. This could cause a higher rate of
> shrinker runs, which might increase lock contention and reclaim
> volume. That part would be good to double check with the shrinker
> benchmarks.

Nevermind, I clearly can't read.

Could still be worthwhile testing with the default 0, but it's not a
concern in the patch as-is.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-18 16:48                   ` Johannes Weiner
@ 2024-01-18 17:03                     ` Yosry Ahmed
  2024-01-18 18:08                       ` Nhat Pham
  0 siblings, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2024-01-18 17:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nhat Pham, Ronald Monthero, sjenning, ddstreet, vitaly.wool,
	akpm, chrisl, linux-mm, linux-kernel

On Thu, Jan 18, 2024 at 8:48 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jan 18, 2024 at 11:16:08AM -0500, Johannes Weiner wrote:
> > > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero
> > > > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> > > > >                 zswap_enabled = false;
> > > > >         }
> > > > >
> > > > > -       shrink_wq = create_workqueue("zswap-shrink");
> > > > > +       shrink_wq = alloc_workqueue("zswap-shrink",
> > > > > +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
>
> > What could make a difference though is the increased concurrency by
> > switching max_active from 1 to 0. This could cause a higher rate of
> > shrinker runs, which might increase lock contention and reclaim
> > volume. That part would be good to double check with the shrinker
> > benchmarks.
>
> Nevermind, I clearly can't read.

Regardless of max_active, we only have one shrink_work per zswap pool,
and we can only have one instance of the work running at any time,
right?

>
> Could still be worthwhile testing with the default 0, but it's not a
> concern in the patch as-is.
>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-18 16:16                 ` Johannes Weiner
  2024-01-18 16:48                   ` Johannes Weiner
@ 2024-01-18 17:06                   ` Yosry Ahmed
  2024-01-18 17:39                     ` Johannes Weiner
  1 sibling, 1 reply; 20+ messages in thread
From: Yosry Ahmed @ 2024-01-18 17:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nhat Pham, Ronald Monthero, sjenning, ddstreet, vitaly.wool,
	akpm, chrisl, linux-mm, linux-kernel

> > > On a different note, I wonder if it would help to perform synchronous
> > > reclaim here instead. With our current design, the zswap store failure
> > > (due to global limit hit) would leave the incoming page going to swap
> > > instead, creating an LRU inversion. Not sure if that's ideal.
> >
> > The global shrink path keeps reclaiming until zswap can accept again
> > (by default, that means reclaiming 10% of the total limit). I think
> > this is too expensive to be done synchronously.
>
> That thresholding code is a bit weird right now.
>
> It wakes the shrinker and rejects at the same time. We're guaranteed
> to see rejections, even if the shrinker has no trouble flushing some
> entries a split second later.
>
> It would make more sense to wake the shrinker at e.g. 95% full and
> have it run until 90%.
>
> But with that in place we also *should* do synchronous reclaim once we
> hit 100%. Just enough to make room for the store. This is important to
> catch the case where reclaim rate exceeds swapout rate. Rejecting and
> going to swap means the reclaimer will be throttled down to IO rate
> anyway, and the app latency isn't any worse. But this way we keep the
> pipeline alive, and keep swapping out the oldest zswap entries,
> instead of rejecting and swapping what would be the hottest ones.

I fully agree with the thresholding code being weird, and with waking
up the shrinker before the pool is full. What I don't understand is
how we can do synchronous reclaim when we hit 100% and still respect
the acceptance threshold :/

Are you proposing we change the semantics of the acceptance threshold
to begin with?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-18 17:06                   ` Yosry Ahmed
@ 2024-01-18 17:39                     ` Johannes Weiner
  2024-01-18 18:03                       ` Yosry Ahmed
  2024-01-18 18:32                       ` Nhat Pham
  0 siblings, 2 replies; 20+ messages in thread
From: Johannes Weiner @ 2024-01-18 17:39 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Nhat Pham, Ronald Monthero, sjenning, ddstreet, vitaly.wool,
	akpm, chrisl, linux-mm, linux-kernel

On Thu, Jan 18, 2024 at 09:06:43AM -0800, Yosry Ahmed wrote:
> > > > On a different note, I wonder if it would help to perform synchronous
> > > > reclaim here instead. With our current design, the zswap store failure
> > > > (due to global limit hit) would leave the incoming page going to swap
> > > > instead, creating an LRU inversion. Not sure if that's ideal.
> > >
> > > The global shrink path keeps reclaiming until zswap can accept again
> > > (by default, that means reclaiming 10% of the total limit). I think
> > > this is too expensive to be done synchronously.
> >
> > That thresholding code is a bit weird right now.
> >
> > It wakes the shrinker and rejects at the same time. We're guaranteed
> > to see rejections, even if the shrinker has no trouble flushing some
> > entries a split second later.
> >
> > It would make more sense to wake the shrinker at e.g. 95% full and
> > have it run until 90%.
> >
> > But with that in place we also *should* do synchronous reclaim once we
> > hit 100%. Just enough to make room for the store. This is important to
> > catch the case where reclaim rate exceeds swapout rate. Rejecting and
> > going to swap means the reclaimer will be throttled down to IO rate
> > anyway, and the app latency isn't any worse. But this way we keep the
> > pipeline alive, and keep swapping out the oldest zswap entries,
> > instead of rejecting and swapping what would be the hottest ones.
> 
> I fully agree with the thresholding code being weird, and with waking
> up the shrinker before the pool is full. What I don't understand is
> how we can do synchronous reclaim when we hit 100% and still respect
> the acceptance threshold :/
> 
> Are you proposing we change the semantics of the acceptance threshold
> to begin with?

I kind of am. It's worth looking at the history of this knob.

It was added in 2020 by 45190f01dd402112d3d22c0ddc4152994f9e1e55, and
from the changelogs and the code in this patch I do not understand how
this was supposed to work.

It also *didn't* work for very basic real world applications. See
Domenico's follow-up (e0228d590beb0d0af345c58a282f01afac5c57f3), which
effectively reverted it to get halfway reasonable behavior.

If there are no good usecases for this knob, then I think it makes
sense to phase it out again.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-18 17:39                     ` Johannes Weiner
@ 2024-01-18 18:03                       ` Yosry Ahmed
  2024-01-18 18:32                       ` Nhat Pham
  1 sibling, 0 replies; 20+ messages in thread
From: Yosry Ahmed @ 2024-01-18 18:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Nhat Pham, Ronald Monthero, sjenning, ddstreet, vitaly.wool,
	akpm, chrisl, linux-mm, linux-kernel

On Thu, Jan 18, 2024 at 9:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jan 18, 2024 at 09:06:43AM -0800, Yosry Ahmed wrote:
> > > > > On a different note, I wonder if it would help to perform synchronous
> > > > > reclaim here instead. With our current design, the zswap store failure
> > > > > (due to global limit hit) would leave the incoming page going to swap
> > > > > instead, creating an LRU inversion. Not sure if that's ideal.
> > > >
> > > > The global shrink path keeps reclaiming until zswap can accept again
> > > > (by default, that means reclaiming 10% of the total limit). I think
> > > > this is too expensive to be done synchronously.
> > >
> > > That thresholding code is a bit weird right now.
> > >
> > > It wakes the shrinker and rejects at the same time. We're guaranteed
> > > to see rejections, even if the shrinker has no trouble flushing some
> > > entries a split second later.
> > >
> > > It would make more sense to wake the shrinker at e.g. 95% full and
> > > have it run until 90%.
> > >
> > > But with that in place we also *should* do synchronous reclaim once we
> > > hit 100%. Just enough to make room for the store. This is important to
> > > catch the case where reclaim rate exceeds swapout rate. Rejecting and
> > > going to swap means the reclaimer will be throttled down to IO rate
> > > anyway, and the app latency isn't any worse. But this way we keep the
> > > pipeline alive, and keep swapping out the oldest zswap entries,
> > > instead of rejecting and swapping what would be the hottest ones.
> >
> > I fully agree with the thresholding code being weird, and with waking
> > up the shrinker before the pool is full. What I don't understand is
> > how we can do synchronous reclaim when we hit 100% and still respect
> > the acceptance threshold :/
> >
> > Are you proposing we change the semantics of the acceptance threshold
> > to begin with?
>
> I kind of am. It's worth looking at the history of this knob.
>
> It was added in 2020 by 45190f01dd402112d3d22c0ddc4152994f9e1e55, and
> from the changelogs and the code in this patch I do not understand how
> this was supposed to work.
>
> It also *didn't* work for very basic real world applications. See
> Domenico's follow-up (e0228d590beb0d0af345c58a282f01afac5c57f3), which
> effectively reverted it to get halfway reasonable behavior.
>
> If there are no good usecases for this knob, then I think it makes
> sense to phase it out again.

I am always nervous about removing/altering user visible knobs, but if
you think it's fine then I am all for it. I think it makes more sense
to start writeback early to avoid the whole situation if possible, and
synchronously reclaim a little bit if we hit 100%. I think the
proactive writeback should reduce the amount of synchronous IO we need
to do in reclaim as well, so we may see some latency improvements.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-17 19:13             ` Nhat Pham
  2024-01-17 19:30               ` Yosry Ahmed
@ 2024-01-18 18:03               ` Nhat Pham
  1 sibling, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2024-01-18 18:03 UTC (permalink / raw)
  To: Ronald Monthero
  Cc: sjenning, ddstreet, vitaly.wool, akpm, chrisl, linux-mm,
	linux-kernel, Johannes Weiner, Yosry Ahmed

On Wed, Jan 17, 2024 at 11:13 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero
> <debug.penguin32@gmail.com> wrote:
>
> + Johannes and Yosry
>
> >
> > The core-api create_workqueue is deprecated, this patch replaces
> > the create_workqueue with alloc_workqueue. The previous
> > implementation workqueue of zswap was a bounded workqueue, this
> > patch uses alloc_workqueue() to create an unbounded workqueue.
> > The WQ_UNBOUND attribute is desirable making the workqueue
> > not localized to a specific cpu so that the scheduler is free
> > to exercise improvisations in any demanding scenarios for
> > offloading cpu time slices for workqueues.
>
> nit: extra space between paragraph would be nice.
>
> > For example if any other workqueues of the same primary cpu
> > had to be served which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.
> > Also Unbound workqueue happens to be more efficient
> > in a system during memory pressure scenarios in comparison
> >  to a bounded workqueue.
> >
> > shrink_wq = alloc_workqueue("zswap-shrink",
> >                      WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
> >
> > Overall the change suggested in this patch should be
> > seamless and does not alter the existing behavior,
> > other than the improvisation to be an unbounded workqueue.
> >
> > Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
> > ---
> >  mm/zswap.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 74411dfdad92..64dbe3e944a2 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> >                 zswap_enabled = false;
> >         }
> >
> > -       shrink_wq = create_workqueue("zswap-shrink");
> > +       shrink_wq = alloc_workqueue("zswap-shrink",
> > +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
>

[...]

> >         if (!shrink_wq)
> >                 goto fallback_fail;
> >
> > --
> > 2.34.1
> >

FWIW:
Acked-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-18 17:03                     ` Yosry Ahmed
@ 2024-01-18 18:08                       ` Nhat Pham
  0 siblings, 0 replies; 20+ messages in thread
From: Nhat Pham @ 2024-01-18 18:08 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Ronald Monthero, sjenning, ddstreet,
	vitaly.wool, akpm, chrisl, linux-mm, linux-kernel

On Thu, Jan 18, 2024 at 9:03 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Jan 18, 2024 at 8:48 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Jan 18, 2024 at 11:16:08AM -0500, Johannes Weiner wrote:
> > > > > On Tue, Jan 16, 2024 at 5:32 AM Ronald Monthero
> > > > > > @@ -1620,7 +1620,8 @@ static int zswap_setup(void)
> > > > > >                 zswap_enabled = false;
> > > > > >         }
> > > > > >
> > > > > > -       shrink_wq = create_workqueue("zswap-shrink");
> > > > > > +       shrink_wq = alloc_workqueue("zswap-shrink",
> > > > > > +                       WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
> >
> > > What could make a difference though is the increased concurrency by
> > > switching max_active from 1 to 0. This could cause a higher rate of
> > > shrinker runs, which might increase lock contention and reclaim
> > > volume. That part would be good to double check with the shrinker
> > > benchmarks.
> >
> > Nevermind, I clearly can't read.
>
> Regardless of max_active, we only have one shrink_work per zswap pool,
> and we can only have one instance of the work running at any time,
> right?

I believe so, yeah. Well I guess you can have a weird setup where
somehow multiple pools are full and submit shrink_work concurrently?
But who does that :) But let's just keep it as is to reduce our mental
workload (i.e not having to keep track of what changes) would be
ideal.

>
> >
> > Could still be worthwhile testing with the default 0, but it's not a
> > concern in the patch as-is.
> >
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> >

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-18 17:39                     ` Johannes Weiner
  2024-01-18 18:03                       ` Yosry Ahmed
@ 2024-01-18 18:32                       ` Nhat Pham
  2024-02-21 13:32                         ` Ronald Monthero
  1 sibling, 1 reply; 20+ messages in thread
From: Nhat Pham @ 2024-01-18 18:32 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, Ronald Monthero, sjenning, ddstreet, vitaly.wool,
	akpm, chrisl, linux-mm, linux-kernel

On Thu, Jan 18, 2024 at 9:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jan 18, 2024 at 09:06:43AM -0800, Yosry Ahmed wrote:
> > > > > On a different note, I wonder if it would help to perform synchronous
> > > > > reclaim here instead. With our current design, the zswap store failure
> > > > > (due to global limit hit) would leave the incoming page going to swap
> > > > > instead, creating an LRU inversion. Not sure if that's ideal.
> > > >
> > > > The global shrink path keeps reclaiming until zswap can accept again
> > > > (by default, that means reclaiming 10% of the total limit). I think
> > > > this is too expensive to be done synchronously.
> > >
> > > That thresholding code is a bit weird right now.
> > >
> > > It wakes the shrinker and rejects at the same time. We're guaranteed
> > > to see rejections, even if the shrinker has no trouble flushing some
> > > entries a split second later.
> > >
> > > It would make more sense to wake the shrinker at e.g. 95% full and
> > > have it run until 90%.

Yep, we should be reclaiming zswap objects way ahead of the pool
limit. Hence the new shrinker, which is memory pressure-driven (i.e
independent of zswap internal limits), and will typically be triggered
even if the pool is not full. During experiments, I never observe the
pool becoming full, with the default settings. I'd be happy to extend
it (or build in extra shrinking logic) to cover these pool limits too,
if it turns out to be necessary.

> > >
> > > But with that in place we also *should* do synchronous reclaim once we
> > > hit 100%. Just enough to make room for the store. This is important to
> > > catch the case where reclaim rate exceeds swapout rate. Rejecting and
> > > going to swap means the reclaimer will be throttled down to IO rate
> > > anyway, and the app latency isn't any worse. But this way we keep the
> > > pipeline alive, and keep swapping out the oldest zswap entries,
> > > instead of rejecting and swapping what would be the hottest ones.
> >
> > I fully agree with the thresholding code being weird, and with waking
> > up the shrinker before the pool is full. What I don't understand is
> > how we can do synchronous reclaim when we hit 100% and still respect
> > the acceptance threshold :/
> >
> > Are you proposing we change the semantics of the acceptance threshold
> > to begin with?
>
> I kind of am. It's worth looking at the history of this knob.
>
> It was added in 2020 by 45190f01dd402112d3d22c0ddc4152994f9e1e55, and
> from the changelogs and the code in this patch I do not understand how
> this was supposed to work.
>
> It also *didn't* work for very basic real world applications. See
> Domenico's follow-up (e0228d590beb0d0af345c58a282f01afac5c57f3), which
> effectively reverted it to get halfway reasonable behavior.
>
> If there are no good usecases for this knob, then I think it makes
> sense to phase it out again.

Yeah this was my original proposal - remove this knob altogether :)
Based on a cursory read, it just seems like zswap was originally
trying to shrink (synchronously) one "object", then try to check if
the pool size is now under the limit. This is indeed insufficient.
However, I'm not quite convinced by the solution (hysteresis) either.

Maybe we can synchronously shrink a la Domenico, i.e until the pool
can accept new pages, but this time capacity-based (maybe under the
limit + some headroom, 1 page for example)? This is just so that the
immediate incoming zswap store succeeds - we can still have the
shrinker freeing up space later on (or maybe keep an asynchronous
pool-limit based shrinker around).

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] mm/zswap: Improve with alloc_workqueue() call
  2024-01-18 18:32                       ` Nhat Pham
@ 2024-02-21 13:32                         ` Ronald Monthero
  0 siblings, 0 replies; 20+ messages in thread
From: Ronald Monthero @ 2024-02-21 13:32 UTC (permalink / raw)
  To: linux-kernel, mm-commits
  Cc: Johannes Weiner, Yosry Ahmed, sjenning, ddstreet, vitaly.wool,
	akpm, Nhat Pham, chrisl, linux-mm

Thanks for the reviews.
This patch is available at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-zswap-improve-with-alloc_workqueue-call.patch

This patch will later appear in the mm-unstable branch at
    git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

The -mm tree is included into linux-next via the mm-everything
branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

------------------------------------------------------
From: Ronald Monthero <debug.penguin32@gmail.com>
Subject: mm/zswap: improve with alloc_workqueue() call
Date: Tue, 16 Jan 2024 23:31:45 +1000

The core-api create_workqueue is deprecated, this patch replaces the
create_workqueue with alloc_workqueue.  The previous implementation
workqueue of zswap was a bounded workqueue, this patch uses
alloc_workqueue() to create an unbounded workqueue.  The WQ_UNBOUND
attribute is desirable making the workqueue not localized to a specific
cpu so that the scheduler is free to exercise improvisations in any
demanding scenarios for offloading cpu time slices for workqueues.  For
example if any other workqueues of the same primary cpu had to be served
which are WQ_HIGHPRI and WQ_CPU_INTENSIVE.  Also Unbound workqueue happens
to be more efficient in a system during memory pressure scenarios in
comparison to a bounded workqueue.

shrink_wq = alloc_workqueue("zswap-shrink",
                     WQ_UNBOUND|WQ_MEM_RECLAIM, 1);

Overall the change suggested in this patch should be seamless and does not
alter the existing behavior, other than the improvisation to be an
unbounded workqueue.

Link: https://lkml.kernel.org/r/20240116133145.12454-1-debug.penguin32@gmail.com
Signed-off-by: Ronald Monthero <debug.penguin32@gmail.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitaly.wool@konsulko.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zswap.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/zswap.c~mm-zswap-improve-with-alloc_workqueue-call
+++ a/mm/zswap.c
@@ -1884,7 +1884,8 @@ static int zswap_setup(void)
                zswap_enabled = false;
        }

-       shrink_wq = create_workqueue("zswap-shrink");
+       shrink_wq = alloc_workqueue("zswap-shrink",
+                       WQ_UNBOUND|WQ_MEM_RECLAIM, 1);
        if (!shrink_wq)
                goto fallback_fail;

_

On Fri, Jan 19, 2024 at 4:33 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 9:39 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Thu, Jan 18, 2024 at 09:06:43AM -0800, Yosry Ahmed wrote:
> > > > > > On a different note, I wonder if it would help to perform synchronous
> > > > > > reclaim here instead. With our current design, the zswap store failure
> > > > > > (due to global limit hit) would leave the incoming page going to swap
> > > > > > instead, creating an LRU inversion. Not sure if that's ideal.
> > > > >
> > > > > The global shrink path keeps reclaiming until zswap can accept again
> > > > > (by default, that means reclaiming 10% of the total limit). I think
> > > > > this is too expensive to be done synchronously.
> > > >
> > > > That thresholding code is a bit weird right now.
> > > >
> > > > It wakes the shrinker and rejects at the same time. We're guaranteed
> > > > to see rejections, even if the shrinker has no trouble flushing some
> > > > entries a split second later.
> > > >
> > > > It would make more sense to wake the shrinker at e.g. 95% full and
> > > > have it run until 90%.
>
> Yep, we should be reclaiming zswap objects way ahead of the pool
> limit. Hence the new shrinker, which is memory pressure-driven (i.e
> independent of zswap internal limits), and will typically be triggered
> even if the pool is not full. During experiments, I never observe the
> pool becoming full, with the default settings. I'd be happy to extend
> it (or build in extra shrinking logic) to cover these pool limits too,
> if it turns out to be necessary.
>
> > > >
> > > > But with that in place we also *should* do synchronous reclaim once we
> > > > hit 100%. Just enough to make room for the store. This is important to
> > > > catch the case where reclaim rate exceeds swapout rate. Rejecting and
> > > > going to swap means the reclaimer will be throttled down to IO rate
> > > > anyway, and the app latency isn't any worse. But this way we keep the
> > > > pipeline alive, and keep swapping out the oldest zswap entries,
> > > > instead of rejecting and swapping what would be the hottest ones.
> > >
> > > I fully agree with the thresholding code being weird, and with waking
> > > up the shrinker before the pool is full. What I don't understand is
> > > how we can do synchronous reclaim when we hit 100% and still respect
> > > the acceptance threshold :/
> > >
> > > Are you proposing we change the semantics of the acceptance threshold
> > > to begin with?
> >
> > I kind of am. It's worth looking at the history of this knob.
> >
> > It was added in 2020 by 45190f01dd402112d3d22c0ddc4152994f9e1e55, and
> > from the changelogs and the code in this patch I do not understand how
> > this was supposed to work.
> >
> > It also *didn't* work for very basic real world applications. See
> > Domenico's follow-up (e0228d590beb0d0af345c58a282f01afac5c57f3), which
> > effectively reverted it to get halfway reasonable behavior.
> >
> > If there are no good usecases for this knob, then I think it makes
> > sense to phase it out again.
>
> Yeah this was my original proposal - remove this knob altogether :)
> Based on a cursory read, it just seems like zswap was originally
> trying to shrink (synchronously) one "object", then try to check if
> the pool size is now under the limit. This is indeed insufficient.
> However, I'm not quite convinced by the solution (hysteresis) either.
>
> Maybe we can synchronously shrink a la Domenico, i.e until the pool
> can accept new pages, but this time capacity-based (maybe under the
> limit + some headroom, 1 page for example)? This is just so that the
> immediate incoming zswap store succeeds - we can still have the
> shrinker freeing up space later on (or maybe keep an asynchronous
> pool-limit based shrinker around).

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2024-02-21 13:33 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-11  5:28 [PATCH] mm/zswap: Improve with alloc_workqueue() call Ronald Monthero
2023-12-11 14:15 ` Nhat Pham
2023-12-13 13:20   ` Ronald Monthero
2023-12-14  0:28     ` Nhat Pham
2023-12-14  1:02       ` Nhat Pham
2023-12-15  9:03       ` Ronald Monthero
2023-12-20  0:21         ` Nhat Pham
2024-01-16 13:31           ` Ronald Monthero
2024-01-17 19:13             ` Nhat Pham
2024-01-17 19:30               ` Yosry Ahmed
2024-01-18 16:16                 ` Johannes Weiner
2024-01-18 16:48                   ` Johannes Weiner
2024-01-18 17:03                     ` Yosry Ahmed
2024-01-18 18:08                       ` Nhat Pham
2024-01-18 17:06                   ` Yosry Ahmed
2024-01-18 17:39                     ` Johannes Weiner
2024-01-18 18:03                       ` Yosry Ahmed
2024-01-18 18:32                       ` Nhat Pham
2024-02-21 13:32                         ` Ronald Monthero
2024-01-18 18:03               ` Nhat Pham

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).