linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] psi: reduce min window size to 50ms
@ 2023-02-10 22:31 Sudarshan Rajagopalan
  2023-02-10 22:31 ` Sudarshan Rajagopalan
  2023-02-10 23:03 ` Suren Baghdasaryan
  0 siblings, 2 replies; 20+ messages in thread
From: Sudarshan Rajagopalan @ 2023-02-10 22:31 UTC (permalink / raw)
  To: David Hildenbrand, Johannes Weiner, Suren Baghdasaryan,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm
  Cc: Sudarshan Rajagopalan, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly

The PSI mechanism is useful tool to monitor pressure stall
information in the system. Currently, the minimum window size
is set to 500ms. May we know what is the rationale for this?

For lightweight systems such as Linux Embedded Systems, PSI
can be used to monitor and track memory pressure building up
in the system and respond quickly to such memory demands.
Example, the Linux Embedded Systems could be a secondary VM
system which requests for memory from Primary host. With 500ms
window size, the sampling period is 50ms (one-tenth of windwo
size). So the minimum amount of time the process needs to stall,
so that a PSI event can be generated and actions can be done
is 50ms. This reaction time can be much reduced by reducing the
sampling time (by reducing window size), so that responses to
such memory pressures in system can be serviced much quicker.

Please let us know your thoughts on reducing window size to 50ms.

Sudarshan Rajagopalan (1):
  psi: reduce min window size to 50ms

 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH] psi: reduce min window size to 50ms
  2023-02-10 22:31 [PATCH] psi: reduce min window size to 50ms Sudarshan Rajagopalan
@ 2023-02-10 22:31 ` Sudarshan Rajagopalan
  2023-02-10 23:03 ` Suren Baghdasaryan
  1 sibling, 0 replies; 20+ messages in thread
From: Sudarshan Rajagopalan @ 2023-02-10 22:31 UTC (permalink / raw)
  To: David Hildenbrand, Johannes Weiner, Suren Baghdasaryan,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm
  Cc: Sudarshan Rajagopalan, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly

Few systems would require much finer-grained tracking of memory
pressure in the system using PSI mechanism. Reduce the minimum
allowable window size to be 50ms to increase the sampling rate
of PSI monitor for much faster response and reaction to memory
pressures in the system. With 50ms window size, the smallest
resolution of memory pressure that can be tracked is now 5ms.

Signed-off-by: Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>
---
 kernel/sched/psi.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ee2ecc0..6eabf27 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -160,7 +160,7 @@ __setup("psi=", setup_psi);
 #define EXP_300s	2034		/* 1/exp(2s/300s) */
 
 /* PSI trigger definitions */
-#define WINDOW_MIN_US 500000	/* Min window size is 500ms */
+#define WINDOW_MIN_US 50000	/* Min window size is 50ms */
 #define WINDOW_MAX_US 10000000	/* Max window size is 10s */
 #define UPDATES_PER_WINDOW 10	/* 10 updates per window */
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-10 22:31 [PATCH] psi: reduce min window size to 50ms Sudarshan Rajagopalan
  2023-02-10 22:31 ` Sudarshan Rajagopalan
@ 2023-02-10 23:03 ` Suren Baghdasaryan
  2023-02-11  0:44   ` Sudarshan Rajagopalan
  1 sibling, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-10 23:03 UTC (permalink / raw)
  To: Sudarshan Rajagopalan
  Cc: David Hildenbrand, Johannes Weiner, Mike Rapoport,
	Oscar Salvador, Anshuman Khandual, mark.rutland, will,
	virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly

On Fri, Feb 10, 2023 at 2:31 PM Sudarshan Rajagopalan
<quic_sudaraja@quicinc.com> wrote:
>
> The PSI mechanism is useful tool to monitor pressure stall
> information in the system. Currently, the minimum window size
> is set to 500ms. May we know what is the rationale for this?

The limit was set to avoid regressions in performance and power
consumption if the window is set too small and the system ends up
polling too frequently. That said, the limit was chosen based on
results of specific experiments which might not represent all
usecases. If you want to change this limit, you would need to describe
why the new limit is inherently better than the current one (why not
higher, why not lower).
Thanks,
Suren.

>
> For lightweight systems such as Linux Embedded Systems, PSI
> can be used to monitor and track memory pressure building up
> in the system and respond quickly to such memory demands.
> Example, the Linux Embedded Systems could be a secondary VM
> system which requests for memory from Primary host. With 500ms
> window size, the sampling period is 50ms (one-tenth of windwo
> size). So the minimum amount of time the process needs to stall,
> so that a PSI event can be generated and actions can be done
> is 50ms. This reaction time can be much reduced by reducing the
> sampling time (by reducing window size), so that responses to
> such memory pressures in system can be serviced much quicker.
>
> Please let us know your thoughts on reducing window size to 50ms.
>
> Sudarshan Rajagopalan (1):
>   psi: reduce min window size to 50ms
>
>  kernel/sched/psi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-10 23:03 ` Suren Baghdasaryan
@ 2023-02-11  0:44   ` Sudarshan Rajagopalan
  2023-02-11  1:09     ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Sudarshan Rajagopalan @ 2023-02-11  0:44 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, Johannes Weiner, Mike Rapoport,
	Oscar Salvador, Anshuman Khandual, mark.rutland, will,
	virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, Sudarshan Rajagopalan


On 2/10/2023 3:03 PM, Suren Baghdasaryan wrote:
> On Fri, Feb 10, 2023 at 2:31 PM Sudarshan Rajagopalan
> <quic_sudaraja@quicinc.com> wrote:
>> The PSI mechanism is useful tool to monitor pressure stall
>> information in the system. Currently, the minimum window size
>> is set to 500ms. May we know what is the rationale for this?
> The limit was set to avoid regressions in performance and power
> consumption if the window is set too small and the system ends up
> polling too frequently. That said, the limit was chosen based on
> results of specific experiments which might not represent all

Rightly as you said, the effect on power and performance depends on type 
of the system - embedded systems, or Android mobile, or commercial VMs 
or servers. With higher PSI sampling, it may not be much of power impact 
to embedded systems with low-tier chipsets or performance impact to 
powerful servers.

> usecases. If you want to change this limit, you would need to describe
> why the new limit is inherently better than the current one (why not
> higher, why not lower).

This is in regards to the userspace daemon [1] that we are working on, 
that dynamically resizes the VM memory based on PSI memory pressure 
events. With current min window size of 500ms, the PSI monitor sampling 
period would be 50ms. So to detect increase in memory demand in system 
and plug-in memory into VM when pressure goes up, the minimum time the 
process needs to stall for is 50ms before a event can be generated and 
sent out to userspace and the daemon can do actions.

This again I'm talking w.r.t. lightweight embedded systems, where even 
background kswapd/kcompd (which I'm calling it as natural memory 
pressure) in the system would be less than 5-10ms stall. So any stall 
more than 5-10ms would "hint" us that a memory consuming usecase has 
ranB  and memory may need to be plugged in.

So in these cases, having as low as 5ms psimon sampling time would give 
us faster reaction time and daemon can be responsive more quickly. In 
general, this will reduce the malloc latencies significantly.

Pasting here the same excerpt I mentioned in [1].

"

4. Detecting increase in memory demand b\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-11  0:44   ` Sudarshan Rajagopalan
@ 2023-02-11  1:09     ` Suren Baghdasaryan
  2023-02-11  1:46       ` Sudarshan Rajagopalan
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-11  1:09 UTC (permalink / raw)
  To: Sudarshan Rajagopalan
  Cc: David Hildenbrand, Johannes Weiner, Mike Rapoport,
	Oscar Salvador, Anshuman Khandual, mark.rutland, will,
	virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly

On Fri, Feb 10, 2023 at 4:45 PM Sudarshan Rajagopalan
<quic_sudaraja@quicinc.com> wrote:
>
>
> On 2/10/2023 3:03 PM, Suren Baghdasaryan wrote:
> > On Fri, Feb 10, 2023 at 2:31 PM Sudarshan Rajagopalan
> > <quic_sudaraja@quicinc.com> wrote:
> >> The PSI mechanism is useful tool to monitor pressure stall
> >> information in the system. Currently, the minimum window size
> >> is set to 500ms. May we know what is the rationale for this?
> > The limit was set to avoid regressions in performance and power
> > consumption if the window is set too small and the system ends up
> > polling too frequently. That said, the limit was chosen based on
> > results of specific experiments which might not represent all
>
> Rightly as you said, the effect on power and performance depends on type
> of the system - embedded systems, or Android mobile, or commercial VMs
> or servers. With higher PSI sampling, it may not be much of power impact
> to embedded systems with low-tier chipsets or performance impact to
> powerful servers.
>
> > usecases. If you want to change this limit, you would need to describe
> > why the new limit is inherently better than the current one (why not
> > higher, why not lower).
>
> This is in regards to the userspace daemon [1] that we are working on,
> that dynamically resizes the VM memory based on PSI memory pressure
> events. With current min window size of 500ms, the PSI monitor sampling
> period would be 50ms. So to detect increase in memory demand in system
> and plug-in memory into VM when pressure goes up, the minimum time the
> process needs to stall for is 50ms before a event can be generated and
> sent out to userspace and the daemon can do actions.
>
> This again I'm talking w.r.t. lightweight embedded systems, where even
> background kswapd/kcompd (which I'm calling it as natural memory
> pressure) in the system would be less than 5-10ms stall. So any stall
> more than 5-10ms would "hint" us that a memory consuming usecase has
> ranB  and memory may need to be plugged in.
>
> So in these cases, having as low as 5ms psimon sampling time would give
> us faster reaction time and daemon can be responsive more quickly. In
> general, this will reduce the malloc latencies significantly.
>
> Pasting here the same excerpt I mentioned in [1].

My question is: why do you think 5ms is the optimal limit here? I want
to avoid a race to the bottom where next time someone can argue that
they would like to detect a stall within a lower period than 5ms.
Technically the limit can be as small as one wants but at some point I
think we should consider the possibility of this being used for a DoS
attack.


>
> "
>
> 4. Detecting increase in memory demand b   when a certain usecase starts
> in VM that does memory allocations, it will stall causing PSI mechanism
> to generate a memory pressure event to userspace. To simply put, when
> pressure increases certain set threshold, it can make educated guess
> that a memory requiring usecase has ran and VM system needs memory to be
> added.
>
> "
>
> [1]
> https://lore.kernel.org/linux-arm-kernel/1bf30145-22a5-cc46-e583-25053460b105@redhat.com/T/#m95ccf038c568271e759a277a08b8e44e51e8f90b
>
> > Thanks,
> > Suren.
> >
> >> For lightweight systems such as Linux Embedded Systems, PSI
> >> can be used to monitor and track memory pressure building up
> >> in the system and respond quickly to such memory demands.
> >> Example, the Linux Embedded Systems could be a secondary VM
> >> system which requests for memory from Primary host. With 500ms
> >> window size, the sampling period is 50ms (one-tenth of windwo
> >> size). So the minimum amount of time the process needs to stall,
> >> so that a PSI event can be generated and actions can be done
> >> is 50ms. This reaction time can be much reduced by reducing the
> >> sampling time (by reducing window size), so that responses to
> >> such memory pressures in system can be serviced much quicker.
> >>
> >> Please let us know your thoughts on reducing window size to 50ms.
> >>
> >> Sudarshan Rajagopalan (1):
> >>    psi: reduce min window size to 50ms
> >>
> >>   kernel/sched/psi.c | 2 +-
> >>   1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> --
> >> 2.7.4
> >>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-11  1:09     ` Suren Baghdasaryan
@ 2023-02-11  1:46       ` Sudarshan Rajagopalan
  2023-02-11  2:13         ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Sudarshan Rajagopalan @ 2023-02-11  1:46 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, Johannes Weiner, Mike Rapoport,
	Oscar Salvador, Anshuman Khandual, mark.rutland, will,
	virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly


On 2/10/2023 5:09 PM, Suren Baghdasaryan wrote:
> On Fri, Feb 10, 2023 at 4:45 PM Sudarshan Rajagopalan
> <quic_sudaraja@quicinc.com> wrote:
>>
>> On 2/10/2023 3:03 PM, Suren Baghdasaryan wrote:
>>> On Fri, Feb 10, 2023 at 2:31 PM Sudarshan Rajagopalan
>>> <quic_sudaraja@quicinc.com> wrote:
>>>> The PSI mechanism is useful tool to monitor pressure stall
>>>> information in the system. Currently, the minimum window size
>>>> is set to 500ms. May we know what is the rationale for this?
>>> The limit was set to avoid regressions in performance and power
>>> consumption if the window is set too small and the system ends up
>>> polling too frequently. That said, the limit was chosen based on
>>> results of specific experiments which might not represent all
>> Rightly as you said, the effect on power and performance depends on type
>> of the system - embedded systems, or Android mobile, or commercial VMs
>> or servers. With higher PSI sampling, it may not be much of power impact
>> to embedded systems with low-tier chipsets or performance impact to
>> powerful servers.
>>
>>> usecases. If you want to change this limit, you would need to describe
>>> why the new limit is inherently better than the current one (why not
>>> higher, why not lower).
>> This is in regards to the userspace daemon [1] that we are working on,
>> that dynamically resizes the VM memory based on PSI memory pressure
>> events. With current min window size of 500ms, the PSI monitor sampling
>> period would be 50ms. So to detect increase in memory demand in system
>> and plug-in memory into VM when pressure goes up, the minimum time the
>> process needs to stall for is 50ms before a event can be generated and
>> sent out to userspace and the daemon can do actions.
>>
>> This again I'm talking w.r.t. lightweight embedded systems, where even
>> background kswapd/kcompd (which I'm calling it as natural memory
>> pressure) in the system would be less than 5-10ms stall. So any stall
>> more than 5-10ms would "hint" us that a memory consuming usecase has
>> ranB  and memory may need to be plugged in.
>>
>> So in these cases, having as low as 5ms psimon sampling time would give
>> us faster reaction time and daemon can be responsive more quickly. In
>> general, this will reduce the malloc latencies significantly.
>>
>> Pasting here the same excerpt I mentioned in [1].
> My question is: why do you think 5ms is the optimal limit here? I want
> to avoid a race to the bottom where next time someone can argue that
> they would like to detect a stall within a lower period than 5ms.
> Technically the limit can be as small as one wants but at some point I
> think we should consider the possibility of this being used for a DoS
> attack.

Well the optimal limit should be something which is least destructive? I 
do understand about possibility of DoS attacks, but wouldn't that still 
be possible with 500ms window today? Which will atleast be 1/10th less 
severe compared to 50ms window. The way I see it is - min pressure 
sampling should be such that even the least pressure stall which we 
think is significant should be captured (this could be 5ms or 50ms at 
present) while balancing the power and performance impact across all 
usecases.

At present, Android's LMKD sets 1000ms as window for which it considers 
100ms sampling to be significant. And here, with psi_daemon usecase we 
are saying 5ms sampling would be significant. So there's no actual 
optimal limit, but we must limit as much possible without effecting 
power or performance as a whole. Also, this is just the "minimum 
allowable" window, and system admins can configure it as per the system 
type/requirement.

Also, about possible DoS attacks - file permissions for 
/proc/pressure/... can be set such that not any random user can register 
to psi events right?

>
>> "
>>
>> 4. Detecting increase in memory demand b   when a certain usecase starts
>> in VM that does memory allocations, it will stall causing PSI mechanism
>> to generate a memory pressure event to userspace. To simply put, when
>> pressure increases certain set threshold, it can make educated guess
>> that a memory requiring usecase has ran and VM system needs memory to be
>> added.
>>
>> "
>>
>> [1]
>> https://lore.kernel.org/linux-arm-kernel/1bf30145-22a5-cc46-e583-25053460b105@redhat.com/T/#m95ccf038c568271e759a277a08b8e44e51e8f90b
>>
>>> Thanks,
>>> Suren.
>>>
>>>> For lightweight systems such as Linux Embedded Systems, PSI
>>>> can be used to monitor and track memory pressure building up
>>>> in the system and respond quickly to such memory demands.
>>>> Example, the Linux Embedded Systems could be a secondary VM
>>>> system which requests for memory from Primary host. With 500ms
>>>> window size, the sampling period is 50ms (one-tenth of windwo
>>>> size). So the minimum amount of time the process needs to stall,
>>>> so that a PSI event can be generated and actions can be done
>>>> is 50ms. This reaction time can be much reduced by reducing the
>>>> sampling time (by reducing window size), so that responses to
>>>> such memory pressures in system can be serviced much quicker.
>>>>
>>>> Please let us know your thoughts on reducing window size to 50ms.
>>>>
>>>> Sudarshan Rajagopalan (1):
>>>>     psi: reduce min window size to 50ms
>>>>
>>>>    kernel/sched/psi.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> --
>>>> 2.7.4
>>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-11  1:46       ` Sudarshan Rajagopalan
@ 2023-02-11  2:13         ` Suren Baghdasaryan
  2023-02-14  2:12           ` Sudarshan Rajagopalan
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-11  2:13 UTC (permalink / raw)
  To: Sudarshan Rajagopalan
  Cc: David Hildenbrand, Johannes Weiner, Mike Rapoport,
	Oscar Salvador, Anshuman Khandual, mark.rutland, will,
	virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly

On Fri, Feb 10, 2023 at 5:46 PM Sudarshan Rajagopalan
<quic_sudaraja@quicinc.com> wrote:
>
>
> On 2/10/2023 5:09 PM, Suren Baghdasaryan wrote:
> > On Fri, Feb 10, 2023 at 4:45 PM Sudarshan Rajagopalan
> > <quic_sudaraja@quicinc.com> wrote:
> >>
> >> On 2/10/2023 3:03 PM, Suren Baghdasaryan wrote:
> >>> On Fri, Feb 10, 2023 at 2:31 PM Sudarshan Rajagopalan
> >>> <quic_sudaraja@quicinc.com> wrote:
> >>>> The PSI mechanism is useful tool to monitor pressure stall
> >>>> information in the system. Currently, the minimum window size
> >>>> is set to 500ms. May we know what is the rationale for this?
> >>> The limit was set to avoid regressions in performance and power
> >>> consumption if the window is set too small and the system ends up
> >>> polling too frequently. That said, the limit was chosen based on
> >>> results of specific experiments which might not represent all
> >> Rightly as you said, the effect on power and performance depends on type
> >> of the system - embedded systems, or Android mobile, or commercial VMs
> >> or servers. With higher PSI sampling, it may not be much of power impact
> >> to embedded systems with low-tier chipsets or performance impact to
> >> powerful servers.
> >>
> >>> usecases. If you want to change this limit, you would need to describe
> >>> why the new limit is inherently better than the current one (why not
> >>> higher, why not lower).
> >> This is in regards to the userspace daemon [1] that we are working on,
> >> that dynamically resizes the VM memory based on PSI memory pressure
> >> events. With current min window size of 500ms, the PSI monitor sampling
> >> period would be 50ms. So to detect increase in memory demand in system
> >> and plug-in memory into VM when pressure goes up, the minimum time the
> >> process needs to stall for is 50ms before a event can be generated and
> >> sent out to userspace and the daemon can do actions.
> >>
> >> This again I'm talking w.r.t. lightweight embedded systems, where even
> >> background kswapd/kcompd (which I'm calling it as natural memory
> >> pressure) in the system would be less than 5-10ms stall. So any stall
> >> more than 5-10ms would "hint" us that a memory consuming usecase has
> >> ranB  and memory may need to be plugged in.
> >>
> >> So in these cases, having as low as 5ms psimon sampling time would give
> >> us faster reaction time and daemon can be responsive more quickly. In
> >> general, this will reduce the malloc latencies significantly.
> >>
> >> Pasting here the same excerpt I mentioned in [1].
> > My question is: why do you think 5ms is the optimal limit here? I want
> > to avoid a race to the bottom where next time someone can argue that
> > they would like to detect a stall within a lower period than 5ms.
> > Technically the limit can be as small as one wants but at some point I
> > think we should consider the possibility of this being used for a DoS
> > attack.
>
> Well the optimal limit should be something which is least destructive? I
> do understand about possibility of DoS attacks, but wouldn't that still
> be possible with 500ms window today? Which will atleast be 1/10th less
> severe compared to 50ms window. The way I see it is - min pressure
> sampling should be such that even the least pressure stall which we
> think is significant should be captured (this could be 5ms or 50ms at
> present) while balancing the power and performance impact across all
> usecases.
>
> At present, Android's LMKD sets 1000ms as window for which it considers
> 100ms sampling to be significant. And here, with psi_daemon usecase we
> are saying 5ms sampling would be significant. So there's no actual
> optimal limit, but we must limit as much possible without effecting
> power or performance as a whole. Also, this is just the "minimum
> allowable" window, and system admins can configure it as per the system
> type/requirement.

Ok, let me ask you another way which might be more productive. What
caused you to choose 5ms as the time you care to react to a stall
buildup?

>
> Also, about possible DoS attacks - file permissions for
> /proc/pressure/... can be set such that not any random user can register
> to psi events right?

True. We have a CAP_SYS_RESOURCE check for the writers of these files.

>
> >
> >> "
> >>
> >> 4. Detecting increase in memory demand b   when a certain usecase starts
> >> in VM that does memory allocations, it will stall causing PSI mechanism
> >> to generate a memory pressure event to userspace. To simply put, when
> >> pressure increases certain set threshold, it can make educated guess
> >> that a memory requiring usecase has ran and VM system needs memory to be
> >> added.
> >>
> >> "
> >>
> >> [1]
> >> https://lore.kernel.org/linux-arm-kernel/1bf30145-22a5-cc46-e583-25053460b105@redhat.com/T/#m95ccf038c568271e759a277a08b8e44e51e8f90b
> >>
> >>> Thanks,
> >>> Suren.
> >>>
> >>>> For lightweight systems such as Linux Embedded Systems, PSI
> >>>> can be used to monitor and track memory pressure building up
> >>>> in the system and respond quickly to such memory demands.
> >>>> Example, the Linux Embedded Systems could be a secondary VM
> >>>> system which requests for memory from Primary host. With 500ms
> >>>> window size, the sampling period is 50ms (one-tenth of windwo
> >>>> size). So the minimum amount of time the process needs to stall,
> >>>> so that a PSI event can be generated and actions can be done
> >>>> is 50ms. This reaction time can be much reduced by reducing the
> >>>> sampling time (by reducing window size), so that responses to
> >>>> such memory pressures in system can be serviced much quicker.
> >>>>
> >>>> Please let us know your thoughts on reducing window size to 50ms.
> >>>>
> >>>> Sudarshan Rajagopalan (1):
> >>>>     psi: reduce min window size to 50ms
> >>>>
> >>>>    kernel/sched/psi.c | 2 +-
> >>>>    1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> --
> >>>> 2.7.4
> >>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-11  2:13         ` Suren Baghdasaryan
@ 2023-02-14  2:12           ` Sudarshan Rajagopalan
  2023-02-14 19:34             ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Sudarshan Rajagopalan @ 2023-02-14  2:12 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: David Hildenbrand, Johannes Weiner, Mike Rapoport,
	Oscar Salvador, Anshuman Khandual, mark.rutland, will,
	virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, Sudarshan Rajagopalan


On 2/10/2023 6:13 PM, Suren Baghdasaryan wrote:
> On Fri, Feb 10, 2023 at 5:46 PM Sudarshan Rajagopalan
> <quic_sudaraja@quicinc.com> wrote:
>>
>> On 2/10/2023 5:09 PM, Suren Baghdasaryan wrote:
>>> On Fri, Feb 10, 2023 at 4:45 PM Sudarshan Rajagopalan
>>> <quic_sudaraja@quicinc.com> wrote:
>>>> On 2/10/2023 3:03 PM, Suren Baghdasaryan wrote:
>>>>> On Fri, Feb 10, 2023 at 2:31 PM Sudarshan Rajagopalan
>>>>> <quic_sudaraja@quicinc.com> wrote:
>>>>>> The PSI mechanism is useful tool to monitor pressure stall
>>>>>> information in the system. Currently, the minimum window size
>>>>>> is set to 500ms. May we know what is the rationale for this?
>>>>> The limit was set to avoid regressions in performance and power
>>>>> consumption if the window is set too small and the system ends up
>>>>> polling too frequently. That said, the limit was chosen based on
>>>>> results of specific experiments which might not represent all
>>>> Rightly as you said, the effect on power and performance depends on type
>>>> of the system - embedded systems, or Android mobile, or commercial VMs
>>>> or servers. With higher PSI sampling, it may not be much of power impact
>>>> to embedded systems with low-tier chipsets or performance impact to
>>>> powerful servers.
>>>>
>>>>> usecases. If you want to change this limit, you would need to describe
>>>>> why the new limit is inherently better than the current one (why not
>>>>> higher, why not lower).
>>>> This is in regards to the userspace daemon [1] that we are working on,
>>>> that dynamically resizes the VM memory based on PSI memory pressure
>>>> events. With current min window size of 500ms, the PSI monitor sampling
>>>> period would be 50ms. So to detect increase in memory demand in system
>>>> and plug-in memory into VM when pressure goes up, the minimum time the
>>>> process needs to stall for is 50ms before a event can be generated and
>>>> sent out to userspace and the daemon can do actions.
>>>>
>>>> This again I'm talking w.r.t. lightweight embedded systems, where even
>>>> background kswapd/kcompd (which I'm calling it as natural memory
>>>> pressure) in the system would be less than 5-10ms stall. So any stall
>>>> more than 5-10ms would "hint" us that a memory consuming usecase has
>>>> ranB  and memory may need to be plugged in.
>>>>
>>>> So in these cases, having as low as 5ms psimon sampling time would give
>>>> us faster reaction time and daemon can be responsive more quickly. In
>>>> general, this will reduce the malloc latencies significantly.
>>>>
>>>> Pasting here the same excerpt I mentioned in [1].
>>> My question is: why do you think 5ms is the optimal limit here? I want
>>> to avoid a race to the bottom where next time someone can argue that
>>> they would like to detect a stall within a lower period than 5ms.
>>> Technically the limit can be as small as one wants but at some point I
>>> think we should consider the possibility of this being used for a DoS
>>> attack.
>> Well the optimal limit should be something which is least destructive? I
>> do understand about possibility of DoS attacks, but wouldn't that still
>> be possible with 500ms window today? Which will atleast be 1/10th less
>> severe compared to 50ms window. The way I see it is - min pressure
>> sampling should be such that even the least pressure stall which we
>> think is significant should be captured (this could be 5ms or 50ms at
>> present) while balancing the power and performance impact across all
>> usecases.
>>
>> At present, Android's LMKD sets 1000ms as window for which it considers
>> 100ms sampling to be significant. And here, with psi_daemon usecase we
>> are saying 5ms sampling would be significant. So there's no actual
>> optimal limit, but we must limit as much possible without effecting
>> power or performance as a whole. Also, this is just the "minimum
>> allowable" window, and system admins can configure it as per the system
>> type/requirement.
> Ok, let me ask you another way which might be more productive. What
> caused you to choose 5ms as the time you care to react to a stall
> buildup?

We basically want to capture any stalls caused by direct reclaim. And 
ignore any stalls caused by indirect reclaim and alloc retries. Stalls 
due to direct reclaim is what indicates that memory pressure is building 
up in system and memory needs to be free'd (by oom-killer or LMKD 
killing apps) or made available (by plugin-in any available memory or 
requesting memory from Primary host). We see that any stalls above 5ms 
is significant enough that alloc request would've invoked direct 
reclaim, hinting that
memory pressure is starting to build up.

Keeping the 5ms and other numbers aside, lets see what is smallest 
pressure that is of significance to be captured.

A PSI memory stall is wholly comprised of: compaction (kcompactd), 
thrashing, kswapd, direct compaction and direct reclaim. Out of these, 
compaction, thrashing and kswapd stalls may not necessarily give 
significance towards memory demand building up (i.e. system is in need 
of more memory). Direct compaction stall would indicate memory is 
fragmented. But a significant direct reclaim stall would indicate that 
system is in memory pressure. Usually, direct compaction and direct 
reclaim are the smallest in an aggregated PSI memory stall.

So now the question - what is the smallest direct reclaim stall that we 
should capture, which would be significant to us? Now this depends on 
system type and configuration and the nature of work loads. For Android 
mobile maybe 100ms (lmkd), and for servers maybe 1s (because max window 
is 10s?). For Linux Embedded Systems, this would be even smaller. From 
our experiments, we observed that 5ms stall would be significant to 
capture direct reclaim stalls that would indicate pressure build up.

I think the min window size should be set such that even the smallest 
pressure stall that we think is significant should be be captured. 
Rather than hard-coding min window to 500ms, let the system admin choose 
what's best? We anyway have bigger cap set for max window of 10s (though 
I hardly doubt one would think 1s is the least pressure that they care 
of for cpu/io/memory). Also these window thresholds have never changed 
since psi monitor was introduced in kernel.org, and are based on 
previous experiments which may not have represented all workloads.

Finding the true bottom of the well would be hard. But to keep things in 
ms range, we can define range 1ms-500ms in Kconfig:

--- a/init/Kconfig
+++ b/init/Kconfig

+config PSI_MIN_WINDOW_MS
+B B B B B B  int "Minimum PSI window (ms)"
+B B B B B B  range 1 500
+B B B B B B  default 500
+
+

With PSI mechanism finding more of its uses, the same requirement might 
be applicable for io and cpu as well. Giving more flexibility in setting 
window size and sampling period would be beneficial.

>> Also, about possible DoS attacks - file permissions for
>> /proc/pressure/... can be set such that not any random user can register
>> to psi events right?
> True. We have a CAP_SYS_RESOURCE check for the writers of these files.
>
>>>> "
>>>>
>>>> 4. Detecting increase in memory demand b   when a certain usecase starts
>>>> in VM that does memory allocations, it will stall causing PSI mechanism
>>>> to generate a memory pressure event to userspace. To simply put, when
>>>> pressure increases certain set threshold, it can make educated guess
>>>> that a memory requiring usecase has ran and VM system needs memory to be
>>>> added.
>>>>
>>>> "
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-arm-kernel/1bf30145-22a5-cc46-e583-25053460b105@redhat.com/T/#m95ccf038c568271e759a277a08b8e44e51e8f90b
>>>>
>>>>> Thanks,
>>>>> Suren.
>>>>>
>>>>>> For lightweight systems such as Linux Embedded Systems, PSI
>>>>>> can be used to monitor and track memory pressure building up
>>>>>> in the system and respond quickly to such memory demands.
>>>>>> Example, the Linux Embedded Systems could be a secondary VM
>>>>>> system which requests for memory from Primary host. With 500ms
>>>>>> window size, the sampling period is 50ms (one-tenth of windwo
>>>>>> size). So the minimum amount of time the process needs to stall,
>>>>>> so that a PSI event can be generated and actions can be done
>>>>>> is 50ms. This reaction time can be much reduced by reducing the
>>>>>> sampling time (by reducing window size), so that responses to
>>>>>> such memory pressures in system can be serviced much quicker.
>>>>>>
>>>>>> Please let us know your thoughts on reducing window size to 50ms.
>>>>>>
>>>>>> Sudarshan Rajagopalan (1):
>>>>>>      psi: reduce min window size to 50ms
>>>>>>
>>>>>>     kernel/sched/psi.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> --
>>>>>> 2.7.4
>>>>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-14  2:12           ` Sudarshan Rajagopalan
@ 2023-02-14 19:34             ` Suren Baghdasaryan
  2023-02-24 12:47               ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-14 19:34 UTC (permalink / raw)
  To: Sudarshan Rajagopalan
  Cc: David Hildenbrand, Johannes Weiner, Mike Rapoport,
	Oscar Salvador, Anshuman Khandual, mark.rutland, will,
	virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, Michal Hocko

On Mon, Feb 13, 2023 at 6:12 PM Sudarshan Rajagopalan
<quic_sudaraja@quicinc.com> wrote:
>
>
> On 2/10/2023 6:13 PM, Suren Baghdasaryan wrote:
> > On Fri, Feb 10, 2023 at 5:46 PM Sudarshan Rajagopalan
> > <quic_sudaraja@quicinc.com> wrote:
> >>
> >> On 2/10/2023 5:09 PM, Suren Baghdasaryan wrote:
> >>> On Fri, Feb 10, 2023 at 4:45 PM Sudarshan Rajagopalan
> >>> <quic_sudaraja@quicinc.com> wrote:
> >>>> On 2/10/2023 3:03 PM, Suren Baghdasaryan wrote:
> >>>>> On Fri, Feb 10, 2023 at 2:31 PM Sudarshan Rajagopalan
> >>>>> <quic_sudaraja@quicinc.com> wrote:
> >>>>>> The PSI mechanism is useful tool to monitor pressure stall
> >>>>>> information in the system. Currently, the minimum window size
> >>>>>> is set to 500ms. May we know what is the rationale for this?
> >>>>> The limit was set to avoid regressions in performance and power
> >>>>> consumption if the window is set too small and the system ends up
> >>>>> polling too frequently. That said, the limit was chosen based on
> >>>>> results of specific experiments which might not represent all
> >>>> Rightly as you said, the effect on power and performance depends on type
> >>>> of the system - embedded systems, or Android mobile, or commercial VMs
> >>>> or servers. With higher PSI sampling, it may not be much of power impact
> >>>> to embedded systems with low-tier chipsets or performance impact to
> >>>> powerful servers.
> >>>>
> >>>>> usecases. If you want to change this limit, you would need to describe
> >>>>> why the new limit is inherently better than the current one (why not
> >>>>> higher, why not lower).
> >>>> This is in regards to the userspace daemon [1] that we are working on,
> >>>> that dynamically resizes the VM memory based on PSI memory pressure
> >>>> events. With current min window size of 500ms, the PSI monitor sampling
> >>>> period would be 50ms. So to detect increase in memory demand in system
> >>>> and plug-in memory into VM when pressure goes up, the minimum time the
> >>>> process needs to stall for is 50ms before a event can be generated and
> >>>> sent out to userspace and the daemon can do actions.
> >>>>
> >>>> This again I'm talking w.r.t. lightweight embedded systems, where even
> >>>> background kswapd/kcompd (which I'm calling it as natural memory
> >>>> pressure) in the system would be less than 5-10ms stall. So any stall
> >>>> more than 5-10ms would "hint" us that a memory consuming usecase has
> >>>> ranB  and memory may need to be plugged in.
> >>>>
> >>>> So in these cases, having as low as 5ms psimon sampling time would give
> >>>> us faster reaction time and daemon can be responsive more quickly. In
> >>>> general, this will reduce the malloc latencies significantly.
> >>>>
> >>>> Pasting here the same excerpt I mentioned in [1].
> >>> My question is: why do you think 5ms is the optimal limit here? I want
> >>> to avoid a race to the bottom where next time someone can argue that
> >>> they would like to detect a stall within a lower period than 5ms.
> >>> Technically the limit can be as small as one wants but at some point I
> >>> think we should consider the possibility of this being used for a DoS
> >>> attack.
> >> Well the optimal limit should be something which is least destructive? I
> >> do understand about possibility of DoS attacks, but wouldn't that still
> >> be possible with 500ms window today? Which will atleast be 1/10th less
> >> severe compared to 50ms window. The way I see it is - min pressure
> >> sampling should be such that even the least pressure stall which we
> >> think is significant should be captured (this could be 5ms or 50ms at
> >> present) while balancing the power and performance impact across all
> >> usecases.
> >>
> >> At present, Android's LMKD sets 1000ms as window for which it considers
> >> 100ms sampling to be significant. And here, with psi_daemon usecase we
> >> are saying 5ms sampling would be significant. So there's no actual
> >> optimal limit, but we must limit as much possible without effecting
> >> power or performance as a whole. Also, this is just the "minimum
> >> allowable" window, and system admins can configure it as per the system
> >> type/requirement.
> > Ok, let me ask you another way which might be more productive. What
> > caused you to choose 5ms as the time you care to react to a stall
> > buildup?
>
> We basically want to capture any stalls caused by direct reclaim. And
> ignore any stalls caused by indirect reclaim and alloc retries. Stalls
> due to direct reclaim is what indicates that memory pressure is building
> up in system and memory needs to be free'd (by oom-killer or LMKD
> killing apps) or made available (by plugin-in any available memory or
> requesting memory from Primary host). We see that any stalls above 5ms
> is significant enough that alloc request would've invoked direct
> reclaim, hinting that
> memory pressure is starting to build up.
>
> Keeping the 5ms and other numbers aside, lets see what is smallest
> pressure that is of significance to be captured.
>
> A PSI memory stall is wholly comprised of: compaction (kcompactd),
> thrashing, kswapd, direct compaction and direct reclaim. Out of these,
> compaction, thrashing and kswapd stalls may not necessarily give
> significance towards memory demand building up (i.e. system is in need
> of more memory). Direct compaction stall would indicate memory is
> fragmented. But a significant direct reclaim stall would indicate that
> system is in memory pressure. Usually, direct compaction and direct
> reclaim are the smallest in an aggregated PSI memory stall.
>
> So now the question - what is the smallest direct reclaim stall that we
> should capture, which would be significant to us? Now this depends on
> system type and configuration and the nature of work loads. For Android
> mobile maybe 100ms (lmkd), and for servers maybe 1s (because max window
> is 10s?). For Linux Embedded Systems, this would be even smaller. From
> our experiments, we observed that 5ms stall would be significant to
> capture direct reclaim stalls that would indicate pressure build up.
>
> I think the min window size should be set such that even the smallest
> pressure stall that we think is significant should be be captured.
> Rather than hard-coding min window to 500ms, let the system admin choose
> what's best? We anyway have bigger cap set for max window of 10s (though
> I hardly doubt one would think 1s is the least pressure that they care
> of for cpu/io/memory). Also these window thresholds have never changed
> since psi monitor was introduced in kernel.org, and are based on
> previous experiments which may not have represented all workloads.
>
> Finding the true bottom of the well would be hard. But to keep things in
> ms range, we can define range 1ms-500ms in Kconfig:
>
> --- a/init/Kconfig
> +++ b/init/Kconfig
>
> +config PSI_MIN_WINDOW_MS
> +B B B B B B  int "Minimum PSI window (ms)"
> +B B B B B B  range 1 500
> +B B B B B B  default 500
> +
> +
>
> With PSI mechanism finding more of its uses, the same requirement might
> be applicable for io and cpu as well. Giving more flexibility in setting
> window size and sampling period would be beneficial.

Hmm. I understand the need and I still don't see a definite answer why
5ms minimum is optimal. The above description argues that 5ms is
indicative of direct reclaim but if some kernel internals change and
system stalls less during direct reclaim (for example, experimenting
with MGLRU we see less stalls), 5ms might end up being too high. You
would need to adjust the limit again.
Your suggestion to have this limit configurable sounds like obvious
solution. I would like to get some opinions from other maintainers.
Johannes, WDYT? CC'ing Michal to chime in as well since this is mostly
related to memory stalls.


>
> >> Also, about possible DoS attacks - file permissions for
> >> /proc/pressure/... can be set such that not any random user can register
> >> to psi events right?
> > True. We have a CAP_SYS_RESOURCE check for the writers of these files.
> >
> >>>> "
> >>>>
> >>>> 4. Detecting increase in memory demand b   when a certain usecase starts
> >>>> in VM that does memory allocations, it will stall causing PSI mechanism
> >>>> to generate a memory pressure event to userspace. To simply put, when
> >>>> pressure increases certain set threshold, it can make educated guess
> >>>> that a memory requiring usecase has ran and VM system needs memory to be
> >>>> added.
> >>>>
> >>>> "
> >>>>
> >>>> [1]
> >>>> https://lore.kernel.org/linux-arm-kernel/1bf30145-22a5-cc46-e583-25053460b105@redhat.com/T/#m95ccf038c568271e759a277a08b8e44e51e8f90b
> >>>>
> >>>>> Thanks,
> >>>>> Suren.
> >>>>>
> >>>>>> For lightweight systems such as Linux Embedded Systems, PSI
> >>>>>> can be used to monitor and track memory pressure building up
> >>>>>> in the system and respond quickly to such memory demands.
> >>>>>> Example, the Linux Embedded Systems could be a secondary VM
> >>>>>> system which requests for memory from Primary host. With 500ms
> >>>>>> window size, the sampling period is 50ms (one-tenth of windwo
> >>>>>> size). So the minimum amount of time the process needs to stall,
> >>>>>> so that a PSI event can be generated and actions can be done
> >>>>>> is 50ms. This reaction time can be much reduced by reducing the
> >>>>>> sampling time (by reducing window size), so that responses to
> >>>>>> such memory pressures in system can be serviced much quicker.
> >>>>>>
> >>>>>> Please let us know your thoughts on reducing window size to 50ms.
> >>>>>>
> >>>>>> Sudarshan Rajagopalan (1):
> >>>>>>      psi: reduce min window size to 50ms
> >>>>>>
> >>>>>>     kernel/sched/psi.c | 2 +-
> >>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>
> >>>>>> --
> >>>>>> 2.7.4
> >>>>>>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-14 19:34             ` Suren Baghdasaryan
@ 2023-02-24 12:47               ` Michal Hocko
  2023-02-24 21:07                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-02-24 12:47 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly

On Tue 14-02-23 11:34:30, Suren Baghdasaryan wrote:
[...]
> Your suggestion to have this limit configurable sounds like obvious
> solution. I would like to get some opinions from other maintainers.
> Johannes, WDYT? CC'ing Michal to chime in as well since this is mostly
> related to memory stalls.

I do not think that making this configurable helps much. Many users will
be bound to distribution config and also it would be hard to experiment
with a recompile cycle every time. This seems just too impractical.

Is there any reason why we shouldn't allow any timeout? Shorter
timeouts could be restricted to a priviledged context to avoid an easy
way to swamp system by too frequent polling.

Btw. it seems that there is is only a limit on a single trigger per fd
but no limits per user so it doesn't sound too hard to end up with too
much polling even with a larger timeouts. To me it seems like we need to
contain the polling thread to be bound by the cpu controller.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-24 12:47               ` Michal Hocko
@ 2023-02-24 21:07                 ` Suren Baghdasaryan
  2023-02-27 13:34                   ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-24 21:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly

On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 14-02-23 11:34:30, Suren Baghdasaryan wrote:
> [...]
> > Your suggestion to have this limit configurable sounds like obvious
> > solution. I would like to get some opinions from other maintainers.
> > Johannes, WDYT? CC'ing Michal to chime in as well since this is mostly
> > related to memory stalls.
>
> I do not think that making this configurable helps much. Many users will
> be bound to distribution config and also it would be hard to experiment
> with a recompile cycle every time. This seems just too impractical.
>
> Is there any reason why we shouldn't allow any timeout? Shorter
> timeouts could be restricted to a priviledged context to avoid an easy
> way to swamp system by too frequent polling.

Hmm, ok. Maybe then we just ensure that only privileged users can set
triggers and remove the min limit (use a >0 check)?

>
> Btw. it seems that there is is only a limit on a single trigger per fd
> but no limits per user so it doesn't sound too hard to end up with too
> much polling even with a larger timeouts. To me it seems like we need to
> contain the polling thread to be bound by the cpu controller.

Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
poll_min_period for each thread is chosen as the min() of polling
periods between triggers created in that group. So, a bad trigger that
causes overly aggressive polling and polling thread being throttled,
might affect other triggers in that cgroup.
I would prefer to deny creation of a new trigger if it would cause too
much polling. However, that seems to be getting into the territory of
"implementing policy inside the kernel". Maybe we just limit trigger
creation to privileged processes only and let those privileged system
components worry about trigger creation policies? These system
processes can also limit cpu shares of all "psimon" threads using cpu
controller, if desired. WDYT?

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-24 21:07                 ` Suren Baghdasaryan
@ 2023-02-27 13:34                   ` Michal Hocko
  2023-02-27 17:49                     ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-02-27 13:34 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly

On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 14-02-23 11:34:30, Suren Baghdasaryan wrote:
> > [...]
> > > Your suggestion to have this limit configurable sounds like obvious
> > > solution. I would like to get some opinions from other maintainers.
> > > Johannes, WDYT? CC'ing Michal to chime in as well since this is mostly
> > > related to memory stalls.
> >
> > I do not think that making this configurable helps much. Many users will
> > be bound to distribution config and also it would be hard to experiment
> > with a recompile cycle every time. This seems just too impractical.
> >
> > Is there any reason why we shouldn't allow any timeout? Shorter
> > timeouts could be restricted to a priviledged context to avoid an easy
> > way to swamp system by too frequent polling.
> 
> Hmm, ok. Maybe then we just ensure that only privileged users can set
> triggers and remove the min limit (use a >0 check)?

This could break existing userspace which is not privileged. I would
just go with CAP_SYS_NICE or similar with small (sub min) timeouts.

> > Btw. it seems that there is is only a limit on a single trigger per fd
> > but no limits per user so it doesn't sound too hard to end up with too
> > much polling even with a larger timeouts. To me it seems like we need to
> > contain the polling thread to be bound by the cpu controller.
> 
> Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> poll_min_period for each thread is chosen as the min() of polling
> periods between triggers created in that group. So, a bad trigger that
> causes overly aggressive polling and polling thread being throttled,
> might affect other triggers in that cgroup.

Yes, and why that would be a problem?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-27 13:34                   ` Michal Hocko
@ 2023-02-27 17:49                     ` Suren Baghdasaryan
  2023-02-27 19:11                       ` Michal Hocko
  2023-02-27 19:19                       ` Josh Hunt
  0 siblings, 2 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-27 17:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, johunt

On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> > On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 14-02-23 11:34:30, Suren Baghdasaryan wrote:
> > > [...]
> > > > Your suggestion to have this limit configurable sounds like obvious
> > > > solution. I would like to get some opinions from other maintainers.
> > > > Johannes, WDYT? CC'ing Michal to chime in as well since this is mostly
> > > > related to memory stalls.
> > >
> > > I do not think that making this configurable helps much. Many users will
> > > be bound to distribution config and also it would be hard to experiment
> > > with a recompile cycle every time. This seems just too impractical.
> > >
> > > Is there any reason why we shouldn't allow any timeout? Shorter
> > > timeouts could be restricted to a priviledged context to avoid an easy
> > > way to swamp system by too frequent polling.
> >
> > Hmm, ok. Maybe then we just ensure that only privileged users can set
> > triggers and remove the min limit (use a >0 check)?
>
> This could break existing userspace which is not privileged. I would
> just go with CAP_SYS_NICE or similar with small (sub min) timeouts.

Yeah, that's what I meant. /proc/pressure/* files already check for
CAP_SYS_RESOURCE
(https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c#L1440)
but per-cgroup pressure files do not have this check. I think the
original patch which added this check
(https://lore.kernel.org/all/20210402025833.27599-1-johunt@akamai.com/)
missed the cgroup ones. This should be easy to add but I wonder if
that was left that way intentionally.

CC'ing the author. Josh, Johannes is that inconsistency between system
pressure files and cgroup-specific ones intentional? Can we change
them all to check for CAP_SYS_RESOURCE?

>
> > > Btw. it seems that there is is only a limit on a single trigger per fd
> > > but no limits per user so it doesn't sound too hard to end up with too
> > > much polling even with a larger timeouts. To me it seems like we need to
> > > contain the polling thread to be bound by the cpu controller.
> >
> > Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> > poll_min_period for each thread is chosen as the min() of polling
> > periods between triggers created in that group. So, a bad trigger that
> > causes overly aggressive polling and polling thread being throttled,
> > might affect other triggers in that cgroup.
>
> Yes, and why that would be a problem?

If unprivileged processes are allowed to add new triggers then a
malicious process can add a bad trigger and affect other legit
processes. That sounds like a problem to me.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-27 17:49                     ` Suren Baghdasaryan
@ 2023-02-27 19:11                       ` Michal Hocko
  2023-02-27 19:50                         ` Suren Baghdasaryan
  2023-02-27 19:19                       ` Josh Hunt
  1 sibling, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-02-27 19:11 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, johunt

On Mon 27-02-23 09:49:59, Suren Baghdasaryan wrote:
> On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> > > On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
[...]
> > > > Btw. it seems that there is is only a limit on a single trigger per fd
> > > > but no limits per user so it doesn't sound too hard to end up with too
> > > > much polling even with a larger timeouts. To me it seems like we need to
> > > > contain the polling thread to be bound by the cpu controller.
> > >
> > > Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> > > poll_min_period for each thread is chosen as the min() of polling
> > > periods between triggers created in that group. So, a bad trigger that
> > > causes overly aggressive polling and polling thread being throttled,
> > > might affect other triggers in that cgroup.
> >
> > Yes, and why that would be a problem?
> 
> If unprivileged processes are allowed to add new triggers then a
> malicious process can add a bad trigger and affect other legit
> processes. That sounds like a problem to me.

Hmm, I am not sure we are on the same page. My argument was that the
monitoring kernel thread should be bound by the same cpu controller so
even if it was excessive it would be bound to the cgroup constrains.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-27 17:49                     ` Suren Baghdasaryan
  2023-02-27 19:11                       ` Michal Hocko
@ 2023-02-27 19:19                       ` Josh Hunt
  2023-02-27 19:51                         ` Suren Baghdasaryan
  1 sibling, 1 reply; 20+ messages in thread
From: Josh Hunt @ 2023-02-27 19:19 UTC (permalink / raw)
  To: Suren Baghdasaryan, Michal Hocko
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly



On 2/27/23 9:49 AM, Suren Baghdasaryan wrote:
> On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@suse.com> wrote:
>>
>> On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
>>> On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
>>>>
>>>> On Tue 14-02-23 11:34:30, Suren Baghdasaryan wrote:
>>>> [...]
>>>>> Your suggestion to have this limit configurable sounds like obvious
>>>>> solution. I would like to get some opinions from other maintainers.
>>>>> Johannes, WDYT? CC'ing Michal to chime in as well since this is mostly
>>>>> related to memory stalls.
>>>>
>>>> I do not think that making this configurable helps much. Many users will
>>>> be bound to distribution config and also it would be hard to experiment
>>>> with a recompile cycle every time. This seems just too impractical.
>>>>
>>>> Is there any reason why we shouldn't allow any timeout? Shorter
>>>> timeouts could be restricted to a priviledged context to avoid an easy
>>>> way to swamp system by too frequent polling.
>>>
>>> Hmm, ok. Maybe then we just ensure that only privileged users can set
>>> triggers and remove the min limit (use a >0 check)?
>>
>> This could break existing userspace which is not privileged. I would
>> just go with CAP_SYS_NICE or similar with small (sub min) timeouts.
> 
> Yeah, that's what I meant. /proc/pressure/* files already check for
> CAP_SYS_RESOURCE
> (https://urldefense.com/v3/__https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c*L1440__;Iw!!GjvTz_vk!WtI61poYlZk9kg5P1sX19RdYnUNGvBJRjnOpu8hL6IOZ_NKhuw2qZ_tAdNRwzZoQVlO4jEObYN6x$ )
> but per-cgroup pressure files do not have this check. I think the
> original patch which added this check
> (https://urldefense.com/v3/__https://lore.kernel.org/all/20210402025833.27599-1-johunt@akamai.com/__;!!GjvTz_vk!WtI61poYlZk9kg5P1sX19RdYnUNGvBJRjnOpu8hL6IOZ_NKhuw2qZ_tAdNRwzZoQVlO4jAVqIVDv$ )
> missed the cgroup ones. This should be easy to add but I wonder if
> that was left that way intentionally.
> 
> CC'ing the author. Josh, Johannes is that inconsistency between system
> pressure files and cgroup-specific ones intentional? Can we change
> them all to check for CAP_SYS_RESOURCE?

No, this was just an oversight in the original patch at least from my
end, and did not come up during code review. Fine with me to change them
all to use CAP_SYS_RESOURCE.

Josh

> 
>>
>>>> Btw. it seems that there is is only a limit on a single trigger per fd
>>>> but no limits per user so it doesn't sound too hard to end up with too
>>>> much polling even with a larger timeouts. To me it seems like we need to
>>>> contain the polling thread to be bound by the cpu controller.
>>>
>>> Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
>>> poll_min_period for each thread is chosen as the min() of polling
>>> periods between triggers created in that group. So, a bad trigger that
>>> causes overly aggressive polling and polling thread being throttled,
>>> might affect other triggers in that cgroup.
>>
>> Yes, and why that would be a problem?
> 
> If unprivileged processes are allowed to add new triggers then a
> malicious process can add a bad trigger and affect other legit
> processes. That sounds like a problem to me.
> Thanks,
> Suren.
> 
>> --
>> Michal Hocko
>> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-27 19:11                       ` Michal Hocko
@ 2023-02-27 19:50                         ` Suren Baghdasaryan
  2023-02-28 13:50                           ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-27 19:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, johunt

On Mon, Feb 27, 2023 at 11:11 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 27-02-23 09:49:59, Suren Baghdasaryan wrote:
> > On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> > > > On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
> [...]
> > > > > Btw. it seems that there is is only a limit on a single trigger per fd
> > > > > but no limits per user so it doesn't sound too hard to end up with too
> > > > > much polling even with a larger timeouts. To me it seems like we need to
> > > > > contain the polling thread to be bound by the cpu controller.
> > > >
> > > > Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> > > > poll_min_period for each thread is chosen as the min() of polling
> > > > periods between triggers created in that group. So, a bad trigger that
> > > > causes overly aggressive polling and polling thread being throttled,
> > > > might affect other triggers in that cgroup.
> > >
> > > Yes, and why that would be a problem?
> >
> > If unprivileged processes are allowed to add new triggers then a
> > malicious process can add a bad trigger and affect other legit
> > processes. That sounds like a problem to me.
>
> Hmm, I am not sure we are on the same page. My argument was that the
> monitoring kernel thread should be bound by the same cpu controller so
> even if it was excessive it would be bound to the cgroup constrains.

Right. But if cgroup constraints are violated then the psimon thread's
activity will be impacted by throttling. In such cases won't that
affect other "good" triggers served by that thread even if they are
using higher polling periods?

>
> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-27 19:19                       ` Josh Hunt
@ 2023-02-27 19:51                         ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-27 19:51 UTC (permalink / raw)
  To: Josh Hunt
  Cc: Michal Hocko, Sudarshan Rajagopalan, David Hildenbrand,
	Johannes Weiner, Mike Rapoport, Oscar Salvador,
	Anshuman Khandual, mark.rutland, will, virtualization, linux-mm,
	linux-kernel, linux-arm-kernel, linux-arm-msm, Trilok Soni,
	Sukadev Bhattiprolu, Srivatsa Vaddagiri, Patrick Daly

On Mon, Feb 27, 2023 at 11:19 AM Josh Hunt <johunt@akamai.com> wrote:
>
>
>
> On 2/27/23 9:49 AM, Suren Baghdasaryan wrote:
> > On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@suse.com> wrote:
> >>
> >> On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> >>> On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
> >>>>
> >>>> On Tue 14-02-23 11:34:30, Suren Baghdasaryan wrote:
> >>>> [...]
> >>>>> Your suggestion to have this limit configurable sounds like obvious
> >>>>> solution. I would like to get some opinions from other maintainers.
> >>>>> Johannes, WDYT? CC'ing Michal to chime in as well since this is mostly
> >>>>> related to memory stalls.
> >>>>
> >>>> I do not think that making this configurable helps much. Many users will
> >>>> be bound to distribution config and also it would be hard to experiment
> >>>> with a recompile cycle every time. This seems just too impractical.
> >>>>
> >>>> Is there any reason why we shouldn't allow any timeout? Shorter
> >>>> timeouts could be restricted to a priviledged context to avoid an easy
> >>>> way to swamp system by too frequent polling.
> >>>
> >>> Hmm, ok. Maybe then we just ensure that only privileged users can set
> >>> triggers and remove the min limit (use a >0 check)?
> >>
> >> This could break existing userspace which is not privileged. I would
> >> just go with CAP_SYS_NICE or similar with small (sub min) timeouts.
> >
> > Yeah, that's what I meant. /proc/pressure/* files already check for
> > CAP_SYS_RESOURCE
> > (https://urldefense.com/v3/__https://elixir.bootlin.com/linux/latest/source/kernel/sched/psi.c*L1440__;Iw!!GjvTz_vk!WtI61poYlZk9kg5P1sX19RdYnUNGvBJRjnOpu8hL6IOZ_NKhuw2qZ_tAdNRwzZoQVlO4jEObYN6x$ )
> > but per-cgroup pressure files do not have this check. I think the
> > original patch which added this check
> > (https://urldefense.com/v3/__https://lore.kernel.org/all/20210402025833.27599-1-johunt@akamai.com/__;!!GjvTz_vk!WtI61poYlZk9kg5P1sX19RdYnUNGvBJRjnOpu8hL6IOZ_NKhuw2qZ_tAdNRwzZoQVlO4jAVqIVDv$ )
> > missed the cgroup ones. This should be easy to add but I wonder if
> > that was left that way intentionally.
> >
> > CC'ing the author. Josh, Johannes is that inconsistency between system
> > pressure files and cgroup-specific ones intentional? Can we change
> > them all to check for CAP_SYS_RESOURCE?
>
> No, this was just an oversight in the original patch at least from my
> end, and did not come up during code review. Fine with me to change them
> all to use CAP_SYS_RESOURCE.

Thanks for the confirmation! Will get this fixed.

>
> Josh
>
> >
> >>
> >>>> Btw. it seems that there is is only a limit on a single trigger per fd
> >>>> but no limits per user so it doesn't sound too hard to end up with too
> >>>> much polling even with a larger timeouts. To me it seems like we need to
> >>>> contain the polling thread to be bound by the cpu controller.
> >>>
> >>> Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> >>> poll_min_period for each thread is chosen as the min() of polling
> >>> periods between triggers created in that group. So, a bad trigger that
> >>> causes overly aggressive polling and polling thread being throttled,
> >>> might affect other triggers in that cgroup.
> >>
> >> Yes, and why that would be a problem?
> >
> > If unprivileged processes are allowed to add new triggers then a
> > malicious process can add a bad trigger and affect other legit
> > processes. That sounds like a problem to me.
> > Thanks,
> > Suren.
> >
> >> --
> >> Michal Hocko
> >> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-27 19:50                         ` Suren Baghdasaryan
@ 2023-02-28 13:50                           ` Michal Hocko
  2023-02-28 18:18                             ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Michal Hocko @ 2023-02-28 13:50 UTC (permalink / raw)
  To: Suren Baghdasaryan
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, johunt

On Mon 27-02-23 11:50:48, Suren Baghdasaryan wrote:
> On Mon, Feb 27, 2023 at 11:11 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 27-02-23 09:49:59, Suren Baghdasaryan wrote:
> > > On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> > > > > On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > [...]
> > > > > > Btw. it seems that there is is only a limit on a single trigger per fd
> > > > > > but no limits per user so it doesn't sound too hard to end up with too
> > > > > > much polling even with a larger timeouts. To me it seems like we need to
> > > > > > contain the polling thread to be bound by the cpu controller.
> > > > >
> > > > > Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> > > > > poll_min_period for each thread is chosen as the min() of polling
> > > > > periods between triggers created in that group. So, a bad trigger that
> > > > > causes overly aggressive polling and polling thread being throttled,
> > > > > might affect other triggers in that cgroup.
> > > >
> > > > Yes, and why that would be a problem?
> > >
> > > If unprivileged processes are allowed to add new triggers then a
> > > malicious process can add a bad trigger and affect other legit
> > > processes. That sounds like a problem to me.
> >
> > Hmm, I am not sure we are on the same page. My argument was that the
> > monitoring kernel thread should be bound by the same cpu controller so
> > even if it was excessive it would be bound to the cgroup constrains.
> 
> Right. But if cgroup constraints are violated then the psimon thread's
> activity will be impacted by throttling. In such cases won't that
> affect other "good" triggers served by that thread even if they are
> using higher polling periods?

That is no different from any other part of the workload running within
the same cpu bound cgroup running overboard with the cpu consumption. I
do not see why psimon or anything else should be any different.

Actually the only difference here is that the psi monitoring is
outsourced to a kernel thread which is running ourside of any constrains.
I am not sure where do we stand with kernel thread cpu cgroup accounting
and I suspect this is not a trivial thing to do ATM. Hence longer term
plan.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-28 13:50                           ` Michal Hocko
@ 2023-02-28 18:18                             ` Suren Baghdasaryan
  2023-03-01  1:49                               ` Suren Baghdasaryan
  0 siblings, 1 reply; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-02-28 18:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, johunt

On Tue, Feb 28, 2023 at 5:50 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 27-02-23 11:50:48, Suren Baghdasaryan wrote:
> > On Mon, Feb 27, 2023 at 11:11 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Mon 27-02-23 09:49:59, Suren Baghdasaryan wrote:
> > > > On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> > > > > > On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > > [...]
> > > > > > > Btw. it seems that there is is only a limit on a single trigger per fd
> > > > > > > but no limits per user so it doesn't sound too hard to end up with too
> > > > > > > much polling even with a larger timeouts. To me it seems like we need to
> > > > > > > contain the polling thread to be bound by the cpu controller.
> > > > > >
> > > > > > Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> > > > > > poll_min_period for each thread is chosen as the min() of polling
> > > > > > periods between triggers created in that group. So, a bad trigger that
> > > > > > causes overly aggressive polling and polling thread being throttled,
> > > > > > might affect other triggers in that cgroup.
> > > > >
> > > > > Yes, and why that would be a problem?
> > > >
> > > > If unprivileged processes are allowed to add new triggers then a
> > > > malicious process can add a bad trigger and affect other legit
> > > > processes. That sounds like a problem to me.
> > >
> > > Hmm, I am not sure we are on the same page. My argument was that the
> > > monitoring kernel thread should be bound by the same cpu controller so
> > > even if it was excessive it would be bound to the cgroup constrains.
> >
> > Right. But if cgroup constraints are violated then the psimon thread's
> > activity will be impacted by throttling. In such cases won't that
> > affect other "good" triggers served by that thread even if they are
> > using higher polling periods?
>
> That is no different from any other part of the workload running within
> the same cpu bound cgroup running overboard with the cpu consumption. I
> do not see why psimon or anything else should be any different.
>
> Actually the only difference here is that the psi monitoring is
> outsourced to a kernel thread which is running ourside of any constrains.
> I am not sure where do we stand with kernel thread cpu cgroup accounting
> and I suspect this is not a trivial thing to do ATM. Hence longer term
> plan.

Yeah, that sounds right.
In the meantime I think the prudent thing to do is to add
CAP_SYS_RESOURCE check for cgroup interface for consistency with
system-wide one. After that we can change the min period to be
anything more than 0 and let userspace privileged services implement
policies to limit trigger cpu consumption (might be via cpu
controller, limiting the number of triggers/their periods, etc).
Sudarshan, I'll post the CAP_SYS_RESOURCE change shortly and you can
follow up with the change to the min trigger period.
Thanks for the input folks!

> --
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] psi: reduce min window size to 50ms
  2023-02-28 18:18                             ` Suren Baghdasaryan
@ 2023-03-01  1:49                               ` Suren Baghdasaryan
  0 siblings, 0 replies; 20+ messages in thread
From: Suren Baghdasaryan @ 2023-03-01  1:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sudarshan Rajagopalan, David Hildenbrand, Johannes Weiner,
	Mike Rapoport, Oscar Salvador, Anshuman Khandual, mark.rutland,
	will, virtualization, linux-mm, linux-kernel, linux-arm-kernel,
	linux-arm-msm, Trilok Soni, Sukadev Bhattiprolu,
	Srivatsa Vaddagiri, Patrick Daly, johunt

On Tue, Feb 28, 2023 at 10:18 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Feb 28, 2023 at 5:50 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Mon 27-02-23 11:50:48, Suren Baghdasaryan wrote:
> > > On Mon, Feb 27, 2023 at 11:11 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Mon 27-02-23 09:49:59, Suren Baghdasaryan wrote:
> > > > > On Mon, Feb 27, 2023 at 5:34 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > > >
> > > > > > On Fri 24-02-23 13:07:57, Suren Baghdasaryan wrote:
> > > > > > > On Fri, Feb 24, 2023 at 4:47 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > [...]
> > > > > > > > Btw. it seems that there is is only a limit on a single trigger per fd
> > > > > > > > but no limits per user so it doesn't sound too hard to end up with too
> > > > > > > > much polling even with a larger timeouts. To me it seems like we need to
> > > > > > > > contain the polling thread to be bound by the cpu controller.
> > > > > > >
> > > > > > > Hmm. We have one "psimon" thread per cgroup (+1 system-level one) and
> > > > > > > poll_min_period for each thread is chosen as the min() of polling
> > > > > > > periods between triggers created in that group. So, a bad trigger that
> > > > > > > causes overly aggressive polling and polling thread being throttled,
> > > > > > > might affect other triggers in that cgroup.
> > > > > >
> > > > > > Yes, and why that would be a problem?
> > > > >
> > > > > If unprivileged processes are allowed to add new triggers then a
> > > > > malicious process can add a bad trigger and affect other legit
> > > > > processes. That sounds like a problem to me.
> > > >
> > > > Hmm, I am not sure we are on the same page. My argument was that the
> > > > monitoring kernel thread should be bound by the same cpu controller so
> > > > even if it was excessive it would be bound to the cgroup constrains.
> > >
> > > Right. But if cgroup constraints are violated then the psimon thread's
> > > activity will be impacted by throttling. In such cases won't that
> > > affect other "good" triggers served by that thread even if they are
> > > using higher polling periods?
> >
> > That is no different from any other part of the workload running within
> > the same cpu bound cgroup running overboard with the cpu consumption. I
> > do not see why psimon or anything else should be any different.
> >
> > Actually the only difference here is that the psi monitoring is
> > outsourced to a kernel thread which is running ourside of any constrains.
> > I am not sure where do we stand with kernel thread cpu cgroup accounting
> > and I suspect this is not a trivial thing to do ATM. Hence longer term
> > plan.
>
> Yeah, that sounds right.
> In the meantime I think the prudent thing to do is to add
> CAP_SYS_RESOURCE check for cgroup interface for consistency with
> system-wide one. After that we can change the min period to be
> anything more than 0 and let userspace privileged services implement
> policies to limit trigger cpu consumption (might be via cpu
> controller, limiting the number of triggers/their periods, etc).
> Sudarshan, I'll post the CAP_SYS_RESOURCE change shortly and you can
> follow up with the change to the min trigger period.

Patch to require CAP_SYS_RESOURCE for writing per-cgroup psi files is
posted at https://lore.kernel.org/all/20230301014651.1370939-1-surenb@google.com/

> Thanks for the input folks!
>
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-03-01  1:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-10 22:31 [PATCH] psi: reduce min window size to 50ms Sudarshan Rajagopalan
2023-02-10 22:31 ` Sudarshan Rajagopalan
2023-02-10 23:03 ` Suren Baghdasaryan
2023-02-11  0:44   ` Sudarshan Rajagopalan
2023-02-11  1:09     ` Suren Baghdasaryan
2023-02-11  1:46       ` Sudarshan Rajagopalan
2023-02-11  2:13         ` Suren Baghdasaryan
2023-02-14  2:12           ` Sudarshan Rajagopalan
2023-02-14 19:34             ` Suren Baghdasaryan
2023-02-24 12:47               ` Michal Hocko
2023-02-24 21:07                 ` Suren Baghdasaryan
2023-02-27 13:34                   ` Michal Hocko
2023-02-27 17:49                     ` Suren Baghdasaryan
2023-02-27 19:11                       ` Michal Hocko
2023-02-27 19:50                         ` Suren Baghdasaryan
2023-02-28 13:50                           ` Michal Hocko
2023-02-28 18:18                             ` Suren Baghdasaryan
2023-03-01  1:49                               ` Suren Baghdasaryan
2023-02-27 19:19                       ` Josh Hunt
2023-02-27 19:51                         ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).