All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] psi: fix sampling artifact from pressure file read frequency
@ 2021-06-08 19:03 Johannes Weiner
  2021-06-10  3:32 ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2021-06-08 19:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Suren Baghdasaryan, Jared Pochtar, linux-kernel, kernel-team

Currently, every time a psi pressure file is read, the per-cpu states
are collected and aggregated according to the CPU's non-idle time
weight. This dynamically changes the sampling period, which means read
frequency can introduce variance into the observed results. This is
somewhat unexpected for users and can be confusing, when e.g. two
nested cgroups with the same workload report different pressure levels
just because they have different consumers reading the pressure files.

Consider the following two CPU timelines:

	CPU0: [ STALL ] [ SLEEP ]
	CPU1: [  RUN  ] [  RUN  ]

If we sample and aggregate once for the whole period, we get the
following total stall time for CPU0:

	CPU0 = stall(1) + nonidle(1) / nonidle_total(3) = 0.3

But if we sample twice, the total for the period is higher:

	CPU0 = stall(1) + nonidle(1) / nonidle_total(2) = 0.5
	CPU0 = stall(0) + nonidle(0) / nonidle_total(1) = 0
                                                          ---
                                                          0.5

Neither answer is inherently wrong: if the user asks for pressure
after half the period, we can't know yet that the CPU will go to sleep
right after and its weight would be lower over the combined period.

We could normalize the weight averaging to a fixed window regardless
of how often stall times themselves are sampled. But that would make
reporting less adaptive to sudden changes when the user intentionally
uses poll() with short intervals in order to get a higher resolution.

For now, simply limit sampling of the pressure file contents to the
fixed two-second period already used by the aggregation worker.

psi_show() still needs to force a catch-up run in case the workload
went idle and periodic aggregation shut off. Since that can race with
periodic aggregation, worker and psi_show() need to fully serialize.
And with sampling becoming exclusive, whoever wins the race must also
requeue the worker if necessary. Move the locking, the aggregation
work, and the worker scheduling logic into update_averages(); use that
from the worker and psi_show().

poll() continues to use the proportional weight sampling window.

Reported-by: Jared Pochtar <jpochtar@fb.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 kernel/sched/psi.c | 83 ++++++++++++++++++++++++----------------------
 1 file changed, 43 insertions(+), 40 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index cc25a3cff41f..9d647d974f55 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -358,17 +358,36 @@ static void collect_percpu_times(struct psi_group *group,
 		*pchanged_states = changed_states;
 }
 
-static u64 update_averages(struct psi_group *group, u64 now)
+static void update_averages(struct psi_group *group)
 {
 	unsigned long missed_periods = 0;
-	u64 expires, period;
-	u64 avg_next_update;
+	u64 now, expires, period;
+	u32 changed_states;
 	int s;
 
 	/* avgX= */
+	mutex_lock(&group->avgs_lock);
+
+	now = sched_clock();
 	expires = group->avg_next_update;
-	if (now - expires >= psi_period)
-		missed_periods = div_u64(now - expires, psi_period);
+
+	/*
+	 * Periodic aggregation.
+	 *
+	 * When tasks in the group are active, we make sure to
+	 * aggregate per-cpu samples and calculate the running
+	 * averages at exactly once per PSI_FREQ period.
+	 *
+	 * When tasks go idle, there is no point in keeping the
+	 * workers running, so we shut them down too. Once restarted,
+	 * we backfill zeroes for the missed periods in calc_avgs().
+	 *
+	 * We can get here from inside the aggregation worker, but
+	 * also from psi_show() as userspace may query pressure files
+	 * of an idle group whose aggregation worker shut down.
+	 */
+	if (now < expires)
+		goto unlock;
 
 	/*
 	 * The periodic clock tick can get delayed for various
@@ -377,10 +396,13 @@ static u64 update_averages(struct psi_group *group, u64 now)
 	 * But the deltas we sample out of the per-cpu buckets above
 	 * are based on the actual time elapsing between clock ticks.
 	 */
-	avg_next_update = expires + ((1 + missed_periods) * psi_period);
+	if (now - expires >= psi_period)
+		missed_periods = div_u64(now - expires, psi_period);
 	period = now - (group->avg_last_update + (missed_periods * psi_period));
 	group->avg_last_update = now;
+	group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
 
+	collect_percpu_times(group, PSI_AVGS, &changed_states);
 	for (s = 0; s < NR_PSI_STATES - 1; s++) {
 		u32 sample;
 
@@ -408,42 +430,25 @@ static u64 update_averages(struct psi_group *group, u64 now)
 		calc_avgs(group->avg[s], missed_periods, sample, period);
 	}
 
-	return avg_next_update;
+	if (changed_states & (1 << PSI_NONIDLE)) {
+		unsigned long delay;
+
+		delay = nsecs_to_jiffies(group->avg_next_update - now) + 1;
+		schedule_delayed_work(&group->avgs_work, delay);
+	}
+unlock:
+	mutex_unlock(&group->avgs_lock);
 }
 
 static void psi_avgs_work(struct work_struct *work)
 {
 	struct delayed_work *dwork;
 	struct psi_group *group;
-	u32 changed_states;
-	bool nonidle;
-	u64 now;
 
 	dwork = to_delayed_work(work);
 	group = container_of(dwork, struct psi_group, avgs_work);
 
-	mutex_lock(&group->avgs_lock);
-
-	now = sched_clock();
-
-	collect_percpu_times(group, PSI_AVGS, &changed_states);
-	nonidle = changed_states & (1 << PSI_NONIDLE);
-	/*
-	 * If there is task activity, periodically fold the per-cpu
-	 * times and feed samples into the running averages. If things
-	 * are idle and there is no data to process, stop the clock.
-	 * Once restarted, we'll catch up the running averages in one
-	 * go - see calc_avgs() and missed_periods.
-	 */
-	if (now >= group->avg_next_update)
-		group->avg_next_update = update_averages(group, now);
-
-	if (nonidle) {
-		schedule_delayed_work(dwork, nsecs_to_jiffies(
-				group->avg_next_update - now) + 1);
-	}
-
-	mutex_unlock(&group->avgs_lock);
+	update_averages(group);
 }
 
 /* Trigger tracking window manipulations */
@@ -1029,18 +1034,16 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
 int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 {
 	int full;
-	u64 now;
 
 	if (static_branch_likely(&psi_disabled))
 		return -EOPNOTSUPP;
 
-	/* Update averages before reporting them */
-	mutex_lock(&group->avgs_lock);
-	now = sched_clock();
-	collect_percpu_times(group, PSI_AVGS, NULL);
-	if (now >= group->avg_next_update)
-		group->avg_next_update = update_averages(group, now);
-	mutex_unlock(&group->avgs_lock);
+	/*
+	 * Periodic aggregation should do all the sampling for us, but
+	 * if the workload goes idle, the worker goes to sleep before
+	 * old stalls may have averaged out. Backfill any idle zeroes.
+	 */
+	update_averages(group);
 
 	for (full = 0; full < 2; full++) {
 		unsigned long avg[3];
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] psi: fix sampling artifact from pressure file read frequency
  2021-06-08 19:03 [PATCH] psi: fix sampling artifact from pressure file read frequency Johannes Weiner
@ 2021-06-10  3:32 ` Suren Baghdasaryan
  2021-06-10 13:58   ` Johannes Weiner
  0 siblings, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2021-06-10  3:32 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Peter Zijlstra, Jared Pochtar, LKML, kernel-team

On Tue, Jun 8, 2021 at 12:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Currently, every time a psi pressure file is read, the per-cpu states
> are collected and aggregated according to the CPU's non-idle time
> weight. This dynamically changes the sampling period, which means read
> frequency can introduce variance into the observed results. This is
> somewhat unexpected for users and can be confusing, when e.g. two
> nested cgroups with the same workload report different pressure levels
> just because they have different consumers reading the pressure files.
>
> Consider the following two CPU timelines:
>
>         CPU0: [ STALL ] [ SLEEP ]
>         CPU1: [  RUN  ] [  RUN  ]
>
> If we sample and aggregate once for the whole period, we get the
> following total stall time for CPU0:
>
>         CPU0 = stall(1) + nonidle(1) / nonidle_total(3) = 0.3
>
> But if we sample twice, the total for the period is higher:
>
>         CPU0 = stall(1) + nonidle(1) / nonidle_total(2) = 0.5
>         CPU0 = stall(0) + nonidle(0) / nonidle_total(1) = 0
>                                                           ---
>                                                           0.5
>
> Neither answer is inherently wrong: if the user asks for pressure
> after half the period, we can't know yet that the CPU will go to sleep
> right after and its weight would be lower over the combined period.
>
> We could normalize the weight averaging to a fixed window regardless
> of how often stall times themselves are sampled. But that would make
> reporting less adaptive to sudden changes when the user intentionally
> uses poll() with short intervals in order to get a higher resolution.
>
> For now, simply limit sampling of the pressure file contents to the
> fixed two-second period already used by the aggregation worker.

Hmm. This is tricky. So, userspace-visible effect of this change is
that totals will not update when the psi file is read unless the
psi_period expires. We used to postpone updating only the averages and
now the totals will follow suit. Not sure if presenting stale data is
better than having this dependency on timing of the read. As you
noted, the value we get is not inherently wrong. But one could argue
both ways I guess... Having this "quantum" effect when the act of
observation changes the state of the object is indeed weird, to say
the least.
In the paragraph above you say "For now". Do you have an idea that
could solve the issue with totals being stale while removing this
dependency on the timing of reads?

>
> psi_show() still needs to force a catch-up run in case the workload
> went idle and periodic aggregation shut off. Since that can race with
> periodic aggregation, worker and psi_show() need to fully serialize.
> And with sampling becoming exclusive, whoever wins the race must also
> requeue the worker if necessary. Move the locking, the aggregation
> work, and the worker scheduling logic into update_averages(); use that
> from the worker and psi_show().
>
> poll() continues to use the proportional weight sampling window.
>
> Reported-by: Jared Pochtar <jpochtar@fb.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  kernel/sched/psi.c | 83 ++++++++++++++++++++++++----------------------
>  1 file changed, 43 insertions(+), 40 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index cc25a3cff41f..9d647d974f55 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -358,17 +358,36 @@ static void collect_percpu_times(struct psi_group *group,
>                 *pchanged_states = changed_states;
>  }
>
> -static u64 update_averages(struct psi_group *group, u64 now)
> +static void update_averages(struct psi_group *group)
>  {
>         unsigned long missed_periods = 0;
> -       u64 expires, period;
> -       u64 avg_next_update;
> +       u64 now, expires, period;
> +       u32 changed_states;
>         int s;
>
>         /* avgX= */
> +       mutex_lock(&group->avgs_lock);
> +
> +       now = sched_clock();
>         expires = group->avg_next_update;
> -       if (now - expires >= psi_period)
> -               missed_periods = div_u64(now - expires, psi_period);
> +
> +       /*
> +        * Periodic aggregation.
> +        *
> +        * When tasks in the group are active, we make sure to
> +        * aggregate per-cpu samples and calculate the running
> +        * averages at exactly once per PSI_FREQ period.
> +        *
> +        * When tasks go idle, there is no point in keeping the
> +        * workers running, so we shut them down too. Once restarted,
> +        * we backfill zeroes for the missed periods in calc_avgs().
> +        *
> +        * We can get here from inside the aggregation worker, but
> +        * also from psi_show() as userspace may query pressure files
> +        * of an idle group whose aggregation worker shut down.
> +        */
> +       if (now < expires)
> +               goto unlock;
>
>         /*
>          * The periodic clock tick can get delayed for various
> @@ -377,10 +396,13 @@ static u64 update_averages(struct psi_group *group, u64 now)
>          * But the deltas we sample out of the per-cpu buckets above
>          * are based on the actual time elapsing between clock ticks.
>          */
> -       avg_next_update = expires + ((1 + missed_periods) * psi_period);
> +       if (now - expires >= psi_period)
> +               missed_periods = div_u64(now - expires, psi_period);
>         period = now - (group->avg_last_update + (missed_periods * psi_period));
>         group->avg_last_update = now;
> +       group->avg_next_update = expires + ((1 + missed_periods) * psi_period);
>
> +       collect_percpu_times(group, PSI_AVGS, &changed_states);
>         for (s = 0; s < NR_PSI_STATES - 1; s++) {
>                 u32 sample;
>
> @@ -408,42 +430,25 @@ static u64 update_averages(struct psi_group *group, u64 now)
>                 calc_avgs(group->avg[s], missed_periods, sample, period);
>         }
>
> -       return avg_next_update;
> +       if (changed_states & (1 << PSI_NONIDLE)) {
> +               unsigned long delay;
> +
> +               delay = nsecs_to_jiffies(group->avg_next_update - now) + 1;
> +               schedule_delayed_work(&group->avgs_work, delay);
> +       }
> +unlock:
> +       mutex_unlock(&group->avgs_lock);
>  }
>
>  static void psi_avgs_work(struct work_struct *work)
>  {
>         struct delayed_work *dwork;
>         struct psi_group *group;
> -       u32 changed_states;
> -       bool nonidle;
> -       u64 now;
>
>         dwork = to_delayed_work(work);
>         group = container_of(dwork, struct psi_group, avgs_work);
>
> -       mutex_lock(&group->avgs_lock);
> -
> -       now = sched_clock();
> -
> -       collect_percpu_times(group, PSI_AVGS, &changed_states);
> -       nonidle = changed_states & (1 << PSI_NONIDLE);
> -       /*
> -        * If there is task activity, periodically fold the per-cpu
> -        * times and feed samples into the running averages. If things
> -        * are idle and there is no data to process, stop the clock.
> -        * Once restarted, we'll catch up the running averages in one
> -        * go - see calc_avgs() and missed_periods.
> -        */
> -       if (now >= group->avg_next_update)
> -               group->avg_next_update = update_averages(group, now);
> -
> -       if (nonidle) {
> -               schedule_delayed_work(dwork, nsecs_to_jiffies(
> -                               group->avg_next_update - now) + 1);
> -       }
> -
> -       mutex_unlock(&group->avgs_lock);
> +       update_averages(group);
>  }
>
>  /* Trigger tracking window manipulations */
> @@ -1029,18 +1034,16 @@ void cgroup_move_task(struct task_struct *task, struct css_set *to)
>  int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
>  {
>         int full;
> -       u64 now;
>
>         if (static_branch_likely(&psi_disabled))
>                 return -EOPNOTSUPP;
>
> -       /* Update averages before reporting them */
> -       mutex_lock(&group->avgs_lock);
> -       now = sched_clock();
> -       collect_percpu_times(group, PSI_AVGS, NULL);
> -       if (now >= group->avg_next_update)
> -               group->avg_next_update = update_averages(group, now);
> -       mutex_unlock(&group->avgs_lock);
> +       /*
> +        * Periodic aggregation should do all the sampling for us, but
> +        * if the workload goes idle, the worker goes to sleep before
> +        * old stalls may have averaged out. Backfill any idle zeroes.
> +        */
> +       update_averages(group);
>
>         for (full = 0; full < 2; full++) {
>                 unsigned long avg[3];
> --
> 2.32.0
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] psi: fix sampling artifact from pressure file read frequency
  2021-06-10  3:32 ` Suren Baghdasaryan
@ 2021-06-10 13:58   ` Johannes Weiner
  2021-06-10 17:33     ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2021-06-10 13:58 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Peter Zijlstra, Jared Pochtar, LKML, kernel-team

On Wed, Jun 09, 2021 at 08:32:51PM -0700, Suren Baghdasaryan wrote:
> On Tue, Jun 8, 2021 at 12:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > Currently, every time a psi pressure file is read, the per-cpu states
> > are collected and aggregated according to the CPU's non-idle time
> > weight. This dynamically changes the sampling period, which means read
> > frequency can introduce variance into the observed results. This is
> > somewhat unexpected for users and can be confusing, when e.g. two
> > nested cgroups with the same workload report different pressure levels
> > just because they have different consumers reading the pressure files.
> >
> > Consider the following two CPU timelines:
> >
> >         CPU0: [ STALL ] [ SLEEP ]
> >         CPU1: [  RUN  ] [  RUN  ]
> >
> > If we sample and aggregate once for the whole period, we get the
> > following total stall time for CPU0:
> >
> >         CPU0 = stall(1) + nonidle(1) / nonidle_total(3) = 0.3
> >
> > But if we sample twice, the total for the period is higher:
> >
> >         CPU0 = stall(1) + nonidle(1) / nonidle_total(2) = 0.5
> >         CPU0 = stall(0) + nonidle(0) / nonidle_total(1) = 0
> >                                                           ---
> >                                                           0.5
> >
> > Neither answer is inherently wrong: if the user asks for pressure
> > after half the period, we can't know yet that the CPU will go to sleep
> > right after and its weight would be lower over the combined period.
> >
> > We could normalize the weight averaging to a fixed window regardless
> > of how often stall times themselves are sampled. But that would make
> > reporting less adaptive to sudden changes when the user intentionally
> > uses poll() with short intervals in order to get a higher resolution.
> >
> > For now, simply limit sampling of the pressure file contents to the
> > fixed two-second period already used by the aggregation worker.
> 
> Hmm. This is tricky.

Yes ;)

> So, userspace-visible effect of this change is that totals will not
> update when the psi file is read unless the psi_period expires.

That's a visible side effect, but yeah, correct.

> We used to postpone updating only the averages and now the totals
> will follow suit. Not sure if presenting stale data is better than
> having this dependency on timing of the read. As you noted, the
> value we get is not inherently wrong. But one could argue both ways
> I guess... Having this "quantum" effect when the act of observation
> changes the state of the object is indeed weird, to say the least.

Yes. Especially *because* we don't update the averages more than once
per 2s window. It gives the impression they would follow steady
sampling, but they're calculated based on total= which is aggregated
on every read.

Tying the total= updates to the same fixed window presents slightly
less current data, but results in more obvious and intuitive behavior.

For more current data there is always poll() - which is a better idea
than busy-reading the pressure files for a sub-2s resolution...

> In the paragraph above you say "For now". Do you have an idea that
> could solve the issue with totals being stale while removing this
> dependency on the timing of reads?

Yeah, it's hinted at in the paragraph before that.

What we could do is decouple the CPU weight sampling from the stall
time sampling, and use a fixed-window average for the nonidle /
nonidle_total ratio when aggregating. This way, no matter how
frequently you read the stall times, you get the same results.

However, because the update frequency absent any reads is 2s, that
would be the window size we'd have to use. So while we could present
updated total= on every read, they would still be aggregated based on
a relatively stale CPU load distribution.

They wouldn't be as current as they might seem - and as current as the
user would assume them to be if they read the file frequently and got
new totals every time. This is even more subtle than the status quo.

IMO the cleanest interface is simply to be honest and consistent about
the pressure files following a fixed 2s aggregation frequency, and
refer people who need a higher resolution to poll().

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] psi: fix sampling artifact from pressure file read frequency
  2021-06-10 13:58   ` Johannes Weiner
@ 2021-06-10 17:33     ` Suren Baghdasaryan
  2021-06-14 14:41       ` Johannes Weiner
  0 siblings, 1 reply; 6+ messages in thread
From: Suren Baghdasaryan @ 2021-06-10 17:33 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Peter Zijlstra, Jared Pochtar, LKML, kernel-team

On Thu, Jun 10, 2021 at 6:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jun 09, 2021 at 08:32:51PM -0700, Suren Baghdasaryan wrote:
> > On Tue, Jun 8, 2021 at 12:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > Currently, every time a psi pressure file is read, the per-cpu states
> > > are collected and aggregated according to the CPU's non-idle time
> > > weight. This dynamically changes the sampling period, which means read
> > > frequency can introduce variance into the observed results. This is
> > > somewhat unexpected for users and can be confusing, when e.g. two
> > > nested cgroups with the same workload report different pressure levels
> > > just because they have different consumers reading the pressure files.
> > >
> > > Consider the following two CPU timelines:
> > >
> > >         CPU0: [ STALL ] [ SLEEP ]
> > >         CPU1: [  RUN  ] [  RUN  ]
> > >
> > > If we sample and aggregate once for the whole period, we get the
> > > following total stall time for CPU0:
> > >
> > >         CPU0 = stall(1) + nonidle(1) / nonidle_total(3) = 0.3
> > >
> > > But if we sample twice, the total for the period is higher:
> > >
> > >         CPU0 = stall(1) + nonidle(1) / nonidle_total(2) = 0.5
> > >         CPU0 = stall(0) + nonidle(0) / nonidle_total(1) = 0
> > >                                                           ---
> > >                                                           0.5

Could you please clarify that above you are calculating the
nonidle/nonidle_total ratio? I understood your description in the text
but these calculations seem to claim that:

1+1/3=0.3
1+1/2=0.5
0+0/1=0

Clarification would be helpful IMHO.

Also IIUC the state contributions are:
STALL == stall(1) + nonidle(1)
SLEEP == stall(0) + nonidle(0)
RUN == nonidle(1)
Is that correct?

> > >
> > > Neither answer is inherently wrong: if the user asks for pressure
> > > after half the period, we can't know yet that the CPU will go to sleep
> > > right after and its weight would be lower over the combined period.
> > >
> > > We could normalize the weight averaging to a fixed window regardless
> > > of how often stall times themselves are sampled. But that would make
> > > reporting less adaptive to sudden changes when the user intentionally
> > > uses poll() with short intervals in order to get a higher resolution.
> > >
> > > For now, simply limit sampling of the pressure file contents to the
> > > fixed two-second period already used by the aggregation worker.
> >
> > Hmm. This is tricky.
>
> Yes ;)
>
> > So, userspace-visible effect of this change is that totals will not
> > update when the psi file is read unless the psi_period expires.
>
> That's a visible side effect, but yeah, correct.
>
> > We used to postpone updating only the averages and now the totals
> > will follow suit. Not sure if presenting stale data is better than
> > having this dependency on timing of the read. As you noted, the
> > value we get is not inherently wrong. But one could argue both ways
> > I guess... Having this "quantum" effect when the act of observation
> > changes the state of the object is indeed weird, to say the least.
>
> Yes. Especially *because* we don't update the averages more than once
> per 2s window. It gives the impression they would follow steady
> sampling, but they're calculated based on total= which is aggregated
> on every read.
>
> Tying the total= updates to the same fixed window presents slightly
> less current data, but results in more obvious and intuitive behavior.
>
> For more current data there is always poll() - which is a better idea
> than busy-reading the pressure files for a sub-2s resolution...

True, however triggers are rate-limited to one per tracking window
(0.5s min), so they have their own limitations.

>
> > In the paragraph above you say "For now". Do you have an idea that
> > could solve the issue with totals being stale while removing this
> > dependency on the timing of reads?
>
> Yeah, it's hinted at in the paragraph before that.
>
> What we could do is decouple the CPU weight sampling from the stall
> time sampling, and use a fixed-window average for the nonidle /
> nonidle_total ratio when aggregating. This way, no matter how
> frequently you read the stall times, you get the same results.
>
> However, because the update frequency absent any reads is 2s, that
> would be the window size we'd have to use. So while we could present
> updated total= on every read, they would still be aggregated based on
> a relatively stale CPU load distribution.
>
> They wouldn't be as current as they might seem - and as current as the
> user would assume them to be if they read the file frequently and got
> new totals every time. This is even more subtle than the status quo.
>
> IMO the cleanest interface is simply to be honest and consistent about
> the pressure files following a fixed 2s aggregation frequency, and
> refer people who need a higher resolution to poll().

I would agree, however 2s interval when (memory) pressure rises fast
is an eternity. The device can get into OOM state much faster than
that. I'm worried that this limitation might render totals useless.
The change does not pose issues for Android today because we rely on
poll() and then we read other stats. However in the ideal world we
would use only psi and after receiving the initial trigger we would
start reading psi periodically to detect any further spikes. This
would require totals to be up-to-date.
These are my thoughts about long-term possible uses but I don't object
to this change as a short-term solution. Hopefully nobody uses totals
this way today, otherwise they are up for an unpleasant surprise.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] psi: fix sampling artifact from pressure file read frequency
  2021-06-10 17:33     ` Suren Baghdasaryan
@ 2021-06-14 14:41       ` Johannes Weiner
  2021-06-15 17:58         ` Suren Baghdasaryan
  0 siblings, 1 reply; 6+ messages in thread
From: Johannes Weiner @ 2021-06-14 14:41 UTC (permalink / raw)
  To: Suren Baghdasaryan; +Cc: Peter Zijlstra, Jared Pochtar, LKML, kernel-team

On Thu, Jun 10, 2021 at 10:33:15AM -0700, Suren Baghdasaryan wrote:
> On Thu, Jun 10, 2021 at 6:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Jun 09, 2021 at 08:32:51PM -0700, Suren Baghdasaryan wrote:
> > > On Tue, Jun 8, 2021 at 12:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > >
> > > > Currently, every time a psi pressure file is read, the per-cpu states
> > > > are collected and aggregated according to the CPU's non-idle time
> > > > weight. This dynamically changes the sampling period, which means read
> > > > frequency can introduce variance into the observed results. This is
> > > > somewhat unexpected for users and can be confusing, when e.g. two
> > > > nested cgroups with the same workload report different pressure levels
> > > > just because they have different consumers reading the pressure files.
> > > >
> > > > Consider the following two CPU timelines:
> > > >
> > > >         CPU0: [ STALL ] [ SLEEP ]
> > > >         CPU1: [  RUN  ] [  RUN  ]
> > > >
> > > > If we sample and aggregate once for the whole period, we get the
> > > > following total stall time for CPU0:
> > > >
> > > >         CPU0 = stall(1) + nonidle(1) / nonidle_total(3) = 0.3
> > > >
> > > > But if we sample twice, the total for the period is higher:
> > > >
> > > >         CPU0 = stall(1) + nonidle(1) / nonidle_total(2) = 0.5
> > > >         CPU0 = stall(0) + nonidle(0) / nonidle_total(1) = 0
> > > >                                                           ---
> > > >                                                           0.5
> 
> Could you please clarify that above you are calculating the
> nonidle/nonidle_total ratio? I understood your description in the text
> but these calculations seem to claim that:
> 
> 1+1/3=0.3
> 1+1/2=0.5
> 0+0/1=0
> 
> Clarification would be helpful IMHO.

Oof, my bad, that's a plain typo. The + should be *:

stall(1) * nonidle(1)/nonidle_total(3) = 0.3

vs

stall(1) * nonidle(1)/nonidle_total(2) = 0.5
stall(0) * nonidle(0)/nonidle_total(1) = 0.0

> Also IIUC the state contributions are:
> STALL == stall(1) + nonidle(1)
> SLEEP == stall(0) + nonidle(0)
> RUN == nonidle(1)
> Is that correct?

Correct. And nonidle_total is from STALL + RUN + RUN.

> > > > Neither answer is inherently wrong: if the user asks for pressure
> > > > after half the period, we can't know yet that the CPU will go to sleep
> > > > right after and its weight would be lower over the combined period.
> > > >
> > > > We could normalize the weight averaging to a fixed window regardless
> > > > of how often stall times themselves are sampled. But that would make
> > > > reporting less adaptive to sudden changes when the user intentionally
> > > > uses poll() with short intervals in order to get a higher resolution.
> > > >
> > > > For now, simply limit sampling of the pressure file contents to the
> > > > fixed two-second period already used by the aggregation worker.
> > >
> > > Hmm. This is tricky.
> >
> > Yes ;)
> >
> > > So, userspace-visible effect of this change is that totals will not
> > > update when the psi file is read unless the psi_period expires.
> >
> > That's a visible side effect, but yeah, correct.
> >
> > > We used to postpone updating only the averages and now the totals
> > > will follow suit. Not sure if presenting stale data is better than
> > > having this dependency on timing of the read. As you noted, the
> > > value we get is not inherently wrong. But one could argue both ways
> > > I guess... Having this "quantum" effect when the act of observation
> > > changes the state of the object is indeed weird, to say the least.
> >
> > Yes. Especially *because* we don't update the averages more than once
> > per 2s window. It gives the impression they would follow steady
> > sampling, but they're calculated based on total= which is aggregated
> > on every read.
> >
> > Tying the total= updates to the same fixed window presents slightly
> > less current data, but results in more obvious and intuitive behavior.
> >
> > For more current data there is always poll() - which is a better idea
> > than busy-reading the pressure files for a sub-2s resolution...
> 
> True, however triggers are rate-limited to one per tracking window
> (0.5s min), so they have their own limitations.

Right, but if somebody needed *higher* resolution than this, wouldn't
it make more sense to lift the restrictions on poll() rather than read
the pressure file several times per second?

> > > In the paragraph above you say "For now". Do you have an idea that
> > > could solve the issue with totals being stale while removing this
> > > dependency on the timing of reads?
> >
> > Yeah, it's hinted at in the paragraph before that.
> >
> > What we could do is decouple the CPU weight sampling from the stall
> > time sampling, and use a fixed-window average for the nonidle /
> > nonidle_total ratio when aggregating. This way, no matter how
> > frequently you read the stall times, you get the same results.
> >
> > However, because the update frequency absent any reads is 2s, that
> > would be the window size we'd have to use. So while we could present
> > updated total= on every read, they would still be aggregated based on
> > a relatively stale CPU load distribution.
> >
> > They wouldn't be as current as they might seem - and as current as the
> > user would assume them to be if they read the file frequently and got
> > new totals every time. This is even more subtle than the status quo.
> >
> > IMO the cleanest interface is simply to be honest and consistent about
> > the pressure files following a fixed 2s aggregation frequency, and
> > refer people who need a higher resolution to poll().
> 
> I would agree, however 2s interval when (memory) pressure rises fast
> is an eternity. The device can get into OOM state much faster than
> that. I'm worried that this limitation might render totals useless.

> The change does not pose issues for Android today because we rely on
> poll() and then we read other stats. However in the ideal world we
> would use only psi and after receiving the initial trigger we would
> start reading psi periodically to detect any further spikes. This
> would require totals to be up-to-date.

I might be missing something, but why manually read the pressure files
at a high frequency when you get the event? Wouldn't it be possible to
keep listening for further trigger events from poll() in that case?

Does it all come down to the 500ms window restriction?

You're right that this change would make it impossible to use total=
in the pressure files for manual high-frequency sampling, my question
is whether that does something that poll() can not, even in theory.

> These are my thoughts about long-term possible uses but I don't object
> to this change as a short-term solution. Hopefully nobody uses totals
> this way today, otherwise they are up for an unpleasant surprise.

Yeah I think that's a valid concern. We shouldn't break such usecases
if they exist. And it would break them in a non-obvious way. So let's
table this version of the patch for now.

Another option could be to use a separate aggregator state for avg=
and total=, such that total= updates stay adaptive to the read
frequency, but the canned avg= would at least behave more predictably
for less sophisticated usecases.

The implication would be that manual averaging of total= at sampling
intervals other than 2s would no longer match up to avg=. But it's
harder to imagine anybody would rely on that.

What do you think?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] psi: fix sampling artifact from pressure file read frequency
  2021-06-14 14:41       ` Johannes Weiner
@ 2021-06-15 17:58         ` Suren Baghdasaryan
  0 siblings, 0 replies; 6+ messages in thread
From: Suren Baghdasaryan @ 2021-06-15 17:58 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Peter Zijlstra, Jared Pochtar, LKML, kernel-team

On Mon, Jun 14, 2021 at 7:41 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Jun 10, 2021 at 10:33:15AM -0700, Suren Baghdasaryan wrote:
> > On Thu, Jun 10, 2021 at 6:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Jun 09, 2021 at 08:32:51PM -0700, Suren Baghdasaryan wrote:
> > > > On Tue, Jun 8, 2021 at 12:03 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > Currently, every time a psi pressure file is read, the per-cpu states
> > > > > are collected and aggregated according to the CPU's non-idle time
> > > > > weight. This dynamically changes the sampling period, which means read
> > > > > frequency can introduce variance into the observed results. This is
> > > > > somewhat unexpected for users and can be confusing, when e.g. two
> > > > > nested cgroups with the same workload report different pressure levels
> > > > > just because they have different consumers reading the pressure files.
> > > > >
> > > > > Consider the following two CPU timelines:
> > > > >
> > > > >         CPU0: [ STALL ] [ SLEEP ]
> > > > >         CPU1: [  RUN  ] [  RUN  ]
> > > > >
> > > > > If we sample and aggregate once for the whole period, we get the
> > > > > following total stall time for CPU0:
> > > > >
> > > > >         CPU0 = stall(1) + nonidle(1) / nonidle_total(3) = 0.3
> > > > >
> > > > > But if we sample twice, the total for the period is higher:
> > > > >
> > > > >         CPU0 = stall(1) + nonidle(1) / nonidle_total(2) = 0.5
> > > > >         CPU0 = stall(0) + nonidle(0) / nonidle_total(1) = 0
> > > > >                                                           ---
> > > > >                                                           0.5
> >
> > Could you please clarify that above you are calculating the
> > nonidle/nonidle_total ratio? I understood your description in the text
> > but these calculations seem to claim that:
> >
> > 1+1/3=0.3
> > 1+1/2=0.5
> > 0+0/1=0
> >
> > Clarification would be helpful IMHO.
>
> Oof, my bad, that's a plain typo. The + should be *:
>
> stall(1) * nonidle(1)/nonidle_total(3) = 0.3
>
> vs
>
> stall(1) * nonidle(1)/nonidle_total(2) = 0.5
> stall(0) * nonidle(0)/nonidle_total(1) = 0.0
>
> > Also IIUC the state contributions are:
> > STALL == stall(1) + nonidle(1)
> > SLEEP == stall(0) + nonidle(0)
> > RUN == nonidle(1)
> > Is that correct?
>
> Correct. And nonidle_total is from STALL + RUN + RUN.
>
> > > > > Neither answer is inherently wrong: if the user asks for pressure
> > > > > after half the period, we can't know yet that the CPU will go to sleep
> > > > > right after and its weight would be lower over the combined period.
> > > > >
> > > > > We could normalize the weight averaging to a fixed window regardless
> > > > > of how often stall times themselves are sampled. But that would make
> > > > > reporting less adaptive to sudden changes when the user intentionally
> > > > > uses poll() with short intervals in order to get a higher resolution.
> > > > >
> > > > > For now, simply limit sampling of the pressure file contents to the
> > > > > fixed two-second period already used by the aggregation worker.
> > > >
> > > > Hmm. This is tricky.
> > >
> > > Yes ;)
> > >
> > > > So, userspace-visible effect of this change is that totals will not
> > > > update when the psi file is read unless the psi_period expires.
> > >
> > > That's a visible side effect, but yeah, correct.
> > >
> > > > We used to postpone updating only the averages and now the totals
> > > > will follow suit. Not sure if presenting stale data is better than
> > > > having this dependency on timing of the read. As you noted, the
> > > > value we get is not inherently wrong. But one could argue both ways
> > > > I guess... Having this "quantum" effect when the act of observation
> > > > changes the state of the object is indeed weird, to say the least.
> > >
> > > Yes. Especially *because* we don't update the averages more than once
> > > per 2s window. It gives the impression they would follow steady
> > > sampling, but they're calculated based on total= which is aggregated
> > > on every read.
> > >
> > > Tying the total= updates to the same fixed window presents slightly
> > > less current data, but results in more obvious and intuitive behavior.
> > >
> > > For more current data there is always poll() - which is a better idea
> > > than busy-reading the pressure files for a sub-2s resolution...
> >
> > True, however triggers are rate-limited to one per tracking window
> > (0.5s min), so they have their own limitations.
>
> Right, but if somebody needed *higher* resolution than this, wouldn't
> it make more sense to lift the restrictions on poll() rather than read
> the pressure file several times per second?

Maybe. The current logic is simple: you set up the tracking window
size and you get one trigger once the growth within that window hits
the specified limit with no more triggers in that window. If we want
to support more triggers within one window we would need to reset the
growth after sending the trigger and provide a way to specify the
desired rate-limit (resolution). It's definitely doable but I would
like to see real-life usage before implementing this.

>
> > > > In the paragraph above you say "For now". Do you have an idea that
> > > > could solve the issue with totals being stale while removing this
> > > > dependency on the timing of reads?
> > >
> > > Yeah, it's hinted at in the paragraph before that.
> > >
> > > What we could do is decouple the CPU weight sampling from the stall
> > > time sampling, and use a fixed-window average for the nonidle /
> > > nonidle_total ratio when aggregating. This way, no matter how
> > > frequently you read the stall times, you get the same results.
> > >
> > > However, because the update frequency absent any reads is 2s, that
> > > would be the window size we'd have to use. So while we could present
> > > updated total= on every read, they would still be aggregated based on
> > > a relatively stale CPU load distribution.
> > >
> > > They wouldn't be as current as they might seem - and as current as the
> > > user would assume them to be if they read the file frequently and got
> > > new totals every time. This is even more subtle than the status quo.
> > >
> > > IMO the cleanest interface is simply to be honest and consistent about
> > > the pressure files following a fixed 2s aggregation frequency, and
> > > refer people who need a higher resolution to poll().
> >
> > I would agree, however 2s interval when (memory) pressure rises fast
> > is an eternity. The device can get into OOM state much faster than
> > that. I'm worried that this limitation might render totals useless.
>
> > The change does not pose issues for Android today because we rely on
> > poll() and then we read other stats. However in the ideal world we
> > would use only psi and after receiving the initial trigger we would
> > start reading psi periodically to detect any further spikes. This
> > would require totals to be up-to-date.
>
> I might be missing something, but why manually read the pressure files
> at a high frequency when you get the event? Wouldn't it be possible to
> keep listening for further trigger events from poll() in that case?
>
> Does it all come down to the 500ms window restriction?

Yes, pretty much so. But again, we don't live in an ideal world and
today we have to get some additional metrics, which requires periodic
reads after we receive the initial psi signal.

>
> You're right that this change would make it impossible to use total=
> in the pressure files for manual high-frequency sampling, my question
> is whether that does something that poll() can not, even in theory.

I think you are correct. If we remove rate-limits and psi signal is
all we need to make a kill/no-kill decision then we could rely purely
on psi triggers and would never need high-frequency total= reads.

>
> > These are my thoughts about long-term possible uses but I don't object
> > to this change as a short-term solution. Hopefully nobody uses totals
> > this way today, otherwise they are up for an unpleasant surprise.
>
> Yeah I think that's a valid concern. We shouldn't break such usecases
> if they exist. And it would break them in a non-obvious way. So let's
> table this version of the patch for now.
>
> Another option could be to use a separate aggregator state for avg=
> and total=, such that total= updates stay adaptive to the read
> frequency, but the canned avg= would at least behave more predictably
> for less sophisticated usecases.
>
> The implication would be that manual averaging of total= at sampling
> intervals other than 2s would no longer match up to avg=. But it's
> harder to imagine anybody would rely on that.
>
> What do you think?

That sounds like a good solution to me.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-06-15 17:59 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-08 19:03 [PATCH] psi: fix sampling artifact from pressure file read frequency Johannes Weiner
2021-06-10  3:32 ` Suren Baghdasaryan
2021-06-10 13:58   ` Johannes Weiner
2021-06-10 17:33     ` Suren Baghdasaryan
2021-06-14 14:41       ` Johannes Weiner
2021-06-15 17:58         ` Suren Baghdasaryan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.