weird loadavg on idle machine post 5.7

* weird loadavg on idle machine post 5.7
@ 2020-07-02 17:15 Dave Jones
  2020-07-02 19:46 ` Dave Jones
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Dave Jones @ 2020-07-02 17:15 UTC (permalink / raw)
  To: Linux Kernel; +Cc: peterz, mgorman, mingo, Linus Torvalds

When I upgraded my firewall to 5.7-rc2 I noticed that on a mostly
idle machine (that usually sees loadavg hover in the 0.xx range)
that it was consistently above 1.00 even when there was nothing running.
All that perf showed was the kernel was spending time in the idle loop
(and running perf).

For the first hour or so after boot, everything seems fine, but over
time loadavg creeps up, and once it's established a new baseline, it
never seems to ever drop below that again.

One morning I woke up to find loadavg at '7.xx', after almost as many
hours of uptime, which makes me wonder if perhaps this is triggered
by something in cron.  I have a bunch of scripts that fire off
every hour that involve thousands of shortlived runs of iptables/ipset,
but running them manually didn't seem to automatically trigger the bug.

Given it took a few hours of runtime to confirm good/bad, bisecting this
took the last two weeks. I did it four different times, the first
producing bogus results from over-eager 'good', but the last two runs
both implicated this commit:

commit c6e7bd7afaeb3af55ffac122828035f1c01d1d7b (refs/bisect/bad)
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Sun May 24 21:29:55 2020 +0100

    sched/core: Optimize ttwu() spinning on p->on_cpu

    Both Rik and Mel reported seeing ttwu() spend significant time on:

      smp_cond_load_acquire(&p->on_cpu, !VAL);

    Attempt to avoid this by queueing the wakeup on the CPU that owns the
    p->on_cpu value. This will then allow the ttwu() to complete without
    further waiting.

    Since we run schedule() with interrupts disabled, the IPI is
    guaranteed to happen after p->on_cpu is cleared, this is what makes it
    safe to queue early.

    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Ingo Molnar <mingo@kernel.org>
    Cc: Jirka Hladky <jhladky@redhat.com>
    Cc: Vincent Guittot <vincent.guittot@linaro.org>
    Cc: valentin.schneider@arm.com
    Cc: Hillf Danton <hdanton@sina.com>
    Cc: Rik van Riel <riel@surriel.com>
    Link: https://lore.kernel.org/r/20200524202956.27665-2-mgorman@techsingularity.net

Unfortunatly it doesn't revert cleanly on top of rc3 so I haven't
confirmed 100% that it's the cause yet, but the two separate bisects
seem promising.

I don't see any obvious correlation between what's changing there and
the symtoms (other than "scheduler magic") but maybe those closer to
this have ideas what could be going awry ?

	Dave

^ permalink raw reply	[flat|nested] 20+ messages in thread