linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Libo Chen <libo.chen@oracle.com>
To: Tim Chen <tim.c.chen@linux.intel.com>,
	peterz@infradead.org, vincent.guittot@linaro.org,
	mgorman@suse.de, 21cnbao@gmail.com, dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org, tglx@linutronix.de
Subject: Re: [PATCH] sched/fair: no sync wakeup from interrupt context
Date: Wed, 13 Jul 2022 14:37:20 -0700	[thread overview]
Message-ID: <fdcc81b3-3d60-d090-de25-e2583d43daa4@oracle.com> (raw)
In-Reply-To: <ff27aa3bce81a4f9cdf9e71b989af7db5b0fa44a.camel@linux.intel.com>



On 7/13/22 13:51, Tim Chen wrote:
> On Wed, 2022-07-13 at 12:17 -0700, Libo Chen wrote:
>>
>> On 7/13/22 09:40, Tim Chen wrote:
>>> On Mon, 2022-07-11 at 15:47 -0700, Libo Chen wrote:
>>>> Barry Song first pointed out that replacing sync wakeup with regular wakeup
>>>> seems to reduce overeager wakeup pulling and shows noticeable performance
>>>> improvement.[1]
>>>>
>>>> This patch argues that allowing sync for wakeups from interrupt context
>>>> is a bug and fixing it can improve performance even when irq/softirq is
>>>> evenly spread out.
>>>>
>>>> For wakeups from ISR, the waking CPU is just the CPU of ISR and the so-called
>>>> waker can be any random task that happens to be running on that CPU when the
>>>> interrupt comes in. This is completely different from other wakups where the
>>>> task running on the waking CPU is the actual waker. For example, two tasks
>>>> communicate through a pipe or mutiple tasks access the same critical section,
>>>> etc. This difference is important because with sync we assume the waker will
>>>> get off the runqueue and go to sleep immedately after the wakeup. The
>>>> assumption is built into wake_affine() where it discounts the waker's presence
>>>> from the runqueue when sync is true. The random waker from interrupts bears no
>>>> relation to the wakee and don't usually go to sleep immediately afterwards
>>>> unless wakeup granularity is reached. Plus the scheduler no longer enforces the
>>>> preepmtion of waker for sync wakeup as it used to before
>>>> patch f2e74eeac03ffb7 ("sched: Remove WAKEUP_SYNC feature"). Enforcing sync
>>>> wakeup preemption for wakeups from interrupt contexts doesn't seem to be
>>>> appropriate too but at least sync wakeup will do what it's supposed to do.
>>> Will there be scenarios where you do want the task being woken up be pulled
>>> to the CPU where the interrupt happened, as the data that needs to be accessed is
>>> on local CPU/NUMA that interrupt happened?  For example, interrupt associated with network
>>> packets received.  Sync still seems desirable, at least if there is no task currently
>>> running on the CPU where interrupt happened.  So maybe we should have some consideration
>>> of the load on the CPU/NUMA before deciding whether we should do sync wake for such
>>> interrupt.
>>>
>>   There are only two places where sync wakeup matters: wake_affine_idle() and wake_affine_weight().
>> In wake_affine_idle(), it considers pulling if there is one runnable on the waking CPU because
>> of the belief that this runnable will voluntarily get off the runqueue. In wake_affine_weight(),
>> it basically takes off the waker's load again assuming the waker goes to sleep after the wakeup.
>> My argument is that this assumption doesn't really hold for wakeups from the interrupt contexts
>> when the waking CPU is non-idle. Wakeups from task context? sure, it seems to be a reasonable
>> assumption.
> I agree with you that the the sync case load computation for wake_affine_idle()
> and wake_affine_weight() is incorrect when waking a task from the interrupt context.
> In this light, your proposal makes sense.
>
>> For your idle case, I totally agree but I don't think having sync or not will actually
>> have any impacts here giving what the code does. Real impact comes from Mel's patch 7332dec055f2457c3
>> which makes it less likely to pull tasks when the waking CPU is idle. I believe we should consider
>> reverting 7332dec055f2 because a significant RDS latency regression has been spotted recently on our
>> system due to this patch.
> The commit 7332dec055f2 prevented cross NUMA node pulling.  I think if the
> waking CPU's NUMA node's average load is less than the prev CPU's NUMA node,
> this cross NUMA node pull could be allowed for better load distribution.
Yeah, we should rewrite wake_affine_weight() so that it compares average 
loads of two nodes
instead of two rq loads.


Libo

>>> Can you provide some further insights on why pgebench is helped at low load
>>> case?  Is it because the woken tasks tend to stay put and not get moved around with interrupts
>>> and maintain cache warmth?
>>   Yes, and for read-only database workloads, the cache (whether it's incoming packet or not) on the interrupt
>> CPU isn't as performance critical as cache from its previous CPU where the db task run to process data.
>> To give you an example, consider a db client/server case, a client sends a request for a select query
>> through the network, the server accepts the query request and does all the heavy lifting and sends the result
>> back. For the server, the incoming packet is just a line of query whereas the CPU and its L3 db process previously
>> on has all the warm db caches, pulling it away from them is a crime :)  This may seem to be a little contradictory
>> to what I said earlier about the idle case and Mel's patch, but ¯\_(ツ)_/¯ it's hard to make all the workloads out
>> there happy. Anyway like I said earlier, this patch doesn't affect the idle case
>>
>> At higher load, sync in wake_affine_idle() doesn't really matter because the waking CPU could easily have more than
>> 1 runnable tasks. Sync in wake_affine_weight() also doesn't matter much as both sides have work to do, and cache
>> benefit of not pulling decreases simply because there are a lot more db processes under the same L3, they can compete
>> for the same cachelines.
>>
>> Hope my explanation helps!
> Yes, that makes sense.
>
> Tim
>
>


  reply	other threads:[~2022-07-13 21:37 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-11 22:47 [PATCH] sched/fair: no sync wakeup from interrupt context Libo Chen
2022-07-13 16:40 ` Tim Chen
     [not found]   ` <0917f479-b6aa-19de-3d6a-6fd422df4d21@oracle.com>
2022-07-13 19:34     ` Libo Chen
2022-07-13 20:51     ` Tim Chen
2022-07-13 21:37       ` Libo Chen [this message]
2022-07-14 14:18     ` Mel Gorman
2022-07-14 20:21       ` Libo Chen
2022-07-15 10:07         ` Mel Gorman
2022-07-14 11:26 ` Peter Zijlstra
2022-07-14 11:35 ` Peter Zijlstra
2022-07-14 18:18   ` Libo Chen
2022-07-21  8:44 ` [tip: sched/core] sched/fair: Disallow " tip-bot2 for Libo Chen
2022-07-29  4:47 ` [PATCH] sched/fair: no " K Prateek Nayak
2022-08-01 13:26   ` Ingo Molnar
2022-08-01 14:59     ` Libo Chen
2022-08-03  9:18       ` Ingo Molnar
2022-08-03 19:37         ` Libo Chen
2022-08-04  8:55           ` Ingo Molnar
2022-08-04  9:51             ` Mel Gorman
2022-08-01 14:57   ` Libo Chen
2022-08-02  4:40     ` K Prateek Nayak

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fdcc81b3-3d60-d090-de25-e2583d43daa4@oracle.com \
    --to=libo.chen@oracle.com \
    --cc=21cnbao@gmail.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tim.c.chen@linux.intel.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).