All of lore.kernel.org
 help / color / mirror / Atom feed
From: Parth Shah <parth@linux.ibm.com>
To: Pavan Kondeti <pkondeti@codeaurora.org>
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
	mingo@redhat.com, vincent.guittot@linaro.org,
	dietmar.eggemann@arm.com, qais.yousef@arm.com,
	chris.hyser@oracle.com, valentin.schneider@arm.com,
	rjw@rjwysocki.net
Subject: Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points
Date: Tue, 12 May 2020 13:21:59 +0530	[thread overview]
Message-ID: <4d05f8c0-250b-bc1e-360f-18dfa197064c@linux.ibm.com> (raw)
In-Reply-To: <20200509023915.GN19464@codeaurora.org>



On 5/9/20 8:09 AM, Pavan Kondeti wrote:
> On Fri, May 08, 2020 at 04:45:16PM +0530, Parth Shah wrote:
>> Hi Pavan,
>>
>> Thanks for going through this patch-set.
>>
>> On 5/8/20 2:03 PM, Pavan Kondeti wrote:
>>> Hi Parth,
>>>
>>> On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
>>>> Monitor tasks at:
>>>> 1. wake_up_new_task() - forked tasks
>>>>
>>>> 2. set_task_cpu() - task migrations, Load balancer
>>>>
>>>> 3. __sched_setscheduler() - set/unset latency_nice value
>>>> Increment the nr_lat_sensitive count on the CPU with task marked with
>>>> latency_nice == -20.
>>>> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
>>>> with >-20 latency_nice task.
>>>>
>>>> 4. finish_task_switch() - dying task
>>>>
>>>
>>>
>>>> Signed-off-by: Parth Shah <parth@linux.ibm.com>
>>>> ---
>>>>  kernel/sched/core.c  | 30 ++++++++++++++++++++++++++++--
>>>>  kernel/sched/sched.h |  5 +++++
>>>>  2 files changed, 33 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>>> index 2d8b76f41d61..ad396c36eba6 100644
>>>> --- a/kernel/sched/core.c
>>>> +++ b/kernel/sched/core.c
>>>> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>>>  	trace_sched_migrate_task(p, new_cpu);
>>>>  
>>>>  	if (task_cpu(p) != new_cpu) {
>>>> +		if (task_is_lat_sensitive(p)) {
>>>> +			per_cpu(nr_lat_sensitive, task_cpu(p))--;
>>>> +			per_cpu(nr_lat_sensitive, new_cpu)++;
>>>> +		}
>>>> +
>>>
>>> Since we can come here without rq locks, there is a possibility
>>> of a race and incorrect updates can happen. Since the counters
>>> are used to prevent C-states, we don't want that to happen.
>>
>> I did tried using task_lock(p) wherever we do change refcount and when
>> latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
>> type.
>>
>> After lots of thinking to optimize it and thinking that we anyways hold rq
>> lock, I thought of not using any lock here and see if scheduler community
>> has well known solution for this :-)
>>
>> But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
>> refcount should solve problem, right?
>>
>> If you or anyone have solution for this kind of pattern, then that surely
>> will be helpful.
>>
> I am not sure if task_lock() can help here, because we are operating the
> counter on per CPU basis here. May be cmpxchg based loop works here to make
> sure that increment/decrement operation happens atomically here.
> 
>>>
>>>>  		if (p->sched_class->migrate_task_rq)
>>>>  			p->sched_class->migrate_task_rq(p, new_cpu);
>>>>  		p->se.nr_migrations++;
> 
> [...]
> 
>>>> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct task_struct *p,
>>>>  	p->normal_prio = normal_prio(p);
>>>>  	set_load_weight(p, true);
>>>>  
>>>> -	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
>>>> +	if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
>>>> +		if (p->state != TASK_DEAD &&
>>>> +		    attr->sched_latency_nice != p->latency_nice) {
>>>> +			if (attr->sched_latency_nice == MIN_LATENCY_NICE)
>>>> +				per_cpu(nr_lat_sensitive, task_cpu(p))++;
>>>> +			else if (task_is_lat_sensitive(p))
>>>> +				per_cpu(nr_lat_sensitive, task_cpu(p))--;
>>>> +		}
>>>> +
>>>>  		p->latency_nice = attr->sched_latency_nice;
>>>> +	}
>>>>  }
>>>
>>> There is a potential race here due to which we can mess up the refcount.
>>>
>>> - A latency sensitive task is marked TASK_DEAD
>>> <snip>
>>> - sched_setattr() called on the task to clear the latency nice. Since
>>> we check the task state here, we skip the decrement.
>>> - The task is finally context switched out and we skip the decrement again
>>> since it is not a latency senstivie task.
>>
>> if task is already marked TASK_DEAD then we should have already decremented
>> its refcount in finish_task_switch().
>> am I missing something?
> 
> There is a window (context switch and dropping rq lock) between
> marking a task DEAD (in do_task_dead()) and dropping the ref counter
> (in finish_task_switch()) during which we can run into here and skip
> the checking because task is marked as DEAD.
> 

Yeah, TASK_DEAD seems to be genuine race conditions. At every other places
we do hold task_rq_lock() except when the task is dying. There is a window
between do_task_dead() and finish_task_switch() which may create race
condition, so if marking TASK_DEAD is protected under task_rq_lock() then
this can be prevented. I will have to look at it more thoroughly at the
code and figure out a way to protect the refcount under such circumstances.


Thanks,
Parth

  reply	other threads:[~2020-05-12  7:52 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-07 13:37 [RFC 0/4] IDLE gating in presence of latency-sensitive tasks Parth Shah
2020-05-07 13:37 ` [RFC 1/4] sched/core: Introduce per_cpu counter to track latency sensitive tasks Parth Shah
2020-05-08  8:40   ` Pavan Kondeti
2020-05-08 11:30     ` Parth Shah
2020-05-09  2:14       ` Pavan Kondeti
2020-05-07 13:37 ` [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points Parth Shah
2020-05-08  8:33   ` Pavan Kondeti
2020-05-08 11:15     ` Parth Shah
2020-05-09  2:39       ` Pavan Kondeti
2020-05-12  7:51         ` Parth Shah [this message]
2020-05-07 13:37 ` [RFC 3/4] sched/idle: Disable idle call on least latency requirements Parth Shah
2020-05-08  8:36   ` Pavan Kondeti
2020-05-08 11:19     ` Parth Shah
2020-05-09  2:18       ` Pavan Kondeti
2020-05-07 13:37 ` [RFC 4/4] sched/idle: Add debugging bits to validate inconsistency in latency sensitive task calculations Parth Shah

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4d05f8c0-250b-bc1e-360f-18dfa197064c@linux.ibm.com \
    --to=parth@linux.ibm.com \
    --cc=chris.hyser@oracle.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pkondeti@codeaurora.org \
    --cc=qais.yousef@arm.com \
    --cc=rjw@rjwysocki.net \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.