From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753156AbdJGDe4 (ORCPT ); Fri, 6 Oct 2017 23:34:56 -0400 Received: from szxga04-in.huawei.com ([45.249.212.190]:7503 "EHLO szxga04-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752725AbdJGDey (ORCPT ); Fri, 6 Oct 2017 23:34:54 -0400 Message-ID: <59D84A4B.70900@huawei.com> Date: Sat, 7 Oct 2017 11:30:19 +0800 From: zhouchengming User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 MIME-Version: 1.0 To: Steven Rostedt , CC: , , Subject: Re: [PATCH] sched/rt.c: pick and check task if double_lock_balance() unlock the rq References: <1505112709-102019-1-git-send-email-zhouchengming1@huawei.com> <20170925154057.191e3fd1@vmware.local.home> <59C9AC08.9000302@huawei.com> <20170925231843.140d3a21@vmware.local.home> In-Reply-To: <20170925231843.140d3a21@vmware.local.home> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [10.177.236.183] X-CFilter-Loop: Reflected X-Mirapoint-Virus-RAPID-Raw: score=unknown(0), refid=str=0001.0A010206.59D84B30.0022,ss=1,re=0.000,recu=0.000,reip=0.000,cl=1,cld=1,fgs=0, ip=0.0.0.0, so=2014-11-16 11:51:01, dmn=2013-03-21 17:37:32 X-Mirapoint-Loop-Id: 165376affd61c9d1c1eec407f1340da5 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Steven, Peter, On 2017/9/26 11:18, Steven Rostedt wrote: > On Tue, 26 Sep 2017 09:23:20 +0800 > zhouchengming wrote: > >> On 2017/9/26 3:40, Steven Rostedt wrote: >>> On Mon, 11 Sep 2017 14:51:49 +0800 >>> Zhou Chengming wrote: >>> >>>> push_rt_task() pick the first pushable task and find an eligible >>>> lowest_rq, then double_lock_balance(rq, lowest_rq). So if >>>> double_lock_balance() unlock the rq (when double_lock_balance() return 1), >>>> we have to check if this task is still on the rq. >>>> >>>> The problem is that the check conditions are not sufficient: >>>> >>>> if (unlikely(task_rq(task) != rq || >>>> !cpumask_test_cpu(lowest_rq->cpu,&task->cpus_allowed) || >>>> task_running(rq, task) || >>>> !rt_task(task) || >>>> !task_on_rq_queued(task))) { >>>> >>>> cpu2 cpu1 cpu0 >>>> push_rt_task(rq1) >>>> pick task_A on rq1 >>>> find rq0 >>>> double_lock_balance(rq1, rq0) >>>> unlock(rq1) >>>> rq1 __schedule >>>> pick task_A run >>>> task_A sleep (dequeued) >>>> lock(rq0) >>>> lock(rq1) >>>> do_above_check(task_A) >>>> task_rq(task_A) == rq1 >>>> cpus_allowed unchanged >>>> task_running == false >>>> rt_task(task_A) == true >>>> try_to_wake_up(task_A) >>>> select_cpu = cpu3 >>>> enqueue(rq3, task_A) >>> How can this happen? The try_to_wake_up(task_A) needs to grab the rq >>> that task A is on, and we have that rq lock. >>> >>> /me confused. >>> >>> -- Steve >> Thanks for the reply! >> After the task_A sleep on cpu1, the try_to_wake_up(task_A) on cpu0 select a different cpu3, >> so it will grab the rq3 lock, not the rq1 lock. > Ah crap. This is caused by 7608dec2ce20 ("sched: Drop the rq argument > to sched_class::select_task_rq()"). Because this code depends on > try_to_wake_up() grabbing the task's rq lock. But it no longer does > that, and it causes this race. > > OK, I need to look at this deeper when I'm not so jetlagged and typing > this because I can't sleep at 5am. > > Thanks for pointing this out! > > It may be fixed by simply grabbing the run queue lock on migration, as > that would sync things up. Is there any new solution? I don't think grabbing the rq lock without the task->pi_lock will fix this problem. And I think my patch is correct and the changes are small. Thanks! > Peter? > > > -- Steve > > > > . >