From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D8D73C05027 for ; Wed, 1 Feb 2023 14:01:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 703116B0074; Wed, 1 Feb 2023 09:01:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6B3C96B0075; Wed, 1 Feb 2023 09:01:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 57AEE6B0078; Wed, 1 Feb 2023 09:01:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 478A46B0074 for ; Wed, 1 Feb 2023 09:01:44 -0500 (EST) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id E56CF40E57 for ; Wed, 1 Feb 2023 14:01:43 +0000 (UTC) X-FDA: 80418886086.18.F0D86AD Received: from r3-11.sinamail.sina.com.cn (r3-11.sinamail.sina.com.cn [202.108.3.11]) by imf29.hostedemail.com (Postfix) with ESMTP id AE5BD12003D for ; Wed, 1 Feb 2023 14:01:34 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.11 as permitted sender) smtp.mailfrom=hdanton@sina.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675260098; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=i73qRzLyaK6rWXinYUI/230+yMmfkM27WQ1z8bPnKWY=; b=DJChHwiOsP41mPRJUXzn1fHHZJqxHPD5ZUozMCeXh1GWXYQ+rfxugr0SfzLv4+W4C0ZMVq nrld0JH7ItkNiTql0kfXSYRGnRsxkjiH1ns9g76yv/SeqmSr67w6biD5wIhalNvjV7tjoe Lh4nAassw9ZuEdyPOIG1jHk3Roq4Zn0= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of hdanton@sina.com designates 202.108.3.11 as permitted sender) smtp.mailfrom=hdanton@sina.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675260098; a=rsa-sha256; cv=none; b=f/XzoH7nxl1h4XMsHHkq5Wy4vUu5wKKyRQRU8LX+0aTFInanPd0PNYGZHWZoaeGh5lCwgf pr+9v8QtHpgFrVtZbfOttpFPcTvzXFleK3WmI93rzSnYxSW6S5CCt6rvyM+RUWi2JARBnd 8uFC2gSNcMzpTqQRbCvNQvXM+lRkPiE= Received: from unknown (HELO localhost.localdomain)([114.249.61.130]) by sina.com (172.16.97.27) with ESMTP id 63DA704800033CDC; Wed, 1 Feb 2023 21:59:38 +0800 (CST) X-Sender: hdanton@sina.com X-Auth-ID: hdanton@sina.com X-SMAIL-MID: 29382449283309 From: Hillf Danton To: Frederic Weisbecker Cc: Thomas Gleixner , Yu Liao , fweisbec@gmail.com, mingo@kernel.org, liwei391@huawei.com, adobriyan@gmail.com, mirsad.todorovac@alu.unizg.hr, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Peter Zijlstra Subject: Re: [PATCH RFC] tick/nohz: fix data races in get_cpu_idle_time_us() Date: Wed, 1 Feb 2023 22:01:17 +0800 Message-Id: <20230201140117.539-1-hdanton@sina.com> In-Reply-To: References: <20230128020051.2328465-1-liaoyu15@huawei.com> <20230201045302.316-1-hdanton@sina.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: yhhf17x37itigojgwmgqzswxtfi465sr X-Rspamd-Queue-Id: AE5BD12003D X-HE-Tag: 1675260094-509269 X-HE-Meta: U2FsdGVkX19A2kBKRGLw4O1TDx5lDVjyoPiNf7ZmpL9Zh3NDtCkKGUxmg+Th9jk7HCNS3uVa5JL+WOY/d8KoeYkccZueyuJkE+mXnZ/+4EfQO76QU7NWMv3M8qzmXvBabKgCkmgE4cVpAoaXo2ykrcOhLcgNwHgATHoVLFa7LS9k9nMATyiVcC3OUB5uIHzltGKlD/0wSn2YsCUkZmsxCn/3hDh0HasDL4g42IMl4pvFQ23bolVKPTL4fL0ZXQ2aHg2E87UUuFpW8bqiQlmJusPBy9/J9oJF1h6tEVJ81gWboglCzrfiDEEA5OB3M/8Egoa4LCzKJf3D3RSnS9jJ4XBnDa9JAvPsnmP7WdG6V6TrNwl/lli/CU+/cZHsQv2vuXoxpdBGOv3XtBzmi3bUEt83nk8IbC2hfXpHbQr4KzA+vvXg60bZWeiw2eFtZHH+jd30sOcsdhrNLUKMpxlWVZ8WT7fRhgQBzrkLadTFst0UpAfuksOAjYiytV2p78SBacsvQQ2MOso3QrNNkH4bkg3C6I0joNCbkzEpy49XXkCU9l3pmLVPpH994VdIUM/iWtvQ35+BRbJ3TPGiimtjMntpSz1+0KB8KC+RH2WUo1CcClLICwvS9nuNtZPTAtgLJLzGmiCmvU1sUuYnKHIFVIuYGw2eRC2GHz/6PYRjtFq6IawDFmSoOIaU9fHIsTp03IkaUG36QF4zAe32bOhWIV0JhWBsDf6tSo2HZ5y9hSRtHKUAJCJTPL7JBk+/OFYDbyR/Gaymf8TQ00ahRZUeZY3RgrWNP/wdglJbZs8Egaa8idj+FVEgM7mrWOvv5DQodwcHmgHMkAg4/e/7WsylvyXyAdZu4rhNakeVR5HsMNNiCV0/sLAmYIdyBDHo8+VNyYM7ad/Sv1/uz9K0QAGj8va+q8op7r0ka0GX7feNxPhW7mzA38BDtqnNkYRgvqW2yee8R8/w4/SiF4BP8Re hptRALsh O9PBdnCTYHgG5BcJNIUMSOpbCqhNFr7fLq5GsuNwykV8hBnoVgaNdFeWtDiPoMzu/ElzM4Kq9Ox6t38KlYAPEIAzOGwuOLfHLd2lYvObadj+epTM9o9Q+zm+ydQK0I5IlbsvD2T3bhduC6QPhRVyolq0TCaf44fGoX/5fAQjah1kp0qRL28n26y9l+0N9lRTSrN3XtslCQbyyYN8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, 1 Feb 2023 13:02:41 +0100 Frederic Weisbecker > On Wed, Feb 01, 2023 at 12:53:02PM +0800, Hillf Danton wrote: > > On Tue, 31 Jan 2023 15:44:00 +0100 Thomas Gleixner > > > > > > Seriously this procfs accuracy is the least of the problems and if this > > > would be the only issue then we could trivially fix it by declaring that > > > the procfs output might go backwards. It's an estimate after all. If > > > there would be a real reason to ensure monotonicity there then we could > > > easily do that in the readout code. > > > > > > But the real issue is that both get_cpu_idle_time_us() and > > > get_cpu_iowait_time_us() can invoke update_ts_time_stats() which is way > > > worse than the above procfs idle time going backwards. > > > > > > If update_ts_time_stats() is invoked concurrently for the same CPU then > > > ts->idle_sleeptime and ts->iowait_sleeptime are turning into random > > > numbers. > > > > > > This has been broken 12 years ago in commit 595aac488b54 ("sched: > > > Introduce a function to update the idle statistics"). > > > > [...] > > > > > > > > P.S.: I hate the spinlock in the idle code path, but I don't have a > > > better idea. > > > > Provided the percpu rule is enforced, the random numbers mentioned above > > could be erased without another spinlock added. > > > > Hillf > > +++ b/kernel/time/tick-sched.c > > @@ -640,13 +640,26 @@ static void tick_nohz_update_jiffies(kti > > /* > > * Updates the per-CPU time idle statistics counters > > */ > > -static void > > -update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time) > > +static u64 update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, > > + int io, u64 *last_update_time) > > { > > ktime_t delta; > > > > + if (last_update_time) > > + *last_update_time = ktime_to_us(now); > > + > > if (ts->idle_active) { > > delta = ktime_sub(now, ts->idle_entrytime); > > + > > + /* update is only expected on the local CPU */ > > + if (cpu != smp_processor_id()) { > > Why not just updating it only on idle exit then? This aligns to idle exit as much as it can by disallowing remote update. > > > + if (io) > > I fear it's not up to the caller to decides if the idle time is IO or not. Could you specify a bit on your concern, given the callers of this function? > > > + delta = ktime_add(ts->iowait_sleeptime, delta); > > + else > > + delta = ktime_add(ts->idle_sleeptime, delta); > > + return ktime_to_us(delta); Based on the above comments, I guest you missed this line which prevents get_cpu_idle_time_us() and get_cpu_iowait_time_us() from updating ts. > > + } > > + > > if (nr_iowait_cpu(cpu) > 0) > > ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta); > > else > > But you kept the old update above. > > So if this is not the local CPU, what do you do? > > You'd need to return (without updating iowait_sleeptime): > > ts->idle_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) > > Right? Yes, the diff goes as you suggest. > But then you may race with the local updater, risking to return > the delta added twice. So you need at least a seqcount. Add seqcount if needed. No problem. > > But in the end, nr_iowait_cpu() is broken because that counter can be > decremented remotely and so the whole thing is beyond repair: > > CPU 0 CPU 1 CPU 2 > ----- ----- ------ > //io_schedule() TASK A > current->in_iowait = 1 > rq(0)->nr_iowait++ > //switch to idle > // READ /proc/stat > // See nr_iowait_cpu(0) == 1 > return ts->iowait_sleeptime + ktime_sub(ktime_get(), ts->idle_entrytime) > > //try_to_wake_up(TASK A) > rq(0)->nr_iowait-- > //idle exit > // See nr_iowait_cpu(0) == 0 > ts->idle_sleeptime += ktime_sub(ktime_get(), ts->idle_entrytime) Ah see your point. The diff disallows remotely updating ts, and it is updated in idle exit after my proposal, so what nr_iowait_cpu() breaks is mitigated. Thanks for taking a look, particularly the race linked to nr_iowait_cpu(). Hillf