From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932366AbdA3Pjh (ORCPT ); Mon, 30 Jan 2017 10:39:37 -0500 Received: from mail-wj0-f196.google.com ([209.85.210.196]:34991 "EHLO mail-wj0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753709AbdA3PjC (ORCPT ); Mon, 30 Jan 2017 10:39:02 -0500 Date: Mon, 30 Jan 2017 16:38:08 +0100 From: Frederic Weisbecker To: Stanislaw Gruszka Cc: Peter Zijlstra , LKML , Tony Luck , Wanpeng Li , Michael Ellerman , Heiko Carstens , Benjamin Herrenschmidt , Thomas Gleixner , Paul Mackerras , Ingo Molnar , Fenghua Yu , Rik van Riel , Martin Schwidefsky Subject: Re: [PATCH 08/37] cputime: Convert task/group cputime to nsecs Message-ID: <20170130153807.GB16945@lerouge> References: <1485109213-8561-1-git-send-email-fweisbec@gmail.com> <1485109213-8561-9-git-send-email-fweisbec@gmail.com> <20170128115740.GA688@redhat.com> <20170128152810.GA16568@lerouge> <20170130135451.GB2669@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170130135451.GB2669@redhat.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Jan 30, 2017 at 02:54:51PM +0100, Stanislaw Gruszka wrote: > On Sat, Jan 28, 2017 at 04:28:13PM +0100, Frederic Weisbecker wrote: > > On Sat, Jan 28, 2017 at 12:57:40PM +0100, Stanislaw Gruszka wrote: > > > On 32 bit architectures 64bit store/load is not atomic and if not > > > protected - 64bit variables can be mangled. I do not see any protection > > > (lock) between utime/stime store and load in the patch and seems that > > > {u/s}time store/load can be performed at the same time. Though problem > > > is very very improbable it still can happen at least theoretically when > > > lower and upper 32 bits are changed at the same time i.e. process > > > {u,s}time become near to multiple of 2**32 nsec (aprox: 4sec) and > > > 64bit {u,s}time is stored and loaded at the same time on different > > > cpus. As said this is very improbable situation, but eventually could > > > be possible on long lived processes. > > > > "Improbable situation" doesn't appply to Linux. With millions (billion?) > > of machines using it, a rare issue in the core turns into likely to happen > > somewhere in the planet every second. > > > > So it's definetly a race we want to consider. Note it goes beyond the scope > > of this patchset as the issue was already there before since cputime_t can already > > map to u64 on 32 bits systems upstream. But this patchset definetly extends > > the issue on all 32 bits configs. > > > > kcpustat has the same issue upstream. It's is made of u64 on all configs. > > I would like to add what are possible consequences if value will be > mangled. For sum_exec_runtime, utime and stime we could get wrong values > on cpu-clock related syscalls like clock_gettime() or clock_nanosleep() > and cpu-clock timers like timer_create(CLOCK_PROCESS_CPUTIME_ID) can be > triggered before or long after expected. For kcpustat this seems to be > wrong values read by procfs and 3 drivers (cpufreq, appldata, macintosh). Yep, all agreed. > > > > I considering fixing problem of sum_exec_runtime possible mangling > > > by using prev_sum_exec_runtime: > > > > > > u64 read_sum_exec_runtime(struct task_struct *t) > > > { > > > u64 ns, prev_ns; > > > > > > do { > > > prev_ns = READ_ONCE(t->se.prev_sum_exec_runtime); > > > ns = READ_ONCE(t->se.sum_exec_runtime); > > > } while (ns < prev_ns || ns > (prev_ns + U32_MAX)); > > > > > > return ns; > > > } > > > > > > This should work based on fact that prev_sum_exec_runtime and > > > sum_exec_runtime are not modified and stored at the same time, so only > > > one of those variabled can be mangled. Though I need to think about > > > correctnes of that a bit more. > > > > I'm not sure that would be enough. READ_ONCE prevents from reordering by the > > compiler but not by the CPU. You'd need memory barriers between reads and > > writes of prev_ns and ns. > > It will not be enough, this _suppose_ to work based on that sum_exec_runtime > and prev_sum_exec_runtime are not written at the same time. i.e. only > one variable can be mangled as another one sits already in the memory. > However "not written at the same time" is weak part of reasoning. Even > if those variables are stored at different part of code (sum_exec_runtime > on update_curr() and prev_sum_exec_runtime on set_next_entity()) we can > not assume store of one variable is finished before another one starts. Right. > > > WRITE ns READ prev_ns > > smp_wmb() smp_rmb() > > WRITE prev_ns READ ns > > smp_wmb() smp_rmb() > > > > It seems to be the only way to make sure that at least one of the reads > > (prev_ns or ns) is correct. Well reading that again, I'm not 100% sure about the correctness guarantee. But it might work. > > I think you have right, but seems on much code paths we have scenario: > > WRITE ns READ prev_ns > smp_wmb() smp_rmb() > WRITE prev_ns READ ns > > and we have already smp_wmb() after write of sum_exec_runtime on > update_min_vruntime(). You still need a second barrier after the second write and read (or before the first write and read, it's the same) to ensure that if you read a mangled version of ns, prev_ns is ok. Still I think u64_stats_sync is less trouble and more reliable. Thanks.