From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S932366AbdA3Pjh (ORCPT <rfc822;w@1wt.eu>);
        Mon, 30 Jan 2017 10:39:37 -0500
Received: from mail-wj0-f196.google.com ([209.85.210.196]:34991 "EHLO
        mail-wj0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1753709AbdA3PjC (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 30 Jan 2017 10:39:02 -0500
Date: Mon, 30 Jan 2017 16:38:08 +0100
From: Frederic Weisbecker <fweisbec@gmail.com>
To: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>, Tony Luck <tony.luck@intel.com>,
        Wanpeng Li <wanpeng.li@hotmail.com>,
        Michael Ellerman <mpe@ellerman.id.au>,
        Heiko Carstens <heiko.carstens@de.ibm.com>,
        Benjamin Herrenschmidt <benh@kernel.crashing.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Paul Mackerras <paulus@samba.org>, Ingo Molnar <mingo@kernel.org>,
        Fenghua Yu <fenghua.yu@intel.com>, Rik van Riel <riel@redhat.com>,
        Martin Schwidefsky <schwidefsky@de.ibm.com>
Subject: Re: [PATCH 08/37] cputime: Convert task/group cputime to nsecs
Message-ID: <20170130153807.GB16945@lerouge>
References: <1485109213-8561-1-git-send-email-fweisbec@gmail.com>
 <1485109213-8561-9-git-send-email-fweisbec@gmail.com>
 <20170128115740.GA688@redhat.com>
 <20170128152810.GA16568@lerouge>
 <20170130135451.GB2669@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170130135451.GB2669@redhat.com>
User-Agent: Mutt/1.5.24 (2015-08-30)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Jan 30, 2017 at 02:54:51PM +0100, Stanislaw Gruszka wrote:
> On Sat, Jan 28, 2017 at 04:28:13PM +0100, Frederic Weisbecker wrote:
> > On Sat, Jan 28, 2017 at 12:57:40PM +0100, Stanislaw Gruszka wrote:
> > > On 32 bit architectures 64bit store/load is not atomic and if not
> > > protected - 64bit variables can be mangled. I do not see any protection
> > > (lock) between utime/stime store and load in the patch and seems that
> > > {u/s}time store/load can be performed at the same time. Though problem
> > > is very very improbable it still can happen at least theoretically when
> > > lower and upper 32 bits are changed at the same time i.e. process
> > > {u,s}time become near to multiple of 2**32 nsec (aprox: 4sec) and
> > > 64bit {u,s}time is stored and loaded at the same time on different
> > > cpus. As said this is very improbable situation, but eventually could
> > > be possible on long lived processes.
> > 
> > "Improbable situation" doesn't appply to Linux. With millions (billion?)
> > of machines using it, a rare issue in the core turns into likely to happen
> > somewhere in the planet every second.
> > 
> > So it's definetly a race we want to consider. Note it goes beyond the scope
> > of this patchset as the issue was already there before since cputime_t can already
> > map to u64 on 32 bits systems upstream. But this patchset definetly extends
> > the issue on all 32 bits configs.
> > 
> > kcpustat has the same issue upstream. It's is made of u64 on all configs.
> 
> I would like to add what are possible consequences if value will be
> mangled. For sum_exec_runtime, utime and stime we could get wrong values
> on cpu-clock related syscalls like clock_gettime() or clock_nanosleep()
> and cpu-clock timers like timer_create(CLOCK_PROCESS_CPUTIME_ID) can be
> triggered before or long after expected. For kcpustat this seems to be
> wrong values read by procfs and 3 drivers (cpufreq, appldata, macintosh).

Yep, all agreed.

> 
> > > I considering fixing problem of sum_exec_runtime possible mangling
> > > by using prev_sum_exec_runtime:
> > > 
> > > u64 read_sum_exec_runtime(struct task_struct *t)
> > > {
> > >        u64 ns, prev_ns;
> > >  
> > >        do {
> > >                prev_ns = READ_ONCE(t->se.prev_sum_exec_runtime);
> > >                ns = READ_ONCE(t->se.sum_exec_runtime);
> > >        } while (ns < prev_ns || ns > (prev_ns + U32_MAX));
> > >  
> > >        return ns;
> > > }
> > > 
> > > This should work based on fact that prev_sum_exec_runtime and
> > > sum_exec_runtime are not modified and stored at the same time, so only
> > > one of those variabled can be mangled. Though I need to think about 
> > > correctnes of that a bit more.
> > 
> > I'm not sure that would be enough. READ_ONCE prevents from reordering by the
> > compiler but not by the CPU. You'd need memory barriers between reads and
> > writes of prev_ns and ns.
> 
> It will not be enough, this _suppose_ to work based on that sum_exec_runtime
> and prev_sum_exec_runtime are not written at the same time. i.e. only
> one variable can be mangled as another one sits already in the memory.
> However "not written at the same time" is weak part of reasoning. Even
> if those variables are stored at different part of code (sum_exec_runtime
> on update_curr() and prev_sum_exec_runtime on set_next_entity()) we can
> not assume store of one variable is finished before another one starts.

Right.

> 
> >    WRITE ns                READ prev_ns
> >    smp_wmb()               smp_rmb()
> >    WRITE prev_ns           READ ns
> >    smp_wmb()               smp_rmb()
> >
> > It seems to be the only way to make sure that at least one of the reads
> > (prev_ns or ns) is correct.

Well reading that again, I'm not 100% sure about the correctness guarantee.
But it might work.

> 
> I think you have right, but seems on much code paths we have scenario:
> 
> 	WRITE ns		READ prev_ns
> 	smp_wmb()		smp_rmb()
> 	WRITE prev_ns		READ ns
> 
> and we have already smp_wmb() after write of sum_exec_runtime on
> update_min_vruntime().

You still need a second barrier after the second write and read (or
before the first write and read, it's the same) to ensure that if you read
a mangled version of ns, prev_ns is ok.

Still I think u64_stats_sync is less trouble and more reliable.

Thanks.