From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753049AbbEHOVO (ORCPT <rfc822;w@1wt.eu>);
	Fri, 8 May 2015 10:21:14 -0400
Received: from foss.arm.com ([217.140.101.70]:53131 "EHLO foss.arm.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752485AbbEHOVN (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Fri, 8 May 2015 10:21:13 -0400
Date: Fri, 8 May 2015 15:21:08 +0100
From: Will Deacon <will.deacon@arm.com>
To: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@kernel.org>,
        David Ahern <dsahern@gmail.com>, Jiri Olsa <jolsa@redhat.com>,
        Namhyung Kim <namhyung@gmail.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: Question about barriers for ARM on tools/perf/
Message-ID: <20150508142107.GA25587@arm.com>
References: <20150508140459.GI7862@kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150508140459.GI7862@kernel.org>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, May 08, 2015 at 03:04:59PM +0100, Arnaldo Carvalho de Melo wrote:
> Hi Will,

Hi Arnaldo,

> 	I am working on moving the stuff we have for mb/rmb/wmb from
> tools/perf/perf-sys.h to tools/include/asm/barrier.h, redirecting
> to tools/arch/$ARCH/include/asm/barrier.h, to make it look like the
> kernel and who knows, at some point even share the source code.
> 
> 	For now I am getting just what is needed for work on having
> atomic.h done in the same fashion, to implement refcounts for various
> perf data structures, starting with struct thread, for which I have
> a patch that makes perf survive in high core count machines where it
> currently crashes, most nobably 'perf top'.

Sharing atomic.h with userspace sounds a bit scary to me. I'm currently
working on patches that involve patching those routines at runtime to
enable use of some new instructions that we have, so that would cause
problems for userspace.

> 	While doing that I noticed that arm64 implementation, lastly
> fixed in:
> 
>   f428ebd184c82a7914b2aa7e9f868918aaf7ea78
>   perf tools: Fix AAAAARGH64 memory barriers
> 
> By peterz, it implements those barriers as:
> 
> #define mb()            asm volatile("dmb ish" ::: "memory")
> #define wmb()           asm volatile("dmb ishst" ::: "memory")
> #define rmb()           asm volatile("dmb ishld" ::: "memory")
> 
> Which are not the same as in the kernel, i.e. in
> arch/arm64/include/asm/barrier.h, where the above are really smp_mb,
> smp_wmb and smp_rmb.
> 
> Would it be enough for us to use the same implementation as the kernel?
> I.e. make it be:
> 
> #define mb()            asm volatile("dsb sy" ::: "memory")
> #define wmb()           asm volatile("dsb st" ::: "memory")
> #define rmb()           asm volatile("dsb ld" ::: "memory")
> 
> ? If so I would then use those dsb/dmb macros, etc, to get tools/ to use
> the proper instructions, etc.

The mandatory barriers (i.e. the non-smp_* versions) are used for ordering
between CPUs and I/O, so they have a significantly higher performance
penalty on ARM. Given that the perf tool assumedly only cares about ordering
between CPUs, the smp_* variants are the correct versions to use. However,
on a !SMP kernel, they become nops (compiler barriers), which is why they
are defined like they are at the moment.

> I need now, for arm64, smp_mb, that is used by atomic_sub_return(), that
> in turn is used by atomic_dec_and_test(), that I need for refcounts.

Hmm, that would mean if I build a perf tool in a kernel source tree that is
configured as !SMP, then the tool would be subtly broken.

Wouldn't it be better to go the other way, and use compiler builtins for
the memory barriers instead of relying on the kernel? It looks like the
perf_mmap__{read,write}_head functions are basically just acquire/release
operations and could therefore be implemented using something like
__atomic_load_n(&pc->data_head, __ATOMIC_ACQUIRE) and
__atomic_store_n(&pc->data_tail, tail, __ATOMIC_RELEASE).

Will