From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752001AbdBBQ3H (ORCPT ); Thu, 2 Feb 2017 11:29:07 -0500 Received: from mail-wm0-f65.google.com ([74.125.82.65]:33273 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751045AbdBBQ3F (ORCPT ); Thu, 2 Feb 2017 11:29:05 -0500 Date: Thu, 2 Feb 2017 17:29:01 +0100 From: Ingo Molnar To: "Ghannam, Yazen" Cc: Borislav Petkov , x86-ml , Yves Dionne , Brice Goglin , Peter Zijlstra , lkml Subject: Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID Message-ID: <20170202162901.GB12498@gmail.com> References: <20170201200237.36s2jwjgxi24we66@pd.tnic> <20170201214421.ppw2ww3faxxu2jrm@pd.tnic> <20170201222507.qvcn6dsxucn6fqcv@pd.tnic> <20170201224150.ohb7f7jvbttnikkz@pd.tnic> <20170202121054.im3c3iiqp26a2dyb@pd.tnic> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Ghannam, Yazen wrote: > Here are my results on a 32C Bulldozer system with an SSD. Also, I use ccache so > I added "ccache -C" in the pre-build script so the cache gets cleared. > > Before: > Performance counter stats for 'make -s -j65 bzImage' (3 runs): > > 2375752.777479 task-clock (msec) # 23.589 CPUs utilized ( +- 0.35% ) > 1,198,979 context-switches # 0.505 K/sec ( +- 0.34% ) > 8,964,671,259 cache-misses ( +- 0.44% ) > 79,399 cpu-migrations # 0.033 K/sec ( +- 1.92% ) > 37,840,875 page-faults # 0.016 M/sec ( +- 0.20% ) > 5,425,612,846,538 cycles # 2.284 GHz ( +- 0.36% ) > 3,367,750,745,825 instructions # 0.62 insn per cycle ( +- 0.11% ) > 750,591,286,261 branches # 315.938 M/sec ( +- 0.11% ) > 43,544,059,077 branch-misses # 5.80% of all branches ( +- 0.08% ) > > 100.716043494 seconds time elapsed ( +- 1.97% ) > > After: > Performance counter stats for 'make -s -j65 bzImage' (3 runs): > > 1736720.488346 task-clock (msec) # 23.529 CPUs utilized ( +- 0.16% ) > 1,144,737 context-switches # 0.659 K/sec ( +- 0.20% ) > 8,570,352,975 cache-misses ( +- 0.33% ) > 91,817 cpu-migrations # 0.053 K/sec ( +- 1.67% ) > 37,688,118 page-faults # 0.022 M/sec ( +- 0.03% ) > 5,547,082,899,245 cycles # 3.194 GHz ( +- 0.19% ) > 3,363,365,420,405 instructions # 0.61 insn per cycle ( +- 0.00% ) > 749,676,420,820 branches # 431.662 M/sec ( +- 0.00% ) > 43,243,046,270 branch-misses # 5.77% of all branches ( +- 0.01% ) > > 73.810517234 seconds time elapsed ( +- 0.02% ) That's pretty impressive: ~35% difference in wall clock performance of this workload. And that while both the cycles and the instructions count is within 2.5% of each other. The only stat the differs beyond the level of noise is cache-misses: 8,964,671,259 cache-misses ( +- 0.44% ) 8,570,352,975 cache-misses ( +- 0.33% ) which is 4.5%, but I have trouble believing that just 4.5% more cachemisses can have such a massive effect on performance. So unless +4.5% cachemisses can cause a 35% difference in performance this is a really weird result. Where did the extra performance come from - was the 'good' workload perhaps running at higher CPU frequencies for some reason? Thanks, Ingo