All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Paul E. McKenney" <paulmck@kernel.org>
To: Feng Tang <feng.tang@intel.com>
Cc: Borislav Petkov <bp@alien8.de>,
	kernel test robot <oliver.sang@intel.com>,
	Jonathan Lemon <bsd@fb.com>, Tony Luck <tony.luck@intel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	x86@kernel.org, lkp@lists.01.org, lkp@intel.com,
	ying.huang@intel.com, zhengjun.xing@intel.com
Subject: Re: [x86/mce]  7bb39313cd:  netperf.Throughput_tps -4.5% regression
Date: Sat, 16 Jan 2021 07:34:26 -0800	[thread overview]
Message-ID: <20210116153413.GP2743@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <20210116035251.GB29609@shbuild999.sh.intel.com>

On Sat, Jan 16, 2021 at 11:52:51AM +0800, Feng Tang wrote:
> Hi Boris,
> 
> On Tue, Jan 12, 2021 at 03:14:38PM +0100, Borislav Petkov wrote:
> > On Tue, Jan 12, 2021 at 10:21:09PM +0800, kernel test robot wrote:
> > > 
> > > Greeting,
> > > 
> > > FYI, we noticed a -4.5% regression of netperf.Throughput_tps due to commit:
> > > 
> > > 
> > > commit: 7bb39313cd6239e7eb95198950a02b4ad2a08316 ("x86/mce: Make mce_timed_out() identify holdout CPUs")
> > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ras/core
> > > 
> > > 
> > > in testcase: netperf
> > > on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> > > with following parameters:
> > > 
> > > 	ip: ipv4
> > > 	runtime: 300s
> > > 	nr_threads: 16
> > > 	cluster: cs-localhost
> > > 	test: TCP_CRR
> > > 	cpufreq_governor: performance
> > > 	ucode: 0x5003003
> > > 
> > > test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
> > > test-url: http://www.netperf.org/netperf/
> > 
> > I'm very very sceptical this thing benchmarks #MC exception handler
> > performance. Because the code this patch adds gets run only during a MCE
> > exception.
> > 
> > So unless I'm missing something obvious please check your setup.
> 
> We've tracked some similar strange kernel performance changes, like
> another mce related one [1]. For many of them, the root cause is
> the patch changes the code or data alignment/address of other
> components, as could be seen from System.map file.
> 
> We added debug patch trying to force data sections of each .o be
> aligned (isolating components), and run the test 3 times, and
> the regression is gone.
> 
>          %stddev     %change         %stddev
>              \          |                \  
>     263059            -0.2%     262523        netperf.Throughput_total_tps
>      16441            -0.2%      16407        netperf.Throughput_tps
> 
> So the -4.5% is likely to be caused by data address change. 
> 
> But still there is something I don't understand, that the patch
> introduces a new cpumask 'mce_missing_cpus', which is 1024B, and
> from the System.map, all data following it get a 1024B offset,
> without changing the cacheline alignment situation.
> 
> 2 original system map files are attached in case people want
> to check.
> 
> [1]. https://lore.kernel.org/lkml/20200425114414.GU26573@shao2-debian/

One possibility is that the data-address changes put more stress on the
TLB, for example, if that region of memory is not covered by a huge
TLB entry.  If this is the case, is there a convenient way to define
mce_missing_cpus so as to get it out of the way?

							Thanx, Paul

WARNING: multiple messages have this Message-ID (diff)
From: Paul E. McKenney <paulmck@kernel.org>
To: lkp@lists.01.org
Subject: Re: [x86/mce] 7bb39313cd: netperf.Throughput_tps -4.5% regression
Date: Sat, 16 Jan 2021 07:34:26 -0800	[thread overview]
Message-ID: <20210116153413.GP2743@paulmck-ThinkPad-P72> (raw)
In-Reply-To: <20210116035251.GB29609@shbuild999.sh.intel.com>

[-- Attachment #1: Type: text/plain, Size: 2767 bytes --]

On Sat, Jan 16, 2021 at 11:52:51AM +0800, Feng Tang wrote:
> Hi Boris,
> 
> On Tue, Jan 12, 2021 at 03:14:38PM +0100, Borislav Petkov wrote:
> > On Tue, Jan 12, 2021 at 10:21:09PM +0800, kernel test robot wrote:
> > > 
> > > Greeting,
> > > 
> > > FYI, we noticed a -4.5% regression of netperf.Throughput_tps due to commit:
> > > 
> > > 
> > > commit: 7bb39313cd6239e7eb95198950a02b4ad2a08316 ("x86/mce: Make mce_timed_out() identify holdout CPUs")
> > > https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ras/core
> > > 
> > > 
> > > in testcase: netperf
> > > on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> > > with following parameters:
> > > 
> > > 	ip: ipv4
> > > 	runtime: 300s
> > > 	nr_threads: 16
> > > 	cluster: cs-localhost
> > > 	test: TCP_CRR
> > > 	cpufreq_governor: performance
> > > 	ucode: 0x5003003
> > > 
> > > test-description: Netperf is a benchmark that can be use to measure various aspect of networking performance.
> > > test-url: http://www.netperf.org/netperf/
> > 
> > I'm very very sceptical this thing benchmarks #MC exception handler
> > performance. Because the code this patch adds gets run only during a MCE
> > exception.
> > 
> > So unless I'm missing something obvious please check your setup.
> 
> We've tracked some similar strange kernel performance changes, like
> another mce related one [1]. For many of them, the root cause is
> the patch changes the code or data alignment/address of other
> components, as could be seen from System.map file.
> 
> We added debug patch trying to force data sections of each .o be
> aligned (isolating components), and run the test 3 times, and
> the regression is gone.
> 
>          %stddev     %change         %stddev
>              \          |                \  
>     263059            -0.2%     262523        netperf.Throughput_total_tps
>      16441            -0.2%      16407        netperf.Throughput_tps
> 
> So the -4.5% is likely to be caused by data address change. 
> 
> But still there is something I don't understand, that the patch
> introduces a new cpumask 'mce_missing_cpus', which is 1024B, and
> from the System.map, all data following it get a 1024B offset,
> without changing the cacheline alignment situation.
> 
> 2 original system map files are attached in case people want
> to check.
> 
> [1]. https://lore.kernel.org/lkml/20200425114414.GU26573(a)shao2-debian/

One possibility is that the data-address changes put more stress on the
TLB, for example, if that region of memory is not covered by a huge
TLB entry.  If this is the case, is there a convenient way to define
mce_missing_cpus so as to get it out of the way?

							Thanx, Paul

  parent reply	other threads:[~2021-01-16 17:17 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-12 14:21 [x86/mce] 7bb39313cd: netperf.Throughput_tps -4.5% regression kernel test robot
2021-01-12 14:21 ` kernel test robot
2021-01-12 14:14 ` Borislav Petkov
2021-01-12 14:14   ` Borislav Petkov
     [not found]   ` <20210116035251.GB29609@shbuild999.sh.intel.com>
2021-01-16 15:34     ` Paul E. McKenney [this message]
2021-01-16 15:34       ` Paul E. McKenney
2021-01-16 16:09       ` Feng Tang
2021-01-16 16:09         ` Feng Tang
2021-01-19  4:27         ` Paul E. McKenney
2021-01-19  4:27           ` Paul E. McKenney
2021-01-19 10:02           ` Borislav Petkov
2021-01-19 10:02             ` Borislav Petkov
2021-01-19 12:15             ` Feng Tang
2021-01-19 12:15               ` Feng Tang
2021-01-19 13:17               ` Borislav Petkov
2021-01-19 13:17                 ` Borislav Petkov
2021-01-19 15:09                 ` Feng Tang
2021-01-19 15:09                   ` Feng Tang
2021-01-19 15:33                   ` Borislav Petkov
2021-01-19 15:33                     ` Borislav Petkov
2021-01-20  5:48                     ` Feng Tang
2021-01-20  5:48                       ` Feng Tang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210116153413.GP2743@paulmck-ThinkPad-P72 \
    --to=paulmck@kernel.org \
    --cc=bp@alien8.de \
    --cc=bsd@fb.com \
    --cc=feng.tang@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lkp@intel.com \
    --cc=lkp@lists.01.org \
    --cc=oliver.sang@intel.com \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=zhengjun.xing@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.