Hi Jiri,

On Sun, Feb 23, 2020 at 08:36:24PM +0100, Jiri Olsa wrote:
> > Thanks for the "perf c2c" suggestion. 
> 
> I'm fighting with lkp tests.. looks like it's not fedora friendly ;-)
> 
> which specific test is doing this? perhaps I can dig it out and run
> without the script machinery..

The test is the 'signal1' test from will-it-scale:
https://github.com/antonblanchard/will-it-scale.git

And to easy debug, I made some rough debug code to run on local
machine, and simply run './a.out task_nums loop_nums', usually
I run it with the cpu numbers and a very big loop, then use
perf-c2c. (code will be in the end of the mail)

> > 
> > I tried to use perf-c2c on one platform (not the one that show
> > the 5.5% regression), and found the main "hitm" points to the
> > "root_user" global data, as there is a task for each CPU doing
> > the signal stress test, and both __sigqueue_alloc() and
> > __sigqueue_free() will call get_user() and free_uid() to inc/dec
> > this root_user's refcount.
> > 
> > Then I added some alignement inside struct "user_struct" (for
> > "root_user"), then the -5.5% is gone, with a +2.6% instead.
> 
> could you share the change?
 
Some detail was explained in reply to Linus' mail.

diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h
index 39ad98c..e8e7c6b 100644
--- a/include/linux/sched/user.h
+++ b/include/linux/sched/user.h
@@ -13,6 +13,7 @@ struct key;
  * Some day this will be a full-fledged user tracking system..
  */
 struct user_struct {
+	char dummy[0] ____cacheline_aligned;
 	refcount_t __count;	/* reference count */
 	atomic_t processes;	/* How many processes does this user have? */
 	atomic_t sigpending;	/* How many pending signals does this user have? */
@@ -46,7 +47,8 @@ struct user_struct {
 
 	/* Miscellaneous per-user rate limit */
 	struct ratelimit_state ratelimit;
-};
+
+} ____cacheline_aligned ;
 

> > 
> > One c2c report log is attached.
> 
> could you also post one (for same data) without the callchains?
> 
>   # perf c2c report --stdio --call-graph none
> 
> it should show the read/write/offset for cachelines in more
> readable way

Sure, the report and the 'kallsyms' are attached. please be noted,
this is recorded on a 16C/32T machine, not the 96C/192T Cascade lake
one which did see this -5.5%

>From the report, the major 'hitm' is for 2 members of 'root_user':
__count and sigpending.

> 
> I'd be also interested to see the data if you can share (no worries
> if not) ... I'd need the perf.data and bz2 file from 'perf archive'
> run on the perf.data

The raw perf data is about 300MB, and I think you can get similar data
running my test code. If you still need it, I'll try to find a way
to send it to you.

> > 
> > One thing I don't understand is, this -5.5% only happens in
> > one 2 sockets, 96C/192T Cascadelake platform, as we've run
> > the same test on several different platforms. In therory,
> > the false sharing may also take effect? 
> 
> I don't have access to cascade lake, but AFAICT the bigger
> machine the bigger issues with false sharing ;-)

That's my initial thought too :), but I did try it on a 288T
machine, but the regression is not reproduced.

Debug code (refer the will-it-scale) below:
----------------------------------------------------------------
#include <stdlib.h>
#include <string.h>
#include <signal.h>
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/wait.h>

void handler(int param)
{
}

#define TEST_NUM	500000

void sig_test(int loop)
{
	struct sigaction act;

	memset(&act, 0, sizeof(act));
	act.sa_handler = handler;
	sigaction(SIGUSR1, &act, NULL);

	while (loop--)
		raise(SIGUSR1);
}

int main(int argc, char *argv[])
{
	int nr_cpus, pid;
	int loop = TEST_NUM;
	int i;

	if (argc == 1) {
		printf("Usage: tsignal nr_cpus [loops]!\n");
		return -1;
	}

	nr_cpus = atoi(argv[1]);
	if (argc == 3)
		loop = atoi(argv[2]);

	for (i = 0; i < nr_cpus; i++) {
		pid = fork();
		if (pid)
			continue;

		/* forked task */
		sig_test(loop);
		break;
	}

	return 0;
}
----------------------------------------------------------------

Thanks,
Feng