Hi Jiri, On Sun, Feb 23, 2020 at 08:36:24PM +0100, Jiri Olsa wrote: > > Thanks for the "perf c2c" suggestion. > > I'm fighting with lkp tests.. looks like it's not fedora friendly ;-) > > which specific test is doing this? perhaps I can dig it out and run > without the script machinery.. The test is the 'signal1' test from will-it-scale: https://github.com/antonblanchard/will-it-scale.git And to easy debug, I made some rough debug code to run on local machine, and simply run './a.out task_nums loop_nums', usually I run it with the cpu numbers and a very big loop, then use perf-c2c. (code will be in the end of the mail) > > > > I tried to use perf-c2c on one platform (not the one that show > > the 5.5% regression), and found the main "hitm" points to the > > "root_user" global data, as there is a task for each CPU doing > > the signal stress test, and both __sigqueue_alloc() and > > __sigqueue_free() will call get_user() and free_uid() to inc/dec > > this root_user's refcount. > > > > Then I added some alignement inside struct "user_struct" (for > > "root_user"), then the -5.5% is gone, with a +2.6% instead. > > could you share the change? Some detail was explained in reply to Linus' mail. diff --git a/include/linux/sched/user.h b/include/linux/sched/user.h index 39ad98c..e8e7c6b 100644 --- a/include/linux/sched/user.h +++ b/include/linux/sched/user.h @@ -13,6 +13,7 @@ struct key; * Some day this will be a full-fledged user tracking system.. */ struct user_struct { + char dummy[0] ____cacheline_aligned; refcount_t __count; /* reference count */ atomic_t processes; /* How many processes does this user have? */ atomic_t sigpending; /* How many pending signals does this user have? */ @@ -46,7 +47,8 @@ struct user_struct { /* Miscellaneous per-user rate limit */ struct ratelimit_state ratelimit; -}; + +} ____cacheline_aligned ; > > > > One c2c report log is attached. > > could you also post one (for same data) without the callchains? > > # perf c2c report --stdio --call-graph none > > it should show the read/write/offset for cachelines in more > readable way Sure, the report and the 'kallsyms' are attached. please be noted, this is recorded on a 16C/32T machine, not the 96C/192T Cascade lake one which did see this -5.5% >From the report, the major 'hitm' is for 2 members of 'root_user': __count and sigpending. > > I'd be also interested to see the data if you can share (no worries > if not) ... I'd need the perf.data and bz2 file from 'perf archive' > run on the perf.data The raw perf data is about 300MB, and I think you can get similar data running my test code. If you still need it, I'll try to find a way to send it to you. > > > > One thing I don't understand is, this -5.5% only happens in > > one 2 sockets, 96C/192T Cascadelake platform, as we've run > > the same test on several different platforms. In therory, > > the false sharing may also take effect? > > I don't have access to cascade lake, but AFAICT the bigger > machine the bigger issues with false sharing ;-) That's my initial thought too :), but I did try it on a 288T machine, but the regression is not reproduced. Debug code (refer the will-it-scale) below: ---------------------------------------------------------------- #include #include #include #include #include #include #include #include void handler(int param) { } #define TEST_NUM 500000 void sig_test(int loop) { struct sigaction act; memset(&act, 0, sizeof(act)); act.sa_handler = handler; sigaction(SIGUSR1, &act, NULL); while (loop--) raise(SIGUSR1); } int main(int argc, char *argv[]) { int nr_cpus, pid; int loop = TEST_NUM; int i; if (argc == 1) { printf("Usage: tsignal nr_cpus [loops]!\n"); return -1; } nr_cpus = atoi(argv[1]); if (argc == 3) loop = atoi(argv[2]); for (i = 0; i < nr_cpus; i++) { pid = fork(); if (pid) continue; /* forked task */ sig_test(loop); break; } return 0; } ---------------------------------------------------------------- Thanks, Feng