All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [RFC v2] perf: Rewrite core context handling
@ 2022-01-14 21:48 kernel test robot
  2022-01-18  6:01   ` kernel test robot
  0 siblings, 1 reply; 47+ messages in thread
From: kernel test robot @ 2022-01-14 21:48 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 19978 bytes --]

CC: llvm(a)lists.linux.dev
CC: kbuild-all(a)lists.01.org
In-Reply-To: <20220113134743.1292-1-ravi.bangoria@amd.com>
References: <20220113134743.1292-1-ravi.bangoria@amd.com>
TO: Ravi Bangoria <ravi.bangoria@amd.com>

Hi Ravi,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on tip/perf/core]
[also build test WARNING on powerpc/next tip/sched/core v5.16 next-20220114]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git a9f4a6e92b3b319296fb078da2615f618f6cd80c
:::::: branch date: 32 hours ago
:::::: commit date: 32 hours ago
config: arm-randconfig-c002-20220113 (https://download.01.org/0day-ci/archive/20220115/202201150516.CCZxTJTq-lkp(a)intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project d1021978b8e7e35dcc30201ca1731d64b5a602a8)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm cross compiling tool for clang build
        # apt-get install binutils-arm-linux-gnueabi
        # https://github.com/0day-ci/linux/commit/f7cf7134e405062bf0f22c3ba5637241c4c4d06a
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
        git checkout f7cf7134e405062bf0f22c3ba5637241c4c4d06a
        # save the config file to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm clang-analyzer 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


clang-analyzer warnings: (new ones prefixed by >>)
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/events/core.c:3752:2: note: Loop condition is true.  Entering loop body
           while (event_heap.nr) {
           ^
   kernel/events/core.c:3753:9: note: Calling 'merge_sched_in'
                   ret = func(*evt, data);
                         ^~~~~~~~~~~~~~~~
   kernel/events/core.c:3792:35: note: Assuming pointer value is null
           struct perf_event_context *ctx = event->ctx;
                                            ^~~~~~~~~~
   kernel/events/core.c:3795:6: note: Assuming field 'state' is > PERF_EVENT_STATE_OFF
           if (event->state <= PERF_EVENT_STATE_OFF)
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/events/core.c:3795:2: note: Taking false branch
           if (event->state <= PERF_EVENT_STATE_OFF)
           ^
   kernel/events/core.c:3798:6: note: Assuming the condition is true
           if (!event_filter_match(event))
               ^~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/events/core.c:3798:2: note: Taking true branch
           if (!event_filter_match(event))
           ^
   kernel/events/core.c:3799:3: note: Returning zero, which participates in a condition later
                   return 0;
                   ^~~~~~~~
   kernel/events/core.c:3753:9: note: Returning from 'merge_sched_in'
                   ret = func(*evt, data);
                         ^~~~~~~~~~~~~~~~
   kernel/events/core.c:3754:7: note: 'ret' is 0
                   if (ret)
                       ^~~
   kernel/events/core.c:3754:3: note: Taking false branch
                   if (ret)
                   ^
   kernel/events/core.c:3758:3: note: Taking false branch
                   if (*evt)
                   ^
   kernel/events/core.c:3761:4: note: Calling 'min_heap_pop'
                           min_heap_pop(&event_heap, &perf_min_heap);
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/min_heap.h:84:16: note: Assuming field 'nr' is > 0
           if (WARN_ONCE(heap->nr <= 0, "Popping an empty heap"))
                         ^
   include/asm-generic/bug.h:150:18: note: expanded from macro 'WARN_ONCE'
           DO_ONCE_LITE_IF(condition, WARN, 1, format)
                           ^~~~~~~~~
   include/linux/once_lite.h:15:27: note: expanded from macro 'DO_ONCE_LITE_IF'
                   bool __ret_do_once = !!(condition);                     \
                                           ^~~~~~~~~
   include/linux/min_heap.h:84:6: note: '__ret_do_once' is false
           if (WARN_ONCE(heap->nr <= 0, "Popping an empty heap"))
               ^
   include/asm-generic/bug.h:150:2: note: expanded from macro 'WARN_ONCE'
           DO_ONCE_LITE_IF(condition, WARN, 1, format)
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/once_lite.h:17:16: note: expanded from macro 'DO_ONCE_LITE_IF'
                   if (unlikely(__ret_do_once && !__already_done)) {       \
                                ^~~~~~~~~~~~~
   include/linux/compiler.h:78:42: note: expanded from macro 'unlikely'
   # define unlikely(x)    __builtin_expect(!!(x), 0)
                                               ^
   include/linux/min_heap.h:84:6: note: Left side of '&&' is false
           if (WARN_ONCE(heap->nr <= 0, "Popping an empty heap"))
               ^
   include/asm-generic/bug.h:150:2: note: expanded from macro 'WARN_ONCE'
           DO_ONCE_LITE_IF(condition, WARN, 1, format)
           ^
   include/linux/once_lite.h:17:30: note: expanded from macro 'DO_ONCE_LITE_IF'
                   if (unlikely(__ret_do_once && !__already_done)) {       \
                                              ^
   include/linux/min_heap.h:84:6: note: Taking false branch
           if (WARN_ONCE(heap->nr <= 0, "Popping an empty heap"))
               ^
   include/asm-generic/bug.h:150:2: note: expanded from macro 'WARN_ONCE'
           DO_ONCE_LITE_IF(condition, WARN, 1, format)
           ^
   include/linux/once_lite.h:17:3: note: expanded from macro 'DO_ONCE_LITE_IF'
                   if (unlikely(__ret_do_once && !__already_done)) {       \
                   ^
   include/linux/min_heap.h:84:2: note: Taking false branch
           if (WARN_ONCE(heap->nr <= 0, "Popping an empty heap"))
           ^
   include/linux/min_heap.h:90:2: note: Value assigned to 'event_heap.nr', which participates in a condition later
           min_heapify(heap, 0, func);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/events/core.c:3761:4: note: Returning from 'min_heap_pop'
                           min_heap_pop(&event_heap, &perf_min_heap);
                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/events/core.c:3752:2: note: Loop condition is true.  Entering loop body
           while (event_heap.nr) {
           ^
   kernel/events/core.c:3753:14: note: Passing null pointer value via 1st parameter 'event'
                   ret = func(*evt, data);
                              ^~~~
   kernel/events/core.c:3753:9: note: Calling 'merge_sched_in'
                   ret = func(*evt, data);
                         ^~~~~~~~~~~~~~~~
   kernel/events/core.c:3792:35: note: Access to field 'ctx' results in a dereference of a null pointer (loaded from variable 'event')
           struct perf_event_context *ctx = event->ctx;
                                            ^~~~~
>> kernel/events/core.c:4277:2: warning: Value stored to 'task_ctx' is never read [clang-analyzer-deadcode.DeadStores]
           task_ctx = cpuctx->task_ctx;
           ^          ~~~~~~~~~~~~~~~~
   kernel/events/core.c:4277:2: note: Value stored to 'task_ctx' is never read
           task_ctx = cpuctx->task_ctx;
           ^          ~~~~~~~~~~~~~~~~
   kernel/events/core.c:4777:2: warning: Value stored to 'err' is never read [clang-analyzer-deadcode.DeadStores]
           err = -EINVAL;
           ^     ~~~~~~~
   kernel/events/core.c:4777:2: note: Value stored to 'err' is never read
           err = -EINVAL;
           ^     ~~~~~~~
   kernel/events/core.c:10439:13: warning: Dereference of null pointer [clang-analyzer-core.NullDereference]
           for (vma = mm->mmap; vma; vma = vma->vm_next) {
                      ^
   kernel/events/core.c:10454:39: note: Calling 'perf_event_addr_filters'
           struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/perf_event.h:1475:6: note: Assuming field 'parent' is null
           if (event->parent)
               ^~~~~~~~~~~~~
   include/linux/perf_event.h:1475:2: note: Taking false branch
           if (event->parent)
           ^
   include/linux/perf_event.h:1478:2: note: Returning without writing to 'event->addr_filters.nr_file_filters', which participates in a condition later
           return ifh;
           ^
   kernel/events/core.c:10454:39: note: Returning from 'perf_event_addr_filters'
           struct perf_addr_filters_head *ifh = perf_event_addr_filters(event);
                                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   kernel/events/core.c:10455:29: note: Left side of '||' is false
           struct task_struct *task = READ_ONCE(event->ctx->task);
                                      ^
   include/asm-generic/rwonce.h:49:2: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
           ^
   include/asm-generic/rwonce.h:36:21: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                              ^
   include/linux/compiler_types.h:302:3: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
            ^
   kernel/events/core.c:10455:29: note: Left side of '||' is false
           struct task_struct *task = READ_ONCE(event->ctx->task);
                                      ^
   include/asm-generic/rwonce.h:49:2: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
           ^
   include/asm-generic/rwonce.h:36:21: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                              ^
   include/linux/compiler_types.h:302:3: note: expanded from macro '__native_word'
           (sizeof(t) == sizeof(char) || sizeof(t) == sizeof(short) || \
            ^
   kernel/events/core.c:10455:29: note: Left side of '||' is true
           struct task_struct *task = READ_ONCE(event->ctx->task);
                                      ^
   include/asm-generic/rwonce.h:49:2: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
           ^
   include/asm-generic/rwonce.h:36:21: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
                              ^
   include/linux/compiler_types.h:303:28: note: expanded from macro '__native_word'
            sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
                                     ^
   kernel/events/core.c:10455:29: note: Taking false branch
           struct task_struct *task = READ_ONCE(event->ctx->task);
                                      ^
   include/asm-generic/rwonce.h:49:2: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
           ^
   include/asm-generic/rwonce.h:36:2: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
           ^
   include/linux/compiler_types.h:335:2: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
           ^
   include/linux/compiler_types.h:323:2: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
           ^
   include/linux/compiler_types.h:315:3: note: expanded from macro '__compiletime_assert'
                   if (!(condition))                                       \
                   ^
   kernel/events/core.c:10455:29: note: Loop condition is false.  Exiting loop
           struct task_struct *task = READ_ONCE(event->ctx->task);
                                      ^
   include/asm-generic/rwonce.h:49:2: note: expanded from macro 'READ_ONCE'
           compiletime_assert_rwonce_type(x);                              \
           ^
   include/asm-generic/rwonce.h:36:2: note: expanded from macro 'compiletime_assert_rwonce_type'
           compiletime_assert(__native_word(t) || sizeof(t) == sizeof(long long),  \
           ^
   include/linux/compiler_types.h:335:2: note: expanded from macro 'compiletime_assert'
           _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
           ^
   include/linux/compiler_types.h:323:2: note: expanded from macro '_compiletime_assert'
           __compiletime_assert(condition, msg, prefix, suffix)
           ^
   include/linux/compiler_types.h:307:2: note: expanded from macro '__compiletime_assert'
           do {                                                            \

vim +/task_ctx +4277 kernel/events/core.c

8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4257  
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4258  static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4259  {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4260  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4261  	struct perf_event_pmu_context *cpu_epc, *task_epc = NULL;
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4262  	struct perf_event *cpu_event = NULL, *task_event = NULL;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01  4263  	struct perf_event_context *task_ctx = NULL;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01  4264  	int cpu_rotate, task_rotate;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4265  	struct pmu *pmu;
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4266  
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4267  	/*
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4268  	 * Since we run this from IRQ context, nobody can install new
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4269  	 * events, thus the event count values are stable.
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4270  	 */
7fc23a53807970 kernel/perf_counter.c Peter Zijlstra   2009-05-08  4271  
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4272  	cpu_epc = &cpc->epc;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4273  	pmu = cpu_epc->pmu;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4274  	task_epc = cpc->task_epc;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4275  
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4276  	cpu_rotate = cpu_epc->rotate_necessary;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01 @4277  	task_ctx = cpuctx->task_ctx;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4278  	task_rotate = task_epc ? task_epc->rotate_necessary : 0;
9717e6cd3db22e kernel/perf_event.c   Peter Zijlstra   2010-01-28  4279  
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4280  	if (!(cpu_rotate || task_rotate))
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4281  		return false;
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4282  
facc43071cc0d4 kernel/events/core.c  Peter Zijlstra   2011-04-09  4283  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4284  	perf_pmu_disable(pmu);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4285  
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4286  	if (task_rotate)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4287  		task_event = ctx_event_to_rotate(task_epc);
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4288  	if (cpu_rotate)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4289  		cpu_event = ctx_event_to_rotate(cpu_epc);
8703a7cfe148f7 kernel/events/core.c  Peter Zijlstra   2017-11-13  4290  
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4291  	/*
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4292  	 * As per the order given at ctx_resched() first 'pop' task flexible
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4293  	 * and then, if needed CPU flexible.
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4294  	 */
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4295  	if (task_event || (task_epc && cpu_event)) {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4296  		update_context_time(task_epc->ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4297  		__pmu_ctx_sched_out(task_epc, EVENT_FLEXIBLE);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4298  	}
235c7fc7c500e4 kernel/perf_counter.c Ingo Molnar      2008-12-21  4299  
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4300  	if (cpu_event) {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4301  		update_context_time(&cpuctx->ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4302  		__pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE);
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4303  		rotate_ctx(&cpuctx->ctx, cpu_event);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4304  		__pmu_ctx_sched_in(&cpuctx->ctx, pmu);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4305  	}
0793a61d4df8da kernel/perf_counter.c Thomas Gleixner  2008-12-04  4306  
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4307  	if (task_event)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4308  		rotate_ctx(task_epc->ctx, task_event);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4309  
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4310  	if (task_event || (task_epc && cpu_event))
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4311  		__pmu_ctx_sched_in(task_epc->ctx, pmu);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4312  
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4313  	perf_pmu_enable(pmu);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4314  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
9e6302056f8029 kernel/events/core.c  Stephane Eranian 2013-04-03  4315  
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4316  	return true;
e9d2b064149ff7 kernel/perf_event.c   Peter Zijlstra   2010-09-17  4317  }
e9d2b064149ff7 kernel/perf_event.c   Peter Zijlstra   2010-09-17  4318  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-01-14 21:48 [RFC v2] perf: Rewrite core context handling kernel test robot
@ 2022-01-18  6:01   ` kernel test robot
  0 siblings, 0 replies; 47+ messages in thread
From: kernel test robot @ 2022-01-18  6:01 UTC (permalink / raw)
  To: Ravi Bangoria; +Cc: llvm, kbuild-all

Hi Ravi,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on tip/perf/core]
[also build test WARNING on powerpc/next tip/sched/core v5.16 next-20220114]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git a9f4a6e92b3b319296fb078da2615f618f6cd80c
config: arm-randconfig-c002-20220113 (https://download.01.org/0day-ci/archive/20220115/202201150516.CCZxTJTq-lkp@intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project d1021978b8e7e35dcc30201ca1731d64b5a602a8)
reproduce (this is a W=1 build):
         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
         chmod +x ~/bin/make.cross
         # install arm cross compiling tool for clang build
         # apt-get install binutils-arm-linux-gnueabi
         # https://github.com/0day-ci/linux/commit/f7cf7134e405062bf0f22c3ba5637241c4c4d06a
         git remote add linux-review https://github.com/0day-ci/linux
         git fetch --no-tags linux-review Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
         git checkout f7cf7134e405062bf0f22c3ba5637241c4c4d06a
         # save the config file to linux build tree
         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm clang-analyzer

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


clang-analyzer warnings: (new ones prefixed by >>)

 >> kernel/events/core.c:4277:2: warning: Value stored to 'task_ctx' is never read [clang-analyzer-deadcode.DeadStores]
            task_ctx = cpuctx->task_ctx;
            ^          ~~~~~~~~~~~~~~~~

vim +/task_ctx +4277 kernel/events/core.c

8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4257
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4258  static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4259  {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4260  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4261  	struct perf_event_pmu_context *cpu_epc, *task_epc = NULL;
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4262  	struct perf_event *cpu_event = NULL, *task_event = NULL;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01  4263  	struct perf_event_context *task_ctx = NULL;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01  4264  	int cpu_rotate, task_rotate;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4265  	struct pmu *pmu;
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4266
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4267  	/*
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4268  	 * Since we run this from IRQ context, nobody can install new
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4269  	 * events, thus the event count values are stable.
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4270  	 */
7fc23a53807970 kernel/perf_counter.c Peter Zijlstra   2009-05-08  4271
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4272  	cpu_epc = &cpc->epc;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4273  	pmu = cpu_epc->pmu;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4274  	task_epc = cpc->task_epc;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4275
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4276  	cpu_rotate = cpu_epc->rotate_necessary;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01 @4277  	task_ctx = cpuctx->task_ctx;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4278  	task_rotate = task_epc ? task_epc->rotate_necessary : 0;
9717e6cd3db22e kernel/perf_event.c   Peter Zijlstra   2010-01-28  4279
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4280  	if (!(cpu_rotate || task_rotate))
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4281  		return false;
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4282
facc43071cc0d4 kernel/events/core.c  Peter Zijlstra   2011-04-09  4283  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4284  	perf_pmu_disable(pmu);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4285
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4286  	if (task_rotate)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4287  		task_event = ctx_event_to_rotate(task_epc);
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4288  	if (cpu_rotate)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4289  		cpu_event = ctx_event_to_rotate(cpu_epc);
8703a7cfe148f7 kernel/events/core.c  Peter Zijlstra   2017-11-13  4290
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4291  	/*
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4292  	 * As per the order given at ctx_resched() first 'pop' task flexible
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4293  	 * and then, if needed CPU flexible.
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4294  	 */
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4295  	if (task_event || (task_epc && cpu_event)) {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4296  		update_context_time(task_epc->ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4297  		__pmu_ctx_sched_out(task_epc, EVENT_FLEXIBLE);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4298  	}
235c7fc7c500e4 kernel/perf_counter.c Ingo Molnar      2008-12-21  4299
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4300  	if (cpu_event) {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4301  		update_context_time(&cpuctx->ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4302  		__pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE);
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4303  		rotate_ctx(&cpuctx->ctx, cpu_event);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4304  		__pmu_ctx_sched_in(&cpuctx->ctx, pmu);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4305  	}
0793a61d4df8da kernel/perf_counter.c Thomas Gleixner  2008-12-04  4306
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4307  	if (task_event)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4308  		rotate_ctx(task_epc->ctx, task_event);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4309
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4310  	if (task_event || (task_epc && cpu_event))
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4311  		__pmu_ctx_sched_in(task_epc->ctx, pmu);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4312
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4313  	perf_pmu_enable(pmu);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4314  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
9e6302056f8029 kernel/events/core.c  Stephane Eranian 2013-04-03  4315
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4316  	return true;
e9d2b064149ff7 kernel/perf_event.c   Peter Zijlstra   2010-09-17  4317  }
e9d2b064149ff7 kernel/perf_event.c   Peter Zijlstra   2010-09-17  4318

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
@ 2022-01-18  6:01   ` kernel test robot
  0 siblings, 0 replies; 47+ messages in thread
From: kernel test robot @ 2022-01-18  6:01 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 8251 bytes --]

Hi Ravi,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on tip/perf/core]
[also build test WARNING on powerpc/next tip/sched/core v5.16 next-20220114]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git a9f4a6e92b3b319296fb078da2615f618f6cd80c
config: arm-randconfig-c002-20220113 (https://download.01.org/0day-ci/archive/20220115/202201150516.CCZxTJTq-lkp(a)intel.com/config)
compiler: clang version 14.0.0 (https://github.com/llvm/llvm-project d1021978b8e7e35dcc30201ca1731d64b5a602a8)
reproduce (this is a W=1 build):
         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
         chmod +x ~/bin/make.cross
         # install arm cross compiling tool for clang build
         # apt-get install binutils-arm-linux-gnueabi
         # https://github.com/0day-ci/linux/commit/f7cf7134e405062bf0f22c3ba5637241c4c4d06a
         git remote add linux-review https://github.com/0day-ci/linux
         git fetch --no-tags linux-review Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
         git checkout f7cf7134e405062bf0f22c3ba5637241c4c4d06a
         # save the config file to linux build tree
         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=arm clang-analyzer

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


clang-analyzer warnings: (new ones prefixed by >>)

 >> kernel/events/core.c:4277:2: warning: Value stored to 'task_ctx' is never read [clang-analyzer-deadcode.DeadStores]
            task_ctx = cpuctx->task_ctx;
            ^          ~~~~~~~~~~~~~~~~

vim +/task_ctx +4277 kernel/events/core.c

8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4257
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4258  static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4259  {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4260  	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4261  	struct perf_event_pmu_context *cpu_epc, *task_epc = NULL;
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4262  	struct perf_event *cpu_event = NULL, *task_event = NULL;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01  4263  	struct perf_event_context *task_ctx = NULL;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01  4264  	int cpu_rotate, task_rotate;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4265  	struct pmu *pmu;
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4266
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4267  	/*
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4268  	 * Since we run this from IRQ context, nobody can install new
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4269  	 * events, thus the event count values are stable.
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4270  	 */
7fc23a53807970 kernel/perf_counter.c Peter Zijlstra   2009-05-08  4271
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4272  	cpu_epc = &cpc->epc;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4273  	pmu = cpu_epc->pmu;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4274  	task_epc = cpc->task_epc;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4275
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4276  	cpu_rotate = cpu_epc->rotate_necessary;
fd7d55172d1e2e kernel/events/core.c  Ian Rogers       2019-06-01 @4277  	task_ctx = cpuctx->task_ctx;
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4278  	task_rotate = task_epc ? task_epc->rotate_necessary : 0;
9717e6cd3db22e kernel/perf_event.c   Peter Zijlstra   2010-01-28  4279
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4280  	if (!(cpu_rotate || task_rotate))
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4281  		return false;
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4282
facc43071cc0d4 kernel/events/core.c  Peter Zijlstra   2011-04-09  4283  	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4284  	perf_pmu_disable(pmu);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4285
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4286  	if (task_rotate)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4287  		task_event = ctx_event_to_rotate(task_epc);
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4288  	if (cpu_rotate)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4289  		cpu_event = ctx_event_to_rotate(cpu_epc);
8703a7cfe148f7 kernel/events/core.c  Peter Zijlstra   2017-11-13  4290
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4291  	/*
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4292  	 * As per the order given at ctx_resched() first 'pop' task flexible
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4293  	 * and then, if needed CPU flexible.
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4294  	 */
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4295  	if (task_event || (task_epc && cpu_event)) {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4296  		update_context_time(task_epc->ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4297  		__pmu_ctx_sched_out(task_epc, EVENT_FLEXIBLE);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4298  	}
235c7fc7c500e4 kernel/perf_counter.c Ingo Molnar      2008-12-21  4299
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4300  	if (cpu_event) {
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4301  		update_context_time(&cpuctx->ctx);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4302  		__pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE);
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4303  		rotate_ctx(&cpuctx->ctx, cpu_event);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4304  		__pmu_ctx_sched_in(&cpuctx->ctx, pmu);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4305  	}
0793a61d4df8da kernel/perf_counter.c Thomas Gleixner  2008-12-04  4306
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4307  	if (task_event)
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4308  		rotate_ctx(task_epc->ctx, task_event);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4309
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4310  	if (task_event || (task_epc && cpu_event))
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4311  		__pmu_ctx_sched_in(task_epc->ctx, pmu);
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4312
f7cf7134e40506 kernel/events/core.c  Peter Zijlstra   2022-01-13  4313  	perf_pmu_enable(pmu);
0f5a2601284237 kernel/events/core.c  Peter Zijlstra   2011-11-16  4314  	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
9e6302056f8029 kernel/events/core.c  Stephane Eranian 2013-04-03  4315
8d5bce0c37fa10 kernel/events/core.c  Peter Zijlstra   2018-03-09  4316  	return true;
e9d2b064149ff7 kernel/perf_event.c   Peter Zijlstra   2010-09-17  4317  }
e9d2b064149ff7 kernel/perf_event.c   Peter Zijlstra   2010-09-17  4318

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-29  4:00             ` Ravi Bangoria
@ 2022-08-29 11:58               ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-29 11:58 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Aug 29, 2022 at 09:30:50AM +0530, Ravi Bangoria wrote:
> > With this, I can run 'perf test' and perf_event_tests without any error in
> > dmesg. I'll run perf fuzzer over night and see if it reports any issue.
> 
> I also ran fuzzer on Intel machine over the weekend. I see only one WARN_ON()
> hit. Otherwise system is running normal. FWIW, I was running fuzzer as normal
> user with perf_event_paranoid=0.
> 
>   WARNING: CPU: 3 PID: 2840537 at arch/x86/events/core.c:1606 x86_pmu_stop+0xd0/0x100

That's the WARN about PERF_HES_STOPPED already being set.

>   Call Trace:
>    <TASK>
>    x86_pmu_del+0x8e/0x2d0
>    ? debug_smp_processor_id+0x17/0x20
>    event_sched_out+0x10b/0x2b0
>    ? x86_pmu_del+0x5c/0x2d0
>    merge_sched_in+0x39f/0x410

And this callchain suggests this is the group_error path.

I can't immediately spot a fail there, but I'll try and stare at it
some.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-22 16:37           ` Ravi Bangoria
  2022-08-23  4:20             ` Ravi Bangoria
  2022-08-23  6:30             ` Peter Zijlstra
@ 2022-08-29  4:00             ` Ravi Bangoria
  2022-08-29 11:58               ` Peter Zijlstra
  2 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-29  4:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

> With this, I can run 'perf test' and perf_event_tests without any error in
> dmesg. I'll run perf fuzzer over night and see if it reports any issue.

I also ran fuzzer on Intel machine over the weekend. I see only one WARN_ON()
hit. Otherwise system is running normal. FWIW, I was running fuzzer as normal
user with perf_event_paranoid=0.

  WARNING: CPU: 3 PID: 2840537 at arch/x86/events/core.c:1606 x86_pmu_stop+0xd0/0x100
  Modules linked in: ipmi_ssif intel_rapl_msr intel_rapl_common intel_tcc_cooling x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel wmi_bmof kvm rapl intel_cstate input_leds ee1004 joydev mei_me mei intel_pch_thermal ie31200_edac acpi_ipmi wmi ipmi_si mac_hid acpi_pad acpi_power_meter acpi_tad tcp_westwood sch_fq_codel dm_multipath scsi_dh_rdac bonding scsi_dh_emc tls scsi_dh_alua ipmi_devintf ipmi_msghandler msr ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear hid_generic uas usbhid cdc_ether hid usb_storage usbnet mii i915 ast drm_vram_helper drm_ttm_helper i2c_algo_bit drm_buddy drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core ttm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd i2c_i801 drm i40e cryptd
   i2c_smbus ahci xhci_pci libahci xhci_pci_renesas video pinctrl_cannonlake
  CPU: 3 PID: 2840537 Comm: perf_fuzzer Not tainted 6.0.0-rc2+ #3
  Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C246D4I-NL, BIOS L2.09C 09/23/2020
  RIP: 0010:x86_pmu_stop+0xd0/0x100
  Code: c8 01 41 89 84 24 d8 01 00 00 eb 9f 4c 89 e7 e8 76 fe ff ff 5b 41 83 8c 24 d8 01 00 00 02 41 5c 41 5d 41 5e 5d c3 cc cc cc cc <0f> 0b eb d1 4c 89 f6 48 c7 c7 00 86 03 b1 e8 cd 18 76 00 e9 48 ff
  RSP: 0000:ffffbda8c818fbd0 EFLAGS: 00010002
  RAX: 0000000000000003 RBX: ffff97b71de19c60 RCX: 0000000000000188
  RDX: 0000000000000000 RSI: 00000000001382d0 RDI: 0000000000000188
  RBP: ffffbda8c818fbf0 R08: ffffffffb1039100 R09: 0000000000000005
  R10: ffff97b71de1a388 R11: 0000000000000004 R12: ffff97b069c19d40
  R13: 0000000000000004 R14: 0000000000000002 R15: ffff97b71de00000
  FS:  00007fbf787c6740(0000) GS:ffff97b71de00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000563033ec5010 CR3: 00000001ab91c002 CR4: 00000000003707e0
  DR0: 0000000000000000 DR1: 000000000000ffff DR2: 0000000081008000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  Call Trace:
   <TASK>
   x86_pmu_del+0x8e/0x2d0
   ? debug_smp_processor_id+0x17/0x20
   event_sched_out+0x10b/0x2b0
   ? x86_pmu_del+0x5c/0x2d0
   merge_sched_in+0x39f/0x410
   visit_groups_merge.constprop.0.isra.0+0x207/0x670
   ctx_flexible_sched_in+0xb8/0xd0
   ctx_sched_in+0x10a/0x290
   ctx_resched+0x97/0x100
   __perf_event_enable+0x21b/0x310
   event_function+0xb3/0x120
   ? perf_duration_warn+0x30/0x30
   remote_function+0x52/0x70
   __flush_smp_call_function_queue+0xc4/0x510
   generic_smp_call_function_single_interrupt+0x1a/0xb0
   __sysvec_call_function_single+0x48/0x1f0
   sysvec_call_function_single+0x56/0xd0
   asm_sysvec_call_function_single+0x1b/0x20
  RIP: 0033:0x563033ec501b
  Code: 0f 1e fa 48 89 d1 31 c0 48 89 f2 89 fe bf 41 01 00 00 e9 48 f7 fe ff 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 31 c9 b9 1f a1 07 00 <ff> c9 75 fc 31 c0 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3
  RSP: 002b:00007ffd7b16aaf8 EFLAGS: 00000202
  RAX: 0000000000002303 RBX: 0000000000000000 RCX: 000000000006c1b6
  RDX: 00007fbf787c6a00 RSI: 0000000000000000 RDI: 0000000000000001
  RBP: 00007ffd7b16ab10 R08: 0000000000000000 R09: 00007fbf787c6740
  R10: 00007fbf7880d0c8 R11: 0000000000000246 R12: 00007ffd7b16cf28
  R13: 0000563033eb527a R14: 0000563033ed1b68 R15: 00007fbf7880c040
   </TASK>
  irq event stamp: 0
  hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  hardirqs last disabled at (0): [<ffffffffaf0bfef8>] copy_process+0xa38/0x1f80
  softirqs last  enabled at (0): [<ffffffffaf0bfef8>] copy_process+0xa38/0x1f80
  softirqs last disabled at (0): [<0000000000000000>] 0x0
  ---[ end trace 0000000000000000 ]---

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-23  4:20             ` Ravi Bangoria
@ 2022-08-29  3:54               ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-29  3:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	Sandipan Das, ravi.bangoria

On 23-Aug-22 9:50 AM, Ravi Bangoria wrote:
> 
>> With this, I can run 'perf test' and perf_event_tests without any error in
>> dmesg. I'll run perf fuzzer over night and see if it reports any issue.
> 
> I hit kernel crash with fuzzer. I'm yet to debug it. Here is the trace:
> 
>   BUG: kernel NULL pointer dereference, address: 0000000000000198
>   #PF: supervisor read access in kernel mode
>   #PF: error_code(0x0000) - not-present page
>   PGD 0 P4D 0
>   Oops: 0000 [#1] PREEMPT SMP NOPTI
>   CPU: 48 PID: 0 Comm: swapper/48 Not tainted 6.0.0-rc1-perf-event-context-peter-queue+ #153
>   Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.7.3 03/31/2022
>   RIP: 0010:x86_pmu_enable_event+0x3c/0x120

I was able to reproduce this with vanilla v6.0-rc2 kernel.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-24 14:59     ` Peter Zijlstra
  2022-08-25  5:39       ` Ravi Bangoria
@ 2022-08-25 11:03       ` Ravi Bangoria
  1 sibling, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-25 11:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 24-Aug-22 8:29 PM, Peter Zijlstra wrote:
> On Wed, Aug 24, 2022 at 02:15:13PM +0200, Peter Zijlstra wrote:
>> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>>>  void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
>>>  {
>>> -	struct perf_cpu_context *cpuctx;
>>> +	/* XXX: Don't need this quirk anymore */
>>> +	/*struct perf_cpu_context *cpuctx;
>>>  
>>>  	if (!pmu->pmu_cpu_context)
>>>  		return;
>>>  
>>>  	cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
>>> -	cpuctx->ctx.pmu = pmu;
>>> +	cpuctx->ctx.pmu = pmu;*/
>>>  }
>>
>> Confirmed; my ADL seems to work fine without all that.
> 
> Additionally; this doesn't insta crash.

While collating this I came across armv8pmu_start() which does:

	struct perf_event_context *task_ctx =
		this_cpu_ptr(cpu_pmu->pmu.pmu_cpu_context)->task_ctx;

	if (sysctl_perf_user_access && task_ctx && task_ctx->nr_user)

Not sure why it does not lock task_ctx. Should it be changed to
something like below? Untested:

---
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index 016072a89f8f..747415a5f2b2 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -806,10 +806,19 @@ static void armv8pmu_disable_event(struct perf_event *event)
 
 static void armv8pmu_start(struct arm_pmu *cpu_pmu)
 {
-	struct perf_event_context *task_ctx =
-		this_cpu_ptr(cpu_pmu->pmu.pmu_cpu_context)->task_ctx;
+	struct perf_event_context *ctx;
+	int nr_user = 0;
+
+	rcu_read_lock();
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx) {
+		raw_spin_lock(&ctx->lock);
+		nr_user = ctx->nr_user;
+		raw_spin_unlock(&ctx->lock);
+	}
+	rcu_read_unlock();
 
-	if (sysctl_perf_user_access && task_ctx && task_ctx->nr_user)
+	if (sysctl_perf_user_access && nr_user)
 		armv8pmu_enable_user_access(cpu_pmu);
 	else
 		armv8pmu_disable_user_access();
---

Thanks,
Ravi

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-25  5:39       ` Ravi Bangoria
@ 2022-08-25  9:17         ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-25  9:17 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Thu, Aug 25, 2022 at 11:09:05AM +0530, Ravi Bangoria wrote:
> > -static inline int __pmu_filter_match(struct perf_event *event)
> > -{
> > -	struct pmu *pmu = event->pmu;
> > -	return pmu->filter_match ? pmu->filter_match(event) : 1;
> > -}
> > -
> > -/*
> > - * Check whether we should attempt to schedule an event group based on
> > - * PMU-specific filtering. An event group can consist of HW and SW events,
> > - * potentially with a SW leader, so we must check all the filters, to
> > - * determine whether a group is schedulable:
> > - */
> > -static inline int pmu_filter_match(struct perf_event *event)
> > -{
> > -	struct perf_event *sibling;
> > -
> > -	if (!__pmu_filter_match(event))
> > -		return 0;
> > -
> > -	for_each_sibling_event(sibling, event) {
> > -		if (!__pmu_filter_match(sibling))
> > -			return 0;
> > -	}
> > -
> > -	return 1;
> > -}
> > -
> >  static inline int
> >  event_filter_match(struct perf_event *event)
> >  {
> >  	return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
> > -	       perf_cgroup_match(event) && pmu_filter_match(event);
> > +	       perf_cgroup_match(event);
> 
> There are many callers of event_filter_match() which might not endup calling
> visit_groups_merge(). I hope this is intentional change?

I thought I did, but lets go through them again.

event_filter_match() is called from:

 - __perf_event_enable(); here we'll end up in ctx_sched_in() which
   will dutifully skip the pmu in question.

   (fwiw, this is one of those sites where ctx_sched_{out,in}() could do
   with a @pmu argument.

 - merge_sched_in(); this is after the new callsite in
   visit_groups_merge().

 - perf_adjust_freq_unthrottle_context(); if the pmu was skipped in
   visit_groups_merge() then ->state != ACTIVE and we'll bail out.

 - perf_iterate_ctx() / perf_iterate_sb_cpu(); these are for generating
   side-band events, and arguably not delivering them when running on
   the 'wrong' CPU wasn't right to begin with.


So I tihnk we're good. Hmm?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-24 14:59     ` Peter Zijlstra
@ 2022-08-25  5:39       ` Ravi Bangoria
  2022-08-25  9:17         ` Peter Zijlstra
  2022-08-25 11:03       ` Ravi Bangoria
  1 sibling, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-25  5:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

> -static inline int __pmu_filter_match(struct perf_event *event)
> -{
> -	struct pmu *pmu = event->pmu;
> -	return pmu->filter_match ? pmu->filter_match(event) : 1;
> -}
> -
> -/*
> - * Check whether we should attempt to schedule an event group based on
> - * PMU-specific filtering. An event group can consist of HW and SW events,
> - * potentially with a SW leader, so we must check all the filters, to
> - * determine whether a group is schedulable:
> - */
> -static inline int pmu_filter_match(struct perf_event *event)
> -{
> -	struct perf_event *sibling;
> -
> -	if (!__pmu_filter_match(event))
> -		return 0;
> -
> -	for_each_sibling_event(sibling, event) {
> -		if (!__pmu_filter_match(sibling))
> -			return 0;
> -	}
> -
> -	return 1;
> -}
> -
>  static inline int
>  event_filter_match(struct perf_event *event)
>  {
>  	return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
> -	       perf_cgroup_match(event) && pmu_filter_match(event);
> +	       perf_cgroup_match(event);

There are many callers of event_filter_match() which might not endup calling
visit_groups_merge(). I hope this is intentional change?

>  }
>  
>  static void
> @@ -3661,6 +3634,9 @@ static noinline int visit_groups_merge(struct perf_event_context *ctx,
>  	struct perf_event **evt;
>  	int ret;
>  
> +	if (pmu->filter && pmu->filter(pmu, cpu))
> +		return 0;
> +
>  	if (!ctx->task) {
>  		cpuctx = this_cpu_ptr(&cpu_context);
>  		event_heap = (struct min_heap){

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-24 12:15   ` Peter Zijlstra
@ 2022-08-24 14:59     ` Peter Zijlstra
  2022-08-25  5:39       ` Ravi Bangoria
  2022-08-25 11:03       ` Ravi Bangoria
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-24 14:59 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Wed, Aug 24, 2022 at 02:15:13PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> >  void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
> >  {
> > -	struct perf_cpu_context *cpuctx;
> > +	/* XXX: Don't need this quirk anymore */
> > +	/*struct perf_cpu_context *cpuctx;
> >  
> >  	if (!pmu->pmu_cpu_context)
> >  		return;
> >  
> >  	cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
> > -	cpuctx->ctx.pmu = pmu;
> > +	cpuctx->ctx.pmu = pmu;*/
> >  }
> 
> Confirmed; my ADL seems to work fine without all that.

Additionally; this doesn't insta crash.

---
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index cb69ff1e6138..016072a89f8f 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -1019,10 +1019,10 @@ static int armv8pmu_set_event_filter(struct hw_perf_event *event,
 	return 0;
 }
 
-static int armv8pmu_filter_match(struct perf_event *event)
+static bool armv8pmu_filter(struct pmu *pmu, int cpu)
 {
-	unsigned long evtype = event->hw.config_base & ARMV8_PMU_EVTYPE_EVENT;
-	return evtype != ARMV8_PMUV3_PERFCTR_CHAIN;
+	struct arm_pmu *armpmu = to_arm_pmu(pmu);
+	return !cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus);
 }
 
 static void armv8pmu_reset(void *info)
@@ -1253,7 +1253,7 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
 	cpu_pmu->stop			= armv8pmu_stop;
 	cpu_pmu->reset			= armv8pmu_reset;
 	cpu_pmu->set_event_filter	= armv8pmu_set_event_filter;
-	cpu_pmu->filter_match		= armv8pmu_filter_match;
+	cpu_pmu->filter			= armv8pmu_filter;
 
 	cpu_pmu->pmu.event_idx		= armv8pmu_user_event_idx;
 
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7a2d12ad6d1f..a8f1e38c66a7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -86,6 +86,8 @@ DEFINE_STATIC_CALL_NULL(x86_pmu_swap_task_ctx, *x86_pmu.swap_task_ctx);
 DEFINE_STATIC_CALL_NULL(x86_pmu_drain_pebs,   *x86_pmu.drain_pebs);
 DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_aliases, *x86_pmu.pebs_aliases);
 
+DEFINE_STATIC_CALL_NULL(x86_pmu_filter, *x86_pmu.filter);
+
 /*
  * This one is magic, it will get called even when PMU init fails (because
  * there is no PMU), in which case it should simply return NULL.
@@ -2038,6 +2040,7 @@ static void x86_pmu_static_call_update(void)
 	static_call_update(x86_pmu_pebs_aliases, x86_pmu.pebs_aliases);
 
 	static_call_update(x86_pmu_guest_get_msrs, x86_pmu.guest_get_msrs);
+	static_call_update(x86_pmu_filter, x86_pmu.filter);
 }
 
 static void _x86_pmu_read(struct perf_event *event)
@@ -2668,12 +2671,13 @@ static int x86_pmu_aux_output_match(struct perf_event *event)
 	return 0;
 }
 
-static int x86_pmu_filter_match(struct perf_event *event)
+static bool x86_pmu_filter(struct pmu *pmu, int cpu)
 {
-	if (x86_pmu.filter_match)
-		return x86_pmu.filter_match(event);
+	bool ret = false;
 
-	return 1;
+	static_call_cond(x86_pmu_filter)(pmu, cpu, &ret);
+
+	return ret;
 }
 
 static struct pmu pmu = {
@@ -2704,7 +2708,7 @@ static struct pmu pmu = {
 
 	.aux_output_match	= x86_pmu_aux_output_match,
 
-	.filter_match		= x86_pmu_filter_match,
+	.filter			= x86_pmu_filter,
 };
 
 void arch_perf_update_userpage(struct perf_event *event,
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 768771e5e4e9..40cebd9b90a1 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4675,12 +4675,11 @@ static int intel_pmu_aux_output_match(struct perf_event *event)
 	return is_intel_pt_event(event);
 }
 
-static int intel_pmu_filter_match(struct perf_event *event)
+static void intel_pmu_filter(struct pmu *pmu, int cpu, bool *ret)
 {
-	struct x86_hybrid_pmu *pmu = hybrid_pmu(event->pmu);
-	unsigned int cpu = smp_processor_id();
+	struct x86_hybrid_pmu *hpmu = hybrid_pmu(pmu);
 
-	return cpumask_test_cpu(cpu, &pmu->supported_cpus);
+	*ret = !cpumask_test_cpu(cpu, &hpmu->supported_cpus);
 }
 
 PMU_FORMAT_ATTR(offcore_rsp, "config1:0-63");
@@ -6348,7 +6347,7 @@ __init int intel_pmu_init(void)
 		x86_pmu.update_topdown_event = adl_update_topdown_event;
 		x86_pmu.set_topdown_event_period = adl_set_topdown_event_period;
 
-		x86_pmu.filter_match = intel_pmu_filter_match;
+		x86_pmu.filter = intel_pmu_filter;
 		x86_pmu.get_event_constraints = adl_get_event_constraints;
 		x86_pmu.hw_config = adl_hw_config;
 		x86_pmu.limit_period = spr_limit_period;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 9c835ecb232e..b3ff55fc5794 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -924,7 +924,7 @@ struct x86_pmu {
 
 	int (*aux_output_match) (struct perf_event *event);
 
-	int (*filter_match)(struct perf_event *event);
+	void (*filter)(struct pmu *pmu, int cpu, bool *ret);
 	/*
 	 * Hybrid support
 	 *
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 0407a38b470a..0f9519874fde 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -99,7 +99,7 @@ struct arm_pmu {
 	void		(*stop)(struct arm_pmu *);
 	void		(*reset)(void *);
 	int		(*map_event)(struct perf_event *event);
-	int		(*filter_match)(struct perf_event *event);
+	bool		(*filter)(struct pmu *pmu, int cpu);
 	int		num_events;
 	bool		secure_access; /* 32-bit ARM only */
 #define ARMV8_PMUV3_MAX_COMMON_EVENTS		0x40
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 7847818e5397..4be3aaae89be 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -519,9 +519,10 @@ struct pmu {
 					/* optional */
 
 	/*
-	 * Filter events for PMU-specific reasons.
+	 * Skip programming this PMU on the given CPU. Typically needed for
+	 * big.LITTLE things.
 	 */
-	int (*filter_match)		(struct perf_event *event); /* optional */
+	bool (*filter)			(struct pmu *pmu, int cpu); /* optional */
 
 	/*
 	 * Check period value for PERF_EVENT_IOC_PERIOD ioctl.
diff --git a/kernel/events/core.c b/kernel/events/core.c
index c6b64a48dea6..180842ba8473 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2181,38 +2181,11 @@ static bool is_orphaned_event(struct perf_event *event)
 	return event->state == PERF_EVENT_STATE_DEAD;
 }
 
-static inline int __pmu_filter_match(struct perf_event *event)
-{
-	struct pmu *pmu = event->pmu;
-	return pmu->filter_match ? pmu->filter_match(event) : 1;
-}
-
-/*
- * Check whether we should attempt to schedule an event group based on
- * PMU-specific filtering. An event group can consist of HW and SW events,
- * potentially with a SW leader, so we must check all the filters, to
- * determine whether a group is schedulable:
- */
-static inline int pmu_filter_match(struct perf_event *event)
-{
-	struct perf_event *sibling;
-
-	if (!__pmu_filter_match(event))
-		return 0;
-
-	for_each_sibling_event(sibling, event) {
-		if (!__pmu_filter_match(sibling))
-			return 0;
-	}
-
-	return 1;
-}
-
 static inline int
 event_filter_match(struct perf_event *event)
 {
 	return (event->cpu == -1 || event->cpu == smp_processor_id()) &&
-	       perf_cgroup_match(event) && pmu_filter_match(event);
+	       perf_cgroup_match(event);
 }
 
 static void
@@ -3661,6 +3634,9 @@ static noinline int visit_groups_merge(struct perf_event_context *ctx,
 	struct perf_event **evt;
 	int ret;
 
+	if (pmu->filter && pmu->filter(pmu, cpu))
+		return 0;
+
 	if (!ctx->task) {
 		cpuctx = this_cpu_ptr(&cpu_context);
 		event_heap = (struct min_heap){

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:35 ` Peter Zijlstra
                     ` (6 preceding siblings ...)
  2022-06-27  4:18   ` Ravi Bangoria
@ 2022-08-24 12:15   ` Peter Zijlstra
  2022-08-24 14:59     ` Peter Zijlstra
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-24 12:15 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>  void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
>  {
> -	struct perf_cpu_context *cpuctx;
> +	/* XXX: Don't need this quirk anymore */
> +	/*struct perf_cpu_context *cpuctx;
>  
>  	if (!pmu->pmu_cpu_context)
>  		return;
>  
>  	cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
> -	cpuctx->ctx.pmu = pmu;
> +	cpuctx->ctx.pmu = pmu;*/
>  }

Confirmed; my ADL seems to work fine without all that.

---
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index fd043cd0e3c9..7a2d12ad6d1f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2059,24 +2059,6 @@ void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
 	pr_info("... event mask:             %016Lx\n", intel_ctrl);
 }
 
-/*
- * The generic code is not hybrid friendly. The hybrid_pmu->pmu
- * of the first registered PMU is unconditionally assigned to
- * each possible cpuctx->ctx.pmu.
- * Update the correct hybrid PMU to the cpuctx->ctx.pmu.
- */
-void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
-{
-	/* XXX: Don't need this quirk anymore */
-	/*struct perf_cpu_context *cpuctx;
-
-	if (!pmu->pmu_cpu_context)
-		return;
-
-	cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-	cpuctx->ctx.pmu = pmu;*/
-}
-
 static int __init init_hw_perf_events(void)
 {
 	struct x86_pmu_quirk *quirk;
@@ -2197,9 +2179,6 @@ static int __init init_hw_perf_events(void)
 						(hybrid_pmu->cpu_type == hybrid_big) ? PERF_TYPE_RAW : -1);
 			if (err)
 				break;
-
-			if (cpu_type == hybrid_pmu->cpu_type)
-				x86_pmu_update_cpu_context(&hybrid_pmu->pmu, raw_smp_processor_id());
 		}
 
 		if (i < x86_pmu.num_hybrid_pmus) {
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8a72e6fe27a5..768771e5e4e9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4508,8 +4508,6 @@ static bool init_hybrid_pmu(int cpu)
 	cpumask_set_cpu(cpu, &pmu->supported_cpus);
 	cpuc->pmu = &pmu->pmu;
 
-	x86_pmu_update_cpu_context(&pmu->pmu, cpu);
-
 	return true;
 }
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 94fb65d7b291..9c835ecb232e 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1175,8 +1175,6 @@ int x86_pmu_handle_irq(struct pt_regs *regs);
 void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
 			  u64 intel_ctrl);
 
-void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu);
-
 extern struct event_constraint emptyconstraint;
 
 extern struct event_constraint unconstrained;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-17 13:36   ` Peter Zijlstra
@ 2022-08-24 10:13     ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-24 10:13 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Fri, Jun 17, 2022 at 03:36:51PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> > +/* XXX: No need of list now. Convert it to per-cpu variable */
> >  static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
> 
> Something like so I suppose...
> 

I need this on top to avoid a spat on perf_cgroup_attach()

---


diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0cd81a3ef374..c6b64a48dea6 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13536,9 +13536,12 @@ static int perf_cgroup_css_online(struct cgroup_subsys_state *css)
 static int __perf_cgroup_move(void *info)
 {
 	struct task_struct *task = info;
-	rcu_read_lock();
-	perf_cgroup_switch(task);
-	rcu_read_unlock();
+
+	preempt_disable();
+	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
+		perf_cgroup_switch(task);
+	preempt_enable();
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-24  7:27           ` Peter Zijlstra
@ 2022-08-24  7:53             ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-24  7:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 24-Aug-22 12:57 PM, Peter Zijlstra wrote:
> On Wed, Aug 24, 2022 at 10:37:36AM +0530, Ravi Bangoria wrote:
> 
>>> Now, I suppose making that:
>>>
>>>   {-1, NULL, NULL}, {cpu, NULL, NULL}
>>>
>>> could work, but wouldn't iterating the the tree be more expensive than
>>> just finding the sub-trees as we do now?
>>
>> pmu=NULL can be used while scheduling entire context. We can just traverse
>> through all pmu events of both cpu subtrees.
> 
> But imagine the case where we have 50 event for a PMU that can only
> schedule 8. Then we have to iterate 42 events for naught instead of
> directly jumping to the next PMU.

Yes, that needs to be handled. And, IIRC, you proposed maintaining a list
of leftmost event from each pmu subtree.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-24  5:07         ` Ravi Bangoria
@ 2022-08-24  7:27           ` Peter Zijlstra
  2022-08-24  7:53             ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-24  7:27 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Wed, Aug 24, 2022 at 10:37:36AM +0530, Ravi Bangoria wrote:

> > Now, I suppose making that:
> > 
> >   {-1, NULL, NULL}, {cpu, NULL, NULL}
> > 
> > could work, but wouldn't iterating the the tree be more expensive than
> > just finding the sub-trees as we do now?
> 
> pmu=NULL can be used while scheduling entire context. We can just traverse
> through all pmu events of both cpu subtrees.

But imagine the case where we have 50 event for a PMU that can only
schedule 8. Then we have to iterate 42 events for naught instead of
directly jumping to the next PMU.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-23  8:57       ` Peter Zijlstra
@ 2022-08-24  5:07         ` Ravi Bangoria
  2022-08-24  7:27           ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-24  5:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 23-Aug-22 2:27 PM, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:46:32AM +0530, Ravi Bangoria wrote:
>> On 13-Jun-22 8:13 PM, Peter Zijlstra wrote:
>>> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> 
>>>> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
>>>>  {
>>>> +	struct perf_event_pmu_context *pmu_ctx;
>>>>  	int can_add_hw = 1;
>>>>  
>>>> -	if (ctx != &cpuctx->ctx)
>>>> -		cpuctx = NULL;
>>>> -
>>>> -	visit_groups_merge(cpuctx, &ctx->pinned_groups,
>>>> -			   smp_processor_id(),
>>>> -			   merge_sched_in, &can_add_hw);
>>>> +	if (pmu) {
>>>> +		visit_groups_merge(ctx, &ctx->pinned_groups,
>>>> +				   smp_processor_id(), pmu,
>>>> +				   merge_sched_in, &can_add_hw);
>>>> +	} else {
>>>> +		/*
>>>> +		 * XXX: This can be optimized for per-task context by calling
>>>> +		 * visit_groups_merge() only once with:
>>>> +		 *   1) pmu=NULL
>>>> +		 *   2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
>>>> +		 *   3) Making can_add_hw a per-pmu variable
>>>> +		 *
>>>> +		 * Though, it can not be opimized for per-cpu context because
>>>> +		 * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
>>>> +		 * consist of cgroup-subtrees. i.e. a cgroup events of same
>>>> +		 * cgroup but different pmus are seperated out into respective
>>>> +		 * pmu-subtrees.
>>>> +		 */
>>>> +		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>>> +			can_add_hw = 1;
>>>> +			visit_groups_merge(ctx, &ctx->pinned_groups,
>>>> +					   smp_processor_id(), pmu_ctx->pmu,
>>>> +					   merge_sched_in, &can_add_hw);
>>>> +		}
>>>> +	}
>>>>  }
>>>
>>> I'm not sure I follow.. task context can have multiple PMUs just the
>>> same as CPU context can, that's more or less the entire point of the
>>> patch.
>>
>> Current rbtree key is {cpu, cgroup_id, group_idx}. However, effective key for
>> task specific context is {cpu, group_idx} because cgroup_id is always 0. And
>> effective key for cpu specific context is {cgroup_id, group_idx} because cpu
>> is same for entire rbtree.
>>
>> With New design, rbtree key will be {cpu, pmu, cgroup_id, group_idx}. But as
>> explained above, effective key for task specific context will be {cpu, pmu,
>> group_idx}. Thus, we can handle pmu=NULL in visit_groups_merge(), same as you
>> did in the very first RFC[1]. (This may make things more complicated though
>> because we might also need to increase heap size to accommodate all pmu events
>> in single heap. Current heap size is 2 for task specific context, which is
>> sufficient if we iterate over all pmus).
>>
>> Same optimization won't work for cpu specific context because, it's effective
>> key would be {pmu, cgroup_id, group_idx} i.e. each pmu subtree is made up of
>> cgroup subtrees.
> 
> Agreed, new order is: {cpu, pmu, cgroup_id, group_idx}
> 
> Event scheduling looks at the {cpu, pmu, cgroup_id} subtree to find the
> leftmost group_idx event to schedule next.
> 
> However, since cgroup events are per-cpu events, per-task events will
> always have cgroup=NULL. Resulting in the subtrees:
> 
>   {-1, pmu, NULL} and {cpu, pmu, NULL}
> 
> Which is what the code does, it iterates ctx->pmu_ctx_list to find all
> @pmu values and then for each does the schedule dance.
> 
> Now, I suppose making that:
> 
>   {-1, NULL, NULL}, {cpu, NULL, NULL}
> 
> could work, but wouldn't iterating the the tree be more expensive than
> just finding the sub-trees as we do now?

pmu=NULL can be used while scheduling entire context. We can just traverse
through all pmu events of both cpu subtrees.

> 
> You also talk about extending extending the heap, which I read like
> doing the heap-merge over:
> 
>  {-1, pmu0, NULL}, {-1, pmu1, NULL}, ...
>  {cpu, pmu0, NULL}, ...
> 
> But that doesn't make sense, the schedule dance is per-pmu.
> 
> Or am I just still not getting it?

Ok. Let's not complicate the design. We can go with current approach of
iterating over all pmus in the first phase and think about optimizing it
later.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-23  7:26   ` Peter Zijlstra
@ 2022-08-23 15:14     ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-23 15:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria


>> Also, this hunk is under if (is_active ^ EVENT_TIME), which effectively is
>> (is_active != EVENT_TIME). I'm assuming it should be (is_active & EVENT_TIME)?
> 
> So that code is identical to what it currently is upstream; but yes that
> looks somewhat dodgy.
> 
> So the code itself (does as the comment says) starts time.

Got it.

> This should only be done if EVENT_TIME is not set.

Does that mean context time should be started only when context is getting
scheduled I.e. ctx->is_active is 0 ?

> That is, I'm thinking it should be something like:
> 
> 	!(is_active & EVENT_TIME)
> 
> which happens to be the same as:
> 
> 	is_active ^ EVENT_TIME
> 
> under the assumption is_active contains no other bits -- which I don't
> think is a valid assumption.

Correct, we can't assume that. There are cases where we call
ctx_sched_out(EVENT_TIME) followed by ctx_sched_in(EVENT_TIME) when PINNED /
FLEXIBLE are also set in ctx->is_active. For ex, perf_event_enable_on_exec().
In such cases, we will not advance ctx->time. Example:

  child()
  {
      ...
      execv();
  }

  main()
  {
      pid = fork();

      attr.enable_on_exec = 0;
      fd0 = perf_event_open(&attr, pid, -1, -1, 0);
      ...
      wait(NULL);
  }

Here execv() will cause call to ctx_sched_in() --> __update_context_time()
with adv=false. I think that's fine. Sometime later we will anyway advance
ctx->time.

Sorry, I've not spend enough time with this time keeping code. Please let
me know if I'm talking nonsense.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-02  6:16     ` Ravi Bangoria
@ 2022-08-23  8:57       ` Peter Zijlstra
  2022-08-24  5:07         ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-23  8:57 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Tue, Aug 02, 2022 at 11:46:32AM +0530, Ravi Bangoria wrote:
> On 13-Jun-22 8:13 PM, Peter Zijlstra wrote:
> > On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:

> >> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
> >>  {
> >> +	struct perf_event_pmu_context *pmu_ctx;
> >>  	int can_add_hw = 1;
> >>  
> >> -	if (ctx != &cpuctx->ctx)
> >> -		cpuctx = NULL;
> >> -
> >> -	visit_groups_merge(cpuctx, &ctx->pinned_groups,
> >> -			   smp_processor_id(),
> >> -			   merge_sched_in, &can_add_hw);
> >> +	if (pmu) {
> >> +		visit_groups_merge(ctx, &ctx->pinned_groups,
> >> +				   smp_processor_id(), pmu,
> >> +				   merge_sched_in, &can_add_hw);
> >> +	} else {
> >> +		/*
> >> +		 * XXX: This can be optimized for per-task context by calling
> >> +		 * visit_groups_merge() only once with:
> >> +		 *   1) pmu=NULL
> >> +		 *   2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
> >> +		 *   3) Making can_add_hw a per-pmu variable
> >> +		 *
> >> +		 * Though, it can not be opimized for per-cpu context because
> >> +		 * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
> >> +		 * consist of cgroup-subtrees. i.e. a cgroup events of same
> >> +		 * cgroup but different pmus are seperated out into respective
> >> +		 * pmu-subtrees.
> >> +		 */
> >> +		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> >> +			can_add_hw = 1;
> >> +			visit_groups_merge(ctx, &ctx->pinned_groups,
> >> +					   smp_processor_id(), pmu_ctx->pmu,
> >> +					   merge_sched_in, &can_add_hw);
> >> +		}
> >> +	}
> >>  }
> > 
> > I'm not sure I follow.. task context can have multiple PMUs just the
> > same as CPU context can, that's more or less the entire point of the
> > patch.
> 
> Current rbtree key is {cpu, cgroup_id, group_idx}. However, effective key for
> task specific context is {cpu, group_idx} because cgroup_id is always 0. And
> effective key for cpu specific context is {cgroup_id, group_idx} because cpu
> is same for entire rbtree.
> 
> With New design, rbtree key will be {cpu, pmu, cgroup_id, group_idx}. But as
> explained above, effective key for task specific context will be {cpu, pmu,
> group_idx}. Thus, we can handle pmu=NULL in visit_groups_merge(), same as you
> did in the very first RFC[1]. (This may make things more complicated though
> because we might also need to increase heap size to accommodate all pmu events
> in single heap. Current heap size is 2 for task specific context, which is
> sufficient if we iterate over all pmus).
> 
> Same optimization won't work for cpu specific context because, it's effective
> key would be {pmu, cgroup_id, group_idx} i.e. each pmu subtree is made up of
> cgroup subtrees.

Agreed, new order is: {cpu, pmu, cgroup_id, group_idx}

Event scheduling looks at the {cpu, pmu, cgroup_id} subtree to find the
leftmost group_idx event to schedule next.

However, since cgroup events are per-cpu events, per-task events will
always have cgroup=NULL. Resulting in the subtrees:

  {-1, pmu, NULL} and {cpu, pmu, NULL}

Which is what the code does, it iterates ctx->pmu_ctx_list to find all
@pmu values and then for each does the schedule dance.

Now, I suppose making that:

  {-1, NULL, NULL}, {cpu, NULL, NULL}

could work, but wouldn't iterating the the tree be more expensive than
just finding the sub-trees as we do now?

You also talk about extending extending the heap, which I read like
doing the heap-merge over:

 {-1, pmu0, NULL}, {-1, pmu1, NULL}, ...
 {cpu, pmu0, NULL}, ...

But that doesn't make sense, the schedule dance is per-pmu.

Or am I just still not getting it?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-02  6:17 ` Ravi Bangoria
@ 2022-08-23  7:26   ` Peter Zijlstra
  2022-08-23 15:14     ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-23  7:26 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Tue, Aug 02, 2022 at 11:47:24AM +0530, Ravi Bangoria wrote:
> [...]
> 
> >  static void
> > -ctx_sched_in(struct perf_event_context *ctx,
> > -	     struct perf_cpu_context *cpuctx,
> > -	     enum event_type_t event_type,
> > +ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type,
> >  	     struct task_struct *task)
> >  {
> > +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
> >  	int is_active = ctx->is_active;
> >  	u64 now;
> >  
> > @@ -3818,6 +3905,7 @@ ctx_sched_in(struct perf_event_context *ctx,
> >  		/* start ctx time */
> >  		now = perf_clock();
> >  		ctx->timestamp = now;
> > +		// XXX ctx->task =? task
> 
> Couldn't get this XXX, it's from your original patch. If you can recall, it
> would be helpful.

No memories at all; but looking at it; it seems to worry if ctx->task is
up-to-date; in this context the only thing that relies on the task is
the cgroup for which we update the timestamp in the next statement:

> >  		perf_cgroup_set_timestamp(task, ctx);

I suppose I should really write less cryptic notes; then again, I never
imagined this would take that many years to complete :/

> >  	}
> 
> Also, this hunk is under if (is_active ^ EVENT_TIME), which effectively is
> (is_active != EVENT_TIME). I'm assuming it should be (is_active & EVENT_TIME)?

So that code is identical to what it currently is upstream; but yes that
looks somewhat dodgy.

So the code itself (does as the comment says) starts time. This should
only be done if EVENT_TIME is not set. That is, I'm thinking it should
be something like:

	!(is_active & EVENT_TIME)

which happens to be the same as:

	is_active ^ EVENT_TIME

under the assumption is_active contains no other bits -- which I don't
think is a valid assumption.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-02  6:13 ` Ravi Bangoria
@ 2022-08-23  7:10   ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-23  7:10 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Tue, Aug 02, 2022 at 11:43:03AM +0530, Ravi Bangoria wrote:
> [...]
> 
> >  /*
> > @@ -2718,7 +2706,6 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> >  			struct perf_event_context *task_ctx,
> >  			enum event_type_t event_type)
> >  {
> > -	enum event_type_t ctx_event_type;
> >  	bool cpu_event = !!(event_type & EVENT_CPU);
> >  
> >  	/*
> > @@ -2728,11 +2715,13 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> >  	if (event_type & EVENT_PINNED)
> >  		event_type |= EVENT_FLEXIBLE;
> >  
> > -	ctx_event_type = event_type & EVENT_ALL;
> > +	event_type &= EVENT_ALL;
> >  
> > -	perf_pmu_disable(cpuctx->ctx.pmu);
> > -	if (task_ctx)
> > -		task_ctx_sched_out(cpuctx, task_ctx, event_type);
> > +	perf_ctx_disable(&cpuctx->ctx);
> > +	if (task_ctx) {
> > +		perf_ctx_disable(task_ctx);
> > +		task_ctx_sched_out(task_ctx, event_type);
> > +	}
> >  
> >  	/*
> >  	 * Decide which cpu ctx groups to schedule out based on the types
> > @@ -2742,17 +2731,20 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
> >  	 *  - otherwise, do nothing more.
> >  	 */
> >  	if (cpu_event)
> > -		cpu_ctx_sched_out(cpuctx, ctx_event_type);
> > -	else if (ctx_event_type & EVENT_PINNED)
> > -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> > +		ctx_sched_out(&cpuctx->ctx, event_type);
> > +	else if (event_type & EVENT_PINNED)
> > +		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
> >  
> >  	perf_event_sched_in(cpuctx, task_ctx, current);
> > -	perf_pmu_enable(cpuctx->ctx.pmu);
> > +
> > +	perf_ctx_enable(&cpuctx->ctx);
> > +	if (task_ctx)
> > +		perf_ctx_enable(task_ctx);
> >  }
> 
> ctx_resched() reschedule entire perf_event_context while adding new event
> to the context or enabling existing event in the context. We can probably
> optimize it by rescheduling only affected pmu_ctx.

Yes, it would probably make sense to add a pmu argument there and limit
the rescheduling where possible.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-22 16:37           ` Ravi Bangoria
  2022-08-23  4:20             ` Ravi Bangoria
@ 2022-08-23  6:30             ` Peter Zijlstra
  2022-08-29  4:00             ` Ravi Bangoria
  2 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-23  6:30 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Aug 22, 2022 at 10:07:45PM +0530, Ravi Bangoria wrote:
> On 22-Aug-22 9:13 PM, Peter Zijlstra wrote:
> > On Mon, Aug 22, 2022 at 05:29:11PM +0200, Peter Zijlstra wrote:
> >> On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
> >>>
> >>>> pulling up the ctx->mutex makes things simpler, but also violates the
> >>>> locking order vs exec_update_lock.
> >>>>
> >>>> Pull that lock up as well...
> >>>
> >>> I'm not able to apply this patch as is but I get the idea. Few
> >>> questions below...
> >>
> >> I was just about to rebase the 'series' to current, let me do that and
> >> get back to you on the specifics.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/wip.rewrite
> 
> Additional set of changes on top of this tree is required to build and boot,
> atleast on my AMD machine:

Right; clearly I didn't even test build that thing... your changes look
fine and added then on top, above tree should be updated.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-22 16:52       ` Peter Zijlstra
@ 2022-08-23  4:57         ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-23  4:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 22-Aug-22 10:22 PM, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
> 
>>> @@ -12358,58 +12374,14 @@ SYSCALL_DEFINE5(perf_event_open,
>>>  			goto err_context;
>>>  	}
>>>  
>>> -	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
>>> -	if (IS_ERR(event_file)) {
>>> -		err = PTR_ERR(event_file);
>>> -		event_file = NULL;
>>> -		goto err_context;
>>> -	}
>>> -
>>> -	if (task) {
>>> -		err = down_read_interruptible(&task->signal->exec_update_lock);
>>> -		if (err)
>>> -			goto err_file;
>>> -
>>> -		/*
>>> -		 * We must hold exec_update_lock across this and any potential
>>> -		 * perf_install_in_context() call for this new event to
>>> -		 * serialize against exec() altering our credentials (and the
>>> -		 * perf_event_exit_task() that could imply).
>>> -		 */
>>> -		err = -EACCES;
>>> -		if (!perf_check_permission(&attr, task))
>>> -			goto err_cred;
>>> -	}
>>> -
>>> -	if (ctx->task == TASK_TOMBSTONE) {
>>> -		err = -ESRCH;
>>> -		goto err_locked;
>>> -	}
>>
>> I think we need to keep (ctx->task == TASK_TOMBSTONE) check?
> 
> I think so too; in fact the code I have still has it, perhaps it was
> there write before this patch?
> 
>>> -
>>>  	if (!perf_event_validate_size(event)) {
>>>  		err = -E2BIG;
>>> -		goto err_locked;
>>> -	}
>>> -
>>> -	if (!task) {
>>> -		/*
>>> -		 * Check if the @cpu we're creating an event for is online.
>>> -		 *
>>> -		 * We use the perf_cpu_context::ctx::mutex to serialize against
>>> -		 * the hotplug notifiers. See perf_event_{init,exit}_cpu().
>>> -		 */
>>> -		struct perf_cpu_context *cpuctx =
>>> -			container_of(ctx, struct perf_cpu_context, ctx);
>>> -
>>> -		if (!cpuctx->online) {
>>> -			err = -ENODEV;
>>> -			goto err_locked;
>>> -		}
>>> +		goto err_context;
>>
>> Why did you remove this hunk? We should confirm whether cpu is online or not
>> before creating event. No?
> 
> Idem.
> 
> Perhaps it is best if we look at the end result of all these patches
> combined and then I'll fold the lot if we're in agreement and then we
> can forget about these intermediate steps.

Let me accumulate all these changes, rebase to v6.0-rc2 and send RFC v3.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-22 16:44       ` Peter Zijlstra
@ 2022-08-23  4:46         ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-23  4:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 22-Aug-22 10:14 PM, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:40:34AM +0530, Ravi Bangoria wrote:
>> On 13-Jun-22 8:25 PM, Peter Zijlstra wrote:
>>> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>>>> @@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
>>>>  		goto err_task;
>>>>  	}
>>>>  
>>>> +	// XXX premature; what if this is allowed, but we get moved to a PMU
>>>> +	// that doesn't have this.
>>>>  	if (is_sampling_event(event)) {
>>>>  		if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
>>>>  			err = -EOPNOTSUPP;
>>>
>>> No; this really should be against the event's native PMU. If the event
>>> can't natively sample, it can't sample when placed in another group
>>> either.
>>
>> Right. But IIUC, the question was, would there be any issue if we allow
>> grouping of perf_sw_context sampling event as group leader and
>> perf_{hw|invalid}_context counting event as group member. I think no. It
>> should just work fine. And, there could be real usecases of it as you
>> described in one old thread[1].
> 
> Like you I need to bend my brain around this again, but I'm not seeing a
> contradiction. The use-case from [1] is a software sampler with a bunch
> of non-sampling uncore events.
> 
> The uncore events aren't sampling, the are simply read by the software
> event (SAMPLE_READ). And moving the sampling software event to the
> non-sample capable uncore PMU shouldn't matter.

Ok.

> That is; the code as it stands here seems right, we should check
> is_sampling_event() against an event's native pmu->capabilities.
> 
> Or am I misunderstanding things?

No, that's correct. We must use event's native pmu to check capabilities.
I'll remove this comment from code.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-22 16:37           ` Ravi Bangoria
@ 2022-08-23  4:20             ` Ravi Bangoria
  2022-08-29  3:54               ` Ravi Bangoria
  2022-08-23  6:30             ` Peter Zijlstra
  2022-08-29  4:00             ` Ravi Bangoria
  2 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-23  4:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria


> With this, I can run 'perf test' and perf_event_tests without any error in
> dmesg. I'll run perf fuzzer over night and see if it reports any issue.

I hit kernel crash with fuzzer. I'm yet to debug it. Here is the trace:

  BUG: kernel NULL pointer dereference, address: 0000000000000198
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: 0000 [#1] PREEMPT SMP NOPTI
  CPU: 48 PID: 0 Comm: swapper/48 Not tainted 6.0.0-rc1-perf-event-context-peter-queue+ #153
  Hardware name: Dell Inc. PowerEdge R6515/07PXPY, BIOS 2.7.3 03/31/2022
  RIP: 0010:x86_pmu_enable_event+0x3c/0x120
  Code: a0 63 82 e8 26 7c cd 00 65 8b 05 4f 29 01 7f 85 c0 75 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 84 a0 63 82 e8 04 7c cd 00 <8b> 8b 98 01 00 00 65 48 8b 2d 2e 3a 01 7f 85 c9 0f 85 9a 00 00 00
  RSP: 0018:ffffc900004e7d78 EFLAGS: 00010002
  RAX: 0000000000000030 RBX: 0000000000000000 RCX: 00000000c0010200
  RDX: 0000000000000000 RSI: ffffffff8263a084 RDI: ffffffff825d5466
  RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000006 R11: ffffc900004e7ba0 R12: ffff88bff6019c60
  R13: ffff88bff6019e60 R14: ffffffff82c35220 R15: ffffc9003ca83d38
  FS:  0000000000000000(0000) GS:ffff88bff6000000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000198 CR3: 000000407be26003 CR4: 0000000000770ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  PKRU: 55555554
  Call Trace:
   <TASK>
   amd_pmu_enable_all+0x68/0xb0
   ctx_resched+0xd9/0x150
   event_function+0xb8/0x130
   ? hrtimer_start_range_ns+0x141/0x4a0
   ? perf_duration_warn+0x30/0x30
   remote_function+0x4d/0x60
   __flush_smp_call_function_queue+0xc4/0x500
   flush_smp_call_function_queue+0x11d/0x1b0
   do_idle+0x18f/0x2d0
   cpu_startup_entry+0x19/0x20
   start_secondary+0x121/0x160
   secondary_startup_64_no_verify+0xe5/0xeb
   </TASK>
  Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables libcrc32c n$netlink intel_rapl_msr intel_rapl_common kvm_amd kvm ipmi_ssif wmi_bmof irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sp5100_tco rapl pcspkr acpi_ipmi ccp k10temp i2c_piix4 wmi ipmi_si acpi_power_meter vfat fat ext4 mbcache
  g200 i2c_algo_bit drm_shmem_helper drm_kms_helper sg syscopyarea nvme sysfillrect sysimgblt fb_sys_fops nvme_core ahci libahci t10_pi drm crc32c_intel tg3 crc64_rocksoft libata crc64 megaraid_sas ipmi_devintf ipmi_msghandl$r fuse
  CR2: 0000000000000198
  ---[ end trace 0000000000000000 ]---
  RIP: 0010:x86_pmu_enable_event+0x3c/0x120
  Code: a0 63 82 e8 26 7c cd 00 65 8b 05 4f 29 01 7f 85 c0 75 0b 5b 5d 41 5c 41 5d c3 cc cc cc cc 48 c7 c7 84 a0 63 82 e8 04 7c cd 00 <8b> 8b 98 01 00 00 65 48 8b 2d 2e 3a 01 7f 85 c9 0f 85 9a 00 00 00
  RSP: 0018:ffffc900004e7d78 EFLAGS: 00010002
  RAX: 0000000000000030 RBX: 0000000000000000 RCX: 00000000c0010200
  RDX: 0000000000000000 RSI: ffffffff8263a084 RDI: ffffffff825d5466
  RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
  R10: 0000000000000006 R11: ffffc900004e7ba0 R12: ffff88bff6019c60
  R13: ffff88bff6019e60 R14: ffffffff82c35220 R15: ffffc9003ca83d38
  FS:  0000000000000000(0000) GS:ffff88bff6000000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000198 CR3: 000000407be26003 CR4: 0000000000770ee0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
  PKRU: 55555554
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled
  ---[ end Kernel panic - not syncing: Fatal exception ]---

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-02  6:11     ` Ravi Bangoria
  2022-08-22 15:29       ` Peter Zijlstra
@ 2022-08-22 16:52       ` Peter Zijlstra
  2022-08-23  4:57         ` Ravi Bangoria
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-22 16:52 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:

> > @@ -12358,58 +12374,14 @@ SYSCALL_DEFINE5(perf_event_open,
> >  			goto err_context;
> >  	}
> >  
> > -	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
> > -	if (IS_ERR(event_file)) {
> > -		err = PTR_ERR(event_file);
> > -		event_file = NULL;
> > -		goto err_context;
> > -	}
> > -
> > -	if (task) {
> > -		err = down_read_interruptible(&task->signal->exec_update_lock);
> > -		if (err)
> > -			goto err_file;
> > -
> > -		/*
> > -		 * We must hold exec_update_lock across this and any potential
> > -		 * perf_install_in_context() call for this new event to
> > -		 * serialize against exec() altering our credentials (and the
> > -		 * perf_event_exit_task() that could imply).
> > -		 */
> > -		err = -EACCES;
> > -		if (!perf_check_permission(&attr, task))
> > -			goto err_cred;
> > -	}
> > -
> > -	if (ctx->task == TASK_TOMBSTONE) {
> > -		err = -ESRCH;
> > -		goto err_locked;
> > -	}
> 
> I think we need to keep (ctx->task == TASK_TOMBSTONE) check?

I think so too; in fact the code I have still has it, perhaps it was
there write before this patch?

> > -
> >  	if (!perf_event_validate_size(event)) {
> >  		err = -E2BIG;
> > -		goto err_locked;
> > -	}
> > -
> > -	if (!task) {
> > -		/*
> > -		 * Check if the @cpu we're creating an event for is online.
> > -		 *
> > -		 * We use the perf_cpu_context::ctx::mutex to serialize against
> > -		 * the hotplug notifiers. See perf_event_{init,exit}_cpu().
> > -		 */
> > -		struct perf_cpu_context *cpuctx =
> > -			container_of(ctx, struct perf_cpu_context, ctx);
> > -
> > -		if (!cpuctx->online) {
> > -			err = -ENODEV;
> > -			goto err_locked;
> > -		}
> > +		goto err_context;
> 
> Why did you remove this hunk? We should confirm whether cpu is online or not
> before creating event. No?

Idem.

Perhaps it is best if we look at the end result of all these patches
combined and then I'll fold the lot if we're in agreement and then we
can forget about these intermediate steps.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-02  6:10     ` Ravi Bangoria
@ 2022-08-22 16:44       ` Peter Zijlstra
  2022-08-23  4:46         ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-22 16:44 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Tue, Aug 02, 2022 at 11:40:34AM +0530, Ravi Bangoria wrote:
> On 13-Jun-22 8:25 PM, Peter Zijlstra wrote:
> > On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> >> @@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
> >>  		goto err_task;
> >>  	}
> >>  
> >> +	// XXX premature; what if this is allowed, but we get moved to a PMU
> >> +	// that doesn't have this.
> >>  	if (is_sampling_event(event)) {
> >>  		if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
> >>  			err = -EOPNOTSUPP;
> > 
> > No; this really should be against the event's native PMU. If the event
> > can't natively sample, it can't sample when placed in another group
> > either.
> 
> Right. But IIUC, the question was, would there be any issue if we allow
> grouping of perf_sw_context sampling event as group leader and
> perf_{hw|invalid}_context counting event as group member. I think no. It
> should just work fine. And, there could be real usecases of it as you
> described in one old thread[1].

Like you I need to bend my brain around this again, but I'm not seeing a
contradiction. The use-case from [1] is a software sampler with a bunch
of non-sampling uncore events.

The uncore events aren't sampling, the are simply read by the software
event (SAMPLE_READ). And moving the sampling software event to the
non-sample capable uncore PMU shouldn't matter.

That is; the code as it stands here seems right, we should check
is_sampling_event() against an event's native pmu->capabilities.

Or am I misunderstanding things?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-22 15:43         ` Peter Zijlstra
@ 2022-08-22 16:37           ` Ravi Bangoria
  2022-08-23  4:20             ` Ravi Bangoria
                               ` (2 more replies)
  0 siblings, 3 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-22 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 22-Aug-22 9:13 PM, Peter Zijlstra wrote:
> On Mon, Aug 22, 2022 at 05:29:11PM +0200, Peter Zijlstra wrote:
>> On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
>>>
>>>> pulling up the ctx->mutex makes things simpler, but also violates the
>>>> locking order vs exec_update_lock.
>>>>
>>>> Pull that lock up as well...
>>>
>>> I'm not able to apply this patch as is but I get the idea. Few
>>> questions below...
>>
>> I was just about to rebase the 'series' to current, let me do that and
>> get back to you on the specifics.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/wip.rewrite

Additional set of changes on top of this tree is required to build and boot,
atleast on my AMD machine:

---
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ccd231ea6a4e..94fb65d7b291 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1248,7 +1248,7 @@ static inline void amd_pmu_brs_add(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	perf_sched_cb_inc(event->ctx->pmu);
+	perf_sched_cb_inc(event->pmu_ctx->pmu);
 	cpuc->lbr_users++;
 	/*
 	 * No need to reset BRS because it is reset
@@ -1263,7 +1263,7 @@ static inline void amd_pmu_brs_del(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 
-	perf_sched_cb_dec(event->ctx->pmu);
+	perf_sched_cb_dec(event->pmu_ctx->pmu);
 }
 
 void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 31ae032d6783..086e37fa32be 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -843,7 +843,7 @@ static void perf_cgroup_switch(struct task_struct *task)
 
 	WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
 	if (READ_ONCE(cpuctx->cgrp) == cgrp)
-		continue;
+		return;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_ctx_disable(&cpuctx->ctx);
@@ -881,7 +881,7 @@ static int perf_cgroup_ensure_storage(struct perf_event *event,
 		heap_size++;
 
 	for_each_possible_cpu(cpu) {
-		cpuctx = this_cpu_ptr(&cpu_context);
+		cpuctx = per_cpu_ptr(&cpu_context, cpu);
 		if (heap_size <= cpuctx->heap_size)
 			continue;
 
@@ -2315,7 +2315,7 @@ __perf_remove_from_context(struct perf_event *event,
 	if (!pmu_ctx->nr_events) {
 		pmu_ctx->rotate_necessary = 0;
 
-		if (ctx->task) {
+		if (ctx->task && ctx->is_active) {
 			struct perf_cpu_pmu_context *cpc;
 
 			cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
@@ -11972,6 +11972,15 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	goto out;
 }
 
+static void mutex_lock_double(struct mutex *a, struct mutex *b)
+{
+	if (b < a)
+		swap(a, b);
+
+	mutex_lock(a);
+	mutex_lock_nested(b, SINGLE_DEPTH_NESTING);
+}
+
 static int
 perf_event_set_output(struct perf_event *event, struct perf_event *output_event)
 {
---

With this, I can run 'perf test' and perf_event_tests without any error in
dmesg. I'll run perf fuzzer over night and see if it reports any issue.

Thanks,
Ravi

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-22 15:29       ` Peter Zijlstra
@ 2022-08-22 15:43         ` Peter Zijlstra
  2022-08-22 16:37           ` Ravi Bangoria
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-22 15:43 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Aug 22, 2022 at 05:29:11PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
> > 
> > > pulling up the ctx->mutex makes things simpler, but also violates the
> > > locking order vs exec_update_lock.
> > > 
> > > Pull that lock up as well...
> > 
> > I'm not able to apply this patch as is but I get the idea. Few
> > questions below...
> 
> I was just about to rebase the 'series' to current, let me do that and
> get back to you on the specifics.

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/wip.rewrite

I need to go make dinner, but I'll try and remember how it all was
support to work later this evening when the loonies are in bed,

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-08-02  6:11     ` Ravi Bangoria
@ 2022-08-22 15:29       ` Peter Zijlstra
  2022-08-22 15:43         ` Peter Zijlstra
  2022-08-22 16:52       ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-08-22 15:29 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Tue, Aug 02, 2022 at 11:41:42AM +0530, Ravi Bangoria wrote:
> 
> > pulling up the ctx->mutex makes things simpler, but also violates the
> > locking order vs exec_update_lock.
> > 
> > Pull that lock up as well...
> 
> I'm not able to apply this patch as is but I get the idea. Few
> questions below...

I was just about to rebase the 'series' to current, let me do that and
get back to you on the specifics.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-01-13 13:47 Ravi Bangoria
                   ` (4 preceding siblings ...)
  2022-08-02  6:17 ` Ravi Bangoria
@ 2022-08-22 14:40 ` Ravi Bangoria
  5 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-22 14:40 UTC (permalink / raw)
  To: peterz
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria


> @@ -915,7 +925,7 @@ static int perf_cgroup_ensure_storage(struct perf_event *event,
>  		heap_size++;
>  
>  	for_each_possible_cpu(cpu) {
> -		cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
> +		cpuctx = this_cpu_ptr(&cpu_context);

This should be fixed as well:

s/this_cpu_ptr(&cpu_context)/per_cpu_ptr(&cpu_context, cpu)/

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:41   ` Peter Zijlstra
@ 2022-08-22 14:38     ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-22 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

[...]

> You mentioned trouble with cpc->task_epc, there's one rebase mistake
> from you and an original bug from me.
> 
> You lost the last hunk, I forgot to clear cpc on
> perf_remove_from_context().
> 
> With these fixes I can run: 'perf test' without things going
> insta-splat.
> 
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2311,6 +2311,7 @@ __perf_remove_from_context(struct perf_e
>  			   struct perf_event_context *ctx,
>  			   void *info)
>  {
> +	struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
>  	unsigned long flags = (unsigned long)info;
>  
>  	if (ctx->is_active & EVENT_TIME) {
> @@ -2325,8 +2326,17 @@ __perf_remove_from_context(struct perf_e
>  		perf_child_detach(event);
>  	list_del_event(event, ctx);
>  
> -	if (!event->pmu_ctx->nr_events)
> -		event->pmu_ctx->rotate_necessary = 0;
> +	if (!pmu_ctx->nr_events) {
> +		pmu_ctx->rotate_necessary = 0;
> +
> +		if (ctx->task) {

IIUC, this should also check for ctx->is_active? i.e.

		if (ctx->task && ctx->is_active) {
			...

> +			struct perf_cpu_pmu_context *cpc;
> +
> +			cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
> +			WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
> +			cpc->task_epc = NULL;
> +		}
> +	}

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-01-13 13:47 Ravi Bangoria
                   ` (3 preceding siblings ...)
  2022-08-02  6:13 ` Ravi Bangoria
@ 2022-08-02  6:17 ` Ravi Bangoria
  2022-08-23  7:26   ` Peter Zijlstra
  2022-08-22 14:40 ` Ravi Bangoria
  5 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-02  6:17 UTC (permalink / raw)
  To: peterz
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

[...]

>  static void
> -ctx_sched_in(struct perf_event_context *ctx,
> -	     struct perf_cpu_context *cpuctx,
> -	     enum event_type_t event_type,
> +ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type,
>  	     struct task_struct *task)
>  {
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
>  	int is_active = ctx->is_active;
>  	u64 now;
>  
> @@ -3818,6 +3905,7 @@ ctx_sched_in(struct perf_event_context *ctx,
>  		/* start ctx time */
>  		now = perf_clock();
>  		ctx->timestamp = now;
> +		// XXX ctx->task =? task

Couldn't get this XXX, it's from your original patch. If you can recall, it
would be helpful.

>  		perf_cgroup_set_timestamp(task, ctx);
>  	}

Also, this hunk is under if (is_active ^ EVENT_TIME), which effectively is
(is_active != EVENT_TIME). I'm assuming it should be (is_active & EVENT_TIME)?

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:43   ` Peter Zijlstra
@ 2022-08-02  6:16     ` Ravi Bangoria
  2022-08-23  8:57       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-02  6:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 13-Jun-22 8:13 PM, Peter Zijlstra wrote:
> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> 
>> @@ -3652,17 +3697,28 @@ static noinline int visit_groups_merge(s
>>  			.size = ARRAY_SIZE(itrs),
>>  		};
>>  		/* Events not within a CPU context may be on any CPU. */
>> -		__heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL));
>> +		__heap_add(&event_heap, perf_event_groups_first(groups, -1, pmu, NULL));
>>  	}
>>  	evt = event_heap.data;
>>  
>> -	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
>> +	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, NULL));
>>  
>>  #ifdef CONFIG_CGROUP_PERF
>>  	for (; css; css = css->parent)
>> -		__heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup));
>> +		__heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, css->cgroup));
>>  #endif
>>  
>> +	if (event_heap.nr) {
>> +		/*
>> +		 * XXX: For now, visit_groups_merge() gets called with pmu
>> +		 * pointer never NULL. But these functions needs to be called
>> +		 * once for each pmu if I implement pmu=NULL optimization.
>> +		 */
>> +		__link_epc((*evt)->pmu_ctx);
>> +		perf_assert_pmu_disabled((*evt)->pmu_ctx->pmu);
>> +	}
>> +
>> +
>>  	min_heapify_all(&event_heap, &perf_min_heap);
>>  
>>  	while (event_heap.nr) {
> 
>> @@ -3741,39 +3799,67 @@ static int merge_sched_in(struct perf_ev
>>  	return 0;
>>  }
>>  
>> -static void
>> -ctx_pinned_sched_in(struct perf_event_context *ctx,
>> -		    struct perf_cpu_context *cpuctx)
>> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
>>  {
>> +	struct perf_event_pmu_context *pmu_ctx;
>>  	int can_add_hw = 1;
>>  
>> -	if (ctx != &cpuctx->ctx)
>> -		cpuctx = NULL;
>> -
>> -	visit_groups_merge(cpuctx, &ctx->pinned_groups,
>> -			   smp_processor_id(),
>> -			   merge_sched_in, &can_add_hw);
>> +	if (pmu) {
>> +		visit_groups_merge(ctx, &ctx->pinned_groups,
>> +				   smp_processor_id(), pmu,
>> +				   merge_sched_in, &can_add_hw);
>> +	} else {
>> +		/*
>> +		 * XXX: This can be optimized for per-task context by calling
>> +		 * visit_groups_merge() only once with:
>> +		 *   1) pmu=NULL
>> +		 *   2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
>> +		 *   3) Making can_add_hw a per-pmu variable
>> +		 *
>> +		 * Though, it can not be opimized for per-cpu context because
>> +		 * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
>> +		 * consist of cgroup-subtrees. i.e. a cgroup events of same
>> +		 * cgroup but different pmus are seperated out into respective
>> +		 * pmu-subtrees.
>> +		 */
>> +		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>> +			can_add_hw = 1;
>> +			visit_groups_merge(ctx, &ctx->pinned_groups,
>> +					   smp_processor_id(), pmu_ctx->pmu,
>> +					   merge_sched_in, &can_add_hw);
>> +		}
>> +	}
>>  }
> 
> I'm not sure I follow.. task context can have multiple PMUs just the
> same as CPU context can, that's more or less the entire point of the
> patch.

Current rbtree key is {cpu, cgroup_id, group_idx}. However, effective key for
task specific context is {cpu, group_idx} because cgroup_id is always 0. And
effective key for cpu specific context is {cgroup_id, group_idx} because cpu
is same for entire rbtree.

With New design, rbtree key will be {cpu, pmu, cgroup_id, group_idx}. But as
explained above, effective key for task specific context will be {cpu, pmu,
group_idx}. Thus, we can handle pmu=NULL in visit_groups_merge(), same as you
did in the very first RFC[1]. (This may make things more complicated though
because we might also need to increase heap size to accommodate all pmu events
in single heap. Current heap size is 2 for task specific context, which is
sufficient if we iterate over all pmus).

Same optimization won't work for cpu specific context because, it's effective
key would be {pmu, cgroup_id, group_idx} i.e. each pmu subtree is made up of
cgroup subtrees.

Please correct me if my understanding is wrong.

Thanks,
Ravi

[1]:
https://lore.kernel.org/lkml/20181010104559.GO5728@hirez.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-01-13 13:47 Ravi Bangoria
                   ` (2 preceding siblings ...)
  2022-06-13 14:35 ` Peter Zijlstra
@ 2022-08-02  6:13 ` Ravi Bangoria
  2022-08-23  7:10   ` Peter Zijlstra
  2022-08-02  6:17 ` Ravi Bangoria
  2022-08-22 14:40 ` Ravi Bangoria
  5 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-02  6:13 UTC (permalink / raw)
  To: peterz
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

[...]

>  /*
> @@ -2718,7 +2706,6 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>  			struct perf_event_context *task_ctx,
>  			enum event_type_t event_type)
>  {
> -	enum event_type_t ctx_event_type;
>  	bool cpu_event = !!(event_type & EVENT_CPU);
>  
>  	/*
> @@ -2728,11 +2715,13 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>  	if (event_type & EVENT_PINNED)
>  		event_type |= EVENT_FLEXIBLE;
>  
> -	ctx_event_type = event_type & EVENT_ALL;
> +	event_type &= EVENT_ALL;
>  
> -	perf_pmu_disable(cpuctx->ctx.pmu);
> -	if (task_ctx)
> -		task_ctx_sched_out(cpuctx, task_ctx, event_type);
> +	perf_ctx_disable(&cpuctx->ctx);
> +	if (task_ctx) {
> +		perf_ctx_disable(task_ctx);
> +		task_ctx_sched_out(task_ctx, event_type);
> +	}
>  
>  	/*
>  	 * Decide which cpu ctx groups to schedule out based on the types
> @@ -2742,17 +2731,20 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
>  	 *  - otherwise, do nothing more.
>  	 */
>  	if (cpu_event)
> -		cpu_ctx_sched_out(cpuctx, ctx_event_type);
> -	else if (ctx_event_type & EVENT_PINNED)
> -		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
> +		ctx_sched_out(&cpuctx->ctx, event_type);
> +	else if (event_type & EVENT_PINNED)
> +		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
>  
>  	perf_event_sched_in(cpuctx, task_ctx, current);
> -	perf_pmu_enable(cpuctx->ctx.pmu);
> +
> +	perf_ctx_enable(&cpuctx->ctx);
> +	if (task_ctx)
> +		perf_ctx_enable(task_ctx);
>  }

ctx_resched() reschedule entire perf_event_context while adding new event
to the context or enabling existing event in the context. We can probably
optimize it by rescheduling only affected pmu_ctx.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:38   ` Peter Zijlstra
@ 2022-08-02  6:11     ` Ravi Bangoria
  2022-08-22 15:29       ` Peter Zijlstra
  2022-08-22 16:52       ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-02  6:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria


> pulling up the ctx->mutex makes things simpler, but also violates the
> locking order vs exec_update_lock.
> 
> Pull that lock up as well...

I'm not able to apply this patch as is but I get the idea. Few
questions below...

> 
> ---
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -12254,13 +12254,29 @@ SYSCALL_DEFINE5(perf_event_open,
>  	if (pmu->task_ctx_nr == perf_sw_context)
>  		event->event_caps |= PERF_EV_CAP_SOFTWARE;
>  
> +	if (task) {
> +		err = down_read_interruptible(&task->signal->exec_update_lock);
> +		if (err)
> +			goto err_alloc;
> +
> +		/*
> +		 * We must hold exec_update_lock across this and any potential
> +		 * perf_install_in_context() call for this new event to
> +		 * serialize against exec() altering our credentials (and the
> +		 * perf_event_exit_task() that could imply).
> +		 */
> +		err = -EACCES;
> +		if (!perf_check_permission(&attr, task))
> +			goto err_cred;
> +	}
> +
>  	/*
>  	 * Get the target context (task or percpu):
>  	 */
>  	ctx = find_get_context(task, event);
>  	if (IS_ERR(ctx)) {
>  		err = PTR_ERR(ctx);
> -		goto err_alloc;
> +		goto err_cred;
>  	}
>  
>  	mutex_lock(&ctx->mutex);
> @@ -12358,58 +12374,14 @@ SYSCALL_DEFINE5(perf_event_open,
>  			goto err_context;
>  	}
>  
> -	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
> -	if (IS_ERR(event_file)) {
> -		err = PTR_ERR(event_file);
> -		event_file = NULL;
> -		goto err_context;
> -	}
> -
> -	if (task) {
> -		err = down_read_interruptible(&task->signal->exec_update_lock);
> -		if (err)
> -			goto err_file;
> -
> -		/*
> -		 * We must hold exec_update_lock across this and any potential
> -		 * perf_install_in_context() call for this new event to
> -		 * serialize against exec() altering our credentials (and the
> -		 * perf_event_exit_task() that could imply).
> -		 */
> -		err = -EACCES;
> -		if (!perf_check_permission(&attr, task))
> -			goto err_cred;
> -	}
> -
> -	if (ctx->task == TASK_TOMBSTONE) {
> -		err = -ESRCH;
> -		goto err_locked;
> -	}

I think we need to keep (ctx->task == TASK_TOMBSTONE) check?

> -
>  	if (!perf_event_validate_size(event)) {
>  		err = -E2BIG;
> -		goto err_locked;
> -	}
> -
> -	if (!task) {
> -		/*
> -		 * Check if the @cpu we're creating an event for is online.
> -		 *
> -		 * We use the perf_cpu_context::ctx::mutex to serialize against
> -		 * the hotplug notifiers. See perf_event_{init,exit}_cpu().
> -		 */
> -		struct perf_cpu_context *cpuctx =
> -			container_of(ctx, struct perf_cpu_context, ctx);
> -
> -		if (!cpuctx->online) {
> -			err = -ENODEV;
> -			goto err_locked;
> -		}
> +		goto err_context;

Why did you remove this hunk? We should confirm whether cpu is online or not
before creating event. No?

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:55   ` Peter Zijlstra
@ 2022-08-02  6:10     ` Ravi Bangoria
  2022-08-22 16:44       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-02  6:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 13-Jun-22 8:25 PM, Peter Zijlstra wrote:
> On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
>> @@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
>>  		goto err_task;
>>  	}
>>  
>> +	// XXX premature; what if this is allowed, but we get moved to a PMU
>> +	// that doesn't have this.
>>  	if (is_sampling_event(event)) {
>>  		if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
>>  			err = -EOPNOTSUPP;
> 
> No; this really should be against the event's native PMU. If the event
> can't natively sample, it can't sample when placed in another group
> either.

Right. But IIUC, the question was, would there be any issue if we allow
grouping of perf_sw_context sampling event as group leader and
perf_{hw|invalid}_context counting event as group member. I think no. It
should just work fine. And, there could be real usecases of it as you
described in one old thread[1].

TL;DR

Although I can't find any such pmu combination on AMD(not considering real
sw pmus), I just tried opposite scenario:

    Group leader: msr/tsc/ as counting event (perf_sw_context)
    Group member: ibs_op/cnt_ctl=1/ as sampling event (perf_invalid_context)

And a simple test program seems to work fine:

  #include <unistd.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <errno.h>
  #include <sys/stat.h>
  #include <fcntl.h>
  #include <string.h>
  #include <linux/perf_event.h>
  #include <sys/types.h>
  #include <sys/mman.h>
  #include <sys/syscall.h>
  #include <sys/ioctl.h>
  
  #define PAGE_SIZE               sysconf(_SC_PAGESIZE)
  
  #define PERF_MMAP_DATA_PAGES    256
  #define PERF_MMAP_DATA_SIZE     (PERF_MMAP_DATA_PAGES * PAGE_SIZE)
  #define PERF_MMAP_DATA_MASK     (PERF_MMAP_DATA_SIZE - 1)
  #define PERF_MMAP_TOTAL_PAGES   (PERF_MMAP_DATA_PAGES + 1)
  #define PERF_MMAP_TOTAL_SIZE    (PERF_MMAP_TOTAL_PAGES * PAGE_SIZE)
  
  #define rmb()   asm volatile("lfence":::"memory")
  
  struct perf_event {
          int fd;
          void *p;
  };
  
  static int perf_event_open(struct perf_event_attr *attr, pid_t pid,
                             int cpu, int group_fd, unsigned long flags)
  {
          int fd = syscall(__NR_perf_event_open, attr, pid, cpu,
                          group_fd, flags);
          if (fd < 0)
                  perror("perf_event_open() failed.");
          return fd;
  }
  
  static void *perf_event_mmap(int fd)
  {
          void *p = mmap(NULL, PERF_MMAP_TOTAL_SIZE,
                          PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
          if (p == MAP_FAILED)
                  perror("mmap() failed.");
          return p;
  }
  
  static void
  copy_event_data(void *src, unsigned long offset, void *dest, size_t size)
  {
          size_t chunk1_size, chunk2_size;
  
          if ((offset + size) < PERF_MMAP_DATA_SIZE) {
                  memcpy(dest, src + offset, size);
          } else {
                  chunk1_size = PERF_MMAP_DATA_SIZE - offset;
                  chunk2_size = size - chunk1_size;
  
                  memcpy(dest, src + offset, chunk1_size);
                  memcpy(dest + chunk1_size, src, chunk2_size);
          }
  }
  
  static int mmap_read(struct perf_event_mmap_page *p, void *dest, size_t size)
  {
          void *base;
          unsigned long data_tail, data_head;
  
          /* Casting to (void *) is needed. */
          base = (void *)p + PAGE_SIZE;
  
          data_head = p->data_head;
          rmb();
          data_tail = p->data_tail;
  
          if ((data_head - data_tail) < size)
                  return -1;
  
          data_tail &= PERF_MMAP_DATA_MASK;
          copy_event_data(base, data_tail, dest, size);
          p->data_tail += size;
          return 0;
  }
  
  static void mmap_skip(struct perf_event_mmap_page *p, size_t size)
  {
          int data_head = p->data_head;
  
          rmb();
  
          if ((p->data_tail + size) > data_head)
                  p->data_tail = data_head;
          else
                  p->data_tail += size;
  }
  
  static void perf_read_event_details(struct perf_event_mmap_page *p)
  {
          struct perf_event_header hdr;
          unsigned int pid, tid;
  
          /*
           * PERF_RECORD_SAMPLE:
           * struct {
           *     struct perf_event_header hdr;
           *     u32 pid;                         // PERF_SAMPLE_TID
           *     u32 tid;                         // PERF_SAMPLE_TID
           * };
           */
          while(1) {
                  if (mmap_read(p, &hdr, sizeof(hdr)))
                          return;
  
                  if (hdr.type == PERF_RECORD_SAMPLE) {
                          if (mmap_read(p, &pid, sizeof(pid)))
                                  perror("Error reading pid.");
  
                          if (mmap_read(p, &tid, sizeof(tid)))
                                  perror("Error reading tid.");
  
                          printf("pid: %d, tid: %d\n", pid, tid);
                  } else {
                          mmap_skip(p, hdr.size - sizeof(hdr));
                  }
          }
  }
  
  int main(int argc, char *argv[])
  {
          struct perf_event_attr attr;
          struct perf_event events[2];
          int i;
          long long count1, count2;
  
          memset(&attr, 0, sizeof(struct perf_event_attr));
          attr.size = sizeof(struct perf_event_attr);
  
          attr.type = 16; /* /sys/bus/event_source/devices/msr/type */
          attr.config = 0x0; /* /sys/bus/event_source/devices/msr/events/tsc */
          attr.disabled = 1;
          events[0].fd = perf_event_open(&attr, -1, 0, -1, 0);
  
          attr.type = 9; /* /sys/bus/event_source/devices/ibs_op/type */
          attr.config = (0x1 << 19); /* /sys/bus/event_source/devices/ibs_op/format/cnt_ctl */
          attr.disabled = 1;
          /* perf_read_event_details() can parse PERF_SAMPLE_TID only */
          attr.sample_type = PERF_SAMPLE_TID;
          attr.sample_period = 10000000;
          events[1].fd = perf_event_open(&attr, -1, 0, events[0].fd, 0);
          events[1].p = perf_event_mmap(events[1].fd);
  
          ioctl(events[0].fd, PERF_EVENT_IOC_RESET, 0);
          ioctl(events[1].fd, PERF_EVENT_IOC_RESET, 0);
          ioctl(events[0].fd, PERF_EVENT_IOC_ENABLE, 0);
          ioctl(events[1].fd, PERF_EVENT_IOC_ENABLE, 0);
  
          i = 5;
          while(i--) {
                  sleep(1);
                  read(events[0].fd, &count1, sizeof(long long));
                  read(events[1].fd, &count2, sizeof(long long));
                  perf_read_event_details(events[1].p);
                  ioctl(events[0].fd, PERF_EVENT_IOC_RESET, 0);
                  ioctl(events[1].fd, PERF_EVENT_IOC_RESET, 0);
  
                  printf("%lld, %lld\n", count1, count2);
          }
  
          close(events[1].fd);
          close(events[0].fd);
  }

Example run:

  [term1~]$ taskset -c 0 top

  [term2~]$ pgrep top
  85747

  [term2~]$ sudo ./perf-group-sample-count
  1996319080, 0
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 0, tid: 0
  1996510960, 150000000
  1996325400, 0
  1996348600, 0
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 85747, tid: 85747
  pid: 0, tid: 0
  1996341420, 130000000

Thanks,
Ravi

[1] https://lore.kernel.org/all/20150204125954.GL21418@twins.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-27  4:18   ` Ravi Bangoria
@ 2022-08-02  6:06     ` Ravi Bangoria
  0 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-08-02  6:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	ravi.bangoria

On 27-Jun-22 9:48 AM, Ravi Bangoria wrote:
> 
> On 13-Jun-22 8:05 PM, Peter Zijlstra wrote:
>>
>>
>> Right, so sorry for being incredibly tardy on this. Find below the
>> patch fwd ported to something recent.
>>
>> I'll reply to this with fixes and comments.
> 
> Thanks! I've resumed on this but my mind has lost all the context so
> it might take a while for me to reply to your comments. Please bear
> with me if I'm bit slow.

Sorry, it took a while for me to get started on it. Anyways, thanks for
providing fixes. I applied those and ran some tests on AMD Milan machine:

- Built in perf tests ran fine without any issues
- perf_event_tests reported one BUG_ON() and one WARN_ON(). I'll work on
  fixing those.
- Ran perf fuzzer for almost a day. It reported one softlockup but system
  recovered from it and later it reported one hardlockup but unfortunately
  my config had HARDLOCKUP_PANIC set and thus couldn't confirm whether that
  hardlockup was recoverable or not. Anyway, system was running pretty much
  fine until then.
- No lockdep warnings were observed in any of the tests.

I'll work on verifying functionality changes.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:35 ` Peter Zijlstra
                     ` (5 preceding siblings ...)
  2022-06-17 13:36   ` Peter Zijlstra
@ 2022-06-27  4:18   ` Ravi Bangoria
  2022-08-02  6:06     ` Ravi Bangoria
  2022-08-24 12:15   ` Peter Zijlstra
  7 siblings, 1 reply; 47+ messages in thread
From: Ravi Bangoria @ 2022-06-27  4:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla,
	Ravi Bangoria


On 13-Jun-22 8:05 PM, Peter Zijlstra wrote:
> 
> 
> Right, so sorry for being incredibly tardy on this. Find below the
> patch fwd ported to something recent.
> 
> I'll reply to this with fixes and comments.

Thanks! I've resumed on this but my mind has lost all the context so
it might take a while for me to reply to your comments. Please bear
with me if I'm bit slow.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:35 ` Peter Zijlstra
                     ` (4 preceding siblings ...)
  2022-06-13 14:55   ` Peter Zijlstra
@ 2022-06-17 13:36   ` Peter Zijlstra
  2022-08-24 10:13     ` Peter Zijlstra
  2022-06-27  4:18   ` Ravi Bangoria
  2022-08-24 12:15   ` Peter Zijlstra
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-06-17 13:36 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> +/* XXX: No need of list now. Convert it to per-cpu variable */
>  static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);

Something like so I suppose...

---
 include/linux/perf_event.h |    1 
 kernel/events/core.c       |   70 ++++++++++++++-------------------------------
 2 files changed, 22 insertions(+), 49 deletions(-)

--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -936,7 +936,6 @@ struct perf_cpu_context {
 
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp;
-	struct list_head		cgrp_cpuctx_entry;
 #endif
 
 	/*
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -829,55 +829,41 @@ perf_cgroup_set_timestamp(struct perf_cp
 	}
 }
 
-/* XXX: No need of list now. Convert it to per-cpu variable */
-static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
-
 /*
  * reschedule events based on the cgroup constraint of task.
  */
 static void perf_cgroup_switch(struct task_struct *task)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_cgroup *cgrp;
-	struct perf_cpu_context *cpuctx, *tmp;
 	struct list_head *list;
 	unsigned long flags;
 
-	/*
-	 * Disable interrupts and preemption to avoid this CPU's
-	 * cgrp_cpuctx_entry to change under us.
-	 */
-	local_irq_save(flags);
-
 	cgrp = perf_cgroup_from_task(task, NULL);
 
-	list = this_cpu_ptr(&cgrp_cpuctx_list);
-	list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
-		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
-		if (READ_ONCE(cpuctx->cgrp) == cgrp)
-			continue;
-
-		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-		perf_ctx_disable(&cpuctx->ctx);
+	WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
+	if (READ_ONCE(cpuctx->cgrp) == cgrp)
+		continue;
 
-		ctx_sched_out(&cpuctx->ctx, EVENT_ALL);
-		/*
-		 * must not be done before ctxswout due
-		 * to update_cgrp_time_from_cpuctx() in
-		 * ctx_sched_out()
-		 */
-		cpuctx->cgrp = cgrp;
-		/*
-		 * set cgrp before ctxsw in to allow
-		 * perf_cgroup_set_timestamp() in ctx_sched_in()
-		 * to not have to pass task around
-		 */
-		ctx_sched_in(&cpuctx->ctx, EVENT_ALL);
+	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
+	perf_ctx_disable(&cpuctx->ctx);
 
-		perf_ctx_enable(&cpuctx->ctx);
-		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
-	}
+	ctx_sched_out(&cpuctx->ctx, EVENT_ALL);
+	/*
+	 * must not be done before ctxswout due
+	 * to update_cgrp_time_from_cpuctx() in
+	 * ctx_sched_out()
+	 */
+	cpuctx->cgrp = cgrp;
+	/*
+	 * set cgrp before ctxsw in to allow
+	 * perf_cgroup_set_timestamp() in ctx_sched_in()
+	 * to not have to pass task around
+	 */
+	ctx_sched_in(&cpuctx->ctx, EVENT_ALL);
 
-	local_irq_restore(flags);
+	perf_ctx_enable(&cpuctx->ctx);
+	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 }
 
 static int perf_cgroup_ensure_storage(struct perf_event *event,
@@ -979,8 +965,6 @@ perf_cgroup_event_enable(struct perf_eve
 		return;
 
 	cpuctx->cgrp = perf_cgroup_from_task(current, ctx);
-	list_add(&cpuctx->cgrp_cpuctx_entry,
-			per_cpu_ptr(&cgrp_cpuctx_list, event->cpu));
 }
 
 static inline void
@@ -1001,7 +985,6 @@ perf_cgroup_event_disable(struct perf_ev
 		return;
 
 	cpuctx->cgrp = NULL;
-	list_del(&cpuctx->cgrp_cpuctx_entry);
 }
 
 #else /* !CONFIG_CGROUP_PERF */
@@ -2372,11 +2355,7 @@ static void perf_remove_from_context(str
 	 * event_function_call() user.
 	 */
 	raw_spin_lock_irq(&ctx->lock);
-	/*
-	 * Cgroup events are per-cpu events, and must IPI because of
-	 * cgrp_cpuctx_list.
-	 */
-	if (!ctx->is_active && !is_cgroup_event(event)) {
+	if (!ctx->is_active) {
 		__perf_remove_from_context(event, this_cpu_ptr(&cpu_context),
 					   ctx, (void *)flags);
 		raw_spin_unlock_irq(&ctx->lock);
@@ -2807,8 +2786,6 @@ perf_install_in_context(struct perf_even
 	 * perf_event_attr::disabled events will not run and can be initialized
 	 * without IPI. Except when this is the first event for the context, in
 	 * that case we need the magic of the IPI to set ctx->is_active.
-	 * Similarly, cgroup events for the context also needs the IPI to
-	 * manipulate the cgrp_cpuctx_list.
 	 *
 	 * The IOC_ENABLE that is sure to follow the creation of a disabled
 	 * event will issue the IPI and reprogram the hardware.
@@ -13301,9 +13278,6 @@ static void __init perf_event_init_all_c
 		INIT_LIST_HEAD(&per_cpu(pmu_sb_events.list, cpu));
 		raw_spin_lock_init(&per_cpu(pmu_sb_events.lock, cpu));
 
-#ifdef CONFIG_CGROUP_PERF
-		INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
-#endif
 		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
 
 		cpuctx = per_cpu_ptr(&cpu_context, cpu);

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:35 ` Peter Zijlstra
                     ` (3 preceding siblings ...)
  2022-06-13 14:43   ` Peter Zijlstra
@ 2022-06-13 14:55   ` Peter Zijlstra
  2022-08-02  6:10     ` Ravi Bangoria
  2022-06-17 13:36   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-06-13 14:55 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> @@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
>  		goto err_task;
>  	}
>  
> +	// XXX premature; what if this is allowed, but we get moved to a PMU
> +	// that doesn't have this.
>  	if (is_sampling_event(event)) {
>  		if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
>  			err = -EOPNOTSUPP;

No; this really should be against the event's native PMU. If the event
can't natively sample, it can't sample when placed in another group
either.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:35 ` Peter Zijlstra
                     ` (2 preceding siblings ...)
  2022-06-13 14:41   ` Peter Zijlstra
@ 2022-06-13 14:43   ` Peter Zijlstra
  2022-08-02  6:16     ` Ravi Bangoria
  2022-06-13 14:55   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-06-13 14:43 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:

> @@ -3652,17 +3697,28 @@ static noinline int visit_groups_merge(s
>  			.size = ARRAY_SIZE(itrs),
>  		};
>  		/* Events not within a CPU context may be on any CPU. */
> -		__heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL));
> +		__heap_add(&event_heap, perf_event_groups_first(groups, -1, pmu, NULL));
>  	}
>  	evt = event_heap.data;
>  
> -	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
> +	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, NULL));
>  
>  #ifdef CONFIG_CGROUP_PERF
>  	for (; css; css = css->parent)
> -		__heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup));
> +		__heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, css->cgroup));
>  #endif
>  
> +	if (event_heap.nr) {
> +		/*
> +		 * XXX: For now, visit_groups_merge() gets called with pmu
> +		 * pointer never NULL. But these functions needs to be called
> +		 * once for each pmu if I implement pmu=NULL optimization.
> +		 */
> +		__link_epc((*evt)->pmu_ctx);
> +		perf_assert_pmu_disabled((*evt)->pmu_ctx->pmu);
> +	}
> +
> +
>  	min_heapify_all(&event_heap, &perf_min_heap);
>  
>  	while (event_heap.nr) {

> @@ -3741,39 +3799,67 @@ static int merge_sched_in(struct perf_ev
>  	return 0;
>  }
>  
> -static void
> -ctx_pinned_sched_in(struct perf_event_context *ctx,
> -		    struct perf_cpu_context *cpuctx)
> +static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
>  {
> +	struct perf_event_pmu_context *pmu_ctx;
>  	int can_add_hw = 1;
>  
> -	if (ctx != &cpuctx->ctx)
> -		cpuctx = NULL;
> -
> -	visit_groups_merge(cpuctx, &ctx->pinned_groups,
> -			   smp_processor_id(),
> -			   merge_sched_in, &can_add_hw);
> +	if (pmu) {
> +		visit_groups_merge(ctx, &ctx->pinned_groups,
> +				   smp_processor_id(), pmu,
> +				   merge_sched_in, &can_add_hw);
> +	} else {
> +		/*
> +		 * XXX: This can be optimized for per-task context by calling
> +		 * visit_groups_merge() only once with:
> +		 *   1) pmu=NULL
> +		 *   2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
> +		 *   3) Making can_add_hw a per-pmu variable
> +		 *
> +		 * Though, it can not be opimized for per-cpu context because
> +		 * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
> +		 * consist of cgroup-subtrees. i.e. a cgroup events of same
> +		 * cgroup but different pmus are seperated out into respective
> +		 * pmu-subtrees.
> +		 */
> +		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
> +			can_add_hw = 1;
> +			visit_groups_merge(ctx, &ctx->pinned_groups,
> +					   smp_processor_id(), pmu_ctx->pmu,
> +					   merge_sched_in, &can_add_hw);
> +		}
> +	}
>  }

I'm not sure I follow.. task context can have multiple PMUs just the
same as CPU context can, that's more or less the entire point of the
patch.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:35 ` Peter Zijlstra
  2022-06-13 14:36   ` Peter Zijlstra
  2022-06-13 14:38   ` Peter Zijlstra
@ 2022-06-13 14:41   ` Peter Zijlstra
  2022-08-22 14:38     ` Ravi Bangoria
  2022-06-13 14:43   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-06-13 14:41 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:

> @@ -3196,11 +3187,52 @@ static int perf_event_modify_attr(struct
>  	return err;
>  }
>  
> -static void ctx_sched_out(struct perf_event_context *ctx,
> -			  struct perf_cpu_context *cpuctx,
> -			  enum event_type_t event_type)
> +static void __pmu_ctx_sched_out(struct perf_event_pmu_context *pmu_ctx,
> +				enum event_type_t event_type)
>  {
> +	struct perf_event_context *ctx = pmu_ctx->ctx;
>  	struct perf_event *event, *tmp;
> +	struct pmu *pmu = pmu_ctx->pmu;
> +
> +	if (ctx->task && !ctx->is_active) {
> +		struct perf_cpu_pmu_context *cpc;
> +
> +		cpc = this_cpu_ptr(pmu->cpu_pmu_context);
> +		WARN_ON_ONCE(cpc->task_epc != pmu_ctx);
> +		cpc->task_epc = NULL;
> +	}
> +
> +	if (!event_type)
> +		return;
> +
> +	perf_pmu_disable(pmu);
> +	if (event_type & EVENT_PINNED) {
> +		list_for_each_entry_safe(event, tmp,
> +				&pmu_ctx->pinned_active,
> +				active_list)
> +			group_sched_out(event, ctx);
> +	}
> +
> +	if (event_type & EVENT_FLEXIBLE) {
> +		list_for_each_entry_safe(event, tmp,
> +				&pmu_ctx->flexible_active,
> +				active_list)
> +			group_sched_out(event, ctx);
> +		/*
> +		 * Since we cleared EVENT_FLEXIBLE, also clear
> +		 * rotate_necessary, is will be reset by
> +		 * ctx_flexible_sched_in() when needed.
> +		 */
> +		pmu_ctx->rotate_necessary = 0;
> +	}
> +	perf_pmu_enable(pmu);
> +}
> +
> +static void
> +ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
> +{
> +	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
> +	struct perf_event_pmu_context *pmu_ctx;
>  	int is_active = ctx->is_active;
>  
>  	lockdep_assert_held(&ctx->lock);
> @@ -3251,24 +3283,8 @@ static void ctx_sched_out(struct perf_ev
>  	if (!ctx->nr_active || !(is_active & EVENT_ALL))
>  		return;
>  
> -	perf_pmu_disable(ctx->pmu);
> -	if (is_active & EVENT_PINNED) {
> -		list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
> -			group_sched_out(event, cpuctx, ctx);
> -	}
> -
> -	if (is_active & EVENT_FLEXIBLE) {
> -		list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
> -			group_sched_out(event, cpuctx, ctx);
> -
> -		/*
> -		 * Since we cleared EVENT_FLEXIBLE, also clear
> -		 * rotate_necessary, is will be reset by
> -		 * ctx_flexible_sched_in() when needed.
> -		 */
> -		ctx->rotate_necessary = 0;
> -	}
> -	perf_pmu_enable(ctx->pmu);
> +	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
> +		__pmu_ctx_sched_out(pmu_ctx, is_active);
>  }

You mentioned trouble with cpc->task_epc, there's one rebase mistake
from you and an original bug from me.

You lost the last hunk, I forgot to clear cpc on
perf_remove_from_context().

With these fixes I can run: 'perf test' without things going
insta-splat.

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2311,6 +2311,7 @@ __perf_remove_from_context(struct perf_e
 			   struct perf_event_context *ctx,
 			   void *info)
 {
+	struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
 	unsigned long flags = (unsigned long)info;
 
 	if (ctx->is_active & EVENT_TIME) {
@@ -2325,8 +2326,17 @@ __perf_remove_from_context(struct perf_e
 		perf_child_detach(event);
 	list_del_event(event, ctx);
 
-	if (!event->pmu_ctx->nr_events)
-		event->pmu_ctx->rotate_necessary = 0;
+	if (!pmu_ctx->nr_events) {
+		pmu_ctx->rotate_necessary = 0;
+
+		if (ctx->task) {
+			struct perf_cpu_pmu_context *cpc;
+
+			cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+			WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
+			cpc->task_epc = NULL;
+		}
+	}
 
 	if (!ctx->nr_events && ctx->is_active) {
 		if (ctx == &cpuctx->ctx)
@@ -3198,7 +3208,7 @@ static void __pmu_ctx_sched_out(struct p
 		struct perf_cpu_pmu_context *cpc;
 
 		cpc = this_cpu_ptr(pmu->cpu_pmu_context);
-		WARN_ON_ONCE(cpc->task_epc != pmu_ctx);
+		WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
 		cpc->task_epc = NULL;
 	}
 
@@ -3280,9 +3290,6 @@ ctx_sched_out(struct perf_event_context
 
 	is_active ^= ctx->is_active; /* changed bits */
 
-	if (!ctx->nr_active || !(is_active & EVENT_ALL))
-		return;
-
 	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
 		__pmu_ctx_sched_out(pmu_ctx, is_active);
 }

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:35 ` Peter Zijlstra
  2022-06-13 14:36   ` Peter Zijlstra
@ 2022-06-13 14:38   ` Peter Zijlstra
  2022-08-02  6:11     ` Ravi Bangoria
  2022-06-13 14:41   ` Peter Zijlstra
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2022-06-13 14:38 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:

Another one of those lockdep splats:

> @@ -12147,42 +12256,37 @@ SYSCALL_DEFINE5(perf_event_open,
>  	if (pmu->task_ctx_nr == perf_sw_context)
>  		event->event_caps |= PERF_EV_CAP_SOFTWARE;
>  
> -	if (group_leader) {
> -		if (is_software_event(event) &&
> -		    !in_software_context(group_leader)) {
> -			/*
> -			 * If the event is a sw event, but the group_leader
> -			 * is on hw context.
> -			 *
> -			 * Allow the addition of software events to hw
> -			 * groups, this is safe because software events
> -			 * never fail to schedule.
> -			 */
> -			pmu = group_leader->ctx->pmu;
> -		} else if (!is_software_event(event) &&
> -			   is_software_event(group_leader) &&
> -			   (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
> -			/*
> -			 * In case the group is a pure software group, and we
> -			 * try to add a hardware event, move the whole group to
> -			 * the hardware context.
> -			 */
> -			move_group = 1;
> -		}
> -	}
> -
>  	/*
>  	 * Get the target context (task or percpu):
>  	 */
> -	ctx = find_get_context(pmu, task, event);
> +	ctx = find_get_context(task, event);
>  	if (IS_ERR(ctx)) {
>  		err = PTR_ERR(ctx);
>  		goto err_alloc;
>  	}
>  
> -	/*
> -	 * Look up the group leader (we will attach this event to it):
> -	 */
> +	mutex_lock(&ctx->mutex);
> +
> +	if (ctx->task == TASK_TOMBSTONE) {
> +		err = -ESRCH;
> +		goto err_locked;
> +	}
> +
> +	if (!task) {
> +		/*
> +		 * Check if the @cpu we're creating an event for is online.
> +		 *
> +		 * We use the perf_cpu_context::ctx::mutex to serialize against
> +		 * the hotplug notifiers. See perf_event_{init,exit}_cpu().
> +		 */
> +		struct perf_cpu_context *cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
> +
> +		if (!cpuctx->online) {
> +			err = -ENODEV;
> +			goto err_locked;
> +		}
> +	}
> +
>  	if (group_leader) {
>  		err = -EINVAL;
>  


pulling up the ctx->mutex makes things simpler, but also violates the
locking order vs exec_update_lock.

Pull that lock up as well...

---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -12254,13 +12254,29 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (pmu->task_ctx_nr == perf_sw_context)
 		event->event_caps |= PERF_EV_CAP_SOFTWARE;
 
+	if (task) {
+		err = down_read_interruptible(&task->signal->exec_update_lock);
+		if (err)
+			goto err_alloc;
+
+		/*
+		 * We must hold exec_update_lock across this and any potential
+		 * perf_install_in_context() call for this new event to
+		 * serialize against exec() altering our credentials (and the
+		 * perf_event_exit_task() that could imply).
+		 */
+		err = -EACCES;
+		if (!perf_check_permission(&attr, task))
+			goto err_cred;
+	}
+
 	/*
 	 * Get the target context (task or percpu):
 	 */
 	ctx = find_get_context(task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
-		goto err_alloc;
+		goto err_cred;
 	}
 
 	mutex_lock(&ctx->mutex);
@@ -12358,58 +12374,14 @@ SYSCALL_DEFINE5(perf_event_open,
 			goto err_context;
 	}
 
-	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
-	if (IS_ERR(event_file)) {
-		err = PTR_ERR(event_file);
-		event_file = NULL;
-		goto err_context;
-	}
-
-	if (task) {
-		err = down_read_interruptible(&task->signal->exec_update_lock);
-		if (err)
-			goto err_file;
-
-		/*
-		 * We must hold exec_update_lock across this and any potential
-		 * perf_install_in_context() call for this new event to
-		 * serialize against exec() altering our credentials (and the
-		 * perf_event_exit_task() that could imply).
-		 */
-		err = -EACCES;
-		if (!perf_check_permission(&attr, task))
-			goto err_cred;
-	}
-
-	if (ctx->task == TASK_TOMBSTONE) {
-		err = -ESRCH;
-		goto err_locked;
-	}
-
 	if (!perf_event_validate_size(event)) {
 		err = -E2BIG;
-		goto err_locked;
-	}
-
-	if (!task) {
-		/*
-		 * Check if the @cpu we're creating an event for is online.
-		 *
-		 * We use the perf_cpu_context::ctx::mutex to serialize against
-		 * the hotplug notifiers. See perf_event_{init,exit}_cpu().
-		 */
-		struct perf_cpu_context *cpuctx =
-			container_of(ctx, struct perf_cpu_context, ctx);
-
-		if (!cpuctx->online) {
-			err = -ENODEV;
-			goto err_locked;
-		}
+		goto err_context;
 	}
 
 	if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader)) {
 		err = -EINVAL;
-		goto err_locked;
+		goto err_context;
 	}
 
 	/*
@@ -12418,11 +12390,18 @@ SYSCALL_DEFINE5(perf_event_open,
 	 */
 	if (!exclusive_event_installable(event, ctx)) {
 		err = -EBUSY;
-		goto err_cred;
+		goto err_context;
 	}
 
 	WARN_ON_ONCE(ctx->parent_ctx);
 
+	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
+	if (IS_ERR(event_file)) {
+		err = PTR_ERR(event_file);
+		event_file = NULL;
+		goto err_context;
+	}
+
 	/*
 	 * This is the point on no return; we cannot fail hereafter. This is
 	 * where we start modifying current state.
@@ -12500,17 +12479,15 @@ SYSCALL_DEFINE5(perf_event_open,
 	fd_install(event_fd, event_file);
 	return event_fd;
 
-err_cred:
-	if (task)
-		up_read(&task->signal->exec_update_lock);
-err_file:
-	fput(event_file);
 err_context:
 	/* event->pmu_ctx freed by free_event() */
 err_locked:
 	mutex_unlock(&ctx->mutex);
 	perf_unpin_context(ctx);
 	put_ctx(ctx);
+err_cred:
+	if (task)
+		up_read(&task->signal->exec_update_lock);
 err_alloc:
 	/*
 	 * If event_file is set, the fput() above will have called ->release()

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-06-13 14:35 ` Peter Zijlstra
@ 2022-06-13 14:36   ` Peter Zijlstra
  2022-06-13 14:38   ` Peter Zijlstra
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2022-06-13 14:36 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla

On Mon, Jun 13, 2022 at 04:35:11PM +0200, Peter Zijlstra wrote:
> 
> 
> Right, so sorry for being incredibly tardy on this. Find below the
> patch fwd ported to something recent.
> 
> I'll reply to this with fixes and comments.

You write:

>> A simple perf stat/record/top survives with the patch but machine
>> crashes with first run of perf test (stale cpc->task_epc causing the
>>crash). Lockdep is also screaming a lot :)


> @@ -7669,20 +7877,15 @@ static void perf_event_addr_filters_exec
>  void perf_event_exec(void)
>  {
>  	struct perf_event_context *ctx;
> -	int ctxn;
> -
> -	for_each_task_context_nr(ctxn) {
> -		perf_event_enable_on_exec(ctxn);
> -		perf_event_remove_on_exec(ctxn);
>  
> -		rcu_read_lock();
> -		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
> -		if (ctx) {
> -			perf_iterate_ctx(ctx, perf_event_addr_filters_exec,
> -					 NULL, true);
> -		}
> -		rcu_read_unlock();
> +	rcu_read_lock();
> +	ctx = rcu_dereference(current->perf_event_ctxp);
> +	if (ctx) {
> +		perf_event_enable_on_exec(ctx);
> +		perf_event_remove_on_exec(ctx);
> +		perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
>  	}
> +	rcu_read_unlock();
>  }
>  
>  struct remote_output {

The above goes *bang* because perf_event_remove_on_exec() will take a
mutex, which isn't allowed under rcu_read_lock().

The below cures.

---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4384,8 +4384,6 @@ static void perf_event_remove_on_exec(st
 	unsigned long flags;
 	bool modified = false;
 
-	perf_pin_task_context(current);
-
 	mutex_lock(&ctx->mutex);
 
 	if (WARN_ON_ONCE(ctx->task != current))
@@ -4406,13 +4404,11 @@ static void perf_event_remove_on_exec(st
 	raw_spin_lock_irqsave(&ctx->lock, flags);
 	if (modified)
 		clone_ctx = unclone_ctx(ctx);
-	--ctx->pin_count;
 	raw_spin_unlock_irqrestore(&ctx->lock, flags);
 
 unlock:
 	mutex_unlock(&ctx->mutex);
 
-	put_ctx(ctx);
 	if (clone_ctx)
 		put_ctx(clone_ctx);
 }
@@ -7878,14 +7874,16 @@ void perf_event_exec(void)
 {
 	struct perf_event_context *ctx;
 
-	rcu_read_lock();
-	ctx = rcu_dereference(current->perf_event_ctxp);
-	if (ctx) {
-		perf_event_enable_on_exec(ctx);
-		perf_event_remove_on_exec(ctx);
-		perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
-	}
-	rcu_read_unlock();
+	ctx = perf_pin_task_context(current);
+	if (!ctx)
+		return;
+
+	perf_event_enable_on_exec(ctx);
+	perf_event_remove_on_exec(ctx);
+	perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
+
+	perf_unpin_context(ctx);
+	put_ctx(ctx);
 }
 
 struct remote_output {

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-01-13 13:47 Ravi Bangoria
  2022-01-13 19:15 ` kernel test robot
  2022-01-31  4:43 ` Ravi Bangoria
@ 2022-06-13 14:35 ` Peter Zijlstra
  2022-06-13 14:36   ` Peter Zijlstra
                     ` (7 more replies)
  2022-08-02  6:13 ` Ravi Bangoria
                   ` (2 subsequent siblings)
  5 siblings, 8 replies; 47+ messages in thread
From: Peter Zijlstra @ 2022-06-13 14:35 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla



Right, so sorry for being incredibly tardy on this. Find below the
patch fwd ported to something recent.

I'll reply to this with fixes and comments.

---
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -132,7 +132,7 @@ static unsigned long ebb_switch_in(bool
 
 static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
 static inline void power_pmu_bhrb_disable(struct perf_event *event) {}
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in) {}
+static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) {}
 static inline void power_pmu_bhrb_read(struct perf_event *event, struct cpu_hw_events *cpuhw) {}
 static void pmao_restore_workaround(bool ebb) { }
 #endif /* CONFIG_PPC32 */
@@ -451,7 +451,7 @@ static void power_pmu_bhrb_disable(struc
 /* Called from ctxsw to prevent one process's branch entries to
  * mingle with the other process's entries during context switch.
  */
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 	if (!ppmu->bhrb_nr)
 		return;
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -317,7 +317,7 @@ static void amd_brs_poison_buffer(void)
  * On ctxswin, sched_in = true, called after the PMU has started
  * On ctxswout, sched_in = false, called before the PMU is stopped
  */
-void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
--- a/arch/x86/events/amd/core.c
+++ b/arch/x86/events/amd/core.c
@@ -1248,11 +1248,11 @@ static ssize_t amd_event_sysfs_show(char
 	return x86_event_sysfs_show(page, config, event);
 }
 
-static void amd_pmu_sched_task(struct perf_event_context *ctx,
+static void amd_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx,
 				 bool sched_in)
 {
 	if (sched_in && x86_pmu.lbr_nr)
-		amd_pmu_brs_sched_task(ctx, sched_in);
+		amd_pmu_brs_sched_task(pmu_ctx, sched_in);
 }
 
 static u64 amd_pmu_limit_period(struct perf_event *event, u64 left)
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2067,13 +2067,14 @@ void x86_pmu_show_pmu_cap(int num_counte
  */
 void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
 {
-	struct perf_cpu_context *cpuctx;
+	/* XXX: Don't need this quirk anymore */
+	/*struct perf_cpu_context *cpuctx;
 
 	if (!pmu->pmu_cpu_context)
 		return;
 
 	cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-	cpuctx->ctx.pmu = pmu;
+	cpuctx->ctx.pmu = pmu;*/
 }
 
 static int __init init_hw_perf_events(void)
@@ -2644,15 +2645,15 @@ static const struct attribute_group *x86
 	NULL,
 };
 
-static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void x86_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
-	static_call_cond(x86_pmu_sched_task)(ctx, sched_in);
+	static_call_cond(x86_pmu_sched_task)(pmu_ctx, sched_in);
 }
 
-static void x86_pmu_swap_task_ctx(struct perf_event_context *prev,
-				  struct perf_event_context *next)
+static void x86_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+				  struct perf_event_pmu_context *next_epc)
 {
-	static_call_cond(x86_pmu_swap_task_ctx)(prev, next);
+	static_call_cond(x86_pmu_swap_task_ctx)(prev_epc, next_epc);
 }
 
 void perf_check_microcode(void)
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4545,17 +4545,17 @@ static void intel_pmu_cpu_dead(int cpu)
 		cpumask_clear_cpu(cpu, &hybrid_pmu(cpuc->pmu)->supported_cpus);
 }
 
-static void intel_pmu_sched_task(struct perf_event_context *ctx,
+static void intel_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx,
 				 bool sched_in)
 {
-	intel_pmu_pebs_sched_task(ctx, sched_in);
-	intel_pmu_lbr_sched_task(ctx, sched_in);
+	intel_pmu_pebs_sched_task(pmu_ctx, sched_in);
+	intel_pmu_lbr_sched_task(pmu_ctx, sched_in);
 }
 
-static void intel_pmu_swap_task_ctx(struct perf_event_context *prev,
-				    struct perf_event_context *next)
+static void intel_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+				    struct perf_event_pmu_context *next_epc)
 {
-	intel_pmu_lbr_swap_task_ctx(prev, next);
+	intel_pmu_lbr_swap_task_ctx(prev_epc, next_epc);
 }
 
 static int intel_pmu_check_period(struct perf_event *event, u64 value)
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1005,7 +1005,7 @@ static inline bool pebs_needs_sched_cb(s
 	return cpuc->n_pebs && (cpuc->n_pebs == cpuc->n_large_pebs);
 }
 
-void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
@@ -1113,7 +1113,7 @@ static void
 pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
 		  struct perf_event *event, bool add)
 {
-	struct pmu *pmu = event->ctx->pmu;
+	struct pmu *pmu = event->pmu;
 	/*
 	 * Make sure we get updated with the first PEBS
 	 * event. It will trigger also during removal, but
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -575,21 +575,21 @@ static void __intel_pmu_lbr_save(void *c
 	cpuc->last_log_id = ++task_context_opt(ctx)->log_id;
 }
 
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
-				 struct perf_event_context *next)
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+				 struct perf_event_pmu_context *next_epc)
 {
 	void *prev_ctx_data, *next_ctx_data;
 
-	swap(prev->task_ctx_data, next->task_ctx_data);
+	swap(prev_epc->task_ctx_data, next_epc->task_ctx_data);
 
 	/*
-	 * Architecture specific synchronization makes sense in
-	 * case both prev->task_ctx_data and next->task_ctx_data
+	 * Architecture specific synchronization makes sense in case
+	 * both prev_epc->task_ctx_data and next_epc->task_ctx_data
 	 * pointers are allocated.
 	 */
 
-	prev_ctx_data = next->task_ctx_data;
-	next_ctx_data = prev->task_ctx_data;
+	prev_ctx_data = next_epc->task_ctx_data;
+	next_ctx_data = prev_epc->task_ctx_data;
 
 	if (!prev_ctx_data || !next_ctx_data)
 		return;
@@ -598,7 +598,7 @@ void intel_pmu_lbr_swap_task_ctx(struct
 	     task_context_opt(next_ctx_data)->lbr_callstack_users);
 }
 
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	void *task_ctx;
@@ -611,7 +611,7 @@ void intel_pmu_lbr_sched_task(struct per
 	 * the task was scheduled out, restore the stack. Otherwise flush
 	 * the LBR stack.
 	 */
-	task_ctx = ctx ? ctx->task_ctx_data : NULL;
+	task_ctx = pmu_ctx ? pmu_ctx->task_ctx_data : NULL;
 	if (task_ctx) {
 		if (sched_in)
 			__intel_pmu_lbr_restore(task_ctx);
@@ -647,8 +647,8 @@ void intel_pmu_lbr_add(struct perf_event
 
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
-	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data)
-		task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users++;
+	if (branch_user_callstack(cpuc->br_sel) && event->pmu_ctx->task_ctx_data)
+		task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users++;
 
 	/*
 	 * Request pmu::sched_task() callback, which will fire inside the
@@ -671,7 +671,7 @@ void intel_pmu_lbr_add(struct perf_event
 	 */
 	if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
 		cpuc->lbr_pebs_users++;
-	perf_sched_cb_inc(event->ctx->pmu);
+	perf_sched_cb_inc(event->pmu);
 	if (!cpuc->lbr_users++ && !event->total_time_running)
 		intel_pmu_lbr_reset();
 }
@@ -724,8 +724,8 @@ void intel_pmu_lbr_del(struct perf_event
 		return;
 
 	if (branch_user_callstack(cpuc->br_sel) &&
-	    event->ctx->task_ctx_data)
-		task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users--;
+	    event->pmu_ctx->task_ctx_data)
+		task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users--;
 
 	if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
 		cpuc->lbr_select = 0;
@@ -735,7 +735,7 @@ void intel_pmu_lbr_del(struct perf_event
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 	WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
-	perf_sched_cb_dec(event->ctx->pmu);
+	perf_sched_cb_dec(event->pmu);
 }
 
 static inline bool vlbr_exclude_host(void)
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -798,7 +798,7 @@ struct x86_pmu {
 	void		(*cpu_dead)(int cpu);
 
 	void		(*check_microcode)(void);
-	void		(*sched_task)(struct perf_event_context *ctx,
+	void		(*sched_task)(struct perf_event_pmu_context *pmu_ctx,
 				      bool sched_in);
 
 	/*
@@ -880,12 +880,12 @@ struct x86_pmu {
 	int		(*set_topdown_event_period)(struct perf_event *event);
 
 	/*
-	 * perf task context (i.e. struct perf_event_context::task_ctx_data)
+	 * perf task context (i.e. struct perf_event_pmu_context::task_ctx_data)
 	 * switch helper to bridge calls from perf/core to perf/x86.
 	 * See struct pmu::swap_task_ctx() usage for examples;
 	 */
-	void		(*swap_task_ctx)(struct perf_event_context *prev,
-					 struct perf_event_context *next);
+	void		(*swap_task_ctx)(struct perf_event_pmu_context *prev_epc,
+					 struct perf_event_pmu_context *next_epc);
 
 	/*
 	 * AMD bits
@@ -1253,7 +1253,7 @@ static inline void amd_pmu_brs_del(struc
 	perf_sched_cb_dec(event->ctx->pmu);
 }
 
-void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in);
+void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 #else
 static inline int amd_brs_init(void)
 {
@@ -1278,7 +1278,7 @@ static inline void amd_pmu_brs_del(struc
 {
 }
 
-static inline void amd_pmu_brs_sched_task(struct perf_event_context *ctx, bool sched_in)
+static inline void amd_pmu_brs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 }
 
@@ -1436,7 +1436,7 @@ void intel_pmu_pebs_enable_all(void);
 
 void intel_pmu_pebs_disable_all(void);
 
-void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 
 void intel_pmu_auto_reload_read(struct perf_event *event);
 
@@ -1444,10 +1444,10 @@ void intel_pmu_store_pebs_lbrs(struct lb
 
 void intel_ds_init(void);
 
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
-				 struct perf_event_context *next);
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+				 struct perf_event_pmu_context *next_epc);
 
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 
 u64 lbr_from_signext_quirk_wr(u64 val);
 
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -262,6 +262,7 @@ struct hw_perf_event {
 };
 
 struct perf_event;
+struct perf_event_pmu_context;
 
 /*
  * Common implementation detail of pmu::{start,commit,cancel}_txn
@@ -304,7 +305,7 @@ struct pmu {
 	int				capabilities;
 
 	int __percpu			*pmu_disable_count;
-	struct perf_cpu_context __percpu *pmu_cpu_context;
+	struct perf_cpu_pmu_context __percpu *cpu_pmu_context;
 	atomic_t			exclusive_cnt; /* < 0: cpu; > 0: tsk */
 	int				task_ctx_nr;
 	int				hrtimer_interval_ms;
@@ -439,7 +440,7 @@ struct pmu {
 	/*
 	 * context-switches callback
 	 */
-	void (*sched_task)		(struct perf_event_context *ctx,
+	void (*sched_task)		(struct perf_event_pmu_context *pmu_ctx,
 					bool sched_in);
 
 	/*
@@ -453,8 +454,8 @@ struct pmu {
 	 * implementation and Perf core context switch handling callbacks for usage
 	 * examples.
 	 */
-	void (*swap_task_ctx)		(struct perf_event_context *prev,
-					 struct perf_event_context *next);
+	void (*swap_task_ctx)		(struct perf_event_pmu_context *prev_epc,
+					 struct perf_event_pmu_context *next_epc);
 					/* optional */
 
 	/*
@@ -675,6 +676,11 @@ struct perf_event {
 	int				group_caps;
 
 	struct perf_event		*group_leader;
+	/*
+	 * event->pmu will always point to pmu in which this event belongs.
+	 * Unlike event->pmu_ctx->pmu which points to other pmu when group of
+	 * different events are created.
+	 */
 	struct pmu			*pmu;
 	void				*pmu_private;
 
@@ -700,6 +706,12 @@ struct perf_event {
 	struct hw_perf_event		hw;
 
 	struct perf_event_context	*ctx;
+	/*
+	 * event->pmu_ctx points to perf_event_pmu_context in which the event
+	 * is added. This pmu_ctx can be of other pmu for sw event when such
+	 * sw event is added to a non-sw event group.
+	 */
+	struct perf_event_pmu_context	*pmu_ctx;
 	atomic_long_t			refcount;
 
 	/*
@@ -787,19 +799,60 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+/*
+ *           ,------------------------[1:n]---------------------.
+ *           V                                                  V
+ * perf_event_context <-[1:n]-> perf_event_pmu_context <--- perf_event
+ *           ^                      ^     |                     |
+ *           `--------[1:n]---------'     `-[n:1]-> pmu <-[1:n]-'
+ *
+ *
+ * XXX destroy epc when empty
+ *   refcount, !rcu
+ *
+ * XXX epc locking
+ *
+ *   event->pmu_ctx            ctx->mutex && inactive
+ *   ctx->pmu_ctx_list         ctx->mutex && ctx->lock
+ *
+ */
+struct perf_event_pmu_context {
+	struct pmu			*pmu;
+	struct perf_event_context       *ctx;
+
+	struct list_head		pmu_ctx_entry;
+
+	struct list_head		pinned_active;
+	struct list_head		flexible_active;
+
+	/* Used to avoid freeing per-cpu perf_event_pmu_context */
+	unsigned int			embedded : 1;
+
+	unsigned int			nr_events;
+	unsigned int			nr_active;
+
+	atomic_t			refcount; /* event <-> epc */
+
+	void				*task_ctx_data; /* pmu specific data */
+	/*
+	 * Set when nr_events != nr_active, except tolerant to events not
+	 * necessary to be active due to scheduling constraints, such as cgroups.
+	 */
+	int				rotate_necessary;
+};
 
 struct perf_event_groups {
 	struct rb_root	tree;
 	u64		index;
 };
 
+
 /**
  * struct perf_event_context - event context structure
  *
  * Used as a container for task events and CPU events as well:
  */
 struct perf_event_context {
-	struct pmu			*pmu;
 	/*
 	 * Protect the states of the events in the list,
 	 * nr_active, and the list:
@@ -812,26 +865,21 @@ struct perf_event_context {
 	 */
 	struct mutex			mutex;
 
-	struct list_head		active_ctx_list;
+	struct list_head		pmu_ctx_list;
 	struct perf_event_groups	pinned_groups;
 	struct perf_event_groups	flexible_groups;
 	struct list_head		event_list;
 
-	struct list_head		pinned_active;
-	struct list_head		flexible_active;
-
 	int				nr_events;
 	int				nr_active;
 	int				nr_user;
 	int				is_active;
+
+	int				nr_task_data;
 	int				nr_stat;
 	int				nr_freq;
 	int				rotate_disable;
-	/*
-	 * Set when nr_events != nr_active, except tolerant to events not
-	 * necessary to be active due to scheduling constraints, such as cgroups.
-	 */
-	int				rotate_necessary;
+
 	refcount_t			refcount;
 	struct task_struct		*task;
 
@@ -853,7 +901,6 @@ struct perf_event_context {
 #ifdef CONFIG_CGROUP_PERF
 	int				nr_cgroups;	 /* cgroup evts */
 #endif
-	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
 };
 
@@ -863,12 +910,13 @@ struct perf_event_context {
  */
 #define PERF_NR_CONTEXTS	4
 
-/**
- * struct perf_cpu_context - per cpu event context structure
- */
-struct perf_cpu_context {
-	struct perf_event_context	ctx;
-	struct perf_event_context	*task_ctx;
+struct perf_cpu_pmu_context {
+	struct perf_event_pmu_context	epc;
+	struct perf_event_pmu_context	*task_epc;
+
+	struct list_head		sched_cb_entry;
+	int				sched_cb_usage;
+
 	int				active_oncpu;
 	int				exclusive;
 
@@ -876,16 +924,21 @@ struct perf_cpu_context {
 	struct hrtimer			hrtimer;
 	ktime_t				hrtimer_interval;
 	unsigned int			hrtimer_active;
+};
+
+/**
+ * struct perf_event_cpu_context - per cpu event context structure
+ */
+struct perf_cpu_context {
+	struct perf_event_context	ctx;
+	struct perf_event_context	*task_ctx;
+	int				online;
 
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp;
 	struct list_head		cgrp_cpuctx_entry;
 #endif
 
-	struct list_head		sched_cb_entry;
-	int				sched_cb_usage;
-
-	int				online;
 	/*
 	 * Per-CPU storage for iterators used in visit_groups_merge. The default
 	 * storage is of size 2 to hold the CPU and any CPU event iterators.
@@ -1151,7 +1204,7 @@ static inline int is_software_event(stru
  */
 static inline int in_software_context(struct perf_event *event)
 {
-	return event->ctx->pmu->task_ctx_nr == perf_sw_context;
+	return event->pmu_ctx->pmu->task_ctx_nr == perf_sw_context;
 }
 
 static inline int is_exclusive_pmu(struct pmu *pmu)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1226,7 +1226,7 @@ struct task_struct {
 	unsigned int			futex_state;
 #endif
 #ifdef CONFIG_PERF_EVENTS
-	struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
+	struct perf_event_context	*perf_event_ctxp;
 	struct mutex			perf_event_mutex;
 	struct list_head		perf_event_list;
 #endif
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -154,12 +154,6 @@ static int cpu_function_call(int cpu, re
 	return data.ret;
 }
 
-static inline struct perf_cpu_context *
-__get_cpu_context(struct perf_event_context *ctx)
-{
-	return this_cpu_ptr(ctx->pmu->pmu_cpu_context);
-}
-
 static void perf_ctx_lock(struct perf_cpu_context *cpuctx,
 			  struct perf_event_context *ctx)
 {
@@ -183,6 +177,8 @@ static bool is_kernel_event(struct perf_
 	return READ_ONCE(event->owner) == TASK_TOMBSTONE;
 }
 
+static DEFINE_PER_CPU(struct perf_cpu_context, cpu_context);
+
 /*
  * On task ctx scheduling...
  *
@@ -216,7 +212,7 @@ static int event_function(void *info)
 	struct event_function_struct *efs = info;
 	struct perf_event *event = efs->event;
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 	int ret = 0;
 
@@ -313,7 +309,7 @@ static void event_function_call(struct p
 static void event_function_local(struct perf_event *event, event_f func, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct task_struct *task = READ_ONCE(ctx->task);
 	struct perf_event_context *task_ctx = NULL;
 
@@ -387,7 +383,6 @@ static DEFINE_MUTEX(perf_sched_mutex);
 static atomic_t perf_sched_count;
 
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events);
 
 static atomic_t nr_mmap_events __read_mostly;
@@ -447,7 +442,7 @@ static void update_perf_cpu_limits(void)
 	WRITE_ONCE(perf_sample_allowed_ns, tmp);
 }
 
-static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
+static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc);
 
 int perf_proc_update_handler(struct ctl_table *table, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
@@ -570,12 +565,6 @@ void perf_sample_event_took(u64 sample_l
 
 static atomic64_t perf_event_id;
 
-static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type);
-
-static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-			     enum event_type_t event_type);
-
 static void update_context_time(struct perf_event_context *ctx);
 static u64 perf_event_time(struct perf_event *event);
 
@@ -690,13 +679,31 @@ do {									\
 	___p;								\
 })
 
+static void perf_ctx_disable(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+		perf_pmu_disable(pmu_ctx->pmu);
+}
+
+static void perf_ctx_enable(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+		perf_pmu_enable(pmu_ctx->pmu);
+}
+
+static void ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type);
+static void ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type);
+
 #ifdef CONFIG_CGROUP_PERF
 
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 
 	/* @event doesn't care about cgroup */
 	if (!event->cgrp)
@@ -822,6 +829,7 @@ perf_cgroup_set_timestamp(struct perf_cp
 	}
 }
 
+/* XXX: No need of list now. Convert it to per-cpu variable */
 static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
 
 /*
@@ -849,9 +857,9 @@ static void perf_cgroup_switch(struct ta
 			continue;
 
 		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-		perf_pmu_disable(cpuctx->ctx.pmu);
+		perf_ctx_disable(&cpuctx->ctx);
 
-		cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+		ctx_sched_out(&cpuctx->ctx, EVENT_ALL);
 		/*
 		 * must not be done before ctxswout due
 		 * to update_cgrp_time_from_cpuctx() in
@@ -863,9 +871,9 @@ static void perf_cgroup_switch(struct ta
 		 * perf_cgroup_set_timestamp() in ctx_sched_in()
 		 * to not have to pass task around
 		 */
-		cpu_ctx_sched_in(cpuctx, EVENT_ALL);
+		ctx_sched_in(&cpuctx->ctx, EVENT_ALL);
 
-		perf_pmu_enable(cpuctx->ctx.pmu);
+		perf_ctx_enable(&cpuctx->ctx);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 	}
 
@@ -887,7 +895,7 @@ static int perf_cgroup_ensure_storage(st
 		heap_size++;
 
 	for_each_possible_cpu(cpu) {
-		cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+		cpuctx = this_cpu_ptr(&cpu_context);
 		if (heap_size <= cpuctx->heap_size)
 			continue;
 
@@ -1068,34 +1076,30 @@ static void perf_cgroup_switch(struct ta
  */
 static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
 {
-	struct perf_cpu_context *cpuctx;
+	struct perf_cpu_pmu_context *cpc;
 	bool rotations;
 
 	lockdep_assert_irqs_disabled();
 
-	cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
-	rotations = perf_rotate_context(cpuctx);
+	cpc = container_of(hr, struct perf_cpu_pmu_context, hrtimer);
+	rotations = perf_rotate_context(cpc);
 
-	raw_spin_lock(&cpuctx->hrtimer_lock);
+	raw_spin_lock(&cpc->hrtimer_lock);
 	if (rotations)
-		hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
+		hrtimer_forward_now(hr, cpc->hrtimer_interval);
 	else
-		cpuctx->hrtimer_active = 0;
-	raw_spin_unlock(&cpuctx->hrtimer_lock);
+		cpc->hrtimer_active = 0;
+	raw_spin_unlock(&cpc->hrtimer_lock);
 
 	return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
 }
 
-static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
+static void __perf_mux_hrtimer_init(struct perf_cpu_pmu_context *cpc, int cpu)
 {
-	struct hrtimer *timer = &cpuctx->hrtimer;
-	struct pmu *pmu = cpuctx->ctx.pmu;
+	struct hrtimer *timer = &cpc->hrtimer;
+	struct pmu *pmu = cpc->epc.pmu;
 	u64 interval;
 
-	/* no multiplexing needed for SW PMU */
-	if (pmu->task_ctx_nr == perf_sw_context)
-		return;
-
 	/*
 	 * check default is sane, if not set then force to
 	 * default interval (1/tick)
@@ -1104,30 +1108,25 @@ static void __perf_mux_hrtimer_init(stru
 	if (interval < 1)
 		interval = pmu->hrtimer_interval_ms = PERF_CPU_HRTIMER;
 
-	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
+	cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
-	raw_spin_lock_init(&cpuctx->hrtimer_lock);
+	raw_spin_lock_init(&cpc->hrtimer_lock);
 	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
-static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
+static int perf_mux_hrtimer_restart(struct perf_cpu_pmu_context *cpc)
 {
-	struct hrtimer *timer = &cpuctx->hrtimer;
-	struct pmu *pmu = cpuctx->ctx.pmu;
+	struct hrtimer *timer = &cpc->hrtimer;
 	unsigned long flags;
 
-	/* not for SW PMU */
-	if (pmu->task_ctx_nr == perf_sw_context)
-		return 0;
-
-	raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
-	if (!cpuctx->hrtimer_active) {
-		cpuctx->hrtimer_active = 1;
-		hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+	raw_spin_lock_irqsave(&cpc->hrtimer_lock, flags);
+	if (!cpc->hrtimer_active) {
+		cpc->hrtimer_active = 1;
+		hrtimer_forward_now(timer, cpc->hrtimer_interval);
 		hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED_HARD);
 	}
-	raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);
+	raw_spin_unlock_irqrestore(&cpc->hrtimer_lock, flags);
 
 	return 0;
 }
@@ -1146,32 +1145,9 @@ void perf_pmu_enable(struct pmu *pmu)
 		pmu->pmu_enable(pmu);
 }
 
-static DEFINE_PER_CPU(struct list_head, active_ctx_list);
-
-/*
- * perf_event_ctx_activate(), perf_event_ctx_deactivate(), and
- * perf_event_task_tick() are fully serialized because they're strictly cpu
- * affine and perf_event_ctx{activate,deactivate} are called with IRQs
- * disabled, while perf_event_task_tick is called from IRQ context.
- */
-static void perf_event_ctx_activate(struct perf_event_context *ctx)
-{
-	struct list_head *head = this_cpu_ptr(&active_ctx_list);
-
-	lockdep_assert_irqs_disabled();
-
-	WARN_ON(!list_empty(&ctx->active_ctx_list));
-
-	list_add(&ctx->active_ctx_list, head);
-}
-
-static void perf_event_ctx_deactivate(struct perf_event_context *ctx)
+static void perf_assert_pmu_disabled(struct pmu *pmu)
 {
-	lockdep_assert_irqs_disabled();
-
-	WARN_ON(list_empty(&ctx->active_ctx_list));
-
-	list_del_init(&ctx->active_ctx_list);
+	WARN_ON_ONCE(*this_cpu_ptr(pmu->pmu_disable_count) == 0);
 }
 
 static void get_ctx(struct perf_event_context *ctx)
@@ -1198,7 +1174,6 @@ static void free_ctx(struct rcu_head *he
 	struct perf_event_context *ctx;
 
 	ctx = container_of(head, struct perf_event_context, rcu_head);
-	free_task_ctx_data(ctx->pmu, ctx->task_ctx_data);
 	kfree(ctx);
 }
 
@@ -1383,7 +1358,7 @@ static u64 primary_event_id(struct perf_
  * the context could get moved to another task.
  */
 static struct perf_event_context *
-perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
+perf_lock_task_context(struct task_struct *task, unsigned long *flags)
 {
 	struct perf_event_context *ctx;
 
@@ -1399,7 +1374,7 @@ perf_lock_task_context(struct task_struc
 	 */
 	local_irq_save(*flags);
 	rcu_read_lock();
-	ctx = rcu_dereference(task->perf_event_ctxp[ctxn]);
+	ctx = rcu_dereference(task->perf_event_ctxp);
 	if (ctx) {
 		/*
 		 * If this context is a clone of another, it might
@@ -1412,7 +1387,7 @@ perf_lock_task_context(struct task_struc
 		 * can't get swapped on us any more.
 		 */
 		raw_spin_lock(&ctx->lock);
-		if (ctx != rcu_dereference(task->perf_event_ctxp[ctxn])) {
+		if (ctx != rcu_dereference(task->perf_event_ctxp)) {
 			raw_spin_unlock(&ctx->lock);
 			rcu_read_unlock();
 			local_irq_restore(*flags);
@@ -1439,12 +1414,12 @@ perf_lock_task_context(struct task_struc
  * reference count so that the context can't get freed.
  */
 static struct perf_event_context *
-perf_pin_task_context(struct task_struct *task, int ctxn)
+perf_pin_task_context(struct task_struct *task)
 {
 	struct perf_event_context *ctx;
 	unsigned long flags;
 
-	ctx = perf_lock_task_context(task, ctxn, &flags);
+	ctx = perf_lock_task_context(task, &flags);
 	if (ctx) {
 		++ctx->pin_count;
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
@@ -1590,14 +1565,22 @@ static inline struct cgroup *event_cgrou
  * which provides ordering when rotating groups for the same CPU.
  */
 static __always_inline int
-perf_event_groups_cmp(const int left_cpu, const struct cgroup *left_cgroup,
-		      const u64 left_group_index, const struct perf_event *right)
+perf_event_groups_cmp(const int left_cpu, const struct pmu *left_pmu,
+		      const struct cgroup *left_cgroup, const u64 left_group_index,
+		      const struct perf_event *right)
 {
 	if (left_cpu < right->cpu)
 		return -1;
 	if (left_cpu > right->cpu)
 		return 1;
 
+	if (left_pmu) {
+		if (left_pmu < right->pmu_ctx->pmu)
+			return -1;
+		if (left_pmu > right->pmu_ctx->pmu)
+			return 1;
+	}
+
 #ifdef CONFIG_CGROUP_PERF
 	{
 		const struct cgroup *right_cgroup = event_cgroup(right);
@@ -1640,12 +1623,13 @@ perf_event_groups_cmp(const int left_cpu
 static inline bool __group_less(struct rb_node *a, const struct rb_node *b)
 {
 	struct perf_event *e = __node_2_pe(a);
-	return perf_event_groups_cmp(e->cpu, event_cgroup(e), e->group_index,
-				     __node_2_pe(b)) < 0;
+	return perf_event_groups_cmp(e->cpu, e->pmu_ctx->pmu, event_cgroup(e),
+				     e->group_index, __node_2_pe(b)) < 0;
 }
 
 struct __group_key {
 	int cpu;
+	struct pmu *pmu;
 	struct cgroup *cgroup;
 };
 
@@ -1654,14 +1638,25 @@ static inline int __group_cmp(const void
 	const struct __group_key *a = key;
 	const struct perf_event *b = __node_2_pe(node);
 
-	/* partial/subtree match: @cpu, @cgroup; ignore: @group_index */
-	return perf_event_groups_cmp(a->cpu, a->cgroup, b->group_index, b);
+	/* partial/subtree match: @cpu, @pmu, @cgroup; ignore: @group_index */
+	return perf_event_groups_cmp(a->cpu, a->pmu, a->cgroup, b->group_index, b);
+}
+
+static inline int
+__group_cmp_ignore_cgroup(const void *key, const struct rb_node *node)
+{
+	const struct __group_key *a = key;
+	const struct perf_event *b = __node_2_pe(node);
+
+	/* partial/subtree match: @cpu, @pmu, ignore: @cgroup, @group_index */
+	return perf_event_groups_cmp(a->cpu, a->pmu, event_cgroup(b),
+				     b->group_index, b);
 }
 
 /*
- * Insert @event into @groups' tree; using {@event->cpu, ++@groups->index} for
- * key (see perf_event_groups_less). This places it last inside the CPU
- * subtree.
+ * Insert @event into @groups' tree; using
+ *   {@event->cpu, @event->pmu_ctx->pmu, event_cgroup(@event), ++@groups->index}
+ * as key. This places it last inside the {cpu,pmu,cgroup} subtree.
  */
 static void
 perf_event_groups_insert(struct perf_event_groups *groups,
@@ -1711,14 +1706,15 @@ del_event_from_groups(struct perf_event
 }
 
 /*
- * Get the leftmost event in the cpu/cgroup subtree.
+ * Get the leftmost event in the {cpu,pmu,cgroup} subtree.
  */
 static struct perf_event *
 perf_event_groups_first(struct perf_event_groups *groups, int cpu,
-			struct cgroup *cgrp)
+			struct pmu *pmu, struct cgroup *cgrp)
 {
 	struct __group_key key = {
 		.cpu = cpu,
+		.pmu = pmu,
 		.cgroup = cgrp,
 	};
 	struct rb_node *node;
@@ -1730,14 +1726,12 @@ perf_event_groups_first(struct perf_even
 	return NULL;
 }
 
-/*
- * Like rb_entry_next_safe() for the @cpu subtree.
- */
 static struct perf_event *
-perf_event_groups_next(struct perf_event *event)
+perf_event_groups_next(struct perf_event *event, struct pmu *pmu)
 {
 	struct __group_key key = {
 		.cpu = event->cpu,
+		.pmu = pmu,
 		.cgroup = event_cgroup(event),
 	};
 	struct rb_node *next;
@@ -1793,6 +1787,7 @@ list_add_event(struct perf_event *event,
 		perf_cgroup_event_enable(event, ctx);
 
 	ctx->generation++;
+	event->pmu_ctx->nr_events++;
 }
 
 /*
@@ -2000,6 +1995,7 @@ list_del_event(struct perf_event *event,
 	}
 
 	ctx->generation++;
+	event->pmu_ctx->nr_events--;
 }
 
 static int
@@ -2016,13 +2012,11 @@ perf_aux_output_match(struct perf_event
 
 static void put_event(struct perf_event *event);
 static void event_sched_out(struct perf_event *event,
-			    struct perf_cpu_context *cpuctx,
 			    struct perf_event_context *ctx);
 
 static void perf_put_aux_event(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event *iter;
 
 	/*
@@ -2051,7 +2045,7 @@ static void perf_put_aux_event(struct pe
 		 * state so that we don't try to schedule it again. Note
 		 * that perf_event_enable() will clear the ERROR status.
 		 */
-		event_sched_out(iter, cpuctx, ctx);
+		event_sched_out(iter, ctx);
 		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 	}
 }
@@ -2102,8 +2096,8 @@ static int perf_get_aux_event(struct per
 
 static inline struct list_head *get_event_list(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
-	return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
+	return event->attr.pinned ? &event->pmu_ctx->pinned_active :
+				    &event->pmu_ctx->flexible_active;
 }
 
 /*
@@ -2114,10 +2108,7 @@ static inline struct list_head *get_even
  */
 static inline void perf_remove_sibling_event(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
-
-	event_sched_out(event, cpuctx, ctx);
+	event_sched_out(event, event->ctx);
 	perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 }
 
@@ -2241,12 +2232,14 @@ event_filter_match(struct perf_event *ev
 }
 
 static void
-event_sched_out(struct perf_event *event,
-		  struct perf_cpu_context *cpuctx,
-		  struct perf_event_context *ctx)
+event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
 {
+	struct perf_event_pmu_context *epc = event->pmu_ctx;
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
 	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
 
+	// XXX cpc serialization, probably per-cpu IRQ disabled
+
 	WARN_ON_ONCE(event->ctx != ctx);
 	lockdep_assert_held(&ctx->lock);
 
@@ -2273,38 +2266,34 @@ event_sched_out(struct perf_event *event
 	perf_event_set_state(event, state);
 
 	if (!is_software_event(event))
-		cpuctx->active_oncpu--;
-	if (!--ctx->nr_active)
-		perf_event_ctx_deactivate(ctx);
+		cpc->active_oncpu--;
+	ctx->nr_active--;
+	event->pmu_ctx->nr_active--;
 	if (event->attr.freq && event->attr.sample_freq)
 		ctx->nr_freq--;
-	if (event->attr.exclusive || !cpuctx->active_oncpu)
-		cpuctx->exclusive = 0;
+	if (event->attr.exclusive || !cpc->active_oncpu)
+		cpc->exclusive = 0;
 
 	perf_pmu_enable(event->pmu);
 }
 
 static void
-group_sched_out(struct perf_event *group_event,
-		struct perf_cpu_context *cpuctx,
-		struct perf_event_context *ctx)
+group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
 {
 	struct perf_event *event;
 
 	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
 		return;
 
-	perf_pmu_disable(ctx->pmu);
+	perf_assert_pmu_disabled(group_event->pmu_ctx->pmu);
 
-	event_sched_out(group_event, cpuctx, ctx);
+	event_sched_out(group_event, ctx);
 
 	/*
 	 * Schedule out siblings (if any):
 	 */
 	for_each_sibling_event(event, group_event)
-		event_sched_out(event, cpuctx, ctx);
-
-	perf_pmu_enable(ctx->pmu);
+		event_sched_out(event, ctx);
 }
 
 #define DETACH_GROUP	0x01UL
@@ -2329,19 +2318,21 @@ __perf_remove_from_context(struct perf_e
 		update_cgrp_time_from_cpuctx(cpuctx, false);
 	}
 
-	event_sched_out(event, cpuctx, ctx);
+	event_sched_out(event, ctx);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	if (flags & DETACH_CHILD)
 		perf_child_detach(event);
 	list_del_event(event, ctx);
 
+	if (!event->pmu_ctx->nr_events)
+		event->pmu_ctx->rotate_necessary = 0;
+
 	if (!ctx->nr_events && ctx->is_active) {
 		if (ctx == &cpuctx->ctx)
 			update_cgrp_time_from_cpuctx(cpuctx, true);
 
 		ctx->is_active = 0;
-		ctx->rotate_necessary = 0;
 		if (ctx->task) {
 			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
 			cpuctx->task_ctx = NULL;
@@ -2376,7 +2367,7 @@ static void perf_remove_from_context(str
 	 * cgrp_cpuctx_list.
 	 */
 	if (!ctx->is_active && !is_cgroup_event(event)) {
-		__perf_remove_from_context(event, __get_cpu_context(ctx),
+		__perf_remove_from_context(event, this_cpu_ptr(&cpu_context),
 					   ctx, (void *)flags);
 		raw_spin_unlock_irq(&ctx->lock);
 		return;
@@ -2402,13 +2393,17 @@ static void __perf_event_disable(struct
 		update_cgrp_time_from_event(event);
 	}
 
+	perf_pmu_disable(event->pmu_ctx->pmu);
+
 	if (event == event->group_leader)
-		group_sched_out(event, cpuctx, ctx);
+		group_sched_out(event, ctx);
 	else
-		event_sched_out(event, cpuctx, ctx);
+		event_sched_out(event, ctx);
 
 	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
 	perf_cgroup_event_disable(event, ctx);
+
+	perf_pmu_enable(event->pmu_ctx->pmu);
 }
 
 /*
@@ -2471,10 +2466,10 @@ static void perf_log_throttle(struct per
 static void perf_log_itrace_start(struct perf_event *event);
 
 static int
-event_sched_in(struct perf_event *event,
-		 struct perf_cpu_context *cpuctx,
-		 struct perf_event_context *ctx)
+event_sched_in(struct perf_event *event, struct perf_event_context *ctx)
 {
+	struct perf_event_pmu_context *epc = event->pmu_ctx;
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
 	int ret = 0;
 
 	WARN_ON_ONCE(event->ctx != ctx);
@@ -2515,14 +2510,14 @@ event_sched_in(struct perf_event *event,
 	}
 
 	if (!is_software_event(event))
-		cpuctx->active_oncpu++;
-	if (!ctx->nr_active++)
-		perf_event_ctx_activate(ctx);
+		cpc->active_oncpu++;
+	ctx->nr_active++;
+	event->pmu_ctx->nr_active++;
 	if (event->attr.freq && event->attr.sample_freq)
 		ctx->nr_freq++;
 
 	if (event->attr.exclusive)
-		cpuctx->exclusive = 1;
+		cpc->exclusive = 1;
 
 out:
 	perf_pmu_enable(event->pmu);
@@ -2531,26 +2526,24 @@ event_sched_in(struct perf_event *event,
 }
 
 static int
-group_sched_in(struct perf_event *group_event,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx)
+group_sched_in(struct perf_event *group_event, struct perf_event_context *ctx)
 {
 	struct perf_event *event, *partial_group = NULL;
-	struct pmu *pmu = ctx->pmu;
+	struct pmu *pmu = group_event->pmu_ctx->pmu;
 
 	if (group_event->state == PERF_EVENT_STATE_OFF)
 		return 0;
 
 	pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
 
-	if (event_sched_in(group_event, cpuctx, ctx))
+	if (event_sched_in(group_event, ctx))
 		goto error;
 
 	/*
 	 * Schedule in siblings as one group (if any):
 	 */
 	for_each_sibling_event(event, group_event) {
-		if (event_sched_in(event, cpuctx, ctx)) {
+		if (event_sched_in(event, ctx)) {
 			partial_group = event;
 			goto group_error;
 		}
@@ -2569,9 +2562,9 @@ group_sched_in(struct perf_event *group_
 		if (event == partial_group)
 			break;
 
-		event_sched_out(event, cpuctx, ctx);
+		event_sched_out(event, ctx);
 	}
-	event_sched_out(group_event, cpuctx, ctx);
+	event_sched_out(group_event, ctx);
 
 error:
 	pmu->cancel_txn(pmu);
@@ -2581,10 +2574,11 @@ group_sched_in(struct perf_event *group_
 /*
  * Work out whether we can put this event group on the CPU now.
  */
-static int group_can_go_on(struct perf_event *event,
-			   struct perf_cpu_context *cpuctx,
-			   int can_add_hw)
+static int group_can_go_on(struct perf_event *event, int can_add_hw)
 {
+	struct perf_event_pmu_context *epc = event->pmu_ctx;
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
+
 	/*
 	 * Groups consisting entirely of software events can always go on.
 	 */
@@ -2594,7 +2588,7 @@ static int group_can_go_on(struct perf_e
 	 * If an exclusive group is already on, no other hardware
 	 * events can go on.
 	 */
-	if (cpuctx->exclusive)
+	if (cpc->exclusive)
 		return 0;
 	/*
 	 * If this group is exclusive and there are already
@@ -2616,36 +2610,29 @@ static void add_event_to_ctx(struct perf
 	perf_group_attach(event);
 }
 
-static void ctx_sched_out(struct perf_event_context *ctx,
-			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
-static void
-ctx_sched_in(struct perf_event_context *ctx,
-	     struct perf_cpu_context *cpuctx,
-	     enum event_type_t event_type);
-
-static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			       struct perf_event_context *ctx,
-			       enum event_type_t event_type)
+static void task_ctx_sched_out(struct perf_event_context *ctx,
+				enum event_type_t event_type)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+
 	if (!cpuctx->task_ctx)
 		return;
 
 	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
 		return;
 
-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, event_type);
 }
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
 				struct perf_event_context *ctx)
 {
-	cpu_ctx_sched_in(cpuctx, EVENT_PINNED);
+	ctx_sched_in(&cpuctx->ctx, EVENT_PINNED);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_PINNED);
-	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE);
+		 ctx_sched_in(ctx, EVENT_PINNED);
+	ctx_sched_in(&cpuctx->ctx, EVENT_FLEXIBLE);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE);
+		 ctx_sched_in(ctx, EVENT_FLEXIBLE);
 }
 
 /*
@@ -2667,7 +2654,6 @@ static void ctx_resched(struct perf_cpu_
 			struct perf_event_context *task_ctx,
 			enum event_type_t event_type)
 {
-	enum event_type_t ctx_event_type;
 	bool cpu_event = !!(event_type & EVENT_CPU);
 
 	/*
@@ -2677,11 +2663,13 @@ static void ctx_resched(struct perf_cpu_
 	if (event_type & EVENT_PINNED)
 		event_type |= EVENT_FLEXIBLE;
 
-	ctx_event_type = event_type & EVENT_ALL;
+	event_type &= EVENT_ALL;
 
-	perf_pmu_disable(cpuctx->ctx.pmu);
-	if (task_ctx)
-		task_ctx_sched_out(cpuctx, task_ctx, event_type);
+	perf_ctx_disable(&cpuctx->ctx);
+	if (task_ctx) {
+		perf_ctx_disable(task_ctx);
+		task_ctx_sched_out(task_ctx, event_type);
+	}
 
 	/*
 	 * Decide which cpu ctx groups to schedule out based on the types
@@ -2691,17 +2679,20 @@ static void ctx_resched(struct perf_cpu_
 	 *  - otherwise, do nothing more.
 	 */
 	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
-	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+		ctx_sched_out(&cpuctx->ctx, event_type);
+	else if (event_type & EVENT_PINNED)
+		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
 
 	perf_event_sched_in(cpuctx, task_ctx);
-	perf_pmu_enable(cpuctx->ctx.pmu);
+
+	perf_ctx_enable(&cpuctx->ctx);
+	if (task_ctx)
+		perf_ctx_enable(task_ctx);
 }
 
 void perf_pmu_resched(struct pmu *pmu)
 {
-	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 
 	perf_ctx_lock(cpuctx, task_ctx);
@@ -2719,7 +2710,7 @@ static int  __perf_install_in_context(vo
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 	bool reprogram = true;
 	int ret = 0;
@@ -2761,7 +2752,7 @@ static int  __perf_install_in_context(vo
 #endif
 
 	if (reprogram) {
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx, get_event_type(event));
 	} else {
@@ -2909,7 +2900,7 @@ static void __perf_event_enable(struct p
 		return;
 
 	if (ctx->is_active)
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, EVENT_TIME);
 
 	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 	perf_cgroup_event_enable(event, ctx);
@@ -2918,7 +2909,7 @@ static void __perf_event_enable(struct p
 		return;
 
 	if (!event_filter_match(event)) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_in(ctx, EVENT_TIME);
 		return;
 	}
 
@@ -2927,7 +2918,7 @@ static void __perf_event_enable(struct p
 	 * then don't put it on unless the group is on.
 	 */
 	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_in(ctx, EVENT_TIME);
 		return;
 	}
 
@@ -3196,11 +3187,52 @@ static int perf_event_modify_attr(struct
 	return err;
 }
 
-static void ctx_sched_out(struct perf_event_context *ctx,
-			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type)
+static void __pmu_ctx_sched_out(struct perf_event_pmu_context *pmu_ctx,
+				enum event_type_t event_type)
 {
+	struct perf_event_context *ctx = pmu_ctx->ctx;
 	struct perf_event *event, *tmp;
+	struct pmu *pmu = pmu_ctx->pmu;
+
+	if (ctx->task && !ctx->is_active) {
+		struct perf_cpu_pmu_context *cpc;
+
+		cpc = this_cpu_ptr(pmu->cpu_pmu_context);
+		WARN_ON_ONCE(cpc->task_epc != pmu_ctx);
+		cpc->task_epc = NULL;
+	}
+
+	if (!event_type)
+		return;
+
+	perf_pmu_disable(pmu);
+	if (event_type & EVENT_PINNED) {
+		list_for_each_entry_safe(event, tmp,
+				&pmu_ctx->pinned_active,
+				active_list)
+			group_sched_out(event, ctx);
+	}
+
+	if (event_type & EVENT_FLEXIBLE) {
+		list_for_each_entry_safe(event, tmp,
+				&pmu_ctx->flexible_active,
+				active_list)
+			group_sched_out(event, ctx);
+		/*
+		 * Since we cleared EVENT_FLEXIBLE, also clear
+		 * rotate_necessary, is will be reset by
+		 * ctx_flexible_sched_in() when needed.
+		 */
+		pmu_ctx->rotate_necessary = 0;
+	}
+	perf_pmu_enable(pmu);
+}
+
+static void
+ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_event_pmu_context *pmu_ctx;
 	int is_active = ctx->is_active;
 
 	lockdep_assert_held(&ctx->lock);
@@ -3251,24 +3283,8 @@ static void ctx_sched_out(struct perf_ev
 	if (!ctx->nr_active || !(is_active & EVENT_ALL))
 		return;
 
-	perf_pmu_disable(ctx->pmu);
-	if (is_active & EVENT_PINNED) {
-		list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
-			group_sched_out(event, cpuctx, ctx);
-	}
-
-	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
-			group_sched_out(event, cpuctx, ctx);
-
-		/*
-		 * Since we cleared EVENT_FLEXIBLE, also clear
-		 * rotate_necessary, is will be reset by
-		 * ctx_flexible_sched_in() when needed.
-		 */
-		ctx->rotate_necessary = 0;
-	}
-	perf_pmu_enable(ctx->pmu);
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+		__pmu_ctx_sched_out(pmu_ctx, is_active);
 }
 
 /*
@@ -3373,26 +3389,65 @@ static void perf_event_sync_stat(struct
 	}
 }
 
-static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
-					 struct task_struct *next)
+static void perf_event_swap_task_ctx_data(struct perf_event_context *prev_ctx,
+					  struct perf_event_context *next_ctx)
+{
+	struct perf_event_pmu_context *prev_epc, *next_epc;
+
+	if (!prev_ctx->nr_task_data)
+		return;
+
+	prev_epc = list_first_entry(&prev_ctx->pmu_ctx_list,
+				    struct perf_event_pmu_context,
+				    pmu_ctx_entry);
+	next_epc = list_first_entry(&next_ctx->pmu_ctx_list,
+				    struct perf_event_pmu_context,
+				    pmu_ctx_entry);
+
+	while (&prev_epc->pmu_ctx_entry != &prev_ctx->pmu_ctx_list &&
+	       &next_epc->pmu_ctx_entry != &next_ctx->pmu_ctx_list) {
+
+		WARN_ON_ONCE(prev_epc->pmu != next_epc->pmu);
+
+		/*
+		 * PMU specific parts of task perf context can require
+		 * additional synchronization. As an example of such
+		 * synchronization see implementation details of Intel
+		 * LBR call stack data profiling;
+		 */
+		if (prev_epc->pmu->swap_task_ctx)
+			prev_epc->pmu->swap_task_ctx(prev_epc, next_epc);
+		else
+			swap(prev_epc->task_ctx_data, next_epc->task_ctx_data);
+	}
+}
+
+static void perf_ctx_sched_task_cb(struct perf_event_context *ctx, bool sched_in)
 {
-	struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
+	struct perf_event_pmu_context *pmu_ctx;
+	struct perf_cpu_pmu_context *cpc;
+
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+
+		if (cpc->sched_cb_usage && pmu_ctx->pmu->sched_task)
+			pmu_ctx->pmu->sched_task(pmu_ctx, sched_in);
+	}
+}
+
+static void
+perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
+{
+	struct perf_event_context *ctx = task->perf_event_ctxp;
 	struct perf_event_context *next_ctx;
 	struct perf_event_context *parent, *next_parent;
-	struct perf_cpu_context *cpuctx;
 	int do_switch = 1;
-	struct pmu *pmu;
 
 	if (likely(!ctx))
 		return;
 
-	pmu = ctx->pmu;
-	cpuctx = __get_cpu_context(ctx);
-	if (!cpuctx->task_ctx)
-		return;
-
 	rcu_read_lock();
-	next_ctx = next->perf_event_ctxp[ctxn];
+	next_ctx = rcu_dereference(next->perf_event_ctxp);
 	if (!next_ctx)
 		goto unlock;
 
@@ -3420,23 +3475,12 @@ static void perf_event_context_sched_out
 			WRITE_ONCE(ctx->task, next);
 			WRITE_ONCE(next_ctx->task, task);
 
-			perf_pmu_disable(pmu);
+			perf_ctx_disable(ctx);
 
-			if (cpuctx->sched_cb_usage && pmu->sched_task)
-				pmu->sched_task(ctx, false);
+			perf_ctx_sched_task_cb(ctx, false);
+			perf_event_swap_task_ctx_data(ctx, next_ctx);
 
-			/*
-			 * PMU specific parts of task perf context can require
-			 * additional synchronization. As an example of such
-			 * synchronization see implementation details of Intel
-			 * LBR call stack data profiling;
-			 */
-			if (pmu->swap_task_ctx)
-				pmu->swap_task_ctx(ctx, next_ctx);
-			else
-				swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
-
-			perf_pmu_enable(pmu);
+			perf_ctx_enable(ctx);
 
 			/*
 			 * RCU_INIT_POINTER here is safe because we've not
@@ -3445,8 +3489,8 @@ static void perf_event_context_sched_out
 			 * since those values are always verified under
 			 * ctx->lock which we're now holding.
 			 */
-			RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], next_ctx);
-			RCU_INIT_POINTER(next->perf_event_ctxp[ctxn], ctx);
+			RCU_INIT_POINTER(task->perf_event_ctxp, next_ctx);
+			RCU_INIT_POINTER(next->perf_event_ctxp, ctx);
 
 			do_switch = 0;
 
@@ -3460,37 +3504,39 @@ static void perf_event_context_sched_out
 
 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		perf_pmu_disable(pmu);
+		perf_ctx_disable(ctx);
 
-		if (cpuctx->sched_cb_usage && pmu->sched_task)
-			pmu->sched_task(ctx, false);
-		task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
+		perf_ctx_sched_task_cb(ctx, false);
+		task_ctx_sched_out(ctx, EVENT_ALL);
 
-		perf_pmu_enable(pmu);
+		perf_ctx_enable(ctx);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
 
 static DEFINE_PER_CPU(struct list_head, sched_cb_list);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 
 void perf_sched_cb_dec(struct pmu *pmu)
 {
-	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
 
 	this_cpu_dec(perf_sched_cb_usages);
+	barrier();
 
-	if (!--cpuctx->sched_cb_usage)
-		list_del(&cpuctx->sched_cb_entry);
+	if (!--cpc->sched_cb_usage)
+		list_del(&cpc->sched_cb_entry);
 }
 
 
 void perf_sched_cb_inc(struct pmu *pmu)
 {
-	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
 
-	if (!cpuctx->sched_cb_usage++)
-		list_add(&cpuctx->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
+	if (!cpc->sched_cb_usage++)
+		list_add(&cpc->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
 
+	barrier();
 	this_cpu_inc(perf_sched_cb_usages);
 }
 
@@ -3502,19 +3548,21 @@ void perf_sched_cb_inc(struct pmu *pmu)
  * PEBS requires this to provide PID/TID information. This requires we flush
  * all queued PEBS records before we context switch to a new task.
  */
-static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx, bool sched_in)
+static void __perf_pmu_sched_task(struct perf_cpu_pmu_context *cpc, bool sched_in)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct pmu *pmu;
 
-	pmu = cpuctx->ctx.pmu; /* software PMUs will not have sched_task */
+	pmu = cpc->epc.pmu;
 
+	/* software PMUs will not have sched_task */
 	if (WARN_ON_ONCE(!pmu->sched_task))
 		return;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(pmu);
 
-	pmu->sched_task(cpuctx->task_ctx, sched_in);
+	pmu->sched_task(cpc->task_epc, sched_in);
 
 	perf_pmu_enable(pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3524,26 +3572,20 @@ static void perf_pmu_sched_task(struct t
 				struct task_struct *next,
 				bool sched_in)
 {
-	struct perf_cpu_context *cpuctx;
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_cpu_pmu_context *cpc;
 
-	if (prev == next)
+	/* cpuctx->task_ctx will be handled in perf_event_context_sched_in/out */
+	if (prev == next || cpuctx->task_ctx)
 		return;
 
-	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
-		/* will be handled in perf_event_context_sched_in/out */
-		if (cpuctx->task_ctx)
-			continue;
-
-		__perf_pmu_sched_task(cpuctx, sched_in);
-	}
+	list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry)
+		__perf_pmu_sched_task(cpc, sched_in);
 }
 
 static void perf_event_switch(struct task_struct *task,
 			      struct task_struct *next_prev, bool sched_in);
 
-#define for_each_task_context_nr(ctxn)					\
-	for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
-
 /*
  * Called from scheduler to remove the events of the current task,
  * with interrupts disabled.
@@ -3558,16 +3600,13 @@ static void perf_event_switch(struct tas
 void __perf_event_task_sched_out(struct task_struct *task,
 				 struct task_struct *next)
 {
-	int ctxn;
-
 	if (__this_cpu_read(perf_sched_cb_usages))
 		perf_pmu_sched_task(task, next, false);
 
 	if (atomic_read(&nr_switch_events))
 		perf_event_switch(task, next, false);
 
-	for_each_task_context_nr(ctxn)
-		perf_event_context_sched_out(task, ctxn, next);
+	perf_event_context_sched_out(task, next);
 
 	/*
 	 * if cgroup events exist on this CPU, then we need
@@ -3578,15 +3617,6 @@ void __perf_event_task_sched_out(struct
 		perf_cgroup_switch(next);
 }
 
-/*
- * Called with IRQs disabled
- */
-static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type)
-{
-	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
-}
-
 static bool perf_less_group_idx(const void *l, const void *r)
 {
 	const struct perf_event *le = *(const struct perf_event **)l;
@@ -3618,21 +3648,36 @@ static void __heap_add(struct min_heap *
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+static void __link_epc(struct perf_event_pmu_context *pmu_ctx)
+{
+	struct perf_cpu_pmu_context *cpc;
+
+	if (!pmu_ctx->ctx->task)
+		return;
+
+	cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+	WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
+	cpc->task_epc = pmu_ctx;
+}
+
+static noinline int visit_groups_merge(struct perf_event_context *ctx,
 				struct perf_event_groups *groups, int cpu,
+				struct pmu *pmu,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
 #ifdef CONFIG_CGROUP_PERF
 	struct cgroup_subsys_state *css = NULL;
 #endif
+	struct perf_cpu_context *cpuctx = NULL;
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
 	struct min_heap event_heap;
 	struct perf_event **evt;
 	int ret;
 
-	if (cpuctx) {
+	if (!ctx->task) {
+		cpuctx = this_cpu_ptr(&cpu_context);
 		event_heap = (struct min_heap){
 			.data = cpuctx->heap,
 			.nr = 0,
@@ -3652,17 +3697,28 @@ static noinline int visit_groups_merge(s
 			.size = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL));
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1, pmu, NULL));
 	}
 	evt = event_heap.data;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, NULL));
 
 #ifdef CONFIG_CGROUP_PERF
 	for (; css; css = css->parent)
-		__heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup));
+		__heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, css->cgroup));
 #endif
 
+	if (event_heap.nr) {
+		/*
+		 * XXX: For now, visit_groups_merge() gets called with pmu
+		 * pointer never NULL. But these functions needs to be called
+		 * once for each pmu if I implement pmu=NULL optimization.
+		 */
+		__link_epc((*evt)->pmu_ctx);
+		perf_assert_pmu_disabled((*evt)->pmu_ctx->pmu);
+	}
+
+
 	min_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.nr) {
@@ -3670,7 +3726,7 @@ static noinline int visit_groups_merge(s
 		if (ret)
 			return ret;
 
-		*evt = perf_event_groups_next(*evt);
+		*evt = perf_event_groups_next(*evt, pmu);
 		if (*evt)
 			min_heapify(&event_heap, 0, &perf_min_heap);
 		else
@@ -3712,7 +3768,6 @@ static inline void group_update_userpage
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	int *can_add_hw = data;
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
@@ -3721,8 +3776,8 @@ static int merge_sched_in(struct perf_ev
 	if (!event_filter_match(event))
 		return 0;
 
-	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
-		if (!group_sched_in(event, cpuctx, ctx))
+	if (group_can_go_on(event, *can_add_hw)) {
+		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
 	}
 
@@ -3732,8 +3787,11 @@ static int merge_sched_in(struct perf_ev
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 		} else {
-			ctx->rotate_necessary = 1;
-			perf_mux_hrtimer_restart(cpuctx);
+			struct perf_cpu_pmu_context *cpc;
+
+			event->pmu_ctx->rotate_necessary = 1;
+			cpc = this_cpu_ptr(event->pmu_ctx->pmu->cpu_pmu_context);
+			perf_mux_hrtimer_restart(cpc);
 			group_update_userpage(event);
 		}
 	}
@@ -3741,39 +3799,67 @@ static int merge_sched_in(struct perf_ev
 	return 0;
 }
 
-static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
-		    struct perf_cpu_context *cpuctx)
+static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
 {
+	struct perf_event_pmu_context *pmu_ctx;
 	int can_add_hw = 1;
 
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
-
-	visit_groups_merge(cpuctx, &ctx->pinned_groups,
-			   smp_processor_id(),
-			   merge_sched_in, &can_add_hw);
+	if (pmu) {
+		visit_groups_merge(ctx, &ctx->pinned_groups,
+				   smp_processor_id(), pmu,
+				   merge_sched_in, &can_add_hw);
+	} else {
+		/*
+		 * XXX: This can be optimized for per-task context by calling
+		 * visit_groups_merge() only once with:
+		 *   1) pmu=NULL
+		 *   2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
+		 *   3) Making can_add_hw a per-pmu variable
+		 *
+		 * Though, it can not be opimized for per-cpu context because
+		 * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
+		 * consist of cgroup-subtrees. i.e. a cgroup events of same
+		 * cgroup but different pmus are seperated out into respective
+		 * pmu-subtrees.
+		 */
+		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+			can_add_hw = 1;
+			visit_groups_merge(ctx, &ctx->pinned_groups,
+					   smp_processor_id(), pmu_ctx->pmu,
+					   merge_sched_in, &can_add_hw);
+		}
+	}
 }
 
-static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
-		      struct perf_cpu_context *cpuctx)
+/* XXX .busy thingy from Peter's patch */
+static void ctx_flexible_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
 {
+	struct perf_event_pmu_context *pmu_ctx;
 	int can_add_hw = 1;
 
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
+	if (pmu) {
+		visit_groups_merge(ctx, &ctx->flexible_groups,
+				   smp_processor_id(), pmu,
+				   merge_sched_in, &can_add_hw);
+	} else {
+		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+			can_add_hw = 1;
+			visit_groups_merge(ctx, &ctx->flexible_groups,
+					   smp_processor_id(), pmu_ctx->pmu,
+					   merge_sched_in, &can_add_hw);
+		}
+	}
+}
 
-	visit_groups_merge(cpuctx, &ctx->flexible_groups,
-			   smp_processor_id(),
-			   merge_sched_in, &can_add_hw);
+static void __pmu_ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
+{
+	ctx_flexible_sched_in(ctx, pmu);
 }
 
 static void
-ctx_sched_in(struct perf_event_context *ctx,
-	     struct perf_cpu_context *cpuctx,
-	     enum event_type_t event_type)
+ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	int is_active = ctx->is_active;
 
 	lockdep_assert_held(&ctx->lock);
@@ -3785,6 +3871,7 @@ ctx_sched_in(struct perf_event_context *
 		/* start ctx time */
 		__update_context_time(ctx, false);
 		perf_cgroup_set_timestamp(cpuctx);
+		// XXX ctx->task =? task
 		/*
 		 * CPU-release for the below ->is_active store,
 		 * see __load_acquire() in perf_event_time_now()
@@ -3807,39 +3894,32 @@ ctx_sched_in(struct perf_event_context *
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+		ctx_pinned_sched_in(ctx, NULL);
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+		ctx_flexible_sched_in(ctx, NULL);
 }
 
-static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-			     enum event_type_t event_type)
+static void perf_event_context_sched_in(struct task_struct *task)
 {
-	struct perf_event_context *ctx = &cpuctx->ctx;
-
-	ctx_sched_in(ctx, cpuctx, event_type);
-}
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_event_context *ctx;
 
-static void perf_event_context_sched_in(struct perf_event_context *ctx,
-					struct task_struct *task)
-{
-	struct perf_cpu_context *cpuctx;
-	struct pmu *pmu;
+	rcu_read_lock();
+	ctx = rcu_dereference(task->perf_event_ctxp);
+	if (!ctx)
+		goto rcu_unlock;
 
-	cpuctx = __get_cpu_context(ctx);
+	if (cpuctx->task_ctx == ctx) {
+		perf_ctx_lock(cpuctx, ctx);
+		perf_ctx_disable(ctx);
 
-	/*
-	 * HACK: for HETEROGENEOUS the task context might have switched to a
-	 * different PMU, force (re)set the context,
-	 */
-	pmu = ctx->pmu = cpuctx->ctx.pmu;
+		perf_ctx_sched_task_cb(ctx, true);
 
-	if (cpuctx->task_ctx == ctx) {
-		if (cpuctx->sched_cb_usage)
-			__perf_pmu_sched_task(cpuctx, true);
-		return;
+		perf_ctx_enable(ctx);
+		perf_ctx_unlock(cpuctx, ctx);
+		goto rcu_unlock;
 	}
 
 	perf_ctx_lock(cpuctx, ctx);
@@ -3850,7 +3930,7 @@ static void perf_event_context_sched_in(
 	if (!ctx->nr_events)
 		goto unlock;
 
-	perf_pmu_disable(pmu);
+	perf_ctx_disable(ctx);
 	/*
 	 * We want to keep the following priority order:
 	 * cpu pinned (that don't need to move), task pinned,
@@ -3859,17 +3939,24 @@ static void perf_event_context_sched_in(
 	 * However, if task's ctx is not carrying any pinned
 	 * events, no need to flip the cpuctx's events around.
 	 */
-	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
+		perf_ctx_disable(&cpuctx->ctx);
+		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
+	}
+
 	perf_event_sched_in(cpuctx, ctx);
 
-	if (cpuctx->sched_cb_usage && pmu->sched_task)
-		pmu->sched_task(cpuctx->task_ctx, true);
+	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
 
-	perf_pmu_enable(pmu);
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
+		perf_ctx_enable(&cpuctx->ctx);
+
+	perf_ctx_enable(ctx);
 
 unlock:
 	perf_ctx_unlock(cpuctx, ctx);
+rcu_unlock:
+	rcu_read_unlock();
 }
 
 /*
@@ -3886,16 +3973,7 @@ static void perf_event_context_sched_in(
 void __perf_event_task_sched_in(struct task_struct *prev,
 				struct task_struct *task)
 {
-	struct perf_event_context *ctx;
-	int ctxn;
-
-	for_each_task_context_nr(ctxn) {
-		ctx = task->perf_event_ctxp[ctxn];
-		if (likely(!ctx))
-			continue;
-
-		perf_event_context_sched_in(ctx, task);
-	}
+	perf_event_context_sched_in(task);
 
 	if (atomic_read(&nr_switch_events))
 		perf_event_switch(task, prev, true);
@@ -4014,8 +4092,8 @@ static void perf_adjust_period(struct pe
  * events. At the same time, make sure, having freq events does not change
  * the rate of unthrottling as that would introduce bias.
  */
-static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
-					   int needs_unthr)
+static void
+perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle)
 {
 	struct perf_event *event;
 	struct hw_perf_event *hwc;
@@ -4027,16 +4105,16 @@ static void perf_adjust_freq_unthr_conte
 	 * - context have events in frequency mode (needs freq adjust)
 	 * - there are events to unthrottle on this cpu
 	 */
-	if (!(ctx->nr_freq || needs_unthr))
+	if (!(ctx->nr_freq || unthrottle))
 		return;
 
 	raw_spin_lock(&ctx->lock);
-	perf_pmu_disable(ctx->pmu);
 
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
 		if (event->state != PERF_EVENT_STATE_ACTIVE)
 			continue;
 
+		// XXX use visit thingy to avoid the -1,cpu match
 		if (!event_filter_match(event))
 			continue;
 
@@ -4077,7 +4155,6 @@ static void perf_adjust_freq_unthr_conte
 		perf_pmu_enable(event->pmu);
 	}
 
-	perf_pmu_enable(ctx->pmu);
 	raw_spin_unlock(&ctx->lock);
 }
 
@@ -4099,72 +4176,111 @@ static void rotate_ctx(struct perf_event
 
 /* pick an event from the flexible_groups to rotate */
 static inline struct perf_event *
-ctx_event_to_rotate(struct perf_event_context *ctx)
+ctx_event_to_rotate(struct perf_event_pmu_context *pmu_ctx)
 {
 	struct perf_event *event;
+	struct rb_node *node;
+	struct rb_root *tree;
+	struct __group_key key = {
+		.pmu = pmu_ctx->pmu,
+	};
 
 	/* pick the first active flexible event */
-	event = list_first_entry_or_null(&ctx->flexible_active,
+	event = list_first_entry_or_null(&pmu_ctx->flexible_active,
 					 struct perf_event, active_list);
+	if (event)
+		goto out;
 
 	/* if no active flexible event, pick the first event */
-	if (!event) {
-		event = rb_entry_safe(rb_first(&ctx->flexible_groups.tree),
-				      typeof(*event), group_node);
+	tree = &pmu_ctx->ctx->flexible_groups.tree;
+
+	if (!pmu_ctx->ctx->task) {
+		key.cpu = smp_processor_id();
+
+		node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+		if (node)
+			event = __node_2_pe(node);
+		goto out;
 	}
 
+	key.cpu = -1;
+	node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+	if (node) {
+		event = __node_2_pe(node);
+		goto out;
+	}
+
+	key.cpu = smp_processor_id();
+	node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+	if (node)
+		event = __node_2_pe(node);
+
+out:
 	/*
 	 * Unconditionally clear rotate_necessary; if ctx_flexible_sched_in()
 	 * finds there are unschedulable events, it will set it again.
 	 */
-	ctx->rotate_necessary = 0;
+	pmu_ctx->rotate_necessary = 0;
 
 	return event;
 }
 
-static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
+static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_event_pmu_context *cpu_epc, *task_epc = NULL;
 	struct perf_event *cpu_event = NULL, *task_event = NULL;
 	struct perf_event_context *task_ctx = NULL;
 	int cpu_rotate, task_rotate;
+	struct pmu *pmu;
 
 	/*
 	 * Since we run this from IRQ context, nobody can install new
 	 * events, thus the event count values are stable.
 	 */
 
-	cpu_rotate = cpuctx->ctx.rotate_necessary;
+	cpu_epc = &cpc->epc;
+	pmu = cpu_epc->pmu;
+	task_epc = cpc->task_epc;
+
+	cpu_rotate = cpu_epc->rotate_necessary;
 	task_ctx = cpuctx->task_ctx;
-	task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
+	task_rotate = task_epc ? task_epc->rotate_necessary : 0;
 
 	if (!(cpu_rotate || task_rotate))
 		return false;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-	perf_pmu_disable(cpuctx->ctx.pmu);
+	perf_pmu_disable(pmu);
 
 	if (task_rotate)
-		task_event = ctx_event_to_rotate(task_ctx);
+		task_event = ctx_event_to_rotate(task_epc);
 	if (cpu_rotate)
-		cpu_event = ctx_event_to_rotate(&cpuctx->ctx);
+		cpu_event = ctx_event_to_rotate(cpu_epc);
 
 	/*
 	 * As per the order given at ctx_resched() first 'pop' task flexible
 	 * and then, if needed CPU flexible.
 	 */
-	if (task_event || (task_ctx && cpu_event))
-		ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
-	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	if (task_event || (task_epc && cpu_event)) {
+		update_context_time(task_epc->ctx);
+		__pmu_ctx_sched_out(task_epc, EVENT_FLEXIBLE);
+	}
 
-	if (task_event)
-		rotate_ctx(task_ctx, task_event);
-	if (cpu_event)
+	if (cpu_event) {
+		update_context_time(&cpuctx->ctx);
+		__pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE);
 		rotate_ctx(&cpuctx->ctx, cpu_event);
+		__pmu_ctx_sched_in(&cpuctx->ctx, pmu);
+	}
 
-	perf_event_sched_in(cpuctx, task_ctx);
+	if (task_event)
+		rotate_ctx(task_epc->ctx, task_event);
+
+	if (task_event || (task_epc && cpu_event))
+		__pmu_ctx_sched_in(task_epc->ctx, pmu);
 
-	perf_pmu_enable(cpuctx->ctx.pmu);
+	perf_pmu_enable(pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 
 	return true;
@@ -4172,8 +4288,8 @@ static bool perf_rotate_context(struct p
 
 void perf_event_task_tick(void)
 {
-	struct list_head *head = this_cpu_ptr(&active_ctx_list);
-	struct perf_event_context *ctx, *tmp;
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_event_context *ctx;
 	int throttled;
 
 	lockdep_assert_irqs_disabled();
@@ -4182,8 +4298,13 @@ void perf_event_task_tick(void)
 	throttled = __this_cpu_xchg(perf_throttled_count, 0);
 	tick_dep_clear_cpu(smp_processor_id(), TICK_DEP_BIT_PERF_EVENTS);
 
-	list_for_each_entry_safe(ctx, tmp, head, active_ctx_list)
-		perf_adjust_freq_unthr_context(ctx, throttled);
+	perf_adjust_freq_unthr_context(&cpuctx->ctx, !!throttled);
+
+	rcu_read_lock();
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx)
+		perf_adjust_freq_unthr_context(ctx, !!throttled);
+	rcu_read_unlock();
 }
 
 static int event_enable_on_exec(struct perf_event *event,
@@ -4205,9 +4326,9 @@ static int event_enable_on_exec(struct p
  * Enable all of a task's events that have been marked enable-on-exec.
  * This expects task == current.
  */
-static void perf_event_enable_on_exec(int ctxn)
+static void perf_event_enable_on_exec(struct perf_event_context *ctx)
 {
-	struct perf_event_context *ctx, *clone_ctx = NULL;
+	struct perf_event_context *clone_ctx = NULL;
 	enum event_type_t event_type = 0;
 	struct perf_cpu_context *cpuctx;
 	struct perf_event *event;
@@ -4215,13 +4336,16 @@ static void perf_event_enable_on_exec(in
 	int enabled = 0;
 
 	local_irq_save(flags);
-	ctx = current->perf_event_ctxp[ctxn];
-	if (!ctx || !ctx->nr_events)
+	if (WARN_ON_ONCE(current->perf_event_ctxp != ctx))
+		goto out;
+
+	if (!ctx->nr_events)
 		goto out;
 
-	cpuctx = __get_cpu_context(ctx);
+	cpuctx = this_cpu_ptr(&cpu_context);
 	perf_ctx_lock(cpuctx, ctx);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, EVENT_TIME);
+
 	list_for_each_entry(event, &ctx->event_list, event_entry) {
 		enabled |= event_enable_on_exec(event, ctx);
 		event_type |= get_event_type(event);
@@ -4234,7 +4358,7 @@ static void perf_event_enable_on_exec(in
 		clone_ctx = unclone_ctx(ctx);
 		ctx_resched(cpuctx, ctx, event_type);
 	} else {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_in(ctx, EVENT_TIME);
 	}
 	perf_ctx_unlock(cpuctx, ctx);
 
@@ -4253,16 +4377,14 @@ static void perf_event_exit_event(struct
  * Removes all events from the current task that have been marked
  * remove-on-exec, and feeds their values back to parent events.
  */
-static void perf_event_remove_on_exec(int ctxn)
+static void perf_event_remove_on_exec(struct perf_event_context *ctx)
 {
-	struct perf_event_context *ctx, *clone_ctx = NULL;
+	struct perf_event_context *clone_ctx = NULL;
 	struct perf_event *event, *next;
 	unsigned long flags;
 	bool modified = false;
 
-	ctx = perf_pin_task_context(current, ctxn);
-	if (!ctx)
-		return;
+	perf_pin_task_context(current);
 
 	mutex_lock(&ctx->mutex);
 
@@ -4326,7 +4448,7 @@ static void __perf_event_read(void *info
 	struct perf_read_data *data = info;
 	struct perf_event *sub, *event = data->event;
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct pmu *pmu = event->pmu;
 
 	/*
@@ -4552,17 +4674,25 @@ static void __perf_event_init_context(st
 {
 	raw_spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
-	INIT_LIST_HEAD(&ctx->active_ctx_list);
+	INIT_LIST_HEAD(&ctx->pmu_ctx_list);
 	perf_event_groups_init(&ctx->pinned_groups);
 	perf_event_groups_init(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
-	INIT_LIST_HEAD(&ctx->pinned_active);
-	INIT_LIST_HEAD(&ctx->flexible_active);
 	refcount_set(&ctx->refcount, 1);
 }
 
+static void
+__perf_init_event_pmu_context(struct perf_event_pmu_context *epc, struct pmu *pmu)
+{
+	epc->pmu = pmu;
+	INIT_LIST_HEAD(&epc->pmu_ctx_entry);
+	INIT_LIST_HEAD(&epc->pinned_active);
+	INIT_LIST_HEAD(&epc->flexible_active);
+	atomic_set(&epc->refcount, 1);
+}
+
 static struct perf_event_context *
-alloc_perf_context(struct pmu *pmu, struct task_struct *task)
+alloc_perf_context(struct task_struct *task)
 {
 	struct perf_event_context *ctx;
 
@@ -4573,7 +4703,6 @@ alloc_perf_context(struct pmu *pmu, stru
 	__perf_event_init_context(ctx);
 	if (task)
 		ctx->task = get_task_struct(task);
-	ctx->pmu = pmu;
 
 	return ctx;
 }
@@ -4602,15 +4731,12 @@ find_lively_task_by_vpid(pid_t vpid)
  * Returns a matching context with refcount and pincount.
  */
 static struct perf_event_context *
-find_get_context(struct pmu *pmu, struct task_struct *task,
-		struct perf_event *event)
+find_get_context(struct task_struct *task, struct perf_event *event)
 {
 	struct perf_event_context *ctx, *clone_ctx = NULL;
 	struct perf_cpu_context *cpuctx;
-	void *task_ctx_data = NULL;
 	unsigned long flags;
-	int ctxn, err;
-	int cpu = event->cpu;
+	int err;
 
 	if (!task) {
 		/* Must be root to operate on a CPU event: */
@@ -4618,7 +4744,7 @@ find_get_context(struct pmu *pmu, struct
 		if (err)
 			return ERR_PTR(err);
 
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
+		cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
 		ctx = &cpuctx->ctx;
 		get_ctx(ctx);
 		raw_spin_lock_irqsave(&ctx->lock, flags);
@@ -4629,43 +4755,22 @@ find_get_context(struct pmu *pmu, struct
 	}
 
 	err = -EINVAL;
-	ctxn = pmu->task_ctx_nr;
-	if (ctxn < 0)
-		goto errout;
-
-	if (event->attach_state & PERF_ATTACH_TASK_DATA) {
-		task_ctx_data = alloc_task_ctx_data(pmu);
-		if (!task_ctx_data) {
-			err = -ENOMEM;
-			goto errout;
-		}
-	}
-
 retry:
-	ctx = perf_lock_task_context(task, ctxn, &flags);
+	ctx = perf_lock_task_context(task, &flags);
 	if (ctx) {
 		clone_ctx = unclone_ctx(ctx);
 		++ctx->pin_count;
 
-		if (task_ctx_data && !ctx->task_ctx_data) {
-			ctx->task_ctx_data = task_ctx_data;
-			task_ctx_data = NULL;
-		}
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 
 		if (clone_ctx)
 			put_ctx(clone_ctx);
 	} else {
-		ctx = alloc_perf_context(pmu, task);
+		ctx = alloc_perf_context(task);
 		err = -ENOMEM;
 		if (!ctx)
 			goto errout;
 
-		if (task_ctx_data) {
-			ctx->task_ctx_data = task_ctx_data;
-			task_ctx_data = NULL;
-		}
-
 		err = 0;
 		mutex_lock(&task->perf_event_mutex);
 		/*
@@ -4674,12 +4779,12 @@ find_get_context(struct pmu *pmu, struct
 		 */
 		if (task->flags & PF_EXITING)
 			err = -ESRCH;
-		else if (task->perf_event_ctxp[ctxn])
+		else if (task->perf_event_ctxp)
 			err = -EAGAIN;
 		else {
 			get_ctx(ctx);
 			++ctx->pin_count;
-			rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
+			rcu_assign_pointer(task->perf_event_ctxp, ctx);
 		}
 		mutex_unlock(&task->perf_event_mutex);
 
@@ -4692,14 +4797,117 @@ find_get_context(struct pmu *pmu, struct
 		}
 	}
 
-	free_task_ctx_data(pmu, task_ctx_data);
 	return ctx;
 
 errout:
-	free_task_ctx_data(pmu, task_ctx_data);
 	return ERR_PTR(err);
 }
 
+struct perf_event_pmu_context *
+find_get_pmu_context(struct pmu *pmu, struct perf_event_context *ctx,
+		     struct perf_event *event)
+{
+	struct perf_event_pmu_context *new = NULL, *epc;
+	void *task_ctx_data = NULL;
+
+	if (!ctx->task) {
+		struct perf_cpu_pmu_context *cpc;
+
+		cpc = per_cpu_ptr(pmu->cpu_pmu_context, event->cpu);
+		epc = &cpc->epc;
+
+		if (!epc->ctx) {
+			atomic_set(&epc->refcount, 1);
+			epc->embedded = 1;
+			raw_spin_lock_irq(&ctx->lock);
+			list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
+			epc->ctx = ctx;
+			raw_spin_unlock_irq(&ctx->lock);
+		} else {
+			WARN_ON_ONCE(epc->ctx != ctx);
+			atomic_inc(&epc->refcount);
+		}
+
+		return epc;
+	}
+
+	new = kzalloc(sizeof(*epc), GFP_KERNEL);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	if (event->attach_state & PERF_ATTACH_TASK_DATA) {
+		task_ctx_data = alloc_task_ctx_data(pmu);;
+		if (!task_ctx_data) {
+			kfree(new);
+			return ERR_PTR(-ENOMEM);
+		}
+	}
+
+	__perf_init_event_pmu_context(new, pmu);
+
+	raw_spin_lock_irq(&ctx->lock);
+	list_for_each_entry(epc, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		if (epc->pmu == pmu) {
+			WARN_ON_ONCE(epc->ctx != ctx);
+			atomic_inc(&epc->refcount);
+			goto found_epc;
+		}
+	}
+
+	epc = new;
+	new = NULL;
+
+	list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
+	epc->ctx = ctx;
+
+found_epc:
+	if (task_ctx_data && !epc->task_ctx_data) {
+		epc->task_ctx_data = task_ctx_data;
+		task_ctx_data = NULL;
+		ctx->nr_task_data++;
+	}
+	raw_spin_unlock_irq(&ctx->lock);
+
+	free_task_ctx_data(pmu, task_ctx_data);
+	kfree(new);
+
+	return epc;
+}
+
+static void get_pmu_ctx(struct perf_event_pmu_context *epc)
+{
+	WARN_ON_ONCE(!atomic_inc_not_zero(&epc->refcount));
+}
+
+static void put_pmu_ctx(struct perf_event_pmu_context *epc)
+{
+	unsigned long flags;
+
+	if (!atomic_dec_and_test(&epc->refcount))
+		return;
+
+	if (epc->ctx) {
+		struct perf_event_context *ctx = epc->ctx;
+
+		// XXX ctx->mutex
+
+		WARN_ON_ONCE(list_empty(&epc->pmu_ctx_entry));
+		raw_spin_lock_irqsave(&ctx->lock, flags);
+		list_del_init(&epc->pmu_ctx_entry);
+		epc->ctx = NULL;
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+	}
+
+	WARN_ON_ONCE(!list_empty(&epc->pinned_active));
+	WARN_ON_ONCE(!list_empty(&epc->flexible_active));
+
+	if (epc->embedded)
+		return;
+
+	kfree(epc->task_ctx_data);
+	kfree(epc);
+}
+
 static void perf_event_free_filter(struct perf_event *event);
 
 static void free_event_rcu(struct rcu_head *head)
@@ -4968,6 +5176,9 @@ static void _free_event(struct perf_even
 	if (event->hw.target)
 		put_task_struct(event->hw.target);
 
+	if (event->pmu_ctx)
+		put_pmu_ctx(event->pmu_ctx);
+
 	/*
 	 * perf_event_free_task() relies on put_ctx() being 'last', in particular
 	 * all task references must be cleaned up.
@@ -5498,7 +5709,7 @@ static void __perf_event_period(struct p
 
 	active = (event->state == PERF_EVENT_STATE_ACTIVE);
 	if (active) {
-		perf_pmu_disable(ctx->pmu);
+		perf_pmu_disable(event->pmu);
 		/*
 		 * We could be throttled; unthrottle now to avoid the tick
 		 * trying to unthrottle while we already re-started the event.
@@ -5514,7 +5725,7 @@ static void __perf_event_period(struct p
 
 	if (active) {
 		event->pmu->start(event, PERF_EF_RELOAD);
-		perf_pmu_enable(ctx->pmu);
+		perf_pmu_enable(event->pmu);
 	}
 }
 
@@ -7606,7 +7817,6 @@ perf_iterate_sb(perf_iterate_f output, v
 	       struct perf_event_context *task_ctx)
 {
 	struct perf_event_context *ctx;
-	int ctxn;
 
 	rcu_read_lock();
 	preempt_disable();
@@ -7623,11 +7833,9 @@ perf_iterate_sb(perf_iterate_f output, v
 
 	perf_iterate_sb_cpu(output, data);
 
-	for_each_task_context_nr(ctxn) {
-		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
-		if (ctx)
-			perf_iterate_ctx(ctx, output, data, false);
-	}
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx)
+		perf_iterate_ctx(ctx, output, data, false);
 done:
 	preempt_enable();
 	rcu_read_unlock();
@@ -7669,20 +7877,15 @@ static void perf_event_addr_filters_exec
 void perf_event_exec(void)
 {
 	struct perf_event_context *ctx;
-	int ctxn;
-
-	for_each_task_context_nr(ctxn) {
-		perf_event_enable_on_exec(ctxn);
-		perf_event_remove_on_exec(ctxn);
 
-		rcu_read_lock();
-		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
-		if (ctx) {
-			perf_iterate_ctx(ctx, perf_event_addr_filters_exec,
-					 NULL, true);
-		}
-		rcu_read_unlock();
+	rcu_read_lock();
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx) {
+		perf_event_enable_on_exec(ctx);
+		perf_event_remove_on_exec(ctx);
+		perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
 	}
+	rcu_read_unlock();
 }
 
 struct remote_output {
@@ -7722,8 +7925,7 @@ static void __perf_event_output_stop(str
 static int __perf_pmu_output_stop(void *info)
 {
 	struct perf_event *event = info;
-	struct pmu *pmu = event->ctx->pmu;
-	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct remote_output ro = {
 		.rb	= event->rb,
 	};
@@ -8512,7 +8714,6 @@ static void __perf_addr_filters_adjust(s
 static void perf_addr_filters_adjust(struct vm_area_struct *vma)
 {
 	struct perf_event_context *ctx;
-	int ctxn;
 
 	/*
 	 * Data tracing isn't supported yet and as such there is no need
@@ -8522,13 +8723,9 @@ static void perf_addr_filters_adjust(str
 		return;
 
 	rcu_read_lock();
-	for_each_task_context_nr(ctxn) {
-		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
-		if (!ctx)
-			continue;
-
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx)
 		perf_iterate_ctx(ctx, __perf_addr_filters_adjust, vma, true);
-	}
 	rcu_read_unlock();
 }
 
@@ -9737,10 +9934,13 @@ void perf_tp_event(u16 event_type, u64 c
 		struct trace_entry *entry = record;
 
 		rcu_read_lock();
-		ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
+		ctx = rcu_dereference(task->perf_event_ctxp);
 		if (!ctx)
 			goto unlock;
 
+		// XXX iterate groups instead, we should be able to
+		// find the subtree for the perf_tracepoint pmu and CPU.
+
 		list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
 			if (event->cpu != smp_processor_id())
 				continue;
@@ -10873,36 +11073,9 @@ static int perf_event_idx_default(struct
 	return 0;
 }
 
-/*
- * Ensures all contexts with the same task_ctx_nr have the same
- * pmu_cpu_context too.
- */
-static struct perf_cpu_context __percpu *find_pmu_context(int ctxn)
-{
-	struct pmu *pmu;
-
-	if (ctxn < 0)
-		return NULL;
-
-	list_for_each_entry(pmu, &pmus, entry) {
-		if (pmu->task_ctx_nr == ctxn)
-			return pmu->pmu_cpu_context;
-	}
-
-	return NULL;
-}
-
 static void free_pmu_context(struct pmu *pmu)
 {
-	/*
-	 * Static contexts such as perf_sw_context have a global lifetime
-	 * and may be shared between different PMUs. Avoid freeing them
-	 * when a single PMU is going away.
-	 */
-	if (pmu->task_ctx_nr > perf_invalid_context)
-		return;
-
-	free_percpu(pmu->pmu_cpu_context);
+	free_percpu(pmu->cpu_pmu_context);
 }
 
 /*
@@ -10966,12 +11139,12 @@ perf_event_mux_interval_ms_store(struct
 	/* update all cpuctx for this PMU */
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		struct perf_cpu_context *cpuctx;
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
+		struct perf_cpu_pmu_context *cpc;
+		cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+		cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
 
 		cpu_function_call(cpu,
-			(remote_function_f)perf_mux_hrtimer_restart, cpuctx);
+			(remote_function_f)perf_mux_hrtimer_restart, cpc);
 	}
 	cpus_read_unlock();
 	mutex_unlock(&mux_interval_mutex);
@@ -11082,47 +11255,19 @@ int perf_pmu_register(struct pmu *pmu, c
 	}
 
 skip_type:
-	if (pmu->task_ctx_nr == perf_hw_context) {
-		static int hw_context_taken = 0;
-
-		/*
-		 * Other than systems with heterogeneous CPUs, it never makes
-		 * sense for two PMUs to share perf_hw_context. PMUs which are
-		 * uncore must use perf_invalid_context.
-		 */
-		if (WARN_ON_ONCE(hw_context_taken &&
-		    !(pmu->capabilities & PERF_PMU_CAP_HETEROGENEOUS_CPUS)))
-			pmu->task_ctx_nr = perf_invalid_context;
-
-		hw_context_taken = 1;
-	}
-
-	pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);
-	if (pmu->pmu_cpu_context)
-		goto got_cpu_context;
-
 	ret = -ENOMEM;
-	pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
-	if (!pmu->pmu_cpu_context)
+	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
+	if (!pmu->cpu_pmu_context)
 		goto free_dev;
 
 	for_each_possible_cpu(cpu) {
-		struct perf_cpu_context *cpuctx;
-
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		__perf_event_init_context(&cpuctx->ctx);
-		lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
-		lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
-		cpuctx->ctx.pmu = pmu;
-		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
-
-		__perf_mux_hrtimer_init(cpuctx, cpu);
+		struct perf_cpu_pmu_context *cpc;
 
-		cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
-		cpuctx->heap = cpuctx->heap_default;
+		cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+		__perf_init_event_pmu_context(&cpc->epc, pmu);
+		__perf_mux_hrtimer_init(cpc, cpu);
 	}
 
-got_cpu_context:
 	if (!pmu->start_txn) {
 		if (pmu->pmu_enable) {
 			/*
@@ -11604,10 +11749,11 @@ perf_event_alloc(struct perf_event_attr
 	}
 
 	/*
-	 * Disallow uncore-cgroup events, they don't make sense as the cgroup will
-	 * be different on other CPUs in the uncore mask.
+	 * Disallow uncode-task events. Similarly, disallow uncore-cgroup
+	 * events (they don't make sense as the cgroup will be different
+	 * on other CPUs in the uncore mask).
 	 */
-	if (pmu->task_ctx_nr == perf_invalid_context && cgroup_fd != -1) {
+	if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) {
 		err = -EINVAL;
 		goto err_pmu;
 	}
@@ -11893,15 +12039,6 @@ perf_event_set_output(struct perf_event
 	return ret;
 }
 
-static void mutex_lock_double(struct mutex *a, struct mutex *b)
-{
-	if (b < a)
-		swap(a, b);
-
-	mutex_lock(a);
-	mutex_lock_nested(b, SINGLE_DEPTH_NESTING);
-}
-
 static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
 {
 	bool nmi_safe = false;
@@ -11939,37 +12076,6 @@ static int perf_event_set_clock(struct p
 	return 0;
 }
 
-/*
- * Variation on perf_event_ctx_lock_nested(), except we take two context
- * mutexes.
- */
-static struct perf_event_context *
-__perf_event_ctx_lock_double(struct perf_event *group_leader,
-			     struct perf_event_context *ctx)
-{
-	struct perf_event_context *gctx;
-
-again:
-	rcu_read_lock();
-	gctx = READ_ONCE(group_leader->ctx);
-	if (!refcount_inc_not_zero(&gctx->refcount)) {
-		rcu_read_unlock();
-		goto again;
-	}
-	rcu_read_unlock();
-
-	mutex_lock_double(&gctx->mutex, &ctx->mutex);
-
-	if (group_leader->ctx != gctx) {
-		mutex_unlock(&ctx->mutex);
-		mutex_unlock(&gctx->mutex);
-		put_ctx(gctx);
-		goto again;
-	}
-
-	return gctx;
-}
-
 static bool
 perf_check_permission(struct perf_event_attr *attr, struct task_struct *task)
 {
@@ -12015,9 +12121,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
 {
 	struct perf_event *group_leader = NULL, *output_event = NULL;
+	struct perf_event_pmu_context *pmu_ctx;
 	struct perf_event *event, *sibling;
 	struct perf_event_attr attr;
-	struct perf_event_context *ctx, *gctx;
+	struct perf_event_context *ctx;
 	struct file *event_file = NULL;
 	struct fd group = {NULL, 0};
 	struct task_struct *task = NULL;
@@ -12125,6 +12232,8 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_task;
 	}
 
+	// XXX premature; what if this is allowed, but we get moved to a PMU
+	// that doesn't have this.
 	if (is_sampling_event(event)) {
 		if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
 			err = -EOPNOTSUPP;
@@ -12147,42 +12256,37 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (pmu->task_ctx_nr == perf_sw_context)
 		event->event_caps |= PERF_EV_CAP_SOFTWARE;
 
-	if (group_leader) {
-		if (is_software_event(event) &&
-		    !in_software_context(group_leader)) {
-			/*
-			 * If the event is a sw event, but the group_leader
-			 * is on hw context.
-			 *
-			 * Allow the addition of software events to hw
-			 * groups, this is safe because software events
-			 * never fail to schedule.
-			 */
-			pmu = group_leader->ctx->pmu;
-		} else if (!is_software_event(event) &&
-			   is_software_event(group_leader) &&
-			   (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
-			/*
-			 * In case the group is a pure software group, and we
-			 * try to add a hardware event, move the whole group to
-			 * the hardware context.
-			 */
-			move_group = 1;
-		}
-	}
-
 	/*
 	 * Get the target context (task or percpu):
 	 */
-	ctx = find_get_context(pmu, task, event);
+	ctx = find_get_context(task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
 		goto err_alloc;
 	}
 
-	/*
-	 * Look up the group leader (we will attach this event to it):
-	 */
+	mutex_lock(&ctx->mutex);
+
+	if (ctx->task == TASK_TOMBSTONE) {
+		err = -ESRCH;
+		goto err_locked;
+	}
+
+	if (!task) {
+		/*
+		 * Check if the @cpu we're creating an event for is online.
+		 *
+		 * We use the perf_cpu_context::ctx::mutex to serialize against
+		 * the hotplug notifiers. See perf_event_{init,exit}_cpu().
+		 */
+		struct perf_cpu_context *cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
+
+		if (!cpuctx->online) {
+			err = -ENODEV;
+			goto err_locked;
+		}
+	}
+
 	if (group_leader) {
 		err = -EINVAL;
 
@@ -12191,11 +12295,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * becoming part of another group-sibling):
 		 */
 		if (group_leader->group_leader != group_leader)
-			goto err_context;
+			goto err_locked;
 
 		/* All events in a group should have the same clock */
 		if (group_leader->clock != event->clock)
-			goto err_context;
+			goto err_locked;
 
 		/*
 		 * Make sure we're both events for the same CPU;
@@ -12203,41 +12307,60 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * you can never concurrently schedule them anyhow.
 		 */
 		if (group_leader->cpu != event->cpu)
-			goto err_context;
-
-		/*
-		 * Make sure we're both on the same task, or both
-		 * per-CPU events.
-		 */
-		if (group_leader->ctx->task != ctx->task)
-			goto err_context;
+			goto err_locked;
 
 		/*
-		 * Do not allow to attach to a group in a different task
-		 * or CPU context. If we're moving SW events, we'll fix
-		 * this up later, so allow that.
-		 *
-		 * Racy, not holding group_leader->ctx->mutex, see comment with
-		 * perf_event_ctx_lock().
+		 * Make sure we're both on the same context; either task or cpu.
 		 */
-		if (!move_group && group_leader->ctx != ctx)
-			goto err_context;
+		if (group_leader->ctx != ctx)
+			goto err_locked;
 
 		/*
 		 * Only a group leader can be exclusive or pinned
 		 */
 		if (attr.exclusive || attr.pinned)
-			goto err_context;
+			goto err_locked;
+
+		if (is_software_event(event) &&
+		    !in_software_context(group_leader)) {
+			/*
+			 * If the event is a sw event, but the group_leader
+			 * is on hw context.
+			 *
+			 * Allow the addition of software events to hw
+			 * groups, this is safe because software events
+			 * never fail to schedule.
+			 */
+			pmu = group_leader->pmu_ctx->pmu;
+		} else if (!is_software_event(event) &&
+			is_software_event(group_leader) &&
+			(group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
+			/*
+			 * In case the group is a pure software group, and we
+			 * try to add a hardware event, move the whole group to
+			 * the hardware context.
+			 */
+			move_group = 1;
+		}
 	}
 
+	/*
+	 * Now that we're certain of the pmu; find the pmu_ctx.
+	 */
+	pmu_ctx = find_get_pmu_context(pmu, ctx, event);
+	if (IS_ERR(pmu_ctx)) {
+		err = PTR_ERR(pmu_ctx);
+		goto err_locked;
+	}
+	event->pmu_ctx = pmu_ctx;
+
 	if (output_event) {
 		err = perf_event_set_output(event, output_event);
 		if (err)
 			goto err_context;
 	}
 
-	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
-					f_flags);
+	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
 	if (IS_ERR(event_file)) {
 		err = PTR_ERR(event_file);
 		event_file = NULL;
@@ -12260,59 +12383,6 @@ SYSCALL_DEFINE5(perf_event_open,
 			goto err_cred;
 	}
 
-	if (move_group) {
-		gctx = __perf_event_ctx_lock_double(group_leader, ctx);
-
-		if (gctx->task == TASK_TOMBSTONE) {
-			err = -ESRCH;
-			goto err_locked;
-		}
-
-		/*
-		 * Check if we raced against another sys_perf_event_open() call
-		 * moving the software group underneath us.
-		 */
-		if (!(group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
-			/*
-			 * If someone moved the group out from under us, check
-			 * if this new event wound up on the same ctx, if so
-			 * its the regular !move_group case, otherwise fail.
-			 */
-			if (gctx != ctx) {
-				err = -EINVAL;
-				goto err_locked;
-			} else {
-				perf_event_ctx_unlock(group_leader, gctx);
-				move_group = 0;
-				goto not_move_group;
-			}
-		}
-
-		/*
-		 * Failure to create exclusive events returns -EBUSY.
-		 */
-		err = -EBUSY;
-		if (!exclusive_event_installable(group_leader, ctx))
-			goto err_locked;
-
-		for_each_sibling_event(sibling, group_leader) {
-			if (!exclusive_event_installable(sibling, ctx))
-				goto err_locked;
-		}
-	} else {
-		mutex_lock(&ctx->mutex);
-
-		/*
-		 * Now that we hold ctx->lock, (re)validate group_leader->ctx == ctx,
-		 * see the group_leader && !move_group test earlier.
-		 */
-		if (group_leader && group_leader->ctx != ctx) {
-			err = -EINVAL;
-			goto err_locked;
-		}
-	}
-not_move_group:
-
 	if (ctx->task == TASK_TOMBSTONE) {
 		err = -ESRCH;
 		goto err_locked;
@@ -12350,7 +12420,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	 */
 	if (!exclusive_event_installable(event, ctx)) {
 		err = -EBUSY;
-		goto err_locked;
+		goto err_cred;
 	}
 
 	WARN_ON_ONCE(ctx->parent_ctx);
@@ -12361,25 +12431,15 @@ SYSCALL_DEFINE5(perf_event_open,
 	 */
 
 	if (move_group) {
-		/*
-		 * See perf_event_ctx_lock() for comments on the details
-		 * of swizzling perf_event::ctx.
-		 */
 		perf_remove_from_context(group_leader, 0);
-		put_ctx(gctx);
+		put_pmu_ctx(group_leader->pmu_ctx);
 
 		for_each_sibling_event(sibling, group_leader) {
 			perf_remove_from_context(sibling, 0);
-			put_ctx(gctx);
+			put_pmu_ctx(sibling->pmu_ctx);
 		}
 
 		/*
-		 * Wait for everybody to stop referencing the events through
-		 * the old lists, before installing it on new lists.
-		 */
-		synchronize_rcu();
-
-		/*
 		 * Install the group siblings before the group leader.
 		 *
 		 * Because a group leader will try and install the entire group
@@ -12390,9 +12450,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * reachable through the group lists.
 		 */
 		for_each_sibling_event(sibling, group_leader) {
+			sibling->pmu_ctx = pmu_ctx;
+			get_pmu_ctx(pmu_ctx);
 			perf_event__state_init(sibling);
 			perf_install_in_context(ctx, sibling, sibling->cpu);
-			get_ctx(ctx);
 		}
 
 		/*
@@ -12400,9 +12461,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * event. What we want here is event in the initial
 		 * startup state, ready to be add into new context.
 		 */
+		group_leader->pmu_ctx = pmu_ctx;
+		get_pmu_ctx(pmu_ctx);
 		perf_event__state_init(group_leader);
 		perf_install_in_context(ctx, group_leader, group_leader->cpu);
-		get_ctx(ctx);
 	}
 
 	/*
@@ -12419,8 +12481,6 @@ SYSCALL_DEFINE5(perf_event_open,
 	perf_install_in_context(ctx, event, event->cpu);
 	perf_unpin_context(ctx);
 
-	if (move_group)
-		perf_event_ctx_unlock(group_leader, gctx);
 	mutex_unlock(&ctx->mutex);
 
 	if (task) {
@@ -12442,16 +12502,15 @@ SYSCALL_DEFINE5(perf_event_open,
 	fd_install(event_fd, event_file);
 	return event_fd;
 
-err_locked:
-	if (move_group)
-		perf_event_ctx_unlock(group_leader, gctx);
-	mutex_unlock(&ctx->mutex);
 err_cred:
 	if (task)
 		up_read(&task->signal->exec_update_lock);
 err_file:
 	fput(event_file);
 err_context:
+	/* event->pmu_ctx freed by free_event() */
+err_locked:
+	mutex_unlock(&ctx->mutex);
 	perf_unpin_context(ctx);
 	put_ctx(ctx);
 err_alloc:
@@ -12486,8 +12545,10 @@ perf_event_create_kernel_counter(struct
 				 perf_overflow_handler_t overflow_handler,
 				 void *context)
 {
+	struct perf_event_pmu_context *pmu_ctx;
 	struct perf_event_context *ctx;
 	struct perf_event *event;
+	struct pmu *pmu;
 	int err;
 
 	/*
@@ -12506,16 +12567,32 @@ perf_event_create_kernel_counter(struct
 
 	/* Mark owner so we could distinguish it from user events. */
 	event->owner = TASK_TOMBSTONE;
+	pmu = event->pmu;
+
+	if (pmu->task_ctx_nr < 0 && task) {
+		err = -EINVAL;
+		goto err_alloc;
+	}
+
+	if (pmu->task_ctx_nr == perf_sw_context)
+		event->event_caps |= PERF_EV_CAP_SOFTWARE;
 
 	/*
 	 * Get the target context (task or percpu):
 	 */
-	ctx = find_get_context(event->pmu, task, event);
+	ctx = find_get_context(task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
-		goto err_free;
+		goto err_alloc;
 	}
 
+	pmu_ctx = find_get_pmu_context(pmu, ctx, event);
+	if (IS_ERR(pmu_ctx)) {
+		err = PTR_ERR(pmu_ctx);
+		goto err_ctx;
+	}
+	event->pmu_ctx = pmu_ctx;
+
 	WARN_ON_ONCE(ctx->parent_ctx);
 	mutex_lock(&ctx->mutex);
 	if (ctx->task == TASK_TOMBSTONE) {
@@ -12551,9 +12628,10 @@ perf_event_create_kernel_counter(struct
 
 err_unlock:
 	mutex_unlock(&ctx->mutex);
+err_ctx:
 	perf_unpin_context(ctx);
 	put_ctx(ctx);
-err_free:
+err_alloc:
 	free_event(event);
 err:
 	return ERR_PTR(err);
@@ -12562,6 +12640,7 @@ EXPORT_SYMBOL_GPL(perf_event_create_kern
 
 void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
 {
+#if 0 // XXX buggered - cpu hotplug, who cares
 	struct perf_event_context *src_ctx;
 	struct perf_event_context *dst_ctx;
 	struct perf_event *event, *tmp;
@@ -12622,6 +12701,7 @@ void perf_pmu_migrate_context(struct pmu
 	}
 	mutex_unlock(&dst_ctx->mutex);
 	mutex_unlock(&src_ctx->mutex);
+#endif
 }
 EXPORT_SYMBOL_GPL(perf_pmu_migrate_context);
 
@@ -12699,14 +12779,14 @@ perf_event_exit_event(struct perf_event
 	perf_event_wakeup(event);
 }
 
-static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
+static void perf_event_exit_task_context(struct task_struct *child)
 {
 	struct perf_event_context *child_ctx, *clone_ctx = NULL;
 	struct perf_event *child_event, *next;
 
 	WARN_ON_ONCE(child != current);
 
-	child_ctx = perf_pin_task_context(child, ctxn);
+	child_ctx = perf_pin_task_context(child);
 	if (!child_ctx)
 		return;
 
@@ -12728,13 +12808,13 @@ static void perf_event_exit_task_context
 	 * in.
 	 */
 	raw_spin_lock_irq(&child_ctx->lock);
-	task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, EVENT_ALL);
+	task_ctx_sched_out(child_ctx, EVENT_ALL);
 
 	/*
 	 * Now that the context is inactive, destroy the task <-> ctx relation
 	 * and mark the context dead.
 	 */
-	RCU_INIT_POINTER(child->perf_event_ctxp[ctxn], NULL);
+	RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
 	put_ctx(child_ctx); /* cannot be last */
 	WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
 	put_task_struct(current); /* cannot be last */
@@ -12769,7 +12849,6 @@ static void perf_event_exit_task_context
 void perf_event_exit_task(struct task_struct *child)
 {
 	struct perf_event *event, *tmp;
-	int ctxn;
 
 	mutex_lock(&child->perf_event_mutex);
 	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
@@ -12785,8 +12864,7 @@ void perf_event_exit_task(struct task_st
 	}
 	mutex_unlock(&child->perf_event_mutex);
 
-	for_each_task_context_nr(ctxn)
-		perf_event_exit_task_context(child, ctxn);
+	perf_event_exit_task_context(child);
 
 	/*
 	 * The perf_event_exit_task_context calls perf_event_task
@@ -12829,56 +12907,51 @@ void perf_event_free_task(struct task_st
 {
 	struct perf_event_context *ctx;
 	struct perf_event *event, *tmp;
-	int ctxn;
 
-	for_each_task_context_nr(ctxn) {
-		ctx = task->perf_event_ctxp[ctxn];
-		if (!ctx)
-			continue;
+	ctx = rcu_dereference(task->perf_event_ctxp);
+	if (!ctx)
+		return;
 
-		mutex_lock(&ctx->mutex);
-		raw_spin_lock_irq(&ctx->lock);
-		/*
-		 * Destroy the task <-> ctx relation and mark the context dead.
-		 *
-		 * This is important because even though the task hasn't been
-		 * exposed yet the context has been (through child_list).
-		 */
-		RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], NULL);
-		WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
-		put_task_struct(task); /* cannot be last */
-		raw_spin_unlock_irq(&ctx->lock);
+	mutex_lock(&ctx->mutex);
+	raw_spin_lock_irq(&ctx->lock);
+	/*
+	 * Destroy the task <-> ctx relation and mark the context dead.
+	 *
+	 * This is important because even though the task hasn't been
+	 * exposed yet the context has been (through child_list).
+	 */
+	RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
+	WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
+	put_task_struct(task); /* cannot be last */
+	raw_spin_unlock_irq(&ctx->lock);
 
-		list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
-			perf_free_event(event, ctx);
 
-		mutex_unlock(&ctx->mutex);
+	list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
+		perf_free_event(event, ctx);
 
-		/*
-		 * perf_event_release_kernel() could've stolen some of our
-		 * child events and still have them on its free_list. In that
-		 * case we must wait for these events to have been freed (in
-		 * particular all their references to this task must've been
-		 * dropped).
-		 *
-		 * Without this copy_process() will unconditionally free this
-		 * task (irrespective of its reference count) and
-		 * _free_event()'s put_task_struct(event->hw.target) will be a
-		 * use-after-free.
-		 *
-		 * Wait for all events to drop their context reference.
-		 */
-		wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
-		put_ctx(ctx); /* must be last */
-	}
+	mutex_unlock(&ctx->mutex);
+
+	/*
+	 * perf_event_release_kernel() could've stolen some of our
+	 * child events and still have them on its free_list. In that
+	 * case we must wait for these events to have been freed (in
+	 * particular all their references to this task must've been
+	 * dropped).
+	 *
+	 * Without this copy_process() will unconditionally free this
+	 * task (irrespective of its reference count) and
+	 * _free_event()'s put_task_struct(event->hw.target) will be a
+	 * use-after-free.
+	 *
+	 * Wait for all events to drop their context reference.
+	 */
+	wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
+	put_ctx(ctx); /* must be last */
 }
 
 void perf_event_delayed_put(struct task_struct *task)
 {
-	int ctxn;
-
-	for_each_task_context_nr(ctxn)
-		WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
+	WARN_ON_ONCE(task->perf_event_ctxp);
 }
 
 struct file *perf_event_get(unsigned int fd)
@@ -12928,6 +13001,7 @@ inherit_event(struct perf_event *parent_
 	      struct perf_event_context *child_ctx)
 {
 	enum perf_event_state parent_state = parent_event->state;
+	struct perf_event_pmu_context *pmu_ctx;
 	struct perf_event *child_event;
 	unsigned long flags;
 
@@ -12948,17 +13022,12 @@ inherit_event(struct perf_event *parent_
 	if (IS_ERR(child_event))
 		return child_event;
 
-
-	if ((child_event->attach_state & PERF_ATTACH_TASK_DATA) &&
-	    !child_ctx->task_ctx_data) {
-		struct pmu *pmu = child_event->pmu;
-
-		child_ctx->task_ctx_data = alloc_task_ctx_data(pmu);
-		if (!child_ctx->task_ctx_data) {
-			free_event(child_event);
-			return ERR_PTR(-ENOMEM);
-		}
+	pmu_ctx = find_get_pmu_context(child_event->pmu, child_ctx, child_event);
+	if (!pmu_ctx) {
+		free_event(child_event);
+		return NULL;
 	}
+	child_event->pmu_ctx = pmu_ctx;
 
 	/*
 	 * is_orphaned_event() and list_add_tail(&parent_event->child_list)
@@ -13081,11 +13150,11 @@ static int inherit_group(struct perf_eve
 static int
 inherit_task_group(struct perf_event *event, struct task_struct *parent,
 		   struct perf_event_context *parent_ctx,
-		   struct task_struct *child, int ctxn,
+		   struct task_struct *child,
 		   u64 clone_flags, int *inherited_all)
 {
-	int ret;
 	struct perf_event_context *child_ctx;
+	int ret;
 
 	if (!event->attr.inherit ||
 	    (event->attr.inherit_thread && !(clone_flags & CLONE_THREAD)) ||
@@ -13095,7 +13164,7 @@ inherit_task_group(struct perf_event *ev
 		return 0;
 	}
 
-	child_ctx = child->perf_event_ctxp[ctxn];
+	child_ctx = child->perf_event_ctxp;
 	if (!child_ctx) {
 		/*
 		 * This is executed from the parent task context, so
@@ -13103,16 +13172,14 @@ inherit_task_group(struct perf_event *ev
 		 * First allocate and initialize a context for the
 		 * child.
 		 */
-		child_ctx = alloc_perf_context(parent_ctx->pmu, child);
+		child_ctx = alloc_perf_context(child);
 		if (!child_ctx)
 			return -ENOMEM;
 
-		child->perf_event_ctxp[ctxn] = child_ctx;
+		child->perf_event_ctxp = child_ctx;
 	}
 
-	ret = inherit_group(event, parent, parent_ctx,
-			    child, child_ctx);
-
+	ret = inherit_group(event, parent, parent_ctx, child, child_ctx);
 	if (ret)
 		*inherited_all = 0;
 
@@ -13122,8 +13189,7 @@ inherit_task_group(struct perf_event *ev
 /*
  * Initialize the perf_event context in task_struct
  */
-static int perf_event_init_context(struct task_struct *child, int ctxn,
-				   u64 clone_flags)
+static int perf_event_init_context(struct task_struct *child, u64 clone_flags)
 {
 	struct perf_event_context *child_ctx, *parent_ctx;
 	struct perf_event_context *cloned_ctx;
@@ -13133,14 +13199,14 @@ static int perf_event_init_context(struc
 	unsigned long flags;
 	int ret = 0;
 
-	if (likely(!parent->perf_event_ctxp[ctxn]))
+	if (likely(!parent->perf_event_ctxp))
 		return 0;
 
 	/*
 	 * If the parent's context is a clone, pin it so it won't get
 	 * swapped under us.
 	 */
-	parent_ctx = perf_pin_task_context(parent, ctxn);
+	parent_ctx = perf_pin_task_context(parent);
 	if (!parent_ctx)
 		return 0;
 
@@ -13163,8 +13229,7 @@ static int perf_event_init_context(struc
 	 */
 	perf_event_groups_for_each(event, &parent_ctx->pinned_groups) {
 		ret = inherit_task_group(event, parent, parent_ctx,
-					 child, ctxn, clone_flags,
-					 &inherited_all);
+					 child, clone_flags, &inherited_all);
 		if (ret)
 			goto out_unlock;
 	}
@@ -13180,8 +13245,7 @@ static int perf_event_init_context(struc
 
 	perf_event_groups_for_each(event, &parent_ctx->flexible_groups) {
 		ret = inherit_task_group(event, parent, parent_ctx,
-					 child, ctxn, clone_flags,
-					 &inherited_all);
+					 child, clone_flags, &inherited_all);
 		if (ret)
 			goto out_unlock;
 	}
@@ -13189,7 +13253,7 @@ static int perf_event_init_context(struc
 	raw_spin_lock_irqsave(&parent_ctx->lock, flags);
 	parent_ctx->rotate_disable = 0;
 
-	child_ctx = child->perf_event_ctxp[ctxn];
+	child_ctx = child->perf_event_ctxp;
 
 	if (child_ctx && inherited_all) {
 		/*
@@ -13225,18 +13289,16 @@ static int perf_event_init_context(struc
  */
 int perf_event_init_task(struct task_struct *child, u64 clone_flags)
 {
-	int ctxn, ret;
+	int ret;
 
-	memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
+	child->perf_event_ctxp = NULL;
 	mutex_init(&child->perf_event_mutex);
 	INIT_LIST_HEAD(&child->perf_event_list);
 
-	for_each_task_context_nr(ctxn) {
-		ret = perf_event_init_context(child, ctxn, clone_flags);
-		if (ret) {
-			perf_event_free_task(child);
-			return ret;
-		}
+	ret = perf_event_init_context(child, clone_flags);
+	if (ret) {
+		perf_event_free_task(child);
+		return ret;
 	}
 
 	return 0;
@@ -13245,6 +13307,7 @@ int perf_event_init_task(struct task_str
 static void __init perf_event_init_all_cpus(void)
 {
 	struct swevent_htable *swhash;
+	struct perf_cpu_context *cpuctx;
 	int cpu;
 
 	zalloc_cpumask_var(&perf_online_mask, GFP_KERNEL);
@@ -13252,7 +13315,6 @@ static void __init perf_event_init_all_c
 	for_each_possible_cpu(cpu) {
 		swhash = &per_cpu(swevent_htable, cpu);
 		mutex_init(&swhash->hlist_mutex);
-		INIT_LIST_HEAD(&per_cpu(active_ctx_list, cpu));
 
 		INIT_LIST_HEAD(&per_cpu(pmu_sb_events.list, cpu));
 		raw_spin_lock_init(&per_cpu(pmu_sb_events.lock, cpu));
@@ -13261,6 +13323,14 @@ static void __init perf_event_init_all_c
 		INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
 #endif
 		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
+
+		cpuctx = per_cpu_ptr(&cpu_context, cpu);
+		__perf_event_init_context(&cpuctx->ctx);
+		lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
+		lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
+		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
+		cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
+		cpuctx->heap = cpuctx->heap_default;
 	}
 }
 
@@ -13282,12 +13352,12 @@ static void perf_swevent_init_cpu(unsign
 #if defined CONFIG_HOTPLUG_CPU || defined CONFIG_KEXEC_CORE
 static void __perf_event_exit_context(void *__info)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_event_context *ctx = __info;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event *event;
 
 	raw_spin_lock(&ctx->lock);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, EVENT_TIME);
 	list_for_each_entry(event, &ctx->event_list, event_entry)
 		__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
 	raw_spin_unlock(&ctx->lock);
@@ -13297,18 +13367,16 @@ static void perf_event_exit_cpu_context(
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_event_context *ctx;
-	struct pmu *pmu;
 
+	// XXX simplify cpuctx->online
 	mutex_lock(&pmus_lock);
-	list_for_each_entry(pmu, &pmus, entry) {
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		ctx = &cpuctx->ctx;
+	cpuctx = per_cpu_ptr(&cpu_context, cpu);
+	ctx = &cpuctx->ctx;
 
-		mutex_lock(&ctx->mutex);
-		smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
-		cpuctx->online = 0;
-		mutex_unlock(&ctx->mutex);
-	}
+	mutex_lock(&ctx->mutex);
+	smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
+	cpuctx->online = 0;
+	mutex_unlock(&ctx->mutex);
 	cpumask_clear_cpu(cpu, perf_online_mask);
 	mutex_unlock(&pmus_lock);
 }
@@ -13322,20 +13390,17 @@ int perf_event_init_cpu(unsigned int cpu
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_event_context *ctx;
-	struct pmu *pmu;
 
 	perf_swevent_init_cpu(cpu);
 
 	mutex_lock(&pmus_lock);
 	cpumask_set_cpu(cpu, perf_online_mask);
-	list_for_each_entry(pmu, &pmus, entry) {
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		ctx = &cpuctx->ctx;
+	cpuctx = per_cpu_ptr(&cpu_context, cpu);
+	ctx = &cpuctx->ctx;
 
-		mutex_lock(&ctx->mutex);
-		cpuctx->online = 1;
-		mutex_unlock(&ctx->mutex);
-	}
+	mutex_lock(&ctx->mutex);
+	cpuctx->online = 1;
+	mutex_unlock(&ctx->mutex);
 	mutex_unlock(&pmus_lock);
 
 	return 0;

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-01-13 13:47 Ravi Bangoria
  2022-01-13 19:15 ` kernel test robot
@ 2022-01-31  4:43 ` Ravi Bangoria
  2022-06-13 14:35 ` Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-01-31  4:43 UTC (permalink / raw)
  To: peterz
  Cc: acme, alexander.shishkin, jolsa, namhyung, songliubraving,
	eranian, alexey.budankov, ak, mark.rutland, megha.dey, frederic,
	maddy, irogers, kim.phillips, linux-kernel, santosh.shukla



On 13-Jan-22 7:17 PM, Ravi Bangoria wrote:
> From: Peter Zijlstra <peterz@infradead.org>
> 
> This is the 2nd version of RFC originally posted by Peter[1].
> 
> There have been various issues and limitations with the way perf uses
> (task) contexts to track events. Most notable is the single hardware
> PMU task context, which has resulted in a number of yucky things (both
> proposed and merged).
> 
> Notably:
>  - HW breakpoint PMU
>  - ARM big.little PMU / Intel ADL PMU
>  - Intel Branch Monitoring PMU
>  - AMD IBS
> 
> Current design:
> ---------------
> Currently we have a per task and per cpu perf_event_contexts:
> 
>   task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
>        ^                                 |    ^     |
>        `---------------------------------'    |     `--> pmu
>                                               v           ^
>                                          perf_event ------'
> 
> Each task has an array of pointers to a perf_event_context. Each
> perf_event_context has a direct relation to a PMU and a group of
> events for that PMU. The task related perf_event_context's have a
> pointer back to that task.
> 
> Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
> includes a perf_event_context, which again has a direct relation to
> that PMU, and a group of events for that PMU.
> 
> The perf_cpu_context also tracks which task context is currently
> associated with that CPU and includes a few other things like the
> hrtimer for rotation etc.
> 
> Each perf_event is then associated with its PMU and one
> perf_event_context.
> 
> Proposed design:
> ----------------
> New design proposed by this patch reduce to a single task context and
> a single CPU context but adds some intermediate data-structures:
> 
>   task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
>        ^                                 |    ^ ^
>        `---------------------------------'    | |
>                                               | |    perf_cpu_pmu_context
>                                               | `----.    ^
>                                               |      |    |
>                                               |      v    v
>                                               | ,--> perf_event_pmu_context
>                                               | |         ^
>                                               | |         |
>                                               v v         v
>                                          perf_event ---> pmu
> 
> With new design, perf_event_context will hold all pmu events in the
> respective(pinned/flexible) rbtrees. This can be achieved by adding
> pmu to rbtree key:
> 
>   {cpu, pmu, cgroup_id, group_index}
> 
> Each perf_event_context carry a list of perf_event_pmu_context which
> is used to hold per-pmu-per-context state. For ex, it keeps track of
> currently active events for that pmu, a pmu specific task_ctx_data,
> a flag to tell whether rotation is required or not etc.
> 
> Similarly perf_cpu_pmu_context is used to hold per-pmu-per-cpu state
> like hrtimer details to drive the event rotation, a pointer to
> perf_event_pmu_context of currently running task and some other
> ancillary information.
> 
> Each perf_event is associated to it's pmu, perf_event_context and
> perf_event_pmu_context.
> 
> Original RFC -> RFC v2:
> -----------------------
> In addition to porting the patch to latest (v5.16-rc6) kernel, here
> are some of the major changes between two revisions:
> 
> - There were quite a bit of fundamental changes since original patch.
>   Most notably a rbtree key has changed from {cpu,group_index} to
>   {cpu,cgroup_id,group_index}. Adding a pmu key in between as proposed
>   in original patch is not straight forward as it will break cgroup
>   specific optimization. Hence we need to iterate over all pmu_ctx
>   for a given ctx and call visit_groups_merge() one by one.
> - Enabled cgroup support (CGROUP_PERF).
> - Some changes wrt multiplexing events as with new design the rotation
>   happens at cgroup subtree unlike at pmu subtree in original patch.
> 
> Because of additional complexity above changes bring in, I thought to
> get initial review about the overall approach before starting to make it
> upstream ready. Hence this patch just provides an idea of the direction
> we will head toward. Many loose ends in the patch rightnow. Like, I've
> not paid much attention to synchronization related aspects. Similarly,
> some of the issues marked in original patch (XXX) haven't been fixed.
> 
> A simple perf stat/record/top survives with the patch but machine
> crashes with first run of perf test (stale cpc->task_epc causing the
> crash). Lockdep is also screaming a lot :)

Hi Peter, can you please review this.

Thanks,
Ravi

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC v2] perf: Rewrite core context handling
  2022-01-13 13:47 Ravi Bangoria
@ 2022-01-13 19:15 ` kernel test robot
  2022-01-31  4:43 ` Ravi Bangoria
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 47+ messages in thread
From: kernel test robot @ 2022-01-13 19:15 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 13155 bytes --]

Hi Ravi,

[FYI, it's a private test report for your RFC patch.]
[auto build test WARNING on tip/perf/core]
[also build test WARNING on powerpc/next tip/sched/core v5.16 next-20220113]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git a9f4a6e92b3b319296fb078da2615f618f6cd80c
config: i386-randconfig-s001 (https://download.01.org/0day-ci/archive/20220114/202201140347.ue2Ckk9k-lkp(a)intel.com/config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/f7cf7134e405062bf0f22c3ba5637241c4c4d06a
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Ravi-Bangoria/perf-Rewrite-core-context-handling/20220113-215022
        git checkout f7cf7134e405062bf0f22c3ba5637241c4c4d06a
        # save the config file to linux build tree
        mkdir build_dir
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=i386 SHELL=/bin/bash kernel/events/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
   kernel/events/core.c:1440:15: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:1440:15: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:1440:15: sparse:    struct perf_event_context *
   kernel/events/core.c:1453:28: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:1453:28: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:1453:28: sparse:    struct perf_event_context *
   kernel/events/core.c:3478:20: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:3478:20: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:3478:20: sparse:    struct perf_event_context *
   kernel/events/core.c:3482:18: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:3482:18: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:3482:18: sparse:    struct perf_event_context *
   kernel/events/core.c:3483:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:3483:23: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:3483:23: sparse:    struct perf_event_context *
   kernel/events/core.c:3520:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:3520:25: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:3520:25: sparse:    struct perf_event_context *
   kernel/events/core.c:3521:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:3521:25: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:3521:25: sparse:    struct perf_event_context *
   kernel/events/core.c:3930:15: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:3930:15: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:3930:15: sparse:    struct perf_event_context *
   kernel/events/core.c:4334:15: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:4334:15: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:4334:15: sparse:    struct perf_event_context *
   kernel/events/core.c:4807:25: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:4807:25: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:4807:25: sparse:    struct perf_event_context *
>> kernel/events/core.c:4826:31: sparse: sparse: symbol 'find_get_pmu_context' was not declared. Should it be static?
   kernel/events/core.c:6188:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:6188:9: sparse:    struct perf_buffer [noderef] __rcu *
   kernel/events/core.c:6188:9: sparse:    struct perf_buffer *
   kernel/events/core.c:5650:24: sparse: sparse: incorrect type in assignment (different base types) @@     expected restricted __poll_t [usertype] events @@     got int @@
   kernel/events/core.c:5894:22: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:5894:22: sparse:    struct perf_buffer [noderef] __rcu *
   kernel/events/core.c:5894:22: sparse:    struct perf_buffer *
   kernel/events/core.c:6030:14: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:6030:14: sparse:    struct perf_buffer [noderef] __rcu *
   kernel/events/core.c:6030:14: sparse:    struct perf_buffer *
   kernel/events/core.c:6063:14: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:6063:14: sparse:    struct perf_buffer [noderef] __rcu *
   kernel/events/core.c:6063:14: sparse:    struct perf_buffer *
   kernel/events/core.c:6120:14: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:6120:14: sparse:    struct perf_buffer [noderef] __rcu *
   kernel/events/core.c:6120:14: sparse:    struct perf_buffer *
   kernel/events/core.c:6206:14: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:6206:14: sparse:    struct perf_buffer [noderef] __rcu *
   kernel/events/core.c:6206:14: sparse:    struct perf_buffer *
   kernel/events/core.c:6219:14: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:6219:14: sparse:    struct perf_buffer [noderef] __rcu *
   kernel/events/core.c:6219:14: sparse:    struct perf_buffer *
   kernel/events/core.c:7864:15: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:7864:15: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:7864:15: sparse:    struct perf_event_context *
   kernel/events/core.c:7910:15: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:7910:15: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:7910:15: sparse:    struct perf_event_context *
   kernel/events/core.c:7949:13: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:7949:13: sparse:    struct perf_buffer [noderef] __rcu *
   kernel/events/core.c:7949:13: sparse:    struct perf_buffer *
   kernel/events/core.c:8053:61: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *real_parent @@
   kernel/events/core.c:8053:61: sparse:     expected struct task_struct *p
   kernel/events/core.c:8053:61: sparse:     got struct task_struct [noderef] __rcu *real_parent
   kernel/events/core.c:8055:61: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected struct task_struct *p @@     got struct task_struct [noderef] __rcu *real_parent @@
   kernel/events/core.c:8055:61: sparse:     expected struct task_struct *p
   kernel/events/core.c:8055:61: sparse:     got struct task_struct [noderef] __rcu *real_parent
   kernel/events/core.c:8754:15: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:8754:15: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:8754:15: sparse:    struct perf_event_context *
   kernel/events/core.c:9745:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:9745:9: sparse:    struct swevent_hlist [noderef] __rcu *
   kernel/events/core.c:9745:9: sparse:    struct swevent_hlist *
   kernel/events/core.c:9784:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:9784:17: sparse:    struct swevent_hlist [noderef] __rcu *
   kernel/events/core.c:9784:17: sparse:    struct swevent_hlist *
   kernel/events/core.c:9965:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:9965:23: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:9965:23: sparse:    struct perf_event_context *
   kernel/events/core.c:11116:1: sparse: sparse: symbol 'dev_attr_nr_addr_filters' was not declared. Should it be static?
   kernel/events/core.c:12826:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:12826:9: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:12826:9: sparse:    struct perf_event_context *
   kernel/events/core.c:12920:15: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:12920:15: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:12920:15: sparse:    struct perf_event_context *
   kernel/events/core.c:12932:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:12932:9: sparse:    struct perf_event_context [noderef] __rcu *
   kernel/events/core.c:12932:9: sparse:    struct perf_event_context *
   kernel/events/core.c:13356:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:13356:17: sparse:    struct swevent_hlist [noderef] __rcu *
   kernel/events/core.c:13356:17: sparse:    struct swevent_hlist *
   kernel/events/core.c:161:9: sparse: sparse: context imbalance in 'perf_ctx_lock' - wrong count at exit
   kernel/events/core.c:169:17: sparse: sparse: context imbalance in 'perf_ctx_unlock' - unexpected unlock
   kernel/events/core.c: note: in included file (through include/linux/rcupdate.h, include/linux/rculist.h, include/linux/dcache.h, ...):
   include/linux/rcutiny.h:102:44: sparse: sparse: context imbalance in 'perf_lock_task_context' - different lock contexts for basic block
   kernel/events/core.c:1487:17: sparse: sparse: context imbalance in 'perf_pin_task_context' - unexpected unlock
   kernel/events/core.c:2815:9: sparse: sparse: context imbalance in '__perf_install_in_context' - wrong count at exit
   kernel/events/core.c:4781:17: sparse: sparse: context imbalance in 'find_get_context' - unexpected unlock
   kernel/events/core.c: note: in included file:
   kernel/events/internal.h:204:1: sparse: sparse: incorrect type in argument 2 (different address spaces) @@     expected void const [noderef] __user *from @@     got void const *buf @@
   kernel/events/core.c:9594:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:9594:17: sparse:    struct swevent_hlist [noderef] __rcu *
   kernel/events/core.c:9594:17: sparse:    struct swevent_hlist *
   kernel/events/core.c:9614:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:9614:17: sparse:    struct swevent_hlist [noderef] __rcu *
   kernel/events/core.c:9614:17: sparse:    struct swevent_hlist *
   kernel/events/core.c:9734:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:9734:16: sparse:    struct swevent_hlist [noderef] __rcu *
   kernel/events/core.c:9734:16: sparse:    struct swevent_hlist *
   kernel/events/core.c:9734:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:9734:16: sparse:    struct swevent_hlist [noderef] __rcu *
   kernel/events/core.c:9734:16: sparse:    struct swevent_hlist *
   kernel/events/core.c:9734:16: sparse: sparse: incompatible types in comparison expression (different address spaces):
   kernel/events/core.c:9734:16: sparse:    struct swevent_hlist [noderef] __rcu *
   kernel/events/core.c:9734:16: sparse:    struct swevent_hlist *

Please review and possibly fold the followup patch.

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC v2] perf: Rewrite core context handling
@ 2022-01-13 13:47 Ravi Bangoria
  2022-01-13 19:15 ` kernel test robot
                   ` (5 more replies)
  0 siblings, 6 replies; 47+ messages in thread
From: Ravi Bangoria @ 2022-01-13 13:47 UTC (permalink / raw)
  To: peterz
  Cc: ravi.bangoria, acme, alexander.shishkin, jolsa, namhyung,
	songliubraving, eranian, alexey.budankov, ak, mark.rutland,
	megha.dey, frederic, maddy, irogers, kim.phillips, linux-kernel,
	santosh.shukla

From: Peter Zijlstra <peterz@infradead.org>

This is the 2nd version of RFC originally posted by Peter[1].

There have been various issues and limitations with the way perf uses
(task) contexts to track events. Most notable is the single hardware
PMU task context, which has resulted in a number of yucky things (both
proposed and merged).

Notably:
 - HW breakpoint PMU
 - ARM big.little PMU / Intel ADL PMU
 - Intel Branch Monitoring PMU
 - AMD IBS

Current design:
---------------
Currently we have a per task and per cpu perf_event_contexts:

  task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context
       ^                                 |    ^     |
       `---------------------------------'    |     `--> pmu
                                              v           ^
                                         perf_event ------'

Each task has an array of pointers to a perf_event_context. Each
perf_event_context has a direct relation to a PMU and a group of
events for that PMU. The task related perf_event_context's have a
pointer back to that task.

Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which
includes a perf_event_context, which again has a direct relation to
that PMU, and a group of events for that PMU.

The perf_cpu_context also tracks which task context is currently
associated with that CPU and includes a few other things like the
hrtimer for rotation etc.

Each perf_event is then associated with its PMU and one
perf_event_context.

Proposed design:
----------------
New design proposed by this patch reduce to a single task context and
a single CPU context but adds some intermediate data-structures:

  task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context
       ^                                 |    ^ ^
       `---------------------------------'    | |
                                              | |    perf_cpu_pmu_context
                                              | `----.    ^
                                              |      |    |
                                              |      v    v
                                              | ,--> perf_event_pmu_context
                                              | |         ^
                                              | |         |
                                              v v         v
                                         perf_event ---> pmu

With new design, perf_event_context will hold all pmu events in the
respective(pinned/flexible) rbtrees. This can be achieved by adding
pmu to rbtree key:

  {cpu, pmu, cgroup_id, group_index}

Each perf_event_context carry a list of perf_event_pmu_context which
is used to hold per-pmu-per-context state. For ex, it keeps track of
currently active events for that pmu, a pmu specific task_ctx_data,
a flag to tell whether rotation is required or not etc.

Similarly perf_cpu_pmu_context is used to hold per-pmu-per-cpu state
like hrtimer details to drive the event rotation, a pointer to
perf_event_pmu_context of currently running task and some other
ancillary information.

Each perf_event is associated to it's pmu, perf_event_context and
perf_event_pmu_context.

Original RFC -> RFC v2:
-----------------------
In addition to porting the patch to latest (v5.16-rc6) kernel, here
are some of the major changes between two revisions:

- There were quite a bit of fundamental changes since original patch.
  Most notably a rbtree key has changed from {cpu,group_index} to
  {cpu,cgroup_id,group_index}. Adding a pmu key in between as proposed
  in original patch is not straight forward as it will break cgroup
  specific optimization. Hence we need to iterate over all pmu_ctx
  for a given ctx and call visit_groups_merge() one by one.
- Enabled cgroup support (CGROUP_PERF).
- Some changes wrt multiplexing events as with new design the rotation
  happens at cgroup subtree unlike at pmu subtree in original patch.

Because of additional complexity above changes bring in, I thought to
get initial review about the overall approach before starting to make it
upstream ready. Hence this patch just provides an idea of the direction
we will head toward. Many loose ends in the patch rightnow. Like, I've
not paid much attention to synchronization related aspects. Similarly,
some of the issues marked in original patch (XXX) haven't been fixed.

A simple perf stat/record/top survives with the patch but machine
crashes with first run of perf test (stale cpc->task_epc causing the
crash). Lockdep is also screaming a lot :)

[1]: https://lore.kernel.org/lkml/20181010104559.GO5728@hirez.programming.kicks-ass.net

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ravi Bangoria <ravi.bangoria@amd.com>
---
 arch/powerpc/perf/core-book3s.c |    4 +-
 arch/x86/events/core.c          |   15 +-
 arch/x86/events/intel/core.c    |   12 +-
 arch/x86/events/intel/ds.c      |    4 +-
 arch/x86/events/intel/lbr.c     |   30 +-
 arch/x86/events/perf_event.h    |   16 +-
 include/linux/perf_event.h      |  105 +-
 include/linux/sched.h           |    2 +-
 kernel/events/core.c            | 1655 ++++++++++++++++---------------
 9 files changed, 982 insertions(+), 861 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 73e62e9b179b..fc5cdc6550d6 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -131,7 +131,7 @@ static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
 
 static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
 static inline void power_pmu_bhrb_disable(struct perf_event *event) {}
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in) {}
+static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in) {}
 static inline void power_pmu_bhrb_read(struct perf_event *event, struct cpu_hw_events *cpuhw) {}
 static void pmao_restore_workaround(bool ebb) { }
 #endif /* CONFIG_PPC32 */
@@ -450,7 +450,7 @@ static void power_pmu_bhrb_disable(struct perf_event *event)
 /* Called from ctxsw to prevent one process's branch entries to
  * mingle with the other process's entries during context switch.
  */
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void power_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 	if (!ppmu->bhrb_nr)
 		return;
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 6dfa8ddaa60f..51ffb1e8de0a 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2053,13 +2053,14 @@ void x86_pmu_show_pmu_cap(int num_counters, int num_counters_fixed,
  */
 void x86_pmu_update_cpu_context(struct pmu *pmu, int cpu)
 {
-	struct perf_cpu_context *cpuctx;
+	/* XXX: Don't need this quirk anymore */
+	/*struct perf_cpu_context *cpuctx;
 
 	if (!pmu->pmu_cpu_context)
 		return;
 
 	cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-	cpuctx->ctx.pmu = pmu;
+	cpuctx->ctx.pmu = pmu;*/
 }
 
 static int __init init_hw_perf_events(void)
@@ -2630,15 +2631,15 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
-static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void x86_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
-	static_call_cond(x86_pmu_sched_task)(ctx, sched_in);
+	static_call_cond(x86_pmu_sched_task)(pmu_ctx, sched_in);
 }
 
-static void x86_pmu_swap_task_ctx(struct perf_event_context *prev,
-				  struct perf_event_context *next)
+static void x86_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+				  struct perf_event_pmu_context *next_epc)
 {
-	static_call_cond(x86_pmu_swap_task_ctx)(prev, next);
+	static_call_cond(x86_pmu_swap_task_ctx)(prev_epc, next_epc);
 }
 
 void perf_check_microcode(void)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 9a044438072b..de9ca948b042 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4477,17 +4477,17 @@ static void intel_pmu_cpu_dead(int cpu)
 		cpumask_clear_cpu(cpu, &hybrid_pmu(cpuc->pmu)->supported_cpus);
 }
 
-static void intel_pmu_sched_task(struct perf_event_context *ctx,
+static void intel_pmu_sched_task(struct perf_event_pmu_context *pmu_ctx,
 				 bool sched_in)
 {
-	intel_pmu_pebs_sched_task(ctx, sched_in);
-	intel_pmu_lbr_sched_task(ctx, sched_in);
+	intel_pmu_pebs_sched_task(pmu_ctx, sched_in);
+	intel_pmu_lbr_sched_task(pmu_ctx, sched_in);
 }
 
-static void intel_pmu_swap_task_ctx(struct perf_event_context *prev,
-				    struct perf_event_context *next)
+static void intel_pmu_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+				    struct perf_event_pmu_context *next_epc)
 {
-	intel_pmu_lbr_swap_task_ctx(prev, next);
+	intel_pmu_lbr_swap_task_ctx(prev_epc, next_epc);
 }
 
 static int intel_pmu_check_period(struct perf_event *event, u64 value)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 8647713276a7..67697c32fe92 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1004,7 +1004,7 @@ static inline bool pebs_needs_sched_cb(struct cpu_hw_events *cpuc)
 	return cpuc->n_pebs && (cpuc->n_pebs == cpuc->n_large_pebs);
 }
 
-void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
@@ -1112,7 +1112,7 @@ static void
 pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
 		  struct perf_event *event, bool add)
 {
-	struct pmu *pmu = event->ctx->pmu;
+	struct pmu *pmu = event->pmu;
 	/*
 	 * Make sure we get updated with the first PEBS
 	 * event. It will trigger also during removal, but
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 9e6d6eaeb4cb..b708be7e4709 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -596,21 +596,21 @@ static void __intel_pmu_lbr_save(void *ctx)
 	cpuc->last_log_id = ++task_context_opt(ctx)->log_id;
 }
 
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
-				 struct perf_event_context *next)
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+				 struct perf_event_pmu_context *next_epc)
 {
 	void *prev_ctx_data, *next_ctx_data;
 
-	swap(prev->task_ctx_data, next->task_ctx_data);
+	swap(prev_epc->task_ctx_data, next_epc->task_ctx_data);
 
 	/*
-	 * Architecture specific synchronization makes sense in
-	 * case both prev->task_ctx_data and next->task_ctx_data
+	 * Architecture specific synchronization makes sense in case
+	 * both prev_epc->task_ctx_data and next_epc->task_ctx_data
 	 * pointers are allocated.
 	 */
 
-	prev_ctx_data = next->task_ctx_data;
-	next_ctx_data = prev->task_ctx_data;
+	prev_ctx_data = next_epc->task_ctx_data;
+	next_ctx_data = prev_epc->task_ctx_data;
 
 	if (!prev_ctx_data || !next_ctx_data)
 		return;
@@ -619,7 +619,7 @@ void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
 	     task_context_opt(next_ctx_data)->lbr_callstack_users);
 }
 
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	void *task_ctx;
@@ -632,7 +632,7 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
 	 * the task was scheduled out, restore the stack. Otherwise flush
 	 * the LBR stack.
 	 */
-	task_ctx = ctx ? ctx->task_ctx_data : NULL;
+	task_ctx = pmu_ctx ? pmu_ctx->task_ctx_data : NULL;
 	if (task_ctx) {
 		if (sched_in)
 			__intel_pmu_lbr_restore(task_ctx);
@@ -668,8 +668,8 @@ void intel_pmu_lbr_add(struct perf_event *event)
 
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
-	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data)
-		task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users++;
+	if (branch_user_callstack(cpuc->br_sel) && event->pmu_ctx->task_ctx_data)
+		task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users++;
 
 	/*
 	 * Request pmu::sched_task() callback, which will fire inside the
@@ -692,7 +692,7 @@ void intel_pmu_lbr_add(struct perf_event *event)
 	 */
 	if (x86_pmu.intel_cap.pebs_baseline && event->attr.precise_ip > 0)
 		cpuc->lbr_pebs_users++;
-	perf_sched_cb_inc(event->ctx->pmu);
+	perf_sched_cb_inc(event->pmu);
 	if (!cpuc->lbr_users++ && !event->total_time_running)
 		intel_pmu_lbr_reset();
 }
@@ -745,8 +745,8 @@ void intel_pmu_lbr_del(struct perf_event *event)
 		return;
 
 	if (branch_user_callstack(cpuc->br_sel) &&
-	    event->ctx->task_ctx_data)
-		task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users--;
+	    event->pmu_ctx->task_ctx_data)
+		task_context_opt(event->pmu_ctx->task_ctx_data)->lbr_callstack_users--;
 
 	if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
 		cpuc->lbr_select = 0;
@@ -756,7 +756,7 @@ void intel_pmu_lbr_del(struct perf_event *event)
 	cpuc->lbr_users--;
 	WARN_ON_ONCE(cpuc->lbr_users < 0);
 	WARN_ON_ONCE(cpuc->lbr_pebs_users < 0);
-	perf_sched_cb_dec(event->ctx->pmu);
+	perf_sched_cb_dec(event->pmu);
 }
 
 static inline bool vlbr_exclude_host(void)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index e3ac05c97b5e..fd937743b51a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -792,7 +792,7 @@ struct x86_pmu {
 	void		(*cpu_dead)(int cpu);
 
 	void		(*check_microcode)(void);
-	void		(*sched_task)(struct perf_event_context *ctx,
+	void		(*sched_task)(struct perf_event_pmu_context *pmu_ctx,
 				      bool sched_in);
 
 	/*
@@ -869,12 +869,12 @@ struct x86_pmu {
 	int		(*set_topdown_event_period)(struct perf_event *event);
 
 	/*
-	 * perf task context (i.e. struct perf_event_context::task_ctx_data)
+	 * perf task context (i.e. struct perf_event_pmu_context::task_ctx_data)
 	 * switch helper to bridge calls from perf/core to perf/x86.
 	 * See struct pmu::swap_task_ctx() usage for examples;
 	 */
-	void		(*swap_task_ctx)(struct perf_event_context *prev,
-					 struct perf_event_context *next);
+	void		(*swap_task_ctx)(struct perf_event_pmu_context *prev_epc,
+					 struct perf_event_pmu_context *next_epc);
 
 	/*
 	 * AMD bits
@@ -1316,7 +1316,7 @@ void intel_pmu_pebs_enable_all(void);
 
 void intel_pmu_pebs_disable_all(void);
 
-void intel_pmu_pebs_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 
 void intel_pmu_auto_reload_read(struct perf_event *event);
 
@@ -1324,10 +1324,10 @@ void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
 
 void intel_ds_init(void);
 
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
-				 struct perf_event_context *next);
+void intel_pmu_lbr_swap_task_ctx(struct perf_event_pmu_context *prev_epc,
+				 struct perf_event_pmu_context *next_epc);
 
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_lbr_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 
 u64 lbr_from_signext_quirk_wr(u64 val);
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 9b60bb89d86a..c7d1f455de0d 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -250,6 +250,7 @@ struct hw_perf_event {
 };
 
 struct perf_event;
+struct perf_event_pmu_context;
 
 /*
  * Common implementation detail of pmu::{start,commit,cancel}_txn
@@ -292,7 +293,7 @@ struct pmu {
 	int				capabilities;
 
 	int __percpu			*pmu_disable_count;
-	struct perf_cpu_context __percpu *pmu_cpu_context;
+	struct perf_cpu_pmu_context __percpu *cpu_pmu_context;
 	atomic_t			exclusive_cnt; /* < 0: cpu; > 0: tsk */
 	int				task_ctx_nr;
 	int				hrtimer_interval_ms;
@@ -427,7 +428,7 @@ struct pmu {
 	/*
 	 * context-switches callback
 	 */
-	void (*sched_task)		(struct perf_event_context *ctx,
+	void (*sched_task)		(struct perf_event_pmu_context *pmu_ctx,
 					bool sched_in);
 
 	/*
@@ -441,8 +442,8 @@ struct pmu {
 	 * implementation and Perf core context switch handling callbacks for usage
 	 * examples.
 	 */
-	void (*swap_task_ctx)		(struct perf_event_context *prev,
-					 struct perf_event_context *next);
+	void (*swap_task_ctx)		(struct perf_event_pmu_context *prev_epc,
+					 struct perf_event_pmu_context *next_epc);
 					/* optional */
 
 	/*
@@ -662,6 +663,11 @@ struct perf_event {
 	int				group_caps;
 
 	struct perf_event		*group_leader;
+	/*
+	 * event->pmu will always point to pmu in which this event belongs.
+	 * Unlike event->pmu_ctx->pmu which points to other pmu when group of
+	 * different events are created.
+	 */
 	struct pmu			*pmu;
 	void				*pmu_private;
 
@@ -699,6 +705,12 @@ struct perf_event {
 	struct hw_perf_event		hw;
 
 	struct perf_event_context	*ctx;
+	/*
+	 * event->pmu_ctx points to perf_event_pmu_context in which the event
+	 * is added. This pmu_ctx can be of other pmu for sw event when such
+	 * sw event is added to a non-sw event group.
+	 */
+	struct perf_event_pmu_context	*pmu_ctx;
 	atomic_long_t			refcount;
 
 	/*
@@ -786,19 +798,60 @@ struct perf_event {
 #endif /* CONFIG_PERF_EVENTS */
 };
 
+/*
+ *           ,------------------------[1:n]---------------------.
+ *           V                                                  V
+ * perf_event_context <-[1:n]-> perf_event_pmu_context <--- perf_event
+ *           ^                      ^     |                     |
+ *           `--------[1:n]---------'     `-[n:1]-> pmu <-[1:n]-'
+ *
+ *
+ * XXX destroy epc when empty
+ *   refcount, !rcu
+ *
+ * XXX epc locking
+ *
+ *   event->pmu_ctx            ctx->mutex && inactive
+ *   ctx->pmu_ctx_list         ctx->mutex && ctx->lock
+ *
+ */
+struct perf_event_pmu_context {
+	struct pmu			*pmu;
+	struct perf_event_context       *ctx;
+
+	struct list_head		pmu_ctx_entry;
+
+	struct list_head		pinned_active;
+	struct list_head		flexible_active;
+
+	/* Used to avoid freeing per-cpu perf_event_pmu_context */
+	unsigned int			embedded : 1;
+
+	unsigned int			nr_events;
+	unsigned int			nr_active;
+
+	atomic_t			refcount; /* event <-> epc */
+
+	void				*task_ctx_data; /* pmu specific data */
+	/*
+	 * Set when nr_events != nr_active, except tolerant to events not
+	 * necessary to be active due to scheduling constraints, such as cgroups.
+	 */
+	int				rotate_necessary;
+};
 
 struct perf_event_groups {
 	struct rb_root	tree;
 	u64		index;
 };
 
+
 /**
  * struct perf_event_context - event context structure
  *
  * Used as a container for task events and CPU events as well:
  */
 struct perf_event_context {
-	struct pmu			*pmu;
 	/*
 	 * Protect the states of the events in the list,
 	 * nr_active, and the list:
@@ -811,25 +864,20 @@ struct perf_event_context {
 	 */
 	struct mutex			mutex;
 
-	struct list_head		active_ctx_list;
+	struct list_head		pmu_ctx_list;
 	struct perf_event_groups	pinned_groups;
 	struct perf_event_groups	flexible_groups;
 	struct list_head		event_list;
 
-	struct list_head		pinned_active;
-	struct list_head		flexible_active;
-
 	int				nr_events;
 	int				nr_active;
 	int				is_active;
+
+	int				nr_task_data;
 	int				nr_stat;
 	int				nr_freq;
 	int				rotate_disable;
-	/*
-	 * Set when nr_events != nr_active, except tolerant to events not
-	 * necessary to be active due to scheduling constraints, such as cgroups.
-	 */
-	int				rotate_necessary;
+
 	refcount_t			refcount;
 	struct task_struct		*task;
 
@@ -850,7 +898,6 @@ struct perf_event_context {
 #ifdef CONFIG_CGROUP_PERF
 	int				nr_cgroups;	 /* cgroup evts */
 #endif
-	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
 };
 
@@ -860,12 +907,13 @@ struct perf_event_context {
  */
 #define PERF_NR_CONTEXTS	4
 
-/**
- * struct perf_event_cpu_context - per cpu event context structure
- */
-struct perf_cpu_context {
-	struct perf_event_context	ctx;
-	struct perf_event_context	*task_ctx;
+struct perf_cpu_pmu_context {
+	struct perf_event_pmu_context   epc;
+	struct perf_event_pmu_context   *task_epc;
+
+	struct list_head		sched_cb_entry;
+	int				sched_cb_usage;
+
 	int				active_oncpu;
 	int				exclusive;
 
@@ -873,16 +921,21 @@ struct perf_cpu_context {
 	struct hrtimer			hrtimer;
 	ktime_t				hrtimer_interval;
 	unsigned int			hrtimer_active;
+};
+
+/**
+ * struct perf_event_cpu_context - per cpu event context structure
+ */
+struct perf_cpu_context {
+	struct perf_event_context	ctx;
+	struct perf_event_context	*task_ctx;
+	int				online;
 
 #ifdef CONFIG_CGROUP_PERF
 	struct perf_cgroup		*cgrp;
 	struct list_head		cgrp_cpuctx_entry;
 #endif
 
-	struct list_head		sched_cb_entry;
-	int				sched_cb_usage;
-
-	int				online;
 	/*
 	 * Per-CPU storage for iterators used in visit_groups_merge. The default
 	 * storage is of size 2 to hold the CPU and any CPU event iterators.
@@ -1130,7 +1183,7 @@ static inline int is_software_event(struct perf_event *event)
  */
 static inline int in_software_context(struct perf_event *event)
 {
-	return event->ctx->pmu->task_ctx_nr == perf_sw_context;
+	return event->pmu_ctx->pmu->task_ctx_nr == perf_sw_context;
 }
 
 static inline int is_exclusive_pmu(struct pmu *pmu)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c1a927ddec64..17e8e1b04ded 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1221,7 +1221,7 @@ struct task_struct {
 	unsigned int			futex_state;
 #endif
 #ifdef CONFIG_PERF_EVENTS
-	struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
+	struct perf_event_context	*perf_event_ctxp;
 	struct mutex			perf_event_mutex;
 	struct list_head		perf_event_list;
 #endif
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f23ca260307f..cf95240c6db0 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -154,12 +154,6 @@ static int cpu_function_call(int cpu, remote_function_f func, void *info)
 	return data.ret;
 }
 
-static inline struct perf_cpu_context *
-__get_cpu_context(struct perf_event_context *ctx)
-{
-	return this_cpu_ptr(ctx->pmu->pmu_cpu_context);
-}
-
 static void perf_ctx_lock(struct perf_cpu_context *cpuctx,
 			  struct perf_event_context *ctx)
 {
@@ -183,6 +177,8 @@ static bool is_kernel_event(struct perf_event *event)
 	return READ_ONCE(event->owner) == TASK_TOMBSTONE;
 }
 
+static DEFINE_PER_CPU(struct perf_cpu_context, cpu_context);
+
 /*
  * On task ctx scheduling...
  *
@@ -216,7 +212,7 @@ static int event_function(void *info)
 	struct event_function_struct *efs = info;
 	struct perf_event *event = efs->event;
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 	int ret = 0;
 
@@ -313,7 +309,7 @@ static void event_function_call(struct perf_event *event, event_f func, void *da
 static void event_function_local(struct perf_event *event, event_f func, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct task_struct *task = READ_ONCE(ctx->task);
 	struct perf_event_context *task_ctx = NULL;
 
@@ -387,7 +383,6 @@ static DEFINE_MUTEX(perf_sched_mutex);
 static atomic_t perf_sched_count;
 
 static DEFINE_PER_CPU(atomic_t, perf_cgroup_events);
-static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 static DEFINE_PER_CPU(struct pmu_event_list, pmu_sb_events);
 
 static atomic_t nr_mmap_events __read_mostly;
@@ -447,7 +442,7 @@ static void update_perf_cpu_limits(void)
 	WRITE_ONCE(perf_sample_allowed_ns, tmp);
 }
 
-static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
+static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc);
 
 int perf_proc_update_handler(struct ctl_table *table, int write,
 		void *buffer, size_t *lenp, loff_t *ppos)
@@ -570,13 +565,6 @@ void perf_sample_event_took(u64 sample_len_ns)
 
 static atomic64_t perf_event_id;
 
-static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type);
-
-static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-			     enum event_type_t event_type,
-			     struct task_struct *task);
-
 static void update_context_time(struct perf_event_context *ctx);
 static u64 perf_event_time(struct perf_event *event);
 
@@ -674,13 +662,35 @@ perf_event_set_state(struct perf_event *event, enum perf_event_state state)
 	WRITE_ONCE(event->state, state);
 }
 
+static void perf_ctx_disable(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+		perf_pmu_disable(pmu_ctx->pmu);
+}
+
+static void perf_ctx_enable(struct perf_event_context *ctx)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+		perf_pmu_enable(pmu_ctx->pmu);
+}
+
+static void ctx_sched_out(struct perf_event_context *ctx,
+			  enum event_type_t event_type);
+static void
+ctx_sched_in(struct perf_event_context *ctx,
+	     enum event_type_t event_type,
+	     struct task_struct *task);
+
 #ifdef CONFIG_CGROUP_PERF
 
 static inline bool
 perf_cgroup_match(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 
 	/* @event doesn't care about cgroup */
 	if (!event->cgrp)
@@ -789,6 +799,7 @@ perf_cgroup_set_timestamp(struct task_struct *task,
 	}
 }
 
+/* XXX: No need of list now. Convert it to per-cpu variable */
 static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
 
 #define PERF_CGROUP_SWOUT	0x1 /* cgroup switch out every event */
@@ -817,10 +828,10 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 		WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
 
 		perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-		perf_pmu_disable(cpuctx->ctx.pmu);
+		perf_ctx_disable(&cpuctx->ctx);
 
 		if (mode & PERF_CGROUP_SWOUT) {
-			cpu_ctx_sched_out(cpuctx, EVENT_ALL);
+			ctx_sched_out(&cpuctx->ctx, EVENT_ALL);
 			/*
 			 * must not be done before ctxswout due
 			 * to event_filter_match() in event_sched_out()
@@ -837,11 +848,10 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			 * we pass the cpuctx->ctx to perf_cgroup_from_task()
 			 * because cgorup events are only per-cpu
 			 */
-			cpuctx->cgrp = perf_cgroup_from_task(task,
-							     &cpuctx->ctx);
-			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+			cpuctx->cgrp = perf_cgroup_from_task(task, &cpuctx->ctx);
+			ctx_sched_in(&cpuctx->ctx, EVENT_ALL, task);
 		}
-		perf_pmu_enable(cpuctx->ctx.pmu);
+		perf_ctx_enable(&cpuctx->ctx);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 	}
 
@@ -915,7 +925,7 @@ static int perf_cgroup_ensure_storage(struct perf_event *event,
 		heap_size++;
 
 	for_each_possible_cpu(cpu) {
-		cpuctx = per_cpu_ptr(event->pmu->pmu_cpu_context, cpu);
+		cpuctx = this_cpu_ptr(&cpu_context);
 		if (heap_size <= cpuctx->heap_size)
 			continue;
 
@@ -1129,34 +1139,30 @@ perf_cgroup_event_disable(struct perf_event *event, struct perf_event_context *c
  */
 static enum hrtimer_restart perf_mux_hrtimer_handler(struct hrtimer *hr)
 {
-	struct perf_cpu_context *cpuctx;
+	struct perf_cpu_pmu_context *cpc;
 	bool rotations;
 
 	lockdep_assert_irqs_disabled();
 
-	cpuctx = container_of(hr, struct perf_cpu_context, hrtimer);
-	rotations = perf_rotate_context(cpuctx);
+	cpc = container_of(hr, struct perf_cpu_pmu_context, hrtimer);
+	rotations = perf_rotate_context(cpc);
 
-	raw_spin_lock(&cpuctx->hrtimer_lock);
+	raw_spin_lock(&cpc->hrtimer_lock);
 	if (rotations)
-		hrtimer_forward_now(hr, cpuctx->hrtimer_interval);
+		hrtimer_forward_now(hr, cpc->hrtimer_interval);
 	else
-		cpuctx->hrtimer_active = 0;
-	raw_spin_unlock(&cpuctx->hrtimer_lock);
+		cpc->hrtimer_active = 0;
+	raw_spin_unlock(&cpc->hrtimer_lock);
 
 	return rotations ? HRTIMER_RESTART : HRTIMER_NORESTART;
 }
 
-static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
+static void __perf_mux_hrtimer_init(struct perf_cpu_pmu_context *cpc, int cpu)
 {
-	struct hrtimer *timer = &cpuctx->hrtimer;
-	struct pmu *pmu = cpuctx->ctx.pmu;
+	struct hrtimer *timer = &cpc->hrtimer;
+	struct pmu *pmu = cpc->epc.pmu;
 	u64 interval;
 
-	/* no multiplexing needed for SW PMU */
-	if (pmu->task_ctx_nr == perf_sw_context)
-		return;
-
 	/*
 	 * check default is sane, if not set then force to
 	 * default interval (1/tick)
@@ -1165,30 +1171,25 @@ static void __perf_mux_hrtimer_init(struct perf_cpu_context *cpuctx, int cpu)
 	if (interval < 1)
 		interval = pmu->hrtimer_interval_ms = PERF_CPU_HRTIMER;
 
-	cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
+	cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * interval);
 
-	raw_spin_lock_init(&cpuctx->hrtimer_lock);
+	raw_spin_lock_init(&cpc->hrtimer_lock);
 	hrtimer_init(timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED_HARD);
 	timer->function = perf_mux_hrtimer_handler;
 }
 
-static int perf_mux_hrtimer_restart(struct perf_cpu_context *cpuctx)
+static int perf_mux_hrtimer_restart(struct perf_cpu_pmu_context *cpc)
 {
-	struct hrtimer *timer = &cpuctx->hrtimer;
-	struct pmu *pmu = cpuctx->ctx.pmu;
+	struct hrtimer *timer = &cpc->hrtimer;
 	unsigned long flags;
 
-	/* not for SW PMU */
-	if (pmu->task_ctx_nr == perf_sw_context)
-		return 0;
-
-	raw_spin_lock_irqsave(&cpuctx->hrtimer_lock, flags);
-	if (!cpuctx->hrtimer_active) {
-		cpuctx->hrtimer_active = 1;
-		hrtimer_forward_now(timer, cpuctx->hrtimer_interval);
+	raw_spin_lock_irqsave(&cpc->hrtimer_lock, flags);
+	if (!cpc->hrtimer_active) {
+		cpc->hrtimer_active = 1;
+		hrtimer_forward_now(timer, cpc->hrtimer_interval);
 		hrtimer_start_expires(timer, HRTIMER_MODE_ABS_PINNED_HARD);
 	}
-	raw_spin_unlock_irqrestore(&cpuctx->hrtimer_lock, flags);
+	raw_spin_unlock_irqrestore(&cpc->hrtimer_lock, flags);
 
 	return 0;
 }
@@ -1207,32 +1208,9 @@ void perf_pmu_enable(struct pmu *pmu)
 		pmu->pmu_enable(pmu);
 }
 
-static DEFINE_PER_CPU(struct list_head, active_ctx_list);
-
-/*
- * perf_event_ctx_activate(), perf_event_ctx_deactivate(), and
- * perf_event_task_tick() are fully serialized because they're strictly cpu
- * affine and perf_event_ctx{activate,deactivate} are called with IRQs
- * disabled, while perf_event_task_tick is called from IRQ context.
- */
-static void perf_event_ctx_activate(struct perf_event_context *ctx)
-{
-	struct list_head *head = this_cpu_ptr(&active_ctx_list);
-
-	lockdep_assert_irqs_disabled();
-
-	WARN_ON(!list_empty(&ctx->active_ctx_list));
-
-	list_add(&ctx->active_ctx_list, head);
-}
-
-static void perf_event_ctx_deactivate(struct perf_event_context *ctx)
+static void perf_assert_pmu_disabled(struct pmu *pmu)
 {
-	lockdep_assert_irqs_disabled();
-
-	WARN_ON(list_empty(&ctx->active_ctx_list));
-
-	list_del_init(&ctx->active_ctx_list);
+	WARN_ON_ONCE(*this_cpu_ptr(pmu->pmu_disable_count) == 0);
 }
 
 static void get_ctx(struct perf_event_context *ctx)
@@ -1259,7 +1237,6 @@ static void free_ctx(struct rcu_head *head)
 	struct perf_event_context *ctx;
 
 	ctx = container_of(head, struct perf_event_context, rcu_head);
-	free_task_ctx_data(ctx->pmu, ctx->task_ctx_data);
 	kfree(ctx);
 }
 
@@ -1444,7 +1421,7 @@ static u64 primary_event_id(struct perf_event *event)
  * the context could get moved to another task.
  */
 static struct perf_event_context *
-perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
+perf_lock_task_context(struct task_struct *task, unsigned long *flags)
 {
 	struct perf_event_context *ctx;
 
@@ -1460,7 +1437,7 @@ perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
 	 */
 	local_irq_save(*flags);
 	rcu_read_lock();
-	ctx = rcu_dereference(task->perf_event_ctxp[ctxn]);
+	ctx = rcu_dereference(task->perf_event_ctxp);
 	if (ctx) {
 		/*
 		 * If this context is a clone of another, it might
@@ -1473,7 +1450,7 @@ perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
 		 * can't get swapped on us any more.
 		 */
 		raw_spin_lock(&ctx->lock);
-		if (ctx != rcu_dereference(task->perf_event_ctxp[ctxn])) {
+		if (ctx != rcu_dereference(task->perf_event_ctxp)) {
 			raw_spin_unlock(&ctx->lock);
 			rcu_read_unlock();
 			local_irq_restore(*flags);
@@ -1500,12 +1477,12 @@ perf_lock_task_context(struct task_struct *task, int ctxn, unsigned long *flags)
  * reference count so that the context can't get freed.
  */
 static struct perf_event_context *
-perf_pin_task_context(struct task_struct *task, int ctxn)
+perf_pin_task_context(struct task_struct *task)
 {
 	struct perf_event_context *ctx;
 	unsigned long flags;
 
-	ctx = perf_lock_task_context(task, ctxn, &flags);
+	ctx = perf_lock_task_context(task, &flags);
 	if (ctx) {
 		++ctx->pin_count;
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
@@ -1614,14 +1591,22 @@ static inline struct cgroup *event_cgroup(const struct perf_event *event)
  * which provides ordering when rotating groups for the same CPU.
  */
 static __always_inline int
-perf_event_groups_cmp(const int left_cpu, const struct cgroup *left_cgroup,
-		      const u64 left_group_index, const struct perf_event *right)
+perf_event_groups_cmp(const int left_cpu, const struct pmu *left_pmu,
+		      const struct cgroup *left_cgroup, const u64 left_group_index,
+		      const struct perf_event *right)
 {
 	if (left_cpu < right->cpu)
 		return -1;
 	if (left_cpu > right->cpu)
 		return 1;
 
+	if (left_pmu) {
+		if (left_pmu < right->pmu_ctx->pmu)
+			return -1;
+		if (left_pmu > right->pmu_ctx->pmu)
+			return 1;
+	}
+
 #ifdef CONFIG_CGROUP_PERF
 	{
 		const struct cgroup *right_cgroup = event_cgroup(right);
@@ -1664,12 +1649,13 @@ perf_event_groups_cmp(const int left_cpu, const struct cgroup *left_cgroup,
 static inline bool __group_less(struct rb_node *a, const struct rb_node *b)
 {
 	struct perf_event *e = __node_2_pe(a);
-	return perf_event_groups_cmp(e->cpu, event_cgroup(e), e->group_index,
-				     __node_2_pe(b)) < 0;
+	return perf_event_groups_cmp(e->cpu, e->pmu_ctx->pmu, event_cgroup(e),
+				     e->group_index, __node_2_pe(b)) < 0;
 }
 
 struct __group_key {
 	int cpu;
+	struct pmu *pmu;
 	struct cgroup *cgroup;
 };
 
@@ -1678,14 +1664,25 @@ static inline int __group_cmp(const void *key, const struct rb_node *node)
 	const struct __group_key *a = key;
 	const struct perf_event *b = __node_2_pe(node);
 
-	/* partial/subtree match: @cpu, @cgroup; ignore: @group_index */
-	return perf_event_groups_cmp(a->cpu, a->cgroup, b->group_index, b);
+	/* partial/subtree match: @cpu, @pmu, @cgroup; ignore: @group_index */
+	return perf_event_groups_cmp(a->cpu, a->pmu, a->cgroup, b->group_index, b);
+}
+
+static inline int
+__group_cmp_ignore_cgroup(const void *key, const struct rb_node *node)
+{
+	const struct __group_key *a = key;
+	const struct perf_event *b = __node_2_pe(node);
+
+	/* partial/subtree match: @cpu, @pmu, ignore: @cgroup, @group_index */
+	return perf_event_groups_cmp(a->cpu, a->pmu, event_cgroup(b),
+				     b->group_index, b);
 }
 
 /*
- * Insert @event into @groups' tree; using {@event->cpu, ++@groups->index} for
- * key (see perf_event_groups_less). This places it last inside the CPU
- * subtree.
+ * Insert @event into @groups' tree; using
+ *   {@event->cpu, @event->pmu_ctx->pmu, event_cgroup(@event), ++@groups->index}
+ * as key. This places it last inside the {cpu,pmu,cgroup} subtree.
  */
 static void
 perf_event_groups_insert(struct perf_event_groups *groups,
@@ -1735,14 +1732,15 @@ del_event_from_groups(struct perf_event *event, struct perf_event_context *ctx)
 }
 
 /*
- * Get the leftmost event in the cpu/cgroup subtree.
+ * Get the leftmost event in the {cpu,pmu,cgroup} subtree.
  */
 static struct perf_event *
 perf_event_groups_first(struct perf_event_groups *groups, int cpu,
-			struct cgroup *cgrp)
+			struct pmu *pmu, struct cgroup *cgrp)
 {
 	struct __group_key key = {
 		.cpu = cpu,
+		.pmu = pmu,
 		.cgroup = cgrp,
 	};
 	struct rb_node *node;
@@ -1754,14 +1752,12 @@ perf_event_groups_first(struct perf_event_groups *groups, int cpu,
 	return NULL;
 }
 
-/*
- * Like rb_entry_next_safe() for the @cpu subtree.
- */
 static struct perf_event *
-perf_event_groups_next(struct perf_event *event)
+perf_event_groups_next(struct perf_event *event, struct pmu *pmu)
 {
 	struct __group_key key = {
 		.cpu = event->cpu,
+		.pmu = pmu,
 		.cgroup = event_cgroup(event),
 	};
 	struct rb_node *next;
@@ -1815,6 +1811,7 @@ list_add_event(struct perf_event *event, struct perf_event_context *ctx)
 		perf_cgroup_event_enable(event, ctx);
 
 	ctx->generation++;
+	event->pmu_ctx->nr_events++;
 }
 
 /*
@@ -2020,6 +2017,7 @@ list_del_event(struct perf_event *event, struct perf_event_context *ctx)
 	}
 
 	ctx->generation++;
+	event->pmu_ctx->nr_events--;
 }
 
 static int
@@ -2036,13 +2034,11 @@ perf_aux_output_match(struct perf_event *event, struct perf_event *aux_event)
 
 static void put_event(struct perf_event *event);
 static void event_sched_out(struct perf_event *event,
-			    struct perf_cpu_context *cpuctx,
 			    struct perf_event_context *ctx);
 
 static void perf_put_aux_event(struct perf_event *event)
 {
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event *iter;
 
 	/*
@@ -2071,7 +2067,7 @@ static void perf_put_aux_event(struct perf_event *event)
 		 * state so that we don't try to schedule it again. Note
 		 * that perf_event_enable() will clear the ERROR status.
 		 */
-		event_sched_out(iter, cpuctx, ctx);
+		event_sched_out(iter, ctx);
 		perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 	}
 }
@@ -2122,8 +2118,8 @@ static int perf_get_aux_event(struct perf_event *event,
 
 static inline struct list_head *get_event_list(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
-	return event->attr.pinned ? &ctx->pinned_active : &ctx->flexible_active;
+	return event->attr.pinned ? &event->pmu_ctx->pinned_active :
+				    &event->pmu_ctx->flexible_active;
 }
 
 /*
@@ -2134,10 +2130,7 @@ static inline struct list_head *get_event_list(struct perf_event *event)
  */
 static inline void perf_remove_sibling_event(struct perf_event *event)
 {
-	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
-
-	event_sched_out(event, cpuctx, ctx);
+	event_sched_out(event, event->ctx);
 	perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 }
 
@@ -2261,12 +2254,14 @@ event_filter_match(struct perf_event *event)
 }
 
 static void
-event_sched_out(struct perf_event *event,
-		  struct perf_cpu_context *cpuctx,
-		  struct perf_event_context *ctx)
+event_sched_out(struct perf_event *event, struct perf_event_context *ctx)
 {
+	struct perf_event_pmu_context *epc = event->pmu_ctx;
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
 	enum perf_event_state state = PERF_EVENT_STATE_INACTIVE;
 
+	// XXX cpc serialization, probably per-cpu IRQ disabled
+
 	WARN_ON_ONCE(event->ctx != ctx);
 	lockdep_assert_held(&ctx->lock);
 
@@ -2293,38 +2288,34 @@ event_sched_out(struct perf_event *event,
 	perf_event_set_state(event, state);
 
 	if (!is_software_event(event))
-		cpuctx->active_oncpu--;
-	if (!--ctx->nr_active)
-		perf_event_ctx_deactivate(ctx);
+		cpc->active_oncpu--;
+	ctx->nr_active--;
+	event->pmu_ctx->nr_active--;
 	if (event->attr.freq && event->attr.sample_freq)
 		ctx->nr_freq--;
-	if (event->attr.exclusive || !cpuctx->active_oncpu)
-		cpuctx->exclusive = 0;
+	if (event->attr.exclusive || !cpc->active_oncpu)
+		cpc->exclusive = 0;
 
 	perf_pmu_enable(event->pmu);
 }
 
 static void
-group_sched_out(struct perf_event *group_event,
-		struct perf_cpu_context *cpuctx,
-		struct perf_event_context *ctx)
+group_sched_out(struct perf_event *group_event, struct perf_event_context *ctx)
 {
 	struct perf_event *event;
 
 	if (group_event->state != PERF_EVENT_STATE_ACTIVE)
 		return;
 
-	perf_pmu_disable(ctx->pmu);
+	perf_assert_pmu_disabled(group_event->pmu_ctx->pmu);
 
-	event_sched_out(group_event, cpuctx, ctx);
+	event_sched_out(group_event, ctx);
 
 	/*
 	 * Schedule out siblings (if any):
 	 */
 	for_each_sibling_event(event, group_event)
-		event_sched_out(event, cpuctx, ctx);
-
-	perf_pmu_enable(ctx->pmu);
+		event_sched_out(event, ctx);
 }
 
 #define DETACH_GROUP	0x01UL
@@ -2349,16 +2340,18 @@ __perf_remove_from_context(struct perf_event *event,
 		update_cgrp_time_from_cpuctx(cpuctx);
 	}
 
-	event_sched_out(event, cpuctx, ctx);
+	event_sched_out(event, ctx);
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	if (flags & DETACH_CHILD)
 		perf_child_detach(event);
 	list_del_event(event, ctx);
 
+	if (!event->pmu_ctx->nr_events)
+		event->pmu_ctx->rotate_necessary = 0;
+
 	if (!ctx->nr_events && ctx->is_active) {
 		ctx->is_active = 0;
-		ctx->rotate_necessary = 0;
 		if (ctx->task) {
 			WARN_ON_ONCE(cpuctx->task_ctx != ctx);
 			cpuctx->task_ctx = NULL;
@@ -2389,7 +2382,7 @@ static void perf_remove_from_context(struct perf_event *event, unsigned long fla
 	 */
 	raw_spin_lock_irq(&ctx->lock);
 	if (!ctx->is_active) {
-		__perf_remove_from_context(event, __get_cpu_context(ctx),
+		__perf_remove_from_context(event, this_cpu_ptr(&cpu_context),
 					   ctx, (void *)flags);
 		raw_spin_unlock_irq(&ctx->lock);
 		return;
@@ -2415,13 +2408,17 @@ static void __perf_event_disable(struct perf_event *event,
 		update_cgrp_time_from_event(event);
 	}
 
+	perf_pmu_disable(event->pmu_ctx->pmu);
+
 	if (event == event->group_leader)
-		group_sched_out(event, cpuctx, ctx);
+		group_sched_out(event, ctx);
 	else
-		event_sched_out(event, cpuctx, ctx);
+		event_sched_out(event, ctx);
 
 	perf_event_set_state(event, PERF_EVENT_STATE_OFF);
 	perf_cgroup_event_disable(event, ctx);
+
+	perf_pmu_enable(event->pmu_ctx->pmu);
 }
 
 /*
@@ -2518,10 +2515,10 @@ static void perf_log_throttle(struct perf_event *event, int enable);
 static void perf_log_itrace_start(struct perf_event *event);
 
 static int
-event_sched_in(struct perf_event *event,
-		 struct perf_cpu_context *cpuctx,
-		 struct perf_event_context *ctx)
+event_sched_in(struct perf_event *event, struct perf_event_context *ctx)
 {
+	struct perf_event_pmu_context *epc = event->pmu_ctx;
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
 	int ret = 0;
 
 	WARN_ON_ONCE(event->ctx != ctx);
@@ -2564,14 +2561,14 @@ event_sched_in(struct perf_event *event,
 	}
 
 	if (!is_software_event(event))
-		cpuctx->active_oncpu++;
-	if (!ctx->nr_active++)
-		perf_event_ctx_activate(ctx);
+		cpc->active_oncpu++;
+	ctx->nr_active++;
+	event->pmu_ctx->nr_active++;
 	if (event->attr.freq && event->attr.sample_freq)
 		ctx->nr_freq++;
 
 	if (event->attr.exclusive)
-		cpuctx->exclusive = 1;
+		cpc->exclusive = 1;
 
 out:
 	perf_pmu_enable(event->pmu);
@@ -2580,26 +2577,24 @@ event_sched_in(struct perf_event *event,
 }
 
 static int
-group_sched_in(struct perf_event *group_event,
-	       struct perf_cpu_context *cpuctx,
-	       struct perf_event_context *ctx)
+group_sched_in(struct perf_event *group_event, struct perf_event_context *ctx)
 {
 	struct perf_event *event, *partial_group = NULL;
-	struct pmu *pmu = ctx->pmu;
+	struct pmu *pmu = group_event->pmu_ctx->pmu;
 
 	if (group_event->state == PERF_EVENT_STATE_OFF)
 		return 0;
 
 	pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
 
-	if (event_sched_in(group_event, cpuctx, ctx))
+	if (event_sched_in(group_event, ctx))
 		goto error;
 
 	/*
 	 * Schedule in siblings as one group (if any):
 	 */
 	for_each_sibling_event(event, group_event) {
-		if (event_sched_in(event, cpuctx, ctx)) {
+		if (event_sched_in(event, ctx)) {
 			partial_group = event;
 			goto group_error;
 		}
@@ -2618,9 +2613,9 @@ group_sched_in(struct perf_event *group_event,
 		if (event == partial_group)
 			break;
 
-		event_sched_out(event, cpuctx, ctx);
+		event_sched_out(event, ctx);
 	}
-	event_sched_out(group_event, cpuctx, ctx);
+	event_sched_out(group_event, ctx);
 
 error:
 	pmu->cancel_txn(pmu);
@@ -2630,10 +2625,11 @@ group_sched_in(struct perf_event *group_event,
 /*
  * Work out whether we can put this event group on the CPU now.
  */
-static int group_can_go_on(struct perf_event *event,
-			   struct perf_cpu_context *cpuctx,
-			   int can_add_hw)
+static int group_can_go_on(struct perf_event *event, int can_add_hw)
 {
+	struct perf_event_pmu_context *epc = event->pmu_ctx;
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(epc->pmu->cpu_pmu_context);
+
 	/*
 	 * Groups consisting entirely of software events can always go on.
 	 */
@@ -2643,7 +2639,7 @@ static int group_can_go_on(struct perf_event *event,
 	 * If an exclusive group is already on, no other hardware
 	 * events can go on.
 	 */
-	if (cpuctx->exclusive)
+	if (cpc->exclusive)
 		return 0;
 	/*
 	 * If this group is exclusive and there are already
@@ -2665,38 +2661,30 @@ static void add_event_to_ctx(struct perf_event *event,
 	perf_group_attach(event);
 }
 
-static void ctx_sched_out(struct perf_event_context *ctx,
-			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type);
-static void
-ctx_sched_in(struct perf_event_context *ctx,
-	     struct perf_cpu_context *cpuctx,
-	     enum event_type_t event_type,
-	     struct task_struct *task);
-
-static void task_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			       struct perf_event_context *ctx,
-			       enum event_type_t event_type)
+static void task_ctx_sched_out(struct perf_event_context *ctx,
+				enum event_type_t event_type)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+
 	if (!cpuctx->task_ctx)
 		return;
 
 	if (WARN_ON_ONCE(ctx != cpuctx->task_ctx))
 		return;
 
-	ctx_sched_out(ctx, cpuctx, event_type);
+	ctx_sched_out(ctx, event_type);
 }
 
 static void perf_event_sched_in(struct perf_cpu_context *cpuctx,
 				struct perf_event_context *ctx,
 				struct task_struct *task)
 {
-	cpu_ctx_sched_in(cpuctx, EVENT_PINNED, task);
+	ctx_sched_in(&cpuctx->ctx, EVENT_PINNED, task);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_PINNED, task);
-	cpu_ctx_sched_in(cpuctx, EVENT_FLEXIBLE, task);
+		 ctx_sched_in(ctx, EVENT_PINNED, task);
+	ctx_sched_in(&cpuctx->ctx, EVENT_FLEXIBLE, task);
 	if (ctx)
-		ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE, task);
+		 ctx_sched_in(ctx, EVENT_FLEXIBLE, task);
 }
 
 /*
@@ -2718,7 +2706,6 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 			struct perf_event_context *task_ctx,
 			enum event_type_t event_type)
 {
-	enum event_type_t ctx_event_type;
 	bool cpu_event = !!(event_type & EVENT_CPU);
 
 	/*
@@ -2728,11 +2715,13 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 	if (event_type & EVENT_PINNED)
 		event_type |= EVENT_FLEXIBLE;
 
-	ctx_event_type = event_type & EVENT_ALL;
+	event_type &= EVENT_ALL;
 
-	perf_pmu_disable(cpuctx->ctx.pmu);
-	if (task_ctx)
-		task_ctx_sched_out(cpuctx, task_ctx, event_type);
+	perf_ctx_disable(&cpuctx->ctx);
+	if (task_ctx) {
+		perf_ctx_disable(task_ctx);
+		task_ctx_sched_out(task_ctx, event_type);
+	}
 
 	/*
 	 * Decide which cpu ctx groups to schedule out based on the types
@@ -2742,17 +2731,20 @@ static void ctx_resched(struct perf_cpu_context *cpuctx,
 	 *  - otherwise, do nothing more.
 	 */
 	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, ctx_event_type);
-	else if (ctx_event_type & EVENT_PINNED)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+		ctx_sched_out(&cpuctx->ctx, event_type);
+	else if (event_type & EVENT_PINNED)
+		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
 
 	perf_event_sched_in(cpuctx, task_ctx, current);
-	perf_pmu_enable(cpuctx->ctx.pmu);
+
+	perf_ctx_enable(&cpuctx->ctx);
+	if (task_ctx)
+		perf_ctx_enable(task_ctx);
 }
 
 void perf_pmu_resched(struct pmu *pmu)
 {
-	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 
 	perf_ctx_lock(cpuctx, task_ctx);
@@ -2770,7 +2762,7 @@ static int  __perf_install_in_context(void *info)
 {
 	struct perf_event *event = info;
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
 	bool reprogram = true;
 	int ret = 0;
@@ -2812,7 +2804,7 @@ static int  __perf_install_in_context(void *info)
 #endif
 
 	if (reprogram) {
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx, get_event_type(event));
 	} else {
@@ -2957,7 +2949,7 @@ static void __perf_event_enable(struct perf_event *event,
 		return;
 
 	if (ctx->is_active)
-		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+		ctx_sched_out(ctx, EVENT_TIME);
 
 	perf_event_set_state(event, PERF_EVENT_STATE_INACTIVE);
 	perf_cgroup_event_enable(event, ctx);
@@ -2966,7 +2958,7 @@ static void __perf_event_enable(struct perf_event *event,
 		return;
 
 	if (!event_filter_match(event)) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, EVENT_TIME, current);
 		return;
 	}
 
@@ -2975,7 +2967,7 @@ static void __perf_event_enable(struct perf_event *event,
 	 * then don't put it on unless the group is on.
 	 */
 	if (leader != event && leader->state != PERF_EVENT_STATE_ACTIVE) {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, EVENT_TIME, current);
 		return;
 	}
 
@@ -3228,11 +3220,52 @@ static int perf_event_modify_attr(struct perf_event *event,
 	return err;
 }
 
-static void ctx_sched_out(struct perf_event_context *ctx,
-			  struct perf_cpu_context *cpuctx,
-			  enum event_type_t event_type)
+static void __pmu_ctx_sched_out(struct perf_event_pmu_context *pmu_ctx,
+				enum event_type_t event_type)
 {
+	struct perf_event_context *ctx = pmu_ctx->ctx;
 	struct perf_event *event, *tmp;
+	struct pmu *pmu = pmu_ctx->pmu;
+
+	if (ctx->task && !ctx->is_active) {
+		struct perf_cpu_pmu_context *cpc;
+
+		cpc = this_cpu_ptr(pmu->cpu_pmu_context);
+		WARN_ON_ONCE(cpc->task_epc != pmu_ctx);
+		cpc->task_epc = NULL;
+	}
+
+	if (!event_type)
+		return;
+
+	perf_pmu_disable(pmu);
+	if (event_type & EVENT_PINNED) {
+		list_for_each_entry_safe(event, tmp,
+				&pmu_ctx->pinned_active,
+				active_list)
+			group_sched_out(event, ctx);
+	}
+
+	if (event_type & EVENT_FLEXIBLE) {
+		list_for_each_entry_safe(event, tmp,
+				&pmu_ctx->flexible_active,
+				active_list)
+			group_sched_out(event, ctx);
+		/*
+		 * Since we cleared EVENT_FLEXIBLE, also clear
+		 * rotate_necessary, is will be reset by
+		 * ctx_flexible_sched_in() when needed.
+		 */
+		pmu_ctx->rotate_necessary = 0;
+	}
+	perf_pmu_enable(pmu);
+}
+
+static void
+ctx_sched_out(struct perf_event_context *ctx, enum event_type_t event_type)
+{
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_event_pmu_context *pmu_ctx;
 	int is_active = ctx->is_active;
 
 	lockdep_assert_held(&ctx->lock);
@@ -3278,24 +3311,8 @@ static void ctx_sched_out(struct perf_event_context *ctx,
 	if (!ctx->nr_active || !(is_active & EVENT_ALL))
 		return;
 
-	perf_pmu_disable(ctx->pmu);
-	if (is_active & EVENT_PINNED) {
-		list_for_each_entry_safe(event, tmp, &ctx->pinned_active, active_list)
-			group_sched_out(event, cpuctx, ctx);
-	}
-
-	if (is_active & EVENT_FLEXIBLE) {
-		list_for_each_entry_safe(event, tmp, &ctx->flexible_active, active_list)
-			group_sched_out(event, cpuctx, ctx);
-
-		/*
-		 * Since we cleared EVENT_FLEXIBLE, also clear
-		 * rotate_necessary, is will be reset by
-		 * ctx_flexible_sched_in() when needed.
-		 */
-		ctx->rotate_necessary = 0;
-	}
-	perf_pmu_enable(ctx->pmu);
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry)
+		__pmu_ctx_sched_out(pmu_ctx, is_active);
 }
 
 /*
@@ -3400,26 +3417,65 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,
 	}
 }
 
-static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
-					 struct task_struct *next)
+static void perf_event_swap_task_ctx_data(struct perf_event_context *prev_ctx,
+					  struct perf_event_context *next_ctx)
+{
+	struct perf_event_pmu_context *prev_epc, *next_epc;
+
+	if (!prev_ctx->nr_task_data)
+		return;
+
+	prev_epc = list_first_entry(&prev_ctx->pmu_ctx_list,
+				    struct perf_event_pmu_context,
+				    pmu_ctx_entry);
+	next_epc = list_first_entry(&next_ctx->pmu_ctx_list,
+				    struct perf_event_pmu_context,
+				    pmu_ctx_entry);
+
+	while (&prev_epc->pmu_ctx_entry != &prev_ctx->pmu_ctx_list &&
+	       &next_epc->pmu_ctx_entry != &next_ctx->pmu_ctx_list) {
+
+		WARN_ON_ONCE(prev_epc->pmu != next_epc->pmu);
+
+		/*
+		 * PMU specific parts of task perf context can require
+		 * additional synchronization. As an example of such
+		 * synchronization see implementation details of Intel
+		 * LBR call stack data profiling;
+		 */
+		if (prev_epc->pmu->swap_task_ctx)
+			prev_epc->pmu->swap_task_ctx(prev_epc, next_epc);
+		else
+			swap(prev_epc->task_ctx_data, next_epc->task_ctx_data);
+	}
+}
+
+static void perf_ctx_sched_task_cb(struct perf_event_context *ctx, bool sched_in)
+{
+	struct perf_event_pmu_context *pmu_ctx;
+	struct perf_cpu_pmu_context *cpc;
+
+	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+
+		if (cpc->sched_cb_usage && pmu_ctx->pmu->sched_task)
+			pmu_ctx->pmu->sched_task(pmu_ctx, sched_in);
+	}
+}
+
+static void
+perf_event_context_sched_out(struct task_struct *task, struct task_struct *next)
 {
-	struct perf_event_context *ctx = task->perf_event_ctxp[ctxn];
+	struct perf_event_context *ctx = task->perf_event_ctxp;
 	struct perf_event_context *next_ctx;
 	struct perf_event_context *parent, *next_parent;
-	struct perf_cpu_context *cpuctx;
 	int do_switch = 1;
-	struct pmu *pmu;
 
 	if (likely(!ctx))
 		return;
 
-	pmu = ctx->pmu;
-	cpuctx = __get_cpu_context(ctx);
-	if (!cpuctx->task_ctx)
-		return;
-
 	rcu_read_lock();
-	next_ctx = next->perf_event_ctxp[ctxn];
+	next_ctx = rcu_dereference(next->perf_event_ctxp);
 	if (!next_ctx)
 		goto unlock;
 
@@ -3447,23 +3503,12 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 			WRITE_ONCE(ctx->task, next);
 			WRITE_ONCE(next_ctx->task, task);
 
-			perf_pmu_disable(pmu);
-
-			if (cpuctx->sched_cb_usage && pmu->sched_task)
-				pmu->sched_task(ctx, false);
+			perf_ctx_disable(ctx);
 
-			/*
-			 * PMU specific parts of task perf context can require
-			 * additional synchronization. As an example of such
-			 * synchronization see implementation details of Intel
-			 * LBR call stack data profiling;
-			 */
-			if (pmu->swap_task_ctx)
-				pmu->swap_task_ctx(ctx, next_ctx);
-			else
-				swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
+			perf_ctx_sched_task_cb(ctx, false);
+			perf_event_swap_task_ctx_data(ctx, next_ctx);
 
-			perf_pmu_enable(pmu);
+			perf_ctx_enable(ctx);
 
 			/*
 			 * RCU_INIT_POINTER here is safe because we've not
@@ -3472,8 +3517,8 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 			 * since those values are always verified under
 			 * ctx->lock which we're now holding.
 			 */
-			RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], next_ctx);
-			RCU_INIT_POINTER(next->perf_event_ctxp[ctxn], ctx);
+			RCU_INIT_POINTER(task->perf_event_ctxp, next_ctx);
+			RCU_INIT_POINTER(next->perf_event_ctxp, ctx);
 
 			do_switch = 0;
 
@@ -3487,37 +3532,39 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 
 	if (do_switch) {
 		raw_spin_lock(&ctx->lock);
-		perf_pmu_disable(pmu);
+		perf_ctx_disable(ctx);
 
-		if (cpuctx->sched_cb_usage && pmu->sched_task)
-			pmu->sched_task(ctx, false);
-		task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
+		perf_ctx_sched_task_cb(ctx, false);
+		task_ctx_sched_out(ctx, EVENT_ALL);
 
-		perf_pmu_enable(pmu);
+		perf_ctx_enable(ctx);
 		raw_spin_unlock(&ctx->lock);
 	}
 }
 
 static DEFINE_PER_CPU(struct list_head, sched_cb_list);
+static DEFINE_PER_CPU(int, perf_sched_cb_usages);
 
 void perf_sched_cb_dec(struct pmu *pmu)
 {
-	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
 
 	this_cpu_dec(perf_sched_cb_usages);
+	barrier();
 
-	if (!--cpuctx->sched_cb_usage)
-		list_del(&cpuctx->sched_cb_entry);
+	if (!--cpc->sched_cb_usage)
+		list_del(&cpc->sched_cb_entry);
 }
 
 
 void perf_sched_cb_inc(struct pmu *pmu)
 {
-	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+	struct perf_cpu_pmu_context *cpc = this_cpu_ptr(pmu->cpu_pmu_context);
 
-	if (!cpuctx->sched_cb_usage++)
-		list_add(&cpuctx->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
+	if (!cpc->sched_cb_usage++)
+		list_add(&cpc->sched_cb_entry, this_cpu_ptr(&sched_cb_list));
 
+	barrier();
 	this_cpu_inc(perf_sched_cb_usages);
 }
 
@@ -3529,19 +3576,21 @@ void perf_sched_cb_inc(struct pmu *pmu)
  * PEBS requires this to provide PID/TID information. This requires we flush
  * all queued PEBS records before we context switch to a new task.
  */
-static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx, bool sched_in)
+static void __perf_pmu_sched_task(struct perf_cpu_pmu_context *cpc, bool sched_in)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct pmu *pmu;
 
-	pmu = cpuctx->ctx.pmu; /* software PMUs will not have sched_task */
+	pmu = cpc->epc.pmu;
 
+	/* software PMUs will not have sched_task */
 	if (WARN_ON_ONCE(!pmu->sched_task))
 		return;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(pmu);
 
-	pmu->sched_task(cpuctx->task_ctx, sched_in);
+	pmu->sched_task(cpc->task_epc, sched_in);
 
 	perf_pmu_enable(pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3551,26 +3600,20 @@ static void perf_pmu_sched_task(struct task_struct *prev,
 				struct task_struct *next,
 				bool sched_in)
 {
-	struct perf_cpu_context *cpuctx;
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_cpu_pmu_context *cpc;
 
-	if (prev == next)
+	/* cpuctx->task_ctx will be handled in perf_event_context_sched_in/out */
+	if (prev == next || cpuctx->task_ctx)
 		return;
 
-	list_for_each_entry(cpuctx, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
-		/* will be handled in perf_event_context_sched_in/out */
-		if (cpuctx->task_ctx)
-			continue;
-
-		__perf_pmu_sched_task(cpuctx, sched_in);
-	}
+	list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry)
+		__perf_pmu_sched_task(cpc, sched_in);
 }
 
 static void perf_event_switch(struct task_struct *task,
 			      struct task_struct *next_prev, bool sched_in);
 
-#define for_each_task_context_nr(ctxn)					\
-	for ((ctxn) = 0; (ctxn) < perf_nr_task_contexts; (ctxn)++)
-
 /*
  * Called from scheduler to remove the events of the current task,
  * with interrupts disabled.
@@ -3585,16 +3628,13 @@ static void perf_event_switch(struct task_struct *task,
 void __perf_event_task_sched_out(struct task_struct *task,
 				 struct task_struct *next)
 {
-	int ctxn;
-
 	if (__this_cpu_read(perf_sched_cb_usages))
 		perf_pmu_sched_task(task, next, false);
 
 	if (atomic_read(&nr_switch_events))
 		perf_event_switch(task, next, false);
 
-	for_each_task_context_nr(ctxn)
-		perf_event_context_sched_out(task, ctxn, next);
+	perf_event_context_sched_out(task, next);
 
 	/*
 	 * if cgroup events exist on this CPU, then we need
@@ -3605,15 +3645,6 @@ void __perf_event_task_sched_out(struct task_struct *task,
 		perf_cgroup_sched_out(task, next);
 }
 
-/*
- * Called with IRQs disabled
- */
-static void cpu_ctx_sched_out(struct perf_cpu_context *cpuctx,
-			      enum event_type_t event_type)
-{
-	ctx_sched_out(&cpuctx->ctx, cpuctx, event_type);
-}
-
 static bool perf_less_group_idx(const void *l, const void *r)
 {
 	const struct perf_event *le = *(const struct perf_event **)l;
@@ -3645,21 +3676,36 @@ static void __heap_add(struct min_heap *heap, struct perf_event *event)
 	}
 }
 
-static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
+static void __link_epc(struct perf_event_pmu_context *pmu_ctx)
+{
+	struct perf_cpu_pmu_context *cpc;
+
+	if (!pmu_ctx->ctx->task)
+		return;
+
+	cpc = this_cpu_ptr(pmu_ctx->pmu->cpu_pmu_context);
+	WARN_ON_ONCE(cpc->task_epc && cpc->task_epc != pmu_ctx);
+	cpc->task_epc = pmu_ctx;
+}
+
+static noinline int visit_groups_merge(struct perf_event_context *ctx,
 				struct perf_event_groups *groups, int cpu,
+				struct pmu *pmu,
 				int (*func)(struct perf_event *, void *),
 				void *data)
 {
 #ifdef CONFIG_CGROUP_PERF
 	struct cgroup_subsys_state *css = NULL;
 #endif
+	struct perf_cpu_context *cpuctx = NULL;
 	/* Space for per CPU and/or any CPU event iterators. */
 	struct perf_event *itrs[2];
 	struct min_heap event_heap;
 	struct perf_event **evt;
 	int ret;
 
-	if (cpuctx) {
+	if (!ctx->task) { 
+		cpuctx = this_cpu_ptr(&cpu_context);
 		event_heap = (struct min_heap){
 			.data = cpuctx->heap,
 			.nr = 0,
@@ -3679,17 +3725,28 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 			.size = ARRAY_SIZE(itrs),
 		};
 		/* Events not within a CPU context may be on any CPU. */
-		__heap_add(&event_heap, perf_event_groups_first(groups, -1, NULL));
+		__heap_add(&event_heap, perf_event_groups_first(groups, -1, pmu, NULL));
 	}
 	evt = event_heap.data;
 
-	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, NULL));
+	__heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, NULL));
 
 #ifdef CONFIG_CGROUP_PERF
 	for (; css; css = css->parent)
-		__heap_add(&event_heap, perf_event_groups_first(groups, cpu, css->cgroup));
+		__heap_add(&event_heap, perf_event_groups_first(groups, cpu, pmu, css->cgroup));
 #endif
 
+	if (event_heap.nr) {
+		/* 
+		 * XXX: For now, visit_groups_merge() gets called with pmu
+		 * pointer never NULL. But these functions needs to be called
+		 * once for each pmu if I implement pmu=NULL optimization.
+		 */
+		__link_epc((*evt)->pmu_ctx);
+		perf_assert_pmu_disabled((*evt)->pmu_ctx->pmu);
+	}
+
+
 	min_heapify_all(&event_heap, &perf_min_heap);
 
 	while (event_heap.nr) {
@@ -3697,7 +3754,7 @@ static noinline int visit_groups_merge(struct perf_cpu_context *cpuctx,
 		if (ret)
 			return ret;
 
-		*evt = perf_event_groups_next(*evt);
+		*evt = perf_event_groups_next(*evt, pmu);
 		if (*evt)
 			min_heapify(&event_heap, 0, &perf_min_heap);
 		else
@@ -3733,7 +3790,6 @@ static inline void group_update_userpage(struct perf_event *group_event)
 static int merge_sched_in(struct perf_event *event, void *data)
 {
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	int *can_add_hw = data;
 
 	if (event->state <= PERF_EVENT_STATE_OFF)
@@ -3742,8 +3798,8 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	if (!event_filter_match(event))
 		return 0;
 
-	if (group_can_go_on(event, cpuctx, *can_add_hw)) {
-		if (!group_sched_in(event, cpuctx, ctx))
+	if (group_can_go_on(event, *can_add_hw)) {
+		if (!group_sched_in(event, ctx))
 			list_add_tail(&event->active_list, get_event_list(event));
 	}
 
@@ -3753,8 +3809,11 @@ static int merge_sched_in(struct perf_event *event, void *data)
 			perf_cgroup_event_disable(event, ctx);
 			perf_event_set_state(event, PERF_EVENT_STATE_ERROR);
 		} else {
-			ctx->rotate_necessary = 1;
-			perf_mux_hrtimer_restart(cpuctx);
+			struct perf_cpu_pmu_context *cpc;
+
+			event->pmu_ctx->rotate_necessary = 1;
+			cpc = this_cpu_ptr(event->pmu_ctx->pmu->cpu_pmu_context);
+			perf_mux_hrtimer_restart(cpc);
 			group_update_userpage(event);
 		}
 	}
@@ -3762,40 +3821,68 @@ static int merge_sched_in(struct perf_event *event, void *data)
 	return 0;
 }
 
-static void
-ctx_pinned_sched_in(struct perf_event_context *ctx,
-		    struct perf_cpu_context *cpuctx)
+static void ctx_pinned_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
 {
+	struct perf_event_pmu_context *pmu_ctx;
 	int can_add_hw = 1;
 
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
-
-	visit_groups_merge(cpuctx, &ctx->pinned_groups,
-			   smp_processor_id(),
-			   merge_sched_in, &can_add_hw);
+	if (pmu) {
+		visit_groups_merge(ctx, &ctx->pinned_groups,
+				   smp_processor_id(), pmu,
+				   merge_sched_in, &can_add_hw);
+	} else {
+		/*
+		 * XXX: This can be optimized for per-task context by calling
+		 * visit_groups_merge() only once with:
+		 *   1) pmu=NULL
+		 *   2) Ignoring pmu in perf_event_groups_cmp() when it's NULL
+		 *   3) Making can_add_hw a per-pmu variable 
+		 *
+		 * Though, it can not be opimized for per-cpu context because
+		 * per-cpu rb-tree consist of pmu-subtrees and pmu-subtrees
+		 * consist of cgroup-subtrees. i.e. a cgroup events of same
+		 * cgroup but different pmus are seperated out into respective
+		 * pmu-subtrees.
+		 */
+		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+			can_add_hw = 1;
+			visit_groups_merge(ctx, &ctx->pinned_groups,
+					   smp_processor_id(), pmu_ctx->pmu,
+					   merge_sched_in, &can_add_hw);
+		}
+	}
 }
 
-static void
-ctx_flexible_sched_in(struct perf_event_context *ctx,
-		      struct perf_cpu_context *cpuctx)
+/* XXX .busy thingy from Peter's patch */
+static void ctx_flexible_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
 {
+	struct perf_event_pmu_context *pmu_ctx;
 	int can_add_hw = 1;
 
-	if (ctx != &cpuctx->ctx)
-		cpuctx = NULL;
+	if (pmu) {
+		visit_groups_merge(ctx, &ctx->flexible_groups,
+				   smp_processor_id(), pmu,
+				   merge_sched_in, &can_add_hw);
+	} else {
+		list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+			can_add_hw = 1;
+			visit_groups_merge(ctx, &ctx->flexible_groups,
+					   smp_processor_id(), pmu_ctx->pmu,
+					   merge_sched_in, &can_add_hw);
+		}
+	}
+}
 
-	visit_groups_merge(cpuctx, &ctx->flexible_groups,
-			   smp_processor_id(),
-			   merge_sched_in, &can_add_hw);
+static void __pmu_ctx_sched_in(struct perf_event_context *ctx, struct pmu *pmu)
+{
+	ctx_flexible_sched_in(ctx, pmu);
 }
 
 static void
-ctx_sched_in(struct perf_event_context *ctx,
-	     struct perf_cpu_context *cpuctx,
-	     enum event_type_t event_type,
+ctx_sched_in(struct perf_event_context *ctx, enum event_type_t event_type,
 	     struct task_struct *task)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	int is_active = ctx->is_active;
 	u64 now;
 
@@ -3818,6 +3905,7 @@ ctx_sched_in(struct perf_event_context *ctx,
 		/* start ctx time */
 		now = perf_clock();
 		ctx->timestamp = now;
+		// XXX ctx->task =? task
 		perf_cgroup_set_timestamp(task, ctx);
 	}
 
@@ -3826,40 +3914,32 @@ ctx_sched_in(struct perf_event_context *ctx,
 	 * in order to give them the best chance of going on.
 	 */
 	if (is_active & EVENT_PINNED)
-		ctx_pinned_sched_in(ctx, cpuctx);
+		ctx_pinned_sched_in(ctx, NULL);
 
 	/* Then walk through the lower prio flexible groups */
 	if (is_active & EVENT_FLEXIBLE)
-		ctx_flexible_sched_in(ctx, cpuctx);
+		ctx_flexible_sched_in(ctx, NULL);
 }
 
-static void cpu_ctx_sched_in(struct perf_cpu_context *cpuctx,
-			     enum event_type_t event_type,
-			     struct task_struct *task)
+static void perf_event_context_sched_in(struct task_struct *task)
 {
-	struct perf_event_context *ctx = &cpuctx->ctx;
-
-	ctx_sched_in(ctx, cpuctx, event_type, task);
-}
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_event_context *ctx;
 
-static void perf_event_context_sched_in(struct perf_event_context *ctx,
-					struct task_struct *task)
-{
-	struct perf_cpu_context *cpuctx;
-	struct pmu *pmu;
+	rcu_read_lock();
+	ctx = rcu_dereference(task->perf_event_ctxp);
+	if (!ctx)
+		goto rcu_unlock;
 
-	cpuctx = __get_cpu_context(ctx);
+	if (cpuctx->task_ctx == ctx) {
+		perf_ctx_lock(cpuctx, ctx);
+		perf_ctx_disable(ctx);
 
-	/*
-	 * HACK: for HETEROGENEOUS the task context might have switched to a
-	 * different PMU, force (re)set the context,
-	 */
-	pmu = ctx->pmu = cpuctx->ctx.pmu;
+		perf_ctx_sched_task_cb(ctx, true);
 
-	if (cpuctx->task_ctx == ctx) {
-		if (cpuctx->sched_cb_usage)
-			__perf_pmu_sched_task(cpuctx, true);
-		return;
+		perf_ctx_enable(ctx);
+		perf_ctx_unlock(cpuctx, ctx);
+		goto rcu_unlock;
 	}
 
 	perf_ctx_lock(cpuctx, ctx);
@@ -3870,7 +3950,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	if (!ctx->nr_events)
 		goto unlock;
 
-	perf_pmu_disable(pmu);
+	perf_ctx_disable(ctx);
 	/*
 	 * We want to keep the following priority order:
 	 * cpu pinned (that don't need to move), task pinned,
@@ -3879,17 +3959,24 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	 * However, if task's ctx is not carrying any pinned
 	 * events, no need to flip the cpuctx's events around.
 	 */
-	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree)) {
+		perf_ctx_disable(&cpuctx->ctx);
+		ctx_sched_out(&cpuctx->ctx, EVENT_FLEXIBLE);
+	}
+
 	perf_event_sched_in(cpuctx, ctx, task);
 
-	if (cpuctx->sched_cb_usage && pmu->sched_task)
-		pmu->sched_task(cpuctx->task_ctx, true);
+	perf_ctx_sched_task_cb(cpuctx->task_ctx, true);
 
-	perf_pmu_enable(pmu);
+	if (!RB_EMPTY_ROOT(&ctx->pinned_groups.tree))
+		perf_ctx_enable(&cpuctx->ctx);
+
+	perf_ctx_enable(ctx);
 
 unlock:
 	perf_ctx_unlock(cpuctx, ctx);
+rcu_unlock:
+	rcu_read_unlock();
 }
 
 /*
@@ -3906,9 +3993,6 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 void __perf_event_task_sched_in(struct task_struct *prev,
 				struct task_struct *task)
 {
-	struct perf_event_context *ctx;
-	int ctxn;
-
 	/*
 	 * If cgroup events exist on this CPU, then we need to check if we have
 	 * to switch in PMU state; cgroup event are system-wide mode only.
@@ -3919,13 +4003,7 @@ void __perf_event_task_sched_in(struct task_struct *prev,
 	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
 		perf_cgroup_sched_in(prev, task);
 
-	for_each_task_context_nr(ctxn) {
-		ctx = task->perf_event_ctxp[ctxn];
-		if (likely(!ctx))
-			continue;
-
-		perf_event_context_sched_in(ctx, task);
-	}
+	perf_event_context_sched_in(task);
 
 	if (atomic_read(&nr_switch_events))
 		perf_event_switch(task, prev, true);
@@ -4044,8 +4122,8 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
  * events. At the same time, make sure, having freq events does not change
  * the rate of unthrottling as that would introduce bias.
  */
-static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
-					   int needs_unthr)
+static void
+perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle)
 {
 	struct perf_event *event;
 	struct hw_perf_event *hwc;
@@ -4057,16 +4135,16 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
 	 * - context have events in frequency mode (needs freq adjust)
 	 * - there are events to unthrottle on this cpu
 	 */
-	if (!(ctx->nr_freq || needs_unthr))
+	if (!(ctx->nr_freq || unthrottle))
 		return;
 
 	raw_spin_lock(&ctx->lock);
-	perf_pmu_disable(ctx->pmu);
 
 	list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
 		if (event->state != PERF_EVENT_STATE_ACTIVE)
 			continue;
 
+		// XXX use visit thingy to avoid the -1,cpu match
 		if (!event_filter_match(event))
 			continue;
 
@@ -4107,7 +4185,6 @@ static void perf_adjust_freq_unthr_context(struct perf_event_context *ctx,
 		perf_pmu_enable(event->pmu);
 	}
 
-	perf_pmu_enable(ctx->pmu);
 	raw_spin_unlock(&ctx->lock);
 }
 
@@ -4129,72 +4206,111 @@ static void rotate_ctx(struct perf_event_context *ctx, struct perf_event *event)
 
 /* pick an event from the flexible_groups to rotate */
 static inline struct perf_event *
-ctx_event_to_rotate(struct perf_event_context *ctx)
+ctx_event_to_rotate(struct perf_event_pmu_context *pmu_ctx)
 {
 	struct perf_event *event;
+	struct rb_node *node;
+	struct rb_root *tree;
+	struct __group_key key = {
+		.pmu = pmu_ctx->pmu,
+	};
 
 	/* pick the first active flexible event */
-	event = list_first_entry_or_null(&ctx->flexible_active,
+	event = list_first_entry_or_null(&pmu_ctx->flexible_active,
 					 struct perf_event, active_list);
+	if (event)
+		goto out;
 
 	/* if no active flexible event, pick the first event */
-	if (!event) {
-		event = rb_entry_safe(rb_first(&ctx->flexible_groups.tree),
-				      typeof(*event), group_node);
+	tree = &pmu_ctx->ctx->flexible_groups.tree;
+
+	if (!pmu_ctx->ctx->task) {
+		key.cpu = smp_processor_id();
+
+		node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+		if (node)
+			event = __node_2_pe(node);
+		goto out;
+	}
+
+	key.cpu = -1;
+	node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+	if (node) {
+		event = __node_2_pe(node);
+		goto out;
 	}
 
+	key.cpu = smp_processor_id();
+	node = rb_find_first(&key, tree, __group_cmp_ignore_cgroup);
+	if (node)
+		event = __node_2_pe(node);
+
+out:
 	/*
 	 * Unconditionally clear rotate_necessary; if ctx_flexible_sched_in()
 	 * finds there are unschedulable events, it will set it again.
 	 */
-	ctx->rotate_necessary = 0;
+	pmu_ctx->rotate_necessary = 0;
 
 	return event;
 }
 
-static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
+static bool perf_rotate_context(struct perf_cpu_pmu_context *cpc)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_event_pmu_context *cpu_epc, *task_epc = NULL;
 	struct perf_event *cpu_event = NULL, *task_event = NULL;
 	struct perf_event_context *task_ctx = NULL;
 	int cpu_rotate, task_rotate;
+	struct pmu *pmu;
 
 	/*
 	 * Since we run this from IRQ context, nobody can install new
 	 * events, thus the event count values are stable.
 	 */
 
-	cpu_rotate = cpuctx->ctx.rotate_necessary;
+	cpu_epc = &cpc->epc;
+	pmu = cpu_epc->pmu;
+	task_epc = cpc->task_epc;
+
+	cpu_rotate = cpu_epc->rotate_necessary;
 	task_ctx = cpuctx->task_ctx;
-	task_rotate = task_ctx ? task_ctx->rotate_necessary : 0;
+	task_rotate = task_epc ? task_epc->rotate_necessary : 0;
 
 	if (!(cpu_rotate || task_rotate))
 		return false;
 
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
-	perf_pmu_disable(cpuctx->ctx.pmu);
+	perf_pmu_disable(pmu);
 
 	if (task_rotate)
-		task_event = ctx_event_to_rotate(task_ctx);
+		task_event = ctx_event_to_rotate(task_epc);
 	if (cpu_rotate)
-		cpu_event = ctx_event_to_rotate(&cpuctx->ctx);
+		cpu_event = ctx_event_to_rotate(cpu_epc);
 
 	/*
 	 * As per the order given at ctx_resched() first 'pop' task flexible
 	 * and then, if needed CPU flexible.
 	 */
-	if (task_event || (task_ctx && cpu_event))
-		ctx_sched_out(task_ctx, cpuctx, EVENT_FLEXIBLE);
-	if (cpu_event)
-		cpu_ctx_sched_out(cpuctx, EVENT_FLEXIBLE);
+	if (task_event || (task_epc && cpu_event)) {
+		update_context_time(task_epc->ctx);
+		__pmu_ctx_sched_out(task_epc, EVENT_FLEXIBLE);
+	}
 
-	if (task_event)
-		rotate_ctx(task_ctx, task_event);
-	if (cpu_event)
+	if (cpu_event) {
+		update_context_time(&cpuctx->ctx);
+		__pmu_ctx_sched_out(cpu_epc, EVENT_FLEXIBLE);
 		rotate_ctx(&cpuctx->ctx, cpu_event);
+		__pmu_ctx_sched_in(&cpuctx->ctx, pmu);
+	}
 
-	perf_event_sched_in(cpuctx, task_ctx, current);
+	if (task_event)
+		rotate_ctx(task_epc->ctx, task_event);
 
-	perf_pmu_enable(cpuctx->ctx.pmu);
+	if (task_event || (task_epc && cpu_event))
+		__pmu_ctx_sched_in(task_epc->ctx, pmu);
+
+	perf_pmu_enable(pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
 
 	return true;
@@ -4202,8 +4318,8 @@ static bool perf_rotate_context(struct perf_cpu_context *cpuctx)
 
 void perf_event_task_tick(void)
 {
-	struct list_head *head = this_cpu_ptr(&active_ctx_list);
-	struct perf_event_context *ctx, *tmp;
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
+	struct perf_event_context *ctx;
 	int throttled;
 
 	lockdep_assert_irqs_disabled();
@@ -4212,8 +4328,13 @@ void perf_event_task_tick(void)
 	throttled = __this_cpu_xchg(perf_throttled_count, 0);
 	tick_dep_clear_cpu(smp_processor_id(), TICK_DEP_BIT_PERF_EVENTS);
 
-	list_for_each_entry_safe(ctx, tmp, head, active_ctx_list)
-		perf_adjust_freq_unthr_context(ctx, throttled);
+	perf_adjust_freq_unthr_context(&cpuctx->ctx, !!throttled);
+
+	rcu_read_lock();
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx)
+		perf_adjust_freq_unthr_context(ctx, !!throttled);
+	rcu_read_unlock();
 }
 
 static int event_enable_on_exec(struct perf_event *event,
@@ -4235,9 +4356,9 @@ static int event_enable_on_exec(struct perf_event *event,
  * Enable all of a task's events that have been marked enable-on-exec.
  * This expects task == current.
  */
-static void perf_event_enable_on_exec(int ctxn)
+static void perf_event_enable_on_exec(struct perf_event_context *ctx)
 {
-	struct perf_event_context *ctx, *clone_ctx = NULL;
+	struct perf_event_context *clone_ctx = NULL;
 	enum event_type_t event_type = 0;
 	struct perf_cpu_context *cpuctx;
 	struct perf_event *event;
@@ -4245,13 +4366,16 @@ static void perf_event_enable_on_exec(int ctxn)
 	int enabled = 0;
 
 	local_irq_save(flags);
-	ctx = current->perf_event_ctxp[ctxn];
-	if (!ctx || !ctx->nr_events)
+	if (WARN_ON_ONCE(current->perf_event_ctxp != ctx))
+		goto out;
+
+	if (!ctx->nr_events)
 		goto out;
 
-	cpuctx = __get_cpu_context(ctx);
+	cpuctx = this_cpu_ptr(&cpu_context);
 	perf_ctx_lock(cpuctx, ctx);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, EVENT_TIME);
+
 	list_for_each_entry(event, &ctx->event_list, event_entry) {
 		enabled |= event_enable_on_exec(event, ctx);
 		event_type |= get_event_type(event);
@@ -4264,7 +4388,7 @@ static void perf_event_enable_on_exec(int ctxn)
 		clone_ctx = unclone_ctx(ctx);
 		ctx_resched(cpuctx, ctx, event_type);
 	} else {
-		ctx_sched_in(ctx, cpuctx, EVENT_TIME, current);
+		ctx_sched_in(ctx, EVENT_TIME, current);
 	}
 	perf_ctx_unlock(cpuctx, ctx);
 
@@ -4283,17 +4407,15 @@ static void perf_event_exit_event(struct perf_event *event,
  * Removes all events from the current task that have been marked
  * remove-on-exec, and feeds their values back to parent events.
  */
-static void perf_event_remove_on_exec(int ctxn)
+static void perf_event_remove_on_exec(struct perf_event_context *ctx)
 {
-	struct perf_event_context *ctx, *clone_ctx = NULL;
+	struct perf_event_context *clone_ctx = NULL;
 	struct perf_event *event, *next;
 	LIST_HEAD(free_list);
 	unsigned long flags;
 	bool modified = false;
 
-	ctx = perf_pin_task_context(current, ctxn);
-	if (!ctx)
-		return;
+	perf_pin_task_context(current);
 
 	mutex_lock(&ctx->mutex);
 
@@ -4357,7 +4479,7 @@ static void __perf_event_read(void *info)
 	struct perf_read_data *data = info;
 	struct perf_event *sub, *event = data->event;
 	struct perf_event_context *ctx = event->ctx;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct pmu *pmu = event->pmu;
 
 	/*
@@ -4572,17 +4694,25 @@ static void __perf_event_init_context(struct perf_event_context *ctx)
 {
 	raw_spin_lock_init(&ctx->lock);
 	mutex_init(&ctx->mutex);
-	INIT_LIST_HEAD(&ctx->active_ctx_list);
+	INIT_LIST_HEAD(&ctx->pmu_ctx_list);
 	perf_event_groups_init(&ctx->pinned_groups);
 	perf_event_groups_init(&ctx->flexible_groups);
 	INIT_LIST_HEAD(&ctx->event_list);
-	INIT_LIST_HEAD(&ctx->pinned_active);
-	INIT_LIST_HEAD(&ctx->flexible_active);
 	refcount_set(&ctx->refcount, 1);
 }
 
+static void
+__perf_init_event_pmu_context(struct perf_event_pmu_context *epc, struct pmu *pmu)
+{
+	epc->pmu = pmu;
+	INIT_LIST_HEAD(&epc->pmu_ctx_entry);
+	INIT_LIST_HEAD(&epc->pinned_active);
+	INIT_LIST_HEAD(&epc->flexible_active);
+	atomic_set(&epc->refcount, 1);
+}
+
 static struct perf_event_context *
-alloc_perf_context(struct pmu *pmu, struct task_struct *task)
+alloc_perf_context(struct task_struct *task)
 {
 	struct perf_event_context *ctx;
 
@@ -4593,7 +4723,6 @@ alloc_perf_context(struct pmu *pmu, struct task_struct *task)
 	__perf_event_init_context(ctx);
 	if (task)
 		ctx->task = get_task_struct(task);
-	ctx->pmu = pmu;
 
 	return ctx;
 }
@@ -4622,15 +4751,12 @@ find_lively_task_by_vpid(pid_t vpid)
  * Returns a matching context with refcount and pincount.
  */
 static struct perf_event_context *
-find_get_context(struct pmu *pmu, struct task_struct *task,
-		struct perf_event *event)
+find_get_context(struct task_struct *task, struct perf_event *event)
 {
 	struct perf_event_context *ctx, *clone_ctx = NULL;
 	struct perf_cpu_context *cpuctx;
-	void *task_ctx_data = NULL;
 	unsigned long flags;
-	int ctxn, err;
-	int cpu = event->cpu;
+	int err;
 
 	if (!task) {
 		/* Must be root to operate on a CPU event: */
@@ -4638,7 +4764,7 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 		if (err)
 			return ERR_PTR(err);
 
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
+		cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
 		ctx = &cpuctx->ctx;
 		get_ctx(ctx);
 		raw_spin_lock_irqsave(&ctx->lock, flags);
@@ -4649,43 +4775,22 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 	}
 
 	err = -EINVAL;
-	ctxn = pmu->task_ctx_nr;
-	if (ctxn < 0)
-		goto errout;
+retry:
+	ctx = perf_lock_task_context(task, &flags);
+	if (ctx) {
+		clone_ctx = unclone_ctx(ctx);
+		++ctx->pin_count;
 
-	if (event->attach_state & PERF_ATTACH_TASK_DATA) {
-		task_ctx_data = alloc_task_ctx_data(pmu);
-		if (!task_ctx_data) {
-			err = -ENOMEM;
-			goto errout;
-		}
-	}
-
-retry:
-	ctx = perf_lock_task_context(task, ctxn, &flags);
-	if (ctx) {
-		clone_ctx = unclone_ctx(ctx);
-		++ctx->pin_count;
-
-		if (task_ctx_data && !ctx->task_ctx_data) {
-			ctx->task_ctx_data = task_ctx_data;
-			task_ctx_data = NULL;
-		}
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 
 		if (clone_ctx)
 			put_ctx(clone_ctx);
 	} else {
-		ctx = alloc_perf_context(pmu, task);
+		ctx = alloc_perf_context(task);
 		err = -ENOMEM;
 		if (!ctx)
 			goto errout;
 
-		if (task_ctx_data) {
-			ctx->task_ctx_data = task_ctx_data;
-			task_ctx_data = NULL;
-		}
-
 		err = 0;
 		mutex_lock(&task->perf_event_mutex);
 		/*
@@ -4694,12 +4799,12 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 		 */
 		if (task->flags & PF_EXITING)
 			err = -ESRCH;
-		else if (task->perf_event_ctxp[ctxn])
+		else if (task->perf_event_ctxp)
 			err = -EAGAIN;
 		else {
 			get_ctx(ctx);
 			++ctx->pin_count;
-			rcu_assign_pointer(task->perf_event_ctxp[ctxn], ctx);
+			rcu_assign_pointer(task->perf_event_ctxp, ctx);
 		}
 		mutex_unlock(&task->perf_event_mutex);
 
@@ -4712,14 +4817,117 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 		}
 	}
 
-	free_task_ctx_data(pmu, task_ctx_data);
 	return ctx;
 
 errout:
-	free_task_ctx_data(pmu, task_ctx_data);
 	return ERR_PTR(err);
 }
 
+struct perf_event_pmu_context *
+find_get_pmu_context(struct pmu *pmu, struct perf_event_context *ctx,
+		     struct perf_event *event)
+{
+	struct perf_event_pmu_context *new = NULL, *epc;
+	void *task_ctx_data = NULL;
+
+	if (!ctx->task) {
+		struct perf_cpu_pmu_context *cpc;
+
+		cpc = per_cpu_ptr(pmu->cpu_pmu_context, event->cpu);
+		epc = &cpc->epc;
+
+		if (!epc->ctx) {
+			atomic_set(&epc->refcount, 1);
+			epc->embedded = 1;
+			raw_spin_lock_irq(&ctx->lock);
+			list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
+			epc->ctx = ctx;
+			raw_spin_unlock_irq(&ctx->lock);
+		} else {
+			WARN_ON_ONCE(epc->ctx != ctx);
+			atomic_inc(&epc->refcount);
+		}
+
+		return epc;
+	}
+
+	new = kzalloc(sizeof(*epc), GFP_KERNEL);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+
+	if (event->attach_state & PERF_ATTACH_TASK_DATA) {
+		task_ctx_data = alloc_task_ctx_data(pmu);;
+		if (!task_ctx_data) {
+			kfree(new);
+			return ERR_PTR(-ENOMEM);
+		}
+	}
+
+	__perf_init_event_pmu_context(new, pmu);
+
+	raw_spin_lock_irq(&ctx->lock);
+	list_for_each_entry(epc, &ctx->pmu_ctx_list, pmu_ctx_entry) {
+		if (epc->pmu == pmu) {
+			WARN_ON_ONCE(epc->ctx != ctx);
+			atomic_inc(&epc->refcount);
+			goto found_epc;
+		}
+	}
+
+	epc = new;
+	new = NULL;
+
+	list_add(&epc->pmu_ctx_entry, &ctx->pmu_ctx_list);
+	epc->ctx = ctx;
+
+found_epc:
+	if (task_ctx_data && !epc->task_ctx_data) {
+		epc->task_ctx_data = task_ctx_data;
+		task_ctx_data = NULL;
+		ctx->nr_task_data++;
+	}
+	raw_spin_unlock_irq(&ctx->lock);
+
+	free_task_ctx_data(pmu, task_ctx_data);
+	kfree(new);
+
+	return epc;
+}
+
+static void get_pmu_ctx(struct perf_event_pmu_context *epc)
+{
+	WARN_ON_ONCE(!atomic_inc_not_zero(&epc->refcount));
+}
+
+static void put_pmu_ctx(struct perf_event_pmu_context *epc)
+{
+	unsigned long flags;
+
+	if (!atomic_dec_and_test(&epc->refcount))
+		return;
+
+	if (epc->ctx) {
+		struct perf_event_context *ctx = epc->ctx;
+
+		// XXX ctx->mutex
+
+		WARN_ON_ONCE(list_empty(&epc->pmu_ctx_entry));
+		raw_spin_lock_irqsave(&ctx->lock, flags);
+		list_del_init(&epc->pmu_ctx_entry);
+		epc->ctx = NULL;
+		raw_spin_unlock_irqrestore(&ctx->lock, flags);
+	}
+
+	WARN_ON_ONCE(!list_empty(&epc->pinned_active));
+	WARN_ON_ONCE(!list_empty(&epc->flexible_active));
+
+	if (epc->embedded)
+		return;
+
+	kfree(epc->task_ctx_data);
+	kfree(epc);
+}
+
 static void perf_event_free_filter(struct perf_event *event);
 
 static void free_event_rcu(struct rcu_head *head)
@@ -4988,6 +5196,9 @@ static void _free_event(struct perf_event *event)
 	if (event->hw.target)
 		put_task_struct(event->hw.target);
 
+	if (event->pmu_ctx)
+		put_pmu_ctx(event->pmu_ctx);
+
 	/*
 	 * perf_event_free_task() relies on put_ctx() being 'last', in particular
 	 * all task references must be cleaned up.
@@ -5518,7 +5729,7 @@ static void __perf_event_period(struct perf_event *event,
 
 	active = (event->state == PERF_EVENT_STATE_ACTIVE);
 	if (active) {
-		perf_pmu_disable(ctx->pmu);
+		perf_pmu_disable(event->pmu);
 		/*
 		 * We could be throttled; unthrottle now to avoid the tick
 		 * trying to unthrottle while we already re-started the event.
@@ -5534,7 +5745,7 @@ static void __perf_event_period(struct perf_event *event,
 
 	if (active) {
 		event->pmu->start(event, PERF_EF_RELOAD);
-		perf_pmu_enable(ctx->pmu);
+		perf_pmu_enable(event->pmu);
 	}
 }
 
@@ -7617,7 +7828,6 @@ perf_iterate_sb(perf_iterate_f output, void *data,
 	       struct perf_event_context *task_ctx)
 {
 	struct perf_event_context *ctx;
-	int ctxn;
 
 	rcu_read_lock();
 	preempt_disable();
@@ -7634,11 +7844,9 @@ perf_iterate_sb(perf_iterate_f output, void *data,
 
 	perf_iterate_sb_cpu(output, data);
 
-	for_each_task_context_nr(ctxn) {
-		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
-		if (ctx)
-			perf_iterate_ctx(ctx, output, data, false);
-	}
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx)
+		perf_iterate_ctx(ctx, output, data, false);
 done:
 	preempt_enable();
 	rcu_read_unlock();
@@ -7680,20 +7888,15 @@ static void perf_event_addr_filters_exec(struct perf_event *event, void *data)
 void perf_event_exec(void)
 {
 	struct perf_event_context *ctx;
-	int ctxn;
-
-	for_each_task_context_nr(ctxn) {
-		perf_event_enable_on_exec(ctxn);
-		perf_event_remove_on_exec(ctxn);
 
-		rcu_read_lock();
-		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
-		if (ctx) {
-			perf_iterate_ctx(ctx, perf_event_addr_filters_exec,
-					 NULL, true);
-		}
-		rcu_read_unlock();
+	rcu_read_lock();
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx) {
+		perf_event_enable_on_exec(ctx);
+		perf_event_remove_on_exec(ctx);
+		perf_iterate_ctx(ctx, perf_event_addr_filters_exec, NULL, true);
 	}
+	rcu_read_unlock();
 }
 
 struct remote_output {
@@ -7733,8 +7936,7 @@ static void __perf_event_output_stop(struct perf_event *event, void *data)
 static int __perf_pmu_output_stop(void *info)
 {
 	struct perf_event *event = info;
-	struct pmu *pmu = event->ctx->pmu;
-	struct perf_cpu_context *cpuctx = this_cpu_ptr(pmu->pmu_cpu_context);
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct remote_output ro = {
 		.rb	= event->rb,
 	};
@@ -8523,7 +8725,6 @@ static void __perf_addr_filters_adjust(struct perf_event *event, void *data)
 static void perf_addr_filters_adjust(struct vm_area_struct *vma)
 {
 	struct perf_event_context *ctx;
-	int ctxn;
 
 	/*
 	 * Data tracing isn't supported yet and as such there is no need
@@ -8533,13 +8734,9 @@ static void perf_addr_filters_adjust(struct vm_area_struct *vma)
 		return;
 
 	rcu_read_lock();
-	for_each_task_context_nr(ctxn) {
-		ctx = rcu_dereference(current->perf_event_ctxp[ctxn]);
-		if (!ctx)
-			continue;
-
+	ctx = rcu_dereference(current->perf_event_ctxp);
+	if (ctx)
 		perf_iterate_ctx(ctx, __perf_addr_filters_adjust, vma, true);
-	}
 	rcu_read_unlock();
 }
 
@@ -9718,10 +9915,13 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
 		struct trace_entry *entry = record;
 
 		rcu_read_lock();
-		ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);
+		ctx = rcu_dereference(task->perf_event_ctxp);
 		if (!ctx)
 			goto unlock;
 
+		// XXX iterate groups instead, we should be able to
+		// find the subtree for the perf_tracepoint pmu and CPU.
+
 		list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
 			if (event->cpu != smp_processor_id())
 				continue;
@@ -10850,36 +11050,9 @@ static int perf_event_idx_default(struct perf_event *event)
 	return 0;
 }
 
-/*
- * Ensures all contexts with the same task_ctx_nr have the same
- * pmu_cpu_context too.
- */
-static struct perf_cpu_context __percpu *find_pmu_context(int ctxn)
-{
-	struct pmu *pmu;
-
-	if (ctxn < 0)
-		return NULL;
-
-	list_for_each_entry(pmu, &pmus, entry) {
-		if (pmu->task_ctx_nr == ctxn)
-			return pmu->pmu_cpu_context;
-	}
-
-	return NULL;
-}
-
 static void free_pmu_context(struct pmu *pmu)
 {
-	/*
-	 * Static contexts such as perf_sw_context have a global lifetime
-	 * and may be shared between different PMUs. Avoid freeing them
-	 * when a single PMU is going away.
-	 */
-	if (pmu->task_ctx_nr > perf_invalid_context)
-		return;
-
-	free_percpu(pmu->pmu_cpu_context);
+	free_percpu(pmu->cpu_pmu_context);
 }
 
 /*
@@ -10943,12 +11116,12 @@ perf_event_mux_interval_ms_store(struct device *dev,
 	/* update all cpuctx for this PMU */
 	cpus_read_lock();
 	for_each_online_cpu(cpu) {
-		struct perf_cpu_context *cpuctx;
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		cpuctx->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
+		struct perf_cpu_pmu_context *cpc;
+		cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+		cpc->hrtimer_interval = ns_to_ktime(NSEC_PER_MSEC * timer);
 
 		cpu_function_call(cpu,
-			(remote_function_f)perf_mux_hrtimer_restart, cpuctx);
+			(remote_function_f)perf_mux_hrtimer_restart, cpc);
 	}
 	cpus_read_unlock();
 	mutex_unlock(&mux_interval_mutex);
@@ -11059,47 +11232,19 @@ int perf_pmu_register(struct pmu *pmu, const char *name, int type)
 	}
 
 skip_type:
-	if (pmu->task_ctx_nr == perf_hw_context) {
-		static int hw_context_taken = 0;
-
-		/*
-		 * Other than systems with heterogeneous CPUs, it never makes
-		 * sense for two PMUs to share perf_hw_context. PMUs which are
-		 * uncore must use perf_invalid_context.
-		 */
-		if (WARN_ON_ONCE(hw_context_taken &&
-		    !(pmu->capabilities & PERF_PMU_CAP_HETEROGENEOUS_CPUS)))
-			pmu->task_ctx_nr = perf_invalid_context;
-
-		hw_context_taken = 1;
-	}
-
-	pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);
-	if (pmu->pmu_cpu_context)
-		goto got_cpu_context;
-
 	ret = -ENOMEM;
-	pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
-	if (!pmu->pmu_cpu_context)
+	pmu->cpu_pmu_context = alloc_percpu(struct perf_cpu_pmu_context);
+	if (!pmu->cpu_pmu_context)
 		goto free_dev;
 
 	for_each_possible_cpu(cpu) {
-		struct perf_cpu_context *cpuctx;
+		struct perf_cpu_pmu_context *cpc;
 
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		__perf_event_init_context(&cpuctx->ctx);
-		lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
-		lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
-		cpuctx->ctx.pmu = pmu;
-		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
-
-		__perf_mux_hrtimer_init(cpuctx, cpu);
-
-		cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
-		cpuctx->heap = cpuctx->heap_default;
+		cpc = per_cpu_ptr(pmu->cpu_pmu_context, cpu);
+		__perf_init_event_pmu_context(&cpc->epc, pmu);
+		__perf_mux_hrtimer_init(cpc, cpu);
 	}
 
-got_cpu_context:
 	if (!pmu->start_txn) {
 		if (pmu->pmu_enable) {
 			/*
@@ -11578,10 +11723,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	}
 
 	/*
-	 * Disallow uncore-cgroup events, they don't make sense as the cgroup will
-	 * be different on other CPUs in the uncore mask.
+	 * Disallow uncode-task events. Similarly, disallow uncore-cgroup
+	 * events (they don't make sense as the cgroup will be different
+	 * on other CPUs in the uncore mask).
 	 */
-	if (pmu->task_ctx_nr == perf_invalid_context && cgroup_fd != -1) {
+	if (pmu->task_ctx_nr == perf_invalid_context && (task || cgroup_fd != -1)) {
 		err = -EINVAL;
 		goto err_pmu;
 	}
@@ -11913,37 +12059,6 @@ static int perf_event_set_clock(struct perf_event *event, clockid_t clk_id)
 	return 0;
 }
 
-/*
- * Variation on perf_event_ctx_lock_nested(), except we take two context
- * mutexes.
- */
-static struct perf_event_context *
-__perf_event_ctx_lock_double(struct perf_event *group_leader,
-			     struct perf_event_context *ctx)
-{
-	struct perf_event_context *gctx;
-
-again:
-	rcu_read_lock();
-	gctx = READ_ONCE(group_leader->ctx);
-	if (!refcount_inc_not_zero(&gctx->refcount)) {
-		rcu_read_unlock();
-		goto again;
-	}
-	rcu_read_unlock();
-
-	mutex_lock_double(&gctx->mutex, &ctx->mutex);
-
-	if (group_leader->ctx != gctx) {
-		mutex_unlock(&ctx->mutex);
-		mutex_unlock(&gctx->mutex);
-		put_ctx(gctx);
-		goto again;
-	}
-
-	return gctx;
-}
-
 static bool
 perf_check_permission(struct perf_event_attr *attr, struct task_struct *task)
 {
@@ -11989,9 +12104,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		pid_t, pid, int, cpu, int, group_fd, unsigned long, flags)
 {
 	struct perf_event *group_leader = NULL, *output_event = NULL;
+	struct perf_event_pmu_context *pmu_ctx;
 	struct perf_event *event, *sibling;
 	struct perf_event_attr attr;
-	struct perf_event_context *ctx, *gctx;
+	struct perf_event_context *ctx;
 	struct file *event_file = NULL;
 	struct fd group = {NULL, 0};
 	struct task_struct *task = NULL;
@@ -12099,6 +12215,8 @@ SYSCALL_DEFINE5(perf_event_open,
 		goto err_task;
 	}
 
+	// XXX premature; what if this is allowed, but we get moved to a PMU
+	// that doesn't have this.
 	if (is_sampling_event(event)) {
 		if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
 			err = -EOPNOTSUPP;
@@ -12121,42 +12239,37 @@ SYSCALL_DEFINE5(perf_event_open,
 	if (pmu->task_ctx_nr == perf_sw_context)
 		event->event_caps |= PERF_EV_CAP_SOFTWARE;
 
-	if (group_leader) {
-		if (is_software_event(event) &&
-		    !in_software_context(group_leader)) {
-			/*
-			 * If the event is a sw event, but the group_leader
-			 * is on hw context.
-			 *
-			 * Allow the addition of software events to hw
-			 * groups, this is safe because software events
-			 * never fail to schedule.
-			 */
-			pmu = group_leader->ctx->pmu;
-		} else if (!is_software_event(event) &&
-			   is_software_event(group_leader) &&
-			   (group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
-			/*
-			 * In case the group is a pure software group, and we
-			 * try to add a hardware event, move the whole group to
-			 * the hardware context.
-			 */
-			move_group = 1;
-		}
-	}
-
 	/*
 	 * Get the target context (task or percpu):
 	 */
-	ctx = find_get_context(pmu, task, event);
+	ctx = find_get_context(task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
 		goto err_alloc;
 	}
 
-	/*
-	 * Look up the group leader (we will attach this event to it):
-	 */
+	mutex_lock(&ctx->mutex);
+
+	if (ctx->task == TASK_TOMBSTONE) {
+		err = -ESRCH;
+		goto err_locked;
+	}
+
+	if (!task) {
+		/*
+		 * Check if the @cpu we're creating an event for is online.
+		 *
+		 * We use the perf_cpu_context::ctx::mutex to serialize against
+		 * the hotplug notifiers. See perf_event_{init,exit}_cpu().
+		 */
+		struct perf_cpu_context *cpuctx = per_cpu_ptr(&cpu_context, event->cpu);
+
+		if (!cpuctx->online) {
+			err = -ENODEV;
+			goto err_locked;
+		}
+	}
+
 	if (group_leader) {
 		err = -EINVAL;
 
@@ -12165,11 +12278,11 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * becoming part of another group-sibling):
 		 */
 		if (group_leader->group_leader != group_leader)
-			goto err_context;
+			goto err_locked;
 
 		/* All events in a group should have the same clock */
 		if (group_leader->clock != event->clock)
-			goto err_context;
+			goto err_locked;
 
 		/*
 		 * Make sure we're both events for the same CPU;
@@ -12177,29 +12290,52 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * you can never concurrently schedule them anyhow.
 		 */
 		if (group_leader->cpu != event->cpu)
-			goto err_context;
-
-		/*
-		 * Make sure we're both on the same task, or both
-		 * per-CPU events.
-		 */
-		if (group_leader->ctx->task != ctx->task)
-			goto err_context;
+			goto err_locked;
 
 		/*
-		 * Do not allow to attach to a group in a different task
-		 * or CPU context. If we're moving SW events, we'll fix
-		 * this up later, so allow that.
+		 * Make sure we're both on the same context; either task or cpu.
 		 */
-		if (!move_group && group_leader->ctx != ctx)
-			goto err_context;
+		if (group_leader->ctx != ctx)
+			goto err_locked;
 
 		/*
 		 * Only a group leader can be exclusive or pinned
 		 */
 		if (attr.exclusive || attr.pinned)
-			goto err_context;
+			goto err_locked;
+
+		if (is_software_event(event) &&
+		    !in_software_context(group_leader)) {
+			/*
+			 * If the event is a sw event, but the group_leader
+			 * is on hw context.
+			 *
+			 * Allow the addition of software events to hw
+			 * groups, this is safe because software events
+			 * never fail to schedule.
+			 */
+			pmu = group_leader->pmu_ctx->pmu;
+		} else if (!is_software_event(event) &&
+			is_software_event(group_leader) &&
+			(group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
+			/*
+			 * In case the group is a pure software group, and we
+			 * try to add a hardware event, move the whole group to
+			 * the hardware context.
+			 */
+			move_group = 1;
+		}
+	}
+
+	/*
+	 * Now that we're certain of the pmu; find the pmu_ctx.
+	 */
+	pmu_ctx = find_get_pmu_context(pmu, ctx, event);
+	if (IS_ERR(pmu_ctx)) {
+		err = PTR_ERR(pmu_ctx);
+		goto err_locked;
 	}
+	event->pmu_ctx = pmu_ctx;
 
 	if (output_event) {
 		err = perf_event_set_output(event, output_event);
@@ -12207,8 +12343,7 @@ SYSCALL_DEFINE5(perf_event_open,
 			goto err_context;
 	}
 
-	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event,
-					f_flags);
+	event_file = anon_inode_getfile("[perf_event]", &perf_fops, event, f_flags);
 	if (IS_ERR(event_file)) {
 		err = PTR_ERR(event_file);
 		event_file = NULL;
@@ -12231,77 +12366,14 @@ SYSCALL_DEFINE5(perf_event_open,
 			goto err_cred;
 	}
 
-	if (move_group) {
-		gctx = __perf_event_ctx_lock_double(group_leader, ctx);
-
-		if (gctx->task == TASK_TOMBSTONE) {
-			err = -ESRCH;
-			goto err_locked;
-		}
-
-		/*
-		 * Check if we raced against another sys_perf_event_open() call
-		 * moving the software group underneath us.
-		 */
-		if (!(group_leader->group_caps & PERF_EV_CAP_SOFTWARE)) {
-			/*
-			 * If someone moved the group out from under us, check
-			 * if this new event wound up on the same ctx, if so
-			 * its the regular !move_group case, otherwise fail.
-			 */
-			if (gctx != ctx) {
-				err = -EINVAL;
-				goto err_locked;
-			} else {
-				perf_event_ctx_unlock(group_leader, gctx);
-				move_group = 0;
-			}
-		}
-
-		/*
-		 * Failure to create exclusive events returns -EBUSY.
-		 */
-		err = -EBUSY;
-		if (!exclusive_event_installable(group_leader, ctx))
-			goto err_locked;
-
-		for_each_sibling_event(sibling, group_leader) {
-			if (!exclusive_event_installable(sibling, ctx))
-				goto err_locked;
-		}
-	} else {
-		mutex_lock(&ctx->mutex);
-	}
-
-	if (ctx->task == TASK_TOMBSTONE) {
-		err = -ESRCH;
-		goto err_locked;
-	}
-
 	if (!perf_event_validate_size(event)) {
 		err = -E2BIG;
-		goto err_locked;
-	}
-
-	if (!task) {
-		/*
-		 * Check if the @cpu we're creating an event for is online.
-		 *
-		 * We use the perf_cpu_context::ctx::mutex to serialize against
-		 * the hotplug notifiers. See perf_event_{init,exit}_cpu().
-		 */
-		struct perf_cpu_context *cpuctx =
-			container_of(ctx, struct perf_cpu_context, ctx);
-
-		if (!cpuctx->online) {
-			err = -ENODEV;
-			goto err_locked;
-		}
+		goto err_cred;
 	}
 
 	if (perf_need_aux_event(event) && !perf_get_aux_event(event, group_leader)) {
 		err = -EINVAL;
-		goto err_locked;
+		goto err_cred;
 	}
 
 	/*
@@ -12310,7 +12382,7 @@ SYSCALL_DEFINE5(perf_event_open,
 	 */
 	if (!exclusive_event_installable(event, ctx)) {
 		err = -EBUSY;
-		goto err_locked;
+		goto err_cred;
 	}
 
 	WARN_ON_ONCE(ctx->parent_ctx);
@@ -12321,24 +12393,14 @@ SYSCALL_DEFINE5(perf_event_open,
 	 */
 
 	if (move_group) {
-		/*
-		 * See perf_event_ctx_lock() for comments on the details
-		 * of swizzling perf_event::ctx.
-		 */
 		perf_remove_from_context(group_leader, 0);
-		put_ctx(gctx);
+		put_pmu_ctx(group_leader->pmu_ctx);
 
 		for_each_sibling_event(sibling, group_leader) {
 			perf_remove_from_context(sibling, 0);
-			put_ctx(gctx);
+			put_pmu_ctx(sibling->pmu_ctx);
 		}
 
-		/*
-		 * Wait for everybody to stop referencing the events through
-		 * the old lists, before installing it on new lists.
-		 */
-		synchronize_rcu();
-
 		/*
 		 * Install the group siblings before the group leader.
 		 *
@@ -12350,9 +12412,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * reachable through the group lists.
 		 */
 		for_each_sibling_event(sibling, group_leader) {
+			sibling->pmu_ctx = pmu_ctx;
+			get_pmu_ctx(pmu_ctx);
 			perf_event__state_init(sibling);
 			perf_install_in_context(ctx, sibling, sibling->cpu);
-			get_ctx(ctx);
 		}
 
 		/*
@@ -12360,9 +12423,10 @@ SYSCALL_DEFINE5(perf_event_open,
 		 * event. What we want here is event in the initial
 		 * startup state, ready to be add into new context.
 		 */
+		group_leader->pmu_ctx = pmu_ctx;
+		get_pmu_ctx(pmu_ctx);
 		perf_event__state_init(group_leader);
 		perf_install_in_context(ctx, group_leader, group_leader->cpu);
-		get_ctx(ctx);
 	}
 
 	/*
@@ -12379,8 +12443,6 @@ SYSCALL_DEFINE5(perf_event_open,
 	perf_install_in_context(ctx, event, event->cpu);
 	perf_unpin_context(ctx);
 
-	if (move_group)
-		perf_event_ctx_unlock(group_leader, gctx);
 	mutex_unlock(&ctx->mutex);
 
 	if (task) {
@@ -12402,16 +12464,15 @@ SYSCALL_DEFINE5(perf_event_open,
 	fd_install(event_fd, event_file);
 	return event_fd;
 
-err_locked:
-	if (move_group)
-		perf_event_ctx_unlock(group_leader, gctx);
-	mutex_unlock(&ctx->mutex);
 err_cred:
 	if (task)
 		up_read(&task->signal->exec_update_lock);
 err_file:
 	fput(event_file);
 err_context:
+	/* event->pmu_ctx freed by free_event() */
+err_locked:
+	mutex_unlock(&ctx->mutex);
 	perf_unpin_context(ctx);
 	put_ctx(ctx);
 err_alloc:
@@ -12446,8 +12507,10 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 				 perf_overflow_handler_t overflow_handler,
 				 void *context)
 {
+	struct perf_event_pmu_context *pmu_ctx;
 	struct perf_event_context *ctx;
 	struct perf_event *event;
+	struct pmu *pmu;
 	int err;
 
 	/*
@@ -12466,15 +12529,31 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 
 	/* Mark owner so we could distinguish it from user events. */
 	event->owner = TASK_TOMBSTONE;
+	pmu = event->pmu;
+
+	if (pmu->task_ctx_nr < 0 && task) {
+		err = -EINVAL;
+		goto err_alloc;
+	}
+
+	if (pmu->task_ctx_nr == perf_sw_context)
+		event->event_caps |= PERF_EV_CAP_SOFTWARE;
 
 	/*
 	 * Get the target context (task or percpu):
 	 */
-	ctx = find_get_context(event->pmu, task, event);
+	ctx = find_get_context(task, event);
 	if (IS_ERR(ctx)) {
 		err = PTR_ERR(ctx);
-		goto err_free;
+		goto err_alloc;
+	}
+
+	pmu_ctx = find_get_pmu_context(pmu, ctx, event);
+	if (IS_ERR(pmu_ctx)) {
+		err = PTR_ERR(pmu_ctx);
+		goto err_ctx;
 	}
+	event->pmu_ctx = pmu_ctx;
 
 	WARN_ON_ONCE(ctx->parent_ctx);
 	mutex_lock(&ctx->mutex);
@@ -12511,9 +12590,10 @@ perf_event_create_kernel_counter(struct perf_event_attr *attr, int cpu,
 
 err_unlock:
 	mutex_unlock(&ctx->mutex);
+err_ctx:
 	perf_unpin_context(ctx);
 	put_ctx(ctx);
-err_free:
+err_alloc:
 	free_event(event);
 err:
 	return ERR_PTR(err);
@@ -12522,6 +12602,7 @@ EXPORT_SYMBOL_GPL(perf_event_create_kernel_counter);
 
 void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
 {
+#if 0 // XXX buggered - cpu hotplug, who cares
 	struct perf_event_context *src_ctx;
 	struct perf_event_context *dst_ctx;
 	struct perf_event *event, *tmp;
@@ -12582,6 +12663,7 @@ void perf_pmu_migrate_context(struct pmu *pmu, int src_cpu, int dst_cpu)
 	}
 	mutex_unlock(&dst_ctx->mutex);
 	mutex_unlock(&src_ctx->mutex);
+#endif
 }
 EXPORT_SYMBOL_GPL(perf_pmu_migrate_context);
 
@@ -12659,14 +12741,14 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
 	perf_event_wakeup(event);
 }
 
-static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
+static void perf_event_exit_task_context(struct task_struct *child)
 {
 	struct perf_event_context *child_ctx, *clone_ctx = NULL;
 	struct perf_event *child_event, *next;
 
 	WARN_ON_ONCE(child != current);
 
-	child_ctx = perf_pin_task_context(child, ctxn);
+	child_ctx = perf_pin_task_context(child);
 	if (!child_ctx)
 		return;
 
@@ -12688,13 +12770,13 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
 	 * in.
 	 */
 	raw_spin_lock_irq(&child_ctx->lock);
-	task_ctx_sched_out(__get_cpu_context(child_ctx), child_ctx, EVENT_ALL);
+	task_ctx_sched_out(child_ctx, EVENT_ALL);
 
 	/*
 	 * Now that the context is inactive, destroy the task <-> ctx relation
 	 * and mark the context dead.
 	 */
-	RCU_INIT_POINTER(child->perf_event_ctxp[ctxn], NULL);
+	RCU_INIT_POINTER(child->perf_event_ctxp, NULL);
 	put_ctx(child_ctx); /* cannot be last */
 	WRITE_ONCE(child_ctx->task, TASK_TOMBSTONE);
 	put_task_struct(current); /* cannot be last */
@@ -12729,7 +12811,6 @@ static void perf_event_exit_task_context(struct task_struct *child, int ctxn)
 void perf_event_exit_task(struct task_struct *child)
 {
 	struct perf_event *event, *tmp;
-	int ctxn;
 
 	mutex_lock(&child->perf_event_mutex);
 	list_for_each_entry_safe(event, tmp, &child->perf_event_list,
@@ -12745,8 +12826,7 @@ void perf_event_exit_task(struct task_struct *child)
 	}
 	mutex_unlock(&child->perf_event_mutex);
 
-	for_each_task_context_nr(ctxn)
-		perf_event_exit_task_context(child, ctxn);
+	perf_event_exit_task_context(child);
 
 	/*
 	 * The perf_event_exit_task_context calls perf_event_task
@@ -12789,56 +12869,51 @@ void perf_event_free_task(struct task_struct *task)
 {
 	struct perf_event_context *ctx;
 	struct perf_event *event, *tmp;
-	int ctxn;
 
-	for_each_task_context_nr(ctxn) {
-		ctx = task->perf_event_ctxp[ctxn];
-		if (!ctx)
-			continue;
+	ctx = rcu_dereference(task->perf_event_ctxp);
+	if (!ctx)
+		return;
 
-		mutex_lock(&ctx->mutex);
-		raw_spin_lock_irq(&ctx->lock);
-		/*
-		 * Destroy the task <-> ctx relation and mark the context dead.
-		 *
-		 * This is important because even though the task hasn't been
-		 * exposed yet the context has been (through child_list).
-		 */
-		RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], NULL);
-		WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
-		put_task_struct(task); /* cannot be last */
-		raw_spin_unlock_irq(&ctx->lock);
+	mutex_lock(&ctx->mutex);
+	raw_spin_lock_irq(&ctx->lock);
+	/*
+	 * Destroy the task <-> ctx relation and mark the context dead.
+	 *
+	 * This is important because even though the task hasn't been
+	 * exposed yet the context has been (through child_list).
+	 */
+	RCU_INIT_POINTER(task->perf_event_ctxp, NULL);
+	WRITE_ONCE(ctx->task, TASK_TOMBSTONE);
+	put_task_struct(task); /* cannot be last */
+	raw_spin_unlock_irq(&ctx->lock);
 
-		list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
-			perf_free_event(event, ctx);
 
-		mutex_unlock(&ctx->mutex);
+	list_for_each_entry_safe(event, tmp, &ctx->event_list, event_entry)
+		perf_free_event(event, ctx);
 
-		/*
-		 * perf_event_release_kernel() could've stolen some of our
-		 * child events and still have them on its free_list. In that
-		 * case we must wait for these events to have been freed (in
-		 * particular all their references to this task must've been
-		 * dropped).
-		 *
-		 * Without this copy_process() will unconditionally free this
-		 * task (irrespective of its reference count) and
-		 * _free_event()'s put_task_struct(event->hw.target) will be a
-		 * use-after-free.
-		 *
-		 * Wait for all events to drop their context reference.
-		 */
-		wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
-		put_ctx(ctx); /* must be last */
-	}
+	mutex_unlock(&ctx->mutex);
+
+	/*
+	 * perf_event_release_kernel() could've stolen some of our
+	 * child events and still have them on its free_list. In that
+	 * case we must wait for these events to have been freed (in
+	 * particular all their references to this task must've been
+	 * dropped).
+	 *
+	 * Without this copy_process() will unconditionally free this
+	 * task (irrespective of its reference count) and
+	 * _free_event()'s put_task_struct(event->hw.target) will be a
+	 * use-after-free.
+	 *
+	 * Wait for all events to drop their context reference.
+	 */
+	wait_var_event(&ctx->refcount, refcount_read(&ctx->refcount) == 1);
+	put_ctx(ctx); /* must be last */
 }
 
 void perf_event_delayed_put(struct task_struct *task)
 {
-	int ctxn;
-
-	for_each_task_context_nr(ctxn)
-		WARN_ON_ONCE(task->perf_event_ctxp[ctxn]);
+	WARN_ON_ONCE(task->perf_event_ctxp);
 }
 
 struct file *perf_event_get(unsigned int fd)
@@ -12888,6 +12963,7 @@ inherit_event(struct perf_event *parent_event,
 	      struct perf_event_context *child_ctx)
 {
 	enum perf_event_state parent_state = parent_event->state;
+	struct perf_event_pmu_context *pmu_ctx;
 	struct perf_event *child_event;
 	unsigned long flags;
 
@@ -12908,17 +12984,12 @@ inherit_event(struct perf_event *parent_event,
 	if (IS_ERR(child_event))
 		return child_event;
 
-
-	if ((child_event->attach_state & PERF_ATTACH_TASK_DATA) &&
-	    !child_ctx->task_ctx_data) {
-		struct pmu *pmu = child_event->pmu;
-
-		child_ctx->task_ctx_data = alloc_task_ctx_data(pmu);
-		if (!child_ctx->task_ctx_data) {
-			free_event(child_event);
-			return ERR_PTR(-ENOMEM);
-		}
+	pmu_ctx = find_get_pmu_context(child_event->pmu, child_ctx, child_event);
+	if (!pmu_ctx) {
+		free_event(child_event);
+		return NULL;
 	}
+	child_event->pmu_ctx = pmu_ctx;
 
 	/*
 	 * is_orphaned_event() and list_add_tail(&parent_event->child_list)
@@ -13041,11 +13112,11 @@ static int inherit_group(struct perf_event *parent_event,
 static int
 inherit_task_group(struct perf_event *event, struct task_struct *parent,
 		   struct perf_event_context *parent_ctx,
-		   struct task_struct *child, int ctxn,
+		   struct task_struct *child,
 		   u64 clone_flags, int *inherited_all)
 {
-	int ret;
 	struct perf_event_context *child_ctx;
+	int ret;
 
 	if (!event->attr.inherit ||
 	    (event->attr.inherit_thread && !(clone_flags & CLONE_THREAD)) ||
@@ -13055,7 +13126,7 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
 		return 0;
 	}
 
-	child_ctx = child->perf_event_ctxp[ctxn];
+	child_ctx = child->perf_event_ctxp;
 	if (!child_ctx) {
 		/*
 		 * This is executed from the parent task context, so
@@ -13063,16 +13134,14 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
 		 * First allocate and initialize a context for the
 		 * child.
 		 */
-		child_ctx = alloc_perf_context(parent_ctx->pmu, child);
+		child_ctx = alloc_perf_context(child);
 		if (!child_ctx)
 			return -ENOMEM;
 
-		child->perf_event_ctxp[ctxn] = child_ctx;
+		child->perf_event_ctxp = child_ctx;
 	}
 
-	ret = inherit_group(event, parent, parent_ctx,
-			    child, child_ctx);
-
+	ret = inherit_group(event, parent, parent_ctx, child, child_ctx);
 	if (ret)
 		*inherited_all = 0;
 
@@ -13082,8 +13151,7 @@ inherit_task_group(struct perf_event *event, struct task_struct *parent,
 /*
  * Initialize the perf_event context in task_struct
  */
-static int perf_event_init_context(struct task_struct *child, int ctxn,
-				   u64 clone_flags)
+static int perf_event_init_context(struct task_struct *child, u64 clone_flags)
 {
 	struct perf_event_context *child_ctx, *parent_ctx;
 	struct perf_event_context *cloned_ctx;
@@ -13093,14 +13161,14 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
 	unsigned long flags;
 	int ret = 0;
 
-	if (likely(!parent->perf_event_ctxp[ctxn]))
+	if (likely(!parent->perf_event_ctxp))
 		return 0;
 
 	/*
 	 * If the parent's context is a clone, pin it so it won't get
 	 * swapped under us.
 	 */
-	parent_ctx = perf_pin_task_context(parent, ctxn);
+	parent_ctx = perf_pin_task_context(parent);
 	if (!parent_ctx)
 		return 0;
 
@@ -13123,8 +13191,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
 	 */
 	perf_event_groups_for_each(event, &parent_ctx->pinned_groups) {
 		ret = inherit_task_group(event, parent, parent_ctx,
-					 child, ctxn, clone_flags,
-					 &inherited_all);
+					 child, clone_flags, &inherited_all);
 		if (ret)
 			goto out_unlock;
 	}
@@ -13140,8 +13207,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
 
 	perf_event_groups_for_each(event, &parent_ctx->flexible_groups) {
 		ret = inherit_task_group(event, parent, parent_ctx,
-					 child, ctxn, clone_flags,
-					 &inherited_all);
+					 child, clone_flags, &inherited_all);
 		if (ret)
 			goto out_unlock;
 	}
@@ -13149,7 +13215,7 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
 	raw_spin_lock_irqsave(&parent_ctx->lock, flags);
 	parent_ctx->rotate_disable = 0;
 
-	child_ctx = child->perf_event_ctxp[ctxn];
+	child_ctx = child->perf_event_ctxp;
 
 	if (child_ctx && inherited_all) {
 		/*
@@ -13185,18 +13251,16 @@ static int perf_event_init_context(struct task_struct *child, int ctxn,
  */
 int perf_event_init_task(struct task_struct *child, u64 clone_flags)
 {
-	int ctxn, ret;
+	int ret;
 
-	memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
+	child->perf_event_ctxp = NULL;
 	mutex_init(&child->perf_event_mutex);
 	INIT_LIST_HEAD(&child->perf_event_list);
 
-	for_each_task_context_nr(ctxn) {
-		ret = perf_event_init_context(child, ctxn, clone_flags);
-		if (ret) {
-			perf_event_free_task(child);
-			return ret;
-		}
+	ret = perf_event_init_context(child, clone_flags);
+	if (ret) {
+		perf_event_free_task(child);
+		return ret;
 	}
 
 	return 0;
@@ -13205,6 +13269,7 @@ int perf_event_init_task(struct task_struct *child, u64 clone_flags)
 static void __init perf_event_init_all_cpus(void)
 {
 	struct swevent_htable *swhash;
+	struct perf_cpu_context *cpuctx;
 	int cpu;
 
 	zalloc_cpumask_var(&perf_online_mask, GFP_KERNEL);
@@ -13212,7 +13277,6 @@ static void __init perf_event_init_all_cpus(void)
 	for_each_possible_cpu(cpu) {
 		swhash = &per_cpu(swevent_htable, cpu);
 		mutex_init(&swhash->hlist_mutex);
-		INIT_LIST_HEAD(&per_cpu(active_ctx_list, cpu));
 
 		INIT_LIST_HEAD(&per_cpu(pmu_sb_events.list, cpu));
 		raw_spin_lock_init(&per_cpu(pmu_sb_events.lock, cpu));
@@ -13221,6 +13285,14 @@ static void __init perf_event_init_all_cpus(void)
 		INIT_LIST_HEAD(&per_cpu(cgrp_cpuctx_list, cpu));
 #endif
 		INIT_LIST_HEAD(&per_cpu(sched_cb_list, cpu));
+
+		cpuctx = per_cpu_ptr(&cpu_context, cpu);
+		__perf_event_init_context(&cpuctx->ctx);
+		lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
+		lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
+		cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
+		cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
+		cpuctx->heap = cpuctx->heap_default;
 	}
 }
 
@@ -13242,12 +13314,12 @@ static void perf_swevent_init_cpu(unsigned int cpu)
 #if defined CONFIG_HOTPLUG_CPU || defined CONFIG_KEXEC_CORE
 static void __perf_event_exit_context(void *__info)
 {
+	struct perf_cpu_context *cpuctx = this_cpu_ptr(&cpu_context);
 	struct perf_event_context *ctx = __info;
-	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event *event;
 
 	raw_spin_lock(&ctx->lock);
-	ctx_sched_out(ctx, cpuctx, EVENT_TIME);
+	ctx_sched_out(ctx, EVENT_TIME);
 	list_for_each_entry(event, &ctx->event_list, event_entry)
 		__perf_remove_from_context(event, cpuctx, ctx, (void *)DETACH_GROUP);
 	raw_spin_unlock(&ctx->lock);
@@ -13257,18 +13329,16 @@ static void perf_event_exit_cpu_context(int cpu)
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_event_context *ctx;
-	struct pmu *pmu;
 
+	// XXX simplify cpuctx->online
 	mutex_lock(&pmus_lock);
-	list_for_each_entry(pmu, &pmus, entry) {
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		ctx = &cpuctx->ctx;
+	cpuctx = per_cpu_ptr(&cpu_context, cpu);
+	ctx = &cpuctx->ctx;
 
-		mutex_lock(&ctx->mutex);
-		smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
-		cpuctx->online = 0;
-		mutex_unlock(&ctx->mutex);
-	}
+	mutex_lock(&ctx->mutex);
+	smp_call_function_single(cpu, __perf_event_exit_context, ctx, 1);
+	cpuctx->online = 0;
+	mutex_unlock(&ctx->mutex);
 	cpumask_clear_cpu(cpu, perf_online_mask);
 	mutex_unlock(&pmus_lock);
 }
@@ -13282,20 +13352,17 @@ int perf_event_init_cpu(unsigned int cpu)
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_event_context *ctx;
-	struct pmu *pmu;
 
 	perf_swevent_init_cpu(cpu);
 
 	mutex_lock(&pmus_lock);
 	cpumask_set_cpu(cpu, perf_online_mask);
-	list_for_each_entry(pmu, &pmus, entry) {
-		cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
-		ctx = &cpuctx->ctx;
+	cpuctx = per_cpu_ptr(&cpu_context, cpu);
+	ctx = &cpuctx->ctx;
 
-		mutex_lock(&ctx->mutex);
-		cpuctx->online = 1;
-		mutex_unlock(&ctx->mutex);
-	}
+	mutex_lock(&ctx->mutex);
+	cpuctx->online = 1;
+	mutex_unlock(&ctx->mutex);
 	mutex_unlock(&pmus_lock);
 
 	return 0;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2022-08-29 12:15 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-14 21:48 [RFC v2] perf: Rewrite core context handling kernel test robot
2022-01-18  6:01 ` kernel test robot
2022-01-18  6:01   ` kernel test robot
  -- strict thread matches above, loose matches on Subject: below --
2022-01-13 13:47 Ravi Bangoria
2022-01-13 19:15 ` kernel test robot
2022-01-31  4:43 ` Ravi Bangoria
2022-06-13 14:35 ` Peter Zijlstra
2022-06-13 14:36   ` Peter Zijlstra
2022-06-13 14:38   ` Peter Zijlstra
2022-08-02  6:11     ` Ravi Bangoria
2022-08-22 15:29       ` Peter Zijlstra
2022-08-22 15:43         ` Peter Zijlstra
2022-08-22 16:37           ` Ravi Bangoria
2022-08-23  4:20             ` Ravi Bangoria
2022-08-29  3:54               ` Ravi Bangoria
2022-08-23  6:30             ` Peter Zijlstra
2022-08-29  4:00             ` Ravi Bangoria
2022-08-29 11:58               ` Peter Zijlstra
2022-08-22 16:52       ` Peter Zijlstra
2022-08-23  4:57         ` Ravi Bangoria
2022-06-13 14:41   ` Peter Zijlstra
2022-08-22 14:38     ` Ravi Bangoria
2022-06-13 14:43   ` Peter Zijlstra
2022-08-02  6:16     ` Ravi Bangoria
2022-08-23  8:57       ` Peter Zijlstra
2022-08-24  5:07         ` Ravi Bangoria
2022-08-24  7:27           ` Peter Zijlstra
2022-08-24  7:53             ` Ravi Bangoria
2022-06-13 14:55   ` Peter Zijlstra
2022-08-02  6:10     ` Ravi Bangoria
2022-08-22 16:44       ` Peter Zijlstra
2022-08-23  4:46         ` Ravi Bangoria
2022-06-17 13:36   ` Peter Zijlstra
2022-08-24 10:13     ` Peter Zijlstra
2022-06-27  4:18   ` Ravi Bangoria
2022-08-02  6:06     ` Ravi Bangoria
2022-08-24 12:15   ` Peter Zijlstra
2022-08-24 14:59     ` Peter Zijlstra
2022-08-25  5:39       ` Ravi Bangoria
2022-08-25  9:17         ` Peter Zijlstra
2022-08-25 11:03       ` Ravi Bangoria
2022-08-02  6:13 ` Ravi Bangoria
2022-08-23  7:10   ` Peter Zijlstra
2022-08-02  6:17 ` Ravi Bangoria
2022-08-23  7:26   ` Peter Zijlstra
2022-08-23 15:14     ` Ravi Bangoria
2022-08-22 14:40 ` Ravi Bangoria

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.