Re: Perf loses events without reporting

From: Stephen Brennan <stephen.s.brennan@oracle.com>
To: Jiri Olsa <jolsa@redhat.com>
Cc: linux-perf-users@vger.kernel.org
Subject: Re: Perf loses events without reporting
Date: Thu, 01 Jul 2021 09:45:45 -0700	[thread overview]
Message-ID: <87im1uc79i.fsf@stepbren-lnx.us.oracle.com> (raw)
In-Reply-To: <YN2a3pSvealjK6uR@krava>

Jiri Olsa <jolsa@redhat.com> writes:

> On Wed, Jun 30, 2021 at 10:29:13AM -0700, Stephen Brennan wrote:
>> Hi all,
>> 
>> I've been trying to understand the behavior of the x86_64 performance
>> monitoring interrupt, specifically when IRQ is disabled. Since it's an
>> NMI, it should still trigger and record events. However, I've noticed
>> that when interrupts are disabled for a long time, events seem to be
>> silently dropped, and I'm wondering if this is expected behavior.
>> 
>> To test this, I created a simple kernel module "irqoff" which creates a
>> file /proc/irqoff_sleep_millis. On write, the module uses
>> "spin_lock_irq()" to disable interrupts, and then issues an mdelay()
>> call for whatever number of milliseconds was written. This allows us to
>> busy wait with IRQ disabled. (Source for the module at the end of this
>> email).
>> 
>> When I use perf to record a write to this file, we see the following:
>> 
>> $ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'
>
> seems strange.. I'll check
>
> could you see that also when monitoring the cpu? like:
>
>   $ sudo perf record -e cycles -c 100000 -C 1 -- taskset -c 1 sh -c ..
>
> jirka

Thanks for taking a look. I tried monitoring the CPU instead of the task
and saw the exact same result -- about a two second gap in the resulting
data when viewed through perf script. I tried specifying a few different
CPUs just in case.

I've reproduced this behavior on a few bare-metal (no VM) systems with
different CPUs, and my initial observation is that it seems to be Intel
specific. Below are two Intel CPUs which have the exact same behavior:
(a) perf stat shows the cycles counter in the billions, (b) orders of
magnitude fewer samples than expected, and (c) large ~2s gap in the
data.

** Server (v5.4 based)
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz
Stepping:            1

** Laptop (v5.11 based)
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           142
Model name:                      Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
Stepping:                        12

On the other hand, with AMD, I get similarly few samples, but the data
seems correct. Here is the CPU info for my AMD test system:

** AMD Server (v5.4 based)
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7551 32-Core Processor
Stepping:            2

$ sudo perf stat -- sh -c 'echo 2000 >/proc/irqoff_sleep_millis'                          

 Performance counter stats for 'sh -c echo 2000 >/proc/irqoff_sleep_millis':

          2,000.83 msec task-clock                #    1.000 CPUs utilized          
                 1      context-switches          #    0.000 K/sec                  
                 1      cpu-migrations            #    0.000 K/sec                  
               130      page-faults               #    0.065 K/sec                  
         7,196,575      cycles                    #    0.004 GHz                      (99.75%)
     3,606,261,791      stalled-cycles-frontend   # 50110.81% frontend cycles idle    (0.29%)
         3,156,105      stalled-cycles-backend    #   43.86% backend cycles idle      (99.99%)
         3,399,963      instructions              #    0.47  insn per cycle         
                                                  # 1060.68  stalled cycles per insn
           655,974      branches                  #    0.328 M/sec                  
            12,880      branch-misses             #    1.96% of all branches          (99.97%)

       2.001346768 seconds time elapsed

       0.396201000 seconds user
       1.605213000 seconds sys

Since the cycles counter is so low, I assume that the AMD PMU doesn't
count idle/stalled/nop cycles from the mdelay() function call, which
seems reasonable. Maybe this is an example of frequency scaling? The
recorded perf data during the mdelay call shows no gap, just very few
samples that uniformly stretch over the runtime:

$ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 >/proc/irqoff_sleep_millis'
...
$ sudo perf script
...
              sh 14305   683.219192:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.300989:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.383783:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.465580:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.547376:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.629173:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.710970:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.793764:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.875561:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.957357:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.039154:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.120951:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.203745:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.285542:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.367339:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.449135:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.530932:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.613726:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.695523:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.778317:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.861111:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.942908:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.978837:     100000 cycles:  ffffffffb0203d71 syscall_slow_exit_work+0x111 ([kernel.kallsyms])
              sh 14305   684.978890:     100000 cycles:  ffffffffb029d8a1 __raw_spin_unlock+0x1 ([kernel.kallsyms])
              sh 14305   684.978922:     100000 cycles:  ffffffffb04502f9 unmap_page_range+0x639 ([kernel.kallsyms]
...

While this sort of data (very few data points during a long period of
time) is not ideal for my debugging, it's at least consistent, and a
well-known gotcha of perf. I also double checked monitoring a single CPU
on AMD (rather than task) and saw the same results.

So hopefully this information helps narrow down the exploration. It
seems to be Intel PMU specific!

Thanks,
Stephen

>
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0.030 MB perf.data (58 samples) ]
>> 
>> $ sudo perf script
>> # ... filtered down:
>>               sh 62863 52318.991716:     100000 cycles:  ffffffff8a8237a9 delay_tsc+0x39 ([kernel.kallsyms])
>>               sh 62863 52318.991740:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
>>               sh 62863 52318.991765:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
>> #                      ^ v   ~2 second gap!
>>               sh 62863 52320.963900:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
>>               sh 62863 52320.963923:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
>>               sh 62863 52320.963948:     100000 cycles:  ffffffff8ab1db9a handle_tx_event+0x2da ([kernel.kallsyms])
>> 
>> The perf stat shows the following counters over a similar run:
>> 
>> $ sudo perf stat -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'
>> 
>>  Performance counter stats for 'sh -c echo 2000 > /proc/irqoff_sleep_millis':
>> 
>>           1,975.55 msec task-clock                #    0.999 CPUs utilized
>>                  1      context-switches          #    0.001 K/sec
>>                  0      cpu-migrations            #    0.000 K/sec
>>                 61      page-faults               #    0.031 K/sec
>>      7,952,267,470      cycles                    #    4.025 GHz
>>        541,904,608      instructions              #    0.07  insn per cycle
>>         83,406,021      branches                  #   42.219 M/sec
>>             10,365      branch-misses             #    0.01% of all branches
>> 
>>        1.977234595 seconds time elapsed
>> 
>>        0.000000000 seconds user
>>        1.977162000 seconds sys
>> 
>> According to this, we should see roughly 79k samples (7.9 billion cycles
>> / 100k sample period), but perf only gets 58. What it "looks like" to
>> me, is that the CPU ring buffer might run out of space after several
>> events, and the perf process doesn't get scheduled soon enough to read
>> the data? But in my experience, perf usually reports that it missed some
>> events. So I wonder if anybody is familiar with the factors at play for
>> when IRQ is disabled during a PMI? I'd appreciate any pointers to guide
>> my exploration.
>> 
>> My test case here ran on Ubuntu distro kernel 5.11.0-22-generic, and I
>> have also tested on a 5.4 based kernel. I'm happy to reproduce this on a
>> mainline kernel too.
>> 
>> Thanks,
>> Stephen
>> 
>> Makefile:
>> <<<
>> obj-m += irqoff.o
>> 
>> all:
>> 	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
>> 
>> clean:
>> 	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
>> >>>
>> 
>> irqoff.c:
>> <<<
>> #include <linux/module.h>
>> #include <linux/kernel.h>
>> #include <linux/init.h>
>> #include <linux/proc_fs.h>
>> #include <linux/spinlock.h>
>> #include <linux/delay.h>
>> #include <linux/version.h>
>> #include <linux/uaccess.h>
>> 
>> MODULE_LICENSE("GPL");
>> MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time");
>> MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>");
>> 
>> 
>> // Store the proc dir entry we can use to check status
>> struct proc_dir_entry *pde = NULL;
>> 
>> DEFINE_SPINLOCK(irqoff_lock);
>> 
>> 
>> static noinline void irqsoff_inirq_delay(unsigned long millis)
>> {
>> 	mdelay(millis);
>> }
>> 
>> 
>> static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
>> {
>> 	char buf[32];
>> 	int rv;
>> 	unsigned long usecs = 0;
>> 
>> 	if (amt > sizeof(buf) - 1)
>> 		return -EFBIG;
>> 
>> 	if ((rv = copy_from_user(buf, data, amt)) != 0)
>> 		return -EFAULT;
>> 
>> 	buf[amt] = '\0';
>> 
>> 	if (sscanf(buf, "%lu", &usecs) != 1)
>> 		return -EINVAL;
>> 
>> 	/* We read number of milliseconds, but will convert to microseconds.
>> 	   Threshold it at 5 minutes for safety. */
>> 	if (usecs > 5 * 60 * 1000)
>> 		return -EINVAL;
>> 
>> 	pr_info("[irqoff] lock for %lu millis\n", usecs);
>> 	spin_lock_irq(&irqoff_lock);
>> 	irqsoff_inirq_delay(usecs);
>> 	spin_unlock_irq(&irqoff_lock);
>> 
>> 	return amt;
>> }
>> 
>> static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off)
>> {
>> 	return 0;
>> }
>> 
>> #if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
>> static const struct file_operations irqsoff_fops = {
>> 	.owner                  = THIS_MODULE,
>> 	.read                   = irqsoff_read,
>> 	.write                  = irqsoff_write,
>> };
>> #else
>> static const struct proc_ops irqsoff_fops = {
>> 	.proc_read		= irqsoff_read,
>> 	.proc_write		= irqsoff_write,
>> };
>> #endif
>> 
>> static int irqoff_init(void)
>> {
>> 	pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops);
>> 	if (!pde)
>> 		return -ENOENT;
>> 
>> 	pr_info("[irqoff] successfully initialized\n");
>> 	return 0;
>> }
>> 
>> static void irqoff_exit(void)
>> {
>> 	proc_remove(pde);
>> 	pde = NULL;
>> }
>> 
>> module_init(irqoff_init);
>> module_exit(irqoff_exit);
>> >>>
>>