linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Perf loses events without reporting
@ 2021-06-30 17:29 Stephen Brennan
  2021-07-01 10:37 ` Jiri Olsa
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Brennan @ 2021-06-30 17:29 UTC (permalink / raw)
  To: linux-perf-users

Hi all,

I've been trying to understand the behavior of the x86_64 performance
monitoring interrupt, specifically when IRQ is disabled. Since it's an
NMI, it should still trigger and record events. However, I've noticed
that when interrupts are disabled for a long time, events seem to be
silently dropped, and I'm wondering if this is expected behavior.

To test this, I created a simple kernel module "irqoff" which creates a
file /proc/irqoff_sleep_millis. On write, the module uses
"spin_lock_irq()" to disable interrupts, and then issues an mdelay()
call for whatever number of milliseconds was written. This allows us to
busy wait with IRQ disabled. (Source for the module at the end of this
email).

When I use perf to record a write to this file, we see the following:

$ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.030 MB perf.data (58 samples) ]

$ sudo perf script
# ... filtered down:
              sh 62863 52318.991716:     100000 cycles:  ffffffff8a8237a9 delay_tsc+0x39 ([kernel.kallsyms])
              sh 62863 52318.991740:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
              sh 62863 52318.991765:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
#                      ^ v   ~2 second gap!
              sh 62863 52320.963900:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
              sh 62863 52320.963923:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
              sh 62863 52320.963948:     100000 cycles:  ffffffff8ab1db9a handle_tx_event+0x2da ([kernel.kallsyms])

The perf stat shows the following counters over a similar run:

$ sudo perf stat -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'

 Performance counter stats for 'sh -c echo 2000 > /proc/irqoff_sleep_millis':

          1,975.55 msec task-clock                #    0.999 CPUs utilized
                 1      context-switches          #    0.001 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                61      page-faults               #    0.031 K/sec
     7,952,267,470      cycles                    #    4.025 GHz
       541,904,608      instructions              #    0.07  insn per cycle
        83,406,021      branches                  #   42.219 M/sec
            10,365      branch-misses             #    0.01% of all branches

       1.977234595 seconds time elapsed

       0.000000000 seconds user
       1.977162000 seconds sys

According to this, we should see roughly 79k samples (7.9 billion cycles
/ 100k sample period), but perf only gets 58. What it "looks like" to
me, is that the CPU ring buffer might run out of space after several
events, and the perf process doesn't get scheduled soon enough to read
the data? But in my experience, perf usually reports that it missed some
events. So I wonder if anybody is familiar with the factors at play for
when IRQ is disabled during a PMI? I'd appreciate any pointers to guide
my exploration.

My test case here ran on Ubuntu distro kernel 5.11.0-22-generic, and I
have also tested on a 5.4 based kernel. I'm happy to reproduce this on a
mainline kernel too.

Thanks,
Stephen

Makefile:
<<<
obj-m += irqoff.o

all:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
>>>

irqoff.c:
<<<
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/proc_fs.h>
#include <linux/spinlock.h>
#include <linux/delay.h>
#include <linux/version.h>
#include <linux/uaccess.h>

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time");
MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>");


// Store the proc dir entry we can use to check status
struct proc_dir_entry *pde = NULL;

DEFINE_SPINLOCK(irqoff_lock);


static noinline void irqsoff_inirq_delay(unsigned long millis)
{
	mdelay(millis);
}


static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
{
	char buf[32];
	int rv;
	unsigned long usecs = 0;

	if (amt > sizeof(buf) - 1)
		return -EFBIG;

	if ((rv = copy_from_user(buf, data, amt)) != 0)
		return -EFAULT;

	buf[amt] = '\0';

	if (sscanf(buf, "%lu", &usecs) != 1)
		return -EINVAL;

	/* We read number of milliseconds, but will convert to microseconds.
	   Threshold it at 5 minutes for safety. */
	if (usecs > 5 * 60 * 1000)
		return -EINVAL;

	pr_info("[irqoff] lock for %lu millis\n", usecs);
	spin_lock_irq(&irqoff_lock);
	irqsoff_inirq_delay(usecs);
	spin_unlock_irq(&irqoff_lock);

	return amt;
}

static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off)
{
	return 0;
}

#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
static const struct file_operations irqsoff_fops = {
	.owner                  = THIS_MODULE,
	.read                   = irqsoff_read,
	.write                  = irqsoff_write,
};
#else
static const struct proc_ops irqsoff_fops = {
	.proc_read		= irqsoff_read,
	.proc_write		= irqsoff_write,
};
#endif

static int irqoff_init(void)
{
	pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops);
	if (!pde)
		return -ENOENT;

	pr_info("[irqoff] successfully initialized\n");
	return 0;
}

static void irqoff_exit(void)
{
	proc_remove(pde);
	pde = NULL;
}

module_init(irqoff_init);
module_exit(irqoff_exit);
>>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Perf loses events without reporting
  2021-06-30 17:29 Perf loses events without reporting Stephen Brennan
@ 2021-07-01 10:37 ` Jiri Olsa
  2021-07-01 16:45   ` Stephen Brennan
  0 siblings, 1 reply; 8+ messages in thread
From: Jiri Olsa @ 2021-07-01 10:37 UTC (permalink / raw)
  To: Stephen Brennan; +Cc: linux-perf-users

On Wed, Jun 30, 2021 at 10:29:13AM -0700, Stephen Brennan wrote:
> Hi all,
> 
> I've been trying to understand the behavior of the x86_64 performance
> monitoring interrupt, specifically when IRQ is disabled. Since it's an
> NMI, it should still trigger and record events. However, I've noticed
> that when interrupts are disabled for a long time, events seem to be
> silently dropped, and I'm wondering if this is expected behavior.
> 
> To test this, I created a simple kernel module "irqoff" which creates a
> file /proc/irqoff_sleep_millis. On write, the module uses
> "spin_lock_irq()" to disable interrupts, and then issues an mdelay()
> call for whatever number of milliseconds was written. This allows us to
> busy wait with IRQ disabled. (Source for the module at the end of this
> email).
> 
> When I use perf to record a write to this file, we see the following:
> 
> $ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'

seems strange.. I'll check

could you see that also when monitoring the cpu? like:

  $ sudo perf record -e cycles -c 100000 -C 1 -- taskset -c 1 sh -c ..

jirka

> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.030 MB perf.data (58 samples) ]
> 
> $ sudo perf script
> # ... filtered down:
>               sh 62863 52318.991716:     100000 cycles:  ffffffff8a8237a9 delay_tsc+0x39 ([kernel.kallsyms])
>               sh 62863 52318.991740:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
>               sh 62863 52318.991765:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
> #                      ^ v   ~2 second gap!
>               sh 62863 52320.963900:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
>               sh 62863 52320.963923:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
>               sh 62863 52320.963948:     100000 cycles:  ffffffff8ab1db9a handle_tx_event+0x2da ([kernel.kallsyms])
> 
> The perf stat shows the following counters over a similar run:
> 
> $ sudo perf stat -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'
> 
>  Performance counter stats for 'sh -c echo 2000 > /proc/irqoff_sleep_millis':
> 
>           1,975.55 msec task-clock                #    0.999 CPUs utilized
>                  1      context-switches          #    0.001 K/sec
>                  0      cpu-migrations            #    0.000 K/sec
>                 61      page-faults               #    0.031 K/sec
>      7,952,267,470      cycles                    #    4.025 GHz
>        541,904,608      instructions              #    0.07  insn per cycle
>         83,406,021      branches                  #   42.219 M/sec
>             10,365      branch-misses             #    0.01% of all branches
> 
>        1.977234595 seconds time elapsed
> 
>        0.000000000 seconds user
>        1.977162000 seconds sys
> 
> According to this, we should see roughly 79k samples (7.9 billion cycles
> / 100k sample period), but perf only gets 58. What it "looks like" to
> me, is that the CPU ring buffer might run out of space after several
> events, and the perf process doesn't get scheduled soon enough to read
> the data? But in my experience, perf usually reports that it missed some
> events. So I wonder if anybody is familiar with the factors at play for
> when IRQ is disabled during a PMI? I'd appreciate any pointers to guide
> my exploration.
> 
> My test case here ran on Ubuntu distro kernel 5.11.0-22-generic, and I
> have also tested on a 5.4 based kernel. I'm happy to reproduce this on a
> mainline kernel too.
> 
> Thanks,
> Stephen
> 
> Makefile:
> <<<
> obj-m += irqoff.o
> 
> all:
> 	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
> 
> clean:
> 	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
> >>>
> 
> irqoff.c:
> <<<
> #include <linux/module.h>
> #include <linux/kernel.h>
> #include <linux/init.h>
> #include <linux/proc_fs.h>
> #include <linux/spinlock.h>
> #include <linux/delay.h>
> #include <linux/version.h>
> #include <linux/uaccess.h>
> 
> MODULE_LICENSE("GPL");
> MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time");
> MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>");
> 
> 
> // Store the proc dir entry we can use to check status
> struct proc_dir_entry *pde = NULL;
> 
> DEFINE_SPINLOCK(irqoff_lock);
> 
> 
> static noinline void irqsoff_inirq_delay(unsigned long millis)
> {
> 	mdelay(millis);
> }
> 
> 
> static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
> {
> 	char buf[32];
> 	int rv;
> 	unsigned long usecs = 0;
> 
> 	if (amt > sizeof(buf) - 1)
> 		return -EFBIG;
> 
> 	if ((rv = copy_from_user(buf, data, amt)) != 0)
> 		return -EFAULT;
> 
> 	buf[amt] = '\0';
> 
> 	if (sscanf(buf, "%lu", &usecs) != 1)
> 		return -EINVAL;
> 
> 	/* We read number of milliseconds, but will convert to microseconds.
> 	   Threshold it at 5 minutes for safety. */
> 	if (usecs > 5 * 60 * 1000)
> 		return -EINVAL;
> 
> 	pr_info("[irqoff] lock for %lu millis\n", usecs);
> 	spin_lock_irq(&irqoff_lock);
> 	irqsoff_inirq_delay(usecs);
> 	spin_unlock_irq(&irqoff_lock);
> 
> 	return amt;
> }
> 
> static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off)
> {
> 	return 0;
> }
> 
> #if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
> static const struct file_operations irqsoff_fops = {
> 	.owner                  = THIS_MODULE,
> 	.read                   = irqsoff_read,
> 	.write                  = irqsoff_write,
> };
> #else
> static const struct proc_ops irqsoff_fops = {
> 	.proc_read		= irqsoff_read,
> 	.proc_write		= irqsoff_write,
> };
> #endif
> 
> static int irqoff_init(void)
> {
> 	pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops);
> 	if (!pde)
> 		return -ENOENT;
> 
> 	pr_info("[irqoff] successfully initialized\n");
> 	return 0;
> }
> 
> static void irqoff_exit(void)
> {
> 	proc_remove(pde);
> 	pde = NULL;
> }
> 
> module_init(irqoff_init);
> module_exit(irqoff_exit);
> >>>
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Perf loses events without reporting
  2021-07-01 10:37 ` Jiri Olsa
@ 2021-07-01 16:45   ` Stephen Brennan
  2021-07-01 17:05     ` Stephen Brennan
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Brennan @ 2021-07-01 16:45 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: linux-perf-users

Jiri Olsa <jolsa@redhat.com> writes:

> On Wed, Jun 30, 2021 at 10:29:13AM -0700, Stephen Brennan wrote:
>> Hi all,
>> 
>> I've been trying to understand the behavior of the x86_64 performance
>> monitoring interrupt, specifically when IRQ is disabled. Since it's an
>> NMI, it should still trigger and record events. However, I've noticed
>> that when interrupts are disabled for a long time, events seem to be
>> silently dropped, and I'm wondering if this is expected behavior.
>> 
>> To test this, I created a simple kernel module "irqoff" which creates a
>> file /proc/irqoff_sleep_millis. On write, the module uses
>> "spin_lock_irq()" to disable interrupts, and then issues an mdelay()
>> call for whatever number of milliseconds was written. This allows us to
>> busy wait with IRQ disabled. (Source for the module at the end of this
>> email).
>> 
>> When I use perf to record a write to this file, we see the following:
>> 
>> $ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'
>
> seems strange.. I'll check
>
> could you see that also when monitoring the cpu? like:
>
>   $ sudo perf record -e cycles -c 100000 -C 1 -- taskset -c 1 sh -c ..
>
> jirka

Thanks for taking a look. I tried monitoring the CPU instead of the task
and saw the exact same result -- about a two second gap in the resulting
data when viewed through perf script. I tried specifying a few different
CPUs just in case.

I've reproduced this behavior on a few bare-metal (no VM) systems with
different CPUs, and my initial observation is that it seems to be Intel
specific. Below are two Intel CPUs which have the exact same behavior:
(a) perf stat shows the cycles counter in the billions, (b) orders of
magnitude fewer samples than expected, and (c) large ~2s gap in the
data.

** Server (v5.4 based)
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz
Stepping:            1

** Laptop (v5.11 based)
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           142
Model name:                      Intel(R) Core(TM) i7-8665U CPU @ 1.90GHz
Stepping:                        12

On the other hand, with AMD, I get similarly few samples, but the data
seems correct. Here is the CPU info for my AMD test system:

** AMD Server (v5.4 based)
Vendor ID:           AuthenticAMD
CPU family:          23
Model:               1
Model name:          AMD EPYC 7551 32-Core Processor
Stepping:            2

$ sudo perf stat -- sh -c 'echo 2000 >/proc/irqoff_sleep_millis'                          

 Performance counter stats for 'sh -c echo 2000 >/proc/irqoff_sleep_millis':

          2,000.83 msec task-clock                #    1.000 CPUs utilized          
                 1      context-switches          #    0.000 K/sec                  
                 1      cpu-migrations            #    0.000 K/sec                  
               130      page-faults               #    0.065 K/sec                  
         7,196,575      cycles                    #    0.004 GHz                      (99.75%)
     3,606,261,791      stalled-cycles-frontend   # 50110.81% frontend cycles idle    (0.29%)
         3,156,105      stalled-cycles-backend    #   43.86% backend cycles idle      (99.99%)
         3,399,963      instructions              #    0.47  insn per cycle         
                                                  # 1060.68  stalled cycles per insn
           655,974      branches                  #    0.328 M/sec                  
            12,880      branch-misses             #    1.96% of all branches          (99.97%)

       2.001346768 seconds time elapsed

       0.396201000 seconds user
       1.605213000 seconds sys

Since the cycles counter is so low, I assume that the AMD PMU doesn't
count idle/stalled/nop cycles from the mdelay() function call, which
seems reasonable. Maybe this is an example of frequency scaling? The
recorded perf data during the mdelay call shows no gap, just very few
samples that uniformly stretch over the runtime:

$ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 >/proc/irqoff_sleep_millis'
...
$ sudo perf script
...
              sh 14305   683.219192:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.300989:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.383783:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.465580:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.547376:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.629173:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.710970:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.793764:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.875561:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   683.957357:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.039154:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.120951:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.203745:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.285542:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.367339:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.449135:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.530932:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.613726:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.695523:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.778317:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.861111:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.942908:     100000 cycles:  ffffffffb0b7a172 delay_mwaitx+0x72 ([kernel.kallsyms])
              sh 14305   684.978837:     100000 cycles:  ffffffffb0203d71 syscall_slow_exit_work+0x111 ([kernel.kallsyms])
              sh 14305   684.978890:     100000 cycles:  ffffffffb029d8a1 __raw_spin_unlock+0x1 ([kernel.kallsyms])
              sh 14305   684.978922:     100000 cycles:  ffffffffb04502f9 unmap_page_range+0x639 ([kernel.kallsyms]
...

While this sort of data (very few data points during a long period of
time) is not ideal for my debugging, it's at least consistent, and a
well-known gotcha of perf. I also double checked monitoring a single CPU
on AMD (rather than task) and saw the same results.

So hopefully this information helps narrow down the exploration. It
seems to be Intel PMU specific!

Thanks,
Stephen

>
>> [ perf record: Woken up 1 times to write data ]
>> [ perf record: Captured and wrote 0.030 MB perf.data (58 samples) ]
>> 
>> $ sudo perf script
>> # ... filtered down:
>>               sh 62863 52318.991716:     100000 cycles:  ffffffff8a8237a9 delay_tsc+0x39 ([kernel.kallsyms])
>>               sh 62863 52318.991740:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
>>               sh 62863 52318.991765:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
>> #                      ^ v   ~2 second gap!
>>               sh 62863 52320.963900:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
>>               sh 62863 52320.963923:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
>>               sh 62863 52320.963948:     100000 cycles:  ffffffff8ab1db9a handle_tx_event+0x2da ([kernel.kallsyms])
>> 
>> The perf stat shows the following counters over a similar run:
>> 
>> $ sudo perf stat -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'
>> 
>>  Performance counter stats for 'sh -c echo 2000 > /proc/irqoff_sleep_millis':
>> 
>>           1,975.55 msec task-clock                #    0.999 CPUs utilized
>>                  1      context-switches          #    0.001 K/sec
>>                  0      cpu-migrations            #    0.000 K/sec
>>                 61      page-faults               #    0.031 K/sec
>>      7,952,267,470      cycles                    #    4.025 GHz
>>        541,904,608      instructions              #    0.07  insn per cycle
>>         83,406,021      branches                  #   42.219 M/sec
>>             10,365      branch-misses             #    0.01% of all branches
>> 
>>        1.977234595 seconds time elapsed
>> 
>>        0.000000000 seconds user
>>        1.977162000 seconds sys
>> 
>> According to this, we should see roughly 79k samples (7.9 billion cycles
>> / 100k sample period), but perf only gets 58. What it "looks like" to
>> me, is that the CPU ring buffer might run out of space after several
>> events, and the perf process doesn't get scheduled soon enough to read
>> the data? But in my experience, perf usually reports that it missed some
>> events. So I wonder if anybody is familiar with the factors at play for
>> when IRQ is disabled during a PMI? I'd appreciate any pointers to guide
>> my exploration.
>> 
>> My test case here ran on Ubuntu distro kernel 5.11.0-22-generic, and I
>> have also tested on a 5.4 based kernel. I'm happy to reproduce this on a
>> mainline kernel too.
>> 
>> Thanks,
>> Stephen
>> 
>> Makefile:
>> <<<
>> obj-m += irqoff.o
>> 
>> all:
>> 	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
>> 
>> clean:
>> 	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
>> >>>
>> 
>> irqoff.c:
>> <<<
>> #include <linux/module.h>
>> #include <linux/kernel.h>
>> #include <linux/init.h>
>> #include <linux/proc_fs.h>
>> #include <linux/spinlock.h>
>> #include <linux/delay.h>
>> #include <linux/version.h>
>> #include <linux/uaccess.h>
>> 
>> MODULE_LICENSE("GPL");
>> MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time");
>> MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>");
>> 
>> 
>> // Store the proc dir entry we can use to check status
>> struct proc_dir_entry *pde = NULL;
>> 
>> DEFINE_SPINLOCK(irqoff_lock);
>> 
>> 
>> static noinline void irqsoff_inirq_delay(unsigned long millis)
>> {
>> 	mdelay(millis);
>> }
>> 
>> 
>> static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
>> {
>> 	char buf[32];
>> 	int rv;
>> 	unsigned long usecs = 0;
>> 
>> 	if (amt > sizeof(buf) - 1)
>> 		return -EFBIG;
>> 
>> 	if ((rv = copy_from_user(buf, data, amt)) != 0)
>> 		return -EFAULT;
>> 
>> 	buf[amt] = '\0';
>> 
>> 	if (sscanf(buf, "%lu", &usecs) != 1)
>> 		return -EINVAL;
>> 
>> 	/* We read number of milliseconds, but will convert to microseconds.
>> 	   Threshold it at 5 minutes for safety. */
>> 	if (usecs > 5 * 60 * 1000)
>> 		return -EINVAL;
>> 
>> 	pr_info("[irqoff] lock for %lu millis\n", usecs);
>> 	spin_lock_irq(&irqoff_lock);
>> 	irqsoff_inirq_delay(usecs);
>> 	spin_unlock_irq(&irqoff_lock);
>> 
>> 	return amt;
>> }
>> 
>> static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off)
>> {
>> 	return 0;
>> }
>> 
>> #if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
>> static const struct file_operations irqsoff_fops = {
>> 	.owner                  = THIS_MODULE,
>> 	.read                   = irqsoff_read,
>> 	.write                  = irqsoff_write,
>> };
>> #else
>> static const struct proc_ops irqsoff_fops = {
>> 	.proc_read		= irqsoff_read,
>> 	.proc_write		= irqsoff_write,
>> };
>> #endif
>> 
>> static int irqoff_init(void)
>> {
>> 	pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops);
>> 	if (!pde)
>> 		return -ENOENT;
>> 
>> 	pr_info("[irqoff] successfully initialized\n");
>> 	return 0;
>> }
>> 
>> static void irqoff_exit(void)
>> {
>> 	proc_remove(pde);
>> 	pde = NULL;
>> }
>> 
>> module_init(irqoff_init);
>> module_exit(irqoff_exit);
>> >>>
>> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Perf loses events without reporting
  2021-07-01 16:45   ` Stephen Brennan
@ 2021-07-01 17:05     ` Stephen Brennan
  2021-07-01 18:00       ` Stephen Brennan
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Brennan @ 2021-07-01 17:05 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: linux-perf-users

Stephen Brennan <stephen.s.brennan@oracle.com> writes:
> So hopefully this information helps narrow down the exploration. It
> seems to be Intel PMU specific!

Or more accurately, we only observe it on the Intel systems I've
checked. The AMD systems record fewer cycle events in mdelay() and thus
may not have a chance to skip events. I'll try updating my kernel module
to busy wait by looping until jiffies is above a particular value. This
would hopefully avoid any CPU-specific idling which would reduce the
cycle count on AMD. I'll update this thread with my findings after
testing.

Thanks again,
Stephen

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Perf loses events without reporting
  2021-07-01 17:05     ` Stephen Brennan
@ 2021-07-01 18:00       ` Stephen Brennan
  2021-07-02 16:22         ` Jiri Olsa
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Brennan @ 2021-07-01 18:00 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: linux-perf-users

Stephen Brennan <stephen.s.brennan@oracle.com> writes:

> Stephen Brennan <stephen.s.brennan@oracle.com> writes:
>> So hopefully this information helps narrow down the exploration. It
>> seems to be Intel PMU specific!
>
> Or more accurately, we only observe it on the Intel systems I've
> checked. The AMD systems record fewer cycle events in mdelay() and thus
> may not have a chance to skip events. I'll try updating my kernel module
> to busy wait by looping until jiffies is above a particular value. This
> would hopefully avoid any CPU-specific idling which would reduce the
> cycle count on AMD. I'll update this thread with my findings after
> testing.

Quite sorry for the spam, but I have completed my testing on AMD. By
abandoning mdelay() and using a jiffies-based busy loop (updated module
code at the end of this message), I was able to reproduce the 2 second
gap in the perf data. It seems that the difference here has more to do
with the implementation of mdelay.

Regards,
Stephen

Updated module code:

/*
 * irqoff.c: Linux kernel module for disabling interrupts and delaying for a
 * certain amount of time. Useful for experimenting with perf, lockup detector,
 * and more.
 */
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/proc_fs.h>
#include <linux/spinlock.h>
#include <linux/delay.h>
#include <linux/version.h>
#include <linux/uaccess.h>
#include <linux/types.h>
#include <linux/jiffies.h>

#include <asm/msr.h>

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time");
MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>");


// Store the proc dir entry we can use to check status
struct proc_dir_entry *pde = NULL;

DEFINE_SPINLOCK(irqoff_lock);

bool use_mdelay = false;
module_param(use_mdelay, bool, false);


static noinline void irqsoff_inirq_manual_delay(unsigned long millis)
{
	uint64_t start_tsc, tsc;
	unsigned long jiffies_start = jiffies;
	unsigned long jiffies_end = jiffies + millis * HZ / 1000;

	/*
	 * With interrupts disabled, it's possible that jiffies may not get
	 * updated! To account for this, we'll use the x86_64 cycle count too.
	 * After a million cycles, if jiffies are not updated, then it's likely
	 * that we've accidentally disabled the timer interrupt and we can't
	 * accurately time our sleep, so we should bail out.
	 */
	#define TSC_EMERGENCY 100000000
	start_tsc = rdtsc();

	pr_info("[irqoff] manual delay %lu millis (jiffies cur=%lu target=%lu)\n",
		millis, jiffies_start, jiffies_end);
	while (time_before(jiffies, jiffies_end)) {
		asm volatile ("nop");
		asm volatile ("nop");
		asm volatile ("nop");
		asm volatile ("nop");
		asm volatile ("nop");
		tsc = rdtsc();
		if ((tsc >= start_tsc + TSC_EMERGENCY) && (jiffies == jiffies_start)) {
			pr_info(
				"[irqoff] looks like we disabled timer interrupt, "
				"bailing! (TSC start %llu, cur %llu)\n", start_tsc, tsc);
			break;
		}
	}
	pr_info("[irqoff] manual delay complete, jiffies: %lu\n", jiffies);
}

static void irqsoff_inirq_delay(unsigned long millis)
{
	if (use_mdelay) {
		mdelay(millis);
	} else {
		irqsoff_inirq_manual_delay(millis);
	}
}

static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
{
	char buf[32];
	int rv;
	unsigned long millis = 0;

	if (amt > sizeof(buf) - 1)
		return -EFBIG;

	if ((rv = copy_from_user(buf, data, amt)) != 0)
		return -EFAULT;

	buf[amt] = '\0';

	if (sscanf(buf, "%lu", &millis) != 1)
		return -EINVAL;

	/* Threshold it at 5 minutes for safety. */
	if (millis > 5 * 60 * 1000)
		return -EINVAL;

	pr_info("[irqoff] lock for %lu millis\n", millis);
	spin_lock_irq(&irqoff_lock);
	irqsoff_inirq_delay(millis);
	spin_unlock_irq(&irqoff_lock);

	return amt;
}

static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off)
{
	return 0;
}

#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
static const struct file_operations irqsoff_fops = {
	.owner                  = THIS_MODULE,
	.read                   = irqsoff_read,
	.write                  = irqsoff_write,
};
#else
static const struct proc_ops irqsoff_fops = {
	.proc_read		= irqsoff_read,
	.proc_write		= irqsoff_write,
};
#endif

static int irqoff_init(void)
{
	pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops);
	if (!pde)
		return -ENOENT;

	pr_info("[irqoff] successfully initialized\n");
	return 0;
}

static void irqoff_exit(void)
{
	proc_remove(pde);
	pde = NULL;
	pr_info("[irqoff] exited\n");
}

module_init(irqoff_init);
module_exit(irqoff_exit);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Perf loses events without reporting
  2021-07-01 18:00       ` Stephen Brennan
@ 2021-07-02 16:22         ` Jiri Olsa
  2021-07-02 20:12           ` Jiri Olsa
  0 siblings, 1 reply; 8+ messages in thread
From: Jiri Olsa @ 2021-07-02 16:22 UTC (permalink / raw)
  To: Stephen Brennan; +Cc: linux-perf-users

On Thu, Jul 01, 2021 at 11:00:24AM -0700, Stephen Brennan wrote:

SNIP

> 
> static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
> {
> 	char buf[32];
> 	int rv;
> 	unsigned long millis = 0;
> 
> 	if (amt > sizeof(buf) - 1)
> 		return -EFBIG;
> 
> 	if ((rv = copy_from_user(buf, data, amt)) != 0)
> 		return -EFAULT;
> 
> 	buf[amt] = '\0';
> 
> 	if (sscanf(buf, "%lu", &millis) != 1)
> 		return -EINVAL;
> 
> 	/* Threshold it at 5 minutes for safety. */
> 	if (millis > 5 * 60 * 1000)
> 		return -EINVAL;
> 
> 	pr_info("[irqoff] lock for %lu millis\n", millis);
> 	spin_lock_irq(&irqoff_lock);
> 	irqsoff_inirq_delay(millis);
> 	spin_unlock_irq(&irqoff_lock);

yea, I can see that as well.. if I took the spin_lock_irq it works as
expected.. will check

thanks,
jirka


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Perf loses events without reporting
  2021-07-02 16:22         ` Jiri Olsa
@ 2021-07-02 20:12           ` Jiri Olsa
  2021-07-02 22:48             ` Stephen Brennan
  0 siblings, 1 reply; 8+ messages in thread
From: Jiri Olsa @ 2021-07-02 20:12 UTC (permalink / raw)
  To: Stephen Brennan; +Cc: linux-perf-users, Peter Zijlstra

On Fri, Jul 02, 2021 at 06:22:10PM +0200, Jiri Olsa wrote:
> On Thu, Jul 01, 2021 at 11:00:24AM -0700, Stephen Brennan wrote:
> 
> SNIP
> 
> > 
> > static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
> > {
> > 	char buf[32];
> > 	int rv;
> > 	unsigned long millis = 0;
> > 
> > 	if (amt > sizeof(buf) - 1)
> > 		return -EFBIG;
> > 
> > 	if ((rv = copy_from_user(buf, data, amt)) != 0)
> > 		return -EFAULT;
> > 
> > 	buf[amt] = '\0';
> > 
> > 	if (sscanf(buf, "%lu", &millis) != 1)
> > 		return -EINVAL;
> > 
> > 	/* Threshold it at 5 minutes for safety. */
> > 	if (millis > 5 * 60 * 1000)
> > 		return -EINVAL;
> > 
> > 	pr_info("[irqoff] lock for %lu millis\n", millis);
> > 	spin_lock_irq(&irqoff_lock);
> > 	irqsoff_inirq_delay(millis);
> > 	spin_unlock_irq(&irqoff_lock);
> 
> yea, I can see that as well.. if I took the spin_lock_irq it works as
> expected.. will check

ok, so with disabling irqs you disable task tick which
kicks the unthrotling

throtling is started when you cross the limit of samples
per tick, which you do real fast with busy waiting

then when event is throttled (and no samples are generated),
the next task tick restores it.. but with disabled irqs,
there's no task tick, so the event is throttled all the
time when irqs are disabled

is there any use case you have to have irqs disabled for so long?

cc-ing peterz
initial email: https://lore.kernel.org/linux-perf-users/YN89LwSzTKc8wBar@krava/T/#m616481aa95da10e17d9b0769717f006b9ff38637

jirka


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Perf loses events without reporting
  2021-07-02 20:12           ` Jiri Olsa
@ 2021-07-02 22:48             ` Stephen Brennan
  0 siblings, 0 replies; 8+ messages in thread
From: Stephen Brennan @ 2021-07-02 22:48 UTC (permalink / raw)
  To: Jiri Olsa; +Cc: linux-perf-users, Peter Zijlstra

Jiri Olsa <jolsa@redhat.com> writes:
> On Fri, Jul 02, 2021 at 06:22:10PM +0200, Jiri Olsa wrote:
>> On Thu, Jul 01, 2021 at 11:00:24AM -0700, Stephen Brennan wrote:
>> 
>> SNIP
>> 
>> > 
>> > static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
>> > {
>> > 	char buf[32];
>> > 	int rv;
>> > 	unsigned long millis = 0;
>> > 
>> > 	if (amt > sizeof(buf) - 1)
>> > 		return -EFBIG;
>> > 
>> > 	if ((rv = copy_from_user(buf, data, amt)) != 0)
>> > 		return -EFAULT;
>> > 
>> > 	buf[amt] = '\0';
>> > 
>> > 	if (sscanf(buf, "%lu", &millis) != 1)
>> > 		return -EINVAL;
>> > 
>> > 	/* Threshold it at 5 minutes for safety. */
>> > 	if (millis > 5 * 60 * 1000)
>> > 		return -EINVAL;
>> > 
>> > 	pr_info("[irqoff] lock for %lu millis\n", millis);
>> > 	spin_lock_irq(&irqoff_lock);
>> > 	irqsoff_inirq_delay(millis);
>> > 	spin_unlock_irq(&irqoff_lock);
>> 
>> yea, I can see that as well.. if I took the spin_lock_irq it works as
>> expected.. will check
>
> ok, so with disabling irqs you disable task tick which
> kicks the unthrotling
>
> throtling is started when you cross the limit of samples
> per tick, which you do real fast with busy waiting
>
> then when event is throttled (and no samples are generated),
> the next task tick restores it.. but with disabled irqs,
> there's no task tick, so the event is throttled all the
> time when irqs are disabled

Ah, this makes sense, I'll take a look at the throttling code to
understand it a bit better myself.

>
> is there any use case you have to have irqs disabled for so long?

I don't have a good use case for keeping IRQs disabled so long. However,
one of my use cases for perf is as a debugging tool for anything which
doesn't result in panics. One common enough issue is spinlock
contention, during which IRQs can be disabled. Normally if a lock is
contended, IRQs are disabled and then reenabled very frequently. In
these cases, you wouldn't necessarily notice lost events because either
(a) they wouldn't be lost, or (b) the event could be unthrottled
frequently. But sometimes, a buggy lock is held for a longer time, and the
event loss can make it hard to see the contention in perf.

For example, in a recent issue we saw a spinlock held with IRQs disabled
for several seconds (causing other threads to pile up waiting). The wait
was long enough to cause serious issues with devices due to missed
interrupts, but short enough to not trigger the configured lockup
detector. Our perf recording didn't show any particular "hot functions"
at this spinlock, since so few samples were captured there, despite the
large amount of time spent on it.

Ideally for this sort of debugging it would be best to have all the
samples. However, failing that, it would be useful to have a reliable
detection mechanism for these sorts of gaps. I have a script right now
for frequency-based sampling, which detects "holes" -- whenever the
period between samples is more than 10x the expected sample period. Of
course the sample period can never be perfect, but 10x seems to be a
reasonable threshold that only really gets hit during idle (due to
frequency scaling, I'd guess). This script is decent but still has false
positives.

Thanks for this analysis, it's quite helpful!

Stephen

>
> cc-ing peterz
> initial email: https://lore.kernel.org/linux-perf-users/YN89LwSzTKc8wBar@krava/T/#m616481aa95da10e17d9b0769717f006b9ff38637
>
> jirka

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-07-02 22:49 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-30 17:29 Perf loses events without reporting Stephen Brennan
2021-07-01 10:37 ` Jiri Olsa
2021-07-01 16:45   ` Stephen Brennan
2021-07-01 17:05     ` Stephen Brennan
2021-07-01 18:00       ` Stephen Brennan
2021-07-02 16:22         ` Jiri Olsa
2021-07-02 20:12           ` Jiri Olsa
2021-07-02 22:48             ` Stephen Brennan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).