Perf loses events without reporting

* Perf loses events without reporting
@ 2021-06-30 17:29 Stephen Brennan
  2021-07-01 10:37 ` Jiri Olsa
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Brennan @ 2021-06-30 17:29 UTC (permalink / raw)
  To: linux-perf-users

Hi all,

I've been trying to understand the behavior of the x86_64 performance
monitoring interrupt, specifically when IRQ is disabled. Since it's an
NMI, it should still trigger and record events. However, I've noticed
that when interrupts are disabled for a long time, events seem to be
silently dropped, and I'm wondering if this is expected behavior.

To test this, I created a simple kernel module "irqoff" which creates a
file /proc/irqoff_sleep_millis. On write, the module uses
"spin_lock_irq()" to disable interrupts, and then issues an mdelay()
call for whatever number of milliseconds was written. This allows us to
busy wait with IRQ disabled. (Source for the module at the end of this
email).

When I use perf to record a write to this file, we see the following:

$ sudo perf record -e cycles -c 100000 -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.030 MB perf.data (58 samples) ]

$ sudo perf script
# ... filtered down:
              sh 62863 52318.991716:     100000 cycles:  ffffffff8a8237a9 delay_tsc+0x39 ([kernel.kallsyms])
              sh 62863 52318.991740:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
              sh 62863 52318.991765:     100000 cycles:  ffffffff8a823797 delay_tsc+0x27 ([kernel.kallsyms])
#                      ^ v   ~2 second gap!
              sh 62863 52320.963900:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
              sh 62863 52320.963923:     100000 cycles:  ffffffff8ae47417 _raw_spin_lock_irqsave+0x27 ([kernel.kallsyms])
              sh 62863 52320.963948:     100000 cycles:  ffffffff8ab1db9a handle_tx_event+0x2da ([kernel.kallsyms])

The perf stat shows the following counters over a similar run:

$ sudo perf stat -- sh -c 'echo 2000 > /proc/irqoff_sleep_millis'

 Performance counter stats for 'sh -c echo 2000 > /proc/irqoff_sleep_millis':

          1,975.55 msec task-clock                #    0.999 CPUs utilized
                 1      context-switches          #    0.001 K/sec
                 0      cpu-migrations            #    0.000 K/sec
                61      page-faults               #    0.031 K/sec
     7,952,267,470      cycles                    #    4.025 GHz
       541,904,608      instructions              #    0.07  insn per cycle
        83,406,021      branches                  #   42.219 M/sec
            10,365      branch-misses             #    0.01% of all branches

       1.977234595 seconds time elapsed

       0.000000000 seconds user
       1.977162000 seconds sys

According to this, we should see roughly 79k samples (7.9 billion cycles
/ 100k sample period), but perf only gets 58. What it "looks like" to
me, is that the CPU ring buffer might run out of space after several
events, and the perf process doesn't get scheduled soon enough to read
the data? But in my experience, perf usually reports that it missed some
events. So I wonder if anybody is familiar with the factors at play for
when IRQ is disabled during a PMI? I'd appreciate any pointers to guide
my exploration.

My test case here ran on Ubuntu distro kernel 5.11.0-22-generic, and I
have also tested on a 5.4 based kernel. I'm happy to reproduce this on a
mainline kernel too.

Thanks,
Stephen

Makefile:
<<<
obj-m += irqoff.o

all:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
	make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean
>>>

irqoff.c:
<<<
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#include <linux/proc_fs.h>
#include <linux/spinlock.h>
#include <linux/delay.h>
#include <linux/version.h>
#include <linux/uaccess.h>

MODULE_LICENSE("GPL");
MODULE_DESCRIPTION("Test module that allows to disable IRQ for configurable time");
MODULE_AUTHOR("Stephen Brennan <stephen.s.brennan@oracle.com>");

// Store the proc dir entry we can use to check status
struct proc_dir_entry *pde = NULL;

DEFINE_SPINLOCK(irqoff_lock);

static noinline void irqsoff_inirq_delay(unsigned long millis)
{
	mdelay(millis);
}

static ssize_t irqsoff_write(struct file *f, const char __user *data, size_t amt, loff_t *off)
{
	char buf[32];
	int rv;
	unsigned long usecs = 0;

	if (amt > sizeof(buf) - 1)
		return -EFBIG;

	if ((rv = copy_from_user(buf, data, amt)) != 0)
		return -EFAULT;

	buf[amt] = '\0';

	if (sscanf(buf, "%lu", &usecs) != 1)
		return -EINVAL;

	/* We read number of milliseconds, but will convert to microseconds.
	   Threshold it at 5 minutes for safety. */
	if (usecs > 5 * 60 * 1000)
		return -EINVAL;

	pr_info("[irqoff] lock for %lu millis\n", usecs);
	spin_lock_irq(&irqoff_lock);
	irqsoff_inirq_delay(usecs);
	spin_unlock_irq(&irqoff_lock);

	return amt;
}

static ssize_t irqsoff_read(struct file *f, char __user *data, size_t amt, loff_t *off)
{
	return 0;
}

#if LINUX_VERSION_CODE < KERNEL_VERSION(5,6,0)
static const struct file_operations irqsoff_fops = {
	.owner                  = THIS_MODULE,
	.read                   = irqsoff_read,
	.write                  = irqsoff_write,
};
#else
static const struct proc_ops irqsoff_fops = {
	.proc_read		= irqsoff_read,
	.proc_write		= irqsoff_write,
};
#endif

static int irqoff_init(void)
{
	pde = proc_create("irqoff_sleep_millis", 0644, NULL, &irqsoff_fops);
	if (!pde)
		return -ENOENT;

	pr_info("[irqoff] successfully initialized\n");
	return 0;
}

static void irqoff_exit(void)
{
	proc_remove(pde);
	pde = NULL;
}

module_init(irqoff_init);
module_exit(irqoff_exit);
>>>

^ permalink raw reply	[flat|nested] 8+ messages in thread