All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sagi Grimberg <sagi@grimberg.me>
To: Jens Axboe <axboe@kernel.dk>,
	Johannes Thumshirn <jthumshirn@suse.de>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>
Cc: linux-block@vger.kernel.org, Linux-scsi@vger.kernel.org,
	linux-nvme@lists.infradead.org,
	Christoph Hellwig <hch@infradead.org>,
	Keith Busch <keith.busch@intel.com>
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Date: Tue, 17 Jan 2017 17:38:54 +0200	[thread overview]
Message-ID: <499af528-7810-f82d-1f11-cbf8f3a5b21c@grimberg.me> (raw)
In-Reply-To: <8b47ca34-d2ff-26dc-721e-2cb1e18f1efc@grimberg.me>

[-- Attachment #1: Type: text/plain, Size: 5615 bytes --]

Hey, so I made some initial analysis of whats going on with
irq-poll.

First, I sampled how much time it takes before we
get the interrupt in nvme_irq and the initial visit
to nvme_irqpoll_handler. I ran a single threaded fio
with QD=32 of 4K reads. This is two displays of a
histogram of the latency (ns):
--
[1]
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 7310 
|****************************************|
          2 -> 3          : 11       | 
      |
          4 -> 7          : 10       | 
      |
          8 -> 15         : 20       | 
      |
         16 -> 31         : 0        | 
      |
         32 -> 63         : 0        | 
      |
         64 -> 127        : 1        | 
      |

[2]
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 7309 
|****************************************|
          2 -> 3          : 14       | 
      |
          4 -> 7          : 7        | 
      |
          8 -> 15         : 17       | 
      |

We can see that most of the time our latency is pretty good (<1ns) but with
huge tail latencies (some 8-15 ns and even one in 32-63 ns).
**NOTE, in order to reduce the tracing impact on performance I sampled
for every 100 interrupts.

I also sampled for a multiple threads/queues with QD=32 of 4K reads.
This is a collection of histograms for 5 queues (5 fio threads):
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 701 
|****************************************|
          2 -> 3          : 177      |********** 
      |
          4 -> 7          : 56       |*** 
      |
          8 -> 15         : 24       |* 
      |
         16 -> 31         : 6        | 
      |
         32 -> 63         : 1        | 
      |

queue = b'nvme0q2'
      usecs               : count     distribution
          0 -> 1          : 412 
|****************************************|
          2 -> 3          : 52       |***** 
      |
          4 -> 7          : 19       |* 
      |
          8 -> 15         : 13       |* 
      |
         16 -> 31         : 5        | 
      |

queue = b'nvme0q3'
      usecs               : count     distribution
          0 -> 1          : 381 
|****************************************|
          2 -> 3          : 74       |******* 
      |
          4 -> 7          : 26       |** 
      |
          8 -> 15         : 12       |* 
      |
         16 -> 31         : 3        | 
      |
         32 -> 63         : 0        | 
      |
         64 -> 127        : 0        | 
      |
        128 -> 255        : 1        | 
      |

queue = b'nvme0q4'
      usecs               : count     distribution
          0 -> 1          : 386 
|****************************************|
          2 -> 3          : 63       |****** 
      |
          4 -> 7          : 30       |*** 
      |
          8 -> 15         : 11       |* 
      |
         16 -> 31         : 7        | 
      |
         32 -> 63         : 1        | 
      |

queue = b'nvme0q5'
      usecs               : count     distribution
          0 -> 1          : 384 
|****************************************|
          2 -> 3          : 69       |******* 
      |
          4 -> 7          : 25       |** 
      |
          8 -> 15         : 15       |* 
      |
         16 -> 31         : 3        | 
      |

Overall looks pretty much the same but some more samples with tails...

Next, I sampled how many completions we are able to consume per interrupt.
Two exaples of histograms of how many completions we take per interrupt.
--
queue = b'nvme0q1'
      completed     : count     distribution
         0          : 0        |                                        |
         1          : 11690    |****************************************|
         2          : 46       |                                        |
         3          : 1        |                                        |

queue = b'nvme0q1'
      completed     : count     distribution
         0          : 0        |                                        |
         1          : 944      |****************************************|
         2          : 8        |                                        |
--

So it looks like we are super not efficient because most of the times we 
catch 1
completion per interrupt and the whole point is that we need to find 
more! This fio
is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 
almost all
the time... I also tried QD=1024, histogram is still the same.
**NOTE: Here I also sampled for every 100 interrupts.


I'll try to run the counter on the current nvme driver and see what I get.



I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps)
with hope that someone is interested enough to try and reproduce these
numbers on his/hers setup and maybe suggest some other useful tracing
we can do.

Prerequisites:
1. iovisor is needed for python bpf support.
   $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial 
xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
   $ sudo apt-get update -y
   $ sudo apt-get install bcc-tools -y
   # Nastty hack .. bcc only available in python2 but copliant with 
python3..
   $ sudo cp -r /usr/lib/python2.7/dist-packages/bcc 
/usr/lib/python3/dist-packages/

2. Because we don't have the nvme-pci symbols exported, The nvme.h file 
is needed on the
    test machine (where the bpf code is running). I used nfs mount for 
the linux source (this
    is why I include from /mnt/linux in the scripts).


[-- Attachment #2: nvme-count-comps --]
[-- Type: text/plain, Size: 6398 bytes --]

#!/usr/bin/python3
# @lint-avoid-python-3-compatibility-imports

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./nvme_comp_cout            # summarize interrupt->irqpoll latency as a histogram
    ./nvme_comp_cout 1 10       # print 1 second summaries, 10 times
    ./nvme_comp_cout -mT 1      # 1s summaries, milliseconds, and timestamps
    ./nvme_comp_cout -Q         # show each nvme queue device separately
"""
parser = argparse.ArgumentParser(
    description="Summarize block device I/O latency as a histogram",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
    help="millisecond histogram")
parser.add_argument("-Q", "--queues", action="store_true",
    help="print a histogram per queue")
parser.add_argument("--freq", help="Account every N-th request",
    type=int, required=False)
parser.add_argument("interval", nargs="?", default=2,
    help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
    help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
/*****************************************************************
 * Nasty hack because we don't have the nvme-pci structs exported
 *****************************************************************/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/hdreg.h>
#include <linux/idr.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/irq_poll.h>
#include "/mnt/linux/drivers/nvme/host/nvme.h"

struct nvme_dev;
struct nvme_queue;
/*
 * Represents an NVM Express device.  Each nvme_dev is a PCI function.
 */
struct nvme_dev {
	struct nvme_queue **queues;
	struct blk_mq_tag_set tagset;
	struct blk_mq_tag_set admin_tagset;
	u32 __iomem *dbs;
	struct device *dev;
	struct dma_pool *prp_page_pool;
	struct dma_pool *prp_small_pool;
	unsigned queue_count;
	unsigned online_queues;
	unsigned max_qid;
	int q_depth;
	u32 db_stride;
	void __iomem *bar;
	struct work_struct reset_work;
	struct work_struct remove_work;
	struct timer_list watchdog_timer;
	struct mutex shutdown_lock;
	bool subsystem;
	void __iomem *cmb;
	dma_addr_t cmb_dma_addr;
	u64 cmb_size;
	u32 cmbsz;
	u32 cmbloc;
	struct nvme_ctrl ctrl;
	struct completion ioq_wait;
};

/*
 * An NVM Express queue.  Each device has at least two (one for admin
 * commands and one for I/O commands).
 */
struct nvme_queue {
	struct device *q_dmadev;
	struct nvme_dev *dev;
	char irqname[24];
	spinlock_t sq_lock;
	spinlock_t cq_lock;
	struct nvme_command *sq_cmds;
	struct nvme_command __iomem *sq_cmds_io;
	volatile struct nvme_completion *cqes;
	struct blk_mq_tags **tags;
	dma_addr_t sq_dma_addr;
	dma_addr_t cq_dma_addr;
	u32 __iomem *q_db;
	u16 q_depth;
	s16 cq_vector;
	u16 sq_tail;
	u16 cq_head;
	u16 qid;
	u8 cq_phase;
	struct irq_poll	iop;
};

typedef struct queue_key {
    char queue[24];
    u64 slot;
} queue_key_t;

/* Completion counter context */
struct nvmeq {
    struct nvme_queue *q;
    u64 completed;
};

BPF_TABLE("percpu_array", int, struct nvmeq, qarr, 1);
BPF_TABLE("percpu_array", int, int, call_count, 1);
STORAGE

/* trace nvme interrupt */
int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data)
{
    __CALL__COUNT__FILTER__

    struct nvmeq q ={};
    int index = 0;

    q.q = data;
    q.completed = 0; /* reset completions */

    qarr.update(&index, &q);
    return 0;
}

/* count completed on each irqpoll end */
int trace_irqpoll_end(struct pt_regs *ctx)
{
    __CALL__COUNT__FILTER__

    struct nvmeq zero = {};
    int index = 0;
    struct nvmeq *q;
    int completed = ctx->ax;

    q = qarr.lookup_or_init(&index, &zero);
    if (q == NULL) {
	goto out;
    }

    q->completed += completed;
    /* No variables in kretprobe :( 64 is our budget */
    if (completed < 64) {
        /* store as histogram */
        STORE
        q->completed = 0;
    }

out:
    return 0;
}
"""

call_count_filter = """
{
    int zero = 0;
    int index =0;
    int *skip;

    skip = call_count.lookup_or_init(&index, &zero);

    if ((*skip) < %d) {
        (*skip)++;
        return 0;
    }
    (*skip) = 0;
}
"""

# code substitutions
if args.queues:
    bpf_text = bpf_text.replace('STORAGE',
        'BPF_HISTOGRAM(dist, queue_key_t);')
    bpf_text = bpf_text.replace('STORE',
        'queue_key_t key = {.slot = q->completed}; ' +
        'bpf_probe_read(&key.queue, sizeof(key.queue), ' +
        'q->q->irqname); dist.increment(key);')
else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
        'dist.increment(q->completed);')

bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "")
if debug:
    print(bpf_text)


# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start")
b.attach_kretprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_end")

print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

    dist.print_linear_hist("completed", "queue")
    dist.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()

[-- Attachment #3: nvme-trace-irq --]
[-- Type: text/plain, Size: 6397 bytes --]

#!/usr/bin/python3
# @lint-avoid-python-3-compatibility-imports

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./nvmetrace            # summarize interrupt->irqpoll latency as a histogram
    ./nvmetrace 1 10       # print 1 second summaries, 10 times
    ./nvmetrace -mT 1      # 1s summaries, milliseconds, and timestamps
    ./nvmetrace -Q         # show each nvme queue device separately
"""
parser = argparse.ArgumentParser(
    description="Summarize interrupt to softirq latency as a histogram",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
    help="millisecond histogram")
parser.add_argument("-Q", "--queues", action="store_true",
    help="print a histogram per queue")
parser.add_argument("--freq", help="Account every N-th request", type=int, required=False)
parser.add_argument("interval", nargs="?", default=2,
    help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
    help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
/*****************************************************************
 * Nasty hack because we don't have the nvme-pci structs exported
 *****************************************************************/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/hdreg.h>
#include <linux/idr.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/irq_poll.h>

/* location of nvme.h */
#include "/mnt/linux/drivers/nvme/host/nvme.h"

struct nvme_dev;
struct nvme_queue;
/*
 * Represents an NVM Express device.  Each nvme_dev is a PCI function.
 */
struct nvme_dev {
	struct nvme_queue **queues;
	struct blk_mq_tag_set tagset;
	struct blk_mq_tag_set admin_tagset;
	u32 __iomem *dbs;
	struct device *dev;
	struct dma_pool *prp_page_pool;
	struct dma_pool *prp_small_pool;
	unsigned queue_count;
	unsigned online_queues;
	unsigned max_qid;
	int q_depth;
	u32 db_stride;
	void __iomem *bar;
	struct work_struct reset_work;
	struct work_struct remove_work;
	struct timer_list watchdog_timer;
	struct mutex shutdown_lock;
	bool subsystem;
	void __iomem *cmb;
	dma_addr_t cmb_dma_addr;
	u64 cmb_size;
	u32 cmbsz;
	u32 cmbloc;
	struct nvme_ctrl ctrl;
	struct completion ioq_wait;
};

/*
 * An NVM Express queue.  Each device has at least two (one for admin
 * commands and one for I/O commands).
 */
struct nvme_queue {
	struct device *q_dmadev;
	struct nvme_dev *dev;
	char irqname[24];
	spinlock_t sq_lock;
	spinlock_t cq_lock;
	struct nvme_command *sq_cmds;
	struct nvme_command __iomem *sq_cmds_io;
	volatile struct nvme_completion *cqes;
	struct blk_mq_tags **tags;
	dma_addr_t sq_dma_addr;
	dma_addr_t cq_dma_addr;
	u32 __iomem *q_db;
	u16 q_depth;
	s16 cq_vector;
	u16 sq_tail;
	u16 cq_head;
	u16 qid;
	u8 cq_phase;
	struct irq_poll	iop;
};

typedef struct queue_key {
    char queue[24];
    u64 slot;
} queue_key_t;

BPF_HASH(start, struct nvme_queue *);
BPF_TABLE("percpu_array", int, int, call_count, 1);
STORAGE

/* timestamp nvme interrupt */
int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data)
{
    __CALL__COUNT__FILTER__

    struct nvme_queue *q = data;
    u64 ts = bpf_ktime_get_ns();
    start.update(&q, &ts);
    return 0;
}

/* timestamp nvme irqpoll */
int trace_irqpoll_start(struct pt_regs *ctx, struct irq_poll *iop, int budget)
{
    struct nvme_queue *q = container_of(iop, struct nvme_queue, iop);
    u64 *tsp, delta;

    /* fetch timestamp and calculate delta */
    tsp = start.lookup(&q);
    if (tsp == 0) {
        return 0;   /* missed issue */
    }
    delta = bpf_ktime_get_ns() - *tsp;
    FACTOR

    /* store as histogram */
    STORE
    start.delete(&q);

    return 0;
}
"""

# code substitutions
call_count_filter = """
{
    int zero = 0;
    int index =0;
    int *skip;

    skip = call_count.lookup_or_init(&index, &zero);

    if ((*skip) < %d) {
        (*skip)++;
        return 0;
    }
    (*skip) = 0;
}
"""

if args.milliseconds:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;')
    label = "msecs"
else:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;')
    label = "usecs"
if args.queues:
    bpf_text = bpf_text.replace('STORAGE',
        'BPF_HISTOGRAM(dist, queue_key_t);')
    bpf_text = bpf_text.replace('STORE',
        'queue_key_t key = {.slot = bpf_log2l(delta)}; ' +
        'bpf_probe_read(&key.queue, sizeof(key.queue), ' +
        'q->irqname); dist.increment(key);')
else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
        'dist.increment(bpf_log2l(delta));')

bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "")
if debug:
    print(bpf_text)

# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start")
b.attach_kprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_start")

print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

    dist.print_log2_hist(label, "queue")
    dist.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()

WARNING: multiple messages have this Message-ID (diff)
From: sagi@grimberg.me (Sagi Grimberg)
Subject: [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers
Date: Tue, 17 Jan 2017 17:38:54 +0200	[thread overview]
Message-ID: <499af528-7810-f82d-1f11-cbf8f3a5b21c@grimberg.me> (raw)
In-Reply-To: <8b47ca34-d2ff-26dc-721e-2cb1e18f1efc@grimberg.me>

Hey, so I made some initial analysis of whats going on with
irq-poll.

First, I sampled how much time it takes before we
get the interrupt in nvme_irq and the initial visit
to nvme_irqpoll_handler. I ran a single threaded fio
with QD=32 of 4K reads. This is two displays of a
histogram of the latency (ns):
--
[1]
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 7310 
|****************************************|
          2 -> 3          : 11       | 
      |
          4 -> 7          : 10       | 
      |
          8 -> 15         : 20       | 
      |
         16 -> 31         : 0        | 
      |
         32 -> 63         : 0        | 
      |
         64 -> 127        : 1        | 
      |

[2]
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 7309 
|****************************************|
          2 -> 3          : 14       | 
      |
          4 -> 7          : 7        | 
      |
          8 -> 15         : 17       | 
      |

We can see that most of the time our latency is pretty good (<1ns) but with
huge tail latencies (some 8-15 ns and even one in 32-63 ns).
**NOTE, in order to reduce the tracing impact on performance I sampled
for every 100 interrupts.

I also sampled for a multiple threads/queues with QD=32 of 4K reads.
This is a collection of histograms for 5 queues (5 fio threads):
queue = b'nvme0q1'
      usecs               : count     distribution
          0 -> 1          : 701 
|****************************************|
          2 -> 3          : 177      |********** 
      |
          4 -> 7          : 56       |*** 
      |
          8 -> 15         : 24       |* 
      |
         16 -> 31         : 6        | 
      |
         32 -> 63         : 1        | 
      |

queue = b'nvme0q2'
      usecs               : count     distribution
          0 -> 1          : 412 
|****************************************|
          2 -> 3          : 52       |***** 
      |
          4 -> 7          : 19       |* 
      |
          8 -> 15         : 13       |* 
      |
         16 -> 31         : 5        | 
      |

queue = b'nvme0q3'
      usecs               : count     distribution
          0 -> 1          : 381 
|****************************************|
          2 -> 3          : 74       |******* 
      |
          4 -> 7          : 26       |** 
      |
          8 -> 15         : 12       |* 
      |
         16 -> 31         : 3        | 
      |
         32 -> 63         : 0        | 
      |
         64 -> 127        : 0        | 
      |
        128 -> 255        : 1        | 
      |

queue = b'nvme0q4'
      usecs               : count     distribution
          0 -> 1          : 386 
|****************************************|
          2 -> 3          : 63       |****** 
      |
          4 -> 7          : 30       |*** 
      |
          8 -> 15         : 11       |* 
      |
         16 -> 31         : 7        | 
      |
         32 -> 63         : 1        | 
      |

queue = b'nvme0q5'
      usecs               : count     distribution
          0 -> 1          : 384 
|****************************************|
          2 -> 3          : 69       |******* 
      |
          4 -> 7          : 25       |** 
      |
          8 -> 15         : 15       |* 
      |
         16 -> 31         : 3        | 
      |

Overall looks pretty much the same but some more samples with tails...

Next, I sampled how many completions we are able to consume per interrupt.
Two exaples of histograms of how many completions we take per interrupt.
--
queue = b'nvme0q1'
      completed     : count     distribution
         0          : 0        |                                        |
         1          : 11690    |****************************************|
         2          : 46       |                                        |
         3          : 1        |                                        |

queue = b'nvme0q1'
      completed     : count     distribution
         0          : 0        |                                        |
         1          : 944      |****************************************|
         2          : 8        |                                        |
--

So it looks like we are super not efficient because most of the times we 
catch 1
completion per interrupt and the whole point is that we need to find 
more! This fio
is single threaded with QD=32 so I'd expect that we be somewhere in 8-31 
almost all
the time... I also tried QD=1024, histogram is still the same.
**NOTE: Here I also sampled for every 100 interrupts.


I'll try to run the counter on the current nvme driver and see what I get.



I attached the bpf scripts I wrote (nvme-trace-irq, nvme-count-comps)
with hope that someone is interested enough to try and reproduce these
numbers on his/hers setup and maybe suggest some other useful tracing
we can do.

Prerequisites:
1. iovisor is needed for python bpf support.
   $ echo "deb [trusted=yes] https://repo.iovisor.org/apt/xenial 
xenial-nightly main" | sudo tee /etc/apt/sources.list.d/iovisor.list
   $ sudo apt-get update -y
   $ sudo apt-get install bcc-tools -y
   # Nastty hack .. bcc only available in python2 but copliant with 
python3..
   $ sudo cp -r /usr/lib/python2.7/dist-packages/bcc 
/usr/lib/python3/dist-packages/

2. Because we don't have the nvme-pci symbols exported, The nvme.h file 
is needed on the
    test machine (where the bpf code is running). I used nfs mount for 
the linux source (this
    is why I include from /mnt/linux in the scripts).

-------------- next part --------------
#!/usr/bin/python3
# @lint-avoid-python-3-compatibility-imports

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./nvme_comp_cout            # summarize interrupt->irqpoll latency as a histogram
    ./nvme_comp_cout 1 10       # print 1 second summaries, 10 times
    ./nvme_comp_cout -mT 1      # 1s summaries, milliseconds, and timestamps
    ./nvme_comp_cout -Q         # show each nvme queue device separately
"""
parser = argparse.ArgumentParser(
    description="Summarize block device I/O latency as a histogram",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
    help="millisecond histogram")
parser.add_argument("-Q", "--queues", action="store_true",
    help="print a histogram per queue")
parser.add_argument("--freq", help="Account every N-th request",
    type=int, required=False)
parser.add_argument("interval", nargs="?", default=2,
    help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
    help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
/*****************************************************************
 * Nasty hack because we don't have the nvme-pci structs exported
 *****************************************************************/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/hdreg.h>
#include <linux/idr.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/irq_poll.h>
#include "/mnt/linux/drivers/nvme/host/nvme.h"

struct nvme_dev;
struct nvme_queue;
/*
 * Represents an NVM Express device.  Each nvme_dev is a PCI function.
 */
struct nvme_dev {
	struct nvme_queue **queues;
	struct blk_mq_tag_set tagset;
	struct blk_mq_tag_set admin_tagset;
	u32 __iomem *dbs;
	struct device *dev;
	struct dma_pool *prp_page_pool;
	struct dma_pool *prp_small_pool;
	unsigned queue_count;
	unsigned online_queues;
	unsigned max_qid;
	int q_depth;
	u32 db_stride;
	void __iomem *bar;
	struct work_struct reset_work;
	struct work_struct remove_work;
	struct timer_list watchdog_timer;
	struct mutex shutdown_lock;
	bool subsystem;
	void __iomem *cmb;
	dma_addr_t cmb_dma_addr;
	u64 cmb_size;
	u32 cmbsz;
	u32 cmbloc;
	struct nvme_ctrl ctrl;
	struct completion ioq_wait;
};

/*
 * An NVM Express queue.  Each device has at least two (one for admin
 * commands and one for I/O commands).
 */
struct nvme_queue {
	struct device *q_dmadev;
	struct nvme_dev *dev;
	char irqname[24];
	spinlock_t sq_lock;
	spinlock_t cq_lock;
	struct nvme_command *sq_cmds;
	struct nvme_command __iomem *sq_cmds_io;
	volatile struct nvme_completion *cqes;
	struct blk_mq_tags **tags;
	dma_addr_t sq_dma_addr;
	dma_addr_t cq_dma_addr;
	u32 __iomem *q_db;
	u16 q_depth;
	s16 cq_vector;
	u16 sq_tail;
	u16 cq_head;
	u16 qid;
	u8 cq_phase;
	struct irq_poll	iop;
};

typedef struct queue_key {
    char queue[24];
    u64 slot;
} queue_key_t;

/* Completion counter context */
struct nvmeq {
    struct nvme_queue *q;
    u64 completed;
};

BPF_TABLE("percpu_array", int, struct nvmeq, qarr, 1);
BPF_TABLE("percpu_array", int, int, call_count, 1);
STORAGE

/* trace nvme interrupt */
int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data)
{
    __CALL__COUNT__FILTER__

    struct nvmeq q ={};
    int index = 0;

    q.q = data;
    q.completed = 0; /* reset completions */

    qarr.update(&index, &q);
    return 0;
}

/* count completed on each irqpoll end */
int trace_irqpoll_end(struct pt_regs *ctx)
{
    __CALL__COUNT__FILTER__

    struct nvmeq zero = {};
    int index = 0;
    struct nvmeq *q;
    int completed = ctx->ax;

    q = qarr.lookup_or_init(&index, &zero);
    if (q == NULL) {
	goto out;
    }

    q->completed += completed;
    /* No variables in kretprobe :( 64 is our budget */
    if (completed < 64) {
        /* store as histogram */
        STORE
        q->completed = 0;
    }

out:
    return 0;
}
"""

call_count_filter = """
{
    int zero = 0;
    int index =0;
    int *skip;

    skip = call_count.lookup_or_init(&index, &zero);

    if ((*skip) < %d) {
        (*skip)++;
        return 0;
    }
    (*skip) = 0;
}
"""

# code substitutions
if args.queues:
    bpf_text = bpf_text.replace('STORAGE',
        'BPF_HISTOGRAM(dist, queue_key_t);')
    bpf_text = bpf_text.replace('STORE',
        'queue_key_t key = {.slot = q->completed}; ' +
        'bpf_probe_read(&key.queue, sizeof(key.queue), ' +
        'q->q->irqname); dist.increment(key);')
else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
        'dist.increment(q->completed);')

bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "")
if debug:
    print(bpf_text)


# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start")
b.attach_kretprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_end")

print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

    dist.print_linear_hist("completed", "queue")
    dist.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()
-------------- next part --------------
#!/usr/bin/python3
# @lint-avoid-python-3-compatibility-imports

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./nvmetrace            # summarize interrupt->irqpoll latency as a histogram
    ./nvmetrace 1 10       # print 1 second summaries, 10 times
    ./nvmetrace -mT 1      # 1s summaries, milliseconds, and timestamps
    ./nvmetrace -Q         # show each nvme queue device separately
"""
parser = argparse.ArgumentParser(
    description="Summarize interrupt to softirq latency as a histogram",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-m", "--milliseconds", action="store_true",
    help="millisecond histogram")
parser.add_argument("-Q", "--queues", action="store_true",
    help="print a histogram per queue")
parser.add_argument("--freq", help="Account every N-th request", type=int, required=False)
parser.add_argument("interval", nargs="?", default=2,
    help="output interval, in seconds")
parser.add_argument("count", nargs="?", default=99999999,
    help="number of outputs")
args = parser.parse_args()
countdown = int(args.count)
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
/*****************************************************************
 * Nasty hack because we don't have the nvme-pci structs exported
 *****************************************************************/
#include <linux/aer.h>
#include <linux/bitops.h>
#include <linux/blkdev.h>
#include <linux/blk-mq.h>
#include <linux/blk-mq-pci.h>
#include <linux/cpu.h>
#include <linux/delay.h>
#include <linux/errno.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/hdreg.h>
#include <linux/idr.h>
#include <linux/init.h>
#include <linux/interrupt.h>
#include <linux/io.h>
#include <linux/kdev_t.h>
#include <linux/kernel.h>
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/mutex.h>
#include <linux/pci.h>
#include <linux/poison.h>
#include <linux/ptrace.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/t10-pi.h>
#include <linux/timer.h>
#include <linux/types.h>
#include <linux/io-64-nonatomic-lo-hi.h>
#include <asm/unaligned.h>
#include <linux/irq_poll.h>

/* location of nvme.h */
#include "/mnt/linux/drivers/nvme/host/nvme.h"

struct nvme_dev;
struct nvme_queue;
/*
 * Represents an NVM Express device.  Each nvme_dev is a PCI function.
 */
struct nvme_dev {
	struct nvme_queue **queues;
	struct blk_mq_tag_set tagset;
	struct blk_mq_tag_set admin_tagset;
	u32 __iomem *dbs;
	struct device *dev;
	struct dma_pool *prp_page_pool;
	struct dma_pool *prp_small_pool;
	unsigned queue_count;
	unsigned online_queues;
	unsigned max_qid;
	int q_depth;
	u32 db_stride;
	void __iomem *bar;
	struct work_struct reset_work;
	struct work_struct remove_work;
	struct timer_list watchdog_timer;
	struct mutex shutdown_lock;
	bool subsystem;
	void __iomem *cmb;
	dma_addr_t cmb_dma_addr;
	u64 cmb_size;
	u32 cmbsz;
	u32 cmbloc;
	struct nvme_ctrl ctrl;
	struct completion ioq_wait;
};

/*
 * An NVM Express queue.  Each device has at least two (one for admin
 * commands and one for I/O commands).
 */
struct nvme_queue {
	struct device *q_dmadev;
	struct nvme_dev *dev;
	char irqname[24];
	spinlock_t sq_lock;
	spinlock_t cq_lock;
	struct nvme_command *sq_cmds;
	struct nvme_command __iomem *sq_cmds_io;
	volatile struct nvme_completion *cqes;
	struct blk_mq_tags **tags;
	dma_addr_t sq_dma_addr;
	dma_addr_t cq_dma_addr;
	u32 __iomem *q_db;
	u16 q_depth;
	s16 cq_vector;
	u16 sq_tail;
	u16 cq_head;
	u16 qid;
	u8 cq_phase;
	struct irq_poll	iop;
};

typedef struct queue_key {
    char queue[24];
    u64 slot;
} queue_key_t;

BPF_HASH(start, struct nvme_queue *);
BPF_TABLE("percpu_array", int, int, call_count, 1);
STORAGE

/* timestamp nvme interrupt */
int trace_interrupt_start(struct pt_regs *ctx, int irq, void *data)
{
    __CALL__COUNT__FILTER__

    struct nvme_queue *q = data;
    u64 ts = bpf_ktime_get_ns();
    start.update(&q, &ts);
    return 0;
}

/* timestamp nvme irqpoll */
int trace_irqpoll_start(struct pt_regs *ctx, struct irq_poll *iop, int budget)
{
    struct nvme_queue *q = container_of(iop, struct nvme_queue, iop);
    u64 *tsp, delta;

    /* fetch timestamp and calculate delta */
    tsp = start.lookup(&q);
    if (tsp == 0) {
        return 0;   /* missed issue */
    }
    delta = bpf_ktime_get_ns() - *tsp;
    FACTOR

    /* store as histogram */
    STORE
    start.delete(&q);

    return 0;
}
"""

# code substitutions
call_count_filter = """
{
    int zero = 0;
    int index =0;
    int *skip;

    skip = call_count.lookup_or_init(&index, &zero);

    if ((*skip) < %d) {
        (*skip)++;
        return 0;
    }
    (*skip) = 0;
}
"""

if args.milliseconds:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000000;')
    label = "msecs"
else:
    bpf_text = bpf_text.replace('FACTOR', 'delta /= 1000;')
    label = "usecs"
if args.queues:
    bpf_text = bpf_text.replace('STORAGE',
        'BPF_HISTOGRAM(dist, queue_key_t);')
    bpf_text = bpf_text.replace('STORE',
        'queue_key_t key = {.slot = bpf_log2l(delta)}; ' +
        'bpf_probe_read(&key.queue, sizeof(key.queue), ' +
        'q->irqname); dist.increment(key);')
else:
    bpf_text = bpf_text.replace('STORAGE', 'BPF_HISTOGRAM(dist);')
    bpf_text = bpf_text.replace('STORE',
        'dist.increment(bpf_log2l(delta));')

bpf_text = bpf_text.replace("__CALL__COUNT__FILTER__", call_count_filter % (args.freq - 1) if args.freq is not None else "")
if debug:
    print(bpf_text)

# load BPF program
b = BPF(text=bpf_text)
b.attach_kprobe(event="nvme_irq", fn_name="trace_interrupt_start")
b.attach_kprobe(event="nvme_irqpoll_handler", fn_name="trace_irqpoll_start")

print("Tracing nvme I/O interrupt/irqpoll... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("dist")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("%-8s\n" % strftime("%H:%M:%S"), end="")

    dist.print_log2_hist(label, "queue")
    dist.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()

  parent reply	other threads:[~2017-01-17 15:38 UTC|newest]

Thread overview: 120+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-11 13:43 [LSF/MM TOPIC][LSF/MM ATTEND] NAPI polling for block drivers Johannes Thumshirn
2017-01-11 13:43 ` Johannes Thumshirn
2017-01-11 13:43 ` Johannes Thumshirn
2017-01-11 13:46 ` Hannes Reinecke
2017-01-11 13:46   ` Hannes Reinecke
2017-01-11 13:46   ` Hannes Reinecke
2017-01-11 15:07 ` Jens Axboe
2017-01-11 15:07   ` Jens Axboe
2017-01-11 15:13   ` Jens Axboe
2017-01-11 15:13     ` Jens Axboe
2017-01-12  8:23     ` Sagi Grimberg
2017-01-12  8:23       ` Sagi Grimberg
2017-01-12 10:02       ` Johannes Thumshirn
2017-01-12 10:02         ` Johannes Thumshirn
2017-01-12 10:02         ` Johannes Thumshirn
2017-01-12 11:44         ` Sagi Grimberg
2017-01-12 11:44           ` Sagi Grimberg
2017-01-12 12:53           ` Johannes Thumshirn
2017-01-12 12:53             ` Johannes Thumshirn
2017-01-12 12:53             ` Johannes Thumshirn
2017-01-12 14:41             ` [Lsf-pc] " Sagi Grimberg
2017-01-12 14:41               ` Sagi Grimberg
2017-01-12 18:59               ` Johannes Thumshirn
2017-01-12 18:59                 ` Johannes Thumshirn
2017-01-12 18:59                 ` Johannes Thumshirn
2017-01-17 15:38       ` Sagi Grimberg [this message]
2017-01-17 15:38         ` Sagi Grimberg
2017-01-17 15:45         ` Sagi Grimberg
2017-01-17 15:45           ` Sagi Grimberg
2017-01-20 12:22           ` Johannes Thumshirn
2017-01-20 12:22             ` Johannes Thumshirn
2017-01-20 12:22             ` Johannes Thumshirn
2017-01-17 16:15         ` Sagi Grimberg
2017-01-17 16:15           ` Sagi Grimberg
2017-01-17 16:27           ` Johannes Thumshirn
2017-01-17 16:27             ` Johannes Thumshirn
2017-01-17 16:27             ` Johannes Thumshirn
2017-01-17 16:38             ` Sagi Grimberg
2017-01-17 16:38               ` Sagi Grimberg
2017-01-18 13:51               ` Johannes Thumshirn
2017-01-18 13:51                 ` Johannes Thumshirn
2017-01-18 13:51                 ` Johannes Thumshirn
2017-01-18 14:27                 ` Sagi Grimberg
2017-01-18 14:27                   ` Sagi Grimberg
2017-01-18 14:36                   ` Andrey Kuzmin
2017-01-18 14:36                     ` Andrey Kuzmin
2017-01-18 14:40                     ` Sagi Grimberg
2017-01-18 14:40                       ` Sagi Grimberg
2017-01-18 15:35                       ` Andrey Kuzmin
2017-01-18 15:35                         ` Andrey Kuzmin
2017-01-18 14:58                   ` Johannes Thumshirn
2017-01-18 14:58                     ` Johannes Thumshirn
2017-01-18 14:58                     ` Johannes Thumshirn
2017-01-18 15:14                     ` Sagi Grimberg
2017-01-18 15:14                       ` Sagi Grimberg
2017-01-18 15:16                       ` Johannes Thumshirn
2017-01-18 15:16                         ` Johannes Thumshirn
2017-01-18 15:16                         ` Johannes Thumshirn
2017-01-18 15:39                         ` Hannes Reinecke
2017-01-18 15:39                           ` Hannes Reinecke
2017-01-18 15:39                           ` Hannes Reinecke
2017-01-19  8:12                           ` Sagi Grimberg
2017-01-19  8:12                             ` Sagi Grimberg
2017-01-19  8:23                             ` Sagi Grimberg
2017-01-19  8:23                               ` Sagi Grimberg
2017-01-19  9:18                               ` Johannes Thumshirn
2017-01-19  9:18                                 ` Johannes Thumshirn
2017-01-19  9:18                                 ` Johannes Thumshirn
2017-01-19  9:13                             ` Johannes Thumshirn
2017-01-19  9:13                               ` Johannes Thumshirn
2017-01-19  9:13                               ` Johannes Thumshirn
2017-01-17 16:44         ` Andrey Kuzmin
2017-01-17 16:50           ` Sagi Grimberg
2017-01-17 16:50             ` Sagi Grimberg
2017-01-18 14:02             ` Hannes Reinecke
2017-01-18 14:02               ` Hannes Reinecke
2017-01-20  0:13               ` Jens Axboe
2017-01-20  0:13                 ` Jens Axboe
2017-01-13 15:56     ` Johannes Thumshirn
2017-01-13 15:56       ` Johannes Thumshirn
2017-01-13 15:56       ` Johannes Thumshirn
2017-01-11 15:16   ` Hannes Reinecke
2017-01-11 15:16     ` Hannes Reinecke
2017-01-11 15:16     ` Hannes Reinecke
2017-01-12  4:36   ` Stephen Bates
2017-01-12  4:44     ` Jens Axboe
2017-01-12  4:44       ` Jens Axboe
2017-01-12  4:56       ` Stephen Bates
2017-01-12  4:56         ` Stephen Bates
2017-01-19 10:57   ` Ming Lei
2017-01-19 10:57     ` Ming Lei
2017-01-19 11:03     ` Hannes Reinecke
2017-01-19 11:03       ` Hannes Reinecke
2017-01-11 16:08 ` Bart Van Assche
2017-01-11 16:08   ` Bart Van Assche
2017-01-11 16:08   ` Bart Van Assche
2017-01-11 16:12   ` hch
2017-01-11 16:12     ` hch
2017-01-11 16:15     ` Jens Axboe
2017-01-11 16:15       ` Jens Axboe
2017-01-11 16:22     ` Hannes Reinecke
2017-01-11 16:22       ` Hannes Reinecke
2017-01-11 16:22       ` Hannes Reinecke
2017-01-11 16:26       ` Bart Van Assche
2017-01-11 16:26         ` Bart Van Assche
2017-01-11 16:26         ` Bart Van Assche
2017-01-11 16:45         ` Hannes Reinecke
2017-01-11 16:45           ` Hannes Reinecke
2017-01-11 16:45           ` Hannes Reinecke
2017-01-12  8:52         ` sagi grimberg
2017-01-12  8:52           ` sagi grimberg
2017-01-11 16:14   ` Johannes Thumshirn
2017-01-11 16:14     ` Johannes Thumshirn
2017-01-11 16:14     ` Johannes Thumshirn
2017-01-12  8:41   ` Sagi Grimberg
2017-01-12  8:41     ` Sagi Grimberg
2017-01-12  8:41     ` Sagi Grimberg
2017-01-12 19:13     ` Bart Van Assche
2017-01-12 19:13       ` Bart Van Assche
2017-01-12 19:13       ` Bart Van Assche

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=499af528-7810-f82d-1f11-cbf8f3a5b21c@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=Linux-scsi@vger.kernel.org \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=jthumshirn@suse.de \
    --cc=keith.busch@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.