All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: John Garry <john.garry@huawei.com>
Cc: Marc Zyngier <maz@kernel.org>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"chenxiang (M)" <chenxiang66@hisilicon.com>,
	"bigeasy@linutronix.de" <bigeasy@linutronix.de>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"hare@suse.com" <hare@suse.com>, "hch@lst.de" <hch@lst.de>,
	"axboe@kernel.dk" <axboe@kernel.dk>,
	"bvanassche@acm.org" <bvanassche@acm.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"mingo@redhat.com" <mingo@redhat.com>,
	Zhang Yi <yi.zhang@redhat.com>
Subject: Re: [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt
Date: Sat, 4 Jan 2020 20:03:49 +0800	[thread overview]
Message-ID: <20200104120349.GA3810@ming.t460p> (raw)
In-Reply-To: <e40ec76b-c9d7-2b2f-8257-d3ced0894260@huawei.com>

[-- Attachment #1: Type: text/plain, Size: 2050 bytes --]

On Fri, Jan 03, 2020 at 11:50:51AM +0000, John Garry wrote:
> On 03/01/2020 11:29, Ming Lei wrote:
> > On Fri, Jan 03, 2020 at 10:41:48AM +0000, John Garry wrote:
> > > On 03/01/2020 00:46, Ming Lei wrote:
> > > > > > > d the
> > > > > > > DMA API more than an architecture-specific problem.
> > > > > > > 
> > > > > > > Given that we have so far very little data, I'd hold off any conclusion.
> > > > > > We can start to collect latency data of dma unmapping vs nvme_irq()
> > > > > > on both x86 and arm64.
> > > > > > 
> > > > > > I will see if I can get a such box for collecting the latency data.
> > > > > To reiterate what I mentioned before about IOMMU DMA unmap on x86, a key
> > > > > difference is that by default it uses the non-strict (lazy) mode unmap, i.e.
> > > > > we unmap in batches. ARM64 uses general default, which is strict mode, i.e.
> > > > > every unmap results in an IOTLB fluch.
> > > > > 
> > > > > In my setup, if I switch to lazy unmap (set iommu.strict=0 on cmdline), then
> > > > > no lockup.
> > > > > 
> > > > > Are any special IOMMU setups being used for x86, like enabling strict mode?
> > > > > I don't know...
> > > > BTW, I have run the test on one 224-core ARM64 with one 32-hw_queue NVMe, the
> > > > softlock issue can be triggered in one minute.
> > > > 
> > > > nvme_irq() often takes ~5us to complete on this machine, then there is really
> > > > risk of cpu lockup when IOPS is > 200K.
> > > 
> > > Do you have a typical nvme_irq() completion time for a mid-range x86 server?
> > 
> > ~1us.
> 
> Eh, so ~ x5 faster on x86 machine?! Seems some real issue here.
> 
> > 
> > It is done via bcc script, and ebpf itself may introduce some overhead.
> > 
> 
> Can you share the script/instructions? I would like to test on my machine. I
> assume you tested on an ThunderX2.

It should have been done easier by bpftrace than bcc, however it has bug in case
of too many cpu cores on arm64.

So I uses a modified hardirqs.py to do that, you can collect the latency
histogram via funclatency too.



Thanks,
Ming

[-- Attachment #2: hardirqs.py --]
[-- Type: text/plain, Size: 5435 bytes --]

#!/usr/bin/python
# @lint-avoid-python-3-compatibility-imports
#
# hardirqs  Summarize hard IRQ (interrupt) event time.
#           For Linux, uses BCC, eBPF.
#
# USAGE: hardirqs [-h] [-T] [-N] [-C] [-d] [interval] [outputs]
#
# Thanks Amer Ather for help understanding irq behavior.
#
# Copyright (c) 2015 Brendan Gregg.
# Licensed under the Apache License, Version 2.0 (the "License")
#
# 19-Oct-2015   Brendan Gregg   Created this.

from __future__ import print_function
from bcc import BPF
from time import sleep, strftime
import argparse

# arguments
examples = """examples:
    ./hardirqs            # sum hard irq event time
    ./hardirqs -d         # show hard irq event time as histograms
    ./hardirqs 1 10       # print 1 second summaries, 10 times
    ./hardirqs -NT 1      # 1s summaries, nanoseconds, and timestamps
"""
parser = argparse.ArgumentParser(
    description="Summarize hard irq event time as histograms",
    formatter_class=argparse.RawDescriptionHelpFormatter,
    epilog=examples)
parser.add_argument("-T", "--timestamp", action="store_true",
    help="include timestamp on output")
parser.add_argument("-N", "--nanoseconds", action="store_true",
    help="output in nanoseconds")
parser.add_argument("-C", "--count", action="store_true",
    help="show event counts instead of timing")
parser.add_argument("-d", "--dist", action="store_true",
    help="show distributions as histograms")
parser.add_argument("interval", nargs="?", default=99999999,
    help="output interval, in seconds")
parser.add_argument("outputs", nargs="?", default=99999999,
    help="number of outputs")
parser.add_argument("--ebpf", action="store_true",
    help=argparse.SUPPRESS)
args = parser.parse_args()
countdown = int(args.outputs)
if args.count and (args.dist or args.nanoseconds):
    print("The --count option can't be used with time-based options")
    exit()
if args.count:
    factor = 1
    label = "count"
elif args.nanoseconds:
    factor = 1
    label = "nsecs"
else:
    factor = 1000
    label = "usecs"
debug = 0

# define BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
#include <linux/irq.h>
#include <linux/irqdesc.h>
#include <linux/interrupt.h>

struct irq_key {
    u64 id;
    u64 slot;
};

BPF_HISTOGRAM(irq_rqs, struct irq_key, 512);

struct irq_info {
    u32 irq;
    u32 rqs;
    u64 start;
};

//at most 256 CPUs
//BPF_ARRAY(curr_irq, struct irq_info, 256);
BPF_HASH(curr_irq, u32, struct irq_info);

struct irq_sum_info {
    u64 total;
    u64 cnt;
    u64 rq_cnt;
};
BPF_HASH(irq, u32, struct irq_sum_info);

// time IRQ
int trace_start(struct pt_regs *ctx, struct irq_desc *desc)
{
    struct irq_info __info = {};
    struct irq_info *info = &__info;
    u32 id = bpf_get_smp_processor_id();

    info->irq = desc->irq_data.irq;
    info->start = bpf_ktime_get_ns();
    info->rqs = 0;

    curr_irq.update(&id, &__info);

    return 0;
}

int trace_completion(struct pt_regs *ctx)
{
    u32 irq_no;
    struct irq_sum_info *sum;
    struct irq_sum_info zero = {}; 
    u64 delta;
    struct irq_desc *desc = (struct irq_desc *)PT_REGS_PARM1(ctx);
    u32 id = bpf_get_smp_processor_id();
    struct irq_info *info = curr_irq.lookup(&id);

    if (!info)
        return 0;

    irq_no = info->irq;
    delta = bpf_ktime_get_ns() - info->start;

    if (irq_no != desc->irq_data.irq)
        bpf_trace_printk("irq not match: %d %d\\n", irq_no, desc->irq_data.irq);
    
    sum = irq.lookup_or_init(&irq_no, &zero);
    if (sum) {
        sum->total += delta;
        sum->cnt++;
        sum->rq_cnt += info->rqs;
    }

    struct irq_key ikey = {.id = irq_no, .slot = info->rqs};
    irq_rqs.increment(ikey);
    curr_irq.delete(&id);

    return 0;
}
int trace_rq(struct pt_regs *ctx, struct request *rq)
{
    u32 id = bpf_get_smp_processor_id();
    struct irq_info *info = curr_irq.lookup(&id);

    if (!info)
        return 0;

    info->rqs++;

    return 0;
}
"""

# code substitutions
if debug or args.ebpf:
    print(bpf_text)
    if args.ebpf:
        exit()

# load BPF program
b = BPF(text=bpf_text)

# these should really use irq:irq_handler_entry/exit tracepoints:
b.attach_kprobe(event="handle_irq_event_percpu", fn_name="trace_start")
b.attach_kretprobe(event="handle_irq_event_percpu", fn_name="trace_completion")
b.attach_kprobe(event="blk_mq_complete_request", fn_name="trace_rq")
print("Tracing hard irq event time... Hit Ctrl-C to end.")

# output
exiting = 0 if args.interval else 1
dist = b.get_table("irq")
irq_rqs = b.get_table("irq_rqs")

if args.timestamp:
    print("start time %-8s\n" % strftime("%H:%M:%S"), end="")
while (1):
    try:
        sleep(int(args.interval))
    except KeyboardInterrupt:
        exiting = 1

    print()
    if args.timestamp:
        print("end time %-8s\n" % strftime("%H:%M:%S"), end="")

    if args.dist:
        dist.print_log2_hist(label, "hardirq")
    else:
        print("%-10s %11s %11s %11s %11s %11s" % ("HARDIRQ", "TOTAL_TIME", "TOTAL_COUNT",
            "AVG_TIME", "RQS", "AVG_RQS"))
        for k, v in sorted(dist.items(), key=lambda dist: dist[1].total):
            print("%-10d %11d %11d %11d %11d %11d" % (k.value, v.total / factor, v.cnt,
                v.total / (factor * v.cnt),
                v.rq_cnt,
                v.rq_cnt / v.cnt))
    dist.clear()

    irq_rqs.print_linear_hist("rqs", "irq")
    irq_rqs.clear()

    countdown -= 1
    if exiting or countdown == 0:
        exit()

  reply	other threads:[~2020-01-04 12:04 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-12-06 14:35 [PATCH RFC 0/1] Threaded handler uses irq affinity for when the interrupt is managed John Garry
2019-12-06 14:35 ` [PATCH RFC 1/1] genirq: Make threaded handler use irq affinity for managed interrupt John Garry
2019-12-06 15:22   ` Marc Zyngier
2019-12-06 16:16     ` John Garry
2019-12-07  8:03   ` Ming Lei
2019-12-09 14:30     ` John Garry
2019-12-09 15:09       ` Hannes Reinecke
2019-12-09 15:17         ` Marc Zyngier
2019-12-09 15:25           ` Hannes Reinecke
2019-12-09 15:36             ` Marc Zyngier
2019-12-09 15:49           ` Qais Yousef
2019-12-09 15:55             ` Marc Zyngier
2019-12-10  1:43       ` Ming Lei
2019-12-10  9:45         ` John Garry
2019-12-10 10:06           ` Ming Lei
2019-12-10 10:28           ` Marc Zyngier
2019-12-10 10:59             ` John Garry
2019-12-10 11:36               ` Marc Zyngier
2019-12-10 12:05                 ` John Garry
2019-12-10 18:32                   ` Marc Zyngier
2019-12-11  9:41                     ` John Garry
2019-12-13 10:07                       ` John Garry
2019-12-13 10:31                         ` Marc Zyngier
2019-12-13 12:08                           ` John Garry
2019-12-14 10:59                             ` Marc Zyngier
2019-12-11 17:09         ` John Garry
2019-12-12 22:38           ` Ming Lei
2019-12-13 11:12             ` John Garry
2019-12-13 13:18               ` Ming Lei
2019-12-13 15:43                 ` John Garry
2019-12-13 17:12                   ` Ming Lei
2019-12-13 17:50                     ` John Garry
2019-12-14 13:56                   ` Marc Zyngier
2019-12-16 10:47                     ` John Garry
2019-12-16 11:40                       ` Marc Zyngier
2019-12-16 14:17                         ` John Garry
2019-12-16 18:00                           ` Marc Zyngier
2019-12-16 18:50                             ` John Garry
2019-12-20 11:30                               ` John Garry
2019-12-20 14:43                                 ` Marc Zyngier
2019-12-20 15:38                                   ` John Garry
2019-12-20 16:16                                     ` Marc Zyngier
2019-12-20 23:31                                     ` Ming Lei
2019-12-23  9:07                                       ` Marc Zyngier
2019-12-23 10:26                                         ` John Garry
2019-12-23 10:47                                           ` Marc Zyngier
2019-12-23 11:35                                             ` John Garry
2019-12-24  1:59                                             ` Ming Lei
2019-12-24 11:20                                               ` Marc Zyngier
2019-12-25  0:48                                                 ` Ming Lei
2020-01-02 10:35                                                   ` John Garry
2020-01-03  0:46                                                     ` Ming Lei
2020-01-03 10:41                                                       ` John Garry
2020-01-03 11:29                                                         ` Ming Lei
2020-01-03 11:50                                                           ` John Garry
2020-01-04 12:03                                                             ` Ming Lei [this message]
2020-05-30  7:46 ` [tip: irq/core] irqchip/gic-v3-its: Balance initial LPI affinity across CPUs tip-bot2 for Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200104120349.GA3810@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bigeasy@linutronix.de \
    --cc=bvanassche@acm.org \
    --cc=chenxiang66@hisilicon.com \
    --cc=hare@suse.com \
    --cc=hch@lst.de \
    --cc=john.garry@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=yi.zhang@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.