KVM Archive on lore.kernel.org
 help / color / Atom feed
From: Zenghui Yu <yuzenghui@huawei.com>
To: James Morse <james.morse@arm.com>
Cc: <linux-arm-kernel@lists.infradead.org>,
	<kvmarm@lists.cs.columbia.edu>, <kvm@vger.kernel.org>,
	<linux-perf-users@vger.kernel.org>, <christoffer.dall@arm.com>,
	<marc.zyngier@arm.com>, <acme@redhat.com>, <peterz@infradead.org>,
	<mingo@redhat.com>, <ganapatrao.kulkarni@cavium.com>,
	<catalin.marinas@arm.com>, <will.deacon@arm.com>,
	<mark.rutland@arm.com>, <acme@kernel.org>,
	<alexander.shishkin@linux.intel.com>, <jolsa@redhat.com>,
	<namhyung@kernel.org>, <wanghaibin.wang@huawei.com>,
	<xiexiangyou@huawei.com>, <linuxarm@huawei.com>
Subject: Re: [PATCH v1 2/5] KVM: arm/arm64: Adjust entry/exit and trap related tracepoints
Date: Fri, 21 Jun 2019 21:25:42 +0800
Message-ID: <34bcb641-9815-cda7-69e0-98ed477e84a6@huawei.com> (raw)
In-Reply-To: <4d16d690-e93b-7b89-3251-aa4bd8489715@arm.com>

Hi James,

sorry for the late reply.

On 2019/6/17 19:19, James Morse wrote:
> Hi Zenghui,
> On 13/06/2019 12:28, Zenghui Yu wrote:
>> On 2019/6/12 20:49, James Morse wrote:
>>> On 12/06/2019 10:08, Zenghui Yu wrote:
>>>> Currently, we use trace_kvm_exit() to report exception type (e.g.,
>>>> "IRQ", "TRAP") and exception class (ESR_ELx's bit[31:26]) together.
>>>> But hardware only saves the exit class to ESR_ELx on synchronous
>>>> exceptions, not on asynchronous exceptions. When the guest exits
>>>> due to external interrupts, we will get tracing output like:
>>>>      "kvm_exit: IRQ: HSR_EC: 0x0000 (UNKNOWN), PC: 0xffff87259e30"
>>>> Obviously, "HSR_EC" here is meaningless.
>>> I assume we do it this way so there is only one guest-exit tracepoint that catches all
>>> exits.
>>> I don't think its a problem if user-space has to know the EC isn't set for asynchronous
>>> exceptions, this is a property of the architecture and anything using these trace-points
>>> is already arch specific.
>> Actually, *no* problem in current implementation, and I'm OK to still
>> keep the EC in trace_kvm_exit().  What I really want to do is adding the
>> EC in trace_trap_enter (the new tracepoint), will explain it later.
>>>> This patch splits "exit" and "trap" events by adding two tracepoints
>>>> explicitly in handle_trap_exceptions(). Let trace_kvm_exit() report VM
>>>> exit events, and trace_kvm_trap_exit() report VM trap events.
>>>> These tracepoints are adjusted also in preparation for supporting
>>>> 'perf kvm stat' on arm64.
>>> Because the existing tracepoints are ABI, I don't think we can change them.
>>> We can add new ones if there is something that a user reasonably needs to trace, and can't
>>> be done any other way.
>>> What can't 'perf kvm stat' do with the existing trace points?
>> First, how does 'perf kvm stat' interact with tracepoints?
> Start at the beginning, good idea. (I've never used this thing!)
>> We have three handlers for a specific event (e.g., "VM-EXIT") --
>> "is_begin_event", "is_end_event", "decode_key". The first two handlers
>> make use of two existing tracepoints ("kvm:kvm_exit" & "kvm:kvm_entry")
>> to check when the VM-EXIT events started/ended, thus the time difference
>> stats, event start/end time etc. can be calculated.
>> "is_begin_event" handler gets a *key* from the "ret" field (exit_code)
>> of "kvm:kvm_exit" payload, and "decode_key" handler makes use of the
>> *key* to find out the reason for the VM-EXIT event. Of course we should
>> maintain the mapping between exit_code and exit_reason in userspace.
> Interpreting 'ret' is going to get tricky if we change those values on a whim. Its
> internal to the KVM arch code.

Yes. We have to keep the definition of 'ret' and the mapping
(kvm_arm_exception_type) consistent between user side and kernel side.
kernel side: arch/arm64/include/asm/kvm_asm.h
user side:   tools/perf/arch/arm64/util/aarch64_guest_exits.h (patch #4)

Do we have a better approach?

>> These are all what *patch #4* had done, #4 is a simple patch to review!
>> Oh, we can also set "vcpu_id_str" to achieve per vcpu event record, but
>> currently, we only have the "vcpu_pc" field in "kvm:kvm_entry", without
>> something like "vcpu_id".
> Heh, so from the trace-point data, you can't know which on is vcpu-0 and which is vcpu-1.


>> OK, next comes the more important question - what should/can we do to
>> the tracepoints in preparation of 'perf kvm stat' on arm64?
>>  From the article you've provided, it's clear that we can't remove the EC
>> from trace_kvm_exit(). But can we add something like "vcpu_id" into
>> (at least) trace_kvm_entry(), just like what this patch has done?
> Adding something is still likely to break a badly written user-space that is trying to
> parse the trace information. A regex picking out the last argument will now get a
> different value.
>> If not, which means we have to keep the existing tracepoints totally
>> unchanged, then 'perf kvm stat' will have no way to record/report per
>> vcpu VM-EXIT events (other arch like X86, powerpc, s390 etc. have this
>> capability, if I understand it correctly).
> Well, you get the events, but you don't know which vCPU is which. You can map this back to
> the pid of the host thread assuming user-space isn't moving vcpu between host threads.
> If we're really stuck: Adding tracepoints to KVM-core's vcpu get/put, that export the
> vcpu_id would let you map pid->vcpu_id, which you can then use for the batch of enter/exit
> events that come before a final vcpu put.
> grepping "vpu_id" shows perf has a mapping for which arch-specific argument in enter/exit
> is the vcpu-id. Done with this core-code mapping, you could drop that code...

Yes. In the current 'perf kvm' implementation, we make use of the
"vcpu id" field (specified by vcpu_id_str, differ between arch) of
"kvm:kvm_entry" tracepoint payload, to achieve per vcpu record/report.
(Details in per_vcpu_record() in tools/perf/builtin-kvm.c)

Adding tracepoints in vcpu_load()/vcpu_put() might be a good idea, we
can get "vcpu id" information without breaking existing tracepoint ABI.
And I think other arch will benefit from this way, for those who haven't
supported 'perf kvm stat' yet and don't have "vcpu id" info in the
"kvm:kvm_entry" tracepoint.

But there is at least one problem (as I can see)... KVM will not always
do a vcpu_load()/vcpu_put(), while guest always enters/exits.
In the worst case, if KVM can always handle guest's exits, perf will not
have a chance to catch vcpu_load()/vcpu_put(). And still, we fail to get
per vcpu statistical analysis.

I've written a simple patch to prove this, attached at the end of this
mail. I run 'perf kvm stat record/report' against a 4-VCPU guest, but
only vcpu-3 will be reported, as only vcpu-3 is doing some MMIO access
and we have to return to user space (and vcpu_load() will be invoked).

I'm not sure if I (likely) missed some important points you've provided.

> But I'd be a little nervous adding a new trace-point to work around an ABI problem, as we
> may have just moved the ABI problem! (What does a user of a vcpu_put tracepoint really need?)
>> As for TRAP events, should we consider adding two new tracepoints --
>> "kvm_trap_enter" and "kvm_trap_exit", to keep tracking of the trap
>> handling process? We should also record the EC in "kvm_trap_enter", which will be used as
>> *key* in TRAP event's "is_begin_event" handler.
> The EC can't change between trace_kvm_exit() and handle_exit(), so you already have this.
> What are the 'trap' trace points needed for? You get the timing and 'exception class' from
> the guest enter/exit tracepoints. What about handle_exit() can't you work out from this?

In short, these two trap tracepoints are for patch #5.
Some users (me) may want to know, how many traps a guest has in a given
period of time, how many times a particular type of trap (e.g., Dabort)
has occurred, and how long each trap has been processed. That's what
patch #5 does, you can get these info through 'perf kvm stat report

If we use guest enter/exit tracepoints to support TRAP events, there
will be two problems:
   1) the timing becomes inaccurate (longer than the real timing)
   2) the reported 'EC' becomes "Unknown" in the case when guest exits
      due to IRQ, as we set 'esr_ec' to 0 on asynchronous exceptions,
      in trace_kvm_exit.

>> Patch #5 tells us the whole story, it's simple too.
> (I only skimmed the perf patches, I'll go back now that I know a little more about what
> you're doing)

Much appreciated :)

>> What do you suggest?
> We can explore the vcpu_load()/vcpu_put() trace idea, (it may not work for some other
> reason). I'd like to understand what the 'trap' tracepoints are needed for.

Thanks for your suggestion!

Btw, I have collected some useful links, attached here, to share with
someone who're interested in this series.

[1] the initial idea for 'perf kvm stat':
[2] a discussion on improving kvm_exit tracepoint:



Two tracepoints have already been added in vcpu_load/put().

  tools/perf/arch/arm64/util/kvm-stat.c | 2 ++
  tools/perf/builtin-kvm.c              | 7 ++++++-
  2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/tools/perf/arch/arm64/util/kvm-stat.c 
index 41b28de..a91fe7d 100644
--- a/tools/perf/arch/arm64/util/kvm-stat.c
+++ b/tools/perf/arch/arm64/util/kvm-stat.c
@@ -103,6 +103,8 @@ static void trap_event_decode_key(struct 
perf_kvm_stat *kvm __maybe_unused,
+	"kvm:vcpu_load",
+	"kvm:vcpu_put",

diff --git a/tools/perf/builtin-kvm.c b/tools/perf/builtin-kvm.c
index dbb6f73..f011fcf 100644
--- a/tools/perf/builtin-kvm.c
+++ b/tools/perf/builtin-kvm.c
@@ -81,6 +81,11 @@ bool exit_event_begin(struct perf_evsel *evsel,
  	return false;

+static bool kvm_vcpu_load_event(struct perf_evsel *evsel)
+	return !strcmp(evsel->name, "kvm:vcpu_load");
  bool kvm_entry_event(struct perf_evsel *evsel)
  	return !strcmp(evsel->name, kvm_entry_trace);
@@ -400,7 +405,7 @@ struct vcpu_event_record *per_vcpu_record(struct 
thread *thread,
  					  struct perf_sample *sample)
  	/* Only kvm_entry records vcpu id. */
-	if (!thread__priv(thread) && kvm_entry_event(evsel)) {
+	if (!thread__priv(thread) && kvm_vcpu_load_event(evsel)) {
  		struct vcpu_event_record *vcpu_record;

  		vcpu_record = zalloc(sizeof(*vcpu_record));

  reply index

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-12  9:08 [PATCH v1 0/5] perf kvm: Add stat support on arm64 Zenghui Yu
2019-06-12  9:08 ` [PATCH v1 1/5] KVM: arm/arm64: Remove kvm_mmio_emulate tracepoint Zenghui Yu
2019-06-12 12:48   ` James Morse
2019-06-13 11:20     ` Zenghui Yu
2019-06-12  9:08 ` [PATCH v1 2/5] KVM: arm/arm64: Adjust entry/exit and trap related tracepoints Zenghui Yu
2019-06-12 12:49   ` James Morse
2019-06-13 11:28     ` Zenghui Yu
2019-06-17 11:19       ` James Morse
2019-06-21 13:25         ` Zenghui Yu [this message]
2019-06-12  9:08 ` [PATCH v1 3/5] perf tools arm64: Add support for get_cpuid() function Zenghui Yu
2019-06-12  9:08 ` [PATCH v1 4/5] perf,kvm/arm64: Add stat support on arm64 Zenghui Yu
2019-06-12  9:08 ` [PATCH v1 5/5] perf,kvm/arm64: perf-kvm-stat to report VM TRAP Zenghui Yu

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=34bcb641-9815-cda7-69e0-98ed477e84a6@huawei.com \
    --to=yuzenghui@huawei.com \
    --cc=acme@kernel.org \
    --cc=acme@redhat.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=catalin.marinas@arm.com \
    --cc=christoffer.dall@arm.com \
    --cc=ganapatrao.kulkarni@cavium.com \
    --cc=james.morse@arm.com \
    --cc=jolsa@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=marc.zyngier@arm.com \
    --cc=mark.rutland@arm.com \
    --cc=mingo@redhat.com \
    --cc=namhyung@kernel.org \
    --cc=peterz@infradead.org \
    --cc=wanghaibin.wang@huawei.com \
    --cc=will.deacon@arm.com \
    --cc=xiexiangyou@huawei.com \


* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

KVM Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/kvm/0 kvm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 kvm kvm/ https://lore.kernel.org/kvm \
	public-inbox-index kvm

Example config snippet for mirrors

Newsgroup available over NNTP:

AGPL code for this site: git clone https://public-inbox.org/public-inbox.git