From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <kvmarm-bounces@lists.cs.columbia.edu>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mm01.cs.columbia.edu (mm01.cs.columbia.edu [128.59.11.253])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AC39DC433F5
	for <kvmarm@archiver.kernel.org>; Tue, 19 Apr 2022 15:20:14 +0000 (UTC)
Received: from localhost (localhost [127.0.0.1])
	by mm01.cs.columbia.edu (Postfix) with ESMTP id ED76440FD6;
	Tue, 19 Apr 2022 11:20:13 -0400 (EDT)
X-Virus-Scanned: at lists.cs.columbia.edu
Received: from mm01.cs.columbia.edu ([127.0.0.1])
	by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
	with ESMTP id LzxSNCg9jijR; Tue, 19 Apr 2022 11:20:12 -0400 (EDT)
Received: from mm01.cs.columbia.edu (localhost [127.0.0.1])
	by mm01.cs.columbia.edu (Postfix) with ESMTP id 637F34141A;
	Tue, 19 Apr 2022 11:20:12 -0400 (EDT)
Received: from localhost (localhost [127.0.0.1])
 by mm01.cs.columbia.edu (Postfix) with ESMTP id B0C5141016
 for <kvmarm@lists.cs.columbia.edu>; Tue, 19 Apr 2022 11:20:11 -0400 (EDT)
X-Virus-Scanned: at lists.cs.columbia.edu
Received: from mm01.cs.columbia.edu ([127.0.0.1])
 by localhost (mm01.cs.columbia.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id EhcIVjTHe2jx for <kvmarm@lists.cs.columbia.edu>;
 Tue, 19 Apr 2022 11:20:10 -0400 (EDT)
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
 by mm01.cs.columbia.edu (Postfix) with ESMTP id 54F8F40FD6
 for <kvmarm@lists.cs.columbia.edu>; Tue, 19 Apr 2022 11:20:10 -0400 (EDT)
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A28531063;
 Tue, 19 Apr 2022 08:20:09 -0700 (PDT)
Received: from monolith.localdoman (unknown [172.31.20.19])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 004BA3F766;
 Tue, 19 Apr 2022 08:20:07 -0700 (PDT)
Date: Tue, 19 Apr 2022 16:20:09 +0100
From: Alexandru Elisei <alexandru.elisei@arm.com>
To: Will Deacon <will@kernel.org>
Subject: Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead
 of pinning VM memory
Message-ID: <Yl7TKXY09ZA61aYT@monolith.localdoman>
References: <Yl6+JWaP+mq2Nc0b@monolith.localdoman>
 <20220419141012.GB6143@willie-the-truck>
 <Yl7KewpTj+7NSonf@monolith.localdoman>
 <20220419145945.GC6186@willie-the-truck>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20220419145945.GC6186@willie-the-truck>
Cc: maz@kernel.org, kvmarm@lists.cs.columbia.edu,
 linux-arm-kernel@lists.infradead.org
X-BeenThere: kvmarm@lists.cs.columbia.edu
X-Mailman-Version: 2.1.14
Precedence: list
List-Id: Where KVM/ARM decisions are made <kvmarm.lists.cs.columbia.edu>
List-Unsubscribe: <https://lists.cs.columbia.edu/mailman/options/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=unsubscribe>
List-Archive: <https://lists.cs.columbia.edu/pipermail/kvmarm>
List-Post: <mailto:kvmarm@lists.cs.columbia.edu>
List-Help: <mailto:kvmarm-request@lists.cs.columbia.edu?subject=help>
List-Subscribe: <https://lists.cs.columbia.edu/mailman/listinfo/kvmarm>,
 <mailto:kvmarm-request@lists.cs.columbia.edu?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Errors-To: kvmarm-bounces@lists.cs.columbia.edu
Sender: kvmarm-bounces@lists.cs.columbia.edu

Hi,

On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote:
> On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote:
> > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > > means there will be a window where profiling is stopped from the moment SPE
> > > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > > is obviously not present when running on bare metal, as there is no second
> > > > stage of address translation being performed.
> > > 
> > > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > > thought SPE buffer data could be written out in whacky ways such that even
> > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > > and so pinning is the only game in town.
> > 
> > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
> > D10-5177):
> > 
> > "The architecture does not require that a sample record is written
> > sequentially by the SPU, only that:
> > [..]
> > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
> >   whether PMBPTR_EL1 points to the first byte after the last complete
> >   sample record.
> > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
> >   Fault Address Register."
> > 
> > and (page D10-5179):
> > 
> > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
> > then a Profiling Buffer management event is generated:
> > [..]
> > - If PMBPTR_EL1 is not the address of the first byte after the last
> >   complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
> >   Otherwise, PMBSR_EL1.DL is unchanged."
> > 
> > Since there is no way to know the record size (well, unless
> > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
> > requirement), it means that KVM cannot restore the write pointer to the
> > address of the last complete record + 1, to allow the guest to resume
> > profiling without corrupted records.
> > 
> > > 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > A guest can use this to pin the VM memory (or a significant part of it),
> > either by doing it on purpose, or by allocating new buffers as they get
> > full. This will probably result in KVM killing the VM if the pinned memory
> > is larger than ulimit's max locked memory, which I believe is going to be a
> > bad experience for the user caught unaware. Unless we don't want KVM to
> > take ulimit into account when pinning the memory, which as far as I can
> > goes against KVM's approach so far.
> 
> Yeah, it gets pretty messy and ulimit definitely needs to be taken into
> account, as it is today.
> 
> That said, we could just continue if the pinning fails and the guest gets to
> keep the pieces if we get a stage-2 fault -- putting the device into an
> error state and re-injecting the interrupt should cause the perf session in
> the guest to fail gracefully. I don't think the complexity is necessarily
> worth it, but pinning all of guest memory is really crap so it's worth
> thinking about alternatives.

On the subject of pinning the memory when guest enables SPE, the guest can
configure SPE to profile userspace only. Programming is done at EL1, and in
this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only
sensible thing to do here is to pin the memory when SPE is disabled. If it
fails, then how should KVM notify the guest that something went wrong when
SPE is disabled? KVM could inject an interrupt, as those are asynchronous
and one could (rather weakly) argue that the interrupt might have been
raised because of something that happened in the previous profiling
session, but what if the guest never enabled SPE? What if the guest is in
the middle of configuring SPE and the interrupt handler isn't even set? Or
should KVM not use an interrupt to report error conditions to the guest, in
which case, how can the guest detect that SPE is stopped?

Both options don't look particularly appealing to me.

Thanks,
Alex
_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.lore.kernel.org (Postfix) with ESMTPS id 064FBC433EF
	for <linux-arm-kernel@archiver.kernel.org>; Tue, 19 Apr 2022 15:41:14 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed;
	d=lists.infradead.org; s=bombadil.20210309; h=Sender:
	Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post:
	List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:References:
	Message-ID:Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:
	Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:
	List-Owner; bh=8FDcMSc38PNrcIw9qWzO4uwO8kQ8uS0TjvubVxUaycc=; b=Eb1H0aLtyYiPsB
	Ejb1z8LoYAa7M2UEInmknVen+tJrsqdzm6D95m558oAZe1MHvf07+jOjx3pIEA6LMtGW49OioW5Di
	L0teRCE3VROxAmzTZkPu+RCOPcRT5Kx5O5R9jWWjzuA2liZrN7HSiVGMYedBUx56W1zu4KR7BHG8/
	W6mMNj4OxL48j5d4WkwJgrzvUjW0keY05v30RcOtNovSyhepLl8ggcG+r3Yzl70i+KDRNjSdWWNMt
	0TCobflji1ttTxLxwwv1D62ODS6rfeDsntKkIUcvpgmuZUKFNWw/xvAeaoo7kTN+RWiSXDrC+QtiZ
	3PpCxDAexRoADvZEoerQ==;
Received: from localhost ([::1] helo=bombadil.infradead.org)
	by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
	id 1ngpwU-004jKE-2o; Tue, 19 Apr 2022 15:39:03 +0000
Received: from foss.arm.com ([217.140.110.172])
 by bombadil.infradead.org with esmtp (Exim 4.94.2 #2 (Red Hat Linux))
 id 1ngpeH-004bHe-2F
 for linux-arm-kernel@lists.infradead.org; Tue, 19 Apr 2022 15:20:15 +0000
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
 by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A28531063;
 Tue, 19 Apr 2022 08:20:09 -0700 (PDT)
Received: from monolith.localdoman (unknown [172.31.20.19])
 by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 004BA3F766;
 Tue, 19 Apr 2022 08:20:07 -0700 (PDT)
Date: Tue, 19 Apr 2022 16:20:09 +0100
From: Alexandru Elisei <alexandru.elisei@arm.com>
To: Will Deacon <will@kernel.org>
Cc: mark.rutland@arm.com, linux-arm-kernel@lists.infradead.org,
 maz@kernel.org, james.morse@arm.com, suzuki.poulose@arm.com,
 kvmarm@lists.cs.columbia.edu
Subject: Re: KVM/arm64: SPE: Translate VA to IPA on a stage 2 fault instead
 of pinning VM memory
Message-ID: <Yl7TKXY09ZA61aYT@monolith.localdoman>
References: <Yl6+JWaP+mq2Nc0b@monolith.localdoman>
 <20220419141012.GB6143@willie-the-truck>
 <Yl7KewpTj+7NSonf@monolith.localdoman>
 <20220419145945.GC6186@willie-the-truck>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20220419145945.GC6186@willie-the-truck>
X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 
X-CRM114-CacheID: sfid-20220419_082013_278703_84921C24 
X-CRM114-Status: GOOD (  38.22  )
X-BeenThere: linux-arm-kernel@lists.infradead.org
X-Mailman-Version: 2.1.34
Precedence: list
List-Id: <linux-arm-kernel.lists.infradead.org>
List-Unsubscribe: <http://lists.infradead.org/mailman/options/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=unsubscribe>
List-Archive: <http://lists.infradead.org/pipermail/linux-arm-kernel/>
List-Post: <mailto:linux-arm-kernel@lists.infradead.org>
List-Help: <mailto:linux-arm-kernel-request@lists.infradead.org?subject=help>
List-Subscribe: <http://lists.infradead.org/mailman/listinfo/linux-arm-kernel>, 
 <mailto:linux-arm-kernel-request@lists.infradead.org?subject=subscribe>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org

Hi,

On Tue, Apr 19, 2022 at 03:59:46PM +0100, Will Deacon wrote:
> On Tue, Apr 19, 2022 at 03:44:02PM +0100, Alexandru Elisei wrote:
> > On Tue, Apr 19, 2022 at 03:10:13PM +0100, Will Deacon wrote:
> > > On Tue, Apr 19, 2022 at 02:51:05PM +0100, Alexandru Elisei wrote:
> > > > 2. The stage 2 fault is reported asynchronously via an interrupt, which
> > > > means there will be a window where profiling is stopped from the moment SPE
> > > > triggers the fault and when the PE taks the interrupt. This blackout window
> > > > is obviously not present when running on bare metal, as there is no second
> > > > stage of address translation being performed.
> > > 
> > > Are these faults actually recoverable? My memory is a bit hazy here, but I
> > > thought SPE buffer data could be written out in whacky ways such that even
> > > a bog-standard page fault could result in uncoverable data loss (i.e. DL=1),
> > > and so pinning is the only game in town.
> > 
> > Ah, I forgot about that, I think you're right (ARM DDI 0487H.a, page
> > D10-5177):
> > 
> > "The architecture does not require that a sample record is written
> > sequentially by the SPU, only that:
> > [..]
> > - On a Profiling Buffer management interrupt, PMBSR_EL1.DL indicates
> >   whether PMBPTR_EL1 points to the first byte after the last complete
> >   sample record.
> > - On an MMU fault or synchronous External abort, PMBPTR_EL1 serves as a
> >   Fault Address Register."
> > 
> > and (page D10-5179):
> > 
> > "If a write to the Profiling Buffer generates a fault and PMBSR_EL1.S is 0,
> > then a Profiling Buffer management event is generated:
> > [..]
> > - If PMBPTR_EL1 is not the address of the first byte after the last
> >   complete sample record written by the SPU, then PMBSR_EL1.DL is set to 1.
> >   Otherwise, PMBSR_EL1.DL is unchanged."
> > 
> > Since there is no way to know the record size (well, unless
> > PMSIDR_EL1.MaxSize == PMBIDR_EL1.Align, but that's not an architectural
> > requirement), it means that KVM cannot restore the write pointer to the
> > address of the last complete record + 1, to allow the guest to resume
> > profiling without corrupted records.
> > 
> > > 
> > > A funkier approach might be to defer pinning of the buffer until the SPE is
> > > enabled and avoid pinning all of VM memory that way, although I can't
> > > immediately tell how flexible the architecture is in allowing you to cache
> > > the base/limit values.
> > 
> > A guest can use this to pin the VM memory (or a significant part of it),
> > either by doing it on purpose, or by allocating new buffers as they get
> > full. This will probably result in KVM killing the VM if the pinned memory
> > is larger than ulimit's max locked memory, which I believe is going to be a
> > bad experience for the user caught unaware. Unless we don't want KVM to
> > take ulimit into account when pinning the memory, which as far as I can
> > goes against KVM's approach so far.
> 
> Yeah, it gets pretty messy and ulimit definitely needs to be taken into
> account, as it is today.
> 
> That said, we could just continue if the pinning fails and the guest gets to
> keep the pieces if we get a stage-2 fault -- putting the device into an
> error state and re-injecting the interrupt should cause the perf session in
> the guest to fail gracefully. I don't think the complexity is necessarily
> worth it, but pinning all of guest memory is really crap so it's worth
> thinking about alternatives.

On the subject of pinning the memory when guest enables SPE, the guest can
configure SPE to profile userspace only. Programming is done at EL1, and in
this case SPE is disabled. KVM doesn't trap the ERET to EL0, so the only
sensible thing to do here is to pin the memory when SPE is disabled. If it
fails, then how should KVM notify the guest that something went wrong when
SPE is disabled? KVM could inject an interrupt, as those are asynchronous
and one could (rather weakly) argue that the interrupt might have been
raised because of something that happened in the previous profiling
session, but what if the guest never enabled SPE? What if the guest is in
the middle of configuring SPE and the interrupt handler isn't even set? Or
should KVM not use an interrupt to report error conditions to the guest, in
which case, how can the guest detect that SPE is stopped?

Both options don't look particularly appealing to me.

Thanks,
Alex

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel