All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lorenzo Pieralisi <lpieralisi@kernel.org>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>, Jason Gunthorpe <jgg@nvidia.com>,
	ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
Date: Thu, 9 Nov 2023 16:34:10 +0100	[thread overview]
Message-ID: <ZUz78gFPgMupew+m@lpieralisi> (raw)
In-Reply-To: <ZSlBOiebenPKXBY4@arm.com>

On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:

[...]

> > > Things can go wrong but that's not because Device does anything better.
> > > Given the RAS implementation, external aborts caused on Device memory
> > > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > > contained (I can dig out the reasoning behind this if you want, IIUC
> > > something to do with not being able to cancel an already issued Device
> > > access since such accesses don't allow speculation due to side-effects;
> > > for Normal NC, it's just about the software not getting the data).
> > 
> > I really think these details belong in the commit message.
> 
> I guess another task for Lorenzo ;).

Updated commit log (it might be [is] too verbose) below, it should probably
be moved into a documentation file but to do that I should decouple
it from this changeset (ie a document explaining memory attributes
and error containment for ARM64 - indipendent from KVM S2 defaults).

I'd also add a Link: to the lore archive entry for reference (did not
add it in the log below).

Please let me know what's the best option here.

Thanks,
Lorenzo

-- >8 --
Currently, KVM for ARM64 maps at stage 2 memory that is
considered device (ie it is not RAM) with DEVICE_nGnRE
memory attributes; this setting overrides (as per the ARM
architecture [1]) any device MMIO mapping present at stage
1, resulting in a set-up whereby a guest operating system
can't determine device MMIO mapping memory attributes on its
own but it is always overriden by the KVM stage 2 default.

This set-up does not allow guest operating systems to select
device memory attributes on a page by page basis independently
from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
2 memory type attributes"), which turns out to be an issue in that
guest operating systems (eg Linux) may request to map
devices MMIO regions with memory attributes that guarantee
better performance (eg gathering attribute - that for some
devices can generate larger PCIe memory writes TLPs)
and specific operations (eg unaligned transactions) such as
the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM
for ARM64 since it was considered safer (ie it would
not allow guests to trigger uncontained failures
ultimately crashing the machine) but this turned out
to be imprecise.

Failures containability is a property of the platform
and is independent from the memory type used for MMIO
device memory mappings.

Actually, DEVICE_nGnRE memory type is even more problematic
than eg Normal-NC memory type in terms of faults containability
in that eg aborts triggered on DEVICE_nGnRE loads cannot be made,
architecturally, synchronous (ie that would imply that the processor
should issue at most 1 load transaction at a time - ie it can't pipeline
them - otherwise the synchronous abort semantics would break the
no-speculation attribute attached to DEVICE_XXX memory).

This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform
capabilities and the device type being assigned (ie PCIe AER/DPC
error containment and RAS architecture[3]); therefore the default
KVM device stage 2 memory attributes play no role in making device
assignment safer for a given platform (if the platform design
adheres to design guidelines outlined in [3]) and therefore can
be relaxed.

For all these reasons, relax the KVM stage 2 device
memory attributes from DEVICE_nGnRE to Normal-NC.

A different Normal memory type default at stage-2
(eg Normal Write-through) could have been chosen
instead of Normal-NC but Normal-NC was chosen
because:

- Its attributes are well-understood compared to
  other Normal memory types for MMIO mappings
- On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible
  default for normal memory types. For systems implementing
  FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options
  to choose the memory types are as follows:

  if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX
  (XXX determined by S2 MemAttr[1:0]). This is what we have
  today (MemAttr[2:0] == 0b001) and therefore it is not described
  any further.

  if S2 MemAttr[2] == 1, there are the following options:
 
  S2 MemAttr[2:0] | Resulting mapping
  -----------------------------------------------------------------------------
  0b101           | Prevent the guest from selecting cachable memory type, it
		  | allows it to choose Device-* or Normal-NC
  0b110           | It forces write-back memory type; it breaks MMIO.
  0b111           | Allow the VM to select any memory type including cachable.
		  | It is unclear whether this is safe from a platform
		  | perspective, especially wrt uncontained failures and
		  | cacheability (eg device reclaim/reset and cache
		  | maintenance).
  ------------------------------------------------------------------------------

  - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping
    identical to FEAT_S2FWB (that basically would force Normal-NC, see
    option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach
    between the two

Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (ie device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered wrt MMIO transactions
according to the PCI ordering rules.

Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):

S1	     |  S2	     | Result
NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>

It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.

This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].

Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change
anything in this respect, in that the mismatched aliases would only affect
(refer to [2] for a detailed explanation) ordering between the userspace and
GuestOS mappings resulting stream of transactions (ie it does not cause loss of
property for either stream of transactions on its own), which is harmless
given that the userspace and GuestOS access to the device is carried
out through independent transactions streams.

[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf

WARNING: multiple messages have this Message-ID (diff)
From: Lorenzo Pieralisi <lpieralisi@kernel.org>
To: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>, Jason Gunthorpe <jgg@nvidia.com>,
	ankita@nvidia.com, maz@kernel.org, oliver.upton@linux.dev,
	aniketa@nvidia.com, cjia@nvidia.com, kwankhede@nvidia.com,
	targupta@nvidia.com, vsethi@nvidia.com, acurrid@nvidia.com,
	apopple@nvidia.com, jhubbard@nvidia.com, danw@nvidia.com,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory
Date: Thu, 9 Nov 2023 16:34:10 +0100	[thread overview]
Message-ID: <ZUz78gFPgMupew+m@lpieralisi> (raw)
In-Reply-To: <ZSlBOiebenPKXBY4@arm.com>

On Fri, Oct 13, 2023 at 02:08:10PM +0100, Catalin Marinas wrote:

[...]

> > > Things can go wrong but that's not because Device does anything better.
> > > Given the RAS implementation, external aborts caused on Device memory
> > > (e.g. wrong size access) is uncontainable. For Normal NC it can be
> > > contained (I can dig out the reasoning behind this if you want, IIUC
> > > something to do with not being able to cancel an already issued Device
> > > access since such accesses don't allow speculation due to side-effects;
> > > for Normal NC, it's just about the software not getting the data).
> > 
> > I really think these details belong in the commit message.
> 
> I guess another task for Lorenzo ;).

Updated commit log (it might be [is] too verbose) below, it should probably
be moved into a documentation file but to do that I should decouple
it from this changeset (ie a document explaining memory attributes
and error containment for ARM64 - indipendent from KVM S2 defaults).

I'd also add a Link: to the lore archive entry for reference (did not
add it in the log below).

Please let me know what's the best option here.

Thanks,
Lorenzo

-- >8 --
Currently, KVM for ARM64 maps at stage 2 memory that is
considered device (ie it is not RAM) with DEVICE_nGnRE
memory attributes; this setting overrides (as per the ARM
architecture [1]) any device MMIO mapping present at stage
1, resulting in a set-up whereby a guest operating system
can't determine device MMIO mapping memory attributes on its
own but it is always overriden by the KVM stage 2 default.

This set-up does not allow guest operating systems to select
device memory attributes on a page by page basis independently
from KVM stage-2 mappings (refer to [1], "Combining stage 1 and stage
2 memory type attributes"), which turns out to be an issue in that
guest operating systems (eg Linux) may request to map
devices MMIO regions with memory attributes that guarantee
better performance (eg gathering attribute - that for some
devices can generate larger PCIe memory writes TLPs)
and specific operations (eg unaligned transactions) such as
the NormalNC memory type.

The default device stage 2 mapping was chosen in KVM
for ARM64 since it was considered safer (ie it would
not allow guests to trigger uncontained failures
ultimately crashing the machine) but this turned out
to be imprecise.

Failures containability is a property of the platform
and is independent from the memory type used for MMIO
device memory mappings.

Actually, DEVICE_nGnRE memory type is even more problematic
than eg Normal-NC memory type in terms of faults containability
in that eg aborts triggered on DEVICE_nGnRE loads cannot be made,
architecturally, synchronous (ie that would imply that the processor
should issue at most 1 load transaction at a time - ie it can't pipeline
them - otherwise the synchronous abort semantics would break the
no-speculation attribute attached to DEVICE_XXX memory).

This means that regardless of the combined stage1+stage2 mappings a
platform is safe if and only if device transactions cannot trigger
uncontained failures and that in turn relies on platform
capabilities and the device type being assigned (ie PCIe AER/DPC
error containment and RAS architecture[3]); therefore the default
KVM device stage 2 memory attributes play no role in making device
assignment safer for a given platform (if the platform design
adheres to design guidelines outlined in [3]) and therefore can
be relaxed.

For all these reasons, relax the KVM stage 2 device
memory attributes from DEVICE_nGnRE to Normal-NC.

A different Normal memory type default at stage-2
(eg Normal Write-through) could have been chosen
instead of Normal-NC but Normal-NC was chosen
because:

- Its attributes are well-understood compared to
  other Normal memory types for MMIO mappings
- On systems implementing S2FWB (FEAT_S2FWB), that's the only sensible
  default for normal memory types. For systems implementing
  FEAT_S2FWB (see [1] D8.5.5 S2=stage-2 - S2 MemAttr[3:0]), the options
  to choose the memory types are as follows:

  if S2 MemAttr[2] == 0, the mapping defaults to DEVICE_XXX
  (XXX determined by S2 MemAttr[1:0]). This is what we have
  today (MemAttr[2:0] == 0b001) and therefore it is not described
  any further.

  if S2 MemAttr[2] == 1, there are the following options:
 
  S2 MemAttr[2:0] | Resulting mapping
  -----------------------------------------------------------------------------
  0b101           | Prevent the guest from selecting cachable memory type, it
		  | allows it to choose Device-* or Normal-NC
  0b110           | It forces write-back memory type; it breaks MMIO.
  0b111           | Allow the VM to select any memory type including cachable.
		  | It is unclear whether this is safe from a platform
		  | perspective, especially wrt uncontained failures and
		  | cacheability (eg device reclaim/reset and cache
		  | maintenance).
  ------------------------------------------------------------------------------

  - For !FEAT_S2FWB systems, it is logical to choose a default S2 mapping
    identical to FEAT_S2FWB (that basically would force Normal-NC, see
    option S2 MemAttr[2:0] == 0b101 above), to guarantee a coherent approach
    between the two

Relaxing S2 KVM device MMIO mappings to Normal-NC is not expected to
trigger any issue on guest device reclaim use cases either (ie device
MMIO unmap followed by a device reset) at least for PCIe devices, in that
in PCIe a device reset is architected and carried out through PCI config
space transactions that are naturally ordered wrt MMIO transactions
according to the PCI ordering rules.

Having Normal-NC S2 default puts guests in control (thanks to
stage1+stage2 combined memory attributes rules [1]) of device MMIO
regions memory mappings, according to the rules described in [1]
and summarized here ([(S1) - stage1], [(S2) - stage 2]):

S1	     |  S2	     | Result
NORMAL-WB    |  NORMAL-NC    | NORMAL-NC
NORMAL-WT    |  NORMAL-NC    | NORMAL-NC
NORMAL-NC    |  NORMAL-NC    | NORMAL-NC
DEVICE<attr> |  NORMAL-NC    | DEVICE<attr>

It is worth noting that currently, to map devices MMIO space to user
space in a device pass-through use case the VFIO framework applies memory
attributes derived from pgprot_noncached() settings applied to VMAs, which
result in device-nGnRnE memory attributes for the stage-1 VMM mappings.

This means that a userspace mapping for device MMIO space carried
out with the current VFIO framework and a guest OS mapping for the same
MMIO space may result in a mismatched alias as described in [2].

Defaulting KVM device stage-2 mappings to Normal-NC attributes does not change
anything in this respect, in that the mismatched aliases would only affect
(refer to [2] for a detailed explanation) ordering between the userspace and
GuestOS mappings resulting stream of transactions (ie it does not cause loss of
property for either stream of transactions on its own), which is harmless
given that the userspace and GuestOS access to the device is carried
out through independent transactions streams.

[1] section D8.5 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[2] section B2.8 - DDI0487_I_a_a-profile_architecture_reference_manual.pdf
[3] sections 1.7.7.3/1.8.5.2/appendix C - DEN0029H_SBSA_7.1.pdf

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  parent reply	other threads:[~2023-11-09 15:34 UTC|newest]

Thread overview: 110+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-07 18:14 [PATCH v1 0/2] KVM: arm64: support write combining and cachable IO memory in VMs ankita
2023-09-07 18:14 ` ankita
2023-09-07 18:14 ` [PATCH v1 1/2] KVM: arm64: determine memory type from VMA ankita
2023-09-07 18:14   ` ankita
2023-09-07 19:12   ` Jason Gunthorpe
2023-09-07 19:12     ` Jason Gunthorpe
2023-10-05 16:15   ` Catalin Marinas
2023-10-05 16:15     ` Catalin Marinas
2023-10-05 16:54     ` Jason Gunthorpe
2023-10-05 16:54       ` Jason Gunthorpe
2023-10-10 14:25       ` Catalin Marinas
2023-10-10 14:25         ` Catalin Marinas
2023-10-10 15:05         ` Jason Gunthorpe
2023-10-10 15:05           ` Jason Gunthorpe
2023-10-10 17:19           ` Catalin Marinas
2023-10-10 17:19             ` Catalin Marinas
2023-10-10 18:23             ` Jason Gunthorpe
2023-10-10 18:23               ` Jason Gunthorpe
2023-10-11 17:45               ` Catalin Marinas
2023-10-11 17:45                 ` Catalin Marinas
2023-10-11 18:38                 ` Jason Gunthorpe
2023-10-11 18:38                   ` Jason Gunthorpe
2023-10-12 16:16                   ` Catalin Marinas
2023-10-12 16:16                     ` Catalin Marinas
2024-03-10  3:49                     ` Ankit Agrawal
2024-03-10  3:49                       ` Ankit Agrawal
2024-03-19 13:38                       ` Jason Gunthorpe
2024-03-19 13:38                         ` Jason Gunthorpe
2023-10-23 13:20   ` Shameerali Kolothum Thodi
2023-10-23 13:20     ` Shameerali Kolothum Thodi
2023-09-07 18:14 ` [PATCH v1 2/2] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory ankita
2023-09-07 18:14   ` ankita
2023-09-08 16:40   ` Catalin Marinas
2023-09-08 16:40     ` Catalin Marinas
2023-09-11 14:57   ` Lorenzo Pieralisi
2023-09-11 14:57     ` Lorenzo Pieralisi
2023-09-11 17:20     ` Jason Gunthorpe
2023-09-11 17:20       ` Jason Gunthorpe
2023-09-13 15:26       ` Lorenzo Pieralisi
2023-09-13 15:26         ` Lorenzo Pieralisi
2023-09-13 18:54         ` Jason Gunthorpe
2023-09-13 18:54           ` Jason Gunthorpe
2023-09-26  8:31           ` Lorenzo Pieralisi
2023-09-26  8:31             ` Lorenzo Pieralisi
2023-09-26 12:25             ` Jason Gunthorpe
2023-09-26 12:25               ` Jason Gunthorpe
2023-09-26 13:52             ` Catalin Marinas
2023-09-26 13:52               ` Catalin Marinas
2023-09-26 16:12               ` Lorenzo Pieralisi
2023-09-26 16:12                 ` Lorenzo Pieralisi
2023-10-05  9:56               ` Lorenzo Pieralisi
2023-10-05  9:56                 ` Lorenzo Pieralisi
2023-10-05 11:56                 ` Jason Gunthorpe
2023-10-05 11:56                   ` Jason Gunthorpe
2023-10-05 14:08                   ` Lorenzo Pieralisi
2023-10-05 14:08                     ` Lorenzo Pieralisi
2023-10-12 12:35                 ` Will Deacon
2023-10-12 12:35                   ` Will Deacon
2023-10-12 13:20                   ` Jason Gunthorpe
2023-10-12 13:20                     ` Jason Gunthorpe
2023-10-12 14:29                     ` Lorenzo Pieralisi
2023-10-12 14:29                       ` Lorenzo Pieralisi
2023-10-12 13:53                   ` Catalin Marinas
2023-10-12 13:53                     ` Catalin Marinas
2023-10-12 14:48                     ` Will Deacon
2023-10-12 14:48                       ` Will Deacon
2023-10-12 15:44                       ` Jason Gunthorpe
2023-10-12 15:44                         ` Jason Gunthorpe
2023-10-12 16:39                         ` Will Deacon
2023-10-12 16:39                           ` Will Deacon
2023-10-12 18:36                           ` Jason Gunthorpe
2023-10-12 18:36                             ` Jason Gunthorpe
2023-10-13  9:29                             ` Will Deacon
2023-10-13  9:29                               ` Will Deacon
2023-10-12 17:26                       ` Catalin Marinas
2023-10-12 17:26                         ` Catalin Marinas
2023-10-13  9:29                         ` Will Deacon
2023-10-13  9:29                           ` Will Deacon
2023-10-13 13:08                           ` Catalin Marinas
2023-10-13 13:08                             ` Catalin Marinas
2023-10-13 13:45                             ` Jason Gunthorpe
2023-10-13 13:45                               ` Jason Gunthorpe
2023-10-19 11:07                               ` Catalin Marinas
2023-10-19 11:07                                 ` Catalin Marinas
2023-10-19 11:51                                 ` Jason Gunthorpe
2023-10-19 11:51                                   ` Jason Gunthorpe
2023-10-20 11:21                                   ` Catalin Marinas
2023-10-20 11:21                                     ` Catalin Marinas
2023-10-20 11:47                                     ` Jason Gunthorpe
2023-10-20 11:47                                       ` Jason Gunthorpe
2023-10-20 14:03                                       ` Lorenzo Pieralisi
2023-10-20 14:03                                         ` Lorenzo Pieralisi
2023-10-20 14:28                                         ` Jason Gunthorpe
2023-10-20 14:28                                           ` Jason Gunthorpe
2023-10-19 13:35                                 ` Lorenzo Pieralisi
2023-10-19 13:35                                   ` Lorenzo Pieralisi
2023-10-13 15:28                             ` Lorenzo Pieralisi
2023-10-13 15:28                               ` Lorenzo Pieralisi
2023-10-19 11:12                               ` Catalin Marinas
2023-10-19 11:12                                 ` Catalin Marinas
2023-11-09 15:34                             ` Lorenzo Pieralisi [this message]
2023-11-09 15:34                               ` Lorenzo Pieralisi
2023-11-10 14:26                               ` Jason Gunthorpe
2023-11-10 14:26                                 ` Jason Gunthorpe
2023-11-13  0:42                                 ` Lorenzo Pieralisi
2023-11-13  0:42                                   ` Lorenzo Pieralisi
2023-11-13 17:41                               ` Catalin Marinas
2023-11-13 17:41                                 ` Catalin Marinas
2023-10-12 12:27   ` Will Deacon
2023-10-12 12:27     ` Will Deacon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZUz78gFPgMupew+m@lpieralisi \
    --to=lpieralisi@kernel.org \
    --cc=acurrid@nvidia.com \
    --cc=aniketa@nvidia.com \
    --cc=ankita@nvidia.com \
    --cc=apopple@nvidia.com \
    --cc=catalin.marinas@arm.com \
    --cc=cjia@nvidia.com \
    --cc=danw@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=kvmarm@lists.linux.dev \
    --cc=kwankhede@nvidia.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=targupta@nvidia.com \
    --cc=vsethi@nvidia.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.