kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH kvmtool] pci: Disable writes to Status register
@ 2022-09-08 14:42 Jean-Philippe Brucker
  2022-09-12 12:46 ` Pierre Gondois
  2022-09-13 13:52 ` Alexandru Elisei
  0 siblings, 2 replies; 5+ messages in thread
From: Jean-Philippe Brucker @ 2022-09-08 14:42 UTC (permalink / raw)
  To: will
  Cc: andre.przywara, alexandru.elisei, kvm, Jean-Philippe Brucker,
	Pierre Gondois

Although the PCI Status register only contains read-only and
write-1-to-clear bits, we currently keep anything written there, which
can confuse a guest.

The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI:
Clear PCI_STATUS when setting up device"), which unconditionally writes
0xffff to the Status register in order to clear pending errors. Then the
EDAC driver sees the parity status bits set and attempts to clear them
by writing 0xc100, which in turn clears the Capabilities List bit.
Later on, when the virtio-pci driver starts probing, it assumes due to
missing capabilities that the device is using the legacy transport, and
fails to setup the device because of mismatched protocol.

Filter writes to the config space, keeping only those to writable
fields. Tighten the access size check while we're at it, to prevent
overflow. This is only a small step in the right direction, not a
foolproof solution, because a guest could still write both Command and
Status registers using a single 32-bit write. More work is needed for:
* Supporting arbitrary sized writes.
* Sanitizing accesses to capabilities, which are device-specific.
* Fine-grained filtering of the Command register, where only some bits
  are writable.

Also remove the old hack that filtered accesses. It was wrong and not
properly explained in the git history, but whatever it was guarding
against should be prevented by these new checks.

Reported-by: Pierre Gondois <pierre.gondois@arm.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
Note that the issue described here only shows up during ACPI boot for
me, because edac_init() happens after PCI enumeration. With DT boot,
edac_pci_clear_parity_errors() runs earlier and doesn't find any device.
---
 pci.c | 41 ++++++++++++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 9 deletions(-)

diff --git a/pci.c b/pci.c
index a769ae27..84dc7d1d 100644
--- a/pci.c
+++ b/pci.c
@@ -350,6 +350,24 @@ static void pci_config_bar_wr(struct kvm *kvm,
 	pci_activate_bar_regions(kvm, old_addr, bar_size);
 }
 
+/*
+ * Bits that are writable in the config space header.
+ * Write-1-to-clear Status bits are missing since we never set them.
+ */
+static const u8 pci_config_writable[PCI_STD_HEADER_SIZEOF] = {
+	[PCI_COMMAND] =
+		PCI_COMMAND_IO |
+		PCI_COMMAND_MEMORY |
+		PCI_COMMAND_MASTER |
+		PCI_COMMAND_PARITY,
+	[PCI_COMMAND + 1] =
+		(PCI_COMMAND_SERR |
+		 PCI_COMMAND_INTX_DISABLE) >> 8,
+	[PCI_INTERRUPT_LINE] = 0xff,
+	[PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3] = 0xff,
+	[PCI_CACHE_LINE_SIZE] = 0xff,
+};
+
 void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size)
 {
 	void *base;
@@ -357,7 +375,7 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
 	u16 offset;
 	struct pci_device_header *pci_hdr;
 	u8 dev_num = addr.device_number;
-	u32 value = 0;
+	u32 value = 0, mask = 0;
 
 	if (!pci_device_exists(addr.bus_number, dev_num, 0))
 		return;
@@ -368,12 +386,12 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
 	if (pci_hdr->cfg_ops.write)
 		pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size);
 
-	/*
-	 * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR).
-	 * Not very nice but has been working so far.
-	 */
-	if (*(u32 *)(base + offset) == 0)
-		return;
+	/* We don't sanity-check capabilities for the moment */
+	if (offset < PCI_STD_HEADER_SIZEOF) {
+		memcpy(&mask, pci_config_writable + offset, size);
+		if (!mask)
+			return;
+	}
 
 	if (offset == PCI_COMMAND) {
 		memcpy(&value, data, size);
@@ -419,8 +437,13 @@ static void pci_config_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
 	cfg_addr.w		= (u32)addr;
 	cfg_addr.enable_bit	= 1;
 
-	if (len > 4)
-		len = 4;
+	/*
+	 * "Root Complex implementations are not required to support the
+	 * generation of Configuration Requests from accesses that cross DW
+	 * [4 bytes] boundaries."
+	 */
+	if ((addr & 3) + len > 4)
+		return;
 
 	if (is_write)
 		pci__config_wr(kvm, cfg_addr, data, len);
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH kvmtool] pci: Disable writes to Status register
  2022-09-08 14:42 [PATCH kvmtool] pci: Disable writes to Status register Jean-Philippe Brucker
@ 2022-09-12 12:46 ` Pierre Gondois
  2022-09-13 13:52 ` Alexandru Elisei
  1 sibling, 0 replies; 5+ messages in thread
From: Pierre Gondois @ 2022-09-12 12:46 UTC (permalink / raw)
  To: Jean-Philippe Brucker, will; +Cc: andre.przywara, alexandru.elisei, kvm

Thanks for the fix Jean-Philippe:
Tested-by: Pierre Gondois <pierre.gondois@arm.com>

On 9/8/22 16:42, Jean-Philippe Brucker wrote:
> Although the PCI Status register only contains read-only and
> write-1-to-clear bits, we currently keep anything written there, which
> can confuse a guest.
> 
> The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI:
> Clear PCI_STATUS when setting up device"), which unconditionally writes
> 0xffff to the Status register in order to clear pending errors. Then the
> EDAC driver sees the parity status bits set and attempts to clear them
> by writing 0xc100, which in turn clears the Capabilities List bit.
> Later on, when the virtio-pci driver starts probing, it assumes due to
> missing capabilities that the device is using the legacy transport, and
> fails to setup the device because of mismatched protocol.
> 
> Filter writes to the config space, keeping only those to writable
> fields. Tighten the access size check while we're at it, to prevent
> overflow. This is only a small step in the right direction, not a
> foolproof solution, because a guest could still write both Command and
> Status registers using a single 32-bit write. More work is needed for:
> * Supporting arbitrary sized writes.
> * Sanitizing accesses to capabilities, which are device-specific.
> * Fine-grained filtering of the Command register, where only some bits
>    are writable.
> 
> Also remove the old hack that filtered accesses. It was wrong and not
> properly explained in the git history, but whatever it was guarding
> against should be prevented by these new checks.
> 
> Reported-by: Pierre Gondois <pierre.gondois@arm.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> Note that the issue described here only shows up during ACPI boot for
> me, because edac_init() happens after PCI enumeration. With DT boot,
> edac_pci_clear_parity_errors() runs earlier and doesn't find any device.
> ---
>   pci.c | 41 ++++++++++++++++++++++++++++++++---------
>   1 file changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/pci.c b/pci.c
> index a769ae27..84dc7d1d 100644
> --- a/pci.c
> +++ b/pci.c
> @@ -350,6 +350,24 @@ static void pci_config_bar_wr(struct kvm *kvm,
>   	pci_activate_bar_regions(kvm, old_addr, bar_size);
>   }
>   
> +/*
> + * Bits that are writable in the config space header.
> + * Write-1-to-clear Status bits are missing since we never set them.
> + */
> +static const u8 pci_config_writable[PCI_STD_HEADER_SIZEOF] = {
> +	[PCI_COMMAND] =
> +		PCI_COMMAND_IO |
> +		PCI_COMMAND_MEMORY |
> +		PCI_COMMAND_MASTER |
> +		PCI_COMMAND_PARITY,
> +	[PCI_COMMAND + 1] =
> +		(PCI_COMMAND_SERR |
> +		 PCI_COMMAND_INTX_DISABLE) >> 8,
> +	[PCI_INTERRUPT_LINE] = 0xff,
> +	[PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3] = 0xff,
> +	[PCI_CACHE_LINE_SIZE] = 0xff,
> +};
> +
>   void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size)
>   {
>   	void *base;
> @@ -357,7 +375,7 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
>   	u16 offset;
>   	struct pci_device_header *pci_hdr;
>   	u8 dev_num = addr.device_number;
> -	u32 value = 0;
> +	u32 value = 0, mask = 0;
>   
>   	if (!pci_device_exists(addr.bus_number, dev_num, 0))
>   		return;
> @@ -368,12 +386,12 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
>   	if (pci_hdr->cfg_ops.write)
>   		pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size);
>   
> -	/*
> -	 * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR).
> -	 * Not very nice but has been working so far.
> -	 */
> -	if (*(u32 *)(base + offset) == 0)
> -		return;
> +	/* We don't sanity-check capabilities for the moment */
> +	if (offset < PCI_STD_HEADER_SIZEOF) {
> +		memcpy(&mask, pci_config_writable + offset, size);
> +		if (!mask)
> +			return;
> +	}
>   
>   	if (offset == PCI_COMMAND) {
>   		memcpy(&value, data, size);
> @@ -419,8 +437,13 @@ static void pci_config_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
>   	cfg_addr.w		= (u32)addr;
>   	cfg_addr.enable_bit	= 1;
>   
> -	if (len > 4)
> -		len = 4;
> +	/*
> +	 * "Root Complex implementations are not required to support the
> +	 * generation of Configuration Requests from accesses that cross DW
> +	 * [4 bytes] boundaries."
> +	 */
> +	if ((addr & 3) + len > 4)
> +		return;
>   
>   	if (is_write)
>   		pci__config_wr(kvm, cfg_addr, data, len);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH kvmtool] pci: Disable writes to Status register
  2022-09-08 14:42 [PATCH kvmtool] pci: Disable writes to Status register Jean-Philippe Brucker
  2022-09-12 12:46 ` Pierre Gondois
@ 2022-09-13 13:52 ` Alexandru Elisei
  2022-09-13 18:40   ` Jean-Philippe Brucker
  1 sibling, 1 reply; 5+ messages in thread
From: Alexandru Elisei @ 2022-09-13 13:52 UTC (permalink / raw)
  To: Jean-Philippe Brucker; +Cc: will, andre.przywara, kvm, Pierre Gondois

Hi,

On Thu, Sep 08, 2022 at 03:42:09PM +0100, Jean-Philippe Brucker wrote:
> Although the PCI Status register only contains read-only and
> write-1-to-clear bits, we currently keep anything written there, which
> can confuse a guest.
> 
> The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI:
> Clear PCI_STATUS when setting up device"), which unconditionally writes
> 0xffff to the Status register in order to clear pending errors. Then the
> EDAC driver sees the parity status bits set and attempts to clear them
> by writing 0xc100, which in turn clears the Capabilities List bit.
> Later on, when the virtio-pci driver starts probing, it assumes due to
> missing capabilities that the device is using the legacy transport, and
> fails to setup the device because of mismatched protocol.
> 
> Filter writes to the config space, keeping only those to writable
> fields. Tighten the access size check while we're at it, to prevent
> overflow. This is only a small step in the right direction, not a
> foolproof solution, because a guest could still write both Command and
> Status registers using a single 32-bit write. More work is needed for:
> * Supporting arbitrary sized writes.
> * Sanitizing accesses to capabilities, which are device-specific.
> * Fine-grained filtering of the Command register, where only some bits
>   are writable.

I'm confused here. Why not do value &= mask to keep only those bits that
writable?

> 
> Also remove the old hack that filtered accesses. It was wrong and not
> properly explained in the git history, but whatever it was guarding
> against should be prevented by these new checks.

If I remember correctly, that was guarding against the guest kernel poking
the ROM base address register for drivers that assumed that the ROM was
always there, I vaguely remember that was the case with GPUs. Pairs with
the similar check in the vfio callback, vfio_pci_cfg_write().

> 
> Reported-by: Pierre Gondois <pierre.gondois@arm.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> Note that the issue described here only shows up during ACPI boot for
> me, because edac_init() happens after PCI enumeration. With DT boot,
> edac_pci_clear_parity_errors() runs earlier and doesn't find any device.
> ---
>  pci.c | 41 ++++++++++++++++++++++++++++++++---------
>  1 file changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/pci.c b/pci.c
> index a769ae27..84dc7d1d 100644
> --- a/pci.c
> +++ b/pci.c
> @@ -350,6 +350,24 @@ static void pci_config_bar_wr(struct kvm *kvm,
>  	pci_activate_bar_regions(kvm, old_addr, bar_size);
>  }
>  
> +/*
> + * Bits that are writable in the config space header.
> + * Write-1-to-clear Status bits are missing since we never set them.
> + */
> +static const u8 pci_config_writable[PCI_STD_HEADER_SIZEOF] = {
> +	[PCI_COMMAND] =
> +		PCI_COMMAND_IO |
> +		PCI_COMMAND_MEMORY |
> +		PCI_COMMAND_MASTER |
> +		PCI_COMMAND_PARITY,
> +	[PCI_COMMAND + 1] =
> +		(PCI_COMMAND_SERR |
> +		 PCI_COMMAND_INTX_DISABLE) >> 8,
> +	[PCI_INTERRUPT_LINE] = 0xff,
> +	[PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3] = 0xff,
> +	[PCI_CACHE_LINE_SIZE] = 0xff,
> +};
> +
>  void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size)
>  {
>  	void *base;
> @@ -357,7 +375,7 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
>  	u16 offset;
>  	struct pci_device_header *pci_hdr;
>  	u8 dev_num = addr.device_number;
> -	u32 value = 0;
> +	u32 value = 0, mask = 0;
>  
>  	if (!pci_device_exists(addr.bus_number, dev_num, 0))
>  		return;
> @@ -368,12 +386,12 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
>  	if (pci_hdr->cfg_ops.write)
>  		pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size);
>  
> -	/*
> -	 * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR).
> -	 * Not very nice but has been working so far.
> -	 */
> -	if (*(u32 *)(base + offset) == 0)
> -		return;
> +	/* We don't sanity-check capabilities for the moment */
> +	if (offset < PCI_STD_HEADER_SIZEOF) {
> +		memcpy(&mask, pci_config_writable + offset, size);
> +		if (!mask)
> +			return;

Shouldn't this be performed before the VFIO callbacks?  Also, the vfio
callbacks still do the writes to the VFIO in-kernel PCI header, but now
kvmtool would skip those writes entirely. Shouldn't kvmtool's view of the
configuration space be identical to that of VFIO?

> +	}
>  
>  	if (offset == PCI_COMMAND) {
>  		memcpy(&value, data, size);
> @@ -419,8 +437,13 @@ static void pci_config_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
>  	cfg_addr.w		= (u32)addr;
>  	cfg_addr.enable_bit	= 1;
>  
> -	if (len > 4)
> -		len = 4;
> +	/*
> +	 * "Root Complex implementations are not required to support the
> +	 * generation of Configuration Requests from accesses that cross DW
> +	 * [4 bytes] boundaries."
> +	 */
> +	if ((addr & 3) + len > 4)
> +		return;

Isn't that a change in behaviour? How about:

    len = 4 - (addr & 3);

Which should conform to the spec, but still allow writes like before.

Thanks,
Alex

>  
>  	if (is_write)
>  		pci__config_wr(kvm, cfg_addr, data, len);
> -- 
> 2.37.3
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH kvmtool] pci: Disable writes to Status register
  2022-09-13 13:52 ` Alexandru Elisei
@ 2022-09-13 18:40   ` Jean-Philippe Brucker
  2022-09-14 10:48     ` Alexandru Elisei
  0 siblings, 1 reply; 5+ messages in thread
From: Jean-Philippe Brucker @ 2022-09-13 18:40 UTC (permalink / raw)
  To: Alexandru Elisei; +Cc: will, andre.przywara, kvm, Pierre Gondois

On Tue, Sep 13, 2022 at 02:52:44PM +0100, Alexandru Elisei wrote:
> Hi,
> 
> On Thu, Sep 08, 2022 at 03:42:09PM +0100, Jean-Philippe Brucker wrote:
> > Although the PCI Status register only contains read-only and
> > write-1-to-clear bits, we currently keep anything written there, which
> > can confuse a guest.
> > 
> > The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI:
> > Clear PCI_STATUS when setting up device"), which unconditionally writes
> > 0xffff to the Status register in order to clear pending errors. Then the
> > EDAC driver sees the parity status bits set and attempts to clear them
> > by writing 0xc100, which in turn clears the Capabilities List bit.
> > Later on, when the virtio-pci driver starts probing, it assumes due to
> > missing capabilities that the device is using the legacy transport, and
> > fails to setup the device because of mismatched protocol.
> > 
> > Filter writes to the config space, keeping only those to writable
> > fields. Tighten the access size check while we're at it, to prevent
> > overflow. This is only a small step in the right direction, not a
> > foolproof solution, because a guest could still write both Command and
> > Status registers using a single 32-bit write. More work is needed for:
> > * Supporting arbitrary sized writes.
> > * Sanitizing accesses to capabilities, which are device-specific.
> > * Fine-grained filtering of the Command register, where only some bits
> >   are writable.
> 
> I'm confused here. Why not do value &= mask to keep only those bits that
> writable?

Sure, I can add it

> 
> > 
> > Also remove the old hack that filtered accesses. It was wrong and not
> > properly explained in the git history, but whatever it was guarding
> > against should be prevented by these new checks.
> 
> If I remember correctly, that was guarding against the guest kernel poking
> the ROM base address register for drivers that assumed that the ROM was
> always there, I vaguely remember that was the case with GPUs. Pairs with
> the similar check in the vfio callback, vfio_pci_cfg_write().

Right, makes sense. I think that's what I assumed when rewriting
pci__config_wr() hence the current comment but the original commits didn't
say anything about it.

> 
> > 
> > Reported-by: Pierre Gondois <pierre.gondois@arm.com>
> > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > ---
> > Note that the issue described here only shows up during ACPI boot for
> > me, because edac_init() happens after PCI enumeration. With DT boot,
> > edac_pci_clear_parity_errors() runs earlier and doesn't find any device.
> > ---
> >  pci.c | 41 ++++++++++++++++++++++++++++++++---------
> >  1 file changed, 32 insertions(+), 9 deletions(-)
> > 
> > diff --git a/pci.c b/pci.c
> > index a769ae27..84dc7d1d 100644
> > --- a/pci.c
> > +++ b/pci.c
> > @@ -350,6 +350,24 @@ static void pci_config_bar_wr(struct kvm *kvm,
> >  	pci_activate_bar_regions(kvm, old_addr, bar_size);
> >  }
> >  
> > +/*
> > + * Bits that are writable in the config space header.
> > + * Write-1-to-clear Status bits are missing since we never set them.
> > + */
> > +static const u8 pci_config_writable[PCI_STD_HEADER_SIZEOF] = {
> > +	[PCI_COMMAND] =
> > +		PCI_COMMAND_IO |
> > +		PCI_COMMAND_MEMORY |
> > +		PCI_COMMAND_MASTER |
> > +		PCI_COMMAND_PARITY,
> > +	[PCI_COMMAND + 1] =
> > +		(PCI_COMMAND_SERR |
> > +		 PCI_COMMAND_INTX_DISABLE) >> 8,
> > +	[PCI_INTERRUPT_LINE] = 0xff,
> > +	[PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3] = 0xff,
> > +	[PCI_CACHE_LINE_SIZE] = 0xff,
> > +};
> > +
> >  void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size)
> >  {
> >  	void *base;
> > @@ -357,7 +375,7 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
> >  	u16 offset;
> >  	struct pci_device_header *pci_hdr;
> >  	u8 dev_num = addr.device_number;
> > -	u32 value = 0;
> > +	u32 value = 0, mask = 0;
> >  
> >  	if (!pci_device_exists(addr.bus_number, dev_num, 0))
> >  		return;
> > @@ -368,12 +386,12 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
> >  	if (pci_hdr->cfg_ops.write)
> >  		pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size);
> >  
> > -	/*
> > -	 * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR).
> > -	 * Not very nice but has been working so far.
> > -	 */
> > -	if (*(u32 *)(base + offset) == 0)
> > -		return;
> > +	/* We don't sanity-check capabilities for the moment */
> > +	if (offset < PCI_STD_HEADER_SIZEOF) {
> > +		memcpy(&mask, pci_config_writable + offset, size);
> > +		if (!mask)
> > +			return;
> 
> Shouldn't this be performed before the VFIO callbacks?

Yes I think I can move it up

> Also, the vfio callbacks still do the writes to the VFIO in-kernel PCI
> header, but now kvmtool would skip those writes entirely. Shouldn't
> kvmtool's view of the configuration space be identical to that of VFIO?

VFIO also skips writes to read-only fields, so they should now be more in
sync than before :) But their views are already desynchronized, because
kvmtool doesn't read back the config space virtualized by VFIO after
writing to it. We should probably improve it, but that's also for a future
patch.

> 
> > +	}
> >  
> >  	if (offset == PCI_COMMAND) {
> >  		memcpy(&value, data, size);
> > @@ -419,8 +437,13 @@ static void pci_config_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
> >  	cfg_addr.w		= (u32)addr;
> >  	cfg_addr.enable_bit	= 1;
> >  
> > -	if (len > 4)
> > -		len = 4;
> > +	/*
> > +	 * "Root Complex implementations are not required to support the
> > +	 * generation of Configuration Requests from accesses that cross DW
> > +	 * [4 bytes] boundaries."
> > +	 */
> > +	if ((addr & 3) + len > 4)
> > +		return;
> 
> Isn't that a change in behaviour?

Yes, but it should be safe to change since it is implementation defined.
According to the spec 64-bit config space writes through ECAM are not
expected to work and I find the old behaviour, truncating the write, worse
than rejecting the whole thing. It looks like at least linux, freebsd,
u-boot and edk2 don't issue 64-bit writes.

Thanks,
Jean

> How about:
> 
>     len = 4 - (addr & 3);
> 
> Which should conform to the spec, but still allow writes like before.
> 
> Thanks,
> Alex
> 
> >  
> >  	if (is_write)
> >  		pci__config_wr(kvm, cfg_addr, data, len);
> > -- 
> > 2.37.3
> > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH kvmtool] pci: Disable writes to Status register
  2022-09-13 18:40   ` Jean-Philippe Brucker
@ 2022-09-14 10:48     ` Alexandru Elisei
  0 siblings, 0 replies; 5+ messages in thread
From: Alexandru Elisei @ 2022-09-14 10:48 UTC (permalink / raw)
  To: Jean-Philippe Brucker; +Cc: will, andre.przywara, kvm, Pierre Gondois

Hi,

On Tue, Sep 13, 2022 at 07:40:02PM +0100, Jean-Philippe Brucker wrote:
> On Tue, Sep 13, 2022 at 02:52:44PM +0100, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Thu, Sep 08, 2022 at 03:42:09PM +0100, Jean-Philippe Brucker wrote:
> > > Although the PCI Status register only contains read-only and
> > > write-1-to-clear bits, we currently keep anything written there, which
> > > can confuse a guest.
> > > 
> > > The problem was highlighted by recent Linux commit 6cd514e58f12 ("PCI:
> > > Clear PCI_STATUS when setting up device"), which unconditionally writes
> > > 0xffff to the Status register in order to clear pending errors. Then the
> > > EDAC driver sees the parity status bits set and attempts to clear them
> > > by writing 0xc100, which in turn clears the Capabilities List bit.
> > > Later on, when the virtio-pci driver starts probing, it assumes due to
> > > missing capabilities that the device is using the legacy transport, and
> > > fails to setup the device because of mismatched protocol.
> > > 
> > > Filter writes to the config space, keeping only those to writable
> > > fields. Tighten the access size check while we're at it, to prevent
> > > overflow. This is only a small step in the right direction, not a
> > > foolproof solution, because a guest could still write both Command and
> > > Status registers using a single 32-bit write. More work is needed for:
> > > * Supporting arbitrary sized writes.
> > > * Sanitizing accesses to capabilities, which are device-specific.
> > > * Fine-grained filtering of the Command register, where only some bits
> > >   are writable.
> > 
> > I'm confused here. Why not do value &= mask to keep only those bits that
> > writable?
> 
> Sure, I can add it
> 
> > 
> > > 
> > > Also remove the old hack that filtered accesses. It was wrong and not
> > > properly explained in the git history, but whatever it was guarding
> > > against should be prevented by these new checks.
> > 
> > If I remember correctly, that was guarding against the guest kernel poking
> > the ROM base address register for drivers that assumed that the ROM was
> > always there, I vaguely remember that was the case with GPUs. Pairs with
> > the similar check in the vfio callback, vfio_pci_cfg_write().
> 
> Right, makes sense. I think that's what I assumed when rewriting
> pci__config_wr() hence the current comment but the original commits didn't
> say anything about it.
> 
> > 
> > > 
> > > Reported-by: Pierre Gondois <pierre.gondois@arm.com>
> > > Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> > > ---
> > > Note that the issue described here only shows up during ACPI boot for
> > > me, because edac_init() happens after PCI enumeration. With DT boot,
> > > edac_pci_clear_parity_errors() runs earlier and doesn't find any device.
> > > ---
> > >  pci.c | 41 ++++++++++++++++++++++++++++++++---------
> > >  1 file changed, 32 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/pci.c b/pci.c
> > > index a769ae27..84dc7d1d 100644
> > > --- a/pci.c
> > > +++ b/pci.c
> > > @@ -350,6 +350,24 @@ static void pci_config_bar_wr(struct kvm *kvm,
> > >  	pci_activate_bar_regions(kvm, old_addr, bar_size);
> > >  }
> > >  
> > > +/*
> > > + * Bits that are writable in the config space header.
> > > + * Write-1-to-clear Status bits are missing since we never set them.
> > > + */
> > > +static const u8 pci_config_writable[PCI_STD_HEADER_SIZEOF] = {
> > > +	[PCI_COMMAND] =
> > > +		PCI_COMMAND_IO |
> > > +		PCI_COMMAND_MEMORY |
> > > +		PCI_COMMAND_MASTER |
> > > +		PCI_COMMAND_PARITY,
> > > +	[PCI_COMMAND + 1] =
> > > +		(PCI_COMMAND_SERR |
> > > +		 PCI_COMMAND_INTX_DISABLE) >> 8,
> > > +	[PCI_INTERRUPT_LINE] = 0xff,
> > > +	[PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_5 + 3] = 0xff,
> > > +	[PCI_CACHE_LINE_SIZE] = 0xff,
> > > +};
> > > +
> > >  void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data, int size)
> > >  {
> > >  	void *base;
> > > @@ -357,7 +375,7 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
> > >  	u16 offset;
> > >  	struct pci_device_header *pci_hdr;
> > >  	u8 dev_num = addr.device_number;
> > > -	u32 value = 0;
> > > +	u32 value = 0, mask = 0;
> > >  
> > >  	if (!pci_device_exists(addr.bus_number, dev_num, 0))
> > >  		return;
> > > @@ -368,12 +386,12 @@ void pci__config_wr(struct kvm *kvm, union pci_config_address addr, void *data,
> > >  	if (pci_hdr->cfg_ops.write)
> > >  		pci_hdr->cfg_ops.write(kvm, pci_hdr, offset, data, size);
> > >  
> > > -	/*
> > > -	 * legacy hack: ignore writes to uninitialized regions (e.g. ROM BAR).
> > > -	 * Not very nice but has been working so far.
> > > -	 */
> > > -	if (*(u32 *)(base + offset) == 0)
> > > -		return;
> > > +	/* We don't sanity-check capabilities for the moment */
> > > +	if (offset < PCI_STD_HEADER_SIZEOF) {
> > > +		memcpy(&mask, pci_config_writable + offset, size);
> > > +		if (!mask)
> > > +			return;
> > 
> > Shouldn't this be performed before the VFIO callbacks?
> 
> Yes I think I can move it up
> 
> > Also, the vfio callbacks still do the writes to the VFIO in-kernel PCI
> > header, but now kvmtool would skip those writes entirely. Shouldn't
> > kvmtool's view of the configuration space be identical to that of VFIO?
> 
> VFIO also skips writes to read-only fields, so they should now be more in
> sync than before :) But their views are already desynchronized, because
> kvmtool doesn't read back the config space virtualized by VFIO after
> writing to it. We should probably improve it, but that's also for a future
> patch.

Ah, I see now. My concern was that kvmtool was still writing to the VFIO
config space, while ignoring those writes to its own instance of the config
space.

> 
> > 
> > > +	}
> > >  
> > >  	if (offset == PCI_COMMAND) {
> > >  		memcpy(&value, data, size);
> > > @@ -419,8 +437,13 @@ static void pci_config_mmio_access(struct kvm_cpu *vcpu, u64 addr, u8 *data,
> > >  	cfg_addr.w		= (u32)addr;
> > >  	cfg_addr.enable_bit	= 1;
> > >  
> > > -	if (len > 4)
> > > -		len = 4;
> > > +	/*
> > > +	 * "Root Complex implementations are not required to support the
> > > +	 * generation of Configuration Requests from accesses that cross DW
> > > +	 * [4 bytes] boundaries."
> > > +	 */
> > > +	if ((addr & 3) + len > 4)
> > > +		return;
> > 
> > Isn't that a change in behaviour?
> 
> Yes, but it should be safe to change since it is implementation defined.
> According to the spec 64-bit config space writes through ECAM are not
> expected to work and I find the old behaviour, truncating the write, worse
> than rejecting the whole thing. It looks like at least linux, freebsd,
> u-boot and edk2 don't issue 64-bit writes.

Ok, great, so this change won't break existing VMs because the guest
kernels don't do these kind of imp def writes.

Would you mind also moving this before the cfg_ops.write callback?

Thanks,
Alex

> 
> Thanks,
> Jean
> 
> > How about:
> > 
> >     len = 4 - (addr & 3);
> > 
> > Which should conform to the spec, but still allow writes like before.
> > 
> > Thanks,
> > Alex
> > 
> > >  
> > >  	if (is_write)
> > >  		pci__config_wr(kvm, cfg_addr, data, len);
> > > -- 
> > > 2.37.3
> > > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-09-14 10:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-08 14:42 [PATCH kvmtool] pci: Disable writes to Status register Jean-Philippe Brucker
2022-09-12 12:46 ` Pierre Gondois
2022-09-13 13:52 ` Alexandru Elisei
2022-09-13 18:40   ` Jean-Philippe Brucker
2022-09-14 10:48     ` Alexandru Elisei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).