[ARM][xencons] PV Console hangs due to illegal ring buffer accesses

All of lore.kernel.org
 help / color / mirror / Atom feed

* [ARM][xencons] PV Console hangs due to illegal ring buffer accesses
@ 2023-07-19 16:13 Andrei Cherechesu (OSS)
  2023-07-20 10:33 ` Julien Grall
  0 siblings, 1 reply; 6+ messages in thread
From: Andrei Cherechesu (OSS) @ 2023-07-19 16:13 UTC (permalink / raw)
  To: xen-devel; +Cc: Stefano Stabellini, george.mocanu

Hello,

As we're running Xen 4.17 (with platform-related support added) on NXP S32G SoCs (ARMv8), with a custom Linux distribution built through Yocto, and we've set some Xen-based demos up, we encountered some issues which we think might not be related to our hardware. For additional context, the Linux kernel version we're running is 5.15.96-rt (with platform-related support added as well).

The setup to reproduce the problem is fairly simple: after booting a Dom0 (can provide configuration details if needed), we're booting a normal PV DomU with PV Networking. Additionally, the VMs have k3s (Lightweight Kubernetes - version v1.25.8+k3s1: https://github.com/k3s-io/k3s/releases/tag/v1.25.8%2Bk3s1) installed in their rootfs'es.

The problem is that the DomU console hangs (no new output is shown, no input can be sent) some time (non-deterministic, sometimes 5 seconds, other times like 15-20 seconds) after we run the `k3s server` command. We have this command running as part of a sysvinit service, and the same behavior can be observed in that case as well. The k3s version we use is the one mentioned in the paragraph above, but this can be reproduced with other versions as well (i.e., v1.21.11, v1.22.6). If the `k3s server` command is ran in the Dom0 VM, everything works fine. Using DomU as an agent node is also working fine, only when it is run as a server the console problem occurs.

Immediately after the serial console hangs, we can still log in on DomU using SSH, and we can observe the following messages its dmesg:
[   57.905806] xencons: Illegal ring page indices
[   59.399620] xenbus: error -5 while reading message
[   59.399649] xenbus: error -5 while writing message
[   67.353608] xencons: Illegal ring page indices
[   78.027813] IPVS: Registered protocols (TCP, UDP, SCTP, AH, ESP)
[   78.027865] IPVS: Connection hash table configured (size=4096, memory=32Kbytes)
[   78.028038] IPVS: ipvs loaded.
[   78.065479] IPVS: [rr] scheduler registered.
[   78.071249] IPVS: [wrr] scheduler registered.
[   78.084190] IPVS: [sh] scheduler registered.

Sometimes, Xen also dumps some info about expanding the grant tables, after the DomU console becomes unresponsive:
(XEN) common/grant_table.c:1882:d2v1: Expanding d2 grant table from 5 to 6 frames
(XEN) common/grant_table.c:1882:d2v1: Expanding d2 grant table from 6 to 7 frames
(XEN) common/grant_table.c:1882:d2v1: Expanding d2 grant table from 7 to 8 frames

It seems that when spawning the k3s server process, somehow (maybe due to intensive usage) the console ring buffers and the indices used for accessing them become corrupt. But the PV networking still works fine, and the domain is reachable via SSH and can continue to process the workload.

We've not been able so far to figure out why this happens, so any help would be appreciated. If you need other Domain configuration details or any inputs from our side, let us know.

Thank you,
Andrei Cherechesu

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [ARM][xencons] PV Console hangs due to illegal ring buffer accesses
  2023-07-19 16:13 [ARM][xencons] PV Console hangs due to illegal ring buffer accesses Andrei Cherechesu (OSS)
@ 2023-07-20 10:33 ` Julien Grall
  2023-07-20 23:25   ` Stefano Stabellini
  0 siblings, 1 reply; 6+ messages in thread
From: Julien Grall @ 2023-07-20 10:33 UTC (permalink / raw)
  To: Andrei Cherechesu (OSS), xen-devel
  Cc: Stefano Stabellini, george.mocanu, Juergen Gross

(+ Juergen)

On 19/07/2023 17:13, Andrei Cherechesu (OSS) wrote:
> Hello,

Hi Andrei,

> As we're running Xen 4.17 (with platform-related support added) on NXP S32G SoCs (ARMv8), with a custom Linux distribution built through Yocto, and we've set some Xen-based demos up, we encountered some issues which we think might not be related to our hardware. For additional context, the Linux kernel version we're running is 5.15.96-rt (with platform-related support added as well).
> 
> The setup to reproduce the problem is fairly simple: after booting a Dom0 (can provide configuration details if needed), we're booting a normal PV DomU with PV Networking. Additionally, the VMs have k3s (Lightweight Kubernetes - version v1.25.8+k3s1: https://github.com/k3s-io/k3s/releases/tag/v1.25.8%2Bk3s1) installed in their rootfs'es.
> 
> The problem is that the DomU console hangs (no new output is shown, no input can be sent) some time (non-deterministic, sometimes 5 seconds, other times like 15-20 seconds) after we run the `k3s server` command. We have this command running as part of a sysvinit service, and the same behavior can be observed in that case as well. The k3s version we use is the one mentioned in the paragraph above, but this can be reproduced with other versions as well (i.e., v1.21.11, v1.22.6). If the `k3s server` command is ran in the Dom0 VM, everything works fine. Using DomU as an agent node is also working fine, only when it is run as a server the console problem occurs.
> 
> Immediately after the serial console hangs, we can still log in on DomU using SSH, and we can observe the following messages its dmesg:
> [   57.905806] xencons: Illegal ring page indices

Looking at Linux code, this message is printed in a couple of place in 
the xenconsole driver.

I would assume that this is printed when reading from the buffer 
(otherwise you would not see any message). Can you confirm it?

Also, can you provide the indices that Linux considers buggy?

Lastly, it seems like the barrier used are incorrect. It should be the 
virt_*() version rather than a plain mb()/wmb(). I don't think it matter 
for arm64 though (I am assuming you are not running 32-bit).

> [   59.399620] xenbus: error -5 while reading message

So this message is coming from the xenbus driver (used to read the 
xenstore ring). This is -EIO, and AFAICT returned when the indices are 
also incorrect.

For this driver, I think there is also a TOCTOU because a compiler is 
free to reload intf->rsp_cons after the check. Moving virt_mb() is 
probably not sufficient. You would also want to use ACCESS_ONCE().

What I find odd is you have two distinct rings (xenconsole and xenbus) 
with similar issues. Above, you said you are using Linux RT. I wonder if 
this has a play into the issue because if I am not mistaken, the two 
functions would now be fully preemptible.

This could expose some races. For instance, there are some missing 
ACCESS_ONCE() (as mentioned above).

In particular, Xenstored (I haven't checked xenconsoled) is using += to 
update intf->rsp_cons. There is no guarantee that the update will be atomic.

Overall, I am not 100% sure what I wrote is related. But that's probably 
a good start of things that can be exacerbated with Linux RT.

> [   59.399649] xenbus: error -5 while writing message

This is in xenbus as well. But this time in the write part. The analysis 
I wrote above for the read part can be applied here.

Cheers,

-- 
Julien Grall

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [ARM][xencons] PV Console hangs due to illegal ring buffer accesses
  2023-07-20 10:33 ` Julien Grall
@ 2023-07-20 23:25   ` Stefano Stabellini
  2023-07-21  8:39     ` Andrei Cherechesu (OSS)
  2023-07-21 14:28     ` George Mocanu
  0 siblings, 2 replies; 6+ messages in thread
From: Stefano Stabellini @ 2023-07-20 23:25 UTC (permalink / raw)
  To: Julien Grall
  Cc: Andrei Cherechesu (OSS),
	xen-devel, Stefano Stabellini, george.mocanu, Juergen Gross

On Thu, 20 Jul 2023, Julien Grall wrote:
> (+ Juergen)
> 
> On 19/07/2023 17:13, Andrei Cherechesu (OSS) wrote:
> > Hello,
> 
> Hi Andrei,
> 
> > As we're running Xen 4.17 (with platform-related support added) on NXP S32G
> > SoCs (ARMv8), with a custom Linux distribution built through Yocto, and
> > we've set some Xen-based demos up, we encountered some issues which we think
> > might not be related to our hardware. For additional context, the Linux
> > kernel version we're running is 5.15.96-rt (with platform-related support
> > added as well).
> > 
> > The setup to reproduce the problem is fairly simple: after booting a Dom0
> > (can provide configuration details if needed), we're booting a normal PV
> > DomU with PV Networking. Additionally, the VMs have k3s (Lightweight
> > Kubernetes - version v1.25.8+k3s1:
> > https://github.com/k3s-io/k3s/releases/tag/v1.25.8%2Bk3s1) installed in
> > their rootfs'es.
> > 
> > The problem is that the DomU console hangs (no new output is shown, no input
> > can be sent) some time (non-deterministic, sometimes 5 seconds, other times
> > like 15-20 seconds) after we run the `k3s server` command. We have this
> > command running as part of a sysvinit service, and the same behavior can be
> > observed in that case as well. The k3s version we use is the one mentioned
> > in the paragraph above, but this can be reproduced with other versions as
> > well (i.e., v1.21.11, v1.22.6). If the `k3s server` command is ran in the
> > Dom0 VM, everything works fine. Using DomU as an agent node is also working
> > fine, only when it is run as a server the console problem occurs.
> > 
> > Immediately after the serial console hangs, we can still log in on DomU
> > using SSH, and we can observe the following messages its dmesg:
> > [   57.905806] xencons: Illegal ring page indices
> 
> Looking at Linux code, this message is printed in a couple of place in the
> xenconsole driver.
> 
> I would assume that this is printed when reading from the buffer (otherwise
> you would not see any message). Can you confirm it?
> 
> Also, can you provide the indices that Linux considers buggy?
> 
> Lastly, it seems like the barrier used are incorrect. It should be the
> virt_*() version rather than a plain mb()/wmb(). I don't think it matter for
> arm64 though (I am assuming you are not running 32-bit).
> 
> > [   59.399620] xenbus: error -5 while reading message
> 
> So this message is coming from the xenbus driver (used to read the xenstore
> ring). This is -EIO, and AFAICT returned when the indices are also incorrect.
> 
> For this driver, I think there is also a TOCTOU because a compiler is free to
> reload intf->rsp_cons after the check. Moving virt_mb() is probably not
> sufficient. You would also want to use ACCESS_ONCE().
> 
> What I find odd is you have two distinct rings (xenconsole and xenbus) with
> similar issues. Above, you said you are using Linux RT. I wonder if this has a
> play into the issue because if I am not mistaken, the two functions would now
> be fully preemptible.
> 
> This could expose some races. For instance, there are some missing
> ACCESS_ONCE() (as mentioned above).
> 
> In particular, Xenstored (I haven't checked xenconsoled) is using += to update
> intf->rsp_cons. There is no guarantee that the update will be atomic.
> 
> Overall, I am not 100% sure what I wrote is related. But that's probably a
> good start of things that can be exacerbated with Linux RT.
> 
> > [   59.399649] xenbus: error -5 while writing message
> 
> This is in xenbus as well. But this time in the write part. The analysis I
> wrote above for the read part can be applied here.

This is really strange. What is also strange is that somehow the indexes
recover after 10-15 seconds? How is that even possible. Let's say there
is a memory corruption of some sort, maybe due to missing barriers like
Julien suggested, how can it go back to normal after a while?

I am really confused. I would try with regular Linux instead of Linux RT
and also would try to replace all the barriers in
drivers/tty/hvc/hvc_xen.c with their virt_* version to see if we can
narrow down the problem a bit.


Keep in mind that during PV network operations grants are used, which
involve mapping pages at the backend and changing the MMU/IOMMU
pagetables to introduce the new mapping. After the DMA operation,
typically the page is unmapped and removed from the pagetable.

Is it possible that the pagetable change is causing the problem, and
when the mapping is removed everything goes back to normal?

I don't know how that could happen, but the mapping and unmapping of the
page is something ongoing which could break things then go back to
normal. One thing you could try is to force all DMA operations to go via
swiotlb-xen in Linux:

diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
index 3d826c0b5fee..f78d86f1bb9c 100644
--- a/arch/arm/xen/mm.c
+++ b/arch/arm/xen/mm.c
@@ -112,8 +112,7 @@ bool xen_arch_need_swiotlb(struct device *dev,
         * require a bounce buffer because the device doesn't support coherent
         * memory and we are not able to flush the cache.
         */
-       return (!hypercall_cflush && (xen_pfn != bfn) &&
-               !dev_is_dma_coherent(dev));
+       return true;
 }
 
 static int __init xen_mm_init(void)


Then you can remove any iommu pagetable flushes in Xen:


diff --git a/xen/arch/arm/include/asm/grant_table.h b/xen/arch/arm/include/asm/grant_table.h
index d3c518a926..b72f8391bd 100644
--- a/xen/arch/arm/include/asm/grant_table.h
+++ b/xen/arch/arm/include/asm/grant_table.h
@@ -74,7 +74,7 @@ int replace_grant_host_mapping(uint64_t gpaddr, mfn_t frame,
     page_get_xenheap_gfn(gnttab_status_page(t, i))
 
 #define gnttab_need_iommu_mapping(d)                    \
-    (is_domain_direct_mapped(d) && is_iommu_enabled(d))
+    (0)
 
 #endif /* __ASM_GRANT_TABLE_H__ */
 /*


I don't know how this could be related but it might help narrow down the
problem.


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [ARM][xencons] PV Console hangs due to illegal ring buffer accesses
  2023-07-20 23:25   ` Stefano Stabellini
@ 2023-07-21  8:39     ` Andrei Cherechesu (OSS)
  2023-07-21 22:06       ` Stefano Stabellini
  2023-07-21 14:28     ` George Mocanu
  1 sibling, 1 reply; 6+ messages in thread
From: Andrei Cherechesu (OSS) @ 2023-07-21  8:39 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall; +Cc: xen-devel, george.mocanu, Juergen Gross

Hello, Julien, Stefano,

Thank you for your replies.

On 21/07/2023 02:25, Stefano Stabellini wrote:
> 
> On Thu, 20 Jul 2023, Julien Grall wrote:
>> (+ Juergen)
>>
>> On 19/07/2023 17:13, Andrei Cherechesu (OSS) wrote:
>>> Hello,
>>
>> Hi Andrei,
>>
>>> As we're running Xen 4.17 (with platform-related support added) on NXP S32G
>>> SoCs (ARMv8), with a custom Linux distribution built through Yocto, and
>>> we've set some Xen-based demos up, we encountered some issues which we think
>>> might not be related to our hardware. For additional context, the Linux
>>> kernel version we're running is 5.15.96-rt (with platform-related support
>>> added as well).
>>>
>>> The setup to reproduce the problem is fairly simple: after booting a Dom0
>>> (can provide configuration details if needed), we're booting a normal PV
>>> DomU with PV Networking. Additionally, the VMs have k3s (Lightweight
>>> Kubernetes - version v1.25.8+k3s1:
>>> https://github.com/k3s-io/k3s/releases/tag/v1.25.8%2Bk3s1) installed in
>>> their rootfs'es.
>>>
>>> The problem is that the DomU console hangs (no new output is shown, no input
>>> can be sent) some time (non-deterministic, sometimes 5 seconds, other times
>>> like 15-20 seconds) after we run the `k3s server` command. We have this
>>> command running as part of a sysvinit service, and the same behavior can be
>>> observed in that case as well. The k3s version we use is the one mentioned
>>> in the paragraph above, but this can be reproduced with other versions as
>>> well (i.e., v1.21.11, v1.22.6). If the `k3s server` command is ran in the
>>> Dom0 VM, everything works fine. Using DomU as an agent node is also working
>>> fine, only when it is run as a server the console problem occurs.
>>>
>>> Immediately after the serial console hangs, we can still log in on DomU
>>> using SSH, and we can observe the following messages its dmesg:
>>> [   57.905806] xencons: Illegal ring page indices
>>
>> Looking at Linux code, this message is printed in a couple of place in the
>> xenconsole driver.
>>
>> I would assume that this is printed when reading from the buffer (otherwise
>> you would not see any message). Can you confirm it?
>>
>> Also, can you provide the indices that Linux considers buggy?
>>
>> Lastly, it seems like the barrier used are incorrect. It should be the
>> virt_*() version rather than a plain mb()/wmb(). I don't think it matter for
>> arm64 though (I am assuming you are not running 32-bit).
>>
>>> [   59.399620] xenbus: error -5 while reading message
>>
>> So this message is coming from the xenbus driver (used to read the xenstore
>> ring). This is -EIO, and AFAICT returned when the indices are also incorrect.
>>
>> For this driver, I think there is also a TOCTOU because a compiler is free to
>> reload intf->rsp_cons after the check. Moving virt_mb() is probably not
>> sufficient. You would also want to use ACCESS_ONCE().
>>
>> What I find odd is you have two distinct rings (xenconsole and xenbus) with
>> similar issues. Above, you said you are using Linux RT. I wonder if this has a
>> play into the issue because if I am not mistaken, the two functions would now
>> be fully preemptible.
>>
>> This could expose some races. For instance, there are some missing
>> ACCESS_ONCE() (as mentioned above).
>>
>> In particular, Xenstored (I haven't checked xenconsoled) is using += to update
>> intf->rsp_cons. There is no guarantee that the update will be atomic.
>>
>> Overall, I am not 100% sure what I wrote is related. But that's probably a
>> good start of things that can be exacerbated with Linux RT.
>>
>>> [   59.399649] xenbus: error -5 while writing message
>>
>> This is in xenbus as well. But this time in the write part. The analysis I
>> wrote above for the read part can be applied here.
> 
> This is really strange. What is also strange is that somehow the indexes
> recover after 10-15 seconds? How is that even possible. Let's say there
> is a memory corruption of some sort, maybe due to missing barriers like
> Julien suggested, how can it go back to normal after a while?

The console does not go back to normal. I mentioned we get that dmesg output
after logging onto DomU via SSH, so at least the grant tables for PV Networking are not corrupted.
But the normal console is still blocked.

> 
> I am really confused. I would try with regular Linux instead of Linux RT
> and also would try to replace all the barriers in
> drivers/tty/hvc/hvc_xen.c with their virt_* version to see if we can
> narrow down the problem a bit.
> 

Unfortunately, we do not normally run regular Linux and we do not have a
stable regular Linux version with our HW support ported on it. We've been running
Linux RT since 4.14 (or even earlier I think), but this issue has started to happen
since we upgraded to Xen 4.17 (from 4.14), with both Linux RT 5.15 and 5.10.

> 
> Keep in mind that during PV network operations grants are used, which
> involve mapping pages at the backend and changing the MMU/IOMMU
> pagetables to introduce the new mapping. After the DMA operation,
> typically the page is unmapped and removed from the pagetable.
> 
> Is it possible that the pagetable change is causing the problem, and
> when the mapping is removed everything goes back to normal?
> 
> I don't know how that could happen, but the mapping and unmapping of the
> page is something ongoing which could break things then go back to
> normal. One thing you could try is to force all DMA operations to go via
> swiotlb-xen in Linux:
> 
> diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> index 3d826c0b5fee..f78d86f1bb9c 100644
> --- a/arch/arm/xen/mm.c
> +++ b/arch/arm/xen/mm.c
> @@ -112,8 +112,7 @@ bool xen_arch_need_swiotlb(struct device *dev,
>          * require a bounce buffer because the device doesn't support coherent
>          * memory and we are not able to flush the cache.
>          */
> -       return (!hypercall_cflush && (xen_pfn != bfn) &&
> -               !dev_is_dma_coherent(dev));
> +       return true;
>  }
> 
>  static int __init xen_mm_init(void)
> 
> 
> Then you can remove any iommu pagetable flushes in Xen:
> 
> 
> diff --git a/xen/arch/arm/include/asm/grant_table.h b/xen/arch/arm/include/asm/grant_table.h
> index d3c518a926..b72f8391bd 100644
> --- a/xen/arch/arm/include/asm/grant_table.h
> +++ b/xen/arch/arm/include/asm/grant_table.h
> @@ -74,7 +74,7 @@ int replace_grant_host_mapping(uint64_t gpaddr, mfn_t frame,
>      page_get_xenheap_gfn(gnttab_status_page(t, i))
> 
>  #define gnttab_need_iommu_mapping(d)                    \
> -    (is_domain_direct_mapped(d) && is_iommu_enabled(d))
> +    (0)
> 
>  #endif /* __ASM_GRANT_TABLE_H__ */
>  /*
> 
> 
> I don't know how this could be related but it might help narrow down the
> problem.

We will try your advice and Julien's, to see if the situation improves.

Thank you very much,
Andrei Cherechesu


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [ARM][xencons] PV Console hangs due to illegal ring buffer accesses
  2023-07-20 23:25   ` Stefano Stabellini
  2023-07-21  8:39     ` Andrei Cherechesu (OSS)
@ 2023-07-21 14:28     ` George Mocanu
  1 sibling, 0 replies; 6+ messages in thread
From: George Mocanu @ 2023-07-21 14:28 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: Andrei Cherechesu (OSS), xen-devel, Juergen Gross

Hello, Stefano, 
Hello, Julien,

Thanks for your suggestions. I gave each of them a try, but it doesn't
look like it brings me anywhere at the moment.

On 21/07/2023 02:25, Stefano Stabellini wrote:
> 
> On Thu, 20 Jul 2023, Julien Grall wrote:
> > (+ Juergen)
> >
> > On 19/07/2023 17:13, Andrei Cherechesu (OSS) wrote:
> > > Hello,
> >
> > Hi Andrei,
> >
> > > As we're running Xen 4.17 (with platform-related support added) on NXP
> S32G
> > > SoCs (ARMv8), with a custom Linux distribution built through Yocto, and
> > > we've set some Xen-based demos up, we encountered some issues which we think
> > > might not be related to our hardware. For additional context, the Linux
> > > kernel version we're running is 5.15.96-rt (with platform-related support
> > > added as well).
> > >
> > > The setup to reproduce the problem is fairly simple: after booting a Dom0
> > > (can provide configuration details if needed), we're booting a normal PV
> > > DomU with PV Networking. Additionally, the VMs have k3s (Lightweight
> > > Kubernetes - version v1.25.8+k3s1) installed in
> > > their rootfs'es.
> > >
> > > The problem is that the DomU console hangs (no new output is shown, no input
> > > can be sent) some time (non-deterministic, sometimes 5 seconds, other times
> > > like 15-20 seconds) after we run the `k3s server` command. We have this
> > > command running as part of a sysvinit service, and the same behavior can be
> > > observed in that case as well. The k3s version we use is the one mentioned
> > > in the paragraph above, but this can be reproduced with other versions as
> > > well (i.e., v1.21.11, v1.22.6). If the `k3s server` command is ran in the
> > > Dom0 VM, everything works fine. Using DomU as an agent node is also working
> > > fine, only when it is run as a server the console problem occurs.
> > >
> > > Immediately after the serial console hangs, we can still log in on DomU
> > > using SSH, and we can observe the following messages its dmesg:
> > > [   57.905806] xencons: Illegal ring page indices
> >
> > Looking at Linux code, this message is printed in a couple of place in the
> > xenconsole driver.
> >
> > I would assume that this is printed when reading from the buffer (otherwise
> > you would not see any message). Can you confirm it?
> >
> > Also, can you provide the indices that Linux considers buggy?

Adding to what Andrei said previously, we login into the DomU console
to observe its state, and send some input keys to confirm whether it is
in the buggy state. Considering this flow, it looks like this message
comes from the write_console() call. In one instance I started the k3s
server process in the console (disabled the sysvinit service beforehand),
then proceeded to kill it after some time - a message from read_console()
was displayed in that instance. As for the indices, I've dumped them in
a separate message, and they are different always:

[   45.303520] xencons: Illegal ring page indices -- write_console()
[   45.303529] xencons: prod 4289880869, cons 2015782840, intf->out size 2048

[   59.203570] xencons: Illegal ring page indices -- write_console()
[   59.203576] xencons: prod 1735287148, cons 1869033263, intf->out size 2048

[   40.838740] xencons: Illegal ring page indices -- write_console()
[   40.838753] xencons: prod 1647211507, cons 2923534489, intf->out size 2048
[...]
[  126.184299] xencons: Illegal ring page indices -- read_console()
[  126.184317] xencons: prod 127, cons 1815732224, intf->int size 1024

> >
> > Lastly, it seems like the barrier used are incorrect. It should be the
> > virt_*() version rather than a plain mb()/wmb(). I don't think it matter for
> > arm64 though (I am assuming you are not running 32-bit).
> >

Replaced them with the virt_*() relatives, but I couldn't notice any change
in the behavior.

> > > [   59.399620] xenbus: error -5 while reading message
> >
> > So this message is coming from the xenbus driver (used to read the xenstore
> > ring). This is -EIO, and AFAICT returned when the indices are also incorrect.
> >
> > For this driver, I think there is also a TOCTOU because a compiler is free to
> > reload intf->rsp_cons after the check. Moving virt_mb() is probably not
> > sufficient. You would also want to use ACCESS_ONCE().
> >
> > What I find odd is you have two distinct rings (xenconsole and xenbus) with
> > similar issues. Above, you said you are using Linux RT. I wonder if this has a
> > play into the issue because if I am not mistaken, the two functions would now
> > be fully preemptible.
> >
> > This could expose some races. For instance, there are some missing
> > ACCESS_ONCE() (as mentioned above).
> >
> > In particular, Xenstored (I haven't checked xenconsoled) is using += to update
> > intf->rsp_cons. There is no guarantee that the update will be atomic.
> >
> > Overall, I am not 100% sure what I wrote is related. But that's probably a
> > good start of things that can be exacerbated with Linux RT.

Added memory barriers wherever I saw the corresponding ring indexes used in
both the xenconsole and xenbus drivers, but nothing changed.

> >
> > > [   59.399649] xenbus: error -5 while writing message
> >
> > This is in xenbus as well. But this time in the write part. The analysis I
> > wrote above for the read part can be applied here.
> 
> This is really strange. What is also strange is that somehow the indexes
> recover after 10-15 seconds? How is that even possible. Let's say there
> is a memory corruption of some sort, maybe due to missing barriers like
> Julien suggested, how can it go back to normal after a while?
> 
> I am really confused. I would try with regular Linux instead of Linux RT
> and also would try to replace all the barriers in
> drivers/tty/hvc/hvc_xen.c with their virt_* version to see if we can
> narrow down the problem a bit.
> 
> 
> Keep in mind that during PV network operations grants are used, which
> involve mapping pages at the backend and changing the MMU/IOMMU
> pagetables to introduce the new mapping. After the DMA operation,
> typically the page is unmapped and removed from the pagetable.
> 
> Is it possible that the pagetable change is causing the problem, and
> when the mapping is removed everything goes back to normal?
> 
> I don't know how that could happen, but the mapping and unmapping of the
> page is something ongoing which could break things then go back to
> normal. One thing you could try is to force all DMA operations to go via
> swiotlb-xen in Linux:
> 
> diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> index 3d826c0b5fee..f78d86f1bb9c 100644
> --- a/arch/arm/xen/mm.c
> +++ b/arch/arm/xen/mm.c
> @@ -112,8 +112,7 @@ bool xen_arch_need_swiotlb(struct device *dev,
>          * require a bounce buffer because the device doesn't support coherent
>          * memory and we are not able to flush the cache.
>          */
> -       return (!hypercall_cflush && (xen_pfn != bfn) &&
> -               !dev_is_dma_coherent(dev));
> +       return true;
>  }
> 
>  static int __init xen_mm_init(void)
> 
> 
> Then you can remove any iommu pagetable flushes in Xen:
> 
> 
> diff --git a/xen/arch/arm/include/asm/grant_table.h
> b/xen/arch/arm/include/asm/grant_table.h
> index d3c518a926..b72f8391bd 100644
> --- a/xen/arch/arm/include/asm/grant_table.h
> +++ b/xen/arch/arm/include/asm/grant_table.h
> @@ -74,7 +74,7 @@ int replace_grant_host_mapping(uint64_t gpaddr, mfn_t
> frame,
>      page_get_xenheap_gfn(gnttab_status_page(t, i))
> 
>  #define gnttab_need_iommu_mapping(d)                    \
> -    (is_domain_direct_mapped(d) && is_iommu_enabled(d))
> +    (0)
> 
>  #endif /* __ASM_GRANT_TABLE_H__ */
>  /*
> 
> 
> I don't know how this could be related but it might help narrow down the
> problem.

Applied your suggestion regarding DMA operations, but we observe the same
behavior (the serial console would hang after some time), besides some new
issues with some other drivers.

We will continue to look into this issue, but if you have some new ideas,
please let us know.

Thank you,
George Mocanu



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [ARM][xencons] PV Console hangs due to illegal ring buffer accesses
  2023-07-21  8:39     ` Andrei Cherechesu (OSS)
@ 2023-07-21 22:06       ` Stefano Stabellini
  0 siblings, 0 replies; 6+ messages in thread
From: Stefano Stabellini @ 2023-07-21 22:06 UTC (permalink / raw)
  To: Andrei Cherechesu (OSS)
  Cc: Stefano Stabellini, Julien Grall, xen-devel, george.mocanu,
	Juergen Gross

On Fri, 21 Jul 2023, Andrei Cherechesu (OSS) wrote:
> > I am really confused. I would try with regular Linux instead of Linux RT
> > and also would try to replace all the barriers in
> > drivers/tty/hvc/hvc_xen.c with their virt_* version to see if we can
> > narrow down the problem a bit.
> > 
> 
> Unfortunately, we do not normally run regular Linux and we do not have a
> stable regular Linux version with our HW support ported on it. We've been running
> Linux RT since 4.14 (or even earlier I think), but this issue has started to happen
> since we upgraded to Xen 4.17 (from 4.14), with both Linux RT 5.15 and 5.10.

I saw that George tried both suggestions without making progress.

This is a very difficult bug. If Xen 4.14 works and the issue starts
with Xen 4.17, then I would try to bisect it (try Xen 4.15, 4.16, etc.
until you narrow down the commit).


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-07-22  5:45 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-19 16:13 [ARM][xencons] PV Console hangs due to illegal ring buffer accesses Andrei Cherechesu (OSS)
2023-07-20 10:33 ` Julien Grall
2023-07-20 23:25   ` Stefano Stabellini
2023-07-21  8:39     ` Andrei Cherechesu (OSS)
2023-07-21 22:06       ` Stefano Stabellini
2023-07-21 14:28     ` George Mocanu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.