All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG]SMMU-V3 queue need no-cache memory
@ 2022-12-07  2:04 sisyphean
  2022-12-07 10:24 ` Rahul Singh
  0 siblings, 1 reply; 12+ messages in thread
From: sisyphean @ 2022-12-07  2:04 UTC (permalink / raw)
  To: xen-devel

Hi,

     I try to run XEN on my ARM board(Sorry, for some commercial 
reasons, I can't tell you
     on which platform I run XEN)  and enable SMMU-V3, but all cmds in 
cmdq failed when XEN started.

     After using the debugger to track debugging, the reason for this 
problem is that
     the queue in the smmu-v3 driver is not no-cache, so after the 
function arm_smmu_cmdq_build_cmd
     is executed, the cmd is still in cache.Therefore, the SMMU-V3 
hardware cannot obtain the correct cmd
     from the memory for execution.

     The temporary solution I use is to execute function clean_dcache 
every time cmd is copied to cmdq
     in function queue_write. But it is obvious that this will seriously 
affect the efficiency.
     I have not found the method of malloc no-cache memory in XEN. Is 
this method not implemented?

     The XEN version I am running is RELEASE-4.16.2.

     English is not my native language; please excuse typing errors.

Cheers,

-- 
Sisyphean



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-07  2:04 [BUG]SMMU-V3 queue need no-cache memory sisyphean
@ 2022-12-07 10:24 ` Rahul Singh
  2022-12-07 12:13   ` Julien Grall
  2022-12-08  2:48   ` sisyphean
  0 siblings, 2 replies; 12+ messages in thread
From: Rahul Singh @ 2022-12-07 10:24 UTC (permalink / raw)
  To: sisyphean; +Cc: xen-devel

Hi Sisyphean,

> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
> 
> Hi,
> 
>     I try to run XEN on my ARM board(Sorry, for some commercial reasons, I can't tell you
>     on which platform I run XEN)  and enable SMMU-V3, but all cmds in cmdq failed when XEN started.
> 
>     After using the debugger to track debugging, the reason for this problem is that
>     the queue in the smmu-v3 driver is not no-cache, so after the function arm_smmu_cmdq_build_cmd
>     is executed, the cmd is still in cache.Therefore, the SMMU-V3 hardware cannot obtain the correct cmd
>     from the memory for execution.

Yes you are right as of now we are allocating the memory for cmdqueue via _xzalloc() which is cached
memory because of that you are observing the issue. We have tested the Xen SMMUv3 driver on SOC
where SMMUv3 HW is in the coherency domain, and because of that we have not encountered this issue.

I think In your case SMMUv3 HW is not in the coherency domain. Please confirm from your side if the
"dma-coherent” property is not set in DT.

I think there is no function available as of now to request Xen to allocate memory that is not cached.

@Julien and @Stefano do you have any suggestion on how we can request memory from Xen that is not
cached something like dma_alloc_coherent() in Linux.

Regards,
Rahul

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-07 10:24 ` Rahul Singh
@ 2022-12-07 12:13   ` Julien Grall
  2022-12-07 22:22     ` Stefano Stabellini
  2022-12-08 13:04     ` Rahul Singh
  2022-12-08  2:48   ` sisyphean
  1 sibling, 2 replies; 12+ messages in thread
From: Julien Grall @ 2022-12-07 12:13 UTC (permalink / raw)
  To: Rahul Singh, sisyphean; +Cc: xen-devel, Stefano Stabellini

Hi,

I only noticed this e-mail because I was skimming xen-devel. If you want 
to get our attention, then I would suggest to CC both of us because I 
(and I guess Stefano) have filter rules so those e-mails land directly 
in my inbox.

On 07/12/2022 10:24, Rahul Singh wrote:
>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>
>> Hi,
>>
>>      I try to run XEN on my ARM board(Sorry, for some commercial reasons, I can't tell you
>>      on which platform I run XEN)  and enable SMMU-V3, but all cmds in cmdq failed when XEN started.
>>
>>      After using the debugger to track debugging, the reason for this problem is that
>>      the queue in the smmu-v3 driver is not no-cache, so after the function arm_smmu_cmdq_build_cmd
>>      is executed, the cmd is still in cache.Therefore, the SMMU-V3 hardware cannot obtain the correct cmd
>>      from the memory for execution.
> 
> Yes you are right as of now we are allocating the memory for cmdqueue via _xzalloc() which is cached
> memory because of that you are observing the issue. We have tested the Xen SMMUv3 driver on SOC
> where SMMUv3 HW is in the coherency domain, and because of that we have not encountered this issue.
> 
> I think In your case SMMUv3 HW is not in the coherency domain. Please confirm from your side if the
> "dma-coherent” property is not set in DT.
> 
> I think there is no function available as of now to request Xen to allocate memory that is not cached.

You are correct.

> 
> @Julien and @Stefano do you have any suggestion on how we can request memory from Xen that is not
> cached something like dma_alloc_coherent() in Linux.

At the moment all the RAM is mapped cacheable in Xen. So it will require 
some work to have some memory uncacheable.

There are two options:
  1) Allocate a pool of memory at boot time that will be mapped with 
different memory attribute. This means we would need a separate pool and 
the user will have to size it.
  2) Modify after the allocation the caching attribute in the memory and 
then revert back after freeing. The cons is we would end up to shatter 
superpage. We also can't re-create superpage (yet), but that might be 
fine if the memory is never freed.

Option two would probably the best. But before going that route I have 
one question...

 > The temporary solution I use is to execute function clean_dcache every
 > time cmd is copied to cmdq in function queue_write. But it is obvious
 > that this will seriously affect the efficiency.

I agree you will see some performance impact in micro-benchmark. But I 
am not sure about normal use-cases. How often do you expect the command 
queue to be used?

Also, I am a bit surprised you are seing issue with the command queue 
but not with the stage-2 page-tables. Does your SMMU support coherent 
walk but cannot snoop for the command queue?

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-07 12:13   ` Julien Grall
@ 2022-12-07 22:22     ` Stefano Stabellini
  2022-12-08  3:22       ` sisyphean
  2022-12-08 13:04     ` Rahul Singh
  1 sibling, 1 reply; 12+ messages in thread
From: Stefano Stabellini @ 2022-12-07 22:22 UTC (permalink / raw)
  To: Julien Grall; +Cc: Rahul Singh, sisyphean, xen-devel, Stefano Stabellini

[-- Attachment #1: Type: text/plain, Size: 3327 bytes --]

On Wed, 7 Dec 2022, Julien Grall wrote:
> Hi,
> 
> I only noticed this e-mail because I was skimming xen-devel. If you want to
> get our attention, then I would suggest to CC both of us because I (and I
> guess Stefano) have filter rules so those e-mails land directly in my inbox.
> 
> On 07/12/2022 10:24, Rahul Singh wrote:
> > > On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
> > > 
> > > Hi,
> > > 
> > >      I try to run XEN on my ARM board(Sorry, for some commercial reasons,
> > > I can't tell you
> > >      on which platform I run XEN)  and enable SMMU-V3, but all cmds in
> > > cmdq failed when XEN started.
> > > 
> > >      After using the debugger to track debugging, the reason for this
> > > problem is that
> > >      the queue in the smmu-v3 driver is not no-cache, so after the
> > > function arm_smmu_cmdq_build_cmd
> > >      is executed, the cmd is still in cache.Therefore, the SMMU-V3
> > > hardware cannot obtain the correct cmd
> > >      from the memory for execution.
> > 
> > Yes you are right as of now we are allocating the memory for cmdqueue via
> > _xzalloc() which is cached
> > memory because of that you are observing the issue. We have tested the Xen
> > SMMUv3 driver on SOC
> > where SMMUv3 HW is in the coherency domain, and because of that we have not
> > encountered this issue.
> > 
> > I think In your case SMMUv3 HW is not in the coherency domain. Please
> > confirm from your side if the
> > "dma-coherent” property is not set in DT.
> > 
> > I think there is no function available as of now to request Xen to allocate
> > memory that is not cached.
> 
> You are correct.
> 
> > 
> > @Julien and @Stefano do you have any suggestion on how we can request memory
> > from Xen that is not
> > cached something like dma_alloc_coherent() in Linux.
> 
> At the moment all the RAM is mapped cacheable in Xen. So it will require some
> work to have some memory uncacheable.
> 
> There are two options:
>  1) Allocate a pool of memory at boot time that will be mapped with different
> memory attribute. This means we would need a separate pool and the user will
> have to size it.
>  2) Modify after the allocation the caching attribute in the memory and then
> revert back after freeing. The cons is we would end up to shatter superpage.
> We also can't re-create superpage (yet), but that might be fine if the memory
> is never freed.
> 
> Option two would probably the best. But before going that route I have one
> question...
> 
> > The temporary solution I use is to execute function clean_dcache every
> > time cmd is copied to cmdq in function queue_write. But it is obvious
> > that this will seriously affect the efficiency.
> 
> I agree you will see some performance impact in micro-benchmark. But I am not
> sure about normal use-cases. How often do you expect the command queue to be
> used?

That is a good question. But even for the micro-benchmark, is the
difference significant? 

My gut feeling (to be discussed and confirmed) is that for this use-case
it might not be worth to do option 1) or option 2) above. Clean_dcache
as needed might be good enough?


> Also, I am a bit surprised you are seing issue with the command queue but not
> with the stage-2 page-tables. Does your SMMU support coherent walk but cannot
> snoop for the command queue?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-07 10:24 ` Rahul Singh
  2022-12-07 12:13   ` Julien Grall
@ 2022-12-08  2:48   ` sisyphean
  1 sibling, 0 replies; 12+ messages in thread
From: sisyphean @ 2022-12-08  2:48 UTC (permalink / raw)
  To: Rahul Singh; +Cc: julien, sstabellini, xen-devel

Hi Rahul Singh,

在 2022/12/7 18:24, Rahul Singh 写道:
> Hi Sisyphean,
>
>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>
>> Hi,
>>
>>      I try to run XEN on my ARM board(Sorry, for some commercial reasons, I can't tell you
>>      on which platform I run XEN)  and enable SMMU-V3, but all cmds in cmdq failed when XEN started.
>>
>>      After using the debugger to track debugging, the reason for this problem is that
>>      the queue in the smmu-v3 driver is not no-cache, so after the function arm_smmu_cmdq_build_cmd
>>      is executed, the cmd is still in cache.Therefore, the SMMU-V3 hardware cannot obtain the correct cmd
>>      from the memory for execution.
> Yes you are right as of now we are allocating the memory for cmdqueue via _xzalloc() which is cached
> memory because of that you are observing the issue. We have tested the Xen SMMUv3 driver on SOC
> where SMMUv3 HW is in the coherency domain, and because of that we have not encountered this issue.
>
> I think In your case SMMUv3 HW is not in the coherency domain. Please confirm from your side if the
> "dma-coherent” property is not set in DT.
>
> I think there is no function available as of now to request Xen to allocate memory that is not cached.
>
> @Julien and @Stefano do you have any suggestion on how we can request memory from Xen that is not
> cached something like dma_alloc_coherent() in Linux.
>
> Regards,
> Rahul
I have tried to set "dma-coherent" and not set "dma-coherent" in DT. The 
results are consistent, and
SMMUv3 HW cannot get the correct cmd from memory


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-07 22:22     ` Stefano Stabellini
@ 2022-12-08  3:22       ` sisyphean
  2022-12-08 13:21         ` Julien Grall
  0 siblings, 1 reply; 12+ messages in thread
From: sisyphean @ 2022-12-08  3:22 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall, Rahul Singh; +Cc: xen-devel

在 2022/12/8 06:22, Stefano Stabellini 写道:

> On Wed, 7 Dec 2022, Julien Grall wrote:
>> Hi,
>>
>> I only noticed this e-mail because I was skimming xen-devel. If you want to
>> get our attention, then I would suggest to CC both of us because I (and I
>> guess Stefano) have filter rules so those e-mails land directly in my inbox.
>>
>> On 07/12/2022 10:24, Rahul Singh wrote:
>>>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>>>
>>>> Hi,
>>>>
>>>>       I try to run XEN on my ARM board(Sorry, for some commercial reasons,
>>>> I can't tell you
>>>>       on which platform I run XEN)  and enable SMMU-V3, but all cmds in
>>>> cmdq failed when XEN started.
>>>>
>>>>       After using the debugger to track debugging, the reason for this
>>>> problem is that
>>>>       the queue in the smmu-v3 driver is not no-cache, so after the
>>>> function arm_smmu_cmdq_build_cmd
>>>>       is executed, the cmd is still in cache.Therefore, the SMMU-V3
>>>> hardware cannot obtain the correct cmd
>>>>       from the memory for execution.
>>> Yes you are right as of now we are allocating the memory for cmdqueue via
>>> _xzalloc() which is cached
>>> memory because of that you are observing the issue. We have tested the Xen
>>> SMMUv3 driver on SOC
>>> where SMMUv3 HW is in the coherency domain, and because of that we have not
>>> encountered this issue.
>>>
>>> I think In your case SMMUv3 HW is not in the coherency domain. Please
>>> confirm from your side if the
>>> "dma-coherent” property is not set in DT.
>>>
>>> I think there is no function available as of now to request Xen to allocate
>>> memory that is not cached.
>> You are correct.
>>
>>> @Julien and @Stefano do you have any suggestion on how we can request memory
>>> from Xen that is not
>>> cached something like dma_alloc_coherent() in Linux.
>> At the moment all the RAM is mapped cacheable in Xen. So it will require some
>> work to have some memory uncacheable.
>>
>> There are two options:
>>   1) Allocate a pool of memory at boot time that will be mapped with different
>> memory attribute. This means we would need a separate pool and the user will
>> have to size it.
>>   2) Modify after the allocation the caching attribute in the memory and then
>> revert back after freeing. The cons is we would end up to shatter superpage.
>> We also can't re-create superpage (yet), but that might be fine if the memory
>> is never freed.
>>
>> Option two would probably the best. But before going that route I have one
>> question...
>>
>>> The temporary solution I use is to execute function clean_dcache every
>>> time cmd is copied to cmdq in function queue_write. But it is obvious
>>> that this will seriously affect the efficiency.
>> I agree you will see some performance impact in micro-benchmark. But I am not
>> sure about normal use-cases. How often do you expect the command queue to be
>> used?
> That is a good question. But even for the micro-benchmark, is the
> difference significant?
>
> My gut feeling (to be discussed and confirmed) is that for this use-case
> it might not be worth to do option 1) or option 2) above. Clean_dcache
> as needed might be good enough?
>
>
>> Also, I am a bit surprised you are seing issue with the command queue but not
>> with the stage-2 page-tables. Does your SMMU support coherent walk but cannot
>> snoop for the command queue?

Hi,

I'm sorry that my statement made you misunderstand. I haven't conducted 
micro-benchmark yet.

I found this problem because "CMD_SYNC timeout" was frequently prompted 
when initializing
SMMUv3 during XEN startup.

As for the usage frequency of the command queue, I'm trying to 
passthrough PCIE devices to the DomU.
According to my understanding, all operations on the device will be 
performed through SMMUv3 after
the device passesthrough? Therefore, queues will be used frequently.

This is my first contact with SMMU and XEN. Please forgive me for some 
low-level mistakes.

Cheers,
-- 
Sisyphean



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-07 12:13   ` Julien Grall
  2022-12-07 22:22     ` Stefano Stabellini
@ 2022-12-08 13:04     ` Rahul Singh
  2022-12-08 13:31       ` Julien Grall
  1 sibling, 1 reply; 12+ messages in thread
From: Rahul Singh @ 2022-12-08 13:04 UTC (permalink / raw)
  To: Julien Grall; +Cc: sisyphean, xen-devel, Stefano Stabellini

Hi Julien,

> On 7 Dec 2022, at 12:13 pm, Julien Grall <julien@xen.org> wrote:
> 
> Hi,
> 
> I only noticed this e-mail because I was skimming xen-devel. If you want to get our attention, then I would suggest to CC both of us because I (and I guess Stefano) have filter rules so those e-mails land directly in my inbox.
> 
> On 07/12/2022 10:24, Rahul Singh wrote:
>>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>> 
>>> Hi,
>>> 
>>>     I try to run XEN on my ARM board(Sorry, for some commercial reasons, I can't tell you
>>>     on which platform I run XEN)  and enable SMMU-V3, but all cmds in cmdq failed when XEN started.
>>> 
>>>     After using the debugger to track debugging, the reason for this problem is that
>>>     the queue in the smmu-v3 driver is not no-cache, so after the function arm_smmu_cmdq_build_cmd
>>>     is executed, the cmd is still in cache.Therefore, the SMMU-V3 hardware cannot obtain the correct cmd
>>>     from the memory for execution.
>> Yes you are right as of now we are allocating the memory for cmdqueue via _xzalloc() which is cached
>> memory because of that you are observing the issue. We have tested the Xen SMMUv3 driver on SOC
>> where SMMUv3 HW is in the coherency domain, and because of that we have not encountered this issue.
>> I think In your case SMMUv3 HW is not in the coherency domain. Please confirm from your side if the
>> "dma-coherent” property is not set in DT.
>> I think there is no function available as of now to request Xen to allocate memory that is not cached.
> 
> You are correct.
> 
>> @Julien and @Stefano do you have any suggestion on how we can request memory from Xen that is not
>> cached something like dma_alloc_coherent() in Linux.
> 
> At the moment all the RAM is mapped cacheable in Xen. So it will require some work to have some memory uncacheable.
> 
> There are two options:
> 1) Allocate a pool of memory at boot time that will be mapped with different memory attribute. This means we would need a separate pool and the user will have to size it.
> 2) Modify after the allocation the caching attribute in the memory and then revert back after freeing. The cons is we would end up to shatter superpage. We also can't re-create superpage (yet), but that might be fine if the memory is never freed.
> 
> Option two would probably the best. But before going that route I have one question...
> 
> > The temporary solution I use is to execute function clean_dcache every
> > time cmd is copied to cmdq in function queue_write. But it is obvious
> > that this will seriously affect the efficiency.
> 
> I agree you will see some performance impact in micro-benchmark. But I am not sure about normal use-cases. How often do you expect the command queue to be used?

To be precise command queue will be used when
 - Set up the stage-2 translation when we assigned the devices to guests. This happens typically dom0 boot and domU creation. 
 - When there is a call to iommu_iotlb_flush() that will call IOMMU specific iotlb_flush. SMMuv3 driver will send the command to
   SMMUv3 HW to invalidate the entries.
  
Regards,
Rahul
> 
> Also, I am a bit surprised you are seing issue with the command queue but not with the stage-2 page-tables. Does your SMMU support coherent walk but cannot snoop for the command queue?
> 
> Cheers,
> 
> -- 
> Julien Grall
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-08  3:22       ` sisyphean
@ 2022-12-08 13:21         ` Julien Grall
  2022-12-08 13:27           ` sisyphean
  0 siblings, 1 reply; 12+ messages in thread
From: Julien Grall @ 2022-12-08 13:21 UTC (permalink / raw)
  To: sisyphean, Stefano Stabellini, Rahul Singh; +Cc: xen-devel

Hi,

On 08/12/2022 03:22, sisyphean wrote:
> 在 2022/12/8 06:22, Stefano Stabellini 写道:
> 
>> On Wed, 7 Dec 2022, Julien Grall wrote:
>>> Hi,
>>>
>>> I only noticed this e-mail because I was skimming xen-devel. If you 
>>> want to
>>> get our attention, then I would suggest to CC both of us because I 
>>> (and I
>>> guess Stefano) have filter rules so those e-mails land directly in my 
>>> inbox.
>>>
>>> On 07/12/2022 10:24, Rahul Singh wrote:
>>>>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>       I try to run XEN on my ARM board(Sorry, for some commercial 
>>>>> reasons,
>>>>> I can't tell you
>>>>>       on which platform I run XEN)  and enable SMMU-V3, but all 
>>>>> cmds in
>>>>> cmdq failed when XEN started.
>>>>>
>>>>>       After using the debugger to track debugging, the reason for this
>>>>> problem is that
>>>>>       the queue in the smmu-v3 driver is not no-cache, so after the
>>>>> function arm_smmu_cmdq_build_cmd
>>>>>       is executed, the cmd is still in cache.Therefore, the SMMU-V3
>>>>> hardware cannot obtain the correct cmd
>>>>>       from the memory for execution.
>>>> Yes you are right as of now we are allocating the memory for 
>>>> cmdqueue via
>>>> _xzalloc() which is cached
>>>> memory because of that you are observing the issue. We have tested 
>>>> the Xen
>>>> SMMUv3 driver on SOC
>>>> where SMMUv3 HW is in the coherency domain, and because of that we 
>>>> have not
>>>> encountered this issue.
>>>>
>>>> I think In your case SMMUv3 HW is not in the coherency domain. Please
>>>> confirm from your side if the
>>>> "dma-coherent” property is not set in DT.
>>>>
>>>> I think there is no function available as of now to request Xen to 
>>>> allocate
>>>> memory that is not cached.
>>> You are correct.
>>>
>>>> @Julien and @Stefano do you have any suggestion on how we can 
>>>> request memory
>>>> from Xen that is not
>>>> cached something like dma_alloc_coherent() in Linux.
>>> At the moment all the RAM is mapped cacheable in Xen. So it will 
>>> require some
>>> work to have some memory uncacheable.
>>>
>>> There are two options:
>>>   1) Allocate a pool of memory at boot time that will be mapped with 
>>> different
>>> memory attribute. This means we would need a separate pool and the 
>>> user will
>>> have to size it.
>>>   2) Modify after the allocation the caching attribute in the memory 
>>> and then
>>> revert back after freeing. The cons is we would end up to shatter 
>>> superpage.
>>> We also can't re-create superpage (yet), but that might be fine if 
>>> the memory
>>> is never freed.
>>>
>>> Option two would probably the best. But before going that route I 
>>> have one
>>> question...
>>>
>>>> The temporary solution I use is to execute function clean_dcache every
>>>> time cmd is copied to cmdq in function queue_write. But it is obvious
>>>> that this will seriously affect the efficiency.
>>> I agree you will see some performance impact in micro-benchmark. But 
>>> I am not
>>> sure about normal use-cases. How often do you expect the command 
>>> queue to be
>>> used?
>> That is a good question. But even for the micro-benchmark, is the
>> difference significant?
>>
>> My gut feeling (to be discussed and confirmed) is that for this use-case
>> it might not be worth to do option 1) or option 2) above. Clean_dcache
>> as needed might be good enough?
>>
>>
>>> Also, I am a bit surprised you are seing issue with the command queue 
>>> but not
>>> with the stage-2 page-tables. Does your SMMU support coherent walk 
>>> but cannot
>>> snoop for the command queue?
> 
> Hi,
> 
> I'm sorry that my statement made you misunderstand. I haven't conducted 
> micro-benchmark yet.
> 
> I found this problem because "CMD_SYNC timeout" was frequently prompted 
> when initializing
> SMMUv3 during XEN startup.
> 
> As for the usage frequency of the command queue, I'm trying to 
> passthrough PCIE devices to the DomU.
> According to my understanding, all operations on the device will be 
> performed through SMMUv3 after
> the device passesthrough? Therefore, queues will be used frequently.
"all operations on the device" is a bit vague. From what Rahul just 
wrote this is a command queue is for controlling the SMMU (e.g. assign 
the device, flush the TLBs...). Anything related to the access (e.g. 
accessing the BAR, configuration space...) are not going through it.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-08 13:21         ` Julien Grall
@ 2022-12-08 13:27           ` sisyphean
  2022-12-08 13:32             ` Julien Grall
  0 siblings, 1 reply; 12+ messages in thread
From: sisyphean @ 2022-12-08 13:27 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini, Rahul Singh; +Cc: xen-devel


在 2022/12/8 21:21, Julien Grall 写道:
> Hi,
>
> On 08/12/2022 03:22, sisyphean wrote:
>> 在 2022/12/8 06:22, Stefano Stabellini 写道:
>>
>>> On Wed, 7 Dec 2022, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> I only noticed this e-mail because I was skimming xen-devel. If you 
>>>> want to
>>>> get our attention, then I would suggest to CC both of us because I 
>>>> (and I
>>>> guess Stefano) have filter rules so those e-mails land directly in 
>>>> my inbox.
>>>>
>>>> On 07/12/2022 10:24, Rahul Singh wrote:
>>>>>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>       I try to run XEN on my ARM board(Sorry, for some commercial 
>>>>>> reasons,
>>>>>> I can't tell you
>>>>>>       on which platform I run XEN)  and enable SMMU-V3, but all 
>>>>>> cmds in
>>>>>> cmdq failed when XEN started.
>>>>>>
>>>>>>       After using the debugger to track debugging, the reason for 
>>>>>> this
>>>>>> problem is that
>>>>>>       the queue in the smmu-v3 driver is not no-cache, so after the
>>>>>> function arm_smmu_cmdq_build_cmd
>>>>>>       is executed, the cmd is still in cache.Therefore, the SMMU-V3
>>>>>> hardware cannot obtain the correct cmd
>>>>>>       from the memory for execution.
>>>>> Yes you are right as of now we are allocating the memory for 
>>>>> cmdqueue via
>>>>> _xzalloc() which is cached
>>>>> memory because of that you are observing the issue. We have tested 
>>>>> the Xen
>>>>> SMMUv3 driver on SOC
>>>>> where SMMUv3 HW is in the coherency domain, and because of that we 
>>>>> have not
>>>>> encountered this issue.
>>>>>
>>>>> I think In your case SMMUv3 HW is not in the coherency domain. Please
>>>>> confirm from your side if the
>>>>> "dma-coherent” property is not set in DT.
>>>>>
>>>>> I think there is no function available as of now to request Xen to 
>>>>> allocate
>>>>> memory that is not cached.
>>>> You are correct.
>>>>
>>>>> @Julien and @Stefano do you have any suggestion on how we can 
>>>>> request memory
>>>>> from Xen that is not
>>>>> cached something like dma_alloc_coherent() in Linux.
>>>> At the moment all the RAM is mapped cacheable in Xen. So it will 
>>>> require some
>>>> work to have some memory uncacheable.
>>>>
>>>> There are two options:
>>>>   1) Allocate a pool of memory at boot time that will be mapped 
>>>> with different
>>>> memory attribute. This means we would need a separate pool and the 
>>>> user will
>>>> have to size it.
>>>>   2) Modify after the allocation the caching attribute in the 
>>>> memory and then
>>>> revert back after freeing. The cons is we would end up to shatter 
>>>> superpage.
>>>> We also can't re-create superpage (yet), but that might be fine if 
>>>> the memory
>>>> is never freed.
>>>>
>>>> Option two would probably the best. But before going that route I 
>>>> have one
>>>> question...
>>>>
>>>>> The temporary solution I use is to execute function clean_dcache 
>>>>> every
>>>>> time cmd is copied to cmdq in function queue_write. But it is obvious
>>>>> that this will seriously affect the efficiency.
>>>> I agree you will see some performance impact in micro-benchmark. 
>>>> But I am not
>>>> sure about normal use-cases. How often do you expect the command 
>>>> queue to be
>>>> used?
>>> That is a good question. But even for the micro-benchmark, is the
>>> difference significant?
>>>
>>> My gut feeling (to be discussed and confirmed) is that for this 
>>> use-case
>>> it might not be worth to do option 1) or option 2) above. Clean_dcache
>>> as needed might be good enough?
>>>
>>>
>>>> Also, I am a bit surprised you are seing issue with the command 
>>>> queue but not
>>>> with the stage-2 page-tables. Does your SMMU support coherent walk 
>>>> but cannot
>>>> snoop for the command queue?
>>
>> Hi,
>>
>> I'm sorry that my statement made you misunderstand. I haven't 
>> conducted micro-benchmark yet.
>>
>> I found this problem because "CMD_SYNC timeout" was frequently 
>> prompted when initializing
>> SMMUv3 during XEN startup.
>>
>> As for the usage frequency of the command queue, I'm trying to 
>> passthrough PCIE devices to the DomU.
>> According to my understanding, all operations on the device will be 
>> performed through SMMUv3 after
>> the device passesthrough? Therefore, queues will be used frequently.
> "all operations on the device" is a bit vague. From what Rahul just 
> wrote this is a command queue is for controlling the SMMU (e.g. assign 
> the device, flush the TLBs...). Anything related to the access (e.g. 
> accessing the BAR, configuration space...) are not going through it.
>
> Cheers,
>
So does this mean that operations on smmu queues are not frequent? There 
are still some problems with PCIE device passthrough.
I will conduct some benchmark tests after completing PCIE device 
passthrough. Are there any test cases for my reference?

Cheers,




^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-08 13:04     ` Rahul Singh
@ 2022-12-08 13:31       ` Julien Grall
  0 siblings, 0 replies; 12+ messages in thread
From: Julien Grall @ 2022-12-08 13:31 UTC (permalink / raw)
  To: Rahul Singh; +Cc: sisyphean, xen-devel, Stefano Stabellini



On 08/12/2022 13:04, Rahul Singh wrote:
> Hi Julien,

Hi Rahul,

>> On 7 Dec 2022, at 12:13 pm, Julien Grall <julien@xen.org> wrote:
>>
>> Hi,
>>
>> I only noticed this e-mail because I was skimming xen-devel. If you want to get our attention, then I would suggest to CC both of us because I (and I guess Stefano) have filter rules so those e-mails land directly in my inbox.
>>
>> On 07/12/2022 10:24, Rahul Singh wrote:
>>>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>>>
>>>> Hi,
>>>>
>>>>      I try to run XEN on my ARM board(Sorry, for some commercial reasons, I can't tell you
>>>>      on which platform I run XEN)  and enable SMMU-V3, but all cmds in cmdq failed when XEN started.
>>>>
>>>>      After using the debugger to track debugging, the reason for this problem is that
>>>>      the queue in the smmu-v3 driver is not no-cache, so after the function arm_smmu_cmdq_build_cmd
>>>>      is executed, the cmd is still in cache.Therefore, the SMMU-V3 hardware cannot obtain the correct cmd
>>>>      from the memory for execution.
>>> Yes you are right as of now we are allocating the memory for cmdqueue via _xzalloc() which is cached
>>> memory because of that you are observing the issue. We have tested the Xen SMMUv3 driver on SOC
>>> where SMMUv3 HW is in the coherency domain, and because of that we have not encountered this issue.
>>> I think In your case SMMUv3 HW is not in the coherency domain. Please confirm from your side if the
>>> "dma-coherent” property is not set in DT.
>>> I think there is no function available as of now to request Xen to allocate memory that is not cached.
>>
>> You are correct.
>>
>>> @Julien and @Stefano do you have any suggestion on how we can request memory from Xen that is not
>>> cached something like dma_alloc_coherent() in Linux.
>>
>> At the moment all the RAM is mapped cacheable in Xen. So it will require some work to have some memory uncacheable.
>>
>> There are two options:
>> 1) Allocate a pool of memory at boot time that will be mapped with different memory attribute. This means we would need a separate pool and the user will have to size it.
>> 2) Modify after the allocation the caching attribute in the memory and then revert back after freeing. The cons is we would end up to shatter superpage. We also can't re-create superpage (yet), but that might be fine if the memory is never freed.
>>
>> Option two would probably the best. But before going that route I have one question...
>>
>>> The temporary solution I use is to execute function clean_dcache every
>>> time cmd is copied to cmdq in function queue_write. But it is obvious
>>> that this will seriously affect the efficiency.
>>
>> I agree you will see some performance impact in micro-benchmark. But I am not sure about normal use-cases. How often do you expect the command queue to be used?
> 
> To be precise command queue will be used when

Thanks for the list. See my comments below.

>   - Set up the stage-2 translation when we assigned the devices to guests. This happens typically dom0 boot and domU creation.

Hotplugging is another approach. At the moment, I would expect that in 
this situation the cache flush will just be noise as the domain creation 
is quite complex.

>   - When there is a call to iommu_iotlb_flush() that will call IOMMU specific iotlb_flush. SMMuv3 driver will send the command to
>     SMMUv3 HW to invalidate the entries.

This is an interesting one. Those operations will usually be heavily 
used by backend PV drivers when mapping/unmapping the grant entries.

I am not aware of anyone that did some performance test when the IOMMU 
is enabled (I think Stefano did some in the past when disabled).

The grant mapping are usually one page at the time. So it would be 
interesting to check the overhead of the SMMU (even without the cache 
flush). The tests I am thinking are comparing the numbers with and 
without the IOMMU enabled:
  1) Micro-benchmark the map/unmap operations
  2) Benchmark throughput for block and network device

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-08 13:27           ` sisyphean
@ 2022-12-08 13:32             ` Julien Grall
  2022-12-08 23:53               ` sisyphean
  0 siblings, 1 reply; 12+ messages in thread
From: Julien Grall @ 2022-12-08 13:32 UTC (permalink / raw)
  To: sisyphean, Stefano Stabellini, Rahul Singh; +Cc: xen-devel

Hi,

On 08/12/2022 13:27, sisyphean wrote:
> 
> 在 2022/12/8 21:21, Julien Grall 写道:
>> Hi,
>>
>> On 08/12/2022 03:22, sisyphean wrote:
>>> 在 2022/12/8 06:22, Stefano Stabellini 写道:
>>>
>>>> On Wed, 7 Dec 2022, Julien Grall wrote:
>>>>> Hi,
>>>>>
>>>>> I only noticed this e-mail because I was skimming xen-devel. If you 
>>>>> want to
>>>>> get our attention, then I would suggest to CC both of us because I 
>>>>> (and I
>>>>> guess Stefano) have filter rules so those e-mails land directly in 
>>>>> my inbox.
>>>>>
>>>>> On 07/12/2022 10:24, Rahul Singh wrote:
>>>>>>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>       I try to run XEN on my ARM board(Sorry, for some commercial 
>>>>>>> reasons,
>>>>>>> I can't tell you
>>>>>>>       on which platform I run XEN)  and enable SMMU-V3, but all 
>>>>>>> cmds in
>>>>>>> cmdq failed when XEN started.
>>>>>>>
>>>>>>>       After using the debugger to track debugging, the reason for 
>>>>>>> this
>>>>>>> problem is that
>>>>>>>       the queue in the smmu-v3 driver is not no-cache, so after the
>>>>>>> function arm_smmu_cmdq_build_cmd
>>>>>>>       is executed, the cmd is still in cache.Therefore, the SMMU-V3
>>>>>>> hardware cannot obtain the correct cmd
>>>>>>>       from the memory for execution.
>>>>>> Yes you are right as of now we are allocating the memory for 
>>>>>> cmdqueue via
>>>>>> _xzalloc() which is cached
>>>>>> memory because of that you are observing the issue. We have tested 
>>>>>> the Xen
>>>>>> SMMUv3 driver on SOC
>>>>>> where SMMUv3 HW is in the coherency domain, and because of that we 
>>>>>> have not
>>>>>> encountered this issue.
>>>>>>
>>>>>> I think In your case SMMUv3 HW is not in the coherency domain. Please
>>>>>> confirm from your side if the
>>>>>> "dma-coherent” property is not set in DT.
>>>>>>
>>>>>> I think there is no function available as of now to request Xen to 
>>>>>> allocate
>>>>>> memory that is not cached.
>>>>> You are correct.
>>>>>
>>>>>> @Julien and @Stefano do you have any suggestion on how we can 
>>>>>> request memory
>>>>>> from Xen that is not
>>>>>> cached something like dma_alloc_coherent() in Linux.
>>>>> At the moment all the RAM is mapped cacheable in Xen. So it will 
>>>>> require some
>>>>> work to have some memory uncacheable.
>>>>>
>>>>> There are two options:
>>>>>   1) Allocate a pool of memory at boot time that will be mapped 
>>>>> with different
>>>>> memory attribute. This means we would need a separate pool and the 
>>>>> user will
>>>>> have to size it.
>>>>>   2) Modify after the allocation the caching attribute in the 
>>>>> memory and then
>>>>> revert back after freeing. The cons is we would end up to shatter 
>>>>> superpage.
>>>>> We also can't re-create superpage (yet), but that might be fine if 
>>>>> the memory
>>>>> is never freed.
>>>>>
>>>>> Option two would probably the best. But before going that route I 
>>>>> have one
>>>>> question...
>>>>>
>>>>>> The temporary solution I use is to execute function clean_dcache 
>>>>>> every
>>>>>> time cmd is copied to cmdq in function queue_write. But it is obvious
>>>>>> that this will seriously affect the efficiency.
>>>>> I agree you will see some performance impact in micro-benchmark. 
>>>>> But I am not
>>>>> sure about normal use-cases. How often do you expect the command 
>>>>> queue to be
>>>>> used?
>>>> That is a good question. But even for the micro-benchmark, is the
>>>> difference significant?
>>>>
>>>> My gut feeling (to be discussed and confirmed) is that for this 
>>>> use-case
>>>> it might not be worth to do option 1) or option 2) above. Clean_dcache
>>>> as needed might be good enough?
>>>>
>>>>
>>>>> Also, I am a bit surprised you are seing issue with the command 
>>>>> queue but not
>>>>> with the stage-2 page-tables. Does your SMMU support coherent walk 
>>>>> but cannot
>>>>> snoop for the command queue?
>>>
>>> Hi,
>>>
>>> I'm sorry that my statement made you misunderstand. I haven't 
>>> conducted micro-benchmark yet.
>>>
>>> I found this problem because "CMD_SYNC timeout" was frequently 
>>> prompted when initializing
>>> SMMUv3 during XEN startup.
>>>
>>> As for the usage frequency of the command queue, I'm trying to 
>>> passthrough PCIE devices to the DomU.
>>> According to my understanding, all operations on the device will be 
>>> performed through SMMUv3 after
>>> the device passesthrough? Therefore, queues will be used frequently.
>> "all operations on the device" is a bit vague. From what Rahul just 
>> wrote this is a command queue is for controlling the SMMU (e.g. assign 
>> the device, flush the TLBs...). Anything related to the access (e.g. 
>> accessing the BAR, configuration space...) are not going through it.
>>
>> Cheers,
>>
> So does this mean that operations on smmu queues are not frequent? There 
> are still some problems with PCIE device passthrough.
> I will conduct some benchmark tests after completing PCIE device 
> passthrough. Are there any test cases for my reference?

See my reply to Rahul. I have provided some ideas how to benchmark it.

Cheers,

-- 
Julien Grall


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [BUG]SMMU-V3 queue need no-cache memory
  2022-12-08 13:32             ` Julien Grall
@ 2022-12-08 23:53               ` sisyphean
  0 siblings, 0 replies; 12+ messages in thread
From: sisyphean @ 2022-12-08 23:53 UTC (permalink / raw)
  To: Julien Grall, Stefano Stabellini, Rahul Singh; +Cc: xen-devel

Hi,

在 2022/12/8 21:32, Julien Grall 写道:
> Hi,
>
> On 08/12/2022 13:27, sisyphean wrote:
>>
>> 在 2022/12/8 21:21, Julien Grall 写道:
>>> Hi,
>>>
>>> On 08/12/2022 03:22, sisyphean wrote:
>>>> 在 2022/12/8 06:22, Stefano Stabellini 写道:
>>>>
>>>>> On Wed, 7 Dec 2022, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I only noticed this e-mail because I was skimming xen-devel. If 
>>>>>> you want to
>>>>>> get our attention, then I would suggest to CC both of us because 
>>>>>> I (and I
>>>>>> guess Stefano) have filter rules so those e-mails land directly 
>>>>>> in my inbox.
>>>>>>
>>>>>> On 07/12/2022 10:24, Rahul Singh wrote:
>>>>>>>> On 7 Dec 2022, at 2:04 am, sisyphean <sisyphean@zlw.email> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>       I try to run XEN on my ARM board(Sorry, for some 
>>>>>>>> commercial reasons,
>>>>>>>> I can't tell you
>>>>>>>>       on which platform I run XEN)  and enable SMMU-V3, but all 
>>>>>>>> cmds in
>>>>>>>> cmdq failed when XEN started.
>>>>>>>>
>>>>>>>>       After using the debugger to track debugging, the reason 
>>>>>>>> for this
>>>>>>>> problem is that
>>>>>>>>       the queue in the smmu-v3 driver is not no-cache, so after 
>>>>>>>> the
>>>>>>>> function arm_smmu_cmdq_build_cmd
>>>>>>>>       is executed, the cmd is still in cache.Therefore, the 
>>>>>>>> SMMU-V3
>>>>>>>> hardware cannot obtain the correct cmd
>>>>>>>>       from the memory for execution.
>>>>>>> Yes you are right as of now we are allocating the memory for 
>>>>>>> cmdqueue via
>>>>>>> _xzalloc() which is cached
>>>>>>> memory because of that you are observing the issue. We have 
>>>>>>> tested the Xen
>>>>>>> SMMUv3 driver on SOC
>>>>>>> where SMMUv3 HW is in the coherency domain, and because of that 
>>>>>>> we have not
>>>>>>> encountered this issue.
>>>>>>>
>>>>>>> I think In your case SMMUv3 HW is not in the coherency domain. 
>>>>>>> Please
>>>>>>> confirm from your side if the
>>>>>>> "dma-coherent” property is not set in DT.
>>>>>>>
>>>>>>> I think there is no function available as of now to request Xen 
>>>>>>> to allocate
>>>>>>> memory that is not cached.
>>>>>> You are correct.
>>>>>>
>>>>>>> @Julien and @Stefano do you have any suggestion on how we can 
>>>>>>> request memory
>>>>>>> from Xen that is not
>>>>>>> cached something like dma_alloc_coherent() in Linux.
>>>>>> At the moment all the RAM is mapped cacheable in Xen. So it will 
>>>>>> require some
>>>>>> work to have some memory uncacheable.
>>>>>>
>>>>>> There are two options:
>>>>>>   1) Allocate a pool of memory at boot time that will be mapped 
>>>>>> with different
>>>>>> memory attribute. This means we would need a separate pool and 
>>>>>> the user will
>>>>>> have to size it.
>>>>>>   2) Modify after the allocation the caching attribute in the 
>>>>>> memory and then
>>>>>> revert back after freeing. The cons is we would end up to shatter 
>>>>>> superpage.
>>>>>> We also can't re-create superpage (yet), but that might be fine 
>>>>>> if the memory
>>>>>> is never freed.
>>>>>>
>>>>>> Option two would probably the best. But before going that route I 
>>>>>> have one
>>>>>> question...
>>>>>>
>>>>>>> The temporary solution I use is to execute function clean_dcache 
>>>>>>> every
>>>>>>> time cmd is copied to cmdq in function queue_write. But it is 
>>>>>>> obvious
>>>>>>> that this will seriously affect the efficiency.
>>>>>> I agree you will see some performance impact in micro-benchmark. 
>>>>>> But I am not
>>>>>> sure about normal use-cases. How often do you expect the command 
>>>>>> queue to be
>>>>>> used?
>>>>> That is a good question. But even for the micro-benchmark, is the
>>>>> difference significant?
>>>>>
>>>>> My gut feeling (to be discussed and confirmed) is that for this 
>>>>> use-case
>>>>> it might not be worth to do option 1) or option 2) above. 
>>>>> Clean_dcache
>>>>> as needed might be good enough?
>>>>>
>>>>>
>>>>>> Also, I am a bit surprised you are seing issue with the command 
>>>>>> queue but not
>>>>>> with the stage-2 page-tables. Does your SMMU support coherent 
>>>>>> walk but cannot
>>>>>> snoop for the command queue?
>>>>
>>>> Hi,
>>>>
>>>> I'm sorry that my statement made you misunderstand. I haven't 
>>>> conducted micro-benchmark yet.
>>>>
>>>> I found this problem because "CMD_SYNC timeout" was frequently 
>>>> prompted when initializing
>>>> SMMUv3 during XEN startup.
>>>>
>>>> As for the usage frequency of the command queue, I'm trying to 
>>>> passthrough PCIE devices to the DomU.
>>>> According to my understanding, all operations on the device will be 
>>>> performed through SMMUv3 after
>>>> the device passesthrough? Therefore, queues will be used frequently.
>>> "all operations on the device" is a bit vague. From what Rahul just 
>>> wrote this is a command queue is for controlling the SMMU (e.g. 
>>> assign the device, flush the TLBs...). Anything related to the 
>>> access (e.g. accessing the BAR, configuration space...) are not 
>>> going through it.
>>>
>>> Cheers,
>>>
>> So does this mean that operations on smmu queues are not frequent? 
>> There are still some problems with PCIE device passthrough.
>> I will conduct some benchmark tests after completing PCIE device 
>> passthrough. Are there any test cases for my reference?
>
> See my reply to Rahul. I have provided some ideas how to benchmark it.
>
> Cheers,
>
Thanks for your suggestion. I will write some test cases to do some 
benchmark tests after completing the PCIE passthrough.

Cheers,



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-12-08 23:53 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-07  2:04 [BUG]SMMU-V3 queue need no-cache memory sisyphean
2022-12-07 10:24 ` Rahul Singh
2022-12-07 12:13   ` Julien Grall
2022-12-07 22:22     ` Stefano Stabellini
2022-12-08  3:22       ` sisyphean
2022-12-08 13:21         ` Julien Grall
2022-12-08 13:27           ` sisyphean
2022-12-08 13:32             ` Julien Grall
2022-12-08 23:53               ` sisyphean
2022-12-08 13:04     ` Rahul Singh
2022-12-08 13:31       ` Julien Grall
2022-12-08  2:48   ` sisyphean

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.