All of lore.kernel.org
 help / color / mirror / Atom feed
From: Robin Murphy <robin.murphy@arm.com>
To: Nicolin Chen <nicolinc@nvidia.com>
Cc: joro@8bytes.org, will@kernel.org, nicoleotsuka@gmail.com,
	thierry.reding@gmail.com, vdumpa@nvidia.com,
	nwatterson@nvidia.com, jean-philippe@linaro.org,
	thunder.leizhen@huawei.com, chenxiang66@hisilicon.com,
	Jonathan.Cameron@huawei.com, yuzenghui@huawei.com,
	linux-kernel@vger.kernel.org, iommu@lists.linux-foundation.org,
	linux-arm-kernel@lists.infradead.org,
	linux-tegra@vger.kernel.org, jgg@nvidia.com
Subject: Re: [PATCH v3 4/5] iommu/arm-smmu-v3: Add host support for NVIDIA Grace CMDQ-V
Date: Tue, 21 Dec 2021 18:55:20 +0000	[thread overview]
Message-ID: <7e68fa19-90b1-bbb5-9991-36b5d35278fa@arm.com> (raw)
In-Reply-To: <20211220192714.GA27303@Asurada-Nvidia>

On 2021-12-20 19:27, Nicolin Chen wrote:
> Hi Robin,
> 
> Thank you for the reply!
> 
> On Mon, Dec 20, 2021 at 06:42:26PM +0000, Robin Murphy wrote:
>> On 2021-11-19 07:19, Nicolin Chen wrote:
>>> From: Nate Watterson <nwatterson@nvidia.com>
>>>
>>> NVIDIA's Grace Soc has a CMDQ-Virtualization (CMDQV) hardware,
>>> which extends the standard ARM SMMU v3 IP to support multiple
>>> VCMDQs with virtualization capabilities. In-kernel of host OS,
>>> they're used to reduce contention on a single queue. In terms
>>> of command queue, they are very like the standard CMDQ/ECMDQs,
>>> but only support CS_NONE in the CS field of CMD_SYNC command.
>>>
>>> This patch adds a new nvidia-grace-cmdqv file and inserts its
>>> structure pointer into the existing arm_smmu_device, and then
>>> adds related function calls in the arm-smmu-v3 driver.
>>>
>>> In the CMDQV driver itself, this patch only adds minimal part
>>> for host kernel support. Upon probe(), VINTF0 is reserved for
>>> in-kernel use. And some of the VCMDQs are assigned to VINTF0.
>>> Then the driver will select one of VCMDQs in the VINTF0 based
>>> on the CPU currently executing, to issue commands.
>>
>> Is there a tangible difference to DMA API or VFIO performance?
> 
> Our testing environment is currently running on a single-core
> CPU, so unfortunately we don't have a perf data at this point.

OK, as for the ECMDQ patches I think we'll need some investigation with 
real workloads to judge whether we can benefit from these things enough 
to justify the complexity, and whether the design is right.

My gut feeling is that if these multi-queue schemes really can live up 
to their promise of making contention negligible, then they should 
further stand to benefit from bypassing the complex lock-free command 
batching in favour of something more lightweight, which could change the 
direction of much of the refactoring.

>> [...]
>>> +struct arm_smmu_cmdq *nvidia_grace_cmdqv_get_cmdq(struct arm_smmu_device *smmu)
>>> +{
>>> +     struct nvidia_grace_cmdqv *cmdqv = smmu->nvidia_grace_cmdqv;
>>> +     struct nvidia_grace_cmdqv_vintf *vintf0 = &cmdqv->vintf0;
>>> +     u16 qidx;
>>> +
>>> +     /* Check error status of vintf0 */
>>> +     if (!FIELD_GET(VINTF_STATUS, vintf0->status))
>>> +             return &smmu->cmdq;
>>> +
>>> +     /*
>>> +      * Select a vcmdq to use. Here we use a temporal solution to
>>> +      * balance out traffic on cmdq issuing: each cmdq has its own
>>> +      * lock, if all cpus issue cmdlist using the same cmdq, only
>>> +      * one CPU at a time can enter the process, while the others
>>> +      * will be spinning at the same lock.
>>> +      */
>>> +     qidx = smp_processor_id() % cmdqv->num_vcmdqs_per_vintf;
>>
>> How does ordering work between queues? Do they follow a global order
>> such that a sync on any queue is guaranteed to complete all prior
>> commands on all queues?
> 
> CMDQV internal scheduler would insert a SYNC when (for example)
> switching from VCMDQ0 to VCMDQ1 while last command in VCMDQ0 is
> not SYNC. HW has a configuration bit in the register to disable
> this feature, which is by default enabled.

Interesting, thanks. So it sounds like this is something you can get 
away with for the moment, but may need to revisit once people chasing 
real-world performance start wanting to turn that bit off.

>> The challenge to make ECMDQ useful to Linux is how to make sure that all
>> the commands expected to be within scope of a future CMND_SYNC plus that
>> sync itself all get issued on the same queue, so I'd be mildly surprised
>> if you didn't have the same problem.
> 
> PATCH-3 in this series actually helps align the command queues,
> between issued commands and SYNC, if bool sync == true. Yet, if
> doing something like issue->issue->issue_with_sync, it could be
> tricker.

Indeed between the iommu_iotlb_gather mechanism and low-level command 
batching things are already a lot more concentrated than they could be, 
but arm_smmu_cmdq_batch_add() and its callers stand out as examples of 
where we'd still be vulnerable to preemption. What I haven't even tried 
to reason about yet is assumptions in the higher-level APIs, e.g. if 
io-pgtable might chuck out a TLBI during an iommu_unmap() which we 
implicitly expect a later iommu_iotlb_sync() to cover.

I've been thinking that in many ways per-domain queues make quite a bit 
of sense and would be easier to manage than per-CPU ones - plus that's 
pretty much the usage model once we get to VMs anyway - but that fails 
to help the significant cases like networking and storage where many 
CPUs are servicing a big monolithic device in a single domain :(

Robin.

WARNING: multiple messages have this Message-ID (diff)
From: Robin Murphy <robin.murphy@arm.com>
To: Nicolin Chen <nicolinc@nvidia.com>
Cc: jean-philippe@linaro.org, linux-kernel@vger.kernel.org,
	iommu@lists.linux-foundation.org, linux-tegra@vger.kernel.org,
	thierry.reding@gmail.com, jgg@nvidia.com, will@kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v3 4/5] iommu/arm-smmu-v3: Add host support for NVIDIA Grace CMDQ-V
Date: Tue, 21 Dec 2021 18:55:20 +0000	[thread overview]
Message-ID: <7e68fa19-90b1-bbb5-9991-36b5d35278fa@arm.com> (raw)
In-Reply-To: <20211220192714.GA27303@Asurada-Nvidia>

On 2021-12-20 19:27, Nicolin Chen wrote:
> Hi Robin,
> 
> Thank you for the reply!
> 
> On Mon, Dec 20, 2021 at 06:42:26PM +0000, Robin Murphy wrote:
>> On 2021-11-19 07:19, Nicolin Chen wrote:
>>> From: Nate Watterson <nwatterson@nvidia.com>
>>>
>>> NVIDIA's Grace Soc has a CMDQ-Virtualization (CMDQV) hardware,
>>> which extends the standard ARM SMMU v3 IP to support multiple
>>> VCMDQs with virtualization capabilities. In-kernel of host OS,
>>> they're used to reduce contention on a single queue. In terms
>>> of command queue, they are very like the standard CMDQ/ECMDQs,
>>> but only support CS_NONE in the CS field of CMD_SYNC command.
>>>
>>> This patch adds a new nvidia-grace-cmdqv file and inserts its
>>> structure pointer into the existing arm_smmu_device, and then
>>> adds related function calls in the arm-smmu-v3 driver.
>>>
>>> In the CMDQV driver itself, this patch only adds minimal part
>>> for host kernel support. Upon probe(), VINTF0 is reserved for
>>> in-kernel use. And some of the VCMDQs are assigned to VINTF0.
>>> Then the driver will select one of VCMDQs in the VINTF0 based
>>> on the CPU currently executing, to issue commands.
>>
>> Is there a tangible difference to DMA API or VFIO performance?
> 
> Our testing environment is currently running on a single-core
> CPU, so unfortunately we don't have a perf data at this point.

OK, as for the ECMDQ patches I think we'll need some investigation with 
real workloads to judge whether we can benefit from these things enough 
to justify the complexity, and whether the design is right.

My gut feeling is that if these multi-queue schemes really can live up 
to their promise of making contention negligible, then they should 
further stand to benefit from bypassing the complex lock-free command 
batching in favour of something more lightweight, which could change the 
direction of much of the refactoring.

>> [...]
>>> +struct arm_smmu_cmdq *nvidia_grace_cmdqv_get_cmdq(struct arm_smmu_device *smmu)
>>> +{
>>> +     struct nvidia_grace_cmdqv *cmdqv = smmu->nvidia_grace_cmdqv;
>>> +     struct nvidia_grace_cmdqv_vintf *vintf0 = &cmdqv->vintf0;
>>> +     u16 qidx;
>>> +
>>> +     /* Check error status of vintf0 */
>>> +     if (!FIELD_GET(VINTF_STATUS, vintf0->status))
>>> +             return &smmu->cmdq;
>>> +
>>> +     /*
>>> +      * Select a vcmdq to use. Here we use a temporal solution to
>>> +      * balance out traffic on cmdq issuing: each cmdq has its own
>>> +      * lock, if all cpus issue cmdlist using the same cmdq, only
>>> +      * one CPU at a time can enter the process, while the others
>>> +      * will be spinning at the same lock.
>>> +      */
>>> +     qidx = smp_processor_id() % cmdqv->num_vcmdqs_per_vintf;
>>
>> How does ordering work between queues? Do they follow a global order
>> such that a sync on any queue is guaranteed to complete all prior
>> commands on all queues?
> 
> CMDQV internal scheduler would insert a SYNC when (for example)
> switching from VCMDQ0 to VCMDQ1 while last command in VCMDQ0 is
> not SYNC. HW has a configuration bit in the register to disable
> this feature, which is by default enabled.

Interesting, thanks. So it sounds like this is something you can get 
away with for the moment, but may need to revisit once people chasing 
real-world performance start wanting to turn that bit off.

>> The challenge to make ECMDQ useful to Linux is how to make sure that all
>> the commands expected to be within scope of a future CMND_SYNC plus that
>> sync itself all get issued on the same queue, so I'd be mildly surprised
>> if you didn't have the same problem.
> 
> PATCH-3 in this series actually helps align the command queues,
> between issued commands and SYNC, if bool sync == true. Yet, if
> doing something like issue->issue->issue_with_sync, it could be
> tricker.

Indeed between the iommu_iotlb_gather mechanism and low-level command 
batching things are already a lot more concentrated than they could be, 
but arm_smmu_cmdq_batch_add() and its callers stand out as examples of 
where we'd still be vulnerable to preemption. What I haven't even tried 
to reason about yet is assumptions in the higher-level APIs, e.g. if 
io-pgtable might chuck out a TLBI during an iommu_unmap() which we 
implicitly expect a later iommu_iotlb_sync() to cover.

I've been thinking that in many ways per-domain queues make quite a bit 
of sense and would be easier to manage than per-CPU ones - plus that's 
pretty much the usage model once we get to VMs anyway - but that fails 
to help the significant cases like networking and storage where many 
CPUs are servicing a big monolithic device in a single domain :(

Robin.
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

WARNING: multiple messages have this Message-ID (diff)
From: Robin Murphy <robin.murphy@arm.com>
To: Nicolin Chen <nicolinc@nvidia.com>
Cc: jean-philippe@linaro.org, nwatterson@nvidia.com,
	chenxiang66@hisilicon.com, joro@8bytes.org,
	Jonathan.Cameron@huawei.com, linux-kernel@vger.kernel.org,
	iommu@lists.linux-foundation.org, nicoleotsuka@gmail.com,
	linux-tegra@vger.kernel.org, thierry.reding@gmail.com,
	jgg@nvidia.com, thunder.leizhen@huawei.com, yuzenghui@huawei.com,
	will@kernel.org, linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH v3 4/5] iommu/arm-smmu-v3: Add host support for NVIDIA Grace CMDQ-V
Date: Tue, 21 Dec 2021 18:55:20 +0000	[thread overview]
Message-ID: <7e68fa19-90b1-bbb5-9991-36b5d35278fa@arm.com> (raw)
In-Reply-To: <20211220192714.GA27303@Asurada-Nvidia>

On 2021-12-20 19:27, Nicolin Chen wrote:
> Hi Robin,
> 
> Thank you for the reply!
> 
> On Mon, Dec 20, 2021 at 06:42:26PM +0000, Robin Murphy wrote:
>> On 2021-11-19 07:19, Nicolin Chen wrote:
>>> From: Nate Watterson <nwatterson@nvidia.com>
>>>
>>> NVIDIA's Grace Soc has a CMDQ-Virtualization (CMDQV) hardware,
>>> which extends the standard ARM SMMU v3 IP to support multiple
>>> VCMDQs with virtualization capabilities. In-kernel of host OS,
>>> they're used to reduce contention on a single queue. In terms
>>> of command queue, they are very like the standard CMDQ/ECMDQs,
>>> but only support CS_NONE in the CS field of CMD_SYNC command.
>>>
>>> This patch adds a new nvidia-grace-cmdqv file and inserts its
>>> structure pointer into the existing arm_smmu_device, and then
>>> adds related function calls in the arm-smmu-v3 driver.
>>>
>>> In the CMDQV driver itself, this patch only adds minimal part
>>> for host kernel support. Upon probe(), VINTF0 is reserved for
>>> in-kernel use. And some of the VCMDQs are assigned to VINTF0.
>>> Then the driver will select one of VCMDQs in the VINTF0 based
>>> on the CPU currently executing, to issue commands.
>>
>> Is there a tangible difference to DMA API or VFIO performance?
> 
> Our testing environment is currently running on a single-core
> CPU, so unfortunately we don't have a perf data at this point.

OK, as for the ECMDQ patches I think we'll need some investigation with 
real workloads to judge whether we can benefit from these things enough 
to justify the complexity, and whether the design is right.

My gut feeling is that if these multi-queue schemes really can live up 
to their promise of making contention negligible, then they should 
further stand to benefit from bypassing the complex lock-free command 
batching in favour of something more lightweight, which could change the 
direction of much of the refactoring.

>> [...]
>>> +struct arm_smmu_cmdq *nvidia_grace_cmdqv_get_cmdq(struct arm_smmu_device *smmu)
>>> +{
>>> +     struct nvidia_grace_cmdqv *cmdqv = smmu->nvidia_grace_cmdqv;
>>> +     struct nvidia_grace_cmdqv_vintf *vintf0 = &cmdqv->vintf0;
>>> +     u16 qidx;
>>> +
>>> +     /* Check error status of vintf0 */
>>> +     if (!FIELD_GET(VINTF_STATUS, vintf0->status))
>>> +             return &smmu->cmdq;
>>> +
>>> +     /*
>>> +      * Select a vcmdq to use. Here we use a temporal solution to
>>> +      * balance out traffic on cmdq issuing: each cmdq has its own
>>> +      * lock, if all cpus issue cmdlist using the same cmdq, only
>>> +      * one CPU at a time can enter the process, while the others
>>> +      * will be spinning at the same lock.
>>> +      */
>>> +     qidx = smp_processor_id() % cmdqv->num_vcmdqs_per_vintf;
>>
>> How does ordering work between queues? Do they follow a global order
>> such that a sync on any queue is guaranteed to complete all prior
>> commands on all queues?
> 
> CMDQV internal scheduler would insert a SYNC when (for example)
> switching from VCMDQ0 to VCMDQ1 while last command in VCMDQ0 is
> not SYNC. HW has a configuration bit in the register to disable
> this feature, which is by default enabled.

Interesting, thanks. So it sounds like this is something you can get 
away with for the moment, but may need to revisit once people chasing 
real-world performance start wanting to turn that bit off.

>> The challenge to make ECMDQ useful to Linux is how to make sure that all
>> the commands expected to be within scope of a future CMND_SYNC plus that
>> sync itself all get issued on the same queue, so I'd be mildly surprised
>> if you didn't have the same problem.
> 
> PATCH-3 in this series actually helps align the command queues,
> between issued commands and SYNC, if bool sync == true. Yet, if
> doing something like issue->issue->issue_with_sync, it could be
> tricker.

Indeed between the iommu_iotlb_gather mechanism and low-level command 
batching things are already a lot more concentrated than they could be, 
but arm_smmu_cmdq_batch_add() and its callers stand out as examples of 
where we'd still be vulnerable to preemption. What I haven't even tried 
to reason about yet is assumptions in the higher-level APIs, e.g. if 
io-pgtable might chuck out a TLBI during an iommu_unmap() which we 
implicitly expect a later iommu_iotlb_sync() to cover.

I've been thinking that in many ways per-domain queues make quite a bit 
of sense and would be easier to manage than per-CPU ones - plus that's 
pretty much the usage model once we get to VMs anyway - but that fails 
to help the significant cases like networking and storage where many 
CPUs are servicing a big monolithic device in a single domain :(

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2021-12-21 18:55 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-19  7:19 [PATCH v3 0/5] iommu/arm-smmu-v3: Add NVIDIA Grace CMDQ-V Support Nicolin Chen
2021-11-19  7:19 ` Nicolin Chen
2021-11-19  7:19 ` Nicolin Chen via iommu
2021-11-19  7:19 ` [PATCH v3 1/5] iommu/arm-smmu-v3: Add CS_NONE quirk Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen via iommu
2021-11-19  7:19 ` [PATCH v3 2/5] iommu/arm-smmu-v3: Make arm_smmu_cmdq_init reusable Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen via iommu
2021-11-19  7:19 ` [PATCH v3 3/5] iommu/arm-smmu-v3: Pass cmdq pointer in arm_smmu_cmdq_issue_cmdlist() Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen via iommu
2021-11-19  7:19 ` [PATCH v3 4/5] iommu/arm-smmu-v3: Add host support for NVIDIA Grace CMDQ-V Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen via iommu
2021-12-20 18:42   ` Robin Murphy
2021-12-20 18:42     ` Robin Murphy
2021-12-20 18:42     ` Robin Murphy
2021-12-20 19:27     ` Nicolin Chen
2021-12-20 19:27       ` Nicolin Chen
2021-12-20 19:27       ` Nicolin Chen via iommu
2021-12-21 18:55       ` Robin Murphy [this message]
2021-12-21 18:55         ` Robin Murphy
2021-12-21 18:55         ` Robin Murphy
2021-12-21 22:00         ` Nicolin Chen
2021-12-21 22:00           ` Nicolin Chen
2021-12-21 22:00           ` Nicolin Chen via iommu
2021-12-22 11:57           ` Robin Murphy
2021-12-22 11:57             ` Robin Murphy
2021-12-22 11:57             ` Robin Murphy
2021-11-19  7:19 ` [PATCH v3 5/5] iommu/nvidia-grace-cmdqv: Limit CMDs for guest owned VINTF Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen
2021-11-19  7:19   ` Nicolin Chen via iommu
2021-12-22 12:32   ` Robin Murphy
2021-12-22 12:32     ` Robin Murphy
2021-12-22 12:32     ` Robin Murphy
2021-12-22 22:52     ` Nicolin Chen
2021-12-22 22:52       ` Nicolin Chen
2021-12-22 22:52       ` Nicolin Chen via iommu
2021-12-23 11:14       ` Robin Murphy
2021-12-23 11:14         ` Robin Murphy
2021-12-23 11:14         ` Robin Murphy
2021-12-24  8:02         ` Nicolin Chen
2021-12-24  8:02           ` Nicolin Chen
2021-12-24  8:02           ` Nicolin Chen via iommu
2021-12-24 12:13           ` Robin Murphy
2021-12-24 12:13             ` Robin Murphy
2021-12-24 12:13             ` Robin Murphy
2021-12-28  5:49             ` Nicolin Chen
2021-12-28  5:49               ` Nicolin Chen
2021-12-28  5:49               ` Nicolin Chen via iommu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e68fa19-90b1-bbb5-9991-36b5d35278fa@arm.com \
    --to=robin.murphy@arm.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=chenxiang66@hisilicon.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jean-philippe@linaro.org \
    --cc=jgg@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tegra@vger.kernel.org \
    --cc=nicoleotsuka@gmail.com \
    --cc=nicolinc@nvidia.com \
    --cc=nwatterson@nvidia.com \
    --cc=thierry.reding@gmail.com \
    --cc=thunder.leizhen@huawei.com \
    --cc=vdumpa@nvidia.com \
    --cc=will@kernel.org \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.