All of lore.kernel.org
 help / color / mirror / Atom feed
* [virtio-comment] [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
@ 2023-12-20 14:40 Heng Qi
  2023-12-21  3:37 ` [virtio-comment] " Jason Wang
  2023-12-21  6:56 ` Michael S. Tsirkin
  0 siblings, 2 replies; 18+ messages in thread
From: Heng Qi @ 2023-12-20 14:40 UTC (permalink / raw)
  To: virtio-comment
  Cc: Jason Wang, Michael S . Tsirkin, Yuri Benditovich, Xuan Zhuo

Currently, when each time the driver attempts to update the coalescing parameters
for a vq, it needs to kick the device and wait for the ctrlq response to return.

If a command can only update one vq parameters, when the parameters are updated
frequently (such as netdim), ctrlq on the device is kicked frequently, which will
increase the device CPU scheduling overhead, and the number and overhead of device
DMA will also increase.

Merging multiple vq updated parameters into one command can effectively reduce
the number of kick devices and device DMA times.

Test results show that this greatly improves the efficiency of the ctrlq in
responding to multiple vq coalescing parameter updates issued by the driver.

I try this as an update to the VIRTIO_NET_F_VQ_NOTF_COAL feature, but not sure
if it's timely at the moment. If not in time, a new feature bit is needed?

Suggested-by: Xiaoming Zhao <zxm377917@alibaba-inc.com>
Signed-off-by: Heng Qi <hengqi@linux.alibaba.com>
---
 device-types/net/description.tex | 26 ++++++++++++++++++++------
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/device-types/net/description.tex b/device-types/net/description.tex
index aff5e08..b3766c4 100644
--- a/device-types/net/description.tex
+++ b/device-types/net/description.tex
@@ -1667,8 +1667,8 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
 for notification coalescing.
 
 If the VIRTIO_NET_F_VQ_NOTF_COAL feature is negotiated, the driver can
-send commands VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET and VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET
-for virtqueue notification coalescing.
+send commands VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET, VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET and
+VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET for virtqueue notification coalescing.
 
 \begin{lstlisting}
 struct virtio_net_ctrl_coal {
@@ -1682,11 +1682,17 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
     struct virtio_net_ctrl_coal coal;
 };
 
+struct virtio_net_ctrl_mrg_coal_vq {
+        le16 num_entries; /* indicates number of valid entries */
+        struct virtio_net_ctrl_coal_vq entries[];
+};
+
 #define VIRTIO_NET_CTRL_NOTF_COAL 6
  #define VIRTIO_NET_CTRL_NOTF_COAL_TX_SET  0
  #define VIRTIO_NET_CTRL_NOTF_COAL_RX_SET 1
  #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET 2
  #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET 3
+ #define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET 4
 \end{lstlisting}
 
 Coalescing parameters:
@@ -1706,6 +1712,7 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
 \item For the command VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET, the structure virtio_net_ctrl_coal_vq is write-only for the driver.
 \item For the command VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET, \field{vq_index} and \field{reserved} are write-only
       for the driver, and the structure virtio_net_ctrl_coal is read-only for the driver.
+\item For the command VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET, the structure virtio_net_ctrl_mrg_coal_vq is write-only for the driver.
 \end{itemize}
 
 The class VIRTIO_NET_CTRL_NOTF_COAL has the following commands:
@@ -1716,6 +1723,9 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
                                         for an enabled transmit/receive virtqueue whose index is \field{vq_index}.
 \item VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET: use the structure virtio_net_ctrl_coal_vq to get the \field{max_usecs} and \field{max_packets} parameters
                                         for an enabled transmit/receive virtqueue whose index is \field{vq_index}.
+\item VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET: use the structure virtio_net_ctrl_mrg_coal_vq to set the \field{max_usecs} and \field{max_packets} parameters
+                                         for \field{num_entries} enabled transmit/receive virtqueues. The corresponding index value
+                                         of each configured virtqueue is \field{vq_index}.
 \end{enumerate}
 
 The device may generate notifications more or less frequently than specified by set commands of the VIRTIO_NET_CTRL_NOTF_COAL class.
@@ -1782,9 +1792,13 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
 
 The driver MUST set \field{vq_index} to the virtqueue index of an enabled transmit or receive virtqueue.
 
+The driver MUST set \field{num_entries} to a non-zero value and MUST NOT set \field{num_entries} to
+a value greater than the number of enabled transmit and receive virtqueues.
+
 The driver MUST have negotiated the VIRTIO_NET_F_NOTF_COAL feature when issuing commands VIRTIO_NET_CTRL_NOTF_COAL_TX_SET and VIRTIO_NET_CTRL_NOTF_COAL_RX_SET.
 
-The driver MUST have negotiated the VIRTIO_NET_F_VQ_NOTF_COAL feature when issuing commands VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET and VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET.
+The driver MUST have negotiated the VIRTIO_NET_F_VQ_NOTF_COAL feature when issuing commands
+VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET, VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET and VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET.
 
 The driver MUST ignore the values of coalescing parameters received from the VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET command if the device responds with VIRTIO_NET_ERR.
 
@@ -1794,10 +1808,10 @@ \subsubsection{Control Virtqueue}\label{sec:Device Types / Network Device / Devi
 
 The device SHOULD respond to VIRTIO_NET_CTRL_NOTF_COAL_TX_SET and VIRTIO_NET_CTRL_NOTF_COAL_RX_SET commands with VIRTIO_NET_ERR if it was not able to change the parameters.
 
-The device MUST respond to the VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET command with VIRTIO_NET_ERR if it was not able to change the parameters.
+The device MUST respond to VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET and VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET commands with VIRTIO_NET_ERR if it was not able to change the parameters.
 
-The device MUST respond to VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET and VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET commands with
-VIRTIO_NET_ERR if the designated virtqueue is not an enabled transmit or receive virtqueue.
+The device MUST respond to VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET, VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET and VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET
+commands with VIRTIO_NET_ERR if the designated virtqueue is not an enabled transmit or receive virtqueue.
 
 Upon disabling and re-enabling a transmit virtqueue, the device MUST set the coalescing parameters of the virtqueue
 to those configured through the VIRTIO_NET_CTRL_NOTF_COAL_TX_SET command, or, if the driver did not set any TX coalescing parameters, to 0.
-- 
2.19.1.6.gb485710b


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-20 14:40 [virtio-comment] [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs Heng Qi
@ 2023-12-21  3:37 ` Jason Wang
  2023-12-21  4:55   ` Heng Qi
  2023-12-22  8:26   ` Michael S. Tsirkin
  2023-12-21  6:56 ` Michael S. Tsirkin
  1 sibling, 2 replies; 18+ messages in thread
From: Jason Wang @ 2023-12-21  3:37 UTC (permalink / raw)
  To: Heng Qi; +Cc: virtio-comment, Michael S . Tsirkin, Yuri Benditovich, Xuan Zhuo

On Wed, Dec 20, 2023 at 10:40 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>
> Currently, when each time the driver attempts to update the coalescing parameters
> for a vq, it needs to kick the device and wait for the ctrlq response to return.
>
> If a command can only update one vq parameters, when the parameters are updated
> frequently (such as netdim), ctrlq on the device is kicked frequently, which will
> increase the device CPU scheduling overhead, and the number and overhead of device
> DMA will also increase.
>
> Merging multiple vq updated parameters into one command can effectively reduce
> the number of kick devices and device DMA times.
>
> Test results show that this greatly improves the efficiency of the ctrlq in
> responding to multiple vq coalescing parameter updates issued by the driver.

So netdim is per virtqueue, to make use of this, you need to batch the
netdim requests first. And if you do that, you can batch the commands
as well. Then you get one kick for sevreal coal requests?

Or are you saying you run out of ctrl vq?

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-21  3:37 ` [virtio-comment] " Jason Wang
@ 2023-12-21  4:55   ` Heng Qi
  2023-12-22  2:40     ` Jason Wang
  2023-12-22  8:26   ` Michael S. Tsirkin
  1 sibling, 1 reply; 18+ messages in thread
From: Heng Qi @ 2023-12-21  4:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, Michael S . Tsirkin, Yuri Benditovich, Xuan Zhuo



在 2023/12/21 上午11:37, Jason Wang 写道:
> On Wed, Dec 20, 2023 at 10:40 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>> Currently, when each time the driver attempts to update the coalescing parameters
>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
>>
>> If a command can only update one vq parameters, when the parameters are updated
>> frequently (such as netdim), ctrlq on the device is kicked frequently, which will
>> increase the device CPU scheduling overhead, and the number and overhead of device
>> DMA will also increase.
>>
>> Merging multiple vq updated parameters into one command can effectively reduce
>> the number of kick devices and device DMA times.
>>
>> Test results show that this greatly improves the efficiency of the ctrlq in
>> responding to multiple vq coalescing parameter updates issued by the driver.
> So netdim is per virtqueue, to make use of this, you need to batch the
> netdim requests first.

Yes. I think the code looks more intuitive.
Please view the following PoC test code combined with the current code 
of upstream virtio netdim

+struct ctrl_coal_vq_entry {
+       u16 vqn;
+       u16 reserved;
+       int usecs;
+       int packets;
+};
+
+struct virtio_net_ctrl_mrg_coal_vq {
+       int num_entries;
+       struct ctrl_coal_vq_entry coal_vqs[];
+};
+

@@ -313,12 +335,18 @@ struct virtnet_info {

         struct control_buf *ctrl;

+       struct virtio_net_ctrl_mrg_coal_vq *ctrl_coal_vqs;
+


@@ -4390,6 +4586,7 @@ static int virtnet_alloc_queues(struct 
virtnet_info *vi)
         if (!vi->rq)
                 goto err_rq;

+       vi->ctrl_coal_vqs = kzalloc(sizeof(struct 
virtio_net_ctrl_mrg_coal_vq) + vi->max_queue_pairs * sizeof(struct 
ctrl_coal_vq_entry), GFP_KERNEL);


+       struct scatterlist sg;
+       for (i = 0; i < vi->curr_queue_pairs; i++) {
+                vi->ctrl_coal_vqs->coal_vqs[i].vqn = rxq2vq(i);
+                vi->ctrl_coal_vqs->coal_vqs[i].usecs = 8;
+                vi->ctrl_coal_vqs->coal_vqs[i].packets = 16;
+        }
+        vi->ctrl_coal_vqs->num_entries = i;
+        sg_init_one(&sg, vi->ctrl_coal_vqs, sizeof(struct 
virtio_net_ctrl_mrg_coal_vq) + vi->ctrl_coal_vqs->num_entries * 
sizeof(struct ctrl_coal_vq_entry));
+        vi->req_num++;
+        if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
+                              VIRTIO_NET_CTRL_NOTF_COAL_*VQS*_SET, &sg))

  #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET               3
+#define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET              4

>   And if you do that, you can batch the commands
> as well.


Batch commands we have tried:

+               for (i = 0; i < vi->curr_queue_pairs; i++) {
+                       err = virtnet_pre_send_cmd(vi, &vi->vq_ctrls[i], 
8, 16);
+               }
+
+               if (unlikely(!virtqueue_kick(vi->cvq)))
+                       return vi->ctrl->status == VIRTIO_NET_OK;
+
+               for (i = 0; i < vi->curr_queue_pairs; i++) {
+                       while (!virtqueue_is_broken(vi->cvq) && 
!virtqueue_get_buf(vi->cvq, &tmp))
+                               cpu_relax();
+               }
+       }


+
+static bool virtnet_pre_send_cmd(struct virtnet_info *vi, struct 
vq_coal *vq_ctrls,
+                                u32 max_usecs, u32 max_packets)
+{
+       struct scatterlist hdr, out, stat;
+       struct scatterlist *sgs[4];
+       unsigned out_num = 0;
+       int ret;
+
+       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
+
+       vq_ctrls->hdr.class = VIRTIO_NET_CTRL_NOTF_COAL;
+       vq_ctrls->hdr.cmd = VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET;
+       sg_init_one(&hdr, &vq_ctrls->hdr, sizeof(vq_ctrls->hdr));
+       sgs[out_num++] = &hdr;
+
+       vq_ctrls->coal_vq.vqn = cpu_to_le16(0);
+       vq_ctrls->coal_vq.coal.max_usecs = cpu_to_le32(max_usecs);
+       vq_ctrls->coal_vq.coal.max_packets = cpu_to_le32(max_packets);
+       sg_init_one(&out, &vq_ctrls->ho, sizeof(vq_ctrls->ho));
+
+       if (&out)
+               sgs[out_num++] = &out;
+
+       vq_ctrls->status = ~0;
+       sg_init_one(&stat, &vq_ctrls->status, sizeof(vq_ctrls->status));
+       sgs[out_num] = &stat;
+
+       BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
+       ret = virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
+       if (ret < 0) {
+               dev_warn(&vi->vdev->dev,
+                        "Failed to add sgs for command vq: %d\n.", ret);
+               return false;
+       }


Overall, batch reqs are sufficient. Because the current major overhead 
is the number of DMAs.
For example, for a device with 256 queues,

For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
The overhead of batch cmds is 1 kick + 256*8 DMA times.
The overhead of batch reqs is 1 kick + 8 DMA times.

Below is 8 DMA times:
- get avail idx 1 time
- Pull avail ring information 1 time
- Pull the desc pointed to by the avail ring 3 times
- Pull the hdr and out bufs pointed to by avail ring desc 2 times
- Write once to the status buf pointed to by status 1 time

Thanks!

> Then you get one kick for sevreal coal requests?
>
> Or are you saying you run out of ctrl vq?
>
> Thanks
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-20 14:40 [virtio-comment] [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs Heng Qi
  2023-12-21  3:37 ` [virtio-comment] " Jason Wang
@ 2023-12-21  6:56 ` Michael S. Tsirkin
  2023-12-21  7:01   ` Heng Qi
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2023-12-21  6:56 UTC (permalink / raw)
  To: Heng Qi; +Cc: virtio-comment, Jason Wang, Yuri Benditovich, Xuan Zhuo

On Wed, Dec 20, 2023 at 10:40:34PM +0800, Heng Qi wrote:
> Currently, when each time the driver attempts to update the coalescing parameters
> for a vq, it needs to kick the device and wait for the ctrlq response to return.

But there's no fundamental reason for it to do so.
It can just as well submit multiple commands and then wait.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-21  6:56 ` Michael S. Tsirkin
@ 2023-12-21  7:01   ` Heng Qi
  2023-12-21  7:08     ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Heng Qi @ 2023-12-21  7:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, Jason Wang, Yuri Benditovich, Xuan Zhuo



在 2023/12/21 下午2:56, Michael S. Tsirkin 写道:
> On Wed, Dec 20, 2023 at 10:40:34PM +0800, Heng Qi wrote:
>> Currently, when each time the driver attempts to update the coalescing parameters
>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
> But there's no fundamental reason for it to do so.

Indeed, please look at the current upstream netdim code:

static void virtnet_rx_dim_work(struct work_struct *work)
{
         struct dim *dim = container_of(work, struct dim, work);
         struct receive_queue *rq = container_of(dim,
                         struct receive_queue, dim);
         struct virtnet_info *vi = rq->vq->vdev->priv;
         struct net_device *dev = vi->dev;
         struct dim_cq_moder update_moder;
         int i, qnum, err;

         if (!rtnl_trylock())
                 return;

         /* Each rxq's work is queued by "net_dim()->schedule_work()"
          * in response to NAPI traffic changes. Note that dim->profile_ix
          * for each rxq is updated prior to the queuing action.
          * So we only need to traverse and update profiles for all rxqs
          * in the work which is holding rtnl_lock.
          */
         for (i = 0; i < vi->curr_queue_pairs; i++) {             
<-----------------
                 rq = &vi->rq[i];
                 dim = &rq->dim;
                 qnum = rq - vi->rq;

                 if (!rq->dim_enabled)
                         continue;

                 update_moder = net_dim_get_rx_moderation(dim->mode, 
dim->profile_ix);
                 if (update_moder.usec != rq->intr_coal.max_usecs ||
                     update_moder.pkts != rq->intr_coal.max_packets) {
                         err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
update_moder.usec,
update_moder.pkts);
                         if (err)
                                 pr_debug("%s: Failed to send dim 
parameters on rxq%d\n",
                                          dev->name, qnum);
                         dim->state = DIM_START_MEASURE;
                 }
         }

         rtnl_unlock();
}

> It can just as well submit multiple commands and then wait.

Sending multiple commands and then waiting reduces the number of kicks.
But it does not reduce the number of device DMAs. I already responded to 
this in a thread to Jason.
please check:

https://lists.oasis-open.org/archives/virtio-comment/202312/msg00142.html

"
Overall, batch reqs are sufficient. Because the current major overhead 
is the number of DMAs.
For example, for a device with 256 queues,

For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
The overhead of batch cmds is 1 kick + 256*8 DMA times.
The overhead of batch reqs is 1 kick + 8 DMA times.

Below is 8 DMA times:
- get avail idx 1 time
- Pull available ring information 1 time
- Pull the desc pointed to by the avail ring 3 times
- Pull the hdr and out bufs pointed to by avail ring desc 2 times
- Write once to the status buf pointed to by status 1 time "

Thanks.

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-21  7:01   ` Heng Qi
@ 2023-12-21  7:08     ` Michael S. Tsirkin
  2023-12-21  7:19       ` Heng Qi
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2023-12-21  7:08 UTC (permalink / raw)
  To: Heng Qi; +Cc: virtio-comment, Jason Wang, Yuri Benditovich, Xuan Zhuo

On Thu, Dec 21, 2023 at 03:01:57PM +0800, Heng Qi wrote:
> 
> 
> 在 2023/12/21 下午2:56, Michael S. Tsirkin 写道:
> > On Wed, Dec 20, 2023 at 10:40:34PM +0800, Heng Qi wrote:
> > > Currently, when each time the driver attempts to update the coalescing parameters
> > > for a vq, it needs to kick the device and wait for the ctrlq response to return.
> > But there's no fundamental reason for it to do so.
> 
> Indeed, please look at the current upstream netdim code:
> 
> static void virtnet_rx_dim_work(struct work_struct *work)
> {
>         struct dim *dim = container_of(work, struct dim, work);
>         struct receive_queue *rq = container_of(dim,
>                         struct receive_queue, dim);
>         struct virtnet_info *vi = rq->vq->vdev->priv;
>         struct net_device *dev = vi->dev;
>         struct dim_cq_moder update_moder;
>         int i, qnum, err;
> 
>         if (!rtnl_trylock())
>                 return;
> 
>         /* Each rxq's work is queued by "net_dim()->schedule_work()"
>          * in response to NAPI traffic changes. Note that dim->profile_ix
>          * for each rxq is updated prior to the queuing action.
>          * So we only need to traverse and update profiles for all rxqs
>          * in the work which is holding rtnl_lock.
>          */
>         for (i = 0; i < vi->curr_queue_pairs; i++) {            
> <-----------------
>                 rq = &vi->rq[i];
>                 dim = &rq->dim;
>                 qnum = rq - vi->rq;
> 
>                 if (!rq->dim_enabled)
>                         continue;
> 
>                 update_moder = net_dim_get_rx_moderation(dim->mode,
> dim->profile_ix);
>                 if (update_moder.usec != rq->intr_coal.max_usecs ||
>                     update_moder.pkts != rq->intr_coal.max_packets) {
>                         err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
> update_moder.usec,
> update_moder.pkts);
>                         if (err)
>                                 pr_debug("%s: Failed to send dim parameters
> on rxq%d\n",
>                                          dev->name, qnum);
>                         dim->state = DIM_START_MEASURE;
>                 }
>         }
> 
>         rtnl_unlock();
> }
> 
> > It can just as well submit multiple commands and then wait.
> 
> Sending multiple commands and then waiting reduces the number of kicks.
> But it does not reduce the number of device DMAs. I already responded to
> this in a thread to Jason.
> please check:
> 
> https://lists.oasis-open.org/archives/virtio-comment/202312/msg00142.html
> 
> "
> Overall, batch reqs are sufficient. Because the current major overhead is
> the number of DMAs.
> For example, for a device with 256 queues,
> 
> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
> The overhead of batch cmds is 1 kick + 256*8 DMA times.
> The overhead of batch reqs is 1 kick + 8 DMA times.
> 
> Below is 8 DMA times:
> - get avail idx 1 time
> - Pull available ring information 1 time
> - Pull the desc pointed to by the avail ring 3 times
> - Pull the hdr and out bufs pointed to by avail ring desc 2 times
> - Write once to the status buf pointed to by status 1 time "
> 
> Thanks.

So there's more DMA but it's all slow path.
Why do we need to micro-optimize it?
What is the overhead practically, in milliseconds?

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-21  7:08     ` Michael S. Tsirkin
@ 2023-12-21  7:19       ` Heng Qi
  2023-12-21  7:48         ` Heng Qi
  2023-12-22  8:24         ` Michael S. Tsirkin
  0 siblings, 2 replies; 18+ messages in thread
From: Heng Qi @ 2023-12-21  7:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, Jason Wang, Yuri Benditovich, Xuan Zhuo



在 2023/12/21 下午3:08, Michael S. Tsirkin 写道:
> On Thu, Dec 21, 2023 at 03:01:57PM +0800, Heng Qi wrote:
>>
>> 在 2023/12/21 下午2:56, Michael S. Tsirkin 写道:
>>> On Wed, Dec 20, 2023 at 10:40:34PM +0800, Heng Qi wrote:
>>>> Currently, when each time the driver attempts to update the coalescing parameters
>>>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
>>> But there's no fundamental reason for it to do so.
>> Indeed, please look at the current upstream netdim code:
>>
>> static void virtnet_rx_dim_work(struct work_struct *work)
>> {
>>          struct dim *dim = container_of(work, struct dim, work);
>>          struct receive_queue *rq = container_of(dim,
>>                          struct receive_queue, dim);
>>          struct virtnet_info *vi = rq->vq->vdev->priv;
>>          struct net_device *dev = vi->dev;
>>          struct dim_cq_moder update_moder;
>>          int i, qnum, err;
>>
>>          if (!rtnl_trylock())
>>                  return;
>>
>>          /* Each rxq's work is queued by "net_dim()->schedule_work()"
>>           * in response to NAPI traffic changes. Note that dim->profile_ix
>>           * for each rxq is updated prior to the queuing action.
>>           * So we only need to traverse and update profiles for all rxqs
>>           * in the work which is holding rtnl_lock.
>>           */
>>          for (i = 0; i < vi->curr_queue_pairs; i++) {
>> <-----------------
>>                  rq = &vi->rq[i];
>>                  dim = &rq->dim;
>>                  qnum = rq - vi->rq;
>>
>>                  if (!rq->dim_enabled)
>>                          continue;
>>
>>                  update_moder = net_dim_get_rx_moderation(dim->mode,
>> dim->profile_ix);
>>                  if (update_moder.usec != rq->intr_coal.max_usecs ||
>>                      update_moder.pkts != rq->intr_coal.max_packets) {
>>                          err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
>> update_moder.usec,
>> update_moder.pkts);
>>                          if (err)
>>                                  pr_debug("%s: Failed to send dim parameters
>> on rxq%d\n",
>>                                           dev->name, qnum);
>>                          dim->state = DIM_START_MEASURE;
>>                  }
>>          }
>>
>>          rtnl_unlock();
>> }
>>
>>> It can just as well submit multiple commands and then wait.
>> Sending multiple commands and then waiting reduces the number of kicks.
>> But it does not reduce the number of device DMAs. I already responded to
>> this in a thread to Jason.
>> please check:
>>
>> https://lists.oasis-open.org/archives/virtio-comment/202312/msg00142.html
>>
>> "
>> Overall, batch reqs are sufficient. Because the current major overhead is
>> the number of DMAs.
>> For example, for a device with 256 queues,
>>
>> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
>> The overhead of batch cmds is 1 kick + 256*8 DMA times.
>> The overhead of batch reqs is 1 kick + 8 DMA times.
>>
>> Below is 8 DMA times:
>> - get avail idx 1 time
>> - Pull available ring information 1 time
>> - Pull the desc pointed to by the avail ring 3 times
>> - Pull the hdr and out bufs pointed to by avail ring desc 2 times
>> - Write once to the status buf pointed to by status 1 time "
>>
>> Thanks.
> So there's more DMA but it's all slow path.
> Why do we need to micro-optimize it?

On our each DPU, multiple VMs need to be supported. The ctrlq of these 
VMs are simulated by software,
which consumes limited CPU resources on the DPU.
Therefore, optimizing the response speed of ctrlq for netdim can better 
support multiple VMs and the large-queue-sized VM.

> What is the overhead practically, in milliseconds?

Latency overhead is only one aspect. More importantly, limited CPU 
resources need to support
effective command processing of more VMs and large-queue-sized VMs.

Thanks.

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-21  7:19       ` Heng Qi
@ 2023-12-21  7:48         ` Heng Qi
  2023-12-22  8:24         ` Michael S. Tsirkin
  1 sibling, 0 replies; 18+ messages in thread
From: Heng Qi @ 2023-12-21  7:48 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, Jason Wang, Yuri Benditovich, Xuan Zhuo



在 2023/12/21 下午3:19, Heng Qi 写道:
>
>
> 在 2023/12/21 下午3:08, Michael S. Tsirkin 写道:
>> On Thu, Dec 21, 2023 at 03:01:57PM +0800, Heng Qi wrote:
>>>
>>> 在 2023/12/21 下午2:56, Michael S. Tsirkin 写道:
>>>> On Wed, Dec 20, 2023 at 10:40:34PM +0800, Heng Qi wrote:
>>>>> Currently, when each time the driver attempts to update the 
>>>>> coalescing parameters
>>>>> for a vq, it needs to kick the device and wait for the ctrlq 
>>>>> response to return.
>>>> But there's no fundamental reason for it to do so.
>>> Indeed, please look at the current upstream netdim code:
>>>
>>> static void virtnet_rx_dim_work(struct work_struct *work)
>>> {
>>>          struct dim *dim = container_of(work, struct dim, work);
>>>          struct receive_queue *rq = container_of(dim,
>>>                          struct receive_queue, dim);
>>>          struct virtnet_info *vi = rq->vq->vdev->priv;
>>>          struct net_device *dev = vi->dev;
>>>          struct dim_cq_moder update_moder;
>>>          int i, qnum, err;
>>>
>>>          if (!rtnl_trylock())
>>>                  return;
>>>
>>>          /* Each rxq's work is queued by "net_dim()->schedule_work()"
>>>           * in response to NAPI traffic changes. Note that 
>>> dim->profile_ix
>>>           * for each rxq is updated prior to the queuing action.
>>>           * So we only need to traverse and update profiles for all 
>>> rxqs
>>>           * in the work which is holding rtnl_lock.
>>>           */
>>>          for (i = 0; i < vi->curr_queue_pairs; i++) {
>>> <-----------------
>>>                  rq = &vi->rq[i];
>>>                  dim = &rq->dim;
>>>                  qnum = rq - vi->rq;
>>>
>>>                  if (!rq->dim_enabled)
>>>                          continue;
>>>
>>>                  update_moder = net_dim_get_rx_moderation(dim->mode,
>>> dim->profile_ix);
>>>                  if (update_moder.usec != rq->intr_coal.max_usecs ||
>>>                      update_moder.pkts != rq->intr_coal.max_packets) {
>>>                          err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, 
>>> qnum,
>>> update_moder.usec,
>>> update_moder.pkts);
>>>                          if (err)
>>>                                  pr_debug("%s: Failed to send dim 
>>> parameters
>>> on rxq%d\n",
>>>                                           dev->name, qnum);
>>>                          dim->state = DIM_START_MEASURE;
>>>                  }
>>>          }
>>>
>>>          rtnl_unlock();
>>> }
>>>
>>>> It can just as well submit multiple commands and then wait.
>>> Sending multiple commands and then waiting reduces the number of kicks.
>>> But it does not reduce the number of device DMAs. I already 
>>> responded to
>>> this in a thread to Jason.
>>> please check:
>>>
>>> https://lists.oasis-open.org/archives/virtio-comment/202312/msg00142.html 
>>>
>>>
>>> "
>>> Overall, batch reqs are sufficient. Because the current major 
>>> overhead is
>>> the number of DMAs.
>>> For example, for a device with 256 queues,
>>>
>>> For the current upstream code, the overhead is 256 kicks + 256*8 DMA 
>>> times.
>>> The overhead of batch cmds is 1 kick + 256*8 DMA times.
>>> The overhead of batch reqs is 1 kick + 8 DMA times.
>>>
>>> Below is 8 DMA times:
>>> - get avail idx 1 time
>>> - Pull available ring information 1 time
>>> - Pull the desc pointed to by the avail ring 3 times
>>> - Pull the hdr and out bufs pointed to by avail ring desc 2 times
>>> - Write once to the status buf pointed to by status 1 time "
>>>
>>> Thanks.
>> So there's more DMA but it's all slow path.
>> Why do we need to micro-optimize it?
>
> On our each DPU, multiple VMs need to be supported. The ctrlq of these 
> VMs are simulated by software,
> which consumes limited CPU resources on the DPU.

Each DMA needs to wait for the DMA hw execution to complete. Then the 
DPU CPU will do other things, and
it will be scheduled when the DMA execution is completed. Therefore, 
more DMA times means more CPU scheduling.
Affects CPU execution efficiency.

And batch reqs, reducing the number of DMA not only improves DPU CPU 
execution efficiency, but also optimizes DMA latency.
This is very meaningful for single DPU multi-VM multi-queue scenarios.


> Therefore, optimizing the response speed of ctrlq for netdim can 
> better support multiple VMs and the large-queue-sized VM.
>
>> What is the overhead practically, in milliseconds?
>
> Latency overhead is only one aspect. More importantly, limited CPU 
> resources need to support
> effective command processing of more VMs and large-queue-sized VMs.
>
> Thanks.
>
>>
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: 
> https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-21  4:55   ` Heng Qi
@ 2023-12-22  2:40     ` Jason Wang
  2023-12-22  6:45       ` Heng Qi
  0 siblings, 1 reply; 18+ messages in thread
From: Jason Wang @ 2023-12-22  2:40 UTC (permalink / raw)
  To: Heng Qi; +Cc: virtio-comment, Michael S . Tsirkin, Yuri Benditovich, Xuan Zhuo

On Thu, Dec 21, 2023 at 12:55 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>
>
>
> 在 2023/12/21 上午11:37, Jason Wang 写道:
> > On Wed, Dec 20, 2023 at 10:40 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >> Currently, when each time the driver attempts to update the coalescing parameters
> >> for a vq, it needs to kick the device and wait for the ctrlq response to return.
> >>
> >> If a command can only update one vq parameters, when the parameters are updated
> >> frequently (such as netdim), ctrlq on the device is kicked frequently, which will
> >> increase the device CPU scheduling overhead, and the number and overhead of device
> >> DMA will also increase.
> >>
> >> Merging multiple vq updated parameters into one command can effectively reduce
> >> the number of kick devices and device DMA times.
> >>
> >> Test results show that this greatly improves the efficiency of the ctrlq in
> >> responding to multiple vq coalescing parameter updates issued by the driver.
> > So netdim is per virtqueue, to make use of this, you need to batch the
> > netdim requests first.
>
> Yes. I think the code looks more intuitive.
> Please view the following PoC test code combined with the current code
> of upstream virtio netdim
>
> +struct ctrl_coal_vq_entry {
> +       u16 vqn;
> +       u16 reserved;
> +       int usecs;
> +       int packets;
> +};
> +
> +struct virtio_net_ctrl_mrg_coal_vq {
> +       int num_entries;
> +       struct ctrl_coal_vq_entry coal_vqs[];
> +};
> +
>
> @@ -313,12 +335,18 @@ struct virtnet_info {
>
>          struct control_buf *ctrl;
>
> +       struct virtio_net_ctrl_mrg_coal_vq *ctrl_coal_vqs;
> +
>
>
> @@ -4390,6 +4586,7 @@ static int virtnet_alloc_queues(struct
> virtnet_info *vi)
>          if (!vi->rq)
>                  goto err_rq;
>
> +       vi->ctrl_coal_vqs = kzalloc(sizeof(struct
> virtio_net_ctrl_mrg_coal_vq) + vi->max_queue_pairs * sizeof(struct
> ctrl_coal_vq_entry), GFP_KERNEL);
>
>
> +       struct scatterlist sg;
> +       for (i = 0; i < vi->curr_queue_pairs; i++) {
> +                vi->ctrl_coal_vqs->coal_vqs[i].vqn = rxq2vq(i);
> +                vi->ctrl_coal_vqs->coal_vqs[i].usecs = 8;
> +                vi->ctrl_coal_vqs->coal_vqs[i].packets = 16;
> +        }
> +        vi->ctrl_coal_vqs->num_entries = i;
> +        sg_init_one(&sg, vi->ctrl_coal_vqs, sizeof(struct
> virtio_net_ctrl_mrg_coal_vq) + vi->ctrl_coal_vqs->num_entries *
> sizeof(struct ctrl_coal_vq_entry));
> +        vi->req_num++;
> +        if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
> +                              VIRTIO_NET_CTRL_NOTF_COAL_*VQS*_SET, &sg))
>
>   #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET               3
> +#define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET              4
>
> >   And if you do that, you can batch the commands
> > as well.
>
>
> Batch commands we have tried:
>
> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> +                       err = virtnet_pre_send_cmd(vi, &vi->vq_ctrls[i],
> 8, 16);
> +               }
> +
> +               if (unlikely(!virtqueue_kick(vi->cvq)))
> +                       return vi->ctrl->status == VIRTIO_NET_OK;
> +
> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> +                       while (!virtqueue_is_broken(vi->cvq) &&
> !virtqueue_get_buf(vi->cvq, &tmp))
> +                               cpu_relax();
> +               }
> +       }
>
>
> +
> +static bool virtnet_pre_send_cmd(struct virtnet_info *vi, struct
> vq_coal *vq_ctrls,
> +                                u32 max_usecs, u32 max_packets)
> +{
> +       struct scatterlist hdr, out, stat;
> +       struct scatterlist *sgs[4];
> +       unsigned out_num = 0;
> +       int ret;
> +
> +       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> +
> +       vq_ctrls->hdr.class = VIRTIO_NET_CTRL_NOTF_COAL;
> +       vq_ctrls->hdr.cmd = VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET;
> +       sg_init_one(&hdr, &vq_ctrls->hdr, sizeof(vq_ctrls->hdr));
> +       sgs[out_num++] = &hdr;
> +
> +       vq_ctrls->coal_vq.vqn = cpu_to_le16(0);
> +       vq_ctrls->coal_vq.coal.max_usecs = cpu_to_le32(max_usecs);
> +       vq_ctrls->coal_vq.coal.max_packets = cpu_to_le32(max_packets);
> +       sg_init_one(&out, &vq_ctrls->ho, sizeof(vq_ctrls->ho));
> +
> +       if (&out)
> +               sgs[out_num++] = &out;
> +
> +       vq_ctrls->status = ~0;
> +       sg_init_one(&stat, &vq_ctrls->status, sizeof(vq_ctrls->status));
> +       sgs[out_num] = &stat;
> +
> +       BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
> +       ret = virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
> +       if (ret < 0) {
> +               dev_warn(&vi->vdev->dev,
> +                        "Failed to add sgs for command vq: %d\n.", ret);
> +               return false;
> +       }
>
>
> Overall, batch reqs are sufficient. Because the current major overhead
> is the number of DMAs.
> For example, for a device with 256 queues,
>
> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
> The overhead of batch cmds is 1 kick + 256*8 DMA times.

This can furtherly optimized by using IN_ORDER. The DMA can be batched as well.

Or even using packed virtqueue.

> The overhead of batch reqs is 1 kick + 8 DMA times.

If this is a real issue for your architecture, we should not make it
only work for coalescing. Any per vq tweaking may suffer from that.
E.g if you want to do ARFS in the future. So it should a e.g a
container command for cvq instead of just for coalescing command
itself.

We need to make sure if the existing facility is sufficient or not.

THanks

>
> Below is 8 DMA times:
> - get avail idx 1 time
> - Pull avail ring information 1 time
> - Pull the desc pointed to by the avail ring 3 times
> - Pull the hdr and out bufs pointed to by avail ring desc 2 times
> - Write once to the status buf pointed to by status 1 time
>
> Thanks!
>
> > Then you get one kick for sevreal coal requests?
> >
> > Or are you saying you run out of ctrl vq?
> >
> > Thanks
> >
> >
> > This publicly archived list offers a means to provide input to the
> > OASIS Virtual I/O Device (VIRTIO) TC.
> >
> > In order to verify user consent to the Feedback License terms and
> > to minimize spam in the list archive, subscription is required
> > before posting.
> >
> > Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> > Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> > List help: virtio-comment-help@lists.oasis-open.org
> > List archive: https://lists.oasis-open.org/archives/virtio-comment/
> > Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> > List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> > Committee: https://www.oasis-open.org/committees/virtio/
> > Join OASIS: https://www.oasis-open.org/join/
>
>
> This publicly archived list offers a means to provide input to the
> OASIS Virtual I/O Device (VIRTIO) TC.
>
> In order to verify user consent to the Feedback License terms and
> to minimize spam in the list archive, subscription is required
> before posting.
>
> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
> List help: virtio-comment-help@lists.oasis-open.org
> List archive: https://lists.oasis-open.org/archives/virtio-comment/
> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
> Committee: https://www.oasis-open.org/committees/virtio/
> Join OASIS: https://www.oasis-open.org/join/
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-22  2:40     ` Jason Wang
@ 2023-12-22  6:45       ` Heng Qi
  2023-12-22  8:20         ` Michael S. Tsirkin
  2023-12-26  3:32         ` Jason Wang
  0 siblings, 2 replies; 18+ messages in thread
From: Heng Qi @ 2023-12-22  6:45 UTC (permalink / raw)
  To: Jason Wang
  Cc: virtio-comment, Michael S . Tsirkin, Yuri Benditovich, Xuan Zhuo,
	Alvaro Karsz, Parav Pandit



在 2023/12/22 上午10:40, Jason Wang 写道:
> On Thu, Dec 21, 2023 at 12:55 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>
>>
>> 在 2023/12/21 上午11:37, Jason Wang 写道:
>>> On Wed, Dec 20, 2023 at 10:40 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>>>> Currently, when each time the driver attempts to update the coalescing parameters
>>>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
>>>>
>>>> If a command can only update one vq parameters, when the parameters are updated
>>>> frequently (such as netdim), ctrlq on the device is kicked frequently, which will
>>>> increase the device CPU scheduling overhead, and the number and overhead of device
>>>> DMA will also increase.
>>>>
>>>> Merging multiple vq updated parameters into one command can effectively reduce
>>>> the number of kick devices and device DMA times.
>>>>
>>>> Test results show that this greatly improves the efficiency of the ctrlq in
>>>> responding to multiple vq coalescing parameter updates issued by the driver.
>>> So netdim is per virtqueue, to make use of this, you need to batch the
>>> netdim requests first.
>> Yes. I think the code looks more intuitive.
>> Please view the following PoC test code combined with the current code
>> of upstream virtio netdim
>>
>> +struct ctrl_coal_vq_entry {
>> +       u16 vqn;
>> +       u16 reserved;
>> +       int usecs;
>> +       int packets;
>> +};
>> +
>> +struct virtio_net_ctrl_mrg_coal_vq {
>> +       int num_entries;
>> +       struct ctrl_coal_vq_entry coal_vqs[];
>> +};
>> +
>>
>> @@ -313,12 +335,18 @@ struct virtnet_info {
>>
>>           struct control_buf *ctrl;
>>
>> +       struct virtio_net_ctrl_mrg_coal_vq *ctrl_coal_vqs;
>> +
>>
>>
>> @@ -4390,6 +4586,7 @@ static int virtnet_alloc_queues(struct
>> virtnet_info *vi)
>>           if (!vi->rq)
>>                   goto err_rq;
>>
>> +       vi->ctrl_coal_vqs = kzalloc(sizeof(struct
>> virtio_net_ctrl_mrg_coal_vq) + vi->max_queue_pairs * sizeof(struct
>> ctrl_coal_vq_entry), GFP_KERNEL);
>>
>>
>> +       struct scatterlist sg;
>> +       for (i = 0; i < vi->curr_queue_pairs; i++) {
>> +                vi->ctrl_coal_vqs->coal_vqs[i].vqn = rxq2vq(i);
>> +                vi->ctrl_coal_vqs->coal_vqs[i].usecs = 8;
>> +                vi->ctrl_coal_vqs->coal_vqs[i].packets = 16;
>> +        }
>> +        vi->ctrl_coal_vqs->num_entries = i;
>> +        sg_init_one(&sg, vi->ctrl_coal_vqs, sizeof(struct
>> virtio_net_ctrl_mrg_coal_vq) + vi->ctrl_coal_vqs->num_entries *
>> sizeof(struct ctrl_coal_vq_entry));
>> +        vi->req_num++;
>> +        if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
>> +                              VIRTIO_NET_CTRL_NOTF_COAL_*VQS*_SET, &sg))
>>
>>    #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET               3
>> +#define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET              4
>>
>>>    And if you do that, you can batch the commands
>>> as well.
>>
>> Batch commands we have tried:
>>
>> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
>> +                       err = virtnet_pre_send_cmd(vi, &vi->vq_ctrls[i],
>> 8, 16);
>> +               }
>> +
>> +               if (unlikely(!virtqueue_kick(vi->cvq)))
>> +                       return vi->ctrl->status == VIRTIO_NET_OK;
>> +
>> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
>> +                       while (!virtqueue_is_broken(vi->cvq) &&
>> !virtqueue_get_buf(vi->cvq, &tmp))
>> +                               cpu_relax();
>> +               }
>> +       }
>>
>>
>> +
>> +static bool virtnet_pre_send_cmd(struct virtnet_info *vi, struct
>> vq_coal *vq_ctrls,
>> +                                u32 max_usecs, u32 max_packets)
>> +{
>> +       struct scatterlist hdr, out, stat;
>> +       struct scatterlist *sgs[4];
>> +       unsigned out_num = 0;
>> +       int ret;
>> +
>> +       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
>> +
>> +       vq_ctrls->hdr.class = VIRTIO_NET_CTRL_NOTF_COAL;
>> +       vq_ctrls->hdr.cmd = VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET;
>> +       sg_init_one(&hdr, &vq_ctrls->hdr, sizeof(vq_ctrls->hdr));
>> +       sgs[out_num++] = &hdr;
>> +
>> +       vq_ctrls->coal_vq.vqn = cpu_to_le16(0);
>> +       vq_ctrls->coal_vq.coal.max_usecs = cpu_to_le32(max_usecs);
>> +       vq_ctrls->coal_vq.coal.max_packets = cpu_to_le32(max_packets);
>> +       sg_init_one(&out, &vq_ctrls->ho, sizeof(vq_ctrls->ho));
>> +
>> +       if (&out)
>> +               sgs[out_num++] = &out;
>> +
>> +       vq_ctrls->status = ~0;
>> +       sg_init_one(&stat, &vq_ctrls->status, sizeof(vq_ctrls->status));
>> +       sgs[out_num] = &stat;
>> +
>> +       BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
>> +       ret = virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
>> +       if (ret < 0) {
>> +               dev_warn(&vi->vdev->dev,
>> +                        "Failed to add sgs for command vq: %d\n.", ret);
>> +               return false;
>> +       }
>>
>>
>> Overall, batch reqs are sufficient. Because the current major overhead
>> is the number of DMAs.
>> For example, for a device with 256 queues,
>>
>> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
>> The overhead of batch cmds is 1 kick + 256*8 DMA times.
> This can furtherly optimized by using IN_ORDER. The DMA can be batched as well.

DMA batch reduces the number of DMA requests, but not the actual number 
of DMAs.
For out DPU, how much to batch and when to end the batch also need to be 
evaluated in detail.

>
> Or even using packed virtqueue.
>
>> The overhead of batch reqs is 1 kick + 8 DMA times.
> If this is a real issue for your architecture, we should not make it
> only work for coalescing. Any per vq tweaking may suffer from that.

YES.

> E.g if you want to do ARFS in the future. So it should a e.g a
> container command for cvq instead of just for coalescing command
> itself.
>
> We need to make sure if the existing facility is sufficient or not.

We considered to update the generic ctrlq batch command. But so far we 
have only seen
this requirement for netdim and aRFS. At the same time, aRFS has not yet 
entered the community.
We will try to add a batch req interface for it and it may have a 
dedicated flow vq to ensure that
the number of command issuances larger than netdim is responded to in time.
So we temporarily plan to advance the batch of coalescing parameter 
requests first.

Thanks.

>
> THanks
>
>> Below is 8 DMA times:
>> - get avail idx 1 time
>> - Pull avail ring information 1 time
>> - Pull the desc pointed to by the avail ring 3 times
>> - Pull the hdr and out bufs pointed to by avail ring desc 2 times
>> - Write once to the status buf pointed to by status 1 time
>>
>> Thanks!
>>
>>> Then you get one kick for sevreal coal requests?
>>>
>>> Or are you saying you run out of ctrl vq?
>>>
>>> Thanks
>>>
>>>
>>> This publicly archived list offers a means to provide input to the
>>> OASIS Virtual I/O Device (VIRTIO) TC.
>>>
>>> In order to verify user consent to the Feedback License terms and
>>> to minimize spam in the list archive, subscription is required
>>> before posting.
>>>
>>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>>> List help: virtio-comment-help@lists.oasis-open.org
>>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>>> Committee: https://www.oasis-open.org/committees/virtio/
>>> Join OASIS: https://www.oasis-open.org/join/
>>
>> This publicly archived list offers a means to provide input to the
>> OASIS Virtual I/O Device (VIRTIO) TC.
>>
>> In order to verify user consent to the Feedback License terms and
>> to minimize spam in the list archive, subscription is required
>> before posting.
>>
>> Subscribe: virtio-comment-subscribe@lists.oasis-open.org
>> Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
>> List help: virtio-comment-help@lists.oasis-open.org
>> List archive: https://lists.oasis-open.org/archives/virtio-comment/
>> Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
>> List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
>> Committee: https://www.oasis-open.org/committees/virtio/
>> Join OASIS: https://www.oasis-open.org/join/
>>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-22  6:45       ` Heng Qi
@ 2023-12-22  8:20         ` Michael S. Tsirkin
  2023-12-26  2:17           ` Heng Qi
  2023-12-26  3:32         ` Jason Wang
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2023-12-22  8:20 UTC (permalink / raw)
  To: Heng Qi
  Cc: Jason Wang, virtio-comment, Yuri Benditovich, Xuan Zhuo,
	Alvaro Karsz, Parav Pandit

On Fri, Dec 22, 2023 at 02:45:21PM +0800, Heng Qi wrote:
> > > For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
> > > The overhead of batch cmds is 1 kick + 256*8 DMA times.
> > This can furtherly optimized by using IN_ORDER. The DMA can be batched as well.
> 
> DMA batch reduces the number of DMA requests, but not the actual number of
> DMAs.
> For out DPU, how much to batch and when to end the batch also need to be
> evaluated in detail.

And what is the result of this evaluation? Again all this is slow path
stuff. Speed only matters if the operation takes many milliseconds.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-21  7:19       ` Heng Qi
  2023-12-21  7:48         ` Heng Qi
@ 2023-12-22  8:24         ` Michael S. Tsirkin
  2023-12-26  2:04           ` Heng Qi
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2023-12-22  8:24 UTC (permalink / raw)
  To: Heng Qi; +Cc: virtio-comment, Jason Wang, Yuri Benditovich, Xuan Zhuo

On Thu, Dec 21, 2023 at 03:19:19PM +0800, Heng Qi wrote:
> 
> 
> 在 2023/12/21 下午3:08, Michael S. Tsirkin 写道:
> > On Thu, Dec 21, 2023 at 03:01:57PM +0800, Heng Qi wrote:
> > > 
> > > 在 2023/12/21 下午2:56, Michael S. Tsirkin 写道:
> > > > On Wed, Dec 20, 2023 at 10:40:34PM +0800, Heng Qi wrote:
> > > > > Currently, when each time the driver attempts to update the coalescing parameters
> > > > > for a vq, it needs to kick the device and wait for the ctrlq response to return.
> > > > But there's no fundamental reason for it to do so.
> > > Indeed, please look at the current upstream netdim code:
> > > 
> > > static void virtnet_rx_dim_work(struct work_struct *work)
> > > {
> > >          struct dim *dim = container_of(work, struct dim, work);
> > >          struct receive_queue *rq = container_of(dim,
> > >                          struct receive_queue, dim);
> > >          struct virtnet_info *vi = rq->vq->vdev->priv;
> > >          struct net_device *dev = vi->dev;
> > >          struct dim_cq_moder update_moder;
> > >          int i, qnum, err;
> > > 
> > >          if (!rtnl_trylock())
> > >                  return;
> > > 
> > >          /* Each rxq's work is queued by "net_dim()->schedule_work()"
> > >           * in response to NAPI traffic changes. Note that dim->profile_ix
> > >           * for each rxq is updated prior to the queuing action.
> > >           * So we only need to traverse and update profiles for all rxqs
> > >           * in the work which is holding rtnl_lock.
> > >           */
> > >          for (i = 0; i < vi->curr_queue_pairs; i++) {
> > > <-----------------
> > >                  rq = &vi->rq[i];
> > >                  dim = &rq->dim;
> > >                  qnum = rq - vi->rq;
> > > 
> > >                  if (!rq->dim_enabled)
> > >                          continue;
> > > 
> > >                  update_moder = net_dim_get_rx_moderation(dim->mode,
> > > dim->profile_ix);
> > >                  if (update_moder.usec != rq->intr_coal.max_usecs ||
> > >                      update_moder.pkts != rq->intr_coal.max_packets) {
> > >                          err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
> > > update_moder.usec,
> > > update_moder.pkts);
> > >                          if (err)
> > >                                  pr_debug("%s: Failed to send dim parameters
> > > on rxq%d\n",
> > >                                           dev->name, qnum);
> > >                          dim->state = DIM_START_MEASURE;
> > >                  }
> > >          }
> > > 
> > >          rtnl_unlock();
> > > }
> > > 
> > > > It can just as well submit multiple commands and then wait.
> > > Sending multiple commands and then waiting reduces the number of kicks.
> > > But it does not reduce the number of device DMAs. I already responded to
> > > this in a thread to Jason.
> > > please check:
> > > 
> > > https://lists.oasis-open.org/archives/virtio-comment/202312/msg00142.html
> > > 
> > > "
> > > Overall, batch reqs are sufficient. Because the current major overhead is
> > > the number of DMAs.
> > > For example, for a device with 256 queues,
> > > 
> > > For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
> > > The overhead of batch cmds is 1 kick + 256*8 DMA times.
> > > The overhead of batch reqs is 1 kick + 8 DMA times.
> > > 
> > > Below is 8 DMA times:
> > > - get avail idx 1 time
> > > - Pull available ring information 1 time
> > > - Pull the desc pointed to by the avail ring 3 times
> > > - Pull the hdr and out bufs pointed to by avail ring desc 2 times
> > > - Write once to the status buf pointed to by status 1 time "
> > > 
> > > Thanks.
> > So there's more DMA but it's all slow path.
> > Why do we need to micro-optimize it?
> 
> On our each DPU, multiple VMs need to be supported. The ctrlq of these VMs
> are simulated by software,
> which consumes limited CPU resources on the DPU.
> Therefore, optimizing the response speed of ctrlq for netdim can better
> support multiple VMs and the large-queue-sized VM.
> 
> > What is the overhead practically, in milliseconds?
> 
> Latency overhead is only one aspect. More importantly, limited CPU resources
> need to support
> effective command processing of more VMs and large-queue-sized VMs.
> 
> Thanks.
> 

All this is interesting but too handwavy to address. Please work on
motivation for your patch. E.g. implement submitting multiple commands
in the linux driver and show how in numbers how it's problematic.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-21  3:37 ` [virtio-comment] " Jason Wang
  2023-12-21  4:55   ` Heng Qi
@ 2023-12-22  8:26   ` Michael S. Tsirkin
  1 sibling, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2023-12-22  8:26 UTC (permalink / raw)
  To: Jason Wang; +Cc: Heng Qi, virtio-comment, Yuri Benditovich, Xuan Zhuo

On Thu, Dec 21, 2023 at 11:37:17AM +0800, Jason Wang wrote:
> On Wed, Dec 20, 2023 at 10:40 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >
> > Currently, when each time the driver attempts to update the coalescing parameters
> > for a vq, it needs to kick the device and wait for the ctrlq response to return.
> >
> > If a command can only update one vq parameters, when the parameters are updated
> > frequently (such as netdim), ctrlq on the device is kicked frequently, which will
> > increase the device CPU scheduling overhead, and the number and overhead of device
> > DMA will also increase.
> >
> > Merging multiple vq updated parameters into one command can effectively reduce
> > the number of kick devices and device DMA times.
> >
> > Test results show that this greatly improves the efficiency of the ctrlq in
> > responding to multiple vq coalescing parameter updates issued by the driver.
> 
> So netdim is per virtqueue, to make use of this, you need to batch the
> netdim requests first. And if you do that, you can batch the commands
> as well. Then you get one kick for sevreal coal requests?
> 
> Or are you saying you run out of ctrl vq?
> 
> Thanks

I think Heng is saying ctrl vq implementation on this DPU is so bad,
you need to avoid that as much as possible.

-- 
MST


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-22  8:24         ` Michael S. Tsirkin
@ 2023-12-26  2:04           ` Heng Qi
  0 siblings, 0 replies; 18+ messages in thread
From: Heng Qi @ 2023-12-26  2:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: virtio-comment, Jason Wang, Yuri Benditovich, Xuan Zhuo



在 2023/12/22 下午4:24, Michael S. Tsirkin 写道:
> On Thu, Dec 21, 2023 at 03:19:19PM +0800, Heng Qi wrote:
>>
>> 在 2023/12/21 下午3:08, Michael S. Tsirkin 写道:
>>> On Thu, Dec 21, 2023 at 03:01:57PM +0800, Heng Qi wrote:
>>>> 在 2023/12/21 下午2:56, Michael S. Tsirkin 写道:
>>>>> On Wed, Dec 20, 2023 at 10:40:34PM +0800, Heng Qi wrote:
>>>>>> Currently, when each time the driver attempts to update the coalescing parameters
>>>>>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
>>>>> But there's no fundamental reason for it to do so.
>>>> Indeed, please look at the current upstream netdim code:
>>>>
>>>> static void virtnet_rx_dim_work(struct work_struct *work)
>>>> {
>>>>           struct dim *dim = container_of(work, struct dim, work);
>>>>           struct receive_queue *rq = container_of(dim,
>>>>                           struct receive_queue, dim);
>>>>           struct virtnet_info *vi = rq->vq->vdev->priv;
>>>>           struct net_device *dev = vi->dev;
>>>>           struct dim_cq_moder update_moder;
>>>>           int i, qnum, err;
>>>>
>>>>           if (!rtnl_trylock())
>>>>                   return;
>>>>
>>>>           /* Each rxq's work is queued by "net_dim()->schedule_work()"
>>>>            * in response to NAPI traffic changes. Note that dim->profile_ix
>>>>            * for each rxq is updated prior to the queuing action.
>>>>            * So we only need to traverse and update profiles for all rxqs
>>>>            * in the work which is holding rtnl_lock.
>>>>            */
>>>>           for (i = 0; i < vi->curr_queue_pairs; i++) {
>>>> <-----------------
>>>>                   rq = &vi->rq[i];
>>>>                   dim = &rq->dim;
>>>>                   qnum = rq - vi->rq;
>>>>
>>>>                   if (!rq->dim_enabled)
>>>>                           continue;
>>>>
>>>>                   update_moder = net_dim_get_rx_moderation(dim->mode,
>>>> dim->profile_ix);
>>>>                   if (update_moder.usec != rq->intr_coal.max_usecs ||
>>>>                       update_moder.pkts != rq->intr_coal.max_packets) {
>>>>                           err = virtnet_send_rx_ctrl_coal_vq_cmd(vi, qnum,
>>>> update_moder.usec,
>>>> update_moder.pkts);
>>>>                           if (err)
>>>>                                   pr_debug("%s: Failed to send dim parameters
>>>> on rxq%d\n",
>>>>                                            dev->name, qnum);
>>>>                           dim->state = DIM_START_MEASURE;
>>>>                   }
>>>>           }
>>>>
>>>>           rtnl_unlock();
>>>> }
>>>>
>>>>> It can just as well submit multiple commands and then wait.
>>>> Sending multiple commands and then waiting reduces the number of kicks.
>>>> But it does not reduce the number of device DMAs. I already responded to
>>>> this in a thread to Jason.
>>>> please check:
>>>>
>>>> https://lists.oasis-open.org/archives/virtio-comment/202312/msg00142.html
>>>>
>>>> "
>>>> Overall, batch reqs are sufficient. Because the current major overhead is
>>>> the number of DMAs.
>>>> For example, for a device with 256 queues,
>>>>
>>>> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
>>>> The overhead of batch cmds is 1 kick + 256*8 DMA times.
>>>> The overhead of batch reqs is 1 kick + 8 DMA times.
>>>>
>>>> Below is 8 DMA times:
>>>> - get avail idx 1 time
>>>> - Pull available ring information 1 time
>>>> - Pull the desc pointed to by the avail ring 3 times
>>>> - Pull the hdr and out bufs pointed to by avail ring desc 2 times
>>>> - Write once to the status buf pointed to by status 1 time "
>>>>
>>>> Thanks.
>>> So there's more DMA but it's all slow path.
>>> Why do we need to micro-optimize it?
>> On our each DPU, multiple VMs need to be supported. The ctrlq of these VMs
>> are simulated by software,
>> which consumes limited CPU resources on the DPU.
>> Therefore, optimizing the response speed of ctrlq for netdim can better
>> support multiple VMs and the large-queue-sized VM.
>>
>>> What is the overhead practically, in milliseconds?
>> Latency overhead is only one aspect. More importantly, limited CPU resources
>> need to support
>> effective command processing of more VMs and large-queue-sized VMs.
>>
>> Thanks.
>>
> All this is interesting but too handwavy to address. Please work on
> motivation for your patch. E.g. implement submitting multiple commands
> in the linux driver and show how in numbers how it's problematic.


OK, I will add it in the next version.

Thanks.

>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-22  8:20         ` Michael S. Tsirkin
@ 2023-12-26  2:17           ` Heng Qi
  0 siblings, 0 replies; 18+ messages in thread
From: Heng Qi @ 2023-12-26  2:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, virtio-comment, Yuri Benditovich, Xuan Zhuo,
	Alvaro Karsz, Parav Pandit



在 2023/12/22 下午4:20, Michael S. Tsirkin 写道:
> On Fri, Dec 22, 2023 at 02:45:21PM +0800, Heng Qi wrote:
>>>> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
>>>> The overhead of batch cmds is 1 kick + 256*8 DMA times.
>>> This can furtherly optimized by using IN_ORDER. The DMA can be batched as well.
>> DMA batch reduces the number of DMA requests, but not the actual number of
>> DMAs.
>> For out DPU, how much to batch and when to end the batch also need to be
>> evaluated in detail.
> And what is the result of this evaluation?

It’s not easy. Not having a good start and end time means that DMA batch 
will increase the response delay
of the entire machine. And DMA batch will not reduce the number of DMA 
times.

>   Again all this is slow path
> stuff. Speed only matters if the operation takes many milliseconds.

Yes, when there are many queues and the device responds slowly, the CPU 
where dim_work in
the driver is located may be kept waiting. This is the overhead 
introduced by DIM enablding.
We want to try our best to avoid the impact of DIM enabling. And, due to 
DPU resource limitations,
we want to better support multiple VMs.

Thanks.



This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-22  6:45       ` Heng Qi
  2023-12-22  8:20         ` Michael S. Tsirkin
@ 2023-12-26  3:32         ` Jason Wang
  2023-12-26  9:18           ` Michael S. Tsirkin
  1 sibling, 1 reply; 18+ messages in thread
From: Jason Wang @ 2023-12-26  3:32 UTC (permalink / raw)
  To: Heng Qi
  Cc: virtio-comment, Michael S . Tsirkin, Yuri Benditovich, Xuan Zhuo,
	Alvaro Karsz, Parav Pandit

On Fri, Dec 22, 2023 at 2:45 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
>
>
>
> 在 2023/12/22 上午10:40, Jason Wang 写道:
> > On Thu, Dec 21, 2023 at 12:55 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >>
> >>
> >> 在 2023/12/21 上午11:37, Jason Wang 写道:
> >>> On Wed, Dec 20, 2023 at 10:40 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >>>> Currently, when each time the driver attempts to update the coalescing parameters
> >>>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
> >>>>
> >>>> If a command can only update one vq parameters, when the parameters are updated
> >>>> frequently (such as netdim), ctrlq on the device is kicked frequently, which will
> >>>> increase the device CPU scheduling overhead, and the number and overhead of device
> >>>> DMA will also increase.
> >>>>
> >>>> Merging multiple vq updated parameters into one command can effectively reduce
> >>>> the number of kick devices and device DMA times.
> >>>>
> >>>> Test results show that this greatly improves the efficiency of the ctrlq in
> >>>> responding to multiple vq coalescing parameter updates issued by the driver.
> >>> So netdim is per virtqueue, to make use of this, you need to batch the
> >>> netdim requests first.
> >> Yes. I think the code looks more intuitive.
> >> Please view the following PoC test code combined with the current code
> >> of upstream virtio netdim
> >>
> >> +struct ctrl_coal_vq_entry {
> >> +       u16 vqn;
> >> +       u16 reserved;
> >> +       int usecs;
> >> +       int packets;
> >> +};
> >> +
> >> +struct virtio_net_ctrl_mrg_coal_vq {
> >> +       int num_entries;
> >> +       struct ctrl_coal_vq_entry coal_vqs[];
> >> +};
> >> +
> >>
> >> @@ -313,12 +335,18 @@ struct virtnet_info {
> >>
> >>           struct control_buf *ctrl;
> >>
> >> +       struct virtio_net_ctrl_mrg_coal_vq *ctrl_coal_vqs;
> >> +
> >>
> >>
> >> @@ -4390,6 +4586,7 @@ static int virtnet_alloc_queues(struct
> >> virtnet_info *vi)
> >>           if (!vi->rq)
> >>                   goto err_rq;
> >>
> >> +       vi->ctrl_coal_vqs = kzalloc(sizeof(struct
> >> virtio_net_ctrl_mrg_coal_vq) + vi->max_queue_pairs * sizeof(struct
> >> ctrl_coal_vq_entry), GFP_KERNEL);
> >>
> >>
> >> +       struct scatterlist sg;
> >> +       for (i = 0; i < vi->curr_queue_pairs; i++) {
> >> +                vi->ctrl_coal_vqs->coal_vqs[i].vqn = rxq2vq(i);
> >> +                vi->ctrl_coal_vqs->coal_vqs[i].usecs = 8;
> >> +                vi->ctrl_coal_vqs->coal_vqs[i].packets = 16;
> >> +        }
> >> +        vi->ctrl_coal_vqs->num_entries = i;
> >> +        sg_init_one(&sg, vi->ctrl_coal_vqs, sizeof(struct
> >> virtio_net_ctrl_mrg_coal_vq) + vi->ctrl_coal_vqs->num_entries *
> >> sizeof(struct ctrl_coal_vq_entry));
> >> +        vi->req_num++;
> >> +        if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
> >> +                              VIRTIO_NET_CTRL_NOTF_COAL_*VQS*_SET, &sg))
> >>
> >>    #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET               3
> >> +#define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET              4
> >>
> >>>    And if you do that, you can batch the commands
> >>> as well.
> >>
> >> Batch commands we have tried:
> >>
> >> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> >> +                       err = virtnet_pre_send_cmd(vi, &vi->vq_ctrls[i],
> >> 8, 16);
> >> +               }
> >> +
> >> +               if (unlikely(!virtqueue_kick(vi->cvq)))
> >> +                       return vi->ctrl->status == VIRTIO_NET_OK;
> >> +
> >> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> >> +                       while (!virtqueue_is_broken(vi->cvq) &&
> >> !virtqueue_get_buf(vi->cvq, &tmp))
> >> +                               cpu_relax();
> >> +               }
> >> +       }
> >>
> >>
> >> +
> >> +static bool virtnet_pre_send_cmd(struct virtnet_info *vi, struct
> >> vq_coal *vq_ctrls,
> >> +                                u32 max_usecs, u32 max_packets)
> >> +{
> >> +       struct scatterlist hdr, out, stat;
> >> +       struct scatterlist *sgs[4];
> >> +       unsigned out_num = 0;
> >> +       int ret;
> >> +
> >> +       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> >> +
> >> +       vq_ctrls->hdr.class = VIRTIO_NET_CTRL_NOTF_COAL;
> >> +       vq_ctrls->hdr.cmd = VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET;
> >> +       sg_init_one(&hdr, &vq_ctrls->hdr, sizeof(vq_ctrls->hdr));
> >> +       sgs[out_num++] = &hdr;
> >> +
> >> +       vq_ctrls->coal_vq.vqn = cpu_to_le16(0);
> >> +       vq_ctrls->coal_vq.coal.max_usecs = cpu_to_le32(max_usecs);
> >> +       vq_ctrls->coal_vq.coal.max_packets = cpu_to_le32(max_packets);
> >> +       sg_init_one(&out, &vq_ctrls->ho, sizeof(vq_ctrls->ho));
> >> +
> >> +       if (&out)
> >> +               sgs[out_num++] = &out;
> >> +
> >> +       vq_ctrls->status = ~0;
> >> +       sg_init_one(&stat, &vq_ctrls->status, sizeof(vq_ctrls->status));
> >> +       sgs[out_num] = &stat;
> >> +
> >> +       BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
> >> +       ret = virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
> >> +       if (ret < 0) {
> >> +               dev_warn(&vi->vdev->dev,
> >> +                        "Failed to add sgs for command vq: %d\n.", ret);
> >> +               return false;
> >> +       }
> >>
> >>
> >> Overall, batch reqs are sufficient. Because the current major overhead
> >> is the number of DMAs.
> >> For example, for a device with 256 queues,
> >>
> >> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
> >> The overhead of batch cmds is 1 kick + 256*8 DMA times.
> > This can furtherly optimized by using IN_ORDER. The DMA can be batched as well.
>
> DMA batch reduces the number of DMA requests, but not the actual number
> of DMAs.

Well, if we do IN_ORDER, then

1 for avail idx.

At most 2 DMA is needed for all descriptors.

Driver can allocate an array of cvq commands. So CVQ commands could be
fetched at most 2 DMA.

etc.

> For out DPU, how much to batch and when to end the batch also need to be
> evaluated in detail.
>
> >
> > Or even using packed virtqueue.
> >
> >> The overhead of batch reqs is 1 kick + 8 DMA times.
> > If this is a real issue for your architecture, we should not make it
> > only work for coalescing. Any per vq tweaking may suffer from that.
>
> YES.
>
> > E.g if you want to do ARFS in the future. So it should a e.g a
> > container command for cvq instead of just for coalescing command
> > itself.
> >
> > We need to make sure if the existing facility is sufficient or not.
>
> We considered to update the generic ctrlq batch command. But so far we
> have only seen
> this requirement for netdim and aRFS.

Two use cases should be sufficient to justify.

We do have others, for example the vlan/mac filters.


> At the same time, aRFS has not yet
> entered the community.
> We will try to add a batch req interface for it and it may have a
> dedicated flow vq to ensure that
> the number of command issuances larger than netdim is responded to in time.
> So we temporarily plan to advance the batch of coalescing parameter
> requests first.

Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-26  3:32         ` Jason Wang
@ 2023-12-26  9:18           ` Michael S. Tsirkin
  2023-12-27  2:32             ` Jason Wang
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2023-12-26  9:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Heng Qi, virtio-comment, Yuri Benditovich, Xuan Zhuo,
	Alvaro Karsz, Parav Pandit

On Tue, Dec 26, 2023 at 11:32:12AM +0800, Jason Wang wrote:
> On Fri, Dec 22, 2023 at 2:45 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> >
> >
> >
> > 在 2023/12/22 上午10:40, Jason Wang 写道:
> > > On Thu, Dec 21, 2023 at 12:55 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> > >>
> > >>
> > >> 在 2023/12/21 上午11:37, Jason Wang 写道:
> > >>> On Wed, Dec 20, 2023 at 10:40 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> > >>>> Currently, when each time the driver attempts to update the coalescing parameters
> > >>>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
> > >>>>
> > >>>> If a command can only update one vq parameters, when the parameters are updated
> > >>>> frequently (such as netdim), ctrlq on the device is kicked frequently, which will
> > >>>> increase the device CPU scheduling overhead, and the number and overhead of device
> > >>>> DMA will also increase.
> > >>>>
> > >>>> Merging multiple vq updated parameters into one command can effectively reduce
> > >>>> the number of kick devices and device DMA times.
> > >>>>
> > >>>> Test results show that this greatly improves the efficiency of the ctrlq in
> > >>>> responding to multiple vq coalescing parameter updates issued by the driver.
> > >>> So netdim is per virtqueue, to make use of this, you need to batch the
> > >>> netdim requests first.
> > >> Yes. I think the code looks more intuitive.
> > >> Please view the following PoC test code combined with the current code
> > >> of upstream virtio netdim
> > >>
> > >> +struct ctrl_coal_vq_entry {
> > >> +       u16 vqn;
> > >> +       u16 reserved;
> > >> +       int usecs;
> > >> +       int packets;
> > >> +};
> > >> +
> > >> +struct virtio_net_ctrl_mrg_coal_vq {
> > >> +       int num_entries;
> > >> +       struct ctrl_coal_vq_entry coal_vqs[];
> > >> +};
> > >> +
> > >>
> > >> @@ -313,12 +335,18 @@ struct virtnet_info {
> > >>
> > >>           struct control_buf *ctrl;
> > >>
> > >> +       struct virtio_net_ctrl_mrg_coal_vq *ctrl_coal_vqs;
> > >> +
> > >>
> > >>
> > >> @@ -4390,6 +4586,7 @@ static int virtnet_alloc_queues(struct
> > >> virtnet_info *vi)
> > >>           if (!vi->rq)
> > >>                   goto err_rq;
> > >>
> > >> +       vi->ctrl_coal_vqs = kzalloc(sizeof(struct
> > >> virtio_net_ctrl_mrg_coal_vq) + vi->max_queue_pairs * sizeof(struct
> > >> ctrl_coal_vq_entry), GFP_KERNEL);
> > >>
> > >>
> > >> +       struct scatterlist sg;
> > >> +       for (i = 0; i < vi->curr_queue_pairs; i++) {
> > >> +                vi->ctrl_coal_vqs->coal_vqs[i].vqn = rxq2vq(i);
> > >> +                vi->ctrl_coal_vqs->coal_vqs[i].usecs = 8;
> > >> +                vi->ctrl_coal_vqs->coal_vqs[i].packets = 16;
> > >> +        }
> > >> +        vi->ctrl_coal_vqs->num_entries = i;
> > >> +        sg_init_one(&sg, vi->ctrl_coal_vqs, sizeof(struct
> > >> virtio_net_ctrl_mrg_coal_vq) + vi->ctrl_coal_vqs->num_entries *
> > >> sizeof(struct ctrl_coal_vq_entry));
> > >> +        vi->req_num++;
> > >> +        if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
> > >> +                              VIRTIO_NET_CTRL_NOTF_COAL_*VQS*_SET, &sg))
> > >>
> > >>    #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET               3
> > >> +#define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET              4
> > >>
> > >>>    And if you do that, you can batch the commands
> > >>> as well.
> > >>
> > >> Batch commands we have tried:
> > >>
> > >> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> > >> +                       err = virtnet_pre_send_cmd(vi, &vi->vq_ctrls[i],
> > >> 8, 16);
> > >> +               }
> > >> +
> > >> +               if (unlikely(!virtqueue_kick(vi->cvq)))
> > >> +                       return vi->ctrl->status == VIRTIO_NET_OK;
> > >> +
> > >> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> > >> +                       while (!virtqueue_is_broken(vi->cvq) &&
> > >> !virtqueue_get_buf(vi->cvq, &tmp))
> > >> +                               cpu_relax();
> > >> +               }
> > >> +       }
> > >>
> > >>
> > >> +
> > >> +static bool virtnet_pre_send_cmd(struct virtnet_info *vi, struct
> > >> vq_coal *vq_ctrls,
> > >> +                                u32 max_usecs, u32 max_packets)
> > >> +{
> > >> +       struct scatterlist hdr, out, stat;
> > >> +       struct scatterlist *sgs[4];
> > >> +       unsigned out_num = 0;
> > >> +       int ret;
> > >> +
> > >> +       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> > >> +
> > >> +       vq_ctrls->hdr.class = VIRTIO_NET_CTRL_NOTF_COAL;
> > >> +       vq_ctrls->hdr.cmd = VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET;
> > >> +       sg_init_one(&hdr, &vq_ctrls->hdr, sizeof(vq_ctrls->hdr));
> > >> +       sgs[out_num++] = &hdr;
> > >> +
> > >> +       vq_ctrls->coal_vq.vqn = cpu_to_le16(0);
> > >> +       vq_ctrls->coal_vq.coal.max_usecs = cpu_to_le32(max_usecs);
> > >> +       vq_ctrls->coal_vq.coal.max_packets = cpu_to_le32(max_packets);
> > >> +       sg_init_one(&out, &vq_ctrls->ho, sizeof(vq_ctrls->ho));
> > >> +
> > >> +       if (&out)
> > >> +               sgs[out_num++] = &out;
> > >> +
> > >> +       vq_ctrls->status = ~0;
> > >> +       sg_init_one(&stat, &vq_ctrls->status, sizeof(vq_ctrls->status));
> > >> +       sgs[out_num] = &stat;
> > >> +
> > >> +       BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
> > >> +       ret = virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
> > >> +       if (ret < 0) {
> > >> +               dev_warn(&vi->vdev->dev,
> > >> +                        "Failed to add sgs for command vq: %d\n.", ret);
> > >> +               return false;
> > >> +       }
> > >>
> > >>
> > >> Overall, batch reqs are sufficient. Because the current major overhead
> > >> is the number of DMAs.
> > >> For example, for a device with 256 queues,
> > >>
> > >> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
> > >> The overhead of batch cmds is 1 kick + 256*8 DMA times.
> > > This can furtherly optimized by using IN_ORDER. The DMA can be batched as well.
> >
> > DMA batch reduces the number of DMA requests, but not the actual number
> > of DMAs.
> 
> Well, if we do IN_ORDER, then
> 
> 1 for avail idx.
> 
> At most 2 DMA is needed for all descriptors.
> 
> Driver can allocate an array of cvq commands. So CVQ commands could be
> fetched at most 2 DMA.
> 
> etc.

For most users, all this happens maybe once per VM start.  It's not
reasonable to start micro-optimizing this until it is taking so much
time it slows VM start measureably.  With each DMA taking at most
of an extra microsecond I just do not see the benefit, even with 1000s
of DMAs, this will take at most milliseconds.


> > For out DPU, how much to batch and when to end the batch also need to be
> > evaluated in detail.
> >
> > >
> > > Or even using packed virtqueue.
> > >
> > >> The overhead of batch reqs is 1 kick + 8 DMA times.
> > > If this is a real issue for your architecture, we should not make it
> > > only work for coalescing. Any per vq tweaking may suffer from that.
> >
> > YES.
> >
> > > E.g if you want to do ARFS in the future. So it should a e.g a
> > > container command for cvq instead of just for coalescing command
> > > itself.
> > >
> > > We need to make sure if the existing facility is sufficient or not.
> >
> > We considered to update the generic ctrlq batch command. But so far we
> > have only seen
> > this requirement for netdim and aRFS.
> 
> Two use cases should be sufficient to justify.
> 
> We do have others, for example the vlan/mac filters.
> 
> 
> > At the same time, aRFS has not yet
> > entered the community.
> > We will try to add a batch req interface for it and it may have a
> > dedicated flow vq to ensure that
> > the number of command issuances larger than netdim is responded to in time.
> > So we temporarily plan to advance the batch of coalescing parameter
> > requests first.
> 
> Thanks


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [virtio-comment] Re: [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs
  2023-12-26  9:18           ` Michael S. Tsirkin
@ 2023-12-27  2:32             ` Jason Wang
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Wang @ 2023-12-27  2:32 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Heng Qi, virtio-comment, Yuri Benditovich, Xuan Zhuo,
	Alvaro Karsz, Parav Pandit

On Tue, Dec 26, 2023 at 5:18 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Tue, Dec 26, 2023 at 11:32:12AM +0800, Jason Wang wrote:
> > On Fri, Dec 22, 2023 at 2:45 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> > >
> > >
> > >
> > > 在 2023/12/22 上午10:40, Jason Wang 写道:
> > > > On Thu, Dec 21, 2023 at 12:55 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> > > >>
> > > >>
> > > >> 在 2023/12/21 上午11:37, Jason Wang 写道:
> > > >>> On Wed, Dec 20, 2023 at 10:40 PM Heng Qi <hengqi@linux.alibaba.com> wrote:
> > > >>>> Currently, when each time the driver attempts to update the coalescing parameters
> > > >>>> for a vq, it needs to kick the device and wait for the ctrlq response to return.
> > > >>>>
> > > >>>> If a command can only update one vq parameters, when the parameters are updated
> > > >>>> frequently (such as netdim), ctrlq on the device is kicked frequently, which will
> > > >>>> increase the device CPU scheduling overhead, and the number and overhead of device
> > > >>>> DMA will also increase.
> > > >>>>
> > > >>>> Merging multiple vq updated parameters into one command can effectively reduce
> > > >>>> the number of kick devices and device DMA times.
> > > >>>>
> > > >>>> Test results show that this greatly improves the efficiency of the ctrlq in
> > > >>>> responding to multiple vq coalescing parameter updates issued by the driver.
> > > >>> So netdim is per virtqueue, to make use of this, you need to batch the
> > > >>> netdim requests first.
> > > >> Yes. I think the code looks more intuitive.
> > > >> Please view the following PoC test code combined with the current code
> > > >> of upstream virtio netdim
> > > >>
> > > >> +struct ctrl_coal_vq_entry {
> > > >> +       u16 vqn;
> > > >> +       u16 reserved;
> > > >> +       int usecs;
> > > >> +       int packets;
> > > >> +};
> > > >> +
> > > >> +struct virtio_net_ctrl_mrg_coal_vq {
> > > >> +       int num_entries;
> > > >> +       struct ctrl_coal_vq_entry coal_vqs[];
> > > >> +};
> > > >> +
> > > >>
> > > >> @@ -313,12 +335,18 @@ struct virtnet_info {
> > > >>
> > > >>           struct control_buf *ctrl;
> > > >>
> > > >> +       struct virtio_net_ctrl_mrg_coal_vq *ctrl_coal_vqs;
> > > >> +
> > > >>
> > > >>
> > > >> @@ -4390,6 +4586,7 @@ static int virtnet_alloc_queues(struct
> > > >> virtnet_info *vi)
> > > >>           if (!vi->rq)
> > > >>                   goto err_rq;
> > > >>
> > > >> +       vi->ctrl_coal_vqs = kzalloc(sizeof(struct
> > > >> virtio_net_ctrl_mrg_coal_vq) + vi->max_queue_pairs * sizeof(struct
> > > >> ctrl_coal_vq_entry), GFP_KERNEL);
> > > >>
> > > >>
> > > >> +       struct scatterlist sg;
> > > >> +       for (i = 0; i < vi->curr_queue_pairs; i++) {
> > > >> +                vi->ctrl_coal_vqs->coal_vqs[i].vqn = rxq2vq(i);
> > > >> +                vi->ctrl_coal_vqs->coal_vqs[i].usecs = 8;
> > > >> +                vi->ctrl_coal_vqs->coal_vqs[i].packets = 16;
> > > >> +        }
> > > >> +        vi->ctrl_coal_vqs->num_entries = i;
> > > >> +        sg_init_one(&sg, vi->ctrl_coal_vqs, sizeof(struct
> > > >> virtio_net_ctrl_mrg_coal_vq) + vi->ctrl_coal_vqs->num_entries *
> > > >> sizeof(struct ctrl_coal_vq_entry));
> > > >> +        vi->req_num++;
> > > >> +        if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_NOTF_COAL,
> > > >> +                              VIRTIO_NET_CTRL_NOTF_COAL_*VQS*_SET, &sg))
> > > >>
> > > >>    #define VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET               3
> > > >> +#define VIRTIO_NET_CTRL_NOTF_COAL_VQS_SET              4
> > > >>
> > > >>>    And if you do that, you can batch the commands
> > > >>> as well.
> > > >>
> > > >> Batch commands we have tried:
> > > >>
> > > >> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> > > >> +                       err = virtnet_pre_send_cmd(vi, &vi->vq_ctrls[i],
> > > >> 8, 16);
> > > >> +               }
> > > >> +
> > > >> +               if (unlikely(!virtqueue_kick(vi->cvq)))
> > > >> +                       return vi->ctrl->status == VIRTIO_NET_OK;
> > > >> +
> > > >> +               for (i = 0; i < vi->curr_queue_pairs; i++) {
> > > >> +                       while (!virtqueue_is_broken(vi->cvq) &&
> > > >> !virtqueue_get_buf(vi->cvq, &tmp))
> > > >> +                               cpu_relax();
> > > >> +               }
> > > >> +       }
> > > >>
> > > >>
> > > >> +
> > > >> +static bool virtnet_pre_send_cmd(struct virtnet_info *vi, struct
> > > >> vq_coal *vq_ctrls,
> > > >> +                                u32 max_usecs, u32 max_packets)
> > > >> +{
> > > >> +       struct scatterlist hdr, out, stat;
> > > >> +       struct scatterlist *sgs[4];
> > > >> +       unsigned out_num = 0;
> > > >> +       int ret;
> > > >> +
> > > >> +       BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ));
> > > >> +
> > > >> +       vq_ctrls->hdr.class = VIRTIO_NET_CTRL_NOTF_COAL;
> > > >> +       vq_ctrls->hdr.cmd = VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET;
> > > >> +       sg_init_one(&hdr, &vq_ctrls->hdr, sizeof(vq_ctrls->hdr));
> > > >> +       sgs[out_num++] = &hdr;
> > > >> +
> > > >> +       vq_ctrls->coal_vq.vqn = cpu_to_le16(0);
> > > >> +       vq_ctrls->coal_vq.coal.max_usecs = cpu_to_le32(max_usecs);
> > > >> +       vq_ctrls->coal_vq.coal.max_packets = cpu_to_le32(max_packets);
> > > >> +       sg_init_one(&out, &vq_ctrls->ho, sizeof(vq_ctrls->ho));
> > > >> +
> > > >> +       if (&out)
> > > >> +               sgs[out_num++] = &out;
> > > >> +
> > > >> +       vq_ctrls->status = ~0;
> > > >> +       sg_init_one(&stat, &vq_ctrls->status, sizeof(vq_ctrls->status));
> > > >> +       sgs[out_num] = &stat;
> > > >> +
> > > >> +       BUG_ON(out_num + 1 > ARRAY_SIZE(sgs));
> > > >> +       ret = virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC);
> > > >> +       if (ret < 0) {
> > > >> +               dev_warn(&vi->vdev->dev,
> > > >> +                        "Failed to add sgs for command vq: %d\n.", ret);
> > > >> +               return false;
> > > >> +       }
> > > >>
> > > >>
> > > >> Overall, batch reqs are sufficient. Because the current major overhead
> > > >> is the number of DMAs.
> > > >> For example, for a device with 256 queues,
> > > >>
> > > >> For the current upstream code, the overhead is 256 kicks + 256*8 DMA times.
> > > >> The overhead of batch cmds is 1 kick + 256*8 DMA times.
> > > > This can furtherly optimized by using IN_ORDER. The DMA can be batched as well.
> > >
> > > DMA batch reduces the number of DMA requests, but not the actual number
> > > of DMAs.
> >
> > Well, if we do IN_ORDER, then
> >
> > 1 for avail idx.
> >
> > At most 2 DMA is needed for all descriptors.
> >
> > Driver can allocate an array of cvq commands. So CVQ commands could be
> > fetched at most 2 DMA.
> >
> > etc.
>
> For most users, all this happens maybe once per VM start.  It's not
> reasonable to start micro-optimizing this until it is taking so much
> time it slows VM start measureably.  With each DMA taking at most
> of an extra microsecond I just do not see the benefit, even with 1000s
> of DMAs, this will take at most milliseconds.

Probably, it is just an example for the possible optimization.

To make the proposal more general, a general container command should
be better than an ad-hoc batch command just for coalescing.

Thanks

>
>
> > > For out DPU, how much to batch and when to end the batch also need to be
> > > evaluated in detail.
> > >
> > > >
> > > > Or even using packed virtqueue.
> > > >
> > > >> The overhead of batch reqs is 1 kick + 8 DMA times.
> > > > If this is a real issue for your architecture, we should not make it
> > > > only work for coalescing. Any per vq tweaking may suffer from that.
> > >
> > > YES.
> > >
> > > > E.g if you want to do ARFS in the future. So it should a e.g a
> > > > container command for cvq instead of just for coalescing command
> > > > itself.
> > > >
> > > > We need to make sure if the existing facility is sufficient or not.
> > >
> > > We considered to update the generic ctrlq batch command. But so far we
> > > have only seen
> > > this requirement for netdim and aRFS.
> >
> > Two use cases should be sufficient to justify.
> >
> > We do have others, for example the vlan/mac filters.
> >
> >
> > > At the same time, aRFS has not yet
> > > entered the community.
> > > We will try to add a batch req interface for it and it may have a
> > > dedicated flow vq to ensure that
> > > the number of command issuances larger than netdim is responded to in time.
> > > So we temporarily plan to advance the batch of coalescing parameter
> > > requests first.
> >
> > Thanks
>


This publicly archived list offers a means to provide input to the
OASIS Virtual I/O Device (VIRTIO) TC.

In order to verify user consent to the Feedback License terms and
to minimize spam in the list archive, subscription is required
before posting.

Subscribe: virtio-comment-subscribe@lists.oasis-open.org
Unsubscribe: virtio-comment-unsubscribe@lists.oasis-open.org
List help: virtio-comment-help@lists.oasis-open.org
List archive: https://lists.oasis-open.org/archives/virtio-comment/
Feedback License: https://www.oasis-open.org/who/ipr/feedback_license.pdf
List Guidelines: https://www.oasis-open.org/policies-guidelines/mailing-lists
Committee: https://www.oasis-open.org/committees/virtio/
Join OASIS: https://www.oasis-open.org/join/


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-12-27  2:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-20 14:40 [virtio-comment] [PATCH RESEND] virtio-net: support setting coalescing params for multiple vqs Heng Qi
2023-12-21  3:37 ` [virtio-comment] " Jason Wang
2023-12-21  4:55   ` Heng Qi
2023-12-22  2:40     ` Jason Wang
2023-12-22  6:45       ` Heng Qi
2023-12-22  8:20         ` Michael S. Tsirkin
2023-12-26  2:17           ` Heng Qi
2023-12-26  3:32         ` Jason Wang
2023-12-26  9:18           ` Michael S. Tsirkin
2023-12-27  2:32             ` Jason Wang
2023-12-22  8:26   ` Michael S. Tsirkin
2023-12-21  6:56 ` Michael S. Tsirkin
2023-12-21  7:01   ` Heng Qi
2023-12-21  7:08     ` Michael S. Tsirkin
2023-12-21  7:19       ` Heng Qi
2023-12-21  7:48         ` Heng Qi
2023-12-22  8:24         ` Michael S. Tsirkin
2023-12-26  2:04           ` Heng Qi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.