All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: virtio-fs: adding support for multi-queue
       [not found] <2fd99bc2-0414-0b85-2bff-3a84ae6c23bd@gootzen.net>
@ 2023-02-07 19:45 ` Stefan Hajnoczi
  2023-02-07 19:53   ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2023-02-07 19:45 UTC (permalink / raw)
  To: Peter-Jan Gootzen; +Cc: virtualization, Jonas Pfefferle, vgoyal, miklos


[-- Attachment #1.1: Type: text/plain, Size: 1487 bytes --]

On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> Hi,
> 
> For my MSc thesis project in collaboration with IBM
> (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> performance of the virtio-fs driver in high throughput scenarios. We think
> the main bottleneck is the fact that the virtio-fs driver does not support
> multi-queue (while the spec does). A big factor in this is that our setup on
> the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> tend to a single virtio queue.
> 
> We are therefore looking to implement multi-queue functionality in the
> virtio-fs driver. The request queues seem to already get created at probe,
> but left unused afterwards. The current plan is to select the queue for a
> request based on the current smp processor id and set the virtio queue
> interrupt affinity for each core accordingly at probe.
> 
> This is my first time contributing to the Linux kernel so I am here to ask
> what the maintainers' thoughts are about this plan.

Hi,
Sounds good. Assigning vqs round-robin is the strategy that virtio-net
and virtio-blk use. virtio-blk could be an interesting example as it's
similar to virtiofs. The Linux multiqueue block layer and core virtio
irq allocation handle CPU affinity in the case of virtio-blk.

Which DPU are you targetting?

Stefan

> 
> Best,
> Peter-Jan Gootzen
> MSc student at VU University Amsterdam & IBM Research Zurich
> 

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-07 19:45 ` virtio-fs: adding support for multi-queue Stefan Hajnoczi
@ 2023-02-07 19:53   ` Vivek Goyal
  2023-02-07 21:32     ` Stefan Hajnoczi
  0 siblings, 1 reply; 11+ messages in thread
From: Vivek Goyal @ 2023-02-07 19:53 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: German Maglione, virtualization, Jonas Pfefferle, miklos

On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
> On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> > Hi,
> > 

[cc German]

> > For my MSc thesis project in collaboration with IBM
> > (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> > performance of the virtio-fs driver in high throughput scenarios. We think
> > the main bottleneck is the fact that the virtio-fs driver does not support
> > multi-queue (while the spec does). A big factor in this is that our setup on
> > the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> > tend to a single virtio queue.

This is an interesting limitation in DPU.

> > 
> > We are therefore looking to implement multi-queue functionality in the
> > virtio-fs driver. The request queues seem to already get created at probe,
> > but left unused afterwards. The current plan is to select the queue for a
> > request based on the current smp processor id and set the virtio queue
> > interrupt affinity for each core accordingly at probe.
> > 
> > This is my first time contributing to the Linux kernel so I am here to ask
> > what the maintainers' thoughts are about this plan.

In general we have talked about multiqueue support in the past but
nothing actually made upstream. So if there are patches to make
it happen, it should be reasonable to look at these and review.

Is it just a theory at this point of time or have you implemented
it and seeing significant performance benefit with multiqueue?

Thanks
Vivek
> 
> Hi,
> Sounds good. Assigning vqs round-robin is the strategy that virtio-net
> and virtio-blk use. virtio-blk could be an interesting example as it's
> similar to virtiofs. The Linux multiqueue block layer and core virtio
> irq allocation handle CPU affinity in the case of virtio-blk.
> 
> Which DPU are you targetting?
> 
> Stefan
> 
> > 
> > Best,
> > Peter-Jan Gootzen
> > MSc student at VU University Amsterdam & IBM Research Zurich
> > 


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-07 19:53   ` Vivek Goyal
@ 2023-02-07 21:32     ` Stefan Hajnoczi
  2023-02-07 21:57       ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2023-02-07 21:32 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: German Maglione, virtualization, Jonas Pfefferle, miklos


[-- Attachment #1.1: Type: text/plain, Size: 976 bytes --]

On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
> On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
> > On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> > > Hi,
> > > 
> 
> [cc German]
> 
> > > For my MSc thesis project in collaboration with IBM
> > > (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> > > performance of the virtio-fs driver in high throughput scenarios. We think
> > > the main bottleneck is the fact that the virtio-fs driver does not support
> > > multi-queue (while the spec does). A big factor in this is that our setup on
> > > the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> > > tend to a single virtio queue.
> 
> This is an interesting limitation in DPU.

Virtqueues are single-consumer queues anyway. Sharing them between
multiple threads would be expensive. I think using multiqueue is natural
and not specific to DPUs.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-07 21:32     ` Stefan Hajnoczi
@ 2023-02-07 21:57       ` Vivek Goyal
  2023-02-08  8:33         ` Peter-Jan Gootzen via Virtualization
  0 siblings, 1 reply; 11+ messages in thread
From: Vivek Goyal @ 2023-02-07 21:57 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: German Maglione, virtualization, Jonas Pfefferle, miklos

On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
> On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
> > On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
> > > On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> > > > Hi,
> > > > 
> > 
> > [cc German]
> > 
> > > > For my MSc thesis project in collaboration with IBM
> > > > (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> > > > performance of the virtio-fs driver in high throughput scenarios. We think
> > > > the main bottleneck is the fact that the virtio-fs driver does not support
> > > > multi-queue (while the spec does). A big factor in this is that our setup on
> > > > the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> > > > tend to a single virtio queue.
> > 
> > This is an interesting limitation in DPU.
> 
> Virtqueues are single-consumer queues anyway. Sharing them between
> multiple threads would be expensive. I think using multiqueue is natural
> and not specific to DPUs.

Can we create multiple threads (a thread pool) on DPU and let these
threads process requests in parallel (While there is only one virt
queue).

So this is what we had done in virtiofsd. One thread is dedicated to
pull the requests from virt queue and then pass the request to thread
pool to process it. And that seems to help with performance in
certain cases.

Is that possible on DPU? That itself can give a nice performance
boost for certain workloads without having to implement multiqueue
actually.

Just curious. I am not opposed to the idea of multiqueue. I am
just curious about the kind of performance gain (if any) it can
provide. And will this be helpful for rust virtiofsd running on
host as well?

Thanks
Vivek

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-07 21:57       ` Vivek Goyal
@ 2023-02-08  8:33         ` Peter-Jan Gootzen via Virtualization
  2023-02-08 10:43           ` Stefan Hajnoczi
  0 siblings, 1 reply; 11+ messages in thread
From: Peter-Jan Gootzen via Virtualization @ 2023-02-08  8:33 UTC (permalink / raw)
  To: Vivek Goyal, Stefan Hajnoczi
  Cc: German Maglione, virtualization, Jonas Pfefferle, miklos



On 07/02/2023 22:57, Vivek Goyal wrote:
> On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
>> On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
>>> On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
>>>> On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
>>>>> Hi,
>>>>>
>>>
>>> [cc German]
>>>
>>>>> For my MSc thesis project in collaboration with IBM
>>>>> (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
>>>>> performance of the virtio-fs driver in high throughput scenarios. We think
>>>>> the main bottleneck is the fact that the virtio-fs driver does not support
>>>>> multi-queue (while the spec does). A big factor in this is that our setup on
>>>>> the virtio-fs device-side (a DPU) does not easily allow multiple cores to
>>>>> tend to a single virtio queue.
>>>
>>> This is an interesting limitation in DPU.
>>
>> Virtqueues are single-consumer queues anyway. Sharing them between
>> multiple threads would be expensive. I think using multiqueue is natural
>> and not specific to DPUs.
> 
> Can we create multiple threads (a thread pool) on DPU and let these
> threads process requests in parallel (While there is only one virt
> queue).
> 
> So this is what we had done in virtiofsd. One thread is dedicated to
> pull the requests from virt queue and then pass the request to thread
> pool to process it. And that seems to help with performance in
> certain cases.
> 
> Is that possible on DPU? That itself can give a nice performance
> boost for certain workloads without having to implement multiqueue
> actually.
> 
> Just curious. I am not opposed to the idea of multiqueue. I am
> just curious about the kind of performance gain (if any) it can
> provide. And will this be helpful for rust virtiofsd running on
> host as well?
> 
> Thanks
> Vivek
>
There is technically nothing preventing us from consuming a single queue 
on multiple cores, however our current Virtio implementation (DPU-side) 
is set up with the assumption that you should never want to do that 
(concurrency mayham around the Virtqueues and the DMAs). So instead of 
putting all the work into reworking the implementation to support that 
and still incur the big overhead, we see it more fitting to amend the 
virtio-fs driver with multi-queue support.


 > Is it just a theory at this point of time or have you implemented
 > it and seeing significant performance benefit with multiqueue?

It is a theory, but we are currently seeing that using the single 
request queue, the single core attending to that queue on the DPU is 
reasonably close to being fully saturated.

 > And will this be helpful for rust virtiofsd running on
 > host as well?

I figure this would be dependent on the workload and the users-needs.
Having many cores concurrently pulling on their own virtq and then 
immediately process the request locally would of course improve 
performance. But we are offloading all this work to the DPU, for 
providing high-throughput cloud services.

 > Sounds good. Assigning vqs round-robin is the strategy that virtio-net
 > and virtio-blk use. virtio-blk could be an interesting example as it's
 > similar to virtiofs. The Linux multiqueue block layer and core virtio
 > irq allocation handle CPU affinity in the case of virtio-blk.

The virtio-blk use the queue assigned by the mq block layer and 
virtio-net the queue assigned from the net core layer correct?

If I interpret you correct, the round-robin strategy is done by 
assigning cores to queues round-robin, not per requests dynamically 
round-robin?
This is what I remembered as well, but can't find it clearly in the 
source right now, do you have references to the source for this?

 > Which DPU are you targetting?

This is something I unfortunately can't disclose at the moment.

Thanks,
Peter-Jan
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-08  8:33         ` Peter-Jan Gootzen via Virtualization
@ 2023-02-08 10:43           ` Stefan Hajnoczi
  2023-02-08 16:29             ` Peter-Jan Gootzen via Virtualization
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2023-02-08 10:43 UTC (permalink / raw)
  To: Peter-Jan Gootzen
  Cc: German Maglione, virtualization, Jonas Pfefferle, Vivek Goyal, miklos


[-- Attachment #1.1: Type: text/plain, Size: 5409 bytes --]

On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
> 
> 
> On 07/02/2023 22:57, Vivek Goyal wrote:
> > On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
> > > On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
> > > > On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
> > > > > On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> > > > > > Hi,
> > > > > > 
> > > > 
> > > > [cc German]
> > > > 
> > > > > > For my MSc thesis project in collaboration with IBM
> > > > > > (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> > > > > > performance of the virtio-fs driver in high throughput scenarios. We think
> > > > > > the main bottleneck is the fact that the virtio-fs driver does not support
> > > > > > multi-queue (while the spec does). A big factor in this is that our setup on
> > > > > > the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> > > > > > tend to a single virtio queue.
> > > > 
> > > > This is an interesting limitation in DPU.
> > > 
> > > Virtqueues are single-consumer queues anyway. Sharing them between
> > > multiple threads would be expensive. I think using multiqueue is natural
> > > and not specific to DPUs.
> > 
> > Can we create multiple threads (a thread pool) on DPU and let these
> > threads process requests in parallel (While there is only one virt
> > queue).
> > 
> > So this is what we had done in virtiofsd. One thread is dedicated to
> > pull the requests from virt queue and then pass the request to thread
> > pool to process it. And that seems to help with performance in
> > certain cases.
> > 
> > Is that possible on DPU? That itself can give a nice performance
> > boost for certain workloads without having to implement multiqueue
> > actually.
> > 
> > Just curious. I am not opposed to the idea of multiqueue. I am
> > just curious about the kind of performance gain (if any) it can
> > provide. And will this be helpful for rust virtiofsd running on
> > host as well?
> > 
> > Thanks
> > Vivek
> > 
> There is technically nothing preventing us from consuming a single queue on
> multiple cores, however our current Virtio implementation (DPU-side) is set
> up with the assumption that you should never want to do that (concurrency
> mayham around the Virtqueues and the DMAs). So instead of putting all the
> work into reworking the implementation to support that and still incur the
> big overhead, we see it more fitting to amend the virtio-fs driver with
> multi-queue support.
> 
> 
> > Is it just a theory at this point of time or have you implemented
> > it and seeing significant performance benefit with multiqueue?
> 
> It is a theory, but we are currently seeing that using the single request
> queue, the single core attending to that queue on the DPU is reasonably
> close to being fully saturated.
> 
> > And will this be helpful for rust virtiofsd running on
> > host as well?
> 
> I figure this would be dependent on the workload and the users-needs.
> Having many cores concurrently pulling on their own virtq and then
> immediately process the request locally would of course improve performance.
> But we are offloading all this work to the DPU, for providing
> high-throughput cloud services.

I think Vivek is getting at whether your code processes requests
sequentially or in parallel. A single thread processing the virtqueue
that hands off requests to worker threads or uses io_uring to perform
I/O asynchronously will perform differently from a single thread that
processes requests sequentially in a blocking fashion. Multiqueue is not
necessary for parallelism, but the single queue might become a
bottleneck.

> > Sounds good. Assigning vqs round-robin is the strategy that virtio-net
> > and virtio-blk use. virtio-blk could be an interesting example as it's
> > similar to virtiofs. The Linux multiqueue block layer and core virtio
> > irq allocation handle CPU affinity in the case of virtio-blk.
> 
> The virtio-blk use the queue assigned by the mq block layer and virtio-net
> the queue assigned from the net core layer correct?

Yes.

> If I interpret you correct, the round-robin strategy is done by assigning
> cores to queues round-robin, not per requests dynamically round-robin?

Yes, virtqueues are assigned to CPUs statically.

> This is what I remembered as well, but can't find it clearly in the source
> right now, do you have references to the source for this?

virtio_blk.ko uses an irq_affinity descriptor to tell virtio_find_vqs()
to spread MSI interrupts across CPUs:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/virtio_blk.c#n609

The core blk-mq code has the blk_mq_virtio_map_queues() function to map
block layer queues to virtqueues:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-mq-virtio.c#n24

virtio_net.ko manually sets virtqueue affinity:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/virtio_net.c#n2283

virtio_net.ko tells the core net subsystem about queues using
netif_set_real_num_tx_queues() and then skbs are mapped to queues by
common code:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/dev.c#n4079

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-08 10:43           ` Stefan Hajnoczi
@ 2023-02-08 16:29             ` Peter-Jan Gootzen via Virtualization
  2023-02-08 20:23               ` Vivek Goyal
  2023-02-22 14:32               ` Stefan Hajnoczi
  0 siblings, 2 replies; 11+ messages in thread
From: Peter-Jan Gootzen via Virtualization @ 2023-02-08 16:29 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: German Maglione, virtualization, Jonas Pfefferle, Vivek Goyal, miklos

On 08/02/2023 11:43, Stefan Hajnoczi wrote:
> On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
>>
>>
>> On 07/02/2023 22:57, Vivek Goyal wrote:
>>> On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
>>>> On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
>>>>> On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
>>>>>> On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>
>>>>> [cc German]
>>>>>
>>>>>>> For my MSc thesis project in collaboration with IBM
>>>>>>> (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
>>>>>>> performance of the virtio-fs driver in high throughput scenarios. We think
>>>>>>> the main bottleneck is the fact that the virtio-fs driver does not support
>>>>>>> multi-queue (while the spec does). A big factor in this is that our setup on
>>>>>>> the virtio-fs device-side (a DPU) does not easily allow multiple cores to
>>>>>>> tend to a single virtio queue.
>>>>>
>>>>> This is an interesting limitation in DPU.
>>>>
>>>> Virtqueues are single-consumer queues anyway. Sharing them between
>>>> multiple threads would be expensive. I think using multiqueue is natural
>>>> and not specific to DPUs.
>>>
>>> Can we create multiple threads (a thread pool) on DPU and let these
>>> threads process requests in parallel (While there is only one virt
>>> queue).
>>>
>>> So this is what we had done in virtiofsd. One thread is dedicated to
>>> pull the requests from virt queue and then pass the request to thread
>>> pool to process it. And that seems to help with performance in
>>> certain cases.
>>>
>>> Is that possible on DPU? That itself can give a nice performance
>>> boost for certain workloads without having to implement multiqueue
>>> actually.
>>>
>>> Just curious. I am not opposed to the idea of multiqueue. I am
>>> just curious about the kind of performance gain (if any) it can
>>> provide. And will this be helpful for rust virtiofsd running on
>>> host as well?
>>>
>>> Thanks
>>> Vivek
>>>
>> There is technically nothing preventing us from consuming a single queue on
>> multiple cores, however our current Virtio implementation (DPU-side) is set
>> up with the assumption that you should never want to do that (concurrency
>> mayham around the Virtqueues and the DMAs). So instead of putting all the
>> work into reworking the implementation to support that and still incur the
>> big overhead, we see it more fitting to amend the virtio-fs driver with
>> multi-queue support.
>>
>>
>>> Is it just a theory at this point of time or have you implemented
>>> it and seeing significant performance benefit with multiqueue?
>>
>> It is a theory, but we are currently seeing that using the single request
>> queue, the single core attending to that queue on the DPU is reasonably
>> close to being fully saturated.
>>
>>> And will this be helpful for rust virtiofsd running on
>>> host as well?
>>
>> I figure this would be dependent on the workload and the users-needs.
>> Having many cores concurrently pulling on their own virtq and then
>> immediately process the request locally would of course improve performance.
>> But we are offloading all this work to the DPU, for providing
>> high-throughput cloud services.
> 
> I think Vivek is getting at whether your code processes requests
> sequentially or in parallel. A single thread processing the virtqueue
> that hands off requests to worker threads or uses io_uring to perform
> I/O asynchronously will perform differently from a single thread that
> processes requests sequentially in a blocking fashion. Multiqueue is not
> necessary for parallelism, but the single queue might become a
> bottleneck.

Requests are handled non-blocking with remote IO on the DPU. Our current 
architecture is as follows:
T1: Tends to the Virtq, parses FUSE to remote IO and fires off the 
asynchronous remote IO.
T2: Polls for completion on the remote IO and parses it back to FUSE, 
puts the FUSE buffers in a completion queue of T1.
T1: Handles the Virtio completion and DMA of the requests in the CQ.

Thread 1 is busy polling on its two queues (Virtq and CQ) with equal 
priority, thread 2 is busy polling as well. This setup is not really 
optimal, but we are working within the constraints of both our DPU and 
remote IO stack.
Currently we are able to get with sequential single job 4k throughput:
Write: 246MiB/s
Read: 20MiB/s
We are not sure yet where the bottleneck is for reads, we hope to be 
able to match it to the write speed. For writes the two main bottlenecks 
we see are: the single Virtq (so limited parallelism on the DPU and 
remote-side) and that virtio-fs IO is constrained to the page size of 4k 
(NFS for example, who we are trying to replace, sees huge performance 
gains with larger block sizes).

>> This is what I remembered as well, but can't find it clearly in the source
>> right now, do you have references to the source for this?
> 
> virtio_blk.ko uses an irq_affinity descriptor to tell virtio_find_vqs()
> to spread MSI interrupts across CPUs:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/virtio_blk.c#n609
> 
> The core blk-mq code has the blk_mq_virtio_map_queues() function to map
> block layer queues to virtqueues:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-mq-virtio.c#n24
> 
> virtio_net.ko manually sets virtqueue affinity:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/virtio_net.c#n2283
> 
> virtio_net.ko tells the core net subsystem about queues using
> netif_set_real_num_tx_queues() and then skbs are mapped to queues by
> common code:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/dev.c#n4079

Thanks for the pointers. :)

Thanks,
Peter-Jan

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-08 16:29             ` Peter-Jan Gootzen via Virtualization
@ 2023-02-08 20:23               ` Vivek Goyal
  2023-02-22 14:32               ` Stefan Hajnoczi
  1 sibling, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2023-02-08 20:23 UTC (permalink / raw)
  To: Peter-Jan Gootzen
  Cc: German Maglione, virtualization, Jonas Pfefferle,
	Stefan Hajnoczi, miklos

On Wed, Feb 08, 2023 at 05:29:25PM +0100, Peter-Jan Gootzen wrote:
> On 08/02/2023 11:43, Stefan Hajnoczi wrote:
> > On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
> > > 
> > > 
> > > On 07/02/2023 22:57, Vivek Goyal wrote:
> > > > On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
> > > > > On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
> > > > > > On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > 
> > > > > > [cc German]
> > > > > > 
> > > > > > > > For my MSc thesis project in collaboration with IBM
> > > > > > > > (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> > > > > > > > performance of the virtio-fs driver in high throughput scenarios. We think
> > > > > > > > the main bottleneck is the fact that the virtio-fs driver does not support
> > > > > > > > multi-queue (while the spec does). A big factor in this is that our setup on
> > > > > > > > the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> > > > > > > > tend to a single virtio queue.
> > > > > > 
> > > > > > This is an interesting limitation in DPU.
> > > > > 
> > > > > Virtqueues are single-consumer queues anyway. Sharing them between
> > > > > multiple threads would be expensive. I think using multiqueue is natural
> > > > > and not specific to DPUs.
> > > > 
> > > > Can we create multiple threads (a thread pool) on DPU and let these
> > > > threads process requests in parallel (While there is only one virt
> > > > queue).
> > > > 
> > > > So this is what we had done in virtiofsd. One thread is dedicated to
> > > > pull the requests from virt queue and then pass the request to thread
> > > > pool to process it. And that seems to help with performance in
> > > > certain cases.
> > > > 
> > > > Is that possible on DPU? That itself can give a nice performance
> > > > boost for certain workloads without having to implement multiqueue
> > > > actually.
> > > > 
> > > > Just curious. I am not opposed to the idea of multiqueue. I am
> > > > just curious about the kind of performance gain (if any) it can
> > > > provide. And will this be helpful for rust virtiofsd running on
> > > > host as well?
> > > > 
> > > > Thanks
> > > > Vivek
> > > > 
> > > There is technically nothing preventing us from consuming a single queue on
> > > multiple cores, however our current Virtio implementation (DPU-side) is set
> > > up with the assumption that you should never want to do that (concurrency
> > > mayham around the Virtqueues and the DMAs). So instead of putting all the
> > > work into reworking the implementation to support that and still incur the
> > > big overhead, we see it more fitting to amend the virtio-fs driver with
> > > multi-queue support.
> > > 
> > > 
> > > > Is it just a theory at this point of time or have you implemented
> > > > it and seeing significant performance benefit with multiqueue?
> > > 
> > > It is a theory, but we are currently seeing that using the single request
> > > queue, the single core attending to that queue on the DPU is reasonably
> > > close to being fully saturated.
> > > 
> > > > And will this be helpful for rust virtiofsd running on
> > > > host as well?
> > > 
> > > I figure this would be dependent on the workload and the users-needs.
> > > Having many cores concurrently pulling on their own virtq and then
> > > immediately process the request locally would of course improve performance.
> > > But we are offloading all this work to the DPU, for providing
> > > high-throughput cloud services.
> > 
> > I think Vivek is getting at whether your code processes requests
> > sequentially or in parallel. A single thread processing the virtqueue
> > that hands off requests to worker threads or uses io_uring to perform
> > I/O asynchronously will perform differently from a single thread that
> > processes requests sequentially in a blocking fashion. Multiqueue is not
> > necessary for parallelism, but the single queue might become a
> > bottleneck.
> 
> Requests are handled non-blocking with remote IO on the DPU. Our current
> architecture is as follows:
> T1: Tends to the Virtq, parses FUSE to remote IO and fires off the
> asynchronous remote IO.
> T2: Polls for completion on the remote IO and parses it back to FUSE, puts
> the FUSE buffers in a completion queue of T1.
> T1: Handles the Virtio completion and DMA of the requests in the CQ.
> 
> Thread 1 is busy polling on its two queues (Virtq and CQ) with equal
> priority, thread 2 is busy polling as well. This setup is not really
> optimal, but we are working within the constraints of both our DPU and
> remote IO stack.
> Currently we are able to get with sequential single job 4k throughput:
> Write: 246MiB/s
> Read: 20MiB/s

I had been doing some performance benchmarking for virtiofs and I found
some old results.

https://github.com/rhvgoyal/virtiofs-tests/tree/master/performance-results/feb-10-2021

While running on top of local fs, with bs=4K, with single queue I could
achieve more than 600MB/s.

NAME                    WORKLOAD                Bandwidth       IOPS            
default                 seqread-psync           625.0mb         156.2k          
no-tpool                seqread-psync           660.8mb         165.2k          

But catch here I think is that host is doing the caching. In your
case I am assuming there is no caching at DPU and all the I/O is
going to remote storage (which might be doing caching in memory).

Anyway, point I am trying to make is that even with single vq, virtiofs
can push a reasonable amount of I/O.

I will be cuirous to find how multiqueue can improve these numbers
further.

> We are not sure yet where the bottleneck is for reads, we hope to be able to
> match it to the write speed. For writes the two main bottlenecks we see are:
> the single Virtq (so limited parallelism on the DPU and remote-side) and
> that virtio-fs IO is constrained to the page size of 4k (NFS for example,
> who we are trying to replace, sees huge performance gains with larger block
> sizes).

I am wondering how did you conclude that single vq is the bottleneck for
performance and not the remote storage DPU is sending I/O to.

Thanks
Vivek

> 
> > > This is what I remembered as well, but can't find it clearly in the source
> > > right now, do you have references to the source for this?
> > 
> > virtio_blk.ko uses an irq_affinity descriptor to tell virtio_find_vqs()
> > to spread MSI interrupts across CPUs:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/block/virtio_blk.c#n609
> > 
> > The core blk-mq code has the blk_mq_virtio_map_queues() function to map
> > block layer queues to virtqueues:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/block/blk-mq-virtio.c#n24
> > 
> > virtio_net.ko manually sets virtqueue affinity:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/virtio_net.c#n2283
> > 
> > virtio_net.ko tells the core net subsystem about queues using
> > netif_set_real_num_tx_queues() and then skbs are mapped to queues by
> > common code:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/core/dev.c#n4079
> 
> Thanks for the pointers. :)
> 
> Thanks,
> Peter-Jan
> 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-08 16:29             ` Peter-Jan Gootzen via Virtualization
  2023-02-08 20:23               ` Vivek Goyal
@ 2023-02-22 14:32               ` Stefan Hajnoczi
  2023-03-07 19:43                 ` Peter-Jan Gootzen via Virtualization
  1 sibling, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2023-02-22 14:32 UTC (permalink / raw)
  To: Peter-Jan Gootzen
  Cc: German Maglione, virtualization, Jonas Pfefferle, Vivek Goyal, miklos


[-- Attachment #1.1: Type: text/plain, Size: 4961 bytes --]

On Wed, Feb 08, 2023 at 05:29:25PM +0100, Peter-Jan Gootzen wrote:
> On 08/02/2023 11:43, Stefan Hajnoczi wrote:
> > On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
> > > 
> > > 
> > > On 07/02/2023 22:57, Vivek Goyal wrote:
> > > > On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
> > > > > On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
> > > > > > On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > 
> > > > > > [cc German]
> > > > > > 
> > > > > > > > For my MSc thesis project in collaboration with IBM
> > > > > > > > (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> > > > > > > > performance of the virtio-fs driver in high throughput scenarios. We think
> > > > > > > > the main bottleneck is the fact that the virtio-fs driver does not support
> > > > > > > > multi-queue (while the spec does). A big factor in this is that our setup on
> > > > > > > > the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> > > > > > > > tend to a single virtio queue.
> > > > > > 
> > > > > > This is an interesting limitation in DPU.
> > > > > 
> > > > > Virtqueues are single-consumer queues anyway. Sharing them between
> > > > > multiple threads would be expensive. I think using multiqueue is natural
> > > > > and not specific to DPUs.
> > > > 
> > > > Can we create multiple threads (a thread pool) on DPU and let these
> > > > threads process requests in parallel (While there is only one virt
> > > > queue).
> > > > 
> > > > So this is what we had done in virtiofsd. One thread is dedicated to
> > > > pull the requests from virt queue and then pass the request to thread
> > > > pool to process it. And that seems to help with performance in
> > > > certain cases.
> > > > 
> > > > Is that possible on DPU? That itself can give a nice performance
> > > > boost for certain workloads without having to implement multiqueue
> > > > actually.
> > > > 
> > > > Just curious. I am not opposed to the idea of multiqueue. I am
> > > > just curious about the kind of performance gain (if any) it can
> > > > provide. And will this be helpful for rust virtiofsd running on
> > > > host as well?
> > > > 
> > > > Thanks
> > > > Vivek
> > > > 
> > > There is technically nothing preventing us from consuming a single queue on
> > > multiple cores, however our current Virtio implementation (DPU-side) is set
> > > up with the assumption that you should never want to do that (concurrency
> > > mayham around the Virtqueues and the DMAs). So instead of putting all the
> > > work into reworking the implementation to support that and still incur the
> > > big overhead, we see it more fitting to amend the virtio-fs driver with
> > > multi-queue support.
> > > 
> > > 
> > > > Is it just a theory at this point of time or have you implemented
> > > > it and seeing significant performance benefit with multiqueue?
> > > 
> > > It is a theory, but we are currently seeing that using the single request
> > > queue, the single core attending to that queue on the DPU is reasonably
> > > close to being fully saturated.
> > > 
> > > > And will this be helpful for rust virtiofsd running on
> > > > host as well?
> > > 
> > > I figure this would be dependent on the workload and the users-needs.
> > > Having many cores concurrently pulling on their own virtq and then
> > > immediately process the request locally would of course improve performance.
> > > But we are offloading all this work to the DPU, for providing
> > > high-throughput cloud services.
> > 
> > I think Vivek is getting at whether your code processes requests
> > sequentially or in parallel. A single thread processing the virtqueue
> > that hands off requests to worker threads or uses io_uring to perform
> > I/O asynchronously will perform differently from a single thread that
> > processes requests sequentially in a blocking fashion. Multiqueue is not
> > necessary for parallelism, but the single queue might become a
> > bottleneck.
> 
> Requests are handled non-blocking with remote IO on the DPU. Our current
> architecture is as follows:
> T1: Tends to the Virtq, parses FUSE to remote IO and fires off the
> asynchronous remote IO.
> T2: Polls for completion on the remote IO and parses it back to FUSE, puts
> the FUSE buffers in a completion queue of T1.
> T1: Handles the Virtio completion and DMA of the requests in the CQ.
> 
> Thread 1 is busy polling on its two queues (Virtq and CQ) with equal
> priority, thread 2 is busy polling as well. This setup is not really
> optimal, but we are working within the constraints of both our DPU and
> remote IO stack.

Why does T1 need to handle VIRTIO completion and DMA requests instead of
T2?

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-02-22 14:32               ` Stefan Hajnoczi
@ 2023-03-07 19:43                 ` Peter-Jan Gootzen via Virtualization
  2023-03-07 22:26                   ` Vivek Goyal
  0 siblings, 1 reply; 11+ messages in thread
From: Peter-Jan Gootzen via Virtualization @ 2023-03-07 19:43 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: German Maglione, virtualization, Jonas Pfefferle, Vivek Goyal, miklos

On 22-02-2023 15:32, Stefan Hajnoczi wrote:
> On Wed, Feb 08, 2023 at 05:29:25PM +0100, Peter-Jan Gootzen wrote:
>> On 08/02/2023 11:43, Stefan Hajnoczi wrote:
>>> On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
>>>>
>>>>
>>>> On 07/02/2023 22:57, Vivek Goyal wrote:
>>>>> On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
>>>>>> On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
>>>>>>> On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
>>>>>>>> On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>
>>>>>>> [cc German]
>>>>>>>
>>>>>>>>> For my MSc thesis project in collaboration with IBM
>>>>>>>>> (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
>>>>>>>>> performance of the virtio-fs driver in high throughput scenarios. We think
>>>>>>>>> the main bottleneck is the fact that the virtio-fs driver does not support
>>>>>>>>> multi-queue (while the spec does). A big factor in this is that our setup on
>>>>>>>>> the virtio-fs device-side (a DPU) does not easily allow multiple cores to
>>>>>>>>> tend to a single virtio queue.
>>>>>>>
>>>>>>> This is an interesting limitation in DPU.
>>>>>>
>>>>>> Virtqueues are single-consumer queues anyway. Sharing them between
>>>>>> multiple threads would be expensive. I think using multiqueue is natural
>>>>>> and not specific to DPUs.
>>>>>
>>>>> Can we create multiple threads (a thread pool) on DPU and let these
>>>>> threads process requests in parallel (While there is only one virt
>>>>> queue).
>>>>>
>>>>> So this is what we had done in virtiofsd. One thread is dedicated to
>>>>> pull the requests from virt queue and then pass the request to thread
>>>>> pool to process it. And that seems to help with performance in
>>>>> certain cases.
>>>>>
>>>>> Is that possible on DPU? That itself can give a nice performance
>>>>> boost for certain workloads without having to implement multiqueue
>>>>> actually.
>>>>>
>>>>> Just curious. I am not opposed to the idea of multiqueue. I am
>>>>> just curious about the kind of performance gain (if any) it can
>>>>> provide. And will this be helpful for rust virtiofsd running on
>>>>> host as well?
>>>>>
>>>>> Thanks
>>>>> Vivek
>>>>>
>>>> There is technically nothing preventing us from consuming a single queue on
>>>> multiple cores, however our current Virtio implementation (DPU-side) is set
>>>> up with the assumption that you should never want to do that (concurrency
>>>> mayham around the Virtqueues and the DMAs). So instead of putting all the
>>>> work into reworking the implementation to support that and still incur the
>>>> big overhead, we see it more fitting to amend the virtio-fs driver with
>>>> multi-queue support.
>>>>
>>>>
>>>>> Is it just a theory at this point of time or have you implemented
>>>>> it and seeing significant performance benefit with multiqueue?
>>>>
>>>> It is a theory, but we are currently seeing that using the single request
>>>> queue, the single core attending to that queue on the DPU is reasonably
>>>> close to being fully saturated.
>>>>
>>>>> And will this be helpful for rust virtiofsd running on
>>>>> host as well?
>>>>
>>>> I figure this would be dependent on the workload and the users-needs.
>>>> Having many cores concurrently pulling on their own virtq and then
>>>> immediately process the request locally would of course improve performance.
>>>> But we are offloading all this work to the DPU, for providing
>>>> high-throughput cloud services.
>>>
>>> I think Vivek is getting at whether your code processes requests
>>> sequentially or in parallel. A single thread processing the virtqueue
>>> that hands off requests to worker threads or uses io_uring to perform
>>> I/O asynchronously will perform differently from a single thread that
>>> processes requests sequentially in a blocking fashion. Multiqueue is not
>>> necessary for parallelism, but the single queue might become a
>>> bottleneck.
>>
>> Requests are handled non-blocking with remote IO on the DPU. Our current
>> architecture is as follows:
>> T1: Tends to the Virtq, parses FUSE to remote IO and fires off the
>> asynchronous remote IO.
>> T2: Polls for completion on the remote IO and parses it back to FUSE, puts
>> the FUSE buffers in a completion queue of T1.
>> T1: Handles the Virtio completion and DMA of the requests in the CQ.
>>
>> Thread 1 is busy polling on its two queues (Virtq and CQ) with equal
>> priority, thread 2 is busy polling as well. This setup is not really
>> optimal, but we are working within the constraints of both our DPU and
>> remote IO stack.
> 
> Why does T1 need to handle VIRTIO completion and DMA requests instead of
> T2?
> 
> Stefan

No good reason other than the fact that the concurrency safety of our 
DPU's virtio-fs library requires this.

 > I had been doing some performance benchmarking for virtio-fs and I found
 > some old results.
 >
 > 
https://github.com/rhvgoyal/virtiofs-tests/tree/master/performance-results/feb-10-2021
 >
 > While running on top of local fs, with bs=4K, with single queue I could
 > achieve more than 600MB/s.
 >
 > NAME                    WORKLOAD                Bandwidth       IOPS
 > default                 seqread-psync           625.0mb         156.2k
 > no-tpool                seqread-psync           660.8mb         165.2k
 >
 > But catch here I think is that host is doing the caching. In your
 > case I am assuming there is no caching at DPU and all the I/O is
 > going to remote storage (which might be doing caching in memory).
 >
 > Anyway, point I am trying to make is that even with single vq, virtiofs
 > can push a reasonable amount of I/O.
 >
 > I will be cuirous to find how multiqueue can improve these numbers
 > further.

We are currently seeing the following throughput numbers:
https://github.com/IBM/dpu-virtio-fs/blob/d0e0560546e2da86b0022a69abe02ab6ac4a6541/experiments/results/graphs/nulldev_tp.pdf
This is using a null device implementation in the DPU (immediately 
return reads and writes in the FUSE file system). All using a single vq 
and one DPU thread attending to it. On the host this experiment is using 
two fio threads pinned to the DPU's NUMA node. We see no additional 
throughput when using more than two threads.

 > I am wondering how did you conclude that single vq is the bottleneck for
 > performance and not the remote storage DPU is sending I/O to.

The single vq is not the sole bottleneck.

With us operating within the constrains in our DPU library, it leads us 
to two options:
1. Do thread pooling DPU-side to distribute the requests from the single 
queue to our eight cores.
2. Implement multi-queue.
We would like to go for the multi-queue option as it is the superior 
option with it exploiting parallelism both host and DPU-side (like in 
the block and net layers).
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: virtio-fs: adding support for multi-queue
  2023-03-07 19:43                 ` Peter-Jan Gootzen via Virtualization
@ 2023-03-07 22:26                   ` Vivek Goyal
  0 siblings, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2023-03-07 22:26 UTC (permalink / raw)
  To: Peter-Jan Gootzen
  Cc: German Maglione, virtualization, Jonas Pfefferle,
	Stefan Hajnoczi, miklos

On Tue, Mar 07, 2023 at 08:43:33PM +0100, Peter-Jan Gootzen wrote:
> On 22-02-2023 15:32, Stefan Hajnoczi wrote:
> > On Wed, Feb 08, 2023 at 05:29:25PM +0100, Peter-Jan Gootzen wrote:
> > > On 08/02/2023 11:43, Stefan Hajnoczi wrote:
> > > > On Wed, Feb 08, 2023 at 09:33:33AM +0100, Peter-Jan Gootzen wrote:
> > > > > 
> > > > > 
> > > > > On 07/02/2023 22:57, Vivek Goyal wrote:
> > > > > > On Tue, Feb 07, 2023 at 04:32:02PM -0500, Stefan Hajnoczi wrote:
> > > > > > > On Tue, Feb 07, 2023 at 02:53:58PM -0500, Vivek Goyal wrote:
> > > > > > > > On Tue, Feb 07, 2023 at 02:45:39PM -0500, Stefan Hajnoczi wrote:
> > > > > > > > > On Tue, Feb 07, 2023 at 11:14:46AM +0100, Peter-Jan Gootzen wrote:
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > 
> > > > > > > > [cc German]
> > > > > > > > 
> > > > > > > > > > For my MSc thesis project in collaboration with IBM
> > > > > > > > > > (https://github.com/IBM/dpu-virtio-fs) we are looking to improve the
> > > > > > > > > > performance of the virtio-fs driver in high throughput scenarios. We think
> > > > > > > > > > the main bottleneck is the fact that the virtio-fs driver does not support
> > > > > > > > > > multi-queue (while the spec does). A big factor in this is that our setup on
> > > > > > > > > > the virtio-fs device-side (a DPU) does not easily allow multiple cores to
> > > > > > > > > > tend to a single virtio queue.
> > > > > > > > 
> > > > > > > > This is an interesting limitation in DPU.
> > > > > > > 
> > > > > > > Virtqueues are single-consumer queues anyway. Sharing them between
> > > > > > > multiple threads would be expensive. I think using multiqueue is natural
> > > > > > > and not specific to DPUs.
> > > > > > 
> > > > > > Can we create multiple threads (a thread pool) on DPU and let these
> > > > > > threads process requests in parallel (While there is only one virt
> > > > > > queue).
> > > > > > 
> > > > > > So this is what we had done in virtiofsd. One thread is dedicated to
> > > > > > pull the requests from virt queue and then pass the request to thread
> > > > > > pool to process it. And that seems to help with performance in
> > > > > > certain cases.
> > > > > > 
> > > > > > Is that possible on DPU? That itself can give a nice performance
> > > > > > boost for certain workloads without having to implement multiqueue
> > > > > > actually.
> > > > > > 
> > > > > > Just curious. I am not opposed to the idea of multiqueue. I am
> > > > > > just curious about the kind of performance gain (if any) it can
> > > > > > provide. And will this be helpful for rust virtiofsd running on
> > > > > > host as well?
> > > > > > 
> > > > > > Thanks
> > > > > > Vivek
> > > > > > 
> > > > > There is technically nothing preventing us from consuming a single queue on
> > > > > multiple cores, however our current Virtio implementation (DPU-side) is set
> > > > > up with the assumption that you should never want to do that (concurrency
> > > > > mayham around the Virtqueues and the DMAs). So instead of putting all the
> > > > > work into reworking the implementation to support that and still incur the
> > > > > big overhead, we see it more fitting to amend the virtio-fs driver with
> > > > > multi-queue support.
> > > > > 
> > > > > 
> > > > > > Is it just a theory at this point of time or have you implemented
> > > > > > it and seeing significant performance benefit with multiqueue?
> > > > > 
> > > > > It is a theory, but we are currently seeing that using the single request
> > > > > queue, the single core attending to that queue on the DPU is reasonably
> > > > > close to being fully saturated.
> > > > > 
> > > > > > And will this be helpful for rust virtiofsd running on
> > > > > > host as well?
> > > > > 
> > > > > I figure this would be dependent on the workload and the users-needs.
> > > > > Having many cores concurrently pulling on their own virtq and then
> > > > > immediately process the request locally would of course improve performance.
> > > > > But we are offloading all this work to the DPU, for providing
> > > > > high-throughput cloud services.
> > > > 
> > > > I think Vivek is getting at whether your code processes requests
> > > > sequentially or in parallel. A single thread processing the virtqueue
> > > > that hands off requests to worker threads or uses io_uring to perform
> > > > I/O asynchronously will perform differently from a single thread that
> > > > processes requests sequentially in a blocking fashion. Multiqueue is not
> > > > necessary for parallelism, but the single queue might become a
> > > > bottleneck.
> > > 
> > > Requests are handled non-blocking with remote IO on the DPU. Our current
> > > architecture is as follows:
> > > T1: Tends to the Virtq, parses FUSE to remote IO and fires off the
> > > asynchronous remote IO.
> > > T2: Polls for completion on the remote IO and parses it back to FUSE, puts
> > > the FUSE buffers in a completion queue of T1.
> > > T1: Handles the Virtio completion and DMA of the requests in the CQ.
> > > 
> > > Thread 1 is busy polling on its two queues (Virtq and CQ) with equal
> > > priority, thread 2 is busy polling as well. This setup is not really
> > > optimal, but we are working within the constraints of both our DPU and
> > > remote IO stack.
> > 
> > Why does T1 need to handle VIRTIO completion and DMA requests instead of
> > T2?
> > 
> > Stefan
> 
> No good reason other than the fact that the concurrency safety of our DPU's
> virtio-fs library requires this.
> 
> > I had been doing some performance benchmarking for virtio-fs and I found
> > some old results.
> >
> > https://github.com/rhvgoyal/virtiofs-tests/tree/master/performance-results/feb-10-2021
> >
> > While running on top of local fs, with bs=4K, with single queue I could
> > achieve more than 600MB/s.
> >
> > NAME                    WORKLOAD                Bandwidth       IOPS
> > default                 seqread-psync           625.0mb         156.2k
> > no-tpool                seqread-psync           660.8mb         165.2k
> >
> > But catch here I think is that host is doing the caching. In your
> > case I am assuming there is no caching at DPU and all the I/O is
> > going to remote storage (which might be doing caching in memory).
> >
> > Anyway, point I am trying to make is that even with single vq, virtiofs
> > can push a reasonable amount of I/O.
> >
> > I will be cuirous to find how multiqueue can improve these numbers
> > further.
> 
> We are currently seeing the following throughput numbers:
> https://github.com/IBM/dpu-virtio-fs/blob/d0e0560546e2da86b0022a69abe02ab6ac4a6541/experiments/results/graphs/nulldev_tp.pdf
> This is using a null device implementation in the DPU (immediately return
> reads and writes in the FUSE file system). All using a single vq and one DPU
> thread attending to it. On the host this experiment is using two fio threads
> pinned to the DPU's NUMA node. We see no additional throughput when using
> more than two threads.

As per this chart, you are getting around 1GB/s with 4K size. So that's
roughly 256K IOPS with single queue. Not too bad I would say.

Would be interesting to see how multiqueue support impacts that number.

Thanks
Vivek

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-03-07 22:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <2fd99bc2-0414-0b85-2bff-3a84ae6c23bd@gootzen.net>
2023-02-07 19:45 ` virtio-fs: adding support for multi-queue Stefan Hajnoczi
2023-02-07 19:53   ` Vivek Goyal
2023-02-07 21:32     ` Stefan Hajnoczi
2023-02-07 21:57       ` Vivek Goyal
2023-02-08  8:33         ` Peter-Jan Gootzen via Virtualization
2023-02-08 10:43           ` Stefan Hajnoczi
2023-02-08 16:29             ` Peter-Jan Gootzen via Virtualization
2023-02-08 20:23               ` Vivek Goyal
2023-02-22 14:32               ` Stefan Hajnoczi
2023-03-07 19:43                 ` Peter-Jan Gootzen via Virtualization
2023-03-07 22:26                   ` Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.