Any performance gains from using per thread(thread local) urings?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Any performance gains from using per thread(thread local) urings?
@ 2020-05-12 20:20 Dmitry Sychov
  2020-05-13  6:07 ` H. de Vries
  0 siblings, 1 reply; 14+ messages in thread
From: Dmitry Sychov @ 2020-05-12 20:20 UTC (permalink / raw)
  To: io-uring

Hello,

I'am writing a small web + embedded database application taking
advantage of the multicore performance of the latest AMD Epyc (up to
128 threads/CPU).

Is there any performance advantage of using per thread uring setups?
Such as every thread will own its unique sq+cq.

My feeling is there are no gains since internally, in Linux kernel,
the uring system is represented as a single queue pickup thread
anyway(?) and sharing a one pair of sq+cq (through exclusive locks)
via all threads would be enough to achieve maximum throughput.

I want to squeeze the max performance out of uring in multi threading
clients <-> server environment, where the max number of threads is
always bounded by the max number of CPUs cores.

Regards, Dmitry

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-12 20:20 Any performance gains from using per thread(thread local) urings? Dmitry Sychov
@ 2020-05-13  6:07 ` H. de Vries
  2020-05-13 11:01   ` Dmitry Sychov
  0 siblings, 1 reply; 14+ messages in thread
From: H. de Vries @ 2020-05-13  6:07 UTC (permalink / raw)
  To: Dmitry Sychov, io-uring

Hi Dmitry,

If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread. This means one ring per core/thread. Of course there is no simple answer to this. See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.

—
Hielke de Vries

On Tue, May 12, 2020, at 22:20, Dmitry Sychov wrote:
> Hello,
> 
> I'am writing a small web + embedded database application taking
> advantage of the multicore performance of the latest AMD Epyc (up to
> 128 threads/CPU).
> 
> Is there any performance advantage of using per thread uring setups?
> Such as every thread will own its unique sq+cq.
> 
> My feeling is there are no gains since internally, in Linux kernel,
> the uring system is represented as a single queue pickup thread
> anyway(?) and sharing a one pair of sq+cq (through exclusive locks)
> via all threads would be enough to achieve maximum throughput.
> 
> I want to squeeze the max performance out of uring in multi threading
> clients <-> server environment, where the max number of threads is
> always bounded by the max number of CPUs cores.
> 
> Regards, Dmitry
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13  6:07 ` H. de Vries
@ 2020-05-13 11:01   ` Dmitry Sychov
  2020-05-13 11:56     ` Mark Papadakis
  0 siblings, 1 reply; 14+ messages in thread
From: Dmitry Sychov @ 2020-05-13 11:01 UTC (permalink / raw)
  To: H. de Vries; +Cc: io-uring

Hi Hielke,

> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
> This means one ring per core/thread. Of course there is no simple answer to this.
> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.

I think a lot depends on the internal uring implementation. To what
degree the kernel is able to handle multiple urings independently,
without much congestion points(like updates of the same memory
locations from multiple threads), thus taking advantage of one ring
per CPU core.

For example, if the tasks from multiple rings are later combined into
single input kernel queue (effectively forming a congestion point) I
see
no reason to use exclusive ring per core in user space.

[BTW in Windows IOCP is always one input+output queue for all(active) threads].

Also we could pop out multiple completion events from a single CQ at
once to spread the handling to cores-bound threads .

I thought about one uring per core at first, but now I'am not sure -
maybe the kernel devs have something to add to the discussion?

P.S. uring is the main reason I'am switching from windows to linux dev
for client-sever app so I want to extract the max performance possible
out of this new exciting uring stuff. :)

Thanks, Dmitry

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 11:01   ` Dmitry Sychov
@ 2020-05-13 11:56     ` Mark Papadakis
  2020-05-13 13:15       ` Dmitry Sychov
  0 siblings, 1 reply; 14+ messages in thread
From: Mark Papadakis @ 2020-05-13 11:56 UTC (permalink / raw)
  To: Dmitry Sychov; +Cc: H. de Vries, io-uring

For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.

Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted. 
You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.

( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )

If you experiment with the various possible designs though, I’d love it if you were to share your findings.

—
@markpapapdakis

> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> 
> Hi Hielke,
> 
>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
>> This means one ring per core/thread. Of course there is no simple answer to this.
>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
> 
> I think a lot depends on the internal uring implementation. To what
> degree the kernel is able to handle multiple urings independently,
> without much congestion points(like updates of the same memory
> locations from multiple threads), thus taking advantage of one ring
> per CPU core.
> 
> For example, if the tasks from multiple rings are later combined into
> single input kernel queue (effectively forming a congestion point) I
> see
> no reason to use exclusive ring per core in user space.
> 
> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
> 
> Also we could pop out multiple completion events from a single CQ at
> once to spread the handling to cores-bound threads .
> 
> I thought about one uring per core at first, but now I'am not sure -
> maybe the kernel devs have something to add to the discussion?
> 
> P.S. uring is the main reason I'am switching from windows to linux dev
> for client-sever app so I want to extract the max performance possible
> out of this new exciting uring stuff. :)
> 
> Thanks, Dmitry

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 11:56     ` Mark Papadakis
@ 2020-05-13 13:15       ` Dmitry Sychov
  2020-05-13 13:27         ` Mark Papadakis
  0 siblings, 1 reply; 14+ messages in thread
From: Dmitry Sychov @ 2020-05-13 13:15 UTC (permalink / raw)
  To: Mark Papadakis; +Cc: H. de Vries, io-uring

Hey Mark,

Or we could share one SQ and one CQ between multiple threads(bound by
the max number of CPU cores) for direct read/write access using very
light mutex to sync.

This also solves threads starvation issue  - thread A submits the job
into shared SQ while thread B both collects and _processes_ the result
from the shared CQ instead of waiting on his own unique CQ for next
completion event.

On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
<markuspapadakis@icloud.com> wrote:
>
> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
>
> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
>
> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
>
> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
>
> —
> @markpapapdakis
>
>
> > On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >
> > Hi Hielke,
> >
> >> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
> >> This means one ring per core/thread. Of course there is no simple answer to this.
> >> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
> >
> > I think a lot depends on the internal uring implementation. To what
> > degree the kernel is able to handle multiple urings independently,
> > without much congestion points(like updates of the same memory
> > locations from multiple threads), thus taking advantage of one ring
> > per CPU core.
> >
> > For example, if the tasks from multiple rings are later combined into
> > single input kernel queue (effectively forming a congestion point) I
> > see
> > no reason to use exclusive ring per core in user space.
> >
> > [BTW in Windows IOCP is always one input+output queue for all(active) threads].
> >
> > Also we could pop out multiple completion events from a single CQ at
> > once to spread the handling to cores-bound threads .
> >
> > I thought about one uring per core at first, but now I'am not sure -
> > maybe the kernel devs have something to add to the discussion?
> >
> > P.S. uring is the main reason I'am switching from windows to linux dev
> > for client-sever app so I want to extract the max performance possible
> > out of this new exciting uring stuff. :)
> >
> > Thanks, Dmitry
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 13:15       ` Dmitry Sychov
@ 2020-05-13 13:27         ` Mark Papadakis
  2020-05-13 13:48           ` Dmitry Sychov
                             ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Mark Papadakis @ 2020-05-13 13:27 UTC (permalink / raw)
  To: Dmitry Sychov; +Cc: H. de Vries, io-uring



> On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> 
> Hey Mark,
> 
> Or we could share one SQ and one CQ between multiple threads(bound by
> the max number of CPU cores) for direct read/write access using very
> light mutex to sync.
> 
> This also solves threads starvation issue  - thread A submits the job
> into shared SQ while thread B both collects and _processes_ the result
> from the shared CQ instead of waiting on his own unique CQ for next
> completion event.
> 


Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.

It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.






> On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
> <markuspapadakis@icloud.com> wrote:
>> 
>> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
>> 
>> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
>> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
>> 
>> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
>> 
>> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
>> 
>> —
>> @markpapapdakis
>> 
>> 
>>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
>>> 
>>> Hi Hielke,
>>> 
>>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
>>>> This means one ring per core/thread. Of course there is no simple answer to this.
>>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
>>> 
>>> I think a lot depends on the internal uring implementation. To what
>>> degree the kernel is able to handle multiple urings independently,
>>> without much congestion points(like updates of the same memory
>>> locations from multiple threads), thus taking advantage of one ring
>>> per CPU core.
>>> 
>>> For example, if the tasks from multiple rings are later combined into
>>> single input kernel queue (effectively forming a congestion point) I
>>> see
>>> no reason to use exclusive ring per core in user space.
>>> 
>>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
>>> 
>>> Also we could pop out multiple completion events from a single CQ at
>>> once to spread the handling to cores-bound threads .
>>> 
>>> I thought about one uring per core at first, but now I'am not sure -
>>> maybe the kernel devs have something to add to the discussion?
>>> 
>>> P.S. uring is the main reason I'am switching from windows to linux dev
>>> for client-sever app so I want to extract the max performance possible
>>> out of this new exciting uring stuff. :)
>>> 
>>> Thanks, Dmitry
>> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 13:27         ` Mark Papadakis
@ 2020-05-13 13:48           ` Dmitry Sychov
  2020-05-13 14:12           ` Sergiy Yevtushenko
       [not found]           ` <CAO5MNut+nD-OqsKgae=eibWYuPim1f8-NuwqVpD87eZQnrwscA@mail.gmail.com>
  2 siblings, 0 replies; 14+ messages in thread
From: Dmitry Sychov @ 2020-05-13 13:48 UTC (permalink / raw)
  To: Mark Papadakis; +Cc: H. de Vries, io-uring

Yep, I want for all states to be uncoupled from threads - its more
about moving unique state from one thread(core) to another for
processing, only SQ+CQ are shared between cores-bound threads.

> I am personally very much against sharing state between threads unless there a really good reason for it.

Yeah, I understand, but for max performance we should start to think
about states as independent from threads entities or whats the reason
to use uring for max performance at first place - we could as well
stuck to very poor Apache model(unbound number of threads with coupled
states).

On Wed, May 13, 2020 at 4:27 PM Mark Papadakis
<markuspapadakis@icloud.com> wrote:
>
>
>
> > On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >
> > Hey Mark,
> >
> > Or we could share one SQ and one CQ between multiple threads(bound by
> > the max number of CPU cores) for direct read/write access using very
> > light mutex to sync.
> >
> > This also solves threads starvation issue  - thread A submits the job
> > into shared SQ while thread B both collects and _processes_ the result
> > from the shared CQ instead of waiting on his own unique CQ for next
> > completion event.
> >
>
>
> Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.
>
> It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.
>
>
>
>
>
>
> > On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
> > <markuspapadakis@icloud.com> wrote:
> >>
> >> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
> >>
> >> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
> >> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
> >>
> >> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
> >>
> >> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
> >>
> >> —
> >> @markpapapdakis
> >>
> >>
> >>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >>>
> >>> Hi Hielke,
> >>>
> >>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
> >>>> This means one ring per core/thread. Of course there is no simple answer to this.
> >>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
> >>>
> >>> I think a lot depends on the internal uring implementation. To what
> >>> degree the kernel is able to handle multiple urings independently,
> >>> without much congestion points(like updates of the same memory
> >>> locations from multiple threads), thus taking advantage of one ring
> >>> per CPU core.
> >>>
> >>> For example, if the tasks from multiple rings are later combined into
> >>> single input kernel queue (effectively forming a congestion point) I
> >>> see
> >>> no reason to use exclusive ring per core in user space.
> >>>
> >>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
> >>>
> >>> Also we could pop out multiple completion events from a single CQ at
> >>> once to spread the handling to cores-bound threads .
> >>>
> >>> I thought about one uring per core at first, but now I'am not sure -
> >>> maybe the kernel devs have something to add to the discussion?
> >>>
> >>> P.S. uring is the main reason I'am switching from windows to linux dev
> >>> for client-sever app so I want to extract the max performance possible
> >>> out of this new exciting uring stuff. :)
> >>>
> >>> Thanks, Dmitry
> >>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 13:27         ` Mark Papadakis
  2020-05-13 13:48           ` Dmitry Sychov
@ 2020-05-13 14:12           ` Sergiy Yevtushenko
       [not found]           ` <CAO5MNut+nD-OqsKgae=eibWYuPim1f8-NuwqVpD87eZQnrwscA@mail.gmail.com>
  2 siblings, 0 replies; 14+ messages in thread
From: Sergiy Yevtushenko @ 2020-05-13 14:12 UTC (permalink / raw)
  To: Mark Papadakis; +Cc: Dmitry Sychov, H. de Vries, io-uring

Completely agree. Sharing state should be avoided as much as possible.
Returning to original question: I believe that uring-per-thread scheme
is better regardless from how queue is managed inside the kernel.
- If there is only one queue inside the kernel, then it's more
efficient to perform multiplexing/demultiplexing requests in kernel
space
- If there are several queues inside the kernel, then user space code
better matches kernel-space code.
- If kernel implementation will change from single to multiple queues,
user space is already prepared for this change.


On Wed, May 13, 2020 at 3:30 PM Mark Papadakis
<markuspapadakis@icloud.com> wrote:
>
>
>
> > On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >
> > Hey Mark,
> >
> > Or we could share one SQ and one CQ between multiple threads(bound by
> > the max number of CPU cores) for direct read/write access using very
> > light mutex to sync.
> >
> > This also solves threads starvation issue  - thread A submits the job
> > into shared SQ while thread B both collects and _processes_ the result
> > from the shared CQ instead of waiting on his own unique CQ for next
> > completion event.
> >
>
>
> Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.
>
> It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.
>
>
>
>
>
>
> > On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
> > <markuspapadakis@icloud.com> wrote:
> >>
> >> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
> >>
> >> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
> >> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
> >>
> >> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
> >>
> >> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
> >>
> >> —
> >> @markpapapdakis
> >>
> >>
> >>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >>>
> >>> Hi Hielke,
> >>>
> >>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
> >>>> This means one ring per core/thread. Of course there is no simple answer to this.
> >>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
> >>>
> >>> I think a lot depends on the internal uring implementation. To what
> >>> degree the kernel is able to handle multiple urings independently,
> >>> without much congestion points(like updates of the same memory
> >>> locations from multiple threads), thus taking advantage of one ring
> >>> per CPU core.
> >>>
> >>> For example, if the tasks from multiple rings are later combined into
> >>> single input kernel queue (effectively forming a congestion point) I
> >>> see
> >>> no reason to use exclusive ring per core in user space.
> >>>
> >>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
> >>>
> >>> Also we could pop out multiple completion events from a single CQ at
> >>> once to spread the handling to cores-bound threads .
> >>>
> >>> I thought about one uring per core at first, but now I'am not sure -
> >>> maybe the kernel devs have something to add to the discussion?
> >>>
> >>> P.S. uring is the main reason I'am switching from windows to linux dev
> >>> for client-sever app so I want to extract the max performance possible
> >>> out of this new exciting uring stuff. :)
> >>>
> >>> Thanks, Dmitry
> >>
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
       [not found]           ` <CAO5MNut+nD-OqsKgae=eibWYuPim1f8-NuwqVpD87eZQnrwscA@mail.gmail.com>
@ 2020-05-13 14:22             ` Dmitry Sychov
  2020-05-13 14:31               ` Dmitry Sychov
  2020-05-13 16:02               ` Pavel Begunkov
  0 siblings, 2 replies; 14+ messages in thread
From: Dmitry Sychov @ 2020-05-13 14:22 UTC (permalink / raw)
  To: Sergiy Yevtushenko; +Cc: Mark Papadakis, H. de Vries, io-uring

Anyone could shed some light on the inner implementation of uring please? :)

Specifically how well kernel scales with the increased number of user
created urings?

> If kernel implementation will change from single to multiple queues,
> user space is already prepared for this change.

Thats +1 for per-thread urings. An expectation for the kernel to
become better and better in multiple urings scaling in the future.

On Wed, May 13, 2020 at 4:52 PM Sergiy Yevtushenko
<sergiy.yevtushenko@gmail.com> wrote:
>
> Completely agree. Sharing state should be avoided as much as possible.
> Returning to original question: I believe that uring-per-thread scheme is better regardless from how queue is managed inside the kernel.
> - If there is only one queue inside the kernel, then it's more efficient to perform multiplexing/demultiplexing requests in kernel space
> - If there are several queues inside the kernel, then user space code better matches kernel-space code.
> - If kernel implementation will change from single to multiple queues, user space is already prepared for this change.
>
>
> On Wed, May 13, 2020 at 3:30 PM Mark Papadakis <markuspapadakis@icloud.com> wrote:
>>
>>
>>
>> > On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
>> >
>> > Hey Mark,
>> >
>> > Or we could share one SQ and one CQ between multiple threads(bound by
>> > the max number of CPU cores) for direct read/write access using very
>> > light mutex to sync.
>> >
>> > This also solves threads starvation issue  - thread A submits the job
>> > into shared SQ while thread B both collects and _processes_ the result
>> > from the shared CQ instead of waiting on his own unique CQ for next
>> > completion event.
>> >
>>
>>
>> Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.
>>
>> It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.
>>
>>
>>
>>
>>
>>
>> > On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
>> > <markuspapadakis@icloud.com> wrote:
>> >>
>> >> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
>> >>
>> >> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
>> >> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
>> >>
>> >> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
>> >>
>> >> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
>> >>
>> >> —
>> >> @markpapapdakis
>> >>
>> >>
>> >>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
>> >>>
>> >>> Hi Hielke,
>> >>>
>> >>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
>> >>>> This means one ring per core/thread. Of course there is no simple answer to this.
>> >>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
>> >>>
>> >>> I think a lot depends on the internal uring implementation. To what
>> >>> degree the kernel is able to handle multiple urings independently,
>> >>> without much congestion points(like updates of the same memory
>> >>> locations from multiple threads), thus taking advantage of one ring
>> >>> per CPU core.
>> >>>
>> >>> For example, if the tasks from multiple rings are later combined into
>> >>> single input kernel queue (effectively forming a congestion point) I
>> >>> see
>> >>> no reason to use exclusive ring per core in user space.
>> >>>
>> >>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
>> >>>
>> >>> Also we could pop out multiple completion events from a single CQ at
>> >>> once to spread the handling to cores-bound threads .
>> >>>
>> >>> I thought about one uring per core at first, but now I'am not sure -
>> >>> maybe the kernel devs have something to add to the discussion?
>> >>>
>> >>> P.S. uring is the main reason I'am switching from windows to linux dev
>> >>> for client-sever app so I want to extract the max performance possible
>> >>> out of this new exciting uring stuff. :)
>> >>>
>> >>> Thanks, Dmitry
>> >>
>>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 14:22             ` Dmitry Sychov
@ 2020-05-13 14:31               ` Dmitry Sychov
  2020-05-13 16:02               ` Pavel Begunkov
  1 sibling, 0 replies; 14+ messages in thread
From: Dmitry Sychov @ 2020-05-13 14:31 UTC (permalink / raw)
  To: Sergiy Yevtushenko; +Cc: Mark Papadakis, H. de Vries, io-uring

> Sharing state should be avoided as much as possible.

Its more about freely moving state between threads (like using
io_uring_cqe::user_data), not sharing...

On Wed, May 13, 2020 at 5:22 PM Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
>
> Anyone could shed some light on the inner implementation of uring please? :)
>
> Specifically how well kernel scales with the increased number of user
> created urings?
>
> > If kernel implementation will change from single to multiple queues,
> > user space is already prepared for this change.
>
> Thats +1 for per-thread urings. An expectation for the kernel to
> become better and better in multiple urings scaling in the future.
>
> On Wed, May 13, 2020 at 4:52 PM Sergiy Yevtushenko
> <sergiy.yevtushenko@gmail.com> wrote:
> >
> > Completely agree. Sharing state should be avoided as much as possible.
> > Returning to original question: I believe that uring-per-thread scheme is better regardless from how queue is managed inside the kernel.
> > - If there is only one queue inside the kernel, then it's more efficient to perform multiplexing/demultiplexing requests in kernel space
> > - If there are several queues inside the kernel, then user space code better matches kernel-space code.
> > - If kernel implementation will change from single to multiple queues, user space is already prepared for this change.
> >
> >
> > On Wed, May 13, 2020 at 3:30 PM Mark Papadakis <markuspapadakis@icloud.com> wrote:
> >>
> >>
> >>
> >> > On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >> >
> >> > Hey Mark,
> >> >
> >> > Or we could share one SQ and one CQ between multiple threads(bound by
> >> > the max number of CPU cores) for direct read/write access using very
> >> > light mutex to sync.
> >> >
> >> > This also solves threads starvation issue  - thread A submits the job
> >> > into shared SQ while thread B both collects and _processes_ the result
> >> > from the shared CQ instead of waiting on his own unique CQ for next
> >> > completion event.
> >> >
> >>
> >>
> >> Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.
> >>
> >> It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.
> >>
> >>
> >>
> >>
> >>
> >>
> >> > On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
> >> > <markuspapadakis@icloud.com> wrote:
> >> >>
> >> >> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
> >> >>
> >> >> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
> >> >> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
> >> >>
> >> >> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
> >> >>
> >> >> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
> >> >>
> >> >> —
> >> >> @markpapapdakis
> >> >>
> >> >>
> >> >>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >> >>>
> >> >>> Hi Hielke,
> >> >>>
> >> >>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
> >> >>>> This means one ring per core/thread. Of course there is no simple answer to this.
> >> >>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
> >> >>>
> >> >>> I think a lot depends on the internal uring implementation. To what
> >> >>> degree the kernel is able to handle multiple urings independently,
> >> >>> without much congestion points(like updates of the same memory
> >> >>> locations from multiple threads), thus taking advantage of one ring
> >> >>> per CPU core.
> >> >>>
> >> >>> For example, if the tasks from multiple rings are later combined into
> >> >>> single input kernel queue (effectively forming a congestion point) I
> >> >>> see
> >> >>> no reason to use exclusive ring per core in user space.
> >> >>>
> >> >>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
> >> >>>
> >> >>> Also we could pop out multiple completion events from a single CQ at
> >> >>> once to spread the handling to cores-bound threads .
> >> >>>
> >> >>> I thought about one uring per core at first, but now I'am not sure -
> >> >>> maybe the kernel devs have something to add to the discussion?
> >> >>>
> >> >>> P.S. uring is the main reason I'am switching from windows to linux dev
> >> >>> for client-sever app so I want to extract the max performance possible
> >> >>> out of this new exciting uring stuff. :)
> >> >>>
> >> >>> Thanks, Dmitry
> >> >>
> >>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 14:22             ` Dmitry Sychov
  2020-05-13 14:31               ` Dmitry Sychov
@ 2020-05-13 16:02               ` Pavel Begunkov
  2020-05-13 19:23                 ` Dmitry Sychov
  1 sibling, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2020-05-13 16:02 UTC (permalink / raw)
  To: Dmitry Sychov, Sergiy Yevtushenko; +Cc: Mark Papadakis, H. de Vries, io-uring

On 13/05/2020 17:22, Dmitry Sychov wrote:
> Anyone could shed some light on the inner implementation of uring please? :)

It really depends on the workload, hardware, etc.

io_uring instances are intended to be independent, and each have one CQ and SQ.
The main user's concern should be synchronisation (in userspace) on CQ+SQ. E.g.
100+ cores hammering on a spinlock/mutex protecting an SQ wouldn't do any good.

Everything that can't be inline completed\submitted during io_urng_enter(), will
be offloaded to an internal thread pool (aka io-wq), which is per io_uring by
default, but can be shared if specified. There are pros and cons, but I'd
recommend first to share a single io-wq, and then experiment and tune.

Also, in-kernel submission is not instantaneous and done by only thread at any
moment. Single io_uring may bottleneck you there or add high latency in some cases.

And there a lot of details, probably worth of a separate write-up.

> 
> Specifically how well kernel scales with the increased number of user
> created urings?

Should scale well, especially for rw. Just don't overthrow the kernel with
threads from dozens of io-wqs.

> 
>> If kernel implementation will change from single to multiple queues,
>> user space is already prepared for this change.
> 
> Thats +1 for per-thread urings. An expectation for the kernel to
> become better and better in multiple urings scaling in the future.
> 
> On Wed, May 13, 2020 at 4:52 PM Sergiy Yevtushenko
> <sergiy.yevtushenko@gmail.com> wrote:
>>
>> Completely agree. Sharing state should be avoided as much as possible.
>> Returning to original question: I believe that uring-per-thread scheme is better regardless from how queue is managed inside the kernel.
>> - If there is only one queue inside the kernel, then it's more efficient to perform multiplexing/demultiplexing requests in kernel space
>> - If there are several queues inside the kernel, then user space code better matches kernel-space code.
>> - If kernel implementation will change from single to multiple queues, user space is already prepared for this change.
>>
>>
>> On Wed, May 13, 2020 at 3:30 PM Mark Papadakis <markuspapadakis@icloud.com> wrote:
>>>
>>>
>>>
>>>> On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
>>>>
>>>> Hey Mark,
>>>>
>>>> Or we could share one SQ and one CQ between multiple threads(bound by
>>>> the max number of CPU cores) for direct read/write access using very
>>>> light mutex to sync.
>>>>
>>>> This also solves threads starvation issue  - thread A submits the job
>>>> into shared SQ while thread B both collects and _processes_ the result
>>>> from the shared CQ instead of waiting on his own unique CQ for next
>>>> completion event.
>>>>
>>>
>>>
>>> Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.
>>>
>>> It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.
>>>
>>>
>>>
>>>
>>>
>>>
>>>> On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
>>>> <markuspapadakis@icloud.com> wrote:
>>>>>
>>>>> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
>>>>>
>>>>> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
>>>>> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
>>>>>
>>>>> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
>>>>>
>>>>> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
>>>>>
>>>>> —
>>>>> @markpapapdakis
>>>>>
>>>>>
>>>>>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
>>>>>>
>>>>>> Hi Hielke,
>>>>>>
>>>>>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
>>>>>>> This means one ring per core/thread. Of course there is no simple answer to this.
>>>>>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
>>>>>>
>>>>>> I think a lot depends on the internal uring implementation. To what
>>>>>> degree the kernel is able to handle multiple urings independently,
>>>>>> without much congestion points(like updates of the same memory
>>>>>> locations from multiple threads), thus taking advantage of one ring
>>>>>> per CPU core.
>>>>>>
>>>>>> For example, if the tasks from multiple rings are later combined into
>>>>>> single input kernel queue (effectively forming a congestion point) I
>>>>>> see
>>>>>> no reason to use exclusive ring per core in user space.
>>>>>>
>>>>>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
>>>>>>
>>>>>> Also we could pop out multiple completion events from a single CQ at
>>>>>> once to spread the handling to cores-bound threads .
>>>>>>
>>>>>> I thought about one uring per core at first, but now I'am not sure -
>>>>>> maybe the kernel devs have something to add to the discussion?
>>>>>>
>>>>>> P.S. uring is the main reason I'am switching from windows to linux dev
>>>>>> for client-sever app so I want to extract the max performance possible
>>>>>> out of this new exciting uring stuff. :)
>>>>>>
>>>>>> Thanks, Dmitry
>>>>>
>>>

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 16:02               ` Pavel Begunkov
@ 2020-05-13 19:23                 ` Dmitry Sychov
  2020-05-14 10:06                   ` Pavel Begunkov
  0 siblings, 1 reply; 14+ messages in thread
From: Dmitry Sychov @ 2020-05-13 19:23 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Sergiy Yevtushenko, Mark Papadakis, H. de Vries, io-uring

> E.g. 100+ cores hammering on a spinlock/mutex protecting an SQ wouldn't do any good.

Its possible to mitigate the hammering by using proxy buffer - instead
of spinning, the particular thread
could add the next entry into the buffer through XADD instead, and
another thread currently holding an exclusive
lock could in turn check this buffer and batch-submit all pending
entries to SQ before leasing SQ mutex.

> will be offloaded to an internal thread pool (aka io-wq), which is per io_uring by default, but can be shared if specified.

Well, thats sounds like mumbo jumbo to me, does this mean that the
kernel holds and internal pool of threads to
perform uring tasks independent to the number of user urings?

If there are multiple kernel work flows bound to corresponding uring
setups the issue with threads starvation could exist if they do not
actively steal from each other SQs.

And starvation costs could be greater than allowing for multiple
threads to dig into one uring queue, even under the exclusive lock.

> And there a lot of details, probably worth of a separate write-up.

I've reread io_uring.pdf and there are not much tech details on the
inner implementation of uring to try to apply best practices and to
avoid noob questions like mine.



On Wed, May 13, 2020 at 7:03 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 13/05/2020 17:22, Dmitry Sychov wrote:
> > Anyone could shed some light on the inner implementation of uring please? :)
>
> It really depends on the workload, hardware, etc.
>
> io_uring instances are intended to be independent, and each have one CQ and SQ.
> The main user's concern should be synchronisation (in userspace) on CQ+SQ. E.g.
> 100+ cores hammering on a spinlock/mutex protecting an SQ wouldn't do any good.
>
> Everything that can't be inline completed\submitted during io_urng_enter(), will
> be offloaded to an internal thread pool (aka io-wq), which is per io_uring by
> default, but can be shared if specified. There are pros and cons, but I'd
> recommend first to share a single io-wq, and then experiment and tune.
>
> Also, in-kernel submission is not instantaneous and done by only thread at any
> moment. Single io_uring may bottleneck you there or add high latency in some cases.
>
> And there a lot of details, probably worth of a separate write-up.
>
> >
> > Specifically how well kernel scales with the increased number of user
> > created urings?
>
> Should scale well, especially for rw. Just don't overthrow the kernel with
> threads from dozens of io-wqs.
>
> >
> >> If kernel implementation will change from single to multiple queues,
> >> user space is already prepared for this change.
> >
> > Thats +1 for per-thread urings. An expectation for the kernel to
> > become better and better in multiple urings scaling in the future.
> >
> > On Wed, May 13, 2020 at 4:52 PM Sergiy Yevtushenko
> > <sergiy.yevtushenko@gmail.com> wrote:
> >>
> >> Completely agree. Sharing state should be avoided as much as possible.
> >> Returning to original question: I believe that uring-per-thread scheme is better regardless from how queue is managed inside the kernel.
> >> - If there is only one queue inside the kernel, then it's more efficient to perform multiplexing/demultiplexing requests in kernel space
> >> - If there are several queues inside the kernel, then user space code better matches kernel-space code.
> >> - If kernel implementation will change from single to multiple queues, user space is already prepared for this change.
> >>
> >>
> >> On Wed, May 13, 2020 at 3:30 PM Mark Papadakis <markuspapadakis@icloud.com> wrote:
> >>>
> >>>
> >>>
> >>>> On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >>>>
> >>>> Hey Mark,
> >>>>
> >>>> Or we could share one SQ and one CQ between multiple threads(bound by
> >>>> the max number of CPU cores) for direct read/write access using very
> >>>> light mutex to sync.
> >>>>
> >>>> This also solves threads starvation issue  - thread A submits the job
> >>>> into shared SQ while thread B both collects and _processes_ the result
> >>>> from the shared CQ instead of waiting on his own unique CQ for next
> >>>> completion event.
> >>>>
> >>>
> >>>
> >>> Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.
> >>>
> >>> It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>> On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
> >>>> <markuspapadakis@icloud.com> wrote:
> >>>>>
> >>>>> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
> >>>>>
> >>>>> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
> >>>>> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
> >>>>>
> >>>>> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
> >>>>>
> >>>>> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
> >>>>>
> >>>>> —
> >>>>> @markpapapdakis
> >>>>>
> >>>>>
> >>>>>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >>>>>>
> >>>>>> Hi Hielke,
> >>>>>>
> >>>>>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
> >>>>>>> This means one ring per core/thread. Of course there is no simple answer to this.
> >>>>>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
> >>>>>>
> >>>>>> I think a lot depends on the internal uring implementation. To what
> >>>>>> degree the kernel is able to handle multiple urings independently,
> >>>>>> without much congestion points(like updates of the same memory
> >>>>>> locations from multiple threads), thus taking advantage of one ring
> >>>>>> per CPU core.
> >>>>>>
> >>>>>> For example, if the tasks from multiple rings are later combined into
> >>>>>> single input kernel queue (effectively forming a congestion point) I
> >>>>>> see
> >>>>>> no reason to use exclusive ring per core in user space.
> >>>>>>
> >>>>>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
> >>>>>>
> >>>>>> Also we could pop out multiple completion events from a single CQ at
> >>>>>> once to spread the handling to cores-bound threads .
> >>>>>>
> >>>>>> I thought about one uring per core at first, but now I'am not sure -
> >>>>>> maybe the kernel devs have something to add to the discussion?
> >>>>>>
> >>>>>> P.S. uring is the main reason I'am switching from windows to linux dev
> >>>>>> for client-sever app so I want to extract the max performance possible
> >>>>>> out of this new exciting uring stuff. :)
> >>>>>>
> >>>>>> Thanks, Dmitry
> >>>>>
> >>>
>
> --
> Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-13 19:23                 ` Dmitry Sychov
@ 2020-05-14 10:06                   ` Pavel Begunkov
  2020-05-14 11:35                     ` Dmitry Sychov
  0 siblings, 1 reply; 14+ messages in thread
From: Pavel Begunkov @ 2020-05-14 10:06 UTC (permalink / raw)
  To: Dmitry Sychov; +Cc: Sergiy Yevtushenko, Mark Papadakis, H. de Vries, io-uring

On 13/05/2020 22:23, Dmitry Sychov wrote:
>> E.g. 100+ cores hammering on a spinlock/mutex protecting an SQ wouldn't do any good.
> 
> Its possible to mitigate the hammering by using proxy buffer - instead
> of spinning, the particular thread
> could add the next entry into the buffer through XADD instead, and
> another thread currently holding an exclusive
> lock could in turn check this buffer and batch-submit all pending
> entries to SQ before leasing SQ mutex.

Sure there are many ways, but I think my point is clear.
FWIW, atomics/wait-free will fail to scale good enough after some point.

>> will be offloaded to an internal thread pool (aka io-wq), which is per io_uring by default, but can be shared if specified.
> 
> Well, thats sounds like mumbo jumbo to me, does this mean that the
> kernel holds and internal pool of threads to
> perform uring tasks independent to the number of user urings?

If I parsed the question correctly, again, it creates a separate thread pool per
each new io_uring, if wasn't specified otherwise.

> 
> If there are multiple kernel work flows bound to corresponding uring
> setups the issue with threads starvation could exist if they do not
> actively steal from each other SQs.
The threads can go to sleep or be dynamically created/destroyed.

Not sure what kind of starvation you meant, but feel free to rephrase your
questions if any of them weren't understood well.

> And starvation costs could be greater than allowing for multiple
> threads to dig into one uring queue, even under the exclusive lock.
Thread pools can be shared.

> 
>> And there a lot of details, probably worth of a separate write-up.
> 
> I've reread io_uring.pdf and there are not much tech details on the
> inner implementation of uring to try to apply best practices and to
> avoid noob questions like mine.
> 
> 
> 
> On Wed, May 13, 2020 at 7:03 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>>
>> On 13/05/2020 17:22, Dmitry Sychov wrote:
>>> Anyone could shed some light on the inner implementation of uring please? :)
>>
>> It really depends on the workload, hardware, etc.
>>
>> io_uring instances are intended to be independent, and each have one CQ and SQ.
>> The main user's concern should be synchronisation (in userspace) on CQ+SQ. E.g.
>> 100+ cores hammering on a spinlock/mutex protecting an SQ wouldn't do any good.
>>
>> Everything that can't be inline completed\submitted during io_urng_enter(), will
>> be offloaded to an internal thread pool (aka io-wq), which is per io_uring by
>> default, but can be shared if specified. There are pros and cons, but I'd
>> recommend first to share a single io-wq, and then experiment and tune.
>>
>> Also, in-kernel submission is not instantaneous and done by only thread at any
>> moment. Single io_uring may bottleneck you there or add high latency in some cases.
>>
>> And there a lot of details, probably worth of a separate write-up.
>>
>>>
>>> Specifically how well kernel scales with the increased number of user
>>> created urings?
>>
>> Should scale well, especially for rw. Just don't overthrow the kernel with
>> threads from dozens of io-wqs.
>>
>>>
>>>> If kernel implementation will change from single to multiple queues,
>>>> user space is already prepared for this change.
>>>
>>> Thats +1 for per-thread urings. An expectation for the kernel to
>>> become better and better in multiple urings scaling in the future.
>>>
>>> On Wed, May 13, 2020 at 4:52 PM Sergiy Yevtushenko
>>> <sergiy.yevtushenko@gmail.com> wrote:
>>>>
>>>> Completely agree. Sharing state should be avoided as much as possible.
>>>> Returning to original question: I believe that uring-per-thread scheme is better regardless from how queue is managed inside the kernel.
>>>> - If there is only one queue inside the kernel, then it's more efficient to perform multiplexing/demultiplexing requests in kernel space
>>>> - If there are several queues inside the kernel, then user space code better matches kernel-space code.
>>>> - If kernel implementation will change from single to multiple queues, user space is already prepared for this change.
>>>>
>>>>
>>>> On Wed, May 13, 2020 at 3:30 PM Mark Papadakis <markuspapadakis@icloud.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
>>>>>>
>>>>>> Hey Mark,
>>>>>>
>>>>>> Or we could share one SQ and one CQ between multiple threads(bound by
>>>>>> the max number of CPU cores) for direct read/write access using very
>>>>>> light mutex to sync.
>>>>>>
>>>>>> This also solves threads starvation issue  - thread A submits the job
>>>>>> into shared SQ while thread B both collects and _processes_ the result
>>>>>> from the shared CQ instead of waiting on his own unique CQ for next
>>>>>> completion event.
>>>>>>
>>>>>
>>>>>
>>>>> Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.
>>>>>
>>>>> It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
>>>>>> <markuspapadakis@icloud.com> wrote:
>>>>>>>
>>>>>>> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
>>>>>>>
>>>>>>> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
>>>>>>> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
>>>>>>>
>>>>>>> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
>>>>>>>
>>>>>>> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
>>>>>>>
>>>>>>> —
>>>>>>> @markpapapdakis
>>>>>>>
>>>>>>>
>>>>>>>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi Hielke,
>>>>>>>>
>>>>>>>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
>>>>>>>>> This means one ring per core/thread. Of course there is no simple answer to this.
>>>>>>>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
>>>>>>>>
>>>>>>>> I think a lot depends on the internal uring implementation. To what
>>>>>>>> degree the kernel is able to handle multiple urings independently,
>>>>>>>> without much congestion points(like updates of the same memory
>>>>>>>> locations from multiple threads), thus taking advantage of one ring
>>>>>>>> per CPU core.
>>>>>>>>
>>>>>>>> For example, if the tasks from multiple rings are later combined into
>>>>>>>> single input kernel queue (effectively forming a congestion point) I
>>>>>>>> see
>>>>>>>> no reason to use exclusive ring per core in user space.
>>>>>>>>
>>>>>>>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
>>>>>>>>
>>>>>>>> Also we could pop out multiple completion events from a single CQ at
>>>>>>>> once to spread the handling to cores-bound threads .
>>>>>>>>
>>>>>>>> I thought about one uring per core at first, but now I'am not sure -
>>>>>>>> maybe the kernel devs have something to add to the discussion?
>>>>>>>>
>>>>>>>> P.S. uring is the main reason I'am switching from windows to linux dev
>>>>>>>> for client-sever app so I want to extract the max performance possible
>>>>>>>> out of this new exciting uring stuff. :)
>>>>>>>>
>>>>>>>> Thanks, Dmitry
>>>>>>>
>>>>>
>>
>> --
>> Pavel Begunkov

-- 
Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Any performance gains from using per thread(thread local) urings?
  2020-05-14 10:06                   ` Pavel Begunkov
@ 2020-05-14 11:35                     ` Dmitry Sychov
  0 siblings, 0 replies; 14+ messages in thread
From: Dmitry Sychov @ 2020-05-14 11:35 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: Sergiy Yevtushenko, Mark Papadakis, H. de Vries, io-uring

> If I parsed the question correctly, again, it creates a separate
> thread pool per each new io_uring, if wasn't specified otherwise.

Aha, got it - finally found IORING_SETUP_ATTACH_WQ flag desc :)

Whats the default number of threads in a pool? Is it fixed or depends
on the number of system CPU cores?

> Not sure what kind of starvation you meant, but feel free to rephrase your
> questions if any of them weren't understood well.

Starvation problem arises when for example one uring becomes
overloaded with pending tasks while another is already empty
and the only way to mitigate the stall is to have all worker flows to
check all other urings(steal jobs from each other)... or to use one
shared uring at first place.

With the increasing number of urings the cost of checking other queues
increases leading to suppressed scaling.

Same thing happens on consumer side. If I'am using states decoupled
from threads and submit(move) them(states) to random urings from a
uring pool I have
to check all other CQs for completed work if the CQ currently
associated with my CPU thread is empty.

> FWIW, atomics/wait-free will fail to scale good enough after some point.

Yep... in general shared between multiple cores memory updates
suppress scalability up to zero gains after like first hundred of cpu
threads.

There was a paper somewhere that even having a one shared counter
between 100+ threads already blocks the scaling completely further
and wait-free containers are almost the same sources for mem bouncing
as write shared data...

On Thu, May 14, 2020 at 1:08 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
>
> On 13/05/2020 22:23, Dmitry Sychov wrote:
> >> E.g. 100+ cores hammering on a spinlock/mutex protecting an SQ wouldn't do any good.
> >
> > Its possible to mitigate the hammering by using proxy buffer - instead
> > of spinning, the particular thread
> > could add the next entry into the buffer through XADD instead, and
> > another thread currently holding an exclusive
> > lock could in turn check this buffer and batch-submit all pending
> > entries to SQ before leasing SQ mutex.
>
> Sure there are many ways, but I think my point is clear.
> FWIW, atomics/wait-free will fail to scale good enough after some point.
>
> >> will be offloaded to an internal thread pool (aka io-wq), which is per io_uring by default, but can be shared if specified.
> >
> > Well, thats sounds like mumbo jumbo to me, does this mean that the
> > kernel holds and internal pool of threads to
> > perform uring tasks independent to the number of user urings?
>
> If I parsed the question correctly, again, it creates a separate thread pool per
> each new io_uring, if wasn't specified otherwise.
>
> >
> > If there are multiple kernel work flows bound to corresponding uring
> > setups the issue with threads starvation could exist if they do not
> > actively steal from each other SQs.
> The threads can go to sleep or be dynamically created/destroyed.
>
> Not sure what kind of starvation you meant, but feel free to rephrase your
> questions if any of them weren't understood well.
>
> > And starvation costs could be greater than allowing for multiple
> > threads to dig into one uring queue, even under the exclusive lock.
> Thread pools can be shared.
>
> >
> >> And there a lot of details, probably worth of a separate write-up.
> >
> > I've reread io_uring.pdf and there are not much tech details on the
> > inner implementation of uring to try to apply best practices and to
> > avoid noob questions like mine.
> >
> >
> >
> > On Wed, May 13, 2020 at 7:03 PM Pavel Begunkov <asml.silence@gmail.com> wrote:
> >>
> >> On 13/05/2020 17:22, Dmitry Sychov wrote:
> >>> Anyone could shed some light on the inner implementation of uring please? :)
> >>
> >> It really depends on the workload, hardware, etc.
> >>
> >> io_uring instances are intended to be independent, and each have one CQ and SQ.
> >> The main user's concern should be synchronisation (in userspace) on CQ+SQ. E.g.
> >> 100+ cores hammering on a spinlock/mutex protecting an SQ wouldn't do any good.
> >>
> >> Everything that can't be inline completed\submitted during io_urng_enter(), will
> >> be offloaded to an internal thread pool (aka io-wq), which is per io_uring by
> >> default, but can be shared if specified. There are pros and cons, but I'd
> >> recommend first to share a single io-wq, and then experiment and tune.
> >>
> >> Also, in-kernel submission is not instantaneous and done by only thread at any
> >> moment. Single io_uring may bottleneck you there or add high latency in some cases.
> >>
> >> And there a lot of details, probably worth of a separate write-up.
> >>
> >>>
> >>> Specifically how well kernel scales with the increased number of user
> >>> created urings?
> >>
> >> Should scale well, especially for rw. Just don't overthrow the kernel with
> >> threads from dozens of io-wqs.
> >>
> >>>
> >>>> If kernel implementation will change from single to multiple queues,
> >>>> user space is already prepared for this change.
> >>>
> >>> Thats +1 for per-thread urings. An expectation for the kernel to
> >>> become better and better in multiple urings scaling in the future.
> >>>
> >>> On Wed, May 13, 2020 at 4:52 PM Sergiy Yevtushenko
> >>> <sergiy.yevtushenko@gmail.com> wrote:
> >>>>
> >>>> Completely agree. Sharing state should be avoided as much as possible.
> >>>> Returning to original question: I believe that uring-per-thread scheme is better regardless from how queue is managed inside the kernel.
> >>>> - If there is only one queue inside the kernel, then it's more efficient to perform multiplexing/demultiplexing requests in kernel space
> >>>> - If there are several queues inside the kernel, then user space code better matches kernel-space code.
> >>>> - If kernel implementation will change from single to multiple queues, user space is already prepared for this change.
> >>>>
> >>>>
> >>>> On Wed, May 13, 2020 at 3:30 PM Mark Papadakis <markuspapadakis@icloud.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On 13 May 2020, at 4:15 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >>>>>>
> >>>>>> Hey Mark,
> >>>>>>
> >>>>>> Or we could share one SQ and one CQ between multiple threads(bound by
> >>>>>> the max number of CPU cores) for direct read/write access using very
> >>>>>> light mutex to sync.
> >>>>>>
> >>>>>> This also solves threads starvation issue  - thread A submits the job
> >>>>>> into shared SQ while thread B both collects and _processes_ the result
> >>>>>> from the shared CQ instead of waiting on his own unique CQ for next
> >>>>>> completion event.
> >>>>>>
> >>>>>
> >>>>>
> >>>>> Well, if the SQ submitted by A and its matching CQ is consumed by B, and A will need access to that CQ because it is tightly coupled to state it owns exclusively(for example), or other reasons, then you’d still need to move that CQ from B back to A, or share it somehow, which seems expensive-is.
> >>>>>
> >>>>> It depends on what kind of roles your threads have though; I am personally very much against sharing state between threads unless there a really good reason for it.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Wed, May 13, 2020 at 2:56 PM Mark Papadakis
> >>>>>> <markuspapadakis@icloud.com> wrote:
> >>>>>>>
> >>>>>>> For what it’s worth, I am (also) using using multiple “reactor” (i.e event driven) cores, each associated with one OS thread, and each reactor core manages its own io_uring context/queues.
> >>>>>>>
> >>>>>>> Even if scheduling all SQEs through a single io_uring SQ — by e.g collecting all such SQEs in every OS thread and then somehow “moving” them to the one OS thread that manages the SQ so that it can enqueue them all -- is very cheap, you ‘d still need to drain the CQ from that thread and presumably process those CQEs in a single OS thread, which will definitely be more work than having each reactor/OS thread dequeue CQEs for SQEs that itself submitted.
> >>>>>>> You could have a single OS thread just for I/O and all other threads could do something else but you’d presumably need to serialize access/share state between them and the one OS thread for I/O which maybe a scalability bottleneck.
> >>>>>>>
> >>>>>>> ( if you are curious, you can read about it here https://medium.com/@markpapadakis/building-high-performance-services-in-2020-e2dea272f6f6 )
> >>>>>>>
> >>>>>>> If you experiment with the various possible designs though, I’d love it if you were to share your findings.
> >>>>>>>
> >>>>>>> —
> >>>>>>> @markpapapdakis
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 13 May 2020, at 2:01 PM, Dmitry Sychov <dmitry.sychov@gmail.com> wrote:
> >>>>>>>>
> >>>>>>>> Hi Hielke,
> >>>>>>>>
> >>>>>>>>> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
> >>>>>>>>> This means one ring per core/thread. Of course there is no simple answer to this.
> >>>>>>>>> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.
> >>>>>>>>
> >>>>>>>> I think a lot depends on the internal uring implementation. To what
> >>>>>>>> degree the kernel is able to handle multiple urings independently,
> >>>>>>>> without much congestion points(like updates of the same memory
> >>>>>>>> locations from multiple threads), thus taking advantage of one ring
> >>>>>>>> per CPU core.
> >>>>>>>>
> >>>>>>>> For example, if the tasks from multiple rings are later combined into
> >>>>>>>> single input kernel queue (effectively forming a congestion point) I
> >>>>>>>> see
> >>>>>>>> no reason to use exclusive ring per core in user space.
> >>>>>>>>
> >>>>>>>> [BTW in Windows IOCP is always one input+output queue for all(active) threads].
> >>>>>>>>
> >>>>>>>> Also we could pop out multiple completion events from a single CQ at
> >>>>>>>> once to spread the handling to cores-bound threads .
> >>>>>>>>
> >>>>>>>> I thought about one uring per core at first, but now I'am not sure -
> >>>>>>>> maybe the kernel devs have something to add to the discussion?
> >>>>>>>>
> >>>>>>>> P.S. uring is the main reason I'am switching from windows to linux dev
> >>>>>>>> for client-sever app so I want to extract the max performance possible
> >>>>>>>> out of this new exciting uring stuff. :)
> >>>>>>>>
> >>>>>>>> Thanks, Dmitry
> >>>>>>>
> >>>>>
> >>
> >> --
> >> Pavel Begunkov
>
> --
> Pavel Begunkov

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2020-05-14 11:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-12 20:20 Any performance gains from using per thread(thread local) urings? Dmitry Sychov
2020-05-13  6:07 ` H. de Vries
2020-05-13 11:01   ` Dmitry Sychov
2020-05-13 11:56     ` Mark Papadakis
2020-05-13 13:15       ` Dmitry Sychov
2020-05-13 13:27         ` Mark Papadakis
2020-05-13 13:48           ` Dmitry Sychov
2020-05-13 14:12           ` Sergiy Yevtushenko
     [not found]           ` <CAO5MNut+nD-OqsKgae=eibWYuPim1f8-NuwqVpD87eZQnrwscA@mail.gmail.com>
2020-05-13 14:22             ` Dmitry Sychov
2020-05-13 14:31               ` Dmitry Sychov
2020-05-13 16:02               ` Pavel Begunkov
2020-05-13 19:23                 ` Dmitry Sychov
2020-05-14 10:06                   ` Pavel Begunkov
2020-05-14 11:35                     ` Dmitry Sychov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.