All of lore.kernel.org
 help / color / mirror / Atom feed
* Per-CPU Queueing for QoS
@ 2017-11-10  1:20 Michael Ma
  2017-11-12 21:43 ` Michael Ma
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Ma @ 2017-11-10  1:20 UTC (permalink / raw)
  To: Linux Kernel Network Developers; +Cc: jianjun.duan, xiangning.yu

Currently txq/qdisc selection is based on flow hash so packets from
the same flow will follow the order when they enter qdisc/txq, which
avoids out-of-order problem.

To improve the concurrency of QoS algorithm we plan to have multiple
per-cpu queues for a single TC class and do busy polling from a
per-class thread to drain these queues. If we can do this frequently
enough the out-of-order situation in this polling thread should not be
that bad.

To give more details - in the send path we introduce per-cpu per-class
queues so that packets from the same class and same core will be
enqueued to the same place. Then a per-class thread poll the queues
belonging to its class from all the cpus and aggregate them into
another per-class queue. This can effectively reduce contention but
inevitably introduces potential out-of-order issue.

Any concern/suggestion for working towards this direction?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-10  1:20 Per-CPU Queueing for QoS Michael Ma
@ 2017-11-12 21:43 ` Michael Ma
  2017-11-13  0:14   ` Stephen Hemminger
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Ma @ 2017-11-12 21:43 UTC (permalink / raw)
  To: Linux Kernel Network Developers; +Cc: jianjun.duan, xiangning.yu

Any comments? We plan to implement this as a qdisc and appreciate any early feedback.

Thanks,
Michael

> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
> 
> Currently txq/qdisc selection is based on flow hash so packets from
> the same flow will follow the order when they enter qdisc/txq, which
> avoids out-of-order problem.
> 
> To improve the concurrency of QoS algorithm we plan to have multiple
> per-cpu queues for a single TC class and do busy polling from a
> per-class thread to drain these queues. If we can do this frequently
> enough the out-of-order situation in this polling thread should not be
> that bad.
> 
> To give more details - in the send path we introduce per-cpu per-class
> queues so that packets from the same class and same core will be
> enqueued to the same place. Then a per-class thread poll the queues
> belonging to its class from all the cpus and aggregate them into
> another per-class queue. This can effectively reduce contention but
> inevitably introduces potential out-of-order issue.
> 
> Any concern/suggestion for working towards this direction?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-12 21:43 ` Michael Ma
@ 2017-11-13  0:14   ` Stephen Hemminger
  2017-11-13 18:17     ` Michael Ma
  0 siblings, 1 reply; 17+ messages in thread
From: Stephen Hemminger @ 2017-11-13  0:14 UTC (permalink / raw)
  To: Michael Ma; +Cc: Linux Kernel Network Developers, jianjun.duan, xiangning.yu

On Sun, 12 Nov 2017 13:43:13 -0800
Michael Ma <make0818@gmail.com> wrote:

> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
> 
> Thanks,
> Michael
> 
> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
> > 
> > Currently txq/qdisc selection is based on flow hash so packets from
> > the same flow will follow the order when they enter qdisc/txq, which
> > avoids out-of-order problem.
> > 
> > To improve the concurrency of QoS algorithm we plan to have multiple
> > per-cpu queues for a single TC class and do busy polling from a
> > per-class thread to drain these queues. If we can do this frequently
> > enough the out-of-order situation in this polling thread should not be
> > that bad.
> > 
> > To give more details - in the send path we introduce per-cpu per-class
> > queues so that packets from the same class and same core will be
> > enqueued to the same place. Then a per-class thread poll the queues
> > belonging to its class from all the cpus and aggregate them into
> > another per-class queue. This can effectively reduce contention but
> > inevitably introduces potential out-of-order issue.
> > 
> > Any concern/suggestion for working towards this direction?  

In general, there is no meta design discussions in Linux development
Several developers have tried to do lockless
qdisc and similar things in the past.

The devil is in the details, show us the code.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-13  0:14   ` Stephen Hemminger
@ 2017-11-13 18:17     ` Michael Ma
  2017-11-13 22:47       ` Alexander Duyck
  0 siblings, 1 reply; 17+ messages in thread
From: Michael Ma @ 2017-11-13 18:17 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Linux Kernel Network Developers, jianjun.duan, xiangning.yu

2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
> On Sun, 12 Nov 2017 13:43:13 -0800
> Michael Ma <make0818@gmail.com> wrote:
>
>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>
>> Thanks,
>> Michael
>>
>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>> >
>> > Currently txq/qdisc selection is based on flow hash so packets from
>> > the same flow will follow the order when they enter qdisc/txq, which
>> > avoids out-of-order problem.
>> >
>> > To improve the concurrency of QoS algorithm we plan to have multiple
>> > per-cpu queues for a single TC class and do busy polling from a
>> > per-class thread to drain these queues. If we can do this frequently
>> > enough the out-of-order situation in this polling thread should not be
>> > that bad.
>> >
>> > To give more details - in the send path we introduce per-cpu per-class
>> > queues so that packets from the same class and same core will be
>> > enqueued to the same place. Then a per-class thread poll the queues
>> > belonging to its class from all the cpus and aggregate them into
>> > another per-class queue. This can effectively reduce contention but
>> > inevitably introduces potential out-of-order issue.
>> >
>> > Any concern/suggestion for working towards this direction?
>
> In general, there is no meta design discussions in Linux development
> Several developers have tried to do lockless
> qdisc and similar things in the past.
>
> The devil is in the details, show us the code.

Thanks for the response, Stephen. The code is fairly straightforward,
we have a per-cpu per-class queue defined as this:

struct bandwidth_group
{
    struct skb_list queues[MAX_CPU_COUNT];
    struct skb_list drain;
}

"drain" queue is used to aggregate per-cpu queues belonging to the
same class. In the enqueue function, we determine the cpu where the
packet is processed and enqueue it to the corresponding per-cpu queue:

int cpu;
struct bandwidth_group *bwg = &bw_rx_groups[bwgid];

cpu = get_cpu();
skb_list_append(&bwg->queues[cpu], skb);

Here we don't check the flow of the packet so if there is task
migration or multiple threads sending packets through the same flow we
theoretically can have packets enqueued to different queues and
aggregated to the "drain" queue out of order.

Also AFAIK there is no lockless htb-like qdisc implementation
currently, however if there is already similar effort ongoing please
let me know.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-13 18:17     ` Michael Ma
@ 2017-11-13 22:47       ` Alexander Duyck
  2017-11-13 23:08         ` Eric Dumazet
  2017-11-14  2:02         ` Michael Ma
  0 siblings, 2 replies; 17+ messages in thread
From: Alexander Duyck @ 2017-11-13 22:47 UTC (permalink / raw)
  To: Michael Ma
  Cc: Stephen Hemminger, Linux Kernel Network Developers, jianjun.duan,
	xiangning.yu

On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>> On Sun, 12 Nov 2017 13:43:13 -0800
>> Michael Ma <make0818@gmail.com> wrote:
>>
>>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>>
>>> Thanks,
>>> Michael
>>>
>>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>> >
>>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> > the same flow will follow the order when they enter qdisc/txq, which
>>> > avoids out-of-order problem.
>>> >
>>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> > per-cpu queues for a single TC class and do busy polling from a
>>> > per-class thread to drain these queues. If we can do this frequently
>>> > enough the out-of-order situation in this polling thread should not be
>>> > that bad.
>>> >
>>> > To give more details - in the send path we introduce per-cpu per-class
>>> > queues so that packets from the same class and same core will be
>>> > enqueued to the same place. Then a per-class thread poll the queues
>>> > belonging to its class from all the cpus and aggregate them into
>>> > another per-class queue. This can effectively reduce contention but
>>> > inevitably introduces potential out-of-order issue.
>>> >
>>> > Any concern/suggestion for working towards this direction?
>>
>> In general, there is no meta design discussions in Linux development
>> Several developers have tried to do lockless
>> qdisc and similar things in the past.
>>
>> The devil is in the details, show us the code.
>
> Thanks for the response, Stephen. The code is fairly straightforward,
> we have a per-cpu per-class queue defined as this:
>
> struct bandwidth_group
> {
>     struct skb_list queues[MAX_CPU_COUNT];
>     struct skb_list drain;
> }
>
> "drain" queue is used to aggregate per-cpu queues belonging to the
> same class. In the enqueue function, we determine the cpu where the
> packet is processed and enqueue it to the corresponding per-cpu queue:
>
> int cpu;
> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>
> cpu = get_cpu();
> skb_list_append(&bwg->queues[cpu], skb);
>
> Here we don't check the flow of the packet so if there is task
> migration or multiple threads sending packets through the same flow we
> theoretically can have packets enqueued to different queues and
> aggregated to the "drain" queue out of order.
>
> Also AFAIK there is no lockless htb-like qdisc implementation
> currently, however if there is already similar effort ongoing please
> let me know.

The question I would have is how would this differ from using XPS w/
mqprio? Would this be a classful qdisc like HTB or a classless one
like mqprio?

>From what I can tell XPS would be able to get you your per-cpu
functionality, the benefit of it though would be that it would avoid
out-of-order issues for sockets originating on the local system. The
only thing I see as an issue right now is that the rate limiting with
mqprio is assumed to be handled via hardware due to mechanisms such as
DCB.

- Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-13 22:47       ` Alexander Duyck
@ 2017-11-13 23:08         ` Eric Dumazet
  2017-11-14  0:13           ` Alexander Duyck
  2017-11-14  2:05           ` Michael Ma
  2017-11-14  2:02         ` Michael Ma
  1 sibling, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2017-11-13 23:08 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Michael Ma, Stephen Hemminger, Linux Kernel Network Developers,
	jianjun.duan, xiangning.yu

On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
> >> On Sun, 12 Nov 2017 13:43:13 -0800
> >> Michael Ma <make0818@gmail.com> wrote:
> >>
> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
> >>>
> >>> Thanks,
> >>> Michael
> >>>
> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
> >>> >
> >>> > Currently txq/qdisc selection is based on flow hash so packets from
> >>> > the same flow will follow the order when they enter qdisc/txq, which
> >>> > avoids out-of-order problem.
> >>> >
> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
> >>> > per-cpu queues for a single TC class and do busy polling from a
> >>> > per-class thread to drain these queues. If we can do this frequently
> >>> > enough the out-of-order situation in this polling thread should not be
> >>> > that bad.
> >>> >
> >>> > To give more details - in the send path we introduce per-cpu per-class
> >>> > queues so that packets from the same class and same core will be
> >>> > enqueued to the same place. Then a per-class thread poll the queues
> >>> > belonging to its class from all the cpus and aggregate them into
> >>> > another per-class queue. This can effectively reduce contention but
> >>> > inevitably introduces potential out-of-order issue.
> >>> >
> >>> > Any concern/suggestion for working towards this direction?
> >>
> >> In general, there is no meta design discussions in Linux development
> >> Several developers have tried to do lockless
> >> qdisc and similar things in the past.
> >>
> >> The devil is in the details, show us the code.
> >
> > Thanks for the response, Stephen. The code is fairly straightforward,
> > we have a per-cpu per-class queue defined as this:
> >
> > struct bandwidth_group
> > {
> >     struct skb_list queues[MAX_CPU_COUNT];
> >     struct skb_list drain;
> > }
> >
> > "drain" queue is used to aggregate per-cpu queues belonging to the
> > same class. In the enqueue function, we determine the cpu where the
> > packet is processed and enqueue it to the corresponding per-cpu queue:
> >
> > int cpu;
> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
> >
> > cpu = get_cpu();
> > skb_list_append(&bwg->queues[cpu], skb);
> >
> > Here we don't check the flow of the packet so if there is task
> > migration or multiple threads sending packets through the same flow we
> > theoretically can have packets enqueued to different queues and
> > aggregated to the "drain" queue out of order.
> >
> > Also AFAIK there is no lockless htb-like qdisc implementation
> > currently, however if there is already similar effort ongoing please
> > let me know.
> 
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
> 
> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.

I think one of the key point was in : " do busy polling from a per-class
thread to drain these queues." 

I mentioned this idea in TX path of :

https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-13 23:08         ` Eric Dumazet
@ 2017-11-14  0:13           ` Alexander Duyck
  2017-11-14  2:18             ` Michael Ma
  2017-11-14  2:05           ` Michael Ma
  1 sibling, 1 reply; 17+ messages in thread
From: Alexander Duyck @ 2017-11-14  0:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michael Ma, Stephen Hemminger, Linux Kernel Network Developers,
	jianjun.duan, xiangning.yu

On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>> >> Michael Ma <make0818@gmail.com> wrote:
>> >>
>> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>> >>>
>> >>> Thanks,
>> >>> Michael
>> >>>
>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>> >>> >
>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>> >>> > avoids out-of-order problem.
>> >>> >
>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>> >>> > per-cpu queues for a single TC class and do busy polling from a
>> >>> > per-class thread to drain these queues. If we can do this frequently
>> >>> > enough the out-of-order situation in this polling thread should not be
>> >>> > that bad.
>> >>> >
>> >>> > To give more details - in the send path we introduce per-cpu per-class
>> >>> > queues so that packets from the same class and same core will be
>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>> >>> > belonging to its class from all the cpus and aggregate them into
>> >>> > another per-class queue. This can effectively reduce contention but
>> >>> > inevitably introduces potential out-of-order issue.
>> >>> >
>> >>> > Any concern/suggestion for working towards this direction?
>> >>
>> >> In general, there is no meta design discussions in Linux development
>> >> Several developers have tried to do lockless
>> >> qdisc and similar things in the past.
>> >>
>> >> The devil is in the details, show us the code.
>> >
>> > Thanks for the response, Stephen. The code is fairly straightforward,
>> > we have a per-cpu per-class queue defined as this:
>> >
>> > struct bandwidth_group
>> > {
>> >     struct skb_list queues[MAX_CPU_COUNT];
>> >     struct skb_list drain;
>> > }
>> >
>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>> > same class. In the enqueue function, we determine the cpu where the
>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>> >
>> > int cpu;
>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>> >
>> > cpu = get_cpu();
>> > skb_list_append(&bwg->queues[cpu], skb);
>> >
>> > Here we don't check the flow of the packet so if there is task
>> > migration or multiple threads sending packets through the same flow we
>> > theoretically can have packets enqueued to different queues and
>> > aggregated to the "drain" queue out of order.
>> >
>> > Also AFAIK there is no lockless htb-like qdisc implementation
>> > currently, however if there is already similar effort ongoing please
>> > let me know.
>>
>> The question I would have is how would this differ from using XPS w/
>> mqprio? Would this be a classful qdisc like HTB or a classless one
>> like mqprio?
>>
>> From what I can tell XPS would be able to get you your per-cpu
>> functionality, the benefit of it though would be that it would avoid
>> out-of-order issues for sockets originating on the local system. The
>> only thing I see as an issue right now is that the rate limiting with
>> mqprio is assumed to be handled via hardware due to mechanisms such as
>> DCB.
>
> I think one of the key point was in : " do busy polling from a per-class
> thread to drain these queues."
>
> I mentioned this idea in TX path of :
>
> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf

I think this is a bit different from that idea in that the busy
polling is transferring packets from a per-cpu qdisc to a per traffic
class queueing discipline. Basically it would be a busy_poll version
of qdisc_run that would be transferring packets from one qdisc onto
another instead of attempting to transmit them directly.

What I think is tripping me up is that I don't think this is even
meant to work with a multiqueue device. The description seems more
like a mqprio implementation feeding into a prio qdisc which is then
used for dequeue. To me it seems like this solution would be pulling
complexity off of the enqueue side and just adding it to the dequeue
side. The use of the "busy poll" is just to try to simplify things as
the agent would then be consolidating traffic from multiple per-cpu
queues onto one drain queue.

Structure wise this ends up looking not too different from mqprio, the
main difference though would be that this would be a classful qdisc
and that the virtual qdiscs we have for the traffic classes would need
to be replaced with actual qdiscs for handling the "drain" aspect of
this.

- Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-13 22:47       ` Alexander Duyck
  2017-11-13 23:08         ` Eric Dumazet
@ 2017-11-14  2:02         ` Michael Ma
  1 sibling, 0 replies; 17+ messages in thread
From: Michael Ma @ 2017-11-14  2:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Stephen Hemminger, Linux Kernel Network Developers, jianjun.duan,
	xiangning.yu

2017-11-13 14:47 GMT-08:00 Alexander Duyck <alexander.duyck@gmail.com>:
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>> Michael Ma <make0818@gmail.com> wrote:
>>>
>>>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>>> >
>>>> > Currently txq/qdisc selection is based on flow hash so packets from
>>>> > the same flow will follow the order when they enter qdisc/txq, which
>>>> > avoids out-of-order problem.
>>>> >
>>>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>>> > per-cpu queues for a single TC class and do busy polling from a
>>>> > per-class thread to drain these queues. If we can do this frequently
>>>> > enough the out-of-order situation in this polling thread should not be
>>>> > that bad.
>>>> >
>>>> > To give more details - in the send path we introduce per-cpu per-class
>>>> > queues so that packets from the same class and same core will be
>>>> > enqueued to the same place. Then a per-class thread poll the queues
>>>> > belonging to its class from all the cpus and aggregate them into
>>>> > another per-class queue. This can effectively reduce contention but
>>>> > inevitably introduces potential out-of-order issue.
>>>> >
>>>> > Any concern/suggestion for working towards this direction?
>>>
>>> In general, there is no meta design discussions in Linux development
>>> Several developers have tried to do lockless
>>> qdisc and similar things in the past.
>>>
>>> The devil is in the details, show us the code.
>>
>> Thanks for the response, Stephen. The code is fairly straightforward,
>> we have a per-cpu per-class queue defined as this:
>>
>> struct bandwidth_group
>> {
>>     struct skb_list queues[MAX_CPU_COUNT];
>>     struct skb_list drain;
>> }
>>
>> "drain" queue is used to aggregate per-cpu queues belonging to the
>> same class. In the enqueue function, we determine the cpu where the
>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>
>> int cpu;
>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>
>> cpu = get_cpu();
>> skb_list_append(&bwg->queues[cpu], skb);
>>
>> Here we don't check the flow of the packet so if there is task
>> migration or multiple threads sending packets through the same flow we
>> theoretically can have packets enqueued to different queues and
>> aggregated to the "drain" queue out of order.
>>
>> Also AFAIK there is no lockless htb-like qdisc implementation
>> currently, however if there is already similar effort ongoing please
>> let me know.
>
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
>

Yes, xps + mqprio will achive similar effect as this. However xps +
mqprio requires quite some static configuration to make things work.
Also all the overhead associated with qdisc root lock still exists
even though the contention might be reduced. With thread dedicated to
cpu we can avoid acquiring lock for enqueue/dequeue since it's
effectively single producer and single consumer. In the case that
multiple threads do share the same flow the lock contention still
exists and we don't have per-class parallelism for rate limiting if
classes ever share the same cores, which is avoidable by having
per-class thread to run stuff like leaky bucket.

We still plan to implement it as a classful qdisc.

> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.
>

mqprio can also be attached with qdiscs like HTB so this can actually
work without DCB.

> - Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-13 23:08         ` Eric Dumazet
  2017-11-14  0:13           ` Alexander Duyck
@ 2017-11-14  2:05           ` Michael Ma
  2017-11-14  2:06             ` Michael Ma
  1 sibling, 1 reply; 17+ messages in thread
From: Michael Ma @ 2017-11-14  2:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Stephen Hemminger,
	Linux Kernel Network Developers, jianjun.duan, xiangning.yu

2017-11-13 15:08 GMT-08:00 Eric Dumazet <eric.dumazet@gmail.com>:
> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>> >> Michael Ma <make0818@gmail.com> wrote:
>> >>
>> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>> >>>
>> >>> Thanks,
>> >>> Michael
>> >>>
>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>> >>> >
>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>> >>> > avoids out-of-order problem.
>> >>> >
>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>> >>> > per-cpu queues for a single TC class and do busy polling from a
>> >>> > per-class thread to drain these queues. If we can do this frequently
>> >>> > enough the out-of-order situation in this polling thread should not be
>> >>> > that bad.
>> >>> >
>> >>> > To give more details - in the send path we introduce per-cpu per-class
>> >>> > queues so that packets from the same class and same core will be
>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>> >>> > belonging to its class from all the cpus and aggregate them into
>> >>> > another per-class queue. This can effectively reduce contention but
>> >>> > inevitably introduces potential out-of-order issue.
>> >>> >
>> >>> > Any concern/suggestion for working towards this direction?
>> >>
>> >> In general, there is no meta design discussions in Linux development
>> >> Several developers have tried to do lockless
>> >> qdisc and similar things in the past.
>> >>
>> >> The devil is in the details, show us the code.
>> >
>> > Thanks for the response, Stephen. The code is fairly straightforward,
>> > we have a per-cpu per-class queue defined as this:
>> >
>> > struct bandwidth_group
>> > {
>> >     struct skb_list queues[MAX_CPU_COUNT];
>> >     struct skb_list drain;
>> > }
>> >
>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>> > same class. In the enqueue function, we determine the cpu where the
>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>> >
>> > int cpu;
>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>> >
>> > cpu = get_cpu();
>> > skb_list_append(&bwg->queues[cpu], skb);
>> >
>> > Here we don't check the flow of the packet so if there is task
>> > migration or multiple threads sending packets through the same flow we
>> > theoretically can have packets enqueued to different queues and
>> > aggregated to the "drain" queue out of order.
>> >
>> > Also AFAIK there is no lockless htb-like qdisc implementation
>> > currently, however if there is already similar effort ongoing please
>> > let me know.
>>
>> The question I would have is how would this differ from using XPS w/
>> mqprio? Would this be a classful qdisc like HTB or a classless one
>> like mqprio?
>>
>> From what I can tell XPS would be able to get you your per-cpu
>> functionality, the benefit of it though would be that it would avoid
>> out-of-order issues for sockets originating on the local system. The
>> only thing I see as an issue right now is that the rate limiting with
>> mqprio is assumed to be handled via hardware due to mechanisms such as
>> DCB.
>
> I think one of the key point was in : " do busy polling from a per-class
> thread to drain these queues."
>
> I mentioned this idea in TX path of :
>
> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>
>

Right - this part is the key difference. With mqprio we still don't
have the ability to explore parallelism at the level of class. The
parallelism is restricted to the way of partitioning flows across
queues.

>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-14  2:05           ` Michael Ma
@ 2017-11-14  2:06             ` Michael Ma
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Ma @ 2017-11-14  2:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Duyck, Stephen Hemminger,
	Linux Kernel Network Developers, jianjun.duan, xiangning.yu

2017-11-13 18:05 GMT-08:00 Michael Ma <make0818@gmail.com>:
> 2017-11-13 15:08 GMT-08:00 Eric Dumazet <eric.dumazet@gmail.com>:
>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>> >> Michael Ma <make0818@gmail.com> wrote:
>>> >>
>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>> >>>
>>> >>> Thanks,
>>> >>> Michael
>>> >>>
>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>> >>> >
>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>> >>> > avoids out-of-order problem.
>>> >>> >
>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>> >>> > enough the out-of-order situation in this polling thread should not be
>>> >>> > that bad.
>>> >>> >
>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>> >>> > queues so that packets from the same class and same core will be
>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>> >>> > belonging to its class from all the cpus and aggregate them into
>>> >>> > another per-class queue. This can effectively reduce contention but
>>> >>> > inevitably introduces potential out-of-order issue.
>>> >>> >
>>> >>> > Any concern/suggestion for working towards this direction?
>>> >>
>>> >> In general, there is no meta design discussions in Linux development
>>> >> Several developers have tried to do lockless
>>> >> qdisc and similar things in the past.
>>> >>
>>> >> The devil is in the details, show us the code.
>>> >
>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>> > we have a per-cpu per-class queue defined as this:
>>> >
>>> > struct bandwidth_group
>>> > {
>>> >     struct skb_list queues[MAX_CPU_COUNT];
>>> >     struct skb_list drain;
>>> > }
>>> >
>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>> > same class. In the enqueue function, we determine the cpu where the
>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>> >
>>> > int cpu;
>>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>> >
>>> > cpu = get_cpu();
>>> > skb_list_append(&bwg->queues[cpu], skb);
>>> >
>>> > Here we don't check the flow of the packet so if there is task
>>> > migration or multiple threads sending packets through the same flow we
>>> > theoretically can have packets enqueued to different queues and
>>> > aggregated to the "drain" queue out of order.
>>> >
>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>> > currently, however if there is already similar effort ongoing please
>>> > let me know.
>>>
>>> The question I would have is how would this differ from using XPS w/
>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>> like mqprio?
>>>
>>> From what I can tell XPS would be able to get you your per-cpu
>>> functionality, the benefit of it though would be that it would avoid
>>> out-of-order issues for sockets originating on the local system. The
>>> only thing I see as an issue right now is that the rate limiting with
>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>> DCB.
>>
>> I think one of the key point was in : " do busy polling from a per-class
>> thread to drain these queues."
>>
>> I mentioned this idea in TX path of :
>>
>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>>
>
> Right - this part is the key difference. With mqprio we still don't
> have the ability to explore parallelism at the level of class. The
> parallelism is restricted to the way of partitioning flows across
> queues.
>
>>

Eric - do you think if we do busy polling frequently enough
out-of-order problem will effectively be mitigated? I'll take a look
at your slides as well.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-14  0:13           ` Alexander Duyck
@ 2017-11-14  2:18             ` Michael Ma
  2017-11-14  3:45               ` John Fastabend
  2017-11-14 22:59               ` Michael Ma
  0 siblings, 2 replies; 17+ messages in thread
From: Michael Ma @ 2017-11-14  2:18 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Stephen Hemminger, Linux Kernel Network Developers,
	jianjun.duan, xiangning.yu

2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.duyck@gmail.com>:
> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>> >> Michael Ma <make0818@gmail.com> wrote:
>>> >>
>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>> >>>
>>> >>> Thanks,
>>> >>> Michael
>>> >>>
>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>> >>> >
>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>> >>> > avoids out-of-order problem.
>>> >>> >
>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>> >>> > enough the out-of-order situation in this polling thread should not be
>>> >>> > that bad.
>>> >>> >
>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>> >>> > queues so that packets from the same class and same core will be
>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>> >>> > belonging to its class from all the cpus and aggregate them into
>>> >>> > another per-class queue. This can effectively reduce contention but
>>> >>> > inevitably introduces potential out-of-order issue.
>>> >>> >
>>> >>> > Any concern/suggestion for working towards this direction?
>>> >>
>>> >> In general, there is no meta design discussions in Linux development
>>> >> Several developers have tried to do lockless
>>> >> qdisc and similar things in the past.
>>> >>
>>> >> The devil is in the details, show us the code.
>>> >
>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>> > we have a per-cpu per-class queue defined as this:
>>> >
>>> > struct bandwidth_group
>>> > {
>>> >     struct skb_list queues[MAX_CPU_COUNT];
>>> >     struct skb_list drain;
>>> > }
>>> >
>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>> > same class. In the enqueue function, we determine the cpu where the
>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>> >
>>> > int cpu;
>>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>> >
>>> > cpu = get_cpu();
>>> > skb_list_append(&bwg->queues[cpu], skb);
>>> >
>>> > Here we don't check the flow of the packet so if there is task
>>> > migration or multiple threads sending packets through the same flow we
>>> > theoretically can have packets enqueued to different queues and
>>> > aggregated to the "drain" queue out of order.
>>> >
>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>> > currently, however if there is already similar effort ongoing please
>>> > let me know.
>>>
>>> The question I would have is how would this differ from using XPS w/
>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>> like mqprio?
>>>
>>> From what I can tell XPS would be able to get you your per-cpu
>>> functionality, the benefit of it though would be that it would avoid
>>> out-of-order issues for sockets originating on the local system. The
>>> only thing I see as an issue right now is that the rate limiting with
>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>> DCB.
>>
>> I think one of the key point was in : " do busy polling from a per-class
>> thread to drain these queues."
>>
>> I mentioned this idea in TX path of :
>>
>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>
> I think this is a bit different from that idea in that the busy
> polling is transferring packets from a per-cpu qdisc to a per traffic
> class queueing discipline. Basically it would be a busy_poll version
> of qdisc_run that would be transferring packets from one qdisc onto
> another instead of attempting to transmit them directly.

The idea is to have the whole part implemented as one classful qdisc
and remove the qdisc root lock since all the synchronization will be
handled internally (let's put aside that other filters/actions/qdiscs
might still require a root lock for now).
>
> What I think is tripping me up is that I don't think this is even
> meant to work with a multiqueue device. The description seems more
> like a mqprio implementation feeding into a prio qdisc which is then
> used for dequeue. To me it seems like this solution would be pulling
> complexity off of the enqueue side and just adding it to the dequeue
> side. The use of the "busy poll" is just to try to simplify things as
> the agent would then be consolidating traffic from multiple per-cpu
> queues onto one drain queue.

We're essentially trying to spread the complexity from enqueue to
different stages such as enqueue/aggregation and rate
limiting/dequeue. Each stage will have different parallelisms. It
should work with multi-queue device since txq selection can be the
same as today. However our concern is that between enqueue and
aggregation we have a small window which can allow packet oob, which
is a sacrifice to better concurrency.

>
> Structure wise this ends up looking not too different from mqprio, the
> main difference though would be that this would be a classful qdisc
> and that the virtual qdiscs we have for the traffic classes would need
> to be replaced with actual qdiscs for handling the "drain" aspect of
> this.

Structure wise it's similar to mqprio + rate limiting qdisc without
root lock, and replacing txq/flow level parallelism by cpu level
parallelism. I'm actually not sure about the similarity with busy
polling that Eric mentioned since I haven't read the slides yet.
>
> - Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-14  2:18             ` Michael Ma
@ 2017-11-14  3:45               ` John Fastabend
  2017-11-14  3:54                 ` Dave Taht
                                   ` (2 more replies)
  2017-11-14 22:59               ` Michael Ma
  1 sibling, 3 replies; 17+ messages in thread
From: John Fastabend @ 2017-11-14  3:45 UTC (permalink / raw)
  To: Michael Ma, Alexander Duyck
  Cc: Eric Dumazet, Stephen Hemminger, Linux Kernel Network Developers,
	jianjun.duan, xiangning.yu

On 11/13/2017 06:18 PM, Michael Ma wrote:
> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.duyck@gmail.com>:
>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>>>>> Michael Ma <make0818@gmail.com> wrote:
>>>>>>
>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Michael
>>>>>>>
>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from
>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which
>>>>>>>> avoids out-of-order problem.
>>>>>>>>
>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple
>>>>>>>> per-cpu queues for a single TC class and do busy polling from a
>>>>>>>> per-class thread to drain these queues. If we can do this frequently
>>>>>>>> enough the out-of-order situation in this polling thread should not be
>>>>>>>> that bad.
>>>>>>>>
>>>>>>>> To give more details - in the send path we introduce per-cpu per-class
>>>>>>>> queues so that packets from the same class and same core will be
>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues
>>>>>>>> belonging to its class from all the cpus and aggregate them into
>>>>>>>> another per-class queue. This can effectively reduce contention but
>>>>>>>> inevitably introduces potential out-of-order issue.
>>>>>>>>
>>>>>>>> Any concern/suggestion for working towards this direction?
>>>>>>
>>>>>> In general, there is no meta design discussions in Linux development
>>>>>> Several developers have tried to do lockless
>>>>>> qdisc and similar things in the past.
>>>>>>
>>>>>> The devil is in the details, show us the code.
>>>>>
>>>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>>>> we have a per-cpu per-class queue defined as this:
>>>>>
>>>>> struct bandwidth_group
>>>>> {
>>>>>     struct skb_list queues[MAX_CPU_COUNT];
>>>>>     struct skb_list drain;
>>>>> }
>>>>>
>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>>>> same class. In the enqueue function, we determine the cpu where the
>>>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>>>
>>>>> int cpu;
>>>>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>>>>
>>>>> cpu = get_cpu();
>>>>> skb_list_append(&bwg->queues[cpu], skb);
>>>>>
>>>>> Here we don't check the flow of the packet so if there is task
>>>>> migration or multiple threads sending packets through the same flow we
>>>>> theoretically can have packets enqueued to different queues and
>>>>> aggregated to the "drain" queue out of order.
>>>>>
>>>>> Also AFAIK there is no lockless htb-like qdisc implementation
>>>>> currently, however if there is already similar effort ongoing please
>>>>> let me know.
>>>>
>>>> The question I would have is how would this differ from using XPS w/
>>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>>> like mqprio?
>>>>
>>>> From what I can tell XPS would be able to get you your per-cpu
>>>> functionality, the benefit of it though would be that it would avoid
>>>> out-of-order issues for sockets originating on the local system. The
>>>> only thing I see as an issue right now is that the rate limiting with
>>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>>> DCB.
>>>
>>> I think one of the key point was in : " do busy polling from a per-class
>>> thread to drain these queues."
>>>
>>> I mentioned this idea in TX path of :
>>>
>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>> I think this is a bit different from that idea in that the busy
>> polling is transferring packets from a per-cpu qdisc to a per traffic
>> class queueing discipline. Basically it would be a busy_poll version
>> of qdisc_run that would be transferring packets from one qdisc onto
>> another instead of attempting to transmit them directly.
> 
> The idea is to have the whole part implemented as one classful qdisc
> and remove the qdisc root lock since all the synchronization will be
> handled internally (let's put aside that other filters/actions/qdiscs
> might still require a root lock for now).
>>
>> What I think is tripping me up is that I don't think this is even
>> meant to work with a multiqueue device. The description seems more
>> like a mqprio implementation feeding into a prio qdisc which is then
>> used for dequeue. To me it seems like this solution would be pulling
>> complexity off of the enqueue side and just adding it to the dequeue
>> side. The use of the "busy poll" is just to try to simplify things as
>> the agent would then be consolidating traffic from multiple per-cpu
>> queues onto one drain queue.
> 
> We're essentially trying to spread the complexity from enqueue to
> different stages such as enqueue/aggregation and rate
> limiting/dequeue. Each stage will have different parallelisms. It
> should work with multi-queue device since txq selection can be the
> same as today. However our concern is that between enqueue and
> aggregation we have a small window which can allow packet oob, which
> is a sacrifice to better concurrency.
> 

So OOO will happen when application cpu migrates presumably? This is
normally prevented with skb ooo flag but it looks like you plan to
violate this somehow. I think a design using ptr_rings/skb_arrays
with bulk dequeue and a good concurrent token bucket ring would
suffice and also not introduce OOO packets.

But don't completely understand your design so might be missing
something.

>>
>> Structure wise this ends up looking not too different from mqprio, the
>> main difference though would be that this would be a classful qdisc
>> and that the virtual qdiscs we have for the traffic classes would need
>> to be replaced with actual qdiscs for handling the "drain" aspect of
>> this.
> 
> Structure wise it's similar to mqprio + rate limiting qdisc without
> root lock, and replacing txq/flow level parallelism by cpu level
> parallelism. I'm actually not sure about the similarity with busy
> polling that Eric mentioned since I haven't read the slides yet.

I pushed lockless qdisc patches again today and will repost when
net-next opens. These plus a lockless version of tbf might be
what you need. At one point I had a lockless tbf I can probably
dig that up as well if its useful.

https://www.mail-archive.com/netdev@vger.kernel.org/msg200244.html

Thanks,
John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-14  3:45               ` John Fastabend
@ 2017-11-14  3:54                 ` Dave Taht
  2017-11-14  4:53                 ` Tom Herbert
  2017-11-14 23:06                 ` Michael Ma
  2 siblings, 0 replies; 17+ messages in thread
From: Dave Taht @ 2017-11-14  3:54 UTC (permalink / raw)
  To: John Fastabend
  Cc: Michael Ma, Alexander Duyck, Eric Dumazet, Stephen Hemminger,
	Linux Kernel Network Developers, jianjun.duan, xiangning.yu

I have been thinking we'd try to submit sch_cake to mainline on this
go-around. It's been out of tree for way too long.

I look forward to understanding your patches soon in the tbf case.

(I'm only responding because cake uses deficit, rather than a token
bucket, scheduler, and is not reliant on the tc filter infrastructure
for its queuing, and I'd love to have it handle multiple cpus better.
)

repo: https://github.com/dtaht/sch_cake.git

doc: https://www.bufferbloat.net/projects/codel/wiki/CakeTechnical/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-14  3:45               ` John Fastabend
  2017-11-14  3:54                 ` Dave Taht
@ 2017-11-14  4:53                 ` Tom Herbert
  2017-11-14 23:10                   ` Michael Ma
  2017-11-14 23:06                 ` Michael Ma
  2 siblings, 1 reply; 17+ messages in thread
From: Tom Herbert @ 2017-11-14  4:53 UTC (permalink / raw)
  To: John Fastabend
  Cc: Michael Ma, Alexander Duyck, Eric Dumazet, Stephen Hemminger,
	Linux Kernel Network Developers, jianjun.duan, xiangning.yu

On Mon, Nov 13, 2017 at 7:45 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> On 11/13/2017 06:18 PM, Michael Ma wrote:
>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.duyck@gmail.com>:
>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>>>>>> Michael Ma <make0818@gmail.com> wrote:
>>>>>>>
>>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from
>>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which
>>>>>>>>> avoids out-of-order problem.
>>>>>>>>>
>>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple
>>>>>>>>> per-cpu queues for a single TC class and do busy polling from a
>>>>>>>>> per-class thread to drain these queues. If we can do this frequently
>>>>>>>>> enough the out-of-order situation in this polling thread should not be
>>>>>>>>> that bad.
>>>>>>>>>
>>>>>>>>> To give more details - in the send path we introduce per-cpu per-class
>>>>>>>>> queues so that packets from the same class and same core will be
>>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues
>>>>>>>>> belonging to its class from all the cpus and aggregate them into
>>>>>>>>> another per-class queue. This can effectively reduce contention but
>>>>>>>>> inevitably introduces potential out-of-order issue.
>>>>>>>>>
>>>>>>>>> Any concern/suggestion for working towards this direction?
>>>>>>>
>>>>>>> In general, there is no meta design discussions in Linux development
>>>>>>> Several developers have tried to do lockless
>>>>>>> qdisc and similar things in the past.
>>>>>>>
>>>>>>> The devil is in the details, show us the code.
>>>>>>
>>>>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>>>>> we have a per-cpu per-class queue defined as this:
>>>>>>
>>>>>> struct bandwidth_group
>>>>>> {
>>>>>>     struct skb_list queues[MAX_CPU_COUNT];
>>>>>>     struct skb_list drain;
>>>>>> }
>>>>>>
>>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>>>>> same class. In the enqueue function, we determine the cpu where the
>>>>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>>>>
>>>>>> int cpu;
>>>>>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>>>>>
>>>>>> cpu = get_cpu();
>>>>>> skb_list_append(&bwg->queues[cpu], skb);
>>>>>>
>>>>>> Here we don't check the flow of the packet so if there is task
>>>>>> migration or multiple threads sending packets through the same flow we
>>>>>> theoretically can have packets enqueued to different queues and
>>>>>> aggregated to the "drain" queue out of order.
>>>>>>
>>>>>> Also AFAIK there is no lockless htb-like qdisc implementation
>>>>>> currently, however if there is already similar effort ongoing please
>>>>>> let me know.
>>>>>
>>>>> The question I would have is how would this differ from using XPS w/
>>>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>>>> like mqprio?
>>>>>
>>>>> From what I can tell XPS would be able to get you your per-cpu
>>>>> functionality, the benefit of it though would be that it would avoid
>>>>> out-of-order issues for sockets originating on the local system. The
>>>>> only thing I see as an issue right now is that the rate limiting with
>>>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>>>> DCB.
>>>>
>>>> I think one of the key point was in : " do busy polling from a per-class
>>>> thread to drain these queues."
>>>>
>>>> I mentioned this idea in TX path of :
>>>>
>>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>>
>>> I think this is a bit different from that idea in that the busy
>>> polling is transferring packets from a per-cpu qdisc to a per traffic
>>> class queueing discipline. Basically it would be a busy_poll version
>>> of qdisc_run that would be transferring packets from one qdisc onto
>>> another instead of attempting to transmit them directly.
>>
>> The idea is to have the whole part implemented as one classful qdisc
>> and remove the qdisc root lock since all the synchronization will be
>> handled internally (let's put aside that other filters/actions/qdiscs
>> might still require a root lock for now).
>>>
>>> What I think is tripping me up is that I don't think this is even
>>> meant to work with a multiqueue device. The description seems more
>>> like a mqprio implementation feeding into a prio qdisc which is then
>>> used for dequeue. To me it seems like this solution would be pulling
>>> complexity off of the enqueue side and just adding it to the dequeue
>>> side. The use of the "busy poll" is just to try to simplify things as
>>> the agent would then be consolidating traffic from multiple per-cpu
>>> queues onto one drain queue.
>>
>> We're essentially trying to spread the complexity from enqueue to
>> different stages such as enqueue/aggregation and rate
>> limiting/dequeue. Each stage will have different parallelisms. It
>> should work with multi-queue device since txq selection can be the
>> same as today. However our concern is that between enqueue and
>> aggregation we have a small window which can allow packet oob, which
>> is a sacrifice to better concurrency.
>>
>
> So OOO will happen when application cpu migrates presumably? This is
> normally prevented with skb ooo flag but it looks like you plan to
> violate this somehow. I think a design using ptr_rings/skb_arrays
> with bulk dequeue and a good concurrent token bucket ring would
> suffice and also not introduce OOO packets.
>
> But don't completely understand your design so might be missing
> something.
>
I'm missing here something also. The check for ooo_okay is
straightforward to see it we can change the sk TX queue. Does the
design not use TX queue any more?

Tom

>>>
>>> Structure wise this ends up looking not too different from mqprio, the
>>> main difference though would be that this would be a classful qdisc
>>> and that the virtual qdiscs we have for the traffic classes would need
>>> to be replaced with actual qdiscs for handling the "drain" aspect of
>>> this.
>>
>> Structure wise it's similar to mqprio + rate limiting qdisc without
>> root lock, and replacing txq/flow level parallelism by cpu level
>> parallelism. I'm actually not sure about the similarity with busy
>> polling that Eric mentioned since I haven't read the slides yet.
>
> I pushed lockless qdisc patches again today and will repost when
> net-next opens. These plus a lockless version of tbf might be
> what you need. At one point I had a lockless tbf I can probably
> dig that up as well if its useful.
>
> https://www.mail-archive.com/netdev@vger.kernel.org/msg200244.html
>
> Thanks,
> John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-14  2:18             ` Michael Ma
  2017-11-14  3:45               ` John Fastabend
@ 2017-11-14 22:59               ` Michael Ma
  1 sibling, 0 replies; 17+ messages in thread
From: Michael Ma @ 2017-11-14 22:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Eric Dumazet, Stephen Hemminger, Linux Kernel Network Developers,
	jianjun.duan, xiangning.yu

2017-11-13 18:18 GMT-08:00 Michael Ma <make0818@gmail.com>:
> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.duyck@gmail.com>:
>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>>> >> Michael Ma <make0818@gmail.com> wrote:
>>>> >>
>>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Michael
>>>> >>>
>>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>>> >>> >
>>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>>> >>> > avoids out-of-order problem.
>>>> >>> >
>>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>>> >>> > enough the out-of-order situation in this polling thread should not be
>>>> >>> > that bad.
>>>> >>> >
>>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>>> >>> > queues so that packets from the same class and same core will be
>>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>>> >>> > belonging to its class from all the cpus and aggregate them into
>>>> >>> > another per-class queue. This can effectively reduce contention but
>>>> >>> > inevitably introduces potential out-of-order issue.
>>>> >>> >
>>>> >>> > Any concern/suggestion for working towards this direction?
>>>> >>
>>>> >> In general, there is no meta design discussions in Linux development
>>>> >> Several developers have tried to do lockless
>>>> >> qdisc and similar things in the past.
>>>> >>
>>>> >> The devil is in the details, show us the code.
>>>> >
>>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>>> > we have a per-cpu per-class queue defined as this:
>>>> >
>>>> > struct bandwidth_group
>>>> > {
>>>> >     struct skb_list queues[MAX_CPU_COUNT];
>>>> >     struct skb_list drain;
>>>> > }
>>>> >
>>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>>> > same class. In the enqueue function, we determine the cpu where the
>>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>>> >
>>>> > int cpu;
>>>> > struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>>> >
>>>> > cpu = get_cpu();
>>>> > skb_list_append(&bwg->queues[cpu], skb);
>>>> >
>>>> > Here we don't check the flow of the packet so if there is task
>>>> > migration or multiple threads sending packets through the same flow we
>>>> > theoretically can have packets enqueued to different queues and
>>>> > aggregated to the "drain" queue out of order.
>>>> >
>>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>>> > currently, however if there is already similar effort ongoing please
>>>> > let me know.
>>>>
>>>> The question I would have is how would this differ from using XPS w/
>>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>>> like mqprio?
>>>>
>>>> From what I can tell XPS would be able to get you your per-cpu
>>>> functionality, the benefit of it though would be that it would avoid
>>>> out-of-order issues for sockets originating on the local system. The
>>>> only thing I see as an issue right now is that the rate limiting with
>>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>>> DCB.
>>>
>>> I think one of the key point was in : " do busy polling from a per-class
>>> thread to drain these queues."
>>>
>>> I mentioned this idea in TX path of :
>>>
>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>> I think this is a bit different from that idea in that the busy
>> polling is transferring packets from a per-cpu qdisc to a per traffic
>> class queueing discipline. Basically it would be a busy_poll version
>> of qdisc_run that would be transferring packets from one qdisc onto
>> another instead of attempting to transmit them directly.
>
> The idea is to have the whole part implemented as one classful qdisc
> and remove the qdisc root lock since all the synchronization will be
> handled internally (let's put aside that other filters/actions/qdiscs
> might still require a root lock for now).
>>
>> What I think is tripping me up is that I don't think this is even
>> meant to work with a multiqueue device. The description seems more
>> like a mqprio implementation feeding into a prio qdisc which is then
>> used for dequeue. To me it seems like this solution would be pulling
>> complexity off of the enqueue side and just adding it to the dequeue
>> side. The use of the "busy poll" is just to try to simplify things as
>> the agent would then be consolidating traffic from multiple per-cpu
>> queues onto one drain queue.
>
> We're essentially trying to spread the complexity from enqueue to
> different stages such as enqueue/aggregation and rate
> limiting/dequeue. Each stage will have different parallelisms. It
> should work with multi-queue device since txq selection can be the
> same as today. However our concern is that between enqueue and
> aggregation we have a small window which can allow packet oob, which
> is a sacrifice to better concurrency.
>
>>
>> Structure wise this ends up looking not too different from mqprio, the
>> main difference though would be that this would be a classful qdisc
>> and that the virtual qdiscs we have for the traffic classes would need
>> to be replaced with actual qdiscs for handling the "drain" aspect of
>> this.
>
> Structure wise it's similar to mqprio + rate limiting qdisc without
> root lock, and replacing txq/flow level parallelism by cpu level
> parallelism. I'm actually not sure about the similarity with busy
> polling that Eric mentioned since I haven't read the slides yet.

The idea is actually pretty similar to what is described in the busy
polling slides. As you mentioned, busy polling is only used for
transferring packets from one queue to another. And we plan to
implement these queues in one qdisc.
>>
>> - Alex

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-14  3:45               ` John Fastabend
  2017-11-14  3:54                 ` Dave Taht
  2017-11-14  4:53                 ` Tom Herbert
@ 2017-11-14 23:06                 ` Michael Ma
  2 siblings, 0 replies; 17+ messages in thread
From: Michael Ma @ 2017-11-14 23:06 UTC (permalink / raw)
  To: John Fastabend
  Cc: Alexander Duyck, Eric Dumazet, Stephen Hemminger,
	Linux Kernel Network Developers, jianjun.duan, xiangning.yu

2017-11-13 19:45 GMT-08:00 John Fastabend <john.fastabend@gmail.com>:
> On 11/13/2017 06:18 PM, Michael Ma wrote:
>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.duyck@gmail.com>:
>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>>>>>> Michael Ma <make0818@gmail.com> wrote:
>>>>>>>
>>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from
>>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which
>>>>>>>>> avoids out-of-order problem.
>>>>>>>>>
>>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple
>>>>>>>>> per-cpu queues for a single TC class and do busy polling from a
>>>>>>>>> per-class thread to drain these queues. If we can do this frequently
>>>>>>>>> enough the out-of-order situation in this polling thread should not be
>>>>>>>>> that bad.
>>>>>>>>>
>>>>>>>>> To give more details - in the send path we introduce per-cpu per-class
>>>>>>>>> queues so that packets from the same class and same core will be
>>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues
>>>>>>>>> belonging to its class from all the cpus and aggregate them into
>>>>>>>>> another per-class queue. This can effectively reduce contention but
>>>>>>>>> inevitably introduces potential out-of-order issue.
>>>>>>>>>
>>>>>>>>> Any concern/suggestion for working towards this direction?
>>>>>>>
>>>>>>> In general, there is no meta design discussions in Linux development
>>>>>>> Several developers have tried to do lockless
>>>>>>> qdisc and similar things in the past.
>>>>>>>
>>>>>>> The devil is in the details, show us the code.
>>>>>>
>>>>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>>>>> we have a per-cpu per-class queue defined as this:
>>>>>>
>>>>>> struct bandwidth_group
>>>>>> {
>>>>>>     struct skb_list queues[MAX_CPU_COUNT];
>>>>>>     struct skb_list drain;
>>>>>> }
>>>>>>
>>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>>>>> same class. In the enqueue function, we determine the cpu where the
>>>>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>>>>
>>>>>> int cpu;
>>>>>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>>>>>
>>>>>> cpu = get_cpu();
>>>>>> skb_list_append(&bwg->queues[cpu], skb);
>>>>>>
>>>>>> Here we don't check the flow of the packet so if there is task
>>>>>> migration or multiple threads sending packets through the same flow we
>>>>>> theoretically can have packets enqueued to different queues and
>>>>>> aggregated to the "drain" queue out of order.
>>>>>>
>>>>>> Also AFAIK there is no lockless htb-like qdisc implementation
>>>>>> currently, however if there is already similar effort ongoing please
>>>>>> let me know.
>>>>>
>>>>> The question I would have is how would this differ from using XPS w/
>>>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>>>> like mqprio?
>>>>>
>>>>> From what I can tell XPS would be able to get you your per-cpu
>>>>> functionality, the benefit of it though would be that it would avoid
>>>>> out-of-order issues for sockets originating on the local system. The
>>>>> only thing I see as an issue right now is that the rate limiting with
>>>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>>>> DCB.
>>>>
>>>> I think one of the key point was in : " do busy polling from a per-class
>>>> thread to drain these queues."
>>>>
>>>> I mentioned this idea in TX path of :
>>>>
>>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>>
>>> I think this is a bit different from that idea in that the busy
>>> polling is transferring packets from a per-cpu qdisc to a per traffic
>>> class queueing discipline. Basically it would be a busy_poll version
>>> of qdisc_run that would be transferring packets from one qdisc onto
>>> another instead of attempting to transmit them directly.
>>
>> The idea is to have the whole part implemented as one classful qdisc
>> and remove the qdisc root lock since all the synchronization will be
>> handled internally (let's put aside that other filters/actions/qdiscs
>> might still require a root lock for now).
>>>
>>> What I think is tripping me up is that I don't think this is even
>>> meant to work with a multiqueue device. The description seems more
>>> like a mqprio implementation feeding into a prio qdisc which is then
>>> used for dequeue. To me it seems like this solution would be pulling
>>> complexity off of the enqueue side and just adding it to the dequeue
>>> side. The use of the "busy poll" is just to try to simplify things as
>>> the agent would then be consolidating traffic from multiple per-cpu
>>> queues onto one drain queue.
>>
>> We're essentially trying to spread the complexity from enqueue to
>> different stages such as enqueue/aggregation and rate
>> limiting/dequeue. Each stage will have different parallelisms. It
>> should work with multi-queue device since txq selection can be the
>> same as today. However our concern is that between enqueue and
>> aggregation we have a small window which can allow packet oob, which
>> is a sacrifice to better concurrency.
>>
>
> So OOO will happen when application cpu migrates presumably? This is
> normally prevented with skb ooo flag but it looks like you plan to
> violate this somehow. I think a design using ptr_rings/skb_arrays
> with bulk dequeue and a good concurrent token bucket ring would
> suffice and also not introduce OOO packets.
>

We haven't thought of leveraging the ooo flag - thanks for bringing
this up. The original question still holds though - if ooo flag is not
set we may not have a simple way of avoiding ooo in our design. A
possible solution is to have a global atomic counter as a sequence
generator for packets and let the per-class thread do a light-weight
sort based on that in case ooo flag is not set and multiple packets
from the same flow are collected.

You're absolutely right that we can introduce some data structure with
better concurrency so that per-cpu/per-class dedicated threads are not
necessary but the complexity could be high. We try to complicate the
thread model here instead.


> But don't completely understand your design so might be missing
> something.
>
>>>
>>> Structure wise this ends up looking not too different from mqprio, the
>>> main difference though would be that this would be a classful qdisc
>>> and that the virtual qdiscs we have for the traffic classes would need
>>> to be replaced with actual qdiscs for handling the "drain" aspect of
>>> this.
>>
>> Structure wise it's similar to mqprio + rate limiting qdisc without
>> root lock, and replacing txq/flow level parallelism by cpu level
>> parallelism. I'm actually not sure about the similarity with busy
>> polling that Eric mentioned since I haven't read the slides yet.
>
> I pushed lockless qdisc patches again today and will repost when
> net-next opens. These plus a lockless version of tbf might be
> what you need. At one point I had a lockless tbf I can probably
> dig that up as well if its useful.
>
> https://www.mail-archive.com/netdev@vger.kernel.org/msg200244.html
>

Cool - thanks for sharing and look forward to the lockless version of
tbf implementation!

> Thanks,
> John

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Per-CPU Queueing for QoS
  2017-11-14  4:53                 ` Tom Herbert
@ 2017-11-14 23:10                   ` Michael Ma
  0 siblings, 0 replies; 17+ messages in thread
From: Michael Ma @ 2017-11-14 23:10 UTC (permalink / raw)
  To: Tom Herbert
  Cc: John Fastabend, Alexander Duyck, Eric Dumazet, Stephen Hemminger,
	Linux Kernel Network Developers, jianjun.duan, xiangning.yu

2017-11-13 20:53 GMT-08:00 Tom Herbert <tom@herbertland.com>:
> On Mon, Nov 13, 2017 at 7:45 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> On 11/13/2017 06:18 PM, Michael Ma wrote:
>>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.duyck@gmail.com>:
>>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0818@gmail.com> wrote:
>>>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <stephen@networkplumber.org>:
>>>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>>>>>>> Michael Ma <make0818@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any early feedback.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0818@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from
>>>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which
>>>>>>>>>> avoids out-of-order problem.
>>>>>>>>>>
>>>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple
>>>>>>>>>> per-cpu queues for a single TC class and do busy polling from a
>>>>>>>>>> per-class thread to drain these queues. If we can do this frequently
>>>>>>>>>> enough the out-of-order situation in this polling thread should not be
>>>>>>>>>> that bad.
>>>>>>>>>>
>>>>>>>>>> To give more details - in the send path we introduce per-cpu per-class
>>>>>>>>>> queues so that packets from the same class and same core will be
>>>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues
>>>>>>>>>> belonging to its class from all the cpus and aggregate them into
>>>>>>>>>> another per-class queue. This can effectively reduce contention but
>>>>>>>>>> inevitably introduces potential out-of-order issue.
>>>>>>>>>>
>>>>>>>>>> Any concern/suggestion for working towards this direction?
>>>>>>>>
>>>>>>>> In general, there is no meta design discussions in Linux development
>>>>>>>> Several developers have tried to do lockless
>>>>>>>> qdisc and similar things in the past.
>>>>>>>>
>>>>>>>> The devil is in the details, show us the code.
>>>>>>>
>>>>>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>>>>>> we have a per-cpu per-class queue defined as this:
>>>>>>>
>>>>>>> struct bandwidth_group
>>>>>>> {
>>>>>>>     struct skb_list queues[MAX_CPU_COUNT];
>>>>>>>     struct skb_list drain;
>>>>>>> }
>>>>>>>
>>>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>>>>>> same class. In the enqueue function, we determine the cpu where the
>>>>>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>>>>>
>>>>>>> int cpu;
>>>>>>> struct bandwidth_group *bwg = &bw_rx_groups[bwgid];
>>>>>>>
>>>>>>> cpu = get_cpu();
>>>>>>> skb_list_append(&bwg->queues[cpu], skb);
>>>>>>>
>>>>>>> Here we don't check the flow of the packet so if there is task
>>>>>>> migration or multiple threads sending packets through the same flow we
>>>>>>> theoretically can have packets enqueued to different queues and
>>>>>>> aggregated to the "drain" queue out of order.
>>>>>>>
>>>>>>> Also AFAIK there is no lockless htb-like qdisc implementation
>>>>>>> currently, however if there is already similar effort ongoing please
>>>>>>> let me know.
>>>>>>
>>>>>> The question I would have is how would this differ from using XPS w/
>>>>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>>>>> like mqprio?
>>>>>>
>>>>>> From what I can tell XPS would be able to get you your per-cpu
>>>>>> functionality, the benefit of it though would be that it would avoid
>>>>>> out-of-order issues for sockets originating on the local system. The
>>>>>> only thing I see as an issue right now is that the rate limiting with
>>>>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>>>>> DCB.
>>>>>
>>>>> I think one of the key point was in : " do busy polling from a per-class
>>>>> thread to drain these queues."
>>>>>
>>>>> I mentioned this idea in TX path of :
>>>>>
>>>>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>>>
>>>> I think this is a bit different from that idea in that the busy
>>>> polling is transferring packets from a per-cpu qdisc to a per traffic
>>>> class queueing discipline. Basically it would be a busy_poll version
>>>> of qdisc_run that would be transferring packets from one qdisc onto
>>>> another instead of attempting to transmit them directly.
>>>
>>> The idea is to have the whole part implemented as one classful qdisc
>>> and remove the qdisc root lock since all the synchronization will be
>>> handled internally (let's put aside that other filters/actions/qdiscs
>>> might still require a root lock for now).
>>>>
>>>> What I think is tripping me up is that I don't think this is even
>>>> meant to work with a multiqueue device. The description seems more
>>>> like a mqprio implementation feeding into a prio qdisc which is then
>>>> used for dequeue. To me it seems like this solution would be pulling
>>>> complexity off of the enqueue side and just adding it to the dequeue
>>>> side. The use of the "busy poll" is just to try to simplify things as
>>>> the agent would then be consolidating traffic from multiple per-cpu
>>>> queues onto one drain queue.
>>>
>>> We're essentially trying to spread the complexity from enqueue to
>>> different stages such as enqueue/aggregation and rate
>>> limiting/dequeue. Each stage will have different parallelisms. It
>>> should work with multi-queue device since txq selection can be the
>>> same as today. However our concern is that between enqueue and
>>> aggregation we have a small window which can allow packet oob, which
>>> is a sacrifice to better concurrency.
>>>
>>
>> So OOO will happen when application cpu migrates presumably? This is
>> normally prevented with skb ooo flag but it looks like you plan to
>> violate this somehow. I think a design using ptr_rings/skb_arrays
>> with bulk dequeue and a good concurrent token bucket ring would
>> suffice and also not introduce OOO packets.
>>
>> But don't completely understand your design so might be missing
>> something.
>>
> I'm missing here something also. The check for ooo_okay is
> straightforward to see it we can change the sk TX queue. Does the
> design not use TX queue any more?
>

We do use TX queue but the TX queue selection is orthogonal to how we
process the packets in qdisc here. Packets belonging to the same flow
(without ooo flag) can still enter the same TXQ but in our qdisc we'll
let them enter different queues if they're handled by different cpus.
After that they eventually enter the preselected TXQ. The problem is
that before entering TXQ they might have already been out-of-order.

> Tom
>
>>>>
>>>> Structure wise this ends up looking not too different from mqprio, the
>>>> main difference though would be that this would be a classful qdisc
>>>> and that the virtual qdiscs we have for the traffic classes would need
>>>> to be replaced with actual qdiscs for handling the "drain" aspect of
>>>> this.
>>>
>>> Structure wise it's similar to mqprio + rate limiting qdisc without
>>> root lock, and replacing txq/flow level parallelism by cpu level
>>> parallelism. I'm actually not sure about the similarity with busy
>>> polling that Eric mentioned since I haven't read the slides yet.
>>
>> I pushed lockless qdisc patches again today and will repost when
>> net-next opens. These plus a lockless version of tbf might be
>> what you need. At one point I had a lockless tbf I can probably
>> dig that up as well if its useful.
>>
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg200244.html
>>
>> Thanks,
>> John

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-11-14 23:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-10  1:20 Per-CPU Queueing for QoS Michael Ma
2017-11-12 21:43 ` Michael Ma
2017-11-13  0:14   ` Stephen Hemminger
2017-11-13 18:17     ` Michael Ma
2017-11-13 22:47       ` Alexander Duyck
2017-11-13 23:08         ` Eric Dumazet
2017-11-14  0:13           ` Alexander Duyck
2017-11-14  2:18             ` Michael Ma
2017-11-14  3:45               ` John Fastabend
2017-11-14  3:54                 ` Dave Taht
2017-11-14  4:53                 ` Tom Herbert
2017-11-14 23:10                   ` Michael Ma
2017-11-14 23:06                 ` Michael Ma
2017-11-14 22:59               ` Michael Ma
2017-11-14  2:05           ` Michael Ma
2017-11-14  2:06             ` Michael Ma
2017-11-14  2:02         ` Michael Ma

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.