All of lore.kernel.org
 help / color / mirror / Atom feed
* rte_sched library performance question
@ 2017-02-16 15:13 Zoltan Kiss
  2017-02-16 19:08 ` Dumitrescu, Cristian
  0 siblings, 1 reply; 3+ messages in thread
From: Zoltan Kiss @ 2017-02-16 15:13 UTC (permalink / raw)
  To: dev

Hi,

I'm experimenting a little bit with the scheduler library, and I got some
performance numbers which seems to be worse than what I've expected.
I'm sending 64 bytes packets on a 10G interface to a separate thread, and
my simple test program (based on the qos_sched example) does the following:

while (1) {
            uint16_t ret = rte_ring_sc_dequeue_burst(it.ring,
(void**)flushbatch, FLUSH_SIZE);
            rte_mbuf** t = flushbatch;

            if (!ret) {
                /* This call is necessary to make sure the TX completed
mbuf's
                 * are returned to the pool even if there is nothing to
                 * transmit */
                rte_eth_tx_burst(it.portid, lcore, t, 0);
                continue;
            }
            rte_sched_port_enqueue(it.port, flushbatch, ret);
            ret = rte_sched_port_dequeue(it.port, flushbatch, FLUSH_SIZE);
            while (ret) {
                uint16_t n = rte_eth_tx_burst(it.portid, lcore, t, ret);
                /* we cannot drop the packets, so re-send */
                /* update number of packets to be sent */
                ret -= n;
                t = &t[n];
            };
}

I run this on a separate thread, another one doing rx and feeding the
packets to the ring. When I comment out the enqueue and dequeue part in the
code (reducing it to simple l2fwd), I can forward the entire ~14 Mpps
traffic, whilst with the scheduler enabled I can only reach ~5.4 Mpps at
best. I've tried with a single pipe or with 4k (used rand() to randomly
distribute between pipe, everything else (class etc) was set to 0), didn't
make a difference. Is this expected? I'm running this on a Xeon E5-2630 0 @
2.30GHz

I've used the following configuration:

; port configuration [port]

[port]
frame overhead = 24
number of subports per port = 1
number of pipes per subport = 1024
queue sizes = 64 64 64 64

; Subport configuration

[subport 0]
tb rate = 1250000000; Bytes per second
tb size = 1000000000; Bytes
tc 0 rate = 1250000000;     Bytes per second
tc 1 rate = 1250000000;     Bytes per second
tc 2 rate = 1250000000;     Bytes per second
tc 3 rate = 1250000000;     Bytes per second
tc period = 10;             Milliseconds
tc oversubscription period = 1000;     Milliseconds

pipe 0-1024 = 0;        These pipes are configured with pipe profile 0

; Pipe configuration

[pipe profile 0]
tb rate = 1250000000; Bytes per second
tb size = 1000000000; Bytes

tc 0 rate = 1250000000; Bytes per second
tc 1 rate = 1250000000; Bytes per second
tc 2 rate = 1250000000; Bytes per second
tc 3 rate = 1250000000; Bytes per second
tc period = 10; Milliseconds

tc 0 oversubscription weight = 1
tc 1 oversubscription weight = 1
tc 2 oversubscription weight = 1
tc 3 oversubscription weight = 1

tc 0 wrr weights = 1 1 1 1
tc 1 wrr weights = 1 1 1 1
tc 2 wrr weights = 1 1 1 1
tc 3 wrr weights = 1 1 1 1

Regards,

Zoltan

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: rte_sched library performance question
  2017-02-16 15:13 rte_sched library performance question Zoltan Kiss
@ 2017-02-16 19:08 ` Dumitrescu, Cristian
  2017-02-24 21:09   ` Zoltan Kiss
  0 siblings, 1 reply; 3+ messages in thread
From: Dumitrescu, Cristian @ 2017-02-16 19:08 UTC (permalink / raw)
  To: Zoltan Kiss, dev

Hi Zoltan,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zoltan Kiss
> Sent: Thursday, February 16, 2017 3:14 PM
> To: dev@dpdk.org
> Subject: [dpdk-dev] rte_sched library performance question
> 
> Hi,
> 
> I'm experimenting a little bit with the scheduler library, and I got some
> performance numbers which seems to be worse than what I've expected.
> I'm sending 64 bytes packets on a 10G interface to a separate thread, and
> my simple test program (based on the qos_sched example) does the
> following:
> 
> while (1) {
>             uint16_t ret = rte_ring_sc_dequeue_burst(it.ring,
> (void**)flushbatch, FLUSH_SIZE);
>             rte_mbuf** t = flushbatch;
> 
>             if (!ret) {
>                 /* This call is necessary to make sure the TX completed
> mbuf's
>                  * are returned to the pool even if there is nothing to
>                  * transmit */
>                 rte_eth_tx_burst(it.portid, lcore, t, 0);
>                 continue;
>             }
>             rte_sched_port_enqueue(it.port, flushbatch, ret);
>             ret = rte_sched_port_dequeue(it.port, flushbatch, FLUSH_SIZE);

Looks to me like the scheduler dequeue burst is equal to the enqueue burst size of FLUSH_SIZE, right?
In this case, you are always dequeueuing the exact packets that you just enqueued, and the scheduler dequeue needs to work really hard to find exactly those FLUSH_SIZE queues that each one have a single packet at this point.

This is wht the enqueue burst size should be bigger than the dequeue burst size. Basically, you add some water into the reservoir up to a reasonable fill level before you start pouring it in your glass if you want to fill the glass quickly.

Typical values used:
-for vector PMD: (enqueue = 32, dequeue = 24), (32, 28), (32, 16), etc
-for scalar PMD: (64, 48), (64, 32), ... We used (256, 248) for VPP

>             while (ret) {
>                 uint16_t n = rte_eth_tx_burst(it.portid, lcore, t, ret);
>                 /* we cannot drop the packets, so re-send */
>                 /* update number of packets to be sent */
>                 ret -= n;
>                 t = &t[n];
>             };
> }
> 
> I run this on a separate thread, another one doing rx and feeding the
> packets to the ring. When I comment out the enqueue and dequeue part in
> the
> code (reducing it to simple l2fwd), I can forward the entire ~14 Mpps
> traffic, whilst with the scheduler enabled I can only reach ~5.4 Mpps at
> best. I've tried with a single pipe or with 4k (used rand() to randomly
> distribute between pipe, everything else (class etc) was set to 0), didn't
> make a difference. Is this expected? I'm running this on a Xeon E5-2630 0 @
> 2.30GHz
> 
> I've used the following configuration:
> 
> ; port configuration [port]
> 
> [port]
> frame overhead = 24
> number of subports per port = 1
> number of pipes per subport = 1024
> queue sizes = 64 64 64 64
> 
> ; Subport configuration
> 
> [subport 0]
> tb rate = 1250000000; Bytes per second
> tb size = 1000000000; Bytes
> tc 0 rate = 1250000000;     Bytes per second
> tc 1 rate = 1250000000;     Bytes per second
> tc 2 rate = 1250000000;     Bytes per second
> tc 3 rate = 1250000000;     Bytes per second
> tc period = 10;             Milliseconds
> tc oversubscription period = 1000;     Milliseconds
> 
> pipe 0-1024 = 0;        These pipes are configured with pipe profile 0
> 
> ; Pipe configuration
> 
> [pipe profile 0]
> tb rate = 1250000000; Bytes per second
> tb size = 1000000000; Bytes
> 
> tc 0 rate = 1250000000; Bytes per second
> tc 1 rate = 1250000000; Bytes per second
> tc 2 rate = 1250000000; Bytes per second
> tc 3 rate = 1250000000; Bytes per second
> tc period = 10; Milliseconds
> 
> tc 0 oversubscription weight = 1
> tc 1 oversubscription weight = 1
> tc 2 oversubscription weight = 1
> tc 3 oversubscription weight = 1
> 
> tc 0 wrr weights = 1 1 1 1
> tc 1 wrr weights = 1 1 1 1
> tc 2 wrr weights = 1 1 1 1
> tc 3 wrr weights = 1 1 1 1
> 
> Regards,
> 
> Zoltan

Regards,
Cristian

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: rte_sched library performance question
  2017-02-16 19:08 ` Dumitrescu, Cristian
@ 2017-02-24 21:09   ` Zoltan Kiss
  0 siblings, 0 replies; 3+ messages in thread
From: Zoltan Kiss @ 2017-02-24 21:09 UTC (permalink / raw)
  To: Dumitrescu, Cristian, dev

On 16/02/17 20:08, Dumitrescu, Cristian wrote:
> Hi Zoltan,
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zoltan Kiss
>> Sent: Thursday, February 16, 2017 3:14 PM
>> To: dev@dpdk.org
>> Subject: [dpdk-dev] rte_sched library performance question
>>
>> Hi,
>>
>> I'm experimenting a little bit with the scheduler library, and I got some
>> performance numbers which seems to be worse than what I've expected.
>> I'm sending 64 bytes packets on a 10G interface to a separate thread, and
>> my simple test program (based on the qos_sched example) does the
>> following:
>>
>> while (1) {
>>              uint16_t ret = rte_ring_sc_dequeue_burst(it.ring,
>> (void**)flushbatch, FLUSH_SIZE);
>>              rte_mbuf** t = flushbatch;
>>
>>              if (!ret) {
>>                  /* This call is necessary to make sure the TX completed
>> mbuf's
>>                   * are returned to the pool even if there is nothing to
>>                   * transmit */
>>                  rte_eth_tx_burst(it.portid, lcore, t, 0);
>>                  continue;
>>              }
>>              rte_sched_port_enqueue(it.port, flushbatch, ret);
>>              ret = rte_sched_port_dequeue(it.port, flushbatch, FLUSH_SIZE);
> Looks to me like the scheduler dequeue burst is equal to the enqueue burst size of FLUSH_SIZE, right?
> In this case, you are always dequeueuing the exact packets that you just enqueued, and the scheduler dequeue needs to work really hard to find exactly those FLUSH_SIZE queues that each one have a single packet at this point.
>
> This is wht the enqueue burst size should be bigger than the dequeue burst size. Basically, you add some water into the reservoir up to a reasonable fill level before you start pouring it in your glass if you want to fill the glass quickly.
>
> Typical values used:
> -for vector PMD: (enqueue = 32, dequeue = 24), (32, 28), (32, 16), etc
> -for scalar PMD: (64, 48), (64, 32), ... We used (256, 248) for VPP

Thanks, it helped my case too. Btw. it would be good do link this 
document somewhere in the DPDK docs, as it contains a lot of good 
information about the scheduler:

https://networkbuilders.intel.com/docs/Network_Builders_RA_NFV_QoS_Aug2014.pdf

>
>>              while (ret) {
>>                  uint16_t n = rte_eth_tx_burst(it.portid, lcore, t, ret);
>>                  /* we cannot drop the packets, so re-send */
>>                  /* update number of packets to be sent */
>>                  ret -= n;
>>                  t = &t[n];
>>              };
>> }
>>
>> I run this on a separate thread, another one doing rx and feeding the
>> packets to the ring. When I comment out the enqueue and dequeue part in
>> the
>> code (reducing it to simple l2fwd), I can forward the entire ~14 Mpps
>> traffic, whilst with the scheduler enabled I can only reach ~5.4 Mpps at
>> best. I've tried with a single pipe or with 4k (used rand() to randomly
>> distribute between pipe, everything else (class etc) was set to 0), didn't
>> make a difference. Is this expected? I'm running this on a Xeon E5-2630 0 @
>> 2.30GHz
>>
>> I've used the following configuration:
>>
>> ; port configuration [port]
>>
>> [port]
>> frame overhead = 24
>> number of subports per port = 1
>> number of pipes per subport = 1024
>> queue sizes = 64 64 64 64
>>
>> ; Subport configuration
>>
>> [subport 0]
>> tb rate = 1250000000; Bytes per second
>> tb size = 1000000000; Bytes
>> tc 0 rate = 1250000000;     Bytes per second
>> tc 1 rate = 1250000000;     Bytes per second
>> tc 2 rate = 1250000000;     Bytes per second
>> tc 3 rate = 1250000000;     Bytes per second
>> tc period = 10;             Milliseconds
>> tc oversubscription period = 1000;     Milliseconds
>>
>> pipe 0-1024 = 0;        These pipes are configured with pipe profile 0
>>
>> ; Pipe configuration
>>
>> [pipe profile 0]
>> tb rate = 1250000000; Bytes per second
>> tb size = 1000000000; Bytes
>>
>> tc 0 rate = 1250000000; Bytes per second
>> tc 1 rate = 1250000000; Bytes per second
>> tc 2 rate = 1250000000; Bytes per second
>> tc 3 rate = 1250000000; Bytes per second
>> tc period = 10; Milliseconds
>>
>> tc 0 oversubscription weight = 1
>> tc 1 oversubscription weight = 1
>> tc 2 oversubscription weight = 1
>> tc 3 oversubscription weight = 1
>>
>> tc 0 wrr weights = 1 1 1 1
>> tc 1 wrr weights = 1 1 1 1
>> tc 2 wrr weights = 1 1 1 1
>> tc 3 wrr weights = 1 1 1 1
>>
>> Regards,
>>
>> Zoltan
> Regards,
> Cristian

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2017-02-24 21:09 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-16 15:13 rte_sched library performance question Zoltan Kiss
2017-02-16 19:08 ` Dumitrescu, Cristian
2017-02-24 21:09   ` Zoltan Kiss

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.