All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ixgbe: take online CPU number as MQ max limit when  alloc_etherdev_mq()
@ 2016-05-13  5:56 ` Ethan Zhao
  0 siblings, 0 replies; 21+ messages in thread
From: Ethan Zhao @ 2016-05-13  5:56 UTC (permalink / raw)
  To: jeffrey.t.kirsher, jesse.brandeburg, shannon.nelson,
	carolyn.wyborny, donald.c.skidmore, bruce.w.allan, john.ronciak,
	mitch.a.williams, intel-wired-lan, netdev
  Cc: linux-kernel, ethan.kernel, ethan.zhao

Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
CPUs were assigned. especially when DCB is enabled, so we should take
num_online_cpus() as top limit, and aslo to make sure every TC has
at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
number.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7df3fe2..1f9769c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9105,6 +9105,10 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		indices = IXGBE_MAX_RSS_INDICES;
 #endif
 	}
+	/* Don't allocate too more queues than online cpus number */
+	indices = min_t(int, indices, num_online_cpus());
+	/* To make sure TC works, allocate at least 1 queue per TC */
+	indices = max_t(int, indices, MAX_TRAFFIC_CLASS);
 
 	netdev = alloc_etherdev_mq(sizeof(struct ixgbe_adapter), indices);
 	if (!netdev) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-13  5:56 ` Ethan Zhao
  0 siblings, 0 replies; 21+ messages in thread
From: Ethan Zhao @ 2016-05-13  5:56 UTC (permalink / raw)
  To: intel-wired-lan

Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
CPUs were assigned. especially when DCB is enabled, so we should take
num_online_cpus() as top limit, and aslo to make sure every TC has
at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
number.

Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7df3fe2..1f9769c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9105,6 +9105,10 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		indices = IXGBE_MAX_RSS_INDICES;
 #endif
 	}
+	/* Don't allocate too more queues than online cpus number */
+	indices = min_t(int, indices, num_online_cpus());
+	/* To make sure TC works, allocate at least 1 queue per TC */
+	indices = max_t(int, indices, MAX_TRAFFIC_CLASS);
 
 	netdev = alloc_etherdev_mq(sizeof(struct ixgbe_adapter), indices);
 	if (!netdev) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
  2016-05-13  5:56 ` [Intel-wired-lan] " Ethan Zhao
@ 2016-05-13 12:52   ` Sergei Shtylyov
  -1 siblings, 0 replies; 21+ messages in thread
From: Sergei Shtylyov @ 2016-05-13 12:52 UTC (permalink / raw)
  To: Ethan Zhao, jeffrey.t.kirsher, jesse.brandeburg, shannon.nelson,
	carolyn.wyborny, donald.c.skidmore, bruce.w.allan, john.ronciak,
	mitch.a.williams, intel-wired-lan, netdev
  Cc: linux-kernel, ethan.kernel

Hello.

On 5/13/2016 8:56 AM, Ethan Zhao wrote:

> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less

    Performance.

> CPUs were assigned. especially when DCB is enabled, so we should take
> num_online_cpus() as top limit, and aslo to make sure every TC has

    Also.

> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
> number.
>
> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 7df3fe2..1f9769c 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -9105,6 +9105,10 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		indices = IXGBE_MAX_RSS_INDICES;
>  #endif
>  	}
> +	/* Don't allocate too more queues than online cpus number */

    "Too" not needed here. CPUs.

> +	indices = min_t(int, indices, num_online_cpus());
> +	/* To make sure TC works, allocate at least 1 queue per TC */
> +	indices = max_t(int, indices, MAX_TRAFFIC_CLASS);
>
>  	netdev = alloc_etherdev_mq(sizeof(struct ixgbe_adapter), indices);
>  	if (!netdev) {

MBR, Sergei

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-13 12:52   ` Sergei Shtylyov
  0 siblings, 0 replies; 21+ messages in thread
From: Sergei Shtylyov @ 2016-05-13 12:52 UTC (permalink / raw)
  To: intel-wired-lan

Hello.

On 5/13/2016 8:56 AM, Ethan Zhao wrote:

> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less

    Performance.

> CPUs were assigned. especially when DCB is enabled, so we should take
> num_online_cpus() as top limit, and aslo to make sure every TC has

    Also.

> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
> number.
>
> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 7df3fe2..1f9769c 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -9105,6 +9105,10 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>  		indices = IXGBE_MAX_RSS_INDICES;
>  #endif
>  	}
> +	/* Don't allocate too more queues than online cpus number */

    "Too" not needed here. CPUs.

> +	indices = min_t(int, indices, num_online_cpus());
> +	/* To make sure TC works, allocate at least 1 queue per TC */
> +	indices = max_t(int, indices, MAX_TRAFFIC_CLASS);
>
>  	netdev = alloc_etherdev_mq(sizeof(struct ixgbe_adapter), indices);
>  	if (!netdev) {

MBR, Sergei


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
  2016-05-13  5:56 ` [Intel-wired-lan] " Ethan Zhao
@ 2016-05-13 16:46   ` Alexander Duyck
  -1 siblings, 0 replies; 21+ messages in thread
From: Alexander Duyck @ 2016-05-13 16:46 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Jeff Kirsher, Brandeburg, Jesse, shannon nelson, Carolyn Wyborny,
	Skidmore, Donald C, Bruce W Allan, John Ronciak, Mitch Williams,
	intel-wired-lan, Netdev, linux-kernel, Ethan Zhao

On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com> wrote:
> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
> CPUs were assigned. especially when DCB is enabled, so we should take
> num_online_cpus() as top limit, and aslo to make sure every TC has
> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
> number.
>
> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>

What is the harm in allowing the user to specify up to 64 queues if
they want to?  Also what is your opinion based on?  In the case of RSS
traffic the upper limit is only 16 on older NICs, but last I knew the
latest X550 can support more queues for RSS.  Have you only been
testing on older NICs or did you test on the latest hardware as well?

If you want to control the number of queues allocated in a given
configuration you should look at the code over in the ixgbe_lib.c, not
ixgbe_main.c.  All you are doing with this patch is denying the user
choice with this change as they then are not allowed to set more
queues.  Even if they find your decision was wrong for their
configuration.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-13 16:46   ` Alexander Duyck
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Duyck @ 2016-05-13 16:46 UTC (permalink / raw)
  To: intel-wired-lan

On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com> wrote:
> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
> CPUs were assigned. especially when DCB is enabled, so we should take
> num_online_cpus() as top limit, and aslo to make sure every TC has
> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
> number.
>
> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>

What is the harm in allowing the user to specify up to 64 queues if
they want to?  Also what is your opinion based on?  In the case of RSS
traffic the upper limit is only 16 on older NICs, but last I knew the
latest X550 can support more queues for RSS.  Have you only been
testing on older NICs or did you test on the latest hardware as well?

If you want to control the number of queues allocated in a given
configuration you should look at the code over in the ixgbe_lib.c, not
ixgbe_main.c.  All you are doing with this patch is denying the user
choice with this change as they then are not allowed to set more
queues.  Even if they find your decision was wrong for their
configuration.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
  2016-05-13 16:46   ` [Intel-wired-lan] " Alexander Duyck
@ 2016-05-16  2:59     ` ethan zhao
  -1 siblings, 0 replies; 21+ messages in thread
From: ethan zhao @ 2016-05-16  2:59 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jeff Kirsher, Brandeburg, Jesse, shannon nelson, Carolyn Wyborny,
	Skidmore, Donald C, Bruce W Allan, John Ronciak, Mitch Williams,
	intel-wired-lan, Netdev, linux-kernel, Ethan Zhao

Alexander,

On 2016/5/14 0:46, Alexander Duyck wrote:
> On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com> wrote:
>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>> CPUs were assigned. especially when DCB is enabled, so we should take
>> num_online_cpus() as top limit, and aslo to make sure every TC has
>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
>> number.
>>
>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> What is the harm in allowing the user to specify up to 64 queues if
> they want to?  Also what is your opinion based on?  In the case of RSS

  There is no module parameter to specify queue number in this upstream 
ixgbe
   driver.  for what to specify more queues than num_online_cpus() via 
ethtool ?
  I couldn't figure out the benefit to do that.

  But if DCB is turned on after loading, the queues would be 64/64, that 
doesn't
  make sense if only 16 CPUs assigned.
> traffic the upper limit is only 16 on older NICs, but last I knew the
> latest X550 can support more queues for RSS.  Have you only been
> testing on older NICs or did you test on the latest hardware as well?
   More queues for RSS than num_online_cpus() could bring better 
performance ?
   Test result shows false result.  even memory cost is not an issue for 
most of
   the expensive servers, but not for all.

>
> If you want to control the number of queues allocated in a given
> configuration you should look at the code over in the ixgbe_lib.c, not
   Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues 
calculation algorithm.
   But they all take the dev queues allocated in alloc_etherdev_mq() as 
upper limit.

  If we set 64 as default here, DCB would says "oh, there is 64 there, I 
could use it"
> ixgbe_main.c.  All you are doing with this patch is denying the user
> choice with this change as they then are not allowed to set more
   Yes, it is purposed to deny configuration that doesn't benefit.
> queues.  Even if they find your decision was wrong for their
> configuration.
>
> - Alex
>
  Thanks,
  Ethan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-16  2:59     ` ethan zhao
  0 siblings, 0 replies; 21+ messages in thread
From: ethan zhao @ 2016-05-16  2:59 UTC (permalink / raw)
  To: intel-wired-lan

Alexander,

On 2016/5/14 0:46, Alexander Duyck wrote:
> On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com> wrote:
>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>> CPUs were assigned. especially when DCB is enabled, so we should take
>> num_online_cpus() as top limit, and aslo to make sure every TC has
>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
>> number.
>>
>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> What is the harm in allowing the user to specify up to 64 queues if
> they want to?  Also what is your opinion based on?  In the case of RSS

  There is no module parameter to specify queue number in this upstream 
ixgbe
   driver.  for what to specify more queues than num_online_cpus() via 
ethtool ?
  I couldn't figure out the benefit to do that.

  But if DCB is turned on after loading, the queues would be 64/64, that 
doesn't
  make sense if only 16 CPUs assigned.
> traffic the upper limit is only 16 on older NICs, but last I knew the
> latest X550 can support more queues for RSS.  Have you only been
> testing on older NICs or did you test on the latest hardware as well?
   More queues for RSS than num_online_cpus() could bring better 
performance ?
   Test result shows false result.  even memory cost is not an issue for 
most of
   the expensive servers, but not for all.

>
> If you want to control the number of queues allocated in a given
> configuration you should look at the code over in the ixgbe_lib.c, not
   Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues 
calculation algorithm.
   But they all take the dev queues allocated in alloc_etherdev_mq() as 
upper limit.

  If we set 64 as default here, DCB would says "oh, there is 64 there, I 
could use it"
> ixgbe_main.c.  All you are doing with this patch is denying the user
> choice with this change as they then are not allowed to set more
   Yes, it is purposed to deny configuration that doesn't benefit.
> queues.  Even if they find your decision was wrong for their
> configuration.
>
> - Alex
>
  Thanks,
  Ethan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
  2016-05-13 12:52   ` [Intel-wired-lan] " Sergei Shtylyov
@ 2016-05-16  5:38     ` ethan zhao
  -1 siblings, 0 replies; 21+ messages in thread
From: ethan zhao @ 2016-05-16  5:38 UTC (permalink / raw)
  To: Sergei Shtylyov, jeffrey.t.kirsher, jesse.brandeburg,
	shannon.nelson, carolyn.wyborny, donald.c.skidmore,
	bruce.w.allan, john.ronciak, mitch.a.williams, intel-wired-lan,
	netdev
  Cc: linux-kernel, ethan.kernel

Thanks for your reviewing.

Ethan

On 2016/5/13 20:52, Sergei Shtylyov wrote:
> Hello.
>
> On 5/13/2016 8:56 AM, Ethan Zhao wrote:
>
>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>
>    Performance.
>
>> CPUs were assigned. especially when DCB is enabled, so we should take
>> num_online_cpus() as top limit, and aslo to make sure every TC has
>
>    Also.
>
>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
>> number.
>>
>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>> ---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 7df3fe2..1f9769c 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -9105,6 +9105,10 @@ static int ixgbe_probe(struct pci_dev *pdev, 
>> const struct pci_device_id *ent)
>>          indices = IXGBE_MAX_RSS_INDICES;
>>  #endif
>>      }
>> +    /* Don't allocate too more queues than online cpus number */
>
>    "Too" not needed here. CPUs.
>
>> +    indices = min_t(int, indices, num_online_cpus());
>> +    /* To make sure TC works, allocate at least 1 queue per TC */
>> +    indices = max_t(int, indices, MAX_TRAFFIC_CLASS);
>>
>>      netdev = alloc_etherdev_mq(sizeof(struct ixgbe_adapter), indices);
>>      if (!netdev) {
>
> MBR, Sergei
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-16  5:38     ` ethan zhao
  0 siblings, 0 replies; 21+ messages in thread
From: ethan zhao @ 2016-05-16  5:38 UTC (permalink / raw)
  To: intel-wired-lan

Thanks for your reviewing.

Ethan

On 2016/5/13 20:52, Sergei Shtylyov wrote:
> Hello.
>
> On 5/13/2016 8:56 AM, Ethan Zhao wrote:
>
>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>
>    Performance.
>
>> CPUs were assigned. especially when DCB is enabled, so we should take
>> num_online_cpus() as top limit, and aslo to make sure every TC has
>
>    Also.
>
>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
>> number.
>>
>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>> ---
>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
>> b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> index 7df3fe2..1f9769c 100644
>> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
>> @@ -9105,6 +9105,10 @@ static int ixgbe_probe(struct pci_dev *pdev, 
>> const struct pci_device_id *ent)
>>          indices = IXGBE_MAX_RSS_INDICES;
>>  #endif
>>      }
>> +    /* Don't allocate too more queues than online cpus number */
>
>    "Too" not needed here. CPUs.
>
>> +    indices = min_t(int, indices, num_online_cpus());
>> +    /* To make sure TC works, allocate at least 1 queue per TC */
>> +    indices = max_t(int, indices, MAX_TRAFFIC_CLASS);
>>
>>      netdev = alloc_etherdev_mq(sizeof(struct ixgbe_adapter), indices);
>>      if (!netdev) {
>
> MBR, Sergei
>
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
  2016-05-16  2:59     ` [Intel-wired-lan] " ethan zhao
@ 2016-05-16 16:09       ` Alexander Duyck
  -1 siblings, 0 replies; 21+ messages in thread
From: Alexander Duyck @ 2016-05-16 16:09 UTC (permalink / raw)
  To: ethan zhao
  Cc: Jeff Kirsher, Brandeburg, Jesse, shannon nelson, Carolyn Wyborny,
	Skidmore, Donald C, Bruce W Allan, John Ronciak, Mitch Williams,
	intel-wired-lan, Netdev, linux-kernel, Ethan Zhao

On Sun, May 15, 2016 at 7:59 PM, ethan zhao <ethan.zhao@oracle.com> wrote:
> Alexander,
>
> On 2016/5/14 0:46, Alexander Duyck wrote:
>>
>> On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com>
>> wrote:
>>>
>>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>>> CPUs were assigned. especially when DCB is enabled, so we should take
>>> num_online_cpus() as top limit, and aslo to make sure every TC has
>>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
>>> number.
>>>
>>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>>
>> What is the harm in allowing the user to specify up to 64 queues if
>> they want to?  Also what is your opinion based on?  In the case of RSS
>
>
>  There is no module parameter to specify queue number in this upstream ixgbe
>   driver.  for what to specify more queues than num_online_cpus() via
> ethtool ?
>  I couldn't figure out the benefit to do that.

There are a number of benefits to being able to set the number of
queues based on the user desire.  Just because you can't figure out
how to use a feature is no reason to break it so that nobody else can.

>  But if DCB is turned on after loading, the queues would be 64/64, that
> doesn't
>  make sense if only 16 CPUs assigned.

It makes perfect sense.  What is happening is that it is allocating an
RSS set per TC.  So what you should have is either 4 queues per CPU
with each one belonging to a different TC, or 4 queues per CPU with
the first 8 CPUs covering TCs 0-3, and the last 8 CPUs covering TCs
4-7.

I can see how the last setup might actually be a bit confusing.  To
that end you might consider modifying ixgbe_acquire_msix_vectors uses
the number of RSS queues instead of the number of Rx queues in the
case of DCB.  Then you would get more consistent behavior with each
q_vector or CPU (if num_q_vecotrs == num_online_cpus()) having one
queue belonging to each TC.  You would end up with either 8 or 16
q_vectors hosting 8 or 4 queues so that they can process DCB requests
without having to worry about head of line blocking.

>> traffic the upper limit is only 16 on older NICs, but last I knew the
>> latest X550 can support more queues for RSS.  Have you only been
>> testing on older NICs or did you test on the latest hardware as well?
>
>   More queues for RSS than num_online_cpus() could bring better performance
> ?
>   Test result shows false result.  even memory cost is not an issue for most
> of
>   the expensive servers, but not for all.

The feature is called DCB.  What it allows for is the avoidance of
head-of-line blocking.  So when you have DCB enabled you should have a
set of queues for each possible RSS result so that if you get a higher
priority request on one of the queues it can use the higher priority
queue instead of having to rely on the the lower priority queue to
receive traffic.  You cannot do that without allocating a queue for
each TC, and reducing the number of RSS queues supported on the system
will hurt performance.  Therefore on a 16 CPU system it is very useful
to be able to allocate 4 queues per RSS flow as that way you get
optimal CPU distribution and can still avoid head-of-line blocking via
DCB.

>>
>> If you want to control the number of queues allocated in a given
>> configuration you should look at the code over in the ixgbe_lib.c, not
>
>   Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues
> calculation algorithm.
>   But they all take the dev queues allocated in alloc_etherdev_mq() as upper
> limit.
>
>  If we set 64 as default here, DCB would says "oh, there is 64 there, I
> could use it"

Right.  But the deciding factor for DCB is RSS which is already
limited by the number of CPUs.  If it is allocating 64 queues it is
because there are either at least 8 CPUs present and 8 TCs being
allocated per CPU, or there are at least 16 queues present and it is
allocating 4 TCs per CPU.

>>
>> ixgbe_main.c.  All you are doing with this patch is denying the user
>> choice with this change as they then are not allowed to set more
>
>   Yes, it is purposed to deny configuration that doesn't benefit.

Doesn't benefit who?  It is obvious you don't understand how DCB is
meant to work since you are assuming the queues are throw-away.
Anyone who makes use of the ability to prioritize their traffic would
likely have a different opinion.

>> queues.  Even if they find your decision was wrong for their
>> configuration.
>>
>> - Alex
>>
>  Thanks,
>  Ethan

Your response clearly points out you don't understand DCB.  I suggest
you take another look at how things are actually being configured.  I
believe what you will find is that the current implementation is
basing things on the number of online CPUs already based on the
ring_feature[RING_F_RSS].limit value.  All that is happening is that
you are getting that value multiplied by the number of TCs and the RSS
value is reduced if the result is greater than 64 based on the maximum
number of queues.

With your code on an 8 core system you go from being able to perform
RSS over 8 queues to only being able to perform RSS over 1 queue when
you enable DCB.  There was a bug a long time ago where this actually
didn't provide any gain because the interrupt allocation was binding
all 8 RSS queues to a single q_vector, but that has long since been
fixed and what you should be seeing is that RSS will spread traffic
across either 8 or 16 queues when DCB is enabled in either 8 or 4 TC
mode.

My advice would be to use a netperf TCP_CRR test and watch what queues
and what interrupts traffic is being delivered to.  Then if you have
DCB enabled on both ends you might try changing the priority of your
netperf session and watch what happens when you switch between TCs.
What you should find is that you will shift between groups of queues
and as you do so you should not have any active queues overlapping
unless you have less interrupts than CPUs.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-16 16:09       ` Alexander Duyck
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Duyck @ 2016-05-16 16:09 UTC (permalink / raw)
  To: intel-wired-lan

On Sun, May 15, 2016 at 7:59 PM, ethan zhao <ethan.zhao@oracle.com> wrote:
> Alexander,
>
> On 2016/5/14 0:46, Alexander Duyck wrote:
>>
>> On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com>
>> wrote:
>>>
>>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>>> CPUs were assigned. especially when DCB is enabled, so we should take
>>> num_online_cpus() as top limit, and aslo to make sure every TC has
>>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
>>> number.
>>>
>>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>>
>> What is the harm in allowing the user to specify up to 64 queues if
>> they want to?  Also what is your opinion based on?  In the case of RSS
>
>
>  There is no module parameter to specify queue number in this upstream ixgbe
>   driver.  for what to specify more queues than num_online_cpus() via
> ethtool ?
>  I couldn't figure out the benefit to do that.

There are a number of benefits to being able to set the number of
queues based on the user desire.  Just because you can't figure out
how to use a feature is no reason to break it so that nobody else can.

>  But if DCB is turned on after loading, the queues would be 64/64, that
> doesn't
>  make sense if only 16 CPUs assigned.

It makes perfect sense.  What is happening is that it is allocating an
RSS set per TC.  So what you should have is either 4 queues per CPU
with each one belonging to a different TC, or 4 queues per CPU with
the first 8 CPUs covering TCs 0-3, and the last 8 CPUs covering TCs
4-7.

I can see how the last setup might actually be a bit confusing.  To
that end you might consider modifying ixgbe_acquire_msix_vectors uses
the number of RSS queues instead of the number of Rx queues in the
case of DCB.  Then you would get more consistent behavior with each
q_vector or CPU (if num_q_vecotrs == num_online_cpus()) having one
queue belonging to each TC.  You would end up with either 8 or 16
q_vectors hosting 8 or 4 queues so that they can process DCB requests
without having to worry about head of line blocking.

>> traffic the upper limit is only 16 on older NICs, but last I knew the
>> latest X550 can support more queues for RSS.  Have you only been
>> testing on older NICs or did you test on the latest hardware as well?
>
>   More queues for RSS than num_online_cpus() could bring better performance
> ?
>   Test result shows false result.  even memory cost is not an issue for most
> of
>   the expensive servers, but not for all.

The feature is called DCB.  What it allows for is the avoidance of
head-of-line blocking.  So when you have DCB enabled you should have a
set of queues for each possible RSS result so that if you get a higher
priority request on one of the queues it can use the higher priority
queue instead of having to rely on the the lower priority queue to
receive traffic.  You cannot do that without allocating a queue for
each TC, and reducing the number of RSS queues supported on the system
will hurt performance.  Therefore on a 16 CPU system it is very useful
to be able to allocate 4 queues per RSS flow as that way you get
optimal CPU distribution and can still avoid head-of-line blocking via
DCB.

>>
>> If you want to control the number of queues allocated in a given
>> configuration you should look at the code over in the ixgbe_lib.c, not
>
>   Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues
> calculation algorithm.
>   But they all take the dev queues allocated in alloc_etherdev_mq() as upper
> limit.
>
>  If we set 64 as default here, DCB would says "oh, there is 64 there, I
> could use it"

Right.  But the deciding factor for DCB is RSS which is already
limited by the number of CPUs.  If it is allocating 64 queues it is
because there are either at least 8 CPUs present and 8 TCs being
allocated per CPU, or there are at least 16 queues present and it is
allocating 4 TCs per CPU.

>>
>> ixgbe_main.c.  All you are doing with this patch is denying the user
>> choice with this change as they then are not allowed to set more
>
>   Yes, it is purposed to deny configuration that doesn't benefit.

Doesn't benefit who?  It is obvious you don't understand how DCB is
meant to work since you are assuming the queues are throw-away.
Anyone who makes use of the ability to prioritize their traffic would
likely have a different opinion.

>> queues.  Even if they find your decision was wrong for their
>> configuration.
>>
>> - Alex
>>
>  Thanks,
>  Ethan

Your response clearly points out you don't understand DCB.  I suggest
you take another look at how things are actually being configured.  I
believe what you will find is that the current implementation is
basing things on the number of online CPUs already based on the
ring_feature[RING_F_RSS].limit value.  All that is happening is that
you are getting that value multiplied by the number of TCs and the RSS
value is reduced if the result is greater than 64 based on the maximum
number of queues.

With your code on an 8 core system you go from being able to perform
RSS over 8 queues to only being able to perform RSS over 1 queue when
you enable DCB.  There was a bug a long time ago where this actually
didn't provide any gain because the interrupt allocation was binding
all 8 RSS queues to a single q_vector, but that has long since been
fixed and what you should be seeing is that RSS will spread traffic
across either 8 or 16 queues when DCB is enabled in either 8 or 4 TC
mode.

My advice would be to use a netperf TCP_CRR test and watch what queues
and what interrupts traffic is being delivered to.  Then if you have
DCB enabled on both ends you might try changing the priority of your
netperf session and watch what happens when you switch between TCs.
What you should find is that you will shift between groups of queues
and as you do so you should not have any active queues overlapping
unless you have less interrupts than CPUs.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
  2016-05-16 16:09       ` [Intel-wired-lan] " Alexander Duyck
@ 2016-05-16 17:14         ` John Fastabend
  -1 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-05-16 17:14 UTC (permalink / raw)
  To: Alexander Duyck, ethan zhao
  Cc: linux-kernel, intel-wired-lan, Netdev, Ethan Zhao

[...]

>>> ixgbe_main.c.  All you are doing with this patch is denying the user
>>> choice with this change as they then are not allowed to set more
>>
>>   Yes, it is purposed to deny configuration that doesn't benefit.
> 
> Doesn't benefit who?  It is obvious you don't understand how DCB is
> meant to work since you are assuming the queues are throw-away.
> Anyone who makes use of the ability to prioritize their traffic would
> likely have a different opinion.


+1 this is actually needed so that when DCB is turned on we can
see both prioritize between tcs (DCB feature) but also do not
see a performance degradation with just a single TC transmitting.

If we break this (and its happened occasionally) we end up with
bug reports so its clear to me folks care about it.

> 
>>> queues.  Even if they find your decision was wrong for their
>>> configuration.
>>>
>>> - Alex
>>>
>>  Thanks,
>>  Ethan
> 
> Your response clearly points out you don't understand DCB.  I suggest
> you take another look at how things are actually being configured.  I
> believe what you will find is that the current implementation is
> basing things on the number of online CPUs already based on the
> ring_feature[RING_F_RSS].limit value.  All that is happening is that
> you are getting that value multiplied by the number of TCs and the RSS
> value is reduced if the result is greater than 64 based on the maximum
> number of queues.
> 
> With your code on an 8 core system you go from being able to perform
> RSS over 8 queues to only being able to perform RSS over 1 queue when
> you enable DCB.  There was a bug a long time ago where this actually
> didn't provide any gain because the interrupt allocation was binding
> all 8 RSS queues to a single q_vector, but that has long since been
> fixed and what you should be seeing is that RSS will spread traffic
> across either 8 or 16 queues when DCB is enabled in either 8 or 4 TC
> mode.
> 
> My advice would be to use a netperf TCP_CRR test and watch what queues
> and what interrupts traffic is being delivered to.  Then if you have
> DCB enabled on both ends you might try changing the priority of your
> netperf session and watch what happens when you switch between TCs.
> What you should find is that you will shift between groups of queues
> and as you do so you should not have any active queues overlapping
> unless you have less interrupts than CPUs.
> 

Yep.

Thanks,
John

> - Alex
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-16 17:14         ` John Fastabend
  0 siblings, 0 replies; 21+ messages in thread
From: John Fastabend @ 2016-05-16 17:14 UTC (permalink / raw)
  To: intel-wired-lan

[...]

>>> ixgbe_main.c.  All you are doing with this patch is denying the user
>>> choice with this change as they then are not allowed to set more
>>
>>   Yes, it is purposed to deny configuration that doesn't benefit.
> 
> Doesn't benefit who?  It is obvious you don't understand how DCB is
> meant to work since you are assuming the queues are throw-away.
> Anyone who makes use of the ability to prioritize their traffic would
> likely have a different opinion.


+1 this is actually needed so that when DCB is turned on we can
see both prioritize between tcs (DCB feature) but also do not
see a performance degradation with just a single TC transmitting.

If we break this (and its happened occasionally) we end up with
bug reports so its clear to me folks care about it.

> 
>>> queues.  Even if they find your decision was wrong for their
>>> configuration.
>>>
>>> - Alex
>>>
>>  Thanks,
>>  Ethan
> 
> Your response clearly points out you don't understand DCB.  I suggest
> you take another look at how things are actually being configured.  I
> believe what you will find is that the current implementation is
> basing things on the number of online CPUs already based on the
> ring_feature[RING_F_RSS].limit value.  All that is happening is that
> you are getting that value multiplied by the number of TCs and the RSS
> value is reduced if the result is greater than 64 based on the maximum
> number of queues.
> 
> With your code on an 8 core system you go from being able to perform
> RSS over 8 queues to only being able to perform RSS over 1 queue when
> you enable DCB.  There was a bug a long time ago where this actually
> didn't provide any gain because the interrupt allocation was binding
> all 8 RSS queues to a single q_vector, but that has long since been
> fixed and what you should be seeing is that RSS will spread traffic
> across either 8 or 16 queues when DCB is enabled in either 8 or 4 TC
> mode.
> 
> My advice would be to use a netperf TCP_CRR test and watch what queues
> and what interrupts traffic is being delivered to.  Then if you have
> DCB enabled on both ends you might try changing the priority of your
> netperf session and watch what happens when you switch between TCs.
> What you should find is that you will shift between groups of queues
> and as you do so you should not have any active queues overlapping
> unless you have less interrupts than CPUs.
> 

Yep.

Thanks,
John

> - Alex
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when  alloc_etherdev_mq()
  2016-05-13  5:56 ` [Intel-wired-lan] " Ethan Zhao
  (?)
@ 2016-05-16 20:58   ` Jeff Kirsher
  -1 siblings, 0 replies; 21+ messages in thread
From: Jeff Kirsher @ 2016-05-16 20:58 UTC (permalink / raw)
  To: Ethan Zhao, jesse.brandeburg, shannon.nelson, carolyn.wyborny,
	donald.c.skidmore, bruce.w.allan, john.ronciak, mitch.a.williams,
	intel-wired-lan, netdev
  Cc: linux-kernel, ethan.kernel

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

On Fri, 2016-05-13 at 14:56 +0900, Ethan Zhao wrote:
> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
> CPUs were assigned. especially when DCB is enabled, so we should take
> num_online_cpus() as top limit, and aslo to make sure every TC has
> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
> number.
> 
> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
>  1 file changed, 4 insertions(+)

Dropping this patch based on Alex's and John's feedback.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-16 20:58   ` Jeff Kirsher
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Kirsher @ 2016-05-16 20:58 UTC (permalink / raw)
  To: Ethan Zhao, jesse.brandeburg, shannon.nelson, carolyn.wyborny,
	donald.c.skidmore, bruce.w.allan, john.ronciak, mitch.a.williams,
	intel-wired-lan, netdev
  Cc: linux-kernel, ethan.kernel

[-- Attachment #1: Type: text/plain, Size: 575 bytes --]

On Fri, 2016-05-13 at 14:56 +0900, Ethan Zhao wrote:
> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
> CPUs were assigned. especially when DCB is enabled, so we should take
> num_online_cpus() as top limit, and aslo to make sure every TC has
> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
> number.
> 
> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
>  1 file changed, 4 insertions(+)

Dropping this patch based on Alex's and John's feedback.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-16 20:58   ` Jeff Kirsher
  0 siblings, 0 replies; 21+ messages in thread
From: Jeff Kirsher @ 2016-05-16 20:58 UTC (permalink / raw)
  To: intel-wired-lan

On Fri, 2016-05-13 at 14:56 +0900, Ethan Zhao wrote:
> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
> CPUs were assigned. especially when DCB is enabled, so we should take
> num_online_cpus() as top limit, and aslo to make sure every TC has
> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
> number.
> 
> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
> ---
> ?drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 ++++
> ?1 file changed, 4 insertions(+)

Dropping this patch based on Alex's and John's feedback.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20160516/5382db22/attachment.asc>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
  2016-05-16 16:09       ` [Intel-wired-lan] " Alexander Duyck
@ 2016-05-17  9:00         ` ethan zhao
  -1 siblings, 0 replies; 21+ messages in thread
From: ethan zhao @ 2016-05-17  9:00 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jeff Kirsher, Brandeburg, Jesse, shannon nelson, Carolyn Wyborny,
	Skidmore, Donald C, Bruce W Allan, John Ronciak, Mitch Williams,
	intel-wired-lan, Netdev, linux-kernel, Ethan Zhao

Alexander,

On 2016/5/17 0:09, Alexander Duyck wrote:
> On Sun, May 15, 2016 at 7:59 PM, ethan zhao <ethan.zhao@oracle.com> wrote:
>> Alexander,
>>
>> On 2016/5/14 0:46, Alexander Duyck wrote:
>>> On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com>
>>> wrote:
>>>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>>>> CPUs were assigned. especially when DCB is enabled, so we should take
>>>> num_online_cpus() as top limit, and aslo to make sure every TC has
>>>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
>>>> number.
>>>>
>>>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>>> What is the harm in allowing the user to specify up to 64 queues if
>>> they want to?  Also what is your opinion based on?  In the case of RSS
>>
>>   There is no module parameter to specify queue number in this upstream ixgbe
>>    driver.  for what to specify more queues than num_online_cpus() via
>> ethtool ?
>>   I couldn't figure out the benefit to do that.
> There are a number of benefits to being able to set the number of
> queues based on the user desire.  Just because you can't figure out
> how to use a feature is no reason to break it so that nobody else can.
>
>>   But if DCB is turned on after loading, the queues would be 64/64, that
>> doesn't
>>   make sense if only 16 CPUs assigned.
> It makes perfect sense.  What is happening is that it is allocating an
> RSS set per TC.  So what you should have is either 4 queues per CPU
> with each one belonging to a different TC, or 4 queues per CPU with
> the first 8 CPUs covering TCs 0-3, and the last 8 CPUs covering TCs
> 4-7.
>
> I can see how the last setup might actually be a bit confusing.  To
> that end you might consider modifying ixgbe_acquire_msix_vectors uses
> the number of RSS queues instead of the number of Rx queues in the


> case of DCB.  Then you would get more consistent behavior with each
> q_vector or CPU (if num_q_vecotrs == num_online_cpus()) having one
> queue belonging to each TC.  You would end up with either 8 or 16
> q_vectors hosting 8 or 4 queues so that they can process DCB requests
> without having to worry about head of line blocking.
>
>>> traffic the upper limit is only 16 on older NICs, but last I knew the
>>> latest X550 can support more queues for RSS.  Have you only been
>>> testing on older NICs or did you test on the latest hardware as well?
>>    More queues for RSS than num_online_cpus() could bring better performance
>> ?
>>    Test result shows false result.  even memory cost is not an issue for most
>> of
>>    the expensive servers, but not for all.
> The feature is called DCB.  What it allows for is the avoidance of
> head-of-line blocking.  So when you have DCB enabled you should have a
> set of queues for each possible RSS result so that if you get a higher
> priority request on one of the queues it can use the higher priority
> queue instead of having to rely on the the lower priority queue to
> receive traffic.  You cannot do that without allocating a queue for
> each TC, and reducing the number of RSS queues supported on the system
> will hurt performance.  Therefore on a 16 CPU system it is very useful
> to be able to allocate 4 queues per RSS flow as that way you get
> optimal CPU distribution and can still avoid head-of-line blocking via
> DCB.
>
>>> If you want to control the number of queues allocated in a given
>>> configuration you should look at the code over in the ixgbe_lib.c, not
>>    Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues
>> calculation algorithm.
>>    But they all take the dev queues allocated in alloc_etherdev_mq() as upper
>> limit.
>>
>>   If we set 64 as default here, DCB would says "oh, there is 64 there, I
>> could use it"
> Right.  But the deciding factor for DCB is RSS which is already
> limited by the number of CPUs.  If it is allocating 64 queues it is
> because there are either at least 8 CPUs present and 8 TCs being
> allocated per CPU, or there are at least 16 queues present and it is
> allocating 4 TCs per CPU.
>
>>> ixgbe_main.c.  All you are doing with this patch is denying the user
>>> choice with this change as they then are not allowed to set more
>>    Yes, it is purposed to deny configuration that doesn't benefit.
> Doesn't benefit who?  It is obvious you don't understand how DCB is
> meant to work since you are assuming the queues are throw-away.
> Anyone who makes use of the ability to prioritize their traffic would
> likely have a different opinion.
>
>>> queues.  Even if they find your decision was wrong for their
>>> configuration.
>>>
>>> - Alex
>>>
>>   Thanks,
>>   Ethan
> Your response clearly points out you don't understand DCB.  I suggest
> you take another look at how things are actually being configured.  I
> believe what you will find is that the current implementation is
> basing things on the number of online CPUs already based on the
> ring_feature[RING_F_RSS].limit value.  All that is happening is that
> you are getting that value multiplied by the number of TCs and the RSS
> value is reduced if the result is greater than 64 based on the maximum
> number of queues.
>
> With your code on an 8 core system you go from being able to perform
> RSS over 8 queues to only being able to perform RSS over 1 queue when
> you enable DCB.  There was a bug a long time ago where this actually
> didn't provide any gain because the interrupt allocation was binding
> all 8 RSS queues to a single q_vector, but that has long since been
> fixed and what you should be seeing is that RSS will spread traffic
> across either 8 or 16 queues when DCB is enabled in either 8 or 4 TC
Here is my understanding of current code about the DCB mapping.
Is it right ?

If we have 8 TCs and 4 RSS queues per TC, one q_vector per queue and
we have total 32 CPUs, the proper layout would be:

App0---> Prio0 --> TC0 --> RSS_queue0 --->Q_vector0 ---->CPU0
                                   |----> RSS_queue1 --->Q_vector1 ---->CPU1
                                   |----> RSS_queue2 --->Q_vector2 ---->CPU2
                                   |----> RSS_queue3 --->Q_vector3 ---->CPU3
                                                  . .             
.            .
                                                  . .             
.            .
                                                   . .             
.            .
App7---> Prio7 --> TC7 --> RSS_queue28 --->Q_vector28 ---->CPU28
                                  |----> RSS_queue29 --->Q_vector29 
---->CPU29
                                  |----> RSS_queue30 --->Q_vector30 
---->CPU30
                                  |----> RSS_queue31 --->Q_vector31 
---->CPU31

if we less CPUs, for example only 4 CPUs, the layout would be
(according to current implementation)

App0---> Prio0 --> TC0 --> RSS_queue0 --->Q_vector0 ---->CPU0
                                  |----> RSS_queue1 --->Q_vector1 ---->CPU1
                                  |----> RSS_queue2 --->Q_vector2 ---->CPU2
                                  |----> RSS_queue3 --->Q_vector3 ---->CPU3
                                               . .             
.            .
                                               . .             
.            .
                                               . .             
.            .
App7---> Prio7 --> TC7 --> RSS_queue28 --->Q_vector0 ---->CPU0
                                  |----> RSS_queue29 --->Q_vector1 ---->CPU1
                                  |----> RSS_queue30 --->Q_vector2 ---->CPU2
                                  |----> RSS_queue31 --->Q_vector3 ---->CPU3

So we bond two 8 queues to one q_vector / CPU.
And here, Yes, we could scale every TC's traffic to all 4 CPUs with RSS.
if the workload of one TC's traffic is beyond one CPU's capability, it
is useful to be scalable.  though it might break the CPU affinity of
application and stack/driver data flow.

Thanks,
Ethan
> mode.
>
> My advice would be to use a netperf TCP_CRR test and watch what queues
> and what interrupts traffic is being delivered to.  Then if you have
> DCB enabled on both ends you might try changing the priority of your
> netperf session and watch what happens when you switch between TCs.
> What you should find is that you will shift between groups of queues
> and as you do so you should not have any active queues overlapping
> unless you have less interrupts than CPUs.
>
> - Alex
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-17  9:00         ` ethan zhao
  0 siblings, 0 replies; 21+ messages in thread
From: ethan zhao @ 2016-05-17  9:00 UTC (permalink / raw)
  To: intel-wired-lan

Alexander,

On 2016/5/17 0:09, Alexander Duyck wrote:
> On Sun, May 15, 2016 at 7:59 PM, ethan zhao <ethan.zhao@oracle.com> wrote:
>> Alexander,
>>
>> On 2016/5/14 0:46, Alexander Duyck wrote:
>>> On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com>
>>> wrote:
>>>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>>>> CPUs were assigned. especially when DCB is enabled, so we should take
>>>> num_online_cpus() as top limit, and aslo to make sure every TC has
>>>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of queues
>>>> number.
>>>>
>>>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>>> What is the harm in allowing the user to specify up to 64 queues if
>>> they want to?  Also what is your opinion based on?  In the case of RSS
>>
>>   There is no module parameter to specify queue number in this upstream ixgbe
>>    driver.  for what to specify more queues than num_online_cpus() via
>> ethtool ?
>>   I couldn't figure out the benefit to do that.
> There are a number of benefits to being able to set the number of
> queues based on the user desire.  Just because you can't figure out
> how to use a feature is no reason to break it so that nobody else can.
>
>>   But if DCB is turned on after loading, the queues would be 64/64, that
>> doesn't
>>   make sense if only 16 CPUs assigned.
> It makes perfect sense.  What is happening is that it is allocating an
> RSS set per TC.  So what you should have is either 4 queues per CPU
> with each one belonging to a different TC, or 4 queues per CPU with
> the first 8 CPUs covering TCs 0-3, and the last 8 CPUs covering TCs
> 4-7.
>
> I can see how the last setup might actually be a bit confusing.  To
> that end you might consider modifying ixgbe_acquire_msix_vectors uses
> the number of RSS queues instead of the number of Rx queues in the


> case of DCB.  Then you would get more consistent behavior with each
> q_vector or CPU (if num_q_vecotrs == num_online_cpus()) having one
> queue belonging to each TC.  You would end up with either 8 or 16
> q_vectors hosting 8 or 4 queues so that they can process DCB requests
> without having to worry about head of line blocking.
>
>>> traffic the upper limit is only 16 on older NICs, but last I knew the
>>> latest X550 can support more queues for RSS.  Have you only been
>>> testing on older NICs or did you test on the latest hardware as well?
>>    More queues for RSS than num_online_cpus() could bring better performance
>> ?
>>    Test result shows false result.  even memory cost is not an issue for most
>> of
>>    the expensive servers, but not for all.
> The feature is called DCB.  What it allows for is the avoidance of
> head-of-line blocking.  So when you have DCB enabled you should have a
> set of queues for each possible RSS result so that if you get a higher
> priority request on one of the queues it can use the higher priority
> queue instead of having to rely on the the lower priority queue to
> receive traffic.  You cannot do that without allocating a queue for
> each TC, and reducing the number of RSS queues supported on the system
> will hurt performance.  Therefore on a 16 CPU system it is very useful
> to be able to allocate 4 queues per RSS flow as that way you get
> optimal CPU distribution and can still avoid head-of-line blocking via
> DCB.
>
>>> If you want to control the number of queues allocated in a given
>>> configuration you should look at the code over in the ixgbe_lib.c, not
>>    Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues
>> calculation algorithm.
>>    But they all take the dev queues allocated in alloc_etherdev_mq() as upper
>> limit.
>>
>>   If we set 64 as default here, DCB would says "oh, there is 64 there, I
>> could use it"
> Right.  But the deciding factor for DCB is RSS which is already
> limited by the number of CPUs.  If it is allocating 64 queues it is
> because there are either at least 8 CPUs present and 8 TCs being
> allocated per CPU, or there are at least 16 queues present and it is
> allocating 4 TCs per CPU.
>
>>> ixgbe_main.c.  All you are doing with this patch is denying the user
>>> choice with this change as they then are not allowed to set more
>>    Yes, it is purposed to deny configuration that doesn't benefit.
> Doesn't benefit who?  It is obvious you don't understand how DCB is
> meant to work since you are assuming the queues are throw-away.
> Anyone who makes use of the ability to prioritize their traffic would
> likely have a different opinion.
>
>>> queues.  Even if they find your decision was wrong for their
>>> configuration.
>>>
>>> - Alex
>>>
>>   Thanks,
>>   Ethan
> Your response clearly points out you don't understand DCB.  I suggest
> you take another look at how things are actually being configured.  I
> believe what you will find is that the current implementation is
> basing things on the number of online CPUs already based on the
> ring_feature[RING_F_RSS].limit value.  All that is happening is that
> you are getting that value multiplied by the number of TCs and the RSS
> value is reduced if the result is greater than 64 based on the maximum
> number of queues.
>
> With your code on an 8 core system you go from being able to perform
> RSS over 8 queues to only being able to perform RSS over 1 queue when
> you enable DCB.  There was a bug a long time ago where this actually
> didn't provide any gain because the interrupt allocation was binding
> all 8 RSS queues to a single q_vector, but that has long since been
> fixed and what you should be seeing is that RSS will spread traffic
> across either 8 or 16 queues when DCB is enabled in either 8 or 4 TC
Here is my understanding of current code about the DCB mapping.
Is it right ?

If we have 8 TCs and 4 RSS queues per TC, one q_vector per queue and
we have total 32 CPUs, the proper layout would be:

App0---> Prio0 --> TC0 --> RSS_queue0 --->Q_vector0 ---->CPU0
                                   |----> RSS_queue1 --->Q_vector1 ---->CPU1
                                   |----> RSS_queue2 --->Q_vector2 ---->CPU2
                                   |----> RSS_queue3 --->Q_vector3 ---->CPU3
                                                  . .             
.            .
                                                  . .             
.            .
                                                   . .             
.            .
App7---> Prio7 --> TC7 --> RSS_queue28 --->Q_vector28 ---->CPU28
                                  |----> RSS_queue29 --->Q_vector29 
---->CPU29
                                  |----> RSS_queue30 --->Q_vector30 
---->CPU30
                                  |----> RSS_queue31 --->Q_vector31 
---->CPU31

if we less CPUs, for example only 4 CPUs, the layout would be
(according to current implementation)

App0---> Prio0 --> TC0 --> RSS_queue0 --->Q_vector0 ---->CPU0
                                  |----> RSS_queue1 --->Q_vector1 ---->CPU1
                                  |----> RSS_queue2 --->Q_vector2 ---->CPU2
                                  |----> RSS_queue3 --->Q_vector3 ---->CPU3
                                               . .             
.            .
                                               . .             
.            .
                                               . .             
.            .
App7---> Prio7 --> TC7 --> RSS_queue28 --->Q_vector0 ---->CPU0
                                  |----> RSS_queue29 --->Q_vector1 ---->CPU1
                                  |----> RSS_queue30 --->Q_vector2 ---->CPU2
                                  |----> RSS_queue31 --->Q_vector3 ---->CPU3

So we bond two 8 queues to one q_vector / CPU.
And here, Yes, we could scale every TC's traffic to all 4 CPUs with RSS.
if the workload of one TC's traffic is beyond one CPU's capability, it
is useful to be scalable.  though it might break the CPU affinity of
application and stack/driver data flow.

Thanks,
Ethan
> mode.
>
> My advice would be to use a netperf TCP_CRR test and watch what queues
> and what interrupts traffic is being delivered to.  Then if you have
> DCB enabled on both ends you might try changing the priority of your
> netperf session and watch what happens when you switch between TCs.
> What you should find is that you will shift between groups of queues
> and as you do so you should not have any active queues overlapping
> unless you have less interrupts than CPUs.
>
> - Alex
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
  2016-05-17  9:00         ` [Intel-wired-lan] " ethan zhao
@ 2016-05-17 15:58           ` Alexander Duyck
  -1 siblings, 0 replies; 21+ messages in thread
From: Alexander Duyck @ 2016-05-17 15:58 UTC (permalink / raw)
  To: ethan zhao
  Cc: Jeff Kirsher, Brandeburg, Jesse, shannon nelson, Carolyn Wyborny,
	Skidmore, Donald C, Bruce W Allan, John Ronciak, Mitch Williams,
	intel-wired-lan, Netdev, linux-kernel, Ethan Zhao

On Tue, May 17, 2016 at 2:00 AM, ethan zhao <ethan.zhao@oracle.com> wrote:
> Alexander,
>
>
> On 2016/5/17 0:09, Alexander Duyck wrote:
>>
>> On Sun, May 15, 2016 at 7:59 PM, ethan zhao <ethan.zhao@oracle.com> wrote:
>>>
>>> Alexander,
>>>
>>> On 2016/5/14 0:46, Alexander Duyck wrote:
>>>>
>>>> On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com>
>>>> wrote:
>>>>>
>>>>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>>>>> CPUs were assigned. especially when DCB is enabled, so we should take
>>>>> num_online_cpus() as top limit, and aslo to make sure every TC has
>>>>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of
>>>>> queues
>>>>> number.
>>>>>
>>>>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>>>>
>>>> What is the harm in allowing the user to specify up to 64 queues if
>>>> they want to?  Also what is your opinion based on?  In the case of RSS
>>>
>>>
>>>   There is no module parameter to specify queue number in this upstream
>>> ixgbe
>>>    driver.  for what to specify more queues than num_online_cpus() via
>>> ethtool ?
>>>   I couldn't figure out the benefit to do that.
>>
>> There are a number of benefits to being able to set the number of
>> queues based on the user desire.  Just because you can't figure out
>> how to use a feature is no reason to break it so that nobody else can.
>>
>>>   But if DCB is turned on after loading, the queues would be 64/64, that
>>> doesn't
>>>   make sense if only 16 CPUs assigned.
>>
>> It makes perfect sense.  What is happening is that it is allocating an
>> RSS set per TC.  So what you should have is either 4 queues per CPU
>> with each one belonging to a different TC, or 4 queues per CPU with
>> the first 8 CPUs covering TCs 0-3, and the last 8 CPUs covering TCs
>> 4-7.
>>
>> I can see how the last setup might actually be a bit confusing.  To
>> that end you might consider modifying ixgbe_acquire_msix_vectors uses
>> the number of RSS queues instead of the number of Rx queues in the
>
>
>
>> case of DCB.  Then you would get more consistent behavior with each
>> q_vector or CPU (if num_q_vecotrs == num_online_cpus()) having one
>> queue belonging to each TC.  You would end up with either 8 or 16
>> q_vectors hosting 8 or 4 queues so that they can process DCB requests
>> without having to worry about head of line blocking.
>>
>>>> traffic the upper limit is only 16 on older NICs, but last I knew the
>>>> latest X550 can support more queues for RSS.  Have you only been
>>>> testing on older NICs or did you test on the latest hardware as well?
>>>
>>>    More queues for RSS than num_online_cpus() could bring better
>>> performance
>>> ?
>>>    Test result shows false result.  even memory cost is not an issue for
>>> most
>>> of
>>>    the expensive servers, but not for all.
>>
>> The feature is called DCB.  What it allows for is the avoidance of
>> head-of-line blocking.  So when you have DCB enabled you should have a
>> set of queues for each possible RSS result so that if you get a higher
>> priority request on one of the queues it can use the higher priority
>> queue instead of having to rely on the the lower priority queue to
>> receive traffic.  You cannot do that without allocating a queue for
>> each TC, and reducing the number of RSS queues supported on the system
>> will hurt performance.  Therefore on a 16 CPU system it is very useful
>> to be able to allocate 4 queues per RSS flow as that way you get
>> optimal CPU distribution and can still avoid head-of-line blocking via
>> DCB.
>>
>>>> If you want to control the number of queues allocated in a given
>>>> configuration you should look at the code over in the ixgbe_lib.c, not
>>>
>>>    Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues
>>> calculation algorithm.
>>>    But they all take the dev queues allocated in alloc_etherdev_mq() as
>>> upper
>>> limit.
>>>
>>>   If we set 64 as default here, DCB would says "oh, there is 64 there, I
>>> could use it"
>>
>> Right.  But the deciding factor for DCB is RSS which is already
>> limited by the number of CPUs.  If it is allocating 64 queues it is
>> because there are either at least 8 CPUs present and 8 TCs being
>> allocated per CPU, or there are at least 16 queues present and it is
>> allocating 4 TCs per CPU.
>>
>>>> ixgbe_main.c.  All you are doing with this patch is denying the user
>>>> choice with this change as they then are not allowed to set more
>>>
>>>    Yes, it is purposed to deny configuration that doesn't benefit.
>>
>> Doesn't benefit who?  It is obvious you don't understand how DCB is
>> meant to work since you are assuming the queues are throw-away.
>> Anyone who makes use of the ability to prioritize their traffic would
>> likely have a different opinion.
>>
>>>> queues.  Even if they find your decision was wrong for their
>>>> configuration.
>>>>
>>>> - Alex
>>>>
>>>   Thanks,
>>>   Ethan
>>
>> Your response clearly points out you don't understand DCB.  I suggest
>> you take another look at how things are actually being configured.  I
>> believe what you will find is that the current implementation is
>> basing things on the number of online CPUs already based on the
>> ring_feature[RING_F_RSS].limit value.  All that is happening is that
>> you are getting that value multiplied by the number of TCs and the RSS
>> value is reduced if the result is greater than 64 based on the maximum
>> number of queues.
>>
>> With your code on an 8 core system you go from being able to perform
>> RSS over 8 queues to only being able to perform RSS over 1 queue when
>> you enable DCB.  There was a bug a long time ago where this actually
>> didn't provide any gain because the interrupt allocation was binding
>> all 8 RSS queues to a single q_vector, but that has long since been
>> fixed and what you should be seeing is that RSS will spread traffic
>> across either 8 or 16 queues when DCB is enabled in either 8 or 4 TC
>
> Here is my understanding of current code about the DCB mapping.
> Is it right ?
>
> If we have 8 TCs and 4 RSS queues per TC, one q_vector per queue and
> we have total 32 CPUs, the proper layout would be:
>
> App0---> Prio0 --> TC0 --> RSS_queue0 --->Q_vector0 ---->CPU0
>                                   |----> RSS_queue1 --->Q_vector1 ---->CPU1
>                                   |----> RSS_queue2 --->Q_vector2 ---->CPU2
>                                   |----> RSS_queue3 --->Q_vector3 ---->CPU3
>                                                  . .             .
> .
>                                                  . .             .
> .
>                                                   . .             .
> .
> App7---> Prio7 --> TC7 --> RSS_queue28 --->Q_vector28 ---->CPU28
>                                  |----> RSS_queue29 --->Q_vector29
> ---->CPU29
>                                  |----> RSS_queue30 --->Q_vector30
> ---->CPU30
>                                  |----> RSS_queue31 --->Q_vector31
> ---->CPU31
>
> if we less CPUs, for example only 4 CPUs, the layout would be
> (according to current implementation)
>
> App0---> Prio0 --> TC0 --> RSS_queue0 --->Q_vector0 ---->CPU0
>                                  |----> RSS_queue1 --->Q_vector1 ---->CPU1
>                                  |----> RSS_queue2 --->Q_vector2 ---->CPU2
>                                  |----> RSS_queue3 --->Q_vector3 ---->CPU3
>                                               . .             .            .
>                                               . .             .            .
>                                               . .             .            .
> App7---> Prio7 --> TC7 --> RSS_queue28 --->Q_vector0 ---->CPU0
>                                  |----> RSS_queue29 --->Q_vector1 ---->CPU1
>                                  |----> RSS_queue30 --->Q_vector2 ---->CPU2
>                                  |----> RSS_queue31 --->Q_vector3 ---->CPU3
>
> So we bond two 8 queues to one q_vector / CPU.
> And here, Yes, we could scale every TC's traffic to all 4 CPUs with RSS.
> if the workload of one TC's traffic is beyond one CPU's capability, it
> is useful to be scalable.  though it might break the CPU affinity of
> application and stack/driver data flow.
>

I think you are generally getting the idea.  Basically what we end up
doing is laying things out so that each TC has access to as many CPUs
as possible.

I'm not sure what the CPU affinity comment is in reference to.
Basically we are doing the best that RSS can do seeing as how DCB is
incompatible with ATR.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [Intel-wired-lan] [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq()
@ 2016-05-17 15:58           ` Alexander Duyck
  0 siblings, 0 replies; 21+ messages in thread
From: Alexander Duyck @ 2016-05-17 15:58 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, May 17, 2016 at 2:00 AM, ethan zhao <ethan.zhao@oracle.com> wrote:
> Alexander,
>
>
> On 2016/5/17 0:09, Alexander Duyck wrote:
>>
>> On Sun, May 15, 2016 at 7:59 PM, ethan zhao <ethan.zhao@oracle.com> wrote:
>>>
>>> Alexander,
>>>
>>> On 2016/5/14 0:46, Alexander Duyck wrote:
>>>>
>>>> On Thu, May 12, 2016 at 10:56 PM, Ethan Zhao <ethan.zhao@oracle.com>
>>>> wrote:
>>>>>
>>>>> Allocating 64 Tx/Rx as default doesn't benefit perfomrnace when less
>>>>> CPUs were assigned. especially when DCB is enabled, so we should take
>>>>> num_online_cpus() as top limit, and aslo to make sure every TC has
>>>>> at least one queue, take the MAX_TRAFFIC_CLASS as bottom limit of
>>>>> queues
>>>>> number.
>>>>>
>>>>> Signed-off-by: Ethan Zhao <ethan.zhao@oracle.com>
>>>>
>>>> What is the harm in allowing the user to specify up to 64 queues if
>>>> they want to?  Also what is your opinion based on?  In the case of RSS
>>>
>>>
>>>   There is no module parameter to specify queue number in this upstream
>>> ixgbe
>>>    driver.  for what to specify more queues than num_online_cpus() via
>>> ethtool ?
>>>   I couldn't figure out the benefit to do that.
>>
>> There are a number of benefits to being able to set the number of
>> queues based on the user desire.  Just because you can't figure out
>> how to use a feature is no reason to break it so that nobody else can.
>>
>>>   But if DCB is turned on after loading, the queues would be 64/64, that
>>> doesn't
>>>   make sense if only 16 CPUs assigned.
>>
>> It makes perfect sense.  What is happening is that it is allocating an
>> RSS set per TC.  So what you should have is either 4 queues per CPU
>> with each one belonging to a different TC, or 4 queues per CPU with
>> the first 8 CPUs covering TCs 0-3, and the last 8 CPUs covering TCs
>> 4-7.
>>
>> I can see how the last setup might actually be a bit confusing.  To
>> that end you might consider modifying ixgbe_acquire_msix_vectors uses
>> the number of RSS queues instead of the number of Rx queues in the
>
>
>
>> case of DCB.  Then you would get more consistent behavior with each
>> q_vector or CPU (if num_q_vecotrs == num_online_cpus()) having one
>> queue belonging to each TC.  You would end up with either 8 or 16
>> q_vectors hosting 8 or 4 queues so that they can process DCB requests
>> without having to worry about head of line blocking.
>>
>>>> traffic the upper limit is only 16 on older NICs, but last I knew the
>>>> latest X550 can support more queues for RSS.  Have you only been
>>>> testing on older NICs or did you test on the latest hardware as well?
>>>
>>>    More queues for RSS than num_online_cpus() could bring better
>>> performance
>>> ?
>>>    Test result shows false result.  even memory cost is not an issue for
>>> most
>>> of
>>>    the expensive servers, but not for all.
>>
>> The feature is called DCB.  What it allows for is the avoidance of
>> head-of-line blocking.  So when you have DCB enabled you should have a
>> set of queues for each possible RSS result so that if you get a higher
>> priority request on one of the queues it can use the higher priority
>> queue instead of having to rely on the the lower priority queue to
>> receive traffic.  You cannot do that without allocating a queue for
>> each TC, and reducing the number of RSS queues supported on the system
>> will hurt performance.  Therefore on a 16 CPU system it is very useful
>> to be able to allocate 4 queues per RSS flow as that way you get
>> optimal CPU distribution and can still avoid head-of-line blocking via
>> DCB.
>>
>>>> If you want to control the number of queues allocated in a given
>>>> configuration you should look at the code over in the ixgbe_lib.c, not
>>>
>>>    Yes,  RSS,  RSS with SRIOV, FCoE, DCB etc uses different queues
>>> calculation algorithm.
>>>    But they all take the dev queues allocated in alloc_etherdev_mq() as
>>> upper
>>> limit.
>>>
>>>   If we set 64 as default here, DCB would says "oh, there is 64 there, I
>>> could use it"
>>
>> Right.  But the deciding factor for DCB is RSS which is already
>> limited by the number of CPUs.  If it is allocating 64 queues it is
>> because there are either at least 8 CPUs present and 8 TCs being
>> allocated per CPU, or there are at least 16 queues present and it is
>> allocating 4 TCs per CPU.
>>
>>>> ixgbe_main.c.  All you are doing with this patch is denying the user
>>>> choice with this change as they then are not allowed to set more
>>>
>>>    Yes, it is purposed to deny configuration that doesn't benefit.
>>
>> Doesn't benefit who?  It is obvious you don't understand how DCB is
>> meant to work since you are assuming the queues are throw-away.
>> Anyone who makes use of the ability to prioritize their traffic would
>> likely have a different opinion.
>>
>>>> queues.  Even if they find your decision was wrong for their
>>>> configuration.
>>>>
>>>> - Alex
>>>>
>>>   Thanks,
>>>   Ethan
>>
>> Your response clearly points out you don't understand DCB.  I suggest
>> you take another look at how things are actually being configured.  I
>> believe what you will find is that the current implementation is
>> basing things on the number of online CPUs already based on the
>> ring_feature[RING_F_RSS].limit value.  All that is happening is that
>> you are getting that value multiplied by the number of TCs and the RSS
>> value is reduced if the result is greater than 64 based on the maximum
>> number of queues.
>>
>> With your code on an 8 core system you go from being able to perform
>> RSS over 8 queues to only being able to perform RSS over 1 queue when
>> you enable DCB.  There was a bug a long time ago where this actually
>> didn't provide any gain because the interrupt allocation was binding
>> all 8 RSS queues to a single q_vector, but that has long since been
>> fixed and what you should be seeing is that RSS will spread traffic
>> across either 8 or 16 queues when DCB is enabled in either 8 or 4 TC
>
> Here is my understanding of current code about the DCB mapping.
> Is it right ?
>
> If we have 8 TCs and 4 RSS queues per TC, one q_vector per queue and
> we have total 32 CPUs, the proper layout would be:
>
> App0---> Prio0 --> TC0 --> RSS_queue0 --->Q_vector0 ---->CPU0
>                                   |----> RSS_queue1 --->Q_vector1 ---->CPU1
>                                   |----> RSS_queue2 --->Q_vector2 ---->CPU2
>                                   |----> RSS_queue3 --->Q_vector3 ---->CPU3
>                                                  . .             .
> .
>                                                  . .             .
> .
>                                                   . .             .
> .
> App7---> Prio7 --> TC7 --> RSS_queue28 --->Q_vector28 ---->CPU28
>                                  |----> RSS_queue29 --->Q_vector29
> ---->CPU29
>                                  |----> RSS_queue30 --->Q_vector30
> ---->CPU30
>                                  |----> RSS_queue31 --->Q_vector31
> ---->CPU31
>
> if we less CPUs, for example only 4 CPUs, the layout would be
> (according to current implementation)
>
> App0---> Prio0 --> TC0 --> RSS_queue0 --->Q_vector0 ---->CPU0
>                                  |----> RSS_queue1 --->Q_vector1 ---->CPU1
>                                  |----> RSS_queue2 --->Q_vector2 ---->CPU2
>                                  |----> RSS_queue3 --->Q_vector3 ---->CPU3
>                                               . .             .            .
>                                               . .             .            .
>                                               . .             .            .
> App7---> Prio7 --> TC7 --> RSS_queue28 --->Q_vector0 ---->CPU0
>                                  |----> RSS_queue29 --->Q_vector1 ---->CPU1
>                                  |----> RSS_queue30 --->Q_vector2 ---->CPU2
>                                  |----> RSS_queue31 --->Q_vector3 ---->CPU3
>
> So we bond two 8 queues to one q_vector / CPU.
> And here, Yes, we could scale every TC's traffic to all 4 CPUs with RSS.
> if the workload of one TC's traffic is beyond one CPU's capability, it
> is useful to be scalable.  though it might break the CPU affinity of
> application and stack/driver data flow.
>

I think you are generally getting the idea.  Basically what we end up
doing is laying things out so that each TC has access to as many CPUs
as possible.

I'm not sure what the CPU affinity comment is in reference to.
Basically we are doing the best that RSS can do seeing as how DCB is
incompatible with ATR.

- Alex

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2016-05-17 15:58 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-13  5:56 [PATCH] ixgbe: take online CPU number as MQ max limit when alloc_etherdev_mq() Ethan Zhao
2016-05-13  5:56 ` [Intel-wired-lan] " Ethan Zhao
2016-05-13 12:52 ` Sergei Shtylyov
2016-05-13 12:52   ` [Intel-wired-lan] " Sergei Shtylyov
2016-05-16  5:38   ` ethan zhao
2016-05-16  5:38     ` [Intel-wired-lan] " ethan zhao
2016-05-13 16:46 ` Alexander Duyck
2016-05-13 16:46   ` [Intel-wired-lan] " Alexander Duyck
2016-05-16  2:59   ` ethan zhao
2016-05-16  2:59     ` [Intel-wired-lan] " ethan zhao
2016-05-16 16:09     ` Alexander Duyck
2016-05-16 16:09       ` [Intel-wired-lan] " Alexander Duyck
2016-05-16 17:14       ` John Fastabend
2016-05-16 17:14         ` John Fastabend
2016-05-17  9:00       ` ethan zhao
2016-05-17  9:00         ` [Intel-wired-lan] " ethan zhao
2016-05-17 15:58         ` Alexander Duyck
2016-05-17 15:58           ` [Intel-wired-lan] " Alexander Duyck
2016-05-16 20:58 ` Jeff Kirsher
2016-05-16 20:58   ` [Intel-wired-lan] " Jeff Kirsher
2016-05-16 20:58   ` Jeff Kirsher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.