All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
@ 2012-06-20  0:43 Alexander Duyck
  2012-06-20  1:49 ` Alexander Duyck
  2012-06-20  5:36 ` Eric Dumazet
  0 siblings, 2 replies; 19+ messages in thread
From: Alexander Duyck @ 2012-06-20  0:43 UTC (permalink / raw)
  To: netdev; +Cc: davem, jeffrey.t.kirsher

This patch is meant to help improve system performance when
netdev_alloc_frag is used in scenarios in which buffers are short lived.
This is accomplished by allowing the page offset to be reset in the event
that the page count is 1.  I also reordered the direction in which we give
out sections of the page so that we start at the end of the page and end at
the start.  The main motivation being that I preferred to have offset
represent the amount of page remaining to be used.

My primary test case was using ixgbe in combination with TCP.  With this
patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
thread of netperf receiving a TCP stream via ixgbe.

I also tested several scenarios in which the page reuse would not be
possible such as UDP flows and routing.  In both of these scenarios I saw
no noticeable performance degradation compared to the kernel without this
patch.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
---

 net/core/skbuff.c |   15 +++++++++++----
 1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..eb3853c 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	if (unlikely(!nc->page)) {
 refill:
 		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
-		nc->offset = 0;
 	}
 	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
+		unsigned int offset = PAGE_SIZE;
+
+		if (page_count(nc->page) != 1)
+			offset = nc->offset;
+
+		if (offset < fragsz) {
 			put_page(nc->page);
 			goto refill;
 		}
-		data = page_address(nc->page) + nc->offset;
-		nc->offset += fragsz;
+
+		offset -= fragsz;
+		nc->offset = offset;
+
+		data = page_address(nc->page) + offset;
 		get_page(nc->page);
 	}
 	local_irq_restore(flags);

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  0:43 [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO Alexander Duyck
@ 2012-06-20  1:49 ` Alexander Duyck
  2012-06-20  5:36 ` Eric Dumazet
  1 sibling, 0 replies; 19+ messages in thread
From: Alexander Duyck @ 2012-06-20  1:49 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher, Eric Dumazet

On 6/19/2012 5:43 PM, Alexander Duyck wrote:
> This patch is meant to help improve system performance when
> netdev_alloc_frag is used in scenarios in which buffers are short lived.
> This is accomplished by allowing the page offset to be reset in the event
> that the page count is 1.  I also reordered the direction in which we give
> out sections of the page so that we start at the end of the page and end at
> the start.  The main motivation being that I preferred to have offset
> represent the amount of page remaining to be used.
>
> My primary test case was using ixgbe in combination with TCP.  With this
> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
> thread of netperf receiving a TCP stream via ixgbe.
>
> I also tested several scenarios in which the page reuse would not be
> possible such as UDP flows and routing.  In both of these scenarios I saw
> no noticeable performance degradation compared to the kernel without this
> patch.
>
> Cc: Eric Dumazet<edumazet@google.com>
> Signed-off-by: Alexander Duyck<alexander.h.duyck@intel.com>
> ---
>
>   net/core/skbuff.c |   15 +++++++++++----
>   1 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..eb3853c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>   	if (unlikely(!nc->page)) {
>   refill:
>   		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> -		nc->offset = 0;
>   	}
>   	if (likely(nc->page)) {
> -		if (nc->offset + fragsz>  PAGE_SIZE) {
> +		unsigned int offset = PAGE_SIZE;
> +
> +		if (page_count(nc->page) != 1)
> +			offset = nc->offset;
> +
> +		if (offset<  fragsz) {
>   			put_page(nc->page);
>   			goto refill;
>   		}
> -		data = page_address(nc->page) + nc->offset;
> -		nc->offset += fragsz;
> +
> +		offset -= fragsz;
> +		nc->offset = offset;
> +
> +		data = page_address(nc->page) + offset;
>   		get_page(nc->page);
>   	}
>   	local_irq_restore(flags);
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
It looks like I forgot to add "--auto" to the command line when I sent 
this out via stg mail so I am just adding Eric to the CC list on this 
reply.  Sorry for the extra noise.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  0:43 [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO Alexander Duyck
  2012-06-20  1:49 ` Alexander Duyck
@ 2012-06-20  5:36 ` Eric Dumazet
  2012-06-20  8:17   ` Eric Dumazet
  2012-06-20 16:30   ` Alexander Duyck
  1 sibling, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-06-20  5:36 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher

On Tue, 2012-06-19 at 17:43 -0700, Alexander Duyck wrote:
> This patch is meant to help improve system performance when
> netdev_alloc_frag is used in scenarios in which buffers are short lived.
> This is accomplished by allowing the page offset to be reset in the event
> that the page count is 1.  I also reordered the direction in which we give
> out sections of the page so that we start at the end of the page and end at
> the start.  The main motivation being that I preferred to have offset
> represent the amount of page remaining to be used.
> 
> My primary test case was using ixgbe in combination with TCP.  With this
> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
> thread of netperf receiving a TCP stream via ixgbe.
> 
> I also tested several scenarios in which the page reuse would not be
> possible such as UDP flows and routing.  In both of these scenarios I saw
> no noticeable performance degradation compared to the kernel without this
> patch.
> 
> Cc: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
> ---
> 
>  net/core/skbuff.c |   15 +++++++++++----
>  1 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..eb3853c 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>  	if (unlikely(!nc->page)) {
>  refill:
>  		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> -		nc->offset = 0;
>  	}
>  	if (likely(nc->page)) {
> -		if (nc->offset + fragsz > PAGE_SIZE) {
> +		unsigned int offset = PAGE_SIZE;
> +
> +		if (page_count(nc->page) != 1)
> +			offset = nc->offset;
> +
> +		if (offset < fragsz) {
>  			put_page(nc->page);
>  			goto refill;
>  		}
> -		data = page_address(nc->page) + nc->offset;
> -		nc->offset += fragsz;
> +
> +		offset -= fragsz;
> +		nc->offset = offset;
> +
> +		data = page_address(nc->page) + offset;
>  		get_page(nc->page);
>  	}
>  	local_irq_restore(flags);
> 

I tested this idea one month ago and got not convincing results, because
the branch was taken half of the time.

The cases where page can be reused is probably specific to ixgbe because
it uses a different allocator for the frags themselves.
netdev_alloc_frag() is only used to allocate the skb head.

For typical nics, we allocate frags to populate the RX ring _way_ before
packet is received by the NIC.

Then, I played with using order-2 pages instead of order-0 ones if
PAGE_SIZE < 8192.

No clear win either, but you might try this too.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  5:36 ` Eric Dumazet
@ 2012-06-20  8:17   ` Eric Dumazet
  2012-06-20  8:44     ` Eric Dumazet
                       ` (2 more replies)
  2012-06-20 16:30   ` Alexander Duyck
  1 sibling, 3 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-06-20  8:17 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher

On Wed, 2012-06-20 at 07:36 +0200, Eric Dumazet wrote:

> I tested this idea one month ago and got not convincing results, because
> the branch was taken half of the time.
> 
> The cases where page can be reused is probably specific to ixgbe because
> it uses a different allocator for the frags themselves.
> netdev_alloc_frag() is only used to allocate the skb head.
> 
> For typical nics, we allocate frags to populate the RX ring _way_ before
> packet is received by the NIC.
> 
> Then, I played with using order-2 pages instead of order-0 ones if
> PAGE_SIZE < 8192.
> 
> No clear win either, but you might try this too.

By the way, big cost in netdev_alloc_frag() is the irq masking/restore
We probably could have a version for softirq users...

Another idea would also use a pool of pages, instead of a single one,
if we want to play with the "clear the offset if page count is one"
idea.

Strange, I did again benchs with order-2 allocations and got good
results this time, but with latest net-next, maybe things have changed
since last time I did this.

(netdev_alloc_frag(), get_page_from_freelist() and put_page() less
prevalent in perf results)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  8:17   ` Eric Dumazet
@ 2012-06-20  8:44     ` Eric Dumazet
  2012-06-20  9:04       ` David Miller
  2012-06-20 13:21     ` Eric Dumazet
  2012-06-21  5:56     ` David Miller
  2 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-06-20  8:44 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher

On Wed, 2012-06-20 at 10:17 +0200, Eric Dumazet wrote:

> Strange, I did again benchs with order-2 allocations and got good
> results this time, but with latest net-next, maybe things have changed
> since last time I did this.
> 
> (netdev_alloc_frag(), get_page_from_freelist() and put_page() less
> prevalent in perf results)
> 

Oh well, I now remember why I abandoned this : machines dont have
infinite memory after all...

(put_page assumes order-0 page)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  8:44     ` Eric Dumazet
@ 2012-06-20  9:04       ` David Miller
  2012-06-20  9:14         ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2012-06-20  9:04 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Jun 2012 10:44:07 +0200

> (put_page assumes order-0 page)

But it certainly can handle compound pages, I thought?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  9:04       ` David Miller
@ 2012-06-20  9:14         ` Eric Dumazet
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-06-20  9:14 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher

On Wed, 2012-06-20 at 02:04 -0700, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 20 Jun 2012 10:44:07 +0200
> 
> > (put_page assumes order-0 page)
> 
> But it certainly can handle compound pages, I thought?

Yes, I forgot the GFP_COMP ;)

Oh well, time for a coffee

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  8:17   ` Eric Dumazet
  2012-06-20  8:44     ` Eric Dumazet
@ 2012-06-20 13:21     ` Eric Dumazet
  2012-06-21  4:07       ` Alexander Duyck
  2012-06-21  5:56     ` David Miller
  2 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-06-20 13:21 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher

On Wed, 2012-06-20 at 10:17 +0200, Eric Dumazet wrote:

> Strange, I did again benchs with order-2 allocations and got good
> results this time, but with latest net-next, maybe things have changed
> since last time I did this.
> 
> (netdev_alloc_frag(), get_page_from_freelist() and put_page() less
> prevalent in perf results)
> 

In fact, since SLUB uses order-3 for kmalloc-2048, I felt lucky to try
this as well, and results are really good, on ixgbe at least.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..ffd2cba 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -299,6 +299,9 @@ struct netdev_alloc_cache {
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
+#define MAX_NETDEV_FRAGSIZE	max_t(unsigned int, PAGE_SIZE, 32768)
+#define NETDEV_FRAG_ORDER	get_order(MAX_NETDEV_FRAGSIZE)
+
 /**
  * netdev_alloc_frag - allocate a page fragment
  * @fragsz: fragment size
@@ -316,11 +319,13 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	nc = &__get_cpu_var(netdev_alloc_cache);
 	if (unlikely(!nc->page)) {
 refill:
-		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
+				       (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
+				       NETDEV_FRAG_ORDER);
 		nc->offset = 0;
 	}
 	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
+		if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
 			put_page(nc->page);
 			goto refill;
 		}
@@ -353,7 +358,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+	if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
 		void *data = netdev_alloc_frag(fragsz);
 
 		if (likely(data)) {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  5:36 ` Eric Dumazet
  2012-06-20  8:17   ` Eric Dumazet
@ 2012-06-20 16:30   ` Alexander Duyck
  2012-06-20 17:14     ` Alexander Duyck
  1 sibling, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2012-06-20 16:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, jeffrey.t.kirsher

On 06/19/2012 10:36 PM, Eric Dumazet wrote:
> On Tue, 2012-06-19 at 17:43 -0700, Alexander Duyck wrote:
>> This patch is meant to help improve system performance when
>> netdev_alloc_frag is used in scenarios in which buffers are short lived.
>> This is accomplished by allowing the page offset to be reset in the event
>> that the page count is 1.  I also reordered the direction in which we give
>> out sections of the page so that we start at the end of the page and end at
>> the start.  The main motivation being that I preferred to have offset
>> represent the amount of page remaining to be used.
>>
>> My primary test case was using ixgbe in combination with TCP.  With this
>> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
>> thread of netperf receiving a TCP stream via ixgbe.
>>
>> I also tested several scenarios in which the page reuse would not be
>> possible such as UDP flows and routing.  In both of these scenarios I saw
>> no noticeable performance degradation compared to the kernel without this
>> patch.
>>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>> ---
>>
>>  net/core/skbuff.c |   15 +++++++++++----
>>  1 files changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 5b21522..eb3853c 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>>  	if (unlikely(!nc->page)) {
>>  refill:
>>  		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
>> -		nc->offset = 0;
>>  	}
>>  	if (likely(nc->page)) {
>> -		if (nc->offset + fragsz > PAGE_SIZE) {
>> +		unsigned int offset = PAGE_SIZE;
>> +
>> +		if (page_count(nc->page) != 1)
>> +			offset = nc->offset;
>> +
>> +		if (offset < fragsz) {
>>  			put_page(nc->page);
>>  			goto refill;
>>  		}
>> -		data = page_address(nc->page) + nc->offset;
>> -		nc->offset += fragsz;
>> +
>> +		offset -= fragsz;
>> +		nc->offset = offset;
>> +
>> +		data = page_address(nc->page) + offset;
>>  		get_page(nc->page);
>>  	}
>>  	local_irq_restore(flags);
>>
> I tested this idea one month ago and got not convincing results, because
> the branch was taken half of the time.
>
> The cases where page can be reused is probably specific to ixgbe because
> it uses a different allocator for the frags themselves.
> netdev_alloc_frag() is only used to allocate the skb head.
Actually it is pretty much anywhere a copy-break type setup exists.  I
think ixgbe and a few other drivers have this type of setup where
netdev_alloc_skb is called and the data is just copied into the buffer. 
My thought was if that I can improve this one case without hurting the
other cases I should just go ahead and submit it since it is a net win
performance wise.

I think one of the biggest advantages of this for ixgbe is that it
allows the buffer to become cache warm so that writing the shared info
and copying the header contents becomes very cheap compared to accessing
a cache cold page.

> For typical nics, we allocate frags to populate the RX ring _way_ before
> packet is received by the NIC.
>
> Then, I played with using order-2 pages instead of order-0 ones if
> PAGE_SIZE < 8192.
>
> No clear win either, but you might try this too.
The biggest issue I see with an order-2 page is that it means the memory
is going to take much longer to cycle out of a shared page.  As a result
changes like the one I just came up with would likely have little to no
benefit because we would run out of room in the frags list before we
could start reusing a fresh page.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20 16:30   ` Alexander Duyck
@ 2012-06-20 17:14     ` Alexander Duyck
  2012-06-20 18:41       ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2012-06-20 17:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, jeffrey.t.kirsher

On 06/20/2012 09:30 AM, Alexander Duyck wrote:
> On 06/19/2012 10:36 PM, Eric Dumazet wrote:
>> On Tue, 2012-06-19 at 17:43 -0700, Alexander Duyck wrote:
>>> This patch is meant to help improve system performance when
>>> netdev_alloc_frag is used in scenarios in which buffers are short lived.
>>> This is accomplished by allowing the page offset to be reset in the event
>>> that the page count is 1.  I also reordered the direction in which we give
>>> out sections of the page so that we start at the end of the page and end at
>>> the start.  The main motivation being that I preferred to have offset
>>> represent the amount of page remaining to be used.
>>>
>>> My primary test case was using ixgbe in combination with TCP.  With this
>>> patch applied I saw CPU utilization drop from 3.4% to 3.0% for a single
>>> thread of netperf receiving a TCP stream via ixgbe.
>>>
>>> I also tested several scenarios in which the page reuse would not be
>>> possible such as UDP flows and routing.  In both of these scenarios I saw
>>> no noticeable performance degradation compared to the kernel without this
>>> patch.
>>>
>>> Cc: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
>>> ---
>>>
>>>  net/core/skbuff.c |   15 +++++++++++----
>>>  1 files changed, 11 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 5b21522..eb3853c 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -317,15 +317,22 @@ void *netdev_alloc_frag(unsigned int fragsz)
>>>  	if (unlikely(!nc->page)) {
>>>  refill:
>>>  		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
>>> -		nc->offset = 0;
>>>  	}
>>>  	if (likely(nc->page)) {
>>> -		if (nc->offset + fragsz > PAGE_SIZE) {
>>> +		unsigned int offset = PAGE_SIZE;
>>> +
>>> +		if (page_count(nc->page) != 1)
>>> +			offset = nc->offset;
>>> +
>>> +		if (offset < fragsz) {
>>>  			put_page(nc->page);
>>>  			goto refill;
>>>  		}
>>> -		data = page_address(nc->page) + nc->offset;
>>> -		nc->offset += fragsz;
>>> +
>>> +		offset -= fragsz;
>>> +		nc->offset = offset;
>>> +
>>> +		data = page_address(nc->page) + offset;
>>>  		get_page(nc->page);
>>>  	}
>>>  	local_irq_restore(flags);
>>>
>> I tested this idea one month ago and got not convincing results, because
>> the branch was taken half of the time.
>>
>> The cases where page can be reused is probably specific to ixgbe because
>> it uses a different allocator for the frags themselves.
>> netdev_alloc_frag() is only used to allocate the skb head.
> Actually it is pretty much anywhere a copy-break type setup exists.  I
> think ixgbe and a few other drivers have this type of setup where
> netdev_alloc_skb is called and the data is just copied into the buffer. 
> My thought was if that I can improve this one case without hurting the
> other cases I should just go ahead and submit it since it is a net win
> performance wise.
>
> I think one of the biggest advantages of this for ixgbe is that it
> allows the buffer to become cache warm so that writing the shared info
> and copying the header contents becomes very cheap compared to accessing
> a cache cold page.
>
>> For typical nics, we allocate frags to populate the RX ring _way_ before
>> packet is received by the NIC.
>>
>> Then, I played with using order-2 pages instead of order-0 ones if
>> PAGE_SIZE < 8192.
>>
>> No clear win either, but you might try this too.
> The biggest issue I see with an order-2 page is that it means the memory
> is going to take much longer to cycle out of a shared page.  As a result
> changes like the one I just came up with would likely have little to no
> benefit because we would run out of room in the frags list before we
> could start reusing a fresh page.
>
> Thanks,
>
> Alex
>
Actually I think I just realized what the difference is.  I was looking
at things with LRO disabled.  With LRO enabled our hardware RSC feature
kind of defeats the whole point of the GRO or TCP coalescing anyway
since it will stuff 16 fragments into a single packet before we even
hand the packet off to the stack.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20 17:14     ` Alexander Duyck
@ 2012-06-20 18:41       ` Eric Dumazet
  2012-06-20 20:10         ` Alexander Duyck
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-06-20 18:41 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev, davem, jeffrey.t.kirsher

On Wed, 2012-06-20 at 10:14 -0700, Alexander Duyck wrote:

> Actually I think I just realized what the difference is.  I was looking
> at things with LRO disabled.  With LRO enabled our hardware RSC feature
> kind of defeats the whole point of the GRO or TCP coalescing anyway
> since it will stuff 16 fragments into a single packet before we even
> hand the packet off to the stack.

I noticed LRO was now 'off' by default on ixgbe (net-next tree), I am
pretty sure it was 'on' some months ago ?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20 18:41       ` Eric Dumazet
@ 2012-06-20 20:10         ` Alexander Duyck
  0 siblings, 0 replies; 19+ messages in thread
From: Alexander Duyck @ 2012-06-20 20:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, davem, jeffrey.t.kirsher

On 06/20/2012 11:41 AM, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 10:14 -0700, Alexander Duyck wrote:
>
>> Actually I think I just realized what the difference is.  I was looking
>> at things with LRO disabled.  With LRO enabled our hardware RSC feature
>> kind of defeats the whole point of the GRO or TCP coalescing anyway
>> since it will stuff 16 fragments into a single packet before we even
>> hand the packet off to the stack.
> I noticed LRO was now 'off' by default on ixgbe (net-next tree), I am
> pretty sure it was 'on' some months ago ?
It should be on by default unless you are doing some routing.  In that
case as soon as the interface comes up the LRO is disabled by the stack.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20 13:21     ` Eric Dumazet
@ 2012-06-21  4:07       ` Alexander Duyck
  2012-06-21  5:07         ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2012-06-21  4:07 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher

On 6/20/2012 6:21 AM, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 10:17 +0200, Eric Dumazet wrote:
>
>> Strange, I did again benchs with order-2 allocations and got good
>> results this time, but with latest net-next, maybe things have changed
>> since last time I did this.
>>
>> (netdev_alloc_frag(), get_page_from_freelist() and put_page() less
>> prevalent in perf results)
>>
> In fact, since SLUB uses order-3 for kmalloc-2048, I felt lucky to try
> this as well, and results are really good, on ixgbe at least.
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..ffd2cba 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -299,6 +299,9 @@ struct netdev_alloc_cache {
>   };
>   static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
>
> +#define MAX_NETDEV_FRAGSIZE	max_t(unsigned int, PAGE_SIZE, 32768)
> +#define NETDEV_FRAG_ORDER	get_order(MAX_NETDEV_FRAGSIZE)
> +
>   /**
>    * netdev_alloc_frag - allocate a page fragment
>    * @fragsz: fragment size
> @@ -316,11 +319,13 @@ void *netdev_alloc_frag(unsigned int fragsz)
>   	nc =&__get_cpu_var(netdev_alloc_cache);
>   	if (unlikely(!nc->page)) {
>   refill:
> -		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> +		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
> +				       (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
> +				       NETDEV_FRAG_ORDER);
>   		nc->offset = 0;
>   	}
I was wondering if you needed the check for NETDEV_FRAG_ORDER here.  
 From what I can tell setting __GFP_COMP for an order 0 page has no 
effect since it only seems to get checked in prep_new_page and that is 
after a check to verify if the page is order 0 or not.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-21  4:07       ` Alexander Duyck
@ 2012-06-21  5:07         ` Eric Dumazet
  2012-06-22 12:33           ` Eric Dumazet
  0 siblings, 1 reply; 19+ messages in thread
From: Eric Dumazet @ 2012-06-21  5:07 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher

On Wed, 2012-06-20 at 21:07 -0700, Alexander Duyck wrote:
> On 6/20/2012 6:21 AM, Eric Dumazet wrote:

> > +		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
> > +				       (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
> > +				       NETDEV_FRAG_ORDER);

> I was wondering if you needed the check for NETDEV_FRAG_ORDER here.  
>  From what I can tell setting __GFP_COMP for an order 0 page has no 
> effect since it only seems to get checked in prep_new_page and that is 
> after a check to verify if the page is order 0 or not.

Good point, it seems some net drivers could be changed to remove
useless tests.

I'll post some performance data as well.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-20  8:17   ` Eric Dumazet
  2012-06-20  8:44     ` Eric Dumazet
  2012-06-20 13:21     ` Eric Dumazet
@ 2012-06-21  5:56     ` David Miller
  2 siblings, 0 replies; 19+ messages in thread
From: David Miller @ 2012-06-21  5:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: alexander.h.duyck, netdev, jeffrey.t.kirsher

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 20 Jun 2012 10:17:03 +0200

> By the way, big cost in netdev_alloc_frag() is the irq
> masking/restore We probably could have a version for softirq
> users...

That's an extremely disappointing way for us to be losing cycles.

Everyone pays this price purely because:

1) __netdev_alloc_skb() uses netdev_alloc_frag()

and:

2) #1 is invoked, either directly or indirectly, by tons
   of slow non-NAPI drivers.

This got me looking into the plathora of interfaces we let drivers use
to allocate RX buffers.  It's a big mess.

We have dev_alloc_skb() which essentially calls __netdev_alloc_skb()
with a NULL device argument.  This is terrible because it means that
if we want to do something interesting on a per-device level we can't
rely upon the device being non-NULL in __netdev_alloc_skb().

I looked at the remaining dev_alloc_skb() users and these are in places
which are allocating packets in a module which is one level removed from
the netdevice level.  For example, ATM and infiniband IPATH.

What these callers want is something more like:

static inline struct sk_buff *alloc_skb_and_reserve_pad(unsinged int length,
							gfp_t gfp_mask)
{
	struct sk_buff *skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
					  0, NUMA_NO_NODE);
	if (likely(skb))
		skb_reserve(skb, NET_SKB_PAD);
	return skb;
}

Then we won't have the NULL device case for __netdev_alloc_skb() any more.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-21  5:07         ` Eric Dumazet
@ 2012-06-22 12:33           ` Eric Dumazet
  2012-06-23  0:17             ` Alexander Duyck
  2012-06-29 23:04             ` Alexander Duyck
  0 siblings, 2 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-06-22 12:33 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher

On Thu, 2012-06-21 at 07:07 +0200, Eric Dumazet wrote:
> On Wed, 2012-06-20 at 21:07 -0700, Alexander Duyck wrote:
> > On 6/20/2012 6:21 AM, Eric Dumazet wrote:
> 
> > > +		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
> > > +				       (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
> > > +				       NETDEV_FRAG_ORDER);
> 
> > I was wondering if you needed the check for NETDEV_FRAG_ORDER here.  
> >  From what I can tell setting __GFP_COMP for an order 0 page has no 
> > effect since it only seems to get checked in prep_new_page and that is 
> > after a check to verify if the page is order 0 or not.
> 
> Good point, it seems some net drivers could be changed to remove
> useless tests.
> 
> I'll post some performance data as well.

Here is the patch I tested here.

Using 32768 bytes allocations is actually nice for MTU=9000 traffic,
since we can fit 3 frames per 32KB instead of only 2 frames (using
kmalloc-16384 slab))

Also, I prefill page->_count with a high bias value, to avoid the
get_page() we did for each allocated frag.

In my profiles, the get_page() cost was dominant, because of false
sharing with skb consumers (as they might run on different cpus)

This way, when 32768 bytes are filled, we perform a single
atomic_sub_return() and can recycle the page if we find we are the last
user (this is what you did in your patch, when testing page->_count
being 1)

Note : If I used max(PAGE_SIZE, 32678) for MAX_NETDEV_FRAGSIZE,
gcc was not able to optimise get_order(MAX_NETDEV_FRAGSIZE), strange...


diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b21522..d31efa2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -296,9 +296,18 @@ EXPORT_SYMBOL(build_skb);
 struct netdev_alloc_cache {
 	struct page *page;
 	unsigned int offset;
+	unsigned int pagecnt_bias;
 };
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
+#if PAGE_SIZE > 32768
+#define MAX_NETDEV_FRAGSIZE	PAGE_SIZE
+#else
+#define MAX_NETDEV_FRAGSIZE	32768
+#endif
+
+#define NETDEV_PAGECNT_BIAS	(MAX_NETDEV_FRAGSIZE /		\
+				 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
 /**
  * netdev_alloc_frag - allocate a page fragment
  * @fragsz: fragment size
@@ -316,18 +325,25 @@ void *netdev_alloc_frag(unsigned int fragsz)
 	nc = &__get_cpu_var(netdev_alloc_cache);
 	if (unlikely(!nc->page)) {
 refill:
-		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP,
+				       get_order(MAX_NETDEV_FRAGSIZE));
+		if (unlikely(!nc->page))
+			goto end;
+recycle:
+		atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
+		nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
 		nc->offset = 0;
 	}
-	if (likely(nc->page)) {
-		if (nc->offset + fragsz > PAGE_SIZE) {
-			put_page(nc->page);
-			goto refill;
-		}
-		data = page_address(nc->page) + nc->offset;
-		nc->offset += fragsz;
-		get_page(nc->page);
+	if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
+		if (!atomic_sub_return(nc->pagecnt_bias,
+				       &nc->page->_count))
+			goto recycle;
+		goto refill;
 	}
+	data = page_address(nc->page) + nc->offset;
+	nc->offset += fragsz;
+	nc->pagecnt_bias--; /* avoid get_page()/get_page() false sharing */
+end:
 	local_irq_restore(flags);
 	return data;
 }
@@ -353,7 +369,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
-	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
+	if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
 		void *data = netdev_alloc_frag(fragsz);
 
 		if (likely(data)) {

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-22 12:33           ` Eric Dumazet
@ 2012-06-23  0:17             ` Alexander Duyck
  2012-06-29 23:04             ` Alexander Duyck
  1 sibling, 0 replies; 19+ messages in thread
From: Alexander Duyck @ 2012-06-23  0:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher

On 06/22/2012 05:33 AM, Eric Dumazet wrote:
> On Thu, 2012-06-21 at 07:07 +0200, Eric Dumazet wrote:
>> On Wed, 2012-06-20 at 21:07 -0700, Alexander Duyck wrote:
>>> On 6/20/2012 6:21 AM, Eric Dumazet wrote:
>>>> +		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD |
>>>> +				       (NETDEV_FRAG_ORDER ? __GFP_COMP : 0),
>>>> +				       NETDEV_FRAG_ORDER);
>>> I was wondering if you needed the check for NETDEV_FRAG_ORDER here.  
>>>  From what I can tell setting __GFP_COMP for an order 0 page has no 
>>> effect since it only seems to get checked in prep_new_page and that is 
>>> after a check to verify if the page is order 0 or not.
>> Good point, it seems some net drivers could be changed to remove
>> useless tests.
>>
>> I'll post some performance data as well.
> Here is the patch I tested here.
>
> Using 32768 bytes allocations is actually nice for MTU=9000 traffic,
> since we can fit 3 frames per 32KB instead of only 2 frames (using
> kmalloc-16384 slab))
>
> Also, I prefill page->_count with a high bias value, to avoid the
> get_page() we did for each allocated frag.
>
> In my profiles, the get_page() cost was dominant, because of false
> sharing with skb consumers (as they might run on different cpus)
>
> This way, when 32768 bytes are filled, we perform a single
> atomic_sub_return() and can recycle the page if we find we are the last
> user (this is what you did in your patch, when testing page->_count
> being 1)
This is working really nicely.  On my system put_page dropped to
somewhere near the bottom of the perf top runs I was doing.  In addition
netdev_alloc_frag dropped from about 4% CPU to 2%.

>
> Note : If I used max(PAGE_SIZE, 32678) for MAX_NETDEV_FRAGSIZE,
> gcc was not able to optimise get_order(MAX_NETDEV_FRAGSIZE), strange...
The issue is probably the type checking in the max macro.  You might
have better luck using max_t and specifying a type.

> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..d31efa2 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -296,9 +296,18 @@ EXPORT_SYMBOL(build_skb);
>  struct netdev_alloc_cache {
>  	struct page *page;
>  	unsigned int offset;
> +	unsigned int pagecnt_bias;
>  };
>  static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
>  
> +#if PAGE_SIZE > 32768
> +#define MAX_NETDEV_FRAGSIZE	PAGE_SIZE
> +#else
> +#define MAX_NETDEV_FRAGSIZE	32768
> +#endif
> +
> +#define NETDEV_PAGECNT_BIAS	(MAX_NETDEV_FRAGSIZE /		\
> +				 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>  /**
>   * netdev_alloc_frag - allocate a page fragment
>   * @fragsz: fragment size
I'm assuming the reason for using the size of skb_shared_info here is
because you don't expect any requests to be smaller than that?  I
suppose that is reasonable, but is there any reason why this couldn't be
a smaller value such as SMP_CACHE_BYTES?

> @@ -316,18 +325,25 @@ void *netdev_alloc_frag(unsigned int fragsz)
>  	nc = &__get_cpu_var(netdev_alloc_cache);
>  	if (unlikely(!nc->page)) {
>  refill:
> -		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> +		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP,
> +				       get_order(MAX_NETDEV_FRAGSIZE));
> +		if (unlikely(!nc->page))
> +			goto end;
> +recycle:
> +		atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
> +		nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
>  		nc->offset = 0;
>  	}
> -	if (likely(nc->page)) {
> -		if (nc->offset + fragsz > PAGE_SIZE) {
> -			put_page(nc->page);
> -			goto refill;
> -		}
> -		data = page_address(nc->page) + nc->offset;
> -		nc->offset += fragsz;
> -		get_page(nc->page);
> +	if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
> +		if (!atomic_sub_return(nc->pagecnt_bias,
> +				       &nc->page->_count))
> +			goto recycle;
> +		goto refill;
>  	}
> +	data = page_address(nc->page) + nc->offset;
> +	nc->offset += fragsz;
> +	nc->pagecnt_bias--; /* avoid get_page()/get_page() false sharing */
> +end:
>  	local_irq_restore(flags);
>  	return data;
>  }
> @@ -353,7 +369,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>  	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
>  			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
> -	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
> +	if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
>  		void *data = netdev_alloc_frag(fragsz);
>  
>  		if (likely(data)) {
>
>
One idea I had that I have been trying to figure out how make work would
be to actually run the offset in the reverse direction so that you start
it at MAX_NETDEV_FRAGSIZE and work your way back down to 0.  The
advantage to that approach is that you then only have to perform one
check instead of two and you can drop a goto.  The only problem I have
been running into is that it doesn't seem to perform as well as your
patch but I am still working the details out.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-22 12:33           ` Eric Dumazet
  2012-06-23  0:17             ` Alexander Duyck
@ 2012-06-29 23:04             ` Alexander Duyck
  2012-06-30  8:39               ` Eric Dumazet
  1 sibling, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2012-06-29 23:04 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher

On 06/22/2012 05:33 AM, Eric Dumazet wrote:
> Here is the patch I tested here.
>
> Using 32768 bytes allocations is actually nice for MTU=9000 traffic,
> since we can fit 3 frames per 32KB instead of only 2 frames (using
> kmalloc-16384 slab))
>
> Also, I prefill page->_count with a high bias value, to avoid the
> get_page() we did for each allocated frag.
>
> In my profiles, the get_page() cost was dominant, because of false
> sharing with skb consumers (as they might run on different cpus)
>
> This way, when 32768 bytes are filled, we perform a single
> atomic_sub_return() and can recycle the page if we find we are the last
> user (this is what you did in your patch, when testing page->_count
> being 1)
>
> Note : If I used max(PAGE_SIZE, 32678) for MAX_NETDEV_FRAGSIZE,
> gcc was not able to optimise get_order(MAX_NETDEV_FRAGSIZE), strange...
>
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 5b21522..d31efa2 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -296,9 +296,18 @@ EXPORT_SYMBOL(build_skb);
>  struct netdev_alloc_cache {
>  	struct page *page;
>  	unsigned int offset;
> +	unsigned int pagecnt_bias;
>  };
>  static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
>  
> +#if PAGE_SIZE > 32768
> +#define MAX_NETDEV_FRAGSIZE	PAGE_SIZE
> +#else
> +#define MAX_NETDEV_FRAGSIZE	32768
> +#endif
> +
> +#define NETDEV_PAGECNT_BIAS	(MAX_NETDEV_FRAGSIZE /		\
> +				 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
>  /**
>   * netdev_alloc_frag - allocate a page fragment
>   * @fragsz: fragment size
> @@ -316,18 +325,25 @@ void *netdev_alloc_frag(unsigned int fragsz)
>  	nc = &__get_cpu_var(netdev_alloc_cache);
>  	if (unlikely(!nc->page)) {
>  refill:
> -		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
> +		nc->page = alloc_pages(GFP_ATOMIC | __GFP_COLD | __GFP_COMP,
> +				       get_order(MAX_NETDEV_FRAGSIZE));
> +		if (unlikely(!nc->page))
> +			goto end;
> +recycle:
> +		atomic_set(&nc->page->_count, NETDEV_PAGECNT_BIAS);
> +		nc->pagecnt_bias = NETDEV_PAGECNT_BIAS;
>  		nc->offset = 0;
>  	}
> -	if (likely(nc->page)) {
> -		if (nc->offset + fragsz > PAGE_SIZE) {
> -			put_page(nc->page);
> -			goto refill;
> -		}
> -		data = page_address(nc->page) + nc->offset;
> -		nc->offset += fragsz;
> -		get_page(nc->page);
> +	if (nc->offset + fragsz > MAX_NETDEV_FRAGSIZE) {
> +		if (!atomic_sub_return(nc->pagecnt_bias,
> +				       &nc->page->_count))
> +			goto recycle;
> +		goto refill;
>  	}
> +	data = page_address(nc->page) + nc->offset;
> +	nc->offset += fragsz;
> +	nc->pagecnt_bias--; /* avoid get_page()/get_page() false sharing */
> +end:
>  	local_irq_restore(flags);
>  	return data;
>  }
> @@ -353,7 +369,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
>  	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
>  			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
> -	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
> +	if (fragsz <= MAX_NETDEV_FRAGSIZE && !(gfp_mask & __GFP_WAIT)) {
>  		void *data = netdev_alloc_frag(fragsz);
>  
>  		if (likely(data)) {
>
>
I was wondering if there were any plans to clean this patch up and
submit it to net-next?  If not, I can probably work on that since this
addressed the concerns I had in my original patch.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO
  2012-06-29 23:04             ` Alexander Duyck
@ 2012-06-30  8:39               ` Eric Dumazet
  0 siblings, 0 replies; 19+ messages in thread
From: Eric Dumazet @ 2012-06-30  8:39 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Alexander Duyck, netdev, davem, jeffrey.t.kirsher

On Fri, 2012-06-29 at 16:04 -0700, Alexander Duyck wrote:

> I was wondering if there were any plans to clean this patch up and
> submit it to net-next?  If not, I can probably work on that since this
> addressed the concerns I had in my original patch.
> 

I used this patch for a while on my machines, but I am working on
something allowing fallback to order-0 allocations if memory gets
fragmented. This fallback should almost never happen, but we should have
it just in case ?

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-06-30  8:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-20  0:43 [PATCH] net: Update netdev_alloc_frag to work more efficiently with TCP and GRO Alexander Duyck
2012-06-20  1:49 ` Alexander Duyck
2012-06-20  5:36 ` Eric Dumazet
2012-06-20  8:17   ` Eric Dumazet
2012-06-20  8:44     ` Eric Dumazet
2012-06-20  9:04       ` David Miller
2012-06-20  9:14         ` Eric Dumazet
2012-06-20 13:21     ` Eric Dumazet
2012-06-21  4:07       ` Alexander Duyck
2012-06-21  5:07         ` Eric Dumazet
2012-06-22 12:33           ` Eric Dumazet
2012-06-23  0:17             ` Alexander Duyck
2012-06-29 23:04             ` Alexander Duyck
2012-06-30  8:39               ` Eric Dumazet
2012-06-21  5:56     ` David Miller
2012-06-20 16:30   ` Alexander Duyck
2012-06-20 17:14     ` Alexander Duyck
2012-06-20 18:41       ` Eric Dumazet
2012-06-20 20:10         ` Alexander Duyck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.