netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] net: use hardware buffer pool to allocate skb
@ 2014-10-15  3:26 Pan Jiafei
  2014-10-15  4:15 ` Eric Dumazet
                   ` (4 more replies)
  0 siblings, 5 replies; 38+ messages in thread
From: Pan Jiafei @ 2014-10-15  3:26 UTC (permalink / raw)
  To: davem, jkosina; +Cc: netdev, Jiafei.Pan, LeoLi, linux-doc

In some platform, there are some hardware block provided
to manage buffers to improve performance. So in some case,
it is expected that the packets received by some generic
NIC should be put into such hardware managed buffers
directly, so that such buffer can be released by hardware
or by driver.

This patch provide such general APIs for generic NIC to
use hardware block managed buffers without any modification
for generic NIC drivers.
In this patch, the following fields are added to "net_device":
    void *hw_skb_priv;
    struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
    void (*free_hw_skb)(struct sk_buff *skb);
so in order to let generic NIC driver to use hardware managed
buffers, the function "alloc_hw_skb" and "free_hw_skb"
provide implementation for allocate and free hardware managed
buffers. "hw_skb_priv" is provided to pass some private data for
these two functions.

When the socket buffer is allocated by these APIs, "hw_skb_state"
is provided in struct "sk_buff". this argument can indicate
that the buffer is hardware managed buffer, this buffer
should freed by software or by hardware.

Documentation on how to use this featue can be found at
<file:Documentation/networking/hw_skb.txt>.

Signed-off-by: Pan Jiafei <Jiafei.Pan@freescale.com>
---
 Documentation/networking/hw_skb.txt | 117 ++++++++++++++++++++++++++++++++++++
 include/linux/netdevice.h           |   5 ++
 include/linux/skbuff.h              |  16 +++++
 net/Kconfig                         |  10 +++
 net/core/skbuff.c                   |  28 +++++++++
 5 files changed, 176 insertions(+)
 create mode 100644 Documentation/networking/hw_skb.txt

diff --git a/Documentation/networking/hw_skb.txt b/Documentation/networking/hw_skb.txt
new file mode 100644
index 0000000..256f3fc
--- /dev/null
+++ b/Documentation/networking/hw_skb.txt
@@ -0,0 +1,117 @@
+Document for using hardware managed SKB.
+
+1. Description
+
+In some platform, there are some hardware block provided
+to manage buffers to improve performance. So in some case,
+it is expected that the packets received by some generic
+NIC should be put into such hardware managed buffers
+directly, so that such buffer can be released by hardware
+or by driver.
+
+2. Related Struct Definition
+
+Some general APIs are provided for generic NIC to use hardware
+block managed buffers without any modification for generic NIC
+drivers.
+
+1)Kernel Configuration Item
+
+	"CONFIG_USE_HW_SKB"
+
+2)The DEVICE structure
+
+	struct net_device {
+		...
+	#ifdef CONFIG_USE_HW_SKB
+		void *hw_skb_priv;
+		struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
+		void (*free_hw_skb)(struct sk_buff *skb);
+	#endif
+		...
+	}
+
+"hw_skb_priv" is private data for "alloc_hw_skb" and "free_hw_skb" functions.
+"alloc_hw_skb" is for allocating skb by using hardware managed buffer.
+"free_hw_skb" is for freeing skb allocated by hardware manager buffer.
+
+3)struct sk_buff - socket buffer
+
+	struct sk_buff {
+		...
+	#ifdef CONFIG_SKB_USE_HW_BP
+		__u32 			hw_skb_state;
+		void			*hw_skb_priv;
+		void 			(*free_hw_skb)(struct sk_buff *skb);
+	#endif
+		...
+	}
+
+	/* hw_skb_state list */
+	enum hw_skb_state {
+		/* If set, SKB use hardware managed buffer */
+		IS_HW_SKB = 1 << 0,
+		/* If set, and skb can be freed by software by calling
+		 * netdev->free_hw_skb
+		 */
+		HW_SKB_SW_FREE = 1 << 1,
+	};
+
+"hw_skb_priv" and "free_hw_skb" are the same with the field in the
+struct "net_device"
+
+After calling "alloc_hw_skb" to allocate skb by using hardware managed
+buffers, "hw_skb_priv" and "free_hw_skb" is set in SKB driver:
+	skb->hw_skb_priv = dev->hw_skb_priv;
+	skb->free_hw_skb = dev->free_hw_skb;
+So that when "struct net_device	*dev" is changed after the skb is allocated,
+It is be confirmed that this skb can be freed by the method synced
+with allocation.
+
+"hw_skb_state" indicates that the state of SKB. When the skb is allocated
+by "alloc_hw_skb" function, the flag of "IS_HW_SKB" is set by
+"__netdev_alloc_skb" function in skbuff.c when returned from "alloc_hw_skb".
+But in "alloc_hw_skb", "HW_SKB_SW_FREE" must be set if the skb should be
+freed by calling "free_hw_skb", otherwise, the skb will never be freed by
+any driver until it is freed by hardware block.
+
+SKB using hardware managed buffer is not recycleable.
+
+3. How to use this feature
+
+For example, driver "A" wants the third-party NIC driver "B" to
+store the data in some hardware managed buffer then send to "A".
+
+1) Select "CONFIG_USE_HW_SKB" to enable this feature.
+
+2) In driver "A", implement the function "alloc_hw_skb" and
+"free_hw_skb". For example:
+
+struct sk_buff *alloc_hw_skb(void *priv, unsigned int length)
+{
+	buf = alloc_hw_buffer();
+	skb = build_skb(buf, ...);
+	if (skb)
+		skb->hw_skb_state |= HW_SKB_SW_FREE;
+
+	return skb;
+}
+
+void free_hw_skb(struct sk_buff *skb)
+{
+	free_hw_buffer(skb->head);
+}
+
+3) In driver "A", get "net_device" handle of net device case using
+driver "B".
+	...
+	net_dev_b->hw_skb_priv = priv;
+	net_dev_b->alloc_hw_skb = alloc_hw_skb;
+	net_dev_b->free_hw_skb = free_hw_skb;
+	...
+
+4) Then, when driver "B" wants to allocate skb, "alloc_hw_skb"
+will be called to allocate hardware manager skb firstly, if
+failed, the normal skb will also be allocate, if successed,
+the skb will be freed by calling free_hw_skb when "kfree_skb"
+is called to free this skb.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 838407a..42b6158 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1689,6 +1689,11 @@ struct net_device {
 	struct lock_class_key *qdisc_tx_busylock;
 	int group;
 	struct pm_qos_request	pm_qos_req;
+#ifdef CONFIG_USE_HW_SKB
+	void *hw_skb_priv;
+	struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
+	void (*free_hw_skb)(struct sk_buff *skb);
+#endif
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 776104b..d9afdeb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -436,6 +436,16 @@ static inline u32 skb_mstamp_us_delta(const struct skb_mstamp *t1,
 }
 
 
+/* hw_skb_state list */
+enum hw_skb_state {
+	/* If set, SKB use hardware managed buffer */
+	IS_HW_SKB = 1 << 0,
+	/* If set, and skb can be freed by software by calling
+	 * netdev->free_hw_skb
+	 */
+	HW_SKB_SW_FREE = 1 << 1,
+};
+
 /** 
  *	struct sk_buff - socket buffer
  *	@next: Next buffer in list
@@ -646,6 +656,12 @@ struct sk_buff {
 	__u16			network_header;
 	__u16			mac_header;
 
+#ifdef CONFIG_USE_HW_SKB
+	__u32			hw_skb_state;
+	void			*hw_skb_priv;
+	void			(*free_hw_skb)(struct sk_buff *skb);
+#endif
+
 	__u32			headers_end[0];
 
 	/* These elements must be at the end, see alloc_skb() for details.  */
diff --git a/net/Kconfig b/net/Kconfig
index d6b138e..346e021 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -291,6 +291,16 @@ config NET_FLOW_LIMIT
 	  with many clients some protection against DoS by a single (spoofed)
 	  flow that greatly exceeds average workload.
 
+config USE_HW_SKB
+	bool "NIC use hardware managed buffer to build skb"
+	depends on INET
+	---help---
+	  If select this, the third party drivers will use hardware managed
+	  buffers to allocate SKB without any modification for the driver.
+
+	  Documentation on how to use this featue can be found at
+	  <file:Documentation/networking/hw_skb.txt>.
+
 menu "Network testing"
 
 config NET_PKTGEN
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7b3df0d..f8603e5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -415,6 +415,19 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
+#ifdef CONFIG_USE_HW_SKB
+	if (dev->alloc_hw_skb) {
+		skb = dev->alloc_hw_skb(dev->hw_skb_priv, length);
+		if (likely(skb)) {
+			skb->hw_skb_state |= IS_HW_SKB;
+			skb->hw_skb_priv = dev->hw_skb_priv;
+			skb->free_hw_skb = dev->free_hw_skb;
+			skb_reserve(skb, NET_SKB_PAD);
+			skb->dev = dev;
+			return skb;
+		}
+	}
+#endif
 	if (fragsz <= PAGE_SIZE && !(gfp_mask & (__GFP_WAIT | GFP_DMA))) {
 		void *data;
 
@@ -432,6 +445,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
 		skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask,
 				  SKB_ALLOC_RX, NUMA_NO_NODE);
 	}
+
 	if (likely(skb)) {
 		skb_reserve(skb, NET_SKB_PAD);
 		skb->dev = dev;
@@ -483,6 +497,15 @@ static void skb_clone_fraglist(struct sk_buff *skb)
 
 static void skb_free_head(struct sk_buff *skb)
 {
+#ifdef CONFIG_USE_HW_SKB
+	if (skb->hw_skb_state & IS_HW_SKB) {
+		if (skb->hw_skb_state & HW_SKB_SW_FREE) {
+			BUG_ON(!skb->free_hw_skb);
+			skb->free_hw_skb(skb);
+		}
+		return;
+	}
+#endif
 	if (skb->head_frag)
 		put_page(virt_to_head_page(skb->head));
 	else
@@ -506,6 +529,10 @@ static void skb_release_data(struct sk_buff *skb)
 	 * If skb buf is from userspace, we need to notify the caller
 	 * the lower device DMA has done;
 	 */
+#ifdef CONFIG_USE_HW_SKB
+	if (skb->hw_skb_state & IS_HW_SKB)
+		goto skip_callback;
+#endif
 	if (shinfo->tx_flags & SKBTX_DEV_ZEROCOPY) {
 		struct ubuf_info *uarg;
 
@@ -514,6 +541,7 @@ static void skb_release_data(struct sk_buff *skb)
 			uarg->callback(uarg, true);
 	}
 
+skip_callback:
 	if (shinfo->frag_list)
 		kfree_skb_list(shinfo->frag_list);
 
-- 
2.1.0.27.g96db324


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  3:26 [PATCH] net: use hardware buffer pool to allocate skb Pan Jiafei
@ 2014-10-15  4:15 ` Eric Dumazet
  2014-10-15  4:26   ` David Miller
  2014-10-15  5:43   ` Jiafei.Pan
  2014-10-15  4:25 ` David Miller
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-10-15  4:15 UTC (permalink / raw)
  To: Pan Jiafei; +Cc: davem, jkosina, netdev, LeoLi, linux-doc

On Wed, 2014-10-15 at 11:26 +0800, Pan Jiafei wrote:
> In some platform, there are some hardware block provided
> to manage buffers to improve performance. So in some case,
> it is expected that the packets received by some generic
> NIC should be put into such hardware managed buffers
> directly, so that such buffer can be released by hardware
> or by driver.

You repeat 'some' four times.

> 
> This patch provide such general APIs for generic NIC to
> use hardware block managed buffers without any modification
> for generic NIC drivers.

...

> In this patch, the following fields are added to "net_device":
>     void *hw_skb_priv;
>     struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
>     void (*free_hw_skb)(struct sk_buff *skb);
> so in order to let generic NIC driver to use hardware managed
> buffers, the function "alloc_hw_skb" and "free_hw_skb"
> provide implementation for allocate and free hardware managed
> buffers. "hw_skb_priv" is provided to pass some private data for
> these two functions.
> 
> When the socket buffer is allocated by these APIs, "hw_skb_state"
> is provided in struct "sk_buff". this argument can indicate
> that the buffer is hardware managed buffer, this buffer
> should freed by software or by hardware.
> 
> Documentation on how to use this featue can be found at
> <file:Documentation/networking/hw_skb.txt>.
> 
> Signed-off-by: Pan Jiafei <Jiafei.Pan@freescale.com>


I am giving a strong NACK, of course.

We are not going to grow sk_buff and add yet another conditional in fast
path for a very obscure feature like that.

Memory management is not going to be done by drivers.

The way it should work is that if your hardware has specific needs, rx
and tx paths (of the driver) need to make the needed adaptation.
Not the other way.

We already have complex skb layouts, we do not need a new one. 

Take a look at how drivers can 'lock' pages already, and build skb sith
page frags. It is already there.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  3:26 [PATCH] net: use hardware buffer pool to allocate skb Pan Jiafei
  2014-10-15  4:15 ` Eric Dumazet
@ 2014-10-15  4:25 ` David Miller
  2014-10-15  5:34   ` Jiafei.Pan
  2014-10-15  4:59 ` Oliver Hartkopp
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 38+ messages in thread
From: David Miller @ 2014-10-15  4:25 UTC (permalink / raw)
  To: Jiafei.Pan; +Cc: jkosina, netdev, LeoLi, linux-doc

From: Pan Jiafei <Jiafei.Pan@freescale.com>
Date: Wed, 15 Oct 2014 11:26:11 +0800

> In some platform, there are some hardware block provided
> to manage buffers to improve performance. So in some case,
> it is expected that the packets received by some generic
> NIC should be put into such hardware managed buffers
> directly, so that such buffer can be released by hardware
> or by driver.
> 
> This patch provide such general APIs for generic NIC to
> use hardware block managed buffers without any modification
> for generic NIC drivers.

Why are existing interfaces insufficent for your needs?

Several ethernet drivers already build SKBs from block
buffer pools.

They allocate pools of pages which the hardware divides into various
powers of 2, then the RX descriptor says what pieces of which pools
were used to hold the data for a packet, and then the SKB is
constructed with page frags pointing to those locations.

It's very cheap, inexpensive, and existing APIs are considered to
cover all use cases.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  4:15 ` Eric Dumazet
@ 2014-10-15  4:26   ` David Miller
  2014-10-15  5:43   ` Jiafei.Pan
  1 sibling, 0 replies; 38+ messages in thread
From: David Miller @ 2014-10-15  4:26 UTC (permalink / raw)
  To: eric.dumazet; +Cc: Jiafei.Pan, jkosina, netdev, LeoLi, linux-doc

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 14 Oct 2014 21:15:03 -0700

> Take a look at how drivers can 'lock' pages already, and build skb
> sith page frags. It is already there.

+1

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  3:26 [PATCH] net: use hardware buffer pool to allocate skb Pan Jiafei
  2014-10-15  4:15 ` Eric Dumazet
  2014-10-15  4:25 ` David Miller
@ 2014-10-15  4:59 ` Oliver Hartkopp
  2014-10-15  5:47   ` Jiafei.Pan
  2014-10-15  8:57 ` David Laight
  2014-10-15  9:33 ` Stephen Hemminger
  4 siblings, 1 reply; 38+ messages in thread
From: Oliver Hartkopp @ 2014-10-15  4:59 UTC (permalink / raw)
  To: Pan Jiafei, davem, jkosina; +Cc: netdev, LeoLi, linux-doc

On 15.10.2014 05:26, Pan Jiafei wrote:
> In some platform, there are some hardware block provided
> to manage buffers to improve performance.

(..)

> diff --git a/net/Kconfig b/net/Kconfig
> index d6b138e..346e021 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -291,6 +291,16 @@ config NET_FLOW_LIMIT
>   	  with many clients some protection against DoS by a single (spoofed)
>   	  flow that greatly exceeds average workload.
>
> +config USE_HW_SKB
> +	bool "NIC use hardware managed buffer to build skb"
> +	depends on INET

The feature seems to be valid for network devices in general.
Why did you make it depending on INET ??

Regards,
Oliver

> +	---help---
> +	  If select this, the third party drivers will use hardware managed
> +	  buffers to allocate SKB without any modification for the driver.
> +
> +	  Documentation on how to use this featue can be found at
> +	  <file:Documentation/networking/hw_skb.txt>.
> +
>   menu "Network testing"
>
>   config NET_PKTGEN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  4:25 ` David Miller
@ 2014-10-15  5:34   ` Jiafei.Pan
  2014-10-15  9:15     ` Eric Dumazet
  2014-10-15 15:51     ` David Miller
  0 siblings, 2 replies; 38+ messages in thread
From: Jiafei.Pan @ 2014-10-15  5:34 UTC (permalink / raw)
  To: David Miller; +Cc: jkosina, netdev, LeoLi, linux-doc, Jiafei.Pan



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, October 15, 2014 12:25 PM
> To: Pan Jiafei-B37022
> Cc: jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472; linux-
> doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> From: Pan Jiafei <Jiafei.Pan@freescale.com>
> Date: Wed, 15 Oct 2014 11:26:11 +0800
> 
> > In some platform, there are some hardware block provided
> > to manage buffers to improve performance. So in some case,
> > it is expected that the packets received by some generic
> > NIC should be put into such hardware managed buffers
> > directly, so that such buffer can be released by hardware
> > or by driver.
> >
> > This patch provide such general APIs for generic NIC to
> > use hardware block managed buffers without any modification
> > for generic NIC drivers.
> 
> Why are existing interfaces insufficent for your needs?
> 
> Several ethernet drivers already build SKBs from block
> buffer pools.
> 
Yes, for this matter, in order to do this we should modify the Ethernet
drivers. For example, driver A want to driver B, C, D.. to support driver
A's Hardware block access functions, so we have to modify driver B, C, D...
It will be so complex for this matter.

But by using my patch, I just modify a Ethernet device (I don't care
Which driver it is used) flag in driver A in order to implement this
Ethernet device using hardware block access functions provided by
Driver A.

> They allocate pools of pages which the hardware divides into various
> powers of 2, then the RX descriptor says what pieces of which pools
> were used to hold the data for a packet, and then the SKB is
> constructed with page frags pointing to those locations.
> 
> It's very cheap, inexpensive, and existing APIs are considered to
> cover all use cases.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  4:15 ` Eric Dumazet
  2014-10-15  4:26   ` David Miller
@ 2014-10-15  5:43   ` Jiafei.Pan
  2014-10-15  9:15     ` Eric Dumazet
  1 sibling, 1 reply; 38+ messages in thread
From: Jiafei.Pan @ 2014-10-15  5:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, jkosina, netdev, LeoLi, linux-doc, Jiafei.Pan



> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Wednesday, October 15, 2014 12:15 PM
> To: Pan Jiafei-B37022
> Cc: davem@davemloft.net; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-
> R58472; linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> On Wed, 2014-10-15 at 11:26 +0800, Pan Jiafei wrote:
> > In some platform, there are some hardware block provided
> > to manage buffers to improve performance. So in some case,
> > it is expected that the packets received by some generic
> > NIC should be put into such hardware managed buffers
> > directly, so that such buffer can be released by hardware
> > or by driver.
> 
> You repeat 'some' four times.
> 
> >
> > This patch provide such general APIs for generic NIC to
> > use hardware block managed buffers without any modification
> > for generic NIC drivers.
> 
> ...
> 
> > In this patch, the following fields are added to "net_device":
> >     void *hw_skb_priv;
> >     struct sk_buff *(*alloc_hw_skb)(void *hw_skb_priv, unsigned int length);
> >     void (*free_hw_skb)(struct sk_buff *skb);
> > so in order to let generic NIC driver to use hardware managed
> > buffers, the function "alloc_hw_skb" and "free_hw_skb"
> > provide implementation for allocate and free hardware managed
> > buffers. "hw_skb_priv" is provided to pass some private data for
> > these two functions.
> >
> > When the socket buffer is allocated by these APIs, "hw_skb_state"
> > is provided in struct "sk_buff". this argument can indicate
> > that the buffer is hardware managed buffer, this buffer
> > should freed by software or by hardware.
> >
> > Documentation on how to use this featue can be found at
> > <file:Documentation/networking/hw_skb.txt>.
> >
> > Signed-off-by: Pan Jiafei <Jiafei.Pan@freescale.com>
> 
> 
> I am giving a strong NACK, of course.
> 
> We are not going to grow sk_buff and add yet another conditional in fast
> path for a very obscure feature like that.
> 
> Memory management is not going to be done by drivers.
> 
> The way it should work is that if your hardware has specific needs, rx
> and tx paths (of the driver) need to make the needed adaptation.
> Not the other way.
> 
> We already have complex skb layouts, we do not need a new one.
> 
> Take a look at how drivers can 'lock' pages already, and build skb sith
> page frags. It is already there.
> 
For my case, there are some shortcoming to use page frags. Firstly, I have to
modify each Ethernet drivers to support it especially I don’t which vendor's
driver the customer will use. Secondly, it is not enough only 
build skb by 'lock' pages, the buffer address comes from hardware block and
should be aligned to hardware block.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  4:59 ` Oliver Hartkopp
@ 2014-10-15  5:47   ` Jiafei.Pan
  0 siblings, 0 replies; 38+ messages in thread
From: Jiafei.Pan @ 2014-10-15  5:47 UTC (permalink / raw)
  To: Oliver Hartkopp, davem, jkosina; +Cc: netdev, LeoLi, linux-doc, Jiafei.Pan



> -----Original Message-----
> From: Oliver Hartkopp [mailto:socketcan@hartkopp.net]
> Sent: Wednesday, October 15, 2014 12:59 PM
> To: Pan Jiafei-B37022; davem@davemloft.net; jkosina@suse.cz
> Cc: netdev@vger.kernel.org; Li Yang-Leo-R58472; linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> On 15.10.2014 05:26, Pan Jiafei wrote:
> > In some platform, there are some hardware block provided
> > to manage buffers to improve performance.
> 
> (..)
[Pan Jiafei] I want to build a general patch to build skb
by using hardware buffer manager block.
> 
> > diff --git a/net/Kconfig b/net/Kconfig
> > index d6b138e..346e021 100644
> > --- a/net/Kconfig
> > +++ b/net/Kconfig
> > @@ -291,6 +291,16 @@ config NET_FLOW_LIMIT
> >   	  with many clients some protection against DoS by a single (spoofed)
> >   	  flow that greatly exceeds average workload.
> >
> > +config USE_HW_SKB
> > +	bool "NIC use hardware managed buffer to build skb"
> > +	depends on INET
> 
> The feature seems to be valid for network devices in general.
> Why did you make it depending on INET ??
> 
> Regards,
> Oliver
> 
[Pan Jiafei] Yes, INET dependency should be removed, thanks.
> > +	---help---
> > +	  If select this, the third party drivers will use hardware managed
> > +	  buffers to allocate SKB without any modification for the driver.
> > +
> > +	  Documentation on how to use this featue can be found at
> > +	  <file:Documentation/networking/hw_skb.txt>.
> > +
> >   menu "Network testing"
> >
> >   config NET_PKTGEN


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  3:26 [PATCH] net: use hardware buffer pool to allocate skb Pan Jiafei
                   ` (2 preceding siblings ...)
  2014-10-15  4:59 ` Oliver Hartkopp
@ 2014-10-15  8:57 ` David Laight
  2014-10-15  9:33 ` Stephen Hemminger
  4 siblings, 0 replies; 38+ messages in thread
From: David Laight @ 2014-10-15  8:57 UTC (permalink / raw)
  To: 'Pan Jiafei', davem, jkosina; +Cc: netdev, LeoLi, linux-doc

From: Pan Jiafei
> In some platform, there are some hardware block provided
> to manage buffers to improve performance. So in some case,
> it is expected that the packets received by some generic
> NIC should be put into such hardware managed buffers
> directly, so that such buffer can be released by hardware
> or by driver.

This looks like some strange variant of 'buffer loaning'.
In general it just doesn't work due to the limited number
of such buffers - they soon all become queued waiting for
applications to read from sockets.

It also isn't at all clear how you expect a 'generic NIC'
to actually allocate buffers from your 'special area'.

	David




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  5:43   ` Jiafei.Pan
@ 2014-10-15  9:15     ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-10-15  9:15 UTC (permalink / raw)
  To: Jiafei.Pan; +Cc: davem, jkosina, netdev, LeoLi, linux-doc

On Wed, 2014-10-15 at 05:43 +0000, Jiafei.Pan@freescale.com wrote:

> For my case, there are some shortcoming to use page frags. Firstly, I have to
> modify each Ethernet drivers to support it especially I don’t which vendor's
> driver the customer will use. Secondly, it is not enough only 
> build skb by 'lock' pages, the buffer address comes from hardware block and
> should be aligned to hardware block.

So you align to hardware block. What is the problem with this ?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  5:34   ` Jiafei.Pan
@ 2014-10-15  9:15     ` Eric Dumazet
  2014-10-16  2:17       ` Jiafei.Pan
  2014-10-16  2:17       ` Jiafei.Pan
  2014-10-15 15:51     ` David Miller
  1 sibling, 2 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-10-15  9:15 UTC (permalink / raw)
  To: Jiafei.Pan; +Cc: David Miller, jkosina, netdev, LeoLi, linux-doc

On Wed, 2014-10-15 at 05:34 +0000, Jiafei.Pan@freescale.com wrote:
 
> Yes, for this matter, in order to do this we should modify the Ethernet
> drivers. For example, driver A want to driver B, C, D.. to support driver
> A's Hardware block access functions, so we have to modify driver B, C, D...
> It will be so complex for this matter.
> 
> But by using my patch, I just modify a Ethernet device (I don't care
> Which driver it is used) flag in driver A in order to implement this
> Ethernet device using hardware block access functions provided by
> Driver A.

We care a lot of all the bugs added by your patches. You have little
idea of how many of them were added. We do not want to spend days of
work explaining everything or fixing all the details for you.

Carefully read net/core/skbuff.c, net/core/dev.c, GRO layer, you'll see
how many spots you missed.

You cannot control how skbs are cooked before reaching your driver
ndo_start_xmit(). You are not going to add hooks in UDP , TCP, or other
drivers RX path. This would be absolutely insane.

Trying to control how skb are cooked in RX path is absolutely something
drivers do, using page frags that are read-only by all the stack.

Fix your driver to use existing infra, your suggestion is not going to
be accepted.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  3:26 [PATCH] net: use hardware buffer pool to allocate skb Pan Jiafei
                   ` (3 preceding siblings ...)
  2014-10-15  8:57 ` David Laight
@ 2014-10-15  9:33 ` Stephen Hemminger
  2014-10-16  2:30   ` Jiafei.Pan
  4 siblings, 1 reply; 38+ messages in thread
From: Stephen Hemminger @ 2014-10-15  9:33 UTC (permalink / raw)
  To: Pan Jiafei; +Cc: davem, jkosina, netdev, LeoLi, linux-doc

Since an skb can sit forever in an application queue, you have created
an easy way to livelock the system when enough skb's are waiting to be
read.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  5:34   ` Jiafei.Pan
  2014-10-15  9:15     ` Eric Dumazet
@ 2014-10-15 15:51     ` David Miller
  1 sibling, 0 replies; 38+ messages in thread
From: David Miller @ 2014-10-15 15:51 UTC (permalink / raw)
  To: Jiafei.Pan; +Cc: jkosina, netdev, LeoLi, linux-doc

From: "Jiafei.Pan@freescale.com" <Jiafei.Pan@freescale.com>
Date: Wed, 15 Oct 2014 05:34:24 +0000

> Yes, for this matter, in order to do this we should modify the Ethernet
> drivers. For example, driver A want to driver B, C, D.. to support driver
> A's Hardware block access functions, so we have to modify driver B, C, D...
> It will be so complex for this matter.

Experience says otherwise.  It's three or four lines of code to attach
a page to an SKB frag.

The driver needs it's own buffer management and setup code anyways,
and no generic facility will replace that.

I think your precondition for these changes therefore doesn't really
exist.

Please, look over all of the drivers that exist already in the tree
and build SKBs using page frage from hardware device buffer pools.

You have to show us that all of those drivers can make use of your
facility.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  9:15     ` Eric Dumazet
@ 2014-10-16  2:17       ` Jiafei.Pan
  2014-10-16  4:15         ` Eric Dumazet
  2014-10-16  2:17       ` Jiafei.Pan
  1 sibling, 1 reply; 38+ messages in thread
From: Jiafei.Pan @ 2014-10-16  2:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jkosina, netdev, LeoLi, linux-doc, Jiafei.Pan


> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Wednesday, October 15, 2014 5:16 PM
> To: Pan Jiafei-B37022
> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> On Wed, 2014-10-15 at 05:34 +0000, Jiafei.Pan@freescale.com wrote:
> 
> > Yes, for this matter, in order to do this we should modify the Ethernet
> > drivers. For example, driver A want to driver B, C, D.. to support driver
> > A's Hardware block access functions, so we have to modify driver B, C, D...
> > It will be so complex for this matter.
> >
> > But by using my patch, I just modify a Ethernet device (I don't care
> > Which driver it is used) flag in driver A in order to implement this
> > Ethernet device using hardware block access functions provided by
> > Driver A.
> 
> We care a lot of all the bugs added by your patches. You have little
> idea of how many of them were added. We do not want to spend days of
> work explaining everything or fixing all the details for you.
> 
> Carefully read net/core/skbuff.c, net/core/dev.c, GRO layer, you'll see
> how many spots you missed.
> 
> You cannot control how skbs are cooked before reaching your driver
> ndo_start_xmit(). You are not going to add hooks in UDP , TCP, or other
> drivers RX path. This would be absolutely insane.
> 

Thanks for your comments and suggestion. In my case, I want to build skb
from hardware block specified memory, I only can see two ways, one is modified
net card driver replace common skb allocation function with my specially
functions, another way is to hack common skb allocation function in which
redirect to my specially functions. My patch is just for the second way.
Except these two ways, would you please give me some advice for some other
ways for my case? Thanks.

> Trying to control how skb are cooked in RX path is absolutely something
> drivers do, using page frags that are read-only by all the stack.
> 
> Fix your driver to use existing infra, your suggestion is not going to
> be accepted.
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  9:15     ` Eric Dumazet
  2014-10-16  2:17       ` Jiafei.Pan
@ 2014-10-16  2:17       ` Jiafei.Pan
  1 sibling, 0 replies; 38+ messages in thread
From: Jiafei.Pan @ 2014-10-16  2:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jkosina, netdev, LeoLi, linux-doc, Jiafei.Pan


> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Wednesday, October 15, 2014 5:16 PM
> To: Pan Jiafei-B37022
> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> On Wed, 2014-10-15 at 05:34 +0000, Jiafei.Pan@freescale.com wrote:
> 
> > Yes, for this matter, in order to do this we should modify the Ethernet
> > drivers. For example, driver A want to driver B, C, D.. to support driver
> > A's Hardware block access functions, so we have to modify driver B, C, D...
> > It will be so complex for this matter.
> >
> > But by using my patch, I just modify a Ethernet device (I don't care
> > Which driver it is used) flag in driver A in order to implement this
> > Ethernet device using hardware block access functions provided by
> > Driver A.
> 
> We care a lot of all the bugs added by your patches. You have little
> idea of how many of them were added. We do not want to spend days of
> work explaining everything or fixing all the details for you.
> 
> Carefully read net/core/skbuff.c, net/core/dev.c, GRO layer, you'll see
> how many spots you missed.
> 
> You cannot control how skbs are cooked before reaching your driver
> ndo_start_xmit(). You are not going to add hooks in UDP , TCP, or other
> drivers RX path. This would be absolutely insane.
> 
> Trying to control how skb are cooked in RX path is absolutely something
> drivers do, using page frags that are read-only by all the stack.
> 
> Fix your driver to use existing infra, your suggestion is not going to
> be accepted.
> 
I think my patch can connect some hardware buffer block with the third party
net card drivers. this should be a general requirement in order to get
a better performance. Yes, maybe some defect in my patch, but any comments
and suggestions for this target is welcome and thanks.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-15  9:33 ` Stephen Hemminger
@ 2014-10-16  2:30   ` Jiafei.Pan
  0 siblings, 0 replies; 38+ messages in thread
From: Jiafei.Pan @ 2014-10-16  2:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: davem, jkosina, netdev, LeoLi, linux-doc, Jiafei.Pan



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Wednesday, October 15, 2014 5:33 PM
> To: Pan Jiafei-B37022
> Cc: davem@davemloft.net; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-
> R58472; linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> Since an skb can sit forever in an application queue, you have created
> an easy way to livelock the system when enough skb's are waiting to be
> read.

I think there is no possible to livelock the system, because in my patch
The function __netdev_alloc_skb will try to allocate hardware block buffer
Firstly if dev->alloc_hw_skb is set, but it will continue allocate normal
skb buffer if the hardware block buffer allocation fails.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16  2:17       ` Jiafei.Pan
@ 2014-10-16  4:15         ` Eric Dumazet
  2014-10-16  5:15           ` Jiafei.Pan
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-16  4:15 UTC (permalink / raw)
  To: Jiafei.Pan; +Cc: David Miller, jkosina, netdev, LeoLi, linux-doc

On Thu, 2014-10-16 at 02:17 +0000, Jiafei.Pan@freescale.com wrote:

> Thanks for your comments and suggestion. In my case, I want to build skb
> from hardware block specified memory, I only can see two ways, one is modified
> net card driver replace common skb allocation function with my specially
> functions, another way is to hack common skb allocation function in which
> redirect to my specially functions. My patch is just for the second way.
> Except these two ways, would you please give me some advice for some other
> ways for my case? Thanks

I suggest you read drivers/net/ethernet numerous examples.

No need to change anything  in net/* or include/*, really.

For a start, look at drivers/net/ethernet/intel/igb/igb_main.c

Mentioning 'hack' in your mails simply should hint you are doing
something very wrong.

What makes you think your hardware is so special ?



^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16  4:15         ` Eric Dumazet
@ 2014-10-16  5:15           ` Jiafei.Pan
  2014-10-16 15:28             ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Jiafei.Pan @ 2014-10-16  5:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jkosina, netdev, LeoLi, linux-doc, Jiafei.Pan

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Thursday, October 16, 2014 12:15 PM
> To: Pan Jiafei-B37022
> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> On Thu, 2014-10-16 at 02:17 +0000, Jiafei.Pan@freescale.com wrote:
> 
> > Thanks for your comments and suggestion. In my case, I want to build skb
> > from hardware block specified memory, I only can see two ways, one is modified
> > net card driver replace common skb allocation function with my specially
> > functions, another way is to hack common skb allocation function in which
> > redirect to my specially functions. My patch is just for the second way.
> > Except these two ways, would you please give me some advice for some other
> > ways for my case? Thanks
> 
> I suggest you read drivers/net/ethernet numerous examples.
> 
> No need to change anything  in net/* or include/*, really.
> 
> For a start, look at drivers/net/ethernet/intel/igb/igb_main.c
> 
> Mentioning 'hack' in your mails simply should hint you are doing
> something very wrong.
> 
> What makes you think your hardware is so special ?
> 
In fact, I am developing a bridge driver, it can bridge between any other the
third party net card and my own net card. My target is to let any other the
third party net card can directly use my own net card specified buffer, then
there will be no memory copy in the whole bridge process.
By the way, I don’t see any similar between igb_main.c and my case. And also
My bridge also can’t implemented with "skb frag" in order to aim at zero memory
copy.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16  5:15           ` Jiafei.Pan
@ 2014-10-16 15:28             ` Alexander Duyck
  2014-10-16 16:57               ` Eric Dumazet
  2014-10-17  2:35               ` Jiafei.Pan
  0 siblings, 2 replies; 38+ messages in thread
From: Alexander Duyck @ 2014-10-16 15:28 UTC (permalink / raw)
  To: Jiafei.Pan, Eric Dumazet; +Cc: David Miller, jkosina, netdev, LeoLi, linux-doc


On 10/15/2014 10:15 PM, Jiafei.Pan@freescale.com wrote:
>> -----Original Message-----
>> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
>> Sent: Thursday, October 16, 2014 12:15 PM
>> To: Pan Jiafei-B37022
>> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
>> linux-doc@vger.kernel.org
>> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
>>
>> On Thu, 2014-10-16 at 02:17 +0000, Jiafei.Pan@freescale.com wrote:
>>
>>> Thanks for your comments and suggestion. In my case, I want to build skb
>>> from hardware block specified memory, I only can see two ways, one is modified
>>> net card driver replace common skb allocation function with my specially
>>> functions, another way is to hack common skb allocation function in which
>>> redirect to my specially functions. My patch is just for the second way.
>>> Except these two ways, would you please give me some advice for some other
>>> ways for my case? Thanks
>> I suggest you read drivers/net/ethernet numerous examples.
>>
>> No need to change anything  in net/* or include/*, really.
>>
>> For a start, look at drivers/net/ethernet/intel/igb/igb_main.c
>>
>> Mentioning 'hack' in your mails simply should hint you are doing
>> something very wrong.
>>
>> What makes you think your hardware is so special ?
>>
> In fact, I am developing a bridge driver, it can bridge between any other the
> third party net card and my own net card. My target is to let any other the
> third party net card can directly use my own net card specified buffer, then
> there will be no memory copy in the whole bridge process.
> By the way, I don’t see any similar between igb_main.c and my case. And also
> My bridge also can’t implemented with "skb frag" in order to aim at zero memory
> copy.

I think the part you are not getting is that is how buffers are 
essentially handled now.  So for example in the case if igb the only 
part we have copied out is usually the header, or the entire frame in 
the case of small packets.  This has to happen in order to allow for 
changes to the header for routing and such.  Beyond that the frags that 
are passed are the buffers that igb is still holding onto.  So 
effectively what the other device transmits in a bridging/routing 
scenario is my own net card specified buffer plus the copied/modified 
header.

For a brief period igb used build_skb but that isn't valid on most 
systems as memory mapped for a device can be overwritten if the page is 
unmapped resulting in any changes to the header for routing/bridging 
purposes being invalidated.  Thus we cannot use the buffers for both the 
skb->data header which may be changed and Rx DMA simultaneously.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16 15:28             ` Alexander Duyck
@ 2014-10-16 16:57               ` Eric Dumazet
  2014-10-16 17:10                 ` Alexander Duyck
  2014-10-17  2:35               ` Jiafei.Pan
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-16 16:57 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiafei.Pan, David Miller, jkosina, netdev, LeoLi, linux-doc

On Thu, 2014-10-16 at 08:28 -0700, Alexander Duyck wrote:

> I think the part you are not getting is that is how buffers are 
> essentially handled now.  So for example in the case if igb the only 
> part we have copied out is usually the header, or the entire frame in 
> the case of small packets.  This has to happen in order to allow for 
> changes to the header for routing and such.  Beyond that the frags that 
> are passed are the buffers that igb is still holding onto.  So 
> effectively what the other device transmits in a bridging/routing 
> scenario is my own net card specified buffer plus the copied/modified 
> header.
> 
> For a brief period igb used build_skb but that isn't valid on most 
> systems as memory mapped for a device can be overwritten if the page is 
> unmapped resulting in any changes to the header for routing/bridging 
> purposes being invalidated.  Thus we cannot use the buffers for both the 
> skb->data header which may be changed and Rx DMA simultaneously.

This reminds me that igb still has skb->truesize underestimation by 100%

If a fragment is held in some socket receive buffer, a full page is
consumed, not 2048 bytes.

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
index a21b14495ebd..56ca6c78985e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6586,9 +6586,11 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
 	struct page *page = rx_buffer->page;
 	unsigned int size = le16_to_cpu(rx_desc->wb.upper.length);
 #if (PAGE_SIZE < 8192)
-	unsigned int truesize = IGB_RX_BUFSZ;
+	unsigned int segsize = IGB_RX_BUFSZ;
+	unsigned int truesize = PAGE_SIZE;
 #else
-	unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
+	unsigned int segsize = ALIGN(size, L1_CACHE_BYTES);
+	unsigned int truesize = segsize;
 #endif
 
 	if ((size <= IGB_RX_HDR_LEN) && !skb_is_nonlinear(skb)) {
@@ -6614,7 +6616,7 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
 	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, page,
 			rx_buffer->page_offset, size, truesize);
 
-	return igb_can_reuse_rx_page(rx_buffer, page, truesize);
+	return igb_can_reuse_rx_page(rx_buffer, page, segsize);
 }
 
 static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring,
 


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16 16:57               ` Eric Dumazet
@ 2014-10-16 17:10                 ` Alexander Duyck
  2014-10-16 17:45                   ` Eric Dumazet
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Duyck @ 2014-10-16 17:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiafei.Pan, David Miller, jkosina, netdev, LeoLi, linux-doc


On 10/16/2014 09:57 AM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 08:28 -0700, Alexander Duyck wrote:
>
>> I think the part you are not getting is that is how buffers are
>> essentially handled now.  So for example in the case if igb the only
>> part we have copied out is usually the header, or the entire frame in
>> the case of small packets.  This has to happen in order to allow for
>> changes to the header for routing and such.  Beyond that the frags that
>> are passed are the buffers that igb is still holding onto.  So
>> effectively what the other device transmits in a bridging/routing
>> scenario is my own net card specified buffer plus the copied/modified
>> header.
>>
>> For a brief period igb used build_skb but that isn't valid on most
>> systems as memory mapped for a device can be overwritten if the page is
>> unmapped resulting in any changes to the header for routing/bridging
>> purposes being invalidated.  Thus we cannot use the buffers for both the
>> skb->data header which may be changed and Rx DMA simultaneously.
> This reminds me that igb still has skb->truesize underestimation by 100%
>
> If a fragment is held in some socket receive buffer, a full page is
> consumed, not 2048 bytes.
>
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c
> index a21b14495ebd..56ca6c78985e 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -6586,9 +6586,11 @@ static bool igb_add_rx_frag(struct igb_ring *rx_ring,
>   	struct page *page = rx_buffer->page;
>   	unsigned int size = le16_to_cpu(rx_desc->wb.upper.length);
>   #if (PAGE_SIZE < 8192)
> -	unsigned int truesize = IGB_RX_BUFSZ;
> +	unsigned int segsize = IGB_RX_BUFSZ;
> +	unsigned int truesize = PAGE_SIZE;
>   #else
> -	unsigned int truesize = ALIGN(size, L1_CACHE_BYTES);
> +	unsigned int segsize = ALIGN(size, L1_CACHE_BYTES);
> +	unsigned int truesize = segsize;
>   #endif

So if a page is used twice we are double counting the page size for the 
socket then, is that correct?  I just want to make sure because prior to 
this patch both flows did the same thing and counted the portion of the 
page used in this pass, now with this change for PAGE_SIZE of 4K we 
count the entire page, and for all other cases we count the portion of 
the page used.

Thanks,

Alex



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16 17:10                 ` Alexander Duyck
@ 2014-10-16 17:45                   ` Eric Dumazet
  2014-10-16 18:20                     ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-16 17:45 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiafei.Pan, David Miller, jkosina, netdev, LeoLi, linux-doc

On Thu, 2014-10-16 at 10:10 -0700, Alexander Duyck wrote:

> So if a page is used twice we are double counting the page size for the 
> socket then, is that correct?  I just want to make sure because prior to 
> this patch both flows did the same thing and counted the portion of the 
> page used in this pass, now with this change for PAGE_SIZE of 4K we 
> count the entire page, and for all other cases we count the portion of 
> the page used.

When a page is split in 2 parts only, probability that a segment holds
the 4K page is quite high (There is a single half page)

When we split say 64KB in 42 segments, the probability a single segment
hold the full 64KB block is very low, so we can almost be safe when we
consider 'truesize = 1536'

Of course there are pathological cases, but attacker has to be quite
smart.

I am just saying that counting 2048 might have a big impact on memory
consumption if all these incoming segments are stored a long time in
receive queues (TCP receive queues or out of order queues) : We might be
off by a factor of 2 on the real memory usage, and delay the TCP
collapsing too much.



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16 17:45                   ` Eric Dumazet
@ 2014-10-16 18:20                     ` Alexander Duyck
  2014-10-16 21:40                       ` Eric Dumazet
  2014-10-17  9:11                       ` David Laight
  0 siblings, 2 replies; 38+ messages in thread
From: Alexander Duyck @ 2014-10-16 18:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiafei.Pan, David Miller, jkosina, netdev, LeoLi, linux-doc


On 10/16/2014 10:45 AM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 10:10 -0700, Alexander Duyck wrote:
>
>> So if a page is used twice we are double counting the page size for the
>> socket then, is that correct?  I just want to make sure because prior to
>> this patch both flows did the same thing and counted the portion of the
>> page used in this pass, now with this change for PAGE_SIZE of 4K we
>> count the entire page, and for all other cases we count the portion of
>> the page used.
> When a page is split in 2 parts only, probability that a segment holds
> the 4K page is quite high (There is a single half page)

Actually the likelihood of anything holding onto the 4K page for very 
long doesn't seem to occur, at least from the drivers perspective.  It 
is one of the reasons why I went for the page reuse approach rather than 
just partitioning a single large page.  It allows us to avoid having to 
call IOMMU map/unmap for the pages since the entire page is usually back 
in the driver ownership before we need to reuse the portion given to the 
stack.

> When we split say 64KB in 42 segments, the probability a single segment
> hold the full 64KB block is very low, so we can almost be safe when we
> consider 'truesize = 1536'

Yes, but the likelihood that only a few segments are holding the page is 
still very high.  So you might not have one segment holding the 64K 
page, but I find it very difficult to believe that all 42 would be 
holding it at the same time.  In that case should we be adding some 
portion of the 64K to the truesize for all frames to account for this?

> Of course there are pathological cases, but attacker has to be quite
> smart.
>
> I am just saying that counting 2048 might have a big impact on memory
> consumption if all these incoming segments are stored a long time in
> receive queues (TCP receive queues or out of order queues) : We might be
> off by a factor of 2 on the real memory usage, and delay the TCP
> collapsing too much.

My concern would be that we are off by a factor of 2 and prematurely 
collapse the TCP too soon with this change.  For example if you are 
looking at a socket that is holding pages for a long period of time 
there would be a good chance of it ending up with both halves of the 
page.  In this case is it fair to charge it for 8K or memory use when in 
reality it is only using 4K?

Thanks,

Alex


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16 18:20                     ` Alexander Duyck
@ 2014-10-16 21:40                       ` Eric Dumazet
  2014-10-16 22:12                         ` Alexander Duyck
  2014-10-17  9:11                       ` David Laight
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-16 21:40 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jiafei.Pan, David Miller, jkosina, netdev, LeoLi, linux-doc

On Thu, 2014-10-16 at 11:20 -0700, Alexander Duyck wrote:

> My concern would be that we are off by a factor of 2 and prematurely 
> collapse the TCP too soon with this change. 

That is the opposite actually. We can consume 4K but we pretend we
consume 2K in some worst cases.

>  For example if you are 
> looking at a socket that is holding pages for a long period of time 
> there would be a good chance of it ending up with both halves of the 
> page.  In this case is it fair to charge it for 8K or memory use when in 
> reality it is only using 4K?

Its better to collapse too soon than too late.

If you want to avoid collapses because one host has plenty of memory,
all you need to do is increase tcp_rmem.

Why are you referring to 8K ? PAGE_SIZE is 4K



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16 21:40                       ` Eric Dumazet
@ 2014-10-16 22:12                         ` Alexander Duyck
  0 siblings, 0 replies; 38+ messages in thread
From: Alexander Duyck @ 2014-10-16 22:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiafei.Pan, David Miller, jkosina, netdev, LeoLi, linux-doc


On 10/16/2014 02:40 PM, Eric Dumazet wrote:
> On Thu, 2014-10-16 at 11:20 -0700, Alexander Duyck wrote:
>
>> My concern would be that we are off by a factor of 2 and prematurely
>> collapse the TCP too soon with this change.
> That is the opposite actually. We can consume 4K but we pretend we
> consume 2K in some worst cases.

The only case where we consume the full 4K but only list it as 2K should 
be if we have memory from the wrong node and we want to flush it from 
the descriptor queue.  For all other cases we should be using the page 
at least twice per buffer.  So the the first page that was assigned for 
an Rx descriptor might be flushed but then after that reuse should take 
hold and stay in place as long as the NAPI poll doesn't change NUMA nodes.

That should be no worse than the case where the remaining space in a 
large page is not large enough to use as a buffer. You still use the 
current size as your truesize, you don't include the overhead of the 
unused space in your calculation.

>>   For example if you are
>> looking at a socket that is holding pages for a long period of time
>> there would be a good chance of it ending up with both halves of the
>> page.  In this case is it fair to charge it for 8K or memory use when in
>> reality it is only using 4K?
> Its better to collapse too soon than too late.
>
> If you want to avoid collapses because one host has plenty of memory,
> all you need to do is increase tcp_rmem.
>
> Why are you referring to 8K ? PAGE_SIZE is 4K

The truesize would be reported as 8K vs 4K for 2 half pages with your 
change if we were to hand off both halves of a page to the same socket.

The 2K value makes sense and is consistent with how we handle this in 
other cases where we are partitioning pages for use as network buffers.  
I think increasing this to 4K is just going to cause performance issues 
as flows are going to get choked off prematurely for memory usage that 
they aren't actually getting.

Part of my hesitation is that I spent the last couple of years 
explaining to our performance testing team and customers that they need 
to adjust tcp_rmem with all of the changes that have been made to 
truesize and the base network drivers, and I think I would prefer it if 
I didn't have to go another round of it.  Then again I probably won't 
have to anyway since I am not doing drivers for Intel any more, but 
still my reaction to this kind of change is what it is.

Thanks,

Alex





^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16 15:28             ` Alexander Duyck
  2014-10-16 16:57               ` Eric Dumazet
@ 2014-10-17  2:35               ` Jiafei.Pan
  2014-10-17 14:05                 ` Eric Dumazet
  1 sibling, 1 reply; 38+ messages in thread
From: Jiafei.Pan @ 2014-10-17  2:35 UTC (permalink / raw)
  To: Alexander Duyck, Eric Dumazet
  Cc: David Miller, jkosina, netdev, LeoLi, linux-doc, Jiafei.Pan


> -----Original Message-----
> From: Alexander Duyck [mailto:alexander.h.duyck@redhat.com]
> Sent: Thursday, October 16, 2014 11:28 PM
> To: Pan Jiafei-B37022; Eric Dumazet
> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> linux-doc@vger.kernel.org
> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> 
> 
> On 10/15/2014 10:15 PM, Jiafei.Pan@freescale.com wrote:
> >> -----Original Message-----
> >> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> >> Sent: Thursday, October 16, 2014 12:15 PM
> >> To: Pan Jiafei-B37022
> >> Cc: David Miller; jkosina@suse.cz; netdev@vger.kernel.org; Li Yang-Leo-R58472;
> >> linux-doc@vger.kernel.org
> >> Subject: Re: [PATCH] net: use hardware buffer pool to allocate skb
> >>
> >> On Thu, 2014-10-16 at 02:17 +0000, Jiafei.Pan@freescale.com wrote:
> >>
> >>> Thanks for your comments and suggestion. In my case, I want to build skb
> >>> from hardware block specified memory, I only can see two ways, one is
> modified
> >>> net card driver replace common skb allocation function with my specially
> >>> functions, another way is to hack common skb allocation function in which
> >>> redirect to my specially functions. My patch is just for the second way.
> >>> Except these two ways, would you please give me some advice for some other
> >>> ways for my case? Thanks
> >> I suggest you read drivers/net/ethernet numerous examples.
> >>
> >> No need to change anything  in net/* or include/*, really.
> >>
> >> For a start, look at drivers/net/ethernet/intel/igb/igb_main.c
> >>
> >> Mentioning 'hack' in your mails simply should hint you are doing
> >> something very wrong.
> >>
> >> What makes you think your hardware is so special ?
> >>
> > In fact, I am developing a bridge driver, it can bridge between any other the
> > third party net card and my own net card. My target is to let any other the
> > third party net card can directly use my own net card specified buffer, then
> > there will be no memory copy in the whole bridge process.
> > By the way, I don’t see any similar between igb_main.c and my case. And also
> > My bridge also can’t implemented with "skb frag" in order to aim at zero
> memory
> > copy.
> 
> I think the part you are not getting is that is how buffers are
> essentially handled now.  

[Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
you have catch my concerns. For example, I want to add igb net card 
into my bridge, and want to igb net driver allocate skb by using
my specified memory address, but I don’t want to modify igb net driver
directly, how to do this in my bridge drivers?

Thanks,
Jiafei.

So for example in the case if igb the only
> part we have copied out is usually the header, or the entire frame in
> the case of small packets.  This has to happen in order to allow for
> changes to the header for routing and such.  Beyond that the frags that
> are passed are the buffers that igb is still holding onto.  So
> effectively what the other device transmits in a bridging/routing
> scenario is my own net card specified buffer plus the copied/modified
> header.
> 
> For a brief period igb used build_skb but that isn't valid on most
> systems as memory mapped for a device can be overwritten if the page is
> unmapped resulting in any changes to the header for routing/bridging
> purposes being invalidated.  Thus we cannot use the buffers for both the
> skb->data header which may be changed and Rx DMA simultaneously.
> 
> Thanks,
> 
> Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-16 18:20                     ` Alexander Duyck
  2014-10-16 21:40                       ` Eric Dumazet
@ 2014-10-17  9:11                       ` David Laight
  2014-10-17 14:40                         ` Alexander Duyck
  1 sibling, 1 reply; 38+ messages in thread
From: David Laight @ 2014-10-17  9:11 UTC (permalink / raw)
  To: 'Alexander Duyck', Eric Dumazet
  Cc: Jiafei.Pan, David Miller, jkosina, netdev, LeoLi, linux-doc

From: Alexander Duyck
...
> Actually the likelihood of anything holding onto the 4K page for very
> long doesn't seem to occur, at least from the drivers perspective.  It
> is one of the reasons why I went for the page reuse approach rather than
> just partitioning a single large page.  It allows us to avoid having to
> call IOMMU map/unmap for the pages since the entire page is usually back
> in the driver ownership before we need to reuse the portion given to the
> stack.

That is almost certainly true for most benchmarks, benchmark processes
consume receive data.
But what about real life situations?

There must be some 'normal' workloads where receive data doesn't get consumed.

	David


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17  2:35               ` Jiafei.Pan
@ 2014-10-17 14:05                 ` Eric Dumazet
  2014-10-17 14:12                   ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-17 14:05 UTC (permalink / raw)
  To: Jiafei.Pan
  Cc: Alexander Duyck, David Miller, jkosina, netdev, LeoLi, linux-doc

On Fri, 2014-10-17 at 02:35 +0000, Jiafei.Pan@freescale.com wrote:

> [Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
> you have catch my concerns. For example, I want to add igb net card 
> into my bridge, and want to igb net driver allocate skb by using
> my specified memory address, but I don’t want to modify igb net driver
> directly, how to do this in my bridge drivers?

This is exactly our point : We do not want to modify all drivers so that
your bridge is happy with them.

You'll have to make your bridge using standard infra instead.

IGB has no way to know in advance that a particular frame should
eventually reach your bridge anyway.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 14:05                 ` Eric Dumazet
@ 2014-10-17 14:12                   ` Alexander Duyck
  0 siblings, 0 replies; 38+ messages in thread
From: Alexander Duyck @ 2014-10-17 14:12 UTC (permalink / raw)
  To: Eric Dumazet, Jiafei.Pan
  Cc: Alexander Duyck, David Miller, jkosina, netdev, LeoLi, linux-doc

On 10/17/2014 07:05 AM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 02:35 +0000, Jiafei.Pan@freescale.com wrote:
>
>> [Pan Jiafei] Hi, Alex, thanks for your comments. I don’t confirm that
>> you have catch my concerns. For example, I want to add igb net card 
>> into my bridge, and want to igb net driver allocate skb by using
>> my specified memory address, but I don’t want to modify igb net driver
>> directly, how to do this in my bridge drivers?
> This is exactly our point : We do not want to modify all drivers so that
> your bridge is happy with them.
>
> You'll have to make your bridge using standard infra instead.
>
> IGB has no way to know in advance that a particular frame should
> eventually reach your bridge anyway.

Also why is it igb should use your buffers, is there any reason why your
device cannot use the receive buffers that are handed off to the stack
from igb?  It isn't as if there is a copy in the routing or bridging
path.  The receive buffer is normally handed off to the stack from the
ingress device, a few headers might get tweaked, and then that buffer is
transmitted by the egress interface.  Only in the case of a buffer being
handed to multiple egress devices or user space should it ever be copied.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17  9:11                       ` David Laight
@ 2014-10-17 14:40                         ` Alexander Duyck
  2014-10-17 16:55                           ` Eric Dumazet
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Duyck @ 2014-10-17 14:40 UTC (permalink / raw)
  To: David Laight, Eric Dumazet
  Cc: Jiafei.Pan, David Miller, jkosina, netdev, LeoLi, linux-doc


On 10/17/2014 02:11 AM, David Laight wrote:
> From: Alexander Duyck
> ...
>> Actually the likelihood of anything holding onto the 4K page for very
>> long doesn't seem to occur, at least from the drivers perspective.  It
>> is one of the reasons why I went for the page reuse approach rather than
>> just partitioning a single large page.  It allows us to avoid having to
>> call IOMMU map/unmap for the pages since the entire page is usually back
>> in the driver ownership before we need to reuse the portion given to the
>> stack.
> That is almost certainly true for most benchmarks, benchmark processes
> consume receive data.
> But what about real life situations?
>
> There must be some 'normal' workloads where receive data doesn't get consumed.
>
> 	David
>

Yes, but for workloads where receive data doesn't get consumed it is 
very unlikely that much receive data is generated.  As such from the 
device perspective the time the socket is holding the page is still not 
all that long as the descriptor ring is not being looped through that 
quickly.  The page has almost always been freed by the time we have 
processed our way through the full descriptor ring.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 14:40                         ` Alexander Duyck
@ 2014-10-17 16:55                           ` Eric Dumazet
  2014-10-17 18:28                             ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-17 16:55 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, Jiafei.Pan, David Miller, jkosina, netdev, LeoLi,
	linux-doc


On Fri, 2014-10-17 at 07:40 -0700, Alexander Duyck wrote:
> On 10/17/2014 02:11 AM, David Laight wrote:
> > From: Alexander Duyck
> > ...
> >> Actually the likelihood of anything holding onto the 4K page for very
> >> long doesn't seem to occur, at least from the drivers perspective.  It
> >> is one of the reasons why I went for the page reuse approach rather than
> >> just partitioning a single large page.  It allows us to avoid having to
> >> call IOMMU map/unmap for the pages since the entire page is usually back
> >> in the driver ownership before we need to reuse the portion given to the
> >> stack.
> > That is almost certainly true for most benchmarks, benchmark processes
> > consume receive data.
> > But what about real life situations?
> >
> > There must be some 'normal' workloads where receive data doesn't get consumed.
> >
> > 	David
> >
> 
> Yes, but for workloads where receive data doesn't get consumed it is 
> very unlikely that much receive data is generated. 

This is very optimistic.

Any kind of flood can generate 5 or 6 Million packets per second.

So in stress condition, we possibly consume twice more ram than alloted
in tcp_mem.  (About 3GBytes per second, think about it)

This is fine, if admins are aware of that and can adjust tcp_mem
accordingly to this.

Apparently none of your customers suffered from this, maybe they had
enough headroom to absorb the over commit or they trusted us and could not
find culprit if they had issues.

Open 50,000 tcp sockets, do not read data on 50% of them (pretend you
are busy on disk access or doing cpu intensive work). As traffic is interleaved
(between consumed data and non consumed data), you'll have the side
effect of consuming more ram than advertised.

Compare /proc/net/protocols (grep TCP /proc/net/protocols) and output of
'free', and you'll see that we are not good citizens.

I will work on TCP stack, to go beyond what I did in commit
b49960a05e3212 ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")

So that TCP should not care if a driver chose to potentially use 4K per
MSS.

Right now, it seems we can drop few packets, and get a slight reduction in
throughput (TCP is very sensitive to losses, even if we drop 0.1 % of packets)



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 16:55                           ` Eric Dumazet
@ 2014-10-17 18:28                             ` Alexander Duyck
  2014-10-17 18:53                               ` Eric Dumazet
  2014-10-17 19:02                               ` Eric Dumazet
  0 siblings, 2 replies; 38+ messages in thread
From: Alexander Duyck @ 2014-10-17 18:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Laight, Jiafei.Pan, David Miller, jkosina, netdev, LeoLi,
	linux-doc


On 10/17/2014 09:55 AM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 07:40 -0700, Alexander Duyck wrote:
>> On 10/17/2014 02:11 AM, David Laight wrote:
>>> From: Alexander Duyck
>>> ...
>>>> Actually the likelihood of anything holding onto the 4K page for very
>>>> long doesn't seem to occur, at least from the drivers perspective.  It
>>>> is one of the reasons why I went for the page reuse approach rather than
>>>> just partitioning a single large page.  It allows us to avoid having to
>>>> call IOMMU map/unmap for the pages since the entire page is usually back
>>>> in the driver ownership before we need to reuse the portion given to the
>>>> stack.
>>> That is almost certainly true for most benchmarks, benchmark processes
>>> consume receive data.
>>> But what about real life situations?
>>>
>>> There must be some 'normal' workloads where receive data doesn't get consumed.
>>>
>>> 	David
>>>
>> Yes, but for workloads where receive data doesn't get consumed it is
>> very unlikely that much receive data is generated.
> This is very optimistic.
>
> Any kind of flood can generate 5 or 6 Million packets per second.

That is fine.  The first 256 (default descriptor ring size) might be 4K 
while reporting truesize of 2K, after that each page is guaranteed to be 
split in half so we get at least 2 uses per page.

> So in stress condition, we possibly consume twice more ram than alloted
> in tcp_mem.  (About 3GBytes per second, think about it)

I see what you are trying to get at, but I don't see how my scenerio is 
worse then the setups that use a large page and partition it.

> This is fine, if admins are aware of that and can adjust tcp_mem
> accordingly to this.

I can say I have never had a single report of of us feeding too much 
memory to the sockets, if anything the complaints I have seen have 
always been that the socket is being starved due to too much memory 
being used to move small packets.  That is one of the reasons I decided 
we had to have a copy-break built in for packets 256B and smaller.  It 
doesn't make much sense to allocate 2K + ~1K (skb + skb->head) for 256B 
or less of payload data.

> Apparently none of your customers suffered from this, maybe they had
> enough headroom to absorb the over commit or they trusted us and could not
> find culprit if they had issues.

Correct.  I've never received complaints about memory overcommit. Like I 
have said in most cases we are always getting the page back anyway so we 
usually get a good ROI on the page recycling.

> Open 50,000 tcp sockets, do not read data on 50% of them (pretend you
> are busy on disk access or doing cpu intensive work). As traffic is interleaved
> (between consumed data and non consumed data), you'll have the side
> effect of consuming more ram than advertised.

Yes, but in the case we are talking about it is only off by a factor of 
2.  How do you account for the setups such as the code for allocating an 
skb that is allocating a 32K page over multiple frames. In your setup I 
would suspect it wouldn't be uncommon for the socket to end up with 
multiple spots where only a few sockets are holding the entire 32K page 
for some period of time.  So does that mean we should hit anybody that 
uses netdev_alloc_skb with the overhead for 32K since there are 
scenarios where that can happen?

> Compare /proc/net/protocols (grep TCP /proc/net/protocols) and output of
> 'free', and you'll see that we are not good citizens.

I'm assuming there is some sort of test I should be running while I do 
this?  Otherwise the current dump of those is not too interesting 
currently because my system is idle.

> I will work on TCP stack, to go beyond what I did in commit
> b49960a05e3212 ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
>
> So that TCP should not care if a driver chose to potentially use 4K per
> MSS.

So long as it doesn't impact performance significantly I am fine with 
it.  My concern is that you are bringing up issues that none of the 
customers were brining up when I was at Intel, and the fixes you are 
proposing are likely to result in customers seeing things they will 
report as issues.

> Right now, it seems we can drop few packets, and get a slight reduction in
> throughput (TCP is very sensitive to losses, even if we drop 0.1 % of packets)

Yes, I am well aware of this bit.  That is my concern.  You increase the 
size for truesize, it will cut the amount of queueing in half, and then 
igb will start seeing drops when it has to deal with bursty traffic and 
people will start to complain about a performance regression.  That is 
the bit I want to avoid.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 18:28                             ` Alexander Duyck
@ 2014-10-17 18:53                               ` Eric Dumazet
  2014-10-18  0:26                                 ` Eric Dumazet
  2014-10-17 19:02                               ` Eric Dumazet
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-17 18:53 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, Jiafei.Pan, David Miller, jkosina, netdev, LeoLi,
	linux-doc

On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:

> Yes, I am well aware of this bit.  That is my concern.  You increase the 
> size for truesize, it will cut the amount of queueing in half, and then 
> igb will start seeing drops when it has to deal with bursty traffic and 
> people will start to complain about a performance regression.  That is 
> the bit I want to avoid.

Then, what about addressing the issue for good, instead of working
around it ?

Don't worry, I will work on it.

We also are working on a direct placement in memory for TCP receive
path. (equivalent of sendfile() but for receiver), to avoid the need to 
hold payload in kernel memory (out of order queues)




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 18:28                             ` Alexander Duyck
  2014-10-17 18:53                               ` Eric Dumazet
@ 2014-10-17 19:02                               ` Eric Dumazet
  2014-10-17 19:38                                 ` Alexander Duyck
  1 sibling, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-17 19:02 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, Jiafei.Pan, David Miller, jkosina, netdev, LeoLi,
	linux-doc

On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:

> So long as it doesn't impact performance significantly I am fine with 
> it.  My concern is that you are bringing up issues that none of the 
> customers were brining up when I was at Intel, and the fixes you are 
> proposing are likely to result in customers seeing things they will 
> report as issues.

We regularly hit these issues at Google.

We have memory containers, and we detect quite easily if some layer is
lying, because we cant afford having 20% of headroom on our servers.

I am not claiming IGB is the only offender.

I am sorry if you believe it was an attack on IGB, or any network
driver.

truesize should really be the thing that protects us from OOM,
and apparently driver authors hitting TCP performance problems
thinks it is better to simply let TCP do no garbage collection.

It seems that nobody cares or even understand what I am saying,
so I should probably not care and suggest Google to buy PetaBytes of
memory, right ?




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 19:02                               ` Eric Dumazet
@ 2014-10-17 19:38                                 ` Alexander Duyck
  2014-10-17 19:51                                   ` Eric Dumazet
  0 siblings, 1 reply; 38+ messages in thread
From: Alexander Duyck @ 2014-10-17 19:38 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck
  Cc: David Laight, Jiafei.Pan, David Miller, jkosina, netdev, LeoLi,
	linux-doc

On 10/17/2014 12:02 PM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 11:28 -0700, Alexander Duyck wrote:
>
>> So long as it doesn't impact performance significantly I am fine with 
>> it.  My concern is that you are bringing up issues that none of the 
>> customers were brining up when I was at Intel, and the fixes you are 
>> proposing are likely to result in customers seeing things they will 
>> report as issues.
> We regularly hit these issues at Google.
>
> We have memory containers, and we detect quite easily if some layer is
> lying, because we cant afford having 20% of headroom on our servers.

Data is useful here.  If you can give enough data about the setup to
reproduce it then we can actually start looking at fixing it.  Otherwise
it is just your anecdotal data versus mine.

> I am not claiming IGB is the only offender.
>
> I am sorry if you believe it was an attack on IGB, or any network
> driver.

My concern is that igb is guilty of being off by at most a factor of 2. 
What about the drivers and implementations that are off by possibly much
larger values?  I'd be much more interested in this if there was data to
back up your position.  Right now it is mostly just conjecture and my
concern is that this type of change may cause more harm than good.

> truesize should really be the thing that protects us from OOM,
> and apparently driver authors hitting TCP performance problems
> thinks it is better to simply let TCP do no garbage collection.

One key point to keep in mind is that the igb performance should take a
pretty hard hit if pages aren't being freed.  The overhead difference
between the page reuse path vs non-page reuse is pretty significant.  If
this is a scenerio you are actually seeing this would be of interest.

> It seems that nobody cares or even understand what I am saying,
> so I should probably not care and suggest Google to buy PetaBytes of
> memory, right ?

That's not what I am saying, but there is a trade-off we always have to
take into account.  Cutting memory overhead will likely have an impact
on performance.  I would like to make the best informed trade-off in
that regard rather than just assuming worst case always for the driver.

Thanks,

Alex

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 19:38                                 ` Alexander Duyck
@ 2014-10-17 19:51                                   ` Eric Dumazet
  2014-10-17 22:13                                     ` Alexander Duyck
  0 siblings, 1 reply; 38+ messages in thread
From: Eric Dumazet @ 2014-10-17 19:51 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Alexander Duyck, David Laight, Jiafei.Pan, David Miller, jkosina,
	netdev, LeoLi, linux-doc

On Fri, 2014-10-17 at 12:38 -0700, Alexander Duyck wrote:

> That's not what I am saying, but there is a trade-off we always have to
> take into account.  Cutting memory overhead will likely have an impact
> on performance.  I would like to make the best informed trade-off in
> that regard rather than just assuming worst case always for the driver.

It seems you misunderstood me. You believe I suggested doing another
allocation strategy in the drivers.

This was not the case.

This allocation strategy is wonderful. I repeat : This is wonderful.

We only have to make sure we do not fool memory management layers, when
they do not understand where the memory is.

Apparently you think it is hard, while it really is not.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 19:51                                   ` Eric Dumazet
@ 2014-10-17 22:13                                     ` Alexander Duyck
  0 siblings, 0 replies; 38+ messages in thread
From: Alexander Duyck @ 2014-10-17 22:13 UTC (permalink / raw)
  To: Eric Dumazet, Alexander Duyck
  Cc: David Laight, Jiafei.Pan, David Miller, jkosina, netdev, LeoLi,
	linux-doc


On 10/17/2014 12:51 PM, Eric Dumazet wrote:
> On Fri, 2014-10-17 at 12:38 -0700, Alexander Duyck wrote:
>
>> That's not what I am saying, but there is a trade-off we always have to
>> take into account.  Cutting memory overhead will likely have an impact
>> on performance.  I would like to make the best informed trade-off in
>> that regard rather than just assuming worst case always for the driver.
> It seems you misunderstood me. You believe I suggested doing another
> allocation strategy in the drivers.
>
> This was not the case.
>
> This allocation strategy is wonderful. I repeat : This is wonderful.

No, I think I understand you.  I'm just not sure listing this as a 4K 
allocation in truesize makes sense.  The problem is the actual 
allocation can be either 2K or 4K, and my concern is that by setting it 
to 4K we are going to be hurting the case where the actual allocation to 
the socket is only 2K for the half page w/ reuse.

I was brining up the other allocation strategy to prove a point. From my 
perspective it wouldn't make any more sense to assign 32K to the 
truesize for an allocated fragment using __netdev_alloc_frag, but it can 
suffer the same type of issues only to a greater extent due to the use 
of the compound page.  Just because it is shared among many more uses 
doesn't mean it couldn't end up in a scenario where one socket somehow 
keeps queueing up the 32K pages and sitting on them.  I would think all 
it would take is 1 bad acting flow interleaved in ~20 active flows to 
suddenly gobble up a ton of memory without it being accounted for.

> We only have to make sure we do not fool memory management layers, when
> they do not understand where the memory is.
>
> Apparently you think it is hard, while it really is not.

I think you are over simplifying it.  By setting it to 4K there are 
situations where a socket will be double charged for getting two halves 
of the same page.  In these cases there will be a negative impact on 
performance as the number of frames that can be queued is reduced.  What 
you are proposing is possibly overreporting memory use by a factor of 2 
instead of possibly under-reporting it by a factor of 2.

I would be more moved by data than just conjecture on what the driver is 
or isn't doing.  My theory is that most of the time the page is reused 
so 2K is the correct value to report, and very seldom would 4K ever be 
the correct value.  This is what I have seen historically with igb/ixgbe 
using the page reuse.  If you have cases that show that the page isn't 
being reused then we can explore the 4K truesize change, but until then 
I think the page is likely being reused and we should probably just 
stick with the 2K value as we should be getting at least 2 uses per page.

Thanks,

Alex


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] net: use hardware buffer pool to allocate skb
  2014-10-17 18:53                               ` Eric Dumazet
@ 2014-10-18  0:26                                 ` Eric Dumazet
  0 siblings, 0 replies; 38+ messages in thread
From: Eric Dumazet @ 2014-10-18  0:26 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: David Laight, Jiafei.Pan, David Miller, jkosina, netdev, LeoLi,
	linux-doc

On Fri, 2014-10-17 at 11:53 -0700, Eric Dumazet wrote:

> Don't worry, I will work on it.

A very simple head drop strategy instead of tail drop on the socket
backlog seems to do the trick :

TCP behaves way better when there are head drops, as fast retransmits
can usually close the gap without any added delay. Sender automatically
adjusts its cwnd (or rate) according to the receiver skb->truesize /
skb->len ratio






^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2014-10-18  0:26 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-15  3:26 [PATCH] net: use hardware buffer pool to allocate skb Pan Jiafei
2014-10-15  4:15 ` Eric Dumazet
2014-10-15  4:26   ` David Miller
2014-10-15  5:43   ` Jiafei.Pan
2014-10-15  9:15     ` Eric Dumazet
2014-10-15  4:25 ` David Miller
2014-10-15  5:34   ` Jiafei.Pan
2014-10-15  9:15     ` Eric Dumazet
2014-10-16  2:17       ` Jiafei.Pan
2014-10-16  4:15         ` Eric Dumazet
2014-10-16  5:15           ` Jiafei.Pan
2014-10-16 15:28             ` Alexander Duyck
2014-10-16 16:57               ` Eric Dumazet
2014-10-16 17:10                 ` Alexander Duyck
2014-10-16 17:45                   ` Eric Dumazet
2014-10-16 18:20                     ` Alexander Duyck
2014-10-16 21:40                       ` Eric Dumazet
2014-10-16 22:12                         ` Alexander Duyck
2014-10-17  9:11                       ` David Laight
2014-10-17 14:40                         ` Alexander Duyck
2014-10-17 16:55                           ` Eric Dumazet
2014-10-17 18:28                             ` Alexander Duyck
2014-10-17 18:53                               ` Eric Dumazet
2014-10-18  0:26                                 ` Eric Dumazet
2014-10-17 19:02                               ` Eric Dumazet
2014-10-17 19:38                                 ` Alexander Duyck
2014-10-17 19:51                                   ` Eric Dumazet
2014-10-17 22:13                                     ` Alexander Duyck
2014-10-17  2:35               ` Jiafei.Pan
2014-10-17 14:05                 ` Eric Dumazet
2014-10-17 14:12                   ` Alexander Duyck
2014-10-16  2:17       ` Jiafei.Pan
2014-10-15 15:51     ` David Miller
2014-10-15  4:59 ` Oliver Hartkopp
2014-10-15  5:47   ` Jiafei.Pan
2014-10-15  8:57 ` David Laight
2014-10-15  9:33 ` Stephen Hemminger
2014-10-16  2:30   ` Jiafei.Pan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).