netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
@ 2015-10-13  1:15 Ronen Arad
  2015-10-13  8:56 ` Thomas Graf
  2015-10-15  8:55 ` [PATCH net-next v2] netlink: Trim skb to alloc " Ronen Arad
  0 siblings, 2 replies; 7+ messages in thread
From: Ronen Arad @ 2015-10-13  1:15 UTC (permalink / raw)
  To: netdev; +Cc: Ronen Arad

The available room in the skb allocated in netlink_dump for iproute2
show requests (e.g. "ip link [show]", "bridge [-c] vlan show") should
be trimmed to the exact size requested in order to avoid MSG_TRUNC flag
set in netlink_recvmg.
This was handled properly for small skb allocated when no interface has
many VLANs configured. This patch applies the same logic to larger skbs
which are allocated using the calculated min_dump_alloc size.

Signed-off-by: Ronen Arad <ronen.arad@intel.com>
---
 net/netlink/af_netlink.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 8f060d7..d628253 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2817,9 +2817,13 @@ static int netlink_dump(struct sock *sk)
 			skb_reserve(skb, skb_tailroom(skb) -
 					 nlk->max_recvmsg_len);
 	}
-	if (!skb)
+	if (!skb) {
 		skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
 					GFP_KERNEL);
+		/* available room should be exact amount to avoid MSG_TRUNC */
+		if (skb)
+			skb_reserve(skb, skb_tailroom(skb) - alloc_size);
+	}
 	if (!skb)
 		goto errout_skb;
 	netlink_skb_set_owner_r(skb, sk);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
  2015-10-13  1:15 [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC Ronen Arad
@ 2015-10-13  8:56 ` Thomas Graf
  2015-10-13 17:52   ` Arad, Ronen
  2015-10-15  8:55 ` [PATCH net-next v2] netlink: Trim skb to alloc " Ronen Arad
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Graf @ 2015-10-13  8:56 UTC (permalink / raw)
  To: Ronen Arad; +Cc: netdev

On 10/12/15 at 06:15pm, Ronen Arad wrote:
> The available room in the skb allocated in netlink_dump for iproute2
> show requests (e.g. "ip link [show]", "bridge [-c] vlan show") should
> be trimmed to the exact size requested in order to avoid MSG_TRUNC flag
> set in netlink_recvmg.
> This was handled properly for small skb allocated when no interface has
> many VLANs configured. This patch applies the same logic to larger skbs
> which are allocated using the calculated min_dump_alloc size.
> 
> Signed-off-by: Ronen Arad <ronen.arad@intel.com>

Wouldn't this imply a bug in which rtnl_calcit() does not account for
some data that is later dumped? How else could the skb end up being
larger than what alloc_size accounts for at this point unless we end
up stuffing 2x smallish messages into the padded projection of the
calculated maximum message size of that type. Is that what you are
seeing?

Regardless of that, alloc_size is likely larger than nlk->max_recvmsg_len
anyway at this point so unless the reader suddenly provides a different
message size or does peeking it will likely still be truncated.

I'm just trying to understand which exact case you are solving here.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
  2015-10-13  8:56 ` Thomas Graf
@ 2015-10-13 17:52   ` Arad, Ronen
  2015-10-14  7:44     ` Thomas Graf
  0 siblings, 1 reply; 7+ messages in thread
From: Arad, Ronen @ 2015-10-13 17:52 UTC (permalink / raw)
  To: Thomas Graf, netdev



>-----Original Message-----
>From: Thomas Graf [mailto:tgraf@suug.ch]
>Sent: Tuesday, October 13, 2015 1:56 AM
>To: Arad, Ronen
>Cc: netdev@vger.kernel.org
>Subject: Re: [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
>
>On 10/12/15 at 06:15pm, Ronen Arad wrote:
>> The available room in the skb allocated in netlink_dump for iproute2
>> show requests (e.g. "ip link [show]", "bridge [-c] vlan show") should
>> be trimmed to the exact size requested in order to avoid MSG_TRUNC flag
>> set in netlink_recvmg.
>> This was handled properly for small skb allocated when no interface has
>> many VLANs configured. This patch applies the same logic to larger skbs
>> which are allocated using the calculated min_dump_alloc size.
>>
>> Signed-off-by: Ronen Arad <ronen.arad@intel.com>
>
>Wouldn't this imply a bug in which rtnl_calcit() does not account for
>some data that is later dumped?
[@Ronen] rtnl_calcit() is not bug-free. It is not, however, the direct
cause of the problem this patch intends to solve.
rtnl_calcit() overestimates the min_alloc_size. The space for VLAN_INFO for
the maximum number ov VLANs on any interface is always added to
min_alloc_size even when the dump request does not specify VLAN or
compressed VLANs. The overestimation is because if_nlmsg_size() does not
pass ext_filter_mask to  rtnl_link_get_af_size() or rtnl_link_get_size().
(ext_filter_mask is passed to rtnl_vfinfo_size() and rtnl_port_size())
 
>How else could the skb end up being
>larger than what alloc_size accounts for at this point unless we end
>up stuffing 2x smallish messages into the padded projection of the
>calculated maximum message size of that type.
[@Ronen] The skb size (i.e. the tailroom of a newly allocated skb) is
greater than the argument passed to netlink_skb_alloc(). I didn't fully
looked at the allocation mechanism (there could be multiple ways and
netlink skbs could be memory mapped)). My understanding is that this is
expected as indicated by the comment in the code.

Netlink dump by design attempts to pack as many smallish messages into the
same skb to minimize the number of parts of a multi-part dump response.
The min_alloc_size is intended to guarantee drivers/modules sufficient
space for the largest dump message of a single netdev. 

>Is that what you are
>seeing?

[@Ronen] My initial observation was that with many VLANs configured, the 
min_alloc_size is greater than the 16KiB buffer iproute2 uses for both
"ip link show" and "bridge [-c[ompressvlans]] vlan show" commands. This buf
is defined locally within iproute2's lib/libnetlink.c.
Each VLAN_INFO attribute takes 8 bytes. 4094 VLANs requires 32,752 bytes.
Compressed VLANs would need a lot less (at the extreme only 8 bytes for a
single range covering the entire 1-4094 or any other range).

I bumped iproute2 buffer size from 16KiB to 3*16KiB when I first noticed
"Message truncated" error from "ip" and "bridge" commands. I was surprised
when such a large buffer remained insufficient for a system with more
interfaces.

The root cause is that the skb is allocated with more space than requested
and the dump is allowed to use this extra space available. In my case I
observed skb allocated with total space of 65,216 bytes when 34,420
were requested.
 
>
>Regardless of that, alloc_size is likely larger than nlk->max_recvmsg_len
>anyway at this point so unless the reader suddenly provides a different
>message size or does peeking it will likely still be truncated.
>
[@Ronen] My reader as I described above is providing a larger message
which I'm trying to properly size. I'm aware that libnl shields
applications from the need to know and provide properly sized buffer by
peeking or/and re-allocating a buffer.
My issue is with iproute2 "ip link show" and "bridge vlan show" commands.
 
>I'm just trying to understand which exact case you are solving here.
[@Ronen] My patch applies the same logic that is used when allocation is
done by nlk->max_recvmsg_len to allocation that is done by min_alloc_size.
The code could be clearer like that:

	/* NLMSG_GOODSIZE is small to avoid high order allocations being
	 * required, but it makes sense to _attempt_ a 16K bytes allocation
	 * to reduce number of system calls on dump operations, if user
	 * ever provided a big enough buffer.
	 */
	cb = &nlk->cb;
	alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);

	if (alloc_min_size < nlk->max_recvmsg_len) {
		alloc_size = nlk->max_recvmsg_len;
		skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
					GFP_KERNEL |
					__GFP_NOWARN |
					__GFP_NORETRY);
	}
	if (!skb) {
		alloc_size = alloc_min_size;
		skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
					GFP_KERNEL);
	}
	if (!skb)
		goto errout_skb;
	/* available room should be exact amount to avoid MSG_TRUNC */
      skb_reserve(skb, skb_tailroom(skb) - alloc_size);

Allocation is always performed by alloc_size which could be
nlk->max_recvmsg_len (only when min_dump_alloc is sufficiently small) and
upon failure falling back to alloc_min_size.
The trimming of the skb space is common regardless of the allocation call.
I tried to submit the minimal patch to address the issue. If you think the
Re-organized code is better I can re-submit a V2.


 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
  2015-10-13 17:52   ` Arad, Ronen
@ 2015-10-14  7:44     ` Thomas Graf
  2015-10-15  6:41       ` Arad, Ronen
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Graf @ 2015-10-14  7:44 UTC (permalink / raw)
  To: Arad, Ronen; +Cc: netdev

On 10/13/15 at 05:52pm, Arad, Ronen wrote:
> [@Ronen] My reader as I described above is providing a larger message
> which I'm trying to properly size. I'm aware that libnl shields
> applications from the need to know and provide properly sized buffer by
> peeking or/and re-allocating a buffer.
> My issue is with iproute2 "ip link show" and "bridge vlan show" commands.

>  
> >I'm just trying to understand which exact case you are solving here.
> Allocation is always performed by alloc_size which could be
> nlk->max_recvmsg_len (only when min_dump_alloc is sufficiently small) and
> upon failure falling back to alloc_min_size.
> The trimming of the skb space is common regardless of the allocation call.
> I tried to submit the minimal patch to address the issue. If you think the
> Re-organized code is better I can re-submit a V2.

I was about to suggest the same code change after initial discussion ;-)

So you are fixing the case where >2x messages fit the padded skb size.
This was not clear from the commit message. I would appreciate a note
in the commit message and updated code comment to reflect this.

The fix is definitely not incorrect and the penalty for readers which
peek first is less than I thought since nlk->max_recvmsg_len is at
least 16K in size. Since most peekers will double buffer sizes they
will most likely end up growing nlk->max_recvmsg_len after the first
read.

However, if alloc_size is > 16K, we would have typically ended up with a
giant skb which peeking users were able to take advantage of while
with this fix this is no longer the case.

However #2, I'll see if it makes sense to look at MSG_PEEK in recvmsg
and change nlk->max_recvmsg_len accordingly so we take advantage of
the full skb size on sockets which perform peeking. Given that both
reader behaviours can be preserved, I'm good with your proposed v2.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
  2015-10-14  7:44     ` Thomas Graf
@ 2015-10-15  6:41       ` Arad, Ronen
  0 siblings, 0 replies; 7+ messages in thread
From: Arad, Ronen @ 2015-10-15  6:41 UTC (permalink / raw)
  To: Thomas Graf; +Cc: netdev



>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Thomas Graf
>Sent: Wednesday, October 14, 2015 12:45 AM
>To: Arad, Ronen
>Cc: netdev@vger.kernel.org
>Subject: Re: [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC
>
>On 10/13/15 at 05:52pm, Arad, Ronen wrote:
>> [@Ronen] My reader as I described above is providing a larger message
>> which I'm trying to properly size. I'm aware that libnl shields
>> applications from the need to know and provide properly sized buffer by
>> peeking or/and re-allocating a buffer.
>> My issue is with iproute2 "ip link show" and "bridge vlan show" commands.
>
>>
>> >I'm just trying to understand which exact case you are solving here.
>> Allocation is always performed by alloc_size which could be
>> nlk->max_recvmsg_len (only when min_dump_alloc is sufficiently small) and
>> upon failure falling back to alloc_min_size.
>> The trimming of the skb space is common regardless of the allocation call.
>> I tried to submit the minimal patch to address the issue. If you think the
>> Re-organized code is better I can re-submit a V2.
>
>I was about to suggest the same code change after initial discussion ;-)
>
>So you are fixing the case where >2x messages fit the padded skb size.
[@Ronen] I'm not sure I understand this statement. I'm fixing the padding
of the skb such that reader could have reasonable buffer size based on the
largest netdev. It is just happened that the skb allocation was about
double the size. Probably because the allocation was some kind of power
of 2 and the requested size was slightly above the next lower power of 2.

On a separate patch titled 
[PATCH net-next v3] netlink: Rightsize IFLA_AF_SPEC size calculation
I'm reducing the over-estimation of the buffer size for "ip link"
requests. It turned out that VLAN information space was added to
unrelated dump requests since ext_filter_mask was not passed to
rtnl_link_get_af_size().
The "rightsizing" patch also reduces the buffer size of compressed VLANs
dump. Non-compressed VLANs dump will continue to require more than the
16KiB buffer size from somewhere around 1800 VLANs and above (based on
8 bytes per VLAN plus other attributes consuming about 1700 bytes).
Using the same logic the full range of 4094 VLANs uncompressed would take
Roughly 34500 bytes. It looks like a "safe" iproute2 statically sized
Buffer would have to be about 36000 bytes or so.

>This was not clear from the commit message. I would appreciate a note
>in the commit message and updated code comment to reflect this.
>
>The fix is definitely not incorrect and the penalty for readers which
>peek first is less than I thought since nlk->max_recvmsg_len is at
>least 16K in size. Since most peekers will double buffer sizes they
>will most likely end up growing nlk->max_recvmsg_len after the first
>read.
[@Ronen] nlk->max_recvmsg_len is actually capped at 16KiB.
/* Record the max length of recvmsg() calls for future allocations */
nlk->max_recvmsg_len = max(nlk->max_recvmsg_len, len);
nlk->max_recvmsg_len = min_t(size_t, nlk->max_recvmsg_len, 16384);
>
>However, if alloc_size is > 16K, we would have typically ended up with a
>giant skb which peeking users were able to take advantage of while
>with this fix this is no longer the case.
[@Ronen] As I noted above, peeking reader could only enjoy saving up to 
16KiB.
>
>However #2, I'll see if it makes sense to look at MSG_PEEK in recvmsg
>and change nlk->max_recvmsg_len accordingly so we take advantage of
>the full skb size on sockets which perform peeking. Given that both
>reader behaviours can be preserved, I'm good with your proposed v2.
[@Ronen] I'll submit by suggested v2.
>--
>To unsubscribe from this list: send the line "unsubscribe netdev" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next v2] netlink: Trim skb to alloc size to avoid MSG_TRUNC
  2015-10-13  1:15 [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC Ronen Arad
  2015-10-13  8:56 ` Thomas Graf
@ 2015-10-15  8:55 ` Ronen Arad
  2015-10-19  2:33   ` David Miller
  1 sibling, 1 reply; 7+ messages in thread
From: Ronen Arad @ 2015-10-15  8:55 UTC (permalink / raw)
  To: netdev; +Cc: Ronen Arad

netlink_dump() allocates skb based on the calculated min_dump_alloc or
a per socket max_recvmsg_len.
min_alloc_size is maximum space required for any single netdev
attributes as calculated by rtnl_calcit().
max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
It is capped at 16KiB.
The intention is to avoid small allocations and to minimize the number
of calls required to obtain dump information for all net devices.

netlink_dump packs as many small messages as could fit within an skb
that was sized for the largest single netdev information. The actual
space available within an skb is larger than what is requested. It could
be much larger and up to near 2x with align to next power of 2 approach.

Allowing netlink_dump to use all the space available within the
allocated skb increases the buffer size a user has to provide to avoid
truncaion (i.e. MSG_TRUNG flag set).

It was observed that with many VLANs configured on at least one netdev,
a larger buffer of near 64KiB was necessary to avoid "Message truncated"
error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
min_alloc_size was only little over 32KiB.

This patch trims skb to allocated size in order to allow the user to
avoid truncation with more reasonable buffer size.

Signed-off-by: Ronen Arad <ronen.arad@intel.com>
---
 net/netlink/af_netlink.c | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 8f060d7..0a49a8c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2785,6 +2785,7 @@ static int netlink_dump(struct sock *sk)
 	struct sk_buff *skb = NULL;
 	struct nlmsghdr *nlh;
 	int len, err = -ENOBUFS;
+	int alloc_min_size;
 	int alloc_size;
 
 	mutex_lock(nlk->cb_mutex);
@@ -2793,9 +2794,6 @@ static int netlink_dump(struct sock *sk)
 		goto errout_skb;
 	}
 
-	cb = &nlk->cb;
-	alloc_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
-
 	if (!netlink_rx_is_mmaped(sk) &&
 	    atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
 		goto errout_skb;
@@ -2805,23 +2803,35 @@ static int netlink_dump(struct sock *sk)
 	 * to reduce number of system calls on dump operations, if user
 	 * ever provided a big enough buffer.
 	 */
-	if (alloc_size < nlk->max_recvmsg_len) {
-		skb = netlink_alloc_skb(sk,
-					nlk->max_recvmsg_len,
-					nlk->portid,
+	cb = &nlk->cb;
+	alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
+
+	if (alloc_min_size < nlk->max_recvmsg_len) {
+		alloc_size = nlk->max_recvmsg_len;
+		skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
 					GFP_KERNEL |
 					__GFP_NOWARN |
 					__GFP_NORETRY);
-		/* available room should be exact amount to avoid MSG_TRUNC */
-		if (skb)
-			skb_reserve(skb, skb_tailroom(skb) -
-					 nlk->max_recvmsg_len);
 	}
-	if (!skb)
+	if (!skb) {
+		alloc_size = alloc_min_size;
 		skb = netlink_alloc_skb(sk, alloc_size, nlk->portid,
 					GFP_KERNEL);
+	}
 	if (!skb)
 		goto errout_skb;
+
+	/* Trim skb to allocated size. User is expected to provide buffer as
+	 * large as max(min_dump_alloc, 16KiB (mac_recvmsg_len capped at
+	 * netlink_recvmsg())). dump will pack as many smaller messages as
+	 * could fit within the allocated skb. skb is typically allocated
+	 * with larger space than required (could be as much as near 2x the
+	 * requested size with align to next power of 2 approach). Allowing
+	 * dump to use the excess space makes it difficult for a user to have a
+	 * reasonable static buffer based on the expected largest dump of a
+	 * single netdev. The outcome is MSG_TRUNC error.
+	 */
+	skb_reserve(skb, skb_tailroom(skb) - alloc_size);
 	netlink_skb_set_owner_r(skb, sk);
 
 	len = cb->dump(skb, cb);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v2] netlink: Trim skb to alloc size to avoid MSG_TRUNC
  2015-10-15  8:55 ` [PATCH net-next v2] netlink: Trim skb to alloc " Ronen Arad
@ 2015-10-19  2:33   ` David Miller
  0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2015-10-19  2:33 UTC (permalink / raw)
  To: ronen.arad; +Cc: netdev

From: Ronen Arad <ronen.arad@intel.com>
Date: Thu, 15 Oct 2015 01:55:17 -0700

> netlink_dump() allocates skb based on the calculated min_dump_alloc or
> a per socket max_recvmsg_len.
> min_alloc_size is maximum space required for any single netdev
> attributes as calculated by rtnl_calcit().
> max_recvmsg_len tracks the user provided buffer to netlink_recvmsg.
> It is capped at 16KiB.
> The intention is to avoid small allocations and to minimize the number
> of calls required to obtain dump information for all net devices.
> 
> netlink_dump packs as many small messages as could fit within an skb
> that was sized for the largest single netdev information. The actual
> space available within an skb is larger than what is requested. It could
> be much larger and up to near 2x with align to next power of 2 approach.
> 
> Allowing netlink_dump to use all the space available within the
> allocated skb increases the buffer size a user has to provide to avoid
> truncaion (i.e. MSG_TRUNG flag set).
> 
> It was observed that with many VLANs configured on at least one netdev,
> a larger buffer of near 64KiB was necessary to avoid "Message truncated"
> error in "ip link" or "bridge [-c[ompressvlans]] vlan show" when
> min_alloc_size was only little over 32KiB.
> 
> This patch trims skb to allocated size in order to allow the user to
> avoid truncation with more reasonable buffer size.
> 
> Signed-off-by: Ronen Arad <ronen.arad@intel.com>

Getting spurious MSG_TRUNC is surely a bug, so I am going to apply
this to 'net' as it is a genuine bug fix.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-10-19  2:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-13  1:15 [PATCH] netlink: trim skb to exact size to avoid MSG_TRUNC Ronen Arad
2015-10-13  8:56 ` Thomas Graf
2015-10-13 17:52   ` Arad, Ronen
2015-10-14  7:44     ` Thomas Graf
2015-10-15  6:41       ` Arad, Ronen
2015-10-15  8:55 ` [PATCH net-next v2] netlink: Trim skb to alloc " Ronen Arad
2015-10-19  2:33   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).