netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
       [not found] <5422254.3711328278789768.JavaMail.root@5-MeO-DMT.ynet.sk>
@ 2012-02-03 14:24 ` Stefan Gula
  2012-02-04  0:29   ` David Miller
  0 siblings, 1 reply; 15+ messages in thread
From: Stefan Gula @ 2012-02-03 14:24 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, netdev

From: Stefan Gula <steweg@gmail.com>

Adding new rtnetlink ops and command for getting more information about
network devices, which are not able to fit inside predefined SKB structures
(e.g. PAGE_SIZE limit). DEVDUMP command allows to call specific device driver
code for complete handling this netlink message. Useful if devices needed to
list some addition dynamic structures like hlists and doesn't require to have
complete set of codes for it new PF families.

Signed-off-by: Stefan Gula <steweg@gmail.com>

---
 - this patch is per-requisition for my new macvlan patch


diff -uprN -X linux-3.2.1-orig/Documentation/dontdiff linux-3.2.1-orig/include/linux/rtnetlink.h linux-3.2.1-macvlan/include/linux/rtnetlink.h
--- linux-3.2.1-orig/include/linux/rtnetlink.h	2012-01-27 13:38:57.000000000 +0000
+++ linux-3.2.1-macvlan/include/linux/rtnetlink.h	2012-02-02 08:34:14.000000000 +0000
@@ -120,6 +120,8 @@ enum {
 	RTM_SETDCB,
 #define RTM_SETDCB RTM_SETDCB
 
+	RTM_DEVDUMP = 80,
+#define RTM_DEVDUMP	RTM_DEVDUMP
 	__RTM_MAX,
 #define RTM_MAX		(((__RTM_MAX + 3) & ~3) - 1)
 };
diff -uprN -X linux-3.2.1-orig/Documentation/dontdiff linux-3.2.1-orig/include/net/rtnetlink.h linux-3.2.1-macvlan/include/net/rtnetlink.h
--- linux-3.2.1-orig/include/net/rtnetlink.h	2012-01-27 13:38:57.000000000 +0000
+++ linux-3.2.1-macvlan/include/net/rtnetlink.h	2012-02-02 05:58:50.000000000 +0000
@@ -78,6 +78,9 @@ struct rtnl_link_ops {
 	int			(*get_tx_queues)(struct net *net, struct nlattr *tb[],
 						 unsigned int *tx_queues,
 						 unsigned int *real_tx_queues);
+	int			(*dev_dump)(struct sk_buff *skb,
+					    struct net_device *dev, int type,
+					    u32 pid, u32 seq);
 };
 
 extern int	__rtnl_link_register(struct rtnl_link_ops *ops);
diff -uprN -X linux-3.2.1-orig/Documentation/dontdiff linux-3.2.1-orig/net/core/rtnetlink.c linux-3.2.1-macvlan/net/core/rtnetlink.c
--- linux-3.2.1-orig/net/core/rtnetlink.c	2012-01-27 14:22:12.000000000 +0000
+++ linux-3.2.1-macvlan/net/core/rtnetlink.c	2012-02-02 23:59:54.000000000 +0000
@@ -1874,6 +1874,42 @@ static int rtnl_getlink(struct sk_buff *
 	return err;
 }
 
+static int rtnl_devdump(struct sk_buff *skb, struct nlmsghdr* nlh, void *arg)
+{
+	struct net *net = sock_net(skb->sk);
+	struct ifinfomsg *ifm;
+	char ifname[IFNAMSIZ];
+	struct nlattr *tb[IFLA_MAX+1];
+	struct net_device *dev = NULL;
+	int err;
+	const struct rtnl_link_ops *ops;
+
+	err = nlmsg_parse(nlh, sizeof(*ifm), tb, IFLA_MAX, ifla_policy);
+	if (err < 0)
+		return err;
+
+	if (tb[IFLA_IFNAME])
+		nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
+
+	ifm = nlmsg_data(nlh);
+	if (ifm->ifi_index > 0)
+		dev = __dev_get_by_index(net, ifm->ifi_index);
+	else if (tb[IFLA_IFNAME])
+		dev = __dev_get_by_name(net, ifname);
+	else
+		return -EINVAL;
+
+	if (dev == NULL)
+		return -ENODEV;
+
+	ops = dev->rtnl_link_ops;
+	if (ops && ops->dev_dump)
+		return ops->dev_dump(skb, dev, RTM_DEVDUMP,
+				     NETLINK_CB(skb).pid, nlh->nlmsg_seq);
+
+	return -EOPNOTSUPP;
+}
+
 static u16 rtnl_calcit(struct sk_buff *skb)
 {
 	return min_ifinfo_dump_size;
@@ -2097,5 +2133,7 @@ void __init rtnetlink_init(void)
 
 	rtnl_register(PF_UNSPEC, RTM_GETADDR, NULL, rtnl_dump_all, NULL);
 	rtnl_register(PF_UNSPEC, RTM_GETROUTE, NULL, rtnl_dump_all, NULL);
+
+	rtnl_register(PF_UNSPEC, RTM_DEVDUMP, rtnl_devdump, NULL, NULL);
 }
 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-03 14:24 ` [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue Stefan Gula
@ 2012-02-04  0:29   ` David Miller
  2012-02-06  4:41     ` Rose, Gregory V
  0 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2012-02-04  0:29 UTC (permalink / raw)
  To: steweg; +Cc: linux-kernel, netdev

From: Stefan Gula <steweg@ynet.sk>
Date: Fri, 3 Feb 2012 15:24:21 +0100 (CET)

> From: Stefan Gula <steweg@gmail.com>
> 
> Adding new rtnetlink ops and command for getting more information about
> network devices, which are not able to fit inside predefined SKB structures
> (e.g. PAGE_SIZE limit). DEVDUMP command allows to call specific device driver
> code for complete handling this netlink message. Useful if devices needed to
> list some addition dynamic structures like hlists and doesn't require to have
> complete set of codes for it new PF families.
> 
> Signed-off-by: Stefan Gula <steweg@gmail.com>

This is not how we're going to fix this.  I already stated the desired
way to fix this, which is to make the existing dump request have a way
for the requestor to enable extended parts of the device dump.

This is just like netlink diag socket dumps, where the dump request
specifies what the user wants to see.

In this case we'd add a netlink attribute to the dump request which
is just a u32 bitmask or similar.

The Intel engineer who added the VF dump support said he would work on
this fix so why don't you just wait patiently for him to do the work?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-04  0:29   ` David Miller
@ 2012-02-06  4:41     ` Rose, Gregory V
  2012-02-06  8:53       ` Štefan Gula
  2012-02-06 23:50       ` Ben Hutchings
  0 siblings, 2 replies; 15+ messages in thread
From: Rose, Gregory V @ 2012-02-06  4:41 UTC (permalink / raw)
  To: David Miller, steweg; +Cc: linux-kernel, netdev

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of David Miller
> Sent: Friday, February 03, 2012 4:30 PM
> To: steweg@ynet.sk
> Cc: linux-kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around
> the skb buff size issue
> 
> From: Stefan Gula <steweg@ynet.sk>
> Date: Fri, 3 Feb 2012 15:24:21 +0100 (CET)
> 
> > From: Stefan Gula <steweg@gmail.com>
> >
> > Adding new rtnetlink ops and command for getting more information about
> > network devices, which are not able to fit inside predefined SKB
> structures
> > (e.g. PAGE_SIZE limit). DEVDUMP command allows to call specific device
> driver
> > code for complete handling this netlink message. Useful if devices
> needed to
> > list some addition dynamic structures like hlists and doesn't require to
> have
> > complete set of codes for it new PF families.
> >
> > Signed-off-by: Stefan Gula <steweg@gmail.com>
> 
> This is not how we're going to fix this.  I already stated the desired
> way to fix this, which is to make the existing dump request have a way
> for the requestor to enable extended parts of the device dump.
> 
> This is just like netlink diag socket dumps, where the dump request
> specifies what the user wants to see.
> 
> In this case we'd add a netlink attribute to the dump request which
> is just a u32 bitmask or similar.
> 
> The Intel engineer who added the VF dump support said he would work on
> this fix so why don't you just wait patiently for him to do the work?

The patch below is what I've got so far.  Right now the bit mask array is global so if you enable display of VF (n) on one interface it will enable display of the same VF on other interfaces.  I intend to move the bit mask array into the net_device structure so we can set the display mask for each interface independently.

The command to set the filter mask is "set only", I see no reason to add it to the info dump.  If other folks see it differently then I can do that too.

Anyway, it will allow the user to control which VFs are getting displayed during the info dump.  They all default to off so initially no VF info gets displayed.

I've also whipped up a patch for the iproute2 ip command.  It'll work like this:

'ip link set <dev> vf (n) filter [on|off]'

So if you have 128 VFs on the device you could enable info dumps for arbitrary VFs, e.g. VFs 3, 9, 16, 21, and 31.  Only the info for those VFs would display.  This method has the advantage of not breaking scripts which parse the current VF info display.  Of course, one could also script up something to sequentially enable the display of a single VF, dump the info for it, and then move on to the next.

Before I go much further on this let me know if this is the right track or not.

Thanks,

- Greg

---

 include/linux/if_link.h |    6 ++++++
 net/core/rtnetlink.c    |   30 +++++++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 3 deletions(-)


diff --git a/include/linux/if_link.h b/include/linux/if_link.h
index c52d4b5..052c240 100644
--- a/include/linux/if_link.h
+++ b/include/linux/if_link.h
@@ -280,6 +280,7 @@ enum {
 	IFLA_VF_VLAN,
 	IFLA_VF_TX_RATE,	/* TX Bandwidth Allocation */
 	IFLA_VF_SPOOFCHK,	/* Spoof Checking on/off switch */
+	IFLA_VF_INFOFILTER,	/* Filter vfinfo on dumps */
 	__IFLA_VF_MAX,
 };
 
@@ -305,6 +306,11 @@ struct ifla_vf_spoofchk {
 	__u32 vf;
 	__u32 setting;
 };
+
+struct ifla_vf_infofilter {
+	__u32 vf;
+	__u32 filter;
+};
 #ifdef __KERNEL__
 
 /* We don't want this structure exposed to user space */
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index af1da12..8c0c8c1 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -62,6 +62,9 @@ struct rtnl_link {
 static DEFINE_MUTEX(rtnl_mutex);
 static u16 min_ifinfo_dump_size;
 
+/* VF info display filter - Number of VFs max is 256 */
+static unsigned long show_vfinfo_filter[256 / sizeof(unsigned long)];
+
 void rtnl_lock(void)
 {
 	mutex_lock(&rtnl_mutex);
@@ -876,6 +879,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 	const struct rtnl_link_stats64 *stats;
 	struct nlattr *attr, *af_spec;
 	struct rtnl_af_ops *af_ops;
+	u32 num_vf_filters_set = 0;
 
 	ASSERT_RTNL();
 	nlh = nlmsg_put(skb, pid, seq, type, sizeof(*ifm), flags);
@@ -941,10 +945,18 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 		goto nla_put_failure;
 	copy_rtnl_link_stats64(nla_data(attr), stats);
 
-	if (dev->dev.parent)
-		NLA_PUT_U32(skb, IFLA_NUM_VF, dev_num_vf(dev->dev.parent));
+	if (dev->dev.parent) {
+		int j;
+		for (j = 0; j < 256; j++) {
+			if (test_bit(j, show_vfinfo_filter))
+				num_vf_filters_set++;
+		}
+		if (num_vf_filters_set)
+			NLA_PUT_U32(skb, IFLA_NUM_VF, num_vf_filters_set);
+	}
 
-	if (dev->netdev_ops->ndo_get_vf_config && dev->dev.parent) {
+	if (dev->netdev_ops->ndo_get_vf_config && dev->dev.parent &&
+			num_vf_filters_set) {
 		int i;
 
 		struct nlattr *vfinfo, *vf;
@@ -960,6 +972,9 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct net_device *dev,
 			struct ifla_vf_tx_rate vf_tx_rate;
 			struct ifla_vf_spoofchk vf_spoofchk;
 
+			if (!test_bit(i, show_vfinfo_filter))
+				continue;
+
 			/*
 			 * Not all SR-IOV capable drivers support the
 			 * spoofcheck query.  Preset to -1 so the user
@@ -1234,6 +1249,15 @@ static int do_setvfinfo(struct net_device *dev, struct nlattr *attr)
 							       ivs->setting);
 			break;
 		}
+		case IFLA_VF_INFOFILTER: {
+			 struct ifla_vf_infofilter *ivf;
+			 ivf = nla_data(vf);
+			 if (ivf->filter)
+				 set_bit(ivf->vf, show_vfinfo_filter);
+			 else
+				 clear_bit(ivf->vf, show_vfinfo_filter);
+			 break;
+		}
 		default:
 			err = -EINVAL;
 			break;



> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06  4:41     ` Rose, Gregory V
@ 2012-02-06  8:53       ` Štefan Gula
  2012-02-06 15:15         ` David Miller
  2012-02-06 17:29         ` Rose, Gregory V
  2012-02-06 23:50       ` Ben Hutchings
  1 sibling, 2 replies; 15+ messages in thread
From: Štefan Gula @ 2012-02-06  8:53 UTC (permalink / raw)
  To: Rose, Gregory V; +Cc: David Miller, linux-kernel, netdev

2012/2/6 Rose, Gregory V <gregory.v.rose@intel.com>:
>
> The patch below is what I've got so far.  Right now the bit mask array is global so if you enable display of VF (n) on one interface it will enable display of the same VF on other interfaces.  I intend to move the bit mask array into the net_device structure so we can set the display mask for each interface independently.
>
> The command to set the filter mask is "set only", I see no reason to add it to the info dump.  If other folks see it differently then I can do that too.
>
> Anyway, it will allow the user to control which VFs are getting displayed during the info dump.  They all default to off so initially no VF info gets displayed.
>
> I've also whipped up a patch for the iproute2 ip command.  It'll work like this:
>
> 'ip link set <dev> vf (n) filter [on|off]'
>
> So if you have 128 VFs on the device you could enable info dumps for arbitrary VFs, e.g. VFs 3, 9, 16, 21, and 31.  Only the info for those VFs would display.  This method has the advantage of not breaking scripts which parse the current VF info display.  Of course, one could also script up something to sequentially enable the display of a single VF, dump the info for it, and then move on to the next.
As this patch will allow one to filter some information and possible
lower the need on skb buffer size, the general idea is ok. On the
other hand it will not eliminate the problem. e.g.:
- assume that one didn't know the limits behind it and put all options enabled
- it also doesn't fix the need to fill relevant info by interface
bigger than buffer size, e.g. my macvlan interface mac address list.
If I try to request for it, it will eventually fail with a lot of
records even with filtering...

So I would rather see a proper general method for requesting some
information in cycles inside the single interface like sending request
to kernel per VF for particular device or per MAC address from macvlan
associated lists. This approach is I believe slightly more scalable as
it can be potentially reused on other types of network devices as
well.

My original idea was to have these methods:
1st kernel method will return some info about the absolute number of
cycles needed per given interface - this can be done in standard
GETLINK operation with some associated IFLA_* value.
2nd user-space method will based on that number sends netlink request
per records or reasonable page of records (e.g. 10) and parse the
output in the user-space.
 - this is needed to overcome another issue when kernel generates so
many netlink messages with NLM_F_MULTI that netlink socket will not be
able to hold and further write/read code will fail.
3rd kernel method, which will allocated, fill, and send required info
per record -> this one can be done by ops/command netlink association
(in my proposal it is DEVDUMP)

To sum up, I believe that both approaches (using cycles and filtering)
should be allowed to coexists in kernel, but they should be considered
separately as they are doing different jobs.

I can provide whole example of code (not just rtnetlink part), if the
list is interested to see live example.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06  8:53       ` Štefan Gula
@ 2012-02-06 15:15         ` David Miller
  2012-02-06 16:56           ` Eric Dumazet
  2012-02-06 17:29         ` Rose, Gregory V
  1 sibling, 1 reply; 15+ messages in thread
From: David Miller @ 2012-02-06 15:15 UTC (permalink / raw)
  To: steweg; +Cc: gregory.v.rose, linux-kernel, netdev

From: Štefan Gula <steweg@ynet.sk>
Date: Mon, 6 Feb 2012 09:53:28 +0100

> If I try to request for it, it will eventually fail with a lot of
> records even with filtering...

Then the user can loop increasing the buffer size until the netlink
request succeeds.

It is not a problem.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06 15:15         ` David Miller
@ 2012-02-06 16:56           ` Eric Dumazet
  2012-02-06 18:52             ` Štefan Gula
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2012-02-06 16:56 UTC (permalink / raw)
  To: David Miller; +Cc: steweg, gregory.v.rose, linux-kernel, netdev

Le lundi 06 février 2012 à 10:15 -0500, David Miller a écrit :
> From: Štefan Gula <steweg@ynet.sk>
> Date: Mon, 6 Feb 2012 09:53:28 +0100
> 
> > If I try to request for it, it will eventually fail with a lot of
> > records even with filtering...
> 
> Then the user can loop increasing the buffer size until the netlink
> request succeeds.
> 
> It is not a problem.

Actually we always truncate message in netlink_recvmsg()

We could use a MSG_NOPARTIAL flag in netlink_recvmsg() so that user can
avoid the MSG_PEEK operation to fetch next message length.

(Ie not consume/copy skb if user buffer is too small to hold full
message, and only return the needed length)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06  8:53       ` Štefan Gula
  2012-02-06 15:15         ` David Miller
@ 2012-02-06 17:29         ` Rose, Gregory V
  2012-02-06 18:32           ` Štefan Gula
  1 sibling, 1 reply; 15+ messages in thread
From: Rose, Gregory V @ 2012-02-06 17:29 UTC (permalink / raw)
  To: Štefan Gula; +Cc: David Miller, linux-kernel, netdev

> -----Original Message-----
> From: steweg@gmail.com [mailto:steweg@gmail.com] On Behalf Of Štefan Gula
> Sent: Monday, February 06, 2012 12:53 AM
> To: Rose, Gregory V
> Cc: David Miller; linux-kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around
> the skb buff size issue
> 
> 2012/2/6 Rose, Gregory V <gregory.v.rose@intel.com>:
> >
> > The patch below is what I've got so far.  Right now the bit mask array
> is global so if you enable display of VF (n) on one interface it will
> enable display of the same VF on other interfaces.  I intend to move the
> bit mask array into the net_device structure so we can set the display
> mask for each interface independently.
> >
> > The command to set the filter mask is "set only", I see no reason to add
> it to the info dump.  If other folks see it differently then I can do that
> too.
> >
> > Anyway, it will allow the user to control which VFs are getting
> displayed during the info dump.  They all default to off so initially no
> VF info gets displayed.
> >
> > I've also whipped up a patch for the iproute2 ip command.  It'll work
> like this:
> >
> > 'ip link set <dev> vf (n) filter [on|off]'
> >
> > So if you have 128 VFs on the device you could enable info dumps for
> arbitrary VFs, e.g. VFs 3, 9, 16, 21, and 31.  Only the info for those VFs
> would display.  This method has the advantage of not breaking scripts
> which parse the current VF info display.  Of course, one could also script
> up something to sequentially enable the display of a single VF, dump the
> info for it, and then move on to the next.
> As this patch will allow one to filter some information and possible
> lower the need on skb buffer size, the general idea is ok. On the
> other hand it will not eliminate the problem. e.g.:
> - assume that one didn't know the limits behind it and put all options
> enabled
> - it also doesn't fix the need to fill relevant info by interface
> bigger than buffer size, e.g. my macvlan interface mac address list.
> If I try to request for it, it will eventually fail with a lot of
> records even with filtering...
> 
> So I would rather see a proper general method for requesting some
> information in cycles inside the single interface like sending request
> to kernel per VF for particular device or per MAC address from macvlan
> associated lists. This approach is I believe slightly more scalable as
> it can be potentially reused on other types of network devices as
> well.
> 
> My original idea was to have these methods:
> 1st kernel method will return some info about the absolute number of
> cycles needed per given interface - this can be done in standard
> GETLINK operation with some associated IFLA_* value.
> 2nd user-space method will based on that number sends netlink request
> per records or reasonable page of records (e.g. 10) and parse the
> output in the user-space.
>  - this is needed to overcome another issue when kernel generates so
> many netlink messages with NLM_F_MULTI that netlink socket will not be
> able to hold and further write/read code will fail.
> 3rd kernel method, which will allocated, fill, and send required info
> per record -> this one can be done by ops/command netlink association
> (in my proposal it is DEVDUMP)
> 
> To sum up, I believe that both approaches (using cycles and filtering)
> should be allowed to coexists in kernel, but they should be considered
> separately as they are doing different jobs.

Stefan,

That is exactly my approach.  We currently have a *bug* in the kernel that this patch is addressing.  The kernel is attempting to provide too much information for the netlink interface to handle and it's breaking things.  So what I want to do is fix the immediate problem while still providing a way for folks to get the information they need.  I've accomplished this by doing exactly what Dave asked me to do, provide a filter that defaults to off and then provide a way for the user to request discrete chunks of information in the dump that won't exceed the netlink buffer limits.

The patch is fairly unobtrusive and simple to understand.

I appreciate that it doesn't do all that you'd like to see done and I see no reason why you couldn't go on and develop the extended features that you would like to see, correct?  There's nothing in my patch that would prevent that so far as I can tell, although I'm not that familiar with your requirements or proposals yet.

- Greg

> 
> I can provide whole example of code (not just rtnetlink part), if the
> list is interested to see live example.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06 17:29         ` Rose, Gregory V
@ 2012-02-06 18:32           ` Štefan Gula
  2012-02-06 18:36             ` Rose, Gregory V
  0 siblings, 1 reply; 15+ messages in thread
From: Štefan Gula @ 2012-02-06 18:32 UTC (permalink / raw)
  To: Rose, Gregory V; +Cc: David Miller, linux-kernel, netdev

2012/2/6 Rose, Gregory V <gregory.v.rose@intel.com>:
> That is exactly my approach.  We currently have a *bug* in the kernel that this patch is addressing.  The kernel is attempting to provide too much information for the netlink interface to handle and it's breaking things.  So what I want to do is fix the immediate problem while still providing a way for folks to get the information they need.  I've accomplished this by doing exactly what Dave asked me to do, provide a filter that defaults to off and then provide a way for the user to request discrete chunks of information in the dump that won't exceed the netlink buffer limits.
>
> The patch is fairly unobtrusive and simple to understand.
>
> I appreciate that it doesn't do all that you'd like to see done and I see no reason why you couldn't go on and develop the extended features that you would like to see, correct?  There's nothing in my patch that would prevent that so far as I can tell, although I'm not that familiar with your requirements or proposals yet.
Greg, your patch is completely ok for filtering. I like that thing. I
am just stating that it doesn't eliminate every possible option that
can happen so I believe we should also have method for using cycles -
that's what my patch is doing. So I believe both approaches should be
applied.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06 18:32           ` Štefan Gula
@ 2012-02-06 18:36             ` Rose, Gregory V
  0 siblings, 0 replies; 15+ messages in thread
From: Rose, Gregory V @ 2012-02-06 18:36 UTC (permalink / raw)
  To: Štefan Gula; +Cc: David Miller, linux-kernel, netdev

> -----Original Message-----
> From: steweg@gmail.com [mailto:steweg@gmail.com] On Behalf Of Štefan Gula
> Sent: Monday, February 06, 2012 10:32 AM
> To: Rose, Gregory V
> Cc: David Miller; linux-kernel@vger.kernel.org; netdev@vger.kernel.org
> Subject: Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around
> the skb buff size issue
> 
> 2012/2/6 Rose, Gregory V <gregory.v.rose@intel.com>:
> > That is exactly my approach.  We currently have a *bug* in the kernel
> that this patch is addressing.  The kernel is attempting to provide too
> much information for the netlink interface to handle and it's breaking
> things.  So what I want to do is fix the immediate problem while still
> providing a way for folks to get the information they need.  I've
> accomplished this by doing exactly what Dave asked me to do, provide a
> filter that defaults to off and then provide a way for the user to request
> discrete chunks of information in the dump that won't exceed the netlink
> buffer limits.
> >
> > The patch is fairly unobtrusive and simple to understand.
> >
> > I appreciate that it doesn't do all that you'd like to see done and I
> see no reason why you couldn't go on and develop the extended features
> that you would like to see, correct?  There's nothing in my patch that
> would prevent that so far as I can tell, although I'm not that familiar
> with your requirements or proposals yet.
> Greg, your patch is completely ok for filtering. I like that thing. I
> am just stating that it doesn't eliminate every possible option that
> can happen so I believe we should also have method for using cycles -
> that's what my patch is doing. So I believe both approaches should be
> applied.

OK, good deal.  I'll go ahead and finish up my kernel patch and the associated iproute2 patch.

Thanks,

- Greg

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06 16:56           ` Eric Dumazet
@ 2012-02-06 18:52             ` Štefan Gula
  0 siblings, 0 replies; 15+ messages in thread
From: Štefan Gula @ 2012-02-06 18:52 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, gregory.v.rose, linux-kernel, netdev

2012/2/6 Eric Dumazet <eric.dumazet@gmail.com>:
> Le lundi 06 février 2012 à 10:15 -0500, David Miller a écrit :
>> From: Štefan Gula <steweg@ynet.sk>
>> Date: Mon, 6 Feb 2012 09:53:28 +0100
>>
>> > If I try to request for it, it will eventually fail with a lot of
>> > records even with filtering...
>>
>> Then the user can loop increasing the buffer size until the netlink
>> request succeeds.
>>
>> It is not a problem.
>
> Actually we always truncate message in netlink_recvmsg()
>
> We could use a MSG_NOPARTIAL flag in netlink_recvmsg() so that user can
> avoid the MSG_PEEK operation to fetch next message length.
>
> (Ie not consume/copy skb if user buffer is too small to hold full
> message, and only return the needed length)
>
>
>
Not sure if this will work. I tried to implement this by the way of
sending one request from user-space to kernel and using NLM_F_MULTI
messages per record to receive the data back from kernel. The problem
was that if I went somewhere beyond 700 messages/records. I get EAGAIN
error code from kernel while trying to write to netlink socket. On the
other hand iproute code was getting error on recvmsg() that buffer is
full. The messages was only 40B long so they should always be able to
fit the 16k buffer used. So I end up with not being able to write nor
read from the socket -> not really sure why. If I introduce paging to
this, so kernel will put only limited number of records (in my case it
was 10) per one request and wait for another request message to
continue... this approach has done job for me. So maybe a good thing
here would be to post the whole code, including rtnetlink part,
macvlan part, iproute part and let you guys check, if you want. Do you
agree?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06  4:41     ` Rose, Gregory V
  2012-02-06  8:53       ` Štefan Gula
@ 2012-02-06 23:50       ` Ben Hutchings
  2012-02-07  0:13         ` Rose, Gregory V
  1 sibling, 1 reply; 15+ messages in thread
From: Ben Hutchings @ 2012-02-06 23:50 UTC (permalink / raw)
  To: Rose, Gregory V; +Cc: David Miller, steweg, linux-kernel, netdev

On Mon, 2012-02-06 at 04:41 +0000, Rose, Gregory V wrote:
[...]
> > This is not how we're going to fix this.  I already stated the desired
> > way to fix this, which is to make the existing dump request have a way
> > for the requestor to enable extended parts of the device dump.
> > 
> > This is just like netlink diag socket dumps, where the dump request
> > specifies what the user wants to see.
> > 
> > In this case we'd add a netlink attribute to the dump request which
> > is just a u32 bitmask or similar.
> > 
> > The Intel engineer who added the VF dump support said he would work on
> > this fix so why don't you just wait patiently for him to do the work?
> 
> The patch below is what I've got so far.  Right now the bit mask array
> is global so if you enable display of VF (n) on one interface it will
> enable display of the same VF on other interfaces.  I intend to move
> the bit mask array into the net_device structure so we can set the
> display mask for each interface independently.

I don't think this can work.  Using an application that requests VF
information and uses large buffers (e.g. the updated 'ip') will break
another application that doesn't (e.g. current Network Manager), won't
it?

The filter should be per-request and not persistent (and I think it
could just be a boolean or a limit value rather than a bitmask).

> The command to set the filter mask is "set only", I see no reason to
> add it to the info dump.  If other folks see it differently then I can
> do that too.
> 
> Anyway, it will allow the user to control which VFs are getting
> displayed during the info dump.  They all default to off so initially
> no VF info gets displayed.
> 
> I've also whipped up a patch for the iproute2 ip command.  It'll work
> like this:
> 
> 'ip link set <dev> vf (n) filter [on|off]'

Well there's no need for a persistent filter.  And I think that the
default behaviour of 'ip' should be to show all the VFs, as it does now.

[...]
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -62,6 +62,9 @@ struct rtnl_link {
>  static DEFINE_MUTEX(rtnl_mutex);
>  static u16 min_ifinfo_dump_size;
>  
> +/* VF info display filter - Number of VFs max is 256 */
> +static unsigned long show_vfinfo_filter[256 / sizeof(unsigned long)];
[...]

This array is 8 times too large; use BITS_TO_LONGS.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-06 23:50       ` Ben Hutchings
@ 2012-02-07  0:13         ` Rose, Gregory V
  2012-02-07 18:26           ` Ben Hutchings
  0 siblings, 1 reply; 15+ messages in thread
From: Rose, Gregory V @ 2012-02-07  0:13 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David Miller, steweg, linux-kernel, netdev

> -----Original Message-----
> From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> Sent: Monday, February 06, 2012 3:51 PM
> To: Rose, Gregory V
> Cc: David Miller; steweg@ynet.sk; linux-kernel@vger.kernel.org;
> netdev@vger.kernel.org
> Subject: RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around
> the skb buff size issue
> 
> On Mon, 2012-02-06 at 04:41 +0000, Rose, Gregory V wrote:
> [...]
> > > This is not how we're going to fix this.  I already stated the desired
> > > way to fix this, which is to make the existing dump request have a way
> > > for the requestor to enable extended parts of the device dump.
> > >
> > > This is just like netlink diag socket dumps, where the dump request
> > > specifies what the user wants to see.
> > >
> > > In this case we'd add a netlink attribute to the dump request which
> > > is just a u32 bitmask or similar.
> > >
> > > The Intel engineer who added the VF dump support said he would work on
> > > this fix so why don't you just wait patiently for him to do the work?
> >
> > The patch below is what I've got so far.  Right now the bit mask array
> > is global so if you enable display of VF (n) on one interface it will
> > enable display of the same VF on other interfaces.  I intend to move
> > the bit mask array into the net_device structure so we can set the
> > display mask for each interface independently.
> 
> I don't think this can work.  Using an application that requests VF
> information and uses large buffers (e.g. the updated 'ip') will break
> another application that doesn't (e.g. current Network Manager), won't
> it?

It's my understanding that the problem isn't with the application buffer size but with the kernel buffer size, which is limited to a page.

> 
> The filter should be per-request and not persistent (and I think it
> could just be a boolean or a limit value rather than a bitmask).

Why?

I think having the ability to specify one or more discrete VFs for the info dump is useful.

> 
> > The command to set the filter mask is "set only", I see no reason to
> > add it to the info dump.  If other folks see it differently then I can
> > do that too.
> >
> > Anyway, it will allow the user to control which VFs are getting
> > displayed during the info dump.  They all default to off so initially
> > no VF info gets displayed.
> >
> > I've also whipped up a patch for the iproute2 ip command.  It'll work
> > like this:
> >
> > 'ip link set <dev> vf (n) filter [on|off]'
> 
> Well there's no need for a persistent filter.  And I think that the
> default behaviour of 'ip' should be to show all the VFs, as it does now.

Dave asked me to make the default behavior to not show the VFs.  Take it up with him.

> 
> [...]
> > --- a/net/core/rtnetlink.c
> > +++ b/net/core/rtnetlink.c
> > @@ -62,6 +62,9 @@ struct rtnl_link {
> >  static DEFINE_MUTEX(rtnl_mutex);
> >  static u16 min_ifinfo_dump_size;
> >
> > +/* VF info display filter - Number of VFs max is 256 */
> > +static unsigned long show_vfinfo_filter[256 / sizeof(unsigned long)];
> [...]
> 
> This array is 8 times too large; use BITS_TO_LONGS.

Oops... yeah, that'll be fixed in the actual patch.

- Greg


^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-07  0:13         ` Rose, Gregory V
@ 2012-02-07 18:26           ` Ben Hutchings
  2012-02-07 18:32             ` David Miller
  0 siblings, 1 reply; 15+ messages in thread
From: Ben Hutchings @ 2012-02-07 18:26 UTC (permalink / raw)
  To: Rose, Gregory V; +Cc: David Miller, steweg, linux-kernel, netdev

On Tue, 2012-02-07 at 00:13 +0000, Rose, Gregory V wrote:
> > -----Original Message-----
> > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> > Sent: Monday, February 06, 2012 3:51 PM
> > To: Rose, Gregory V
> > Cc: David Miller; steweg@ynet.sk; linux-kernel@vger.kernel.org;
> > netdev@vger.kernel.org
> > Subject: RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around
> > the skb buff size issue
> > 
> > On Mon, 2012-02-06 at 04:41 +0000, Rose, Gregory V wrote:
> > [...]
> > > > This is not how we're going to fix this.  I already stated the desired
> > > > way to fix this, which is to make the existing dump request have a way
> > > > for the requestor to enable extended parts of the device dump.
> > > >
> > > > This is just like netlink diag socket dumps, where the dump request
> > > > specifies what the user wants to see.
> > > >
> > > > In this case we'd add a netlink attribute to the dump request which
> > > > is just a u32 bitmask or similar.
> > > >
> > > > The Intel engineer who added the VF dump support said he would work on
> > > > this fix so why don't you just wait patiently for him to do the work?
> > >
> > > The patch below is what I've got so far.  Right now the bit mask array
> > > is global so if you enable display of VF (n) on one interface it will
> > > enable display of the same VF on other interfaces.  I intend to move
> > > the bit mask array into the net_device structure so we can set the
> > > display mask for each interface independently.
> > 
> > I don't think this can work.  Using an application that requests VF
> > information and uses large buffers (e.g. the updated 'ip') will break
> > another application that doesn't (e.g. current Network Manager), won't
> > it?
> 
> It's my understanding that the problem isn't with the application
> buffer size but with the kernel buffer size, which is limited to a
> page.
[...]

Then it's no wonder you're going about this the wrong way.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-07 18:26           ` Ben Hutchings
@ 2012-02-07 18:32             ` David Miller
  2012-02-07 18:59               ` Rose, Gregory V
  0 siblings, 1 reply; 15+ messages in thread
From: David Miller @ 2012-02-07 18:32 UTC (permalink / raw)
  To: bhutchings; +Cc: gregory.v.rose, steweg, linux-kernel, netdev

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Tue, 7 Feb 2012 18:26:32 +0000

> On Tue, 2012-02-07 at 00:13 +0000, Rose, Gregory V wrote:
>> > -----Original Message-----
>> > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
>> > Sent: Monday, February 06, 2012 3:51 PM
>> > To: Rose, Gregory V
>> > Cc: David Miller; steweg@ynet.sk; linux-kernel@vger.kernel.org;
>> > netdev@vger.kernel.org
>> > Subject: RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around
>> > the skb buff size issue
>> > 
>> > On Mon, 2012-02-06 at 04:41 +0000, Rose, Gregory V wrote:
>> > [...]
>> > > > This is not how we're going to fix this.  I already stated the desired
>> > > > way to fix this, which is to make the existing dump request have a way
>> > > > for the requestor to enable extended parts of the device dump.
>> > > >
>> > > > This is just like netlink diag socket dumps, where the dump request
>> > > > specifies what the user wants to see.
>> > > >
>> > > > In this case we'd add a netlink attribute to the dump request which
>> > > > is just a u32 bitmask or similar.
>> > > >
>> > > > The Intel engineer who added the VF dump support said he would work on
>> > > > this fix so why don't you just wait patiently for him to do the work?
>> > >
>> > > The patch below is what I've got so far.  Right now the bit mask array
>> > > is global so if you enable display of VF (n) on one interface it will
>> > > enable display of the same VF on other interfaces.  I intend to move
>> > > the bit mask array into the net_device structure so we can set the
>> > > display mask for each interface independently.
>> > 
>> > I don't think this can work.  Using an application that requests VF
>> > information and uses large buffers (e.g. the updated 'ip') will break
>> > another application that doesn't (e.g. current Network Manager), won't
>> > it?
>> 
>> It's my understanding that the problem isn't with the application
>> buffer size but with the kernel buffer size, which is limited to a
>> page.
> [...]
> 
> Then it's no wonder you're going about this the wrong way.

It is the userland buffer size that is the problem, userland will allocate
max(8192, PAGE_SIZE) for it's recvmsg() call, that's why we have to only
dump VF's and other potentially large objects when the user enables it in
the request.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue
  2012-02-07 18:32             ` David Miller
@ 2012-02-07 18:59               ` Rose, Gregory V
  0 siblings, 0 replies; 15+ messages in thread
From: Rose, Gregory V @ 2012-02-07 18:59 UTC (permalink / raw)
  To: David Miller, bhutchings; +Cc: steweg, linux-kernel, netdev

> -----Original Message-----
> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
> On Behalf Of David Miller
> Sent: Tuesday, February 07, 2012 10:32 AM
> To: bhutchings@solarflare.com
> Cc: Rose, Gregory V; steweg@ynet.sk; linux-kernel@vger.kernel.org;
> netdev@vger.kernel.org
> Subject: Re: [patch v1, kernel version 3.2.1] rtnetlink workaround around
> the skb buff size issue
> 
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Tue, 7 Feb 2012 18:26:32 +0000
> 
> > On Tue, 2012-02-07 at 00:13 +0000, Rose, Gregory V wrote:
> >> > -----Original Message-----
> >> > From: Ben Hutchings [mailto:bhutchings@solarflare.com]
> >> > Sent: Monday, February 06, 2012 3:51 PM
> >> > To: Rose, Gregory V
> >> > Cc: David Miller; steweg@ynet.sk; linux-kernel@vger.kernel.org;
> >> > netdev@vger.kernel.org
> >> > Subject: RE: [patch v1, kernel version 3.2.1] rtnetlink workaround
> around
> >> > the skb buff size issue
> >> >
> >> > On Mon, 2012-02-06 at 04:41 +0000, Rose, Gregory V wrote:
> >> > [...]
> >> > > > This is not how we're going to fix this.  I already stated the
> desired
> >> > > > way to fix this, which is to make the existing dump request have
> a way
> >> > > > for the requestor to enable extended parts of the device dump.
> >> > > >
> >> > > > This is just like netlink diag socket dumps, where the dump
> request
> >> > > > specifies what the user wants to see.
> >> > > >
> >> > > > In this case we'd add a netlink attribute to the dump request
> which
> >> > > > is just a u32 bitmask or similar.
> >> > > >
> >> > > > The Intel engineer who added the VF dump support said he would
> work on
> >> > > > this fix so why don't you just wait patiently for him to do the
> work?
> >> > >
> >> > > The patch below is what I've got so far.  Right now the bit mask
> array
> >> > > is global so if you enable display of VF (n) on one interface it
> will
> >> > > enable display of the same VF on other interfaces.  I intend to
> move
> >> > > the bit mask array into the net_device structure so we can set the
> >> > > display mask for each interface independently.
> >> >
> >> > I don't think this can work.  Using an application that requests VF
> >> > information and uses large buffers (e.g. the updated 'ip') will break
> >> > another application that doesn't (e.g. current Network Manager),
> won't
> >> > it?
> >>
> >> It's my understanding that the problem isn't with the application
> >> buffer size but with the kernel buffer size, which is limited to a
> >> page.
> > [...]
> >
> > Then it's no wonder you're going about this the wrong way.
> 
> It is the userland buffer size that is the problem, userland will allocate
> max(8192, PAGE_SIZE) for it's recvmsg() call, that's why we have to only
> dump VF's and other potentially large objects when the user enables it in
> the request.

Ah well, when that was stated before I was under the impression it was in the kernel.  My bad...

Anyway, if Ben would like to do this work it's fine by me.  He seems to have some very definite opinions about this and I don't feel like going through another several weeks of fighting about it him.  I've got a patch ready to submit today that goes along the lines of what I've already posted but with a few clean ups.  I'll post it, if it gets shot down then fine, I'll get on with my other work.

- Greg

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2012-02-07 18:59 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <5422254.3711328278789768.JavaMail.root@5-MeO-DMT.ynet.sk>
2012-02-03 14:24 ` [patch v1, kernel version 3.2.1] rtnetlink workaround around the skb buff size issue Stefan Gula
2012-02-04  0:29   ` David Miller
2012-02-06  4:41     ` Rose, Gregory V
2012-02-06  8:53       ` Štefan Gula
2012-02-06 15:15         ` David Miller
2012-02-06 16:56           ` Eric Dumazet
2012-02-06 18:52             ` Štefan Gula
2012-02-06 17:29         ` Rose, Gregory V
2012-02-06 18:32           ` Štefan Gula
2012-02-06 18:36             ` Rose, Gregory V
2012-02-06 23:50       ` Ben Hutchings
2012-02-07  0:13         ` Rose, Gregory V
2012-02-07 18:26           ` Ben Hutchings
2012-02-07 18:32             ` David Miller
2012-02-07 18:59               ` Rose, Gregory V

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).