All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv4 iproute2 0/2] libnetlink: malloc correct buff at run time
@ 2017-09-28 13:33 Hangbin Liu
  2017-09-28 13:33 ` [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough Hangbin Liu
  2017-09-28 13:33 ` [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time Hangbin Liu
  0 siblings, 2 replies; 19+ messages in thread
From: Hangbin Liu @ 2017-09-28 13:33 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Michal Kubecek, Phil Sutter, Hangbin Liu

From: Hangbin Liu <liuhangbin@gmail.com>

With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size") and
460c03f3f3cc ("iplink: double the buffer size also in iplink_get()"), we
extend the buffer size to avoid truncated message with large numbers of
VFs. But just as Michal said, this is not future-proof since the NIC
number is increasing. We have customer even has 220+ VFs now.

This is not make sense to hard code the buffer and increase it all the time.
So let's just malloc the correct buff size at run time.

Tested with most ip cmds and all look good.

---
v3 -> v4:
* rtnl_recvmsg():
  * As Michal suggested, use zero iov len at the first time to avoid
    copy same data from kernel to userspace.
  * As Stephen suggested, remove loops via goto to make the logic more clear.
* With Phil's help, add __rtnl_recvmsg() to reduce duplicate code.

v2 -> v3:
* rtnl_recvmsg():
  * free buf before each return.
  * return errno when recvmsg failed.

v1 -> v2 by Phil:
* rtnl_recvmsg():
  * Rename output buffer pointer arg to 'answer'.
  * Use realloc() and make sure old buffer is freed on error.
  * Always return a newly allocated buffer for caller to free.
  * Retry on EINTR or EAGAIN so caller doesn't have to.
  * Return well-known negative error codes instead of just -1 on error.
  * Simplify goto label names.
  * If no answer pointer was passed, just free the buffer.
* rtnl_dump_filter_l():
  * Don't retry if rtnl_recvmsg() returns 0 as this can't happen
    anymore.
  * Free buffer returned by rtnl_recvmsg().
* __rtnl_talk():
  * Don't retry if rtnl_recvmsg() returns 0 as this can't happen
    anymore.
  * Free buffer returned by rtnl_recvmsg().
  * Return a newly allocated buffer for callers to free.
* genl_ctrl_resolve_family()
  * Replace 'ghdr + GENL_HDRLEN' to 'answer + NLMSG_LENGTH(GENL_HDRLEN)'
* tc_action_gd()
  * Call print_action() only if cmd == RTM_GETACTION
* Change callers of rtnl_talk*() to always free the answer buffer if
  they passed one.
* Drop extra request buffer space in callers if only used for holding
  output data.
* Drop initialization of answer pointer if not necessary.
* Change callers to pass NULL instead of answer pointer if they don't
  use it afterwards.

Hangbin Liu (2):
  lib/libnetlink: re malloc buff if size is not enough
  lib/libnetlink: update rtnl_talk to support malloc buff at run time

 bridge/fdb.c         |   2 +-
 bridge/link.c        |   2 +-
 bridge/mdb.c         |   2 +-
 bridge/vlan.c        |   2 +-
 genl/ctrl.c          |  19 +++++---
 include/libnetlink.h |   6 +--
 ip/ipaddress.c       |   4 +-
 ip/ipaddrlabel.c     |   4 +-
 ip/ipfou.c           |   4 +-
 ip/ipila.c           |   4 +-
 ip/ipl2tp.c          |   8 +--
 ip/iplink.c          |  38 +++++++--------
 ip/iplink_vrf.c      |  44 ++++++++---------
 ip/ipmacsec.c        |   2 +-
 ip/ipneigh.c         |   2 +-
 ip/ipnetns.c         |  23 +++++----
 ip/ipntable.c        |   2 +-
 ip/iproute.c         |  26 ++++++----
 ip/iprule.c          |   6 +--
 ip/ipseg6.c          |   8 +--
 ip/iptoken.c         |   2 +-
 ip/link_gre.c        |  11 +++--
 ip/link_gre6.c       |  11 +++--
 ip/link_ip6tnl.c     |  11 +++--
 ip/link_iptnl.c      |  10 ++--
 ip/link_vti.c        |  11 +++--
 ip/link_vti6.c       |  11 +++--
 ip/tcp_metrics.c     |   8 +--
 ip/xfrm_policy.c     |  25 +++++-----
 ip/xfrm_state.c      |  30 ++++++------
 lib/libgenl.c        |   9 +++-
 lib/libnetlink.c     | 134 ++++++++++++++++++++++++++++++++++-----------------
 misc/ss.c            |   2 +-
 tc/m_action.c        |  12 ++---
 tc/tc_class.c        |   2 +-
 tc/tc_filter.c       |   8 +--
 tc/tc_qdisc.c        |   2 +-
 37 files changed, 298 insertions(+), 209 deletions(-)

-- 
2.5.5

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough
  2017-09-28 13:33 [PATCHv4 iproute2 0/2] libnetlink: malloc correct buff at run time Hangbin Liu
@ 2017-09-28 13:33 ` Hangbin Liu
  2017-09-29 12:55   ` Michal Kubecek
  2017-09-29 17:54   ` Stephen Hemminger
  2017-09-28 13:33 ` [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time Hangbin Liu
  1 sibling, 2 replies; 19+ messages in thread
From: Hangbin Liu @ 2017-09-28 13:33 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Michal Kubecek, Phil Sutter, Hangbin Liu

From: Hangbin Liu <liuhangbin@gmail.com>

With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size")
we doubled the buffer size to support more VFs. But the VFs number is
increasing all the time. Some customers even use more than 200 VFs now.

We could not double it everytime when the buffer is not enough. Let's just
not hard code the buffer size and malloc the correct number when running.

Introduce function rtnl_recvmsg() to always return a newly allocated buffer.
The caller need to free it after using.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 lib/libnetlink.c | 114 ++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 80 insertions(+), 34 deletions(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index be7ac86..1847c0b 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -402,6 +402,64 @@ static void rtnl_dump_error(const struct rtnl_handle *rth,
 	}
 }
 
+static int __rtnl_recvmsg(int fd, struct msghdr *msg, int flags)
+{
+	int len;
+
+	do {
+		len = recvmsg(fd, msg, flags);
+	} while (len < 0 && (errno == EINTR || errno == EAGAIN));
+
+	if (len < 0) {
+		fprintf(stderr, "netlink receive error %s (%d)\n",
+			strerror(errno), errno);
+		return -errno;
+	}
+
+	if (len == 0) {
+		fprintf(stderr, "EOF on netlink\n");
+		return -ENODATA;
+	}
+
+	return len;
+}
+
+static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
+{
+	struct iovec *iov = msg->msg_iov;
+	char *buf;
+	int len;
+
+	iov->iov_base = NULL;
+	iov->iov_len = 0;
+
+	len = __rtnl_recvmsg(fd, msg, MSG_PEEK | MSG_TRUNC);
+	if (len < 0)
+		return len;
+
+	buf = malloc(len);
+	if (!buf) {
+		fprintf(stderr, "malloc error: not enough buffer\n");
+		return -ENOMEM;
+	}
+
+	iov->iov_base = buf;
+	iov->iov_len = len;
+
+	len = __rtnl_recvmsg(fd, msg, 0);
+	if (len < 0) {
+		free(buf);
+		return len;
+	}
+
+	if (answer)
+		*answer = buf;
+	else
+		free(buf);
+
+	return len;
+}
+
 int rtnl_dump_filter_l(struct rtnl_handle *rth,
 		       const struct rtnl_dump_filter_arg *arg)
 {
@@ -413,31 +471,18 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
 		.msg_iov = &iov,
 		.msg_iovlen = 1,
 	};
-	char buf[32768];
+	char *buf;
 	int dump_intr = 0;
 
-	iov.iov_base = buf;
 	while (1) {
 		int status;
 		const struct rtnl_dump_filter_arg *a;
 		int found_done = 0;
 		int msglen = 0;
 
-		iov.iov_len = sizeof(buf);
-		status = recvmsg(rth->fd, &msg, 0);
-
-		if (status < 0) {
-			if (errno == EINTR || errno == EAGAIN)
-				continue;
-			fprintf(stderr, "netlink receive error %s (%d)\n",
-				strerror(errno), errno);
-			return -1;
-		}
-
-		if (status == 0) {
-			fprintf(stderr, "EOF on netlink\n");
-			return -1;
-		}
+		status = rtnl_recvmsg(rth->fd, &msg, &buf);
+		if (status < 0)
+			return status;
 
 		if (rth->dump_fp)
 			fwrite(buf, 1, NLMSG_ALIGN(status), rth->dump_fp);
@@ -462,8 +507,10 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
 
 				if (h->nlmsg_type == NLMSG_DONE) {
 					err = rtnl_dump_done(h);
-					if (err < 0)
+					if (err < 0) {
+						free(buf);
 						return -1;
+					}
 
 					found_done = 1;
 					break; /* process next filter */
@@ -471,19 +518,23 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
 
 				if (h->nlmsg_type == NLMSG_ERROR) {
 					rtnl_dump_error(rth, h);
+					free(buf);
 					return -1;
 				}
 
 				if (!rth->dump_fp) {
 					err = a->filter(&nladdr, h, a->arg1);
-					if (err < 0)
+					if (err < 0) {
+						free(buf);
 						return err;
+					}
 				}
 
 skip_it:
 				h = NLMSG_NEXT(h, msglen);
 			}
 		}
+		free(buf);
 
 		if (found_done) {
 			if (dump_intr)
@@ -543,7 +594,7 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 		.msg_iov = &iov,
 		.msg_iovlen = 1,
 	};
-	char   buf[32768] = {};
+	char *buf;
 
 	n->nlmsg_seq = seq = ++rtnl->seq;
 
@@ -556,22 +607,12 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 		return -1;
 	}
 
-	iov.iov_base = buf;
 	while (1) {
-		iov.iov_len = sizeof(buf);
-		status = recvmsg(rtnl->fd, &msg, 0);
+		status = rtnl_recvmsg(rtnl->fd, &msg, &buf);
+
+		if (status < 0)
+			return status;
 
-		if (status < 0) {
-			if (errno == EINTR || errno == EAGAIN)
-				continue;
-			fprintf(stderr, "netlink receive error %s (%d)\n",
-				strerror(errno), errno);
-			return -1;
-		}
-		if (status == 0) {
-			fprintf(stderr, "EOF on netlink\n");
-			return -1;
-		}
 		if (msg.msg_namelen != sizeof(nladdr)) {
 			fprintf(stderr,
 				"sender address length == %d\n",
@@ -585,6 +626,7 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 			if (l < 0 || len > status) {
 				if (msg.msg_flags & MSG_TRUNC) {
 					fprintf(stderr, "Truncated message\n");
+					free(buf);
 					return -1;
 				}
 				fprintf(stderr,
@@ -611,6 +653,7 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 					if (answer)
 						memcpy(answer, h,
 						       MIN(maxlen, h->nlmsg_len));
+					free(buf);
 					return 0;
 				}
 
@@ -619,12 +662,14 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 					rtnl_talk_error(h, err, errfn);
 
 				errno = -err->error;
+				free(buf);
 				return -1;
 			}
 
 			if (answer) {
 				memcpy(answer, h,
 				       MIN(maxlen, h->nlmsg_len));
+				free(buf);
 				return 0;
 			}
 
@@ -633,6 +678,7 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 			status -= NLMSG_ALIGN(len);
 			h = (struct nlmsghdr *)((char *)h + NLMSG_ALIGN(len));
 		}
+		free(buf);
 
 		if (msg.msg_flags & MSG_TRUNC) {
 			fprintf(stderr, "Message truncated\n");
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-09-28 13:33 [PATCHv4 iproute2 0/2] libnetlink: malloc correct buff at run time Hangbin Liu
  2017-09-28 13:33 ` [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough Hangbin Liu
@ 2017-09-28 13:33 ` Hangbin Liu
  2017-09-29 13:06   ` Michal Kubecek
                     ` (2 more replies)
  1 sibling, 3 replies; 19+ messages in thread
From: Hangbin Liu @ 2017-09-28 13:33 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Michal Kubecek, Phil Sutter, Hangbin Liu

From: Hangbin Liu <liuhangbin@gmail.com>

This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
iplink_get()"). After update, we will not need to double the buffer size
every time when VFs number increased.

With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
length parameter.

With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
answer to avoid overwrite data in nlh, because it may has more info after
nlh. also this will avoid nlh buffer not enough issue.

We need to free answer after using.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Phil Sutter <phil@nwl.cc>
---
 bridge/fdb.c         |  2 +-
 bridge/link.c        |  2 +-
 bridge/mdb.c         |  2 +-
 bridge/vlan.c        |  2 +-
 genl/ctrl.c          | 19 ++++++++++++-------
 include/libnetlink.h |  6 +++---
 ip/ipaddress.c       |  4 ++--
 ip/ipaddrlabel.c     |  4 ++--
 ip/ipfou.c           |  4 ++--
 ip/ipila.c           |  4 ++--
 ip/ipl2tp.c          |  8 ++++----
 ip/iplink.c          | 38 +++++++++++++++++++-------------------
 ip/iplink_vrf.c      | 44 ++++++++++++++++++++------------------------
 ip/ipmacsec.c        |  2 +-
 ip/ipneigh.c         |  2 +-
 ip/ipnetns.c         | 23 ++++++++++++++---------
 ip/ipntable.c        |  2 +-
 ip/iproute.c         | 26 +++++++++++++++++---------
 ip/iprule.c          |  6 +++---
 ip/ipseg6.c          |  8 +++++---
 ip/iptoken.c         |  2 +-
 ip/link_gre.c        | 11 +++++++----
 ip/link_gre6.c       | 11 +++++++----
 ip/link_ip6tnl.c     | 11 +++++++----
 ip/link_iptnl.c      | 10 ++++++----
 ip/link_vti.c        | 11 +++++++----
 ip/link_vti6.c       | 11 +++++++----
 ip/tcp_metrics.c     |  8 +++++---
 ip/xfrm_policy.c     | 25 +++++++++++++------------
 ip/xfrm_state.c      | 30 ++++++++++++++++--------------
 lib/libgenl.c        |  9 +++++++--
 lib/libnetlink.c     | 24 +++++++++++-------------
 misc/ss.c            |  2 +-
 tc/m_action.c        | 12 ++++++------
 tc/tc_class.c        |  2 +-
 tc/tc_filter.c       |  8 +++++---
 tc/tc_qdisc.c        |  2 +-
 37 files changed, 220 insertions(+), 177 deletions(-)

diff --git a/bridge/fdb.c b/bridge/fdb.c
index e5cebf9..807914f 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -535,7 +535,7 @@ static int fdb_modify(int cmd, int flags, int argc, char **argv)
 		return -1;
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -1;
 
 	return 0;
diff --git a/bridge/link.c b/bridge/link.c
index 93472ad..cc29a2a 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -426,7 +426,7 @@ static int brlink_modify(int argc, char **argv)
 		addattr_nest_end(&req.n, nest);
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -1;
 
 	return 0;
diff --git a/bridge/mdb.c b/bridge/mdb.c
index 748091b..f38e326 100644
--- a/bridge/mdb.c
+++ b/bridge/mdb.c
@@ -440,7 +440,7 @@ static int mdb_modify(int cmd, int flags, int argc, char **argv)
 	entry.vid = vid;
 	addattr_l(&req.n, sizeof(req), MDBA_SET_ENTRY, &entry, sizeof(entry));
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -1;
 
 	return 0;
diff --git a/bridge/vlan.c b/bridge/vlan.c
index ebcdace..5d68359 100644
--- a/bridge/vlan.c
+++ b/bridge/vlan.c
@@ -133,7 +133,7 @@ static int vlan_modify(int cmd, int argc, char **argv)
 
 	addattr_nest_end(&req.n, afspec);
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -1;
 
 	return 0;
diff --git a/genl/ctrl.c b/genl/ctrl.c
index 448988e..a6d31b0 100644
--- a/genl/ctrl.c
+++ b/genl/ctrl.c
@@ -55,6 +55,7 @@ int genl_ctrl_resolve_family(const char *family)
 	};
 	struct nlmsghdr *nlh = &req.n;
 	struct genlmsghdr *ghdr = &req.g;
+	struct nlmsghdr *answer = NULL;
 
 	if (rtnl_open_byproto(&rth, 0, NETLINK_GENERIC) < 0) {
 		fprintf(stderr, "Cannot open generic netlink socket\n");
@@ -63,19 +64,19 @@ int genl_ctrl_resolve_family(const char *family)
 
 	addattr_l(nlh, 128, CTRL_ATTR_FAMILY_NAME, family, strlen(family) + 1);
 
-	if (rtnl_talk(&rth, nlh, nlh, sizeof(req)) < 0) {
+	if (rtnl_talk(&rth, nlh, &answer) < 0) {
 		fprintf(stderr, "Error talking to the kernel\n");
 		goto errout;
 	}
 
 	{
 		struct rtattr *tb[CTRL_ATTR_MAX + 1];
-		int len = nlh->nlmsg_len;
+		int len = answer->nlmsg_len;
 		struct rtattr *attrs;
 
-		if (nlh->nlmsg_type !=  GENL_ID_CTRL) {
+		if (answer->nlmsg_type !=  GENL_ID_CTRL) {
 			fprintf(stderr, "Not a controller message, nlmsg_len=%d "
-				"nlmsg_type=0x%x\n", nlh->nlmsg_len, nlh->nlmsg_type);
+				"nlmsg_type=0x%x\n", answer->nlmsg_len, answer->nlmsg_type);
 			goto errout;
 		}
 
@@ -88,10 +89,11 @@ int genl_ctrl_resolve_family(const char *family)
 
 		if (len < 0) {
 			fprintf(stderr, "wrong controller message len %d\n", len);
+			free(answer);
 			return -1;
 		}
 
-		attrs = (struct rtattr *) ((char *) ghdr + GENL_HDRLEN);
+		attrs = (struct rtattr *) ((char *) answer + NLMSG_LENGTH(GENL_HDRLEN));
 		parse_rtattr(tb, CTRL_ATTR_MAX, attrs, len);
 
 		if (tb[CTRL_ATTR_FAMILY_ID] == NULL) {
@@ -103,6 +105,7 @@ int genl_ctrl_resolve_family(const char *family)
 	}
 
 errout:
+	free(answer);
 	rtnl_close(&rth);
 	return ret;
 }
@@ -299,6 +302,7 @@ static int ctrl_list(int cmd, int argc, char **argv)
 		.g.cmd = CTRL_CMD_GETFAMILY,
 	};
 	struct nlmsghdr *nlh = &req.n;
+	struct nlmsghdr *answer = NULL;
 
 	if (rtnl_open_byproto(&rth, 0, NETLINK_GENERIC) < 0) {
 		fprintf(stderr, "Cannot open generic netlink socket\n");
@@ -331,12 +335,12 @@ static int ctrl_list(int cmd, int argc, char **argv)
 			goto ctrl_done;
 		}
 
-		if (rtnl_talk(&rth, nlh, nlh, sizeof(req)) < 0) {
+		if (rtnl_talk(&rth, nlh, &answer) < 0) {
 			fprintf(stderr, "Error talking to the kernel\n");
 			goto ctrl_done;
 		}
 
-		if (print_ctrl2(NULL, nlh, (void *) stdout) < 0) {
+		if (print_ctrl2(NULL, answer, (void *) stdout) < 0) {
 			fprintf(stderr, "Dump terminated\n");
 			goto ctrl_done;
 		}
@@ -358,6 +362,7 @@ static int ctrl_list(int cmd, int argc, char **argv)
 
 	ret = 0;
 ctrl_done:
+	free(answer);
 	rtnl_close(&rth);
 	return ret;
 }
diff --git a/include/libnetlink.h b/include/libnetlink.h
index 69257f0..77b6260 100644
--- a/include/libnetlink.h
+++ b/include/libnetlink.h
@@ -93,13 +93,13 @@ int rtnl_dump_filter_nc(struct rtnl_handle *rth,
 #define rtnl_dump_filter(rth, filter, arg) \
 	rtnl_dump_filter_nc(rth, filter, arg, 0)
 int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-	      struct nlmsghdr *answer, size_t len)
+	      struct nlmsghdr **answer)
 	__attribute__((warn_unused_result));
 int rtnl_talk_extack(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-	      struct nlmsghdr *answer, size_t len, nl_ext_ack_fn_t errfn)
+	      struct nlmsghdr **answer, nl_ext_ack_fn_t errfn)
 	__attribute__((warn_unused_result));
 int rtnl_talk_suppress_rtnl_errmsg(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-				   struct nlmsghdr *answer, size_t len)
+				   struct nlmsghdr **answer)
 	__attribute__((warn_unused_result));
 int rtnl_send(struct rtnl_handle *rth, const void *buf, int)
 	__attribute__((warn_unused_result));
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 9797145..019dd1e 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -1824,7 +1824,7 @@ static int restore_handler(const struct sockaddr_nl *nl,
 
 	ll_init_map(&rth);
 
-	ret = rtnl_talk(&rth, n, n, sizeof(*n));
+	ret = rtnl_talk(&rth, n, NULL);
 	if ((ret < 0) && (errno == EEXIST))
 		ret = 0;
 
@@ -2520,7 +2520,7 @@ static int ipaddr_modify(int cmd, int flags, int argc, char **argv)
 		return -1;
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
diff --git a/ip/ipaddrlabel.c b/ip/ipaddrlabel.c
index 1d324da..6ea9bff 100644
--- a/ip/ipaddrlabel.c
+++ b/ip/ipaddrlabel.c
@@ -176,7 +176,7 @@ static int ipaddrlabel_modify(int cmd, int argc, char **argv)
 	if (req.ifal.ifal_family == AF_UNSPEC)
 		req.ifal.ifal_family = AF_INET6;
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
@@ -203,7 +203,7 @@ static int flush_addrlabel(const struct sockaddr_nl *who, struct nlmsghdr *n, vo
 		if (rtnl_open(&rth2, 0) < 0)
 			return -1;
 
-		if (rtnl_talk(&rth2, n, NULL, 0) < 0)
+		if (rtnl_talk(&rth2, n, NULL) < 0)
 			return -2;
 
 		rtnl_close(&rth2);
diff --git a/ip/ipfou.c b/ip/ipfou.c
index 00dbe15..23000dc 100644
--- a/ip/ipfou.c
+++ b/ip/ipfou.c
@@ -116,7 +116,7 @@ static int do_add(int argc, char **argv)
 
 	fou_parse_opt(argc, argv, &req.n, true);
 
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
@@ -128,7 +128,7 @@ static int do_del(int argc, char **argv)
 
 	fou_parse_opt(argc, argv, &req.n, false);
 
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
diff --git a/ip/ipila.c b/ip/ipila.c
index 843cc16..0403fc4 100644
--- a/ip/ipila.c
+++ b/ip/ipila.c
@@ -220,7 +220,7 @@ static int do_add(int argc, char **argv)
 
 	ila_parse_opt(argc, argv, &req.n, true);
 
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
@@ -232,7 +232,7 @@ static int do_del(int argc, char **argv)
 
 	ila_parse_opt(argc, argv, &req.n, false);
 
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
diff --git a/ip/ipl2tp.c b/ip/ipl2tp.c
index 88664c9..742adbe 100644
--- a/ip/ipl2tp.c
+++ b/ip/ipl2tp.c
@@ -129,7 +129,7 @@ static int create_tunnel(struct l2tp_parm *p)
 			addattr(&req.n, 1024, L2TP_ATTR_UDP_ZERO_CSUM6_RX);
 	}
 
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
@@ -142,7 +142,7 @@ static int delete_tunnel(struct l2tp_parm *p)
 
 	addattr32(&req.n, 128, L2TP_ATTR_CONN_ID, p->tunnel_id);
 
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
@@ -185,7 +185,7 @@ static int create_session(struct l2tp_parm *p)
 	if (p->ifname && p->ifname[0])
 		addattrstrz(&req.n, 1024, L2TP_ATTR_IFNAME, p->ifname);
 
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
@@ -198,7 +198,7 @@ static int delete_session(struct l2tp_parm *p)
 
 	addattr32(&req.n, 1024, L2TP_ATTR_CONN_ID, p->tunnel_id);
 	addattr32(&req.n, 1024, L2TP_ATTR_SESSION_ID, p->session_id);
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
diff --git a/ip/iplink.c b/ip/iplink.c
index ff5b56c..21a22cd 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -249,19 +249,26 @@ static int nl_get_ll_addr_len(unsigned int dev_index)
 			.ifi_index = dev_index,
 		}
 	};
+	struct nlmsghdr *answer;
 	struct rtattr *tb[IFLA_MAX+1];
 
-	if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0)
+	if (rtnl_talk(&rth, &req.n, &answer) < 0)
 		return -1;
 
-	len = req.n.nlmsg_len - NLMSG_LENGTH(sizeof(req.i));
-	if (len < 0)
+	len = answer->nlmsg_len - NLMSG_LENGTH(sizeof(struct ifinfomsg));
+	if (len < 0) {
+		free(answer);
 		return -1;
+	}
 
-	parse_rtattr_flags(tb, IFLA_MAX, IFLA_RTA(&req.i), len, NLA_F_NESTED);
-	if (!tb[IFLA_ADDRESS])
+	parse_rtattr_flags(tb, IFLA_MAX, IFLA_RTA(NLMSG_DATA(answer)),
+			   len, NLA_F_NESTED);
+	if (!tb[IFLA_ADDRESS]) {
+		free(answer);
 		return -1;
+	}
 
+	free(answer);
 	return RTA_PAYLOAD(tb[IFLA_ADDRESS]);
 }
 
@@ -913,7 +920,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 
 			req.i.ifi_index = 0;
 			addattr32(&req.n, sizeof(req), IFLA_GROUP, group);
-			if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+			if (rtnl_talk(&rth, &req.n, NULL) < 0)
 				return -2;
 			return 0;
 		}
@@ -1008,7 +1015,7 @@ static int iplink_modify(int cmd, unsigned int flags, int argc, char **argv)
 		return -1;
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
@@ -1023,10 +1030,7 @@ int iplink_get(unsigned int flags, char *name, __u32 filt_mask)
 		.n.nlmsg_type = RTM_GETLINK,
 		.i.ifi_family = preferred_family,
 	};
-	struct {
-		struct nlmsghdr n;
-		char buf[32768];
-	} answer;
+	struct nlmsghdr *answer;
 
 	if (name) {
 		len = strlen(name) + 1;
@@ -1039,21 +1043,17 @@ int iplink_get(unsigned int flags, char *name, __u32 filt_mask)
 	}
 	addattr32(&req.n, sizeof(req), IFLA_EXT_MASK, filt_mask);
 
-	if (rtnl_talk(&rth, &req.n, &answer.n, sizeof(answer)) < 0)
-		return -2;
-	if (answer.n.nlmsg_len > sizeof(answer.buf)) {
-		fprintf(stderr, "Message truncated from %u to %lu\n",
-			answer.n.nlmsg_len, sizeof(answer.buf));
+	if (rtnl_talk(&rth, &req.n, &answer) < 0)
 		return -2;
-	}
 
 	open_json_object(NULL);
 	if (brief)
-		print_linkinfo_brief(NULL, &answer.n, stdout, NULL);
+		print_linkinfo_brief(NULL, answer, stdout, NULL);
 	else
-		print_linkinfo(NULL, &answer.n, stdout);
+		print_linkinfo(NULL, answer, stdout);
 	close_json_object();
 
+	free(answer);
 	return 0;
 }
 
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index 7a1bb5e..e9dd0df 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -119,10 +119,7 @@ __u32 ipvrf_get_table(const char *name)
 			.ifi_family  = preferred_family,
 		},
 	};
-	struct {
-		struct nlmsghdr n;
-		char buf[8192];
-	} answer;
+	struct nlmsghdr *answer;
 	struct rtattr *tb[IFLA_MAX+1];
 	struct rtattr *li[IFLA_INFO_MAX+1];
 	struct rtattr *vrf_attr[IFLA_VRF_MAX + 1];
@@ -132,8 +129,7 @@ __u32 ipvrf_get_table(const char *name)
 
 	addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
-	if (rtnl_talk_suppress_rtnl_errmsg(&rth, &req.n,
-					   &answer.n, sizeof(answer)) < 0) {
+	if (rtnl_talk_suppress_rtnl_errmsg(&rth, &req.n, &answer) < 0) {
 		/* special case "default" vrf to be the main table */
 		if (errno == ENODEV && !strcmp(name, "default"))
 			if (rtnl_rttable_a2n(&tb_id, "main"))
@@ -143,25 +139,25 @@ __u32 ipvrf_get_table(const char *name)
 		return tb_id;
 	}
 
-	ifi = NLMSG_DATA(&answer.n);
-	len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+	ifi = NLMSG_DATA(answer);
+	len = answer->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
 	if (len < 0) {
 		fprintf(stderr, "BUG: Invalid response to link query.\n");
-		return 0;
+		goto out;
 	}
 
 	parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
 
 	if (!tb[IFLA_LINKINFO])
-		return 0;
+		goto out;
 
 	parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
 
 	if (!li[IFLA_INFO_KIND] || !li[IFLA_INFO_DATA])
-		return 0;
+		goto out;
 
 	if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
-		return 0;
+		goto out;
 
 	parse_rtattr_nested(vrf_attr, IFLA_VRF_MAX, li[IFLA_INFO_DATA]);
 	if (vrf_attr[IFLA_VRF_TABLE])
@@ -170,6 +166,8 @@ __u32 ipvrf_get_table(const char *name)
 	if (!tb_id)
 		fprintf(stderr, "BUG: VRF %s is missing table id\n", name);
 
+out:
+	free(answer);
 	return tb_id;
 }
 
@@ -189,10 +187,7 @@ int name_is_vrf(const char *name)
 			.ifi_family  = preferred_family,
 		},
 	};
-	struct {
-		struct nlmsghdr n;
-		char buf[8192];
-	} answer;
+	struct nlmsghdr *answer;
 	struct rtattr *tb[IFLA_MAX+1];
 	struct rtattr *li[IFLA_INFO_MAX+1];
 	struct ifinfomsg *ifi;
@@ -200,29 +195,30 @@ int name_is_vrf(const char *name)
 
 	addattr_l(&req.n, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
 
-	if (rtnl_talk_suppress_rtnl_errmsg(&rth, &req.n,
-					   &answer.n, sizeof(answer)) < 0)
+	if (rtnl_talk_suppress_rtnl_errmsg(&rth, &req.n, &answer) < 0)
 		return 0;
 
-	ifi = NLMSG_DATA(&answer.n);
-	len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+	ifi = NLMSG_DATA(answer);
+	len = answer->nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
 	if (len < 0) {
 		fprintf(stderr, "BUG: Invalid response to link query.\n");
-		return 0;
+		goto out;
 	}
 
 	parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
 
 	if (!tb[IFLA_LINKINFO])
-		return 0;
+		goto out;
 
 	parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
 
 	if (!li[IFLA_INFO_KIND])
-		return 0;
+		goto out;
 
 	if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
-		return 0;
+		goto out;
 
+out:
+	free(answer);
 	return ifi->ifi_index;
 }
diff --git a/ip/ipmacsec.c b/ip/ipmacsec.c
index ecc371a..345a707 100644
--- a/ip/ipmacsec.c
+++ b/ip/ipmacsec.c
@@ -421,7 +421,7 @@ static int do_modify_nl(enum cmd c, enum macsec_nl_commands cmd, int ifindex,
 	addattr_nest_end(&req.n, attr_sa);
 
 talk:
-	if (rtnl_talk(&genl_rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&genl_rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
diff --git a/ip/ipneigh.c b/ip/ipneigh.c
index 9c38a60..32f2d55 100644
--- a/ip/ipneigh.c
+++ b/ip/ipneigh.c
@@ -184,7 +184,7 @@ static int ipneigh_modify(int cmd, int flags, int argc, char **argv)
 		return -1;
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		exit(2);
 
 	return 0;
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index afb4978..bc70997 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -95,12 +95,13 @@ static int get_netnsid_from_name(const char *name)
 		struct nlmsghdr n;
 		struct rtgenmsg g;
 		char            buf[1024];
-	} answer, req = {
+	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(struct rtgenmsg)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
 		.n.nlmsg_type = RTM_GETNSID,
 		.g.rtgen_family = AF_UNSPEC,
 	};
+	struct nlmsghdr *answer;
 	struct rtattr *tb[NETNSA_MAX + 1];
 	struct rtgenmsg *rthdr;
 	int len, fd;
@@ -110,26 +111,30 @@ static int get_netnsid_from_name(const char *name)
 		return fd;
 
 	addattr32(&req.n, 1024, NETNSA_FD, fd);
-	if (rtnl_talk(&rtnsh, &req.n, &answer.n, sizeof(answer)) < 0) {
+	if (rtnl_talk(&rtnsh, &req.n, &answer) < 0) {
 		close(fd);
 		return -2;
 	}
 	close(fd);
 
 	/* Validate message and parse attributes */
-	if (answer.n.nlmsg_type == NLMSG_ERROR)
-		return -1;
+	if (answer->nlmsg_type == NLMSG_ERROR)
+		goto err_out;
 
-	rthdr = NLMSG_DATA(&answer.n);
-	len = answer.n.nlmsg_len - NLMSG_SPACE(sizeof(*rthdr));
+	rthdr = NLMSG_DATA(answer);
+	len = answer->nlmsg_len - NLMSG_SPACE(sizeof(*rthdr));
 	if (len < 0)
-		return -1;
+		goto err_out;
 
 	parse_rtattr(tb, NETNSA_MAX, NETNS_RTA(rthdr), len);
 
-	if (tb[NETNSA_NSID])
+	if (tb[NETNSA_NSID]) {
+		free(answer);
 		return rta_getattr_u32(tb[NETNSA_NSID]);
+	}
 
+err_out:
+	free(answer);
 	return -1;
 }
 
@@ -689,7 +694,7 @@ static int set_netnsid_from_name(const char *name, int nsid)
 
 	addattr32(&req.n, 1024, NETNSA_FD, fd);
 	addattr32(&req.n, 1024, NETNSA_NSID, nsid);
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		err = -2;
 
 	close(fd);
diff --git a/ip/ipntable.c b/ip/ipntable.c
index 88236ce..2f72c98 100644
--- a/ip/ipntable.c
+++ b/ip/ipntable.c
@@ -304,7 +304,7 @@ static int ipntable_modify(int cmd, int flags, int argc, char **argv)
 			  RTA_PAYLOAD(parms_rta));
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		exit(2);
 
 	return 0;
diff --git a/ip/iproute.c b/ip/iproute.c
index a8733f4..e45a5f2 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -1300,7 +1300,7 @@ static int iproute_modify(int cmd, unsigned int flags, int argc, char **argv)
 	if (!type_ok && req.r.rtm_family == AF_MPLS)
 		req.r.rtm_type = RTN_UNICAST;
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
@@ -1680,6 +1680,7 @@ static int iproute_get(int argc, char **argv)
 	};
 	char  *idev = NULL;
 	char  *odev = NULL;
+	struct nlmsghdr *answer;
 	int connected = 0;
 	int fib_match = 0;
 	int from_ok = 0;
@@ -1800,26 +1801,29 @@ static int iproute_get(int argc, char **argv)
 	if (fib_match)
 		req.r.rtm_flags |= RTM_F_FIB_MATCH;
 
-	if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0)
+	if (rtnl_talk(&rth, &req.n, &answer) < 0)
 		return -2;
 
 	if (connected && !from_ok) {
-		struct rtmsg *r = NLMSG_DATA(&req.n);
-		int len = req.n.nlmsg_len;
+		struct rtmsg *r = NLMSG_DATA(answer);
+		int len = answer->nlmsg_len;
 		struct rtattr *tb[RTA_MAX+1];
 
-		if (print_route(NULL, &req.n, (void *)stdout) < 0) {
+		if (print_route(NULL, answer, (void *)stdout) < 0) {
 			fprintf(stderr, "An error :-)\n");
+			free(answer);
 			return -1;
 		}
 
-		if (req.n.nlmsg_type != RTM_NEWROUTE) {
+		if (answer->nlmsg_type != RTM_NEWROUTE) {
 			fprintf(stderr, "Not a route?\n");
+			free(answer);
 			return -1;
 		}
 		len -= NLMSG_LENGTH(sizeof(*r));
 		if (len < 0) {
 			fprintf(stderr, "Wrong len %d\n", len);
+			free(answer);
 			return -1;
 		}
 
@@ -1830,6 +1834,7 @@ static int iproute_get(int argc, char **argv)
 			r->rtm_src_len = 8*RTA_PAYLOAD(tb[RTA_PREFSRC]);
 		} else if (!tb[RTA_SRC]) {
 			fprintf(stderr, "Failed to connect the route\n");
+			free(answer);
 			return -1;
 		}
 		if (!odev && tb[RTA_OIF])
@@ -1843,15 +1848,18 @@ static int iproute_get(int argc, char **argv)
 		req.n.nlmsg_flags = NLM_F_REQUEST;
 		req.n.nlmsg_type = RTM_GETROUTE;
 
-		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0)
+		free(answer);
+		if (rtnl_talk(&rth, &req.n, &answer) < 0)
 			return -2;
 	}
 
-	if (print_route(NULL, &req.n, (void *)stdout) < 0) {
+	if (print_route(NULL, answer, (void *)stdout) < 0) {
 		fprintf(stderr, "An error :-)\n");
+		free(answer);
 		return -1;
 	}
 
+	free(answer);
 	return 0;
 }
 
@@ -1895,7 +1903,7 @@ restore:
 
 	ll_init_map(&rth);
 
-	ret = rtnl_talk(&rth, n, n, sizeof(*n));
+	ret = rtnl_talk(&rth, n, NULL);
 	if ((ret < 0) && (errno == EEXIST))
 		ret = 0;
 
diff --git a/ip/iprule.c b/ip/iprule.c
index 8313138..e64b4d7 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -393,7 +393,7 @@ static int flush_rule(const struct sockaddr_nl *who, struct nlmsghdr *n,
 		if (rtnl_open(&rth2, 0) < 0)
 			return -1;
 
-		if (rtnl_talk(&rth2, n, NULL, 0) < 0)
+		if (rtnl_talk(&rth2, n, NULL) < 0)
 			return -2;
 
 		rtnl_close(&rth2);
@@ -553,7 +553,7 @@ static int restore_handler(const struct sockaddr_nl *nl,
 
 	ll_init_map(&rth);
 
-	ret = rtnl_talk(&rth, n, n, sizeof(*n));
+	ret = rtnl_talk(&rth, n, NULL);
 	if ((ret < 0) && (errno == EEXIST))
 		ret = 0;
 
@@ -760,7 +760,7 @@ static int iprule_modify(int cmd, int argc, char **argv)
 	if (!table_ok && cmd == RTM_NEWRULE)
 		req.r.rtm_table = RT_TABLE_MAIN;
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
diff --git a/ip/ipseg6.c b/ip/ipseg6.c
index a8f5c69..461a3c1 100644
--- a/ip/ipseg6.c
+++ b/ip/ipseg6.c
@@ -125,6 +125,7 @@ static int process_msg(const struct sockaddr_nl *who, struct nlmsghdr *n,
 static int seg6_do_cmd(void)
 {
 	SEG6_REQUEST(req, 1024, opts.cmd, NLM_F_REQUEST);
+	struct nlmsghdr *answer;
 	int repl = 0, dump = 0;
 
 	if (genl_family < 0) {
@@ -163,15 +164,16 @@ static int seg6_do_cmd(void)
 	}
 
 	if (!repl && !dump) {
-		if (rtnl_talk(&grth, &req.n, NULL, 0) < 0)
+		if (rtnl_talk(&grth, &req.n, NULL) < 0)
 			return -1;
 	} else if (repl) {
-		if (rtnl_talk(&grth, &req.n, &req.n, sizeof(req)) < 0)
+		if (rtnl_talk(&grth, &req.n, &answer) < 0)
 			return -2;
-		if (process_msg(NULL, &req.n, stdout) < 0) {
+		if (process_msg(NULL, answer, stdout) < 0) {
 			fprintf(stderr, "Error parsing reply\n");
 			exit(1);
 		}
+		free(answer);
 	} else {
 		req.n.nlmsg_flags |= NLM_F_DUMP;
 		req.n.nlmsg_seq = grth.dump = ++grth.seq;
diff --git a/ip/iptoken.c b/ip/iptoken.c
index 1869f76..0528bad 100644
--- a/ip/iptoken.c
+++ b/ip/iptoken.c
@@ -166,7 +166,7 @@ static int iptoken_set(int argc, char **argv, bool delete)
 	addattr_nest_end(&req.n, afs6);
 	addattr_nest_end(&req.n, afs);
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return -2;
 
 	return 0;
diff --git a/ip/link_gre.c b/ip/link_gre.c
index 9ea2970..35782ca 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -68,7 +68,6 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 	struct {
 		struct nlmsghdr n;
 		struct ifinfomsg i;
-		char buf[16384];
 	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
@@ -76,6 +75,7 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 		.i.ifi_family = preferred_family,
 		.i.ifi_index = ifi->ifi_index,
 	};
+	struct nlmsghdr *answer = NULL;
 	struct rtattr *tb[IFLA_MAX + 1];
 	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
 	struct rtattr *greinfo[IFLA_GRE_MAX + 1];
@@ -100,19 +100,20 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 	__u32 erspan_idx = 0;
 
 	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
-		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
+		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
 get_failed:
 			fprintf(stderr,
 				"Failed to get existing tunnel info.\n");
+			free(answer);
 			return -1;
 		}
 
-		len = req.n.nlmsg_len;
+		len = answer->nlmsg_len;
 		len -= NLMSG_LENGTH(sizeof(*ifi));
 		if (len < 0)
 			goto get_failed;
 
-		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(NLMSG_DATA(answer)), len);
 
 		if (!tb[IFLA_LINKINFO])
 			goto get_failed;
@@ -177,6 +178,8 @@ get_failed:
 
 		if (greinfo[IFLA_GRE_ERSPAN_INDEX])
 			erspan_idx = rta_getattr_u32(greinfo[IFLA_GRE_ERSPAN_INDEX]);
+
+		free(answer);
 	}
 
 	while (argc > 0) {
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 7d07932..2eedec8 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -78,7 +78,6 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 	struct {
 		struct nlmsghdr n;
 		struct ifinfomsg i;
-		char buf[1024];
 	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
@@ -86,6 +85,7 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 		.i.ifi_family = preferred_family,
 		.i.ifi_index = ifi->ifi_index,
 	};
+	struct nlmsghdr *answer = NULL;
 	struct rtattr *tb[IFLA_MAX + 1];
 	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
 	struct rtattr *greinfo[IFLA_GRE_MAX + 1];
@@ -108,19 +108,20 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
 	__u32 fwmark = 0;
 
 	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
-		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
+		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
 get_failed:
 			fprintf(stderr,
 				"Failed to get existing tunnel info.\n");
+			free(answer);
 			return -1;
 		}
 
-		len = req.n.nlmsg_len;
+		len = answer->nlmsg_len;
 		len -= NLMSG_LENGTH(sizeof(*ifi));
 		if (len < 0)
 			goto get_failed;
 
-		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(NLMSG_DATA(answer)), len);
 
 		if (!tb[IFLA_LINKINFO])
 			goto get_failed;
@@ -180,6 +181,8 @@ get_failed:
 
 		if (greinfo[IFLA_GRE_FWMARK])
 			fwmark = rta_getattr_u32(greinfo[IFLA_GRE_FWMARK]);
+
+		free(answer);
 	}
 
 	while (argc > 0) {
diff --git a/ip/link_ip6tnl.c b/ip/link_ip6tnl.c
index a419900..2f8c3f3 100644
--- a/ip/link_ip6tnl.c
+++ b/ip/link_ip6tnl.c
@@ -75,7 +75,6 @@ static int ip6tunnel_parse_opt(struct link_util *lu, int argc, char **argv,
 	struct {
 		struct nlmsghdr n;
 		struct ifinfomsg i;
-		char buf[2048];
 	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
@@ -83,6 +82,7 @@ static int ip6tunnel_parse_opt(struct link_util *lu, int argc, char **argv,
 		.i.ifi_family = preferred_family,
 		.i.ifi_index = ifi->ifi_index,
 	};
+	struct nlmsghdr *answer = NULL;
 	struct rtattr *tb[IFLA_MAX + 1];
 	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
 	struct rtattr *iptuninfo[IFLA_IPTUN_MAX + 1];
@@ -103,19 +103,20 @@ static int ip6tunnel_parse_opt(struct link_util *lu, int argc, char **argv,
 	__u32 fwmark = 0;
 
 	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
-		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
+		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
 get_failed:
 			fprintf(stderr,
 				"Failed to get existing tunnel info.\n");
+			free(answer);
 			return -1;
 		}
 
-		len = req.n.nlmsg_len;
+		len = answer->nlmsg_len;
 		len -= NLMSG_LENGTH(sizeof(*ifi));
 		if (len < 0)
 			goto get_failed;
 
-		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(NLMSG_DATA(answer)), len);
 
 		if (!tb[IFLA_LINKINFO])
 			goto get_failed;
@@ -158,6 +159,8 @@ get_failed:
 
 		if (iptuninfo[IFLA_IPTUN_FWMARK])
 			fwmark = rta_getattr_u32(iptuninfo[IFLA_IPTUN_FWMARK]);
+
+		free(answer);
 	}
 
 	while (argc > 0) {
diff --git a/ip/link_iptnl.c b/ip/link_iptnl.c
index 6a725e9..4940b8b 100644
--- a/ip/link_iptnl.c
+++ b/ip/link_iptnl.c
@@ -76,7 +76,6 @@ static int iptunnel_parse_opt(struct link_util *lu, int argc, char **argv,
 	struct {
 		struct nlmsghdr n;
 		struct ifinfomsg i;
-		char buf[2048];
 	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
@@ -84,6 +83,7 @@ static int iptunnel_parse_opt(struct link_util *lu, int argc, char **argv,
 		.i.ifi_family = preferred_family,
 		.i.ifi_index = ifi->ifi_index,
 	};
+	struct nlmsghdr *answer = NULL;
 	struct rtattr *tb[IFLA_MAX + 1];
 	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
 	struct rtattr *iptuninfo[IFLA_IPTUN_MAX + 1];
@@ -108,19 +108,20 @@ static int iptunnel_parse_opt(struct link_util *lu, int argc, char **argv,
 	__u32 fwmark = 0;
 
 	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
-		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
+		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
 get_failed:
 			fprintf(stderr,
 				"Failed to get existing tunnel info.\n");
+			free(answer);
 			return -1;
 		}
 
-		len = req.n.nlmsg_len;
+		len = answer->nlmsg_len;
 		len -= NLMSG_LENGTH(sizeof(*ifi));
 		if (len < 0)
 			goto get_failed;
 
-		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(NLMSG_DATA(answer)), len);
 
 		if (!tb[IFLA_LINKINFO])
 			goto get_failed;
@@ -188,6 +189,7 @@ get_failed:
 		if (iptuninfo[IFLA_IPTUN_FWMARK])
 			fwmark = rta_getattr_u32(iptuninfo[IFLA_IPTUN_FWMARK]);
 
+		free(answer);
 	}
 
 	while (argc > 0) {
diff --git a/ip/link_vti.c b/ip/link_vti.c
index 8bd4d90..07ac94e 100644
--- a/ip/link_vti.c
+++ b/ip/link_vti.c
@@ -53,7 +53,6 @@ static int vti_parse_opt(struct link_util *lu, int argc, char **argv,
 	struct {
 		struct nlmsghdr n;
 		struct ifinfomsg i;
-		char buf[1024];
 	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
@@ -61,6 +60,7 @@ static int vti_parse_opt(struct link_util *lu, int argc, char **argv,
 		.i.ifi_family = preferred_family,
 		.i.ifi_index = ifi->ifi_index,
 	};
+	struct nlmsghdr *answer = NULL;
 	struct rtattr *tb[IFLA_MAX + 1];
 	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
 	struct rtattr *vtiinfo[IFLA_VTI_MAX + 1];
@@ -73,19 +73,20 @@ static int vti_parse_opt(struct link_util *lu, int argc, char **argv,
 	int len;
 
 	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
-		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
+		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
 get_failed:
 			fprintf(stderr,
 				"Failed to get existing tunnel info.\n");
+			free(answer);
 			return -1;
 		}
 
-		len = req.n.nlmsg_len;
+		len = answer->nlmsg_len;
 		len -= NLMSG_LENGTH(sizeof(*ifi));
 		if (len < 0)
 			goto get_failed;
 
-		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(NLMSG_DATA(answer)), len);
 
 		if (!tb[IFLA_LINKINFO])
 			goto get_failed;
@@ -115,6 +116,8 @@ get_failed:
 
 		if (vtiinfo[IFLA_VTI_FWMARK])
 			fwmark = rta_getattr_u32(vtiinfo[IFLA_VTI_FWMARK]);
+
+		free(answer);
 	}
 
 	while (argc > 0) {
diff --git a/ip/link_vti6.c b/ip/link_vti6.c
index 8198d46..6d08bfe 100644
--- a/ip/link_vti6.c
+++ b/ip/link_vti6.c
@@ -48,7 +48,6 @@ static int vti6_parse_opt(struct link_util *lu, int argc, char **argv,
 	struct {
 		struct nlmsghdr n;
 		struct ifinfomsg i;
-		char buf[1024];
 	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
@@ -56,6 +55,7 @@ static int vti6_parse_opt(struct link_util *lu, int argc, char **argv,
 		.i.ifi_family = preferred_family,
 		.i.ifi_index = ifi->ifi_index,
 	};
+	struct nlmsghdr *answer = NULL;
 	struct rtattr *tb[IFLA_MAX + 1];
 	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
 	struct rtattr *vtiinfo[IFLA_VTI_MAX + 1];
@@ -68,19 +68,20 @@ static int vti6_parse_opt(struct link_util *lu, int argc, char **argv,
 	int len;
 
 	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
-		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
+		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
 get_failed:
 			fprintf(stderr,
 				"Failed to get existing tunnel info.\n");
+			free(answer);
 			return -1;
 		}
 
-		len = req.n.nlmsg_len;
+		len = answer->nlmsg_len;
 		len -= NLMSG_LENGTH(sizeof(*ifi));
 		if (len < 0)
 			goto get_failed;
 
-		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(&req.i), len);
+		parse_rtattr(tb, IFLA_MAX, IFLA_RTA(NLMSG_DATA(answer)), len);
 
 		if (!tb[IFLA_LINKINFO])
 			goto get_failed;
@@ -110,6 +111,8 @@ get_failed:
 
 		if (vtiinfo[IFLA_VTI_FWMARK])
 			fwmark = rta_getattr_u32(vtiinfo[IFLA_VTI_FWMARK]);
+
+		free(answer);
 	}
 
 	while (argc > 0) {
diff --git a/ip/tcp_metrics.c b/ip/tcp_metrics.c
index 8972acd..3f9790e 100644
--- a/ip/tcp_metrics.c
+++ b/ip/tcp_metrics.c
@@ -306,6 +306,7 @@ static int process_msg(const struct sockaddr_nl *who, struct nlmsghdr *n,
 static int tcpm_do_cmd(int cmd, int argc, char **argv)
 {
 	TCPM_REQUEST(req, 1024, TCP_METRICS_CMD_GET, NLM_F_REQUEST);
+	struct nlmsghdr *answer;
 	int atype = -1, stype = -1;
 	int ack;
 
@@ -457,15 +458,16 @@ static int tcpm_do_cmd(int cmd, int argc, char **argv)
 	}
 
 	if (ack) {
-		if (rtnl_talk(&grth, &req.n, NULL, 0) < 0)
+		if (rtnl_talk(&grth, &req.n, NULL) < 0)
 			return -2;
 	} else if (atype >= 0) {
-		if (rtnl_talk(&grth, &req.n, &req.n, sizeof(req)) < 0)
+		if (rtnl_talk(&grth, &req.n, &answer) < 0)
 			return -2;
-		if (process_msg(NULL, &req.n, stdout) < 0) {
+		if (process_msg(NULL, answer, stdout) < 0) {
 			fprintf(stderr, "Dump terminated\n");
 			exit(1);
 		}
+		free(answer);
 	} else {
 		req.n.nlmsg_seq = grth.dump = ++grth.seq;
 		if (rtnl_send(&grth, &req, req.n.nlmsg_len) < 0) {
diff --git a/ip/xfrm_policy.c b/ip/xfrm_policy.c
index de689c4..98460a0 100644
--- a/ip/xfrm_policy.c
+++ b/ip/xfrm_policy.c
@@ -386,7 +386,7 @@ static int xfrm_policy_modify(int cmd, unsigned int flags, int argc, char **argv
 	if (req.xpinfo.sel.family == AF_UNSPEC)
 		req.xpinfo.sel.family = AF_INET;
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		exit(2);
 
 	rtnl_close(&rth);
@@ -548,7 +548,7 @@ int xfrm_policy_print(const struct sockaddr_nl *who, struct nlmsghdr *n,
 }
 
 static int xfrm_policy_get_or_delete(int argc, char **argv, int delete,
-				     void *res_nlbuf, size_t res_size)
+				     struct nlmsghdr **answer)
 {
 	struct rtnl_handle rth;
 	struct {
@@ -659,7 +659,7 @@ static int xfrm_policy_get_or_delete(int argc, char **argv, int delete,
 			  (void *)&ctx, ctx.sctx.len);
 	}
 
-	if (rtnl_talk(&rth, &req.n, res_nlbuf, res_size) < 0)
+	if (rtnl_talk(&rth, &req.n, answer) < 0)
 		exit(2);
 
 	rtnl_close(&rth);
@@ -669,21 +669,21 @@ static int xfrm_policy_get_or_delete(int argc, char **argv, int delete,
 
 static int xfrm_policy_delete(int argc, char **argv)
 {
-	return xfrm_policy_get_or_delete(argc, argv, 1, NULL, 0);
+	return xfrm_policy_get_or_delete(argc, argv, 1, NULL);
 }
 
 static int xfrm_policy_get(int argc, char **argv)
 {
-	char buf[NLMSG_BUF_SIZE] = {};
-	struct nlmsghdr *n = (struct nlmsghdr *)buf;
+	struct nlmsghdr *n = NULL;
 
-	xfrm_policy_get_or_delete(argc, argv, 0, n, sizeof(buf));
+	xfrm_policy_get_or_delete(argc, argv, 0, &n);
 
 	if (xfrm_policy_print(NULL, n, (void *)stdout) < 0) {
 		fprintf(stderr, "An error :-)\n");
 		exit(1);
 	}
 
+	free(n);
 	return 0;
 }
 
@@ -1049,7 +1049,7 @@ static int xfrm_spd_setinfo(int argc, char **argv)
 	if (rtnl_open_byproto(&rth, 0, NETLINK_XFRM) < 0)
 		exit(1);
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		exit(2);
 
 	rtnl_close(&rth);
@@ -1063,22 +1063,23 @@ static int xfrm_spd_getinfo(int argc, char **argv)
 	struct {
 		struct nlmsghdr			n;
 		__u32				flags;
-		char				ans[128];
 	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(__u32)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
 		.n.nlmsg_type = XFRM_MSG_GETSPDINFO,
 		.flags = 0XFFFFFFFF,
 	};
+	struct nlmsghdr *answer;
 
 	if (rtnl_open_byproto(&rth, 0, NETLINK_XFRM) < 0)
 		exit(1);
 
-	if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0)
+	if (rtnl_talk(&rth, &req.n, &answer) < 0)
 		exit(2);
 
-	print_spdinfo(&req.n, (void *)stdout);
+	print_spdinfo(answer, (void *)stdout);
 
+	free(answer);
 	rtnl_close(&rth);
 
 	return 0;
@@ -1123,7 +1124,7 @@ static int xfrm_policy_flush(int argc, char **argv)
 	if (show_stats > 1)
 		fprintf(stderr, "Flush policy\n");
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		exit(2);
 
 	rtnl_close(&rth);
diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 4483fb8..a3d4fd7 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -726,7 +726,7 @@ static int xfrm_state_modify(int cmd, unsigned int flags, int argc, char **argv)
 	if (req.xsinfo.family == AF_UNSPEC)
 		req.xsinfo.family = AF_INET;
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		exit(2);
 
 	rtnl_close(&rth);
@@ -757,8 +757,7 @@ static int xfrm_state_allocspi(int argc, char **argv)
 	char *minp = NULL;
 	char *maxp = NULL;
 	struct xfrm_mark mark = {0, 0};
-	char res_buf[NLMSG_BUF_SIZE] = {};
-	struct nlmsghdr *res_n = (struct nlmsghdr *)res_buf;
+	struct nlmsghdr *answer;
 
 	while (argc > 0) {
 		if (strcmp(*argv, "mode") == 0) {
@@ -858,14 +857,15 @@ static int xfrm_state_allocspi(int argc, char **argv)
 		req.xspi.info.family = AF_INET;
 
 
-	if (rtnl_talk(&rth, &req.n, res_n, sizeof(res_buf)) < 0)
+	if (rtnl_talk(&rth, &req.n, &answer) < 0)
 		exit(2);
 
-	if (xfrm_state_print(NULL, res_n, (void *)stdout) < 0) {
+	if (xfrm_state_print(NULL, answer, (void *)stdout) < 0) {
 		fprintf(stderr, "An error :-)\n");
 		exit(1);
 	}
 
+	free(answer);
 	rtnl_close(&rth);
 
 	return 0;
@@ -1046,19 +1046,20 @@ static int xfrm_state_get_or_delete(int argc, char **argv, int delete)
 		req.xsid.family = AF_INET;
 
 	if (delete) {
-		if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+		if (rtnl_talk(&rth, &req.n, NULL) < 0)
 			exit(2);
 	} else {
-		char buf[NLMSG_BUF_SIZE] = {};
-		struct nlmsghdr *res_n = (struct nlmsghdr *)buf;
+		struct nlmsghdr *answer;
 
-		if (rtnl_talk(&rth, &req.n, res_n, sizeof(req)) < 0)
+		if (rtnl_talk(&rth, &req.n, &answer) < 0)
 			exit(2);
 
-		if (xfrm_state_print(NULL, res_n, (void *)stdout) < 0) {
+		if (xfrm_state_print(NULL, answer, (void *)stdout) < 0) {
 			fprintf(stderr, "An error :-)\n");
 			exit(1);
 		}
+
+		free(answer);
 	}
 
 	rtnl_close(&rth);
@@ -1314,22 +1315,23 @@ static int xfrm_sad_getinfo(int argc, char **argv)
 	struct {
 		struct nlmsghdr			n;
 		__u32				flags;
-		char				ans[64];
 	} req = {
 		.n.nlmsg_len = NLMSG_LENGTH(sizeof(req.flags)),
 		.n.nlmsg_flags = NLM_F_REQUEST,
 		.n.nlmsg_type = XFRM_MSG_GETSADINFO,
 		.flags = 0XFFFFFFFF,
 	};
+	struct nlmsghdr *answer;
 
 	if (rtnl_open_byproto(&rth, 0, NETLINK_XFRM) < 0)
 		exit(1);
 
-	if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0)
+	if (rtnl_talk(&rth, &req.n, &answer) < 0)
 		exit(2);
 
-	print_sadinfo(&req.n, (void *)stdout);
+	print_sadinfo(answer, (void *)stdout);
 
+	free(answer);
 	rtnl_close(&rth);
 
 	return 0;
@@ -1376,7 +1378,7 @@ static int xfrm_state_flush(int argc, char **argv)
 		fprintf(stderr, "Flush state with XFRM-PROTO value \"%s\"\n",
 			strxf_xfrmproto(req.xsf.proto));
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		exit(2);
 
 	rtnl_close(&rth);
diff --git a/lib/libgenl.c b/lib/libgenl.c
index 50d2d92..bb5fbb5 100644
--- a/lib/libgenl.c
+++ b/lib/libgenl.c
@@ -49,16 +49,21 @@ int genl_resolve_family(struct rtnl_handle *grth, const char *family)
 {
 	GENL_REQUEST(req, 1024, GENL_ID_CTRL, 0, 0, CTRL_CMD_GETFAMILY,
 		     NLM_F_REQUEST);
+	struct nlmsghdr *answer;
+	int fnum;
 
 	addattr_l(&req.n, sizeof(req), CTRL_ATTR_FAMILY_NAME,
 		  family, strlen(family) + 1);
 
-	if (rtnl_talk(grth, &req.n, &req.n, sizeof(req)) < 0) {
+	if (rtnl_talk(grth, &req.n, &answer) < 0) {
 		fprintf(stderr, "Error talking to the kernel\n");
 		return -2;
 	}
 
-	return genl_parse_getfamily(&req.n);
+	fnum = genl_parse_getfamily(answer);
+	free(answer);
+
+	return fnum;
 }
 
 int genl_init_handle(struct rtnl_handle *grth, const char *family,
diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index 1847c0b..4618dc0 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -577,7 +577,7 @@ static void rtnl_talk_error(struct nlmsghdr *h, struct nlmsgerr *err,
 }
 
 static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-		       struct nlmsghdr *answer, size_t maxlen,
+		       struct nlmsghdr **answer,
 		       bool show_rtnl_err, nl_ext_ack_fn_t errfn)
 {
 	int status;
@@ -651,9 +651,9 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 					fprintf(stderr, "ERROR truncated\n");
 				} else if (!err->error) {
 					if (answer)
-						memcpy(answer, h,
-						       MIN(maxlen, h->nlmsg_len));
-					free(buf);
+						*answer = (struct nlmsghdr *)buf;
+					else
+						free(buf);
 					return 0;
 				}
 
@@ -667,9 +667,7 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 			}
 
 			if (answer) {
-				memcpy(answer, h,
-				       MIN(maxlen, h->nlmsg_len));
-				free(buf);
+				*answer = (struct nlmsghdr *)buf;
 				return 0;
 			}
 
@@ -693,22 +691,22 @@ static int __rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
 }
 
 int rtnl_talk(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-	      struct nlmsghdr *answer, size_t maxlen)
+	      struct nlmsghdr **answer)
 {
-	return __rtnl_talk(rtnl, n, answer, maxlen, true, NULL);
+	return __rtnl_talk(rtnl, n, answer, true, NULL);
 }
 
 int rtnl_talk_extack(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-		     struct nlmsghdr *answer, size_t maxlen,
+		     struct nlmsghdr **answer,
 		     nl_ext_ack_fn_t errfn)
 {
-	return __rtnl_talk(rtnl, n, answer, maxlen, true, errfn);
+	return __rtnl_talk(rtnl, n, answer, true, errfn);
 }
 
 int rtnl_talk_suppress_rtnl_errmsg(struct rtnl_handle *rtnl, struct nlmsghdr *n,
-				   struct nlmsghdr *answer, size_t maxlen)
+				   struct nlmsghdr **answer)
 {
-	return __rtnl_talk(rtnl, n, answer, maxlen, false, NULL);
+	return __rtnl_talk(rtnl, n, answer, false, NULL);
 }
 
 int rtnl_listen_all_nsid(struct rtnl_handle *rth)
diff --git a/misc/ss.c b/misc/ss.c
index dd8dfaa..bf21d80 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2597,7 +2597,7 @@ static int kill_inet_sock(struct nlmsghdr *h, void *arg, struct sockstat *s)
 		raw->sdiag_raw_protocol = s->raw_prot;
 	}
 
-	return rtnl_talk(rth, &req.nlh, NULL, 0);
+	return rtnl_talk(rth, &req.nlh, NULL);
 }
 
 static int show_one_inet_sock(const struct sockaddr_nl *addr,
diff --git a/tc/m_action.c b/tc/m_action.c
index 402228b..8cbf764 100644
--- a/tc/m_action.c
+++ b/tc/m_action.c
@@ -518,18 +518,18 @@ static int tc_action_gd(int cmd, unsigned int flags, int *argc_p, char ***argv_p
 	tail->rta_len = (void *) NLMSG_TAIL(&req.n) - (void *) tail;
 
 	req.n.nlmsg_seq = rth.dump = ++rth.seq;
-	if (cmd == RTM_GETACTION)
-		ans = &req.n;
 
-	if (rtnl_talk(&rth, &req.n, ans, MAX_MSG) < 0) {
+	if (rtnl_talk(&rth, &req.n, &ans) < 0) {
 		fprintf(stderr, "We have an error talking to the kernel\n");
 		return 1;
 	}
 
-	if (ans && print_action(NULL, &req.n, (void *)stdout) < 0) {
+	if (cmd == RTM_GETACTION && print_action(NULL, ans, stdout) < 0) {
 		fprintf(stderr, "Dump terminated\n");
+		free(ans);
 		return 1;
 	}
+	free(ans);
 
 	*argc_p = argc;
 	*argv_p = argv;
@@ -562,7 +562,7 @@ static int tc_action_modify(int cmd, unsigned int flags, int *argc_p, char ***ar
 	}
 	tail->rta_len = (void *) NLMSG_TAIL(&req.n) - (void *) tail;
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0) {
+	if (rtnl_talk(&rth, &req.n, NULL) < 0) {
 		fprintf(stderr, "We have an error talking to the kernel\n");
 		ret = -1;
 	}
@@ -653,7 +653,7 @@ static int tc_act_list_or_flush(int *argc_p, char ***argv_p, int event)
 		req.n.nlmsg_type = RTM_DELACTION;
 		req.n.nlmsg_flags |= NLM_F_ROOT;
 		req.n.nlmsg_flags |= NLM_F_REQUEST;
-		if (rtnl_talk(&rth, &req.n, NULL, 0) < 0) {
+		if (rtnl_talk(&rth, &req.n, NULL) < 0) {
 			fprintf(stderr, "We have an error flushing\n");
 			return 1;
 		}
diff --git a/tc/tc_class.c b/tc/tc_class.c
index 1a1f1fa..0214775 100644
--- a/tc/tc_class.c
+++ b/tc/tc_class.c
@@ -149,7 +149,7 @@ static int tc_class_modify(int cmd, unsigned int flags, int argc, char **argv)
 		}
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return 2;
 
 	return 0;
diff --git a/tc/tc_filter.c b/tc/tc_filter.c
index cf290ae..8dbebf1 100644
--- a/tc/tc_filter.c
+++ b/tc/tc_filter.c
@@ -194,7 +194,7 @@ static int tc_filter_modify(int cmd, unsigned int flags, int argc, char **argv)
 		}
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0) {
+	if (rtnl_talk(&rth, &req.n, NULL) < 0) {
 		fprintf(stderr, "We have an error talking to the kernel\n");
 		return 2;
 	}
@@ -331,6 +331,7 @@ static int tc_filter_get(int cmd, unsigned int flags, int argc, char **argv)
 		.t.tcm_parent = TC_H_UNSPEC,
 		.t.tcm_family = AF_UNSPEC,
 	};
+	struct nlmsghdr *answer;
 	struct filter_util *q = NULL;
 	__u32 prio = 0;
 	__u32 protocol = 0;
@@ -484,13 +485,14 @@ static int tc_filter_get(int cmd, unsigned int flags, int argc, char **argv)
 		return -1;
 	}
 
-	if (rtnl_talk(&rth, &req.n, &req.n, MAX_MSG) < 0) {
+	if (rtnl_talk(&rth, &req.n, &answer) < 0) {
 		fprintf(stderr, "We have an error talking to the kernel\n");
 		return 2;
 	}
 
-	print_filter(NULL, &req.n, (void *)stdout);
+	print_filter(NULL, answer, (void *)stdout);
 
+	free(answer);
 	return 0;
 }
 
diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index 1e9d909..c52114a 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -190,7 +190,7 @@ static int tc_qdisc_modify(int cmd, unsigned int flags, int argc, char **argv)
 		req.t.tcm_ifindex = idx;
 	}
 
-	if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+	if (rtnl_talk(&rth, &req.n, NULL) < 0)
 		return 2;
 
 	return 0;
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough
  2017-09-28 13:33 ` [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough Hangbin Liu
@ 2017-09-29 12:55   ` Michal Kubecek
  2017-09-29 17:54   ` Stephen Hemminger
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Kubecek @ 2017-09-29 12:55 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Stephen Hemminger, Phil Sutter, Hangbin Liu

On Thu, Sep 28, 2017 at 09:33:45PM +0800, Hangbin Liu wrote:
> From: Hangbin Liu <liuhangbin@gmail.com>
> 
> With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size")
> we doubled the buffer size to support more VFs. But the VFs number is
> increasing all the time. Some customers even use more than 200 VFs now.
> 
> We could not double it everytime when the buffer is not enough. Let's just
> not hard code the buffer size and malloc the correct number when running.
> 
> Introduce function rtnl_recvmsg() to always return a newly allocated buffer.
> The caller need to free it after using.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> Signed-off-by: Phil Sutter <phil@nwl.cc>
> ---
>  lib/libnetlink.c | 114 ++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 80 insertions(+), 34 deletions(-)
> 

Reviewed-by: Michal Kubecek <mkubecek@suse.cz>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-09-28 13:33 ` [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time Hangbin Liu
@ 2017-09-29 13:06   ` Michal Kubecek
  2017-10-02 17:37   ` Stephen Hemminger
  2017-10-25 10:45   ` Stephen Hemminger
  2 siblings, 0 replies; 19+ messages in thread
From: Michal Kubecek @ 2017-09-29 13:06 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Stephen Hemminger, Phil Sutter, Hangbin Liu

On Thu, Sep 28, 2017 at 09:33:46PM +0800, Hangbin Liu wrote:
> From: Hangbin Liu <liuhangbin@gmail.com>
> 
> This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> iplink_get()"). After update, we will not need to double the buffer size
> every time when VFs number increased.
> 
> With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> length parameter.
> 
> With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> answer to avoid overwrite data in nlh, because it may has more info after
> nlh. also this will avoid nlh buffer not enough issue.
> 
> We need to free answer after using.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> Signed-off-by: Phil Sutter <phil@nwl.cc>
> ---

Reviewed-by: Michal Kubecek <mkubecek@suse.cz>

> diff --git a/ip/link_gre.c b/ip/link_gre.c
> index 9ea2970..35782ca 100644
> --- a/ip/link_gre.c
> +++ b/ip/link_gre.c
> @@ -68,7 +68,6 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
>  	struct {
>  		struct nlmsghdr n;
>  		struct ifinfomsg i;
> -		char buf[16384];
>  	} req = {
>  		.n.nlmsg_len = NLMSG_LENGTH(sizeof(*ifi)),
>  		.n.nlmsg_flags = NLM_F_REQUEST,
> @@ -76,6 +75,7 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
>  		.i.ifi_family = preferred_family,
>  		.i.ifi_index = ifi->ifi_index,
>  	};
> +	struct nlmsghdr *answer = NULL;
>  	struct rtattr *tb[IFLA_MAX + 1];
>  	struct rtattr *linkinfo[IFLA_INFO_MAX+1];
>  	struct rtattr *greinfo[IFLA_GRE_MAX + 1];
> @@ -100,19 +100,20 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
>  	__u32 erspan_idx = 0;
>  
>  	if (!(n->nlmsg_flags & NLM_F_CREATE)) {
> -		if (rtnl_talk(&rth, &req.n, &req.n, sizeof(req)) < 0) {
> +		if (rtnl_talk(&rth, &req.n, &answer) < 0) {
>  get_failed:
>  			fprintf(stderr,
>  				"Failed to get existing tunnel info.\n");
> +			free(answer);
>  			return -1;
>  		}

Took me a moment to realize answer is still NULL if we get here via
failed rtnl_talk() but non-NULL if we get here via "goto get_failed"
later. Nice trick. :-)

Michal Kubecek

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough
  2017-09-28 13:33 ` [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough Hangbin Liu
  2017-09-29 12:55   ` Michal Kubecek
@ 2017-09-29 17:54   ` Stephen Hemminger
  2017-09-29 18:20     ` Michal Kubecek
  2017-09-30 13:54     ` Hangbin Liu
  1 sibling, 2 replies; 19+ messages in thread
From: Stephen Hemminger @ 2017-09-29 17:54 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Michal Kubecek, Phil Sutter, Hangbin Liu

On Thu, 28 Sep 2017 21:33:45 +0800
Hangbin Liu <haliu@redhat.com> wrote:

>  
> +static int __rtnl_recvmsg(int fd, struct msghdr *msg, int flags)
> +{
> +	int len;
> +
> +	do {
> +		len = recvmsg(fd, msg, flags);
> +	} while (len < 0 && (errno == EINTR || errno == EAGAIN));
> +
> +	if (len < 0) {
> +		fprintf(stderr, "netlink receive error %s (%d)\n",
> +			strerror(errno), errno);
> +		return -errno;
> +	}
> +
> +	if (len == 0) {
> +		fprintf(stderr, "EOF on netlink\n");
> +		return -ENODATA;
> +	}
> +
> +	return len;
> +}
> +
> +static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
> +{
> +	struct iovec *iov = msg->msg_iov;
> +	char *buf;
> +	int len;
> +
> +	iov->iov_base = NULL;
> +	iov->iov_len = 0;
> +
> +	len = __rtnl_recvmsg(fd, msg, MSG_PEEK | MSG_TRUNC);
> +	if (len < 0)
> +		return len;
> +
> +	buf = malloc(len);
> +	if (!buf) {
> +		fprintf(stderr, "malloc error: not enough buffer\n");
> +		return -ENOMEM;
> +	}
> +
> +	iov->iov_base = buf;
> +	iov->iov_len = len;
> +
> +	len = __rtnl_recvmsg(fd, msg, 0);
> +	if (len < 0) {
> +		free(buf);
> +		return len;
> +	}
> +
> +	if (answer)
> +		*answer = buf;
> +	else
> +		free(buf);
> +
> +	return len;
> +}

Doubling the number of system calls per message is not going to make
users with 5,000,000 routes or 1000 vlans, or 10,000 tunnels happy.
Please rethink this.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough
  2017-09-29 17:54   ` Stephen Hemminger
@ 2017-09-29 18:20     ` Michal Kubecek
  2017-09-30 13:54     ` Hangbin Liu
  1 sibling, 0 replies; 19+ messages in thread
From: Michal Kubecek @ 2017-09-29 18:20 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Hangbin Liu, netdev, Phil Sutter, Hangbin Liu

On Fri, Sep 29, 2017 at 10:54:40AM -0700, Stephen Hemminger wrote:
> On Thu, 28 Sep 2017 21:33:45 +0800
> Hangbin Liu <haliu@redhat.com> wrote:
> 
> >  
> > +static int __rtnl_recvmsg(int fd, struct msghdr *msg, int flags)
> > +{
> > +	int len;
> > +
> > +	do {
> > +		len = recvmsg(fd, msg, flags);
> > +	} while (len < 0 && (errno == EINTR || errno == EAGAIN));
> > +
> > +	if (len < 0) {
> > +		fprintf(stderr, "netlink receive error %s (%d)\n",
> > +			strerror(errno), errno);
> > +		return -errno;
> > +	}
> > +
> > +	if (len == 0) {
> > +		fprintf(stderr, "EOF on netlink\n");
> > +		return -ENODATA;
> > +	}
> > +
> > +	return len;
> > +}
> > +
> > +static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
> > +{
> > +	struct iovec *iov = msg->msg_iov;
> > +	char *buf;
> > +	int len;
> > +
> > +	iov->iov_base = NULL;
> > +	iov->iov_len = 0;
> > +
> > +	len = __rtnl_recvmsg(fd, msg, MSG_PEEK | MSG_TRUNC);
> > +	if (len < 0)
> > +		return len;
> > +
> > +	buf = malloc(len);
> > +	if (!buf) {
> > +		fprintf(stderr, "malloc error: not enough buffer\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	iov->iov_base = buf;
> > +	iov->iov_len = len;
> > +
> > +	len = __rtnl_recvmsg(fd, msg, 0);
> > +	if (len < 0) {
> > +		free(buf);
> > +		return len;
> > +	}
> > +
> > +	if (answer)
> > +		*answer = buf;
> > +	else
> > +		free(buf);
> > +
> > +	return len;
> > +}
> 
> Doubling the number of system calls per message is not going to make
> users with 5,000,000 routes or 1000 vlans, or 10,000 tunnels happy.
> Please rethink this.

I'm not sure it's possible to avoid this if we want to be able to get
rid of a preset message length limit. If you call recvmsg() without
MSG_PEEK and your buffer isn't sufficiently large, the message is lost.
And once you use MSG_PEEK, you need another syscall to remove the
message from the queue even if you read all data. In other words, to be
sure you don't lose the reply, you have to do two syscalls.

One alternative I can see would be calling recvmsg() without MSG_PEEK
(but with reasonably large buffer) and repeating the request if the
buffer is not large enough (and caller is actually interested in the
answer). But I don't think this is desirable either as that would result
in even worse overhead.

Michal Kubecek

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough
  2017-09-29 17:54   ` Stephen Hemminger
  2017-09-29 18:20     ` Michal Kubecek
@ 2017-09-30 13:54     ` Hangbin Liu
  1 sibling, 0 replies; 19+ messages in thread
From: Hangbin Liu @ 2017-09-30 13:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Michal Kubecek, Phil Sutter, Hangbin Liu

Hi Stephen,
On Fri, Sep 29, 2017 at 10:54:40AM -0700, Stephen Hemminger wrote:
> 
> Doubling the number of system calls per message is not going to make
> users with 5,000,000 routes or 1000 vlans, or 10,000 tunnels happy.
> Please rethink this.

I tried to add 2500 vlans and 70,000 routes. Then show the result. The
time looks reasonable.

# ip link show | wc -l
5024

# time ip link show > /dev/null

real    0m0.218s
user    0m0.007s
sys     0m0.210s

# time iproute2/ip/ip link show > /dev/null

real    0m0.221s
user    0m0.008s
sys     0m0.212s

# time ip addr show > /dev/null

real    0m0.299s
user    0m0.094s
sys     0m0.205s

# time iproute2/ip/ip addr show > /dev/null

real    0m0.302s
user    0m0.099s
sys     0m0.202s

# ip -6 route show | wc -l
704458

# time ip -6 route show > /dev/null

real    0m5.400s
user    0m0.947s
sys     0m4.453s

# time iproute2/ip/ip -6 route show > /dev/null

real    0m5.404s
user    0m1.070s
sys     0m4.333s


Thanks
Hangbin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-09-28 13:33 ` [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time Hangbin Liu
  2017-09-29 13:06   ` Michal Kubecek
@ 2017-10-02 17:37   ` Stephen Hemminger
  2017-10-09 20:25     ` Phil Sutter
  2017-10-25 10:45   ` Stephen Hemminger
  2 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2017-10-02 17:37 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Michal Kubecek, Phil Sutter, Hangbin Liu

On Thu, 28 Sep 2017 21:33:46 +0800
Hangbin Liu <haliu@redhat.com> wrote:

> From: Hangbin Liu <liuhangbin@gmail.com>
> 
> This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> iplink_get()"). After update, we will not need to double the buffer size
> every time when VFs number increased.
> 
> With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> length parameter.
> 
> With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> answer to avoid overwrite data in nlh, because it may has more info after
> nlh. also this will avoid nlh buffer not enough issue.
> 
> We need to free answer after using.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> Signed-off-by: Phil Sutter <phil@nwl.cc>
> ---

Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
Can only those places that need that be targeted?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-02 17:37   ` Stephen Hemminger
@ 2017-10-09 20:25     ` Phil Sutter
  2017-10-10  6:41       ` Michal Kubecek
  0 siblings, 1 reply; 19+ messages in thread
From: Phil Sutter @ 2017-10-09 20:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Hangbin Liu, netdev, Michal Kubecek, Hangbin Liu

Hi Stephen,

On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:
> On Thu, 28 Sep 2017 21:33:46 +0800
> Hangbin Liu <haliu@redhat.com> wrote:
> 
> > From: Hangbin Liu <liuhangbin@gmail.com>
> > 
> > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > iplink_get()"). After update, we will not need to double the buffer size
> > every time when VFs number increased.
> > 
> > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > length parameter.
> > 
> > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > answer to avoid overwrite data in nlh, because it may has more info after
> > nlh. also this will avoid nlh buffer not enough issue.
> > 
> > We need to free answer after using.
> > 
> > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > Signed-off-by: Phil Sutter <phil@nwl.cc>
> > ---
> 
> Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> Can only those places that need that be targeted?

We could probably do that, by having a buffer on stack in __rtnl_talk()
which will be used instead of the allocated one if 'answer' is NULL. Or
maybe even introduce a dedicated API call for the dynamically allocated
receive buffer. But I really doubt that's feasible: AFAICT, that stack
buffer still needs to be reasonably sized since the reply might be
larger than the request (reusing the request buffer would be the most
simple way to tackle this), also there is support for extack which may
bloat the response to arbitrary size. Hangbin has shown in his benchmark
that the overhead of the second syscall is negligible, so why care about
that and increase code complexity even further?

Not saying it's not possible, but I just doubt it's worth the effort.

Cheers, Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-09 20:25     ` Phil Sutter
@ 2017-10-10  6:41       ` Michal Kubecek
  2017-10-10 16:47         ` Stephen Hemminger
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Kubecek @ 2017-10-10  6:41 UTC (permalink / raw)
  To: Phil Sutter, Stephen Hemminger, Hangbin Liu, netdev, Hangbin Liu

On Mon, Oct 09, 2017 at 10:25:25PM +0200, Phil Sutter wrote:
> Hi Stephen,
> 
> On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:
> > On Thu, 28 Sep 2017 21:33:46 +0800
> > Hangbin Liu <haliu@redhat.com> wrote:
> > 
> > > From: Hangbin Liu <liuhangbin@gmail.com>
> > > 
> > > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > > iplink_get()"). After update, we will not need to double the buffer size
> > > every time when VFs number increased.
> > > 
> > > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > > length parameter.
> > > 
> > > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > > answer to avoid overwrite data in nlh, because it may has more info after
> > > nlh. also this will avoid nlh buffer not enough issue.
> > > 
> > > We need to free answer after using.
> > > 
> > > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > > Signed-off-by: Phil Sutter <phil@nwl.cc>
> > > ---
> > 
> > Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> > Can only those places that need that be targeted?
> 
> We could probably do that, by having a buffer on stack in __rtnl_talk()
> which will be used instead of the allocated one if 'answer' is NULL. Or
> maybe even introduce a dedicated API call for the dynamically allocated
> receive buffer. But I really doubt that's feasible: AFAICT, that stack
> buffer still needs to be reasonably sized since the reply might be
> larger than the request (reusing the request buffer would be the most
> simple way to tackle this), also there is support for extack which may
> bloat the response to arbitrary size. Hangbin has shown in his benchmark
> that the overhead of the second syscall is negligible, so why care about
> that and increase code complexity even further?
> 
> Not saying it's not possible, but I just doubt it's worth the effort.

Agreed. Current code is based on the assumption that we can estimate the
maximum reply length in advance and the reason for this series is that
this assumption turned out to be wrong. I'm afraid that if we replace
it by an assumption that we can estimate the maximum reply length for
most requests with only few exceptions, it's only matter of time for us
to be proven wrong again.

Michal Kubecek

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-10  6:41       ` Michal Kubecek
@ 2017-10-10 16:47         ` Stephen Hemminger
  2017-10-11 10:40           ` Hangbin Liu
  2017-10-11 11:10           ` Phil Sutter
  0 siblings, 2 replies; 19+ messages in thread
From: Stephen Hemminger @ 2017-10-10 16:47 UTC (permalink / raw)
  To: Michal Kubecek; +Cc: Phil Sutter, Hangbin Liu, netdev, Hangbin Liu

On Tue, 10 Oct 2017 08:41:17 +0200
Michal Kubecek <mkubecek@suse.cz> wrote:

> On Mon, Oct 09, 2017 at 10:25:25PM +0200, Phil Sutter wrote:
> > Hi Stephen,
> > 
> > On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:  
> > > On Thu, 28 Sep 2017 21:33:46 +0800
> > > Hangbin Liu <haliu@redhat.com> wrote:
> > >   
> > > > From: Hangbin Liu <liuhangbin@gmail.com>
> > > > 
> > > > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > > > iplink_get()"). After update, we will not need to double the buffer size
> > > > every time when VFs number increased.
> > > > 
> > > > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > > > length parameter.
> > > > 
> > > > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > > > answer to avoid overwrite data in nlh, because it may has more info after
> > > > nlh. also this will avoid nlh buffer not enough issue.
> > > > 
> > > > We need to free answer after using.
> > > > 
> > > > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > > > Signed-off-by: Phil Sutter <phil@nwl.cc>
> > > > ---  
> > > 
> > > Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> > > Can only those places that need that be targeted?  
> > 
> > We could probably do that, by having a buffer on stack in __rtnl_talk()
> > which will be used instead of the allocated one if 'answer' is NULL. Or
> > maybe even introduce a dedicated API call for the dynamically allocated
> > receive buffer. But I really doubt that's feasible: AFAICT, that stack
> > buffer still needs to be reasonably sized since the reply might be
> > larger than the request (reusing the request buffer would be the most
> > simple way to tackle this), also there is support for extack which may
> > bloat the response to arbitrary size. Hangbin has shown in his benchmark
> > that the overhead of the second syscall is negligible, so why care about
> > that and increase code complexity even further?
> > 
> > Not saying it's not possible, but I just doubt it's worth the effort.  
> 
> Agreed. Current code is based on the assumption that we can estimate the
> maximum reply length in advance and the reason for this series is that
> this assumption turned out to be wrong. I'm afraid that if we replace
> it by an assumption that we can estimate the maximum reply length for
> most requests with only few exceptions, it's only matter of time for us
> to be proven wrong again.
> 
> Michal Kubecek
> 

For query responses, yes the response may be large. But for the common cases of
add address or add route, the response should just be ack or error.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-10 16:47         ` Stephen Hemminger
@ 2017-10-11 10:40           ` Hangbin Liu
  2017-10-11 11:10           ` Phil Sutter
  1 sibling, 0 replies; 19+ messages in thread
From: Hangbin Liu @ 2017-10-11 10:40 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Michal Kubecek, Phil Sutter, netdev, Hangbin Liu

Hi Stephen,
On Tue, Oct 10, 2017 at 09:47:43AM -0700, Stephen Hemminger wrote:
> > Agreed. Current code is based on the assumption that we can estimate the
> > maximum reply length in advance and the reason for this series is that
> > this assumption turned out to be wrong. I'm afraid that if we replace
> > it by an assumption that we can estimate the maximum reply length for
> > most requests with only few exceptions, it's only matter of time for us
> > to be proven wrong again.
> > 
> > Michal Kubecek
> > 
> 
> For query responses, yes the response may be large. But for the common cases of
> add address or add route, the response should just be ack or error.

I tried to list 10 NIC links with ip cmd.

With unpatched ip cmd:
# time for i in `seq 100000`; do ip link show &> /dev/null; done

real    5m14.591s
user    0m58.134s
sys     4m21.104s


With patched ip cmd:
# time for i in `seq 100000`; do ./ip link show &> /dev/null; done

real    4m48.579s
user    0m8.570s
sys     4m43.460s


Then tested add 99,00 address via script
# cat add_addr.sh
#!/bin/bash
dev=$1
for vid in $(seq 99); do
        ip link add link $dev name ${dev}.$vid type vlan id $vid
        ip link set ${dev}.$vid up
        for n in $(seq 100); do
                ip addr add 20$vid::$n dev ${dev}.$vid
        done
done

with unpatched ip cmd:
# time ./add_addr.sh p7p1

real    0m13.456s
user    0m2.551s
sys     0m11.106s


With patched ip cmd:
# time ./add_addr.sh p7p1

real    0m13.700s
user    0m2.827s
sys     0m11.148s


The result don't have much difference and looks good. And I wonder if adding
thousands of address is a common case.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-10 16:47         ` Stephen Hemminger
  2017-10-11 10:40           ` Hangbin Liu
@ 2017-10-11 11:10           ` Phil Sutter
  2017-10-11 14:35             ` David Ahern
  2017-10-12 16:07             ` Stephen Hemminger
  1 sibling, 2 replies; 19+ messages in thread
From: Phil Sutter @ 2017-10-11 11:10 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Michal Kubecek, Hangbin Liu, netdev, Hangbin Liu

On Tue, Oct 10, 2017 at 09:47:43AM -0700, Stephen Hemminger wrote:
> On Tue, 10 Oct 2017 08:41:17 +0200
> Michal Kubecek <mkubecek@suse.cz> wrote:
> 
> > On Mon, Oct 09, 2017 at 10:25:25PM +0200, Phil Sutter wrote:
> > > Hi Stephen,
> > > 
> > > On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:  
> > > > On Thu, 28 Sep 2017 21:33:46 +0800
> > > > Hangbin Liu <haliu@redhat.com> wrote:
> > > >   
> > > > > From: Hangbin Liu <liuhangbin@gmail.com>
> > > > > 
> > > > > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > > > > iplink_get()"). After update, we will not need to double the buffer size
> > > > > every time when VFs number increased.
> > > > > 
> > > > > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > > > > length parameter.
> > > > > 
> > > > > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > > > > answer to avoid overwrite data in nlh, because it may has more info after
> > > > > nlh. also this will avoid nlh buffer not enough issue.
> > > > > 
> > > > > We need to free answer after using.
> > > > > 
> > > > > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > > > > Signed-off-by: Phil Sutter <phil@nwl.cc>
> > > > > ---  
> > > > 
> > > > Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> > > > Can only those places that need that be targeted?  
> > > 
> > > We could probably do that, by having a buffer on stack in __rtnl_talk()
> > > which will be used instead of the allocated one if 'answer' is NULL. Or
> > > maybe even introduce a dedicated API call for the dynamically allocated
> > > receive buffer. But I really doubt that's feasible: AFAICT, that stack
> > > buffer still needs to be reasonably sized since the reply might be
> > > larger than the request (reusing the request buffer would be the most
> > > simple way to tackle this), also there is support for extack which may
> > > bloat the response to arbitrary size. Hangbin has shown in his benchmark
> > > that the overhead of the second syscall is negligible, so why care about
> > > that and increase code complexity even further?
> > > 
> > > Not saying it's not possible, but I just doubt it's worth the effort.  
> > 
> > Agreed. Current code is based on the assumption that we can estimate the
> > maximum reply length in advance and the reason for this series is that
> > this assumption turned out to be wrong. I'm afraid that if we replace
> > it by an assumption that we can estimate the maximum reply length for
> > most requests with only few exceptions, it's only matter of time for us
> > to be proven wrong again.
> > 
> > Michal Kubecek
> > 
> 
> For query responses, yes the response may be large. But for the common cases of
> add address or add route, the response should just be ack or error.

And with extack, error is comprised of the original request plus an
arbitrarily sized error message, so we can't just reuse the request
buffer and are back to "guessing" the right length again.

To get an idea of what we're talking about, I wrote a simple benchmark
which adds 256 * 254 (= 65024) addresses to an interface, then removes
them again one by one and measured the time that takes for binaries with
and without Hangbin's patches:

OP	Vanilla		Hangbin		Delta
--------------------------------------------------------
add	real 2m16.244s	real 2m27.964s	+11.72s	(108.6%)
	user 0m15.241s	user 0m17.295s	+2.054s	(113.5%)
	sys  1m40.229s	sys  1m48.239s	+8.01s	(108.0%)

remove	real 1m44.950s	real 1m47.044s	+2.094s	(102.0%)
	user 0m13.899s	user 0m14.723s	+0.824s (105.9%)
	sys  1m30.798s	sys  1m31.938s	+1.140s (101.3%)

So the overhead of the second syscall and dynamic memory allocation is
less than 10% overall. Given the short time a single call to 'ip'
typically takes, I don't think the difference is noticeable even in
highly performance critical applications.

Cheers, Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-11 11:10           ` Phil Sutter
@ 2017-10-11 14:35             ` David Ahern
  2017-10-12 16:07             ` Stephen Hemminger
  1 sibling, 0 replies; 19+ messages in thread
From: David Ahern @ 2017-10-11 14:35 UTC (permalink / raw)
  To: Phil Sutter, Stephen Hemminger, Michal Kubecek, Hangbin Liu,
	netdev, Hangbin Liu

On 10/11/17 5:10 AM, Phil Sutter wrote:
> less than 10% overall. Given the short time a single call to 'ip'
> typically takes, I don't think the difference is noticeable even in
> highly performance critical applications.

high performance critical applications should not be forking ip and
definitely not for 1 command at a time.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-11 11:10           ` Phil Sutter
  2017-10-11 14:35             ` David Ahern
@ 2017-10-12 16:07             ` Stephen Hemminger
  2017-10-13 10:31               ` Phil Sutter
  1 sibling, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2017-10-12 16:07 UTC (permalink / raw)
  To: Phil Sutter; +Cc: Michal Kubecek, Hangbin Liu, netdev, Hangbin Liu

On Wed, 11 Oct 2017 13:10:07 +0200
Phil Sutter <phil@nwl.cc> wrote:

> On Tue, Oct 10, 2017 at 09:47:43AM -0700, Stephen Hemminger wrote:
> > On Tue, 10 Oct 2017 08:41:17 +0200
> > Michal Kubecek <mkubecek@suse.cz> wrote:
> > 
> > > On Mon, Oct 09, 2017 at 10:25:25PM +0200, Phil Sutter wrote:
> > > > Hi Stephen,
> > > > 
> > > > On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:  
> > > > > On Thu, 28 Sep 2017 21:33:46 +0800
> > > > > Hangbin Liu <haliu@redhat.com> wrote:
> > > > >   
> > > > > > From: Hangbin Liu <liuhangbin@gmail.com>
> > > > > > 
> > > > > > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > > > > > iplink_get()"). After update, we will not need to double the buffer size
> > > > > > every time when VFs number increased.
> > > > > > 
> > > > > > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > > > > > length parameter.
> > > > > > 
> > > > > > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > > > > > answer to avoid overwrite data in nlh, because it may has more info after
> > > > > > nlh. also this will avoid nlh buffer not enough issue.
> > > > > > 
> > > > > > We need to free answer after using.
> > > > > > 
> > > > > > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > > > > > Signed-off-by: Phil Sutter <phil@nwl.cc>
> > > > > > ---  
> > > > > 
> > > > > Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> > > > > Can only those places that need that be targeted?  
> > > > 
> > > > We could probably do that, by having a buffer on stack in __rtnl_talk()
> > > > which will be used instead of the allocated one if 'answer' is NULL. Or
> > > > maybe even introduce a dedicated API call for the dynamically allocated
> > > > receive buffer. But I really doubt that's feasible: AFAICT, that stack
> > > > buffer still needs to be reasonably sized since the reply might be
> > > > larger than the request (reusing the request buffer would be the most
> > > > simple way to tackle this), also there is support for extack which may
> > > > bloat the response to arbitrary size. Hangbin has shown in his benchmark
> > > > that the overhead of the second syscall is negligible, so why care about
> > > > that and increase code complexity even further?
> > > > 
> > > > Not saying it's not possible, but I just doubt it's worth the effort.  
> > > 
> > > Agreed. Current code is based on the assumption that we can estimate the
> > > maximum reply length in advance and the reason for this series is that
> > > this assumption turned out to be wrong. I'm afraid that if we replace
> > > it by an assumption that we can estimate the maximum reply length for
> > > most requests with only few exceptions, it's only matter of time for us
> > > to be proven wrong again.
> > > 
> > > Michal Kubecek
> > > 
> > 
> > For query responses, yes the response may be large. But for the common cases of
> > add address or add route, the response should just be ack or error.
> 
> And with extack, error is comprised of the original request plus an
> arbitrarily sized error message, so we can't just reuse the request
> buffer and are back to "guessing" the right length again.
> 
> To get an idea of what we're talking about, I wrote a simple benchmark
> which adds 256 * 254 (= 65024) addresses to an interface, then removes
> them again one by one and measured the time that takes for binaries with
> and without Hangbin's patches:
> 
> OP	Vanilla		Hangbin		Delta
> --------------------------------------------------------
> add	real 2m16.244s	real 2m27.964s	+11.72s	(108.6%)
> 	user 0m15.241s	user 0m17.295s	+2.054s	(113.5%)
> 	sys  1m40.229s	sys  1m48.239s	+8.01s	(108.0%)
> 
> remove	real 1m44.950s	real 1m47.044s	+2.094s	(102.0%)
> 	user 0m13.899s	user 0m14.723s	+0.824s (105.9%)
> 	sys  1m30.798s	sys  1m31.938s	+1.140s (101.3%)
> 
> So the overhead of the second syscall and dynamic memory allocation is
> less than 10% overall. Given the short time a single call to 'ip'
> typically takes, I don't think the difference is noticeable even in
> highly performance critical applications.
> 
> Cheers, Phil

For a better benchmark, I generated 4 Million routes
then did: 
	# ip ---batch routes.txt


OP	Vanilla		Hangbin		Delta
-----------------------------------------------------
add	real 1:25.840	1:33.677	+9.13%
	user   10.690	   6.078	-56.85%
	sys  1:00.920	1:13.109	+20.00%	

remove	real 2:29.881	2:25.872	-2.67%
	user   12.862	   7.942	-38.25%
	sys    44.127	  44.633	+1.15%


So the answer is addition is slower but deletion appears faster?
If I rerun the Vanilla test, get about the same times.

The slowdown won't impact me, but what about large scale users
like Cumulus.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-12 16:07             ` Stephen Hemminger
@ 2017-10-13 10:31               ` Phil Sutter
  0 siblings, 0 replies; 19+ messages in thread
From: Phil Sutter @ 2017-10-13 10:31 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Michal Kubecek, Hangbin Liu, netdev, Hangbin Liu

On Thu, Oct 12, 2017 at 09:07:06AM -0700, Stephen Hemminger wrote:
> On Wed, 11 Oct 2017 13:10:07 +0200
> Phil Sutter <phil@nwl.cc> wrote:
> 
> > On Tue, Oct 10, 2017 at 09:47:43AM -0700, Stephen Hemminger wrote:
> > > On Tue, 10 Oct 2017 08:41:17 +0200
> > > Michal Kubecek <mkubecek@suse.cz> wrote:
> > > 
> > > > On Mon, Oct 09, 2017 at 10:25:25PM +0200, Phil Sutter wrote:
> > > > > Hi Stephen,
> > > > > 
> > > > > On Mon, Oct 02, 2017 at 10:37:08AM -0700, Stephen Hemminger wrote:  
> > > > > > On Thu, 28 Sep 2017 21:33:46 +0800
> > > > > > Hangbin Liu <haliu@redhat.com> wrote:
> > > > > >   
> > > > > > > From: Hangbin Liu <liuhangbin@gmail.com>
> > > > > > > 
> > > > > > > This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> > > > > > > iplink_get()"). After update, we will not need to double the buffer size
> > > > > > > every time when VFs number increased.
> > > > > > > 
> > > > > > > With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> > > > > > > length parameter.
> > > > > > > 
> > > > > > > With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> > > > > > > answer to avoid overwrite data in nlh, because it may has more info after
> > > > > > > nlh. also this will avoid nlh buffer not enough issue.
> > > > > > > 
> > > > > > > We need to free answer after using.
> > > > > > > 
> > > > > > > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> > > > > > > Signed-off-by: Phil Sutter <phil@nwl.cc>
> > > > > > > ---  
> > > > > > 
> > > > > > Most of the uses of rtnl_talk() don't need to this peek and dynamic sizing.
> > > > > > Can only those places that need that be targeted?  
> > > > > 
> > > > > We could probably do that, by having a buffer on stack in __rtnl_talk()
> > > > > which will be used instead of the allocated one if 'answer' is NULL. Or
> > > > > maybe even introduce a dedicated API call for the dynamically allocated
> > > > > receive buffer. But I really doubt that's feasible: AFAICT, that stack
> > > > > buffer still needs to be reasonably sized since the reply might be
> > > > > larger than the request (reusing the request buffer would be the most
> > > > > simple way to tackle this), also there is support for extack which may
> > > > > bloat the response to arbitrary size. Hangbin has shown in his benchmark
> > > > > that the overhead of the second syscall is negligible, so why care about
> > > > > that and increase code complexity even further?
> > > > > 
> > > > > Not saying it's not possible, but I just doubt it's worth the effort.  
> > > > 
> > > > Agreed. Current code is based on the assumption that we can estimate the
> > > > maximum reply length in advance and the reason for this series is that
> > > > this assumption turned out to be wrong. I'm afraid that if we replace
> > > > it by an assumption that we can estimate the maximum reply length for
> > > > most requests with only few exceptions, it's only matter of time for us
> > > > to be proven wrong again.
> > > > 
> > > > Michal Kubecek
> > > > 
> > > 
> > > For query responses, yes the response may be large. But for the common cases of
> > > add address or add route, the response should just be ack or error.
> > 
> > And with extack, error is comprised of the original request plus an
> > arbitrarily sized error message, so we can't just reuse the request
> > buffer and are back to "guessing" the right length again.
> > 
> > To get an idea of what we're talking about, I wrote a simple benchmark
> > which adds 256 * 254 (= 65024) addresses to an interface, then removes
> > them again one by one and measured the time that takes for binaries with
> > and without Hangbin's patches:
> > 
> > OP	Vanilla		Hangbin		Delta
> > --------------------------------------------------------
> > add	real 2m16.244s	real 2m27.964s	+11.72s	(108.6%)
> > 	user 0m15.241s	user 0m17.295s	+2.054s	(113.5%)
> > 	sys  1m40.229s	sys  1m48.239s	+8.01s	(108.0%)
> > 
> > remove	real 1m44.950s	real 1m47.044s	+2.094s	(102.0%)
> > 	user 0m13.899s	user 0m14.723s	+0.824s (105.9%)
> > 	sys  1m30.798s	sys  1m31.938s	+1.140s (101.3%)
> > 
> > So the overhead of the second syscall and dynamic memory allocation is
> > less than 10% overall. Given the short time a single call to 'ip'
> > typically takes, I don't think the difference is noticeable even in
> > highly performance critical applications.
> > 
> > Cheers, Phil
> 
> For a better benchmark, I generated 4 Million routes
> then did: 
> 	# ip ---batch routes.txt

Ah, batch mode. Nice trick!

> OP	Vanilla		Hangbin		Delta
> -----------------------------------------------------
> add	real 1:25.840	1:33.677	+9.13%
> 	user   10.690	   6.078	-56.85%
> 	sys  1:00.920	1:13.109	+20.00%	
> 
> remove	real 2:29.881	2:25.872	-2.67%
> 	user   12.862	   7.942	-38.25%
> 	sys    44.127	  44.633	+1.15%
> 
> 
> So the answer is addition is slower but deletion appears faster?

Yeah, that's funny. Hangbin's tests show the same in his 'ip link show'
test. I can imagine a performance improvement in some situations since
the patches eliminate that memcpy() of the reply buffer in
__rtnl_talk(), but neither 'route add' nor 'route del' trigger that code
path.

> If I rerun the Vanilla test, get about the same times.
> 
> The slowdown won't impact me, but what about large scale users
> like Cumulus.

If they delete routes as often as they add them, things don't look too
bad at least. :)

Cheers, Phil

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-09-28 13:33 ` [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time Hangbin Liu
  2017-09-29 13:06   ` Michal Kubecek
  2017-10-02 17:37   ` Stephen Hemminger
@ 2017-10-25 10:45   ` Stephen Hemminger
  2017-10-25 13:16     ` Hangbin Liu
  2 siblings, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2017-10-25 10:45 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, Michal Kubecek, Phil Sutter, Hangbin Liu

On Thu, 28 Sep 2017 21:33:46 +0800
Hangbin Liu <haliu@redhat.com> wrote:

> From: Hangbin Liu <liuhangbin@gmail.com>
> 
> This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
> iplink_get()"). After update, we will not need to double the buffer size
> every time when VFs number increased.
> 
> With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
> length parameter.
> 
> With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
> answer to avoid overwrite data in nlh, because it may has more info after
> nlh. also this will avoid nlh buffer not enough issue.
> 
> We need to free answer after using.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> Signed-off-by: Phil Sutter <phil@nwl.cc>

This does not apply cleanly to net-next branch

Applying: lib/libnetlink: update rtnl_talk to support malloc buff at run time
error: patch failed: ip/ipl2tp.c:185
error: ip/ipl2tp.c: patch does not apply
error: patch failed: ip/iplink.c:1023
error: ip/iplink.c: patch does not apply
Patch failed at 0002 lib/libnetlink: update rtnl_talk to support malloc buff at run time

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time
  2017-10-25 10:45   ` Stephen Hemminger
@ 2017-10-25 13:16     ` Hangbin Liu
  0 siblings, 0 replies; 19+ messages in thread
From: Hangbin Liu @ 2017-10-25 13:16 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, Michal Kubecek, Phil Sutter, Hangbin Liu

On Wed, Oct 25, 2017 at 12:45:56PM +0200, Stephen Hemminger wrote:
> This does not apply cleanly to net-next branch
> 
> Applying: lib/libnetlink: update rtnl_talk to support malloc buff at run time
> error: patch failed: ip/ipl2tp.c:185
> error: ip/ipl2tp.c: patch does not apply
> error: patch failed: ip/iplink.c:1023
> error: ip/iplink.c: patch does not apply
> Patch failed at 0002 lib/libnetlink: update rtnl_talk to support malloc buff at run time
> 
Oh, I will send a new version patch based on latest net-next code ASAP.

Thanks
Hangbin

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2017-10-25 13:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-28 13:33 [PATCHv4 iproute2 0/2] libnetlink: malloc correct buff at run time Hangbin Liu
2017-09-28 13:33 ` [PATCHv4 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough Hangbin Liu
2017-09-29 12:55   ` Michal Kubecek
2017-09-29 17:54   ` Stephen Hemminger
2017-09-29 18:20     ` Michal Kubecek
2017-09-30 13:54     ` Hangbin Liu
2017-09-28 13:33 ` [PATCHv4 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time Hangbin Liu
2017-09-29 13:06   ` Michal Kubecek
2017-10-02 17:37   ` Stephen Hemminger
2017-10-09 20:25     ` Phil Sutter
2017-10-10  6:41       ` Michal Kubecek
2017-10-10 16:47         ` Stephen Hemminger
2017-10-11 10:40           ` Hangbin Liu
2017-10-11 11:10           ` Phil Sutter
2017-10-11 14:35             ` David Ahern
2017-10-12 16:07             ` Stephen Hemminger
2017-10-13 10:31               ` Phil Sutter
2017-10-25 10:45   ` Stephen Hemminger
2017-10-25 13:16     ` Hangbin Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.