All of lore.kernel.org
 help / color / mirror / Atom feed
* nfnetlink: Busy-loop in nfnetlink_rcv_msg()
@ 2020-08-21 23:06 Phil Sutter
  2020-08-22 18:46 ` Florian Westphal
  2020-08-23 12:04 ` Pablo Neira Ayuso
  0 siblings, 2 replies; 11+ messages in thread
From: Phil Sutter @ 2020-08-21 23:06 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Pablo Neira Ayuso, Florian Westphal

Hi,

Starting firewalld with two active zones in an lxc container provokes a
situation in which nfnetlink_rcv_msg() loops indefinitely, because
nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
time.

I identified netlink_attachskb() as the originator for the above error
code. The conditional leading to it looks like this:

| if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
|      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
|         [...]
|         if (!*timeo) {

*timeo is zero, so this seems to be a non-blocking socket. Both
NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
sk->sk_rcvbuf.

From user space side, firewalld seems to simply call sendto() and the
call never returns.

How to solve that? I tried to find other code which does the same, but I
haven't found one that does any looping. Should nfnetlink_rcv_msg()
maybe just return -EAGAIN to the caller if it comes from call_rcu
backend?

This happening only in an lxc container may be due to some setsockopt()
calls not being allowed. In particular, setsockopt(SO_RCVBUFFORCE)
returns EPERM.

The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
if this is related and how.

Cheers, Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-21 23:06 nfnetlink: Busy-loop in nfnetlink_rcv_msg() Phil Sutter
@ 2020-08-22 18:46 ` Florian Westphal
  2020-08-24 10:47   ` Pablo Neira Ayuso
  2020-08-24 13:11   ` Phil Sutter
  2020-08-23 12:04 ` Pablo Neira Ayuso
  1 sibling, 2 replies; 11+ messages in thread
From: Florian Westphal @ 2020-08-22 18:46 UTC (permalink / raw)
  To: Phil Sutter, netfilter-devel, Pablo Neira Ayuso, Florian Westphal

Phil Sutter <phil@nwl.cc> wrote:
> Starting firewalld with two active zones in an lxc container provokes a
> situation in which nfnetlink_rcv_msg() loops indefinitely, because
> nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> time.
> 
> I identified netlink_attachskb() as the originator for the above error
> code. The conditional leading to it looks like this:
> 
> | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> |      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> |         [...]
> |         if (!*timeo) {
> 
> *timeo is zero, so this seems to be a non-blocking socket. Both
> NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> sk->sk_rcvbuf.
> 
> From user space side, firewalld seems to simply call sendto() and the
> call never returns.
> 
> How to solve that? I tried to find other code which does the same, but I
> haven't found one that does any looping. Should nfnetlink_rcv_msg()
> maybe just return -EAGAIN to the caller if it comes from call_rcu
> backend?

Yes, I think thats the most straightforward solution.

We can of course also intercept -EAGAIN in nf_tables_api.c and translate
it to -ENOBUFS like in nft_get_set_elem().

But I think a generic solution it better.  The call_rcu backends should
not result in changes to nf_tables internal state so they do not load
modules and therefore don't need a restart.

> This happening only in an lxc container may be due to some setsockopt()
> calls not being allowed. In particular, setsockopt(SO_RCVBUFFORCE)
> returns EPERM.

Right.

> The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> if this is related and how.

Does that SO_RCVBUF succeed? How large is the recvbuf?
We should try to investigate and see that nft works rather than just
fix the loop and claim that fixes the bug (but just changes
'nft loops' to 'nft exits with an error').

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH nf] netfilter: nfnetlink: nfnetlink_unicast() reports EAGAIN instead of ENOBUFS
@ 2020-08-23 11:55 Pablo Neira Ayuso
  0 siblings, 0 replies; 11+ messages in thread
From: Pablo Neira Ayuso @ 2020-08-23 11:55 UTC (permalink / raw)
  To: netfilter-devel; +Cc: phil

Frontend callback reports EAGAIN to nfnetlink to retry a command, this
is used to signal that module autoloading is required. Unfortunately,
nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
full, so it enters a busy-loop.

This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
since this is always MSG_DONTWAIT in the existing code which is exactly
what nlmsg_unicast() passes to netlink_unicast() as parameter.

Fixes: 96518518cc41 ("netfilter: add nftables")
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter/nfnetlink.h |  3 +-
 net/netfilter/nf_tables_api.c       | 61 ++++++++++++++---------------
 net/netfilter/nfnetlink.c           | 11 ++++--
 net/netfilter/nfnetlink_log.c       |  3 +-
 net/netfilter/nfnetlink_queue.c     |  2 +-
 5 files changed, 40 insertions(+), 40 deletions(-)

diff --git a/include/linux/netfilter/nfnetlink.h b/include/linux/netfilter/nfnetlink.h
index 851425c3178f..89016d08f6a2 100644
--- a/include/linux/netfilter/nfnetlink.h
+++ b/include/linux/netfilter/nfnetlink.h
@@ -43,8 +43,7 @@ int nfnetlink_has_listeners(struct net *net, unsigned int group);
 int nfnetlink_send(struct sk_buff *skb, struct net *net, u32 portid,
 		   unsigned int group, int echo, gfp_t flags);
 int nfnetlink_set_err(struct net *net, u32 portid, u32 group, int error);
-int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid,
-		      int flags);
+int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid);
 
 static inline u16 nfnl_msg_type(u8 subsys, u8 msg_type)
 {
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 71e501c5ad21..b7dc1cbf40ea 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -815,11 +815,11 @@ static int nf_tables_gettable(struct net *net, struct sock *nlsk,
 					nlh->nlmsg_seq, NFT_MSG_NEWTABLE, 0,
 					family, table);
 	if (err < 0)
-		goto err;
+		goto err_fill_table_info;
 
-	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
+	return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid);
 
-err:
+err_fill_table_info:
 	kfree_skb(skb2);
 	return err;
 }
@@ -1563,11 +1563,11 @@ static int nf_tables_getchain(struct net *net, struct sock *nlsk,
 					nlh->nlmsg_seq, NFT_MSG_NEWCHAIN, 0,
 					family, table, chain);
 	if (err < 0)
-		goto err;
+		goto err_fill_chain_info;
 
-	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
+	return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid);
 
-err:
+err_fill_chain_info:
 	kfree_skb(skb2);
 	return err;
 }
@@ -3008,11 +3008,11 @@ static int nf_tables_getrule(struct net *net, struct sock *nlsk,
 				       nlh->nlmsg_seq, NFT_MSG_NEWRULE, 0,
 				       family, table, chain, rule, NULL);
 	if (err < 0)
-		goto err;
+		goto err_fill_rule_info;
 
-	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
+	return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid);
 
-err:
+err_fill_rule_info:
 	kfree_skb(skb2);
 	return err;
 }
@@ -3968,11 +3968,11 @@ static int nf_tables_getset(struct net *net, struct sock *nlsk,
 
 	err = nf_tables_fill_set(skb2, &ctx, set, NFT_MSG_NEWSET, 0);
 	if (err < 0)
-		goto err;
+		goto err_fill_set_info;
 
-	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
+	return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid);
 
-err:
+err_fill_set_info:
 	kfree_skb(skb2);
 	return err;
 }
@@ -4860,24 +4860,18 @@ static int nft_get_set_elem(struct nft_ctx *ctx, struct nft_set *set,
 	err = -ENOMEM;
 	skb = nlmsg_new(NLMSG_GOODSIZE, GFP_ATOMIC);
 	if (skb == NULL)
-		goto err1;
+		return err;
 
 	err = nf_tables_fill_setelem_info(skb, ctx, ctx->seq, ctx->portid,
 					  NFT_MSG_NEWSETELEM, 0, set, &elem);
 	if (err < 0)
-		goto err2;
+		goto err_fill_setelem;
 
-	err = nfnetlink_unicast(skb, ctx->net, ctx->portid, MSG_DONTWAIT);
-	/* This avoids a loop in nfnetlink. */
-	if (err < 0)
-		goto err1;
+	return nfnetlink_unicast(skb, ctx->net, ctx->portid);
 
-	return 0;
-err2:
+err_fill_setelem:
 	kfree_skb(skb);
-err1:
-	/* this avoids a loop in nfnetlink. */
-	return err == -EAGAIN ? -ENOBUFS : err;
+	return err;
 }
 
 /* called with rcu_read_lock held */
@@ -6182,10 +6176,11 @@ static int nf_tables_getobj(struct net *net, struct sock *nlsk,
 				      nlh->nlmsg_seq, NFT_MSG_NEWOBJ, 0,
 				      family, table, obj, reset);
 	if (err < 0)
-		goto err;
+		goto err_fill_obj_info;
 
-	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
-err:
+	return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid);
+
+err_fill_obj_info:
 	kfree_skb(skb2);
 	return err;
 }
@@ -7045,10 +7040,11 @@ static int nf_tables_getflowtable(struct net *net, struct sock *nlsk,
 					    NFT_MSG_NEWFLOWTABLE, 0, family,
 					    flowtable, &flowtable->hook_list);
 	if (err < 0)
-		goto err;
+		goto err_fill_flowtable_info;
 
-	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
-err:
+	return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid);
+
+err_fill_flowtable_info:
 	kfree_skb(skb2);
 	return err;
 }
@@ -7234,10 +7230,11 @@ static int nf_tables_getgen(struct net *net, struct sock *nlsk,
 	err = nf_tables_fill_gen_info(skb2, net, NETLINK_CB(skb).portid,
 				      nlh->nlmsg_seq);
 	if (err < 0)
-		goto err;
+		goto err_fill_gen_info;
 
-	return nlmsg_unicast(nlsk, skb2, NETLINK_CB(skb).portid);
-err:
+	return nfnetlink_unicast(skb2, net, NETLINK_CB(skb).portid);
+
+err_fill_gen_info:
 	kfree_skb(skb2);
 	return err;
 }
diff --git a/net/netfilter/nfnetlink.c b/net/netfilter/nfnetlink.c
index 5f24edf95830..3a2e64e13b22 100644
--- a/net/netfilter/nfnetlink.c
+++ b/net/netfilter/nfnetlink.c
@@ -149,10 +149,15 @@ int nfnetlink_set_err(struct net *net, u32 portid, u32 group, int error)
 }
 EXPORT_SYMBOL_GPL(nfnetlink_set_err);
 
-int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid,
-		      int flags)
+int nfnetlink_unicast(struct sk_buff *skb, struct net *net, u32 portid)
 {
-	return netlink_unicast(net->nfnl, skb, portid, flags);
+	int err;
+
+	err = nlmsg_unicast(net->nfnl, skb, portid);
+	if (err == -EAGAIN)
+		err = -ENOBUFS;
+
+	return err;
 }
 EXPORT_SYMBOL_GPL(nfnetlink_unicast);
 
diff --git a/net/netfilter/nfnetlink_log.c b/net/netfilter/nfnetlink_log.c
index f02992419850..b35e8d9a5b37 100644
--- a/net/netfilter/nfnetlink_log.c
+++ b/net/netfilter/nfnetlink_log.c
@@ -356,8 +356,7 @@ __nfulnl_send(struct nfulnl_instance *inst)
 			goto out;
 		}
 	}
-	nfnetlink_unicast(inst->skb, inst->net, inst->peer_portid,
-			  MSG_DONTWAIT);
+	nfnetlink_unicast(inst->skb, inst->net, inst->peer_portid);
 out:
 	inst->qlen = 0;
 	inst->skb = NULL;
diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index dadfc06245a3..d1d8bca03b4f 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -681,7 +681,7 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
 	*packet_id_ptr = htonl(entry->id);
 
 	/* nfnetlink_unicast will either free the nskb or add it to a socket */
-	err = nfnetlink_unicast(nskb, net, queue->peer_portid, MSG_DONTWAIT);
+	err = nfnetlink_unicast(nskb, net, queue->peer_portid);
 	if (err < 0) {
 		if (queue->flags & NFQA_CFG_F_FAIL_OPEN) {
 			failopen = 1;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-21 23:06 nfnetlink: Busy-loop in nfnetlink_rcv_msg() Phil Sutter
  2020-08-22 18:46 ` Florian Westphal
@ 2020-08-23 12:04 ` Pablo Neira Ayuso
  1 sibling, 0 replies; 11+ messages in thread
From: Pablo Neira Ayuso @ 2020-08-23 12:04 UTC (permalink / raw)
  To: Phil Sutter, netfilter-devel, Florian Westphal

Hi Phil,

On Sat, Aug 22, 2020 at 01:06:15AM +0200, Phil Sutter wrote:
> Hi,
> 
> Starting firewalld with two active zones in an lxc container provokes a
> situation in which nfnetlink_rcv_msg() loops indefinitely, because
> nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> time.
> 
> I identified netlink_attachskb() as the originator for the above error
> code. The conditional leading to it looks like this:
> 
> | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> |      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> |         [...]
> |         if (!*timeo) {
> 
> *timeo is zero, so this seems to be a non-blocking socket. Both
> NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> sk->sk_rcvbuf.
> 
> From user space side, firewalld seems to simply call sendto() and the
> call never returns.
> 
> How to solve that? I tried to find other code which does the same, but I
> haven't found one that does any looping. Should nfnetlink_rcv_msg()
> maybe just return -EAGAIN to the caller if it comes from call_rcu
> backend?

It's a bug in the netlink frontend, which erroneously reports -EAGAIN
to the nfnetlink when the socket buffer is full, see:

https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@netfilter.org/

> This happening only in an lxc container may be due to some setsockopt()
> calls not being allowed. In particular, setsockopt(SO_RCVBUFFORCE)
> returns EPERM.

SO_RCVBUFFORCE fails with EPERM if CAP_NET_ADMIN is not available.

> The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> if this is related and how.

Next problem is to track why socket buffer is getting full with
GET_GENID.

firewalld heavily uses NLM_F_ECHO, there I can see how it can easily
reach the default socket buffer size, but with GET_GENID I'm not sure
yet, probably the problem is elsewhere but it manifests in GET_GENID
because it's the first thing that is done when sending a batch (maybe
there are unread messages in the socket buffer, you might check
/proc/net/netlink to see if the socket buffer keeps growing as
firewalld moves on).

Is this easy to reproduce? Or does this happens after some time of
firewalld execution?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-22 18:46 ` Florian Westphal
@ 2020-08-24 10:47   ` Pablo Neira Ayuso
  2020-08-24 12:39     ` Florian Westphal
  2020-08-24 13:11   ` Phil Sutter
  1 sibling, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2020-08-24 10:47 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Phil Sutter, netfilter-devel

Hi Florian,

Sorry, I overlook your reply.

On Sat, Aug 22, 2020 at 08:46:21PM +0200, Florian Westphal wrote:
> Phil Sutter <phil@nwl.cc> wrote:
> > Starting firewalld with two active zones in an lxc container provokes a
> > situation in which nfnetlink_rcv_msg() loops indefinitely, because
> > nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> > time.
> > 
> > I identified netlink_attachskb() as the originator for the above error
> > code. The conditional leading to it looks like this:
> > 
> > | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> > |      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> > |         [...]
> > |         if (!*timeo) {
> > 
> > *timeo is zero, so this seems to be a non-blocking socket. Both
> > NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> > sk->sk_rcvbuf.
> > 
> > From user space side, firewalld seems to simply call sendto() and the
> > call never returns.
> > 
> > How to solve that? I tried to find other code which does the same, but I
> > haven't found one that does any looping. Should nfnetlink_rcv_msg()
> > maybe just return -EAGAIN to the caller if it comes from call_rcu
> > backend?
> 
> Yes, I think thats the most straightforward solution.
> 
> We can of course also intercept -EAGAIN in nf_tables_api.c and translate
> it to -ENOBUFS like in nft_get_set_elem().
> 
> But I think a generic solution it better.  The call_rcu backends should
> not result in changes to nf_tables internal state so they do not load
> modules and therefore don't need a restart.

Handling this from the core would be better, so people don't have to
remember to use the nfnetlink_unicast() that I'm proposing.

Looking at the tree, call_rcu is not enough to assume this: there are
several nfnetlink subsystems calling netlink_unicast() that translate
EAGAIN to ENOBUFS, from .call and .call_rcu. The way to identify this
would be to decorate callbacks to know what are specifically GET
commands.

So either do this or just extend my patch to use nfnetlink_send()
everywhere to remove all existing translations from EAGAIN to ENOBUFS.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-24 10:47   ` Pablo Neira Ayuso
@ 2020-08-24 12:39     ` Florian Westphal
  0 siblings, 0 replies; 11+ messages in thread
From: Florian Westphal @ 2020-08-24 12:39 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, Phil Sutter, netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> > We can of course also intercept -EAGAIN in nf_tables_api.c and translate
> > it to -ENOBUFS like in nft_get_set_elem().
> > 
> > But I think a generic solution it better.  The call_rcu backends should
> > not result in changes to nf_tables internal state so they do not load
> > modules and therefore don't need a restart.
> 
> Handling this from the core would be better, so people don't have to
> remember to use the nfnetlink_unicast() that I'm proposing.

Your patch looks good to me.

> Looking at the tree, call_rcu is not enough to assume this: there are
> several nfnetlink subsystems calling netlink_unicast() that translate
> EAGAIN to ENOBUFS, from .call and .call_rcu. The way to identify this
> would be to decorate callbacks to know what are specifically GET
> commands.

?  Which .call_rcu is NOT "passive" (does not just peek at things) and,
more specifically, which .call_rcu depends on -EAGAIN requiring a
tansaction replay?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-22 18:46 ` Florian Westphal
  2020-08-24 10:47   ` Pablo Neira Ayuso
@ 2020-08-24 13:11   ` Phil Sutter
  2020-08-26 15:32     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 11+ messages in thread
From: Phil Sutter @ 2020-08-24 13:11 UTC (permalink / raw)
  To: Florian Westphal, Pablo Neira Ayuso; +Cc: netfilter-devel

Hi,

On Sat, Aug 22, 2020 at 08:46:21PM +0200, Florian Westphal wrote:
> Phil Sutter <phil@nwl.cc> wrote:
[...]
> > The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> > space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> > if this is related and how.
> 
> Does that SO_RCVBUF succeed? How large is the recvbuf?
> We should try to investigate and see that nft works rather than just
> fix the loop and claim that fixes the bug (but just changes
> 'nft loops' to 'nft exits with an error').

Yes, that call succeeds. A following getsockopt() reveals a bufsize of
425984. Should we print a warning if the kernel accepts but restricts
the value? Not sure how well warnings work when emitted by libnftables,
though.

In mnl_nft_event_listener(), we check the return value of
mnl_set_rcvbuffer() and print an error message:

| nft_print(octx, "# Cannot set up netlink receive socket buffer size to %u bytes, falling back to %u bytes\n",
|                           NFTABLES_NLEVENT_BUFSIZ, bufsiz);

When called from mnl_batch_talk(), the return code is ignored.

On Sun, Aug 23, 2020 at 02:04:34PM +0200, Pablo Neira Ayuso wrote:
> Hi Phil,
> 
> On Sat, Aug 22, 2020 at 01:06:15AM +0200, Phil Sutter wrote:
> > Hi,
> > 
> > Starting firewalld with two active zones in an lxc container provokes a
> > situation in which nfnetlink_rcv_msg() loops indefinitely, because
> > nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> > time.
> > 
> > I identified netlink_attachskb() as the originator for the above error
> > code. The conditional leading to it looks like this:
> > 
> > | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> > |      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> > |         [...]
> > |         if (!*timeo) {
> > 
> > *timeo is zero, so this seems to be a non-blocking socket. Both
> > NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> > sk->sk_rcvbuf.
> > 
> > From user space side, firewalld seems to simply call sendto() and the
> > call never returns.
> > 
> > How to solve that? I tried to find other code which does the same, but I
> > haven't found one that does any looping. Should nfnetlink_rcv_msg()
> > maybe just return -EAGAIN to the caller if it comes from call_rcu
> > backend?
> 
> It's a bug in the netlink frontend, which erroneously reports -EAGAIN
> to the nfnetlink when the socket buffer is full, see:
> 
> https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@netfilter.org/

Obviously this avoids the lockup. As correctly assumed by Florian,
firewalld startup fails instead. (The daemon keeps running, but an error
message is printed indicating that initial ruleset setup failed.)

[...]
> > The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> > space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> > if this is related and how.
> 
> Next problem is to track why socket buffer is getting full with
> GET_GENID.
> 
> firewalld heavily uses NLM_F_ECHO, there I can see how it can easily
> reach the default socket buffer size, but with GET_GENID I'm not sure
> yet, probably the problem is elsewhere but it manifests in GET_GENID
> because it's the first thing that is done when sending a batch (maybe
> there are unread messages in the socket buffer, you might check
> /proc/net/netlink to see if the socket buffer keeps growing as
> firewalld moves on).

Yes, it happens only for echo mode. With your fix in place, I also see
what firewalld is trying to do: The JSON input leading to the error is
huge (~72k characters). I suspect that GET_GENID just happens to be the
last straw. Or my debugging was faulty somehow and netlink_attachskb()
really got called via a different code-path.

> Is this easy to reproduce? Or does this happens after some time of
> firewalld execution?

The necessary lxd setup aside, it's pretty trivial: launch an instance
of images:centos/8/amd64, install firewalld therein, add two zone files
and start firewalld. It happens immediately, so two active zones already
make firewalld generate enough rules to exceed the buffer space.

On Sun, Aug 23, 2020 at 01:55:36PM +0200, Pablo Neira Ayuso wrote:
> Frontend callback reports EAGAIN to nfnetlink to retry a command, this
> is used to signal that module autoloading is required. Unfortunately,
> nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
> full, so it enters a busy-loop.
> 
> This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
> to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
> since this is always MSG_DONTWAIT in the existing code which is exactly
> what nlmsg_unicast() passes to netlink_unicast() as parameter.
> 
> Fixes: 96518518cc41 ("netfilter: add nftables")
> Reported-by: Phil Sutter <phil@nwl.cc>
> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

This indeed "fixes" the problem. Or rather, exposes the actual problem
in echo-related code, namely the tendency to exhaust socket buffers.

So the problem we're facing is that while user space still waits for
sendmsg() to complete, receive buffer fills up. Is it possible to buffer
the data in kernel somewhere else so user space has a chance to call
recvmsg()?

Thanks, Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-24 13:11   ` Phil Sutter
@ 2020-08-26 15:32     ` Pablo Neira Ayuso
  2020-08-26 18:54       ` Eric Garver
  2020-08-27 14:23       ` Phil Sutter
  0 siblings, 2 replies; 11+ messages in thread
From: Pablo Neira Ayuso @ 2020-08-26 15:32 UTC (permalink / raw)
  To: Phil Sutter, Florian Westphal, netfilter-devel

Hi Phil,

On Mon, Aug 24, 2020 at 03:11:04PM +0200, Phil Sutter wrote:
[...]
> On Sun, Aug 23, 2020 at 02:04:34PM +0200, Pablo Neira Ayuso wrote:
> > Hi Phil,
> > 
> > On Sat, Aug 22, 2020 at 01:06:15AM +0200, Phil Sutter wrote:
> > > Hi,
> > > 
> > > Starting firewalld with two active zones in an lxc container provokes a
> > > situation in which nfnetlink_rcv_msg() loops indefinitely, because
> > > nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> > > time.
> > > 
> > > I identified netlink_attachskb() as the originator for the above error
> > > code. The conditional leading to it looks like this:
> > > 
> > > | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> > > |      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> > > |         [...]
> > > |         if (!*timeo) {
> > > 
> > > *timeo is zero, so this seems to be a non-blocking socket. Both
> > > NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> > > sk->sk_rcvbuf.
> > > 
> > > From user space side, firewalld seems to simply call sendto() and the
> > > call never returns.
> > > 
> > > How to solve that? I tried to find other code which does the same, but I
> > > haven't found one that does any looping. Should nfnetlink_rcv_msg()
> > > maybe just return -EAGAIN to the caller if it comes from call_rcu
> > > backend?
> > 
> > It's a bug in the netlink frontend, which erroneously reports -EAGAIN
> > to the nfnetlink when the socket buffer is full, see:
> > 
> > https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@netfilter.org/
> 
> Obviously this avoids the lockup. As correctly assumed by Florian,
> firewalld startup fails instead. (The daemon keeps running, but an error
> message is printed indicating that initial ruleset setup failed.)

Thanks for confirming, I'll apply this patch to nf.git.

> [...]
> > > The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> > > space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> > > if this is related and how.
> > 
> > Next problem is to track why socket buffer is getting full with
> > GET_GENID.
> > 
> > firewalld heavily uses NLM_F_ECHO, there I can see how it can easily
> > reach the default socket buffer size, but with GET_GENID I'm not sure
> > yet, probably the problem is elsewhere but it manifests in GET_GENID
> > because it's the first thing that is done when sending a batch (maybe
> > there are unread messages in the socket buffer, you might check
> > /proc/net/netlink to see if the socket buffer keeps growing as
> > firewalld moves on).
> 
> Yes, it happens only for echo mode. With your fix in place, I also see
> what firewalld is trying to do: The JSON input leading to the error is
> huge (~72k characters). I suspect that GET_GENID just happens to be the
> last straw. Or my debugging was faulty somehow and netlink_attachskb()
> really got called via a different code-path.
> 
> > Is this easy to reproduce? Or does this happens after some time of
> > firewalld execution?
> 
> The necessary lxd setup aside, it's pretty trivial: launch an instance
> of images:centos/8/amd64, install firewalld therein, add two zone files
> and start firewalld. It happens immediately, so two active zones already
> make firewalld generate enough rules to exceed the buffer space.
> 
> On Sun, Aug 23, 2020 at 01:55:36PM +0200, Pablo Neira Ayuso wrote:
> > Frontend callback reports EAGAIN to nfnetlink to retry a command, this
> > is used to signal that module autoloading is required. Unfortunately,
> > nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
> > full, so it enters a busy-loop.
> > 
> > This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
> > to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
> > since this is always MSG_DONTWAIT in the existing code which is exactly
> > what nlmsg_unicast() passes to netlink_unicast() as parameter.
> > 
> > Fixes: 96518518cc41 ("netfilter: add nftables")
> > Reported-by: Phil Sutter <phil@nwl.cc>
> > Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> 
> This indeed "fixes" the problem. Or rather, exposes the actual problem
> in echo-related code, namely the tendency to exhaust socket buffers.
> 
> So the problem we're facing is that while user space still waits for
> sendmsg() to complete, receive buffer fills up. Is it possible to buffer
> the data in kernel somewhere else so user space has a chance to call
> recvmsg()?

There several possibilities, just a few that can be explored:

* I made a quick patch to batch several netlink messages coming as
  reply to the NLM_F_ECHO request into one single skbuff. If you look
  at the _notify() functions in the kernel, this is currently taking
  one single skbuff for one message which adds a bit of overhead (the
  skbuff truesize is used for the socket buffer accounting). I'm
  measuring here on x86_64 that each command takes 768 bytes. With a
  quick patch I'm batching several reply netlink messages into one
  single skbuff, now each commands takes ~120 bytes (well, size
  depends on how many expressions you use actually). This means this
  can handle ~3550 commands vs. the existing ~555 commands (assuming
  very small sk_rmem_alloc 426240 as the one you're reporting).

  Even if this does not fix your problem (you refer to 72k chars, not
  sure how many commands this is), this is probably good to have
  anyway, this decreasing memory consumption by 85%. This will also
  make event reporting (monitor mode) more reliable through netlink
  (it's the same codepath).

  Note that userspace requires no changes to support batching mode:
  libmnl's mnl_cb_run() keeps iterating over the buffer that was
  received until all netlink messages are consumed.

  The quick patch is incomplete, I just want to prove the saving in
  terms of memory. I'll give it another spin and submit this for
  review.

* Probably recover the cookie idea: firewalld specifies a cookie
  that identifies the rule from userspace, so there is no need for
  NLM_F_ECHO. Eric mentioned he would like to delete rules by cookie,
  this should be possible by looking up for the handle from the
  cookie to delete rules. The cookie is stored in NFTA_RULE_USERDATA
  so this is only meaningful to userspace. No kernel changes are
  required (this is supported for a bit of time already).

  Note: The cookie could be duplicated. Since the cookie is allocated
  by userspace, it's up to userspace to ensure uniqueness. In case
  cookies are duplicated, if you delete a rule by cookie then

  Rule deletion by cookie requires dumping the whole ruleset though
  (slow). However, I think firewalld keeps a rule cache, so it uses
  the rule handle to delete the rule instead (no need to dump the
  cache then). Therefore, the cookie is only used to fetch the rule
  handle.

  With the rule cookie in place, firewalld can send the batch, then
  make a NLM_F_DUMP request after sending the batch to retrieve the
  handle from the cookie instead of using NLM_F_ECHO. The batch send +
  NLM_F_DUMP barely consumes 16KBytes per recvmsg() call. The send +
  dump is not atomic though.

  Because NLM_F_ECHO is only used to get back the rule handle, right?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-26 15:32     ` Pablo Neira Ayuso
@ 2020-08-26 18:54       ` Eric Garver
  2020-08-27 14:23       ` Phil Sutter
  1 sibling, 0 replies; 11+ messages in thread
From: Eric Garver @ 2020-08-26 18:54 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Phil Sutter, Florian Westphal, netfilter-devel

On Wed, Aug 26, 2020 at 05:32:19PM +0200, Pablo Neira Ayuso wrote:
> Hi Phil,
> 
> On Mon, Aug 24, 2020 at 03:11:04PM +0200, Phil Sutter wrote:
> [...]
> > On Sun, Aug 23, 2020 at 02:04:34PM +0200, Pablo Neira Ayuso wrote:
> > > Hi Phil,
> > > 
> > > On Sat, Aug 22, 2020 at 01:06:15AM +0200, Phil Sutter wrote:
> > > > Hi,
> > > > 
> > > > Starting firewalld with two active zones in an lxc container provokes a
> > > > situation in which nfnetlink_rcv_msg() loops indefinitely, because
> > > > nc->call_rcu() (nf_tables_getgen() in this case) returns -EAGAIN every
> > > > time.
> > > > 
> > > > I identified netlink_attachskb() as the originator for the above error
> > > > code. The conditional leading to it looks like this:
> > > > 
> > > > | if ((atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf ||
> > > > |      test_bit(NETLINK_S_CONGESTED, &nlk->state))) {
> > > > |         [...]
> > > > |         if (!*timeo) {
> > > > 
> > > > *timeo is zero, so this seems to be a non-blocking socket. Both
> > > > NETLINK_S_CONGESTED bit is set and sk->sk_rmem_alloc exceeds
> > > > sk->sk_rcvbuf.
> > > > 
> > > > From user space side, firewalld seems to simply call sendto() and the
> > > > call never returns.
> > > > 
> > > > How to solve that? I tried to find other code which does the same, but I
> > > > haven't found one that does any looping. Should nfnetlink_rcv_msg()
> > > > maybe just return -EAGAIN to the caller if it comes from call_rcu
> > > > backend?
> > > 
> > > It's a bug in the netlink frontend, which erroneously reports -EAGAIN
> > > to the nfnetlink when the socket buffer is full, see:
> > > 
> > > https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@netfilter.org/
> > 
> > Obviously this avoids the lockup. As correctly assumed by Florian,
> > firewalld startup fails instead. (The daemon keeps running, but an error
> > message is printed indicating that initial ruleset setup failed.)
> 
> Thanks for confirming, I'll apply this patch to nf.git.
> 
> > [...]
> > > > The value of sk_rcvbuf is 425984, BTW. sk_rmem_alloc is 426240. In user
> > > > space, I see a call to setsockopt(SO_RCVBUF) with value 4194304. No idea
> > > > if this is related and how.
> > > 
> > > Next problem is to track why socket buffer is getting full with
> > > GET_GENID.
> > > 
> > > firewalld heavily uses NLM_F_ECHO, there I can see how it can easily
> > > reach the default socket buffer size, but with GET_GENID I'm not sure
> > > yet, probably the problem is elsewhere but it manifests in GET_GENID
> > > because it's the first thing that is done when sending a batch (maybe
> > > there are unread messages in the socket buffer, you might check
> > > /proc/net/netlink to see if the socket buffer keeps growing as
> > > firewalld moves on).
> > 
> > Yes, it happens only for echo mode. With your fix in place, I also see
> > what firewalld is trying to do: The JSON input leading to the error is
> > huge (~72k characters). I suspect that GET_GENID just happens to be the
> > last straw. Or my debugging was faulty somehow and netlink_attachskb()
> > really got called via a different code-path.
> > 
> > > Is this easy to reproduce? Or does this happens after some time of
> > > firewalld execution?
> > 
> > The necessary lxd setup aside, it's pretty trivial: launch an instance
> > of images:centos/8/amd64, install firewalld therein, add two zone files
> > and start firewalld. It happens immediately, so two active zones already
> > make firewalld generate enough rules to exceed the buffer space.
> > 
> > On Sun, Aug 23, 2020 at 01:55:36PM +0200, Pablo Neira Ayuso wrote:
> > > Frontend callback reports EAGAIN to nfnetlink to retry a command, this
> > > is used to signal that module autoloading is required. Unfortunately,
> > > nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
> > > full, so it enters a busy-loop.
> > > 
> > > This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
> > > to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
> > > since this is always MSG_DONTWAIT in the existing code which is exactly
> > > what nlmsg_unicast() passes to netlink_unicast() as parameter.
> > > 
> > > Fixes: 96518518cc41 ("netfilter: add nftables")
> > > Reported-by: Phil Sutter <phil@nwl.cc>
> > > Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
> > 
> > This indeed "fixes" the problem. Or rather, exposes the actual problem
> > in echo-related code, namely the tendency to exhaust socket buffers.
> > 
> > So the problem we're facing is that while user space still waits for
> > sendmsg() to complete, receive buffer fills up. Is it possible to buffer
> > the data in kernel somewhere else so user space has a chance to call
> > recvmsg()?
> 
> There several possibilities, just a few that can be explored:
> 
> * I made a quick patch to batch several netlink messages coming as
>   reply to the NLM_F_ECHO request into one single skbuff. If you look
>   at the _notify() functions in the kernel, this is currently taking
>   one single skbuff for one message which adds a bit of overhead (the
>   skbuff truesize is used for the socket buffer accounting). I'm
>   measuring here on x86_64 that each command takes 768 bytes. With a
>   quick patch I'm batching several reply netlink messages into one
>   single skbuff, now each commands takes ~120 bytes (well, size
>   depends on how many expressions you use actually). This means this
>   can handle ~3550 commands vs. the existing ~555 commands (assuming
>   very small sk_rmem_alloc 426240 as the one you're reporting).
> 
>   Even if this does not fix your problem (you refer to 72k chars, not
>   sure how many commands this is), this is probably good to have
>   anyway, this decreasing memory consumption by 85%. This will also
>   make event reporting (monitor mode) more reliable through netlink
>   (it's the same codepath).
> 
>   Note that userspace requires no changes to support batching mode:
>   libmnl's mnl_cb_run() keeps iterating over the buffer that was
>   received until all netlink messages are consumed.
> 
>   The quick patch is incomplete, I just want to prove the saving in
>   terms of memory. I'll give it another spin and submit this for
>   review.
> 
> * Probably recover the cookie idea: firewalld specifies a cookie
>   that identifies the rule from userspace, so there is no need for
>   NLM_F_ECHO. Eric mentioned he would like to delete rules by cookie,
>   this should be possible by looking up for the handle from the
>   cookie to delete rules. The cookie is stored in NFTA_RULE_USERDATA
>   so this is only meaningful to userspace. No kernel changes are
>   required (this is supported for a bit of time already).

Good that it doesn't require a new kernel.

>   Note: The cookie could be duplicated. Since the cookie is allocated
>   by userspace, it's up to userspace to ensure uniqueness. In case
>   cookies are duplicated, if you delete a rule by cookie then

I think this is advantageous. It means a user/program can group rules
with a given cookie value.

>   Rule deletion by cookie requires dumping the whole ruleset though
>   (slow). However, I think firewalld keeps a rule cache, so it uses
>   the rule handle to delete the rule instead (no need to dump the
>   cache then). Therefore, the cookie is only used to fetch the rule
>   handle.
> 
>   With the rule cookie in place, firewalld can send the batch, then
>   make a NLM_F_DUMP request after sending the batch to retrieve the
>   handle from the cookie instead of using NLM_F_ECHO. The batch send +
>   NLM_F_DUMP barely consumes 16KBytes per recvmsg() call. The send +
>   dump is not atomic though.

NLM_F_DUMP has the downside that a single rule update means fetching the
whole rule set.

firewalld has a large batch when it starts up. After the initial startup
rule updates are generally small. I don't think NLM_F_DUMP is good for
firewalld's use case. The simpler solution would be for firewalld to
step the initial large batch, i.e. break it up into smaller batches.

>   Because NLM_F_ECHO is only used to get back the rule handle, right?

Right.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-26 15:32     ` Pablo Neira Ayuso
  2020-08-26 18:54       ` Eric Garver
@ 2020-08-27 14:23       ` Phil Sutter
  2020-08-27 17:50         ` Pablo Neira Ayuso
  1 sibling, 1 reply; 11+ messages in thread
From: Phil Sutter @ 2020-08-27 14:23 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel

Hi Pablo,

On Wed, Aug 26, 2020 at 05:32:19PM +0200, Pablo Neira Ayuso wrote:
[...]
> > > https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200823115536.16631-1-pablo@netfilter.org/
> > 
> > Obviously this avoids the lockup. As correctly assumed by Florian,
> > firewalld startup fails instead. (The daemon keeps running, but an error
> > message is printed indicating that initial ruleset setup failed.)
> 
> Thanks for confirming, I'll apply this patch to nf.git.

Thanks!

[...]
> There several possibilities, just a few that can be explored:
> 
> * I made a quick patch to batch several netlink messages coming as
>   reply to the NLM_F_ECHO request into one single skbuff. If you look
>   at the _notify() functions in the kernel, this is currently taking
>   one single skbuff for one message which adds a bit of overhead (the
>   skbuff truesize is used for the socket buffer accounting). I'm
>   measuring here on x86_64 that each command takes 768 bytes. With a
>   quick patch I'm batching several reply netlink messages into one
>   single skbuff, now each commands takes ~120 bytes (well, size
>   depends on how many expressions you use actually). This means this
>   can handle ~3550 commands vs. the existing ~555 commands (assuming
>   very small sk_rmem_alloc 426240 as the one you're reporting).
> 
>   Even if this does not fix your problem (you refer to 72k chars, not
>   sure how many commands this is), this is probably good to have
>   anyway, this decreasing memory consumption by 85%. This will also
>   make event reporting (monitor mode) more reliable through netlink
>   (it's the same codepath).

I didn't count the number of commands, but highly doubt it was even
close to 3000.

>   Note that userspace requires no changes to support batching mode:
>   libmnl's mnl_cb_run() keeps iterating over the buffer that was
>   received until all netlink messages are consumed.
> 
>   The quick patch is incomplete, I just want to prove the saving in
>   terms of memory. I'll give it another spin and submit this for
>   review.

Highly appreciated. Put me in Cc and I'll stress-test it a bit.

> * Probably recover the cookie idea: firewalld specifies a cookie
>   that identifies the rule from userspace, so there is no need for
>   NLM_F_ECHO. Eric mentioned he would like to delete rules by cookie,
>   this should be possible by looking up for the handle from the
>   cookie to delete rules. The cookie is stored in NFTA_RULE_USERDATA
>   so this is only meaningful to userspace. No kernel changes are
>   required (this is supported for a bit of time already).
> 
>   Note: The cookie could be duplicated. Since the cookie is allocated
>   by userspace, it's up to userspace to ensure uniqueness. In case
>   cookies are duplicated, if you delete a rule by cookie then
> 
>   Rule deletion by cookie requires dumping the whole ruleset though
>   (slow). However, I think firewalld keeps a rule cache, so it uses
>   the rule handle to delete the rule instead (no need to dump the
>   cache then). Therefore, the cookie is only used to fetch the rule
>   handle.
> 
>   With the rule cookie in place, firewalld can send the batch, then
>   make a NLM_F_DUMP request after sending the batch to retrieve the
>   handle from the cookie instead of using NLM_F_ECHO. The batch send +
>   NLM_F_DUMP barely consumes 16KBytes per recvmsg() call. The send +
>   dump is not atomic though.
> 
>   Because NLM_F_ECHO is only used to get back the rule handle, right?

The discussion that led to NLM_F_ECHO was to support atomic rule handle
retrieval. A user-defined "pseudo-handle" obviously can't uphold that
promise and therefore won't be a full replacement.

Apart from that, JSON echo output in it's current form is useful for
scripts to retrieve the handle. We've had that discussion already, I
pointed out they could just do 'input = output' and know
'input["nftables"][5]["add"]["rule"]["expr"][0]["match"]["right"]' is
still present and has the expected value.

I recently found out that firewalld (e.g.) doesn't even do that, but
instead manually iterates over the list of commands it got back and
extracts handle values. Doing that without NLM_F_ECHO but with cookies
instead means to add some rules, then list ruleset and search for one's
cookies.

That aside, if nftables doesn't support cookies beyond keeping them in
place, why not just use a custom comment instead? That's even
backwards-compatible.

Cheers, Phil

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: nfnetlink: Busy-loop in nfnetlink_rcv_msg()
  2020-08-27 14:23       ` Phil Sutter
@ 2020-08-27 17:50         ` Pablo Neira Ayuso
  0 siblings, 0 replies; 11+ messages in thread
From: Pablo Neira Ayuso @ 2020-08-27 17:50 UTC (permalink / raw)
  To: Phil Sutter, Florian Westphal, netfilter-devel

Hi Phil,

On Thu, Aug 27, 2020 at 04:23:19PM +0200, Phil Sutter wrote:
> Hi Pablo,
> 
> On Wed, Aug 26, 2020 at 05:32:19PM +0200, Pablo Neira Ayuso wrote:
> [...]
> > There several possibilities, just a few that can be explored:
[...]
> >   Note that userspace requires no changes to support batching mode:
> >   libmnl's mnl_cb_run() keeps iterating over the buffer that was
> >   received until all netlink messages are consumed.
> > 
> >   The quick patch is incomplete, I just want to prove the saving in
> >   terms of memory. I'll give it another spin and submit this for
> >   review.
> 
> Highly appreciated. Put me in Cc and I'll stress-test it a bit.

Let's give a try to this patch:

https://patchwork.ozlabs.org/project/netfilter-devel/patch/20200827172842.24478-1-pablo@netfilter.org/

> > * Probably recover the cookie idea: firewalld specifies a cookie
> >   that identifies the rule from userspace, so there is no need for
> >   NLM_F_ECHO. Eric mentioned he would like to delete rules by cookie,
> >   this should be possible by looking up for the handle from the
> >   cookie to delete rules. The cookie is stored in NFTA_RULE_USERDATA
> >   so this is only meaningful to userspace. No kernel changes are
> >   required (this is supported for a bit of time already).
> > 
> >   Note: The cookie could be duplicated. Since the cookie is allocated
> >   by userspace, it's up to userspace to ensure uniqueness. In case
> >   cookies are duplicated, if you delete a rule by cookie then
> > 
> >   Rule deletion by cookie requires dumping the whole ruleset though
> >   (slow). However, I think firewalld keeps a rule cache, so it uses
> >   the rule handle to delete the rule instead (no need to dump the
> >   cache then). Therefore, the cookie is only used to fetch the rule
> >   handle.
> > 
> >   With the rule cookie in place, firewalld can send the batch, then
> >   make a NLM_F_DUMP request after sending the batch to retrieve the
> >   handle from the cookie instead of using NLM_F_ECHO. The batch send +
> >   NLM_F_DUMP barely consumes 16KBytes per recvmsg() call. The send +
> >   dump is not atomic though.
> > 
> >   Because NLM_F_ECHO is only used to get back the rule handle, right?
> 
> The discussion that led to NLM_F_ECHO was to support atomic rule handle
> retrieval. A user-defined "pseudo-handle" obviously can't uphold that
> promise and therefore won't be a full replacement.
>
> Apart from that, JSON echo output in it's current form is useful for
> scripts to retrieve the handle. We've had that discussion already, I
> pointed out they could just do 'input = output' and know
> 'input["nftables"][5]["add"]["rule"]["expr"][0]["match"]["right"]' is
> still present and has the expected value.
>
> I recently found out that firewalld (e.g.) doesn't even do that, but
> instead manually iterates over the list of commands it got back and
> extracts handle values.
>
> Doing that without NLM_F_ECHO but with cookies instead means to add
> some rules, then list ruleset and search for one's cookies.
> 
> That aside, if nftables doesn't support cookies beyond keeping them in
> place, why not just use a custom comment instead? That's even
> backwards-compatible.

Something else to improve netlink socket receiver usage: Userspace
might also set on NLM_F_ECHO for rules only, so the kernel only
consumes the netlink receive socket buffer for these reports. Although
not sure how to expose this yet through the library in a nice way...

Probably it should be possible to extend netlink to filter out
attributes that do not need to be reported back to userspace via
NLM_F_ECHO...

The main gain is to amortize skbuff cost, ie. increasing the batching
factor, but going over NLMSG_GOODSIZE is tricky. We have to update
userspace (provide larger buffer) and revisit if this is possible
without breaking backward compatibility.

BTW, if NLM_F_ECHO results in ENOSPC, the ruleset has been applied,
it's just that the socket buffer got full, a fallback to NLM_F_DUMP is
probably an option in that case? But I understand the goal is to not
hit ENOSPC.

Let me know if my patch helps mitigate the problem, thanks.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-08-27 17:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-21 23:06 nfnetlink: Busy-loop in nfnetlink_rcv_msg() Phil Sutter
2020-08-22 18:46 ` Florian Westphal
2020-08-24 10:47   ` Pablo Neira Ayuso
2020-08-24 12:39     ` Florian Westphal
2020-08-24 13:11   ` Phil Sutter
2020-08-26 15:32     ` Pablo Neira Ayuso
2020-08-26 18:54       ` Eric Garver
2020-08-27 14:23       ` Phil Sutter
2020-08-27 17:50         ` Pablo Neira Ayuso
2020-08-23 12:04 ` Pablo Neira Ayuso
2020-08-23 11:55 [PATCH nf] netfilter: nfnetlink: nfnetlink_unicast() reports EAGAIN instead of ENOBUFS Pablo Neira Ayuso

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.