netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH nf] netfilter: arptables: use percpu jumpstack
@ 2015-06-30 20:21 Florian Westphal
  2015-07-02 11:30 ` Pablo Neira Ayuso
  0 siblings, 1 reply; 7+ messages in thread
From: Florian Westphal @ 2015-06-30 20:21 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

commit 482cfc318559 ("netfilter: xtables: avoid percpu ruleset duplication")

Unlike ip and ip6tables, arp tables were never converted to use the percpu
jump stack.

It still uses the rule blob to store return address, which isn't safe
anymore since we now share this blob among all processors.

Because there is no TEE support for arptables, we don't need to cope
with reentrancy, so we can use loocal variable to hold stack offset.

Fixes: 482cfc318559 ("netfilter: xtables: avoid percpu ruleset duplication")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/netfilter/arp_tables.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 95c9b6e..0fbe1a6 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -254,9 +254,10 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
 	unsigned int verdict = NF_DROP;
 	const struct arphdr *arp;
-	struct arpt_entry *e, *back;
+	struct arpt_entry *e, **jumpstack;
 	const char *indev, *outdev;
 	const void *table_base;
+	unsigned int cpu, stackidx = 0;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
 	unsigned int addend;
@@ -270,15 +271,16 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 	local_bh_disable();
 	addend = xt_write_recseq_begin();
 	private = table->private;
+	cpu     = smp_processor_id();
 	/*
 	 * Ensure we load private-> members after we've fetched the base
 	 * pointer.
 	 */
 	smp_read_barrier_depends();
 	table_base = private->entries;
+	jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
 
 	e = get_entry(table_base, private->hook_entry[hook]);
-	back = get_entry(table_base, private->underflow[hook]);
 
 	acpar.in      = state->in;
 	acpar.out     = state->out;
@@ -312,18 +314,23 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 					verdict = (unsigned int)(-v) - 1;
 					break;
 				}
-				e = back;
-				back = get_entry(table_base, back->comefrom);
+				if (stackidx == 0) {
+					e = get_entry(table_base,
+						      private->underflow[hook]);
+				} else {
+					e = jumpstack[--stackidx];
+					e = arpt_next_entry(e);
+				}
 				continue;
 			}
 			if (table_base + v
 			    != arpt_next_entry(e)) {
-				/* Save old back ptr in next entry */
-				struct arpt_entry *next = arpt_next_entry(e);
-				next->comefrom = (void *)back - table_base;
 
-				/* set back pointer to next entry */
-				back = next;
+				if (WARN_ON_ONCE(stackidx >= private->stacksize)) {
+					verdict = NF_DROP;
+					break;
+				}
+				jumpstack[stackidx++] = e;
 			}
 
 			e = get_entry(table_base, v);
-- 
2.0.5


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH nf] netfilter: arptables: use percpu jumpstack
  2015-06-30 20:21 [PATCH nf] netfilter: arptables: use percpu jumpstack Florian Westphal
@ 2015-07-02 11:30 ` Pablo Neira Ayuso
  2015-07-02 11:48   ` Jan Engelhardt
  2015-07-02 11:48   ` Florian Westphal
  0 siblings, 2 replies; 7+ messages in thread
From: Pablo Neira Ayuso @ 2015-07-02 11:30 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

On Tue, Jun 30, 2015 at 10:21:00PM +0200, Florian Westphal wrote:
> commit 482cfc318559 ("netfilter: xtables: avoid percpu ruleset duplication")
> 
> Unlike ip and ip6tables, arp tables were never converted to use the percpu
> jump stack.
> 
> It still uses the rule blob to store return address, which isn't safe
> anymore since we now share this blob among all processors.
> 
> Because there is no TEE support for arptables, we don't need to cope
> with reentrancy, so we can use loocal variable to hold stack offset.
> 
> Fixes: 482cfc318559 ("netfilter: xtables: avoid percpu ruleset duplication")
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  net/ipv4/netfilter/arp_tables.c | 25 ++++++++++++++++---------
>  1 file changed, 16 insertions(+), 9 deletions(-)
> 
> diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
> index 95c9b6e..0fbe1a6 100644
> --- a/net/ipv4/netfilter/arp_tables.c
> +++ b/net/ipv4/netfilter/arp_tables.c
> @@ -254,9 +254,10 @@ unsigned int arpt_do_table(struct sk_buff *skb,
>  	static const char nulldevname[IFNAMSIZ] __attribute__((aligned(sizeof(long))));
>  	unsigned int verdict = NF_DROP;
>  	const struct arphdr *arp;
> -	struct arpt_entry *e, *back;
> +	struct arpt_entry *e, **jumpstack;
>  	const char *indev, *outdev;
>  	const void *table_base;
> +	unsigned int cpu, stackidx = 0;
>  	const struct xt_table_info *private;
>  	struct xt_action_param acpar;
>  	unsigned int addend;
> @@ -270,15 +271,16 @@ unsigned int arpt_do_table(struct sk_buff *skb,
>  	local_bh_disable();
>  	addend = xt_write_recseq_begin();
>  	private = table->private;
> +	cpu     = smp_processor_id();
>  	/*
>  	 * Ensure we load private-> members after we've fetched the base
>  	 * pointer.
>  	 */
>  	smp_read_barrier_depends();
>  	table_base = private->entries;
> +	jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
>  
>  	e = get_entry(table_base, private->hook_entry[hook]);
> -	back = get_entry(table_base, private->underflow[hook]);
>  
>  	acpar.in      = state->in;
>  	acpar.out     = state->out;
> @@ -312,18 +314,23 @@ unsigned int arpt_do_table(struct sk_buff *skb,
>  					verdict = (unsigned int)(-v) - 1;
>  					break;
>  				}
> -				e = back;
> -				back = get_entry(table_base, back->comefrom);
> +				if (stackidx == 0) {
> +					e = get_entry(table_base,
> +						      private->underflow[hook]);
> +				} else {
> +					e = jumpstack[--stackidx];
> +					e = arpt_next_entry(e);
> +				}
>  				continue;
>  			}
>  			if (table_base + v
>  			    != arpt_next_entry(e)) {
> -				/* Save old back ptr in next entry */
> -				struct arpt_entry *next = arpt_next_entry(e);
> -				next->comefrom = (void *)back - table_base;
>  
> -				/* set back pointer to next entry */
> -				back = next;
> +				if (WARN_ON_ONCE(stackidx >= private->stacksize)) {
> +					verdict = NF_DROP;
> +					break;
> +				}

I can see you're getting this in sync with iptables, but I wonder
about this defensive check to make sure we don't go over the allocated
jumpstack area. This was added in f3c5c1bfd43.

If we remove it and things are broken, then this will crash with a
general protection fault when accessing memory out of the jumpstack
boundary. On the other hand, if we keep it, packets will be dropped
and it will keep going until someone checks logs and reports this. If
we hit this then things are really broken so probably being a
agressive in this case makes sense.

Moreover, this is adds another branch in the packet path (not critical
in arptables, but we have in iptables too).

What do you think?

BTW, not related to this patch, Eric Dumazet indicated during the NFWS
that it would be a good idea to make this jumpstack fixed length as in
nftables, so we can place it in the stack and get rid of this percpu
jumpstack that was introduced to cope with reentrancy (only TEE needs
this). I've been checking this but we have no limits at this moment,
so the concerns go in the direction that if we limit this, we may
break some crazy setup with lots of jump to chain outthere. So I
suspect we cannot get rid of this easily :-(.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH nf] netfilter: arptables: use percpu jumpstack
  2015-07-02 11:30 ` Pablo Neira Ayuso
@ 2015-07-02 11:48   ` Jan Engelhardt
  2015-07-02 11:48   ` Florian Westphal
  1 sibling, 0 replies; 7+ messages in thread
From: Jan Engelhardt @ 2015-07-02 11:48 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel


On Thursday 2015-07-02 13:30, Pablo Neira Ayuso wrote:
>> 
>> Unlike ip and ip6tables, arp tables were never converted to use the percpu
>> jump stack.
>> 
>>  net/ipv4/netfilter/arp_tables.c | 25 ++++++++++++++++---------
>> -				/* Save old back ptr in next entry */
>> -				struct arpt_entry *next = arpt_next_entry(e);
>> -				next->comefrom = (void *)back - table_base;
>>  
>> -				/* set back pointer to next entry */
>> -				back = next;
>> +				if (WARN_ON_ONCE(stackidx >= private->stacksize)) {
>> +					verdict = NF_DROP;
>> +					break;
>> +				}
>
>I can see you're getting this in sync with iptables, but I wonder
>about this defensive check to make sure we don't go over the allocated
>jumpstack area. This was added in f3c5c1bfd43.
>
>If we remove it and things are broken, then this will crash with a
>general protection fault when accessing memory out of the jumpstack
>boundary.

To the best of my knowledge, arp_tables does not offer any target
that can re-enter (such as REJECT, TEE, SYNPROXY do in IPv6/IPv4), so
we would never have a situation with stackidx >= stacksize anyway.

>Eric Dumazet indicated during the NFWS that it would be a good idea
>to make this jumpstack fixed length as in nftables, so we can place
>it in the stack and get rid of this percpu jumpstack that was
>introduced to cope with reentrancy (only TEE needs this).

fixed *and reduced* (you don't want to place a 64-chain list
onto the stack), and the reduction is what will be in conflict
with rulesets.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH nf] netfilter: arptables: use percpu jumpstack
  2015-07-02 11:30 ` Pablo Neira Ayuso
  2015-07-02 11:48   ` Jan Engelhardt
@ 2015-07-02 11:48   ` Florian Westphal
  2015-07-02 15:38     ` Eric Dumazet
  2015-07-02 15:43     ` Pablo Neira Ayuso
  1 sibling, 2 replies; 7+ messages in thread
From: Florian Westphal @ 2015-07-02 11:48 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: Florian Westphal, netfilter-devel, eric.dumazet

Pablo Neira Ayuso <pablo@netfilter.org> wrote:

[ CC Eric ]

> > +				if (WARN_ON_ONCE(stackidx >= private->stacksize)) {
> > +					verdict = NF_DROP;
> > +					break;
> > +				}
> 
> I can see you're getting this in sync with iptables, but I wonder
> about this defensive check to make sure we don't go over the allocated
> jumpstack area. This was added in f3c5c1bfd43.

Yes, I added it since iptables does this and because this is nf patch,
not nf-next.

> If we remove it and things are broken, then this will crash with a
> general protection fault when accessing memory out of the jumpstack
> boundary. On the other hand, if we keep it, packets will be dropped
> and it will keep going until someone checks logs and reports this. If
> we hit this then things are really broken so probably being a
> agressive in this case makes sense.
> Moreover, this is adds another branch in the packet path (not critical
> in arptables, but we have in iptables too).
> 
> What do you think?

I will remove it in the nf-next patch series i am currently working on.

When this happens something is *seriously* wrong with the ruleset
validation checks that we have in place.

> BTW, not related to this patch, Eric Dumazet indicated during the NFWS
> that it would be a good idea to make this jumpstack fixed length as in
> nftables, so we can place it in the stack and get rid of this percpu
> jumpstack that was introduced to cope with reentrancy (only TEE needs
> this). I've been checking this but we have no limits at this moment,
> so the concerns go in the direction that if we limit this, we may
> break some crazy setup with lots of jump to chain outthere. So I
> suspect we cannot get rid of this easily :-(.

Seems Eric lobbied this to several people ;)

I'm working on it.

I agree that using kernel stack with auto-sized variable makes most
sense BUT since this could theoretically cause userspace breakage I
decided against it.

My plan:

- move tee_active percpu varible to xtables core (suggested by Eric)
- in do_table, check if we're TEE'd or not

1. if no, then just use the jumpstack from offset 0 onwards.
2. If yes, then fetch jumpstack, and use the upper half:

if (__this_cpu_read(xt_tee_active))
 	jumpstack += private->stacksize;

(jumpstack is twice the size of 'stacksize' to accompondate this).

This means that the relative stack offset during table traversal
always starts at 0 and we do not have to store the old stack location
when we leave the do_table function anymore.

The stackptr percpu variable is now unused; i'll unify
it with the jumpstack so that stack is fetched via

jumpstack  = (struct ipt_entry **) this_cpu_ptr(private->stackptr);

I was also planning to compute real needed stack size
(i.e., track largest callchain seen, should be simple by adding this
 to 'mark_source_chains' function) rather than just using the number
 of chains.

In most rulesets the call chain will not be deep, even when there
are 100 or so user-defined rules.

I'd guess a percpu jumpstack of < 128 bytes is quite realistic for most
rulesets.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH nf] netfilter: arptables: use percpu jumpstack
  2015-07-02 11:48   ` Florian Westphal
@ 2015-07-02 15:38     ` Eric Dumazet
  2015-07-02 20:00       ` Florian Westphal
  2015-07-02 15:43     ` Pablo Neira Ayuso
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2015-07-02 15:38 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Pablo Neira Ayuso, netfilter-devel

On Thu, 2015-07-02 at 13:48 +0200, Florian Westphal wrote:

> My plan:
> 
> - move tee_active percpu varible to xtables core (suggested by Eric)
> - in do_table, check if we're TEE'd or not
> 
> 1. if no, then just use the jumpstack from offset 0 onwards.
> 2. If yes, then fetch jumpstack, and use the upper half:
> 
> if (__this_cpu_read(xt_tee_active))
>  	jumpstack += private->stacksize;

Or maybe not using a conditional

jumpstack += private->stacksize * __this_cpu_read(xt_tee_active);


BTW, I do not remember why I used a conditional in
xt_write_recseq_begin(). This also adds extra setup cost, as @addend has
to be preserved in the stack.

Hmm... What about something like :

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 286098a5667f..bee1484c18d2 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -287,44 +287,28 @@ DECLARE_PER_CPU(seqcount_t, xt_recseq);
  * Begin packet processing : all readers must wait the end
  * 1) Must be called with preemption disabled
  * 2) softirqs must be disabled too (or we should use this_cpu_add())
- * Returns :
- *  1 if no recursion on this cpu
- *  0 if recursion detected
  */
-static inline unsigned int xt_write_recseq_begin(void)
+static inline void xt_write_recseq_begin(void)
 {
-	unsigned int addend;
-
-	/*
-	 * Low order bit of sequence is set if we already
-	 * called xt_write_recseq_begin().
-	 */
-	addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
-
-	/*
-	 * This is kind of a write_seqcount_begin(), but addend is 0 or 1
-	 * We dont check addend value to avoid a test and conditional jump,
-	 * since addend is most likely 1
+	/* This is kind of a write_seqcount_begin(),
+	 * but can be nested (no lockdep support)
 	 */
-	__this_cpu_add(xt_recseq.sequence, addend);
+	__this_cpu_inc(xt_recseq.sequence);
 	smp_wmb();
-
-	return addend;
 }
 
 /**
  * xt_write_recseq_end - end of a write section
- * @addend: return value from previous xt_write_recseq_begin()
  *
  * End packet processing : all readers can proceed
  * 1) Must be called with preemption disabled
  * 2) softirqs must be disabled too (or we should use this_cpu_add())
  */
-static inline void xt_write_recseq_end(unsigned int addend)
+static inline void xt_write_recseq_end(void)
 {
-	/* this is kind of a write_seqcount_end(), but addend is 0 or 1 */
+	/* this is kind of a write_seqcount_end() */
 	smp_wmb();
-	__this_cpu_add(xt_recseq.sequence, addend);
+	__this_cpu_inc(xt_recseq.sequence);
 }
 
 /*
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 95c9b6eece25..b34ac984bd7f 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -259,7 +259,6 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 	const void *table_base;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
-	unsigned int addend;
 
 	if (!pskb_may_pull(skb, arp_hdr_len(skb->dev)))
 		return NF_DROP;
@@ -268,7 +267,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 	outdev = state->out ? state->out->name : nulldevname;
 
 	local_bh_disable();
-	addend = xt_write_recseq_begin();
+	xt_write_recseq_begin();
 	private = table->private;
 	/*
 	 * Ensure we load private-> members after we've fetched the base
@@ -346,7 +345,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 			/* Verdict */
 			break;
 	} while (!acpar.hotdrop);
-	xt_write_recseq_end(addend);
+	xt_write_recseq_end();
 	local_bh_enable();
 
 	if (acpar.hotdrop)
@@ -1130,7 +1129,6 @@ static int do_add_counters(struct net *net, const void __user *user,
 	const struct xt_table_info *private;
 	int ret = 0;
 	struct arpt_entry *iter;
-	unsigned int addend;
 #ifdef CONFIG_COMPAT
 	struct compat_xt_counters_info compat_tmp;
 
@@ -1185,7 +1183,7 @@ static int do_add_counters(struct net *net, const void __user *user,
 
 	i = 0;
 
-	addend = xt_write_recseq_begin();
+	xt_write_recseq_begin();
 	xt_entry_foreach(iter,  private->entries, private->size) {
 		struct xt_counters *tmp;
 
@@ -1193,7 +1191,7 @@ static int do_add_counters(struct net *net, const void __user *user,
 		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
-	xt_write_recseq_end(addend);
+	xt_write_recseq_end();
  unlock_up_free:
 	local_bh_enable();
 	xt_table_unlock(t);
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 6c72fbb7b49e..4c575fb623aa 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -299,7 +299,6 @@ ipt_do_table(struct sk_buff *skb,
 	unsigned int *stackptr, origptr, cpu;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
-	unsigned int addend;
 
 	/* Initialization */
 	ip = ip_hdr(skb);
@@ -321,7 +320,7 @@ ipt_do_table(struct sk_buff *skb,
 
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
 	local_bh_disable();
-	addend = xt_write_recseq_begin();
+	xt_write_recseq_begin();
 	private = table->private;
 	cpu        = smp_processor_id();
 	/*
@@ -426,7 +425,7 @@ ipt_do_table(struct sk_buff *skb,
 	pr_debug("Exiting %s; resetting sp from %u to %u\n",
 		 __func__, *stackptr, origptr);
 	*stackptr = origptr;
- 	xt_write_recseq_end(addend);
+ 	xt_write_recseq_end();
  	local_bh_enable();
 
 #ifdef DEBUG_ALLOW_ALL
@@ -1312,7 +1311,6 @@ do_add_counters(struct net *net, const void __user *user,
 	const struct xt_table_info *private;
 	int ret = 0;
 	struct ipt_entry *iter;
-	unsigned int addend;
 #ifdef CONFIG_COMPAT
 	struct compat_xt_counters_info compat_tmp;
 
@@ -1366,7 +1364,7 @@ do_add_counters(struct net *net, const void __user *user,
 	}
 
 	i = 0;
-	addend = xt_write_recseq_begin();
+	xt_write_recseq_begin();
 	xt_entry_foreach(iter, private->entries, private->size) {
 		struct xt_counters *tmp;
 
@@ -1374,7 +1372,7 @@ do_add_counters(struct net *net, const void __user *user,
 		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
-	xt_write_recseq_end(addend);
+	xt_write_recseq_end();
  unlock_up_free:
 	local_bh_enable();
 	xt_table_unlock(t);
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 3c35ced39b42..d0f66a11ed90 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -327,7 +327,6 @@ ip6t_do_table(struct sk_buff *skb,
 	unsigned int *stackptr, origptr, cpu;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
-	unsigned int addend;
 
 	/* Initialization */
 	indev = state->in ? state->in->name : nulldevname;
@@ -347,7 +346,7 @@ ip6t_do_table(struct sk_buff *skb,
 	IP_NF_ASSERT(table->valid_hooks & (1 << hook));
 
 	local_bh_disable();
-	addend = xt_write_recseq_begin();
+	xt_write_recseq_begin();
 	private = table->private;
 	/*
 	 * Ensure we load private-> members after we've fetched the base
@@ -439,7 +438,7 @@ ip6t_do_table(struct sk_buff *skb,
 
 	*stackptr = origptr;
 
- 	xt_write_recseq_end(addend);
+ 	xt_write_recseq_end();
  	local_bh_enable();
 
 #ifdef DEBUG_ALLOW_ALL
@@ -1325,7 +1324,7 @@ do_add_counters(struct net *net, const void __user *user, unsigned int len,
 	const struct xt_table_info *private;
 	int ret = 0;
 	struct ip6t_entry *iter;
-	unsigned int addend;
+
 #ifdef CONFIG_COMPAT
 	struct compat_xt_counters_info compat_tmp;
 
@@ -1379,7 +1378,7 @@ do_add_counters(struct net *net, const void __user *user, unsigned int len,
 	}
 
 	i = 0;
-	addend = xt_write_recseq_begin();
+	xt_write_recseq_begin();
 	xt_entry_foreach(iter, private->entries, private->size) {
 		struct xt_counters *tmp;
 
@@ -1387,7 +1386,7 @@ do_add_counters(struct net *net, const void __user *user, unsigned int len,
 		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
-	xt_write_recseq_end(addend);
+	xt_write_recseq_end();
  unlock_up_free:
 	local_bh_enable();
 	xt_table_unlock(t);






^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH nf] netfilter: arptables: use percpu jumpstack
  2015-07-02 11:48   ` Florian Westphal
  2015-07-02 15:38     ` Eric Dumazet
@ 2015-07-02 15:43     ` Pablo Neira Ayuso
  1 sibling, 0 replies; 7+ messages in thread
From: Pablo Neira Ayuso @ 2015-07-02 15:43 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel, eric.dumazet

On Thu, Jul 02, 2015 at 01:48:37PM +0200, Florian Westphal wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> wrote:
[...]
> > If we remove it and things are broken, then this will crash with a
> > general protection fault when accessing memory out of the jumpstack
> > boundary. On the other hand, if we keep it, packets will be dropped
> > and it will keep going until someone checks logs and reports this. If
> > we hit this then things are really broken so probably being a
> > agressive in this case makes sense.
> > Moreover, this is adds another branch in the packet path (not critical
> > in arptables, but we have in iptables too).
> > 
> > What do you think?
> 
> I will remove it in the nf-next patch series i am currently working on.
> 
> When this happens something is *seriously* wrong with the ruleset
> validation checks that we have in place.

OK. I'm going to apply this, but will remove the WARN_ON_ONCE to get
this in sync with iptables and given that this will be gone soon.

> > BTW, not related to this patch, Eric Dumazet indicated during the NFWS
> > that it would be a good idea to make this jumpstack fixed length as in
> > nftables, so we can place it in the stack and get rid of this percpu
> > jumpstack that was introduced to cope with reentrancy (only TEE needs
> > this). I've been checking this but we have no limits at this moment,
> > so the concerns go in the direction that if we limit this, we may
> > break some crazy setup with lots of jump to chain outthere. So I
> > suspect we cannot get rid of this easily :-(.
> 
> Seems Eric lobbied this to several people ;)
> 
> I'm working on it.
> 
> I agree that using kernel stack with auto-sized variable makes most
> sense BUT since this could theoretically cause userspace breakage I
> decided against it.

Agreed.

> My plan:
> 
> - move tee_active percpu varible to xtables core (suggested by Eric)

I'll need to move this to the netfilter core to support tee both from
iptables and nftables in my next round of patches. Just to avoid
strange problem in case problem mix iptables and nftables. But I think
this patch should come in first place, so I'll rebase, no problem.

> - in do_table, check if we're TEE'd or not
> 
> 1. if no, then just use the jumpstack from offset 0 onwards.
> 2. If yes, then fetch jumpstack, and use the upper half:
> 
> if (__this_cpu_read(xt_tee_active))
>  	jumpstack += private->stacksize;
> 
> (jumpstack is twice the size of 'stacksize' to accompondate this).
> 
> This means that the relative stack offset during table traversal
> always starts at 0 and we do not have to store the old stack location
> when we leave the do_table function anymore.
> 
> The stackptr percpu variable is now unused; i'll unify
> it with the jumpstack so that stack is fetched via
> 
> jumpstack  = (struct ipt_entry **) this_cpu_ptr(private->stackptr);
> 
> I was also planning to compute real needed stack size
> (i.e., track largest callchain seen, should be simple by adding this
>  to 'mark_source_chains' function) rather than just using the number
>  of chains.
> 
> In most rulesets the call chain will not be deep, even when there
> are 100 or so user-defined rules.
>
> I'd guess a percpu jumpstack of < 128 bytes is quite realistic for most
> rulesets.

Sound good. Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH nf] netfilter: arptables: use percpu jumpstack
  2015-07-02 15:38     ` Eric Dumazet
@ 2015-07-02 20:00       ` Florian Westphal
  0 siblings, 0 replies; 7+ messages in thread
From: Florian Westphal @ 2015-07-02 20:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Florian Westphal, Pablo Neira Ayuso, netfilter-devel

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-07-02 at 13:48 +0200, Florian Westphal wrote:
> 
> > My plan:
> > 
> > - move tee_active percpu varible to xtables core (suggested by Eric)
> > - in do_table, check if we're TEE'd or not
> > 
> > 1. if no, then just use the jumpstack from offset 0 onwards.
> > 2. If yes, then fetch jumpstack, and use the upper half:
> > 
> > if (__this_cpu_read(xt_tee_active))
> >  	jumpstack += private->stacksize;
> 
> Or maybe not using a conditional
> 
> jumpstack += private->stacksize * __this_cpu_read(xt_tee_active);
> 
> 
> BTW, I do not remember why I used a conditional in
> xt_write_recseq_begin(). This also adds extra setup cost, as @addend has
> to be preserved in the stack.
> 
> Hmm... What about something like :

Hmm.  I don't understand how this would work reliably.

   xt_write_recseq_begin(); /* value is now odd */

  /* other cpu fetches counters, blocks in read seeqlock */
   for_each_rule( .. ) {
	-> packet is sent by some target
  	  /* reentry into do_table */
	  xt_write_recseq_begin(); /* value is now even */

          -> other cpu unblocks since it thinks seqlock is taken

   }

This is problematic because get_counters() (and thus the seqlock) seems
to be used as sync point with table replacement.

And reading Jans email, we can have re-entrancy in do_table() also
via -j REJECT and SYNPROXY, not just TEE :-/

MAYBE its worth investigating a dual stack approach.

Keep a small jumpstack on the local kernel stack

If table traverse exeeds it, switch to the private->jumpstack

This adds a conditional, but would avoid all the setup/restore/save for
most rulesets.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-07-02 20:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-30 20:21 [PATCH nf] netfilter: arptables: use percpu jumpstack Florian Westphal
2015-07-02 11:30 ` Pablo Neira Ayuso
2015-07-02 11:48   ` Jan Engelhardt
2015-07-02 11:48   ` Florian Westphal
2015-07-02 15:38     ` Eric Dumazet
2015-07-02 20:00       ` Florian Westphal
2015-07-02 15:43     ` Pablo Neira Ayuso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).