All of lore.kernel.org
 help / color / mirror / Atom feed
* DDoS attack causing bad effect on conntrack searches
@ 2010-04-22 12:58 Jesper Dangaard Brouer
  2010-04-22 13:13 ` Changli Gao
  0 siblings, 1 reply; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2010-04-22 12:58 UTC (permalink / raw)
  To: Eric Dumazet, Patrick McHardy
  Cc: Linux Kernel Network Hackers, netfilter-devel, Paul E McKenney


At an unnamed ISP, we experienced a DDoS attack against one of our
customers.  This attack also caused problems for one of our Linux
based routers.

The attack was "only" generating 300 kpps (packets per sec), which
usually isn't a problem for this (fairly old) Linux Router.  But the
conntracking system chocked and reduced pps processing power to
40kpps.

I do extensive RRD/graph monitoring of the machines.  The IP conntrack
searches in the period exploded, to a stunning 700.000 searches per
sec.

http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png

First I though it might be caused by bad hashing, but after reading
the kernel code (func: __nf_conntrack_find()), I think its caused by
the loop restart (goto begin) of the conntrack search, running under
local_bh_disable().  These RCU changes to conntrack were introduced in
ea781f19 by Eric Dumazet.

Code: net/netfilter/nf_conntrack_core.c
Func: __nf_conntrack_find()

struct nf_conntrack_tuple_hash *
__nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
{
	struct nf_conntrack_tuple_hash *h;
	struct hlist_nulls_node *n;
	unsigned int hash = hash_conntrack(tuple);

	/* Disable BHs the entire time since we normally need to disable them
	 * at least once for the stats anyway.
	 */
	local_bh_disable();
begin:
	hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
		if (nf_ct_tuple_equal(tuple, &h->tuple)) {
			NF_CT_STAT_INC(net, found);
			local_bh_enable();
			return h;
		}
		NF_CT_STAT_INC(net, searched);
	}
	/*
	 * if the nulls value we got at the end of this lookup is
	 * not the expected one, we must restart lookup.
	 * We probably met an item that was moved to another chain.
	 */
	if (get_nulls_value(n) != hash)
		goto begin;
	local_bh_enable();

	return NULL;
}

>From the graphs:
 http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html

Its possible to see, that the problems are most likely caused by the
number of conntrack elements being deleted.

http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_delete001.png

If you look closely at the graphs, you should be able to see, that
CPU1 is doing all the conntrack "searches", and CPU2 is doing most of
the conntrack "deletes" (and CPU1 is creating a lot of new entries).


The question is, how do we avoid this unfortunately behavior of the
delete process disturbing the search process (causing it into
looping)?


-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


Extra info: Conntrack tuning
-----------
I have tuned the conntrack system on these hosts.  Firstly I have
increased the number of hash buckets for the conntrack system to
around 300.000.

 cat /sys/module/nf_conntrack/parameters/hashsize
 300032

Next I have increased the max conntracking elements to 900.000.

 cat /proc/sys/net/nf_conntrack_max
 900000




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 12:58 DDoS attack causing bad effect on conntrack searches Jesper Dangaard Brouer
@ 2010-04-22 13:13 ` Changli Gao
  2010-04-22 13:17   ` Patrick McHardy
  2010-04-22 13:31   ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 54+ messages in thread
From: Changli Gao @ 2010-04-22 13:13 UTC (permalink / raw)
  To: hawk
  Cc: Eric Dumazet, Patrick McHardy, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney

On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
>
> At an unnamed ISP, we experienced a DDoS attack against one of our
> customers.  This attack also caused problems for one of our Linux
> based routers.
>
> The attack was "only" generating 300 kpps (packets per sec), which
> usually isn't a problem for this (fairly old) Linux Router.  But the
> conntracking system chocked and reduced pps processing power to
> 40kpps.
>
> I do extensive RRD/graph monitoring of the machines.  The IP conntrack
> searches in the period exploded, to a stunning 700.000 searches per
> sec.
>
> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
>
> First I though it might be caused by bad hashing, but after reading
> the kernel code (func: __nf_conntrack_find()), I think its caused by
> the loop restart (goto begin) of the conntrack search, running under
> local_bh_disable().  These RCU changes to conntrack were introduced in
> ea781f19 by Eric Dumazet.
>
> Code: net/netfilter/nf_conntrack_core.c
> Func: __nf_conntrack_find()
>
> struct nf_conntrack_tuple_hash *
> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
> {
>        struct nf_conntrack_tuple_hash *h;
>        struct hlist_nulls_node *n;
>        unsigned int hash = hash_conntrack(tuple);
>
>        /* Disable BHs the entire time since we normally need to disable them
>         * at least once for the stats anyway.
>         */
>        local_bh_disable();
> begin:
>        hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
>                if (nf_ct_tuple_equal(tuple, &h->tuple)) {
>                        NF_CT_STAT_INC(net, found);
>                        local_bh_enable();
>                        return h;
>                }
>                NF_CT_STAT_INC(net, searched);
>        }
>        /*
>         * if the nulls value we got at the end of this lookup is
>         * not the expected one, we must restart lookup.
>         * We probably met an item that was moved to another chain.
>         */
>        if (get_nulls_value(n) != hash)
>                goto begin;
>        local_bh_enable();
>

We should add a retry limit there.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 13:13 ` Changli Gao
@ 2010-04-22 13:17   ` Patrick McHardy
  2010-04-22 14:36     ` Eric Dumazet
  2010-04-22 13:31   ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-04-22 13:17 UTC (permalink / raw)
  To: Changli Gao
  Cc: hawk, Eric Dumazet, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney

Changli Gao wrote:
> On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
>> At an unnamed ISP, we experienced a DDoS attack against one of our
>> customers.  This attack also caused problems for one of our Linux
>> based routers.
>>
>> The attack was "only" generating 300 kpps (packets per sec), which
>> usually isn't a problem for this (fairly old) Linux Router.  But the
>> conntracking system chocked and reduced pps processing power to
>> 40kpps.
>>
>> I do extensive RRD/graph monitoring of the machines.  The IP conntrack
>> searches in the period exploded, to a stunning 700.000 searches per
>> sec.
>>
>> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
>>
>> First I though it might be caused by bad hashing, but after reading
>> the kernel code (func: __nf_conntrack_find()), I think its caused by
>> the loop restart (goto begin) of the conntrack search, running under
>> local_bh_disable().  These RCU changes to conntrack were introduced in
>> ea781f19 by Eric Dumazet.
>>
>> Code: net/netfilter/nf_conntrack_core.c
>> Func: __nf_conntrack_find()
>>
>> struct nf_conntrack_tuple_hash *
>> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
>> {
>>        struct nf_conntrack_tuple_hash *h;
>>        struct hlist_nulls_node *n;
>>        unsigned int hash = hash_conntrack(tuple);
>>
>>        /* Disable BHs the entire time since we normally need to disable them
>>         * at least once for the stats anyway.
>>         */
>>        local_bh_disable();
>> begin:
>>        hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
>>                if (nf_ct_tuple_equal(tuple, &h->tuple)) {
>>                        NF_CT_STAT_INC(net, found);
>>                        local_bh_enable();
>>                        return h;
>>                }
>>                NF_CT_STAT_INC(net, searched);
>>        }
>>        /*
>>         * if the nulls value we got at the end of this lookup is
>>         * not the expected one, we must restart lookup.
>>         * We probably met an item that was moved to another chain.
>>         */
>>        if (get_nulls_value(n) != hash)
>>                goto begin;
>>        local_bh_enable();
>>
> 
> We should add a retry limit there.

We can't do that since that would allow false negatives.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 13:13 ` Changli Gao
  2010-04-22 13:17   ` Patrick McHardy
@ 2010-04-22 13:31   ` Jesper Dangaard Brouer
  2010-04-23 10:35     ` Patrick McHardy
  1 sibling, 1 reply; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2010-04-22 13:31 UTC (permalink / raw)
  To: Changli Gao
  Cc: Eric Dumazet, Patrick McHardy, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney


I have added a stats counter to prove my case, which I think we should add to the kernel (to detect the case in the future).
The DDoS attack has disappeared, so I guess I'll try to see if I can reproduce the problem in my testlab.



[PATCH] net: netfilter conntrack extended with extra stat counter.

From: Jesper Dangaard Brouer <hawk@comx.dk>

I suspect an unfortunatly series of events occuring under a DDoS
attack, in function __nf_conntrack_find() nf_contrack_core.c.

Adding a stats counter to see if the search is restarted too often.

Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
---

 include/linux/netfilter/nf_conntrack_common.h      |    1 +
 .../netfilter/nf_conntrack_l3proto_ipv4_compat.c   |    7 ++++---
 net/netfilter/nf_conntrack_core.c                  |    4 +++-
 net/netfilter/nf_conntrack_standalone.c            |    7 ++++---
 4 files changed, 12 insertions(+), 7 deletions(-)


diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index a8248ee..57ff312 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -93,6 +93,7 @@ struct ip_conntrack_stat
 	unsigned int expect_new;
 	unsigned int expect_create;
 	unsigned int expect_delete;
+	unsigned int search_restart;
 };
 
 /* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 8668a3d..b94f510 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete search_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -358,7 +358,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 
 		   st->expect_new,
 		   st->expect_create,
-		   st->expect_delete
+		   st->expect_delete,
+		   st->search_restart
 		);
 	return 0;
 }
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 4299db7..5ca8286 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -315,8 +315,10 @@ begin:
 	 * not the expected one, we must restart lookup.
 	 * We probably met an item that was moved to another chain.
 	 */
-	if (get_nulls_value(n) != hash)
+	if (get_nulls_value(n) != hash) {
+		NF_CT_STAT_INC(net, search_restart);
 		goto begin;
+	}
 	local_bh_enable();
 
 	return NULL;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index 1935153..4812750 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -245,12 +245,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete search_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -267,7 +267,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 
 		   st->expect_new,
 		   st->expect_create,
-		   st->expect_delete
+		   st->expect_delete,
+		   st->search_restart
 		);
 	return 0;
 }



-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 13:17   ` Patrick McHardy
@ 2010-04-22 14:36     ` Eric Dumazet
  2010-04-22 14:53       ` Eric Dumazet
  2010-04-23 10:56       ` DDoS attack causing bad effect on conntrack searches Patrick McHardy
  0 siblings, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2010-04-22 14:36 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney

Le jeudi 22 avril 2010 à 15:17 +0200, Patrick McHardy a écrit :
> Changli Gao wrote:
> > On Thu, Apr 22, 2010 at 8:58 PM, Jesper Dangaard Brouer <hawk@comx.dk> wrote:
> >> At an unnamed ISP, we experienced a DDoS attack against one of our
> >> customers.  This attack also caused problems for one of our Linux
> >> based routers.
> >>
> >> The attack was "only" generating 300 kpps (packets per sec), which
> >> usually isn't a problem for this (fairly old) Linux Router.  But the
> >> conntracking system chocked and reduced pps processing power to
> >> 40kpps.
> >>
> >> I do extensive RRD/graph monitoring of the machines.  The IP conntrack
> >> searches in the period exploded, to a stunning 700.000 searches per
> >> sec.
> >>
> >> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_searches001.png
> >>
> >> First I though it might be caused by bad hashing, but after reading
> >> the kernel code (func: __nf_conntrack_find()), I think its caused by
> >> the loop restart (goto begin) of the conntrack search, running under
> >> local_bh_disable().  These RCU changes to conntrack were introduced in
> >> ea781f19 by Eric Dumazet.
> >>
> >> Code: net/netfilter/nf_conntrack_core.c
> >> Func: __nf_conntrack_find()
> >>
> >> struct nf_conntrack_tuple_hash *
> >> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
> >> {
> >>        struct nf_conntrack_tuple_hash *h;
> >>        struct hlist_nulls_node *n;
> >>        unsigned int hash = hash_conntrack(tuple);
> >>
> >>        /* Disable BHs the entire time since we normally need to disable them
> >>         * at least once for the stats anyway.
> >>         */
> >>        local_bh_disable();
> >> begin:
> >>        hlist_nulls_for_each_entry_rcu(h, n, &net->ct.hash[hash], hnnode) {
> >>                if (nf_ct_tuple_equal(tuple, &h->tuple)) {
> >>                        NF_CT_STAT_INC(net, found);
> >>                        local_bh_enable();
> >>                        return h;
> >>                }
> >>                NF_CT_STAT_INC(net, searched);
> >>        }
> >>        /*
> >>         * if the nulls value we got at the end of this lookup is
> >>         * not the expected one, we must restart lookup.
> >>         * We probably met an item that was moved to another chain.
> >>         */
> >>        if (get_nulls_value(n) != hash)
> >>                goto begin;
> >>        local_bh_enable();
> >>
> > 
> > We should add a retry limit there.
> 
> We can't do that since that would allow false negatives.

If one hash slot is under attack, then there is a bug somewhere.

If we cannot avoid this, we can fallback to a secure mode at the second
retry, and take the spinlock.

Tis way, most of lookups stay lockless (one pass), and some might take
the slot lock to avoid the possibility of a loop.

I suspect a bug elsewhere, quite frankly !

We have a chain that have an end pointer that doesnt match the expected
one.




--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 14:36     ` Eric Dumazet
@ 2010-04-22 14:53       ` Eric Dumazet
  2010-04-22 15:51         ` Paul E. McKenney
  2010-04-23 10:56       ` DDoS attack causing bad effect on conntrack searches Patrick McHardy
  1 sibling, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-04-22 14:53 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney

Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :

> If one hash slot is under attack, then there is a bug somewhere.
> 
> If we cannot avoid this, we can fallback to a secure mode at the second
> retry, and take the spinlock.
> 
> Tis way, most of lookups stay lockless (one pass), and some might take
> the slot lock to avoid the possibility of a loop.
> 
> I suspect a bug elsewhere, quite frankly !
> 
> We have a chain that have an end pointer that doesnt match the expected
> one.
> 

On normal situation, we always finish the lookup :

1) If we found the thing we were looking at.

2) We get the list end (item not found), we then check if it is the
expected end.

It is _not_ the expected end only if some writer deleted/inserted an
element in _this_ chain during our lookup.

Because our lookup is lockless, we then have to redo it because we might
miss the object we are looking for.

If we can do the 'retry' a 10 times, it means the attacker was really
clever enough to inject new packets (new conntracks) at the right
moment, in the right hash chain, and this sounds so higly incredible
that I cannot believe it at all :)




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 14:53       ` Eric Dumazet
@ 2010-04-22 15:51         ` Paul E. McKenney
  2010-04-22 16:02           ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Paul E. McKenney @ 2010-04-22 15:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel

On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> 
> > If one hash slot is under attack, then there is a bug somewhere.
> > 
> > If we cannot avoid this, we can fallback to a secure mode at the second
> > retry, and take the spinlock.
> > 
> > Tis way, most of lookups stay lockless (one pass), and some might take
> > the slot lock to avoid the possibility of a loop.
> > 
> > I suspect a bug elsewhere, quite frankly !
> > 
> > We have a chain that have an end pointer that doesnt match the expected
> > one.
> > 
> 
> On normal situation, we always finish the lookup :
> 
> 1) If we found the thing we were looking at.
> 
> 2) We get the list end (item not found), we then check if it is the
> expected end.
> 
> It is _not_ the expected end only if some writer deleted/inserted an
> element in _this_ chain during our lookup.

So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
elements?  (Not obvious from the code, but my ignorance of the networking
code is such that many things in that part of the kernel are not obvious
to me, I am afraid.)

Otherwise, of course you would simply allow deleted elements to continue
pointing where they did previously, so that concurrent readers would not
miss anything.

Of course, the same potential might arise on insertion, but it is usually
OK to miss an element that was inserted after you started searching.

> Because our lookup is lockless, we then have to redo it because we might
> miss the object we are looking for.

Ah...  Is there also a resize operation?  Herbert did do a resizable
hash table recently, but I was under the impression that (1) it was in
some other part of the networking stack and (2) it avoided the need to
restart readers.

> If we can do the 'retry' a 10 times, it means the attacker was really
> clever enough to inject new packets (new conntracks) at the right
> moment, in the right hash chain, and this sounds so higly incredible
> that I cannot believe it at all :)

Or maybe the DoS attack is injecting so many new conntracks that a large
fraction of the hash chains are being modified at any given time?

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 15:51         ` Paul E. McKenney
@ 2010-04-22 16:02           ` Eric Dumazet
  2010-04-22 16:34             ` Paul E. McKenney
  2010-04-22 20:38             ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2010-04-22 16:02 UTC (permalink / raw)
  To: paulmck
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel

Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > 
> > > If one hash slot is under attack, then there is a bug somewhere.
> > > 
> > > If we cannot avoid this, we can fallback to a secure mode at the second
> > > retry, and take the spinlock.
> > > 
> > > Tis way, most of lookups stay lockless (one pass), and some might take
> > > the slot lock to avoid the possibility of a loop.
> > > 
> > > I suspect a bug elsewhere, quite frankly !
> > > 
> > > We have a chain that have an end pointer that doesnt match the expected
> > > one.
> > > 
> > 
> > On normal situation, we always finish the lookup :
> > 
> > 1) If we found the thing we were looking at.
> > 
> > 2) We get the list end (item not found), we then check if it is the
> > expected end.
> > 
> > It is _not_ the expected end only if some writer deleted/inserted an
> > element in _this_ chain during our lookup.
> 
> So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
> elements?  (Not obvious from the code, but my ignorance of the networking
> code is such that many things in that part of the kernel are not obvious
> to me, I am afraid.)
> 

Yes, this uses SLAB_DESTROY_BY_RCU, like tcp/udp lookups.

> Otherwise, of course you would simply allow deleted elements to continue
> pointing where they did previously, so that concurrent readers would not
> miss anything.
> 




> Of course, the same potential might arise on insertion, but it is usually
> OK to miss an element that was inserted after you started searching.
> 
> > Because our lookup is lockless, we then have to redo it because we might
> > miss the object we are looking for.
> 
> Ah...  Is there also a resize operation?  Herbert did do a resizable
> hash table recently, but I was under the impression that (1) it was in
> some other part of the networking stack and (2) it avoided the need to
> restart readers.
> 
> > If we can do the 'retry' a 10 times, it means the attacker was really
> > clever enough to inject new packets (new conntracks) at the right
> > moment, in the right hash chain, and this sounds so higly incredible
> > that I cannot believe it at all :)
> 
> Or maybe the DoS attack is injecting so many new conntracks that a large
> fraction of the hash chains are being modified at any given time?
> 
> 							Thanx, Paul

maybe hash table has one slot :)


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 16:02           ` Eric Dumazet
@ 2010-04-22 16:34             ` Paul E. McKenney
  2010-04-22 20:38             ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 54+ messages in thread
From: Paul E. McKenney @ 2010-04-22 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel

On Thu, Apr 22, 2010 at 06:02:08PM +0200, Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> > On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > > Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > > 
> > > > If one hash slot is under attack, then there is a bug somewhere.
> > > > 
> > > > If we cannot avoid this, we can fallback to a secure mode at the second
> > > > retry, and take the spinlock.
> > > > 
> > > > Tis way, most of lookups stay lockless (one pass), and some might take
> > > > the slot lock to avoid the possibility of a loop.
> > > > 
> > > > I suspect a bug elsewhere, quite frankly !
> > > > 
> > > > We have a chain that have an end pointer that doesnt match the expected
> > > > one.
> > > > 
> > > 
> > > On normal situation, we always finish the lookup :
> > > 
> > > 1) If we found the thing we were looking at.
> > > 
> > > 2) We get the list end (item not found), we then check if it is the
> > > expected end.
> > > 
> > > It is _not_ the expected end only if some writer deleted/inserted an
> > > element in _this_ chain during our lookup.
> > 
> > So this situation uses SLAB_DESTROY_BY_RCU to quickly recycle deleted
> > elements?  (Not obvious from the code, but my ignorance of the networking
> > code is such that many things in that part of the kernel are not obvious
> > to me, I am afraid.)
> 
> Yes, this uses SLAB_DESTROY_BY_RCU, like tcp/udp lookups.

OK, that will do it!!!  ;-)

One way of throttling the bad effects of updates on readers is to
periodically force updates through a grace period.  But this seems
to be a very big hammer, and likely to have little practical effect.

Another approach would be to have multiple list pointers per element,
so that a given element could be reused a small number of times without
messing up concurrent readers (sort of like Herbert's resizable hash
table).

But as you say, if some other bug is really behind this, better to fix
that bug than to work around it.

> > Otherwise, of course you would simply allow deleted elements to continue
> > pointing where they did previously, so that concurrent readers would not
> > miss anything.
> 
> > Of course, the same potential might arise on insertion, but it is usually
> > OK to miss an element that was inserted after you started searching.
> > 
> > > Because our lookup is lockless, we then have to redo it because we might
> > > miss the object we are looking for.
> > 
> > Ah...  Is there also a resize operation?  Herbert did do a resizable
> > hash table recently, but I was under the impression that (1) it was in
> > some other part of the networking stack and (2) it avoided the need to
> > restart readers.
> > 
> > > If we can do the 'retry' a 10 times, it means the attacker was really
> > > clever enough to inject new packets (new conntracks) at the right
> > > moment, in the right hash chain, and this sounds so higly incredible
> > > that I cannot believe it at all :)
> > 
> > Or maybe the DoS attack is injecting so many new conntracks that a large
> > fraction of the hash chains are being modified at any given time?
> 
> maybe hash table has one slot :)

;-) ;-) ;-)

							Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 16:02           ` Eric Dumazet
  2010-04-22 16:34             ` Paul E. McKenney
@ 2010-04-22 20:38             ` Jesper Dangaard Brouer
  2010-04-22 21:03               ` Eric Dumazet
  2010-04-23 20:57               ` Eric Dumazet
  1 sibling, 2 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2010-04-22 20:38 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: paulmck, Patrick McHardy, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1891 bytes --]


On Thu, 22 Apr 2010, Eric Dumazet wrote:

> Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
>> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
>>> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
>>>
>>> If we can do the 'retry' a 10 times, it means the attacker was really
>>> clever enough to inject new packets (new conntracks) at the right
>>> moment, in the right hash chain, and this sounds so higly incredible
>>> that I cannot believe it at all :)
>>
>> Or maybe the DoS attack is injecting so many new conntracks that a large
>> fraction of the hash chains are being modified at any given time?
>>

I think its plausable, there is a lot of modification going on.
Approx 40.000 deletes/sec and 40.000 inserts/sec.
The hash bucket size is 300032, and with 80000 modifications/sec, we are 
(potentially) changing 26.6% of the hash chains each second.

As can be seen from the graphs:
  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html

Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is 
caught searching...


> maybe hash table has one slot :)

Guess I have to reproduce the DoS attack in a testlab (I will first have 
time Tuesday).  So we can determine if its bad hashing or restart of the 
search loop.


The traffic pattern was fairly simple:

200 bytes UDP packets, comming from approx 60 source IPs, going to one 
destination IP.  The UDP destination port number was varied in the range 
of 1 to 6000.   The source UDP port was varied a bit more, some ranging 
from 32768 to 61000, and some from 1028 to 5000.


Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 20:38             ` Jesper Dangaard Brouer
@ 2010-04-22 21:03               ` Eric Dumazet
  2010-04-22 21:14                 ` Eric Dumazet
                                   ` (2 more replies)
  2010-04-23 20:57               ` Eric Dumazet
  1 sibling, 3 replies; 54+ messages in thread
From: Eric Dumazet @ 2010-04-22 21:03 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: paulmck, Patrick McHardy, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Le jeudi 22 avril 2010 à 22:38 +0200, Jesper Dangaard Brouer a écrit :
> On Thu, 22 Apr 2010, Eric Dumazet wrote:
> 
> > Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> >> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> >>> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> >>>
> >>> If we can do the 'retry' a 10 times, it means the attacker was really
> >>> clever enough to inject new packets (new conntracks) at the right
> >>> moment, in the right hash chain, and this sounds so higly incredible
> >>> that I cannot believe it at all :)
> >>
> >> Or maybe the DoS attack is injecting so many new conntracks that a large
> >> fraction of the hash chains are being modified at any given time?
> >>
> 
> I think its plausable, there is a lot of modification going on.
> Approx 40.000 deletes/sec and 40.000 inserts/sec.
> The hash bucket size is 300032, and with 80000 modifications/sec, we are 
> (potentially) changing 26.6% of the hash chains each second.
> 

OK but a lookup last a fraction of a micro second, unless interrupted by
hard irq.

Probability of a change during a lookup should be very very small.

Note that the scenario for a restart is :

The lookup go through the chain.
While it is examining one object, this object is deleted.
The object is re-allocated by another cpu and inserted to a new chain.


What exact version of kernel are you running ?

> As can be seen from the graphs:
>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html
> 
> Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is 
> caught searching...
> 
> 
> > maybe hash table has one slot :)
> 
> Guess I have to reproduce the DoS attack in a testlab (I will first have 
> time Tuesday).  So we can determine if its bad hashing or restart of the 
> search loop.
> 
> 
> The traffic pattern was fairly simple:
> 
> 200 bytes UDP packets, comming from approx 60 source IPs, going to one 
> destination IP.  The UDP destination port number was varied in the range 
> of 1 to 6000.   The source UDP port was varied a bit more, some ranging 
> from 32768 to 61000, and some from 1028 to 5000.
> 
> 
> Cheers,
>    Jesper Brouer
> 
> --
> -------------------------------------------------------------------
> MSc. Master of Computer Science
> Dept. of Computer Science, University of Copenhagen
> Author of http://www.adsl-optimizer.dk
> -------------------------------------------------------------------


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 21:03               ` Eric Dumazet
@ 2010-04-22 21:14                 ` Eric Dumazet
  2010-04-22 23:44                   ` David Miller
  2010-04-23 10:36                   ` Patrick McHardy
  2010-04-22 21:28                 ` Jesper Dangaard Brouer
  2010-04-23 10:55                 ` Patrick McHardy
  2 siblings, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2010-04-22 21:14 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: paulmck, Patrick McHardy, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Le jeudi 22 avril 2010 à 23:03 +0200, Eric Dumazet a écrit :
> Le jeudi 22 avril 2010 à 22:38 +0200, Jesper Dangaard Brouer a écrit :
> > On Thu, 22 Apr 2010, Eric Dumazet wrote:
> > 
> > > Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
> > >> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
> > >>> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
> > >>>
> > >>> If we can do the 'retry' a 10 times, it means the attacker was really
> > >>> clever enough to inject new packets (new conntracks) at the right
> > >>> moment, in the right hash chain, and this sounds so higly incredible
> > >>> that I cannot believe it at all :)
> > >>
> > >> Or maybe the DoS attack is injecting so many new conntracks that a large
> > >> fraction of the hash chains are being modified at any given time?
> > >>
> > 
> > I think its plausable, there is a lot of modification going on.
> > Approx 40.000 deletes/sec and 40.000 inserts/sec.
> > The hash bucket size is 300032, and with 80000 modifications/sec, we are 
> > (potentially) changing 26.6% of the hash chains each second.
> > 
> 
> OK but a lookup last a fraction of a micro second, unless interrupted by
> hard irq.
> 
> Probability of a change during a lookup should be very very small.
> 
> Note that the scenario for a restart is :
> 
> The lookup go through the chain.
> While it is examining one object, this object is deleted.
> The object is re-allocated by another cpu and inserted to a new chain.
> 
> 
> What exact version of kernel are you running ?
> 
> > As can be seen from the graphs:
> >   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html
> > 
> > Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is 
> > caught searching...
> > 
> > 
> > > maybe hash table has one slot :)
> > 
> > Guess I have to reproduce the DoS attack in a testlab (I will first have 
> > time Tuesday).  So we can determine if its bad hashing or restart of the 
> > search loop.
> > 

Or very long chains, if attacker managed to find a jhash flaw.

You could add a lookup_restart counter :


 include/linux/netfilter/nf_conntrack_common.h         |    1 +
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c |    7 ++++---
 net/netfilter/nf_conntrack_core.c                     |    4 +++-
 net/netfilter/nf_conntrack_standalone.c               |    7 ++++---
 4 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index c608677..cf9a8df 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -113,6 +113,7 @@ struct ip_conntrack_stat {
 	unsigned int expect_new;
 	unsigned int expect_create;
 	unsigned int expect_delete;
+	unsigned int lookup_restart;
 };
 
 /* call to create an explicit dependency on nf_conntrack. */
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
index 2fb7b76..95f2227 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4_compat.c
@@ -336,12 +336,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete lookup_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -358,7 +358,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 
 		   st->expect_new,
 		   st->expect_create,
-		   st->expect_delete
+		   st->expect_delete,
+		   st->lookup_restart
 		);
 	return 0;
 }
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 0c9bbe9..68e53f1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -319,8 +319,10 @@ begin:
 	 * not the expected one, we must restart lookup.
 	 * We probably met an item that was moved to another chain.
 	 */
-	if (get_nulls_value(n) != hash)
+	if (unlikely(get_nulls_value(n) != hash)) {
+		NF_CT_STAT_INC(net, lookup_restart);
 		goto begin;
+	}
 	local_bh_enable();
 
 	return NULL;
diff --git a/net/netfilter/nf_conntrack_standalone.c b/net/netfilter/nf_conntrack_standalone.c
index faa8eb3..c8a286e 100644
--- a/net/netfilter/nf_conntrack_standalone.c
+++ b/net/netfilter/nf_conntrack_standalone.c
@@ -252,12 +252,12 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 	const struct ip_conntrack_stat *st = v;
 
 	if (v == SEQ_START_TOKEN) {
-		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete\n");
+		seq_printf(seq, "entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete lookup_restart\n");
 		return 0;
 	}
 
 	seq_printf(seq, "%08x  %08x %08x %08x %08x %08x %08x %08x "
-			"%08x %08x %08x %08x %08x  %08x %08x %08x \n",
+			"%08x %08x %08x %08x %08x  %08x %08x %08x %08x\n",
 		   nr_conntracks,
 		   st->searched,
 		   st->found,
@@ -274,7 +274,8 @@ static int ct_cpu_seq_show(struct seq_file *seq, void *v)
 
 		   st->expect_new,
 		   st->expect_create,
-		   st->expect_delete
+		   st->expect_delete,
+		   st->lookup_restart
 		);
 	return 0;
 }


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 21:03               ` Eric Dumazet
  2010-04-22 21:14                 ` Eric Dumazet
@ 2010-04-22 21:28                 ` Jesper Dangaard Brouer
  2010-04-23  7:23                   ` Jan Engelhardt
  2010-04-23 10:55                 ` Patrick McHardy
  2 siblings, 1 reply; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2010-04-22 21:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, hawk, Linux Kernel Network Hackers,
	Netfilter Developers

On Thu, 22 Apr 2010, Eric Dumazet wrote:

> What exact version of kernel are you running ?

2.6.31.7-pvlan2G #3 SMP PREEMPT
32-bit kernel with 2G kernel mem (you showed me that trick).

I have some patches on top (which are accepted upstream), but I originate 
from commit f8ebcb2ebc49a Linux 2.6.31.7.


Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 21:14                 ` Eric Dumazet
@ 2010-04-22 23:44                   ` David Miller
  2010-04-23  5:44                     ` Eric Dumazet
  2010-04-23 10:36                   ` Patrick McHardy
  1 sibling, 1 reply; 54+ messages in thread
From: David Miller @ 2010-04-22 23:44 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hawk, paulmck, kaber, xiaosuo, hawk, netdev, netfilter-devel


Eric, I wonder if we run into some kind of issue on 32-bit systems
because we always lose a bit of the conntrack hash value when we store
it into the 'nulls' area?

Wouldn't that make the "get_nulls_value(n) != hash" fail?

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 23:44                   ` David Miller
@ 2010-04-23  5:44                     ` Eric Dumazet
  2010-04-23  8:13                       ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-04-23  5:44 UTC (permalink / raw)
  To: David Miller; +Cc: hawk, paulmck, kaber, xiaosuo, hawk, netdev, netfilter-devel

Le jeudi 22 avril 2010 à 16:44 -0700, David Miller a écrit :
> Eric, I wonder if we run into some kind of issue on 32-bit systems
> because we always lose a bit of the conntrack hash value when we store
> it into the 'nulls' area?
> 
> Wouldn't that make the "get_nulls_value(n) != hash" fail?
> --


Well, 'hash' at this time is not the result of the jhash() transform [0
- 0xFFFFFFFF], but a slot number in htable [0 - (300032-1)].


And we can have a nulls_value up to 0x7FFFFFFF  (31 bits)

static inline unsigned long get_nulls_value(const struct hlist_nulls_node *ptr)
{
        return ((unsigned long)ptr) >> 1;
}





^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 21:28                 ` Jesper Dangaard Brouer
@ 2010-04-23  7:23                   ` Jan Engelhardt
  2010-04-23  7:46                     ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Engelhardt @ 2010-04-23  7:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Patrick McHardy, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

On Thursday 2010-04-22 23:28, Jesper Dangaard Brouer wrote:

> On Thu, 22 Apr 2010, Eric Dumazet wrote:
>
>> What exact version of kernel are you running ?
>
> 2.6.31.7-pvlan2G #3 SMP PREEMPT
> 32-bit kernel with 2G kernel mem (you showed me that trick).

Since when is enabling 2G a trick? :)
There's CONFIG_VMSPLIT_2G (and 2G_OPT) for quite some time now.


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23  7:23                   ` Jan Engelhardt
@ 2010-04-23  7:46                     ` Eric Dumazet
  2010-04-23  7:55                       ` Jan Engelhardt
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-04-23  7:46 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jesper Dangaard Brouer, Patrick McHardy, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Le vendredi 23 avril 2010 à 09:23 +0200, Jan Engelhardt a écrit :
> On Thursday 2010-04-22 23:28, Jesper Dangaard Brouer wrote:
> 
> > On Thu, 22 Apr 2010, Eric Dumazet wrote:
> >
> >> What exact version of kernel are you running ?
> >
> > 2.6.31.7-pvlan2G #3 SMP PREEMPT
> > 32-bit kernel with 2G kernel mem (you showed me that trick).
> 
> Since when is enabling 2G a trick? :)
> There's CONFIG_VMSPLIT_2G (and 2G_OPT) for quite some time now.
> 

Yes, when you know it, its not a trick anymore :)

Years ago, we had to manually change PAGE_OFFSET, and I remember some
machines with PAGE_OFFSET 0xA0000000  (1.5 GB LOWMEM), 
or 0xB0000000 (1.25 GB), (PAE off)



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23  7:46                     ` Eric Dumazet
@ 2010-04-23  7:55                       ` Jan Engelhardt
  2010-04-23  9:23                         ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Jan Engelhardt @ 2010-04-23  7:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Patrick McHardy, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

On Friday 2010-04-23 09:46, Eric Dumazet wrote:

>Le vendredi 23 avril 2010 à 09:23 +0200, Jan Engelhardt a écrit :
>> On Thursday 2010-04-22 23:28, Jesper Dangaard Brouer wrote:
>> 
>> > On Thu, 22 Apr 2010, Eric Dumazet wrote:
>> >
>> >> What exact version of kernel are you running ?
>> >
>> > 2.6.31.7-pvlan2G #3 SMP PREEMPT
>> > 32-bit kernel with 2G kernel mem (you showed me that trick).
>> 
>> Since when is enabling 2G a trick? :)
>> There's CONFIG_VMSPLIT_2G (and 2G_OPT) for quite some time now.
>> 
>
>Yes, when you know it, its not a trick anymore :)
>
>Years ago, we had to manually change PAGE_OFFSET, and I remember some
>machines with PAGE_OFFSET 0xA0000000  (1.5 GB LOWMEM), 
>or 0xB0000000 (1.25 GB), (PAE off)

I notice that 0xB0000000, which is now known as LOWMEM_3G_OPT,
is only available when PAE is off. Would you know the reason for
that decision? Are some values unsuitable for PAE?



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23  5:44                     ` Eric Dumazet
@ 2010-04-23  8:13                       ` David Miller
  2010-04-23  8:18                         ` David Miller
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2010-04-23  8:13 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hawk, paulmck, kaber, xiaosuo, hawk, netdev, netfilter-devel

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 23 Apr 2010 07:44:38 +0200

> Le jeudi 22 avril 2010 à 16:44 -0700, David Miller a écrit :
>> Eric, I wonder if we run into some kind of issue on 32-bit systems
>> because we always lose a bit of the conntrack hash value when we store
>> it into the 'nulls' area?
>> 
>> Wouldn't that make the "get_nulls_value(n) != hash" fail?
>> --
> 
> 
> Well, 'hash' at this time is not the result of the jhash() transform [0
> - 0xFFFFFFFF], but a slot number in htable [0 - (300032-1)].

Aha, I see.

I really can't see what might cause this behavior then.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23  8:13                       ` David Miller
@ 2010-04-23  8:18                         ` David Miller
  2010-04-23  8:40                           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 54+ messages in thread
From: David Miller @ 2010-04-23  8:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: hawk, paulmck, kaber, xiaosuo, hawk, netdev, netfilter-devel

From: David Miller <davem@davemloft.net>
Date: Fri, 23 Apr 2010 01:13:28 -0700 (PDT)

> I really can't see what might cause this behavior then.

This all reminds me of the namespace bug we dealt with
a month or two ago.

Jesper, you don't happen to be using network namespaces are you?
Because if so, the following might be your cure.

commit 5b3501faa8741d50617ce4191c20061c6ef36cb3
Author: Eric Dumazet <eric.dumazet@gmail.com>
Date:   Mon Feb 8 11:16:56 2010 -0800

    netfilter: nf_conntrack: per netns nf_conntrack_cachep

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23  8:18                         ` David Miller
@ 2010-04-23  8:40                           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2010-04-23  8:40 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, paulmck, Patrick McHardy, xiaosuo, netdev,
	Netfilter Developers

On Fri, 23 Apr 2010, David Miller wrote:

> This all reminds me of the namespace bug we dealt with
> a month or two ago.
>
> Jesper, you don't happen to be using network namespaces are you?

No, I don't use network namespaces.
(In .config CONFIG_NAMESPACES is not set.)

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23  7:55                       ` Jan Engelhardt
@ 2010-04-23  9:23                         ` Eric Dumazet
  0 siblings, 0 replies; 54+ messages in thread
From: Eric Dumazet @ 2010-04-23  9:23 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Jesper Dangaard Brouer, Patrick McHardy, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Le vendredi 23 avril 2010 à 09:55 +0200, Jan Engelhardt a écrit :
> On Friday 2010-04-23 09:46, Eric Dumazet wrote:
> >Years ago, we had to manually change PAGE_OFFSET, and I remember some
> >machines with PAGE_OFFSET 0xA0000000  (1.5 GB LOWMEM), 
> >or 0xB0000000 (1.25 GB), (PAE off)
> 
> I notice that 0xB0000000, which is now known as LOWMEM_3G_OPT,
> is only available when PAE is off. Would you know the reason for
> that decision? Are some values unsuitable for PAE?
> 

If PAE was on, PAGE_OFFSET must be a 1GB multiple.
This is because of hardware limitations.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 13:31   ` Jesper Dangaard Brouer
@ 2010-04-23 10:35     ` Patrick McHardy
  0 siblings, 0 replies; 54+ messages in thread
From: Patrick McHardy @ 2010-04-23 10:35 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Changli Gao, Eric Dumazet, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney

Jesper Dangaard Brouer wrote:
> I have added a stats counter to prove my case, which I think we should add to the kernel (to detect the case in the future).
> The DDoS attack has disappeared, so I guess I'll try to see if I can reproduce the problem in my testlab.
> 
> 
> 
> [PATCH] net: netfilter conntrack extended with extra stat counter.
> 
> From: Jesper Dangaard Brouer <hawk@comx.dk>
> 
> I suspect an unfortunatly series of events occuring under a DDoS
> attack, in function __nf_conntrack_find() nf_contrack_core.c.
> 
> Adding a stats counter to see if the search is restarted too often.

Applied, thanks Jesper.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 21:14                 ` Eric Dumazet
  2010-04-22 23:44                   ` David Miller
@ 2010-04-23 10:36                   ` Patrick McHardy
  2010-04-23 11:06                     ` Eric Dumazet
  1 sibling, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-04-23 10:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, paulmck, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 23:03 +0200, Eric Dumazet a écrit :
>>> Guess I have to reproduce the DoS attack in a testlab (I will first have 
>>> time Tuesday).  So we can determine if its bad hashing or restart of the 
>>> search loop.
>>>
> 
> Or very long chains, if attacker managed to find a jhash flaw.

That should be visible in the "searched" statistic.

> You could add a lookup_restart counter :

I've applied Jespers equivalent patch.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 21:03               ` Eric Dumazet
  2010-04-22 21:14                 ` Eric Dumazet
  2010-04-22 21:28                 ` Jesper Dangaard Brouer
@ 2010-04-23 10:55                 ` Patrick McHardy
  2010-04-23 11:05                   ` Eric Dumazet
  2 siblings, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-04-23 10:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, paulmck, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 22:38 +0200, Jesper Dangaard Brouer a écrit :
>> On Thu, 22 Apr 2010, Eric Dumazet wrote:
>>
>>> Le jeudi 22 avril 2010 à 08:51 -0700, Paul E. McKenney a écrit :
>>>> On Thu, Apr 22, 2010 at 04:53:49PM +0200, Eric Dumazet wrote:
>>>>> Le jeudi 22 avril 2010 à 16:36 +0200, Eric Dumazet a écrit :
>>>>>
>>>>> If we can do the 'retry' a 10 times, it means the attacker was really
>>>>> clever enough to inject new packets (new conntracks) at the right
>>>>> moment, in the right hash chain, and this sounds so higly incredible
>>>>> that I cannot believe it at all :)
>>>> Or maybe the DoS attack is injecting so many new conntracks that a large
>>>> fraction of the hash chains are being modified at any given time?
>>>>
>> I think its plausable, there is a lot of modification going on.
>> Approx 40.000 deletes/sec and 40.000 inserts/sec.
>> The hash bucket size is 300032, and with 80000 modifications/sec, we are 
>> (potentially) changing 26.6% of the hash chains each second.
>>
> 
> OK but a lookup last a fraction of a micro second, unless interrupted by
> hard irq.
> 
> Probability of a change during a lookup should be very very small.
> 
> Note that the scenario for a restart is :
> 
> The lookup go through the chain.
> While it is examining one object, this object is deleted.
> The object is re-allocated by another cpu and inserted to a new chain.

I think another scenario that seems a bit more likely would be
that a new entry is added to the chain after it was fully searched.
Perhaps we could continue searching at the last position if the
last entry is not a nulls entry to improve this.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 14:36     ` Eric Dumazet
  2010-04-22 14:53       ` Eric Dumazet
@ 2010-04-23 10:56       ` Patrick McHardy
  2010-04-23 12:45         ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-04-23 10:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, hawk, Linux Kernel Network Hackers, netfilter-devel,
	Paul E McKenney

Eric Dumazet wrote:
> Le jeudi 22 avril 2010 à 15:17 +0200, Patrick McHardy a écrit :
>> Changli Gao wrote:
>>>> struct nf_conntrack_tuple_hash *
>>>> __nf_conntrack_find(struct net *net, const struct nf_conntrack_tuple *tuple)
>>>> ...
>>> We should add a retry limit there.
>> We can't do that since that would allow false negatives.
> 
> If one hash slot is under attack, then there is a bug somewhere.
> 
> If we cannot avoid this, we can fallback to a secure mode at the second
> retry, and take the spinlock.
> 
> Tis way, most of lookups stay lockless (one pass), and some might take
> the slot lock to avoid the possibility of a loop.

That sounds like a good idea. But lets what for Jesper's test results
before we start fixing this problem :)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23 10:55                 ` Patrick McHardy
@ 2010-04-23 11:05                   ` Eric Dumazet
  2010-04-23 11:06                     ` Patrick McHardy
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-04-23 11:05 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jesper Dangaard Brouer, paulmck, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Le vendredi 23 avril 2010 à 12:55 +0200, Patrick McHardy a écrit :
> Eric Dumazet wrote:
> > 
> > OK but a lookup last a fraction of a micro second, unless interrupted by
> > hard irq.
> > 
> > Probability of a change during a lookup should be very very small.
> > 
> > Note that the scenario for a restart is :
> > 
> > The lookup go through the chain.
> > While it is examining one object, this object is deleted.
> > The object is re-allocated by another cpu and inserted to a new chain.
> 
> I think another scenario that seems a bit more likely would be
> that a new entry is added to the chain after it was fully searched.
> Perhaps we could continue searching at the last position if the
> last entry is not a nulls entry to improve this.

But the last entry is always a nulls entry, what do you mean exactly ?

When an unsert (of a fresh object, not a reused one) is done, this
doesnt affect lookups in any way, since its done at the head of list.




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23 10:36                   ` Patrick McHardy
@ 2010-04-23 11:06                     ` Eric Dumazet
  0 siblings, 0 replies; 54+ messages in thread
From: Eric Dumazet @ 2010-04-23 11:06 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Jesper Dangaard Brouer, paulmck, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Le vendredi 23 avril 2010 à 12:36 +0200, Patrick McHardy a écrit :
> Eric Dumazet wrote:
> > Le jeudi 22 avril 2010 à 23:03 +0200, Eric Dumazet a écrit :
> >>> Guess I have to reproduce the DoS attack in a testlab (I will first have 
> >>> time Tuesday).  So we can determine if its bad hashing or restart of the 
> >>> search loop.
> >>>
> > 
> > Or very long chains, if attacker managed to find a jhash flaw.
> 
> That should be visible in the "searched" statistic.
> 
> > You could add a lookup_restart counter :
> 
> I've applied Jespers equivalent patch.

Yes of course, I missed it or I would not have cooked it ;)

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23 11:05                   ` Eric Dumazet
@ 2010-04-23 11:06                     ` Patrick McHardy
  0 siblings, 0 replies; 54+ messages in thread
From: Patrick McHardy @ 2010-04-23 11:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, paulmck, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Eric Dumazet wrote:
> Le vendredi 23 avril 2010 à 12:55 +0200, Patrick McHardy a écrit :
>> Eric Dumazet wrote:
>>> OK but a lookup last a fraction of a micro second, unless interrupted by
>>> hard irq.
>>>
>>> Probability of a change during a lookup should be very very small.
>>>
>>> Note that the scenario for a restart is :
>>>
>>> The lookup go through the chain.
>>> While it is examining one object, this object is deleted.
>>> The object is re-allocated by another cpu and inserted to a new chain.
>> I think another scenario that seems a bit more likely would be
>> that a new entry is added to the chain after it was fully searched.
>> Perhaps we could continue searching at the last position if the
>> last entry is not a nulls entry to improve this.
> 
> But the last entry is always a nulls entry, what do you mean exactly ?
> 
> When an unsert (of a fresh object, not a reused one) is done, this
> doesnt affect lookups in any way, since its done at the head of list.

Right, I missed that :)
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23 10:56       ` DDoS attack causing bad effect on conntrack searches Patrick McHardy
@ 2010-04-23 12:45         ` Jesper Dangaard Brouer
  2010-04-23 13:57           ` Patrick McHardy
  0 siblings, 1 reply; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2010-04-23 12:45 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Eric Dumazet, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney

On Fri, 23 Apr 2010, Patrick McHardy wrote:

> That sounds like a good idea. But lets what for Jesper's test results
> before we start fixing this problem :)

I will first have time to perform the tests Monday or Tuesday.

BUT I have just noticed there seems to be a corrolation between conntrack 
early_drop and searches.  I have upload a new graph:

  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_early_drop002.png

I have not had time to checkout the code path yet...

Cheers,
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23 12:45         ` Jesper Dangaard Brouer
@ 2010-04-23 13:57           ` Patrick McHardy
  0 siblings, 0 replies; 54+ messages in thread
From: Patrick McHardy @ 2010-04-23 13:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Eric Dumazet, Changli Gao, hawk, Linux Kernel Network Hackers,
	netfilter-devel, Paul E McKenney

Jesper Dangaard Brouer wrote:
> On Fri, 23 Apr 2010, Patrick McHardy wrote:
> 
>> That sounds like a good idea. But lets what for Jesper's test results
>> before we start fixing this problem :)
> 
> I will first have time to perform the tests Monday or Tuesday.
> 
> BUT I have just noticed there seems to be a corrolation between
> conntrack early_drop and searches.  I have upload a new graph:
> 
>  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/conntrack_early_drop002.png

I guess that's somewhat expected when your conntrack table is full
and all you're seeing is new connection setup attempts.

First you have a search for an existing conntrack, then it attempts
to create a new one and tries to early_drop and old one.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-22 20:38             ` Jesper Dangaard Brouer
  2010-04-22 21:03               ` Eric Dumazet
@ 2010-04-23 20:57               ` Eric Dumazet
  2010-04-24 11:11                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-04-23 20:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: paulmck, Patrick McHardy, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Le jeudi 22 avril 2010 à 22:38 +0200, Jesper Dangaard Brouer a écrit :

> 
> I think its plausable, there is a lot of modification going on.
> Approx 40.000 deletes/sec and 40.000 inserts/sec.
> The hash bucket size is 300032, and with 80000 modifications/sec, we are 
> (potentially) changing 26.6% of the hash chains each second.
> 
> As can be seen from the graphs:
>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html
> 
> Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is 
> caught searching...
> 
> 
> > maybe hash table has one slot :)
> 
> Guess I have to reproduce the DoS attack in a testlab (I will first have 
> time Tuesday).  So we can determine if its bad hashing or restart of the 
> search loop.
> 
> 
> The traffic pattern was fairly simple:
> 
> 200 bytes UDP packets, comming from approx 60 source IPs, going to one 
> destination IP.  The UDP destination port number was varied in the range 
> of 1 to 6000.   The source UDP port was varied a bit more, some ranging 
> from 32768 to 61000, and some from 1028 to 5000.
> 
> 

Re-reading this, I am not sure there is a real problem on RCU as you
pointed out.

With 800.000 entries, in a 300.032 buckets hash table, each lookup hit
about 3 entries (aka searches in conntrack stats)

300.000 packets/second -> 900.000 'searches' per second.

If you have four cpus all trying to insert/delete entries in //, they
all hit the central conntrack lock.

On a DDOS scenario, every packet needs to take this lock twice,
once to free an old conntrack (early drop), once to insert a new entry.

To scale this, only way would be to have an array of locks, like we have
for TCP/UDP hash tables.

I did some tests here, with a multiqueue card, flooded with 300.000
pack/second, 65.536 source IP, millions of flows, and nothing wrong
happened (but packets drops, of course)

My two cpus were busy 100%, after tweaking smp_affinities, because on
first try, irqbalance put "01" mask on both queues, so only one ksoftirq
was working, other cpu was idle :(




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-23 20:57               ` Eric Dumazet
@ 2010-04-24 11:11                 ` Jesper Dangaard Brouer
  2010-04-24 20:11                   ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2010-04-24 11:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: paulmck, Patrick McHardy, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4506 bytes --]

On Fri, 23 Apr 2010, Eric Dumazet wrote:

> Le jeudi 22 avril 2010 à 22:38 +0200, Jesper Dangaard Brouer a écrit :
>
>>
>> I think its plausable, there is a lot of modification going on.
>> Approx 40.000 deletes/sec and 40.000 inserts/sec.
>> The hash bucket size is 300032, and with 80000 modifications/sec, we are
>> (potentially) changing 26.6% of the hash chains each second.
>>
>> As can be seen from the graphs:
>>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html
>>
>> Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is
>> caught searching...
>>
>>
>>> maybe hash table has one slot :)
>>
>> Guess I have to reproduce the DoS attack in a testlab (I will first have
>> time Tuesday).  So we can determine if its bad hashing or restart of the
>> search loop.
>>
>>
>> The traffic pattern was fairly simple:
>>
>> 200 bytes UDP packets, comming from approx 60 source IPs, going to one
>> destination IP.  The UDP destination port number was varied in the range
>> of 1 to 6000.   The source UDP port was varied a bit more, some ranging
>> from 32768 to 61000, and some from 1028 to 5000.
>>
>>
>
> Re-reading this, I am not sure there is a real problem on RCU as you
> pointed out.
>
> With 800.000 entries, in a 300.032 buckets hash table, each lookup hit
> about 3 entries (aka searches in conntrack stats)
>
> 300.000 packets/second -> 900.000 'searches' per second.

The machine is not getting 300.000 pps, its only getting 40.000 pps (the 
rest is stopped by the NIC by sending Ethernet flowcontrol pause frames)

  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/eth0-rx.png

We are doing 700.000 'searches' per second, with 40.000 pps, thus on 
average the list lenght (in each hash bucket) just need to be 17.5 
elements.  Is this an acceptable has distribution, with 900.000 elements 
in a 300.032 buckets hash table?


> If you have four cpus all trying to insert/delete entries in //, they
> all hit the central conntrack lock.

This machine only have two CPUs, or rather one physical CPU and one 
hyperthreaded.  The server is a old HP380 G4, with an old Xeon type CPU 
3.4 GHz (1MB cache).  If remember correctly its based on Pentium-4 
technologi, which had a pretty bad hyperthreading.


> On a DDOS scenario, every packet needs to take this lock twice,
> once to free an old conntrack (early drop), once to insert a new entry.

I was worried about if the "early drop" e.g. free an old conntrack would 
disturbe the conntrack searching?


> To scale this, only way would be to have an array of locks, like we have
> for TCP/UDP hash tables.
>
> I did some tests here, with a multiqueue card, flooded with 300.000
> pack/second, 65.536 source IP, millions of flows, and nothing wrong
> happened (but packets drops, of course)

A small hint when testing, use Haralds tool 'lnstat' to see the stats on 
the command line, thus you don't need to RRDtool graphe every thing:

Command:
  lnstat -f nf_conntrack -i 1 -c 1000

I don't have a multiqueue NIC in this old machine.

I also ran some tests on my 10G testlab, but it didn't go wrong. 
Tweeking the pktgen DDoS I could get the system to do 4.500.000 'searches' 
per sec with a 1.500.000 packets/sec. (Have not reloaded the kernel with 
the new failed lookup stats).  Guess my 10G machines are too fast to hit 
the issues.


> My two cpus were busy 100%, after tweaking smp_affinities, because on
> first try, irqbalance put "01" mask on both queues, so only one ksoftirq
> was working, other cpu was idle :(

Think my machine is some what slower than yours, perhaps its simply not 
fast enough for this kind of workload (pretty sure that the cache is the 
CPU is getting f*ked in this case).

On my machine one CPU in stuck in softirq:
  http://people.netfilter.org/hawk/DDoS/2010-04-12__001/cpu_softirq001.png

And another observation is that the CPUs are disturbing each other on the 
RX softirq code path.
http://people.netfilter.org/hawk/DDoS/2010-04-12__001/softnet_time_squeeze_rx-softirq001.png
(/proc/net/softnet_stat column 3)

Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
see if I can reproduce the DDoS attack senario.  And see if I can get 
it into to lookup loop.

Hilsen
   Jesper Brouer

--
-------------------------------------------------------------------
MSc. Master of Computer Science
Dept. of Computer Science, University of Copenhagen
Author of http://www.adsl-optimizer.dk
-------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-24 11:11                 ` Jesper Dangaard Brouer
@ 2010-04-24 20:11                   ` Eric Dumazet
  2010-04-26 14:36                     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-04-24 20:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: paulmck, Patrick McHardy, Changli Gao, hawk,
	Linux Kernel Network Hackers, Netfilter Developers

Le samedi 24 avril 2010 à 13:11 +0200, Jesper Dangaard Brouer a écrit :
> On Fri, 23 Apr 2010, Eric Dumazet wrote:
> 
> > Le jeudi 22 avril 2010 à 22:38 +0200, Jesper Dangaard Brouer a écrit :
> >
> >>
> >> I think its plausable, there is a lot of modification going on.
> >> Approx 40.000 deletes/sec and 40.000 inserts/sec.
> >> The hash bucket size is 300032, and with 80000 modifications/sec, we are
> >> (potentially) changing 26.6% of the hash chains each second.
> >>
> >> As can be seen from the graphs:
> >>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/list.html
> >>
> >> Notice that primarily CPU2 is doing the 40k deletes/sec, while CPU1 is
> >> caught searching...
> >>
> >>
> >>> maybe hash table has one slot :)
> >>
> >> Guess I have to reproduce the DoS attack in a testlab (I will first have
> >> time Tuesday).  So we can determine if its bad hashing or restart of the
> >> search loop.
> >>
> >>
> >> The traffic pattern was fairly simple:
> >>
> >> 200 bytes UDP packets, comming from approx 60 source IPs, going to one
> >> destination IP.  The UDP destination port number was varied in the range
> >> of 1 to 6000.   The source UDP port was varied a bit more, some ranging
> >> from 32768 to 61000, and some from 1028 to 5000.
> >>
> >>
> >
> > Re-reading this, I am not sure there is a real problem on RCU as you
> > pointed out.
> >
> > With 800.000 entries, in a 300.032 buckets hash table, each lookup hit
> > about 3 entries (aka searches in conntrack stats)
> >
> > 300.000 packets/second -> 900.000 'searches' per second.
> 
> The machine is not getting 300.000 pps, its only getting 40.000 pps (the 
> rest is stopped by the NIC by sending Ethernet flowcontrol pause frames)
> 
>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/eth0-rx.png
> 
> We are doing 700.000 'searches' per second, with 40.000 pps, thus on 
> average the list lenght (in each hash bucket) just need to be 17.5 
> elements.  Is this an acceptable has distribution, with 900.000 elements 
> in a 300.032 buckets hash table?
> 
> 
> > If you have four cpus all trying to insert/delete entries in //, they
> > all hit the central conntrack lock.
> 
> This machine only have two CPUs, or rather one physical CPU and one 
> hyperthreaded.  The server is a old HP380 G4, with an old Xeon type CPU 
> 3.4 GHz (1MB cache).  If remember correctly its based on Pentium-4 
> technologi, which had a pretty bad hyperthreading.
> 
> 
> > On a DDOS scenario, every packet needs to take this lock twice,
> > once to free an old conntrack (early drop), once to insert a new entry.
> 
> I was worried about if the "early drop" e.g. free an old conntrack would 
> disturbe the conntrack searching?
> 
> 
> > To scale this, only way would be to have an array of locks, like we have
> > for TCP/UDP hash tables.
> >
> > I did some tests here, with a multiqueue card, flooded with 300.000
> > pack/second, 65.536 source IP, millions of flows, and nothing wrong
> > happened (but packets drops, of course)
> 
> A small hint when testing, use Haralds tool 'lnstat' to see the stats on 
> the command line, thus you don't need to RRDtool graphe every thing:
> 
> Command:
>   lnstat -f nf_conntrack -i 1 -c 1000
> 
> I don't have a multiqueue NIC in this old machine.
> 
> I also ran some tests on my 10G testlab, but it didn't go wrong. 
> Tweeking the pktgen DDoS I could get the system to do 4.500.000 'searches' 
> per sec with a 1.500.000 packets/sec. (Have not reloaded the kernel with 
> the new failed lookup stats).  Guess my 10G machines are too fast to hit 
> the issues.
> 
> 
> > My two cpus were busy 100%, after tweaking smp_affinities, because on
> > first try, irqbalance put "01" mask on both queues, so only one ksoftirq
> > was working, other cpu was idle :(
> 
> Think my machine is some what slower than yours, perhaps its simply not 
> fast enough for this kind of workload (pretty sure that the cache is the 
> CPU is getting f*ked in this case).
> 
> On my machine one CPU in stuck in softirq:
>   http://people.netfilter.org/hawk/DDoS/2010-04-12__001/cpu_softirq001.png
> 
> And another observation is that the CPUs are disturbing each other on the 
> RX softirq code path.
> http://people.netfilter.org/hawk/DDoS/2010-04-12__001/softnet_time_squeeze_rx-softirq001.png
> (/proc/net/softnet_stat column 3)
> 
> Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
> see if I can reproduce the DDoS attack senario.  And see if I can get 
> it into to lookup loop.

Theorically a loop is very unlikely, given a single retry is very
unlikly too.

Unless a cpu gets in its cache a corrupted value of a 'next' pointer.

Maybe a hardware problem ?

My test machine is a fairly low end one, an AMD Athlon Dual core 5050e,
2.6 GHz

I used an igb card for ease of setup, and to make sure my two cores
would handle packets in parallel, without RPS.

With same hash bucket size (300.032) and max conntracks (800.000), and
after more than 10 hours of test, not a single lookup was restarted
because of a nulls with wrong value.

I can setup a test on a 16 cpu machine, multiqueue card too.

Hmm, I forgot to say I am using net-next-2.6, not your kernel version...



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-24 20:11                   ` Eric Dumazet
@ 2010-04-26 14:36                     ` Jesper Dangaard Brouer
  2010-05-31 21:21                       ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Jesper Dangaard Brouer @ 2010-04-26 14:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, paulmck, Patrick McHardy, Changli Gao,
	Linux Kernel Network Hackers, Netfilter Developers

On Sat, 2010-04-24 at 22:11 +0200, Eric Dumazet wrote:
>  
> > Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
> > see if I can reproduce the DDoS attack senario.  And see if I can get 
> > it into to lookup loop.
> 
> Theorically a loop is very unlikely, given a single retry is very
> unlikly too.
> 
> Unless a cpu gets in its cache a corrupted value of a 'next' pointer.
> 
...
>
> With same hash bucket size (300.032) and max conntracks (800.000), and
> after more than 10 hours of test, not a single lookup was restarted
> because of a nulls with wrong value.

So fare, I have to agree with you.  I have now tested on the same type
of hardware (although running a 64-bit kernel, and off net-next-2.6),
and the result is, the same as yours, I don't see a any restarts of the
loop.  The test systems differs a bit, as it has two physical CPU and 2M
cache (and annoyingly the system insists on using HPET as clocksource).

Guess, the only explanation would be bad/sub-optimal hash distribution.
With 40 kpps and 700.000 'searches' per second, the hash bucket list
length "only" need to be 17.5 elements on average, where optimum is 3.
With my pktgen DoS test, where I tried to reproduce the DoS attack, only
see a screw of 6 elements on average.


> I can setup a test on a 16 cpu machine, multiqueue card too.

Don't think that is necessary.  My theory was it was possible on slower
single queue NIC, where one CPU is 100% busy in the conntrack search,
and the other CPUs delete the entries (due to early drop and
call_rcu()).  But guess that note the case, and RCU works perfectly ;-)

> Hmm, I forgot to say I am using net-next-2.6, not your kernel version...

I also did this test using net-next-2.6, perhaps I should try the
version I use in production...


-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network Kernel Developer
  Cand. Scient Datalog / MSc.CS
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-04-26 14:36                     ` Jesper Dangaard Brouer
@ 2010-05-31 21:21                       ` Eric Dumazet
  2010-06-01  0:28                         ` Changli Gao
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-05-31 21:21 UTC (permalink / raw)
  To: hawk
  Cc: Jesper Dangaard Brouer, paulmck, Patrick McHardy, Changli Gao,
	Linux Kernel Network Hackers, Netfilter Developers

Le lundi 26 avril 2010 à 16:36 +0200, Jesper Dangaard Brouer a écrit :
> On Sat, 2010-04-24 at 22:11 +0200, Eric Dumazet wrote:
> >  
> > > Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
> > > see if I can reproduce the DDoS attack senario.  And see if I can get 
> > > it into to lookup loop.
> > 
> > Theorically a loop is very unlikely, given a single retry is very
> > unlikly too.
> > 
> > Unless a cpu gets in its cache a corrupted value of a 'next' pointer.
> > 
> ...
> >
> > With same hash bucket size (300.032) and max conntracks (800.000), and
> > after more than 10 hours of test, not a single lookup was restarted
> > because of a nulls with wrong value.
> 
> So fare, I have to agree with you.  I have now tested on the same type
> of hardware (although running a 64-bit kernel, and off net-next-2.6),
> and the result is, the same as yours, I don't see a any restarts of the
> loop.  The test systems differs a bit, as it has two physical CPU and 2M
> cache (and annoyingly the system insists on using HPET as clocksource).
> 
> Guess, the only explanation would be bad/sub-optimal hash distribution.
> With 40 kpps and 700.000 'searches' per second, the hash bucket list
> length "only" need to be 17.5 elements on average, where optimum is 3.
> With my pktgen DoS test, where I tried to reproduce the DoS attack, only
> see a screw of 6 elements on average.
> 
> 
> > I can setup a test on a 16 cpu machine, multiqueue card too.
> 
> Don't think that is necessary.  My theory was it was possible on slower
> single queue NIC, where one CPU is 100% busy in the conntrack search,
> and the other CPUs delete the entries (due to early drop and
> call_rcu()).  But guess that note the case, and RCU works perfectly ;-)
> 
> > Hmm, I forgot to say I am using net-next-2.6, not your kernel version...
> 
> I also did this test using net-next-2.6, perhaps I should try the
> version I use in production...
> 
> 


I had a look at current conntrack and found the 'unconfirmed' list was
maybe a candidate for a potential blackhole.

That is, if a reader happens to hit an entry that is moved from regular
hash table slot 'hash' to unconfirmed list, reader might scan whole
unconfirmed list to find out he is not anymore on the wanted hash chain.

Problem is this unconfirmed list might be very very long in case of
DDOS. It's really not designed to be scanned during a lookup.

So I guess we should stop early if we find an unconfirmed entry ?



diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..0573641 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -298,8 +298,10 @@ extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
 extern unsigned int nf_conntrack_htable_size;
 extern unsigned int nf_conntrack_max;
 
-#define NF_CT_STAT_INC(net, count)	\
+#define NF_CT_STAT_INC(net, count)		\
 	__this_cpu_inc((net)->ct.stat->count)
+#define NF_CT_STAT_ADD(net, count, value)	\
+	__this_cpu_add((net)->ct.stat->count, value)
 #define NF_CT_STAT_INC_ATOMIC(net, count)		\
 do {							\
 	local_bh_disable();				\
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..e96d999 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -299,6 +299,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(net, zone, tuple);
+	unsigned int cnt = 0;
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
@@ -309,10 +310,19 @@ begin:
 		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
 		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
 			NF_CT_STAT_INC(net, found);
+			NF_CT_STAT_ADD(net, searched, cnt);
 			local_bh_enable();
 			return h;
 		}
-		NF_CT_STAT_INC(net, searched);
+		/*
+		 * If we find an unconfirmed entry, restart the lookup to
+		 * avoid scanning whole unconfirmed list
+		 */
+		if (unlikely(++cnt > 8 &&
+			     !nf_ct_is_confirmed(nf_ct_tuplehash_to_ctrack(h)))) {
+			NF_CT_STAT_INC(net, search_restart);
+			goto begin;
+		}
 	}
 	/*
 	 * if the nulls value we got at the end of this lookup is
@@ -323,6 +333,7 @@ begin:
 		NF_CT_STAT_INC(net, search_restart);
 		goto begin;
 	}
+	NF_CT_STAT_ADD(net, searched, cnt);
 	local_bh_enable();
 
 	return NULL;


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-05-31 21:21                       ` Eric Dumazet
@ 2010-06-01  0:28                         ` Changli Gao
  2010-06-01  5:05                           ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Changli Gao @ 2010-06-01  0:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hawk, Jesper Dangaard Brouer, paulmck, Patrick McHardy,
	Linux Kernel Network Hackers, Netfilter Developers

On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> I had a look at current conntrack and found the 'unconfirmed' list was
> maybe a candidate for a potential blackhole.
>
> That is, if a reader happens to hit an entry that is moved from regular
> hash table slot 'hash' to unconfirmed list,

Sorry, but I can't find where we do this things. unconfirmed list is
used to track the unconfirmed cts, whose corresponding skbs are still
in path from the first to the last netfilter hooks. As soon as the
skbs end their travel in netfilter, the corresponding cts will be
confirmed(moving ct from unconfirmed list to regular hash table).

unconfirmed list should be small, as networking receiving is in BH.
How about implementing unconfirmed list as a per cpu variable?

> reader might scan whole
> unconfirmed list to find out he is not anymore on the wanted hash chain.
>
> Problem is this unconfirmed list might be very very long in case of
> DDOS. It's really not designed to be scanned during a lookup.
>
> So I guess we should stop early if we find an unconfirmed entry ?
>
>
>
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index bde095f..0573641 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -298,8 +298,10 @@ extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
>  extern unsigned int nf_conntrack_htable_size;
>  extern unsigned int nf_conntrack_max;
>
> -#define NF_CT_STAT_INC(net, count)     \
> +#define NF_CT_STAT_INC(net, count)             \
>        __this_cpu_inc((net)->ct.stat->count)
> +#define NF_CT_STAT_ADD(net, count, value)      \
> +       __this_cpu_add((net)->ct.stat->count, value)
>  #define NF_CT_STAT_INC_ATOMIC(net, count)              \
>  do {                                                   \
>        local_bh_disable();                             \
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index eeeb8bc..e96d999 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -299,6 +299,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
>        struct nf_conntrack_tuple_hash *h;
>        struct hlist_nulls_node *n;
>        unsigned int hash = hash_conntrack(net, zone, tuple);
> +       unsigned int cnt = 0;
>
>        /* Disable BHs the entire time since we normally need to disable them
>         * at least once for the stats anyway.
> @@ -309,10 +310,19 @@ begin:
>                if (nf_ct_tuple_equal(tuple, &h->tuple) &&
>                    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
>                        NF_CT_STAT_INC(net, found);
> +                       NF_CT_STAT_ADD(net, searched, cnt);
>                        local_bh_enable();
>                        return h;
>                }
> -               NF_CT_STAT_INC(net, searched);
> +               /*
> +                * If we find an unconfirmed entry, restart the lookup to
> +                * avoid scanning whole unconfirmed list
> +                */
> +               if (unlikely(++cnt > 8 &&
> +                            !nf_ct_is_confirmed(nf_ct_tuplehash_to_ctrack(h)))) {
> +                       NF_CT_STAT_INC(net, search_restart);
> +                       goto begin;
> +               }
>        }
>        /*
>         * if the nulls value we got at the end of this lookup is
> @@ -323,6 +333,7 @@ begin:
>                NF_CT_STAT_INC(net, search_restart);
>                goto begin;
>        }
> +       NF_CT_STAT_ADD(net, searched, cnt);
>        local_bh_enable();
>
>        return NULL;
>
>
>



-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-06-01  0:28                         ` Changli Gao
@ 2010-06-01  5:05                           ` Eric Dumazet
  2010-06-01  5:48                             ` Changli Gao
  2010-06-01 10:18                             ` Patrick McHardy
  0 siblings, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2010-06-01  5:05 UTC (permalink / raw)
  To: Changli Gao
  Cc: hawk, Jesper Dangaard Brouer, paulmck, Patrick McHardy,
	Linux Kernel Network Hackers, Netfilter Developers

Le mardi 01 juin 2010 à 08:28 +0800, Changli Gao a écrit :
> On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > I had a look at current conntrack and found the 'unconfirmed' list was
> > maybe a candidate for a potential blackhole.
> >
> > That is, if a reader happens to hit an entry that is moved from regular
> > hash table slot 'hash' to unconfirmed list,
> 
> Sorry, but I can't find where we do this things. unconfirmed list is
> used to track the unconfirmed cts, whose corresponding skbs are still
> in path from the first to the last netfilter hooks. As soon as the
> skbs end their travel in netfilter, the corresponding cts will be
> confirmed(moving ct from unconfirmed list to regular hash table).
> 

So netfilter is a monolithic thing.

When a packet begins its travel into netfilter, you guarantee that no
other packet can also begin its travel and find an unconfirmed
conntrack ?

I wonder why we use atomic ops then to track the confirmed bit :)


> unconfirmed list should be small, as networking receiving is in BH.

So according to you, netfilter/ct runs only in input path ?

So I assume a packet is handled by CPU X, creates a new conntrack
(possibly early droping an old entry that was previously in a standard
hash chain), inserted in unconfirmed list. _You_ guarantee another CPU
Y, handling another packet, possibly sent by a hacker reading your
netdev mails, cannot find the conntrack that was early dropped ?

> How about implementing unconfirmed list as a per cpu variable?

I first implemented such a patch to reduce cache line contention, then I
asked to myself : What is exactly an unconfirmed conntrack ? Can their
number be unbounded ? If yes, we have a problem, even on a two cpus
machine. Using two lists instead of one wont solve the fundamental
problem.

The real question is, why do we need this unconfirmed 'list' in the
first place. Is it really a private per cpu thing ? Can you prove this,
in respect of lockless lookups, and things like NFQUEUE ? 

Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
one for IP_CT_DIR_REPLY.

Unconfirmed list use the first anchor. This means another cpu can
definitely find an unconfirmed item in a regular hash chain, since we
dont respect an RCU grace period before re-using an object.

If memory was not a problem, we probably would use a third anchor to
avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-06-01  5:05                           ` Eric Dumazet
@ 2010-06-01  5:48                             ` Changli Gao
  2010-06-01 10:18                             ` Patrick McHardy
  1 sibling, 0 replies; 54+ messages in thread
From: Changli Gao @ 2010-06-01  5:48 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hawk, Jesper Dangaard Brouer, paulmck, Patrick McHardy,
	Linux Kernel Network Hackers, Netfilter Developers

On Tue, Jun 1, 2010 at 1:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 01 juin 2010 à 08:28 +0800, Changli Gao a écrit :
>> On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > I had a look at current conntrack and found the 'unconfirmed' list was
>> > maybe a candidate for a potential blackhole.
>> >
>> > That is, if a reader happens to hit an entry that is moved from regular
>> > hash table slot 'hash' to unconfirmed list,
>>
>> Sorry, but I can't find where we do this things. unconfirmed list is
>> used to track the unconfirmed cts, whose corresponding skbs are still
>> in path from the first to the last netfilter hooks. As soon as the
>> skbs end their travel in netfilter, the corresponding cts will be
>> confirmed(moving ct from unconfirmed list to regular hash table).
>>
>
> So netfilter is a monolithic thing.
>
> When a packet begins its travel into netfilter, you guarantee that no
> other packet can also begin its travel and find an unconfirmed
> conntrack ?
>
> I wonder why we use atomic ops then to track the confirmed bit :)

seems no need.

>
>
>> unconfirmed list should be small, as networking receiving is in BH.
>
> So according to you, netfilter/ct runs only in input path ?

No. there are another entrances: local out and nf_reinject. If there
isn't any packet queued, as netfilter is in atomic context, the nubmer
of unconfirmed cts should be small( at most, 2 * nr_cpu?).

>
> So I assume a packet is handled by CPU X, creates a new conntrack
> (possibly early droping an old entry that was previously in a standard
> hash chain), inserted in unconfirmed list.
>

Oh, Thanks, I got it.

> _You_ guarantee another CPU
> Y, handling another packet, possibly sent by a hacker reading your
> netdev mails, cannot find the conntrack that was early dropped ?
>
>> How about implementing unconfirmed list as a per cpu variable?
>
> I first implemented such a patch to reduce cache line contention, then I
> asked to myself : What is exactly an unconfirmed conntrack ? Can their
> number be unbounded ? If yes, we have a problem, even on a two cpus
> machine. Using two lists instead of one wont solve the fundamental
> problem.
>
> The real question is, why do we need this unconfirmed 'list' in the
> first place. Is it really a private per cpu thing ?

No, it isn't really private. But most of time, it is accessed locally,
if it is implemented as a per cpu var.

> Can you prove this,
> in respect of lockless lookups, and things like NFQUEUE ?
>
> Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
> one for IP_CT_DIR_REPLY.
>
> Unconfirmed list use the first anchor. This means another cpu can
> definitely find an unconfirmed item in a regular hash chain, since we
> dont respect an RCU grace period before re-using an object.
>
> If memory was not a problem, we probably would use a third anchor to
> avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.
>

thanks for your explaining.


-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-06-01  5:05                           ` Eric Dumazet
  2010-06-01  5:48                             ` Changli Gao
@ 2010-06-01 10:18                             ` Patrick McHardy
  2010-06-01 10:31                               ` Eric Dumazet
  1 sibling, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-06-01 10:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 08:28 +0800, Changli Gao a écrit :
>> On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> I had a look at current conntrack and found the 'unconfirmed' list was
>>> maybe a candidate for a potential blackhole.
>>>
>>> That is, if a reader happens to hit an entry that is moved from regular
>>> hash table slot 'hash' to unconfirmed list,
>> Sorry, but I can't find where we do this things. unconfirmed list is
>> used to track the unconfirmed cts, whose corresponding skbs are still
>> in path from the first to the last netfilter hooks. As soon as the
>> skbs end their travel in netfilter, the corresponding cts will be
>> confirmed(moving ct from unconfirmed list to regular hash table).
>>
> 
> So netfilter is a monolithic thing.
> 
> When a packet begins its travel into netfilter, you guarantee that no
> other packet can also begin its travel and find an unconfirmed
> conntrack ?

Correct, the unconfirmed list exists only for cleanup.

> I wonder why we use atomic ops then to track the confirmed bit :)

Good question, that looks unnecessary :)

>> unconfirmed list should be small, as networking receiving is in BH.
> 
> So according to you, netfilter/ct runs only in input path ?
> 
> So I assume a packet is handled by CPU X, creates a new conntrack
> (possibly early droping an old entry that was previously in a standard
> hash chain), inserted in unconfirmed list. _You_ guarantee another CPU
> Y, handling another packet, possibly sent by a hacker reading your
> netdev mails, cannot find the conntrack that was early dropped ?
> 
>> How about implementing unconfirmed list as a per cpu variable?
> 
> I first implemented such a patch to reduce cache line contention, then I
> asked to myself : What is exactly an unconfirmed conntrack ? Can their
> number be unbounded ? If yes, we have a problem, even on a two cpus
> machine. Using two lists instead of one wont solve the fundamental
> problem.

If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
added to the unconfirmed list and moved to the hash as soon as the
packet passes POST_ROUTING. This means the number of unconfirmed entries
created by the network is bound by the number of CPUs due to BH
processing. The number created by locally generated packets is unbound
in case of preemptible kernels however.

> The real question is, why do we need this unconfirmed 'list' in the
> first place. Is it really a private per cpu thing ? Can you prove this,
> in respect of lockless lookups, and things like NFQUEUE ? 

Its used for cleaning up conntracks not in the hash table yet on
module unload (or manual flush). It is supposed to be write-only
during regular operation.

> Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
> one for IP_CT_DIR_REPLY.
> 
> Unconfirmed list use the first anchor. This means another cpu can
> definitely find an unconfirmed item in a regular hash chain, since we
> dont respect an RCU grace period before re-using an object.
> 
> If memory was not a problem, we probably would use a third anchor to
> avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.

So I guess we should check the CONFIRMED bit when searching
in the hash table.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-06-01 10:18                             ` Patrick McHardy
@ 2010-06-01 10:31                               ` Eric Dumazet
  2010-06-01 10:41                                 ` Patrick McHardy
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-06-01 10:31 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers

Le mardi 01 juin 2010 à 12:18 +0200, Patrick McHardy a écrit :

> If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
> added to the unconfirmed list and moved to the hash as soon as the
> packet passes POST_ROUTING. This means the number of unconfirmed entries
> created by the network is bound by the number of CPUs due to BH
> processing. The number created by locally generated packets is unbound
> in case of preemptible kernels however.
> 

OK, we should have a percpu list then.

BTW, I notice nf_conntrack_untracked is incorrectly annotated
'__read_mostly'.

It can be written very often :(

Should'nt we special case it and let be really const ?




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: DDoS attack causing bad effect on conntrack searches
  2010-06-01 10:31                               ` Eric Dumazet
@ 2010-06-01 10:41                                 ` Patrick McHardy
  2010-06-01 16:20                                   ` [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-06-01 10:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, hawk, Jesper Dangaard Brouer, paulmck,
	Linux Kernel Network Hackers, Netfilter Developers

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 12:18 +0200, Patrick McHardy a écrit :
> 
>> If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
>> added to the unconfirmed list and moved to the hash as soon as the
>> packet passes POST_ROUTING. This means the number of unconfirmed entries
>> created by the network is bound by the number of CPUs due to BH
>> processing. The number created by locally generated packets is unbound
>> in case of preemptible kernels however.
>>
> 
> OK, we should have a percpu list then.

Yes, that makes sense.

> BTW, I notice nf_conntrack_untracked is incorrectly annotated
> '__read_mostly'.
> 
> It can be written very often :(
> 
> Should'nt we special case it and let be really const ?

That would need quite a bit of special-casing to avoid touching
the reference counts. So far this is completely hidden, so I'd
say it just shouldn't be marked __read_mostly. Alternatively we
can make "untracked" a nfctinfo state.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
  2010-06-01 10:41                                 ` Patrick McHardy
@ 2010-06-01 16:20                                   ` Eric Dumazet
  2010-06-04 11:40                                     ` Patrick McHardy
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-06-01 16:20 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Netfilter Developers, netdev

Le mardi 01 juin 2010 à 12:41 +0200, Patrick McHardy a écrit :

> > BTW, I notice nf_conntrack_untracked is incorrectly annotated
> > '__read_mostly'.
> > 
> > It can be written very often :(
> > 
> > Should'nt we special case it and let be really const ?
> 
> That would need quite a bit of special-casing to avoid touching
> the reference counts. So far this is completely hidden, so I'd
> say it just shouldn't be marked __read_mostly. Alternatively we
> can make "untracked" a nfctinfo state.

I tried this suggestion, (a new IP_CT_UNTRACKED ctinfo), over a per_cpu
untracked ct, but its a bit hard.

For example, I cannot find a way to change ctnetlink_conntrack_event() :

	if (ct == &nf_conntrack_untracked)
		return 0;

Maybe this piece of code is not necessary, we should not come here
anyway, or it means several packets could store events for this (shared)
ct ?

Obviously, an IPS_UNTRACKED bit would be much easier to implement.
Would it be acceptable ?


Preliminary patch with IP_CT_UNTRACKED, probably not working at all...

 include/linux/netfilter/nf_conntrack_common.h  |    3 +
 include/net/netfilter/nf_conntrack.h           |   11 +++--
 include/net/netfilter/nf_conntrack_core.h      |    2 
 net/ipv4/netfilter/nf_nat_core.c               |    4 +
 net/ipv4/netfilter/nf_nat_standalone.c         |    2 
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    4 -
 net/netfilter/nf_conntrack_core.c              |   32 +++++++++------
 net/netfilter/nf_conntrack_netlink.c           |    6 +-
 net/netfilter/xt_CT.c                          |   13 +++---
 net/netfilter/xt_NOTRACK.c                     |    4 -
 net/netfilter/xt_TEE.c                         |    8 +--
 net/netfilter/xt_cluster.c                     |    2 
 net/netfilter/xt_conntrack.c                   |    2 
 net/netfilter/xt_socket.c                      |    2 
 14 files changed, 58 insertions(+), 37 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index 14e6d32..5f7c947 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -15,6 +15,9 @@ enum ip_conntrack_info {
            IP_CT_DIR_ORIGINAL); may be a retransmission. */
 	IP_CT_NEW,
 
+	/* Untracked */
+	IP_CT_UNTRACKED,
+
 	/* >= this indicates reply direction */
 	IP_CT_IS_REPLY,
 
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..884ade9 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -175,7 +175,7 @@ static inline struct nf_conn *
 nf_ct_get(const struct sk_buff *skb, enum ip_conntrack_info *ctinfo)
 {
 	*ctinfo = skb->nfctinfo;
-	return (struct nf_conn *)skb->nfct;
+	return container_of(skb->nfct, struct nf_conn, ct_general);
 }
 
 /* decrement reference count on a conntrack */
@@ -261,7 +261,7 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
-extern struct nf_conn nf_conntrack_untracked;
+DECLARE_PER_CPU(struct nf_conn, pcpu_nf_conntrack_untracked);
 
 /* Iterate over all conntracks: if iter returns true, it's deleted. */
 extern void
@@ -291,7 +291,12 @@ static inline int nf_ct_is_dying(struct nf_conn *ct)
 
 static inline int nf_ct_is_untracked(const struct sk_buff *skb)
 {
-	return (skb->nfct == &nf_conntrack_untracked.ct_general);
+	return (skb->nfctinfo == IP_CT_UNTRACKED);
+}
+
+static inline int nf_ct_is_tracked(const struct sk_buff *skb)
+{
+	return (skb->nfctinfo != IP_CT_UNTRACKED);
 }
 
 extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 3d7524f..8dd05ea 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -60,7 +60,7 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
 	int ret = NF_ACCEPT;
 
-	if (ct && ct != &nf_conntrack_untracked) {
+	if (ct && nf_ct_is_tracked(skb)) {
 		if (!nf_ct_is_confirmed(ct))
 			ret = __nf_conntrack_confirm(skb);
 		if (likely(ret == NF_ACCEPT))
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index 4f8bddb..a797999 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -719,6 +719,7 @@ static int __init nf_nat_init(void)
 {
 	size_t i;
 	int ret;
+	int cpu;
 
 	need_ipv4_conntrack();
 
@@ -742,7 +743,8 @@ static int __init nf_nat_init(void)
 	spin_unlock_bh(&nf_nat_lock);
 
 	/* Initialize fake conntrack so that NAT will skip it */
-	nf_conntrack_untracked.status |= IPS_NAT_DONE_MASK;
+	for_each_possible_cpu(cpu)
+		per_cpu(pcpu_nf_conntrack_untracked,cpu).status |= IPS_NAT_DONE_MASK;
 
 	l3proto = nf_ct_l3proto_find_get((u_int16_t)AF_INET);
 
diff --git a/net/ipv4/netfilter/nf_nat_standalone.c b/net/ipv4/netfilter/nf_nat_standalone.c
index beb2581..17af2bb 100644
--- a/net/ipv4/netfilter/nf_nat_standalone.c
+++ b/net/ipv4/netfilter/nf_nat_standalone.c
@@ -98,7 +98,7 @@ nf_nat_fn(unsigned int hooknum,
 		return NF_ACCEPT;
 
 	/* Don't try to NAT if this packet is not conntracked */
-	if (ct == &nf_conntrack_untracked)
+	if (ctinfo == IP_CT_UNTRACKED)
 		return NF_ACCEPT;
 
 	nat = nfct_nat(ct);
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index 9be8177..b67029c 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -208,8 +208,8 @@ icmpv6_error(struct net *net, struct nf_conn *tmpl,
 	type = icmp6h->icmp6_type - 130;
 	if (type >= 0 && type < sizeof(noct_valid_new) &&
 	    noct_valid_new[type]) {
-		skb->nfct = &nf_conntrack_untracked.ct_general;
-		skb->nfctinfo = IP_CT_NEW;
+		skb->nfct = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+		skb->nfctinfo = IP_CT_UNTRACKED;
 		nf_conntrack_get(skb->nfct);
 		return NF_ACCEPT;
 	}
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..eea5df1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn nf_conntrack_untracked __read_mostly;
-EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
+DEFINE_PER_CPU(struct nf_conn, pcpu_nf_conntrack_untracked);
+EXPORT_PER_CPU_SYMBOL(pcpu_nf_conntrack_untracked);
 
 static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
@@ -1185,10 +1185,16 @@ static void nf_ct_release_dying_list(struct net *net)
 
 static void nf_conntrack_cleanup_init_net(void)
 {
-	/* wait until all references to nf_conntrack_untracked are dropped */
-	while (atomic_read(&nf_conntrack_untracked.ct_general.use) > 1)
+	int cpu, use;
+	for (;;) {
+		use = 0;
+		for_each_possible_cpu(cpu)
+			use += atomic_read(&per_cpu(pcpu_nf_conntrack_untracked, cpu).ct_general.use) - 1;
+		/* wait until all references to nf_conntrack_untracked are dropped */
+		if (!use)
+			break;
 		schedule();
-
+	}
 	nf_conntrack_helper_fini();
 	nf_conntrack_proto_fini();
 #ifdef CONFIG_NF_CONNTRACK_ZONES
@@ -1325,6 +1331,7 @@ static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
 	int ret;
+	int cpu;
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
@@ -1362,14 +1369,15 @@ static int nf_conntrack_init_init_net(void)
 	if (ret < 0)
 		goto err_extend;
 #endif
-	/* Set up fake conntrack: to never be deleted, not in any hashes */
-#ifdef CONFIG_NET_NS
-	nf_conntrack_untracked.ct_net = &init_net;
-#endif
-	atomic_set(&nf_conntrack_untracked.ct_general.use, 1);
-	/*  - and look it like as a confirmed connection */
-	set_bit(IPS_CONFIRMED_BIT, &nf_conntrack_untracked.status);
+	/* Set up fake conntracks: to never be deleted, not in any hashes */
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(pcpu_nf_conntrack_untracked, cpu);
 
+		write_pnet(&ct->ct_net, &init_net);
+		atomic_set(&ct->ct_general.use, 1);
+	/*  - and look it like as a confirmed connection */
+		__set_bit(IPS_CONFIRMED_BIT, &ct->status);
+	}
 	return 0;
 
 #ifdef CONFIG_NF_CONNTRACK_ZONES
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index c42ff6a..ac21514 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -479,9 +479,9 @@ ctnetlink_conntrack_event(unsigned int events, struct nf_ct_event *item)
 	unsigned int flags = 0, group;
 	int err;
 
-	/* ignore our fake conntrack entry */
-	if (ct == &nf_conntrack_untracked)
-		return 0;
+//	/* ignore our fake conntrack entry */
+//	if (ct == &nf_conntrack_untracked)
+//		return 0;
 
 	if (events & (1 << IPCT_DESTROY)) {
 		type = IPCTNL_MSG_CT_DELETE;
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 562bf32..5723f9a 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -29,9 +29,13 @@ static unsigned int xt_ct_target(struct sk_buff *skb,
 	if (skb->nfct != NULL)
 		return XT_CONTINUE;
 
+	skb->nfctinfo = IP_CT_NEW;
+	if (info->flags & XT_CT_NOTRACK) {
+		ct = &__get_cpu_var(pcpu_nf_conntrack_untracked);
+		skb->nfctinfo = IP_CT_UNTRACKED;
+	}
 	atomic_inc(&ct->ct_general.use);
 	skb->nfct = &ct->ct_general;
-	skb->nfctinfo = IP_CT_NEW;
 
 	return XT_CONTINUE;
 }
@@ -67,8 +71,7 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par)
 		return -EINVAL;
 
 	if (info->flags & XT_CT_NOTRACK) {
-		ct = &nf_conntrack_untracked;
-		atomic_inc(&ct->ct_general.use);
+		ct = &__get_cpu_var(pcpu_nf_conntrack_untracked);
 		goto out;
 	}
 
@@ -132,14 +135,14 @@ static void xt_ct_tg_destroy(const struct xt_tgdtor_param *par)
 	struct nf_conn *ct = info->ct;
 	struct nf_conn_help *help;
 
-	if (ct != &nf_conntrack_untracked) {
+	if (!(info->flags & XT_CT_NOTRACK)) {
 		help = nfct_help(ct);
 		if (help)
 			module_put(help->helper->me);
 
 		nf_ct_l3proto_module_put(par->family);
+		nf_ct_put(info->ct);
 	}
-	nf_ct_put(info->ct);
 }
 
 static struct xt_target xt_ct_tg __read_mostly = {
diff --git a/net/netfilter/xt_NOTRACK.c b/net/netfilter/xt_NOTRACK.c
index 512b912..9547b58 100644
--- a/net/netfilter/xt_NOTRACK.c
+++ b/net/netfilter/xt_NOTRACK.c
@@ -23,8 +23,8 @@ notrack_tg(struct sk_buff *skb, const struct xt_action_param *par)
 	   If there is a real ct entry correspondig to this packet,
 	   it'll hang aroun till timing out. We don't deal with it
 	   for performance reasons. JK */
-	skb->nfct = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 
 	return XT_CONTINUE;
diff --git a/net/netfilter/xt_TEE.c b/net/netfilter/xt_TEE.c
index 859d9fd..b8e46b3 100644
--- a/net/netfilter/xt_TEE.c
+++ b/net/netfilter/xt_TEE.c
@@ -104,8 +104,8 @@ tee_tg4(struct sk_buff *skb, const struct xt_action_param *par)
 #ifdef WITH_CONNTRACK
 	/* Avoid counting cloned packets towards the original connection. */
 	nf_conntrack_put(skb->nfct);
-	skb->nfct     = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct     = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 #endif
 	/*
@@ -177,8 +177,8 @@ tee_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 
 #ifdef WITH_CONNTRACK
 	nf_conntrack_put(skb->nfct);
-	skb->nfct     = &nf_conntrack_untracked.ct_general;
-	skb->nfctinfo = IP_CT_NEW;
+	skb->nfct     = &__get_cpu_var(pcpu_nf_conntrack_untracked).ct_general;
+	skb->nfctinfo = IP_CT_UNTRACKED;
 	nf_conntrack_get(skb->nfct);
 #endif
 	if (par->hooknum == NF_INET_PRE_ROUTING ||
diff --git a/net/netfilter/xt_cluster.c b/net/netfilter/xt_cluster.c
index 30b95a1..b26f94d 100644
--- a/net/netfilter/xt_cluster.c
+++ b/net/netfilter/xt_cluster.c
@@ -120,7 +120,7 @@ xt_cluster_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	if (ct == NULL)
 		return false;
 
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(skb))
 		return false;
 
 	if (ct->master)
diff --git a/net/netfilter/xt_conntrack.c b/net/netfilter/xt_conntrack.c
index 39681f1..95bcfbb 100644
--- a/net/netfilter/xt_conntrack.c
+++ b/net/netfilter/xt_conntrack.c
@@ -123,7 +123,7 @@ conntrack_mt(const struct sk_buff *skb, struct xt_action_param *par,
 
 	ct = nf_ct_get(skb, &ctinfo);
 
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(skb))
 		statebit = XT_CONNTRACK_STATE_UNTRACKED;
 	else if (ct != NULL)
 		statebit = XT_CONNTRACK_STATE_BIT(ctinfo);
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 3d54c23..1f760b5 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -127,7 +127,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 	 * reply packet of an established SNAT-ted connection. */
 
 	ct = nf_ct_get(skb, &ctinfo);
-	if (ct && (ct != &nf_conntrack_untracked) &&
+	if (ct && nf_ct_is_tracked(skb) &&
 	    ((iph->protocol != IPPROTO_ICMP &&
 	      ctinfo == IP_CT_IS_REPLY + IP_CT_ESTABLISHED) ||
 	     (iph->protocol == IPPROTO_ICMP &&



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
  2010-06-01 16:20                                   ` [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked Eric Dumazet
@ 2010-06-04 11:40                                     ` Patrick McHardy
  2010-06-04 12:10                                       ` Changli Gao
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-06-04 11:40 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Netfilter Developers, netdev

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 12:41 +0200, Patrick McHardy a écrit :
> 
>>> BTW, I notice nf_conntrack_untracked is incorrectly annotated
>>> '__read_mostly'.
>>>
>>> It can be written very often :(
>>>
>>> Should'nt we special case it and let be really const ?
>> That would need quite a bit of special-casing to avoid touching
>> the reference counts. So far this is completely hidden, so I'd
>> say it just shouldn't be marked __read_mostly. Alternatively we
>> can make "untracked" a nfctinfo state.
> 
> I tried this suggestion, (a new IP_CT_UNTRACKED ctinfo), over a per_cpu
> untracked ct, but its a bit hard.
> 
> For example, I cannot find a way to change ctnetlink_conntrack_event() :
> 
> 	if (ct == &nf_conntrack_untracked)
> 		return 0;
> 
> Maybe this piece of code is not necessary, we should not come here
> anyway, or it means several packets could store events for this (shared)
> ct ?

We probably shouldn't be reaching that code since that would mean
that we previously did modifications to the untracked conntrack.
But a quick audit shows that f.i. xt_connmark will do just that.

> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
> Would it be acceptable ?

That also would be fine. However the main idea behind using a nfctinfo
bit was that we wouldn't need the untracked conntrack anymore at all.
But I guess a per-cpu untrack conntrack would already be an improvement
over the current situation.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
  2010-06-04 11:40                                     ` Patrick McHardy
@ 2010-06-04 12:10                                       ` Changli Gao
  2010-06-04 12:29                                         ` Patrick McHardy
  0 siblings, 1 reply; 54+ messages in thread
From: Changli Gao @ 2010-06-04 12:10 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Eric Dumazet, Netfilter Developers, netdev

On Fri, Jun 4, 2010 at 7:40 PM, Patrick McHardy <kaber@trash.net> wrote:
> Eric Dumazet wrote:
>> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
>> Would it be acceptable ?
>
> That also would be fine. However the main idea behind using a nfctinfo
> bit was that we wouldn't need the untracked conntrack anymore at all.
> But I guess a per-cpu untrack conntrack would already be an improvement
> over the current situation.

I think Eric didn't mean ip_conntrack_info but ip_conntrack_status
bit. Since we have had a IPS_TEMPLATE bit, I think another
IPS_UNTRACKED bit is also acceptable.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
  2010-06-04 12:10                                       ` Changli Gao
@ 2010-06-04 12:29                                         ` Patrick McHardy
  2010-06-04 12:36                                           ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-06-04 12:29 UTC (permalink / raw)
  To: Changli Gao; +Cc: Eric Dumazet, Netfilter Developers, netdev

Changli Gao wrote:
> On Fri, Jun 4, 2010 at 7:40 PM, Patrick McHardy <kaber@trash.net> wrote:
>> Eric Dumazet wrote:
>>> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
>>> Would it be acceptable ?
>> That also would be fine. However the main idea behind using a nfctinfo
>> bit was that we wouldn't need the untracked conntrack anymore at all.
>> But I guess a per-cpu untrack conntrack would already be an improvement
>> over the current situation.
> 
> I think Eric didn't mean ip_conntrack_info but ip_conntrack_status
> bit. Since we have had a IPS_TEMPLATE bit, I think another
> IPS_UNTRACKED bit is also acceptable.

Yes, of course. But using one of these bits implies that we'd still
have the untracked conntrack.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked
  2010-06-04 12:29                                         ` Patrick McHardy
@ 2010-06-04 12:36                                           ` Eric Dumazet
  2010-06-04 16:25                                             ` [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-06-04 12:36 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, Netfilter Developers, netdev

Le vendredi 04 juin 2010 à 14:29 +0200, Patrick McHardy a écrit :
> Changli Gao wrote:
> > On Fri, Jun 4, 2010 at 7:40 PM, Patrick McHardy <kaber@trash.net> wrote:
> >> Eric Dumazet wrote:
> >>> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
> >>> Would it be acceptable ?
> >> That also would be fine. However the main idea behind using a nfctinfo
> >> bit was that we wouldn't need the untracked conntrack anymore at all.
> >> But I guess a per-cpu untrack conntrack would already be an improvement
> >> over the current situation.
> > 
> > I think Eric didn't mean ip_conntrack_info but ip_conntrack_status
> > bit. Since we have had a IPS_TEMPLATE bit, I think another
> > IPS_UNTRACKED bit is also acceptable.
> 
> Yes, of course. But using one of these bits implies that we'd still
> have the untracked conntrack.

Yes, it was my idea, with a per_cpu untracked conntrack.

I'll submit a patch, thanks.



--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit
  2010-06-04 12:36                                           ` Eric Dumazet
@ 2010-06-04 16:25                                             ` Eric Dumazet
  2010-06-04 20:15                                               ` [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking Eric Dumazet
  2010-06-08 14:12                                               ` [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit Patrick McHardy
  0 siblings, 2 replies; 54+ messages in thread
From: Eric Dumazet @ 2010-06-04 16:25 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, Netfilter Developers, netdev

Le vendredi 04 juin 2010 à 14:36 +0200, Eric Dumazet a écrit :
> Le vendredi 04 juin 2010 à 14:29 +0200, Patrick McHardy a écrit :
> > Changli Gao wrote:
> > > On Fri, Jun 4, 2010 at 7:40 PM, Patrick McHardy <kaber@trash.net> wrote:
> > >> Eric Dumazet wrote:
> > >>> Obviously, an IPS_UNTRACKED bit would be much easier to implement.
> > >>> Would it be acceptable ?
> > >> That also would be fine. However the main idea behind using a nfctinfo
> > >> bit was that we wouldn't need the untracked conntrack anymore at all.
> > >> But I guess a per-cpu untrack conntrack would already be an improvement
> > >> over the current situation.
> > > 
> > > I think Eric didn't mean ip_conntrack_info but ip_conntrack_status
> > > bit. Since we have had a IPS_TEMPLATE bit, I think another
> > > IPS_UNTRACKED bit is also acceptable.
> > 
> > Yes, of course. But using one of these bits implies that we'd still
> > have the untracked conntrack.
> 
> Yes, it was my idea, with a per_cpu untracked conntrack.
> 
> I'll submit a patch, thanks.
> 
> 

Here is first part, introducing IPS_UNTRACKED bit and various helpers to
abstract nf_conntrack_untracked access.

I'll cook second patch in a couple of hours for per_cpu conversion.

Thanks !

[PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet. This is bad for performance.
__read_mostly annotation is also a bad choice.

This patch introduces IPS_UNTRACKED bit so that we can use later a
per_cpu untrack structure more easily.

A new helper, nf_ct_untracked_get() returns a pointer to
nf_conntrack_untracked.

Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
IPS_NAT_DONE_MASK bits to untracked status.

nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/netfilter/nf_conntrack_common.h  |    4 ++++
 include/net/netfilter/nf_conntrack.h           |   12 +++++++++---
 include/net/netfilter/nf_conntrack_core.h      |    2 +-
 net/ipv4/netfilter/nf_nat_core.c               |    2 +-
 net/ipv4/netfilter/nf_nat_standalone.c         |    2 +-
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |    2 +-
 net/netfilter/nf_conntrack_core.c              |   11 ++++++++---
 net/netfilter/nf_conntrack_netlink.c           |    2 +-
 net/netfilter/xt_CT.c                          |    4 ++--
 net/netfilter/xt_NOTRACK.c                     |    2 +-
 net/netfilter/xt_TEE.c                         |    4 ++--
 net/netfilter/xt_cluster.c                     |    2 +-
 net/netfilter/xt_conntrack.c                   |   11 ++++++-----
 net/netfilter/xt_socket.c                      |    2 +-
 net/netfilter/xt_state.c                       |   14 ++++++++------
 15 files changed, 47 insertions(+), 29 deletions(-)

diff --git a/include/linux/netfilter/nf_conntrack_common.h b/include/linux/netfilter/nf_conntrack_common.h
index 14e6d32..1afd18c 100644
--- a/include/linux/netfilter/nf_conntrack_common.h
+++ b/include/linux/netfilter/nf_conntrack_common.h
@@ -76,6 +76,10 @@ enum ip_conntrack_status {
 	/* Conntrack is a template */
 	IPS_TEMPLATE_BIT = 11,
 	IPS_TEMPLATE = (1 << IPS_TEMPLATE_BIT),
+
+	/* Conntrack is a fake untracked entry */
+	IPS_UNTRACKED_BIT = 12,
+	IPS_UNTRACKED = (1 << IPS_UNTRACKED_BIT),
 };
 
 /* Connection tracking event types */
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..3bc38c7 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -261,7 +261,13 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
-extern struct nf_conn nf_conntrack_untracked;
+static inline struct nf_conn *nf_ct_untracked_get(void)
+{
+	extern struct nf_conn nf_conntrack_untracked;
+
+	return &nf_conntrack_untracked;
+}
+extern void nf_ct_untracked_status_or(unsigned long bits);
 
 /* Iterate over all conntracks: if iter returns true, it's deleted. */
 extern void
@@ -289,9 +295,9 @@ static inline int nf_ct_is_dying(struct nf_conn *ct)
 	return test_bit(IPS_DYING_BIT, &ct->status);
 }
 
-static inline int nf_ct_is_untracked(const struct sk_buff *skb)
+static inline int nf_ct_is_untracked(const struct nf_conn *ct)
 {
-	return (skb->nfct == &nf_conntrack_untracked.ct_general);
+	return test_bit(IPS_UNTRACKED_BIT, &ct->status);
 }
 
 extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
diff --git a/include/net/netfilter/nf_conntrack_core.h b/include/net/netfilter/nf_conntrack_core.h
index 3d7524f..aced085 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -60,7 +60,7 @@ static inline int nf_conntrack_confirm(struct sk_buff *skb)
 	struct nf_conn *ct = (struct nf_conn *)skb->nfct;
 	int ret = NF_ACCEPT;
 
-	if (ct && ct != &nf_conntrack_untracked) {
+	if (ct && !nf_ct_is_untracked(ct)) {
 		if (!nf_ct_is_confirmed(ct))
 			ret = __nf_conntrack_confirm(skb);
 		if (likely(ret == NF_ACCEPT))
diff --git a/net/ipv4/netfilter/nf_nat_core.c b/net/ipv4/netfilter/nf_nat_core.c
index 4f8bddb..c7719b2 100644
--- a/net/ipv4/netfilter/nf_nat_core.c
+++ b/net/ipv4/netfilter/nf_nat_core.c
@@ -742,7 +742,7 @@ static int __init nf_nat_init(void)
 	spin_unlock_bh(&nf_nat_lock);
 
 	/* Initialize fake conntrack so that NAT will skip it */
-	nf_conntrack_untracked.status |= IPS_NAT_DONE_MASK;
+	nf_ct_untracked_status_or(IPS_NAT_DONE_MASK);
 
 	l3proto = nf_ct_l3proto_find_get((u_int16_t)AF_INET);
 
diff --git a/net/ipv4/netfilter/nf_nat_standalone.c b/net/ipv4/netfilter/nf_nat_standalone.c
index beb2581..6723c68 100644
--- a/net/ipv4/netfilter/nf_nat_standalone.c
+++ b/net/ipv4/netfilter/nf_nat_standalone.c
@@ -98,7 +98,7 @@ nf_nat_fn(unsigned int hooknum,
 		return NF_ACCEPT;
 
 	/* Don't try to NAT if this packet is not conntracked */
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(ct))
 		return NF_ACCEPT;
 
 	nat = nfct_nat(ct);
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index 9be8177..1df3c8b 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -208,7 +208,7 @@ icmpv6_error(struct net *net, struct nf_conn *tmpl,
 	type = icmp6h->icmp6_type - 130;
 	if (type >= 0 && type < sizeof(noct_valid_new) &&
 	    noct_valid_new[type]) {
-		skb->nfct = &nf_conntrack_untracked.ct_general;
+		skb->nfct = &nf_ct_untracked_get()->ct_general;
 		skb->nfctinfo = IP_CT_NEW;
 		nf_conntrack_get(skb->nfct);
 		return NF_ACCEPT;
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..6c1da21 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,7 +62,7 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn nf_conntrack_untracked __read_mostly;
+struct nf_conn nf_conntrack_untracked;
 EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
 
 static int nf_conntrack_hash_rnd_initted;
@@ -1321,6 +1321,12 @@ EXPORT_SYMBOL_GPL(nf_conntrack_set_hashsize);
 module_param_call(hashsize, nf_conntrack_set_hashsize, param_get_uint,
 		  &nf_conntrack_htable_size, 0600);
 
+void nf_ct_untracked_status_or(unsigned long bits)
+{
+	nf_conntrack_untracked.status |= bits;
+}
+EXPORT_SYMBOL_GPL(nf_ct_untracked_status_or);
+
 static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
@@ -1368,8 +1374,7 @@ static int nf_conntrack_init_init_net(void)
 #endif
 	atomic_set(&nf_conntrack_untracked.ct_general.use, 1);
 	/*  - and look it like as a confirmed connection */
-	set_bit(IPS_CONFIRMED_BIT, &nf_conntrack_untracked.status);
-
+	nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
 	return 0;
 
 #ifdef CONFIG_NF_CONNTRACK_ZONES
diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index c42ff6a..5bae1cd 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -480,7 +480,7 @@ ctnetlink_conntrack_event(unsigned int events, struct nf_ct_event *item)
 	int err;
 
 	/* ignore our fake conntrack entry */
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(ct))
 		return 0;
 
 	if (events & (1 << IPCT_DESTROY)) {
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 562bf32..0cb6053 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -67,7 +67,7 @@ static int xt_ct_tg_check(const struct xt_tgchk_param *par)
 		return -EINVAL;
 
 	if (info->flags & XT_CT_NOTRACK) {
-		ct = &nf_conntrack_untracked;
+		ct = nf_ct_untracked_get();
 		atomic_inc(&ct->ct_general.use);
 		goto out;
 	}
@@ -132,7 +132,7 @@ static void xt_ct_tg_destroy(const struct xt_tgdtor_param *par)
 	struct nf_conn *ct = info->ct;
 	struct nf_conn_help *help;
 
-	if (ct != &nf_conntrack_untracked) {
+	if (!nf_ct_is_untracked(ct)) {
 		help = nfct_help(ct);
 		if (help)
 			module_put(help->helper->me);
diff --git a/net/netfilter/xt_NOTRACK.c b/net/netfilter/xt_NOTRACK.c
index 512b912..9d78218 100644
--- a/net/netfilter/xt_NOTRACK.c
+++ b/net/netfilter/xt_NOTRACK.c
@@ -23,7 +23,7 @@ notrack_tg(struct sk_buff *skb, const struct xt_action_param *par)
 	   If there is a real ct entry correspondig to this packet,
 	   it'll hang aroun till timing out. We don't deal with it
 	   for performance reasons. JK */
-	skb->nfct = &nf_conntrack_untracked.ct_general;
+	skb->nfct = &nf_ct_untracked_get()->ct_general;
 	skb->nfctinfo = IP_CT_NEW;
 	nf_conntrack_get(skb->nfct);
 
diff --git a/net/netfilter/xt_TEE.c b/net/netfilter/xt_TEE.c
index 859d9fd..7a11826 100644
--- a/net/netfilter/xt_TEE.c
+++ b/net/netfilter/xt_TEE.c
@@ -104,7 +104,7 @@ tee_tg4(struct sk_buff *skb, const struct xt_action_param *par)
 #ifdef WITH_CONNTRACK
 	/* Avoid counting cloned packets towards the original connection. */
 	nf_conntrack_put(skb->nfct);
-	skb->nfct     = &nf_conntrack_untracked.ct_general;
+	skb->nfct     = &nf_ct_untracked_get()->ct_general;
 	skb->nfctinfo = IP_CT_NEW;
 	nf_conntrack_get(skb->nfct);
 #endif
@@ -177,7 +177,7 @@ tee_tg6(struct sk_buff *skb, const struct xt_action_param *par)
 
 #ifdef WITH_CONNTRACK
 	nf_conntrack_put(skb->nfct);
-	skb->nfct     = &nf_conntrack_untracked.ct_general;
+	skb->nfct     = &nf_ct_untracked_get()->ct_general;
 	skb->nfctinfo = IP_CT_NEW;
 	nf_conntrack_get(skb->nfct);
 #endif
diff --git a/net/netfilter/xt_cluster.c b/net/netfilter/xt_cluster.c
index 30b95a1..f4af1bf 100644
--- a/net/netfilter/xt_cluster.c
+++ b/net/netfilter/xt_cluster.c
@@ -120,7 +120,7 @@ xt_cluster_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	if (ct == NULL)
 		return false;
 
-	if (ct == &nf_conntrack_untracked)
+	if (nf_ct_is_untracked(ct))
 		return false;
 
 	if (ct->master)
diff --git a/net/netfilter/xt_conntrack.c b/net/netfilter/xt_conntrack.c
index 39681f1..e536710 100644
--- a/net/netfilter/xt_conntrack.c
+++ b/net/netfilter/xt_conntrack.c
@@ -123,11 +123,12 @@ conntrack_mt(const struct sk_buff *skb, struct xt_action_param *par,
 
 	ct = nf_ct_get(skb, &ctinfo);
 
-	if (ct == &nf_conntrack_untracked)
-		statebit = XT_CONNTRACK_STATE_UNTRACKED;
-	else if (ct != NULL)
-		statebit = XT_CONNTRACK_STATE_BIT(ctinfo);
-	else
+	if (ct) {
+		if (nf_ct_is_untracked(ct))
+			statebit = XT_CONNTRACK_STATE_UNTRACKED;
+		else
+			statebit = XT_CONNTRACK_STATE_BIT(ctinfo);
+	} else
 		statebit = XT_CONNTRACK_STATE_INVALID;
 
 	if (info->match_flags & XT_CONNTRACK_STATE) {
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 3d54c23..1ca8990 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -127,7 +127,7 @@ socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 	 * reply packet of an established SNAT-ted connection. */
 
 	ct = nf_ct_get(skb, &ctinfo);
-	if (ct && (ct != &nf_conntrack_untracked) &&
+	if (ct && !nf_ct_is_untracked(ct) &&
 	    ((iph->protocol != IPPROTO_ICMP &&
 	      ctinfo == IP_CT_IS_REPLY + IP_CT_ESTABLISHED) ||
 	     (iph->protocol == IPPROTO_ICMP &&
diff --git a/net/netfilter/xt_state.c b/net/netfilter/xt_state.c
index e12e053..a507922 100644
--- a/net/netfilter/xt_state.c
+++ b/net/netfilter/xt_state.c
@@ -26,14 +26,16 @@ state_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	const struct xt_state_info *sinfo = par->matchinfo;
 	enum ip_conntrack_info ctinfo;
 	unsigned int statebit;
+	struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
 
-	if (nf_ct_is_untracked(skb))
-		statebit = XT_STATE_UNTRACKED;
-	else if (!nf_ct_get(skb, &ctinfo))
+	if (!ct)
 		statebit = XT_STATE_INVALID;
-	else
-		statebit = XT_STATE_BIT(ctinfo);
-
+	else {
+		if (nf_ct_is_untracked(ct))
+			statebit = XT_STATE_UNTRACKED;
+		else
+			statebit = XT_STATE_BIT(ctinfo);
+	}
 	return (sinfo->statemask & statebit);
 }
 



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
  2010-06-04 16:25                                             ` [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit Eric Dumazet
@ 2010-06-04 20:15                                               ` Eric Dumazet
  2010-06-08 14:29                                                 ` Patrick McHardy
  2010-06-08 14:12                                               ` [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit Patrick McHardy
  1 sibling, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-06-04 20:15 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, Netfilter Developers, netdev

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.

This patch converts it to a per_cpu variable.

We assume same cpu is used for a given packet, entering and exiting the
NOTRACK state.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/netfilter/nf_conntrack.h |    5 +--
 net/netfilter/nf_conntrack_core.c    |   36 ++++++++++++++++++-------
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3bc38c7..84a4b6f 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -261,11 +261,10 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
+DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
 static inline struct nf_conn *nf_ct_untracked_get(void)
 {
-	extern struct nf_conn nf_conntrack_untracked;
-
-	return &nf_conntrack_untracked;
+	return &__raw_get_cpu_var(nf_conntrack_untracked);
 }
 extern void nf_ct_untracked_status_or(unsigned long bits);
 
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 6c1da21..9c66141 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn nf_conntrack_untracked;
-EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
+DEFINE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
+EXPORT_PER_CPU_SYMBOL(nf_conntrack_untracked);
 
 static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
@@ -1183,10 +1183,21 @@ static void nf_ct_release_dying_list(struct net *net)
 	spin_unlock_bh(&nf_conntrack_lock);
 }
 
+static int untrack_refs(void)
+{
+	int cnt = 0, cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(nf_conntrack_untracked, cpu);
+
+		cnt += atomic_read(&ct->ct_general.use) - 1;
+	}
+	return cnt;
+}
+
 static void nf_conntrack_cleanup_init_net(void)
 {
-	/* wait until all references to nf_conntrack_untracked are dropped */
-	while (atomic_read(&nf_conntrack_untracked.ct_general.use) > 1)
+	while (untrack_refs() > 0)
 		schedule();
 
 	nf_conntrack_helper_fini();
@@ -1323,14 +1334,17 @@ module_param_call(hashsize, nf_conntrack_set_hashsize, param_get_uint,
 
 void nf_ct_untracked_status_or(unsigned long bits)
 {
-	nf_conntrack_untracked.status |= bits;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu(nf_conntrack_untracked, cpu).status |= bits;
 }
 EXPORT_SYMBOL_GPL(nf_ct_untracked_status_or);
 
 static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
-	int ret;
+	int ret, cpu;
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
@@ -1369,10 +1383,12 @@ static int nf_conntrack_init_init_net(void)
 		goto err_extend;
 #endif
 	/* Set up fake conntrack: to never be deleted, not in any hashes */
-#ifdef CONFIG_NET_NS
-	nf_conntrack_untracked.ct_net = &init_net;
-#endif
-	atomic_set(&nf_conntrack_untracked.ct_general.use, 1);
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(nf_conntrack_untracked, cpu);
+
+		write_pnet(&ct->ct_net, &init_net);
+		atomic_set(&ct->ct_general.use, 1);
+	}
 	/*  - and look it like as a confirmed connection */
 	nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
 	return 0;



^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit
  2010-06-04 16:25                                             ` [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit Eric Dumazet
  2010-06-04 20:15                                               ` [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking Eric Dumazet
@ 2010-06-08 14:12                                               ` Patrick McHardy
  1 sibling, 0 replies; 54+ messages in thread
From: Patrick McHardy @ 2010-06-08 14:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Netfilter Developers, netdev

On 04.06.2010 18:25, Eric Dumazet wrote:
> [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit
> 
> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> twice per packet. This is bad for performance.
> __read_mostly annotation is also a bad choice.
> 
> This patch introduces IPS_UNTRACKED bit so that we can use later a
> per_cpu untrack structure more easily.
> 
> A new helper, nf_ct_untracked_get() returns a pointer to
> nf_conntrack_untracked.
> 
> Another one, nf_ct_untracked_status_or() is used by nf_nat_init() to add
> IPS_NAT_DONE_MASK bits to untracked status.
> 
> nf_ct_is_untracked() prototype is changed to work on a nf_conn pointer.

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
  2010-06-04 20:15                                               ` [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking Eric Dumazet
@ 2010-06-08 14:29                                                 ` Patrick McHardy
  2010-06-08 14:52                                                   ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Patrick McHardy @ 2010-06-08 14:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Netfilter Developers, netdev

On 04.06.2010 22:15, Eric Dumazet wrote:
> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> twice per packet, slowing down performance.
> 
> This patch converts it to a per_cpu variable.
> 
> We assume same cpu is used for a given packet, entering and exiting the
> NOTRACK state.

That doesn't seem to be a valid assumption, the conntrack entry is
attached to the skb and processing in the output path might get
preempted and rescheduled to a different CPU.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
  2010-06-08 14:29                                                 ` Patrick McHardy
@ 2010-06-08 14:52                                                   ` Eric Dumazet
  2010-06-08 15:12                                                     ` Eric Dumazet
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-06-08 14:52 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, Netfilter Developers, netdev

Le mardi 08 juin 2010 à 16:29 +0200, Patrick McHardy a écrit :
> On 04.06.2010 22:15, Eric Dumazet wrote:
> > NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> > twice per packet, slowing down performance.
> > 
> > This patch converts it to a per_cpu variable.
> > 
> > We assume same cpu is used for a given packet, entering and exiting the
> > NOTRACK state.
> 
> That doesn't seem to be a valid assumption, the conntrack entry is
> attached to the skb and processing in the output path might get
> preempted and rescheduled to a different CPU.

Thats unfortunate.

Ok, only choice then is to not change refcount on the untracked ct, and
keep a shared (read only after setup time) untrack structure.





--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
  2010-06-08 14:52                                                   ` Eric Dumazet
@ 2010-06-08 15:12                                                     ` Eric Dumazet
  2010-06-09 12:45                                                       ` Patrick McHardy
  0 siblings, 1 reply; 54+ messages in thread
From: Eric Dumazet @ 2010-06-08 15:12 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Changli Gao, Netfilter Developers, netdev

Le mardi 08 juin 2010 à 16:52 +0200, Eric Dumazet a écrit :
> Le mardi 08 juin 2010 à 16:29 +0200, Patrick McHardy a écrit :
> > On 04.06.2010 22:15, Eric Dumazet wrote:
> > > NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> > > twice per packet, slowing down performance.
> > > 
> > > This patch converts it to a per_cpu variable.
> > > 
> > > We assume same cpu is used for a given packet, entering and exiting the
> > > NOTRACK state.
> > 
> > That doesn't seem to be a valid assumption, the conntrack entry is
> > attached to the skb and processing in the output path might get
> > preempted and rescheduled to a different CPU.
> 
> Thats unfortunate.
> 
> Ok, only choice then is to not change refcount on the untracked ct, and
> keep a shared (read only after setup time) untrack structure.
> 
> 

Oh well, re-reading my patch, I dont see why I said this in Changelog :)

We lazily select the untrack structure in one cpu, then keep the pointer
to this untrack structure, attached to ct.

The (still atomic) increment / decrement of refcount is done on the
saved pointer, not on actual per_cpu structure.

So if a packet is rescheduled on a different CPU, second cpu will "only"
dirty cache line of other cpu, it probably almost never happens...

Thanks

[PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking

NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
twice per packet, slowing down performance.

This patch converts it to a per_cpu variable.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/netfilter/nf_conntrack.h |    5 +--
 net/netfilter/nf_conntrack_core.c    |   36 ++++++++++++++++++-------
 2 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 3bc38c7..84a4b6f 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -261,11 +261,10 @@ extern s16 (*nf_ct_nat_offset)(const struct nf_conn *ct,
 			       u32 seq);
 
 /* Fake conntrack entry for untracked connections */
+DECLARE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
 static inline struct nf_conn *nf_ct_untracked_get(void)
 {
-	extern struct nf_conn nf_conntrack_untracked;
-
-	return &nf_conntrack_untracked;
+	return &__raw_get_cpu_var(nf_conntrack_untracked);
 }
 extern void nf_ct_untracked_status_or(unsigned long bits);
 
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 6c1da21..9c66141 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -62,8 +62,8 @@ EXPORT_SYMBOL_GPL(nf_conntrack_htable_size);
 unsigned int nf_conntrack_max __read_mostly;
 EXPORT_SYMBOL_GPL(nf_conntrack_max);
 
-struct nf_conn nf_conntrack_untracked;
-EXPORT_SYMBOL_GPL(nf_conntrack_untracked);
+DEFINE_PER_CPU(struct nf_conn, nf_conntrack_untracked);
+EXPORT_PER_CPU_SYMBOL(nf_conntrack_untracked);
 
 static int nf_conntrack_hash_rnd_initted;
 static unsigned int nf_conntrack_hash_rnd;
@@ -1183,10 +1183,21 @@ static void nf_ct_release_dying_list(struct net *net)
 	spin_unlock_bh(&nf_conntrack_lock);
 }
 
+static int untrack_refs(void)
+{
+	int cnt = 0, cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(nf_conntrack_untracked, cpu);
+
+		cnt += atomic_read(&ct->ct_general.use) - 1;
+	}
+	return cnt;
+}
+
 static void nf_conntrack_cleanup_init_net(void)
 {
-	/* wait until all references to nf_conntrack_untracked are dropped */
-	while (atomic_read(&nf_conntrack_untracked.ct_general.use) > 1)
+	while (untrack_refs() > 0)
 		schedule();
 
 	nf_conntrack_helper_fini();
@@ -1323,14 +1334,17 @@ module_param_call(hashsize, nf_conntrack_set_hashsize, param_get_uint,
 
 void nf_ct_untracked_status_or(unsigned long bits)
 {
-	nf_conntrack_untracked.status |= bits;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		per_cpu(nf_conntrack_untracked, cpu).status |= bits;
 }
 EXPORT_SYMBOL_GPL(nf_ct_untracked_status_or);
 
 static int nf_conntrack_init_init_net(void)
 {
 	int max_factor = 8;
-	int ret;
+	int ret, cpu;
 
 	/* Idea from tcp.c: use 1/16384 of memory.  On i386: 32MB
 	 * machine has 512 buckets. >= 1GB machines have 16384 buckets. */
@@ -1369,10 +1383,12 @@ static int nf_conntrack_init_init_net(void)
 		goto err_extend;
 #endif
 	/* Set up fake conntrack: to never be deleted, not in any hashes */
-#ifdef CONFIG_NET_NS
-	nf_conntrack_untracked.ct_net = &init_net;
-#endif
-	atomic_set(&nf_conntrack_untracked.ct_general.use, 1);
+	for_each_possible_cpu(cpu) {
+		struct nf_conn *ct = &per_cpu(nf_conntrack_untracked, cpu);
+
+		write_pnet(&ct->ct_net, &init_net);
+		atomic_set(&ct->ct_general.use, 1);
+	}
 	/*  - and look it like as a confirmed connection */
 	nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
 	return 0;


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
  2010-06-08 15:12                                                     ` Eric Dumazet
@ 2010-06-09 12:45                                                       ` Patrick McHardy
  0 siblings, 0 replies; 54+ messages in thread
From: Patrick McHardy @ 2010-06-09 12:45 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Changli Gao, Netfilter Developers, netdev

Eric Dumazet wrote:
> Le mardi 08 juin 2010 à 16:52 +0200, Eric Dumazet a écrit :
>   
>> Le mardi 08 juin 2010 à 16:29 +0200, Patrick McHardy a écrit :
>>     
>>> On 04.06.2010 22:15, Eric Dumazet wrote:
>>>       
>>>> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
>>>> twice per packet, slowing down performance.
>>>>
>>>> This patch converts it to a per_cpu variable.
>>>>
>>>> We assume same cpu is used for a given packet, entering and exiting the
>>>> NOTRACK state.
>>>>         
>>> That doesn't seem to be a valid assumption, the conntrack entry is
>>> attached to the skb and processing in the output path might get
>>> preempted and rescheduled to a different CPU.
>>>       
>> Thats unfortunate.
>>
>> Ok, only choice then is to not change refcount on the untracked ct, and
>> keep a shared (read only after setup time) untrack structure.
>>
>>
>>     
>
> Oh well, re-reading my patch, I dont see why I said this in Changelog :)
>
> We lazily select the untrack structure in one cpu, then keep the pointer
> to this untrack structure, attached to ct.
>
> The (still atomic) increment / decrement of refcount is done on the
> saved pointer, not on actual per_cpu structure.
>
> So if a packet is rescheduled on a different CPU, second cpu will "only"
> dirty cache line of other cpu, it probably almost never happens...
>   

Indeed, you're right of course.
> Thanks
>
> [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking
>
> NOTRACK makes all cpus share a cache line on nf_conntrack_untracked
> twice per packet, slowing down performance.
>
> This patch converts it to a per_cpu variable.
>   
Applied, thanks Eric.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2010-06-09 12:45 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-04-22 12:58 DDoS attack causing bad effect on conntrack searches Jesper Dangaard Brouer
2010-04-22 13:13 ` Changli Gao
2010-04-22 13:17   ` Patrick McHardy
2010-04-22 14:36     ` Eric Dumazet
2010-04-22 14:53       ` Eric Dumazet
2010-04-22 15:51         ` Paul E. McKenney
2010-04-22 16:02           ` Eric Dumazet
2010-04-22 16:34             ` Paul E. McKenney
2010-04-22 20:38             ` Jesper Dangaard Brouer
2010-04-22 21:03               ` Eric Dumazet
2010-04-22 21:14                 ` Eric Dumazet
2010-04-22 23:44                   ` David Miller
2010-04-23  5:44                     ` Eric Dumazet
2010-04-23  8:13                       ` David Miller
2010-04-23  8:18                         ` David Miller
2010-04-23  8:40                           ` Jesper Dangaard Brouer
2010-04-23 10:36                   ` Patrick McHardy
2010-04-23 11:06                     ` Eric Dumazet
2010-04-22 21:28                 ` Jesper Dangaard Brouer
2010-04-23  7:23                   ` Jan Engelhardt
2010-04-23  7:46                     ` Eric Dumazet
2010-04-23  7:55                       ` Jan Engelhardt
2010-04-23  9:23                         ` Eric Dumazet
2010-04-23 10:55                 ` Patrick McHardy
2010-04-23 11:05                   ` Eric Dumazet
2010-04-23 11:06                     ` Patrick McHardy
2010-04-23 20:57               ` Eric Dumazet
2010-04-24 11:11                 ` Jesper Dangaard Brouer
2010-04-24 20:11                   ` Eric Dumazet
2010-04-26 14:36                     ` Jesper Dangaard Brouer
2010-05-31 21:21                       ` Eric Dumazet
2010-06-01  0:28                         ` Changli Gao
2010-06-01  5:05                           ` Eric Dumazet
2010-06-01  5:48                             ` Changli Gao
2010-06-01 10:18                             ` Patrick McHardy
2010-06-01 10:31                               ` Eric Dumazet
2010-06-01 10:41                                 ` Patrick McHardy
2010-06-01 16:20                                   ` [RFC nf-next-2.6] conntrack: per cpu nf_conntrack_untracked Eric Dumazet
2010-06-04 11:40                                     ` Patrick McHardy
2010-06-04 12:10                                       ` Changli Gao
2010-06-04 12:29                                         ` Patrick McHardy
2010-06-04 12:36                                           ` Eric Dumazet
2010-06-04 16:25                                             ` [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit Eric Dumazet
2010-06-04 20:15                                               ` [PATCH nf-next-2.6 2/2] conntrack: per_cpu untracking Eric Dumazet
2010-06-08 14:29                                                 ` Patrick McHardy
2010-06-08 14:52                                                   ` Eric Dumazet
2010-06-08 15:12                                                     ` Eric Dumazet
2010-06-09 12:45                                                       ` Patrick McHardy
2010-06-08 14:12                                               ` [PATCH nf-next-2.6] conntrack: IPS_UNTRACKED bit Patrick McHardy
2010-04-23 10:56       ` DDoS attack causing bad effect on conntrack searches Patrick McHardy
2010-04-23 12:45         ` Jesper Dangaard Brouer
2010-04-23 13:57           ` Patrick McHardy
2010-04-22 13:31   ` Jesper Dangaard Brouer
2010-04-23 10:35     ` Patrick McHardy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.