All of lore.kernel.org
 help / color / mirror / Atom feed
* tproxy related crash in inet_hashtables
@ 2010-08-13  8:15 Stephen Buck
  2010-08-13 10:54 ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Buck @ 2010-08-13  8:15 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 5089 bytes --]

Recently I encountered a number of crashes related to tproxy on the 
2.6.34.1 (x86_64 SMP) kernel. These usually manifested as a bug like the 
following (Although the bug was confirmed to be present on a vanilla 
kernel, this particular trace is from a kernel with some customisations):

[ 1504.765077] BUG: unable to handle kernel NULL pointer dereference at 
(null)
[ 1504.848183] IP: [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
[ 1504.927126] PGD 1a9933067 PUD 1ad909067 PMD 0
[ 1504.980125] Thread overran stack, or stack corrupted
[ 1505.039325] Oops: 0002 #1 SMP
[ 1505.077775] last sysfs file: 
/sys/devices/system/cpu/cpu15/topology/thread_siblings
[ 1505.169166] CPU 0
[ 1505.193070] Modules linked in: sch_sfq cls_fw sch_htb xt_physdev 
8021q bridge stp llc
[ 1505.923769] Pid: 0, comm: swapper Not tainted 2.6.31-12EXINDAsmp #0 
PowerEdge R710
[ 1506.014118] RIP: 0010:[<ffffffff8135a79b>] [<ffffffff8135a79b>] 
inet_bind_bucket_destroy+0x1b/0x30
[ 1506.122242] RSP: 0018:ffffc90000003e10 EFLAGS: 00010246
[ 1506.185655] RAX: 0000000000000000 RBX: ffffc900164a02a0 RCX: 
ffffea00098e24b0
[ 1506.270863] RDX: 0000000000000000 RSI: ffff8802e1186280 RDI: 
ffffffff815b4600
[ 1506.356077] RBP: ffffc90000003e10 R08: 0000000000000016 R09: 
0000000000000001
[ 1506.441284] R10: 0000000000000000 R11: 0000000000000000 R12: 
ffff88011b894fc0
[ 1506.526500] R13: ffffffff81744c80 R14: ffffffff815fe6c0 R15: 
0000000000000003
[ 1506.611722] FS: 0000000000000000(0000) GS:ffffc90000000000(0000) 
knlGS:0000000000000000
[ 1506.708410] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[ 1506.777011] CR2: 0000000000000000 CR3: 00000001ae17a000 CR4: 
00000000000006f0
[ 1506.862225] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[ 1506.947442] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[ 1507.032650] Process swapper (pid: 0, threadinfo ffffffff815b0000, 
task ffffffff815b6bc0)
[ 1507.129345] Stack:
[ 1507.153299] ffffc90000003e40 ffffffff8135b834 0000000000000001 
ffff88011b894fc0
[ 1507.239581] <0> ffffffff815fe560 0000000000000002 ffffc90000003e80 
ffffffff8135bbf6
[ 1507.331133] <0> ffffffff815fe560 ffffffff815fe560 ffffffff815fe6c0 
ffffc90000003eb0
[ 1507.424833] Call Trace:
[ 1507.453976] <IRQ>
[ 1507.478971] [<ffffffff8135b834>] __inet_twsk_kill+0xb4/0xf0
[ 1507.546538] [<ffffffff8135bbf6>] inet_twdr_do_twkill_work+0x66/0xd0
[ 1507.622408] [<ffffffff8135bd40>] ? inet_twdr_hangman+0x0/0xd0
[ 1507.692041] [<ffffffff8135bd95>] inet_twdr_hangman+0x55/0xd0
[ 1507.760650] [<ffffffff810515ec>] run_timer_softirq+0x18c/0x220
[ 1507.831330] [<ffffffff8104c3b8>] __do_softirq+0xc8/0x1f0
[ 1507.895799] [<ffffffff8100cf5c>] call_softirq+0x1c/0x30
[ 1507.959209] [<ffffffff8100e5f5>] do_softirq+0x45/0x80
[ 1508.020538] [<ffffffff8104c2e7>] irq_exit+0x87/0x90
[ 1508.079797] [<ffffffff813bd0c1>] smp_apic_timer_interrupt+0x71/0x9d
[ 1508.155667] [<ffffffff8100c933>] apic_timer_interrupt+0x13/0x20
[ 1508.227385] <EOI>
[ 1508.252383] [<ffffffff8101317e>] ? mwait_idle+0x7e/0x110
[ 1508.316836] [<ffffffff813bb0bd>] ? __atomic_notifier_call_chain+0xd/0x10
[ 1508.397903] [<ffffffff813bb0d1>] ? atomic_notifier_call_chain+0x11/0x20
[ 1508.477928] [<ffffffff8100ac31>] ? cpu_idle+0x51/0x90
[ 1508.539265] [<ffffffff813aa92b>] ? rest_init+0x6b/0x80
[ 1508.601639] [<ffffffff816230f5>] ? start_kernel+0x2c5/0x370
[ 1508.669200] [<ffffffff81622611>] ? x86_64_start_reservations+0x81/0xc0
[ 1508.748192] [<ffffffff81622726>] ? x86_64_start_kernel+0xd6/0x100
[ 1508.821979] Code: 64 10 40 eb 94 0f 1f 44 00 00 66 0f 1f 44 00 00 55 
48 89 e5 0f 1f 44 00 00 48 83 7e 18 00 75 19 48 8b 46 08 48 8b 56 10
[ 1509.048123] RIP [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
[ 1509.128157] RSP <ffffc90000003e10>
[ 1509.169759] CR2: 0000000000000000


After spending a while tracking it down, I discovered that the wrong 
locks get held when operating on the bind hash table's chains.

This is due to the listen socket and the child socket having different 
local ports when __inet_inherit_port() is called. The lock is held based 
on the child socket's port, but the list operated on is the one the 
listen socket belongs to.

e.g.
There is a transparent proxy listening on port 9999.
A new http connection (with port 80) is redirected to the proxy.

The inet_bind_hashbucket locked in this case is table->bhash[80].lock, 
but the inet_bind_bucket the child socket is added to is in the chain of 
the table->bhash[9999] inet_bind_hashbucket. This means that if another 
connection with a different local port arrived and was redirected to the 
proxy, they could both be operating on the list at the same time.


Attached is a patch that should fix this by looking up the correct 
inet_bind_bucket based on the child's local port when the 
inet_bind_bucket from the listen socket has a different port to the 
child's inet_num. It was built against 2.6.34.1, but should apply to any 
mainline kernel.

It is also possible the same bug exists in the IPv6 code as well. As I 
have not had to deal with IPv6 yet, I have not had a look.

[-- Attachment #2: tproxy-bind-fix.patch --]
[-- Type: text/x-patch, Size: 1443 bytes --]

commit f517f61e37907ea66e48f224c135a9e1c257bb04
Author: Stephen Buck <stephen.buck@exinda.com>
Date:   Thu Aug 12 16:19:19 2010 +1000

    Fix locking issue in inet_hashtables when using tproxy
    
    When __inet_inherit_port() is called on a tproxy connection the wrong locks
    are held for the inet_bind_bucket it is added to.

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2b79377..1ba982f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -105,13 +105,24 @@ EXPORT_SYMBOL(inet_put_port);
 void __inet_inherit_port(struct sock *sk, struct sock *child)
 {
 	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
-	const int bhash = inet_bhashfn(sock_net(sk), inet_sk(child)->inet_num,
+	unsigned short port = inet_sk(child)->inet_num;
+	const int bhash = inet_bhashfn(sock_net(sk), port,
 			table->bhash_size);
 	struct inet_bind_hashbucket *head = &table->bhash[bhash];
 	struct inet_bind_bucket *tb;
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
+	if (tb->port != port) {
+		struct hlist_node *node;
+		inet_bind_bucket_for_each(tb, node, &head->chain) {
+			if (net_eq(ib_net(tb), sock_net(sk)) && tb->port == port)
+				break;
+		}
+		if (!node)
+			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
+						     sock_net(sk), head, port);
+	}
 	sk_add_bind_node(child, &tb->owners);
 	inet_csk(child)->icsk_bind_hash = tb;
 	spin_unlock(&head->lock);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: tproxy related crash in inet_hashtables
  2010-08-13  8:15 tproxy related crash in inet_hashtables Stephen Buck
@ 2010-08-13 10:54 ` Eric Dumazet
  2010-08-13 13:05   ` Stephen Buck
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2010-08-13 10:54 UTC (permalink / raw)
  To: Stephen Buck
  Cc: netdev, Netfilter Development Mailinglist, Patrick McHardy,
	KOVACS Krisztian

Le vendredi 13 août 2010 à 18:15 +1000, Stephen Buck a écrit :
> Recently I encountered a number of crashes related to tproxy on the 
> 2.6.34.1 (x86_64 SMP) kernel. These usually manifested as a bug like the 
> following (Although the bug was confirmed to be present on a vanilla 
> kernel, this particular trace is from a kernel with some customisations):
> 
> [ 1504.765077] BUG: unable to handle kernel NULL pointer dereference at 
> (null)
> [ 1504.848183] IP: [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
> [ 1504.927126] PGD 1a9933067 PUD 1ad909067 PMD 0
> [ 1504.980125] Thread overran stack, or stack corrupted
> [ 1505.039325] Oops: 0002 #1 SMP
> [ 1505.077775] last sysfs file: 
> /sys/devices/system/cpu/cpu15/topology/thread_siblings
> [ 1505.169166] CPU 0
> [ 1505.193070] Modules linked in: sch_sfq cls_fw sch_htb xt_physdev 
> 8021q bridge stp llc
> [ 1505.923769] Pid: 0, comm: swapper Not tainted 2.6.31-12EXINDAsmp #0 
> PowerEdge R710
> [ 1506.014118] RIP: 0010:[<ffffffff8135a79b>] [<ffffffff8135a79b>] 
> inet_bind_bucket_destroy+0x1b/0x30
> [ 1506.122242] RSP: 0018:ffffc90000003e10 EFLAGS: 00010246
> [ 1506.185655] RAX: 0000000000000000 RBX: ffffc900164a02a0 RCX: 
> ffffea00098e24b0
> [ 1506.270863] RDX: 0000000000000000 RSI: ffff8802e1186280 RDI: 
> ffffffff815b4600
> [ 1506.356077] RBP: ffffc90000003e10 R08: 0000000000000016 R09: 
> 0000000000000001
> [ 1506.441284] R10: 0000000000000000 R11: 0000000000000000 R12: 
> ffff88011b894fc0
> [ 1506.526500] R13: ffffffff81744c80 R14: ffffffff815fe6c0 R15: 
> 0000000000000003
> [ 1506.611722] FS: 0000000000000000(0000) GS:ffffc90000000000(0000) 
> knlGS:0000000000000000
> [ 1506.708410] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 1506.777011] CR2: 0000000000000000 CR3: 00000001ae17a000 CR4: 
> 00000000000006f0
> [ 1506.862225] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
> 0000000000000000
> [ 1506.947442] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
> 0000000000000400
> [ 1507.032650] Process swapper (pid: 0, threadinfo ffffffff815b0000, 
> task ffffffff815b6bc0)
> [ 1507.129345] Stack:
> [ 1507.153299] ffffc90000003e40 ffffffff8135b834 0000000000000001 
> ffff88011b894fc0
> [ 1507.239581] <0> ffffffff815fe560 0000000000000002 ffffc90000003e80 
> ffffffff8135bbf6
> [ 1507.331133] <0> ffffffff815fe560 ffffffff815fe560 ffffffff815fe6c0 
> ffffc90000003eb0
> [ 1507.424833] Call Trace:
> [ 1507.453976] <IRQ>
> [ 1507.478971] [<ffffffff8135b834>] __inet_twsk_kill+0xb4/0xf0
> [ 1507.546538] [<ffffffff8135bbf6>] inet_twdr_do_twkill_work+0x66/0xd0
> [ 1507.622408] [<ffffffff8135bd40>] ? inet_twdr_hangman+0x0/0xd0
> [ 1507.692041] [<ffffffff8135bd95>] inet_twdr_hangman+0x55/0xd0
> [ 1507.760650] [<ffffffff810515ec>] run_timer_softirq+0x18c/0x220
> [ 1507.831330] [<ffffffff8104c3b8>] __do_softirq+0xc8/0x1f0
> [ 1507.895799] [<ffffffff8100cf5c>] call_softirq+0x1c/0x30
> [ 1507.959209] [<ffffffff8100e5f5>] do_softirq+0x45/0x80
> [ 1508.020538] [<ffffffff8104c2e7>] irq_exit+0x87/0x90
> [ 1508.079797] [<ffffffff813bd0c1>] smp_apic_timer_interrupt+0x71/0x9d
> [ 1508.155667] [<ffffffff8100c933>] apic_timer_interrupt+0x13/0x20
> [ 1508.227385] <EOI>
> [ 1508.252383] [<ffffffff8101317e>] ? mwait_idle+0x7e/0x110
> [ 1508.316836] [<ffffffff813bb0bd>] ? __atomic_notifier_call_chain+0xd/0x10
> [ 1508.397903] [<ffffffff813bb0d1>] ? atomic_notifier_call_chain+0x11/0x20
> [ 1508.477928] [<ffffffff8100ac31>] ? cpu_idle+0x51/0x90
> [ 1508.539265] [<ffffffff813aa92b>] ? rest_init+0x6b/0x80
> [ 1508.601639] [<ffffffff816230f5>] ? start_kernel+0x2c5/0x370
> [ 1508.669200] [<ffffffff81622611>] ? x86_64_start_reservations+0x81/0xc0
> [ 1508.748192] [<ffffffff81622726>] ? x86_64_start_kernel+0xd6/0x100
> [ 1508.821979] Code: 64 10 40 eb 94 0f 1f 44 00 00 66 0f 1f 44 00 00 55 
> 48 89 e5 0f 1f 44 00 00 48 83 7e 18 00 75 19 48 8b 46 08 48 8b 56 10
> [ 1509.048123] RIP [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
> [ 1509.128157] RSP <ffffc90000003e10>
> [ 1509.169759] CR2: 0000000000000000
> 
> 
> After spending a while tracking it down, I discovered that the wrong 
> locks get held when operating on the bind hash table's chains.
> 
> This is due to the listen socket and the child socket having different 
> local ports when __inet_inherit_port() is called. The lock is held based 
> on the child socket's port, but the list operated on is the one the 
> listen socket belongs to.
> 
> e.g.
> There is a transparent proxy listening on port 9999.
> A new http connection (with port 80) is redirected to the proxy.
> 
> The inet_bind_hashbucket locked in this case is table->bhash[80].lock, 
> but the inet_bind_bucket the child socket is added to is in the chain of 
> the table->bhash[9999] inet_bind_hashbucket. This means that if another 
> connection with a different local port arrived and was redirected to the 
> proxy, they could both be operating on the list at the same time.
> 
> 
> Attached is a patch that should fix this by looking up the correct 
> inet_bind_bucket based on the child's local port when the 
> inet_bind_bucket from the listen socket has a different port to the 
> child's inet_num. It was built against 2.6.34.1, but should apply to any 
> mainline kernel.
> 
> It is also possible the same bug exists in the IPv6 code as well. As I 
> have not had to deal with IPv6 yet, I have not had a look.

Hi Stephen

CC netfilter-devel & Patrick & Krisztian

I cannot convince myself this patch is a right fix.

This probably should be fixed in netfilter tree, not in
net/ipv4/inet_hashtables.c ?

Once tproxy is involved, the original port (80) should be changed to
9999 by tproxy (for SYN packet) and conntrack for following ones.

So listening socket and its children all use source port 9999 ?

(inet_sk(child)->inet_num == inet_sk(parent)->inet_num)


You claim wrong lock is taken at insert time, but are you sure the right
lock is taken at deletion time ?

Hmm...


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tproxy related crash in inet_hashtables
  2010-08-13 10:54 ` Eric Dumazet
@ 2010-08-13 13:05   ` Stephen Buck
  2010-08-13 13:55     ` Eric Dumazet
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Buck @ 2010-08-13 13:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Netfilter Development Mailinglist, Patrick McHardy,
	KOVACS Krisztian

  On 13/08/10 20:54, Eric Dumazet wrote:
> Le vendredi 13 août 2010 à 18:15 +1000, Stephen Buck a écrit :
>> Recently I encountered a number of crashes related to tproxy on the
>> 2.6.34.1 (x86_64 SMP) kernel. These usually manifested as a bug like the
>> following (Although the bug was confirmed to be present on a vanilla
>> kernel, this particular trace is from a kernel with some customisations):
>>
>> [ 1504.765077] BUG: unable to handle kernel NULL pointer dereference at
>> (null)
>> [ 1504.848183] IP: [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
>> [ 1504.927126] PGD 1a9933067 PUD 1ad909067 PMD 0
>> [ 1504.980125] Thread overran stack, or stack corrupted
>> [ 1505.039325] Oops: 0002 #1 SMP
>> [ 1505.077775] last sysfs file:
>> /sys/devices/system/cpu/cpu15/topology/thread_siblings
>> [ 1505.169166] CPU 0
>> [ 1505.193070] Modules linked in: sch_sfq cls_fw sch_htb xt_physdev
>> 8021q bridge stp llc
>> [ 1505.923769] Pid: 0, comm: swapper Not tainted 2.6.31-12EXINDAsmp #0
>> PowerEdge R710
>> [ 1506.014118] RIP: 0010:[<ffffffff8135a79b>] [<ffffffff8135a79b>]
>> inet_bind_bucket_destroy+0x1b/0x30
>> [ 1506.122242] RSP: 0018:ffffc90000003e10 EFLAGS: 00010246
>> [ 1506.185655] RAX: 0000000000000000 RBX: ffffc900164a02a0 RCX:
>> ffffea00098e24b0
>> [ 1506.270863] RDX: 0000000000000000 RSI: ffff8802e1186280 RDI:
>> ffffffff815b4600
>> [ 1506.356077] RBP: ffffc90000003e10 R08: 0000000000000016 R09:
>> 0000000000000001
>> [ 1506.441284] R10: 0000000000000000 R11: 0000000000000000 R12:
>> ffff88011b894fc0
>> [ 1506.526500] R13: ffffffff81744c80 R14: ffffffff815fe6c0 R15:
>> 0000000000000003
>> [ 1506.611722] FS: 0000000000000000(0000) GS:ffffc90000000000(0000)
>> knlGS:0000000000000000
>> [ 1506.708410] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>> [ 1506.777011] CR2: 0000000000000000 CR3: 00000001ae17a000 CR4:
>> 00000000000006f0
>> [ 1506.862225] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 1506.947442] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> [ 1507.032650] Process swapper (pid: 0, threadinfo ffffffff815b0000,
>> task ffffffff815b6bc0)
>> [ 1507.129345] Stack:
>> [ 1507.153299] ffffc90000003e40 ffffffff8135b834 0000000000000001
>> ffff88011b894fc0
>> [ 1507.239581]<0>  ffffffff815fe560 0000000000000002 ffffc90000003e80
>> ffffffff8135bbf6
>> [ 1507.331133]<0>  ffffffff815fe560 ffffffff815fe560 ffffffff815fe6c0
>> ffffc90000003eb0
>> [ 1507.424833] Call Trace:
>> [ 1507.453976]<IRQ>
>> [ 1507.478971] [<ffffffff8135b834>] __inet_twsk_kill+0xb4/0xf0
>> [ 1507.546538] [<ffffffff8135bbf6>] inet_twdr_do_twkill_work+0x66/0xd0
>> [ 1507.622408] [<ffffffff8135bd40>] ? inet_twdr_hangman+0x0/0xd0
>> [ 1507.692041] [<ffffffff8135bd95>] inet_twdr_hangman+0x55/0xd0
>> [ 1507.760650] [<ffffffff810515ec>] run_timer_softirq+0x18c/0x220
>> [ 1507.831330] [<ffffffff8104c3b8>] __do_softirq+0xc8/0x1f0
>> [ 1507.895799] [<ffffffff8100cf5c>] call_softirq+0x1c/0x30
>> [ 1507.959209] [<ffffffff8100e5f5>] do_softirq+0x45/0x80
>> [ 1508.020538] [<ffffffff8104c2e7>] irq_exit+0x87/0x90
>> [ 1508.079797] [<ffffffff813bd0c1>] smp_apic_timer_interrupt+0x71/0x9d
>> [ 1508.155667] [<ffffffff8100c933>] apic_timer_interrupt+0x13/0x20
>> [ 1508.227385]<EOI>
>> [ 1508.252383] [<ffffffff8101317e>] ? mwait_idle+0x7e/0x110
>> [ 1508.316836] [<ffffffff813bb0bd>] ? __atomic_notifier_call_chain+0xd/0x10
>> [ 1508.397903] [<ffffffff813bb0d1>] ? atomic_notifier_call_chain+0x11/0x20
>> [ 1508.477928] [<ffffffff8100ac31>] ? cpu_idle+0x51/0x90
>> [ 1508.539265] [<ffffffff813aa92b>] ? rest_init+0x6b/0x80
>> [ 1508.601639] [<ffffffff816230f5>] ? start_kernel+0x2c5/0x370
>> [ 1508.669200] [<ffffffff81622611>] ? x86_64_start_reservations+0x81/0xc0
>> [ 1508.748192] [<ffffffff81622726>] ? x86_64_start_kernel+0xd6/0x100
>> [ 1508.821979] Code: 64 10 40 eb 94 0f 1f 44 00 00 66 0f 1f 44 00 00 55
>> 48 89 e5 0f 1f 44 00 00 48 83 7e 18 00 75 19 48 8b 46 08 48 8b 56 10
>> [ 1509.048123] RIP [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
>> [ 1509.128157] RSP<ffffc90000003e10>
>> [ 1509.169759] CR2: 0000000000000000
>>
>>
>> After spending a while tracking it down, I discovered that the wrong
>> locks get held when operating on the bind hash table's chains.
>>
>> This is due to the listen socket and the child socket having different
>> local ports when __inet_inherit_port() is called. The lock is held based
>> on the child socket's port, but the list operated on is the one the
>> listen socket belongs to.
>>
>> e.g.
>> There is a transparent proxy listening on port 9999.
>> A new http connection (with port 80) is redirected to the proxy.
>>
>> The inet_bind_hashbucket locked in this case is table->bhash[80].lock,
>> but the inet_bind_bucket the child socket is added to is in the chain of
>> the table->bhash[9999] inet_bind_hashbucket. This means that if another
>> connection with a different local port arrived and was redirected to the
>> proxy, they could both be operating on the list at the same time.
>>
>>
>> Attached is a patch that should fix this by looking up the correct
>> inet_bind_bucket based on the child's local port when the
>> inet_bind_bucket from the listen socket has a different port to the
>> child's inet_num. It was built against 2.6.34.1, but should apply to any
>> mainline kernel.
>>
>> It is also possible the same bug exists in the IPv6 code as well. As I
>> have not had to deal with IPv6 yet, I have not had a look.
> Hi Stephen
>
> CC netfilter-devel&  Patrick&  Krisztian
>
> I cannot convince myself this patch is a right fix.
>
> This probably should be fixed in netfilter tree, not in
> net/ipv4/inet_hashtables.c ?
>
> Once tproxy is involved, the original port (80) should be changed to
> 9999 by tproxy (for SYN packet) and conntrack for following ones.
>
> So listening socket and its children all use source port 9999 ?
>
> (inet_sk(child)->inet_num == inet_sk(parent)->inet_num)
I think you are referring to the older method of transparent proxying. 
Tproxy4 does not rely on conntrack. The socket is created with the 
connection's original ports and IPs.
>
> You claim wrong lock is taken at insert time, but are you sure the right
> lock is taken at deletion time ?
>
> Hmm...
>
Without the patch, you have the same problem in __inet_put_port(). The 
lock is taken based on the child's inet_num, but the icsk_bind_hash of 
the socket was inherited from the parent, so it belongs to a different 
inet_bind_hashbucket.

With the patch, inet_bind_hash refers to the inet_bind_bucket that was 
found by searching the hash table, rather than directly inherited from 
the parent. This means that the correct lock is chosen for the list 
being manipulated.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tproxy related crash in inet_hashtables
  2010-08-13 13:05   ` Stephen Buck
@ 2010-08-13 13:55     ` Eric Dumazet
  2010-08-14  4:35       ` Stephen Buck
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2010-08-13 13:55 UTC (permalink / raw)
  To: Stephen Buck
  Cc: netdev, Netfilter Development Mailinglist, Patrick McHardy,
	KOVACS Krisztian

Le vendredi 13 août 2010 à 23:05 +1000, Stephen Buck a écrit :
> On 13/08/10 20:54, Eric Dumazet wrote:
> > Le vendredi 13 août 2010 à 18:15 +1000, Stephen Buck a écrit :
> >> Recently I encountered a number of crashes related to tproxy on the
> >> 2.6.34.1 (x86_64 SMP) kernel. These usually manifested as a bug like the
> >> following (Although the bug was confirmed to be present on a vanilla
> >> kernel, this particular trace is from a kernel with some customisations):
> >>
> >> [ 1504.765077] BUG: unable to handle kernel NULL pointer dereference at
> >> (null)
> >> [ 1504.848183] IP: [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
> >> [ 1504.927126] PGD 1a9933067 PUD 1ad909067 PMD 0
> >> [ 1504.980125] Thread overran stack, or stack corrupted
> >> [ 1505.039325] Oops: 0002 #1 SMP
> >> [ 1505.077775] last sysfs file:
> >> /sys/devices/system/cpu/cpu15/topology/thread_siblings
> >> [ 1505.169166] CPU 0
> >> [ 1505.193070] Modules linked in: sch_sfq cls_fw sch_htb xt_physdev
> >> 8021q bridge stp llc
> >> [ 1505.923769] Pid: 0, comm: swapper Not tainted 2.6.31-12EXINDAsmp #0
> >> PowerEdge R710
> >> [ 1506.014118] RIP: 0010:[<ffffffff8135a79b>] [<ffffffff8135a79b>]
> >> inet_bind_bucket_destroy+0x1b/0x30
> >> [ 1506.122242] RSP: 0018:ffffc90000003e10 EFLAGS: 00010246
> >> [ 1506.185655] RAX: 0000000000000000 RBX: ffffc900164a02a0 RCX:
> >> ffffea00098e24b0
> >> [ 1506.270863] RDX: 0000000000000000 RSI: ffff8802e1186280 RDI:
> >> ffffffff815b4600
> >> [ 1506.356077] RBP: ffffc90000003e10 R08: 0000000000000016 R09:
> >> 0000000000000001
> >> [ 1506.441284] R10: 0000000000000000 R11: 0000000000000000 R12:
> >> ffff88011b894fc0
> >> [ 1506.526500] R13: ffffffff81744c80 R14: ffffffff815fe6c0 R15:
> >> 0000000000000003
> >> [ 1506.611722] FS: 0000000000000000(0000) GS:ffffc90000000000(0000)
> >> knlGS:0000000000000000
> >> [ 1506.708410] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> >> [ 1506.777011] CR2: 0000000000000000 CR3: 00000001ae17a000 CR4:
> >> 00000000000006f0
> >> [ 1506.862225] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >> 0000000000000000
> >> [ 1506.947442] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> >> 0000000000000400
> >> [ 1507.032650] Process swapper (pid: 0, threadinfo ffffffff815b0000,
> >> task ffffffff815b6bc0)
> >> [ 1507.129345] Stack:
> >> [ 1507.153299] ffffc90000003e40 ffffffff8135b834 0000000000000001
> >> ffff88011b894fc0
> >> [ 1507.239581]<0>  ffffffff815fe560 0000000000000002 ffffc90000003e80
> >> ffffffff8135bbf6
> >> [ 1507.331133]<0>  ffffffff815fe560 ffffffff815fe560 ffffffff815fe6c0
> >> ffffc90000003eb0
> >> [ 1507.424833] Call Trace:
> >> [ 1507.453976]<IRQ>
> >> [ 1507.478971] [<ffffffff8135b834>] __inet_twsk_kill+0xb4/0xf0
> >> [ 1507.546538] [<ffffffff8135bbf6>] inet_twdr_do_twkill_work+0x66/0xd0
> >> [ 1507.622408] [<ffffffff8135bd40>] ? inet_twdr_hangman+0x0/0xd0
> >> [ 1507.692041] [<ffffffff8135bd95>] inet_twdr_hangman+0x55/0xd0
> >> [ 1507.760650] [<ffffffff810515ec>] run_timer_softirq+0x18c/0x220
> >> [ 1507.831330] [<ffffffff8104c3b8>] __do_softirq+0xc8/0x1f0
> >> [ 1507.895799] [<ffffffff8100cf5c>] call_softirq+0x1c/0x30
> >> [ 1507.959209] [<ffffffff8100e5f5>] do_softirq+0x45/0x80
> >> [ 1508.020538] [<ffffffff8104c2e7>] irq_exit+0x87/0x90
> >> [ 1508.079797] [<ffffffff813bd0c1>] smp_apic_timer_interrupt+0x71/0x9d
> >> [ 1508.155667] [<ffffffff8100c933>] apic_timer_interrupt+0x13/0x20
> >> [ 1508.227385]<EOI>
> >> [ 1508.252383] [<ffffffff8101317e>] ? mwait_idle+0x7e/0x110
> >> [ 1508.316836] [<ffffffff813bb0bd>] ? __atomic_notifier_call_chain+0xd/0x10
> >> [ 1508.397903] [<ffffffff813bb0d1>] ? atomic_notifier_call_chain+0x11/0x20
> >> [ 1508.477928] [<ffffffff8100ac31>] ? cpu_idle+0x51/0x90
> >> [ 1508.539265] [<ffffffff813aa92b>] ? rest_init+0x6b/0x80
> >> [ 1508.601639] [<ffffffff816230f5>] ? start_kernel+0x2c5/0x370
> >> [ 1508.669200] [<ffffffff81622611>] ? x86_64_start_reservations+0x81/0xc0
> >> [ 1508.748192] [<ffffffff81622726>] ? x86_64_start_kernel+0xd6/0x100
> >> [ 1508.821979] Code: 64 10 40 eb 94 0f 1f 44 00 00 66 0f 1f 44 00 00 55
> >> 48 89 e5 0f 1f 44 00 00 48 83 7e 18 00 75 19 48 8b 46 08 48 8b 56 10
> >> [ 1509.048123] RIP [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
> >> [ 1509.128157] RSP<ffffc90000003e10>
> >> [ 1509.169759] CR2: 0000000000000000
> >>
> >>
> >> After spending a while tracking it down, I discovered that the wrong
> >> locks get held when operating on the bind hash table's chains.
> >>
> >> This is due to the listen socket and the child socket having different
> >> local ports when __inet_inherit_port() is called. The lock is held based
> >> on the child socket's port, but the list operated on is the one the
> >> listen socket belongs to.
> >>
> >> e.g.
> >> There is a transparent proxy listening on port 9999.
> >> A new http connection (with port 80) is redirected to the proxy.
> >>
> >> The inet_bind_hashbucket locked in this case is table->bhash[80].lock,
> >> but the inet_bind_bucket the child socket is added to is in the chain of
> >> the table->bhash[9999] inet_bind_hashbucket. This means that if another
> >> connection with a different local port arrived and was redirected to the
> >> proxy, they could both be operating on the list at the same time.
> >>
> >>
> >> Attached is a patch that should fix this by looking up the correct
> >> inet_bind_bucket based on the child's local port when the
> >> inet_bind_bucket from the listen socket has a different port to the
> >> child's inet_num. It was built against 2.6.34.1, but should apply to any
> >> mainline kernel.
> >>
> >> It is also possible the same bug exists in the IPv6 code as well. As I
> >> have not had to deal with IPv6 yet, I have not had a look.
> > Hi Stephen
> >
> > CC netfilter-devel&  Patrick&  Krisztian
> >
> > I cannot convince myself this patch is a right fix.
> >
> > This probably should be fixed in netfilter tree, not in
> > net/ipv4/inet_hashtables.c ?
> >
> > Once tproxy is involved, the original port (80) should be changed to
> > 9999 by tproxy (for SYN packet) and conntrack for following ones.
> >
> > So listening socket and its children all use source port 9999 ?
> >
> > (inet_sk(child)->inet_num == inet_sk(parent)->inet_num)
> I think you are referring to the older method of transparent proxying. 
> Tproxy4 does not rely on conntrack. The socket is created with the 
> connection's original ports and IPs.

So in your case "netstat -atn" displays port 80 , not port 9999 for
children, and 9999 for the listener ? 

> >
> > You claim wrong lock is taken at insert time, but are you sure the right
> > lock is taken at deletion time ?
> >
> > Hmm...
> >
> Without the patch, you have the same problem in __inet_put_port(). The 
> lock is taken based on the child's inet_num, but the icsk_bind_hash of 
> the socket was inherited from the parent, so it belongs to a different 
> inet_bind_hashbucket.
> 
> With the patch, inet_bind_hash refers to the inet_bind_bucket that was 
> found by searching the hash table, rather than directly inherited from 
> the parent. This means that the correct lock is chosen for the list 
> being manipulated.

Really ?

So lock[80] is taken at insert time (as your patch doesnt change
locking), and lock[9999] taken at delete time ? How comes it doesnt
race ?


Look I am _very_ confused by this stuff.

How can you explain lock is taken on table->bhash[80].lock but your
patch adds a chain to table->bhash[9999] ? This is not what your patch
is trying to do. Or you should lock table->bhash[9999].lock 


Please check the comment in front of inet_bind_bucket_create() :

/*
 * Allocate and initialize a new local port bind bucket.
 * The bindhash mutex for snum's hash chain must be held here.
 */

Maybe change this comment, because clearly "snum's hash" meaning is very
different after tproxy inclusion (and your patch) ?

Your patch Changelog is not clear enough, or very misleading ...

commit f517f61e37907ea66e48f224c135a9e1c257bb04
Author: Stephen Buck <stephen.buck@exinda.com>
Date:   Thu Aug 12 16:19:19 2010 +1000

    Fix locking issue in inet_hashtables when using tproxy
    
    When __inet_inherit_port() is called on a tproxy connection the wrong locks
    are held for the inet_bind_bucket it is added to.



Since your patch does nothing about the lock itself, but adds a 
new struct inet_bind_bucket in case the parent one is not appropriate, and
another sibling is not yet present.

I suggest you include in ChangeLog _all_ your email, because I am
pretty sure nobody will understand the intent of your patch in 2011.

I am sure the full explanation is better than a very short log entry.

Thanks


--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tproxy related crash in inet_hashtables
  2010-08-13 13:55     ` Eric Dumazet
@ 2010-08-14  4:35       ` Stephen Buck
  2010-08-15  5:16         ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Stephen Buck @ 2010-08-14  4:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Netfilter Development Mailinglist, Patrick McHardy,
	KOVACS Krisztian

[-- Attachment #1: Type: text/plain, Size: 11863 bytes --]

  On 13/08/10 23:55, Eric Dumazet wrote:
> Le vendredi 13 août 2010 à 23:05 +1000, Stephen Buck a écrit :
>> On 13/08/10 20:54, Eric Dumazet wrote:
>>> Le vendredi 13 août 2010 à 18:15 +1000, Stephen Buck a écrit :
>>>> Recently I encountered a number of crashes related to tproxy on the
>>>> 2.6.34.1 (x86_64 SMP) kernel. These usually manifested as a bug like the
>>>> following (Although the bug was confirmed to be present on a vanilla
>>>> kernel, this particular trace is from a kernel with some customisations):
>>>>
>>>> [ 1504.765077] BUG: unable to handle kernel NULL pointer dereference at
>>>> (null)
>>>> [ 1504.848183] IP: [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
>>>> [ 1504.927126] PGD 1a9933067 PUD 1ad909067 PMD 0
>>>> [ 1504.980125] Thread overran stack, or stack corrupted
>>>> [ 1505.039325] Oops: 0002 #1 SMP
>>>> [ 1505.077775] last sysfs file:
>>>> /sys/devices/system/cpu/cpu15/topology/thread_siblings
>>>> [ 1505.169166] CPU 0
>>>> [ 1505.193070] Modules linked in: sch_sfq cls_fw sch_htb xt_physdev
>>>> 8021q bridge stp llc
>>>> [ 1505.923769] Pid: 0, comm: swapper Not tainted 2.6.31-12EXINDAsmp #0
>>>> PowerEdge R710
>>>> [ 1506.014118] RIP: 0010:[<ffffffff8135a79b>] [<ffffffff8135a79b>]
>>>> inet_bind_bucket_destroy+0x1b/0x30
>>>> [ 1506.122242] RSP: 0018:ffffc90000003e10 EFLAGS: 00010246
>>>> [ 1506.185655] RAX: 0000000000000000 RBX: ffffc900164a02a0 RCX:
>>>> ffffea00098e24b0
>>>> [ 1506.270863] RDX: 0000000000000000 RSI: ffff8802e1186280 RDI:
>>>> ffffffff815b4600
>>>> [ 1506.356077] RBP: ffffc90000003e10 R08: 0000000000000016 R09:
>>>> 0000000000000001
>>>> [ 1506.441284] R10: 0000000000000000 R11: 0000000000000000 R12:
>>>> ffff88011b894fc0
>>>> [ 1506.526500] R13: ffffffff81744c80 R14: ffffffff815fe6c0 R15:
>>>> 0000000000000003
>>>> [ 1506.611722] FS: 0000000000000000(0000) GS:ffffc90000000000(0000)
>>>> knlGS:0000000000000000
>>>> [ 1506.708410] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
>>>> [ 1506.777011] CR2: 0000000000000000 CR3: 00000001ae17a000 CR4:
>>>> 00000000000006f0
>>>> [ 1506.862225] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>> 0000000000000000
>>>> [ 1506.947442] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>>>> 0000000000000400
>>>> [ 1507.032650] Process swapper (pid: 0, threadinfo ffffffff815b0000,
>>>> task ffffffff815b6bc0)
>>>> [ 1507.129345] Stack:
>>>> [ 1507.153299] ffffc90000003e40 ffffffff8135b834 0000000000000001
>>>> ffff88011b894fc0
>>>> [ 1507.239581]<0>   ffffffff815fe560 0000000000000002 ffffc90000003e80
>>>> ffffffff8135bbf6
>>>> [ 1507.331133]<0>   ffffffff815fe560 ffffffff815fe560 ffffffff815fe6c0
>>>> ffffc90000003eb0
>>>> [ 1507.424833] Call Trace:
>>>> [ 1507.453976]<IRQ>
>>>> [ 1507.478971] [<ffffffff8135b834>] __inet_twsk_kill+0xb4/0xf0
>>>> [ 1507.546538] [<ffffffff8135bbf6>] inet_twdr_do_twkill_work+0x66/0xd0
>>>> [ 1507.622408] [<ffffffff8135bd40>] ? inet_twdr_hangman+0x0/0xd0
>>>> [ 1507.692041] [<ffffffff8135bd95>] inet_twdr_hangman+0x55/0xd0
>>>> [ 1507.760650] [<ffffffff810515ec>] run_timer_softirq+0x18c/0x220
>>>> [ 1507.831330] [<ffffffff8104c3b8>] __do_softirq+0xc8/0x1f0
>>>> [ 1507.895799] [<ffffffff8100cf5c>] call_softirq+0x1c/0x30
>>>> [ 1507.959209] [<ffffffff8100e5f5>] do_softirq+0x45/0x80
>>>> [ 1508.020538] [<ffffffff8104c2e7>] irq_exit+0x87/0x90
>>>> [ 1508.079797] [<ffffffff813bd0c1>] smp_apic_timer_interrupt+0x71/0x9d
>>>> [ 1508.155667] [<ffffffff8100c933>] apic_timer_interrupt+0x13/0x20
>>>> [ 1508.227385]<EOI>
>>>> [ 1508.252383] [<ffffffff8101317e>] ? mwait_idle+0x7e/0x110
>>>> [ 1508.316836] [<ffffffff813bb0bd>] ? __atomic_notifier_call_chain+0xd/0x10
>>>> [ 1508.397903] [<ffffffff813bb0d1>] ? atomic_notifier_call_chain+0x11/0x20
>>>> [ 1508.477928] [<ffffffff8100ac31>] ? cpu_idle+0x51/0x90
>>>> [ 1508.539265] [<ffffffff813aa92b>] ? rest_init+0x6b/0x80
>>>> [ 1508.601639] [<ffffffff816230f5>] ? start_kernel+0x2c5/0x370
>>>> [ 1508.669200] [<ffffffff81622611>] ? x86_64_start_reservations+0x81/0xc0
>>>> [ 1508.748192] [<ffffffff81622726>] ? x86_64_start_kernel+0xd6/0x100
>>>> [ 1508.821979] Code: 64 10 40 eb 94 0f 1f 44 00 00 66 0f 1f 44 00 00 55
>>>> 48 89 e5 0f 1f 44 00 00 48 83 7e 18 00 75 19 48 8b 46 08 48 8b 56 10
>>>> [ 1509.048123] RIP [<ffffffff8135a79b>] inet_bind_bucket_destroy+0x1b/0x30
>>>> [ 1509.128157] RSP<ffffc90000003e10>
>>>> [ 1509.169759] CR2: 0000000000000000
>>>>
>>>>
>>>> After spending a while tracking it down, I discovered that the wrong
>>>> locks get held when operating on the bind hash table's chains.
>>>>
>>>> This is due to the listen socket and the child socket having different
>>>> local ports when __inet_inherit_port() is called. The lock is held based
>>>> on the child socket's port, but the list operated on is the one the
>>>> listen socket belongs to.
>>>>
>>>> e.g.
>>>> There is a transparent proxy listening on port 9999.
>>>> A new http connection (with port 80) is redirected to the proxy.
>>>>
>>>> The inet_bind_hashbucket locked in this case is table->bhash[80].lock,
>>>> but the inet_bind_bucket the child socket is added to is in the chain of
>>>> the table->bhash[9999] inet_bind_hashbucket. This means that if another
>>>> connection with a different local port arrived and was redirected to the
>>>> proxy, they could both be operating on the list at the same time.
>>>>
>>>>
>>>> Attached is a patch that should fix this by looking up the correct
>>>> inet_bind_bucket based on the child's local port when the
>>>> inet_bind_bucket from the listen socket has a different port to the
>>>> child's inet_num. It was built against 2.6.34.1, but should apply to any
>>>> mainline kernel.
>>>>
>>>> It is also possible the same bug exists in the IPv6 code as well. As I
>>>> have not had to deal with IPv6 yet, I have not had a look.
>>> Hi Stephen
>>>
>>> CC netfilter-devel&   Patrick&   Krisztian
>>>
>>> I cannot convince myself this patch is a right fix.
>>>
>>> This probably should be fixed in netfilter tree, not in
>>> net/ipv4/inet_hashtables.c ?
>>>
>>> Once tproxy is involved, the original port (80) should be changed to
>>> 9999 by tproxy (for SYN packet) and conntrack for following ones.
>>>
>>> So listening socket and its children all use source port 9999 ?
>>>
>>> (inet_sk(child)->inet_num == inet_sk(parent)->inet_num)
>> I think you are referring to the older method of transparent proxying.
>> Tproxy4 does not rely on conntrack. The socket is created with the
>> connection's original ports and IPs.
> So in your case "netstat -atn" displays port 80 , not port 9999 for
> children, and 9999 for the listener ?
Yes, that is correct.
>>> You claim wrong lock is taken at insert time, but are you sure the right
>>> lock is taken at deletion time ?
>>>
>>> Hmm...
>>>
>> Without the patch, you have the same problem in __inet_put_port(). The
>> lock is taken based on the child's inet_num, but the icsk_bind_hash of
>> the socket was inherited from the parent, so it belongs to a different
>> inet_bind_hashbucket.
>>
>> With the patch, inet_bind_hash refers to the inet_bind_bucket that was
>> found by searching the hash table, rather than directly inherited from
>> the parent. This means that the correct lock is chosen for the list
>> being manipulated.
> Really ?
>
> So lock[80] is taken at insert time (as your patch doesnt change
> locking), and lock[9999] taken at delete time ? How comes it doesnt
> race ?
No, lock[80] is taken at delete time. It is taken using the exact same 
procedure as in __inet_inherit_port().

The problem in __inet_put_sock() occurs when 
inet_csk(sk)->icsk_bind_hash->port != inet_sk(sk)->inet_num. With the 
patch, inet_csk(sk)->icsk_bind_hash->port == inet_sk(sk)->inet_num, so 
the locking is correct.

Below, I've attempted to walk through the original code. Hopefully it 
makes things a little clearer.

// inet_sk(sk)->inet_num = 9999
// inet_sk(child)->inet_num = 80
// inet_csk(sk)->icsk_bind_hash is a bucket where
//         inet_csk(sk)->icsk_bind_hash->port = 9999
// inet_csk(child)->icsk_bind_hash is uninitialised

void __inet_inherit_port(struct sock *sk, struct sock *child)
{
         struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
         const int bhash = inet_bhashfn(sock_net(sk), 
inet_sk(child)->inet_num,
                         table->bhash_size);

// head is retrieved usng inet_sk(child)->inet_num, so we end up with
// head = &table->bhash[80]

         struct inet_bind_hashbucket *head = &table->bhash[bhash];
         struct inet_bind_bucket *tb;

         spin_lock(&head->lock);

// Here the child is queued in the same list as the parent. This means that
// the child is queued in a inet_bind_bucket where
// inet_bind_bucket->port = 9999, but the lock being held is bhash[80].lock

         tb = inet_csk(sk)->icsk_bind_hash;
         sk_add_bind_node(child, &tb->owners);

// Now inet_csk(child)->icsk_bind_hash is assigned to the bucket where
// inet_bind_bucket->port = 9999 resulting in the wrong lock being taken
// for the list being manipulated in __inet_put_port()

         inet_csk(child)->icsk_bind_hash = tb;
         spin_unlock(&head->lock);
}

// inet_sk(sk)->inet_num = 9999
// inet_sk(child)->inet_num = 80
// inet_csk(sk)->icsk_bind_hash is a bucket where
//         inet_csk(sk)->icsk_bind_hash->port = 9999
// inet_csk(child)->icsk_bind_hash is a bucket where
//         inet_csk(sk)->icsk_bind_hash->port = 9999

It is the difference between inet_sk(child)->inet_num and 
inet_csk(child)->icsk_bind_hash->port that causes problems in 
__inet_put_port().

With the patch, inet_sk(child)->inet_num and 
inet_csk(child)->icsk_bind_hash are kept in sync eliminating the locking 
error in __inet_put_port().

> Look I am _very_ confused by this stuff.
>
> How can you explain lock is taken on table->bhash[80].lock but your
> patch adds a chain to table->bhash[9999] ? This is not what your patch
> is trying to do. Or you should lock table->bhash[9999].lock
>
The patch does not cause a inet_bind_hashbucket to be added to 
table->bhash[9999]. It is added (if it doesn't already exist) to 
table->bhash[80], so there is no problem with the locking.
> Please check the comment in front of inet_bind_bucket_create() :
>
> /*
>   * Allocate and initialize a new local port bind bucket.
>   * The bindhash mutex for snum's hash chain must be held here.
>   */
>
> Maybe change this comment, because clearly "snum's hash" meaning is very
> different after tproxy inclusion (and your patch) ?
I'm not sure I agree. The bindhash mutex for snum's hash is held, as 
described. The lock held is bhash[80], and snum is also 80, so the 
comment is correct.
> Your patch Changelog is not clear enough, or very misleading ...
>
> commit f517f61e37907ea66e48f224c135a9e1c257bb04
> Author: Stephen Buck<stephen.buck@exinda.com>
> Date:   Thu Aug 12 16:19:19 2010 +1000
>
>      Fix locking issue in inet_hashtables when using tproxy
>
>      When __inet_inherit_port() is called on a tproxy connection the wrong locks
>      are held for the inet_bind_bucket it is added to.
>
>
>
> Since your patch does nothing about the lock itself, but adds a
> new struct inet_bind_bucket in case the parent one is not appropriate, and
> another sibling is not yet present.
>
> I suggest you include in ChangeLog _all_ your email, because I am
> pretty sure nobody will understand the intent of your patch in 2011.
>
> I am sure the full explanation is better than a very short log entry.
Ok I've reworked the patch's comment to describe the problem in more 
detail and to describe more accurately what it actually does.
> Thanks
>

On a slight tangent, should tb->num_owners be incremented in 
__inet_inherit_port()? It is decremented in __inet_put_port(), causing 
the value to go strongly negative. My interpretation of what it should 
do is count the number of full (non-timewait) sockets in the owners list.

[-- Attachment #2: tproxy-bind-fix.patch --]
[-- Type: text/plain, Size: 3898 bytes --]

commit eedcf97aca5a1c483475216696d5e79fa868ab66
Author: Stephen Buck <stephen.buck@exinda.com>
Date:   Thu Aug 12 16:19:19 2010 +1000

    Put a tproxy socket in the correct bind bucket
    
    When __inet_inherit_port() is called on a tproxy connection, the parent's
    inet_num is not the same as the child's inet_num. This results in the child
    socket being added to the wrong inet_bind_bucket.
    
    In addition to this, the discrepency causes the wrong lock is held, so that
    if 2 connections with different destination ports arrive at the same time, the
    list may be corrupted. Similarly, the wrong lock is held when the socket is
    removed from the bind hash in __inet_put_port().
    
    When the parent's inet_num is different to the child's inet_num, the
    inet_bind_bucket needs to be retrieved directly from the hash table.
    This keeps the child's inet_num and icsk_bind_hash in sync eliminating the
    described locking errors.
    
    Below is a walk through the original code, using a scenario where the tproxy
    listen socket is bound to port 9999, and has a new http connection redirected
    to it:
    
    // inet_sk(sk)->inet_num = 9999
    // inet_sk(child)->inet_num = 80
    // inet_csk(sk)->icsk_bind_hash is a bucket where
    //         inet_csk(sk)->icsk_bind_hash->port = 9999
    // inet_csk(child)->icsk_bind_hash is uninitialised
    
    void __inet_inherit_port(struct sock *sk, struct sock *child)
    {
    	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
    	const int bhash = inet_bhashfn(sock_net(sk), inet_sk(child)->inet_num,
    			table->bhash_size);
    
    // head is retrieved using inet_sk(child)->inet_num, so we end up with
    // head = &table->bhash[80]
    
    	struct inet_bind_hashbucket *head = &table->bhash[bhash];
    	struct inet_bind_bucket *tb;
    
    	spin_lock(&head->lock);
    
    // Here the child is queued in the same list as the parent. This means that
    // the child is queued in a inet_bind_bucket where
    // inet_bind_bucket->port = 9999, but the lock being held is bhash[80].lock
    
    	tb = inet_csk(sk)->icsk_bind_hash;
    	sk_add_bind_node(child, &tb->owners);
    
    // Now inet_csk(child)->icsk_bind_hash is assigned to the bucket where
    // inet_bind_bucket->port = 9999 resulting in the wrong lock being taken
    // for the list being manipulated in __inet_put_port()
    
    	inet_csk(child)->icsk_bind_hash = tb;
    	spin_unlock(&head->lock);
    }
    
    // inet_sk(sk)->inet_num = 9999
    // inet_sk(child)->inet_num = 80
    // inet_csk(sk)->icsk_bind_hash is a bucket where
    //         inet_csk(sk)->icsk_bind_hash->port = 9999
    // inet_csk(child)->icsk_bind_hash is a bucket where
    //         inet_csk(sk)->icsk_bind_hash->port = 9999

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2b79377..1ba982f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -105,13 +105,24 @@ EXPORT_SYMBOL(inet_put_port);
 void __inet_inherit_port(struct sock *sk, struct sock *child)
 {
 	struct inet_hashinfo *table = sk->sk_prot->h.hashinfo;
-	const int bhash = inet_bhashfn(sock_net(sk), inet_sk(child)->inet_num,
+	unsigned short port = inet_sk(child)->inet_num;
+	const int bhash = inet_bhashfn(sock_net(sk), port,
 			table->bhash_size);
 	struct inet_bind_hashbucket *head = &table->bhash[bhash];
 	struct inet_bind_bucket *tb;
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
+	if (tb->port != port) {
+		struct hlist_node *node;
+		inet_bind_bucket_for_each(tb, node, &head->chain) {
+			if (net_eq(ib_net(tb), sock_net(sk)) && tb->port == port)
+				break;
+		}
+		if (!node)
+			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
+						     sock_net(sk), head, port);
+	}
 	sk_add_bind_node(child, &tb->owners);
 	inet_csk(child)->icsk_bind_hash = tb;
 	spin_unlock(&head->lock);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: tproxy related crash in inet_hashtables
  2010-08-14  4:35       ` Stephen Buck
@ 2010-08-15  5:16         ` David Miller
  2010-08-16  8:26           ` Stephen Buck
  0 siblings, 1 reply; 7+ messages in thread
From: David Miller @ 2010-08-15  5:16 UTC (permalink / raw)
  To: stephen.buck; +Cc: eric.dumazet, netdev, netfilter-devel, kaber, hidden

From: Stephen Buck <stephen.buck@exinda.com>
Date: Sat, 14 Aug 2010 14:35:21 +1000

> +		}
> +		if (!node)
> +			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
> +						     sock_net(sk), head, port);
> +	}
>  	sk_add_bind_node(child, &tb->owners);

inet_bind_bucket_create() can fail due to memory allocation failure,
you're therefore going to have to find a way to handle that.

And in doing so you're going to find out that there is no easy
way out of this code path in such an error, this code was
absolutely not designed to be able to handle cases like this.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: tproxy related crash in inet_hashtables
  2010-08-15  5:16         ` David Miller
@ 2010-08-16  8:26           ` Stephen Buck
  0 siblings, 0 replies; 7+ messages in thread
From: Stephen Buck @ 2010-08-16  8:26 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, netfilter-devel, kaber, hidden

On 15/08/10 15:16, David Miller wrote:
> From: Stephen Buck<stephen.buck@exinda.com>
> Date: Sat, 14 Aug 2010 14:35:21 +1000
>
>    
>> +		}
>> +		if (!node)
>> +			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
>> +						     sock_net(sk), head, port);
>> +	}
>>   	sk_add_bind_node(child,&tb->owners);
>>      
> inet_bind_bucket_create() can fail due to memory allocation failure,
> you're therefore going to have to find a way to handle that.
>
> And in doing so you're going to find out that there is no easy
> way out of this code path in such an error, this code was
> absolutely not designed to be able to handle cases like this.
>    
You're right. I didn't cover the failure case. And no, it doesn't look easy.
I'll have another look at the problem when I get a chance.

Thanks for pointing it out.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-08-16  8:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-13  8:15 tproxy related crash in inet_hashtables Stephen Buck
2010-08-13 10:54 ` Eric Dumazet
2010-08-13 13:05   ` Stephen Buck
2010-08-13 13:55     ` Eric Dumazet
2010-08-14  4:35       ` Stephen Buck
2010-08-15  5:16         ` David Miller
2010-08-16  8:26           ` Stephen Buck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.