All of lore.kernel.org
 help / color / mirror / Atom feed
* possible bug in tcp_input.c
@ 2003-10-24 16:29 Tomas Szepe
  2003-10-25  2:30 ` David S. Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Tomas Szepe @ 2003-10-24 16:29 UTC (permalink / raw)
  To: davem; +Cc: linux-kernel, netdev, grof

Hi David,

We came up with the attached patch during a hectic oops tracing session
that got started by our sysadmin writing down an oops using the pen & paper
method, no less.  The crashing machine has been a firewall running a very
unusual NAT + QoS configuration, however we believe that we might have
discovered a real bug in 2.4's tcp_input.c.  Since our insight into the
internals of the tcp/ip stack is far from even basic, we are seeking your
opinion on whether we are correct.

The static inline function skb_peek() as defined in include/linux/skbuff.h
returns a pointer to a sk_buff, or NULL when its argument is an empty list
or a pointer to the head element.  Since this is documented behavior, it is
not surprising that all segments of code within tcp_input.c dealing with
a return from skb_peek() take care not to dereference the returned pointer
if it happens to be NULL.  There is an exception, though:

/* tcp_input.c, line 1138 */
static inline int tcp_head_timedout(struct sock *sk, struct tcp_opt *tp)
{
  return tp->packets_out && tcp_skb_timedout(tp, skb_peek(&sk->write_queue));
}

The passed NULL (and yes, this is where we are getting one) is dereferenced
immediately in:

/* tcp_input.c, line 1133 */
static inline int tcp_skb_timedout(struct tcp_opt *tp, struct sk_buff *skb)
{
  return (tcp_time_stamp - TCP_SKB_CB(skb)->when > tp->rto);
}

with TCP_SKB_CB that is defined as

/* tcp.h, line 1034 */
#define TCP_SKB_CB(__skb)	((struct tcp_skb_cb *)&((__skb)->cb[0]))

We are proposing to cure the problem by adding a simple check in
tcp_head_timedout(), but are not sure whether this is the right
thing to do, because as a friend put it, we seem to be fixing
a leaking faucet in a god damn power plant.

Thanks for any help,
-- 
Tomas Szepe <szepe@pinerecords.com>


diff -urN a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2003-06-13 16:51:39 +0200
+++ b/net/ipv4/tcp_input.c	2003-10-24 17:41:19 +0200
@@ -1138,7 +1138,11 @@
 
 static inline int tcp_head_timedout(struct sock *sk, struct tcp_opt *tp)
 {
-	return tp->packets_out && tcp_skb_timedout(tp, skb_peek(&sk->write_queue));
+	struct sk_buff *skb = skb_peek(&sk->write_queue);
+	if (skb == NULL)
+		return 1;
+
+	return tp->packets_out && tcp_skb_timedout(tp, skb);
 }
 
 /* Linux NewReno/SACK/FACK/ECN state machine.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in tcp_input.c
  2003-10-24 16:29 possible bug in tcp_input.c Tomas Szepe
@ 2003-10-25  2:30 ` David S. Miller
  2003-10-25  9:12   ` Tomas Szepe
  2003-10-26  6:55   ` Tomas Szepe
  0 siblings, 2 replies; 8+ messages in thread
From: David S. Miller @ 2003-10-25  2:30 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel, netdev, grof

On Fri, 24 Oct 2003 18:29:59 +0200
Tomas Szepe <szepe@pinerecords.com> wrote:

> /* tcp_input.c, line 1138 */
> static inline int tcp_head_timedout(struct sock *sk, struct tcp_opt *tp)
> {
>   return tp->packets_out && tcp_skb_timedout(tp, skb_peek(&sk->write_queue));
> }
> 
> The passed NULL (and yes, this is where we are getting one) is dereferenced
> immediately in:
> 
> /* tcp_input.c, line 1133 */
> static inline int tcp_skb_timedout(struct tcp_opt *tp, struct sk_buff *skb)
> {
>   return (tcp_time_stamp - TCP_SKB_CB(skb)->when > tp->rto);
> }

If tp->packets_out is non-zero (which by definition it is
in your case else the right hand side of the "&&" would not be
evaluated) then we _MUST_ have some packets in sk->write_queue.

Something is being fiercely corrupted.  Probably some piece of
netfilter is freeing up an SKB one too many times thus corrupting
the TCP write queue list pointers.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in tcp_input.c
  2003-10-25  2:30 ` David S. Miller
@ 2003-10-25  9:12   ` Tomas Szepe
  2003-10-26  6:55   ` Tomas Szepe
  1 sibling, 0 replies; 8+ messages in thread
From: Tomas Szepe @ 2003-10-25  9:12 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, netdev, grof

On Oct-24 2003, Fri, 19:30 -0700
David S. Miller <davem@redhat.com> wrote:

> > /* tcp_input.c, line 1138 */
> > static inline int tcp_head_timedout(struct sock *sk, struct tcp_opt *tp)
> > {
> >   return tp->packets_out && tcp_skb_timedout(tp, skb_peek(&sk->write_queue));
> > }
> > 
> > The passed NULL (and yes, this is where we are getting one) is dereferenced
> > immediately in:
> > 
> > /* tcp_input.c, line 1133 */
> > static inline int tcp_skb_timedout(struct tcp_opt *tp, struct sk_buff *skb)
> > {
> >   return (tcp_time_stamp - TCP_SKB_CB(skb)->when > tp->rto);
> > }
> 
> If tp->packets_out is non-zero (which by definition it is
> in your case else the right hand side of the "&&" would not be
> evaluated) then we _MUST_ have some packets in sk->write_queue.
> 
> Something is being fiercely corrupted.  Probably some piece of
> netfilter is freeing up an SKB one too many times thus corrupting
> the TCP write queue list pointers.

Ok, thanks.  Here's the backtrace in case you are able to extract
some information out of it.

>>EIP; c01fa86f <tcp_time_to_recover+6f/1c0>   <=====
>>ebx; c606c19c <_end+5d94b7c/85faa40>
>>esi; c606c0e0 <_end+5d94ac0/85faa40>
>>ebp; c606c0e0 <_end+5d94ac0/85faa40>
>>esp; c026bc24 <init_task_union+1c24/2000>
Trace; c01ffbe5 <tcp_reset_xmit_timer+85/f0>
Trace; c01fb46b <tcp_fastretrans_alert+17b/5b0>
Trace; c01fbc67 <tcp_clean_rtx_queue+307/320>
Trace; c01fc06c <tcp_ack+14c/360>
Trace; c01fe6f2 <tcp_rcv_established+372/830>
Trace; c88e3c34 <[ip_tables]__kstrtab_ipt_register_table+0/0>
Trace; c01063a2 <sys_sigaction+72/110>
Trace; c0206835 <tcp_v4_rcv+485/520>
Trace; c01ed38e <ip_local_deliver_finish+14e/170>
Trace; c01e0a3b <nf_hook_slow+bb/190>
Trace; c01ed240 <ip_local_deliver_finish+0/170>
Trace; c01ed008 <ip_local_deliver+68/90>
Trace; c01ed240 <ip_local_deliver_finish+0/170>
Trace; c01ed589 <ip_rcv_finish+1d9/240>
Trace; c8912ddc <.data.end+3ba5/????>
Trace; c01e0ca4 <nf_reinject+194/1c0>
Trace; c01ed3b0 <ip_rcv_finish+0/240>
Trace; c88fa12a <[ip_nat_ftp].data.end+184b/8781>
Trace; c01e21cd <qdisc_restart+6d/100>
Trace; c88fa1cf <[ip_nat_ftp].data.end+18f0/8781>
Trace; c01e08c2 <nf_queue+122/1e0>
Trace; c88fa7c0 <[ip_nat_ftp].data.end+1ee1/8781>
Trace; c01ed3b0 <ip_rcv_finish+0/240>
Trace; c01ed3b0 <ip_rcv_finish+0/240>
Trace; c01e0a97 <nf_hook_slow+117/190>
Trace; c88fa7a0 <[ip_nat_ftp].data.end+1ec1/8781>
Trace; c01ed3b0 <ip_rcv_finish+0/240>
Trace; c88fa7c0 <[ip_nat_ftp].data.end+1ee1/8781>
Trace; c01ed1de <ip_rcv+1ae/210>
Trace; c01ed3b0 <ip_rcv_finish+0/240>
Trace; c01da3ef <netif_receive_skb+12f/1f0>
Trace; c01da533 <process_backlog+83/120>
Trace; c01da644 <net_rx_action+74/110>
Trace; c011b3b5 <do_softirq+95/a0>
Trace; c0108b4a <do_IRQ+9a/a0>
Trace; c0105390 <default_idle+0/40>
Trace; c010b0a8 <call_do_IRQ+5/d>
Trace; c0105390 <default_idle+0/40>
Trace; c01053b3 <default_idle+23/40>
Trace; c0105442 <cpu_idle+52/70>
Trace; c0105000 <_stext+0/0>

-- 
Tomas Szepe <szepe@pinerecords.com>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in tcp_input.c
  2003-10-25  2:30 ` David S. Miller
  2003-10-25  9:12   ` Tomas Szepe
@ 2003-10-26  6:55   ` Tomas Szepe
  2003-10-27  6:33     ` David S. Miller
  1 sibling, 1 reply; 8+ messages in thread
From: Tomas Szepe @ 2003-10-26  6:55 UTC (permalink / raw)
  To: David S. Miller; +Cc: linux-kernel, netdev, grof, volf

On Oct-24 2003, Fri, 19:30 -0700
David S. Miller <davem@redhat.com> wrote:

> > The passed NULL (and yes, this is where we are getting one) is dereferenced
> > immediately in:
> > 
> > /* tcp_input.c, line 1133 */
> > static inline int tcp_skb_timedout(struct tcp_opt *tp, struct sk_buff *skb)
> > {
> >   return (tcp_time_stamp - TCP_SKB_CB(skb)->when > tp->rto);
> > }
> 
> If tp->packets_out is non-zero (which by definition it is
> in your case else the right hand side of the "&&" would not be
> evaluated) then we _MUST_ have some packets in sk->write_queue.
> 
> Something is being fiercely corrupted.  Probably some piece of
> netfilter is freeing up an SKB one too many times thus corrupting
> the TCP write queue list pointers.

Dave, we've been thinking about this some more and have concluded
that since the problem is a relatively non-fatal one, the kernel
should just print out an "assertion failed" error similar to the
one in tcp_input.c, line 1323 [BUG_TRAP(cnt <= tp->packets_out);]
and maybe fix things up a little rather than oops on a NULL pointer
dereference;  The state in question, although invalid, is possible
and should IMHO be checked for as in all the other "if (skb != NULL)
..." places).

What do you think?  We keep on trying to locate which code is causing
the corruption, meanwhile the affected system has been running crash-lessly
with the attached patch.

Thanks,
-- 
Tomas Szepe <szepe@pinerecords.com>

diff -urN a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c	2003-06-13 16:51:39 +0200
+++ b/net/ipv4/tcp_input.c	2003-10-26 07:29:11 +0100
@@ -1138,7 +1138,19 @@
 
 static inline int tcp_head_timedout(struct sock *sk, struct tcp_opt *tp)
 {
-	return tp->packets_out && tcp_skb_timedout(tp, skb_peek(&sk->write_queue));
+	struct sk_buff *skb = skb_peek(&sk->write_queue);
+
+	if (skb == NULL) {
+		if (tp->packets_out) {
+			printk("KERNEL: assertion (%s) failed at %s(%d)\n",
+				"skb == NULL && tp->packets_out",
+				__FILE__, __LINE__);
+			tp->packets_out = 0;
+		}
+		return 0;
+	} else {
+		return tp->packets_out && tcp_skb_timedout(tp, skb);
+	}
 }
 
 /* Linux NewReno/SACK/FACK/ECN state machine.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in tcp_input.c
  2003-10-26  6:55   ` Tomas Szepe
@ 2003-10-27  6:33     ` David S. Miller
  0 siblings, 0 replies; 8+ messages in thread
From: David S. Miller @ 2003-10-27  6:33 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel, netdev, grof, volf

On Sun, 26 Oct 2003 07:55:19 +0100
Tomas Szepe <szepe@pinerecords.com> wrote:

> Dave, we've been thinking about this some more and have concluded
> that since the problem is a relatively non-fatal one, the kernel
> should just print out an "assertion failed" error similar to the
> one in tcp_input.c, line 1323 [BUG_TRAP(cnt <= tp->packets_out);]
> and maybe fix things up a little rather than oops on a NULL pointer
> dereference;  The state in question, although invalid, is possible
> and should IMHO be checked for as in all the other "if (skb != NULL)
> ..." places).

If this condition triggers, the lists are corrupt and the kernel
should not try to access this socket in any way whatsoever
past this point.

Therefore it should OOPS, which is exactly what it does now.
A BUG_ON() is an equivalent response as far as I am concerned,
it has the same exact result, and the backtrace shows where the
problem is occuring.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in tcp_input.c
  2003-11-18 13:58   ` Tomas Szepe
@ 2003-11-18 14:01     ` Andi Kleen
  0 siblings, 0 replies; 8+ messages in thread
From: Andi Kleen @ 2003-11-18 14:01 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel, netdev, grof, davem

On Tue, 18 Nov 2003 14:58:05 +0100
Tomas Szepe <szepe@pinerecords.com> wrote:

> On Oct-24 2003, Fri, 19:57 +0200
> Andi Kleen <ak@suse.de> wrote:
> 
> > > /* tcp_input.c, line 1138 */
> > > static inline int tcp_head_timedout(struct sock *sk, struct tcp_opt *tp)
> > > {
> > >   return tp->packets_out && tcp_skb_timedout(tp, skb_peek(&sk->write_queue));
> > > }
> > 
> > tp->packets_out > 0 implies that there is at least one packet in the write 
> > queue (it counts the number of unacked packets in flight, which are kept
> > in the write queue). When that's not the case something else is wrong.
> 
> Yes, that's exactly what davem said.  The corruption is happening somewhere
> in netsched/imq code that's not even part of the official kernel tree (and
> I'm told there's nobody to maintain the patch at present).

Ignore the mail. It was some machine flushing out an old mail queue
(with some very old mails from me that never made it out) 

I actually wrote it before DaveM if you check the dates ;-)

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in tcp_input.c
  2003-10-24 17:57 ` Andi Kleen
@ 2003-11-18 13:58   ` Tomas Szepe
  2003-11-18 14:01     ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: Tomas Szepe @ 2003-11-18 13:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, netdev, grof, davem

On Oct-24 2003, Fri, 19:57 +0200
Andi Kleen <ak@suse.de> wrote:

> > /* tcp_input.c, line 1138 */
> > static inline int tcp_head_timedout(struct sock *sk, struct tcp_opt *tp)
> > {
> >   return tp->packets_out && tcp_skb_timedout(tp, skb_peek(&sk->write_queue));
> > }
> 
> tp->packets_out > 0 implies that there is at least one packet in the write 
> queue (it counts the number of unacked packets in flight, which are kept
> in the write queue). When that's not the case something else is wrong.

Yes, that's exactly what davem said.  The corruption is happening somewhere
in netsched/imq code that's not even part of the official kernel tree (and
I'm told there's nobody to maintain the patch at present).

Thanks,
-- 
Tomas Szepe <szepe@pinerecords.com>

P.S.  I can post the patchset we've been using on the crashing machines
in case someone's interested, it's reasonably short:

      9101 Jul  6 11:48 bridge-nf-0.0.7-against-2.4.22pre3.diff.gz
      4123 Jul  6 11:14 imq-2.4.22pre3-1.diff.gz
      1883 Jul  6 12:01 imq-nf-20030625-2.4.22pre3.diff.gz

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: possible bug in tcp_input.c
       [not found] <20031024162959.GB11154@louise.pinerecords.com.suse.lists.linux.kernel>
@ 2003-10-24 17:57 ` Andi Kleen
  2003-11-18 13:58   ` Tomas Szepe
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2003-10-24 17:57 UTC (permalink / raw)
  To: Tomas Szepe; +Cc: linux-kernel, netdev, grof, davem

Tomas Szepe <szepe@pinerecords.com> writes:

> /* tcp_input.c, line 1138 */
> static inline int tcp_head_timedout(struct sock *sk, struct tcp_opt *tp)
> {
>   return tp->packets_out && tcp_skb_timedout(tp, skb_peek(&sk->write_queue));
> }

tp->packets_out > 0 implies that there is at least one packet in the write 
queue (it counts the number of unacked packets in flight, which are kept
in the write queue). When that's not the case something else is wrong.

-Andi


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2003-11-18 14:02 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2003-10-24 16:29 possible bug in tcp_input.c Tomas Szepe
2003-10-25  2:30 ` David S. Miller
2003-10-25  9:12   ` Tomas Szepe
2003-10-26  6:55   ` Tomas Szepe
2003-10-27  6:33     ` David S. Miller
     [not found] <20031024162959.GB11154@louise.pinerecords.com.suse.lists.linux.kernel>
2003-10-24 17:57 ` Andi Kleen
2003-11-18 13:58   ` Tomas Szepe
2003-11-18 14:01     ` Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.