linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] network: return errors if we know tcp_connect failed
@ 2010-11-11 21:03 Eric Paris
  2010-11-11 21:14 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Eric Paris @ 2010-11-11 21:03 UTC (permalink / raw)
  To: netdev, linux-kernel; +Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber

THIS PATCH IS VERY POSSIBLY WRONG!  But if it is I want some feedback.

Basically what I found was that if I added an iptables rule like so:

iptables -A OUTPUT -p tcp --dport 80 -j DROP

And then ran a web browser like links it would just hang on 'establishing
connection.'  I expected that the application would immediately, or at least
very quickly, get notified that the connect failed.   This waiting for timeout
would be expected if something else dropped the SYN or if we were dropping the
SYN/ACK packet coming back, but I figured if we knew we threw away the SYN we knew
right away that the connection was denied and we should be able to indicate
that to the application.  Yes, I realize this is little different than if the
SYN was dropped in the first network device, but it is different because we
know what happened!  We know that connect() call failed and that there isn't
anything coming back.

What I discovered was that we actually had 2 problems in making it possible.
For userspace to quickly realize the connect failed.  The first was a problem
in the netfilter code which wasn't passing errors back up the stack correctly,
due to what I believe to be a mistake in precedence rules.

http://marc.info/?l=netfilter-devel&m=128950262021804&w=2

And the second was that tcp_connect() was just ignoring the return value from
tcp_transmit_skb().  Maybe this was intentional but I really wish we could
find out that connect failed long before the minutes long timeout.  Once I
fixed both of those issues I find that links gets denied (with EPERM)
immediately when it calls connect().  Is this wrong?  Is this bad to tell
userspace more quickly what happened?  Does passing this error code back up
the stack here break something else?  Why do some functions seem to pay
attention to tcp_transmit_skb() return codes and some functions just ignore
it?  What do others think?

NOT-AT-ALL-SIGNED-OFF-BY
---

 net/ipv4/tcp_output.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e961522..67b8535 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2592,6 +2592,7 @@ int tcp_connect(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *buff;
+	int ret;
 
 	tcp_connect_init(sk);
 
@@ -2614,7 +2615,7 @@ int tcp_connect(struct sock *sk)
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
 	tp->packets_out += tcp_skb_pcount(buff);
-	tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
+	ret = tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
 
 	/* We change tp->snd_nxt after the tcp_transmit_skb() call
 	 * in order to make this packet get counted in tcpOutSegs.
@@ -2626,7 +2627,7 @@ int tcp_connect(struct sock *sk)
 	/* Timer for repeating the SYN until an answer. */
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 				  inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL(tcp_connect);
 


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-11 21:03 [RFC PATCH] network: return errors if we know tcp_connect failed Eric Paris
@ 2010-11-11 21:14 ` Eric Dumazet
  2010-11-11 21:58 ` Hua Zhong
  2010-11-12 17:46 ` Alexey Kuznetsov
  2 siblings, 0 replies; 31+ messages in thread
From: Eric Dumazet @ 2010-11-11 21:14 UTC (permalink / raw)
  To: Eric Paris
  Cc: netdev, linux-kernel, davem, kuznet, pekkas, jmorris, yoshfuji, kaber

Le jeudi 11 novembre 2010 à 16:03 -0500, Eric Paris a écrit :
> THIS PATCH IS VERY POSSIBLY WRONG!  But if it is I want some feedback.
> 
> Basically what I found was that if I added an iptables rule like so:
> 
> iptables -A OUTPUT -p tcp --dport 80 -j DROP
> 
> And then ran a web browser like links it would just hang on 'establishing
> connection.'  I expected that the application would immediately, or at least
> very quickly, get notified that the connect failed.   This waiting for timeout
> would be expected if something else dropped the SYN or if we were dropping the
> SYN/ACK packet coming back, but I figured if we knew we threw away the SYN we knew
> right away that the connection was denied and we should be able to indicate
> that to the application.  Yes, I realize this is little different than if the
> SYN was dropped in the first network device, but it is different because we
> know what happened!  We know that connect() call failed and that there isn't
> anything coming back.
> 
> What I discovered was that we actually had 2 problems in making it possible.
> For userspace to quickly realize the connect failed.  The first was a problem
> in the netfilter code which wasn't passing errors back up the stack correctly,
> due to what I believe to be a mistake in precedence rules.
> 
> http://marc.info/?l=netfilter-devel&m=128950262021804&w=2
> 
> And the second was that tcp_connect() was just ignoring the return value from
> tcp_transmit_skb().  Maybe this was intentional but I really wish we could
> find out that connect failed long before the minutes long timeout.  Once I
> fixed both of those issues I find that links gets denied (with EPERM)
> immediately when it calls connect().  Is this wrong?  Is this bad to tell
> userspace more quickly what happened?  Does passing this error code back up
> the stack here break something else?  Why do some functions seem to pay
> attention to tcp_transmit_skb() return codes and some functions just ignore
> it?  What do others think?
> 


I think its an interesting idea, but a temporary memory shortage would
abort the connect().

We could imagine some special handling of the first packet of a flow
being DROPED for whatever reason (flow control...)

So it needs some refinement I think.

SYN packets should be allowed to be re-transmitted before saying a TCP
connect() cannot succeed.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-11 21:03 [RFC PATCH] network: return errors if we know tcp_connect failed Eric Paris
  2010-11-11 21:14 ` Eric Dumazet
@ 2010-11-11 21:58 ` Hua Zhong
  2010-11-12  7:36   ` Patrick McHardy
  2010-11-12 16:08   ` Eric Paris
  2010-11-12 17:46 ` Alexey Kuznetsov
  2 siblings, 2 replies; 31+ messages in thread
From: Hua Zhong @ 2010-11-11 21:58 UTC (permalink / raw)
  To: 'Eric Paris', netdev, linux-kernel
  Cc: davem, kuznet, pekkas, jmorris, yoshfuji, kaber

> Yes, I realize this is little different than if the
> SYN was dropped in the first network device, but it is different
> because we know what happened!  We know that connect() call failed
> and that there isn't anything coming back.

I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).

It does not only make sense, but also is a highly useful testing technique that we use -j DROP in OUTPUT to emulate network losses and see how the application behaves.





^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-11 21:58 ` Hua Zhong
@ 2010-11-12  7:36   ` Patrick McHardy
  2010-11-12 23:14     ` Hua Zhong
  2010-11-12 16:08   ` Eric Paris
  1 sibling, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2010-11-12  7:36 UTC (permalink / raw)
  To: Hua Zhong
  Cc: 'Eric Paris',
	netdev, linux-kernel, davem, kuznet, pekkas, jmorris, yoshfuji

On 11.11.2010 22:58, Hua Zhong wrote:
>> Yes, I realize this is little different than if the
>> SYN was dropped in the first network device, but it is different
>> because we know what happened!  We know that connect() call failed
>> and that there isn't anything coming back.
> 
> I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).

It sends an ICMP error or TCP reset. Interpretation is up to TCP.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-11 21:58 ` Hua Zhong
  2010-11-12  7:36   ` Patrick McHardy
@ 2010-11-12 16:08   ` Eric Paris
  2010-11-12 16:15     ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Paris @ 2010-11-12 16:08 UTC (permalink / raw)
  To: Hua Zhong
  Cc: netdev, linux-kernel, davem, kuznet, pekkas, jmorris, yoshfuji,
	kaber, eric.dumazet, paul.moore

On Thu, 2010-11-11 at 13:58 -0800, Hua Zhong wrote:
> > Yes, I realize this is little different than if the
> > SYN was dropped in the first network device, but it is different
> > because we know what happened!  We know that connect() call failed
> > and that there isn't anything coming back.
> 
> I would argue that -j DROP should behave exactly as the packet is dropped in the network, while -j REJECT should signal the failure to the application as soon as possible (which it doesn't seem to do).
> 
> It does not only make sense, but also is a highly useful testing technique that we use -j DROP in OUTPUT to emulate network losses and see how the application behaves.

I guess I can be a bit more descriptive of my specific situation,
although I'm not sure it matters.  I don't actually plan to drop packets
with -j REJECT or -j DROP, that's just a simple example everyone can see
on their own machine.  I plan to have the packets drop in the selinux
netfilter hook.  The SELinux hook uses NF_DROP/NF_ACCEPT just like any
other netfilter hook.  Maybe the answer is that I need to duplicate the
-j REJECT type operations in the SELinux hook.  -j REJECT doesn't do
what I want today, but if that's the right way forward tell me and I'll
look down that path.

But the path I first started looking down rules in 2 distinct questions:

1) What should netfilter pass back up the stack.  From my looking at
this I see that nf_hook_slow() will convert NF_DROP into -EPERM and pass
that back up the stack.  Is this wrong?  Should it more intelligently
pass errors back up the stack?  Maybe it needs an NF_REJECT as well as
NF_DROP?  NF_DROP returns 0 maybe and NF_REJECT return EPERM?

2) What should the generic TCP code (tcp_connect()) do if the skb failed
to send.  Should it return error codes back up the stack somehow or
should they continue to be ignored?  Obviously continuing to just ignore
information we have doesn't make me happy (otherwise I wouldn't have
started scratching this itch).  But the point about ENOBUFS is well
taken.  Maybe I should make tcp_connect(), or the caller to
tcp_connect() more intelligent about specific error codes?

I'm looking for a path forward.  If SELinux is rejecting the SYN packets
on connect() I want to pass that info to userspace rather than just
hanging.  What's the best way to accomplish that?

-Eric


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12 16:08   ` Eric Paris
@ 2010-11-12 16:15     ` Eric Dumazet
  2010-11-12 16:35       ` David Lamparter
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2010-11-12 16:15 UTC (permalink / raw)
  To: Eric Paris
  Cc: Hua Zhong, netdev, linux-kernel, davem, kuznet, pekkas, jmorris,
	yoshfuji, kaber, paul.moore

Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :

> 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> to send.  Should it return error codes back up the stack somehow or
> should they continue to be ignored?  Obviously continuing to just ignore
> information we have doesn't make me happy (otherwise I wouldn't have
> started scratching this itch).  But the point about ENOBUFS is well
> taken.  Maybe I should make tcp_connect(), or the caller to
> tcp_connect() more intelligent about specific error codes?
> 
> I'm looking for a path forward.  If SELinux is rejecting the SYN packets
> on connect() I want to pass that info to userspace rather than just
> hanging.  What's the best way to accomplish that?
> 

Eric, if you can differentiate a permanent reject, instead of a
temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
yes, you could make tcp_connect() report to user the permanent error,
and ignore the temporary one.




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12 16:15     ` Eric Dumazet
@ 2010-11-12 16:35       ` David Lamparter
  2010-11-12 16:53         ` Eric Paris
  2010-11-12 16:54         ` Patrick McHardy
  0 siblings, 2 replies; 31+ messages in thread
From: David Lamparter @ 2010-11-12 16:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Paris, Hua Zhong, netdev, linux-kernel, davem, kuznet,
	pekkas, jmorris, yoshfuji, kaber, paul.moore

On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
> Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
> 
> > 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> > to send.  Should it return error codes back up the stack somehow or
> > should they continue to be ignored?  Obviously continuing to just ignore
> > information we have doesn't make me happy (otherwise I wouldn't have
> > started scratching this itch).  But the point about ENOBUFS is well
> > taken.  Maybe I should make tcp_connect(), or the caller to
> > tcp_connect() more intelligent about specific error codes?
> > 
> > I'm looking for a path forward.  If SELinux is rejecting the SYN packets
> > on connect() I want to pass that info to userspace rather than just
> > hanging.  What's the best way to accomplish that?
> > 
> 
> Eric, if you can differentiate a permanent reject, instead of a
> temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
> yes, you could make tcp_connect() report to user the permanent error,
> and ignore the temporary one.

If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
counterparts, which i guess they do but i didn't read the source ;),
then SELinux should use NF_REJECT in my opinion.

NF_DROP does exactly what the name says, it drops the packet aka
basically puts it in /dev/null. As with writing to /dev/null, you don't
get an error for that. Even more, if in the meantime the DROP rule does
not match anymore, the 2nd or 3rd SYN from the connect() can come
through and establish a connection (think of "-m statistic" & co.)

This is very different from REJECT.

If REJECT doesn't immediately get reported to the application, that *is*
a bug, but last time i checked i got EPERM immediately. I would fix
SELinux to use the same mechanism.


-David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12 16:35       ` David Lamparter
@ 2010-11-12 16:53         ` Eric Paris
  2010-11-12 16:54         ` Patrick McHardy
  1 sibling, 0 replies; 31+ messages in thread
From: Eric Paris @ 2010-11-12 16:53 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric Dumazet, Hua Zhong, netdev, linux-kernel, davem, kuznet,
	pekkas, jmorris, yoshfuji, kaber, paul.moore

On Fri, 2010-11-12 at 17:35 +0100, David Lamparter wrote:
> On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
> > Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
> > 
> > > 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> > > to send.  Should it return error codes back up the stack somehow or
> > > should they continue to be ignored?  Obviously continuing to just ignore
> > > information we have doesn't make me happy (otherwise I wouldn't have
> > > started scratching this itch).  But the point about ENOBUFS is well
> > > taken.  Maybe I should make tcp_connect(), or the caller to
> > > tcp_connect() more intelligent about specific error codes?
> > > 
> > > I'm looking for a path forward.  If SELinux is rejecting the SYN packets
> > > on connect() I want to pass that info to userspace rather than just
> > > hanging.  What's the best way to accomplish that?
> > > 
> > 
> > Eric, if you can differentiate a permanent reject, instead of a
> > temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
> > yes, you could make tcp_connect() report to user the permanent error,
> > and ignore the temporary one.
> 
> If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
> counterparts, which i guess they do but i didn't read the source ;),
> then SELinux should use NF_REJECT in my opinion.

As it stands today there is no NF_REJECT.  NF_DROP is the only (related)
permitted return value from a netfilter hook.  Maybe I need to change
that fact though.

> NF_DROP does exactly what the name says, it drops the packet aka
> basically puts it in /dev/null. As with writing to /dev/null, you don't
> get an error for that. Even more, if in the meantime the DROP rule does
> not match anymore, the 2nd or 3rd SYN from the connect() can come
> through and establish a connection (think of "-m statistic" & co.)
> 
> This is very different from REJECT.
> 
> If REJECT doesn't immediately get reported to the application, that *is*
> a bug, but last time i checked i got EPERM immediately. I would fix
> SELinux to use the same mechanism.

I haven't looked at what -j REJECT does (or was intended to do) but it
most certainly does not return an error to sys_connect().  Try it out.

iptables -A OUTPUT -p tcp --dport 80 -j REJECT
links www.google.com

it just hangs on 'making connection'  (exact same for -j DROP)

If everyone agrees that's the wrong behavior (for -j REJECT) I'll work
on fixing that (however is appropriate) and will change the SELinux code
if needed after we've fixed the -j REJECT code.  Obviously there's
problems with my original way to fix the lack of error returns (namely
that I would immediately EACCES for DROP as well as REJECT).

I'm glad to hear that others seem to believe the current code is buggy
and I'm not completely off my rocker to think that applications should
be able to learn somehow that things fell down...

-Eric


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12 16:35       ` David Lamparter
  2010-11-12 16:53         ` Eric Paris
@ 2010-11-12 16:54         ` Patrick McHardy
  2010-11-12 17:57           ` a problem tcp_v4_err() Alexey Kuznetsov
  2010-11-12 21:16           ` [RFC PATCH] network: return errors if we know tcp_connect failed David Lamparter
  1 sibling, 2 replies; 31+ messages in thread
From: Patrick McHardy @ 2010-11-12 16:54 UTC (permalink / raw)
  To: David Lamparter
  Cc: Eric Dumazet, Eric Paris, Hua Zhong, netdev, linux-kernel, davem,
	kuznet, pekkas, jmorris, yoshfuji, paul.moore

Am 12.11.2010 17:35, schrieb David Lamparter:
> On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
>> Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
>>
>>> 2) What should the generic TCP code (tcp_connect()) do if the skb failed
>>> to send.  Should it return error codes back up the stack somehow or
>>> should they continue to be ignored?  Obviously continuing to just ignore
>>> information we have doesn't make me happy (otherwise I wouldn't have
>>> started scratching this itch).  But the point about ENOBUFS is well
>>> taken.  Maybe I should make tcp_connect(), or the caller to
>>> tcp_connect() more intelligent about specific error codes?
>>>
>>> I'm looking for a path forward.  If SELinux is rejecting the SYN packets
>>> on connect() I want to pass that info to userspace rather than just
>>> hanging.  What's the best way to accomplish that?
>>>
>>
>> Eric, if you can differentiate a permanent reject, instead of a
>> temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
>> yes, you could make tcp_connect() report to user the permanent error,
>> and ignore the temporary one.

Indeed. We could even make the NF_DROP return value configurable
by encoding it in the verdict.

> If the netfilter targets DROP/REJECT match the NF_DROP/NF_REJECT
> counterparts, which i guess they do but i didn't read the source ;),
> then SELinux should use NF_REJECT in my opinion.

There is no NF_REJECT.

> NF_DROP does exactly what the name says, it drops the packet aka
> basically puts it in /dev/null. As with writing to /dev/null, you don't
> get an error for that. Even more, if in the meantime the DROP rule does
> not match anymore, the 2nd or 3rd SYN from the connect() can come
> through and establish a connection (think of "-m statistic" & co.)
> 
> This is very different from REJECT.

Returning NF_DROP results in -EPERM getting reported back. As Eric
noticed, this is ignored for SYN packets.

> If REJECT doesn't immediately get reported to the application, that *is*
> a bug, but last time i checked i got EPERM immediately. I would fix
> SELinux to use the same mechanism.

NF_DROP returns -EPERM, the REJECT targets send packets to reject
a connection. Whether this is reported immediately depends on the
error and the protocol in question. Using a TCP reset immediately
resets the connection.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-11 21:03 [RFC PATCH] network: return errors if we know tcp_connect failed Eric Paris
  2010-11-11 21:14 ` Eric Dumazet
  2010-11-11 21:58 ` Hua Zhong
@ 2010-11-12 17:46 ` Alexey Kuznetsov
  2010-11-12 19:28   ` David Miller
  2 siblings, 1 reply; 31+ messages in thread
From: Alexey Kuznetsov @ 2010-11-12 17:46 UTC (permalink / raw)
  To: Eric Paris; +Cc: netdev, linux-kernel, davem, pekkas, jmorris, yoshfuji, kaber

Hello!

On Thu, Nov 11, 2010 at 04:03:41PM -0500, Eric Paris wrote:
> immediately when it calls connect().  Is this wrong?  Is this bad to tell
> userspace more quickly what happened?  Does passing this error code back up
> the stack here break something else?  Why do some functions seem to pay
> attention to tcp_transmit_skb() return codes and some functions just ignore
> it?

Essentially, return value of tcp_transmit_skb() is always ignored.
It is used only for accounting and for some optimization of retransmission behaviour.
Generally, tcp does not react on errors coming outside of tcp protocol.

The only loophole is ICMP error in the same case as yours. In _violation_ of specs
linux immediately aborts unestablished connect on an icmp error. IMHO that thing
which you suggest is correct (of course, provided you filter out transient errors and react only
to EPERM or something like this). It was not done because it was expected
firewall rule prescribing immediate abort is configured with "--reject-with icmp-port-unreachable",
otherwise the rule orders real blackhole.

Alexey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* a problem tcp_v4_err()
  2010-11-12 16:54         ` Patrick McHardy
@ 2010-11-12 17:57           ` Alexey Kuznetsov
  2010-11-12 18:12             ` Eric Dumazet
  2010-11-12 21:16           ` [RFC PATCH] network: return errors if we know tcp_connect failed David Lamparter
  1 sibling, 1 reply; 31+ messages in thread
From: Alexey Kuznetsov @ 2010-11-12 17:57 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David Lamparter, Eric Dumazet, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore

Hello!

I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
are performed on unlocked sockets. It looks like at least this BUG_ON():

                skb = tcp_write_queue_head(sk);
                BUG_ON(!skb);

can be easily triggered.

Do I miss something?

Alexey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 17:57           ` a problem tcp_v4_err() Alexey Kuznetsov
@ 2010-11-12 18:12             ` Eric Dumazet
  2010-11-12 18:21               ` Eric Dumazet
  2010-11-12 18:29               ` Alexey Kuznetsov
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Dumazet @ 2010-11-12 18:12 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore

Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> Hello!
> 
> I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> are performed on unlocked sockets. It looks like at least this BUG_ON():
> 
>                 skb = tcp_write_queue_head(sk);
>                 BUG_ON(!skb);
> 
> can be easily triggered.
> 
> Do I miss something?
> 

Hi Alexey !

I see socket is locked around line 368,

        bh_lock_sock(sk);
        /* If too many ICMPs get dropped on busy
         * servers this needs to be solved differently.
         */
        if (sock_owned_by_user(sk))
                NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);


Hmm, maybe some goto is missing ;)




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 18:12             ` Eric Dumazet
@ 2010-11-12 18:21               ` Eric Dumazet
  2010-11-12 18:27                 ` Eric Dumazet
  2010-11-12 18:29               ` Alexey Kuznetsov
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2010-11-12 18:21 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore

Le vendredi 12 novembre 2010 à 19:12 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> > Hello!
> > 
> > I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> > are performed on unlocked sockets. It looks like at least this BUG_ON():
> > 
> >                 skb = tcp_write_queue_head(sk);
> >                 BUG_ON(!skb);
> > 
> > can be easily triggered.
> > 
> > Do I miss something?
> > 
> 
> Hi Alexey !
> 
> I see socket is locked around line 368,
> 
>         bh_lock_sock(sk);
>         /* If too many ICMPs get dropped on busy
>          * servers this needs to be solved differently.
>          */
>         if (sock_owned_by_user(sk))
>                 NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> 
> 
> Hmm, maybe some goto is missing ;)
> 

Well, goto is not missing.

Why do you think BUG_ON(!skb) can be triggered ?

We test before :

	if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
		!icsk->icsk_backoff)
		break;

So a concurrent user only can add new skb(s) in the (non empty) queue ?




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 18:21               ` Eric Dumazet
@ 2010-11-12 18:27                 ` Eric Dumazet
  2010-11-12 18:31                   ` Alexey Kuznetsov
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2010-11-12 18:27 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore,
	Damian Lukowski

Le vendredi 12 novembre 2010 à 19:21 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 19:12 +0100, Eric Dumazet a écrit :
> > Le vendredi 12 novembre 2010 à 20:57 +0300, Alexey Kuznetsov a écrit :
> > > Hello!
> > > 
> > > I looked at tcp_v4_err() and found something strange. Quite non-trivial operations
> > > are performed on unlocked sockets. It looks like at least this BUG_ON():
> > > 
> > >                 skb = tcp_write_queue_head(sk);
> > >                 BUG_ON(!skb);
> > > 
> > > can be easily triggered.
> > > 
> > > Do I miss something?
> > > 
> > 
> > Hi Alexey !
> > 
> > I see socket is locked around line 368,
> > 
> >         bh_lock_sock(sk);
> >         /* If too many ICMPs get dropped on busy
> >          * servers this needs to be solved differently.
> >          */
> >         if (sock_owned_by_user(sk))
> >                 NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> > 
> > 
> > Hmm, maybe some goto is missing ;)
> > 
> 
> Well, goto is not missing.
> 
> Why do you think BUG_ON(!skb) can be triggered ?
> 
> We test before :
> 
> 	if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
> 		!icsk->icsk_backoff)
> 		break;
> 
> So a concurrent user only can add new skb(s) in the (non empty) queue ?
> 
> 

Oh well, it seems you are right (backlog processing)

Bug was introduced in commit f1ecd5d9e736660 (Revert Backoff [v3]:
Revert RTO on ICMP destination unreachable) from Damian Lukowski




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 18:12             ` Eric Dumazet
  2010-11-12 18:21               ` Eric Dumazet
@ 2010-11-12 18:29               ` Alexey Kuznetsov
  2010-11-12 18:33                 ` Eric Dumazet
  1 sibling, 1 reply; 31+ messages in thread
From: Alexey Kuznetsov @ 2010-11-12 18:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore

Hello!

On Fri, Nov 12, 2010 at 07:12:58PM +0100, Eric Dumazet wrote:
> I see socket is locked around line 368,
> 
>         bh_lock_sock(sk);
>         /* If too many ICMPs get dropped on busy
>          * servers this needs to be solved differently.
>          */
>         if (sock_owned_by_user(sk))
>                 NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> 
> 
> Hmm, maybe some goto is missing ;)

It is not missing, sock_owned_by_user() is checked later when some operation which
cannot be done without lock is required. It was done to save error in sk_err_soft even
when socket is locked.

This code also _understands_ this: look at

                } else if (sock_owned_by_user(sk)) {
                        /* RTO revert clocked out retransmission,
                         * but socket is locked. Will defer. */
                        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
                                                  HZ/20, TCP_RTO_MAX);

but somehow it considers the manipulations with rto/backoff/write_queue as safe.
Seems, they are not.

Alexey




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 18:27                 ` Eric Dumazet
@ 2010-11-12 18:31                   ` Alexey Kuznetsov
  0 siblings, 0 replies; 31+ messages in thread
From: Alexey Kuznetsov @ 2010-11-12 18:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore,
	Damian Lukowski

Hello!

> Oh well, it seems you are right (backlog processing)

Exactly.

Alexey

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 18:29               ` Alexey Kuznetsov
@ 2010-11-12 18:33                 ` Eric Dumazet
  2010-11-12 19:22                   ` David Miller
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2010-11-12 18:33 UTC (permalink / raw)
  To: Alexey Kuznetsov
  Cc: Patrick McHardy, David Lamparter, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, pekkas, jmorris, yoshfuji, paul.moore,
	Damian Lukowski

Le vendredi 12 novembre 2010 à 21:29 +0300, Alexey Kuznetsov a écrit :
> Hello!
> 
> On Fri, Nov 12, 2010 at 07:12:58PM +0100, Eric Dumazet wrote:
> > I see socket is locked around line 368,
> > 
> >         bh_lock_sock(sk);
> >         /* If too many ICMPs get dropped on busy
> >          * servers this needs to be solved differently.
> >          */
> >         if (sock_owned_by_user(sk))
> >                 NET_INC_STATS_BH(net, LINUX_MIB_LOCKDROPPEDICMPS);
> > 
> > 
> > Hmm, maybe some goto is missing ;)
> 
> It is not missing, sock_owned_by_user() is checked later when some operation which
> cannot be done without lock is required. It was done to save error in sk_err_soft even
> when socket is locked.
> 
> This code also _understands_ this: look at
> 
>                 } else if (sock_owned_by_user(sk)) {
>                         /* RTO revert clocked out retransmission,
>                          * but socket is locked. Will defer. */
>                         inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>                                                   HZ/20, TCP_RTO_MAX);
> 
> but somehow it considers the manipulations with rto/backoff/write_queue as safe.
> Seems, they are not.

Indeed, right you are, I came to same conclusion.

I CC Damian Lukowski in my previous answer (and this one too)

Thanks !



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 18:33                 ` Eric Dumazet
@ 2010-11-12 19:22                   ` David Miller
  2010-11-12 21:18                     ` Eric Dumazet
  0 siblings, 1 reply; 31+ messages in thread
From: David Miller @ 2010-11-12 19:22 UTC (permalink / raw)
  To: eric.dumazet
  Cc: kuznet, kaber, equinox, eparis, hzhong, netdev, linux-kernel,
	pekkas, jmorris, yoshfuji, paul.moore, damian

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 12 Nov 2010 19:33:23 +0100

> I CC Damian Lukowski in my previous answer (and this one too)

Probably the safest fix is this:

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8f8527d..69ccbc1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -415,6 +415,9 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 		    !icsk->icsk_backoff)
 			break;
 
+		if (sock_owned_by_user(sk))
+			break;
+
 		icsk->icsk_backoff--;
 		inet_csk(sk)->icsk_rto = __tcp_set_rto(tp) <<
 					 icsk->icsk_backoff;
@@ -429,11 +432,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 		if (remaining) {
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 						  remaining, TCP_RTO_MAX);
-		} else if (sock_owned_by_user(sk)) {
-			/* RTO revert clocked out retransmission,
-			 * but socket is locked. Will defer. */
-			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-						  HZ/20, TCP_RTO_MAX);
 		} else {
 			/* RTO revert clocked out retransmission.
 			 * Will retransmit now */

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12 17:46 ` Alexey Kuznetsov
@ 2010-11-12 19:28   ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2010-11-12 19:28 UTC (permalink / raw)
  To: kuznet; +Cc: eparis, netdev, linux-kernel, pekkas, jmorris, yoshfuji, kaber

From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Date: Fri, 12 Nov 2010 20:46:20 +0300

> The only loophole is ICMP error in the same case as yours. In
> _violation_ of specs linux immediately aborts unestablished connect
> on an icmp error. IMHO that thing which you suggest is correct (of
> course, provided you filter out transient errors and react only to
> EPERM or something like this). It was not done because it was
> expected firewall rule prescribing immediate abort is configured
> with "--reject-with icmp-port-unreachable", otherwise the rule
> orders real blackhole.

The idea to signal on -EPERM might be OK, but if that's also
what things like "-m statistical" and friends end up reporting
then we still cannot do it.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12 16:54         ` Patrick McHardy
  2010-11-12 17:57           ` a problem tcp_v4_err() Alexey Kuznetsov
@ 2010-11-12 21:16           ` David Lamparter
  2010-11-12 21:18             ` David Miller
  1 sibling, 1 reply; 31+ messages in thread
From: David Lamparter @ 2010-11-12 21:16 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David Lamparter, Eric Dumazet, Eric Paris, Hua Zhong, netdev,
	linux-kernel, davem, kuznet, pekkas, jmorris, yoshfuji,
	paul.moore

On Fri, Nov 12, 2010 at 05:54:29PM +0100, Patrick McHardy wrote:
> Am 12.11.2010 17:35, schrieb David Lamparter:
> > On Fri, Nov 12, 2010 at 05:15:32PM +0100, Eric Dumazet wrote:
> >> Le vendredi 12 novembre 2010 à 11:08 -0500, Eric Paris a écrit :
> >>
> >>> 2) What should the generic TCP code (tcp_connect()) do if the skb failed
> >>> to send.  Should it return error codes back up the stack somehow or
> >>> should they continue to be ignored?  Obviously continuing to just ignore
> >>> information we have doesn't make me happy (otherwise I wouldn't have
> >>> started scratching this itch).  But the point about ENOBUFS is well
> >>> taken.  Maybe I should make tcp_connect(), or the caller to
> >>> tcp_connect() more intelligent about specific error codes?
> >>>
> >>> I'm looking for a path forward.  If SELinux is rejecting the SYN packets
> >>> on connect() I want to pass that info to userspace rather than just
> >>> hanging.  What's the best way to accomplish that?
> >>>
> >>
> >> Eric, if you can differentiate a permanent reject, instead of a
> >> temporary one (congestion, or rate limiting, or ENOBUF, or ...), then
> >> yes, you could make tcp_connect() report to user the permanent error,
> >> and ignore the temporary one.
> 
> Indeed. We could even make the NF_DROP return value configurable
> by encoding it in the verdict.

> There is no NF_REJECT.
Ah, sorry, not at home in netfilter, coming from an user perspective here.

> Returning NF_DROP results in -EPERM getting reported back. As Eric
> noticed, this is ignored for SYN packets.

Hrm. But how do you silently drop packets? This seems counterintuitive
or even buggy to me; or at least the netfilter DROP target shouldn't use
this kind of error-reporting drop.

As food for thought I'd like to pose the following rule:
iptables -A OUTPUT -m statistic --mode nth --every 5 -j DROP
which should, to my understanding, still allow the connect to complete,
even if the first SYN got (silently!...) dropped.

Also, i'm *very* sure i was able to trigger a "permission denied" from
either firewall or route rules; weirdly enough i can't get that on my
2.6.35.7 router... (poking older boxes to reproduce it right now)

Just for reference some test results: (heavily cropped)
TL;DR: only tcp-reset and route prohibit work immediately.


+ telnet 74.125.43.105 80
Connected to 74.125.43.105.
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT
  # default w/o reject-with is icmp-port-unreachable
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: Connection refused
real    0m3.014s
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT --reject-with tcp-reset
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: Connection refused
real    0m0.007s
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT --reject-with host-prohib
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: No route to host
real    0m3.010s
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT --reject-with admin-prohib
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: No route to host
real    0m3.009s
+ iptables -I OUTPUT -p tcp -d 74.125.43.105 --dport 80 -j REJECT --reject-with net-prohib
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: Network is unreachable
real    0m3.011s
+ iptables -F OUTPUT
+ ip route add prohibit 74.125.43.105
+ ip route flush cache
+ telnet 74.125.43.105 80
telnet: connect to address 74.125.43.105: Network is unreachable
real    0m0.007s


-David


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12 21:16           ` [RFC PATCH] network: return errors if we know tcp_connect failed David Lamparter
@ 2010-11-12 21:18             ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2010-11-12 21:18 UTC (permalink / raw)
  To: equinox
  Cc: kaber, eric.dumazet, eparis, hzhong, netdev, linux-kernel,
	kuznet, pekkas, jmorris, yoshfuji, paul.moore

From: David Lamparter <equinox@diac24.net>
Date: Fri, 12 Nov 2010 22:16:27 +0100

> As food for thought I'd like to pose the following rule:
> iptables -A OUTPUT -m statistic --mode nth --every 5 -j DROP
> which should, to my understanding, still allow the connect to complete,
> even if the first SYN got (silently!...) dropped.

Yes, I agree and this is pretty much the point I tried to make
earlier.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 19:22                   ` David Miller
@ 2010-11-12 21:18                     ` Eric Dumazet
  2010-11-12 21:36                       ` David Miller
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Dumazet @ 2010-11-12 21:18 UTC (permalink / raw)
  To: David Miller
  Cc: kuznet, kaber, equinox, eparis, hzhong, netdev, linux-kernel,
	pekkas, jmorris, yoshfuji, paul.moore, damian

Le vendredi 12 novembre 2010 à 11:22 -0800, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 12 Nov 2010 19:33:23 +0100
> 
> > I CC Damian Lukowski in my previous answer (and this one too)
> 
> Probably the safest fix is this:
> 
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 8f8527d..69ccbc1 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -415,6 +415,9 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
>  		    !icsk->icsk_backoff)
>  			break;
>  
> +		if (sock_owned_by_user(sk))
> +			break;
> +
>  		icsk->icsk_backoff--;
>  		inet_csk(sk)->icsk_rto = __tcp_set_rto(tp) <<
>  					 icsk->icsk_backoff;
> @@ -429,11 +432,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
>  		if (remaining) {
>  			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>  						  remaining, TCP_RTO_MAX);
> -		} else if (sock_owned_by_user(sk)) {
> -			/* RTO revert clocked out retransmission,
> -			 * but socket is locked. Will defer. */
> -			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> -						  HZ/20, TCP_RTO_MAX);
>  		} else {
>  			/* RTO revert clocked out retransmission.
>  			 * Will retransmit now */


Acked-by: Eric Dumazet <eric.dumazet@gmail.com>




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: a problem tcp_v4_err()
  2010-11-12 21:18                     ` Eric Dumazet
@ 2010-11-12 21:36                       ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2010-11-12 21:36 UTC (permalink / raw)
  To: eric.dumazet
  Cc: kuznet, kaber, equinox, eparis, hzhong, netdev, linux-kernel,
	pekkas, jmorris, yoshfuji, paul.moore, damian

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 12 Nov 2010 22:18:38 +0100

> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks for reviewing, here is the final bit I committed and
I'll queue this up for whatever -stable trees need this.

--------------------
tcp: Don't change unlocked socket state in tcp_v4_err().

Alexey Kuznetsov noticed a regression introduced by
commit f1ecd5d9e7366609d640ff4040304ea197fbc618
("Revert Backoff [v3]: Revert RTO on ICMP destination unreachable")

The RTO and timer modification code added to tcp_v4_err()
doesn't check sock_owned_by_user(), which if true means we
don't have exclusive access to the socket and therefore cannot
modify it's critical state.

Just skip this new code block if sock_owned_by_user() is true
and eliminate the now superfluous sock_owned_by_user() code
block contained within.

Reported-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
CC: Damian Lukowski <damian@tvk.rwth-aachen.de>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/tcp_ipv4.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 8f8527d..69ccbc1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -415,6 +415,9 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 		    !icsk->icsk_backoff)
 			break;
 
+		if (sock_owned_by_user(sk))
+			break;
+
 		icsk->icsk_backoff--;
 		inet_csk(sk)->icsk_rto = __tcp_set_rto(tp) <<
 					 icsk->icsk_backoff;
@@ -429,11 +432,6 @@ void tcp_v4_err(struct sk_buff *icmp_skb, u32 info)
 		if (remaining) {
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
 						  remaining, TCP_RTO_MAX);
-		} else if (sock_owned_by_user(sk)) {
-			/* RTO revert clocked out retransmission,
-			 * but socket is locked. Will defer. */
-			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
-						  HZ/20, TCP_RTO_MAX);
 		} else {
 			/* RTO revert clocked out retransmission.
 			 * Will retransmit now */
-- 
1.7.3.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* RE: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12  7:36   ` Patrick McHardy
@ 2010-11-12 23:14     ` Hua Zhong
  2010-11-15 10:32       ` Patrick McHardy
  0 siblings, 1 reply; 31+ messages in thread
From: Hua Zhong @ 2010-11-12 23:14 UTC (permalink / raw)
  To: 'Patrick McHardy'
  Cc: 'Eric Paris',
	netdev, linux-kernel, davem, kuznet, pekkas, jmorris, yoshfuji

> On 11.11.2010 22:58, Hua Zhong wrote:
> >> Yes, I realize this is little different than if the
> >> SYN was dropped in the first network device, but it is different
> >> because we know what happened!  We know that connect() call failed
> >> and that there isn't anything coming back.
> >
> > I would argue that -j DROP should behave exactly as the packet is
> dropped in the network, while -j REJECT should signal the failure to
> the application as soon as possible (which it doesn't seem to do).
> 
> It sends an ICMP error or TCP reset. Interpretation is up to TCP.

Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
TCP reset.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-12 23:14     ` Hua Zhong
@ 2010-11-15 10:32       ` Patrick McHardy
  2010-11-15 15:47         ` Eric Paris
  0 siblings, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2010-11-15 10:32 UTC (permalink / raw)
  To: Hua Zhong
  Cc: 'Eric Paris',
	netdev, linux-kernel, davem, kuznet, pekkas, jmorris, yoshfuji

On 13.11.2010 00:14, Hua Zhong wrote:
>> On 11.11.2010 22:58, Hua Zhong wrote:
>>>> Yes, I realize this is little different than if the
>>>> SYN was dropped in the first network device, but it is different
>>>> because we know what happened!  We know that connect() call failed
>>>> and that there isn't anything coming back.
>>>
>>> I would argue that -j DROP should behave exactly as the packet is
>> dropped in the network, while -j REJECT should signal the failure to
>> the application as soon as possible (which it doesn't seem to do).
>>
>> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
> 
> Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
> TCP reset.

Of course there is.

ICMP (default):

iptables -A OUTPUT -p tcp -j REJECT

TCP reset:

iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset

The second one will cause a hard error for the connection.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-15 10:32       ` Patrick McHardy
@ 2010-11-15 15:47         ` Eric Paris
  2010-11-15 15:57           ` Patrick McHardy
  2010-11-15 20:00           ` Alexey Kuznetsov
  0 siblings, 2 replies; 31+ messages in thread
From: Eric Paris @ 2010-11-15 15:47 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Hua Zhong, netdev, linux-kernel, davem, kuznet, pekkas, jmorris,
	yoshfuji

On Mon, 2010-11-15 at 11:32 +0100, Patrick McHardy wrote:
> On 13.11.2010 00:14, Hua Zhong wrote:
> >> On 11.11.2010 22:58, Hua Zhong wrote:
> >>>> Yes, I realize this is little different than if the
> >>>> SYN was dropped in the first network device, but it is different
> >>>> because we know what happened!  We know that connect() call failed
> >>>> and that there isn't anything coming back.
> >>>
> >>> I would argue that -j DROP should behave exactly as the packet is
> >> dropped in the network, while -j REJECT should signal the failure to
> >> the application as soon as possible (which it doesn't seem to do).
> >>
> >> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
> > 
> > Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
> > TCP reset.
> 
> Of course there is.
> 
> ICMP (default):
> 
> iptables -A OUTPUT -p tcp -j REJECT
> 
> TCP reset:
> 
> iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
> 
> The second one will cause a hard error for the connection.

Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
anything with a local outgoing connection but --reject-with tcp-reset
does something like what I'm looking for.

I notice the heavy lifting for this is done in 
net/ipv4/netfilter/ipt_REJECT.c::send_rest()
(and something very similar for IPv6)

I really don't want to duplicate that code into SELinux (for obvious
reasons) and I'm wondering if anyone has objections to me making it
available outside of netlink and/or suggestions on how to make that code
available outside of netfilter (aka what header to expose it, and does
it still make logical sense in ipt_REJECT.c or somewhere else?)

-Eric


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-15 15:47         ` Eric Paris
@ 2010-11-15 15:57           ` Patrick McHardy
  2010-11-15 16:04             ` Patrick McHardy
  2010-11-15 16:36             ` Patrick McHardy
  2010-11-15 20:00           ` Alexey Kuznetsov
  1 sibling, 2 replies; 31+ messages in thread
From: Patrick McHardy @ 2010-11-15 15:57 UTC (permalink / raw)
  To: Eric Paris
  Cc: Hua Zhong, netdev, linux-kernel, davem, kuznet, pekkas, jmorris,
	yoshfuji

On 15.11.2010 16:47, Eric Paris wrote:
> On Mon, 2010-11-15 at 11:32 +0100, Patrick McHardy wrote:
>> On 13.11.2010 00:14, Hua Zhong wrote:
>>>> On 11.11.2010 22:58, Hua Zhong wrote:
>>>>>> Yes, I realize this is little different than if the
>>>>>> SYN was dropped in the first network device, but it is different
>>>>>> because we know what happened!  We know that connect() call failed
>>>>>> and that there isn't anything coming back.
>>>>>
>>>>> I would argue that -j DROP should behave exactly as the packet is
>>>> dropped in the network, while -j REJECT should signal the failure to
>>>> the application as soon as possible (which it doesn't seem to do).
>>>>
>>>> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
>>>
>>> Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
>>> TCP reset.
>>
>> Of course there is.
>>
>> ICMP (default):
>>
>> iptables -A OUTPUT -p tcp -j REJECT
>>
>> TCP reset:
>>
>> iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
>>
>> The second one will cause a hard error for the connection.
> 
> Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
> anything with a local outgoing connection but --reject-with tcp-reset
> does something like what I'm looking for.
> 
> I notice the heavy lifting for this is done in 
> net/ipv4/netfilter/ipt_REJECT.c::send_rest()
> (and something very similar for IPv6)
> 
> I really don't want to duplicate that code into SELinux (for obvious
> reasons) and I'm wondering if anyone has objections to me making it
> available outside of netlink and/or suggestions on how to make that code
> available outside of netfilter (aka what header to expose it, and does
> it still make logical sense in ipt_REJECT.c or somewhere else?)

I don't think having SELinux sending packets to handle local
connections is a very elegant design, its not a firewall after
all. What's wrong with reacting only to specific errno codes
in tcp_connect()? You could f.i. return -ECONNREFUSED from
SELinux, that one is pretty much guaranteed not to occur in
the network stack itself and can be returned directly.

That would need minor changes to nf_hook_slow so we can
encode errno values in the upper 16 bits of the verdict,
as we already do with the queue number. The added benefit
is that we don't have to return EPERM anymore when f.i.
rerouting fails.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-15 15:57           ` Patrick McHardy
@ 2010-11-15 16:04             ` Patrick McHardy
  2010-11-15 16:36             ` Patrick McHardy
  1 sibling, 0 replies; 31+ messages in thread
From: Patrick McHardy @ 2010-11-15 16:04 UTC (permalink / raw)
  To: Eric Paris
  Cc: Hua Zhong, netdev, linux-kernel, davem, kuznet, pekkas, jmorris,
	yoshfuji

On 15.11.2010 16:57, Patrick McHardy wrote:
> On 15.11.2010 16:47, Eric Paris wrote:
>>> iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
>>>
>>> The second one will cause a hard error for the connection.
>>
>> Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
>> anything with a local outgoing connection but --reject-with tcp-reset
>> does something like what I'm looking for.
>>
>> I notice the heavy lifting for this is done in 
>> net/ipv4/netfilter/ipt_REJECT.c::send_rest()
>> (and something very similar for IPv6)
>>
>> I really don't want to duplicate that code into SELinux (for obvious
>> reasons) and I'm wondering if anyone has objections to me making it
>> available outside of netlink and/or suggestions on how to make that code
>> available outside of netfilter (aka what header to expose it, and does
>> it still make logical sense in ipt_REJECT.c or somewhere else?)
> 
> I don't think having SELinux sending packets to handle local
> connections is a very elegant design, its not a firewall after
> all. What's wrong with reacting only to specific errno codes
> in tcp_connect()? You could f.i. return -ECONNREFUSED from
> SELinux, that one is pretty much guaranteed not to occur in
> the network stack itself and can be returned directly.

One more note: there is also the problem that the RST might never
reach the socket, f.i. because netfilter drops it, or TC actions
reroute it etc. With netfilter users are expected to make sure the
entire combination of network features does what the expect, but
that's probably not what you want for SELinux.

> That would need minor changes to nf_hook_slow so we can
> encode errno values in the upper 16 bits of the verdict,
> as we already do with the queue number. The added benefit
> is that we don't have to return EPERM anymore when f.i.
> rerouting fails.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-15 15:57           ` Patrick McHardy
  2010-11-15 16:04             ` Patrick McHardy
@ 2010-11-15 16:36             ` Patrick McHardy
  2010-11-15 16:46               ` David Miller
  1 sibling, 1 reply; 31+ messages in thread
From: Patrick McHardy @ 2010-11-15 16:36 UTC (permalink / raw)
  To: Eric Paris
  Cc: Hua Zhong, netdev, linux-kernel, davem, kuznet, pekkas, jmorris,
	yoshfuji, Netfilter Development Mailinglist

[-- Attachment #1: Type: text/plain, Size: 1599 bytes --]

On 15.11.2010 16:57, Patrick McHardy wrote:
> On 15.11.2010 16:47, Eric Paris wrote:
>> I notice the heavy lifting for this is done in 
>> net/ipv4/netfilter/ipt_REJECT.c::send_rest()
>> (and something very similar for IPv6)
>>
>> I really don't want to duplicate that code into SELinux (for obvious
>> reasons) and I'm wondering if anyone has objections to me making it
>> available outside of netlink and/or suggestions on how to make that code
>> available outside of netfilter (aka what header to expose it, and does
>> it still make logical sense in ipt_REJECT.c or somewhere else?)
> 
> I don't think having SELinux sending packets to handle local
> connections is a very elegant design, its not a firewall after
> all. What's wrong with reacting only to specific errno codes
> in tcp_connect()? You could f.i. return -ECONNREFUSED from
> SELinux, that one is pretty much guaranteed not to occur in
> the network stack itself and can be returned directly.
> 
> That would need minor changes to nf_hook_slow so we can
> encode errno values in the upper 16 bits of the verdict,
> as we already do with the queue number. The added benefit
> is that we don't have to return EPERM anymore when f.i.
> rerouting fails.

Patch for demonstration purposes attached. I've modified the
MARK target so it returns NF_DROP with an errno code of
-ECONNREFUSED:

# iptables -A OUTPUT -d 1.2.3.4 -j MARK --set-mark 1
# ping 1.2.3.4
PING 1.2.3.4 (1.2.3.4) 56(84) bytes of data.
ping: sendmsg: Connection refused
# telnet 1.2.3.4
Trying 1.2.3.4...
telnet: Unable to connect to remote host: Connection refused




[-- Attachment #2: x --]
[-- Type: text/plain, Size: 2252 bytes --]

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 89341c3..ef2af8f 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -33,6 +33,8 @@
 
 #define NF_QUEUE_NR(x) ((((x) << NF_VERDICT_BITS) & NF_VERDICT_QMASK) | NF_QUEUE)
 
+#define NF_DROP_ERR(x) (((-x) << NF_VERDICT_BITS) | NF_DROP)
+
 /* only for userspace compatibility */
 #ifndef __KERNEL__
 /* Generic cache responses from hook functions.
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 05b1ecf..bb8f547 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2592,6 +2592,7 @@ int tcp_connect(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *buff;
+	int err;
 
 	tcp_connect_init(sk);
 
@@ -2614,7 +2615,9 @@ int tcp_connect(struct sock *sk)
 	sk->sk_wmem_queued += buff->truesize;
 	sk_mem_charge(sk, buff->truesize);
 	tp->packets_out += tcp_skb_pcount(buff);
-	tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
+	err = tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
+	if (err == -ECONNREFUSED)
+		return err;
 
 	/* We change tp->snd_nxt after the tcp_transmit_skb() call
 	 * in order to make this packet get counted in tcpOutSegs.
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 85dabb8..32fcbe2 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -173,9 +173,11 @@ next_hook:
 			     outdev, &elem, okfn, hook_thresh);
 	if (verdict == NF_ACCEPT || verdict == NF_STOP) {
 		ret = 1;
-	} else if (verdict == NF_DROP) {
+	} else if ((verdict & NF_VERDICT_MASK) == NF_DROP) {
 		kfree_skb(skb);
-		ret = -EPERM;
+		ret = -(verdict >> NF_VERDICT_BITS);
+		if (ret == 0)
+			ret = -EPERM;
 	} else if ((verdict & NF_VERDICT_MASK) == NF_QUEUE) {
 		if (!nf_queue(skb, elem, pf, hook, indev, outdev, okfn,
 			      verdict >> NF_VERDICT_BITS))
diff --git a/net/netfilter/xt_mark.c b/net/netfilter/xt_mark.c
index 2334523..185330c 100644
--- a/net/netfilter/xt_mark.c
+++ b/net/netfilter/xt_mark.c
@@ -30,7 +30,7 @@ mark_tg(struct sk_buff *skb, const struct xt_action_param *par)
 	const struct xt_mark_tginfo2 *info = par->targinfo;
 
 	skb->mark = (skb->mark & ~info->mask) ^ info->mark;
-	return XT_CONTINUE;
+	return NF_DROP_ERR(-ECONNREFUSED);
 }
 
 static bool


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-15 16:36             ` Patrick McHardy
@ 2010-11-15 16:46               ` David Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David Miller @ 2010-11-15 16:46 UTC (permalink / raw)
  To: kaber
  Cc: eparis, hzhong, netdev, linux-kernel, kuznet, pekkas, jmorris,
	yoshfuji, netfilter-devel

From: Patrick McHardy <kaber@trash.net>
Date: Mon, 15 Nov 2010 17:36:40 +0100

> Patch for demonstration purposes attached. I've modified the
> MARK target so it returns NF_DROP with an errno code of
> -ECONNREFUSED:

I'm fine with the tcp_output.c changes.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
  2010-11-15 15:47         ` Eric Paris
  2010-11-15 15:57           ` Patrick McHardy
@ 2010-11-15 20:00           ` Alexey Kuznetsov
  1 sibling, 0 replies; 31+ messages in thread
From: Alexey Kuznetsov @ 2010-11-15 20:00 UTC (permalink / raw)
  To: Eric Paris
  Cc: Patrick McHardy, Hua Zhong, netdev, linux-kernel, davem, pekkas,
	jmorris, yoshfuji

Hello!

On Mon, Nov 15, 2010 at 10:47:46AM -0500, Eric Paris wrote:
> Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
> anything with a local outgoing connection

Yeah. This was "an obvious surpise": for local socket the first icmp error always
arrives on locked socket and gets dropped. I even forgot about this (tcp-reset
still works ok due to backlog) This makes the idea to respect error more attractive.

Alexey

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2010-11-15 20:01 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-11 21:03 [RFC PATCH] network: return errors if we know tcp_connect failed Eric Paris
2010-11-11 21:14 ` Eric Dumazet
2010-11-11 21:58 ` Hua Zhong
2010-11-12  7:36   ` Patrick McHardy
2010-11-12 23:14     ` Hua Zhong
2010-11-15 10:32       ` Patrick McHardy
2010-11-15 15:47         ` Eric Paris
2010-11-15 15:57           ` Patrick McHardy
2010-11-15 16:04             ` Patrick McHardy
2010-11-15 16:36             ` Patrick McHardy
2010-11-15 16:46               ` David Miller
2010-11-15 20:00           ` Alexey Kuznetsov
2010-11-12 16:08   ` Eric Paris
2010-11-12 16:15     ` Eric Dumazet
2010-11-12 16:35       ` David Lamparter
2010-11-12 16:53         ` Eric Paris
2010-11-12 16:54         ` Patrick McHardy
2010-11-12 17:57           ` a problem tcp_v4_err() Alexey Kuznetsov
2010-11-12 18:12             ` Eric Dumazet
2010-11-12 18:21               ` Eric Dumazet
2010-11-12 18:27                 ` Eric Dumazet
2010-11-12 18:31                   ` Alexey Kuznetsov
2010-11-12 18:29               ` Alexey Kuznetsov
2010-11-12 18:33                 ` Eric Dumazet
2010-11-12 19:22                   ` David Miller
2010-11-12 21:18                     ` Eric Dumazet
2010-11-12 21:36                       ` David Miller
2010-11-12 21:16           ` [RFC PATCH] network: return errors if we know tcp_connect failed David Lamparter
2010-11-12 21:18             ` David Miller
2010-11-12 17:46 ` Alexey Kuznetsov
2010-11-12 19:28   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).