linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* sockets affected by IPsec always block (2.6.23)
@ 2007-12-04 18:53 Simon Arlott
  2007-12-05  0:12 ` Herbert Xu
  2007-12-05  6:06 ` David Miller
  0 siblings, 2 replies; 27+ messages in thread
From: Simon Arlott @ 2007-12-04 18:53 UTC (permalink / raw)
  To: Linux Kernel Mailing List, netdev

If I have a IPsec rule like:
	spdadd 192.168.7.8 1.2.3.4 any -P out ipsec esp/transport//require;
(i.e. a remote host 1.2.3.4 which will not respond)

Then any attempt to communicate with 1.2.3.4 will block, even when using non-blocking sockets:

socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl64(3, F_GETFL)                     = 0x2 (flags O_RDWR)
fcntl64(3, F_SETFL, O_RDWR|O_NONBLOCK)  = 0 <-- non-blocking socket
connect(3, {sa_family=AF_INET, sin_port=htons(22), sin_addr=inet_addr("1.2.3.4")}, 16 <-- blocked connect()

[277657.564773] netcat        S b06bcf20     0  9450   9449
[277657.564785]        c8d51d28 00200046 00200046 b06bcf20 00200286 c8d51d14 00200286 c8d51d84
[277657.564814]        b06bcf20 c8d51d28 b013680b c8d51d84 eeae8800 c8d51d78 c8d51dd0 b04d3fc5
[277657.564843]        c8d51da4 00000002 00000001 ede87284 00000002 00000040 e9318ac0 db3f20a0
[277657.564874] Call Trace:
[277657.564881]  [<b04d3fc5>] __xfrm_lookup+0x2f5/0x510
[277657.564905]  [<b0494f9e>] ip_route_output_flow+0x4e/0x80
[277657.564919]  [<b04af303>] tcp_v4_connect+0x183/0x6d0
[277657.564934]  [<b04beaf2>] inet_stream_connect+0x122/0x1c0
[277657.564949]  [<b0471c8e>] sys_connect+0x9e/0xd0
[277657.564963]  [<b0472785>] sys_socketcall+0xa5/0x230
[277657.564973]  [<b01042ba>] syscall_call+0x7/0xb
[277657.564984]  =======================

I had a process using non-blocking sockets stuck in connect() for over 8 hours because of this...

00002630 <__xfrm_lookup>:
...
    290b:   b8 00 00 00 00          mov    $0x0,%eax
            290c: R_386_32  km_waitq
    2910:   e8 fc ff ff ff          call   2911 <__xfrm_lookup+0x2e1>
            2911: R_386_PC32    add_wait_queue
    2915:   a1 00 00 00 00          mov    0x0,%eax
            2916: R_386_32  per_cpu__current_task
    291a:   c7 00 01 00 00 00       movl   $0x1,(%eax)
    2920:   e8 fc ff ff ff          call   2921 <__xfrm_lookup+0x2f1>
            2921: R_386_PC32    schedule
    2925:   a1 00 00 00 00          mov    0x0,%eax
            2926: R_386_32  per_cpu__current_task
    292a:   c7 00 00 00 00 00       movl   $0x0,(%eax)
    2930:   b8 00 00 00 00          mov    $0x0,%eax
            2931: R_386_32  km_waitq
    2935:   89 da                   mov    %ebx,%edx
    2937:   e8 fc ff ff ff          call   2938 <__xfrm_lookup+0x308>
            2938: R_386_PC32    remove_wait_queue

-- 
Simon Arlott

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-04 18:53 sockets affected by IPsec always block (2.6.23) Simon Arlott
@ 2007-12-05  0:12 ` Herbert Xu
  2007-12-05  6:30   ` David Miller
  2007-12-05  6:06 ` David Miller
  1 sibling, 1 reply; 27+ messages in thread
From: Herbert Xu @ 2007-12-05  0:12 UTC (permalink / raw)
  To: Simon Arlott, David S. Miller; +Cc: Linux Kernel Mailing List, netdev

On Tue, Dec 04, 2007 at 06:53:19PM +0000, Simon Arlott wrote:
> If I have a IPsec rule like:
> 	spdadd 192.168.7.8 1.2.3.4 any -P out ipsec esp/transport//require;
> (i.e. a remote host 1.2.3.4 which will not respond)
> 
> Then any attempt to communicate with 1.2.3.4 will block, even when using non-blocking sockets:

This patch should help.

[INET]: Export non-blocking flags to proto connect call

Previously we made connect(2) block on IPsec SA resolution.  This is
good in general but not desirable for non-blocking sockets.

To fix this properly we'd need to implement the larval IPsec dst stuff
that we talked about.  For now let's just revert to the old behaviour
on non-blocking sockets.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/include/net/ip.h b/include/net/ip.h
index 83fb9f1..9b4ed7e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -121,7 +121,8 @@ extern void		ip_flush_pending_frames(struct sock *sk);
 
 /* datagram.c */
 extern int		ip4_datagram_connect(struct sock *sk, 
-					     struct sockaddr *uaddr, int addr_len);
+					     struct sockaddr *uaddr,
+					     int addr_len, int flags);
 
 /*
  *	Map a multicast IP onto multicast MAC for type Token Ring.
diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index e90f962..2686850 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -567,7 +567,8 @@ extern void			ipv6_packet_init(void);
 extern void			ipv6_packet_cleanup(void);
 
 extern int			ip6_datagram_connect(struct sock *sk, 
-						     struct sockaddr *addr, int addr_len);
+						     struct sockaddr *addr,
+						     int addr_len, int flags);
 
 extern int 			ipv6_recv_error(struct sock *sk, struct msghdr *msg, int len);
 extern void			ipv6_icmp_error(struct sock *sk, struct sk_buff *skb, int err, __be16 port,
diff --git a/include/net/sock.h b/include/net/sock.h
index 43e3cd9..d70b110 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -522,8 +522,8 @@ struct proto {
 	void			(*close)(struct sock *sk, 
 					long timeout);
 	int			(*connect)(struct sock *sk,
-				        struct sockaddr *uaddr, 
-					int addr_len);
+					   struct sockaddr *uaddr,
+					   int addr_len, int flags);
 	int			(*disconnect)(struct sock *sk, int flags);
 
 	struct sock *		(*accept) (struct sock *sk, int flags, int *err);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 9dbed0b..d93eef6 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -421,7 +421,7 @@ extern int			tcp_v4_do_rcv(struct sock *sk,
 
 extern int			tcp_v4_connect(struct sock *sk,
 					       struct sockaddr *uaddr,
-					       int addr_len);
+					       int addr_len, int flags);
 
 extern int			tcp_connect(struct sock *sk);
 
diff --git a/net/dccp/dccp.h b/net/dccp/dccp.h
index ee97950..4062068 100644
--- a/net/dccp/dccp.h
+++ b/net/dccp/dccp.h
@@ -292,7 +292,7 @@ extern int	   inet_dccp_listen(struct socket *sock, int backlog);
 extern unsigned int dccp_poll(struct file *file, struct socket *sock,
 			     poll_table *wait);
 extern int	   dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr,
-				   int addr_len);
+				   int addr_len, int flags);
 
 extern struct sk_buff *dccp_ctl_make_reset(struct socket *ctl,
 					   struct sk_buff *skb);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index db17b83..e2aba0f 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -44,7 +44,8 @@ static int dccp_v4_get_port(struct sock *sk, const unsigned short snum)
 				 inet_csk_bind_conflict);
 }
 
-int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		    int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct dccp_sock *dp = dccp_sk(sk);
@@ -72,7 +73,8 @@ int dccp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	tmp = ip_route_connect(&rt, nexthop, inet->saddr,
 			       RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
 			       IPPROTO_DCCP,
-			       inet->sport, usin->sin_port, sk, 1);
+			       inet->sport, usin->sin_port, sk,
+			       !(flags & O_NONBLOCK));
 	if (tmp < 0)
 		return tmp;
 
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 87c98fb..7da5269 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -866,7 +866,7 @@ discard_and_relse:
 }
 
 static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
-			   int addr_len)
+			   int addr_len, int flags)
 {
 	struct sockaddr_in6 *usin = (struct sockaddr_in6 *)uaddr;
 	struct inet_connection_sock *icsk = inet_csk(sk);
@@ -952,7 +952,8 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 		icsk->icsk_af_ops = &dccp_ipv6_mapped;
 		sk->sk_backlog_rcv = dccp_v4_do_rcv;
 
-		err = dccp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin));
+		err = dccp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin),
+				      flags);
 		if (err) {
 			icsk->icsk_ext_hdr_len = exthdrlen;
 			icsk->icsk_af_ops = &dccp_ipv6_af_ops;
@@ -994,7 +995,7 @@ static int dccp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	if (final_p)
 		ipv6_addr_copy(&fl.fl6_dst, final_p);
 
-	err = __xfrm_lookup(&dst, &fl, sk, 1);
+	err = __xfrm_lookup(&dst, &fl, sk, !(flags & O_NONBLOCK));
 	if (err < 0) {
 		if (err == -EREMOTE)
 			err = ip6_dst_blackhole(sk, &dst, &fl);
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c75f20b..5fc5de0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -513,7 +513,8 @@ int inet_dgram_connect(struct socket *sock, struct sockaddr * uaddr,
 
 	if (!inet_sk(sk)->num && inet_autobind(sk))
 		return -EAGAIN;
-	return sk->sk_prot->connect(sk, (struct sockaddr *)uaddr, addr_len);
+	return sk->sk_prot->connect(sk, (struct sockaddr *)uaddr, addr_len,
+				    flags);
 }
 
 static long inet_wait_for_connect(struct sock *sk, long timeo)
@@ -574,7 +575,7 @@ int inet_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 		if (sk->sk_state != TCP_CLOSE)
 			goto out;
 
-		err = sk->sk_prot->connect(sk, uaddr, addr_len);
+		err = sk->sk_prot->connect(sk, uaddr, addr_len, flags);
 		if (err < 0)
 			goto out;
 
diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index 0301dd4..19ab5b1 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -20,7 +20,8 @@
 #include <net/route.h>
 #include <net/tcp_states.h>
 
-int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+			 int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct sockaddr_in *usin = (struct sockaddr_in *) uaddr;
@@ -49,7 +50,8 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	err = ip_route_connect(&rt, usin->sin_addr.s_addr, saddr,
 			       RT_CONN_FLAGS(sk), oif,
 			       sk->sk_protocol,
-			       inet->sport, usin->sin_port, sk, 1);
+			       inet->sport, usin->sin_port, sk,
+			       !(flags & O_NONBLOCK));
 	if (err) {
 		if (err == -ENETUNREACH)
 			IP_INC_STATS_BH(IPSTATS_MIB_OUTNOROUTES);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5a27e42..0a2ee0b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -166,7 +166,8 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 EXPORT_SYMBOL_GPL(tcp_twsk_unique);
 
 /* This will initiate an outgoing connection. */
-int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+		   int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -192,7 +193,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	tmp = ip_route_connect(&rt, nexthop, inet->saddr,
 			       RT_CONN_FLAGS(sk), sk->sk_bound_dev_if,
 			       IPPROTO_TCP,
-			       inet->sport, usin->sin_port, sk, 1);
+			       inet->sport, usin->sin_port, sk,
+			       !(flags & O_NONBLOCK));
 	if (tmp < 0) {
 		if (tmp == -ENETUNREACH)
 			IP_INC_STATS_BH(IPSTATS_MIB_OUTNOROUTES);
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 2ed689a..591634a 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -34,7 +34,8 @@
 #include <linux/errqueue.h>
 #include <asm/uaccess.h>
 
-int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
+int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len,
+			 int flags)
 {
 	struct sockaddr_in6	*usin = (struct sockaddr_in6 *) uaddr;
 	struct inet_sock      	*inet = inet_sk(sk);
@@ -49,7 +50,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	if (usin->sin6_family == AF_INET) {
 		if (__ipv6_only_sock(sk))
 			return -EAFNOSUPPORT;
-		err = ip4_datagram_connect(sk, uaddr, addr_len);
+		err = ip4_datagram_connect(sk, uaddr, addr_len, flags);
 		goto ipv4_connected;
 	}
 
@@ -94,7 +95,7 @@ int ip6_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 
 		err = ip4_datagram_connect(sk,
 					   (struct sockaddr*) &sin,
-					   sizeof(sin));
+					   sizeof(sin), flags);
 
 ipv4_connected:
 		if (err)
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index f1294dc..9cb9068 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -122,7 +122,7 @@ static __u32 tcp_v6_init_sequence(struct sk_buff *skb)
 }
 
 static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
-			  int addr_len)
+			  int addr_len, int flags)
 {
 	struct sockaddr_in6 *usin = (struct sockaddr_in6 *) uaddr;
 	struct inet_sock *inet = inet_sk(sk);
@@ -219,7 +219,8 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 		tp->af_specific = &tcp_sock_ipv6_mapped_specific;
 #endif
 
-		err = tcp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin));
+		err = tcp_v4_connect(sk, (struct sockaddr *)&sin, sizeof(sin),
+				     flags);
 
 		if (err) {
 			icsk->icsk_ext_hdr_len = exthdrlen;
@@ -265,7 +266,7 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	if (final_p)
 		ipv6_addr_copy(&fl.fl6_dst, final_p);
 
-	if ((err = __xfrm_lookup(&dst, &fl, sk, 1)) < 0) {
+	if ((err = __xfrm_lookup(&dst, &fl, sk, !(flags & O_NONBLOCK))) < 0) {
 		if (err == -EREMOTE)
 			err = ip6_dst_blackhole(sk, &dst, &fl);
 		if (err < 0)
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 248b9a5..1ca8913 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -960,7 +960,8 @@ out:
  */
 static int __sctp_connect(struct sock* sk,
 			  struct sockaddr *kaddrs,
-			  int addrs_size)
+			  int addrs_size,
+			  int f_flags)
 {
 	struct sctp_sock *sp;
 	struct sctp_endpoint *ep;
@@ -977,7 +978,6 @@ static int __sctp_connect(struct sock* sk,
 	union sctp_addr *sa_addr = NULL;
 	void *addr_buf;
 	unsigned short port;
-	unsigned int f_flags = 0;
 
 	sp = sctp_sk(sk);
 	ep = sp->ep;
@@ -1106,12 +1106,6 @@ static int __sctp_connect(struct sock* sk,
 	af->to_sk_daddr(sa_addr, sk);
 	sk->sk_err = 0;
 
-	/* in-kernel sockets don't generally have a file allocated to them
-	 * if all they do is call sock_create_kern().
-	 */
-	if (sk->sk_socket->file)
-		f_flags = sk->sk_socket->file->f_flags;
-
 	timeo = sock_sndtimeo(sk, f_flags & O_NONBLOCK);
 
 	err = sctp_wait_for_connect(asoc, &timeo);
@@ -1194,6 +1188,7 @@ SCTP_STATIC int sctp_setsockopt_connectx(struct sock* sk,
 {
 	int err = 0;
 	struct sockaddr *kaddrs;
+	int f_flags;
 
 	SCTP_DEBUG_PRINTK("%s - sk %p addrs %p addrs_size %d\n",
 			  __FUNCTION__, sk, addrs, addrs_size);
@@ -1210,10 +1205,15 @@ SCTP_STATIC int sctp_setsockopt_connectx(struct sock* sk,
 	if (unlikely(!kaddrs))
 		return -ENOMEM;
 
+	/* in-kernel sockets don't generally have a file allocated to them
+	 * if all they do is call sock_create_kern().
+	 */
+	f_flags = sk->sk_socket->file ? sk->sk_socket->file->f_flags : 0;
+
 	if (__copy_from_user(kaddrs, addrs, addrs_size)) {
 		err = -EFAULT;
 	} else {
-		err = __sctp_connect(sk, kaddrs, addrs_size);
+		err = __sctp_connect(sk, kaddrs, addrs_size, f_flags);
 	}
 
 	kfree(kaddrs);
@@ -3270,7 +3270,7 @@ out_nounlock:
  * len: the size of the address.
  */
 SCTP_STATIC int sctp_connect(struct sock *sk, struct sockaddr *addr,
-			     int addr_len)
+			     int addr_len, int flags)
 {
 	int err = 0;
 	struct sctp_af *af;
@@ -3288,7 +3288,7 @@ SCTP_STATIC int sctp_connect(struct sock *sk, struct sockaddr *addr,
 		/* Pass correct addr len to common routine (so it knows there
 		 * is only one address being passed.
 		 */
-		err = __sctp_connect(sk, addr, af->sockaddr_len);
+		err = __sctp_connect(sk, addr, af->sockaddr_len, flags);
 	}
 
 	sctp_release_sock(sk);

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-04 18:53 sockets affected by IPsec always block (2.6.23) Simon Arlott
  2007-12-05  0:12 ` Herbert Xu
@ 2007-12-05  6:06 ` David Miller
  1 sibling, 0 replies; 27+ messages in thread
From: David Miller @ 2007-12-05  6:06 UTC (permalink / raw)
  To: simon; +Cc: linux-kernel, netdev

From: Simon Arlott <simon@fire.lp0.eu>
Date: Tue, 04 Dec 2007 18:53:19 +0000

> If I have a IPsec rule like:
> 	spdadd 192.168.7.8 1.2.3.4 any -P out ipsec esp/transport//require;
> (i.e. a remote host 1.2.3.4 which will not respond)
> 
> Then any attempt to communicate with 1.2.3.4 will block, even when using non-blocking sockets:

If you don't like this behavior:

	echo "1" >/proc/sys/net/core/xfrm_larval_drop

but those initial connection setup packets will be dropped while
waiting for the IPSEC route to be resolved, and in your 8 hour case
the TCP connect will fail.

Anyways, the choice for different behavior is there, select it
to suit your tastes.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  0:12 ` Herbert Xu
@ 2007-12-05  6:30   ` David Miller
  2007-12-05  6:51     ` Herbert Xu
  2007-12-16 22:47     ` Bill Davidsen
  0 siblings, 2 replies; 27+ messages in thread
From: David Miller @ 2007-12-05  6:30 UTC (permalink / raw)
  To: herbert; +Cc: simon, linux-kernel, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 5 Dec 2007 11:12:32 +1100

> [INET]: Export non-blocking flags to proto connect call
> 
> Previously we made connect(2) block on IPsec SA resolution.  This is
> good in general but not desirable for non-blocking sockets.
> 
> To fix this properly we'd need to implement the larval IPsec dst stuff
> that we talked about.  For now let's just revert to the old behaviour
> on non-blocking sockets.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

We made an explicit decision not to do things this way.

Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
setting, and this is across the board.  If xfrm_larval_drop is zero,
non-blocking semantics do not extend to IPSEC route resolution,
otherwise it does.

If he sets this sysctl to "1" as I detailed in my reply, he'll
get the behavior he wants.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  6:30   ` David Miller
@ 2007-12-05  6:51     ` Herbert Xu
  2007-12-05  7:12       ` David Miller
  2007-12-05 18:39       ` Stefan Rompf
  2007-12-16 22:47     ` Bill Davidsen
  1 sibling, 2 replies; 27+ messages in thread
From: Herbert Xu @ 2007-12-05  6:51 UTC (permalink / raw)
  To: David Miller; +Cc: simon, linux-kernel, netdev

On Tue, Dec 04, 2007 at 10:30:23PM -0800, David Miller wrote:
> 
> We made an explicit decision not to do things this way.

Thanks for pointing this out.

> Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
> setting, and this is across the board.  If xfrm_larval_drop is zero,
> non-blocking semantics do not extend to IPSEC route resolution,
> otherwise it does.
> 
> If he sets this sysctl to "1" as I detailed in my reply, he'll
> get the behavior he wants.

Does anybody actually need the 0 setting? What would we break if
the default became 1?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  6:51     ` Herbert Xu
@ 2007-12-05  7:12       ` David Miller
  2007-12-05  7:16         ` Herbert Xu
  2007-12-05 18:42         ` Stefan Rompf
  2007-12-05 18:39       ` Stefan Rompf
  1 sibling, 2 replies; 27+ messages in thread
From: David Miller @ 2007-12-05  7:12 UTC (permalink / raw)
  To: herbert; +Cc: simon, linux-kernel, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 5 Dec 2007 17:51:32 +1100

> Does anybody actually need the 0 setting? What would we break if
> the default became 1?

I bet there are UDP apps out there that would break if we
didn't do this.

Actually, consider even a case like DNS.  Let's say the timeout
is set to 2 seconds or something and you have 3 DNS servers
listed, on different IPSEC destinations, in your resolv.conf

Each IPSEC route that isn't currently resolved will cause packet loss
of the DNS lookup request with xfrm_larval_drop set to '1'.

If all 3 need to be resolved, the DNS lookup will fully fail
which defeats the purpose of listing 3 servers for redundancy
don't you think? :-)

As much as I even personally prefer the xfrm_larval_drop=1
behavior, it cases like above that keep me from jumping at
making it the default.

Arguably, potentially blocking forever (which is what can easily
happen with xfrm_larval_drop=0 if your IPSEC daemon cannot resolve the
IPSEC path for whatever reason) is worse than the above, but the
other cases are still something to consider as well.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  7:12       ` David Miller
@ 2007-12-05  7:16         ` Herbert Xu
  2007-12-05  7:34           ` David Miller
  2007-12-05 18:42         ` Stefan Rompf
  1 sibling, 1 reply; 27+ messages in thread
From: Herbert Xu @ 2007-12-05  7:16 UTC (permalink / raw)
  To: David Miller; +Cc: simon, linux-kernel, netdev

On Tue, Dec 04, 2007 at 11:12:00PM -0800, David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 5 Dec 2007 17:51:32 +1100
> 
> > Does anybody actually need the 0 setting? What would we break if
> > the default became 1?
> 
> I bet there are UDP apps out there that would break if we
> didn't do this.

Right.  This is definitely bad for protocols without a retransmission
mechanism.

However, is the 0 setting ever useful for TCP and in particular, TCP's
connect(2) call? Perhaps we can just make that one always drop.

Well, until someone implements queueing to fix all of this properly
that is :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  7:16         ` Herbert Xu
@ 2007-12-05  7:34           ` David Miller
  2007-12-05  7:39             ` Herbert Xu
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2007-12-05  7:34 UTC (permalink / raw)
  To: herbert; +Cc: simon, linux-kernel, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 5 Dec 2007 18:16:07 +1100

> Right.  This is definitely bad for protocols without a retransmission
> mechanism.
> 
> However, is the 0 setting ever useful for TCP and in particular, TCP's
> connect(2) call? Perhaps we can just make that one always drop.

TCP has some built-in assumptions about characteristics of
interent links and what constitutes a timeout which is "too long"
and should thus result in a full connection failure.

IPSEC changes this because of IPSEC route resolution via
ISAKMP.

With this in mind I can definitely see people preferring
the "block until IPSEC resolves" behavior, especially for
something like, say, periodic remote backups and stuff like
that where you really want the thing to just sit and wait
for the connect() to succeed instead of failing.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  7:34           ` David Miller
@ 2007-12-05  7:39             ` Herbert Xu
  2007-12-05  9:55               ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Herbert Xu @ 2007-12-05  7:39 UTC (permalink / raw)
  To: David Miller; +Cc: simon, linux-kernel, netdev

On Tue, Dec 04, 2007 at 11:34:32PM -0800, David Miller wrote:
> 
> TCP has some built-in assumptions about characteristics of
> interent links and what constitutes a timeout which is "too long"
> and should thus result in a full connection failure.
> 
> IPSEC changes this because of IPSEC route resolution via
> ISAKMP.
> 
> With this in mind I can definitely see people preferring
> the "block until IPSEC resolves" behavior, especially for
> something like, say, periodic remote backups and stuff like
> that where you really want the thing to just sit and wait
> for the connect() to succeed instead of failing.

Hmm, but connect(2) should succeed in that case thanks to the
blackhole route, no? The subsequent SYNs will then be dropped
until the IPsec SAs are in place.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  7:39             ` Herbert Xu
@ 2007-12-05  9:55               ` David Miller
  2007-12-05  9:57                 ` Herbert Xu
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2007-12-05  9:55 UTC (permalink / raw)
  To: herbert; +Cc: simon, linux-kernel, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 5 Dec 2007 18:39:27 +1100

> On Tue, Dec 04, 2007 at 11:34:32PM -0800, David Miller wrote:
> > 
> > TCP has some built-in assumptions about characteristics of
> > interent links and what constitutes a timeout which is "too long"
> > and should thus result in a full connection failure.
> > 
> > IPSEC changes this because of IPSEC route resolution via
> > ISAKMP.
> > 
> > With this in mind I can definitely see people preferring
> > the "block until IPSEC resolves" behavior, especially for
> > something like, say, periodic remote backups and stuff like
> > that where you really want the thing to just sit and wait
> > for the connect() to succeed instead of failing.
> 
> Hmm, but connect(2) should succeed in that case thanks to the
> blackhole route, no? The subsequent SYNs will then be dropped
> until the IPsec SAs are in place.

If it hits sysctl_tcp_syn_retries SYN attempts, the connect will hard
fail.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  9:55               ` David Miller
@ 2007-12-05  9:57                 ` Herbert Xu
  0 siblings, 0 replies; 27+ messages in thread
From: Herbert Xu @ 2007-12-05  9:57 UTC (permalink / raw)
  To: David Miller; +Cc: simon, linux-kernel, netdev

On Wed, Dec 05, 2007 at 01:55:58AM -0800, David Miller wrote:
>
> If it hits sysctl_tcp_syn_retries SYN attempts, the connect will hard
> fail.

Right.  Let's just forget about this until we have a queueing system :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  6:51     ` Herbert Xu
  2007-12-05  7:12       ` David Miller
@ 2007-12-05 18:39       ` Stefan Rompf
  2007-12-06  2:25         ` David Miller
  1 sibling, 1 reply; 27+ messages in thread
From: Stefan Rompf @ 2007-12-05 18:39 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, simon, linux-kernel, netdev

Am Mittwoch, 5. Dezember 2007 07:51 schrieb Herbert Xu:

> > If he sets this sysctl to "1" as I detailed in my reply, he'll
> > get the behavior he wants.
>
> Does anybody actually need the 0 setting? What would we break if
> the default became 1?

I'd strongly suggest doing so. AFAIK, behaviour of connect() on nonblocking 
sockets is quite well defined in POSIX. If this is changed for some IP 
sockets, event-driven applications will randomly and subtly break.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  7:12       ` David Miller
  2007-12-05  7:16         ` Herbert Xu
@ 2007-12-05 18:42         ` Stefan Rompf
  1 sibling, 0 replies; 27+ messages in thread
From: Stefan Rompf @ 2007-12-05 18:42 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, simon, linux-kernel, netdev

Am Mittwoch, 5. Dezember 2007 08:12 schrieb David Miller:

> Actually, consider even a case like DNS.  Let's say the timeout
> is set to 2 seconds or something and you have 3 DNS servers
> listed, on different IPSEC destinations, in your resolv.conf
>
> Each IPSEC route that isn't currently resolved will cause packet loss
> of the DNS lookup request with xfrm_larval_drop set to '1'.
>
> If all 3 need to be resolved, the DNS lookup will fully fail
> which defeats the purpose of listing 3 servers for redundancy
> don't you think? :-)

In your example, the DNS server might actually stop responding to other 
clients while waiting for the (expected to be non-blocking) connect() to 
return. This is much much worse.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05 18:39       ` Stefan Rompf
@ 2007-12-06  2:25         ` David Miller
  2007-12-06  8:49           ` Stefan Rompf
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2007-12-06  2:25 UTC (permalink / raw)
  To: stefan; +Cc: herbert, simon, linux-kernel, netdev

From: Stefan Rompf <stefan@loplof.de>
Date: Wed, 5 Dec 2007 19:39:07 +0100

> I'd strongly suggest doing so. AFAIK, behaviour of connect() on nonblocking 
> sockets is quite well defined in POSIX.

You are entitled to your opinion.

POSIX says nothing about the semantics of route resolution.
Non-blocking doesn't mean "cannot sleep no matter what".

> If this is changed for some IP sockets, event-driven applications
> will randomly and subtly break.

If this was such a clear cut case we'd have changed things
a long time ago, but it isn't so don't pretend this is the
case.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06  2:25         ` David Miller
@ 2007-12-06  8:49           ` Stefan Rompf
  2007-12-06  8:53             ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Rompf @ 2007-12-06  8:49 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, simon, linux-kernel, netdev

Am Donnerstag, 6. Dezember 2007 03:25 schrieb David Miller:

> POSIX says nothing about the semantics of route resolution.

Of course not. Applications must not care about what happens at the transport 
layer.

> Non-blocking doesn't mean "cannot sleep no matter what".

... and as O_CREAT on open() isn't specifically documented to apply to 
filenames starting with 'a', it is perfectly normal that "echo x >ash" always 
fails since 2.6.22. To revert to the old behaviour, please do "echo 1 
>/proc/sys/fs/allow_a_file_creation".

Ok, irony aside. Just have a look at
http://www.opengroup.org/onlinepubs/009695399/functions/connect.html (I hope 
009695399 is not a personalition cookie ;-)

"If the connection cannot be established immediately and O_NONBLOCK is set for 
the file descriptor for the socket, connect() shall fail and set errno to 
[EINPROGRESS], but the connection request shall not be aborted, and the 
connection shall be established asynchronously."

I think the words "shall fail" and "immediately" are quite clear.

> > If this is changed for some IP sockets, event-driven applications
> > will randomly and subtly break.
>
> If this was such a clear cut case we'd have changed things
> a long time ago, but it isn't so don't pretend this is the
> case.

Well, the only reason this doesn't break on a daily basis is because the code 
isn't in the kernel that long and not many people run applications on an 
IPSEC gateway. This will change if kernel based IPSEC is used for roadwarrior 
connections or dnssec based anonymous IPSEC someday. Trust me, you will 
revert this misbehaviour in -stable then.

For some real life applications that break when nonblocking connect() blocks, 
please look f.e. at squid or mozilla firefox.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06  8:49           ` Stefan Rompf
@ 2007-12-06  8:53             ` David Miller
  2007-12-06 10:56               ` Stefan Rompf
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2007-12-06  8:53 UTC (permalink / raw)
  To: stefan; +Cc: herbert, simon, linux-kernel, netdev

From: Stefan Rompf <stefan@loplof.de>
Date: Thu, 6 Dec 2007 09:49:01 +0100

> "If the connection cannot be established immediately and O_NONBLOCK is set for 
> the file descriptor for the socket, connect() shall fail and set errno to 
> [EINPROGRESS], but the connection request shall not be aborted, and the 
> connection shall be established asynchronously."
> 
> I think the words "shall fail" and "immediately" are quite clear.

They are, but the context in which they apply is vague.

I can equally generate examples where the non-blocking behavior you
are a proponent of would break non-blocking UDP apps during a
sendmsg() call when we hit IPSEC resolution.  Yet similar language on
blocking semantics exists for sendmsg() in the standards.

The world is shades of gray, implying anything else is foolhardy and
that's how I'm handling this.

> Well, the only reason this doesn't break on a daily basis is because the code 
> isn't in the kernel that long and not many people run applications on an 
> IPSEC gateway. This will change if kernel based IPSEC is used for roadwarrior 
> connections or dnssec based anonymous IPSEC someday. Trust me, you will 
> revert this misbehaviour in -stable then.

I use IPSEC every single day in this fashion, and I haven't.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06  8:53             ` David Miller
@ 2007-12-06 10:56               ` Stefan Rompf
  2007-12-06 11:13                 ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Rompf @ 2007-12-06 10:56 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, simon, linux-kernel, netdev

Am Donnerstag, 6. Dezember 2007 09:53 schrieb David Miller:

> > I think the words "shall fail" and "immediately" are quite clear.
>
> They are, but the context in which they apply is vague.

"socket is connection-mode" => SOCK_STREAM

> I can equally generate examples where the non-blocking behavior you
> are a proponent of would break non-blocking UDP apps during a
> sendmsg() call when we hit IPSEC resolution.  Yet similar language on
> blocking semantics exists for sendmsg() in the standards.

I am not a good enough kernel hacker to exactly understand the code flow in 
udp_sendmsg(). However, it seems that it first checks destination validity 
via ip_route_output_flow() and queues the message then. The sendmsg() 
documentation only talks about buffer space. I can see your dilemma.

The reason why I'm pushing this issue another time is that I know quite a 
bit about system level application development. A very typical design pattern 
for non-naive single or multi threaded programs is that they set all 
communication sockets to be nonblocking and use a select()/epoll() based loop 
to dispatch IO. This often includes initiating a TCP connect() and 
asynchronously waiting for it to finish or fail from the main loop.

The dangerous situation here is that in 99% of all cases things will just work 
because the phase 2 SA exists. In 0.8%, the SA will be established in <1 sec. 
However, in the rest of time the server application that you have considered 
to be stable will end up sleeping with all threads in a connect() call that 
is supposed to return immediatly.

> The world is shades of gray, implying anything else is foolhardy and
> that's how I'm handling this.

Even though I consider programmers that ignore the result code on a 
nonblocking UDP sendmsg() fools, I agree. May be the best compromise is what 
Herbert Xu suggested in <20071205001230.GA11391@gondor.apana.org.au> in this 
thread: At least, for connect() O_NONBLOCK ist ALWAYS respected. Because this 
is where the chance for breakage is highest.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06 10:56               ` Stefan Rompf
@ 2007-12-06 11:13                 ` David Miller
  2007-12-06 11:35                   ` Stefan Rompf
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2007-12-06 11:13 UTC (permalink / raw)
  To: stefan; +Cc: herbert, simon, linux-kernel, netdev

From: Stefan Rompf <stefan@loplof.de>
Date: Thu, 6 Dec 2007 11:56:48 +0100

> Am Donnerstag, 6. Dezember 2007 09:53 schrieb David Miller:
> 
> > > I think the words "shall fail" and "immediately" are quite clear.
> >
> > They are, but the context in which they apply is vague.
> 
> "socket is connection-mode" => SOCK_STREAM

I meant whether "immediately" mean in reference to socket
state or includes auxiliary things like route lookups.

When you do a non-blocking write on a socket, things like
memory allocations can block, potentially for a long time.
It is an example where there are definite boundaries to where
the non-blocking'ness applies.

And therefore it is not so cut and dry and you present this
issue.

> The reason why I'm pushing this issue another time is that I know quite a 
> bit about system level application development. A very typical design pattern 
> for non-naive single or multi threaded programs is that they set all 
> communication sockets to be nonblocking and use a select()/epoll() based loop 
> to dispatch IO. This often includes initiating a TCP connect() and 
> asynchronously waiting for it to finish or fail from the main loop.
>
> The dangerous situation here is that in 99% of all cases things will just work 
> because the phase 2 SA exists. In 0.8%, the SA will be established in <1 sec. 
> However, in the rest of time the server application that you have considered 
> to be stable will end up sleeping with all threads in a connect() call that 
> is supposed to return immediatly.

And that connect() call can hang for a long time due to any memory
allocation done in the connect() path.

You are not avoiding blocking by setting O_NONBLOCK on the socket, it
is quite foolhardy to think that it does so unilaterally.

And that's why this is a grey area.  Why is waiting for memory
allocation on a O_NONBLOCK socket OK but waiting for IPSEC route
resolution is not?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06 11:13                 ` David Miller
@ 2007-12-06 11:35                   ` Stefan Rompf
  2007-12-06 11:39                     ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Rompf @ 2007-12-06 11:35 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, simon, linux-kernel, netdev

Am Donnerstag, 6. Dezember 2007 12:13 schrieb David Miller:

> And that's why this is a grey area.  Why is waiting for memory
> allocation on a O_NONBLOCK socket OK but waiting for IPSEC route
> resolution is not?

Because you just will put enough RAM modules into you server when setting up a 
scalable system. Local resource, managable by the admin. What you cannot 
control in many cases is the network connection to the remote node. Simon 
Arlott has been talking about an 8 hour network outage.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06 11:35                   ` Stefan Rompf
@ 2007-12-06 11:39                     ` David Miller
  2007-12-06 12:30                       ` Stefan Rompf
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2007-12-06 11:39 UTC (permalink / raw)
  To: stefan; +Cc: herbert, simon, linux-kernel, netdev

From: Stefan Rompf <stefan@loplof.de>
Date: Thu, 6 Dec 2007 12:35:05 +0100

> Because you just will put enough RAM modules into you server when
> setting up a scalable system.

This suggestion is avoiding the important semantic issue, and
won't lead to a real discussion of the core problem.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06 11:39                     ` David Miller
@ 2007-12-06 12:30                       ` Stefan Rompf
  2007-12-06 13:55                         ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Rompf @ 2007-12-06 12:30 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, simon, linux-kernel, netdev

Am Donnerstag, 6. Dezember 2007 12:39 schrieb David Miller:

> > Because you just will put enough RAM modules into you server when
> > setting up a scalable system.
>
> This suggestion is avoiding the important semantic issue, and
> won't lead to a real discussion of the core problem.

When writing applications for unix operating systems, it is known since ages 
that stuff can be swapped out and that even things like memory accesses can 
block. So it does not really surprise when a system call has to wait for 
memory - just imagine the kernel code for connect() could be and has been 
swapped out.

Even with moderate swap activity, this memory should be available in much less 
than one second. If on the other hand the system is already threshing, it is 
no difference if it does so within connect() or while reaching the connect() 
system call in the application flow.

Btw, this is where admin responsibility to size their systems kicks in.

So where I would draw the line: connect() is clearly a network related 
function. Therefore, if a nonblocking connect() has to sleep for a local, 
controllable resource like memory to become available, this is ok. Maybe it 
shouldn't wait for a 128MB buffer if someone configured such an abonimation, 
haven't thought deeply about that. But when being told not to wait the 
connection to complete, it should never ever wait for another network related 
activity like IPSEC SA setup to complete, especially not for hours.

IMHO this is what developers expect, and is also consistent with the fact that 
POSIX does not define O_NONBLOCK behaviour for local files.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06 12:30                       ` Stefan Rompf
@ 2007-12-06 13:55                         ` David Miller
  2007-12-06 14:31                           ` Stefan Rompf
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2007-12-06 13:55 UTC (permalink / raw)
  To: stefan; +Cc: herbert, simon, linux-kernel, netdev

From: Stefan Rompf <stefan@loplof.de>
Date: Thu, 6 Dec 2007 13:30:20 +0100

> IMHO this is what developers expect, and is also consistent with the
> fact that POSIX does not define O_NONBLOCK behaviour for local
> files.

You keep ignoring the fact that, as Herbert and I discussed, not
blocking for IPSEC resolution will make some connect() cases fail that
would otherwise not fail.

There are two sides to this issue, and we need to consider them
both.

Long term a resolution-packet-queue provides a solution that handles
both angles correctly, but we don't have that code yet.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06 13:55                         ` David Miller
@ 2007-12-06 14:31                           ` Stefan Rompf
  2007-12-07  3:20                             ` David Miller
  0 siblings, 1 reply; 27+ messages in thread
From: Stefan Rompf @ 2007-12-06 14:31 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, simon, linux-kernel, netdev

Am Donnerstag, 6. Dezember 2007 14:55 schrieb David Miller:

> You keep ignoring the fact that, as Herbert and I discussed, not
> blocking for IPSEC resolution will make some connect() cases fail that
> would otherwise not fail.
>
> There are two sides to this issue, and we need to consider them
> both.

as far as I've understood Herbert's patch, at least TCP connect can be fixed 
so that non blocking connect() will neither fail nor block, but just use the 
first or second retransmission of the SYN packet to complete the handshake 
after IPSEC is up. As this will fix the common breakage case, just do so and 
keep UDP sendmsg() etc for later.

You are looking at this issue too much from the kernel side. Admitted, this is 
a corner case, but therefore nobody cares if connection completion takes two 
SYNs and three seconds instead of one SYN and may be two seconds. But 
application developers and users will validly complain if their applications 
block unexpectedly for hours just because some random provider has a network 
outage and IPSEC cannot come up.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-06 14:31                           ` Stefan Rompf
@ 2007-12-07  3:20                             ` David Miller
  2007-12-07  9:29                               ` Stefan Rompf
  0 siblings, 1 reply; 27+ messages in thread
From: David Miller @ 2007-12-07  3:20 UTC (permalink / raw)
  To: stefan; +Cc: herbert, simon, linux-kernel, netdev

From: Stefan Rompf <stefan@loplof.de>
Date: Thu, 6 Dec 2007 15:31:53 +0100

> as far as I've understood Herbert's patch, at least TCP connect can be fixed 
> so that non blocking connect() will neither fail nor block, but just use the 
> first or second retransmission of the SYN packet to complete the handshake 
> after IPSEC is up.

If IPSEC takes a long time to resolve, and we don't block, the
connect() can hard fail (we will just keep dropping the outgoing SYN
packet send attempts, eventually hitting the retry limit) in cases
where if we did block it would not fail (because we wouldn't send
the first SYN until IPSEC resolved).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-07  3:20                             ` David Miller
@ 2007-12-07  9:29                               ` Stefan Rompf
  0 siblings, 0 replies; 27+ messages in thread
From: Stefan Rompf @ 2007-12-07  9:29 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, simon, linux-kernel, netdev

Am Freitag, 7. Dezember 2007 04:20 schrieb David Miller:

> If IPSEC takes a long time to resolve, and we don't block, the
> connect() can hard fail (we will just keep dropping the outgoing SYN
> packet send attempts, eventually hitting the retry limit) in cases
> where if we did block it would not fail (because we wouldn't send
> the first SYN until IPSEC resolved).

David - I'm aware of this, the discussion is which behaviour is ok. Let's go 
back to a real life example. I've already researched that the squid web proxy 
has a poll() based main loop doing nonblocking connects, may be with multiple 
threads.

Situation: One user wants to access a web page that needs IPSEC. The SA takes 
30 seconds to come up.

a) Non-blocking connect is respected: SYN packets during the first 30 seconds 
will be dropped as you said. Connection can be completed on the next SYN 
retry (timeout in linux: 3 minutes). During this time, the 500 other users 
can continue to browse using the proxy.

b) Non-blocking connect is ignored during IPSEC resolving as you advocate it: 
Connection for the one user can be completed immediatly after IPSEC comes up. 
That's the pro. However, until then, the other 500 proxy user CANNOT ACCESS 
THE WEB because squid's threads are stuck in connect()s on sockets they 
configured not to block. If the IPSEC SA never resolves due to some network 
outage, squid will sleep forever or until an admin configures it that it 
doesn't try to connect the adress in question and restarts it.

Don't you realize how broken this behaviour is? Can you give me ONE example of 
an application that works better with b) and why this outweights the problems 
it creates for everybody else?

Even the DNS example you posted in  
<20071204.231200.117152338.davem@davemloft.net> is wrong because the second 
server will never queried if the kernel puts the process into coma while the 
IPSEC SA to the first server cannot be resolved.

Stefan

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-05  6:30   ` David Miller
  2007-12-05  6:51     ` Herbert Xu
@ 2007-12-16 22:47     ` Bill Davidsen
  2007-12-16 23:22       ` David Miller
  1 sibling, 1 reply; 27+ messages in thread
From: Bill Davidsen @ 2007-12-16 22:47 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, simon, linux-kernel, netdev

David Miller wrote:
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 5 Dec 2007 11:12:32 +1100
> 
>> [INET]: Export non-blocking flags to proto connect call
>>
>> Previously we made connect(2) block on IPsec SA resolution.  This is
>> good in general but not desirable for non-blocking sockets.
>>
>> To fix this properly we'd need to implement the larval IPsec dst stuff
>> that we talked about.  For now let's just revert to the old behaviour
>> on non-blocking sockets.
>>
>> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> We made an explicit decision not to do things this way.
> 
> Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
> setting, and this is across the board.  If xfrm_larval_drop is zero,
> non-blocking semantics do not extend to IPSEC route resolution,
> otherwise it does.
> 
> If he sets this sysctl to "1" as I detailed in my reply, he'll
> get the behavior he wants.
> 
I think you for the hint, but I would hardly call this sentence 
"detailed" in terms of being a cookbook solution to the problem.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: sockets affected by IPsec always block (2.6.23)
  2007-12-16 22:47     ` Bill Davidsen
@ 2007-12-16 23:22       ` David Miller
  0 siblings, 0 replies; 27+ messages in thread
From: David Miller @ 2007-12-16 23:22 UTC (permalink / raw)
  To: davidsen; +Cc: herbert, simon, linux-kernel, netdev

From: Bill Davidsen <davidsen@tmr.com>
Date: Sun, 16 Dec 2007 17:47:24 -0500

> David Miller wrote:
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Date: Wed, 5 Dec 2007 11:12:32 +1100
> > 
> >> [INET]: Export non-blocking flags to proto connect call
> >>
> >> Previously we made connect(2) block on IPsec SA resolution.  This is
> >> good in general but not desirable for non-blocking sockets.
> >>
> >> To fix this properly we'd need to implement the larval IPsec dst stuff
> >> that we talked about.  For now let's just revert to the old behaviour
> >> on non-blocking sockets.
> >>
> >> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> > 
> > We made an explicit decision not to do things this way.
> > 
> > Non-blocking has a meaning dependant upon the xfrm_larval_drop sysctl
> > setting, and this is across the board.  If xfrm_larval_drop is zero,
> > non-blocking semantics do not extend to IPSEC route resolution,
> > otherwise it does.
> > 
> > If he sets this sysctl to "1" as I detailed in my reply, he'll
> > get the behavior he wants.
> > 
> I think you for the hint, but I would hardly call this sentence 
> "detailed" in terms of being a cookbook solution to the problem.

I guess "echo '1' >/proc/sys/net/core/xfrm_larval_drop" is not
explicit enough?  What more would you like me to say?

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2007-12-16 23:22 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-04 18:53 sockets affected by IPsec always block (2.6.23) Simon Arlott
2007-12-05  0:12 ` Herbert Xu
2007-12-05  6:30   ` David Miller
2007-12-05  6:51     ` Herbert Xu
2007-12-05  7:12       ` David Miller
2007-12-05  7:16         ` Herbert Xu
2007-12-05  7:34           ` David Miller
2007-12-05  7:39             ` Herbert Xu
2007-12-05  9:55               ` David Miller
2007-12-05  9:57                 ` Herbert Xu
2007-12-05 18:42         ` Stefan Rompf
2007-12-05 18:39       ` Stefan Rompf
2007-12-06  2:25         ` David Miller
2007-12-06  8:49           ` Stefan Rompf
2007-12-06  8:53             ` David Miller
2007-12-06 10:56               ` Stefan Rompf
2007-12-06 11:13                 ` David Miller
2007-12-06 11:35                   ` Stefan Rompf
2007-12-06 11:39                     ` David Miller
2007-12-06 12:30                       ` Stefan Rompf
2007-12-06 13:55                         ` David Miller
2007-12-06 14:31                           ` Stefan Rompf
2007-12-07  3:20                             ` David Miller
2007-12-07  9:29                               ` Stefan Rompf
2007-12-16 22:47     ` Bill Davidsen
2007-12-16 23:22       ` David Miller
2007-12-05  6:06 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).