All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next 0/6] TCP connection repair (v4)
@ 2012-04-19 13:38 Pavel Emelyanov
  2012-04-19 13:39 ` [PATCH 1/6] sock: Introduce named constants for sk_reuse Pavel Emelyanov
                   ` (6 more replies)
  0 siblings, 7 replies; 13+ messages in thread
From: Pavel Emelyanov @ 2012-04-19 13:38 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

Hi!

Attempt #4 with an API for TCP connection recreation (previous one is
at http://lists.openwall.net/netdev/2012/03/28/84) re-based on the
today's net-next tree.


Changes since v3:

* Added repair for TCP options negotiated during 3WHS process, pointed
  out by Li Yu. The explanation of how this happens is in patch #6.

* Named constant for sk_reuse values as proposed by Ben Hutching.

* Off-by-one in repair-queue sockoption caught by Ben.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/6] sock: Introduce named constants for sk_reuse
  2012-04-19 13:38 [PATCH net-next 0/6] TCP connection repair (v4) Pavel Emelyanov
@ 2012-04-19 13:39 ` Pavel Emelyanov
  2012-04-19 13:40 ` [PATCH 2/6] tcp: Move code around Pavel Emelyanov
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Pavel Emelyanov @ 2012-04-19 13:39 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

Name them in a "backward compatible" manner, i.e. reuse or not
are still 1 and 0 respectively. The reuse value of 2 means that
the socket with it will forcibly reuse everyone else's port.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 drivers/block/drbd/drbd_receiver.c    |    6 +++---
 drivers/scsi/iscsi_tcp.c              |    2 +-
 drivers/staging/ramster/cluster/tcp.c |    2 +-
 fs/ocfs2/cluster/tcp.c                |    2 +-
 include/net/sock.h                    |   11 +++++++++++
 net/core/sock.c                       |    2 +-
 net/econet/af_econet.c                |    4 ++--
 net/ipv4/af_inet.c                    |    2 +-
 net/ipv4/inet_connection_sock.c       |    3 +++
 net/ipv6/af_inet6.c                   |    2 +-
 net/netfilter/ipvs/ip_vs_sync.c       |    2 +-
 net/rds/tcp_listen.c                  |    2 +-
 net/sunrpc/svcsock.c                  |    2 +-
 13 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/drivers/block/drbd/drbd_receiver.c b/drivers/block/drbd/drbd_receiver.c
index 43beaca..436f519 100644
--- a/drivers/block/drbd/drbd_receiver.c
+++ b/drivers/block/drbd/drbd_receiver.c
@@ -664,7 +664,7 @@ static struct socket *drbd_wait_for_connect(struct drbd_conf *mdev)
 	timeo = mdev->net_conf->try_connect_int * HZ;
 	timeo += (random32() & 1) ? timeo / 7 : -timeo / 7; /* 28.5% random jitter */
 
-	s_listen->sk->sk_reuse    = 1; /* SO_REUSEADDR */
+	s_listen->sk->sk_reuse    = SK_CAN_REUSE; /* SO_REUSEADDR */
 	s_listen->sk->sk_rcvtimeo = timeo;
 	s_listen->sk->sk_sndtimeo = timeo;
 	drbd_setbufsize(s_listen, mdev->net_conf->sndbuf_size,
@@ -841,8 +841,8 @@ retry:
 		}
 	} while (1);
 
-	msock->sk->sk_reuse = 1; /* SO_REUSEADDR */
-	sock->sk->sk_reuse = 1; /* SO_REUSEADDR */
+	msock->sk->sk_reuse = SK_CAN_REUSE; /* SO_REUSEADDR */
+	sock->sk->sk_reuse = SK_CAN_REUSE; /* SO_REUSEADDR */
 
 	sock->sk->sk_allocation = GFP_NOIO;
 	msock->sk->sk_allocation = GFP_NOIO;
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 453a740..9220861 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -662,7 +662,7 @@ iscsi_sw_tcp_conn_bind(struct iscsi_cls_session *cls_session,
 
 	/* setup Socket parameters */
 	sk = sock->sk;
-	sk->sk_reuse = 1;
+	sk->sk_reuse = SK_CAN_REUSE;
 	sk->sk_sndtimeo = 15 * HZ; /* FIXME: make it configurable */
 	sk->sk_allocation = GFP_ATOMIC;
 
diff --git a/drivers/staging/ramster/cluster/tcp.c b/drivers/staging/ramster/cluster/tcp.c
index 3af1b2c..b9721c1 100644
--- a/drivers/staging/ramster/cluster/tcp.c
+++ b/drivers/staging/ramster/cluster/tcp.c
@@ -2106,7 +2106,7 @@ static int r2net_open_listening_sock(__be32 addr, __be16 port)
 	r2net_listen_sock = sock;
 	INIT_WORK(&r2net_listen_work, r2net_accept_many);
 
-	sock->sk->sk_reuse = 1;
+	sock->sk->sk_reuse = SK_CAN_REUSE;
 	ret = sock->ops->bind(sock, (struct sockaddr *)&sin, sizeof(sin));
 	if (ret < 0) {
 		printk(KERN_ERR "ramster: Error %d while binding socket at "
diff --git a/fs/ocfs2/cluster/tcp.c b/fs/ocfs2/cluster/tcp.c
index 044e7b5..1bfe880 100644
--- a/fs/ocfs2/cluster/tcp.c
+++ b/fs/ocfs2/cluster/tcp.c
@@ -2005,7 +2005,7 @@ static int o2net_open_listening_sock(__be32 addr, __be16 port)
 	o2net_listen_sock = sock;
 	INIT_WORK(&o2net_listen_work, o2net_accept_many);
 
-	sock->sk->sk_reuse = 1;
+	sock->sk->sk_reuse = SK_CAN_REUSE;
 	ret = sock->ops->bind(sock, (struct sockaddr *)&sin, sizeof(sin));
 	if (ret < 0) {
 		printk(KERN_ERR "o2net: Error %d while binding socket at "
diff --git a/include/net/sock.h b/include/net/sock.h
index a6ba1f8..4cdb9b3 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -376,6 +376,17 @@ struct sock {
 	void                    (*sk_destruct)(struct sock *sk);
 };
 
+/*
+ * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
+ * or not whether his port will be reused by someone else. SK_FORCE_REUSE
+ * on a socket means that the socket will reuse everybody else's port
+ * without looking at the other's sk_reuse value.
+ */
+
+#define SK_NO_REUSE	0
+#define SK_CAN_REUSE	1
+#define SK_FORCE_REUSE	2
+
 static inline int sk_peek_offset(struct sock *sk, int flags)
 {
 	if ((flags & MSG_PEEK) && (sk->sk_peek_off >= 0))
diff --git a/net/core/sock.c b/net/core/sock.c
index c7e60ea..679c5bb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -561,7 +561,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 			sock_valbool_flag(sk, SOCK_DBG, valbool);
 		break;
 	case SO_REUSEADDR:
-		sk->sk_reuse = valbool;
+		sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
 		break;
 	case SO_TYPE:
 	case SO_PROTOCOL:
diff --git a/net/econet/af_econet.c b/net/econet/af_econet.c
index 71b5edc..fa14ca7 100644
--- a/net/econet/af_econet.c
+++ b/net/econet/af_econet.c
@@ -617,7 +617,7 @@ static int econet_create(struct net *net, struct socket *sock, int protocol,
 	if (sk == NULL)
 		goto out;
 
-	sk->sk_reuse = 1;
+	sk->sk_reuse = SK_CAN_REUSE;
 	sock->ops = &econet_ops;
 	sock_init_data(sock, sk);
 
@@ -1012,7 +1012,7 @@ static int __init aun_udp_initialise(void)
 		return error;
 	}
 
-	udpsock->sk->sk_reuse = 1;
+	udpsock->sk->sk_reuse = SK_CAN_REUSE;
 	udpsock->sk->sk_allocation = GFP_ATOMIC; /* we're going to call it
 						    from interrupts */
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 3744c1c..c8f7aee 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -350,7 +350,7 @@ lookup_protocol:
 	err = 0;
 	sk->sk_no_check = answer_no_check;
 	if (INET_PROTOSW_REUSE & answer_flags)
-		sk->sk_reuse = 1;
+		sk->sk_reuse = SK_CAN_REUSE;
 
 	inet = inet_sk(sk);
 	inet->is_icsk = (INET_PROTOSW_ICSK & answer_flags) != 0;
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 7d972f6..95e6159 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -182,6 +182,9 @@ have_snum:
 	goto tb_not_found;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
+		if (sk->sk_reuse == SK_FORCE_REUSE)
+			goto success;
+
 		if (tb->fastreuse > 0 &&
 		    sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
 		    smallest_size == -1) {
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 8ed1b93..499e74e 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -180,7 +180,7 @@ lookup_protocol:
 	err = 0;
 	sk->sk_no_check = answer_no_check;
 	if (INET_PROTOSW_REUSE & answer_flags)
-		sk->sk_reuse = 1;
+		sk->sk_reuse = SK_CAN_REUSE;
 
 	inet = inet_sk(sk);
 	inet->is_icsk = (INET_PROTOSW_ICSK & answer_flags) != 0;
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index f4e0b6c..bf5e538 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -1368,7 +1368,7 @@ static struct socket *make_receive_sock(struct net *net)
 	 */
 	sk_change_net(sock->sk, net);
 	/* it is equivalent to the REUSEADDR option in user-space */
-	sock->sk->sk_reuse = 1;
+	sock->sk->sk_reuse = SK_CAN_REUSE;
 
 	result = sock->ops->bind(sock, (struct sockaddr *) &mcast_addr,
 			sizeof(struct sockaddr));
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 8b5cc4a..7298137 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -145,7 +145,7 @@ int rds_tcp_listen_init(void)
 	if (ret < 0)
 		goto out;
 
-	sock->sk->sk_reuse = 1;
+	sock->sk->sk_reuse = SK_CAN_REUSE;
 	rds_tcp_nonagle(sock);
 
 	write_lock_bh(&sock->sk->sk_callback_lock);
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 824d32f..f0132b2 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1556,7 +1556,7 @@ static struct svc_xprt *svc_create_socket(struct svc_serv *serv,
 					(char *)&val, sizeof(val));
 
 	if (type == SOCK_STREAM)
-		sock->sk->sk_reuse = 1;		/* allow address reuse */
+		sock->sk->sk_reuse = SK_CAN_REUSE; /* allow address reuse */
 	error = kernel_bind(sock, sin, len);
 	if (error < 0)
 		goto bummer;
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/6] tcp: Move code around
  2012-04-19 13:38 [PATCH net-next 0/6] TCP connection repair (v4) Pavel Emelyanov
  2012-04-19 13:39 ` [PATCH 1/6] sock: Introduce named constants for sk_reuse Pavel Emelyanov
@ 2012-04-19 13:40 ` Pavel Emelyanov
  2012-04-19 13:40 ` [PATCH 3/6] tcp: Initial repair mode Pavel Emelyanov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Pavel Emelyanov @ 2012-04-19 13:40 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

This is just the preparation patch, which makes the needed for
TCP repair code ready for use.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 include/net/tcp.h     |    3 ++
 net/ipv4/tcp.c        |    2 +-
 net/ipv4/tcp_input.c  |   81 +++++++++++++++++++++++++++++--------------------
 net/ipv4/tcp_output.c |    4 +-
 4 files changed, 54 insertions(+), 36 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d5984e3..633fde2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -435,6 +435,9 @@ extern struct sk_buff * tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 					struct request_values *rvp);
 extern int tcp_disconnect(struct sock *sk, int flags);
 
+void tcp_connect_init(struct sock *sk);
+void tcp_finish_connect(struct sock *sk, struct sk_buff *skb);
+void tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen);
 
 /* From syncookies.c */
 extern __u32 syncookie_secret[2][16-4+SHA_DIGEST_WORDS];
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c53e8a8..bb4200f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -919,7 +919,7 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	int iovlen, flags, err, copied;
-	int mss_now, size_goal;
+	int mss_now = 0, size_goal;
 	bool sg;
 	long timeo;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99448f0..37e1c5c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5325,6 +5325,14 @@ discard:
 	return 0;
 }
 
+void tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen)
+{
+	__skb_pull(skb, hdrlen);
+	__skb_queue_tail(&sk->sk_receive_queue, skb);
+	skb_set_owner_r(skb, sk);
+	tcp_sk(sk)->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+}
+
 /*
  *	TCP receive function for the ESTABLISHED state.
  *
@@ -5490,10 +5498,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
 
 				/* Bulk data transfer: receiver */
-				__skb_pull(skb, tcp_header_len);
-				__skb_queue_tail(&sk->sk_receive_queue, skb);
-				skb_set_owner_r(skb, sk);
-				tp->rcv_nxt = TCP_SKB_CB(skb)->end_seq;
+				tcp_queue_rcv(sk, skb, tcp_header_len);
 			}
 
 			tcp_event_data_recv(sk, skb);
@@ -5559,6 +5564,44 @@ discard:
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
+void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	tcp_set_state(sk, TCP_ESTABLISHED);
+
+	if (skb != NULL)
+		security_inet_conn_established(sk, skb);
+
+	/* Make sure socket is routed, for correct metrics.  */
+	icsk->icsk_af_ops->rebuild_header(sk);
+
+	tcp_init_metrics(sk);
+
+	tcp_init_congestion_control(sk);
+
+	/* Prevent spurious tcp_cwnd_restart() on first data
+	 * packet.
+	 */
+	tp->lsndtime = tcp_time_stamp;
+
+	tcp_init_buffer_space(sk);
+
+	if (sock_flag(sk, SOCK_KEEPOPEN))
+		inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
+
+	if (!tp->rx_opt.snd_wscale)
+		__tcp_fast_path_on(tp, tp->snd_wnd);
+	else
+		tp->pred_flags = 0;
+
+	if (!sock_flag(sk, SOCK_DEAD)) {
+		sk->sk_state_change(sk);
+		sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
+	}
+}
+
 static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 					 const struct tcphdr *th, unsigned int len)
 {
@@ -5691,36 +5734,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		}
 
 		smp_mb();
-		tcp_set_state(sk, TCP_ESTABLISHED);
-
-		security_inet_conn_established(sk, skb);
-
-		/* Make sure socket is routed, for correct metrics.  */
-		icsk->icsk_af_ops->rebuild_header(sk);
-
-		tcp_init_metrics(sk);
 
-		tcp_init_congestion_control(sk);
-
-		/* Prevent spurious tcp_cwnd_restart() on first data
-		 * packet.
-		 */
-		tp->lsndtime = tcp_time_stamp;
-
-		tcp_init_buffer_space(sk);
-
-		if (sock_flag(sk, SOCK_KEEPOPEN))
-			inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
-
-		if (!tp->rx_opt.snd_wscale)
-			__tcp_fast_path_on(tp, tp->snd_wnd);
-		else
-			tp->pred_flags = 0;
-
-		if (!sock_flag(sk, SOCK_DEAD)) {
-			sk->sk_state_change(sk);
-			sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
-		}
+		tcp_finish_connect(sk, skb);
 
 		if (sk->sk_write_pending ||
 		    icsk->icsk_accept_queue.rskq_defer_accept ||
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de8790c..db126a6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2561,7 +2561,7 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 EXPORT_SYMBOL(tcp_make_synack);
 
 /* Do all connect socket setups that can be done AF independent. */
-static void tcp_connect_init(struct sock *sk)
+void tcp_connect_init(struct sock *sk)
 {
 	const struct dst_entry *dst = __sk_dst_get(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -2616,6 +2616,7 @@ static void tcp_connect_init(struct sock *sk)
 	tp->snd_una = tp->write_seq;
 	tp->snd_sml = tp->write_seq;
 	tp->snd_up = tp->write_seq;
+	tp->snd_nxt = tp->write_seq;
 	tp->rcv_nxt = 0;
 	tp->rcv_wup = 0;
 	tp->copied_seq = 0;
@@ -2641,7 +2642,6 @@ int tcp_connect(struct sock *sk)
 	/* Reserve space for headers. */
 	skb_reserve(buff, MAX_TCP_HEADER);
 
-	tp->snd_nxt = tp->write_seq;
 	tcp_init_nondata_skb(buff, tp->write_seq++, TCPHDR_SYN);
 	TCP_ECN_send_syn(sk, buff);
 
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/6] tcp: Initial repair mode
  2012-04-19 13:38 [PATCH net-next 0/6] TCP connection repair (v4) Pavel Emelyanov
  2012-04-19 13:39 ` [PATCH 1/6] sock: Introduce named constants for sk_reuse Pavel Emelyanov
  2012-04-19 13:40 ` [PATCH 2/6] tcp: Move code around Pavel Emelyanov
@ 2012-04-19 13:40 ` Pavel Emelyanov
  2012-04-19 13:41 ` [PATCH 4/6] tcp: Repair socket queues Pavel Emelyanov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 13+ messages in thread
From: Pavel Emelyanov @ 2012-04-19 13:40 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

This includes (according the the previous description):

* TCP_REPAIR sockoption

This one just puts the socket in/out of the repair mode.
Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
When repair mode is turned off and the socket happens to be in
the established state the window probe is sent to the peer to
'unlock' the connection.

* TCP_REPAIR_QUEUE sockoption

This one sets the queue which we're about to repair. The
'no-queue' is set by default.

* TCP_QUEUE_SEQ socoption

Sets the write_seq/rcv_nxt of a selected repaired queue.
Allowed for TCP_CLOSE-d sockets only. When the socket changes
its state the other seq-s are changed by the kernel according
to the protocol rules (most of the existing code is actually
reused).

* Ability to forcibly bind a socket to a port

The sk->sk_reuse is set to SK_FORCE_REUSE.

* Immediate connect modification

The connect syscall initializes the connection, then directly jumps
to the code which finalizes it.

* Silent close modification

The close just aborts the connection (similar to SO_LINGER with 0
time) but without sending any FIN/RST-s to peer.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 include/linux/tcp.h   |   14 +++++++++-
 include/net/tcp.h     |    2 +
 net/ipv4/tcp.c        |   68 ++++++++++++++++++++++++++++++++++++++++++++++++-
 net/ipv4/tcp_ipv4.c   |   19 +++++++++++--
 net/ipv4/tcp_output.c |   16 +++++++++--
 5 files changed, 111 insertions(+), 8 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b6c62d2..4e90e6a 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -106,6 +106,16 @@ enum {
 #define TCP_THIN_LINEAR_TIMEOUTS 16      /* Use linear timeouts for thin streams*/
 #define TCP_THIN_DUPACK         17      /* Fast retrans. after 1 dupack */
 #define TCP_USER_TIMEOUT	18	/* How long for loss retry before timeout */
+#define TCP_REPAIR		19	/* TCP sock is under repair right now */
+#define TCP_REPAIR_QUEUE	20
+#define TCP_QUEUE_SEQ		21
+
+enum {
+	TCP_NO_QUEUE,
+	TCP_RECV_QUEUE,
+	TCP_SEND_QUEUE,
+	TCP_QUEUES_NR,
+};
 
 /* for TCP_INFO socket option */
 #define TCPI_OPT_TIMESTAMPS	1
@@ -353,7 +363,9 @@ struct tcp_sock {
 	u8	nonagle     : 4,/* Disable Nagle algorithm?             */
 		thin_lto    : 1,/* Use linear timeouts for thin streams */
 		thin_dupack : 1,/* Fast retransmit on first dupack      */
-		unused      : 2;
+		repair      : 1,
+		unused      : 1;
+	u8	repair_queue;
 
 /* RTT measurement */
 	u32	srtt;		/* smoothed round trip time << 3	*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 633fde2..b4ccb8a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -612,6 +612,8 @@ static inline u32 tcp_receive_window(const struct tcp_sock *tp)
  */
 extern u32 __tcp_select_window(struct sock *sk);
 
+void tcp_send_window_probe(struct sock *sk);
+
 /* TCP timestamps are only 32-bits, this causes a slight
  * complication on 64-bit systems since we store a snapshot
  * of jiffies in the buffer control blocks below.  We decided
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index bb4200f..e38d6f2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1935,7 +1935,9 @@ void tcp_close(struct sock *sk, long timeout)
 	 * advertise a zero window, then kill -9 the FTP client, wheee...
 	 * Note: timeout is always zero in such a case.
 	 */
-	if (data_was_unread) {
+	if (unlikely(tcp_sk(sk)->repair)) {
+		sk->sk_prot->disconnect(sk, 0);
+	} else if (data_was_unread) {
 		/* Unread data was tossed, zap the connection. */
 		NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
 		tcp_set_state(sk, TCP_CLOSE);
@@ -2074,6 +2076,8 @@ int tcp_disconnect(struct sock *sk, int flags)
 	/* ABORT function of RFC793 */
 	if (old_state == TCP_LISTEN) {
 		inet_csk_listen_stop(sk);
+	} else if (unlikely(tp->repair)) {
+		sk->sk_err = ECONNABORTED;
 	} else if (tcp_need_reset(old_state) ||
 		   (tp->snd_nxt != tp->write_seq &&
 		    (1 << old_state) & (TCPF_CLOSING | TCPF_LAST_ACK))) {
@@ -2125,6 +2129,12 @@ int tcp_disconnect(struct sock *sk, int flags)
 }
 EXPORT_SYMBOL(tcp_disconnect);
 
+static inline int tcp_can_repair_sock(struct sock *sk)
+{
+	return capable(CAP_NET_ADMIN) &&
+		((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_ESTABLISHED));
+}
+
 /*
  *	Socket option code for TCP.
  */
@@ -2297,6 +2307,42 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			tp->thin_dupack = val;
 		break;
 
+	case TCP_REPAIR:
+		if (!tcp_can_repair_sock(sk))
+			err = -EPERM;
+		else if (val == 1) {
+			tp->repair = 1;
+			sk->sk_reuse = SK_FORCE_REUSE;
+			tp->repair_queue = TCP_NO_QUEUE;
+		} else if (val == 0) {
+			tp->repair = 0;
+			sk->sk_reuse = SK_NO_REUSE;
+			tcp_send_window_probe(sk);
+		} else
+			err = -EINVAL;
+
+		break;
+
+	case TCP_REPAIR_QUEUE:
+		if (!tp->repair)
+			err = -EPERM;
+		else if (val < TCP_QUEUES_NR)
+			tp->repair_queue = val;
+		else
+			err = -EINVAL;
+		break;
+
+	case TCP_QUEUE_SEQ:
+		if (sk->sk_state != TCP_CLOSE)
+			err = -EPERM;
+		else if (tp->repair_queue == TCP_SEND_QUEUE)
+			tp->write_seq = val;
+		else if (tp->repair_queue == TCP_RECV_QUEUE)
+			tp->rcv_nxt = val;
+		else
+			err = -EINVAL;
+		break;
+
 	case TCP_CORK:
 		/* When set indicates to always queue non-full frames.
 		 * Later the user clears this option and we transmit
@@ -2632,6 +2678,26 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		val = tp->thin_dupack;
 		break;
 
+	case TCP_REPAIR:
+		val = tp->repair;
+		break;
+
+	case TCP_REPAIR_QUEUE:
+		if (tp->repair)
+			val = tp->repair_queue;
+		else
+			return -EINVAL;
+		break;
+
+	case TCP_QUEUE_SEQ:
+		if (tp->repair_queue == TCP_SEND_QUEUE)
+			val = tp->write_seq;
+		else if (tp->repair_queue == TCP_RECV_QUEUE)
+			val = tp->rcv_nxt;
+		else
+			return -EINVAL;
+		break;
+
 	case TCP_USER_TIMEOUT:
 		val = jiffies_to_msecs(icsk->icsk_user_timeout);
 		break;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0cb86ce..ba6dad8 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -138,6 +138,14 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
 }
 EXPORT_SYMBOL_GPL(tcp_twsk_unique);
 
+static int tcp_repair_connect(struct sock *sk)
+{
+	tcp_connect_init(sk);
+	tcp_finish_connect(sk, NULL);
+
+	return 0;
+}
+
 /* This will initiate an outgoing connection. */
 int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 {
@@ -196,7 +204,8 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		/* Reset inherited state */
 		tp->rx_opt.ts_recent	   = 0;
 		tp->rx_opt.ts_recent_stamp = 0;
-		tp->write_seq		   = 0;
+		if (likely(!tp->repair))
+			tp->write_seq	   = 0;
 	}
 
 	if (tcp_death_row.sysctl_tw_recycle &&
@@ -247,7 +256,7 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	sk->sk_gso_type = SKB_GSO_TCPV4;
 	sk_setup_caps(sk, &rt->dst);
 
-	if (!tp->write_seq)
+	if (!tp->write_seq && likely(!tp->repair))
 		tp->write_seq = secure_tcp_sequence_number(inet->inet_saddr,
 							   inet->inet_daddr,
 							   inet->inet_sport,
@@ -255,7 +264,11 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 
 	inet->inet_id = tp->write_seq ^ jiffies;
 
-	err = tcp_connect(sk);
+	if (likely(!tp->repair))
+		err = tcp_connect(sk);
+	else
+		err = tcp_repair_connect(sk);
+
 	rt = NULL;
 	if (err)
 		goto failure;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index db126a6..fa442a6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2617,9 +2617,11 @@ void tcp_connect_init(struct sock *sk)
 	tp->snd_sml = tp->write_seq;
 	tp->snd_up = tp->write_seq;
 	tp->snd_nxt = tp->write_seq;
-	tp->rcv_nxt = 0;
-	tp->rcv_wup = 0;
-	tp->copied_seq = 0;
+
+	if (likely(!tp->repair))
+		tp->rcv_nxt = 0;
+	tp->rcv_wup = tp->rcv_nxt;
+	tp->copied_seq = tp->rcv_nxt;
 
 	inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
 	inet_csk(sk)->icsk_retransmits = 0;
@@ -2790,6 +2792,14 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
 	return tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC);
 }
 
+void tcp_send_window_probe(struct sock *sk)
+{
+	if (sk->sk_state == TCP_ESTABLISHED) {
+		tcp_sk(sk)->snd_wl1 = tcp_sk(sk)->rcv_nxt - 1;
+		tcp_xmit_probe_skb(sk, 0);
+	}
+}
+
 /* Initiate keepalive or window probe from timer. */
 int tcp_write_wakeup(struct sock *sk)
 {
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/6] tcp: Repair socket queues
  2012-04-19 13:38 [PATCH net-next 0/6] TCP connection repair (v4) Pavel Emelyanov
                   ` (2 preceding siblings ...)
  2012-04-19 13:40 ` [PATCH 3/6] tcp: Initial repair mode Pavel Emelyanov
@ 2012-04-19 13:41 ` Pavel Emelyanov
  2012-05-02 11:11   ` Eric Dumazet
  2012-04-19 13:41 ` [PATCH 5/6] tcp: Report mss_clamp with TCP_MAXSEG option in repair mode Pavel Emelyanov
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 13+ messages in thread
From: Pavel Emelyanov @ 2012-04-19 13:41 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

Reading queues under repair mode is done with recvmsg call.
The queue-under-repair set by TCP_REPAIR_QUEUE option is used
to determine which queue should be read. Thus both send and
receive queue can be read with this.

Caller must pass the MSG_PEEK flag.

Writing to queues is done with sendmsg call and yet again --
the repair-queue option can be used to push data into the
receive queue.

When putting an skb into receive queue a zero tcp header is
appented to its head to address the tcp_hdr(skb)->syn and
the ->fin checks by the (after repair) tcp_recvmsg. These
flags flags are both set to zero and that's why.

The fin cannot be met in the queue while reading the source
socket, since the repair only works for closed/established
sockets and queueing fin packet always changes its state.

The syn in the queue denotes that the respective skb's seq
is "off-by-one" as compared to the actual payload lenght. Thus,
at the rcv queue refill we can just drop this flag and set the
skb's sequences to precice values.

When the repair mode is turned off, the write queue seqs are
updated so that the whole queue is considered to be 'already sent,
waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the
protocol POV the send queue looks like it was sent, but the data
between the write_seq and snd_nxt is lost in the network.

This helps to avoid another sockoption for setting the snd_nxt
sequence. Leaving the whole queue in a 'not yet sent' state (as
it will be after sendmsg-s) will not allow to receive any acks
from the peer since the ack_seq will be after the snd_nxt. Thus
even the ack for the window probe will be dropped and the
connection will be 'locked' with the zero peer window.

Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
---
 net/ipv4/tcp.c        |   89 +++++++++++++++++++++++++++++++++++++++++++++++--
 net/ipv4/tcp_output.c |    1 +
 2 files changed, 87 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index e38d6f2..47e2f49 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -912,6 +912,39 @@ static inline int select_size(const struct sock *sk, bool sg)
 	return tmp;
 }
 
+static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
+{
+	struct sk_buff *skb;
+	struct tcp_skb_cb *cb;
+	struct tcphdr *th;
+
+	skb = alloc_skb(size + sizeof(*th), sk->sk_allocation);
+	if (!skb)
+		goto err;
+
+	th = (struct tcphdr *)skb_put(skb, sizeof(*th));
+	skb_reset_transport_header(skb);
+	memset(th, 0, sizeof(*th));
+
+	if (memcpy_fromiovec(skb_put(skb, size), msg->msg_iov, size))
+		goto err_free;
+
+	cb = TCP_SKB_CB(skb);
+
+	TCP_SKB_CB(skb)->seq = tcp_sk(sk)->rcv_nxt;
+	TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
+	TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
+
+	tcp_queue_rcv(sk, skb, sizeof(*th));
+
+	return size;
+
+err_free:
+	kfree_skb(skb);
+err:
+	return -ENOMEM;
+}
+
 int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		size_t size)
 {
@@ -933,6 +966,19 @@ int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
 			goto out_err;
 
+	if (unlikely(tp->repair)) {
+		if (tp->repair_queue == TCP_RECV_QUEUE) {
+			copied = tcp_send_rcvq(sk, msg, size);
+			goto out;
+		}
+
+		err = -EINVAL;
+		if (tp->repair_queue == TCP_NO_QUEUE)
+			goto out_err;
+
+		/* 'common' sending to sendq */
+	}
+
 	/* This should be in poll */
 	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 
@@ -1089,7 +1135,7 @@ new_segment:
 			if ((seglen -= copy) == 0 && iovlen == 0)
 				goto out;
 
-			if (skb->len < max || (flags & MSG_OOB))
+			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
 				continue;
 
 			if (forced_push(tp)) {
@@ -1102,7 +1148,7 @@ new_segment:
 wait_for_sndbuf:
 			set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 wait_for_memory:
-			if (copied)
+			if (copied && likely(!tp->repair))
 				tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
 
 			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
@@ -1113,7 +1159,7 @@ wait_for_memory:
 	}
 
 out:
-	if (copied)
+	if (copied && likely(!tp->repair))
 		tcp_push(sk, flags, mss_now, tp->nonagle);
 	release_sock(sk);
 	return copied;
@@ -1187,6 +1233,24 @@ static int tcp_recv_urg(struct sock *sk, struct msghdr *msg, int len, int flags)
 	return -EAGAIN;
 }
 
+static int tcp_peek_sndq(struct sock *sk, struct msghdr *msg, int len)
+{
+	struct sk_buff *skb;
+	int copied = 0, err = 0;
+
+	/* XXX -- need to support SO_PEEK_OFF */
+
+	skb_queue_walk(&sk->sk_write_queue, skb) {
+		err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, skb->len);
+		if (err)
+			break;
+
+		copied += skb->len;
+	}
+
+	return err ?: copied;
+}
+
 /* Clean up the receive buffer for full frames taken by the user,
  * then send an ACK if necessary.  COPIED is the number of bytes
  * tcp_recvmsg has given to the user so far, it speeds up the
@@ -1432,6 +1496,21 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	if (flags & MSG_OOB)
 		goto recv_urg;
 
+	if (unlikely(tp->repair)) {
+		err = -EPERM;
+		if (!(flags & MSG_PEEK))
+			goto out;
+
+		if (tp->repair_queue == TCP_SEND_QUEUE)
+			goto recv_sndq;
+
+		err = -EINVAL;
+		if (tp->repair_queue == TCP_NO_QUEUE)
+			goto out;
+
+		/* 'common' recv queue MSG_PEEK-ing */
+	}
+
 	seq = &tp->copied_seq;
 	if (flags & MSG_PEEK) {
 		peek_seq = tp->copied_seq;
@@ -1783,6 +1862,10 @@ out:
 recv_urg:
 	err = tcp_recv_urg(sk, msg, len, flags);
 	goto out;
+
+recv_sndq:
+	err = tcp_peek_sndq(sk, msg, len);
+	goto out;
 }
 EXPORT_SYMBOL(tcp_recvmsg);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fa442a6..57a834c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2796,6 +2796,7 @@ void tcp_send_window_probe(struct sock *sk)
 {
 	if (sk->sk_state == TCP_ESTABLISHED) {
 		tcp_sk(sk)->snd_wl1 = tcp_sk(sk)->rcv_nxt - 1;
+		tcp_sk(sk)->snd_nxt = tcp_sk(sk)->write_seq;
 		tcp_xmit_probe_skb(sk, 0);
 	}
 }
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 5/6] tcp: Report mss_clamp with TCP_MAXSEG option in repair mode
  2012-04-19 13:38 [PATCH net-next 0/6] TCP connection repair (v4) Pavel Emelyanov
                   ` (3 preceding siblings ...)
  2012-04-19 13:41 ` [PATCH 4/6] tcp: Repair socket queues Pavel Emelyanov
@ 2012-04-19 13:41 ` Pavel Emelyanov
  2012-04-19 13:41 ` [PATCH 6/6] tcp: Repair connection-time negotiated parameters Pavel Emelyanov
  2012-04-21 19:53 ` [PATCH net-next 0/6] TCP connection repair (v4) David Miller
  6 siblings, 0 replies; 13+ messages in thread
From: Pavel Emelyanov @ 2012-04-19 13:41 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

The mss_clamp is the only connection-time negotiated option which
cannot be obtained from the user space. Make the TCP_MAXSEG sockopt
report one in the repair mode.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 net/ipv4/tcp.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 47e2f49..b4e690d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2659,6 +2659,8 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
 		val = tp->mss_cache;
 		if (!val && ((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_LISTEN)))
 			val = tp->rx_opt.user_mss;
+		if (tp->repair)
+			val = tp->rx_opt.mss_clamp;
 		break;
 	case TCP_NODELAY:
 		val = !!(tp->nonagle&TCP_NAGLE_OFF);
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 6/6] tcp: Repair connection-time negotiated parameters
  2012-04-19 13:38 [PATCH net-next 0/6] TCP connection repair (v4) Pavel Emelyanov
                   ` (4 preceding siblings ...)
  2012-04-19 13:41 ` [PATCH 5/6] tcp: Report mss_clamp with TCP_MAXSEG option in repair mode Pavel Emelyanov
@ 2012-04-19 13:41 ` Pavel Emelyanov
  2012-04-21 19:53 ` [PATCH net-next 0/6] TCP connection repair (v4) David Miller
  6 siblings, 0 replies; 13+ messages in thread
From: Pavel Emelyanov @ 2012-04-19 13:41 UTC (permalink / raw)
  To: Linux Netdev List, David Miller

There are options, which are set up on a socket while performing
TCP handshake. Need to resurrect them on a socket while repairing.
A new sockoption accepts a buffer and parses it. The buffer should
be CODE:VALUE sequence of bytes, where CODE is standard option
code and VALUE is the respective value.

Only 4 options should be handled on repaired socket.

To read 3 out of 4 of these options the TCP_INFO sockoption can be
used. An ability to get the last one (the mss_clamp) was added by
the previous patch.

Now the restore. Three of these options -- timestamp_ok, mss_clamp
and snd_wscale -- are just restored on a coket.

The sack_ok flags has 2 issues. First, whether or not to do sacks
at all. This flag is just read and set back. No other sack  info is
saved or restored, since according to the standart and the code
dropping all sack-ed segments is OK, the sender will resubmit them
again, so after the repair we will probably experience a pause in
connection. Next, the fack bit. It's just set back on a socket if
the respective sysctl is set. No collected stats about packets flow
is preserved. As far as I see (plz, correct me if I'm wrong) the
fack-based congestion algorithm survives dropping all of the stats
and repairs itself eventually, probably losing the performance for
that period.

Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
---
 include/linux/tcp.h |    1 +
 net/ipv4/tcp.c      |   71 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 0 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4e90e6a..9865936 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -109,6 +109,7 @@ enum {
 #define TCP_REPAIR		19	/* TCP sock is under repair right now */
 #define TCP_REPAIR_QUEUE	20
 #define TCP_QUEUE_SEQ		21
+#define TCP_REPAIR_OPTIONS	22
 
 enum {
 	TCP_NO_QUEUE,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b4e690d..3ce3bd0 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2218,6 +2218,68 @@ static inline int tcp_can_repair_sock(struct sock *sk)
 		((1 << sk->sk_state) & (TCPF_CLOSE | TCPF_ESTABLISHED));
 }
 
+static int tcp_repair_options_est(struct tcp_sock *tp, char __user *optbuf, unsigned int len)
+{
+	/*
+	 * Options are stored in CODE:VALUE form where CODE is 8bit and VALUE
+	 * fits the respective TCPOLEN_ size
+	 */
+
+	while (len > 0) {
+		u8 opcode;
+
+		if (get_user(opcode, optbuf))
+			return -EFAULT;
+
+		optbuf++;
+		len--;
+
+		switch (opcode) {
+		case TCPOPT_MSS: {
+			u16 in_mss;
+
+			if (len < sizeof(in_mss))
+				return -ENODATA;
+			if (get_user(in_mss, optbuf))
+				return -EFAULT;
+
+			tp->rx_opt.mss_clamp = in_mss;
+
+			optbuf += sizeof(in_mss);
+			len -= sizeof(in_mss);
+			break;
+		}
+		case TCPOPT_WINDOW: {
+			u8 wscale;
+
+			if (len < sizeof(wscale))
+				return -ENODATA;
+			if (get_user(wscale, optbuf))
+				return -EFAULT;
+
+			if (wscale > 14)
+				return -EFBIG;
+
+			tp->rx_opt.snd_wscale = wscale;
+
+			optbuf += sizeof(wscale);
+			len -= sizeof(wscale);
+			break;
+		}
+		case TCPOPT_SACK_PERM:
+			tp->rx_opt.sack_ok |= TCP_SACK_SEEN;
+			if (sysctl_tcp_fack)
+				tcp_enable_fack(tp);
+			break;
+		case TCPOPT_TIMESTAMP:
+			tp->rx_opt.tstamp_ok = 1;
+			break;
+		}
+	}
+
+	return 0;
+}
+
 /*
  *	Socket option code for TCP.
  */
@@ -2426,6 +2488,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EINVAL;
 		break;
 
+	case TCP_REPAIR_OPTIONS:
+		if (!tp->repair)
+			err = -EINVAL;
+		else if (sk->sk_state == TCP_ESTABLISHED)
+			err = tcp_repair_options_est(tp, optval, optlen);
+		else
+			err = -EPERM;
+		break;
+
 	case TCP_CORK:
 		/* When set indicates to always queue non-full frames.
 		 * Later the user clears this option and we transmit
-- 
1.5.5.6

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next 0/6] TCP connection repair (v4)
  2012-04-19 13:38 [PATCH net-next 0/6] TCP connection repair (v4) Pavel Emelyanov
                   ` (5 preceding siblings ...)
  2012-04-19 13:41 ` [PATCH 6/6] tcp: Repair connection-time negotiated parameters Pavel Emelyanov
@ 2012-04-21 19:53 ` David Miller
  6 siblings, 0 replies; 13+ messages in thread
From: David Miller @ 2012-04-21 19:53 UTC (permalink / raw)
  To: xemul; +Cc: netdev

From: Pavel Emelyanov <xemul@parallels.com>
Date: Thu, 19 Apr 2012 17:38:58 +0400

> Attempt #4 with an API for TCP connection recreation (previous one is
> at http://lists.openwall.net/netdev/2012/03/28/84) re-based on the
> today's net-next tree.
> 
> Changes since v3:
> 
> * Added repair for TCP options negotiated during 3WHS process, pointed
>   out by Li Yu. The explanation of how this happens is in patch #6.
> 
> * Named constant for sk_reuse values as proposed by Ben Hutching.
> 
> * Off-by-one in repair-queue sockoption caught by Ben.

All applied to net-next, nice work.

Please make the following fix for me.  The option recovery code will
result in unaligned accesses, for example you'll do a byte aligned
get_user() for the u16 MSS object in many cases, and this will trap
and cpus such as sparc.

Either add a padding facility or pass more structured data into the
socket option.

Thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/6] tcp: Repair socket queues
  2012-04-19 13:41 ` [PATCH 4/6] tcp: Repair socket queues Pavel Emelyanov
@ 2012-05-02 11:11   ` Eric Dumazet
  2012-05-03  8:59     ` Pavel Emelyanov
  0 siblings, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2012-05-02 11:11 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Netdev List, David Miller

On Thu, 2012-04-19 at 17:41 +0400, Pavel Emelyanov wrote:
> Reading queues under repair mode is done with recvmsg call.
> The queue-under-repair set by TCP_REPAIR_QUEUE option is used
> to determine which queue should be read. Thus both send and
> receive queue can be read with this.
> 
> Caller must pass the MSG_PEEK flag.
> 
> Writing to queues is done with sendmsg call and yet again --
> the repair-queue option can be used to push data into the
> receive queue.
> 
> When putting an skb into receive queue a zero tcp header is
> appented to its head to address the tcp_hdr(skb)->syn and
> the ->fin checks by the (after repair) tcp_recvmsg. These
> flags flags are both set to zero and that's why.
> 
> The fin cannot be met in the queue while reading the source
> socket, since the repair only works for closed/established
> sockets and queueing fin packet always changes its state.
> 
> The syn in the queue denotes that the respective skb's seq
> is "off-by-one" as compared to the actual payload lenght. Thus,
> at the rcv queue refill we can just drop this flag and set the
> skb's sequences to precice values.
> 
> When the repair mode is turned off, the write queue seqs are
> updated so that the whole queue is considered to be 'already sent,
> waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the
> protocol POV the send queue looks like it was sent, but the data
> between the write_seq and snd_nxt is lost in the network.
> 
> This helps to avoid another sockoption for setting the snd_nxt
> sequence. Leaving the whole queue in a 'not yet sent' state (as
> it will be after sendmsg-s) will not allow to receive any acks
> from the peer since the ack_seq will be after the snd_nxt. Thus
> even the ack for the window probe will be dropped and the
> connection will be 'locked' with the zero peer window.
> 
> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
> ---
>  net/ipv4/tcp.c        |   89 +++++++++++++++++++++++++++++++++++++++++++++++--
>  net/ipv4/tcp_output.c |    1 +
>  2 files changed, 87 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index e38d6f2..47e2f49 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -912,6 +912,39 @@ static inline int select_size(const struct sock *sk, bool sg)
>  	return tmp;
>  }
>  
> +static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
> +{
> +	struct sk_buff *skb;
> +	struct tcp_skb_cb *cb;
> +	struct tcphdr *th;
> +
> +	skb = alloc_skb(size + sizeof(*th), sk->sk_allocation);

I am not sure any check is performed on 'size' ?

A caller might trigger OOM or wrap bug.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/6] tcp: Repair socket queues
  2012-05-02 11:11   ` Eric Dumazet
@ 2012-05-03  8:59     ` Pavel Emelyanov
  2012-05-03  9:08       ` Eric Dumazet
  2012-05-03  9:31       ` David Miller
  0 siblings, 2 replies; 13+ messages in thread
From: Pavel Emelyanov @ 2012-05-03  8:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List, David Miller

On 05/02/2012 03:11 PM, Eric Dumazet wrote:
> On Thu, 2012-04-19 at 17:41 +0400, Pavel Emelyanov wrote:
>> Reading queues under repair mode is done with recvmsg call.
>> The queue-under-repair set by TCP_REPAIR_QUEUE option is used
>> to determine which queue should be read. Thus both send and
>> receive queue can be read with this.
>>
>> Caller must pass the MSG_PEEK flag.
>>
>> Writing to queues is done with sendmsg call and yet again --
>> the repair-queue option can be used to push data into the
>> receive queue.
>>
>> When putting an skb into receive queue a zero tcp header is
>> appented to its head to address the tcp_hdr(skb)->syn and
>> the ->fin checks by the (after repair) tcp_recvmsg. These
>> flags flags are both set to zero and that's why.
>>
>> The fin cannot be met in the queue while reading the source
>> socket, since the repair only works for closed/established
>> sockets and queueing fin packet always changes its state.
>>
>> The syn in the queue denotes that the respective skb's seq
>> is "off-by-one" as compared to the actual payload lenght. Thus,
>> at the rcv queue refill we can just drop this flag and set the
>> skb's sequences to precice values.
>>
>> When the repair mode is turned off, the write queue seqs are
>> updated so that the whole queue is considered to be 'already sent,
>> waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the
>> protocol POV the send queue looks like it was sent, but the data
>> between the write_seq and snd_nxt is lost in the network.
>>
>> This helps to avoid another sockoption for setting the snd_nxt
>> sequence. Leaving the whole queue in a 'not yet sent' state (as
>> it will be after sendmsg-s) will not allow to receive any acks
>> from the peer since the ack_seq will be after the snd_nxt. Thus
>> even the ack for the window probe will be dropped and the
>> connection will be 'locked' with the zero peer window.
>>
>> Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
>> ---
>>  net/ipv4/tcp.c        |   89 +++++++++++++++++++++++++++++++++++++++++++++++--
>>  net/ipv4/tcp_output.c |    1 +
>>  2 files changed, 87 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>> index e38d6f2..47e2f49 100644
>> --- a/net/ipv4/tcp.c
>> +++ b/net/ipv4/tcp.c
>> @@ -912,6 +912,39 @@ static inline int select_size(const struct sock *sk, bool sg)
>>  	return tmp;
>>  }
>>  
>> +static int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size)
>> +{
>> +	struct sk_buff *skb;
>> +	struct tcp_skb_cb *cb;
>> +	struct tcphdr *th;
>> +
>> +	skb = alloc_skb(size + sizeof(*th), sk->sk_allocation);
> 
> I am not sure any check is performed on 'size' ?

No, no checks here.

> A caller might trigger OOM or wrap bug.

Well, yes, but this ability is given to CAP_SYS_NET_ADMIN users only.
Do you think it's nonetheless worth accounting this allocation into
the socket's rmem?

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/6] tcp: Repair socket queues
  2012-05-03  8:59     ` Pavel Emelyanov
@ 2012-05-03  9:08       ` Eric Dumazet
  2012-05-03  9:15         ` Pavel Emelyanov
  2012-05-03  9:31       ` David Miller
  1 sibling, 1 reply; 13+ messages in thread
From: Eric Dumazet @ 2012-05-03  9:08 UTC (permalink / raw)
  To: Pavel Emelyanov; +Cc: Linux Netdev List, David Miller

On Thu, 2012-05-03 at 12:59 +0400, Pavel Emelyanov wrote:
> On 05/02/2012 03:11 PM, Eric Dumazet wrote:

> > I am not sure any check is performed on 'size' ?
> 
> No, no checks here.
> 
> > A caller might trigger OOM or wrap bug.
> 
> Well, yes, but this ability is given to CAP_SYS_NET_ADMIN users only.
> Do you think it's nonetheless worth accounting this allocation into
> the socket's rmem?

Yes, something must be done...

Might be a good reason to un-inline tcp_try_rmem_schedule(), this fat
thing...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/6] tcp: Repair socket queues
  2012-05-03  9:08       ` Eric Dumazet
@ 2012-05-03  9:15         ` Pavel Emelyanov
  0 siblings, 0 replies; 13+ messages in thread
From: Pavel Emelyanov @ 2012-05-03  9:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List, David Miller

On 05/03/2012 01:08 PM, Eric Dumazet wrote:
> On Thu, 2012-05-03 at 12:59 +0400, Pavel Emelyanov wrote:
>> On 05/02/2012 03:11 PM, Eric Dumazet wrote:
> 
>>> I am not sure any check is performed on 'size' ?
>>
>> No, no checks here.
>>
>>> A caller might trigger OOM or wrap bug.
>>
>> Well, yes, but this ability is given to CAP_SYS_NET_ADMIN users only.
>> Do you think it's nonetheless worth accounting this allocation into
>> the socket's rmem?
> 
> Yes, something must be done...
> 
> Might be a good reason to un-inline tcp_try_rmem_schedule(), this fat
> thing...

OK, will try to look at it.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/6] tcp: Repair socket queues
  2012-05-03  8:59     ` Pavel Emelyanov
  2012-05-03  9:08       ` Eric Dumazet
@ 2012-05-03  9:31       ` David Miller
  1 sibling, 0 replies; 13+ messages in thread
From: David Miller @ 2012-05-03  9:31 UTC (permalink / raw)
  To: xemul; +Cc: eric.dumazet, netdev

From: Pavel Emelyanov <xemul@parallels.com>
Date: Thu, 03 May 2012 12:59:16 +0400

> Well, yes, but this ability is given to CAP_SYS_NET_ADMIN users only.
> Do you think it's nonetheless worth accounting this allocation into
> the socket's rmem?

Often such too large lengths can be a bug in the application, so best
to catch it than let it silently succeed.

Also, restricting an operation to "privileged" entities does not mean
we should forego resource utilization checks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-05-03  9:31 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-04-19 13:38 [PATCH net-next 0/6] TCP connection repair (v4) Pavel Emelyanov
2012-04-19 13:39 ` [PATCH 1/6] sock: Introduce named constants for sk_reuse Pavel Emelyanov
2012-04-19 13:40 ` [PATCH 2/6] tcp: Move code around Pavel Emelyanov
2012-04-19 13:40 ` [PATCH 3/6] tcp: Initial repair mode Pavel Emelyanov
2012-04-19 13:41 ` [PATCH 4/6] tcp: Repair socket queues Pavel Emelyanov
2012-05-02 11:11   ` Eric Dumazet
2012-05-03  8:59     ` Pavel Emelyanov
2012-05-03  9:08       ` Eric Dumazet
2012-05-03  9:15         ` Pavel Emelyanov
2012-05-03  9:31       ` David Miller
2012-04-19 13:41 ` [PATCH 5/6] tcp: Report mss_clamp with TCP_MAXSEG option in repair mode Pavel Emelyanov
2012-04-19 13:41 ` [PATCH 6/6] tcp: Repair connection-time negotiated parameters Pavel Emelyanov
2012-04-21 19:53 ` [PATCH net-next 0/6] TCP connection repair (v4) David Miller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.