linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] ipvs: avoid drop first packet to reuse conntrack
@ 2020-06-11  9:28 YangYuxi
  2020-06-11 18:27 ` Julian Anastasov
  0 siblings, 1 reply; 4+ messages in thread
From: YangYuxi @ 2020-06-11  9:28 UTC (permalink / raw)
  To: wensong, horms, ja, pablo, kadlec, fw, davem, kuba
  Cc: netdev, lvs-devel, netfilter-devel, coreteam, linux-kernel, yx.atom1

Since commit f719e3754ee ("ipvs: drop first packet
to redirect conntrack"), when a new TCP connection
meet the conditions that need reschedule, the first
syn packet is dropped, this cause one second latency
for the new connection, more discussion about this
problem can easy seach from google, such as:

1)One second connection delay in masque
https://marc.info/?t=151683118100004&r=1&w=2

2)IPVS low throughput #70747
https://github.com/kubernetes/kubernetes/issues/70747

3)Apache Bench can fill up ipvs service proxy in seconds #544
https://github.com/cloudnativelabs/kube-router/issues/544

4)Additional 1s latency in `host -> service IP -> pod`
https://github.com/kubernetes/kubernetes/issues/90854

The root cause is when the old session is expired, the
conntrack related to the session is dropped by
ip_vs_conn_drop_conntrack. The code is as follows:
```
static void ip_vs_conn_expire(struct timer_list *t)
{
...

                if ((cp->flags & IP_VS_CONN_F_NFCT) &&
                    !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
                        /* Do not access conntracks during subsys cleanup
                         * because nf_conntrack_find_get can not be used after
                         * conntrack cleanup for the net.
                         */
                        smp_rmb();
                        if (ipvs->enable)
                                ip_vs_conn_drop_conntrack(cp);
                }
...
}
```
As the code show, only if the condition  (cp->flags & IP_VS_CONN_F_NFCT)
is true, ip_vs_conn_drop_conntrack will be called.
So we solve this bug by following steps:
1) erase the IP_VS_CONN_F_NFCT flag (it is safely because no packets will
   use the old session)
2) call ip_vs_conn_expire_now to release the old session, then the related
   conntrack will not be dropped
3) then ipvs unnecessary to drop the first syn packet,
   it just continue to pass the syn packet to the next process,
   create a new ipvs session, and the new session will related to
   the old conntrack(which is reopened by conntrack as a new one),
   the next whole things is just as normal as that the old session
   isn't used to exist.

This patch has been verified on our thousands of kubernets node servers on Tencent Inc.
Signed-off-by: YangYuxi <yx.atom1@gmail.com>
---
 net/netfilter/ipvs/ip_vs_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..2f750145172f 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -2086,11 +2086,11 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, struct sk_buff *skb,
 		}
 
 		if (resched) {
+			if (uses_ct)
+				cp->flags &= ~IP_VS_CONN_F_NFCT;
 			if (!atomic_read(&cp->n_control))
 				ip_vs_conn_expire_now(cp);
 			__ip_vs_conn_put(cp);
-			if (uses_ct)
-				return NF_DROP;
 			cp = NULL;
 		}
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] ipvs: avoid drop first packet to reuse conntrack
  2020-06-11  9:28 [PATCH] ipvs: avoid drop first packet to reuse conntrack YangYuxi
@ 2020-06-11 18:27 ` Julian Anastasov
  0 siblings, 0 replies; 4+ messages in thread
From: Julian Anastasov @ 2020-06-11 18:27 UTC (permalink / raw)
  To: YangYuxi
  Cc: wensong, horms, pablo, kadlec, fw, davem, kuba, netdev,
	lvs-devel, netfilter-devel, coreteam, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5155 bytes --]


	Hello,

On Thu, 11 Jun 2020, YangYuxi wrote:

> Since commit f719e3754ee ("ipvs: drop first packet
> to redirect conntrack"), when a new TCP connection
> meet the conditions that need reschedule, the first
> syn packet is dropped, this cause one second latency
> for the new connection, more discussion about this
> problem can easy seach from google, such as:
> 
> 1)One second connection delay in masque
> https://marc.info/?t=151683118100004&r=1&w=2
> 
> 2)IPVS low throughput #70747
> https://github.com/kubernetes/kubernetes/issues/70747
> 
> 3)Apache Bench can fill up ipvs service proxy in seconds #544
> https://github.com/cloudnativelabs/kube-router/issues/544
> 
> 4)Additional 1s latency in `host -> service IP -> pod`
> https://github.com/kubernetes/kubernetes/issues/90854

	Such delays occur only on collision, say some
client IP creates many connections that lead to
reusing same client port...

> The root cause is when the old session is expired, the
> conntrack related to the session is dropped by
> ip_vs_conn_drop_conntrack. The code is as follows:
> ```
> static void ip_vs_conn_expire(struct timer_list *t)
> {
> ...
> 
>                 if ((cp->flags & IP_VS_CONN_F_NFCT) &&
>                     !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
>                         /* Do not access conntracks during subsys cleanup
>                          * because nf_conntrack_find_get can not be used after
>                          * conntrack cleanup for the net.
>                          */
>                         smp_rmb();
>                         if (ipvs->enable)
>                                 ip_vs_conn_drop_conntrack(cp);
>                 }
> ...
> }
> ```
> As the code show, only if the condition  (cp->flags & IP_VS_CONN_F_NFCT)
> is true, ip_vs_conn_drop_conntrack will be called.
> So we solve this bug by following steps:

	Not exactly a bug, we do the delay intentionally.

> 1) erase the IP_VS_CONN_F_NFCT flag (it is safely because no packets will
>    use the old session)
> 2) call ip_vs_conn_expire_now to release the old session, then the related
>    conntrack will not be dropped

	The IPVS connection table allows the newly created
connection to have priority when next packets lookup for
connection. That is why we delay only when conntracks are
used. When they are not used, we can create IPVS connection
to different real server by creating collision in original
direction in the IPVS table. When reply packet is received
it will find its connection.

	IPVS does not create duplicate conntracks. When
packet is received it will hit existing conntrack or
new conntrack will be created. This is what happens in
Netfilter in original direction. Note that active FTP
can create connection also for packet from real server.

	IPVS simply alters the reply tuple while processing
this first packet and only when conntrack is not confirmed
yet, because that is the only possible time to insert the
both tuples into netfilter hash table:

CIP->VIP(orig),VIP->CIP(reply) becomes CIP->VIP,RIP->CIP
CIP: Client IP, VIP: Virtual IP, RIP: Real Server IP

	After the new reply tuple is determined, it can not
be changed after the first packet. That is why we have to
drop the old conntrack. Then, reply direction will match
packets from the correct real server.

> 3) then ipvs unnecessary to drop the first syn packet,
>    it just continue to pass the syn packet to the next process,
>    create a new ipvs session, and the new session will related to
>    the old conntrack(which is reopened by conntrack as a new one),
>    the next whole things is just as normal as that the old session
>    isn't used to exist.

	If we leave the old conntrack, say with reply
tuple RIP1->CIP but the IPVS scheduling selects different
RIP2 for the new IPVS connection which uses the old
conntrack due to the equal original tuple, we create
inconsistency.

	We can not be sure if admins use connmarks,
state matching (-m state), passive FTP (which requires
conntracks in IPVS), so we can not just go and use
the old conntrack which points to wrong real server.

	How exactly are relayed the reply packets from
real servers in your tests? Do you have NF SNAT rules to
translate the source addresses? Also, it will be good
to know what is the effect on the passive FTP.

> This patch has been verified on our thousands of kubernets node servers on Tencent Inc.
> Signed-off-by: YangYuxi <yx.atom1@gmail.com>
> ---
>  net/netfilter/ipvs/ip_vs_core.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
> index aa6a603a2425..2f750145172f 100644
> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -2086,11 +2086,11 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, struct sk_buff *skb,
>  		}
>  
>  		if (resched) {
> +			if (uses_ct)
> +				cp->flags &= ~IP_VS_CONN_F_NFCT;
>  			if (!atomic_read(&cp->n_control))
>  				ip_vs_conn_expire_now(cp);
>  			__ip_vs_conn_put(cp);
> -			if (uses_ct)
> -				return NF_DROP;
>  			cp = NULL;
>  		}
>  	}
> -- 

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] ipvs: avoid drop first packet to reuse conntrack
@ 2020-06-12 13:59 YangYuxi
  0 siblings, 0 replies; 4+ messages in thread
From: YangYuxi @ 2020-06-12 13:59 UTC (permalink / raw)
  To: wensong, horms, ja, pablo, kadlec, fw, davem, kuba
  Cc: netdev, lvs-devel, netfilter-devel, coreteam, linux-kernel, yx.atom1

Since 'commit f719e3754ee2 ("ipvs: drop first packet to
redirect conntrack")', when a new TCP connection meet
the conditions that need reschedule, the first syn packet
is dropped, this cause one second latency for the new
connection, more discussion about this problem can easy
search from google, such as:

1)One second connection delay in masque
https://marc.info/?t=151683118100004&r=1&w=2

2)IPVS low throughput #70747
https://github.com/kubernetes/kubernetes/issues/70747

3)Apache Bench can fill up ipvs service proxy in seconds #544
https://github.com/cloudnativelabs/kube-router/issues/544

4)Additional 1s latency in `host -> service IP -> pod`
https://github.com/kubernetes/kubernetes/issues/90854

The root cause is when the old session is expired, the
conntrack related to the session is dropped by
ip_vs_conn_drop_conntrack. The code is as follows:
```
static void ip_vs_conn_expire(struct timer_list *t)
{
...

                if ((cp->flags & IP_VS_CONN_F_NFCT) &&
                    !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
                        /* Do not access conntracks during subsys cleanup
                         * because nf_conntrack_find_get can not be used after
                         * conntrack cleanup for the net.
                         */
                        smp_rmb();
                        if (ipvs->enable)
                                ip_vs_conn_drop_conntrack(cp);
                }
...
}
```
As the code show, only if the condition  (cp->flags & IP_VS_CONN_F_NFCT)
is true, ip_vs_conn_drop_conntrack will be called.
So we optimize this by following steps:
1) erase the IP_VS_CONN_F_NFCT flag (it is safely because no packets will
   use the old session)
2) call ip_vs_conn_expire_now to release the old session, then the related
   conntrack will not be dropped
3) then ipvs unnecessary to drop the first syn packet,
   it just continue to pass the syn packet to the next process,
   create a new ipvs session, and the new session will related to
   the old conntrack(which is reopened by conntrack as a new one),
   the next whole things is just as normal as that the old session
   isn't used to exist.

The above scenario has no problems except passive FTP and
connmarks (state matching (-m state)). So, ipvs can give
users the right to choose, when FTP or connmarks is not used,
they can choose a high performance one by set net.ipv4.vs.conn_reuse_old_conntrack=1,
this is necessary because most scenarios, such as kubernetes,
do not have FTP and connmarks scenarios, but are very sensitive
to TCP short link performance.

This patch has been verified on our thousands of kubernets node servers on Tencent Inc.
Signed-off-by: YangYuxi <yx.atom1@gmail.com>
---
 include/net/ip_vs.h             | 11 +++++++++++
 net/netfilter/ipvs/ip_vs_core.c | 10 ++++++++--
 net/netfilter/ipvs/ip_vs_ctl.c  |  2 ++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 83be2d93b407..052fa87d2a44 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -928,6 +928,7 @@ struct netns_ipvs {
 	int			sysctl_pmtu_disc;
 	int			sysctl_backup_only;
 	int			sysctl_conn_reuse_mode;
+	int			sysctl_conn_reuse_old_conntrack;
 	int			sysctl_schedule_icmp;
 	int			sysctl_ignore_tunneled;
 
@@ -1049,6 +1050,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return ipvs->sysctl_conn_reuse_mode;
 }
 
+static inline int sysctl_conn_reuse_old_conntrack(struct netns_ipvs *ipvs)
+{
+	return ipvs->sysctl_conn_reuse_old_conntrack;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return ipvs->sysctl_schedule_icmp;
@@ -1136,6 +1142,11 @@ static inline int sysctl_conn_reuse_mode(struct netns_ipvs *ipvs)
 	return 1;
 }
 
+static inline int sysctl_conn_reuse_old_conntrack(struct netns_ipvs *ipvs)
+{
+	return 1;
+}
+
 static inline int sysctl_schedule_icmp(struct netns_ipvs *ipvs)
 {
 	return 0;
diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..0b89c872ea46 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -2066,7 +2066,7 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, struct sk_buff *skb,
 
 	conn_reuse_mode = sysctl_conn_reuse_mode(ipvs);
 	if (conn_reuse_mode && !iph.fragoffs && is_new_conn(skb, &iph) && cp) {
-		bool uses_ct = false, resched = false;
+		bool uses_ct = false, resched = false, drop = false;
 
 		if (unlikely(sysctl_expire_nodest_conn(ipvs)) && cp->dest &&
 		    unlikely(!atomic_read(&cp->dest->weight))) {
@@ -2086,10 +2086,16 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, struct sk_buff *skb,
 		}
 
 		if (resched) {
+			if (uses_ct) {
+				if (likely(sysctl_conn_reuse_old_conntrack(ipvs)))
+					cp->flags &= ~IP_VS_CONN_F_NFCT;
+				else
+					drop = true;
+			}
 			if (!atomic_read(&cp->n_control))
 				ip_vs_conn_expire_now(cp);
 			__ip_vs_conn_put(cp);
-			if (uses_ct)
+			if (drop)
 				return NF_DROP;
 			cp = NULL;
 		}
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 412656c34f20..eeb87994c21f 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4049,7 +4049,9 @@ static int __net_init ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
 	tbl[idx++].data = &ipvs->sysctl_pmtu_disc;
 	tbl[idx++].data = &ipvs->sysctl_backup_only;
 	ipvs->sysctl_conn_reuse_mode = 1;
+	ipvs->sysctl_conn_reuse_old_conntrack = 1;
 	tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode;
+	tbl[idx++].data = &ipvs->sysctl_conn_reuse_old_conntrack;
 	tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
 	tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH] ipvs: avoid drop first packet to reuse conntrack
@ 2020-06-11 10:01 YangYuxi
  0 siblings, 0 replies; 4+ messages in thread
From: YangYuxi @ 2020-06-11 10:01 UTC (permalink / raw)
  To: wensong, horms, ja, pablo, kadlec, fw, davem, kuba
  Cc: netdev, lvs-devel, netfilter-devel, coreteam, linux-kernel, yx.atom1

Since 'commit f719e3754ee2 ("ipvs: drop first packet to
redirect conntrack")', when a new TCP connection meet
the conditions that need reschedule, the first syn packet
is dropped, this cause one second latency for the new
connection, more discussion about this problem can easy
search from google, such as:

1)One second connection delay in masque
https://marc.info/?t=151683118100004&r=1&w=2

2)IPVS low throughput #70747
https://github.com/kubernetes/kubernetes/issues/70747

3)Apache Bench can fill up ipvs service proxy in seconds #544
https://github.com/cloudnativelabs/kube-router/issues/544

4)Additional 1s latency in `host -> service IP -> pod`
https://github.com/kubernetes/kubernetes/issues/90854

The root cause is when the old session is expired, the
conntrack related to the session is dropped by
ip_vs_conn_drop_conntrack. The code is as follows:
```
static void ip_vs_conn_expire(struct timer_list *t)
{
...

                if ((cp->flags & IP_VS_CONN_F_NFCT) &&
                    !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
                        /* Do not access conntracks during subsys cleanup
                         * because nf_conntrack_find_get can not be used after
                         * conntrack cleanup for the net.
                         */
                        smp_rmb();
                        if (ipvs->enable)
                                ip_vs_conn_drop_conntrack(cp);
                }
...
}
```
As the code show, only if the condition  (cp->flags & IP_VS_CONN_F_NFCT)
is true, ip_vs_conn_drop_conntrack will be called.
So we solve this bug by following steps:
1) erase the IP_VS_CONN_F_NFCT flag (it is safely because no packets will
   use the old session)
2) call ip_vs_conn_expire_now to release the old session, then the related
   conntrack will not be dropped
3) then ipvs unnecessary to drop the first syn packet,
   it just continue to pass the syn packet to the next process,
   create a new ipvs session, and the new session will related to
   the old conntrack(which is reopened by conntrack as a new one),
   the next whole things is just as normal as that the old session
   isn't used to exist.

This patch has been verified on our thousands of kubernets node servers on Tencent Inc.
Signed-off-by: YangYuxi <yx.atom1@gmail.com>
---
 net/netfilter/ipvs/ip_vs_core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_core.c b/net/netfilter/ipvs/ip_vs_core.c
index aa6a603a2425..2f750145172f 100644
--- a/net/netfilter/ipvs/ip_vs_core.c
+++ b/net/netfilter/ipvs/ip_vs_core.c
@@ -2086,11 +2086,11 @@ static int ip_vs_in_icmp_v6(struct netns_ipvs *ipvs, struct sk_buff *skb,
 		}
 
 		if (resched) {
+			if (uses_ct)
+				cp->flags &= ~IP_VS_CONN_F_NFCT;
 			if (!atomic_read(&cp->n_control))
 				ip_vs_conn_expire_now(cp);
 			__ip_vs_conn_put(cp);
-			if (uses_ct)
-				return NF_DROP;
 			cp = NULL;
 		}
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2020-06-12 14:00 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-11  9:28 [PATCH] ipvs: avoid drop first packet to reuse conntrack YangYuxi
2020-06-11 18:27 ` Julian Anastasov
2020-06-11 10:01 YangYuxi
2020-06-12 13:59 YangYuxi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).