All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Routing loop handling in IP VS...
@ 2016-07-28  9:23 Dwip N. Banerjee
  2016-07-28 20:21 ` Julian Anastasov
  0 siblings, 1 reply; 4+ messages in thread
From: Dwip N. Banerjee @ 2016-07-28  9:23 UTC (permalink / raw)
  To: lvs-devel

Problem:

A problem has been identified in a cluster environment using IPVS with 
Direct Routing where multiple appliances can end up in the "active 
forwarder/distributor" state simultaneously. As an "active distributor" 
the appliance balances workload by forwarding packets to the group
members. 
Because "active distributors" also consider each other as group members 
available to receive forwarded packets (i.e. the load balancers also
front as real servers and are working in a HA mode with active/backup
roles), the distributors may forward the same packet to each other
forming a routing loop. 

While the immediate trigger in the aforesaid scenario is CPU starvation
caused by lock contention leading to an active/active scenario (i.e. two
instances both acting as "active" virtualservers), similar route loops
in an ip_vs installation is possible through other means as well (e.g.
http://marc.info/?l=linux-virtual-server&m=136008320907330&w=2).
 
As it stands now, there is no mitigation/damping mechanism available in
ip_vs to limit the impact of the routing loop as described above. When
the scenario occurs it leads to starvation and requires administrative
network action on the cluster controller to terminate the routing loop
and recover.

Although the situation described above was observed in a Virtual Server 
with Direct Routing, it is just as applicable in Virtual Servers via NAT
and IP Tunneling.

ip_vs does not decrement ip_ttl as standard routers do and as a result 
does not have anything to protect itself from re-forwarding the same 
packet an unbounded number of times. Standard IP routers always 
decrement the IP TTL as required by rfc791, but ip_vs does not even 
though ip_vs is acting as a specialized kind of IP router.

In a scenario where two ip_vs instances are forwarding to each other 
(which admittedly should not happen but is not impossible, as 
illustrated above), there is no way for the system to recover due to the
persistence of the  route loop. The two hosts will forward the same 
packet between each other at speed.

Test Case:
It is possible to configure two ip_vs instances to forward to each other
and cause it to starve the network.  The starvation itself makes it 
impossible to recover from this situation since the communication 
channel is blocked by the forwarding loop.

Proposed fix:
Sample fix for Linux v4.7 which decrements the TTL when forwarding, is
for the 
Direct Routing Transmitter. 



============================================================================

diff -Naur linux_4.7/net/netfilter/ipvs/ip_vs_xmit.c
linux_ipvs_patch/net/netfilter/ipvs/ip_vs_xmit.c
--- linux_4.7/net/netfilter/ipvs/ip_vs_xmit.c	2016-07-28
00:01:10.040974435 -0500
+++ linux_ipvs_patch/net/netfilter/ipvs/ip_vs_xmit.c	2016-07-28
00:01:42.900977155 -0500
@@ -1156,10 +1156,18 @@
 	      struct ip_vs_protocol *pp, struct ip_vs_iphdr *ipvsh)
 {
 	int local;
+	struct iphdr  *iph = ip_hdr(skb);
 
 	EnterFunction(10);
 
 	rcu_read_lock();
+	if (iph->ttl <= 1) {
+		/* Tell the sender its packet died... */
+		__IP_INC_STATS(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_INHDRERRORS);
+		icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
+		goto tx_error;
+	}
+
 	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest,
cp->daddr.ip,
 				   IP_VS_RT_MODE_LOCAL |
 				   IP_VS_RT_MODE_NON_LOCAL |
@@ -1171,7 +1179,10 @@
 		return ip_vs_send_or_cont(NFPROTO_IPV4, skb, cp, 1);
 	}
 
-	ip_send_check(ip_hdr(skb));
+	/* Decrease ttl */
+	ip_decrease_ttl(iph);
+
+	ip_send_check(iph);
 
 	/* Another hack: avoid icmp_send in ip_fragment */
 	skb->ignore_df = 1;

==================================================================================

p.s. A similar fix may be made to the other modes too ( NAT, IP
Tunneling, 
ICMP Package transmitter).



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Routing loop handling in IP VS...
  2016-07-28  9:23 [RFC] Routing loop handling in IP VS Dwip N. Banerjee
@ 2016-07-28 20:21 ` Julian Anastasov
  2016-07-29  1:39   ` Dwip N. Banerjee
  0 siblings, 1 reply; 4+ messages in thread
From: Julian Anastasov @ 2016-07-28 20:21 UTC (permalink / raw)
  To: Dwip N. Banerjee; +Cc: lvs-devel


	Hello,

On Thu, 28 Jul 2016, Dwip N. Banerjee wrote:

> Problem:
> 
> A problem has been identified in a cluster environment using IPVS with 
> Direct Routing where multiple appliances can end up in the "active 
> forwarder/distributor" state simultaneously. As an "active distributor" 
> the appliance balances workload by forwarding packets to the group
> members. 
> Because "active distributors" also consider each other as group members 
> available to receive forwarded packets (i.e. the load balancers also
> front as real servers and are working in a HA mode with active/backup
> roles), the distributors may forward the same packet to each other
> forming a routing loop. 
> 
> While the immediate trigger in the aforesaid scenario is CPU starvation
> caused by lock contention leading to an active/active scenario (i.e. two
> instances both acting as "active" virtualservers), similar route loops
> in an ip_vs installation is possible through other means as well (e.g.
> http://marc.info/?l=linux-virtual-server&m=136008320907330&w=2).

	In some cases backup_only=1 can help, not if
modes do not change in time and both servers are set as
masters.

> As it stands now, there is no mitigation/damping mechanism available in
> ip_vs to limit the impact of the routing loop as described above. When
> the scenario occurs it leads to starvation and requires administrative
> network action on the cluster controller to terminate the routing loop
> and recover.
> 
> Although the situation described above was observed in a Virtual Server 
> with Direct Routing, it is just as applicable in Virtual Servers via NAT
> and IP Tunneling.
> 
> ip_vs does not decrement ip_ttl as standard routers do and as a result 
> does not have anything to protect itself from re-forwarding the same 
> packet an unbounded number of times. Standard IP routers always 
> decrement the IP TTL as required by rfc791, but ip_vs does not even 
> though ip_vs is acting as a specialized kind of IP router.
> 
> In a scenario where two ip_vs instances are forwarding to each other 
> (which admittedly should not happen but is not impossible, as 
> illustrated above), there is no way for the system to recover due to the
> persistence of the  route loop. The two hosts will forward the same 
> packet between each other at speed.
> 
> Test Case:
> It is possible to configure two ip_vs instances to forward to each other
> and cause it to starve the network.  The starvation itself makes it 
> impossible to recover from this situation since the communication 
> channel is blocked by the forwarding loop.
> 
> Proposed fix:
> Sample fix for Linux v4.7 which decrements the TTL when forwarding, is
> for the 
> Direct Routing Transmitter. 
> 
> 
> 
> ============================================================================
> 
> diff -Naur linux_4.7/net/netfilter/ipvs/ip_vs_xmit.c
> linux_ipvs_patch/net/netfilter/ipvs/ip_vs_xmit.c
> --- linux_4.7/net/netfilter/ipvs/ip_vs_xmit.c	2016-07-28
> 00:01:10.040974435 -0500
> +++ linux_ipvs_patch/net/netfilter/ipvs/ip_vs_xmit.c	2016-07-28
> 00:01:42.900977155 -0500
> @@ -1156,10 +1156,18 @@
>  	      struct ip_vs_protocol *pp, struct ip_vs_iphdr *ipvsh)
>  {
>  	int local;
> +	struct iphdr  *iph = ip_hdr(skb);
>  
>  	EnterFunction(10);
>  
>  	rcu_read_lock();
> +	if (iph->ttl <= 1) {
> +		/* Tell the sender its packet died... */
> +		__IP_INC_STATS(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_INHDRERRORS);
> +		icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
> +		goto tx_error;
> +	}
> +
>  	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest,
> cp->daddr.ip,
>  				   IP_VS_RT_MODE_LOCAL |
>  				   IP_VS_RT_MODE_NON_LOCAL |
> @@ -1171,7 +1179,10 @@
>  		return ip_vs_send_or_cont(NFPROTO_IPV4, skb, cp, 1);
>  	}
>  
> -	ip_send_check(ip_hdr(skb));
> +	/* Decrease ttl */
> +	ip_decrease_ttl(iph);
> +
> +	ip_send_check(iph);

	OK, lets add TTL decrease. We write the IP header anyways,
so I guess the CPU write-back caching will hide the extra write
operation.

	Such change should also include:

- IPv6 solution: code from ip6_forward

- DR, TUN, ip_vs_bypass_xmit* and others that call
	__ip_vs_get_out_rt* funcs, this includes ICMP packets.
	Even better, hide the ttl <= 1 check in
	__ip_vs_get_out_rt* after the 'if (local) ... return local;'
	and before the MTU checks. ensure_mtu_is_adequate is
	a good example. As result, the ttl <= 1 should
	work only for the '!local' case.

- No need for !ip_vs_iph_icmp(ipvsh) checks as done in 
	ensure_mtu_is_adequate, icmp_send is smart enough
	to avoid sending ICMP to ICMP error.

- skb_make_writable guard as done in ip_vs_nat_xmit to ensure
	our change does not propagate to cloned packets,
	eg. causing tcpdump to see the decreased TTL.

>  	/* Another hack: avoid icmp_send in ip_fragment */
>  	skb->ignore_df = 1;
> 
> ==================================================================================
> 
> p.s. A similar fix may be made to the other modes too ( NAT, IP
> Tunneling, 
> ICMP Package transmitter).

	Yep. Let me know if you prefer to play and provide
a complete patch.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Routing loop handling in IP VS...
  2016-07-28 20:21 ` Julian Anastasov
@ 2016-07-29  1:39   ` Dwip N. Banerjee
  2016-07-29  7:43     ` Julian Anastasov
  0 siblings, 1 reply; 4+ messages in thread
From: Dwip N. Banerjee @ 2016-07-29  1:39 UTC (permalink / raw)
  To: Julian Anastasov; +Cc: lvs-devel

Thank you for prompt and detailed response... much appreciated! 

Yes, I can provide a more comprehensive patch - it may take a 
little time, but I will send it out as soon as I can.

Thanks
Dwip Banerjee

On Thu, 2016-07-28 at 23:21 +0300, Julian Anastasov wrote:
> 	Hello,
> 
> On Thu, 28 Jul 2016, Dwip N. Banerjee wrote:
> 
> > Problem:
> > 
> > A problem has been identified in a cluster environment using IPVS with 
> > Direct Routing where multiple appliances can end up in the "active 
> > forwarder/distributor" state simultaneously. As an "active distributor" 
> > the appliance balances workload by forwarding packets to the group
> > members. 
> > Because "active distributors" also consider each other as group members 
> > available to receive forwarded packets (i.e. the load balancers also
> > front as real servers and are working in a HA mode with active/backup
> > roles), the distributors may forward the same packet to each other
> > forming a routing loop. 
> > 
> > While the immediate trigger in the aforesaid scenario is CPU starvation
> > caused by lock contention leading to an active/active scenario (i.e. two
> > instances both acting as "active" virtualservers), similar route loops
> > in an ip_vs installation is possible through other means as well (e.g.
> > http://marc.info/?l=linux-virtual-server&m=136008320907330&w=2).
> 
> 	In some cases backup_only=1 can help, not if
> modes do not change in time and both servers are set as
> masters.
> 
> > As it stands now, there is no mitigation/damping mechanism available in
> > ip_vs to limit the impact of the routing loop as described above. When
> > the scenario occurs it leads to starvation and requires administrative
> > network action on the cluster controller to terminate the routing loop
> > and recover.
> > 
> > Although the situation described above was observed in a Virtual Server 
> > with Direct Routing, it is just as applicable in Virtual Servers via NAT
> > and IP Tunneling.
> > 
> > ip_vs does not decrement ip_ttl as standard routers do and as a result 
> > does not have anything to protect itself from re-forwarding the same 
> > packet an unbounded number of times. Standard IP routers always 
> > decrement the IP TTL as required by rfc791, but ip_vs does not even 
> > though ip_vs is acting as a specialized kind of IP router.
> > 
> > In a scenario where two ip_vs instances are forwarding to each other 
> > (which admittedly should not happen but is not impossible, as 
> > illustrated above), there is no way for the system to recover due to the
> > persistence of the  route loop. The two hosts will forward the same 
> > packet between each other at speed.
> > 
> > Test Case:
> > It is possible to configure two ip_vs instances to forward to each other
> > and cause it to starve the network.  The starvation itself makes it 
> > impossible to recover from this situation since the communication 
> > channel is blocked by the forwarding loop.
> > 
> > Proposed fix:
> > Sample fix for Linux v4.7 which decrements the TTL when forwarding, is
> > for the 
> > Direct Routing Transmitter. 
> > 
> > 
> > 
> > ============================================================================
> > 
> > diff -Naur linux_4.7/net/netfilter/ipvs/ip_vs_xmit.c
> > linux_ipvs_patch/net/netfilter/ipvs/ip_vs_xmit.c
> > --- linux_4.7/net/netfilter/ipvs/ip_vs_xmit.c	2016-07-28
> > 00:01:10.040974435 -0500
> > +++ linux_ipvs_patch/net/netfilter/ipvs/ip_vs_xmit.c	2016-07-28
> > 00:01:42.900977155 -0500
> > @@ -1156,10 +1156,18 @@
> >  	      struct ip_vs_protocol *pp, struct ip_vs_iphdr *ipvsh)
> >  {
> >  	int local;
> > +	struct iphdr  *iph = ip_hdr(skb);
> >  
> >  	EnterFunction(10);
> >  
> >  	rcu_read_lock();
> > +	if (iph->ttl <= 1) {
> > +		/* Tell the sender its packet died... */
> > +		__IP_INC_STATS(dev_net(skb_dst(skb)->dev), IPSTATS_MIB_INHDRERRORS);
> > +		icmp_send(skb, ICMP_TIME_EXCEEDED, ICMP_EXC_TTL, 0);
> > +		goto tx_error;
> > +	}
> > +
> >  	local = __ip_vs_get_out_rt(cp->ipvs, cp->af, skb, cp->dest,
> > cp->daddr.ip,
> >  				   IP_VS_RT_MODE_LOCAL |
> >  				   IP_VS_RT_MODE_NON_LOCAL |
> > @@ -1171,7 +1179,10 @@
> >  		return ip_vs_send_or_cont(NFPROTO_IPV4, skb, cp, 1);
> >  	}
> >  
> > -	ip_send_check(ip_hdr(skb));
> > +	/* Decrease ttl */
> > +	ip_decrease_ttl(iph);
> > +
> > +	ip_send_check(iph);
> 
> 	OK, lets add TTL decrease. We write the IP header anyways,
> so I guess the CPU write-back caching will hide the extra write
> operation.
> 
> 	Such change should also include:
> 
> - IPv6 solution: code from ip6_forward
> 
> - DR, TUN, ip_vs_bypass_xmit* and others that call
> 	__ip_vs_get_out_rt* funcs, this includes ICMP packets.
> 	Even better, hide the ttl <= 1 check in
> 	__ip_vs_get_out_rt* after the 'if (local) ... return local;'
> 	and before the MTU checks. ensure_mtu_is_adequate is
> 	a good example. As result, the ttl <= 1 should
> 	work only for the '!local' case.
> 
> - No need for !ip_vs_iph_icmp(ipvsh) checks as done in 
> 	ensure_mtu_is_adequate, icmp_send is smart enough
> 	to avoid sending ICMP to ICMP error.
> 
> - skb_make_writable guard as done in ip_vs_nat_xmit to ensure
> 	our change does not propagate to cloned packets,
> 	eg. causing tcpdump to see the decreased TTL.
> 
> >  	/* Another hack: avoid icmp_send in ip_fragment */
> >  	skb->ignore_df = 1;
> > 
> > ==================================================================================
> > 
> > p.s. A similar fix may be made to the other modes too ( NAT, IP
> > Tunneling, 
> > ICMP Package transmitter).
> 
> 	Yep. Let me know if you prefer to play and provide
> a complete patch.
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] Routing loop handling in IP VS...
  2016-07-29  1:39   ` Dwip N. Banerjee
@ 2016-07-29  7:43     ` Julian Anastasov
  0 siblings, 0 replies; 4+ messages in thread
From: Julian Anastasov @ 2016-07-29  7:43 UTC (permalink / raw)
  To: Dwip N. Banerjee; +Cc: lvs-devel


	Hello,

On Thu, 28 Jul 2016, Dwip N. Banerjee wrote:

> Thank you for prompt and detailed response... much appreciated! 
> 
> Yes, I can provide a more comprehensive patch - it may take a 
> little time, but I will send it out as soon as I can.

	Thanks! It is not something urgent, so no problem.

Regards

--
Julian Anastasov <ja@ssi.bg>

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2016-07-29  7:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-28  9:23 [RFC] Routing loop handling in IP VS Dwip N. Banerjee
2016-07-28 20:21 ` Julian Anastasov
2016-07-29  1:39   ` Dwip N. Banerjee
2016-07-29  7:43     ` Julian Anastasov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.