netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] neighbour: fix possible DoS due to net iface start/stop loop
@ 2022-07-29 10:35 Alexander Mikhalitsyn
  2022-07-29 10:35 ` [PATCH 1/2] neigh: " Alexander Mikhalitsyn
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-07-29 10:35 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Mikhalitsyn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, linux-kernel, Denis V . Lunev,
	Alexey Kuznetsov, Konstantin Khorenko, Pavel Tikhomirov,
	Andrey Zhadchenko, Alexander Mikhalitsyn, kernel

Dear friends,

Recently one of OpenVZ users reported that they have issues with network
availability of some containers. It was discovered that the reason is absence
of ARP replies from the Host Node on the requests about container IPs.

Of course, we started from tcpdump analysis and noticed that ARP requests
successfuly comes to the problematic node external interface. So, something
was wrong from the kernel side.

I've played a lot with arping and perf in attempts to understand what's
happening. And the key observation was that we experiencing issues only
with ARP requests with broadcast source ip (skb->pkt_type == PACKET_BROADCAST).
But for packets skb->pkt_type == PACKET_HOST everything works flawlessly.

Let me show a small piece of code:

static int arp_process(struct sock *sk, struct sk_buff *skb)
...
				if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
				    skb->pkt_type == PACKET_HOST ||
				    NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) { // reply instantly
					arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
						     sip, dev, tip, sha,
						     dev->dev_addr, sha,
						     reply_dst);
				} else {
					pneigh_enqueue(&arp_tbl,                     // reply with delay
						       in_dev->arp_parms, skb);
					goto out_free_dst;
				}

The problem was that for PACKET_BROADCAST packets we delaying replies and use pneigh_enqueue() function.
For some reason, queued packets were lost almost all the time! The reason for such behaviour is pneigh_queue_purge()
function which cleanups all the queue, and this function called everytime once some network device in the system
gets link down.

neigh_ifdown -> pneigh_queue_purge

Now imagine that we have a node with 500+ containers with microservices. And some of that microservices are buggy
and always restarting... in this case, pneigh_queue_purge function will be called very frequently.

This problem is reproducible only with so-called "host routed" setup. The classical scheme bridge + veth
is not affected.

Minimal reproducer

Suppose that we have a network 172.29.1.1/16 brd 172.29.255.255
and we have free-to-use IP, let it be 172.29.128.3

1. Network configuration. I showing the minimal configuration, it makes no sense
as we have both veth devices stay at the same net namespace, but for demonstation and simplicity sake it's okay.

ip l a veth31427 type veth peer name veth314271
ip l s veth31427 up
ip l s veth314271 up

# setup static arp entry and publish it
arp -Ds -i br0 172.29.128.3 veth31427 pub
# setup static route for this address
route add 172.29.128.3/32 dev veth31427

2. "attacker" side (kubernetes pod with buggy microservice :) )

unshare -n
ip l a type veth
ip l s veth0 up
ip l s veth1 up
for i in {1..100000}; do ip link set veth0 down; sleep 0.01; ip link set veth0 up; done

This will totaly block ARP replies for 172.29.128.3 address. Just try
# arping -I eth0 172.29.128.3 -c 4

Our proposal is simple:
1. Let's cleanup queue partially. Remove only skb's that related to the net namespace
of the adapter which link is down.

2. Let's account proxy_queue limit properly per-device. Current limitation looks
not fully correct because we comparing per-device configurable limit with the
"global" qlen of proxy_queue.

Thanks,
Alex

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Denis V. Lunev <den@openvz.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: kernel@openvz.org

Alexander Mikhalitsyn (1):
  neighbour: make proxy_queue.qlen limit per-device

Denis V. Lunev (1):
  neigh: fix possible DoS due to net iface start/stop loop

 include/net/neighbour.h |  1 +
 net/core/neighbour.c    | 43 +++++++++++++++++++++++++++++++++--------
 2 files changed, 36 insertions(+), 8 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] neigh: fix possible DoS due to net iface start/stop loop
  2022-07-29 10:35 [PATCH 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
@ 2022-07-29 10:35 ` Alexander Mikhalitsyn
  2022-08-01 15:08   ` David Ahern
  2022-07-29 10:35 ` [PATCH 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-07-29 10:35 UTC (permalink / raw)
  To: netdev
  Cc: Denis V. Lunev, Alexander Mikhalitsyn, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Daniel Borkmann,
	David Ahern, Yajun Deng, Roopa Prabhu, linux-kernel,
	Alexey Kuznetsov, Konstantin Khorenko, kernel

From: "Denis V. Lunev" <den@openvz.org>

Normal processing of ARP request (usually this is Ethernet broadcast
packet) coming to the host is looking like the following:
* the packet comes to arp_process() call and is passed through routing
  procedure
* the request is put into the queue using pneigh_enqueue() if
  corresponding ARP record is not local (common case for container
  records on the host)
* the request is processed by timer (within 80 jiffies by default) and
  ARP reply is sent from the same arp_process() using
  NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
  pneigh_enqueue())

And here the problem comes. Linux kernel calls pneigh_queue_purge()
which destroys the whole queue of ARP requests on ANY network interface
start/stop event through __neigh_ifdown().

This is actually not a problem within the original world as network
interface start/stop was accessible to the host 'root' only, which
could do more destructive things. But the world is changed and there
are Linux containers available. Here container 'root' has an access
to this API and could be considered as untrusted user in the hosting
(container's) world.

Thus there is an attack vector to other containers on node when
container's root will endlessly start/stop interfaces. We have observed
similar situation on a real production node when docker container was
doing such activity and thus other containers on the node become not
accessible.

The patch proposed doing very simple thing. It drops only packets from
the same namespace in the pneigh_queue_purge() where network interface
state change is detected. This is enough to prevent the problem for the
whole node preserving original semantics of the code.

Investigated-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 net/core/neighbour.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 54625287ee5b..213ec0be800b 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -307,14 +307,24 @@ static int neigh_del_timer(struct neighbour *n)
 	return 0;
 }
 
-static void pneigh_queue_purge(struct sk_buff_head *list)
+static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
 {
+	unsigned long flags;
 	struct sk_buff *skb;
 
-	while ((skb = skb_dequeue(list)) != NULL) {
-		dev_put(skb->dev);
-		kfree_skb(skb);
+	spin_lock_irqsave(&list->lock, flags);
+	skb = skb_peek(list);
+	while (skb) {
+		struct sk_buff *skb_next = skb_peek_next(skb, list);
+
+		if (!net || net_eq(dev_net(skb->dev), net)) {
+			__skb_unlink(skb, list);
+			dev_put(skb->dev);
+			kfree_skb(skb);
+		}
+		skb = skb_next;
 	}
+	spin_unlock_irqrestore(&list->lock, flags);
 }
 
 static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
@@ -386,8 +396,7 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
 	neigh_flush_dev(tbl, dev, skip_perm);
 	pneigh_ifdown_and_unlock(tbl, dev);
 
-	del_timer_sync(&tbl->proxy_timer);
-	pneigh_queue_purge(&tbl->proxy_queue);
+	pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
 	return 0;
 }
 
@@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
 	cancel_delayed_work_sync(&tbl->managed_work);
 	cancel_delayed_work_sync(&tbl->gc_work);
 	del_timer_sync(&tbl->proxy_timer);
-	pneigh_queue_purge(&tbl->proxy_queue);
+	pneigh_queue_purge(&tbl->proxy_queue, NULL);
 	neigh_ifdown(tbl, NULL);
 	if (atomic_read(&tbl->entries))
 		pr_crit("neighbour leakage\n");
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] neighbour: make proxy_queue.qlen limit per-device
  2022-07-29 10:35 [PATCH 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
  2022-07-29 10:35 ` [PATCH 1/2] neigh: " Alexander Mikhalitsyn
@ 2022-07-29 10:35 ` Alexander Mikhalitsyn
  2022-08-10 16:08 ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
  2022-08-11 15:20 ` [PATCH v3 " Alexander Mikhalitsyn
  3 siblings, 0 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-07-29 10:35 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Mikhalitsyn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, linux-kernel, Alexey Kuznetsov,
	Konstantin Khorenko, kernel, Denis V . Lunev

Right now we have a neigh_param PROXY_QLEN which specifies maximum length
of neigh_table->proxy_queue. But in fact, this limitation doesn't work well
because check condition looks like:
tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)

The problem is that p (struct neigh_parms) is a per-device thing,
but tbl (struct neigh_table) is a system-wide global thing.

It seems reasonable to make proxy_queue limit per-device based.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
---
 include/net/neighbour.h |  1 +
 net/core/neighbour.c    | 25 ++++++++++++++++++++++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 87419f7f5421..bc3fbec70d10 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -82,6 +82,7 @@ struct neigh_parms {
 	struct rcu_head rcu_head;
 
 	int	reachable_time;
+	int	qlen;
 	int	data[NEIGH_VAR_DATA_MAX];
 	DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX);
 };
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 213ec0be800b..d98923e3ff02 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -316,9 +316,19 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
 	skb = skb_peek(list);
 	while (skb != NULL) {
 		struct sk_buff *skb_next = skb_peek_next(skb, list);
-		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
+		struct net_device *dev = skb->dev;
+
+		if (!net || net_eq(dev_net(dev), net)) {
+			struct in_device *in_dev;
+
+			rcu_read_lock();
+			in_dev = __in_dev_get_rcu(dev);
+			if (in_dev)
+				in_dev->arp_parms->qlen--;
+			rcu_read_unlock();
 			__skb_unlink(skb, list);
-			dev_put(skb->dev);
+
+			dev_put(dev);
 			kfree_skb(skb);
 		}
 		skb = skb_next;
@@ -1605,8 +1615,15 @@ static void neigh_proxy_process(struct timer_list *t)
 
 		if (tdif <= 0) {
 			struct net_device *dev = skb->dev;
+			struct in_device *in_dev;
 
+			rcu_read_lock();
+			in_dev = __in_dev_get_rcu(dev);
+			if (in_dev)
+				in_dev->arp_parms->qlen--;
+			rcu_read_unlock();
 			__skb_unlink(skb, &tbl->proxy_queue);
+
 			if (tbl->proxy_redo && netif_running(dev)) {
 				rcu_read_lock();
 				tbl->proxy_redo(skb);
@@ -1631,7 +1648,7 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 	unsigned long sched_next = jiffies +
 			prandom_u32_max(NEIGH_VAR(p, PROXY_DELAY));
 
-	if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) {
+	if (p->qlen > NEIGH_VAR(p, PROXY_QLEN)) {
 		kfree_skb(skb);
 		return;
 	}
@@ -1647,6 +1664,7 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 	skb_dst_drop(skb);
 	dev_hold(skb->dev);
 	__skb_queue_tail(&tbl->proxy_queue, skb);
+	p->qlen++;
 	mod_timer(&tbl->proxy_timer, sched_next);
 	spin_unlock(&tbl->proxy_queue.lock);
 }
@@ -1679,6 +1697,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 		refcount_set(&p->refcnt, 1);
 		p->reachable_time =
 				neigh_rand_reach_time(NEIGH_VAR(p, BASE_REACHABLE_TIME));
+		p->qlen = 0;
 		dev_hold_track(dev, &p->dev_tracker, GFP_KERNEL);
 		p->dev = dev;
 		write_pnet(&p->net, net);
@@ -1744,6 +1763,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
 	refcount_set(&tbl->parms.refcnt, 1);
 	tbl->parms.reachable_time =
 			  neigh_rand_reach_time(NEIGH_VAR(&tbl->parms, BASE_REACHABLE_TIME));
+	tbl->parms.qlen = 0;
 
 	tbl->stats = alloc_percpu(struct neigh_statistics);
 	if (!tbl->stats)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] neigh: fix possible DoS due to net iface start/stop loop
  2022-07-29 10:35 ` [PATCH 1/2] neigh: " Alexander Mikhalitsyn
@ 2022-08-01 15:08   ` David Ahern
  2022-08-01 15:44     ` Denis V. Lunev
  0 siblings, 1 reply; 18+ messages in thread
From: David Ahern @ 2022-08-01 15:08 UTC (permalink / raw)
  To: Alexander Mikhalitsyn, netdev
  Cc: Denis V. Lunev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Daniel Borkmann, Yajun Deng, Roopa Prabhu,
	linux-kernel, Alexey Kuznetsov, Konstantin Khorenko, kernel

On 7/29/22 4:35 AM, Alexander Mikhalitsyn wrote:
> The patch proposed doing very simple thing. It drops only packets from

it does 2 things - adds a namespace check and a performance based change
with the way the list is walked.

> the same namespace in the pneigh_queue_purge() where network interface
> state change is detected. This is enough to prevent the problem for the
> whole node preserving original semantics of the code.
> 


> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 54625287ee5b..213ec0be800b 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c

> @@ -386,8 +396,7 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
>  	neigh_flush_dev(tbl, dev, skip_perm);
>  	pneigh_ifdown_and_unlock(tbl, dev);
>  
> -	del_timer_sync(&tbl->proxy_timer);

why are you removing this line too?

> -	pneigh_queue_purge(&tbl->proxy_queue);
> +	pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
>  	return 0;
>  }
>  

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] neigh: fix possible DoS due to net iface start/stop loop
  2022-08-01 15:08   ` David Ahern
@ 2022-08-01 15:44     ` Denis V. Lunev
  0 siblings, 0 replies; 18+ messages in thread
From: Denis V. Lunev @ 2022-08-01 15:44 UTC (permalink / raw)
  To: David Ahern, Alexander Mikhalitsyn, netdev
  Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Daniel Borkmann, Yajun Deng, Roopa Prabhu, linux-kernel,
	Alexey Kuznetsov, Konstantin Khorenko, kernel

On 01.08.2022 17:08, David Ahern wrote:
> On 7/29/22 4:35 AM, Alexander Mikhalitsyn wrote:
>> The patch proposed doing very simple thing. It drops only packets from
> it does 2 things - adds a namespace check and a performance based change
> with the way the list is walked.
I believe no. All items are removed before the patch and
all items are walked after the patch, similar O(n) operation.


>> the same namespace in the pneigh_queue_purge() where network interface
>> state change is detected. This is enough to prevent the problem for the
>> whole node preserving original semantics of the code.
>>
>
>> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
>> index 54625287ee5b..213ec0be800b 100644
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -386,8 +396,7 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
>>   	neigh_flush_dev(tbl, dev, skip_perm);
>>   	pneigh_ifdown_and_unlock(tbl, dev);
>>   
>> -	del_timer_sync(&tbl->proxy_timer);
> why are you removing this line too?

Because we have packets in the queue after the purge and they have to be 
processed.
May be we should better do

if (skb_queue_empty(&tbl->proxy_queue))
    del_timer_sync(&tbl->proxy_timer);

after the purge. That could be more correct.

>> -	pneigh_queue_purge(&tbl->proxy_queue);
>> +	pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
>>   	return 0;
>>   }
>>   


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop
  2022-07-29 10:35 [PATCH 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
  2022-07-29 10:35 ` [PATCH 1/2] neigh: " Alexander Mikhalitsyn
  2022-07-29 10:35 ` [PATCH 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
@ 2022-08-10 16:08 ` Alexander Mikhalitsyn
  2022-08-10 16:08   ` [PATCH v2 1/2] neigh: " Alexander Mikhalitsyn
                     ` (2 more replies)
  2022-08-11 15:20 ` [PATCH v3 " Alexander Mikhalitsyn
  3 siblings, 3 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-08-10 16:08 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Mikhalitsyn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, Christian Brauner, linux-kernel,
	Denis V . Lunev, Alexey Kuznetsov, Konstantin Khorenko,
	Pavel Tikhomirov, Andrey Zhadchenko, Alexander Mikhalitsyn,
	kernel, devel

Dear friends,

Recently one of OpenVZ users reported that they have issues with network
availability of some containers. It was discovered that the reason is absence
of ARP replies from the Host Node on the requests about container IPs.

Of course, we started from tcpdump analysis and noticed that ARP requests
successfuly comes to the problematic node external interface. So, something
was wrong from the kernel side.

I've played a lot with arping and perf in attempts to understand what's
happening. And the key observation was that we experiencing issues only
with ARP requests with broadcast source ip (skb->pkt_type == PACKET_BROADCAST).
But for packets skb->pkt_type == PACKET_HOST everything works flawlessly.

Let me show a small piece of code:

static int arp_process(struct sock *sk, struct sk_buff *skb)
...
				if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
				    skb->pkt_type == PACKET_HOST ||
				    NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) { // reply instantly
					arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
						     sip, dev, tip, sha,
						     dev->dev_addr, sha,
						     reply_dst);
				} else {
					pneigh_enqueue(&arp_tbl,                     // reply with delay
						       in_dev->arp_parms, skb);
					goto out_free_dst;
				}

The problem was that for PACKET_BROADCAST packets we delaying replies and use pneigh_enqueue() function.
For some reason, queued packets were lost almost all the time! The reason for such behaviour is pneigh_queue_purge()
function which cleanups all the queue, and this function called everytime once some network device in the system
gets link down.

neigh_ifdown -> pneigh_queue_purge

Now imagine that we have a node with 500+ containers with microservices. And some of that microservices are buggy
and always restarting... in this case, pneigh_queue_purge function will be called very frequently.

This problem is reproducible only with so-called "host routed" setup. The classical scheme bridge + veth
is not affected.

Minimal reproducer

Suppose that we have a network 172.29.1.1/16 brd 172.29.255.255
and we have free-to-use IP, let it be 172.29.128.3

1. Network configuration. I showing the minimal configuration, it makes no sense
as we have both veth devices stay at the same net namespace, but for demonstation and simplicity sake it's okay.

ip l a veth31427 type veth peer name veth314271
ip l s veth31427 up
ip l s veth314271 up

# setup static arp entry and publish it
arp -Ds -i br0 172.29.128.3 veth31427 pub
# setup static route for this address
route add 172.29.128.3/32 dev veth31427

2. "attacker" side (kubernetes pod with buggy microservice :) )

unshare -n
ip l a type veth
ip l s veth0 up
ip l s veth1 up
for i in {1..100000}; do ip link set veth0 down; sleep 0.01; ip link set veth0 up; done

This will totaly block ARP replies for 172.29.128.3 address. Just try
# arping -I eth0 172.29.128.3 -c 4

Our proposal is simple:
1. Let's cleanup queue partially. Remove only skb's that related to the net namespace
of the adapter which link is down.

2. Let's account proxy_queue limit properly per-device. Current limitation looks
not fully correct because we comparing per-device configurable limit with the
"global" qlen of proxy_queue.

Thanks,
Alex

v2:
	- only ("neigh: fix possible DoS due to net iface start/stop") is changed
		do del_timer_sync() if queue is empty after pneigh_queue_purge()

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Denis V. Lunev <den@openvz.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Alexander Mikhalitsyn (1):
  neighbour: make proxy_queue.qlen limit per-device

Denis V. Lunev (1):
  neigh: fix possible DoS due to net iface start/stop loop

 include/net/neighbour.h |  1 +
 net/core/neighbour.c    | 46 +++++++++++++++++++++++++++++++++--------
 2 files changed, 38 insertions(+), 9 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v2 1/2] neigh: fix possible DoS due to net iface start/stop loop
  2022-08-10 16:08 ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
@ 2022-08-10 16:08   ` Alexander Mikhalitsyn
  2022-08-15  9:44     ` Christian Brauner
  2022-08-10 16:08   ` [PATCH v2 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
  2022-08-11 14:46   ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Jakub Kicinski
  2 siblings, 1 reply; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-08-10 16:08 UTC (permalink / raw)
  To: netdev
  Cc: Denis V. Lunev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Daniel Borkmann, David Ahern, Yajun Deng,
	Roopa Prabhu, Christian Brauner, linux-kernel, Alexey Kuznetsov,
	Alexander Mikhalitsyn, Konstantin Khorenko, kernel, devel

From: "Denis V. Lunev" <den@openvz.org>

Normal processing of ARP request (usually this is Ethernet broadcast
packet) coming to the host is looking like the following:
* the packet comes to arp_process() call and is passed through routing
  procedure
* the request is put into the queue using pneigh_enqueue() if
  corresponding ARP record is not local (common case for container
  records on the host)
* the request is processed by timer (within 80 jiffies by default) and
  ARP reply is sent from the same arp_process() using
  NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
  pneigh_enqueue())

And here the problem comes. Linux kernel calls pneigh_queue_purge()
which destroys the whole queue of ARP requests on ANY network interface
start/stop event through __neigh_ifdown().

This is actually not a problem within the original world as network
interface start/stop was accessible to the host 'root' only, which
could do more destructive things. But the world is changed and there
are Linux containers available. Here container 'root' has an access
to this API and could be considered as untrusted user in the hosting
(container's) world.

Thus there is an attack vector to other containers on node when
container's root will endlessly start/stop interfaces. We have observed
similar situation on a real production node when docker container was
doing such activity and thus other containers on the node become not
accessible.

The patch proposed doing very simple thing. It drops only packets from
the same namespace in the pneigh_queue_purge() where network interface
state change is detected. This is enough to prevent the problem for the
whole node preserving original semantics of the code.

v2:
	- do del_timer_sync() if queue is empty after pneigh_queue_purge()

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Investigated-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 net/core/neighbour.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 54625287ee5b..19d99d1eff53 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
 	return 0;
 }
 
-static void pneigh_queue_purge(struct sk_buff_head *list)
+static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
 {
+	unsigned long flags;
 	struct sk_buff *skb;
 
-	while ((skb = skb_dequeue(list)) != NULL) {
-		dev_put(skb->dev);
-		kfree_skb(skb);
+	spin_lock_irqsave(&list->lock, flags);
+	skb = skb_peek(list);
+	while (skb != NULL) {
+		struct sk_buff *skb_next = skb_peek_next(skb, list);
+		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
+			__skb_unlink(skb, list);
+			dev_put(skb->dev);
+			kfree_skb(skb);
+		}
+		skb = skb_next;
 	}
+	spin_unlock_irqrestore(&list->lock, flags);
 }
 
 static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
@@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
 	write_lock_bh(&tbl->lock);
 	neigh_flush_dev(tbl, dev, skip_perm);
 	pneigh_ifdown_and_unlock(tbl, dev);
-
-	del_timer_sync(&tbl->proxy_timer);
-	pneigh_queue_purge(&tbl->proxy_queue);
+	pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
+	if (skb_queue_empty_lockless(&tbl->proxy_queue))
+		del_timer_sync(&tbl->proxy_timer);
 	return 0;
 }
 
@@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
 	cancel_delayed_work_sync(&tbl->managed_work);
 	cancel_delayed_work_sync(&tbl->gc_work);
 	del_timer_sync(&tbl->proxy_timer);
-	pneigh_queue_purge(&tbl->proxy_queue);
+	pneigh_queue_purge(&tbl->proxy_queue, NULL);
 	neigh_ifdown(tbl, NULL);
 	if (atomic_read(&tbl->entries))
 		pr_crit("neighbour leakage\n");
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v2 2/2] neighbour: make proxy_queue.qlen limit per-device
  2022-08-10 16:08 ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
  2022-08-10 16:08   ` [PATCH v2 1/2] neigh: " Alexander Mikhalitsyn
@ 2022-08-10 16:08   ` Alexander Mikhalitsyn
  2022-08-11 14:46   ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Jakub Kicinski
  2 siblings, 0 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-08-10 16:08 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Mikhalitsyn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, Christian Brauner, linux-kernel,
	Alexey Kuznetsov, Konstantin Khorenko, kernel, devel,
	Denis V . Lunev

Right now we have a neigh_param PROXY_QLEN which specifies maximum length
of neigh_table->proxy_queue. But in fact, this limitation doesn't work well
because check condition looks like:
tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)

The problem is that p (struct neigh_parms) is a per-device thing,
but tbl (struct neigh_table) is a system-wide global thing.

It seems reasonable to make proxy_queue limit per-device based.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
---
 include/net/neighbour.h |  1 +
 net/core/neighbour.c    | 25 ++++++++++++++++++++++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 87419f7f5421..bc3fbec70d10 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -82,6 +82,7 @@ struct neigh_parms {
 	struct rcu_head rcu_head;
 
 	int	reachable_time;
+	int	qlen;
 	int	data[NEIGH_VAR_DATA_MAX];
 	DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX);
 };
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 19d99d1eff53..0469fafffd5d 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -316,9 +316,18 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
 	skb = skb_peek(list);
 	while (skb != NULL) {
 		struct sk_buff *skb_next = skb_peek_next(skb, list);
-		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
+		struct net_device *dev = skb->dev;
+		if (net == NULL || net_eq(dev_net(dev), net)) {
+			struct in_device *in_dev;
+
+			rcu_read_lock();
+			in_dev = __in_dev_get_rcu(dev);
+			if (in_dev)
+				in_dev->arp_parms->qlen--;
+			rcu_read_unlock();
 			__skb_unlink(skb, list);
-			dev_put(skb->dev);
+
+			dev_put(dev);
 			kfree_skb(skb);
 		}
 		skb = skb_next;
@@ -1606,8 +1615,15 @@ static void neigh_proxy_process(struct timer_list *t)
 
 		if (tdif <= 0) {
 			struct net_device *dev = skb->dev;
+			struct in_device *in_dev;
 
+			rcu_read_lock();
+			in_dev = __in_dev_get_rcu(dev);
+			if (in_dev)
+				in_dev->arp_parms->qlen--;
+			rcu_read_unlock();
 			__skb_unlink(skb, &tbl->proxy_queue);
+
 			if (tbl->proxy_redo && netif_running(dev)) {
 				rcu_read_lock();
 				tbl->proxy_redo(skb);
@@ -1632,7 +1648,7 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 	unsigned long sched_next = jiffies +
 			prandom_u32_max(NEIGH_VAR(p, PROXY_DELAY));
 
-	if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) {
+	if (p->qlen > NEIGH_VAR(p, PROXY_QLEN)) {
 		kfree_skb(skb);
 		return;
 	}
@@ -1648,6 +1664,7 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 	skb_dst_drop(skb);
 	dev_hold(skb->dev);
 	__skb_queue_tail(&tbl->proxy_queue, skb);
+	p->qlen++;
 	mod_timer(&tbl->proxy_timer, sched_next);
 	spin_unlock(&tbl->proxy_queue.lock);
 }
@@ -1680,6 +1697,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 		refcount_set(&p->refcnt, 1);
 		p->reachable_time =
 				neigh_rand_reach_time(NEIGH_VAR(p, BASE_REACHABLE_TIME));
+		p->qlen = 0;
 		dev_hold_track(dev, &p->dev_tracker, GFP_KERNEL);
 		p->dev = dev;
 		write_pnet(&p->net, net);
@@ -1745,6 +1763,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
 	refcount_set(&tbl->parms.refcnt, 1);
 	tbl->parms.reachable_time =
 			  neigh_rand_reach_time(NEIGH_VAR(&tbl->parms, BASE_REACHABLE_TIME));
+	tbl->parms.qlen = 0;
 
 	tbl->stats = alloc_percpu(struct neigh_statistics);
 	if (!tbl->stats)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop
  2022-08-10 16:08 ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
  2022-08-10 16:08   ` [PATCH v2 1/2] neigh: " Alexander Mikhalitsyn
  2022-08-10 16:08   ` [PATCH v2 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
@ 2022-08-11 14:46   ` Jakub Kicinski
  2022-08-11 14:51     ` Alexander Mikhalitsyn
  2 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2022-08-11 14:46 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Daniel Borkmann, David Ahern, Yajun Deng, Roopa Prabhu,
	Christian Brauner, linux-kernel, Denis V . Lunev,
	Alexey Kuznetsov, Konstantin Khorenko, Pavel Tikhomirov,
	Andrey Zhadchenko, Alexander Mikhalitsyn, kernel, devel

On Wed, 10 Aug 2022 19:08:38 +0300 Alexander Mikhalitsyn wrote:
>  include/net/neighbour.h |  1 +
>  net/core/neighbour.c    | 46 +++++++++++++++++++++++++++++++++--------
>  2 files changed, 38 insertions(+), 9 deletions(-)

Which tree are these based on? They don't seem to apply cleanly

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop
  2022-08-11 14:46   ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Jakub Kicinski
@ 2022-08-11 14:51     ` Alexander Mikhalitsyn
  2022-08-11 14:53       ` Jakub Kicinski
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-08-11 14:51 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Daniel Borkmann, David Ahern, Yajun Deng, Roopa Prabhu,
	Christian Brauner, linux-kernel, Denis V . Lunev,
	Alexey Kuznetsov, Konstantin Khorenko, Pavel Tikhomirov,
	Andrey Zhadchenko, kernel, devel

Hi, Jakub

On Thu, Aug 11, 2022 at 5:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 10 Aug 2022 19:08:38 +0300 Alexander Mikhalitsyn wrote:
> >  include/net/neighbour.h |  1 +
> >  net/core/neighbour.c    | 46 +++++++++++++++++++++++++++++++++--------
> >  2 files changed, 38 insertions(+), 9 deletions(-)
>
> Which tree are these based on? They don't seem to apply cleanly

It's based on 5.19 tree, but I can easily resent it based on net-next.

Regards,
Alex

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop
  2022-08-11 14:51     ` Alexander Mikhalitsyn
@ 2022-08-11 14:53       ` Jakub Kicinski
  2022-08-11 14:57         ` Alexander Mikhalitsyn
  0 siblings, 1 reply; 18+ messages in thread
From: Jakub Kicinski @ 2022-08-11 14:53 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Daniel Borkmann, David Ahern, Yajun Deng, Roopa Prabhu,
	Christian Brauner, linux-kernel, Denis V . Lunev,
	Alexey Kuznetsov, Konstantin Khorenko, Pavel Tikhomirov,
	Andrey Zhadchenko, kernel, devel

On Thu, 11 Aug 2022 17:51:32 +0300 Alexander Mikhalitsyn wrote:
> On Thu, Aug 11, 2022 at 5:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 10 Aug 2022 19:08:38 +0300 Alexander Mikhalitsyn wrote:  
> > >  include/net/neighbour.h |  1 +
> > >  net/core/neighbour.c    | 46 +++++++++++++++++++++++++++++++++--------
> > >  2 files changed, 38 insertions(+), 9 deletions(-)  
> >
> > Which tree are these based on? They don't seem to apply cleanly  
> 
> It's based on 5.19 tree, but I can easily resent it based on net-next.

netdev/net would be the most appropriate tree for a fix.
Not that it differs much from net-next at this stage of 
the merge window.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop
  2022-08-11 14:53       ` Jakub Kicinski
@ 2022-08-11 14:57         ` Alexander Mikhalitsyn
  0 siblings, 0 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-08-11 14:57 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: netdev, David S. Miller, Eric Dumazet, Paolo Abeni,
	Daniel Borkmann, David Ahern, Yajun Deng, Roopa Prabhu,
	Christian Brauner, linux-kernel, Denis V . Lunev,
	Alexey Kuznetsov, Konstantin Khorenko, Pavel Tikhomirov,
	Andrey Zhadchenko, kernel, devel

On Thu, Aug 11, 2022 at 5:53 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 11 Aug 2022 17:51:32 +0300 Alexander Mikhalitsyn wrote:
> > On Thu, Aug 11, 2022 at 5:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Wed, 10 Aug 2022 19:08:38 +0300 Alexander Mikhalitsyn wrote:
> > > >  include/net/neighbour.h |  1 +
> > > >  net/core/neighbour.c    | 46 +++++++++++++++++++++++++++++++++--------
> > > >  2 files changed, 38 insertions(+), 9 deletions(-)
> > >
> > > Which tree are these based on? They don't seem to apply cleanly
> >
> > It's based on 5.19 tree, but I can easily resent it based on net-next.
>
> netdev/net would be the most appropriate tree for a fix.
> Not that it differs much from net-next at this stage of
> the merge window.

Yes, thanks Jakub. I'm a newbie here ;) Sorry for the inconvenience.

Will rebase and send patches soon.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 0/2] neighbour: fix possible DoS due to net iface start/stop loop
  2022-07-29 10:35 [PATCH 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
                   ` (2 preceding siblings ...)
  2022-08-10 16:08 ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
@ 2022-08-11 15:20 ` Alexander Mikhalitsyn
  2022-08-11 15:20   ` [PATCH v3 1/2] neigh: " Alexander Mikhalitsyn
                     ` (2 more replies)
  3 siblings, 3 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-08-11 15:20 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Mikhalitsyn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, Christian Brauner, linux-kernel,
	Denis V . Lunev, Alexey Kuznetsov, Konstantin Khorenko,
	Pavel Tikhomirov, Andrey Zhadchenko, Alexander Mikhalitsyn,
	kernel, devel

Dear friends,

Recently one of OpenVZ users reported that they have issues with network
availability of some containers. It was discovered that the reason is absence
of ARP replies from the Host Node on the requests about container IPs.

Of course, we started from tcpdump analysis and noticed that ARP requests
successfuly comes to the problematic node external interface. So, something
was wrong from the kernel side.

I've played a lot with arping and perf in attempts to understand what's
happening. And the key observation was that we experiencing issues only
with ARP requests with broadcast source ip (skb->pkt_type == PACKET_BROADCAST).
But for packets skb->pkt_type == PACKET_HOST everything works flawlessly.

Let me show a small piece of code:

static int arp_process(struct sock *sk, struct sk_buff *skb)
...
				if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
				    skb->pkt_type == PACKET_HOST ||
				    NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) { // reply instantly
					arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
						     sip, dev, tip, sha,
						     dev->dev_addr, sha,
						     reply_dst);
				} else {
					pneigh_enqueue(&arp_tbl,                     // reply with delay
						       in_dev->arp_parms, skb);
					goto out_free_dst;
				}

The problem was that for PACKET_BROADCAST packets we delaying replies and use pneigh_enqueue() function.
For some reason, queued packets were lost almost all the time! The reason for such behaviour is pneigh_queue_purge()
function which cleanups all the queue, and this function called everytime once some network device in the system
gets link down.

neigh_ifdown -> pneigh_queue_purge

Now imagine that we have a node with 500+ containers with microservices. And some of that microservices are buggy
and always restarting... in this case, pneigh_queue_purge function will be called very frequently.

This problem is reproducible only with so-called "host routed" setup. The classical scheme bridge + veth
is not affected.

Minimal reproducer

Suppose that we have a network 172.29.1.1/16 brd 172.29.255.255
and we have free-to-use IP, let it be 172.29.128.3

1. Network configuration. I showing the minimal configuration, it makes no sense
as we have both veth devices stay at the same net namespace, but for demonstation and simplicity sake it's okay.

ip l a veth31427 type veth peer name veth314271
ip l s veth31427 up
ip l s veth314271 up

# setup static arp entry and publish it
arp -Ds -i br0 172.29.128.3 veth31427 pub
# setup static route for this address
route add 172.29.128.3/32 dev veth31427

2. "attacker" side (kubernetes pod with buggy microservice :) )

unshare -n
ip l a type veth
ip l s veth0 up
ip l s veth1 up
for i in {1..100000}; do ip link set veth0 down; sleep 0.01; ip link set veth0 up; done

This will totaly block ARP replies for 172.29.128.3 address. Just try
# arping -I eth0 172.29.128.3 -c 4

Our proposal is simple:
1. Let's cleanup queue partially. Remove only skb's that related to the net namespace
of the adapter which link is down.

2. Let's account proxy_queue limit properly per-device. Current limitation looks
not fully correct because we comparing per-device configurable limit with the
"global" qlen of proxy_queue.

Thanks,
Alex

v2:
	- only ("neigh: fix possible DoS due to net iface start/stop") is changed
		do del_timer_sync() if queue is empty after pneigh_queue_purge()

v3:
	- rebase to net tree

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Denis V. Lunev <den@openvz.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander@mihalicyn.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Signed-off-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>

Alexander Mikhalitsyn (1):
  neighbour: make proxy_queue.qlen limit per-device

Denis V. Lunev (1):
  neigh: fix possible DoS due to net iface start/stop loop

 include/net/neighbour.h |  1 +
 net/core/neighbour.c    | 46 +++++++++++++++++++++++++++++++++--------
 2 files changed, 38 insertions(+), 9 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/2] neigh: fix possible DoS due to net iface start/stop loop
  2022-08-11 15:20 ` [PATCH v3 " Alexander Mikhalitsyn
@ 2022-08-11 15:20   ` Alexander Mikhalitsyn
  2022-08-11 15:20   ` [PATCH v3 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
  2022-08-15 10:40   ` [PATCH v3 0/2] neighbour: fix possible DoS due to net iface start/stop loop patchwork-bot+netdevbpf
  2 siblings, 0 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-08-11 15:20 UTC (permalink / raw)
  To: netdev
  Cc: Denis V. Lunev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Daniel Borkmann, David Ahern, Yajun Deng,
	Roopa Prabhu, Christian Brauner, linux-kernel, Alexey Kuznetsov,
	Alexander Mikhalitsyn, Konstantin Khorenko, kernel, devel

From: "Denis V. Lunev" <den@openvz.org>

Normal processing of ARP request (usually this is Ethernet broadcast
packet) coming to the host is looking like the following:
* the packet comes to arp_process() call and is passed through routing
  procedure
* the request is put into the queue using pneigh_enqueue() if
  corresponding ARP record is not local (common case for container
  records on the host)
* the request is processed by timer (within 80 jiffies by default) and
  ARP reply is sent from the same arp_process() using
  NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
  pneigh_enqueue())

And here the problem comes. Linux kernel calls pneigh_queue_purge()
which destroys the whole queue of ARP requests on ANY network interface
start/stop event through __neigh_ifdown().

This is actually not a problem within the original world as network
interface start/stop was accessible to the host 'root' only, which
could do more destructive things. But the world is changed and there
are Linux containers available. Here container 'root' has an access
to this API and could be considered as untrusted user in the hosting
(container's) world.

Thus there is an attack vector to other containers on node when
container's root will endlessly start/stop interfaces. We have observed
similar situation on a real production node when docker container was
doing such activity and thus other containers on the node become not
accessible.

The patch proposed doing very simple thing. It drops only packets from
the same namespace in the pneigh_queue_purge() where network interface
state change is detected. This is enough to prevent the problem for the
whole node preserving original semantics of the code.

v2:
	- do del_timer_sync() if queue is empty after pneigh_queue_purge()
v3:
	- rebase to net tree

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Investigated-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
---
 net/core/neighbour.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 6a8c2596ebab..0e38a05d5b23 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
 	return 0;
 }
 
-static void pneigh_queue_purge(struct sk_buff_head *list)
+static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
 {
+	unsigned long flags;
 	struct sk_buff *skb;
 
-	while ((skb = skb_dequeue(list)) != NULL) {
-		dev_put(skb->dev);
-		kfree_skb(skb);
+	spin_lock_irqsave(&list->lock, flags);
+	skb = skb_peek(list);
+	while (skb != NULL) {
+		struct sk_buff *skb_next = skb_peek_next(skb, list);
+		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
+			__skb_unlink(skb, list);
+			dev_put(skb->dev);
+			kfree_skb(skb);
+		}
+		skb = skb_next;
 	}
+	spin_unlock_irqrestore(&list->lock, flags);
 }
 
 static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
@@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
 	write_lock_bh(&tbl->lock);
 	neigh_flush_dev(tbl, dev, skip_perm);
 	pneigh_ifdown_and_unlock(tbl, dev);
-
-	del_timer_sync(&tbl->proxy_timer);
-	pneigh_queue_purge(&tbl->proxy_queue);
+	pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
+	if (skb_queue_empty_lockless(&tbl->proxy_queue))
+		del_timer_sync(&tbl->proxy_timer);
 	return 0;
 }
 
@@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
 	cancel_delayed_work_sync(&tbl->managed_work);
 	cancel_delayed_work_sync(&tbl->gc_work);
 	del_timer_sync(&tbl->proxy_timer);
-	pneigh_queue_purge(&tbl->proxy_queue);
+	pneigh_queue_purge(&tbl->proxy_queue, NULL);
 	neigh_ifdown(tbl, NULL);
 	if (atomic_read(&tbl->entries))
 		pr_crit("neighbour leakage\n");
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/2] neighbour: make proxy_queue.qlen limit per-device
  2022-08-11 15:20 ` [PATCH v3 " Alexander Mikhalitsyn
  2022-08-11 15:20   ` [PATCH v3 1/2] neigh: " Alexander Mikhalitsyn
@ 2022-08-11 15:20   ` Alexander Mikhalitsyn
  2022-08-15 10:40   ` [PATCH v3 0/2] neighbour: fix possible DoS due to net iface start/stop loop patchwork-bot+netdevbpf
  2 siblings, 0 replies; 18+ messages in thread
From: Alexander Mikhalitsyn @ 2022-08-11 15:20 UTC (permalink / raw)
  To: netdev
  Cc: Alexander Mikhalitsyn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, Christian Brauner, linux-kernel,
	Alexey Kuznetsov, Konstantin Khorenko, kernel, devel,
	Denis V . Lunev

Right now we have a neigh_param PROXY_QLEN which specifies maximum length
of neigh_table->proxy_queue. But in fact, this limitation doesn't work well
because check condition looks like:
tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)

The problem is that p (struct neigh_parms) is a per-device thing,
but tbl (struct neigh_table) is a system-wide global thing.

It seems reasonable to make proxy_queue limit per-device based.

v2:
	- nothing changed in this patch
v3:
	- rebase to net tree

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@kernel.org>
Cc: Yajun Deng <yajun.deng@linux.dev>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
Cc: kernel@openvz.org
Cc: devel@openvz.org
Suggested-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Reviewed-by: Denis V. Lunev <den@openvz.org>
---
 include/net/neighbour.h |  1 +
 net/core/neighbour.c    | 25 ++++++++++++++++++++++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index 9f0bab0589d9..3827a6b395fd 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -83,6 +83,7 @@ struct neigh_parms {
 	struct rcu_head rcu_head;
 
 	int	reachable_time;
+	int	qlen;
 	int	data[NEIGH_VAR_DATA_MAX];
 	DECLARE_BITMAP(data_state, NEIGH_VAR_DATA_MAX);
 };
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 0e38a05d5b23..5b669eb80270 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -316,9 +316,18 @@ static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
 	skb = skb_peek(list);
 	while (skb != NULL) {
 		struct sk_buff *skb_next = skb_peek_next(skb, list);
-		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
+		struct net_device *dev = skb->dev;
+		if (net == NULL || net_eq(dev_net(dev), net)) {
+			struct in_device *in_dev;
+
+			rcu_read_lock();
+			in_dev = __in_dev_get_rcu(dev);
+			if (in_dev)
+				in_dev->arp_parms->qlen--;
+			rcu_read_unlock();
 			__skb_unlink(skb, list);
-			dev_put(skb->dev);
+
+			dev_put(dev);
 			kfree_skb(skb);
 		}
 		skb = skb_next;
@@ -1606,8 +1615,15 @@ static void neigh_proxy_process(struct timer_list *t)
 
 		if (tdif <= 0) {
 			struct net_device *dev = skb->dev;
+			struct in_device *in_dev;
 
+			rcu_read_lock();
+			in_dev = __in_dev_get_rcu(dev);
+			if (in_dev)
+				in_dev->arp_parms->qlen--;
+			rcu_read_unlock();
 			__skb_unlink(skb, &tbl->proxy_queue);
+
 			if (tbl->proxy_redo && netif_running(dev)) {
 				rcu_read_lock();
 				tbl->proxy_redo(skb);
@@ -1632,7 +1648,7 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 	unsigned long sched_next = jiffies +
 			prandom_u32_max(NEIGH_VAR(p, PROXY_DELAY));
 
-	if (tbl->proxy_queue.qlen > NEIGH_VAR(p, PROXY_QLEN)) {
+	if (p->qlen > NEIGH_VAR(p, PROXY_QLEN)) {
 		kfree_skb(skb);
 		return;
 	}
@@ -1648,6 +1664,7 @@ void pneigh_enqueue(struct neigh_table *tbl, struct neigh_parms *p,
 	skb_dst_drop(skb);
 	dev_hold(skb->dev);
 	__skb_queue_tail(&tbl->proxy_queue, skb);
+	p->qlen++;
 	mod_timer(&tbl->proxy_timer, sched_next);
 	spin_unlock(&tbl->proxy_queue.lock);
 }
@@ -1680,6 +1697,7 @@ struct neigh_parms *neigh_parms_alloc(struct net_device *dev,
 		refcount_set(&p->refcnt, 1);
 		p->reachable_time =
 				neigh_rand_reach_time(NEIGH_VAR(p, BASE_REACHABLE_TIME));
+		p->qlen = 0;
 		netdev_hold(dev, &p->dev_tracker, GFP_KERNEL);
 		p->dev = dev;
 		write_pnet(&p->net, net);
@@ -1745,6 +1763,7 @@ void neigh_table_init(int index, struct neigh_table *tbl)
 	refcount_set(&tbl->parms.refcnt, 1);
 	tbl->parms.reachable_time =
 			  neigh_rand_reach_time(NEIGH_VAR(&tbl->parms, BASE_REACHABLE_TIME));
+	tbl->parms.qlen = 0;
 
 	tbl->stats = alloc_percpu(struct neigh_statistics);
 	if (!tbl->stats)
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] neigh: fix possible DoS due to net iface start/stop loop
  2022-08-10 16:08   ` [PATCH v2 1/2] neigh: " Alexander Mikhalitsyn
@ 2022-08-15  9:44     ` Christian Brauner
  2022-08-15 10:47       ` Denis V. Lunev
  0 siblings, 1 reply; 18+ messages in thread
From: Christian Brauner @ 2022-08-15  9:44 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: netdev, Denis V. Lunev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Daniel Borkmann, David Ahern,
	Yajun Deng, Roopa Prabhu, linux-kernel, Alexey Kuznetsov,
	Konstantin Khorenko, kernel, devel

On Wed, Aug 10, 2022 at 07:08:39PM +0300, Alexander Mikhalitsyn wrote:
> From: "Denis V. Lunev" <den@openvz.org>
> 
> Normal processing of ARP request (usually this is Ethernet broadcast
> packet) coming to the host is looking like the following:
> * the packet comes to arp_process() call and is passed through routing
>   procedure
> * the request is put into the queue using pneigh_enqueue() if
>   corresponding ARP record is not local (common case for container
>   records on the host)
> * the request is processed by timer (within 80 jiffies by default) and
>   ARP reply is sent from the same arp_process() using
>   NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
>   pneigh_enqueue())
> 
> And here the problem comes. Linux kernel calls pneigh_queue_purge()
> which destroys the whole queue of ARP requests on ANY network interface
> start/stop event through __neigh_ifdown().
> 
> This is actually not a problem within the original world as network
> interface start/stop was accessible to the host 'root' only, which
> could do more destructive things. But the world is changed and there
> are Linux containers available. Here container 'root' has an access
> to this API and could be considered as untrusted user in the hosting
> (container's) world.
> 
> Thus there is an attack vector to other containers on node when
> container's root will endlessly start/stop interfaces. We have observed
> similar situation on a real production node when docker container was
> doing such activity and thus other containers on the node become not
> accessible.
> 
> The patch proposed doing very simple thing. It drops only packets from
> the same namespace in the pneigh_queue_purge() where network interface
> state change is detected. This is enough to prevent the problem for the
> whole node preserving original semantics of the code.

This is how I'd do it as well.

> 
> v2:
> 	- do del_timer_sync() if queue is empty after pneigh_queue_purge()
> 
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Paolo Abeni <pabeni@redhat.com>
> Cc: Daniel Borkmann <daniel@iogearbox.net>
> Cc: David Ahern <dsahern@kernel.org>
> Cc: Yajun Deng <yajun.deng@linux.dev>
> Cc: Roopa Prabhu <roopa@nvidia.com>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: netdev@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
> Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
> Cc: kernel@openvz.org
> Cc: devel@openvz.org
> Investigated-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> ---
>  net/core/neighbour.c | 25 +++++++++++++++++--------
>  1 file changed, 17 insertions(+), 8 deletions(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 54625287ee5b..19d99d1eff53 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
>  	return 0;
>  }
>  
> -static void pneigh_queue_purge(struct sk_buff_head *list)
> +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
>  {
> +	unsigned long flags;
>  	struct sk_buff *skb;
>  
> -	while ((skb = skb_dequeue(list)) != NULL) {
> -		dev_put(skb->dev);
> -		kfree_skb(skb);
> +	spin_lock_irqsave(&list->lock, flags);

I'm a bit surprised to see a spinlock held around a while loop walking a
linked list but that seems to be quite common in this file. I take it
the lists are guaranteed to be short.

> +	skb = skb_peek(list);
> +	while (skb != NULL) {
> +		struct sk_buff *skb_next = skb_peek_next(skb, list);
> +		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
> +			__skb_unlink(skb, list);
> +			dev_put(skb->dev);
> +			kfree_skb(skb);
> +		}
> +		skb = skb_next;
>  	}
> +	spin_unlock_irqrestore(&list->lock, flags);
>  }
>  
>  static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
> @@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
>  	write_lock_bh(&tbl->lock);
>  	neigh_flush_dev(tbl, dev, skip_perm);
>  	pneigh_ifdown_and_unlock(tbl, dev);
> -
> -	del_timer_sync(&tbl->proxy_timer);
> -	pneigh_queue_purge(&tbl->proxy_queue);
> +	pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
> +	if (skb_queue_empty_lockless(&tbl->proxy_queue))
> +		del_timer_sync(&tbl->proxy_timer);
>  	return 0;
>  }
>  
> @@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
>  	cancel_delayed_work_sync(&tbl->managed_work);
>  	cancel_delayed_work_sync(&tbl->gc_work);
>  	del_timer_sync(&tbl->proxy_timer);
> -	pneigh_queue_purge(&tbl->proxy_queue);
> +	pneigh_queue_purge(&tbl->proxy_queue, NULL);
>  	neigh_ifdown(tbl, NULL);
>  	if (atomic_read(&tbl->entries))
>  		pr_crit("neighbour leakage\n");
> -- 
> 2.36.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/2] neighbour: fix possible DoS due to net iface start/stop loop
  2022-08-11 15:20 ` [PATCH v3 " Alexander Mikhalitsyn
  2022-08-11 15:20   ` [PATCH v3 1/2] neigh: " Alexander Mikhalitsyn
  2022-08-11 15:20   ` [PATCH v3 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
@ 2022-08-15 10:40   ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 18+ messages in thread
From: patchwork-bot+netdevbpf @ 2022-08-15 10:40 UTC (permalink / raw)
  To: Alexander Mikhalitsyn
  Cc: netdev, davem, edumazet, kuba, pabeni, daniel, dsahern,
	yajun.deng, roopa, brauner, linux-kernel, den, kuznet, khorenko,
	ptikhomirov, andrey.zhadchenko, alexander, kernel, devel

Hello:

This series was applied to netdev/net.git (master)
by David S. Miller <davem@davemloft.net>:

On Thu, 11 Aug 2022 18:20:10 +0300 you wrote:
> Dear friends,
> 
> Recently one of OpenVZ users reported that they have issues with network
> availability of some containers. It was discovered that the reason is absence
> of ARP replies from the Host Node on the requests about container IPs.
> 
> Of course, we started from tcpdump analysis and noticed that ARP requests
> successfuly comes to the problematic node external interface. So, something
> was wrong from the kernel side.
> 
> [...]

Here is the summary with links:
  - [v3,1/2] neigh: fix possible DoS due to net iface start/stop loop
    https://git.kernel.org/netdev/net/c/66ba215cb513
  - [v3,2/2] neighbour: make proxy_queue.qlen limit per-device
    https://git.kernel.org/netdev/net/c/0ff4eb3d5ebb

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v2 1/2] neigh: fix possible DoS due to net iface start/stop loop
  2022-08-15  9:44     ` Christian Brauner
@ 2022-08-15 10:47       ` Denis V. Lunev
  0 siblings, 0 replies; 18+ messages in thread
From: Denis V. Lunev @ 2022-08-15 10:47 UTC (permalink / raw)
  To: Christian Brauner, Alexander Mikhalitsyn
  Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Daniel Borkmann, David Ahern, Yajun Deng,
	Roopa Prabhu, linux-kernel, Alexey Kuznetsov,
	Konstantin Khorenko, kernel, devel

On 15.08.2022 11:44, Christian Brauner wrote:
> On Wed, Aug 10, 2022 at 07:08:39PM +0300, Alexander Mikhalitsyn wrote:
>> From: "Denis V. Lunev" <den@openvz.org>
>>
>> Normal processing of ARP request (usually this is Ethernet broadcast
>> packet) coming to the host is looking like the following:
>> * the packet comes to arp_process() call and is passed through routing
>>    procedure
>> * the request is put into the queue using pneigh_enqueue() if
>>    corresponding ARP record is not local (common case for container
>>    records on the host)
>> * the request is processed by timer (within 80 jiffies by default) and
>>    ARP reply is sent from the same arp_process() using
>>    NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
>>    pneigh_enqueue())
>>
>> And here the problem comes. Linux kernel calls pneigh_queue_purge()
>> which destroys the whole queue of ARP requests on ANY network interface
>> start/stop event through __neigh_ifdown().
>>
>> This is actually not a problem within the original world as network
>> interface start/stop was accessible to the host 'root' only, which
>> could do more destructive things. But the world is changed and there
>> are Linux containers available. Here container 'root' has an access
>> to this API and could be considered as untrusted user in the hosting
>> (container's) world.
>>
>> Thus there is an attack vector to other containers on node when
>> container's root will endlessly start/stop interfaces. We have observed
>> similar situation on a real production node when docker container was
>> doing such activity and thus other containers on the node become not
>> accessible.
>>
>> The patch proposed doing very simple thing. It drops only packets from
>> the same namespace in the pneigh_queue_purge() where network interface
>> state change is detected. This is enough to prevent the problem for the
>> whole node preserving original semantics of the code.
> This is how I'd do it as well.
>
>> v2:
>> 	- do del_timer_sync() if queue is empty after pneigh_queue_purge()
>>
>> Cc: "David S. Miller" <davem@davemloft.net>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Jakub Kicinski <kuba@kernel.org>
>> Cc: Paolo Abeni <pabeni@redhat.com>
>> Cc: Daniel Borkmann <daniel@iogearbox.net>
>> Cc: David Ahern <dsahern@kernel.org>
>> Cc: Yajun Deng <yajun.deng@linux.dev>
>> Cc: Roopa Prabhu <roopa@nvidia.com>
>> Cc: Christian Brauner <brauner@kernel.org>
>> Cc: netdev@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org
>> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
>> Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
>> Cc: Konstantin Khorenko <khorenko@virtuozzo.com>
>> Cc: kernel@openvz.org
>> Cc: devel@openvz.org
>> Investigated-by: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> ---
>>   net/core/neighbour.c | 25 +++++++++++++++++--------
>>   1 file changed, 17 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
>> index 54625287ee5b..19d99d1eff53 100644
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
>>   	return 0;
>>   }
>>   
>> -static void pneigh_queue_purge(struct sk_buff_head *list)
>> +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
>>   {
>> +	unsigned long flags;
>>   	struct sk_buff *skb;
>>   
>> -	while ((skb = skb_dequeue(list)) != NULL) {
>> -		dev_put(skb->dev);
>> -		kfree_skb(skb);
>> +	spin_lock_irqsave(&list->lock, flags);
> I'm a bit surprised to see a spinlock held around a while loop walking a
> linked list but that seems to be quite common in this file. I take it
> the lists are guaranteed to be short.
Within the current code the size of the list is 64 packets at most
(same spinlock is held during packets processing).

Though this semantics is changed in the next patch from
Alexander, where we will get 64 packets/interface.

Den

>> +	skb = skb_peek(list);
>> +	while (skb != NULL) {
>> +		struct sk_buff *skb_next = skb_peek_next(skb, list);
>> +		if (net == NULL || net_eq(dev_net(skb->dev), net)) {
>> +			__skb_unlink(skb, list);
>> +			dev_put(skb->dev);
>> +			kfree_skb(skb);
>> +		}
>> +		skb = skb_next;
>>   	}
>> +	spin_unlock_irqrestore(&list->lock, flags);
>>   }
>>   
>>   static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
>> @@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
>>   	write_lock_bh(&tbl->lock);
>>   	neigh_flush_dev(tbl, dev, skip_perm);
>>   	pneigh_ifdown_and_unlock(tbl, dev);
>> -
>> -	del_timer_sync(&tbl->proxy_timer);
>> -	pneigh_queue_purge(&tbl->proxy_queue);
>> +	pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
>> +	if (skb_queue_empty_lockless(&tbl->proxy_queue))
>> +		del_timer_sync(&tbl->proxy_timer);
>>   	return 0;
>>   }
>>   
>> @@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
>>   	cancel_delayed_work_sync(&tbl->managed_work);
>>   	cancel_delayed_work_sync(&tbl->gc_work);
>>   	del_timer_sync(&tbl->proxy_timer);
>> -	pneigh_queue_purge(&tbl->proxy_queue);
>> +	pneigh_queue_purge(&tbl->proxy_queue, NULL);
>>   	neigh_ifdown(tbl, NULL);
>>   	if (atomic_read(&tbl->entries))
>>   		pr_crit("neighbour leakage\n");
>> -- 
>> 2.36.1
>>


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-08-15 10:47 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-29 10:35 [PATCH 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
2022-07-29 10:35 ` [PATCH 1/2] neigh: " Alexander Mikhalitsyn
2022-08-01 15:08   ` David Ahern
2022-08-01 15:44     ` Denis V. Lunev
2022-07-29 10:35 ` [PATCH 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
2022-08-10 16:08 ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Alexander Mikhalitsyn
2022-08-10 16:08   ` [PATCH v2 1/2] neigh: " Alexander Mikhalitsyn
2022-08-15  9:44     ` Christian Brauner
2022-08-15 10:47       ` Denis V. Lunev
2022-08-10 16:08   ` [PATCH v2 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
2022-08-11 14:46   ` [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop Jakub Kicinski
2022-08-11 14:51     ` Alexander Mikhalitsyn
2022-08-11 14:53       ` Jakub Kicinski
2022-08-11 14:57         ` Alexander Mikhalitsyn
2022-08-11 15:20 ` [PATCH v3 " Alexander Mikhalitsyn
2022-08-11 15:20   ` [PATCH v3 1/2] neigh: " Alexander Mikhalitsyn
2022-08-11 15:20   ` [PATCH v3 2/2] neighbour: make proxy_queue.qlen limit per-device Alexander Mikhalitsyn
2022-08-15 10:40   ` [PATCH v3 0/2] neighbour: fix possible DoS due to net iface start/stop loop patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).