* [PATCH net] net/packet: tpacket_rcv: avoid a producer race condition
@ 2020-03-13 16:18 Willem de Bruijn
2020-03-14 11:56 ` FW: " Jon Rosen (jrosen)
2020-03-15 7:25 ` David Miller
0 siblings, 2 replies; 3+ messages in thread
From: Willem de Bruijn @ 2020-03-13 16:18 UTC (permalink / raw)
To: netdev; +Cc: davem, jrosen, Willem de Bruijn
From: Willem de Bruijn <willemb@google.com>
PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).
Synchronize kernel thread ownership of rx ring slots with a bitmap.
Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.
Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.
This is the one of a variety of discussed options (see Link below):
* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.
* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.
* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.
* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.
The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.
Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.
Based on an earlier patchset by Jon Rosen. See links below.
I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
Suggested-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
---
We discussed this two years ago. After again spending a few days
trying all options, I found I again ended up with essentially your
last solution, Jon. Please feel free to add your Signed-off-by
or resubmit as first author.
I can move the alternatives to below the --- to reduce commit length.
---
net/packet/af_packet.c | 21 +++++++++++++++++++++
net/packet/internal.h | 5 ++++-
2 files changed, 25 insertions(+), 1 deletion(-)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index e5b0986215d2..29bd405adbbd 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2173,6 +2173,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
struct timespec64 ts;
__u32 ts_status;
bool is_drop_n_account = false;
+ unsigned int slot_id = 0;
bool do_vnet = false;
/* struct tpacket{2,3}_hdr is aligned to a multiple of TPACKET_ALIGNMENT.
@@ -2275,6 +2276,13 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
if (!h.raw)
goto drop_n_account;
+ if (po->tp_version <= TPACKET_V2) {
+ slot_id = po->rx_ring.head;
+ if (test_bit(slot_id, po->rx_ring.rx_owner_map))
+ goto drop_n_account;
+ __set_bit(slot_id, po->rx_ring.rx_owner_map);
+ }
+
if (do_vnet &&
virtio_net_hdr_from_skb(skb, h.raw + macoff -
sizeof(struct virtio_net_hdr),
@@ -2380,7 +2388,10 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev,
#endif
if (po->tp_version <= TPACKET_V2) {
+ spin_lock(&sk->sk_receive_queue.lock);
__packet_set_status(po, h.raw, status);
+ __clear_bit(slot_id, po->rx_ring.rx_owner_map);
+ spin_unlock(&sk->sk_receive_queue.lock);
sk->sk_data_ready(sk);
} else {
prb_clear_blk_fill_status(&po->rx_ring);
@@ -4277,6 +4288,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
{
struct pgv *pg_vec = NULL;
struct packet_sock *po = pkt_sk(sk);
+ unsigned long *rx_owner_map = NULL;
int was_running, order = 0;
struct packet_ring_buffer *rb;
struct sk_buff_head *rb_queue;
@@ -4362,6 +4374,12 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
}
break;
default:
+ if (!tx_ring) {
+ rx_owner_map = bitmap_alloc(req->tp_frame_nr,
+ GFP_KERNEL | __GFP_NOWARN | __GFP_ZERO);
+ if (!rx_owner_map)
+ goto out_free_pg_vec;
+ }
break;
}
}
@@ -4391,6 +4409,8 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
err = 0;
spin_lock_bh(&rb_queue->lock);
swap(rb->pg_vec, pg_vec);
+ if (po->tp_version <= TPACKET_V2)
+ swap(rb->rx_owner_map, rx_owner_map);
rb->frame_max = (req->tp_frame_nr - 1);
rb->head = 0;
rb->frame_size = req->tp_frame_size;
@@ -4422,6 +4442,7 @@ static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u,
}
out_free_pg_vec:
+ bitmap_free(rx_owner_map);
if (pg_vec)
free_pg_vec(pg_vec, order, req->tp_block_nr);
out:
diff --git a/net/packet/internal.h b/net/packet/internal.h
index 82fb2b10f790..907f4cd2a718 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -70,7 +70,10 @@ struct packet_ring_buffer {
unsigned int __percpu *pending_refcnt;
- struct tpacket_kbdq_core prb_bdqc;
+ union {
+ unsigned long *rx_owner_map;
+ struct tpacket_kbdq_core prb_bdqc;
+ };
};
extern struct mutex fanout_mutex;
--
2.25.1.481.gfbce0eb801-goog
^ permalink raw reply related [flat|nested] 3+ messages in thread
* FW: [PATCH net] net/packet: tpacket_rcv: avoid a producer race condition
2020-03-13 16:18 [PATCH net] net/packet: tpacket_rcv: avoid a producer race condition Willem de Bruijn
@ 2020-03-14 11:56 ` Jon Rosen (jrosen)
2020-03-15 7:25 ` David Miller
1 sibling, 0 replies; 3+ messages in thread
From: Jon Rosen (jrosen) @ 2020-03-14 11:56 UTC (permalink / raw)
To: Willem de Bruijn, netdev; +Cc: davem, Willem de Bruijn
From: Willem de Bruijn <willemb@google.com>
PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).
Synchronize kernel thread ownership of rx ring slots with a bitmap.
Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.
Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.
This is the one of a variety of discussed options (see Link below):
* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.
* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.
* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.
* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.
The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.
Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.
Based on an earlier patchset by Jon Rosen. See links below.
I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.
Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
Suggested-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jon Rosen <jrosen@cisco.com>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH net] net/packet: tpacket_rcv: avoid a producer race condition
2020-03-13 16:18 [PATCH net] net/packet: tpacket_rcv: avoid a producer race condition Willem de Bruijn
2020-03-14 11:56 ` FW: " Jon Rosen (jrosen)
@ 2020-03-15 7:25 ` David Miller
1 sibling, 0 replies; 3+ messages in thread
From: David Miller @ 2020-03-15 7:25 UTC (permalink / raw)
To: willemdebruijn.kernel; +Cc: netdev, jrosen, willemb
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Date: Fri, 13 Mar 2020 12:18:09 -0400
> From: Willem de Bruijn <willemb@google.com>
>
> PACKET_RX_RING can cause multiple writers to access the same slot if a
> fast writer wraps the ring while a slow writer is still copying. This
> is particularly likely with few, large, slots (e.g., GSO packets).
>
> Synchronize kernel thread ownership of rx ring slots with a bitmap.
>
> Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
> while holding the sk receive queue lock. They release this lock before
> copying and set tp_status to TP_STATUS_USER to release to userspace
> when done. During copying, another writer may take the lock, also see
> TP_STATUS_KERNEL, and start writing to the same slot.
>
> Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
> slot, test and set with the lock held. To release race-free, update
> tp_status and owner bit as a transaction, so take the lock again.
>
> This is the one of a variety of discussed options (see Link below):
>
> * instead of a shadow ring, embed the data in the slot itself, such as
> in tp_padding. But any test for this field may match a value left by
> userspace, causing deadlock.
>
> * avoid the lock on release. This leaves a small race if releasing the
> shadow slot before setting TP_STATUS_USER. The below reproducer showed
> that this race is not academic. If releasing the slot after tp_status,
> the race is more subtle. See the first link for details.
>
> * add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
> of two fields. But, legacy applications may interpret all non-zero
> tp_status as owned by the user. As libpcap does. So this is possible
> only opt-in by newer processes. It can be added as an optional mode.
>
> * embed the struct at the tail of pg_vec to avoid extra allocation.
> The implementation proved no less complex than a separate field.
>
> The additional locking cost on release adds contention, no different
> than scaling on multicore or multiqueue h/w. In practice, below
> reproducer nor small packet tcpdump showed a noticeable change in
> perf report in cycles spent in spinlock. Where contention is
> problematic, packet sockets support mitigation through PACKET_FANOUT.
> And we can consider adding opt-in state TP_KERNEL_OWNED.
>
> Easy to reproduce by running multiple netperf or similar TCP_STREAM
> flows concurrently with `tcpdump -B 129 -n greater 60000`.
>
> Based on an earlier patchset by Jon Rosen. See links below.
>
> I believe this issue goes back to the introduction of tpacket_rcv,
> which predates git history.
>
> Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
> Suggested-by: Jon Rosen <jrosen@cisco.com>
> Signed-off-by: Willem de Bruijn <willemb@google.com>
Applied and queued up for -stable, thanks everyone.
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2020-03-15 7:25 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-13 16:18 [PATCH net] net/packet: tpacket_rcv: avoid a producer race condition Willem de Bruijn
2020-03-14 11:56 ` FW: " Jon Rosen (jrosen)
2020-03-15 7:25 ` David Miller
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.