All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/3] net: dev: PREEMPT_RT fixups.
@ 2022-02-04 20:12 Sebastian Andrzej Siewior
  2022-02-04 20:12 ` [PATCH net-next v2 1/3] net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal() Sebastian Andrzej Siewior
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-04 20:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen

Hi,

this series removes or replaces preempt_disable() and local_irq_save()
sections which are problematic on PREEMPT_RT.
Patch 2 makes netif_rx() work from any context after I found suggestions
for it in an old thread. Should that work, then the context-specific
variants could be removed.

Already sketched the removal at
   https://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git/log/?h=nettree


v1…v2:
  - #1 and #2
    - merge patch 1 und 2 from the series (as per Toke).
    - updated patch description and corrected the first commit number (as
      per Eric).
   - #2
     - Provide netif_rx() as in v1 and additionally __netif_rx() without
       local_bh disable()+enable() for the loopback driver. __netif_rx() is
       not exported (loopback is built-in only) so it won't be used
       drivers. If this doesn't work then we can still export/ define a
       wrapper as Eric suggested.
     - Added a comment that netif_rx() considered legacy.
   - #3
     - Moved ____napi_schedule() into rps_ipi_queued() and
       renamed it napi_schedule_rps().

v1:
   https://lore.kernel.org/all/20220202122848.647635-1-bigeasy@linutronix.de

Sebastian



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH net-next v2 1/3] net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal().
  2022-02-04 20:12 [PATCH net-next v2 0/3] net: dev: PREEMPT_RT fixups Sebastian Andrzej Siewior
@ 2022-02-04 20:12 ` Sebastian Andrzej Siewior
  2022-02-04 23:44   ` Toke Høiland-Jørgensen
  2022-02-04 20:12 ` [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context Sebastian Andrzej Siewior
  2022-02-04 20:12 ` [PATCH net-next v2 3/3] net: dev: Make rps_lock() disable interrupts Sebastian Andrzej Siewior
  2 siblings, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-04 20:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen, Sebastian Andrzej Siewior

The preempt_disable() () section was introduced in commit
    cece1945bffcf ("net: disable preemption before call smp_processor_id()")

and adds it in case this function is invoked from preemtible context and
because get_cpu() later on as been added.

The get_cpu() usage was added in commit
    b0e28f1effd1d ("net: netif_rx() must disable preemption")

because ip_dev_loopback_xmit() invoked netif_rx() with enabled preemption
causing a warning in smp_processor_id(). The function netif_rx() should
only be invoked from an interrupt context which implies disabled
preemption. The commit
   e30b38c298b55 ("ip: Fix ip_dev_loopback_xmit()")

was addressing this and replaced netif_rx() with in netif_rx_ni() in
ip_dev_loopback_xmit().

Based on the discussion on the list, the former patch (b0e28f1effd1d)
should not have been applied only the latter (e30b38c298b55).

Remove get_cpu() and preempt_disable() since the function is supposed to
be invoked from context with stable per-CPU pointers. Bottom halves have
to be disabled at this point because the function may raise softirqs
which need to be processed.

Link: https://lkml.kernel.org/r/20100415.013347.98375530.davem@davemloft.net
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 net/core/dev.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1baab07820f65..0d13340ed4054 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4796,7 +4796,6 @@ static int netif_rx_internal(struct sk_buff *skb)
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
 		int cpu;
 
-		preempt_disable();
 		rcu_read_lock();
 
 		cpu = get_rps_cpu(skb->dev, skb, &rflow);
@@ -4806,14 +4805,12 @@ static int netif_rx_internal(struct sk_buff *skb)
 		ret = enqueue_to_backlog(skb, cpu, &rflow->last_qtail);
 
 		rcu_read_unlock();
-		preempt_enable();
 	} else
 #endif
 	{
 		unsigned int qtail;
 
-		ret = enqueue_to_backlog(skb, get_cpu(), &qtail);
-		put_cpu();
+		ret = enqueue_to_backlog(skb, smp_processor_id(), &qtail);
 	}
 	return ret;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context.
  2022-02-04 20:12 [PATCH net-next v2 0/3] net: dev: PREEMPT_RT fixups Sebastian Andrzej Siewior
  2022-02-04 20:12 ` [PATCH net-next v2 1/3] net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal() Sebastian Andrzej Siewior
@ 2022-02-04 20:12 ` Sebastian Andrzej Siewior
  2022-02-04 23:44   ` Toke Høiland-Jørgensen
  2022-02-05  4:17   ` Jakub Kicinski
  2022-02-04 20:12 ` [PATCH net-next v2 3/3] net: dev: Make rps_lock() disable interrupts Sebastian Andrzej Siewior
  2 siblings, 2 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-04 20:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen, Sebastian Andrzej Siewior

Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
work in all contexts and get rid of netif_rx_ni()". Eric agreed and
pointed out that modern devices should use netif_receive_skb() to avoid
the overhead.
In the meantime someone added another variant, netif_rx_any_context(),
which behaves as suggested.

netif_rx() must be invoked with disabled bottom halves to ensure that
pending softirqs, which were raised within the function, are handled.
netif_rx_ni() can be invoked only from process context (bottom halves
must be enabled) because the function handles pending softirqs without
checking if bottom halves were disabled or not.
netif_rx_any_context() invokes on the former functions by checking
in_interrupts().

netif_rx() could be taught to handle both cases (disabled and enabled
bottom halves) by simply disabling bottom halves while invoking
netif_rx_internal(). The local_bh_enable() invocation will then invoke
pending softirqs only if the BH-disable counter drops to zero.

Eric is concerned about the overhead of BH-disable+enable especially in
regard to the loopback driver. As critical as this driver is, it will
receive a shortcut to avoid the additional overhead which is not needed.

Add a local_bh_disable() section in netif_rx() to ensure softirqs are
handled if needed. Provide the internal bits as __netif_rx() which can
be used by the loopback driver. This function is not exported so it
can't be used by modules.
Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
can be removed once they are no more users left.

Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
---
 drivers/net/loopback.c     |  2 +-
 include/linux/netdevice.h  | 14 ++++++++--
 include/trace/events/net.h | 14 ----------
 net/core/dev.c             | 53 +++++++++++---------------------------
 4 files changed, 28 insertions(+), 55 deletions(-)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index ed0edf5884ef8..77f5b564382b6 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -86,7 +86,7 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
 	skb->protocol = eth_type_trans(skb, dev);
 
 	len = skb->len;
-	if (likely(netif_rx(skb) == NET_RX_SUCCESS))
+	if (likely(__netif_rx(skb) == NET_RX_SUCCESS))
 		dev_lstats_add(dev, len);
 
 	return NETDEV_TX_OK;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e490b84732d16..c9e883104adb1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3669,8 +3669,18 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
 void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog);
 int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb);
 int netif_rx(struct sk_buff *skb);
-int netif_rx_ni(struct sk_buff *skb);
-int netif_rx_any_context(struct sk_buff *skb);
+int __netif_rx(struct sk_buff *skb);
+
+static inline int netif_rx_ni(struct sk_buff *skb)
+{
+	return netif_rx(skb);
+}
+
+static inline int netif_rx_any_context(struct sk_buff *skb)
+{
+	return netif_rx(skb);
+}
+
 int netif_receive_skb(struct sk_buff *skb);
 int netif_receive_skb_core(struct sk_buff *skb);
 void netif_receive_skb_list_internal(struct list_head *head);
diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 78c448c6ab4c5..032b431b987b6 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -260,13 +260,6 @@ DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_entry,
 	TP_ARGS(skb)
 );
 
-DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_ni_entry,
-
-	TP_PROTO(const struct sk_buff *skb),
-
-	TP_ARGS(skb)
-);
-
 DECLARE_EVENT_CLASS(net_dev_rx_exit_template,
 
 	TP_PROTO(int ret),
@@ -312,13 +305,6 @@ DEFINE_EVENT(net_dev_rx_exit_template, netif_rx_exit,
 	TP_ARGS(ret)
 );
 
-DEFINE_EVENT(net_dev_rx_exit_template, netif_rx_ni_exit,
-
-	TP_PROTO(int ret),
-
-	TP_ARGS(ret)
-);
-
 DEFINE_EVENT(net_dev_rx_exit_template, netif_receive_skb_list_exit,
 
 	TP_PROTO(int ret),
diff --git a/net/core/dev.c b/net/core/dev.c
index 0d13340ed4054..f34a8f3a448a7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4815,6 +4815,16 @@ static int netif_rx_internal(struct sk_buff *skb)
 	return ret;
 }
 
+int __netif_rx(struct sk_buff *skb)
+{
+	int ret;
+
+	trace_netif_rx_entry(skb);
+	ret = netif_rx_internal(skb);
+	trace_netif_rx_exit(ret);
+	return ret;
+}
+
 /**
  *	netif_rx	-	post buffer to the network code
  *	@skb: buffer to post
@@ -4823,58 +4833,25 @@ static int netif_rx_internal(struct sk_buff *skb)
  *	the upper (protocol) levels to process.  It always succeeds. The buffer
  *	may be dropped during processing for congestion control or by the
  *	protocol layers.
+ *	This interface is considered legacy. Modern NIC driver should use NAPI
+ *	and GRO.
  *
  *	return values:
  *	NET_RX_SUCCESS	(no congestion)
  *	NET_RX_DROP     (packet was dropped)
  *
  */
-
 int netif_rx(struct sk_buff *skb)
 {
 	int ret;
 
-	trace_netif_rx_entry(skb);
-
-	ret = netif_rx_internal(skb);
-	trace_netif_rx_exit(ret);
-
+	local_bh_disable();
+	ret = __netif_rx(skb);
+	local_bh_enable();
 	return ret;
 }
 EXPORT_SYMBOL(netif_rx);
 
-int netif_rx_ni(struct sk_buff *skb)
-{
-	int err;
-
-	trace_netif_rx_ni_entry(skb);
-
-	preempt_disable();
-	err = netif_rx_internal(skb);
-	if (local_softirq_pending())
-		do_softirq();
-	preempt_enable();
-	trace_netif_rx_ni_exit(err);
-
-	return err;
-}
-EXPORT_SYMBOL(netif_rx_ni);
-
-int netif_rx_any_context(struct sk_buff *skb)
-{
-	/*
-	 * If invoked from contexts which do not invoke bottom half
-	 * processing either at return from interrupt or when softrqs are
-	 * reenabled, use netif_rx_ni() which invokes bottomhalf processing
-	 * directly.
-	 */
-	if (in_interrupt())
-		return netif_rx(skb);
-	else
-		return netif_rx_ni(skb);
-}
-EXPORT_SYMBOL(netif_rx_any_context);
-
 static __latent_entropy void net_tx_action(struct softirq_action *h)
 {
 	struct softnet_data *sd = this_cpu_ptr(&softnet_data);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH net-next v2 3/3] net: dev: Make rps_lock() disable interrupts.
  2022-02-04 20:12 [PATCH net-next v2 0/3] net: dev: PREEMPT_RT fixups Sebastian Andrzej Siewior
  2022-02-04 20:12 ` [PATCH net-next v2 1/3] net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal() Sebastian Andrzej Siewior
  2022-02-04 20:12 ` [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context Sebastian Andrzej Siewior
@ 2022-02-04 20:12 ` Sebastian Andrzej Siewior
  2022-02-05  4:17   ` Jakub Kicinski
  2 siblings, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-04 20:12 UTC (permalink / raw)
  To: bpf, netdev
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen, Sebastian Andrzej Siewior

Disabling interrupts and in the RPS case locking input_pkt_queue is
split into local_irq_disable() and optional spin_lock().

This breaks on PREEMPT_RT because the spinlock_t typed lock can not be
acquired with disabled interrupts.
The sections in which the lock is acquired is usually short in a sense that it
is not causing long und unbounded latiencies. One exception is the
skb_flow_limit() invocation which may invoke a BPF program (and may
require sleeping locks).

By moving local_irq_disable() + spin_lock() into rps_lock(), we can keep
interrupts disabled on !PREEMPT_RT and enabled on PREEMPT_RT kernels.
Without RPS on a PREEMPT_RT kernel, the needed synchronisation happens
as part of local_bh_disable() on the local CPU.
____napi_schedule() is only invoked if sd is from the local CPU. Replace
it with __napi_schedule_irqoff() which already disables interrupts on
PREEMPT_RT as needed. Move this call to rps_ipi_queued() and rename the
function to napi_schedule_rps as suggested by Jakub.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 net/core/dev.c | 76 ++++++++++++++++++++++++++++----------------------
 1 file changed, 42 insertions(+), 34 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index f34a8f3a448a7..b7578f47e151c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -216,18 +216,38 @@ static inline struct hlist_head *dev_index_hash(struct net *net, int ifindex)
 	return &net->dev_index_head[ifindex & (NETDEV_HASHENTRIES - 1)];
 }
 
-static inline void rps_lock(struct softnet_data *sd)
+static inline void rps_lock_irqsave(struct softnet_data *sd,
+				    unsigned long *flags)
 {
-#ifdef CONFIG_RPS
-	spin_lock(&sd->input_pkt_queue.lock);
-#endif
+	if (IS_ENABLED(CONFIG_RPS))
+		spin_lock_irqsave(&sd->input_pkt_queue.lock, *flags);
+	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_save(*flags);
 }
 
-static inline void rps_unlock(struct softnet_data *sd)
+static inline void rps_lock_irq_disable(struct softnet_data *sd)
 {
-#ifdef CONFIG_RPS
-	spin_unlock(&sd->input_pkt_queue.lock);
-#endif
+	if (IS_ENABLED(CONFIG_RPS))
+		spin_lock_irq(&sd->input_pkt_queue.lock);
+	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_disable();
+}
+
+static inline void rps_unlock_irq_restore(struct softnet_data *sd,
+					  unsigned long *flags)
+{
+	if (IS_ENABLED(CONFIG_RPS))
+		spin_unlock_irqrestore(&sd->input_pkt_queue.lock, *flags);
+	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_restore(*flags);
+}
+
+static inline void rps_unlock_irq_enable(struct softnet_data *sd)
+{
+	if (IS_ENABLED(CONFIG_RPS))
+		spin_unlock_irq(&sd->input_pkt_queue.lock);
+	else if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_enable();
 }
 
 static struct netdev_name_node *netdev_name_node_alloc(struct net_device *dev,
@@ -4456,11 +4476,11 @@ static void rps_trigger_softirq(void *data)
  * If yes, queue it to our IPI list and return 1
  * If no, return 0
  */
-static int rps_ipi_queued(struct softnet_data *sd)
+static int napi_schedule_rps(struct softnet_data *sd)
 {
-#ifdef CONFIG_RPS
 	struct softnet_data *mysd = this_cpu_ptr(&softnet_data);
 
+#ifdef CONFIG_RPS
 	if (sd != mysd) {
 		sd->rps_ipi_next = mysd->rps_ipi_list;
 		mysd->rps_ipi_list = sd;
@@ -4469,6 +4489,7 @@ static int rps_ipi_queued(struct softnet_data *sd)
 		return 1;
 	}
 #endif /* CONFIG_RPS */
+	__napi_schedule_irqoff(&mysd->backlog);
 	return 0;
 }
 
@@ -4525,9 +4546,7 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 
 	sd = &per_cpu(softnet_data, cpu);
 
-	local_irq_save(flags);
-
-	rps_lock(sd);
+	rps_lock_irqsave(sd, &flags);
 	if (!netif_running(skb->dev))
 		goto drop;
 	qlen = skb_queue_len(&sd->input_pkt_queue);
@@ -4536,26 +4555,21 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
 enqueue:
 			__skb_queue_tail(&sd->input_pkt_queue, skb);
 			input_queue_tail_incr_save(sd, qtail);
-			rps_unlock(sd);
-			local_irq_restore(flags);
+			rps_unlock_irq_restore(sd, &flags);
 			return NET_RX_SUCCESS;
 		}
 
 		/* Schedule NAPI for backlog device
 		 * We can use non atomic operation since we own the queue lock
 		 */
-		if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state)) {
-			if (!rps_ipi_queued(sd))
-				____napi_schedule(sd, &sd->backlog);
-		}
+		if (!__test_and_set_bit(NAPI_STATE_SCHED, &sd->backlog.state))
+			napi_schedule_rps(sd);
 		goto enqueue;
 	}
 
 drop:
 	sd->dropped++;
-	rps_unlock(sd);
-
-	local_irq_restore(flags);
+	rps_unlock_irq_restore(sd, &flags);
 
 	atomic_long_inc(&skb->dev->rx_dropped);
 	kfree_skb(skb);
@@ -5624,8 +5638,7 @@ static void flush_backlog(struct work_struct *work)
 	local_bh_disable();
 	sd = this_cpu_ptr(&softnet_data);
 
-	local_irq_disable();
-	rps_lock(sd);
+	rps_lock_irq_disable(sd);
 	skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
 		if (skb->dev->reg_state == NETREG_UNREGISTERING) {
 			__skb_unlink(skb, &sd->input_pkt_queue);
@@ -5633,8 +5646,7 @@ static void flush_backlog(struct work_struct *work)
 			input_queue_head_incr(sd);
 		}
 	}
-	rps_unlock(sd);
-	local_irq_enable();
+	rps_unlock_irq_enable(sd);
 
 	skb_queue_walk_safe(&sd->process_queue, skb, tmp) {
 		if (skb->dev->reg_state == NETREG_UNREGISTERING) {
@@ -5652,16 +5664,14 @@ static bool flush_required(int cpu)
 	struct softnet_data *sd = &per_cpu(softnet_data, cpu);
 	bool do_flush;
 
-	local_irq_disable();
-	rps_lock(sd);
+	rps_lock_irq_disable(sd);
 
 	/* as insertion into process_queue happens with the rps lock held,
 	 * process_queue access may race only with dequeue
 	 */
 	do_flush = !skb_queue_empty(&sd->input_pkt_queue) ||
 		   !skb_queue_empty_lockless(&sd->process_queue);
-	rps_unlock(sd);
-	local_irq_enable();
+	rps_unlock_irq_enable(sd);
 
 	return do_flush;
 #endif
@@ -5776,8 +5786,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
 
 		}
 
-		local_irq_disable();
-		rps_lock(sd);
+		rps_lock_irq_disable(sd);
 		if (skb_queue_empty(&sd->input_pkt_queue)) {
 			/*
 			 * Inline a custom version of __napi_complete().
@@ -5793,8 +5802,7 @@ static int process_backlog(struct napi_struct *napi, int quota)
 			skb_queue_splice_tail_init(&sd->input_pkt_queue,
 						   &sd->process_queue);
 		}
-		rps_unlock(sd);
-		local_irq_enable();
+		rps_unlock_irq_enable(sd);
 	}
 
 	return work;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 1/3] net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal().
  2022-02-04 20:12 ` [PATCH net-next v2 1/3] net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal() Sebastian Andrzej Siewior
@ 2022-02-04 23:44   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-02-04 23:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, bpf, netdev
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner, Sebastian Andrzej Siewior

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> The preempt_disable() () section was introduced in commit
>     cece1945bffcf ("net: disable preemption before call smp_processor_id()")
>
> and adds it in case this function is invoked from preemtible context and
> because get_cpu() later on as been added.
>
> The get_cpu() usage was added in commit
>     b0e28f1effd1d ("net: netif_rx() must disable preemption")
>
> because ip_dev_loopback_xmit() invoked netif_rx() with enabled preemption
> causing a warning in smp_processor_id(). The function netif_rx() should
> only be invoked from an interrupt context which implies disabled
> preemption. The commit
>    e30b38c298b55 ("ip: Fix ip_dev_loopback_xmit()")
>
> was addressing this and replaced netif_rx() with in netif_rx_ni() in
> ip_dev_loopback_xmit().
>
> Based on the discussion on the list, the former patch (b0e28f1effd1d)
> should not have been applied only the latter (e30b38c298b55).
>
> Remove get_cpu() and preempt_disable() since the function is supposed to
> be invoked from context with stable per-CPU pointers. Bottom halves have
> to be disabled at this point because the function may raise softirqs
> which need to be processed.
>
> Link: https://lkml.kernel.org/r/20100415.013347.98375530.davem@davemloft.net
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context.
  2022-02-04 20:12 ` [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context Sebastian Andrzej Siewior
@ 2022-02-04 23:44   ` Toke Høiland-Jørgensen
  2022-02-05  4:17   ` Jakub Kicinski
  1 sibling, 0 replies; 13+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-02-04 23:44 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, bpf, netdev
  Cc: David S. Miller, Alexei Starovoitov, Daniel Borkmann,
	Eric Dumazet, Jakub Kicinski, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner, Sebastian Andrzej Siewior

Sebastian Andrzej Siewior <bigeasy@linutronix.de> writes:

> Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
> work in all contexts and get rid of netif_rx_ni()". Eric agreed and
> pointed out that modern devices should use netif_receive_skb() to avoid
> the overhead.
> In the meantime someone added another variant, netif_rx_any_context(),
> which behaves as suggested.
>
> netif_rx() must be invoked with disabled bottom halves to ensure that
> pending softirqs, which were raised within the function, are handled.
> netif_rx_ni() can be invoked only from process context (bottom halves
> must be enabled) because the function handles pending softirqs without
> checking if bottom halves were disabled or not.
> netif_rx_any_context() invokes on the former functions by checking
> in_interrupts().
>
> netif_rx() could be taught to handle both cases (disabled and enabled
> bottom halves) by simply disabling bottom halves while invoking
> netif_rx_internal(). The local_bh_enable() invocation will then invoke
> pending softirqs only if the BH-disable counter drops to zero.
>
> Eric is concerned about the overhead of BH-disable+enable especially in
> regard to the loopback driver. As critical as this driver is, it will
> receive a shortcut to avoid the additional overhead which is not needed.
>
> Add a local_bh_disable() section in netif_rx() to ensure softirqs are
> handled if needed. Provide the internal bits as __netif_rx() which can
> be used by the loopback driver. This function is not exported so it
> can't be used by modules.
> Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
> can be removed once they are no more users left.
>
> Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.net
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Reviewed-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context.
  2022-02-04 20:12 ` [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context Sebastian Andrzej Siewior
  2022-02-04 23:44   ` Toke Høiland-Jørgensen
@ 2022-02-05  4:17   ` Jakub Kicinski
  2022-02-05 20:36     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2022-02-05  4:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, netdev, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen

On Fri,  4 Feb 2022 21:12:58 +0100 Sebastian Andrzej Siewior wrote:
> +int __netif_rx(struct sk_buff *skb)
> +{
> +	int ret;
> +
> +	trace_netif_rx_entry(skb);
> +	ret = netif_rx_internal(skb);
> +	trace_netif_rx_exit(ret);
> +	return ret;
> +}

Any reason this is not exported? I don't think there's anything wrong
with drivers calling this function, especially SW drivers which already
know to be in BH. I'd vote for roughly all of $(ls drivers/net/*.c) to
get the same treatment as loopback.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 3/3] net: dev: Make rps_lock() disable interrupts.
  2022-02-04 20:12 ` [PATCH net-next v2 3/3] net: dev: Make rps_lock() disable interrupts Sebastian Andrzej Siewior
@ 2022-02-05  4:17   ` Jakub Kicinski
  0 siblings, 0 replies; 13+ messages in thread
From: Jakub Kicinski @ 2022-02-05  4:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, netdev, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen

On Fri,  4 Feb 2022 21:12:59 +0100 Sebastian Andrzej Siewior wrote:
> Disabling interrupts and in the RPS case locking input_pkt_queue is
> split into local_irq_disable() and optional spin_lock().
> 
> This breaks on PREEMPT_RT because the spinlock_t typed lock can not be
> acquired with disabled interrupts.
> The sections in which the lock is acquired is usually short in a sense that it
> is not causing long und unbounded latiencies. One exception is the
> skb_flow_limit() invocation which may invoke a BPF program (and may
> require sleeping locks).
> 
> By moving local_irq_disable() + spin_lock() into rps_lock(), we can keep
> interrupts disabled on !PREEMPT_RT and enabled on PREEMPT_RT kernels.
> Without RPS on a PREEMPT_RT kernel, the needed synchronisation happens
> as part of local_bh_disable() on the local CPU.
> ____napi_schedule() is only invoked if sd is from the local CPU. Replace
> it with __napi_schedule_irqoff() which already disables interrupts on
> PREEMPT_RT as needed. Move this call to rps_ipi_queued() and rename the
> function to napi_schedule_rps as suggested by Jakub.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Reviewed-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context.
  2022-02-05  4:17   ` Jakub Kicinski
@ 2022-02-05 20:36     ` Sebastian Andrzej Siewior
  2022-02-07 16:47       ` Jakub Kicinski
  0 siblings, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-05 20:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen

On 2022-02-04 20:17:15 [-0800], Jakub Kicinski wrote:
> On Fri,  4 Feb 2022 21:12:58 +0100 Sebastian Andrzej Siewior wrote:
> > +int __netif_rx(struct sk_buff *skb)
> > +{
> > +	int ret;
> > +
> > +	trace_netif_rx_entry(skb);
> > +	ret = netif_rx_internal(skb);
> > +	trace_netif_rx_exit(ret);
> > +	return ret;
> > +}
> 
> Any reason this is not exported? I don't think there's anything wrong
> with drivers calling this function, especially SW drivers which already
> know to be in BH. I'd vote for roughly all of $(ls drivers/net/*.c) to
> get the same treatment as loopback.

Don't we end up in the same situation as netif_rx() vs netix_rx_ni()?

Sebastian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context.
  2022-02-05 20:36     ` Sebastian Andrzej Siewior
@ 2022-02-07 16:47       ` Jakub Kicinski
  2022-02-10 12:22         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2022-02-07 16:47 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, netdev, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen

On Sat, 5 Feb 2022 21:36:05 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-04 20:17:15 [-0800], Jakub Kicinski wrote:
> > On Fri,  4 Feb 2022 21:12:58 +0100 Sebastian Andrzej Siewior wrote:  
> > > +int __netif_rx(struct sk_buff *skb)
> > > +{
> > > +	int ret;
> > > +
> > > +	trace_netif_rx_entry(skb);
> > > +	ret = netif_rx_internal(skb);
> > > +	trace_netif_rx_exit(ret);
> > > +	return ret;
> > > +}  
> > 
> > Any reason this is not exported? I don't think there's anything wrong
> > with drivers calling this function, especially SW drivers which already
> > know to be in BH. I'd vote for roughly all of $(ls drivers/net/*.c) to
> > get the same treatment as loopback.  
> 
> Don't we end up in the same situation as netif_rx() vs netix_rx_ni()?

Sort of. TBH my understanding of the motivation is a bit vague.
IIUC you want to reduce the API duplication so drivers know what to
do[1]. I believe the quote from Eric you put in the commit message
pertains to HW devices, where using netif_rx() is quite anachronistic. 
But software devices like loopback, veth or tunnels may want to go via
backlog for good reasons. Would it make it better if we called
netif_rx() netif_rx_backlog() instead? Or am I missing the point?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context.
  2022-02-07 16:47       ` Jakub Kicinski
@ 2022-02-10 12:22         ` Sebastian Andrzej Siewior
  2022-02-10 18:13           ` Jakub Kicinski
  0 siblings, 1 reply; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-10 12:22 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen

On 2022-02-07 08:47:17 [-0800], Jakub Kicinski wrote:
> On Sat, 5 Feb 2022 21:36:05 +0100 Sebastian Andrzej Siewior wrote:
> > On 2022-02-04 20:17:15 [-0800], Jakub Kicinski wrote:
> > > On Fri,  4 Feb 2022 21:12:58 +0100 Sebastian Andrzej Siewior wrote:  
> > > > +int __netif_rx(struct sk_buff *skb)
> > > > +{
> > > > +	int ret;
> > > > +
> > > > +	trace_netif_rx_entry(skb);
> > > > +	ret = netif_rx_internal(skb);
> > > > +	trace_netif_rx_exit(ret);
> > > > +	return ret;
> > > > +}  
> > > 
> > > Any reason this is not exported? I don't think there's anything wrong
> > > with drivers calling this function, especially SW drivers which already
> > > know to be in BH. I'd vote for roughly all of $(ls drivers/net/*.c) to
> > > get the same treatment as loopback.  
> > 
> > Don't we end up in the same situation as netif_rx() vs netix_rx_ni()?
> 
> Sort of. TBH my understanding of the motivation is a bit vague.
> IIUC you want to reduce the API duplication so drivers know what to
> do[1]. I believe the quote from Eric you put in the commit message
> pertains to HW devices, where using netif_rx() is quite anachronistic. 
> But software devices like loopback, veth or tunnels may want to go via
> backlog for good reasons. Would it make it better if we called
> netif_rx() netif_rx_backlog() instead? Or am I missing the point?

So we do netif_rx_backlog() with the bh disable+enable and
__netif_rx_backlog() without it and export both tree wide? It would make
it more obvious indeed. Could we add
	WARN_ON_ONCE(!(hardirq_count() | softirq_count()))
to the shortcut to catch the "you did it wrong folks"? This costs me
about 2ns.

TL;DR

The netix_rx_ni() is problematic on RT and I tried to do something about
it. I remembered from the in_atomic() cleanup that a few drivers got it
wrong (one way or another). We added also netif_rx_any_context() which
is used by some of the drivers (which is yet another entry point) while
the few other got fixed.
Then I stumbled over the thread where the entry (netif_rx() vs
netif_rx_ni()) was wrong and Dave suggested to have one entry point for
them all. This sounded like a good idea since it would eliminate the
several API entry points where things can go wrong and my RT trouble
would vanish in one go.
The part with deprecated looked promising but I didn't take into account
that the overhead for legitimate users (like the backlog or the software
tunnels you mention) is not acceptable.

Sebastian

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context.
  2022-02-10 12:22         ` Sebastian Andrzej Siewior
@ 2022-02-10 18:13           ` Jakub Kicinski
  2022-02-10 19:52             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 13+ messages in thread
From: Jakub Kicinski @ 2022-02-10 18:13 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: bpf, netdev, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen

On Thu, 10 Feb 2022 13:22:32 +0100 Sebastian Andrzej Siewior wrote:
> On 2022-02-07 08:47:17 [-0800], Jakub Kicinski wrote:
> > On Sat, 5 Feb 2022 21:36:05 +0100 Sebastian Andrzej Siewior wrote:  
> > > Don't we end up in the same situation as netif_rx() vs netix_rx_ni()?  
> > 
> > Sort of. TBH my understanding of the motivation is a bit vague.
> > IIUC you want to reduce the API duplication so drivers know what to
> > do[1]. I believe the quote from Eric you put in the commit message
> > pertains to HW devices, where using netif_rx() is quite anachronistic. 
> > But software devices like loopback, veth or tunnels may want to go via
> > backlog for good reasons. Would it make it better if we called
> > netif_rx() netif_rx_backlog() instead? Or am I missing the point?  
> 
> So we do netif_rx_backlog() with the bh disable+enable and
> __netif_rx_backlog() without it and export both tree wide?

At a risk of confusing people about the API we could also name the
"non-super-optimized" version netif_rx(), like you had in your patch.
Grepping thru the drivers there's ~250 uses so maybe we don't wanna
touch all that code. No strong preference, I just didn't expect to 
see __netif_rx_backlog(), but either way works.

> It would make it more obvious indeed. Could we add
> 	WARN_ON_ONCE(!(hardirq_count() | softirq_count()))
> to the shortcut to catch the "you did it wrong folks"? This costs me
> about 2ns.

Modulo lockdep_..(), so we don't have to run this check on prod kernels?

> TL;DR
> 
> The netix_rx_ni() is problematic on RT and I tried to do something about
> it. I remembered from the in_atomic() cleanup that a few drivers got it
> wrong (one way or another). We added also netif_rx_any_context() which
> is used by some of the drivers (which is yet another entry point) while
> the few other got fixed.
> Then I stumbled over the thread where the entry (netif_rx() vs
> netif_rx_ni()) was wrong and Dave suggested to have one entry point for
> them all. This sounded like a good idea since it would eliminate the
> several API entry points where things can go wrong and my RT trouble
> would vanish in one go.
> The part with deprecated looked promising but I didn't take into account
> that the overhead for legitimate users (like the backlog or the software
> tunnels you mention) is not acceptable.

I see. So IIUC primary motivation is replacing preempt disable with bh
disable but the cleanup seemed like a good idea.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context.
  2022-02-10 18:13           ` Jakub Kicinski
@ 2022-02-10 19:52             ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 13+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-02-10 19:52 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: bpf, netdev, David S. Miller, Alexei Starovoitov,
	Daniel Borkmann, Eric Dumazet, Jesper Dangaard Brouer,
	John Fastabend, Thomas Gleixner,
	Toke Høiland-Jørgensen

On 2022-02-10 10:13:30 [-0800], Jakub Kicinski wrote:
> > So we do netif_rx_backlog() with the bh disable+enable and
> > __netif_rx_backlog() without it and export both tree wide?
> 
> At a risk of confusing people about the API we could also name the
> "non-super-optimized" version netif_rx(), like you had in your patch.
> Grepping thru the drivers there's ~250 uses so maybe we don't wanna
> touch all that code. No strong preference, I just didn't expect to 
> see __netif_rx_backlog(), but either way works.

So let me keep the naming as-is, export __netif_rx() and update the
kernel doc with the bits about backlog.
After that if we are up to rename the function in ~250 drivers then I
should be simpler.

> > It would make it more obvious indeed. Could we add
> > 	WARN_ON_ONCE(!(hardirq_count() | softirq_count()))
> > to the shortcut to catch the "you did it wrong folks"? This costs me
> > about 2ns.
> 
> Modulo lockdep_..(), so we don't have to run this check on prod kernels?

I was worried a little about the corner cases but then lockdep is your
friend and you should test your code. Okay.

Sebastian

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-02-10 19:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04 20:12 [PATCH net-next v2 0/3] net: dev: PREEMPT_RT fixups Sebastian Andrzej Siewior
2022-02-04 20:12 ` [PATCH net-next v2 1/3] net: dev: Remove preempt_disable() and get_cpu() in netif_rx_internal() Sebastian Andrzej Siewior
2022-02-04 23:44   ` Toke Høiland-Jørgensen
2022-02-04 20:12 ` [PATCH net-next v2 2/3] net: dev: Makes sure netif_rx() can be invoked in any context Sebastian Andrzej Siewior
2022-02-04 23:44   ` Toke Høiland-Jørgensen
2022-02-05  4:17   ` Jakub Kicinski
2022-02-05 20:36     ` Sebastian Andrzej Siewior
2022-02-07 16:47       ` Jakub Kicinski
2022-02-10 12:22         ` Sebastian Andrzej Siewior
2022-02-10 18:13           ` Jakub Kicinski
2022-02-10 19:52             ` Sebastian Andrzej Siewior
2022-02-04 20:12 ` [PATCH net-next v2 3/3] net: dev: Make rps_lock() disable interrupts Sebastian Andrzej Siewior
2022-02-05  4:17   ` Jakub Kicinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.