linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] sched&net: avoid over-pulling tasks due to network interrupts
@ 2021-11-05 10:51 Barry Song
  2021-11-05 12:24 ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Barry Song @ 2021-11-05 10:51 UTC (permalink / raw)
  To: davem, kuba, edumazet, pabeni, fw, mingo, peterz, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, tglx, netdev, linux-kernel
  Cc: linuxarm, guodong.xu, yangyicong, shenyang39, tangchengchang,
	Barry Song, Libo Chen, Tim Chen

From: Barry Song <song.bao.hua@hisilicon.com>

In LPC2021, both Libo Chen and Tim Chen have reported the overpull
of network interrupts[1]. For example, while running one database,
ethernet is located in numa0, numa1 might be almost idle due to
interrupts are pulling tasks to numa0 because of wake_up affine.
I have seen the same problem. One way to solve this problem is
moving to a normal wakeup in network rather than using a sync
wakeup which will be more aggressively pulling tasks in scheduler
core.

On kunpeng920 with 4numa, ethernet is located at numa0, storage
disk is located at numa2. While using sysbench to connect this
mysql machine, I am seeing numa1 is idle though numa0,2 and 3
are quite busy.

The benchmark command:

 sysbench --db-driver=mysql --mysql-user=sbtest_user \
 --mysql_password=password --mysql-db=sbtest \
 --mysql-host=192.168.101.3 --mysql-port=3306 \
 --point-selects=10 --simple-ranges=1 \
 --sum-ranges=1 --order-ranges=1 --distinct-ranges=1 \
 --index-updates=1 --non-index-updates=1 \
 --delete-inserts=1 --range-size=100 \
 --time=600 --events=0 --report-interval=60 \
 --tables=64 --table-size=2000000 --threads=128 \
  /usr/share/sysbench/oltp_read_only.lua run

The benchmark result is as below:
                 tps        qps
w/o patch     31748.22     507971.56
w/  patch     35075.20     561203.13
              +10.5%

With the patch I am seeing NUMA1 becomes busy as well so I am
getting 10%+ performance improvement.

I am not saying this patch is exactly the right approach, But I'd
like to use this RFC to connect the people of net and scheduler,
and start the discussion in this wider range.

Testing was done based on the latest linus tree commit d4439a1189.
with the .config[2]

[1] https://linuxplumbersconf.org/event/11/contributions/1044/attachments/801/1508/lpc21_wakeup_pulling_libochen.pdf
[2] http://www.linuxep.com/patches/config

Cc: Libo Chen <libo.chen@oracle.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 net/core/sock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 9862eef..a346359 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -3133,7 +3133,7 @@ void sock_def_readable(struct sock *sk)
 	rcu_read_lock();
 	wq = rcu_dereference(sk->sk_wq);
 	if (skwq_has_sleeper(wq))
-		wake_up_interruptible_sync_poll(&wq->wait, EPOLLIN | EPOLLPRI |
+		wake_up_interruptible_poll(&wq->wait, EPOLLIN | EPOLLPRI |
 						EPOLLRDNORM | EPOLLRDBAND);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
 	rcu_read_unlock();
@@ -3151,7 +3151,7 @@ static void sock_def_write_space(struct sock *sk)
 	if ((refcount_read(&sk->sk_wmem_alloc) << 1) <= READ_ONCE(sk->sk_sndbuf)) {
 		wq = rcu_dereference(sk->sk_wq);
 		if (skwq_has_sleeper(wq))
-			wake_up_interruptible_sync_poll(&wq->wait, EPOLLOUT |
+			wake_up_interruptible_poll(&wq->wait, EPOLLOUT |
 						EPOLLWRNORM | EPOLLWRBAND);
 
 		/* Should agree with poll, otherwise some programs break */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] sched&net: avoid over-pulling tasks due to network interrupts
  2021-11-05 10:51 [RFC PATCH] sched&net: avoid over-pulling tasks due to network interrupts Barry Song
@ 2021-11-05 12:24 ` Peter Zijlstra
  2021-11-07 18:08   ` Barry Song
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2021-11-05 12:24 UTC (permalink / raw)
  To: Barry Song
  Cc: davem, kuba, edumazet, pabeni, fw, mingo, juri.lelli,
	vincent.guittot, dietmar.eggemann, rostedt, bsegall, mgorman,
	bristot, tglx, netdev, linux-kernel, linuxarm, guodong.xu,
	yangyicong, shenyang39, tangchengchang, Barry Song, Libo Chen,
	Tim Chen

On Fri, Nov 05, 2021 at 06:51:36PM +0800, Barry Song wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 
> In LPC2021, both Libo Chen and Tim Chen have reported the overpull
> of network interrupts[1]. For example, while running one database,
> ethernet is located in numa0, numa1 might be almost idle due to
> interrupts are pulling tasks to numa0 because of wake_up affine.
> I have seen the same problem. One way to solve this problem is
> moving to a normal wakeup in network rather than using a sync
> wakeup which will be more aggressively pulling tasks in scheduler
> core.
> 
> On kunpeng920 with 4numa, ethernet is located at numa0, storage
> disk is located at numa2. While using sysbench to connect this
> mysql machine, I am seeing numa1 is idle though numa0,2 and 3
> are quite busy.
> 

> I am not saying this patch is exactly the right approach, But I'd
> like to use this RFC to connect the people of net and scheduler,
> and start the discussion in this wider range.

Well the normal way would be to use multi-queue crud and/or receive
packet steering to get the interrupt/wakeup back to the cpu that data
came from.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] sched&net: avoid over-pulling tasks due to network interrupts
  2021-11-05 12:24 ` Peter Zijlstra
@ 2021-11-07 18:08   ` Barry Song
  2021-11-08  9:27     ` Peter Zijlstra
  0 siblings, 1 reply; 5+ messages in thread
From: Barry Song @ 2021-11-07 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: David Miller, kuba, Eric Dumazet, pabeni, fw, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, netdev, LKML, Linuxarm, Guodong Xu, yangyicong,
	shenyang39, tangchengchang, Barry Song, Libo Chen, Tim Chen

On Sat, Nov 6, 2021 at 1:25 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Nov 05, 2021 at 06:51:36PM +0800, Barry Song wrote:
> > From: Barry Song <song.bao.hua@hisilicon.com>
> >
> > In LPC2021, both Libo Chen and Tim Chen have reported the overpull
> > of network interrupts[1]. For example, while running one database,
> > ethernet is located in numa0, numa1 might be almost idle due to
> > interrupts are pulling tasks to numa0 because of wake_up affine.
> > I have seen the same problem. One way to solve this problem is
> > moving to a normal wakeup in network rather than using a sync
> > wakeup which will be more aggressively pulling tasks in scheduler
> > core.
> >
> > On kunpeng920 with 4numa, ethernet is located at numa0, storage
> > disk is located at numa2. While using sysbench to connect this
> > mysql machine, I am seeing numa1 is idle though numa0,2 and 3
> > are quite busy.
> >
>
> > I am not saying this patch is exactly the right approach, But I'd
> > like to use this RFC to connect the people of net and scheduler,
> > and start the discussion in this wider range.
>
> Well the normal way would be to use multi-queue crud and/or receive
> packet steering to get the interrupt/wakeup back to the cpu that data
> came from.

The test case has been a multi-queue ethernet and irqs are balanced
to NUMA0 by irqbalanced or pinned to NUMA0 where the card is located
by the script like:
#!/bin/bash
irq_list=(`cat /proc/interrupts | grep network_name| awk -F: '{print $1}'`)
cpunum=0
for irq in ${irq_list[@]}
do
echo $cpunum > /proc/irq/$irq/smp_affinity_list
echo `cat /proc/irq/$irq/smp_affinity_list`
(( cpunum+=1 ))
done

I have heard some people are working around this issue  by pinning
multi-queue IRQs to multiple NUMAs which can spread interrupts and
avoid over-pulling tasks to one NUMA only, but lose ethernet locality?
Hi, @Tim, it seems in LPC2021 you mentioned you are using this
solution?

And some other people are pinning ethernet IRQs to a couple of
CPUs within the NUMA ethernet belongs to, and then isolate these
CPUs from tasks and use those CPUs for interrupts only. This can
avoid wake-up pulling at all.

I think we need some generic way to resolve this problem. Hi,
@Libo , what is your solution to work around this issue?

Thanks
Barry

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] sched&net: avoid over-pulling tasks due to network interrupts
  2021-11-07 18:08   ` Barry Song
@ 2021-11-08  9:27     ` Peter Zijlstra
  2021-11-08 16:27       ` Eric Dumazet
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2021-11-08  9:27 UTC (permalink / raw)
  To: Barry Song
  Cc: David Miller, kuba, Eric Dumazet, pabeni, fw, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, netdev, LKML, Linuxarm, Guodong Xu, yangyicong,
	shenyang39, tangchengchang, Barry Song, Libo Chen, Tim Chen

On Mon, Nov 08, 2021 at 07:08:09AM +1300, Barry Song wrote:
> On Sat, Nov 6, 2021 at 1:25 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Nov 05, 2021 at 06:51:36PM +0800, Barry Song wrote:
> > > From: Barry Song <song.bao.hua@hisilicon.com>
> > >
> > > In LPC2021, both Libo Chen and Tim Chen have reported the overpull
> > > of network interrupts[1]. For example, while running one database,
> > > ethernet is located in numa0, numa1 might be almost idle due to
> > > interrupts are pulling tasks to numa0 because of wake_up affine.
> > > I have seen the same problem. One way to solve this problem is
> > > moving to a normal wakeup in network rather than using a sync
> > > wakeup which will be more aggressively pulling tasks in scheduler
> > > core.
> > >
> > > On kunpeng920 with 4numa, ethernet is located at numa0, storage
> > > disk is located at numa2. While using sysbench to connect this
> > > mysql machine, I am seeing numa1 is idle though numa0,2 and 3
> > > are quite busy.
> > >
> >
> > > I am not saying this patch is exactly the right approach, But I'd
> > > like to use this RFC to connect the people of net and scheduler,
> > > and start the discussion in this wider range.
> >
> > Well the normal way would be to use multi-queue crud and/or receive
> > packet steering to get the interrupt/wakeup back to the cpu that data
> > came from.
> 
> The test case has been a multi-queue ethernet and irqs are balanced
> to NUMA0 by irqbalanced or pinned to NUMA0 where the card is located
> by the script like:
> #!/bin/bash
> irq_list=(`cat /proc/interrupts | grep network_name| awk -F: '{print $1}'`)
> cpunum=0
> for irq in ${irq_list[@]}
> do
> echo $cpunum > /proc/irq/$irq/smp_affinity_list
> echo `cat /proc/irq/$irq/smp_affinity_list`
> (( cpunum+=1 ))
> done
> 
> I have heard some people are working around this issue  by pinning
> multi-queue IRQs to multiple NUMAs which can spread interrupts and
> avoid over-pulling tasks to one NUMA only, but lose ethernet locality?

So you're doing explicitly the wrong thing with your script above and
then complain the scheduler follows that and destroys your data
locality?

The network folks made RPS/RFS specifically to spread the processing of
the packets back to the CPUs/Nodes the TX happened on to increase data
locality. Why not use that?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] sched&net: avoid over-pulling tasks due to network interrupts
  2021-11-08  9:27     ` Peter Zijlstra
@ 2021-11-08 16:27       ` Eric Dumazet
  0 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2021-11-08 16:27 UTC (permalink / raw)
  To: Peter Zijlstra, Barry Song
  Cc: David Miller, kuba, Eric Dumazet, pabeni, fw, Ingo Molnar,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Thomas Gleixner, netdev, LKML, Linuxarm, Guodong Xu, yangyicong,
	shenyang39, tangchengchang, Barry Song, Libo Chen, Tim Chen



On 11/8/21 1:27 AM, Peter Zijlstra wrote:
> On Mon, Nov 08, 2021 at 07:08:09AM +1300, Barry Song wrote:
>> On Sat, Nov 6, 2021 at 1:25 AM Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Fri, Nov 05, 2021 at 06:51:36PM +0800, Barry Song wrote:
>>>> From: Barry Song <song.bao.hua@hisilicon.com>
>>>>
>>>> In LPC2021, both Libo Chen and Tim Chen have reported the overpull
>>>> of network interrupts[1]. For example, while running one database,
>>>> ethernet is located in numa0, numa1 might be almost idle due to
>>>> interrupts are pulling tasks to numa0 because of wake_up affine.
>>>> I have seen the same problem. One way to solve this problem is
>>>> moving to a normal wakeup in network rather than using a sync
>>>> wakeup which will be more aggressively pulling tasks in scheduler
>>>> core.
>>>>
>>>> On kunpeng920 with 4numa, ethernet is located at numa0, storage
>>>> disk is located at numa2. While using sysbench to connect this
>>>> mysql machine, I am seeing numa1 is idle though numa0,2 and 3
>>>> are quite busy.
>>>>
>>>
>>>> I am not saying this patch is exactly the right approach, But I'd
>>>> like to use this RFC to connect the people of net and scheduler,
>>>> and start the discussion in this wider range.
>>>
>>> Well the normal way would be to use multi-queue crud and/or receive
>>> packet steering to get the interrupt/wakeup back to the cpu that data
>>> came from.
>>
>> The test case has been a multi-queue ethernet and irqs are balanced
>> to NUMA0 by irqbalanced or pinned to NUMA0 where the card is located
>> by the script like:
>> #!/bin/bash
>> irq_list=(`cat /proc/interrupts | grep network_name| awk -F: '{print $1}'`)
>> cpunum=0
>> for irq in ${irq_list[@]}
>> do
>> echo $cpunum > /proc/irq/$irq/smp_affinity_list
>> echo `cat /proc/irq/$irq/smp_affinity_list`
>> (( cpunum+=1 ))
>> done
>>
>> I have heard some people are working around this issue  by pinning
>> multi-queue IRQs to multiple NUMAs which can spread interrupts and
>> avoid over-pulling tasks to one NUMA only, but lose ethernet locality?
> 
> So you're doing explicitly the wrong thing with your script above and
> then complain the scheduler follows that and destroys your data
> locality?
> 
> The network folks made RPS/RFS specifically to spread the processing of
> the packets back to the CPUs/Nodes the TX happened on to increase data
> locality. Why not use that?
> 

+1

This documentation should describe how this can be done

Documentation/networking/scaling.rst

Hopefully it is not completely outdated.
 

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-11-08 16:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-05 10:51 [RFC PATCH] sched&net: avoid over-pulling tasks due to network interrupts Barry Song
2021-11-05 12:24 ` Peter Zijlstra
2021-11-07 18:08   ` Barry Song
2021-11-08  9:27     ` Peter Zijlstra
2021-11-08 16:27       ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).