[PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node
@ 2022-01-30 19:03 Tony Lu
  2022-01-31  7:20 ` Leon Romanovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Tony Lu @ 2022-01-30 19:03 UTC (permalink / raw)
  To: kgraul, kuba, davem; +Cc: netdev, linux-s390

Currently, pages are allocated in the process context, for its NUMA node
isn't equal to ibdev's, which is not the best policy for performance.

Applications will generally perform best when the processes are
accessing memory on the same NUMA node. When numa_balancing enabled
(which is enabled by most of OS distributions), it moves tasks closer to
the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
to the same node usually. This reduces the latency when accessing remote
memory.

According to our tests in different scenarios, there has up to 15.30%
performance drop (Redis benchmark) when accessing remote memory.

Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
---
 net/smc/smc_core.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 8935ef4811b0..2a28b045edfa 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -2065,9 +2065,10 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
 	return rc;
 }
 
-static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
+static struct smc_buf_desc *smcr_new_buf_create(struct smc_connection *conn,
 						bool is_rmb, int bufsize)
 {
+	int node = ibdev_to_node(conn->lnk->smcibdev->ibdev);
 	struct smc_buf_desc *buf_desc;
 
 	/* try to alloc a new buffer */
@@ -2076,10 +2077,10 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
 		return ERR_PTR(-ENOMEM);
 
 	buf_desc->order = get_order(bufsize);
-	buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
-				      __GFP_NOMEMALLOC | __GFP_COMP |
-				      __GFP_NORETRY | __GFP_ZERO,
-				      buf_desc->order);
+	buf_desc->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_NOWARN |
+					   __GFP_NOMEMALLOC | __GFP_COMP |
+					   __GFP_NORETRY | __GFP_ZERO,
+					   buf_desc->order);
 	if (!buf_desc->pages) {
 		kfree(buf_desc);
 		return ERR_PTR(-EAGAIN);
@@ -2190,7 +2191,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
 		if (is_smcd)
 			buf_desc = smcd_new_buf_create(lgr, is_rmb, bufsize);
 		else
-			buf_desc = smcr_new_buf_create(lgr, is_rmb, bufsize);
+			buf_desc = smcr_new_buf_create(conn, is_rmb, bufsize);
 
 		if (PTR_ERR(buf_desc) == -ENOMEM)
 			break;
-- 
2.32.0.3.g01195cf9f


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node
  2022-01-30 19:03 [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node Tony Lu
@ 2022-01-31  7:20 ` Leon Romanovsky
  2022-02-07  9:59   ` Tony Lu
  0 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2022-01-31  7:20 UTC (permalink / raw)
  To: Tony Lu; +Cc: kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote:
> Currently, pages are allocated in the process context, for its NUMA node
> isn't equal to ibdev's, which is not the best policy for performance.
> 
> Applications will generally perform best when the processes are
> accessing memory on the same NUMA node. When numa_balancing enabled
> (which is enabled by most of OS distributions), it moves tasks closer to
> the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
> to the same node usually. This reduces the latency when accessing remote
> memory.

It is very subjective per-specific test. I would expect that
application will control NUMA memory policies (set_mempolicy(), ...)
by itself without kernel setting NUMA node.

Various *_alloc_node() APIs are applicable for in-kernel allocations
where user can't control memory policy.

I don't know SMC-R enough, but if I judge from your description, this
allocation is controlled by the application.

Thanks

> 
> According to our tests in different scenarios, there has up to 15.30%
> performance drop (Redis benchmark) when accessing remote memory.
> 
> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com>
> ---
>  net/smc/smc_core.c | 13 +++++++------
>  1 file changed, 7 insertions(+), 6 deletions(-)
> 
> diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
> index 8935ef4811b0..2a28b045edfa 100644
> --- a/net/smc/smc_core.c
> +++ b/net/smc/smc_core.c
> @@ -2065,9 +2065,10 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
>  	return rc;
>  }
>  
> -static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
> +static struct smc_buf_desc *smcr_new_buf_create(struct smc_connection *conn,
>  						bool is_rmb, int bufsize)
>  {
> +	int node = ibdev_to_node(conn->lnk->smcibdev->ibdev);
>  	struct smc_buf_desc *buf_desc;
>  
>  	/* try to alloc a new buffer */
> @@ -2076,10 +2077,10 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
>  		return ERR_PTR(-ENOMEM);
>  
>  	buf_desc->order = get_order(bufsize);
> -	buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
> -				      __GFP_NOMEMALLOC | __GFP_COMP |
> -				      __GFP_NORETRY | __GFP_ZERO,
> -				      buf_desc->order);
> +	buf_desc->pages = alloc_pages_node(node, GFP_KERNEL | __GFP_NOWARN |
> +					   __GFP_NOMEMALLOC | __GFP_COMP |
> +					   __GFP_NORETRY | __GFP_ZERO,
> +					   buf_desc->order);
>  	if (!buf_desc->pages) {
>  		kfree(buf_desc);
>  		return ERR_PTR(-EAGAIN);
> @@ -2190,7 +2191,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
>  		if (is_smcd)
>  			buf_desc = smcd_new_buf_create(lgr, is_rmb, bufsize);
>  		else
> -			buf_desc = smcr_new_buf_create(lgr, is_rmb, bufsize);
> +			buf_desc = smcr_new_buf_create(conn, is_rmb, bufsize);
>  
>  		if (PTR_ERR(buf_desc) == -ENOMEM)
>  			break;
> -- 
> 2.32.0.3.g01195cf9f
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node
  2022-01-31  7:20 ` Leon Romanovsky
@ 2022-02-07  9:59   ` Tony Lu
  2022-02-07 13:49     ` Leon Romanovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Tony Lu @ 2022-02-07  9:59 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote:
> On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote:
> > Currently, pages are allocated in the process context, for its NUMA node
> > isn't equal to ibdev's, which is not the best policy for performance.
> > 
> > Applications will generally perform best when the processes are
> > accessing memory on the same NUMA node. When numa_balancing enabled
> > (which is enabled by most of OS distributions), it moves tasks closer to
> > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
> > to the same node usually. This reduces the latency when accessing remote
> > memory.
> 
> It is very subjective per-specific test. I would expect that
> application will control NUMA memory policies (set_mempolicy(), ...)
> by itself without kernel setting NUMA node.
> 
> Various *_alloc_node() APIs are applicable for in-kernel allocations
> where user can't control memory policy.
> 
> I don't know SMC-R enough, but if I judge from your description, this
> allocation is controlled by the application.

The original design of SMC doesn't handle the memory allocation of
different NUMA node, and the application can't control the NUMA policy
in SMC.

It allocates memory according to the NUMA node based on the process
context, which is determined by the scheduler. If application process
runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends
on the scheduler. If RDMA device is attached to node 1, the process runs
on node 0, it allocates memory on node 0.

This patch tries to allocate memory on the same NUMA node of RDMA
device. Applications can't know the current node of RDMA device. The
scheduler knows the node of memory, and can let applications run on the
same node of memory and RDMA device.

Thanks,
Tony Lu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node
  2022-02-07  9:59   ` Tony Lu
@ 2022-02-07 13:49     ` Leon Romanovsky
  2022-02-08  9:10       ` Stefan Raspl
  0 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2022-02-07 13:49 UTC (permalink / raw)
  To: Tony Lu; +Cc: kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote:
> On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote:
> > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote:
> > > Currently, pages are allocated in the process context, for its NUMA node
> > > isn't equal to ibdev's, which is not the best policy for performance.
> > > 
> > > Applications will generally perform best when the processes are
> > > accessing memory on the same NUMA node. When numa_balancing enabled
> > > (which is enabled by most of OS distributions), it moves tasks closer to
> > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
> > > to the same node usually. This reduces the latency when accessing remote
> > > memory.
> > 
> > It is very subjective per-specific test. I would expect that
> > application will control NUMA memory policies (set_mempolicy(), ...)
> > by itself without kernel setting NUMA node.
> > 
> > Various *_alloc_node() APIs are applicable for in-kernel allocations
> > where user can't control memory policy.
> > 
> > I don't know SMC-R enough, but if I judge from your description, this
> > allocation is controlled by the application.
> 
> The original design of SMC doesn't handle the memory allocation of
> different NUMA node, and the application can't control the NUMA policy
> in SMC.
> 
> It allocates memory according to the NUMA node based on the process
> context, which is determined by the scheduler. If application process
> runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends
> on the scheduler. If RDMA device is attached to node 1, the process runs
> on node 0, it allocates memory on node 0.
> 
> This patch tries to allocate memory on the same NUMA node of RDMA
> device. Applications can't know the current node of RDMA device. The
> scheduler knows the node of memory, and can let applications run on the
> same node of memory and RDMA device.

I don't know, everything explained above is controlled through memory
policy, where application needs to run on same node as ibdev.

Thanks

> 
> Thanks,
> Tony Lu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node
  2022-02-07 13:49     ` Leon Romanovsky
@ 2022-02-08  9:10       ` Stefan Raspl
  2022-02-08  9:32         ` Leon Romanovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Stefan Raspl @ 2022-02-08  9:10 UTC (permalink / raw)
  To: Leon Romanovsky, Tony Lu
  Cc: kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On 2/7/22 14:49, Leon Romanovsky wrote:
> On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote:
>> On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote:
>>> On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote:
>>>> Currently, pages are allocated in the process context, for its NUMA node
>>>> isn't equal to ibdev's, which is not the best policy for performance.
>>>>
>>>> Applications will generally perform best when the processes are
>>>> accessing memory on the same NUMA node. When numa_balancing enabled
>>>> (which is enabled by most of OS distributions), it moves tasks closer to
>>>> the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
>>>> to the same node usually. This reduces the latency when accessing remote
>>>> memory.
>>>
>>> It is very subjective per-specific test. I would expect that
>>> application will control NUMA memory policies (set_mempolicy(), ...)
>>> by itself without kernel setting NUMA node.
>>>
>>> Various *_alloc_node() APIs are applicable for in-kernel allocations
>>> where user can't control memory policy.
>>>
>>> I don't know SMC-R enough, but if I judge from your description, this
>>> allocation is controlled by the application.
>>
>> The original design of SMC doesn't handle the memory allocation of
>> different NUMA node, and the application can't control the NUMA policy
>> in SMC.
>>
>> It allocates memory according to the NUMA node based on the process
>> context, which is determined by the scheduler. If application process
>> runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends
>> on the scheduler. If RDMA device is attached to node 1, the process runs
>> on node 0, it allocates memory on node 0.
>>
>> This patch tries to allocate memory on the same NUMA node of RDMA
>> device. Applications can't know the current node of RDMA device. The
>> scheduler knows the node of memory, and can let applications run on the
>> same node of memory and RDMA device.
> 
> I don't know, everything explained above is controlled through memory
> policy, where application needs to run on same node as ibdev.

The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP 
applications. The idea is to avoid almost any modification to the application, 
just switch the address family. So while what you say makes a lot of sense for 
applications that intend to use RDMA, in the case of SMC-R we can safely assume 
that most if not all applications running it assume they get connectivity 
through a non-RDMA NIC. Hence we cannot expect the applications to think about 
aspects such as NUMA, and we should do the right thing within SMC-R.

Ciao,
Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node
  2022-02-08  9:10       ` Stefan Raspl
@ 2022-02-08  9:32         ` Leon Romanovsky
  2022-02-09  8:00           ` Tony Lu
  0 siblings, 1 reply; 8+ messages in thread
From: Leon Romanovsky @ 2022-02-08  9:32 UTC (permalink / raw)
  To: Stefan Raspl
  Cc: Tony Lu, kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Tue, Feb 08, 2022 at 10:10:55AM +0100, Stefan Raspl wrote:
> On 2/7/22 14:49, Leon Romanovsky wrote:
> > On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote:
> > > On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote:
> > > > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote:
> > > > > Currently, pages are allocated in the process context, for its NUMA node
> > > > > isn't equal to ibdev's, which is not the best policy for performance.
> > > > > 
> > > > > Applications will generally perform best when the processes are
> > > > > accessing memory on the same NUMA node. When numa_balancing enabled
> > > > > (which is enabled by most of OS distributions), it moves tasks closer to
> > > > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
> > > > > to the same node usually. This reduces the latency when accessing remote
> > > > > memory.
> > > > 
> > > > It is very subjective per-specific test. I would expect that
> > > > application will control NUMA memory policies (set_mempolicy(), ...)
> > > > by itself without kernel setting NUMA node.
> > > > 
> > > > Various *_alloc_node() APIs are applicable for in-kernel allocations
> > > > where user can't control memory policy.
> > > > 
> > > > I don't know SMC-R enough, but if I judge from your description, this
> > > > allocation is controlled by the application.
> > > 
> > > The original design of SMC doesn't handle the memory allocation of
> > > different NUMA node, and the application can't control the NUMA policy
> > > in SMC.
> > > 
> > > It allocates memory according to the NUMA node based on the process
> > > context, which is determined by the scheduler. If application process
> > > runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends
> > > on the scheduler. If RDMA device is attached to node 1, the process runs
> > > on node 0, it allocates memory on node 0.
> > > 
> > > This patch tries to allocate memory on the same NUMA node of RDMA
> > > device. Applications can't know the current node of RDMA device. The
> > > scheduler knows the node of memory, and can let applications run on the
> > > same node of memory and RDMA device.
> > 
> > I don't know, everything explained above is controlled through memory
> > policy, where application needs to run on same node as ibdev.
> 
> The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP
> applications. The idea is to avoid almost any modification to the
> application, just switch the address family. So while what you say makes a
> lot of sense for applications that intend to use RDMA, in the case of SMC-R
> we can safely assume that most if not all applications running it assume
> they get connectivity through a non-RDMA NIC. Hence we cannot expect the
> applications to think about aspects such as NUMA, and we should do the right
> thing within SMC-R.

And here comes the problem, you are doing the right thing for very
specific and narrow use case, where application and ibdev run on
same node. It is not true for multi-core systems as application will
be scheduled on less load node (in very simplistic form).

In general case, the application will get CPU and memory based on scheduler
heuristic as you don't use memory policy to restrict it. The assumption
that allocations need to be close to ibdev and not to applications can
lead to worse performance.

Thanks

> 
> Ciao,
> Stefan

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node
  2022-02-08  9:32         ` Leon Romanovsky
@ 2022-02-09  8:00           ` Tony Lu
  2022-02-09 10:10             ` Leon Romanovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Tony Lu @ 2022-02-09  8:00 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Stefan Raspl, kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Tue, Feb 08, 2022 at 11:32:23AM +0200, Leon Romanovsky wrote:
> On Tue, Feb 08, 2022 at 10:10:55AM +0100, Stefan Raspl wrote:
> > On 2/7/22 14:49, Leon Romanovsky wrote:
> > > On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote:
> > > > On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote:
> > > > > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote:
> > > > > > Currently, pages are allocated in the process context, for its NUMA node
> > > > > > isn't equal to ibdev's, which is not the best policy for performance.
> > > > > > 
> > > > > > Applications will generally perform best when the processes are
> > > > > > accessing memory on the same NUMA node. When numa_balancing enabled
> > > > > > (which is enabled by most of OS distributions), it moves tasks closer to
> > > > > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
> > > > > > to the same node usually. This reduces the latency when accessing remote
> > > > > > memory.
> > > > > 
> > > > > It is very subjective per-specific test. I would expect that
> > > > > application will control NUMA memory policies (set_mempolicy(), ...)
> > > > > by itself without kernel setting NUMA node.
> > > > > 
> > > > > Various *_alloc_node() APIs are applicable for in-kernel allocations
> > > > > where user can't control memory policy.
> > > > > 
> > > > > I don't know SMC-R enough, but if I judge from your description, this
> > > > > allocation is controlled by the application.
> > > > 
> > > > The original design of SMC doesn't handle the memory allocation of
> > > > different NUMA node, and the application can't control the NUMA policy
> > > > in SMC.
> > > > 
> > > > It allocates memory according to the NUMA node based on the process
> > > > context, which is determined by the scheduler. If application process
> > > > runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends
> > > > on the scheduler. If RDMA device is attached to node 1, the process runs
> > > > on node 0, it allocates memory on node 0.
> > > > 
> > > > This patch tries to allocate memory on the same NUMA node of RDMA
> > > > device. Applications can't know the current node of RDMA device. The
> > > > scheduler knows the node of memory, and can let applications run on the
> > > > same node of memory and RDMA device.
> > > 
> > > I don't know, everything explained above is controlled through memory
> > > policy, where application needs to run on same node as ibdev.
> > 
> > The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP
> > applications. The idea is to avoid almost any modification to the
> > application, just switch the address family. So while what you say makes a
> > lot of sense for applications that intend to use RDMA, in the case of SMC-R
> > we can safely assume that most if not all applications running it assume
> > they get connectivity through a non-RDMA NIC. Hence we cannot expect the
> > applications to think about aspects such as NUMA, and we should do the right
> > thing within SMC-R.
> 
> And here comes the problem, you are doing the right thing for very
> specific and narrow use case, where application and ibdev run on
> same node. It is not true for multi-core systems as application will
> be scheduled on less load node (in very simplistic form).
> 
> In general case, the application will get CPU and memory based on scheduler
> heuristic as you don't use memory policy to restrict it. The assumption
> that allocations need to be close to ibdev and not to applications can
> lead to worse performance.
> 

Yes, the applications cannot run faster if they always access remote
memory. There are something complex in SMC, so choose to bind to the
RDMA device.

As Stefan mentioned, SMC is to provide a drop-in replacement for TCP.
SMC doesn't allocate memory for the new connection most of time, it has
linkgroup-level buffer reuse pool. The memory is only allocated during
connecting in process context or workqueue (non-blocking) if no buffer
in the beginning. Later it will reuse the buffer in the link group. The
data operations (send/recv) occurs in the following progress and wake up
by scheduler (epoll). Also, local IRQ binding can help process runs on
the node of RDMA device.

NUMA 0                | NUMA 1
// Application A      |
connect()             |
  smc_connect_rdma()  |
    smc_conn_create() |
      // create buffer|
      smc_buf_create()|
      ...             |
                      |
close()               |
  ...                 |
    // recycle buffer |
    smc_buf_unuse()   |
                      | // Application B
                      | connect()
                      |   smc_connect_rdma()
                      |     smc_conn_create()
                      |       // reuse buffer in NUMA 0
                      |       smc_buf_create()

Thanks,
Tony Lu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node
  2022-02-09  8:00           ` Tony Lu
@ 2022-02-09 10:10             ` Leon Romanovsky
  0 siblings, 0 replies; 8+ messages in thread
From: Leon Romanovsky @ 2022-02-09 10:10 UTC (permalink / raw)
  To: Tony Lu
  Cc: Stefan Raspl, kgraul, kuba, davem, netdev, linux-s390, RDMA mailing list

On Wed, Feb 09, 2022 at 04:00:34PM +0800, Tony Lu wrote:
> On Tue, Feb 08, 2022 at 11:32:23AM +0200, Leon Romanovsky wrote:
> > On Tue, Feb 08, 2022 at 10:10:55AM +0100, Stefan Raspl wrote:
> > > On 2/7/22 14:49, Leon Romanovsky wrote:
> > > > On Mon, Feb 07, 2022 at 05:59:58PM +0800, Tony Lu wrote:
> > > > > On Mon, Jan 31, 2022 at 09:20:52AM +0200, Leon Romanovsky wrote:
> > > > > > On Mon, Jan 31, 2022 at 03:03:00AM +0800, Tony Lu wrote:
> > > > > > > Currently, pages are allocated in the process context, for its NUMA node
> > > > > > > isn't equal to ibdev's, which is not the best policy for performance.
> > > > > > > 
> > > > > > > Applications will generally perform best when the processes are
> > > > > > > accessing memory on the same NUMA node. When numa_balancing enabled
> > > > > > > (which is enabled by most of OS distributions), it moves tasks closer to
> > > > > > > the memory of sndbuf or rmb and ibdev, meanwhile, the IRQs of ibdev bind
> > > > > > > to the same node usually. This reduces the latency when accessing remote
> > > > > > > memory.
> > > > > > 
> > > > > > It is very subjective per-specific test. I would expect that
> > > > > > application will control NUMA memory policies (set_mempolicy(), ...)
> > > > > > by itself without kernel setting NUMA node.
> > > > > > 
> > > > > > Various *_alloc_node() APIs are applicable for in-kernel allocations
> > > > > > where user can't control memory policy.
> > > > > > 
> > > > > > I don't know SMC-R enough, but if I judge from your description, this
> > > > > > allocation is controlled by the application.
> > > > > 
> > > > > The original design of SMC doesn't handle the memory allocation of
> > > > > different NUMA node, and the application can't control the NUMA policy
> > > > > in SMC.
> > > > > 
> > > > > It allocates memory according to the NUMA node based on the process
> > > > > context, which is determined by the scheduler. If application process
> > > > > runs on NUMA node 0, SMC allocates on node 0 and so on, it all depends
> > > > > on the scheduler. If RDMA device is attached to node 1, the process runs
> > > > > on node 0, it allocates memory on node 0.
> > > > > 
> > > > > This patch tries to allocate memory on the same NUMA node of RDMA
> > > > > device. Applications can't know the current node of RDMA device. The
> > > > > scheduler knows the node of memory, and can let applications run on the
> > > > > same node of memory and RDMA device.
> > > > 
> > > > I don't know, everything explained above is controlled through memory
> > > > policy, where application needs to run on same node as ibdev.
> > > 
> > > The purpose of SMC-R is to provide a drop-in replacement for existing TCP/IP
> > > applications. The idea is to avoid almost any modification to the
> > > application, just switch the address family. So while what you say makes a
> > > lot of sense for applications that intend to use RDMA, in the case of SMC-R
> > > we can safely assume that most if not all applications running it assume
> > > they get connectivity through a non-RDMA NIC. Hence we cannot expect the
> > > applications to think about aspects such as NUMA, and we should do the right
> > > thing within SMC-R.
> > 
> > And here comes the problem, you are doing the right thing for very
> > specific and narrow use case, where application and ibdev run on
> > same node. It is not true for multi-core systems as application will
> > be scheduled on less load node (in very simplistic form).
> > 
> > In general case, the application will get CPU and memory based on scheduler
> > heuristic as you don't use memory policy to restrict it. The assumption
> > that allocations need to be close to ibdev and not to applications can
> > lead to worse performance.
> > 
> 
> Yes, the applications cannot run faster if they always access remote
> memory. There are something complex in SMC, so choose to bind to the
> RDMA device.
> 
> As Stefan mentioned, SMC is to provide a drop-in replacement for TCP.

If I'm looking on the right piece of code (net/core/skbuff.c:build_skb),
even SKB is not allocated close to ehternet device. I'm not convinced that
SMC should be different here.

Thanks

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-02-09 10:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-30 19:03 [PATCH net-next] net/smc: Allocate pages of SMC-R on ibdev NUMA node Tony Lu
2022-01-31  7:20 ` Leon Romanovsky
2022-02-07  9:59   ` Tony Lu
2022-02-07 13:49     ` Leon Romanovsky
2022-02-08  9:10       ` Stefan Raspl
2022-02-08  9:32         ` Leon Romanovsky
2022-02-09  8:00           ` Tony Lu
2022-02-09 10:10             ` Leon Romanovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.