All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] RDS: sync congestion map updating
@ 2016-03-30  9:08 Wengang Wang
       [not found] ` <1459328902-31968-1-git-send-email-wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Wengang Wang @ 2016-03-30  9:08 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA

Problem is found that some among a lot of parallel RDS communications hang.
In my test ten or so among 33 communications hang. The send requests got
-ENOBUF error meaning the peer socket (port) is congested. But meanwhile,
peer socket (port) is not congested.

The congestion map updating can happen in two paths: one is in rds_recvmsg path
and the other is when it receives packets from the hardware. There is no
synchronization when updating the congestion map. So a bit operation (clearing)
in the rds_recvmsg path can be skipped by another bit operation (setting) in
hardware packet receving path.

Fix is to add a spin lock per congestion map to sync the update on it.
No performance drop found during the test for the fix.

Signed-off-by: Wengang Wang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 net/rds/cong.c | 7 +++++++
 net/rds/rds.h  | 1 +
 2 files changed, 8 insertions(+)

diff --git a/net/rds/cong.c b/net/rds/cong.c
index e6144b8..7afc1bf 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -144,6 +144,7 @@ static struct rds_cong_map *rds_cong_from_addr(__be32 addr)
 	if (!map)
 		return NULL;
 
+	spin_lock_init(&map->m_lock);
 	map->m_addr = addr;
 	init_waitqueue_head(&map->m_waitq);
 	INIT_LIST_HEAD(&map->m_conn_list);
@@ -292,6 +293,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
 {
 	unsigned long i;
 	unsigned long off;
+	unsigned long flags;
 
 	rdsdebug("setting congestion for %pI4:%u in map %p\n",
 	  &map->m_addr, ntohs(port), map);
@@ -299,13 +301,16 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
 	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
 	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
+	spin_lock_irqsave(&map->m_lock, flags);
 	__set_bit_le(off, (void *)map->m_page_addrs[i]);
+	spin_unlock_irqrestore(&map->m_lock, flags);
 }
 
 void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
 {
 	unsigned long i;
 	unsigned long off;
+	unsigned long flags;
 
 	rdsdebug("clearing congestion for %pI4:%u in map %p\n",
 	  &map->m_addr, ntohs(port), map);
@@ -313,7 +318,9 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
 	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
 	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
 
+	spin_lock_irqsave(&map->m_lock, flags);
 	__clear_bit_le(off, (void *)map->m_page_addrs[i]);
+	spin_unlock_irqrestore(&map->m_lock, flags);
 }
 
 static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 80256b0..f359cf8 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -59,6 +59,7 @@ struct rds_cong_map {
 	__be32			m_addr;
 	wait_queue_head_t	m_waitq;
 	struct list_head	m_conn_list;
+	spinlock_t		m_lock;
 	unsigned long		m_page_addrs[RDS_CONG_MAP_PAGES];
 };
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] RDS: sync congestion map updating
       [not found] ` <1459328902-31968-1-git-send-email-wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-03-30 16:19   ` Leon Romanovsky
       [not found]     ` <20160330161952.GA2670-2ukJVAZIZ/Y@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Romanovsky @ 2016-03-30 16:19 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
> Problem is found that some among a lot of parallel RDS communications hang.
> In my test ten or so among 33 communications hang. The send requests got
> -ENOBUF error meaning the peer socket (port) is congested. But meanwhile,
> peer socket (port) is not congested.
> 
> The congestion map updating can happen in two paths: one is in rds_recvmsg path
> and the other is when it receives packets from the hardware. There is no
> synchronization when updating the congestion map. So a bit operation (clearing)
> in the rds_recvmsg path can be skipped by another bit operation (setting) in
> hardware packet receving path.
> 
> Fix is to add a spin lock per congestion map to sync the update on it.
> No performance drop found during the test for the fix.

I assume that this change fixed your issue, however it looks suspicious
that performance wasn't change.

> 
> Signed-off-by: Wengang Wang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
>  net/rds/cong.c | 7 +++++++
>  net/rds/rds.h  | 1 +
>  2 files changed, 8 insertions(+)

According to get_maintainer script, you send this patch to wrong lists
and persons.

➜  linux git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c
Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> (supporter:RDS - RELIABLE DATAGRAM SOCKETS)
"David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> (maintainer:NETWORKING [GENERAL])
netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
rds-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org (moderated list:RDS - RELIABLE DATAGRAM SOCKETS)
linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list)

> 
> diff --git a/net/rds/cong.c b/net/rds/cong.c
> index e6144b8..7afc1bf 100644
> --- a/net/rds/cong.c
> +++ b/net/rds/cong.c
> @@ -144,6 +144,7 @@ static struct rds_cong_map *rds_cong_from_addr(__be32 addr)
>  	if (!map)
>  		return NULL;
>  
> +	spin_lock_init(&map->m_lock);
>  	map->m_addr = addr;
>  	init_waitqueue_head(&map->m_waitq);
>  	INIT_LIST_HEAD(&map->m_conn_list);
> @@ -292,6 +293,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
>  {
>  	unsigned long i;
>  	unsigned long off;
> +	unsigned long flags;
>  
>  	rdsdebug("setting congestion for %pI4:%u in map %p\n",
>  	  &map->m_addr, ntohs(port), map);
> @@ -299,13 +301,16 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
>  	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
>  	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
>  
> +	spin_lock_irqsave(&map->m_lock, flags);
>  	__set_bit_le(off, (void *)map->m_page_addrs[i]);
> +	spin_unlock_irqrestore(&map->m_lock, flags);
>  }
>  
>  void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
>  {
>  	unsigned long i;
>  	unsigned long off;
> +	unsigned long flags;
>  
>  	rdsdebug("clearing congestion for %pI4:%u in map %p\n",
>  	  &map->m_addr, ntohs(port), map);
> @@ -313,7 +318,9 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
>  	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
>  	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
>  
> +	spin_lock_irqsave(&map->m_lock, flags);
>  	__clear_bit_le(off, (void *)map->m_page_addrs[i]);
> +	spin_unlock_irqrestore(&map->m_lock, flags);
>  }
>  
>  static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
> diff --git a/net/rds/rds.h b/net/rds/rds.h
> index 80256b0..f359cf8 100644
> --- a/net/rds/rds.h
> +++ b/net/rds/rds.h
> @@ -59,6 +59,7 @@ struct rds_cong_map {
>  	__be32			m_addr;
>  	wait_queue_head_t	m_waitq;
>  	struct list_head	m_conn_list;
> +	spinlock_t		m_lock;
>  	unsigned long		m_page_addrs[RDS_CONG_MAP_PAGES];
>  };
>  
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] RDS: sync congestion map updating
       [not found]     ` <20160330161952.GA2670-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-03-30 17:16       ` santosh shilimkar
       [not found]         ` <56FC09D6.7090602-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2016-03-31  1:24       ` Wengang Wang
  1 sibling, 1 reply; 9+ messages in thread
From: santosh shilimkar @ 2016-03-30 17:16 UTC (permalink / raw)
  To: Wengang Wang, linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: leon-2ukJVAZIZ/Y

Hi Wengang,

On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
>> Problem is found that some among a lot of parallel RDS communications hang.
>> In my test ten or so among 33 communications hang. The send requests got
>> -ENOBUF error meaning the peer socket (port) is congested. But meanwhile,
>> peer socket (port) is not congested.
>>
>> The congestion map updating can happen in two paths: one is in rds_recvmsg path
>> and the other is when it receives packets from the hardware. There is no
>> synchronization when updating the congestion map. So a bit operation (clearing)
>> in the rds_recvmsg path can be skipped by another bit operation (setting) in
>> hardware packet receving path.
>>
>> Fix is to add a spin lock per congestion map to sync the update on it.
>> No performance drop found during the test for the fix.
>
> I assume that this change fixed your issue, however it looks suspicious
> that performance wasn't change.
>
First of all thanks for finding the issue and posting patch
for it. I do agree with Leon on performance comment.
We shouldn't need locks for map updates.

Moreover the parallel receive path on which this patch
is based of doesn't exist in upstream code. I have kept
that out so far because of similar issue like one you
encountered.

Anyways lets discuss offline about the fix even for the
downstream kernel. I suspect we can address it without locks.

Reagrds,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] RDS: sync congestion map updating
       [not found]     ` <20160330161952.GA2670-2ukJVAZIZ/Y@public.gmane.org>
  2016-03-30 17:16       ` santosh shilimkar
@ 2016-03-31  1:24       ` Wengang Wang
  1 sibling, 0 replies; 9+ messages in thread
From: Wengang Wang @ 2016-03-31  1:24 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Leon,

在 2016年03月31日 00:19, Leon Romanovsky 写道:
> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
>> Problem is found that some among a lot of parallel RDS communications hang.
>> In my test ten or so among 33 communications hang. The send requests got
>> -ENOBUF error meaning the peer socket (port) is congested. But meanwhile,
>> peer socket (port) is not congested.
>>
>> The congestion map updating can happen in two paths: one is in rds_recvmsg path
>> and the other is when it receives packets from the hardware. There is no
>> synchronization when updating the congestion map. So a bit operation (clearing)
>> in the rds_recvmsg path can be skipped by another bit operation (setting) in
>> hardware packet receving path.
>>
>> Fix is to add a spin lock per congestion map to sync the update on it.
>> No performance drop found during the test for the fix.
> I assume that this change fixed your issue, however it looks suspicious
> that performance wasn't change.
Sure it I verified that patch fixes the issue.
For performance, I will reply to Santosh's email later, please check there.
>> Signed-off-by: Wengang Wang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> ---
>>   net/rds/cong.c | 7 +++++++
>>   net/rds/rds.h  | 1 +
>>   2 files changed, 8 insertions(+)
> According to get_maintainer script, you send this patch to wrong lists
> and persons.
>
> ➜  linux git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c
> Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> (supporter:RDS - RELIABLE DATAGRAM SOCKETS)
> "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> (maintainer:NETWORKING [GENERAL])
> netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)

So linux-rdma is here :)

thanks,
wengang
> rds-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org (moderated list:RDS - RELIABLE DATAGRAM SOCKETS)
> linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list)
>
>> diff --git a/net/rds/cong.c b/net/rds/cong.c
>> index e6144b8..7afc1bf 100644
>> --- a/net/rds/cong.c
>> +++ b/net/rds/cong.c
>> @@ -144,6 +144,7 @@ static struct rds_cong_map *rds_cong_from_addr(__be32 addr)
>>   	if (!map)
>>   		return NULL;
>>   
>> +	spin_lock_init(&map->m_lock);
>>   	map->m_addr = addr;
>>   	init_waitqueue_head(&map->m_waitq);
>>   	INIT_LIST_HEAD(&map->m_conn_list);
>> @@ -292,6 +293,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
>>   {
>>   	unsigned long i;
>>   	unsigned long off;
>> +	unsigned long flags;
>>   
>>   	rdsdebug("setting congestion for %pI4:%u in map %p\n",
>>   	  &map->m_addr, ntohs(port), map);
>> @@ -299,13 +301,16 @@ void rds_cong_set_bit(struct rds_cong_map *map, __be16 port)
>>   	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
>>   	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
>>   
>> +	spin_lock_irqsave(&map->m_lock, flags);
>>   	__set_bit_le(off, (void *)map->m_page_addrs[i]);
>> +	spin_unlock_irqrestore(&map->m_lock, flags);
>>   }
>>   
>>   void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
>>   {
>>   	unsigned long i;
>>   	unsigned long off;
>> +	unsigned long flags;
>>   
>>   	rdsdebug("clearing congestion for %pI4:%u in map %p\n",
>>   	  &map->m_addr, ntohs(port), map);
>> @@ -313,7 +318,9 @@ void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
>>   	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
>>   	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;
>>   
>> +	spin_lock_irqsave(&map->m_lock, flags);
>>   	__clear_bit_le(off, (void *)map->m_page_addrs[i]);
>> +	spin_unlock_irqrestore(&map->m_lock, flags);
>>   }
>>   
>>   static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
>> diff --git a/net/rds/rds.h b/net/rds/rds.h
>> index 80256b0..f359cf8 100644
>> --- a/net/rds/rds.h
>> +++ b/net/rds/rds.h
>> @@ -59,6 +59,7 @@ struct rds_cong_map {
>>   	__be32			m_addr;
>>   	wait_queue_head_t	m_waitq;
>>   	struct list_head	m_conn_list;
>> +	spinlock_t		m_lock;
>>   	unsigned long		m_page_addrs[RDS_CONG_MAP_PAGES];
>>   };
>>   
>> -- 
>> 2.1.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] RDS: sync congestion map updating
       [not found]         ` <56FC09D6.7090602-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-03-31  1:51           ` Wengang Wang
       [not found]             ` <56FC82B7.3070504-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Wengang Wang @ 2016-03-31  1:51 UTC (permalink / raw)
  To: santosh shilimkar, linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: leon-2ukJVAZIZ/Y



在 2016年03月31日 01:16, santosh shilimkar 写道:
> Hi Wengang,
>
> On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
>> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
>>> Problem is found that some among a lot of parallel RDS 
>>> communications hang.
>>> In my test ten or so among 33 communications hang. The send requests 
>>> got
>>> -ENOBUF error meaning the peer socket (port) is congested. But 
>>> meanwhile,
>>> peer socket (port) is not congested.
>>>
>>> The congestion map updating can happen in two paths: one is in 
>>> rds_recvmsg path
>>> and the other is when it receives packets from the hardware. There 
>>> is no
>>> synchronization when updating the congestion map. So a bit operation 
>>> (clearing)
>>> in the rds_recvmsg path can be skipped by another bit operation 
>>> (setting) in
>>> hardware packet receving path.
>>>
>>> Fix is to add a spin lock per congestion map to sync the update on it.
>>> No performance drop found during the test for the fix.
>>
>> I assume that this change fixed your issue, however it looks suspicious
>> that performance wasn't change.
>>
> First of all thanks for finding the issue and posting patch
> for it. I do agree with Leon on performance comment.
> We shouldn't need locks for map updates.
>
Here is the performance data I collected yesterday.
Settings:
net.core.rmem_default = 4194304
net.core.wmem_default = 262144
net.core.rmem_max = 4194304
net.core.wmem_max = 2097152

test case:  rds-stress -s 192.168.111.16 -q 1m -d 10 -T 300 -t 10
With 1M size sends, the 10 pending send request is enough to trigger the 
congestion on receiver side. And the test last 5 mins.

result is like this:
without patch:
10   2231   2355 4697759.63       0.00       0.00  473.38 19123.89 
-1.00  (average)
receiver
  10   2356   2231 4698350.06       0.00       0.00  486.28 18537.23 
-1.00  (average)

with patch applied:
sender
10   2230   2396 47x.53       0.00       0.00  475.87 31954.35 -1.00  
(average)
receiver
10   2396   2230 4738051.76       0.00       0.00  480.85 18408.13 
-1.00  (average)

So I don't see performance drops. On a previous test, the test result is 
reverted that is it's faster when patch not applied, but the numbers is 
till 47xxxxx VS 46xxxxx.  So I don't have a very stable test result. But 
in average, no obvious performance drop.

Let me try to explain from theory:
Firstly, No matter the rds_recvmsg path or the hardware receiving data 
path, we have rds_sock->rs_recv_lock (this is not enough to fix our 
issue here since there could be many different rds_socks) locked very 
near before we lock the congestion map.  So the performance drop on CPU 
cache refilling is small.
Secondly, though the problem exist,  the malformed map may be not 
happening that frequent especially for this test case, 10 parallel 
communication.

> Moreover the parallel receive path on which this patch
> is based of doesn't exist in upstream code. I have kept
> that out so far because of similar issue like one you
> encountered.
But I don't see how rds_recvmsg path is different from UEK kernels. Can 
you explain more here or offline?

>
> Anyways lets discuss offline about the fix even for the
> downstream kernel. I suspect we can address it without locks.
>
If in normal use we have no performace issue (and before we found import 
use case that would hit), I think locking is fine.
Well, what ideas do you have to prevent using locks? After all we are 
updating a 8KB bitmap, not a single uint64 or less length variable. No 
matter we use lock or not, we need to make sure the bits to update can't 
be cached on different CPUs.

thanks,
wengang

> Reagrds,
> Santosh

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] RDS: sync congestion map updating
       [not found]             ` <56FC82B7.3070504-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-03-31  2:59               ` Wengang Wang
       [not found]                 ` <56FC927E.9090404-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Wengang Wang @ 2016-03-31  2:59 UTC (permalink / raw)
  To: santosh shilimkar, linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: leon-2ukJVAZIZ/Y



在 2016年03月31日 09:51, Wengang Wang 写道:
>
>
> 在 2016年03月31日 01:16, santosh shilimkar 写道:
>> Hi Wengang,
>>
>> On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
>>> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
>>>> Problem is found that some among a lot of parallel RDS 
>>>> communications hang.
>>>> In my test ten or so among 33 communications hang. The send 
>>>> requests got
>>>> -ENOBUF error meaning the peer socket (port) is congested. But 
>>>> meanwhile,
>>>> peer socket (port) is not congested.
>>>>
>>>> The congestion map updating can happen in two paths: one is in 
>>>> rds_recvmsg path
>>>> and the other is when it receives packets from the hardware. There 
>>>> is no
>>>> synchronization when updating the congestion map. So a bit 
>>>> operation (clearing)
>>>> in the rds_recvmsg path can be skipped by another bit operation 
>>>> (setting) in
>>>> hardware packet receving path.
>>>>

To be more detailed.  Here, the two paths (user calls recvmsg and 
hardware receives data) are for different rds socks. thus the 
rds_sock->rs_recv_lock is not helpful to sync the updating on congestion 
map.

thanks,
wengang
>>>> Fix is to add a spin lock per congestion map to sync the update on it.
>>>> No performance drop found during the test for the fix.
>>>
>>> I assume that this change fixed your issue, however it looks suspicious
>>> that performance wasn't change.
>>>
>> First of all thanks for finding the issue and posting patch
>> for it. I do agree with Leon on performance comment.
>> We shouldn't need locks for map updates.
>>
> Here is the performance data I collected yesterday.
> Settings:
> net.core.rmem_default = 4194304
> net.core.wmem_default = 262144
> net.core.rmem_max = 4194304
> net.core.wmem_max = 2097152
>
> test case:  rds-stress -s 192.168.111.16 -q 1m -d 10 -T 300 -t 10
> With 1M size sends, the 10 pending send request is enough to trigger 
> the congestion on receiver side. And the test last 5 mins.
>
> result is like this:
> without patch:
> 10   2231   2355 4697759.63       0.00       0.00  473.38 19123.89 
> -1.00  (average)
> receiver
>  10   2356   2231 4698350.06       0.00       0.00  486.28 18537.23 
> -1.00  (average)
>
> with patch applied:
> sender
> 10   2230   2396 47x.53       0.00       0.00  475.87 31954.35 -1.00  
> (average)
> receiver
> 10   2396   2230 4738051.76       0.00       0.00  480.85 18408.13 
> -1.00  (average)
>
> So I don't see performance drops. On a previous test, the test result 
> is reverted that is it's faster when patch not applied, but the 
> numbers is till 47xxxxx VS 46xxxxx.  So I don't have a very stable 
> test result. But in average, no obvious performance drop.
>
> Let me try to explain from theory:
> Firstly, No matter the rds_recvmsg path or the hardware receiving data 
> path, we have rds_sock->rs_recv_lock (this is not enough to fix our 
> issue here since there could be many different rds_socks) locked very 
> near before we lock the congestion map.  So the performance drop on 
> CPU cache refilling is small.
> Secondly, though the problem exist,  the malformed map may be not 
> happening that frequent especially for this test case, 10 parallel 
> communication.
>
>> Moreover the parallel receive path on which this patch
>> is based of doesn't exist in upstream code. I have kept
>> that out so far because of similar issue like one you
>> encountered.
> But I don't see how rds_recvmsg path is different from UEK kernels. 
> Can you explain more here or offline?
>
>>
>> Anyways lets discuss offline about the fix even for the
>> downstream kernel. I suspect we can address it without locks.
>>
> If in normal use we have no performace issue (and before we found 
> import use case that would hit), I think locking is fine.
> Well, what ideas do you have to prevent using locks? After all we are 
> updating a 8KB bitmap, not a single uint64 or less length variable. No 
> matter we use lock or not, we need to make sure the bits to update 
> can't be cached on different CPUs.
>
> thanks,
> wengang
>
>> Reagrds,
>> Santosh
>
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] RDS: sync congestion map updating
       [not found]                 ` <56FC927E.9090404-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-04-01 19:47                   ` santosh shilimkar
  2016-04-02  1:14                     ` Leon Romanovsky
  0 siblings, 1 reply; 9+ messages in thread
From: santosh shilimkar @ 2016-04-01 19:47 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA
  Cc: Wengang Wang, leon-2ukJVAZIZ/Y, netdev-u79uwXL29TY76Z2rM5mHXA

(cc-ing netdev)
On 3/30/2016 7:59 PM, Wengang Wang wrote:
>
>
> 在 2016年03月31日 09:51, Wengang Wang 写道:
>>
>>
>> 在 2016年03月31日 01:16, santosh shilimkar 写道:
>>> Hi Wengang,
>>>
>>> On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
>>>> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
>>>>> Problem is found that some among a lot of parallel RDS
>>>>> communications hang.
>>>>> In my test ten or so among 33 communications hang. The send
>>>>> requests got
>>>>> -ENOBUF error meaning the peer socket (port) is congested. But
>>>>> meanwhile,
>>>>> peer socket (port) is not congested.
>>>>>
>>>>> The congestion map updating can happen in two paths: one is in
>>>>> rds_recvmsg path
>>>>> and the other is when it receives packets from the hardware. There
>>>>> is no
>>>>> synchronization when updating the congestion map. So a bit
>>>>> operation (clearing)
>>>>> in the rds_recvmsg path can be skipped by another bit operation
>>>>> (setting) in
>>>>> hardware packet receving path.
>>>>>
>
> To be more detailed.  Here, the two paths (user calls recvmsg and
> hardware receives data) are for different rds socks. thus the
> rds_sock->rs_recv_lock is not helpful to sync the updating on congestion
> map.
>
For archive purpose, let me try to conclude the thread. I synced
with Wengang offlist and came up with below fix. I was under
impression that __set_bit_le() was atmoic version. After fixing
it like patch(end of the email), the bug gets addressed.

I will probably send this as fix for stable as well.


 From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001
From: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Date: Wed, 30 Mar 2016 23:26:47 -0700
Subject: [PATCH] RDS: Fix the atomicity for congestion map update

Two different threads with different rds sockets may be in
rds_recv_rcvbuf_delta() via receive path. If their ports
both map to the same word in the congestion map, then
using non-atomic ops to update it could cause the map to
be incorrect. Lets use atomics to avoid such an issue.

Full credit to Wengang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> for
finding the issue, analysing it and also pointing out
to offending code with spin lock based fix.

Signed-off-by: Wengang Wang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
  net/rds/cong.c |    4 ++--
  1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/cong.c b/net/rds/cong.c
index e6144b8..6641bcf 100644
--- a/net/rds/cong.c
+++ b/net/rds/cong.c
@@ -299,7 +299,7 @@ void rds_cong_set_bit(struct rds_cong_map *map, 
__be16 port)
  	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
  	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;

-	__set_bit_le(off, (void *)map->m_page_addrs[i]);
+	set_bit_le(off, (void *)map->m_page_addrs[i]);
  }

  void rds_cong_clear_bit(struct rds_cong_map *map, __be16 port)
@@ -313,7 +313,7 @@ void rds_cong_clear_bit(struct rds_cong_map *map, 
__be16 port)
  	i = be16_to_cpu(port) / RDS_CONG_MAP_PAGE_BITS;
  	off = be16_to_cpu(port) % RDS_CONG_MAP_PAGE_BITS;

-	__clear_bit_le(off, (void *)map->m_page_addrs[i]);
+	clear_bit_le(off, (void *)map->m_page_addrs[i]);
  }

  static int rds_cong_test_bit(struct rds_cong_map *map, __be16 port)
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] RDS: sync congestion map updating
  2016-04-01 19:47                   ` santosh shilimkar
@ 2016-04-02  1:14                     ` Leon Romanovsky
       [not found]                       ` <20160402011459.GC8565-2ukJVAZIZ/Y@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Leon Romanovsky @ 2016-04-02  1:14 UTC (permalink / raw)
  To: santosh shilimkar; +Cc: linux-rdma, Wengang Wang, netdev

On Fri, Apr 01, 2016 at 12:47:24PM -0700, santosh shilimkar wrote:
> (cc-ing netdev)
> On 3/30/2016 7:59 PM, Wengang Wang wrote:
> >
> >
> >在 2016年03月31日 09:51, Wengang Wang 写道:
> >>
> >>
> >>在 2016年03月31日 01:16, santosh shilimkar 写道:
> >>>Hi Wengang,
> >>>
> >>>On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
> >>>>On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
> >>>>>Problem is found that some among a lot of parallel RDS
> >>>>>communications hang.
> >>>>>In my test ten or so among 33 communications hang. The send
> >>>>>requests got
> >>>>>-ENOBUF error meaning the peer socket (port) is congested. But
> >>>>>meanwhile,
> >>>>>peer socket (port) is not congested.
> >>>>>
> >>>>>The congestion map updating can happen in two paths: one is in
> >>>>>rds_recvmsg path
> >>>>>and the other is when it receives packets from the hardware. There
> >>>>>is no
> >>>>>synchronization when updating the congestion map. So a bit
> >>>>>operation (clearing)
> >>>>>in the rds_recvmsg path can be skipped by another bit operation
> >>>>>(setting) in
> >>>>>hardware packet receving path.
> >>>>>
> >
> >To be more detailed.  Here, the two paths (user calls recvmsg and
> >hardware receives data) are for different rds socks. thus the
> >rds_sock->rs_recv_lock is not helpful to sync the updating on congestion
> >map.
> >
> For archive purpose, let me try to conclude the thread. I synced
> with Wengang offlist and came up with below fix. I was under
> impression that __set_bit_le() was atmoic version. After fixing
> it like patch(end of the email), the bug gets addressed.
> 
> I will probably send this as fix for stable as well.
> 
> 
> From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001
> From: Santosh Shilimkar <santosh.shilimkar@oracle.com>
> Date: Wed, 30 Mar 2016 23:26:47 -0700
> Subject: [PATCH] RDS: Fix the atomicity for congestion map update
> 
> Two different threads with different rds sockets may be in
> rds_recv_rcvbuf_delta() via receive path. If their ports
> both map to the same word in the congestion map, then
> using non-atomic ops to update it could cause the map to
> be incorrect. Lets use atomics to avoid such an issue.
> 
> Full credit to Wengang <wen.gang.wang@oracle.com> for
> finding the issue, analysing it and also pointing out
> to offending code with spin lock based fix.

I'm glad that you solved the issue without spinlocks.
Out of curiosity, I see that this patch is needed to be sent
to Dave and applied by him. Is it right?

➜  linus-tree git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c
Santosh Shilimkar <santosh.shilimkar@oracle.com> (supporter:RDS -
RELIABLE DATAGRAM SOCKETS)
"David S. Miller" <davem@davemloft.net> (maintainer:NETWORKING
[GENERAL])
netdev@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
linux-rdma@vger.kernel.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
rds-devel@oss.oracle.com (moderated list:RDS - RELIABLE DATAGRAM
SOCKETS)
linux-kernel@vger.kernel.org (open list)

> 
> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>

Reviewed-by: Leon Romanovsky <leon@leon.nu>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] RDS: sync congestion map updating
       [not found]                       ` <20160402011459.GC8565-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-04-02  4:30                         ` santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA
  0 siblings, 0 replies; 9+ messages in thread
From: santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA @ 2016-04-02  4:30 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Wengang Wang,
	netdev-u79uwXL29TY76Z2rM5mHXA



On 4/1/16 6:14 PM, Leon Romanovsky wrote:
> On Fri, Apr 01, 2016 at 12:47:24PM -0700, santosh shilimkar wrote:
>> (cc-ing netdev)
>> On 3/30/2016 7:59 PM, Wengang Wang wrote:
>>>
>>>
>>> 在 2016年03月31日 09:51, Wengang Wang 写道:
>>>>
>>>>
>>>> 在 2016年03月31日 01:16, santosh shilimkar 写道:
>>>>> Hi Wengang,
>>>>>
>>>>> On 3/30/2016 9:19 AM, Leon Romanovsky wrote:
>>>>>> On Wed, Mar 30, 2016 at 05:08:22PM +0800, Wengang Wang wrote:
>>>>>>> Problem is found that some among a lot of parallel RDS
>>>>>>> communications hang.
>>>>>>> In my test ten or so among 33 communications hang. The send
>>>>>>> requests got
>>>>>>> -ENOBUF error meaning the peer socket (port) is congested. But
>>>>>>> meanwhile,
>>>>>>> peer socket (port) is not congested.
>>>>>>>
>>>>>>> The congestion map updating can happen in two paths: one is in
>>>>>>> rds_recvmsg path
>>>>>>> and the other is when it receives packets from the hardware. There
>>>>>>> is no
>>>>>>> synchronization when updating the congestion map. So a bit
>>>>>>> operation (clearing)
>>>>>>> in the rds_recvmsg path can be skipped by another bit operation
>>>>>>> (setting) in
>>>>>>> hardware packet receving path.
>>>>>>>
>>>
>>> To be more detailed.  Here, the two paths (user calls recvmsg and
>>> hardware receives data) are for different rds socks. thus the
>>> rds_sock->rs_recv_lock is not helpful to sync the updating on congestion
>>> map.
>>>
>> For archive purpose, let me try to conclude the thread. I synced
>> with Wengang offlist and came up with below fix. I was under
>> impression that __set_bit_le() was atmoic version. After fixing
>> it like patch(end of the email), the bug gets addressed.
>>
>> I will probably send this as fix for stable as well.
>>
>>
>>  From 5614b61f6fdcd6ae0c04e50b97efd13201762294 Mon Sep 17 00:00:00 2001
>> From: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Date: Wed, 30 Mar 2016 23:26:47 -0700
>> Subject: [PATCH] RDS: Fix the atomicity for congestion map update
>>
>> Two different threads with different rds sockets may be in
>> rds_recv_rcvbuf_delta() via receive path. If their ports
>> both map to the same word in the congestion map, then
>> using non-atomic ops to update it could cause the map to
>> be incorrect. Lets use atomics to avoid such an issue.
>>
>> Full credit to Wengang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> for
>> finding the issue, analysing it and also pointing out
>> to offending code with spin lock based fix.
>
> I'm glad that you solved the issue without spinlocks.
> Out of curiosity, I see that this patch is needed to be sent
> to Dave and applied by him. Is it right?
>
Right. I was planning send this one along with one more fix
together on netdev for Dave to pick it up.

> ➜  linus-tree git:(master) ./scripts/get_maintainer.pl -f net/rds/cong.c
> Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> (supporter:RDS -
> RELIABLE DATAGRAM SOCKETS)
> "David S. Miller" <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org> (maintainer:NETWORKING
> [GENERAL])
> netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
> linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list:RDS - RELIABLE DATAGRAM SOCKETS)
> rds-devel-N0ozoZBvEnrZJqsBc5GL+g@public.gmane.org (moderated list:RDS - RELIABLE DATAGRAM
> SOCKETS)
> linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org (open list)
>
>>
>> Signed-off-by: Wengang Wang <wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>> Signed-off-by: Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
>
> Reviewed-by: Leon Romanovsky <leon-2ukJVAZIZ/Y@public.gmane.org>
>
Thanks for review.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-04-02  4:30 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-30  9:08 [PATCH] RDS: sync congestion map updating Wengang Wang
     [not found] ` <1459328902-31968-1-git-send-email-wen.gang.wang-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-03-30 16:19   ` Leon Romanovsky
     [not found]     ` <20160330161952.GA2670-2ukJVAZIZ/Y@public.gmane.org>
2016-03-30 17:16       ` santosh shilimkar
     [not found]         ` <56FC09D6.7090602-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-03-31  1:51           ` Wengang Wang
     [not found]             ` <56FC82B7.3070504-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-03-31  2:59               ` Wengang Wang
     [not found]                 ` <56FC927E.9090404-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-04-01 19:47                   ` santosh shilimkar
2016-04-02  1:14                     ` Leon Romanovsky
     [not found]                       ` <20160402011459.GC8565-2ukJVAZIZ/Y@public.gmane.org>
2016-04-02  4:30                         ` santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA
2016-03-31  1:24       ` Wengang Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.