From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS,UNPARSEABLE_RELAY,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CBE14C169C4 for ; Wed, 6 Feb 2019 08:51:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 83BE92081B for ; Wed, 6 Feb 2019 08:51:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=oracle.com header.i=@oracle.com header.b="hx5DER1G" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728569AbfBFIvJ (ORCPT ); Wed, 6 Feb 2019 03:51:09 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:51190 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727150AbfBFIvJ (ORCPT ); Wed, 6 Feb 2019 03:51:09 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.27/8.16.0.27) with SMTP id x168n2RF099018; Wed, 6 Feb 2019 08:51:06 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=content-type : mime-version : subject : from : in-reply-to : date : cc : content-transfer-encoding : message-id : references : to; s=corp-2018-07-02; bh=EzuvNSpIcwwiE2aYLeIyCrV/NY9Zbz2jfhFUEZapuqY=; b=hx5DER1GpIOuVPzGwR89TCnj5VRyFUwNPz1rzhTOYrC1cMT0NG33vQyQAUhxWxDK5xxT Zb1v6jIaWDmscDLM2X04B/b/+ZIhbPv9q7SN+Nbw1T858YOxuTvz2I3tJysceRTkbJgf khSl9T6sFSk3hQ8tFIQ+OcUrV2gdsrJT2I5OqwZ7Y9/mDQ5XuCe5SuN/HGL18g67/ueQ kEp0T52BiBV1NgcEdrWOc4AYNyIoNb9lmHGke9OOAIsQwzRGrov6W3e/scPnwcqQP575 SPV9dVUKN+aOBuEKAoKHWC0BZDkeOcUQpZKxQ3MXFPZX5K6LeaXiZes5i6cB8narpNyv BQ== Received: from userv0021.oracle.com (userv0021.oracle.com [156.151.31.71]) by userp2120.oracle.com with ESMTP id 2qd98n7nfe-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 06 Feb 2019 08:51:06 +0000 Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by userv0021.oracle.com (8.14.4/8.14.4) with ESMTP id x168oxgA003894 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 6 Feb 2019 08:51:00 GMT Received: from abhmp0019.oracle.com (abhmp0019.oracle.com [141.146.116.25]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id x168oxKq007219; Wed, 6 Feb 2019 08:50:59 GMT Received: from dhcp-10-172-157-159.no.oracle.com (/10.172.157.159) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 06 Feb 2019 08:50:58 +0000 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 12.2 \(3445.102.3\)) Subject: Re: [PATCH] mlx4_ib: Increase the timeout for CM cache From: =?utf-8?Q?H=C3=A5kon_Bugge?= In-Reply-To: <20190205223608.GA23110@ziepe.ca> Date: Wed, 6 Feb 2019 09:50:56 +0100 Cc: netdev@vger.kernel.org, OFED mailing list , rds-devel@oss.oracle.com, linux-kernel@vger.kernel.org, Jack Morgenstein Content-Transfer-Encoding: quoted-printable Message-Id: <13750147-482A-4F90-976A-033C52DCF85E@oracle.com> References: <20190131170951.178676-1-haakon.bugge@oracle.com> <20190205223608.GA23110@ziepe.ca> To: Jason Gunthorpe X-Mailer: Apple Mail (2.3445.102.3) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9158 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 priorityscore=1501 malwarescore=0 suspectscore=3 phishscore=0 bulkscore=0 spamscore=0 clxscore=1011 lowpriorityscore=0 mlxscore=0 impostorscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1902060070 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On 5 Feb 2019, at 23:36, Jason Gunthorpe wrote: >=20 > On Thu, Jan 31, 2019 at 06:09:51PM +0100, H=C3=A5kon Bugge wrote: >> Using CX-3 virtual functions, either from a bare-metal machine or >> pass-through from a VM, MAD packets are proxied through the PF = driver. >>=20 >> Since the VMs have separate name spaces for MAD Transaction Ids >> (TIDs), the PF driver has to re-map the TIDs and keep the book = keeping >> in a cache. >>=20 >> Following the RDMA CM protocol, it is clear when an entry has to >> evicted form the cache. But life is not perfect, remote peers may die >> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when >> the PF driver assumes the remote peer has gone. >>=20 >> We have experienced excessive amount of DREQ retries during fail-over >> testing, when running with eight VMs per database server. >>=20 >> The problem has been reproduced in a bare-metal system using one VM >> per physical node. In this environment, running 256 processes in each >> VM, each process uses RDMA CM to create an RC QP between himself and >> all (256) remote processes. All in all 16K QPs. >>=20 >> When tearing down these 16K QPs, excessive DREQ retries (and >> duplicates) are observed. With some cat/paste/awk wizardry on the >> infiniband_cm sysfs, we observe: >>=20 >> dreq: 5007 >> cm_rx_msgs: >> drep: 3838 >> dreq: 13018 >> rep: 8128 >> req: 8256 >> rtu: 8256 >> cm_tx_msgs: >> drep: 8011 >> dreq: 68856 >> rep: 8256 >> req: 8128 >> rtu: 8128 >> cm_tx_retries: >> dreq: 60483 >>=20 >> Note that the active/passive side is distributed. >>=20 >> Enabling pr_debug in cm.c gives tons of: >>=20 >> [171778.814239] mlx4_ib_multiplex_cm_handler: id{slave: >> 1,sl_cm_id: 0xd393089f} is NULL! >>=20 >> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the >> tear-down phase of the application is reduced from 113 to 67 >> seconds. Retries/duplicates are also significantly reduced: >>=20 >> cm_rx_duplicates: >> dreq: 7726 >> [] >> cm_tx_retries: >> drep: 1 >> dreq: 7779 >>=20 >> Increasing the timeout further didn't help, as these duplicates and >> retries stem from a too short CMA timeout, which was 20 (~4 seconds) >> on the systems. By increasing the CMA timeout to 22 (~17 seconds), = the >> numbers fell down to about one hundred for both of them. >>=20 >> Adjustment of the CMA timeout is _not_ part of this commit. >>=20 >> Signed-off-by: H=C3=A5kon Bugge >> --- >> drivers/infiniband/hw/mlx4/cm.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >=20 > Jack? What do you think? I am tempted to send a v2 making this a sysctl tuneable. This because, = full-rack testing using 8 servers, each with 8 VMs, only showed 33% = reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: = id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit. But sure, Jack's opinion matters. Thxs, H=C3=A5kon >=20 >> diff --git a/drivers/infiniband/hw/mlx4/cm.c = b/drivers/infiniband/hw/mlx4/cm.c >> index fedaf8260105..8c79a480f2b7 100644 >> --- a/drivers/infiniband/hw/mlx4/cm.c >> +++ b/drivers/infiniband/hw/mlx4/cm.c >> @@ -39,7 +39,7 @@ >>=20 >> #include "mlx4_ib.h" >>=20 >> -#define CM_CLEANUP_CACHE_TIMEOUT (5 * HZ) >> +#define CM_CLEANUP_CACHE_TIMEOUT (30 * HZ) >>=20 >> struct id_map_entry { >> struct rb_node node; >> --=20 >> 2.20.1