From: Leon Romanovsky <leonro@mellanox.com> Hi, This pack of small fixes is sent as a patchset simply to simplify their tracking. Some of them, like first and second patches were already sent to the mailing list. The ucma patch was in our regression for whole cycle and we didn't notice any failures related to that change. Changelog of second patch: 1. Maor added IB_QP_PKEY_INDEX and IB_QP_PORT checks and I rewrote the code logic to be less hairy. Thanks Leon Romanovsky (2): RDMA/ucma: Mask QPN to be 24 bits according to IBTA RDMA/mlx5: Prevent overflow in mmap offset calculations Maor Gottlieb (1): RDMA/core: Fix protection fault in get_pkey_idx_qp_list Michael Guralnik (1): RDMA/core: Add missing list deletion on freeing event queue Parav Pandit (1): Revert "RDMA/cma: Simplify rdma_resolve_addr() error flow" Valentine Fatiev (1): IB/ipoib: Fix double free of skb in case of multicast traffic in CM mode Yishai Hadas (1): IB/mlx5: Fix async events cleanup flows Yonatan Cohen (1): IB/umad: Fix kernel crash while unloading ib_umad Zhu Yanjun (1): RDMA/rxe: Fix soft lockup problem due to using tasklets in softirq drivers/infiniband/core/cm.c | 3 ++ drivers/infiniband/core/cma.c | 15 +++++-- drivers/infiniband/core/security.c | 24 ++++------ drivers/infiniband/core/ucma.c | 2 +- drivers/infiniband/core/user_mad.c | 5 ++- drivers/infiniband/core/uverbs_std_types.c | 1 + drivers/infiniband/hw/mlx5/devx.c | 51 ++++++++++++---------- drivers/infiniband/hw/mlx5/main.c | 4 +- drivers/infiniband/sw/rxe/rxe_comp.c | 8 ++-- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 15 ++++--- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 8 +++- drivers/infiniband/ulp/ipoib/ipoib_main.c | 1 + 13 files changed, 78 insertions(+), 60 deletions(-) -- 2.24.1
From: Leon Romanovsky <leonro@mellanox.com> IBTA declares QPN as 24bits, mask input to ensure that kernel doesn't get higher bits and ensure by adding WANR_ONCE() that other CM users do the same. Fixes: 75216638572f ("RDMA/cma: Export rdma cm interface to userspace") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/core/cm.c | 3 +++ drivers/infiniband/core/ucma.c | 2 +- 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 68cc1b2d6824..33c0d9e7bb66 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -2188,6 +2188,9 @@ int ib_send_cm_rep(struct ib_cm_id *cm_id, cm_id_priv->initiator_depth = param->initiator_depth; cm_id_priv->responder_resources = param->responder_resources; cm_id_priv->rq_psn = cpu_to_be32(IBA_GET(CM_REP_STARTING_PSN, rep_msg)); + WARN_ONCE(param->qp_num & 0xFF000000, + "IBTA declares QPN to be 24 bits, but it is 0x%X\n", + param->qp_num); cm_id_priv->local_qpn = cpu_to_be32(param->qp_num & 0xFFFFFF); out: spin_unlock_irqrestore(&cm_id_priv->lock, flags); diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 0274e9b704be..57e68491a2fd 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -1045,7 +1045,7 @@ static void ucma_copy_conn_param(struct rdma_cm_id *id, dst->retry_count = src->retry_count; dst->rnr_retry_count = src->rnr_retry_count; dst->srq = src->srq; - dst->qp_num = src->qp_num; + dst->qp_num = src->qp_num & 0xFFFFFF; dst->qkey = (id->route.addr.src_addr.ss_family == AF_IB) ? src->qkey : 0; } -- 2.24.1
From: Maor Gottlieb <maorg@mellanox.com> We don't need to set pkey as valid in case that user set only one of pkey index or port number, otherwise it will be resulted in NULL pointer dereference while accessing to uninitialized pkey list. The following crash from Syzkaller revealed it. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN PTI CPU: 1 PID: 14753 Comm: syz-executor.2 Not tainted 5.5.0-rc5 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 RIP: 0010:get_pkey_idx_qp_list+0x161/0x2d0 Code: 01 00 00 49 8b 5e 20 4c 39 e3 0f 84 b9 00 00 00 e8 e4 42 6e fe 48 8d 7b 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 01 0f 8e d0 00 00 00 48 8d 7d 04 48 b8 RSP: 0018:ffffc9000bc6f950 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff82c8bdec RDX: 0000000000000002 RSI: ffffc900030a8000 RDI: 0000000000000010 RBP: ffff888112c8ce80 R08: 0000000000000004 R09: fffff5200178df1f R10: 0000000000000001 R11: fffff5200178df1f R12: ffff888115dc4430 R13: ffff888115da8498 R14: ffff888115dc4410 R15: ffff888115da8000 FS: 00007f20777de700(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f721000 CR3: 00000001173ca002 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: port_pkey_list_insert+0xd7/0x7c0 ib_security_modify_qp+0x6fa/0xfc0 _ib_modify_qp+0x8c4/0xbf0 modify_qp+0x10da/0x16d0 ib_uverbs_modify_qp+0x9a/0x100 ib_uverbs_write+0xaa5/0xdf0 __vfs_write+0x7c/0x100 vfs_write+0x168/0x4a0 ksys_write+0xc8/0x200 do_syscall_64+0x9c/0x390 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: d291f1a65232 ("IB/core: Enforce PKey security on QPs") Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/core/security.c | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/core/security.c b/drivers/infiniband/core/security.c index 6eb6d2717ca5..5c8bf8ffb08c 100644 --- a/drivers/infiniband/core/security.c +++ b/drivers/infiniband/core/security.c @@ -339,22 +339,16 @@ static struct ib_ports_pkeys *get_new_pps(const struct ib_qp *qp, if (!new_pps) return NULL; - if (qp_attr_mask & (IB_QP_PKEY_INDEX | IB_QP_PORT)) { - if (!qp_pps) { - new_pps->main.port_num = qp_attr->port_num; - new_pps->main.pkey_index = qp_attr->pkey_index; - } else { - new_pps->main.port_num = (qp_attr_mask & IB_QP_PORT) ? - qp_attr->port_num : - qp_pps->main.port_num; - - new_pps->main.pkey_index = - (qp_attr_mask & IB_QP_PKEY_INDEX) ? - qp_attr->pkey_index : - qp_pps->main.pkey_index; - } + if (qp_attr_mask & IB_QP_PORT) + new_pps->main.port_num = + (qp_pps) ? qp_pps->main.port_num : qp_attr->port_num; + if (qp_attr_mask & IB_QP_PKEY_INDEX) + new_pps->main.pkey_index = (qp_pps) ? qp_pps->main.pkey_index : + qp_attr->pkey_index; + if ((qp_attr_mask & IB_QP_PKEY_INDEX) && (qp_attr_mask & IB_QP_PORT)) new_pps->main.state = IB_PORT_PKEY_VALID; - } else if (qp_pps) { + + if (!(qp_attr_mask & IB_QP_PKEY_INDEX & IB_QP_PORT) && qp_pps) { new_pps->main.port_num = qp_pps->main.port_num; new_pps->main.pkey_index = qp_pps->main.pkey_index; if (qp_pps->main.state != IB_PORT_PKEY_NOT_VALID) -- 2.24.1
From: Parav Pandit <parav@mellanox.com> This reverts commit 219d2e9dfda9431b808c28d5efc74b404b95b638. Below flow requires cm_id_priv's destination address to be setup before performing rdma_bind_addr(). Without which, source port allocation for existing bind list fails when port range is small, resulting into rdma_resolve_addr() failure. rdma_resolve_addr() cma_bind_addr() rdma_bind_addr() cma_get_port() cma_alloc_any_port() cma_port_is_unique() <- compared with zero daddr Fixes: 219d2e9dfda9 ("RDMA/cma: Simplify rdma_resolve_addr() error flow") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/core/cma.c | 15 +++++++++++---- 1 file changed, 11 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 72f032160c4b..2dec3a02ab9f 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -3212,19 +3212,26 @@ int rdma_resolve_addr(struct rdma_cm_id *id, struct sockaddr *src_addr, int ret; id_priv = container_of(id, struct rdma_id_private, id); + memcpy(cma_dst_addr(id_priv), dst_addr, rdma_addr_size(dst_addr)); if (id_priv->state == RDMA_CM_IDLE) { ret = cma_bind_addr(id, src_addr, dst_addr); - if (ret) + if (ret) { + memset(cma_dst_addr(id_priv), 0, + rdma_addr_size(dst_addr)); return ret; + } } - if (cma_family(id_priv) != dst_addr->sa_family) + if (cma_family(id_priv) != dst_addr->sa_family) { + memset(cma_dst_addr(id_priv), 0, rdma_addr_size(dst_addr)); return -EINVAL; + } - if (!cma_comp_exch(id_priv, RDMA_CM_ADDR_BOUND, RDMA_CM_ADDR_QUERY)) + if (!cma_comp_exch(id_priv, RDMA_CM_ADDR_BOUND, RDMA_CM_ADDR_QUERY)) { + memset(cma_dst_addr(id_priv), 0, rdma_addr_size(dst_addr)); return -EINVAL; + } - memcpy(cma_dst_addr(id_priv), dst_addr, rdma_addr_size(dst_addr)); if (cma_any_addr(dst_addr)) { ret = cma_resolve_loopback(id_priv); } else { -- 2.24.1
From: Valentine Fatiev <valentinef@mellanox.com> While connected mode set and we have connected and datagram traffic in parallel it might end with double free of datagram skb. Current mechanism assumes that order in completion queue is the same as theorder of sent packets for all QPs. Order is kept only for specific QP, in case of mixed UD and CM traffic we have few QPs(one UD and few CM's) in parallel. The problem: ---------------------------------------------------------- Transmit queue: ----------------- UD skb pointer kept in queue itself , CM skb kept in spearate queue and use transmit queue as a placeholder to count number of transmitted packets. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127 ------------------------------------------------------------ NL ud1 UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ........... -------------------------------------------------------------- ^ ^ tail head Completion queue(problematic scenario) - order not the same as in transmit queue 1 2 3 4 5 6 7 8 9 ---------------------------- ud1 CM1 UD2 ud3 cm2 cm3 ud4 cm4 ud5 ---------------------------- 1. CM1 'wc' processing - skb freed in cm separate ring. - tx_tail of transmit queue increased although UD2 not freed. Now driver assume UD2 index is already freed and it could be used for new transmitted skb. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 .........127 ------------------------------------------------------------ NL NL UD2 CM1 ud3 cm2 cm3 ud4 cm4 ud5 NL NL NL ........... -------------------------------------------------------------- ^ ^ ^ (Bad)tail head (Bad - Could be used for new SKB) In this case (due to heavy load) UD2 skb pointer could be replaced by new transmitted packet UD_NEW , as driver assume its free. At this point we will have to process two 'wc' with same index but we have only one pointer to free. During second attempt to free same skb we will have NULL pointer exception. 2. UD2 'wc' processing - skb freed according the index we got from 'wc' ,but it was already overwritten by mistake.So actually skb that was released its skb of new transmitted packet and no the original one. 3. UD_NEW 'wc' processing - attempt to free already freed skb. NUll pointer exception. Fix: ----------------------------------------------------------------------- The fix is to stop using UD ring also as placeholder for CM packets, instead using atomic counter to keep total packets sent value and use it to stop/wake queue. Fixes: 2c104ea68350 ("IB/ipoib: Get rid of the tx_outstanding variable in all modes") Signed-off-by: Valentine Fatiev <valentinef@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c | 15 ++++++++------- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 8 ++++++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 1 + 4 files changed, 16 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 2aa3457a30ce..c614cb87d09b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -379,6 +379,7 @@ struct ipoib_dev_priv { struct ipoib_tx_buf *tx_ring; unsigned int tx_head; unsigned int tx_tail; + atomic_t tx_outstanding; struct ib_sge tx_sge[MAX_SKB_FRAGS + 1]; struct ib_ud_wr tx_wr; struct ib_wc send_wc[MAX_SEND_CQE]; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index c59e00a0881f..db6aace83fe5 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -756,7 +756,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ return; } - if ((priv->tx_head - priv->tx_tail) == ipoib_sendq_size - 1) { + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) { ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n", tx->qp->qp_num); netif_stop_queue(dev); @@ -786,7 +786,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ } else { netif_trans_update(dev); ++tx->tx_head; - ++priv->tx_head; + atomic_inc(&priv->tx_outstanding); } } @@ -820,11 +820,11 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) netif_tx_lock(dev); ++tx->tx_tail; - ++priv->tx_tail; + atomic_dec(&priv->tx_outstanding); if (unlikely(netif_queue_stopped(dev) && - (priv->tx_head - priv->tx_tail) <= ipoib_sendq_size >> 1 && - test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))) + (atomic_read(&priv->tx_outstanding) <= ipoib_sendq_size >> 1) && + test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))) netif_wake_queue(dev); if (wc->status != IB_WC_SUCCESS && @@ -1232,8 +1232,9 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p) dev_kfree_skb_any(tx_req->skb); netif_tx_lock_bh(p->dev); ++p->tx_tail; - ++priv->tx_tail; - if (unlikely(priv->tx_head - priv->tx_tail == ipoib_sendq_size >> 1) && + atomic_dec(&priv->tx_outstanding); + if (unlikely(atomic_read(&priv->tx_outstanding) <= + ipoib_sendq_size >> 1) && netif_queue_stopped(p->dev) && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags)) netif_wake_queue(p->dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index c332b4761816..0652f25e8d0f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -407,9 +407,11 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) dev_kfree_skb_any(tx_req->skb); ++priv->tx_tail; + atomic_dec(&priv->tx_outstanding); if (unlikely(netif_queue_stopped(dev) && - ((priv->tx_head - priv->tx_tail) <= ipoib_sendq_size >> 1) && + (atomic_read(&priv->tx_outstanding) <= ipoib_sendq_size >> + 1) && test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))) netif_wake_queue(dev); @@ -634,7 +636,7 @@ int ipoib_send(struct net_device *dev, struct sk_buff *skb, else priv->tx_wr.wr.send_flags &= ~IB_SEND_IP_CSUM; /* increase the tx_head after send success, but use it for queue state */ - if (priv->tx_head - priv->tx_tail == ipoib_sendq_size - 1) { + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) { ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n"); netif_stop_queue(dev); } @@ -662,6 +664,7 @@ int ipoib_send(struct net_device *dev, struct sk_buff *skb, rc = priv->tx_head; ++priv->tx_head; + atomic_inc(&priv->tx_outstanding); } return rc; } @@ -807,6 +810,7 @@ int ipoib_ib_dev_stop_default(struct net_device *dev) ipoib_dma_unmap_tx(priv, tx_req); dev_kfree_skb_any(tx_req->skb); ++priv->tx_tail; + atomic_dec(&priv->tx_outstanding); } for (i = 0; i < ipoib_recvq_size; ++i) { diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 4a0d3a9e72e1..0d98886cfef8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1706,6 +1706,7 @@ static int ipoib_dev_init_default(struct net_device *dev) } /* priv->tx_head, tx_tail & tx_outstanding are already 0 */ + atomic_set(&priv->tx_outstanding, 0); if (ipoib_transport_dev_init(dev, priv->ca)) { pr_warn("%s: ipoib_transport_dev_init failed\n", -- 2.24.1
From: Michael Guralnik <michaelgur@mellanox.com> Fix a bug on disassociate flow while deleting the event queues. Current code has a race between ib_uverbs_free_event_queue() and ib_uverbs_event_read() which might leave entries in the list and bring double free. Fixes: f7c8416ccea5 ("RDMA/core: Simplify destruction of FD uobjects") Signed-off-by: Michael Guralnik <michaelgur@mellanox.com> Reviewed-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/core/uverbs_std_types.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/infiniband/core/uverbs_std_types.c b/drivers/infiniband/core/uverbs_std_types.c index 994d8744b246..3abfc63225cb 100644 --- a/drivers/infiniband/core/uverbs_std_types.c +++ b/drivers/infiniband/core/uverbs_std_types.c @@ -220,6 +220,7 @@ void ib_uverbs_free_event_queue(struct ib_uverbs_event_queue *event_queue) list_for_each_entry_safe(entry, tmp, &event_queue->event_list, list) { if (entry->counter) list_del(&entry->obj_list); + list_del(&entry->list); kfree(entry); } spin_unlock_irq(&event_queue->lock); -- 2.24.1
From: Yishai Hadas <yishaih@mellanox.com> Fix async events flows to prevent race between the read event APIs and their destroy uobj API. In both async command/event flows, delete the event entry from the list before its memory de-allocation and fix the async command flow to check properly for the 'is_destroyed' under the lock. The above comes to prevent accessing an entry post its de-allocation. Fixes: f7c8416ccea5 ("RDMA/core: Simplify destruction of FD uobjects") Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/hw/mlx5/devx.c | 51 +++++++++++++++++-------------- 1 file changed, 28 insertions(+), 23 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c index d7efc9f6daf0..46e1ab771f10 100644 --- a/drivers/infiniband/hw/mlx5/devx.c +++ b/drivers/infiniband/hw/mlx5/devx.c @@ -2319,14 +2319,12 @@ static int deliver_event(struct devx_event_subscription *event_sub, if (ev_file->omit_data) { spin_lock_irqsave(&ev_file->lock, flags); - if (!list_empty(&event_sub->event_list)) { + if (!list_empty(&event_sub->event_list) || + ev_file->is_destroyed) { spin_unlock_irqrestore(&ev_file->lock, flags); return 0; } - /* is_destroyed is ignored here because we don't have any memory - * allocation to clean up for the omit_data case - */ list_add_tail(&event_sub->event_list, &ev_file->event_list); spin_unlock_irqrestore(&ev_file->lock, flags); wake_up_interruptible(&ev_file->poll_wait); @@ -2473,11 +2471,11 @@ static ssize_t devx_async_cmd_event_read(struct file *filp, char __user *buf, return -ERESTARTSYS; } - if (list_empty(&ev_queue->event_list) && - ev_queue->is_destroyed) - return -EIO; - spin_lock_irq(&ev_queue->lock); + if (ev_queue->is_destroyed) { + spin_unlock_irq(&ev_queue->lock); + return -EIO; + } } event = list_entry(ev_queue->event_list.next, @@ -2551,10 +2549,6 @@ static ssize_t devx_async_event_read(struct file *filp, char __user *buf, return -EOVERFLOW; } - if (ev_file->is_destroyed) { - spin_unlock_irq(&ev_file->lock); - return -EIO; - } while (list_empty(&ev_file->event_list)) { spin_unlock_irq(&ev_file->lock); @@ -2667,8 +2661,10 @@ static int devx_async_cmd_event_destroy_uobj(struct ib_uobject *uobj, spin_lock_irq(&comp_ev_file->ev_queue.lock); list_for_each_entry_safe(entry, tmp, - &comp_ev_file->ev_queue.event_list, list) + &comp_ev_file->ev_queue.event_list, list) { + list_del(&entry->list); kvfree(entry); + } spin_unlock_irq(&comp_ev_file->ev_queue.lock); return 0; }; @@ -2680,11 +2676,29 @@ static int devx_async_event_destroy_uobj(struct ib_uobject *uobj, container_of(uobj, struct devx_async_event_file, uobj); struct devx_event_subscription *event_sub, *event_sub_tmp; - struct devx_async_event_data *entry, *tmp; struct mlx5_ib_dev *dev = ev_file->dev; spin_lock_irq(&ev_file->lock); ev_file->is_destroyed = 1; + + /* free the pending events allocation */ + if (ev_file->omit_data) { + struct devx_event_subscription *event_sub, *tmp; + + list_for_each_entry_safe(event_sub, tmp, &ev_file->event_list, + event_list) + list_del_init(&event_sub->event_list); + + } else { + struct devx_async_event_data *entry, *tmp; + + list_for_each_entry_safe(entry, tmp, &ev_file->event_list, + list) { + list_del(&entry->list); + kfree(entry); + } + } + spin_unlock_irq(&ev_file->lock); wake_up_interruptible(&ev_file->poll_wait); @@ -2699,15 +2713,6 @@ static int devx_async_event_destroy_uobj(struct ib_uobject *uobj, } mutex_unlock(&dev->devx_event_table.event_xa_lock); - /* free the pending events allocation */ - if (!ev_file->omit_data) { - spin_lock_irq(&ev_file->lock); - list_for_each_entry_safe(entry, tmp, - &ev_file->event_list, list) - kfree(entry); /* read can't come any more */ - spin_unlock_irq(&ev_file->lock); - } - put_device(&dev->ib_dev.dev); return 0; }; -- 2.24.1
From: Zhu Yanjun <yanjunz@mellanox.com> When run stress tests with RXE, the following Call Traces often occur " watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [swapper/2:0] ... Call Trace: <IRQ> create_object+0x3f/0x3b0 kmem_cache_alloc_node_trace+0x129/0x2d0 __kmalloc_reserve.isra.52+0x2e/0x80 __alloc_skb+0x83/0x270 rxe_init_packet+0x99/0x150 [rdma_rxe] rxe_requester+0x34e/0x11a0 [rdma_rxe] rxe_do_task+0x85/0xf0 [rdma_rxe] tasklet_action_common.isra.21+0xeb/0x100 __do_softirq+0xd0/0x298 irq_exit+0xc5/0xd0 smp_apic_timer_interrupt+0x68/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> ... " The root cause is that tasklet is actually a softirq. In a tasklet handler, another softirq handler is triggered. Usually these softirq handlers run on the same cpu core. So this will cause "soft lockup Bug". Fixes: 8700e3e7c485 ("Soft RoCE driver") CC: Yossef Efraim <yossefe@mellanox.com> Signed-off-by: Zhu Yanjun <yanjunz@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/sw/rxe/rxe_comp.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c index 116cafc9afcf..4bc88708b355 100644 --- a/drivers/infiniband/sw/rxe/rxe_comp.c +++ b/drivers/infiniband/sw/rxe/rxe_comp.c @@ -329,7 +329,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp, qp->comp.psn = pkt->psn; if (qp->req.wait_psn) { qp->req.wait_psn = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_task(&qp->req.task, 0); } } return COMPST_ERROR_RETRY; @@ -463,7 +463,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe) */ if (qp->req.wait_fence) { qp->req.wait_fence = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_task(&qp->req.task, 0); } } @@ -479,7 +479,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp, if (qp->req.need_rd_atomic) { qp->comp.timeout_retry = 0; qp->req.need_rd_atomic = 0; - rxe_run_task(&qp->req.task, 1); + rxe_run_task(&qp->req.task, 0); } } @@ -725,7 +725,7 @@ int rxe_completer(void *arg) RXE_CNT_COMP_RETRY); qp->req.need_retry = 1; qp->comp.started_retry = 1; - rxe_run_task(&qp->req.task, 1); + rxe_run_task(&qp->req.task, 0); } if (pkt) { -- 2.24.1
From: Yonatan Cohen <yonatanc@mellanox.com> When unloading ib_umad, remove ibdev sys file 1st before port removal to prevent kernel oops. ib_mad's method ibdev_show() might access a umad port whoes ibdev field has already been NULLed when rmmod ib_umad was issued from another shell. Consider this scenario shell-1 shell-2 rmmod ib_mod cat /sys/devices/../ibdev | | ib_umad_kill_port() ibdev_show() port->ib_dev = NULL dev_name(port->ib_dev) kernel stack PF: error_code(0x0000) - not-present page Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI RIP: 0010:ibdev_show+0x18/0x50 [ib_umad] RSP: 0018:ffffc9000097fe40 EFLAGS: 00010282 RAX: 0000000000000000 RBX: ffffffffa0441120 RCX: ffff8881df514000 RDX: ffff8881df514000 RSI: ffffffffa0441120 RDI: ffff8881df1e8870 RBP: ffffffff81caf000 R08: ffff8881df1e8870 R09: 0000000000000000 R10: 0000000000001000 R11: 0000000000000003 R12: ffff88822f550b40 R13: 0000000000000001 R14: ffffc9000097ff08 R15: ffff8882238bad58 FS: 00007f1437ff3740(0000) GS:ffff888236940000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000004e8 CR3: 00000001e0dfc001 CR4: 00000000001606e0 Call Trace: dev_attr_show+0x15/0x50 sysfs_kf_seq_show+0xb8/0x1a0 seq_read+0x12d/0x350 vfs_read+0x89/0x140 ksys_read+0x55/0xd0 do_syscall_64+0x55/0x1b0 entry_SYSCALL_64_after_hwframe+0x44/0xa9: Fixes: e9dd5daf884c ("IB/umad: Refactor code to use cdev_device_add()") Signed-off-by: Yonatan Cohen <yonatanc@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/core/user_mad.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index d1407fa378e8..1235ffb2389b 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -1312,6 +1312,9 @@ static void ib_umad_kill_port(struct ib_umad_port *port) struct ib_umad_file *file; int id; + cdev_device_del(&port->sm_cdev, &port->sm_dev); + cdev_device_del(&port->cdev, &port->dev); + mutex_lock(&port->file_mutex); /* Mark ib_dev NULL and block ioctl or other file ops to progress @@ -1331,8 +1334,6 @@ static void ib_umad_kill_port(struct ib_umad_port *port) mutex_unlock(&port->file_mutex); - cdev_device_del(&port->sm_cdev, &port->sm_dev); - cdev_device_del(&port->cdev, &port->dev); ida_free(&umad_ida, port->dev_num); /* balances device_initialize() */ -- 2.24.1
From: Leon Romanovsky <leonro@mellanox.com> The cmd and index variables declared as u16 and the result is supposed to be stored in u64. The C arithmetic rules doesn't promote "(index >> 8) << 16" to be u64 and leaves the end result to be u16. Fixes: 7be76bef320b ("IB/mlx5: Introduce VAR object and its alloc/destroy methods") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- drivers/infiniband/hw/mlx5/main.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index e874d688d040..987bfdcd12a5 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -2283,8 +2283,8 @@ static int mlx5_ib_mmap_offset(struct mlx5_ib_dev *dev, static u64 mlx5_entry_to_mmap_offset(struct mlx5_user_mmap_entry *entry) { - u16 cmd = entry->rdma_entry.start_pgoff >> 16; - u16 index = entry->rdma_entry.start_pgoff & 0xFFFF; + u64 cmd = (entry->rdma_entry.start_pgoff >> 16) & 0xFFFF; + u64 index = entry->rdma_entry.start_pgoff & 0xFFFF; return (((index >> 8) << 16) | (cmd << MLX5_IB_MMAP_CMD_SHIFT) | (index & 0xFF)) << PAGE_SHIFT; -- 2.24.1
On Wed, Feb 12, 2020 at 09:26:28AM +0200, Leon Romanovsky wrote: > From: Maor Gottlieb <maorg@mellanox.com> > > We don't need to set pkey as valid in case that user set only one of > pkey index or port number, otherwise it will be resulted in NULL > pointer dereference while accessing to uninitialized pkey list. > The following crash from Syzkaller revealed it. > > kasan: CONFIG_KASAN_INLINE enabled > kasan: GPF could be caused by NULL-ptr deref or user memory access > general protection fault: 0000 [#1] SMP KASAN PTI > CPU: 1 PID: 14753 Comm: syz-executor.2 Not tainted 5.5.0-rc5 #2 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > RIP: 0010:get_pkey_idx_qp_list+0x161/0x2d0 > Code: 01 00 00 49 8b 5e 20 4c 39 e3 0f 84 b9 00 00 00 e8 e4 42 6e fe 48 > 8d 7b 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 > 02 84 c0 74 08 3c 01 0f 8e d0 00 00 00 48 8d 7d 04 48 b8 > RSP: 0018:ffffc9000bc6f950 EFLAGS: 00010202 > RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff82c8bdec > RDX: 0000000000000002 RSI: ffffc900030a8000 RDI: 0000000000000010 > RBP: ffff888112c8ce80 R08: 0000000000000004 R09: fffff5200178df1f > R10: 0000000000000001 R11: fffff5200178df1f R12: ffff888115dc4430 > R13: ffff888115da8498 R14: ffff888115dc4410 R15: ffff888115da8000 > FS: 00007f20777de700(0000) GS:ffff88811b100000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000001b2f721000 CR3: 00000001173ca002 CR4: 0000000000360ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > port_pkey_list_insert+0xd7/0x7c0 > ib_security_modify_qp+0x6fa/0xfc0 > _ib_modify_qp+0x8c4/0xbf0 > modify_qp+0x10da/0x16d0 > ib_uverbs_modify_qp+0x9a/0x100 > ib_uverbs_write+0xaa5/0xdf0 > __vfs_write+0x7c/0x100 > vfs_write+0x168/0x4a0 > ksys_write+0xc8/0x200 > do_syscall_64+0x9c/0x390 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Fixes: d291f1a65232 ("IB/core: Enforce PKey security on QPs") > Signed-off-by: Maor Gottlieb <maorg@mellanox.com> > Signed-off-by: Leon Romanovsky <leonro@mellanox.com> > --- > drivers/infiniband/core/security.c | 24 +++++++++--------------- > 1 file changed, 9 insertions(+), 15 deletions(-) > > diff --git a/drivers/infiniband/core/security.c b/drivers/infiniband/core/security.c > index 6eb6d2717ca5..5c8bf8ffb08c 100644 > --- a/drivers/infiniband/core/security.c > +++ b/drivers/infiniband/core/security.c > @@ -339,22 +339,16 @@ static struct ib_ports_pkeys *get_new_pps(const struct ib_qp *qp, > if (!new_pps) > return NULL; > > - if (qp_attr_mask & (IB_QP_PKEY_INDEX | IB_QP_PORT)) { > - if (!qp_pps) { > - new_pps->main.port_num = qp_attr->port_num; > - new_pps->main.pkey_index = qp_attr->pkey_index; > - } else { > - new_pps->main.port_num = (qp_attr_mask & IB_QP_PORT) ? > - qp_attr->port_num : > - qp_pps->main.port_num; > - > - new_pps->main.pkey_index = > - (qp_attr_mask & IB_QP_PKEY_INDEX) ? > - qp_attr->pkey_index : > - qp_pps->main.pkey_index; > - } > + if (qp_attr_mask & IB_QP_PORT) > + new_pps->main.port_num = > + (qp_pps) ? qp_pps->main.port_num : qp_attr->port_num; > + if (qp_attr_mask & IB_QP_PKEY_INDEX) > + new_pps->main.pkey_index = (qp_pps) ? qp_pps->main.pkey_index : > + qp_attr->pkey_index; > + if ((qp_attr_mask & IB_QP_PKEY_INDEX) && (qp_attr_mask & IB_QP_PORT)) > new_pps->main.state = IB_PORT_PKEY_VALID; > - } else if (qp_pps) { > + > + if (!(qp_attr_mask & IB_QP_PKEY_INDEX & IB_QP_PORT) && qp_pps) { ^^ Sorry, this is rebase error, I'll send another version of this patch now. > new_pps->main.port_num = qp_pps->main.port_num; > new_pps->main.pkey_index = qp_pps->main.pkey_index; > if (qp_pps->main.state != IB_PORT_PKEY_NOT_VALID) > -- > 2.24.1 >
On Wed, Feb 12, 2020 at 09:26:28AM +0200, Leon Romanovsky wrote: > From: Maor Gottlieb <maorg@mellanox.com> > > We don't need to set pkey as valid in case that user set only one of > pkey index or port number, otherwise it will be resulted in NULL > pointer dereference while accessing to uninitialized pkey list. > The following crash from Syzkaller revealed it. > > kasan: CONFIG_KASAN_INLINE enabled > kasan: GPF could be caused by NULL-ptr deref or user memory access > general protection fault: 0000 [#1] SMP KASAN PTI > CPU: 1 PID: 14753 Comm: syz-executor.2 Not tainted 5.5.0-rc5 #2 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 > RIP: 0010:get_pkey_idx_qp_list+0x161/0x2d0 > Code: 01 00 00 49 8b 5e 20 4c 39 e3 0f 84 b9 00 00 00 e8 e4 42 6e fe 48 > 8d 7b 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 > 02 84 c0 74 08 3c 01 0f 8e d0 00 00 00 48 8d 7d 04 48 b8 > RSP: 0018:ffffc9000bc6f950 EFLAGS: 00010202 > RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff82c8bdec > RDX: 0000000000000002 RSI: ffffc900030a8000 RDI: 0000000000000010 > RBP: ffff888112c8ce80 R08: 0000000000000004 R09: fffff5200178df1f > R10: 0000000000000001 R11: fffff5200178df1f R12: ffff888115dc4430 > R13: ffff888115da8498 R14: ffff888115dc4410 R15: ffff888115da8000 > FS: 00007f20777de700(0000) GS:ffff88811b100000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 0000001b2f721000 CR3: 00000001173ca002 CR4: 0000000000360ee0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > port_pkey_list_insert+0xd7/0x7c0 > ib_security_modify_qp+0x6fa/0xfc0 > _ib_modify_qp+0x8c4/0xbf0 > modify_qp+0x10da/0x16d0 > ib_uverbs_modify_qp+0x9a/0x100 > ib_uverbs_write+0xaa5/0xdf0 > __vfs_write+0x7c/0x100 > vfs_write+0x168/0x4a0 > ksys_write+0xc8/0x200 > do_syscall_64+0x9c/0x390 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Fixes: d291f1a65232 ("IB/core: Enforce PKey security on QPs") > Signed-off-by: Maor Gottlieb <maorg@mellanox.com> > Signed-off-by: Leon Romanovsky <leonro@mellanox.com> > --- > drivers/infiniband/core/security.c | 24 +++++++++--------------- > 1 file changed, 9 insertions(+), 15 deletions(-) From e4536a662610eac76dc817c3a6956fa19542772d Mon Sep 17 00:00:00 2001 From: Maor Gottlieb <maorg@mellanox.com> Date: Thu, 16 Jan 2020 09:31:40 +0200 Subject: [PATCH rdma-rc v1] RDMA/core: Fix protection fault in get_pkey_idx_qp_list To: Doug Ledford <dledford@redhat.com>, Jason Gunthorpe <jgg@mellanox.com> Cc: RDMA mailing list <linux-rdma@vger.kernel.org>, Daniel Jurgens <danielj@mellanox.com>, Maor Gottlieb <maorg@mellanox.com> We don't need to set pkey as valid in case that user set only one of pkey index or port number, otherwise it will be resulted in NULL pointer dereference while accessing to uninitialized pkey list. The following crash from Syzkaller revealed it. kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN PTI CPU: 1 PID: 14753 Comm: syz-executor.2 Not tainted 5.5.0-rc5 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 RIP: 0010:get_pkey_idx_qp_list+0x161/0x2d0 Code: 01 00 00 49 8b 5e 20 4c 39 e3 0f 84 b9 00 00 00 e8 e4 42 6e fe 48 8d 7b 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 01 0f 8e d0 00 00 00 48 8d 7d 04 48 b8 RSP: 0018:ffffc9000bc6f950 EFLAGS: 00010202 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff82c8bdec RDX: 0000000000000002 RSI: ffffc900030a8000 RDI: 0000000000000010 RBP: ffff888112c8ce80 R08: 0000000000000004 R09: fffff5200178df1f R10: 0000000000000001 R11: fffff5200178df1f R12: ffff888115dc4430 R13: ffff888115da8498 R14: ffff888115dc4410 R15: ffff888115da8000 FS: 00007f20777de700(0000) GS:ffff88811b100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2f721000 CR3: 00000001173ca002 CR4: 0000000000360ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: port_pkey_list_insert+0xd7/0x7c0 ib_security_modify_qp+0x6fa/0xfc0 _ib_modify_qp+0x8c4/0xbf0 modify_qp+0x10da/0x16d0 ib_uverbs_modify_qp+0x9a/0x100 ib_uverbs_write+0xaa5/0xdf0 __vfs_write+0x7c/0x100 vfs_write+0x168/0x4a0 ksys_write+0xc8/0x200 do_syscall_64+0x9c/0x390 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fixes: d291f1a65232 ("IB/core: Enforce PKey security on QPs") Signed-off-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> --- Changelog: 1. Replaced "&" with "||" in last if. --- drivers/infiniband/core/security.c | 24 +++++++++--------------- 1 file changed, 9 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/core/security.c b/drivers/infiniband/core/security.c index 6eb6d2717ca5..2b4d80393bd0 100644 --- a/drivers/infiniband/core/security.c +++ b/drivers/infiniband/core/security.c @@ -339,22 +339,16 @@ static struct ib_ports_pkeys *get_new_pps(const struct ib_qp *qp, if (!new_pps) return NULL; - if (qp_attr_mask & (IB_QP_PKEY_INDEX | IB_QP_PORT)) { - if (!qp_pps) { - new_pps->main.port_num = qp_attr->port_num; - new_pps->main.pkey_index = qp_attr->pkey_index; - } else { - new_pps->main.port_num = (qp_attr_mask & IB_QP_PORT) ? - qp_attr->port_num : - qp_pps->main.port_num; - - new_pps->main.pkey_index = - (qp_attr_mask & IB_QP_PKEY_INDEX) ? - qp_attr->pkey_index : - qp_pps->main.pkey_index; - } + if (qp_attr_mask & IB_QP_PORT) + new_pps->main.port_num = + (qp_pps) ? qp_pps->main.port_num : qp_attr->port_num; + if (qp_attr_mask & IB_QP_PKEY_INDEX) + new_pps->main.pkey_index = (qp_pps) ? qp_pps->main.pkey_index : + qp_attr->pkey_index; + if ((qp_attr_mask & IB_QP_PKEY_INDEX) && (qp_attr_mask & IB_QP_PORT)) new_pps->main.state = IB_PORT_PKEY_VALID; - } else if (qp_pps) { + + if (!(qp_attr_mask & (IB_QP_PKEY_INDEX || IB_QP_PORT)) && qp_pps) { new_pps->main.port_num = qp_pps->main.port_num; new_pps->main.pkey_index = qp_pps->main.pkey_index; if (qp_pps->main.state != IB_PORT_PKEY_NOT_VALID) -- 2.24.1
On Wed, Feb 12, 2020 at 09:26:29AM +0200, Leon Romanovsky wrote: > From: Parav Pandit <parav@mellanox.com> > > This reverts commit 219d2e9dfda9431b808c28d5efc74b404b95b638. > > Below flow requires cm_id_priv's destination address to be setup > before performing rdma_bind_addr(). > Without which, source port allocation for existing bind list fails > when port range is small, resulting into rdma_resolve_addr() > failure. I don't quite understands this - what is "when port range is small" ? > rdma_resolve_addr() > cma_bind_addr() > rdma_bind_addr() > cma_get_port() > cma_alloc_any_port() > cma_port_is_unique() <- compared with zero daddr Do you understand why cma_port_is_unique is even testing the dst_addr? It seems very strange. Outbound connections should not alias the source port in the first place?? Jason
On Wed, Feb 12, 2020 at 09:26:34AM +0200, Leon Romanovsky wrote:
> From: Yonatan Cohen <yonatanc@mellanox.com>
>
> When unloading ib_umad, remove ibdev sys file 1st before
> port removal to prevent kernel oops.
>
> ib_mad's method ibdev_show() might access a umad port
> whoes ibdev field has already been NULLed when rmmod ib_umad
> was issued from another shell.
>
> Consider this scenario
> shell-1 shell-2
> rmmod ib_mod cat /sys/devices/../ibdev
> | |
> ib_umad_kill_port() ibdev_show()
> port->ib_dev = NULL dev_name(port->ib_dev)
>
> kernel stack
> PF: error_code(0x0000) - not-present page
> Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> RIP: 0010:ibdev_show+0x18/0x50 [ib_umad]
> RSP: 0018:ffffc9000097fe40 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: ffffffffa0441120 RCX: ffff8881df514000
> RDX: ffff8881df514000 RSI: ffffffffa0441120 RDI: ffff8881df1e8870
> RBP: ffffffff81caf000 R08: ffff8881df1e8870 R09: 0000000000000000
> R10: 0000000000001000 R11: 0000000000000003 R12: ffff88822f550b40
> R13: 0000000000000001 R14: ffffc9000097ff08 R15: ffff8882238bad58
> FS: 00007f1437ff3740(0000) GS:ffff888236940000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000000004e8 CR3: 00000001e0dfc001 CR4: 00000000001606e0
> Call Trace:
> dev_attr_show+0x15/0x50
> sysfs_kf_seq_show+0xb8/0x1a0
> seq_read+0x12d/0x350
> vfs_read+0x89/0x140
> ksys_read+0x55/0xd0
> do_syscall_64+0x55/0x1b0
> entry_SYSCALL_64_after_hwframe+0x44/0xa9:
>
> Fixes: e9dd5daf884c ("IB/umad: Refactor code to use cdev_device_add()")
This is the wrong fixes line, this ordering change was actually
deliberately done:
commit cf7ad3030271c55a7119a8c2162563e3f6e93879
Author: Parav Pandit <parav@mellanox.com>
Date: Fri Dec 21 16:19:24 2018 +0200
IB/umad: Avoid destroying device while it is accessed
ib_umad_reg_agent2() and ib_umad_reg_agent() access the device name in
dev_notice(), while concurrently, ib_umad_kill_port() can destroy the
device using device_destroy().
cpu-0 cpu-1
----- -----
ib_umad_ioctl()
[...] ib_umad_kill_port()
device_destroy(dev)
ib_umad_reg_agent()
dev_notice(dev)
The mistake in the above was to move the device_dstroy() down, not
split it into device_del() above and put_device() below.
Now that is already split we are OK.
Jason
On Wed, Feb 12, 2020 at 09:26:30AM +0200, Leon Romanovsky wrote:
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> index 2aa3457a30ce..c614cb87d09b 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib.h
> +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> @@ -379,6 +379,7 @@ struct ipoib_dev_priv {
> struct ipoib_tx_buf *tx_ring;
> unsigned int tx_head;
> unsigned int tx_tail;
> + atomic_t tx_outstanding;
> struct ib_sge tx_sge[MAX_SKB_FRAGS + 1];
> struct ib_ud_wr tx_wr;
> struct ib_wc send_wc[MAX_SEND_CQE];
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> index c59e00a0881f..db6aace83fe5 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> @@ -756,7 +756,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
> return;
> }
>
> - if ((priv->tx_head - priv->tx_tail) == ipoib_sendq_size - 1) {
> + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) {
> ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n",
> tx->qp->qp_num);
> netif_stop_queue(dev);
> @@ -786,7 +786,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
> } else {
> netif_trans_update(dev);
> ++tx->tx_head;
> - ++priv->tx_head;
> + atomic_inc(&priv->tx_outstanding);
> }
This use of an atomic is very weird, probably wrong.
Why is it an atomic? priv->tx_head wasn't an atomic, and every place
touching tx_outstanding was also touching tx_head.
I assume there is some hidden locking here? Or much more stuff is
busted up.
In that case, drop the atomic.
However, if the atomic is needed (where/why?) then something has to
be dealing with the races, and if the write side is fully locked then
an atomic is the wrong choice, use READ_ONCE/WRITE_ONCE instead
Jason
On Wed, Feb 12, 2020 at 09:26:26AM +0200, Leon Romanovsky wrote: > From: Leon Romanovsky <leonro@mellanox.com> > > Hi, > > This pack of small fixes is sent as a patchset simply to simplify their > tracking. Some of them, like first and second patches were already > sent to the mailing list. The ucma patch was in our regression for whole > cycle and we didn't notice any failures related to that change. > > Changelog of second patch: > 1. Maor added IB_QP_PKEY_INDEX and IB_QP_PORT checks and I rewrote the > code logic to be less hairy. > > Thanks > > Leon Romanovsky (2): > RDMA/ucma: Mask QPN to be 24 bits according to IBTA I put this one in to for-next > RDMA/mlx5: Prevent overflow in mmap offset calculations > > Maor Gottlieb (1): > RDMA/core: Fix protection fault in get_pkey_idx_qp_list > > Michael Guralnik (1): > RDMA/core: Add missing list deletion on freeing event queue > > Yishai Hadas (1): > IB/mlx5: Fix async events cleanup flows > > Yonatan Cohen (1): > IB/umad: Fix kernel crash while unloading ib_umad > > Zhu Yanjun (1): > RDMA/rxe: Fix soft lockup problem due to using tasklets in softirq And these to for-rc > Parav Pandit (1): > Revert "RDMA/cma: Simplify rdma_resolve_addr() error flow" > > Valentine Fatiev (1): > IB/ipoib: Fix double free of skb in case of multicast traffic in CM > mode These I want to see the discussion on Thanks, Jason
On Thu, Feb 13, 2020 at 10:28:18AM -0400, Jason Gunthorpe wrote:
> On Wed, Feb 12, 2020 at 09:26:34AM +0200, Leon Romanovsky wrote:
> > From: Yonatan Cohen <yonatanc@mellanox.com>
> >
> > When unloading ib_umad, remove ibdev sys file 1st before
> > port removal to prevent kernel oops.
> >
> > ib_mad's method ibdev_show() might access a umad port
> > whoes ibdev field has already been NULLed when rmmod ib_umad
> > was issued from another shell.
> >
> > Consider this scenario
> > shell-1 shell-2
> > rmmod ib_mod cat /sys/devices/../ibdev
> > | |
> > ib_umad_kill_port() ibdev_show()
> > port->ib_dev = NULL dev_name(port->ib_dev)
> >
> > kernel stack
> > PF: error_code(0x0000) - not-present page
> > Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
> > RIP: 0010:ibdev_show+0x18/0x50 [ib_umad]
> > RSP: 0018:ffffc9000097fe40 EFLAGS: 00010282
> > RAX: 0000000000000000 RBX: ffffffffa0441120 RCX: ffff8881df514000
> > RDX: ffff8881df514000 RSI: ffffffffa0441120 RDI: ffff8881df1e8870
> > RBP: ffffffff81caf000 R08: ffff8881df1e8870 R09: 0000000000000000
> > R10: 0000000000001000 R11: 0000000000000003 R12: ffff88822f550b40
> > R13: 0000000000000001 R14: ffffc9000097ff08 R15: ffff8882238bad58
> > FS: 00007f1437ff3740(0000) GS:ffff888236940000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000000004e8 CR3: 00000001e0dfc001 CR4: 00000000001606e0
> > Call Trace:
> > dev_attr_show+0x15/0x50
> > sysfs_kf_seq_show+0xb8/0x1a0
> > seq_read+0x12d/0x350
> > vfs_read+0x89/0x140
> > ksys_read+0x55/0xd0
> > do_syscall_64+0x55/0x1b0
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9:
> >
> > Fixes: e9dd5daf884c ("IB/umad: Refactor code to use cdev_device_add()")
>
> This is the wrong fixes line, this ordering change was actually
> deliberately done:
>
Can you please fix the fixes line, so I will resend unaccepted patches only?
Thanks
On Thu, Feb 13, 2020 at 11:37:43AM -0400, Jason Gunthorpe wrote: > On Wed, Feb 12, 2020 at 09:26:30AM +0200, Leon Romanovsky wrote: > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h > > index 2aa3457a30ce..c614cb87d09b 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib.h > > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > > @@ -379,6 +379,7 @@ struct ipoib_dev_priv { > > struct ipoib_tx_buf *tx_ring; > > unsigned int tx_head; > > unsigned int tx_tail; > > + atomic_t tx_outstanding; > > struct ib_sge tx_sge[MAX_SKB_FRAGS + 1]; > > struct ib_ud_wr tx_wr; > > struct ib_wc send_wc[MAX_SEND_CQE]; > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > index c59e00a0881f..db6aace83fe5 100644 > > --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > @@ -756,7 +756,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ > > return; > > } > > > > - if ((priv->tx_head - priv->tx_tail) == ipoib_sendq_size - 1) { > > + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) { > > ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n", > > tx->qp->qp_num); > > netif_stop_queue(dev); > > @@ -786,7 +786,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ > > } else { > > netif_trans_update(dev); > > ++tx->tx_head; > > - ++priv->tx_head; > > + atomic_inc(&priv->tx_outstanding); > > } > > This use of an atomic is very weird, probably wrong. > > Why is it an atomic? priv->tx_head wasn't an atomic, and every place > touching tx_outstanding was also touching tx_head. > > I assume there is some hidden locking here? Or much more stuff is > busted up. > > In that case, drop the atomic. > > However, if the atomic is needed (where/why?) then something has to > be dealing with the races, and if the write side is fully locked then > an atomic is the wrong choice, use READ_ONCE/WRITE_ONCE instead I thought that atomic_t is appropriate here. Valentin wanted to implement "window" and he needed to ensure that this window is counted correctly while it doesn't need to be 100% accurate. Thanks > > Jason
On Thu, Feb 13, 2020 at 08:10:12PM +0200, Leon Romanovsky wrote:
> On Thu, Feb 13, 2020 at 11:37:43AM -0400, Jason Gunthorpe wrote:
> > On Wed, Feb 12, 2020 at 09:26:30AM +0200, Leon Romanovsky wrote:
> >
> > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
> > > index 2aa3457a30ce..c614cb87d09b 100644
> > > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h
> > > @@ -379,6 +379,7 @@ struct ipoib_dev_priv {
> > > struct ipoib_tx_buf *tx_ring;
> > > unsigned int tx_head;
> > > unsigned int tx_tail;
> > > + atomic_t tx_outstanding;
> > > struct ib_sge tx_sge[MAX_SKB_FRAGS + 1];
> > > struct ib_ud_wr tx_wr;
> > > struct ib_wc send_wc[MAX_SEND_CQE];
> > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> > > index c59e00a0881f..db6aace83fe5 100644
> > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
> > > @@ -756,7 +756,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
> > > return;
> > > }
> > >
> > > - if ((priv->tx_head - priv->tx_tail) == ipoib_sendq_size - 1) {
> > > + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) {
> > > ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n",
> > > tx->qp->qp_num);
> > > netif_stop_queue(dev);
> > > @@ -786,7 +786,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
> > > } else {
> > > netif_trans_update(dev);
> > > ++tx->tx_head;
> > > - ++priv->tx_head;
> > > + atomic_inc(&priv->tx_outstanding);
> > > }
> >
> > This use of an atomic is very weird, probably wrong.
> >
> > Why is it an atomic? priv->tx_head wasn't an atomic, and every place
> > touching tx_outstanding was also touching tx_head.
> >
> > I assume there is some hidden locking here? Or much more stuff is
> > busted up.
> >
> > In that case, drop the atomic.
> >
> > However, if the atomic is needed (where/why?) then something has to
> > be dealing with the races, and if the write side is fully locked then
> > an atomic is the wrong choice, use READ_ONCE/WRITE_ONCE instead
>
> I thought that atomic_t is appropriate here. Valentin wanted to
> implement "window" and he needed to ensure that this window is counted
> correctly while it doesn't need to be 100% accurate.
It does need to be 100% accurate because the stop_queue condition is:
+ if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) {
And presumably the whole problem here is overflowing some sized q
opbject.
That atomic is only needed if there is concurrency, and where is the
concurrency?
Jason
On Thu, Feb 13, 2020 at 02:26:55PM -0400, Jason Gunthorpe wrote: > On Thu, Feb 13, 2020 at 08:10:12PM +0200, Leon Romanovsky wrote: > > On Thu, Feb 13, 2020 at 11:37:43AM -0400, Jason Gunthorpe wrote: > > > On Wed, Feb 12, 2020 at 09:26:30AM +0200, Leon Romanovsky wrote: > > > > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h > > > > index 2aa3457a30ce..c614cb87d09b 100644 > > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > > > > @@ -379,6 +379,7 @@ struct ipoib_dev_priv { > > > > struct ipoib_tx_buf *tx_ring; > > > > unsigned int tx_head; > > > > unsigned int tx_tail; > > > > + atomic_t tx_outstanding; > > > > struct ib_sge tx_sge[MAX_SKB_FRAGS + 1]; > > > > struct ib_ud_wr tx_wr; > > > > struct ib_wc send_wc[MAX_SEND_CQE]; > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > > > index c59e00a0881f..db6aace83fe5 100644 > > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > > > @@ -756,7 +756,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ > > > > return; > > > > } > > > > > > > > - if ((priv->tx_head - priv->tx_tail) == ipoib_sendq_size - 1) { > > > > + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) { > > > > ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n", > > > > tx->qp->qp_num); > > > > netif_stop_queue(dev); > > > > @@ -786,7 +786,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ > > > > } else { > > > > netif_trans_update(dev); > > > > ++tx->tx_head; > > > > - ++priv->tx_head; > > > > + atomic_inc(&priv->tx_outstanding); > > > > } > > > > > > This use of an atomic is very weird, probably wrong. > > > > > > Why is it an atomic? priv->tx_head wasn't an atomic, and every place > > > touching tx_outstanding was also touching tx_head. > > > > > > I assume there is some hidden locking here? Or much more stuff is > > > busted up. > > > > > > In that case, drop the atomic. > > > > > > However, if the atomic is needed (where/why?) then something has to > > > be dealing with the races, and if the write side is fully locked then > > > an atomic is the wrong choice, use READ_ONCE/WRITE_ONCE instead > > > > I thought that atomic_t is appropriate here. Valentin wanted to > > implement "window" and he needed to ensure that this window is counted > > correctly while it doesn't need to be 100% accurate. > > It does need to be 100% accurate because the stop_queue condition is: > > + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) { So better to write ">=" instead of "=". > > And presumably the whole problem here is overflowing some sized q > opbject. > > That atomic is only needed if there is concurrency, and where is the > concurrency? Depends on if you believe to description in commit message :) > > Jason
On Thu, Feb 13, 2020 at 08:36:22PM +0200, Leon Romanovsky wrote: > On Thu, Feb 13, 2020 at 02:26:55PM -0400, Jason Gunthorpe wrote: > > On Thu, Feb 13, 2020 at 08:10:12PM +0200, Leon Romanovsky wrote: > > > On Thu, Feb 13, 2020 at 11:37:43AM -0400, Jason Gunthorpe wrote: > > > > On Wed, Feb 12, 2020 at 09:26:30AM +0200, Leon Romanovsky wrote: > > > > > > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h > > > > > index 2aa3457a30ce..c614cb87d09b 100644 > > > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib.h > > > > > @@ -379,6 +379,7 @@ struct ipoib_dev_priv { > > > > > struct ipoib_tx_buf *tx_ring; > > > > > unsigned int tx_head; > > > > > unsigned int tx_tail; > > > > > + atomic_t tx_outstanding; > > > > > struct ib_sge tx_sge[MAX_SKB_FRAGS + 1]; > > > > > struct ib_ud_wr tx_wr; > > > > > struct ib_wc send_wc[MAX_SEND_CQE]; > > > > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > > > > index c59e00a0881f..db6aace83fe5 100644 > > > > > +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c > > > > > @@ -756,7 +756,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ > > > > > return; > > > > > } > > > > > > > > > > - if ((priv->tx_head - priv->tx_tail) == ipoib_sendq_size - 1) { > > > > > + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) { > > > > > ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n", > > > > > tx->qp->qp_num); > > > > > netif_stop_queue(dev); > > > > > @@ -786,7 +786,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_ > > > > > } else { > > > > > netif_trans_update(dev); > > > > > ++tx->tx_head; > > > > > - ++priv->tx_head; > > > > > + atomic_inc(&priv->tx_outstanding); > > > > > } > > > > > > > > This use of an atomic is very weird, probably wrong. > > > > > > > > Why is it an atomic? priv->tx_head wasn't an atomic, and every place > > > > touching tx_outstanding was also touching tx_head. > > > > > > > > I assume there is some hidden locking here? Or much more stuff is > > > > busted up. > > > > > > > > In that case, drop the atomic. > > > > > > > > However, if the atomic is needed (where/why?) then something has to > > > > be dealing with the races, and if the write side is fully locked then > > > > an atomic is the wrong choice, use READ_ONCE/WRITE_ONCE instead > > > > > > I thought that atomic_t is appropriate here. Valentin wanted to > > > implement "window" and he needed to ensure that this window is counted > > > correctly while it doesn't need to be 100% accurate. > > > > It does need to be 100% accurate because the stop_queue condition is: > > > > + if (atomic_read(&priv->tx_outstanding) == ipoib_sendq_size - 1) { > > So better to write ">=" instead of "=". Then you risk overflowing the send q. The required scheme appears to be to globally bound the # of packets outstanding to tx such that the global bound is lower than any of the (many) QP's sendq, thus there will always be a place to put the packets. So it must be exact or at least pessimistic. This isn't a multiq netdev, so AFAICT, the tx side is single threaded by netdev, and it looks like the wc side was single threaded by NAPI as well. The old code used two variable, each written in a single threaded way, and then read to generate the state. With the missing READ_ONCE added, it should look like this: if ((priv->tx_head - READ_ONCE(priv->tx_tail)) == ipoib_sendq_size - 1) { I feel like sticking with the single writer packets posted/completed scheme is better and clearer than trying to collapse it into a single atomic. > > And presumably the whole problem here is overflowing some sized q > > opbject. > > > > That atomic is only needed if there is concurrency, and where is the > > concurrency? > > Depends on if you believe to description in commit message :) I do.. It is obviously wrong now that you point to it :) Each QP must have exclusive control of its head/tail pointers, having a CM QP reach into the head/tail of the UD QP is certainly wrong. Jason
On 2/13/2020 7:30 AM, Jason Gunthorpe wrote: > On Wed, Feb 12, 2020 at 09:26:29AM +0200, Leon Romanovsky wrote: >> From: Parav Pandit <parav@mellanox.com> >> >> This reverts commit 219d2e9dfda9431b808c28d5efc74b404b95b638. >> >> Below flow requires cm_id_priv's destination address to be setup >> before performing rdma_bind_addr(). >> Without which, source port allocation for existing bind list fails >> when port range is small, resulting into rdma_resolve_addr() >> failure. > > I don't quite understands this - what is "when port range is small" ? > There is sysfs knob to reduce source port range to use for binding. So it is easy to create the issue by reducing port range to handful of them. >> rdma_resolve_addr() >> cma_bind_addr() >> rdma_bind_addr() >> cma_get_port() >> cma_alloc_any_port() >> cma_port_is_unique() <- compared with zero daddr > > Do you understand why cma_port_is_unique is even testing the dst_addr? > It seems very strange. ma_port_unique() reuses the source port as long as destination is different (destination = either different dest.addr or different dest.port between two cm_ids ). This behavior is synonymous to TCP also. > Outbound connections should not alias the source port in the first > place?? > I believe it can because TCP seems to allow that too as long destination is different. I think it allows if it results into different 4-tuple.
On Fri, Feb 14, 2020 at 03:11:48AM +0000, Parav Pandit wrote:
> On 2/13/2020 7:30 AM, Jason Gunthorpe wrote:
> > On Wed, Feb 12, 2020 at 09:26:29AM +0200, Leon Romanovsky wrote:
> >> From: Parav Pandit <parav@mellanox.com>
> >>
> >> This reverts commit 219d2e9dfda9431b808c28d5efc74b404b95b638.
> >>
> >> Below flow requires cm_id_priv's destination address to be setup
> >> before performing rdma_bind_addr().
> >> Without which, source port allocation for existing bind list fails
> >> when port range is small, resulting into rdma_resolve_addr()
> >> failure.
> >
> > I don't quite understands this - what is "when port range is small" ?
> >
> There is sysfs knob to reduce source port range to use for binding.
> So it is easy to create the issue by reducing port range to handful of them.
>
> >> rdma_resolve_addr()
> >> cma_bind_addr()
> >> rdma_bind_addr()
> >> cma_get_port()
> >> cma_alloc_any_port()
> >> cma_port_is_unique() <- compared with zero daddr
> >
> > Do you understand why cma_port_is_unique is even testing the dst_addr?
> > It seems very strange.
>
> ma_port_unique() reuses the source port as long as
> destination is different (destination = either different
> dest.addr or different dest.port between two cm_ids ).
>
> This behavior is synonymous to TCP also.
>
> > Outbound connections should not alias the source port in the first
> > place??
> >
> I believe it can because TCP seems to allow that too as long destination
> is different.
>
> I think it allows if it results into different 4-tuple.
So the issue here is if the dest is left as 0 then the
cma_alloc_any_port() isn't able to alias source ports any more?
Jason
On 2/14/2020 8:08 AM, Jason Gunthorpe wrote: > On Fri, Feb 14, 2020 at 03:11:48AM +0000, Parav Pandit wrote: >> On 2/13/2020 7:30 AM, Jason Gunthorpe wrote: >>> On Wed, Feb 12, 2020 at 09:26:29AM +0200, Leon Romanovsky wrote: >>>> From: Parav Pandit <parav@mellanox.com> >>>> >>>> This reverts commit 219d2e9dfda9431b808c28d5efc74b404b95b638. >>>> >>>> Below flow requires cm_id_priv's destination address to be setup >>>> before performing rdma_bind_addr(). >>>> Without which, source port allocation for existing bind list fails >>>> when port range is small, resulting into rdma_resolve_addr() >>>> failure. >>> >>> I don't quite understands this - what is "when port range is small" ? >>> >> There is sysfs knob to reduce source port range to use for binding. >> So it is easy to create the issue by reducing port range to handful of them. >> >>>> rdma_resolve_addr() >>>> cma_bind_addr() >>>> rdma_bind_addr() >>>> cma_get_port() >>>> cma_alloc_any_port() >>>> cma_port_is_unique() <- compared with zero daddr >>> >>> Do you understand why cma_port_is_unique is even testing the dst_addr? >>> It seems very strange. >> >> ma_port_unique() reuses the source port as long as >> destination is different (destination = either different >> dest.addr or different dest.port between two cm_ids ). >> >> This behavior is synonymous to TCP also. >> >>> Outbound connections should not alias the source port in the first >>> place?? >>> >> I believe it can because TCP seems to allow that too as long destination >> is different. >> >> I think it allows if it results into different 4-tuple. > > So the issue here is if the dest is left as 0 then the > cma_alloc_any_port() isn't able to alias source ports any more? > Correct. > Jason >
On Wed, Feb 12, 2020 at 09:26:29AM +0200, Leon Romanovsky wrote:
> From: Parav Pandit <parav@mellanox.com>
>
> This reverts commit 219d2e9dfda9431b808c28d5efc74b404b95b638.
>
> Below flow requires cm_id_priv's destination address to be setup
> before performing rdma_bind_addr().
> Without which, source port allocation for existing bind list fails
> when port range is small, resulting into rdma_resolve_addr()
> failure.
>
> rdma_resolve_addr()
> cma_bind_addr()
> rdma_bind_addr()
> cma_get_port()
> cma_alloc_any_port()
> cma_port_is_unique() <- compared with zero daddr
>
> Fixes: 219d2e9dfda9 ("RDMA/cma: Simplify rdma_resolve_addr() error flow")
> Signed-off-by: Parav Pandit <parav@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
> ---
> drivers/infiniband/core/cma.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
Applied to for-rc
Thanks,
Jason