* MLX4 Cq Question
@ 2013-05-17 19:25 Tom Tucker
[not found] ` <51968438.7070907-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
0 siblings, 1 reply; 14+ messages in thread
From: Tom Tucker @ 2013-05-17 19:25 UTC (permalink / raw)
To: roland-DgEjT+Ai2ygdnm+yROfE0A; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
Hi Roland,
I'm looking at the Linux MLX4 net driver and found something that confuses
me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c,
the mlx4_ib_completion function does not take any kind of lock when
looking up the SW CQ in the radix tree, however, the mlx4_cq_event
function does. In addition if I go look at the code paths where cq are
removed from this tree, they are protected by spin_lock_irq. So I am
baffled at this point as to what the locking strategy is and how this is
supposed to work. I'm sure I'm missing something and would greatly
appreciate it if someone would explain this.
Thanks,
Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <51968438.7070907-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2013-05-17 21:37 ` Roland Dreier
[not found] ` <CAG4TOxNi0PxxskqXgxRhMPG0bmr+sS-x0_RG-zKyvLW1LNzoBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 14+ messages in thread
From: Roland Dreier @ 2013-05-17 21:37 UTC (permalink / raw)
To: Tom Tucker; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA
On Fri, May 17, 2013 at 12:25 PM, Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> I'm looking at the Linux MLX4 net driver and found something that confuses me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the mlx4_ib_completion function does not take any kind of lock when looking up the SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition if I go look at the code paths where cq are removed from this tree, they are protected by spin_lock_irq. So I am baffled at this point as to what the locking strategy is and how this is supposed to work. I'm sure I'm missing something and would greatly appreciate it if someone would explain this.
This is a bit tricky. If you look at
void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
{
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_cq_table *cq_table = &priv->cq_table;
int err;
err = mlx4_HW2SW_CQ(dev, NULL, cq->cqn);
if (err)
mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n",
err, cq->cqn);
synchronize_irq(priv->eq_table.eq[cq->vector].irq);
spin_lock_irq(&cq_table->lock);
radix_tree_delete(&cq_table->tree, cq->cqn);
spin_unlock_irq(&cq_table->lock);
if (atomic_dec_and_test(&cq->refcount))
complete(&cq->free);
wait_for_completion(&cq->free);
mlx4_cq_free_icm(dev, cq->cqn);
}
you see that when freeing a CQ, we first do the HW2SW_CQ firmware
command; once this command completes, no more events will be generated
for that CQ. Then we do synchronize_irq for the CQ's interrupt
vector. Once that completes, no more completion handlers will be
running for the CQ, so we can safely delete the CQ from the radix tree
(relying on the radix tree's safety of deleting one entry while
possibly looking up other entries, so no lock is needed). We also use
the lock to synchronize against the CQ event function, which as you
noted does take the lock too.
Basic idea is that we're tricky and careful so we can make the fast
path (completion interrupt handling) lock-free, but then use locks and
whatever else needed in the slow path (CQ async event handling, CQ
destroy).
- R.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <CAG4TOxNi0PxxskqXgxRhMPG0bmr+sS-x0_RG-zKyvLW1LNzoBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-05-19 6:09 ` Or Gerlitz
2013-05-20 14:53 ` Jack Morgenstein
1 sibling, 0 replies; 14+ messages in thread
From: Or Gerlitz @ 2013-05-19 6:09 UTC (permalink / raw)
To: Roland Dreier, Jack Morgenstein
Cc: Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Yishai Hadas
On 18/05/2013 00:37, Roland Dreier wrote:
> you see that when freeing a CQ, we first do the HW2SW_CQ firmware
> command; once this command completes, no more events will be generated
> for that CQ. Then we do synchronize_irq for the CQ's interrupt
> vector. Once that completes, no more completion handlers will be
> running for the CQ, so we can safely delete the CQ from the radix tree
> (relying on the radix tree's safety of deleting one entry while
> possibly looking up other entries, so no lock is needed). We also use
> the lock to synchronize against the CQ event function, which as you
> noted does take the lock too.
>
> Basic idea is that we're tricky and careful so we can make the fast
> path (completion interrupt handling) lock-free, but then use locks and
> whatever else needed in the slow path (CQ async event handling, CQ
> destroy).
Jack, so do we finally agree to this analysis? last time when this was
on the list, I was under the impression that there was no consensus and
I also see that on the stack we provide to customers there's a patch of
yours in that area, or it may fix another bug?
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <CAG4TOxNi0PxxskqXgxRhMPG0bmr+sS-x0_RG-zKyvLW1LNzoBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-19 6:09 ` Or Gerlitz
@ 2013-05-20 14:53 ` Jack Morgenstein
[not found] ` <201305201753.10806.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
1 sibling, 1 reply; 14+ messages in thread
From: Jack Morgenstein @ 2013-05-20 14:53 UTC (permalink / raw)
To: Roland Dreier; +Cc: Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz
On Saturday 18 May 2013 00:37, Roland Dreier wrote:
> On Fri, May 17, 2013 at 12:25 PM, Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> > I'm looking at the Linux MLX4 net driver and found something that confuses me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the mlx4_ib_completion function does not take any kind of lock when looking up the SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition if I go look at the code paths where cq are removed from this tree, they are protected by spin_lock_irq. So I am baffled at this point as to what the locking strategy is and how this is supposed to work. I'm sure I'm missing something and would greatly appreciate it if someone would explain this.
>
> This is a bit tricky. If you look at
>
> void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
> {
> struct mlx4_priv *priv = mlx4_priv(dev);
> struct mlx4_cq_table *cq_table = &priv->cq_table;
> int err;
>
> err = mlx4_HW2SW_CQ(dev, NULL, cq->cqn);
> if (err)
> mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n",
> err, cq->cqn);
>
> synchronize_irq(priv->eq_table.eq[cq->vector].irq);
>
> spin_lock_irq(&cq_table->lock);
> radix_tree_delete(&cq_table->tree, cq->cqn);
> spin_unlock_irq(&cq_table->lock);
>
> if (atomic_dec_and_test(&cq->refcount))
> complete(&cq->free);
> wait_for_completion(&cq->free);
>
> mlx4_cq_free_icm(dev, cq->cqn);
> }
>
> you see that when freeing a CQ, we first do the HW2SW_CQ firmware
> command; once this command completes, no more events will be generated
> for that CQ. Then we do synchronize_irq for the CQ's interrupt
> vector. Once that completes, no more completion handlers will be
> running for the CQ, so we can safely delete the CQ from the radix tree
> (relying on the radix tree's safety of deleting one entry while
> possibly looking up other entries, so no lock is needed). We also use
> the lock to synchronize against the CQ event function, which as you
> noted does take the lock too.
>
> Basic idea is that we're tricky and careful so we can make the fast
> path (completion interrupt handling) lock-free, but then use locks and
> whatever else needed in the slow path (CQ async event handling, CQ
> destroy).
>
> - R.
=======================================================
Roland, unfortunately we have seen that we need some locking on the
cq completion handler (there is a stack trace which resulted from this
lack of proper locking).
In our current driver, we are using the patch below (which uses RCU locking
instead of spinlocks). I can prepare a proper patch for the upstream kernel.
===================================================
net/mlx4_core: Fix racy flow in the driver CQ completion handler
The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock
the radix tree which is used to manage the table of CQs, nor does it increase
the reference count of the CQ before invoking the user provided callback
(and decrease it afterwards).
This is racy and can cause use-after-free, null pointer dereference, etc, which
result in kernel crashes.
To fix this, we must do the following in mlx4_cq_completion:
- increase the ref count on the cq before invoking the user callback, and
decrement it after the callback.
- Place a lock around the radix tree lookup/ref-count-increase
Using an irq spinlock will not fix this issue. The problem is that under VPI,
the ETH interface uses multiple msix irq's, which can result in one cq completion
event interrupting another in-progress cq completion event. A deadlock results
when the handler for the first cq completion grabs the spinlock, and is
interrupted by the second completion before it has a chance to release the spinlock.
The handler for the second completion will deadlock waiting for the spinlock
to be released.
The proper fix is to use the RCU mechanism for locking radix-tree accesses in
the cq completion event handler (The radix-tree implementation uses the RCU
mechanism, so rcu_read_lock/unlock in the reader, with rcu_synchronize in the
updater, will do the job).
Note that the same issue exists in mlx4_cq_event() (the cq async event
handler), which also takes the same lock on the radix tree. Here, we replace the
spinlock with an rcu_read_lock().
This patch was motivated by the following report from the field:
"[...] box panic'ed when trying to find a completion queue. There is
no corruption but there is a possible race which could result in
mlx4_cq_completion getting wrong height of the radix tree and
following a bit too deep into the chains. In the other code which uses
this radix tree the access is protected by the lock but
mlx4_cq_completion is running in the interrupt context and cannot
take locks, so instead it runs without any protection whatsoever."
The stack trace below is from the mlnx ofed 1.5.3 driver running under RHEL5.7.
(this driver uses the upstream kernel mlx4_cq_completion() code)
PID: 8178 TASK: ffff810b91a52400 CPU: 11 COMMAND: "universal_consu"
[exception RIP: radix_tree_lookup+38]
RIP: ffffffff8016405f RSP: ffff81182fe3be90 RFLAGS: 00210002
RAX: 6b6b6b6b6b6b6b8b RBX: ffff810c2fb90000 RCX: 0000000000000006
RDX: 0000000000000001 RSI: 00000000000000c0 RDI: 6b6b6b6b6b6b6b6b
RBP: 00000000000000c0 R8: 0000000000010000 R9: 000000000920ea94
R10: 00000000000000b3 R11: ffffffff800c7ea5 R12: 00000000000000b3
R13: ffff810c2b15a280 R14: ffff810b7a98ff58 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
*** <IRQ stack> ***
RIP: 00000000f5cd8859 RSP: 00000000f3eafa28 RFLAGS: 00200202
RAX: 0000000000000000 RBX: 00000000f3eafaf8 RCX: 000000000b6d4f00
RDX: 000000000000001c RSI: 000000000000001c RDI: 0000000000000000
RBP: ffff810bf1a0a120 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: 0000000000000000 R14: ffff810b9e7d5bf0 R15: 0000000000000000
ORIG_RAX: ffffffffffffff4c CS: 0023 SS: 002b
The original patch was generated by Yishai Hadas,
and reviewed by Or Gerlitz and Jack Morgenstein. A subsequent fix
by Jack Morgenstein was reviewed by Eli Cohen.
Signed-off-by: Jack Morgenstein <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
---
drivers/net/ethernet/mellanox/mlx4/cq.c | 14 ++++++++++++--
1 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index ff91904..0e28258 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -57,8 +57,13 @@ void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
{
struct mlx4_cq *cq;
+ rcu_read_lock();
cq = radix_tree_lookup(&mlx4_priv(dev)->cq_table.tree,
cqn & (dev->caps.num_cqs - 1));
+ if (cq)
+ atomic_inc(&cq->refcount);
+ rcu_read_unlock();
+
if (!cq) {
mlx4_dbg(dev, "Completion event for bogus CQ %08x\n", cqn);
return;
@@ -67,6 +72,9 @@ void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
++cq->arm_sn;
cq->comp(cq);
+
+ if (atomic_dec_and_test(&cq->refcount))
+ complete(&cq->free);
}
void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type)
@@ -74,13 +82,13 @@ void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type)
struct mlx4_cq_table *cq_table = &mlx4_priv(dev)->cq_table;
struct mlx4_cq *cq;
- spin_lock(&cq_table->lock);
+ rcu_read_lock();
cq = radix_tree_lookup(&cq_table->tree, cqn & (dev->caps.num_cqs - 1));
if (cq)
atomic_inc(&cq->refcount);
- spin_unlock(&cq_table->lock);
+ rcu_read_unlock();
if (!cq) {
mlx4_warn(dev, "Async event for bogus CQ %08x\n", cqn);
@@ -328,6 +336,7 @@ err_radix:
spin_lock_irq(&cq_table->lock);
radix_tree_delete(&cq_table->tree, cq->cqn);
spin_unlock_irq(&cq_table->lock);
+ synchronize_rcu();
err_icm:
mlx4_cq_free_icm(dev, cq->cqn);
@@ -351,6 +360,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
spin_lock_irq(&cq_table->lock);
radix_tree_delete(&cq_table->tree, cq->cqn);
spin_unlock_irq(&cq_table->lock);
+ synchronize_rcu();
if (atomic_dec_and_test(&cq->refcount))
complete(&cq->free);
--
1.7.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <201305201753.10806.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2013-05-20 16:07 ` Roland Dreier
2013-05-20 19:51 ` Tom Tucker
2013-05-21 9:40 ` Or Gerlitz
2 siblings, 0 replies; 14+ messages in thread
From: Roland Dreier @ 2013-05-20 16:07 UTC (permalink / raw)
To: Jack Morgenstein
Cc: Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz
On Mon, May 20, 2013 at 7:53 AM, Jack Morgenstein
<jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> This is racy and can cause use-after-free, null pointer dereference, etc, which
> result in kernel crashes.
Sounds fine and I'd be happy to apply your final patch, but I'd be
curious to know what the race is in more detail.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <201305201753.10806.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-05-20 16:07 ` Roland Dreier
@ 2013-05-20 19:51 ` Tom Tucker
[not found] ` <519A7EB2.8090206-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2013-05-21 9:40 ` Or Gerlitz
2 siblings, 1 reply; 14+ messages in thread
From: Tom Tucker @ 2013-05-20 19:51 UTC (permalink / raw)
To: Jack Morgenstein
Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz
Hi Guys,
One other quick one. I've received conflicting claims on the validity of
the wc.opcode when wc.status != 0 for mlx4 hardware.
My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe
owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an
error and therefore, the only way to recover what the opcode was is
through the wr_id you used when submitting the WR.
Is my reading of the code correct?
Thanks,
Tom
On 5/20/13 9:53 AM, Jack Morgenstein wrote:
> On Saturday 18 May 2013 00:37, Roland Dreier wrote:
>> On Fri, May 17, 2013 at 12:25 PM, Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>>> I'm looking at the Linux MLX4 net driver and found something that confuses me mightily. In particular in the file net/ethernet/mellanox/mlx4/cq.c, the mlx4_ib_completion function does not take any kind of lock when looking up the SW CQ in the radix tree, however, the mlx4_cq_event function does. In addition if I go look at the code paths where cq are removed from this tree, they are protected by spin_lock_irq. So I am baffled at this point as to what the locking strategy is and how this is supposed to work. I'm sure I'm missing something and would greatly appreciate it if someone would explain this.
>> This is a bit tricky. If you look at
>>
>> void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
>> {
>> struct mlx4_priv *priv = mlx4_priv(dev);
>> struct mlx4_cq_table *cq_table = &priv->cq_table;
>> int err;
>>
>> err = mlx4_HW2SW_CQ(dev, NULL, cq->cqn);
>> if (err)
>> mlx4_warn(dev, "HW2SW_CQ failed (%d) for CQN %06x\n",
>> err, cq->cqn);
>>
>> synchronize_irq(priv->eq_table.eq[cq->vector].irq);
>>
>> spin_lock_irq(&cq_table->lock);
>> radix_tree_delete(&cq_table->tree, cq->cqn);
>> spin_unlock_irq(&cq_table->lock);
>>
>> if (atomic_dec_and_test(&cq->refcount))
>> complete(&cq->free);
>> wait_for_completion(&cq->free);
>>
>> mlx4_cq_free_icm(dev, cq->cqn);
>> }
>>
>> you see that when freeing a CQ, we first do the HW2SW_CQ firmware
>> command; once this command completes, no more events will be generated
>> for that CQ. Then we do synchronize_irq for the CQ's interrupt
>> vector. Once that completes, no more completion handlers will be
>> running for the CQ, so we can safely delete the CQ from the radix tree
>> (relying on the radix tree's safety of deleting one entry while
>> possibly looking up other entries, so no lock is needed). We also use
>> the lock to synchronize against the CQ event function, which as you
>> noted does take the lock too.
>>
>> Basic idea is that we're tricky and careful so we can make the fast
>> path (completion interrupt handling) lock-free, but then use locks and
>> whatever else needed in the slow path (CQ async event handling, CQ
>> destroy).
>>
>> - R.
> =======================================================
>
> Roland, unfortunately we have seen that we need some locking on the
> cq completion handler (there is a stack trace which resulted from this
> lack of proper locking).
> In our current driver, we are using the patch below (which uses RCU locking
> instead of spinlocks). I can prepare a proper patch for the upstream kernel.
>
> ===================================================
> net/mlx4_core: Fix racy flow in the driver CQ completion handler
>
> The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock
> the radix tree which is used to manage the table of CQs, nor does it increase
> the reference count of the CQ before invoking the user provided callback
> (and decrease it afterwards).
>
> This is racy and can cause use-after-free, null pointer dereference, etc, which
> result in kernel crashes.
>
> To fix this, we must do the following in mlx4_cq_completion:
> - increase the ref count on the cq before invoking the user callback, and
> decrement it after the callback.
> - Place a lock around the radix tree lookup/ref-count-increase
>
> Using an irq spinlock will not fix this issue. The problem is that under VPI,
> the ETH interface uses multiple msix irq's, which can result in one cq completion
> event interrupting another in-progress cq completion event. A deadlock results
> when the handler for the first cq completion grabs the spinlock, and is
> interrupted by the second completion before it has a chance to release the spinlock.
> The handler for the second completion will deadlock waiting for the spinlock
> to be released.
>
> The proper fix is to use the RCU mechanism for locking radix-tree accesses in
> the cq completion event handler (The radix-tree implementation uses the RCU
> mechanism, so rcu_read_lock/unlock in the reader, with rcu_synchronize in the
> updater, will do the job).
>
> Note that the same issue exists in mlx4_cq_event() (the cq async event
> handler), which also takes the same lock on the radix tree. Here, we replace the
> spinlock with an rcu_read_lock().
>
> This patch was motivated by the following report from the field:
>
> "[...] box panic'ed when trying to find a completion queue. There is
> no corruption but there is a possible race which could result in
> mlx4_cq_completion getting wrong height of the radix tree and
> following a bit too deep into the chains. In the other code which uses
> this radix tree the access is protected by the lock but
> mlx4_cq_completion is running in the interrupt context and cannot
> take locks, so instead it runs without any protection whatsoever."
>
> The stack trace below is from the mlnx ofed 1.5.3 driver running under RHEL5.7.
> (this driver uses the upstream kernel mlx4_cq_completion() code)
>
> PID: 8178 TASK: ffff810b91a52400 CPU: 11 COMMAND: "universal_consu"
>
> [exception RIP: radix_tree_lookup+38]
> RIP: ffffffff8016405f RSP: ffff81182fe3be90 RFLAGS: 00210002
> RAX: 6b6b6b6b6b6b6b8b RBX: ffff810c2fb90000 RCX: 0000000000000006
> RDX: 0000000000000001 RSI: 00000000000000c0 RDI: 6b6b6b6b6b6b6b6b
> RBP: 00000000000000c0 R8: 0000000000010000 R9: 000000000920ea94
> R10: 00000000000000b3 R11: ffffffff800c7ea5 R12: 00000000000000b3
> R13: ffff810c2b15a280 R14: ffff810b7a98ff58 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
>
> *** <IRQ stack> ***
>
> RIP: 00000000f5cd8859 RSP: 00000000f3eafa28 RFLAGS: 00200202
> RAX: 0000000000000000 RBX: 00000000f3eafaf8 RCX: 000000000b6d4f00
> RDX: 000000000000001c RSI: 000000000000001c RDI: 0000000000000000
> RBP: ffff810bf1a0a120 R8: 0000000000000000 R9: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
> R13: 0000000000000000 R14: ffff810b9e7d5bf0 R15: 0000000000000000
> ORIG_RAX: ffffffffffffff4c CS: 0023 SS: 002b
>
> The original patch was generated by Yishai Hadas,
> and reviewed by Or Gerlitz and Jack Morgenstein. A subsequent fix
> by Jack Morgenstein was reviewed by Eli Cohen.
>
> Signed-off-by: Jack Morgenstein <jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
> ---
> drivers/net/ethernet/mellanox/mlx4/cq.c | 14 ++++++++++++--
> 1 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
> index ff91904..0e28258 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
> @@ -57,8 +57,13 @@ void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
> {
> struct mlx4_cq *cq;
>
> + rcu_read_lock();
> cq = radix_tree_lookup(&mlx4_priv(dev)->cq_table.tree,
> cqn & (dev->caps.num_cqs - 1));
> + if (cq)
> + atomic_inc(&cq->refcount);
> + rcu_read_unlock();
> +
> if (!cq) {
> mlx4_dbg(dev, "Completion event for bogus CQ %08x\n", cqn);
> return;
> @@ -67,6 +72,9 @@ void mlx4_cq_completion(struct mlx4_dev *dev, u32 cqn)
> ++cq->arm_sn;
>
> cq->comp(cq);
> +
> + if (atomic_dec_and_test(&cq->refcount))
> + complete(&cq->free);
> }
>
> void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type)
> @@ -74,13 +82,13 @@ void mlx4_cq_event(struct mlx4_dev *dev, u32 cqn, int event_type)
> struct mlx4_cq_table *cq_table = &mlx4_priv(dev)->cq_table;
> struct mlx4_cq *cq;
>
> - spin_lock(&cq_table->lock);
> + rcu_read_lock();
>
> cq = radix_tree_lookup(&cq_table->tree, cqn & (dev->caps.num_cqs - 1));
> if (cq)
> atomic_inc(&cq->refcount);
>
> - spin_unlock(&cq_table->lock);
> + rcu_read_unlock();
>
> if (!cq) {
> mlx4_warn(dev, "Async event for bogus CQ %08x\n", cqn);
> @@ -328,6 +336,7 @@ err_radix:
> spin_lock_irq(&cq_table->lock);
> radix_tree_delete(&cq_table->tree, cq->cqn);
> spin_unlock_irq(&cq_table->lock);
> + synchronize_rcu();
>
> err_icm:
> mlx4_cq_free_icm(dev, cq->cqn);
> @@ -351,6 +360,7 @@ void mlx4_cq_free(struct mlx4_dev *dev, struct mlx4_cq *cq)
> spin_lock_irq(&cq_table->lock);
> radix_tree_delete(&cq_table->tree, cq->cqn);
> spin_unlock_irq(&cq_table->lock);
> + synchronize_rcu();
>
> if (atomic_dec_and_test(&cq->refcount))
> complete(&cq->free);
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: MLX4 Cq Question
[not found] ` <519A7EB2.8090206-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
@ 2013-05-20 19:58 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A823736FD2955D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
0 siblings, 1 reply; 14+ messages in thread
From: Hefty, Sean @ 2013-05-20 19:58 UTC (permalink / raw)
To: Tom Tucker, Jack Morgenstein
Cc: Roland Dreier, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz
> My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe
> owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an
> error and therefore, the only way to recover what the opcode was is
> through the wr_id you used when submitting the WR.
>
> Is my reading of the code correct?
I believe this is true wrt the IB spec.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <1828884A29C6694DAF28B7E6B8A823736FD2955D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-05-20 20:10 ` Tom Tucker
2013-05-20 20:11 ` Or Gerlitz
1 sibling, 0 replies; 14+ messages in thread
From: Tom Tucker @ 2013-05-20 20:10 UTC (permalink / raw)
To: Hefty, Sean
Cc: Jack Morgenstein, Roland Dreier,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz
On 5/20/13 2:58 PM, Hefty, Sean wrote:
>> My reading of the code (i.e. hw/mlx4/cq.c) is that the hardware cqe
>> owner_sr_opcode field contains MLX4_CQE_OPCODE_ERROR when there is an
>> error and therefore, the only way to recover what the opcode was is
>> through the wr_id you used when submitting the WR.
>>
>> Is my reading of the code correct?
> I believe this is true wrt the IB spec.
Thanks, this was my recollection as well.
Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <1828884A29C6694DAF28B7E6B8A823736FD2955D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-05-20 20:10 ` Tom Tucker
@ 2013-05-20 20:11 ` Or Gerlitz
1 sibling, 0 replies; 14+ messages in thread
From: Or Gerlitz @ 2013-05-20 20:11 UTC (permalink / raw)
To: Hefty, Sean, Tom Tucker
Cc: Jack Morgenstein, Roland Dreier,
linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz
On Mon, May 20, 2013 at 10:58 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> I believe this is true wrt the IB spec.
indeed, don't go anywhere else on that corner.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <201305201753.10806.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-05-20 16:07 ` Roland Dreier
2013-05-20 19:51 ` Tom Tucker
@ 2013-05-21 9:40 ` Or Gerlitz
[not found] ` <519B4104.4090102-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2 siblings, 1 reply; 14+ messages in thread
From: Or Gerlitz @ 2013-05-21 9:40 UTC (permalink / raw)
To: Jack Morgenstein, Eli Cohen
Cc: Roland Dreier, Tom Tucker, linux-rdma-u79uwXL29TY76Z2rM5mHXA
On 20/05/2013 17:53, Jack Morgenstein wrote:
> ===================================================
> net/mlx4_core: Fix racy flow in the driver CQ completion handler
>
> The mlx4 CQ completion handler, mlx4_cq_completion, doesn't bother to lock
> the radix tree which is used to manage the table of CQs, nor does it increase
> the reference count of the CQ before invoking the user provided callback
> (and decrease it afterwards).
>
> This is racy and can cause use-after-free, null pointer dereference, etc, which
> result in kernel crashes.
>
> To fix this, we must do the following in mlx4_cq_completion:
> - increase the ref count on the cq before invoking the user callback, and
> decrement it after the callback.
> - Place a lock around the radix tree lookup/ref-count-increase
>
> Using an irq spinlock will not fix this issue. The problem is that under VPI,
> the ETH interface uses multiple msix irq's, which can result in one cq completion
> event interrupting another in-progress cq completion event. A deadlock results
> when the handler for the first cq completion grabs the spinlock, and is
> interrupted by the second completion before it has a chance to release the spinlock.
> The handler for the second completion will deadlock waiting for the spinlock
> to be released.
I am not sure to follow on two pieces here:
1. why we say that only mlx4_en uses multiple msix irq's? mlx4_ib also
exposes multiple vectors (--> EQs --> MSI-X --> IRQ)
and the iser driver use that, e.g creates multiple CQs each on different EQ
2. is possible in the Linux kernel for one hard irq callback to flash on
CPU X while another hard irq callback is running on the same CPU?
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <519B4104.4090102-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-05-21 10:42 ` Bart Van Assche
[not found] ` <519B4F81.9040108-HInyCGIudOg@public.gmane.org>
0 siblings, 1 reply; 14+ messages in thread
From: Bart Van Assche @ 2013-05-21 10:42 UTC (permalink / raw)
To: Or Gerlitz
Cc: Jack Morgenstein, Eli Cohen, Roland Dreier, Tom Tucker,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
On 05/21/13 11:40, Or Gerlitz wrote:
> 2. is possible in the Linux kernel for one hard irq callback to flash on
> CPU X while another hard irq callback is running on the same CPU?
I think that from kernel 2.6.35 on MSI IRQs are no longer nested. See
also
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=753649dbc49345a73a2454c770a3f2d54d11aec6
or http://lwn.net/Articles/380931/.
Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <519B4F81.9040108-HInyCGIudOg@public.gmane.org>
@ 2013-05-21 10:43 ` Or Gerlitz
[not found] ` <519B4FEB.5090701-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 14+ messages in thread
From: Or Gerlitz @ 2013-05-21 10:43 UTC (permalink / raw)
To: Jack Morgenstein
Cc: Bart Van Assche, Eli Cohen, Roland Dreier, Tom Tucker,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
On 21/05/2013 13:42, Bart Van Assche wrote:
> On 05/21/13 11:40, Or Gerlitz wrote:
>> 2. is possible in the Linux kernel for one hard irq callback to flash on
>> CPU X while another hard irq callback is running on the same CPU?
>
> I think that from kernel 2.6.35 on MSI IRQs are no longer nested. See
> also
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=753649dbc49345a73a2454c770a3f2d54d11aec6
> or http://lwn.net/Articles/380931/
thanks, so suppose we agree on that, still the patch makes sense as the
race is there, but does the patch has to change?
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <519B4FEB.5090701-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-05-21 14:13 ` Jack Morgenstein
[not found] ` <201305211713.24370.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
0 siblings, 1 reply; 14+ messages in thread
From: Jack Morgenstein @ 2013-05-21 14:13 UTC (permalink / raw)
To: Or Gerlitz
Cc: Bart Van Assche, Eli Cohen, Roland Dreier, Tom Tucker,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
On Tuesday 21 May 2013 13:43, Or Gerlitz wrote:
> On 21/05/2013 13:42, Bart Van Assche wrote:
> > On 05/21/13 11:40, Or Gerlitz wrote:
> >> 2. is possible in the Linux kernel for one hard irq callback to flash on
> >> CPU X while another hard irq callback is running on the same CPU?
> >
> > I think that from kernel 2.6.35 on MSI IRQs are no longer nested. See
> > also
> > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=753649dbc49345a73a2454c770a3f2d54d11aec6
> > or http://lwn.net/Articles/380931/
>
> thanks, so suppose we agree on that, still the patch makes sense as the
> race is there, but does the patch has to change?
>
> Or.
>
I just need to verify that the patch can be applied correctly on the upstream kernel.
The use of RCU (and not spinlock) makes sense from a performance standpoint
in any case. We do NOT want to force mlx4_cq_completion to have a spinlock
which is device-global, resulting in having completion event processing be
single-threaded in effect).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: MLX4 Cq Question
[not found] ` <201305211713.24370.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2013-05-21 15:10 ` Or Gerlitz
0 siblings, 0 replies; 14+ messages in thread
From: Or Gerlitz @ 2013-05-21 15:10 UTC (permalink / raw)
To: Jack Morgenstein
Cc: Bart Van Assche, Eli Cohen, Roland Dreier, Tom Tucker,
linux-rdma-u79uwXL29TY76Z2rM5mHXA
On 21/05/2013 17:13, Jack Morgenstein wrote:
> I just need to verify that the patch can be applied correctly on the upstream kernel.
> The use of RCU (and not spinlock) makes sense from a performance standpoint
> in any case. We do NOT want to force mlx4_cq_completion to have a spinlock
> which is device-global, resulting in having completion event processing be
> single-threaded in effect).
cool, lets do that and re-submit
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2013-05-21 15:10 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-17 19:25 MLX4 Cq Question Tom Tucker
[not found] ` <51968438.7070907-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2013-05-17 21:37 ` Roland Dreier
[not found] ` <CAG4TOxNi0PxxskqXgxRhMPG0bmr+sS-x0_RG-zKyvLW1LNzoBg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-05-19 6:09 ` Or Gerlitz
2013-05-20 14:53 ` Jack Morgenstein
[not found] ` <201305201753.10806.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-05-20 16:07 ` Roland Dreier
2013-05-20 19:51 ` Tom Tucker
[not found] ` <519A7EB2.8090206-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
2013-05-20 19:58 ` Hefty, Sean
[not found] ` <1828884A29C6694DAF28B7E6B8A823736FD2955D-P5GAC/sN6hkd3b2yrw5b5LfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-05-20 20:10 ` Tom Tucker
2013-05-20 20:11 ` Or Gerlitz
2013-05-21 9:40 ` Or Gerlitz
[not found] ` <519B4104.4090102-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-21 10:42 ` Bart Van Assche
[not found] ` <519B4F81.9040108-HInyCGIudOg@public.gmane.org>
2013-05-21 10:43 ` Or Gerlitz
[not found] ` <519B4FEB.5090701-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-05-21 14:13 ` Jack Morgenstein
[not found] ` <201305211713.24370.jackm-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-05-21 15:10 ` Or Gerlitz
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.