All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next 0/3] EFA statistics minor updates
@ 2020-04-20  6:22 Gal Pressman
  2020-04-20  6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-20  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: linux-rdma, Alexander Matushevsky, Gal Pressman

This series contains three small patches to collect a few counters we
found helpful when encountering different issues.

Regards,
Gal

Gal Pressman (3):
  RDMA/efa: Report create CQ error counter
  RDMA/efa: Count mmap failures
  RDMA/efa: Count admin commands errors

 drivers/infiniband/hw/efa/efa.h       |  3 ++-
 drivers/infiniband/hw/efa/efa_com.c   |  5 ++++-
 drivers/infiniband/hw/efa/efa_com.h   |  3 ++-
 drivers/infiniband/hw/efa/efa_verbs.c | 13 +++++++++++--
 4 files changed, 19 insertions(+), 5 deletions(-)


base-commit: 8f3d9f354286745c751374f5f1fcafee6b3f3136
-- 
2.26.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter
  2020-04-20  6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
@ 2020-04-20  6:22 ` Gal Pressman
  2020-04-20  6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-20  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: linux-rdma, Alexander Matushevsky, Gal Pressman, Firas JahJah,
	Yossi Leybovich

Create CQ errors are already being counted, report them along all other
counters.

Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
---
 drivers/infiniband/hw/efa/efa_verbs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 5c57098a4aee..b555845d6c14 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -41,6 +41,7 @@ struct efa_user_mmap_entry {
 	op(EFA_KEEP_ALIVE_RCVD, "keep_alive_rcvd") \
 	op(EFA_ALLOC_PD_ERR, "alloc_pd_err") \
 	op(EFA_CREATE_QP_ERR, "create_qp_err") \
+	op(EFA_CREATE_CQ_ERR, "create_cq_err") \
 	op(EFA_REG_MR_ERR, "reg_mr_err") \
 	op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
 	op(EFA_CREATE_AH_ERR, "create_ah_err")
@@ -1753,6 +1754,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats,
 	stats->value[EFA_KEEP_ALIVE_RCVD] = atomic64_read(&s->keep_alive_rcvd);
 	stats->value[EFA_ALLOC_PD_ERR] = atomic64_read(&s->sw_stats.alloc_pd_err);
 	stats->value[EFA_CREATE_QP_ERR] = atomic64_read(&s->sw_stats.create_qp_err);
+	stats->value[EFA_CREATE_CQ_ERR] = atomic64_read(&s->sw_stats.create_cq_err);
 	stats->value[EFA_REG_MR_ERR] = atomic64_read(&s->sw_stats.reg_mr_err);
 	stats->value[EFA_ALLOC_UCONTEXT_ERR] = atomic64_read(&s->sw_stats.alloc_ucontext_err);
 	stats->value[EFA_CREATE_AH_ERR] = atomic64_read(&s->sw_stats.create_ah_err);
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next 2/3] RDMA/efa: Count mmap failures
  2020-04-20  6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
  2020-04-20  6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
@ 2020-04-20  6:22 ` Gal Pressman
  2020-04-24 14:59   ` Jason Gunthorpe
  2020-04-20  6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman
  2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe
  3 siblings, 1 reply; 11+ messages in thread
From: Gal Pressman @ 2020-04-20  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: linux-rdma, Alexander Matushevsky, Gal Pressman, Firas JahJah,
	Yossi Leybovich

Add a new stat that counts mmap failures, which might help when
debugging different issues.

Reviewed-by: Firas JahJah <firasj@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
---
 drivers/infiniband/hw/efa/efa.h       | 3 ++-
 drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
index aa7396a1588a..77c9ff798117 100644
--- a/drivers/infiniband/hw/efa/efa.h
+++ b/drivers/infiniband/hw/efa/efa.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
 /*
- * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
+ * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
  */
 
 #ifndef _EFA_H_
@@ -40,6 +40,7 @@ struct efa_sw_stats {
 	atomic64_t reg_mr_err;
 	atomic64_t alloc_ucontext_err;
 	atomic64_t create_ah_err;
+	atomic64_t mmap_err;
 };
 
 /* Don't use anything other than atomic64 */
diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index b555845d6c14..75eef1ec2474 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
 	op(EFA_CREATE_CQ_ERR, "create_cq_err") \
 	op(EFA_REG_MR_ERR, "reg_mr_err") \
 	op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
-	op(EFA_CREATE_AH_ERR, "create_ah_err")
+	op(EFA_CREATE_AH_ERR, "create_ah_err") \
+	op(EFA_MMAP_ERR, "mmap_err")
 
 #define EFA_STATS_ENUM(ename, name) ename,
 #define EFA_STATS_STR(ename, name) [ename] = name,
@@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
 		ibdev_dbg(&dev->ibdev,
 			  "pgoff[%#lx] does not have valid entry\n",
 			  vma->vm_pgoff);
+		atomic64_inc(&dev->stats.sw_stats.mmap_err);
 		return -EINVAL;
 	}
 	entry = to_emmap(rdma_entry);
@@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
 		err = -EINVAL;
 	}
 
-	if (err)
+	if (err) {
 		ibdev_dbg(
 			&dev->ibdev,
 			"Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
 			entry->address, rdma_entry->npages * PAGE_SIZE,
 			entry->mmap_flag, err);
+		atomic64_inc(&dev->stats.sw_stats.mmap_err);
+	}
 
 	rdma_user_mmap_entry_put(rdma_entry);
 	return err;
@@ -1758,6 +1762,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats,
 	stats->value[EFA_REG_MR_ERR] = atomic64_read(&s->sw_stats.reg_mr_err);
 	stats->value[EFA_ALLOC_UCONTEXT_ERR] = atomic64_read(&s->sw_stats.alloc_ucontext_err);
 	stats->value[EFA_CREATE_AH_ERR] = atomic64_read(&s->sw_stats.create_ah_err);
+	stats->value[EFA_MMAP_ERR] = atomic64_read(&s->sw_stats.mmap_err);
 
 	return ARRAY_SIZE(efa_stats_names);
 }
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH for-next 3/3] RDMA/efa: Count admin commands errors
  2020-04-20  6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
  2020-04-20  6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
  2020-04-20  6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman
@ 2020-04-20  6:22 ` Gal Pressman
  2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe
  3 siblings, 0 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-20  6:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Doug Ledford
  Cc: linux-rdma, Alexander Matushevsky, Gal Pressman,
	Daniel Kranzdorf, Yossi Leybovich

Add a new stat that counts admin commands failures, which might help
when debugging different issues.

Reviewed-by: Daniel Kranzdorf <dkkranzd@amazon.com>
Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
Signed-off-by: Gal Pressman <galpress@amazon.com>
---
 drivers/infiniband/hw/efa/efa_com.c   | 5 ++++-
 drivers/infiniband/hw/efa/efa_com.h   | 3 ++-
 drivers/infiniband/hw/efa/efa_verbs.c | 2 ++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/efa/efa_com.c b/drivers/infiniband/hw/efa/efa_com.c
index 7fce69f5568f..336bc2c57bb1 100644
--- a/drivers/infiniband/hw/efa/efa_com.c
+++ b/drivers/infiniband/hw/efa/efa_com.c
@@ -631,17 +631,20 @@ int efa_com_cmd_exec(struct efa_com_admin_queue *aq,
 			cmd->aq_common_descriptor.opcode, PTR_ERR(comp_ctx));
 
 		up(&aq->avail_cmds);
+		atomic64_inc(&aq->stats.cmd_err);
 		return PTR_ERR(comp_ctx);
 	}
 
 	err = efa_com_wait_and_process_admin_cq(comp_ctx, aq);
-	if (err)
+	if (err) {
 		ibdev_err_ratelimited(
 			aq->efa_dev,
 			"Failed to process command %s (opcode %u) comp_status %d err %d\n",
 			efa_com_cmd_str(cmd->aq_common_descriptor.opcode),
 			cmd->aq_common_descriptor.opcode, comp_ctx->comp_status,
 			err);
+		atomic64_inc(&aq->stats.cmd_err);
+	}
 
 	up(&aq->avail_cmds);
 
diff --git a/drivers/infiniband/hw/efa/efa_com.h b/drivers/infiniband/hw/efa/efa_com.h
index c67dd8109d1c..5e4c88877ddb 100644
--- a/drivers/infiniband/hw/efa/efa_com.h
+++ b/drivers/infiniband/hw/efa/efa_com.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
 /*
- * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
+ * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
  */
 
 #ifndef _EFA_COM_H_
@@ -47,6 +47,7 @@ struct efa_com_admin_sq {
 struct efa_com_stats_admin {
 	atomic64_t submitted_cmd;
 	atomic64_t completed_cmd;
+	atomic64_t cmd_err;
 	atomic64_t no_completion;
 };
 
diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 75eef1ec2474..ce7a3ead1ba7 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -37,6 +37,7 @@ struct efa_user_mmap_entry {
 	op(EFA_RX_DROPS, "rx_drops") \
 	op(EFA_SUBMITTED_CMDS, "submitted_cmds") \
 	op(EFA_COMPLETED_CMDS, "completed_cmds") \
+	op(EFA_CMDS_ERR, "cmds_err") \
 	op(EFA_NO_COMPLETION_CMDS, "no_completion_cmds") \
 	op(EFA_KEEP_ALIVE_RCVD, "keep_alive_rcvd") \
 	op(EFA_ALLOC_PD_ERR, "alloc_pd_err") \
@@ -1752,6 +1753,7 @@ int efa_get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats,
 	as = &dev->edev.aq.stats;
 	stats->value[EFA_SUBMITTED_CMDS] = atomic64_read(&as->submitted_cmd);
 	stats->value[EFA_COMPLETED_CMDS] = atomic64_read(&as->completed_cmd);
+	stats->value[EFA_CMDS_ERR] = atomic64_read(&as->cmd_err);
 	stats->value[EFA_NO_COMPLETION_CMDS] = atomic64_read(&as->no_completion);
 
 	s = &dev->stats;
-- 
2.26.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
  2020-04-20  6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman
@ 2020-04-24 14:59   ` Jason Gunthorpe
  2020-04-24 15:25     ` Gal Pressman
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-04-24 14:59 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
	Yossi Leybovich

On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
> Add a new stat that counts mmap failures, which might help when
> debugging different issues.
> 
> Reviewed-by: Firas JahJah <firasj@amazon.com>
> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
> Signed-off-by: Gal Pressman <galpress@amazon.com>
>  drivers/infiniband/hw/efa/efa.h       | 3 ++-
>  drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
> index aa7396a1588a..77c9ff798117 100644
> +++ b/drivers/infiniband/hw/efa/efa.h
> @@ -1,6 +1,6 @@
>  /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
>  /*
> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
>   */
>  
>  #ifndef _EFA_H_
> @@ -40,6 +40,7 @@ struct efa_sw_stats {
>  	atomic64_t reg_mr_err;
>  	atomic64_t alloc_ucontext_err;
>  	atomic64_t create_ah_err;
> +	atomic64_t mmap_err;
>  };
>  
>  /* Don't use anything other than atomic64 */
> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
> index b555845d6c14..75eef1ec2474 100644
> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
>  	op(EFA_CREATE_CQ_ERR, "create_cq_err") \
>  	op(EFA_REG_MR_ERR, "reg_mr_err") \
>  	op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
> -	op(EFA_CREATE_AH_ERR, "create_ah_err")
> +	op(EFA_CREATE_AH_ERR, "create_ah_err") \
> +	op(EFA_MMAP_ERR, "mmap_err")
>  
>  #define EFA_STATS_ENUM(ename, name) ename,
>  #define EFA_STATS_STR(ename, name) [ename] = name,
> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>  		ibdev_dbg(&dev->ibdev,
>  			  "pgoff[%#lx] does not have valid entry\n",
>  			  vma->vm_pgoff);
> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
>  		return -EINVAL;
>  	}
>  	entry = to_emmap(rdma_entry);
> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>  		err = -EINVAL;
>  	}
>  
> -	if (err)
> +	if (err) {
>  		ibdev_dbg(
>  			&dev->ibdev,
>  			"Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
>  			entry->address, rdma_entry->npages * PAGE_SIZE,
>  			entry->mmap_flag, err);
> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);

Really? Isn't this something that is only possible with a buggy
rdma-core provider? Why count it?

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
  2020-04-24 14:59   ` Jason Gunthorpe
@ 2020-04-24 15:25     ` Gal Pressman
  2020-04-24 18:26       ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Gal Pressman @ 2020-04-24 15:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
	Yossi Leybovich

On 24/04/2020 17:59, Jason Gunthorpe wrote:
> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
>> Add a new stat that counts mmap failures, which might help when
>> debugging different issues.
>>
>> Reviewed-by: Firas JahJah <firasj@amazon.com>
>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
>> Signed-off-by: Gal Pressman <galpress@amazon.com>
>>  drivers/infiniband/hw/efa/efa.h       | 3 ++-
>>  drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
>>  2 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
>> index aa7396a1588a..77c9ff798117 100644
>> +++ b/drivers/infiniband/hw/efa/efa.h
>> @@ -1,6 +1,6 @@
>>  /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
>>  /*
>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
>>   */
>>  
>>  #ifndef _EFA_H_
>> @@ -40,6 +40,7 @@ struct efa_sw_stats {
>>  	atomic64_t reg_mr_err;
>>  	atomic64_t alloc_ucontext_err;
>>  	atomic64_t create_ah_err;
>> +	atomic64_t mmap_err;
>>  };
>>  
>>  /* Don't use anything other than atomic64 */
>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
>> index b555845d6c14..75eef1ec2474 100644
>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
>>  	op(EFA_CREATE_CQ_ERR, "create_cq_err") \
>>  	op(EFA_REG_MR_ERR, "reg_mr_err") \
>>  	op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
>> -	op(EFA_CREATE_AH_ERR, "create_ah_err")
>> +	op(EFA_CREATE_AH_ERR, "create_ah_err") \
>> +	op(EFA_MMAP_ERR, "mmap_err")
>>  
>>  #define EFA_STATS_ENUM(ename, name) ename,
>>  #define EFA_STATS_STR(ename, name) [ename] = name,
>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>  		ibdev_dbg(&dev->ibdev,
>>  			  "pgoff[%#lx] does not have valid entry\n",
>>  			  vma->vm_pgoff);
>> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>  		return -EINVAL;
>>  	}
>>  	entry = to_emmap(rdma_entry);
>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>  		err = -EINVAL;
>>  	}
>>  
>> -	if (err)
>> +	if (err) {
>>  		ibdev_dbg(
>>  			&dev->ibdev,
>>  			"Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
>>  			entry->address, rdma_entry->npages * PAGE_SIZE,
>>  			entry->mmap_flag, err);
>> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
> 
> Really? Isn't this something that is only possible with a buggy
> rdma-core provider? Why count it?

Though unlikely, it could happen, otherwise this error flow wouldn't exist in
the first place.

If for some reason a customer app steps on a bug we're not aware of, this
counter could serve as a red flag.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
  2020-04-24 15:25     ` Gal Pressman
@ 2020-04-24 18:26       ` Jason Gunthorpe
  2020-04-26  6:42         ` Gal Pressman
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-04-24 18:26 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
	Yossi Leybovich

On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote:
> On 24/04/2020 17:59, Jason Gunthorpe wrote:
> > On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
> >> Add a new stat that counts mmap failures, which might help when
> >> debugging different issues.
> >>
> >> Reviewed-by: Firas JahJah <firasj@amazon.com>
> >> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
> >> Signed-off-by: Gal Pressman <galpress@amazon.com>
> >>  drivers/infiniband/hw/efa/efa.h       | 3 ++-
> >>  drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
> >>  2 files changed, 9 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
> >> index aa7396a1588a..77c9ff798117 100644
> >> +++ b/drivers/infiniband/hw/efa/efa.h
> >> @@ -1,6 +1,6 @@
> >>  /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
> >>  /*
> >> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
> >> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
> >>   */
> >>  
> >>  #ifndef _EFA_H_
> >> @@ -40,6 +40,7 @@ struct efa_sw_stats {
> >>  	atomic64_t reg_mr_err;
> >>  	atomic64_t alloc_ucontext_err;
> >>  	atomic64_t create_ah_err;
> >> +	atomic64_t mmap_err;
> >>  };
> >>  
> >>  /* Don't use anything other than atomic64 */
> >> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
> >> index b555845d6c14..75eef1ec2474 100644
> >> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> >> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
> >>  	op(EFA_CREATE_CQ_ERR, "create_cq_err") \
> >>  	op(EFA_REG_MR_ERR, "reg_mr_err") \
> >>  	op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
> >> -	op(EFA_CREATE_AH_ERR, "create_ah_err")
> >> +	op(EFA_CREATE_AH_ERR, "create_ah_err") \
> >> +	op(EFA_MMAP_ERR, "mmap_err")
> >>  
> >>  #define EFA_STATS_ENUM(ename, name) ename,
> >>  #define EFA_STATS_STR(ename, name) [ename] = name,
> >> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> >>  		ibdev_dbg(&dev->ibdev,
> >>  			  "pgoff[%#lx] does not have valid entry\n",
> >>  			  vma->vm_pgoff);
> >> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
> >>  		return -EINVAL;
> >>  	}
> >>  	entry = to_emmap(rdma_entry);
> >> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> >>  		err = -EINVAL;
> >>  	}
> >>  
> >> -	if (err)
> >> +	if (err) {
> >>  		ibdev_dbg(
> >>  			&dev->ibdev,
> >>  			"Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
> >>  			entry->address, rdma_entry->npages * PAGE_SIZE,
> >>  			entry->mmap_flag, err);
> >> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
> > 
> > Really? Isn't this something that is only possible with a buggy
> > rdma-core provider? Why count it?
> 
> Though unlikely, it could happen, otherwise this error flow wouldn't exist in
> the first place.
> 
> If for some reason a customer app steps on a bug we're not aware of, this
> counter could serve as a red flag.

But there are lots of cases where a buggy provider can cause error
exits, why choose this one to count against all the others?

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
  2020-04-24 18:26       ` Jason Gunthorpe
@ 2020-04-26  6:42         ` Gal Pressman
  2020-04-26 13:30           ` Jason Gunthorpe
  0 siblings, 1 reply; 11+ messages in thread
From: Gal Pressman @ 2020-04-26  6:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
	Yossi Leybovich

On 24/04/2020 21:26, Jason Gunthorpe wrote:
> On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote:
>> On 24/04/2020 17:59, Jason Gunthorpe wrote:
>>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
>>>> Add a new stat that counts mmap failures, which might help when
>>>> debugging different issues.
>>>>
>>>> Reviewed-by: Firas JahJah <firasj@amazon.com>
>>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
>>>> Signed-off-by: Gal Pressman <galpress@amazon.com>
>>>>  drivers/infiniband/hw/efa/efa.h       | 3 ++-
>>>>  drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
>>>>  2 files changed, 9 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
>>>> index aa7396a1588a..77c9ff798117 100644
>>>> +++ b/drivers/infiniband/hw/efa/efa.h
>>>> @@ -1,6 +1,6 @@
>>>>  /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
>>>>  /*
>>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
>>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
>>>>   */
>>>>  
>>>>  #ifndef _EFA_H_
>>>> @@ -40,6 +40,7 @@ struct efa_sw_stats {
>>>>  	atomic64_t reg_mr_err;
>>>>  	atomic64_t alloc_ucontext_err;
>>>>  	atomic64_t create_ah_err;
>>>> +	atomic64_t mmap_err;
>>>>  };
>>>>  
>>>>  /* Don't use anything other than atomic64 */
>>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
>>>> index b555845d6c14..75eef1ec2474 100644
>>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
>>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
>>>>  	op(EFA_CREATE_CQ_ERR, "create_cq_err") \
>>>>  	op(EFA_REG_MR_ERR, "reg_mr_err") \
>>>>  	op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
>>>> -	op(EFA_CREATE_AH_ERR, "create_ah_err")
>>>> +	op(EFA_CREATE_AH_ERR, "create_ah_err") \
>>>> +	op(EFA_MMAP_ERR, "mmap_err")
>>>>  
>>>>  #define EFA_STATS_ENUM(ename, name) ename,
>>>>  #define EFA_STATS_STR(ename, name) [ename] = name,
>>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>>>  		ibdev_dbg(&dev->ibdev,
>>>>  			  "pgoff[%#lx] does not have valid entry\n",
>>>>  			  vma->vm_pgoff);
>>>> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>>>  		return -EINVAL;
>>>>  	}
>>>>  	entry = to_emmap(rdma_entry);
>>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>>>  		err = -EINVAL;
>>>>  	}
>>>>  
>>>> -	if (err)
>>>> +	if (err) {
>>>>  		ibdev_dbg(
>>>>  			&dev->ibdev,
>>>>  			"Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
>>>>  			entry->address, rdma_entry->npages * PAGE_SIZE,
>>>>  			entry->mmap_flag, err);
>>>> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>>
>>> Really? Isn't this something that is only possible with a buggy
>>> rdma-core provider? Why count it?
>>
>> Though unlikely, it could happen, otherwise this error flow wouldn't exist in
>> the first place.
>>
>> If for some reason a customer app steps on a bug we're not aware of, this
>> counter could serve as a red flag.
> 
> But there are lots of cases where a buggy provider can cause error
> exits, why choose this one to count against all the others?

It's not one against all others, most if not all of our userspace facing API
error flows have a similar counter.
And TBH, I think that the mmap flow is quite convoluted with the cookie response
from the crate verb, so it deserves a counter IMO.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
  2020-04-26  6:42         ` Gal Pressman
@ 2020-04-26 13:30           ` Jason Gunthorpe
  2020-04-26 14:17             ` Gal Pressman
  0 siblings, 1 reply; 11+ messages in thread
From: Jason Gunthorpe @ 2020-04-26 13:30 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
	Yossi Leybovich

On Sun, Apr 26, 2020 at 09:42:27AM +0300, Gal Pressman wrote:
> On 24/04/2020 21:26, Jason Gunthorpe wrote:
> > On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote:
> >> On 24/04/2020 17:59, Jason Gunthorpe wrote:
> >>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
> >>>> Add a new stat that counts mmap failures, which might help when
> >>>> debugging different issues.
> >>>>
> >>>> Reviewed-by: Firas JahJah <firasj@amazon.com>
> >>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
> >>>> Signed-off-by: Gal Pressman <galpress@amazon.com>
> >>>>  drivers/infiniband/hw/efa/efa.h       | 3 ++-
> >>>>  drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
> >>>>  2 files changed, 9 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
> >>>> index aa7396a1588a..77c9ff798117 100644
> >>>> +++ b/drivers/infiniband/hw/efa/efa.h
> >>>> @@ -1,6 +1,6 @@
> >>>>  /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
> >>>>  /*
> >>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
> >>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
> >>>>   */
> >>>>  
> >>>>  #ifndef _EFA_H_
> >>>> @@ -40,6 +40,7 @@ struct efa_sw_stats {
> >>>>  	atomic64_t reg_mr_err;
> >>>>  	atomic64_t alloc_ucontext_err;
> >>>>  	atomic64_t create_ah_err;
> >>>> +	atomic64_t mmap_err;
> >>>>  };
> >>>>  
> >>>>  /* Don't use anything other than atomic64 */
> >>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
> >>>> index b555845d6c14..75eef1ec2474 100644
> >>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
> >>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
> >>>>  	op(EFA_CREATE_CQ_ERR, "create_cq_err") \
> >>>>  	op(EFA_REG_MR_ERR, "reg_mr_err") \
> >>>>  	op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
> >>>> -	op(EFA_CREATE_AH_ERR, "create_ah_err")
> >>>> +	op(EFA_CREATE_AH_ERR, "create_ah_err") \
> >>>> +	op(EFA_MMAP_ERR, "mmap_err")
> >>>>  
> >>>>  #define EFA_STATS_ENUM(ename, name) ename,
> >>>>  #define EFA_STATS_STR(ename, name) [ename] = name,
> >>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> >>>>  		ibdev_dbg(&dev->ibdev,
> >>>>  			  "pgoff[%#lx] does not have valid entry\n",
> >>>>  			  vma->vm_pgoff);
> >>>> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
> >>>>  		return -EINVAL;
> >>>>  	}
> >>>>  	entry = to_emmap(rdma_entry);
> >>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
> >>>>  		err = -EINVAL;
> >>>>  	}
> >>>>  
> >>>> -	if (err)
> >>>> +	if (err) {
> >>>>  		ibdev_dbg(
> >>>>  			&dev->ibdev,
> >>>>  			"Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
> >>>>  			entry->address, rdma_entry->npages * PAGE_SIZE,
> >>>>  			entry->mmap_flag, err);
> >>>> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
> >>>
> >>> Really? Isn't this something that is only possible with a buggy
> >>> rdma-core provider? Why count it?
> >>
> >> Though unlikely, it could happen, otherwise this error flow wouldn't exist in
> >> the first place.
> >>
> >> If for some reason a customer app steps on a bug we're not aware of, this
> >> counter could serve as a red flag.
> > 
> > But there are lots of cases where a buggy provider can cause error
> > exits, why choose this one to count against all the others?
> 
> It's not one against all others, most if not all of our userspace facing API
> error flows have a similar counter.

Hurm, seems a bit strange, but OK

> And TBH, I think that the mmap flow is quite convoluted with the cookie response
> from the crate verb, so it deserves a counter IMO.

How so? Userspace takes the u64 from the command and pass it to mmap,
what is convoluted?

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 2/3] RDMA/efa: Count mmap failures
  2020-04-26 13:30           ` Jason Gunthorpe
@ 2020-04-26 14:17             ` Gal Pressman
  0 siblings, 0 replies; 11+ messages in thread
From: Gal Pressman @ 2020-04-26 14:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Doug Ledford, linux-rdma, Alexander Matushevsky, Firas JahJah,
	Yossi Leybovich

On 26/04/2020 16:30, Jason Gunthorpe wrote:
> On Sun, Apr 26, 2020 at 09:42:27AM +0300, Gal Pressman wrote:
>> On 24/04/2020 21:26, Jason Gunthorpe wrote:
>>> On Fri, Apr 24, 2020 at 06:25:54PM +0300, Gal Pressman wrote:
>>>> On 24/04/2020 17:59, Jason Gunthorpe wrote:
>>>>> On Mon, Apr 20, 2020 at 09:22:12AM +0300, Gal Pressman wrote:
>>>>>> Add a new stat that counts mmap failures, which might help when
>>>>>> debugging different issues.
>>>>>>
>>>>>> Reviewed-by: Firas JahJah <firasj@amazon.com>
>>>>>> Reviewed-by: Yossi Leybovich <sleybo@amazon.com>
>>>>>> Signed-off-by: Gal Pressman <galpress@amazon.com>
>>>>>>  drivers/infiniband/hw/efa/efa.h       | 3 ++-
>>>>>>  drivers/infiniband/hw/efa/efa_verbs.c | 9 +++++++--
>>>>>>  2 files changed, 9 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/hw/efa/efa.h b/drivers/infiniband/hw/efa/efa.h
>>>>>> index aa7396a1588a..77c9ff798117 100644
>>>>>> +++ b/drivers/infiniband/hw/efa/efa.h
>>>>>> @@ -1,6 +1,6 @@
>>>>>>  /* SPDX-License-Identifier: GPL-2.0 OR BSD-2-Clause */
>>>>>>  /*
>>>>>> - * Copyright 2018-2019 Amazon.com, Inc. or its affiliates. All rights reserved.
>>>>>> + * Copyright 2018-2020 Amazon.com, Inc. or its affiliates. All rights reserved.
>>>>>>   */
>>>>>>  
>>>>>>  #ifndef _EFA_H_
>>>>>> @@ -40,6 +40,7 @@ struct efa_sw_stats {
>>>>>>  	atomic64_t reg_mr_err;
>>>>>>  	atomic64_t alloc_ucontext_err;
>>>>>>  	atomic64_t create_ah_err;
>>>>>> +	atomic64_t mmap_err;
>>>>>>  };
>>>>>>  
>>>>>>  /* Don't use anything other than atomic64 */
>>>>>> diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
>>>>>> index b555845d6c14..75eef1ec2474 100644
>>>>>> +++ b/drivers/infiniband/hw/efa/efa_verbs.c
>>>>>> @@ -44,7 +44,8 @@ struct efa_user_mmap_entry {
>>>>>>  	op(EFA_CREATE_CQ_ERR, "create_cq_err") \
>>>>>>  	op(EFA_REG_MR_ERR, "reg_mr_err") \
>>>>>>  	op(EFA_ALLOC_UCONTEXT_ERR, "alloc_ucontext_err") \
>>>>>> -	op(EFA_CREATE_AH_ERR, "create_ah_err")
>>>>>> +	op(EFA_CREATE_AH_ERR, "create_ah_err") \
>>>>>> +	op(EFA_MMAP_ERR, "mmap_err")
>>>>>>  
>>>>>>  #define EFA_STATS_ENUM(ename, name) ename,
>>>>>>  #define EFA_STATS_STR(ename, name) [ename] = name,
>>>>>> @@ -1569,6 +1570,7 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>>>>>  		ibdev_dbg(&dev->ibdev,
>>>>>>  			  "pgoff[%#lx] does not have valid entry\n",
>>>>>>  			  vma->vm_pgoff);
>>>>>> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>>>>>  		return -EINVAL;
>>>>>>  	}
>>>>>>  	entry = to_emmap(rdma_entry);
>>>>>> @@ -1604,12 +1606,14 @@ static int __efa_mmap(struct efa_dev *dev, struct efa_ucontext *ucontext,
>>>>>>  		err = -EINVAL;
>>>>>>  	}
>>>>>>  
>>>>>> -	if (err)
>>>>>> +	if (err) {
>>>>>>  		ibdev_dbg(
>>>>>>  			&dev->ibdev,
>>>>>>  			"Couldn't mmap address[%#llx] length[%#zx] mmap_flag[%d] err[%d]\n",
>>>>>>  			entry->address, rdma_entry->npages * PAGE_SIZE,
>>>>>>  			entry->mmap_flag, err);
>>>>>> +		atomic64_inc(&dev->stats.sw_stats.mmap_err);
>>>>>
>>>>> Really? Isn't this something that is only possible with a buggy
>>>>> rdma-core provider? Why count it?
>>>>
>>>> Though unlikely, it could happen, otherwise this error flow wouldn't exist in
>>>> the first place.
>>>>
>>>> If for some reason a customer app steps on a bug we're not aware of, this
>>>> counter could serve as a red flag.
>>>
>>> But there are lots of cases where a buggy provider can cause error
>>> exits, why choose this one to count against all the others?
>>
>> It's not one against all others, most if not all of our userspace facing API
>> error flows have a similar counter.
> 
> Hurm, seems a bit strange, but OK
> 
>> And TBH, I think that the mmap flow is quite convoluted with the cookie response
>> from the crate verb, so it deserves a counter IMO.
> 
> How so? Userspace takes the u64 from the command and pass it to mmap,
> what is convoluted?

It's the only flow that involves two phases/userspace calls and not a single
command-response call, so it's a bit more error prone. Convoluted was a bit
harsh :).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH for-next 0/3] EFA statistics minor updates
  2020-04-20  6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
                   ` (2 preceding siblings ...)
  2020-04-20  6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman
@ 2020-05-02 23:33 ` Jason Gunthorpe
  3 siblings, 0 replies; 11+ messages in thread
From: Jason Gunthorpe @ 2020-05-02 23:33 UTC (permalink / raw)
  To: Gal Pressman; +Cc: Doug Ledford, linux-rdma, Alexander Matushevsky

On Mon, Apr 20, 2020 at 09:22:10AM +0300, Gal Pressman wrote:
> This series contains three small patches to collect a few counters we
> found helpful when encountering different issues.

Applied to for-next, thanks

Jason

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-05-02 23:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-20  6:22 [PATCH for-next 0/3] EFA statistics minor updates Gal Pressman
2020-04-20  6:22 ` [PATCH for-next 1/3] RDMA/efa: Report create CQ error counter Gal Pressman
2020-04-20  6:22 ` [PATCH for-next 2/3] RDMA/efa: Count mmap failures Gal Pressman
2020-04-24 14:59   ` Jason Gunthorpe
2020-04-24 15:25     ` Gal Pressman
2020-04-24 18:26       ` Jason Gunthorpe
2020-04-26  6:42         ` Gal Pressman
2020-04-26 13:30           ` Jason Gunthorpe
2020-04-26 14:17             ` Gal Pressman
2020-04-20  6:22 ` [PATCH for-next 3/3] RDMA/efa: Count admin commands errors Gal Pressman
2020-05-02 23:33 ` [PATCH for-next 0/3] EFA statistics minor updates Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.