All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next 00/18] IB/hfi1, rdmavt, qib: First batch of fixes for 4.8
@ 2016-07-01 23:00 Dennis Dalessandro
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  0 siblings, 1 reply; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:00 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Jason Gunthorpe, Mike Marciniszyn, Dean Luick, Jakub Pawlak,
	Tadeusz Struk, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Ira Weiny,
	Mitko Haralanov, Ashutosh Dixit, Easwar Hariharan,
	Sebastian Sanchez, Jubin John, Jianxin Xiong

Hi Doug,

Here is a set of fixes and improvmenets that would be for the next release. They
apply onto of the last set of RC fixes previously posted.

Of particular note in here is the twsi code clean up that was asked for
previously while we were in staging. I think this does the job of not
duplicating what is already present in the kernel. These are the two patches
from Dean.

The patches from Mike improve rdmavt and make the posting of sends more friendly
to work with and extend.

There are also performance improvement patches in this bunch as well as well as
a couple minor fixes that we felt are more appropriate for the next merge cycle
rather than RC.

These patches have been added to my GitHub branch and have passed zero day
builds.

https://github.com/ddalessa/kernel/tree/for-4.8

---

Dean Luick (2):
      IB/hfi1: Use built-in i2c bit-shift bus adapter
      IB/hfi1: Remove TWSI references

Ira Weiny (2):
      IB/hfi1: Clean up port state structure definition
      IB/hfi1: Remove unnecessary done label in hfi1_write_iter

Jakub Pawlak (3):
      IB/hfi1: Add VL XmitDiscards counters to the opapmaquery
      IB/hfi1: Add counter to track unsupported packets drop
      IB/hfi1: Correct receive packet handler assignment

Jianxin Xiong (1):
      IB/hfi1: Improve SDMA engine assignment for user SDMA

Mike Marciniszyn (5):
      IB/hfi1: Fix trace sparse errors
      IB/rdmavt: Add data structures and routines for table driven post send
      IB/hfi1: Add hfi1 post send tables
      IB/qib: Add qib post send table
      IB/rdmavt: Use new driver specific post send table

Sebastian Sanchez (4):
      IB/hfi1: Separate tracepoints into specific headers
      IB/hfi1: Add global structure for affinity assignments
      IB/hfi1: Reserve and collapse CPU cores for contexts
      IB/hfi1: Refine user process affinity algorithm

Tadeusz Struk (1):
      IB/hfi1: Fix typo


 drivers/infiniband/hw/hfi1/Kconfig        |    3 
 drivers/infiniband/hw/hfi1/Makefile       |    2 
 drivers/infiniband/hw/hfi1/affinity.c     |  526 +++++++++--
 drivers/infiniband/hw/hfi1/affinity.h     |   34 +
 drivers/infiniband/hw/hfi1/chip.c         |   82 +-
 drivers/infiniband/hw/hfi1/chip.h         |    2 
 drivers/infiniband/hw/hfi1/driver.c       |    1 
 drivers/infiniband/hw/hfi1/file_ops.c     |   46 +
 drivers/infiniband/hw/hfi1/hfi.h          |   67 +
 drivers/infiniband/hw/hfi1/init.c         |   36 +
 drivers/infiniband/hw/hfi1/mad.c          |   26 -
 drivers/infiniband/hw/hfi1/mad.h          |    7 
 drivers/infiniband/hw/hfi1/qp.c           |   44 +
 drivers/infiniband/hw/hfi1/qp.h           |    2 
 drivers/infiniband/hw/hfi1/qsfp.c         |  409 +++++++--
 drivers/infiniband/hw/hfi1/qsfp.h         |    3 
 drivers/infiniband/hw/hfi1/rc.c           |    8 
 drivers/infiniband/hw/hfi1/trace.h        | 1333 -----------------------------
 drivers/infiniband/hw/hfi1/trace_ctxts.h  |  141 +++
 drivers/infiniband/hw/hfi1/trace_dbg.h    |  155 +++
 drivers/infiniband/hw/hfi1/trace_ibhdrs.h |  209 +++++
 drivers/infiniband/hw/hfi1/trace_misc.h   |   81 ++
 drivers/infiniband/hw/hfi1/trace_rc.h     |  123 +++
 drivers/infiniband/hw/hfi1/trace_rx.h     |  322 +++++++
 drivers/infiniband/hw/hfi1/trace_tx.h     |  642 ++++++++++++++
 drivers/infiniband/hw/hfi1/twsi.c         |  489 -----------
 drivers/infiniband/hw/hfi1/twsi.h         |   65 -
 drivers/infiniband/hw/hfi1/user_sdma.c    |   29 +
 drivers/infiniband/hw/hfi1/verbs.c        |   32 -
 drivers/infiniband/hw/qib/qib_qp.c        |   43 +
 drivers/infiniband/hw/qib/qib_verbs.c     |    2 
 drivers/infiniband/hw/qib/qib_verbs.h     |    2 
 drivers/infiniband/sw/rdmavt/qp.c         |  113 ++
 drivers/infiniband/sw/rdmavt/vt.c         |    3 
 include/rdma/opa_port_info.h              |   16 
 include/rdma/rdma_vt.h                    |    3 
 include/rdma/rdmavt_qp.h                  |   28 +
 37 files changed, 2811 insertions(+), 2318 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/trace_ctxts.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_dbg.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_ibhdrs.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_misc.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_rc.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_rx.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_tx.h
 delete mode 100644 drivers/infiniband/hw/hfi1/twsi.c
 delete mode 100644 drivers/infiniband/hw/hfi1/twsi.h

--
-Denny
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH for-next 01/18] IB/hfi1: Clean up port state structure definition
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
@ 2016-07-01 23:00   ` Dennis Dalessandro
  2016-07-01 23:00   ` [PATCH for-next 02/18] IB/hfi1: Remove unnecessary done label in hfi1_write_iter Dennis Dalessandro
                     ` (17 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:00 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Ira Weiny

From: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The definition of port state changed mid development and the
old structure was kept accidentally.  Remove this dead code.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mad.c |   12 ------------
 drivers/infiniband/hw/hfi1/mad.h |    7 -------
 include/rdma/opa_port_info.h     |   16 ----------------
 3 files changed, 0 insertions(+), 35 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mad.c b/drivers/infiniband/hw/hfi1/mad.c
index fca07a1..223dd46 100644
--- a/drivers/infiniband/hw/hfi1/mad.c
+++ b/drivers/infiniband/hw/hfi1/mad.c
@@ -588,7 +588,6 @@ static int __subn_get_opa_portinfo(struct opa_smp *smp, u32 am, u8 *data,
 
 	pi->port_phys_conf = (ppd->port_type & 0xf);
 
-#if PI_LED_ENABLE_SUP
 	pi->port_states.ledenable_offlinereason = ppd->neighbor_normal << 4;
 	pi->port_states.ledenable_offlinereason |=
 		ppd->is_sm_config_started << 5;
@@ -602,11 +601,6 @@ static int __subn_get_opa_portinfo(struct opa_smp *smp, u32 am, u8 *data,
 	pi->port_states.ledenable_offlinereason |= is_beaconing_active << 6;
 	pi->port_states.ledenable_offlinereason |=
 		ppd->offline_disabled_reason;
-#else
-	pi->port_states.offline_reason = ppd->neighbor_normal << 4;
-	pi->port_states.offline_reason |= ppd->is_sm_config_started << 5;
-	pi->port_states.offline_reason |= ppd->offline_disabled_reason;
-#endif /* PI_LED_ENABLE_SUP */
 
 	pi->port_states.portphysstate_portstate =
 		(hfi1_ibphys_portstate(ppd) << 4) | state;
@@ -1752,17 +1746,11 @@ static int __subn_get_opa_psi(struct opa_smp *smp, u32 am, u8 *data,
 	if (start_of_sm_config && (lstate == IB_PORT_INIT))
 		ppd->is_sm_config_started = 1;
 
-#if PI_LED_ENABLE_SUP
 	psi->port_states.ledenable_offlinereason = ppd->neighbor_normal << 4;
 	psi->port_states.ledenable_offlinereason |=
 		ppd->is_sm_config_started << 5;
 	psi->port_states.ledenable_offlinereason |=
 		ppd->offline_disabled_reason;
-#else
-	psi->port_states.offline_reason = ppd->neighbor_normal << 4;
-	psi->port_states.offline_reason |= ppd->is_sm_config_started << 5;
-	psi->port_states.offline_reason |= ppd->offline_disabled_reason;
-#endif /* PI_LED_ENABLE_SUP */
 
 	psi->port_states.portphysstate_portstate =
 		(hfi1_ibphys_portstate(ppd) << 4) | (lstate & 0xf);
diff --git a/drivers/infiniband/hw/hfi1/mad.h b/drivers/infiniband/hw/hfi1/mad.h
index 8b734aa..5aa3fd1 100644
--- a/drivers/infiniband/hw/hfi1/mad.h
+++ b/drivers/infiniband/hw/hfi1/mad.h
@@ -48,15 +48,8 @@
 #define _HFI1_MAD_H
 
 #include <rdma/ib_pma.h>
-#define USE_PI_LED_ENABLE	1 /*
-				   * use led enabled bit in struct
-				   * opa_port_states, if available
-				   */
 #include <rdma/opa_smi.h>
 #include <rdma/opa_port_info.h>
-#ifndef PI_LED_ENABLE_SUP
-#define PI_LED_ENABLE_SUP 0
-#endif
 #include "opa_compat.h"
 
 /*
diff --git a/include/rdma/opa_port_info.h b/include/rdma/opa_port_info.h
index 2b95c2c..9303e0e 100644
--- a/include/rdma/opa_port_info.h
+++ b/include/rdma/opa_port_info.h
@@ -33,11 +33,6 @@
 #if !defined(OPA_PORT_INFO_H)
 #define OPA_PORT_INFO_H
 
-/* Temporary until HFI driver is updated */
-#ifndef USE_PI_LED_ENABLE
-#define USE_PI_LED_ENABLE 0
-#endif
-
 #define OPA_PORT_LINK_MODE_NOP	0		/* No change */
 #define OPA_PORT_LINK_MODE_OPA	4		/* Port mode is OPA */
 
@@ -274,23 +269,12 @@ enum port_info_field_masks {
 	OPA_PI_MASK_MTU_CAP                       = 0x0F,
 };
 
-#if USE_PI_LED_ENABLE
 struct opa_port_states {
 	u8     reserved;
 	u8     ledenable_offlinereason;   /* 1 res, 1 bit, 6 bits */
 	u8     reserved2;
 	u8     portphysstate_portstate;   /* 4 bits, 4 bits */
 };
-#define PI_LED_ENABLE_SUP 1
-#else
-struct opa_port_states {
-	u8     reserved;
-	u8     offline_reason;            /* 2 res, 6 bits */
-	u8     reserved2;
-	u8     portphysstate_portstate;   /* 4 bits, 4 bits */
-};
-#define PI_LED_ENABLE_SUP 0
-#endif
 
 struct opa_port_state_info {
 	struct opa_port_states port_states;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 02/18] IB/hfi1: Remove unnecessary done label in hfi1_write_iter
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  2016-07-01 23:00   ` [PATCH for-next 01/18] IB/hfi1: Clean up port state structure definition Dennis Dalessandro
@ 2016-07-01 23:00   ` Dennis Dalessandro
  2016-07-01 23:01   ` [PATCH for-next 03/18] IB/hfi1: Fix typo Dennis Dalessandro
                     ` (16 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:00 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Ira Weiny

From: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Simple code clean up of hfi1_write_iter.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/file_ops.c |   31 ++++++++++++++-----------------
 1 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index c702a00..2f097d9 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -392,41 +392,38 @@ static ssize_t hfi1_write_iter(struct kiocb *kiocb, struct iov_iter *from)
 	struct hfi1_filedata *fd = kiocb->ki_filp->private_data;
 	struct hfi1_user_sdma_pkt_q *pq = fd->pq;
 	struct hfi1_user_sdma_comp_q *cq = fd->cq;
-	int ret = 0, done = 0, reqs = 0;
+	int done = 0, reqs = 0;
 	unsigned long dim = from->nr_segs;
 
-	if (!cq || !pq) {
-		ret = -EIO;
-		goto done;
-	}
+	if (!cq || !pq)
+		return -EIO;
 
-	if (!iter_is_iovec(from) || !dim) {
-		ret = -EINVAL;
-		goto done;
-	}
+	if (!iter_is_iovec(from) || !dim)
+		return -EINVAL;
 
 	hfi1_cdbg(SDMA, "SDMA request from %u:%u (%lu)",
 		  fd->uctxt->ctxt, fd->subctxt, dim);
 
-	if (atomic_read(&pq->n_reqs) == pq->n_max_reqs) {
-		ret = -ENOSPC;
-		goto done;
-	}
+	if (atomic_read(&pq->n_reqs) == pq->n_max_reqs)
+		return -ENOSPC;
 
 	while (dim) {
+		int ret;
 		unsigned long count = 0;
 
 		ret = hfi1_user_sdma_process_request(
 			kiocb->ki_filp,	(struct iovec *)(from->iov + done),
 			dim, &count);
-		if (ret)
-			goto done;
+		if (ret) {
+			reqs = ret;
+			break;
+		}
 		dim -= count;
 		done += count;
 		reqs++;
 	}
-done:
-	return ret ? ret : reqs;
+
+	return reqs;
 }
 
 static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 03/18] IB/hfi1: Fix typo
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  2016-07-01 23:00   ` [PATCH for-next 01/18] IB/hfi1: Clean up port state structure definition Dennis Dalessandro
  2016-07-01 23:00   ` [PATCH for-next 02/18] IB/hfi1: Remove unnecessary done label in hfi1_write_iter Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
  2016-07-01 23:01   ` [PATCH for-next 04/18] IB/hfi1: Separate tracepoints into specific headers Dennis Dalessandro
                     ` (15 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Tadeusz Struk

From: Tadeusz Struk <tadeusz.struk-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Fix a copy and paste typo in comment.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Tadeusz Struk <tadeusz.struk-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/qsfp.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/qsfp.c b/drivers/infiniband/hw/hfi1/qsfp.c
index 9fb5616..6fca2a0 100644
--- a/drivers/infiniband/hw/hfi1/qsfp.c
+++ b/drivers/infiniband/hw/hfi1/qsfp.c
@@ -243,7 +243,7 @@ int qsfp_write(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 
 /*
  * Perform a stand-alone single QSFP write.  Acquire the resource, do the
- * read, then release the resource.
+ * write, then release the resource.
  */
 int one_qsfp_write(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 		   int len)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 04/18] IB/hfi1: Separate tracepoints into specific headers
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (2 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 03/18] IB/hfi1: Fix typo Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
  2016-07-01 23:01   ` [PATCH for-next 05/18] IB/hfi1: Fix trace sparse errors Dennis Dalessandro
                     ` (14 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mike Marciniszyn, Sebastian Sanchez

From: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

The ftrace infrastructure used to evaluate the TRACE_SYSTEM
macro on every DEFINE_EVENT() macro. Now the TRACE_SYSTEM
macro only gets evaluated when trace/define_trace.h is
included, so the group event information is lost. This was
introduced in
commit acd388fd3af3 ("tracing: Give system name a pointer")
Therefore, each system tracepoint must be on its own file.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/hfi.h          |   51 +
 drivers/infiniband/hw/hfi1/rc.c           |    8 
 drivers/infiniband/hw/hfi1/trace.h        | 1333 -----------------------------
 drivers/infiniband/hw/hfi1/trace_ctxts.h  |  141 +++
 drivers/infiniband/hw/hfi1/trace_dbg.h    |  155 +++
 drivers/infiniband/hw/hfi1/trace_ibhdrs.h |  209 +++++
 drivers/infiniband/hw/hfi1/trace_misc.h   |   81 ++
 drivers/infiniband/hw/hfi1/trace_rc.h     |  123 +++
 drivers/infiniband/hw/hfi1/trace_rx.h     |  322 +++++++
 drivers/infiniband/hw/hfi1/trace_tx.h     |  642 ++++++++++++++
 10 files changed, 1735 insertions(+), 1330 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/trace_ctxts.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_dbg.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_ibhdrs.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_misc.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_rc.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_rx.h
 create mode 100644 drivers/infiniband/hw/hfi1/trace_tx.h

diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 4417a0f..1dd48ef 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1947,4 +1947,55 @@ static inline u32 qsfp_resource(struct hfi1_devdata *dd)
 
 int hfi1_tempsense_rd(struct hfi1_devdata *dd, struct hfi1_temp *temp);
 
+#define DD_DEV_ENTRY(dd)       __string(dev, dev_name(&(dd)->pcidev->dev))
+#define DD_DEV_ASSIGN(dd)      __assign_str(dev, dev_name(&(dd)->pcidev->dev))
+
+#define packettype_name(etype) { RHF_RCV_TYPE_##etype, #etype }
+#define show_packettype(etype)                  \
+__print_symbolic(etype,                         \
+	packettype_name(EXPECTED),              \
+	packettype_name(EAGER),                 \
+	packettype_name(IB),                    \
+	packettype_name(ERROR),                 \
+	packettype_name(BYPASS))
+
+#define ib_opcode_name(opcode) { IB_OPCODE_##opcode, #opcode  }
+#define show_ib_opcode(opcode)                             \
+__print_symbolic(opcode,                                   \
+	ib_opcode_name(RC_SEND_FIRST),                     \
+	ib_opcode_name(RC_SEND_MIDDLE),                    \
+	ib_opcode_name(RC_SEND_LAST),                      \
+	ib_opcode_name(RC_SEND_LAST_WITH_IMMEDIATE),       \
+	ib_opcode_name(RC_SEND_ONLY),                      \
+	ib_opcode_name(RC_SEND_ONLY_WITH_IMMEDIATE),       \
+	ib_opcode_name(RC_RDMA_WRITE_FIRST),               \
+	ib_opcode_name(RC_RDMA_WRITE_MIDDLE),              \
+	ib_opcode_name(RC_RDMA_WRITE_LAST),                \
+	ib_opcode_name(RC_RDMA_WRITE_LAST_WITH_IMMEDIATE), \
+	ib_opcode_name(RC_RDMA_WRITE_ONLY),                \
+	ib_opcode_name(RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE), \
+	ib_opcode_name(RC_RDMA_READ_REQUEST),              \
+	ib_opcode_name(RC_RDMA_READ_RESPONSE_FIRST),       \
+	ib_opcode_name(RC_RDMA_READ_RESPONSE_MIDDLE),      \
+	ib_opcode_name(RC_RDMA_READ_RESPONSE_LAST),        \
+	ib_opcode_name(RC_RDMA_READ_RESPONSE_ONLY),        \
+	ib_opcode_name(RC_ACKNOWLEDGE),                    \
+	ib_opcode_name(RC_ATOMIC_ACKNOWLEDGE),             \
+	ib_opcode_name(RC_COMPARE_SWAP),                   \
+	ib_opcode_name(RC_FETCH_ADD),                      \
+	ib_opcode_name(UC_SEND_FIRST),                     \
+	ib_opcode_name(UC_SEND_MIDDLE),                    \
+	ib_opcode_name(UC_SEND_LAST),                      \
+	ib_opcode_name(UC_SEND_LAST_WITH_IMMEDIATE),       \
+	ib_opcode_name(UC_SEND_ONLY),                      \
+	ib_opcode_name(UC_SEND_ONLY_WITH_IMMEDIATE),       \
+	ib_opcode_name(UC_RDMA_WRITE_FIRST),               \
+	ib_opcode_name(UC_RDMA_WRITE_MIDDLE),              \
+	ib_opcode_name(UC_RDMA_WRITE_LAST),                \
+	ib_opcode_name(UC_RDMA_WRITE_LAST_WITH_IMMEDIATE), \
+	ib_opcode_name(UC_RDMA_WRITE_ONLY),                \
+	ib_opcode_name(UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE), \
+	ib_opcode_name(UD_SEND_ONLY),                      \
+	ib_opcode_name(UD_SEND_ONLY_WITH_IMMEDIATE),       \
+	ib_opcode_name(CNP))
 #endif                          /* _HFI1_KERNEL_H */
diff --git a/drivers/infiniband/hw/hfi1/rc.c b/drivers/infiniband/hw/hfi1/rc.c
index 792f15e..3aeb832 100644
--- a/drivers/infiniband/hw/hfi1/rc.c
+++ b/drivers/infiniband/hw/hfi1/rc.c
@@ -1047,7 +1047,7 @@ void hfi1_rc_timeout(unsigned long arg)
 		ibp->rvp.n_rc_timeouts++;
 		qp->s_flags &= ~RVT_S_TIMER;
 		del_timer(&qp->s_timer);
-		trace_hfi1_rc_timeout(qp, qp->s_last_psn + 1);
+		trace_hfi1_timeout(qp, qp->s_last_psn + 1);
 		restart_rc(qp, qp->s_last_psn + 1, 1);
 		hfi1_schedule_send(qp);
 	}
@@ -1171,7 +1171,7 @@ void hfi1_rc_send_complete(struct rvt_qp *qp, struct hfi1_ib_header *hdr)
 	 * If we were waiting for sends to complete before re-sending,
 	 * and they are now complete, restart sending.
 	 */
-	trace_hfi1_rc_sendcomplete(qp, psn);
+	trace_hfi1_sendcomplete(qp, psn);
 	if (qp->s_flags & RVT_S_WAIT_PSN &&
 	    cmp_psn(qp->s_sending_psn, qp->s_sending_hpsn) > 0) {
 		qp->s_flags &= ~RVT_S_WAIT_PSN;
@@ -1567,7 +1567,7 @@ static void rc_rcv_resp(struct hfi1_ibport *ibp,
 
 	spin_lock_irqsave(&qp->s_lock, flags);
 
-	trace_hfi1_rc_ack(qp, psn);
+	trace_hfi1_ack(qp, psn);
 
 	/* Ignore invalid responses. */
 	smp_read_barrier_depends(); /* see post_one_send */
@@ -1782,7 +1782,7 @@ static noinline int rc_rcv_error(struct hfi1_other_headers *ohdr, void *data,
 	u8 i, prev;
 	int old_req;
 
-	trace_hfi1_rc_rcv_error(qp, psn);
+	trace_hfi1_rcv_error(qp, psn);
 	if (diff > 0) {
 		/*
 		 * Packet sequence error.
diff --git a/drivers/infiniband/hw/hfi1/trace.h b/drivers/infiniband/hw/hfi1/trace.h
index 28c1d08..92dc88f 100644
--- a/drivers/infiniband/hw/hfi1/trace.h
+++ b/drivers/infiniband/hw/hfi1/trace.h
@@ -44,1329 +44,10 @@
  * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  */
-#undef TRACE_SYSTEM_VAR
-#define TRACE_SYSTEM_VAR hfi1
-
-#if !defined(__HFI1_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
-#define __HFI1_TRACE_H
-
-#include <linux/tracepoint.h>
-#include <linux/trace_seq.h>
-
-#include "hfi.h"
-#include "mad.h"
-#include "sdma.h"
-
-#define DD_DEV_ENTRY(dd)       __string(dev, dev_name(&(dd)->pcidev->dev))
-#define DD_DEV_ASSIGN(dd)      __assign_str(dev, dev_name(&(dd)->pcidev->dev))
-
-#define packettype_name(etype) { RHF_RCV_TYPE_##etype, #etype }
-#define show_packettype(etype)                  \
-__print_symbolic(etype,                         \
-	packettype_name(EXPECTED),              \
-	packettype_name(EAGER),                 \
-	packettype_name(IB),                    \
-	packettype_name(ERROR),                 \
-	packettype_name(BYPASS))
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_rx
-
-TRACE_EVENT(hfi1_rcvhdr,
-	    TP_PROTO(struct hfi1_devdata *dd,
-		     u32 ctxt,
-		     u64 eflags,
-		     u32 etype,
-		     u32 hlen,
-		     u32 tlen,
-		     u32 updegr,
-		     u32 etail
-		     ),
-	    TP_ARGS(dd, ctxt, eflags, etype, hlen, tlen, updegr, etail),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
-			     __field(u64, eflags)
-			     __field(u32, ctxt)
-			     __field(u32, etype)
-			     __field(u32, hlen)
-			     __field(u32, tlen)
-			     __field(u32, updegr)
-			     __field(u32, etail)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(dd);
-			   __entry->eflags = eflags;
-			   __entry->ctxt = ctxt;
-			   __entry->etype = etype;
-			   __entry->hlen = hlen;
-			   __entry->tlen = tlen;
-			   __entry->updegr = updegr;
-			   __entry->etail = etail;
-			   ),
-	    TP_printk(
-		      "[%s] ctxt %d eflags 0x%llx etype %d,%s hlen %d tlen %d updegr %d etail %d",
-		      __get_str(dev),
-		      __entry->ctxt,
-		      __entry->eflags,
-		      __entry->etype, show_packettype(__entry->etype),
-		      __entry->hlen,
-		      __entry->tlen,
-		      __entry->updegr,
-		      __entry->etail
-		      )
-);
-
-TRACE_EVENT(hfi1_receive_interrupt,
-	    TP_PROTO(struct hfi1_devdata *dd, u32 ctxt),
-	    TP_ARGS(dd, ctxt),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
-			     __field(u32, ctxt)
-			     __field(u8, slow_path)
-			     __field(u8, dma_rtail)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(dd);
-			   __entry->ctxt = ctxt;
-			   if (dd->rcd[ctxt]->do_interrupt ==
-			       &handle_receive_interrupt) {
-				__entry->slow_path = 1;
-				__entry->dma_rtail = 0xFF;
-			   } else if (dd->rcd[ctxt]->do_interrupt ==
-				      &handle_receive_interrupt_dma_rtail){
-				__entry->dma_rtail = 1;
-				__entry->slow_path = 0;
-			   } else if (dd->rcd[ctxt]->do_interrupt ==
-				      &handle_receive_interrupt_nodma_rtail) {
-				__entry->dma_rtail = 0;
-				__entry->slow_path = 0;
-			   }
-			   ),
-	    TP_printk("[%s] ctxt %d SlowPath: %d DmaRtail: %d",
-		      __get_str(dev),
-		      __entry->ctxt,
-		      __entry->slow_path,
-		      __entry->dma_rtail
-		      )
-);
-
-TRACE_EVENT(hfi1_exp_tid_reg,
-	    TP_PROTO(unsigned ctxt, u16 subctxt, u32 rarr,
-		     u32 npages, unsigned long va, unsigned long pa,
-		     dma_addr_t dma),
-	    TP_ARGS(ctxt, subctxt, rarr, npages, va, pa, dma),
-	    TP_STRUCT__entry(
-		    __field(unsigned, ctxt)
-		    __field(u16, subctxt)
-		    __field(u32, rarr)
-		    __field(u32, npages)
-		    __field(unsigned long, va)
-		    __field(unsigned long, pa)
-		    __field(dma_addr_t, dma)
-		    ),
-	    TP_fast_assign(
-		    __entry->ctxt = ctxt;
-		    __entry->subctxt = subctxt;
-		    __entry->rarr = rarr;
-		    __entry->npages = npages;
-		    __entry->va = va;
-		    __entry->pa = pa;
-		    __entry->dma = dma;
-		    ),
-	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx, va:0x%lx dma:0x%llx",
-		      __entry->ctxt,
-		      __entry->subctxt,
-		      __entry->rarr,
-		      __entry->npages,
-		      __entry->pa,
-		      __entry->va,
-		      __entry->dma
-		    )
-	);
-
-TRACE_EVENT(hfi1_exp_tid_unreg,
-	    TP_PROTO(unsigned ctxt, u16 subctxt, u32 rarr, u32 npages,
-		     unsigned long va, unsigned long pa, dma_addr_t dma),
-	    TP_ARGS(ctxt, subctxt, rarr, npages, va, pa, dma),
-	    TP_STRUCT__entry(
-		    __field(unsigned, ctxt)
-		    __field(u16, subctxt)
-		    __field(u32, rarr)
-		    __field(u32, npages)
-		    __field(unsigned long, va)
-		    __field(unsigned long, pa)
-		    __field(dma_addr_t, dma)
-		    ),
-	    TP_fast_assign(
-		    __entry->ctxt = ctxt;
-		    __entry->subctxt = subctxt;
-		    __entry->rarr = rarr;
-		    __entry->npages = npages;
-		    __entry->va = va;
-		    __entry->pa = pa;
-		    __entry->dma = dma;
-		    ),
-	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx, va:0x%lx dma:0x%llx",
-		      __entry->ctxt,
-		      __entry->subctxt,
-		      __entry->rarr,
-		      __entry->npages,
-		      __entry->pa,
-		      __entry->va,
-		      __entry->dma
-		    )
-	);
-
-TRACE_EVENT(hfi1_exp_tid_inval,
-	    TP_PROTO(unsigned ctxt, u16 subctxt, unsigned long va, u32 rarr,
-		     u32 npages, dma_addr_t dma),
-	    TP_ARGS(ctxt, subctxt, va, rarr, npages, dma),
-	    TP_STRUCT__entry(
-		    __field(unsigned, ctxt)
-		    __field(u16, subctxt)
-		    __field(unsigned long, va)
-		    __field(u32, rarr)
-		    __field(u32, npages)
-		    __field(dma_addr_t, dma)
-		    ),
-	    TP_fast_assign(
-		    __entry->ctxt = ctxt;
-		    __entry->subctxt = subctxt;
-		    __entry->va = va;
-		    __entry->rarr = rarr;
-		    __entry->npages = npages;
-		    __entry->dma = dma;
-		    ),
-	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx dma: 0x%llx",
-		      __entry->ctxt,
-		      __entry->subctxt,
-		      __entry->rarr,
-		      __entry->npages,
-		      __entry->va,
-		      __entry->dma
-		    )
-	);
-
-TRACE_EVENT(hfi1_mmu_invalidate,
-	    TP_PROTO(unsigned ctxt, u16 subctxt, const char *type,
-		     unsigned long start, unsigned long end),
-	    TP_ARGS(ctxt, subctxt, type, start, end),
-	    TP_STRUCT__entry(
-		    __field(unsigned, ctxt)
-		    __field(u16, subctxt)
-		    __string(type, type)
-		    __field(unsigned long, start)
-		    __field(unsigned long, end)
-		    ),
-	    TP_fast_assign(
-		    __entry->ctxt = ctxt;
-		    __entry->subctxt = subctxt;
-		    __assign_str(type, type);
-		    __entry->start = start;
-		    __entry->end = end;
-		    ),
-	    TP_printk("[%3u:%02u] MMU Invalidate (%s) 0x%lx - 0x%lx",
-		      __entry->ctxt,
-		      __entry->subctxt,
-		      __get_str(type),
-		      __entry->start,
-		      __entry->end
-		    )
-	);
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_tx
-
-TRACE_EVENT(hfi1_piofree,
-	    TP_PROTO(struct send_context *sc, int extra),
-	    TP_ARGS(sc, extra),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(sc->dd)
-			     __field(u32, sw_index)
-			     __field(u32, hw_context)
-			     __field(int, extra)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(sc->dd);
-			   __entry->sw_index = sc->sw_index;
-			   __entry->hw_context = sc->hw_context;
-			   __entry->extra = extra;
-			   ),
-	    TP_printk("[%s] ctxt %u(%u) extra %d",
-		      __get_str(dev),
-		      __entry->sw_index,
-		      __entry->hw_context,
-		      __entry->extra
-		      )
-);
-
-TRACE_EVENT(hfi1_wantpiointr,
-	    TP_PROTO(struct send_context *sc, u32 needint, u64 credit_ctrl),
-	    TP_ARGS(sc, needint, credit_ctrl),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(sc->dd)
-			     __field(u32, sw_index)
-			     __field(u32, hw_context)
-			     __field(u32, needint)
-			     __field(u64, credit_ctrl)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(sc->dd);
-			   __entry->sw_index = sc->sw_index;
-			   __entry->hw_context = sc->hw_context;
-			   __entry->needint = needint;
-			   __entry->credit_ctrl = credit_ctrl;
-			   ),
-	    TP_printk("[%s] ctxt %u(%u) on %d credit_ctrl 0x%llx",
-		      __get_str(dev),
-		      __entry->sw_index,
-		      __entry->hw_context,
-		      __entry->needint,
-		      (unsigned long long)__entry->credit_ctrl
-		       )
-);
-
-DECLARE_EVENT_CLASS(hfi1_qpsleepwakeup_template,
-		    TP_PROTO(struct rvt_qp *qp, u32 flags),
-		    TP_ARGS(qp, flags),
-		    TP_STRUCT__entry(
-			    DD_DEV_ENTRY(dd_from_ibdev(qp->ibqp.device))
-			    __field(u32, qpn)
-			    __field(u32, flags)
-			    __field(u32, s_flags)
-			    ),
-		    TP_fast_assign(
-			    DD_DEV_ASSIGN(dd_from_ibdev(qp->ibqp.device))
-			    __entry->flags = flags;
-			    __entry->qpn = qp->ibqp.qp_num;
-			    __entry->s_flags = qp->s_flags;
-			    ),
-		    TP_printk(
-			    "[%s] qpn 0x%x flags 0x%x s_flags 0x%x",
-			    __get_str(dev),
-			    __entry->qpn,
-			    __entry->flags,
-			    __entry->s_flags
-			    )
-);
-
-DEFINE_EVENT(hfi1_qpsleepwakeup_template, hfi1_qpwakeup,
-	     TP_PROTO(struct rvt_qp *qp, u32 flags),
-	     TP_ARGS(qp, flags));
-
-DEFINE_EVENT(hfi1_qpsleepwakeup_template, hfi1_qpsleep,
-	     TP_PROTO(struct rvt_qp *qp, u32 flags),
-	     TP_ARGS(qp, flags));
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_ibhdrs
-
-u8 ibhdr_exhdr_len(struct hfi1_ib_header *hdr);
-const char *parse_everbs_hdrs(struct trace_seq *p, u8 opcode, void *ehdrs);
-
-#define __parse_ib_ehdrs(op, ehdrs) parse_everbs_hdrs(p, op, ehdrs)
-
-const char *parse_sdma_flags(struct trace_seq *p, u64 desc0, u64 desc1);
-
-#define __parse_sdma_flags(desc0, desc1) parse_sdma_flags(p, desc0, desc1)
-
-#define lrh_name(lrh) { HFI1_##lrh, #lrh }
-#define show_lnh(lrh)                    \
-__print_symbolic(lrh,                    \
-	lrh_name(LRH_BTH),               \
-	lrh_name(LRH_GRH))
-
-#define ib_opcode_name(opcode) { IB_OPCODE_##opcode, #opcode  }
-#define show_ib_opcode(opcode)                             \
-__print_symbolic(opcode,                                   \
-	ib_opcode_name(RC_SEND_FIRST),                     \
-	ib_opcode_name(RC_SEND_MIDDLE),                    \
-	ib_opcode_name(RC_SEND_LAST),                      \
-	ib_opcode_name(RC_SEND_LAST_WITH_IMMEDIATE),       \
-	ib_opcode_name(RC_SEND_ONLY),                      \
-	ib_opcode_name(RC_SEND_ONLY_WITH_IMMEDIATE),       \
-	ib_opcode_name(RC_RDMA_WRITE_FIRST),               \
-	ib_opcode_name(RC_RDMA_WRITE_MIDDLE),              \
-	ib_opcode_name(RC_RDMA_WRITE_LAST),                \
-	ib_opcode_name(RC_RDMA_WRITE_LAST_WITH_IMMEDIATE), \
-	ib_opcode_name(RC_RDMA_WRITE_ONLY),                \
-	ib_opcode_name(RC_RDMA_WRITE_ONLY_WITH_IMMEDIATE), \
-	ib_opcode_name(RC_RDMA_READ_REQUEST),              \
-	ib_opcode_name(RC_RDMA_READ_RESPONSE_FIRST),       \
-	ib_opcode_name(RC_RDMA_READ_RESPONSE_MIDDLE),      \
-	ib_opcode_name(RC_RDMA_READ_RESPONSE_LAST),        \
-	ib_opcode_name(RC_RDMA_READ_RESPONSE_ONLY),        \
-	ib_opcode_name(RC_ACKNOWLEDGE),                    \
-	ib_opcode_name(RC_ATOMIC_ACKNOWLEDGE),             \
-	ib_opcode_name(RC_COMPARE_SWAP),                   \
-	ib_opcode_name(RC_FETCH_ADD),                      \
-	ib_opcode_name(RC_SEND_LAST_WITH_INVALIDATE),      \
-	ib_opcode_name(RC_SEND_ONLY_WITH_INVALIDATE),      \
-	ib_opcode_name(UC_SEND_FIRST),                     \
-	ib_opcode_name(UC_SEND_MIDDLE),                    \
-	ib_opcode_name(UC_SEND_LAST),                      \
-	ib_opcode_name(UC_SEND_LAST_WITH_IMMEDIATE),       \
-	ib_opcode_name(UC_SEND_ONLY),                      \
-	ib_opcode_name(UC_SEND_ONLY_WITH_IMMEDIATE),       \
-	ib_opcode_name(UC_RDMA_WRITE_FIRST),               \
-	ib_opcode_name(UC_RDMA_WRITE_MIDDLE),              \
-	ib_opcode_name(UC_RDMA_WRITE_LAST),                \
-	ib_opcode_name(UC_RDMA_WRITE_LAST_WITH_IMMEDIATE), \
-	ib_opcode_name(UC_RDMA_WRITE_ONLY),                \
-	ib_opcode_name(UC_RDMA_WRITE_ONLY_WITH_IMMEDIATE), \
-	ib_opcode_name(UD_SEND_ONLY),                      \
-	ib_opcode_name(UD_SEND_ONLY_WITH_IMMEDIATE),       \
-	ib_opcode_name(CNP))
-
-#define LRH_PRN "vl %d lver %d sl %d lnh %d,%s dlid %.4x len %d slid %.4x"
-#define BTH_PRN \
-	"op 0x%.2x,%s se %d m %d pad %d tver %d pkey 0x%.4x " \
-	"f %d b %d qpn 0x%.6x a %d psn 0x%.8x"
-#define EHDR_PRN "%s"
-
-DECLARE_EVENT_CLASS(hfi1_ibhdr_template,
-		    TP_PROTO(struct hfi1_devdata *dd,
-			     struct hfi1_ib_header *hdr),
-		    TP_ARGS(dd, hdr),
-		    TP_STRUCT__entry(
-			    DD_DEV_ENTRY(dd)
-			    /* LRH */
-			    __field(u8, vl)
-			    __field(u8, lver)
-			    __field(u8, sl)
-			    __field(u8, lnh)
-			    __field(u16, dlid)
-			    __field(u16, len)
-			    __field(u16, slid)
-			    /* BTH */
-			    __field(u8, opcode)
-			    __field(u8, se)
-			    __field(u8, m)
-			    __field(u8, pad)
-			    __field(u8, tver)
-			    __field(u16, pkey)
-			    __field(u8, f)
-			    __field(u8, b)
-			    __field(u32, qpn)
-			    __field(u8, a)
-			    __field(u32, psn)
-			    /* extended headers */
-			    __dynamic_array(u8, ehdrs, ibhdr_exhdr_len(hdr))
-			    ),
-		    TP_fast_assign(
-			   struct hfi1_other_headers *ohdr;
-
-			   DD_DEV_ASSIGN(dd);
-			   /* LRH */
-			   __entry->vl =
-			   (u8)(be16_to_cpu(hdr->lrh[0]) >> 12);
-			   __entry->lver =
-			   (u8)(be16_to_cpu(hdr->lrh[0]) >> 8) & 0xf;
-			   __entry->sl =
-			   (u8)(be16_to_cpu(hdr->lrh[0]) >> 4) & 0xf;
-			   __entry->lnh =
-			   (u8)(be16_to_cpu(hdr->lrh[0]) & 3);
-			   __entry->dlid =
-			   be16_to_cpu(hdr->lrh[1]);
-			   /* allow for larger len */
-			   __entry->len =
-			   be16_to_cpu(hdr->lrh[2]);
-			   __entry->slid =
-			   be16_to_cpu(hdr->lrh[3]);
-			   /* BTH */
-			   if (__entry->lnh == HFI1_LRH_BTH)
-				ohdr = &hdr->u.oth;
-			   else
-				ohdr = &hdr->u.l.oth;
-			  __entry->opcode =
-			  (be32_to_cpu(ohdr->bth[0]) >> 24) & 0xff;
-			  __entry->se =
-			  (be32_to_cpu(ohdr->bth[0]) >> 23) & 1;
-			  __entry->m =
-			  (be32_to_cpu(ohdr->bth[0]) >> 22) & 1;
-			  __entry->pad =
-			  (be32_to_cpu(ohdr->bth[0]) >> 20) & 3;
-			  __entry->tver =
-			  (be32_to_cpu(ohdr->bth[0]) >> 16) & 0xf;
-			  __entry->pkey =
-			  be32_to_cpu(ohdr->bth[0]) & 0xffff;
-			  __entry->f =
-			  (be32_to_cpu(ohdr->bth[1]) >> HFI1_FECN_SHIFT) &
-			  HFI1_FECN_MASK;
-			  __entry->b =
-			  (be32_to_cpu(ohdr->bth[1]) >> HFI1_BECN_SHIFT) &
-			  HFI1_BECN_MASK;
-			  __entry->qpn =
-			  be32_to_cpu(ohdr->bth[1]) & RVT_QPN_MASK;
-			  __entry->a =
-			  (be32_to_cpu(ohdr->bth[2]) >> 31) & 1;
-			  /* allow for larger PSN */
-			  __entry->psn =
-			  be32_to_cpu(ohdr->bth[2]) & 0x7fffffff;
-			  /* extended headers */
-			  memcpy(__get_dynamic_array(ehdrs), &ohdr->u,
-				 ibhdr_exhdr_len(hdr));
-			 ),
-		    TP_printk("[%s] " LRH_PRN " " BTH_PRN " " EHDR_PRN,
-			      __get_str(dev),
-			      /* LRH */
-			      __entry->vl,
-			      __entry->lver,
-			      __entry->sl,
-			      __entry->lnh, show_lnh(__entry->lnh),
-			      __entry->dlid,
-			      __entry->len,
-			      __entry->slid,
-			      /* BTH */
-			      __entry->opcode, show_ib_opcode(__entry->opcode),
-			      __entry->se,
-			      __entry->m,
-			      __entry->pad,
-			      __entry->tver,
-			      __entry->pkey,
-			      __entry->f,
-			      __entry->b,
-			      __entry->qpn,
-			      __entry->a,
-			      __entry->psn,
-			      /* extended headers */
-			      __parse_ib_ehdrs(
-					__entry->opcode,
-					(void *)__get_dynamic_array(ehdrs))
-			     )
-);
-
-DEFINE_EVENT(hfi1_ibhdr_template, input_ibhdr,
-	     TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ib_header *hdr),
-	     TP_ARGS(dd, hdr));
-
-DEFINE_EVENT(hfi1_ibhdr_template, pio_output_ibhdr,
-	     TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ib_header *hdr),
-	     TP_ARGS(dd, hdr));
-
-DEFINE_EVENT(hfi1_ibhdr_template, ack_output_ibhdr,
-	     TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ib_header *hdr),
-	     TP_ARGS(dd, hdr));
-
-DEFINE_EVENT(hfi1_ibhdr_template, sdma_output_ibhdr,
-	     TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ib_header *hdr),
-	     TP_ARGS(dd, hdr));
-
-#define SNOOP_PRN \
-	"slid %.4x dlid %.4x qpn 0x%.6x opcode 0x%.2x,%s " \
-	"svc lvl %d pkey 0x%.4x [header = %d bytes] [data = %d bytes]"
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_snoop
-
-TRACE_EVENT(snoop_capture,
-	    TP_PROTO(struct hfi1_devdata *dd,
-		     int hdr_len,
-		     struct hfi1_ib_header *hdr,
-		     int data_len,
-		     void *data),
-	    TP_ARGS(dd, hdr_len, hdr, data_len, data),
-	    TP_STRUCT__entry(
-		DD_DEV_ENTRY(dd)
-		__field(u16, slid)
-		__field(u16, dlid)
-		__field(u32, qpn)
-		__field(u8, opcode)
-		__field(u8, sl)
-		__field(u16, pkey)
-		__field(u32, hdr_len)
-		__field(u32, data_len)
-		__field(u8, lnh)
-		__dynamic_array(u8, raw_hdr, hdr_len)
-		__dynamic_array(u8, raw_pkt, data_len)
-		),
-	    TP_fast_assign(
-		struct hfi1_other_headers *ohdr;
-
-		__entry->lnh = (u8)(be16_to_cpu(hdr->lrh[0]) & 3);
-		if (__entry->lnh == HFI1_LRH_BTH)
-			ohdr = &hdr->u.oth;
-		else
-			ohdr = &hdr->u.l.oth;
-		DD_DEV_ASSIGN(dd);
-		__entry->slid = be16_to_cpu(hdr->lrh[3]);
-		__entry->dlid = be16_to_cpu(hdr->lrh[1]);
-		__entry->qpn = be32_to_cpu(ohdr->bth[1]) & RVT_QPN_MASK;
-		__entry->opcode = (be32_to_cpu(ohdr->bth[0]) >> 24) & 0xff;
-		__entry->sl = (u8)(be16_to_cpu(hdr->lrh[0]) >> 4) & 0xf;
-		__entry->pkey =	be32_to_cpu(ohdr->bth[0]) & 0xffff;
-		__entry->hdr_len = hdr_len;
-		__entry->data_len = data_len;
-		memcpy(__get_dynamic_array(raw_hdr), hdr, hdr_len);
-		memcpy(__get_dynamic_array(raw_pkt), data, data_len);
-		),
-	    TP_printk(
-		"[%s] " SNOOP_PRN,
-		__get_str(dev),
-		__entry->slid,
-		__entry->dlid,
-		__entry->qpn,
-		__entry->opcode,
-		show_ib_opcode(__entry->opcode),
-		__entry->sl,
-		__entry->pkey,
-		__entry->hdr_len,
-		__entry->data_len
-		)
-);
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_ctxts
-
-#define UCTXT_FMT \
-	"cred:%u, credaddr:0x%llx, piobase:0x%llx, rcvhdr_cnt:%u, "	\
-	"rcvbase:0x%llx, rcvegrc:%u, rcvegrb:0x%llx"
-TRACE_EVENT(hfi1_uctxtdata,
-	    TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ctxtdata *uctxt),
-	    TP_ARGS(dd, uctxt),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
-			     __field(unsigned, ctxt)
-			     __field(u32, credits)
-			     __field(u64, hw_free)
-			     __field(u64, piobase)
-			     __field(u16, rcvhdrq_cnt)
-			     __field(u64, rcvhdrq_phys)
-			     __field(u32, eager_cnt)
-			     __field(u64, rcvegr_phys)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(dd);
-			   __entry->ctxt = uctxt->ctxt;
-			   __entry->credits = uctxt->sc->credits;
-			   __entry->hw_free = (u64)uctxt->sc->hw_free;
-			   __entry->piobase = (u64)uctxt->sc->base_addr;
-			   __entry->rcvhdrq_cnt = uctxt->rcvhdrq_cnt;
-			   __entry->rcvhdrq_phys = uctxt->rcvhdrq_phys;
-			   __entry->eager_cnt = uctxt->egrbufs.alloced;
-			   __entry->rcvegr_phys =
-			   uctxt->egrbufs.rcvtids[0].phys;
-			   ),
-	    TP_printk("[%s] ctxt %u " UCTXT_FMT,
-		      __get_str(dev),
-		      __entry->ctxt,
-		      __entry->credits,
-		      __entry->hw_free,
-		      __entry->piobase,
-		      __entry->rcvhdrq_cnt,
-		      __entry->rcvhdrq_phys,
-		      __entry->eager_cnt,
-		      __entry->rcvegr_phys
-		      )
-);
-
-#define CINFO_FMT \
-	"egrtids:%u, egr_size:%u, hdrq_cnt:%u, hdrq_size:%u, sdma_ring_size:%u"
-TRACE_EVENT(hfi1_ctxt_info,
-	    TP_PROTO(struct hfi1_devdata *dd, unsigned ctxt, unsigned subctxt,
-		     struct hfi1_ctxt_info cinfo),
-	    TP_ARGS(dd, ctxt, subctxt, cinfo),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
-			     __field(unsigned, ctxt)
-			     __field(unsigned, subctxt)
-			     __field(u16, egrtids)
-			     __field(u16, rcvhdrq_cnt)
-			     __field(u16, rcvhdrq_size)
-			     __field(u16, sdma_ring_size)
-			     __field(u32, rcvegr_size)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(dd);
-			    __entry->ctxt = ctxt;
-			    __entry->subctxt = subctxt;
-			    __entry->egrtids = cinfo.egrtids;
-			    __entry->rcvhdrq_cnt = cinfo.rcvhdrq_cnt;
-			    __entry->rcvhdrq_size = cinfo.rcvhdrq_entsize;
-			    __entry->sdma_ring_size = cinfo.sdma_ring_size;
-			    __entry->rcvegr_size = cinfo.rcvegr_size;
-			    ),
-	    TP_printk("[%s] ctxt %u:%u " CINFO_FMT,
-		      __get_str(dev),
-		      __entry->ctxt,
-		      __entry->subctxt,
-		      __entry->egrtids,
-		      __entry->rcvegr_size,
-		      __entry->rcvhdrq_cnt,
-		      __entry->rcvhdrq_size,
-		      __entry->sdma_ring_size
-		      )
-);
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_sma
-
-#define BCT_FORMAT \
-	"shared_limit %x vls 0-7 [%x,%x][%x,%x][%x,%x][%x,%x][%x,%x][%x,%x][%x,%x][%x,%x] 15 [%x,%x]"
-
-#define BCT(field) \
-	be16_to_cpu( \
-		((struct buffer_control *)__get_dynamic_array(bct))->field \
-	)
-
-DECLARE_EVENT_CLASS(hfi1_bct_template,
-		    TP_PROTO(struct hfi1_devdata *dd,
-			     struct buffer_control *bc),
-		    TP_ARGS(dd, bc),
-		    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
-				     __dynamic_array(u8, bct, sizeof(*bc))
-				     ),
-		    TP_fast_assign(DD_DEV_ASSIGN(dd);
-				   memcpy(__get_dynamic_array(bct), bc,
-					  sizeof(*bc));
-				   ),
-		    TP_printk(BCT_FORMAT,
-			      BCT(overall_shared_limit),
-
-			      BCT(vl[0].dedicated),
-			      BCT(vl[0].shared),
-
-			      BCT(vl[1].dedicated),
-			      BCT(vl[1].shared),
-
-			      BCT(vl[2].dedicated),
-			      BCT(vl[2].shared),
-
-			      BCT(vl[3].dedicated),
-			      BCT(vl[3].shared),
-
-			      BCT(vl[4].dedicated),
-			      BCT(vl[4].shared),
-
-			      BCT(vl[5].dedicated),
-			      BCT(vl[5].shared),
-
-			      BCT(vl[6].dedicated),
-			      BCT(vl[6].shared),
-
-			      BCT(vl[7].dedicated),
-			      BCT(vl[7].shared),
-
-			      BCT(vl[15].dedicated),
-			      BCT(vl[15].shared)
-			      )
-);
-
-DEFINE_EVENT(hfi1_bct_template, bct_set,
-	     TP_PROTO(struct hfi1_devdata *dd, struct buffer_control *bc),
-	     TP_ARGS(dd, bc));
-
-DEFINE_EVENT(hfi1_bct_template, bct_get,
-	     TP_PROTO(struct hfi1_devdata *dd, struct buffer_control *bc),
-	     TP_ARGS(dd, bc));
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_sdma
-
-TRACE_EVENT(hfi1_sdma_descriptor,
-	    TP_PROTO(struct sdma_engine *sde,
-		     u64 desc0,
-		     u64 desc1,
-		     u16 e,
-		     void *descp),
-	TP_ARGS(sde, desc0, desc1, e, descp),
-	TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
-			 __field(void *, descp)
-			 __field(u64, desc0)
-			 __field(u64, desc1)
-			 __field(u16, e)
-			 __field(u8, idx)
-			 ),
-	TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
-		       __entry->desc0 = desc0;
-		       __entry->desc1 = desc1;
-		       __entry->idx = sde->this_idx;
-		       __entry->descp = descp;
-		       __entry->e = e;
-		       ),
-	TP_printk(
-		  "[%s] SDE(%u) flags:%s addr:0x%016llx gen:%u len:%u d0:%016llx d1:%016llx to %p,%u",
-		  __get_str(dev),
-		  __entry->idx,
-		  __parse_sdma_flags(__entry->desc0, __entry->desc1),
-		  (__entry->desc0 >> SDMA_DESC0_PHY_ADDR_SHIFT) &
-		  SDMA_DESC0_PHY_ADDR_MASK,
-		  (u8)((__entry->desc1 >> SDMA_DESC1_GENERATION_SHIFT) &
-		       SDMA_DESC1_GENERATION_MASK),
-		  (u16)((__entry->desc0 >> SDMA_DESC0_BYTE_COUNT_SHIFT) &
-			SDMA_DESC0_BYTE_COUNT_MASK),
-		  __entry->desc0,
-		  __entry->desc1,
-		  __entry->descp,
-		  __entry->e
-		  )
-);
-
-TRACE_EVENT(hfi1_sdma_engine_select,
-	    TP_PROTO(struct hfi1_devdata *dd, u32 sel, u8 vl, u8 idx),
-	    TP_ARGS(dd, sel, vl, idx),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
-			     __field(u32, sel)
-			     __field(u8, vl)
-			     __field(u8, idx)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(dd);
-			   __entry->sel = sel;
-			   __entry->vl = vl;
-			   __entry->idx = idx;
-			   ),
-	    TP_printk("[%s] selecting SDE %u sel 0x%x vl %u",
-		      __get_str(dev),
-		      __entry->idx,
-		      __entry->sel,
-		      __entry->vl
-		      )
-);
-
-DECLARE_EVENT_CLASS(hfi1_sdma_engine_class,
-		    TP_PROTO(struct sdma_engine *sde, u64 status),
-		    TP_ARGS(sde, status),
-		    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
-				     __field(u64, status)
-				     __field(u8, idx)
-				     ),
-		    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
-				   __entry->status = status;
-				   __entry->idx = sde->this_idx;
-				   ),
-		    TP_printk("[%s] SDE(%u) status %llx",
-			      __get_str(dev),
-			      __entry->idx,
-			      (unsigned long long)__entry->status
-			      )
-);
-
-DEFINE_EVENT(hfi1_sdma_engine_class, hfi1_sdma_engine_interrupt,
-	     TP_PROTO(struct sdma_engine *sde, u64 status),
-	     TP_ARGS(sde, status)
-);
-
-DEFINE_EVENT(hfi1_sdma_engine_class, hfi1_sdma_engine_progress,
-	     TP_PROTO(struct sdma_engine *sde, u64 status),
-	     TP_ARGS(sde, status)
-);
-
-DECLARE_EVENT_CLASS(hfi1_sdma_ahg_ad,
-		    TP_PROTO(struct sdma_engine *sde, int aidx),
-		    TP_ARGS(sde, aidx),
-		    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
-				     __field(int, aidx)
-				     __field(u8, idx)
-				     ),
-		    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
-				   __entry->idx = sde->this_idx;
-				   __entry->aidx = aidx;
-				   ),
-		    TP_printk("[%s] SDE(%u) aidx %d",
-			      __get_str(dev),
-			      __entry->idx,
-			      __entry->aidx
-			      )
-);
-
-DEFINE_EVENT(hfi1_sdma_ahg_ad, hfi1_ahg_allocate,
-	     TP_PROTO(struct sdma_engine *sde, int aidx),
-	     TP_ARGS(sde, aidx));
-
-DEFINE_EVENT(hfi1_sdma_ahg_ad, hfi1_ahg_deallocate,
-	     TP_PROTO(struct sdma_engine *sde, int aidx),
-	     TP_ARGS(sde, aidx));
-
-#ifdef CONFIG_HFI1_DEBUG_SDMA_ORDER
-TRACE_EVENT(hfi1_sdma_progress,
-	    TP_PROTO(struct sdma_engine *sde,
-		     u16 hwhead,
-		     u16 swhead,
-		     struct sdma_txreq *txp
-		     ),
-	    TP_ARGS(sde, hwhead, swhead, txp),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
-			     __field(u64, sn)
-			     __field(u16, hwhead)
-			     __field(u16, swhead)
-			     __field(u16, txnext)
-			     __field(u16, tx_tail)
-			     __field(u16, tx_head)
-			     __field(u8, idx)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
-			   __entry->hwhead = hwhead;
-			   __entry->swhead = swhead;
-			   __entry->tx_tail = sde->tx_tail;
-			   __entry->tx_head = sde->tx_head;
-			   __entry->txnext = txp ? txp->next_descq_idx : ~0;
-			   __entry->idx = sde->this_idx;
-			   __entry->sn = txp ? txp->sn : ~0;
-			   ),
-	    TP_printk(
-		      "[%s] SDE(%u) sn %llu hwhead %u swhead %u next_descq_idx %u tx_head %u tx_tail %u",
-		      __get_str(dev),
-		      __entry->idx,
-		      __entry->sn,
-		      __entry->hwhead,
-		      __entry->swhead,
-		      __entry->txnext,
-		      __entry->tx_head,
-		      __entry->tx_tail
-		      )
-);
-#else
-TRACE_EVENT(hfi1_sdma_progress,
-	    TP_PROTO(struct sdma_engine *sde,
-		     u16 hwhead, u16 swhead,
-		     struct sdma_txreq *txp
-	    ),
-	TP_ARGS(sde, hwhead, swhead, txp),
-	TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
-			 __field(u16, hwhead)
-			 __field(u16, swhead)
-			 __field(u16, txnext)
-			 __field(u16, tx_tail)
-			 __field(u16, tx_head)
-			 __field(u8, idx)
-			 ),
-	TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
-		       __entry->hwhead = hwhead;
-		       __entry->swhead = swhead;
-		       __entry->tx_tail = sde->tx_tail;
-		       __entry->tx_head = sde->tx_head;
-		       __entry->txnext = txp ? txp->next_descq_idx : ~0;
-		       __entry->idx = sde->this_idx;
-		       ),
-	TP_printk(
-		  "[%s] SDE(%u) hwhead %u swhead %u next_descq_idx %u tx_head %u tx_tail %u",
-		  __get_str(dev),
-		  __entry->idx,
-		  __entry->hwhead,
-		  __entry->swhead,
-		  __entry->txnext,
-		  __entry->tx_head,
-		  __entry->tx_tail
-		  )
-);
-#endif
-
-DECLARE_EVENT_CLASS(hfi1_sdma_sn,
-		    TP_PROTO(struct sdma_engine *sde, u64 sn),
-		    TP_ARGS(sde, sn),
-		    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
-				     __field(u64, sn)
-				     __field(u8, idx)
-				     ),
-		    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
-				   __entry->sn = sn;
-				   __entry->idx = sde->this_idx;
-				   ),
-		    TP_printk("[%s] SDE(%u) sn %llu",
-			      __get_str(dev),
-			      __entry->idx,
-			      __entry->sn
-			      )
-);
-
-DEFINE_EVENT(hfi1_sdma_sn, hfi1_sdma_out_sn,
-	     TP_PROTO(
-		struct sdma_engine *sde,
-		u64 sn
-	     ),
-	     TP_ARGS(sde, sn)
-);
-
-DEFINE_EVENT(hfi1_sdma_sn, hfi1_sdma_in_sn,
-	     TP_PROTO(struct sdma_engine *sde, u64 sn),
-	     TP_ARGS(sde, sn)
-);
-
-#define USDMA_HDR_FORMAT \
-	"[%s:%u:%u:%u] PBC=(0x%x 0x%x) LRH=(0x%x 0x%x) BTH=(0x%x 0x%x 0x%x) KDETH=(0x%x 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x) TIDVal=0x%x"
-
-TRACE_EVENT(hfi1_sdma_user_header,
-	    TP_PROTO(struct hfi1_devdata *dd, u16 ctxt, u8 subctxt, u16 req,
-		     struct hfi1_pkt_header *hdr, u32 tidval),
-	    TP_ARGS(dd, ctxt, subctxt, req, hdr, tidval),
-	    TP_STRUCT__entry(
-		    DD_DEV_ENTRY(dd)
-		    __field(u16, ctxt)
-		    __field(u8, subctxt)
-		    __field(u16, req)
-		    __field(__le32, pbc0)
-		    __field(__le32, pbc1)
-		    __field(__be32, lrh0)
-		    __field(__be32, lrh1)
-		    __field(__be32, bth0)
-		    __field(__be32, bth1)
-		    __field(__be32, bth2)
-		    __field(__le32, kdeth0)
-		    __field(__le32, kdeth1)
-		    __field(__le32, kdeth2)
-		    __field(__le32, kdeth3)
-		    __field(__le32, kdeth4)
-		    __field(__le32, kdeth5)
-		    __field(__le32, kdeth6)
-		    __field(__le32, kdeth7)
-		    __field(__le32, kdeth8)
-		    __field(u32, tidval)
-		    ),
-	    TP_fast_assign(
-		    __le32 *pbc = (__le32 *)hdr->pbc;
-		    __be32 *lrh = (__be32 *)hdr->lrh;
-		    __be32 *bth = (__be32 *)hdr->bth;
-		    __le32 *kdeth = (__le32 *)&hdr->kdeth;
-
-		    DD_DEV_ASSIGN(dd);
-		    __entry->ctxt = ctxt;
-		    __entry->subctxt = subctxt;
-		    __entry->req = req;
-		    __entry->pbc0 = pbc[0];
-		    __entry->pbc1 = pbc[1];
-		    __entry->lrh0 = be32_to_cpu(lrh[0]);
-		    __entry->lrh1 = be32_to_cpu(lrh[1]);
-		    __entry->bth0 = be32_to_cpu(bth[0]);
-		    __entry->bth1 = be32_to_cpu(bth[1]);
-		    __entry->bth2 = be32_to_cpu(bth[2]);
-		    __entry->kdeth0 = kdeth[0];
-		    __entry->kdeth1 = kdeth[1];
-		    __entry->kdeth2 = kdeth[2];
-		    __entry->kdeth3 = kdeth[3];
-		    __entry->kdeth4 = kdeth[4];
-		    __entry->kdeth5 = kdeth[5];
-		    __entry->kdeth6 = kdeth[6];
-		    __entry->kdeth7 = kdeth[7];
-		    __entry->kdeth8 = kdeth[8];
-		    __entry->tidval = tidval;
-		    ),
-	    TP_printk(USDMA_HDR_FORMAT,
-		      __get_str(dev),
-		      __entry->ctxt,
-		      __entry->subctxt,
-		      __entry->req,
-		      __entry->pbc1,
-		      __entry->pbc0,
-		      __entry->lrh0,
-		      __entry->lrh1,
-		      __entry->bth0,
-		      __entry->bth1,
-		      __entry->bth2,
-		      __entry->kdeth0,
-		      __entry->kdeth1,
-		      __entry->kdeth2,
-		      __entry->kdeth3,
-		      __entry->kdeth4,
-		      __entry->kdeth5,
-		      __entry->kdeth6,
-		      __entry->kdeth7,
-		      __entry->kdeth8,
-		      __entry->tidval
-		    )
-	);
-
-#define SDMA_UREQ_FMT \
-	"[%s:%u:%u] ver/op=0x%x, iovcnt=%u, npkts=%u, frag=%u, idx=%u"
-TRACE_EVENT(hfi1_sdma_user_reqinfo,
-	    TP_PROTO(struct hfi1_devdata *dd, u16 ctxt, u8 subctxt, u16 *i),
-	    TP_ARGS(dd, ctxt, subctxt, i),
-	    TP_STRUCT__entry(
-		    DD_DEV_ENTRY(dd);
-		    __field(u16, ctxt)
-		    __field(u8, subctxt)
-		    __field(u8, ver_opcode)
-		    __field(u8, iovcnt)
-		    __field(u16, npkts)
-		    __field(u16, fragsize)
-		    __field(u16, comp_idx)
-		    ),
-	    TP_fast_assign(
-		    DD_DEV_ASSIGN(dd);
-		    __entry->ctxt = ctxt;
-		    __entry->subctxt = subctxt;
-		    __entry->ver_opcode = i[0] & 0xff;
-		    __entry->iovcnt = (i[0] >> 8) & 0xff;
-		    __entry->npkts = i[1];
-		    __entry->fragsize = i[2];
-		    __entry->comp_idx = i[3];
-		    ),
-	    TP_printk(SDMA_UREQ_FMT,
-		      __get_str(dev),
-		      __entry->ctxt,
-		      __entry->subctxt,
-		      __entry->ver_opcode,
-		      __entry->iovcnt,
-		      __entry->npkts,
-		      __entry->fragsize,
-		      __entry->comp_idx
-		    )
-	);
-
-#define usdma_complete_name(st) { st, #st }
-#define show_usdma_complete_state(st)			\
-	__print_symbolic(st,				\
-			 usdma_complete_name(FREE),	\
-			 usdma_complete_name(QUEUED),	\
-			 usdma_complete_name(COMPLETE), \
-			 usdma_complete_name(ERROR))
-
-TRACE_EVENT(hfi1_sdma_user_completion,
-	    TP_PROTO(struct hfi1_devdata *dd, u16 ctxt, u8 subctxt, u16 idx,
-		     u8 state, int code),
-	    TP_ARGS(dd, ctxt, subctxt, idx, state, code),
-	    TP_STRUCT__entry(
-		    DD_DEV_ENTRY(dd)
-		    __field(u16, ctxt)
-		    __field(u8, subctxt)
-		    __field(u16, idx)
-		    __field(u8, state)
-		    __field(int, code)
-		    ),
-	    TP_fast_assign(
-		    DD_DEV_ASSIGN(dd);
-		    __entry->ctxt = ctxt;
-		    __entry->subctxt = subctxt;
-		    __entry->idx = idx;
-		    __entry->state = state;
-		    __entry->code = code;
-		    ),
-	    TP_printk("[%s:%u:%u:%u] SDMA completion state %s (%d)",
-		      __get_str(dev), __entry->ctxt, __entry->subctxt,
-		      __entry->idx, show_usdma_complete_state(__entry->state),
-		      __entry->code)
-	);
-
-const char *print_u32_array(struct trace_seq *, u32 *, int);
-#define __print_u32_hex(arr, len) print_u32_array(p, arr, len)
-
-TRACE_EVENT(hfi1_sdma_user_header_ahg,
-	    TP_PROTO(struct hfi1_devdata *dd, u16 ctxt, u8 subctxt, u16 req,
-		     u8 sde, u8 ahgidx, u32 *ahg, int len, u32 tidval),
-	    TP_ARGS(dd, ctxt, subctxt, req, sde, ahgidx, ahg, len, tidval),
-	    TP_STRUCT__entry(
-		    DD_DEV_ENTRY(dd)
-		    __field(u16, ctxt)
-		    __field(u8, subctxt)
-		    __field(u16, req)
-		    __field(u8, sde)
-		    __field(u8, idx)
-		    __field(int, len)
-		    __field(u32, tidval)
-		    __array(u32, ahg, 10)
-		    ),
-	    TP_fast_assign(
-		    DD_DEV_ASSIGN(dd);
-		    __entry->ctxt = ctxt;
-		    __entry->subctxt = subctxt;
-		    __entry->req = req;
-		    __entry->sde = sde;
-		    __entry->idx = ahgidx;
-		    __entry->len = len;
-		    __entry->tidval = tidval;
-		    memcpy(__entry->ahg, ahg, len * sizeof(u32));
-		    ),
-	    TP_printk("[%s:%u:%u:%u] (SDE%u/AHG%u) ahg[0-%d]=(%s) TIDVal=0x%x",
-		      __get_str(dev),
-		      __entry->ctxt,
-		      __entry->subctxt,
-		      __entry->req,
-		      __entry->sde,
-		      __entry->idx,
-		      __entry->len - 1,
-		      __print_u32_hex(__entry->ahg, __entry->len),
-		      __entry->tidval
-		    )
-	);
-
-TRACE_EVENT(hfi1_sdma_state,
-	    TP_PROTO(struct sdma_engine *sde,
-		     const char *cstate,
-		     const char *nstate
-		     ),
-	    TP_ARGS(sde, cstate, nstate),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
-			     __string(curstate, cstate)
-			     __string(newstate, nstate)
-			     ),
-	TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
-		       __assign_str(curstate, cstate);
-		       __assign_str(newstate, nstate);
-		       ),
-	TP_printk("[%s] current state %s new state %s",
-		  __get_str(dev),
-		  __get_str(curstate),
-		  __get_str(newstate)
-		  )
-);
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_rc
-
-DECLARE_EVENT_CLASS(hfi1_rc_template,
-		    TP_PROTO(struct rvt_qp *qp, u32 psn),
-		    TP_ARGS(qp, psn),
-		    TP_STRUCT__entry(
-			DD_DEV_ENTRY(dd_from_ibdev(qp->ibqp.device))
-			__field(u32, qpn)
-			__field(u32, s_flags)
-			__field(u32, psn)
-			__field(u32, s_psn)
-			__field(u32, s_next_psn)
-			__field(u32, s_sending_psn)
-			__field(u32, s_sending_hpsn)
-			__field(u32, r_psn)
-			),
-		    TP_fast_assign(
-			DD_DEV_ASSIGN(dd_from_ibdev(qp->ibqp.device))
-			__entry->qpn = qp->ibqp.qp_num;
-			__entry->s_flags = qp->s_flags;
-			__entry->psn = psn;
-			__entry->s_psn = qp->s_psn;
-			__entry->s_next_psn = qp->s_next_psn;
-			__entry->s_sending_psn = qp->s_sending_psn;
-			__entry->s_sending_hpsn = qp->s_sending_hpsn;
-			__entry->r_psn = qp->r_psn;
-			),
-		    TP_printk(
-			"[%s] qpn 0x%x s_flags 0x%x psn 0x%x s_psn 0x%x s_next_psn 0x%x s_sending_psn 0x%x sending_hpsn 0x%x r_psn 0x%x",
-			__get_str(dev),
-			__entry->qpn,
-			__entry->s_flags,
-			__entry->psn,
-			__entry->s_psn,
-			__entry->s_next_psn,
-			__entry->s_sending_psn,
-			__entry->s_sending_hpsn,
-			__entry->r_psn
-			)
-);
-
-DEFINE_EVENT(hfi1_rc_template, hfi1_rc_sendcomplete,
-	     TP_PROTO(struct rvt_qp *qp, u32 psn),
-	     TP_ARGS(qp, psn)
-);
-
-DEFINE_EVENT(hfi1_rc_template, hfi1_rc_ack,
-	     TP_PROTO(struct rvt_qp *qp, u32 psn),
-	     TP_ARGS(qp, psn)
-);
-
-DEFINE_EVENT(hfi1_rc_template, hfi1_rc_timeout,
-	     TP_PROTO(struct rvt_qp *qp, u32 psn),
-	     TP_ARGS(qp, psn)
-);
-
-DEFINE_EVENT(hfi1_rc_template, hfi1_rc_rcv_error,
-	     TP_PROTO(struct rvt_qp *qp, u32 psn),
-	     TP_ARGS(qp, psn)
-);
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_misc
-
-TRACE_EVENT(hfi1_interrupt,
-	    TP_PROTO(struct hfi1_devdata *dd, const struct is_table *is_entry,
-		     int src),
-	    TP_ARGS(dd, is_entry, src),
-	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
-			     __array(char, buf, 64)
-			     __field(int, src)
-			     ),
-	    TP_fast_assign(DD_DEV_ASSIGN(dd)
-			   is_entry->is_name(__entry->buf, 64,
-					     src - is_entry->start);
-			   __entry->src = src;
-			   ),
-	    TP_printk("[%s] source: %s [%d]", __get_str(dev), __entry->buf,
-		      __entry->src)
-);
-
-/*
- * Note:
- * This produces a REALLY ugly trace in the console output when the string is
- * too long.
- */
-
-#undef TRACE_SYSTEM
-#define TRACE_SYSTEM hfi1_trace
-
-#define MAX_MSG_LEN 512
-
-DECLARE_EVENT_CLASS(hfi1_trace_template,
-		    TP_PROTO(const char *function, struct va_format *vaf),
-		    TP_ARGS(function, vaf),
-		    TP_STRUCT__entry(__string(function, function)
-				     __dynamic_array(char, msg, MAX_MSG_LEN)
-				     ),
-		    TP_fast_assign(__assign_str(function, function);
-				   WARN_ON_ONCE(vsnprintf
-						(__get_dynamic_array(msg),
-						 MAX_MSG_LEN, vaf->fmt,
-						 *vaf->va) >=
-						MAX_MSG_LEN);
-				   ),
-		    TP_printk("(%s) %s",
-			      __get_str(function),
-			      __get_str(msg))
-);
-
-/*
- * It may be nice to macroize the __hfi1_trace but the va_* stuff requires an
- * actual function to work and can not be in a macro.
- */
-#define __hfi1_trace_def(lvl) \
-void __hfi1_trace_##lvl(const char *funct, char *fmt, ...);		\
-									\
-DEFINE_EVENT(hfi1_trace_template, hfi1_ ##lvl,				\
-	TP_PROTO(const char *function, struct va_format *vaf),		\
-	TP_ARGS(function, vaf))
-
-#define __hfi1_trace_fn(lvl) \
-void __hfi1_trace_##lvl(const char *func, char *fmt, ...)		\
-{									\
-	struct va_format vaf = {					\
-		.fmt = fmt,						\
-	};								\
-	va_list args;							\
-									\
-	va_start(args, fmt);						\
-	vaf.va = &args;							\
-	trace_hfi1_ ##lvl(func, &vaf);					\
-	va_end(args);							\
-	return;								\
-}
-
-/*
- * To create a new trace level simply define it below and as a __hfi1_trace_fn
- * in trace.c. This will create all the hooks for calling
- * hfi1_cdbg(LVL, fmt, ...); as well as take care of all
- * the debugfs stuff.
- */
-__hfi1_trace_def(PKT);
-__hfi1_trace_def(PROC);
-__hfi1_trace_def(SDMA);
-__hfi1_trace_def(LINKVERB);
-__hfi1_trace_def(DEBUG);
-__hfi1_trace_def(SNOOP);
-__hfi1_trace_def(CNTR);
-__hfi1_trace_def(PIO);
-__hfi1_trace_def(DC8051);
-__hfi1_trace_def(FIRMWARE);
-__hfi1_trace_def(RCVCTRL);
-__hfi1_trace_def(TID);
-__hfi1_trace_def(MMU);
-__hfi1_trace_def(IOCTL);
-
-#define hfi1_cdbg(which, fmt, ...) \
-	__hfi1_trace_##which(__func__, fmt, ##__VA_ARGS__)
-
-#define hfi1_dbg(fmt, ...) \
-	hfi1_cdbg(DEBUG, fmt, ##__VA_ARGS__)
-
-/*
- * Define HFI1_EARLY_DBG at compile time or here to enable early trace
- * messages. Do not check in an enablement for this.
- */
-
-#ifdef HFI1_EARLY_DBG
-#define hfi1_dbg_early(fmt, ...) \
-	trace_printk(fmt, ##__VA_ARGS__)
-#else
-#define hfi1_dbg_early(fmt, ...)
-#endif
-
-#endif /* __HFI1_TRACE_H */
-
-#undef TRACE_INCLUDE_PATH
-#undef TRACE_INCLUDE_FILE
-#define TRACE_INCLUDE_PATH .
-#define TRACE_INCLUDE_FILE trace
-#include <trace/define_trace.h>
+#include "trace_dbg.h"
+#include "trace_misc.h"
+#include "trace_ctxts.h"
+#include "trace_ibhdrs.h"
+#include "trace_rc.h"
+#include "trace_rx.h"
+#include "trace_tx.h"
diff --git a/drivers/infiniband/hw/hfi1/trace_ctxts.h b/drivers/infiniband/hw/hfi1/trace_ctxts.h
new file mode 100644
index 0000000..5052d49
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/trace_ctxts.h
@@ -0,0 +1,141 @@
+/*
+* Copyright(c) 2015, 2016 Intel Corporation.
+*
+* This file is provided under a dual BSD/GPLv2 license.  When using or
+* redistributing this file, you may do so under either license.
+*
+* GPL LICENSE SUMMARY
+*
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of version 2 of the GNU General Public License as
+* published by the Free Software Foundation.
+*
+* This program is distributed in the hope that it will be useful, but
+* WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+* General Public License for more details.
+*
+* BSD LICENSE
+*
+* Redistribution and use in source and binary forms, with or without
+* modification, are permitted provided that the following conditions
+* are met:
+*
+*  - Redistributions of source code must retain the above copyright
+*    notice, this list of conditions and the following disclaimer.
+*  - Redistributions in binary form must reproduce the above copyright
+*    notice, this list of conditions and the following disclaimer in
+*    the documentation and/or other materials provided with the
+*    distribution.
+*  - Neither the name of Intel Corporation nor the names of its
+*    contributors may be used to endorse or promote products derived
+*    from this software without specific prior written permission.
+*
+* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*
+*/
+#if !defined(__HFI1_TRACE_CTXTS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HFI1_TRACE_CTXTS_H
+
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "hfi.h"
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hfi1_ctxts
+
+#define UCTXT_FMT \
+	"cred:%u, credaddr:0x%llx, piobase:0x%llx, rcvhdr_cnt:%u, "	\
+	"rcvbase:0x%llx, rcvegrc:%u, rcvegrb:0x%llx"
+TRACE_EVENT(hfi1_uctxtdata,
+	    TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ctxtdata *uctxt),
+	    TP_ARGS(dd, uctxt),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
+			     __field(unsigned int, ctxt)
+			     __field(u32, credits)
+			     __field(u64, hw_free)
+			     __field(u64, piobase)
+			     __field(u16, rcvhdrq_cnt)
+			     __field(u64, rcvhdrq_phys)
+			     __field(u32, eager_cnt)
+			     __field(u64, rcvegr_phys)
+			     ),
+	    TP_fast_assign(DD_DEV_ASSIGN(dd);
+			   __entry->ctxt = uctxt->ctxt;
+			   __entry->credits = uctxt->sc->credits;
+			   __entry->hw_free = (u64)uctxt->sc->hw_free;
+			   __entry->piobase = (u64)uctxt->sc->base_addr;
+			   __entry->rcvhdrq_cnt = uctxt->rcvhdrq_cnt;
+			   __entry->rcvhdrq_phys = uctxt->rcvhdrq_phys;
+			   __entry->eager_cnt = uctxt->egrbufs.alloced;
+			   __entry->rcvegr_phys =
+			   uctxt->egrbufs.rcvtids[0].phys;
+			   ),
+	    TP_printk("[%s] ctxt %u " UCTXT_FMT,
+		      __get_str(dev),
+		      __entry->ctxt,
+		      __entry->credits,
+		      __entry->hw_free,
+		      __entry->piobase,
+		      __entry->rcvhdrq_cnt,
+		      __entry->rcvhdrq_phys,
+		      __entry->eager_cnt,
+		      __entry->rcvegr_phys
+		      )
+);
+
+#define CINFO_FMT \
+	"egrtids:%u, egr_size:%u, hdrq_cnt:%u, hdrq_size:%u, sdma_ring_size:%u"
+TRACE_EVENT(hfi1_ctxt_info,
+	    TP_PROTO(struct hfi1_devdata *dd, unsigned int ctxt,
+		     unsigned int subctxt,
+		     struct hfi1_ctxt_info cinfo),
+	    TP_ARGS(dd, ctxt, subctxt, cinfo),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
+			     __field(unsigned int, ctxt)
+			     __field(unsigned int, subctxt)
+			     __field(u16, egrtids)
+			     __field(u16, rcvhdrq_cnt)
+			     __field(u16, rcvhdrq_size)
+			     __field(u16, sdma_ring_size)
+			     __field(u32, rcvegr_size)
+			     ),
+	    TP_fast_assign(DD_DEV_ASSIGN(dd);
+			    __entry->ctxt = ctxt;
+			    __entry->subctxt = subctxt;
+			    __entry->egrtids = cinfo.egrtids;
+			    __entry->rcvhdrq_cnt = cinfo.rcvhdrq_cnt;
+			    __entry->rcvhdrq_size = cinfo.rcvhdrq_entsize;
+			    __entry->sdma_ring_size = cinfo.sdma_ring_size;
+			    __entry->rcvegr_size = cinfo.rcvegr_size;
+			    ),
+	    TP_printk("[%s] ctxt %u:%u " CINFO_FMT,
+		      __get_str(dev),
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __entry->egrtids,
+		      __entry->rcvegr_size,
+		      __entry->rcvhdrq_cnt,
+		      __entry->rcvhdrq_size,
+		      __entry->sdma_ring_size
+		      )
+);
+
+#endif /* __HFI1_TRACE_CTXTS_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_ctxts
+#include <trace/define_trace.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_dbg.h b/drivers/infiniband/hw/hfi1/trace_dbg.h
new file mode 100644
index 0000000..0e7d929
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/trace_dbg.h
@@ -0,0 +1,155 @@
+/*
+* Copyright(c) 2015, 2016 Intel Corporation.
+*
+* This file is provided under a dual BSD/GPLv2 license.  When using or
+* redistributing this file, you may do so under either license.
+*
+* GPL LICENSE SUMMARY
+*
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of version 2 of the GNU General Public License as
+* published by the Free Software Foundation.
+*
+* This program is distributed in the hope that it will be useful, but
+* WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+* General Public License for more details.
+*
+* BSD LICENSE
+*
+* Redistribution and use in source and binary forms, with or without
+* modification, are permitted provided that the following conditions
+* are met:
+*
+*  - Redistributions of source code must retain the above copyright
+*    notice, this list of conditions and the following disclaimer.
+*  - Redistributions in binary form must reproduce the above copyright
+*    notice, this list of conditions and the following disclaimer in
+*    the documentation and/or other materials provided with the
+*    distribution.
+*  - Neither the name of Intel Corporation nor the names of its
+*    contributors may be used to endorse or promote products derived
+*    from this software without specific prior written permission.
+*
+* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*
+*/
+#if !defined(__HFI1_TRACE_EXTRA_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HFI1_TRACE_EXTRA_H
+
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "hfi.h"
+
+/*
+ * Note:
+ * This produces a REALLY ugly trace in the console output when the string is
+ * too long.
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hfi1_dbg
+
+#define MAX_MSG_LEN 512
+
+DECLARE_EVENT_CLASS(hfi1_trace_template,
+		    TP_PROTO(const char *function, struct va_format *vaf),
+		    TP_ARGS(function, vaf),
+		    TP_STRUCT__entry(__string(function, function)
+				     __dynamic_array(char, msg, MAX_MSG_LEN)
+				     ),
+		    TP_fast_assign(__assign_str(function, function);
+				   WARN_ON_ONCE(vsnprintf
+						(__get_dynamic_array(msg),
+						 MAX_MSG_LEN, vaf->fmt,
+						 *vaf->va) >=
+						MAX_MSG_LEN);
+				   ),
+		    TP_printk("(%s) %s",
+			      __get_str(function),
+			      __get_str(msg))
+);
+
+/*
+ * It may be nice to macroize the __hfi1_trace but the va_* stuff requires an
+ * actual function to work and can not be in a macro.
+ */
+#define __hfi1_trace_def(lvl) \
+void __hfi1_trace_##lvl(const char *funct, char *fmt, ...);		\
+									\
+DEFINE_EVENT(hfi1_trace_template, hfi1_ ##lvl,				\
+	TP_PROTO(const char *function, struct va_format *vaf),		\
+	TP_ARGS(function, vaf))
+
+#define __hfi1_trace_fn(lvl) \
+void __hfi1_trace_##lvl(const char *func, char *fmt, ...)		\
+{									\
+	struct va_format vaf = {					\
+		.fmt = fmt,						\
+	};								\
+	va_list args;							\
+									\
+	va_start(args, fmt);						\
+	vaf.va = &args;							\
+	trace_hfi1_ ##lvl(func, &vaf);					\
+	va_end(args);							\
+	return;								\
+}
+
+/*
+ * To create a new trace level simply define it below and as a __hfi1_trace_fn
+ * in trace.c. This will create all the hooks for calling
+ * hfi1_cdbg(LVL, fmt, ...); as well as take care of all
+ * the debugfs stuff.
+ */
+__hfi1_trace_def(PKT);
+__hfi1_trace_def(PROC);
+__hfi1_trace_def(SDMA);
+__hfi1_trace_def(LINKVERB);
+__hfi1_trace_def(DEBUG);
+__hfi1_trace_def(SNOOP);
+__hfi1_trace_def(CNTR);
+__hfi1_trace_def(PIO);
+__hfi1_trace_def(DC8051);
+__hfi1_trace_def(FIRMWARE);
+__hfi1_trace_def(RCVCTRL);
+__hfi1_trace_def(TID);
+__hfi1_trace_def(MMU);
+__hfi1_trace_def(IOCTL);
+
+#define hfi1_cdbg(which, fmt, ...) \
+	__hfi1_trace_##which(__func__, fmt, ##__VA_ARGS__)
+
+#define hfi1_dbg(fmt, ...) \
+	hfi1_cdbg(DEBUG, fmt, ##__VA_ARGS__)
+
+/*
+ * Define HFI1_EARLY_DBG at compile time or here to enable early trace
+ * messages. Do not check in an enablement for this.
+ */
+
+#ifdef HFI1_EARLY_DBG
+#define hfi1_dbg_early(fmt, ...) \
+	trace_printk(fmt, ##__VA_ARGS__)
+#else
+#define hfi1_dbg_early(fmt, ...)
+#endif
+
+#endif /* __HFI1_TRACE_EXTRA_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_dbg
+#include <trace/define_trace.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_ibhdrs.h b/drivers/infiniband/hw/hfi1/trace_ibhdrs.h
new file mode 100644
index 0000000..c3e41ae
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/trace_ibhdrs.h
@@ -0,0 +1,209 @@
+/*
+ * Copyright(c) 2015, 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#if !defined(__HFI1_TRACE_IBHDRS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HFI1_TRACE_IBHDRS_H
+
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "hfi.h"
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hfi1_ibhdrs
+
+u8 ibhdr_exhdr_len(struct hfi1_ib_header *hdr);
+const char *parse_everbs_hdrs(struct trace_seq *p, u8 opcode, void *ehdrs);
+
+#define __parse_ib_ehdrs(op, ehdrs) parse_everbs_hdrs(p, op, ehdrs)
+
+#define lrh_name(lrh) { HFI1_##lrh, #lrh }
+#define show_lnh(lrh)                    \
+__print_symbolic(lrh,                    \
+	lrh_name(LRH_BTH),               \
+	lrh_name(LRH_GRH))
+
+#define LRH_PRN "vl %d lver %d sl %d lnh %d,%s dlid %.4x len %d slid %.4x"
+#define BTH_PRN \
+	"op 0x%.2x,%s se %d m %d pad %d tver %d pkey 0x%.4x " \
+	"f %d b %d qpn 0x%.6x a %d psn 0x%.8x"
+#define EHDR_PRN "%s"
+
+DECLARE_EVENT_CLASS(hfi1_ibhdr_template,
+		    TP_PROTO(struct hfi1_devdata *dd,
+			     struct hfi1_ib_header *hdr),
+		    TP_ARGS(dd, hdr),
+		    TP_STRUCT__entry(
+			DD_DEV_ENTRY(dd)
+			/* LRH */
+			__field(u8, vl)
+			__field(u8, lver)
+			__field(u8, sl)
+			__field(u8, lnh)
+			__field(u16, dlid)
+			__field(u16, len)
+			__field(u16, slid)
+			/* BTH */
+			__field(u8, opcode)
+			__field(u8, se)
+			__field(u8, m)
+			__field(u8, pad)
+			__field(u8, tver)
+			__field(u16, pkey)
+			__field(u8, f)
+			__field(u8, b)
+			__field(u32, qpn)
+			__field(u8, a)
+			__field(u32, psn)
+			/* extended headers */
+			__dynamic_array(u8, ehdrs, ibhdr_exhdr_len(hdr))
+			),
+		      TP_fast_assign(
+			struct hfi1_other_headers *ohdr;
+
+			DD_DEV_ASSIGN(dd);
+			/* LRH */
+			__entry->vl =
+			(u8)(be16_to_cpu(hdr->lrh[0]) >> 12);
+			__entry->lver =
+			(u8)(be16_to_cpu(hdr->lrh[0]) >> 8) & 0xf;
+			__entry->sl =
+			(u8)(be16_to_cpu(hdr->lrh[0]) >> 4) & 0xf;
+			__entry->lnh =
+			(u8)(be16_to_cpu(hdr->lrh[0]) & 3);
+			__entry->dlid =
+			be16_to_cpu(hdr->lrh[1]);
+			/* allow for larger len */
+			__entry->len =
+			be16_to_cpu(hdr->lrh[2]);
+			__entry->slid =
+			be16_to_cpu(hdr->lrh[3]);
+			/* BTH */
+			if (__entry->lnh == HFI1_LRH_BTH)
+			ohdr = &hdr->u.oth;
+			else
+			ohdr = &hdr->u.l.oth;
+			__entry->opcode =
+			(be32_to_cpu(ohdr->bth[0]) >> 24) & 0xff;
+			__entry->se =
+			(be32_to_cpu(ohdr->bth[0]) >> 23) & 1;
+			__entry->m =
+			(be32_to_cpu(ohdr->bth[0]) >> 22) & 1;
+			__entry->pad =
+			(be32_to_cpu(ohdr->bth[0]) >> 20) & 3;
+			__entry->tver =
+			(be32_to_cpu(ohdr->bth[0]) >> 16) & 0xf;
+			__entry->pkey =
+			be32_to_cpu(ohdr->bth[0]) & 0xffff;
+			__entry->f =
+			(be32_to_cpu(ohdr->bth[1]) >> HFI1_FECN_SHIFT) &
+			HFI1_FECN_MASK;
+			__entry->b =
+			(be32_to_cpu(ohdr->bth[1]) >> HFI1_BECN_SHIFT) &
+			HFI1_BECN_MASK;
+			__entry->qpn =
+			be32_to_cpu(ohdr->bth[1]) & RVT_QPN_MASK;
+			__entry->a =
+			(be32_to_cpu(ohdr->bth[2]) >> 31) & 1;
+			/* allow for larger PSN */
+			__entry->psn =
+			be32_to_cpu(ohdr->bth[2]) & 0x7fffffff;
+			/* extended headers */
+			memcpy(__get_dynamic_array(ehdrs), &ohdr->u,
+			       ibhdr_exhdr_len(hdr));
+			),
+		TP_printk("[%s] " LRH_PRN " " BTH_PRN " " EHDR_PRN,
+			  __get_str(dev),
+			  /* LRH */
+			  __entry->vl,
+			  __entry->lver,
+			  __entry->sl,
+			  __entry->lnh, show_lnh(__entry->lnh),
+			  __entry->dlid,
+			  __entry->len,
+			  __entry->slid,
+			  /* BTH */
+			  __entry->opcode, show_ib_opcode(__entry->opcode),
+			  __entry->se,
+			  __entry->m,
+			  __entry->pad,
+			  __entry->tver,
+			  __entry->pkey,
+			  __entry->f,
+			  __entry->b,
+			  __entry->qpn,
+			  __entry->a,
+			  __entry->psn,
+			  /* extended headers */
+			  __parse_ib_ehdrs(
+				__entry->opcode,
+				(void *)__get_dynamic_array(ehdrs))
+			)
+);
+
+DEFINE_EVENT(hfi1_ibhdr_template, input_ibhdr,
+	     TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ib_header *hdr),
+	     TP_ARGS(dd, hdr));
+
+DEFINE_EVENT(hfi1_ibhdr_template, pio_output_ibhdr,
+	     TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ib_header *hdr),
+	     TP_ARGS(dd, hdr));
+
+DEFINE_EVENT(hfi1_ibhdr_template, ack_output_ibhdr,
+	     TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ib_header *hdr),
+	     TP_ARGS(dd, hdr));
+
+DEFINE_EVENT(hfi1_ibhdr_template, sdma_output_ibhdr,
+	     TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ib_header *hdr),
+	     TP_ARGS(dd, hdr));
+
+#endif /* __HFI1_TRACE_IBHDRS_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_ibhdrs
+#include <trace/define_trace.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_misc.h b/drivers/infiniband/hw/hfi1/trace_misc.h
new file mode 100644
index 0000000..d308454
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/trace_misc.h
@@ -0,0 +1,81 @@
+/*
+* Copyright(c) 2015, 2016 Intel Corporation.
+*
+* This file is provided under a dual BSD/GPLv2 license.  When using or
+* redistributing this file, you may do so under either license.
+*
+* GPL LICENSE SUMMARY
+*
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of version 2 of the GNU General Public License as
+* published by the Free Software Foundation.
+*
+* This program is distributed in the hope that it will be useful, but
+* WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+* General Public License for more details.
+*
+* BSD LICENSE
+*
+* Redistribution and use in source and binary forms, with or without
+* modification, are permitted provided that the following conditions
+* are met:
+*
+*  - Redistributions of source code must retain the above copyright
+*    notice, this list of conditions and the following disclaimer.
+*  - Redistributions in binary form must reproduce the above copyright
+*    notice, this list of conditions and the following disclaimer in
+*    the documentation and/or other materials provided with the
+*    distribution.
+*  - Neither the name of Intel Corporation nor the names of its
+*    contributors may be used to endorse or promote products derived
+*    from this software without specific prior written permission.
+*
+* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*
+*/
+#if !defined(__HFI1_TRACE_MISC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HFI1_TRACE_MISC_H
+
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "hfi.h"
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hfi1_misc
+
+TRACE_EVENT(hfi1_interrupt,
+	    TP_PROTO(struct hfi1_devdata *dd, const struct is_table *is_entry,
+		     int src),
+	    TP_ARGS(dd, is_entry, src),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
+			     __array(char, buf, 64)
+			     __field(int, src)
+			     ),
+	    TP_fast_assign(DD_DEV_ASSIGN(dd)
+			   is_entry->is_name(__entry->buf, 64,
+					     src - is_entry->start);
+			   __entry->src = src;
+			   ),
+	    TP_printk("[%s] source: %s [%d]", __get_str(dev), __entry->buf,
+		      __entry->src)
+);
+
+#endif /* __HFI1_TRACE_MISC_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_misc
+#include <trace/define_trace.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_rc.h b/drivers/infiniband/hw/hfi1/trace_rc.h
new file mode 100644
index 0000000..5ea5005
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/trace_rc.h
@@ -0,0 +1,123 @@
+/*
+* Copyright(c) 2015, 2016 Intel Corporation.
+*
+* This file is provided under a dual BSD/GPLv2 license.  When using or
+* redistributing this file, you may do so under either license.
+*
+* GPL LICENSE SUMMARY
+*
+* This program is free software; you can redistribute it and/or modify
+* it under the terms of version 2 of the GNU General Public License as
+* published by the Free Software Foundation.
+*
+* This program is distributed in the hope that it will be useful, but
+* WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+* General Public License for more details.
+*
+* BSD LICENSE
+*
+* Redistribution and use in source and binary forms, with or without
+* modification, are permitted provided that the following conditions
+* are met:
+*
+*  - Redistributions of source code must retain the above copyright
+*    notice, this list of conditions and the following disclaimer.
+*  - Redistributions in binary form must reproduce the above copyright
+*    notice, this list of conditions and the following disclaimer in
+*    the documentation and/or other materials provided with the
+*    distribution.
+*  - Neither the name of Intel Corporation nor the names of its
+*    contributors may be used to endorse or promote products derived
+*    from this software without specific prior written permission.
+*
+* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+*
+*/
+#if !defined(__HFI1_TRACE_RC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HFI1_TRACE_RC_H
+
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "hfi.h"
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hfi1_rc
+
+DECLARE_EVENT_CLASS(hfi1_rc_template,
+		    TP_PROTO(struct rvt_qp *qp, u32 psn),
+		    TP_ARGS(qp, psn),
+		    TP_STRUCT__entry(
+			DD_DEV_ENTRY(dd_from_ibdev(qp->ibqp.device))
+			__field(u32, qpn)
+			__field(u32, s_flags)
+			__field(u32, psn)
+			__field(u32, s_psn)
+			__field(u32, s_next_psn)
+			__field(u32, s_sending_psn)
+			__field(u32, s_sending_hpsn)
+			__field(u32, r_psn)
+			),
+		    TP_fast_assign(
+			DD_DEV_ASSIGN(dd_from_ibdev(qp->ibqp.device))
+			__entry->qpn = qp->ibqp.qp_num;
+			__entry->s_flags = qp->s_flags;
+			__entry->psn = psn;
+			__entry->s_psn = qp->s_psn;
+			__entry->s_next_psn = qp->s_next_psn;
+			__entry->s_sending_psn = qp->s_sending_psn;
+			__entry->s_sending_hpsn = qp->s_sending_hpsn;
+			__entry->r_psn = qp->r_psn;
+			),
+		    TP_printk(
+			"[%s] qpn 0x%x s_flags 0x%x psn 0x%x s_psn 0x%x s_next_psn 0x%x s_sending_psn 0x%x sending_hpsn 0x%x r_psn 0x%x",
+			__get_str(dev),
+			__entry->qpn,
+			__entry->s_flags,
+			__entry->psn,
+			__entry->s_psn,
+			__entry->s_next_psn,
+			__entry->s_sending_psn,
+			__entry->s_sending_hpsn,
+			__entry->r_psn
+			)
+);
+
+DEFINE_EVENT(hfi1_rc_template, hfi1_sendcomplete,
+	     TP_PROTO(struct rvt_qp *qp, u32 psn),
+	     TP_ARGS(qp, psn)
+);
+
+DEFINE_EVENT(hfi1_rc_template, hfi1_ack,
+	     TP_PROTO(struct rvt_qp *qp, u32 psn),
+	     TP_ARGS(qp, psn)
+);
+
+DEFINE_EVENT(hfi1_rc_template, hfi1_timeout,
+	     TP_PROTO(struct rvt_qp *qp, u32 psn),
+	     TP_ARGS(qp, psn)
+);
+
+DEFINE_EVENT(hfi1_rc_template, hfi1_rcv_error,
+	     TP_PROTO(struct rvt_qp *qp, u32 psn),
+	     TP_ARGS(qp, psn)
+);
+
+#endif /* __HFI1_TRACE_RC_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_rc
+#include <trace/define_trace.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_rx.h b/drivers/infiniband/hw/hfi1/trace_rx.h
new file mode 100644
index 0000000..9ba1f61
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/trace_rx.h
@@ -0,0 +1,322 @@
+/*
+ * Copyright(c) 2015, 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#if !defined(__HFI1_TRACE_RX_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HFI1_TRACE_RX_H
+
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "hfi.h"
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hfi1_rx
+
+TRACE_EVENT(hfi1_rcvhdr,
+	    TP_PROTO(struct hfi1_devdata *dd,
+		     u32 ctxt,
+		     u64 eflags,
+		     u32 etype,
+		     u32 hlen,
+		     u32 tlen,
+		     u32 updegr,
+		     u32 etail
+		    ),
+	    TP_ARGS(dd, ctxt, eflags, etype, hlen, tlen, updegr, etail),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
+			     __field(u64, eflags)
+			     __field(u32, ctxt)
+			     __field(u32, etype)
+			     __field(u32, hlen)
+			     __field(u32, tlen)
+			     __field(u32, updegr)
+			     __field(u32, etail)
+			     ),
+	     TP_fast_assign(DD_DEV_ASSIGN(dd);
+			    __entry->eflags = eflags;
+			    __entry->ctxt = ctxt;
+			    __entry->etype = etype;
+			    __entry->hlen = hlen;
+			    __entry->tlen = tlen;
+			    __entry->updegr = updegr;
+			    __entry->etail = etail;
+			    ),
+	     TP_printk(
+		"[%s] ctxt %d eflags 0x%llx etype %d,%s hlen %d tlen %d updegr %d etail %d",
+		__get_str(dev),
+		__entry->ctxt,
+		__entry->eflags,
+		__entry->etype, show_packettype(__entry->etype),
+		__entry->hlen,
+		__entry->tlen,
+		__entry->updegr,
+		__entry->etail
+		)
+);
+
+TRACE_EVENT(hfi1_receive_interrupt,
+	    TP_PROTO(struct hfi1_devdata *dd, u32 ctxt),
+	    TP_ARGS(dd, ctxt),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
+			     __field(u32, ctxt)
+			     __field(u8, slow_path)
+			     __field(u8, dma_rtail)
+			     ),
+	    TP_fast_assign(DD_DEV_ASSIGN(dd);
+			__entry->ctxt = ctxt;
+			if (dd->rcd[ctxt]->do_interrupt ==
+			    &handle_receive_interrupt) {
+				__entry->slow_path = 1;
+				__entry->dma_rtail = 0xFF;
+			} else if (dd->rcd[ctxt]->do_interrupt ==
+					&handle_receive_interrupt_dma_rtail){
+				__entry->dma_rtail = 1;
+				__entry->slow_path = 0;
+			} else if (dd->rcd[ctxt]->do_interrupt ==
+					&handle_receive_interrupt_nodma_rtail) {
+				__entry->dma_rtail = 0;
+				__entry->slow_path = 0;
+			}
+			),
+	    TP_printk("[%s] ctxt %d SlowPath: %d DmaRtail: %d",
+		      __get_str(dev),
+		      __entry->ctxt,
+		      __entry->slow_path,
+		      __entry->dma_rtail
+		      )
+);
+
+TRACE_EVENT(hfi1_exp_tid_reg,
+	    TP_PROTO(unsigned int ctxt, u16 subctxt, u32 rarr,
+		     u32 npages, unsigned long va, unsigned long pa,
+		     dma_addr_t dma),
+	    TP_ARGS(ctxt, subctxt, rarr, npages, va, pa, dma),
+	    TP_STRUCT__entry(
+			     __field(unsigned int, ctxt)
+			     __field(u16, subctxt)
+			     __field(u32, rarr)
+			     __field(u32, npages)
+			     __field(unsigned long, va)
+			     __field(unsigned long, pa)
+			     __field(dma_addr_t, dma)
+			     ),
+	    TP_fast_assign(
+			   __entry->ctxt = ctxt;
+			   __entry->subctxt = subctxt;
+			   __entry->rarr = rarr;
+			   __entry->npages = npages;
+			   __entry->va = va;
+			   __entry->pa = pa;
+			   __entry->dma = dma;
+			   ),
+	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx, va:0x%lx dma:0x%llx",
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __entry->rarr,
+		      __entry->npages,
+		      __entry->pa,
+		      __entry->va,
+		      __entry->dma
+		      )
+	);
+
+TRACE_EVENT(hfi1_exp_tid_unreg,
+	    TP_PROTO(unsigned int ctxt, u16 subctxt, u32 rarr, u32 npages,
+		     unsigned long va, unsigned long pa, dma_addr_t dma),
+	    TP_ARGS(ctxt, subctxt, rarr, npages, va, pa, dma),
+	    TP_STRUCT__entry(
+			     __field(unsigned int, ctxt)
+			     __field(u16, subctxt)
+			     __field(u32, rarr)
+			     __field(u32, npages)
+			     __field(unsigned long, va)
+			     __field(unsigned long, pa)
+			     __field(dma_addr_t, dma)
+			     ),
+	    TP_fast_assign(
+			   __entry->ctxt = ctxt;
+			   __entry->subctxt = subctxt;
+			   __entry->rarr = rarr;
+			   __entry->npages = npages;
+			   __entry->va = va;
+			   __entry->pa = pa;
+			   __entry->dma = dma;
+			   ),
+	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx, va:0x%lx dma:0x%llx",
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __entry->rarr,
+		      __entry->npages,
+		      __entry->pa,
+		      __entry->va,
+		      __entry->dma
+		      )
+	);
+
+TRACE_EVENT(hfi1_exp_tid_inval,
+	    TP_PROTO(unsigned int ctxt, u16 subctxt, unsigned long va, u32 rarr,
+		     u32 npages, dma_addr_t dma),
+	    TP_ARGS(ctxt, subctxt, va, rarr, npages, dma),
+	    TP_STRUCT__entry(
+			     __field(unsigned int, ctxt)
+			     __field(u16, subctxt)
+			     __field(unsigned long, va)
+			     __field(u32, rarr)
+			     __field(u32, npages)
+			     __field(dma_addr_t, dma)
+			     ),
+	    TP_fast_assign(
+			   __entry->ctxt = ctxt;
+			   __entry->subctxt = subctxt;
+			   __entry->va = va;
+			   __entry->rarr = rarr;
+			   __entry->npages = npages;
+			   __entry->dma = dma;
+			  ),
+	    TP_printk("[%u:%u] entry:%u, %u pages @ 0x%lx dma: 0x%llx",
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __entry->rarr,
+		      __entry->npages,
+		      __entry->va,
+		      __entry->dma
+		      )
+	    );
+
+TRACE_EVENT(hfi1_mmu_invalidate,
+	    TP_PROTO(unsigned int ctxt, u16 subctxt, const char *type,
+		     unsigned long start, unsigned long end),
+	    TP_ARGS(ctxt, subctxt, type, start, end),
+	    TP_STRUCT__entry(
+			     __field(unsigned int, ctxt)
+			     __field(u16, subctxt)
+			     __string(type, type)
+			     __field(unsigned long, start)
+			     __field(unsigned long, end)
+			     ),
+	    TP_fast_assign(
+			__entry->ctxt = ctxt;
+			__entry->subctxt = subctxt;
+			__assign_str(type, type);
+			__entry->start = start;
+			__entry->end = end;
+	    ),
+	    TP_printk("[%3u:%02u] MMU Invalidate (%s) 0x%lx - 0x%lx",
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __get_str(type),
+		      __entry->start,
+		      __entry->end
+		      )
+	    );
+
+#define SNOOP_PRN \
+	"slid %.4x dlid %.4x qpn 0x%.6x opcode 0x%.2x,%s " \
+	"svc lvl %d pkey 0x%.4x [header = %d bytes] [data = %d bytes]"
+
+TRACE_EVENT(snoop_capture,
+	    TP_PROTO(struct hfi1_devdata *dd,
+		     int hdr_len,
+		     struct hfi1_ib_header *hdr,
+		     int data_len,
+		     void *data),
+	    TP_ARGS(dd, hdr_len, hdr, data_len, data),
+	    TP_STRUCT__entry(
+			     DD_DEV_ENTRY(dd)
+			     __field(u16, slid)
+			     __field(u16, dlid)
+			     __field(u32, qpn)
+			     __field(u8, opcode)
+			     __field(u8, sl)
+			     __field(u16, pkey)
+			     __field(u32, hdr_len)
+			     __field(u32, data_len)
+			     __field(u8, lnh)
+			     __dynamic_array(u8, raw_hdr, hdr_len)
+			     __dynamic_array(u8, raw_pkt, data_len)
+			     ),
+	    TP_fast_assign(
+		struct hfi1_other_headers *ohdr;
+
+		__entry->lnh = (u8)(be16_to_cpu(hdr->lrh[0]) & 3);
+		if (__entry->lnh == HFI1_LRH_BTH)
+		ohdr = &hdr->u.oth;
+		else
+		ohdr = &hdr->u.l.oth;
+		DD_DEV_ASSIGN(dd);
+		__entry->slid = be16_to_cpu(hdr->lrh[3]);
+		__entry->dlid = be16_to_cpu(hdr->lrh[1]);
+		__entry->qpn = be32_to_cpu(ohdr->bth[1]) & RVT_QPN_MASK;
+		__entry->opcode = (be32_to_cpu(ohdr->bth[0]) >> 24) & 0xff;
+		__entry->sl = (u8)(be16_to_cpu(hdr->lrh[0]) >> 4) & 0xf;
+		__entry->pkey =	be32_to_cpu(ohdr->bth[0]) & 0xffff;
+		__entry->hdr_len = hdr_len;
+		__entry->data_len = data_len;
+		memcpy(__get_dynamic_array(raw_hdr), hdr, hdr_len);
+		memcpy(__get_dynamic_array(raw_pkt), data, data_len);
+		),
+	    TP_printk(
+		"[%s] " SNOOP_PRN,
+		__get_str(dev),
+		__entry->slid,
+		__entry->dlid,
+		__entry->qpn,
+		__entry->opcode,
+		show_ib_opcode(__entry->opcode),
+		__entry->sl,
+		__entry->pkey,
+		__entry->hdr_len,
+		__entry->data_len
+		)
+);
+
+#endif /* __HFI1_TRACE_RX_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_rx
+#include <trace/define_trace.h>
diff --git a/drivers/infiniband/hw/hfi1/trace_tx.h b/drivers/infiniband/hw/hfi1/trace_tx.h
new file mode 100644
index 0000000..79c93ec
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/trace_tx.h
@@ -0,0 +1,642 @@
+/*
+ * Copyright(c) 2015, 2016 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *    contributors may be used to endorse or promote products derived
+ *    from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#if !defined(__HFI1_TRACE_TX_H) || defined(TRACE_HEADER_MULTI_READ)
+#define __HFI1_TRACE_TX_H
+
+#include <linux/tracepoint.h>
+#include <linux/trace_seq.h>
+
+#include "hfi.h"
+#include "mad.h"
+#include "sdma.h"
+
+const char *parse_sdma_flags(struct trace_seq *p, u64 desc0, u64 desc1);
+
+#define __parse_sdma_flags(desc0, desc1) parse_sdma_flags(p, desc0, desc1)
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM hfi1_tx
+
+TRACE_EVENT(hfi1_piofree,
+	    TP_PROTO(struct send_context *sc, int extra),
+	    TP_ARGS(sc, extra),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(sc->dd)
+	    __field(u32, sw_index)
+	    __field(u32, hw_context)
+	    __field(int, extra)
+	    ),
+	    TP_fast_assign(DD_DEV_ASSIGN(sc->dd);
+	    __entry->sw_index = sc->sw_index;
+	    __entry->hw_context = sc->hw_context;
+	    __entry->extra = extra;
+	    ),
+	    TP_printk("[%s] ctxt %u(%u) extra %d",
+		      __get_str(dev),
+		      __entry->sw_index,
+		      __entry->hw_context,
+		      __entry->extra
+	    )
+);
+
+TRACE_EVENT(hfi1_wantpiointr,
+	    TP_PROTO(struct send_context *sc, u32 needint, u64 credit_ctrl),
+	    TP_ARGS(sc, needint, credit_ctrl),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(sc->dd)
+			__field(u32, sw_index)
+			__field(u32, hw_context)
+			__field(u32, needint)
+			__field(u64, credit_ctrl)
+			),
+	    TP_fast_assign(DD_DEV_ASSIGN(sc->dd);
+			__entry->sw_index = sc->sw_index;
+			__entry->hw_context = sc->hw_context;
+			__entry->needint = needint;
+			__entry->credit_ctrl = credit_ctrl;
+			),
+	    TP_printk("[%s] ctxt %u(%u) on %d credit_ctrl 0x%llx",
+		      __get_str(dev),
+		      __entry->sw_index,
+		      __entry->hw_context,
+		      __entry->needint,
+		      (unsigned long long)__entry->credit_ctrl
+		      )
+);
+
+DECLARE_EVENT_CLASS(hfi1_qpsleepwakeup_template,
+		    TP_PROTO(struct rvt_qp *qp, u32 flags),
+		    TP_ARGS(qp, flags),
+		    TP_STRUCT__entry(
+		    DD_DEV_ENTRY(dd_from_ibdev(qp->ibqp.device))
+		    __field(u32, qpn)
+		    __field(u32, flags)
+		    __field(u32, s_flags)
+		    ),
+		    TP_fast_assign(
+		    DD_DEV_ASSIGN(dd_from_ibdev(qp->ibqp.device))
+		    __entry->flags = flags;
+		    __entry->qpn = qp->ibqp.qp_num;
+		    __entry->s_flags = qp->s_flags;
+		    ),
+		    TP_printk(
+		    "[%s] qpn 0x%x flags 0x%x s_flags 0x%x",
+		    __get_str(dev),
+		    __entry->qpn,
+		    __entry->flags,
+		    __entry->s_flags
+		    )
+);
+
+DEFINE_EVENT(hfi1_qpsleepwakeup_template, hfi1_qpwakeup,
+	     TP_PROTO(struct rvt_qp *qp, u32 flags),
+	     TP_ARGS(qp, flags));
+
+DEFINE_EVENT(hfi1_qpsleepwakeup_template, hfi1_qpsleep,
+	     TP_PROTO(struct rvt_qp *qp, u32 flags),
+	     TP_ARGS(qp, flags));
+
+TRACE_EVENT(hfi1_sdma_descriptor,
+	    TP_PROTO(struct sdma_engine *sde,
+		     u64 desc0,
+		     u64 desc1,
+		     u16 e,
+		     void *descp),
+		     TP_ARGS(sde, desc0, desc1, e, descp),
+		     TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
+		     __field(void *, descp)
+		     __field(u64, desc0)
+		     __field(u64, desc1)
+		     __field(u16, e)
+		     __field(u8, idx)
+		     ),
+		     TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
+		     __entry->desc0 = desc0;
+		     __entry->desc1 = desc1;
+		     __entry->idx = sde->this_idx;
+		     __entry->descp = descp;
+		     __entry->e = e;
+		     ),
+	    TP_printk(
+	    "[%s] SDE(%u) flags:%s addr:0x%016llx gen:%u len:%u d0:%016llx d1:%016llx to %p,%u",
+	    __get_str(dev),
+	    __entry->idx,
+	    __parse_sdma_flags(__entry->desc0, __entry->desc1),
+	    (__entry->desc0 >> SDMA_DESC0_PHY_ADDR_SHIFT) &
+	    SDMA_DESC0_PHY_ADDR_MASK,
+	    (u8)((__entry->desc1 >> SDMA_DESC1_GENERATION_SHIFT) &
+	    SDMA_DESC1_GENERATION_MASK),
+	    (u16)((__entry->desc0 >> SDMA_DESC0_BYTE_COUNT_SHIFT) &
+	    SDMA_DESC0_BYTE_COUNT_MASK),
+	    __entry->desc0,
+	    __entry->desc1,
+	    __entry->descp,
+	    __entry->e
+	    )
+);
+
+TRACE_EVENT(hfi1_sdma_engine_select,
+	    TP_PROTO(struct hfi1_devdata *dd, u32 sel, u8 vl, u8 idx),
+	    TP_ARGS(dd, sel, vl, idx),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
+	    __field(u32, sel)
+	    __field(u8, vl)
+	    __field(u8, idx)
+	    ),
+	    TP_fast_assign(DD_DEV_ASSIGN(dd);
+	    __entry->sel = sel;
+	    __entry->vl = vl;
+	    __entry->idx = idx;
+	    ),
+	    TP_printk("[%s] selecting SDE %u sel 0x%x vl %u",
+		      __get_str(dev),
+		      __entry->idx,
+		      __entry->sel,
+		      __entry->vl
+		      )
+);
+
+DECLARE_EVENT_CLASS(hfi1_sdma_engine_class,
+		    TP_PROTO(struct sdma_engine *sde, u64 status),
+		    TP_ARGS(sde, status),
+		    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
+		    __field(u64, status)
+		    __field(u8, idx)
+		    ),
+		    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
+		    __entry->status = status;
+		    __entry->idx = sde->this_idx;
+		    ),
+		    TP_printk("[%s] SDE(%u) status %llx",
+			      __get_str(dev),
+			      __entry->idx,
+			      (unsigned long long)__entry->status
+			      )
+);
+
+DEFINE_EVENT(hfi1_sdma_engine_class, hfi1_sdma_engine_interrupt,
+	     TP_PROTO(struct sdma_engine *sde, u64 status),
+	     TP_ARGS(sde, status)
+);
+
+DEFINE_EVENT(hfi1_sdma_engine_class, hfi1_sdma_engine_progress,
+	     TP_PROTO(struct sdma_engine *sde, u64 status),
+	     TP_ARGS(sde, status)
+);
+
+DECLARE_EVENT_CLASS(hfi1_sdma_ahg_ad,
+		    TP_PROTO(struct sdma_engine *sde, int aidx),
+		    TP_ARGS(sde, aidx),
+		    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
+		    __field(int, aidx)
+		    __field(u8, idx)
+		    ),
+		    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
+		    __entry->idx = sde->this_idx;
+		    __entry->aidx = aidx;
+		    ),
+		    TP_printk("[%s] SDE(%u) aidx %d",
+			      __get_str(dev),
+			      __entry->idx,
+			      __entry->aidx
+			      )
+);
+
+DEFINE_EVENT(hfi1_sdma_ahg_ad, hfi1_ahg_allocate,
+	     TP_PROTO(struct sdma_engine *sde, int aidx),
+	     TP_ARGS(sde, aidx));
+
+DEFINE_EVENT(hfi1_sdma_ahg_ad, hfi1_ahg_deallocate,
+	     TP_PROTO(struct sdma_engine *sde, int aidx),
+	     TP_ARGS(sde, aidx));
+
+#ifdef CONFIG_HFI1_DEBUG_SDMA_ORDER
+TRACE_EVENT(hfi1_sdma_progress,
+	    TP_PROTO(struct sdma_engine *sde,
+		     u16 hwhead,
+		     u16 swhead,
+		     struct sdma_txreq *txp
+		     ),
+	    TP_ARGS(sde, hwhead, swhead, txp),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
+	    __field(u64, sn)
+	    __field(u16, hwhead)
+	    __field(u16, swhead)
+	    __field(u16, txnext)
+	    __field(u16, tx_tail)
+	    __field(u16, tx_head)
+	    __field(u8, idx)
+	    ),
+	    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
+	    __entry->hwhead = hwhead;
+	    __entry->swhead = swhead;
+	    __entry->tx_tail = sde->tx_tail;
+	    __entry->tx_head = sde->tx_head;
+	    __entry->txnext = txp ? txp->next_descq_idx : ~0;
+	    __entry->idx = sde->this_idx;
+	    __entry->sn = txp ? txp->sn : ~0;
+	    ),
+	    TP_printk(
+	    "[%s] SDE(%u) sn %llu hwhead %u swhead %u next_descq_idx %u tx_head %u tx_tail %u",
+	    __get_str(dev),
+	    __entry->idx,
+	    __entry->sn,
+	    __entry->hwhead,
+	    __entry->swhead,
+	    __entry->txnext,
+	    __entry->tx_head,
+	    __entry->tx_tail
+	    )
+);
+#else
+TRACE_EVENT(hfi1_sdma_progress,
+	    TP_PROTO(struct sdma_engine *sde,
+		     u16 hwhead, u16 swhead,
+		     struct sdma_txreq *txp
+		     ),
+	    TP_ARGS(sde, hwhead, swhead, txp),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
+		    __field(u16, hwhead)
+		    __field(u16, swhead)
+		    __field(u16, txnext)
+		    __field(u16, tx_tail)
+		    __field(u16, tx_head)
+		    __field(u8, idx)
+		    ),
+	    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
+		    __entry->hwhead = hwhead;
+		    __entry->swhead = swhead;
+		    __entry->tx_tail = sde->tx_tail;
+		    __entry->tx_head = sde->tx_head;
+		    __entry->txnext = txp ? txp->next_descq_idx : ~0;
+		    __entry->idx = sde->this_idx;
+		    ),
+	    TP_printk(
+		    "[%s] SDE(%u) hwhead %u swhead %u next_descq_idx %u tx_head %u tx_tail %u",
+		    __get_str(dev),
+		    __entry->idx,
+		    __entry->hwhead,
+		    __entry->swhead,
+		    __entry->txnext,
+		    __entry->tx_head,
+		    __entry->tx_tail
+	    )
+);
+#endif
+
+DECLARE_EVENT_CLASS(hfi1_sdma_sn,
+		    TP_PROTO(struct sdma_engine *sde, u64 sn),
+		    TP_ARGS(sde, sn),
+		    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
+		    __field(u64, sn)
+		    __field(u8, idx)
+		    ),
+		    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
+		    __entry->sn = sn;
+		    __entry->idx = sde->this_idx;
+		    ),
+		    TP_printk("[%s] SDE(%u) sn %llu",
+			      __get_str(dev),
+			      __entry->idx,
+			      __entry->sn
+			      )
+);
+
+DEFINE_EVENT(hfi1_sdma_sn, hfi1_sdma_out_sn,
+	     TP_PROTO(
+	     struct sdma_engine *sde,
+	     u64 sn
+	     ),
+	     TP_ARGS(sde, sn)
+);
+
+DEFINE_EVENT(hfi1_sdma_sn, hfi1_sdma_in_sn,
+	     TP_PROTO(struct sdma_engine *sde, u64 sn),
+	     TP_ARGS(sde, sn)
+);
+
+#define USDMA_HDR_FORMAT \
+	"[%s:%u:%u:%u] PBC=(0x%x 0x%x) LRH=(0x%x 0x%x) BTH=(0x%x 0x%x 0x%x) KDETH=(0x%x 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x 0x%x) TIDVal=0x%x"
+
+TRACE_EVENT(hfi1_sdma_user_header,
+	    TP_PROTO(struct hfi1_devdata *dd, u16 ctxt, u8 subctxt, u16 req,
+		     struct hfi1_pkt_header *hdr, u32 tidval),
+	    TP_ARGS(dd, ctxt, subctxt, req, hdr, tidval),
+	    TP_STRUCT__entry(
+		    DD_DEV_ENTRY(dd)
+		    __field(u16, ctxt)
+		    __field(u8, subctxt)
+		    __field(u16, req)
+		    __field(__le32, pbc0)
+		    __field(__le32, pbc1)
+		    __field(__be32, lrh0)
+		    __field(__be32, lrh1)
+		    __field(__be32, bth0)
+		    __field(__be32, bth1)
+		    __field(__be32, bth2)
+		    __field(__le32, kdeth0)
+		    __field(__le32, kdeth1)
+		    __field(__le32, kdeth2)
+		    __field(__le32, kdeth3)
+		    __field(__le32, kdeth4)
+		    __field(__le32, kdeth5)
+		    __field(__le32, kdeth6)
+		    __field(__le32, kdeth7)
+		    __field(__le32, kdeth8)
+		    __field(u32, tidval)
+		    ),
+		    TP_fast_assign(
+		    __le32 *pbc = (__le32 *)hdr->pbc;
+		    __be32 *lrh = (__be32 *)hdr->lrh;
+		    __be32 *bth = (__be32 *)hdr->bth;
+		    __le32 *kdeth = (__le32 *)&hdr->kdeth;
+
+		    DD_DEV_ASSIGN(dd);
+		    __entry->ctxt = ctxt;
+		    __entry->subctxt = subctxt;
+		    __entry->req = req;
+		    __entry->pbc0 = pbc[0];
+		    __entry->pbc1 = pbc[1];
+		    __entry->lrh0 = be32_to_cpu(lrh[0]);
+		    __entry->lrh1 = be32_to_cpu(lrh[1]);
+		    __entry->bth0 = be32_to_cpu(bth[0]);
+		    __entry->bth1 = be32_to_cpu(bth[1]);
+		    __entry->bth2 = be32_to_cpu(bth[2]);
+		    __entry->kdeth0 = kdeth[0];
+		    __entry->kdeth1 = kdeth[1];
+		    __entry->kdeth2 = kdeth[2];
+		    __entry->kdeth3 = kdeth[3];
+		    __entry->kdeth4 = kdeth[4];
+		    __entry->kdeth5 = kdeth[5];
+		    __entry->kdeth6 = kdeth[6];
+		    __entry->kdeth7 = kdeth[7];
+		    __entry->kdeth8 = kdeth[8];
+		    __entry->tidval = tidval;
+	    ),
+	    TP_printk(USDMA_HDR_FORMAT,
+		      __get_str(dev),
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __entry->req,
+		      __entry->pbc1,
+		      __entry->pbc0,
+		      __entry->lrh0,
+		      __entry->lrh1,
+		      __entry->bth0,
+		      __entry->bth1,
+		      __entry->bth2,
+		      __entry->kdeth0,
+		      __entry->kdeth1,
+		      __entry->kdeth2,
+		      __entry->kdeth3,
+		      __entry->kdeth4,
+		      __entry->kdeth5,
+		      __entry->kdeth6,
+		      __entry->kdeth7,
+		      __entry->kdeth8,
+		      __entry->tidval
+	    )
+);
+
+#define SDMA_UREQ_FMT \
+	"[%s:%u:%u] ver/op=0x%x, iovcnt=%u, npkts=%u, frag=%u, idx=%u"
+TRACE_EVENT(hfi1_sdma_user_reqinfo,
+	    TP_PROTO(struct hfi1_devdata *dd, u16 ctxt, u8 subctxt, u16 *i),
+	    TP_ARGS(dd, ctxt, subctxt, i),
+	    TP_STRUCT__entry(
+		    DD_DEV_ENTRY(dd);
+		    __field(u16, ctxt)
+		    __field(u8, subctxt)
+		    __field(u8, ver_opcode)
+		    __field(u8, iovcnt)
+		    __field(u16, npkts)
+		    __field(u16, fragsize)
+		    __field(u16, comp_idx)
+	    ),
+	    TP_fast_assign(
+		    DD_DEV_ASSIGN(dd);
+		    __entry->ctxt = ctxt;
+		    __entry->subctxt = subctxt;
+		    __entry->ver_opcode = i[0] & 0xff;
+		    __entry->iovcnt = (i[0] >> 8) & 0xff;
+		    __entry->npkts = i[1];
+		    __entry->fragsize = i[2];
+		    __entry->comp_idx = i[3];
+	    ),
+	    TP_printk(SDMA_UREQ_FMT,
+		      __get_str(dev),
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __entry->ver_opcode,
+		      __entry->iovcnt,
+		      __entry->npkts,
+		      __entry->fragsize,
+		      __entry->comp_idx
+		      )
+);
+
+#define usdma_complete_name(st) { st, #st }
+#define show_usdma_complete_state(st)			\
+	__print_symbolic(st,				\
+			usdma_complete_name(FREE),	\
+			usdma_complete_name(QUEUED),	\
+			usdma_complete_name(COMPLETE), \
+			usdma_complete_name(ERROR))
+
+TRACE_EVENT(hfi1_sdma_user_completion,
+	    TP_PROTO(struct hfi1_devdata *dd, u16 ctxt, u8 subctxt, u16 idx,
+		     u8 state, int code),
+	    TP_ARGS(dd, ctxt, subctxt, idx, state, code),
+	    TP_STRUCT__entry(
+	    DD_DEV_ENTRY(dd)
+	    __field(u16, ctxt)
+	    __field(u8, subctxt)
+	    __field(u16, idx)
+	    __field(u8, state)
+	    __field(int, code)
+	    ),
+	    TP_fast_assign(
+	    DD_DEV_ASSIGN(dd);
+	    __entry->ctxt = ctxt;
+	    __entry->subctxt = subctxt;
+	    __entry->idx = idx;
+	    __entry->state = state;
+	    __entry->code = code;
+	    ),
+	    TP_printk("[%s:%u:%u:%u] SDMA completion state %s (%d)",
+		      __get_str(dev), __entry->ctxt, __entry->subctxt,
+		      __entry->idx, show_usdma_complete_state(__entry->state),
+		      __entry->code)
+);
+
+const char *print_u32_array(struct trace_seq *, u32 *, int);
+#define __print_u32_hex(arr, len) print_u32_array(p, arr, len)
+
+TRACE_EVENT(hfi1_sdma_user_header_ahg,
+	    TP_PROTO(struct hfi1_devdata *dd, u16 ctxt, u8 subctxt, u16 req,
+		     u8 sde, u8 ahgidx, u32 *ahg, int len, u32 tidval),
+	    TP_ARGS(dd, ctxt, subctxt, req, sde, ahgidx, ahg, len, tidval),
+	    TP_STRUCT__entry(
+	    DD_DEV_ENTRY(dd)
+	    __field(u16, ctxt)
+	    __field(u8, subctxt)
+	    __field(u16, req)
+	    __field(u8, sde)
+	    __field(u8, idx)
+	    __field(int, len)
+	    __field(u32, tidval)
+	    __array(u32, ahg, 10)
+	    ),
+	    TP_fast_assign(
+	    DD_DEV_ASSIGN(dd);
+	    __entry->ctxt = ctxt;
+	    __entry->subctxt = subctxt;
+	    __entry->req = req;
+	    __entry->sde = sde;
+	    __entry->idx = ahgidx;
+	    __entry->len = len;
+	    __entry->tidval = tidval;
+	    memcpy(__entry->ahg, ahg, len * sizeof(u32));
+	    ),
+	    TP_printk("[%s:%u:%u:%u] (SDE%u/AHG%u) ahg[0-%d]=(%s) TIDVal=0x%x",
+		      __get_str(dev),
+		      __entry->ctxt,
+		      __entry->subctxt,
+		      __entry->req,
+		      __entry->sde,
+		      __entry->idx,
+		      __entry->len - 1,
+		      __print_u32_hex(__entry->ahg, __entry->len),
+		      __entry->tidval
+		      )
+);
+
+TRACE_EVENT(hfi1_sdma_state,
+	    TP_PROTO(struct sdma_engine *sde,
+		     const char *cstate,
+		     const char *nstate
+		     ),
+	    TP_ARGS(sde, cstate, nstate),
+	    TP_STRUCT__entry(DD_DEV_ENTRY(sde->dd)
+		__string(curstate, cstate)
+		__string(newstate, nstate)
+	    ),
+	    TP_fast_assign(DD_DEV_ASSIGN(sde->dd);
+		__assign_str(curstate, cstate);
+		__assign_str(newstate, nstate);
+	    ),
+	    TP_printk("[%s] current state %s new state %s",
+		      __get_str(dev),
+		      __get_str(curstate),
+		      __get_str(newstate)
+	    )
+);
+
+#define BCT_FORMAT \
+	"shared_limit %x vls 0-7 [%x,%x][%x,%x][%x,%x][%x,%x][%x,%x][%x,%x][%x,%x][%x,%x] 15 [%x,%x]"
+
+#define BCT(field) \
+	be16_to_cpu( \
+	((struct buffer_control *)__get_dynamic_array(bct))->field \
+	)
+
+DECLARE_EVENT_CLASS(hfi1_bct_template,
+		    TP_PROTO(struct hfi1_devdata *dd,
+			     struct buffer_control *bc),
+		    TP_ARGS(dd, bc),
+		    TP_STRUCT__entry(DD_DEV_ENTRY(dd)
+		    __dynamic_array(u8, bct, sizeof(*bc))
+		    ),
+		    TP_fast_assign(DD_DEV_ASSIGN(dd);
+				   memcpy(__get_dynamic_array(bct), bc,
+					  sizeof(*bc));
+		    ),
+		    TP_printk(BCT_FORMAT,
+			      BCT(overall_shared_limit),
+
+			      BCT(vl[0].dedicated),
+			      BCT(vl[0].shared),
+
+			      BCT(vl[1].dedicated),
+			      BCT(vl[1].shared),
+
+			      BCT(vl[2].dedicated),
+			      BCT(vl[2].shared),
+
+			      BCT(vl[3].dedicated),
+			      BCT(vl[3].shared),
+
+			      BCT(vl[4].dedicated),
+			      BCT(vl[4].shared),
+
+			      BCT(vl[5].dedicated),
+			      BCT(vl[5].shared),
+
+			      BCT(vl[6].dedicated),
+			      BCT(vl[6].shared),
+
+			      BCT(vl[7].dedicated),
+			      BCT(vl[7].shared),
+
+			      BCT(vl[15].dedicated),
+			      BCT(vl[15].shared)
+		    )
+);
+
+DEFINE_EVENT(hfi1_bct_template, bct_set,
+	     TP_PROTO(struct hfi1_devdata *dd, struct buffer_control *bc),
+	     TP_ARGS(dd, bc));
+
+DEFINE_EVENT(hfi1_bct_template, bct_get,
+	     TP_PROTO(struct hfi1_devdata *dd, struct buffer_control *bc),
+	     TP_ARGS(dd, bc));
+
+#endif /* __HFI1_TRACE_TX_H */
+
+#undef TRACE_INCLUDE_PATH
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace_tx
+#include <trace/define_trace.h>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 05/18] IB/hfi1: Fix trace sparse errors
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (3 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 04/18] IB/hfi1: Separate tracepoints into specific headers Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
  2016-07-01 23:01   ` [PATCH for-next 06/18] IB/hfi1: Add VL XmitDiscards counters to the opapmaquery Dennis Dalessandro
                     ` (13 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mike Marciniszyn

From: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Fix sparse errors by making sure the fast assign destinations
are host cpu typed.

For the void __iomem *, just make the field match source
data.

Fix a bug where the hw_free trace printed the pointer vs.
the dereferenced value.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/trace_ctxts.h |    8 ++--
 drivers/infiniband/hw/hfi1/trace_tx.h    |   54 +++++++++++++++---------------
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/trace_ctxts.h b/drivers/infiniband/hw/hfi1/trace_ctxts.h
index 5052d49..31654bb 100644
--- a/drivers/infiniband/hw/hfi1/trace_ctxts.h
+++ b/drivers/infiniband/hw/hfi1/trace_ctxts.h
@@ -56,7 +56,7 @@
 #define TRACE_SYSTEM hfi1_ctxts
 
 #define UCTXT_FMT \
-	"cred:%u, credaddr:0x%llx, piobase:0x%llx, rcvhdr_cnt:%u, "	\
+	"cred:%u, credaddr:0x%llx, piobase:0x%p, rcvhdr_cnt:%u, "	\
 	"rcvbase:0x%llx, rcvegrc:%u, rcvegrb:0x%llx"
 TRACE_EVENT(hfi1_uctxtdata,
 	    TP_PROTO(struct hfi1_devdata *dd, struct hfi1_ctxtdata *uctxt),
@@ -65,7 +65,7 @@ TRACE_EVENT(hfi1_uctxtdata,
 			     __field(unsigned int, ctxt)
 			     __field(u32, credits)
 			     __field(u64, hw_free)
-			     __field(u64, piobase)
+			     __field(void __iomem *, piobase)
 			     __field(u16, rcvhdrq_cnt)
 			     __field(u64, rcvhdrq_phys)
 			     __field(u32, eager_cnt)
@@ -74,8 +74,8 @@ TRACE_EVENT(hfi1_uctxtdata,
 	    TP_fast_assign(DD_DEV_ASSIGN(dd);
 			   __entry->ctxt = uctxt->ctxt;
 			   __entry->credits = uctxt->sc->credits;
-			   __entry->hw_free = (u64)uctxt->sc->hw_free;
-			   __entry->piobase = (u64)uctxt->sc->base_addr;
+			   __entry->hw_free = le64_to_cpu(*uctxt->sc->hw_free);
+			   __entry->piobase = uctxt->sc->base_addr;
 			   __entry->rcvhdrq_cnt = uctxt->rcvhdrq_cnt;
 			   __entry->rcvhdrq_phys = uctxt->rcvhdrq_phys;
 			   __entry->eager_cnt = uctxt->egrbufs.alloced;
diff --git a/drivers/infiniband/hw/hfi1/trace_tx.h b/drivers/infiniband/hw/hfi1/trace_tx.h
index 79c93ec..415d6be 100644
--- a/drivers/infiniband/hw/hfi1/trace_tx.h
+++ b/drivers/infiniband/hw/hfi1/trace_tx.h
@@ -369,22 +369,22 @@ TRACE_EVENT(hfi1_sdma_user_header,
 		    __field(u16, ctxt)
 		    __field(u8, subctxt)
 		    __field(u16, req)
-		    __field(__le32, pbc0)
-		    __field(__le32, pbc1)
-		    __field(__be32, lrh0)
-		    __field(__be32, lrh1)
-		    __field(__be32, bth0)
-		    __field(__be32, bth1)
-		    __field(__be32, bth2)
-		    __field(__le32, kdeth0)
-		    __field(__le32, kdeth1)
-		    __field(__le32, kdeth2)
-		    __field(__le32, kdeth3)
-		    __field(__le32, kdeth4)
-		    __field(__le32, kdeth5)
-		    __field(__le32, kdeth6)
-		    __field(__le32, kdeth7)
-		    __field(__le32, kdeth8)
+		    __field(u32, pbc0)
+		    __field(u32, pbc1)
+		    __field(u32, lrh0)
+		    __field(u32, lrh1)
+		    __field(u32, bth0)
+		    __field(u32, bth1)
+		    __field(u32, bth2)
+		    __field(u32, kdeth0)
+		    __field(u32, kdeth1)
+		    __field(u32, kdeth2)
+		    __field(u32, kdeth3)
+		    __field(u32, kdeth4)
+		    __field(u32, kdeth5)
+		    __field(u32, kdeth6)
+		    __field(u32, kdeth7)
+		    __field(u32, kdeth8)
 		    __field(u32, tidval)
 		    ),
 		    TP_fast_assign(
@@ -397,22 +397,22 @@ TRACE_EVENT(hfi1_sdma_user_header,
 		    __entry->ctxt = ctxt;
 		    __entry->subctxt = subctxt;
 		    __entry->req = req;
-		    __entry->pbc0 = pbc[0];
-		    __entry->pbc1 = pbc[1];
+		    __entry->pbc0 = le32_to_cpu(pbc[0]);
+		    __entry->pbc1 = le32_to_cpu(pbc[1]);
 		    __entry->lrh0 = be32_to_cpu(lrh[0]);
 		    __entry->lrh1 = be32_to_cpu(lrh[1]);
 		    __entry->bth0 = be32_to_cpu(bth[0]);
 		    __entry->bth1 = be32_to_cpu(bth[1]);
 		    __entry->bth2 = be32_to_cpu(bth[2]);
-		    __entry->kdeth0 = kdeth[0];
-		    __entry->kdeth1 = kdeth[1];
-		    __entry->kdeth2 = kdeth[2];
-		    __entry->kdeth3 = kdeth[3];
-		    __entry->kdeth4 = kdeth[4];
-		    __entry->kdeth5 = kdeth[5];
-		    __entry->kdeth6 = kdeth[6];
-		    __entry->kdeth7 = kdeth[7];
-		    __entry->kdeth8 = kdeth[8];
+		    __entry->kdeth0 = le32_to_cpu(kdeth[0]);
+		    __entry->kdeth1 = le32_to_cpu(kdeth[1]);
+		    __entry->kdeth2 = le32_to_cpu(kdeth[2]);
+		    __entry->kdeth3 = le32_to_cpu(kdeth[3]);
+		    __entry->kdeth4 = le32_to_cpu(kdeth[4]);
+		    __entry->kdeth5 = le32_to_cpu(kdeth[5]);
+		    __entry->kdeth6 = le32_to_cpu(kdeth[6]);
+		    __entry->kdeth7 = le32_to_cpu(kdeth[7]);
+		    __entry->kdeth8 = le32_to_cpu(kdeth[8]);
 		    __entry->tidval = tidval;
 	    ),
 	    TP_printk(USDMA_HDR_FORMAT,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 06/18] IB/hfi1: Add VL XmitDiscards counters to the opapmaquery
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (4 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 05/18] IB/hfi1: Fix trace sparse errors Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
  2016-07-01 23:01   ` [PATCH for-next 07/18] IB/hfi1: Add counter to track unsupported packets drop Dennis Dalessandro
                     ` (12 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dean Luick, Jakub Pawlak

From: Jakub Pawlak <jakub.pawlak-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add per VL XmitDiscards counters to the opapmaquery
status and error response.

Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jakub Pawlak <jakub.pawlak-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/mad.c |   11 +++++++++--
 1 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/mad.c b/drivers/infiniband/hw/hfi1/mad.c
index 223dd46..349a138 100644
--- a/drivers/infiniband/hw/hfi1/mad.c
+++ b/drivers/infiniband/hw/hfi1/mad.c
@@ -2487,6 +2487,9 @@ static int pma_get_opa_portstatus(struct opa_pma_mad *pmp,
 			cpu_to_be64(read_dev_cntr(dd, C_DC_RCV_BCN_VL,
 						  idx_from_vl(vl)));
 
+		rsp->vls[vfi].port_vl_xmit_discards =
+			cpu_to_be64(read_port_cntr(ppd, C_SW_XMIT_DSCD_VL,
+						   idx_from_vl(vl)));
 		vlinfo++;
 		vfi++;
 	}
@@ -2878,7 +2881,9 @@ static int pma_get_opa_porterrors(struct opa_pma_mad *pmp,
 	for_each_set_bit(vl, (unsigned long *)&(vl_select_mask),
 			 8 * sizeof(req->vl_select_mask)) {
 		memset(vlinfo, 0, sizeof(*vlinfo));
-		/* vlinfo->vls[vfi].port_vl_xmit_discards ??? */
+		rsp->vls[vfi].port_vl_xmit_discards =
+			cpu_to_be64(read_port_cntr(ppd, C_SW_XMIT_DSCD_VL,
+						   idx_from_vl(vl)));
 		vlinfo += 1;
 		vfi++;
 	}
@@ -3211,7 +3216,9 @@ static int pma_set_opa_portstatus(struct opa_pma_mad *pmp,
 		/* if (counter_select & CS_PORT_MARK_FECN)
 		 *     write_csr(dd, DCC_PRF_PORT_VL_MARK_FECN_CNT + offset, 0);
 		 */
-		/* port_vl_xmit_discards ??? */
+		if (counter_select & C_SW_XMIT_DSCD_VL)
+			write_port_cntr(ppd, C_SW_XMIT_DSCD_VL,
+					idx_from_vl(vl), 0);
 	}
 
 	if (resp_len)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 07/18] IB/hfi1: Add counter to track unsupported packets drop
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (5 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 06/18] IB/hfi1: Add VL XmitDiscards counters to the opapmaquery Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
  2016-07-01 23:01   ` [PATCH for-next 08/18] IB/hfi1: Add global structure for affinity assignments Dennis Dalessandro
                     ` (11 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jakub Pawlak

From: Jakub Pawlak <jakub.pawlak-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add sw counter to track dropped unsupported packets.
Report unsupported packets drop as the RcvError.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jakub Pawlak <jakub.pawlak-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/chip.c   |   31 +++++++++++++++++++++++++++----
 drivers/infiniband/hw/hfi1/driver.c |    1 +
 drivers/infiniband/hw/hfi1/hfi.h    |    3 ++-
 drivers/infiniband/hw/hfi1/mad.c    |    3 ++-
 4 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index 0662451..9eb4551 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -238,6 +238,9 @@ struct flag_table {
 /* all CceStatus sub-block RXE pause bits */
 #define ALL_RXE_PAUSE CCE_STATUS_RXE_PAUSED_SMASK
 
+#define CNTR_MAX 0xFFFFFFFFFFFFFFFFULL
+#define CNTR_32BIT_MAX 0x00000000FFFFFFFF
+
 /*
  * CCE Error flags.
  */
@@ -3947,6 +3950,28 @@ static u64 access_sdma_wrong_dw_err_cnt(const struct cntr_entry *entry,
 	return dd->sw_send_dma_eng_err_status_cnt[0];
 }
 
+static u64 access_dc_rcv_err_cnt(const struct cntr_entry *entry,
+				 void *context, int vl, int mode,
+				 u64 data)
+{
+	struct hfi1_devdata *dd = (struct hfi1_devdata *)context;
+
+	u64 val = 0;
+	u64 csr = entry->csr;
+
+	val = read_write_csr(dd, csr, mode, data);
+	if (mode == CNTR_MODE_R) {
+		val = val > CNTR_MAX - dd->sw_rcv_bypass_packet_errors ?
+			CNTR_MAX : val + dd->sw_rcv_bypass_packet_errors;
+	} else if (mode == CNTR_MODE_W) {
+		dd->sw_rcv_bypass_packet_errors = 0;
+	} else {
+		dd_dev_err(dd, "Invalid cntr register access mode");
+		return 0;
+	}
+	return val;
+}
+
 #define def_access_sw_cpu(cntr) \
 static u64 access_sw_cpu_##cntr(const struct cntr_entry *entry,		      \
 			      void *context, int vl, int mode, u64 data)      \
@@ -4020,7 +4045,8 @@ static struct cntr_entry dev_cntrs[DEV_CNTR_LAST] = {
 			CCE_SEND_CREDIT_INT_CNT, CNTR_NORMAL),
 [C_DC_UNC_ERR] = DC_PERF_CNTR(DcUnctblErr, DCC_ERR_UNCORRECTABLE_CNT,
 			      CNTR_SYNTH),
-[C_DC_RCV_ERR] = DC_PERF_CNTR(DcRecvErr, DCC_ERR_PORTRCV_ERR_CNT, CNTR_SYNTH),
+[C_DC_RCV_ERR] = CNTR_ELEM("DcRecvErr", DCC_ERR_PORTRCV_ERR_CNT, 0, CNTR_SYNTH,
+			    access_dc_rcv_err_cnt),
 [C_DC_FM_CFG_ERR] = DC_PERF_CNTR(DcFmCfgErr, DCC_ERR_FMCONFIG_ERR_CNT,
 				 CNTR_SYNTH),
 [C_DC_RMT_PHY_ERR] = DC_PERF_CNTR(DcRmtPhyErr, DCC_ERR_RCVREMOTE_PHY_ERR_CNT,
@@ -11668,9 +11694,6 @@ static void free_cntrs(struct hfi1_devdata *dd)
 	dd->cntrnames = NULL;
 }
 
-#define CNTR_MAX 0xFFFFFFFFFFFFFFFFULL
-#define CNTR_32BIT_MAX 0x00000000FFFFFFFF
-
 static u64 read_dev_port_cntr(struct hfi1_devdata *dd, struct cntr_entry *entry,
 			      u64 *psval, void *context, int vl)
 {
diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c
index c75b0ae..6c81d15 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -1362,6 +1362,7 @@ int process_receive_bypass(struct hfi1_packet *packet)
 
 	dd_dev_err(packet->rcd->dd,
 		   "Bypass packets are not supported in normal operation. Dropping\n");
+	incr_cntr64(&packet->rcd->dd->sw_rcv_bypass_packet_errors);
 	return RHF_RCV_CONTINUE;
 }
 
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 1dd48ef..748e235 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1128,7 +1128,8 @@ struct hfi1_devdata {
 		NUM_SEND_DMA_ENG_ERR_STATUS_COUNTERS];
 	/* Software counter that aggregates all cce_err_status errors */
 	u64 sw_cce_err_status_aggregate;
-
+	/* Software counter that aggregates all bypass packet rcv errors */
+	u64 sw_rcv_bypass_packet_errors;
 	/* receive interrupt functions */
 	rhf_rcv_function_ptr *rhf_rcv_function_map;
 	rhf_rcv_function_ptr normal_rhf_rcv_functions[8];
diff --git a/drivers/infiniband/hw/hfi1/mad.c b/drivers/infiniband/hw/hfi1/mad.c
index 349a138..962bb11 100644
--- a/drivers/infiniband/hw/hfi1/mad.c
+++ b/drivers/infiniband/hw/hfi1/mad.c
@@ -2874,7 +2874,8 @@ static int pma_get_opa_porterrors(struct opa_pma_mad *pmp,
 	tmp = read_dev_cntr(dd, C_DC_UNC_ERR, CNTR_INVALID_VL);
 
 	rsp->uncorrectable_errors = tmp < 0x100 ? (tmp & 0xff) : 0xff;
-
+	rsp->port_rcv_errors =
+		cpu_to_be64(read_dev_cntr(dd, C_DC_RCV_ERR, CNTR_INVALID_VL));
 	vlinfo = &rsp->vls[0];
 	vfi = 0;
 	vl_select_mask = be32_to_cpu(req->vl_select_mask);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 08/18] IB/hfi1: Add global structure for affinity assignments
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (6 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 07/18] IB/hfi1: Add counter to track unsupported packets drop Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
       [not found]     ` <20160701230127.20160.68709.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  2016-07-01 23:01   ` [PATCH for-next 09/18] IB/hfi1: Reserve and collapse CPU cores for contexts Dennis Dalessandro
                     ` (10 subsequent siblings)
  18 siblings, 1 reply; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jubin John, Sebastian Sanchez,
	Mike Marciniszyn, Jianxin Xiong

From: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

When HFI units get initialized, they each use their own mask copy for
affinity assignments. On a multi-HFI system, affinity assignments
overbook CPU cores as each HFI doesn't have knowledge of affinity
assignments for other HFI units. Therefore, some CPU cores are never
used for interrupt handlers in systems with high number of CPU cores
per NUMA node.

For multi-HFI systems, SDMA engine interrupt assignments start all over
from the first CPU in the local NUMA node after the first HFI
initialization. This change allows assignments to continue where the
last HFI unit left off.

Add global structure for affinity assignments for multiple HFIs to share
affinity mask.

Reviewed-by: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/affinity.c |  250 +++++++++++++++++++++++----------
 drivers/infiniband/hw/hfi1/affinity.h |   25 +++
 drivers/infiniband/hw/hfi1/chip.c     |   20 +--
 drivers/infiniband/hw/hfi1/init.c     |    5 +
 4 files changed, 201 insertions(+), 99 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c
index 14d7eeb..4d82920 100644
--- a/drivers/infiniband/hw/hfi1/affinity.c
+++ b/drivers/infiniband/hw/hfi1/affinity.c
@@ -53,6 +53,11 @@
 #include "sdma.h"
 #include "trace.h"
 
+struct hfi1_affinity_node_list node_affinity = {
+	.list = LIST_HEAD_INIT(node_affinity.list),
+	.lock = __SPIN_LOCK_UNLOCKED(&node_affinity.lock),
+};
+
 /* Name of IRQ types, indexed by enum irq_type */
 static const char * const irq_type_names[] = {
 	"SDMA",
@@ -69,45 +74,100 @@ static inline void init_cpu_mask_set(struct cpu_mask_set *set)
 }
 
 /* Initialize non-HT cpu cores mask */
-int init_real_cpu_mask(struct hfi1_devdata *dd)
+void init_real_cpu_mask(void)
 {
-	struct hfi1_affinity *info;
 	int possible, curr_cpu, i, ht;
 
-	info = kzalloc(sizeof(*info), GFP_KERNEL);
-	if (!info)
-		return -ENOMEM;
-
-	cpumask_clear(&info->real_cpu_mask);
+	cpumask_clear(&node_affinity.real_cpu_mask);
 
 	/* Start with cpu online mask as the real cpu mask */
-	cpumask_copy(&info->real_cpu_mask, cpu_online_mask);
+	cpumask_copy(&node_affinity.real_cpu_mask, cpu_online_mask);
 
 	/*
 	 * Remove HT cores from the real cpu mask.  Do this in two steps below.
 	 */
-	possible = cpumask_weight(&info->real_cpu_mask);
+	possible = cpumask_weight(&node_affinity.real_cpu_mask);
 	ht = cpumask_weight(topology_sibling_cpumask(
-					cpumask_first(&info->real_cpu_mask)));
+				cpumask_first(&node_affinity.real_cpu_mask)));
 	/*
 	 * Step 1.  Skip over the first N HT siblings and use them as the
 	 * "real" cores.  Assumes that HT cores are not enumerated in
 	 * succession (except in the single core case).
 	 */
-	curr_cpu = cpumask_first(&info->real_cpu_mask);
+	curr_cpu = cpumask_first(&node_affinity.real_cpu_mask);
 	for (i = 0; i < possible / ht; i++)
-		curr_cpu = cpumask_next(curr_cpu, &info->real_cpu_mask);
+		curr_cpu = cpumask_next(curr_cpu, &node_affinity.real_cpu_mask);
 	/*
 	 * Step 2.  Remove the remaining HT siblings.  Use cpumask_next() to
 	 * skip any gaps.
 	 */
 	for (; i < possible; i++) {
-		cpumask_clear_cpu(curr_cpu, &info->real_cpu_mask);
-		curr_cpu = cpumask_next(curr_cpu, &info->real_cpu_mask);
+		cpumask_clear_cpu(curr_cpu, &node_affinity.real_cpu_mask);
+		curr_cpu = cpumask_next(curr_cpu, &node_affinity.real_cpu_mask);
 	}
+}
 
-	dd->affinity = info;
-	return 0;
+void node_affinity_init(void)
+{
+	cpumask_copy(&node_affinity.proc.mask, cpu_online_mask);
+	/*
+	 * The real cpu mask is part of the affinity struct but it has to be
+	 * initialized early. It is needed to calculate the number of user
+	 * contexts in set_up_context_variables().
+	 */
+	init_real_cpu_mask();
+}
+
+void node_affinity_destroy(void)
+{
+	struct list_head *pos, *q;
+	struct hfi1_affinity_node *entry;
+
+	spin_lock(&node_affinity.lock);
+	list_for_each_safe(pos, q, &node_affinity.list) {
+		entry = list_entry(pos, struct hfi1_affinity_node,
+				   list);
+		list_del(pos);
+		kfree(entry);
+	}
+	spin_unlock(&node_affinity.lock);
+}
+
+static struct hfi1_affinity_node *node_affinity_allocate(int node)
+{
+	struct hfi1_affinity_node *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return NULL;
+	entry->node = node;
+	INIT_LIST_HEAD(&entry->list);
+
+	return entry;
+}
+
+/*
+ * It appends an entry to the list.
+ * It *must* be called with node_affinity.lock held.
+ */
+static void node_affinity_add_tail(struct hfi1_affinity_node *entry)
+{
+	list_add_tail(&entry->list, &node_affinity.list);
+}
+
+/* It must be called with node_affinity.lock held */
+static struct hfi1_affinity_node *node_affinity_lookup(int node)
+{
+	struct list_head *pos;
+	struct hfi1_affinity_node *entry;
+
+	list_for_each(pos, &node_affinity.list) {
+		entry = list_entry(pos, struct hfi1_affinity_node, list);
+		if (entry->node == node)
+			return entry;
+	}
+
+	return NULL;
 }
 
 /*
@@ -121,10 +181,10 @@ int init_real_cpu_mask(struct hfi1_devdata *dd)
  * to the node relative 1 as necessary.
  *
  */
-void hfi1_dev_affinity_init(struct hfi1_devdata *dd)
+int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 {
 	int node = pcibus_to_node(dd->pcidev->bus);
-	struct hfi1_affinity *info = dd->affinity;
+	struct hfi1_affinity_node *entry;
 	const struct cpumask *local_mask;
 	int curr_cpu, possible, i;
 
@@ -132,55 +192,75 @@ void hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 		node = numa_node_id();
 	dd->node = node;
 
-	spin_lock_init(&info->lock);
-
-	init_cpu_mask_set(&info->def_intr);
-	init_cpu_mask_set(&info->rcv_intr);
-	init_cpu_mask_set(&info->proc);
-
 	local_mask = cpumask_of_node(dd->node);
 	if (cpumask_first(local_mask) >= nr_cpu_ids)
 		local_mask = topology_core_cpumask(0);
-	/* Use the "real" cpu mask of this node as the default */
-	cpumask_and(&info->def_intr.mask, &info->real_cpu_mask, local_mask);
-
-	/*  fill in the receive list */
-	possible = cpumask_weight(&info->def_intr.mask);
-	curr_cpu = cpumask_first(&info->def_intr.mask);
-	if (possible == 1) {
-		/*  only one CPU, everyone will use it */
-		cpumask_set_cpu(curr_cpu, &info->rcv_intr.mask);
-	} else {
-		/*
-		 * Retain the first CPU in the default list for the control
-		 * context.
-		 */
-		curr_cpu = cpumask_next(curr_cpu, &info->def_intr.mask);
-		/*
-		 * Remove the remaining kernel receive queues from
-		 * the default list and add them to the receive list.
-		 */
-		for (i = 0; i < dd->n_krcv_queues - 1; i++) {
-			cpumask_clear_cpu(curr_cpu, &info->def_intr.mask);
-			cpumask_set_cpu(curr_cpu, &info->rcv_intr.mask);
-			curr_cpu = cpumask_next(curr_cpu, &info->def_intr.mask);
-			if (curr_cpu >= nr_cpu_ids)
-				break;
+
+	spin_lock(&node_affinity.lock);
+	entry = node_affinity_lookup(dd->node);
+	spin_unlock(&node_affinity.lock);
+
+	/*
+	 * If this is the first time this NUMA node's affinity is used,
+	 * create an entry in the global affinity structure and initialize it.
+	 */
+	if (!entry) {
+		entry = node_affinity_allocate(node);
+		if (!entry) {
+			dd_dev_err(dd,
+				   "Unable to allocate global affinity node\n");
+			return -ENOMEM;
 		}
-	}
+		init_cpu_mask_set(&entry->def_intr);
+		init_cpu_mask_set(&entry->rcv_intr);
+		/* Use the "real" cpu mask of this node as the default */
+		cpumask_and(&entry->def_intr.mask, &node_affinity.real_cpu_mask,
+			    local_mask);
+
+		/* fill in the receive list */
+		possible = cpumask_weight(&entry->def_intr.mask);
+		curr_cpu = cpumask_first(&entry->def_intr.mask);
+
+		if (possible == 1) {
+			/* only one CPU, everyone will use it */
+			cpumask_set_cpu(curr_cpu, &entry->rcv_intr.mask);
+		} else {
+			/*
+			 * Retain the first CPU in the default list for the
+			 * control context.
+			 */
+			curr_cpu = cpumask_next(curr_cpu,
+						&entry->def_intr.mask);
 
-	cpumask_copy(&info->proc.mask, cpu_online_mask);
-}
+			/*
+			 * Remove the remaining kernel receive queues from
+			 * the default list and add them to the receive list.
+			 */
+			for (i = 0; i < dd->n_krcv_queues - 1; i++) {
+				cpumask_clear_cpu(curr_cpu,
+						  &entry->def_intr.mask);
+				cpumask_set_cpu(curr_cpu,
+						&entry->rcv_intr.mask);
+				curr_cpu = cpumask_next(curr_cpu,
+							&entry->def_intr.mask);
+				if (curr_cpu >= nr_cpu_ids)
+					break;
+			}
+		}
 
-void hfi1_dev_affinity_free(struct hfi1_devdata *dd)
-{
-	kfree(dd->affinity);
+		spin_lock(&node_affinity.lock);
+		node_affinity_add_tail(entry);
+		spin_unlock(&node_affinity.lock);
+	}
+
+	return 0;
 }
 
 int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 {
 	int ret;
 	cpumask_var_t diff;
+	struct hfi1_affinity_node *entry;
 	struct cpu_mask_set *set;
 	struct sdma_engine *sde = NULL;
 	struct hfi1_ctxtdata *rcd = NULL;
@@ -194,21 +274,25 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	if (!ret)
 		return -ENOMEM;
 
+	spin_lock(&node_affinity.lock);
+	entry = node_affinity_lookup(dd->node);
+	spin_unlock(&node_affinity.lock);
+
 	switch (msix->type) {
 	case IRQ_SDMA:
 		sde = (struct sdma_engine *)msix->arg;
 		scnprintf(extra, 64, "engine %u", sde->this_idx);
 		/* fall through */
 	case IRQ_GENERAL:
-		set = &dd->affinity->def_intr;
+		set = &entry->def_intr;
 		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
 		if (rcd->ctxt == HFI1_CTRL_CTXT) {
-			set = &dd->affinity->def_intr;
+			set = &entry->def_intr;
 			cpu = cpumask_first(&set->mask);
 		} else {
-			set = &dd->affinity->rcv_intr;
+			set = &entry->rcv_intr;
 		}
 		scnprintf(extra, 64, "ctxt %u", rcd->ctxt);
 		break;
@@ -222,8 +306,8 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	 * is set above.  Skip accounting for it.  Everything else finds its
 	 * CPU here.
 	 */
-	if (cpu == -1) {
-		spin_lock(&dd->affinity->lock);
+	if (cpu == -1 && set) {
+		spin_lock(&node_affinity.lock);
 		if (cpumask_equal(&set->mask, &set->used)) {
 			/*
 			 * We've used up all the CPUs, bump up the generation
@@ -235,7 +319,7 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 		cpumask_andnot(diff, &set->mask, &set->used);
 		cpu = cpumask_first(diff);
 		cpumask_set_cpu(cpu, &set->used);
-		spin_unlock(&dd->affinity->lock);
+		spin_unlock(&node_affinity.lock);
 	}
 
 	switch (msix->type) {
@@ -263,30 +347,35 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 {
 	struct cpu_mask_set *set = NULL;
 	struct hfi1_ctxtdata *rcd;
+	struct hfi1_affinity_node *entry;
+
+	spin_lock(&node_affinity.lock);
+	entry = node_affinity_lookup(dd->node);
+	spin_unlock(&node_affinity.lock);
 
 	switch (msix->type) {
 	case IRQ_SDMA:
 	case IRQ_GENERAL:
-		set = &dd->affinity->def_intr;
+		set = &entry->def_intr;
 		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
 		/* only do accounting for non control contexts */
 		if (rcd->ctxt != HFI1_CTRL_CTXT)
-			set = &dd->affinity->rcv_intr;
+			set = &entry->rcv_intr;
 		break;
 	default:
 		return;
 	}
 
 	if (set) {
-		spin_lock(&dd->affinity->lock);
+		spin_lock(&node_affinity.lock);
 		cpumask_andnot(&set->used, &set->used, &msix->mask);
 		if (cpumask_empty(&set->used) && set->gen) {
 			set->gen--;
 			cpumask_copy(&set->used, &set->mask);
 		}
-		spin_unlock(&dd->affinity->lock);
+		spin_unlock(&node_affinity.lock);
 	}
 
 	irq_set_affinity_hint(msix->msix.vector, NULL);
@@ -297,9 +386,11 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 {
 	int cpu = -1, ret;
 	cpumask_var_t diff, mask, intrs;
+	struct hfi1_affinity_node *entry;
 	const struct cpumask *node_mask,
 		*proc_mask = tsk_cpus_allowed(current);
-	struct cpu_mask_set *set = &dd->affinity->proc;
+	struct cpu_mask_set *set = &node_affinity.proc;
+	char buf[1024];
 
 	/*
 	 * check whether process/context affinity has already
@@ -338,7 +429,7 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 	if (!ret)
 		goto free_mask;
 
-	spin_lock(&dd->affinity->lock);
+	spin_lock(&node_affinity.lock);
 	/*
 	 * If we've used all available CPUs, clear the mask and start
 	 * overloading.
@@ -348,15 +439,16 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 		cpumask_clear(&set->used);
 	}
 
+	entry = node_affinity_lookup(dd->node);
 	/* CPUs used by interrupt handlers */
-	cpumask_copy(intrs, (dd->affinity->def_intr.gen ?
-			     &dd->affinity->def_intr.mask :
-			     &dd->affinity->def_intr.used));
-	cpumask_or(intrs, intrs, (dd->affinity->rcv_intr.gen ?
-				  &dd->affinity->rcv_intr.mask :
-				  &dd->affinity->rcv_intr.used));
-	hfi1_cdbg(PROC, "CPUs used by interrupts: %*pbl",
-		  cpumask_pr_args(intrs));
+	cpumask_copy(intrs, (entry->def_intr.gen ?
+			     &entry->def_intr.mask :
+			     &entry->def_intr.used));
+	cpumask_or(intrs, intrs, (entry->rcv_intr.gen ?
+				  &entry->rcv_intr.mask :
+				  &entry->rcv_intr.used));
+	scnprintf(buf, sizeof(buf), "%*pbl", cpumask_pr_args(intrs));
+	hfi1_cdbg(PROC, "CPUs used by interrupts: %s", buf);
 
 	/*
 	 * If we don't have a NUMA node requested, preference is towards
@@ -400,7 +492,7 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 		cpu = -1;
 	else
 		cpumask_set_cpu(cpu, &set->used);
-	spin_unlock(&dd->affinity->lock);
+	spin_unlock(&node_affinity.lock);
 
 	free_cpumask_var(intrs);
 free_mask:
@@ -413,16 +505,16 @@ done:
 
 void hfi1_put_proc_affinity(struct hfi1_devdata *dd, int cpu)
 {
-	struct cpu_mask_set *set = &dd->affinity->proc;
+	struct cpu_mask_set *set = &node_affinity.proc;
 
 	if (cpu < 0)
 		return;
-	spin_lock(&dd->affinity->lock);
+	spin_lock(&node_affinity.lock);
 	cpumask_clear_cpu(cpu, &set->used);
 	if (cpumask_empty(&set->used) && set->gen) {
 		set->gen--;
 		cpumask_copy(&set->used, &set->mask);
 	}
-	spin_unlock(&dd->affinity->lock);
+	spin_unlock(&node_affinity.lock);
 }
 
diff --git a/drivers/infiniband/hw/hfi1/affinity.h b/drivers/infiniband/hw/hfi1/affinity.h
index 20f52fe..ad3e730 100644
--- a/drivers/infiniband/hw/hfi1/affinity.h
+++ b/drivers/infiniband/hw/hfi1/affinity.h
@@ -82,11 +82,9 @@ struct hfi1_affinity {
 struct hfi1_msix_entry;
 
 /* Initialize non-HT cpu cores mask */
-int init_real_cpu_mask(struct hfi1_devdata *);
+void init_real_cpu_mask(void);
 /* Initialize driver affinity data */
-void hfi1_dev_affinity_init(struct hfi1_devdata *);
-/* Free driver affinity data */
-void hfi1_dev_affinity_free(struct hfi1_devdata *);
+int hfi1_dev_affinity_init(struct hfi1_devdata *);
 /*
  * Set IRQ affinity to a CPU. The function will determine the
  * CPU and set the affinity to it.
@@ -105,4 +103,23 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *, int);
 /* Release a CPU used by a user process. */
 void hfi1_put_proc_affinity(struct hfi1_devdata *, int);
 
+struct hfi1_affinity_node {
+	int node;
+	struct cpu_mask_set def_intr;
+	struct cpu_mask_set rcv_intr;
+	struct list_head list;
+};
+
+struct hfi1_affinity_node_list {
+	struct list_head list;
+	struct cpumask real_cpu_mask;
+	struct cpu_mask_set proc;
+	/* protect affinity node list */
+	spinlock_t lock;
+};
+
+void node_affinity_init(void);
+void node_affinity_destroy(void);
+extern struct hfi1_affinity_node_list node_affinity;
+
 #endif /* _HFI1_AFFINITY_H */
diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index 9eb4551..22bfe0e 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -63,6 +63,7 @@
 #include "efivar.h"
 #include "platform.h"
 #include "aspm.h"
+#include "affinity.h"
 
 #define NUM_IB_PORTS 1
 
@@ -12838,7 +12839,7 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 	 */
 	if (num_user_contexts < 0)
 		num_user_contexts =
-			cpumask_weight(&dd->affinity->real_cpu_mask);
+			cpumask_weight(&node_affinity.real_cpu_mask);
 
 	total_contexts = num_kernel_contexts + num_user_contexts;
 
@@ -14473,19 +14474,6 @@ struct hfi1_devdata *hfi1_init_dd(struct pci_dev *pdev,
 		 (dd->revision >> CCE_REVISION_SW_SHIFT)
 		    & CCE_REVISION_SW_MASK);
 
-	/*
-	 * The real cpu mask is part of the affinity struct but has to be
-	 * initialized earlier than the rest of the affinity struct because it
-	 * is needed to calculate the number of user contexts in
-	 * set_up_context_variables(). However, hfi1_dev_affinity_init(),
-	 * which initializes the rest of the affinity struct members,
-	 * depends on set_up_context_variables() for the number of kernel
-	 * contexts, so it cannot be called before set_up_context_variables().
-	 */
-	ret = init_real_cpu_mask(dd);
-	if (ret)
-		goto bail_cleanup;
-
 	ret = set_up_context_variables(dd);
 	if (ret)
 		goto bail_cleanup;
@@ -14499,7 +14487,9 @@ struct hfi1_devdata *hfi1_init_dd(struct pci_dev *pdev,
 	/* set up KDETH QP prefix in both RX and TX CSRs */
 	init_kdeth_qp(dd);
 
-	hfi1_dev_affinity_init(dd);
+	ret = hfi1_dev_affinity_init(dd);
+	if (ret)
+		goto bail_cleanup;
 
 	/* send contexts must be set up before receive contexts */
 	ret = init_send_contexts(dd);
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index eed971c..b0c3e8a 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -64,6 +64,7 @@
 #include "debugfs.h"
 #include "verbs.h"
 #include "aspm.h"
+#include "affinity.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
@@ -1004,7 +1005,6 @@ static void __hfi1_free_devdata(struct kobject *kobj)
 	rcu_barrier(); /* wait for rcu callbacks to complete */
 	free_percpu(dd->int_counter);
 	free_percpu(dd->rcv_limit);
-	hfi1_dev_affinity_free(dd);
 	free_percpu(dd->send_schedule);
 	rvt_dealloc_device(&dd->verbs_dev.rdi);
 }
@@ -1198,6 +1198,8 @@ static int __init hfi1_mod_init(void)
 	if (ret)
 		goto bail;
 
+	node_affinity_init();
+
 	/* validate max MTU before any devices start */
 	if (!valid_opa_max_mtu(hfi1_max_mtu)) {
 		pr_err("Invalid max_mtu 0x%x, using 0x%x instead\n",
@@ -1278,6 +1280,7 @@ module_init(hfi1_mod_init);
 static void __exit hfi1_mod_cleanup(void)
 {
 	pci_unregister_driver(&hfi1_pci_driver);
+	node_affinity_destroy();
 	hfi1_wss_exit();
 	hfi1_dbg_exit();
 	hfi1_cpulist_count = 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 09/18] IB/hfi1: Reserve and collapse CPU cores for contexts
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (7 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 08/18] IB/hfi1: Add global structure for affinity assignments Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
       [not found]     ` <20160701230133.20160.76302.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  2016-07-01 23:01   ` [PATCH for-next 10/18] IB/hfi1: Refine user process affinity algorithm Dennis Dalessandro
                     ` (9 subsequent siblings)
  18 siblings, 1 reply; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dean Luick, Sebastian Sanchez

From: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Kernel receive queues oversubscribe CPU cores on multi-HFI systems.
To prevent this, the kernel receive queues are separated onto
different cores, and the SDMA engine interrupts are constrained to
a lesser number of cores.

hfi1s_on_numa_node*krcvqs is the number of CPU cores that are
reserved for kernel receive queues for all HFIs. Each HFI initializes
its kernel receive queues to one of the reserved CPU cores. If there
ends up being 0 CPU cores leftover for SDMA engines, use the same
CPU cores as receive contexts.

In addition, general and control contexts are assigned to their own
CPU core, however, both types of contexts tend to have low traffic.
To save CPU cores, collapse general and control contexts to one CPU
core for all HFI units. This change prevents SDMA engine interrupts
from wrapping around general contexts.

Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/affinity.c |  101 +++++++++++++++++++++++++--------
 drivers/infiniband/hw/hfi1/affinity.h |    3 +
 drivers/infiniband/hw/hfi1/hfi.h      |    2 +
 drivers/infiniband/hw/hfi1/init.c     |    6 +-
 4 files changed, 84 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c
index 4d82920..3a3ef2a 100644
--- a/drivers/infiniband/hw/hfi1/affinity.c
+++ b/drivers/infiniband/hw/hfi1/affinity.c
@@ -66,6 +66,9 @@ static const char * const irq_type_names[] = {
 	"OTHER",
 };
 
+/* Per NUMA node count of HFI devices */
+static unsigned int *hfi1_per_node_cntr;
+
 static inline void init_cpu_mask_set(struct cpu_mask_set *set)
 {
 	cpumask_clear(&set->mask);
@@ -107,8 +110,12 @@ void init_real_cpu_mask(void)
 	}
 }
 
-void node_affinity_init(void)
+int node_affinity_init(void)
 {
+	int node;
+	struct pci_dev *dev = NULL;
+	const struct pci_device_id *ids = hfi1_pci_tbl;
+
 	cpumask_copy(&node_affinity.proc.mask, cpu_online_mask);
 	/*
 	 * The real cpu mask is part of the affinity struct but it has to be
@@ -116,6 +123,25 @@ void node_affinity_init(void)
 	 * contexts in set_up_context_variables().
 	 */
 	init_real_cpu_mask();
+
+	hfi1_per_node_cntr = kcalloc(num_possible_nodes(),
+				     sizeof(*hfi1_per_node_cntr), GFP_KERNEL);
+	if (!hfi1_per_node_cntr)
+		return -ENOMEM;
+
+	while (ids->vendor) {
+		dev = NULL;
+		while ((dev = pci_get_device(ids->vendor, ids->device, dev))) {
+			node = pcibus_to_node(dev->bus);
+			if (node < 0)
+				node = numa_node_id();
+
+			hfi1_per_node_cntr[node]++;
+		}
+		ids++;
+	}
+
+	return 0;
 }
 
 void node_affinity_destroy(void)
@@ -131,6 +157,7 @@ void node_affinity_destroy(void)
 		kfree(entry);
 	}
 	spin_unlock(&node_affinity.lock);
+	kfree(hfi1_per_node_cntr);
 }
 
 static struct hfi1_affinity_node *node_affinity_allocate(int node)
@@ -213,6 +240,7 @@ int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 		}
 		init_cpu_mask_set(&entry->def_intr);
 		init_cpu_mask_set(&entry->rcv_intr);
+		cpumask_clear(&entry->general_intr_mask);
 		/* Use the "real" cpu mask of this node as the default */
 		cpumask_and(&entry->def_intr.mask, &node_affinity.real_cpu_mask,
 			    local_mask);
@@ -224,11 +252,15 @@ int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 		if (possible == 1) {
 			/* only one CPU, everyone will use it */
 			cpumask_set_cpu(curr_cpu, &entry->rcv_intr.mask);
+			cpumask_set_cpu(curr_cpu, &entry->general_intr_mask);
 		} else {
 			/*
-			 * Retain the first CPU in the default list for the
-			 * control context.
+			 * The general/control context will be the first CPU in
+			 * the default list, so it is removed from the default
+			 * list and added to the general interrupt list.
 			 */
+			cpumask_clear_cpu(curr_cpu, &entry->def_intr.mask);
+			cpumask_set_cpu(curr_cpu, &entry->general_intr_mask);
 			curr_cpu = cpumask_next(curr_cpu,
 						&entry->def_intr.mask);
 
@@ -236,7 +268,10 @@ int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 			 * Remove the remaining kernel receive queues from
 			 * the default list and add them to the receive list.
 			 */
-			for (i = 0; i < dd->n_krcv_queues - 1; i++) {
+			for (i = 0;
+			     i < (dd->n_krcv_queues - 1) *
+				  hfi1_per_node_cntr[dd->node];
+			     i++) {
 				cpumask_clear_cpu(curr_cpu,
 						  &entry->def_intr.mask);
 				cpumask_set_cpu(curr_cpu,
@@ -246,6 +281,15 @@ int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 				if (curr_cpu >= nr_cpu_ids)
 					break;
 			}
+
+			/*
+			 * If there ends up being 0 CPU cores leftover for SDMA
+			 * engines, use the same CPU cores as general/control
+			 * context.
+			 */
+			if (cpumask_weight(&entry->def_intr.mask) == 0)
+				cpumask_copy(&entry->def_intr.mask,
+					     &entry->general_intr_mask);
 		}
 
 		spin_lock(&node_affinity.lock);
@@ -261,7 +305,7 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	int ret;
 	cpumask_var_t diff;
 	struct hfi1_affinity_node *entry;
-	struct cpu_mask_set *set;
+	struct cpu_mask_set *set = NULL;
 	struct sdma_engine *sde = NULL;
 	struct hfi1_ctxtdata *rcd = NULL;
 	char extra[64];
@@ -282,18 +326,17 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	case IRQ_SDMA:
 		sde = (struct sdma_engine *)msix->arg;
 		scnprintf(extra, 64, "engine %u", sde->this_idx);
-		/* fall through */
-	case IRQ_GENERAL:
 		set = &entry->def_intr;
 		break;
+	case IRQ_GENERAL:
+		cpu = cpumask_first(&entry->general_intr_mask);
+		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
-		if (rcd->ctxt == HFI1_CTRL_CTXT) {
-			set = &entry->def_intr;
-			cpu = cpumask_first(&set->mask);
-		} else {
+		if (rcd->ctxt == HFI1_CTRL_CTXT)
+			cpu = cpumask_first(&entry->general_intr_mask);
+		else
 			set = &entry->rcv_intr;
-		}
 		scnprintf(extra, 64, "ctxt %u", rcd->ctxt);
 		break;
 	default:
@@ -302,9 +345,9 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	}
 
 	/*
-	 * The control receive context is placed on a particular CPU, which
-	 * is set above.  Skip accounting for it.  Everything else finds its
-	 * CPU here.
+	 * The general and control contexts are placed on a particular
+	 * CPU, which is set above. Skip accounting for it. Everything else
+	 * finds its CPU here.
 	 */
 	if (cpu == -1 && set) {
 		spin_lock(&node_affinity.lock);
@@ -355,12 +398,14 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 
 	switch (msix->type) {
 	case IRQ_SDMA:
-	case IRQ_GENERAL:
 		set = &entry->def_intr;
 		break;
+	case IRQ_GENERAL:
+		/* Don't accounting for general contexts */
+		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
-		/* only do accounting for non control contexts */
+		/* Don't do accounting for control contexts */
 		if (rcd->ctxt != HFI1_CTRL_CTXT)
 			set = &entry->rcv_intr;
 		break;
@@ -439,14 +484,20 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 		cpumask_clear(&set->used);
 	}
 
-	entry = node_affinity_lookup(dd->node);
-	/* CPUs used by interrupt handlers */
-	cpumask_copy(intrs, (entry->def_intr.gen ?
-			     &entry->def_intr.mask :
-			     &entry->def_intr.used));
-	cpumask_or(intrs, intrs, (entry->rcv_intr.gen ?
-				  &entry->rcv_intr.mask :
-				  &entry->rcv_intr.used));
+	/*
+	 * If NUMA node has CPUs used by interrupt handlers, include them in the
+	 * interrupt handler mask.
+	 */
+	entry = node_affinity_lookup(node);
+	if (entry) {
+		cpumask_copy(intrs, (entry->def_intr.gen ?
+				     &entry->def_intr.mask :
+				     &entry->def_intr.used));
+		cpumask_or(intrs, intrs, (entry->rcv_intr.gen ?
+					  &entry->rcv_intr.mask :
+					  &entry->rcv_intr.used));
+		cpumask_or(intrs, intrs, &entry->general_intr_mask);
+	}
 	scnprintf(buf, sizeof(buf), "%*pbl", cpumask_pr_args(intrs));
 	hfi1_cdbg(PROC, "CPUs used by interrupts: %s", buf);
 
diff --git a/drivers/infiniband/hw/hfi1/affinity.h b/drivers/infiniband/hw/hfi1/affinity.h
index ad3e730..003860e 100644
--- a/drivers/infiniband/hw/hfi1/affinity.h
+++ b/drivers/infiniband/hw/hfi1/affinity.h
@@ -107,6 +107,7 @@ struct hfi1_affinity_node {
 	int node;
 	struct cpu_mask_set def_intr;
 	struct cpu_mask_set rcv_intr;
+	struct cpumask general_intr_mask;
 	struct list_head list;
 };
 
@@ -118,7 +119,7 @@ struct hfi1_affinity_node_list {
 	spinlock_t lock;
 };
 
-void node_affinity_init(void);
+int node_affinity_init(void);
 void node_affinity_destroy(void);
 extern struct hfi1_affinity_node_list node_affinity;
 
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 748e235..fd67e98 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1235,6 +1235,8 @@ int handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *, int);
 int handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *, int);
 void set_all_slowpath(struct hfi1_devdata *dd);
 
+extern const struct pci_device_id hfi1_pci_tbl[];
+
 /* receive packet handler dispositions */
 #define RCV_PKT_OK      0x0 /* keep going */
 #define RCV_PKT_LIMIT   0x1 /* stop, hit limit, start thread */
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index b0c3e8a..1620d68 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -1162,7 +1162,7 @@ static int init_one(struct pci_dev *, const struct pci_device_id *);
 #define DRIVER_LOAD_MSG "Intel " DRIVER_NAME " loaded: "
 #define PFX DRIVER_NAME ": "
 
-static const struct pci_device_id hfi1_pci_tbl[] = {
+const struct pci_device_id hfi1_pci_tbl[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL0) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL1) },
 	{ 0, }
@@ -1198,7 +1198,9 @@ static int __init hfi1_mod_init(void)
 	if (ret)
 		goto bail;
 
-	node_affinity_init();
+	ret = node_affinity_init();
+	if (ret)
+		goto bail;
 
 	/* validate max MTU before any devices start */
 	if (!valid_opa_max_mtu(hfi1_max_mtu)) {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 10/18] IB/hfi1: Refine user process affinity algorithm
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (8 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 09/18] IB/hfi1: Reserve and collapse CPU cores for contexts Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
       [not found]     ` <20160701230138.20160.5753.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
  2016-07-01 23:01   ` [PATCH for-next 11/18] IB/hfi1: Use built-in i2c bit-shift bus adapter Dennis Dalessandro
                     ` (8 subsequent siblings)
  18 siblings, 1 reply; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Ira Weiny, Mitko Haralanov,
	Sebastian Sanchez

From: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

When performing process affinity recommendations for MPI ranks, the current
algorithm doesn't take into account multiple HFI units. Also, real
cores and HT cores are not distinguished from one another. Therefore,
all HT cores are recommended to be assigned first within the local NUMA
node before recommending the assignments of cores in other NUMA nodes.
It's ideal to assign all real cores across all NUMA nodes first, then all
HT 1 cores, then all HT 2 cores, and so on to balance CPU workload. CPU
cores in other NUMA nodes could be running interrupt handlers, and this is
not taken into account.

To balance the CPU workload for user processes, the following
recommendation algorithm is used:

 For each user process that is opening a context on HFI Y:
  a) If all cores are assigned to user processes, start assignments all
	 over from the first core
  b) Assign real cores first, then HT cores (First set of HT cores on
	 all physical cores, then second set of HT cores, and, so on) in the
	 following order:

	 1. Same NUMA node as HFI Y and not running an IRQ handler
	 2. Same NUMA node as HFI Y and running an IRQ handler
	 3. Different NUMA node to HFI Y and not running an IRQ handler
	 4. Different NUMA node to HFI Y and running an IRQ handler
  c) Mark core as assigned in the global affinity structure. As user
	 processes are done, remove core assignments from global affinity
	 structure.

This implementation allows an arbitrary number of HT cores and provides
support for multiple HFIs.

This is being included in the kernel rather than user space due to the
fact that user space has no way of knowing the CPU recommendations for
contexts running as part of other jobs.

Reviewed-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/affinity.c |  235 +++++++++++++++++++++++++--------
 drivers/infiniband/hw/hfi1/affinity.h |    8 +
 drivers/infiniband/hw/hfi1/file_ops.c |   15 +-
 3 files changed, 191 insertions(+), 67 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c
index 3a3ef2a..1e31827 100644
--- a/drivers/infiniband/hw/hfi1/affinity.c
+++ b/drivers/infiniband/hw/hfi1/affinity.c
@@ -116,7 +116,17 @@ int node_affinity_init(void)
 	struct pci_dev *dev = NULL;
 	const struct pci_device_id *ids = hfi1_pci_tbl;
 
+	cpumask_clear(&node_affinity.proc.used);
 	cpumask_copy(&node_affinity.proc.mask, cpu_online_mask);
+
+	node_affinity.proc.gen = 0;
+	node_affinity.num_core_siblings =
+				cpumask_weight(topology_sibling_cpumask(
+					cpumask_first(&node_affinity.proc.mask)
+					));
+	node_affinity.num_online_nodes = num_online_nodes();
+	node_affinity.num_online_cpus = num_online_cpus();
+
 	/*
 	 * The real cpu mask is part of the affinity struct but it has to be
 	 * initialized early. It is needed to calculate the number of user
@@ -401,7 +411,7 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 		set = &entry->def_intr;
 		break;
 	case IRQ_GENERAL:
-		/* Don't accounting for general contexts */
+		/* Don't do accounting for general contexts */
 		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
@@ -427,14 +437,47 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 	cpumask_clear(&msix->mask);
 }
 
-int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
+/* This should be called with node_affinity.lock held */
+static void find_hw_thread_mask(uint hw_thread_no, cpumask_var_t hw_thread_mask,
+				struct hfi1_affinity_node_list *affinity)
+{
+	int possible, curr_cpu, i;
+	uint num_cores_per_socket = node_affinity.num_online_cpus /
+					affinity->num_core_siblings /
+						node_affinity.num_online_nodes;
+
+	cpumask_copy(hw_thread_mask, &affinity->proc.mask);
+	if (affinity->num_core_siblings > 0) {
+		/* Removing other siblings not needed for now */
+		possible = cpumask_weight(hw_thread_mask);
+		curr_cpu = cpumask_first(hw_thread_mask);
+		for (i = 0;
+		     i < num_cores_per_socket * node_affinity.num_online_nodes;
+		     i++)
+			curr_cpu = cpumask_next(curr_cpu, hw_thread_mask);
+
+		for (; i < possible; i++) {
+			cpumask_clear_cpu(curr_cpu, hw_thread_mask);
+			curr_cpu = cpumask_next(curr_cpu, hw_thread_mask);
+		}
+
+		/* Identifying correct HW threads within physical cores */
+		cpumask_shift_left(hw_thread_mask, hw_thread_mask,
+				   num_cores_per_socket *
+				   node_affinity.num_online_nodes *
+				   hw_thread_no);
+	}
+}
+
+int hfi1_get_proc_affinity(int node)
 {
-	int cpu = -1, ret;
-	cpumask_var_t diff, mask, intrs;
+	int cpu = -1, ret, i;
 	struct hfi1_affinity_node *entry;
+	cpumask_var_t diff, hw_thread_mask, available_mask, intrs_mask;
 	const struct cpumask *node_mask,
 		*proc_mask = tsk_cpus_allowed(current);
-	struct cpu_mask_set *set = &node_affinity.proc;
+	struct hfi1_affinity_node_list *affinity = &node_affinity;
+	struct cpu_mask_set *set = &affinity->proc;
 	char buf[1024];
 
 	/*
@@ -442,9 +485,10 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 	 * been set
 	 */
 	if (cpumask_weight(proc_mask) == 1) {
-		hfi1_cdbg(PROC, "PID %u %s affinity set to CPU %*pbl",
-			  current->pid, current->comm,
+		scnprintf(buf, sizeof(buf), "%*pbl",
 			  cpumask_pr_args(proc_mask));
+		hfi1_cdbg(PROC, "PID %u %s affinity set to CPU %s",
+			  current->pid, current->comm, buf);
 		/*
 		 * Mark the pre-set CPU as used. This is atomic so we don't
 		 * need the lock
@@ -453,30 +497,50 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 		cpumask_set_cpu(cpu, &set->used);
 		goto done;
 	} else if (cpumask_weight(proc_mask) < cpumask_weight(&set->mask)) {
-		hfi1_cdbg(PROC, "PID %u %s affinity set to CPU set(s) %*pbl",
-			  current->pid, current->comm,
+		scnprintf(buf, sizeof(buf), "%*pbl",
 			  cpumask_pr_args(proc_mask));
+		hfi1_cdbg(PROC, "PID %u %s affinity set to CPU set(s) %s",
+			  current->pid, current->comm, buf);
 		goto done;
 	}
 
 	/*
 	 * The process does not have a preset CPU affinity so find one to
-	 * recommend. We prefer CPUs on the same NUMA as the device.
+	 * recommend using the following algorithm:
+	 *
+	 * For each user process that is opening a context on HFI Y:
+	 *  a) If all cores are filled, reinitialize the bitmask
+	 *  b) Fill real cores first, then HT cores (First set of HT
+	 *     cores on all physical cores, then second set of HT core,
+	 *     and, so on) in the following order:
+	 *
+	 *     1. Same NUMA node as HFI Y and not running an IRQ
+	 *        handler
+	 *     2. Same NUMA node as HFI Y and running an IRQ handler
+	 *     3. Different NUMA node to HFI Y and not running an IRQ
+	 *        handler
+	 *     4. Different NUMA node to HFI Y and running an IRQ
+	 *        handler
+	 *  c) Mark core as filled in the bitmask. As user processes are
+	 *     done, clear cores from the bitmask.
 	 */
 
 	ret = zalloc_cpumask_var(&diff, GFP_KERNEL);
 	if (!ret)
 		goto done;
-	ret = zalloc_cpumask_var(&mask, GFP_KERNEL);
+	ret = zalloc_cpumask_var(&hw_thread_mask, GFP_KERNEL);
 	if (!ret)
 		goto free_diff;
-	ret = zalloc_cpumask_var(&intrs, GFP_KERNEL);
+	ret = zalloc_cpumask_var(&available_mask, GFP_KERNEL);
+	if (!ret)
+		goto free_hw_thread_mask;
+	ret = zalloc_cpumask_var(&intrs_mask, GFP_KERNEL);
 	if (!ret)
-		goto free_mask;
+		goto free_available_mask;
 
-	spin_lock(&node_affinity.lock);
+	spin_lock(&affinity->lock);
 	/*
-	 * If we've used all available CPUs, clear the mask and start
+	 * If we've used all available HW threads, clear the mask and start
 	 * overloading.
 	 */
 	if (cpumask_equal(&set->mask, &set->used)) {
@@ -490,82 +554,135 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 	 */
 	entry = node_affinity_lookup(node);
 	if (entry) {
-		cpumask_copy(intrs, (entry->def_intr.gen ?
-				     &entry->def_intr.mask :
-				     &entry->def_intr.used));
-		cpumask_or(intrs, intrs, (entry->rcv_intr.gen ?
-					  &entry->rcv_intr.mask :
-					  &entry->rcv_intr.used));
-		cpumask_or(intrs, intrs, &entry->general_intr_mask);
+		cpumask_copy(intrs_mask, (entry->def_intr.gen ?
+					  &entry->def_intr.mask :
+					  &entry->def_intr.used));
+		cpumask_or(intrs_mask, intrs_mask, (entry->rcv_intr.gen ?
+						    &entry->rcv_intr.mask :
+						    &entry->rcv_intr.used));
+		cpumask_or(intrs_mask, intrs_mask, &entry->general_intr_mask);
 	}
-	scnprintf(buf, sizeof(buf), "%*pbl", cpumask_pr_args(intrs));
+	scnprintf(buf, sizeof(buf), "%*pbl", cpumask_pr_args(intrs_mask));
 	hfi1_cdbg(PROC, "CPUs used by interrupts: %s", buf);
 
+	cpumask_copy(hw_thread_mask, &set->mask);
+
 	/*
-	 * If we don't have a NUMA node requested, preference is towards
-	 * device NUMA node
+	 * If HT cores are enabled, identify which HW threads within the
+	 * physical cores should be used.
 	 */
-	if (node == -1)
-		node = dd->node;
-	node_mask = cpumask_of_node(node);
-	hfi1_cdbg(PROC, "device on NUMA %u, CPUs %*pbl", node,
-		  cpumask_pr_args(node_mask));
+	if (affinity->num_core_siblings > 0) {
+		for (i = 0; i < affinity->num_core_siblings; i++) {
+			find_hw_thread_mask(i, hw_thread_mask, affinity);
+
+			/*
+			 * If there's at least one available core for this HW
+			 * thread number, stop looking for a core.
+			 *
+			 * diff will always be not empty at least once in this
+			 * loop as the used mask gets reset when
+			 * (set->mask == set->used) before this loop.
+			 */
+			cpumask_andnot(diff, hw_thread_mask, &set->used);
+			if (!cpumask_empty(diff))
+				break;
+		}
+	}
+	if (!scnprintf(buf, sizeof(buf), "%*pbl",
+		       cpumask_pr_args(hw_thread_mask)))
+		snprintf(buf, sizeof(buf), "None");
+	hfi1_cdbg(PROC, "Same available HW thread on all physical CPUs: %s",
+		  buf);
 
-	/* diff will hold all unused cpus */
-	cpumask_andnot(diff, &set->mask, &set->used);
-	hfi1_cdbg(PROC, "unused CPUs (all) %*pbl", cpumask_pr_args(diff));
+	node_mask = cpumask_of_node(node);
+	scnprintf(buf, 1024, "%*pbl", cpumask_pr_args(node_mask));
+	hfi1_cdbg(PROC, "Device on NUMA %u, CPUs %s", node, buf);
 
-	/* get cpumask of available CPUs on preferred NUMA */
-	cpumask_and(mask, diff, node_mask);
-	hfi1_cdbg(PROC, "available cpus on NUMA %*pbl", cpumask_pr_args(mask));
+	/* Get cpumask of available CPUs on preferred NUMA */
+	cpumask_and(available_mask, hw_thread_mask, node_mask);
+	cpumask_andnot(available_mask, available_mask, &set->used);
+	if (!scnprintf(buf, sizeof(buf), "%*pbl",
+		       cpumask_pr_args(available_mask)))
+		snprintf(buf, sizeof(buf), "None");
+	hfi1_cdbg(PROC, "Available CPUs on NUMA %u: %s", node, buf);
 
 	/*
 	 * At first, we don't want to place processes on the same
-	 * CPUs as interrupt handlers.
+	 * CPUs as interrupt handlers. Then, CPUs running interrupt
+	 * handlers are used.
+	 *
+	 * 1) If diff is not empty, then there are CPUs not running
+	 *    non-interrupt handlers available, so diff gets copied
+	 *    over to available_mask.
+	 * 2) If diff is empty, then all CPUs not running interrupt
+	 *    handlers are taken, so available_mask contains all
+	 *    available CPUs running interrupt handlers.
+	 * 3) If available_mask is empty, then all CPUs on the
+	 *    preferred NUMA node are taken, so other NUMA nodes are
+	 *    used for process assignments using the same method as
+	 *    the preferred NUMA node.
 	 */
-	cpumask_andnot(diff, mask, intrs);
+	cpumask_andnot(diff, available_mask, intrs_mask);
 	if (!cpumask_empty(diff))
-		cpumask_copy(mask, diff);
+		cpumask_copy(available_mask, diff);
+
+	/* If we don't have CPUs on the preferred node, use other NUMA nodes */
+	if (cpumask_empty(available_mask)) {
+		cpumask_andnot(available_mask, hw_thread_mask, &set->used);
+		/* Excluding preferred NUMA cores */
+		cpumask_andnot(available_mask, available_mask, node_mask);
+		if (!scnprintf(buf, sizeof(buf), "%*pbl",
+			       cpumask_pr_args(available_mask)))
+			snprintf(buf, sizeof(buf), "None");
+		hfi1_cdbg(PROC,
+			  "Preferred NUMA node cores are taken, cores available in other NUMA nodes: %s",
+			  buf);
 
-	/*
-	 * if we don't have a cpu on the preferred NUMA, get
-	 * the list of the remaining available CPUs
-	 */
-	if (cpumask_empty(mask)) {
-		cpumask_andnot(diff, &set->mask, &set->used);
-		cpumask_andnot(mask, diff, node_mask);
+		/*
+		 * At first, we don't want to place processes on the same
+		 * CPUs as interrupt handlers.
+		 */
+		cpumask_andnot(diff, available_mask, intrs_mask);
+		if (!cpumask_empty(diff))
+			cpumask_copy(available_mask, diff);
 	}
-	hfi1_cdbg(PROC, "possible CPUs for process %*pbl",
-		  cpumask_pr_args(mask));
+	if (!scnprintf(buf, sizeof(buf), "%*pbl",
+		       cpumask_pr_args(available_mask)))
+		snprintf(buf, sizeof(buf), "None");
+	hfi1_cdbg(PROC, "Possible CPUs for process: %s", buf);
 
-	cpu = cpumask_first(mask);
+	cpu = cpumask_first(available_mask);
 	if (cpu >= nr_cpu_ids) /* empty */
 		cpu = -1;
 	else
 		cpumask_set_cpu(cpu, &set->used);
-	spin_unlock(&node_affinity.lock);
-
-	free_cpumask_var(intrs);
-free_mask:
-	free_cpumask_var(mask);
+	spin_unlock(&affinity->lock);
+	hfi1_cdbg(PROC, "Process assigned to CPU %d", cpu);
+
+	free_cpumask_var(intrs_mask);
+free_available_mask:
+	free_cpumask_var(available_mask);
+free_hw_thread_mask:
+	free_cpumask_var(hw_thread_mask);
 free_diff:
 	free_cpumask_var(diff);
 done:
 	return cpu;
 }
 
-void hfi1_put_proc_affinity(struct hfi1_devdata *dd, int cpu)
+void hfi1_put_proc_affinity(int cpu)
 {
-	struct cpu_mask_set *set = &node_affinity.proc;
+	struct hfi1_affinity_node_list *affinity = &node_affinity;
+	struct cpu_mask_set *set = &affinity->proc;
 
 	if (cpu < 0)
 		return;
-	spin_lock(&node_affinity.lock);
+	spin_lock(&affinity->lock);
 	cpumask_clear_cpu(cpu, &set->used);
+	hfi1_cdbg(PROC, "Returning CPU %d for future process assignment", cpu);
 	if (cpumask_empty(&set->used) && set->gen) {
 		set->gen--;
 		cpumask_copy(&set->used, &set->mask);
 	}
-	spin_unlock(&node_affinity.lock);
+	spin_unlock(&affinity->lock);
 }
-
diff --git a/drivers/infiniband/hw/hfi1/affinity.h b/drivers/infiniband/hw/hfi1/affinity.h
index 003860e..f784de5 100644
--- a/drivers/infiniband/hw/hfi1/affinity.h
+++ b/drivers/infiniband/hw/hfi1/affinity.h
@@ -73,7 +73,6 @@ struct cpu_mask_set {
 struct hfi1_affinity {
 	struct cpu_mask_set def_intr;
 	struct cpu_mask_set rcv_intr;
-	struct cpu_mask_set proc;
 	struct cpumask real_cpu_mask;
 	/* spin lock to protect affinity struct */
 	spinlock_t lock;
@@ -99,9 +98,9 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *, struct hfi1_msix_entry *);
  * Determine a CPU affinity for a user process, if the process does not
  * have an affinity set yet.
  */
-int hfi1_get_proc_affinity(struct hfi1_devdata *, int);
+int hfi1_get_proc_affinity(int);
 /* Release a CPU used by a user process. */
-void hfi1_put_proc_affinity(struct hfi1_devdata *, int);
+void hfi1_put_proc_affinity(int);
 
 struct hfi1_affinity_node {
 	int node;
@@ -115,6 +114,9 @@ struct hfi1_affinity_node_list {
 	struct list_head list;
 	struct cpumask real_cpu_mask;
 	struct cpu_mask_set proc;
+	int num_core_siblings;
+	int num_online_nodes;
+	int num_online_cpus;
 	/* protect affinity node list */
 	spinlock_t lock;
 };
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 2f097d9..d7c07bc 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -715,7 +715,7 @@ static int hfi1_file_close(struct inode *inode, struct file *fp)
 	hfi1_user_sdma_free_queues(fdata);
 
 	/* release the cpu */
-	hfi1_put_proc_affinity(dd, fdata->rec_cpu_num);
+	hfi1_put_proc_affinity(fdata->rec_cpu_num);
 
 	/*
 	 * Clear any left over, unhandled events so the next process that
@@ -815,9 +815,10 @@ static int assign_ctxt(struct file *fp, struct hfi1_user_info *uinfo)
 		ret = find_shared_ctxt(fp, uinfo);
 		if (ret < 0)
 			goto done_unlock;
-		if (ret)
-			fd->rec_cpu_num = hfi1_get_proc_affinity(
-				fd->uctxt->dd, fd->uctxt->numa_id);
+		if (ret) {
+			fd->rec_cpu_num =
+				hfi1_get_proc_affinity(fd->uctxt->numa_id);
+		}
 	}
 
 	/*
@@ -929,7 +930,11 @@ static int allocate_ctxt(struct file *fp, struct hfi1_devdata *dd,
 	if (ctxt == dd->num_rcv_contexts)
 		return -EBUSY;
 
-	fd->rec_cpu_num = hfi1_get_proc_affinity(dd, -1);
+	/*
+	 * If we don't have a NUMA node requested, preference is towards
+	 * device NUMA node.
+	 */
+	fd->rec_cpu_num = hfi1_get_proc_affinity(dd->node);
 	if (fd->rec_cpu_num != -1)
 		numa = cpu_to_node(fd->rec_cpu_num);
 	else

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 11/18] IB/hfi1: Use built-in i2c bit-shift bus adapter
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (9 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 10/18] IB/hfi1: Refine user process affinity algorithm Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
  2016-07-01 23:01   ` [PATCH for-next 12/18] IB/hfi1: Remove TWSI references Dennis Dalessandro
                     ` (7 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Easwar Hariharan, Dean Luick

From: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Use built-in i2c bit-shift bus adapter to control the
i2c busses on the chip.

Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Reviewed-by: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/Kconfig |    3 
 drivers/infiniband/hw/hfi1/hfi.h   |   11 +
 drivers/infiniband/hw/hfi1/init.c  |   27 ++
 drivers/infiniband/hw/hfi1/qsfp.c  |  407 +++++++++++++++++++++++++++---------
 drivers/infiniband/hw/hfi1/qsfp.h  |    3 
 5 files changed, 336 insertions(+), 115 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/Kconfig b/drivers/infiniband/hw/hfi1/Kconfig
index a925fb0..bac1860 100644
--- a/drivers/infiniband/hw/hfi1/Kconfig
+++ b/drivers/infiniband/hw/hfi1/Kconfig
@@ -1,8 +1,9 @@
 config INFINIBAND_HFI1
 	tristate "Intel OPA Gen1 support"
-	depends on X86_64 && INFINIBAND_RDMAVT
+	depends on X86_64 && INFINIBAND_RDMAVT && I2C
 	select MMU_NOTIFIER
 	select CRC32
+	select I2C_ALGOBIT
 	default m
 	---help---
 	This is a low-level driver for Intel OPA Gen1 adapter.
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index fd67e98..c433eb8 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -62,6 +62,8 @@
 #include <linux/cdev.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
+#include <linux/i2c.h>
+#include <linux/i2c-algo-bit.h>
 #include <rdma/rdma_vt.h>
 
 #include "chip_registers.h"
@@ -805,10 +807,19 @@ struct hfi1_temp {
 	u8 triggers;      /* temperature triggers */
 };
 
+struct hfi1_i2c_bus {
+	struct hfi1_devdata *controlling_dd; /* current controlling device */
+	struct i2c_adapter adapter;	/* bus details */
+	struct i2c_algo_bit_data algo;	/* bus algorithm details */
+	int num;			/* bus number, 0 or 1 */
+};
+
 /* common data between shared ASIC HFIs */
 struct hfi1_asic_data {
 	struct hfi1_devdata *dds[2];	/* back pointers */
 	struct mutex asic_resource_mutex;
+	struct hfi1_i2c_bus *i2c_bus0;
+	struct hfi1_i2c_bus *i2c_bus1;
 };
 
 /* device data struct now contains only "general per-device" info.
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index 1620d68..ec77c7e 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -973,34 +973,45 @@ void hfi1_free_ctxtdata(struct hfi1_devdata *dd, struct hfi1_ctxtdata *rcd)
 
 /*
  * Release our hold on the shared asic data.  If we are the last one,
- * free the structure.  Must be holding hfi1_devs_lock.
+ * return the structure to be finalized outside the lock.  Must be
+ * holding hfi1_devs_lock.
  */
-static void release_asic_data(struct hfi1_devdata *dd)
+static struct hfi1_asic_data *release_asic_data(struct hfi1_devdata *dd)
 {
+	struct hfi1_asic_data *ad;
 	int other;
 
 	if (!dd->asic_data)
-		return;
+		return NULL;
 	dd->asic_data->dds[dd->hfi1_id] = NULL;
 	other = dd->hfi1_id ? 0 : 1;
-	if (!dd->asic_data->dds[other]) {
-		/* we are the last holder, free it */
-		kfree(dd->asic_data);
-	}
+	ad = dd->asic_data;
 	dd->asic_data = NULL;
+	/* return NULL if the other dd still has a link */
+	return ad->dds[other] ? NULL : ad;
+}
+
+static void finalize_asic_data(struct hfi1_devdata *dd,
+			       struct hfi1_asic_data *ad)
+{
+	clean_up_i2c(dd, ad);
+	kfree(ad);
 }
 
 static void __hfi1_free_devdata(struct kobject *kobj)
 {
 	struct hfi1_devdata *dd =
 		container_of(kobj, struct hfi1_devdata, kobj);
+	struct hfi1_asic_data *ad;
 	unsigned long flags;
 
 	spin_lock_irqsave(&hfi1_devs_lock, flags);
 	idr_remove(&hfi1_unit_table, dd->unit);
 	list_del(&dd->list);
-	release_asic_data(dd);
+	ad = release_asic_data(dd);
 	spin_unlock_irqrestore(&hfi1_devs_lock, flags);
+	if (ad)
+		finalize_asic_data(dd, ad);
 	free_platform_config(dd);
 	rcu_barrier(); /* wait for rcu callbacks to complete */
 	free_percpu(dd->int_counter);
diff --git a/drivers/infiniband/hw/hfi1/qsfp.c b/drivers/infiniband/hw/hfi1/qsfp.c
index 6fca2a0..a207717 100644
--- a/drivers/infiniband/hw/hfi1/qsfp.c
+++ b/drivers/infiniband/hw/hfi1/qsfp.c
@@ -50,46 +50,285 @@
 #include <linux/vmalloc.h>
 
 #include "hfi.h"
-#include "twsi.h"
+
+/* for the given bus number, return the CSR for reading an i2c line */
+static inline u32 i2c_in_csr(u32 bus_num)
+{
+	return bus_num ? ASIC_QSFP2_IN : ASIC_QSFP1_IN;
+}
+
+/* for the given bus number, return the CSR for writing an i2c line */
+static inline u32 i2c_oe_csr(u32 bus_num)
+{
+	return bus_num ? ASIC_QSFP2_OE : ASIC_QSFP1_OE;
+}
+
+static void hfi1_setsda(void *data, int state)
+{
+	struct hfi1_i2c_bus *bus = (struct hfi1_i2c_bus *)data;
+	struct hfi1_devdata *dd = bus->controlling_dd;
+	u64 reg;
+	u32 target_oe;
+
+	target_oe = i2c_oe_csr(bus->num);
+	reg = read_csr(dd, target_oe);
+	/*
+	 * The OE bit value is inverted and connected to the pin.  When
+	 * OE is 0 the pin is left to be pulled up, when the OE is 1
+	 * the pin is driven low.  This matches the "open drain" or "open
+	 * collector" convention.
+	 */
+	if (state)
+		reg &= ~QSFP_HFI0_I2CDAT;
+	else
+		reg |= QSFP_HFI0_I2CDAT;
+	write_csr(dd, target_oe, reg);
+	/* do a read to force the write into the chip */
+	(void)read_csr(dd, target_oe);
+}
+
+static void hfi1_setscl(void *data, int state)
+{
+	struct hfi1_i2c_bus *bus = (struct hfi1_i2c_bus *)data;
+	struct hfi1_devdata *dd = bus->controlling_dd;
+	u64 reg;
+	u32 target_oe;
+
+	target_oe = i2c_oe_csr(bus->num);
+	reg = read_csr(dd, target_oe);
+	/*
+	 * The OE bit value is inverted and connected to the pin.  When
+	 * OE is 0 the pin is left to be pulled up, when the OE is 1
+	 * the pin is driven low.  This matches the "open drain" or "open
+	 * collector" convention.
+	 */
+	if (state)
+		reg &= ~QSFP_HFI0_I2CCLK;
+	else
+		reg |= QSFP_HFI0_I2CCLK;
+	write_csr(dd, target_oe, reg);
+	/* do a read to force the write into the chip */
+	(void)read_csr(dd, target_oe);
+}
+
+static int hfi1_getsda(void *data)
+{
+	struct hfi1_i2c_bus *bus = (struct hfi1_i2c_bus *)data;
+	u64 reg;
+	u32 target_in;
+
+	hfi1_setsda(data, 1);	/* clear OE so we do not pull line down */
+	udelay(2);		/* 1us pull up + 250ns hold */
+
+	target_in = i2c_in_csr(bus->num);
+	reg = read_csr(bus->controlling_dd, target_in);
+	return !!(reg & QSFP_HFI0_I2CDAT);
+}
+
+static int hfi1_getscl(void *data)
+{
+	struct hfi1_i2c_bus *bus = (struct hfi1_i2c_bus *)data;
+	u64 reg;
+	u32 target_in;
+
+	hfi1_setscl(data, 1);	/* clear OE so we do not pull line down */
+	udelay(2);		/* 1us pull up + 250ns hold */
+
+	target_in = i2c_in_csr(bus->num);
+	reg = read_csr(bus->controlling_dd, target_in);
+	return !!(reg & QSFP_HFI0_I2CCLK);
+}
 
 /*
- * QSFP support for hfi driver, using "Two Wire Serial Interface" driver
- * in twsi.c
+ * Allocate and initialize the given i2c bus number.
+ * Returns NULL on failure.
  */
-#define I2C_MAX_RETRY 4
+static struct hfi1_i2c_bus *init_i2c_bus(struct hfi1_devdata *dd,
+					 struct hfi1_asic_data *ad, int num)
+{
+	struct hfi1_i2c_bus *bus;
+	int ret;
+
+	bus = kzalloc(sizeof(*bus), GFP_KERNEL);
+	if (!bus)
+		return NULL;
+
+	bus->controlling_dd = dd;
+	bus->num = num;	/* our bus number */
+
+	bus->algo.setsda = hfi1_setsda;
+	bus->algo.setscl = hfi1_setscl;
+	bus->algo.getsda = hfi1_getsda;
+	bus->algo.getscl = hfi1_getscl;
+	bus->algo.udelay = 5;
+	bus->algo.timeout = usecs_to_jiffies(50);
+	bus->algo.data = bus;
+
+	bus->adapter.owner = THIS_MODULE;
+	bus->adapter.algo_data = &bus->algo;
+	bus->adapter.dev.parent = &dd->pcidev->dev;
+	snprintf(bus->adapter.name, sizeof(bus->adapter.name),
+		 "hfi1_i2c%d", num);
+
+	ret = i2c_bit_add_bus(&bus->adapter);
+	if (ret) {
+		dd_dev_info(dd, "%s: unable to add i2c bus %d, err %d\n",
+			    __func__, num, ret);
+		kfree(bus);
+		return NULL;
+	}
+
+	return bus;
+}
 
 /*
- * Raw i2c write.  No set-up or lock checking.
+ * Initialize i2c buses.
+ * Return 0 on success, -errno on error.
  */
-static int __i2c_write(struct hfi1_pportdata *ppd, u32 target, int i2c_addr,
-		       int offset, void *bp, int len)
+int set_up_i2c(struct hfi1_devdata *dd, struct hfi1_asic_data *ad)
 {
-	struct hfi1_devdata *dd = ppd->dd;
-	int ret, cnt;
-	u8 *buff = bp;
+	ad->i2c_bus0 = init_i2c_bus(dd, ad, 0);
+	ad->i2c_bus1 = init_i2c_bus(dd, ad, 1);
+	if (!ad->i2c_bus0 || !ad->i2c_bus1)
+		return -ENOMEM;
+	return 0;
+};
 
-	cnt = 0;
-	while (cnt < len) {
-		int wlen = len - cnt;
+static void clean_i2c_bus(struct hfi1_i2c_bus *bus)
+{
+	if (bus) {
+		i2c_del_adapter(&bus->adapter);
+		kfree(bus);
+	}
+}
 
-		ret = hfi1_twsi_blk_wr(dd, target, i2c_addr, offset,
-				       buff + cnt, wlen);
-		if (ret) {
-			/* hfi1_twsi_blk_wr() 1 for error, else 0 */
-			return -EIO;
-		}
-		offset += wlen;
-		cnt += wlen;
+void clean_up_i2c(struct hfi1_devdata *dd, struct hfi1_asic_data *ad)
+{
+	clean_i2c_bus(ad->i2c_bus0);
+	ad->i2c_bus0 = NULL;
+	clean_i2c_bus(ad->i2c_bus1);
+	ad->i2c_bus1 = NULL;
+}
+
+static int i2c_bus_write(struct hfi1_devdata *dd, struct hfi1_i2c_bus *i2c,
+			 u8 slave_addr, int offset, int offset_size,
+			 u8 *data, u16 len)
+{
+	int ret;
+	int num_msgs;
+	u8 offset_bytes[2];
+	struct i2c_msg msgs[2];
+
+	switch (offset_size) {
+	case 0:
+		num_msgs = 1;
+		msgs[0].addr = slave_addr;
+		msgs[0].flags = 0;
+		msgs[0].len = len;
+		msgs[0].buf = data;
+		break;
+	case 2:
+		offset_bytes[1] = (offset >> 8) & 0xff;
+		/* fall through */
+	case 1:
+		num_msgs = 2;
+		offset_bytes[0] = offset & 0xff;
+
+		msgs[0].addr = slave_addr;
+		msgs[0].flags = 0;
+		msgs[0].len = offset_size;
+		msgs[0].buf = offset_bytes;
+
+		msgs[1].addr = slave_addr;
+		msgs[1].flags = I2C_M_NOSTART,
+		msgs[1].len = len;
+		msgs[1].buf = data;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	i2c->controlling_dd = dd;
+	ret = i2c_transfer(&i2c->adapter, msgs, num_msgs);
+	if (ret != num_msgs) {
+		dd_dev_err(dd, "%s: bus %d, i2c slave 0x%x, offset 0x%x, len 0x%x; write failed, ret %d\n",
+			   __func__, i2c->num, slave_addr, offset, len, ret);
+		return ret < 0 ? ret : -EIO;
 	}
+	return 0;
+}
 
-	/* Must wait min 20us between qsfp i2c transactions */
-	udelay(20);
+static int i2c_bus_read(struct hfi1_devdata *dd, struct hfi1_i2c_bus *bus,
+			u8 slave_addr, int offset, int offset_size,
+			u8 *data, u16 len)
+{
+	int ret;
+	int num_msgs;
+	u8 offset_bytes[2];
+	struct i2c_msg msgs[2];
+
+	switch (offset_size) {
+	case 0:
+		num_msgs = 1;
+		msgs[0].addr = slave_addr;
+		msgs[0].flags = I2C_M_RD;
+		msgs[0].len = len;
+		msgs[0].buf = data;
+		break;
+	case 2:
+		offset_bytes[1] = (offset >> 8) & 0xff;
+		/* fall through */
+	case 1:
+		num_msgs = 2;
+		offset_bytes[0] = offset & 0xff;
+
+		msgs[0].addr = slave_addr;
+		msgs[0].flags = 0;
+		msgs[0].len = offset_size;
+		msgs[0].buf = offset_bytes;
+
+		msgs[1].addr = slave_addr;
+		msgs[1].flags = I2C_M_RD,
+		msgs[1].len = len;
+		msgs[1].buf = data;
+		break;
+	default:
+		return -EINVAL;
+	}
 
-	return cnt;
+	bus->controlling_dd = dd;
+	ret = i2c_transfer(&bus->adapter, msgs, num_msgs);
+	if (ret != num_msgs) {
+		dd_dev_err(dd, "%s: bus %d, i2c slave 0x%x, offset 0x%x, len 0x%x; read failed, ret %d\n",
+			   __func__, bus->num, slave_addr, offset, len, ret);
+		return ret < 0 ? ret : -EIO;
+	}
+	return 0;
+}
+
+/*
+ * Raw i2c write.  No set-up or lock checking.
+ *
+ * Return 0 on success, -errno on error.
+ */
+static int __i2c_write(struct hfi1_pportdata *ppd, u32 target, int i2c_addr,
+		       int offset, void *bp, int len)
+{
+	struct hfi1_devdata *dd = ppd->dd;
+	struct hfi1_i2c_bus *bus;
+	u8 slave_addr;
+	int offset_size;
+
+	bus = target ? dd->asic_data->i2c_bus1 : dd->asic_data->i2c_bus0;
+	slave_addr = (i2c_addr & 0xff) >> 1; /* convert to 7-bit addr */
+	offset_size = (i2c_addr >> 8) & 0x3;
+	return i2c_bus_write(dd, bus, slave_addr, offset, offset_size, bp, len);
 }
 
 /*
  * Caller must hold the i2c chain resource.
+ *
+ * Return number of bytes written, or -errno.
  */
 int i2c_write(struct hfi1_pportdata *ppd, u32 target, int i2c_addr, int offset,
 	      void *bp, int len)
@@ -99,63 +338,36 @@ int i2c_write(struct hfi1_pportdata *ppd, u32 target, int i2c_addr, int offset,
 	if (!check_chip_resource(ppd->dd, i2c_target(target), __func__))
 		return -EACCES;
 
-	/* make sure the TWSI bus is in a sane state */
-	ret = hfi1_twsi_reset(ppd->dd, target);
-	if (ret) {
-		hfi1_dev_porterr(ppd->dd, ppd->port,
-				 "I2C chain %d write interface reset failed\n",
-				 target);
+	ret = __i2c_write(ppd, target, i2c_addr, offset, bp, len);
+	if (ret)
 		return ret;
-	}
 
-	return __i2c_write(ppd, target, i2c_addr, offset, bp, len);
+	return len;
 }
 
 /*
  * Raw i2c read.  No set-up or lock checking.
+ *
+ * Return 0 on success, -errno on error.
  */
 static int __i2c_read(struct hfi1_pportdata *ppd, u32 target, int i2c_addr,
 		      int offset, void *bp, int len)
 {
 	struct hfi1_devdata *dd = ppd->dd;
-	int ret, cnt, pass = 0;
-	int orig_offset = offset;
-
-	cnt = 0;
-	while (cnt < len) {
-		int rlen = len - cnt;
-
-		ret = hfi1_twsi_blk_rd(dd, target, i2c_addr, offset,
-				       bp + cnt, rlen);
-		/* Some QSFP's fail first try. Retry as experiment */
-		if (ret && cnt == 0 && ++pass < I2C_MAX_RETRY)
-			continue;
-		if (ret) {
-			/* hfi1_twsi_blk_rd() 1 for error, else 0 */
-			ret = -EIO;
-			goto exit;
-		}
-		offset += rlen;
-		cnt += rlen;
-	}
-
-	ret = cnt;
-
-exit:
-	if (ret < 0) {
-		hfi1_dev_porterr(dd, ppd->port,
-				 "I2C chain %d read failed, addr 0x%x, offset 0x%x, len %d\n",
-				 target, i2c_addr, orig_offset, len);
-	}
-
-	/* Must wait min 20us between qsfp i2c transactions */
-	udelay(20);
-
-	return ret;
+	struct hfi1_i2c_bus *bus;
+	u8 slave_addr;
+	int offset_size;
+
+	bus = target ? dd->asic_data->i2c_bus1 : dd->asic_data->i2c_bus0;
+	slave_addr = (i2c_addr & 0xff) >> 1; /* convert to 7-bit addr */
+	offset_size = (i2c_addr >> 8) & 0x3;
+	return i2c_bus_read(dd, bus, slave_addr, offset, offset_size, bp, len);
 }
 
 /*
  * Caller must hold the i2c chain resource.
+ *
+ * Return number of bytes read, or -errno.
  */
 int i2c_read(struct hfi1_pportdata *ppd, u32 target, int i2c_addr, int offset,
 	     void *bp, int len)
@@ -165,16 +377,11 @@ int i2c_read(struct hfi1_pportdata *ppd, u32 target, int i2c_addr, int offset,
 	if (!check_chip_resource(ppd->dd, i2c_target(target), __func__))
 		return -EACCES;
 
-	/* make sure the TWSI bus is in a sane state */
-	ret = hfi1_twsi_reset(ppd->dd, target);
-	if (ret) {
-		hfi1_dev_porterr(ppd->dd, ppd->port,
-				 "I2C chain %d read interface reset failed\n",
-				 target);
+	ret = __i2c_read(ppd, target, i2c_addr, offset, bp, len);
+	if (ret)
 		return ret;
-	}
 
-	return __i2c_read(ppd, target, i2c_addr, offset, bp, len);
+	return len;
 }
 
 /*
@@ -182,6 +389,8 @@ int i2c_read(struct hfi1_pportdata *ppd, u32 target, int i2c_addr, int offset,
  * by writing @addr = ((256 * n) + m)
  *
  * Caller must hold the i2c chain resource.
+ *
+ * Return number of bytes written or -errno.
  */
 int qsfp_write(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 	       int len)
@@ -189,21 +398,12 @@ int qsfp_write(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 	int count = 0;
 	int offset;
 	int nwrite;
-	int ret;
+	int ret = 0;
 	u8 page;
 
 	if (!check_chip_resource(ppd->dd, i2c_target(target), __func__))
 		return -EACCES;
 
-	/* make sure the TWSI bus is in a sane state */
-	ret = hfi1_twsi_reset(ppd->dd, target);
-	if (ret) {
-		hfi1_dev_porterr(ppd->dd, ppd->port,
-				 "QSFP chain %d write interface reset failed\n",
-				 target);
-		return ret;
-	}
-
 	while (count < len) {
 		/*
 		 * Set the qsfp page based on a zero-based address
@@ -213,11 +413,12 @@ int qsfp_write(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 
 		ret = __i2c_write(ppd, target, QSFP_DEV | QSFP_OFFSET_SIZE,
 				  QSFP_PAGE_SELECT_BYTE_OFFS, &page, 1);
-		if (ret != 1) {
+		/* QSFPs require a 5-10msec delay after write operations */
+		mdelay(5);
+		if (ret) {
 			hfi1_dev_porterr(ppd->dd, ppd->port,
 					 "QSFP chain %d can't write QSFP_PAGE_SELECT_BYTE: %d\n",
 					 target, ret);
-			ret = -EIO;
 			break;
 		}
 
@@ -229,11 +430,13 @@ int qsfp_write(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 
 		ret = __i2c_write(ppd, target, QSFP_DEV | QSFP_OFFSET_SIZE,
 				  offset, bp + count, nwrite);
-		if (ret <= 0)	/* stop on error or nothing written */
+		/* QSFPs require a 5-10msec delay after write operations */
+		mdelay(5);
+		if (ret)	/* stop on error */
 			break;
 
-		count += ret;
-		addr += ret;
+		count += nwrite;
+		addr += nwrite;
 	}
 
 	if (ret < 0)
@@ -266,6 +469,8 @@ int one_qsfp_write(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
  * by reading @addr = ((256 * n) + m)
  *
  * Caller must hold the i2c chain resource.
+ *
+ * Return the number of bytes read or -errno.
  */
 int qsfp_read(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 	      int len)
@@ -273,21 +478,12 @@ int qsfp_read(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 	int count = 0;
 	int offset;
 	int nread;
-	int ret;
+	int ret = 0;
 	u8 page;
 
 	if (!check_chip_resource(ppd->dd, i2c_target(target), __func__))
 		return -EACCES;
 
-	/* make sure the TWSI bus is in a sane state */
-	ret = hfi1_twsi_reset(ppd->dd, target);
-	if (ret) {
-		hfi1_dev_porterr(ppd->dd, ppd->port,
-				 "QSFP chain %d read interface reset failed\n",
-				 target);
-		return ret;
-	}
-
 	while (count < len) {
 		/*
 		 * Set the qsfp page based on a zero-based address
@@ -296,11 +492,12 @@ int qsfp_read(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 		page = (u8)(addr / QSFP_PAGESIZE);
 		ret = __i2c_write(ppd, target, QSFP_DEV | QSFP_OFFSET_SIZE,
 				  QSFP_PAGE_SELECT_BYTE_OFFS, &page, 1);
-		if (ret != 1) {
+		/* QSFPs require a 5-10msec delay after write operations */
+		mdelay(5);
+		if (ret) {
 			hfi1_dev_porterr(ppd->dd, ppd->port,
 					 "QSFP chain %d can't write QSFP_PAGE_SELECT_BYTE: %d\n",
 					 target, ret);
-			ret = -EIO;
 			break;
 		}
 
@@ -310,15 +507,13 @@ int qsfp_read(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 		if (((addr % QSFP_RW_BOUNDARY) + nread) > QSFP_RW_BOUNDARY)
 			nread = QSFP_RW_BOUNDARY - (addr % QSFP_RW_BOUNDARY);
 
-		/* QSFPs require a 5-10msec delay after write operations */
-		mdelay(5);
 		ret = __i2c_read(ppd, target, QSFP_DEV | QSFP_OFFSET_SIZE,
 				 offset, bp + count, nread);
-		if (ret <= 0)	/* stop on error or nothing read */
+		if (ret)	/* stop on error */
 			break;
 
-		count += ret;
-		addr += ret;
+		count += nread;
+		addr += nread;
 	}
 
 	if (ret < 0)
diff --git a/drivers/infiniband/hw/hfi1/qsfp.h b/drivers/infiniband/hw/hfi1/qsfp.h
index dadc66c..69275eb 100644
--- a/drivers/infiniband/hw/hfi1/qsfp.h
+++ b/drivers/infiniband/hw/hfi1/qsfp.h
@@ -238,3 +238,6 @@ int one_qsfp_write(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 		   int len);
 int one_qsfp_read(struct hfi1_pportdata *ppd, u32 target, int addr, void *bp,
 		  int len);
+struct hfi1_asic_data;
+int set_up_i2c(struct hfi1_devdata *dd, struct hfi1_asic_data *ad);
+void clean_up_i2c(struct hfi1_devdata *dd, struct hfi1_asic_data *ad);

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 12/18] IB/hfi1: Remove TWSI references
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (10 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 11/18] IB/hfi1: Use built-in i2c bit-shift bus adapter Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
  2016-07-01 23:01   ` [PATCH for-next 13/18] IB/hfi1: Improve SDMA engine assignment for user SDMA Dennis Dalessandro
                     ` (6 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Easwar Hariharan, Dean Luick

From: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Remove the TWSI code.  The driver now uses the kernel's built-in
i2c bit bus module.

Cc: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
Reviewed-by: Easwar Hariharan <easwar.hariharan-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/Makefile |    2 
 drivers/infiniband/hw/hfi1/chip.c   |   31 --
 drivers/infiniband/hw/hfi1/chip.h   |    2 
 drivers/infiniband/hw/hfi1/twsi.c   |  489 -----------------------------------
 drivers/infiniband/hw/hfi1/twsi.h   |   65 -----
 5 files changed, 1 insertions(+), 588 deletions(-)
 delete mode 100644 drivers/infiniband/hw/hfi1/twsi.c
 delete mode 100644 drivers/infiniband/hw/hfi1/twsi.h

diff --git a/drivers/infiniband/hw/hfi1/Makefile b/drivers/infiniband/hw/hfi1/Makefile
index 9b5382c..0cf97a0 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_INFINIBAND_HFI1) += hfi1.o
 hfi1-y := affinity.o chip.o device.o driver.o efivar.o \
 	eprom.o file_ops.o firmware.o \
 	init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \
-	qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o twsi.o \
+	qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \
 	uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \
 	verbs_txreq.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index 22bfe0e..40d485b 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -12349,37 +12349,6 @@ u8 hfi1_ibphys_portstate(struct hfi1_pportdata *ppd)
 	return ib_pstate;
 }
 
-/*
- * Read/modify/write ASIC_QSFP register bits as selected by mask
- * data: 0 or 1 in the positions depending on what needs to be written
- * dir: 0 for read, 1 for write
- * mask: select by setting
- *      I2CCLK  (bit 0)
- *      I2CDATA (bit 1)
- */
-u64 hfi1_gpio_mod(struct hfi1_devdata *dd, u32 target, u32 data, u32 dir,
-		  u32 mask)
-{
-	u64 qsfp_oe, target_oe;
-
-	target_oe = target ? ASIC_QSFP2_OE : ASIC_QSFP1_OE;
-	if (mask) {
-		/* We are writing register bits, so lock access */
-		dir &= mask;
-		data &= mask;
-
-		qsfp_oe = read_csr(dd, target_oe);
-		qsfp_oe = (qsfp_oe & ~(u64)mask) | (u64)dir;
-		write_csr(dd, target_oe, qsfp_oe);
-	}
-	/* We are exclusively reading bits here, but it is unlikely
-	 * we'll get valid data when we set the direction of the pin
-	 * in the same call, so read should call this function again
-	 * to get valid data
-	 */
-	return read_csr(dd, target ? ASIC_QSFP2_IN : ASIC_QSFP1_IN);
-}
-
 #define CLEAR_STATIC_RATE_CONTROL_SMASK(r) \
 (r &= ~SEND_CTXT_CHECK_ENABLE_DISALLOW_PBC_STATIC_RATE_CONTROL_SMASK)
 
diff --git a/drivers/infiniband/hw/hfi1/chip.h b/drivers/infiniband/hw/hfi1/chip.h
index 66a3279..d0a4ddb 100644
--- a/drivers/infiniband/hw/hfi1/chip.h
+++ b/drivers/infiniband/hw/hfi1/chip.h
@@ -1338,8 +1338,6 @@ struct hfi1_message_header *hfi1_get_msgheader(
 				struct hfi1_devdata *dd, __le32 *rhf_addr);
 int hfi1_get_base_kinfo(struct hfi1_ctxtdata *rcd,
 			struct hfi1_ctxt_info *kinfo);
-u64 hfi1_gpio_mod(struct hfi1_devdata *dd, u32 target, u32 data, u32 dir,
-		  u32 mask);
 int hfi1_init_ctxt(struct send_context *sc);
 void hfi1_put_tid(struct hfi1_devdata *dd, u32 index,
 		  u32 type, unsigned long pa, u16 order);
diff --git a/drivers/infiniband/hw/hfi1/twsi.c b/drivers/infiniband/hw/hfi1/twsi.c
deleted file mode 100644
index e82e52a..0000000
--- a/drivers/infiniband/hw/hfi1/twsi.c
+++ /dev/null
@@ -1,489 +0,0 @@
-/*
- * Copyright(c) 2015, 2016 Intel Corporation.
- *
- * This file is provided under a dual BSD/GPLv2 license.  When using or
- * redistributing this file, you may do so under either license.
- *
- * GPL LICENSE SUMMARY
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * BSD LICENSE
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- *  - Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  - Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *  - Neither the name of Intel Corporation nor the names of its
- *    contributors may be used to endorse or promote products derived
- *    from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- */
-
-#include <linux/delay.h>
-#include <linux/pci.h>
-#include <linux/vmalloc.h>
-
-#include "hfi.h"
-#include "twsi.h"
-
-/*
- * "Two Wire Serial Interface" support.
- *
- * Originally written for a not-quite-i2c serial eeprom, which is
- * still used on some supported boards. Later boards have added a
- * variety of other uses, most board-specific, so the bit-boffing
- * part has been split off to this file, while the other parts
- * have been moved to chip-specific files.
- *
- * We have also dropped all pretense of fully generic (e.g. pretend
- * we don't know whether '1' is the higher voltage) interface, as
- * the restrictions of the generic i2c interface (e.g. no access from
- * driver itself) make it unsuitable for this use.
- */
-
-#define READ_CMD 1
-#define WRITE_CMD 0
-
-/**
- * i2c_wait_for_writes - wait for a write
- * @dd: the hfi1_ib device
- *
- * We use this instead of udelay directly, so we can make sure
- * that previous register writes have been flushed all the way
- * to the chip.  Since we are delaying anyway, the cost doesn't
- * hurt, and makes the bit twiddling more regular
- */
-static void i2c_wait_for_writes(struct hfi1_devdata *dd, u32 target)
-{
-	/*
-	 * implicit read of EXTStatus is as good as explicit
-	 * read of scratch, if all we want to do is flush
-	 * writes.
-	 */
-	hfi1_gpio_mod(dd, target, 0, 0, 0);
-	rmb(); /* inlined, so prevent compiler reordering */
-}
-
-/*
- * QSFP modules are allowed to hold SCL low for 500uSec. Allow twice that
- * for "almost compliant" modules
- */
-#define SCL_WAIT_USEC 1000
-
-/* BUF_WAIT is time bus must be free between STOP or ACK and to next START.
- * Should be 20, but some chips need more.
- */
-#define TWSI_BUF_WAIT_USEC 60
-
-static void scl_out(struct hfi1_devdata *dd, u32 target, u8 bit)
-{
-	u32 mask;
-
-	udelay(1);
-
-	mask = QSFP_HFI0_I2CCLK;
-
-	/* SCL is meant to be bare-drain, so never set "OUT", just DIR */
-	hfi1_gpio_mod(dd, target, 0, bit ? 0 : mask, mask);
-
-	/*
-	 * Allow for slow slaves by simple
-	 * delay for falling edge, sampling on rise.
-	 */
-	if (!bit) {
-		udelay(2);
-	} else {
-		int rise_usec;
-
-		for (rise_usec = SCL_WAIT_USEC; rise_usec > 0; rise_usec -= 2) {
-			if (mask & hfi1_gpio_mod(dd, target, 0, 0, 0))
-				break;
-			udelay(2);
-		}
-		if (rise_usec <= 0)
-			dd_dev_err(dd, "SCL interface stuck low > %d uSec\n",
-				   SCL_WAIT_USEC);
-	}
-	i2c_wait_for_writes(dd, target);
-}
-
-static u8 scl_in(struct hfi1_devdata *dd, u32 target, int wait)
-{
-	u32 read_val, mask;
-
-	mask = QSFP_HFI0_I2CCLK;
-	/* SCL is meant to be bare-drain, so never set "OUT", just DIR */
-	hfi1_gpio_mod(dd, target, 0, 0, mask);
-	read_val = hfi1_gpio_mod(dd, target, 0, 0, 0);
-	if (wait)
-		i2c_wait_for_writes(dd, target);
-	return (read_val & mask) >> GPIO_SCL_NUM;
-}
-
-static void sda_out(struct hfi1_devdata *dd, u32 target, u8 bit)
-{
-	u32 mask;
-
-	mask = QSFP_HFI0_I2CDAT;
-
-	/* SDA is meant to be bare-drain, so never set "OUT", just DIR */
-	hfi1_gpio_mod(dd, target, 0, bit ? 0 : mask, mask);
-
-	i2c_wait_for_writes(dd, target);
-	udelay(2);
-}
-
-static u8 sda_in(struct hfi1_devdata *dd, u32 target, int wait)
-{
-	u32 read_val, mask;
-
-	mask = QSFP_HFI0_I2CDAT;
-	/* SDA is meant to be bare-drain, so never set "OUT", just DIR */
-	hfi1_gpio_mod(dd, target, 0, 0, mask);
-	read_val = hfi1_gpio_mod(dd, target, 0, 0, 0);
-	if (wait)
-		i2c_wait_for_writes(dd, target);
-	return (read_val & mask) >> GPIO_SDA_NUM;
-}
-
-/**
- * i2c_ackrcv - see if ack following write is true
- * @dd: the hfi1_ib device
- */
-static int i2c_ackrcv(struct hfi1_devdata *dd, u32 target)
-{
-	u8 ack_received;
-
-	/* AT ENTRY SCL = LOW */
-	/* change direction, ignore data */
-	ack_received = sda_in(dd, target, 1);
-	scl_out(dd, target, 1);
-	ack_received = sda_in(dd, target, 1) == 0;
-	scl_out(dd, target, 0);
-	return ack_received;
-}
-
-static void stop_cmd(struct hfi1_devdata *dd, u32 target);
-
-/**
- * rd_byte - read a byte, sending STOP on last, else ACK
- * @dd: the hfi1_ib device
- *
- * Returns byte shifted out of device
- */
-static int rd_byte(struct hfi1_devdata *dd, u32 target, int last)
-{
-	int bit_cntr, data;
-
-	data = 0;
-
-	for (bit_cntr = 7; bit_cntr >= 0; --bit_cntr) {
-		data <<= 1;
-		scl_out(dd, target, 1);
-		data |= sda_in(dd, target, 0);
-		scl_out(dd, target, 0);
-	}
-	if (last) {
-		scl_out(dd, target, 1);
-		stop_cmd(dd, target);
-	} else {
-		sda_out(dd, target, 0);
-		scl_out(dd, target, 1);
-		scl_out(dd, target, 0);
-		sda_out(dd, target, 1);
-	}
-	return data;
-}
-
-/**
- * wr_byte - write a byte, one bit at a time
- * @dd: the hfi1_ib device
- * @data: the byte to write
- *
- * Returns 0 if we got the following ack, otherwise 1
- */
-static int wr_byte(struct hfi1_devdata *dd, u32 target, u8 data)
-{
-	int bit_cntr;
-	u8 bit;
-
-	for (bit_cntr = 7; bit_cntr >= 0; bit_cntr--) {
-		bit = (data >> bit_cntr) & 1;
-		sda_out(dd, target, bit);
-		scl_out(dd, target, 1);
-		scl_out(dd, target, 0);
-	}
-	return (!i2c_ackrcv(dd, target)) ? 1 : 0;
-}
-
-/*
- * issue TWSI start sequence:
- * (both clock/data high, clock high, data low while clock is high)
- */
-static void start_seq(struct hfi1_devdata *dd, u32 target)
-{
-	sda_out(dd, target, 1);
-	scl_out(dd, target, 1);
-	sda_out(dd, target, 0);
-	udelay(1);
-	scl_out(dd, target, 0);
-}
-
-/**
- * stop_seq - transmit the stop sequence
- * @dd: the hfi1_ib device
- *
- * (both clock/data low, clock high, data high while clock is high)
- */
-static void stop_seq(struct hfi1_devdata *dd, u32 target)
-{
-	scl_out(dd, target, 0);
-	sda_out(dd, target, 0);
-	scl_out(dd, target, 1);
-	sda_out(dd, target, 1);
-}
-
-/**
- * stop_cmd - transmit the stop condition
- * @dd: the hfi1_ib device
- *
- * (both clock/data low, clock high, data high while clock is high)
- */
-static void stop_cmd(struct hfi1_devdata *dd, u32 target)
-{
-	stop_seq(dd, target);
-	udelay(TWSI_BUF_WAIT_USEC);
-}
-
-/**
- * hfi1_twsi_reset - reset I2C communication
- * @dd: the hfi1_ib device
- * returns 0 if ok, -EIO on error
- */
-int hfi1_twsi_reset(struct hfi1_devdata *dd, u32 target)
-{
-	int clock_cycles_left = 9;
-	u32 mask;
-
-	/* Both SCL and SDA should be high. If not, there
-	 * is something wrong.
-	 */
-	mask = QSFP_HFI0_I2CCLK | QSFP_HFI0_I2CDAT;
-
-	/*
-	 * Force pins to desired innocuous state.
-	 * This is the default power-on state with out=0 and dir=0,
-	 * So tri-stated and should be floating high (barring HW problems)
-	 */
-	hfi1_gpio_mod(dd, target, 0, 0, mask);
-
-	/* Check if SCL is low, if it is low then we have a slave device
-	 * misbehaving and there is not much we can do.
-	 */
-	if (!scl_in(dd, target, 0))
-		return -EIO;
-
-	/* Check if SDA is low, if it is low then we have to clock SDA
-	 * up to 9 times for the device to release the bus
-	 */
-	while (clock_cycles_left--) {
-		if (sda_in(dd, target, 0))
-			return 0;
-		scl_out(dd, target, 0);
-		scl_out(dd, target, 1);
-	}
-
-	return -EIO;
-}
-
-#define HFI1_TWSI_START 0x100
-#define HFI1_TWSI_STOP 0x200
-
-/* Write byte to TWSI, optionally prefixed with START or suffixed with
- * STOP.
- * returns 0 if OK (ACK received), else != 0
- */
-static int twsi_wr(struct hfi1_devdata *dd, u32 target, int data, int flags)
-{
-	int ret = 1;
-
-	if (flags & HFI1_TWSI_START)
-		start_seq(dd, target);
-
-	/* Leaves SCL low (from i2c_ackrcv()) */
-	ret = wr_byte(dd, target, data);
-
-	if (flags & HFI1_TWSI_STOP)
-		stop_cmd(dd, target);
-	return ret;
-}
-
-/* Added functionality for IBA7220-based cards */
-#define HFI1_TEMP_DEV 0x98
-
-/*
- * hfi1_twsi_blk_rd
- * General interface for data transfer from twsi devices.
- * One vestige of its former role is that it recognizes a device
- * HFI1_TWSI_NO_DEV and does the correct operation for the legacy part,
- * which responded to all TWSI device codes, interpreting them as
- * address within device. On all other devices found on board handled by
- * this driver, the device is followed by a N-byte "address" which selects
- * the "register" or "offset" within the device from which data should
- * be read.
- */
-int hfi1_twsi_blk_rd(struct hfi1_devdata *dd, u32 target, int dev, int addr,
-		     void *buffer, int len)
-{
-	u8 *bp = buffer;
-	int ret = 1;
-	int i;
-	int offset_size;
-
-	/* obtain the offset size, strip it from the device address */
-	offset_size = (dev >> 8) & 0xff;
-	dev &= 0xff;
-
-	/* allow at most a 2 byte offset */
-	if (offset_size > 2)
-		goto bail;
-
-	if (dev == HFI1_TWSI_NO_DEV) {
-		/* legacy not-really-I2C */
-		addr = (addr << 1) | READ_CMD;
-		ret = twsi_wr(dd, target, addr, HFI1_TWSI_START);
-	} else {
-		/* Actual I2C */
-		if (offset_size) {
-			ret = twsi_wr(dd, target,
-				      dev | WRITE_CMD, HFI1_TWSI_START);
-			if (ret) {
-				stop_cmd(dd, target);
-				goto bail;
-			}
-
-			for (i = 0; i < offset_size; i++) {
-				ret = twsi_wr(dd, target,
-					      (addr >> (i * 8)) & 0xff, 0);
-				udelay(TWSI_BUF_WAIT_USEC);
-				if (ret) {
-					dd_dev_err(dd, "Failed to write byte %d of offset 0x%04X\n",
-						   i, addr);
-					goto bail;
-				}
-			}
-		}
-		ret = twsi_wr(dd, target, dev | READ_CMD, HFI1_TWSI_START);
-	}
-	if (ret) {
-		stop_cmd(dd, target);
-		goto bail;
-	}
-
-	/*
-	 * block devices keeps clocking data out as long as we ack,
-	 * automatically incrementing the address. Some have "pages"
-	 * whose boundaries will not be crossed, but the handling
-	 * of these is left to the caller, who is in a better
-	 * position to know.
-	 */
-	while (len-- > 0) {
-		/*
-		 * Get and store data, sending ACK if length remaining,
-		 * else STOP
-		 */
-		*bp++ = rd_byte(dd, target, !len);
-	}
-
-	ret = 0;
-
-bail:
-	return ret;
-}
-
-/*
- * hfi1_twsi_blk_wr
- * General interface for data transfer to twsi devices.
- * One vestige of its former role is that it recognizes a device
- * HFI1_TWSI_NO_DEV and does the correct operation for the legacy part,
- * which responded to all TWSI device codes, interpreting them as
- * address within device. On all other devices found on board handled by
- * this driver, the device is followed by a N-byte "address" which selects
- * the "register" or "offset" within the device to which data should
- * be written.
- */
-int hfi1_twsi_blk_wr(struct hfi1_devdata *dd, u32 target, int dev, int addr,
-		     const void *buffer, int len)
-{
-	const u8 *bp = buffer;
-	int ret = 1;
-	int i;
-	int offset_size;
-
-	/* obtain the offset size, strip it from the device address */
-	offset_size = (dev >> 8) & 0xff;
-	dev &= 0xff;
-
-	/* allow at most a 2 byte offset */
-	if (offset_size > 2)
-		goto bail;
-
-	if (dev == HFI1_TWSI_NO_DEV) {
-		if (twsi_wr(dd, target, (addr << 1) | WRITE_CMD,
-			    HFI1_TWSI_START)) {
-			goto failed_write;
-		}
-	} else {
-		/* Real I2C */
-		if (twsi_wr(dd, target, dev | WRITE_CMD, HFI1_TWSI_START))
-			goto failed_write;
-	}
-
-	for (i = 0; i < offset_size; i++) {
-		ret = twsi_wr(dd, target, (addr >> (i * 8)) & 0xff, 0);
-		udelay(TWSI_BUF_WAIT_USEC);
-		if (ret) {
-			dd_dev_err(dd, "Failed to write byte %d of offset 0x%04X\n",
-				   i, addr);
-			goto bail;
-		}
-	}
-
-	for (i = 0; i < len; i++)
-		if (twsi_wr(dd, target, *bp++, 0))
-			goto failed_write;
-
-	ret = 0;
-
-failed_write:
-	stop_cmd(dd, target);
-
-bail:
-	return ret;
-}
diff --git a/drivers/infiniband/hw/hfi1/twsi.h b/drivers/infiniband/hw/hfi1/twsi.h
deleted file mode 100644
index 5b8a5b5..0000000
--- a/drivers/infiniband/hw/hfi1/twsi.h
+++ /dev/null
@@ -1,65 +0,0 @@
-#ifndef _TWSI_H
-#define _TWSI_H
-/*
- * Copyright(c) 2015, 2016 Intel Corporation.
- *
- * This file is provided under a dual BSD/GPLv2 license.  When using or
- * redistributing this file, you may do so under either license.
- *
- * GPL LICENSE SUMMARY
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of version 2 of the GNU General Public License as
- * published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
- * General Public License for more details.
- *
- * BSD LICENSE
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- *  - Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *  - Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *  - Neither the name of Intel Corporation nor the names of its
- *    contributors may be used to endorse or promote products derived
- *    from this software without specific prior written permission.
- *
- * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
- * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
- * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
- * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
- * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
- * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
- *
- */
-
-#define HFI1_TWSI_NO_DEV 0xFF
-
-struct hfi1_devdata;
-
-/* Bit position of SDA/SCL pins in ASIC_QSFP* registers  */
-#define  GPIO_SDA_NUM 1
-#define  GPIO_SCL_NUM 0
-
-/* these functions must be called with qsfp_lock held */
-int hfi1_twsi_reset(struct hfi1_devdata *dd, u32 target);
-int hfi1_twsi_blk_rd(struct hfi1_devdata *dd, u32 target, int dev, int addr,
-		     void *buffer, int len);
-int hfi1_twsi_blk_wr(struct hfi1_devdata *dd, u32 target, int dev, int addr,
-		     const void *buffer, int len);
-
-#endif /* _TWSI_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 13/18] IB/hfi1: Improve SDMA engine assignment for user SDMA
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (11 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 12/18] IB/hfi1: Remove TWSI references Dennis Dalessandro
@ 2016-07-01 23:01   ` Dennis Dalessandro
  2016-07-01 23:02   ` [PATCH for-next 14/18] IB/hfi1: Correct receive packet handler assignment Dennis Dalessandro
                     ` (5 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:01 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dean Luick, Jianxin Xiong,
	Tadeusz Struk

From: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Currently each user context is assigned a single SDMA engine
based on the VL, context id, and subcontext id. That means for
MPI applications, each rank can only use one SDMA engine for
all messages. This may create unwanted backup for independent
messages going to different destinations upon congestion at one
destination.

This patch adds the packet "dlid" to the formula of SDMA engine
selection for user SDMA requests. A simple hash table is used
to maintain even distribution among the available SDMA engines
regardless how the "dlid" values are distributed.

Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Tadeusz Struk <tadeusz.struk-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/user_sdma.c |   29 ++++++++++++++++++++++++++++-
 1 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c b/drivers/infiniband/hw/hfi1/user_sdma.c
index 47ffd27..d16ed52 100644
--- a/drivers/infiniband/hw/hfi1/user_sdma.c
+++ b/drivers/infiniband/hw/hfi1/user_sdma.c
@@ -496,6 +496,27 @@ int hfi1_user_sdma_free_queues(struct hfi1_filedata *fd)
 	return 0;
 }
 
+static u8 dlid_to_selector(u16 dlid)
+{
+	static u8 mapping[256];
+	static int initialized;
+	static u8 next;
+	int hash;
+
+	if (!initialized) {
+		memset(mapping, 0xFF, 256);
+		initialized = 1;
+	}
+
+	hash = ((dlid >> 8) ^ dlid) & 0xFF;
+	if (mapping[hash] == 0xFF) {
+		mapping[hash] = next;
+		next = (next + 1) & 0x7F;
+	}
+
+	return mapping[hash];
+}
+
 int hfi1_user_sdma_process_request(struct file *fp, struct iovec *iovec,
 				   unsigned long dim, unsigned long *count)
 {
@@ -511,6 +532,8 @@ int hfi1_user_sdma_process_request(struct file *fp, struct iovec *iovec,
 	struct user_sdma_request *req;
 	u8 opcode, sc, vl;
 	int req_queued = 0;
+	u16 dlid;
+	u8 selector;
 
 	if (iovec[idx].iov_len < sizeof(info) + sizeof(req->hdr)) {
 		hfi1_cdbg(
@@ -686,9 +709,13 @@ int hfi1_user_sdma_process_request(struct file *fp, struct iovec *iovec,
 		idx++;
 	}
 
+	dlid = be16_to_cpu(req->hdr.lrh[1]);
+	selector = dlid_to_selector(dlid);
+
 	/* Have to select the engine */
 	req->sde = sdma_select_engine_vl(dd,
-					 (u32)(uctxt->ctxt + fd->subctxt),
+					 (u32)(uctxt->ctxt + fd->subctxt +
+					       selector),
 					 vl);
 	if (!req->sde || !sdma_running(req->sde)) {
 		ret = -ECOMM;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 14/18] IB/hfi1: Correct receive packet handler assignment
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (12 preceding siblings ...)
  2016-07-01 23:01   ` [PATCH for-next 13/18] IB/hfi1: Improve SDMA engine assignment for user SDMA Dennis Dalessandro
@ 2016-07-01 23:02   ` Dennis Dalessandro
  2016-07-01 23:02   ` [PATCH for-next 15/18] IB/rdmavt: Add data structures and routines for table driven post send Dennis Dalessandro
                     ` (4 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:02 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mike Marciniszyn, Jakub Pawlak

From: Jakub Pawlak <jakub.pawlak-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Prevent processing receive packet in case when opcode is
accepted by QP but handler for this type of packet is not
defined.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Jakub Pawlak <jakub.pawlak-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/verbs.c |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/verbs.c b/drivers/infiniband/hw/hfi1/verbs.c
index 849c4b9..6ad3f9d 100644
--- a/drivers/infiniband/hw/hfi1/verbs.c
+++ b/drivers/infiniband/hw/hfi1/verbs.c
@@ -540,19 +540,15 @@ void hfi1_skip_sge(struct rvt_sge_state *ss, u32 length, int release)
 /*
  * Make sure the QP is ready and able to accept the given opcode.
  */
-static inline int qp_ok(int opcode, struct hfi1_packet *packet)
+static inline opcode_handler qp_ok(int opcode, struct hfi1_packet *packet)
 {
-	struct hfi1_ibport *ibp;
-
 	if (!(ib_rvt_state_ops[packet->qp->state] & RVT_PROCESS_RECV_OK))
-		goto dropit;
+		return NULL;
 	if (((opcode & RVT_OPCODE_QP_MASK) == packet->qp->allowed_ops) ||
 	    (opcode == IB_OPCODE_CNP))
-		return 1;
-dropit:
-	ibp = &packet->rcd->ppd->ibport_data;
-	ibp->rvp.n_pkt_drops++;
-	return 0;
+		return opcode_handler_tbl[opcode];
+
+	return NULL;
 }
 
 /**
@@ -571,6 +567,7 @@ void hfi1_ib_rcv(struct hfi1_packet *packet)
 	struct hfi1_pportdata *ppd = rcd->ppd;
 	struct hfi1_ibport *ibp = &ppd->ibport_data;
 	struct rvt_dev_info *rdi = &ppd->dd->verbs_dev.rdi;
+	opcode_handler packet_handler;
 	unsigned long flags;
 	u32 qp_num;
 	int lnh;
@@ -616,8 +613,11 @@ void hfi1_ib_rcv(struct hfi1_packet *packet)
 		list_for_each_entry_rcu(p, &mcast->qp_list, list) {
 			packet->qp = p->qp;
 			spin_lock_irqsave(&packet->qp->r_lock, flags);
-			if (likely((qp_ok(opcode, packet))))
-				opcode_handler_tbl[opcode](packet);
+			packet_handler = qp_ok(opcode, packet);
+			if (likely(packet_handler))
+				packet_handler(packet);
+			else
+				ibp->rvp.n_pkt_drops++;
 			spin_unlock_irqrestore(&packet->qp->r_lock, flags);
 		}
 		/*
@@ -634,8 +634,11 @@ void hfi1_ib_rcv(struct hfi1_packet *packet)
 			goto drop;
 		}
 		spin_lock_irqsave(&packet->qp->r_lock, flags);
-		if (likely((qp_ok(opcode, packet))))
-			opcode_handler_tbl[opcode](packet);
+		packet_handler = qp_ok(opcode, packet);
+		if (likely(packet_handler))
+			packet_handler(packet);
+		else
+			ibp->rvp.n_pkt_drops++;
 		spin_unlock_irqrestore(&packet->qp->r_lock, flags);
 		rcu_read_unlock();
 	}

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 15/18] IB/rdmavt: Add data structures and routines for table driven post send
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (13 preceding siblings ...)
  2016-07-01 23:02   ` [PATCH for-next 14/18] IB/hfi1: Correct receive packet handler assignment Dennis Dalessandro
@ 2016-07-01 23:02   ` Dennis Dalessandro
  2016-07-01 23:02   ` [PATCH for-next 16/18] IB/hfi1: Add hfi1 post send tables Dennis Dalessandro
                     ` (3 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:02 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: Ashutosh Dixit, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Mike Marciniszyn, Jianxin Xiong

From: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add flexibility for driver dependent operations in post send
because different drivers will have differing post send
operation support.

This includes data structure definitions to support a table
driven scheme along with the necessary validation routine
using the new table.

Reviewed-by: Ashutosh Dixit <ashutosh.dixit-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/sw/rdmavt/qp.c |   67 ++++++++++++++++++++++++++++++++++---
 include/rdma/rdma_vt.h            |    3 ++
 include/rdma/rdmavt_qp.h          |   28 +++++++++++++--
 3 files changed, 89 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/sw/rdmavt/qp.c b/drivers/infiniband/sw/rdmavt/qp.c
index 41ba7e9..d2b5b54 100644
--- a/drivers/infiniband/sw/rdmavt/qp.c
+++ b/drivers/infiniband/sw/rdmavt/qp.c
@@ -613,6 +613,7 @@ struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
 	struct rvt_dev_info *rdi = ib_to_rvt(ibpd->device);
 	void *priv = NULL;
 	gfp_t gfp;
+	size_t sqsize;
 
 	if (!rdi)
 		return ERR_PTR(-EINVAL);
@@ -643,7 +644,8 @@ struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
 		    init_attr->cap.max_recv_wr == 0)
 			return ERR_PTR(-EINVAL);
 	}
-
+	sqsize =
+		init_attr->cap.max_send_wr + 1;
 	switch (init_attr->qp_type) {
 	case IB_QPT_SMI:
 	case IB_QPT_GSI:
@@ -658,11 +660,11 @@ struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
 			sizeof(struct rvt_swqe);
 		if (gfp == GFP_NOIO)
 			swq = __vmalloc(
-				(init_attr->cap.max_send_wr + 1) * sz,
+				sqsize * sz,
 				gfp | __GFP_ZERO, PAGE_KERNEL);
 		else
 			swq = vzalloc_node(
-				(init_attr->cap.max_send_wr + 1) * sz,
+				sqsize * sz,
 				rdi->dparms.node);
 		if (!swq)
 			return ERR_PTR(-ENOMEM);
@@ -747,7 +749,7 @@ struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
 		INIT_LIST_HEAD(&qp->rspwait);
 		qp->state = IB_QPS_RESET;
 		qp->s_wq = swq;
-		qp->s_size = init_attr->cap.max_send_wr + 1;
+		qp->s_size = sqsize;
 		qp->s_avail = init_attr->cap.max_send_wr;
 		qp->s_max_sge = init_attr->cap.max_send_sge;
 		if (init_attr->sq_sig_type == IB_SIGNAL_REQ_WR)
@@ -1440,12 +1442,65 @@ int rvt_post_recv(struct ib_qp *ibqp, struct ib_recv_wr *wr,
 }
 
 /**
- * qp_get_savail - return number of avail send entries
+ * rvt_qp_valid_operation - validate post send wr request
+ * @qp - the qp
+ * @post-parms - the post send table for the driver
+ * @wr - the work request
  *
+ * The routine validates the operation based on the
+ * validation table an returns the length of the operation
+ * which can extend beyond the ib_send_bw.  Operation
+ * dependent flags key atomic operation validation.
+ *
+ * There is an exception for UD qps that validates the pd and
+ * overrides the length to include the additional UD specific
+ * length.
+ *
+ * Returns a negative error or the length of the work request
+ * for building the swqe.
+ */
+static inline int rvt_qp_valid_operation(
+	struct rvt_qp *qp,
+	const struct rvt_operation_params *post_parms,
+	struct ib_send_wr *wr)
+{
+	int len;
+
+	if (wr->opcode >= RVT_OPERATION_MAX || !post_parms[wr->opcode].length)
+		return -EINVAL;
+	if (!(post_parms[wr->opcode].qpt_support & BIT(qp->ibqp.qp_type)))
+		return -EINVAL;
+	if ((post_parms[wr->opcode].flags & RVT_OPERATION_PRIV) &&
+	    ibpd_to_rvtpd(qp->ibqp.pd)->user)
+		return -EINVAL;
+	if (post_parms[wr->opcode].flags & RVT_OPERATION_ATOMIC_SGE &&
+	    (wr->num_sge == 0 ||
+	     wr->sg_list[0].length < sizeof(u64) ||
+	     wr->sg_list[0].addr & (sizeof(u64) - 1)))
+		return -EINVAL;
+	if (post_parms[wr->opcode].flags & RVT_OPERATION_ATOMIC &&
+	    !qp->s_max_rd_atomic)
+		return -EINVAL;
+	len = post_parms[wr->opcode].length;
+	/* UD specific */
+	if (qp->ibqp.qp_type != IB_QPT_UC &&
+	    qp->ibqp.qp_type != IB_QPT_RC) {
+		if (qp->ibqp.pd != ud_wr(wr)->ah->pd)
+			return -EINVAL;
+		len = sizeof(struct ib_ud_wr);
+	}
+	return len;
+}
+
+/**
+ * qp_get_savail - return number of avail send entries
  * @qp - the qp
  *
  * This assumes the s_hlock is held but the s_last
  * qp variable is uncontrolled.
+ *
+ * The return is adjusted to not count device specific
+ * reserved operations.
  */
 static inline u32 qp_get_savail(struct rvt_qp *qp)
 {
@@ -1481,6 +1536,8 @@ static int rvt_post_one_wr(struct rvt_qp *qp,
 	u8 log_pmtu;
 	int ret;
 
+	BUILD_BUG_ON(IB_QPT_MAX >= (sizeof(u32) * BITS_PER_BYTE));
+
 	/* IB spec says that num_sge == 0 is OK. */
 	if (unlikely(wr->num_sge > qp->s_max_sge))
 		return -EINVAL;
diff --git a/include/rdma/rdma_vt.h b/include/rdma/rdma_vt.h
index 9c9a27d..3a70dc0 100644
--- a/include/rdma/rdma_vt.h
+++ b/include/rdma/rdma_vt.h
@@ -351,6 +351,9 @@ struct rvt_dev_info {
 	/* Driver specific properties */
 	struct rvt_driver_params dparms;
 
+	/* post send table */
+	const struct rvt_operation_params *post_parms;
+
 	struct rvt_mregion __rcu *dma_mr;
 	struct rvt_lkey_table lkey_table;
 
diff --git a/include/rdma/rdmavt_qp.h b/include/rdma/rdmavt_qp.h
index 6d23b87..a90d1e9 100644
--- a/include/rdma/rdmavt_qp.h
+++ b/include/rdma/rdmavt_qp.h
@@ -228,11 +228,31 @@ struct rvt_ack_entry {
 
 #define	RC_QP_SCALING_INTERVAL	5
 
-/*
- * Variables prefixed with s_ are for the requester (sender).
- * Variables prefixed with r_ are for the responder (receiver).
- * Variables prefixed with ack_ are for responder replies.
+#define RVT_OPERATION_PRIV        0x00000001
+#define RVT_OPERATION_ATOMIC      0x00000002
+#define RVT_OPERATION_ATOMIC_SGE  0x00000004
+
+#define RVT_OPERATION_MAX (IB_WR_RESERVED10 + 1)
+
+/**
+ * rvt_operation_params - op table entry
+ * @length - the length to copy into the swqe entry
+ * @qpt_support - a bit mask indicating QP type support
+ * @flags - RVT_OPERATION flags (see above)
+ *
+ * This supports table driven post send so that
+ * the driver can have differing an potentially
+ * different sets of operations.
  *
+ **/
+
+struct rvt_operation_params {
+	size_t length;
+	u32 qpt_support;
+	u32 flags;
+};
+
+/*
  * Common variables are protected by both r_rq.lock and s_lock in that order
  * which only happens in modify_qp() or changing the QP 'state'.
  */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 16/18] IB/hfi1: Add hfi1 post send tables
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (14 preceding siblings ...)
  2016-07-01 23:02   ` [PATCH for-next 15/18] IB/rdmavt: Add data structures and routines for table driven post send Dennis Dalessandro
@ 2016-07-01 23:02   ` Dennis Dalessandro
  2016-07-01 23:02   ` [PATCH for-next 17/18] IB/qib: Add qib post send table Dennis Dalessandro
                     ` (2 subsequent siblings)
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:02 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mike Marciniszyn, Jianxin Xiong

From: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add initial table for table driven post_send support.

Reviewed-by: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/hfi1/qp.c    |   44 ++++++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/qp.h    |    2 ++
 drivers/infiniband/hw/hfi1/verbs.c |    3 ++
 3 files changed, 49 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/qp.c b/drivers/infiniband/hw/hfi1/qp.c
index 1a942ff..a8b3fc9 100644
--- a/drivers/infiniband/hw/hfi1/qp.c
+++ b/drivers/infiniband/hw/hfi1/qp.c
@@ -52,6 +52,7 @@
 #include <linux/seq_file.h>
 #include <rdma/rdma_vt.h>
 #include <rdma/rdmavt_qp.h>
+#include <rdma/ib_verbs.h>
 
 #include "hfi.h"
 #include "qp.h"
@@ -115,6 +116,49 @@ static const u16 credit_table[31] = {
 	32768                   /* 1E */
 };
 
+const struct rvt_operation_params hfi1_post_parms[RVT_OPERATION_MAX] = {
+[IB_WR_RDMA_WRITE] = {
+	.length = sizeof(struct ib_rdma_wr),
+	.qpt_support = BIT(IB_QPT_UC) | BIT(IB_QPT_RC),
+},
+
+[IB_WR_RDMA_READ] = {
+	.length = sizeof(struct ib_rdma_wr),
+	.qpt_support = BIT(IB_QPT_RC),
+	.flags = RVT_OPERATION_ATOMIC,
+},
+
+[IB_WR_ATOMIC_CMP_AND_SWP] = {
+	.length = sizeof(struct ib_atomic_wr),
+	.qpt_support = BIT(IB_QPT_RC),
+	.flags = RVT_OPERATION_ATOMIC | RVT_OPERATION_ATOMIC_SGE,
+},
+
+[IB_WR_ATOMIC_FETCH_AND_ADD] = {
+	.length = sizeof(struct ib_atomic_wr),
+	.qpt_support = BIT(IB_QPT_RC),
+	.flags = RVT_OPERATION_ATOMIC | RVT_OPERATION_ATOMIC_SGE,
+},
+
+[IB_WR_RDMA_WRITE_WITH_IMM] = {
+	.length = sizeof(struct ib_rdma_wr),
+	.qpt_support = BIT(IB_QPT_UC) | BIT(IB_QPT_RC),
+},
+
+[IB_WR_SEND] = {
+	.length = sizeof(struct ib_send_wr),
+	.qpt_support = BIT(IB_QPT_UD) | BIT(IB_QPT_SMI) | BIT(IB_QPT_GSI) |
+		       BIT(IB_QPT_UC) | BIT(IB_QPT_RC),
+},
+
+[IB_WR_SEND_WITH_IMM] = {
+	.length = sizeof(struct ib_send_wr),
+	.qpt_support = BIT(IB_QPT_UD) | BIT(IB_QPT_SMI) | BIT(IB_QPT_GSI) |
+		       BIT(IB_QPT_UC) | BIT(IB_QPT_RC),
+},
+
+};
+
 static void flush_tx_list(struct rvt_qp *qp)
 {
 	struct hfi1_qp_priv *priv = qp->priv;
diff --git a/drivers/infiniband/hw/hfi1/qp.h b/drivers/infiniband/hw/hfi1/qp.h
index e7bc8d6..ddf8298 100644
--- a/drivers/infiniband/hw/hfi1/qp.h
+++ b/drivers/infiniband/hw/hfi1/qp.h
@@ -54,6 +54,8 @@
 
 extern unsigned int hfi1_qp_table_size;
 
+extern const struct rvt_operation_params hfi1_post_parms[];
+
 /*
  * free_ahg - clear ahg from QP
  */
diff --git a/drivers/infiniband/hw/hfi1/verbs.c b/drivers/infiniband/hw/hfi1/verbs.c
index 6ad3f9d..a89055f 100644
--- a/drivers/infiniband/hw/hfi1/verbs.c
+++ b/drivers/infiniband/hw/hfi1/verbs.c
@@ -1683,6 +1683,9 @@ int hfi1_register_ib_device(struct hfi1_devdata *dd)
 	dd->verbs_dev.rdi.dparms.nports = dd->num_pports;
 	dd->verbs_dev.rdi.dparms.npkeys = hfi1_get_npkeys(dd);
 
+	/* post send table */
+	dd->verbs_dev.rdi.post_parms = hfi1_post_parms;
+
 	ppd = dd->pport;
 	for (i = 0; i < dd->num_pports; i++, ppd++)
 		rvt_init_port(&dd->verbs_dev.rdi,

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 17/18] IB/qib: Add qib post send table
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (15 preceding siblings ...)
  2016-07-01 23:02   ` [PATCH for-next 16/18] IB/hfi1: Add hfi1 post send tables Dennis Dalessandro
@ 2016-07-01 23:02   ` Dennis Dalessandro
  2016-07-01 23:02   ` [PATCH for-next 18/18] IB/rdmavt: Use new driver specific " Dennis Dalessandro
  2016-08-02 19:58   ` [PATCH for-next 00/18] IB/hfi1, rdmavt, qib: First batch of fixes for 4.8 Doug Ledford
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:02 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mike Marciniszyn, Jianxin Xiong

From: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Add initial table for table driven post_send support.

Reviewed-by: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/hw/qib/qib_qp.c    |   43 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/qib/qib_verbs.c |    2 ++
 drivers/infiniband/hw/qib/qib_verbs.h |    2 ++
 3 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c
index 575b737..9cc0aae 100644
--- a/drivers/infiniband/hw/qib/qib_qp.c
+++ b/drivers/infiniband/hw/qib/qib_qp.c
@@ -106,6 +106,49 @@ static u32 credit_table[31] = {
 	32768                   /* 1E */
 };
 
+const struct rvt_operation_params qib_post_parms[RVT_OPERATION_MAX] = {
+[IB_WR_RDMA_WRITE] = {
+	.length = sizeof(struct ib_rdma_wr),
+	.qpt_support = BIT(IB_QPT_UC) | BIT(IB_QPT_RC),
+},
+
+[IB_WR_RDMA_READ] = {
+	.length = sizeof(struct ib_rdma_wr),
+	.qpt_support = BIT(IB_QPT_RC),
+	.flags = RVT_OPERATION_ATOMIC,
+},
+
+[IB_WR_ATOMIC_CMP_AND_SWP] = {
+	.length = sizeof(struct ib_atomic_wr),
+	.qpt_support = BIT(IB_QPT_RC),
+	.flags = RVT_OPERATION_ATOMIC | RVT_OPERATION_ATOMIC_SGE,
+},
+
+[IB_WR_ATOMIC_FETCH_AND_ADD] = {
+	.length = sizeof(struct ib_atomic_wr),
+	.qpt_support = BIT(IB_QPT_RC),
+	.flags = RVT_OPERATION_ATOMIC | RVT_OPERATION_ATOMIC_SGE,
+},
+
+[IB_WR_RDMA_WRITE_WITH_IMM] = {
+	.length = sizeof(struct ib_rdma_wr),
+	.qpt_support = BIT(IB_QPT_UC) | BIT(IB_QPT_RC),
+},
+
+[IB_WR_SEND] = {
+	.length = sizeof(struct ib_send_wr),
+	.qpt_support = BIT(IB_QPT_UD) | BIT(IB_QPT_SMI) | BIT(IB_QPT_GSI) |
+		       BIT(IB_QPT_UC) | BIT(IB_QPT_RC),
+},
+
+[IB_WR_SEND_WITH_IMM] = {
+	.length = sizeof(struct ib_send_wr),
+	.qpt_support = BIT(IB_QPT_UD) | BIT(IB_QPT_SMI) | BIT(IB_QPT_GSI) |
+		       BIT(IB_QPT_UC) | BIT(IB_QPT_RC),
+},
+
+};
+
 static void get_map_page(struct rvt_qpn_table *qpt, struct rvt_qpn_map *map,
 			 gfp_t gfp)
 {
diff --git a/drivers/infiniband/hw/qib/qib_verbs.c b/drivers/infiniband/hw/qib/qib_verbs.c
index cbf6200..fd1dfbc 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.c
+++ b/drivers/infiniband/hw/qib/qib_verbs.c
@@ -1582,6 +1582,8 @@ static void qib_fill_device_attr(struct qib_devdata *dd)
 	rdi->dparms.props.max_total_mcast_qp_attach =
 					rdi->dparms.props.max_mcast_qp_attach *
 					rdi->dparms.props.max_mcast_grp;
+	/* post send table */
+	dd->verbs_dev.rdi.post_parms = qib_post_parms;
 }
 
 /**
diff --git a/drivers/infiniband/hw/qib/qib_verbs.h b/drivers/infiniband/hw/qib/qib_verbs.h
index 4f87815..736ced6 100644
--- a/drivers/infiniband/hw/qib/qib_verbs.h
+++ b/drivers/infiniband/hw/qib/qib_verbs.h
@@ -497,4 +497,6 @@ extern unsigned int ib_qib_max_srq_wrs;
 
 extern const u32 ib_qib_rnr_table[];
 
+extern const struct rvt_operation_params qib_post_parms[];
+
 #endif                          /* QIB_VERBS_H */

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next 18/18] IB/rdmavt: Use new driver specific post send table
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (16 preceding siblings ...)
  2016-07-01 23:02   ` [PATCH for-next 17/18] IB/qib: Add qib post send table Dennis Dalessandro
@ 2016-07-01 23:02   ` Dennis Dalessandro
  2016-08-02 19:58   ` [PATCH for-next 00/18] IB/hfi1, rdmavt, qib: First batch of fixes for 4.8 Doug Ledford
  18 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-01 23:02 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Mike Marciniszyn, Jianxin Xiong

From: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Change rvt_post_one_wr to use the new table mechanism for
post send.

Validate that each low level driver specifies the table.

Reviewed-by: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
---
 drivers/infiniband/sw/rdmavt/qp.c |   46 ++++++-------------------------------
 drivers/infiniband/sw/rdmavt/vt.c |    3 ++
 2 files changed, 10 insertions(+), 39 deletions(-)

diff --git a/drivers/infiniband/sw/rdmavt/qp.c b/drivers/infiniband/sw/rdmavt/qp.c
index d2b5b54..ebc37f5 100644
--- a/drivers/infiniband/sw/rdmavt/qp.c
+++ b/drivers/infiniband/sw/rdmavt/qp.c
@@ -1535,6 +1535,7 @@ static int rvt_post_one_wr(struct rvt_qp *qp,
 	struct rvt_dev_info *rdi = ib_to_rvt(qp->ibqp.device);
 	u8 log_pmtu;
 	int ret;
+	size_t cplen;
 
 	BUILD_BUG_ON(IB_QPT_MAX >= (sizeof(u32) * BITS_PER_BYTE));
 
@@ -1542,32 +1543,11 @@ static int rvt_post_one_wr(struct rvt_qp *qp,
 	if (unlikely(wr->num_sge > qp->s_max_sge))
 		return -EINVAL;
 
-	/*
-	 * Don't allow RDMA reads or atomic operations on UC or
-	 * undefined operations.
-	 * Make sure buffer is large enough to hold the result for atomics.
-	 */
-	if (qp->ibqp.qp_type == IB_QPT_UC) {
-		if ((unsigned)wr->opcode >= IB_WR_RDMA_READ)
-			return -EINVAL;
-	} else if (qp->ibqp.qp_type != IB_QPT_RC) {
-		/* Check IB_QPT_SMI, IB_QPT_GSI, IB_QPT_UD opcode */
-		if (wr->opcode != IB_WR_SEND &&
-		    wr->opcode != IB_WR_SEND_WITH_IMM)
-			return -EINVAL;
-		/* Check UD destination address PD */
-		if (qp->ibqp.pd != ud_wr(wr)->ah->pd)
-			return -EINVAL;
-	} else if ((unsigned)wr->opcode > IB_WR_ATOMIC_FETCH_AND_ADD) {
-		return -EINVAL;
-	} else if (wr->opcode >= IB_WR_ATOMIC_CMP_AND_SWP &&
-		   (wr->num_sge == 0 ||
-		    wr->sg_list[0].length < sizeof(u64) ||
-		    wr->sg_list[0].addr & (sizeof(u64) - 1))) {
-		return -EINVAL;
-	} else if (wr->opcode >= IB_WR_RDMA_READ && !qp->s_max_rd_atomic) {
-		return -EINVAL;
-	}
+	ret = rvt_qp_valid_operation(qp, rdi->post_parms, wr);
+	if (ret < 0)
+		return ret;
+	cplen = ret;
+
 	/* check for avail */
 	if (unlikely(!qp->s_avail)) {
 		qp->s_avail = qp_get_savail(qp);
@@ -1588,18 +1568,8 @@ static int rvt_post_one_wr(struct rvt_qp *qp,
 	pd = ibpd_to_rvtpd(qp->ibqp.pd);
 	wqe = rvt_get_swqe_ptr(qp, qp->s_head);
 
-	if (qp->ibqp.qp_type != IB_QPT_UC &&
-	    qp->ibqp.qp_type != IB_QPT_RC)
-		memcpy(&wqe->ud_wr, ud_wr(wr), sizeof(wqe->ud_wr));
-	else if (wr->opcode == IB_WR_RDMA_WRITE_WITH_IMM ||
-		 wr->opcode == IB_WR_RDMA_WRITE ||
-		 wr->opcode == IB_WR_RDMA_READ)
-		memcpy(&wqe->rdma_wr, rdma_wr(wr), sizeof(wqe->rdma_wr));
-	else if (wr->opcode == IB_WR_ATOMIC_CMP_AND_SWP ||
-		 wr->opcode == IB_WR_ATOMIC_FETCH_AND_ADD)
-		memcpy(&wqe->atomic_wr, atomic_wr(wr), sizeof(wqe->atomic_wr));
-	else
-		memcpy(&wqe->wr, wr, sizeof(wqe->wr));
+	/* cplen has length from above */
+	memcpy(&wqe->wr, wr, cplen);
 
 	wqe->length = 0;
 	j = 0;
diff --git a/drivers/infiniband/sw/rdmavt/vt.c b/drivers/infiniband/sw/rdmavt/vt.c
index 30c4fda..89fe967 100644
--- a/drivers/infiniband/sw/rdmavt/vt.c
+++ b/drivers/infiniband/sw/rdmavt/vt.c
@@ -528,7 +528,8 @@ static noinline int check_support(struct rvt_dev_info *rdi, int verb)
 							 post_send),
 					   rvt_post_send))
 			if (!rdi->driver_f.schedule_send ||
-			    !rdi->driver_f.do_send)
+			    !rdi->driver_f.do_send ||
+			    !rdi->post_parms)
 				return -EINVAL;
 		break;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2] IB/hfi1: Add global structure for affinity assignments
       [not found]     ` <20160701230127.20160.68709.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
@ 2016-07-25 14:52       ` Dennis Dalessandro
  0 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-25 14:52 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Jubin John, Sebastian Sanchez,
	Mike Marciniszyn, Jianxin Xiong

When HFI units get initialized, they each use their own mask copy for
affinity assignments. On a multi-HFI system, affinity assignments
overbook CPU cores as each HFI doesn't have knowledge of affinity
assignments for other HFI units. Therefore, some CPU cores are never
used for interrupt handlers in systems with high number of CPU cores
per NUMA node.

For multi-HFI systems, SDMA engine interrupt assignments start all over
from the first CPU in the local NUMA node after the first HFI
initialization. This change allows assignments to continue where the
last HFI unit left off.

Add global structure for affinity assignments for multiple HFIs to share
affinity mask.

Reviewed-by: Jianxin Xiong <jianxin.xiong-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Jubin John <jubin.john-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

--

changes since v1:
-----------------
Remove char buf[] that was accidentally put back after being removed in another
patch [1]. Also patch up one of the trace messages to not need scnprintf().

[1]
https://git.kernel.org/cgit/linux/kernel/git/dledford/rdma.git/commit/?id=f242d93ae92032f78840471e5c2bfc2d04ae324c
---
 drivers/infiniband/hw/hfi1/affinity.c |  245 +++++++++++++++++++++++----------
 drivers/infiniband/hw/hfi1/affinity.h |   25 +++
 drivers/infiniband/hw/hfi1/chip.c     |   20 +--
 drivers/infiniband/hw/hfi1/init.c     |    5 +
 4 files changed, 198 insertions(+), 97 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c
index 14d7eeb..1647699 100644
--- a/drivers/infiniband/hw/hfi1/affinity.c
+++ b/drivers/infiniband/hw/hfi1/affinity.c
@@ -53,6 +53,11 @@
 #include "sdma.h"
 #include "trace.h"
 
+struct hfi1_affinity_node_list node_affinity = {
+	.list = LIST_HEAD_INIT(node_affinity.list),
+	.lock = __SPIN_LOCK_UNLOCKED(&node_affinity.lock),
+};
+
 /* Name of IRQ types, indexed by enum irq_type */
 static const char * const irq_type_names[] = {
 	"SDMA",
@@ -69,45 +74,100 @@ static inline void init_cpu_mask_set(struct cpu_mask_set *set)
 }
 
 /* Initialize non-HT cpu cores mask */
-int init_real_cpu_mask(struct hfi1_devdata *dd)
+void init_real_cpu_mask(void)
 {
-	struct hfi1_affinity *info;
 	int possible, curr_cpu, i, ht;
 
-	info = kzalloc(sizeof(*info), GFP_KERNEL);
-	if (!info)
-		return -ENOMEM;
-
-	cpumask_clear(&info->real_cpu_mask);
+	cpumask_clear(&node_affinity.real_cpu_mask);
 
 	/* Start with cpu online mask as the real cpu mask */
-	cpumask_copy(&info->real_cpu_mask, cpu_online_mask);
+	cpumask_copy(&node_affinity.real_cpu_mask, cpu_online_mask);
 
 	/*
 	 * Remove HT cores from the real cpu mask.  Do this in two steps below.
 	 */
-	possible = cpumask_weight(&info->real_cpu_mask);
+	possible = cpumask_weight(&node_affinity.real_cpu_mask);
 	ht = cpumask_weight(topology_sibling_cpumask(
-					cpumask_first(&info->real_cpu_mask)));
+				cpumask_first(&node_affinity.real_cpu_mask)));
 	/*
 	 * Step 1.  Skip over the first N HT siblings and use them as the
 	 * "real" cores.  Assumes that HT cores are not enumerated in
 	 * succession (except in the single core case).
 	 */
-	curr_cpu = cpumask_first(&info->real_cpu_mask);
+	curr_cpu = cpumask_first(&node_affinity.real_cpu_mask);
 	for (i = 0; i < possible / ht; i++)
-		curr_cpu = cpumask_next(curr_cpu, &info->real_cpu_mask);
+		curr_cpu = cpumask_next(curr_cpu, &node_affinity.real_cpu_mask);
 	/*
 	 * Step 2.  Remove the remaining HT siblings.  Use cpumask_next() to
 	 * skip any gaps.
 	 */
 	for (; i < possible; i++) {
-		cpumask_clear_cpu(curr_cpu, &info->real_cpu_mask);
-		curr_cpu = cpumask_next(curr_cpu, &info->real_cpu_mask);
+		cpumask_clear_cpu(curr_cpu, &node_affinity.real_cpu_mask);
+		curr_cpu = cpumask_next(curr_cpu, &node_affinity.real_cpu_mask);
 	}
+}
 
-	dd->affinity = info;
-	return 0;
+void node_affinity_init(void)
+{
+	cpumask_copy(&node_affinity.proc.mask, cpu_online_mask);
+	/*
+	 * The real cpu mask is part of the affinity struct but it has to be
+	 * initialized early. It is needed to calculate the number of user
+	 * contexts in set_up_context_variables().
+	 */
+	init_real_cpu_mask();
+}
+
+void node_affinity_destroy(void)
+{
+	struct list_head *pos, *q;
+	struct hfi1_affinity_node *entry;
+
+	spin_lock(&node_affinity.lock);
+	list_for_each_safe(pos, q, &node_affinity.list) {
+		entry = list_entry(pos, struct hfi1_affinity_node,
+				   list);
+		list_del(pos);
+		kfree(entry);
+	}
+	spin_unlock(&node_affinity.lock);
+}
+
+static struct hfi1_affinity_node *node_affinity_allocate(int node)
+{
+	struct hfi1_affinity_node *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return NULL;
+	entry->node = node;
+	INIT_LIST_HEAD(&entry->list);
+
+	return entry;
+}
+
+/*
+ * It appends an entry to the list.
+ * It *must* be called with node_affinity.lock held.
+ */
+static void node_affinity_add_tail(struct hfi1_affinity_node *entry)
+{
+	list_add_tail(&entry->list, &node_affinity.list);
+}
+
+/* It must be called with node_affinity.lock held */
+static struct hfi1_affinity_node *node_affinity_lookup(int node)
+{
+	struct list_head *pos;
+	struct hfi1_affinity_node *entry;
+
+	list_for_each(pos, &node_affinity.list) {
+		entry = list_entry(pos, struct hfi1_affinity_node, list);
+		if (entry->node == node)
+			return entry;
+	}
+
+	return NULL;
 }
 
 /*
@@ -121,10 +181,10 @@ int init_real_cpu_mask(struct hfi1_devdata *dd)
  * to the node relative 1 as necessary.
  *
  */
-void hfi1_dev_affinity_init(struct hfi1_devdata *dd)
+int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 {
 	int node = pcibus_to_node(dd->pcidev->bus);
-	struct hfi1_affinity *info = dd->affinity;
+	struct hfi1_affinity_node *entry;
 	const struct cpumask *local_mask;
 	int curr_cpu, possible, i;
 
@@ -132,55 +192,75 @@ void hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 		node = numa_node_id();
 	dd->node = node;
 
-	spin_lock_init(&info->lock);
-
-	init_cpu_mask_set(&info->def_intr);
-	init_cpu_mask_set(&info->rcv_intr);
-	init_cpu_mask_set(&info->proc);
-
 	local_mask = cpumask_of_node(dd->node);
 	if (cpumask_first(local_mask) >= nr_cpu_ids)
 		local_mask = topology_core_cpumask(0);
-	/* Use the "real" cpu mask of this node as the default */
-	cpumask_and(&info->def_intr.mask, &info->real_cpu_mask, local_mask);
-
-	/*  fill in the receive list */
-	possible = cpumask_weight(&info->def_intr.mask);
-	curr_cpu = cpumask_first(&info->def_intr.mask);
-	if (possible == 1) {
-		/*  only one CPU, everyone will use it */
-		cpumask_set_cpu(curr_cpu, &info->rcv_intr.mask);
-	} else {
-		/*
-		 * Retain the first CPU in the default list for the control
-		 * context.
-		 */
-		curr_cpu = cpumask_next(curr_cpu, &info->def_intr.mask);
-		/*
-		 * Remove the remaining kernel receive queues from
-		 * the default list and add them to the receive list.
-		 */
-		for (i = 0; i < dd->n_krcv_queues - 1; i++) {
-			cpumask_clear_cpu(curr_cpu, &info->def_intr.mask);
-			cpumask_set_cpu(curr_cpu, &info->rcv_intr.mask);
-			curr_cpu = cpumask_next(curr_cpu, &info->def_intr.mask);
-			if (curr_cpu >= nr_cpu_ids)
-				break;
+
+	spin_lock(&node_affinity.lock);
+	entry = node_affinity_lookup(dd->node);
+	spin_unlock(&node_affinity.lock);
+
+	/*
+	 * If this is the first time this NUMA node's affinity is used,
+	 * create an entry in the global affinity structure and initialize it.
+	 */
+	if (!entry) {
+		entry = node_affinity_allocate(node);
+		if (!entry) {
+			dd_dev_err(dd,
+				   "Unable to allocate global affinity node\n");
+			return -ENOMEM;
 		}
-	}
+		init_cpu_mask_set(&entry->def_intr);
+		init_cpu_mask_set(&entry->rcv_intr);
+		/* Use the "real" cpu mask of this node as the default */
+		cpumask_and(&entry->def_intr.mask, &node_affinity.real_cpu_mask,
+			    local_mask);
+
+		/* fill in the receive list */
+		possible = cpumask_weight(&entry->def_intr.mask);
+		curr_cpu = cpumask_first(&entry->def_intr.mask);
+
+		if (possible == 1) {
+			/* only one CPU, everyone will use it */
+			cpumask_set_cpu(curr_cpu, &entry->rcv_intr.mask);
+		} else {
+			/*
+			 * Retain the first CPU in the default list for the
+			 * control context.
+			 */
+			curr_cpu = cpumask_next(curr_cpu,
+						&entry->def_intr.mask);
 
-	cpumask_copy(&info->proc.mask, cpu_online_mask);
-}
+			/*
+			 * Remove the remaining kernel receive queues from
+			 * the default list and add them to the receive list.
+			 */
+			for (i = 0; i < dd->n_krcv_queues - 1; i++) {
+				cpumask_clear_cpu(curr_cpu,
+						  &entry->def_intr.mask);
+				cpumask_set_cpu(curr_cpu,
+						&entry->rcv_intr.mask);
+				curr_cpu = cpumask_next(curr_cpu,
+							&entry->def_intr.mask);
+				if (curr_cpu >= nr_cpu_ids)
+					break;
+			}
+		}
 
-void hfi1_dev_affinity_free(struct hfi1_devdata *dd)
-{
-	kfree(dd->affinity);
+		spin_lock(&node_affinity.lock);
+		node_affinity_add_tail(entry);
+		spin_unlock(&node_affinity.lock);
+	}
+
+	return 0;
 }
 
 int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 {
 	int ret;
 	cpumask_var_t diff;
+	struct hfi1_affinity_node *entry;
 	struct cpu_mask_set *set;
 	struct sdma_engine *sde = NULL;
 	struct hfi1_ctxtdata *rcd = NULL;
@@ -194,21 +274,25 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	if (!ret)
 		return -ENOMEM;
 
+	spin_lock(&node_affinity.lock);
+	entry = node_affinity_lookup(dd->node);
+	spin_unlock(&node_affinity.lock);
+
 	switch (msix->type) {
 	case IRQ_SDMA:
 		sde = (struct sdma_engine *)msix->arg;
 		scnprintf(extra, 64, "engine %u", sde->this_idx);
 		/* fall through */
 	case IRQ_GENERAL:
-		set = &dd->affinity->def_intr;
+		set = &entry->def_intr;
 		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
 		if (rcd->ctxt == HFI1_CTRL_CTXT) {
-			set = &dd->affinity->def_intr;
+			set = &entry->def_intr;
 			cpu = cpumask_first(&set->mask);
 		} else {
-			set = &dd->affinity->rcv_intr;
+			set = &entry->rcv_intr;
 		}
 		scnprintf(extra, 64, "ctxt %u", rcd->ctxt);
 		break;
@@ -222,8 +306,8 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	 * is set above.  Skip accounting for it.  Everything else finds its
 	 * CPU here.
 	 */
-	if (cpu == -1) {
-		spin_lock(&dd->affinity->lock);
+	if (cpu == -1 && set) {
+		spin_lock(&node_affinity.lock);
 		if (cpumask_equal(&set->mask, &set->used)) {
 			/*
 			 * We've used up all the CPUs, bump up the generation
@@ -235,7 +319,7 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 		cpumask_andnot(diff, &set->mask, &set->used);
 		cpu = cpumask_first(diff);
 		cpumask_set_cpu(cpu, &set->used);
-		spin_unlock(&dd->affinity->lock);
+		spin_unlock(&node_affinity.lock);
 	}
 
 	switch (msix->type) {
@@ -263,30 +347,35 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 {
 	struct cpu_mask_set *set = NULL;
 	struct hfi1_ctxtdata *rcd;
+	struct hfi1_affinity_node *entry;
+
+	spin_lock(&node_affinity.lock);
+	entry = node_affinity_lookup(dd->node);
+	spin_unlock(&node_affinity.lock);
 
 	switch (msix->type) {
 	case IRQ_SDMA:
 	case IRQ_GENERAL:
-		set = &dd->affinity->def_intr;
+		set = &entry->def_intr;
 		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
 		/* only do accounting for non control contexts */
 		if (rcd->ctxt != HFI1_CTRL_CTXT)
-			set = &dd->affinity->rcv_intr;
+			set = &entry->rcv_intr;
 		break;
 	default:
 		return;
 	}
 
 	if (set) {
-		spin_lock(&dd->affinity->lock);
+		spin_lock(&node_affinity.lock);
 		cpumask_andnot(&set->used, &set->used, &msix->mask);
 		if (cpumask_empty(&set->used) && set->gen) {
 			set->gen--;
 			cpumask_copy(&set->used, &set->mask);
 		}
-		spin_unlock(&dd->affinity->lock);
+		spin_unlock(&node_affinity.lock);
 	}
 
 	irq_set_affinity_hint(msix->msix.vector, NULL);
@@ -297,9 +386,10 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 {
 	int cpu = -1, ret;
 	cpumask_var_t diff, mask, intrs;
+	struct hfi1_affinity_node *entry;
 	const struct cpumask *node_mask,
 		*proc_mask = tsk_cpus_allowed(current);
-	struct cpu_mask_set *set = &dd->affinity->proc;
+	struct cpu_mask_set *set = &node_affinity.proc;
 
 	/*
 	 * check whether process/context affinity has already
@@ -338,7 +428,7 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 	if (!ret)
 		goto free_mask;
 
-	spin_lock(&dd->affinity->lock);
+	spin_lock(&node_affinity.lock);
 	/*
 	 * If we've used all available CPUs, clear the mask and start
 	 * overloading.
@@ -348,13 +438,14 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 		cpumask_clear(&set->used);
 	}
 
+	entry = node_affinity_lookup(dd->node);
 	/* CPUs used by interrupt handlers */
-	cpumask_copy(intrs, (dd->affinity->def_intr.gen ?
-			     &dd->affinity->def_intr.mask :
-			     &dd->affinity->def_intr.used));
-	cpumask_or(intrs, intrs, (dd->affinity->rcv_intr.gen ?
-				  &dd->affinity->rcv_intr.mask :
-				  &dd->affinity->rcv_intr.used));
+	cpumask_copy(intrs, (entry->def_intr.gen ?
+			     &entry->def_intr.mask :
+			     &entry->def_intr.used));
+	cpumask_or(intrs, intrs, (entry->rcv_intr.gen ?
+				  &entry->rcv_intr.mask :
+				  &entry->rcv_intr.used));
 	hfi1_cdbg(PROC, "CPUs used by interrupts: %*pbl",
 		  cpumask_pr_args(intrs));
 
@@ -400,7 +491,7 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 		cpu = -1;
 	else
 		cpumask_set_cpu(cpu, &set->used);
-	spin_unlock(&dd->affinity->lock);
+	spin_unlock(&node_affinity.lock);
 
 	free_cpumask_var(intrs);
 free_mask:
@@ -413,16 +504,16 @@ done:
 
 void hfi1_put_proc_affinity(struct hfi1_devdata *dd, int cpu)
 {
-	struct cpu_mask_set *set = &dd->affinity->proc;
+	struct cpu_mask_set *set = &node_affinity.proc;
 
 	if (cpu < 0)
 		return;
-	spin_lock(&dd->affinity->lock);
+	spin_lock(&node_affinity.lock);
 	cpumask_clear_cpu(cpu, &set->used);
 	if (cpumask_empty(&set->used) && set->gen) {
 		set->gen--;
 		cpumask_copy(&set->used, &set->mask);
 	}
-	spin_unlock(&dd->affinity->lock);
+	spin_unlock(&node_affinity.lock);
 }
 
diff --git a/drivers/infiniband/hw/hfi1/affinity.h b/drivers/infiniband/hw/hfi1/affinity.h
index 20f52fe..ad3e730 100644
--- a/drivers/infiniband/hw/hfi1/affinity.h
+++ b/drivers/infiniband/hw/hfi1/affinity.h
@@ -82,11 +82,9 @@ struct hfi1_affinity {
 struct hfi1_msix_entry;
 
 /* Initialize non-HT cpu cores mask */
-int init_real_cpu_mask(struct hfi1_devdata *);
+void init_real_cpu_mask(void);
 /* Initialize driver affinity data */
-void hfi1_dev_affinity_init(struct hfi1_devdata *);
-/* Free driver affinity data */
-void hfi1_dev_affinity_free(struct hfi1_devdata *);
+int hfi1_dev_affinity_init(struct hfi1_devdata *);
 /*
  * Set IRQ affinity to a CPU. The function will determine the
  * CPU and set the affinity to it.
@@ -105,4 +103,23 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *, int);
 /* Release a CPU used by a user process. */
 void hfi1_put_proc_affinity(struct hfi1_devdata *, int);
 
+struct hfi1_affinity_node {
+	int node;
+	struct cpu_mask_set def_intr;
+	struct cpu_mask_set rcv_intr;
+	struct list_head list;
+};
+
+struct hfi1_affinity_node_list {
+	struct list_head list;
+	struct cpumask real_cpu_mask;
+	struct cpu_mask_set proc;
+	/* protect affinity node list */
+	spinlock_t lock;
+};
+
+void node_affinity_init(void);
+void node_affinity_destroy(void);
+extern struct hfi1_affinity_node_list node_affinity;
+
 #endif /* _HFI1_AFFINITY_H */
diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index 97ce886..0de6c0c 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -63,6 +63,7 @@
 #include "efivar.h"
 #include "platform.h"
 #include "aspm.h"
+#include "affinity.h"
 
 #define NUM_IB_PORTS 1
 
@@ -12838,7 +12839,7 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 	 */
 	if (num_user_contexts < 0)
 		num_user_contexts =
-			cpumask_weight(&dd->affinity->real_cpu_mask);
+			cpumask_weight(&node_affinity.real_cpu_mask);
 
 	total_contexts = num_kernel_contexts + num_user_contexts;
 
@@ -14468,19 +14469,6 @@ struct hfi1_devdata *hfi1_init_dd(struct pci_dev *pdev,
 		 (dd->revision >> CCE_REVISION_SW_SHIFT)
 		    & CCE_REVISION_SW_MASK);
 
-	/*
-	 * The real cpu mask is part of the affinity struct but has to be
-	 * initialized earlier than the rest of the affinity struct because it
-	 * is needed to calculate the number of user contexts in
-	 * set_up_context_variables(). However, hfi1_dev_affinity_init(),
-	 * which initializes the rest of the affinity struct members,
-	 * depends on set_up_context_variables() for the number of kernel
-	 * contexts, so it cannot be called before set_up_context_variables().
-	 */
-	ret = init_real_cpu_mask(dd);
-	if (ret)
-		goto bail_cleanup;
-
 	ret = set_up_context_variables(dd);
 	if (ret)
 		goto bail_cleanup;
@@ -14494,7 +14482,9 @@ struct hfi1_devdata *hfi1_init_dd(struct pci_dev *pdev,
 	/* set up KDETH QP prefix in both RX and TX CSRs */
 	init_kdeth_qp(dd);
 
-	hfi1_dev_affinity_init(dd);
+	ret = hfi1_dev_affinity_init(dd);
+	if (ret)
+		goto bail_cleanup;
 
 	/* send contexts must be set up before receive contexts */
 	ret = init_send_contexts(dd);
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index eed971c..b0c3e8a 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -64,6 +64,7 @@
 #include "debugfs.h"
 #include "verbs.h"
 #include "aspm.h"
+#include "affinity.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
@@ -1004,7 +1005,6 @@ static void __hfi1_free_devdata(struct kobject *kobj)
 	rcu_barrier(); /* wait for rcu callbacks to complete */
 	free_percpu(dd->int_counter);
 	free_percpu(dd->rcv_limit);
-	hfi1_dev_affinity_free(dd);
 	free_percpu(dd->send_schedule);
 	rvt_dealloc_device(&dd->verbs_dev.rdi);
 }
@@ -1198,6 +1198,8 @@ static int __init hfi1_mod_init(void)
 	if (ret)
 		goto bail;
 
+	node_affinity_init();
+
 	/* validate max MTU before any devices start */
 	if (!valid_opa_max_mtu(hfi1_max_mtu)) {
 		pr_err("Invalid max_mtu 0x%x, using 0x%x instead\n",
@@ -1278,6 +1280,7 @@ module_init(hfi1_mod_init);
 static void __exit hfi1_mod_cleanup(void)
 {
 	pci_unregister_driver(&hfi1_pci_driver);
+	node_affinity_destroy();
 	hfi1_wss_exit();
 	hfi1_dbg_exit();
 	hfi1_cpulist_count = 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2] IB/hfi1: Reserve and collapse CPU cores for contexts
       [not found]     ` <20160701230133.20160.76302.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
@ 2016-07-25 14:54       ` Dennis Dalessandro
  0 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-25 14:54 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Dean Luick, Sebastian Sanchez

From: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

Kernel receive queues oversubscribe CPU cores on multi-HFI systems.
To prevent this, the kernel receive queues are separated onto
different cores, and the SDMA engine interrupts are constrained to
a lesser number of cores.

hfi1s_on_numa_node*krcvqs is the number of CPU cores that are
reserved for kernel receive queues for all HFIs. Each HFI initializes
its kernel receive queues to one of the reserved CPU cores. If there
ends up being 0 CPU cores leftover for SDMA engines, use the same
CPU cores as receive contexts.

In addition, general and control contexts are assigned to their own
CPU core, however, both types of contexts tend to have low traffic.
To save CPU cores, collapse general and control contexts to one CPU
core for all HFI units. This change prevents SDMA engine interrupts
from wrapping around general contexts.

Reviewed-by: Dean Luick <dean.luick-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

--

Changes since v1
----------------
Resolve conflict due to v2 of previous patch.
---
 drivers/infiniband/hw/hfi1/affinity.c |  101 +++++++++++++++++++++++++--------
 drivers/infiniband/hw/hfi1/affinity.h |    3 +
 drivers/infiniband/hw/hfi1/hfi.h      |    2 +
 drivers/infiniband/hw/hfi1/init.c     |    6 +-
 4 files changed, 84 insertions(+), 28 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c
index 1647699..eb88927 100644
--- a/drivers/infiniband/hw/hfi1/affinity.c
+++ b/drivers/infiniband/hw/hfi1/affinity.c
@@ -66,6 +66,9 @@ static const char * const irq_type_names[] = {
 	"OTHER",
 };
 
+/* Per NUMA node count of HFI devices */
+static unsigned int *hfi1_per_node_cntr;
+
 static inline void init_cpu_mask_set(struct cpu_mask_set *set)
 {
 	cpumask_clear(&set->mask);
@@ -107,8 +110,12 @@ void init_real_cpu_mask(void)
 	}
 }
 
-void node_affinity_init(void)
+int node_affinity_init(void)
 {
+	int node;
+	struct pci_dev *dev = NULL;
+	const struct pci_device_id *ids = hfi1_pci_tbl;
+
 	cpumask_copy(&node_affinity.proc.mask, cpu_online_mask);
 	/*
 	 * The real cpu mask is part of the affinity struct but it has to be
@@ -116,6 +123,25 @@ void node_affinity_init(void)
 	 * contexts in set_up_context_variables().
 	 */
 	init_real_cpu_mask();
+
+	hfi1_per_node_cntr = kcalloc(num_possible_nodes(),
+				     sizeof(*hfi1_per_node_cntr), GFP_KERNEL);
+	if (!hfi1_per_node_cntr)
+		return -ENOMEM;
+
+	while (ids->vendor) {
+		dev = NULL;
+		while ((dev = pci_get_device(ids->vendor, ids->device, dev))) {
+			node = pcibus_to_node(dev->bus);
+			if (node < 0)
+				node = numa_node_id();
+
+			hfi1_per_node_cntr[node]++;
+		}
+		ids++;
+	}
+
+	return 0;
 }
 
 void node_affinity_destroy(void)
@@ -131,6 +157,7 @@ void node_affinity_destroy(void)
 		kfree(entry);
 	}
 	spin_unlock(&node_affinity.lock);
+	kfree(hfi1_per_node_cntr);
 }
 
 static struct hfi1_affinity_node *node_affinity_allocate(int node)
@@ -213,6 +240,7 @@ int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 		}
 		init_cpu_mask_set(&entry->def_intr);
 		init_cpu_mask_set(&entry->rcv_intr);
+		cpumask_clear(&entry->general_intr_mask);
 		/* Use the "real" cpu mask of this node as the default */
 		cpumask_and(&entry->def_intr.mask, &node_affinity.real_cpu_mask,
 			    local_mask);
@@ -224,11 +252,15 @@ int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 		if (possible == 1) {
 			/* only one CPU, everyone will use it */
 			cpumask_set_cpu(curr_cpu, &entry->rcv_intr.mask);
+			cpumask_set_cpu(curr_cpu, &entry->general_intr_mask);
 		} else {
 			/*
-			 * Retain the first CPU in the default list for the
-			 * control context.
+			 * The general/control context will be the first CPU in
+			 * the default list, so it is removed from the default
+			 * list and added to the general interrupt list.
 			 */
+			cpumask_clear_cpu(curr_cpu, &entry->def_intr.mask);
+			cpumask_set_cpu(curr_cpu, &entry->general_intr_mask);
 			curr_cpu = cpumask_next(curr_cpu,
 						&entry->def_intr.mask);
 
@@ -236,7 +268,10 @@ int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 			 * Remove the remaining kernel receive queues from
 			 * the default list and add them to the receive list.
 			 */
-			for (i = 0; i < dd->n_krcv_queues - 1; i++) {
+			for (i = 0;
+			     i < (dd->n_krcv_queues - 1) *
+				  hfi1_per_node_cntr[dd->node];
+			     i++) {
 				cpumask_clear_cpu(curr_cpu,
 						  &entry->def_intr.mask);
 				cpumask_set_cpu(curr_cpu,
@@ -246,6 +281,15 @@ int hfi1_dev_affinity_init(struct hfi1_devdata *dd)
 				if (curr_cpu >= nr_cpu_ids)
 					break;
 			}
+
+			/*
+			 * If there ends up being 0 CPU cores leftover for SDMA
+			 * engines, use the same CPU cores as general/control
+			 * context.
+			 */
+			if (cpumask_weight(&entry->def_intr.mask) == 0)
+				cpumask_copy(&entry->def_intr.mask,
+					     &entry->general_intr_mask);
 		}
 
 		spin_lock(&node_affinity.lock);
@@ -261,7 +305,7 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	int ret;
 	cpumask_var_t diff;
 	struct hfi1_affinity_node *entry;
-	struct cpu_mask_set *set;
+	struct cpu_mask_set *set = NULL;
 	struct sdma_engine *sde = NULL;
 	struct hfi1_ctxtdata *rcd = NULL;
 	char extra[64];
@@ -282,18 +326,17 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	case IRQ_SDMA:
 		sde = (struct sdma_engine *)msix->arg;
 		scnprintf(extra, 64, "engine %u", sde->this_idx);
-		/* fall through */
-	case IRQ_GENERAL:
 		set = &entry->def_intr;
 		break;
+	case IRQ_GENERAL:
+		cpu = cpumask_first(&entry->general_intr_mask);
+		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
-		if (rcd->ctxt == HFI1_CTRL_CTXT) {
-			set = &entry->def_intr;
-			cpu = cpumask_first(&set->mask);
-		} else {
+		if (rcd->ctxt == HFI1_CTRL_CTXT)
+			cpu = cpumask_first(&entry->general_intr_mask);
+		else
 			set = &entry->rcv_intr;
-		}
 		scnprintf(extra, 64, "ctxt %u", rcd->ctxt);
 		break;
 	default:
@@ -302,9 +345,9 @@ int hfi1_get_irq_affinity(struct hfi1_devdata *dd, struct hfi1_msix_entry *msix)
 	}
 
 	/*
-	 * The control receive context is placed on a particular CPU, which
-	 * is set above.  Skip accounting for it.  Everything else finds its
-	 * CPU here.
+	 * The general and control contexts are placed on a particular
+	 * CPU, which is set above. Skip accounting for it. Everything else
+	 * finds its CPU here.
 	 */
 	if (cpu == -1 && set) {
 		spin_lock(&node_affinity.lock);
@@ -355,12 +398,14 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 
 	switch (msix->type) {
 	case IRQ_SDMA:
-	case IRQ_GENERAL:
 		set = &entry->def_intr;
 		break;
+	case IRQ_GENERAL:
+		/* Don't accounting for general contexts */
+		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
-		/* only do accounting for non control contexts */
+		/* Don't do accounting for control contexts */
 		if (rcd->ctxt != HFI1_CTRL_CTXT)
 			set = &entry->rcv_intr;
 		break;
@@ -438,14 +483,20 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 		cpumask_clear(&set->used);
 	}
 
-	entry = node_affinity_lookup(dd->node);
-	/* CPUs used by interrupt handlers */
-	cpumask_copy(intrs, (entry->def_intr.gen ?
-			     &entry->def_intr.mask :
-			     &entry->def_intr.used));
-	cpumask_or(intrs, intrs, (entry->rcv_intr.gen ?
-				  &entry->rcv_intr.mask :
-				  &entry->rcv_intr.used));
+	/*
+	 * If NUMA node has CPUs used by interrupt handlers, include them in the
+	 * interrupt handler mask.
+	 */
+	entry = node_affinity_lookup(node);
+	if (entry) {
+		cpumask_copy(intrs, (entry->def_intr.gen ?
+				     &entry->def_intr.mask :
+				     &entry->def_intr.used));
+		cpumask_or(intrs, intrs, (entry->rcv_intr.gen ?
+					  &entry->rcv_intr.mask :
+					  &entry->rcv_intr.used));
+		cpumask_or(intrs, intrs, &entry->general_intr_mask);
+	}
 	hfi1_cdbg(PROC, "CPUs used by interrupts: %*pbl",
 		  cpumask_pr_args(intrs));
 
diff --git a/drivers/infiniband/hw/hfi1/affinity.h b/drivers/infiniband/hw/hfi1/affinity.h
index ad3e730..003860e 100644
--- a/drivers/infiniband/hw/hfi1/affinity.h
+++ b/drivers/infiniband/hw/hfi1/affinity.h
@@ -107,6 +107,7 @@ struct hfi1_affinity_node {
 	int node;
 	struct cpu_mask_set def_intr;
 	struct cpu_mask_set rcv_intr;
+	struct cpumask general_intr_mask;
 	struct list_head list;
 };
 
@@ -118,7 +119,7 @@ struct hfi1_affinity_node_list {
 	spinlock_t lock;
 };
 
-void node_affinity_init(void);
+int node_affinity_init(void);
 void node_affinity_destroy(void);
 extern struct hfi1_affinity_node_list node_affinity;
 
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 748e235..fd67e98 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1235,6 +1235,8 @@ int handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *, int);
 int handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *, int);
 void set_all_slowpath(struct hfi1_devdata *dd);
 
+extern const struct pci_device_id hfi1_pci_tbl[];
+
 /* receive packet handler dispositions */
 #define RCV_PKT_OK      0x0 /* keep going */
 #define RCV_PKT_LIMIT   0x1 /* stop, hit limit, start thread */
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index b0c3e8a..1620d68 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -1162,7 +1162,7 @@ static int init_one(struct pci_dev *, const struct pci_device_id *);
 #define DRIVER_LOAD_MSG "Intel " DRIVER_NAME " loaded: "
 #define PFX DRIVER_NAME ": "
 
-static const struct pci_device_id hfi1_pci_tbl[] = {
+const struct pci_device_id hfi1_pci_tbl[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL0) },
 	{ PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL1) },
 	{ 0, }
@@ -1198,7 +1198,9 @@ static int __init hfi1_mod_init(void)
 	if (ret)
 		goto bail;
 
-	node_affinity_init();
+	ret = node_affinity_init();
+	if (ret)
+		goto bail;
 
 	/* validate max MTU before any devices start */
 	if (!valid_opa_max_mtu(hfi1_max_mtu)) {

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v2] IB/hfi1: Refine user process affinity algorithm
       [not found]     ` <20160701230138.20160.5753.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
@ 2016-07-25 14:54       ` Dennis Dalessandro
  0 siblings, 0 replies; 23+ messages in thread
From: Dennis Dalessandro @ 2016-07-25 14:54 UTC (permalink / raw)
  To: dledford-H+wXaHxf7aLQT0dZR+AlfA
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Ira Weiny, Mitko Haralanov,
	Sebastian Sanchez

From: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

When performing process affinity recommendations for MPI ranks, the current
algorithm doesn't take into account multiple HFI units. Also, real
cores and HT cores are not distinguished from one another. Therefore,
all HT cores are recommended to be assigned first within the local NUMA
node before recommending the assignments of cores in other NUMA nodes.
It's ideal to assign all real cores across all NUMA nodes first, then all
HT 1 cores, then all HT 2 cores, and so on to balance CPU workload. CPU
cores in other NUMA nodes could be running interrupt handlers, and this is
not taken into account.

To balance the CPU workload for user processes, the following
recommendation algorithm is used:

 For each user process that is opening a context on HFI Y:
  a) If all cores are assigned to user processes, start assignments all
	 over from the first core
  b) Assign real cores first, then HT cores (First set of HT cores on
	 all physical cores, then second set of HT cores, and, so on) in the
	 following order:

	 1. Same NUMA node as HFI Y and not running an IRQ handler
	 2. Same NUMA node as HFI Y and running an IRQ handler
	 3. Different NUMA node to HFI Y and not running an IRQ handler
	 4. Different NUMA node to HFI Y and running an IRQ handler
  c) Mark core as assigned in the global affinity structure. As user
	 processes are done, remove core assignments from global affinity
	 structure.

This implementation allows an arbitrary number of HT cores and provides
support for multiple HFIs.

This is being included in the kernel rather than user space due to the
fact that user space has no way of knowing the CPU recommendations for
contexts running as part of other jobs.

Reviewed-by: Ira Weiny <ira.weiny-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Mitko Haralanov <mitko.haralanov-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Signed-off-by: Sebastian Sanchez <sebastian.sanchez-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

--

Changes since v1:
----------------
Resolve conflicts introduced by the previous two patches.
---
 drivers/infiniband/hw/hfi1/affinity.c |  213 +++++++++++++++++++++++++--------
 drivers/infiniband/hw/hfi1/affinity.h |    8 +
 drivers/infiniband/hw/hfi1/file_ops.c |   15 ++
 3 files changed, 174 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c
index eb88927..c9dcbd5 100644
--- a/drivers/infiniband/hw/hfi1/affinity.c
+++ b/drivers/infiniband/hw/hfi1/affinity.c
@@ -116,7 +116,17 @@ int node_affinity_init(void)
 	struct pci_dev *dev = NULL;
 	const struct pci_device_id *ids = hfi1_pci_tbl;
 
+	cpumask_clear(&node_affinity.proc.used);
 	cpumask_copy(&node_affinity.proc.mask, cpu_online_mask);
+
+	node_affinity.proc.gen = 0;
+	node_affinity.num_core_siblings =
+				cpumask_weight(topology_sibling_cpumask(
+					cpumask_first(&node_affinity.proc.mask)
+					));
+	node_affinity.num_online_nodes = num_online_nodes();
+	node_affinity.num_online_cpus = num_online_cpus();
+
 	/*
 	 * The real cpu mask is part of the affinity struct but it has to be
 	 * initialized early. It is needed to calculate the number of user
@@ -401,7 +411,7 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 		set = &entry->def_intr;
 		break;
 	case IRQ_GENERAL:
-		/* Don't accounting for general contexts */
+		/* Don't do accounting for general contexts */
 		break;
 	case IRQ_RCVCTXT:
 		rcd = (struct hfi1_ctxtdata *)msix->arg;
@@ -427,14 +437,47 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 	cpumask_clear(&msix->mask);
 }
 
-int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
+/* This should be called with node_affinity.lock held */
+static void find_hw_thread_mask(uint hw_thread_no, cpumask_var_t hw_thread_mask,
+				struct hfi1_affinity_node_list *affinity)
+{
+	int possible, curr_cpu, i;
+	uint num_cores_per_socket = node_affinity.num_online_cpus /
+					affinity->num_core_siblings /
+						node_affinity.num_online_nodes;
+
+	cpumask_copy(hw_thread_mask, &affinity->proc.mask);
+	if (affinity->num_core_siblings > 0) {
+		/* Removing other siblings not needed for now */
+		possible = cpumask_weight(hw_thread_mask);
+		curr_cpu = cpumask_first(hw_thread_mask);
+		for (i = 0;
+		     i < num_cores_per_socket * node_affinity.num_online_nodes;
+		     i++)
+			curr_cpu = cpumask_next(curr_cpu, hw_thread_mask);
+
+		for (; i < possible; i++) {
+			cpumask_clear_cpu(curr_cpu, hw_thread_mask);
+			curr_cpu = cpumask_next(curr_cpu, hw_thread_mask);
+		}
+
+		/* Identifying correct HW threads within physical cores */
+		cpumask_shift_left(hw_thread_mask, hw_thread_mask,
+				   num_cores_per_socket *
+				   node_affinity.num_online_nodes *
+				   hw_thread_no);
+	}
+}
+
+int hfi1_get_proc_affinity(int node)
 {
-	int cpu = -1, ret;
-	cpumask_var_t diff, mask, intrs;
+	int cpu = -1, ret, i;
 	struct hfi1_affinity_node *entry;
+	cpumask_var_t diff, hw_thread_mask, available_mask, intrs_mask;
 	const struct cpumask *node_mask,
 		*proc_mask = tsk_cpus_allowed(current);
-	struct cpu_mask_set *set = &node_affinity.proc;
+	struct hfi1_affinity_node_list *affinity = &node_affinity;
+	struct cpu_mask_set *set = &affinity->proc;
 
 	/*
 	 * check whether process/context affinity has already
@@ -460,22 +503,41 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 
 	/*
 	 * The process does not have a preset CPU affinity so find one to
-	 * recommend. We prefer CPUs on the same NUMA as the device.
+	 * recommend using the following algorithm:
+	 *
+	 * For each user process that is opening a context on HFI Y:
+	 *  a) If all cores are filled, reinitialize the bitmask
+	 *  b) Fill real cores first, then HT cores (First set of HT
+	 *     cores on all physical cores, then second set of HT core,
+	 *     and, so on) in the following order:
+	 *
+	 *     1. Same NUMA node as HFI Y and not running an IRQ
+	 *        handler
+	 *     2. Same NUMA node as HFI Y and running an IRQ handler
+	 *     3. Different NUMA node to HFI Y and not running an IRQ
+	 *        handler
+	 *     4. Different NUMA node to HFI Y and running an IRQ
+	 *        handler
+	 *  c) Mark core as filled in the bitmask. As user processes are
+	 *     done, clear cores from the bitmask.
 	 */
 
 	ret = zalloc_cpumask_var(&diff, GFP_KERNEL);
 	if (!ret)
 		goto done;
-	ret = zalloc_cpumask_var(&mask, GFP_KERNEL);
+	ret = zalloc_cpumask_var(&hw_thread_mask, GFP_KERNEL);
 	if (!ret)
 		goto free_diff;
-	ret = zalloc_cpumask_var(&intrs, GFP_KERNEL);
+	ret = zalloc_cpumask_var(&available_mask, GFP_KERNEL);
 	if (!ret)
-		goto free_mask;
+		goto free_hw_thread_mask;
+	ret = zalloc_cpumask_var(&intrs_mask, GFP_KERNEL);
+	if (!ret)
+		goto free_available_mask;
 
-	spin_lock(&node_affinity.lock);
+	spin_lock(&affinity->lock);
 	/*
-	 * If we've used all available CPUs, clear the mask and start
+	 * If we've used all available HW threads, clear the mask and start
 	 * overloading.
 	 */
 	if (cpumask_equal(&set->mask, &set->used)) {
@@ -489,82 +551,125 @@ int hfi1_get_proc_affinity(struct hfi1_devdata *dd, int node)
 	 */
 	entry = node_affinity_lookup(node);
 	if (entry) {
-		cpumask_copy(intrs, (entry->def_intr.gen ?
-				     &entry->def_intr.mask :
-				     &entry->def_intr.used));
-		cpumask_or(intrs, intrs, (entry->rcv_intr.gen ?
-					  &entry->rcv_intr.mask :
-					  &entry->rcv_intr.used));
-		cpumask_or(intrs, intrs, &entry->general_intr_mask);
+		cpumask_copy(intrs_mask, (entry->def_intr.gen ?
+					  &entry->def_intr.mask :
+					  &entry->def_intr.used));
+		cpumask_or(intrs_mask, intrs_mask, (entry->rcv_intr.gen ?
+						    &entry->rcv_intr.mask :
+						    &entry->rcv_intr.used));
+		cpumask_or(intrs_mask, intrs_mask, &entry->general_intr_mask);
 	}
 	hfi1_cdbg(PROC, "CPUs used by interrupts: %*pbl",
-		  cpumask_pr_args(intrs));
+		  cpumask_pr_args(intrs_mask));
+
+	cpumask_copy(hw_thread_mask, &set->mask);
 
 	/*
-	 * If we don't have a NUMA node requested, preference is towards
-	 * device NUMA node
+	 * If HT cores are enabled, identify which HW threads within the
+	 * physical cores should be used.
 	 */
-	if (node == -1)
-		node = dd->node;
+	if (affinity->num_core_siblings > 0) {
+		for (i = 0; i < affinity->num_core_siblings; i++) {
+			find_hw_thread_mask(i, hw_thread_mask, affinity);
+
+			/*
+			 * If there's at least one available core for this HW
+			 * thread number, stop looking for a core.
+			 *
+			 * diff will always be not empty at least once in this
+			 * loop as the used mask gets reset when
+			 * (set->mask == set->used) before this loop.
+			 */
+			cpumask_andnot(diff, hw_thread_mask, &set->used);
+			if (!cpumask_empty(diff))
+				break;
+		}
+	}
+	hfi1_cdbg(PROC, "Same available HW thread on all physical CPUs: %*pbl",
+		  cpumask_pr_args(hw_thread_mask));
+
 	node_mask = cpumask_of_node(node);
-	hfi1_cdbg(PROC, "device on NUMA %u, CPUs %*pbl", node,
+	hfi1_cdbg(PROC, "Device on NUMA %u, CPUs %*pbl", node,
 		  cpumask_pr_args(node_mask));
 
-	/* diff will hold all unused cpus */
-	cpumask_andnot(diff, &set->mask, &set->used);
-	hfi1_cdbg(PROC, "unused CPUs (all) %*pbl", cpumask_pr_args(diff));
-
-	/* get cpumask of available CPUs on preferred NUMA */
-	cpumask_and(mask, diff, node_mask);
-	hfi1_cdbg(PROC, "available cpus on NUMA %*pbl", cpumask_pr_args(mask));
+	/* Get cpumask of available CPUs on preferred NUMA */
+	cpumask_and(available_mask, hw_thread_mask, node_mask);
+	cpumask_andnot(available_mask, available_mask, &set->used);
+	hfi1_cdbg(PROC, "Available CPUs on NUMA %u: %*pbl", node,
+		  cpumask_pr_args(available_mask));
 
 	/*
 	 * At first, we don't want to place processes on the same
-	 * CPUs as interrupt handlers.
+	 * CPUs as interrupt handlers. Then, CPUs running interrupt
+	 * handlers are used.
+	 *
+	 * 1) If diff is not empty, then there are CPUs not running
+	 *    non-interrupt handlers available, so diff gets copied
+	 *    over to available_mask.
+	 * 2) If diff is empty, then all CPUs not running interrupt
+	 *    handlers are taken, so available_mask contains all
+	 *    available CPUs running interrupt handlers.
+	 * 3) If available_mask is empty, then all CPUs on the
+	 *    preferred NUMA node are taken, so other NUMA nodes are
+	 *    used for process assignments using the same method as
+	 *    the preferred NUMA node.
 	 */
-	cpumask_andnot(diff, mask, intrs);
+	cpumask_andnot(diff, available_mask, intrs_mask);
 	if (!cpumask_empty(diff))
-		cpumask_copy(mask, diff);
+		cpumask_copy(available_mask, diff);
 
-	/*
-	 * if we don't have a cpu on the preferred NUMA, get
-	 * the list of the remaining available CPUs
-	 */
-	if (cpumask_empty(mask)) {
-		cpumask_andnot(diff, &set->mask, &set->used);
-		cpumask_andnot(mask, diff, node_mask);
+	/* If we don't have CPUs on the preferred node, use other NUMA nodes */
+	if (cpumask_empty(available_mask)) {
+		cpumask_andnot(available_mask, hw_thread_mask, &set->used);
+		/* Excluding preferred NUMA cores */
+		cpumask_andnot(available_mask, available_mask, node_mask);
+		hfi1_cdbg(PROC,
+			  "Preferred NUMA node cores are taken, cores available in other NUMA nodes: %*pbl",
+			  cpumask_pr_args(available_mask));
+
+		/*
+		 * At first, we don't want to place processes on the same
+		 * CPUs as interrupt handlers.
+		 */
+		cpumask_andnot(diff, available_mask, intrs_mask);
+		if (!cpumask_empty(diff))
+			cpumask_copy(available_mask, diff);
 	}
-	hfi1_cdbg(PROC, "possible CPUs for process %*pbl",
-		  cpumask_pr_args(mask));
+	hfi1_cdbg(PROC, "Possible CPUs for process: %*pbl",
+		  cpumask_pr_args(available_mask));
 
-	cpu = cpumask_first(mask);
+	cpu = cpumask_first(available_mask);
 	if (cpu >= nr_cpu_ids) /* empty */
 		cpu = -1;
 	else
 		cpumask_set_cpu(cpu, &set->used);
-	spin_unlock(&node_affinity.lock);
-
-	free_cpumask_var(intrs);
-free_mask:
-	free_cpumask_var(mask);
+	spin_unlock(&affinity->lock);
+	hfi1_cdbg(PROC, "Process assigned to CPU %d", cpu);
+
+	free_cpumask_var(intrs_mask);
+free_available_mask:
+	free_cpumask_var(available_mask);
+free_hw_thread_mask:
+	free_cpumask_var(hw_thread_mask);
 free_diff:
 	free_cpumask_var(diff);
 done:
 	return cpu;
 }
 
-void hfi1_put_proc_affinity(struct hfi1_devdata *dd, int cpu)
+void hfi1_put_proc_affinity(int cpu)
 {
-	struct cpu_mask_set *set = &node_affinity.proc;
+	struct hfi1_affinity_node_list *affinity = &node_affinity;
+	struct cpu_mask_set *set = &affinity->proc;
 
 	if (cpu < 0)
 		return;
-	spin_lock(&node_affinity.lock);
+	spin_lock(&affinity->lock);
 	cpumask_clear_cpu(cpu, &set->used);
+	hfi1_cdbg(PROC, "Returning CPU %d for future process assignment", cpu);
 	if (cpumask_empty(&set->used) && set->gen) {
 		set->gen--;
 		cpumask_copy(&set->used, &set->mask);
 	}
-	spin_unlock(&node_affinity.lock);
+	spin_unlock(&affinity->lock);
 }
-
diff --git a/drivers/infiniband/hw/hfi1/affinity.h b/drivers/infiniband/hw/hfi1/affinity.h
index 003860e..f784de5 100644
--- a/drivers/infiniband/hw/hfi1/affinity.h
+++ b/drivers/infiniband/hw/hfi1/affinity.h
@@ -73,7 +73,6 @@ struct cpu_mask_set {
 struct hfi1_affinity {
 	struct cpu_mask_set def_intr;
 	struct cpu_mask_set rcv_intr;
-	struct cpu_mask_set proc;
 	struct cpumask real_cpu_mask;
 	/* spin lock to protect affinity struct */
 	spinlock_t lock;
@@ -99,9 +98,9 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *, struct hfi1_msix_entry *);
  * Determine a CPU affinity for a user process, if the process does not
  * have an affinity set yet.
  */
-int hfi1_get_proc_affinity(struct hfi1_devdata *, int);
+int hfi1_get_proc_affinity(int);
 /* Release a CPU used by a user process. */
-void hfi1_put_proc_affinity(struct hfi1_devdata *, int);
+void hfi1_put_proc_affinity(int);
 
 struct hfi1_affinity_node {
 	int node;
@@ -115,6 +114,9 @@ struct hfi1_affinity_node_list {
 	struct list_head list;
 	struct cpumask real_cpu_mask;
 	struct cpu_mask_set proc;
+	int num_core_siblings;
+	int num_online_nodes;
+	int num_online_cpus;
 	/* protect affinity node list */
 	spinlock_t lock;
 };
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 2f097d9..d7c07bc 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -715,7 +715,7 @@ static int hfi1_file_close(struct inode *inode, struct file *fp)
 	hfi1_user_sdma_free_queues(fdata);
 
 	/* release the cpu */
-	hfi1_put_proc_affinity(dd, fdata->rec_cpu_num);
+	hfi1_put_proc_affinity(fdata->rec_cpu_num);
 
 	/*
 	 * Clear any left over, unhandled events so the next process that
@@ -815,9 +815,10 @@ static int assign_ctxt(struct file *fp, struct hfi1_user_info *uinfo)
 		ret = find_shared_ctxt(fp, uinfo);
 		if (ret < 0)
 			goto done_unlock;
-		if (ret)
-			fd->rec_cpu_num = hfi1_get_proc_affinity(
-				fd->uctxt->dd, fd->uctxt->numa_id);
+		if (ret) {
+			fd->rec_cpu_num =
+				hfi1_get_proc_affinity(fd->uctxt->numa_id);
+		}
 	}
 
 	/*
@@ -929,7 +930,11 @@ static int allocate_ctxt(struct file *fp, struct hfi1_devdata *dd,
 	if (ctxt == dd->num_rcv_contexts)
 		return -EBUSY;
 
-	fd->rec_cpu_num = hfi1_get_proc_affinity(dd, -1);
+	/*
+	 * If we don't have a NUMA node requested, preference is towards
+	 * device NUMA node.
+	 */
+	fd->rec_cpu_num = hfi1_get_proc_affinity(dd->node);
 	if (fd->rec_cpu_num != -1)
 		numa = cpu_to_node(fd->rec_cpu_num);
 	else

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next 00/18] IB/hfi1, rdmavt, qib: First batch of fixes for 4.8
       [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
                     ` (17 preceding siblings ...)
  2016-07-01 23:02   ` [PATCH for-next 18/18] IB/rdmavt: Use new driver specific " Dennis Dalessandro
@ 2016-08-02 19:58   ` Doug Ledford
  18 siblings, 0 replies; 23+ messages in thread
From: Doug Ledford @ 2016-08-02 19:58 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Jason Gunthorpe, Mike Marciniszyn, Dean Luick, Jakub Pawlak,
	Tadeusz Struk, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Ira Weiny,
	Mitko Haralanov, Ashutosh Dixit, Easwar Hariharan,
	Sebastian Sanchez, Jubin John, Jianxin Xiong

[-- Attachment #1: Type: text/plain, Size: 1221 bytes --]

On Fri, 2016-07-01 at 16:00 -0700, Dennis Dalessandro wrote:
> Hi Doug,
> 
> Here is a set of fixes and improvmenets that would be for the next
> release. They
> apply onto of the last set of RC fixes previously posted.
> 
> Of particular note in here is the twsi code clean up that was asked
> for
> previously while we were in staging. I think this does the job of not
> duplicating what is already present in the kernel. These are the two
> patches
> from Dean.
> 
> The patches from Mike improve rdmavt and make the posting of sends
> more friendly
> to work with and extend.
> 
> There are also performance improvement patches in this bunch as well
> as well as
> a couple minor fixes that we felt are more appropriate for the next
> merge cycle
> rather than RC.
> 
> These patches have been added to my GitHub branch and have passed
> zero day
> builds.
> 
> https://github.com/ddalessa/kernel/tree/for-4.8
> 
> 

Hi Denny,

These are in.  I had to rebase in order to rip out the V1 of patches 8,
9, and 10, and then put the v2 in their place, but that's done as well.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
              GPG KeyID: 0E572FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-08-02 19:58 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-01 23:00 [PATCH for-next 00/18] IB/hfi1, rdmavt, qib: First batch of fixes for 4.8 Dennis Dalessandro
     [not found] ` <20160701225824.20160.19055.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
2016-07-01 23:00   ` [PATCH for-next 01/18] IB/hfi1: Clean up port state structure definition Dennis Dalessandro
2016-07-01 23:00   ` [PATCH for-next 02/18] IB/hfi1: Remove unnecessary done label in hfi1_write_iter Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 03/18] IB/hfi1: Fix typo Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 04/18] IB/hfi1: Separate tracepoints into specific headers Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 05/18] IB/hfi1: Fix trace sparse errors Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 06/18] IB/hfi1: Add VL XmitDiscards counters to the opapmaquery Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 07/18] IB/hfi1: Add counter to track unsupported packets drop Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 08/18] IB/hfi1: Add global structure for affinity assignments Dennis Dalessandro
     [not found]     ` <20160701230127.20160.68709.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
2016-07-25 14:52       ` [PATCH v2] " Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 09/18] IB/hfi1: Reserve and collapse CPU cores for contexts Dennis Dalessandro
     [not found]     ` <20160701230133.20160.76302.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
2016-07-25 14:54       ` [PATCH v2] " Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 10/18] IB/hfi1: Refine user process affinity algorithm Dennis Dalessandro
     [not found]     ` <20160701230138.20160.5753.stgit-9QXIwq+3FY+1XWohqUldA0EOCMrvLtNR@public.gmane.org>
2016-07-25 14:54       ` [PATCH v2] " Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 11/18] IB/hfi1: Use built-in i2c bit-shift bus adapter Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 12/18] IB/hfi1: Remove TWSI references Dennis Dalessandro
2016-07-01 23:01   ` [PATCH for-next 13/18] IB/hfi1: Improve SDMA engine assignment for user SDMA Dennis Dalessandro
2016-07-01 23:02   ` [PATCH for-next 14/18] IB/hfi1: Correct receive packet handler assignment Dennis Dalessandro
2016-07-01 23:02   ` [PATCH for-next 15/18] IB/rdmavt: Add data structures and routines for table driven post send Dennis Dalessandro
2016-07-01 23:02   ` [PATCH for-next 16/18] IB/hfi1: Add hfi1 post send tables Dennis Dalessandro
2016-07-01 23:02   ` [PATCH for-next 17/18] IB/qib: Add qib post send table Dennis Dalessandro
2016-07-01 23:02   ` [PATCH for-next 18/18] IB/rdmavt: Use new driver specific " Dennis Dalessandro
2016-08-02 19:58   ` [PATCH for-next 00/18] IB/hfi1, rdmavt, qib: First batch of fixes for 4.8 Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.