All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
@ 2013-03-07 17:11 Or Gerlitz
       [not found] ` <1362676288-19906-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2013-03-07 17:11 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Or Gerlitz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Here's V3 of the IPoIB TSS/RSS patch series, basically its very similar to V2, 
with fix to for one issue we stepped over while testing V2 and addressing of
feedback provided by Sean on the QP groups concept.

The concept of QP groups for TSS/RSS was introduced in the 2012 OFA conference, 
you can take a look on the user mode ethernet session slides 10-14, the author 
didn't use the terms RSS/TSS but that's the intention... see

https://openfabrics.org/resources/document-downloads/presentations/cat_view/57-ofa-documents/23-presentations/81-openfabrics-international-workshops/104-2012-ofa-international-workshop/107-2012-ofa-intl-workshop-wednesday.html 

V2 http://marc.info/?l=linux-rdma&m=136007935605406&w=2
V1 http://marc.info/?l=linux-rdma&m=133881081520248&w=2
V0 http://marc.info/?l=linux-rdma&m=133649429821312&w=2

V3 changes:

 - rebased to 3.9-rc1

 - fixed few sparse errors on patch on patch #3

 - Implement Sean's Hefty suggestion, that is don't allow to modify parent QP state 
   before all RSS/TSS children were created. Also disallow to destroy the parent QP 
   unless all RSS/TSS children were destroyed.

 - solved a race condition when creation of an ipoib_neigh was attempted from more 
   than one TX context, the change was merged into patch #3

V2 changes:

 - added pre-patch correcting the ipoib_neigh hash function

 - ported to infiniband tree / for-next branch 

 - following commit b63b70d877 "IPoIB: Use a private hash table for path lookup in xmit path" 
   from kernel 3.6, the TX select queue logic for UD neighbours was changed to be based on 
   "full" hashing ala skb_tx_hash that covers L4 too wheres in V1 the queue selection 
   was in the neighbours level. This means that different sessions (TCP/UDP five-tuples)
   would map to different TX rings subject to hashing.

 - for CM neighbours, the queue selection uses the destination IPoIB HW addr as the base 
   for hashing. Previously each ipoib_neigh was assigned a running index upon creation
   and that neighbour was accessed during select queue. Now, we want to issue only 
   ONE ipoib_neigh lookup in the xmit path and do that in start_xmit.

 - added patch #6 to allow for the number of TX and RX rings to be changed at runtime. 
   By supporting ethtool directives to get/set the number of channels.
   move code which is common to device cleanup and device reinit from
   "ipoib_dev_cleanup" to "ipoib_dev_uninit".
       
 - CM TX completions are spreaded among CQs (for NAPI) using hash of the destination 
   IPoIB HW address.

 - use netif_tx bh locking in ipoib_cm_handle_tx_wc and drain_tx_cq. Also, in 
   drain_tx_cq revert from subqueue locking to full locking, did it since 
   __netif_tx_lock doesn't set __QUEUE_STATE_FROZEN_BIT.

 - handle the rare case were the device CM "state" ipoib_cm_admin_enabled() status 
   changes between the time select queue was done to when the transmit routine was 
   called.

 - fixed a race in the CM RX drain/reap logic caused by the change to multiple 
   rings, added detailed comment in ipoib_cm_start_rx_drain to explain the fix.

 - changed the CM code that posts receive buffers (both srq and non-srq
   flows) to use per ring WR and SGE objects, since now buffer re-fill may happen from different
   NAPI contexts

V1 changes:

 - removed accepted patches, the first three on the V0 series
 - fixed crash in the driver EQ teardown flow - merged by commit 3aac6ff "IB/mlx4: Fix EQ deallocation in legacy mode"
 - removed wrong setting done in the ehca driver in ehca_create_srq
 - fixed user space QP creation to specify QPG_NONE
 - fixed usage of wrong API for netif queues stopping in patch 3/4 (V0 6/7)
 - fixed use-after-free of device attr pointer in patch 4/4 (V0 7/7)

* Add support for for RSS and TSS for UD.
        The number of RSS and TSS queues is a function of the number
        of cores and HW capability.

* Utilize multi core CPU and NIC's multi queuing in order to increase
        throughput. It utilize a new "QP Group" concept. A QP group is
        a set of QP consists of a parent QP and two disjoint subsets of
        RSS and TSS QP.

* If RSS is supported by HW then the number of RSS queues is highest
        power of two greater than or equal to the number of cores.
        Otherwise the number is one.

* If TSS is supported by HW then the number of TSS queues is highest
        power of two greater than or equal to the number of cores.
        Otherwise the number is highest power of two greater than or
                equal to the number of cores plus one.

* Transmission and receiving in CM mode uses a send and receive queue
        assigned to each CM instance at creation time.

* Advertise that packets sent from set of QPs will be received. That is,
        A received packets with a source QPN different from the QPN
        advertised with ARP will be accepted.

* The advertising is done by setting a third bit in the flags part
        of the link layer address. This is similar to RFC 4755
        section 3.1 (CM advertisement)

* If TSS is not supported by HW then transmission of multi-cast packets
        is done using device queue N and thus the parent QP, which is
                also the advertised QP.

* If TSS is not supported by HW then usage of TSS is enabled if the peer
        advertised that it will accept TSS packets.

* Drivers can now use a larger portion of the device vectors/IRQ




Shlomo Pongratz (5):
  IB/core: Add RSS and TSS QP groups
  IB/mlx4: Add support for RSS and TSS QP groups
  IB/ipoib: Move to multi-queue device
  IB/ipoib: Add RSS and TSS support for datagram mode
  IB/ipoib: Support changing the number of RX/TX rings with ethtool

 drivers/infiniband/core/uverbs_cmd.c           |    1 +
 drivers/infiniband/core/verbs.c                |  118 +++++
 drivers/infiniband/hw/amso1100/c2_provider.c   |    3 +
 drivers/infiniband/hw/cxgb3/iwch_provider.c    |    2 +
 drivers/infiniband/hw/cxgb4/qp.c               |    3 +
 drivers/infiniband/hw/ehca/ehca_qp.c           |    3 +
 drivers/infiniband/hw/ipath/ipath_qp.c         |    3 +
 drivers/infiniband/hw/mlx4/main.c              |    5 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h           |   13 +
 drivers/infiniband/hw/mlx4/qp.c                |  344 ++++++++++++-
 drivers/infiniband/hw/mthca/mthca_provider.c   |    3 +
 drivers/infiniband/hw/nes/nes_verbs.c          |    3 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c    |    5 +
 drivers/infiniband/hw/qib/qib_qp.c             |    5 +
 drivers/infiniband/ulp/ipoib/ipoib.h           |  118 ++++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  206 +++++---
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c   |  160 ++++++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  550 ++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  523 +++++++++++++++++---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   44 ++-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  662 +++++++++++++++++++++---
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    2 +-
 include/rdma/ib_verbs.h                        |   40 ++-
 23 files changed, 2388 insertions(+), 428 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH V3 for-next 1/5] IB/core: Add RSS and TSS QP groups
       [not found] ` <1362676288-19906-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-03-07 17:11   ` Or Gerlitz
  2013-03-07 17:11   ` [PATCH V3 for-next 2/5] IB/mlx4: Add support for " Or Gerlitz
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Or Gerlitz @ 2013-03-07 17:11 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

RSS (Receive Side Scaling) TSS (Transmit Side Scaling, better known as
MQ/Multi-Queue) are common networking techniques which allow to use
contemporary NICs that support multiple receive and transmit descriptor
queues (multi-queue), see also Documentation/networking/scaling.txt

This patch introduces the concept of RSS and TSS QP groups which
allows for implementing them by low level drivers and using it
by IPoIB and later also by user space ULPs.

A QP group is a set of QPs consists of a parent QP and two disjoint sets
of RSS and TSS QPs. The creation of a QP group is a two stage process:

In the the 1st stage, the parent QP is created.

In the 2nd stage the children QPs of the parent are created.

Each child QP indicates if its a RSS or TSS QP. Both the TSS
and RSS sets of QPs should have contiguous QP numbers.

It is forbidden to modify parent QP state before all RSS/TSS children 
were created. In the same manner it is disallowed to destroy the parent 
QP unless all RSS/TSS children were destroyed.

A few new elements/concepts are introduced to support this:

Three new device capabilities that can be set by the low level driver:

- IB_DEVICE_QPG which is set to indicate QP groups are supported.

- IB_DEVICE_UD_RSS which is set to indicate that the device supports
RSS, that is applying hash function on incoming TCP/UDP/IP packets and
dispatching them to multiple "rings" (child QPs).

- IB_DEVICE_UD_TSS which is set to indicate that the device supports
"HW TSS" which means that the HW is capable of over-riding the source
UD QPN present in sent IB datagram header (DTH) with the parent's QPN.

Low level drivers not supporting HW TSS, could still support QP groups, such
as combination is referred as "SW TSS". Where in this case, the low level drive
fills in the qpg_tss_mask_sz field of struct ib_qp_cap returned from
ib_create_qp. Such that this mask can be used to retrieve the parent QPN from
incoming packets carrying a child QPN (as of the contiguous QP numbers requirement).

- max rss table size device attribute, which is the maximal size of the RSS
indirection table  supported by the device

- qp group type attribute for qp creation saying whether this is a parent QP
or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs.

- per qp group type, another attribute is added, for parent QPs, the number
of rx/tx child QPs and for child QPs pointer to the parent.

- IB_QP_GROUP_RSS attribute mask, which should be used when modifying
the parent QP state from reset to init

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/uverbs_cmd.c         |    1 +
 drivers/infiniband/core/verbs.c              |  118 ++++++++++++++++++++++++++
 drivers/infiniband/hw/amso1100/c2_provider.c |    3 +
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |    2 +
 drivers/infiniband/hw/cxgb4/qp.c             |    3 +
 drivers/infiniband/hw/ehca/ehca_qp.c         |    3 +
 drivers/infiniband/hw/ipath/ipath_qp.c       |    3 +
 drivers/infiniband/hw/mlx4/qp.c              |    3 +
 drivers/infiniband/hw/mthca/mthca_provider.c |    3 +
 drivers/infiniband/hw/nes/nes_verbs.c        |    3 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |    5 +
 drivers/infiniband/hw/qib/qib_qp.c           |    5 +
 include/rdma/ib_verbs.h                      |   40 ++++++++-
 13 files changed, 190 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 3983a05..b41e7b2 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1582,6 +1582,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
 	attr.sq_sig_type   = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR;
 	attr.qp_type       = cmd.qp_type;
 	attr.create_flags  = 0;
+	attr.qpg_type	   = IB_QPG_NONE;
 
 	attr.cap.max_send_wr     = cmd.max_send_wr;
 	attr.cap.max_recv_wr     = cmd.max_recv_wr;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index a8fdd33..f40f194 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -406,12 +406,98 @@ struct ib_qp *ib_open_qp(struct ib_xrcd *xrcd,
 }
 EXPORT_SYMBOL(ib_open_qp);
 
+static int ib_qpg_verify(struct ib_qp_init_attr *qp_init_attr)
+{
+	/* RSS/TSS QP group basic validation */
+	struct ib_qp *parent;
+	struct ib_qpg_init_attrib *attr;
+	struct ib_qpg_attr *pattr;
+
+	switch (qp_init_attr->qpg_type) {
+	case IB_QPG_PARENT:
+		attr = &qp_init_attr->parent_attrib;
+		if (attr->tss_child_count == 1)
+			return -EINVAL; /* doesn't make sense */
+		if (attr->rss_child_count == 1)
+			return -EINVAL; /* doesn't make sense */
+		if ((attr->tss_child_count == 0) &&
+		    (attr->rss_child_count == 0))
+			/* should be called with IP_QPG_NONE */
+			return -EINVAL;
+		break;
+	case IB_QPG_CHILD_RX:
+		parent = qp_init_attr->qpg_parent;
+		if (!parent || parent->qpg_type != IB_QPG_PARENT)
+			return -EINVAL;
+		pattr = &parent->qpg_attr.parent_attr;
+		if (!pattr->rss_child_count)
+			return -EINVAL;
+		if (atomic_read(&pattr->rsscnt) >= pattr->rss_child_count)
+			return -EINVAL;
+		break;
+	case IB_QPG_CHILD_TX:
+		parent = qp_init_attr->qpg_parent;
+		if (!parent || parent->qpg_type != IB_QPG_PARENT)
+			return -EINVAL;
+		pattr = &parent->qpg_attr.parent_attr;
+		if (!pattr->tss_child_count)
+			return -EINVAL;
+		if (atomic_read(&pattr->tsscnt) >= pattr->tss_child_count)
+			return -EINVAL;
+		break;
+	default:
+		break;
+	}
+
+	return 0;
+}
+
+static void ib_init_qpg(struct ib_qp_init_attr *qp_init_attr, struct ib_qp *qp)
+{
+	struct ib_qp *parent;
+	struct ib_qpg_init_attrib *attr;
+	struct ib_qpg_attr *pattr;
+
+	qp->qpg_type = qp_init_attr->qpg_type;
+
+	/* qp was created without an error parmaters are O.K. */
+	switch (qp_init_attr->qpg_type) {
+	case IB_QPG_PARENT:
+		attr = &qp_init_attr->parent_attrib;
+		pattr = &qp->qpg_attr.parent_attr;
+		pattr->rss_child_count = attr->rss_child_count;
+		pattr->tss_child_count = attr->tss_child_count;
+		atomic_set(&pattr->rsscnt, 0);
+		atomic_set(&pattr->tsscnt, 0);
+		break;
+	case IB_QPG_CHILD_RX:
+		parent = qp_init_attr->qpg_parent;
+		qp->qpg_attr.parent = parent;
+		/* update parent's counter */
+		pattr = &parent->qpg_attr.parent_attr;
+		atomic_inc(&pattr->rsscnt);
+		break;
+	case IB_QPG_CHILD_TX:
+		parent = qp_init_attr->qpg_parent;
+		qp->qpg_attr.parent = parent;
+		/* update parent's counter */
+		pattr = &parent->qpg_attr.parent_attr;
+		atomic_inc(&pattr->tsscnt);
+		break;
+	default:
+		break;
+	}
+}
+
 struct ib_qp *ib_create_qp(struct ib_pd *pd,
 			   struct ib_qp_init_attr *qp_init_attr)
 {
 	struct ib_qp *qp, *real_qp;
 	struct ib_device *device;
 
+	if (ib_qpg_verify(qp_init_attr))
+		return ERR_PTR(-EINVAL);
+
 	device = pd ? pd->device : qp_init_attr->xrcd->device;
 	qp = device->create_qp(pd, qp_init_attr, NULL);
 
@@ -460,6 +546,8 @@ struct ib_qp *ib_create_qp(struct ib_pd *pd,
 			atomic_inc(&pd->usecnt);
 			atomic_inc(&qp_init_attr->send_cq->usecnt);
 		}
+
+		ib_init_qpg(qp_init_attr, qp);
 	}
 
 	return qp;
@@ -496,6 +584,9 @@ static const struct {
 						IB_QP_QKEY),
 				[IB_QPT_GSI] = (IB_QP_PKEY_INDEX		|
 						IB_QP_QKEY),
+			},
+			.opt_param = {
+				[IB_QPT_UD]  = IB_QP_GROUP_RSS
 			}
 		},
 	},
@@ -805,6 +896,13 @@ int ib_modify_qp(struct ib_qp *qp,
 		 struct ib_qp_attr *qp_attr,
 		 int qp_attr_mask)
 {
+	if (qp->qpg_type == IB_QPG_PARENT) {
+		struct ib_qpg_attr *pattr = &qp->qpg_attr.parent_attr;
+		if (atomic_read(&pattr->rsscnt) < pattr->rss_child_count)
+			return -EINVAL;
+		if (atomic_read(&pattr->tsscnt) < pattr->tss_child_count)
+			return -EINVAL;
+	}
 	return qp->device->modify_qp(qp->real_qp, qp_attr, qp_attr_mask, NULL);
 }
 EXPORT_SYMBOL(ib_modify_qp);
@@ -878,6 +976,15 @@ int ib_destroy_qp(struct ib_qp *qp)
 	if (atomic_read(&qp->usecnt))
 		return -EBUSY;
 
+	if (qp->qpg_type == IB_QPG_PARENT) {
+		/* All childeren should have been deleted by now */
+		struct ib_qpg_attr *pattr = &qp->qpg_attr.parent_attr;
+		if (atomic_read(&pattr->rsscnt))
+			return -EINVAL;
+		if (atomic_read(&pattr->tsscnt))
+			return -EINVAL;
+	}
+
 	if (qp->real_qp != qp)
 		return __ib_destroy_shared_qp(qp);
 
@@ -896,6 +1003,17 @@ int ib_destroy_qp(struct ib_qp *qp)
 			atomic_dec(&rcq->usecnt);
 		if (srq)
 			atomic_dec(&srq->usecnt);
+
+		if (qp->qpg_type == IB_QPG_CHILD_RX ||
+		    qp->qpg_type == IB_QPG_CHILD_TX) {
+			/* decrement parent's counters */
+			struct ib_qp *pqp = qp->qpg_attr.parent;
+			struct ib_qpg_attr *pattr = &pqp->qpg_attr.parent_attr;
+			if (qp->qpg_type == IB_QPG_CHILD_RX)
+				atomic_dec(&pattr->rsscnt);
+			else
+				atomic_dec(&pattr->tsscnt);
+		}
 	}
 
 	return ret;
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 07eb3a8..546760b 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -241,6 +241,9 @@ static struct ib_qp *c2_create_qp(struct ib_pd *pd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_RC:
 		qp = kzalloc(sizeof(*qp), GFP_KERNEL);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 074d5c2..a8d0752 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -905,6 +905,8 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
 	PDBG("%s ib_pd %p\n", __func__, pd);
 	if (attrs->qp_type != IB_QPT_RC)
 		return ERR_PTR(-EINVAL);
+	if (attrs->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
 	php = to_iwch_pd(pd);
 	rhp = php->rhp;
 	schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid);
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 17ba4f8..db71190 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -1489,6 +1489,9 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 	if (attrs->qp_type != IB_QPT_RC)
 		return ERR_PTR(-EINVAL);
 
+	if (attrs->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 	schp = get_chp(rhp, ((struct c4iw_cq *)attrs->send_cq)->cq.cqid);
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 1493939..2df7584 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -464,6 +464,9 @@ static struct ehca_qp *internal_create_qp(
 	int is_llqp = 0, has_srq = 0, is_user = 0;
 	int qp_type, max_send_sge, max_recv_sge, ret;
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	/* h_call's out parameters */
 	struct ehca_alloc_qp_parms parms;
 	u32 swqe_size = 0, rwqe_size = 0, ib_qp_num;
diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c
index 0857a9c..117b775 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -755,6 +755,9 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
 		goto bail;
 	}
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	if (init_attr->cap.max_send_sge > ib_ipath_max_sges ||
 	    init_attr->cap.max_send_wr > ib_ipath_max_qp_wrs) {
 		ret = ERR_PTR(-EINVAL);
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 35cced2..c58dbdc 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -998,6 +998,9 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 	      init_attr->qp_type > IB_QPT_GSI)))
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_XRC_TGT:
 		pd = to_mxrcd(init_attr->xrcd)->pd;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 5b71d43..120aa1e 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -518,6 +518,9 @@ static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_RC:
 	case IB_QPT_UC:
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index 8f67fe2..dfae39a 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -1134,6 +1134,9 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	atomic_inc(&qps_created);
 	switch (init_attr->qp_type) {
 		case IB_QPT_RC:
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index b29a424..7c3e0ce 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -841,6 +841,11 @@ static int ocrdma_check_qp_params(struct ib_pd *ibpd, struct ocrdma_dev *dev,
 			   __func__, dev->id, attrs->qp_type);
 		return -EINVAL;
 	}
+	if (attrs->qpg_type != IB_QPG_NONE) {
+		ocrdma_err("%s(%d) unsupported qpg type=0x%x requested\n",
+			   __func__, dev->id, attrs->qpg_type);
+			   return -ENOSYS;
+	}
 	if (attrs->cap.max_send_wr > dev->attr.max_wqe) {
 		ocrdma_err("%s(%d) unsupported send_wr=0x%x requested\n",
 			   __func__, dev->id, attrs->cap.max_send_wr);
diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c
index a6a2cc2..eda3f93 100644
--- a/drivers/infiniband/hw/qib/qib_qp.c
+++ b/drivers/infiniband/hw/qib/qib_qp.c
@@ -986,6 +986,11 @@ struct ib_qp *qib_create_qp(struct ib_pd *ibpd,
 		goto bail;
 	}
 
+	if (init_attr->qpg_type != IB_QPG_NONE) {
+		ret = ERR_PTR(-ENOSYS);
+		goto bail;
+	}
+
 	/* Check receive queue parameters if no SRQ is specified. */
 	if (!init_attr->srq) {
 		if (init_attr->cap.max_recv_sge > ib_qib_max_sges ||
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 98cc4b2..9317e76 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -116,7 +116,10 @@ enum ib_device_cap_flags {
 	IB_DEVICE_MEM_MGT_EXTENSIONS	= (1<<21),
 	IB_DEVICE_BLOCK_MULTICAST_LOOPBACK = (1<<22),
 	IB_DEVICE_MEM_WINDOW_TYPE_2A	= (1<<23),
-	IB_DEVICE_MEM_WINDOW_TYPE_2B	= (1<<24)
+	IB_DEVICE_MEM_WINDOW_TYPE_2B	= (1<<24),
+	IB_DEVICE_QPG			= (1<<25),
+	IB_DEVICE_UD_RSS		= (1<<26),
+	IB_DEVICE_UD_TSS		= (1<<27)
 };
 
 enum ib_atomic_cap {
@@ -164,6 +167,7 @@ struct ib_device_attr {
 	int			max_srq_wr;
 	int			max_srq_sge;
 	unsigned int		max_fast_reg_page_list_len;
+	int			max_rss_tbl_sz;
 	u16			max_pkeys;
 	u8			local_ca_ack_delay;
 };
@@ -586,6 +590,7 @@ struct ib_qp_cap {
 	u32	max_send_sge;
 	u32	max_recv_sge;
 	u32	max_inline_data;
+	u32	qpg_tss_mask_sz;
 };
 
 enum ib_sig_type {
@@ -621,6 +626,18 @@ enum ib_qp_create_flags {
 	IB_QP_CREATE_RESERVED_END		= 1 << 31,
 };
 
+enum ib_qpg_type {
+	IB_QPG_NONE	= 0,
+	IB_QPG_PARENT	= (1<<0),
+	IB_QPG_CHILD_RX = (1<<1),
+	IB_QPG_CHILD_TX = (1<<2)
+};
+
+struct ib_qpg_init_attrib {
+	u32 tss_child_count;
+	u32 rss_child_count;
+};
+
 struct ib_qp_init_attr {
 	void                  (*event_handler)(struct ib_event *, void *);
 	void		       *qp_context;
@@ -629,9 +646,14 @@ struct ib_qp_init_attr {
 	struct ib_srq	       *srq;
 	struct ib_xrcd	       *xrcd;     /* XRC TGT QPs only */
 	struct ib_qp_cap	cap;
+	union {
+		struct ib_qp *qpg_parent; /* see qpg_type */
+		struct ib_qpg_init_attrib parent_attrib;
+	};
 	enum ib_sig_type	sq_sig_type;
 	enum ib_qp_type		qp_type;
 	enum ib_qp_create_flags	create_flags;
+	enum ib_qpg_type	qpg_type;
 	u8			port_num; /* special QP types only */
 };
 
@@ -698,7 +720,8 @@ enum ib_qp_attr_mask {
 	IB_QP_MAX_DEST_RD_ATOMIC	= (1<<17),
 	IB_QP_PATH_MIG_STATE		= (1<<18),
 	IB_QP_CAP			= (1<<19),
-	IB_QP_DEST_QPN			= (1<<20)
+	IB_QP_DEST_QPN			= (1<<20),
+	IB_QP_GROUP_RSS			= (1<<21)
 };
 
 enum ib_qp_state {
@@ -994,6 +1017,14 @@ struct ib_srq {
 	} ext;
 };
 
+struct ib_qpg_attr {
+	atomic_t rsscnt; /* count open rss children */
+	atomic_t tsscnt; /* count open rss children */
+	u32	 rss_child_count;
+	u32	 tss_child_count;
+};
+
+
 struct ib_qp {
 	struct ib_device       *device;
 	struct ib_pd	       *pd;
@@ -1010,6 +1041,11 @@ struct ib_qp {
 	void		       *qp_context;
 	u32			qp_num;
 	enum ib_qp_type		qp_type;
+	enum ib_qpg_type	qpg_type;
+	union {
+		struct ib_qp   *parent; /* rss/tss parent */
+		struct ib_qpg_attr parent_attr;
+	} qpg_attr;
 };
 
 struct ib_mr {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V3 for-next 2/5] IB/mlx4: Add support for RSS and TSS QP groups
       [not found] ` <1362676288-19906-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-03-07 17:11   ` [PATCH V3 for-next 1/5] IB/core: Add RSS and TSS QP groups Or Gerlitz
@ 2013-03-07 17:11   ` Or Gerlitz
  2013-03-07 17:11   ` [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device Or Gerlitz
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Or Gerlitz @ 2013-03-07 17:11 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Depending on the mlx4 device capabilities, support the RSS IB device
capability, using Topelitz or XOR hash functions according to what
available with the HW. Support creating QP groups where all RX and TX
QPs have contiguous QP numbers.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c    |    5 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   13 ++
 drivers/infiniband/hw/mlx4/qp.c      |  345 ++++++++++++++++++++++++++++++++-
 3 files changed, 352 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 23d7343..b29a4b6 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -145,6 +145,11 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 		else
 			props->device_cap_flags |= IB_DEVICE_MEM_WINDOW_TYPE_2A;
 	}
+	props->device_cap_flags |= IB_DEVICE_QPG;
+	if (dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS) {
+		props->device_cap_flags |= IB_DEVICE_UD_RSS;
+		props->max_rss_tbl_sz = dev->dev->caps.max_rss_tbl_sz;
+	}
 
 	props->vendor_id	   = be32_to_cpup((__be32 *) (out_mad->data + 36)) &
 		0xffffff;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index f61ec26..48aeaab 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -232,6 +232,17 @@ struct mlx4_ib_proxy_sqp_hdr {
 	struct mlx4_rcv_tunnel_hdr tun;
 }  __packed;
 
+struct mlx4_ib_qpg_data {
+	unsigned long *tss_bitmap;
+	unsigned long *rss_bitmap;
+	struct mlx4_ib_qp *qpg_parent;
+	int tss_qpn_base;
+	int rss_qpn_base;
+	u32 tss_child_count;
+	u32 rss_child_count;
+	u32 qpg_tss_mask_sz;
+};
+
 struct mlx4_ib_qp {
 	struct ib_qp		ibqp;
 	struct mlx4_qp		mqp;
@@ -261,6 +272,8 @@ struct mlx4_ib_qp {
 	u8			sq_no_prefetch;
 	u8			state;
 	int			mlx_type;
+	enum ib_qpg_type	qpg_type;
+	struct mlx4_ib_qpg_data *qpg_data;
 	struct list_head	gid_list;
 	struct list_head	steering_rules;
 	struct mlx4_ib_buf	*sqp_proxy_rcv;
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index c58dbdc..e504e5f 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -34,6 +34,8 @@
 #include <linux/log2.h>
 #include <linux/slab.h>
 #include <linux/netdevice.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
 
 #include <rdma/ib_cache.h>
 #include <rdma/ib_pack.h>
@@ -593,6 +595,241 @@ static int qp_has_rq(struct ib_qp_init_attr *attr)
 	return !attr->srq;
 }
 
+static int init_qpg_parent(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *pqp,
+			   struct ib_qp_init_attr *attr, int *qpn)
+{
+	struct mlx4_ib_qpg_data *qpg_data;
+	int tss_num, rss_num;
+	int tss_align_num, rss_align_num;
+	int tss_base, rss_base;
+	int err;
+
+	/* Parent is part of the TSS range (in SW TSS ARP is sent via parent) */
+	tss_num = 1 + attr->parent_attrib.tss_child_count;
+	tss_align_num = roundup_pow_of_two(tss_num);
+	rss_num = attr->parent_attrib.rss_child_count;
+	rss_align_num = roundup_pow_of_two(rss_num);
+
+	if (rss_num > 1) {
+		/* RSS is requested */
+		if (!(dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS))
+			return -ENOSYS;
+		if (rss_align_num > dev->dev->caps.max_rss_tbl_sz)
+			return -EINVAL;
+		/* We must work with power of two */
+		attr->parent_attrib.rss_child_count = rss_align_num;
+	}
+
+	qpg_data = kzalloc(sizeof(*qpg_data), GFP_KERNEL);
+	if (!qpg_data)
+		return -ENOMEM;
+
+	err = mlx4_qp_reserve_range(dev->dev, tss_align_num,
+				    tss_align_num, &tss_base);
+	if (err)
+		goto err1;
+
+	if (tss_num > 1) {
+		u32 alloc = BITS_TO_LONGS(tss_align_num)  * sizeof(long);
+		qpg_data->tss_bitmap = kzalloc(alloc, GFP_KERNEL);
+		if (qpg_data->tss_bitmap == NULL) {
+			err = -ENOMEM;
+			goto err2;
+		}
+		bitmap_fill(qpg_data->tss_bitmap, tss_num);
+		/* Note parent takes first index */
+		clear_bit(0, qpg_data->tss_bitmap);
+	}
+
+	if (rss_num > 1) {
+		u32 alloc = BITS_TO_LONGS(rss_align_num) * sizeof(long);
+		err = mlx4_qp_reserve_range(dev->dev, rss_align_num,
+					    rss_align_num, &rss_base);
+		if (err)
+			goto err3;
+		qpg_data->rss_bitmap = kzalloc(alloc, GFP_KERNEL);
+		if (qpg_data->rss_bitmap == NULL) {
+			err = -ENOMEM;
+			goto err4;
+		}
+		bitmap_fill(qpg_data->rss_bitmap, rss_align_num);
+	}
+
+	qpg_data->tss_child_count = attr->parent_attrib.tss_child_count;
+	qpg_data->rss_child_count = attr->parent_attrib.rss_child_count;
+	qpg_data->qpg_parent = pqp;
+	qpg_data->qpg_tss_mask_sz = ilog2(tss_align_num);
+	qpg_data->tss_qpn_base = tss_base;
+	qpg_data->rss_qpn_base = rss_base;
+
+	pqp->qpg_data = qpg_data;
+	*qpn = tss_base;
+
+	return 0;
+
+err4:
+	mlx4_qp_release_range(dev->dev, rss_base, rss_align_num);
+
+err3:
+	if (tss_num > 1)
+		kfree(qpg_data->tss_bitmap);
+
+err2:
+	mlx4_qp_release_range(dev->dev, tss_base, tss_align_num);
+
+err1:
+	kfree(qpg_data);
+	return err;
+}
+
+static void free_qpg_parent(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *pqp)
+{
+	struct mlx4_ib_qpg_data *qpg_data = pqp->qpg_data;
+	int align_num;
+
+	if (qpg_data->tss_child_count > 1)
+		kfree(qpg_data->tss_bitmap);
+
+	align_num = roundup_pow_of_two(1 + qpg_data->tss_child_count);
+	mlx4_qp_release_range(dev->dev, qpg_data->tss_qpn_base, align_num);
+
+	if (qpg_data->rss_child_count > 1) {
+		kfree(qpg_data->rss_bitmap);
+		align_num = roundup_pow_of_two(qpg_data->rss_child_count);
+		mlx4_qp_release_range(dev->dev, qpg_data->rss_qpn_base,
+				      align_num);
+	}
+
+	kfree(qpg_data);
+}
+
+static int alloc_qpg_qpn(struct ib_qp_init_attr *init_attr,
+			 struct mlx4_ib_qp *pqp, int *qpn)
+{
+	struct mlx4_ib_qp *mqp = to_mqp(init_attr->qpg_parent);
+	struct mlx4_ib_qpg_data *qpg_data = mqp->qpg_data;
+	u32 idx, old;
+
+	switch (init_attr->qpg_type) {
+	case IB_QPG_CHILD_TX:
+		if (qpg_data->tss_child_count == 0)
+			return -EINVAL;
+		do {
+			/* Parent took index 0 */
+			idx = find_first_bit(qpg_data->tss_bitmap,
+					     qpg_data->tss_child_count + 1);
+			if (idx >= qpg_data->tss_child_count + 1)
+				return -ENOMEM;
+			old = test_and_clear_bit(idx, qpg_data->tss_bitmap);
+		} while (old == 0);
+		idx += qpg_data->tss_qpn_base;
+		break;
+	case IB_QPG_CHILD_RX:
+		if (qpg_data->rss_child_count == 0)
+			return -EINVAL;
+		do {
+			idx = find_first_bit(qpg_data->rss_bitmap,
+					     qpg_data->rss_child_count);
+			if (idx >= qpg_data->rss_child_count)
+				return -ENOMEM;
+			old = test_and_clear_bit(idx, qpg_data->rss_bitmap);
+		} while (old == 0);
+		idx += qpg_data->rss_qpn_base;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	pqp->qpg_data = qpg_data;
+	*qpn = idx;
+
+	return 0;
+}
+
+static void free_qpg_qpn(struct mlx4_ib_qp *mqp, int qpn)
+{
+	struct mlx4_ib_qpg_data *qpg_data = mqp->qpg_data;
+
+	switch (mqp->qpg_type) {
+	case IB_QPG_CHILD_TX:
+		/* Do range check */
+		qpn -= qpg_data->tss_qpn_base;
+		set_bit(qpn, qpg_data->tss_bitmap);
+		break;
+	case IB_QPG_CHILD_RX:
+		qpn -= qpg_data->rss_qpn_base;
+		set_bit(qpn, qpg_data->rss_bitmap);
+		break;
+	default:
+		/* error */
+		pr_warn("wrong qpg type (%d)\n", mqp->qpg_type);
+		break;
+	}
+}
+
+static int alloc_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			    struct ib_qp_init_attr *attr, int *qpn)
+{
+	int err = 0;
+
+	switch (attr->qpg_type) {
+	case IB_QPG_NONE:
+		/* Raw packet QPNs must be aligned to 8 bits. If not, the WQE
+		 * BlueFlame setup flow wrongly causes VLAN insertion. */
+		if (attr->qp_type == IB_QPT_RAW_PACKET)
+			err = mlx4_qp_reserve_range(dev->dev, 1, 1 << 8, qpn);
+		else
+			err = mlx4_qp_reserve_range(dev->dev, 1, 1, qpn);
+		break;
+	case IB_QPG_PARENT:
+		err = init_qpg_parent(dev, qp, attr, qpn);
+		break;
+	case IB_QPG_CHILD_TX:
+	case IB_QPG_CHILD_RX:
+		err = alloc_qpg_qpn(attr, qp, qpn);
+		break;
+	default:
+		qp->qpg_type = IB_QPG_NONE;
+		err = -EINVAL;
+		break;
+	}
+	if (err)
+		return err;
+	qp->qpg_type = attr->qpg_type;
+	return 0;
+}
+
+static void free_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			enum ib_qpg_type qpg_type, int qpn)
+{
+	switch (qpg_type) {
+	case IB_QPG_NONE:
+		mlx4_qp_release_range(dev->dev, qpn, 1);
+		break;
+	case IB_QPG_PARENT:
+		free_qpg_parent(dev, qp);
+		break;
+	case IB_QPG_CHILD_TX:
+	case IB_QPG_CHILD_RX:
+		free_qpg_qpn(qp, qpn);
+		break;
+	default:
+		break;
+	}
+}
+
+/* Revert allocation on create_qp_common */
+static void unalloc_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			       struct ib_qp_init_attr *attr, int qpn)
+{
+	free_qpn_common(dev, qp, attr->qpg_type, qpn);
+}
+
+static void release_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp)
+{
+	free_qpn_common(dev, qp, qp->qpg_type, qp->mqp.qpn);
+}
+
 static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			    struct ib_qp_init_attr *init_attr,
 			    struct ib_udata *udata, int sqpn, struct mlx4_ib_qp **caller_qp)
@@ -760,12 +997,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			}
 		}
 	} else {
-		/* Raw packet QPNs must be aligned to 8 bits. If not, the WQE
-		 * BlueFlame setup flow wrongly causes VLAN insertion. */
-		if (init_attr->qp_type == IB_QPT_RAW_PACKET)
-			err = mlx4_qp_reserve_range(dev->dev, 1, 1 << 8, &qpn);
-		else
-			err = mlx4_qp_reserve_range(dev->dev, 1, 1, &qpn);
+		err = alloc_qpn_common(dev, qp, init_attr, &qpn);
 		if (err)
 			goto err_proxy;
 	}
@@ -790,8 +1022,8 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	return 0;
 
 err_qpn:
-	if (!sqpn)
-		mlx4_qp_release_range(dev->dev, qpn, 1);
+	unalloc_qpn_common(dev, qp, init_attr, qpn);
+
 err_proxy:
 	if (qp->mlx4_ib_qp_type == MLX4_IB_QPT_PROXY_GSI)
 		free_proxy_bufs(pd->device, qp);
@@ -933,7 +1165,7 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
 	mlx4_qp_free(dev->dev, &qp->mqp);
 
 	if (!is_sqp(dev, qp) && !is_tunnel_qp(dev, qp))
-		mlx4_qp_release_range(dev->dev, qp->mqp.qpn, 1);
+		release_qpn_common(dev, qp);
 
 	mlx4_mtt_cleanup(dev->dev, &qp->mtt);
 
@@ -973,6 +1205,52 @@ static u32 get_sqp_num(struct mlx4_ib_dev *dev, struct ib_qp_init_attr *attr)
 		return dev->dev->caps.qp1_proxy[attr->port_num - 1];
 }
 
+static int check_qpg_attr(struct mlx4_ib_dev *dev,
+			  struct ib_qp_init_attr *attr)
+{
+	if (attr->qpg_type == IB_QPG_NONE)
+		return 0;
+
+	if (attr->qp_type != IB_QPT_UD)
+		return -EINVAL;
+
+	if (attr->qpg_type == IB_QPG_PARENT) {
+		if (attr->parent_attrib.tss_child_count == 1)
+			return -EINVAL; /* Doesn't make sense */
+		if (attr->parent_attrib.rss_child_count == 1)
+			return -EINVAL; /* Doesn't make sense */
+		if ((attr->parent_attrib.tss_child_count == 0) &&
+		    (attr->parent_attrib.rss_child_count == 0))
+			/* Should be called with IP_QPG_NONE */
+			return -EINVAL;
+		if (attr->parent_attrib.rss_child_count > 1) {
+			int rss_align_num;
+			if (!(dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS))
+				return -ENOSYS;
+			rss_align_num = roundup_pow_of_two(
+					attr->parent_attrib.rss_child_count);
+			if (rss_align_num > dev->dev->caps.max_rss_tbl_sz)
+				return -EINVAL;
+		}
+	} else {
+		struct mlx4_ib_qpg_data *qpg_data;
+		if (attr->qpg_parent == NULL)
+			return -EINVAL;
+		if (IS_ERR(attr->qpg_parent))
+			return -EINVAL;
+		qpg_data = to_mqp(attr->qpg_parent)->qpg_data;
+		if (qpg_data == NULL)
+			return -EINVAL;
+		if (attr->qpg_type == IB_QPG_CHILD_TX &&
+		    !qpg_data->tss_child_count)
+			return -EINVAL;
+		if (attr->qpg_type == IB_QPG_CHILD_RX &&
+		    !qpg_data->rss_child_count)
+			return -EINVAL;
+	}
+	return 0;
+}
+
 struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 				struct ib_qp_init_attr *init_attr,
 				struct ib_udata *udata)
@@ -998,8 +1276,9 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 	      init_attr->qp_type > IB_QPT_GSI)))
 		return ERR_PTR(-EINVAL);
 
-	if (init_attr->qpg_type != IB_QPG_NONE)
-		return ERR_PTR(-ENOSYS);
+	err = check_qpg_attr(to_mdev(pd->device), init_attr);
+	if (err)
+		return ERR_PTR(err);
 
 	switch (init_attr->qp_type) {
 	case IB_QPT_XRC_TGT:
@@ -1470,6 +1749,43 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	if (!ibqp->uobject && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
 		context->rlkey |= (1 << 4);
 
+	if ((attr_mask & IB_QP_GROUP_RSS) &&
+	    (qp->qpg_data->rss_child_count > 1)) {
+		struct mlx4_ib_qpg_data *qpg_data = qp->qpg_data;
+		void *rss_context_base = &context->pri_path;
+		struct mlx4_rss_context *rss_context =
+			(struct mlx4_rss_context *)(rss_context_base
+					+ MLX4_RSS_OFFSET_IN_QPC_PRI_PATH);
+
+		context->flags |= cpu_to_be32(1 << MLX4_RSS_QPC_FLAG_OFFSET);
+
+		/* This should be tbl_sz_base_qpn */
+		rss_context->base_qpn = cpu_to_be32(qpg_data->rss_qpn_base |
+				(ilog2(qpg_data->rss_child_count) << 24));
+		rss_context->default_qpn = cpu_to_be32(qpg_data->rss_qpn_base);
+		/* This should be flags_hash_fn */
+		rss_context->flags = MLX4_RSS_TCP_IPV6 |
+				     MLX4_RSS_TCP_IPV4;
+		if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_UDP_RSS) {
+			rss_context->base_qpn_udp = rss_context->default_qpn;
+			rss_context->flags |= MLX4_RSS_IPV6 |
+					MLX4_RSS_IPV4     |
+					MLX4_RSS_UDP_IPV6 |
+					MLX4_RSS_UDP_IPV4;
+		}
+		if (dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS_TOP) {
+			static const u32 rsskey[10] = { 0xD181C62C, 0xF7F4DB5B,
+				0x1983A2FC, 0x943E1ADB, 0xD9389E6B, 0xD1039C2C,
+				0xA74499AD, 0x593D56D9, 0xF3253C06, 0x2ADC1FFC};
+			rss_context->hash_fn = MLX4_RSS_HASH_TOP;
+			memcpy(rss_context->rss_key, rsskey,
+			       sizeof(rss_context->rss_key));
+		} else {
+			rss_context->hash_fn = MLX4_RSS_HASH_XOR;
+			memset(rss_context->rss_key, 0,
+			       sizeof(rss_context->rss_key));
+		}
+	}
 	/*
 	 * Before passing a kernel QP to the HW, make sure that the
 	 * ownership bits of the send queue are set and the SQ
@@ -2763,6 +3079,13 @@ done:
 		qp->sq_signal_bits == cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE) ?
 		IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR;
 
+	qp_init_attr->qpg_type = qp->qpg_type;
+	if (qp->qpg_type == IB_QPG_PARENT)
+		qp_init_attr->cap.qpg_tss_mask_sz =
+			qp->qpg_data->qpg_tss_mask_sz;
+	else
+		qp_init_attr->cap.qpg_tss_mask_sz = 0;
+
 out:
 	mutex_unlock(&qp->mutex);
 	return err;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device
       [not found] ` <1362676288-19906-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-03-07 17:11   ` [PATCH V3 for-next 1/5] IB/core: Add RSS and TSS QP groups Or Gerlitz
  2013-03-07 17:11   ` [PATCH V3 for-next 2/5] IB/mlx4: Add support for " Or Gerlitz
@ 2013-03-07 17:11   ` Or Gerlitz
       [not found]     ` <1362676288-19906-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-03-07 17:11   ` [PATCH V3 for-next 4/5] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2013-03-07 17:11 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch is a restructuring step needed to implement RSS (Receive Side
Scaling) and TSS (multi-queue transmit) for IPoIB.

The following structures and flows are changed:

- Addition of struct ipoib_recv_ring and struct ipoib_send_ring which hold
the per RX / TX ring fields respectively. These fields are the plural of
the receive and send fields previously present in struct ipoib_dev_priv.

- Add per send/receive ring stats counters. These counters are accessible
through ethtool. Net device stats are no longer accumulated, instead
ndo_get_stats is implemented.

- Use the multi queue APIs for TX and RX: alloc_netdev_mqs, netif_xxx_subqueue,
netif_subqueue_yyy, use per TX queue timer and NAPI instance per RX queue.

- Put a work request structure and scatter/gather list in the RX ring
structure for the CM code to use, and remove them from ipoib_cm_dev_priv

With this patch being an intermediate step, the number of RX and TX rings
is fixed to one. Where the single TX ring and RX ring QP/CQs are currently
taken from the "priv" structure.

The Address Handles Garbage Collection mechanism was changed such
that the data path uses ref count (inc on post send, dec on send completion),
and the AH GC thread code tests for zero value of the ref count instead of
comparing tx_head to last_send. Some change was a must here, since the SAME
AH can be used by multiple TX rings as the skb hashing can possible map the
same IPoIB daddr to multiple TX rings in parallel (uses L3/L4 headers).

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  102 ++++--
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  206 ++++++----
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c   |   92 ++++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  538 +++++++++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  265 ++++++++++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   44 ++-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |   63 ++-
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    2 +-
 8 files changed, 973 insertions(+), 339 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index eb71aaa..9bf96db 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -160,6 +160,7 @@ struct ipoib_rx_buf {
 
 struct ipoib_tx_buf {
 	struct sk_buff *skb;
+	struct ipoib_ah *ah;
 	u64		mapping[MAX_SKB_FRAGS + 1];
 };
 
@@ -217,6 +218,7 @@ struct ipoib_cm_rx {
 	unsigned long		jiffies;
 	enum ipoib_cm_state	state;
 	int			recv_count;
+	int index; /* For ring counters */
 };
 
 struct ipoib_cm_tx {
@@ -256,11 +258,10 @@ struct ipoib_cm_dev_priv {
 	struct list_head	start_list;
 	struct list_head	reap_list;
 	struct ib_wc		ibwc[IPOIB_NUM_WC];
-	struct ib_sge		rx_sge[IPOIB_CM_RX_SG];
-	struct ib_recv_wr       rx_wr;
 	int			nonsrq_conn_qp;
 	int			max_cm_mtu;
 	int			num_frags;
+	u32			rx_cq_ind;
 };
 
 struct ipoib_ethtool_st {
@@ -286,6 +287,65 @@ struct ipoib_neigh_table {
 };
 
 /*
+ * Per QP stats
+ */
+
+struct ipoib_tx_ring_stats {
+	unsigned long tx_packets;
+	unsigned long tx_bytes;
+	unsigned long tx_errors;
+	unsigned long tx_dropped;
+};
+
+struct ipoib_rx_ring_stats {
+	unsigned long rx_packets;
+	unsigned long rx_bytes;
+	unsigned long rx_errors;
+	unsigned long rx_dropped;
+};
+
+/*
+ * Encapsulates the per send QP information
+ */
+struct ipoib_send_ring {
+	struct net_device	*dev;
+	struct ib_cq		*send_cq;
+	struct ib_qp		*send_qp;
+	struct ipoib_tx_buf	*tx_ring;
+	unsigned		tx_head;
+	unsigned		tx_tail;
+	struct ib_sge		tx_sge[MAX_SKB_FRAGS + 1];
+	struct ib_send_wr	tx_wr;
+	unsigned		tx_outstanding;
+	struct ib_wc		tx_wc[MAX_SEND_CQE];
+	struct timer_list	poll_timer;
+	struct ipoib_tx_ring_stats stats;
+	unsigned		index;
+};
+
+struct ipoib_rx_cm_info {
+	struct ib_sge		rx_sge[IPOIB_CM_RX_SG];
+	struct ib_recv_wr       rx_wr;
+};
+
+/*
+ * Encapsulates the per recv QP information
+ */
+struct ipoib_recv_ring {
+	struct net_device	*dev;
+	struct ib_qp		*recv_qp;
+	struct ib_cq		*recv_cq;
+	struct ib_wc		ibwc[IPOIB_NUM_WC];
+	struct napi_struct	napi;
+	struct ipoib_rx_buf	*rx_ring;
+	struct ib_recv_wr	rx_wr;
+	struct ib_sge		rx_sge[IPOIB_UD_RX_SG];
+	struct ipoib_rx_cm_info	cm;
+	struct ipoib_rx_ring_stats stats;
+	unsigned		index;
+};
+
+/*
  * Device private locking: network stack tx_lock protects members used
  * in TX fast path, lock protects everything else.  lock nests inside
  * of tx_lock (ie tx_lock must be acquired first if needed).
@@ -295,8 +355,6 @@ struct ipoib_dev_priv {
 
 	struct net_device *dev;
 
-	struct napi_struct napi;
-
 	unsigned long flags;
 
 	struct mutex vlan_mutex;
@@ -337,21 +395,6 @@ struct ipoib_dev_priv {
 	unsigned int mcast_mtu;
 	unsigned int max_ib_mtu;
 
-	struct ipoib_rx_buf *rx_ring;
-
-	struct ipoib_tx_buf *tx_ring;
-	unsigned	     tx_head;
-	unsigned	     tx_tail;
-	struct ib_sge	     tx_sge[MAX_SKB_FRAGS + 1];
-	struct ib_send_wr    tx_wr;
-	unsigned	     tx_outstanding;
-	struct ib_wc	     send_wc[MAX_SEND_CQE];
-
-	struct ib_recv_wr    rx_wr;
-	struct ib_sge	     rx_sge[IPOIB_UD_RX_SG];
-
-	struct ib_wc ibwc[IPOIB_NUM_WC];
-
 	struct list_head dead_ahs;
 
 	struct ib_event_handler event_handler;
@@ -373,6 +416,10 @@ struct ipoib_dev_priv {
 	int	hca_caps;
 	struct ipoib_ethtool_st ethtool;
 	struct timer_list poll_timer;
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	unsigned int num_rx_queues;
+	unsigned int num_tx_queues;
 };
 
 struct ipoib_ah {
@@ -380,7 +427,7 @@ struct ipoib_ah {
 	struct ib_ah	  *ah;
 	struct list_head   list;
 	struct kref	   ref;
-	unsigned	   last_send;
+	atomic_t	   refcnt;
 };
 
 struct ipoib_path {
@@ -442,8 +489,8 @@ extern struct workqueue_struct *ipoib_workqueue;
 /* functions */
 
 int ipoib_poll(struct napi_struct *napi, int budget);
-void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr);
-void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr);
+void ipoib_ib_completion(struct ib_cq *cq, void *recv_ring_ptr);
+void ipoib_send_comp_handler(struct ib_cq *cq, void *send_ring_ptr);
 
 struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 				 struct ib_pd *pd, struct ib_ah_attr *attr);
@@ -462,7 +509,8 @@ void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_mark_paths_invalid(struct net_device *dev);
 void ipoib_flush_paths(struct net_device *dev);
-struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
+struct ipoib_dev_priv *ipoib_intf_alloc(const char *format,
+					struct ipoib_dev_priv *temp_priv);
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush_light(struct work_struct *work);
@@ -600,7 +648,9 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
 void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx);
 void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
 			   unsigned int mtu);
-void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc);
+void ipoib_cm_handle_rx_wc(struct net_device *dev,
+			   struct ipoib_recv_ring *recv_ring,
+			   struct ib_wc *wc);
 void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc);
 #else
 
@@ -698,7 +748,9 @@ static inline void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff
 	dev_kfree_skb_any(skb);
 }
 
-static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static inline void ipoib_cm_handle_rx_wc(struct net_device *dev,
+					 struct ipoib_recv_ring *recv_ring,
+					 struct ib_wc *wc)
 {
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 67b0c1d..40a40e2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -38,6 +38,7 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/moduleparam.h>
+#include <linux/jhash.h>
 
 #include "ipoib.h"
 
@@ -88,18 +89,24 @@ static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags,
 		ib_dma_unmap_page(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE);
 }
 
-static int ipoib_cm_post_receive_srq(struct net_device *dev, int id)
+static int ipoib_cm_post_receive_srq(struct net_device *dev,
+				     struct ipoib_recv_ring *recv_ring, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_sge *sge;
+	struct ib_recv_wr *wr;
 	struct ib_recv_wr *bad_wr;
 	int i, ret;
 
-	priv->cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV;
+	sge = recv_ring->cm.rx_sge;
+	wr = &recv_ring->cm.rx_wr;
+
+	wr->wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV;
 
 	for (i = 0; i < priv->cm.num_frags; ++i)
-		priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];
+		sge[i].addr = priv->cm.srq_ring[id].mapping[i];
 
-	ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr);
+	ret = ib_post_srq_recv(priv->cm.srq, wr, &bad_wr);
 	if (unlikely(ret)) {
 		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
 		ipoib_cm_dma_unmap_rx(priv, priv->cm.num_frags - 1,
@@ -112,14 +119,18 @@ static int ipoib_cm_post_receive_srq(struct net_device *dev, int id)
 }
 
 static int ipoib_cm_post_receive_nonsrq(struct net_device *dev,
-					struct ipoib_cm_rx *rx,
-					struct ib_recv_wr *wr,
-					struct ib_sge *sge, int id)
+					struct ipoib_cm_rx *rx, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring = priv->recv_ring + rx->index;
+	struct ib_sge *sge;
+	struct ib_recv_wr *wr;
 	struct ib_recv_wr *bad_wr;
 	int i, ret;
 
+	sge = recv_ring->cm.rx_sge;
+	wr = &recv_ring->cm.rx_wr;
+
 	wr->wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV;
 
 	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
@@ -225,7 +236,15 @@ static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv *priv)
 	if (ib_post_send(p->qp, &ipoib_cm_rx_drain_wr, &bad_wr))
 		ipoib_warn(priv, "failed to post drain wr\n");
 
-	list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list);
+	/*
+	 * Under the multi ring scheme, different CM QPs are bounded to
+	 * different CQs and hence to diferent NAPI contextes. With that in
+	 * mind, we must make sure that the NAPI context that invokes the reap
+	 * (deletion) of a certain QP is the same context that handles the
+	 * normal RX WC handling. To achieve that, move only one QP at a time to
+	 * the drain list, this will enforce posting the drain WR on each QP.
+	 */
+	list_move(&p->list, &priv->cm.rx_drain_list);
 }
 
 static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx)
@@ -250,8 +269,6 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
 		.event_handler = ipoib_cm_rx_event_handler,
-		.send_cq = priv->recv_cq, /* For drain WR */
-		.recv_cq = priv->recv_cq,
 		.srq = priv->cm.srq,
 		.cap.max_send_wr = 1, /* For drain WR */
 		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
@@ -259,12 +276,23 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 		.qp_type = IB_QPT_RC,
 		.qp_context = p,
 	};
+	int index;
 
 	if (!ipoib_cm_has_srq(dev)) {
 		attr.cap.max_recv_wr  = ipoib_recvq_size;
 		attr.cap.max_recv_sge = IPOIB_CM_RX_SG;
 	}
 
+	index = priv->cm.rx_cq_ind;
+	if (index >= priv->num_rx_queues)
+		index = 0;
+
+	priv->cm.rx_cq_ind = index + 1;
+	/* send_cp for drain WR */
+	attr.recv_cq = priv->recv_ring[index].recv_cq;
+	attr.send_cq = attr.recv_cq;
+	p->index = index;
+
 	return ib_create_qp(priv->pd, &attr);
 }
 
@@ -323,33 +351,34 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev,
 	return 0;
 }
 
-static void ipoib_cm_init_rx_wr(struct net_device *dev,
-				struct ib_recv_wr *wr,
-				struct ib_sge *sge)
+static void ipoib_cm_init_rx_wr(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	int i;
-
-	for (i = 0; i < priv->cm.num_frags; ++i)
-		sge[i].lkey = priv->mr->lkey;
-
-	sge[0].length = IPOIB_CM_HEAD_SIZE;
-	for (i = 1; i < priv->cm.num_frags; ++i)
-		sge[i].length = PAGE_SIZE;
-
-	wr->next    = NULL;
-	wr->sg_list = sge;
-	wr->num_sge = priv->cm.num_frags;
+	struct ipoib_recv_ring *recv_ring = priv->recv_ring;
+	struct ib_sge *sge;
+	struct ib_recv_wr *wr;
+	int i, j;
+
+	for (j = 0; j < priv->num_rx_queues; j++, recv_ring++) {
+		sge = recv_ring->cm.rx_sge;
+		wr = &recv_ring->cm.rx_wr;
+		for (i = 0; i < priv->cm.num_frags; ++i)
+			sge[i].lkey = priv->mr->lkey;
+
+		sge[0].length = IPOIB_CM_HEAD_SIZE;
+		for (i = 1; i < priv->cm.num_frags; ++i)
+			sge[i].length = PAGE_SIZE;
+
+		wr->next    = NULL;
+		wr->sg_list = sge;
+		wr->num_sge = priv->cm.num_frags;
+	}
 }
 
 static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_id,
 				   struct ipoib_cm_rx *rx)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct {
-		struct ib_recv_wr wr;
-		struct ib_sge sge[IPOIB_CM_RX_SG];
-	} *t;
 	int ret;
 	int i;
 
@@ -360,14 +389,6 @@ static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_i
 		return -ENOMEM;
 	}
 
-	t = kmalloc(sizeof *t, GFP_KERNEL);
-	if (!t) {
-		ret = -ENOMEM;
-		goto err_free;
-	}
-
-	ipoib_cm_init_rx_wr(dev, &t->wr, t->sge);
-
 	spin_lock_irq(&priv->lock);
 
 	if (priv->cm.nonsrq_conn_qp >= ipoib_max_conn_qp) {
@@ -387,7 +408,7 @@ static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_i
 				ret = -ENOMEM;
 				goto err_count;
 		}
-		ret = ipoib_cm_post_receive_nonsrq(dev, rx, &t->wr, t->sge, i);
+		ret = ipoib_cm_post_receive_nonsrq(dev, rx, i);
 		if (ret) {
 			ipoib_warn(priv, "ipoib_cm_post_receive_nonsrq "
 				   "failed for buf %d\n", i);
@@ -398,8 +419,6 @@ static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_i
 
 	rx->recv_count = ipoib_recvq_size;
 
-	kfree(t);
-
 	return 0;
 
 err_count:
@@ -408,7 +427,6 @@ err_count:
 	spin_unlock_irq(&priv->lock);
 
 err_free:
-	kfree(t);
 	ipoib_cm_free_rx_ring(dev, rx->rx_ring);
 
 	return ret;
@@ -553,7 +571,9 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
 	}
 }
 
-void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+void ipoib_cm_handle_rx_wc(struct net_device *dev,
+			   struct ipoib_recv_ring *recv_ring,
+			   struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_cm_rx_buf *rx_ring;
@@ -593,7 +613,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		ipoib_dbg(priv, "cm recv error "
 			   "(status=%d, wrid=%d vend_err %x)\n",
 			   wc->status, wr_id, wc->vendor_err);
-		++dev->stats.rx_dropped;
+		++recv_ring->stats.rx_dropped;
 		if (has_srq)
 			goto repost;
 		else {
@@ -646,7 +666,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		 * this packet and reuse the old buffer.
 		 */
 		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
-		++dev->stats.rx_dropped;
+		++recv_ring->stats.rx_dropped;
 		goto repost;
 	}
 
@@ -663,8 +683,8 @@ copied:
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
 
-	++dev->stats.rx_packets;
-	dev->stats.rx_bytes += skb->len;
+	++recv_ring->stats.rx_packets;
+	recv_ring->stats.rx_bytes += skb->len;
 
 	skb->dev = dev;
 	/* XXX get correct PACKET_ type here */
@@ -673,13 +693,13 @@ copied:
 
 repost:
 	if (has_srq) {
-		if (unlikely(ipoib_cm_post_receive_srq(dev, wr_id)))
+		if (unlikely(ipoib_cm_post_receive_srq(dev,
+						       recv_ring,
+						       wr_id)))
 			ipoib_warn(priv, "ipoib_cm_post_receive_srq failed "
 				   "for buf %d\n", wr_id);
 	} else {
 		if (unlikely(ipoib_cm_post_receive_nonsrq(dev, p,
-							  &priv->cm.rx_wr,
-							  priv->cm.rx_sge,
 							  wr_id))) {
 			--p->recv_count;
 			ipoib_warn(priv, "ipoib_cm_post_receive_nonsrq failed "
@@ -691,17 +711,18 @@ repost:
 static inline int post_send(struct ipoib_dev_priv *priv,
 			    struct ipoib_cm_tx *tx,
 			    unsigned int wr_id,
-			    u64 addr, int len)
+			    u64 addr, int len,
+				struct ipoib_send_ring *send_ring)
 {
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge[0].addr          = addr;
-	priv->tx_sge[0].length        = len;
+	send_ring->tx_sge[0].addr          = addr;
+	send_ring->tx_sge[0].length        = len;
 
-	priv->tx_wr.num_sge	= 1;
-	priv->tx_wr.wr_id	= wr_id | IPOIB_OP_CM;
+	send_ring->tx_wr.num_sge	= 1;
+	send_ring->tx_wr.wr_id	= wr_id | IPOIB_OP_CM;
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ib_post_send(tx->qp, &send_ring->tx_wr, &bad_wr);
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
@@ -710,12 +731,17 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	struct ipoib_cm_tx_buf *tx_req;
 	u64 addr;
 	int rc;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
+
+	queue_index = skb_get_queue_mapping(skb);
+	send_ring = priv->send_ring + queue_index;
 
 	if (unlikely(skb->len > tx->mtu)) {
 		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
 			   skb->len, tx->mtu);
-		++dev->stats.tx_dropped;
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_dropped;
+		++send_ring->stats.tx_errors;
 		ipoib_cm_skb_too_long(dev, skb, tx->mtu - IPOIB_ENCAP_LEN);
 		return;
 	}
@@ -734,7 +760,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	tx_req->skb = skb;
 	addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		dev_kfree_skb_any(skb);
 		return;
 	}
@@ -745,22 +771,23 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	skb_dst_drop(skb);
 
 	rc = post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1),
-		       addr, skb->len);
+		       addr, skb->len, send_ring);
 	if (unlikely(rc)) {
 		ipoib_warn(priv, "post_send failed, error %d\n", rc);
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
 		dev_kfree_skb_any(skb);
 	} else {
-		dev->trans_start = jiffies;
+		netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
 		++tx->tx_head;
 
-		if (++priv->tx_outstanding == ipoib_sendq_size) {
+		if (++send_ring->tx_outstanding == ipoib_sendq_size) {
 			ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n",
 				  tx->qp->qp_num);
-			if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+			if (ib_req_notify_cq(send_ring->send_cq,
+					     IB_CQ_NEXT_COMP))
 				ipoib_warn(priv, "request notify on send CQ failed\n");
-			netif_stop_queue(dev);
+			netif_stop_subqueue(dev, queue_index);
 		}
 	}
 }
@@ -772,6 +799,8 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_CM;
 	struct ipoib_cm_tx_buf *tx_req;
 	unsigned long flags;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
 
 	ipoib_dbg_data(priv, "cm send completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -783,22 +812,24 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	}
 
 	tx_req = &tx->tx_ring[wr_id];
+	queue_index = skb_get_queue_mapping(tx_req->skb);
+	send_ring = priv->send_ring + queue_index;
 
 	ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE);
 
 	/* FIXME: is this right? Shouldn't we only increment on success? */
-	++dev->stats.tx_packets;
-	dev->stats.tx_bytes += tx_req->skb->len;
+	++send_ring->stats.tx_packets;
+	send_ring->stats.tx_bytes += tx_req->skb->len;
 
 	dev_kfree_skb_any(tx_req->skb);
 
-	netif_tx_lock(dev);
+	netif_tx_lock_bh(dev);
 
 	++tx->tx_tail;
-	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-	    netif_queue_stopped(dev) &&
+	if (unlikely(--send_ring->tx_outstanding == ipoib_sendq_size >> 1) &&
+	    __netif_subqueue_stopped(dev, queue_index) &&
 	    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-		netif_wake_queue(dev);
+		netif_wake_subqueue(dev, queue_index);
 
 	if (wc->status != IB_WC_SUCCESS &&
 	    wc->status != IB_WC_WR_FLUSH_ERR) {
@@ -829,7 +860,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 		spin_unlock_irqrestore(&priv->lock, flags);
 	}
 
-	netif_tx_unlock(dev);
+	netif_tx_unlock_bh(dev);
 }
 
 int ipoib_cm_dev_open(struct net_device *dev)
@@ -1017,8 +1048,6 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
-		.send_cq		= priv->recv_cq,
-		.recv_cq		= priv->recv_cq,
 		.srq			= priv->cm.srq,
 		.cap.max_send_wr	= ipoib_sendq_size,
 		.cap.max_send_sge	= 1,
@@ -1026,6 +1055,21 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 		.qp_type		= IB_QPT_RC,
 		.qp_context		= tx
 	};
+	u32 index;
+
+	/* CM uses ipoib_ib_completion for TX completion which makes use of the
+	 * RX NAPI mechanism. spread context among RX CQ based on address hash.
+	 */
+	if (priv->num_rx_queues > 1) {
+		u32 *daddr_32 = (u32 *)tx->neigh->daddr;
+		u32 hv = jhash_1word(*daddr_32 & IPOIB_QPN_MASK, 0);
+		index = hv % priv->num_rx_queues;
+	} else {
+		index = 0;
+	}
+
+	attr.recv_cq = priv->recv_ring[index].recv_cq;
+	attr.send_cq = attr.recv_cq;
 
 	return ib_create_qp(priv->pd, &attr);
 }
@@ -1178,16 +1222,21 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
 timeout:
 
 	while ((int) p->tx_tail - (int) p->tx_head < 0) {
+		struct ipoib_send_ring *send_ring;
+		u16 queue_index;
 		tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)];
 		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
 				    DMA_TO_DEVICE);
 		dev_kfree_skb_any(tx_req->skb);
 		++p->tx_tail;
+		queue_index = skb_get_queue_mapping(tx_req->skb);
+		send_ring = priv->send_ring + queue_index;
 		netif_tx_lock_bh(p->dev);
-		if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-		    netif_queue_stopped(p->dev) &&
+		if (unlikely(--send_ring->tx_outstanding ==
+				(ipoib_sendq_size >> 1)) &&
+		    __netif_subqueue_stopped(p->dev, queue_index) &&
 		    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-			netif_wake_queue(p->dev);
+			netif_wake_subqueue(p->dev, queue_index);
 		netif_tx_unlock_bh(p->dev);
 	}
 
@@ -1549,7 +1598,7 @@ int ipoib_cm_dev_init(struct net_device *dev)
 		priv->cm.num_frags  = IPOIB_CM_RX_SG;
 	}
 
-	ipoib_cm_init_rx_wr(dev, &priv->cm.rx_wr, priv->cm.rx_sge);
+	ipoib_cm_init_rx_wr(dev);
 
 	if (ipoib_cm_has_srq(dev)) {
 		for (i = 0; i < ipoib_recvq_size; ++i) {
@@ -1562,7 +1611,8 @@ int ipoib_cm_dev_init(struct net_device *dev)
 				return -ENOMEM;
 			}
 
-			if (ipoib_cm_post_receive_srq(dev, i)) {
+			if (ipoib_cm_post_receive_srq(dev, priv->recv_ring,
+						      i)) {
 				ipoib_warn(priv, "ipoib_cm_post_receive_srq "
 					   "failed for buf %d\n", i);
 				ipoib_cm_dev_cleanup(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index c4b3940..7c56341 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -74,7 +74,8 @@ static int ipoib_set_coalesce(struct net_device *dev,
 			      struct ethtool_coalesce *coal)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	int ret;
+	int ret, i;
+
 
 	/*
 	 * These values are saved in the private data and returned
@@ -84,23 +85,100 @@ static int ipoib_set_coalesce(struct net_device *dev,
 	    coal->rx_max_coalesced_frames > 0xffff)
 		return -EINVAL;
 
-	ret = ib_modify_cq(priv->recv_cq, coal->rx_max_coalesced_frames,
-			   coal->rx_coalesce_usecs);
-	if (ret && ret != -ENOSYS) {
-		ipoib_warn(priv, "failed modifying CQ (%d)\n", ret);
-		return ret;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ret = ib_modify_cq(priv->recv_ring[i].recv_cq,
+					coal->rx_max_coalesced_frames,
+					coal->rx_coalesce_usecs);
+		if (ret && ret != -ENOSYS) {
+			ipoib_warn(priv, "failed modifying CQ (%d)\n", ret);
+			return ret;
+		}
 	}
-
 	priv->ethtool.coalesce_usecs       = coal->rx_coalesce_usecs;
 	priv->ethtool.max_coalesced_frames = coal->rx_max_coalesced_frames;
 
 	return 0;
 }
 
+static void ipoib_get_strings(struct net_device *dev, u32 stringset, u8 *data)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i, index = 0;
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		for (i = 0; i < priv->num_rx_queues; i++) {
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_packets", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_bytes", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_errors", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_dropped", i);
+		}
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_packets", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_bytes", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_errors", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_dropped", i);
+		}
+		break;
+	}
+}
+
+static int ipoib_get_sset_count(struct net_device *dev, int sset)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	switch (sset) {
+	case ETH_SS_STATS:
+		return (priv->num_rx_queues + priv->num_tx_queues) * 4;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static void ipoib_get_ethtool_stats(struct net_device *dev,
+				struct ethtool_stats *stats, uint64_t *data)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	int index = 0;
+	int i;
+
+	/* Get per QP stats */
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		struct ipoib_rx_ring_stats *rx_stats = &recv_ring->stats;
+		data[index++] = rx_stats->rx_packets;
+		data[index++] = rx_stats->rx_bytes;
+		data[index++] = rx_stats->rx_errors;
+		data[index++] = rx_stats->rx_dropped;
+		recv_ring++;
+	}
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		struct ipoib_tx_ring_stats *tx_stats = &send_ring->stats;
+		data[index++] = tx_stats->tx_packets;
+		data[index++] = tx_stats->tx_bytes;
+		data[index++] = tx_stats->tx_errors;
+		data[index++] = tx_stats->tx_dropped;
+		send_ring++;
+	}
+}
+
 static const struct ethtool_ops ipoib_ethtool_ops = {
 	.get_drvinfo		= ipoib_get_drvinfo,
 	.get_coalesce		= ipoib_get_coalesce,
 	.set_coalesce		= ipoib_set_coalesce,
+	.get_strings		= ipoib_get_strings,
+	.get_sset_count		= ipoib_get_sset_count,
+	.get_ethtool_stats	= ipoib_get_ethtool_stats,
 };
 
 void ipoib_set_ethtool_ops(struct net_device *dev)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 2cfa76f..4871dc9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -64,7 +64,6 @@ struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 		return ERR_PTR(-ENOMEM);
 
 	ah->dev       = dev;
-	ah->last_send = 0;
 	kref_init(&ah->ref);
 
 	vah = ib_create_ah(pd, attr);
@@ -72,6 +71,7 @@ struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 		kfree(ah);
 		ah = (struct ipoib_ah *)vah;
 	} else {
+		atomic_set(&ah->refcnt, 0);
 		ah->ah = vah;
 		ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah);
 	}
@@ -129,29 +129,32 @@ static void ipoib_ud_skb_put_frags(struct ipoib_dev_priv *priv,
 
 }
 
-static int ipoib_ib_post_receive(struct net_device *dev, int id)
+static int ipoib_ib_post_receive(struct net_device *dev,
+			struct ipoib_recv_ring *recv_ring, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_recv_wr *bad_wr;
 	int ret;
 
-	priv->rx_wr.wr_id   = id | IPOIB_OP_RECV;
-	priv->rx_sge[0].addr = priv->rx_ring[id].mapping[0];
-	priv->rx_sge[1].addr = priv->rx_ring[id].mapping[1];
+	recv_ring->rx_wr.wr_id   = id | IPOIB_OP_RECV;
+	recv_ring->rx_sge[0].addr = recv_ring->rx_ring[id].mapping[0];
+	recv_ring->rx_sge[1].addr = recv_ring->rx_ring[id].mapping[1];
 
 
-	ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr);
+	ret = ib_post_recv(recv_ring->recv_qp, &recv_ring->rx_wr, &bad_wr);
 	if (unlikely(ret)) {
 		ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret);
-		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[id].mapping);
-		dev_kfree_skb_any(priv->rx_ring[id].skb);
-		priv->rx_ring[id].skb = NULL;
+		ipoib_ud_dma_unmap_rx(priv, recv_ring->rx_ring[id].mapping);
+		dev_kfree_skb_any(recv_ring->rx_ring[id].skb);
+		recv_ring->rx_ring[id].skb = NULL;
 	}
 
 	return ret;
 }
 
-static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
+static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev,
+					  struct ipoib_recv_ring *recv_ring,
+					  int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct sk_buff *skb;
@@ -178,7 +181,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 	 */
 	skb_reserve(skb, 4);
 
-	mapping = priv->rx_ring[id].mapping;
+	mapping = recv_ring->rx_ring[id].mapping;
 	mapping[0] = ib_dma_map_single(priv->ca, skb->data, buf_size,
 				       DMA_FROM_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, mapping[0])))
@@ -196,7 +199,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 			goto partial_error;
 	}
 
-	priv->rx_ring[id].skb = skb;
+	recv_ring->rx_ring[id].skb = skb;
 	return skb;
 
 partial_error:
@@ -206,18 +209,23 @@ error:
 	return NULL;
 }
 
-static int ipoib_ib_post_receives(struct net_device *dev)
+static int ipoib_ib_post_ring_receives(struct net_device *dev,
+				      struct ipoib_recv_ring *recv_ring)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int i;
 
 	for (i = 0; i < ipoib_recvq_size; ++i) {
-		if (!ipoib_alloc_rx_skb(dev, i)) {
-			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+		if (!ipoib_alloc_rx_skb(dev, recv_ring, i)) {
+			ipoib_warn(priv,
+				   "failed to alloc receive buffer (%d,%d)\n",
+				   recv_ring->index, i);
 			return -ENOMEM;
 		}
-		if (ipoib_ib_post_receive(dev, i)) {
-			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
+		if (ipoib_ib_post_receive(dev, recv_ring, i)) {
+			ipoib_warn(priv,
+				   "ipoib_ib_post_receive failed buf (%d,%d)\n",
+				   recv_ring->index, i);
 			return -EIO;
 		}
 	}
@@ -225,7 +233,27 @@ static int ipoib_ib_post_receives(struct net_device *dev)
 	return 0;
 }
 
-static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static int ipoib_ib_post_receives(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int err;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		err = ipoib_ib_post_ring_receives(dev, recv_ring);
+		if (err)
+			return err;
+		recv_ring++;
+	}
+
+	return 0;
+}
+
+static void ipoib_ib_handle_rx_wc(struct net_device *dev,
+				  struct ipoib_recv_ring *recv_ring,
+				  struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
@@ -242,16 +270,16 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		return;
 	}
 
-	skb  = priv->rx_ring[wr_id].skb;
+	skb  = recv_ring->rx_ring[wr_id].skb;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		if (wc->status != IB_WC_WR_FLUSH_ERR)
 			ipoib_warn(priv, "failed recv event "
 				   "(status=%d, wrid=%d vend_err %x)\n",
 				   wc->status, wr_id, wc->vendor_err);
-		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping);
+		ipoib_ud_dma_unmap_rx(priv, recv_ring->rx_ring[wr_id].mapping);
 		dev_kfree_skb_any(skb);
-		priv->rx_ring[wr_id].skb = NULL;
+		recv_ring->rx_ring[wr_id].skb = NULL;
 		return;
 	}
 
@@ -262,18 +290,20 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num)
 		goto repost;
 
-	memcpy(mapping, priv->rx_ring[wr_id].mapping,
+	memcpy(mapping, recv_ring->rx_ring[wr_id].mapping,
 	       IPOIB_UD_RX_SG * sizeof *mapping);
 
 	/*
 	 * If we can't allocate a new RX buffer, dump
 	 * this packet and reuse the old buffer.
 	 */
-	if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
-		++dev->stats.rx_dropped;
+	if (unlikely(!ipoib_alloc_rx_skb(dev, recv_ring, wr_id))) {
+		++recv_ring->stats.rx_dropped;
 		goto repost;
 	}
 
+	skb_record_rx_queue(skb, recv_ring->index);
+
 	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
 		       wc->byte_len, wc->slid);
 
@@ -296,18 +326,18 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
 
-	++dev->stats.rx_packets;
-	dev->stats.rx_bytes += skb->len;
+	++recv_ring->stats.rx_packets;
+	recv_ring->stats.rx_bytes += skb->len;
 
 	skb->dev = dev;
 	if ((dev->features & NETIF_F_RXCSUM) &&
 			likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-	napi_gro_receive(&priv->napi, skb);
+	napi_gro_receive(&recv_ring->napi, skb);
 
 repost:
-	if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
+	if (unlikely(ipoib_ib_post_receive(dev, recv_ring, wr_id)))
 		ipoib_warn(priv, "ipoib_ib_post_receive failed "
 			   "for buf %d\n", wr_id);
 }
@@ -376,11 +406,14 @@ static void ipoib_dma_unmap_tx(struct ib_device *ca,
 	}
 }
 
-static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
+static void ipoib_ib_handle_tx_wc(struct ipoib_send_ring *send_ring,
+				struct ib_wc *wc)
 {
+	struct net_device *dev = send_ring->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id;
 	struct ipoib_tx_buf *tx_req;
+	struct ipoib_ah *ah;
 
 	ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -391,20 +424,23 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	tx_req = &send_ring->tx_ring[wr_id];
+
+	ah = tx_req->ah;
+	atomic_dec(&ah->refcnt);
 
 	ipoib_dma_unmap_tx(priv->ca, tx_req);
 
-	++dev->stats.tx_packets;
-	dev->stats.tx_bytes += tx_req->skb->len;
+	++send_ring->stats.tx_packets;
+	send_ring->stats.tx_bytes += tx_req->skb->len;
 
 	dev_kfree_skb_any(tx_req->skb);
 
-	++priv->tx_tail;
-	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-	    netif_queue_stopped(dev) &&
+	++send_ring->tx_tail;
+	if (unlikely(--send_ring->tx_outstanding == ipoib_sendq_size >> 1) &&
+	    __netif_subqueue_stopped(dev, send_ring->index) &&
 	    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-		netif_wake_queue(dev);
+		netif_wake_subqueue(dev, send_ring->index);
 
 	if (wc->status != IB_WC_SUCCESS &&
 	    wc->status != IB_WC_WR_FLUSH_ERR)
@@ -413,45 +449,47 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 			   wc->status, wr_id, wc->vendor_err);
 }
 
-static int poll_tx(struct ipoib_dev_priv *priv)
+static int poll_tx_ring(struct ipoib_send_ring *send_ring)
 {
 	int n, i;
 
-	n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc);
+	n = ib_poll_cq(send_ring->send_cq, MAX_SEND_CQE, send_ring->tx_wc);
 	for (i = 0; i < n; ++i)
-		ipoib_ib_handle_tx_wc(priv->dev, priv->send_wc + i);
+		ipoib_ib_handle_tx_wc(send_ring, send_ring->tx_wc + i);
 
 	return n == MAX_SEND_CQE;
 }
 
 int ipoib_poll(struct napi_struct *napi, int budget)
 {
-	struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi);
-	struct net_device *dev = priv->dev;
+	struct ipoib_recv_ring *rx_ring;
+	struct net_device *dev;
 	int done;
 	int t;
 	int n, i;
 
 	done  = 0;
+	rx_ring = container_of(napi, struct ipoib_recv_ring, napi);
+	dev = rx_ring->dev;
 
 poll_more:
 	while (done < budget) {
 		int max = (budget - done);
 
 		t = min(IPOIB_NUM_WC, max);
-		n = ib_poll_cq(priv->recv_cq, t, priv->ibwc);
+		n = ib_poll_cq(rx_ring->recv_cq, t, rx_ring->ibwc);
 
 		for (i = 0; i < n; i++) {
-			struct ib_wc *wc = priv->ibwc + i;
+			struct ib_wc *wc = rx_ring->ibwc + i;
 
 			if (wc->wr_id & IPOIB_OP_RECV) {
 				++done;
 				if (wc->wr_id & IPOIB_OP_CM)
-					ipoib_cm_handle_rx_wc(dev, wc);
+					ipoib_cm_handle_rx_wc(dev, rx_ring, wc);
 				else
-					ipoib_ib_handle_rx_wc(dev, wc);
+					ipoib_ib_handle_rx_wc(dev, rx_ring, wc);
 			} else
-				ipoib_cm_handle_tx_wc(priv->dev, wc);
+				ipoib_cm_handle_tx_wc(dev, wc);
 		}
 
 		if (n != t)
@@ -460,7 +498,7 @@ poll_more:
 
 	if (done < budget) {
 		napi_complete(napi);
-		if (unlikely(ib_req_notify_cq(priv->recv_cq,
+		if (unlikely(ib_req_notify_cq(rx_ring->recv_cq,
 					      IB_CQ_NEXT_COMP |
 					      IB_CQ_REPORT_MISSED_EVENTS)) &&
 		    napi_reschedule(napi))
@@ -470,36 +508,34 @@ poll_more:
 	return done;
 }
 
-void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr)
+void ipoib_ib_completion(struct ib_cq *cq, void *ctx_ptr)
 {
-	struct net_device *dev = dev_ptr;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring = (struct ipoib_recv_ring *)ctx_ptr;
 
-	napi_schedule(&priv->napi);
+	napi_schedule(&recv_ring->napi);
 }
 
-static void drain_tx_cq(struct net_device *dev)
+static void drain_tx_cq(struct ipoib_send_ring *send_ring)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	netif_tx_lock_bh(send_ring->dev);
 
-	netif_tx_lock(dev);
-	while (poll_tx(priv))
+	while (poll_tx_ring(send_ring))
 		; /* nothing */
 
-	if (netif_queue_stopped(dev))
-		mod_timer(&priv->poll_timer, jiffies + 1);
+	if (__netif_subqueue_stopped(send_ring->dev, send_ring->index))
+		mod_timer(&send_ring->poll_timer, jiffies + 1);
 
-	netif_tx_unlock(dev);
+	netif_tx_unlock_bh(send_ring->dev);
 }
 
-void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr)
+void ipoib_send_comp_handler(struct ib_cq *cq, void *ctx_ptr)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev_ptr);
+	struct ipoib_send_ring *send_ring = (struct ipoib_send_ring *)ctx_ptr;
 
-	mod_timer(&priv->poll_timer, jiffies);
+	mod_timer(&send_ring->poll_timer, jiffies);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
+static inline int post_send(struct ipoib_send_ring *send_ring,
 			    unsigned int wr_id,
 			    struct ib_ah *address, u32 qpn,
 			    struct ipoib_tx_buf *tx_req,
@@ -513,30 +549,30 @@ static inline int post_send(struct ipoib_dev_priv *priv,
 	u64 *mapping = tx_req->mapping;
 
 	if (skb_headlen(skb)) {
-		priv->tx_sge[0].addr         = mapping[0];
-		priv->tx_sge[0].length       = skb_headlen(skb);
+		send_ring->tx_sge[0].addr         = mapping[0];
+		send_ring->tx_sge[0].length       = skb_headlen(skb);
 		off = 1;
 	} else
 		off = 0;
 
 	for (i = 0; i < nr_frags; ++i) {
-		priv->tx_sge[i + off].addr = mapping[i + off];
-		priv->tx_sge[i + off].length = skb_frag_size(&frags[i]);
+		send_ring->tx_sge[i + off].addr = mapping[i + off];
+		send_ring->tx_sge[i + off].length = skb_frag_size(&frags[i]);
 	}
-	priv->tx_wr.num_sge	     = nr_frags + off;
-	priv->tx_wr.wr_id 	     = wr_id;
-	priv->tx_wr.wr.ud.remote_qpn = qpn;
-	priv->tx_wr.wr.ud.ah 	     = address;
+	send_ring->tx_wr.num_sge	 = nr_frags + off;
+	send_ring->tx_wr.wr_id		 = wr_id;
+	send_ring->tx_wr.wr.ud.remote_qpn = qpn;
+	send_ring->tx_wr.wr.ud.ah	 = address;
 
 	if (head) {
-		priv->tx_wr.wr.ud.mss	 = skb_shinfo(skb)->gso_size;
-		priv->tx_wr.wr.ud.header = head;
-		priv->tx_wr.wr.ud.hlen	 = hlen;
-		priv->tx_wr.opcode	 = IB_WR_LSO;
+		send_ring->tx_wr.wr.ud.mss	 = skb_shinfo(skb)->gso_size;
+		send_ring->tx_wr.wr.ud.header = head;
+		send_ring->tx_wr.wr.ud.hlen	 = hlen;
+		send_ring->tx_wr.opcode	 = IB_WR_LSO;
 	} else
-		priv->tx_wr.opcode	 = IB_WR_SEND;
+		send_ring->tx_wr.opcode	 = IB_WR_SEND;
 
-	return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+	return ib_post_send(send_ring->send_qp, &send_ring->tx_wr, &bad_wr);
 }
 
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
@@ -544,16 +580,23 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_tx_buf *tx_req;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
 	int hlen, rc;
 	void *phead;
+	int req_index;
+
+	/* Find the correct QP to submit the IO to */
+	queue_index = skb_get_queue_mapping(skb);
+	send_ring = priv->send_ring + queue_index;
 
 	if (skb_is_gso(skb)) {
 		hlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
 		phead = skb->data;
 		if (unlikely(!skb_pull(skb, hlen))) {
 			ipoib_warn(priv, "linear data too small\n");
-			++dev->stats.tx_dropped;
-			++dev->stats.tx_errors;
+			++send_ring->stats.tx_dropped;
+			++send_ring->stats.tx_errors;
 			dev_kfree_skb_any(skb);
 			return;
 		}
@@ -561,8 +604,8 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
 			ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
 				   skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
-			++dev->stats.tx_dropped;
-			++dev->stats.tx_errors;
+			++send_ring->stats.tx_dropped;
+			++send_ring->stats.tx_errors;
 			ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
 			return;
 		}
@@ -580,48 +623,56 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 	 * means we have to make sure everything is properly recorded and
 	 * our state is consistent before we call post_send().
 	 */
-	tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
+	req_index = send_ring->tx_head & (ipoib_sendq_size - 1);
+	tx_req = &send_ring->tx_ring[req_index];
 	tx_req->skb = skb;
+	tx_req->ah = address;
 	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		dev_kfree_skb_any(skb);
 		return;
 	}
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
-		priv->tx_wr.send_flags |= IB_SEND_IP_CSUM;
+		send_ring->tx_wr.send_flags |= IB_SEND_IP_CSUM;
 	else
-		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+		send_ring->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
 
-	if (++priv->tx_outstanding == ipoib_sendq_size) {
+	if (++send_ring->tx_outstanding == ipoib_sendq_size) {
 		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-		if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+		if (ib_req_notify_cq(send_ring->send_cq, IB_CQ_NEXT_COMP))
 			ipoib_warn(priv, "request notify on send CQ failed\n");
-		netif_stop_queue(dev);
+		netif_stop_subqueue(dev, queue_index);
 	}
 
 	skb_orphan(skb);
 	skb_dst_drop(skb);
 
-	rc = post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
+	/*
+	 * Incrementing the reference count after submitting
+	 * may create race condition
+	 * It is better to increment before and decrement in case of error
+	 */
+	atomic_inc(&address->refcnt);
+	rc = post_send(send_ring, req_index,
 		       address->ah, qpn, tx_req, phead, hlen);
 	if (unlikely(rc)) {
 		ipoib_warn(priv, "post_send failed, error %d\n", rc);
-		++dev->stats.tx_errors;
-		--priv->tx_outstanding;
+		++send_ring->stats.tx_errors;
+		--send_ring->tx_outstanding;
 		ipoib_dma_unmap_tx(priv->ca, tx_req);
 		dev_kfree_skb_any(skb);
-		if (netif_queue_stopped(dev))
-			netif_wake_queue(dev);
+		atomic_dec(&address->refcnt);
+		if (__netif_subqueue_stopped(dev, queue_index))
+			netif_wake_subqueue(dev, queue_index);
 	} else {
-		dev->trans_start = jiffies;
+		netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
 
-		address->last_send = priv->tx_head;
-		++priv->tx_head;
+		++send_ring->tx_head;
 	}
 
-	if (unlikely(priv->tx_outstanding > MAX_SEND_CQE))
-		while (poll_tx(priv))
+	if (unlikely(send_ring->tx_outstanding > MAX_SEND_CQE))
+		while (poll_tx_ring(send_ring))
 			; /* nothing */
 }
 
@@ -636,7 +687,7 @@ static void __ipoib_reap_ah(struct net_device *dev)
 	spin_lock_irqsave(&priv->lock, flags);
 
 	list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list)
-		if ((int) priv->tx_tail - (int) ah->last_send >= 0) {
+		if (atomic_read(&ah->refcnt) == 0) {
 			list_del(&ah->list);
 			ib_destroy_ah(ah->ah);
 			kfree(ah);
@@ -661,7 +712,31 @@ void ipoib_reap_ah(struct work_struct *work)
 
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
 {
-	drain_tx_cq((struct net_device *)ctx);
+	drain_tx_cq((struct ipoib_send_ring *)ctx);
+}
+
+static void ipoib_napi_enable(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		netif_napi_add(dev, &recv_ring->napi,
+			       ipoib_poll, 100);
+		napi_enable(&recv_ring->napi);
+		recv_ring++;
+	}
+}
+
+static void ipoib_napi_disable(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < priv->num_rx_queues; i++)
+		napi_disable(&priv->recv_ring[i].napi);
 }
 
 int ipoib_ib_dev_open(struct net_device *dev)
@@ -701,7 +776,7 @@ int ipoib_ib_dev_open(struct net_device *dev)
 			   round_jiffies_relative(HZ));
 
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-		napi_enable(&priv->napi);
+		ipoib_napi_enable(dev);
 
 	return 0;
 }
@@ -763,19 +838,47 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
 static int recvs_pending(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
 	int pending = 0;
-	int i;
+	int i, j;
 
-	for (i = 0; i < ipoib_recvq_size; ++i)
-		if (priv->rx_ring[i].skb)
-			++pending;
+	recv_ring = priv->recv_ring;
+	for (j = 0; j < priv->num_rx_queues; j++) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			if (recv_ring->rx_ring[i].skb)
+				++pending;
+		}
+		recv_ring++;
+	}
 
 	return pending;
 }
 
-void ipoib_drain_cq(struct net_device *dev)
+static int sends_pending(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int pending = 0;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		/*
+		* Note that since head and tails are unsigned then
+		* the result of the substruction is correct even when
+		* the counters wrap around
+		*/
+		pending += send_ring->tx_head - send_ring->tx_tail;
+		send_ring++;
+	}
+
+	return pending;
+}
+
+static void ipoib_drain_rx_ring(struct ipoib_dev_priv *priv,
+				struct ipoib_recv_ring *rx_ring)
+{
+	struct net_device *dev = priv->dev;
 	int i, n;
 
 	/*
@@ -786,42 +889,191 @@ void ipoib_drain_cq(struct net_device *dev)
 	local_bh_disable();
 
 	do {
-		n = ib_poll_cq(priv->recv_cq, IPOIB_NUM_WC, priv->ibwc);
+		n = ib_poll_cq(rx_ring->recv_cq, IPOIB_NUM_WC, rx_ring->ibwc);
 		for (i = 0; i < n; ++i) {
+			struct ib_wc *wc = rx_ring->ibwc + i;
 			/*
 			 * Convert any successful completions to flush
 			 * errors to avoid passing packets up the
 			 * stack after bringing the device down.
 			 */
-			if (priv->ibwc[i].status == IB_WC_SUCCESS)
-				priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR;
+			if (wc->status == IB_WC_SUCCESS)
+				wc->status = IB_WC_WR_FLUSH_ERR;
 
-			if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) {
-				if (priv->ibwc[i].wr_id & IPOIB_OP_CM)
-					ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
+			if (wc->wr_id & IPOIB_OP_RECV) {
+				if (wc->wr_id & IPOIB_OP_CM)
+					ipoib_cm_handle_rx_wc(dev, rx_ring, wc);
 				else
-					ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
-			} else
-				ipoib_cm_handle_tx_wc(dev, priv->ibwc + i);
+					ipoib_ib_handle_rx_wc(dev, rx_ring, wc);
+			} else {
+				ipoib_cm_handle_tx_wc(dev, wc);
+			}
 		}
 	} while (n == IPOIB_NUM_WC);
 
-	while (poll_tx(priv))
-		; /* nothing */
-
 	local_bh_enable();
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+static void drain_rx_rings(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ipoib_drain_rx_ring(priv, recv_ring);
+		recv_ring++;
+	}
+}
+
+
+static void drain_tx_rings(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int bool_value = 0;
+	int i;
+
+	do {
+		bool_value = 0;
+		send_ring = priv->send_ring;
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			local_bh_disable();
+			bool_value |= poll_tx_ring(send_ring);
+			local_bh_enable();
+			send_ring++;
+		}
+	} while (bool_value);
+}
+
+void ipoib_drain_cq(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	drain_rx_rings(priv);
+
+	drain_tx_rings(priv);
+}
+
+static void ipoib_ib_send_ring_stop(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *tx_ring;
+	struct ipoib_tx_buf *tx_req;
+	int i;
+
+	tx_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		while ((int) tx_ring->tx_tail - (int) tx_ring->tx_head < 0) {
+			tx_req = &tx_ring->tx_ring[tx_ring->tx_tail &
+				  (ipoib_sendq_size - 1)];
+			ipoib_dma_unmap_tx(priv->ca, tx_req);
+			dev_kfree_skb_any(tx_req->skb);
+			++tx_ring->tx_tail;
+			--tx_ring->tx_outstanding;
+		}
+		tx_ring++;
+	}
+}
+
+static void ipoib_ib_recv_ring_stop(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_recv_ring *recv_ring;
+	int i, j;
+
+	recv_ring = priv->recv_ring;
+	for (j = 0; j < priv->num_rx_queues; ++j) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			struct ipoib_rx_buf *rx_req;
+
+			rx_req = &recv_ring->rx_ring[i];
+			if (!rx_req->skb)
+				continue;
+			ipoib_ud_dma_unmap_rx(priv,
+					      recv_ring->rx_ring[i].mapping);
+			dev_kfree_skb_any(rx_req->skb);
+			rx_req->skb = NULL;
+		}
+		recv_ring++;
+	}
+}
+
+static void set_tx_poll_timers(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int i;
+	/* Init a timer per queue */
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		setup_timer(&send_ring->poll_timer, ipoib_ib_tx_timer_func,
+			    (unsigned long)send_ring);
+		send_ring++;
+	}
+}
+
+static void del_tx_poll_timers(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		del_timer_sync(&send_ring->poll_timer);
+		send_ring++;
+	}
+}
+
+static void set_tx_rings_qp_state(struct ipoib_dev_priv *priv,
+					enum ib_qp_state new_state)
+{
+	struct ipoib_send_ring *send_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i <  priv->num_tx_queues; i++) {
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(send_ring->send_qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+				   new_state);
+		send_ring++;
+	}
+}
+
+static void set_rx_rings_qp_state(struct ipoib_dev_priv *priv,
+					enum ib_qp_state new_state)
+{
+	struct ipoib_recv_ring *recv_ring;
 	struct ib_qp_attr qp_attr;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(recv_ring->recv_qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+				   new_state);
+		recv_ring++;
+	}
+}
+
+static void set_rings_qp_state(struct ipoib_dev_priv *priv,
+				enum ib_qp_state new_state)
+{
+	set_tx_rings_qp_state(priv, new_state);
+
+	if (priv->num_rx_queues > 1)
+		set_rx_rings_qp_state(priv, new_state);
+}
+
+
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned long begin;
-	struct ipoib_tx_buf *tx_req;
+	struct ipoib_recv_ring *recv_ring;
 	int i;
 
 	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-		napi_disable(&priv->napi);
+		ipoib_napi_disable(dev);
 
 	ipoib_cm_dev_stop(dev);
 
@@ -829,42 +1081,24 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 	 * Move our QP to the error state and then reinitialize in
 	 * when all work requests have completed or have been flushed.
 	 */
-	qp_attr.qp_state = IB_QPS_ERR;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
-		ipoib_warn(priv, "Failed to modify QP to ERROR state\n");
+	set_rings_qp_state(priv, IB_QPS_ERR);
+
 
 	/* Wait for all sends and receives to complete */
 	begin = jiffies;
 
-	while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) {
+	while (sends_pending(dev) || recvs_pending(dev)) {
 		if (time_after(jiffies, begin + 5 * HZ)) {
 			ipoib_warn(priv, "timing out; %d sends %d receives not completed\n",
-				   priv->tx_head - priv->tx_tail, recvs_pending(dev));
+				   sends_pending(dev), recvs_pending(dev));
 
 			/*
 			 * assume the HW is wedged and just free up
 			 * all our pending work requests.
 			 */
-			while ((int) priv->tx_tail - (int) priv->tx_head < 0) {
-				tx_req = &priv->tx_ring[priv->tx_tail &
-							(ipoib_sendq_size - 1)];
-				ipoib_dma_unmap_tx(priv->ca, tx_req);
-				dev_kfree_skb_any(tx_req->skb);
-				++priv->tx_tail;
-				--priv->tx_outstanding;
-			}
+			ipoib_ib_send_ring_stop(priv);
 
-			for (i = 0; i < ipoib_recvq_size; ++i) {
-				struct ipoib_rx_buf *rx_req;
-
-				rx_req = &priv->rx_ring[i];
-				if (!rx_req->skb)
-					continue;
-				ipoib_ud_dma_unmap_rx(priv,
-						      priv->rx_ring[i].mapping);
-				dev_kfree_skb_any(rx_req->skb);
-				rx_req->skb = NULL;
-			}
+			ipoib_ib_recv_ring_stop(priv);
 
 			goto timeout;
 		}
@@ -877,10 +1111,9 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 	ipoib_dbg(priv, "All sends and receives done.\n");
 
 timeout:
-	del_timer_sync(&priv->poll_timer);
-	qp_attr.qp_state = IB_QPS_RESET;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
-		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
+	del_tx_poll_timers(priv);
+
+	set_rings_qp_state(priv, IB_QPS_RESET);
 
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
@@ -901,7 +1134,11 @@ timeout:
 		msleep(1);
 	}
 
-	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		ib_req_notify_cq(recv_ring->recv_cq, IB_CQ_NEXT_COMP);
+		recv_ring++;
+	}
 
 	return 0;
 }
@@ -919,8 +1156,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 		return -ENODEV;
 	}
 
-	setup_timer(&priv->poll_timer, ipoib_ib_tx_timer_func,
-		    (unsigned long) dev);
+	set_tx_poll_timers(priv);
 
 	if (dev->flags & IFF_UP) {
 		if (ipoib_ib_dev_open(dev)) {
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 8534afd..51bebca 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -132,7 +132,7 @@ int ipoib_open(struct net_device *dev)
 		mutex_unlock(&priv->vlan_mutex);
 	}
 
-	netif_start_queue(dev);
+	netif_tx_start_all_queues(dev);
 
 	return 0;
 
@@ -153,7 +153,7 @@ static int ipoib_stop(struct net_device *dev)
 
 	clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
-	netif_stop_queue(dev);
+	netif_tx_stop_all_queues(dev);
 
 	ipoib_ib_dev_down(dev, 1);
 	ipoib_ib_dev_stop(dev, 0);
@@ -223,6 +223,8 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu)
 int ipoib_set_mode(struct net_device *dev, const char *buf)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
 
 	/* flush paths if we switch modes so that connections are restarted */
 	if (IPOIB_CM_SUPPORTED(dev->dev_addr) && !strcmp(buf, "connected\n")) {
@@ -231,7 +233,12 @@ int ipoib_set_mode(struct net_device *dev, const char *buf)
 			   "will cause multicast packet drops\n");
 		netdev_update_features(dev);
 		rtnl_unlock();
-		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+
+		send_ring = priv->send_ring;
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			send_ring->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+			send_ring++;
+		}
 
 		ipoib_flush_paths(dev);
 		rtnl_lock();
@@ -582,21 +589,35 @@ static int path_rec_start(struct net_device *dev,
 	return 0;
 }
 
-static void neigh_add_path(struct sk_buff *skb, u8 *daddr,
-			   struct net_device *dev)
+static struct ipoib_neigh *neigh_add_path(struct sk_buff *skb, u8 *daddr,
+					  struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_path *path;
 	struct ipoib_neigh *neigh;
 	unsigned long flags;
+	int index;
 
 	spin_lock_irqsave(&priv->lock, flags);
 	neigh = ipoib_neigh_alloc(daddr, dev);
 	if (!neigh) {
 		spin_unlock_irqrestore(&priv->lock, flags);
-		++dev->stats.tx_dropped;
+		index = skb_get_queue_mapping(skb);
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
-		return;
+		return NULL;
+	}
+
+	/* With TX MQ it is possible that more than one skb transmission
+	 * triggered the creation of the neigh. But only one actually created
+	 * the neigh struct, all the others found it in the hash. We must make
+	 * sure that the neigh will be added only once to the path list.
+	 * Note that double insertion will lead to an infinite loop in the
+	 * path_rec_completion routine.
+	 */
+	if (unlikely(!list_empty(&neigh->list))) {
+		spin_unlock_irqrestore(&priv->lock, flags);
+		return neigh;
 	}
 
 	path = __path_find(dev, daddr + 4);
@@ -633,7 +654,7 @@ static void neigh_add_path(struct sk_buff *skb, u8 *daddr,
 			spin_unlock_irqrestore(&priv->lock, flags);
 			ipoib_send(dev, skb, path->ah, IPOIB_QPN(daddr));
 			ipoib_neigh_put(neigh);
-			return;
+			return NULL;
 		}
 	} else {
 		neigh->ah  = NULL;
@@ -646,7 +667,7 @@ static void neigh_add_path(struct sk_buff *skb, u8 *daddr,
 
 	spin_unlock_irqrestore(&priv->lock, flags);
 	ipoib_neigh_put(neigh);
-	return;
+	return NULL;
 
 err_list:
 	list_del(&neigh->list);
@@ -654,11 +675,14 @@ err_list:
 err_path:
 	ipoib_neigh_free(neigh);
 err_drop:
-	++dev->stats.tx_dropped;
+	index = skb_get_queue_mapping(skb);
+	priv->send_ring[index].stats.tx_dropped++;
 	dev_kfree_skb_any(skb);
 
 	spin_unlock_irqrestore(&priv->lock, flags);
 	ipoib_neigh_put(neigh);
+
+	return NULL;
 }
 
 static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
@@ -667,6 +691,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_path *path;
 	unsigned long flags;
+	int index = skb_get_queue_mapping(skb);
 
 	spin_lock_irqsave(&priv->lock, flags);
 
@@ -689,7 +714,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 			} else
 				__path_add(dev, path);
 		} else {
-			++dev->stats.tx_dropped;
+			priv->send_ring[index].stats.tx_dropped++;
 			dev_kfree_skb_any(skb);
 		}
 
@@ -708,7 +733,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		__skb_queue_tail(&path->queue, skb);
 	} else {
-		++dev->stats.tx_dropped;
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 	}
 
@@ -753,8 +778,14 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	case htons(ETH_P_IPV6):
 		neigh = ipoib_neigh_get(dev, cb->hwaddr);
 		if (unlikely(!neigh)) {
-			neigh_add_path(skb, cb->hwaddr, dev);
-			return NETDEV_TX_OK;
+			/* If more than one thread of execution tried to
+			 * create the neigh then only one succeeded, all the
+			 * others got the neigh from the hash and should
+			 * continue as usual.
+			 */
+			neigh = neigh_add_path(skb, cb->hwaddr, dev);
+			if (likely(!neigh))
+				return NETDEV_TX_OK;
 		}
 		break;
 	case htons(ETH_P_ARP):
@@ -796,18 +827,70 @@ unref:
 	return NETDEV_TX_OK;
 }
 
+static u16 ipoib_select_queue_null(struct net_device *dev, struct sk_buff *skb)
+{
+	return 0;
+}
+
 static void ipoib_timeout(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	u16 index;
 
 	ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
 		   jiffies_to_msecs(jiffies - dev->trans_start));
-	ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
-		   netif_queue_stopped(dev),
-		   priv->tx_head, priv->tx_tail);
+
+	for (index = 0; index < priv->num_tx_queues; index++) {
+		if (__netif_subqueue_stopped(dev, index)) {
+			send_ring = priv->send_ring + index;
+			ipoib_warn(priv,
+				   "queue (%d) stopped, head %u, tail %u\n",
+				   index,
+				   send_ring->tx_head, send_ring->tx_tail);
+		}
+	}
 	/* XXX reset QP, etc. */
 }
 
+static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+	struct net_device_stats local_stats;
+	int i;
+
+	memset(&local_stats, 0, sizeof(struct net_device_stats));
+
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		struct ipoib_rx_ring_stats *rstats = &priv->recv_ring[i].stats;
+		local_stats.rx_packets += rstats->rx_packets;
+		local_stats.rx_bytes   += rstats->rx_bytes;
+		local_stats.rx_errors  += rstats->rx_errors;
+		local_stats.rx_dropped += rstats->rx_dropped;
+	}
+
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		struct ipoib_tx_ring_stats *tstats = &priv->send_ring[i].stats;
+		local_stats.tx_packets += tstats->tx_packets;
+		local_stats.tx_bytes   += tstats->tx_bytes;
+		local_stats.tx_errors  += tstats->tx_errors;
+		local_stats.tx_dropped += tstats->tx_dropped;
+	}
+
+	stats->rx_packets = local_stats.rx_packets;
+	stats->rx_bytes   = local_stats.rx_bytes;
+	stats->rx_errors  = local_stats.rx_errors;
+	stats->rx_dropped = local_stats.rx_dropped;
+
+	stats->tx_packets = local_stats.tx_packets;
+	stats->tx_bytes   = local_stats.tx_bytes;
+	stats->tx_errors  = local_stats.tx_errors;
+	stats->tx_dropped = local_stats.tx_dropped;
+
+	return stats;
+}
+
 static int ipoib_hard_header(struct sk_buff *skb,
 			     struct net_device *dev,
 			     unsigned short type,
@@ -1260,47 +1343,93 @@ static void ipoib_neigh_hash_uninit(struct net_device *dev)
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring;
+	int i, rx_allocated, tx_allocated;
+	unsigned long alloc_size;
 
 	if (ipoib_neigh_hash_init(priv) < 0)
 		goto out;
 	/* Allocate RX/TX "rings" to hold queued skbs */
-	priv->rx_ring =	kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
+	/* Multi queue initialization */
+	priv->recv_ring = kzalloc(priv->num_rx_queues * sizeof(*recv_ring),
 				GFP_KERNEL);
-	if (!priv->rx_ring) {
-		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
-		       ca->name, ipoib_recvq_size);
+	if (!priv->recv_ring) {
+		pr_warn("%s: failed to allocate RECV ring (%d entries)\n",
+			ca->name, priv->num_rx_queues);
 		goto out_neigh_hash_cleanup;
 	}
 
-	priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
-	if (!priv->tx_ring) {
-		printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n",
-		       ca->name, ipoib_sendq_size);
-		goto out_rx_ring_cleanup;
+	alloc_size = ipoib_recvq_size * sizeof(*recv_ring->rx_ring);
+	rx_allocated = 0;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		recv_ring->rx_ring = kzalloc(alloc_size, GFP_KERNEL);
+		if (!recv_ring->rx_ring) {
+			pr_warn("%s: failed to allocate RX ring (%d entries)\n",
+				ca->name, ipoib_recvq_size);
+			goto out_recv_ring_cleanup;
+		}
+		recv_ring->dev = dev;
+		recv_ring->index = i;
+		recv_ring++;
+		rx_allocated++;
+	}
+
+	priv->send_ring = kzalloc(priv->num_tx_queues * sizeof(*send_ring),
+			GFP_KERNEL);
+	if (!priv->send_ring) {
+		pr_warn("%s: failed to allocate SEND ring (%d entries)\n",
+			ca->name, priv->num_tx_queues);
+		goto out_recv_ring_cleanup;
+	}
+
+	alloc_size = ipoib_sendq_size * sizeof(*send_ring->tx_ring);
+	tx_allocated = 0;
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		send_ring->tx_ring = vzalloc(alloc_size);
+		if (!send_ring->tx_ring) {
+			pr_warn(
+				"%s: failed to allocate TX ring (%d entries)\n",
+				ca->name, ipoib_sendq_size);
+			goto out_send_ring_cleanup;
+		}
+		send_ring->dev = dev;
+		send_ring->index = i;
+		send_ring++;
+		tx_allocated++;
 	}
 
 	/* priv->tx_head, tx_tail & tx_outstanding are already 0 */
 
 	if (ipoib_ib_dev_init(dev, ca, port))
-		goto out_tx_ring_cleanup;
+		goto out_send_ring_cleanup;
+
 
 	return 0;
 
-out_tx_ring_cleanup:
-	vfree(priv->tx_ring);
+out_send_ring_cleanup:
+	for (i = 0; i < tx_allocated; i++)
+		vfree(priv->send_ring[i].tx_ring);
 
-out_rx_ring_cleanup:
-	kfree(priv->rx_ring);
+out_recv_ring_cleanup:
+	for (i = 0; i < rx_allocated; i++)
+		kfree(priv->recv_ring[i].rx_ring);
 
 out_neigh_hash_cleanup:
 	ipoib_neigh_hash_uninit(dev);
 out:
+	priv->send_ring = NULL;
+	priv->recv_ring = NULL;
+
 	return -ENOMEM;
 }
 
 void ipoib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv;
+	int i;
 	LIST_HEAD(head);
 
 	ASSERT_RTNL();
@@ -1318,11 +1447,17 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
 	ipoib_ib_dev_cleanup(dev);
 
-	kfree(priv->rx_ring);
-	vfree(priv->tx_ring);
 
-	priv->rx_ring = NULL;
-	priv->tx_ring = NULL;
+	for (i = 0; i < priv->num_tx_queues; i++)
+		vfree(priv->send_ring[i].tx_ring);
+	kfree(priv->send_ring);
+
+	for (i = 0; i < priv->num_rx_queues; i++)
+		kfree(priv->recv_ring[i].rx_ring);
+	kfree(priv->recv_ring);
+
+	priv->recv_ring = NULL;
+	priv->send_ring = NULL;
 
 	ipoib_neigh_hash_uninit(dev);
 }
@@ -1338,7 +1473,9 @@ static const struct net_device_ops ipoib_netdev_ops = {
 	.ndo_change_mtu		 = ipoib_change_mtu,
 	.ndo_fix_features	 = ipoib_fix_features,
 	.ndo_start_xmit	 	 = ipoib_start_xmit,
+	.ndo_select_queue	 = ipoib_select_queue_null,
 	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
 };
 
@@ -1346,13 +1483,12 @@ void ipoib_setup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	/* Use correct ops (ndo_select_queue) */
 	dev->netdev_ops		 = &ipoib_netdev_ops;
 	dev->header_ops		 = &ipoib_header_ops;
 
 	ipoib_set_ethtool_ops(dev);
 
-	netif_napi_add(dev, &priv->napi, ipoib_poll, 100);
-
 	dev->watchdog_timeo	 = HZ;
 
 	dev->flags		|= IFF_BROADCAST | IFF_MULTICAST;
@@ -1391,15 +1527,21 @@ void ipoib_setup(struct net_device *dev)
 	INIT_DELAYED_WORK(&priv->neigh_reap_task, ipoib_reap_neigh);
 }
 
-struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)
+struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
+					struct ipoib_dev_priv *template_priv)
 {
 	struct net_device *dev;
 
-	dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name,
-			   ipoib_setup);
+	dev = alloc_netdev_mqs((int) sizeof(struct ipoib_dev_priv), name,
+			   ipoib_setup,
+			   template_priv->num_tx_queues,
+			   template_priv->num_rx_queues);
 	if (!dev)
 		return NULL;
 
+	netif_set_real_num_tx_queues(dev, template_priv->num_tx_queues);
+	netif_set_real_num_rx_queues(dev, template_priv->num_rx_queues);
+
 	return netdev_priv(dev);
 }
 
@@ -1499,7 +1641,8 @@ int ipoib_add_pkey_attr(struct net_device *dev)
 	return device_create_file(&dev->dev, &dev_attr_pkey);
 }
 
-int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
+static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
+				  struct ib_device *hca)
 {
 	struct ib_device_attr *device_attr;
 	int result = -ENOMEM;
@@ -1522,6 +1665,20 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
 
 	kfree(device_attr);
 
+	priv->num_rx_queues = 1;
+	priv->num_tx_queues = 1;
+
+	return 0;
+}
+
+int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
+{
+	int result;
+
+	result = ipoib_get_hca_features(priv, hca);
+	if (result)
+		return result;
+
 	if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) {
 		priv->dev->hw_features = NETIF_F_SG |
 			NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
@@ -1538,13 +1695,23 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
 static struct net_device *ipoib_add_port(const char *format,
 					 struct ib_device *hca, u8 port)
 {
-	struct ipoib_dev_priv *priv;
+	struct ipoib_dev_priv *priv, *template_priv;
 	struct ib_port_attr attr;
 	int result = -ENOMEM;
 
-	priv = ipoib_intf_alloc(format);
-	if (!priv)
-		goto alloc_mem_failed;
+	template_priv = kmalloc(sizeof(*template_priv), GFP_KERNEL);
+	if (!template_priv)
+		goto alloc_mem_failed1;
+
+	if (ipoib_get_hca_features(template_priv, hca))
+		goto device_query_failed;
+
+	priv = ipoib_intf_alloc(format, template_priv);
+	if (!priv) {
+		kfree(template_priv);
+		goto alloc_mem_failed2;
+	}
+	kfree(template_priv);
 
 	SET_NETDEV_DEV(priv->dev, hca->dma_device);
 	priv->dev->dev_id = port - 1;
@@ -1646,7 +1813,13 @@ event_failed:
 device_init_failed:
 	free_netdev(priv->dev);
 
-alloc_mem_failed:
+alloc_mem_failed2:
+	return ERR_PTR(result);
+
+device_query_failed:
+	kfree(template_priv);
+
+alloc_mem_failed1:
 	return ERR_PTR(result);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index cecb98a..5c383d9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -69,7 +69,7 @@ struct ipoib_mcast_iter {
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
 	struct net_device *dev = mcast->dev;
-	int tx_dropped = 0;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group %pI6\n",
 			mcast->mcmember.mgid.raw);
@@ -81,14 +81,15 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 		ipoib_put_ah(mcast->ah);
 
 	while (!skb_queue_empty(&mcast->pkt_queue)) {
-		++tx_dropped;
-		dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
+		struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+		int index = skb_get_queue_mapping(skb);
+		/* Modify to lock queue */
+		netif_tx_lock_bh(dev);
+		priv->send_ring[index].stats.tx_dropped++;
+		netif_tx_unlock_bh(dev);
+		dev_kfree_skb_any(skb);
 	}
 
-	netif_tx_lock_bh(dev);
-	dev->stats.tx_dropped += tx_dropped;
-	netif_tx_unlock_bh(dev);
-
 	kfree(mcast);
 }
 
@@ -172,6 +173,7 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 	struct ipoib_ah *ah;
 	int ret;
 	int set_qkey = 0;
+	int i;
 
 	mcast->mcmember = *mcmember;
 
@@ -188,7 +190,8 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 		priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
 		spin_unlock_irq(&priv->lock);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		for (i = 0; i < priv->num_tx_queues; i++)
+			priv->send_ring[i].tx_wr.wr.ud.remote_qkey = priv->qkey;
 		set_qkey = 1;
 
 		if (!ipoib_cm_admin_enabled(dev)) {
@@ -276,6 +279,7 @@ ipoib_mcast_sendonly_join_complete(int status,
 {
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/* We trap for port events ourselves. */
 	if (status == -ENETRESET)
@@ -292,8 +296,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 		/* Flush out any queued packets */
 		netif_tx_lock_bh(dev);
 		while (!skb_queue_empty(&mcast->pkt_queue)) {
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
+			struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
+			dev_kfree_skb_any(skb);
 		}
 		netif_tx_unlock_bh(dev);
 
@@ -653,7 +659,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)		||
 	    !priv->broadcast					||
 	    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
-		++dev->stats.tx_dropped;
+		int index = skb_get_queue_mapping(skb);
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 		goto unlock;
 	}
@@ -666,9 +673,10 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 
 		mcast = ipoib_mcast_alloc(dev, 0);
 		if (!mcast) {
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
 			ipoib_warn(priv, "unable to allocate memory for "
 				   "multicast structure\n");
-			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 			goto out;
 		}
@@ -683,7 +691,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
 			skb_queue_tail(&mcast->pkt_queue, skb);
 		else {
-			++dev->stats.tx_dropped;
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
 			dev_kfree_skb_any(skb);
 		}
 
@@ -709,7 +718,14 @@ out:
 		spin_lock_irqsave(&priv->lock, flags);
 		if (!neigh) {
 			neigh = ipoib_neigh_alloc(daddr, dev);
-			if (neigh) {
+			/* With TX MQ it is possible that more than one skb
+			 * transmission triggered the creation of the neigh.
+			 * But only one actually created the neigh struct,
+			 * all the others found it in the hash. We must make
+			 * sure that the neigh will be added only once to the
+			 * mcast list.
+			 */
+			if (neigh && list_empty(&neigh->list)) {
 				kref_get(&mcast->ah->ref);
 				neigh->ah	= mcast->ah;
 				list_add_tail(&neigh->list, &mcast->neigh_list);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 049a997..4be626f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -118,6 +118,10 @@ int ipoib_init_qp(struct net_device *dev)
 		goto out_fail;
 	}
 
+	/* Only one ring currently */
+	priv->recv_ring[0].recv_qp = priv->qp;
+	priv->send_ring[0].send_qp = priv->qp;
+
 	return 0;
 
 out_fail:
@@ -142,8 +146,10 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 		.qp_type     = IB_QPT_UD
 	};
 
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring, *first_recv_ring;
 	int ret, size;
-	int i;
+	int i, j;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -167,19 +173,24 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 			size += ipoib_recvq_size * ipoib_max_conn_qp;
 	}
 
-	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
+	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
+				     priv->recv_ring, size, 0);
 	if (IS_ERR(priv->recv_cq)) {
 		printk(KERN_WARNING "%s: failed to create receive CQ\n", ca->name);
 		goto out_free_mr;
 	}
 
 	priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL,
-				     dev, ipoib_sendq_size, 0);
+				     priv->send_ring, ipoib_sendq_size, 0);
 	if (IS_ERR(priv->send_cq)) {
 		printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name);
 		goto out_free_recv_cq;
 	}
 
+	/* Only one ring */
+	priv->recv_ring[0].recv_cq = priv->recv_cq;
+	priv->send_ring[0].send_cq = priv->send_cq;
+
 	if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP))
 		goto out_free_send_cq;
 
@@ -205,25 +216,43 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
-		priv->tx_sge[i].lkey = priv->mr->lkey;
+	send_ring = priv->send_ring;
+	for (j = 0; j < priv->num_tx_queues; j++) {
+		for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
+			send_ring->tx_sge[i].lkey = priv->mr->lkey;
 
-	priv->tx_wr.opcode	= IB_WR_SEND;
-	priv->tx_wr.sg_list	= priv->tx_sge;
-	priv->tx_wr.send_flags	= IB_SEND_SIGNALED;
+		send_ring->tx_wr.opcode	= IB_WR_SEND;
+		send_ring->tx_wr.sg_list	= send_ring->tx_sge;
+		send_ring->tx_wr.send_flags	= IB_SEND_SIGNALED;
+		send_ring++;
+	}
 
-	priv->rx_sge[0].lkey = priv->mr->lkey;
+	recv_ring = priv->recv_ring;
+	recv_ring->rx_sge[0].lkey = priv->mr->lkey;
 	if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
-		priv->rx_sge[0].length = IPOIB_UD_HEAD_SIZE;
-		priv->rx_sge[1].length = PAGE_SIZE;
-		priv->rx_sge[1].lkey = priv->mr->lkey;
-		priv->rx_wr.num_sge = IPOIB_UD_RX_SG;
+		recv_ring->rx_sge[0].length = IPOIB_UD_HEAD_SIZE;
+		recv_ring->rx_sge[1].length = PAGE_SIZE;
+		recv_ring->rx_sge[1].lkey = priv->mr->lkey;
+		recv_ring->rx_wr.num_sge = IPOIB_UD_RX_SG;
 	} else {
-		priv->rx_sge[0].length = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
-		priv->rx_wr.num_sge = 1;
+		recv_ring->rx_sge[0].length =
+				IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
+		recv_ring->rx_wr.num_sge = 1;
+	}
+	recv_ring->rx_wr.next = NULL;
+	recv_ring->rx_wr.sg_list = recv_ring->rx_sge;
+
+	/* Copy first RX ring sge and wr parameters to the rest RX ring */
+	first_recv_ring = recv_ring;
+	recv_ring++;
+	for (i = 1; i < priv->num_rx_queues; i++) {
+		recv_ring->rx_sge[0] = first_recv_ring->rx_sge[0];
+		recv_ring->rx_sge[1] = first_recv_ring->rx_sge[1];
+		recv_ring->rx_wr = first_recv_ring->rx_wr;
+		/* This field in per ring */
+		recv_ring->rx_wr.sg_list = recv_ring->rx_sge;
+		recv_ring++;
 	}
-	priv->rx_wr.next = NULL;
-	priv->rx_wr.sg_list = priv->rx_sge;
 
 	return 0;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index 8292554..ba633c2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -133,7 +133,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 
 	snprintf(intf_name, sizeof intf_name, "%s.%04x",
 		 ppriv->dev->name, pkey);
-	priv = ipoib_intf_alloc(intf_name);
+	priv = ipoib_intf_alloc(intf_name, ppriv);
 	if (!priv)
 		return -ENOMEM;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V3 for-next 4/5] IB/ipoib: Add RSS and TSS support for datagram mode
       [not found] ` <1362676288-19906-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-03-07 17:11   ` [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device Or Gerlitz
@ 2013-03-07 17:11   ` Or Gerlitz
  2013-03-07 17:11   ` [PATCH V3 for-next 5/5] IB/ipoib: Support changing the number of RX/TX rings with ethtool Or Gerlitz
  2013-03-18 19:14   ` [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support Or Gerlitz
  5 siblings, 0 replies; 20+ messages in thread
From: Or Gerlitz @ 2013-03-07 17:11 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch add RSS (Receive Side Scaling) and TSS (multi-queue transmit)
support for IPoIB. The RSS and TSS implementation utilizes the new QP
groups concept.

The number of RSS and TSS rings is a function of the number of cores,
and the low level driver capability to support QP groups and RSS.

If the low level driver doesn't support QP groups, then only one RX
and one TX rings are created and only one QP, such both rings use it.

If the HW supports RSS then additional receive QP are created, and each
is assigned to a separate receive ring. The number of additional receive
rings is equal to the number of CPU cores rounded to the next power of two.

If the HW doesn't support RSS then only one receive ring is created
and the parent QP is assigned as its QP.

When TSS is used, additional send QPs are created, and each is assigned to
a separate send ring. The number of additional send rings is equal to the
number of CPU cores rounded to the next power of two.

It turns out that there are IPoIB drivers used by some operating-systems
and/or Hypervisors in a para-virtualization (PV) scheme which extract the
source QPN from the CQ WC associated with an incoming packets in order to
generate the source MAC address in the emulated MAC header they build.

With TSS, different packets targeted for the same entity (e.g VM using
PV IPoIB instance) could be potentially sent through different TX rings
which map to different UD QPs, each with its own QPN. This may break some
assumptions made the receiving entity (e.g rules related to security,
monitoring, etc).

If the HW supports TSS, it is capable of over-riding the source UD QPN
present in the IB datagram header (DTH) of sent packets with the parent's
QPN which is part of the device HW address as advertized to the Linux network
stack and hence carried in ARP requests/responses. Thus the above mentioned
problem doesn't exist.

When HW doesn't support TSS, but QP groups are supported which mean the
low level driver can create set of QPs with contiguous QP numbers, TSS
can still be used, this is called "SW TSS".

In this case, the low level drive provides IPoIB with a mask when the
parent QP is created. This mask is later written into the reserved field
of the IPoIB header so receivers of SW TSS packets can mask the QPN of
a received packet and discover the parent QPN.

In order not to possibly breaks inter-operability with the PV IPoIB drivers
which were not yet enhanced to apply this masking from incoming packets,
SW TSS will only be used if the peer advertised its willingness to accept
SW TSS frames, otherwise the parent QP will be used.

The advertisement to accept TSS frames is done using a dedicated bit in
the reserved byte of the IPoIB HW address (e.g similar to CM).

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |   15 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   10 +
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  169 +++++++-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  621 ++++++++++++++++++++++++----
 4 files changed, 721 insertions(+), 94 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 9bf96db..1b214f1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -123,7 +123,7 @@ enum {
 
 struct ipoib_header {
 	__be16	proto;
-	u16	reserved;
+	__be16	tss_qpn_mask_sz;
 };
 
 struct ipoib_cb {
@@ -383,9 +383,7 @@ struct ipoib_dev_priv {
 	u16		  pkey_index;
 	struct ib_pd	 *pd;
 	struct ib_mr	 *mr;
-	struct ib_cq	 *recv_cq;
-	struct ib_cq	 *send_cq;
-	struct ib_qp	 *qp;
+	struct ib_qp	 *qp; /* also parent QP for TSS & RSS */
 	u32		  qkey;
 
 	union ib_gid local_gid;
@@ -418,8 +416,11 @@ struct ipoib_dev_priv {
 	struct timer_list poll_timer;
 	struct ipoib_recv_ring *recv_ring;
 	struct ipoib_send_ring *send_ring;
-	unsigned int num_rx_queues;
-	unsigned int num_tx_queues;
+	unsigned int rss_qp_num; /* No RSS HW support 0 */
+	unsigned int tss_qp_num; /* No TSS (HW or SW) used 0 */
+	unsigned int num_rx_queues; /* No RSS HW support 1 */
+	unsigned int num_tx_queues; /* No TSS HW support tss_qp_num + 1 */
+	__be16 tss_qpn_mask_sz; /* Put in ipoib header reserved */
 };
 
 struct ipoib_ah {
@@ -587,9 +588,11 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca);
 
 #define IPOIB_FLAGS_RC		0x80
 #define IPOIB_FLAGS_UC		0x40
+#define IPOIB_FLAGS_TSS		0x20
 
 /* We don't support UC connections at the moment */
 #define IPOIB_CM_SUPPORTED(ha)   (ha[0] & (IPOIB_FLAGS_RC))
+#define IPOIB_TSS_SUPPORTED(ha)   (ha[0] & (IPOIB_FLAGS_TSS))
 
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 4871dc9..01ce5e9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -286,6 +286,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev,
 	/*
 	 * Drop packets that this interface sent, ie multicast packets
 	 * that the HCA has replicated.
+	 * Note with SW TSS MC were sent using priv->qp so no need to mask
 	 */
 	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num)
 		goto repost;
@@ -1058,6 +1059,15 @@ static void set_rx_rings_qp_state(struct ipoib_dev_priv *priv,
 static void set_rings_qp_state(struct ipoib_dev_priv *priv,
 				enum ib_qp_state new_state)
 {
+	if (priv->hca_caps & IB_DEVICE_UD_TSS) {
+		/* TSS HW is supported, parent QP has no ring (send_ring) */
+		struct ib_qp_attr qp_attr;
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+				   new_state);
+	}
+
 	set_tx_rings_qp_state(priv, new_state);
 
 	if (priv->num_rx_queues > 1)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 51bebca..8089137 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -747,7 +747,9 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct ipoib_cb *cb = (struct ipoib_cb *) skb->cb;
 	struct ipoib_header *header;
 	unsigned long flags;
+	struct ipoib_send_ring *send_ring;
 
+	send_ring = priv->send_ring + skb_get_queue_mapping(skb);
 	header = (struct ipoib_header *) skb->data;
 
 	if (unlikely(cb->hwaddr[4] == 0xff)) {
@@ -757,7 +759,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		    (header->proto != htons(ETH_P_ARP)) &&
 		    (header->proto != htons(ETH_P_RARP))) {
 			/* ethertype not supported by IPoIB */
-			++dev->stats.tx_dropped;
+			++send_ring->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 			return NETDEV_TX_OK;
 		}
@@ -795,7 +797,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		return NETDEV_TX_OK;
 	default:
 		/* ethertype not supported by IPoIB */
-		++dev->stats.tx_dropped;
+		++send_ring->stats.tx_dropped;
 		dev_kfree_skb_any(skb);
 		return NETDEV_TX_OK;
 	}
@@ -803,11 +805,19 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 send_using_neigh:
 	/* note we now hold a ref to neigh */
 	if (ipoib_cm_get(neigh)) {
+		/* in select queue cm wasn't enabled ring is likely wrong */
+		if (!IPOIB_CM_SUPPORTED(cb->hwaddr))
+			goto drop;
+
 		if (ipoib_cm_up(neigh)) {
 			ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
 			goto unref;
 		}
 	} else if (neigh->ah) {
+		/* in select queue cm was enabled ring is likely wrong */
+		if (IPOIB_CM_SUPPORTED(cb->hwaddr) && priv->num_tx_queues > 1)
+			goto drop;
+
 		ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(cb->hwaddr));
 		goto unref;
 	}
@@ -816,20 +826,78 @@ send_using_neigh:
 		spin_lock_irqsave(&priv->lock, flags);
 		__skb_queue_tail(&neigh->queue, skb);
 		spin_unlock_irqrestore(&priv->lock, flags);
-	} else {
-		++dev->stats.tx_dropped;
-		dev_kfree_skb_any(skb);
+		goto unref;
 	}
 
+drop:
+	++send_ring->stats.tx_dropped;
+	dev_kfree_skb_any(skb);
+
 unref:
 	ipoib_neigh_put(neigh);
 
 	return NETDEV_TX_OK;
 }
 
-static u16 ipoib_select_queue_null(struct net_device *dev, struct sk_buff *skb)
+static u16 ipoib_select_queue_hw(struct net_device *dev, struct sk_buff *skb)
 {
-	return 0;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cb *cb = (struct ipoib_cb *)skb->cb;
+
+	/* (BC/MC), stay on this core */
+	if (unlikely(cb->hwaddr[4] == 0xff))
+		return smp_processor_id() % priv->tss_qp_num;
+
+	/* is CM in use */
+	if (IPOIB_CM_SUPPORTED(cb->hwaddr)) {
+		if (ipoib_cm_admin_enabled(dev)) {
+			/* use remote QP for hash, so we use the same ring */
+			u32 *d32 = (u32 *)cb->hwaddr;
+			u32 hv = jhash_1word(*d32 & IPOIB_QPN_MASK, 0);
+			return hv % priv->tss_qp_num;
+		} else
+			/* the ADMIN CM might be up until transmit, and
+			 * we might transmit on CM QP not from it's
+			 * designated ring */
+			cb->hwaddr[0] &= ~IPOIB_FLAGS_RC;
+	}
+	return skb_tx_hash(dev, skb);
+}
+
+static u16 ipoib_select_queue_sw(struct net_device *dev, struct sk_buff *skb)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cb *cb = (struct ipoib_cb *)skb->cb;
+	struct ipoib_header *header;
+
+	/* (BC/MC) use designated QDISC -> parent QP */
+	if (unlikely(cb->hwaddr[4] == 0xff))
+		return priv->tss_qp_num;
+
+	/* is CM in use */
+	if (IPOIB_CM_SUPPORTED(cb->hwaddr)) {
+		if (ipoib_cm_admin_enabled(dev)) {
+			/* use remote QP for hash, so we use the same ring */
+			u32 *d32 = (u32 *)cb->hwaddr;
+			u32 hv = jhash_1word(*d32 & IPOIB_QPN_MASK, 0);
+			return hv % priv->tss_qp_num;
+		} else
+			/* the ADMIN CM might be up until transmit, and
+			 * we might transmit on CM QP not from it's
+			 * designated ring */
+			cb->hwaddr[0] &= ~IPOIB_FLAGS_RC;
+	}
+
+	/* Did neighbour advertise TSS support */
+	if (unlikely(!IPOIB_TSS_SUPPORTED(cb->hwaddr)))
+		return priv->tss_qp_num;
+
+	/* We are after ipoib_hard_header so skb->data is O.K. */
+	header = (struct ipoib_header *)skb->data;
+	header->tss_qpn_mask_sz |= priv->tss_qpn_mask_sz;
+
+	/* don't use special ring in TX */
+	return __skb_tx_hash(dev, skb, priv->tss_qp_num);
 }
 
 static void ipoib_timeout(struct net_device *dev)
@@ -902,7 +970,7 @@ static int ipoib_hard_header(struct sk_buff *skb,
 	header = (struct ipoib_header *) skb_push(skb, sizeof *header);
 
 	header->proto = htons(type);
-	header->reserved = 0;
+	header->tss_qpn_mask_sz = 0;
 
 	/*
 	 * we don't rely on dst_entry structure,  always stuff the
@@ -961,7 +1029,8 @@ struct ipoib_neigh *ipoib_neigh_get(struct net_device *dev, u8 *daddr)
 	for (neigh = rcu_dereference_bh(htbl->buckets[hash_val]);
 	     neigh != NULL;
 	     neigh = rcu_dereference_bh(neigh->hnext)) {
-		if (memcmp(daddr, neigh->daddr, INFINIBAND_ALEN) == 0) {
+		/* don't use flags for the comapre */
+		if (memcmp(daddr+1, neigh->daddr+1, INFINIBAND_ALEN-1) == 0) {
 			/* found, take one ref on behalf of the caller */
 			if (!atomic_inc_not_zero(&neigh->refcnt)) {
 				/* deleted */
@@ -1088,7 +1157,8 @@ struct ipoib_neigh *ipoib_neigh_alloc(u8 *daddr,
 	     neigh != NULL;
 	     neigh = rcu_dereference_protected(neigh->hnext,
 					       lockdep_is_held(&priv->lock))) {
-		if (memcmp(daddr, neigh->daddr, INFINIBAND_ALEN) == 0) {
+		/* don't use flags for the comapre */
+		if (memcmp(daddr+1, neigh->daddr+1, INFINIBAND_ALEN-1) == 0) {
 			/* found, take one ref on behalf of the caller */
 			if (!atomic_inc_not_zero(&neigh->refcnt)) {
 				/* deleted */
@@ -1466,25 +1536,52 @@ static const struct header_ops ipoib_header_ops = {
 	.create	= ipoib_hard_header,
 };
 
-static const struct net_device_ops ipoib_netdev_ops = {
+static const struct net_device_ops ipoib_netdev_ops_no_tss = {
 	.ndo_uninit		 = ipoib_uninit,
 	.ndo_open		 = ipoib_open,
 	.ndo_stop		 = ipoib_stop,
 	.ndo_change_mtu		 = ipoib_change_mtu,
 	.ndo_fix_features	 = ipoib_fix_features,
-	.ndo_start_xmit	 	 = ipoib_start_xmit,
-	.ndo_select_queue	 = ipoib_select_queue_null,
+	.ndo_start_xmit		 = ipoib_start_xmit,
 	.ndo_tx_timeout		 = ipoib_timeout,
 	.ndo_get_stats		 = ipoib_get_stats,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
 };
 
+static const struct net_device_ops ipoib_netdev_ops_hw_tss = {
+	.ndo_uninit		 = ipoib_uninit,
+	.ndo_open		 = ipoib_open,
+	.ndo_stop		 = ipoib_stop,
+	.ndo_change_mtu		 = ipoib_change_mtu,
+	.ndo_fix_features	 = ipoib_fix_features,
+	.ndo_start_xmit		 = ipoib_start_xmit,
+	.ndo_select_queue	 = ipoib_select_queue_hw,
+	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
+	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
+};
+
+static const struct net_device_ops ipoib_netdev_ops_sw_tss = {
+	.ndo_uninit		 = ipoib_uninit,
+	.ndo_open		 = ipoib_open,
+	.ndo_stop		 = ipoib_stop,
+	.ndo_change_mtu		 = ipoib_change_mtu,
+	.ndo_fix_features	 = ipoib_fix_features,
+	.ndo_start_xmit		 = ipoib_start_xmit,
+	.ndo_select_queue	 = ipoib_select_queue_sw,
+	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
+	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
+};
+
+static const struct net_device_ops *ipoib_netdev_ops;
+
 void ipoib_setup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/* Use correct ops (ndo_select_queue) */
-	dev->netdev_ops		 = &ipoib_netdev_ops;
+	dev->netdev_ops		 = ipoib_netdev_ops;
 	dev->header_ops		 = &ipoib_header_ops;
 
 	ipoib_set_ethtool_ops(dev);
@@ -1532,6 +1629,16 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
 {
 	struct net_device *dev;
 
+	/* Use correct ops (ndo_select_queue) pass to ipoib_setup */
+	if (template_priv->num_tx_queues > 1) {
+		if (template_priv->hca_caps & IB_DEVICE_UD_TSS)
+			ipoib_netdev_ops = &ipoib_netdev_ops_hw_tss;
+		else
+			ipoib_netdev_ops = &ipoib_netdev_ops_sw_tss;
+	} else
+		ipoib_netdev_ops = &ipoib_netdev_ops_no_tss;
+
+
 	dev = alloc_netdev_mqs((int) sizeof(struct ipoib_dev_priv), name,
 			   ipoib_setup,
 			   template_priv->num_tx_queues,
@@ -1645,6 +1752,7 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 				  struct ib_device *hca)
 {
 	struct ib_device_attr *device_attr;
+	int num_cores;
 	int result = -ENOMEM;
 
 	device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL);
@@ -1663,10 +1771,39 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 	}
 	priv->hca_caps = device_attr->device_cap_flags;
 
+	num_cores = num_online_cpus();
+	if (num_cores == 1 || !(priv->hca_caps & IB_DEVICE_QPG)) {
+		/* No additional QP, only one QP for RX & TX */
+		priv->rss_qp_num = 0;
+		priv->tss_qp_num = 0;
+		priv->num_rx_queues = 1;
+		priv->num_tx_queues = 1;
+		kfree(device_attr);
+		return 0;
+	}
+	num_cores = roundup_pow_of_two(num_cores);
+	if (priv->hca_caps & IB_DEVICE_UD_RSS) {
+		int max_rss_tbl_sz;
+		max_rss_tbl_sz = device_attr->max_rss_tbl_sz;
+		max_rss_tbl_sz = min(num_cores, max_rss_tbl_sz);
+		max_rss_tbl_sz = rounddown_pow_of_two(max_rss_tbl_sz);
+		priv->rss_qp_num    = max_rss_tbl_sz;
+		priv->num_rx_queues = max_rss_tbl_sz;
+	} else {
+		/* No additional QP, only the parent QP for RX */
+		priv->rss_qp_num = 0;
+		priv->num_rx_queues = 1;
+	}
+
 	kfree(device_attr);
 
-	priv->num_rx_queues = 1;
-	priv->num_tx_queues = 1;
+	priv->tss_qp_num = num_cores;
+	if (priv->hca_caps & IB_DEVICE_UD_TSS)
+		/* TSS is supported by HW */
+		priv->num_tx_queues = priv->tss_qp_num;
+	else
+		/* If TSS is not support by HW use the parent QP for ARP */
+		priv->num_tx_queues = priv->tss_qp_num + 1;
 
 	return 0;
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 4be626f..3917d3c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -35,6 +35,31 @@
 
 #include "ipoib.h"
 
+static int set_qps_qkey(struct ipoib_dev_priv *priv)
+{
+	struct ib_qp_attr *qp_attr;
+	struct ipoib_recv_ring *recv_ring;
+	int ret = -ENOMEM;
+	int i;
+
+	qp_attr = kmalloc(sizeof(*qp_attr), GFP_KERNEL);
+	if (!qp_attr)
+		return -ENOMEM;
+
+	qp_attr->qkey = priv->qkey;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		ret = ib_modify_qp(recv_ring->recv_qp, qp_attr, IB_QP_QKEY);
+		if (ret)
+			break;
+		recv_ring++;
+	}
+
+	kfree(qp_attr);
+
+	return ret;
+}
+
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid, int set_qkey)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -50,18 +75,9 @@ int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid, int
 	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 
 	if (set_qkey) {
-		ret = -ENOMEM;
-		qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
-		if (!qp_attr)
-			goto out;
-
-		/* set correct QKey for QP */
-		qp_attr->qkey = priv->qkey;
-		ret = ib_modify_qp(priv->qp, qp_attr, IB_QP_QKEY);
-		if (ret) {
-			ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret);
+		ret = set_qps_qkey(priv);
+		if (ret)
 			goto out;
-		}
 	}
 
 	/* attach QP to multicast group */
@@ -74,16 +90,13 @@ out:
 	return ret;
 }
 
-int ipoib_init_qp(struct net_device *dev)
+static int ipoib_init_one_qp(struct ipoib_dev_priv *priv, struct ib_qp *qp,
+				int init_attr)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
 
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
-		return -1;
-
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
 	qp_attr.port_num = priv->port;
@@ -92,17 +105,18 @@ int ipoib_init_qp(struct net_device *dev)
 	    IB_QP_QKEY |
 	    IB_QP_PORT |
 	    IB_QP_PKEY_INDEX |
-	    IB_QP_STATE;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	    IB_QP_STATE | init_attr;
+
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
-		ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret);
+		ipoib_warn(priv, "failed to modify QP to INT, ret = %d\n", ret);
 		goto out_fail;
 	}
 
 	qp_attr.qp_state = IB_QPS_RTR;
 	/* Can't set this in a INIT->RTR transition */
-	attr_mask &= ~IB_QP_PORT;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	attr_mask &= ~(IB_QP_PORT | init_attr);
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret);
 		goto out_fail;
@@ -112,40 +126,417 @@ int ipoib_init_qp(struct net_device *dev)
 	qp_attr.sq_psn = 0;
 	attr_mask |= IB_QP_SQ_PSN;
 	attr_mask &= ~IB_QP_PKEY_INDEX;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret);
 		goto out_fail;
 	}
 
-	/* Only one ring currently */
-	priv->recv_ring[0].recv_qp = priv->qp;
-	priv->send_ring[0].send_qp = priv->qp;
-
 	return 0;
 
 out_fail:
 	qp_attr.qp_state = IB_QPS_RESET;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
+	if (ib_modify_qp(qp, &qp_attr, IB_QP_STATE))
 		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
 
 	return ret;
 }
 
-int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
+static int ipoib_init_rss_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+	int ret;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		ret = ipoib_init_one_qp(priv, recv_ring->recv_qp, 0);
+		if (ret) {
+			ipoib_warn(priv,
+				   "failed to init rss qp, ind = %d, ret=%d\n",
+				   i, ret);
+			goto out_free_reset_qp;
+		}
+		recv_ring++;
+	}
+
+	return 0;
+
+out_free_reset_qp:
+	for (--i; i >= 0; --i) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->recv_ring[i].recv_qp,
+				 &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				   "Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+static int ipoib_init_tss_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+	int ret;
+
+	send_ring = priv->send_ring;
+	/*
+	 * Note if priv->tss_qdisc_num > priv->tss_qp_num then since
+	 * the last QP is the parent QP and it will be initialize later
+	 */
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		ret = ipoib_init_one_qp(priv, send_ring->send_qp, 0);
+		if (ret) {
+			ipoib_warn(priv,
+				   "failed to init tss qp, ind = %d, ret=%d\n",
+				   i, ret);
+			goto out_free_reset_qp;
+		}
+		send_ring++;
+	}
+
+	return 0;
+
+out_free_reset_qp:
+	for (--i; i >= 0; --i) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->send_ring[i].send_qp,
+				 &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				   "Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+int ipoib_init_qp(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	int ret, i, attr;
+
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
+		ipoib_warn(priv, "PKEY not assigned\n");
+		return -1;
+	}
+
+	/* Init parent QP */
+	/* If rss_qp_num = 0 then the parent QP is the RX QP */
+	ret = ipoib_init_rss_qps(dev);
+	if (ret)
+		return ret;
+
+	ret = ipoib_init_tss_qps(dev);
+	if (ret)
+		goto out_reset_tss_qp;
+
+	/* Init the parent QP which can be the only QP */
+	attr = priv->rss_qp_num > 0 ? IB_QP_GROUP_RSS : 0;
+	ret = ipoib_init_one_qp(priv, priv->qp, attr);
+	if (ret) {
+		ipoib_warn(priv, "failed to init parent qp, ret=%d\n", ret);
+		goto out_reset_rss_qp;
+	}
+
+	return 0;
+
+out_reset_rss_qp:
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->recv_ring[i].recv_qp,
+				 &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				   "Failed to modify QP to RESET state\n");
+	}
+
+out_reset_tss_qp:
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->send_ring[i].send_qp,
+				 &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				   "Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+static int ipoib_transport_cq_init(struct net_device *dev,
+							int size)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	struct ib_cq *cq;
+	int i, allocated_rx, allocated_tx, req_vec;
+
+	allocated_rx = 0;
+	allocated_tx = 0;
+
+	/* We over subscribed the CPUS, ports start from 1 */
+	req_vec = (priv->port - 1) * roundup_pow_of_two(num_online_cpus());
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		/* Try to spread vectors based on port and ring numbers */
+		cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
+				  recv_ring, size,
+				  req_vec % priv->ca->num_comp_vectors);
+		if (IS_ERR(cq)) {
+			pr_warn("%s: failed to create recv CQ\n",
+				priv->ca->name);
+			goto out_free_recv_cqs;
+		}
+		recv_ring->recv_cq = cq;
+		allocated_rx++;
+		req_vec++;
+		if (ib_req_notify_cq(recv_ring->recv_cq, IB_CQ_NEXT_COMP)) {
+			pr_warn("%s: req notify recv CQ\n",
+				priv->ca->name);
+			goto out_free_recv_cqs;
+		}
+		recv_ring++;
+	}
+
+	/* We over subscribed the CPUS, ports start from 1 */
+	req_vec = (priv->port - 1) * roundup_pow_of_two(num_online_cpus());
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		cq = ib_create_cq(priv->ca,
+				  ipoib_send_comp_handler, NULL,
+				  send_ring, ipoib_sendq_size,
+				  req_vec % priv->ca->num_comp_vectors);
+		if (IS_ERR(cq)) {
+			pr_warn("%s: failed to create send CQ\n",
+				priv->ca->name);
+			goto out_free_send_cqs;
+		}
+		send_ring->send_cq = cq;
+		allocated_tx++;
+		req_vec++;
+		send_ring++;
+	}
+
+	return 0;
+
+out_free_send_cqs:
+	for (i = 0; i < allocated_tx; i++) {
+		ib_destroy_cq(priv->send_ring[i].send_cq);
+		priv->send_ring[i].send_cq = NULL;
+	}
+
+out_free_recv_cqs:
+	for (i = 0; i < allocated_rx; i++) {
+		ib_destroy_cq(priv->recv_ring[i].recv_cq);
+		priv->recv_ring[i].recv_cq = NULL;
+	}
+
+	return -ENODEV;
+}
+
+static int ipoib_create_parent_qp(struct net_device *dev,
+				  struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr init_attr = {
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type     = IB_QPT_UD
+	};
+	struct ib_qp *qp;
+
+	if (priv->hca_caps & IB_DEVICE_UD_TSO)
+		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
+
+	if (priv->hca_caps & IB_DEVICE_BLOCK_MULTICAST_LOOPBACK)
+		init_attr.create_flags |= IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK;
+
+	if (dev->features & NETIF_F_SG)
+		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
+
+	if (priv->tss_qp_num == 0 && priv->rss_qp_num == 0)
+		/* Legacy mode */
+		init_attr.qpg_type = IB_QPG_NONE;
+	else {
+		init_attr.qpg_type = IB_QPG_PARENT;
+		init_attr.parent_attrib.tss_child_count = priv->tss_qp_num;
+		init_attr.parent_attrib.rss_child_count = priv->rss_qp_num;
+	}
+
+	/*
+	 * NO TSS (tss_qp_num = 0 priv->num_tx_queues  == 1)
+	 * OR TSS is not supported in HW in this case
+	 * parent QP is used for ARR and friend transmission
+	 */
+	if (priv->num_tx_queues > priv->tss_qp_num) {
+		init_attr.cap.max_send_wr  = ipoib_sendq_size;
+		init_attr.cap.max_send_sge = 1;
+	}
+
+	/* No RSS parent QP will be used for RX */
+	if (priv->rss_qp_num == 0) {
+		init_attr.cap.max_recv_wr  = ipoib_recvq_size;
+		init_attr.cap.max_recv_sge = IPOIB_UD_RX_SG;
+	}
+
+	/* Note that if parent QP is not used for RX/TX then this is harmless */
+	init_attr.recv_cq = priv->recv_ring[0].recv_cq;
+	init_attr.send_cq = priv->send_ring[priv->tss_qp_num].send_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create parent QP\n", ca->name);
+		return -ENODEV; /* qp is an error value and will be checked */
+	}
+
+	priv->qp = qp;
+
+	/* TSS is not supported in HW or NO TSS (tss_qp_num = 0) */
+	if (priv->num_tx_queues > priv->tss_qp_num)
+		priv->send_ring[priv->tss_qp_num].send_qp = qp;
+
+	/* No RSS parent QP will be used for RX */
+	if (priv->rss_qp_num == 0)
+		priv->recv_ring[0].recv_qp = qp;
+
+	/* only with SW TSS there is a need for a mask */
+	if ((priv->hca_caps & IB_DEVICE_UD_TSS) || (priv->tss_qp_num == 0))
+		/* TSS is supported by HW or no TSS at all */
+		priv->tss_qpn_mask_sz = 0;
+	else {
+		/* SW TSS, get mask back from HW, put in the upper nibble */
+		u16 tmp = (u16)init_attr.cap.qpg_tss_mask_sz;
+		priv->tss_qpn_mask_sz = cpu_to_be16((tmp << 12));
+	}
+	return 0;
+}
+
+static struct ib_qp *ipoib_create_tss_qp(struct net_device *dev,
+					 struct ib_device *ca,
+					 int ind)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr init_attr = {
 		.cap = {
 			.max_send_wr  = ipoib_sendq_size,
-			.max_recv_wr  = ipoib_recvq_size,
 			.max_send_sge = 1,
+		},
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type     = IB_QPT_UD
+	};
+	struct ib_qp *qp;
+
+	if (priv->hca_caps & IB_DEVICE_UD_TSO)
+		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
+
+	if (dev->features & NETIF_F_SG)
+		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
+
+	init_attr.qpg_type = IB_QPG_CHILD_TX;
+	init_attr.qpg_parent = priv->qp;
+
+	init_attr.recv_cq = priv->send_ring[ind].send_cq;
+	init_attr.send_cq = init_attr.recv_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create TSS QP(%d)\n", ca->name, ind);
+		return qp; /* qp is an error value and will be checked */
+	}
+
+	return qp;
+}
+
+static struct ib_qp *ipoib_create_rss_qp(struct net_device *dev,
+					 struct ib_device *ca,
+					 int ind)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr init_attr = {
+		.cap = {
+			.max_recv_wr  = ipoib_recvq_size,
 			.max_recv_sge = IPOIB_UD_RX_SG
 		},
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
 		.qp_type     = IB_QPT_UD
 	};
+	struct ib_qp *qp;
+
+	init_attr.qpg_type = IB_QPG_CHILD_RX;
+	init_attr.qpg_parent = priv->qp;
+
+	init_attr.recv_cq = priv->recv_ring[ind].recv_cq;
+	init_attr.send_cq = init_attr.recv_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create RSS QP(%d)\n", ca->name, ind);
+		return qp; /* qp is an error value and will be checked */
+	}
 
+	return qp;
+}
+
+static int ipoib_create_other_qps(struct net_device *dev,
+				  struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring;
+	int i, rss_created, tss_created;
+	struct ib_qp *qp;
+
+	tss_created = 0;
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		qp = ipoib_create_tss_qp(dev, ca, i);
+		if (IS_ERR(qp)) {
+			pr_warn("%s: failed to create QP\n",
+				ca->name);
+			goto out_free_send_qp;
+		}
+		send_ring->send_qp = qp;
+		send_ring++;
+		tss_created++;
+	}
+
+	rss_created = 0;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		qp = ipoib_create_rss_qp(dev, ca, i);
+		if (IS_ERR(qp)) {
+			pr_warn("%s: failed to create QP\n",
+				ca->name);
+			goto out_free_recv_qp;
+		}
+		recv_ring->recv_qp = qp;
+		recv_ring++;
+		rss_created++;
+	}
+
+	return 0;
+
+out_free_recv_qp:
+	for (i = 0; i < rss_created; i++) {
+		ib_destroy_qp(priv->recv_ring[i].recv_qp);
+		priv->recv_ring[i].recv_qp = NULL;
+	}
+
+out_free_send_qp:
+	for (i = 0; i < tss_created; i++) {
+		ib_destroy_qp(priv->send_ring[i].send_qp);
+		priv->send_ring[i].send_qp = NULL;
+	}
+
+	return -ENODEV;
+}
+
+int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_send_ring *send_ring;
 	struct ipoib_recv_ring *recv_ring, *first_recv_ring;
 	int ret, size;
@@ -173,49 +564,38 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 			size += ipoib_recvq_size * ipoib_max_conn_qp;
 	}
 
-	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
-				     priv->recv_ring, size, 0);
-	if (IS_ERR(priv->recv_cq)) {
-		printk(KERN_WARNING "%s: failed to create receive CQ\n", ca->name);
+	/* Create CQ(s) */
+	ret = ipoib_transport_cq_init(dev, size);
+	if (ret) {
+		pr_warn("%s: ipoib_transport_cq_init failed\n", ca->name);
 		goto out_free_mr;
 	}
 
-	priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL,
-				     priv->send_ring, ipoib_sendq_size, 0);
-	if (IS_ERR(priv->send_cq)) {
-		printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name);
-		goto out_free_recv_cq;
-	}
-
-	/* Only one ring */
-	priv->recv_ring[0].recv_cq = priv->recv_cq;
-	priv->send_ring[0].send_cq = priv->send_cq;
-
-	if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP))
-		goto out_free_send_cq;
-
-	init_attr.send_cq = priv->send_cq;
-	init_attr.recv_cq = priv->recv_cq;
-
-	if (priv->hca_caps & IB_DEVICE_UD_TSO)
-		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
-
-	if (priv->hca_caps & IB_DEVICE_BLOCK_MULTICAST_LOOPBACK)
-		init_attr.create_flags |= IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK;
-
-	if (dev->features & NETIF_F_SG)
-		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
-
-	priv->qp = ib_create_qp(priv->pd, &init_attr);
-	if (IS_ERR(priv->qp)) {
-		printk(KERN_WARNING "%s: failed to create QP\n", ca->name);
-		goto out_free_send_cq;
+	/* Init the parent QP */
+	ret = ipoib_create_parent_qp(dev, ca);
+	if (ret) {
+		pr_warn("%s: failed to create parent QP\n", ca->name);
+		goto out_free_cqs;
 	}
 
+	/*
+	* advetize that we are willing to accept from TSS sender
+	* note that this only indicates that this side is willing to accept
+	* TSS frames, it doesn't implies that it will use TSS since for
+	* transmission the peer should advertize TSS as well
+	*/
+	priv->dev->dev_addr[0] |= IPOIB_FLAGS_TSS;
 	priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff;
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
+	/* create TSS & RSS QPs */
+	ret = ipoib_create_other_qps(dev, ca);
+	if (ret) {
+		pr_warn("%s: failed to create QP(s)\n", ca->name);
+		goto out_free_parent_qp;
+	}
+
 	send_ring = priv->send_ring;
 	for (j = 0; j < priv->num_tx_queues; j++) {
 		for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
@@ -256,11 +636,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 
 	return 0;
 
-out_free_send_cq:
-	ib_destroy_cq(priv->send_cq);
+out_free_parent_qp:
+	ib_destroy_qp(priv->qp);
+	priv->qp = NULL;
+
+out_free_cqs:
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ib_destroy_cq(priv->recv_ring[i].recv_cq);
+		priv->recv_ring[i].recv_cq = NULL;
+	}
 
-out_free_recv_cq:
-	ib_destroy_cq(priv->recv_cq);
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		ib_destroy_cq(priv->send_ring[i].send_cq);
+		priv->send_ring[i].send_cq = NULL;
+	}
 
 out_free_mr:
 	ib_dereg_mr(priv->mr);
@@ -271,10 +660,101 @@ out_free_pd:
 	return -ENODEV;
 }
 
+static void ipoib_destroy_tx_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	if (NULL == priv->send_ring)
+		return;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		if (send_ring->send_qp) {
+			if (ib_destroy_qp(send_ring->send_qp))
+				ipoib_warn(priv, "ib_destroy_qp (send) failed\n");
+			send_ring->send_qp = NULL;
+		}
+		send_ring++;
+	}
+
+	/*
+	 * No support of TSS in HW
+	 * so there is an extra QP but it is freed later
+	 */
+	if (priv->num_tx_queues > priv->tss_qp_num)
+		send_ring->send_qp = NULL;
+}
+
+static void ipoib_destroy_rx_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	if (NULL == priv->recv_ring)
+		return;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		if (recv_ring->recv_qp) {
+			if (ib_destroy_qp(recv_ring->recv_qp))
+				ipoib_warn(priv, "ib_destroy_qp (recv) failed\n");
+			recv_ring->recv_qp = NULL;
+		}
+		recv_ring++;
+	}
+}
+
+static void ipoib_destroy_tx_cqs(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	if (NULL == priv->send_ring)
+		return;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		if (send_ring->send_cq) {
+			if (ib_destroy_cq(send_ring->send_cq))
+				ipoib_warn(priv, "ib_destroy_cq (send) failed\n");
+			send_ring->send_cq = NULL;
+		}
+		send_ring++;
+	}
+}
+
+static void ipoib_destroy_rx_cqs(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	if (NULL == priv->recv_ring)
+		return;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		if (recv_ring->recv_cq) {
+			if (ib_destroy_cq(recv_ring->recv_cq))
+				ipoib_warn(priv, "ib_destroy_cq (recv) failed\n");
+			recv_ring->recv_cq = NULL;
+		}
+		recv_ring++;
+	}
+}
+
 void ipoib_transport_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	ipoib_destroy_rx_qps(dev);
+	ipoib_destroy_tx_qps(dev);
+
+	/* Destroy parent or only QP */
 	if (priv->qp) {
 		if (ib_destroy_qp(priv->qp))
 			ipoib_warn(priv, "ib_qp_destroy failed\n");
@@ -283,11 +763,8 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 	}
 
-	if (ib_destroy_cq(priv->send_cq))
-		ipoib_warn(priv, "ib_cq_destroy (send) failed\n");
-
-	if (ib_destroy_cq(priv->recv_cq))
-		ipoib_warn(priv, "ib_cq_destroy (recv) failed\n");
+	ipoib_destroy_rx_cqs(dev);
+	ipoib_destroy_tx_cqs(dev);
 
 	ipoib_cm_dev_cleanup(dev);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH V3 for-next 5/5] IB/ipoib: Support changing the number of RX/TX rings with ethtool
       [not found] ` <1362676288-19906-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-03-07 17:11   ` [PATCH V3 for-next 4/5] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz
@ 2013-03-07 17:11   ` Or Gerlitz
  2013-03-18 19:14   ` [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support Or Gerlitz
  5 siblings, 0 replies; 20+ messages in thread
From: Or Gerlitz @ 2013-03-07 17:11 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The number of RX/TX rings can now be get or changed using the ethtool
directives to get/set the number of channels of ETHTOOL_{G/S}CHANNELS.

Added ipoib_reinit() which releases all the rings and their associated
resources, and immediatly following that allocates them again according
to the new number of rings. For that end, moved code which is common to
device cleanup and device reinit from the device cleanup flow to a routine
which is called on both cases.

On some flows, the ndo_get_stats entry (which now reads the per ring
statistics for an ipoib netdevice), is called by the core networking
code without rtnl locking. To protect against such a call being made
in parallel with an ethtool call to change the number of rings --
added rwlock on the rings.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h         |    9 ++-
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c |   68 +++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_ib.c      |    4 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c    |  133 ++++++++++++++++++++++----
 4 files changed, 192 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 1b214f1..cf6ab56 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -418,8 +418,11 @@ struct ipoib_dev_priv {
 	struct ipoib_send_ring *send_ring;
 	unsigned int rss_qp_num; /* No RSS HW support 0 */
 	unsigned int tss_qp_num; /* No TSS (HW or SW) used 0 */
-	unsigned int num_rx_queues; /* No RSS HW support 1 */
-	unsigned int num_tx_queues; /* No TSS HW support tss_qp_num + 1 */
+	unsigned int max_rx_queues; /* No RSS HW support 1 */
+	unsigned int max_tx_queues; /* No TSS HW support tss_qp_num + 1 */
+	unsigned int num_rx_queues; /* Actual */
+	unsigned int num_tx_queues; /* Actual */
+	struct rw_semaphore rings_rwsem;
 	__be16 tss_qpn_mask_sz; /* Put in ipoib header reserved */
 };
 
@@ -528,6 +531,8 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
 
+int ipoib_reinit(struct net_device *dev, int num_rx, int num_tx);
+
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index 7c56341..f79a8a4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -172,6 +172,72 @@ static void ipoib_get_ethtool_stats(struct net_device *dev,
 	}
 }
 
+static void ipoib_get_channels(struct net_device *dev,
+			struct ethtool_channels *channel)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	channel->max_rx = priv->max_rx_queues;
+	channel->max_tx = priv->max_tx_queues;
+	channel->max_other = 0;
+	channel->max_combined = priv->max_rx_queues +
+				priv->max_tx_queues;
+	channel->rx_count = priv->num_rx_queues;
+	channel->tx_count = priv->num_tx_queues;
+	channel->other_count = 0;
+	channel->combined_count = priv->num_rx_queues +
+				priv->num_tx_queues;
+}
+
+static int ipoib_set_channels(struct net_device *dev,
+			struct ethtool_channels *channel)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (channel->other_count)
+		return -EINVAL;
+
+	if (channel->combined_count !=
+		priv->num_rx_queues + priv->num_tx_queues)
+		return -EINVAL;
+
+	if (channel->rx_count == 0 ||
+	    channel->rx_count > priv->max_rx_queues)
+		return -EINVAL;
+
+	if (!is_power_of_2(channel->rx_count))
+		return -EINVAL;
+
+	if (channel->tx_count  == 0 ||
+	    channel->tx_count > priv->max_tx_queues)
+		return -EINVAL;
+
+	/* Nothing to do ? */
+	if (channel->rx_count == priv->num_rx_queues &&
+	    channel->tx_count == priv->num_tx_queues)
+		return 0;
+
+	/* 1 is always O.K. */
+	if (channel->tx_count > 1) {
+		if (priv->hca_caps & IB_DEVICE_UD_TSS) {
+			/* with HW TSS tx_count is 2^N */
+			if (!is_power_of_2(channel->tx_count))
+				return -EINVAL;
+		} else {
+			/*
+			* with SW TSS tx_count = 1 + 2 ^ N,
+			* 2 is not allowed, make no sense.
+			* if want to disable TSS use 1.
+			*/
+			if (!is_power_of_2(channel->tx_count - 1) ||
+			    channel->tx_count == 2)
+				return -EINVAL;
+		}
+	}
+
+	return ipoib_reinit(dev, channel->rx_count, channel->tx_count);
+}
+
 static const struct ethtool_ops ipoib_ethtool_ops = {
 	.get_drvinfo		= ipoib_get_drvinfo,
 	.get_coalesce		= ipoib_get_coalesce,
@@ -179,6 +245,8 @@ static const struct ethtool_ops ipoib_ethtool_ops = {
 	.get_strings		= ipoib_get_strings,
 	.get_sset_count		= ipoib_get_sset_count,
 	.get_ethtool_stats	= ipoib_get_ethtool_stats,
+	.get_channels		= ipoib_get_channels,
+	.set_channels		= ipoib_set_channels,
 };
 
 void ipoib_set_ethtool_ops(struct net_device *dev)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 01ce5e9..fa4958c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -736,8 +736,10 @@ static void ipoib_napi_disable(struct net_device *dev)
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int i;
 
-	for (i = 0; i < priv->num_rx_queues; i++)
+	for (i = 0; i < priv->num_rx_queues; i++) {
 		napi_disable(&priv->recv_ring[i].napi);
+		netif_napi_del(&priv->recv_ring[i].napi);
+	}
 }
 
 int ipoib_ib_dev_open(struct net_device *dev)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 8089137..a1f10b3 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -928,6 +928,10 @@ static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
 	struct net_device_stats local_stats;
 	int i;
 
+	/* if rings are not ready yet return last values */
+	if (!down_read_trylock(&priv->rings_rwsem))
+		return stats;
+
 	memset(&local_stats, 0, sizeof(struct net_device_stats));
 
 	for (i = 0; i < priv->num_rx_queues; i++) {
@@ -946,6 +950,8 @@ static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
 		local_stats.tx_dropped += tstats->tx_dropped;
 	}
 
+	up_read(&priv->rings_rwsem);
+
 	stats->rx_packets = local_stats.rx_packets;
 	stats->rx_bytes   = local_stats.rx_bytes;
 	stats->rx_errors  = local_stats.rx_errors;
@@ -1476,6 +1482,8 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	if (ipoib_ib_dev_init(dev, ca, port))
 		goto out_send_ring_cleanup;
 
+	/* access to rings allowed */
+	up_write(&priv->rings_rwsem);
 
 	return 0;
 
@@ -1496,10 +1504,36 @@ out:
 	return -ENOMEM;
 }
 
+static void ipoib_dev_uninit(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i;
+
+	ASSERT_RTNL();
+
+	ipoib_ib_dev_cleanup(dev);
+
+	/* no more access to rings */
+	down_write(&priv->rings_rwsem);
+
+	for (i = 0; i < priv->num_tx_queues; i++)
+		vfree(priv->send_ring[i].tx_ring);
+	kfree(priv->send_ring);
+
+	for (i = 0; i < priv->num_rx_queues; i++)
+		kfree(priv->recv_ring[i].rx_ring);
+	kfree(priv->recv_ring);
+
+	priv->recv_ring = NULL;
+	priv->send_ring = NULL;
+
+	ipoib_neigh_hash_uninit(dev);
+}
+
 void ipoib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv;
-	int i;
+
 	LIST_HEAD(head);
 
 	ASSERT_RTNL();
@@ -1513,23 +1547,71 @@ void ipoib_dev_cleanup(struct net_device *dev)
 		cancel_delayed_work(&cpriv->neigh_reap_task);
 		unregister_netdevice_queue(cpriv->dev, &head);
 	}
+
 	unregister_netdevice_many(&head);
 
-	ipoib_ib_dev_cleanup(dev);
+	ipoib_dev_uninit(dev);
 
+	/* ipoib_dev_uninit took rings lock but can't release it when called by
+	 * ipoib_reinit, for the cleanup flow, release it here
+	 */
+	up_write(&priv->rings_rwsem);
+}
 
-	for (i = 0; i < priv->num_tx_queues; i++)
-		vfree(priv->send_ring[i].tx_ring);
-	kfree(priv->send_ring);
+int ipoib_reinit(struct net_device *dev, int num_rx, int num_tx)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int flags;
+	int ret;
 
-	for (i = 0; i < priv->num_rx_queues; i++)
-		kfree(priv->recv_ring[i].rx_ring);
-	kfree(priv->recv_ring);
+	flags = dev->flags;
+	dev_close(dev);
 
-	priv->recv_ring = NULL;
-	priv->send_ring = NULL;
+	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags))
+		ib_unregister_event_handler(&priv->event_handler);
 
-	ipoib_neigh_hash_uninit(dev);
+	ipoib_dev_uninit(dev);
+
+	priv->num_rx_queues = num_rx;
+	priv->num_tx_queues = num_tx;
+	if (num_rx == 1)
+		priv->rss_qp_num = 0;
+	else
+		priv->rss_qp_num = num_rx;
+	if (num_tx == 1 || !(priv->hca_caps & IB_DEVICE_UD_TSS))
+		priv->tss_qp_num = num_tx - 1;
+	else
+		priv->tss_qp_num = num_tx;
+
+	netif_set_real_num_tx_queues(dev, num_tx);
+	netif_set_real_num_rx_queues(dev, num_rx);
+
+	/* prevent ipoib_ib_dev_init from calling ipoib_ib_dev_open,
+	 * let ipoib_open do it
+	 */
+	dev->flags &= ~IFF_UP;
+	ret = ipoib_dev_init(dev, priv->ca, priv->port);
+	if (ret) {
+		pr_warn("%s: failed to reinitialize port %d (ret = %d)\n",
+			priv->ca->name, priv->port, ret);
+		return ret;
+	}
+
+	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
+		ret = ib_register_event_handler(&priv->event_handler);
+		if (ret)
+			pr_warn("%s: failed to rereg port %d (ret = %d)\n",
+				priv->ca->name, priv->port, ret);
+	}
+
+	/* if the device was up bring it up again */
+	if (flags & IFF_UP) {
+		ret = dev_open(dev);
+		if (ret)
+			pr_warn("%s: failed to reopen port %d (ret = %d)\n",
+				priv->ca->name, priv->port, ret);
+	}
+	return ret;
 }
 
 static const struct header_ops ipoib_header_ops = {
@@ -1608,6 +1690,10 @@ void ipoib_setup(struct net_device *dev)
 
 	mutex_init(&priv->vlan_mutex);
 
+	init_rwsem(&priv->rings_rwsem);
+	/* read access to rings is disabled */
+	down_write(&priv->rings_rwsem);
+
 	INIT_LIST_HEAD(&priv->path_list);
 	INIT_LIST_HEAD(&priv->child_intfs);
 	INIT_LIST_HEAD(&priv->dead_ahs);
@@ -1629,8 +1715,12 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
 {
 	struct net_device *dev;
 
-	/* Use correct ops (ndo_select_queue) pass to ipoib_setup */
-	if (template_priv->num_tx_queues > 1) {
+	/* Use correct ops (ndo_select_queue) pass to ipoib_setup
+	 * A child interface starts with the same number of queues as the
+	 * parent. Even if the parent currently has only one ring, the MQ
+	 * potential must be reserved.
+	 */
+	if (template_priv->max_tx_queues > 1) {
 		if (template_priv->hca_caps & IB_DEVICE_UD_TSS)
 			ipoib_netdev_ops = &ipoib_netdev_ops_hw_tss;
 		else
@@ -1641,8 +1731,8 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
 
 	dev = alloc_netdev_mqs((int) sizeof(struct ipoib_dev_priv), name,
 			   ipoib_setup,
-			   template_priv->num_tx_queues,
-			   template_priv->num_rx_queues);
+			   template_priv->max_tx_queues,
+			   template_priv->max_rx_queues);
 	if (!dev)
 		return NULL;
 
@@ -1776,6 +1866,8 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 		/* No additional QP, only one QP for RX & TX */
 		priv->rss_qp_num = 0;
 		priv->tss_qp_num = 0;
+		priv->max_rx_queues = 1;
+		priv->max_tx_queues = 1;
 		priv->num_rx_queues = 1;
 		priv->num_tx_queues = 1;
 		kfree(device_attr);
@@ -1788,22 +1880,25 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 		max_rss_tbl_sz = min(num_cores, max_rss_tbl_sz);
 		max_rss_tbl_sz = rounddown_pow_of_two(max_rss_tbl_sz);
 		priv->rss_qp_num    = max_rss_tbl_sz;
-		priv->num_rx_queues = max_rss_tbl_sz;
+		priv->max_rx_queues = max_rss_tbl_sz;
 	} else {
 		/* No additional QP, only the parent QP for RX */
 		priv->rss_qp_num = 0;
-		priv->num_rx_queues = 1;
+		priv->max_rx_queues = 1;
 	}
+	priv->num_rx_queues = priv->max_rx_queues;
 
 	kfree(device_attr);
 
 	priv->tss_qp_num = num_cores;
 	if (priv->hca_caps & IB_DEVICE_UD_TSS)
 		/* TSS is supported by HW */
-		priv->num_tx_queues = priv->tss_qp_num;
+		priv->max_tx_queues = priv->tss_qp_num;
 	else
 		/* If TSS is not support by HW use the parent QP for ARP */
-		priv->num_tx_queues = priv->tss_qp_num + 1;
+		priv->max_tx_queues = priv->tss_qp_num + 1;
+
+	priv->num_tx_queues = priv->max_tx_queues;
 
 	return 0;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* RE: [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device
       [not found]     ` <1362676288-19906-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-03-09 14:04       ` Marciniszyn, Mike
       [not found]         ` <32E1700B9017364D9B60AED9960492BC0D5C7875-AtyAts71sc88Ug9VwtkbtrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Marciniszyn, Mike @ 2013-03-09 14:04 UTC (permalink / raw)
  To: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz

This patch will conflict with http://marc.info/?l=linux-rdma&m=136190765729001&w=2.

Mike
> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Or Gerlitz
> Sent: Thursday, March 07, 2013 12:11 PM
> To: roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Shlomo Pongratz
> Subject: [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device
       [not found]         ` <32E1700B9017364D9B60AED9960492BC0D5C7875-AtyAts71sc88Ug9VwtkbtrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-03-14 20:51           ` Or Gerlitz
       [not found]             ` <CAJZOPZ+xHqWT6vrbpYoakM_h=BBYsxq-CaXSNSHchvQ1wu66kQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2013-03-24 11:12           ` Shlomo Pongratz
  1 sibling, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2013-03-14 20:51 UTC (permalink / raw)
  To: Marciniszyn, Mike
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz

Marciniszyn, Mike <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> This patch will conflict with http://marc.info/?l=linux-rdma&m=136190765729001&w=2


What sort of conflict? is that on lines that are moving, or something
deeper in the proposed design?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device
       [not found]             ` <CAJZOPZ+xHqWT6vrbpYoakM_h=BBYsxq-CaXSNSHchvQ1wu66kQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-03-14 20:53               ` Marciniszyn, Mike
  0 siblings, 0 replies; 20+ messages in thread
From: Marciniszyn, Mike @ 2013-03-14 20:53 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz

The cm side ib_req_notify_cq() is altered in my patch.

The same behavior is need to be adopted.

Mike

> -----Original Message-----
> From: Or Gerlitz [mailto:or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org]
> Sent: Thursday, March 14, 2013 4:52 PM
> To: Marciniszyn, Mike
> Cc: Or Gerlitz; roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Shlomo
> Pongratz
> Subject: Re: [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device
> 
> Marciniszyn, Mike <mike.marciniszyn-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
> 
> > This patch will conflict with
> > http://marc.info/?l=linux-rdma&m=136190765729001&w=2
> 
> 
> What sort of conflict? is that on lines that are moving, or something deeper in
> the proposed design?
> 
> Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found] ` <1362676288-19906-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2013-03-07 17:11   ` [PATCH V3 for-next 5/5] IB/ipoib: Support changing the number of RX/TX rings with ethtool Or Gerlitz
@ 2013-03-18 19:14   ` Or Gerlitz
       [not found]     ` <CAJZOPZJ_runtaQnj+3n03FniBR83AeRD+Lh_2tn_1XZ0F2wKYg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  5 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2013-03-18 19:14 UTC (permalink / raw)
  To: Sean Hefty
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, Mar 7, 2013 at 7:11 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:

> Here's V3 of the IPoIB TSS/RSS patch series, basically its very similar to V2,
> with fix to for one issue we stepped over while testing V2 and addressing of
> feedback provided by Sean on the QP groups concept.

Hi Sean,

Re your feedback on V2, do you feel the concept has been flushed out
deep enough now?


Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]     ` <CAJZOPZJ_runtaQnj+3n03FniBR83AeRD+Lh_2tn_1XZ0F2wKYg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-03-19 18:57       ` Hefty, Sean
       [not found]         ` <1828884A29C6694DAF28B7E6B8A823736F366357-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Hefty, Sean @ 2013-03-19 18:57 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA

> > Here's V3 of the IPoIB TSS/RSS patch series, basically its very similar to
> V2,
> > with fix to for one issue we stepped over while testing V2 and addressing of
> > feedback provided by Sean on the QP groups concept.
> 
> Hi Sean,
> 
> Re your feedback on V2, do you feel the concept has been flushed out
> deep enough now?

I have not had a chance to look at v3 yet.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device
       [not found]         ` <32E1700B9017364D9B60AED9960492BC0D5C7875-AtyAts71sc88Ug9VwtkbtrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-03-14 20:51           ` Or Gerlitz
@ 2013-03-24 11:12           ` Shlomo Pongratz
  1 sibling, 0 replies; 20+ messages in thread
From: Shlomo Pongratz @ 2013-03-24 11:12 UTC (permalink / raw)
  To: Marciniszyn, Mike
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 3/9/2013 4:04 PM, Marciniszyn, Mike wrote:
> This patch will conflict with http://marc.info/?l=linux-rdma&m=136190765729001&w=2.
>
> Mike
>> -----Original Message-----
>> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
>> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Or Gerlitz
>> Sent: Thursday, March 07, 2013 12:11 PM
>> To: roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org
>> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Shlomo Pongratz
>> Subject: [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device
Hi Mike,

You didn't mentioned but you have changed the order of the call to 
"netif_stop_queue" and the call to "ib_req_notify_cq", by placing the 
"netif_stop_queue" before the call to "ib_req_notify_cq".

IMO you've solved a theoretical bug in which the the handler might be 
called and finish before the call to the call to "netif_stop_queue", 
which would result in a stopped queue.
I guess the same reordering should be done in "ipoib_ib.c::ipoib_send".

Thanks.

S.P.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]         ` <1828884A29C6694DAF28B7E6B8A823736F366357-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-03-24 12:44           ` Or Gerlitz
       [not found]             ` <514EF53F.3000200-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2013-03-24 12:44 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 19/03/2013 20:57, Hefty, Sean wrote:
> I have not had a chance to look at v3 yet.

will love it if you do so... we've just posted V4 which is respin of V3 
over an ipoib change

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]             ` <514EF53F.3000200-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-04-01 21:50               ` Or Gerlitz
       [not found]                 ` <CAJZOPZLbj+YxbELMRh9TioWptHG88Qz2VfzGTsreB+PFTdkNPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2013-04-01 21:50 UTC (permalink / raw)
  To: Sean Hefty
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shlomo Pongratz, Tzahi Oved

On Sun, Mar 24, 2013 at 2:44 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> On 19/03/2013 20:57, Hefty, Sean wrote:
>> I have not had a chance to look at v3 yet.

> will love it if you do so... we've just posted V4 which is respin of V3 over
> an ipoib change

Hi Sean, we have posted the TSS/RSS V0 patches on May 2012 and so far
attempted to address all the feedback / questions you provided/had.
Could you comment how you see things w.r.t to these patches?
specifically the QP groups concept on which you had raised some
concerns which we believe were addressed with V3.

thanks,

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]                 ` <CAJZOPZLbj+YxbELMRh9TioWptHG88Qz2VfzGTsreB+PFTdkNPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-04-03 19:45                   ` Or Gerlitz
       [not found]                     ` <CAJZOPZJ3G3weqAmaTytVAgQTvfiSOjgnZ_ROk4osRv8fxuRWwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2013-04-03 19:45 UTC (permalink / raw)
  To: Sean Hefty
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shlomo Pongratz, Tzahi Oved

On Tue, Apr 2, 2013 at 12:50 AM, Or Gerlitz <or.gerlitz-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Sun, Mar 24, 2013 at 2:44 PM, Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>> On 19/03/2013 20:57, Hefty, Sean wrote:
>>> I have not had a chance to look at v3 yet.

>> will love it if you do so... we've just posted V4 which is respin of V3 over
>> an ipoib change

> Hi Sean, we have posted the TSS/RSS V0 patches on May 2012 and so far
> attempted to address all the feedback / questions you provided/had.
> Could you comment how you see things w.r.t to these patches?
> specifically the QP groups concept on which you had raised some
> concerns which we believe were addressed with V3.

Hi Sean, Ping. You had concerns on the suggested concept, we want to
know if we addressed them, can you comment?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]                     ` <CAJZOPZJ3G3weqAmaTytVAgQTvfiSOjgnZ_ROk4osRv8fxuRWwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-04-03 20:12                       ` Hefty, Sean
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A823736F36B547-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Hefty, Sean @ 2013-04-03 20:12 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shlomo Pongratz, Tzahi Oved

> Hi Sean, Ping. You had concerns on the suggested concept, we want to
> know if we addressed them, can you comment?

I'm in meetings this week until tomorrow.  I'll try to take a look at the updated patches then or Friday.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A823736F36B547-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-04-03 20:14                           ` Or Gerlitz
  2013-04-09 14:07                           ` Or Gerlitz
  1 sibling, 0 replies; 20+ messages in thread
From: Or Gerlitz @ 2013-04-03 20:14 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	Shlomo Pongratz, Tzahi Oved

On Wed, Apr 3, 2013 at 11:12 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:
>> Hi Sean, Ping. You had concerns on the suggested concept, we want to
>> know if we addressed them, can you comment?

> I'm in meetings this week until tomorrow.  I'll try to take a look at the updated patches > then or Friday.

OK, thanks, the 3.10 merge window is coming closer and I want to know
where are we in that respect. Almost every Ethernet NIC you use has
RSS, there's no reason for IPoIB not to support that too.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]                         ` <1828884A29C6694DAF28B7E6B8A823736F36B547-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-04-03 20:14                           ` Or Gerlitz
@ 2013-04-09 14:07                           ` Or Gerlitz
       [not found]                             ` <5164209D.1060101-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  1 sibling, 1 reply; 20+ messages in thread
From: Or Gerlitz @ 2013-04-09 14:07 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz, Tzahi Oved

On 03/04/2013 23:12, Hefty, Sean wrote:
>> Hi Sean, Ping. You had concerns on the suggested concept, we want to
>> know if we addressed them, can you comment?
> I'm in meetings this week until tomorrow.  I'll try to take a look at the updated patches then or Friday.
>

any feedback?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* RE: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]                             ` <5164209D.1060101-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-04-09 17:06                               ` Hefty, Sean
       [not found]                                 ` <1828884A29C6694DAF28B7E6B8A823736F36D0E1-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 20+ messages in thread
From: Hefty, Sean @ 2013-04-09 17:06 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz, Tzahi Oved

> any feedback?

I have no issue with RSS/TSS.  But the 'qp group' interface to using this seems kludgy.

On a node, this is multiple send/receive queues grouped together to form a larger construct.  On the wire, this is a single QP - maybe?  I'm still not clear on that.  From what's written, all the send queues appear as a single QPN.  The receive queues appear as different QPNs.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support
       [not found]                                 ` <1828884A29C6694DAF28B7E6B8A823736F36D0E1-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-04-09 20:41                                   ` Or Gerlitz
  0 siblings, 0 replies; 20+ messages in thread
From: Or Gerlitz @ 2013-04-09 20:41 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, Shlomo Pongratz, Tzahi Oved

On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> I have no issue with RSS/TSS.  But the 'qp group' interface to using this seems kludgy.

OK, so lets take it over the patch that has the QP group description

> On a node, this is multiple send/receive queues grouped together to form a larger
> construct.  On the wire, this is a single QP - maybe?  I'm still not clear on that.  From
> what's written, all the send queues appear as a single QPN.  The receive queues
> appear as different QPNs.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2013-04-09 20:41 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-07 17:11 [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support Or Gerlitz
     [not found] ` <1362676288-19906-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-03-07 17:11   ` [PATCH V3 for-next 1/5] IB/core: Add RSS and TSS QP groups Or Gerlitz
2013-03-07 17:11   ` [PATCH V3 for-next 2/5] IB/mlx4: Add support for " Or Gerlitz
2013-03-07 17:11   ` [PATCH V3 for-next 3/5] IB/ipoib: Move to multi-queue device Or Gerlitz
     [not found]     ` <1362676288-19906-4-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-03-09 14:04       ` Marciniszyn, Mike
     [not found]         ` <32E1700B9017364D9B60AED9960492BC0D5C7875-AtyAts71sc88Ug9VwtkbtrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-03-14 20:51           ` Or Gerlitz
     [not found]             ` <CAJZOPZ+xHqWT6vrbpYoakM_h=BBYsxq-CaXSNSHchvQ1wu66kQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-14 20:53               ` Marciniszyn, Mike
2013-03-24 11:12           ` Shlomo Pongratz
2013-03-07 17:11   ` [PATCH V3 for-next 4/5] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz
2013-03-07 17:11   ` [PATCH V3 for-next 5/5] IB/ipoib: Support changing the number of RX/TX rings with ethtool Or Gerlitz
2013-03-18 19:14   ` [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support Or Gerlitz
     [not found]     ` <CAJZOPZJ_runtaQnj+3n03FniBR83AeRD+Lh_2tn_1XZ0F2wKYg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-03-19 18:57       ` Hefty, Sean
     [not found]         ` <1828884A29C6694DAF28B7E6B8A823736F366357-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-03-24 12:44           ` Or Gerlitz
     [not found]             ` <514EF53F.3000200-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-04-01 21:50               ` Or Gerlitz
     [not found]                 ` <CAJZOPZLbj+YxbELMRh9TioWptHG88Qz2VfzGTsreB+PFTdkNPA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-03 19:45                   ` Or Gerlitz
     [not found]                     ` <CAJZOPZJ3G3weqAmaTytVAgQTvfiSOjgnZ_ROk4osRv8fxuRWwA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-04-03 20:12                       ` Hefty, Sean
     [not found]                         ` <1828884A29C6694DAF28B7E6B8A823736F36B547-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-04-03 20:14                           ` Or Gerlitz
2013-04-09 14:07                           ` Or Gerlitz
     [not found]                             ` <5164209D.1060101-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-04-09 17:06                               ` Hefty, Sean
     [not found]                                 ` <1828884A29C6694DAF28B7E6B8A823736F36D0E1-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-04-09 20:41                                   ` Or Gerlitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.