All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 for-next 0/6] IB/IPoIB: Add multi-queue TSS and RSS support
@ 2013-02-05 15:48 Or Gerlitz
       [not found] ` <1360079337-8173-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2013-02-05 15:48 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Or Gerlitz, Shlomo Pongratz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Here's V2 of the IPoIB TSS/RSS patch series, basically its very similar to V1, 
with fixes to for issues we stepped over while testing since the V1 
submission and an addtional capability to change the number of TX and RX 
rings at run time via ethtool.

The concept of QP groups for TSS/RSS was introduced in the 2012 OFA conference, 
you can take a look on the user mode ethernet session slides 10-14, the author 
didn't use the terms RSS/TSS but that's the intention... see

https://openfabrics.org/resources/document-downloads/presentations/cat_view/57-ofa-documents/23-presentations/81-openfabrics-international-workshops/104-2012-ofa-international-workshop/107-2012-ofa-intl-workshop-wednesday.html 

V1 http://marc.info/?l=linux-rdma&m=133881081520248&w=2
V0 http://marc.info/?l=linux-rdma&m=133649429821312&w=2

V2 changes:

 - added pre-patch correcting the ipoib_neigh hash function

 - ported to infiniband tree / for-next branch 

 - following commit b63b70d877 "IPoIB: Use a private hash table for path lookup in xmit path" 
   from kernel 3.6, the TX select queue logic for UD neighbours was changed to be based on 
   "full" hashing ala skb_tx_hash that covers L4 too wheres in V1 the queue selection 
   was in the neighbours level. This means that different sessions (TCP/UDP five-tuples)
   would map to different TX rings subject to hashing.

 - for CM neighbours, the queue selection uses the destination IPoIB HW addr as the base 
   for hashing. Previously each ipoib_neigh was assigned a running index upon creation
   and that neighbour was accessed during select queue. Now, we want to issue only 
   ONE ipoib_neigh lookup in the xmit path and do that in start_xmit.

 - added patch #6 to allow for the number of TX and RX rings to be changed at runtime. 
   By supporting ethtool directives to get/set the number of channels.
   move code which is common to device cleanup and device reinit from
   "ipoib_dev_cleanup" to "ipoib_dev_uninit".
       
 - CM TX completions are spreaded among CQs (for NAPI) using hash of the destination 
   IPoIB HW address.

 - use netif_tx bh locking in ipoib_cm_handle_tx_wc and drain_tx_cq. Also, in 
   drain_tx_cq revert from subqueue locking to full locking, did it since 
   __netif_tx_lock doesn't set __QUEUE_STATE_FROZEN_BIT.

 - handle the rare case were the device CM "state" ipoib_cm_admin_enabled() status 
   changes between the time select queue was done to when the transmit routine was 
   called.

 - fixed a race in the CM RX drain/reap logic caused by the change to multiple 
   rings, added detailed comment in ipoib_cm_start_rx_drain to explain the fix.

 - changed the CM code that posts receive buffers (both srq and non-srq
   flows) to use per ring WR and SGE objects, since now buffer re-fill may happen from different
   NAPI contexts

V1 changes:

 - removed accepted patches, the first three on the V0 series
 - fixed crash in the driver EQ teardown flow - merged by commit 3aac6ff "IB/mlx4: Fix EQ deallocation in legacy mode"
 - removed wrong setting done in the ehca driver in ehca_create_srq
 - fixed user space QP creation to specify QPG_NONE
 - fixed usage of wrong API for netif queues stopping in patch 3/4 (V0 6/7)
 - fixed use-after-free of device attr pointer in patch 4/4 (V0 7/7)

* Add support for for RSS and TSS for UD.
        The number of RSS and TSS queues is a function of the number
        of cores and HW capability.

* Utilize multi core CPU and NIC's multi queuing in order to increase
        throughput. It utilize a new "QP Group" concept. A QP group is
        a set of QP consists of a parent QP and two disjoint subsets of
        RSS and TSS QP.

* If RSS is supported by HW then the number of RSS queues is highest
        power of two greater than or equal to the number of cores.
        Otherwise the number is one.

* If TSS is supported by HW then the number of TSS queues is highest
        power of two greater than or equal to the number of cores.
        Otherwise the number is highest power of two greater than or
                equal to the number of cores plus one.

* Transmission and receiving in CM mode uses a send and receive queue
        assigned to each CM instance at creation time.

* Advertise that packets sent from set of QPs will be received. That is,
        A received packets with a source QPN different from the QPN
        advertised with ARP will be accepted.

* The advertising is done by setting a third bit in the flags part
        of the link layer address. This is similar to RFC 4755
        section 3.1 (CM advertisement)

* If TSS is not supported by HW then transmission of multi-cast packets
        is done using device queue N and thus the parent QP, which is
                also the advertised QP.

* If TSS is not supported by HW then usage of TSS is enabled if the peer
        advertised that it will accept TSS packets.

* Drivers can now use a larger portion of the device vectors/IRQ

Shlomo Pongratz (6):
  IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets
  IB/core: Add RSS and TSS QP groups
  IB/mlx4: Add support for RSS and TSS QP groups
  IB/ipoib: Move to multi-queue device
  IB/ipoib: Add RSS and TSS support for datagram mode
  IB/ipoib: Support changing the number of RX/TX rings with ethtool

 drivers/infiniband/core/uverbs_cmd.c           |    1 +
 drivers/infiniband/core/verbs.c                |    3 +
 drivers/infiniband/hw/amso1100/c2_provider.c   |    3 +
 drivers/infiniband/hw/cxgb3/iwch_provider.c    |    2 +
 drivers/infiniband/hw/cxgb4/qp.c               |    3 +
 drivers/infiniband/hw/ehca/ehca_qp.c           |    3 +
 drivers/infiniband/hw/ipath/ipath_qp.c         |    3 +
 drivers/infiniband/hw/mlx4/main.c              |    5 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h           |   13 +
 drivers/infiniband/hw/mlx4/qp.c                |  344 ++++++++++++-
 drivers/infiniband/hw/mthca/mthca_provider.c   |    3 +
 drivers/infiniband/hw/nes/nes_verbs.c          |    3 +
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c    |    5 +
 drivers/infiniband/hw/qib/qib_qp.c             |    5 +
 drivers/infiniband/ulp/ipoib/ipoib.h           |  118 ++++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  206 +++++---
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c   |  160 ++++++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  550 ++++++++++++++------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  493 ++++++++++++++++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   35 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |  662 +++++++++++++++++++++---
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    2 +-
 include/rdma/ib_verbs.h                        |   26 +-
 23 files changed, 2227 insertions(+), 421 deletions(-)


Cc: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets
       [not found] ` <1360079337-8173-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-02-05 15:48   ` Or Gerlitz
       [not found]     ` <1360079337-8173-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-02-05 15:48   ` [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups Or Gerlitz
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2013-02-05 15:48 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Or Gerlitz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The hash function introduced in commit b63b70d877 "IPoIB: Use a private hash
table for path lookup in xmit path" was designd to use the 3 octets of the
IPoIB HW address that holds the remote QPN. However, this currently isn't
the case under little endian machines as the code there uses the flags part
(octet[0]) and not the last octet of the QPN (octet[3]), fix that.

The fix caused a checkpatch warning on line over 80 characters, to
solve that changed the name of the temp variable that holds the daddr.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 6fdc9e7..e459fa7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -844,10 +844,10 @@ static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl, u8 *daddr)
 	 * different subnets.
 	 */
 	 /* qpn octets[1:4) & port GUID octets[12:20) */
-	u32 *daddr_32 = (u32 *) daddr;
+	u32 *d32 = (u32 *)daddr;
 	u32 hv;
 
-	hv = jhash_3words(daddr_32[3], daddr_32[4], 0xFFFFFF & daddr_32[0], 0);
+	hv = jhash_3words(d32[3], d32[4], cpu_to_be32(0xFFFFFF) & d32[0], 0);
 	return hv & htbl->mask;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups
       [not found] ` <1360079337-8173-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-02-05 15:48   ` [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets Or Gerlitz
@ 2013-02-05 15:48   ` Or Gerlitz
       [not found]     ` <1360079337-8173-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-02-05 15:48   ` [PATCH V2 for-next 3/6] IB/mlx4: Add support for " Or Gerlitz
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2013-02-05 15:48 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Or Gerlitz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

RSS (Receive Side Scaling) TSS (Transmit Side Scaling, better known as
MQ/Multi-Queue) are common networking techniques which allow to use
contemporary NICs that support multiple receive and transmit descriptor
queues (multi-queue), see also Documentation/networking/scaling.txt

This patch introduces the concept of RSS and TSS QP groups which
allows for implementing them by low level drivers and using it
by IPoIB and later also by user space ULPs.

A QP group is a set of QPs consists of a parent QP and two disjoint sets
of RSS and TSS QPs. The creation of a QP group is a two stage process:

In the the 1st stage, the parent QP is created.

In the 2nd stage the children QPs of the parent are created.

Each child QP indicates if its a RSS or TSS QP. Both the TSS
and RSS sets of QPs should have contiguous QP numbers.

A few new elements/concepts are introduced to support this:

Three new device capabilities that can be set by the low level driver:

- IB_DEVICE_QPG which is set to indicate QP groups are supported.

- IB_DEVICE_UD_RSS which is set to indicate that the device supports
RSS, that is applying hash function on incoming TCP/UDP/IP packets and
dispatching them to multiple "rings" (child QPs).

- IB_DEVICE_UD_TSS which is set to indicate that the device supports
"HW TSS" which means that the HW is capable of over-riding the source
UD QPN present in sent IB datagram header (DTH) with the parent's QPN.

Low level drivers not supporting HW TSS, could still support QP groups, such
as combination is referred as "SW TSS". Where in this case, the low level drive
fills in the qpg_tss_mask_sz field of struct ib_qp_cap returned from
ib_create_qp. Such that this mask can be used to retrieve the parent QPN from
incoming packets carrying a child QPN (as of the contiguous QP numbers requirement).

- max rss table size device attribute, which is the maximal size of the RSS
indirection table  supported by the device

- qp group type attribute for qp creation saying whether this is a parent QP
or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs.

- per qp group type, another attribute is added, for parent QPs, the number
of rx/tx child QPs and for child QPs pointer to the parent.

- IB_QP_GROUP_RSS attribute mask, which should be used when modifying
the parent QP state from reset to init

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/core/uverbs_cmd.c         |    1 +
 drivers/infiniband/core/verbs.c              |    3 +++
 drivers/infiniband/hw/amso1100/c2_provider.c |    3 +++
 drivers/infiniband/hw/cxgb3/iwch_provider.c  |    2 ++
 drivers/infiniband/hw/cxgb4/qp.c             |    3 +++
 drivers/infiniband/hw/ehca/ehca_qp.c         |    3 +++
 drivers/infiniband/hw/ipath/ipath_qp.c       |    3 +++
 drivers/infiniband/hw/mlx4/qp.c              |    3 +++
 drivers/infiniband/hw/mthca/mthca_provider.c |    3 +++
 drivers/infiniband/hw/nes/nes_verbs.c        |    3 +++
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |    5 +++++
 drivers/infiniband/hw/qib/qib_qp.c           |    5 +++++
 include/rdma/ib_verbs.h                      |   26 +++++++++++++++++++++++++-
 13 files changed, 62 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/core/uverbs_cmd.c b/drivers/infiniband/core/uverbs_cmd.c
index 0cb0007..d31199f 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1461,6 +1461,7 @@ ssize_t ib_uverbs_create_qp(struct ib_uverbs_file *file,
 	attr.sq_sig_type   = cmd.sq_sig_all ? IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR;
 	attr.qp_type       = cmd.qp_type;
 	attr.create_flags  = 0;
+	attr.qpg_type	   = IB_QPG_NONE;
 
 	attr.cap.max_send_wr     = cmd.max_send_wr;
 	attr.cap.max_recv_wr     = cmd.max_recv_wr;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 30f199e..bbe0e5f 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -496,6 +496,9 @@ static const struct {
 						IB_QP_QKEY),
 				[IB_QPT_GSI] = (IB_QP_PKEY_INDEX		|
 						IB_QP_QKEY),
+			},
+			.opt_param = {
+				[IB_QPT_UD]  = IB_QP_GROUP_RSS
 			}
 		},
 	},
diff --git a/drivers/infiniband/hw/amso1100/c2_provider.c b/drivers/infiniband/hw/amso1100/c2_provider.c
index 07eb3a8..546760b 100644
--- a/drivers/infiniband/hw/amso1100/c2_provider.c
+++ b/drivers/infiniband/hw/amso1100/c2_provider.c
@@ -241,6 +241,9 @@ static struct ib_qp *c2_create_qp(struct ib_pd *pd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_RC:
 		qp = kzalloc(sizeof(*qp), GFP_KERNEL);
diff --git a/drivers/infiniband/hw/cxgb3/iwch_provider.c b/drivers/infiniband/hw/cxgb3/iwch_provider.c
index 0bdf09a..49850f6 100644
--- a/drivers/infiniband/hw/cxgb3/iwch_provider.c
+++ b/drivers/infiniband/hw/cxgb3/iwch_provider.c
@@ -902,6 +902,8 @@ static struct ib_qp *iwch_create_qp(struct ib_pd *pd,
 	PDBG("%s ib_pd %p\n", __func__, pd);
 	if (attrs->qp_type != IB_QPT_RC)
 		return ERR_PTR(-EINVAL);
+	if (attrs->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
 	php = to_iwch_pd(pd);
 	rhp = php->rhp;
 	schp = get_chp(rhp, ((struct iwch_cq *) attrs->send_cq)->cq.cqid);
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 05bfe53..167387c 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -1488,6 +1488,9 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 	if (attrs->qp_type != IB_QPT_RC)
 		return ERR_PTR(-EINVAL);
 
+	if (attrs->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 	schp = get_chp(rhp, ((struct c4iw_cq *)attrs->send_cq)->cq.cqid);
diff --git a/drivers/infiniband/hw/ehca/ehca_qp.c b/drivers/infiniband/hw/ehca/ehca_qp.c
index 1493939..2df7584 100644
--- a/drivers/infiniband/hw/ehca/ehca_qp.c
+++ b/drivers/infiniband/hw/ehca/ehca_qp.c
@@ -464,6 +464,9 @@ static struct ehca_qp *internal_create_qp(
 	int is_llqp = 0, has_srq = 0, is_user = 0;
 	int qp_type, max_send_sge, max_recv_sge, ret;
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	/* h_call's out parameters */
 	struct ehca_alloc_qp_parms parms;
 	u32 swqe_size = 0, rwqe_size = 0, ib_qp_num;
diff --git a/drivers/infiniband/hw/ipath/ipath_qp.c b/drivers/infiniband/hw/ipath/ipath_qp.c
index 0857a9c..117b775 100644
--- a/drivers/infiniband/hw/ipath/ipath_qp.c
+++ b/drivers/infiniband/hw/ipath/ipath_qp.c
@@ -755,6 +755,9 @@ struct ib_qp *ipath_create_qp(struct ib_pd *ibpd,
 		goto bail;
 	}
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	if (init_attr->cap.max_send_sge > ib_ipath_max_sges ||
 	    init_attr->cap.max_send_wr > ib_ipath_max_qp_wrs) {
 		ret = ERR_PTR(-EINVAL);
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 19e0637..917b111 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -997,6 +997,9 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 	      init_attr->qp_type > IB_QPT_GSI)))
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_XRC_TGT:
 		pd = to_mxrcd(init_attr->xrcd)->pd;
diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c
index 5b71d43..120aa1e 100644
--- a/drivers/infiniband/hw/mthca/mthca_provider.c
+++ b/drivers/infiniband/hw/mthca/mthca_provider.c
@@ -518,6 +518,9 @@ static struct ib_qp *mthca_create_qp(struct ib_pd *pd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	switch (init_attr->qp_type) {
 	case IB_QPT_RC:
 	case IB_QPT_UC:
diff --git a/drivers/infiniband/hw/nes/nes_verbs.c b/drivers/infiniband/hw/nes/nes_verbs.c
index 07e4fba..fe7de14 100644
--- a/drivers/infiniband/hw/nes/nes_verbs.c
+++ b/drivers/infiniband/hw/nes/nes_verbs.c
@@ -1131,6 +1131,9 @@ static struct ib_qp *nes_create_qp(struct ib_pd *ibpd,
 	if (init_attr->create_flags)
 		return ERR_PTR(-EINVAL);
 
+	if (init_attr->qpg_type != IB_QPG_NONE)
+		return ERR_PTR(-ENOSYS);
+
 	atomic_inc(&qps_created);
 	switch (init_attr->qp_type) {
 		case IB_QPT_RC:
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index b29a424..7c3e0ce 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -841,6 +841,11 @@ static int ocrdma_check_qp_params(struct ib_pd *ibpd, struct ocrdma_dev *dev,
 			   __func__, dev->id, attrs->qp_type);
 		return -EINVAL;
 	}
+	if (attrs->qpg_type != IB_QPG_NONE) {
+		ocrdma_err("%s(%d) unsupported qpg type=0x%x requested\n",
+			   __func__, dev->id, attrs->qpg_type);
+			   return -ENOSYS;
+	}
 	if (attrs->cap.max_send_wr > dev->attr.max_wqe) {
 		ocrdma_err("%s(%d) unsupported send_wr=0x%x requested\n",
 			   __func__, dev->id, attrs->cap.max_send_wr);
diff --git a/drivers/infiniband/hw/qib/qib_qp.c b/drivers/infiniband/hw/qib/qib_qp.c
index 3527509..0e6f64d 100644
--- a/drivers/infiniband/hw/qib/qib_qp.c
+++ b/drivers/infiniband/hw/qib/qib_qp.c
@@ -985,6 +985,11 @@ struct ib_qp *qib_create_qp(struct ib_pd *ibpd,
 		goto bail;
 	}
 
+	if (init_attr->qpg_type != IB_QPG_NONE) {
+		ret = ERR_PTR(-ENOSYS);
+		goto bail;
+	}
+
 	/* Check receive queue parameters if no SRQ is specified. */
 	if (!init_attr->srq) {
 		if (init_attr->cap.max_recv_sge > ib_qib_max_sges ||
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 46bc045..c4c1dc7 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -115,6 +115,9 @@ enum ib_device_cap_flags {
 	IB_DEVICE_XRC			= (1<<20),
 	IB_DEVICE_MEM_MGT_EXTENSIONS	= (1<<21),
 	IB_DEVICE_BLOCK_MULTICAST_LOOPBACK = (1<<22),
+	IB_DEVICE_QPG			= (1<<23),
+	IB_DEVICE_UD_RSS		= (1<<24),
+	IB_DEVICE_UD_TSS		= (1<<25)
 };
 
 enum ib_atomic_cap {
@@ -162,6 +165,7 @@ struct ib_device_attr {
 	int			max_srq_wr;
 	int			max_srq_sge;
 	unsigned int		max_fast_reg_page_list_len;
+	int			max_rss_tbl_sz;
 	u16			max_pkeys;
 	u8			local_ca_ack_delay;
 };
@@ -584,6 +588,7 @@ struct ib_qp_cap {
 	u32	max_send_sge;
 	u32	max_recv_sge;
 	u32	max_inline_data;
+	u32	qpg_tss_mask_sz;
 };
 
 enum ib_sig_type {
@@ -619,6 +624,18 @@ enum ib_qp_create_flags {
 	IB_QP_CREATE_RESERVED_END		= 1 << 31,
 };
 
+enum ib_qpg_type {
+	IB_QPG_NONE	= 0,
+	IB_QPG_PARENT	= (1<<0),
+	IB_QPG_CHILD_RX = (1<<1),
+	IB_QPG_CHILD_TX = (1<<2)
+};
+
+struct ib_qpg_init_attrib {
+	u32 tss_child_count;
+	u32 rss_child_count;
+};
+
 struct ib_qp_init_attr {
 	void                  (*event_handler)(struct ib_event *, void *);
 	void		       *qp_context;
@@ -627,9 +644,14 @@ struct ib_qp_init_attr {
 	struct ib_srq	       *srq;
 	struct ib_xrcd	       *xrcd;     /* XRC TGT QPs only */
 	struct ib_qp_cap	cap;
+	union {
+		struct ib_qp *qpg_parent; /* see qpg_type */
+		struct ib_qpg_init_attrib parent_attrib;
+	};
 	enum ib_sig_type	sq_sig_type;
 	enum ib_qp_type		qp_type;
 	enum ib_qp_create_flags	create_flags;
+	enum ib_qpg_type	qpg_type;
 	u8			port_num; /* special QP types only */
 };
 
@@ -696,7 +718,8 @@ enum ib_qp_attr_mask {
 	IB_QP_MAX_DEST_RD_ATOMIC	= (1<<17),
 	IB_QP_PATH_MIG_STATE		= (1<<18),
 	IB_QP_CAP			= (1<<19),
-	IB_QP_DEST_QPN			= (1<<20)
+	IB_QP_DEST_QPN			= (1<<20),
+	IB_QP_GROUP_RSS			= (1<<21)
 };
 
 enum ib_qp_state {
@@ -975,6 +998,7 @@ struct ib_qp {
 	void		       *qp_context;
 	u32			qp_num;
 	enum ib_qp_type		qp_type;
+	enum ib_qpg_type	qpg_type;
 };
 
 struct ib_mr {
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 for-next 3/6] IB/mlx4: Add support for RSS and TSS QP groups
       [not found] ` <1360079337-8173-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-02-05 15:48   ` [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets Or Gerlitz
  2013-02-05 15:48   ` [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups Or Gerlitz
@ 2013-02-05 15:48   ` Or Gerlitz
  2013-02-05 15:48   ` [PATCH V2 for-next 4/6] IB/ipoib: Move to multi-queue device Or Gerlitz
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2013-02-05 15:48 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Or Gerlitz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

Depending on the mlx4 device capabilities, support the RSS IB device
capability, using Topelitz or XOR hash functions according to what
available with the HW. Support creating QP groups where all RX and TX
QPs have contiguous QP numbers.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/hw/mlx4/main.c    |    5 +
 drivers/infiniband/hw/mlx4/mlx4_ib.h |   13 ++
 drivers/infiniband/hw/mlx4/qp.c      |  345 ++++++++++++++++++++++++++++++++-
 3 files changed, 352 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index e7d81c0..6dea1f3 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -138,6 +138,11 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 	if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_XRC)
 		props->device_cap_flags |= IB_DEVICE_XRC;
 
+	props->device_cap_flags |= IB_DEVICE_QPG;
+	if (dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS) {
+		props->device_cap_flags |= IB_DEVICE_UD_RSS;
+		props->max_rss_tbl_sz = dev->dev->caps.max_rss_tbl_sz;
+	}
 	props->vendor_id	   = be32_to_cpup((__be32 *) (out_mad->data + 36)) &
 		0xffffff;
 	props->vendor_part_id	   = dev->dev->pdev->device;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index dcd845b..77539af 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -227,6 +227,17 @@ struct mlx4_ib_proxy_sqp_hdr {
 	struct mlx4_rcv_tunnel_hdr tun;
 }  __packed;
 
+struct mlx4_ib_qpg_data {
+	unsigned long *tss_bitmap;
+	unsigned long *rss_bitmap;
+	struct mlx4_ib_qp *qpg_parent;
+	int tss_qpn_base;
+	int rss_qpn_base;
+	u32 tss_child_count;
+	u32 rss_child_count;
+	u32 qpg_tss_mask_sz;
+};
+
 struct mlx4_ib_qp {
 	struct ib_qp		ibqp;
 	struct mlx4_qp		mqp;
@@ -256,6 +267,8 @@ struct mlx4_ib_qp {
 	u8			sq_no_prefetch;
 	u8			state;
 	int			mlx_type;
+	enum ib_qpg_type	qpg_type;
+	struct mlx4_ib_qpg_data *qpg_data;
 	struct list_head	gid_list;
 	struct list_head	steering_rules;
 	struct mlx4_ib_buf	*sqp_proxy_rcv;
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 917b111..3661c61 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -34,6 +34,8 @@
 #include <linux/log2.h>
 #include <linux/slab.h>
 #include <linux/netdevice.h>
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
 
 #include <rdma/ib_cache.h>
 #include <rdma/ib_pack.h>
@@ -592,6 +594,241 @@ static int qp_has_rq(struct ib_qp_init_attr *attr)
 	return !attr->srq;
 }
 
+static int init_qpg_parent(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *pqp,
+			   struct ib_qp_init_attr *attr, int *qpn)
+{
+	struct mlx4_ib_qpg_data *qpg_data;
+	int tss_num, rss_num;
+	int tss_align_num, rss_align_num;
+	int tss_base, rss_base;
+	int err;
+
+	/* Parent is part of the TSS range (in SW TSS ARP is sent via parent) */
+	tss_num = 1 + attr->parent_attrib.tss_child_count;
+	tss_align_num = roundup_pow_of_two(tss_num);
+	rss_num = attr->parent_attrib.rss_child_count;
+	rss_align_num = roundup_pow_of_two(rss_num);
+
+	if (rss_num > 1) {
+		/* RSS is requested */
+		if (!(dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS))
+			return -ENOSYS;
+		if (rss_align_num > dev->dev->caps.max_rss_tbl_sz)
+			return -EINVAL;
+		/* We must work with power of two */
+		attr->parent_attrib.rss_child_count = rss_align_num;
+	}
+
+	qpg_data = kzalloc(sizeof(*qpg_data), GFP_KERNEL);
+	if (!qpg_data)
+		return -ENOMEM;
+
+	err = mlx4_qp_reserve_range(dev->dev, tss_align_num,
+				    tss_align_num, &tss_base);
+	if (err)
+		goto err1;
+
+	if (tss_num > 1) {
+		u32 alloc = BITS_TO_LONGS(tss_align_num)  * sizeof(long);
+		qpg_data->tss_bitmap = kzalloc(alloc, GFP_KERNEL);
+		if (qpg_data->tss_bitmap == NULL) {
+			err = -ENOMEM;
+			goto err2;
+		}
+		bitmap_fill(qpg_data->tss_bitmap, tss_num);
+		/* Note parent takes first index */
+		clear_bit(0, qpg_data->tss_bitmap);
+	}
+
+	if (rss_num > 1) {
+		u32 alloc = BITS_TO_LONGS(rss_align_num) * sizeof(long);
+		err = mlx4_qp_reserve_range(dev->dev, rss_align_num,
+					    rss_align_num, &rss_base);
+		if (err)
+			goto err3;
+		qpg_data->rss_bitmap = kzalloc(alloc, GFP_KERNEL);
+		if (qpg_data->rss_bitmap == NULL) {
+			err = -ENOMEM;
+			goto err4;
+		}
+		bitmap_fill(qpg_data->rss_bitmap, rss_align_num);
+	}
+
+	qpg_data->tss_child_count = attr->parent_attrib.tss_child_count;
+	qpg_data->rss_child_count = attr->parent_attrib.rss_child_count;
+	qpg_data->qpg_parent = pqp;
+	qpg_data->qpg_tss_mask_sz = ilog2(tss_align_num);
+	qpg_data->tss_qpn_base = tss_base;
+	qpg_data->rss_qpn_base = rss_base;
+
+	pqp->qpg_data = qpg_data;
+	*qpn = tss_base;
+
+	return 0;
+
+err4:
+	mlx4_qp_release_range(dev->dev, rss_base, rss_align_num);
+
+err3:
+	if (tss_num > 1)
+		kfree(qpg_data->tss_bitmap);
+
+err2:
+	mlx4_qp_release_range(dev->dev, tss_base, tss_align_num);
+
+err1:
+	kfree(qpg_data);
+	return err;
+}
+
+static void free_qpg_parent(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *pqp)
+{
+	struct mlx4_ib_qpg_data *qpg_data = pqp->qpg_data;
+	int align_num;
+
+	if (qpg_data->tss_child_count > 1)
+		kfree(qpg_data->tss_bitmap);
+
+	align_num = roundup_pow_of_two(1 + qpg_data->tss_child_count);
+	mlx4_qp_release_range(dev->dev, qpg_data->tss_qpn_base, align_num);
+
+	if (qpg_data->rss_child_count > 1) {
+		kfree(qpg_data->rss_bitmap);
+		align_num = roundup_pow_of_two(qpg_data->rss_child_count);
+		mlx4_qp_release_range(dev->dev, qpg_data->rss_qpn_base,
+				      align_num);
+	}
+
+	kfree(qpg_data);
+}
+
+static int alloc_qpg_qpn(struct ib_qp_init_attr *init_attr,
+			 struct mlx4_ib_qp *pqp, int *qpn)
+{
+	struct mlx4_ib_qp *mqp = to_mqp(init_attr->qpg_parent);
+	struct mlx4_ib_qpg_data *qpg_data = mqp->qpg_data;
+	u32 idx, old;
+
+	switch (init_attr->qpg_type) {
+	case IB_QPG_CHILD_TX:
+		if (qpg_data->tss_child_count == 0)
+			return -EINVAL;
+		do {
+			/* Parent took index 0 */
+			idx = find_first_bit(qpg_data->tss_bitmap,
+					     qpg_data->tss_child_count + 1);
+			if (idx >= qpg_data->tss_child_count + 1)
+				return -ENOMEM;
+			old = test_and_clear_bit(idx, qpg_data->tss_bitmap);
+		} while (old == 0);
+		idx += qpg_data->tss_qpn_base;
+		break;
+	case IB_QPG_CHILD_RX:
+		if (qpg_data->rss_child_count == 0)
+			return -EINVAL;
+		do {
+			idx = find_first_bit(qpg_data->rss_bitmap,
+					     qpg_data->rss_child_count);
+			if (idx >= qpg_data->rss_child_count)
+				return -ENOMEM;
+			old = test_and_clear_bit(idx, qpg_data->rss_bitmap);
+		} while (old == 0);
+		idx += qpg_data->rss_qpn_base;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	pqp->qpg_data = qpg_data;
+	*qpn = idx;
+
+	return 0;
+}
+
+static void free_qpg_qpn(struct mlx4_ib_qp *mqp, int qpn)
+{
+	struct mlx4_ib_qpg_data *qpg_data = mqp->qpg_data;
+
+	switch (mqp->qpg_type) {
+	case IB_QPG_CHILD_TX:
+		/* Do range check */
+		qpn -= qpg_data->tss_qpn_base;
+		set_bit(qpn, qpg_data->tss_bitmap);
+		break;
+	case IB_QPG_CHILD_RX:
+		qpn -= qpg_data->rss_qpn_base;
+		set_bit(qpn, qpg_data->rss_bitmap);
+		break;
+	default:
+		/* error */
+		pr_warn("wrong qpg type (%d)\n", mqp->qpg_type);
+		break;
+	}
+}
+
+static int alloc_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			    struct ib_qp_init_attr *attr, int *qpn)
+{
+	int err = 0;
+
+	switch (attr->qpg_type) {
+	case IB_QPG_NONE:
+		/* Raw packet QPNs must be aligned to 8 bits. If not, the WQE
+		 * BlueFlame setup flow wrongly causes VLAN insertion. */
+		if (attr->qp_type == IB_QPT_RAW_PACKET)
+			err = mlx4_qp_reserve_range(dev->dev, 1, 1 << 8, qpn);
+		else
+			err = mlx4_qp_reserve_range(dev->dev, 1, 1, qpn);
+		break;
+	case IB_QPG_PARENT:
+		err = init_qpg_parent(dev, qp, attr, qpn);
+		break;
+	case IB_QPG_CHILD_TX:
+	case IB_QPG_CHILD_RX:
+		err = alloc_qpg_qpn(attr, qp, qpn);
+		break;
+	default:
+		qp->qpg_type = IB_QPG_NONE;
+		err = -EINVAL;
+		break;
+	}
+	if (err)
+		return err;
+	qp->qpg_type = attr->qpg_type;
+	return 0;
+}
+
+static void free_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			enum ib_qpg_type qpg_type, int qpn)
+{
+	switch (qpg_type) {
+	case IB_QPG_NONE:
+		mlx4_qp_release_range(dev->dev, qpn, 1);
+		break;
+	case IB_QPG_PARENT:
+		free_qpg_parent(dev, qp);
+		break;
+	case IB_QPG_CHILD_TX:
+	case IB_QPG_CHILD_RX:
+		free_qpg_qpn(qp, qpn);
+		break;
+	default:
+		break;
+	}
+}
+
+/* Revert allocation on create_qp_common */
+static void unalloc_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
+			       struct ib_qp_init_attr *attr, int qpn)
+{
+	free_qpn_common(dev, qp, attr->qpg_type, qpn);
+}
+
+static void release_qpn_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp)
+{
+	free_qpn_common(dev, qp, qp->qpg_type, qp->mqp.qpn);
+}
+
 static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			    struct ib_qp_init_attr *init_attr,
 			    struct ib_udata *udata, int sqpn, struct mlx4_ib_qp **caller_qp)
@@ -759,12 +996,7 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 			}
 		}
 	} else {
-		/* Raw packet QPNs must be aligned to 8 bits. If not, the WQE
-		 * BlueFlame setup flow wrongly causes VLAN insertion. */
-		if (init_attr->qp_type == IB_QPT_RAW_PACKET)
-			err = mlx4_qp_reserve_range(dev->dev, 1, 1 << 8, &qpn);
-		else
-			err = mlx4_qp_reserve_range(dev->dev, 1, 1, &qpn);
+		err = alloc_qpn_common(dev, qp, init_attr, &qpn);
 		if (err)
 			goto err_proxy;
 	}
@@ -789,8 +1021,8 @@ static int create_qp_common(struct mlx4_ib_dev *dev, struct ib_pd *pd,
 	return 0;
 
 err_qpn:
-	if (!sqpn)
-		mlx4_qp_release_range(dev->dev, qpn, 1);
+	unalloc_qpn_common(dev, qp, init_attr, qpn);
+
 err_proxy:
 	if (qp->mlx4_ib_qp_type == MLX4_IB_QPT_PROXY_GSI)
 		free_proxy_bufs(pd->device, qp);
@@ -932,7 +1164,7 @@ static void destroy_qp_common(struct mlx4_ib_dev *dev, struct mlx4_ib_qp *qp,
 	mlx4_qp_free(dev->dev, &qp->mqp);
 
 	if (!is_sqp(dev, qp) && !is_tunnel_qp(dev, qp))
-		mlx4_qp_release_range(dev->dev, qp->mqp.qpn, 1);
+		release_qpn_common(dev, qp);
 
 	mlx4_mtt_cleanup(dev->dev, &qp->mtt);
 
@@ -972,6 +1204,52 @@ static u32 get_sqp_num(struct mlx4_ib_dev *dev, struct ib_qp_init_attr *attr)
 		return dev->dev->caps.qp1_proxy[attr->port_num - 1];
 }
 
+static int check_qpg_attr(struct mlx4_ib_dev *dev,
+			  struct ib_qp_init_attr *attr)
+{
+	if (attr->qpg_type == IB_QPG_NONE)
+		return 0;
+
+	if (attr->qp_type != IB_QPT_UD)
+		return -EINVAL;
+
+	if (attr->qpg_type == IB_QPG_PARENT) {
+		if (attr->parent_attrib.tss_child_count == 1)
+			return -EINVAL; /* Doesn't make sense */
+		if (attr->parent_attrib.rss_child_count == 1)
+			return -EINVAL; /* Doesn't make sense */
+		if ((attr->parent_attrib.tss_child_count == 0) &&
+		    (attr->parent_attrib.rss_child_count == 0))
+			/* Should be called with IP_QPG_NONE */
+			return -EINVAL;
+		if (attr->parent_attrib.rss_child_count > 1) {
+			int rss_align_num;
+			if (!(dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS))
+				return -ENOSYS;
+			rss_align_num = roundup_pow_of_two(
+					attr->parent_attrib.rss_child_count);
+			if (rss_align_num > dev->dev->caps.max_rss_tbl_sz)
+				return -EINVAL;
+		}
+	} else {
+		struct mlx4_ib_qpg_data *qpg_data;
+		if (attr->qpg_parent == NULL)
+			return -EINVAL;
+		if (IS_ERR(attr->qpg_parent))
+			return -EINVAL;
+		qpg_data = to_mqp(attr->qpg_parent)->qpg_data;
+		if (qpg_data == NULL)
+			return -EINVAL;
+		if (attr->qpg_type == IB_QPG_CHILD_TX &&
+		    !qpg_data->tss_child_count)
+			return -EINVAL;
+		if (attr->qpg_type == IB_QPG_CHILD_RX &&
+		    !qpg_data->rss_child_count)
+			return -EINVAL;
+	}
+	return 0;
+}
+
 struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 				struct ib_qp_init_attr *init_attr,
 				struct ib_udata *udata)
@@ -997,8 +1275,9 @@ struct ib_qp *mlx4_ib_create_qp(struct ib_pd *pd,
 	      init_attr->qp_type > IB_QPT_GSI)))
 		return ERR_PTR(-EINVAL);
 
-	if (init_attr->qpg_type != IB_QPG_NONE)
-		return ERR_PTR(-ENOSYS);
+	err = check_qpg_attr(to_mdev(pd->device), init_attr);
+	if (err)
+		return ERR_PTR(err);
 
 	switch (init_attr->qp_type) {
 	case IB_QPT_XRC_TGT:
@@ -1469,6 +1748,43 @@ static int __mlx4_ib_modify_qp(struct ib_qp *ibqp,
 	if (!ibqp->uobject && cur_state == IB_QPS_RESET && new_state == IB_QPS_INIT)
 		context->rlkey |= (1 << 4);
 
+	if ((attr_mask & IB_QP_GROUP_RSS) &&
+	    (qp->qpg_data->rss_child_count > 1)) {
+		struct mlx4_ib_qpg_data *qpg_data = qp->qpg_data;
+		void *rss_context_base = &context->pri_path;
+		struct mlx4_rss_context *rss_context =
+			(struct mlx4_rss_context *)(rss_context_base
+					+ MLX4_RSS_OFFSET_IN_QPC_PRI_PATH);
+
+		context->flags |= cpu_to_be32(1 << MLX4_RSS_QPC_FLAG_OFFSET);
+
+		/* This should be tbl_sz_base_qpn */
+		rss_context->base_qpn = cpu_to_be32(qpg_data->rss_qpn_base |
+				(ilog2(qpg_data->rss_child_count) << 24));
+		rss_context->default_qpn = cpu_to_be32(qpg_data->rss_qpn_base);
+		/* This should be flags_hash_fn */
+		rss_context->flags = MLX4_RSS_TCP_IPV6 |
+				     MLX4_RSS_TCP_IPV4;
+		if (dev->dev->caps.flags & MLX4_DEV_CAP_FLAG_UDP_RSS) {
+			rss_context->base_qpn_udp = rss_context->default_qpn;
+			rss_context->flags |= MLX4_RSS_IPV6 |
+					MLX4_RSS_IPV4     |
+					MLX4_RSS_UDP_IPV6 |
+					MLX4_RSS_UDP_IPV4;
+		}
+		if (dev->dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_RSS_TOP) {
+			static const u32 rsskey[10] = { 0xD181C62C, 0xF7F4DB5B,
+				0x1983A2FC, 0x943E1ADB, 0xD9389E6B, 0xD1039C2C,
+				0xA74499AD, 0x593D56D9, 0xF3253C06, 0x2ADC1FFC};
+			rss_context->hash_fn = MLX4_RSS_HASH_TOP;
+			memcpy(rss_context->rss_key, rsskey,
+			       sizeof(rss_context->rss_key));
+		} else {
+			rss_context->hash_fn = MLX4_RSS_HASH_XOR;
+			memset(rss_context->rss_key, 0,
+			       sizeof(rss_context->rss_key));
+		}
+	}
 	/*
 	 * Before passing a kernel QP to the HW, make sure that the
 	 * ownership bits of the send queue are set and the SQ
@@ -2736,6 +3052,13 @@ done:
 		qp->sq_signal_bits == cpu_to_be32(MLX4_WQE_CTRL_CQ_UPDATE) ?
 		IB_SIGNAL_ALL_WR : IB_SIGNAL_REQ_WR;
 
+	qp_init_attr->qpg_type = ibqp->qpg_type;
+	if (ibqp->qpg_type == IB_QPG_PARENT)
+		qp_init_attr->cap.qpg_tss_mask_sz =
+			qp->qpg_data->qpg_tss_mask_sz;
+	else
+		qp_init_attr->cap.qpg_tss_mask_sz = 0;
+
 out:
 	mutex_unlock(&qp->mutex);
 	return err;
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 for-next 4/6] IB/ipoib: Move to multi-queue device
       [not found] ` <1360079337-8173-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (2 preceding siblings ...)
  2013-02-05 15:48   ` [PATCH V2 for-next 3/6] IB/mlx4: Add support for " Or Gerlitz
@ 2013-02-05 15:48   ` Or Gerlitz
  2013-02-05 15:48   ` [PATCH V2 for-next 5/6] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz
  2013-02-05 15:48   ` [PATCH V2 for-next 6/6] IB/ipoib: Support changing the number of RX/TX rings with ethtool Or Gerlitz
  5 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2013-02-05 15:48 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Or Gerlitz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch is a restructuring step needed to implement RSS (Receive Side
Scaling) and TSS (multi-queue transmit) for IPoIB.

The following structures and flows are changed:

- Addition of struct ipoib_recv_ring and struct ipoib_send_ring which hold
the per RX / TX ring fields respectively. These fields are the plural of
the receive and send fields previously present in struct ipoib_dev_priv.

- Add per send/receive ring stats counters. These counters are accessible
through ethtool. Net device stats are no longer accumulated, instead
ndo_get_stats is implemented.

- Use the multi queue APIs for TX and RX: alloc_netdev_mqs, netif_xxx_subqueue,
netif_subqueue_yyy, use per TX queue timer and NAPI instance per RX queue.

- Put a work request structure and scatter/gather list in the RX ring
structure for the CM code to use, and remove them from ipoib_cm_dev_priv

With this patch being an intermediate step, the number of RX and TX rings
is fixed to one. Where the single TX ring and RX ring QP/CQs are currently
taken from the "priv" structure.

The Address Handles Garbage Collection mechanism was changed such
that the data path uses ref count (inc on post send, dec on send completion),
and the AH GC thread code tests for zero value of the ref count instead of
comparing tx_head to last_send. Some change was a must here, since the SAME
AH can be used by multiple TX rings as the skb hashing can possible map the
same IPoIB daddr to multiple TX rings in parallel (uses L3/L4 headers).

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h           |  102 ++++--
 drivers/infiniband/ulp/ipoib/ipoib_cm.c        |  206 ++++++----
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c   |   92 ++++-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c        |  538 +++++++++++++++++-------
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |  231 +++++++++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   35 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |   63 ++-
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    2 +-
 8 files changed, 938 insertions(+), 331 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 07ca6fd..cf5fdd9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -158,6 +158,7 @@ struct ipoib_rx_buf {
 
 struct ipoib_tx_buf {
 	struct sk_buff *skb;
+	struct ipoib_ah *ah;
 	u64		mapping[MAX_SKB_FRAGS + 1];
 };
 
@@ -215,6 +216,7 @@ struct ipoib_cm_rx {
 	unsigned long		jiffies;
 	enum ipoib_cm_state	state;
 	int			recv_count;
+	int index; /* For ring counters */
 };
 
 struct ipoib_cm_tx {
@@ -254,11 +256,10 @@ struct ipoib_cm_dev_priv {
 	struct list_head	start_list;
 	struct list_head	reap_list;
 	struct ib_wc		ibwc[IPOIB_NUM_WC];
-	struct ib_sge		rx_sge[IPOIB_CM_RX_SG];
-	struct ib_recv_wr       rx_wr;
 	int			nonsrq_conn_qp;
 	int			max_cm_mtu;
 	int			num_frags;
+	u32			rx_cq_ind;
 };
 
 struct ipoib_ethtool_st {
@@ -284,6 +285,65 @@ struct ipoib_neigh_table {
 };
 
 /*
+ * Per QP stats
+ */
+
+struct ipoib_tx_ring_stats {
+	unsigned long tx_packets;
+	unsigned long tx_bytes;
+	unsigned long tx_errors;
+	unsigned long tx_dropped;
+};
+
+struct ipoib_rx_ring_stats {
+	unsigned long rx_packets;
+	unsigned long rx_bytes;
+	unsigned long rx_errors;
+	unsigned long rx_dropped;
+};
+
+/*
+ * Encapsulates the per send QP information
+ */
+struct ipoib_send_ring {
+	struct net_device	*dev;
+	struct ib_cq		*send_cq;
+	struct ib_qp		*send_qp;
+	struct ipoib_tx_buf	*tx_ring;
+	unsigned		tx_head;
+	unsigned		tx_tail;
+	struct ib_sge		tx_sge[MAX_SKB_FRAGS + 1];
+	struct ib_send_wr	tx_wr;
+	unsigned		tx_outstanding;
+	struct ib_wc		tx_wc[MAX_SEND_CQE];
+	struct timer_list	poll_timer;
+	struct ipoib_tx_ring_stats stats;
+	unsigned		index;
+};
+
+struct ipoib_rx_cm_info {
+	struct ib_sge		rx_sge[IPOIB_CM_RX_SG];
+	struct ib_recv_wr       rx_wr;
+};
+
+/*
+ * Encapsulates the per recv QP information
+ */
+struct ipoib_recv_ring {
+	struct net_device	*dev;
+	struct ib_qp		*recv_qp;
+	struct ib_cq		*recv_cq;
+	struct ib_wc		ibwc[IPOIB_NUM_WC];
+	struct napi_struct	napi;
+	struct ipoib_rx_buf	*rx_ring;
+	struct ib_recv_wr	rx_wr;
+	struct ib_sge		rx_sge[IPOIB_UD_RX_SG];
+	struct ipoib_rx_cm_info	cm;
+	struct ipoib_rx_ring_stats stats;
+	unsigned		index;
+};
+
+/*
  * Device private locking: network stack tx_lock protects members used
  * in TX fast path, lock protects everything else.  lock nests inside
  * of tx_lock (ie tx_lock must be acquired first if needed).
@@ -293,8 +353,6 @@ struct ipoib_dev_priv {
 
 	struct net_device *dev;
 
-	struct napi_struct napi;
-
 	unsigned long flags;
 
 	struct mutex vlan_mutex;
@@ -335,21 +393,6 @@ struct ipoib_dev_priv {
 	unsigned int mcast_mtu;
 	unsigned int max_ib_mtu;
 
-	struct ipoib_rx_buf *rx_ring;
-
-	struct ipoib_tx_buf *tx_ring;
-	unsigned	     tx_head;
-	unsigned	     tx_tail;
-	struct ib_sge	     tx_sge[MAX_SKB_FRAGS + 1];
-	struct ib_send_wr    tx_wr;
-	unsigned	     tx_outstanding;
-	struct ib_wc	     send_wc[MAX_SEND_CQE];
-
-	struct ib_recv_wr    rx_wr;
-	struct ib_sge	     rx_sge[IPOIB_UD_RX_SG];
-
-	struct ib_wc ibwc[IPOIB_NUM_WC];
-
 	struct list_head dead_ahs;
 
 	struct ib_event_handler event_handler;
@@ -371,6 +414,10 @@ struct ipoib_dev_priv {
 	int	hca_caps;
 	struct ipoib_ethtool_st ethtool;
 	struct timer_list poll_timer;
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	unsigned int num_rx_queues;
+	unsigned int num_tx_queues;
 };
 
 struct ipoib_ah {
@@ -378,7 +425,7 @@ struct ipoib_ah {
 	struct ib_ah	  *ah;
 	struct list_head   list;
 	struct kref	   ref;
-	unsigned	   last_send;
+	atomic_t	   refcnt;
 };
 
 struct ipoib_path {
@@ -440,8 +487,8 @@ extern struct workqueue_struct *ipoib_workqueue;
 /* functions */
 
 int ipoib_poll(struct napi_struct *napi, int budget);
-void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr);
-void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr);
+void ipoib_ib_completion(struct ib_cq *cq, void *recv_ring_ptr);
+void ipoib_send_comp_handler(struct ib_cq *cq, void *send_ring_ptr);
 
 struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 				 struct ib_pd *pd, struct ib_ah_attr *attr);
@@ -460,7 +507,8 @@ void ipoib_reap_ah(struct work_struct *work);
 
 void ipoib_mark_paths_invalid(struct net_device *dev);
 void ipoib_flush_paths(struct net_device *dev);
-struct ipoib_dev_priv *ipoib_intf_alloc(const char *format);
+struct ipoib_dev_priv *ipoib_intf_alloc(const char *format,
+					struct ipoib_dev_priv *temp_priv);
 
 int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_ib_dev_flush_light(struct work_struct *work);
@@ -598,7 +646,9 @@ struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path
 void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx);
 void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
 			   unsigned int mtu);
-void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc);
+void ipoib_cm_handle_rx_wc(struct net_device *dev,
+			   struct ipoib_recv_ring *recv_ring,
+			   struct ib_wc *wc);
 void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc);
 #else
 
@@ -696,7 +746,9 @@ static inline void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff
 	dev_kfree_skb_any(skb);
 }
 
-static inline void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static inline void ipoib_cm_handle_rx_wc(struct net_device *dev,
+					 struct ipoib_recv_ring *recv_ring,
+					 struct ib_wc *wc)
 {
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 67b0c1d..5bbc404 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -38,6 +38,7 @@
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
 #include <linux/moduleparam.h>
+#include <linux/jhash.h>
 
 #include "ipoib.h"
 
@@ -88,18 +89,24 @@ static void ipoib_cm_dma_unmap_rx(struct ipoib_dev_priv *priv, int frags,
 		ib_dma_unmap_page(priv->ca, mapping[i + 1], PAGE_SIZE, DMA_FROM_DEVICE);
 }
 
-static int ipoib_cm_post_receive_srq(struct net_device *dev, int id)
+static int ipoib_cm_post_receive_srq(struct net_device *dev,
+				     struct ipoib_recv_ring *recv_ring, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_sge *sge;
+	struct ib_recv_wr *wr;
 	struct ib_recv_wr *bad_wr;
 	int i, ret;
 
-	priv->cm.rx_wr.wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV;
+	sge = recv_ring->cm.rx_sge;
+	wr = &recv_ring->cm.rx_wr;
+
+	wr->wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV;
 
 	for (i = 0; i < priv->cm.num_frags; ++i)
-		priv->cm.rx_sge[i].addr = priv->cm.srq_ring[id].mapping[i];
+		sge[i].addr = priv->cm.srq_ring[id].mapping[i];
 
-	ret = ib_post_srq_recv(priv->cm.srq, &priv->cm.rx_wr, &bad_wr);
+	ret = ib_post_srq_recv(priv->cm.srq, wr, &bad_wr);
 	if (unlikely(ret)) {
 		ipoib_warn(priv, "post srq failed for buf %d (%d)\n", id, ret);
 		ipoib_cm_dma_unmap_rx(priv, priv->cm.num_frags - 1,
@@ -112,14 +119,18 @@ static int ipoib_cm_post_receive_srq(struct net_device *dev, int id)
 }
 
 static int ipoib_cm_post_receive_nonsrq(struct net_device *dev,
-					struct ipoib_cm_rx *rx,
-					struct ib_recv_wr *wr,
-					struct ib_sge *sge, int id)
+					struct ipoib_cm_rx *rx, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring = priv->recv_ring + rx->index;
+	struct ib_sge *sge;
+	struct ib_recv_wr *wr;
 	struct ib_recv_wr *bad_wr;
 	int i, ret;
 
+	sge = recv_ring->cm.rx_sge;
+	wr = &recv_ring->cm.rx_wr;
+
 	wr->wr_id = id | IPOIB_OP_CM | IPOIB_OP_RECV;
 
 	for (i = 0; i < IPOIB_CM_RX_SG; ++i)
@@ -225,7 +236,15 @@ static void ipoib_cm_start_rx_drain(struct ipoib_dev_priv *priv)
 	if (ib_post_send(p->qp, &ipoib_cm_rx_drain_wr, &bad_wr))
 		ipoib_warn(priv, "failed to post drain wr\n");
 
-	list_splice_init(&priv->cm.rx_flush_list, &priv->cm.rx_drain_list);
+	/*
+	 * Under the multi ring scheme, different CM QPs are bounded to
+	 * different CQs and hence to diferent NAPI contextes. With that in
+	 * mind, we must make sure that the NAPI context that invokes the reap
+	 * (deletion) of a certain QP is the same context that handles the
+	 * normal RX WC handling. To achieve that, move only one QP at a time to
+	 * the drain list, this will enforce posting the drain WR on each QP.
+	 */
+	list_move(&p->list, &priv->cm.rx_drain_list);
 }
 
 static void ipoib_cm_rx_event_handler(struct ib_event *event, void *ctx)
@@ -250,8 +269,6 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
 		.event_handler = ipoib_cm_rx_event_handler,
-		.send_cq = priv->recv_cq, /* For drain WR */
-		.recv_cq = priv->recv_cq,
 		.srq = priv->cm.srq,
 		.cap.max_send_wr = 1, /* For drain WR */
 		.cap.max_send_sge = 1, /* FIXME: 0 Seems not to work */
@@ -259,12 +276,23 @@ static struct ib_qp *ipoib_cm_create_rx_qp(struct net_device *dev,
 		.qp_type = IB_QPT_RC,
 		.qp_context = p,
 	};
+	int index;
 
 	if (!ipoib_cm_has_srq(dev)) {
 		attr.cap.max_recv_wr  = ipoib_recvq_size;
 		attr.cap.max_recv_sge = IPOIB_CM_RX_SG;
 	}
 
+	index = priv->cm.rx_cq_ind;
+	if (index >= priv->num_rx_queues)
+		index = 0;
+
+	priv->cm.rx_cq_ind = index + 1;
+	/* send_cp for drain WR */
+	attr.recv_cq = priv->recv_ring[index].recv_cq;
+	attr.send_cq = attr.recv_cq;
+	p->index = index;
+
 	return ib_create_qp(priv->pd, &attr);
 }
 
@@ -323,33 +351,34 @@ static int ipoib_cm_modify_rx_qp(struct net_device *dev,
 	return 0;
 }
 
-static void ipoib_cm_init_rx_wr(struct net_device *dev,
-				struct ib_recv_wr *wr,
-				struct ib_sge *sge)
+static void ipoib_cm_init_rx_wr(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	int i;
-
-	for (i = 0; i < priv->cm.num_frags; ++i)
-		sge[i].lkey = priv->mr->lkey;
-
-	sge[0].length = IPOIB_CM_HEAD_SIZE;
-	for (i = 1; i < priv->cm.num_frags; ++i)
-		sge[i].length = PAGE_SIZE;
-
-	wr->next    = NULL;
-	wr->sg_list = sge;
-	wr->num_sge = priv->cm.num_frags;
+	struct ipoib_recv_ring *recv_ring = priv->recv_ring;
+	struct ib_sge *sge;
+	struct ib_recv_wr *wr;
+	int i, j;
+
+	for (j = 0; j < priv->num_rx_queues; j++, recv_ring++) {
+		sge = recv_ring->cm.rx_sge;
+		wr = &recv_ring->cm.rx_wr;
+		for (i = 0; i < priv->cm.num_frags; ++i)
+			sge[i].lkey = priv->mr->lkey;
+
+		sge[0].length = IPOIB_CM_HEAD_SIZE;
+		for (i = 1; i < priv->cm.num_frags; ++i)
+			sge[i].length = PAGE_SIZE;
+
+		wr->next    = NULL;
+		wr->sg_list = sge;
+		wr->num_sge = priv->cm.num_frags;
+	}
 }
 
 static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_id,
 				   struct ipoib_cm_rx *rx)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	struct {
-		struct ib_recv_wr wr;
-		struct ib_sge sge[IPOIB_CM_RX_SG];
-	} *t;
 	int ret;
 	int i;
 
@@ -360,14 +389,6 @@ static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_i
 		return -ENOMEM;
 	}
 
-	t = kmalloc(sizeof *t, GFP_KERNEL);
-	if (!t) {
-		ret = -ENOMEM;
-		goto err_free;
-	}
-
-	ipoib_cm_init_rx_wr(dev, &t->wr, t->sge);
-
 	spin_lock_irq(&priv->lock);
 
 	if (priv->cm.nonsrq_conn_qp >= ipoib_max_conn_qp) {
@@ -387,7 +408,7 @@ static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_i
 				ret = -ENOMEM;
 				goto err_count;
 		}
-		ret = ipoib_cm_post_receive_nonsrq(dev, rx, &t->wr, t->sge, i);
+		ret = ipoib_cm_post_receive_nonsrq(dev, rx, i);
 		if (ret) {
 			ipoib_warn(priv, "ipoib_cm_post_receive_nonsrq "
 				   "failed for buf %d\n", i);
@@ -398,8 +419,6 @@ static int ipoib_cm_nonsrq_init_rx(struct net_device *dev, struct ib_cm_id *cm_i
 
 	rx->recv_count = ipoib_recvq_size;
 
-	kfree(t);
-
 	return 0;
 
 err_count:
@@ -408,7 +427,6 @@ err_count:
 	spin_unlock_irq(&priv->lock);
 
 err_free:
-	kfree(t);
 	ipoib_cm_free_rx_ring(dev, rx->rx_ring);
 
 	return ret;
@@ -553,7 +571,9 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
 	}
 }
 
-void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+void ipoib_cm_handle_rx_wc(struct net_device *dev,
+			   struct ipoib_recv_ring *recv_ring,
+			   struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_cm_rx_buf *rx_ring;
@@ -593,7 +613,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		ipoib_dbg(priv, "cm recv error "
 			   "(status=%d, wrid=%d vend_err %x)\n",
 			   wc->status, wr_id, wc->vendor_err);
-		++dev->stats.rx_dropped;
+		++recv_ring->stats.rx_dropped;
 		if (has_srq)
 			goto repost;
 		else {
@@ -646,7 +666,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		 * this packet and reuse the old buffer.
 		 */
 		ipoib_dbg(priv, "failed to allocate receive buffer %d\n", wr_id);
-		++dev->stats.rx_dropped;
+		++recv_ring->stats.rx_dropped;
 		goto repost;
 	}
 
@@ -663,8 +683,8 @@ copied:
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
 
-	++dev->stats.rx_packets;
-	dev->stats.rx_bytes += skb->len;
+	++recv_ring->stats.rx_packets;
+	recv_ring->stats.rx_bytes += skb->len;
 
 	skb->dev = dev;
 	/* XXX get correct PACKET_ type here */
@@ -673,13 +693,13 @@ copied:
 
 repost:
 	if (has_srq) {
-		if (unlikely(ipoib_cm_post_receive_srq(dev, wr_id)))
+		if (unlikely(ipoib_cm_post_receive_srq(dev,
+						       recv_ring,
+						       wr_id)))
 			ipoib_warn(priv, "ipoib_cm_post_receive_srq failed "
 				   "for buf %d\n", wr_id);
 	} else {
 		if (unlikely(ipoib_cm_post_receive_nonsrq(dev, p,
-							  &priv->cm.rx_wr,
-							  priv->cm.rx_sge,
 							  wr_id))) {
 			--p->recv_count;
 			ipoib_warn(priv, "ipoib_cm_post_receive_nonsrq failed "
@@ -691,17 +711,18 @@ repost:
 static inline int post_send(struct ipoib_dev_priv *priv,
 			    struct ipoib_cm_tx *tx,
 			    unsigned int wr_id,
-			    u64 addr, int len)
+			    u64 addr, int len,
+				struct ipoib_send_ring *send_ring)
 {
 	struct ib_send_wr *bad_wr;
 
-	priv->tx_sge[0].addr          = addr;
-	priv->tx_sge[0].length        = len;
+	send_ring->tx_sge[0].addr          = addr;
+	send_ring->tx_sge[0].length        = len;
 
-	priv->tx_wr.num_sge	= 1;
-	priv->tx_wr.wr_id	= wr_id | IPOIB_OP_CM;
+	send_ring->tx_wr.num_sge	= 1;
+	send_ring->tx_wr.wr_id	= wr_id | IPOIB_OP_CM;
 
-	return ib_post_send(tx->qp, &priv->tx_wr, &bad_wr);
+	return ib_post_send(tx->qp, &send_ring->tx_wr, &bad_wr);
 }
 
 void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_tx *tx)
@@ -710,12 +731,17 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	struct ipoib_cm_tx_buf *tx_req;
 	u64 addr;
 	int rc;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
+
+	queue_index = skb_get_queue_mapping(skb);
+	send_ring = priv->send_ring + queue_index;
 
 	if (unlikely(skb->len > tx->mtu)) {
 		ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
 			   skb->len, tx->mtu);
-		++dev->stats.tx_dropped;
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_dropped;
+		++send_ring->stats.tx_errors;
 		ipoib_cm_skb_too_long(dev, skb, tx->mtu - IPOIB_ENCAP_LEN);
 		return;
 	}
@@ -734,7 +760,7 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	tx_req->skb = skb;
 	addr = ib_dma_map_single(priv->ca, skb->data, skb->len, DMA_TO_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, addr))) {
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		dev_kfree_skb_any(skb);
 		return;
 	}
@@ -745,22 +771,23 @@ void ipoib_cm_send(struct net_device *dev, struct sk_buff *skb, struct ipoib_cm_
 	skb_dst_drop(skb);
 
 	rc = post_send(priv, tx, tx->tx_head & (ipoib_sendq_size - 1),
-		       addr, skb->len);
+		       addr, skb->len, send_ring);
 	if (unlikely(rc)) {
 		ipoib_warn(priv, "post_send failed, error %d\n", rc);
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		ib_dma_unmap_single(priv->ca, addr, skb->len, DMA_TO_DEVICE);
 		dev_kfree_skb_any(skb);
 	} else {
-		dev->trans_start = jiffies;
+		netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
 		++tx->tx_head;
 
-		if (++priv->tx_outstanding == ipoib_sendq_size) {
+		if (++send_ring->tx_outstanding == ipoib_sendq_size) {
 			ipoib_dbg(priv, "TX ring 0x%x full, stopping kernel net queue\n",
 				  tx->qp->qp_num);
-			if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+			if (ib_req_notify_cq(send_ring->send_cq,
+					     IB_CQ_NEXT_COMP))
 				ipoib_warn(priv, "request notify on send CQ failed\n");
-			netif_stop_queue(dev);
+			netif_stop_subqueue(dev, queue_index);
 		}
 	}
 }
@@ -772,6 +799,8 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_CM;
 	struct ipoib_cm_tx_buf *tx_req;
 	unsigned long flags;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
 
 	ipoib_dbg_data(priv, "cm send completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -783,22 +812,24 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 	}
 
 	tx_req = &tx->tx_ring[wr_id];
+	queue_index = skb_get_queue_mapping(tx_req->skb);
+	send_ring = priv->send_ring + queue_index;
 
 	ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len, DMA_TO_DEVICE);
 
 	/* FIXME: is this right? Shouldn't we only increment on success? */
-	++dev->stats.tx_packets;
-	dev->stats.tx_bytes += tx_req->skb->len;
+	++send_ring->stats.tx_packets;
+	send_ring->stats.tx_bytes += tx_req->skb->len;
 
 	dev_kfree_skb_any(tx_req->skb);
 
-	netif_tx_lock(dev);
+	netif_tx_lock_bh(dev);
 
 	++tx->tx_tail;
-	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-	    netif_queue_stopped(dev) &&
+	if (unlikely(--send_ring->tx_outstanding == ipoib_sendq_size >> 1) &&
+	    __netif_subqueue_stopped(dev, queue_index) &&
 	    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-		netif_wake_queue(dev);
+		netif_wake_subqueue(dev, queue_index);
 
 	if (wc->status != IB_WC_SUCCESS &&
 	    wc->status != IB_WC_WR_FLUSH_ERR) {
@@ -829,7 +860,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 		spin_unlock_irqrestore(&priv->lock, flags);
 	}
 
-	netif_tx_unlock(dev);
+	netif_tx_unlock_bh(dev);
 }
 
 int ipoib_cm_dev_open(struct net_device *dev)
@@ -1017,8 +1048,6 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr attr = {
-		.send_cq		= priv->recv_cq,
-		.recv_cq		= priv->recv_cq,
 		.srq			= priv->cm.srq,
 		.cap.max_send_wr	= ipoib_sendq_size,
 		.cap.max_send_sge	= 1,
@@ -1026,6 +1055,21 @@ static struct ib_qp *ipoib_cm_create_tx_qp(struct net_device *dev, struct ipoib_
 		.qp_type		= IB_QPT_RC,
 		.qp_context		= tx
 	};
+	u32 index;
+
+	/* CM uses ipoib_ib_completion for TX completion which makes use of the
+	 * RX NAPI mechanism. spread context among RX CQ based on address hash.
+	 */
+	if (priv->num_rx_queues > 1) {
+		u32 *daddr_32 = (u32 *)tx->neigh->daddr;
+		u32 hv = jhash_1word(*daddr_32 & cpu_to_be32(0xFFFFFF), 0);
+		index = hv % priv->num_rx_queues;
+	} else {
+		index = 0;
+	}
+
+	attr.recv_cq = priv->recv_ring[index].recv_cq;
+	attr.send_cq = attr.recv_cq;
 
 	return ib_create_qp(priv->pd, &attr);
 }
@@ -1178,16 +1222,21 @@ static void ipoib_cm_tx_destroy(struct ipoib_cm_tx *p)
 timeout:
 
 	while ((int) p->tx_tail - (int) p->tx_head < 0) {
+		struct ipoib_send_ring *send_ring;
+		u16 queue_index;
 		tx_req = &p->tx_ring[p->tx_tail & (ipoib_sendq_size - 1)];
 		ib_dma_unmap_single(priv->ca, tx_req->mapping, tx_req->skb->len,
 				    DMA_TO_DEVICE);
 		dev_kfree_skb_any(tx_req->skb);
 		++p->tx_tail;
+		queue_index = skb_get_queue_mapping(tx_req->skb);
+		send_ring = priv->send_ring + queue_index;
 		netif_tx_lock_bh(p->dev);
-		if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-		    netif_queue_stopped(p->dev) &&
+		if (unlikely(--send_ring->tx_outstanding ==
+				(ipoib_sendq_size >> 1)) &&
+		    __netif_subqueue_stopped(p->dev, queue_index) &&
 		    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-			netif_wake_queue(p->dev);
+			netif_wake_subqueue(p->dev, queue_index);
 		netif_tx_unlock_bh(p->dev);
 	}
 
@@ -1549,7 +1598,7 @@ int ipoib_cm_dev_init(struct net_device *dev)
 		priv->cm.num_frags  = IPOIB_CM_RX_SG;
 	}
 
-	ipoib_cm_init_rx_wr(dev, &priv->cm.rx_wr, priv->cm.rx_sge);
+	ipoib_cm_init_rx_wr(dev);
 
 	if (ipoib_cm_has_srq(dev)) {
 		for (i = 0; i < ipoib_recvq_size; ++i) {
@@ -1562,7 +1611,8 @@ int ipoib_cm_dev_init(struct net_device *dev)
 				return -ENOMEM;
 			}
 
-			if (ipoib_cm_post_receive_srq(dev, i)) {
+			if (ipoib_cm_post_receive_srq(dev, priv->recv_ring,
+						      i)) {
 				ipoib_warn(priv, "ipoib_cm_post_receive_srq "
 					   "failed for buf %d\n", i);
 				ipoib_cm_dev_cleanup(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index 29bc7b5..f2cc283 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -57,7 +57,8 @@ static int ipoib_set_coalesce(struct net_device *dev,
 			      struct ethtool_coalesce *coal)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
-	int ret;
+	int ret, i;
+
 
 	/*
 	 * These values are saved in the private data and returned
@@ -67,23 +68,100 @@ static int ipoib_set_coalesce(struct net_device *dev,
 	    coal->rx_max_coalesced_frames > 0xffff)
 		return -EINVAL;
 
-	ret = ib_modify_cq(priv->recv_cq, coal->rx_max_coalesced_frames,
-			   coal->rx_coalesce_usecs);
-	if (ret && ret != -ENOSYS) {
-		ipoib_warn(priv, "failed modifying CQ (%d)\n", ret);
-		return ret;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ret = ib_modify_cq(priv->recv_ring[i].recv_cq,
+					coal->rx_max_coalesced_frames,
+					coal->rx_coalesce_usecs);
+		if (ret && ret != -ENOSYS) {
+			ipoib_warn(priv, "failed modifying CQ (%d)\n", ret);
+			return ret;
+		}
 	}
-
 	priv->ethtool.coalesce_usecs       = coal->rx_coalesce_usecs;
 	priv->ethtool.max_coalesced_frames = coal->rx_max_coalesced_frames;
 
 	return 0;
 }
 
+static void ipoib_get_strings(struct net_device *dev, u32 stringset, u8 *data)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i, index = 0;
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		for (i = 0; i < priv->num_rx_queues; i++) {
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_packets", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_bytes", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_errors", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"rx%d_dropped", i);
+		}
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_packets", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_bytes", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_errors", i);
+			sprintf(data + (index++) * ETH_GSTRING_LEN,
+				"tx%d_dropped", i);
+		}
+		break;
+	}
+}
+
+static int ipoib_get_sset_count(struct net_device *dev, int sset)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	switch (sset) {
+	case ETH_SS_STATS:
+		return (priv->num_rx_queues + priv->num_tx_queues) * 4;
+	default:
+		return -EOPNOTSUPP;
+	}
+}
+
+static void ipoib_get_ethtool_stats(struct net_device *dev,
+				struct ethtool_stats *stats, uint64_t *data)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	int index = 0;
+	int i;
+
+	/* Get per QP stats */
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		struct ipoib_rx_ring_stats *rx_stats = &recv_ring->stats;
+		data[index++] = rx_stats->rx_packets;
+		data[index++] = rx_stats->rx_bytes;
+		data[index++] = rx_stats->rx_errors;
+		data[index++] = rx_stats->rx_dropped;
+		recv_ring++;
+	}
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		struct ipoib_tx_ring_stats *tx_stats = &send_ring->stats;
+		data[index++] = tx_stats->tx_packets;
+		data[index++] = tx_stats->tx_bytes;
+		data[index++] = tx_stats->tx_errors;
+		data[index++] = tx_stats->tx_dropped;
+		send_ring++;
+	}
+}
+
 static const struct ethtool_ops ipoib_ethtool_ops = {
 	.get_drvinfo		= ipoib_get_drvinfo,
 	.get_coalesce		= ipoib_get_coalesce,
 	.set_coalesce		= ipoib_set_coalesce,
+	.get_strings		= ipoib_get_strings,
+	.get_sset_count		= ipoib_get_sset_count,
+	.get_ethtool_stats	= ipoib_get_ethtool_stats,
 };
 
 void ipoib_set_ethtool_ops(struct net_device *dev)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 2cfa76f..4871dc9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -64,7 +64,6 @@ struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 		return ERR_PTR(-ENOMEM);
 
 	ah->dev       = dev;
-	ah->last_send = 0;
 	kref_init(&ah->ref);
 
 	vah = ib_create_ah(pd, attr);
@@ -72,6 +71,7 @@ struct ipoib_ah *ipoib_create_ah(struct net_device *dev,
 		kfree(ah);
 		ah = (struct ipoib_ah *)vah;
 	} else {
+		atomic_set(&ah->refcnt, 0);
 		ah->ah = vah;
 		ipoib_dbg(netdev_priv(dev), "Created ah %p\n", ah->ah);
 	}
@@ -129,29 +129,32 @@ static void ipoib_ud_skb_put_frags(struct ipoib_dev_priv *priv,
 
 }
 
-static int ipoib_ib_post_receive(struct net_device *dev, int id)
+static int ipoib_ib_post_receive(struct net_device *dev,
+			struct ipoib_recv_ring *recv_ring, int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_recv_wr *bad_wr;
 	int ret;
 
-	priv->rx_wr.wr_id   = id | IPOIB_OP_RECV;
-	priv->rx_sge[0].addr = priv->rx_ring[id].mapping[0];
-	priv->rx_sge[1].addr = priv->rx_ring[id].mapping[1];
+	recv_ring->rx_wr.wr_id   = id | IPOIB_OP_RECV;
+	recv_ring->rx_sge[0].addr = recv_ring->rx_ring[id].mapping[0];
+	recv_ring->rx_sge[1].addr = recv_ring->rx_ring[id].mapping[1];
 
 
-	ret = ib_post_recv(priv->qp, &priv->rx_wr, &bad_wr);
+	ret = ib_post_recv(recv_ring->recv_qp, &recv_ring->rx_wr, &bad_wr);
 	if (unlikely(ret)) {
 		ipoib_warn(priv, "receive failed for buf %d (%d)\n", id, ret);
-		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[id].mapping);
-		dev_kfree_skb_any(priv->rx_ring[id].skb);
-		priv->rx_ring[id].skb = NULL;
+		ipoib_ud_dma_unmap_rx(priv, recv_ring->rx_ring[id].mapping);
+		dev_kfree_skb_any(recv_ring->rx_ring[id].skb);
+		recv_ring->rx_ring[id].skb = NULL;
 	}
 
 	return ret;
 }
 
-static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
+static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev,
+					  struct ipoib_recv_ring *recv_ring,
+					  int id)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct sk_buff *skb;
@@ -178,7 +181,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 	 */
 	skb_reserve(skb, 4);
 
-	mapping = priv->rx_ring[id].mapping;
+	mapping = recv_ring->rx_ring[id].mapping;
 	mapping[0] = ib_dma_map_single(priv->ca, skb->data, buf_size,
 				       DMA_FROM_DEVICE);
 	if (unlikely(ib_dma_mapping_error(priv->ca, mapping[0])))
@@ -196,7 +199,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 			goto partial_error;
 	}
 
-	priv->rx_ring[id].skb = skb;
+	recv_ring->rx_ring[id].skb = skb;
 	return skb;
 
 partial_error:
@@ -206,18 +209,23 @@ error:
 	return NULL;
 }
 
-static int ipoib_ib_post_receives(struct net_device *dev)
+static int ipoib_ib_post_ring_receives(struct net_device *dev,
+				      struct ipoib_recv_ring *recv_ring)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int i;
 
 	for (i = 0; i < ipoib_recvq_size; ++i) {
-		if (!ipoib_alloc_rx_skb(dev, i)) {
-			ipoib_warn(priv, "failed to allocate receive buffer %d\n", i);
+		if (!ipoib_alloc_rx_skb(dev, recv_ring, i)) {
+			ipoib_warn(priv,
+				   "failed to alloc receive buffer (%d,%d)\n",
+				   recv_ring->index, i);
 			return -ENOMEM;
 		}
-		if (ipoib_ib_post_receive(dev, i)) {
-			ipoib_warn(priv, "ipoib_ib_post_receive failed for buf %d\n", i);
+		if (ipoib_ib_post_receive(dev, recv_ring, i)) {
+			ipoib_warn(priv,
+				   "ipoib_ib_post_receive failed buf (%d,%d)\n",
+				   recv_ring->index, i);
 			return -EIO;
 		}
 	}
@@ -225,7 +233,27 @@ static int ipoib_ib_post_receives(struct net_device *dev)
 	return 0;
 }
 
-static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
+static int ipoib_ib_post_receives(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int err;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		err = ipoib_ib_post_ring_receives(dev, recv_ring);
+		if (err)
+			return err;
+		recv_ring++;
+	}
+
+	return 0;
+}
+
+static void ipoib_ib_handle_rx_wc(struct net_device *dev,
+				  struct ipoib_recv_ring *recv_ring,
+				  struct ib_wc *wc)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id & ~IPOIB_OP_RECV;
@@ -242,16 +270,16 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 		return;
 	}
 
-	skb  = priv->rx_ring[wr_id].skb;
+	skb  = recv_ring->rx_ring[wr_id].skb;
 
 	if (unlikely(wc->status != IB_WC_SUCCESS)) {
 		if (wc->status != IB_WC_WR_FLUSH_ERR)
 			ipoib_warn(priv, "failed recv event "
 				   "(status=%d, wrid=%d vend_err %x)\n",
 				   wc->status, wr_id, wc->vendor_err);
-		ipoib_ud_dma_unmap_rx(priv, priv->rx_ring[wr_id].mapping);
+		ipoib_ud_dma_unmap_rx(priv, recv_ring->rx_ring[wr_id].mapping);
 		dev_kfree_skb_any(skb);
-		priv->rx_ring[wr_id].skb = NULL;
+		recv_ring->rx_ring[wr_id].skb = NULL;
 		return;
 	}
 
@@ -262,18 +290,20 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num)
 		goto repost;
 
-	memcpy(mapping, priv->rx_ring[wr_id].mapping,
+	memcpy(mapping, recv_ring->rx_ring[wr_id].mapping,
 	       IPOIB_UD_RX_SG * sizeof *mapping);
 
 	/*
 	 * If we can't allocate a new RX buffer, dump
 	 * this packet and reuse the old buffer.
 	 */
-	if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
-		++dev->stats.rx_dropped;
+	if (unlikely(!ipoib_alloc_rx_skb(dev, recv_ring, wr_id))) {
+		++recv_ring->stats.rx_dropped;
 		goto repost;
 	}
 
+	skb_record_rx_queue(skb, recv_ring->index);
+
 	ipoib_dbg_data(priv, "received %d bytes, SLID 0x%04x\n",
 		       wc->byte_len, wc->slid);
 
@@ -296,18 +326,18 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev, struct ib_wc *wc)
 	skb_reset_mac_header(skb);
 	skb_pull(skb, IPOIB_ENCAP_LEN);
 
-	++dev->stats.rx_packets;
-	dev->stats.rx_bytes += skb->len;
+	++recv_ring->stats.rx_packets;
+	recv_ring->stats.rx_bytes += skb->len;
 
 	skb->dev = dev;
 	if ((dev->features & NETIF_F_RXCSUM) &&
 			likely(wc->wc_flags & IB_WC_IP_CSUM_OK))
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 
-	napi_gro_receive(&priv->napi, skb);
+	napi_gro_receive(&recv_ring->napi, skb);
 
 repost:
-	if (unlikely(ipoib_ib_post_receive(dev, wr_id)))
+	if (unlikely(ipoib_ib_post_receive(dev, recv_ring, wr_id)))
 		ipoib_warn(priv, "ipoib_ib_post_receive failed "
 			   "for buf %d\n", wr_id);
 }
@@ -376,11 +406,14 @@ static void ipoib_dma_unmap_tx(struct ib_device *ca,
 	}
 }
 
-static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
+static void ipoib_ib_handle_tx_wc(struct ipoib_send_ring *send_ring,
+				struct ib_wc *wc)
 {
+	struct net_device *dev = send_ring->dev;
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned int wr_id = wc->wr_id;
 	struct ipoib_tx_buf *tx_req;
+	struct ipoib_ah *ah;
 
 	ipoib_dbg_data(priv, "send completion: id %d, status: %d\n",
 		       wr_id, wc->status);
@@ -391,20 +424,23 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 		return;
 	}
 
-	tx_req = &priv->tx_ring[wr_id];
+	tx_req = &send_ring->tx_ring[wr_id];
+
+	ah = tx_req->ah;
+	atomic_dec(&ah->refcnt);
 
 	ipoib_dma_unmap_tx(priv->ca, tx_req);
 
-	++dev->stats.tx_packets;
-	dev->stats.tx_bytes += tx_req->skb->len;
+	++send_ring->stats.tx_packets;
+	send_ring->stats.tx_bytes += tx_req->skb->len;
 
 	dev_kfree_skb_any(tx_req->skb);
 
-	++priv->tx_tail;
-	if (unlikely(--priv->tx_outstanding == ipoib_sendq_size >> 1) &&
-	    netif_queue_stopped(dev) &&
+	++send_ring->tx_tail;
+	if (unlikely(--send_ring->tx_outstanding == ipoib_sendq_size >> 1) &&
+	    __netif_subqueue_stopped(dev, send_ring->index) &&
 	    test_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags))
-		netif_wake_queue(dev);
+		netif_wake_subqueue(dev, send_ring->index);
 
 	if (wc->status != IB_WC_SUCCESS &&
 	    wc->status != IB_WC_WR_FLUSH_ERR)
@@ -413,45 +449,47 @@ static void ipoib_ib_handle_tx_wc(struct net_device *dev, struct ib_wc *wc)
 			   wc->status, wr_id, wc->vendor_err);
 }
 
-static int poll_tx(struct ipoib_dev_priv *priv)
+static int poll_tx_ring(struct ipoib_send_ring *send_ring)
 {
 	int n, i;
 
-	n = ib_poll_cq(priv->send_cq, MAX_SEND_CQE, priv->send_wc);
+	n = ib_poll_cq(send_ring->send_cq, MAX_SEND_CQE, send_ring->tx_wc);
 	for (i = 0; i < n; ++i)
-		ipoib_ib_handle_tx_wc(priv->dev, priv->send_wc + i);
+		ipoib_ib_handle_tx_wc(send_ring, send_ring->tx_wc + i);
 
 	return n == MAX_SEND_CQE;
 }
 
 int ipoib_poll(struct napi_struct *napi, int budget)
 {
-	struct ipoib_dev_priv *priv = container_of(napi, struct ipoib_dev_priv, napi);
-	struct net_device *dev = priv->dev;
+	struct ipoib_recv_ring *rx_ring;
+	struct net_device *dev;
 	int done;
 	int t;
 	int n, i;
 
 	done  = 0;
+	rx_ring = container_of(napi, struct ipoib_recv_ring, napi);
+	dev = rx_ring->dev;
 
 poll_more:
 	while (done < budget) {
 		int max = (budget - done);
 
 		t = min(IPOIB_NUM_WC, max);
-		n = ib_poll_cq(priv->recv_cq, t, priv->ibwc);
+		n = ib_poll_cq(rx_ring->recv_cq, t, rx_ring->ibwc);
 
 		for (i = 0; i < n; i++) {
-			struct ib_wc *wc = priv->ibwc + i;
+			struct ib_wc *wc = rx_ring->ibwc + i;
 
 			if (wc->wr_id & IPOIB_OP_RECV) {
 				++done;
 				if (wc->wr_id & IPOIB_OP_CM)
-					ipoib_cm_handle_rx_wc(dev, wc);
+					ipoib_cm_handle_rx_wc(dev, rx_ring, wc);
 				else
-					ipoib_ib_handle_rx_wc(dev, wc);
+					ipoib_ib_handle_rx_wc(dev, rx_ring, wc);
 			} else
-				ipoib_cm_handle_tx_wc(priv->dev, wc);
+				ipoib_cm_handle_tx_wc(dev, wc);
 		}
 
 		if (n != t)
@@ -460,7 +498,7 @@ poll_more:
 
 	if (done < budget) {
 		napi_complete(napi);
-		if (unlikely(ib_req_notify_cq(priv->recv_cq,
+		if (unlikely(ib_req_notify_cq(rx_ring->recv_cq,
 					      IB_CQ_NEXT_COMP |
 					      IB_CQ_REPORT_MISSED_EVENTS)) &&
 		    napi_reschedule(napi))
@@ -470,36 +508,34 @@ poll_more:
 	return done;
 }
 
-void ipoib_ib_completion(struct ib_cq *cq, void *dev_ptr)
+void ipoib_ib_completion(struct ib_cq *cq, void *ctx_ptr)
 {
-	struct net_device *dev = dev_ptr;
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring = (struct ipoib_recv_ring *)ctx_ptr;
 
-	napi_schedule(&priv->napi);
+	napi_schedule(&recv_ring->napi);
 }
 
-static void drain_tx_cq(struct net_device *dev)
+static void drain_tx_cq(struct ipoib_send_ring *send_ring)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	netif_tx_lock_bh(send_ring->dev);
 
-	netif_tx_lock(dev);
-	while (poll_tx(priv))
+	while (poll_tx_ring(send_ring))
 		; /* nothing */
 
-	if (netif_queue_stopped(dev))
-		mod_timer(&priv->poll_timer, jiffies + 1);
+	if (__netif_subqueue_stopped(send_ring->dev, send_ring->index))
+		mod_timer(&send_ring->poll_timer, jiffies + 1);
 
-	netif_tx_unlock(dev);
+	netif_tx_unlock_bh(send_ring->dev);
 }
 
-void ipoib_send_comp_handler(struct ib_cq *cq, void *dev_ptr)
+void ipoib_send_comp_handler(struct ib_cq *cq, void *ctx_ptr)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev_ptr);
+	struct ipoib_send_ring *send_ring = (struct ipoib_send_ring *)ctx_ptr;
 
-	mod_timer(&priv->poll_timer, jiffies);
+	mod_timer(&send_ring->poll_timer, jiffies);
 }
 
-static inline int post_send(struct ipoib_dev_priv *priv,
+static inline int post_send(struct ipoib_send_ring *send_ring,
 			    unsigned int wr_id,
 			    struct ib_ah *address, u32 qpn,
 			    struct ipoib_tx_buf *tx_req,
@@ -513,30 +549,30 @@ static inline int post_send(struct ipoib_dev_priv *priv,
 	u64 *mapping = tx_req->mapping;
 
 	if (skb_headlen(skb)) {
-		priv->tx_sge[0].addr         = mapping[0];
-		priv->tx_sge[0].length       = skb_headlen(skb);
+		send_ring->tx_sge[0].addr         = mapping[0];
+		send_ring->tx_sge[0].length       = skb_headlen(skb);
 		off = 1;
 	} else
 		off = 0;
 
 	for (i = 0; i < nr_frags; ++i) {
-		priv->tx_sge[i + off].addr = mapping[i + off];
-		priv->tx_sge[i + off].length = skb_frag_size(&frags[i]);
+		send_ring->tx_sge[i + off].addr = mapping[i + off];
+		send_ring->tx_sge[i + off].length = skb_frag_size(&frags[i]);
 	}
-	priv->tx_wr.num_sge	     = nr_frags + off;
-	priv->tx_wr.wr_id 	     = wr_id;
-	priv->tx_wr.wr.ud.remote_qpn = qpn;
-	priv->tx_wr.wr.ud.ah 	     = address;
+	send_ring->tx_wr.num_sge	 = nr_frags + off;
+	send_ring->tx_wr.wr_id		 = wr_id;
+	send_ring->tx_wr.wr.ud.remote_qpn = qpn;
+	send_ring->tx_wr.wr.ud.ah	 = address;
 
 	if (head) {
-		priv->tx_wr.wr.ud.mss	 = skb_shinfo(skb)->gso_size;
-		priv->tx_wr.wr.ud.header = head;
-		priv->tx_wr.wr.ud.hlen	 = hlen;
-		priv->tx_wr.opcode	 = IB_WR_LSO;
+		send_ring->tx_wr.wr.ud.mss	 = skb_shinfo(skb)->gso_size;
+		send_ring->tx_wr.wr.ud.header = head;
+		send_ring->tx_wr.wr.ud.hlen	 = hlen;
+		send_ring->tx_wr.opcode	 = IB_WR_LSO;
 	} else
-		priv->tx_wr.opcode	 = IB_WR_SEND;
+		send_ring->tx_wr.opcode	 = IB_WR_SEND;
 
-	return ib_post_send(priv->qp, &priv->tx_wr, &bad_wr);
+	return ib_post_send(send_ring->send_qp, &send_ring->tx_wr, &bad_wr);
 }
 
 void ipoib_send(struct net_device *dev, struct sk_buff *skb,
@@ -544,16 +580,23 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_tx_buf *tx_req;
+	struct ipoib_send_ring *send_ring;
+	u16 queue_index;
 	int hlen, rc;
 	void *phead;
+	int req_index;
+
+	/* Find the correct QP to submit the IO to */
+	queue_index = skb_get_queue_mapping(skb);
+	send_ring = priv->send_ring + queue_index;
 
 	if (skb_is_gso(skb)) {
 		hlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
 		phead = skb->data;
 		if (unlikely(!skb_pull(skb, hlen))) {
 			ipoib_warn(priv, "linear data too small\n");
-			++dev->stats.tx_dropped;
-			++dev->stats.tx_errors;
+			++send_ring->stats.tx_dropped;
+			++send_ring->stats.tx_errors;
 			dev_kfree_skb_any(skb);
 			return;
 		}
@@ -561,8 +604,8 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 		if (unlikely(skb->len > priv->mcast_mtu + IPOIB_ENCAP_LEN)) {
 			ipoib_warn(priv, "packet len %d (> %d) too long to send, dropping\n",
 				   skb->len, priv->mcast_mtu + IPOIB_ENCAP_LEN);
-			++dev->stats.tx_dropped;
-			++dev->stats.tx_errors;
+			++send_ring->stats.tx_dropped;
+			++send_ring->stats.tx_errors;
 			ipoib_cm_skb_too_long(dev, skb, priv->mcast_mtu);
 			return;
 		}
@@ -580,48 +623,56 @@ void ipoib_send(struct net_device *dev, struct sk_buff *skb,
 	 * means we have to make sure everything is properly recorded and
 	 * our state is consistent before we call post_send().
 	 */
-	tx_req = &priv->tx_ring[priv->tx_head & (ipoib_sendq_size - 1)];
+	req_index = send_ring->tx_head & (ipoib_sendq_size - 1);
+	tx_req = &send_ring->tx_ring[req_index];
 	tx_req->skb = skb;
+	tx_req->ah = address;
 	if (unlikely(ipoib_dma_map_tx(priv->ca, tx_req))) {
-		++dev->stats.tx_errors;
+		++send_ring->stats.tx_errors;
 		dev_kfree_skb_any(skb);
 		return;
 	}
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL)
-		priv->tx_wr.send_flags |= IB_SEND_IP_CSUM;
+		send_ring->tx_wr.send_flags |= IB_SEND_IP_CSUM;
 	else
-		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+		send_ring->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
 
-	if (++priv->tx_outstanding == ipoib_sendq_size) {
+	if (++send_ring->tx_outstanding == ipoib_sendq_size) {
 		ipoib_dbg(priv, "TX ring full, stopping kernel net queue\n");
-		if (ib_req_notify_cq(priv->send_cq, IB_CQ_NEXT_COMP))
+		if (ib_req_notify_cq(send_ring->send_cq, IB_CQ_NEXT_COMP))
 			ipoib_warn(priv, "request notify on send CQ failed\n");
-		netif_stop_queue(dev);
+		netif_stop_subqueue(dev, queue_index);
 	}
 
 	skb_orphan(skb);
 	skb_dst_drop(skb);
 
-	rc = post_send(priv, priv->tx_head & (ipoib_sendq_size - 1),
+	/*
+	 * Incrementing the reference count after submitting
+	 * may create race condition
+	 * It is better to increment before and decrement in case of error
+	 */
+	atomic_inc(&address->refcnt);
+	rc = post_send(send_ring, req_index,
 		       address->ah, qpn, tx_req, phead, hlen);
 	if (unlikely(rc)) {
 		ipoib_warn(priv, "post_send failed, error %d\n", rc);
-		++dev->stats.tx_errors;
-		--priv->tx_outstanding;
+		++send_ring->stats.tx_errors;
+		--send_ring->tx_outstanding;
 		ipoib_dma_unmap_tx(priv->ca, tx_req);
 		dev_kfree_skb_any(skb);
-		if (netif_queue_stopped(dev))
-			netif_wake_queue(dev);
+		atomic_dec(&address->refcnt);
+		if (__netif_subqueue_stopped(dev, queue_index))
+			netif_wake_subqueue(dev, queue_index);
 	} else {
-		dev->trans_start = jiffies;
+		netdev_get_tx_queue(dev, queue_index)->trans_start = jiffies;
 
-		address->last_send = priv->tx_head;
-		++priv->tx_head;
+		++send_ring->tx_head;
 	}
 
-	if (unlikely(priv->tx_outstanding > MAX_SEND_CQE))
-		while (poll_tx(priv))
+	if (unlikely(send_ring->tx_outstanding > MAX_SEND_CQE))
+		while (poll_tx_ring(send_ring))
 			; /* nothing */
 }
 
@@ -636,7 +687,7 @@ static void __ipoib_reap_ah(struct net_device *dev)
 	spin_lock_irqsave(&priv->lock, flags);
 
 	list_for_each_entry_safe(ah, tah, &priv->dead_ahs, list)
-		if ((int) priv->tx_tail - (int) ah->last_send >= 0) {
+		if (atomic_read(&ah->refcnt) == 0) {
 			list_del(&ah->list);
 			ib_destroy_ah(ah->ah);
 			kfree(ah);
@@ -661,7 +712,31 @@ void ipoib_reap_ah(struct work_struct *work)
 
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
 {
-	drain_tx_cq((struct net_device *)ctx);
+	drain_tx_cq((struct ipoib_send_ring *)ctx);
+}
+
+static void ipoib_napi_enable(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		netif_napi_add(dev, &recv_ring->napi,
+			       ipoib_poll, 100);
+		napi_enable(&recv_ring->napi);
+		recv_ring++;
+	}
+}
+
+static void ipoib_napi_disable(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i;
+
+	for (i = 0; i < priv->num_rx_queues; i++)
+		napi_disable(&priv->recv_ring[i].napi);
 }
 
 int ipoib_ib_dev_open(struct net_device *dev)
@@ -701,7 +776,7 @@ int ipoib_ib_dev_open(struct net_device *dev)
 			   round_jiffies_relative(HZ));
 
 	if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-		napi_enable(&priv->napi);
+		ipoib_napi_enable(dev);
 
 	return 0;
 }
@@ -763,19 +838,47 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
 static int recvs_pending(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
 	int pending = 0;
-	int i;
+	int i, j;
 
-	for (i = 0; i < ipoib_recvq_size; ++i)
-		if (priv->rx_ring[i].skb)
-			++pending;
+	recv_ring = priv->recv_ring;
+	for (j = 0; j < priv->num_rx_queues; j++) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			if (recv_ring->rx_ring[i].skb)
+				++pending;
+		}
+		recv_ring++;
+	}
 
 	return pending;
 }
 
-void ipoib_drain_cq(struct net_device *dev)
+static int sends_pending(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int pending = 0;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		/*
+		* Note that since head and tails are unsigned then
+		* the result of the substruction is correct even when
+		* the counters wrap around
+		*/
+		pending += send_ring->tx_head - send_ring->tx_tail;
+		send_ring++;
+	}
+
+	return pending;
+}
+
+static void ipoib_drain_rx_ring(struct ipoib_dev_priv *priv,
+				struct ipoib_recv_ring *rx_ring)
+{
+	struct net_device *dev = priv->dev;
 	int i, n;
 
 	/*
@@ -786,42 +889,191 @@ void ipoib_drain_cq(struct net_device *dev)
 	local_bh_disable();
 
 	do {
-		n = ib_poll_cq(priv->recv_cq, IPOIB_NUM_WC, priv->ibwc);
+		n = ib_poll_cq(rx_ring->recv_cq, IPOIB_NUM_WC, rx_ring->ibwc);
 		for (i = 0; i < n; ++i) {
+			struct ib_wc *wc = rx_ring->ibwc + i;
 			/*
 			 * Convert any successful completions to flush
 			 * errors to avoid passing packets up the
 			 * stack after bringing the device down.
 			 */
-			if (priv->ibwc[i].status == IB_WC_SUCCESS)
-				priv->ibwc[i].status = IB_WC_WR_FLUSH_ERR;
+			if (wc->status == IB_WC_SUCCESS)
+				wc->status = IB_WC_WR_FLUSH_ERR;
 
-			if (priv->ibwc[i].wr_id & IPOIB_OP_RECV) {
-				if (priv->ibwc[i].wr_id & IPOIB_OP_CM)
-					ipoib_cm_handle_rx_wc(dev, priv->ibwc + i);
+			if (wc->wr_id & IPOIB_OP_RECV) {
+				if (wc->wr_id & IPOIB_OP_CM)
+					ipoib_cm_handle_rx_wc(dev, rx_ring, wc);
 				else
-					ipoib_ib_handle_rx_wc(dev, priv->ibwc + i);
-			} else
-				ipoib_cm_handle_tx_wc(dev, priv->ibwc + i);
+					ipoib_ib_handle_rx_wc(dev, rx_ring, wc);
+			} else {
+				ipoib_cm_handle_tx_wc(dev, wc);
+			}
 		}
 	} while (n == IPOIB_NUM_WC);
 
-	while (poll_tx(priv))
-		; /* nothing */
-
 	local_bh_enable();
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+static void drain_rx_rings(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ipoib_drain_rx_ring(priv, recv_ring);
+		recv_ring++;
+	}
+}
+
+
+static void drain_tx_rings(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int bool_value = 0;
+	int i;
+
+	do {
+		bool_value = 0;
+		send_ring = priv->send_ring;
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			local_bh_disable();
+			bool_value |= poll_tx_ring(send_ring);
+			local_bh_enable();
+			send_ring++;
+		}
+	} while (bool_value);
+}
+
+void ipoib_drain_cq(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	drain_rx_rings(priv);
+
+	drain_tx_rings(priv);
+}
+
+static void ipoib_ib_send_ring_stop(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *tx_ring;
+	struct ipoib_tx_buf *tx_req;
+	int i;
+
+	tx_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		while ((int) tx_ring->tx_tail - (int) tx_ring->tx_head < 0) {
+			tx_req = &tx_ring->tx_ring[tx_ring->tx_tail &
+				  (ipoib_sendq_size - 1)];
+			ipoib_dma_unmap_tx(priv->ca, tx_req);
+			dev_kfree_skb_any(tx_req->skb);
+			++tx_ring->tx_tail;
+			--tx_ring->tx_outstanding;
+		}
+		tx_ring++;
+	}
+}
+
+static void ipoib_ib_recv_ring_stop(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_recv_ring *recv_ring;
+	int i, j;
+
+	recv_ring = priv->recv_ring;
+	for (j = 0; j < priv->num_rx_queues; ++j) {
+		for (i = 0; i < ipoib_recvq_size; ++i) {
+			struct ipoib_rx_buf *rx_req;
+
+			rx_req = &recv_ring->rx_ring[i];
+			if (!rx_req->skb)
+				continue;
+			ipoib_ud_dma_unmap_rx(priv,
+					      recv_ring->rx_ring[i].mapping);
+			dev_kfree_skb_any(rx_req->skb);
+			rx_req->skb = NULL;
+		}
+		recv_ring++;
+	}
+}
+
+static void set_tx_poll_timers(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int i;
+	/* Init a timer per queue */
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		setup_timer(&send_ring->poll_timer, ipoib_ib_tx_timer_func,
+			    (unsigned long)send_ring);
+		send_ring++;
+	}
+}
+
+static void del_tx_poll_timers(struct ipoib_dev_priv *priv)
+{
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		del_timer_sync(&send_ring->poll_timer);
+		send_ring++;
+	}
+}
+
+static void set_tx_rings_qp_state(struct ipoib_dev_priv *priv,
+					enum ib_qp_state new_state)
+{
+	struct ipoib_send_ring *send_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i <  priv->num_tx_queues; i++) {
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(send_ring->send_qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+				   new_state);
+		send_ring++;
+	}
+}
+
+static void set_rx_rings_qp_state(struct ipoib_dev_priv *priv,
+					enum ib_qp_state new_state)
+{
+	struct ipoib_recv_ring *recv_ring;
 	struct ib_qp_attr qp_attr;
+	int i;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(recv_ring->recv_qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+				   new_state);
+		recv_ring++;
+	}
+}
+
+static void set_rings_qp_state(struct ipoib_dev_priv *priv,
+				enum ib_qp_state new_state)
+{
+	set_tx_rings_qp_state(priv, new_state);
+
+	if (priv->num_rx_queues > 1)
+		set_rx_rings_qp_state(priv, new_state);
+}
+
+
+int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	unsigned long begin;
-	struct ipoib_tx_buf *tx_req;
+	struct ipoib_recv_ring *recv_ring;
 	int i;
 
 	if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, &priv->flags))
-		napi_disable(&priv->napi);
+		ipoib_napi_disable(dev);
 
 	ipoib_cm_dev_stop(dev);
 
@@ -829,42 +1081,24 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 	 * Move our QP to the error state and then reinitialize in
 	 * when all work requests have completed or have been flushed.
 	 */
-	qp_attr.qp_state = IB_QPS_ERR;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
-		ipoib_warn(priv, "Failed to modify QP to ERROR state\n");
+	set_rings_qp_state(priv, IB_QPS_ERR);
+
 
 	/* Wait for all sends and receives to complete */
 	begin = jiffies;
 
-	while (priv->tx_head != priv->tx_tail || recvs_pending(dev)) {
+	while (sends_pending(dev) || recvs_pending(dev)) {
 		if (time_after(jiffies, begin + 5 * HZ)) {
 			ipoib_warn(priv, "timing out; %d sends %d receives not completed\n",
-				   priv->tx_head - priv->tx_tail, recvs_pending(dev));
+				   sends_pending(dev), recvs_pending(dev));
 
 			/*
 			 * assume the HW is wedged and just free up
 			 * all our pending work requests.
 			 */
-			while ((int) priv->tx_tail - (int) priv->tx_head < 0) {
-				tx_req = &priv->tx_ring[priv->tx_tail &
-							(ipoib_sendq_size - 1)];
-				ipoib_dma_unmap_tx(priv->ca, tx_req);
-				dev_kfree_skb_any(tx_req->skb);
-				++priv->tx_tail;
-				--priv->tx_outstanding;
-			}
+			ipoib_ib_send_ring_stop(priv);
 
-			for (i = 0; i < ipoib_recvq_size; ++i) {
-				struct ipoib_rx_buf *rx_req;
-
-				rx_req = &priv->rx_ring[i];
-				if (!rx_req->skb)
-					continue;
-				ipoib_ud_dma_unmap_rx(priv,
-						      priv->rx_ring[i].mapping);
-				dev_kfree_skb_any(rx_req->skb);
-				rx_req->skb = NULL;
-			}
+			ipoib_ib_recv_ring_stop(priv);
 
 			goto timeout;
 		}
@@ -877,10 +1111,9 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush)
 	ipoib_dbg(priv, "All sends and receives done.\n");
 
 timeout:
-	del_timer_sync(&priv->poll_timer);
-	qp_attr.qp_state = IB_QPS_RESET;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
-		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
+	del_tx_poll_timers(priv);
+
+	set_rings_qp_state(priv, IB_QPS_RESET);
 
 	/* Wait for all AHs to be reaped */
 	set_bit(IPOIB_STOP_REAPER, &priv->flags);
@@ -901,7 +1134,11 @@ timeout:
 		msleep(1);
 	}
 
-	ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP);
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		ib_req_notify_cq(recv_ring->recv_cq, IB_CQ_NEXT_COMP);
+		recv_ring++;
+	}
 
 	return 0;
 }
@@ -919,8 +1156,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 		return -ENODEV;
 	}
 
-	setup_timer(&priv->poll_timer, ipoib_ib_tx_timer_func,
-		    (unsigned long) dev);
+	set_tx_poll_timers(priv);
 
 	if (dev->flags & IFF_UP) {
 		if (ipoib_ib_dev_open(dev)) {
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index e459fa7..6d23f44 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -127,7 +127,7 @@ int ipoib_open(struct net_device *dev)
 		mutex_unlock(&priv->vlan_mutex);
 	}
 
-	netif_start_queue(dev);
+	netif_tx_start_all_queues(dev);
 
 	return 0;
 
@@ -148,7 +148,7 @@ static int ipoib_stop(struct net_device *dev)
 
 	clear_bit(IPOIB_FLAG_ADMIN_UP, &priv->flags);
 
-	netif_stop_queue(dev);
+	netif_tx_stop_all_queues(dev);
 
 	ipoib_ib_dev_down(dev, 1);
 	ipoib_ib_dev_stop(dev, 0);
@@ -218,6 +218,8 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu)
 int ipoib_set_mode(struct net_device *dev, const char *buf)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
 
 	/* flush paths if we switch modes so that connections are restarted */
 	if (IPOIB_CM_SUPPORTED(dev->dev_addr) && !strcmp(buf, "connected\n")) {
@@ -226,7 +228,12 @@ int ipoib_set_mode(struct net_device *dev, const char *buf)
 			   "will cause multicast packet drops\n");
 		netdev_update_features(dev);
 		rtnl_unlock();
-		priv->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+
+		send_ring = priv->send_ring;
+		for (i = 0; i < priv->num_tx_queues; i++) {
+			send_ring->tx_wr.send_flags &= ~IB_SEND_IP_CSUM;
+			send_ring++;
+		}
 
 		ipoib_flush_paths(dev);
 		rtnl_lock();
@@ -581,12 +588,14 @@ static void neigh_add_path(struct sk_buff *skb, u8 *daddr,
 	struct ipoib_path *path;
 	struct ipoib_neigh *neigh;
 	unsigned long flags;
+	int index;
 
 	spin_lock_irqsave(&priv->lock, flags);
 	neigh = ipoib_neigh_alloc(daddr, dev);
 	if (!neigh) {
 		spin_unlock_irqrestore(&priv->lock, flags);
-		++dev->stats.tx_dropped;
+		index = skb_get_queue_mapping(skb);
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 		return;
 	}
@@ -646,7 +655,8 @@ err_list:
 err_path:
 	ipoib_neigh_free(neigh);
 err_drop:
-	++dev->stats.tx_dropped;
+	index = skb_get_queue_mapping(skb);
+	priv->send_ring[index].stats.tx_dropped++;
 	dev_kfree_skb_any(skb);
 
 	spin_unlock_irqrestore(&priv->lock, flags);
@@ -659,6 +669,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_path *path;
 	unsigned long flags;
+	int index = skb_get_queue_mapping(skb);
 
 	spin_lock_irqsave(&priv->lock, flags);
 
@@ -681,7 +692,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 			} else
 				__path_add(dev, path);
 		} else {
-			++dev->stats.tx_dropped;
+			priv->send_ring[index].stats.tx_dropped++;
 			dev_kfree_skb_any(skb);
 		}
 
@@ -700,7 +711,7 @@ static void unicast_arp_send(struct sk_buff *skb, struct net_device *dev,
 		   skb_queue_len(&path->queue) < IPOIB_MAX_PATH_REC_QUEUE) {
 		__skb_queue_tail(&path->queue, skb);
 	} else {
-		++dev->stats.tx_dropped;
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 	}
 
@@ -788,18 +799,70 @@ unref:
 	return NETDEV_TX_OK;
 }
 
+static u16 ipoib_select_queue_null(struct net_device *dev, struct sk_buff *skb)
+{
+	return 0;
+}
+
 static void ipoib_timeout(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	u16 index;
 
 	ipoib_warn(priv, "transmit timeout: latency %d msecs\n",
 		   jiffies_to_msecs(jiffies - dev->trans_start));
-	ipoib_warn(priv, "queue stopped %d, tx_head %u, tx_tail %u\n",
-		   netif_queue_stopped(dev),
-		   priv->tx_head, priv->tx_tail);
+
+	for (index = 0; index < priv->num_tx_queues; index++) {
+		if (__netif_subqueue_stopped(dev, index)) {
+			send_ring = priv->send_ring + index;
+			ipoib_warn(priv,
+				   "queue (%d) stopped, head %u, tail %u\n",
+				   index,
+				   send_ring->tx_head, send_ring->tx_tail);
+		}
+	}
 	/* XXX reset QP, etc. */
 }
 
+static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct net_device_stats *stats = &dev->stats;
+	struct net_device_stats local_stats;
+	int i;
+
+	memset(&local_stats, 0, sizeof(struct net_device_stats));
+
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		struct ipoib_rx_ring_stats *rstats = &priv->recv_ring[i].stats;
+		local_stats.rx_packets += rstats->rx_packets;
+		local_stats.rx_bytes   += rstats->rx_bytes;
+		local_stats.rx_errors  += rstats->rx_errors;
+		local_stats.rx_dropped += rstats->rx_dropped;
+	}
+
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		struct ipoib_tx_ring_stats *tstats = &priv->send_ring[i].stats;
+		local_stats.tx_packets += tstats->tx_packets;
+		local_stats.tx_bytes   += tstats->tx_bytes;
+		local_stats.tx_errors  += tstats->tx_errors;
+		local_stats.tx_dropped += tstats->tx_dropped;
+	}
+
+	stats->rx_packets = local_stats.rx_packets;
+	stats->rx_bytes   = local_stats.rx_bytes;
+	stats->rx_errors  = local_stats.rx_errors;
+	stats->rx_dropped = local_stats.rx_dropped;
+
+	stats->tx_packets = local_stats.tx_packets;
+	stats->tx_bytes   = local_stats.tx_bytes;
+	stats->tx_errors  = local_stats.tx_errors;
+	stats->tx_dropped = local_stats.tx_dropped;
+
+	return stats;
+}
+
 static int ipoib_hard_header(struct sk_buff *skb,
 			     struct net_device *dev,
 			     unsigned short type,
@@ -1252,47 +1315,93 @@ static void ipoib_neigh_hash_uninit(struct net_device *dev)
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring;
+	int i, rx_allocated, tx_allocated;
+	unsigned long alloc_size;
 
 	if (ipoib_neigh_hash_init(priv) < 0)
 		goto out;
 	/* Allocate RX/TX "rings" to hold queued skbs */
-	priv->rx_ring =	kzalloc(ipoib_recvq_size * sizeof *priv->rx_ring,
+	/* Multi queue initialization */
+	priv->recv_ring = kzalloc(priv->num_rx_queues * sizeof(*recv_ring),
 				GFP_KERNEL);
-	if (!priv->rx_ring) {
-		printk(KERN_WARNING "%s: failed to allocate RX ring (%d entries)\n",
-		       ca->name, ipoib_recvq_size);
+	if (!priv->recv_ring) {
+		pr_warn("%s: failed to allocate RECV ring (%d entries)\n",
+			ca->name, priv->num_rx_queues);
 		goto out_neigh_hash_cleanup;
 	}
 
-	priv->tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv->tx_ring);
-	if (!priv->tx_ring) {
-		printk(KERN_WARNING "%s: failed to allocate TX ring (%d entries)\n",
-		       ca->name, ipoib_sendq_size);
-		goto out_rx_ring_cleanup;
+	alloc_size = ipoib_recvq_size * sizeof(*recv_ring->rx_ring);
+	rx_allocated = 0;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		recv_ring->rx_ring = kzalloc(alloc_size, GFP_KERNEL);
+		if (!recv_ring->rx_ring) {
+			pr_warn("%s: failed to allocate RX ring (%d entries)\n",
+				ca->name, ipoib_recvq_size);
+			goto out_recv_ring_cleanup;
+		}
+		recv_ring->dev = dev;
+		recv_ring->index = i;
+		recv_ring++;
+		rx_allocated++;
+	}
+
+	priv->send_ring = kzalloc(priv->num_tx_queues * sizeof(*send_ring),
+			GFP_KERNEL);
+	if (!priv->send_ring) {
+		pr_warn("%s: failed to allocate SEND ring (%d entries)\n",
+			ca->name, priv->num_tx_queues);
+		goto out_recv_ring_cleanup;
+	}
+
+	alloc_size = ipoib_sendq_size * sizeof(*send_ring->tx_ring);
+	tx_allocated = 0;
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		send_ring->tx_ring = vzalloc(alloc_size);
+		if (!send_ring->tx_ring) {
+			pr_warn(
+				"%s: failed to allocate TX ring (%d entries)\n",
+				ca->name, ipoib_sendq_size);
+			goto out_send_ring_cleanup;
+		}
+		send_ring->dev = dev;
+		send_ring->index = i;
+		send_ring++;
+		tx_allocated++;
 	}
 
 	/* priv->tx_head, tx_tail & tx_outstanding are already 0 */
 
 	if (ipoib_ib_dev_init(dev, ca, port))
-		goto out_tx_ring_cleanup;
+		goto out_send_ring_cleanup;
+
 
 	return 0;
 
-out_tx_ring_cleanup:
-	vfree(priv->tx_ring);
+out_send_ring_cleanup:
+	for (i = 0; i < tx_allocated; i++)
+		vfree(priv->send_ring[i].tx_ring);
 
-out_rx_ring_cleanup:
-	kfree(priv->rx_ring);
+out_recv_ring_cleanup:
+	for (i = 0; i < rx_allocated; i++)
+		kfree(priv->recv_ring[i].rx_ring);
 
 out_neigh_hash_cleanup:
 	ipoib_neigh_hash_uninit(dev);
 out:
+	priv->send_ring = NULL;
+	priv->recv_ring = NULL;
+
 	return -ENOMEM;
 }
 
 void ipoib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv;
+	int i;
 	LIST_HEAD(head);
 
 	ASSERT_RTNL();
@@ -1310,11 +1419,17 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
 	ipoib_ib_dev_cleanup(dev);
 
-	kfree(priv->rx_ring);
-	vfree(priv->tx_ring);
 
-	priv->rx_ring = NULL;
-	priv->tx_ring = NULL;
+	for (i = 0; i < priv->num_tx_queues; i++)
+		vfree(priv->send_ring[i].tx_ring);
+	kfree(priv->send_ring);
+
+	for (i = 0; i < priv->num_rx_queues; i++)
+		kfree(priv->recv_ring[i].rx_ring);
+	kfree(priv->recv_ring);
+
+	priv->recv_ring = NULL;
+	priv->send_ring = NULL;
 
 	ipoib_neigh_hash_uninit(dev);
 }
@@ -1330,7 +1445,9 @@ static const struct net_device_ops ipoib_netdev_ops = {
 	.ndo_change_mtu		 = ipoib_change_mtu,
 	.ndo_fix_features	 = ipoib_fix_features,
 	.ndo_start_xmit	 	 = ipoib_start_xmit,
+	.ndo_select_queue	 = ipoib_select_queue_null,
 	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
 };
 
@@ -1338,13 +1455,12 @@ void ipoib_setup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	/* Use correct ops (ndo_select_queue) */
 	dev->netdev_ops		 = &ipoib_netdev_ops;
 	dev->header_ops		 = &ipoib_header_ops;
 
 	ipoib_set_ethtool_ops(dev);
 
-	netif_napi_add(dev, &priv->napi, ipoib_poll, 100);
-
 	dev->watchdog_timeo	 = HZ;
 
 	dev->flags		|= IFF_BROADCAST | IFF_MULTICAST;
@@ -1383,15 +1499,21 @@ void ipoib_setup(struct net_device *dev)
 	INIT_DELAYED_WORK(&priv->neigh_reap_task, ipoib_reap_neigh);
 }
 
-struct ipoib_dev_priv *ipoib_intf_alloc(const char *name)
+struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
+					struct ipoib_dev_priv *template_priv)
 {
 	struct net_device *dev;
 
-	dev = alloc_netdev((int) sizeof (struct ipoib_dev_priv), name,
-			   ipoib_setup);
+	dev = alloc_netdev_mqs((int) sizeof(struct ipoib_dev_priv), name,
+			   ipoib_setup,
+			   template_priv->num_tx_queues,
+			   template_priv->num_rx_queues);
 	if (!dev)
 		return NULL;
 
+	netif_set_real_num_tx_queues(dev, template_priv->num_tx_queues);
+	netif_set_real_num_rx_queues(dev, template_priv->num_rx_queues);
+
 	return netdev_priv(dev);
 }
 
@@ -1491,7 +1613,8 @@ int ipoib_add_pkey_attr(struct net_device *dev)
 	return device_create_file(&dev->dev, &dev_attr_pkey);
 }
 
-int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
+static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
+				  struct ib_device *hca)
 {
 	struct ib_device_attr *device_attr;
 	int result = -ENOMEM;
@@ -1514,6 +1637,20 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
 
 	kfree(device_attr);
 
+	priv->num_rx_queues = 1;
+	priv->num_tx_queues = 1;
+
+	return 0;
+}
+
+int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
+{
+	int result;
+
+	result = ipoib_get_hca_features(priv, hca);
+	if (result)
+		return result;
+
 	if (priv->hca_caps & IB_DEVICE_UD_IP_CSUM) {
 		priv->dev->hw_features = NETIF_F_SG |
 			NETIF_F_IP_CSUM | NETIF_F_RXCSUM;
@@ -1530,13 +1667,23 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca)
 static struct net_device *ipoib_add_port(const char *format,
 					 struct ib_device *hca, u8 port)
 {
-	struct ipoib_dev_priv *priv;
+	struct ipoib_dev_priv *priv, *template_priv;
 	struct ib_port_attr attr;
 	int result = -ENOMEM;
 
-	priv = ipoib_intf_alloc(format);
-	if (!priv)
-		goto alloc_mem_failed;
+	template_priv = kmalloc(sizeof(*template_priv), GFP_KERNEL);
+	if (!template_priv)
+		goto alloc_mem_failed1;
+
+	if (ipoib_get_hca_features(template_priv, hca))
+		goto device_query_failed;
+
+	priv = ipoib_intf_alloc(format, template_priv);
+	if (!priv) {
+		kfree(template_priv);
+		goto alloc_mem_failed2;
+	}
+	kfree(template_priv);
 
 	SET_NETDEV_DEV(priv->dev, hca->dma_device);
 	priv->dev->dev_id = port - 1;
@@ -1638,7 +1785,13 @@ event_failed:
 device_init_failed:
 	free_netdev(priv->dev);
 
-alloc_mem_failed:
+alloc_mem_failed2:
+	return ERR_PTR(result);
+
+device_query_failed:
+	kfree(template_priv);
+
+alloc_mem_failed1:
 	return ERR_PTR(result);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index cecb98a..875cf2c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -69,7 +69,7 @@ struct ipoib_mcast_iter {
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
 	struct net_device *dev = mcast->dev;
-	int tx_dropped = 0;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	ipoib_dbg_mcast(netdev_priv(dev), "deleting multicast group %pI6\n",
 			mcast->mcmember.mgid.raw);
@@ -81,14 +81,15 @@ static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 		ipoib_put_ah(mcast->ah);
 
 	while (!skb_queue_empty(&mcast->pkt_queue)) {
-		++tx_dropped;
-		dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
+		struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+		int index = skb_get_queue_mapping(skb);
+		/* Modify to lock queue */
+		netif_tx_lock_bh(dev);
+		priv->send_ring[index].stats.tx_dropped++;
+		netif_tx_unlock_bh(dev);
+		dev_kfree_skb_any(skb);
 	}
 
-	netif_tx_lock_bh(dev);
-	dev->stats.tx_dropped += tx_dropped;
-	netif_tx_unlock_bh(dev);
-
 	kfree(mcast);
 }
 
@@ -172,6 +173,7 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 	struct ipoib_ah *ah;
 	int ret;
 	int set_qkey = 0;
+	int i;
 
 	mcast->mcmember = *mcmember;
 
@@ -188,7 +190,8 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 		priv->mcast_mtu = IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
 		spin_unlock_irq(&priv->lock);
-		priv->tx_wr.wr.ud.remote_qkey = priv->qkey;
+		for (i = 0; i < priv->num_tx_queues; i++)
+			priv->send_ring[i].tx_wr.wr.ud.remote_qkey = priv->qkey;
 		set_qkey = 1;
 
 		if (!ipoib_cm_admin_enabled(dev)) {
@@ -276,6 +279,7 @@ ipoib_mcast_sendonly_join_complete(int status,
 {
 	struct ipoib_mcast *mcast = multicast->context;
 	struct net_device *dev = mcast->dev;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/* We trap for port events ourselves. */
 	if (status == -ENETRESET)
@@ -292,8 +296,10 @@ ipoib_mcast_sendonly_join_complete(int status,
 		/* Flush out any queued packets */
 		netif_tx_lock_bh(dev);
 		while (!skb_queue_empty(&mcast->pkt_queue)) {
-			++dev->stats.tx_dropped;
-			dev_kfree_skb_any(skb_dequeue(&mcast->pkt_queue));
+			struct sk_buff *skb = skb_dequeue(&mcast->pkt_queue);
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
+			dev_kfree_skb_any(skb);
 		}
 		netif_tx_unlock_bh(dev);
 
@@ -653,7 +659,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 	if (!test_bit(IPOIB_FLAG_OPER_UP, &priv->flags)		||
 	    !priv->broadcast					||
 	    !test_bit(IPOIB_MCAST_FLAG_ATTACHED, &priv->broadcast->flags)) {
-		++dev->stats.tx_dropped;
+		int index = skb_get_queue_mapping(skb);
+		priv->send_ring[index].stats.tx_dropped++;
 		dev_kfree_skb_any(skb);
 		goto unlock;
 	}
@@ -666,9 +673,10 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 
 		mcast = ipoib_mcast_alloc(dev, 0);
 		if (!mcast) {
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
 			ipoib_warn(priv, "unable to allocate memory for "
 				   "multicast structure\n");
-			++dev->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 			goto out;
 		}
@@ -683,7 +691,8 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb)
 		if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
 			skb_queue_tail(&mcast->pkt_queue, skb);
 		else {
-			++dev->stats.tx_dropped;
+			int index = skb_get_queue_mapping(skb);
+			priv->send_ring[index].stats.tx_dropped++;
 			dev_kfree_skb_any(skb);
 		}
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 049a997..4be626f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -118,6 +118,10 @@ int ipoib_init_qp(struct net_device *dev)
 		goto out_fail;
 	}
 
+	/* Only one ring currently */
+	priv->recv_ring[0].recv_qp = priv->qp;
+	priv->send_ring[0].send_qp = priv->qp;
+
 	return 0;
 
 out_fail:
@@ -142,8 +146,10 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 		.qp_type     = IB_QPT_UD
 	};
 
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring, *first_recv_ring;
 	int ret, size;
-	int i;
+	int i, j;
 
 	priv->pd = ib_alloc_pd(priv->ca);
 	if (IS_ERR(priv->pd)) {
@@ -167,19 +173,24 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 			size += ipoib_recvq_size * ipoib_max_conn_qp;
 	}
 
-	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL, dev, size, 0);
+	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
+				     priv->recv_ring, size, 0);
 	if (IS_ERR(priv->recv_cq)) {
 		printk(KERN_WARNING "%s: failed to create receive CQ\n", ca->name);
 		goto out_free_mr;
 	}
 
 	priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL,
-				     dev, ipoib_sendq_size, 0);
+				     priv->send_ring, ipoib_sendq_size, 0);
 	if (IS_ERR(priv->send_cq)) {
 		printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name);
 		goto out_free_recv_cq;
 	}
 
+	/* Only one ring */
+	priv->recv_ring[0].recv_cq = priv->recv_cq;
+	priv->send_ring[0].send_cq = priv->send_cq;
+
 	if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP))
 		goto out_free_send_cq;
 
@@ -205,25 +216,43 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
-	for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
-		priv->tx_sge[i].lkey = priv->mr->lkey;
+	send_ring = priv->send_ring;
+	for (j = 0; j < priv->num_tx_queues; j++) {
+		for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
+			send_ring->tx_sge[i].lkey = priv->mr->lkey;
 
-	priv->tx_wr.opcode	= IB_WR_SEND;
-	priv->tx_wr.sg_list	= priv->tx_sge;
-	priv->tx_wr.send_flags	= IB_SEND_SIGNALED;
+		send_ring->tx_wr.opcode	= IB_WR_SEND;
+		send_ring->tx_wr.sg_list	= send_ring->tx_sge;
+		send_ring->tx_wr.send_flags	= IB_SEND_SIGNALED;
+		send_ring++;
+	}
 
-	priv->rx_sge[0].lkey = priv->mr->lkey;
+	recv_ring = priv->recv_ring;
+	recv_ring->rx_sge[0].lkey = priv->mr->lkey;
 	if (ipoib_ud_need_sg(priv->max_ib_mtu)) {
-		priv->rx_sge[0].length = IPOIB_UD_HEAD_SIZE;
-		priv->rx_sge[1].length = PAGE_SIZE;
-		priv->rx_sge[1].lkey = priv->mr->lkey;
-		priv->rx_wr.num_sge = IPOIB_UD_RX_SG;
+		recv_ring->rx_sge[0].length = IPOIB_UD_HEAD_SIZE;
+		recv_ring->rx_sge[1].length = PAGE_SIZE;
+		recv_ring->rx_sge[1].lkey = priv->mr->lkey;
+		recv_ring->rx_wr.num_sge = IPOIB_UD_RX_SG;
 	} else {
-		priv->rx_sge[0].length = IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
-		priv->rx_wr.num_sge = 1;
+		recv_ring->rx_sge[0].length =
+				IPOIB_UD_BUF_SIZE(priv->max_ib_mtu);
+		recv_ring->rx_wr.num_sge = 1;
+	}
+	recv_ring->rx_wr.next = NULL;
+	recv_ring->rx_wr.sg_list = recv_ring->rx_sge;
+
+	/* Copy first RX ring sge and wr parameters to the rest RX ring */
+	first_recv_ring = recv_ring;
+	recv_ring++;
+	for (i = 1; i < priv->num_rx_queues; i++) {
+		recv_ring->rx_sge[0] = first_recv_ring->rx_sge[0];
+		recv_ring->rx_sge[1] = first_recv_ring->rx_sge[1];
+		recv_ring->rx_wr = first_recv_ring->rx_wr;
+		/* This field in per ring */
+		recv_ring->rx_wr.sg_list = recv_ring->rx_sge;
+		recv_ring++;
 	}
-	priv->rx_wr.next = NULL;
-	priv->rx_wr.sg_list = priv->rx_sge;
 
 	return 0;
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index 8292554..ba633c2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -133,7 +133,7 @@ int ipoib_vlan_add(struct net_device *pdev, unsigned short pkey)
 
 	snprintf(intf_name, sizeof intf_name, "%s.%04x",
 		 ppriv->dev->name, pkey);
-	priv = ipoib_intf_alloc(intf_name);
+	priv = ipoib_intf_alloc(intf_name, ppriv);
 	if (!priv)
 		return -ENOMEM;
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 for-next 5/6] IB/ipoib: Add RSS and TSS support for datagram mode
       [not found] ` <1360079337-8173-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (3 preceding siblings ...)
  2013-02-05 15:48   ` [PATCH V2 for-next 4/6] IB/ipoib: Move to multi-queue device Or Gerlitz
@ 2013-02-05 15:48   ` Or Gerlitz
  2013-02-05 15:48   ` [PATCH V2 for-next 6/6] IB/ipoib: Support changing the number of RX/TX rings with ethtool Or Gerlitz
  5 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2013-02-05 15:48 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Or Gerlitz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

This patch add RSS (Receive Side Scaling) and TSS (multi-queue transmit)
support for IPoIB. The RSS and TSS implementation utilizes the new QP
groups concept.

The number of RSS and TSS rings is a function of the number of cores,
and the low level driver capability to support QP groups and RSS.

If the low level driver doesn't support QP groups, then only one RX
and one TX rings are created and only one QP, such both rings use it.

If the HW supports RSS then additional receive QP are created, and each
is assigned to a separate receive ring. The number of additional receive
rings is equal to the number of CPU cores rounded to the next power of two.

If the HW doesn't support RSS then only one receive ring is created
and the parent QP is assigned as its QP.

When TSS is used, additional send QPs are created, and each is assigned to
a separate send ring. The number of additional send rings is equal to the
number of CPU cores rounded to the next power of two.

It turns out that there are IPoIB drivers used by some operating-systems
and/or Hypervisors in a para-virtualization (PV) scheme which extract the
source QPN from the CQ WC associated with an incoming packets in order to
generate the source MAC address in the emulated MAC header they build.

With TSS, different packets targeted for the same entity (e.g VM using
PV IPoIB instance) could be potentially sent through different TX rings
which map to different UD QPs, each with its own QPN. This may break some
assumptions made the receiving entity (e.g rules related to security,
monitoring, etc).

If the HW supports TSS, it is capable of over-riding the source UD QPN
present in the IB datagram header (DTH) of sent packets with the parent's
QPN which is part of the device HW address as advertized to the Linux network
stack and hence carried in ARP requests/responses. Thus the above mentioned
problem doesn't exist.

When HW doesn't support TSS, but QP groups are supported which mean the
low level driver can create set of QPs with contiguous QP numbers, TSS
can still be used, this is called "SW TSS".

In this case, the low level drive provides IPoIB with a mask when the
parent QP is created. This mask is later written into the reserved field
of the IPoIB header so receivers of SW TSS packets can mask the QPN of
a received packet and discover the parent QPN.

In order not to possibly breaks inter-operability with the PV IPoIB drivers
which were not yet enhanced to apply this masking from incoming packets,
SW TSS will only be used if the peer advertised its willingness to accept
SW TSS frames, otherwise the parent QP will be used.

The advertisement to accept TSS frames is done using a dedicated bit in
the reserved byte of the IPoIB HW address (e.g similar to CM).

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h       |   15 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c    |   10 +
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  169 +++++++-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  621 ++++++++++++++++++++++++----
 4 files changed, 721 insertions(+), 94 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index cf5fdd9..87004e2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -121,7 +121,7 @@ enum {
 
 struct ipoib_header {
 	__be16	proto;
-	u16	reserved;
+	__be16	tss_qpn_mask_sz;
 };
 
 struct ipoib_cb {
@@ -381,9 +381,7 @@ struct ipoib_dev_priv {
 	u16		  pkey_index;
 	struct ib_pd	 *pd;
 	struct ib_mr	 *mr;
-	struct ib_cq	 *recv_cq;
-	struct ib_cq	 *send_cq;
-	struct ib_qp	 *qp;
+	struct ib_qp	 *qp; /* also parent QP for TSS & RSS */
 	u32		  qkey;
 
 	union ib_gid local_gid;
@@ -416,8 +414,11 @@ struct ipoib_dev_priv {
 	struct timer_list poll_timer;
 	struct ipoib_recv_ring *recv_ring;
 	struct ipoib_send_ring *send_ring;
-	unsigned int num_rx_queues;
-	unsigned int num_tx_queues;
+	unsigned int rss_qp_num; /* No RSS HW support 0 */
+	unsigned int tss_qp_num; /* No TSS (HW or SW) used 0 */
+	unsigned int num_rx_queues; /* No RSS HW support 1 */
+	unsigned int num_tx_queues; /* No TSS HW support tss_qp_num + 1 */
+	__be16 tss_qpn_mask_sz; /* Put in ipoib header reserved */
 };
 
 struct ipoib_ah {
@@ -585,9 +586,11 @@ int ipoib_set_dev_features(struct ipoib_dev_priv *priv, struct ib_device *hca);
 
 #define IPOIB_FLAGS_RC		0x80
 #define IPOIB_FLAGS_UC		0x40
+#define IPOIB_FLAGS_TSS		0x20
 
 /* We don't support UC connections at the moment */
 #define IPOIB_CM_SUPPORTED(ha)   (ha[0] & (IPOIB_FLAGS_RC))
+#define IPOIB_TSS_SUPPORTED(ha)   (ha[0] & (IPOIB_FLAGS_TSS))
 
 #ifdef CONFIG_INFINIBAND_IPOIB_CM
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 4871dc9..01ce5e9 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -286,6 +286,7 @@ static void ipoib_ib_handle_rx_wc(struct net_device *dev,
 	/*
 	 * Drop packets that this interface sent, ie multicast packets
 	 * that the HCA has replicated.
+	 * Note with SW TSS MC were sent using priv->qp so no need to mask
 	 */
 	if (wc->slid == priv->local_lid && wc->src_qp == priv->qp->qp_num)
 		goto repost;
@@ -1058,6 +1059,15 @@ static void set_rx_rings_qp_state(struct ipoib_dev_priv *priv,
 static void set_rings_qp_state(struct ipoib_dev_priv *priv,
 				enum ib_qp_state new_state)
 {
+	if (priv->hca_caps & IB_DEVICE_UD_TSS) {
+		/* TSS HW is supported, parent QP has no ring (send_ring) */
+		struct ib_qp_attr qp_attr;
+		qp_attr.qp_state = new_state;
+		if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv, "Failed to modify QP to state(%d)\n",
+				   new_state);
+	}
+
 	set_tx_rings_qp_state(priv, new_state);
 
 	if (priv->num_rx_queues > 1)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 6d23f44..cd9df99 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -725,7 +725,9 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct ipoib_cb *cb = (struct ipoib_cb *) skb->cb;
 	struct ipoib_header *header;
 	unsigned long flags;
+	struct ipoib_send_ring *send_ring;
 
+	send_ring = priv->send_ring + skb_get_queue_mapping(skb);
 	header = (struct ipoib_header *) skb->data;
 
 	if (unlikely(cb->hwaddr[4] == 0xff)) {
@@ -735,7 +737,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		    (header->proto != htons(ETH_P_ARP)) &&
 		    (header->proto != htons(ETH_P_RARP))) {
 			/* ethertype not supported by IPoIB */
-			++dev->stats.tx_dropped;
+			++send_ring->stats.tx_dropped;
 			dev_kfree_skb_any(skb);
 			return NETDEV_TX_OK;
 		}
@@ -767,7 +769,7 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		return NETDEV_TX_OK;
 	default:
 		/* ethertype not supported by IPoIB */
-		++dev->stats.tx_dropped;
+		++send_ring->stats.tx_dropped;
 		dev_kfree_skb_any(skb);
 		return NETDEV_TX_OK;
 	}
@@ -775,11 +777,19 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
 send_using_neigh:
 	/* note we now hold a ref to neigh */
 	if (ipoib_cm_get(neigh)) {
+		/* in select queue cm wasn't enabled ring is likely wrong */
+		if (!IPOIB_CM_SUPPORTED(cb->hwaddr))
+			goto drop;
+
 		if (ipoib_cm_up(neigh)) {
 			ipoib_cm_send(dev, skb, ipoib_cm_get(neigh));
 			goto unref;
 		}
 	} else if (neigh->ah) {
+		/* in select queue cm was enabled ring is likely wrong */
+		if (IPOIB_CM_SUPPORTED(cb->hwaddr) && priv->num_tx_queues > 1)
+			goto drop;
+
 		ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(cb->hwaddr));
 		goto unref;
 	}
@@ -788,20 +798,78 @@ send_using_neigh:
 		spin_lock_irqsave(&priv->lock, flags);
 		__skb_queue_tail(&neigh->queue, skb);
 		spin_unlock_irqrestore(&priv->lock, flags);
-	} else {
-		++dev->stats.tx_dropped;
-		dev_kfree_skb_any(skb);
+		goto unref;
 	}
 
+drop:
+	++send_ring->stats.tx_dropped;
+	dev_kfree_skb_any(skb);
+
 unref:
 	ipoib_neigh_put(neigh);
 
 	return NETDEV_TX_OK;
 }
 
-static u16 ipoib_select_queue_null(struct net_device *dev, struct sk_buff *skb)
+static u16 ipoib_select_queue_hw(struct net_device *dev, struct sk_buff *skb)
 {
-	return 0;
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cb *cb = (struct ipoib_cb *)skb->cb;
+
+	/* (BC/MC), stay on this core */
+	if (unlikely(cb->hwaddr[4] == 0xff))
+		return smp_processor_id() % priv->tss_qp_num;
+
+	/* is CM in use */
+	if (IPOIB_CM_SUPPORTED(cb->hwaddr)) {
+		if (ipoib_cm_admin_enabled(dev)) {
+			/* use remote QP for hash, so we use the same ring */
+			u32 *d32 = (u32 *)cb->hwaddr;
+			u32 hv = jhash_1word(*d32 & cpu_to_be32(0xFFFFFF), 0);
+			return hv % priv->tss_qp_num;
+		} else
+			/* the ADMIN CM might be up until transmit, and
+			 * we might transmit on CM QP not from it's
+			 * designated ring */
+			cb->hwaddr[0] &= ~IPOIB_FLAGS_RC;
+	}
+	return skb_tx_hash(dev, skb);
+}
+
+static u16 ipoib_select_queue_sw(struct net_device *dev, struct sk_buff *skb)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_cb *cb = (struct ipoib_cb *)skb->cb;
+	struct ipoib_header *header;
+
+	/* (BC/MC) use designated QDISC -> parent QP */
+	if (unlikely(cb->hwaddr[4] == 0xff))
+		return priv->tss_qp_num;
+
+	/* is CM in use */
+	if (IPOIB_CM_SUPPORTED(cb->hwaddr)) {
+		if (ipoib_cm_admin_enabled(dev)) {
+			/* use remote QP for hash, so we use the same ring */
+			u32 *d32 = (u32 *)cb->hwaddr;
+			u32 hv = jhash_1word(*d32 & cpu_to_be32(0xFFFFFF), 0);
+			return hv % priv->tss_qp_num;
+		} else
+			/* the ADMIN CM might be up until transmit, and
+			 * we might transmit on CM QP not from it's
+			 * designated ring */
+			cb->hwaddr[0] &= ~IPOIB_FLAGS_RC;
+	}
+
+	/* Did neighbour advertise TSS support */
+	if (unlikely(!IPOIB_TSS_SUPPORTED(cb->hwaddr)))
+		return priv->tss_qp_num;
+
+	/* We are after ipoib_hard_header so skb->data is O.K. */
+	header = (struct ipoib_header *)skb->data;
+	header->tss_qpn_mask_sz |= priv->tss_qpn_mask_sz;
+
+	/* don't use special ring in TX */
+	return __skb_tx_hash(dev, skb, priv->tss_qp_num);
 }
 
 static void ipoib_timeout(struct net_device *dev)
@@ -874,7 +942,7 @@ static int ipoib_hard_header(struct sk_buff *skb,
 	header = (struct ipoib_header *) skb_push(skb, sizeof *header);
 
 	header->proto = htons(type);
-	header->reserved = 0;
+	header->tss_qpn_mask_sz = 0;
 
 	/*
 	 * we don't rely on dst_entry structure,  always stuff the
@@ -933,7 +1001,8 @@ struct ipoib_neigh *ipoib_neigh_get(struct net_device *dev, u8 *daddr)
 	for (neigh = rcu_dereference_bh(htbl->buckets[hash_val]);
 	     neigh != NULL;
 	     neigh = rcu_dereference_bh(neigh->hnext)) {
-		if (memcmp(daddr, neigh->daddr, INFINIBAND_ALEN) == 0) {
+		/* don't use flags for the comapre */
+		if (memcmp(daddr+1, neigh->daddr+1, INFINIBAND_ALEN-1) == 0) {
 			/* found, take one ref on behalf of the caller */
 			if (!atomic_inc_not_zero(&neigh->refcnt)) {
 				/* deleted */
@@ -1060,7 +1129,8 @@ struct ipoib_neigh *ipoib_neigh_alloc(u8 *daddr,
 	     neigh != NULL;
 	     neigh = rcu_dereference_protected(neigh->hnext,
 					       lockdep_is_held(&priv->lock))) {
-		if (memcmp(daddr, neigh->daddr, INFINIBAND_ALEN) == 0) {
+		/* don't use flags for the comapre */
+		if (memcmp(daddr+1, neigh->daddr+1, INFINIBAND_ALEN-1) == 0) {
 			/* found, take one ref on behalf of the caller */
 			if (!atomic_inc_not_zero(&neigh->refcnt)) {
 				/* deleted */
@@ -1438,25 +1508,52 @@ static const struct header_ops ipoib_header_ops = {
 	.create	= ipoib_hard_header,
 };
 
-static const struct net_device_ops ipoib_netdev_ops = {
+static const struct net_device_ops ipoib_netdev_ops_no_tss = {
 	.ndo_uninit		 = ipoib_uninit,
 	.ndo_open		 = ipoib_open,
 	.ndo_stop		 = ipoib_stop,
 	.ndo_change_mtu		 = ipoib_change_mtu,
 	.ndo_fix_features	 = ipoib_fix_features,
-	.ndo_start_xmit	 	 = ipoib_start_xmit,
-	.ndo_select_queue	 = ipoib_select_queue_null,
+	.ndo_start_xmit		 = ipoib_start_xmit,
 	.ndo_tx_timeout		 = ipoib_timeout,
 	.ndo_get_stats		 = ipoib_get_stats,
 	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
 };
 
+static const struct net_device_ops ipoib_netdev_ops_hw_tss = {
+	.ndo_uninit		 = ipoib_uninit,
+	.ndo_open		 = ipoib_open,
+	.ndo_stop		 = ipoib_stop,
+	.ndo_change_mtu		 = ipoib_change_mtu,
+	.ndo_fix_features	 = ipoib_fix_features,
+	.ndo_start_xmit		 = ipoib_start_xmit,
+	.ndo_select_queue	 = ipoib_select_queue_hw,
+	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
+	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
+};
+
+static const struct net_device_ops ipoib_netdev_ops_sw_tss = {
+	.ndo_uninit		 = ipoib_uninit,
+	.ndo_open		 = ipoib_open,
+	.ndo_stop		 = ipoib_stop,
+	.ndo_change_mtu		 = ipoib_change_mtu,
+	.ndo_fix_features	 = ipoib_fix_features,
+	.ndo_start_xmit		 = ipoib_start_xmit,
+	.ndo_select_queue	 = ipoib_select_queue_sw,
+	.ndo_tx_timeout		 = ipoib_timeout,
+	.ndo_get_stats		 = ipoib_get_stats,
+	.ndo_set_rx_mode	 = ipoib_set_mcast_list,
+};
+
+static const struct net_device_ops *ipoib_netdev_ops;
+
 void ipoib_setup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
 	/* Use correct ops (ndo_select_queue) */
-	dev->netdev_ops		 = &ipoib_netdev_ops;
+	dev->netdev_ops		 = ipoib_netdev_ops;
 	dev->header_ops		 = &ipoib_header_ops;
 
 	ipoib_set_ethtool_ops(dev);
@@ -1504,6 +1601,16 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
 {
 	struct net_device *dev;
 
+	/* Use correct ops (ndo_select_queue) pass to ipoib_setup */
+	if (template_priv->num_tx_queues > 1) {
+		if (template_priv->hca_caps & IB_DEVICE_UD_TSS)
+			ipoib_netdev_ops = &ipoib_netdev_ops_hw_tss;
+		else
+			ipoib_netdev_ops = &ipoib_netdev_ops_sw_tss;
+	} else
+		ipoib_netdev_ops = &ipoib_netdev_ops_no_tss;
+
+
 	dev = alloc_netdev_mqs((int) sizeof(struct ipoib_dev_priv), name,
 			   ipoib_setup,
 			   template_priv->num_tx_queues,
@@ -1617,6 +1724,7 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 				  struct ib_device *hca)
 {
 	struct ib_device_attr *device_attr;
+	int num_cores;
 	int result = -ENOMEM;
 
 	device_attr = kmalloc(sizeof *device_attr, GFP_KERNEL);
@@ -1635,10 +1743,39 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 	}
 	priv->hca_caps = device_attr->device_cap_flags;
 
+	num_cores = num_online_cpus();
+	if (num_cores == 1 || !(priv->hca_caps & IB_DEVICE_QPG)) {
+		/* No additional QP, only one QP for RX & TX */
+		priv->rss_qp_num = 0;
+		priv->tss_qp_num = 0;
+		priv->num_rx_queues = 1;
+		priv->num_tx_queues = 1;
+		kfree(device_attr);
+		return 0;
+	}
+	num_cores = roundup_pow_of_two(num_cores);
+	if (priv->hca_caps & IB_DEVICE_UD_RSS) {
+		int max_rss_tbl_sz;
+		max_rss_tbl_sz = device_attr->max_rss_tbl_sz;
+		max_rss_tbl_sz = min(num_cores, max_rss_tbl_sz);
+		max_rss_tbl_sz = rounddown_pow_of_two(max_rss_tbl_sz);
+		priv->rss_qp_num    = max_rss_tbl_sz;
+		priv->num_rx_queues = max_rss_tbl_sz;
+	} else {
+		/* No additional QP, only the parent QP for RX */
+		priv->rss_qp_num = 0;
+		priv->num_rx_queues = 1;
+	}
+
 	kfree(device_attr);
 
-	priv->num_rx_queues = 1;
-	priv->num_tx_queues = 1;
+	priv->tss_qp_num = num_cores;
+	if (priv->hca_caps & IB_DEVICE_UD_TSS)
+		/* TSS is supported by HW */
+		priv->num_tx_queues = priv->tss_qp_num;
+	else
+		/* If TSS is not support by HW use the parent QP for ARP */
+		priv->num_tx_queues = priv->tss_qp_num + 1;
 
 	return 0;
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 4be626f..3917d3c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -35,6 +35,31 @@
 
 #include "ipoib.h"
 
+static int set_qps_qkey(struct ipoib_dev_priv *priv)
+{
+	struct ib_qp_attr *qp_attr;
+	struct ipoib_recv_ring *recv_ring;
+	int ret = -ENOMEM;
+	int i;
+
+	qp_attr = kmalloc(sizeof(*qp_attr), GFP_KERNEL);
+	if (!qp_attr)
+		return -ENOMEM;
+
+	qp_attr->qkey = priv->qkey;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; ++i) {
+		ret = ib_modify_qp(recv_ring->recv_qp, qp_attr, IB_QP_QKEY);
+		if (ret)
+			break;
+		recv_ring++;
+	}
+
+	kfree(qp_attr);
+
+	return ret;
+}
+
 int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid, int set_qkey)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
@@ -50,18 +75,9 @@ int ipoib_mcast_attach(struct net_device *dev, u16 mlid, union ib_gid *mgid, int
 	set_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 
 	if (set_qkey) {
-		ret = -ENOMEM;
-		qp_attr = kmalloc(sizeof *qp_attr, GFP_KERNEL);
-		if (!qp_attr)
-			goto out;
-
-		/* set correct QKey for QP */
-		qp_attr->qkey = priv->qkey;
-		ret = ib_modify_qp(priv->qp, qp_attr, IB_QP_QKEY);
-		if (ret) {
-			ipoib_warn(priv, "failed to modify QP, ret = %d\n", ret);
+		ret = set_qps_qkey(priv);
+		if (ret)
 			goto out;
-		}
 	}
 
 	/* attach QP to multicast group */
@@ -74,16 +90,13 @@ out:
 	return ret;
 }
 
-int ipoib_init_qp(struct net_device *dev)
+static int ipoib_init_one_qp(struct ipoib_dev_priv *priv, struct ib_qp *qp,
+				int init_attr)
 {
-	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int ret;
 	struct ib_qp_attr qp_attr;
 	int attr_mask;
 
-	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags))
-		return -1;
-
 	qp_attr.qp_state = IB_QPS_INIT;
 	qp_attr.qkey = 0;
 	qp_attr.port_num = priv->port;
@@ -92,17 +105,18 @@ int ipoib_init_qp(struct net_device *dev)
 	    IB_QP_QKEY |
 	    IB_QP_PORT |
 	    IB_QP_PKEY_INDEX |
-	    IB_QP_STATE;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	    IB_QP_STATE | init_attr;
+
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
-		ipoib_warn(priv, "failed to modify QP to init, ret = %d\n", ret);
+		ipoib_warn(priv, "failed to modify QP to INT, ret = %d\n", ret);
 		goto out_fail;
 	}
 
 	qp_attr.qp_state = IB_QPS_RTR;
 	/* Can't set this in a INIT->RTR transition */
-	attr_mask &= ~IB_QP_PORT;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	attr_mask &= ~(IB_QP_PORT | init_attr);
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to RTR, ret = %d\n", ret);
 		goto out_fail;
@@ -112,40 +126,417 @@ int ipoib_init_qp(struct net_device *dev)
 	qp_attr.sq_psn = 0;
 	attr_mask |= IB_QP_SQ_PSN;
 	attr_mask &= ~IB_QP_PKEY_INDEX;
-	ret = ib_modify_qp(priv->qp, &qp_attr, attr_mask);
+	ret = ib_modify_qp(qp, &qp_attr, attr_mask);
 	if (ret) {
 		ipoib_warn(priv, "failed to modify QP to RTS, ret = %d\n", ret);
 		goto out_fail;
 	}
 
-	/* Only one ring currently */
-	priv->recv_ring[0].recv_qp = priv->qp;
-	priv->send_ring[0].send_qp = priv->qp;
-
 	return 0;
 
 out_fail:
 	qp_attr.qp_state = IB_QPS_RESET;
-	if (ib_modify_qp(priv->qp, &qp_attr, IB_QP_STATE))
+	if (ib_modify_qp(qp, &qp_attr, IB_QP_STATE))
 		ipoib_warn(priv, "Failed to modify QP to RESET state\n");
 
 	return ret;
 }
 
-int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
+static int ipoib_init_rss_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+	int ret;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		ret = ipoib_init_one_qp(priv, recv_ring->recv_qp, 0);
+		if (ret) {
+			ipoib_warn(priv,
+				   "failed to init rss qp, ind = %d, ret=%d\n",
+				   i, ret);
+			goto out_free_reset_qp;
+		}
+		recv_ring++;
+	}
+
+	return 0;
+
+out_free_reset_qp:
+	for (--i; i >= 0; --i) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->recv_ring[i].recv_qp,
+				 &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				   "Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+static int ipoib_init_tss_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ib_qp_attr qp_attr;
+	int i;
+	int ret;
+
+	send_ring = priv->send_ring;
+	/*
+	 * Note if priv->tss_qdisc_num > priv->tss_qp_num then since
+	 * the last QP is the parent QP and it will be initialize later
+	 */
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		ret = ipoib_init_one_qp(priv, send_ring->send_qp, 0);
+		if (ret) {
+			ipoib_warn(priv,
+				   "failed to init tss qp, ind = %d, ret=%d\n",
+				   i, ret);
+			goto out_free_reset_qp;
+		}
+		send_ring++;
+	}
+
+	return 0;
+
+out_free_reset_qp:
+	for (--i; i >= 0; --i) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->send_ring[i].send_qp,
+				 &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				   "Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+int ipoib_init_qp(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_attr qp_attr;
+	int ret, i, attr;
+
+	if (!test_bit(IPOIB_PKEY_ASSIGNED, &priv->flags)) {
+		ipoib_warn(priv, "PKEY not assigned\n");
+		return -1;
+	}
+
+	/* Init parent QP */
+	/* If rss_qp_num = 0 then the parent QP is the RX QP */
+	ret = ipoib_init_rss_qps(dev);
+	if (ret)
+		return ret;
+
+	ret = ipoib_init_tss_qps(dev);
+	if (ret)
+		goto out_reset_tss_qp;
+
+	/* Init the parent QP which can be the only QP */
+	attr = priv->rss_qp_num > 0 ? IB_QP_GROUP_RSS : 0;
+	ret = ipoib_init_one_qp(priv, priv->qp, attr);
+	if (ret) {
+		ipoib_warn(priv, "failed to init parent qp, ret=%d\n", ret);
+		goto out_reset_rss_qp;
+	}
+
+	return 0;
+
+out_reset_rss_qp:
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->recv_ring[i].recv_qp,
+				 &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				   "Failed to modify QP to RESET state\n");
+	}
+
+out_reset_tss_qp:
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		qp_attr.qp_state = IB_QPS_RESET;
+		if (ib_modify_qp(priv->send_ring[i].send_qp,
+				 &qp_attr, IB_QP_STATE))
+			ipoib_warn(priv,
+				   "Failed to modify QP to RESET state\n");
+	}
+
+	return ret;
+}
+
+static int ipoib_transport_cq_init(struct net_device *dev,
+							int size)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	struct ipoib_send_ring *send_ring;
+	struct ib_cq *cq;
+	int i, allocated_rx, allocated_tx, req_vec;
+
+	allocated_rx = 0;
+	allocated_tx = 0;
+
+	/* We over subscribed the CPUS, ports start from 1 */
+	req_vec = (priv->port - 1) * roundup_pow_of_two(num_online_cpus());
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		/* Try to spread vectors based on port and ring numbers */
+		cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
+				  recv_ring, size,
+				  req_vec % priv->ca->num_comp_vectors);
+		if (IS_ERR(cq)) {
+			pr_warn("%s: failed to create recv CQ\n",
+				priv->ca->name);
+			goto out_free_recv_cqs;
+		}
+		recv_ring->recv_cq = cq;
+		allocated_rx++;
+		req_vec++;
+		if (ib_req_notify_cq(recv_ring->recv_cq, IB_CQ_NEXT_COMP)) {
+			pr_warn("%s: req notify recv CQ\n",
+				priv->ca->name);
+			goto out_free_recv_cqs;
+		}
+		recv_ring++;
+	}
+
+	/* We over subscribed the CPUS, ports start from 1 */
+	req_vec = (priv->port - 1) * roundup_pow_of_two(num_online_cpus());
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		cq = ib_create_cq(priv->ca,
+				  ipoib_send_comp_handler, NULL,
+				  send_ring, ipoib_sendq_size,
+				  req_vec % priv->ca->num_comp_vectors);
+		if (IS_ERR(cq)) {
+			pr_warn("%s: failed to create send CQ\n",
+				priv->ca->name);
+			goto out_free_send_cqs;
+		}
+		send_ring->send_cq = cq;
+		allocated_tx++;
+		req_vec++;
+		send_ring++;
+	}
+
+	return 0;
+
+out_free_send_cqs:
+	for (i = 0; i < allocated_tx; i++) {
+		ib_destroy_cq(priv->send_ring[i].send_cq);
+		priv->send_ring[i].send_cq = NULL;
+	}
+
+out_free_recv_cqs:
+	for (i = 0; i < allocated_rx; i++) {
+		ib_destroy_cq(priv->recv_ring[i].recv_cq);
+		priv->recv_ring[i].recv_cq = NULL;
+	}
+
+	return -ENODEV;
+}
+
+static int ipoib_create_parent_qp(struct net_device *dev,
+				  struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr init_attr = {
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type     = IB_QPT_UD
+	};
+	struct ib_qp *qp;
+
+	if (priv->hca_caps & IB_DEVICE_UD_TSO)
+		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
+
+	if (priv->hca_caps & IB_DEVICE_BLOCK_MULTICAST_LOOPBACK)
+		init_attr.create_flags |= IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK;
+
+	if (dev->features & NETIF_F_SG)
+		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
+
+	if (priv->tss_qp_num == 0 && priv->rss_qp_num == 0)
+		/* Legacy mode */
+		init_attr.qpg_type = IB_QPG_NONE;
+	else {
+		init_attr.qpg_type = IB_QPG_PARENT;
+		init_attr.parent_attrib.tss_child_count = priv->tss_qp_num;
+		init_attr.parent_attrib.rss_child_count = priv->rss_qp_num;
+	}
+
+	/*
+	 * NO TSS (tss_qp_num = 0 priv->num_tx_queues  == 1)
+	 * OR TSS is not supported in HW in this case
+	 * parent QP is used for ARR and friend transmission
+	 */
+	if (priv->num_tx_queues > priv->tss_qp_num) {
+		init_attr.cap.max_send_wr  = ipoib_sendq_size;
+		init_attr.cap.max_send_sge = 1;
+	}
+
+	/* No RSS parent QP will be used for RX */
+	if (priv->rss_qp_num == 0) {
+		init_attr.cap.max_recv_wr  = ipoib_recvq_size;
+		init_attr.cap.max_recv_sge = IPOIB_UD_RX_SG;
+	}
+
+	/* Note that if parent QP is not used for RX/TX then this is harmless */
+	init_attr.recv_cq = priv->recv_ring[0].recv_cq;
+	init_attr.send_cq = priv->send_ring[priv->tss_qp_num].send_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create parent QP\n", ca->name);
+		return -ENODEV; /* qp is an error value and will be checked */
+	}
+
+	priv->qp = qp;
+
+	/* TSS is not supported in HW or NO TSS (tss_qp_num = 0) */
+	if (priv->num_tx_queues > priv->tss_qp_num)
+		priv->send_ring[priv->tss_qp_num].send_qp = qp;
+
+	/* No RSS parent QP will be used for RX */
+	if (priv->rss_qp_num == 0)
+		priv->recv_ring[0].recv_qp = qp;
+
+	/* only with SW TSS there is a need for a mask */
+	if ((priv->hca_caps & IB_DEVICE_UD_TSS) || (priv->tss_qp_num == 0))
+		/* TSS is supported by HW or no TSS at all */
+		priv->tss_qpn_mask_sz = 0;
+	else {
+		/* SW TSS, get mask back from HW, put in the upper nibble */
+		u16 tmp = (u16)init_attr.cap.qpg_tss_mask_sz;
+		priv->tss_qpn_mask_sz = cpu_to_be16((tmp << 12));
+	}
+	return 0;
+}
+
+static struct ib_qp *ipoib_create_tss_qp(struct net_device *dev,
+					 struct ib_device *ca,
+					 int ind)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ib_qp_init_attr init_attr = {
 		.cap = {
 			.max_send_wr  = ipoib_sendq_size,
-			.max_recv_wr  = ipoib_recvq_size,
 			.max_send_sge = 1,
+		},
+		.sq_sig_type = IB_SIGNAL_ALL_WR,
+		.qp_type     = IB_QPT_UD
+	};
+	struct ib_qp *qp;
+
+	if (priv->hca_caps & IB_DEVICE_UD_TSO)
+		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
+
+	if (dev->features & NETIF_F_SG)
+		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
+
+	init_attr.qpg_type = IB_QPG_CHILD_TX;
+	init_attr.qpg_parent = priv->qp;
+
+	init_attr.recv_cq = priv->send_ring[ind].send_cq;
+	init_attr.send_cq = init_attr.recv_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create TSS QP(%d)\n", ca->name, ind);
+		return qp; /* qp is an error value and will be checked */
+	}
+
+	return qp;
+}
+
+static struct ib_qp *ipoib_create_rss_qp(struct net_device *dev,
+					 struct ib_device *ca,
+					 int ind)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ib_qp_init_attr init_attr = {
+		.cap = {
+			.max_recv_wr  = ipoib_recvq_size,
 			.max_recv_sge = IPOIB_UD_RX_SG
 		},
 		.sq_sig_type = IB_SIGNAL_ALL_WR,
 		.qp_type     = IB_QPT_UD
 	};
+	struct ib_qp *qp;
+
+	init_attr.qpg_type = IB_QPG_CHILD_RX;
+	init_attr.qpg_parent = priv->qp;
+
+	init_attr.recv_cq = priv->recv_ring[ind].recv_cq;
+	init_attr.send_cq = init_attr.recv_cq;
+
+	qp = ib_create_qp(priv->pd, &init_attr);
+	if (IS_ERR(qp)) {
+		pr_warn("%s: failed to create RSS QP(%d)\n", ca->name, ind);
+		return qp; /* qp is an error value and will be checked */
+	}
 
+	return qp;
+}
+
+static int ipoib_create_other_qps(struct net_device *dev,
+				  struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	struct ipoib_recv_ring *recv_ring;
+	int i, rss_created, tss_created;
+	struct ib_qp *qp;
+
+	tss_created = 0;
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		qp = ipoib_create_tss_qp(dev, ca, i);
+		if (IS_ERR(qp)) {
+			pr_warn("%s: failed to create QP\n",
+				ca->name);
+			goto out_free_send_qp;
+		}
+		send_ring->send_qp = qp;
+		send_ring++;
+		tss_created++;
+	}
+
+	rss_created = 0;
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		qp = ipoib_create_rss_qp(dev, ca, i);
+		if (IS_ERR(qp)) {
+			pr_warn("%s: failed to create QP\n",
+				ca->name);
+			goto out_free_recv_qp;
+		}
+		recv_ring->recv_qp = qp;
+		recv_ring++;
+		rss_created++;
+	}
+
+	return 0;
+
+out_free_recv_qp:
+	for (i = 0; i < rss_created; i++) {
+		ib_destroy_qp(priv->recv_ring[i].recv_qp);
+		priv->recv_ring[i].recv_qp = NULL;
+	}
+
+out_free_send_qp:
+	for (i = 0; i < tss_created; i++) {
+		ib_destroy_qp(priv->send_ring[i].send_qp);
+		priv->send_ring[i].send_qp = NULL;
+	}
+
+	return -ENODEV;
+}
+
+int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	struct ipoib_send_ring *send_ring;
 	struct ipoib_recv_ring *recv_ring, *first_recv_ring;
 	int ret, size;
@@ -173,49 +564,38 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 			size += ipoib_recvq_size * ipoib_max_conn_qp;
 	}
 
-	priv->recv_cq = ib_create_cq(priv->ca, ipoib_ib_completion, NULL,
-				     priv->recv_ring, size, 0);
-	if (IS_ERR(priv->recv_cq)) {
-		printk(KERN_WARNING "%s: failed to create receive CQ\n", ca->name);
+	/* Create CQ(s) */
+	ret = ipoib_transport_cq_init(dev, size);
+	if (ret) {
+		pr_warn("%s: ipoib_transport_cq_init failed\n", ca->name);
 		goto out_free_mr;
 	}
 
-	priv->send_cq = ib_create_cq(priv->ca, ipoib_send_comp_handler, NULL,
-				     priv->send_ring, ipoib_sendq_size, 0);
-	if (IS_ERR(priv->send_cq)) {
-		printk(KERN_WARNING "%s: failed to create send CQ\n", ca->name);
-		goto out_free_recv_cq;
-	}
-
-	/* Only one ring */
-	priv->recv_ring[0].recv_cq = priv->recv_cq;
-	priv->send_ring[0].send_cq = priv->send_cq;
-
-	if (ib_req_notify_cq(priv->recv_cq, IB_CQ_NEXT_COMP))
-		goto out_free_send_cq;
-
-	init_attr.send_cq = priv->send_cq;
-	init_attr.recv_cq = priv->recv_cq;
-
-	if (priv->hca_caps & IB_DEVICE_UD_TSO)
-		init_attr.create_flags |= IB_QP_CREATE_IPOIB_UD_LSO;
-
-	if (priv->hca_caps & IB_DEVICE_BLOCK_MULTICAST_LOOPBACK)
-		init_attr.create_flags |= IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK;
-
-	if (dev->features & NETIF_F_SG)
-		init_attr.cap.max_send_sge = MAX_SKB_FRAGS + 1;
-
-	priv->qp = ib_create_qp(priv->pd, &init_attr);
-	if (IS_ERR(priv->qp)) {
-		printk(KERN_WARNING "%s: failed to create QP\n", ca->name);
-		goto out_free_send_cq;
+	/* Init the parent QP */
+	ret = ipoib_create_parent_qp(dev, ca);
+	if (ret) {
+		pr_warn("%s: failed to create parent QP\n", ca->name);
+		goto out_free_cqs;
 	}
 
+	/*
+	* advetize that we are willing to accept from TSS sender
+	* note that this only indicates that this side is willing to accept
+	* TSS frames, it doesn't implies that it will use TSS since for
+	* transmission the peer should advertize TSS as well
+	*/
+	priv->dev->dev_addr[0] |= IPOIB_FLAGS_TSS;
 	priv->dev->dev_addr[1] = (priv->qp->qp_num >> 16) & 0xff;
 	priv->dev->dev_addr[2] = (priv->qp->qp_num >>  8) & 0xff;
 	priv->dev->dev_addr[3] = (priv->qp->qp_num      ) & 0xff;
 
+	/* create TSS & RSS QPs */
+	ret = ipoib_create_other_qps(dev, ca);
+	if (ret) {
+		pr_warn("%s: failed to create QP(s)\n", ca->name);
+		goto out_free_parent_qp;
+	}
+
 	send_ring = priv->send_ring;
 	for (j = 0; j < priv->num_tx_queues; j++) {
 		for (i = 0; i < MAX_SKB_FRAGS + 1; ++i)
@@ -256,11 +636,20 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 
 	return 0;
 
-out_free_send_cq:
-	ib_destroy_cq(priv->send_cq);
+out_free_parent_qp:
+	ib_destroy_qp(priv->qp);
+	priv->qp = NULL;
+
+out_free_cqs:
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		ib_destroy_cq(priv->recv_ring[i].recv_cq);
+		priv->recv_ring[i].recv_cq = NULL;
+	}
 
-out_free_recv_cq:
-	ib_destroy_cq(priv->recv_cq);
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		ib_destroy_cq(priv->send_ring[i].send_cq);
+		priv->send_ring[i].send_cq = NULL;
+	}
 
 out_free_mr:
 	ib_dereg_mr(priv->mr);
@@ -271,10 +660,101 @@ out_free_pd:
 	return -ENODEV;
 }
 
+static void ipoib_destroy_tx_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	if (NULL == priv->send_ring)
+		return;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->tss_qp_num; i++) {
+		if (send_ring->send_qp) {
+			if (ib_destroy_qp(send_ring->send_qp))
+				ipoib_warn(priv, "ib_destroy_qp (send) failed\n");
+			send_ring->send_qp = NULL;
+		}
+		send_ring++;
+	}
+
+	/*
+	 * No support of TSS in HW
+	 * so there is an extra QP but it is freed later
+	 */
+	if (priv->num_tx_queues > priv->tss_qp_num)
+		send_ring->send_qp = NULL;
+}
+
+static void ipoib_destroy_rx_qps(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	if (NULL == priv->recv_ring)
+		return;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->rss_qp_num; i++) {
+		if (recv_ring->recv_qp) {
+			if (ib_destroy_qp(recv_ring->recv_qp))
+				ipoib_warn(priv, "ib_destroy_qp (recv) failed\n");
+			recv_ring->recv_qp = NULL;
+		}
+		recv_ring++;
+	}
+}
+
+static void ipoib_destroy_tx_cqs(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_send_ring *send_ring;
+	int i;
+
+	if (NULL == priv->send_ring)
+		return;
+
+	send_ring = priv->send_ring;
+	for (i = 0; i < priv->num_tx_queues; i++) {
+		if (send_ring->send_cq) {
+			if (ib_destroy_cq(send_ring->send_cq))
+				ipoib_warn(priv, "ib_destroy_cq (send) failed\n");
+			send_ring->send_cq = NULL;
+		}
+		send_ring++;
+	}
+}
+
+static void ipoib_destroy_rx_cqs(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	struct ipoib_recv_ring *recv_ring;
+	int i;
+
+	if (NULL == priv->recv_ring)
+		return;
+
+	recv_ring = priv->recv_ring;
+	for (i = 0; i < priv->num_rx_queues; i++) {
+		if (recv_ring->recv_cq) {
+			if (ib_destroy_cq(recv_ring->recv_cq))
+				ipoib_warn(priv, "ib_destroy_cq (recv) failed\n");
+			recv_ring->recv_cq = NULL;
+		}
+		recv_ring++;
+	}
+}
+
 void ipoib_transport_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+	ipoib_destroy_rx_qps(dev);
+	ipoib_destroy_tx_qps(dev);
+
+	/* Destroy parent or only QP */
 	if (priv->qp) {
 		if (ib_destroy_qp(priv->qp))
 			ipoib_warn(priv, "ib_qp_destroy failed\n");
@@ -283,11 +763,8 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 		clear_bit(IPOIB_PKEY_ASSIGNED, &priv->flags);
 	}
 
-	if (ib_destroy_cq(priv->send_cq))
-		ipoib_warn(priv, "ib_cq_destroy (send) failed\n");
-
-	if (ib_destroy_cq(priv->recv_cq))
-		ipoib_warn(priv, "ib_cq_destroy (recv) failed\n");
+	ipoib_destroy_rx_cqs(dev);
+	ipoib_destroy_tx_cqs(dev);
 
 	ipoib_cm_dev_cleanup(dev);
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH V2 for-next 6/6] IB/ipoib: Support changing the number of RX/TX rings with ethtool
       [not found] ` <1360079337-8173-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
                     ` (4 preceding siblings ...)
  2013-02-05 15:48   ` [PATCH V2 for-next 5/6] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz
@ 2013-02-05 15:48   ` Or Gerlitz
  5 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2013-02-05 15:48 UTC (permalink / raw)
  To: roland-DgEjT+Ai2ygdnm+yROfE0A, sean.hefty-ral2JQCrhuEAvxtiuMwx3w
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Or Gerlitz

From: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

The number of RX/TX rings can now be get or changed using the ethtool
directives to get/set the number of channels of ETHTOOL_{G/S}CHANNELS.

Added ipoib_reinit() which releases all the rings and their associated
resources, and immediatly following that allocates them again according
to the new number of rings. For that end, moved code which is common to
device cleanup and device reinit from the device cleanup flow to a routine
which is called on both cases.

On some flows, the ndo_get_stats entry (which now reads the per ring
statistics for an ipoib netdevice), is called by the core networking
code without rtnl locking. To protect against such a call being made
in parallel with an ethtool call to change the number of rings --
added rwlock on the rings.

Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
---
 drivers/infiniband/ulp/ipoib/ipoib.h         |    9 ++-
 drivers/infiniband/ulp/ipoib/ipoib_ethtool.c |   68 +++++++++++++
 drivers/infiniband/ulp/ipoib/ipoib_ib.c      |    4 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c    |  133 ++++++++++++++++++++++----
 4 files changed, 192 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h
index 87004e2..8df6ee1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -416,8 +416,11 @@ struct ipoib_dev_priv {
 	struct ipoib_send_ring *send_ring;
 	unsigned int rss_qp_num; /* No RSS HW support 0 */
 	unsigned int tss_qp_num; /* No TSS (HW or SW) used 0 */
-	unsigned int num_rx_queues; /* No RSS HW support 1 */
-	unsigned int num_tx_queues; /* No TSS HW support tss_qp_num + 1 */
+	unsigned int max_rx_queues; /* No RSS HW support 1 */
+	unsigned int max_tx_queues; /* No TSS HW support tss_qp_num + 1 */
+	unsigned int num_rx_queues; /* Actual */
+	unsigned int num_tx_queues; /* Actual */
+	struct rw_semaphore rings_rwsem;
 	__be16 tss_qpn_mask_sz; /* Put in ipoib header reserved */
 };
 
@@ -526,6 +529,8 @@ int ipoib_ib_dev_stop(struct net_device *dev, int flush);
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
 void ipoib_dev_cleanup(struct net_device *dev);
 
+int ipoib_reinit(struct net_device *dev, int num_rx, int num_tx);
+
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
 void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
index f2cc283..d3e0533 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ethtool.c
@@ -155,6 +155,72 @@ static void ipoib_get_ethtool_stats(struct net_device *dev,
 	}
 }
 
+static void ipoib_get_channels(struct net_device *dev,
+			struct ethtool_channels *channel)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	channel->max_rx = priv->max_rx_queues;
+	channel->max_tx = priv->max_tx_queues;
+	channel->max_other = 0;
+	channel->max_combined = priv->max_rx_queues +
+				priv->max_tx_queues;
+	channel->rx_count = priv->num_rx_queues;
+	channel->tx_count = priv->num_tx_queues;
+	channel->other_count = 0;
+	channel->combined_count = priv->num_rx_queues +
+				priv->num_tx_queues;
+}
+
+static int ipoib_set_channels(struct net_device *dev,
+			struct ethtool_channels *channel)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+	if (channel->other_count)
+		return -EINVAL;
+
+	if (channel->combined_count !=
+		priv->num_rx_queues + priv->num_tx_queues)
+		return -EINVAL;
+
+	if (channel->rx_count == 0 ||
+	    channel->rx_count > priv->max_rx_queues)
+		return -EINVAL;
+
+	if (!is_power_of_2(channel->rx_count))
+		return -EINVAL;
+
+	if (channel->tx_count  == 0 ||
+	    channel->tx_count > priv->max_tx_queues)
+		return -EINVAL;
+
+	/* Nothing to do ? */
+	if (channel->rx_count == priv->num_rx_queues &&
+	    channel->tx_count == priv->num_tx_queues)
+		return 0;
+
+	/* 1 is always O.K. */
+	if (channel->tx_count > 1) {
+		if (priv->hca_caps & IB_DEVICE_UD_TSS) {
+			/* with HW TSS tx_count is 2^N */
+			if (!is_power_of_2(channel->tx_count))
+				return -EINVAL;
+		} else {
+			/*
+			* with SW TSS tx_count = 1 + 2 ^ N,
+			* 2 is not allowed, make no sense.
+			* if want to disable TSS use 1.
+			*/
+			if (!is_power_of_2(channel->tx_count - 1) ||
+			    channel->tx_count == 2)
+				return -EINVAL;
+		}
+	}
+
+	return ipoib_reinit(dev, channel->rx_count, channel->tx_count);
+}
+
 static const struct ethtool_ops ipoib_ethtool_ops = {
 	.get_drvinfo		= ipoib_get_drvinfo,
 	.get_coalesce		= ipoib_get_coalesce,
@@ -162,6 +228,8 @@ static const struct ethtool_ops ipoib_ethtool_ops = {
 	.get_strings		= ipoib_get_strings,
 	.get_sset_count		= ipoib_get_sset_count,
 	.get_ethtool_stats	= ipoib_get_ethtool_stats,
+	.get_channels		= ipoib_get_channels,
+	.set_channels		= ipoib_set_channels,
 };
 
 void ipoib_set_ethtool_ops(struct net_device *dev)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 01ce5e9..fa4958c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -736,8 +736,10 @@ static void ipoib_napi_disable(struct net_device *dev)
 	struct ipoib_dev_priv *priv = netdev_priv(dev);
 	int i;
 
-	for (i = 0; i < priv->num_rx_queues; i++)
+	for (i = 0; i < priv->num_rx_queues; i++) {
 		napi_disable(&priv->recv_ring[i].napi);
+		netif_napi_del(&priv->recv_ring[i].napi);
+	}
 }
 
 int ipoib_ib_dev_open(struct net_device *dev)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index cd9df99..85cf641 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -900,6 +900,10 @@ static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
 	struct net_device_stats local_stats;
 	int i;
 
+	/* if rings are not ready yet return last values */
+	if (!down_read_trylock(&priv->rings_rwsem))
+		return stats;
+
 	memset(&local_stats, 0, sizeof(struct net_device_stats));
 
 	for (i = 0; i < priv->num_rx_queues; i++) {
@@ -918,6 +922,8 @@ static struct net_device_stats *ipoib_get_stats(struct net_device *dev)
 		local_stats.tx_dropped += tstats->tx_dropped;
 	}
 
+	up_read(&priv->rings_rwsem);
+
 	stats->rx_packets = local_stats.rx_packets;
 	stats->rx_bytes   = local_stats.rx_bytes;
 	stats->rx_errors  = local_stats.rx_errors;
@@ -1448,6 +1454,8 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port)
 	if (ipoib_ib_dev_init(dev, ca, port))
 		goto out_send_ring_cleanup;
 
+	/* access to rings allowed */
+	up_write(&priv->rings_rwsem);
 
 	return 0;
 
@@ -1468,10 +1476,36 @@ out:
 	return -ENOMEM;
 }
 
+static void ipoib_dev_uninit(struct net_device *dev)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int i;
+
+	ASSERT_RTNL();
+
+	ipoib_ib_dev_cleanup(dev);
+
+	/* no more access to rings */
+	down_write(&priv->rings_rwsem);
+
+	for (i = 0; i < priv->num_tx_queues; i++)
+		vfree(priv->send_ring[i].tx_ring);
+	kfree(priv->send_ring);
+
+	for (i = 0; i < priv->num_rx_queues; i++)
+		kfree(priv->recv_ring[i].rx_ring);
+	kfree(priv->recv_ring);
+
+	priv->recv_ring = NULL;
+	priv->send_ring = NULL;
+
+	ipoib_neigh_hash_uninit(dev);
+}
+
 void ipoib_dev_cleanup(struct net_device *dev)
 {
 	struct ipoib_dev_priv *priv = netdev_priv(dev), *cpriv, *tcpriv;
-	int i;
+
 	LIST_HEAD(head);
 
 	ASSERT_RTNL();
@@ -1485,23 +1519,71 @@ void ipoib_dev_cleanup(struct net_device *dev)
 		cancel_delayed_work(&cpriv->neigh_reap_task);
 		unregister_netdevice_queue(cpriv->dev, &head);
 	}
+
 	unregister_netdevice_many(&head);
 
-	ipoib_ib_dev_cleanup(dev);
+	ipoib_dev_uninit(dev);
 
+	/* ipoib_dev_uninit took rings lock but can't release it when called by
+	 * ipoib_reinit, for the cleanup flow, release it here
+	 */
+	up_write(&priv->rings_rwsem);
+}
 
-	for (i = 0; i < priv->num_tx_queues; i++)
-		vfree(priv->send_ring[i].tx_ring);
-	kfree(priv->send_ring);
+int ipoib_reinit(struct net_device *dev, int num_rx, int num_tx)
+{
+	struct ipoib_dev_priv *priv = netdev_priv(dev);
+	int flags;
+	int ret;
 
-	for (i = 0; i < priv->num_rx_queues; i++)
-		kfree(priv->recv_ring[i].rx_ring);
-	kfree(priv->recv_ring);
+	flags = dev->flags;
+	dev_close(dev);
 
-	priv->recv_ring = NULL;
-	priv->send_ring = NULL;
+	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags))
+		ib_unregister_event_handler(&priv->event_handler);
 
-	ipoib_neigh_hash_uninit(dev);
+	ipoib_dev_uninit(dev);
+
+	priv->num_rx_queues = num_rx;
+	priv->num_tx_queues = num_tx;
+	if (num_rx == 1)
+		priv->rss_qp_num = 0;
+	else
+		priv->rss_qp_num = num_rx;
+	if (num_tx == 1 || !(priv->hca_caps & IB_DEVICE_UD_TSS))
+		priv->tss_qp_num = num_tx - 1;
+	else
+		priv->tss_qp_num = num_tx;
+
+	netif_set_real_num_tx_queues(dev, num_tx);
+	netif_set_real_num_rx_queues(dev, num_rx);
+
+	/* prevent ipoib_ib_dev_init from calling ipoib_ib_dev_open,
+	 * let ipoib_open do it
+	 */
+	dev->flags &= ~IFF_UP;
+	ret = ipoib_dev_init(dev, priv->ca, priv->port);
+	if (ret) {
+		pr_warn("%s: failed to reinitialize port %d (ret = %d)\n",
+			priv->ca->name, priv->port, ret);
+		return ret;
+	}
+
+	if (!test_bit(IPOIB_FLAG_SUBINTERFACE, &priv->flags)) {
+		ret = ib_register_event_handler(&priv->event_handler);
+		if (ret)
+			pr_warn("%s: failed to rereg port %d (ret = %d)\n",
+				priv->ca->name, priv->port, ret);
+	}
+
+	/* if the device was up bring it up again */
+	if (flags & IFF_UP) {
+		ret = dev_open(dev);
+		if (ret)
+			pr_warn("%s: failed to reopen port %d (ret = %d)\n",
+				priv->ca->name, priv->port, ret);
+	}
+	return ret;
 }
 
 static const struct header_ops ipoib_header_ops = {
@@ -1580,6 +1662,10 @@ void ipoib_setup(struct net_device *dev)
 
 	mutex_init(&priv->vlan_mutex);
 
+	init_rwsem(&priv->rings_rwsem);
+	/* read access to rings is disabled */
+	down_write(&priv->rings_rwsem);
+
 	INIT_LIST_HEAD(&priv->path_list);
 	INIT_LIST_HEAD(&priv->child_intfs);
 	INIT_LIST_HEAD(&priv->dead_ahs);
@@ -1601,8 +1687,12 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
 {
 	struct net_device *dev;
 
-	/* Use correct ops (ndo_select_queue) pass to ipoib_setup */
-	if (template_priv->num_tx_queues > 1) {
+	/* Use correct ops (ndo_select_queue) pass to ipoib_setup
+	 * A child interface starts with the same number of queues as the
+	 * parent. Even if the parent currently has only one ring, the MQ
+	 * potential must be reserved.
+	 */
+	if (template_priv->max_tx_queues > 1) {
 		if (template_priv->hca_caps & IB_DEVICE_UD_TSS)
 			ipoib_netdev_ops = &ipoib_netdev_ops_hw_tss;
 		else
@@ -1613,8 +1703,8 @@ struct ipoib_dev_priv *ipoib_intf_alloc(const char *name,
 
 	dev = alloc_netdev_mqs((int) sizeof(struct ipoib_dev_priv), name,
 			   ipoib_setup,
-			   template_priv->num_tx_queues,
-			   template_priv->num_rx_queues);
+			   template_priv->max_tx_queues,
+			   template_priv->max_rx_queues);
 	if (!dev)
 		return NULL;
 
@@ -1748,6 +1838,8 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 		/* No additional QP, only one QP for RX & TX */
 		priv->rss_qp_num = 0;
 		priv->tss_qp_num = 0;
+		priv->max_rx_queues = 1;
+		priv->max_tx_queues = 1;
 		priv->num_rx_queues = 1;
 		priv->num_tx_queues = 1;
 		kfree(device_attr);
@@ -1760,22 +1852,25 @@ static int ipoib_get_hca_features(struct ipoib_dev_priv *priv,
 		max_rss_tbl_sz = min(num_cores, max_rss_tbl_sz);
 		max_rss_tbl_sz = rounddown_pow_of_two(max_rss_tbl_sz);
 		priv->rss_qp_num    = max_rss_tbl_sz;
-		priv->num_rx_queues = max_rss_tbl_sz;
+		priv->max_rx_queues = max_rss_tbl_sz;
 	} else {
 		/* No additional QP, only the parent QP for RX */
 		priv->rss_qp_num = 0;
-		priv->num_rx_queues = 1;
+		priv->max_rx_queues = 1;
 	}
+	priv->num_rx_queues = priv->max_rx_queues;
 
 	kfree(device_attr);
 
 	priv->tss_qp_num = num_cores;
 	if (priv->hca_caps & IB_DEVICE_UD_TSS)
 		/* TSS is supported by HW */
-		priv->num_tx_queues = priv->tss_qp_num;
+		priv->max_tx_queues = priv->tss_qp_num;
 	else
 		/* If TSS is not support by HW use the parent QP for ARP */
-		priv->num_tx_queues = priv->tss_qp_num + 1;
+		priv->max_tx_queues = priv->tss_qp_num + 1;
+
+	priv->num_tx_queues = priv->max_tx_queues;
 
 	return 0;
 }
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets
       [not found]     ` <1360079337-8173-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-02-11 19:46       ` Hefty, Sean
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237368B99DDC-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hefty, Sean @ 2013-02-11 19:46 UTC (permalink / raw)
  To: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz

> The hash function introduced in commit b63b70d877 "IPoIB: Use a private hash
> table for path lookup in xmit path" was designd to use the 3 octets of the
> IPoIB HW address that holds the remote QPN. However, this currently isn't
> the case under little endian machines as the code there uses the flags part
> (octet[0]) and not the last octet of the QPN (octet[3]), fix that.
> 
> The fix caused a checkpatch warning on line over 80 characters, to
> solve that changed the name of the temp variable that holds the daddr.
> 
> Signed-off-by: Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_main.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 6fdc9e7..e459fa7 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -844,10 +844,10 @@ static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl,
> u8 *daddr)
>  	 * different subnets.
>  	 */
>  	 /* qpn octets[1:4) & port GUID octets[12:20) */
> -	u32 *daddr_32 = (u32 *) daddr;
> +	u32 *d32 = (u32 *)daddr;
>  	u32 hv;
> 
> -	hv = jhash_3words(daddr_32[3], daddr_32[4], 0xFFFFFF & daddr_32[0], 0);
> +	hv = jhash_3words(d32[3], d32[4], cpu_to_be32(0xFFFFFF) & d32[0], 0);

Should d32 be declared as __be32 *?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups
       [not found]     ` <1360079337-8173-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-02-11 20:42       ` Hefty, Sean
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237368B99E0B-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hefty, Sean @ 2013-02-11 20:42 UTC (permalink / raw)
  To: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz

> RSS (Receive Side Scaling) TSS (Transmit Side Scaling, better known as
> MQ/Multi-Queue) are common networking techniques which allow to use
> contemporary NICs that support multiple receive and transmit descriptor
> queues (multi-queue), see also Documentation/networking/scaling.txt

If TSS is better known as MQ, then why not use that term instead?

> - qp group type attribute for qp creation saying whether this is a parent QP
> or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs.

Can we either define this as a new QP type or some QP creation flag, so that every user who wants to create a QP doesn't need to figure out what a QP group is and if their QP needs to be part of one?

Then you wouldn't need to define IB_QPG_NONE.
 
> - per qp group type, another attribute is added, for parent QPs, the number
> of rx/tx child QPs and for child QPs pointer to the parent.

If I understand the interface correctly, the user calls ib_create_qp() to create a parent QP and reserve space for all of the children.  They then call ib_create_qp() to allocate the children.  Is this correct?

What restrictions does a child QP have based on the parent?  E.g. same PD, CQ, QP size <= parent, number SGEs <= parent, destroyed with parent, etc.  And how independent is a child QP?  E.g. joins own multicast groups, different CQs, transitions states independently, etc.

It's not clear to me if using the existing interfaces are the best approach, if MQ is best handled as different QPs, if MQ is better abstracted as a 'QP' that has multiple send/receive queues, if MQ should just be completely hidden beneath verbs, or what.

The XRC model of creating the parent and using open to associated related QPs still seems more appropriate, but it depends on how independent the parent and child QPs are.  We don't have a spec (formal or informal) that defines how the verbs function with these new QP types, which makes reviewing these changes difficult.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237368B99DDC-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-02-12 14:47           ` Shlomo Pongratz
       [not found]             ` <511A560D.8020900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Shlomo Pongratz @ 2013-02-12 14:47 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w

On 2/11/2013 9:46 PM, Hefty, Sean wrote:
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> >+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> >@@ -844,10 +844,10 @@ static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl,
>> >u8 *daddr)
>> >  	 * different subnets.
>> >  	 */
>> >  	 /* qpn octets[1:4) & port GUID octets[12:20) */
>> >-	u32 *daddr_32 = (u32 *) daddr;
>> >+	u32 *d32 = (u32 *)daddr;
>> >  	u32 hv;
>> >
>> >-	hv = jhash_3words(daddr_32[3], daddr_32[4], 0xFFFFFF & daddr_32[0], 0);
>> >+	hv = jhash_3words(d32[3], d32[4], cpu_to_be32(0xFFFFFF) & d32[0], 0);
> Should d32 be declared as __be32 *?
Hi Sean,

The IPoIB destination address is indeed in big endian format and 
normally the pointer to it should be of type __be32.
However in this case I just want to feed it into the hash function 
without the flags part.
defining d32 as __be32* will make the code a bit ugly as I'll need to 
cast 3 of "jhash_3words" functions arguments.
That is,

__be32 *d32;
....

hv = jhash_3words((__force u32) d32[3], (__force u32) d32[4], (__force 
u32)(cpu_to_be32(0xFFFFFF) & d32[0]), 0);


Best regards,

S.P.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237368B99E0B-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-02-12 15:27           ` Or Gerlitz
  2013-02-12 16:39           ` Or Gerlitz
  2013-02-12 16:46           ` Or Gerlitz
  2 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2013-02-12 15:27 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	erezsh-VPRAkNaXOzVWk0Htik3J/w, Shlomo Pongratz, Tzahi Oved

On 11/02/2013 22:42, Hefty, Sean wrote:
>> RSS (Receive Side Scaling) TSS (Transmit Side Scaling, better known as
>> MQ/Multi-Queue) are common networking techniques which allow to use
>> contemporary NICs that support multiple receive and transmit descriptor
>> queues (multi-queue), see also Documentation/networking/scaling.txt
> If TSS is better known as MQ, then why not use that term instead?

Well, maybe saying that TSS is better known as MQ was too definitive in 
that context, Linux now
support MQ for networking driver on both the RX and TX rings, e.g 
through both in-kernel netdev
APIs and to user space too through ethtool and friends that allow you to 
configure the number
of RX/TX rings etc. RSS is well known term, and we found the term TSS to 
have good fit here as the TX
equivalent of RSS.


>
>> - qp group type attribute for qp creation saying whether this is a parent QP
>> or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs.
> Can we either define this as a new QP type or some QP creation flag, so that every user who wants to create a QP doesn't need to figure out what a QP group is and if their QP needs to be part of one? Then you wouldn't need to define IB_QPG_NONE.

understood, basically I don't see why this change can't be done. In a 
response to earlier posting of these patches, Tzahi Oved from Mellanox 
wrote on the same matter "Reg ib_qp_init_attr and ib_qp_type, since 
RSS/TSS child/parent attributes can be defined for multiple QP types 
(today IB_QPT_UD and IB_QPT_RAW_PACKET), we believe it is cleaner to 
have another attribute of ib_qpg_type." 
http://marc.info/?l=linux-rdma&m=134486836225450&w=2 any way, this isn't 
the hard core part of the suggested changes, so we should be able to 
nail that this way (current proposal) or another (yours) somehow.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets
       [not found]             ` <511A560D.8020900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2013-02-12 16:33               ` Hefty, Sean
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237368B9A045-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-02-12 20:35               ` Jason Gunthorpe
  1 sibling, 1 reply; 18+ messages in thread
From: Hefty, Sean @ 2013-02-12 16:33 UTC (permalink / raw)
  To: Shlomo Pongratz
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w

> On 2/11/2013 9:46 PM, Hefty, Sean wrote:
> >> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> >> >+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> >> >@@ -844,10 +844,10 @@ static u32 ipoib_addr_hash(struct ipoib_neigh_hash
> *htbl,
> >> >u8 *daddr)
> >> >  	 * different subnets.
> >> >  	 */
> >> >  	 /* qpn octets[1:4) & port GUID octets[12:20) */
> >> >-	u32 *daddr_32 = (u32 *) daddr;
> >> >+	u32 *d32 = (u32 *)daddr;
> >> >  	u32 hv;
> >> >
> >> >-	hv = jhash_3words(daddr_32[3], daddr_32[4], 0xFFFFFF & daddr_32[0], 0);
> >> >+	hv = jhash_3words(d32[3], d32[4], cpu_to_be32(0xFFFFFF) & d32[0], 0);
> > Should d32 be declared as __be32 *?
> Hi Sean,
> 
> The IPoIB destination address is indeed in big endian format and
> normally the pointer to it should be of type __be32.
> However in this case I just want to feed it into the hash function
> without the flags part.
> defining d32 as __be32* will make the code a bit ugly as I'll need to
> cast 3 of "jhash_3words" functions arguments.
> That is,
> 
> __be32 *d32;
> ....
> 
> hv = jhash_3words((__force u32) d32[3], (__force u32) d32[4], (__force
> u32)(cpu_to_be32(0xFFFFFF) & d32[0]), 0);

Have you run the V2 patch through sparse?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237368B99E0B-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-02-12 15:27           ` Or Gerlitz
@ 2013-02-12 16:39           ` Or Gerlitz
  2013-02-12 16:46           ` Or Gerlitz
  2 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2013-02-12 16:39 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: roland-DgEjT+Ai2ygdnm+yROfE0A, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	erezsh-VPRAkNaXOzVWk0Htik3J/w, Shlomo Pongratz, Tzahi Oved

On 11/02/2013 22:42, Hefty, Sean wrote:
> or some QP creation flag, so that every user who wants to create a QP doesn't need to figure out what a QP group is and if their QP needs to be part of one?
>
> Then you wouldn't need to define IB_QPG_NONE.
>   
another point in favor of using different/dedicated field  for the QPG 
creation flags, is that unlike the other QP
creation flags, they aren't subject to logical or but rather an 
exclusive or, that is a QP can't be both parent and child.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups
       [not found]         ` <1828884A29C6694DAF28B7E6B8A8237368B99E0B-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-02-12 15:27           ` Or Gerlitz
  2013-02-12 16:39           ` Or Gerlitz
@ 2013-02-12 16:46           ` Or Gerlitz
       [not found]             ` <CAJZOPZ+eT=UGfqbwyMn8BtKCei2t1RKj1auAhSbPphLF9A6eVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 1 reply; 18+ messages in thread
From: Or Gerlitz @ 2013-02-12 16:46 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Tzahi Oved

resending for the 3rd time, now from different address as the mail
server reject both prev postings...

On Mon, Feb 11, 2013 at 10:42 PM, Hefty, Sean <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> wrote:

> If I understand the interface correctly, the user calls ib_create_qp() to
> create a parent QP and reserve space for all of the children.  They then
> call ib_create_qp() to allocate the children.  Is this correct?

YES


> What restrictions does a child QP have based on the parent?  E.g. same PD,
> CQ, QP size <= parent, number SGEs <= parent, destroyed with parent, etc.
> And how independent is a child QP?  E.g. joins own multicast groups,
> different CQs, transitions states independently, etc.

These parent/child QPs are supported for UD and RAW_PACKET QP types,
and of course child and parent have to be of the same QP type.
A good (probably best) practice would be for the parent and child QPs
to use the same PD (and QKEY for UD QPs), it makes sense that we
enforce that on the core level.  Using the same CQ isn't required nor
similarity in the QP ring/sge size and whatever other attributes.

Currently its the HW driver role to dictate the QP number for created
QPs, an assumption taken by the architecture/design for the QP groups.

re RSS child QPs, under the proposed API, HW drivers are likely to
reserve consecutive range of QPNs for the childs. To meet certain HW
requirements, the driver may round up the number of requested RSS
children to be a power of two, and the consumer is expected to
actually open the rounded up number of QPs. This might somehow fell
between the cracks in the submitted code, we need to see how to nail
that corner. Other than that, no limitations.

re TSS child QPs, under the proposed API, HW drivers are likely to
reserve consecutive range of QPNs for the childs unless they support
the IB_DEVICE_UD_TSS capability which is set to indicate that the
device supports "HW TSS" which means that the HW is capable of
over-riding the source UD QPN present in sent IB datagram header (DTH)
with the parent's QPN.  This is irrelevant for RAW_PACKET Ethernet
QPs. Other than that, no limitations.


> It's not clear to me if using the existing interfaces are the best
> approach, if MQ is best handled as different QPs, if MQ is better abstracted
> as a 'QP' that has multiple send/receive queues, if MQ should just be
> completely hidden beneath verbs, or what.

We found it to be handled well as a different QP...


> The XRC model of creating the parent and using open to associated related
> QPs still seems more appropriate, but it depends on how independent the
> parent and child QPs are.  We don't have a spec (formal or informal) that
> defines how the verbs function with these new QP types, which makes
> reviewing these changes difficult.

As I explained above, parent and child QPs are pretty much
independent, re the XRC model,
Tzahi provided detailed answer why we didn't find a fit here, which if
I understand correct,
you accepted his arguments... again here
http://marc.info/?l=linux-rdma&m=134486836225450&w=2 and all thread
here http://marc.info/?t=133881099000001&r=1&w=2
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237368B9A045-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-02-12 16:53                   ` Or Gerlitz
  0 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2013-02-12 16:53 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Shlomo Pongratz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w

On 12/02/2013 18:33, Hefty, Sean wrote:
> Have you run the V2 patch through sparse?

oops, I see now that the V2 patches introduced some sparse warnings, 
will fix for V3, thanks for spotting that over.

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups
       [not found]             ` <CAJZOPZ+eT=UGfqbwyMn8BtKCei2t1RKj1auAhSbPphLF9A6eVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2013-02-12 18:59               ` Hefty, Sean
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237368B9A0FE-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hefty, Sean @ 2013-02-12 18:59 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Tzahi Oved

My understanding of this is that there's NO changes to the wire protocols.

A QP is simply that, a pair of queues - one send, one receive.  To the best that I can figure out, you're wanting to allocate 'multiple-queues' - something that has multiple send and receive queues.  (I use the term MQ, because it seems to be the most appropriate based on my understanding.)  A QP can be viewed as a special case of a MQ.  Is single QPN is used on the wire for all queues which are part of a MQ?  Like a QP, each queue can have its own size and CQ.  So, they're independent.. except that they're dependent on some higher association, (referred to as a parent QP).

The user has the joy of not knowing beforehand how many queues will be allocated.  Just that they need to somehow allocate them all, transition them all into a usable state, and keep all of them in that state.  The extra queues are allocated by the HW, but the user still needs to specify how big they are, how many SGEs each should have, etc.  I'm guessing specifying a size of 0 isn't acceptable if the user really doesn't want it.  But it would be okay if it went unused... maybe?  There's no mention of what happens if a user fails to allocate all queues, destroys one of the queues but keeps the others, or has the queues in different states - such as transitioning the 'parent' QP into the error state.  It's not even clear to me if the 'parent QP' has send and receive queues, or if it even sh
 ould.

Honestly, I like to see the entire concept flushed out before trying to decide if the implementation matches up with what the architecture is trying to accomplish.  Maybe you end up with the same implementation, but there are details in the usage model that seem to be missing.  The email threads talk about UD, but wants to leave open the possibility of other QP types.  How would RC even work in this model?  How would it connect?  How do you manage associated QPs being in different states?  How would this export into user space?  How and when does the HW decide to direct receives to a specific queue?

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets
       [not found]             ` <511A560D.8020900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2013-02-12 16:33               ` Hefty, Sean
@ 2013-02-12 20:35               ` Jason Gunthorpe
  1 sibling, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2013-02-12 20:35 UTC (permalink / raw)
  To: Shlomo Pongratz
  Cc: Hefty, Sean, Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w

On Tue, Feb 12, 2013 at 04:47:41PM +0200, Shlomo Pongratz wrote:
> On 2/11/2013 9:46 PM, Hefty, Sean wrote:
> >>>+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> >>>@@ -844,10 +844,10 @@ static u32 ipoib_addr_hash(struct ipoib_neigh_hash *htbl,
> >>>u8 *daddr)
> >>>  	 * different subnets.
> >>>  	 */
> >>>  	 /* qpn octets[1:4) & port GUID octets[12:20) */
> >>>-	u32 *daddr_32 = (u32 *) daddr;
> >>>+	u32 *d32 = (u32 *)daddr;
> >>>  	u32 hv;
> >>>
> >>>-	hv = jhash_3words(daddr_32[3], daddr_32[4], 0xFFFFFF & daddr_32[0], 0);
> >>>+	hv = jhash_3words(d32[3], d32[4], cpu_to_be32(0xFFFFFF) & d32[0], 0);
> >Should d32 be declared as __be32 *?
> Hi Sean,
> 
> The IPoIB destination address is indeed in big endian format and
> normally the pointer to it should be of type __be32.
> However in this case I just want to feed it into the hash function
> without the flags part.
> defining d32 as __be32* will make the code a bit ugly as I'll need
> to cast 3 of "jhash_3words" functions arguments.
> That is,
> 
> __be32 *d32;
> ....
> 
> hv = jhash_3words((__force u32) d32[3], (__force u32) d32[4],
> (__force u32)(cpu_to_be32(0xFFFFFF) & d32[0]), 0);

Not sure what your hv is used for, but be aware that it is going to
have a different value on big and little endian systems..

This is why the (__force u32) is somewhat desirable, because you are
explicitly, and deliberately ignoring the effect of endianness at that
point in the code.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups
       [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237368B9A0FE-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-02-13 10:31                   ` Or Gerlitz
  0 siblings, 0 replies; 18+ messages in thread
From: Or Gerlitz @ 2013-02-13 10:31 UTC (permalink / raw)
  To: Hefty, Sean
  Cc: Or Gerlitz, roland-DgEjT+Ai2ygdnm+yROfE0A,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA, erezsh-VPRAkNaXOzVWk0Htik3J/w,
	Shlomo Pongratz, Tzahi Oved

On 12/02/2013 20:59, Hefty, Sean wrote:
> My understanding of this is that there's NO changes to the wire protocols.

For RSS no changes.

For TSS, added a flag in the IPoIB HW address and used a reserved field 
of the IPoIB header, see the change log for patch #5 "IB/IPoIB: Add RSS 
and TSS support for datagram mode" for the details.


> A QP is simply that, a pair of queues - one send, one receive.  To the best that I can figure out, you're wanting to allocate 'multiple-queues' - something that has multiple send and receive queues.  (I use the term MQ, because it seems to be the most appropriate based on my understanding.)  A QP can be viewed as a special case of a MQ.  Is single QPN is used on the wire for all queues which are part of a MQ?  Like a QP, each queue can have its own size and CQ.  So, they're independent.. except that they're dependent on some higher association, (referred to as a parent QP).

HW driver supporting single QPN on the wire for all the TSS child QPs of 
a given parent is a HW feature called "HW TSS" in the core (this) patch 
and the IPoIB RSS/TSS patch (#5) which will simplify the implementation 
and under which the code avoids the wire changes, indeed (so we have 
were to improve...).

Yep, child QPs are independent to large extent, under HW TSS 
instrumented to put their parent QPN on the wire and other than that 
totally independent. For RSS they should be using the same PD/QKEY as I 
said and with typical HW implementations would have consecutive numbers, 
as networking RSS HW is typically configured with {RSS hash function, 
starting queue number (== the QPN of the "first" RSS child), # of RX 
queues} all  this for what is called the RSS indirection QP (== RSS 
parent), see the mlx4 and IPoIB TSS/RSS patch for more details.



> The user has the joy of not knowing beforehand how many queues will be allocated.  Just that they need to somehow allocate them all, transition them all into a usable state, and keep all of them in that state.  The extra queues are allocated by the HW, but the user still needs to specify how big they are, how many SGEs each should have, etc.  I'm guessing specifying a size of 0 isn't acceptable if the user really doesn't want it.  But it would be okay if it went unused... maybe?  There's no mention of what happens if a user fails to allocate all queues, destroys one of the queues but keeps the others, or has the queues in different states - such as transitioning the 'parent' QP into the error state.  It's not even clear to me if the 'parent QP' has send and receive queues, or if it even 
 should.


Cases you indicate here such as failing to allocate or destroying some 
of the queues would be problematic to RSS, good catch! thinking out loud 
I think we can solve it if we let the parent QP creation to actually 
trigger a creation of the whole set of childs (instead of only reserving 
QPNs for them as done now by the mlx4 patch), we'll look into this.


> Honestly, I like to see the entire concept flushed out before trying to decide if the implementation matches up with what the architecture is trying to accomplish.  Maybe you end up with the same implementation, but there are details in the usage model that seem to be missing.  The email threads talk about UD, but wants to leave open the possibility of other QP types.  How would RC even work in this model?  How would it connect?  How do you manage associated QPs being in different states?  How would this export into user space?  How and when does the HW decide to direct receives to a specific queue?
>

Re the entire concept flushed out, this requirement makes sense, and I 
think we're trying to do it now through these emails... As for QP types 
supported for this feature, they are UD and RAW_PACKET, the two types 
which are commonly used for TCP/IP networking in the relevant 
environment (IB UD - "plain" IPoIB and offloaded IPoIB, Eth RAW_PACKET - 
offloaded TCP/IP).

RC doesn't have a good fit here since some contract (e.g pre-set hash or 
advertizement of QPNs) has to be set over the wire, which isn't the case 
for RSS over UD/RAW_PACKET QPs, as of this indirection QP doing a hash 
on recieved packet and further dispatching them to multiple queues.

Or.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2013-02-13 10:31 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-02-05 15:48 [PATCH V2 for-next 0/6] IB/IPoIB: Add multi-queue TSS and RSS support Or Gerlitz
     [not found] ` <1360079337-8173-1-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-05 15:48   ` [PATCH V2 for-next 1/6] IB/ipoib: Fix ipoib_neigh hashing to use the correct daddr octets Or Gerlitz
     [not found]     ` <1360079337-8173-2-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-11 19:46       ` Hefty, Sean
     [not found]         ` <1828884A29C6694DAF28B7E6B8A8237368B99DDC-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-02-12 14:47           ` Shlomo Pongratz
     [not found]             ` <511A560D.8020900-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-12 16:33               ` Hefty, Sean
     [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237368B9A045-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-02-12 16:53                   ` Or Gerlitz
2013-02-12 20:35               ` Jason Gunthorpe
2013-02-05 15:48   ` [PATCH V2 for-next 2/6] IB/core: Add RSS and TSS QP groups Or Gerlitz
     [not found]     ` <1360079337-8173-3-git-send-email-ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2013-02-11 20:42       ` Hefty, Sean
     [not found]         ` <1828884A29C6694DAF28B7E6B8A8237368B99E0B-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-02-12 15:27           ` Or Gerlitz
2013-02-12 16:39           ` Or Gerlitz
2013-02-12 16:46           ` Or Gerlitz
     [not found]             ` <CAJZOPZ+eT=UGfqbwyMn8BtKCei2t1RKj1auAhSbPphLF9A6eVg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2013-02-12 18:59               ` Hefty, Sean
     [not found]                 ` <1828884A29C6694DAF28B7E6B8A8237368B9A0FE-P5GAC/sN6hmkrb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-02-13 10:31                   ` Or Gerlitz
2013-02-05 15:48   ` [PATCH V2 for-next 3/6] IB/mlx4: Add support for " Or Gerlitz
2013-02-05 15:48   ` [PATCH V2 for-next 4/6] IB/ipoib: Move to multi-queue device Or Gerlitz
2013-02-05 15:48   ` [PATCH V2 for-next 5/6] IB/ipoib: Add RSS and TSS support for datagram mode Or Gerlitz
2013-02-05 15:48   ` [PATCH V2 for-next 6/6] IB/ipoib: Support changing the number of RX/TX rings with ethtool Or Gerlitz

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.