All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next 00/16] New hfi1 feature: Accelerated IP
@ 2020-02-10 13:18 Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 01/16] IB/hfi1: Add accelerated IP capability bit Dennis Dalessandro
                   ` (16 more replies)
  0 siblings, 17 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford; +Cc: linux-rdma

This patch series is an accelerated ipoib using the rdma netdev mechanism
already present in ipoib. A new device capability bit,
IB_DEVICE_RDMA_NETDEV_OPA, triggers ipoib to create a datagram QP using the
IB_QP_CREATE_NETDEV_USE.

The highlights include:
- Sharing send and receive resources with VNIC
- Allows for switching between connected mode and datagram mode
- Increases the maximum datagram MTU for opa devices to 10k

The same spreading capability exploited by VNIC is used here to vary
the receive context that receives the packet.

The patches are fully bisectable and stepwise implement the capability.

---

Gary Leshner (6):
      IB/hfi1: Add functions to transmit datagram ipoib packets
      IB/hfi1: Add the transmit side of a datagram ipoib RDMA netdev
      IB/hfi1: Remove module parameter for KDETH qpns
      IB/{rdmavt,hfi1}: Implement creation of accelerated UD QPs
      IB/{hfi1,ipoib,rdma}: Broadcast ping sent packets which exceeded mtu size
      IB/ipoib: Add capability to switch between datagram and connected mode

Grzegorz Andrejczuk (7):
      IB/hfi1: RSM rules for AIP
      IB/hfi1: Rename num_vnic_contexts as num_netdev_contexts
      IB/hfi1: Add functions to receive accelerated ipoib packets
      IB/hfi1: Add interrupt handler functions for accelerated ipoib
      IB/hfi1: Add rx functions for dummy netdev
      IB/hfi1: Activate the dummy netdev
      IB/hfi1: Add packet histogram trace event

Kaike Wan (1):
      IB/hfi1: Add accelerated IP capability bit

Piotr Stankiewicz (1):
      IB/hfi1: Enable the transmit side of the datagram ipoib netdev

Sadanand Warrier (1):
      IB/ipoib: Increase ipoib Datagram mode MTU's upper limit


 drivers/infiniband/hw/hfi1/Makefile            |    4 
 drivers/infiniband/hw/hfi1/affinity.c          |   12 
 drivers/infiniband/hw/hfi1/affinity.h          |    3 
 drivers/infiniband/hw/hfi1/chip.c              |  303 ++++++---
 drivers/infiniband/hw/hfi1/chip.h              |    5 
 drivers/infiniband/hw/hfi1/common.h            |   13 
 drivers/infiniband/hw/hfi1/driver.c            |  231 ++++++-
 drivers/infiniband/hw/hfi1/file_ops.c          |    4 
 drivers/infiniband/hw/hfi1/hfi.h               |   38 -
 drivers/infiniband/hw/hfi1/init.c              |   14 
 drivers/infiniband/hw/hfi1/ipoib.h             |  171 +++++
 drivers/infiniband/hw/hfi1/ipoib_main.c        |  309 +++++++++
 drivers/infiniband/hw/hfi1/ipoib_rx.c          |   95 +++
 drivers/infiniband/hw/hfi1/ipoib_tx.c          |  828 ++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/msix.c              |   36 +
 drivers/infiniband/hw/hfi1/msix.h              |    7 
 drivers/infiniband/hw/hfi1/netdev.h            |  118 +++
 drivers/infiniband/hw/hfi1/netdev_rx.c         |  481 ++++++++++++++
 drivers/infiniband/hw/hfi1/qp.c                |   18 -
 drivers/infiniband/hw/hfi1/tid_rdma.c          |    4 
 drivers/infiniband/hw/hfi1/trace.c             |   42 +
 drivers/infiniband/hw/hfi1/trace_ctxts.h       |   11 
 drivers/infiniband/hw/hfi1/verbs.c             |   13 
 drivers/infiniband/hw/hfi1/vnic.h              |    5 
 drivers/infiniband/hw/hfi1/vnic_main.c         |  318 ++-------
 drivers/infiniband/sw/rdmavt/qp.c              |   24 +
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |   25 -
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   12 
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c     |    3 
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    3 
 include/rdma/ib_verbs.h                        |   65 ++
 include/rdma/opa_port_info.h                   |   10 
 include/rdma/opa_vnic.h                        |    4 
 include/rdma/rdmavt_qp.h                       |   29 +
 include/uapi/rdma/hfi/hfi1_user.h              |    3 
 35 files changed, 2768 insertions(+), 493 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/ipoib.h
 create mode 100644 drivers/infiniband/hw/hfi1/ipoib_main.c
 create mode 100644 drivers/infiniband/hw/hfi1/ipoib_rx.c
 create mode 100644 drivers/infiniband/hw/hfi1/ipoib_tx.c
 create mode 100644 drivers/infiniband/hw/hfi1/netdev.h
 create mode 100644 drivers/infiniband/hw/hfi1/netdev_rx.c

--
-Denny

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH for-next 01/16] IB/hfi1: Add accelerated IP capability bit
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
@ 2020-02-10 13:18 ` Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 02/16] IB/hfi1: Add functions to transmit datagram ipoib packets Dennis Dalessandro
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford; +Cc: linux-rdma, Mike Marciniszyn, Kaike Wan

From: Kaike Wan <kaike.wan@intel.com>

The accelerated IP capability bit is added to allow users to control
which feature is enabled and disabled.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/common.h |    5 +++--
 include/uapi/rdma/hfi/hfi1_user.h   |    3 ++-
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/common.h b/drivers/infiniband/hw/hfi1/common.h
index 40a1ff0..1f7107e 100644
--- a/drivers/infiniband/hw/hfi1/common.h
+++ b/drivers/infiniband/hw/hfi1/common.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -149,7 +149,8 @@
 				   HFI1_CAP_NO_INTEGRITY |		\
 				   HFI1_CAP_PKEY_CHECK |		\
 				   HFI1_CAP_TID_RDMA |			\
-				   HFI1_CAP_OPFN) <<			\
+				   HFI1_CAP_OPFN |			\
+				   HFI1_CAP_AIP) <<			\
 				  HFI1_CAP_USER_SHIFT)
 /*
  * Set of capabilities that need to be enabled for kernel context in
diff --git a/include/uapi/rdma/hfi/hfi1_user.h b/include/uapi/rdma/hfi/hfi1_user.h
index 01ac585..d95ef9a 100644
--- a/include/uapi/rdma/hfi/hfi1_user.h
+++ b/include/uapi/rdma/hfi/hfi1_user.h
@@ -6,7 +6,7 @@
  *
  * GPL LICENSE SUMMARY
  *
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of version 2 of the GNU General Public License as
@@ -109,6 +109,7 @@
 #define HFI1_CAP_OPFN             (1UL << 16) /* Enable the OPFN protocol */
 #define HFI1_CAP_SDMA_HEAD_CHECK  (1UL << 17) /* SDMA head checking */
 #define HFI1_CAP_EARLY_CREDIT_RETURN (1UL << 18) /* early credit return */
+#define HFI1_CAP_AIP              (1UL << 19) /* Enable accelerated IP */
 
 #define HFI1_RCVHDR_ENTSIZE_2    (1UL << 0)
 #define HFI1_RCVHDR_ENTSIZE_16   (1UL << 1)


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 02/16] IB/hfi1: Add functions to transmit datagram ipoib packets
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 01/16] IB/hfi1: Add accelerated IP capability bit Dennis Dalessandro
@ 2020-02-10 13:18 ` Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 03/16] IB/hfi1: Add the transmit side of a datagram ipoib RDMA netdev Dennis Dalessandro
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford; +Cc: linux-rdma, Mike Marciniszyn, Gary Leshner, Kaike Wan

From: Gary Leshner <Gary.S.Leshner@intel.com>

This patch implements the mechanism to accelerate the transmit side of
a multiple transmit queue RDMA netdev by submitting the packets to
the SDMA engine directly instead of sending through the verbs layer.

This patch also changes the UD/SEND_ONLY op to output the entropy value
in byte 0 of deth[1]. UD/SEND_ONLY_WITH_IMMEDIATE uses the previous
behavior with no entropy value being output.

The code in the ipoib rdma netdev which submits tx requests upon
successful submission will call trace_sdma_output_ibhdr to output
the ibhdr to the trace buffer.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/Makefile   |    1 
 drivers/infiniband/hw/hfi1/ipoib.h    |  145 ++++++
 drivers/infiniband/hw/hfi1/ipoib_tx.c |  828 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/trace.c    |   10 
 include/rdma/ib_verbs.h               |    1 
 5 files changed, 984 insertions(+), 1 deletion(-)
 create mode 100644 drivers/infiniband/hw/hfi1/ipoib.h
 create mode 100644 drivers/infiniband/hw/hfi1/ipoib_tx.c

diff --git a/drivers/infiniband/hw/hfi1/Makefile b/drivers/infiniband/hw/hfi1/Makefile
index 0405d26..09ef0b8 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -22,6 +22,7 @@ hfi1-y := \
 	init.o \
 	intr.o \
 	iowait.o \
+	ipoib_tx.o \
 	mad.o \
 	mmu_rb.o \
 	msix.o \
diff --git a/drivers/infiniband/hw/hfi1/ipoib.h b/drivers/infiniband/hw/hfi1/ipoib.h
new file mode 100644
index 0000000..2b541ab
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/ipoib.h
@@ -0,0 +1,145 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/*
+ * Copyright(c) 2020 Intel Corporation.
+ *
+ */
+
+/*
+ * This file contains HFI1 support for IPOIB functionality
+ */
+
+#ifndef HFI1_IPOIB_H
+#define HFI1_IPOIB_H
+
+#include <linux/types.h>
+#include <linux/stddef.h>
+#include <linux/atomic.h>
+#include <linux/netdevice.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/list.h>
+#include <linux/if_infiniband.h>
+
+#include "hfi.h"
+#include "iowait.h"
+
+#include <rdma/ib_verbs.h>
+
+#define HFI1_IPOIB_ENTROPY_SHIFT   24
+
+#define HFI1_IPOIB_TXREQ_NAME_LEN   32
+
+#define HFI1_IPOIB_ENCAP_LEN 4
+
+struct hfi1_ipoib_dev_priv;
+
+union hfi1_ipoib_flow {
+	u16 as_int;
+	struct {
+		u8 tx_queue;
+		u8 sc5;
+	} __attribute__((__packed__));
+};
+
+/**
+ * struct hfi1_ipoib_circ_buf - List of items to be processed
+ * @items: ring of items
+ * @head: ring head
+ * @tail: ring tail
+ * @max_items: max items + 1 that the ring can contain
+ * @producer_lock: producer sync lock
+ * @consumer_lock: consumer sync lock
+ */
+struct hfi1_ipoib_circ_buf {
+	void **items;
+	unsigned long head;
+	unsigned long tail;
+	unsigned long max_items;
+	spinlock_t producer_lock; /* head sync lock */
+	spinlock_t consumer_lock; /* tail sync lock */
+};
+
+/**
+ * struct hfi1_ipoib_txq - IPOIB per Tx queue information
+ * @priv: private pointer
+ * @sde: sdma engine
+ * @tx_list: tx request list
+ * @sent_txreqs: count of txreqs posted to sdma
+ * @flow: tracks when list needs to be flushed for a flow change
+ * @q_idx: ipoib Tx queue index
+ * @pkts_sent: indicator packets have been sent from this queue
+ * @wait: iowait structure
+ * @complete_txreqs: count of txreqs completed by sdma
+ * @napi: pointer to tx napi interface
+ * @tx_ring: ring of ipoib txreqs to be reaped by napi callback
+ */
+struct hfi1_ipoib_txq {
+	struct hfi1_ipoib_dev_priv *priv;
+	struct sdma_engine *sde;
+	struct list_head tx_list;
+	u64 sent_txreqs;
+	union hfi1_ipoib_flow flow;
+	u8 q_idx;
+	bool pkts_sent;
+	struct iowait wait;
+
+	atomic64_t ____cacheline_aligned_in_smp complete_txreqs;
+	struct napi_struct *napi;
+	struct hfi1_ipoib_circ_buf tx_ring;
+};
+
+struct hfi1_ipoib_dev_priv {
+	struct hfi1_devdata *dd;
+	struct net_device   *netdev;
+	struct ib_device    *device;
+	struct hfi1_ipoib_txq *txqs;
+	struct kmem_cache *txreq_cache;
+	struct napi_struct *tx_napis;
+	u16 pkey;
+	u16 pkey_index;
+	u32 qkey;
+	u8 port_num;
+
+	const struct net_device_ops *netdev_ops;
+	struct rvt_qp *qp;
+	struct pcpu_sw_netstats __percpu *netstats;
+};
+
+/* hfi1 ipoib rdma netdev's private data structure */
+struct hfi1_ipoib_rdma_netdev {
+	struct rdma_netdev rn;  /* keep this first */
+	/* followed by device private data */
+	struct hfi1_ipoib_dev_priv dev_priv;
+};
+
+static inline struct hfi1_ipoib_dev_priv *
+hfi1_ipoib_priv(const struct net_device *dev)
+{
+	return &((struct hfi1_ipoib_rdma_netdev *)netdev_priv(dev))->dev_priv;
+}
+
+static inline void
+hfi1_ipoib_update_tx_netstats(struct hfi1_ipoib_dev_priv *priv,
+			      u64 packets,
+			      u64 bytes)
+{
+	struct pcpu_sw_netstats *netstats = this_cpu_ptr(priv->netstats);
+
+	u64_stats_update_begin(&netstats->syncp);
+	netstats->tx_packets += packets;
+	netstats->tx_bytes += bytes;
+	u64_stats_update_end(&netstats->syncp);
+}
+
+int hfi1_ipoib_send_dma(struct net_device *dev,
+			struct sk_buff *skb,
+			struct ib_ah *address,
+			u32 dqpn);
+
+int hfi1_ipoib_txreq_init(struct hfi1_ipoib_dev_priv *priv);
+void hfi1_ipoib_txreq_deinit(struct hfi1_ipoib_dev_priv *priv);
+
+void hfi1_ipoib_napi_tx_enable(struct net_device *dev);
+void hfi1_ipoib_napi_tx_disable(struct net_device *dev);
+
+#endif /* _IPOIB_H */
diff --git a/drivers/infiniband/hw/hfi1/ipoib_tx.c b/drivers/infiniband/hw/hfi1/ipoib_tx.c
new file mode 100644
index 0000000..883cb9d4
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/ipoib_tx.c
@@ -0,0 +1,828 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+/*
+ * Copyright(c) 2020 Intel Corporation.
+ *
+ */
+
+/*
+ * This file contains HFI1 support for IPOIB SDMA functionality
+ */
+
+#include <linux/log2.h>
+#include <linux/circ_buf.h>
+
+#include "sdma.h"
+#include "verbs.h"
+#include "trace_ibhdrs.h"
+#include "ipoib.h"
+
+/* Add a convenience helper */
+#define CIRC_ADD(val, add, size) (((val) + (add)) & ((size) - 1))
+#define CIRC_NEXT(val, size) CIRC_ADD(val, 1, size)
+#define CIRC_PREV(val, size) CIRC_ADD(val, -1, size)
+
+/**
+ * struct ipoib_txreq - IPOIB transmit descriptor
+ * @txreq: sdma transmit request
+ * @sdma_hdr: 9b ib headers
+ * @sdma_status: status returned by sdma engine
+ * @priv: ipoib netdev private data
+ * @txq: txq on which skb was output
+ * @skb: skb to send
+ */
+struct ipoib_txreq {
+	struct sdma_txreq           txreq;
+	struct hfi1_sdma_header     sdma_hdr;
+	int                         sdma_status;
+	struct hfi1_ipoib_dev_priv *priv;
+	struct hfi1_ipoib_txq      *txq;
+	struct sk_buff             *skb;
+};
+
+struct ipoib_txparms {
+	struct hfi1_devdata        *dd;
+	struct rdma_ah_attr        *ah_attr;
+	struct hfi1_ibport         *ibp;
+	struct hfi1_ipoib_txq      *txq;
+	union hfi1_ipoib_flow       flow;
+	u32                         dqpn;
+	u8                          hdr_dwords;
+	u8                          entropy;
+};
+
+static u64 hfi1_ipoib_txreqs(const u64 sent, const u64 completed)
+{
+	return sent - completed;
+}
+
+static void hfi1_ipoib_check_queue_depth(struct hfi1_ipoib_txq *txq)
+{
+	if (unlikely(hfi1_ipoib_txreqs(++txq->sent_txreqs,
+				       atomic64_read(&txq->complete_txreqs)) >=
+	    min_t(unsigned int, txq->priv->netdev->tx_queue_len,
+		  txq->tx_ring.max_items - 1)))
+		netif_stop_subqueue(txq->priv->netdev, txq->q_idx);
+}
+
+static void hfi1_ipoib_check_queue_stopped(struct hfi1_ipoib_txq *txq)
+{
+	struct net_device *dev = txq->priv->netdev;
+
+	/* If the queue is already running just return */
+	if (likely(!__netif_subqueue_stopped(dev, txq->q_idx)))
+		return;
+
+	/* If shutting down just return as queue state is irrelevant */
+	if (unlikely(dev->reg_state != NETREG_REGISTERED))
+		return;
+
+	/*
+	 * When the queue has been drained to less than half full it will be
+	 * restarted.
+	 * The size of the txreq ring is fixed at initialization.
+	 * The tx queue len can be adjusted upward while the interface is
+	 * running.
+	 * The tx queue len can be large enough to overflow the txreq_ring.
+	 * Use the minimum of the current tx_queue_len or the rings max txreqs
+	 * to protect against ring overflow.
+	 */
+	if (hfi1_ipoib_txreqs(txq->sent_txreqs,
+			      atomic64_read(&txq->complete_txreqs))
+	    < min_t(unsigned int, dev->tx_queue_len,
+		    txq->tx_ring.max_items) >> 1)
+		netif_wake_subqueue(dev, txq->q_idx);
+}
+
+static void hfi1_ipoib_free_tx(struct ipoib_txreq *tx, int budget)
+{
+	struct hfi1_ipoib_dev_priv *priv = tx->priv;
+
+	if (likely(!tx->sdma_status)) {
+		hfi1_ipoib_update_tx_netstats(priv, 1, tx->skb->len);
+	} else {
+		++priv->netdev->stats.tx_errors;
+		dd_dev_warn(priv->dd,
+			    "%s: Status = 0x%x pbc 0x%llx txq = %d sde = %d\n",
+			    __func__, tx->sdma_status,
+			    le64_to_cpu(tx->sdma_hdr.pbc), tx->txq->q_idx,
+			    tx->txq->sde->this_idx);
+	}
+
+	napi_consume_skb(tx->skb, budget);
+	sdma_txclean(priv->dd, &tx->txreq);
+	kmem_cache_free(priv->txreq_cache, tx);
+}
+
+static int hfi1_ipoib_drain_tx_ring(struct hfi1_ipoib_txq *txq, int budget)
+{
+	struct hfi1_ipoib_circ_buf *tx_ring = &txq->tx_ring;
+	unsigned long head;
+	unsigned long tail;
+	unsigned int max_tx;
+	int work_done;
+	int tx_count;
+
+	spin_lock_bh(&tx_ring->consumer_lock);
+
+	/* Read index before reading contents at that index. */
+	head = smp_load_acquire(&tx_ring->head);
+	tail = tx_ring->tail;
+	max_tx = tx_ring->max_items;
+
+	work_done = min_t(int, CIRC_CNT(head, tail, max_tx), budget);
+
+	for (tx_count = work_done; tx_count; tx_count--) {
+		hfi1_ipoib_free_tx(tx_ring->items[tail], budget);
+		tail = CIRC_NEXT(tail, max_tx);
+	}
+
+	atomic64_add(work_done, &txq->complete_txreqs);
+
+	/* Finished freeing tx items so store the tail value. */
+	smp_store_release(&tx_ring->tail, tail);
+
+	spin_unlock_bh(&tx_ring->consumer_lock);
+
+	hfi1_ipoib_check_queue_stopped(txq);
+
+	return work_done;
+}
+
+static int hfi1_ipoib_process_tx_ring(struct napi_struct *napi, int budget)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(napi->dev);
+	struct hfi1_ipoib_txq *txq = &priv->txqs[napi - priv->tx_napis];
+
+	int work_done = hfi1_ipoib_drain_tx_ring(txq, budget);
+
+	if (work_done < budget)
+		napi_complete_done(napi, work_done);
+
+	return work_done;
+}
+
+static void hfi1_ipoib_add_tx(struct ipoib_txreq *tx)
+{
+	struct hfi1_ipoib_circ_buf *tx_ring = &tx->txq->tx_ring;
+	unsigned long head;
+	unsigned long tail;
+	size_t max_tx;
+
+	spin_lock(&tx_ring->producer_lock);
+
+	head = tx_ring->head;
+	tail = READ_ONCE(tx_ring->tail);
+	max_tx = tx_ring->max_items;
+
+	if (likely(CIRC_SPACE(head, tail, max_tx))) {
+		tx_ring->items[head] = tx;
+
+		/* Finish storing txreq before incrementing head. */
+		smp_store_release(&tx_ring->head, CIRC_ADD(head, 1, max_tx));
+		napi_schedule(tx->txq->napi);
+	} else {
+		struct hfi1_ipoib_txq *txq = tx->txq;
+		struct hfi1_ipoib_dev_priv *priv = tx->priv;
+
+		/* Ring was full */
+		hfi1_ipoib_free_tx(tx, 0);
+		atomic64_inc(&txq->complete_txreqs);
+		dd_dev_dbg(priv->dd, "txq %d full.\n", txq->q_idx);
+	}
+
+	spin_unlock(&tx_ring->producer_lock);
+}
+
+static void hfi1_ipoib_sdma_complete(struct sdma_txreq *txreq, int status)
+{
+	struct ipoib_txreq *tx = container_of(txreq, struct ipoib_txreq, txreq);
+
+	tx->sdma_status = status;
+
+	hfi1_ipoib_add_tx(tx);
+}
+
+static int hfi1_ipoib_build_ulp_payload(struct ipoib_txreq *tx,
+					struct ipoib_txparms *txp)
+{
+	struct hfi1_devdata *dd = txp->dd;
+	struct sdma_txreq *txreq = &tx->txreq;
+	struct sk_buff *skb = tx->skb;
+	int ret = 0;
+	int i;
+
+	if (skb_headlen(skb)) {
+		ret = sdma_txadd_kvaddr(dd, txreq, skb->data, skb_headlen(skb));
+		if (unlikely(ret))
+			return ret;
+	}
+
+	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+		const skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+
+		ret = sdma_txadd_page(dd,
+				      txreq,
+				      skb_frag_page(frag),
+				      frag->bv_offset,
+				      skb_frag_size(frag));
+		if (unlikely(ret))
+			break;
+	}
+
+	return ret;
+}
+
+static int hfi1_ipoib_build_tx_desc(struct ipoib_txreq *tx,
+				    struct ipoib_txparms *txp)
+{
+	struct hfi1_devdata *dd = txp->dd;
+	struct sdma_txreq *txreq = &tx->txreq;
+	struct hfi1_sdma_header *sdma_hdr = &tx->sdma_hdr;
+	u16 pkt_bytes =
+		sizeof(sdma_hdr->pbc) + (txp->hdr_dwords << 2) + tx->skb->len;
+	int ret;
+
+	ret = sdma_txinit(txreq, 0, pkt_bytes, hfi1_ipoib_sdma_complete);
+	if (unlikely(ret))
+		return ret;
+
+	/* add pbc + headers */
+	ret = sdma_txadd_kvaddr(dd,
+				txreq,
+				sdma_hdr,
+				sizeof(sdma_hdr->pbc) + (txp->hdr_dwords << 2));
+	if (unlikely(ret))
+		return ret;
+
+	/* add the ulp payload */
+	return hfi1_ipoib_build_ulp_payload(tx, txp);
+}
+
+static void hfi1_ipoib_build_ib_tx_headers(struct ipoib_txreq *tx,
+					   struct ipoib_txparms *txp)
+{
+	struct hfi1_ipoib_dev_priv *priv = tx->priv;
+	struct hfi1_sdma_header *sdma_hdr = &tx->sdma_hdr;
+	struct sk_buff *skb = tx->skb;
+	struct hfi1_pportdata *ppd = ppd_from_ibp(txp->ibp);
+	struct rdma_ah_attr *ah_attr = txp->ah_attr;
+	struct ib_other_headers *ohdr;
+	struct ib_grh *grh;
+	u16 dwords;
+	u16 slid;
+	u16 dlid;
+	u16 lrh0;
+	u32 bth0;
+	u32 sqpn = (u32)(priv->netdev->dev_addr[1] << 16 |
+			 priv->netdev->dev_addr[2] << 8 |
+			 priv->netdev->dev_addr[3]);
+	u16 payload_dwords;
+	u8 pad_cnt;
+
+	pad_cnt = -skb->len & 3;
+
+	/* Includes ICRC */
+	payload_dwords = ((skb->len + pad_cnt) >> 2) + SIZE_OF_CRC;
+
+	/* header size in dwords LRH+BTH+DETH = (8+12+8)/4. */
+	txp->hdr_dwords = 7;
+
+	if (rdma_ah_get_ah_flags(ah_attr) & IB_AH_GRH) {
+		grh = &sdma_hdr->hdr.ibh.u.l.grh;
+		txp->hdr_dwords +=
+			hfi1_make_grh(txp->ibp,
+				      grh,
+				      rdma_ah_read_grh(ah_attr),
+				      txp->hdr_dwords - LRH_9B_DWORDS,
+				      payload_dwords);
+		lrh0 = HFI1_LRH_GRH;
+		ohdr = &sdma_hdr->hdr.ibh.u.l.oth;
+	} else {
+		lrh0 = HFI1_LRH_BTH;
+		ohdr = &sdma_hdr->hdr.ibh.u.oth;
+	}
+
+	lrh0 |= (rdma_ah_get_sl(ah_attr) & 0xf) << 4;
+	lrh0 |= (txp->flow.sc5 & 0xf) << 12;
+
+	dlid = opa_get_lid(rdma_ah_get_dlid(ah_attr), 9B);
+	if (dlid == be16_to_cpu(IB_LID_PERMISSIVE)) {
+		slid = be16_to_cpu(IB_LID_PERMISSIVE);
+	} else {
+		u16 lid = (u16)ppd->lid;
+
+		if (lid) {
+			lid |= rdma_ah_get_path_bits(ah_attr) &
+				((1 << ppd->lmc) - 1);
+			slid = lid;
+		} else {
+			slid = be16_to_cpu(IB_LID_PERMISSIVE);
+		}
+	}
+
+	/* Includes ICRC */
+	dwords = txp->hdr_dwords + payload_dwords;
+
+	/* Build the lrh */
+	sdma_hdr->hdr.hdr_type = HFI1_PKT_TYPE_9B;
+	hfi1_make_ib_hdr(&sdma_hdr->hdr.ibh, lrh0, dwords, dlid, slid);
+
+	/* Build the bth */
+	bth0 = (IB_OPCODE_UD_SEND_ONLY << 24) | (pad_cnt << 20) | priv->pkey;
+
+	ohdr->bth[0] = cpu_to_be32(bth0);
+	ohdr->bth[1] = cpu_to_be32(txp->dqpn);
+	ohdr->bth[2] = cpu_to_be32(mask_psn((u32)txp->txq->sent_txreqs));
+
+	/* Build the deth */
+	ohdr->u.ud.deth[0] = cpu_to_be32(priv->qkey);
+	ohdr->u.ud.deth[1] = cpu_to_be32((txp->entropy <<
+					  HFI1_IPOIB_ENTROPY_SHIFT) | sqpn);
+
+	/* Construct the pbc. */
+	sdma_hdr->pbc =
+		cpu_to_le64(create_pbc(ppd,
+				       ib_is_sc5(txp->flow.sc5) <<
+							      PBC_DC_INFO_SHIFT,
+				       0,
+				       sc_to_vlt(priv->dd, txp->flow.sc5),
+				       dwords - SIZE_OF_CRC +
+						(sizeof(sdma_hdr->pbc) >> 2)));
+}
+
+static struct ipoib_txreq *hfi1_ipoib_send_dma_common(struct net_device *dev,
+						      struct sk_buff *skb,
+						      struct ipoib_txparms *txp)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	struct ipoib_txreq *tx;
+	int ret;
+
+	tx = kmem_cache_alloc_node(priv->txreq_cache,
+				   GFP_ATOMIC,
+				   priv->dd->node);
+	if (unlikely(!tx))
+		return ERR_PTR(-ENOMEM);
+
+	/* so that we can test if the sdma decriptors are there */
+	tx->txreq.num_desc = 0;
+	tx->priv = priv;
+	tx->txq = txp->txq;
+	tx->skb = skb;
+
+	hfi1_ipoib_build_ib_tx_headers(tx, txp);
+
+	ret = hfi1_ipoib_build_tx_desc(tx, txp);
+	if (likely(!ret)) {
+		if (txp->txq->flow.as_int != txp->flow.as_int) {
+			txp->txq->flow.tx_queue = txp->flow.tx_queue;
+			txp->txq->flow.sc5 = txp->flow.sc5;
+			txp->txq->sde =
+				sdma_select_engine_sc(priv->dd,
+						      txp->flow.tx_queue,
+						      txp->flow.sc5);
+		}
+
+		return tx;
+	}
+
+	sdma_txclean(priv->dd, &tx->txreq);
+	kmem_cache_free(priv->txreq_cache, tx);
+
+	return ERR_PTR(ret);
+}
+
+static int hfi1_ipoib_submit_tx_list(struct net_device *dev,
+				     struct hfi1_ipoib_txq *txq)
+{
+	int ret;
+	u16 count_out;
+
+	ret = sdma_send_txlist(txq->sde,
+			       iowait_get_ib_work(&txq->wait),
+			       &txq->tx_list,
+			       &count_out);
+	if (likely(!ret) || ret == -EBUSY || ret == -ECOMM)
+		return ret;
+
+	dd_dev_warn(txq->priv->dd, "cannot send skb tx list, err %d.\n", ret);
+
+	return ret;
+}
+
+static int hfi1_ipoib_flush_tx_list(struct net_device *dev,
+				    struct hfi1_ipoib_txq *txq)
+{
+	int ret = 0;
+
+	if (!list_empty(&txq->tx_list)) {
+		/* Flush the current list */
+		ret = hfi1_ipoib_submit_tx_list(dev, txq);
+
+		if (unlikely(ret))
+			if (ret != -EBUSY)
+				++dev->stats.tx_carrier_errors;
+	}
+
+	return ret;
+}
+
+static int hfi1_ipoib_submit_tx(struct hfi1_ipoib_txq *txq,
+				struct ipoib_txreq *tx)
+{
+	int ret;
+
+	ret = sdma_send_txreq(txq->sde,
+			      iowait_get_ib_work(&txq->wait),
+			      &tx->txreq,
+			      txq->pkts_sent);
+	if (likely(!ret)) {
+		txq->pkts_sent = true;
+		iowait_starve_clear(txq->pkts_sent, &txq->wait);
+	}
+
+	return ret;
+}
+
+static int hfi1_ipoib_send_dma_single(struct net_device *dev,
+				      struct sk_buff *skb,
+				      struct ipoib_txparms *txp)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	struct hfi1_ipoib_txq *txq = txp->txq;
+	struct ipoib_txreq *tx;
+	int ret;
+
+	tx = hfi1_ipoib_send_dma_common(dev, skb, txp);
+	if (IS_ERR(tx)) {
+		int ret = PTR_ERR(tx);
+
+		dev_kfree_skb_any(skb);
+
+		if (ret == -ENOMEM)
+			++dev->stats.tx_errors;
+		else
+			++dev->stats.tx_carrier_errors;
+
+		return NETDEV_TX_OK;
+	}
+
+	ret = hfi1_ipoib_submit_tx(txq, tx);
+	if (likely(!ret)) {
+		trace_sdma_output_ibhdr(tx->priv->dd,
+					&tx->sdma_hdr.hdr,
+					ib_is_sc5(txp->flow.sc5));
+		hfi1_ipoib_check_queue_depth(txq);
+		return NETDEV_TX_OK;
+	}
+
+	txq->pkts_sent = false;
+
+	if (ret == -EBUSY) {
+		list_add_tail(&tx->txreq.list, &txq->tx_list);
+
+		trace_sdma_output_ibhdr(tx->priv->dd,
+					&tx->sdma_hdr.hdr,
+					ib_is_sc5(txp->flow.sc5));
+		hfi1_ipoib_check_queue_depth(txq);
+		return NETDEV_TX_OK;
+	}
+
+	if (ret == -ECOMM) {
+		hfi1_ipoib_check_queue_depth(txq);
+		return NETDEV_TX_OK;
+	}
+
+	sdma_txclean(priv->dd, &tx->txreq);
+	dev_kfree_skb_any(skb);
+	kmem_cache_free(priv->txreq_cache, tx);
+	++dev->stats.tx_carrier_errors;
+
+	return NETDEV_TX_OK;
+}
+
+static int hfi1_ipoib_send_dma_list(struct net_device *dev,
+				    struct sk_buff *skb,
+				    struct ipoib_txparms *txp)
+{
+	struct hfi1_ipoib_txq *txq = txp->txq;
+	struct ipoib_txreq *tx;
+
+	/* Has the flow change ? */
+	if (txq->flow.as_int != txp->flow.as_int)
+		(void)hfi1_ipoib_flush_tx_list(dev, txq);
+
+	tx = hfi1_ipoib_send_dma_common(dev, skb, txp);
+	if (IS_ERR(tx)) {
+		int ret = PTR_ERR(tx);
+
+		dev_kfree_skb_any(skb);
+
+		if (ret == -ENOMEM)
+			++dev->stats.tx_errors;
+		else
+			++dev->stats.tx_carrier_errors;
+
+		return NETDEV_TX_OK;
+	}
+
+	list_add_tail(&tx->txreq.list, &txq->tx_list);
+
+	hfi1_ipoib_check_queue_depth(txq);
+
+	trace_sdma_output_ibhdr(tx->priv->dd,
+				&tx->sdma_hdr.hdr,
+				ib_is_sc5(txp->flow.sc5));
+
+	if (!netdev_xmit_more())
+		(void)hfi1_ipoib_flush_tx_list(dev, txq);
+
+	return NETDEV_TX_OK;
+}
+
+static u8 hfi1_ipoib_calc_entropy(struct sk_buff *skb)
+{
+	if (skb_transport_header_was_set(skb)) {
+		u8 *hdr = (u8 *)skb_transport_header(skb);
+
+		return (hdr[0] ^ hdr[1] ^ hdr[2] ^ hdr[3]);
+	}
+
+	return (u8)skb_get_queue_mapping(skb);
+}
+
+int hfi1_ipoib_send_dma(struct net_device *dev,
+			struct sk_buff *skb,
+			struct ib_ah *address,
+			u32 dqpn)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	struct ipoib_txparms txp;
+	struct rdma_netdev *rn = netdev_priv(dev);
+
+	if (unlikely(skb->len > rn->mtu + HFI1_IPOIB_ENCAP_LEN)) {
+		dd_dev_warn(priv->dd, "packet len %d (> %d) too long to send, dropping\n",
+			    skb->len,
+			    rn->mtu + HFI1_IPOIB_ENCAP_LEN);
+		++dev->stats.tx_dropped;
+		++dev->stats.tx_errors;
+		dev_kfree_skb_any(skb);
+		return NETDEV_TX_OK;
+	}
+
+	txp.dd = priv->dd;
+	txp.ah_attr = &ibah_to_rvtah(address)->attr;
+	txp.ibp = to_iport(priv->device, priv->port_num);
+	txp.txq = &priv->txqs[skb_get_queue_mapping(skb)];
+	txp.dqpn = dqpn;
+	txp.flow.sc5 = txp.ibp->sl_to_sc[rdma_ah_get_sl(txp.ah_attr)];
+	txp.flow.tx_queue = (u8)skb_get_queue_mapping(skb);
+	txp.entropy = hfi1_ipoib_calc_entropy(skb);
+
+	if (netdev_xmit_more() || !list_empty(&txp.txq->tx_list))
+		return hfi1_ipoib_send_dma_list(dev, skb, &txp);
+
+	return hfi1_ipoib_send_dma_single(dev, skb,  &txp);
+}
+
+/*
+ * hfi1_ipoib_sdma_sleep - ipoib sdma sleep function
+ *
+ * This function gets called from sdma_send_txreq() when there are not enough
+ * sdma descriptors available to send the packet. It adds Tx queue's wait
+ * structure to sdma engine's dmawait list to be woken up when descriptors
+ * become available.
+ */
+static int hfi1_ipoib_sdma_sleep(struct sdma_engine *sde,
+				 struct iowait_work *wait,
+				 struct sdma_txreq *txreq,
+				 uint seq,
+				 bool pkts_sent)
+{
+	struct hfi1_ipoib_txq *txq =
+		container_of(wait->iow, struct hfi1_ipoib_txq, wait);
+
+	write_seqlock(&sde->waitlock);
+
+	if (likely(txq->priv->netdev->reg_state == NETREG_REGISTERED)) {
+		if (sdma_progress(sde, seq, txreq)) {
+			write_sequnlock(&sde->waitlock);
+			return -EAGAIN;
+		}
+
+		netif_stop_subqueue(txq->priv->netdev, txq->q_idx);
+
+		if (list_empty(&txq->wait.list))
+			iowait_queue(pkts_sent, wait->iow, &sde->dmawait);
+
+		write_sequnlock(&sde->waitlock);
+		return -EBUSY;
+	}
+
+	write_sequnlock(&sde->waitlock);
+	return -EINVAL;
+}
+
+/*
+ * hfi1_ipoib_sdma_wakeup - ipoib sdma wakeup function
+ *
+ * This function gets called when SDMA descriptors becomes available and Tx
+ * queue's wait structure was previously added to sdma engine's dmawait list.
+ */
+static void hfi1_ipoib_sdma_wakeup(struct iowait *wait, int reason)
+{
+	struct hfi1_ipoib_txq *txq =
+		container_of(wait, struct hfi1_ipoib_txq, wait);
+
+	if (likely(txq->priv->netdev->reg_state == NETREG_REGISTERED))
+		iowait_schedule(wait, system_highpri_wq, WORK_CPU_UNBOUND);
+}
+
+static void hfi1_ipoib_flush_txq(struct work_struct *work)
+{
+	struct iowait_work *ioww =
+		container_of(work, struct iowait_work, iowork);
+	struct iowait *wait = iowait_ioww_to_iow(ioww);
+	struct hfi1_ipoib_txq *txq =
+		container_of(wait, struct hfi1_ipoib_txq, wait);
+	struct net_device *dev = txq->priv->netdev;
+
+	if (likely(dev->reg_state == NETREG_REGISTERED) &&
+	    likely(__netif_subqueue_stopped(dev, txq->q_idx)) &&
+	    likely(!hfi1_ipoib_flush_tx_list(dev, txq)))
+		netif_wake_subqueue(dev, txq->q_idx);
+}
+
+int hfi1_ipoib_txreq_init(struct hfi1_ipoib_dev_priv *priv)
+{
+	struct net_device *dev = priv->netdev;
+	char buf[HFI1_IPOIB_TXREQ_NAME_LEN];
+	unsigned long tx_ring_size;
+	int i;
+
+	/*
+	 * Ring holds 1 less than tx_ring_size
+	 * Round up to next power of 2 in order to hold at least tx_queue_len
+	 */
+	tx_ring_size = roundup_pow_of_two((unsigned long)dev->tx_queue_len + 1);
+
+	snprintf(buf, sizeof(buf), "hfi1_%u_ipoib_txreq_cache", priv->dd->unit);
+	priv->txreq_cache = kmem_cache_create(buf,
+					      sizeof(struct ipoib_txreq),
+					      0,
+					      0,
+					      NULL);
+	if (!priv->txreq_cache)
+		return -ENOMEM;
+
+	priv->tx_napis = kcalloc_node(dev->num_tx_queues,
+				      sizeof(struct napi_struct),
+				      GFP_ATOMIC,
+				      priv->dd->node);
+	if (!priv->tx_napis)
+		goto free_txreq_cache;
+
+	priv->txqs = kcalloc_node(dev->num_tx_queues,
+				  sizeof(struct hfi1_ipoib_txq),
+				  GFP_ATOMIC,
+				  priv->dd->node);
+	if (!priv->txqs)
+		goto free_tx_napis;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		struct hfi1_ipoib_txq *txq = &priv->txqs[i];
+
+		iowait_init(&txq->wait,
+			    0,
+			    hfi1_ipoib_flush_txq,
+			    NULL,
+			    hfi1_ipoib_sdma_sleep,
+			    hfi1_ipoib_sdma_wakeup,
+			    NULL,
+			    NULL);
+		txq->priv = priv;
+		txq->sde = NULL;
+		INIT_LIST_HEAD(&txq->tx_list);
+		atomic64_set(&txq->complete_txreqs, 0);
+		txq->q_idx = i;
+		txq->flow.tx_queue = 0xff;
+		txq->flow.sc5 = 0xff;
+		txq->pkts_sent = false;
+
+		netdev_queue_numa_node_write(netdev_get_tx_queue(dev, i),
+					     priv->dd->node);
+
+		txq->tx_ring.items =
+			vzalloc_node(array_size(tx_ring_size,
+						sizeof(struct ipoib_txreq)),
+				     priv->dd->node);
+		if (!txq->tx_ring.items)
+			goto free_txqs;
+
+		spin_lock_init(&txq->tx_ring.producer_lock);
+		spin_lock_init(&txq->tx_ring.consumer_lock);
+		txq->tx_ring.max_items = tx_ring_size;
+
+		txq->napi = &priv->tx_napis[i];
+		netif_tx_napi_add(dev, txq->napi,
+				  hfi1_ipoib_process_tx_ring,
+				  NAPI_POLL_WEIGHT);
+	}
+
+	return 0;
+
+free_txqs:
+	for (i--; i >= 0; i--) {
+		struct hfi1_ipoib_txq *txq = &priv->txqs[i];
+
+		netif_napi_del(txq->napi);
+		vfree(txq->tx_ring.items);
+	}
+
+	kfree(priv->txqs);
+	priv->txqs = NULL;
+
+free_tx_napis:
+	kfree(priv->tx_napis);
+	priv->tx_napis = NULL;
+
+free_txreq_cache:
+	kmem_cache_destroy(priv->txreq_cache);
+	priv->txreq_cache = NULL;
+	return -ENOMEM;
+}
+
+static void hfi1_ipoib_drain_tx_list(struct hfi1_ipoib_txq *txq)
+{
+	struct sdma_txreq *txreq;
+	struct sdma_txreq *txreq_tmp;
+	atomic64_t *complete_txreqs = &txq->complete_txreqs;
+
+	list_for_each_entry_safe(txreq, txreq_tmp, &txq->tx_list, list) {
+		struct ipoib_txreq *tx =
+			container_of(txreq, struct ipoib_txreq, txreq);
+
+		list_del(&txreq->list);
+		sdma_txclean(txq->priv->dd, &tx->txreq);
+		dev_kfree_skb_any(tx->skb);
+		kmem_cache_free(txq->priv->txreq_cache, tx);
+		atomic64_inc(complete_txreqs);
+	}
+
+	if (hfi1_ipoib_txreqs(txq->sent_txreqs, atomic64_read(complete_txreqs)))
+		dd_dev_warn(txq->priv->dd,
+			    "txq %d not empty found %llu requests\n",
+			    txq->q_idx,
+			    hfi1_ipoib_txreqs(txq->sent_txreqs,
+					      atomic64_read(complete_txreqs)));
+}
+
+void hfi1_ipoib_txreq_deinit(struct hfi1_ipoib_dev_priv *priv)
+{
+	int i;
+
+	for (i = 0; i < priv->netdev->num_tx_queues; i++) {
+		struct hfi1_ipoib_txq *txq = &priv->txqs[i];
+
+		iowait_cancel_work(&txq->wait);
+		iowait_sdma_drain(&txq->wait);
+		hfi1_ipoib_drain_tx_list(txq);
+		netif_napi_del(txq->napi);
+		(void)hfi1_ipoib_drain_tx_ring(txq, txq->tx_ring.max_items);
+		vfree(txq->tx_ring.items);
+	}
+
+	kfree(priv->txqs);
+	priv->txqs = NULL;
+
+	kfree(priv->tx_napis);
+	priv->tx_napis = NULL;
+
+	kmem_cache_destroy(priv->txreq_cache);
+	priv->txreq_cache = NULL;
+}
+
+void hfi1_ipoib_napi_tx_enable(struct net_device *dev)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		struct hfi1_ipoib_txq *txq = &priv->txqs[i];
+
+		napi_enable(txq->napi);
+	}
+}
+
+void hfi1_ipoib_napi_tx_disable(struct net_device *dev)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	int i;
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		struct hfi1_ipoib_txq *txq = &priv->txqs[i];
+
+		napi_disable(txq->napi);
+		(void)hfi1_ipoib_drain_tx_ring(txq, txq->tx_ring.max_items);
+	}
+}
diff --git a/drivers/infiniband/hw/hfi1/trace.c b/drivers/infiniband/hw/hfi1/trace.c
index 9a3d236..c8a9988 100644
--- a/drivers/infiniband/hw/hfi1/trace.c
+++ b/drivers/infiniband/hw/hfi1/trace.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -47,6 +47,7 @@
 #define CREATE_TRACE_POINTS
 #include "trace.h"
 #include "exp_rcv.h"
+#include "ipoib.h"
 
 static u8 __get_ib_hdr_len(struct ib_header *hdr)
 {
@@ -126,6 +127,7 @@ const char *hfi1_trace_get_packet_l2_str(u8 l2)
 #define RETH_PRN "reth vaddr:0x%.16llx rkey:0x%.8x dlen:0x%.8x"
 #define AETH_PRN "aeth syn:0x%.2x %s msn:0x%.8x"
 #define DETH_PRN "deth qkey:0x%.8x sqpn:0x%.6x"
+#define DETH_ENTROPY_PRN "deth qkey:0x%.8x sqpn:0x%.6x entropy:0x%.2x"
 #define IETH_PRN "ieth rkey:0x%.8x"
 #define ATOMICACKETH_PRN "origdata:%llx"
 #define ATOMICETH_PRN "vaddr:0x%llx rkey:0x%.8x sdata:%llx cdata:%llx"
@@ -444,6 +446,12 @@ const char *parse_everbs_hdrs(
 		break;
 	/* deth */
 	case OP(UD, SEND_ONLY):
+		trace_seq_printf(p, DETH_ENTROPY_PRN,
+				 be32_to_cpu(eh->ud.deth[0]),
+				 be32_to_cpu(eh->ud.deth[1]) & RVT_QPN_MASK,
+				 be32_to_cpu(eh->ud.deth[1]) >>
+					     HFI1_IPOIB_ENTROPY_SHIFT);
+		break;
 	case OP(UD, SEND_ONLY_WITH_IMMEDIATE):
 		trace_seq_printf(p, DETH_PRN,
 				 be32_to_cpu(eh->ud.deth[0]),
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 1f779fa..9abe2c8 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2198,6 +2198,7 @@ struct rdma_netdev {
 	void              *clnt_priv;
 	struct ib_device  *hca;
 	u8                 port_num;
+	int                mtu;
 
 	/*
 	 * cleanup function must be specified.


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 03/16] IB/hfi1: Add the transmit side of a datagram ipoib RDMA netdev
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 01/16] IB/hfi1: Add accelerated IP capability bit Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 02/16] IB/hfi1: Add functions to transmit datagram ipoib packets Dennis Dalessandro
@ 2020-02-10 13:18 ` Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 04/16] IB/hfi1: Remove module parameter for KDETH qpns Dennis Dalessandro
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford; +Cc: linux-rdma, Mike Marciniszyn, Gary Leshner, Kaike Wan

From: Gary Leshner <Gary.S.Leshner@intel.com>

This implements the transmit side of the multiple transmit queue RDMA
netdev used to accelerate ipoib.  The receive side remains the ipoib
internal implementation.

The init/unint/open/stop netdev operations are saved off and called by the
versions within the hfi1 netdev in order to initialize the connected mode
resources present in ipoib thus allowing us to switch modes between
datagram and connected.

The datagram queue pair instantiated by the ipoib ulp is used by this
implementation for its queue pair number and to register with multicast.

The above queue pair is not used on transmit other than its qpn as the
verbs layer is skipped and packets are directly submitted to the sdma
engines.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/Makefile     |    1 
 drivers/infiniband/hw/hfi1/ipoib.h      |    5 +
 drivers/infiniband/hw/hfi1/ipoib_main.c |  283 +++++++++++++++++++++++++++++++
 3 files changed, 289 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/ipoib_main.c

diff --git a/drivers/infiniband/hw/hfi1/Makefile b/drivers/infiniband/hw/hfi1/Makefile
index 09ef0b8..0b25713 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -22,6 +22,7 @@ hfi1-y := \
 	init.o \
 	intr.o \
 	iowait.o \
+	ipoib_main.o \
 	ipoib_tx.o \
 	mad.o \
 	mmu_rb.o \
diff --git a/drivers/infiniband/hw/hfi1/ipoib.h b/drivers/infiniband/hw/hfi1/ipoib.h
index 2b541ab..c2e63ca 100644
--- a/drivers/infiniband/hw/hfi1/ipoib.h
+++ b/drivers/infiniband/hw/hfi1/ipoib.h
@@ -142,4 +142,9 @@ int hfi1_ipoib_send_dma(struct net_device *dev,
 void hfi1_ipoib_napi_tx_enable(struct net_device *dev);
 void hfi1_ipoib_napi_tx_disable(struct net_device *dev);
 
+int hfi1_ipoib_rn_get_params(struct ib_device *device,
+			     u8 port_num,
+			     enum rdma_netdev_t type,
+			     struct rdma_netdev_alloc_params *params);
+
 #endif /* _IPOIB_H */
diff --git a/drivers/infiniband/hw/hfi1/ipoib_main.c b/drivers/infiniband/hw/hfi1/ipoib_main.c
new file mode 100644
index 0000000..304a5ac
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/ipoib_main.c
@@ -0,0 +1,283 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+/*
+ * Copyright(c) 2020 Intel Corporation.
+ *
+ */
+
+/*
+ * This file contains HFI1 support for ipoib functionality
+ */
+
+#include "ipoib.h"
+#include "hfi.h"
+
+static u32 qpn_from_mac(u8 *mac_arr)
+{
+	return (u32)mac_arr[1] << 16 | mac_arr[2] << 8 | mac_arr[3];
+}
+
+static int hfi1_ipoib_dev_init(struct net_device *dev)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+
+	priv->netstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
+
+	return priv->netdev_ops->ndo_init(dev);
+}
+
+static void hfi1_ipoib_dev_uninit(struct net_device *dev)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+
+	priv->netdev_ops->ndo_uninit(dev);
+}
+
+static int hfi1_ipoib_dev_open(struct net_device *dev)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	int ret;
+
+	ret = priv->netdev_ops->ndo_open(dev);
+	if (!ret) {
+		struct hfi1_ibport *ibp = to_iport(priv->device,
+						   priv->port_num);
+		struct rvt_qp *qp;
+		u32 qpn = qpn_from_mac(priv->netdev->dev_addr);
+
+		rcu_read_lock();
+		qp = rvt_lookup_qpn(ib_to_rvt(priv->device), &ibp->rvp, qpn);
+		if (!qp) {
+			rcu_read_unlock();
+			priv->netdev_ops->ndo_stop(dev);
+			return -EINVAL;
+		}
+		rvt_get_qp(qp);
+		priv->qp = qp;
+		rcu_read_unlock();
+
+		hfi1_ipoib_napi_tx_enable(dev);
+	}
+
+	return ret;
+}
+
+static int hfi1_ipoib_dev_stop(struct net_device *dev)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+
+	if (!priv->qp)
+		return 0;
+
+	hfi1_ipoib_napi_tx_disable(dev);
+
+	rvt_put_qp(priv->qp);
+	priv->qp = NULL;
+
+	return priv->netdev_ops->ndo_stop(dev);
+}
+
+static void hfi1_ipoib_dev_get_stats64(struct net_device *dev,
+				       struct rtnl_link_stats64 *storage)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	u64 rx_packets = 0ull;
+	u64 rx_bytes = 0ull;
+	u64 tx_packets = 0ull;
+	u64 tx_bytes = 0ull;
+	int i;
+
+	netdev_stats_to_stats64(storage, &dev->stats);
+
+	for_each_possible_cpu(i) {
+		const struct pcpu_sw_netstats *stats;
+		unsigned int start;
+		u64 trx_packets;
+		u64 trx_bytes;
+		u64 ttx_packets;
+		u64 ttx_bytes;
+
+		stats = per_cpu_ptr(priv->netstats, i);
+		do {
+			start = u64_stats_fetch_begin_irq(&stats->syncp);
+			trx_packets = stats->rx_packets;
+			trx_bytes = stats->rx_bytes;
+			ttx_packets = stats->tx_packets;
+			ttx_bytes = stats->tx_bytes;
+		} while (u64_stats_fetch_retry_irq(&stats->syncp, start));
+
+		rx_packets += trx_packets;
+		rx_bytes += trx_bytes;
+		tx_packets += ttx_packets;
+		tx_bytes += ttx_bytes;
+	}
+
+	storage->rx_packets += rx_packets;
+	storage->rx_bytes += rx_bytes;
+	storage->tx_packets += tx_packets;
+	storage->tx_bytes += tx_bytes;
+}
+
+static const struct net_device_ops hfi1_ipoib_netdev_ops = {
+	.ndo_init         = hfi1_ipoib_dev_init,
+	.ndo_uninit       = hfi1_ipoib_dev_uninit,
+	.ndo_open         = hfi1_ipoib_dev_open,
+	.ndo_stop         = hfi1_ipoib_dev_stop,
+	.ndo_get_stats64  = hfi1_ipoib_dev_get_stats64,
+};
+
+static int hfi1_ipoib_send(struct net_device *dev,
+			   struct sk_buff *skb,
+			   struct ib_ah *address,
+			   u32 dqpn)
+{
+	return hfi1_ipoib_send_dma(dev, skb, address, dqpn);
+}
+
+static int hfi1_ipoib_mcast_attach(struct net_device *dev,
+				   struct ib_device *device,
+				   union ib_gid *mgid,
+				   u16 mlid,
+				   int set_qkey,
+				   u32 qkey)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	u32 qpn = (u32)qpn_from_mac(priv->netdev->dev_addr);
+	struct hfi1_ibport *ibp = to_iport(priv->device, priv->port_num);
+	struct rvt_qp *qp;
+	int ret = -EINVAL;
+
+	rcu_read_lock();
+
+	qp = rvt_lookup_qpn(ib_to_rvt(priv->device), &ibp->rvp, qpn);
+	if (qp) {
+		rvt_get_qp(qp);
+		rcu_read_unlock();
+		if (set_qkey)
+			priv->qkey = qkey;
+
+		/* attach QP to multicast group */
+		ret = ib_attach_mcast(&qp->ibqp, mgid, mlid);
+		rvt_put_qp(qp);
+	} else {
+		rcu_read_unlock();
+	}
+
+	return ret;
+}
+
+static int hfi1_ipoib_mcast_detach(struct net_device *dev,
+				   struct ib_device *device,
+				   union ib_gid *mgid,
+				   u16 mlid)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	u32 qpn = (u32)qpn_from_mac(priv->netdev->dev_addr);
+	struct hfi1_ibport *ibp = to_iport(priv->device, priv->port_num);
+	struct rvt_qp *qp;
+	int ret = -EINVAL;
+
+	rcu_read_lock();
+
+	qp = rvt_lookup_qpn(ib_to_rvt(priv->device), &ibp->rvp, qpn);
+	if (qp) {
+		rvt_get_qp(qp);
+		rcu_read_unlock();
+		ret = ib_detach_mcast(&qp->ibqp, mgid, mlid);
+		rvt_put_qp(qp);
+	} else {
+		rcu_read_unlock();
+	}
+	return ret;
+}
+
+static void hfi1_ipoib_netdev_dtor(struct net_device *dev)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+
+	hfi1_ipoib_txreq_deinit(priv);
+
+	free_percpu(priv->netstats);
+}
+
+static void hfi1_ipoib_free_rdma_netdev(struct net_device *dev)
+{
+	hfi1_ipoib_netdev_dtor(dev);
+	free_netdev(dev);
+}
+
+static void hfi1_ipoib_set_id(struct net_device *dev, int id)
+{
+	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+
+	priv->pkey_index = (u16)id;
+	ib_query_pkey(priv->device,
+		      priv->port_num,
+		      priv->pkey_index,
+		      &priv->pkey);
+}
+
+static int hfi1_ipoib_setup_rn(struct ib_device *device,
+			       u8 port_num,
+			       struct net_device *netdev,
+			       void *param)
+{
+	struct hfi1_devdata *dd = dd_from_ibdev(device);
+	struct rdma_netdev *rn = netdev_priv(netdev);
+	struct hfi1_ipoib_dev_priv *priv;
+	int rc;
+
+	rn->send = hfi1_ipoib_send;
+	rn->attach_mcast = hfi1_ipoib_mcast_attach;
+	rn->detach_mcast = hfi1_ipoib_mcast_detach;
+	rn->set_id = hfi1_ipoib_set_id;
+	rn->hca = device;
+	rn->port_num = port_num;
+	rn->mtu = netdev->mtu;
+
+	priv = hfi1_ipoib_priv(netdev);
+	priv->dd = dd;
+	priv->netdev = netdev;
+	priv->device = device;
+	priv->port_num = port_num;
+	priv->netdev_ops = netdev->netdev_ops;
+
+	netdev->netdev_ops = &hfi1_ipoib_netdev_ops;
+
+	ib_query_pkey(device, port_num, priv->pkey_index, &priv->pkey);
+
+	rc = hfi1_ipoib_txreq_init(priv);
+	if (rc) {
+		dd_dev_err(dd, "IPoIB netdev TX init - failed(%d)\n", rc);
+		hfi1_ipoib_free_rdma_netdev(netdev);
+		return rc;
+	}
+
+	netdev->priv_destructor = hfi1_ipoib_netdev_dtor;
+	netdev->needs_free_netdev = true;
+
+	return 0;
+}
+
+int hfi1_ipoib_rn_get_params(struct ib_device *device,
+			     u8 port_num,
+			     enum rdma_netdev_t type,
+			     struct rdma_netdev_alloc_params *params)
+{
+	struct hfi1_devdata *dd = dd_from_ibdev(device);
+
+	if (type != RDMA_NETDEV_IPOIB)
+		return -EOPNOTSUPP;
+
+	if (!HFI1_CAP_IS_KSET(AIP))
+		return -EOPNOTSUPP;
+
+	if (!port_num || port_num > dd->num_pports)
+		return -EINVAL;
+
+	params->sizeof_priv = sizeof(struct hfi1_ipoib_rdma_netdev);
+	params->txqs = dd->num_sdma;
+	params->param = NULL;
+	params->initialize_rdma_netdev = hfi1_ipoib_setup_rn;
+
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 04/16] IB/hfi1: Remove module parameter for KDETH qpns
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (2 preceding siblings ...)
  2020-02-10 13:18 ` [PATCH for-next 03/16] IB/hfi1: Add the transmit side of a datagram ipoib RDMA netdev Dennis Dalessandro
@ 2020-02-10 13:18 ` Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 05/16] IB/{rdmavt, hfi1}: Implement creation of accelerated UD QPs Dennis Dalessandro
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford
  Cc: linux-rdma, Mike Marciniszyn, Dennis Dalessandro, Gary Leshner,
	Kaike Wan

From: Gary Leshner <Gary.S.Leshner@intel.com>

The module parameter for KDETH qpns is being removed in favor
of always using the default value of 0x80 as the qpn prefix.
Defines have been added for various KDETH values including
the prefix of 0x80.
The reserved range now starts at the base value for KDETH
qpns (0x80) and extends up to and including the last qpn for
other reserved QP prefixed types.
Adjust other QP prefixed define names to match KDETH defined
names.

Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/chip.c     |   19 +++----------------
 drivers/infiniband/hw/hfi1/common.h   |    7 -------
 drivers/infiniband/hw/hfi1/file_ops.c |    4 ++--
 drivers/infiniband/hw/hfi1/hfi.h      |    3 +--
 drivers/infiniband/hw/hfi1/tid_rdma.c |    4 ++--
 drivers/infiniband/hw/hfi1/verbs.c    |    7 +++----
 include/rdma/rdmavt_qp.h              |   29 ++++++++++++++++++++++++++++-
 7 files changed, 39 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index e0b1238..c08bf81 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -67,10 +67,6 @@
 #include "debugfs.h"
 #include "fault.h"
 
-uint kdeth_qp;
-module_param_named(kdeth_qp, kdeth_qp, uint, S_IRUGO);
-MODULE_PARM_DESC(kdeth_qp, "Set the KDETH queue pair prefix");
-
 uint num_vls = HFI1_MAX_VLS_SUPPORTED;
 module_param(num_vls, uint, S_IRUGO);
 MODULE_PARM_DESC(num_vls, "Set number of Virtual Lanes to use (1-8)");
@@ -14119,21 +14115,12 @@ static void init_early_variables(struct hfi1_devdata *dd)
 
 static void init_kdeth_qp(struct hfi1_devdata *dd)
 {
-	/* user changed the KDETH_QP */
-	if (kdeth_qp != 0 && kdeth_qp >= 0xff) {
-		/* out of range or illegal value */
-		dd_dev_err(dd, "Invalid KDETH queue pair prefix, ignoring");
-		kdeth_qp = 0;
-	}
-	if (kdeth_qp == 0)	/* not set, or failed range check */
-		kdeth_qp = DEFAULT_KDETH_QP;
-
 	write_csr(dd, SEND_BTH_QP,
-		  (kdeth_qp & SEND_BTH_QP_KDETH_QP_MASK) <<
+		  (RVT_KDETH_QP_PREFIX & SEND_BTH_QP_KDETH_QP_MASK) <<
 		  SEND_BTH_QP_KDETH_QP_SHIFT);
 
 	write_csr(dd, RCV_BTH_QP,
-		  (kdeth_qp & RCV_BTH_QP_KDETH_QP_MASK) <<
+		  (RVT_KDETH_QP_PREFIX & RCV_BTH_QP_KDETH_QP_MASK) <<
 		  RCV_BTH_QP_KDETH_QP_SHIFT);
 }
 
diff --git a/drivers/infiniband/hw/hfi1/common.h b/drivers/infiniband/hw/hfi1/common.h
index 1f7107e..6062545 100644
--- a/drivers/infiniband/hw/hfi1/common.h
+++ b/drivers/infiniband/hw/hfi1/common.h
@@ -72,13 +72,6 @@
  * compilation unit
  */
 
-/*
- * If a packet's QP[23:16] bits match this value, then it is
- * a PSM packet and the hardware will expect a KDETH header
- * following the BTH.
- */
-#define DEFAULT_KDETH_QP 0x80
-
 /* driver/hw feature set bitmask */
 #define HFI1_CAP_USER_SHIFT      24
 #define HFI1_CAP_MASK            ((1UL << HFI1_CAP_USER_SHIFT) - 1)
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index 2591158..00ba8d9 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015-2017 Intel Corporation.
+ * Copyright(c) 2015-2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -1266,7 +1266,7 @@ static int get_base_info(struct hfi1_filedata *fd, unsigned long arg, u32 len)
 	memset(&binfo, 0, sizeof(binfo));
 	binfo.hw_version = dd->revision;
 	binfo.sw_version = HFI1_KERN_SWVERSION;
-	binfo.bthqp = kdeth_qp;
+	binfo.bthqp = RVT_KDETH_QP_PREFIX;
 	binfo.jkey = uctxt->jkey;
 	/*
 	 * If more than 64 contexts are enabled the allocated credit
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index cae12f4..92fceb5 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1,7 +1,7 @@
 #ifndef _HFI1_KERNEL_H
 #define _HFI1_KERNEL_H
 /*
- * Copyright(c) 2015-2018 Intel Corporation.
+ * Copyright(c) 2015-2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -2252,7 +2252,6 @@ static inline void flush_wc(void)
 extern unsigned long n_krcvqs;
 extern uint krcvqs[];
 extern int krcvqsset;
-extern uint kdeth_qp;
 extern uint loopback;
 extern uint quick_linkup;
 extern uint rcv_intr_timeout;
diff --git a/drivers/infiniband/hw/hfi1/tid_rdma.c b/drivers/infiniband/hw/hfi1/tid_rdma.c
index 8a2e0d9..243b4ba 100644
--- a/drivers/infiniband/hw/hfi1/tid_rdma.c
+++ b/drivers/infiniband/hw/hfi1/tid_rdma.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
 /*
- * Copyright(c) 2018 Intel Corporation.
+ * Copyright(c) 2018 - 2020 Intel Corporation.
  *
  */
 
@@ -194,7 +194,7 @@ void tid_rdma_opfn_init(struct rvt_qp *qp, struct tid_rdma_params *p)
 {
 	struct hfi1_qp_priv *priv = qp->priv;
 
-	p->qp = (kdeth_qp << 16) | priv->rcd->ctxt;
+	p->qp = (RVT_KDETH_QP_PREFIX << 16) | priv->rcd->ctxt;
 	p->max_len = TID_RDMA_MAX_SEGMENT_SIZE;
 	p->jkey = priv->rcd->jkey;
 	p->max_read = TID_RDMA_MAX_READ_SEGS_PER_REQ;
diff --git a/drivers/infiniband/hw/hfi1/verbs.c b/drivers/infiniband/hw/hfi1/verbs.c
index 089e201..2bddc28 100644
--- a/drivers/infiniband/hw/hfi1/verbs.c
+++ b/drivers/infiniband/hw/hfi1/verbs.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -1861,9 +1861,8 @@ int hfi1_register_ib_device(struct hfi1_devdata *dd)
 	dd->verbs_dev.rdi.dparms.qpn_start = 0;
 	dd->verbs_dev.rdi.dparms.qpn_inc = 1;
 	dd->verbs_dev.rdi.dparms.qos_shift = dd->qos_shift;
-	dd->verbs_dev.rdi.dparms.qpn_res_start = kdeth_qp << 16;
-	dd->verbs_dev.rdi.dparms.qpn_res_end =
-	dd->verbs_dev.rdi.dparms.qpn_res_start + 65535;
+	dd->verbs_dev.rdi.dparms.qpn_res_start = RVT_KDETH_QP_BASE;
+	dd->verbs_dev.rdi.dparms.qpn_res_end = RVT_AIP_QP_MAX;
 	dd->verbs_dev.rdi.dparms.max_rdma_atomic = HFI1_MAX_RDMA_ATOMIC;
 	dd->verbs_dev.rdi.dparms.psn_mask = PSN_MASK;
 	dd->verbs_dev.rdi.dparms.psn_shift = PSN_SHIFT;
diff --git a/include/rdma/rdmavt_qp.h b/include/rdma/rdmavt_qp.h
index 0d5c70e..afd7be6 100644
--- a/include/rdma/rdmavt_qp.h
+++ b/include/rdma/rdmavt_qp.h
@@ -2,7 +2,7 @@
 #define DEF_RDMAVT_INCQP_H
 
 /*
- * Copyright(c) 2016 - 2019 Intel Corporation.
+ * Copyright(c) 2016 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -69,6 +69,33 @@
 #define RVT_R_COMM_EST  0x10
 
 /*
+ * If a packet's QP[23:16] bits match this value, then it is
+ * a PSM packet and the hardware will expect a KDETH header
+ * following the BTH.
+ */
+#define RVT_KDETH_QP_PREFIX       0x80
+#define RVT_KDETH_QP_SUFFIX       0xffff
+#define RVT_KDETH_QP_PREFIX_MASK  0x00ff0000
+#define RVT_KDETH_QP_PREFIX_SHIFT 16
+#define RVT_KDETH_QP_BASE         (u32)(RVT_KDETH_QP_PREFIX << \
+					RVT_KDETH_QP_PREFIX_SHIFT)
+#define RVT_KDETH_QP_MAX          (u32)(RVT_KDETH_QP_BASE + RVT_KDETH_QP_SUFFIX)
+
+/*
+ * If a packet's LNH == BTH and DEST QPN[23:16] in the BTH match this
+ * prefix value, then it is an AIP packet with a DETH containing the entropy
+ * value in byte 4 following the BTH.
+ */
+#define RVT_AIP_QP_PREFIX       0x81
+#define RVT_AIP_QP_SUFFIX       0xffff
+#define RVT_AIP_QP_PREFIX_MASK  0x00ff0000
+#define RVT_AIP_QP_PREFIX_SHIFT 16
+#define RVT_AIP_QP_BASE         (u32)(RVT_AIP_QP_PREFIX << \
+				      RVT_AIP_QP_PREFIX_SHIFT)
+#define RVT_AIP_QPN_MAX         BIT(RVT_AIP_QP_PREFIX_SHIFT)
+#define RVT_AIP_QP_MAX          (u32)(RVT_AIP_QP_BASE + RVT_AIP_QPN_MAX - 1)
+
+/*
  * Bit definitions for s_flags.
  *
  * RVT_S_SIGNAL_REQ_WR - set if QP send WRs contain completion signaled


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 05/16] IB/{rdmavt, hfi1}: Implement creation of accelerated UD QPs
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (3 preceding siblings ...)
  2020-02-10 13:18 ` [PATCH for-next 04/16] IB/hfi1: Remove module parameter for KDETH qpns Dennis Dalessandro
@ 2020-02-10 13:18 ` Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 06/16] IB/hfi1: RSM rules for AIP Dennis Dalessandro
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford
  Cc: linux-rdma, Mike Marciniszyn, Dennis Dalessandro, Gary Leshner,
	Kaike Wan

From: Gary Leshner <Gary.S.Leshner@intel.com>

Adds capability to create a qpn to be recognized as an accelerated
UD QP for ipoib.

This is accomplished by reserving 0x81 in byte[0] of the qpn as the
prefix for these qp types and reserving qpns between 0x810000 and
0x81ffff.

The hfi1 capability mask already contained a flag for the VNIC netdev.
This has been renamed and extended to include both VNIC and ipoib.

The rvt code to allocate qps now recognizes this flag and sets 0x81
into byte[0] of the qpn.

The code to allocate qpns is modified to reset the qpn numbering when it
is detected that a value is located in byte[0] for a UD QP and it is a
qpn being requested for net dev use. If it is a regular UD QP then it is
allowable to have bits set in byte[0] of the qpn and provide the
previously normal behavior.

The code to free the qpn now checks for the AIP prefix value of 0x81 and
removes it from the qpn before being freed so that the lower 16 bit
number can be reused.

This patch requires minor changes in the IB core and ipoib to facilitate
the creation of accelerated UP QPs.

Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/verbs.c         |    2 +-
 drivers/infiniband/sw/rdmavt/qp.c          |   24 +++++++++++++++++++-----
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |    3 +++
 include/rdma/ib_verbs.h                    |    4 ++--
 include/rdma/opa_vnic.h                    |    4 ++--
 5 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/verbs.c b/drivers/infiniband/hw/hfi1/verbs.c
index 2bddc28..8061e93 100644
--- a/drivers/infiniband/hw/hfi1/verbs.c
+++ b/drivers/infiniband/hw/hfi1/verbs.c
@@ -1340,7 +1340,7 @@ static void hfi1_fill_device_attr(struct hfi1_devdata *dd)
 			IB_DEVICE_SYS_IMAGE_GUID | IB_DEVICE_RC_RNR_NAK_GEN |
 			IB_DEVICE_PORT_ACTIVE_EVENT | IB_DEVICE_SRQ_RESIZE |
 			IB_DEVICE_MEM_MGT_EXTENSIONS |
-			IB_DEVICE_RDMA_NETDEV_OPA_VNIC;
+			IB_DEVICE_RDMA_NETDEV_OPA;
 	rdi->dparms.props.page_size_cap = PAGE_SIZE;
 	rdi->dparms.props.vendor_id = dd->oui1 << 16 | dd->oui2 << 8 | dd->oui3;
 	rdi->dparms.props.vendor_part_id = dd->pcidev->device;
diff --git a/drivers/infiniband/sw/rdmavt/qp.c b/drivers/infiniband/sw/rdmavt/qp.c
index 7858d49..b59729f 100644
--- a/drivers/infiniband/sw/rdmavt/qp.c
+++ b/drivers/infiniband/sw/rdmavt/qp.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2016 - 2019 Intel Corporation.
+ * Copyright(c) 2016 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -525,15 +525,18 @@ static inline unsigned mk_qpn(struct rvt_qpn_table *qpt,
  * @rdi: rvt device info structure
  * @qpt: queue pair number table pointer
  * @port_num: IB port number, 1 based, comes from core
+ * @exclude_prefix: prefix of special queue pair number being allocated
  *
  * Return: The queue pair number
  */
 static int alloc_qpn(struct rvt_dev_info *rdi, struct rvt_qpn_table *qpt,
-		     enum ib_qp_type type, u8 port_num)
+		     enum ib_qp_type type, u8 port_num, u8 exclude_prefix)
 {
 	u32 i, offset, max_scan, qpn;
 	struct rvt_qpn_map *map;
 	u32 ret;
+	u32 max_qpn = exclude_prefix == RVT_AIP_QP_PREFIX ?
+		RVT_AIP_QPN_MAX : RVT_QPN_MAX;
 
 	if (rdi->driver_f.alloc_qpn)
 		return rdi->driver_f.alloc_qpn(rdi, qpt, type, port_num);
@@ -553,7 +556,7 @@ static int alloc_qpn(struct rvt_dev_info *rdi, struct rvt_qpn_table *qpt,
 	}
 
 	qpn = qpt->last + qpt->incr;
-	if (qpn >= RVT_QPN_MAX)
+	if (qpn >= max_qpn)
 		qpn = qpt->incr | ((qpt->last & 1) ^ 1);
 	/* offset carries bit 0 */
 	offset = qpn & RVT_BITS_PER_PAGE_MASK;
@@ -987,6 +990,9 @@ static void rvt_free_qpn(struct rvt_qpn_table *qpt, u32 qpn)
 {
 	struct rvt_qpn_map *map;
 
+	if ((qpn & RVT_AIP_QP_PREFIX_MASK) == RVT_AIP_QP_BASE)
+		qpn &= RVT_AIP_QP_SUFFIX;
+
 	map = qpt->map + (qpn & RVT_QPN_MASK) / RVT_BITS_PER_PAGE;
 	if (map->page)
 		clear_bit(qpn & RVT_BITS_PER_PAGE_MASK, map->page);
@@ -1074,13 +1080,15 @@ struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
 	struct rvt_dev_info *rdi = ib_to_rvt(ibpd->device);
 	void *priv = NULL;
 	size_t sqsize;
+	u8 exclude_prefix = 0;
 
 	if (!rdi)
 		return ERR_PTR(-EINVAL);
 
 	if (init_attr->cap.max_send_sge > rdi->dparms.props.max_send_sge ||
 	    init_attr->cap.max_send_wr > rdi->dparms.props.max_qp_wr ||
-	    init_attr->create_flags)
+	    (init_attr->create_flags &&
+	     init_attr->create_flags != IB_QP_CREATE_NETDEV_USE))
 		return ERR_PTR(-EINVAL);
 
 	/* Check receive queue parameters if no SRQ is specified. */
@@ -1199,14 +1207,20 @@ struct ib_qp *rvt_create_qp(struct ib_pd *ibpd,
 			goto bail_driver_priv;
 		}
 
+		if (init_attr->create_flags & IB_QP_CREATE_NETDEV_USE)
+			exclude_prefix = RVT_AIP_QP_PREFIX;
+
 		err = alloc_qpn(rdi, &rdi->qp_dev->qpn_table,
 				init_attr->qp_type,
-				init_attr->port_num);
+				init_attr->port_num,
+				exclude_prefix);
 		if (err < 0) {
 			ret = ERR_PTR(err);
 			goto bail_rq_wq;
 		}
 		qp->ibqp.qp_num = err;
+		if (init_attr->create_flags & IB_QP_CREATE_NETDEV_USE)
+			qp->ibqp.qp_num |= RVT_AIP_QP_BASE;
 		qp->port_num = init_attr->port_num;
 		rvt_init_qp(rdi, qp, init_attr->qp_type);
 		if (rdi->driver_f.qp_priv_init) {
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index b69304d..587252f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -206,6 +206,9 @@ int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
 	if (priv->hca_caps & IB_DEVICE_MANAGED_FLOW_STEERING)
 		init_attr.create_flags |= IB_QP_CREATE_NETIF_QP;
 
+	if (priv->hca_caps & IB_DEVICE_RDMA_NETDEV_OPA)
+		init_attr.create_flags |= IB_QP_CREATE_NETDEV_USE;
+
 	priv->qp = ib_create_qp(priv->pd, &init_attr);
 	if (IS_ERR(priv->qp)) {
 		pr_warn("%s: failed to create QP\n", ca->name);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 9abe2c8..8b805ab3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -305,7 +305,7 @@ enum ib_device_cap_flags {
 	IB_DEVICE_VIRTUAL_FUNCTION		= (1ULL << 33),
 	/* Deprecated. Please use IB_RAW_PACKET_CAP_SCATTER_FCS. */
 	IB_DEVICE_RAW_SCATTER_FCS		= (1ULL << 34),
-	IB_DEVICE_RDMA_NETDEV_OPA_VNIC		= (1ULL << 35),
+	IB_DEVICE_RDMA_NETDEV_OPA		= (1ULL << 35),
 	/* The device supports padding incoming writes to cacheline. */
 	IB_DEVICE_PCI_WRITE_END_PADDING		= (1ULL << 36),
 	IB_DEVICE_ALLOW_USER_UNREG		= (1ULL << 37),
@@ -1111,7 +1111,7 @@ enum ib_qp_create_flags {
 	IB_QP_CREATE_MANAGED_RECV               = 1 << 4,
 	IB_QP_CREATE_NETIF_QP			= 1 << 5,
 	IB_QP_CREATE_INTEGRITY_EN		= 1 << 6,
-	/* FREE					= 1 << 7, */
+	IB_QP_CREATE_NETDEV_USE			= 1 << 7,
 	IB_QP_CREATE_SCATTER_FCS		= 1 << 8,
 	IB_QP_CREATE_CVLAN_STRIPPING		= 1 << 9,
 	IB_QP_CREATE_SOURCE_QPN			= 1 << 10,
diff --git a/include/rdma/opa_vnic.h b/include/rdma/opa_vnic.h
index 0c07a70..02e6d28 100644
--- a/include/rdma/opa_vnic.h
+++ b/include/rdma/opa_vnic.h
@@ -1,7 +1,7 @@
 #ifndef _OPA_VNIC_H
 #define _OPA_VNIC_H
 /*
- * Copyright(c) 2017 Intel Corporation.
+ * Copyright(c) 2017 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -132,7 +132,7 @@ struct opa_vnic_stats {
 static inline bool rdma_cap_opa_vnic(struct ib_device *device)
 {
 	return !!(device->attrs.device_cap_flags &
-		  IB_DEVICE_RDMA_NETDEV_OPA_VNIC);
+		  IB_DEVICE_RDMA_NETDEV_OPA);
 }
 
 #endif /* _OPA_VNIC_H */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 06/16] IB/hfi1: RSM rules for AIP
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (4 preceding siblings ...)
  2020-02-10 13:18 ` [PATCH for-next 05/16] IB/{rdmavt, hfi1}: Implement creation of accelerated UD QPs Dennis Dalessandro
@ 2020-02-10 13:18 ` Dennis Dalessandro
  2020-02-10 13:18 ` [PATCH for-next 07/16] IB/ipoib: Increase ipoib Datagram mode MTU's upper limit Dennis Dalessandro
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford
  Cc: linux-rdma, Mike Marciniszyn, Dennis Dalessandro,
	Grzegorz Andrejczuk, Kaike Wan

From: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>

This is implementation of RSM rule for AIP packets.
AIP rule will use rule RSM2 and will match standard
Infiniband packet containg BTH (LNH==BTH) and
having Dest QPN prefixed with value 0x81. Spread between
receive contexts will be done using source QPN bits.

VNIC and AIP will share receive contexts, so their rules
will point to the same RMT entries and their shared
code is moved to separate functions.
If any of the rules is active RMT mapping will be skipped
for latter.

Changed function hfi1_vnic_is_rsm_full to be more general
and moved it from main header to chip.c.

Changed the order of RSM rules because AIP rule as
more specific one is needed to be placed before more
general QOS rule. Rules are occupying two last RSM
registers.

Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/chip.c |  171 ++++++++++++++++++++++++++++---------
 drivers/infiniband/hw/hfi1/chip.h |    4 +
 drivers/infiniband/hw/hfi1/hfi.h  |    8 +-
 drivers/infiniband/hw/hfi1/init.c |    4 +
 4 files changed, 137 insertions(+), 50 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index c08bf81..be1fb29 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -124,13 +124,15 @@ struct flag_table {
 
 /*
  * RSM instance allocation
- *   0 - Verbs
- *   1 - User Fecn Handling
- *   2 - Vnic
+ *   0 - User Fecn Handling
+ *   1 - Vnic
+ *   2 - AIP
+ *   3 - Verbs
  */
-#define RSM_INS_VERBS             0
-#define RSM_INS_FECN              1
-#define RSM_INS_VNIC              2
+#define RSM_INS_FECN              0
+#define RSM_INS_VNIC              1
+#define RSM_INS_AIP               2
+#define RSM_INS_VERBS             3
 
 /* Bit offset into the GUID which carries HFI id information */
 #define GUID_HFI_INDEX_SHIFT     39
@@ -171,6 +173,25 @@ struct flag_table {
 /* QPN[m+n:1] QW 1, OFFSET 1 */
 #define QPN_SELECT_OFFSET      ((1ull << QW_SHIFT) | (1ull))
 
+/* RSM fields for AIP */
+/* LRH.BTH above is reused for this rule */
+
+/* BTH.DESTQP: QW 1, OFFSET 16 for match */
+#define BTH_DESTQP_QW           1ull
+#define BTH_DESTQP_BIT_OFFSET   16ull
+#define BTH_DESTQP_OFFSET(off) ((BTH_DESTQP_QW << QW_SHIFT) | (off))
+#define BTH_DESTQP_MATCH_OFFSET BTH_DESTQP_OFFSET(BTH_DESTQP_BIT_OFFSET)
+#define BTH_DESTQP_MASK         0xFFull
+#define BTH_DESTQP_VALUE        0x81ull
+
+/* DETH.SQPN: QW 1 Offset 56 for select */
+/* We use 8 most significant Soure QPN bits as entropy fpr AIP */
+#define DETH_AIP_SQPN_QW 3ull
+#define DETH_AIP_SQPN_BIT_OFFSET 56ull
+#define DETH_AIP_SQPN_OFFSET(off) ((DETH_AIP_SQPN_QW << QW_SHIFT) | (off))
+#define DETH_AIP_SQPN_SELECT_OFFSET \
+	DETH_AIP_SQPN_OFFSET(DETH_AIP_SQPN_BIT_OFFSET)
+
 /* RSM fields for Vnic */
 /* L2_TYPE: QW 0, OFFSET 61 - for match */
 #define L2_TYPE_QW             0ull
@@ -14236,6 +14257,12 @@ static void complete_rsm_map_table(struct hfi1_devdata *dd,
 	}
 }
 
+/* Is a receive side mapping rule */
+static bool has_rsm_rule(struct hfi1_devdata *dd, u8 rule_index)
+{
+	return read_csr(dd, RCV_RSM_CFG + (8 * rule_index)) != 0;
+}
+
 /*
  * Add a receive side mapping rule.
  */
@@ -14472,39 +14499,49 @@ static void init_fecn_handling(struct hfi1_devdata *dd,
 	rmt->used += total_cnt;
 }
 
-/* Initialize RSM for VNIC */
-void hfi1_init_vnic_rsm(struct hfi1_devdata *dd)
+static inline bool hfi1_is_rmt_full(int start, int spare)
+{
+	return (start + spare) > NUM_MAP_ENTRIES;
+}
+
+static bool hfi1_netdev_update_rmt(struct hfi1_devdata *dd)
 {
 	u8 i, j;
 	u8 ctx_id = 0;
 	u64 reg;
 	u32 regoff;
-	struct rsm_rule_data rrd;
+	int rmt_start = dd->vnic.rmt_start;
 
-	if (hfi1_vnic_is_rsm_full(dd, NUM_VNIC_MAP_ENTRIES)) {
-		dd_dev_err(dd, "Vnic RSM disabled, rmt entries used = %d\n",
-			   dd->vnic.rmt_start);
-		return;
+	/* We already have contexts mapped in RMT */
+	if (has_rsm_rule(dd, RSM_INS_VNIC) || has_rsm_rule(dd, RSM_INS_AIP)) {
+		dd_dev_info(dd, "Contexts are already mapped in RMT\n");
+		return true;
+	}
+
+	if (hfi1_is_rmt_full(rmt_start, NUM_VNIC_MAP_ENTRIES)) {
+		dd_dev_err(dd, "Not enought RMT entries used = %d\n",
+			   rmt_start);
+		return false;
 	}
 
-	dev_dbg(&(dd)->pcidev->dev, "Vnic rsm start = %d, end %d\n",
-		dd->vnic.rmt_start,
-		dd->vnic.rmt_start + NUM_VNIC_MAP_ENTRIES);
+	dev_dbg(&(dd)->pcidev->dev, "RMT start = %d, end %d\n",
+		rmt_start,
+		rmt_start + NUM_VNIC_MAP_ENTRIES);
 
 	/* Update RSM mapping table, 32 regs, 256 entries - 1 ctx per byte */
-	regoff = RCV_RSM_MAP_TABLE + (dd->vnic.rmt_start / 8) * 8;
+	regoff = RCV_RSM_MAP_TABLE + (rmt_start / 8) * 8;
 	reg = read_csr(dd, regoff);
 	for (i = 0; i < NUM_VNIC_MAP_ENTRIES; i++) {
-		/* Update map register with vnic context */
-		j = (dd->vnic.rmt_start + i) % 8;
+		/* Update map register with netdev context */
+		j = (rmt_start + i) % 8;
 		reg &= ~(0xffllu << (j * 8));
 		reg |= (u64)dd->vnic.ctxt[ctx_id++]->ctxt << (j * 8);
-		/* Wrap up vnic ctx index */
+		/* Wrap up netdev ctx index */
 		ctx_id %= dd->vnic.num_ctxt;
 		/* Write back map register */
 		if (j == 7 || ((i + 1) == NUM_VNIC_MAP_ENTRIES)) {
 			dev_dbg(&(dd)->pcidev->dev,
-				"Vnic rsm map reg[%d] =0x%llx\n",
+				"RMT[%d] =0x%llx\n",
 				regoff - RCV_RSM_MAP_TABLE, reg);
 
 			write_csr(dd, regoff, reg);
@@ -14514,35 +14551,83 @@ void hfi1_init_vnic_rsm(struct hfi1_devdata *dd)
 		}
 	}
 
-	/* Add rule for vnic */
-	rrd.offset = dd->vnic.rmt_start;
-	rrd.pkt_type = 4;
-	/* Match 16B packets */
-	rrd.field1_off = L2_TYPE_MATCH_OFFSET;
-	rrd.mask1 = L2_TYPE_MASK;
-	rrd.value1 = L2_16B_VALUE;
-	/* Match ETH L4 packets */
-	rrd.field2_off = L4_TYPE_MATCH_OFFSET;
-	rrd.mask2 = L4_16B_TYPE_MASK;
-	rrd.value2 = L4_16B_ETH_VALUE;
-	/* Calc context from veswid and entropy */
-	rrd.index1_off = L4_16B_HDR_VESWID_OFFSET;
-	rrd.index1_width = ilog2(NUM_VNIC_MAP_ENTRIES);
-	rrd.index2_off = L2_16B_ENTROPY_OFFSET;
-	rrd.index2_width = ilog2(NUM_VNIC_MAP_ENTRIES);
-	add_rsm_rule(dd, RSM_INS_VNIC, &rrd);
-
-	/* Enable RSM if not already enabled */
+	return true;
+}
+
+static void hfi1_enable_rsm_rule(struct hfi1_devdata *dd,
+				 int rule, struct rsm_rule_data *rrd)
+{
+	if (!hfi1_netdev_update_rmt(dd)) {
+		dd_dev_err(dd, "Failed to update RMT for RSM%d rule\n", rule);
+		return;
+	}
+
+	add_rsm_rule(dd, rule, rrd);
 	add_rcvctrl(dd, RCV_CTRL_RCV_RSM_ENABLE_SMASK);
 }
 
+void hfi1_init_aip_rsm(struct hfi1_devdata *dd)
+{
+	/*
+	 * go through with the initialisation only if this rule actually doesn't
+	 * exist yet
+	 */
+	if (atomic_fetch_inc(&dd->ipoib_rsm_usr_num) == 0) {
+		struct rsm_rule_data rrd = {
+			.offset = dd->vnic.rmt_start,
+			.pkt_type = IB_PACKET_TYPE,
+			.field1_off = LRH_BTH_MATCH_OFFSET,
+			.mask1 = LRH_BTH_MASK,
+			.value1 = LRH_BTH_VALUE,
+			.field2_off = BTH_DESTQP_MATCH_OFFSET,
+			.mask2 = BTH_DESTQP_MASK,
+			.value2 = BTH_DESTQP_VALUE,
+			.index1_off = DETH_AIP_SQPN_SELECT_OFFSET +
+					ilog2(NUM_VNIC_MAP_ENTRIES),
+			.index1_width = ilog2(NUM_VNIC_MAP_ENTRIES),
+			.index2_off = DETH_AIP_SQPN_SELECT_OFFSET,
+			.index2_width = ilog2(NUM_VNIC_MAP_ENTRIES)
+		};
+
+		hfi1_enable_rsm_rule(dd, RSM_INS_AIP, &rrd);
+	}
+}
+
+/* Initialize RSM for VNIC */
+void hfi1_init_vnic_rsm(struct hfi1_devdata *dd)
+{
+	struct rsm_rule_data rrd = {
+		/* Add rule for vnic */
+		.offset = dd->vnic.rmt_start,
+		.pkt_type = 4,
+		/* Match 16B packets */
+		.field1_off = L2_TYPE_MATCH_OFFSET,
+		.mask1 = L2_TYPE_MASK,
+		.value1 = L2_16B_VALUE,
+		/* Match ETH L4 packets */
+		.field2_off = L4_TYPE_MATCH_OFFSET,
+		.mask2 = L4_16B_TYPE_MASK,
+		.value2 = L4_16B_ETH_VALUE,
+		/* Calc context from veswid and entropy */
+		.index1_off = L4_16B_HDR_VESWID_OFFSET,
+		.index1_width = ilog2(NUM_VNIC_MAP_ENTRIES),
+		.index2_off = L2_16B_ENTROPY_OFFSET,
+		.index2_width = ilog2(NUM_VNIC_MAP_ENTRIES)
+	};
+
+	hfi1_enable_rsm_rule(dd, RSM_INS_VNIC, &rrd);
+}
+
 void hfi1_deinit_vnic_rsm(struct hfi1_devdata *dd)
 {
 	clear_rsm_rule(dd, RSM_INS_VNIC);
+}
 
-	/* Disable RSM if used only by vnic */
-	if (dd->vnic.rmt_start == 0)
-		clear_rcvctrl(dd, RCV_CTRL_RCV_RSM_ENABLE_SMASK);
+void hfi1_deinit_aip_rsm(struct hfi1_devdata *dd)
+{
+	/* only actually clear the rule if it's the last user asking to do so */
+	if (atomic_fetch_add_unless(&dd->ipoib_rsm_usr_num, -1, 0) == 1)
+		clear_rsm_rule(dd, RSM_INS_AIP);
 }
 
 static int init_rxe(struct hfi1_devdata *dd)
diff --git a/drivers/infiniband/hw/hfi1/chip.h b/drivers/infiniband/hw/hfi1/chip.h
index 7255092..b10e0bf 100644
--- a/drivers/infiniband/hw/hfi1/chip.h
+++ b/drivers/infiniband/hw/hfi1/chip.h
@@ -1,7 +1,7 @@
 #ifndef _CHIP_H
 #define _CHIP_H
 /*
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -1455,6 +1455,8 @@ int hfi1_set_ctxt_pkey(struct hfi1_devdata *dd, struct hfi1_ctxtdata *ctxt,
 void remap_sdma_interrupts(struct hfi1_devdata *dd, int engine, int msix_intr);
 void reset_interrupts(struct hfi1_devdata *dd);
 u8 hfi1_get_qp_map(struct hfi1_devdata *dd, u8 idx);
+void hfi1_init_aip_rsm(struct hfi1_devdata *dd);
+void hfi1_deinit_aip_rsm(struct hfi1_devdata *dd);
 
 /*
  * Interrupt source table.
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 92fceb5..5a7fa5a 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1419,12 +1419,10 @@ struct hfi1_devdata {
 	struct hfi1_vnic_data vnic;
 	/* Lock to protect IRQ SRC register access */
 	spinlock_t irq_src_lock;
-};
 
-static inline bool hfi1_vnic_is_rsm_full(struct hfi1_devdata *dd, int spare)
-{
-	return (dd->vnic.rmt_start + spare) > NUM_MAP_ENTRIES;
-}
+	/* Keeps track of IPoIB RSM rule users */
+	atomic_t ipoib_rsm_usr_num;
+};
 
 /* 8051 firmware version helper */
 #define dc8051_ver(a, b, c) ((a) << 16 | (b) << 8 | (c))
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index e3acda7..a4a1b3a 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -1333,6 +1333,8 @@ static struct hfi1_devdata *hfi1_alloc_devdata(struct pci_dev *pdev,
 		goto bail;
 	}
 
+	atomic_set(&dd->ipoib_rsm_usr_num, 0);
+
 	kobject_init(&dd->kobj, &hfi1_devdata_type);
 	return dd;
 


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 07/16] IB/ipoib: Increase ipoib Datagram mode MTU's upper limit
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (5 preceding siblings ...)
  2020-02-10 13:18 ` [PATCH for-next 06/16] IB/hfi1: RSM rules for AIP Dennis Dalessandro
@ 2020-02-10 13:18 ` Dennis Dalessandro
  2020-02-19 11:01   ` Erez Shitrit
  2020-02-10 13:18 ` [PATCH for-next 08/16] IB/hfi1: Rename num_vnic_contexts as num_netdev_contexts Dennis Dalessandro
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford; +Cc: Mike Marciniszyn, linux-rdma, Sadanand Warrier, Kaike Wan

From: Sadanand Warrier <sadanand.warrier@intel.com>

Currently the ipoib UD mtu is restricted to 4K bytes. Remove this
limitation so that the IPOIB module can potentially use an MTU (in UD
mode) that is bounded by the MTU of the underlying device. A field is
added to the ib_port_attr structure to indicate the maximum physical
MTU the underlying device supports.

Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
Reviewed-by: Mike Marciniszyn <mike.marcinisyzn@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/qp.c                |   18 +------
 drivers/infiniband/hw/hfi1/verbs.c             |    2 +
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |    5 ++
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   11 ++--
 include/rdma/ib_verbs.h                        |   60 ++++++++++++++++++++++++
 include/rdma/opa_port_info.h                   |   10 ----
 6 files changed, 74 insertions(+), 32 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/qp.c b/drivers/infiniband/hw/hfi1/qp.c
index f8e733a..0c2ae9f 100644
--- a/drivers/infiniband/hw/hfi1/qp.c
+++ b/drivers/infiniband/hw/hfi1/qp.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015 - 2019 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -186,15 +186,6 @@ static void flush_iowait(struct rvt_qp *qp)
 	write_sequnlock_irqrestore(lock, flags);
 }
 
-static inline int opa_mtu_enum_to_int(int mtu)
-{
-	switch (mtu) {
-	case OPA_MTU_8192:  return 8192;
-	case OPA_MTU_10240: return 10240;
-	default:            return -1;
-	}
-}
-
 /**
  * This function is what we would push to the core layer if we wanted to be a
  * "first class citizen".  Instead we hide this here and rely on Verbs ULPs
@@ -202,15 +193,10 @@ static inline int opa_mtu_enum_to_int(int mtu)
  */
 static inline int verbs_mtu_enum_to_int(struct ib_device *dev, enum ib_mtu mtu)
 {
-	int val;
-
 	/* Constraining 10KB packets to 8KB packets */
 	if (mtu == (enum ib_mtu)OPA_MTU_10240)
 		mtu = OPA_MTU_8192;
-	val = opa_mtu_enum_to_int((int)mtu);
-	if (val > 0)
-		return val;
-	return ib_mtu_enum_to_int(mtu);
+	return opa_mtu_enum_to_int((enum opa_mtu)mtu);
 }
 
 int hfi1_check_modify_qp(struct rvt_qp *qp, struct ib_qp_attr *attr,
diff --git a/drivers/infiniband/hw/hfi1/verbs.c b/drivers/infiniband/hw/hfi1/verbs.c
index 8061e93..04f0206 100644
--- a/drivers/infiniband/hw/hfi1/verbs.c
+++ b/drivers/infiniband/hw/hfi1/verbs.c
@@ -1437,6 +1437,8 @@ static int query_port(struct rvt_dev_info *rdi, u8 port_num,
 				      4096 : hfi1_max_mtu), IB_MTU_4096);
 	props->active_mtu = !valid_ib_mtu(ppd->ibmtu) ? props->max_mtu :
 		mtu_to_enum(ppd->ibmtu, IB_MTU_4096);
+	props->phys_mtu = HFI1_CAP_IS_KSET(AIP) ? hfi1_max_mtu :
+				ib_mtu_enum_to_int(props->max_mtu);
 
 	return 0;
 }
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index e5f438a..5c1cf68 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1862,7 +1862,10 @@ static int ipoib_parent_init(struct net_device *ndev)
 			priv->port);
 		return result;
 	}
-	priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu);
+	if (rdma_core_cap_opa_port(priv->ca, priv->port))
+		priv->max_ib_mtu = attr.phys_mtu;
+	else
+		priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu);
 
 	result = ib_query_pkey(priv->ca, priv->port, 0, &priv->pkey);
 	if (result) {
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index b9e9562..7166ee9b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -218,6 +218,7 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 	struct rdma_ah_attr av;
 	int ret;
 	int set_qkey = 0;
+	int mtu;
 
 	mcast->mcmember = *mcmember;
 
@@ -240,13 +241,11 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 		priv->broadcast->mcmember.flow_label = mcmember->flow_label;
 		priv->broadcast->mcmember.hop_limit = mcmember->hop_limit;
 		/* assume if the admin and the mcast are the same both can be changed */
+		mtu = rdma_mtu_enum_to_int(priv->ca,  priv->port,
+					   priv->broadcast->mcmember.mtu);
 		if (priv->mcast_mtu == priv->admin_mtu)
-			priv->admin_mtu =
-			priv->mcast_mtu =
-			IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
-		else
-			priv->mcast_mtu =
-			IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
+			priv->admin_mtu = IPOIB_UD_MTU(mtu);
+		priv->mcast_mtu = IPOIB_UD_MTU(mtu);
 
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
 		spin_unlock_irq(&priv->lock);
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 8b805ab3..6ab343f 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -462,6 +462,11 @@ enum ib_mtu {
 	IB_MTU_4096 = 5
 };
 
+enum opa_mtu {
+	OPA_MTU_8192 = 6,
+	OPA_MTU_10240 = 7
+};
+
 static inline int ib_mtu_enum_to_int(enum ib_mtu mtu)
 {
 	switch (mtu) {
@@ -488,6 +493,28 @@ static inline enum ib_mtu ib_mtu_int_to_enum(int mtu)
 		return IB_MTU_256;
 }
 
+static inline int opa_mtu_enum_to_int(enum opa_mtu mtu)
+{
+	switch (mtu) {
+	case OPA_MTU_8192:
+		return 8192;
+	case OPA_MTU_10240:
+		return 10240;
+	default:
+		return(ib_mtu_enum_to_int((enum ib_mtu)mtu));
+	}
+}
+
+static inline enum opa_mtu opa_mtu_int_to_enum(int mtu)
+{
+	if (mtu >= 10240)
+		return OPA_MTU_10240;
+	else if (mtu >= 8192)
+		return OPA_MTU_8192;
+	else
+		return ((enum opa_mtu)ib_mtu_int_to_enum(mtu));
+}
+
 enum ib_port_state {
 	IB_PORT_NOP		= 0,
 	IB_PORT_DOWN		= 1,
@@ -651,6 +678,7 @@ struct ib_port_attr {
 	enum ib_port_state	state;
 	enum ib_mtu		max_mtu;
 	enum ib_mtu		active_mtu;
+	u32                     phys_mtu;
 	int			gid_tbl_len;
 	unsigned int		ip_gids:1;
 	/* This is the value from PortInfo CapabilityMask, defined by IBA */
@@ -3356,6 +3384,38 @@ static inline unsigned int rdma_find_pg_bit(unsigned long addr,
 	return __fls(pgsz);
 }
 
+/**
+ * rdma_core_cap_opa_port - Return whether the RDMA Port is OPA or not.
+ * @device: Device
+ * @port_num: 1 based Port number
+ *
+ * Return true if port is an Intel OPA port , false if not
+ */
+static inline bool rdma_core_cap_opa_port(struct ib_device *device,
+					  u32 port_num)
+{
+	return (device->port_data[port_num].immutable.core_cap_flags &
+		RDMA_CORE_PORT_INTEL_OPA) == RDMA_CORE_PORT_INTEL_OPA;
+}
+
+/**
+ * rdma_mtu_enum_to_int - Return the mtu of the port as an integer value.
+ * @device: Device
+ * @port_num: Port number
+ * @mtu: enum value of MTU
+ *
+ * Return the MTU size supported by the port as an integer value. Will return
+ * -1 if enum value of mtu is not supported.
+ */
+static inline int rdma_mtu_enum_to_int(struct ib_device *device, u8 port,
+				       int mtu)
+{
+	if (rdma_core_cap_opa_port(device, port))
+		return opa_mtu_enum_to_int((enum opa_mtu)mtu);
+	else
+		return ib_mtu_enum_to_int((enum ib_mtu)mtu);
+}
+
 int ib_set_vf_link_state(struct ib_device *device, int vf, u8 port,
 			 int state);
 int ib_get_vf_config(struct ib_device *device, int vf, u8 port,
diff --git a/include/rdma/opa_port_info.h b/include/rdma/opa_port_info.h
index bdbfe25..0d9e6d7 100644
--- a/include/rdma/opa_port_info.h
+++ b/include/rdma/opa_port_info.h
@@ -1,5 +1,5 @@
 /*
- * Copyright (c) 2014-2017 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2014-2020 Intel Corporation.  All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -139,14 +139,6 @@
 #define OPA_CAP_MASK3_IsVLMarkerSupported         (1 << 1)
 #define OPA_CAP_MASK3_IsVLrSupported              (1 << 0)
 
-/**
- * new MTU values
- */
-enum {
-	OPA_MTU_8192  = 6,
-	OPA_MTU_10240 = 7,
-};
-
 enum {
 	OPA_PORT_PHYS_CONF_DISCONNECTED = 0,
 	OPA_PORT_PHYS_CONF_STANDARD     = 1,


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 08/16] IB/hfi1: Rename num_vnic_contexts as num_netdev_contexts
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (6 preceding siblings ...)
  2020-02-10 13:18 ` [PATCH for-next 07/16] IB/ipoib: Increase ipoib Datagram mode MTU's upper limit Dennis Dalessandro
@ 2020-02-10 13:18 ` Dennis Dalessandro
  2020-02-10 13:19 ` [PATCH for-next 09/16] IB/hfi1: Add functions to receive accelerated ipoib packets Dennis Dalessandro
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:18 UTC (permalink / raw)
  To: jgg, dledford
  Cc: Mike Marciniszyn, Grzegorz Andrejczuk, Dennis Dalessandro,
	linux-rdma, Kaike Wan, Sadanand Warrier

From: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>

Rename num_vnic_contexts as num_ndetdev_contexts since VNIC and ipoib
will share the same set of receive contexts.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/chip.c      |   24 ++++++++++++------------
 drivers/infiniband/hw/hfi1/hfi.h       |    4 ++--
 drivers/infiniband/hw/hfi1/msix.c      |    4 ++--
 drivers/infiniband/hw/hfi1/vnic_main.c |    8 ++++----
 4 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index be1fb29..07eec35 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -13347,12 +13347,12 @@ static int set_up_interrupts(struct hfi1_devdata *dd)
  *                             in array of contexts
  *	freectxts  - number of free user contexts
  *	num_send_contexts - number of PIO send contexts being used
- *	num_vnic_contexts - number of contexts reserved for VNIC
+ *	num_netdev_contexts - number of contexts reserved for netdev
  */
 static int set_up_context_variables(struct hfi1_devdata *dd)
 {
 	unsigned long num_kernel_contexts;
-	u16 num_vnic_contexts = HFI1_NUM_VNIC_CTXT;
+	u16 num_netdev_contexts = HFI1_NUM_VNIC_CTXT;
 	int total_contexts;
 	int ret;
 	unsigned ngroups;
@@ -13391,11 +13391,11 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 	}
 
 	/* Accommodate VNIC contexts if possible */
-	if ((num_kernel_contexts + num_vnic_contexts) > rcv_contexts) {
+	if ((num_kernel_contexts + num_netdev_contexts) > rcv_contexts) {
 		dd_dev_err(dd, "No receive contexts available for VNIC\n");
-		num_vnic_contexts = 0;
+		num_netdev_contexts = 0;
 	}
-	total_contexts = num_kernel_contexts + num_vnic_contexts;
+	total_contexts = num_kernel_contexts + num_netdev_contexts;
 
 	/*
 	 * User contexts:
@@ -13422,15 +13422,15 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 	 * The RMT entries are currently allocated as shown below:
 	 * 1. QOS (0 to 128 entries);
 	 * 2. FECN (num_kernel_context - 1 + num_user_contexts +
-	 *    num_vnic_contexts);
-	 * 3. VNIC (num_vnic_contexts).
-	 * It should be noted that FECN oversubscribe num_vnic_contexts
-	 * entries of RMT because both VNIC and PSM could allocate any receive
+	 *    num_netdev_contexts);
+	 * 3. netdev (num_netdev_contexts).
+	 * It should be noted that FECN oversubscribe num_netdev_contexts
+	 * entries of RMT because both netdev and PSM could allocate any receive
 	 * context between dd->first_dyn_alloc_text and dd->num_rcv_contexts,
 	 * and PSM FECN must reserve an RMT entry for each possible PSM receive
 	 * context.
 	 */
-	rmt_count = qos_rmt_entries(dd, NULL, NULL) + (num_vnic_contexts * 2);
+	rmt_count = qos_rmt_entries(dd, NULL, NULL) + (num_netdev_contexts * 2);
 	if (HFI1_CAP_IS_KSET(TID_RDMA))
 		rmt_count += num_kernel_contexts - 1;
 	if (rmt_count + n_usr_ctxts > NUM_MAP_ENTRIES) {
@@ -13449,7 +13449,7 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 	dd->num_rcv_contexts = total_contexts;
 	dd->n_krcv_queues = num_kernel_contexts;
 	dd->first_dyn_alloc_ctxt = num_kernel_contexts;
-	dd->num_vnic_contexts = num_vnic_contexts;
+	dd->num_netdev_contexts = num_netdev_contexts;
 	dd->num_user_contexts = n_usr_ctxts;
 	dd->freectxts = n_usr_ctxts;
 	dd_dev_info(dd,
@@ -13457,7 +13457,7 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 		    rcv_contexts,
 		    (int)dd->num_rcv_contexts,
 		    (int)dd->n_krcv_queues,
-		    dd->num_vnic_contexts,
+		    dd->num_netdev_contexts,
 		    dd->num_user_contexts);
 
 	/*
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 5a7fa5a..2ff3904 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1167,8 +1167,8 @@ struct hfi1_devdata {
 	u64 z_send_schedule;
 
 	u64 __percpu *send_schedule;
-	/* number of reserved contexts for VNIC usage */
-	u16 num_vnic_contexts;
+	/* number of reserved contexts for netdev usage */
+	u16 num_netdev_contexts;
 	/* number of receive contexts in use by the driver */
 	u32 num_rcv_contexts;
 	/* number of pio send contexts in use by the driver */
diff --git a/drivers/infiniband/hw/hfi1/msix.c b/drivers/infiniband/hw/hfi1/msix.c
index db82db4..7574f2b 100644
--- a/drivers/infiniband/hw/hfi1/msix.c
+++ b/drivers/infiniband/hw/hfi1/msix.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
 /*
- * Copyright(c) 2018 Intel Corporation.
+ * Copyright(c) 2018 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -69,7 +69,7 @@ int msix_initialize(struct hfi1_devdata *dd)
 	 *	one for each VNIC context
 	 *      ...any new IRQs should be added here.
 	 */
-	total = 1 + dd->num_sdma + dd->n_krcv_queues + dd->num_vnic_contexts;
+	total = 1 + dd->num_sdma + dd->n_krcv_queues + dd->num_netdev_contexts;
 
 	if (total >= CCE_NUM_MSIX_VECTORS)
 		return -EINVAL;
diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c b/drivers/infiniband/hw/hfi1/vnic_main.c
index 6b14581..db7624c 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2017 - 2018 Intel Corporation.
+ * Copyright(c) 2017 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -804,7 +804,7 @@ struct net_device *hfi1_vnic_alloc_rn(struct ib_device *device,
 	struct rdma_netdev *rn;
 	int i, size, rc;
 
-	if (!dd->num_vnic_contexts)
+	if (!dd->num_netdev_contexts)
 		return ERR_PTR(-ENOMEM);
 
 	if (!port_num || (port_num > dd->num_pports))
@@ -815,7 +815,7 @@ struct net_device *hfi1_vnic_alloc_rn(struct ib_device *device,
 
 	size = sizeof(struct opa_vnic_rdma_netdev) + sizeof(*vinfo);
 	netdev = alloc_netdev_mqs(size, name, name_assign_type, setup,
-				  dd->num_sdma, dd->num_vnic_contexts);
+				  dd->num_sdma, dd->num_netdev_contexts);
 	if (!netdev)
 		return ERR_PTR(-ENOMEM);
 
@@ -823,7 +823,7 @@ struct net_device *hfi1_vnic_alloc_rn(struct ib_device *device,
 	vinfo = opa_vnic_dev_priv(netdev);
 	vinfo->dd = dd;
 	vinfo->num_tx_q = dd->num_sdma;
-	vinfo->num_rx_q = dd->num_vnic_contexts;
+	vinfo->num_rx_q = dd->num_netdev_contexts;
 	vinfo->netdev = netdev;
 	rn->free_rdma_netdev = hfi1_vnic_free_rn;
 	rn->set_id = hfi1_vnic_set_vesw_id;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 09/16] IB/hfi1: Add functions to receive accelerated ipoib packets
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (7 preceding siblings ...)
  2020-02-10 13:18 ` [PATCH for-next 08/16] IB/hfi1: Rename num_vnic_contexts as num_netdev_contexts Dennis Dalessandro
@ 2020-02-10 13:19 ` Dennis Dalessandro
  2020-02-10 13:19 ` [PATCH for-next 10/16] IB/hfi1: Add interrupt handler functions for accelerated ipoib Dennis Dalessandro
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:19 UTC (permalink / raw)
  To: jgg, dledford
  Cc: Mike Marciniszyn, Grzegorz Andrejczuk, Dennis Dalessandro,
	linux-rdma, Kaike Wan, Sadanand Warrier

From: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>

Ipoib netdev will share receive contexts with existing VNIC netdev.
To achieve that, a dummy netdev is allocated with hfi1_devdata to
own the receive contexts, and ipoib and VNIC netdevs will be put
on top of it. Each receive context is associated with a single
NAPI object.

This patch adds the functions to receive incoming packets for
accelerated ipoib.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/Makefile    |    2 +
 drivers/infiniband/hw/hfi1/driver.c    |   92 ++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/hfi.h       |    5 +-
 drivers/infiniband/hw/hfi1/ipoib.h     |   18 ++++++
 drivers/infiniband/hw/hfi1/ipoib_rx.c  |   71 +++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/netdev.h    |   90 +++++++++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/netdev_rx.c |   76 ++++++++++++++++++++++++++
 7 files changed, 352 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/ipoib_rx.c
 create mode 100644 drivers/infiniband/hw/hfi1/netdev.h
 create mode 100644 drivers/infiniband/hw/hfi1/netdev_rx.c

diff --git a/drivers/infiniband/hw/hfi1/Makefile b/drivers/infiniband/hw/hfi1/Makefile
index 0b25713..2e89ec1 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -23,10 +23,12 @@ hfi1-y := \
 	intr.o \
 	iowait.o \
 	ipoib_main.o \
+	ipoib_rx.o \
 	ipoib_tx.o \
 	mad.o \
 	mmu_rb.o \
 	msix.o \
+	netdev_rx.o \
 	opfn.o \
 	pcie.o \
 	pio.o \
diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c
index 049d15b..c5ed6ed 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015-2018 Intel Corporation.
+ * Copyright(c) 2015-2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -54,6 +54,7 @@
 #include <linux/module.h>
 #include <linux/prefetch.h>
 #include <rdma/ib_verbs.h>
+#include <linux/etherdevice.h>
 
 #include "hfi.h"
 #include "trace.h"
@@ -63,6 +64,9 @@
 #include "vnic.h"
 #include "fault.h"
 
+#include "ipoib.h"
+#include "netdev.h"
+
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
 
@@ -1550,6 +1554,81 @@ void handle_eflags(struct hfi1_packet *packet)
 		show_eflags_errs(packet);
 }
 
+static void hfi1_ipoib_ib_rcv(struct hfi1_packet *packet)
+{
+	struct hfi1_ibport *ibp;
+	struct net_device *netdev;
+	struct hfi1_ctxtdata *rcd = packet->rcd;
+	struct napi_struct *napi = rcd->napi;
+	struct sk_buff *skb;
+	struct hfi1_netdev_rxq *rxq = container_of(napi,
+			struct hfi1_netdev_rxq, napi);
+	u32 extra_bytes;
+	u32 tlen, qpnum;
+	bool do_work, do_cnp;
+	struct hfi1_ipoib_dev_priv *priv;
+
+	trace_hfi1_rcvhdr(packet);
+
+	hfi1_setup_ib_header(packet);
+
+	packet->ohdr = &((struct ib_header *)packet->hdr)->u.oth;
+	packet->grh = NULL;
+
+	if (unlikely(rhf_err_flags(packet->rhf))) {
+		handle_eflags(packet);
+		return;
+	}
+
+	qpnum = ib_bth_get_qpn(packet->ohdr);
+	netdev = hfi1_netdev_get_data(rcd->dd, qpnum);
+	if (!netdev)
+		goto drop_no_nd;
+
+	trace_input_ibhdr(rcd->dd, packet, !!(rhf_dc_info(packet->rhf)));
+
+	/* handle congestion notifications */
+	do_work = hfi1_may_ecn(packet);
+	if (unlikely(do_work)) {
+		do_cnp = (packet->opcode != IB_OPCODE_CNP);
+		(void)hfi1_process_ecn_slowpath(hfi1_ipoib_priv(netdev)->qp,
+						 packet, do_cnp);
+	}
+
+	/*
+	 * We have split point after last byte of DETH
+	 * lets strip padding and CRC and ICRC.
+	 * tlen is whole packet len so we need to
+	 * subtract header size as well.
+	 */
+	tlen = packet->tlen;
+	extra_bytes = ib_bth_get_pad(packet->ohdr) + (SIZE_OF_CRC << 2) +
+			packet->hlen;
+	if (unlikely(tlen < extra_bytes))
+		goto drop;
+
+	tlen -= extra_bytes;
+
+	skb = hfi1_ipoib_prepare_skb(rxq, tlen, packet->ebuf);
+	if (unlikely(!skb))
+		goto drop;
+
+	priv = hfi1_ipoib_priv(netdev);
+	hfi1_ipoib_update_rx_netstats(priv, 1, skb->len);
+
+	skb->dev = netdev;
+	skb->pkt_type = PACKET_HOST;
+	netif_receive_skb(skb);
+
+	return;
+
+drop:
+	++netdev->stats.rx_dropped;
+drop_no_nd:
+	ibp = rcd_to_iport(packet->rcd);
+	++ibp->rvp.n_pkt_drops;
+}
+
 /*
  * The following functions are called by the interrupt handler. They are type
  * specific handlers for each packet type.
@@ -1757,3 +1836,14 @@ void seqfile_dump_rcd(struct seq_file *s, struct hfi1_ctxtdata *rcd)
 	[RHF_RCV_TYPE_INVALID6] = process_receive_invalid,
 	[RHF_RCV_TYPE_INVALID7] = process_receive_invalid,
 };
+
+const rhf_rcv_function_ptr netdev_rhf_rcv_functions[] = {
+	[RHF_RCV_TYPE_EXPECTED] = process_receive_invalid,
+	[RHF_RCV_TYPE_EAGER] = process_receive_invalid,
+	[RHF_RCV_TYPE_IB] = hfi1_ipoib_ib_rcv,
+	[RHF_RCV_TYPE_ERROR] = process_receive_error,
+	[RHF_RCV_TYPE_BYPASS] = hfi1_vnic_bypass_rcv,
+	[RHF_RCV_TYPE_INVALID5] = process_receive_invalid,
+	[RHF_RCV_TYPE_INVALID6] = process_receive_invalid,
+	[RHF_RCV_TYPE_INVALID7] = process_receive_invalid,
+};
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 2ff3904..5798688 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -233,6 +233,8 @@ struct hfi1_ctxtdata {
 	intr_handler fast_handler;
 	/** slow handler */
 	intr_handler slow_handler;
+	/* napi pointer assiociated with netdev */
+	struct napi_struct *napi;
 	/* verbs rx_stats per rcd */
 	struct hfi1_opcode_stats_perctx *opstats;
 	/* clear interrupt mask */
@@ -985,7 +987,7 @@ typedef void (*hfi1_make_req)(struct rvt_qp *qp,
 			      struct hfi1_pkt_state *ps,
 			      struct rvt_swqe *wqe);
 extern const rhf_rcv_function_ptr normal_rhf_rcv_functions[];
-
+extern const rhf_rcv_function_ptr netdev_rhf_rcv_functions[];
 
 /* return values for the RHF receive functions */
 #define RHF_RCV_CONTINUE  0	/* keep going */
@@ -1419,6 +1421,7 @@ struct hfi1_devdata {
 	struct hfi1_vnic_data vnic;
 	/* Lock to protect IRQ SRC register access */
 	spinlock_t irq_src_lock;
+	struct net_device *dummy_netdev;
 
 	/* Keeps track of IPoIB RSM rule users */
 	atomic_t ipoib_rsm_usr_num;
diff --git a/drivers/infiniband/hw/hfi1/ipoib.h b/drivers/infiniband/hw/hfi1/ipoib.h
index c2e63ca..ca00f6c 100644
--- a/drivers/infiniband/hw/hfi1/ipoib.h
+++ b/drivers/infiniband/hw/hfi1/ipoib.h
@@ -22,6 +22,7 @@
 
 #include "hfi.h"
 #include "iowait.h"
+#include "netdev.h"
 
 #include <rdma/ib_verbs.h>
 
@@ -29,6 +30,7 @@
 
 #define HFI1_IPOIB_TXREQ_NAME_LEN   32
 
+#define HFI1_IPOIB_PSEUDO_LEN 20
 #define HFI1_IPOIB_ENCAP_LEN 4
 
 struct hfi1_ipoib_dev_priv;
@@ -119,6 +121,19 @@ struct hfi1_ipoib_rdma_netdev {
 }
 
 static inline void
+hfi1_ipoib_update_rx_netstats(struct hfi1_ipoib_dev_priv *priv,
+			      u64 packets,
+			      u64 bytes)
+{
+	struct pcpu_sw_netstats *netstats = this_cpu_ptr(priv->netstats);
+
+	u64_stats_update_begin(&netstats->syncp);
+	netstats->rx_packets += packets;
+	netstats->rx_bytes += bytes;
+	u64_stats_update_end(&netstats->syncp);
+}
+
+static inline void
 hfi1_ipoib_update_tx_netstats(struct hfi1_ipoib_dev_priv *priv,
 			      u64 packets,
 			      u64 bytes)
@@ -142,6 +157,9 @@ int hfi1_ipoib_send_dma(struct net_device *dev,
 void hfi1_ipoib_napi_tx_enable(struct net_device *dev);
 void hfi1_ipoib_napi_tx_disable(struct net_device *dev);
 
+struct sk_buff *hfi1_ipoib_prepare_skb(struct hfi1_netdev_rxq *rxq,
+				       int size, void *data);
+
 int hfi1_ipoib_rn_get_params(struct ib_device *device,
 			     u8 port_num,
 			     enum rdma_netdev_t type,
diff --git a/drivers/infiniband/hw/hfi1/ipoib_rx.c b/drivers/infiniband/hw/hfi1/ipoib_rx.c
new file mode 100644
index 0000000..2485663
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/ipoib_rx.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+/*
+ * Copyright(c) 2020 Intel Corporation.
+ *
+ */
+
+#include "netdev.h"
+#include "ipoib.h"
+
+#define HFI1_IPOIB_SKB_PAD ((NET_SKB_PAD) + (NET_IP_ALIGN))
+
+static void copy_ipoib_buf(struct sk_buff *skb, void *data, int size)
+{
+	void *dst_data;
+
+	skb_checksum_none_assert(skb);
+	skb->protocol = *((__be16 *)data);
+
+	dst_data = skb_put(skb, size);
+	memcpy(dst_data, data, size);
+	skb->mac_header = HFI1_IPOIB_PSEUDO_LEN;
+	skb_pull(skb, HFI1_IPOIB_ENCAP_LEN);
+}
+
+static struct sk_buff *prepare_frag_skb(struct napi_struct *napi, int size)
+{
+	struct sk_buff *skb;
+	int skb_size = SKB_DATA_ALIGN(size + HFI1_IPOIB_SKB_PAD);
+	void *frag;
+
+	skb_size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	skb_size = SKB_DATA_ALIGN(skb_size);
+	frag = napi_alloc_frag(skb_size);
+
+	if (unlikely(!frag))
+		return napi_alloc_skb(napi, size);
+
+	skb = build_skb(frag, skb_size);
+
+	if (unlikely(!skb)) {
+		skb_free_frag(frag);
+		return NULL;
+	}
+
+	skb_reserve(skb, HFI1_IPOIB_SKB_PAD);
+	return skb;
+}
+
+struct sk_buff *hfi1_ipoib_prepare_skb(struct hfi1_netdev_rxq *rxq,
+				       int size, void *data)
+{
+	struct napi_struct *napi = &rxq->napi;
+	int skb_size = size + HFI1_IPOIB_ENCAP_LEN;
+	struct sk_buff *skb;
+
+	/*
+	 * For smaller(4k + skb overhead) allocations we will go using
+	 * napi cache. Otherwise we will try to use napi frag cache.
+	 */
+	if (size <= SKB_WITH_OVERHEAD(PAGE_SIZE))
+		skb = napi_alloc_skb(napi, skb_size);
+	else
+		skb = prepare_frag_skb(napi, skb_size);
+
+	if (unlikely(!skb))
+		return NULL;
+
+	copy_ipoib_buf(skb, data, size);
+
+	return skb;
+}
diff --git a/drivers/infiniband/hw/hfi1/netdev.h b/drivers/infiniband/hw/hfi1/netdev.h
new file mode 100644
index 0000000..8992dfe
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/netdev.h
@@ -0,0 +1,90 @@
+/* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
+/*
+ * Copyright(c) 2020 Intel Corporation.
+ *
+ */
+
+#ifndef HFI1_NETDEV_H
+#define HFI1_NETDEV_H
+
+#include "hfi.h"
+
+#include <linux/netdevice.h>
+#include <linux/xarray.h>
+
+/**
+ * struct hfi1_netdev_rxq - Receive Queue for HFI
+ * dummy netdev. Both IPoIB and VNIC netdevices will be working on
+ * top of this device.
+ * @napi: napi object
+ * @priv: ptr to netdev_priv
+ * @rcd:  ptr to receive context data
+ */
+struct hfi1_netdev_rxq {
+	struct napi_struct napi;
+	struct hfi1_netdev_priv *priv;
+	struct hfi1_ctxtdata *rcd;
+};
+
+/*
+ * Number of netdev contexts used. Ensure it is less than or equal to
+ * max queues supported by VNIC (HFI1_VNIC_MAX_QUEUE).
+ */
+#define HFI1_MAX_NETDEV_CTXTS   8
+
+/* Number of NETDEV RSM entries */
+#define NUM_NETDEV_MAP_ENTRIES HFI1_MAX_NETDEV_CTXTS
+
+/**
+ * struct hfi1_netdev_priv: data required to setup and run HFI netdev.
+ * @dd:		hfi1_devdata
+ * @rxq:	pointer to dummy netdev receive queues.
+ * @num_rx_q:	number of receive queues
+ * @rmt_index:	first free index in RMT Array
+ * @msix_start: first free MSI-X interrupt vector.
+ * @dev_tbl:	netdev table for unique identifier VNIC and IPoIb VLANs.
+ * @enabled:	atomic counter of netdevs enabling receive queues.
+ *		When 0 NAPI will be disabled.
+ * @netdevs:	atomic counter of netdevs using dummy netdev.
+ *		When 0 receive queues will be freed.
+ */
+struct hfi1_netdev_priv {
+	struct hfi1_devdata *dd;
+	struct hfi1_netdev_rxq *rxq;
+	int num_rx_q;
+	int rmt_start;
+	struct xarray dev_tbl;
+	/* count of enabled napi polls */
+	atomic_t enabled;
+	/* count of netdevs on top */
+	atomic_t netdevs;
+};
+
+static inline
+struct hfi1_netdev_priv *hfi1_netdev_priv(struct net_device *dev)
+{
+	return (struct hfi1_netdev_priv *)&dev[1];
+}
+
+static inline
+int hfi1_netdev_ctxt_count(struct hfi1_devdata *dd)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	return priv->num_rx_q;
+}
+
+static inline
+struct hfi1_ctxtdata *hfi1_netdev_get_ctxt(struct hfi1_devdata *dd, int ctxt)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	return priv->rxq[ctxt].rcd;
+}
+
+int hfi1_netdev_add_data(struct hfi1_devdata *dd, int id, void *data);
+void *hfi1_netdev_remove_data(struct hfi1_devdata *dd, int id);
+void *hfi1_netdev_get_data(struct hfi1_devdata *dd, int id);
+void *hfi1_netdev_get_first_data(struct hfi1_devdata *dd, int *start_id);
+
+#endif /* HFI1_NETDEV_H */
diff --git a/drivers/infiniband/hw/hfi1/netdev_rx.c b/drivers/infiniband/hw/hfi1/netdev_rx.c
new file mode 100644
index 0000000..19597e0
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/netdev_rx.c
@@ -0,0 +1,76 @@
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+/*
+ * Copyright(c) 2020 Intel Corporation.
+ *
+ */
+
+/*
+ * This file contains HFI1 support for netdev RX functionality
+ */
+
+#include "sdma.h"
+#include "verbs.h"
+#include "netdev.h"
+#include "hfi.h"
+
+#include <linux/netdevice.h>
+#include <linux/etherdevice.h>
+#include <rdma/ib_verbs.h>
+
+/**
+ * hfi1_netdev_add_data - Registers data with unique identifier
+ * to be requested later this is needed for VNIC and IPoIB VLANs
+ * implementations.
+ * This call is protected by mutex idr_lock.
+ *
+ * @dd: hfi1 dev data
+ * @id: requested integer id up to INT_MAX
+ * @data: data to be associated with index
+ */
+int hfi1_netdev_add_data(struct hfi1_devdata *dd, int id, void *data)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	return xa_insert(&priv->dev_tbl, id, data, GFP_NOWAIT);
+}
+
+/**
+ * hfi1_netdev_remove_data - Removes data with previously given id.
+ * Returns the reference to removed entry.
+ *
+ * @dd: hfi1 dev data
+ * @id: requested integer id up to INT_MAX
+ */
+void *hfi1_netdev_remove_data(struct hfi1_devdata *dd, int id)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	return xa_erase(&priv->dev_tbl, id);
+}
+
+/**
+ * hfi1_netdev_get_data - Gets data with given id
+ *
+ * @dd: hfi1 dev data
+ * @id: requested integer id up to INT_MAX
+ */
+void *hfi1_netdev_get_data(struct hfi1_devdata *dd, int id)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	return xa_load(&priv->dev_tbl, id);
+}
+
+/**
+ * hfi1_netdev_get_first_dat - Gets first entry with greater or equal id.
+ *
+ * @dd: hfi1 dev data
+ * @id: requested integer id up to INT_MAX
+ */
+void *hfi1_netdev_get_first_data(struct hfi1_devdata *dd, int *start_id)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	return xa_find(&priv->dev_tbl, (unsigned long *)start_id, ULONG_MAX,
+		       XA_PRESENT);
+}


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 10/16] IB/hfi1: Add interrupt handler functions for accelerated ipoib
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (8 preceding siblings ...)
  2020-02-10 13:19 ` [PATCH for-next 09/16] IB/hfi1: Add functions to receive accelerated ipoib packets Dennis Dalessandro
@ 2020-02-10 13:19 ` Dennis Dalessandro
  2020-02-10 13:19 ` [PATCH for-next 11/16] IB/hfi1: Add rx functions for dummy netdev Dennis Dalessandro
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:19 UTC (permalink / raw)
  To: jgg, dledford
  Cc: Mike Marciniszyn, Grzegorz Andrejczuk, Dennis Dalessandro,
	linux-rdma, Kaike Wan, Sadanand Warrier

From: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>

This patch adds the interrupt handler function, the NAPI poll
function, and its associated helper functions for receiving
accelerated ipoib packets. While we are here, fix the formats
of two error printouts.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/affinity.c |   12 +++
 drivers/infiniband/hw/hfi1/affinity.h |    3 +
 drivers/infiniband/hw/hfi1/chip.c     |   44 ++++++++++++
 drivers/infiniband/hw/hfi1/chip.h     |    1 
 drivers/infiniband/hw/hfi1/driver.c   |  120 +++++++++++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/hfi.h      |    4 +
 drivers/infiniband/hw/hfi1/init.c     |    1 
 drivers/infiniband/hw/hfi1/msix.c     |   20 +++++-
 drivers/infiniband/hw/hfi1/msix.h     |    5 +
 drivers/infiniband/hw/hfi1/netdev.h   |    3 +
 10 files changed, 206 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/affinity.c b/drivers/infiniband/hw/hfi1/affinity.c
index c142b23..d24b084 100644
--- a/drivers/infiniband/hw/hfi1/affinity.c
+++ b/drivers/infiniband/hw/hfi1/affinity.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -64,6 +64,7 @@ struct hfi1_affinity_node_list node_affinity = {
 static const char * const irq_type_names[] = {
 	"SDMA",
 	"RCVCTXT",
+	"NETDEVCTXT",
 	"GENERAL",
 	"OTHER",
 };
@@ -913,6 +914,11 @@ static int get_irq_affinity(struct hfi1_devdata *dd,
 			set = &entry->rcv_intr;
 		scnprintf(extra, 64, "ctxt %u", rcd->ctxt);
 		break;
+	case IRQ_NETDEVCTXT:
+		rcd = (struct hfi1_ctxtdata *)msix->arg;
+		set = &entry->def_intr;
+		scnprintf(extra, 64, "ctxt %u", rcd->ctxt);
+		break;
 	default:
 		dd_dev_err(dd, "Invalid IRQ type %d\n", msix->type);
 		return -EINVAL;
@@ -985,6 +991,10 @@ void hfi1_put_irq_affinity(struct hfi1_devdata *dd,
 		if (rcd->ctxt != HFI1_CTRL_CTXT)
 			set = &entry->rcv_intr;
 		break;
+	case IRQ_NETDEVCTXT:
+		rcd = (struct hfi1_ctxtdata *)msix->arg;
+		set = &entry->def_intr;
+		break;
 	default:
 		mutex_unlock(&node_affinity.lock);
 		return;
diff --git a/drivers/infiniband/hw/hfi1/affinity.h b/drivers/infiniband/hw/hfi1/affinity.h
index 6a7e6ea..f94ed5d 100644
--- a/drivers/infiniband/hw/hfi1/affinity.h
+++ b/drivers/infiniband/hw/hfi1/affinity.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015 - 2018 Intel Corporation.
+ * Copyright(c) 2015 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -52,6 +52,7 @@
 enum irq_type {
 	IRQ_SDMA,
 	IRQ_RCVCTXT,
+	IRQ_NETDEVCTXT,
 	IRQ_GENERAL,
 	IRQ_OTHER
 };
diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index 07eec35..2117612 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -66,6 +66,7 @@
 #include "affinity.h"
 #include "debugfs.h"
 #include "fault.h"
+#include "netdev.h"
 
 uint num_vls = HFI1_MAX_VLS_SUPPORTED;
 module_param(num_vls, uint, S_IRUGO);
@@ -8480,6 +8481,49 @@ static void hfi1_rcd_eoi_intr(struct hfi1_ctxtdata *rcd)
 	local_irq_restore(flags);
 }
 
+/**
+ * hfi1_netdev_rx_napi - napi poll function to move eoi inline
+ * @napi - pointer to napi object
+ * @budget - netdev budget
+ */
+int hfi1_netdev_rx_napi(struct napi_struct *napi, int budget)
+{
+	struct hfi1_netdev_rxq *rxq = container_of(napi,
+			struct hfi1_netdev_rxq, napi);
+	struct hfi1_ctxtdata *rcd = rxq->rcd;
+	int work_done = 0;
+
+	work_done = rcd->do_interrupt(rcd, budget);
+
+	if (work_done < budget) {
+		napi_complete_done(napi, work_done);
+		hfi1_rcd_eoi_intr(rcd);
+	}
+
+	return work_done;
+}
+
+/* Receive packet napi handler for netdevs VNIC and AIP  */
+irqreturn_t receive_context_interrupt_napi(int irq, void *data)
+{
+	struct hfi1_ctxtdata *rcd = data;
+
+	receive_interrupt_common(rcd);
+
+	if (likely(rcd->napi)) {
+		if (likely(napi_schedule_prep(rcd->napi)))
+			__napi_schedule_irqoff(rcd->napi);
+		else
+			__hfi1_rcd_eoi_intr(rcd);
+	} else {
+		WARN_ONCE(1, "Napi IRQ handler without napi set up ctxt=%d\n",
+			  rcd->ctxt);
+		__hfi1_rcd_eoi_intr(rcd);
+	}
+
+	return IRQ_HANDLED;
+}
+
 /*
  * Receive packet IRQ handler.  This routine expects to be on its own IRQ.
  * This routine will try to handle packets immediately (latency), but if
diff --git a/drivers/infiniband/hw/hfi1/chip.h b/drivers/infiniband/hw/hfi1/chip.h
index b10e0bf..2c6f2de 100644
--- a/drivers/infiniband/hw/hfi1/chip.h
+++ b/drivers/infiniband/hw/hfi1/chip.h
@@ -1447,6 +1447,7 @@ int hfi1_set_ctxt_pkey(struct hfi1_devdata *dd, struct hfi1_ctxtdata *ctxt,
 irqreturn_t sdma_interrupt(int irq, void *data);
 irqreturn_t receive_context_interrupt(int irq, void *data);
 irqreturn_t receive_context_thread(int irq, void *data);
+irqreturn_t receive_context_interrupt_napi(int irq, void *data);
 
 int set_intr_bits(struct hfi1_devdata *dd, u16 first, u16 last, bool set);
 void init_qsfp_int(struct hfi1_devdata *dd);
diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c
index c5ed6ed..d89fc8f 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -752,6 +752,39 @@ static noinline int skip_rcv_packet(struct hfi1_packet *packet, int thread)
 	return ret;
 }
 
+static void process_rcv_packet_napi(struct hfi1_packet *packet)
+{
+	packet->etype = rhf_rcv_type(packet->rhf);
+
+	/* total length */
+	packet->tlen = rhf_pkt_len(packet->rhf); /* in bytes */
+	/* retrieve eager buffer details */
+	packet->etail = rhf_egr_index(packet->rhf);
+	packet->ebuf = get_egrbuf(packet->rcd, packet->rhf,
+				  &packet->updegr);
+	/*
+	 * Prefetch the contents of the eager buffer.  It is
+	 * OK to send a negative length to prefetch_range().
+	 * The +2 is the size of the RHF.
+	 */
+	prefetch_range(packet->ebuf,
+		       packet->tlen - ((packet->rcd->rcvhdrqentsize -
+				       (rhf_hdrq_offset(packet->rhf)
+					+ 2)) * 4));
+
+	packet->rcd->rhf_rcv_function_map[packet->etype](packet);
+	packet->numpkt++;
+
+	/* Set up for the next packet */
+	packet->rhqoff += packet->rsize;
+	if (packet->rhqoff >= packet->maxcnt)
+		packet->rhqoff = 0;
+
+	packet->rhf_addr = (__le32 *)packet->rcd->rcvhdrq + packet->rhqoff +
+				      packet->rcd->rhf_offset;
+	packet->rhf = rhf_to_cpu(packet->rhf_addr);
+}
+
 static inline int process_rcv_packet(struct hfi1_packet *packet, int thread)
 {
 	int ret;
@@ -831,6 +864,36 @@ static inline void finish_packet(struct hfi1_packet *packet)
 }
 
 /*
+ * handle_receive_interrupt_napi_fp - receive a packet
+ * @rcd: the context
+ * @budget: polling budget
+ *
+ * Called from interrupt handler for receive interrupt.
+ * This is the fast path interrupt handler
+ * when executing napi soft irq environment.
+ */
+int handle_receive_interrupt_napi_fp(struct hfi1_ctxtdata *rcd, int budget)
+{
+	struct hfi1_packet packet;
+
+	init_packet(rcd, &packet);
+	if (last_rcv_seq(rcd, rhf_rcv_seq(packet.rhf)))
+		goto bail;
+
+	while (packet.numpkt < budget) {
+		process_rcv_packet_napi(&packet);
+		if (hfi1_seq_incr(rcd, rhf_rcv_seq(packet.rhf)))
+			break;
+
+		process_rcv_update(0, &packet);
+	}
+	hfi1_set_rcd_head(rcd, packet.rhqoff);
+bail:
+	finish_packet(&packet);
+	return packet.numpkt;
+}
+
+/*
  * Handle receive interrupts when using the no dma rtail option.
  */
 int handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *rcd, int thread)
@@ -1078,6 +1141,63 @@ int handle_receive_interrupt(struct hfi1_ctxtdata *rcd, int thread)
 }
 
 /*
+ * handle_receive_interrupt_napi_sp - receive a packet
+ * @rcd: the context
+ * @budget: polling budget
+ *
+ * Called from interrupt handler for errors or receive interrupt.
+ * This is the slow path interrupt handler
+ * when executing napi soft irq environment.
+ */
+int handle_receive_interrupt_napi_sp(struct hfi1_ctxtdata *rcd, int budget)
+{
+	struct hfi1_devdata *dd = rcd->dd;
+	int last = RCV_PKT_OK;
+	bool needset = true;
+	struct hfi1_packet packet;
+
+	init_packet(rcd, &packet);
+	if (last_rcv_seq(rcd, rhf_rcv_seq(packet.rhf)))
+		goto bail;
+
+	while (last != RCV_PKT_DONE && packet.numpkt < budget) {
+		if (hfi1_need_drop(dd)) {
+			/* On to the next packet */
+			packet.rhqoff += packet.rsize;
+			packet.rhf_addr = (__le32 *)rcd->rcvhdrq +
+					  packet.rhqoff +
+					  rcd->rhf_offset;
+			packet.rhf = rhf_to_cpu(packet.rhf_addr);
+
+		} else {
+			if (set_armed_to_active(&packet))
+				goto bail;
+			process_rcv_packet_napi(&packet);
+		}
+
+		if (hfi1_seq_incr(rcd, rhf_rcv_seq(packet.rhf)))
+			last = RCV_PKT_DONE;
+
+		if (needset) {
+			needset = false;
+			set_all_fastpath(dd, rcd);
+		}
+
+		process_rcv_update(last, &packet);
+	}
+
+	hfi1_set_rcd_head(rcd, packet.rhqoff);
+
+bail:
+	/*
+	 * Always write head at end, and setup rcv interrupt, even
+	 * if no packets were processed.
+	 */
+	finish_packet(&packet);
+	return packet.numpkt;
+}
+
+/*
  * We may discover in the interrupt that the hardware link state has
  * changed from ARMED to ACTIVE (due to the arrival of a non-SC15 packet),
  * and we need to update the driver's notion of the link state.  We cannot
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 5798688..b45e4b9 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -385,11 +385,11 @@ struct hfi1_packet {
 	u32 rhqoff;
 	u32 dlid;
 	u32 slid;
+	int numpkt;
 	u16 tlen;
 	s16 etail;
 	u16 pkey;
 	u8 hlen;
-	u8 numpkt;
 	u8 rsize;
 	u8 updegr;
 	u8 etype;
@@ -1503,6 +1503,8 @@ struct hfi1_ctxtdata *hfi1_rcd_get_by_index_safe(struct hfi1_devdata *dd,
 int handle_receive_interrupt(struct hfi1_ctxtdata *rcd, int thread);
 int handle_receive_interrupt_nodma_rtail(struct hfi1_ctxtdata *rcd, int thread);
 int handle_receive_interrupt_dma_rtail(struct hfi1_ctxtdata *rcd, int thread);
+int handle_receive_interrupt_napi_fp(struct hfi1_ctxtdata *rcd, int budget);
+int handle_receive_interrupt_napi_sp(struct hfi1_ctxtdata *rcd, int budget);
 void set_all_slowpath(struct hfi1_devdata *dd);
 
 extern const struct pci_device_id hfi1_pci_tbl[];
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index a4a1b3a..95bff7d 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -374,6 +374,7 @@ int hfi1_create_ctxtdata(struct hfi1_pportdata *ppd, int numa,
 		rcd->numa_id = numa;
 		rcd->rcv_array_groups = dd->rcv_entries.ngroups;
 		rcd->rhf_rcv_function_map = normal_rhf_rcv_functions;
+		rcd->msix_intr = CCE_NUM_MSIX_VECTORS;
 
 		mutex_init(&rcd->exp_mutex);
 		spin_lock_init(&rcd->exp_lock);
diff --git a/drivers/infiniband/hw/hfi1/msix.c b/drivers/infiniband/hw/hfi1/msix.c
index 7574f2b..7559875e 100644
--- a/drivers/infiniband/hw/hfi1/msix.c
+++ b/drivers/infiniband/hw/hfi1/msix.c
@@ -49,6 +49,7 @@
 #include "hfi.h"
 #include "affinity.h"
 #include "sdma.h"
+#include "netdev.h"
 
 /**
  * msix_initialize() - Calculate, request and configure MSIx IRQs
@@ -140,7 +141,7 @@ static int msix_request_irq(struct hfi1_devdata *dd, void *arg,
 	ret = pci_request_irq(dd->pcidev, nr, handler, thread, arg, name);
 	if (ret) {
 		dd_dev_err(dd,
-			   "%s: request for IRQ %d failed, MSIx %lu, err %d\n",
+			   "%s: request for IRQ %d failed, MSIx %lx, err %d\n",
 			   name, irq, nr, ret);
 		spin_lock(&dd->msix_info.msix_lock);
 		__clear_bit(nr, dd->msix_info.in_use_msix);
@@ -160,7 +161,7 @@ static int msix_request_irq(struct hfi1_devdata *dd, void *arg,
 	/* This is a request, so a failure is not fatal */
 	ret = hfi1_get_irq_affinity(dd, me);
 	if (ret)
-		dd_dev_err(dd, "unable to pin IRQ %d\n", ret);
+		dd_dev_err(dd, "%s: unable to pin IRQ %d\n", name, ret);
 
 	return nr;
 }
@@ -204,6 +205,21 @@ int msix_request_rcd_irq(struct hfi1_ctxtdata *rcd)
 }
 
 /**
+ * msix_request_rcd_irq() - Helper function for RCVAVAIL IRQs
+ * for netdev context
+ * @rcd: valid netdev contexti
+ */
+int msix_netdev_request_rcd_irq(struct hfi1_ctxtdata *rcd)
+{
+	char name[MAX_NAME_SIZE];
+
+	snprintf(name, sizeof(name), DRIVER_NAME "_%d nd kctxt%d",
+		 rcd->dd->unit, rcd->ctxt);
+	return msix_request_rcd_irq_common(rcd, receive_context_interrupt_napi,
+					   NULL, name);
+}
+
+/**
  * msix_request_smda_ira() - Helper for getting SDMA IRQ resources
  * @sde: valid sdma engine
  *
diff --git a/drivers/infiniband/hw/hfi1/msix.h b/drivers/infiniband/hw/hfi1/msix.h
index 1a02ab7..42fab94 100644
--- a/drivers/infiniband/hw/hfi1/msix.h
+++ b/drivers/infiniband/hw/hfi1/msix.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause) */
 /*
- * Copyright(c) 2018 Intel Corporation.
+ * Copyright(c) 2018 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -59,7 +59,8 @@
 int msix_request_sdma_irq(struct sdma_engine *sde);
 void msix_free_irq(struct hfi1_devdata *dd, u8 msix_intr);
 
-/* VNIC interface */
+/* Netdev interface */
 void msix_vnic_synchronize_irq(struct hfi1_devdata *dd);
+int msix_netdev_request_rcd_irq(struct hfi1_ctxtdata *rcd);
 
 #endif
diff --git a/drivers/infiniband/hw/hfi1/netdev.h b/drivers/infiniband/hw/hfi1/netdev.h
index 8992dfe..6740ec3 100644
--- a/drivers/infiniband/hw/hfi1/netdev.h
+++ b/drivers/infiniband/hw/hfi1/netdev.h
@@ -87,4 +87,7 @@ struct hfi1_ctxtdata *hfi1_netdev_get_ctxt(struct hfi1_devdata *dd, int ctxt)
 void *hfi1_netdev_get_data(struct hfi1_devdata *dd, int id);
 void *hfi1_netdev_get_first_data(struct hfi1_devdata *dd, int *start_id);
 
+/* chip.c  */
+int hfi1_netdev_rx_napi(struct napi_struct *napi, int budget);
+
 #endif /* HFI1_NETDEV_H */


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 11/16] IB/hfi1: Add rx functions for dummy netdev
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (9 preceding siblings ...)
  2020-02-10 13:19 ` [PATCH for-next 10/16] IB/hfi1: Add interrupt handler functions for accelerated ipoib Dennis Dalessandro
@ 2020-02-10 13:19 ` Dennis Dalessandro
  2020-02-10 13:19 ` [PATCH for-next 12/16] IB/hfi1: Activate the " Dennis Dalessandro
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:19 UTC (permalink / raw)
  To: jgg, dledford
  Cc: Mike Marciniszyn, Grzegorz Andrejczuk, Dennis Dalessandro,
	linux-rdma, Kaike Wan, Sadanand Warrier

From: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>

This patch adds the rx functions for the dummy netdev:
- Functions to allocate/free the dummy netdev.
- Functions to allocate/free receiving contexts for the netdev.
- Functions to initialize/de-initialize the receive queue.
- Functions to enable/disable the receive queue.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/ipoib.h      |    3 
 drivers/infiniband/hw/hfi1/ipoib_main.c |   30 ++-
 drivers/infiniband/hw/hfi1/ipoib_rx.c   |   16 +
 drivers/infiniband/hw/hfi1/netdev.h     |    6 +
 drivers/infiniband/hw/hfi1/netdev_rx.c  |  361 +++++++++++++++++++++++++++++++
 5 files changed, 414 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/ipoib.h b/drivers/infiniband/hw/hfi1/ipoib.h
index ca00f6c..185c9b0 100644
--- a/drivers/infiniband/hw/hfi1/ipoib.h
+++ b/drivers/infiniband/hw/hfi1/ipoib.h
@@ -154,6 +154,9 @@ int hfi1_ipoib_send_dma(struct net_device *dev,
 int hfi1_ipoib_txreq_init(struct hfi1_ipoib_dev_priv *priv);
 void hfi1_ipoib_txreq_deinit(struct hfi1_ipoib_dev_priv *priv);
 
+int hfi1_ipoib_rxq_init(struct net_device *dev);
+void hfi1_ipoib_rxq_deinit(struct net_device *dev);
+
 void hfi1_ipoib_napi_tx_enable(struct net_device *dev);
 void hfi1_ipoib_napi_tx_disable(struct net_device *dev);
 
diff --git a/drivers/infiniband/hw/hfi1/ipoib_main.c b/drivers/infiniband/hw/hfi1/ipoib_main.c
index 304a5ac..014351e 100644
--- a/drivers/infiniband/hw/hfi1/ipoib_main.c
+++ b/drivers/infiniband/hw/hfi1/ipoib_main.c
@@ -19,16 +19,31 @@ static u32 qpn_from_mac(u8 *mac_arr)
 static int hfi1_ipoib_dev_init(struct net_device *dev)
 {
 	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
+	int ret;
 
 	priv->netstats = netdev_alloc_pcpu_stats(struct pcpu_sw_netstats);
 
-	return priv->netdev_ops->ndo_init(dev);
+	ret = priv->netdev_ops->ndo_init(dev);
+	if (ret)
+		return ret;
+
+	ret = hfi1_netdev_add_data(priv->dd,
+				   qpn_from_mac(priv->netdev->dev_addr),
+				   dev);
+	if (ret < 0) {
+		priv->netdev_ops->ndo_uninit(dev);
+		return ret;
+	}
+
+	return 0;
 }
 
 static void hfi1_ipoib_dev_uninit(struct net_device *dev)
 {
 	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
 
+	hfi1_netdev_remove_data(priv->dd, qpn_from_mac(priv->netdev->dev_addr));
+
 	priv->netdev_ops->ndo_uninit(dev);
 }
 
@@ -55,6 +70,7 @@ static int hfi1_ipoib_dev_open(struct net_device *dev)
 		priv->qp = qp;
 		rcu_read_unlock();
 
+		hfi1_netdev_enable_queues(priv->dd);
 		hfi1_ipoib_napi_tx_enable(dev);
 	}
 
@@ -69,6 +85,7 @@ static int hfi1_ipoib_dev_stop(struct net_device *dev)
 		return 0;
 
 	hfi1_ipoib_napi_tx_disable(dev);
+	hfi1_netdev_disable_queues(priv->dd);
 
 	rvt_put_qp(priv->qp);
 	priv->qp = NULL;
@@ -195,6 +212,7 @@ static void hfi1_ipoib_netdev_dtor(struct net_device *dev)
 	struct hfi1_ipoib_dev_priv *priv = hfi1_ipoib_priv(dev);
 
 	hfi1_ipoib_txreq_deinit(priv);
+	hfi1_ipoib_rxq_deinit(priv->netdev);
 
 	free_percpu(priv->netstats);
 }
@@ -252,6 +270,13 @@ static int hfi1_ipoib_setup_rn(struct ib_device *device,
 		return rc;
 	}
 
+	rc = hfi1_ipoib_rxq_init(netdev);
+	if (rc) {
+		dd_dev_err(dd, "IPoIB netdev RX init - failed(%d)\n", rc);
+		hfi1_ipoib_free_rdma_netdev(netdev);
+		return rc;
+	}
+
 	netdev->priv_destructor = hfi1_ipoib_netdev_dtor;
 	netdev->needs_free_netdev = true;
 
@@ -268,7 +293,7 @@ int hfi1_ipoib_rn_get_params(struct ib_device *device,
 	if (type != RDMA_NETDEV_IPOIB)
 		return -EOPNOTSUPP;
 
-	if (!HFI1_CAP_IS_KSET(AIP))
+	if (!HFI1_CAP_IS_KSET(AIP) || !dd->num_netdev_contexts)
 		return -EOPNOTSUPP;
 
 	if (!port_num || port_num > dd->num_pports)
@@ -276,6 +301,7 @@ int hfi1_ipoib_rn_get_params(struct ib_device *device,
 
 	params->sizeof_priv = sizeof(struct hfi1_ipoib_rdma_netdev);
 	params->txqs = dd->num_sdma;
+	params->rxqs = dd->num_netdev_contexts;
 	params->param = NULL;
 	params->initialize_rdma_netdev = hfi1_ipoib_setup_rn;
 
diff --git a/drivers/infiniband/hw/hfi1/ipoib_rx.c b/drivers/infiniband/hw/hfi1/ipoib_rx.c
index 2485663..606ac69 100644
--- a/drivers/infiniband/hw/hfi1/ipoib_rx.c
+++ b/drivers/infiniband/hw/hfi1/ipoib_rx.c
@@ -69,3 +69,19 @@ struct sk_buff *hfi1_ipoib_prepare_skb(struct hfi1_netdev_rxq *rxq,
 
 	return skb;
 }
+
+int hfi1_ipoib_rxq_init(struct net_device *netdev)
+{
+	struct hfi1_ipoib_dev_priv *ipoib_priv = hfi1_ipoib_priv(netdev);
+	struct hfi1_devdata *dd = ipoib_priv->dd;
+
+	return hfi1_netdev_rx_init(dd);
+}
+
+void hfi1_ipoib_rxq_deinit(struct net_device *netdev)
+{
+	struct hfi1_ipoib_dev_priv *ipoib_priv = hfi1_ipoib_priv(netdev);
+	struct hfi1_devdata *dd = ipoib_priv->dd;
+
+	hfi1_netdev_rx_destroy(dd);
+}
diff --git a/drivers/infiniband/hw/hfi1/netdev.h b/drivers/infiniband/hw/hfi1/netdev.h
index 6740ec3..edb936f 100644
--- a/drivers/infiniband/hw/hfi1/netdev.h
+++ b/drivers/infiniband/hw/hfi1/netdev.h
@@ -82,6 +82,12 @@ struct hfi1_ctxtdata *hfi1_netdev_get_ctxt(struct hfi1_devdata *dd, int ctxt)
 	return priv->rxq[ctxt].rcd;
 }
 
+void hfi1_netdev_enable_queues(struct hfi1_devdata *dd);
+void hfi1_netdev_disable_queues(struct hfi1_devdata *dd);
+int hfi1_netdev_rx_init(struct hfi1_devdata *dd);
+int hfi1_netdev_rx_destroy(struct hfi1_devdata *dd);
+int hfi1_netdev_alloc(struct hfi1_devdata *dd);
+void hfi1_netdev_free(struct hfi1_devdata *dd);
 int hfi1_netdev_add_data(struct hfi1_devdata *dd, int id, void *data);
 void *hfi1_netdev_remove_data(struct hfi1_devdata *dd, int id);
 void *hfi1_netdev_get_data(struct hfi1_devdata *dd, int id);
diff --git a/drivers/infiniband/hw/hfi1/netdev_rx.c b/drivers/infiniband/hw/hfi1/netdev_rx.c
index 19597e0..95e8c98 100644
--- a/drivers/infiniband/hw/hfi1/netdev_rx.c
+++ b/drivers/infiniband/hw/hfi1/netdev_rx.c
@@ -17,6 +17,367 @@
 #include <linux/etherdevice.h>
 #include <rdma/ib_verbs.h>
 
+static int hfi1_netdev_setup_ctxt(struct hfi1_netdev_priv *priv,
+				  struct hfi1_ctxtdata *uctxt)
+{
+	unsigned int rcvctrl_ops;
+	struct hfi1_devdata *dd = priv->dd;
+	int ret;
+
+	uctxt->rhf_rcv_function_map = netdev_rhf_rcv_functions;
+	uctxt->do_interrupt = &handle_receive_interrupt_napi_sp;
+
+	/* Now allocate the RcvHdr queue and eager buffers. */
+	ret = hfi1_create_rcvhdrq(dd, uctxt);
+	if (ret)
+		goto done;
+
+	ret = hfi1_setup_eagerbufs(uctxt);
+	if (ret)
+		goto done;
+
+	clear_rcvhdrtail(uctxt);
+
+	rcvctrl_ops = HFI1_RCVCTRL_CTXT_DIS;
+	rcvctrl_ops |= HFI1_RCVCTRL_INTRAVAIL_DIS;
+
+	if (!HFI1_CAP_KGET_MASK(uctxt->flags, MULTI_PKT_EGR))
+		rcvctrl_ops |= HFI1_RCVCTRL_ONE_PKT_EGR_ENB;
+	if (HFI1_CAP_KGET_MASK(uctxt->flags, NODROP_EGR_FULL))
+		rcvctrl_ops |= HFI1_RCVCTRL_NO_EGR_DROP_ENB;
+	if (HFI1_CAP_KGET_MASK(uctxt->flags, NODROP_RHQ_FULL))
+		rcvctrl_ops |= HFI1_RCVCTRL_NO_RHQ_DROP_ENB;
+	if (HFI1_CAP_KGET_MASK(uctxt->flags, DMA_RTAIL))
+		rcvctrl_ops |= HFI1_RCVCTRL_TAILUPD_ENB;
+
+	hfi1_rcvctrl(uctxt->dd, rcvctrl_ops, uctxt);
+done:
+	return ret;
+}
+
+static int hfi1_netdev_allocate_ctxt(struct hfi1_devdata *dd,
+				     struct hfi1_ctxtdata **ctxt)
+{
+	struct hfi1_ctxtdata *uctxt;
+	int ret;
+
+	if (dd->flags & HFI1_FROZEN)
+		return -EIO;
+
+	ret = hfi1_create_ctxtdata(dd->pport, dd->node, &uctxt);
+	if (ret < 0) {
+		dd_dev_err(dd, "Unable to create ctxtdata, failing open\n");
+		return -ENOMEM;
+	}
+
+	uctxt->flags = HFI1_CAP_KGET(MULTI_PKT_EGR) |
+		HFI1_CAP_KGET(NODROP_RHQ_FULL) |
+		HFI1_CAP_KGET(NODROP_EGR_FULL) |
+		HFI1_CAP_KGET(DMA_RTAIL);
+	/* Netdev contexts are always NO_RDMA_RTAIL */
+	uctxt->fast_handler = handle_receive_interrupt_napi_fp;
+	uctxt->slow_handler = handle_receive_interrupt_napi_sp;
+	hfi1_set_seq_cnt(uctxt, 1);
+	uctxt->is_vnic = true;
+
+	hfi1_stats.sps_ctxts++;
+
+	dd_dev_info(dd, "created netdev context %d\n", uctxt->ctxt);
+	*ctxt = uctxt;
+
+	return 0;
+}
+
+static void hfi1_netdev_deallocate_ctxt(struct hfi1_devdata *dd,
+					struct hfi1_ctxtdata *uctxt)
+{
+	flush_wc();
+
+	/*
+	 * Disable receive context and interrupt available, reset all
+	 * RcvCtxtCtrl bits to default values.
+	 */
+	hfi1_rcvctrl(dd, HFI1_RCVCTRL_CTXT_DIS |
+		     HFI1_RCVCTRL_TIDFLOW_DIS |
+		     HFI1_RCVCTRL_INTRAVAIL_DIS |
+		     HFI1_RCVCTRL_ONE_PKT_EGR_DIS |
+		     HFI1_RCVCTRL_NO_RHQ_DROP_DIS |
+		     HFI1_RCVCTRL_NO_EGR_DROP_DIS, uctxt);
+
+	if (uctxt->msix_intr != CCE_NUM_MSIX_VECTORS)
+		msix_free_irq(dd, uctxt->msix_intr);
+
+	uctxt->msix_intr = CCE_NUM_MSIX_VECTORS;
+	uctxt->event_flags = 0;
+
+	hfi1_clear_tids(uctxt);
+	hfi1_clear_ctxt_pkey(dd, uctxt);
+
+	hfi1_stats.sps_ctxts--;
+
+	hfi1_free_ctxt(uctxt);
+}
+
+static int hfi1_netdev_allot_ctxt(struct hfi1_netdev_priv *priv,
+				  struct hfi1_ctxtdata **ctxt)
+{
+	int rc;
+	struct hfi1_devdata *dd = priv->dd;
+
+	rc = hfi1_netdev_allocate_ctxt(dd, ctxt);
+	if (rc) {
+		dd_dev_err(dd, "netdev ctxt alloc failed %d\n", rc);
+		return rc;
+	}
+
+	rc = hfi1_netdev_setup_ctxt(priv, *ctxt);
+	if (rc) {
+		dd_dev_err(dd, "netdev ctxt setup failed %d\n", rc);
+		hfi1_netdev_deallocate_ctxt(dd, *ctxt);
+		*ctxt = NULL;
+	}
+
+	return rc;
+}
+
+static int hfi1_netdev_rxq_init(struct net_device *dev)
+{
+	int i;
+	int rc;
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dev);
+	struct hfi1_devdata *dd = priv->dd;
+
+	priv->num_rx_q = dd->num_netdev_contexts;
+	priv->rxq = kcalloc_node(priv->num_rx_q, sizeof(struct hfi1_netdev_rxq),
+				 GFP_KERNEL, dd->node);
+
+	if (!priv->rxq) {
+		dd_dev_err(dd, "Unable to allocate netdev queue data\n");
+		return (-ENOMEM);
+	}
+
+	for (i = 0; i < priv->num_rx_q; i++) {
+		struct hfi1_netdev_rxq *rxq = &priv->rxq[i];
+
+		rc = hfi1_netdev_allot_ctxt(priv, &rxq->rcd);
+		if (rc)
+			goto bail_context_irq_failure;
+
+		hfi1_rcd_get(rxq->rcd);
+		rxq->priv = priv;
+		rxq->rcd->napi = &rxq->napi;
+		dd_dev_info(dd, "Setting rcv queue %d napi to context %d\n",
+			    i, rxq->rcd->ctxt);
+		/*
+		 * Disable BUSY_POLL on this NAPI as this is not supported
+		 * right now.
+		 */
+		set_bit(NAPI_STATE_NO_BUSY_POLL, &rxq->napi.state);
+		netif_napi_add(dev, &rxq->napi, hfi1_netdev_rx_napi, 64);
+		rc = msix_netdev_request_rcd_irq(rxq->rcd);
+		if (rc)
+			goto bail_context_irq_failure;
+	}
+
+	return 0;
+
+bail_context_irq_failure:
+	dd_dev_err(dd, "Unable to allot receive context\n");
+	for (; i >= 0; i--) {
+		struct hfi1_netdev_rxq *rxq = &priv->rxq[i];
+
+		if (rxq->rcd) {
+			hfi1_netdev_deallocate_ctxt(dd, rxq->rcd);
+			hfi1_rcd_put(rxq->rcd);
+			rxq->rcd = NULL;
+		}
+	}
+	kfree(priv->rxq);
+	priv->rxq = NULL;
+
+	return rc;
+}
+
+static void hfi1_netdev_rxq_deinit(struct net_device *dev)
+{
+	int i;
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dev);
+	struct hfi1_devdata *dd = priv->dd;
+
+	for (i = 0; i < priv->num_rx_q; i++) {
+		struct hfi1_netdev_rxq *rxq = &priv->rxq[i];
+
+		netif_napi_del(&rxq->napi);
+		hfi1_netdev_deallocate_ctxt(dd, rxq->rcd);
+		hfi1_rcd_put(rxq->rcd);
+		rxq->rcd = NULL;
+	}
+
+	kfree(priv->rxq);
+	priv->rxq = NULL;
+	priv->num_rx_q = 0;
+}
+
+static void enable_queues(struct hfi1_netdev_priv *priv)
+{
+	int i;
+
+	for (i = 0; i < priv->num_rx_q; i++) {
+		struct hfi1_netdev_rxq *rxq = &priv->rxq[i];
+
+		dd_dev_info(priv->dd, "enabling queue %d on context %d\n", i,
+			    rxq->rcd->ctxt);
+		napi_enable(&rxq->napi);
+		hfi1_rcvctrl(priv->dd,
+			     HFI1_RCVCTRL_CTXT_ENB | HFI1_RCVCTRL_INTRAVAIL_ENB,
+			     rxq->rcd);
+	}
+}
+
+static void disable_queues(struct hfi1_netdev_priv *priv)
+{
+	int i;
+
+	msix_vnic_synchronize_irq(priv->dd);
+
+	for (i = 0; i < priv->num_rx_q; i++) {
+		struct hfi1_netdev_rxq *rxq = &priv->rxq[i];
+
+		dd_dev_info(priv->dd, "disabling queue %d on context %d\n", i,
+			    rxq->rcd->ctxt);
+
+		/* wait for napi if it was scheduled */
+		hfi1_rcvctrl(priv->dd,
+			     HFI1_RCVCTRL_CTXT_DIS | HFI1_RCVCTRL_INTRAVAIL_DIS,
+			     rxq->rcd);
+		napi_synchronize(&rxq->napi);
+		napi_disable(&rxq->napi);
+	}
+}
+
+/**
+ * hfi1_netdev_rx_init - Incrememnts netdevs counter. When called first time,
+ * it allocates receive queue data and calls netif_napi_add
+ * for each queue.
+ *
+ * @dd: hfi1 dev data
+ */
+int hfi1_netdev_rx_init(struct hfi1_devdata *dd)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+	int res;
+
+	if (atomic_fetch_inc(&priv->netdevs))
+		return 0;
+
+	mutex_lock(&hfi1_mutex);
+	init_dummy_netdev(dd->dummy_netdev);
+	res = hfi1_netdev_rxq_init(dd->dummy_netdev);
+	mutex_unlock(&hfi1_mutex);
+	return res;
+}
+
+/**
+ * hfi1_netdev_rx_destroy - Decrements netdevs counter, when it reaches 0
+ * napi is deleted and receive queses memory is freed.
+ *
+ * @dd: hfi1 dev data
+ */
+int hfi1_netdev_rx_destroy(struct hfi1_devdata *dd)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	/* destroy the RX queues only if it is the last netdev going away */
+	if (atomic_fetch_add_unless(&priv->netdevs, -1, 0) == 1) {
+		mutex_lock(&hfi1_mutex);
+		hfi1_netdev_rxq_deinit(dd->dummy_netdev);
+		mutex_unlock(&hfi1_mutex);
+	}
+
+	return 0;
+}
+
+/**
+ * hfi1_netdev_alloc - Allocates netdev and private data. It is required
+ * because RMT index and MSI-X interrupt can be set only
+ * during driver initialization.
+ *
+ * @dd: hfi1 dev data
+ */
+int hfi1_netdev_alloc(struct hfi1_devdata *dd)
+{
+	struct hfi1_netdev_priv *priv;
+	const int netdev_size = sizeof(*dd->dummy_netdev) +
+		sizeof(struct hfi1_netdev_priv);
+
+	dd_dev_info(dd, "allocating netdev size %d\n", netdev_size);
+	dd->dummy_netdev = kcalloc_node(1, netdev_size, GFP_KERNEL, dd->node);
+
+	if (!dd->dummy_netdev)
+		return -ENOMEM;
+
+	priv = hfi1_netdev_priv(dd->dummy_netdev);
+	priv->dd = dd;
+	xa_init(&priv->dev_tbl);
+	atomic_set(&priv->enabled, 0);
+	atomic_set(&priv->netdevs, 0);
+
+	return 0;
+}
+
+void hfi1_netdev_free(struct hfi1_devdata *dd)
+{
+	struct hfi1_netdev_priv *priv;
+
+	if (dd->dummy_netdev) {
+		priv = hfi1_netdev_priv(dd->dummy_netdev);
+		dd_dev_info(dd, "hfi1 netdev freed\n");
+		kfree(dd->dummy_netdev);
+		dd->dummy_netdev = NULL;
+	}
+}
+
+/**
+ * hfi1_netdev_enable_queues - This is napi enable function.
+ * It enables napi objects associated with queues.
+ * When at least one device has called it it increments atomic counter.
+ * Disable function decrements counter and when it is 0,
+ * calls napi_disable for every queue.
+ *
+ * @dd: hfi1 dev data
+ */
+void hfi1_netdev_enable_queues(struct hfi1_devdata *dd)
+{
+	struct hfi1_netdev_priv *priv;
+
+	if (!dd->dummy_netdev)
+		return;
+
+	priv = hfi1_netdev_priv(dd->dummy_netdev);
+	if (atomic_fetch_inc(&priv->enabled))
+		return;
+
+	mutex_lock(&hfi1_mutex);
+	enable_queues(priv);
+	mutex_unlock(&hfi1_mutex);
+}
+
+void hfi1_netdev_disable_queues(struct hfi1_devdata *dd)
+{
+	struct hfi1_netdev_priv *priv;
+
+	if (!dd->dummy_netdev)
+		return;
+
+	priv = hfi1_netdev_priv(dd->dummy_netdev);
+	if (atomic_dec_if_positive(&priv->enabled))
+		return;
+
+	mutex_lock(&hfi1_mutex);
+	disable_queues(priv);
+	mutex_unlock(&hfi1_mutex);
+}
+
 /**
  * hfi1_netdev_add_data - Registers data with unique identifier
  * to be requested later this is needed for VNIC and IPoIB VLANs


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 12/16] IB/hfi1: Activate the dummy netdev
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (10 preceding siblings ...)
  2020-02-10 13:19 ` [PATCH for-next 11/16] IB/hfi1: Add rx functions for dummy netdev Dennis Dalessandro
@ 2020-02-10 13:19 ` Dennis Dalessandro
  2020-02-10 13:19 ` [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size Dennis Dalessandro
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:19 UTC (permalink / raw)
  To: jgg, dledford
  Cc: linux-rdma, Mike Marciniszyn, Dennis Dalessandro,
	Grzegorz Andrejczuk, Sadanand Warrier

From: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>

As described in earlier patches, ipoib netdev will share receive
contexts with existing VNIC netdev through a dummy netdev. The
following changes are made to achieve that:
- Set up netdev receive contexts after user contexts. A function is
  added to count the available netdev receive contexts.
- Add functions to set/get receive map table free index.
- Rename NUM_VNIC_MAP_ENTRIES as NUM_NETDEV_MAP_ENTRIES.
- Let the dummy netdev own the receive contexts instead of VNIC.
- Allocate the dummy netdev when the hfi1 device is added and free it
  when the device is removed.
- Initialize AIP RSM rules when the IpoIb rxq is initialized and
  remove the rules when it is de-initialized.
- Convert VNIC to use the dummy netdev.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/chip.c      |   73 ++++---
 drivers/infiniband/hw/hfi1/driver.c    |   18 --
 drivers/infiniband/hw/hfi1/hfi.h       |   14 -
 drivers/infiniband/hw/hfi1/init.c      |    9 -
 drivers/infiniband/hw/hfi1/ipoib_rx.c  |   10 +
 drivers/infiniband/hw/hfi1/msix.c      |   12 +
 drivers/infiniband/hw/hfi1/msix.h      |    2 
 drivers/infiniband/hw/hfi1/netdev.h    |   19 ++
 drivers/infiniband/hw/hfi1/netdev_rx.c |   46 +++++
 drivers/infiniband/hw/hfi1/vnic.h      |    5 -
 drivers/infiniband/hw/hfi1/vnic_main.c |  312 +++++---------------------------
 11 files changed, 178 insertions(+), 342 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c
index 2117612..7f35b9e 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -13396,8 +13396,7 @@ static int set_up_interrupts(struct hfi1_devdata *dd)
 static int set_up_context_variables(struct hfi1_devdata *dd)
 {
 	unsigned long num_kernel_contexts;
-	u16 num_netdev_contexts = HFI1_NUM_VNIC_CTXT;
-	int total_contexts;
+	u16 num_netdev_contexts;
 	int ret;
 	unsigned ngroups;
 	int rmt_count;
@@ -13434,13 +13433,6 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 		num_kernel_contexts = send_contexts - num_vls - 1;
 	}
 
-	/* Accommodate VNIC contexts if possible */
-	if ((num_kernel_contexts + num_netdev_contexts) > rcv_contexts) {
-		dd_dev_err(dd, "No receive contexts available for VNIC\n");
-		num_netdev_contexts = 0;
-	}
-	total_contexts = num_kernel_contexts + num_netdev_contexts;
-
 	/*
 	 * User contexts:
 	 *	- default to 1 user context per real (non-HT) CPU core if
@@ -13453,15 +13445,19 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 	/*
 	 * Adjust the counts given a global max.
 	 */
-	if (total_contexts + n_usr_ctxts > rcv_contexts) {
+	if (num_kernel_contexts + n_usr_ctxts > rcv_contexts) {
 		dd_dev_err(dd,
-			   "Reducing # user receive contexts to: %d, from %u\n",
-			   rcv_contexts - total_contexts,
+			   "Reducing # user receive contexts to: %u, from %u\n",
+			   (u32)(rcv_contexts - num_kernel_contexts),
 			   n_usr_ctxts);
 		/* recalculate */
-		n_usr_ctxts = rcv_contexts - total_contexts;
+		n_usr_ctxts = rcv_contexts - num_kernel_contexts;
 	}
 
+	num_netdev_contexts =
+		hfi1_num_netdev_contexts(dd, rcv_contexts -
+					 (num_kernel_contexts + n_usr_ctxts),
+					 &node_affinity.real_cpu_mask);
 	/*
 	 * The RMT entries are currently allocated as shown below:
 	 * 1. QOS (0 to 128 entries);
@@ -13487,17 +13483,16 @@ static int set_up_context_variables(struct hfi1_devdata *dd)
 		n_usr_ctxts = user_rmt_reduced;
 	}
 
-	total_contexts += n_usr_ctxts;
-
-	/* the first N are kernel contexts, the rest are user/vnic contexts */
-	dd->num_rcv_contexts = total_contexts;
+	/* the first N are kernel contexts, the rest are user/netdev contexts */
+	dd->num_rcv_contexts =
+		num_kernel_contexts + n_usr_ctxts + num_netdev_contexts;
 	dd->n_krcv_queues = num_kernel_contexts;
 	dd->first_dyn_alloc_ctxt = num_kernel_contexts;
 	dd->num_netdev_contexts = num_netdev_contexts;
 	dd->num_user_contexts = n_usr_ctxts;
 	dd->freectxts = n_usr_ctxts;
 	dd_dev_info(dd,
-		    "rcv contexts: chip %d, used %d (kernel %d, vnic %u, user %u)\n",
+		    "rcv contexts: chip %d, used %d (kernel %d, netdev %u, user %u)\n",
 		    rcv_contexts,
 		    (int)dd->num_rcv_contexts,
 		    (int)dd->n_krcv_queues,
@@ -14554,7 +14549,8 @@ static bool hfi1_netdev_update_rmt(struct hfi1_devdata *dd)
 	u8 ctx_id = 0;
 	u64 reg;
 	u32 regoff;
-	int rmt_start = dd->vnic.rmt_start;
+	int rmt_start = hfi1_netdev_get_free_rmt_idx(dd);
+	int ctxt_count = hfi1_netdev_ctxt_count(dd);
 
 	/* We already have contexts mapped in RMT */
 	if (has_rsm_rule(dd, RSM_INS_VNIC) || has_rsm_rule(dd, RSM_INS_AIP)) {
@@ -14562,7 +14558,7 @@ static bool hfi1_netdev_update_rmt(struct hfi1_devdata *dd)
 		return true;
 	}
 
-	if (hfi1_is_rmt_full(rmt_start, NUM_VNIC_MAP_ENTRIES)) {
+	if (hfi1_is_rmt_full(rmt_start, NUM_NETDEV_MAP_ENTRIES)) {
 		dd_dev_err(dd, "Not enought RMT entries used = %d\n",
 			   rmt_start);
 		return false;
@@ -14570,27 +14566,27 @@ static bool hfi1_netdev_update_rmt(struct hfi1_devdata *dd)
 
 	dev_dbg(&(dd)->pcidev->dev, "RMT start = %d, end %d\n",
 		rmt_start,
-		rmt_start + NUM_VNIC_MAP_ENTRIES);
+		rmt_start + NUM_NETDEV_MAP_ENTRIES);
 
 	/* Update RSM mapping table, 32 regs, 256 entries - 1 ctx per byte */
 	regoff = RCV_RSM_MAP_TABLE + (rmt_start / 8) * 8;
 	reg = read_csr(dd, regoff);
-	for (i = 0; i < NUM_VNIC_MAP_ENTRIES; i++) {
+	for (i = 0; i < NUM_NETDEV_MAP_ENTRIES; i++) {
 		/* Update map register with netdev context */
 		j = (rmt_start + i) % 8;
 		reg &= ~(0xffllu << (j * 8));
-		reg |= (u64)dd->vnic.ctxt[ctx_id++]->ctxt << (j * 8);
+		reg |= (u64)hfi1_netdev_get_ctxt(dd, ctx_id++)->ctxt << (j * 8);
 		/* Wrap up netdev ctx index */
-		ctx_id %= dd->vnic.num_ctxt;
+		ctx_id %= ctxt_count;
 		/* Write back map register */
-		if (j == 7 || ((i + 1) == NUM_VNIC_MAP_ENTRIES)) {
+		if (j == 7 || ((i + 1) == NUM_NETDEV_MAP_ENTRIES)) {
 			dev_dbg(&(dd)->pcidev->dev,
 				"RMT[%d] =0x%llx\n",
 				regoff - RCV_RSM_MAP_TABLE, reg);
 
 			write_csr(dd, regoff, reg);
 			regoff += 8;
-			if (i < (NUM_VNIC_MAP_ENTRIES - 1))
+			if (i < (NUM_NETDEV_MAP_ENTRIES - 1))
 				reg = read_csr(dd, regoff);
 		}
 	}
@@ -14617,8 +14613,9 @@ void hfi1_init_aip_rsm(struct hfi1_devdata *dd)
 	 * exist yet
 	 */
 	if (atomic_fetch_inc(&dd->ipoib_rsm_usr_num) == 0) {
+		int rmt_start = hfi1_netdev_get_free_rmt_idx(dd);
 		struct rsm_rule_data rrd = {
-			.offset = dd->vnic.rmt_start,
+			.offset = rmt_start,
 			.pkt_type = IB_PACKET_TYPE,
 			.field1_off = LRH_BTH_MATCH_OFFSET,
 			.mask1 = LRH_BTH_MASK,
@@ -14627,10 +14624,10 @@ void hfi1_init_aip_rsm(struct hfi1_devdata *dd)
 			.mask2 = BTH_DESTQP_MASK,
 			.value2 = BTH_DESTQP_VALUE,
 			.index1_off = DETH_AIP_SQPN_SELECT_OFFSET +
-					ilog2(NUM_VNIC_MAP_ENTRIES),
-			.index1_width = ilog2(NUM_VNIC_MAP_ENTRIES),
+					ilog2(NUM_NETDEV_MAP_ENTRIES),
+			.index1_width = ilog2(NUM_NETDEV_MAP_ENTRIES),
 			.index2_off = DETH_AIP_SQPN_SELECT_OFFSET,
-			.index2_width = ilog2(NUM_VNIC_MAP_ENTRIES)
+			.index2_width = ilog2(NUM_NETDEV_MAP_ENTRIES)
 		};
 
 		hfi1_enable_rsm_rule(dd, RSM_INS_AIP, &rrd);
@@ -14640,9 +14637,10 @@ void hfi1_init_aip_rsm(struct hfi1_devdata *dd)
 /* Initialize RSM for VNIC */
 void hfi1_init_vnic_rsm(struct hfi1_devdata *dd)
 {
+	int rmt_start = hfi1_netdev_get_free_rmt_idx(dd);
 	struct rsm_rule_data rrd = {
 		/* Add rule for vnic */
-		.offset = dd->vnic.rmt_start,
+		.offset = rmt_start,
 		.pkt_type = 4,
 		/* Match 16B packets */
 		.field1_off = L2_TYPE_MATCH_OFFSET,
@@ -14654,9 +14652,9 @@ void hfi1_init_vnic_rsm(struct hfi1_devdata *dd)
 		.value2 = L4_16B_ETH_VALUE,
 		/* Calc context from veswid and entropy */
 		.index1_off = L4_16B_HDR_VESWID_OFFSET,
-		.index1_width = ilog2(NUM_VNIC_MAP_ENTRIES),
+		.index1_width = ilog2(NUM_NETDEV_MAP_ENTRIES),
 		.index2_off = L2_16B_ENTROPY_OFFSET,
-		.index2_width = ilog2(NUM_VNIC_MAP_ENTRIES)
+		.index2_width = ilog2(NUM_NETDEV_MAP_ENTRIES)
 	};
 
 	hfi1_enable_rsm_rule(dd, RSM_INS_VNIC, &rrd);
@@ -14690,8 +14688,8 @@ static int init_rxe(struct hfi1_devdata *dd)
 	init_qos(dd, rmt);
 	init_fecn_handling(dd, rmt);
 	complete_rsm_map_table(dd, rmt);
-	/* record number of used rsm map entries for vnic */
-	dd->vnic.rmt_start = rmt->used;
+	/* record number of used rsm map entries for netdev */
+	hfi1_netdev_set_free_rmt_idx(dd, rmt->used);
 	kfree(rmt);
 
 	/*
@@ -15245,6 +15243,10 @@ int hfi1_init_dd(struct hfi1_devdata *dd)
 		 (dd->revision >> CCE_REVISION_SW_SHIFT)
 		    & CCE_REVISION_SW_MASK);
 
+	/* alloc netdev data */
+	if (hfi1_netdev_alloc(dd))
+		goto bail_cleanup;
+
 	ret = set_up_context_variables(dd);
 	if (ret)
 		goto bail_cleanup;
@@ -15345,6 +15347,7 @@ int hfi1_init_dd(struct hfi1_devdata *dd)
 	hfi1_comp_vectors_clean_up(dd);
 	msix_clean_up_interrupts(dd);
 bail_cleanup:
+	hfi1_netdev_free(dd);
 	hfi1_pcie_ddcleanup(dd);
 bail_free:
 	hfi1_free_devdata(dd);
diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c
index d89fc8f..60ff6de 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -1771,28 +1771,10 @@ static void process_receive_ib(struct hfi1_packet *packet)
 	hfi1_ib_rcv(packet);
 }
 
-static inline bool hfi1_is_vnic_packet(struct hfi1_packet *packet)
-{
-	/* Packet received in VNIC context via RSM */
-	if (packet->rcd->is_vnic)
-		return true;
-
-	if ((hfi1_16B_get_l2(packet->ebuf) == OPA_16B_L2_TYPE) &&
-	    (hfi1_16B_get_l4(packet->ebuf) == OPA_16B_L4_ETHR))
-		return true;
-
-	return false;
-}
-
 static void process_receive_bypass(struct hfi1_packet *packet)
 {
 	struct hfi1_devdata *dd = packet->rcd->dd;
 
-	if (hfi1_is_vnic_packet(packet)) {
-		hfi1_vnic_bypass_rcv(packet);
-		return;
-	}
-
 	if (hfi1_setup_bypass_packet(packet))
 		return;
 
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index b45e4b9..b44a3fe 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1047,23 +1047,10 @@ struct hfi1_asic_data {
 #define NUM_MAP_ENTRIES	 256
 #define NUM_MAP_REGS      32
 
-/*
- * Number of VNIC contexts used. Ensure it is less than or equal to
- * max queues supported by VNIC (HFI1_VNIC_MAX_QUEUE).
- */
-#define HFI1_NUM_VNIC_CTXT   8
-
-/* Number of VNIC RSM entries */
-#define NUM_VNIC_MAP_ENTRIES 8
-
 /* Virtual NIC information */
 struct hfi1_vnic_data {
-	struct hfi1_ctxtdata *ctxt[HFI1_NUM_VNIC_CTXT];
 	struct kmem_cache *txreq_cache;
-	struct xarray vesws;
 	u8 num_vports;
-	u8 rmt_start;
-	u8 num_ctxt;
 };
 
 struct hfi1_vnic_vport_info;
@@ -1421,6 +1408,7 @@ struct hfi1_devdata {
 	struct hfi1_vnic_data vnic;
 	/* Lock to protect IRQ SRC register access */
 	spinlock_t irq_src_lock;
+	int vnic_num_vports;
 	struct net_device *dummy_netdev;
 
 	/* Keeps track of IPoIB RSM rule users */
diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c
index 95bff7d..ff3045e 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -69,6 +69,7 @@
 #include "affinity.h"
 #include "vnic.h"
 #include "exp_rcv.h"
+#include "netdev.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
@@ -1684,9 +1685,6 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	/* do the generic initialization */
 	initfail = hfi1_init(dd, 0);
 
-	/* setup vnic */
-	hfi1_vnic_setup(dd);
-
 	ret = hfi1_register_ib_device(dd);
 
 	/*
@@ -1725,7 +1723,6 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 			hfi1_device_remove(dd);
 		if (!ret)
 			hfi1_unregister_ib_device(dd);
-		hfi1_vnic_cleanup(dd);
 		postinit_cleanup(dd);
 		if (initfail)
 			ret = initfail;
@@ -1770,8 +1767,8 @@ static void remove_one(struct pci_dev *pdev)
 	/* unregister from IB core */
 	hfi1_unregister_ib_device(dd);
 
-	/* cleanup vnic */
-	hfi1_vnic_cleanup(dd);
+	/* free netdev data */
+	hfi1_netdev_free(dd);
 
 	/*
 	 * Disable the IB link, disable interrupts on the device,
diff --git a/drivers/infiniband/hw/hfi1/ipoib_rx.c b/drivers/infiniband/hw/hfi1/ipoib_rx.c
index 606ac69..3afa754 100644
--- a/drivers/infiniband/hw/hfi1/ipoib_rx.c
+++ b/drivers/infiniband/hw/hfi1/ipoib_rx.c
@@ -74,8 +74,15 @@ int hfi1_ipoib_rxq_init(struct net_device *netdev)
 {
 	struct hfi1_ipoib_dev_priv *ipoib_priv = hfi1_ipoib_priv(netdev);
 	struct hfi1_devdata *dd = ipoib_priv->dd;
+	int ret;
 
-	return hfi1_netdev_rx_init(dd);
+	ret = hfi1_netdev_rx_init(dd);
+	if (ret)
+		return ret;
+
+	hfi1_init_aip_rsm(dd);
+
+	return ret;
 }
 
 void hfi1_ipoib_rxq_deinit(struct net_device *netdev)
@@ -83,5 +90,6 @@ void hfi1_ipoib_rxq_deinit(struct net_device *netdev)
 	struct hfi1_ipoib_dev_priv *ipoib_priv = hfi1_ipoib_priv(netdev);
 	struct hfi1_devdata *dd = ipoib_priv->dd;
 
+	hfi1_deinit_aip_rsm(dd);
 	hfi1_netdev_rx_destroy(dd);
 }
diff --git a/drivers/infiniband/hw/hfi1/msix.c b/drivers/infiniband/hw/hfi1/msix.c
index 7559875e..d61ee85 100644
--- a/drivers/infiniband/hw/hfi1/msix.c
+++ b/drivers/infiniband/hw/hfi1/msix.c
@@ -172,7 +172,8 @@ static int msix_request_rcd_irq_common(struct hfi1_ctxtdata *rcd,
 				       const char *name)
 {
 	int nr = msix_request_irq(rcd->dd, rcd, handler, thread,
-				  IRQ_RCVCTXT, name);
+				  rcd->is_vnic ? IRQ_NETDEVCTXT : IRQ_RCVCTXT,
+				  name);
 	if (nr < 0)
 		return nr;
 
@@ -371,15 +372,16 @@ void msix_clean_up_interrupts(struct hfi1_devdata *dd)
 }
 
 /**
- * msix_vnic_syncrhonize_irq() - Vnic IRQ synchronize
+ * msix_netdev_syncrhonize_irq() - netdev IRQ synchronize
  * @dd: valid devdata
  */
-void msix_vnic_synchronize_irq(struct hfi1_devdata *dd)
+void msix_netdev_synchronize_irq(struct hfi1_devdata *dd)
 {
 	int i;
+	int ctxt_count = hfi1_netdev_ctxt_count(dd);
 
-	for (i = 0; i < dd->vnic.num_ctxt; i++) {
-		struct hfi1_ctxtdata *rcd = dd->vnic.ctxt[i];
+	for (i = 0; i < ctxt_count; i++) {
+		struct hfi1_ctxtdata *rcd = hfi1_netdev_get_ctxt(dd, i);
 		struct hfi1_msix_entry *me;
 
 		me = &dd->msix_info.msix_entries[rcd->msix_intr];
diff --git a/drivers/infiniband/hw/hfi1/msix.h b/drivers/infiniband/hw/hfi1/msix.h
index 42fab94..e63e944 100644
--- a/drivers/infiniband/hw/hfi1/msix.h
+++ b/drivers/infiniband/hw/hfi1/msix.h
@@ -60,7 +60,7 @@
 void msix_free_irq(struct hfi1_devdata *dd, u8 msix_intr);
 
 /* Netdev interface */
-void msix_vnic_synchronize_irq(struct hfi1_devdata *dd);
+void msix_netdev_synchronize_irq(struct hfi1_devdata *dd);
 int msix_netdev_request_rcd_irq(struct hfi1_ctxtdata *rcd);
 
 #endif
diff --git a/drivers/infiniband/hw/hfi1/netdev.h b/drivers/infiniband/hw/hfi1/netdev.h
index edb936f..947543a 100644
--- a/drivers/infiniband/hw/hfi1/netdev.h
+++ b/drivers/infiniband/hw/hfi1/netdev.h
@@ -82,6 +82,25 @@ struct hfi1_ctxtdata *hfi1_netdev_get_ctxt(struct hfi1_devdata *dd, int ctxt)
 	return priv->rxq[ctxt].rcd;
 }
 
+static inline
+int hfi1_netdev_get_free_rmt_idx(struct hfi1_devdata *dd)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	return priv->rmt_start;
+}
+
+static inline
+void hfi1_netdev_set_free_rmt_idx(struct hfi1_devdata *dd, int rmt_idx)
+{
+	struct hfi1_netdev_priv *priv = hfi1_netdev_priv(dd->dummy_netdev);
+
+	priv->rmt_start = rmt_idx;
+}
+
+u32 hfi1_num_netdev_contexts(struct hfi1_devdata *dd, u32 available_contexts,
+			     struct cpumask *cpu_mask);
+
 void hfi1_netdev_enable_queues(struct hfi1_devdata *dd);
 void hfi1_netdev_disable_queues(struct hfi1_devdata *dd);
 int hfi1_netdev_rx_init(struct hfi1_devdata *dd);
diff --git a/drivers/infiniband/hw/hfi1/netdev_rx.c b/drivers/infiniband/hw/hfi1/netdev_rx.c
index 95e8c98..ba3340f 100644
--- a/drivers/infiniband/hw/hfi1/netdev_rx.c
+++ b/drivers/infiniband/hw/hfi1/netdev_rx.c
@@ -140,6 +140,50 @@ static int hfi1_netdev_allot_ctxt(struct hfi1_netdev_priv *priv,
 	return rc;
 }
 
+/**
+ * hfi1_num_netdev_contexts - Count of netdev recv contexts to use.
+ * @dd: device on which to allocate netdev contexts
+ * @available_contexts: count of available receive contexts
+ * @cpu_mask: mask of possible cpus to include for contexts
+ *
+ * Return: count of physical cores on a node or the remaining available recv
+ * contexts for netdev recv context usage up to the maximum of
+ * HFI1_MAX_NETDEV_CTXTS.
+ * A value of 0 can be returned when acceleration is explicitly turned off,
+ * a memory allocation error occurs or when there are no available contexts.
+ *
+ */
+u32 hfi1_num_netdev_contexts(struct hfi1_devdata *dd, u32 available_contexts,
+			     struct cpumask *cpu_mask)
+{
+	cpumask_var_t node_cpu_mask;
+	unsigned int available_cpus;
+
+	if (!HFI1_CAP_IS_KSET(AIP))
+		return 0;
+
+	/* Always give user contexts priority over netdev contexts */
+	if (available_contexts == 0) {
+		dd_dev_info(dd, "No receive contexts available for netdevs.\n");
+		return 0;
+	}
+
+	if (!zalloc_cpumask_var(&node_cpu_mask, GFP_KERNEL)) {
+		dd_dev_err(dd, "Unable to allocate cpu_mask for netdevs.\n");
+		return 0;
+	}
+
+	cpumask_and(node_cpu_mask, cpu_mask,
+		    cpumask_of_node(pcibus_to_node(dd->pcidev->bus)));
+
+	available_cpus = cpumask_weight(node_cpu_mask);
+
+	free_cpumask_var(node_cpu_mask);
+
+	return min3(available_cpus, available_contexts,
+		    (u32)HFI1_MAX_NETDEV_CTXTS);
+}
+
 static int hfi1_netdev_rxq_init(struct net_device *dev)
 {
 	int i;
@@ -238,7 +282,7 @@ static void disable_queues(struct hfi1_netdev_priv *priv)
 {
 	int i;
 
-	msix_vnic_synchronize_irq(priv->dd);
+	msix_netdev_synchronize_irq(priv->dd);
 
 	for (i = 0; i < priv->num_rx_q; i++) {
 		struct hfi1_netdev_rxq *rxq = &priv->rxq[i];
diff --git a/drivers/infiniband/hw/hfi1/vnic.h b/drivers/infiniband/hw/hfi1/vnic.h
index 5ae7815..66150a13 100644
--- a/drivers/infiniband/hw/hfi1/vnic.h
+++ b/drivers/infiniband/hw/hfi1/vnic.h
@@ -1,7 +1,7 @@
 #ifndef _HFI1_VNIC_H
 #define _HFI1_VNIC_H
 /*
- * Copyright(c) 2017 Intel Corporation.
+ * Copyright(c) 2017 - 2020 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -69,6 +69,7 @@
 #define HFI1_VNIC_SC_SHIFT      4
 
 #define HFI1_VNIC_MAX_QUEUE 16
+#define HFI1_NUM_VNIC_CTXT 8
 
 /**
  * struct hfi1_vnic_sdma - VNIC per Tx ring SDMA information
@@ -104,7 +105,6 @@ struct hfi1_vnic_rx_queue {
 	struct hfi1_vnic_vport_info *vinfo;
 	struct net_device           *netdev;
 	struct napi_struct           napi;
-	struct sk_buff_head          skbq;
 };
 
 /**
@@ -146,7 +146,6 @@ struct hfi1_vnic_vport_info {
 
 /* vnic hfi1 internal functions */
 void hfi1_vnic_setup(struct hfi1_devdata *dd);
-void hfi1_vnic_cleanup(struct hfi1_devdata *dd);
 int hfi1_vnic_txreq_init(struct hfi1_devdata *dd);
 void hfi1_vnic_txreq_deinit(struct hfi1_devdata *dd);
 
diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c b/drivers/infiniband/hw/hfi1/vnic_main.c
index db7624c..b183c56 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -53,6 +53,7 @@
 #include <linux/if_vlan.h>
 
 #include "vnic.h"
+#include "netdev.h"
 
 #define HFI_TX_TIMEOUT_MS 1000
 
@@ -62,114 +63,6 @@
 
 static DEFINE_SPINLOCK(vport_cntr_lock);
 
-static int setup_vnic_ctxt(struct hfi1_devdata *dd, struct hfi1_ctxtdata *uctxt)
-{
-	unsigned int rcvctrl_ops = 0;
-	int ret;
-
-	uctxt->do_interrupt = &handle_receive_interrupt;
-
-	/* Now allocate the RcvHdr queue and eager buffers. */
-	ret = hfi1_create_rcvhdrq(dd, uctxt);
-	if (ret)
-		goto done;
-
-	ret = hfi1_setup_eagerbufs(uctxt);
-	if (ret)
-		goto done;
-
-	if (hfi1_rcvhdrtail_kvaddr(uctxt))
-		clear_rcvhdrtail(uctxt);
-
-	rcvctrl_ops = HFI1_RCVCTRL_CTXT_ENB;
-	rcvctrl_ops |= HFI1_RCVCTRL_INTRAVAIL_ENB;
-
-	if (!HFI1_CAP_KGET_MASK(uctxt->flags, MULTI_PKT_EGR))
-		rcvctrl_ops |= HFI1_RCVCTRL_ONE_PKT_EGR_ENB;
-	if (HFI1_CAP_KGET_MASK(uctxt->flags, NODROP_EGR_FULL))
-		rcvctrl_ops |= HFI1_RCVCTRL_NO_EGR_DROP_ENB;
-	if (HFI1_CAP_KGET_MASK(uctxt->flags, NODROP_RHQ_FULL))
-		rcvctrl_ops |= HFI1_RCVCTRL_NO_RHQ_DROP_ENB;
-	if (HFI1_CAP_KGET_MASK(uctxt->flags, DMA_RTAIL))
-		rcvctrl_ops |= HFI1_RCVCTRL_TAILUPD_ENB;
-
-	hfi1_rcvctrl(uctxt->dd, rcvctrl_ops, uctxt);
-done:
-	return ret;
-}
-
-static int allocate_vnic_ctxt(struct hfi1_devdata *dd,
-			      struct hfi1_ctxtdata **vnic_ctxt)
-{
-	struct hfi1_ctxtdata *uctxt;
-	int ret;
-
-	if (dd->flags & HFI1_FROZEN)
-		return -EIO;
-
-	ret = hfi1_create_ctxtdata(dd->pport, dd->node, &uctxt);
-	if (ret < 0) {
-		dd_dev_err(dd, "Unable to create ctxtdata, failing open\n");
-		return -ENOMEM;
-	}
-
-	uctxt->flags = HFI1_CAP_KGET(MULTI_PKT_EGR) |
-			HFI1_CAP_KGET(NODROP_RHQ_FULL) |
-			HFI1_CAP_KGET(NODROP_EGR_FULL) |
-			HFI1_CAP_KGET(DMA_RTAIL);
-	uctxt->seq_cnt = 1;
-	uctxt->is_vnic = true;
-
-	msix_request_rcd_irq(uctxt);
-
-	hfi1_stats.sps_ctxts++;
-	dd_dev_dbg(dd, "created vnic context %d\n", uctxt->ctxt);
-	*vnic_ctxt = uctxt;
-
-	return 0;
-}
-
-static void deallocate_vnic_ctxt(struct hfi1_devdata *dd,
-				 struct hfi1_ctxtdata *uctxt)
-{
-	dd_dev_dbg(dd, "closing vnic context %d\n", uctxt->ctxt);
-	flush_wc();
-
-	/*
-	 * Disable receive context and interrupt available, reset all
-	 * RcvCtxtCtrl bits to default values.
-	 */
-	hfi1_rcvctrl(dd, HFI1_RCVCTRL_CTXT_DIS |
-		     HFI1_RCVCTRL_TIDFLOW_DIS |
-		     HFI1_RCVCTRL_INTRAVAIL_DIS |
-		     HFI1_RCVCTRL_ONE_PKT_EGR_DIS |
-		     HFI1_RCVCTRL_NO_RHQ_DROP_DIS |
-		     HFI1_RCVCTRL_NO_EGR_DROP_DIS, uctxt);
-
-	/* msix_intr will always be > 0, only clean up if this is true */
-	if (uctxt->msix_intr)
-		msix_free_irq(dd, uctxt->msix_intr);
-
-	uctxt->event_flags = 0;
-
-	hfi1_clear_tids(uctxt);
-	hfi1_clear_ctxt_pkey(dd, uctxt);
-
-	hfi1_stats.sps_ctxts--;
-
-	hfi1_free_ctxt(uctxt);
-}
-
-void hfi1_vnic_setup(struct hfi1_devdata *dd)
-{
-	xa_init(&dd->vnic.vesws);
-}
-
-void hfi1_vnic_cleanup(struct hfi1_devdata *dd)
-{
-	WARN_ON(!xa_empty(&dd->vnic.vesws));
-}
-
 #define SUM_GRP_COUNTERS(stats, qstats, x_grp) do {            \
 		u64 *src64, *dst64;                            \
 		for (src64 = &qstats->x_grp.unicast,           \
@@ -179,6 +72,9 @@ void hfi1_vnic_cleanup(struct hfi1_devdata *dd)
 		}                                              \
 	} while (0)
 
+#define VNIC_MASK (0xFF)
+#define VNIC_ID(val) ((1ull << 24) | ((val) & VNIC_MASK))
+
 /* hfi1_vnic_update_stats - update statistics */
 static void hfi1_vnic_update_stats(struct hfi1_vnic_vport_info *vinfo,
 				   struct opa_vnic_stats *stats)
@@ -454,71 +350,25 @@ static inline int hfi1_vnic_decap_skb(struct hfi1_vnic_rx_queue *rxq,
 	return rc;
 }
 
-static inline struct sk_buff *hfi1_vnic_get_skb(struct hfi1_vnic_rx_queue *rxq)
+static struct hfi1_vnic_vport_info *get_vnic_port(struct hfi1_devdata *dd,
+						  int vesw_id)
 {
-	unsigned char *pad_info;
-	struct sk_buff *skb;
-
-	skb = skb_dequeue(&rxq->skbq);
-	if (unlikely(!skb))
-		return NULL;
+	int vnic_id = VNIC_ID(vesw_id);
 
-	/* remove tail padding and icrc */
-	pad_info = skb->data + skb->len - 1;
-	skb_trim(skb, (skb->len - OPA_VNIC_ICRC_TAIL_LEN -
-		       ((*pad_info) & 0x7)));
-
-	return skb;
+	return hfi1_netdev_get_data(dd, vnic_id);
 }
 
-/* hfi1_vnic_handle_rx - handle skb receive */
-static void hfi1_vnic_handle_rx(struct hfi1_vnic_rx_queue *rxq,
-				int *work_done, int work_to_do)
+static struct hfi1_vnic_vport_info *get_first_vnic_port(struct hfi1_devdata *dd)
 {
-	struct hfi1_vnic_vport_info *vinfo = rxq->vinfo;
-	struct sk_buff *skb;
-	int rc;
-
-	while (1) {
-		if (*work_done >= work_to_do)
-			break;
-
-		skb = hfi1_vnic_get_skb(rxq);
-		if (unlikely(!skb))
-			break;
-
-		rc = hfi1_vnic_decap_skb(rxq, skb);
-		/* update rx counters */
-		hfi1_vnic_update_rx_counters(vinfo, rxq->idx, skb, rc);
-		if (unlikely(rc)) {
-			dev_kfree_skb_any(skb);
-			continue;
-		}
-
-		skb_checksum_none_assert(skb);
-		skb->protocol = eth_type_trans(skb, rxq->netdev);
-
-		napi_gro_receive(&rxq->napi, skb);
-		(*work_done)++;
-	}
-}
-
-/* hfi1_vnic_napi - napi receive polling callback function */
-static int hfi1_vnic_napi(struct napi_struct *napi, int budget)
-{
-	struct hfi1_vnic_rx_queue *rxq = container_of(napi,
-					      struct hfi1_vnic_rx_queue, napi);
-	struct hfi1_vnic_vport_info *vinfo = rxq->vinfo;
-	int work_done = 0;
+	struct hfi1_vnic_vport_info *vinfo;
+	int next_id = VNIC_ID(0);
 
-	v_dbg("napi %d budget %d\n", rxq->idx, budget);
-	hfi1_vnic_handle_rx(rxq, &work_done, budget);
+	vinfo = hfi1_netdev_get_first_data(dd, &next_id);
 
-	v_dbg("napi %d work_done %d\n", rxq->idx, work_done);
-	if (work_done < budget)
-		napi_complete(napi);
+	if (next_id > VNIC_ID(VNIC_MASK))
+		return NULL;
 
-	return work_done;
+	return vinfo;
 }
 
 void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet)
@@ -527,13 +377,14 @@ void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet)
 	struct hfi1_vnic_vport_info *vinfo = NULL;
 	struct hfi1_vnic_rx_queue *rxq;
 	struct sk_buff *skb;
-	int l4_type, vesw_id = -1;
+	int l4_type, vesw_id = -1, rc;
 	u8 q_idx;
+	unsigned char *pad_info;
 
 	l4_type = hfi1_16B_get_l4(packet->ebuf);
 	if (likely(l4_type == OPA_16B_L4_ETHR)) {
 		vesw_id = HFI1_VNIC_GET_VESWID(packet->ebuf);
-		vinfo = xa_load(&dd->vnic.vesws, vesw_id);
+		vinfo = get_vnic_port(dd, vesw_id);
 
 		/*
 		 * In case of invalid vesw id, count the error on
@@ -541,10 +392,8 @@ void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet)
 		 */
 		if (unlikely(!vinfo)) {
 			struct hfi1_vnic_vport_info *vinfo_tmp;
-			unsigned long index = 0;
 
-			vinfo_tmp = xa_find(&dd->vnic.vesws, &index, ULONG_MAX,
-					XA_PRESENT);
+			vinfo_tmp = get_first_vnic_port(dd);
 			if (vinfo_tmp) {
 				spin_lock(&vport_cntr_lock);
 				vinfo_tmp->stats[0].netstats.rx_nohandler++;
@@ -563,12 +412,6 @@ void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet)
 	rxq = &vinfo->rxq[q_idx];
 	if (unlikely(!netif_oper_up(vinfo->netdev))) {
 		vinfo->stats[q_idx].rx_drop_state++;
-		skb_queue_purge(&rxq->skbq);
-		return;
-	}
-
-	if (unlikely(skb_queue_len(&rxq->skbq) > HFI1_VNIC_RCV_Q_SIZE)) {
-		vinfo->stats[q_idx].netstats.rx_fifo_errors++;
 		return;
 	}
 
@@ -580,34 +423,41 @@ void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet)
 
 	memcpy(skb->data, packet->ebuf, packet->tlen);
 	skb_put(skb, packet->tlen);
-	skb_queue_tail(&rxq->skbq, skb);
 
-	if (napi_schedule_prep(&rxq->napi)) {
-		v_dbg("napi %d scheduling\n", q_idx);
-		__napi_schedule(&rxq->napi);
+	pad_info = skb->data + skb->len - 1;
+	skb_trim(skb, (skb->len - OPA_VNIC_ICRC_TAIL_LEN -
+		       ((*pad_info) & 0x7)));
+
+	rc = hfi1_vnic_decap_skb(rxq, skb);
+
+	/* update rx counters */
+	hfi1_vnic_update_rx_counters(vinfo, rxq->idx, skb, rc);
+	if (unlikely(rc)) {
+		dev_kfree_skb_any(skb);
+		return;
 	}
+
+	skb_checksum_none_assert(skb);
+	skb->protocol = eth_type_trans(skb, rxq->netdev);
+
+	napi_gro_receive(&rxq->napi, skb);
 }
 
 static int hfi1_vnic_up(struct hfi1_vnic_vport_info *vinfo)
 {
 	struct hfi1_devdata *dd = vinfo->dd;
 	struct net_device *netdev = vinfo->netdev;
-	int i, rc;
+	int rc;
 
 	/* ensure virtual eth switch id is valid */
 	if (!vinfo->vesw_id)
 		return -EINVAL;
 
-	rc = xa_insert(&dd->vnic.vesws, vinfo->vesw_id, vinfo, GFP_KERNEL);
+	rc = hfi1_netdev_add_data(dd, VNIC_ID(vinfo->vesw_id), vinfo);
 	if (rc < 0)
 		return rc;
 
-	for (i = 0; i < vinfo->num_rx_q; i++) {
-		struct hfi1_vnic_rx_queue *rxq = &vinfo->rxq[i];
-
-		skb_queue_head_init(&rxq->skbq);
-		napi_enable(&rxq->napi);
-	}
+	hfi1_netdev_rx_init(dd);
 
 	netif_carrier_on(netdev);
 	netif_tx_start_all_queues(netdev);
@@ -619,23 +469,13 @@ static int hfi1_vnic_up(struct hfi1_vnic_vport_info *vinfo)
 static void hfi1_vnic_down(struct hfi1_vnic_vport_info *vinfo)
 {
 	struct hfi1_devdata *dd = vinfo->dd;
-	u8 i;
 
 	clear_bit(HFI1_VNIC_UP, &vinfo->flags);
 	netif_carrier_off(vinfo->netdev);
 	netif_tx_disable(vinfo->netdev);
-	xa_erase(&dd->vnic.vesws, vinfo->vesw_id);
+	hfi1_netdev_remove_data(dd, VNIC_ID(vinfo->vesw_id));
 
-	/* ensure irqs see the change */
-	msix_vnic_synchronize_irq(dd);
-
-	/* remove unread skbs */
-	for (i = 0; i < vinfo->num_rx_q; i++) {
-		struct hfi1_vnic_rx_queue *rxq = &vinfo->rxq[i];
-
-		napi_disable(&rxq->napi);
-		skb_queue_purge(&rxq->skbq);
-	}
+	hfi1_netdev_rx_destroy(dd);
 }
 
 static int hfi1_netdev_open(struct net_device *netdev)
@@ -660,70 +500,30 @@ static int hfi1_netdev_close(struct net_device *netdev)
 	return 0;
 }
 
-static int hfi1_vnic_allot_ctxt(struct hfi1_devdata *dd,
-				struct hfi1_ctxtdata **vnic_ctxt)
-{
-	int rc;
-
-	rc = allocate_vnic_ctxt(dd, vnic_ctxt);
-	if (rc) {
-		dd_dev_err(dd, "vnic ctxt alloc failed %d\n", rc);
-		return rc;
-	}
-
-	rc = setup_vnic_ctxt(dd, *vnic_ctxt);
-	if (rc) {
-		dd_dev_err(dd, "vnic ctxt setup failed %d\n", rc);
-		deallocate_vnic_ctxt(dd, *vnic_ctxt);
-		*vnic_ctxt = NULL;
-	}
-
-	return rc;
-}
-
 static int hfi1_vnic_init(struct hfi1_vnic_vport_info *vinfo)
 {
 	struct hfi1_devdata *dd = vinfo->dd;
-	int i, rc = 0;
+	int rc = 0;
 
 	mutex_lock(&hfi1_mutex);
-	if (!dd->vnic.num_vports) {
+	if (!dd->vnic_num_vports) {
 		rc = hfi1_vnic_txreq_init(dd);
 		if (rc)
 			goto txreq_fail;
 	}
 
-	for (i = dd->vnic.num_ctxt; i < vinfo->num_rx_q; i++) {
-		rc = hfi1_vnic_allot_ctxt(dd, &dd->vnic.ctxt[i]);
-		if (rc)
-			break;
-		hfi1_rcd_get(dd->vnic.ctxt[i]);
-		dd->vnic.ctxt[i]->vnic_q_idx = i;
-	}
-
-	if (i < vinfo->num_rx_q) {
-		/*
-		 * If required amount of contexts is not
-		 * allocated successfully then remaining contexts
-		 * are released.
-		 */
-		while (i-- > dd->vnic.num_ctxt) {
-			deallocate_vnic_ctxt(dd, dd->vnic.ctxt[i]);
-			hfi1_rcd_put(dd->vnic.ctxt[i]);
-			dd->vnic.ctxt[i] = NULL;
-		}
+	if (hfi1_netdev_rx_init(dd)) {
+		dd_dev_err(dd, "Unable to initialize netdev contexts\n");
 		goto alloc_fail;
 	}
 
-	if (dd->vnic.num_ctxt != i) {
-		dd->vnic.num_ctxt = i;
-		hfi1_init_vnic_rsm(dd);
-	}
+	hfi1_init_vnic_rsm(dd);
 
-	dd->vnic.num_vports++;
+	dd->vnic_num_vports++;
 	hfi1_vnic_sdma_init(vinfo);
+
 alloc_fail:
-	if (!dd->vnic.num_vports)
+	if (!dd->vnic_num_vports)
 		hfi1_vnic_txreq_deinit(dd);
 txreq_fail:
 	mutex_unlock(&hfi1_mutex);
@@ -733,20 +533,14 @@ static int hfi1_vnic_init(struct hfi1_vnic_vport_info *vinfo)
 static void hfi1_vnic_deinit(struct hfi1_vnic_vport_info *vinfo)
 {
 	struct hfi1_devdata *dd = vinfo->dd;
-	int i;
 
 	mutex_lock(&hfi1_mutex);
-	if (--dd->vnic.num_vports == 0) {
-		for (i = 0; i < dd->vnic.num_ctxt; i++) {
-			deallocate_vnic_ctxt(dd, dd->vnic.ctxt[i]);
-			hfi1_rcd_put(dd->vnic.ctxt[i]);
-			dd->vnic.ctxt[i] = NULL;
-		}
+	if (--dd->vnic_num_vports == 0) {
 		hfi1_deinit_vnic_rsm(dd);
-		dd->vnic.num_ctxt = 0;
 		hfi1_vnic_txreq_deinit(dd);
 	}
 	mutex_unlock(&hfi1_mutex);
+	hfi1_netdev_rx_destroy(dd);
 }
 
 static void hfi1_vnic_set_vesw_id(struct net_device *netdev, int id)
@@ -815,14 +609,15 @@ struct net_device *hfi1_vnic_alloc_rn(struct ib_device *device,
 
 	size = sizeof(struct opa_vnic_rdma_netdev) + sizeof(*vinfo);
 	netdev = alloc_netdev_mqs(size, name, name_assign_type, setup,
-				  dd->num_sdma, dd->num_netdev_contexts);
+				  chip_sdma_engines(dd),
+				  dd->num_netdev_contexts);
 	if (!netdev)
 		return ERR_PTR(-ENOMEM);
 
 	rn = netdev_priv(netdev);
 	vinfo = opa_vnic_dev_priv(netdev);
 	vinfo->dd = dd;
-	vinfo->num_tx_q = dd->num_sdma;
+	vinfo->num_tx_q = chip_sdma_engines(dd);
 	vinfo->num_rx_q = dd->num_netdev_contexts;
 	vinfo->netdev = netdev;
 	rn->free_rdma_netdev = hfi1_vnic_free_rn;
@@ -841,7 +636,6 @@ struct net_device *hfi1_vnic_alloc_rn(struct ib_device *device,
 		rxq->idx = i;
 		rxq->vinfo = vinfo;
 		rxq->netdev = netdev;
-		netif_napi_add(netdev, &rxq->napi, hfi1_vnic_napi, 64);
 	}
 
 	rc = hfi1_vnic_init(vinfo);


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (11 preceding siblings ...)
  2020-02-10 13:19 ` [PATCH for-next 12/16] IB/hfi1: Activate the " Dennis Dalessandro
@ 2020-02-10 13:19 ` Dennis Dalessandro
  2020-02-19  0:42   ` Jason Gunthorpe
  2020-02-19 13:41   ` Erez Shitrit
  2020-02-10 13:19 ` [PATCH for-next 14/16] IB/hfi1: Add packet histogram trace event Dennis Dalessandro
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:19 UTC (permalink / raw)
  To: jgg, dledford
  Cc: linux-rdma, Mike Marciniszyn, Dennis Dalessandro, Gary Leshner,
	Kaike Wan

From: Gary Leshner <Gary.S.Leshner@intel.com>

When in connected mode ipoib sent broadcast pings which exceeded the mtu
size for broadcast addresses.

Add an mtu attribute to the rdma_netdev structure which ipoib sets to its
mcast mtu size.

The RDMA netdev uses this value to determine if the skb length is too long
for the mtu specified and if it is, drops the packet and logs an error
about the errant packet.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c      |    2 ++
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |    1 +
 drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    3 +++
 3 files changed, 6 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 5c1cf68..ddb896f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1906,6 +1906,7 @@ static int ipoib_ndo_init(struct net_device *ndev)
 {
 	struct ipoib_dev_priv *priv = ipoib_priv(ndev);
 	int rc;
+	struct rdma_netdev *rn = netdev_priv(ndev);
 
 	if (priv->parent) {
 		ipoib_child_init(ndev);
@@ -1918,6 +1919,7 @@ static int ipoib_ndo_init(struct net_device *ndev)
 	/* MTU will be reset when mcast join happens */
 	ndev->mtu = IPOIB_UD_MTU(priv->max_ib_mtu);
 	priv->mcast_mtu = priv->admin_mtu = ndev->mtu;
+	rn->mtu = priv->mcast_mtu;
 	ndev->max_mtu = IPOIB_CM_MTU;
 
 	ndev->neigh_priv_len = sizeof(struct ipoib_neigh);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 7166ee9b..3d5f6b8 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -246,6 +246,7 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
 		if (priv->mcast_mtu == priv->admin_mtu)
 			priv->admin_mtu = IPOIB_UD_MTU(mtu);
 		priv->mcast_mtu = IPOIB_UD_MTU(mtu);
+		rn->mtu = priv->mcast_mtu;
 
 		priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
 		spin_unlock_irq(&priv->lock);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
index 8ac8e18..3086560 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
@@ -97,6 +97,7 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
 {
 	struct net_device *ndev = priv->dev;
 	int result;
+	struct rdma_netdev *rn = netdev_priv(ndev);
 
 	ASSERT_RTNL();
 
@@ -117,6 +118,8 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
 		goto out_early;
 	}
 
+	rn->mtu = priv->mcast_mtu;
+
 	priv->parent = ppriv->dev;
 	priv->pkey = pkey;
 	priv->child_type = type;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 14/16] IB/hfi1: Add packet histogram trace event
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (12 preceding siblings ...)
  2020-02-10 13:19 ` [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size Dennis Dalessandro
@ 2020-02-10 13:19 ` Dennis Dalessandro
  2020-02-10 13:19 ` [PATCH for-next 15/16] IB/ipoib: Add capability to switch between datagram and connected mode Dennis Dalessandro
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:19 UTC (permalink / raw)
  To: jgg, dledford
  Cc: linux-rdma, Mike Marciniszyn, Dennis Dalessandro,
	Grzegorz Andrejczuk, Kaike Wan

From: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>

Add a simple trace event taking context number and building simple
histogram to print packets distribution between contexts.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Grzegorz Andrejczuk <grzegorz.andrejczuk@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/driver.c      |    1 +
 drivers/infiniband/hw/hfi1/trace.c       |   32 ++++++++++++++++++++++++++++++
 drivers/infiniband/hw/hfi1/trace_ctxts.h |   11 +++++++++-
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c
index 60ff6de..a40701a 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -1706,6 +1706,7 @@ static void hfi1_ipoib_ib_rcv(struct hfi1_packet *packet)
 		goto drop_no_nd;
 
 	trace_input_ibhdr(rcd->dd, packet, !!(rhf_dc_info(packet->rhf)));
+	trace_ctxt_rsm_hist(rcd->ctxt);
 
 	/* handle congestion notifications */
 	do_work = hfi1_may_ecn(packet);
diff --git a/drivers/infiniband/hw/hfi1/trace.c b/drivers/infiniband/hw/hfi1/trace.c
index c8a9988..b219ea9 100644
--- a/drivers/infiniband/hw/hfi1/trace.c
+++ b/drivers/infiniband/hw/hfi1/trace.c
@@ -520,6 +520,38 @@ u16 hfi1_trace_get_tid_idx(u32 ent)
 	return EXP_TID_GET(ent, IDX);
 }
 
+struct hfi1_ctxt_hist {
+	atomic_t count;
+	atomic_t data[255];
+};
+
+struct hfi1_ctxt_hist hist = {
+	.count = ATOMIC_INIT(0)
+};
+
+const char *hfi1_trace_print_rsm_hist(struct trace_seq *p, unsigned int ctxt)
+{
+	int i, len = ARRAY_SIZE(hist.data);
+	const char *ret = trace_seq_buffer_ptr(p);
+	unsigned long packet_count = atomic_fetch_inc(&hist.count);
+
+	trace_seq_printf(p, "packet[%lu]", packet_count);
+	for (i = 0; i < len; ++i) {
+		unsigned long val;
+		atomic_t *count = &hist.data[i];
+
+		if (ctxt == i)
+			val = atomic_fetch_inc(count);
+		else
+			val = atomic_read(count);
+
+		if (val)
+			trace_seq_printf(p, "(%d:%lu)", i, val);
+	}
+	trace_seq_putc(p, 0);
+	return ret;
+}
+
 __hfi1_trace_fn(AFFINITY);
 __hfi1_trace_fn(PKT);
 __hfi1_trace_fn(PROC);
diff --git a/drivers/infiniband/hw/hfi1/trace_ctxts.h b/drivers/infiniband/hw/hfi1/trace_ctxts.h
index b5fc5c6..d8c168d 100644
--- a/drivers/infiniband/hw/hfi1/trace_ctxts.h
+++ b/drivers/infiniband/hw/hfi1/trace_ctxts.h
@@ -1,5 +1,5 @@
 /*
-* Copyright(c) 2015, 2016 Intel Corporation.
+* Copyright(c) 2015 - 2020 Intel Corporation.
 *
 * This file is provided under a dual BSD/GPLv2 license.  When using or
 * redistributing this file, you may do so under either license.
@@ -138,6 +138,15 @@
 		      )
 );
 
+const char *hfi1_trace_print_rsm_hist(struct trace_seq *p, unsigned int ctxt);
+TRACE_EVENT(ctxt_rsm_hist,
+	    TP_PROTO(unsigned int ctxt),
+	    TP_ARGS(ctxt),
+	    TP_STRUCT__entry(__field(unsigned int, ctxt)),
+	    TP_fast_assign(__entry->ctxt = ctxt;),
+	    TP_printk("%s", hfi1_trace_print_rsm_hist(p, __entry->ctxt))
+);
+
 #endif /* __HFI1_TRACE_CTXTS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 15/16] IB/ipoib: Add capability to switch between datagram and connected mode
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (13 preceding siblings ...)
  2020-02-10 13:19 ` [PATCH for-next 14/16] IB/hfi1: Add packet histogram trace event Dennis Dalessandro
@ 2020-02-10 13:19 ` Dennis Dalessandro
  2020-02-10 13:20 ` [PATCH for-next 16/16] IB/hfi1: Enable the transmit side of the datagram ipoib netdev Dennis Dalessandro
  2020-02-10 13:31 ` [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Jason Gunthorpe
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:19 UTC (permalink / raw)
  To: jgg, dledford
  Cc: linux-rdma, Mike Marciniszyn, Dennis Dalessandro, Gary Leshner,
	Kaike Wan

From: Gary Leshner <Gary.S.Leshner@intel.com>

This is the prerequisite modification to the ipoib ulp to allow a
rdma netdev to obtain the default ndo ops for init/uninit/open/close.

This is accomplished by setting the netdev ops field within the
callback function passed to the netdev allocation routine which
in turn was passed into the rdma netdev allocation routine.

This allows the rdma netdev to call back into the ulp to create the
resources required for connected mode operation.

Additionally as the ulp is not re-entrant, when switching modes,
the number of real tx queues is set to 1 for the connected mode.

For datagram mode the number of real tx queues is set to the
actual number of tx queues specified at the netdev's allocation.

For the internal ulp netdev the number of tx queues defaults to 1.

It is up to the rdma netdev to specify the actual number it can support.

When the driver does not support a rdma netdev for acceleration,
(-ENOTSUPPORTED return code or the verbs function for allocation is
NULL) the ipoib ulp functions are unaffected by using the internal
netdev allocated by the ipoib ulp.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index ddb896f..149c9b1 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -533,6 +533,7 @@ int ipoib_set_mode(struct net_device *dev, const char *buf)
 			   "will cause multicast packet drops\n");
 		netdev_update_features(dev);
 		dev_set_mtu(dev, ipoib_cm_max_mtu(dev));
+		netif_set_real_num_tx_queues(dev, 1);
 		rtnl_unlock();
 		priv->tx_wr.wr.send_flags &= ~IB_SEND_IP_CSUM;
 
@@ -544,6 +545,7 @@ int ipoib_set_mode(struct net_device *dev, const char *buf)
 		clear_bit(IPOIB_FLAG_ADMIN_CM, &priv->flags);
 		netdev_update_features(dev);
 		dev_set_mtu(dev, min(priv->mcast_mtu, dev->mtu));
+		netif_set_real_num_tx_queues(dev, dev->num_tx_queues);
 		rtnl_unlock();
 		ipoib_flush_paths(dev);
 		return (!rtnl_trylock()) ? -EBUSY : 0;
@@ -2081,9 +2083,17 @@ static int ipoib_get_vf_stats(struct net_device *dev, int vf,
 	.ndo_do_ioctl		 = ipoib_ioctl,
 };
 
+static const struct net_device_ops ipoib_netdev_default_pf = {
+	.ndo_init		 = ipoib_dev_init_default,
+	.ndo_uninit		 = ipoib_dev_uninit_default,
+	.ndo_open		 = ipoib_ib_dev_open_default,
+	.ndo_stop		 = ipoib_ib_dev_stop_default,
+};
+
 void ipoib_setup_common(struct net_device *dev)
 {
 	dev->header_ops		 = &ipoib_header_ops;
+	dev->netdev_ops          = &ipoib_netdev_default_pf;
 
 	ipoib_set_ethtool_ops(dev);
 
@@ -2133,13 +2143,6 @@ static void ipoib_build_priv(struct net_device *dev)
 	INIT_DELAYED_WORK(&priv->neigh_reap_task, ipoib_reap_neigh);
 }
 
-static const struct net_device_ops ipoib_netdev_default_pf = {
-	.ndo_init		 = ipoib_dev_init_default,
-	.ndo_uninit		 = ipoib_dev_uninit_default,
-	.ndo_open		 = ipoib_ib_dev_open_default,
-	.ndo_stop		 = ipoib_ib_dev_stop_default,
-};
-
 static struct net_device *ipoib_alloc_netdev(struct ib_device *hca, u8 port,
 					     const char *name)
 {
@@ -2177,7 +2180,6 @@ int ipoib_intf_init(struct ib_device *hca, u8 port, const char *name,
 		if (rc != -EOPNOTSUPP)
 			goto out;
 
-		dev->netdev_ops = &ipoib_netdev_default_pf;
 		rn->send = ipoib_send;
 		rn->attach_mcast = ipoib_mcast_attach;
 		rn->detach_mcast = ipoib_mcast_detach;


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH for-next 16/16] IB/hfi1: Enable the transmit side of the datagram ipoib netdev
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (14 preceding siblings ...)
  2020-02-10 13:19 ` [PATCH for-next 15/16] IB/ipoib: Add capability to switch between datagram and connected mode Dennis Dalessandro
@ 2020-02-10 13:20 ` Dennis Dalessandro
  2020-02-10 13:31 ` [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Jason Gunthorpe
  16 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 13:20 UTC (permalink / raw)
  To: jgg, dledford
  Cc: linux-rdma, Mike Marciniszyn, Dennis Dalessandro,
	Piotr Stankiewicz, Kaike Wan

From: Piotr Stankiewicz <piotr.stankiewicz@intel.com>

This patch hooks the transmit side of the datagram netdev with
ipoib by setting the rdma_netdev_get_params function for the
hfi1 ib_device_ops structue. It also enables the receiving side
by adding the AIP capability into the default capabilities.

Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
Signed-off-by: Piotr Stankiewicz <piotr.stankiewicz@intel.com>
Signed-off-by: Kaike Wan <kaike.wan@intel.com>
Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
---
 drivers/infiniband/hw/hfi1/common.h |    1 +
 drivers/infiniband/hw/hfi1/verbs.c  |    2 ++
 2 files changed, 3 insertions(+)

diff --git a/drivers/infiniband/hw/hfi1/common.h b/drivers/infiniband/hw/hfi1/common.h
index 6062545..ff423e54 100644
--- a/drivers/infiniband/hw/hfi1/common.h
+++ b/drivers/infiniband/hw/hfi1/common.h
@@ -160,6 +160,7 @@
 				 HFI1_CAP_PKEY_CHECK |			\
 				 HFI1_CAP_MULTI_PKT_EGR |		\
 				 HFI1_CAP_EXTENDED_PSN |		\
+				 HFI1_CAP_AIP |				\
 				 ((HFI1_CAP_HDRSUPP |			\
 				   HFI1_CAP_MULTI_PKT_EGR |		\
 				   HFI1_CAP_STATIC_RATE_CTRL |		\
diff --git a/drivers/infiniband/hw/hfi1/verbs.c b/drivers/infiniband/hw/hfi1/verbs.c
index 04f0206..d7621c2 100644
--- a/drivers/infiniband/hw/hfi1/verbs.c
+++ b/drivers/infiniband/hw/hfi1/verbs.c
@@ -66,6 +66,7 @@
 #include "vnic.h"
 #include "fault.h"
 #include "affinity.h"
+#include "ipoib.h"
 
 static unsigned int hfi1_lkey_table_size = 16;
 module_param_named(lkey_table_size, hfi1_lkey_table_size, uint,
@@ -1793,6 +1794,7 @@ static int get_hw_stats(struct ib_device *ibdev, struct rdma_hw_stats *stats,
 	.modify_device = modify_device,
 	/* keep process mad in the driver */
 	.process_mad = hfi1_process_mad,
+	.rdma_netdev_get_params = hfi1_ipoib_rn_get_params,
 };
 
 /**


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 00/16] New hfi1 feature: Accelerated IP
  2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
                   ` (15 preceding siblings ...)
  2020-02-10 13:20 ` [PATCH for-next 16/16] IB/hfi1: Enable the transmit side of the datagram ipoib netdev Dennis Dalessandro
@ 2020-02-10 13:31 ` Jason Gunthorpe
  2020-02-10 17:36   ` Dennis Dalessandro
  16 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2020-02-10 13:31 UTC (permalink / raw)
  To: Dennis Dalessandro; +Cc: dledford, linux-rdma

On Mon, Feb 10, 2020 at 08:18:05AM -0500, Dennis Dalessandro wrote:
> This patch series is an accelerated ipoib using the rdma netdev mechanism
> already present in ipoib. A new device capability bit,
> IB_DEVICE_RDMA_NETDEV_OPA, triggers ipoib to create a datagram QP using the
> IB_QP_CREATE_NETDEV_USE.
> 
> The highlights include:
> - Sharing send and receive resources with VNIC
> - Allows for switching between connected mode and datagram mode

There is still value in connected mode?

> - Increases the maximum datagram MTU for opa devices to 10k
> 
> The same spreading capability exploited by VNIC is used here to vary
> the receive context that receives the packet.
> 
> The patches are fully bisectable and stepwise implement the capability.

This is alot of code to send without a performance
justification.. What is it? Is it worth while?

> Gary Leshner (6):
>       IB/hfi1: Add functions to transmit datagram ipoib packets
>       IB/hfi1: Add the transmit side of a datagram ipoib RDMA netdev
>       IB/hfi1: Remove module parameter for KDETH qpns
>       IB/{rdmavt,hfi1}: Implement creation of accelerated UD QPs
>       IB/{hfi1,ipoib,rdma}: Broadcast ping sent packets which exceeded mtu size
>       IB/ipoib: Add capability to switch between datagram and connected mode
> 
> Grzegorz Andrejczuk (7):
>       IB/hfi1: RSM rules for AIP
>       IB/hfi1: Rename num_vnic_contexts as num_netdev_contexts
>       IB/hfi1: Add functions to receive accelerated ipoib packets
>       IB/hfi1: Add interrupt handler functions for accelerated ipoib
>       IB/hfi1: Add rx functions for dummy netdev

This dummy netdev thing seemed very strange

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 00/16] New hfi1 feature: Accelerated IP
  2020-02-10 13:31 ` [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Jason Gunthorpe
@ 2020-02-10 17:36   ` Dennis Dalessandro
  2020-02-10 18:32     ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-10 17:36 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, linux-rdma

On 2/10/2020 8:31 AM, Jason Gunthorpe wrote:
> On Mon, Feb 10, 2020 at 08:18:05AM -0500, Dennis Dalessandro wrote:
>> This patch series is an accelerated ipoib using the rdma netdev mechanism
>> already present in ipoib. A new device capability bit,
>> IB_DEVICE_RDMA_NETDEV_OPA, triggers ipoib to create a datagram QP using the
>> IB_QP_CREATE_NETDEV_USE.
>>
>> The highlights include:
>> - Sharing send and receive resources with VNIC
>> - Allows for switching between connected mode and datagram mode
> 
> There is still value in connected mode?

It's really a compatibility thing. If someone wants to change modes that 
will work. There won't be any benefit to connected mode though. The goal 
is just to not break.

> >> - Increases the maximum datagram MTU for opa devices to 10k
>>
>> The same spreading capability exploited by VNIC is used here to vary
>> the receive context that receives the packet.
>>
>> The patches are fully bisectable and stepwise implement the capability.
> 
> This is alot of code to send without a performance
> justification.. What is it? Is it worth while?

It avoids the scalability problem of connected mode, the number of QPs. 
Incoming packets are spread into multiple receive contexts increasing 
parallelism. The MTU is increased to allows 10K. It also reduces/removes 
the verbs TX overhead by allowing packets to be sent through the SDMA 
engines directly.

>> Gary Leshner (6):
>>        IB/hfi1: Add functions to transmit datagram ipoib packets
>>        IB/hfi1: Add the transmit side of a datagram ipoib RDMA netdev
>>        IB/hfi1: Remove module parameter for KDETH qpns
>>        IB/{rdmavt,hfi1}: Implement creation of accelerated UD QPs
>>        IB/{hfi1,ipoib,rdma}: Broadcast ping sent packets which exceeded mtu size
>>        IB/ipoib: Add capability to switch between datagram and connected mode
>>
>> Grzegorz Andrejczuk (7):
>>        IB/hfi1: RSM rules for AIP
>>        IB/hfi1: Rename num_vnic_contexts as num_netdev_contexts
>>        IB/hfi1: Add functions to receive accelerated ipoib packets
>>        IB/hfi1: Add interrupt handler functions for accelerated ipoib
>>        IB/hfi1: Add rx functions for dummy netdev
> 
> This dummy netdev thing seemed very strange

One of the existing uses of dummy netdev seems to be to tie multiple 
hardware interfaces together. We are using a similar concept for two 
software interfaces. Those being VNIC and AIP. The dummy netdev here 
will own the receiving resources which are shared.


-Denny

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 00/16] New hfi1 feature: Accelerated IP
  2020-02-10 17:36   ` Dennis Dalessandro
@ 2020-02-10 18:32     ` Jason Gunthorpe
  2020-02-11 21:58       ` Dennis Dalessandro
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2020-02-10 18:32 UTC (permalink / raw)
  To: Dennis Dalessandro; +Cc: dledford, linux-rdma

On Mon, Feb 10, 2020 at 12:36:02PM -0500, Dennis Dalessandro wrote:
> On 2/10/2020 8:31 AM, Jason Gunthorpe wrote:
> > On Mon, Feb 10, 2020 at 08:18:05AM -0500, Dennis Dalessandro wrote:
> > > This patch series is an accelerated ipoib using the rdma netdev mechanism
> > > already present in ipoib. A new device capability bit,
> > > IB_DEVICE_RDMA_NETDEV_OPA, triggers ipoib to create a datagram QP using the
> > > IB_QP_CREATE_NETDEV_USE.
> > > 
> > > The highlights include:
> > > - Sharing send and receive resources with VNIC
> > > - Allows for switching between connected mode and datagram mode
> > 
> > There is still value in connected mode?
> 
> It's really a compatibility thing. If someone wants to change modes that
> will work. There won't be any benefit to connected mode though. The goal is
> just to not break.

I am a bit confused by this.. I thought the mlx5 implementation
already could select connected mode?

Why were core ipoib changes needed?

> > > The patches are fully bisectable and stepwise implement the capability.
> > 
> > This is alot of code to send without a performance
> > justification.. What is it? Is it worth while?
> 
> It avoids the scalability problem of connected mode, the number of QPs.
> Incoming packets are spread into multiple receive contexts increasing
> parallelism. The MTU is increased to allows 10K. It also reduces/removes the
> verbs TX overhead by allowing packets to be sent through the SDMA engines
> directly.

No numbers to share?

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 00/16] New hfi1 feature: Accelerated IP
  2020-02-10 18:32     ` Jason Gunthorpe
@ 2020-02-11 21:58       ` Dennis Dalessandro
  0 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-11 21:58 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: dledford, linux-rdma

On 2/10/2020 1:32 PM, Jason Gunthorpe wrote:
> On Mon, Feb 10, 2020 at 12:36:02PM -0500, Dennis Dalessandro wrote:
>> On 2/10/2020 8:31 AM, Jason Gunthorpe wrote:
>>> On Mon, Feb 10, 2020 at 08:18:05AM -0500, Dennis Dalessandro wrote:
>>>> This patch series is an accelerated ipoib using the rdma netdev mechanism
>>>> already present in ipoib. A new device capability bit,
>>>> IB_DEVICE_RDMA_NETDEV_OPA, triggers ipoib to create a datagram QP using the
>>>> IB_QP_CREATE_NETDEV_USE.
>>>>
>>>> The highlights include:
>>>> - Sharing send and receive resources with VNIC
>>>> - Allows for switching between connected mode and datagram mode
>>>
>>> There is still value in connected mode?
>>
>> It's really a compatibility thing. If someone wants to change modes that
>> will work. There won't be any benefit to connected mode though. The goal is
>> just to not break.
> 
> I am a bit confused by this.. I thought the mlx5 implementation
> already could select connected mode?
> 
> Why were core ipoib changes needed?

I don't think so, patch 15/16 seemed to be necessary to get connected 
mode to work with the rdma netdev.

> 
>>>> The patches are fully bisectable and stepwise implement the capability.
>>>
>>> This is alot of code to send without a performance
>>> justification.. What is it? Is it worth while?
>>
>> It avoids the scalability problem of connected mode, the number of QPs.
>> Incoming packets are spread into multiple receive contexts increasing
>> parallelism. The MTU is increased to allows 10K. It also reduces/removes the
>> verbs TX overhead by allowing packets to be sent through the SDMA engines
>> directly.
> 
> No numbers to share?

No numbers directly but I can say that AIP enables line-rate performance 
between two nodes with Datagram Mode, it also provides IPoFabric latency 
improvements relative to standard Datagram Mode without AIP.

-Denny

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size
  2020-02-10 13:19 ` [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size Dennis Dalessandro
@ 2020-02-19  0:42   ` Jason Gunthorpe
  2020-02-21 19:40     ` Dennis Dalessandro
  2020-02-19 13:41   ` Erez Shitrit
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2020-02-19  0:42 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: dledford, linux-rdma, Mike Marciniszyn, Dennis Dalessandro,
	Gary Leshner, Kaike Wan

On Mon, Feb 10, 2020 at 08:19:44AM -0500, Dennis Dalessandro wrote:
> From: Gary Leshner <Gary.S.Leshner@intel.com>
> 
> When in connected mode ipoib sent broadcast pings which exceeded the mtu
> size for broadcast addresses.
>
> Add an mtu attribute to the rdma_netdev structure which ipoib sets to its
> mcast mtu size.
> 
> The RDMA netdev uses this value to determine if the skb length is too long
> for the mtu specified and if it is, drops the packet and logs an error
> about the errant packet.

I'm confused by this comment, connected mode is not able to use
rdma_netdev, for various technical reason, I thought?

Is this somehow running a rdma_netdev concurrently with connected
mode? How?

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 07/16] IB/ipoib: Increase ipoib Datagram mode MTU's upper limit
  2020-02-10 13:18 ` [PATCH for-next 07/16] IB/ipoib: Increase ipoib Datagram mode MTU's upper limit Dennis Dalessandro
@ 2020-02-19 11:01   ` Erez Shitrit
  2020-02-21 19:40     ` Dennis Dalessandro
  0 siblings, 1 reply; 29+ messages in thread
From: Erez Shitrit @ 2020-02-19 11:01 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Jason Gunthorpe, Doug Ledford, Mike Marciniszyn,
	RDMA mailing list, Sadanand Warrier, Kaike Wan

On Mon, Feb 10, 2020 at 3:18 PM Dennis Dalessandro
<dennis.dalessandro@intel.com> wrote:
>
> From: Sadanand Warrier <sadanand.warrier@intel.com>
>
> Currently the ipoib UD mtu is restricted to 4K bytes. Remove this
> limitation so that the IPOIB module can potentially use an MTU (in UD
> mode) that is bounded by the MTU of the underlying device. A field is
> added to the ib_port_attr structure to indicate the maximum physical
> MTU the underlying device supports.
>
> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Reviewed-by: Mike Marciniszyn <mike.marcinisyzn@intel.com>
> Signed-off-by: Sadanand Warrier <sadanand.warrier@intel.com>
> Signed-off-by: Kaike Wan <kaike.wan@intel.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> ---
>  drivers/infiniband/hw/hfi1/qp.c                |   18 +------
>  drivers/infiniband/hw/hfi1/verbs.c             |    2 +
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      |    5 ++
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   11 ++--
>  include/rdma/ib_verbs.h                        |   60 ++++++++++++++++++++++++
>  include/rdma/opa_port_info.h                   |   10 ----
>  6 files changed, 74 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/infiniband/hw/hfi1/qp.c b/drivers/infiniband/hw/hfi1/qp.c
> index f8e733a..0c2ae9f 100644
> --- a/drivers/infiniband/hw/hfi1/qp.c
> +++ b/drivers/infiniband/hw/hfi1/qp.c
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright(c) 2015 - 2019 Intel Corporation.
> + * Copyright(c) 2015 - 2020 Intel Corporation.
>   *
>   * This file is provided under a dual BSD/GPLv2 license.  When using or
>   * redistributing this file, you may do so under either license.
> @@ -186,15 +186,6 @@ static void flush_iowait(struct rvt_qp *qp)
>         write_sequnlock_irqrestore(lock, flags);
>  }
>
> -static inline int opa_mtu_enum_to_int(int mtu)
> -{
> -       switch (mtu) {
> -       case OPA_MTU_8192:  return 8192;
> -       case OPA_MTU_10240: return 10240;
> -       default:            return -1;
> -       }
> -}
> -
>  /**
>   * This function is what we would push to the core layer if we wanted to be a
>   * "first class citizen".  Instead we hide this here and rely on Verbs ULPs
> @@ -202,15 +193,10 @@ static inline int opa_mtu_enum_to_int(int mtu)
>   */
>  static inline int verbs_mtu_enum_to_int(struct ib_device *dev, enum ib_mtu mtu)
>  {
> -       int val;
> -
>         /* Constraining 10KB packets to 8KB packets */
>         if (mtu == (enum ib_mtu)OPA_MTU_10240)
>                 mtu = OPA_MTU_8192;
> -       val = opa_mtu_enum_to_int((int)mtu);
> -       if (val > 0)
> -               return val;
> -       return ib_mtu_enum_to_int(mtu);
> +       return opa_mtu_enum_to_int((enum opa_mtu)mtu);
>  }
>
>  int hfi1_check_modify_qp(struct rvt_qp *qp, struct ib_qp_attr *attr,
> diff --git a/drivers/infiniband/hw/hfi1/verbs.c b/drivers/infiniband/hw/hfi1/verbs.c
> index 8061e93..04f0206 100644
> --- a/drivers/infiniband/hw/hfi1/verbs.c
> +++ b/drivers/infiniband/hw/hfi1/verbs.c
> @@ -1437,6 +1437,8 @@ static int query_port(struct rvt_dev_info *rdi, u8 port_num,
>                                       4096 : hfi1_max_mtu), IB_MTU_4096);
>         props->active_mtu = !valid_ib_mtu(ppd->ibmtu) ? props->max_mtu :
>                 mtu_to_enum(ppd->ibmtu, IB_MTU_4096);
> +       props->phys_mtu = HFI1_CAP_IS_KSET(AIP) ? hfi1_max_mtu :
> +                               ib_mtu_enum_to_int(props->max_mtu);
>
>         return 0;
>  }
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index e5f438a..5c1cf68 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1862,7 +1862,10 @@ static int ipoib_parent_init(struct net_device *ndev)
>                         priv->port);
>                 return result;
>         }
> -       priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu);
> +       if (rdma_core_cap_opa_port(priv->ca, priv->port))

Why not to use something like you did in rdma_mtu_enum_to_int, in
order to hide all the "_opa_" stuff from IPoIB ?

> +               priv->max_ib_mtu = attr.phys_mtu;
> +       else
> +               priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu);
>
>         result = ib_query_pkey(priv->ca, priv->port, 0, &priv->pkey);
>         if (result) {
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index b9e9562..7166ee9b 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -218,6 +218,7 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
>         struct rdma_ah_attr av;
>         int ret;
>         int set_qkey = 0;
> +       int mtu;
>
>         mcast->mcmember = *mcmember;
>
> @@ -240,13 +241,11 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
>                 priv->broadcast->mcmember.flow_label = mcmember->flow_label;
>                 priv->broadcast->mcmember.hop_limit = mcmember->hop_limit;
>                 /* assume if the admin and the mcast are the same both can be changed */
> +               mtu = rdma_mtu_enum_to_int(priv->ca,  priv->port,
> +                                          priv->broadcast->mcmember.mtu);
>                 if (priv->mcast_mtu == priv->admin_mtu)
> -                       priv->admin_mtu =
> -                       priv->mcast_mtu =
> -                       IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
> -               else
> -                       priv->mcast_mtu =
> -                       IPOIB_UD_MTU(ib_mtu_enum_to_int(priv->broadcast->mcmember.mtu));
> +                       priv->admin_mtu = IPOIB_UD_MTU(mtu);
> +               priv->mcast_mtu = IPOIB_UD_MTU(mtu);
>
>                 priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
>                 spin_unlock_irq(&priv->lock);
> diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> index 8b805ab3..6ab343f 100644
> --- a/include/rdma/ib_verbs.h
> +++ b/include/rdma/ib_verbs.h
> @@ -462,6 +462,11 @@ enum ib_mtu {
>         IB_MTU_4096 = 5
>  };
>
> +enum opa_mtu {
> +       OPA_MTU_8192 = 6,
> +       OPA_MTU_10240 = 7
> +};
> +
>  static inline int ib_mtu_enum_to_int(enum ib_mtu mtu)
>  {
>         switch (mtu) {
> @@ -488,6 +493,28 @@ static inline enum ib_mtu ib_mtu_int_to_enum(int mtu)
>                 return IB_MTU_256;
>  }
>
> +static inline int opa_mtu_enum_to_int(enum opa_mtu mtu)
> +{
> +       switch (mtu) {
> +       case OPA_MTU_8192:
> +               return 8192;
> +       case OPA_MTU_10240:
> +               return 10240;
> +       default:
> +               return(ib_mtu_enum_to_int((enum ib_mtu)mtu));
> +       }
> +}
> +
> +static inline enum opa_mtu opa_mtu_int_to_enum(int mtu)
> +{
> +       if (mtu >= 10240)
> +               return OPA_MTU_10240;
> +       else if (mtu >= 8192)
> +               return OPA_MTU_8192;
> +       else
> +               return ((enum opa_mtu)ib_mtu_int_to_enum(mtu));
> +}
> +
>  enum ib_port_state {
>         IB_PORT_NOP             = 0,
>         IB_PORT_DOWN            = 1,
> @@ -651,6 +678,7 @@ struct ib_port_attr {
>         enum ib_port_state      state;
>         enum ib_mtu             max_mtu;
>         enum ib_mtu             active_mtu;
> +       u32                     phys_mtu;
>         int                     gid_tbl_len;
>         unsigned int            ip_gids:1;
>         /* This is the value from PortInfo CapabilityMask, defined by IBA */
> @@ -3356,6 +3384,38 @@ static inline unsigned int rdma_find_pg_bit(unsigned long addr,
>         return __fls(pgsz);
>  }
>
> +/**
> + * rdma_core_cap_opa_port - Return whether the RDMA Port is OPA or not.
> + * @device: Device
> + * @port_num: 1 based Port number
> + *
> + * Return true if port is an Intel OPA port , false if not
> + */
> +static inline bool rdma_core_cap_opa_port(struct ib_device *device,
> +                                         u32 port_num)
> +{
> +       return (device->port_data[port_num].immutable.core_cap_flags &
> +               RDMA_CORE_PORT_INTEL_OPA) == RDMA_CORE_PORT_INTEL_OPA;
> +}
> +
> +/**
> + * rdma_mtu_enum_to_int - Return the mtu of the port as an integer value.
> + * @device: Device
> + * @port_num: Port number
> + * @mtu: enum value of MTU
> + *
> + * Return the MTU size supported by the port as an integer value. Will return
> + * -1 if enum value of mtu is not supported.
> + */
> +static inline int rdma_mtu_enum_to_int(struct ib_device *device, u8 port,
> +                                      int mtu)
> +{
> +       if (rdma_core_cap_opa_port(device, port))
> +               return opa_mtu_enum_to_int((enum opa_mtu)mtu);
> +       else
> +               return ib_mtu_enum_to_int((enum ib_mtu)mtu);
> +}
> +
>  int ib_set_vf_link_state(struct ib_device *device, int vf, u8 port,
>                          int state);
>  int ib_get_vf_config(struct ib_device *device, int vf, u8 port,
> diff --git a/include/rdma/opa_port_info.h b/include/rdma/opa_port_info.h
> index bdbfe25..0d9e6d7 100644
> --- a/include/rdma/opa_port_info.h
> +++ b/include/rdma/opa_port_info.h
> @@ -1,5 +1,5 @@
>  /*
> - * Copyright (c) 2014-2017 Intel Corporation.  All rights reserved.
> + * Copyright (c) 2014-2020 Intel Corporation.  All rights reserved.
>   *
>   * This software is available to you under a choice of one of two
>   * licenses.  You may choose to be licensed under the terms of the GNU
> @@ -139,14 +139,6 @@
>  #define OPA_CAP_MASK3_IsVLMarkerSupported         (1 << 1)
>  #define OPA_CAP_MASK3_IsVLrSupported              (1 << 0)
>
> -/**
> - * new MTU values
> - */
> -enum {
> -       OPA_MTU_8192  = 6,
> -       OPA_MTU_10240 = 7,
> -};
> -
>  enum {
>         OPA_PORT_PHYS_CONF_DISCONNECTED = 0,
>         OPA_PORT_PHYS_CONF_STANDARD     = 1,
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size
  2020-02-10 13:19 ` [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size Dennis Dalessandro
  2020-02-19  0:42   ` Jason Gunthorpe
@ 2020-02-19 13:41   ` Erez Shitrit
  2020-02-21 19:40     ` Dennis Dalessandro
  1 sibling, 1 reply; 29+ messages in thread
From: Erez Shitrit @ 2020-02-19 13:41 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: Jason Gunthorpe, Doug Ledford, RDMA mailing list,
	Mike Marciniszyn, Dennis Dalessandro, Gary Leshner, Kaike Wan

On Mon, Feb 10, 2020 at 3:19 PM Dennis Dalessandro
<dennis.dalessandro@intel.com> wrote:
>
> From: Gary Leshner <Gary.S.Leshner@intel.com>
>
> When in connected mode ipoib sent broadcast pings which exceeded the mtu
> size for broadcast addresses.

But this broadcast done via the UD QP and not via the connected mode,
please explain

>
> Add an mtu attribute to the rdma_netdev structure which ipoib sets to its
> mcast mtu size.
>
> The RDMA netdev uses this value to determine if the skb length is too long
> for the mtu specified and if it is, drops the packet and logs an error
> about the errant packet.
>
> Reviewed-by: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Reviewed-by: Dennis Dalessandro <dennis.alessandro@intel.com>
> Signed-off-by: Gary Leshner <Gary.S.Leshner@intel.com>
> Signed-off-by: Kaike Wan <kaike.wan@intel.com>
> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com>
> ---
>  drivers/infiniband/ulp/ipoib/ipoib_main.c      |    2 ++
>  drivers/infiniband/ulp/ipoib/ipoib_multicast.c |    1 +
>  drivers/infiniband/ulp/ipoib/ipoib_vlan.c      |    3 +++
>  3 files changed, 6 insertions(+)
>
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> index 5c1cf68..ddb896f 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
> @@ -1906,6 +1906,7 @@ static int ipoib_ndo_init(struct net_device *ndev)
>  {
>         struct ipoib_dev_priv *priv = ipoib_priv(ndev);
>         int rc;
> +       struct rdma_netdev *rn = netdev_priv(ndev);
>
>         if (priv->parent) {
>                 ipoib_child_init(ndev);
> @@ -1918,6 +1919,7 @@ static int ipoib_ndo_init(struct net_device *ndev)
>         /* MTU will be reset when mcast join happens */
>         ndev->mtu = IPOIB_UD_MTU(priv->max_ib_mtu);
>         priv->mcast_mtu = priv->admin_mtu = ndev->mtu;
> +       rn->mtu = priv->mcast_mtu;

If this is something specific for your lower driver (opa_vnic etc.)
you don't need to do that here, you can use the rn->clnt_priv member
in order to get the mcast_mtu

>         ndev->max_mtu = IPOIB_CM_MTU;
>
>         ndev->neigh_priv_len = sizeof(struct ipoib_neigh);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> index 7166ee9b..3d5f6b8 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
> @@ -246,6 +246,7 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast,
>                 if (priv->mcast_mtu == priv->admin_mtu)
>                         priv->admin_mtu = IPOIB_UD_MTU(mtu);
>                 priv->mcast_mtu = IPOIB_UD_MTU(mtu);
> +               rn->mtu = priv->mcast_mtu;
>
>                 priv->qkey = be32_to_cpu(priv->broadcast->mcmember.qkey);
>                 spin_unlock_irq(&priv->lock);
> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
> index 8ac8e18..3086560 100644
> --- a/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
> +++ b/drivers/infiniband/ulp/ipoib/ipoib_vlan.c
> @@ -97,6 +97,7 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
>  {
>         struct net_device *ndev = priv->dev;
>         int result;
> +       struct rdma_netdev *rn = netdev_priv(ndev);
>
>         ASSERT_RTNL();
>
> @@ -117,6 +118,8 @@ int __ipoib_vlan_add(struct ipoib_dev_priv *ppriv, struct ipoib_dev_priv *priv,
>                 goto out_early;
>         }
>
> +       rn->mtu = priv->mcast_mtu;
> +
>         priv->parent = ppriv->dev;
>         priv->pkey = pkey;
>         priv->child_type = type;
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size
  2020-02-19  0:42   ` Jason Gunthorpe
@ 2020-02-21 19:40     ` Dennis Dalessandro
  2020-02-21 23:32       ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-21 19:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford, linux-rdma, Mike Marciniszyn, Dennis Dalessandro,
	Gary Leshner, Kaike Wan

On 2/18/2020 7:42 PM, Jason Gunthorpe wrote:
> On Mon, Feb 10, 2020 at 08:19:44AM -0500, Dennis Dalessandro wrote:
>> From: Gary Leshner <Gary.S.Leshner@intel.com>
>>
>> When in connected mode ipoib sent broadcast pings which exceeded the mtu
>> size for broadcast addresses.
>>
>> Add an mtu attribute to the rdma_netdev structure which ipoib sets to its
>> mcast mtu size.
>>
>> The RDMA netdev uses this value to determine if the skb length is too long
>> for the mtu specified and if it is, drops the packet and logs an error
>> about the errant packet.
> 
> I'm confused by this comment, connected mode is not able to use
> rdma_netdev, for various technical reason, I thought?
> 
> Is this somehow running a rdma_netdev concurrently with connected
> mode? How?

No, not concurrently. When ipoib is in connected mode, a broadcast 
request, something like:

ping -s 2017 -i 0.001 -c 10 -M do -I ib0 -b 192.168.0.255

will be sent down from user space to ipoib. At an mcast_mtu of 2048, the 
max payload size is 2016 (2048 - 28 - 4). If AIP is not being used then 
the datagram send function (ipoib_send()) does a check and drops the packet.

However when AIP is enabled ipoib_send is of course not used and we land 
in rn->send function. Which needs to do the same check.

Alternatively, the mcast_mtu check could be added to ipoib so that this 
checking is performed before rn->send() is called. In that case, this 
patch will not be needed and the new rdma_netdev (or clnt_priv) field 
that shadows mcast_mtu would also not be needed.

Also for the datagram mode case, the ping command will correctly check 
the payload size and drop the request in the user
space so that ipoib will not receive the broadcast request at all:

   ping -s 2017 -i 0.001 -c 1 -M do -I ib0 -b 192.168.0.255
   WARNING: pinging broadcast address
   PING 192.168.0.255 (192.168.0.255) from 192.168.0.1 ib0: 2017(2045) 
bytes of data.
   ping: local error: Message too long, mtu=2044

-Denny

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size
  2020-02-19 13:41   ` Erez Shitrit
@ 2020-02-21 19:40     ` Dennis Dalessandro
  0 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-21 19:40 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: Jason Gunthorpe, Doug Ledford, RDMA mailing list,
	Mike Marciniszyn, Dennis Dalessandro, Gary Leshner, Kaike Wan

On 2/19/2020 8:41 AM, Erez Shitrit wrote:
> On Mon, Feb 10, 2020 at 3:19 PM Dennis Dalessandro
> <dennis.dalessandro@intel.com> wrote:
>>
>> From: Gary Leshner <Gary.S.Leshner@intel.com>
>>
>> When in connected mode ipoib sent broadcast pings which exceeded the mtu
>> size for broadcast addresses.
> 
> But this broadcast done via the UD QP and not via the connected mode,
> please explain

It still lands in the ipoib_send() function though which does the length 
check for non-AIP. With AIP we still need to do a similar check.

>> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> @@ -1906,6 +1906,7 @@ static int ipoib_ndo_init(struct net_device *ndev)
>>   {
>>          struct ipoib_dev_priv *priv = ipoib_priv(ndev);
>>          int rc;
>> +       struct rdma_netdev *rn = netdev_priv(ndev);
>>
>>          if (priv->parent) {
>>                  ipoib_child_init(ndev);
>> @@ -1918,6 +1919,7 @@ static int ipoib_ndo_init(struct net_device *ndev)
>>          /* MTU will be reset when mcast join happens */
>>          ndev->mtu = IPOIB_UD_MTU(priv->max_ib_mtu);
>>          priv->mcast_mtu = priv->admin_mtu = ndev->mtu;
>> +       rn->mtu = priv->mcast_mtu;
> 
> If this is something specific for your lower driver (opa_vnic etc.)
> you don't need to do that here, you can use the rn->clnt_priv member
> in order to get the mcast_mtu

That's probably doable, sure. However this seems like something we don't 
want to hide away as a private member. The Mcast MTU is a pretty central 
concept and some other lower driver may conceivably make sue of it. Is 
there a reason we don't want to add to the rn?

The other complication is struct ipoib_dev_priv is not available to 
hfi1, it is a ULP concept and hfi1 should probably not be made aware?

-Denny


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 07/16] IB/ipoib: Increase ipoib Datagram mode MTU's upper limit
  2020-02-19 11:01   ` Erez Shitrit
@ 2020-02-21 19:40     ` Dennis Dalessandro
  0 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-02-21 19:40 UTC (permalink / raw)
  To: Erez Shitrit
  Cc: Jason Gunthorpe, Doug Ledford, Mike Marciniszyn,
	RDMA mailing list, Sadanand Warrier, Kaike Wan

On 2/19/2020 6:01 AM, Erez Shitrit wrote:
>> diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> index e5f438a..5c1cf68 100644
>> --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
>> @@ -1862,7 +1862,10 @@ static int ipoib_parent_init(struct net_device *ndev)
>>                          priv->port);
>>                  return result;
>>          }
>> -       priv->max_ib_mtu = ib_mtu_enum_to_int(attr.max_mtu);
>> +       if (rdma_core_cap_opa_port(priv->ca, priv->port))
> 
> Why not to use something like you did in rdma_mtu_enum_to_int, in
> order to hide all the "_opa_" stuff from IPoIB ?

Sounds reasonable, we can do that in the v2.

-Denny

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size
  2020-02-21 19:40     ` Dennis Dalessandro
@ 2020-02-21 23:32       ` Jason Gunthorpe
  2020-03-20 13:53         ` Dennis Dalessandro
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2020-02-21 23:32 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: dledford, linux-rdma, Mike Marciniszyn, Dennis Dalessandro,
	Gary Leshner, Kaike Wan

On Fri, Feb 21, 2020 at 02:40:28PM -0500, Dennis Dalessandro wrote:
> On 2/18/2020 7:42 PM, Jason Gunthorpe wrote:
> > On Mon, Feb 10, 2020 at 08:19:44AM -0500, Dennis Dalessandro wrote:
> > > From: Gary Leshner <Gary.S.Leshner@intel.com>
> > > 
> > > When in connected mode ipoib sent broadcast pings which exceeded the mtu
> > > size for broadcast addresses.
> > > 
> > > Add an mtu attribute to the rdma_netdev structure which ipoib sets to its
> > > mcast mtu size.
> > > 
> > > The RDMA netdev uses this value to determine if the skb length is too long
> > > for the mtu specified and if it is, drops the packet and logs an error
> > > about the errant packet.
> > 
> > I'm confused by this comment, connected mode is not able to use
> > rdma_netdev, for various technical reason, I thought?
> > 
> > Is this somehow running a rdma_netdev concurrently with connected
> > mode? How?
> 
> No, not concurrently. When ipoib is in connected mode, a broadcast request,
> something like:
> 
> ping -s 2017 -i 0.001 -c 10 -M do -I ib0 -b 192.168.0.255
> 
> will be sent down from user space to ipoib. At an mcast_mtu of 2048, the max
> payload size is 2016 (2048 - 28 - 4). If AIP is not being used then the
> datagram send function (ipoib_send()) does a check and drops the packet.
> 
> However when AIP is enabled ipoib_send is of course not used and we land in
> rn->send function. Which needs to do the same check.

You just contradicted yourself: the first sentence was 'not
concurrently' and here you say we have connected mode turned on and
yet a packet is delivered to AIP, so what do you mean?

What I mean is if you can do connected mode you don't have a
rdma_netdev and you can't do AIP.

How are things in connected mode and a rdma_netdev is available?

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size
  2020-02-21 23:32       ` Jason Gunthorpe
@ 2020-03-20 13:53         ` Dennis Dalessandro
  0 siblings, 0 replies; 29+ messages in thread
From: Dennis Dalessandro @ 2020-03-20 13:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: dledford, linux-rdma, Mike Marciniszyn, Dennis Dalessandro,
	Gary Leshner, Kaike Wan

On 2/21/2020 6:32 PM, Jason Gunthorpe wrote:
> On Fri, Feb 21, 2020 at 02:40:28PM -0500, Dennis Dalessandro wrote:
>> On 2/18/2020 7:42 PM, Jason Gunthorpe wrote:
>>> On Mon, Feb 10, 2020 at 08:19:44AM -0500, Dennis Dalessandro wrote:
>>>> From: Gary Leshner <Gary.S.Leshner@intel.com>
>>>>
>>>> When in connected mode ipoib sent broadcast pings which exceeded the mtu
>>>> size for broadcast addresses.
>>>>
>>>> Add an mtu attribute to the rdma_netdev structure which ipoib sets to its
>>>> mcast mtu size.
>>>>
>>>> The RDMA netdev uses this value to determine if the skb length is too long
>>>> for the mtu specified and if it is, drops the packet and logs an error
>>>> about the errant packet.
>>>
>>> I'm confused by this comment, connected mode is not able to use
>>> rdma_netdev, for various technical reason, I thought?
>>>
>>> Is this somehow running a rdma_netdev concurrently with connected
>>> mode? How?
>>
>> No, not concurrently. When ipoib is in connected mode, a broadcast request,
>> something like:
>>
>> ping -s 2017 -i 0.001 -c 10 -M do -I ib0 -b 192.168.0.255
>>
>> will be sent down from user space to ipoib. At an mcast_mtu of 2048, the max
>> payload size is 2016 (2048 - 28 - 4). If AIP is not being used then the
>> datagram send function (ipoib_send()) does a check and drops the packet.
>>
>> However when AIP is enabled ipoib_send is of course not used and we land in
>> rn->send function. Which needs to do the same check.
> 
> You just contradicted yourself: the first sentence was 'not
> concurrently' and here you say we have connected mode turned on and
> yet a packet is delivered to AIP, so what do you mean?

AIP provides a rdma_netdev (rn) that overloads the rn inside ipoib. When 
the broadcast skb is passed down from the user space, even in connected 
mode, the skb will be forwarded to the rn to send out.

> What I mean is if you can do connected mode you don't have a
> rdma_netdev and you can't do AIP.

The rdma_netdev is always present, regardless of the ipoib mode.

> How are things in connected mode and a rdma_netdev is available?

So we don't only overload the rn for datagram, we do it for connected as 
well.

The rdma_netdev is set up when ipoib first finds the port, not when the 
mode is switched through sysfs. Therefore, it has to be there always, 
even in connected mode.

In hfi1_ipoib_setup_rn() (the setup function for rdma_netdev), we set:
     rn->send = hfi1_ipoib_send

We also keeps the default netdev_ops and overload it with our netdev_ops 
to set up /tear down resources during netdev init/uninit/open/close:

     Priv->netdev_ops = netdev->netdev_ops;
     Netdev->netdev_ops = &hfi1_ipoib_netdev_ops;

-Denny


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2020-03-20 13:53 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-10 13:18 [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Dennis Dalessandro
2020-02-10 13:18 ` [PATCH for-next 01/16] IB/hfi1: Add accelerated IP capability bit Dennis Dalessandro
2020-02-10 13:18 ` [PATCH for-next 02/16] IB/hfi1: Add functions to transmit datagram ipoib packets Dennis Dalessandro
2020-02-10 13:18 ` [PATCH for-next 03/16] IB/hfi1: Add the transmit side of a datagram ipoib RDMA netdev Dennis Dalessandro
2020-02-10 13:18 ` [PATCH for-next 04/16] IB/hfi1: Remove module parameter for KDETH qpns Dennis Dalessandro
2020-02-10 13:18 ` [PATCH for-next 05/16] IB/{rdmavt, hfi1}: Implement creation of accelerated UD QPs Dennis Dalessandro
2020-02-10 13:18 ` [PATCH for-next 06/16] IB/hfi1: RSM rules for AIP Dennis Dalessandro
2020-02-10 13:18 ` [PATCH for-next 07/16] IB/ipoib: Increase ipoib Datagram mode MTU's upper limit Dennis Dalessandro
2020-02-19 11:01   ` Erez Shitrit
2020-02-21 19:40     ` Dennis Dalessandro
2020-02-10 13:18 ` [PATCH for-next 08/16] IB/hfi1: Rename num_vnic_contexts as num_netdev_contexts Dennis Dalessandro
2020-02-10 13:19 ` [PATCH for-next 09/16] IB/hfi1: Add functions to receive accelerated ipoib packets Dennis Dalessandro
2020-02-10 13:19 ` [PATCH for-next 10/16] IB/hfi1: Add interrupt handler functions for accelerated ipoib Dennis Dalessandro
2020-02-10 13:19 ` [PATCH for-next 11/16] IB/hfi1: Add rx functions for dummy netdev Dennis Dalessandro
2020-02-10 13:19 ` [PATCH for-next 12/16] IB/hfi1: Activate the " Dennis Dalessandro
2020-02-10 13:19 ` [PATCH for-next 13/16] IB/{hfi1, ipoib, rdma}: Broadcast ping sent packets which exceeded mtu size Dennis Dalessandro
2020-02-19  0:42   ` Jason Gunthorpe
2020-02-21 19:40     ` Dennis Dalessandro
2020-02-21 23:32       ` Jason Gunthorpe
2020-03-20 13:53         ` Dennis Dalessandro
2020-02-19 13:41   ` Erez Shitrit
2020-02-21 19:40     ` Dennis Dalessandro
2020-02-10 13:19 ` [PATCH for-next 14/16] IB/hfi1: Add packet histogram trace event Dennis Dalessandro
2020-02-10 13:19 ` [PATCH for-next 15/16] IB/ipoib: Add capability to switch between datagram and connected mode Dennis Dalessandro
2020-02-10 13:20 ` [PATCH for-next 16/16] IB/hfi1: Enable the transmit side of the datagram ipoib netdev Dennis Dalessandro
2020-02-10 13:31 ` [PATCH for-next 00/16] New hfi1 feature: Accelerated IP Jason Gunthorpe
2020-02-10 17:36   ` Dennis Dalessandro
2020-02-10 18:32     ` Jason Gunthorpe
2020-02-11 21:58       ` Dennis Dalessandro

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.