All of lore.kernel.org
 help / color / mirror / Atom feed
* [net-next PATCH v3 0/3] e1000 XDP implementation
@ 2016-09-12 22:13 ` John Fastabend
  0 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: bblanco, john.fastabend, alexei.starovoitov, jeffrey.t.kirsher,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev

This patch implements XDP on e1000 a few comments below:

The XDP_TX path does not violate BQL in this series so we have to increment
and decrement the bql counters correctly. When a TX queue can have both stack
traffic and XDP traffic I believe user configured BQL should be correct. If
users do not want BQL they can always disable it or raise the limits.


I left the xdp program attached to the adapter structure because in the e1000
case the program is global because we only run on a single queue. Pushing the
program into the rx_ring just creates a bunch of ring[0] references in the code
and adds little in my opinion. This is really just a style issue though. Notice
I did push the bundle logic onto the queue but I view the bundling and rx
ring buffers to be so closely linked it was worth it and only caused a couple
extra ring[0] lines.


XDP_TX will drop packets if any errors occur while attempting to push onto the
TX descriptor ring. This seems to me at least to be the most sensible thing to
do and keeps e1000 inline with existing mlx XDP drivers. This can always be
extended if some consensus is made later.

Thanks for all the comments in v{1|2}. As always any comments/feedback
appreciated.

Also initial patch was based on a patch I took from Alexei so I left his
signed-off there even though I mangled the code a fair amount since then and
did not give him any chance to review off-list.

---

Alexei Starovoitov (1):
      e1000: add initial XDP support

John Fastabend (2):
      e1000: track BQL bytes regardless of skb or not
      e1000: bundle xdp xmit routines


 drivers/net/ethernet/intel/e1000/e1000.h      |   12 +
 drivers/net/ethernet/intel/e1000/e1000_main.c |  227 ++++++++++++++++++++++++-
 2 files changed, 229 insertions(+), 10 deletions(-)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 0/3] e1000 XDP implementation
@ 2016-09-12 22:13 ` John Fastabend
  0 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: intel-wired-lan

This patch implements XDP on e1000 a few comments below:

The XDP_TX path does not violate BQL in this series so we have to increment
and decrement the bql counters correctly. When a TX queue can have both stack
traffic and XDP traffic I believe user configured BQL should be correct. If
users do not want BQL they can always disable it or raise the limits.


I left the xdp program attached to the adapter structure because in the e1000
case the program is global because we only run on a single queue. Pushing the
program into the rx_ring just creates a bunch of ring[0] references in the code
and adds little in my opinion. This is really just a style issue though. Notice
I did push the bundle logic onto the queue but I view the bundling and rx
ring buffers to be so closely linked it was worth it and only caused a couple
extra ring[0] lines.


XDP_TX will drop packets if any errors occur while attempting to push onto the
TX descriptor ring. This seems to me at least to be the most sensible thing to
do and keeps e1000 inline with existing mlx XDP drivers. This can always be
extended if some consensus is made later.

Thanks for all the comments in v{1|2}. As always any comments/feedback
appreciated.

Also initial patch was based on a patch I took from Alexei so I left his
signed-off there even though I mangled the code a fair amount since then and
did not give him any chance to review off-list.

---

Alexei Starovoitov (1):
      e1000: add initial XDP support

John Fastabend (2):
      e1000: track BQL bytes regardless of skb or not
      e1000: bundle xdp xmit routines


 drivers/net/ethernet/intel/e1000/e1000.h      |   12 +
 drivers/net/ethernet/intel/e1000/e1000_main.c |  227 ++++++++++++++++++++++++-
 2 files changed, 229 insertions(+), 10 deletions(-)

--
Signature

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
  2016-09-12 22:13 ` [Intel-wired-lan] " John Fastabend
@ 2016-09-12 22:13   ` John Fastabend
  -1 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: bblanco, john.fastabend, alexei.starovoitov, jeffrey.t.kirsher,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev

The BQL API does not reference the sk_buff nor does the driver need to
reference the sk_buff to calculate the length of a transmitted frame.
This patch removes an sk_buff reference from the xmit irq path and
also allows packets sent from XDP to use BQL.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f42129d..62a7f8d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
 			if (cleaned) {
 				total_tx_packets += buffer_info->segs;
 				total_tx_bytes += buffer_info->bytecount;
-				if (buffer_info->skb) {
-					bytes_compl += buffer_info->skb->len;
-					pkts_compl++;
-				}
-
+				bytes_compl += buffer_info->length;
+				pkts_compl++;
 			}
 			e1000_unmap_and_free_tx_resource(adapter, buffer_info);
 			tx_desc->upper.data = 0;

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
@ 2016-09-12 22:13   ` John Fastabend
  0 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: intel-wired-lan

The BQL API does not reference the sk_buff nor does the driver need to
reference the sk_buff to calculate the length of a transmitted frame.
This patch removes an sk_buff reference from the xmit irq path and
also allows packets sent from XDP to use BQL.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index f42129d..62a7f8d 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
 			if (cleaned) {
 				total_tx_packets += buffer_info->segs;
 				total_tx_bytes += buffer_info->bytecount;
-				if (buffer_info->skb) {
-					bytes_compl += buffer_info->skb->len;
-					pkts_compl++;
-				}
-
+				bytes_compl += buffer_info->length;
+				pkts_compl++;
 			}
 			e1000_unmap_and_free_tx_resource(adapter, buffer_info);
 			tx_desc->upper.data = 0;


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-12 22:13 ` [Intel-wired-lan] " John Fastabend
@ 2016-09-12 22:13   ` John Fastabend
  -1 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: bblanco, john.fastabend, alexei.starovoitov, jeffrey.t.kirsher,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev

From: Alexei Starovoitov <ast@fb.com>

This patch adds initial support for XDP on e1000 driver. Note e1000
driver does not support page recycling in general which could be
added as a further improvement. However XDP_DROP case will recycle.
XDP_TX and XDP_PASS do not support recycling.

e1000 only supports a single tx queue at this time so the queue
is shared between xdp program and Linux stack. It is possible for
an XDP program to starve the stack in this model.

The XDP program will drop packets on XDP_TX errors. This can occur
when the tx descriptors are exhausted. This behavior is the same
for both shared queue models like e1000 and dedicated tx queue
models used in multiqueue devices. However if both the stack and
XDP are transmitting packets it is perhaps more likely to occur in
the shared queue model. Further refinement to the XDP model may be
possible in the future.

I tested this patch running e1000 in a VM using KVM over a tap
device.

CC: William Tu <u9012063@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |    2 
 drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
 2 files changed, 175 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index d7bdea7..5cf8a0a 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -150,6 +150,7 @@ struct e1000_adapter;
  */
 struct e1000_tx_buffer {
 	struct sk_buff *skb;
+	struct page *page;
 	dma_addr_t dma;
 	unsigned long time_stamp;
 	u16 length;
@@ -279,6 +280,7 @@ struct e1000_adapter {
 			     struct e1000_rx_ring *rx_ring,
 			     int cleaned_count);
 	struct e1000_rx_ring *rx_ring;      /* One per active queue */
+	struct bpf_prog *prog;
 	struct napi_struct napi;
 
 	int num_tx_queues;
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 62a7f8d..232b927 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -32,6 +32,7 @@
 #include <linux/prefetch.h>
 #include <linux/bitops.h>
 #include <linux/if_vlan.h>
+#include <linux/bpf.h>
 
 char e1000_driver_name[] = "e1000";
 static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
 	return 0;
 }
 
+static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct bpf_prog *old_prog;
+
+	old_prog = xchg(&adapter->prog, prog);
+	if (old_prog) {
+		synchronize_net();
+		bpf_prog_put(old_prog);
+	}
+
+	if (netif_running(netdev))
+		e1000_reinit_locked(adapter);
+	else
+		e1000_reset(adapter);
+	return 0;
+}
+
+static bool e1000_xdp_attached(struct net_device *dev)
+{
+	struct e1000_adapter *priv = netdev_priv(dev);
+
+	return !!priv->prog;
+}
+
+static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return e1000_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = e1000_xdp_attached(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops e1000_netdev_ops = {
 	.ndo_open		= e1000_open,
 	.ndo_stop		= e1000_close,
@@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
 #endif
 	.ndo_fix_features	= e1000_fix_features,
 	.ndo_set_features	= e1000_set_features,
+	.ndo_xdp		= e1000_xdp,
 };
 
 /**
@@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
 	e1000_down_and_stop(adapter);
 	e1000_release_manageability(adapter);
 
+	if (adapter->prog)
+		bpf_prog_put(adapter->prog);
+
 	unregister_netdev(netdev);
 
 	e1000_phy_hw_reset(hw);
@@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
 	struct e1000_hw *hw = &adapter->hw;
 	u32 rdlen, rctl, rxcsum;
 
-	if (adapter->netdev->mtu > ETH_DATA_LEN) {
+	if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
 		rdlen = adapter->rx_ring[0].count *
 			sizeof(struct e1000_rx_desc);
 		adapter->clean_rx = e1000_clean_jumbo_rx_irq;
@@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
 		dev_kfree_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
+	if (buffer_info->page) {
+		put_page(buffer_info->page);
+		buffer_info->page = NULL;
+	}
+
 	buffer_info->time_stamp = 0;
 	/* buffer_info must be completely set up in the transmit path */
 }
@@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
 	return NETDEV_TX_OK;
 }
 
+static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
+				struct e1000_rx_buffer *rx_buffer_info,
+				unsigned int len)
+{
+	struct e1000_tx_buffer *buffer_info;
+	unsigned int i = tx_ring->next_to_use;
+
+	buffer_info = &tx_ring->buffer_info[i];
+
+	buffer_info->length = len;
+	buffer_info->time_stamp = jiffies;
+	buffer_info->mapped_as_page = false;
+	buffer_info->dma = rx_buffer_info->dma;
+	buffer_info->next_to_watch = i;
+	buffer_info->page = rx_buffer_info->rxbuf.page;
+
+	tx_ring->buffer_info[i].skb = NULL;
+	tx_ring->buffer_info[i].segs = 1;
+	tx_ring->buffer_info[i].bytecount = len;
+	tx_ring->buffer_info[i].next_to_watch = i;
+
+	rx_buffer_info->rxbuf.page = NULL;
+}
+
+static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
+				 u32 len,
+				 struct net_device *netdev,
+				 struct e1000_adapter *adapter)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
+	struct e1000_hw *hw = &adapter->hw;
+	struct e1000_tx_ring *tx_ring;
+
+	if (len > E1000_MAX_DATA_PER_TXD)
+		return;
+
+	/* e1000 only support a single txq at the moment so the queue is being
+	 * shared with stack. To support this requires locking to ensure the
+	 * stack and XDP are not running at the same time. Devices with
+	 * multiple queues should allocate a separate queue space.
+	 */
+	HARD_TX_LOCK(netdev, txq, smp_processor_id());
+
+	tx_ring = adapter->tx_ring;
+
+	if (E1000_DESC_UNUSED(tx_ring) < 2) {
+		HARD_TX_UNLOCK(netdev, txq);
+		return;
+	}
+
+	if (netif_xmit_frozen_or_stopped(txq))
+		return;
+
+	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
+	netdev_sent_queue(netdev, len);
+	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
+
+	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
+	mmiowb();
+
+	HARD_TX_UNLOCK(netdev, txq);
+}
+
 #define NUM_REGS 38 /* 1 based count */
 static void e1000_regdump(struct e1000_adapter *adapter)
 {
@@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
 	return skb;
 }
 
+static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
+				 unsigned int length)
+{
+	struct xdp_buff xdp;
+	int ret;
+
+	xdp.data = data;
+	xdp.data_end = data + length;
+	ret = BPF_PROG_RUN(prog, (void *)&xdp);
+
+	return ret;
+}
+
 /**
  * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
  * @adapter: board private structure
@@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 	struct pci_dev *pdev = adapter->pdev;
 	struct e1000_rx_desc *rx_desc, *next_rxd;
 	struct e1000_rx_buffer *buffer_info, *next_buffer;
+	struct bpf_prog *prog;
 	u32 length;
 	unsigned int i;
 	int cleaned_count = 0;
 	bool cleaned = false;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 
+	rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
+	prog = READ_ONCE(adapter->prog);
 	i = rx_ring->next_to_clean;
 	rx_desc = E1000_RX_DESC(*rx_ring, i);
 	buffer_info = &rx_ring->buffer_info[i];
@@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 
 		cleaned = true;
 		cleaned_count++;
+		length = le16_to_cpu(rx_desc->length);
+
+		if (prog) {
+			struct page *p = buffer_info->rxbuf.page;
+			dma_addr_t dma = buffer_info->dma;
+			int act;
+
+			if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
+				/* attached bpf disallows larger than page
+				 * packets, so this is hw error or corruption
+				 */
+				pr_info_once("%s buggy !eop\n", netdev->name);
+				break;
+			}
+			if (unlikely(rx_ring->rx_skb_top)) {
+				pr_info_once("%s ring resizing bug\n",
+					     netdev->name);
+				break;
+			}
+			dma_sync_single_for_cpu(&pdev->dev, dma,
+						length, DMA_FROM_DEVICE);
+			act = e1000_call_bpf(prog, page_address(p), length);
+			switch (act) {
+			case XDP_PASS:
+				break;
+			case XDP_TX:
+				dma_sync_single_for_device(&pdev->dev,
+							   dma,
+							   length,
+							   DMA_TO_DEVICE);
+				e1000_xmit_raw_frame(buffer_info, length,
+						     netdev, adapter);
+			case XDP_DROP:
+			default:
+				/* re-use mapped page. keep buffer_info->dma
+				 * as-is, so that e1000_alloc_jumbo_rx_buffers
+				 * only needs to put it back into rx ring
+				 */
+				total_rx_bytes += length;
+				total_rx_packets++;
+				goto next_desc;
+			}
+		}
+
 		dma_unmap_page(&pdev->dev, buffer_info->dma,
 			       adapter->rx_buffer_len, DMA_FROM_DEVICE);
 		buffer_info->dma = 0;
 
-		length = le16_to_cpu(rx_desc->length);
-
 		/* errors is only valid for DD + EOP descriptors */
 		if (unlikely((status & E1000_RXD_STAT_EOP) &&
 		    (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
@@ -4327,6 +4496,7 @@ next_desc:
 		rx_desc = next_rxd;
 		buffer_info = next_buffer;
 	}
+	rcu_read_unlock();
 	rx_ring->next_to_clean = i;
 
 	cleaned_count = E1000_DESC_UNUSED(rx_ring);

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-12 22:13   ` John Fastabend
  0 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-12 22:13 UTC (permalink / raw)
  To: intel-wired-lan

From: Alexei Starovoitov <ast@fb.com>

This patch adds initial support for XDP on e1000 driver. Note e1000
driver does not support page recycling in general which could be
added as a further improvement. However XDP_DROP case will recycle.
XDP_TX and XDP_PASS do not support recycling.

e1000 only supports a single tx queue at this time so the queue
is shared between xdp program and Linux stack. It is possible for
an XDP program to starve the stack in this model.

The XDP program will drop packets on XDP_TX errors. This can occur
when the tx descriptors are exhausted. This behavior is the same
for both shared queue models like e1000 and dedicated tx queue
models used in multiqueue devices. However if both the stack and
XDP are transmitting packets it is perhaps more likely to occur in
the shared queue model. Further refinement to the XDP model may be
possible in the future.

I tested this patch running e1000 in a VM using KVM over a tap
device.

CC: William Tu <u9012063@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |    2 
 drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
 2 files changed, 175 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index d7bdea7..5cf8a0a 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -150,6 +150,7 @@ struct e1000_adapter;
  */
 struct e1000_tx_buffer {
 	struct sk_buff *skb;
+	struct page *page;
 	dma_addr_t dma;
 	unsigned long time_stamp;
 	u16 length;
@@ -279,6 +280,7 @@ struct e1000_adapter {
 			     struct e1000_rx_ring *rx_ring,
 			     int cleaned_count);
 	struct e1000_rx_ring *rx_ring;      /* One per active queue */
+	struct bpf_prog *prog;
 	struct napi_struct napi;
 
 	int num_tx_queues;
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 62a7f8d..232b927 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -32,6 +32,7 @@
 #include <linux/prefetch.h>
 #include <linux/bitops.h>
 #include <linux/if_vlan.h>
+#include <linux/bpf.h>
 
 char e1000_driver_name[] = "e1000";
 static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
@@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
 	return 0;
 }
 
+static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct bpf_prog *old_prog;
+
+	old_prog = xchg(&adapter->prog, prog);
+	if (old_prog) {
+		synchronize_net();
+		bpf_prog_put(old_prog);
+	}
+
+	if (netif_running(netdev))
+		e1000_reinit_locked(adapter);
+	else
+		e1000_reset(adapter);
+	return 0;
+}
+
+static bool e1000_xdp_attached(struct net_device *dev)
+{
+	struct e1000_adapter *priv = netdev_priv(dev);
+
+	return !!priv->prog;
+}
+
+static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	switch (xdp->command) {
+	case XDP_SETUP_PROG:
+		return e1000_xdp_set(dev, xdp->prog);
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = e1000_xdp_attached(dev);
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static const struct net_device_ops e1000_netdev_ops = {
 	.ndo_open		= e1000_open,
 	.ndo_stop		= e1000_close,
@@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
 #endif
 	.ndo_fix_features	= e1000_fix_features,
 	.ndo_set_features	= e1000_set_features,
+	.ndo_xdp		= e1000_xdp,
 };
 
 /**
@@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
 	e1000_down_and_stop(adapter);
 	e1000_release_manageability(adapter);
 
+	if (adapter->prog)
+		bpf_prog_put(adapter->prog);
+
 	unregister_netdev(netdev);
 
 	e1000_phy_hw_reset(hw);
@@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
 	struct e1000_hw *hw = &adapter->hw;
 	u32 rdlen, rctl, rxcsum;
 
-	if (adapter->netdev->mtu > ETH_DATA_LEN) {
+	if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
 		rdlen = adapter->rx_ring[0].count *
 			sizeof(struct e1000_rx_desc);
 		adapter->clean_rx = e1000_clean_jumbo_rx_irq;
@@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
 		dev_kfree_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
+	if (buffer_info->page) {
+		put_page(buffer_info->page);
+		buffer_info->page = NULL;
+	}
+
 	buffer_info->time_stamp = 0;
 	/* buffer_info must be completely set up in the transmit path */
 }
@@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
 	return NETDEV_TX_OK;
 }
 
+static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
+				struct e1000_rx_buffer *rx_buffer_info,
+				unsigned int len)
+{
+	struct e1000_tx_buffer *buffer_info;
+	unsigned int i = tx_ring->next_to_use;
+
+	buffer_info = &tx_ring->buffer_info[i];
+
+	buffer_info->length = len;
+	buffer_info->time_stamp = jiffies;
+	buffer_info->mapped_as_page = false;
+	buffer_info->dma = rx_buffer_info->dma;
+	buffer_info->next_to_watch = i;
+	buffer_info->page = rx_buffer_info->rxbuf.page;
+
+	tx_ring->buffer_info[i].skb = NULL;
+	tx_ring->buffer_info[i].segs = 1;
+	tx_ring->buffer_info[i].bytecount = len;
+	tx_ring->buffer_info[i].next_to_watch = i;
+
+	rx_buffer_info->rxbuf.page = NULL;
+}
+
+static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
+				 u32 len,
+				 struct net_device *netdev,
+				 struct e1000_adapter *adapter)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
+	struct e1000_hw *hw = &adapter->hw;
+	struct e1000_tx_ring *tx_ring;
+
+	if (len > E1000_MAX_DATA_PER_TXD)
+		return;
+
+	/* e1000 only support a single txq at the moment so the queue is being
+	 * shared with stack. To support this requires locking to ensure the
+	 * stack and XDP are not running at the same time. Devices with
+	 * multiple queues should allocate a separate queue space.
+	 */
+	HARD_TX_LOCK(netdev, txq, smp_processor_id());
+
+	tx_ring = adapter->tx_ring;
+
+	if (E1000_DESC_UNUSED(tx_ring) < 2) {
+		HARD_TX_UNLOCK(netdev, txq);
+		return;
+	}
+
+	if (netif_xmit_frozen_or_stopped(txq))
+		return;
+
+	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
+	netdev_sent_queue(netdev, len);
+	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
+
+	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
+	mmiowb();
+
+	HARD_TX_UNLOCK(netdev, txq);
+}
+
 #define NUM_REGS 38 /* 1 based count */
 static void e1000_regdump(struct e1000_adapter *adapter)
 {
@@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
 	return skb;
 }
 
+static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
+				 unsigned int length)
+{
+	struct xdp_buff xdp;
+	int ret;
+
+	xdp.data = data;
+	xdp.data_end = data + length;
+	ret = BPF_PROG_RUN(prog, (void *)&xdp);
+
+	return ret;
+}
+
 /**
  * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
  * @adapter: board private structure
@@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 	struct pci_dev *pdev = adapter->pdev;
 	struct e1000_rx_desc *rx_desc, *next_rxd;
 	struct e1000_rx_buffer *buffer_info, *next_buffer;
+	struct bpf_prog *prog;
 	u32 length;
 	unsigned int i;
 	int cleaned_count = 0;
 	bool cleaned = false;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
 
+	rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
+	prog = READ_ONCE(adapter->prog);
 	i = rx_ring->next_to_clean;
 	rx_desc = E1000_RX_DESC(*rx_ring, i);
 	buffer_info = &rx_ring->buffer_info[i];
@@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 
 		cleaned = true;
 		cleaned_count++;
+		length = le16_to_cpu(rx_desc->length);
+
+		if (prog) {
+			struct page *p = buffer_info->rxbuf.page;
+			dma_addr_t dma = buffer_info->dma;
+			int act;
+
+			if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
+				/* attached bpf disallows larger than page
+				 * packets, so this is hw error or corruption
+				 */
+				pr_info_once("%s buggy !eop\n", netdev->name);
+				break;
+			}
+			if (unlikely(rx_ring->rx_skb_top)) {
+				pr_info_once("%s ring resizing bug\n",
+					     netdev->name);
+				break;
+			}
+			dma_sync_single_for_cpu(&pdev->dev, dma,
+						length, DMA_FROM_DEVICE);
+			act = e1000_call_bpf(prog, page_address(p), length);
+			switch (act) {
+			case XDP_PASS:
+				break;
+			case XDP_TX:
+				dma_sync_single_for_device(&pdev->dev,
+							   dma,
+							   length,
+							   DMA_TO_DEVICE);
+				e1000_xmit_raw_frame(buffer_info, length,
+						     netdev, adapter);
+			case XDP_DROP:
+			default:
+				/* re-use mapped page. keep buffer_info->dma
+				 * as-is, so that e1000_alloc_jumbo_rx_buffers
+				 * only needs to put it back into rx ring
+				 */
+				total_rx_bytes += length;
+				total_rx_packets++;
+				goto next_desc;
+			}
+		}
+
 		dma_unmap_page(&pdev->dev, buffer_info->dma,
 			       adapter->rx_buffer_len, DMA_FROM_DEVICE);
 		buffer_info->dma = 0;
 
-		length = le16_to_cpu(rx_desc->length);
-
 		/* errors is only valid for DD + EOP descriptors */
 		if (unlikely((status & E1000_RXD_STAT_EOP) &&
 		    (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
@@ -4327,6 +4496,7 @@ next_desc:
 		rx_desc = next_rxd;
 		buffer_info = next_buffer;
 	}
+	rcu_read_unlock();
 	rx_ring->next_to_clean = i;
 
 	cleaned_count = E1000_DESC_UNUSED(rx_ring);


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines
  2016-09-12 22:13 ` [Intel-wired-lan] " John Fastabend
@ 2016-09-12 22:14   ` John Fastabend
  -1 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-12 22:14 UTC (permalink / raw)
  To: bblanco, john.fastabend, alexei.starovoitov, jeffrey.t.kirsher,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev

e1000 supports a single TX queue so it is being shared with the stack
when XDP runs XDP_TX action. This requires taking the xmit lock to
ensure we don't corrupt the tx ring. To avoid taking and dropping the
lock per packet this patch adds a bundling implementation to submit
a bundle of packets to the xmit routine.

I tested this patch running e1000 in a VM using KVM over a tap
device using pktgen to generate traffic along with 'ping -f -l 100'.

Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |   10 +++
 drivers/net/ethernet/intel/e1000/e1000_main.c |   80 +++++++++++++++++++------
 2 files changed, 70 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index 5cf8a0a..877b377 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -133,6 +133,8 @@ struct e1000_adapter;
 #define E1000_TX_QUEUE_WAKE	16
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define E1000_RX_BUFFER_WRITE	16 /* Must be power of 2 */
+/* How many XDP XMIT buffers to bundle into one xmit transaction */
+#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
 
 #define AUTO_ALL_MODES		0
 #define E1000_EEPROM_82544_APM	0x0004
@@ -168,6 +170,11 @@ struct e1000_rx_buffer {
 	dma_addr_t dma;
 };
 
+struct e1000_rx_buffer_bundle {
+	struct e1000_rx_buffer *buffer;
+	u32 length;
+};
+
 struct e1000_tx_ring {
 	/* pointer to the descriptor ring memory */
 	void *desc;
@@ -206,6 +213,9 @@ struct e1000_rx_ring {
 	struct e1000_rx_buffer *buffer_info;
 	struct sk_buff *rx_skb_top;
 
+	/* array of XDP buffer information structs */
+	struct e1000_rx_buffer_bundle *xdp_buffer;
+
 	/* cpu for rx queue */
 	int cpu;
 
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 232b927..31489d4 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct bpf_prog *old_prog;
 
+	if (!adapter->rx_ring[0].xdp_buffer) {
+		int size = sizeof(struct e1000_rx_buffer_bundle) *
+				E1000_XDP_XMIT_BUNDLE_MAX;
+
+		adapter->rx_ring[0].xdp_buffer = vzalloc(size);
+		if (!adapter->rx_ring[0].xdp_buffer)
+			return -ENOMEM;
+	}
+
 	old_prog = xchg(&adapter->prog, prog);
 	if (old_prog) {
 		synchronize_net();
@@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev)
 	if (adapter->prog)
 		bpf_prog_put(adapter->prog);
 
+	if (adapter->rx_ring[0].xdp_buffer)
+		vfree(adapter->rx_ring[0].xdp_buffer);
+
 	unregister_netdev(netdev);
 
 	e1000_phy_hw_reset(hw);
@@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
 
 static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
 				 u32 len,
+				 struct e1000_adapter *adapter,
 				 struct net_device *netdev,
-				 struct e1000_adapter *adapter)
+				 struct e1000_tx_ring *tx_ring)
 {
-	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
-	struct e1000_hw *hw = &adapter->hw;
-	struct e1000_tx_ring *tx_ring;
+	const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
 
 	if (len > E1000_MAX_DATA_PER_TXD)
 		return;
 
-	/* e1000 only support a single txq at the moment so the queue is being
-	 * shared with stack. To support this requires locking to ensure the
-	 * stack and XDP are not running at the same time. Devices with
-	 * multiple queues should allocate a separate queue space.
-	 */
-	HARD_TX_LOCK(netdev, txq, smp_processor_id());
-
-	tx_ring = adapter->tx_ring;
-
-	if (E1000_DESC_UNUSED(tx_ring) < 2) {
-		HARD_TX_UNLOCK(netdev, txq);
+	if (E1000_DESC_UNUSED(tx_ring) < 2)
 		return;
-	}
 
 	if (netif_xmit_frozen_or_stopped(txq))
 		return;
@@ -3402,7 +3402,36 @@ static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
 	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
 	netdev_sent_queue(netdev, len);
 	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
+}
 
+static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
+				  struct net_device *netdev,
+				  struct e1000_adapter *adapter)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+	struct e1000_hw *hw = &adapter->hw;
+	int i = 0;
+
+	/* e1000 only support a single txq at the moment so the queue is being
+	 * shared with stack. To support this requires locking to ensure the
+	 * stack and XDP are not running at the same time. Devices with
+	 * multiple queues should allocate a separate queue space.
+	 *
+	 * To amortize the locking cost e1000 bundles the xmits and send up to
+	 * E1000_XDP_XMIT_BUNDLE_MAX.
+	 */
+	HARD_TX_LOCK(netdev, txq, smp_processor_id());
+
+	for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) {
+		e1000_xmit_raw_frame(buffer_info[i].buffer,
+				     buffer_info[i].length,
+				     adapter, netdev, tx_ring);
+		buffer_info[i].buffer = NULL;
+		buffer_info[i].length = 0;
+	}
+
+	/* kick hardware to send bundle and return control back to the stack */
 	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
 	mmiowb();
 
@@ -4284,9 +4313,10 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 	struct bpf_prog *prog;
 	u32 length;
 	unsigned int i;
-	int cleaned_count = 0;
+	int cleaned_count = 0, xdp_xmit = 0;
 	bool cleaned = false;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	struct e1000_rx_buffer_bundle *xdp_bundle = rx_ring->xdp_buffer;
 
 	rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
 	prog = READ_ONCE(adapter->prog);
@@ -4341,12 +4371,13 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 			case XDP_PASS:
 				break;
 			case XDP_TX:
+				xdp_bundle[xdp_xmit].buffer = buffer_info;
+				xdp_bundle[xdp_xmit].length = length;
 				dma_sync_single_for_device(&pdev->dev,
 							   dma,
 							   length,
 							   DMA_TO_DEVICE);
-				e1000_xmit_raw_frame(buffer_info, length,
-						     netdev, adapter);
+				xdp_xmit++;
 			case XDP_DROP:
 			default:
 				/* re-use mapped page. keep buffer_info->dma
@@ -4488,8 +4519,14 @@ next_desc:
 
 		/* return some buffers to hardware, one at a time is too slow */
 		if (unlikely(cleaned_count >= E1000_RX_BUFFER_WRITE)) {
+			if (xdp_xmit)
+				e1000_xdp_xmit_bundle(xdp_bundle,
+						      netdev,
+						      adapter);
+
 			adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
 			cleaned_count = 0;
+			xdp_xmit = 0;
 		}
 
 		/* use prefetched values */
@@ -4500,8 +4537,11 @@ next_desc:
 	rx_ring->next_to_clean = i;
 
 	cleaned_count = E1000_DESC_UNUSED(rx_ring);
-	if (cleaned_count)
+	if (cleaned_count) {
+		if (xdp_xmit)
+			e1000_xdp_xmit_bundle(xdp_bundle, netdev, adapter);
 		adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
+	}
 
 	adapter->total_rx_packets += total_rx_packets;
 	adapter->total_rx_bytes += total_rx_bytes;

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines
@ 2016-09-12 22:14   ` John Fastabend
  0 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-12 22:14 UTC (permalink / raw)
  To: intel-wired-lan

e1000 supports a single TX queue so it is being shared with the stack
when XDP runs XDP_TX action. This requires taking the xmit lock to
ensure we don't corrupt the tx ring. To avoid taking and dropping the
lock per packet this patch adds a bundling implementation to submit
a bundle of packets to the xmit routine.

I tested this patch running e1000 in a VM using KVM over a tap
device using pktgen to generate traffic along with 'ping -f -l 100'.

Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/e1000/e1000.h      |   10 +++
 drivers/net/ethernet/intel/e1000/e1000_main.c |   80 +++++++++++++++++++------
 2 files changed, 70 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
index 5cf8a0a..877b377 100644
--- a/drivers/net/ethernet/intel/e1000/e1000.h
+++ b/drivers/net/ethernet/intel/e1000/e1000.h
@@ -133,6 +133,8 @@ struct e1000_adapter;
 #define E1000_TX_QUEUE_WAKE	16
 /* How many Rx Buffers do we bundle into one write to the hardware ? */
 #define E1000_RX_BUFFER_WRITE	16 /* Must be power of 2 */
+/* How many XDP XMIT buffers to bundle into one xmit transaction */
+#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
 
 #define AUTO_ALL_MODES		0
 #define E1000_EEPROM_82544_APM	0x0004
@@ -168,6 +170,11 @@ struct e1000_rx_buffer {
 	dma_addr_t dma;
 };
 
+struct e1000_rx_buffer_bundle {
+	struct e1000_rx_buffer *buffer;
+	u32 length;
+};
+
 struct e1000_tx_ring {
 	/* pointer to the descriptor ring memory */
 	void *desc;
@@ -206,6 +213,9 @@ struct e1000_rx_ring {
 	struct e1000_rx_buffer *buffer_info;
 	struct sk_buff *rx_skb_top;
 
+	/* array of XDP buffer information structs */
+	struct e1000_rx_buffer_bundle *xdp_buffer;
+
 	/* cpu for rx queue */
 	int cpu;
 
diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 232b927..31489d4 100644
--- a/drivers/net/ethernet/intel/e1000/e1000_main.c
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
@@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct bpf_prog *old_prog;
 
+	if (!adapter->rx_ring[0].xdp_buffer) {
+		int size = sizeof(struct e1000_rx_buffer_bundle) *
+				E1000_XDP_XMIT_BUNDLE_MAX;
+
+		adapter->rx_ring[0].xdp_buffer = vzalloc(size);
+		if (!adapter->rx_ring[0].xdp_buffer)
+			return -ENOMEM;
+	}
+
 	old_prog = xchg(&adapter->prog, prog);
 	if (old_prog) {
 		synchronize_net();
@@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev)
 	if (adapter->prog)
 		bpf_prog_put(adapter->prog);
 
+	if (adapter->rx_ring[0].xdp_buffer)
+		vfree(adapter->rx_ring[0].xdp_buffer);
+
 	unregister_netdev(netdev);
 
 	e1000_phy_hw_reset(hw);
@@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
 
 static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
 				 u32 len,
+				 struct e1000_adapter *adapter,
 				 struct net_device *netdev,
-				 struct e1000_adapter *adapter)
+				 struct e1000_tx_ring *tx_ring)
 {
-	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
-	struct e1000_hw *hw = &adapter->hw;
-	struct e1000_tx_ring *tx_ring;
+	const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
 
 	if (len > E1000_MAX_DATA_PER_TXD)
 		return;
 
-	/* e1000 only support a single txq at the moment so the queue is being
-	 * shared with stack. To support this requires locking to ensure the
-	 * stack and XDP are not running at the same time. Devices with
-	 * multiple queues should allocate a separate queue space.
-	 */
-	HARD_TX_LOCK(netdev, txq, smp_processor_id());
-
-	tx_ring = adapter->tx_ring;
-
-	if (E1000_DESC_UNUSED(tx_ring) < 2) {
-		HARD_TX_UNLOCK(netdev, txq);
+	if (E1000_DESC_UNUSED(tx_ring) < 2)
 		return;
-	}
 
 	if (netif_xmit_frozen_or_stopped(txq))
 		return;
@@ -3402,7 +3402,36 @@ static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
 	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
 	netdev_sent_queue(netdev, len);
 	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
+}
 
+static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
+				  struct net_device *netdev,
+				  struct e1000_adapter *adapter)
+{
+	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+	struct e1000_hw *hw = &adapter->hw;
+	int i = 0;
+
+	/* e1000 only support a single txq at the moment so the queue is being
+	 * shared with stack. To support this requires locking to ensure the
+	 * stack and XDP are not running at the same time. Devices with
+	 * multiple queues should allocate a separate queue space.
+	 *
+	 * To amortize the locking cost e1000 bundles the xmits and send up to
+	 * E1000_XDP_XMIT_BUNDLE_MAX.
+	 */
+	HARD_TX_LOCK(netdev, txq, smp_processor_id());
+
+	for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) {
+		e1000_xmit_raw_frame(buffer_info[i].buffer,
+				     buffer_info[i].length,
+				     adapter, netdev, tx_ring);
+		buffer_info[i].buffer = NULL;
+		buffer_info[i].length = 0;
+	}
+
+	/* kick hardware to send bundle and return control back to the stack */
 	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
 	mmiowb();
 
@@ -4284,9 +4313,10 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 	struct bpf_prog *prog;
 	u32 length;
 	unsigned int i;
-	int cleaned_count = 0;
+	int cleaned_count = 0, xdp_xmit = 0;
 	bool cleaned = false;
 	unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+	struct e1000_rx_buffer_bundle *xdp_bundle = rx_ring->xdp_buffer;
 
 	rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
 	prog = READ_ONCE(adapter->prog);
@@ -4341,12 +4371,13 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
 			case XDP_PASS:
 				break;
 			case XDP_TX:
+				xdp_bundle[xdp_xmit].buffer = buffer_info;
+				xdp_bundle[xdp_xmit].length = length;
 				dma_sync_single_for_device(&pdev->dev,
 							   dma,
 							   length,
 							   DMA_TO_DEVICE);
-				e1000_xmit_raw_frame(buffer_info, length,
-						     netdev, adapter);
+				xdp_xmit++;
 			case XDP_DROP:
 			default:
 				/* re-use mapped page. keep buffer_info->dma
@@ -4488,8 +4519,14 @@ next_desc:
 
 		/* return some buffers to hardware, one at a time is too slow */
 		if (unlikely(cleaned_count >= E1000_RX_BUFFER_WRITE)) {
+			if (xdp_xmit)
+				e1000_xdp_xmit_bundle(xdp_bundle,
+						      netdev,
+						      adapter);
+
 			adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
 			cleaned_count = 0;
+			xdp_xmit = 0;
 		}
 
 		/* use prefetched values */
@@ -4500,8 +4537,11 @@ next_desc:
 	rx_ring->next_to_clean = i;
 
 	cleaned_count = E1000_DESC_UNUSED(rx_ring);
-	if (cleaned_count)
+	if (cleaned_count) {
+		if (xdp_xmit)
+			e1000_xdp_xmit_bundle(xdp_bundle, netdev, adapter);
 		adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
+	}
 
 	adapter->total_rx_packets += total_rx_packets;
 	adapter->total_rx_bytes += total_rx_bytes;


^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-12 22:13   ` [Intel-wired-lan] " John Fastabend
@ 2016-09-12 22:46     ` Eric Dumazet
  -1 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2016-09-12 22:46 UTC (permalink / raw)
  To: John Fastabend
  Cc: bblanco, alexei.starovoitov, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev

On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> From: Alexei Starovoitov <ast@fb.com>

> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +				 u32 len,
> +				 struct net_device *netdev,
> +				 struct e1000_adapter *adapter)
> +{
> +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +	struct e1000_hw *hw = &adapter->hw;
> +	struct e1000_tx_ring *tx_ring;
> +
> +	if (len > E1000_MAX_DATA_PER_TXD)
> +		return;
> +
> +	/* e1000 only support a single txq at the moment so the queue is being
> +	 * shared with stack. To support this requires locking to ensure the
> +	 * stack and XDP are not running at the same time. Devices with
> +	 * multiple queues should allocate a separate queue space.
> +	 */
> +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +	tx_ring = adapter->tx_ring;
> +
> +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> +		HARD_TX_UNLOCK(netdev, txq);
> +		return;
> +	}
> +
> +	if (netif_xmit_frozen_or_stopped(txq))
> +		return;
> +
> +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +	netdev_sent_queue(netdev, len);
> +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +	mmiowb();
> +
> +	HARD_TX_UNLOCK(netdev, txq);
> +}


e1000_tx_map() is full of workarounds.

Have a look at last_tx_tso for example.

               /* Workaround for Controller erratum --
                 * descriptor for non-tso packet in a linear SKB that follows a
                 * tso gets written back prematurely before the data is fully
                 * DMA'd to the controller
                 */
                if (!skb->data_len && tx_ring->last_tx_tso &&
                    !skb_is_gso(skb)) {
                        tx_ring->last_tx_tso = false;
                        size -= 4;
                }

Look, this XDP_TX thing is hard to properly implement and test on
various NIC revisions.

Without proper queue management, high prio packets in qdisc wont be sent
if NIC is under RX -> XDP_TX flood.

Sounds a horrible feature to me.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-12 22:46     ` Eric Dumazet
  0 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2016-09-12 22:46 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> From: Alexei Starovoitov <ast@fb.com>

> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +				 u32 len,
> +				 struct net_device *netdev,
> +				 struct e1000_adapter *adapter)
> +{
> +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +	struct e1000_hw *hw = &adapter->hw;
> +	struct e1000_tx_ring *tx_ring;
> +
> +	if (len > E1000_MAX_DATA_PER_TXD)
> +		return;
> +
> +	/* e1000 only support a single txq at the moment so the queue is being
> +	 * shared with stack. To support this requires locking to ensure the
> +	 * stack and XDP are not running at the same time. Devices with
> +	 * multiple queues should allocate a separate queue space.
> +	 */
> +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +	tx_ring = adapter->tx_ring;
> +
> +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> +		HARD_TX_UNLOCK(netdev, txq);
> +		return;
> +	}
> +
> +	if (netif_xmit_frozen_or_stopped(txq))
> +		return;
> +
> +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +	netdev_sent_queue(netdev, len);
> +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +	mmiowb();
> +
> +	HARD_TX_UNLOCK(netdev, txq);
> +}


e1000_tx_map() is full of workarounds.

Have a look at last_tx_tso for example.

               /* Workaround for Controller erratum --
                 * descriptor for non-tso packet in a linear SKB that follows a
                 * tso gets written back prematurely before the data is fully
                 * DMA'd to the controller
                 */
                if (!skb->data_len && tx_ring->last_tx_tso &&
                    !skb_is_gso(skb)) {
                        tx_ring->last_tx_tso = false;
                        size -= 4;
                }

Look, this XDP_TX thing is hard to properly implement and test on
various NIC revisions.

Without proper queue management, high prio packets in qdisc wont be sent
if NIC is under RX -> XDP_TX flood.

Sounds a horrible feature to me.



^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-12 22:46     ` [Intel-wired-lan] " Eric Dumazet
@ 2016-09-12 23:07       ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-12 23:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Fastabend, bblanco, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev

On Mon, Sep 12, 2016 at 03:46:39PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> > From: Alexei Starovoitov <ast@fb.com>
> 
> > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> > +				 u32 len,
> > +				 struct net_device *netdev,
> > +				 struct e1000_adapter *adapter)
> > +{
> > +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> > +	struct e1000_hw *hw = &adapter->hw;
> > +	struct e1000_tx_ring *tx_ring;
> > +
> > +	if (len > E1000_MAX_DATA_PER_TXD)
> > +		return;
> > +
> > +	/* e1000 only support a single txq at the moment so the queue is being
> > +	 * shared with stack. To support this requires locking to ensure the
> > +	 * stack and XDP are not running at the same time. Devices with
> > +	 * multiple queues should allocate a separate queue space.
> > +	 */
> > +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> > +
> > +	tx_ring = adapter->tx_ring;
> > +
> > +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> > +		HARD_TX_UNLOCK(netdev, txq);
> > +		return;
> > +	}
> > +
> > +	if (netif_xmit_frozen_or_stopped(txq))
> > +		return;
> > +
> > +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> > +	netdev_sent_queue(netdev, len);
> > +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> > +
> > +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> > +	mmiowb();
> > +
> > +	HARD_TX_UNLOCK(netdev, txq);
> > +}
> 
> 
> e1000_tx_map() is full of workarounds.
> 
> Have a look at last_tx_tso for example.
> 
>                /* Workaround for Controller erratum --
>                  * descriptor for non-tso packet in a linear SKB that follows a
>                  * tso gets written back prematurely before the data is fully
>                  * DMA'd to the controller
>                  */
>                 if (!skb->data_len && tx_ring->last_tx_tso &&
>                     !skb_is_gso(skb)) {
>                         tx_ring->last_tx_tso = false;
>                         size -= 4;
>                 }
> 
> Look, this XDP_TX thing is hard to properly implement and test on
> various NIC revisions.

my reading of these e1k workarounds are for old versions of the nic that
don't even exist anymore. I believe none of them apply to qemu e1k.

> Without proper queue management, high prio packets in qdisc wont be sent
> if NIC is under RX -> XDP_TX flood.

yep. there are various ways to shoot yourself in the foot with xdp.
The simplest program that drops all the packets will make the box unpingable.

> Sounds a horrible feature to me.

yep. not pretty by any means.
This whole thing is only to test xdp programs under qemu which
realistically emulates only e1k.
I simply don't see any other options to test xdp under kvm.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-12 23:07       ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-12 23:07 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 03:46:39PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 15:13 -0700, John Fastabend wrote:
> > From: Alexei Starovoitov <ast@fb.com>
> 
> > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> > +				 u32 len,
> > +				 struct net_device *netdev,
> > +				 struct e1000_adapter *adapter)
> > +{
> > +	struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> > +	struct e1000_hw *hw = &adapter->hw;
> > +	struct e1000_tx_ring *tx_ring;
> > +
> > +	if (len > E1000_MAX_DATA_PER_TXD)
> > +		return;
> > +
> > +	/* e1000 only support a single txq at the moment so the queue is being
> > +	 * shared with stack. To support this requires locking to ensure the
> > +	 * stack and XDP are not running at the same time. Devices with
> > +	 * multiple queues should allocate a separate queue space.
> > +	 */
> > +	HARD_TX_LOCK(netdev, txq, smp_processor_id());
> > +
> > +	tx_ring = adapter->tx_ring;
> > +
> > +	if (E1000_DESC_UNUSED(tx_ring) < 2) {
> > +		HARD_TX_UNLOCK(netdev, txq);
> > +		return;
> > +	}
> > +
> > +	if (netif_xmit_frozen_or_stopped(txq))
> > +		return;
> > +
> > +	e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> > +	netdev_sent_queue(netdev, len);
> > +	e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> > +
> > +	writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> > +	mmiowb();
> > +
> > +	HARD_TX_UNLOCK(netdev, txq);
> > +}
> 
> 
> e1000_tx_map() is full of workarounds.
> 
> Have a look at last_tx_tso for example.
> 
>                /* Workaround for Controller erratum --
>                  * descriptor for non-tso packet in a linear SKB that follows a
>                  * tso gets written back prematurely before the data is fully
>                  * DMA'd to the controller
>                  */
>                 if (!skb->data_len && tx_ring->last_tx_tso &&
>                     !skb_is_gso(skb)) {
>                         tx_ring->last_tx_tso = false;
>                         size -= 4;
>                 }
> 
> Look, this XDP_TX thing is hard to properly implement and test on
> various NIC revisions.

my reading of these e1k workarounds are for old versions of the nic that
don't even exist anymore. I believe none of them apply to qemu e1k.

> Without proper queue management, high prio packets in qdisc wont be sent
> if NIC is under RX -> XDP_TX flood.

yep. there are various ways to shoot yourself in the foot with xdp.
The simplest program that drops all the packets will make the box unpingable.

> Sounds a horrible feature to me.

yep. not pretty by any means.
This whole thing is only to test xdp programs under qemu which
realistically emulates only e1k.
I simply don't see any other options to test xdp under kvm.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines
  2016-09-12 22:14   ` [Intel-wired-lan] " John Fastabend
@ 2016-09-12 23:45     ` Tom Herbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-12 23:45 UTC (permalink / raw)
  To: John Fastabend
  Cc: Brenden Blanco, Alexei Starovoitov, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers

On Mon, Sep 12, 2016 at 3:14 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> e1000 supports a single TX queue so it is being shared with the stack
> when XDP runs XDP_TX action. This requires taking the xmit lock to
> ensure we don't corrupt the tx ring. To avoid taking and dropping the
> lock per packet this patch adds a bundling implementation to submit
> a bundle of packets to the xmit routine.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device using pktgen to generate traffic along with 'ping -f -l 100'.
>
John,

It still looks to me like this will break the normal TX path. Can you
test these patches on real e1000 to get performance numbers for both
the drop and forwarding case. Also can you run this against an
antagonist workload over the normal stack to see if one or the other
path can be starved?

Thanks,
Tom

> Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |   10 +++
>  drivers/net/ethernet/intel/e1000/e1000_main.c |   80 +++++++++++++++++++------
>  2 files changed, 70 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index 5cf8a0a..877b377 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -133,6 +133,8 @@ struct e1000_adapter;
>  #define E1000_TX_QUEUE_WAKE    16
>  /* How many Rx Buffers do we bundle into one write to the hardware ? */
>  #define E1000_RX_BUFFER_WRITE  16 /* Must be power of 2 */
> +/* How many XDP XMIT buffers to bundle into one xmit transaction */
> +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
>
>  #define AUTO_ALL_MODES         0
>  #define E1000_EEPROM_82544_APM 0x0004
> @@ -168,6 +170,11 @@ struct e1000_rx_buffer {
>         dma_addr_t dma;
>  };
>
> +struct e1000_rx_buffer_bundle {
> +       struct e1000_rx_buffer *buffer;
> +       u32 length;
> +};
> +
>  struct e1000_tx_ring {
>         /* pointer to the descriptor ring memory */
>         void *desc;
> @@ -206,6 +213,9 @@ struct e1000_rx_ring {
>         struct e1000_rx_buffer *buffer_info;
>         struct sk_buff *rx_skb_top;
>
> +       /* array of XDP buffer information structs */
> +       struct e1000_rx_buffer_bundle *xdp_buffer;
> +
>         /* cpu for rx queue */
>         int cpu;
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 232b927..31489d4 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
>         struct e1000_adapter *adapter = netdev_priv(netdev);
>         struct bpf_prog *old_prog;
>
> +       if (!adapter->rx_ring[0].xdp_buffer) {
> +               int size = sizeof(struct e1000_rx_buffer_bundle) *
> +                               E1000_XDP_XMIT_BUNDLE_MAX;
> +
> +               adapter->rx_ring[0].xdp_buffer = vzalloc(size);
> +               if (!adapter->rx_ring[0].xdp_buffer)
> +                       return -ENOMEM;
> +       }
> +
>         old_prog = xchg(&adapter->prog, prog);
>         if (old_prog) {
>                 synchronize_net();
> @@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev)
>         if (adapter->prog)
>                 bpf_prog_put(adapter->prog);
>
> +       if (adapter->rx_ring[0].xdp_buffer)
> +               vfree(adapter->rx_ring[0].xdp_buffer);
> +
>         unregister_netdev(netdev);
>
>         e1000_phy_hw_reset(hw);
> @@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
>
>  static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>                                  u32 len,
> +                                struct e1000_adapter *adapter,
>                                  struct net_device *netdev,
> -                                struct e1000_adapter *adapter)
> +                                struct e1000_tx_ring *tx_ring)
>  {
> -       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> -       struct e1000_hw *hw = &adapter->hw;
> -       struct e1000_tx_ring *tx_ring;
> +       const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
>
>         if (len > E1000_MAX_DATA_PER_TXD)
>                 return;
>
> -       /* e1000 only support a single txq at the moment so the queue is being
> -        * shared with stack. To support this requires locking to ensure the
> -        * stack and XDP are not running at the same time. Devices with
> -        * multiple queues should allocate a separate queue space.
> -        */
> -       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> -
> -       tx_ring = adapter->tx_ring;
> -
> -       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> -               HARD_TX_UNLOCK(netdev, txq);
> +       if (E1000_DESC_UNUSED(tx_ring) < 2)
>                 return;
> -       }
>
>         if (netif_xmit_frozen_or_stopped(txq))
>                 return;
> @@ -3402,7 +3402,36 @@ static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>         e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
>         netdev_sent_queue(netdev, len);
>         e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +}
>
> +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
> +                                 struct net_device *netdev,
> +                                 struct e1000_adapter *adapter)
> +{
> +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +       struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> +       struct e1000_hw *hw = &adapter->hw;
> +       int i = 0;
> +
> +       /* e1000 only support a single txq at the moment so the queue is being
> +        * shared with stack. To support this requires locking to ensure the
> +        * stack and XDP are not running at the same time. Devices with
> +        * multiple queues should allocate a separate queue space.
> +        *
> +        * To amortize the locking cost e1000 bundles the xmits and send up to
> +        * E1000_XDP_XMIT_BUNDLE_MAX.
> +        */
> +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +       for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) {
> +               e1000_xmit_raw_frame(buffer_info[i].buffer,
> +                                    buffer_info[i].length,
> +                                    adapter, netdev, tx_ring);
> +               buffer_info[i].buffer = NULL;
> +               buffer_info[i].length = 0;
> +       }
> +
> +       /* kick hardware to send bundle and return control back to the stack */
>         writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
>         mmiowb();
>
> @@ -4284,9 +4313,10 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>         struct bpf_prog *prog;
>         u32 length;
>         unsigned int i;
> -       int cleaned_count = 0;
> +       int cleaned_count = 0, xdp_xmit = 0;
>         bool cleaned = false;
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> +       struct e1000_rx_buffer_bundle *xdp_bundle = rx_ring->xdp_buffer;
>
>         rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
>         prog = READ_ONCE(adapter->prog);
> @@ -4341,12 +4371,13 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>                         case XDP_PASS:
>                                 break;
>                         case XDP_TX:
> +                               xdp_bundle[xdp_xmit].buffer = buffer_info;
> +                               xdp_bundle[xdp_xmit].length = length;
>                                 dma_sync_single_for_device(&pdev->dev,
>                                                            dma,
>                                                            length,
>                                                            DMA_TO_DEVICE);
> -                               e1000_xmit_raw_frame(buffer_info, length,
> -                                                    netdev, adapter);
> +                               xdp_xmit++;
>                         case XDP_DROP:
>                         default:
>                                 /* re-use mapped page. keep buffer_info->dma
> @@ -4488,8 +4519,14 @@ next_desc:
>
>                 /* return some buffers to hardware, one at a time is too slow */
>                 if (unlikely(cleaned_count >= E1000_RX_BUFFER_WRITE)) {
> +                       if (xdp_xmit)
> +                               e1000_xdp_xmit_bundle(xdp_bundle,
> +                                                     netdev,
> +                                                     adapter);
> +
>                         adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
>                         cleaned_count = 0;
> +                       xdp_xmit = 0;
>                 }
>
>                 /* use prefetched values */
> @@ -4500,8 +4537,11 @@ next_desc:
>         rx_ring->next_to_clean = i;
>
>         cleaned_count = E1000_DESC_UNUSED(rx_ring);
> -       if (cleaned_count)
> +       if (cleaned_count) {
> +               if (xdp_xmit)
> +                       e1000_xdp_xmit_bundle(xdp_bundle, netdev, adapter);
>                 adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
> +       }
>
>         adapter->total_rx_packets += total_rx_packets;
>         adapter->total_rx_bytes += total_rx_bytes;
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines
@ 2016-09-12 23:45     ` Tom Herbert
  0 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-12 23:45 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 3:14 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> e1000 supports a single TX queue so it is being shared with the stack
> when XDP runs XDP_TX action. This requires taking the xmit lock to
> ensure we don't corrupt the tx ring. To avoid taking and dropping the
> lock per packet this patch adds a bundling implementation to submit
> a bundle of packets to the xmit routine.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device using pktgen to generate traffic along with 'ping -f -l 100'.
>
John,

It still looks to me like this will break the normal TX path. Can you
test these patches on real e1000 to get performance numbers for both
the drop and forwarding case. Also can you run this against an
antagonist workload over the normal stack to see if one or the other
path can be starved?

Thanks,
Tom

> Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |   10 +++
>  drivers/net/ethernet/intel/e1000/e1000_main.c |   80 +++++++++++++++++++------
>  2 files changed, 70 insertions(+), 20 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index 5cf8a0a..877b377 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -133,6 +133,8 @@ struct e1000_adapter;
>  #define E1000_TX_QUEUE_WAKE    16
>  /* How many Rx Buffers do we bundle into one write to the hardware ? */
>  #define E1000_RX_BUFFER_WRITE  16 /* Must be power of 2 */
> +/* How many XDP XMIT buffers to bundle into one xmit transaction */
> +#define E1000_XDP_XMIT_BUNDLE_MAX E1000_RX_BUFFER_WRITE
>
>  #define AUTO_ALL_MODES         0
>  #define E1000_EEPROM_82544_APM 0x0004
> @@ -168,6 +170,11 @@ struct e1000_rx_buffer {
>         dma_addr_t dma;
>  };
>
> +struct e1000_rx_buffer_bundle {
> +       struct e1000_rx_buffer *buffer;
> +       u32 length;
> +};
> +
>  struct e1000_tx_ring {
>         /* pointer to the descriptor ring memory */
>         void *desc;
> @@ -206,6 +213,9 @@ struct e1000_rx_ring {
>         struct e1000_rx_buffer *buffer_info;
>         struct sk_buff *rx_skb_top;
>
> +       /* array of XDP buffer information structs */
> +       struct e1000_rx_buffer_bundle *xdp_buffer;
> +
>         /* cpu for rx queue */
>         int cpu;
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 232b927..31489d4 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -848,6 +848,15 @@ static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
>         struct e1000_adapter *adapter = netdev_priv(netdev);
>         struct bpf_prog *old_prog;
>
> +       if (!adapter->rx_ring[0].xdp_buffer) {
> +               int size = sizeof(struct e1000_rx_buffer_bundle) *
> +                               E1000_XDP_XMIT_BUNDLE_MAX;
> +
> +               adapter->rx_ring[0].xdp_buffer = vzalloc(size);
> +               if (!adapter->rx_ring[0].xdp_buffer)
> +                       return -ENOMEM;
> +       }
> +
>         old_prog = xchg(&adapter->prog, prog);
>         if (old_prog) {
>                 synchronize_net();
> @@ -1319,6 +1328,9 @@ static void e1000_remove(struct pci_dev *pdev)
>         if (adapter->prog)
>                 bpf_prog_put(adapter->prog);
>
> +       if (adapter->rx_ring[0].xdp_buffer)
> +               vfree(adapter->rx_ring[0].xdp_buffer);
> +
>         unregister_netdev(netdev);
>
>         e1000_phy_hw_reset(hw);
> @@ -3372,29 +3384,17 @@ static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
>
>  static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>                                  u32 len,
> +                                struct e1000_adapter *adapter,
>                                  struct net_device *netdev,
> -                                struct e1000_adapter *adapter)
> +                                struct e1000_tx_ring *tx_ring)
>  {
> -       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> -       struct e1000_hw *hw = &adapter->hw;
> -       struct e1000_tx_ring *tx_ring;
> +       const struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
>
>         if (len > E1000_MAX_DATA_PER_TXD)
>                 return;
>
> -       /* e1000 only support a single txq at the moment so the queue is being
> -        * shared with stack. To support this requires locking to ensure the
> -        * stack and XDP are not running at the same time. Devices with
> -        * multiple queues should allocate a separate queue space.
> -        */
> -       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> -
> -       tx_ring = adapter->tx_ring;
> -
> -       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> -               HARD_TX_UNLOCK(netdev, txq);
> +       if (E1000_DESC_UNUSED(tx_ring) < 2)
>                 return;
> -       }
>
>         if (netif_xmit_frozen_or_stopped(txq))
>                 return;
> @@ -3402,7 +3402,36 @@ static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
>         e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
>         netdev_sent_queue(netdev, len);
>         e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +}
>
> +static void e1000_xdp_xmit_bundle(struct e1000_rx_buffer_bundle *buffer_info,
> +                                 struct net_device *netdev,
> +                                 struct e1000_adapter *adapter)
> +{
> +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +       struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> +       struct e1000_hw *hw = &adapter->hw;
> +       int i = 0;
> +
> +       /* e1000 only support a single txq at the moment so the queue is being
> +        * shared with stack. To support this requires locking to ensure the
> +        * stack and XDP are not running at the same time. Devices with
> +        * multiple queues should allocate a separate queue space.
> +        *
> +        * To amortize the locking cost e1000 bundles the xmits and send up to
> +        * E1000_XDP_XMIT_BUNDLE_MAX.
> +        */
> +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +       for (; i < E1000_XDP_XMIT_BUNDLE_MAX && buffer_info[i].buffer; i++) {
> +               e1000_xmit_raw_frame(buffer_info[i].buffer,
> +                                    buffer_info[i].length,
> +                                    adapter, netdev, tx_ring);
> +               buffer_info[i].buffer = NULL;
> +               buffer_info[i].length = 0;
> +       }
> +
> +       /* kick hardware to send bundle and return control back to the stack */
>         writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
>         mmiowb();
>
> @@ -4284,9 +4313,10 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>         struct bpf_prog *prog;
>         u32 length;
>         unsigned int i;
> -       int cleaned_count = 0;
> +       int cleaned_count = 0, xdp_xmit = 0;
>         bool cleaned = false;
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> +       struct e1000_rx_buffer_bundle *xdp_bundle = rx_ring->xdp_buffer;
>
>         rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
>         prog = READ_ONCE(adapter->prog);
> @@ -4341,12 +4371,13 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>                         case XDP_PASS:
>                                 break;
>                         case XDP_TX:
> +                               xdp_bundle[xdp_xmit].buffer = buffer_info;
> +                               xdp_bundle[xdp_xmit].length = length;
>                                 dma_sync_single_for_device(&pdev->dev,
>                                                            dma,
>                                                            length,
>                                                            DMA_TO_DEVICE);
> -                               e1000_xmit_raw_frame(buffer_info, length,
> -                                                    netdev, adapter);
> +                               xdp_xmit++;
>                         case XDP_DROP:
>                         default:
>                                 /* re-use mapped page. keep buffer_info->dma
> @@ -4488,8 +4519,14 @@ next_desc:
>
>                 /* return some buffers to hardware, one at a time is too slow */
>                 if (unlikely(cleaned_count >= E1000_RX_BUFFER_WRITE)) {
> +                       if (xdp_xmit)
> +                               e1000_xdp_xmit_bundle(xdp_bundle,
> +                                                     netdev,
> +                                                     adapter);
> +
>                         adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
>                         cleaned_count = 0;
> +                       xdp_xmit = 0;
>                 }
>
>                 /* use prefetched values */
> @@ -4500,8 +4537,11 @@ next_desc:
>         rx_ring->next_to_clean = i;
>
>         cleaned_count = E1000_DESC_UNUSED(rx_ring);
> -       if (cleaned_count)
> +       if (cleaned_count) {
> +               if (xdp_xmit)
> +                       e1000_xdp_xmit_bundle(xdp_bundle, netdev, adapter);
>                 adapter->alloc_rx_buf(adapter, rx_ring, cleaned_count);
> +       }
>
>         adapter->total_rx_packets += total_rx_packets;
>         adapter->total_rx_bytes += total_rx_bytes;
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-12 23:07       ` [Intel-wired-lan] " Alexei Starovoitov
@ 2016-09-12 23:46         ` Eric Dumazet
  -1 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2016-09-12 23:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, bblanco, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev

On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:

> yep. there are various ways to shoot yourself in the foot with xdp.
> The simplest program that drops all the packets will make the box unpingable.

Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
scooter on 101 highway ;)

This XDP_TX thing was one of the XDP marketing stuff, but there is
absolutely no documentation on it, warning users about possible
limitations/outcomes.

BTW, I am not sure mlx4 implementation even works, vs BQL :

mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
but tx completion will call netdev_tx_completed_queue() -> crash

Do we have one test to validate that a XDP_TX implementation is actually
correct ?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-12 23:46         ` Eric Dumazet
  0 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2016-09-12 23:46 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:

> yep. there are various ways to shoot yourself in the foot with xdp.
> The simplest program that drops all the packets will make the box unpingable.

Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
scooter on 101 highway ;)

This XDP_TX thing was one of the XDP marketing stuff, but there is
absolutely no documentation on it, warning users about possible
limitations/outcomes.

BTW, I am not sure mlx4 implementation even works, vs BQL :

mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
but tx completion will call netdev_tx_completed_queue() -> crash

Do we have one test to validate that a XDP_TX implementation is actually
correct ?




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-12 23:46         ` [Intel-wired-lan] " Eric Dumazet
@ 2016-09-13  0:03           ` Tom Herbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-13  0:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers

On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>
>> yep. there are various ways to shoot yourself in the foot with xdp.
>> The simplest program that drops all the packets will make the box unpingable.
>
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
>
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.
>
> BTW, I am not sure mlx4 implementation even works, vs BQL :
>
> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash
>
> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?
>
Obviously not for e1000 :-(. We really need some real test and
performance results and analysis on the interaction between the stack
data path and XDP data path. The fact that these changes are being
passed of as something only needed for KCM is irrelevant, e1000 is a
well deployed a NIC and there's no restriction that I see that would
prevent any users from enabling this feature on real devices. So these
patches need to be tested and justified. Eric's skepticism in all this
really doesn't surprise me...

Tom

>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13  0:03           ` Tom Herbert
  0 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-13  0:03 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>
>> yep. there are various ways to shoot yourself in the foot with xdp.
>> The simplest program that drops all the packets will make the box unpingable.
>
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
>
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.
>
> BTW, I am not sure mlx4 implementation even works, vs BQL :
>
> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash
>
> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?
>
Obviously not for e1000 :-(. We really need some real test and
performance results and analysis on the interaction between the stack
data path and XDP data path. The fact that these changes are being
passed of as something only needed for KCM is irrelevant, e1000 is a
well deployed a NIC and there's no restriction that I see that would
prevent any users from enabling this feature on real devices. So these
patches need to be tested and justified. Eric's skepticism in all this
really doesn't surprise me...

Tom

>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-12 23:46         ` [Intel-wired-lan] " Eric Dumazet
@ 2016-09-13  1:21           ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13  1:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: John Fastabend, bblanco, jeffrey.t.kirsher, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev

On Mon, Sep 12, 2016 at 04:46:08PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> 
> > yep. there are various ways to shoot yourself in the foot with xdp.
> > The simplest program that drops all the packets will make the box unpingable.
> 
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
> 
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

that's fair critique. there is no documentation for xdp in general.

> BTW, I am not sure mlx4 implementation even works, vs BQL :

it doesn't and it shouldn't. xdp and stack use different tx queues.

> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash

not quite. netdev_tx_completed_queue() is not called for xdp queues.

> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?

it's correct in the scope that it was defined for.
There is no objective to share the same tx queue with the stack in xdp.
The queues must be absolutely separate otherwise performance will tank.

This e1k patch is really special, because the e1k has one tx queue,
but this is for debugging of the programs, so it's ok to cut corners.
The e1k xdp support doesn't need to interact nicely with stack.
It's impossible in the first place. xdp is the layer before the stack.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13  1:21           ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13  1:21 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 04:46:08PM -0700, Eric Dumazet wrote:
> On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> 
> > yep. there are various ways to shoot yourself in the foot with xdp.
> > The simplest program that drops all the packets will make the box unpingable.
> 
> Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> scooter on 101 highway ;)
> 
> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

that's fair critique. there is no documentation for xdp in general.

> BTW, I am not sure mlx4 implementation even works, vs BQL :

it doesn't and it shouldn't. xdp and stack use different tx queues.

> mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> but tx completion will call netdev_tx_completed_queue() -> crash

not quite. netdev_tx_completed_queue() is not called for xdp queues.

> Do we have one test to validate that a XDP_TX implementation is actually
> correct ?

it's correct in the scope that it was defined for.
There is no objective to share the same tx queue with the stack in xdp.
The queues must be absolutely separate otherwise performance will tank.

This e1k patch is really special, because the e1k has one tx queue,
but this is for debugging of the programs, so it's ok to cut corners.
The e1k xdp support doesn't need to interact nicely with stack.
It's impossible in the first place. xdp is the layer before the stack.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13  0:03           ` [Intel-wired-lan] " Tom Herbert
@ 2016-09-13  1:28             ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13  1:28 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers

On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >
> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> The simplest program that drops all the packets will make the box unpingable.
> >
> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> > scooter on 101 highway ;)
> >
> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> > absolutely no documentation on it, warning users about possible
> > limitations/outcomes.
> >
> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >
> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> > but tx completion will call netdev_tx_completed_queue() -> crash
> >
> > Do we have one test to validate that a XDP_TX implementation is actually
> > correct ?
> >
> Obviously not for e1000 :-(. We really need some real test and
> performance results and analysis on the interaction between the stack
> data path and XDP data path.

no. we don't need it for e1k and we cannot really do it.
<broken record mode on> this patch is for debugging of xdp programs only.

> The fact that these changes are being
> passed of as something only needed for KCM is irrelevant, e1000 is a
> well deployed a NIC and there's no restriction that I see that would
> prevent any users from enabling this feature on real devices.

e1k is not even manufactured any more. Probably the only place
where it can be found is computer history museum.
e1000e fairs slightly better, but it's a different nic and this
patch is not about it.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13  1:28             ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13  1:28 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >
> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> The simplest program that drops all the packets will make the box unpingable.
> >
> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> > scooter on 101 highway ;)
> >
> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> > absolutely no documentation on it, warning users about possible
> > limitations/outcomes.
> >
> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >
> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> > but tx completion will call netdev_tx_completed_queue() -> crash
> >
> > Do we have one test to validate that a XDP_TX implementation is actually
> > correct ?
> >
> Obviously not for e1000 :-(. We really need some real test and
> performance results and analysis on the interaction between the stack
> data path and XDP data path.

no. we don't need it for e1k and we cannot really do it.
<broken record mode on> this patch is for debugging of xdp programs only.

> The fact that these changes are being
> passed of as something only needed for KCM is irrelevant, e1000 is a
> well deployed a NIC and there's no restriction that I see that would
> prevent any users from enabling this feature on real devices.

e1k is not even manufactured any more. Probably the only place
where it can be found is computer history museum.
e1000e fairs slightly better, but it's a different nic and this
patch is not about it.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
  2016-09-12 22:13   ` [Intel-wired-lan] " John Fastabend
@ 2016-09-13  3:00     ` Alexander Duyck
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2016-09-13  3:00 UTC (permalink / raw)
  To: John Fastabend
  Cc: Brenden Blanco, Alexei Starovoitov, Jeff Kirsher,
	Jesper Dangaard Brouer, David Miller, Cong Wang, intel-wired-lan,
	u9012063, Netdev

On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> The BQL API does not reference the sk_buff nor does the driver need to
> reference the sk_buff to calculate the length of a transmitted frame.
> This patch removes an sk_buff reference from the xmit irq path and
> also allows packets sent from XDP to use BQL.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index f42129d..62a7f8d 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
>                         if (cleaned) {
>                                 total_tx_packets += buffer_info->segs;
>                                 total_tx_bytes += buffer_info->bytecount;
> -                               if (buffer_info->skb) {
> -                                       bytes_compl += buffer_info->skb->len;
> -                                       pkts_compl++;
> -                               }
> -
> +                               bytes_compl += buffer_info->length;
> +                               pkts_compl++;
>                         }
>                         e1000_unmap_and_free_tx_resource(adapter, buffer_info);
>                         tx_desc->upper.data = 0;

Actually it might be worth looking into why we have two different
stats for tracking bytecount and segs.  From what I can tell the
pkts_compl value is never actually used.  The function doesn't even
use the value so it is just wasted cycles.  And as far as the bytes go
the accounting would be more accurate if you were to use bytecount
instead of buffer_info->skb->len.  You would just need to update the
xmit function to use that on the other side so that they match.

- Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
@ 2016-09-13  3:00     ` Alexander Duyck
  0 siblings, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2016-09-13  3:00 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> The BQL API does not reference the sk_buff nor does the driver need to
> reference the sk_buff to calculate the length of a transmitted frame.
> This patch removes an sk_buff reference from the xmit irq path and
> also allows packets sent from XDP to use BQL.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index f42129d..62a7f8d 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
>                         if (cleaned) {
>                                 total_tx_packets += buffer_info->segs;
>                                 total_tx_bytes += buffer_info->bytecount;
> -                               if (buffer_info->skb) {
> -                                       bytes_compl += buffer_info->skb->len;
> -                                       pkts_compl++;
> -                               }
> -
> +                               bytes_compl += buffer_info->length;
> +                               pkts_compl++;
>                         }
>                         e1000_unmap_and_free_tx_resource(adapter, buffer_info);
>                         tx_desc->upper.data = 0;

Actually it might be worth looking into why we have two different
stats for tracking bytecount and segs.  From what I can tell the
pkts_compl value is never actually used.  The function doesn't even
use the value so it is just wasted cycles.  And as far as the bytes go
the accounting would be more accurate if you were to use bytecount
instead of buffer_info->skb->len.  You would just need to update the
xmit function to use that on the other side so that they match.

- Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-12 22:13   ` [Intel-wired-lan] " John Fastabend
@ 2016-09-13  3:42     ` Alexander Duyck
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2016-09-13  3:42 UTC (permalink / raw)
  To: John Fastabend
  Cc: Brenden Blanco, Alexei Starovoitov, Jeff Kirsher,
	Jesper Dangaard Brouer, David Miller, Cong Wang, intel-wired-lan,
	u9012063, Netdev

On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> From: Alexei Starovoitov <ast@fb.com>
>
> This patch adds initial support for XDP on e1000 driver. Note e1000
> driver does not support page recycling in general which could be
> added as a further improvement. However XDP_DROP case will recycle.
> XDP_TX and XDP_PASS do not support recycling.
>
> e1000 only supports a single tx queue at this time so the queue
> is shared between xdp program and Linux stack. It is possible for
> an XDP program to starve the stack in this model.
>
> The XDP program will drop packets on XDP_TX errors. This can occur
> when the tx descriptors are exhausted. This behavior is the same
> for both shared queue models like e1000 and dedicated tx queue
> models used in multiqueue devices. However if both the stack and
> XDP are transmitting packets it is perhaps more likely to occur in
> the shared queue model. Further refinement to the XDP model may be
> possible in the future.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device.
>
> CC: William Tu <u9012063@gmail.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |    2
>  drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
>  2 files changed, 175 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index d7bdea7..5cf8a0a 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -150,6 +150,7 @@ struct e1000_adapter;
>   */
>  struct e1000_tx_buffer {
>         struct sk_buff *skb;
> +       struct page *page;
>         dma_addr_t dma;
>         unsigned long time_stamp;
>         u16 length;

I'm not really a huge fan of adding yet another member to this
structure.  Each e1000_tx_buffer is already pretty big at 40 bytes,
pushing it to 48 just means we lose that much more memory.  If nothing
else we may wan to look at doing something like creating a union
between the skb, page, and an unsigned long.  Then you could use the
lowest bit of the address as a flag indicating if this is a skb or a
page.

> @@ -279,6 +280,7 @@ struct e1000_adapter {
>                              struct e1000_rx_ring *rx_ring,
>                              int cleaned_count);
>         struct e1000_rx_ring *rx_ring;      /* One per active queue */
> +       struct bpf_prog *prog;
>         struct napi_struct napi;
>
>         int num_tx_queues;
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 62a7f8d..232b927 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -32,6 +32,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/bitops.h>
>  #include <linux/if_vlan.h>
> +#include <linux/bpf.h>
>
>  char e1000_driver_name[] = "e1000";
>  static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
> @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
>         return 0;
>  }
>
> +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
> +{
> +       struct e1000_adapter *adapter = netdev_priv(netdev);
> +       struct bpf_prog *old_prog;
> +
> +       old_prog = xchg(&adapter->prog, prog);
> +       if (old_prog) {
> +               synchronize_net();
> +               bpf_prog_put(old_prog);
> +       }
> +
> +       if (netif_running(netdev))
> +               e1000_reinit_locked(adapter);
> +       else
> +               e1000_reset(adapter);

What is the point of the reset?  If the interface isn't running is
there anything in the hardware you actually need to cleanup?

> +       return 0;
> +}
> +
> +static bool e1000_xdp_attached(struct net_device *dev)
> +{
> +       struct e1000_adapter *priv = netdev_priv(dev);
> +
> +       return !!priv->prog;
> +}
> +
> +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> +{
> +       switch (xdp->command) {
> +       case XDP_SETUP_PROG:
> +               return e1000_xdp_set(dev, xdp->prog);
> +       case XDP_QUERY_PROG:
> +               xdp->prog_attached = e1000_xdp_attached(dev);
> +               return 0;
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
>  static const struct net_device_ops e1000_netdev_ops = {
>         .ndo_open               = e1000_open,
>         .ndo_stop               = e1000_close,
> @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
>  #endif
>         .ndo_fix_features       = e1000_fix_features,
>         .ndo_set_features       = e1000_set_features,
> +       .ndo_xdp                = e1000_xdp,
>  };
>
>  /**
> @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
>         e1000_down_and_stop(adapter);
>         e1000_release_manageability(adapter);
>
> +       if (adapter->prog)
> +               bpf_prog_put(adapter->prog);
> +
>         unregister_netdev(netdev);
>
>         e1000_phy_hw_reset(hw);
> @@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
>         struct e1000_hw *hw = &adapter->hw;
>         u32 rdlen, rctl, rxcsum;
>
> -       if (adapter->netdev->mtu > ETH_DATA_LEN) {
> +       if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
>                 rdlen = adapter->rx_ring[0].count *
>                         sizeof(struct e1000_rx_desc);
>                 adapter->clean_rx = e1000_clean_jumbo_rx_irq;

If you are really serious about using the page based Rx path we should
probably fix the fact that you take a pretty significant hit on
performance penalty for turning this mode on.

> @@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
>                 dev_kfree_skb_any(buffer_info->skb);
>                 buffer_info->skb = NULL;
>         }
> +       if (buffer_info->page) {
> +               put_page(buffer_info->page);
> +               buffer_info->page = NULL;
> +       }
> +
>         buffer_info->time_stamp = 0;
>         /* buffer_info must be completely set up in the transmit path */
>  }
> @@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
>         return NETDEV_TX_OK;
>  }
>
> +static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
> +                               struct e1000_rx_buffer *rx_buffer_info,
> +                               unsigned int len)
> +{
> +       struct e1000_tx_buffer *buffer_info;
> +       unsigned int i = tx_ring->next_to_use;
> +
> +       buffer_info = &tx_ring->buffer_info[i];
> +
> +       buffer_info->length = len;
> +       buffer_info->time_stamp = jiffies;
> +       buffer_info->mapped_as_page = false;
> +       buffer_info->dma = rx_buffer_info->dma;
> +       buffer_info->next_to_watch = i;
> +       buffer_info->page = rx_buffer_info->rxbuf.page;
> +
> +       tx_ring->buffer_info[i].skb = NULL;
> +       tx_ring->buffer_info[i].segs = 1;
> +       tx_ring->buffer_info[i].bytecount = len;
> +       tx_ring->buffer_info[i].next_to_watch = i;
> +
> +       rx_buffer_info->rxbuf.page = NULL;
> +}
> +
> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +                                u32 len,
> +                                struct net_device *netdev,
> +                                struct e1000_adapter *adapter)
> +{
> +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +       struct e1000_hw *hw = &adapter->hw;
> +       struct e1000_tx_ring *tx_ring;
> +
> +       if (len > E1000_MAX_DATA_PER_TXD)
> +               return;
> +
> +       /* e1000 only support a single txq at the moment so the queue is being
> +        * shared with stack. To support this requires locking to ensure the
> +        * stack and XDP are not running at the same time. Devices with
> +        * multiple queues should allocate a separate queue space.
> +        */
> +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +       tx_ring = adapter->tx_ring;
> +
> +       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> +               HARD_TX_UNLOCK(netdev, txq);
> +               return;
> +       }
> +
> +       if (netif_xmit_frozen_or_stopped(txq))
> +               return;
> +
> +       e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +       netdev_sent_queue(netdev, len);
> +       e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +       writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +       mmiowb();
> +
> +       HARD_TX_UNLOCK(netdev, txq);
> +}
> +
>  #define NUM_REGS 38 /* 1 based count */
>  static void e1000_regdump(struct e1000_adapter *adapter)
>  {
> @@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
>         return skb;
>  }
>
> +static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
> +                                unsigned int length)
> +{
> +       struct xdp_buff xdp;
> +       int ret;
> +
> +       xdp.data = data;
> +       xdp.data_end = data + length;
> +       ret = BPF_PROG_RUN(prog, (void *)&xdp);
> +
> +       return ret;
> +}
> +
>  /**
>   * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
>   * @adapter: board private structure
> @@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>         struct pci_dev *pdev = adapter->pdev;
>         struct e1000_rx_desc *rx_desc, *next_rxd;
>         struct e1000_rx_buffer *buffer_info, *next_buffer;
> +       struct bpf_prog *prog;
>         u32 length;
>         unsigned int i;
>         int cleaned_count = 0;
>         bool cleaned = false;
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>
> +       rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
> +       prog = READ_ONCE(adapter->prog);
>         i = rx_ring->next_to_clean;
>         rx_desc = E1000_RX_DESC(*rx_ring, i);
>         buffer_info = &rx_ring->buffer_info[i];
> @@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>
>                 cleaned = true;
>                 cleaned_count++;
> +               length = le16_to_cpu(rx_desc->length);
> +
> +               if (prog) {
> +                       struct page *p = buffer_info->rxbuf.page;
> +                       dma_addr_t dma = buffer_info->dma;
> +                       int act;
> +
> +                       if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
> +                               /* attached bpf disallows larger than page
> +                                * packets, so this is hw error or corruption
> +                                */
> +                               pr_info_once("%s buggy !eop\n", netdev->name);
> +                               break;
> +                       }
> +                       if (unlikely(rx_ring->rx_skb_top)) {
> +                               pr_info_once("%s ring resizing bug\n",
> +                                            netdev->name);
> +                               break;
> +                       }
> +                       dma_sync_single_for_cpu(&pdev->dev, dma,
> +                                               length, DMA_FROM_DEVICE);
> +                       act = e1000_call_bpf(prog, page_address(p), length);
> +                       switch (act) {
> +                       case XDP_PASS:
> +                               break;
> +                       case XDP_TX:
> +                               dma_sync_single_for_device(&pdev->dev,
> +                                                          dma,
> +                                                          length,
> +                                                          DMA_TO_DEVICE);
> +                               e1000_xmit_raw_frame(buffer_info, length,
> +                                                    netdev, adapter);

Implementing a new xmit path and clean-up routines for just this is
going to be a pain.  I'd say if we are going to do something like this
then maybe we should look at coming up with a new ndo for the xmit and
maybe push more of this into some sort of inline hook.  Duplicating
this code in every driver is going to be really expensive.

Also I just noticed there is no break statement from the xmit code
above to the drop that below.  I'd think you could overwrite the frame
data in a case where the Rx exceeds the Tx due to things like flow
control generating back pressure.

> +                       case XDP_DROP:
> +                       default:
> +                               /* re-use mapped page. keep buffer_info->dma
> +                                * as-is, so that e1000_alloc_jumbo_rx_buffers
> +                                * only needs to put it back into rx ring
> +                                */
> +                               total_rx_bytes += length;
> +                               total_rx_packets++;
> +                               goto next_desc;
> +                       }
> +               }
> +
>                 dma_unmap_page(&pdev->dev, buffer_info->dma,
>                                adapter->rx_buffer_len, DMA_FROM_DEVICE);
>                 buffer_info->dma = 0;
>
> -               length = le16_to_cpu(rx_desc->length);
> -
>                 /* errors is only valid for DD + EOP descriptors */
>                 if (unlikely((status & E1000_RXD_STAT_EOP) &&
>                     (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
> @@ -4327,6 +4496,7 @@ next_desc:
>                 rx_desc = next_rxd;
>                 buffer_info = next_buffer;
>         }
> +       rcu_read_unlock();
>         rx_ring->next_to_clean = i;
>
>         cleaned_count = E1000_DESC_UNUSED(rx_ring);
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan@lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13  3:42     ` Alexander Duyck
  0 siblings, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2016-09-13  3:42 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> From: Alexei Starovoitov <ast@fb.com>
>
> This patch adds initial support for XDP on e1000 driver. Note e1000
> driver does not support page recycling in general which could be
> added as a further improvement. However XDP_DROP case will recycle.
> XDP_TX and XDP_PASS do not support recycling.
>
> e1000 only supports a single tx queue at this time so the queue
> is shared between xdp program and Linux stack. It is possible for
> an XDP program to starve the stack in this model.
>
> The XDP program will drop packets on XDP_TX errors. This can occur
> when the tx descriptors are exhausted. This behavior is the same
> for both shared queue models like e1000 and dedicated tx queue
> models used in multiqueue devices. However if both the stack and
> XDP are transmitting packets it is perhaps more likely to occur in
> the shared queue model. Further refinement to the XDP model may be
> possible in the future.
>
> I tested this patch running e1000 in a VM using KVM over a tap
> device.
>
> CC: William Tu <u9012063@gmail.com>
> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000.h      |    2
>  drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
>  2 files changed, 175 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> index d7bdea7..5cf8a0a 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> @@ -150,6 +150,7 @@ struct e1000_adapter;
>   */
>  struct e1000_tx_buffer {
>         struct sk_buff *skb;
> +       struct page *page;
>         dma_addr_t dma;
>         unsigned long time_stamp;
>         u16 length;

I'm not really a huge fan of adding yet another member to this
structure.  Each e1000_tx_buffer is already pretty big at 40 bytes,
pushing it to 48 just means we lose that much more memory.  If nothing
else we may wan to look at doing something like creating a union
between the skb, page, and an unsigned long.  Then you could use the
lowest bit of the address as a flag indicating if this is a skb or a
page.

> @@ -279,6 +280,7 @@ struct e1000_adapter {
>                              struct e1000_rx_ring *rx_ring,
>                              int cleaned_count);
>         struct e1000_rx_ring *rx_ring;      /* One per active queue */
> +       struct bpf_prog *prog;
>         struct napi_struct napi;
>
>         int num_tx_queues;
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index 62a7f8d..232b927 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -32,6 +32,7 @@
>  #include <linux/prefetch.h>
>  #include <linux/bitops.h>
>  #include <linux/if_vlan.h>
> +#include <linux/bpf.h>
>
>  char e1000_driver_name[] = "e1000";
>  static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
> @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
>         return 0;
>  }
>
> +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
> +{
> +       struct e1000_adapter *adapter = netdev_priv(netdev);
> +       struct bpf_prog *old_prog;
> +
> +       old_prog = xchg(&adapter->prog, prog);
> +       if (old_prog) {
> +               synchronize_net();
> +               bpf_prog_put(old_prog);
> +       }
> +
> +       if (netif_running(netdev))
> +               e1000_reinit_locked(adapter);
> +       else
> +               e1000_reset(adapter);

What is the point of the reset?  If the interface isn't running is
there anything in the hardware you actually need to cleanup?

> +       return 0;
> +}
> +
> +static bool e1000_xdp_attached(struct net_device *dev)
> +{
> +       struct e1000_adapter *priv = netdev_priv(dev);
> +
> +       return !!priv->prog;
> +}
> +
> +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> +{
> +       switch (xdp->command) {
> +       case XDP_SETUP_PROG:
> +               return e1000_xdp_set(dev, xdp->prog);
> +       case XDP_QUERY_PROG:
> +               xdp->prog_attached = e1000_xdp_attached(dev);
> +               return 0;
> +       default:
> +               return -EINVAL;
> +       }
> +}
> +
>  static const struct net_device_ops e1000_netdev_ops = {
>         .ndo_open               = e1000_open,
>         .ndo_stop               = e1000_close,
> @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
>  #endif
>         .ndo_fix_features       = e1000_fix_features,
>         .ndo_set_features       = e1000_set_features,
> +       .ndo_xdp                = e1000_xdp,
>  };
>
>  /**
> @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
>         e1000_down_and_stop(adapter);
>         e1000_release_manageability(adapter);
>
> +       if (adapter->prog)
> +               bpf_prog_put(adapter->prog);
> +
>         unregister_netdev(netdev);
>
>         e1000_phy_hw_reset(hw);
> @@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
>         struct e1000_hw *hw = &adapter->hw;
>         u32 rdlen, rctl, rxcsum;
>
> -       if (adapter->netdev->mtu > ETH_DATA_LEN) {
> +       if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
>                 rdlen = adapter->rx_ring[0].count *
>                         sizeof(struct e1000_rx_desc);
>                 adapter->clean_rx = e1000_clean_jumbo_rx_irq;

If you are really serious about using the page based Rx path we should
probably fix the fact that you take a pretty significant hit on
performance penalty for turning this mode on.

> @@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
>                 dev_kfree_skb_any(buffer_info->skb);
>                 buffer_info->skb = NULL;
>         }
> +       if (buffer_info->page) {
> +               put_page(buffer_info->page);
> +               buffer_info->page = NULL;
> +       }
> +
>         buffer_info->time_stamp = 0;
>         /* buffer_info must be completely set up in the transmit path */
>  }
> @@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
>         return NETDEV_TX_OK;
>  }
>
> +static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
> +                               struct e1000_rx_buffer *rx_buffer_info,
> +                               unsigned int len)
> +{
> +       struct e1000_tx_buffer *buffer_info;
> +       unsigned int i = tx_ring->next_to_use;
> +
> +       buffer_info = &tx_ring->buffer_info[i];
> +
> +       buffer_info->length = len;
> +       buffer_info->time_stamp = jiffies;
> +       buffer_info->mapped_as_page = false;
> +       buffer_info->dma = rx_buffer_info->dma;
> +       buffer_info->next_to_watch = i;
> +       buffer_info->page = rx_buffer_info->rxbuf.page;
> +
> +       tx_ring->buffer_info[i].skb = NULL;
> +       tx_ring->buffer_info[i].segs = 1;
> +       tx_ring->buffer_info[i].bytecount = len;
> +       tx_ring->buffer_info[i].next_to_watch = i;
> +
> +       rx_buffer_info->rxbuf.page = NULL;
> +}
> +
> +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> +                                u32 len,
> +                                struct net_device *netdev,
> +                                struct e1000_adapter *adapter)
> +{
> +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> +       struct e1000_hw *hw = &adapter->hw;
> +       struct e1000_tx_ring *tx_ring;
> +
> +       if (len > E1000_MAX_DATA_PER_TXD)
> +               return;
> +
> +       /* e1000 only support a single txq at the moment so the queue is being
> +        * shared with stack. To support this requires locking to ensure the
> +        * stack and XDP are not running at the same time. Devices with
> +        * multiple queues should allocate a separate queue space.
> +        */
> +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> +
> +       tx_ring = adapter->tx_ring;
> +
> +       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> +               HARD_TX_UNLOCK(netdev, txq);
> +               return;
> +       }
> +
> +       if (netif_xmit_frozen_or_stopped(txq))
> +               return;
> +
> +       e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> +       netdev_sent_queue(netdev, len);
> +       e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> +
> +       writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> +       mmiowb();
> +
> +       HARD_TX_UNLOCK(netdev, txq);
> +}
> +
>  #define NUM_REGS 38 /* 1 based count */
>  static void e1000_regdump(struct e1000_adapter *adapter)
>  {
> @@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
>         return skb;
>  }
>
> +static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
> +                                unsigned int length)
> +{
> +       struct xdp_buff xdp;
> +       int ret;
> +
> +       xdp.data = data;
> +       xdp.data_end = data + length;
> +       ret = BPF_PROG_RUN(prog, (void *)&xdp);
> +
> +       return ret;
> +}
> +
>  /**
>   * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
>   * @adapter: board private structure
> @@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>         struct pci_dev *pdev = adapter->pdev;
>         struct e1000_rx_desc *rx_desc, *next_rxd;
>         struct e1000_rx_buffer *buffer_info, *next_buffer;
> +       struct bpf_prog *prog;
>         u32 length;
>         unsigned int i;
>         int cleaned_count = 0;
>         bool cleaned = false;
>         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>
> +       rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
> +       prog = READ_ONCE(adapter->prog);
>         i = rx_ring->next_to_clean;
>         rx_desc = E1000_RX_DESC(*rx_ring, i);
>         buffer_info = &rx_ring->buffer_info[i];
> @@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
>
>                 cleaned = true;
>                 cleaned_count++;
> +               length = le16_to_cpu(rx_desc->length);
> +
> +               if (prog) {
> +                       struct page *p = buffer_info->rxbuf.page;
> +                       dma_addr_t dma = buffer_info->dma;
> +                       int act;
> +
> +                       if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
> +                               /* attached bpf disallows larger than page
> +                                * packets, so this is hw error or corruption
> +                                */
> +                               pr_info_once("%s buggy !eop\n", netdev->name);
> +                               break;
> +                       }
> +                       if (unlikely(rx_ring->rx_skb_top)) {
> +                               pr_info_once("%s ring resizing bug\n",
> +                                            netdev->name);
> +                               break;
> +                       }
> +                       dma_sync_single_for_cpu(&pdev->dev, dma,
> +                                               length, DMA_FROM_DEVICE);
> +                       act = e1000_call_bpf(prog, page_address(p), length);
> +                       switch (act) {
> +                       case XDP_PASS:
> +                               break;
> +                       case XDP_TX:
> +                               dma_sync_single_for_device(&pdev->dev,
> +                                                          dma,
> +                                                          length,
> +                                                          DMA_TO_DEVICE);
> +                               e1000_xmit_raw_frame(buffer_info, length,
> +                                                    netdev, adapter);

Implementing a new xmit path and clean-up routines for just this is
going to be a pain.  I'd say if we are going to do something like this
then maybe we should look at coming up with a new ndo for the xmit and
maybe push more of this into some sort of inline hook.  Duplicating
this code in every driver is going to be really expensive.

Also I just noticed there is no break statement from the xmit code
above to the drop that below.  I'd think you could overwrite the frame
data in a case where the Rx exceeds the Tx due to things like flow
control generating back pressure.

> +                       case XDP_DROP:
> +                       default:
> +                               /* re-use mapped page. keep buffer_info->dma
> +                                * as-is, so that e1000_alloc_jumbo_rx_buffers
> +                                * only needs to put it back into rx ring
> +                                */
> +                               total_rx_bytes += length;
> +                               total_rx_packets++;
> +                               goto next_desc;
> +                       }
> +               }
> +
>                 dma_unmap_page(&pdev->dev, buffer_info->dma,
>                                adapter->rx_buffer_len, DMA_FROM_DEVICE);
>                 buffer_info->dma = 0;
>
> -               length = le16_to_cpu(rx_desc->length);
> -
>                 /* errors is only valid for DD + EOP descriptors */
>                 if (unlikely((status & E1000_RXD_STAT_EOP) &&
>                     (rx_desc->errors & E1000_RXD_ERR_FRAME_ERR_MASK))) {
> @@ -4327,6 +4496,7 @@ next_desc:
>                 rx_desc = next_rxd;
>                 buffer_info = next_buffer;
>         }
> +       rcu_read_unlock();
>         rx_ring->next_to_clean = i;
>
>         cleaned_count = E1000_DESC_UNUSED(rx_ring);
>
> _______________________________________________
> Intel-wired-lan mailing list
> Intel-wired-lan at lists.osuosl.org
> http://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
  2016-09-13  3:00     ` Alexander Duyck
@ 2016-09-13  4:25       ` Tom Herbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-13  4:25 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: John Fastabend, Brenden Blanco, Alexei Starovoitov, Jeff Kirsher,
	Jesper Dangaard Brouer, David Miller, Cong Wang, intel-wired-lan,
	William Tu, Netdev

On Mon, Sep 12, 2016 at 8:00 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> The BQL API does not reference the sk_buff nor does the driver need to
>> reference the sk_buff to calculate the length of a transmitted frame.
>> This patch removes an sk_buff reference from the xmit irq path and
>> also allows packets sent from XDP to use BQL.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
>> index f42129d..62a7f8d 100644
>> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
>> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
>> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
>>                         if (cleaned) {
>>                                 total_tx_packets += buffer_info->segs;
>>                                 total_tx_bytes += buffer_info->bytecount;
>> -                               if (buffer_info->skb) {
>> -                                       bytes_compl += buffer_info->skb->len;
>> -                                       pkts_compl++;
>> -                               }
>> -
>> +                               bytes_compl += buffer_info->length;
>> +                               pkts_compl++;
>>                         }
>>                         e1000_unmap_and_free_tx_resource(adapter, buffer_info);
>>                         tx_desc->upper.data = 0;
>
> Actually it might be worth looking into why we have two different
> stats for tracking bytecount and segs.  From what I can tell the
> pkts_compl value is never actually used.  The function doesn't even
> use the value so it is just wasted cycles.  And as far as the bytes go

Transmit flow steering which I posted and is pending on some testing
uses the pkt count BQL to track inflight packets.

> the accounting would be more accurate if you were to use bytecount
> instead of buffer_info->skb->len.  You would just need to update the
> xmit function to use that on the other side so that they match.
>
> - Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
@ 2016-09-13  4:25       ` Tom Herbert
  0 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-13  4:25 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 8:00 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
>> The BQL API does not reference the sk_buff nor does the driver need to
>> reference the sk_buff to calculate the length of a transmitted frame.
>> This patch removes an sk_buff reference from the xmit irq path and
>> also allows packets sent from XDP to use BQL.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>> ---
>>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
>> index f42129d..62a7f8d 100644
>> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
>> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
>> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
>>                         if (cleaned) {
>>                                 total_tx_packets += buffer_info->segs;
>>                                 total_tx_bytes += buffer_info->bytecount;
>> -                               if (buffer_info->skb) {
>> -                                       bytes_compl += buffer_info->skb->len;
>> -                                       pkts_compl++;
>> -                               }
>> -
>> +                               bytes_compl += buffer_info->length;
>> +                               pkts_compl++;
>>                         }
>>                         e1000_unmap_and_free_tx_resource(adapter, buffer_info);
>>                         tx_desc->upper.data = 0;
>
> Actually it might be worth looking into why we have two different
> stats for tracking bytecount and segs.  From what I can tell the
> pkts_compl value is never actually used.  The function doesn't even
> use the value so it is just wasted cycles.  And as far as the bytes go

Transmit flow steering which I posted and is pending on some testing
uses the pkt count BQL to track inflight packets.

> the accounting would be more accurate if you were to use bytecount
> instead of buffer_info->skb->len.  You would just need to update the
> xmit function to use that on the other side so that they match.
>
> - Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
  2016-09-13  4:25       ` Tom Herbert
@ 2016-09-13 13:17         ` Alexander Duyck
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2016-09-13 13:17 UTC (permalink / raw)
  To: Tom Herbert
  Cc: John Fastabend, Brenden Blanco, Alexei Starovoitov, Jeff Kirsher,
	Jesper Dangaard Brouer, David Miller, Cong Wang, intel-wired-lan,
	William Tu, Netdev

On Mon, Sep 12, 2016 at 9:25 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Mon, Sep 12, 2016 at 8:00 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>>> The BQL API does not reference the sk_buff nor does the driver need to
>>> reference the sk_buff to calculate the length of a transmitted frame.
>>> This patch removes an sk_buff reference from the xmit irq path and
>>> also allows packets sent from XDP to use BQL.
>>>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>> ---
>>>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
>>> index f42129d..62a7f8d 100644
>>> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
>>> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
>>> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
>>>                         if (cleaned) {
>>>                                 total_tx_packets += buffer_info->segs;
>>>                                 total_tx_bytes += buffer_info->bytecount;
>>> -                               if (buffer_info->skb) {
>>> -                                       bytes_compl += buffer_info->skb->len;
>>> -                                       pkts_compl++;
>>> -                               }
>>> -
>>> +                               bytes_compl += buffer_info->length;
>>> +                               pkts_compl++;
>>>                         }
>>>                         e1000_unmap_and_free_tx_resource(adapter, buffer_info);
>>>                         tx_desc->upper.data = 0;
>>
>> Actually it might be worth looking into why we have two different
>> stats for tracking bytecount and segs.  From what I can tell the
>> pkts_compl value is never actually used.  The function doesn't even
>> use the value so it is just wasted cycles.  And as far as the bytes go
>
> Transmit flow steering which I posted and is pending on some testing
> uses the pkt count BQL to track inflight packets.
>
>> the accounting would be more accurate if you were to use bytecount
>> instead of buffer_info->skb->len.  You would just need to update the
>> xmit function to use that on the other side so that they match.

Okay, that makes sense.

But as I was saying we might be better off using the segs and
bytecount values instead of the skb->len in the xmit and cleanup paths
to get more accurate accounting for the total bytes/packets coming and
going from the interface.  That way we can avoid any significant
change in behavior between TSO and GSO.

- Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
@ 2016-09-13 13:17         ` Alexander Duyck
  0 siblings, 0 replies; 71+ messages in thread
From: Alexander Duyck @ 2016-09-13 13:17 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 9:25 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Mon, Sep 12, 2016 at 8:00 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
>> <john.fastabend@gmail.com> wrote:
>>> The BQL API does not reference the sk_buff nor does the driver need to
>>> reference the sk_buff to calculate the length of a transmitted frame.
>>> This patch removes an sk_buff reference from the xmit irq path and
>>> also allows packets sent from XDP to use BQL.
>>>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>> ---
>>>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
>>> index f42129d..62a7f8d 100644
>>> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
>>> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
>>> @@ -3882,11 +3882,8 @@ static bool e1000_clean_tx_irq(struct e1000_adapter *adapter,
>>>                         if (cleaned) {
>>>                                 total_tx_packets += buffer_info->segs;
>>>                                 total_tx_bytes += buffer_info->bytecount;
>>> -                               if (buffer_info->skb) {
>>> -                                       bytes_compl += buffer_info->skb->len;
>>> -                                       pkts_compl++;
>>> -                               }
>>> -
>>> +                               bytes_compl += buffer_info->length;
>>> +                               pkts_compl++;
>>>                         }
>>>                         e1000_unmap_and_free_tx_resource(adapter, buffer_info);
>>>                         tx_desc->upper.data = 0;
>>
>> Actually it might be worth looking into why we have two different
>> stats for tracking bytecount and segs.  From what I can tell the
>> pkts_compl value is never actually used.  The function doesn't even
>> use the value so it is just wasted cycles.  And as far as the bytes go
>
> Transmit flow steering which I posted and is pending on some testing
> uses the pkt count BQL to track inflight packets.
>
>> the accounting would be more accurate if you were to use bytecount
>> instead of buffer_info->skb->len.  You would just need to update the
>> xmit function to use that on the other side so that they match.

Okay, that makes sense.

But as I was saying we might be better off using the segs and
bytecount values instead of the skb->len in the xmit and cleanup paths
to get more accurate accounting for the total bytes/packets coming and
going from the interface.  That way we can avoid any significant
change in behavior between TSO and GSO.

- Alex

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13  1:28             ` [Intel-wired-lan] " Alexei Starovoitov
@ 2016-09-13 16:21               ` Tom Herbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-13 16:21 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers

On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
>> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>> >
>> >> yep. there are various ways to shoot yourself in the foot with xdp.
>> >> The simplest program that drops all the packets will make the box unpingable.
>> >
>> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
>> > scooter on 101 highway ;)
>> >
>> > This XDP_TX thing was one of the XDP marketing stuff, but there is
>> > absolutely no documentation on it, warning users about possible
>> > limitations/outcomes.
>> >
>> > BTW, I am not sure mlx4 implementation even works, vs BQL :
>> >
>> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
>> > but tx completion will call netdev_tx_completed_queue() -> crash
>> >
>> > Do we have one test to validate that a XDP_TX implementation is actually
>> > correct ?
>> >
>> Obviously not for e1000 :-(. We really need some real test and
>> performance results and analysis on the interaction between the stack
>> data path and XDP data path.
>
> no. we don't need it for e1k and we cannot really do it.
> <broken record mode on> this patch is for debugging of xdp programs only.
>
You can say this "only for a debugging" a thousand times and that
still won't justify putting bad code into the kernel. Material issues
have been raised with these patches, I have proposed a fix for one
core issue, and we have requested a lot more testing. So, please, if
you really want to move these patches forward start addressing the
concerns being raised by reviewers.

Tom

>> The fact that these changes are being
>> passed of as something only needed for KCM is irrelevant, e1000 is a
>> well deployed a NIC and there's no restriction that I see that would
>> prevent any users from enabling this feature on real devices.
>
> e1k is not even manufactured any more. Probably the only place
> where it can be found is computer history museum.
> e1000e fairs slightly better, but it's a different nic and this
> patch is not about it.
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 16:21               ` Tom Herbert
  0 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-13 16:21 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
>> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>> >
>> >> yep. there are various ways to shoot yourself in the foot with xdp.
>> >> The simplest program that drops all the packets will make the box unpingable.
>> >
>> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
>> > scooter on 101 highway ;)
>> >
>> > This XDP_TX thing was one of the XDP marketing stuff, but there is
>> > absolutely no documentation on it, warning users about possible
>> > limitations/outcomes.
>> >
>> > BTW, I am not sure mlx4 implementation even works, vs BQL :
>> >
>> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
>> > but tx completion will call netdev_tx_completed_queue() -> crash
>> >
>> > Do we have one test to validate that a XDP_TX implementation is actually
>> > correct ?
>> >
>> Obviously not for e1000 :-(. We really need some real test and
>> performance results and analysis on the interaction between the stack
>> data path and XDP data path.
>
> no. we don't need it for e1k and we cannot really do it.
> <broken record mode on> this patch is for debugging of xdp programs only.
>
You can say this "only for a debugging" a thousand times and that
still won't justify putting bad code into the kernel. Material issues
have been raised with these patches, I have proposed a fix for one
core issue, and we have requested a lot more testing. So, please, if
you really want to move these patches forward start addressing the
concerns being raised by reviewers.

Tom

>> The fact that these changes are being
>> passed of as something only needed for KCM is irrelevant, e1000 is a
>> well deployed a NIC and there's no restriction that I see that would
>> prevent any users from enabling this feature on real devices.
>
> e1k is not even manufactured any more. Probably the only place
> where it can be found is computer history museum.
> e1000e fairs slightly better, but it's a different nic and this
> patch is not about it.
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13  3:42     ` Alexander Duyck
@ 2016-09-13 16:56       ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 16:56 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David Miller, Cong Wang, intel-wired-lan,
	u9012063, Netdev

On Mon, Sep 12, 2016 at 08:42:41PM -0700, Alexander Duyck wrote:
> On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
> > From: Alexei Starovoitov <ast@fb.com>
> >
> > This patch adds initial support for XDP on e1000 driver. Note e1000
> > driver does not support page recycling in general which could be
> > added as a further improvement. However XDP_DROP case will recycle.
> > XDP_TX and XDP_PASS do not support recycling.
> >
> > e1000 only supports a single tx queue at this time so the queue
> > is shared between xdp program and Linux stack. It is possible for
> > an XDP program to starve the stack in this model.
> >
> > The XDP program will drop packets on XDP_TX errors. This can occur
> > when the tx descriptors are exhausted. This behavior is the same
> > for both shared queue models like e1000 and dedicated tx queue
> > models used in multiqueue devices. However if both the stack and
> > XDP are transmitting packets it is perhaps more likely to occur in
> > the shared queue model. Further refinement to the XDP model may be
> > possible in the future.
> >
> > I tested this patch running e1000 in a VM using KVM over a tap
> > device.
> >
> > CC: William Tu <u9012063@gmail.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ---
> >  drivers/net/ethernet/intel/e1000/e1000.h      |    2
> >  drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
> >  2 files changed, 175 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> > index d7bdea7..5cf8a0a 100644
> > --- a/drivers/net/ethernet/intel/e1000/e1000.h
> > +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> > @@ -150,6 +150,7 @@ struct e1000_adapter;
> >   */
> >  struct e1000_tx_buffer {
> >         struct sk_buff *skb;
> > +       struct page *page;
> >         dma_addr_t dma;
> >         unsigned long time_stamp;
> >         u16 length;
> 
> I'm not really a huge fan of adding yet another member to this
> structure.  Each e1000_tx_buffer is already pretty big at 40 bytes,
> pushing it to 48 just means we lose that much more memory.  If nothing
> else we may wan to look at doing something like creating a union
> between the skb, page, and an unsigned long.  Then you could use the
> lowest bit of the address as a flag indicating if this is a skb or a
> page.

that exactly what we did for mlx4_en_tx_info, since it's a real nic
where performance matters. For e1k I didn't want to complicate
the logic for no reason, since I don't see how 8 extra bytes matter here.
we will take a look if union is easy to do though.

> > @@ -279,6 +280,7 @@ struct e1000_adapter {
> >                              struct e1000_rx_ring *rx_ring,
> >                              int cleaned_count);
> >         struct e1000_rx_ring *rx_ring;      /* One per active queue */
> > +       struct bpf_prog *prog;
> >         struct napi_struct napi;
> >
> >         int num_tx_queues;
> > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > index 62a7f8d..232b927 100644
> > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/prefetch.h>
> >  #include <linux/bitops.h>
> >  #include <linux/if_vlan.h>
> > +#include <linux/bpf.h>
> >
> >  char e1000_driver_name[] = "e1000";
> >  static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
> > @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
> >         return 0;
> >  }
> >
> > +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
> > +{
> > +       struct e1000_adapter *adapter = netdev_priv(netdev);
> > +       struct bpf_prog *old_prog;
> > +
> > +       old_prog = xchg(&adapter->prog, prog);
> > +       if (old_prog) {
> > +               synchronize_net();
> > +               bpf_prog_put(old_prog);
> > +       }
> > +
> > +       if (netif_running(netdev))
> > +               e1000_reinit_locked(adapter);
> > +       else
> > +               e1000_reset(adapter);
> 
> What is the point of the reset?  If the interface isn't running is
> there anything in the hardware you actually need to cleanup?

The above is inspired by e1000_set_features().
I'm assuming it was done there for a reason and same applies here.

> > +       return 0;
> > +}
> > +
> > +static bool e1000_xdp_attached(struct net_device *dev)
> > +{
> > +       struct e1000_adapter *priv = netdev_priv(dev);
> > +
> > +       return !!priv->prog;
> > +}
> > +
> > +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> > +{
> > +       switch (xdp->command) {
> > +       case XDP_SETUP_PROG:
> > +               return e1000_xdp_set(dev, xdp->prog);
> > +       case XDP_QUERY_PROG:
> > +               xdp->prog_attached = e1000_xdp_attached(dev);
> > +               return 0;
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +}
> > +
> >  static const struct net_device_ops e1000_netdev_ops = {
> >         .ndo_open               = e1000_open,
> >         .ndo_stop               = e1000_close,
> > @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
> >  #endif
> >         .ndo_fix_features       = e1000_fix_features,
> >         .ndo_set_features       = e1000_set_features,
> > +       .ndo_xdp                = e1000_xdp,
> >  };
> >
> >  /**
> > @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
> >         e1000_down_and_stop(adapter);
> >         e1000_release_manageability(adapter);
> >
> > +       if (adapter->prog)
> > +               bpf_prog_put(adapter->prog);
> > +
> >         unregister_netdev(netdev);
> >
> >         e1000_phy_hw_reset(hw);
> > @@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
> >         struct e1000_hw *hw = &adapter->hw;
> >         u32 rdlen, rctl, rxcsum;
> >
> > -       if (adapter->netdev->mtu > ETH_DATA_LEN) {
> > +       if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
> >                 rdlen = adapter->rx_ring[0].count *
> >                         sizeof(struct e1000_rx_desc);
> >                 adapter->clean_rx = e1000_clean_jumbo_rx_irq;
> 
> If you are really serious about using the page based Rx path we should
> probably fix the fact that you take a pretty significant hit on
> performance penalty for turning this mode on.

Not sure I follow. KVM tests show that xdp_drop/tx is faster even with
full page alloc and no page recycling.
xdp is only operational in page-per-packet mode which dictates the above approach.

> > @@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
> >                 dev_kfree_skb_any(buffer_info->skb);
> >                 buffer_info->skb = NULL;
> >         }
> > +       if (buffer_info->page) {
> > +               put_page(buffer_info->page);
> > +               buffer_info->page = NULL;
> > +       }
> > +
> >         buffer_info->time_stamp = 0;
> >         /* buffer_info must be completely set up in the transmit path */
> >  }
> > @@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
> >         return NETDEV_TX_OK;
> >  }
> >
> > +static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
> > +                               struct e1000_rx_buffer *rx_buffer_info,
> > +                               unsigned int len)
> > +{
> > +       struct e1000_tx_buffer *buffer_info;
> > +       unsigned int i = tx_ring->next_to_use;
> > +
> > +       buffer_info = &tx_ring->buffer_info[i];
> > +
> > +       buffer_info->length = len;
> > +       buffer_info->time_stamp = jiffies;
> > +       buffer_info->mapped_as_page = false;
> > +       buffer_info->dma = rx_buffer_info->dma;
> > +       buffer_info->next_to_watch = i;
> > +       buffer_info->page = rx_buffer_info->rxbuf.page;
> > +
> > +       tx_ring->buffer_info[i].skb = NULL;
> > +       tx_ring->buffer_info[i].segs = 1;
> > +       tx_ring->buffer_info[i].bytecount = len;
> > +       tx_ring->buffer_info[i].next_to_watch = i;
> > +
> > +       rx_buffer_info->rxbuf.page = NULL;
> > +}
> > +
> > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> > +                                u32 len,
> > +                                struct net_device *netdev,
> > +                                struct e1000_adapter *adapter)
> > +{
> > +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> > +       struct e1000_hw *hw = &adapter->hw;
> > +       struct e1000_tx_ring *tx_ring;
> > +
> > +       if (len > E1000_MAX_DATA_PER_TXD)
> > +               return;
> > +
> > +       /* e1000 only support a single txq at the moment so the queue is being
> > +        * shared with stack. To support this requires locking to ensure the
> > +        * stack and XDP are not running at the same time. Devices with
> > +        * multiple queues should allocate a separate queue space.
> > +        */
> > +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> > +
> > +       tx_ring = adapter->tx_ring;
> > +
> > +       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> > +               HARD_TX_UNLOCK(netdev, txq);
> > +               return;
> > +       }
> > +
> > +       if (netif_xmit_frozen_or_stopped(txq))
> > +               return;
> > +
> > +       e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> > +       netdev_sent_queue(netdev, len);
> > +       e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> > +
> > +       writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> > +       mmiowb();
> > +
> > +       HARD_TX_UNLOCK(netdev, txq);
> > +}
> > +
> >  #define NUM_REGS 38 /* 1 based count */
> >  static void e1000_regdump(struct e1000_adapter *adapter)
> >  {
> > @@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
> >         return skb;
> >  }
> >
> > +static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
> > +                                unsigned int length)
> > +{
> > +       struct xdp_buff xdp;
> > +       int ret;
> > +
> > +       xdp.data = data;
> > +       xdp.data_end = data + length;
> > +       ret = BPF_PROG_RUN(prog, (void *)&xdp);
> > +
> > +       return ret;
> > +}
> > +
> >  /**
> >   * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
> >   * @adapter: board private structure
> > @@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
> >         struct pci_dev *pdev = adapter->pdev;
> >         struct e1000_rx_desc *rx_desc, *next_rxd;
> >         struct e1000_rx_buffer *buffer_info, *next_buffer;
> > +       struct bpf_prog *prog;
> >         u32 length;
> >         unsigned int i;
> >         int cleaned_count = 0;
> >         bool cleaned = false;
> >         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> >
> > +       rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
> > +       prog = READ_ONCE(adapter->prog);
> >         i = rx_ring->next_to_clean;
> >         rx_desc = E1000_RX_DESC(*rx_ring, i);
> >         buffer_info = &rx_ring->buffer_info[i];
> > @@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
> >
> >                 cleaned = true;
> >                 cleaned_count++;
> > +               length = le16_to_cpu(rx_desc->length);
> > +
> > +               if (prog) {
> > +                       struct page *p = buffer_info->rxbuf.page;
> > +                       dma_addr_t dma = buffer_info->dma;
> > +                       int act;
> > +
> > +                       if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
> > +                               /* attached bpf disallows larger than page
> > +                                * packets, so this is hw error or corruption
> > +                                */
> > +                               pr_info_once("%s buggy !eop\n", netdev->name);
> > +                               break;
> > +                       }
> > +                       if (unlikely(rx_ring->rx_skb_top)) {
> > +                               pr_info_once("%s ring resizing bug\n",
> > +                                            netdev->name);
> > +                               break;
> > +                       }
> > +                       dma_sync_single_for_cpu(&pdev->dev, dma,
> > +                                               length, DMA_FROM_DEVICE);
> > +                       act = e1000_call_bpf(prog, page_address(p), length);
> > +                       switch (act) {
> > +                       case XDP_PASS:
> > +                               break;
> > +                       case XDP_TX:
> > +                               dma_sync_single_for_device(&pdev->dev,
> > +                                                          dma,
> > +                                                          length,
> > +                                                          DMA_TO_DEVICE);
> > +                               e1000_xmit_raw_frame(buffer_info, length,
> > +                                                    netdev, adapter);
> 
> Implementing a new xmit path and clean-up routines for just this is
> going to be a pain.  I'd say if we are going to do something like this
> then maybe we should look at coming up with a new ndo for the xmit and
> maybe push more of this into some sort of inline hook.  Duplicating
> this code in every driver is going to be really expensive.

we will have a common xdp routines when more drivers implement it.
I would expect several pieces fo mlx4/mlx5 can be made common.
ndo approach won't work here, since stack doesn't call into this part.
The xdp logic stays within the driver and dma/page things are driver specific too.
It's pretty much like trying to make common struct between e1000_tx_buffer
and mlx4_en_tx_info. Which is quite difficult.

> Also I just noticed there is no break statement from the xmit code
> above to the drop that below.  I'd think you could overwrite the frame
> data in a case where the Rx exceeds the Tx due to things like flow
> control generating back pressure.

you mean pause frames on TX side?
Aren't 'if (E1000_DESC_UNUSED(tx_ring) < 2) {' enough in e1000_xmit_raw_frame() ?

Thanks for the review!

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 16:56       ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 16:56 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, Sep 12, 2016 at 08:42:41PM -0700, Alexander Duyck wrote:
> On Mon, Sep 12, 2016 at 3:13 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
> > From: Alexei Starovoitov <ast@fb.com>
> >
> > This patch adds initial support for XDP on e1000 driver. Note e1000
> > driver does not support page recycling in general which could be
> > added as a further improvement. However XDP_DROP case will recycle.
> > XDP_TX and XDP_PASS do not support recycling.
> >
> > e1000 only supports a single tx queue at this time so the queue
> > is shared between xdp program and Linux stack. It is possible for
> > an XDP program to starve the stack in this model.
> >
> > The XDP program will drop packets on XDP_TX errors. This can occur
> > when the tx descriptors are exhausted. This behavior is the same
> > for both shared queue models like e1000 and dedicated tx queue
> > models used in multiqueue devices. However if both the stack and
> > XDP are transmitting packets it is perhaps more likely to occur in
> > the shared queue model. Further refinement to the XDP model may be
> > possible in the future.
> >
> > I tested this patch running e1000 in a VM using KVM over a tap
> > device.
> >
> > CC: William Tu <u9012063@gmail.com>
> > Signed-off-by: Alexei Starovoitov <ast@kernel.org>
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ---
> >  drivers/net/ethernet/intel/e1000/e1000.h      |    2
> >  drivers/net/ethernet/intel/e1000/e1000_main.c |  176 +++++++++++++++++++++++++
> >  2 files changed, 175 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/intel/e1000/e1000.h b/drivers/net/ethernet/intel/e1000/e1000.h
> > index d7bdea7..5cf8a0a 100644
> > --- a/drivers/net/ethernet/intel/e1000/e1000.h
> > +++ b/drivers/net/ethernet/intel/e1000/e1000.h
> > @@ -150,6 +150,7 @@ struct e1000_adapter;
> >   */
> >  struct e1000_tx_buffer {
> >         struct sk_buff *skb;
> > +       struct page *page;
> >         dma_addr_t dma;
> >         unsigned long time_stamp;
> >         u16 length;
> 
> I'm not really a huge fan of adding yet another member to this
> structure.  Each e1000_tx_buffer is already pretty big at 40 bytes,
> pushing it to 48 just means we lose that much more memory.  If nothing
> else we may wan to look at doing something like creating a union
> between the skb, page, and an unsigned long.  Then you could use the
> lowest bit of the address as a flag indicating if this is a skb or a
> page.

that exactly what we did for mlx4_en_tx_info, since it's a real nic
where performance matters. For e1k I didn't want to complicate
the logic for no reason, since I don't see how 8 extra bytes matter here.
we will take a look if union is easy to do though.

> > @@ -279,6 +280,7 @@ struct e1000_adapter {
> >                              struct e1000_rx_ring *rx_ring,
> >                              int cleaned_count);
> >         struct e1000_rx_ring *rx_ring;      /* One per active queue */
> > +       struct bpf_prog *prog;
> >         struct napi_struct napi;
> >
> >         int num_tx_queues;
> > diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > index 62a7f8d..232b927 100644
> > --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> > +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> > @@ -32,6 +32,7 @@
> >  #include <linux/prefetch.h>
> >  #include <linux/bitops.h>
> >  #include <linux/if_vlan.h>
> > +#include <linux/bpf.h>
> >
> >  char e1000_driver_name[] = "e1000";
> >  static char e1000_driver_string[] = "Intel(R) PRO/1000 Network Driver";
> > @@ -842,6 +843,44 @@ static int e1000_set_features(struct net_device *netdev,
> >         return 0;
> >  }
> >
> > +static int e1000_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
> > +{
> > +       struct e1000_adapter *adapter = netdev_priv(netdev);
> > +       struct bpf_prog *old_prog;
> > +
> > +       old_prog = xchg(&adapter->prog, prog);
> > +       if (old_prog) {
> > +               synchronize_net();
> > +               bpf_prog_put(old_prog);
> > +       }
> > +
> > +       if (netif_running(netdev))
> > +               e1000_reinit_locked(adapter);
> > +       else
> > +               e1000_reset(adapter);
> 
> What is the point of the reset?  If the interface isn't running is
> there anything in the hardware you actually need to cleanup?

The above is inspired by e1000_set_features().
I'm assuming it was done there for a reason and same applies here.

> > +       return 0;
> > +}
> > +
> > +static bool e1000_xdp_attached(struct net_device *dev)
> > +{
> > +       struct e1000_adapter *priv = netdev_priv(dev);
> > +
> > +       return !!priv->prog;
> > +}
> > +
> > +static int e1000_xdp(struct net_device *dev, struct netdev_xdp *xdp)
> > +{
> > +       switch (xdp->command) {
> > +       case XDP_SETUP_PROG:
> > +               return e1000_xdp_set(dev, xdp->prog);
> > +       case XDP_QUERY_PROG:
> > +               xdp->prog_attached = e1000_xdp_attached(dev);
> > +               return 0;
> > +       default:
> > +               return -EINVAL;
> > +       }
> > +}
> > +
> >  static const struct net_device_ops e1000_netdev_ops = {
> >         .ndo_open               = e1000_open,
> >         .ndo_stop               = e1000_close,
> > @@ -860,6 +899,7 @@ static const struct net_device_ops e1000_netdev_ops = {
> >  #endif
> >         .ndo_fix_features       = e1000_fix_features,
> >         .ndo_set_features       = e1000_set_features,
> > +       .ndo_xdp                = e1000_xdp,
> >  };
> >
> >  /**
> > @@ -1276,6 +1316,9 @@ static void e1000_remove(struct pci_dev *pdev)
> >         e1000_down_and_stop(adapter);
> >         e1000_release_manageability(adapter);
> >
> > +       if (adapter->prog)
> > +               bpf_prog_put(adapter->prog);
> > +
> >         unregister_netdev(netdev);
> >
> >         e1000_phy_hw_reset(hw);
> > @@ -1859,7 +1902,7 @@ static void e1000_configure_rx(struct e1000_adapter *adapter)
> >         struct e1000_hw *hw = &adapter->hw;
> >         u32 rdlen, rctl, rxcsum;
> >
> > -       if (adapter->netdev->mtu > ETH_DATA_LEN) {
> > +       if (adapter->netdev->mtu > ETH_DATA_LEN || adapter->prog) {
> >                 rdlen = adapter->rx_ring[0].count *
> >                         sizeof(struct e1000_rx_desc);
> >                 adapter->clean_rx = e1000_clean_jumbo_rx_irq;
> 
> If you are really serious about using the page based Rx path we should
> probably fix the fact that you take a pretty significant hit on
> performance penalty for turning this mode on.

Not sure I follow. KVM tests show that xdp_drop/tx is faster even with
full page alloc and no page recycling.
xdp is only operational in page-per-packet mode which dictates the above approach.

> > @@ -1973,6 +2016,11 @@ e1000_unmap_and_free_tx_resource(struct e1000_adapter *adapter,
> >                 dev_kfree_skb_any(buffer_info->skb);
> >                 buffer_info->skb = NULL;
> >         }
> > +       if (buffer_info->page) {
> > +               put_page(buffer_info->page);
> > +               buffer_info->page = NULL;
> > +       }
> > +
> >         buffer_info->time_stamp = 0;
> >         /* buffer_info must be completely set up in the transmit path */
> >  }
> > @@ -3298,6 +3346,69 @@ static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
> >         return NETDEV_TX_OK;
> >  }
> >
> > +static void e1000_tx_map_rxpage(struct e1000_tx_ring *tx_ring,
> > +                               struct e1000_rx_buffer *rx_buffer_info,
> > +                               unsigned int len)
> > +{
> > +       struct e1000_tx_buffer *buffer_info;
> > +       unsigned int i = tx_ring->next_to_use;
> > +
> > +       buffer_info = &tx_ring->buffer_info[i];
> > +
> > +       buffer_info->length = len;
> > +       buffer_info->time_stamp = jiffies;
> > +       buffer_info->mapped_as_page = false;
> > +       buffer_info->dma = rx_buffer_info->dma;
> > +       buffer_info->next_to_watch = i;
> > +       buffer_info->page = rx_buffer_info->rxbuf.page;
> > +
> > +       tx_ring->buffer_info[i].skb = NULL;
> > +       tx_ring->buffer_info[i].segs = 1;
> > +       tx_ring->buffer_info[i].bytecount = len;
> > +       tx_ring->buffer_info[i].next_to_watch = i;
> > +
> > +       rx_buffer_info->rxbuf.page = NULL;
> > +}
> > +
> > +static void e1000_xmit_raw_frame(struct e1000_rx_buffer *rx_buffer_info,
> > +                                u32 len,
> > +                                struct net_device *netdev,
> > +                                struct e1000_adapter *adapter)
> > +{
> > +       struct netdev_queue *txq = netdev_get_tx_queue(netdev, 0);
> > +       struct e1000_hw *hw = &adapter->hw;
> > +       struct e1000_tx_ring *tx_ring;
> > +
> > +       if (len > E1000_MAX_DATA_PER_TXD)
> > +               return;
> > +
> > +       /* e1000 only support a single txq at the moment so the queue is being
> > +        * shared with stack. To support this requires locking to ensure the
> > +        * stack and XDP are not running at the same time. Devices with
> > +        * multiple queues should allocate a separate queue space.
> > +        */
> > +       HARD_TX_LOCK(netdev, txq, smp_processor_id());
> > +
> > +       tx_ring = adapter->tx_ring;
> > +
> > +       if (E1000_DESC_UNUSED(tx_ring) < 2) {
> > +               HARD_TX_UNLOCK(netdev, txq);
> > +               return;
> > +       }
> > +
> > +       if (netif_xmit_frozen_or_stopped(txq))
> > +               return;
> > +
> > +       e1000_tx_map_rxpage(tx_ring, rx_buffer_info, len);
> > +       netdev_sent_queue(netdev, len);
> > +       e1000_tx_queue(adapter, tx_ring, 0/*tx_flags*/, 1);
> > +
> > +       writel(tx_ring->next_to_use, hw->hw_addr + tx_ring->tdt);
> > +       mmiowb();
> > +
> > +       HARD_TX_UNLOCK(netdev, txq);
> > +}
> > +
> >  #define NUM_REGS 38 /* 1 based count */
> >  static void e1000_regdump(struct e1000_adapter *adapter)
> >  {
> > @@ -4139,6 +4250,19 @@ static struct sk_buff *e1000_alloc_rx_skb(struct e1000_adapter *adapter,
> >         return skb;
> >  }
> >
> > +static inline int e1000_call_bpf(struct bpf_prog *prog, void *data,
> > +                                unsigned int length)
> > +{
> > +       struct xdp_buff xdp;
> > +       int ret;
> > +
> > +       xdp.data = data;
> > +       xdp.data_end = data + length;
> > +       ret = BPF_PROG_RUN(prog, (void *)&xdp);
> > +
> > +       return ret;
> > +}
> > +
> >  /**
> >   * e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
> >   * @adapter: board private structure
> > @@ -4157,12 +4281,15 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
> >         struct pci_dev *pdev = adapter->pdev;
> >         struct e1000_rx_desc *rx_desc, *next_rxd;
> >         struct e1000_rx_buffer *buffer_info, *next_buffer;
> > +       struct bpf_prog *prog;
> >         u32 length;
> >         unsigned int i;
> >         int cleaned_count = 0;
> >         bool cleaned = false;
> >         unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> >
> > +       rcu_read_lock(); /* rcu lock needed here to protect xdp programs */
> > +       prog = READ_ONCE(adapter->prog);
> >         i = rx_ring->next_to_clean;
> >         rx_desc = E1000_RX_DESC(*rx_ring, i);
> >         buffer_info = &rx_ring->buffer_info[i];
> > @@ -4188,12 +4315,54 @@ static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
> >
> >                 cleaned = true;
> >                 cleaned_count++;
> > +               length = le16_to_cpu(rx_desc->length);
> > +
> > +               if (prog) {
> > +                       struct page *p = buffer_info->rxbuf.page;
> > +                       dma_addr_t dma = buffer_info->dma;
> > +                       int act;
> > +
> > +                       if (unlikely(!(status & E1000_RXD_STAT_EOP))) {
> > +                               /* attached bpf disallows larger than page
> > +                                * packets, so this is hw error or corruption
> > +                                */
> > +                               pr_info_once("%s buggy !eop\n", netdev->name);
> > +                               break;
> > +                       }
> > +                       if (unlikely(rx_ring->rx_skb_top)) {
> > +                               pr_info_once("%s ring resizing bug\n",
> > +                                            netdev->name);
> > +                               break;
> > +                       }
> > +                       dma_sync_single_for_cpu(&pdev->dev, dma,
> > +                                               length, DMA_FROM_DEVICE);
> > +                       act = e1000_call_bpf(prog, page_address(p), length);
> > +                       switch (act) {
> > +                       case XDP_PASS:
> > +                               break;
> > +                       case XDP_TX:
> > +                               dma_sync_single_for_device(&pdev->dev,
> > +                                                          dma,
> > +                                                          length,
> > +                                                          DMA_TO_DEVICE);
> > +                               e1000_xmit_raw_frame(buffer_info, length,
> > +                                                    netdev, adapter);
> 
> Implementing a new xmit path and clean-up routines for just this is
> going to be a pain.  I'd say if we are going to do something like this
> then maybe we should look at coming up with a new ndo for the xmit and
> maybe push more of this into some sort of inline hook.  Duplicating
> this code in every driver is going to be really expensive.

we will have a common xdp routines when more drivers implement it.
I would expect several pieces fo mlx4/mlx5 can be made common.
ndo approach won't work here, since stack doesn't call into this part.
The xdp logic stays within the driver and dma/page things are driver specific too.
It's pretty much like trying to make common struct between e1000_tx_buffer
and mlx4_en_tx_info. Which is quite difficult.

> Also I just noticed there is no break statement from the xmit code
> above to the drop that below.  I'd think you could overwrite the frame
> data in a case where the Rx exceeds the Tx due to things like flow
> control generating back pressure.

you mean pause frames on TX side?
Aren't 'if (E1000_DESC_UNUSED(tx_ring) < 2) {' enough in e1000_xmit_raw_frame() ?

Thanks for the review!


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 16:21               ` [Intel-wired-lan] " Tom Herbert
@ 2016-09-13 17:13                 ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 17:13 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Eric Dumazet, John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers

On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >> >
> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> >> The simplest program that drops all the packets will make the box unpingable.
> >> >
> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> >> > scooter on 101 highway ;)
> >> >
> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> >> > absolutely no documentation on it, warning users about possible
> >> > limitations/outcomes.
> >> >
> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >> >
> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> >> > but tx completion will call netdev_tx_completed_queue() -> crash
> >> >
> >> > Do we have one test to validate that a XDP_TX implementation is actually
> >> > correct ?
> >> >
> >> Obviously not for e1000 :-(. We really need some real test and
> >> performance results and analysis on the interaction between the stack
> >> data path and XDP data path.
> >
> > no. we don't need it for e1k and we cannot really do it.
> > <broken record mode on> this patch is for debugging of xdp programs only.
> >
> You can say this "only for a debugging" a thousand times and that
> still won't justify putting bad code into the kernel. Material issues
> have been raised with these patches, I have proposed a fix for one
> core issue, and we have requested a lot more testing. So, please, if
> you really want to move these patches forward start addressing the
> concerns being raised by reviewers.

I'm afraid the point 'only for debugging' still didn't make it across.
xdp+e1k is for development (and debugging) of xdp-type of bpf
programs and _not_ for debugging of xdp itself, kernel or anything else.
The e1k provided interfaces and behavior needs to match exactly
what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
Doing special hacks are not acceptable. Therefore your
'proposed fix' misses the mark, since:
1. ignoring bql/qdisc is not a bug, but the requirement
2. such 'fix' goes against the goal above since behaviors will be
different and xdp developer won't be able to build something like
xdp loadbalancer in the kvm.

If you have other concerns please raise them or if you have
suggestions on how to develop xdp programs without this e1k patch
I would love hear them.
Alexander's review comments are discussed in separate thread.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 17:13                 ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 17:13 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
> >> >
> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
> >> >> The simplest program that drops all the packets will make the box unpingable.
> >> >
> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
> >> > scooter on 101 highway ;)
> >> >
> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> >> > absolutely no documentation on it, warning users about possible
> >> > limitations/outcomes.
> >> >
> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
> >> >
> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
> >> > but tx completion will call netdev_tx_completed_queue() -> crash
> >> >
> >> > Do we have one test to validate that a XDP_TX implementation is actually
> >> > correct ?
> >> >
> >> Obviously not for e1000 :-(. We really need some real test and
> >> performance results and analysis on the interaction between the stack
> >> data path and XDP data path.
> >
> > no. we don't need it for e1k and we cannot really do it.
> > <broken record mode on> this patch is for debugging of xdp programs only.
> >
> You can say this "only for a debugging" a thousand times and that
> still won't justify putting bad code into the kernel. Material issues
> have been raised with these patches, I have proposed a fix for one
> core issue, and we have requested a lot more testing. So, please, if
> you really want to move these patches forward start addressing the
> concerns being raised by reviewers.

I'm afraid the point 'only for debugging' still didn't make it across.
xdp+e1k is for development (and debugging) of xdp-type of bpf
programs and _not_ for debugging of xdp itself, kernel or anything else.
The e1k provided interfaces and behavior needs to match exactly
what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
Doing special hacks are not acceptable. Therefore your
'proposed fix' misses the mark, since:
1. ignoring bql/qdisc is not a bug, but the requirement
2. such 'fix' goes against the goal above since behaviors will be
different and xdp developer won't be able to build something like
xdp loadbalancer in the kvm.

If you have other concerns please raise them or if you have
suggestions on how to develop xdp programs without this e1k patch
I would love hear them.
Alexander's review comments are discussed in separate thread.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 17:13                 ` [Intel-wired-lan] " Alexei Starovoitov
@ 2016-09-13 17:37                   ` Eric Dumazet
  -1 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2016-09-13 17:37 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tom Herbert, John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers

On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:

> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement
> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
> 


Is e1k the only way a VM can receive and send packets ?

Instead of adding more cruft to a legacy driver, risking breaking real
old machines, I am sure we can find modern alternative.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 17:37                   ` Eric Dumazet
  0 siblings, 0 replies; 71+ messages in thread
From: Eric Dumazet @ 2016-09-13 17:37 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:

> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement
> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
> 


Is e1k the only way a VM can receive and send packets ?

Instead of adding more cruft to a legacy driver, risking breaking real
old machines, I am sure we can find modern alternative.




^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 17:13                 ` [Intel-wired-lan] " Alexei Starovoitov
@ 2016-09-13 17:55                   ` Tom Herbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-13 17:55 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers

On Tue, Sep 13, 2016 at 10:13 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
>> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
>> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>> >> >
>> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
>> >> >> The simplest program that drops all the packets will make the box unpingable.
>> >> >
>> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
>> >> > scooter on 101 highway ;)
>> >> >
>> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
>> >> > absolutely no documentation on it, warning users about possible
>> >> > limitations/outcomes.
>> >> >
>> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
>> >> >
>> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
>> >> > but tx completion will call netdev_tx_completed_queue() -> crash
>> >> >
>> >> > Do we have one test to validate that a XDP_TX implementation is actually
>> >> > correct ?
>> >> >
>> >> Obviously not for e1000 :-(. We really need some real test and
>> >> performance results and analysis on the interaction between the stack
>> >> data path and XDP data path.
>> >
>> > no. we don't need it for e1k and we cannot really do it.
>> > <broken record mode on> this patch is for debugging of xdp programs only.
>> >
>> You can say this "only for a debugging" a thousand times and that
>> still won't justify putting bad code into the kernel. Material issues
>> have been raised with these patches, I have proposed a fix for one
>> core issue, and we have requested a lot more testing. So, please, if
>> you really want to move these patches forward start addressing the
>> concerns being raised by reviewers.
>
> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement

You don't seem to understand the problem. In the shared queue scenario
if one party (the stack) implements qdiscs, BQL, and such and the
other (XDP) just throws packets onto the queue then these are
incompatible behaviors and something will break. I suppose it's
possible that some how this does not affect the stack path, but
remains to be proven. In any case the patches under review look very
much like they break things; either a fix is needed or tests run to
show it's not a problem. Until this is resolved I am going to nack the
patch.

Tom

> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
>
> If you have other concerns please raise them or if you have
> suggestions on how to develop xdp programs without this e1k patch
> I would love hear them.
> Alexander's review comments are discussed in separate thread.
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 17:55                   ` Tom Herbert
  0 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-13 17:55 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Sep 13, 2016 at 10:13 AM, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
> On Tue, Sep 13, 2016 at 09:21:47AM -0700, Tom Herbert wrote:
>> On Mon, Sep 12, 2016 at 6:28 PM, Alexei Starovoitov
>> <alexei.starovoitov@gmail.com> wrote:
>> > On Mon, Sep 12, 2016 at 05:03:25PM -0700, Tom Herbert wrote:
>> >> On Mon, Sep 12, 2016 at 4:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> > On Mon, 2016-09-12 at 16:07 -0700, Alexei Starovoitov wrote:
>> >> >
>> >> >> yep. there are various ways to shoot yourself in the foot with xdp.
>> >> >> The simplest program that drops all the packets will make the box unpingable.
>> >> >
>> >> > Well, my comment was about XDP_TX only, not about XDP_DROP or driving a
>> >> > scooter on 101 highway ;)
>> >> >
>> >> > This XDP_TX thing was one of the XDP marketing stuff, but there is
>> >> > absolutely no documentation on it, warning users about possible
>> >> > limitations/outcomes.
>> >> >
>> >> > BTW, I am not sure mlx4 implementation even works, vs BQL :
>> >> >
>> >> > mlx4_en_xmit_frame() does not call netdev_tx_sent_queue(),
>> >> > but tx completion will call netdev_tx_completed_queue() -> crash
>> >> >
>> >> > Do we have one test to validate that a XDP_TX implementation is actually
>> >> > correct ?
>> >> >
>> >> Obviously not for e1000 :-(. We really need some real test and
>> >> performance results and analysis on the interaction between the stack
>> >> data path and XDP data path.
>> >
>> > no. we don't need it for e1k and we cannot really do it.
>> > <broken record mode on> this patch is for debugging of xdp programs only.
>> >
>> You can say this "only for a debugging" a thousand times and that
>> still won't justify putting bad code into the kernel. Material issues
>> have been raised with these patches, I have proposed a fix for one
>> core issue, and we have requested a lot more testing. So, please, if
>> you really want to move these patches forward start addressing the
>> concerns being raised by reviewers.
>
> I'm afraid the point 'only for debugging' still didn't make it across.
> xdp+e1k is for development (and debugging) of xdp-type of bpf
> programs and _not_ for debugging of xdp itself, kernel or anything else.
> The e1k provided interfaces and behavior needs to match exactly
> what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> Doing special hacks are not acceptable. Therefore your
> 'proposed fix' misses the mark, since:
> 1. ignoring bql/qdisc is not a bug, but the requirement

You don't seem to understand the problem. In the shared queue scenario
if one party (the stack) implements qdiscs, BQL, and such and the
other (XDP) just throws packets onto the queue then these are
incompatible behaviors and something will break. I suppose it's
possible that some how this does not affect the stack path, but
remains to be proven. In any case the patches under review look very
much like they break things; either a fix is needed or tests run to
show it's not a problem. Until this is resolved I am going to nack the
patch.

Tom

> 2. such 'fix' goes against the goal above since behaviors will be
> different and xdp developer won't be able to build something like
> xdp loadbalancer in the kvm.
>
> If you have other concerns please raise them or if you have
> suggestions on how to develop xdp programs without this e1k patch
> I would love hear them.
> Alexander's review comments are discussed in separate thread.
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 17:37                   ` [Intel-wired-lan] " Eric Dumazet
@ 2016-09-13 17:59                     ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 17:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Tom Herbert, John Fastabend, Brenden Blanco, Jeff Kirsher,
	Jesper Dangaard Brouer, David S. Miller, Cong Wang,
	intel-wired-lan, William Tu, Linux Kernel Network Developers

On Tue, Sep 13, 2016 at 10:37:32AM -0700, Eric Dumazet wrote:
> On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:
> 
> > I'm afraid the point 'only for debugging' still didn't make it across.
> > xdp+e1k is for development (and debugging) of xdp-type of bpf
> > programs and _not_ for debugging of xdp itself, kernel or anything else.
> > The e1k provided interfaces and behavior needs to match exactly
> > what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> > Doing special hacks are not acceptable. Therefore your
> > 'proposed fix' misses the mark, since:
> > 1. ignoring bql/qdisc is not a bug, but the requirement
> > 2. such 'fix' goes against the goal above since behaviors will be
> > different and xdp developer won't be able to build something like
> > xdp loadbalancer in the kvm.
> > 
> 
> 
> Is e1k the only way a VM can receive and send packets ?
> 
> Instead of adding more cruft to a legacy driver, risking breaking real
> old machines, 

agree that it is the concern.

> I am sure we can find modern alternative.

I've looked through qemu and it appears only emulate e1k and tg3.
The latter is still used in the field, so the risk of touching
it is higher.
The other alternative is virtio, but it doesn't have dma and/or pages,
so it looks to me even messier hack.
The last alternative considered was to invent xdp-only fake 'hw' nic,
but it's too much work to get it into qemu then ask the world
to upgrade qemu.
At that point I ran out of ideas and settled on hacking e1k :(
Not proud of this hack at all.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 17:59                     ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 17:59 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Sep 13, 2016 at 10:37:32AM -0700, Eric Dumazet wrote:
> On Tue, 2016-09-13 at 10:13 -0700, Alexei Starovoitov wrote:
> 
> > I'm afraid the point 'only for debugging' still didn't make it across.
> > xdp+e1k is for development (and debugging) of xdp-type of bpf
> > programs and _not_ for debugging of xdp itself, kernel or anything else.
> > The e1k provided interfaces and behavior needs to match exactly
> > what real hw nics (like mlx4, mlx5, igxbe, i40e) will do.
> > Doing special hacks are not acceptable. Therefore your
> > 'proposed fix' misses the mark, since:
> > 1. ignoring bql/qdisc is not a bug, but the requirement
> > 2. such 'fix' goes against the goal above since behaviors will be
> > different and xdp developer won't be able to build something like
> > xdp loadbalancer in the kvm.
> > 
> 
> 
> Is e1k the only way a VM can receive and send packets ?
> 
> Instead of adding more cruft to a legacy driver, risking breaking real
> old machines, 

agree that it is the concern.

> I am sure we can find modern alternative.

I've looked through qemu and it appears only emulate e1k and tg3.
The latter is still used in the field, so the risk of touching
it is higher.
The other alternative is virtio, but it doesn't have dma and/or pages,
so it looks to me even messier hack.
The last alternative considered was to invent xdp-only fake 'hw' nic,
but it's too much work to get it into qemu then ask the world
to upgrade qemu.
At that point I ran out of ideas and settled on hacking e1k :(
Not proud of this hack at all.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 17:59                     ` [Intel-wired-lan] " Alexei Starovoitov
@ 2016-09-13 18:28                       ` Rustad, Mark D
  -1 siblings, 0 replies; 71+ messages in thread
From: Rustad, Mark D @ 2016-09-13 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu

[-- Attachment #1: Type: text/plain, Size: 613 bytes --]

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> I've looked through qemu and it appears only emulate e1k and tg3.
> The latter is still used in the field, so the risk of touching
> it is higher.

I have no idea what makes you think that e1k is *not* "used in the field".   
I grant you it probably is used more virtualized than not these days, but  
it certainly exists and is used. You can still buy them new at Newegg for  
goodness sakes!

Maybe I'll go home and plug in my old e100 into my machine that still has a  
PCI slot, just for old times sake. Oh darn, I have a SCSI card in that  
slot...

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 18:28                       ` Rustad, Mark D
  0 siblings, 0 replies; 71+ messages in thread
From: Rustad, Mark D @ 2016-09-13 18:28 UTC (permalink / raw)
  To: intel-wired-lan

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> I've looked through qemu and it appears only emulate e1k and tg3.
> The latter is still used in the field, so the risk of touching
> it is higher.

I have no idea what makes you think that e1k is *not* "used in the field".   
I grant you it probably is used more virtualized than not these days, but  
it certainly exists and is used. You can still buy them new at Newegg for  
goodness sakes!

Maybe I'll go home and plug in my old e100 into my machine that still has a  
PCI slot, just for old times sake. Oh darn, I have a SCSI card in that  
slot...

--
Mark Rustad, Networking Division, Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20160913/c759115f/attachment.asc>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 18:28                       ` Rustad, Mark D
@ 2016-09-13 18:30                         ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 18:30 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu

On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> >I've looked through qemu and it appears only emulate e1k and tg3.
> >The latter is still used in the field, so the risk of touching
> >it is higher.
> 
> I have no idea what makes you think that e1k is *not* "used in the field".
> I grant you it probably is used more virtualized than not these days, but it
> certainly exists and is used. You can still buy them new at Newegg for
> goodness sakes!

the point that it's only used virtualized, since PCI (not PCIE) have
been long dead.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 18:30                         ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 18:30 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> >I've looked through qemu and it appears only emulate e1k and tg3.
> >The latter is still used in the field, so the risk of touching
> >it is higher.
> 
> I have no idea what makes you think that e1k is *not* "used in the field".
> I grant you it probably is used more virtualized than not these days, but it
> certainly exists and is used. You can still buy them new at Newegg for
> goodness sakes!

the point that it's only used virtualized, since PCI (not PCIE) have
been long dead.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 18:30                         ` Alexei Starovoitov
@ 2016-09-13 19:14                           ` Rustad, Mark D
  -1 siblings, 0 replies; 71+ messages in thread
From: Rustad, Mark D @ 2016-09-13 19:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu

[-- Attachment #1: Type: text/plain, Size: 1549 bytes --]

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>>> I've looked through qemu and it appears only emulate e1k and tg3.
>>> The latter is still used in the field, so the risk of touching
>>> it is higher.
>>
>> I have no idea what makes you think that e1k is *not* "used in the field".
>> I grant you it probably is used more virtualized than not these days,  
>> but it
>> certainly exists and is used. You can still buy them new at Newegg for
>> goodness sakes!
>
> the point that it's only used virtualized, since PCI (not PCIE) have
> been long dead.

My point is precisely the opposite. It is a real device, it exists in real  
systems and it is used in those systems. I worked on embedded systems that  
ran Linux and used e1000 devices. I am sure they are still out there  
because customers are still paying for support of those systems.

Yes, PCI(-X) is absent from any current hardware and has been for some  
years now, but there is an installed base that continues. What part of that  
installed base updates software? I don't know, but I would not just assume  
that it is 0. I know that I updated the kernel on those embedded systems  
that I worked on when I was supporting them. Never at the bleeding edge,  
but generally hopping from one LTS kernel to another as needed.

The day is coming when all the motherboards with PCI(-X) will be gone, but  
I think it is still at least a few years off.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 19:14                           ` Rustad, Mark D
  0 siblings, 0 replies; 71+ messages in thread
From: Rustad, Mark D @ 2016-09-13 19:14 UTC (permalink / raw)
  To: intel-wired-lan

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>>> I've looked through qemu and it appears only emulate e1k and tg3.
>>> The latter is still used in the field, so the risk of touching
>>> it is higher.
>>
>> I have no idea what makes you think that e1k is *not* "used in the field".
>> I grant you it probably is used more virtualized than not these days,  
>> but it
>> certainly exists and is used. You can still buy them new at Newegg for
>> goodness sakes!
>
> the point that it's only used virtualized, since PCI (not PCIE) have
> been long dead.

My point is precisely the opposite. It is a real device, it exists in real  
systems and it is used in those systems. I worked on embedded systems that  
ran Linux and used e1000 devices. I am sure they are still out there  
because customers are still paying for support of those systems.

Yes, PCI(-X) is absent from any current hardware and has been for some  
years now, but there is an installed base that continues. What part of that  
installed base updates software? I don't know, but I would not just assume  
that it is 0. I know that I updated the kernel on those embedded systems  
that I worked on when I was supporting them. Never at the bleeding edge,  
but generally hopping from one LTS kernel to another as needed.

The day is coming when all the motherboards with PCI(-X) will be gone, but  
I think it is still at least a few years off.

--
Mark Rustad, Networking Division, Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20160913/c33ae462/attachment.asc>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 19:14                           ` Rustad, Mark D
@ 2016-09-13 21:52                             ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 21:52 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu

On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> >On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> >>Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >>>I've looked through qemu and it appears only emulate e1k and tg3.
> >>>The latter is still used in the field, so the risk of touching
> >>>it is higher.
> >>
> >>I have no idea what makes you think that e1k is *not* "used in the field".
> >>I grant you it probably is used more virtualized than not these days,
> >>but it
> >>certainly exists and is used. You can still buy them new at Newegg for
> >>goodness sakes!
> >
> >the point that it's only used virtualized, since PCI (not PCIE) have
> >been long dead.
> 
> My point is precisely the opposite. It is a real device, it exists in real
> systems and it is used in those systems. I worked on embedded systems that
> ran Linux and used e1000 devices. I am sure they are still out there because
> customers are still paying for support of those systems.
> 
> Yes, PCI(-X) is absent from any current hardware and has been for some years
> now, but there is an installed base that continues. What part of that
> installed base updates software? I don't know, but I would not just assume
> that it is 0. I know that I updated the kernel on those embedded systems
> that I worked on when I was supporting them. Never at the bleeding edge, but
> generally hopping from one LTS kernel to another as needed.

I suspect modern linux won't boot on such old pci only systems for other
reasons not related to networking, since no one really cares to test kernels there.
So I think we mostly agree. There is chance that this xdp e1k code will
find a way to that old system. What are the chances those users will
be using xdp there? I think pretty close to zero.

The pci-e nics integrated into motherboards that pretend to be tg3
(though they're not at all build by broadcom) are significantly more common.
That's why I picked e1k instead of tg3.

Also note how this patch is not touching anything in the main e1k path
(because I don't have a hw to test and I suspect Intel's driver team
doesn't have it either) to make sure there is no breakage on those
old systems. I created separate e1000_xmit_raw_frame() routine
instead of adding flags into e1000_xmit_frame() for the same reasons:
to make sure there is no breakage.
Same reasoning for not doing an union of page/skb as Alexander suggested.
I wanted minimal change to e1k that allows development xdp programs in kvm
without affecting e1k main path. If you see the actual bug in the patch,
please point out the line.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 21:52                             ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 21:52 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> >On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> >>Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >>>I've looked through qemu and it appears only emulate e1k and tg3.
> >>>The latter is still used in the field, so the risk of touching
> >>>it is higher.
> >>
> >>I have no idea what makes you think that e1k is *not* "used in the field".
> >>I grant you it probably is used more virtualized than not these days,
> >>but it
> >>certainly exists and is used. You can still buy them new at Newegg for
> >>goodness sakes!
> >
> >the point that it's only used virtualized, since PCI (not PCIE) have
> >been long dead.
> 
> My point is precisely the opposite. It is a real device, it exists in real
> systems and it is used in those systems. I worked on embedded systems that
> ran Linux and used e1000 devices. I am sure they are still out there because
> customers are still paying for support of those systems.
> 
> Yes, PCI(-X) is absent from any current hardware and has been for some years
> now, but there is an installed base that continues. What part of that
> installed base updates software? I don't know, but I would not just assume
> that it is 0. I know that I updated the kernel on those embedded systems
> that I worked on when I was supporting them. Never at the bleeding edge, but
> generally hopping from one LTS kernel to another as needed.

I suspect modern linux won't boot on such old pci only systems for other
reasons not related to networking, since no one really cares to test kernels there.
So I think we mostly agree. There is chance that this xdp e1k code will
find a way to that old system. What are the chances those users will
be using xdp there? I think pretty close to zero.

The pci-e nics integrated into motherboards that pretend to be tg3
(though they're not at all build by broadcom) are significantly more common.
That's why I picked e1k instead of tg3.

Also note how this patch is not touching anything in the main e1k path
(because I don't have a hw to test and I suspect Intel's driver team
doesn't have it either) to make sure there is no breakage on those
old systems. I created separate e1000_xmit_raw_frame() routine
instead of adding flags into e1000_xmit_frame() for the same reasons:
to make sure there is no breakage.
Same reasoning for not doing an union of page/skb as Alexander suggested.
I wanted minimal change to e1k that allows development xdp programs in kvm
without affecting e1k main path. If you see the actual bug in the patch,
please point out the line.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 21:52                             ` Alexei Starovoitov
@ 2016-09-13 22:41                               ` Rustad, Mark D
  -1 siblings, 0 replies; 71+ messages in thread
From: Rustad, Mark D @ 2016-09-13 22:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu

[-- Attachment #1: Type: text/plain, Size: 4687 bytes --]

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>>> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>
>>>>> I've looked through qemu and it appears only emulate e1k and tg3.
>>>>> The latter is still used in the field, so the risk of touching
>>>>> it is higher.
>>>>
>>>> I have no idea what makes you think that e1k is *not* "used in the  
>>>> field".
>>>> I grant you it probably is used more virtualized than not these days,
>>>> but it
>>>> certainly exists and is used. You can still buy them new at Newegg for
>>>> goodness sakes!
>>>
>>> the point that it's only used virtualized, since PCI (not PCIE) have
>>> been long dead.
>>
>> My point is precisely the opposite. It is a real device, it exists in real
>> systems and it is used in those systems. I worked on embedded systems that
>> ran Linux and used e1000 devices. I am sure they are still out there  
>> because
>> customers are still paying for support of those systems.
>>
>> Yes, PCI(-X) is absent from any current hardware and has been for some  
>> years
>> now, but there is an installed base that continues. What part of that
>> installed base updates software? I don't know, but I would not just assume
>> that it is 0. I know that I updated the kernel on those embedded systems
>> that I worked on when I was supporting them. Never at the bleeding edge,  
>> but
>> generally hopping from one LTS kernel to another as needed.
>
> I suspect modern linux won't boot on such old pci only systems for other
> reasons not related to networking, since no one really cares to test  
> kernels there.

Actually it does boot, because although the motherboard was PCIe, the slots  
and the adapters in them were PCI-X. So the core architecture was not so  
stale.

> So I think we mostly agree. There is chance that this xdp e1k code will
> find a way to that old system. What are the chances those users will
> be using xdp there? I think pretty close to zero.

For sure they wouldn't be using XDP, but they could suffer regressions in a  
changed driver that might find its way there. That is the risk.

> The pci-e nics integrated into motherboards that pretend to be tg3
> (though they're not at all build by broadcom) are significantly more  
> common.
> That's why I picked e1k instead of tg3.

That may be true (I really don't know anything about tg3 so I certainly  
can't dispute it), so the risk could be smaller with e1k, but there is  
still a regression risk for real existing hardware. That is my concern.

> Also note how this patch is not touching anything in the main e1k path
> (because I don't have a hw to test and I suspect Intel's driver team
> doesn't have it either) to make sure there is no breakage on those
> old systems. I created separate e1000_xmit_raw_frame() routine
> instead of adding flags into e1000_xmit_frame() for the same reasons:
> to make sure there is no breakage.
> Same reasoning for not doing an union of page/skb as Alexander suggested.
> I wanted minimal change to e1k that allows development xdp programs in kvm
> without affecting e1k main path. If you see the actual bug in the patch,
> please point out the line.

I can't say that I can, because I am not familiar with the internals of  
e1k. When I was using it, I never had cause to even look at the driver  
because it just worked. My attentions then were properly elsewhere.

My concern is with messing with a driver that probably no one has an active  
testbed routinely running regression tests.

Maybe a new ID should be assigned and the driver forked for this purpose.  
At least then only the virtualization case would have to be tested. Of  
course the hosts would have to offer that ID, but if this is just for  
testing that should be ok, or at least possible.

If someone with more internal knowledge of this driver has enough  
confidence to bless such patches, that might be fine, but it is a fallacy  
to think that e1k is *only* a virtualization driver today. Not yet anyway.  
Maybe around 2020.

That said, I can see that you have tried to keep the original code path  
pretty much intact. I would note that you introduced rcu calls into the  
!bpf path that would never have been done before. While that should be ok,  
I would really like to see it tested, at least for the !bpf case, on real  
hardware to be sure. I really can't comment on the workaround issue brought  
up by Eric, because I just don't know about them. At least that risk seems  
to only be in the bpf case.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 22:41                               ` Rustad, Mark D
  0 siblings, 0 replies; 71+ messages in thread
From: Rustad, Mark D @ 2016-09-13 22:41 UTC (permalink / raw)
  To: intel-wired-lan

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>
>>> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
>>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
>>>>
>>>>> I've looked through qemu and it appears only emulate e1k and tg3.
>>>>> The latter is still used in the field, so the risk of touching
>>>>> it is higher.
>>>>
>>>> I have no idea what makes you think that e1k is *not* "used in the  
>>>> field".
>>>> I grant you it probably is used more virtualized than not these days,
>>>> but it
>>>> certainly exists and is used. You can still buy them new at Newegg for
>>>> goodness sakes!
>>>
>>> the point that it's only used virtualized, since PCI (not PCIE) have
>>> been long dead.
>>
>> My point is precisely the opposite. It is a real device, it exists in real
>> systems and it is used in those systems. I worked on embedded systems that
>> ran Linux and used e1000 devices. I am sure they are still out there  
>> because
>> customers are still paying for support of those systems.
>>
>> Yes, PCI(-X) is absent from any current hardware and has been for some  
>> years
>> now, but there is an installed base that continues. What part of that
>> installed base updates software? I don't know, but I would not just assume
>> that it is 0. I know that I updated the kernel on those embedded systems
>> that I worked on when I was supporting them. Never at the bleeding edge,  
>> but
>> generally hopping from one LTS kernel to another as needed.
>
> I suspect modern linux won't boot on such old pci only systems for other
> reasons not related to networking, since no one really cares to test  
> kernels there.

Actually it does boot, because although the motherboard was PCIe, the slots  
and the adapters in them were PCI-X. So the core architecture was not so  
stale.

> So I think we mostly agree. There is chance that this xdp e1k code will
> find a way to that old system. What are the chances those users will
> be using xdp there? I think pretty close to zero.

For sure they wouldn't be using XDP, but they could suffer regressions in a  
changed driver that might find its way there. That is the risk.

> The pci-e nics integrated into motherboards that pretend to be tg3
> (though they're not at all build by broadcom) are significantly more  
> common.
> That's why I picked e1k instead of tg3.

That may be true (I really don't know anything about tg3 so I certainly  
can't dispute it), so the risk could be smaller with e1k, but there is  
still a regression risk for real existing hardware. That is my concern.

> Also note how this patch is not touching anything in the main e1k path
> (because I don't have a hw to test and I suspect Intel's driver team
> doesn't have it either) to make sure there is no breakage on those
> old systems. I created separate e1000_xmit_raw_frame() routine
> instead of adding flags into e1000_xmit_frame() for the same reasons:
> to make sure there is no breakage.
> Same reasoning for not doing an union of page/skb as Alexander suggested.
> I wanted minimal change to e1k that allows development xdp programs in kvm
> without affecting e1k main path. If you see the actual bug in the patch,
> please point out the line.

I can't say that I can, because I am not familiar with the internals of  
e1k. When I was using it, I never had cause to even look at the driver  
because it just worked. My attentions then were properly elsewhere.

My concern is with messing with a driver that probably no one has an active  
testbed routinely running regression tests.

Maybe a new ID should be assigned and the driver forked for this purpose.  
At least then only the virtualization case would have to be tested. Of  
course the hosts would have to offer that ID, but if this is just for  
testing that should be ok, or at least possible.

If someone with more internal knowledge of this driver has enough  
confidence to bless such patches, that might be fine, but it is a fallacy  
to think that e1k is *only* a virtualization driver today. Not yet anyway.  
Maybe around 2020.

That said, I can see that you have tried to keep the original code path  
pretty much intact. I would note that you introduced rcu calls into the  
!bpf path that would never have been done before. While that should be ok,  
I would really like to see it tested, at least for the !bpf case, on real  
hardware to be sure. I really can't comment on the workaround issue brought  
up by Eric, because I just don't know about them. At least that risk seems  
to only be in the bpf case.

--
Mark Rustad, Networking Division, Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20160913/f3725831/attachment.asc>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 19:14                           ` Rustad, Mark D
@ 2016-09-13 23:17                             ` Francois Romieu
  -1 siblings, 0 replies; 71+ messages in thread
From: Francois Romieu @ 2016-09-13 23:17 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: Alexei Starovoitov, Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan, William Tu,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller

Rustad, Mark D <mark.d.rustad@intel.com> :
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
[...]
> > the point that it's only used virtualized, since PCI (not PCIE) have
> > been long dead.
> 
> My point is precisely the opposite. It is a real device, it exists in real
> systems and it is used in those systems. I worked on embedded systems that
> ran Linux and used e1000 devices. I am sure they are still out there because
> customers are still paying for support of those systems.

Old PCI is not the bulk of my professional hardware but it's still used here,
both for networking and video. No embedded systems. Mostly dumb file serving
ranging from quad core to i5 and xeon. Recent kernels. 

> The day is coming when all the motherboards with PCI(-X) will be gone, but I
> think it is still at least a few years off.

Add some time for the PCI <-> PCIe adapters to disappear as well. :o)

-- 
Ueimor

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 23:17                             ` Francois Romieu
  0 siblings, 0 replies; 71+ messages in thread
From: Francois Romieu @ 2016-09-13 23:17 UTC (permalink / raw)
  To: intel-wired-lan

Rustad, Mark D <mark.d.rustad@intel.com> :
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
[...]
> > the point that it's only used virtualized, since PCI (not PCIE) have
> > been long dead.
> 
> My point is precisely the opposite. It is a real device, it exists in real
> systems and it is used in those systems. I worked on embedded systems that
> ran Linux and used e1000 devices. I am sure they are still out there because
> customers are still paying for support of those systems.

Old PCI is not the bulk of my professional hardware but it's still used here,
both for networking and video. No embedded systems. Mostly dumb file serving
ranging from quad core to i5 and xeon. Recent kernels. 

> The day is coming when all the motherboards with PCI(-X) will be gone, but I
> think it is still at least a few years off.

Add some time for the PCI <-> PCIe adapters to disappear as well. :o)

-- 
Ueimor

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 22:41                               ` Rustad, Mark D
@ 2016-09-13 23:40                                 ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 23:40 UTC (permalink / raw)
  To: Rustad, Mark D
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu

On Tue, Sep 13, 2016 at 10:41:12PM +0000, Rustad, Mark D wrote:
> That said, I can see that you have tried to keep the original code path
> pretty much intact. I would note that you introduced rcu calls into the !bpf
> path that would never have been done before. While that should be ok, I
> would really like to see it tested, at least for the !bpf case, on real
> hardware to be sure. 

please go ahead and test. rcu_read_lock is zero extra instructions
for everything but preempt or debug kernels.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-13 23:40                                 ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-13 23:40 UTC (permalink / raw)
  To: intel-wired-lan

On Tue, Sep 13, 2016 at 10:41:12PM +0000, Rustad, Mark D wrote:
> That said, I can see that you have tried to keep the original code path
> pretty much intact. I would note that you introduced rcu calls into the !bpf
> path that would never have been done before. While that should be ok, I
> would really like to see it tested, at least for the !bpf case, on real
> hardware to be sure. 

please go ahead and test. rcu_read_lock is zero extra instructions
for everything but preempt or debug kernels.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 23:40                                 ` Alexei Starovoitov
@ 2016-09-14  0:13                                   ` Rustad, Mark D
  -1 siblings, 0 replies; 71+ messages in thread
From: Rustad, Mark D @ 2016-09-14  0:13 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller, William Tu

[-- Attachment #1: Type: text/plain, Size: 1064 bytes --]

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 10:41:12PM +0000, Rustad, Mark D wrote:
>> That said, I can see that you have tried to keep the original code path
>> pretty much intact. I would note that you introduced rcu calls into the  
>> !bpf
>> path that would never have been done before. While that should be ok, I
>> would really like to see it tested, at least for the !bpf case, on real
>> hardware to be sure.
>
> please go ahead and test. rcu_read_lock is zero extra instructions
> for everything but preempt or debug kernels.

Well, I don't have any hardware in hand to test with, though my former  
employer would. I guess my current employer would too! :-) FWIW, the kernel  
used in that system I referred to before was a preempt kernel.

The test matrix is large, the tail is long and you can't just gloss these  
things over. I understand that it isn't the focus of your work, just as  
regression testing the e1000 is not the focus of any of our work any more.  
That is precisely why it is a sensitive area.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-14  0:13                                   ` Rustad, Mark D
  0 siblings, 0 replies; 71+ messages in thread
From: Rustad, Mark D @ 2016-09-14  0:13 UTC (permalink / raw)
  To: intel-wired-lan

Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Sep 13, 2016 at 10:41:12PM +0000, Rustad, Mark D wrote:
>> That said, I can see that you have tried to keep the original code path
>> pretty much intact. I would note that you introduced rcu calls into the  
>> !bpf
>> path that would never have been done before. While that should be ok, I
>> would really like to see it tested, at least for the !bpf case, on real
>> hardware to be sure.
>
> please go ahead and test. rcu_read_lock is zero extra instructions
> for everything but preempt or debug kernels.

Well, I don't have any hardware in hand to test with, though my former  
employer would. I guess my current employer would too! :-) FWIW, the kernel  
used in that system I referred to before was a preempt kernel.

The test matrix is large, the tail is long and you can't just gloss these  
things over. I understand that it isn't the focus of your work, just as  
regression testing the e1000 is not the focus of any of our work any more.  
That is precisely why it is a sensitive area.

--
Mark Rustad, Networking Division, Intel Corporation
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20160914/af2b8812/attachment-0001.asc>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-13 22:41                               ` Rustad, Mark D
@ 2016-09-14 23:42                                 ` Brown, Aaron F
  -1 siblings, 0 replies; 71+ messages in thread
From: Brown, Aaron F @ 2016-09-14 23:42 UTC (permalink / raw)
  To: Rustad, Mark D, Alexei Starovoitov
  Cc: Eric Dumazet, Tom Herbert, Brenden Blanco,
	Linux Kernel Network Developers, intel-wired-lan, William Tu,
	Jesper Dangaard Brouer, Cong Wang, David S. Miller

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On
> Behalf Of Rustad, Mark D
> Sent: Tuesday, September 13, 2016 3:41 PM
> To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>; Tom Herbert
> <tom@herbertland.com>; Brenden Blanco <bblanco@plumgrid.com>; Linux
> Kernel Network Developers <netdev@vger.kernel.org>; intel-wired-lan <intel-
> wired-lan@lists.osuosl.org>; William Tu <u9012063@gmail.com>; Jesper
> Dangaard Brouer <brouer@redhat.com>; Cong Wang
> <xiyou.wangcong@gmail.com>; David S. Miller <davem@davemloft.net>
> Subject: Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP
> support
> 
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >>> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> >>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>
> >>>>> I've looked through qemu and it appears only emulate e1k and tg3.
> >>>>> The latter is still used in the field, so the risk of touching
> >>>>> it is higher.
> >>>>
> >>>> I have no idea what makes you think that e1k is *not* "used in the
> >>>> field".

I personally use a number of physical e1000 parts in my lab.  Primarily for an out of band control adapter, as it almost never changes so tends to be very stable, and it is rare that I have to test it so I'm not fighting with a control adapter that I am also testing the driver on.

> >>>> I grant you it probably is used more virtualized than not these days,
> >>>> but it
> >>>> certainly exists and is used. You can still buy them new at Newegg for
> >>>> goodness sakes!
> >>>
> >>> the point that it's only used virtualized, since PCI (not PCIE) have
> >>> been long dead.

Maybe for new server installations.  But modern motherboards often still have PCI, university / lab environments are slow to upgrade specialty test equipment that uses PCI and home users often don't want to get rid of their expensive high quality sound card, video capture card (or whatever.)  Plain old PCI is still in use.

> >>
> >> My point is precisely the opposite. It is a real device, it exists in real
> >> systems and it is used in those systems. I worked on embedded systems
> that
> >> ran Linux and used e1000 devices. I am sure they are still out there
> >> because
> >> customers are still paying for support of those systems.
> >>
> >> Yes, PCI(-X) is absent from any current hardware and has been for some
> >> years
> >> now, but there is an installed base that continues. What part of that
> >> installed base updates software? I don't know, but I would not just
> assume
> >> that it is 0. I know that I updated the kernel on those embedded systems
> >> that I worked on when I was supporting them. Never at the bleeding edge,
> >> but
> >> generally hopping from one LTS kernel to another as needed.
> >
> > I suspect modern linux won't boot on such old pci only systems for other
> > reasons not related to networking, since no one really cares to test
> > kernels there.
> 
> Actually it does boot, because although the motherboard was PCIe, the slots
> and the adapters in them were PCI-X. So the core architecture was not so
> stale.
> 
> > So I think we mostly agree. There is chance that this xdp e1k code will
> > find a way to that old system. What are the chances those users will
> > be using xdp there? I think pretty close to zero.
> 
> For sure they wouldn't be using XDP, but they could suffer regressions in a
> changed driver that might find its way there. That is the risk.
> 
> > The pci-e nics integrated into motherboards that pretend to be tg3
> > (though they're not at all build by broadcom) are significantly more
> > common.
> > That's why I picked e1k instead of tg3.
> 
> That may be true (I really don't know anything about tg3 so I certainly
> can't dispute it), so the risk could be smaller with e1k, but there is
> still a regression risk for real existing hardware. That is my concern.

And one of the patches in supporting this seems to have done just that.  I'll reply to the patch itself with details.
> 
> > Also note how this patch is not touching anything in the main e1k path
> > (because I don't have a hw to test and I suspect Intel's driver team
> > doesn't have it either)

Your suspicion is incorrect.  My lab has multiple boxes out on benches and at least one cabinet 3/4 full of ancient hardware, including a decent variety of e1000 cards (and even a number of e100 ones.)  I also keep a handful of test systems in my regression rack devoted to e1000, though those have dwindled (old hardware eventually dies) and is currently in need of a bit of work, which this patch set has encouraged me to get on to.

> >  to make sure there is no breakage on those
> > old systems. I created separate e1000_xmit_raw_frame() routine
> > instead of adding flags into e1000_xmit_frame() for the same reasons:
> > to make sure there is no breakage.
> > Same reasoning for not doing an union of page/skb as Alexander
> suggested.
> > I wanted minimal change to e1k that allows development xdp programs in
> kvm
> > without affecting e1k main path. If you see the actual bug in the patch,
> > please point out the line.
> 
> I can't say that I can, because I am not familiar with the internals of
> e1k. When I was using it, I never had cause to even look at the driver
> because it just worked. My attentions then were properly elsewhere.
> 
> My concern is with messing with a driver that probably no one has an active
> testbed routinely running regression tests.
> 
> Maybe a new ID should be assigned and the driver forked for this purpose.
> At least then only the virtualization case would have to be tested. Of
> course the hosts would have to offer that ID, but if this is just for
> testing that should be ok, or at least possible.
> 
> If someone with more internal knowledge of this driver has enough
> confidence to bless such patches, that might be fine, but it is a fallacy
> to think that e1k is *only* a virtualization driver today. Not yet anyway.
> Maybe around 2020.
> 
> That said, I can see that you have tried to keep the original code path
> pretty much intact. I would note that you introduced rcu calls into the
> !bpf path that would never have been done before. While that should be ok,
> I would really like to see it tested, at least for the !bpf case, on real
> hardware to be sure. I really can't comment on the workaround issue
> brought
> up by Eric, because I just don't know about them. At least that risk seems
> to only be in the bpf case.
> 
> --
> Mark Rustad, Networking Division, Intel Corporation

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-14 23:42                                 ` Brown, Aaron F
  0 siblings, 0 replies; 71+ messages in thread
From: Brown, Aaron F @ 2016-09-14 23:42 UTC (permalink / raw)
  To: intel-wired-lan

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces at lists.osuosl.org] On
> Behalf Of Rustad, Mark D
> Sent: Tuesday, September 13, 2016 3:41 PM
> To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>; Tom Herbert
> <tom@herbertland.com>; Brenden Blanco <bblanco@plumgrid.com>; Linux
> Kernel Network Developers <netdev@vger.kernel.org>; intel-wired-lan <intel-
> wired-lan at lists.osuosl.org>; William Tu <u9012063@gmail.com>; Jesper
> Dangaard Brouer <brouer@redhat.com>; Cong Wang
> <xiyou.wangcong@gmail.com>; David S. Miller <davem@davemloft.net>
> Subject: Re: [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP
> support
> 
> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> 
> > On Tue, Sep 13, 2016 at 07:14:27PM +0000, Rustad, Mark D wrote:
> >> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>
> >>> On Tue, Sep 13, 2016 at 06:28:03PM +0000, Rustad, Mark D wrote:
> >>>> Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:
> >>>>
> >>>>> I've looked through qemu and it appears only emulate e1k and tg3.
> >>>>> The latter is still used in the field, so the risk of touching
> >>>>> it is higher.
> >>>>
> >>>> I have no idea what makes you think that e1k is *not* "used in the
> >>>> field".

I personally use a number of physical e1000 parts in my lab.  Primarily for an out of band control adapter, as it almost never changes so tends to be very stable, and it is rare that I have to test it so I'm not fighting with a control adapter that I am also testing the driver on.

> >>>> I grant you it probably is used more virtualized than not these days,
> >>>> but it
> >>>> certainly exists and is used. You can still buy them new at Newegg for
> >>>> goodness sakes!
> >>>
> >>> the point that it's only used virtualized, since PCI (not PCIE) have
> >>> been long dead.

Maybe for new server installations.  But modern motherboards often still have PCI, university / lab environments are slow to upgrade specialty test equipment that uses PCI and home users often don't want to get rid of their expensive high quality sound card, video capture card (or whatever.)  Plain old PCI is still in use.

> >>
> >> My point is precisely the opposite. It is a real device, it exists in real
> >> systems and it is used in those systems. I worked on embedded systems
> that
> >> ran Linux and used e1000 devices. I am sure they are still out there
> >> because
> >> customers are still paying for support of those systems.
> >>
> >> Yes, PCI(-X) is absent from any current hardware and has been for some
> >> years
> >> now, but there is an installed base that continues. What part of that
> >> installed base updates software? I don't know, but I would not just
> assume
> >> that it is 0. I know that I updated the kernel on those embedded systems
> >> that I worked on when I was supporting them. Never at the bleeding edge,
> >> but
> >> generally hopping from one LTS kernel to another as needed.
> >
> > I suspect modern linux won't boot on such old pci only systems for other
> > reasons not related to networking, since no one really cares to test
> > kernels there.
> 
> Actually it does boot, because although the motherboard was PCIe, the slots
> and the adapters in them were PCI-X. So the core architecture was not so
> stale.
> 
> > So I think we mostly agree. There is chance that this xdp e1k code will
> > find a way to that old system. What are the chances those users will
> > be using xdp there? I think pretty close to zero.
> 
> For sure they wouldn't be using XDP, but they could suffer regressions in a
> changed driver that might find its way there. That is the risk.
> 
> > The pci-e nics integrated into motherboards that pretend to be tg3
> > (though they're not at all build by broadcom) are significantly more
> > common.
> > That's why I picked e1k instead of tg3.
> 
> That may be true (I really don't know anything about tg3 so I certainly
> can't dispute it), so the risk could be smaller with e1k, but there is
> still a regression risk for real existing hardware. That is my concern.

And one of the patches in supporting this seems to have done just that.  I'll reply to the patch itself with details.
> 
> > Also note how this patch is not touching anything in the main e1k path
> > (because I don't have a hw to test and I suspect Intel's driver team
> > doesn't have it either)

Your suspicion is incorrect.  My lab has multiple boxes out on benches and at least one cabinet 3/4 full of ancient hardware, including a decent variety of e1000 cards (and even a number of e100 ones.)  I also keep a handful of test systems in my regression rack devoted to e1000, though those have dwindled (old hardware eventually dies) and is currently in need of a bit of work, which this patch set has encouraged me to get on to.

> >  to make sure there is no breakage on those
> > old systems. I created separate e1000_xmit_raw_frame() routine
> > instead of adding flags into e1000_xmit_frame() for the same reasons:
> > to make sure there is no breakage.
> > Same reasoning for not doing an union of page/skb as Alexander
> suggested.
> > I wanted minimal change to e1k that allows development xdp programs in
> kvm
> > without affecting e1k main path. If you see the actual bug in the patch,
> > please point out the line.
> 
> I can't say that I can, because I am not familiar with the internals of
> e1k. When I was using it, I never had cause to even look at the driver
> because it just worked. My attentions then were properly elsewhere.
> 
> My concern is with messing with a driver that probably no one has an active
> testbed routinely running regression tests.
> 
> Maybe a new ID should be assigned and the driver forked for this purpose.
> At least then only the virtualization case would have to be tested. Of
> course the hosts would have to offer that ID, but if this is just for
> testing that should be ok, or at least possible.
> 
> If someone with more internal knowledge of this driver has enough
> confidence to bless such patches, that might be fine, but it is a fallacy
> to think that e1k is *only* a virtualization driver today. Not yet anyway.
> Maybe around 2020.
> 
> That said, I can see that you have tried to keep the original code path
> pretty much intact. I would note that you introduced rcu calls into the
> !bpf path that would never have been done before. While that should be ok,
> I would really like to see it tested, at least for the !bpf case, on real
> hardware to be sure. I really can't comment on the workaround issue
> brought
> up by Eric, because I just don't know about them. At least that risk seems
> to only be in the bpf case.
> 
> --
> Mark Rustad, Networking Division, Intel Corporation

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
  2016-09-12 22:13   ` [Intel-wired-lan] " John Fastabend
@ 2016-09-14 23:57     ` Brown, Aaron F
  -1 siblings, 0 replies; 71+ messages in thread
From: Brown, Aaron F @ 2016-09-14 23:57 UTC (permalink / raw)
  To: John Fastabend, bblanco, alexei.starovoitov, Kirsher, Jeffrey T,
	brouer, davem
  Cc: xiyou.wangcong, intel-wired-lan, u9012063, netdev

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On
> Behalf Of John Fastabend
> Sent: Monday, September 12, 2016 3:13 PM
> To: bblanco@plumgrid.com; john.fastabend@gmail.com;
> alexei.starovoitov@gmail.com; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@intel.com>; brouer@redhat.com; davem@davemloft.net
> Cc: xiyou.wangcong@gmail.com; intel-wired-lan@lists.osuosl.org;
> u9012063@gmail.com; netdev@vger.kernel.org
> Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
> regardless of skb or not
> 
> The BQL API does not reference the sk_buff nor does the driver need to
> reference the sk_buff to calculate the length of a transmitted frame.
> This patch removes an sk_buff reference from the xmit irq path and
> also allows packets sent from XDP to use BQL.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)

This patch is causing all my e1000 adapters to fail a simple ftp session with really slow response (hashing on) followed by adapter resets.  I have seen this on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 82545GM.  On a few occasions I get a splat captured to dmesg.  Here is an example:
--------------------------------
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1c2/0x1d0
NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
Modules linked in: e1000 e100 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs
4 ipv6 crc_ccitt p4_clockmod dm_mirror dm_region_hash dm_log uinput sg serio_raw
 dcdbas mii i2c_piix4 i2c_core cfi_probe gen_probe cfi_util scb2_flash dm_mod(E)
 ext4(E) mbcache(E) jbd2(E) sd_mod(E) sr_mod(E) cdrom(E) aacraid(E) pata_acpi(E)
 ata_generic(E) pata_serverworks(E) [last unloaded: e1000]
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E   4.8.0-rc6_next-queue_dev
-queue_b0b5ade #8
Hardware name: Dell Computer Corporation PowerEdge 2650             /0D5995, BIO
S A21 10/05/2006
 00000000 c06724fb c0b6038a c0b6038a 0000013c c04596d5 c0b647ac f3d2fed0
 00000000 c0b6038a 0000013c c08bff52 c08bff52 00000009 f2922000 00000000
 f318cf00 0001e788 c045979b 00000009 00000000 f3d2feb8 c0b647ac f3d2fed0
Call Trace:
 [<c06724fb>] ? dump_stack+0x47/0x6c
 [<c04596d5>] ? __warn+0x105/0x120
 [<c08bff52>] ? dev_watchdog+0x1c2/0x1d0
 [<c08bff52>] ? dev_watchdog+0x1c2/0x1d0
 [<c045979b>] ? warn_slowpath_fmt+0x3b/0x40
 [<c08bff52>] ? dev_watchdog+0x1c2/0x1d0
 [<c04bb71b>] ? call_timer_fn+0x3b/0x140
 [<c08bfd90>] ? dev_trans_start+0x60/0x60
 [<c04bc1ab>] ? expire_timers+0x9b/0x110
 [<c04bc471>] ? run_timer_softirq+0xd1/0x260
 [<c089a1e0>] ? net_tx_action+0xe0/0x1a0
 [<c04b8597>] ? rcu_process_callbacks+0x47/0xe0
 [<c045e518>] ? __do_softirq+0xc8/0x280
 [<c045e450>] ? irq_exit+0x90/0x90
 [<c041dbd2>] ? do_softirq_own_stack+0x22/0x30
 <IRQ>  [<c045e445>] ? irq_exit+0x85/0x90
 [<c0440f80>] ? smp_apic_timer_interrupt+0x30/0x40
 [<c095f44d>] ? apic_timer_interrupt+0x2d/0x34
 [<c04251cc>] ? default_idle+0x1c/0xd0
 [<c04ce222>] ? __tick_nohz_idle_enter+0x92/0x140
 [<c0424cc6>] ? arch_cpu_idle+0x6/0x10
 [<c0495ebd>] ? cpuidle_idle_call+0x7d/0xe0
 [<c0496075>] ? cpu_idle_loop+0x155/0x210
 [<c04404a5>] ? start_secondary+0xb5/0xe0
---[ end trace fa448b49f7848a42 ]---
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
@ 2016-09-14 23:57     ` Brown, Aaron F
  0 siblings, 0 replies; 71+ messages in thread
From: Brown, Aaron F @ 2016-09-14 23:57 UTC (permalink / raw)
  To: intel-wired-lan

> From: Intel-wired-lan [mailto:intel-wired-lan-bounces at lists.osuosl.org] On
> Behalf Of John Fastabend
> Sent: Monday, September 12, 2016 3:13 PM
> To: bblanco at plumgrid.com; john.fastabend at gmail.com;
> alexei.starovoitov at gmail.com; Kirsher, Jeffrey T
> <jeffrey.t.kirsher@intel.com>; brouer at redhat.com; davem at davemloft.net
> Cc: xiyou.wangcong at gmail.com; intel-wired-lan at lists.osuosl.org;
> u9012063 at gmail.com; netdev at vger.kernel.org
> Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
> regardless of skb or not
> 
> The BQL API does not reference the sk_buff nor does the driver need to
> reference the sk_buff to calculate the length of a transmitted frame.
> This patch removes an sk_buff reference from the xmit irq path and
> also allows packets sent from XDP to use BQL.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)

This patch is causing all my e1000 adapters to fail a simple ftp session with really slow response (hashing on) followed by adapter resets.  I have seen this on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 82545GM.  On a few occasions I get a splat captured to dmesg.  Here is an example:
--------------------------------
------------[ cut here ]------------
WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1c2/0x1d0
NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
Modules linked in: e1000 e100 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs
4 ipv6 crc_ccitt p4_clockmod dm_mirror dm_region_hash dm_log uinput sg serio_raw
 dcdbas mii i2c_piix4 i2c_core cfi_probe gen_probe cfi_util scb2_flash dm_mod(E)
 ext4(E) mbcache(E) jbd2(E) sd_mod(E) sr_mod(E) cdrom(E) aacraid(E) pata_acpi(E)
 ata_generic(E) pata_serverworks(E) [last unloaded: e1000]
CPU: 1 PID: 0 Comm: swapper/1 Tainted: G            E   4.8.0-rc6_next-queue_dev
-queue_b0b5ade #8
Hardware name: Dell Computer Corporation PowerEdge 2650             /0D5995, BIO
S A21 10/05/2006
 00000000 c06724fb c0b6038a c0b6038a 0000013c c04596d5 c0b647ac f3d2fed0
 00000000 c0b6038a 0000013c c08bff52 c08bff52 00000009 f2922000 00000000
 f318cf00 0001e788 c045979b 00000009 00000000 f3d2feb8 c0b647ac f3d2fed0
Call Trace:
 [<c06724fb>] ? dump_stack+0x47/0x6c
 [<c04596d5>] ? __warn+0x105/0x120
 [<c08bff52>] ? dev_watchdog+0x1c2/0x1d0
 [<c08bff52>] ? dev_watchdog+0x1c2/0x1d0
 [<c045979b>] ? warn_slowpath_fmt+0x3b/0x40
 [<c08bff52>] ? dev_watchdog+0x1c2/0x1d0
 [<c04bb71b>] ? call_timer_fn+0x3b/0x140
 [<c08bfd90>] ? dev_trans_start+0x60/0x60
 [<c04bc1ab>] ? expire_timers+0x9b/0x110
 [<c04bc471>] ? run_timer_softirq+0xd1/0x260
 [<c089a1e0>] ? net_tx_action+0xe0/0x1a0
 [<c04b8597>] ? rcu_process_callbacks+0x47/0xe0
 [<c045e518>] ? __do_softirq+0xc8/0x280
 [<c045e450>] ? irq_exit+0x90/0x90
 [<c041dbd2>] ? do_softirq_own_stack+0x22/0x30
 <IRQ>  [<c045e445>] ? irq_exit+0x85/0x90
 [<c0440f80>] ? smp_apic_timer_interrupt+0x30/0x40
 [<c095f44d>] ? apic_timer_interrupt+0x2d/0x34
 [<c04251cc>] ? default_idle+0x1c/0xd0
 [<c04ce222>] ? __tick_nohz_idle_enter+0x92/0x140
 [<c0424cc6>] ? arch_cpu_idle+0x6/0x10
 [<c0495ebd>] ? cpuidle_idle_call+0x7d/0xe0
 [<c0496075>] ? cpu_idle_loop+0x155/0x210
 [<c04404a5>] ? start_secondary+0xb5/0xe0
---[ end trace fa448b49f7848a42 ]---
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
e1000 0000:02:06.0 eth1: Reset adapter
e1000: eth1 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
  2016-09-14 23:57     ` Brown, Aaron F
@ 2016-09-15  0:43       ` Alexei Starovoitov
  -1 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-15  0:43 UTC (permalink / raw)
  To: Brown, Aaron F
  Cc: John Fastabend, bblanco, Kirsher, Jeffrey T, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev

On Wed, Sep 14, 2016 at 11:57:24PM +0000, Brown, Aaron F wrote:
> > From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On
> > Behalf Of John Fastabend
> > Sent: Monday, September 12, 2016 3:13 PM
> > To: bblanco@plumgrid.com; john.fastabend@gmail.com;
> > alexei.starovoitov@gmail.com; Kirsher, Jeffrey T
> > <jeffrey.t.kirsher@intel.com>; brouer@redhat.com; davem@davemloft.net
> > Cc: xiyou.wangcong@gmail.com; intel-wired-lan@lists.osuosl.org;
> > u9012063@gmail.com; netdev@vger.kernel.org
> > Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
> > regardless of skb or not
> > 
> > The BQL API does not reference the sk_buff nor does the driver need to
> > reference the sk_buff to calculate the length of a transmitted frame.
> > This patch removes an sk_buff reference from the xmit irq path and
> > also allows packets sent from XDP to use BQL.
> > 
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ---
> >  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
> >  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> This patch is causing all my e1000 adapters to fail a simple ftp session with really slow response (hashing on) followed by adapter resets.  I have seen this on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 82545GM.  On a few occasions I get a splat captured to dmesg.  Here is an example:
> --------------------------------
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1c2/0x1d0
> NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out

Thanks a lot for the tests! Really appreciate it.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
@ 2016-09-15  0:43       ` Alexei Starovoitov
  0 siblings, 0 replies; 71+ messages in thread
From: Alexei Starovoitov @ 2016-09-15  0:43 UTC (permalink / raw)
  To: intel-wired-lan

On Wed, Sep 14, 2016 at 11:57:24PM +0000, Brown, Aaron F wrote:
> > From: Intel-wired-lan [mailto:intel-wired-lan-bounces at lists.osuosl.org] On
> > Behalf Of John Fastabend
> > Sent: Monday, September 12, 2016 3:13 PM
> > To: bblanco at plumgrid.com; john.fastabend at gmail.com;
> > alexei.starovoitov at gmail.com; Kirsher, Jeffrey T
> > <jeffrey.t.kirsher@intel.com>; brouer at redhat.com; davem at davemloft.net
> > Cc: xiyou.wangcong at gmail.com; intel-wired-lan at lists.osuosl.org;
> > u9012063 at gmail.com; netdev at vger.kernel.org
> > Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
> > regardless of skb or not
> > 
> > The BQL API does not reference the sk_buff nor does the driver need to
> > reference the sk_buff to calculate the length of a transmitted frame.
> > This patch removes an sk_buff reference from the xmit irq path and
> > also allows packets sent from XDP to use BQL.
> > 
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ---
> >  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
> >  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> This patch is causing all my e1000 adapters to fail a simple ftp session with really slow response (hashing on) followed by adapter resets.  I have seen this on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 82545GM.  On a few occasions I get a splat captured to dmesg.  Here is an example:
> --------------------------------
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1c2/0x1d0
> NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out

Thanks a lot for the tests! Really appreciate it.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
  2016-09-15  0:43       ` Alexei Starovoitov
@ 2016-09-15  4:22         ` John Fastabend
  -1 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-15  4:22 UTC (permalink / raw)
  To: Alexei Starovoitov, Brown, Aaron F
  Cc: bblanco, Kirsher, Jeffrey T, brouer, davem, xiyou.wangcong,
	intel-wired-lan, u9012063, netdev

On 16-09-14 05:43 PM, Alexei Starovoitov wrote:
> On Wed, Sep 14, 2016 at 11:57:24PM +0000, Brown, Aaron F wrote:
>>> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@lists.osuosl.org] On
>>> Behalf Of John Fastabend
>>> Sent: Monday, September 12, 2016 3:13 PM
>>> To: bblanco@plumgrid.com; john.fastabend@gmail.com;
>>> alexei.starovoitov@gmail.com; Kirsher, Jeffrey T
>>> <jeffrey.t.kirsher@intel.com>; brouer@redhat.com; davem@davemloft.net
>>> Cc: xiyou.wangcong@gmail.com; intel-wired-lan@lists.osuosl.org;
>>> u9012063@gmail.com; netdev@vger.kernel.org
>>> Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
>>> regardless of skb or not
>>>
>>> The BQL API does not reference the sk_buff nor does the driver need to
>>> reference the sk_buff to calculate the length of a transmitted frame.
>>> This patch removes an sk_buff reference from the xmit irq path and
>>> also allows packets sent from XDP to use BQL.
>>>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>> ---
>>>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>
>> This patch is causing all my e1000 adapters to fail a simple ftp session with really slow response (hashing on) followed by adapter resets.  I have seen this on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 82545GM.  On a few occasions I get a splat captured to dmesg.  Here is an example:
>> --------------------------------
>> ------------[ cut here ]------------
>> WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1c2/0x1d0
>> NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
> 
> Thanks a lot for the tests! Really appreciate it.
> 

Thanks.

Jeff, please drop the series for now obviously this wont work. It needs
some work.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
@ 2016-09-15  4:22         ` John Fastabend
  0 siblings, 0 replies; 71+ messages in thread
From: John Fastabend @ 2016-09-15  4:22 UTC (permalink / raw)
  To: intel-wired-lan

On 16-09-14 05:43 PM, Alexei Starovoitov wrote:
> On Wed, Sep 14, 2016 at 11:57:24PM +0000, Brown, Aaron F wrote:
>>> From: Intel-wired-lan [mailto:intel-wired-lan-bounces at lists.osuosl.org] On
>>> Behalf Of John Fastabend
>>> Sent: Monday, September 12, 2016 3:13 PM
>>> To: bblanco at plumgrid.com; john.fastabend at gmail.com;
>>> alexei.starovoitov at gmail.com; Kirsher, Jeffrey T
>>> <jeffrey.t.kirsher@intel.com>; brouer at redhat.com; davem at davemloft.net
>>> Cc: xiyou.wangcong at gmail.com; intel-wired-lan at lists.osuosl.org;
>>> u9012063 at gmail.com; netdev at vger.kernel.org
>>> Subject: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes
>>> regardless of skb or not
>>>
>>> The BQL API does not reference the sk_buff nor does the driver need to
>>> reference the sk_buff to calculate the length of a transmitted frame.
>>> This patch removes an sk_buff reference from the xmit irq path and
>>> also allows packets sent from XDP to use BQL.
>>>
>>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>>> ---
>>>  drivers/net/ethernet/intel/e1000/e1000_main.c |    7 ++-----
>>>  1 file changed, 2 insertions(+), 5 deletions(-)
>>
>> This patch is causing all my e1000 adapters to fail a simple ftp session with really slow response (hashing on) followed by adapter resets.  I have seen this on 4 different e1000 nics now, an 82543GC, 82544GC, 82546EB and an 82545GM.  On a few occasions I get a splat captured to dmesg.  Here is an example:
>> --------------------------------
>> ------------[ cut here ]------------
>> WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1c2/0x1d0
>> NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
> 
> Thanks a lot for the tests! Really appreciate it.
> 

Thanks.

Jeff, please drop the series for now obviously this wont work. It needs
some work.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* RE: [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
  2016-09-15  0:43       ` Alexei Starovoitov
@ 2016-09-15 23:29         ` Brown, Aaron F
  -1 siblings, 0 replies; 71+ messages in thread
From: Brown, Aaron F @ 2016-09-15 23:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: John Fastabend, bblanco, Kirsher, Jeffrey T, brouer, davem,
	xiyou.wangcong, intel-wired-lan, u9012063, netdev

> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316
> dev_watchdog+0x1c2/0x1d0
> > NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
> 
> Thanks a lot for the tests! Really appreciate it.

np, I needed to get my old compatibility systems back in running order anyway.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not
@ 2016-09-15 23:29         ` Brown, Aaron F
  0 siblings, 0 replies; 71+ messages in thread
From: Brown, Aaron F @ 2016-09-15 23:29 UTC (permalink / raw)
  To: intel-wired-lan

> > ------------[ cut here ]------------
> > WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:316
> dev_watchdog+0x1c2/0x1d0
> > NETDEV WATCHDOG: eth1 (e1000): transmit queue 0 timed out
> 
> Thanks a lot for the tests! Really appreciate it.

np, I needed to get my old compatibility systems back in running order anyway.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-12 23:46         ` [Intel-wired-lan] " Eric Dumazet
@ 2016-09-18 17:25           ` Jesper Dangaard Brouer
  -1 siblings, 0 replies; 71+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-18 17:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, John Fastabend, bblanco, jeffrey.t.kirsher,
	davem, xiyou.wangcong, intel-wired-lan, u9012063, netdev, brouer

On Mon, 12 Sep 2016 16:46:08 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

I will take care of documentation for the XDP project.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
@ 2016-09-18 17:25           ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 71+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-18 17:25 UTC (permalink / raw)
  To: intel-wired-lan

On Mon, 12 Sep 2016 16:46:08 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

I will take care of documentation for the XDP project.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Intel-wired-lan] [net-next PATCH v3 2/3] e1000: add initial XDP support
  2016-09-18 17:25           ` [Intel-wired-lan] " Jesper Dangaard Brouer
  (?)
@ 2016-09-18 18:06           ` Tom Herbert
  -1 siblings, 0 replies; 71+ messages in thread
From: Tom Herbert @ 2016-09-18 18:06 UTC (permalink / raw)
  To: intel-wired-lan

On Sep 18, 2016 10:25 AM, "Jesper Dangaard Brouer" <brouer@redhat.com>
wrote:
>
> On Mon, 12 Sep 2016 16:46:08 -0700
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> > This XDP_TX thing was one of the XDP marketing stuff, but there is
> > absolutely no documentation on it, warning users about possible
> > limitations/outcomes.
>
> I will take care of documentation for the XDP project.

Thanks Jesper, that will help a lot!

Tom

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.osuosl.org/pipermail/intel-wired-lan/attachments/20160918/f549bc7a/attachment.html>

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2016-09-18 18:06 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-12 22:13 [net-next PATCH v3 0/3] e1000 XDP implementation John Fastabend
2016-09-12 22:13 ` [Intel-wired-lan] " John Fastabend
2016-09-12 22:13 ` [net-next PATCH v3 1/3] e1000: track BQL bytes regardless of skb or not John Fastabend
2016-09-12 22:13   ` [Intel-wired-lan] " John Fastabend
2016-09-13  3:00   ` Alexander Duyck
2016-09-13  3:00     ` Alexander Duyck
2016-09-13  4:25     ` Tom Herbert
2016-09-13  4:25       ` Tom Herbert
2016-09-13 13:17       ` Alexander Duyck
2016-09-13 13:17         ` Alexander Duyck
2016-09-14 23:57   ` Brown, Aaron F
2016-09-14 23:57     ` Brown, Aaron F
2016-09-15  0:43     ` Alexei Starovoitov
2016-09-15  0:43       ` Alexei Starovoitov
2016-09-15  4:22       ` John Fastabend
2016-09-15  4:22         ` John Fastabend
2016-09-15 23:29       ` Brown, Aaron F
2016-09-15 23:29         ` Brown, Aaron F
2016-09-12 22:13 ` [net-next PATCH v3 2/3] e1000: add initial XDP support John Fastabend
2016-09-12 22:13   ` [Intel-wired-lan] " John Fastabend
2016-09-12 22:46   ` Eric Dumazet
2016-09-12 22:46     ` [Intel-wired-lan] " Eric Dumazet
2016-09-12 23:07     ` Alexei Starovoitov
2016-09-12 23:07       ` [Intel-wired-lan] " Alexei Starovoitov
2016-09-12 23:46       ` Eric Dumazet
2016-09-12 23:46         ` [Intel-wired-lan] " Eric Dumazet
2016-09-13  0:03         ` Tom Herbert
2016-09-13  0:03           ` [Intel-wired-lan] " Tom Herbert
2016-09-13  1:28           ` Alexei Starovoitov
2016-09-13  1:28             ` [Intel-wired-lan] " Alexei Starovoitov
2016-09-13 16:21             ` Tom Herbert
2016-09-13 16:21               ` [Intel-wired-lan] " Tom Herbert
2016-09-13 17:13               ` Alexei Starovoitov
2016-09-13 17:13                 ` [Intel-wired-lan] " Alexei Starovoitov
2016-09-13 17:37                 ` Eric Dumazet
2016-09-13 17:37                   ` [Intel-wired-lan] " Eric Dumazet
2016-09-13 17:59                   ` Alexei Starovoitov
2016-09-13 17:59                     ` [Intel-wired-lan] " Alexei Starovoitov
2016-09-13 18:28                     ` Rustad, Mark D
2016-09-13 18:28                       ` Rustad, Mark D
2016-09-13 18:30                       ` Alexei Starovoitov
2016-09-13 18:30                         ` Alexei Starovoitov
2016-09-13 19:14                         ` Rustad, Mark D
2016-09-13 19:14                           ` Rustad, Mark D
2016-09-13 21:52                           ` Alexei Starovoitov
2016-09-13 21:52                             ` Alexei Starovoitov
2016-09-13 22:41                             ` Rustad, Mark D
2016-09-13 22:41                               ` Rustad, Mark D
2016-09-13 23:40                               ` Alexei Starovoitov
2016-09-13 23:40                                 ` Alexei Starovoitov
2016-09-14  0:13                                 ` Rustad, Mark D
2016-09-14  0:13                                   ` Rustad, Mark D
2016-09-14 23:42                               ` Brown, Aaron F
2016-09-14 23:42                                 ` Brown, Aaron F
2016-09-13 23:17                           ` Francois Romieu
2016-09-13 23:17                             ` Francois Romieu
2016-09-13 17:55                 ` Tom Herbert
2016-09-13 17:55                   ` [Intel-wired-lan] " Tom Herbert
2016-09-13  1:21         ` Alexei Starovoitov
2016-09-13  1:21           ` [Intel-wired-lan] " Alexei Starovoitov
2016-09-18 17:25         ` Jesper Dangaard Brouer
2016-09-18 17:25           ` [Intel-wired-lan] " Jesper Dangaard Brouer
2016-09-18 18:06           ` Tom Herbert
2016-09-13  3:42   ` Alexander Duyck
2016-09-13  3:42     ` Alexander Duyck
2016-09-13 16:56     ` Alexei Starovoitov
2016-09-13 16:56       ` Alexei Starovoitov
2016-09-12 22:14 ` [net-next PATCH v3 3/3] e1000: bundle xdp xmit routines John Fastabend
2016-09-12 22:14   ` [Intel-wired-lan] " John Fastabend
2016-09-12 23:45   ` Tom Herbert
2016-09-12 23:45     ` [Intel-wired-lan] " Tom Herbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.