netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v6 0/4] GVE Raw Addressing
@ 2020-11-09 23:36 David Awogbemila
  2020-11-09 23:36 ` [PATCH net-next v6 1/4] gve: Add support for raw addressing device option David Awogbemila
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: David Awogbemila @ 2020-11-09 23:36 UTC (permalink / raw)
  To: netdev; +Cc: David Awogbemila

Patch 1: Use static inline for getting next option in describe_device
        to make loop more readable.
Patch 2: Remove redundant else in gve_num_rx_qpls,
        Fix skew in for loop in gve_prefill_rx_pages and separate
        raw_addressing and non-raw_addressing paths for clearly.
Patch 3: Handle pages with bad refcount:
        - in raw addressing mode, just warn and leak the page,
        - in qpl mode, schedule a reset.
Patch 4: Remove redundant else from gve_num_tx_qpls
        Protect dma_mapping_error stat from parallel access.
        Revert unnecessary change of signature for gve_dma_sync_for_device.

Catherine Sullivan (3):
  gve: Add support for raw addressing device option
  gve: Add support for raw addressing to the rx path
  gve: Add support for raw addressing in the tx path

David Awogbemila (1):
  gve: Rx Buffer Recycling

 drivers/net/ethernet/google/gve/gve.h        |  32 +-
 drivers/net/ethernet/google/gve/gve_adminq.c |  82 +++-
 drivers/net/ethernet/google/gve/gve_adminq.h |  15 +-
 drivers/net/ethernet/google/gve/gve_desc.h   |  18 +-
 drivers/net/ethernet/google/gve/gve_main.c   |  12 +-
 drivers/net/ethernet/google/gve/gve_rx.c     | 380 ++++++++++++++-----
 drivers/net/ethernet/google/gve/gve_tx.c     | 207 ++++++++--
 7 files changed, 588 insertions(+), 158 deletions(-)

-- 
2.29.2.222.g5d2a92d10f8-goog


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH net-next v6 1/4] gve: Add support for raw addressing device option
  2020-11-09 23:36 [PATCH net-next v6 0/4] GVE Raw Addressing David Awogbemila
@ 2020-11-09 23:36 ` David Awogbemila
  2020-11-11 17:20   ` Alexander Duyck
  2020-11-09 23:36 ` [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path David Awogbemila
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: David Awogbemila @ 2020-11-09 23:36 UTC (permalink / raw)
  To: netdev; +Cc: Catherine Sullivan, Yangchun Fu, David Awogbemila

From: Catherine Sullivan <csully@google.com>

Add support to describe device for parsing device options. As
the first device option, add raw addressing.

"Raw Addressing" mode (as opposed to the current "qpl" mode) is an
operational mode which allows the driver avoid bounce buffer copies
which it currently performs using pre-allocated qpls (queue_page_lists)
when sending and receiving packets.
For egress packets, the provided skb data addresses will be dma_map'ed and
passed to the device, allowing the NIC can perform DMA directly - the
driver will not have to copy the buffer content into pre-allocated
buffers/qpls (as in qpl mode).
For ingress packets, copies are also eliminated as buffers are handed to
the networking stack and then recycled or re-allocated as
necessary, avoiding the use of skb_copy_to_linear_data().

This patch only introduces the option to the driver.
Subsequent patches will add the ingress and egress functionality.

Reviewed-by: Yangchun Fu <yangchun@google.com>
Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
---
 drivers/net/ethernet/google/gve/gve.h        |  1 +
 drivers/net/ethernet/google/gve/gve_adminq.c | 64 ++++++++++++++++++++
 drivers/net/ethernet/google/gve/gve_adminq.h | 15 +++--
 drivers/net/ethernet/google/gve/gve_main.c   |  9 +++
 4 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index f5c80229ea96..80cdae06ee39 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -199,6 +199,7 @@ struct gve_priv {
 	u64 num_registered_pages; /* num pages registered with NIC */
 	u32 rx_copybreak; /* copy packets smaller than this */
 	u16 default_num_queues; /* default num queues to set up */
+	bool raw_addressing; /* true if this dev supports raw addressing */
 
 	struct gve_queue_config tx_cfg;
 	struct gve_queue_config rx_cfg;
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
index 24ae6a28a806..3e6de659b274 100644
--- a/drivers/net/ethernet/google/gve/gve_adminq.c
+++ b/drivers/net/ethernet/google/gve/gve_adminq.c
@@ -14,6 +14,18 @@
 #define GVE_ADMINQ_SLEEP_LEN		20
 #define GVE_MAX_ADMINQ_EVENT_COUNTER_CHECK	100
 
+static inline
+struct gve_device_option *gve_get_next_option(struct gve_device_descriptor *descriptor,
+					      struct gve_device_option *option)
+{
+	void *option_end, *descriptor_end;
+
+	option_end = (void *)option + sizeof(*option) + be16_to_cpu(option->option_length);
+	descriptor_end = (void *)descriptor + be16_to_cpu(descriptor->total_length);
+
+	return option_end > descriptor_end ? NULL : (struct gve_device_option *)option_end;
+}
+
 int gve_adminq_alloc(struct device *dev, struct gve_priv *priv)
 {
 	priv->adminq = dma_alloc_coherent(dev, PAGE_SIZE,
@@ -460,11 +472,14 @@ int gve_adminq_destroy_rx_queues(struct gve_priv *priv, u32 num_queues)
 int gve_adminq_describe_device(struct gve_priv *priv)
 {
 	struct gve_device_descriptor *descriptor;
+	struct gve_device_option *dev_opt;
 	union gve_adminq_command cmd;
 	dma_addr_t descriptor_bus;
+	u16 num_options;
 	int err = 0;
 	u8 *mac;
 	u16 mtu;
+	int i;
 
 	memset(&cmd, 0, sizeof(cmd));
 	descriptor = dma_alloc_coherent(&priv->pdev->dev, PAGE_SIZE,
@@ -518,6 +533,55 @@ int gve_adminq_describe_device(struct gve_priv *priv)
 		priv->rx_desc_cnt = priv->rx_pages_per_qpl;
 	}
 	priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
+	dev_opt = (void *)(descriptor + 1);
+
+	num_options = be16_to_cpu(descriptor->num_device_options);
+	for (i = 0; i < num_options; i++) {
+		u16 option_length = be16_to_cpu(dev_opt->option_length);
+		u16 option_id = be16_to_cpu(dev_opt->option_id);
+		struct gve_device_option *next_opt;
+
+		next_opt = gve_get_next_option(descriptor, dev_opt);
+		if (!next_opt) {
+			dev_err(&priv->dev->dev,
+				"options exceed device_descriptor's total length.\n");
+			err = -EINVAL;
+			goto free_device_descriptor;
+		}
+
+		switch (option_id) {
+		case GVE_DEV_OPT_ID_RAW_ADDRESSING:
+			/* If the length or feature mask doesn't match,
+			 * continue without enabling the feature.
+			 */
+			if (option_length != GVE_DEV_OPT_LEN_RAW_ADDRESSING ||
+			    dev_opt->feat_mask !=
+			    cpu_to_be32(GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING)) {
+				dev_warn(&priv->pdev->dev,
+					 "Raw addressing option error:\n"
+					 "	Expected: length=%d, feature_mask=%x.\n"
+					 "	Actual: length=%d, feature_mask=%x.\n",
+					 GVE_DEV_OPT_LEN_RAW_ADDRESSING,
+					 cpu_to_be32(GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING),
+					 option_length, dev_opt->feat_mask);
+				priv->raw_addressing = false;
+			} else {
+				dev_info(&priv->pdev->dev,
+					 "Raw addressing device option enabled.\n");
+				priv->raw_addressing = true;
+			}
+			break;
+		default:
+			/* If we don't recognize the option just continue
+			 * without doing anything.
+			 */
+			dev_dbg(&priv->pdev->dev,
+				"Unrecognized device option 0x%hx not enabled.\n",
+				option_id);
+			break;
+		}
+		dev_opt = next_opt;
+	}
 
 free_device_descriptor:
 	dma_free_coherent(&priv->pdev->dev, sizeof(*descriptor), descriptor,
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.h b/drivers/net/ethernet/google/gve/gve_adminq.h
index 281de8326bc5..af5f586167bd 100644
--- a/drivers/net/ethernet/google/gve/gve_adminq.h
+++ b/drivers/net/ethernet/google/gve/gve_adminq.h
@@ -79,12 +79,17 @@ struct gve_device_descriptor {
 
 static_assert(sizeof(struct gve_device_descriptor) == 40);
 
-struct device_option {
-	__be32 option_id;
-	__be32 option_length;
+struct gve_device_option {
+	__be16 option_id;
+	__be16 option_length;
+	__be32 feat_mask;
 };
 
-static_assert(sizeof(struct device_option) == 8);
+static_assert(sizeof(struct gve_device_option) == 8);
+
+#define GVE_DEV_OPT_ID_RAW_ADDRESSING 0x1
+#define GVE_DEV_OPT_LEN_RAW_ADDRESSING 0x0
+#define GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING 0x0
 
 struct gve_adminq_configure_device_resources {
 	__be64 counter_array;
@@ -111,6 +116,8 @@ struct gve_adminq_unregister_page_list {
 
 static_assert(sizeof(struct gve_adminq_unregister_page_list) == 4);
 
+#define GVE_RAW_ADDRESSING_QPL_ID 0xFFFFFFFF
+
 struct gve_adminq_create_tx_queue {
 	__be32 queue_id;
 	__be32 reserved;
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 48a433154ce0..70685c10db0e 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -678,6 +678,10 @@ static int gve_alloc_qpls(struct gve_priv *priv)
 	int i, j;
 	int err;
 
+	/* Raw addressing means no QPLs */
+	if (priv->raw_addressing)
+		return 0;
+
 	priv->qpls = kvzalloc(num_qpls * sizeof(*priv->qpls), GFP_KERNEL);
 	if (!priv->qpls)
 		return -ENOMEM;
@@ -718,6 +722,10 @@ static void gve_free_qpls(struct gve_priv *priv)
 	int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
 	int i;
 
+	/* Raw addressing means no QPLs */
+	if (priv->raw_addressing)
+		return;
+
 	kvfree(priv->qpl_cfg.qpl_id_map);
 
 	for (i = 0; i < num_qpls; i++)
@@ -1078,6 +1086,7 @@ static int gve_init_priv(struct gve_priv *priv, bool skip_describe_device)
 	if (skip_describe_device)
 		goto setup_device;
 
+	priv->raw_addressing = false;
 	/* Get the initial information we need from the device */
 	err = gve_adminq_describe_device(priv);
 	if (err) {
-- 
2.29.2.222.g5d2a92d10f8-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path
  2020-11-09 23:36 [PATCH net-next v6 0/4] GVE Raw Addressing David Awogbemila
  2020-11-09 23:36 ` [PATCH net-next v6 1/4] gve: Add support for raw addressing device option David Awogbemila
@ 2020-11-09 23:36 ` David Awogbemila
  2020-11-11 17:20   ` Alexander Duyck
  2020-11-09 23:36 ` [PATCH net-next v6 3/4] gve: Rx Buffer Recycling David Awogbemila
  2020-11-09 23:36 ` [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path David Awogbemila
  3 siblings, 1 reply; 19+ messages in thread
From: David Awogbemila @ 2020-11-09 23:36 UTC (permalink / raw)
  To: netdev; +Cc: Catherine Sullivan, Yangchun Fu, David Awogbemila

From: Catherine Sullivan <csully@google.com>

Add support to use raw dma addresses in the rx path. Due to this new
support we can alloc a new buffer instead of making a copy.

RX buffers are handed to the networking stack and are
re-allocated as needed, avoiding the need to use
skb_copy_to_linear_data() as in "qpl" mode.

Reviewed-by: Yangchun Fu <yangchun@google.com>
Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
---
 drivers/net/ethernet/google/gve/gve.h        |   6 +-
 drivers/net/ethernet/google/gve/gve_adminq.c |  14 +-
 drivers/net/ethernet/google/gve/gve_desc.h   |  10 +-
 drivers/net/ethernet/google/gve/gve_main.c   |   3 +-
 drivers/net/ethernet/google/gve/gve_rx.c     | 224 +++++++++++++++----
 5 files changed, 200 insertions(+), 57 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 80cdae06ee39..a8c589dd14e4 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -68,6 +68,7 @@ struct gve_rx_data_queue {
 	dma_addr_t data_bus; /* dma mapping of the slots */
 	struct gve_rx_slot_page_info *page_info; /* page info of the buffers */
 	struct gve_queue_page_list *qpl; /* qpl assigned to this queue */
+	bool raw_addressing; /* use raw_addressing? */
 };
 
 struct gve_priv;
@@ -82,6 +83,7 @@ struct gve_rx_ring {
 	u32 cnt; /* free-running total number of completed packets */
 	u32 fill_cnt; /* free-running total number of descs and buffs posted */
 	u32 mask; /* masks the cnt and fill_cnt to the size of the ring */
+	u32 db_threshold; /* threshold for posting new buffs and descs */
 	u64 rx_copybreak_pkt; /* free-running count of copybreak packets */
 	u64 rx_copied_pkt; /* free-running total number of copied packets */
 	u64 rx_skb_alloc_fail; /* free-running count of skb alloc fails */
@@ -194,7 +196,7 @@ struct gve_priv {
 	u16 tx_desc_cnt; /* num desc per ring */
 	u16 rx_desc_cnt; /* num desc per ring */
 	u16 tx_pages_per_qpl; /* tx buffer length */
-	u16 rx_pages_per_qpl; /* rx buffer length */
+	u16 rx_data_slot_cnt; /* rx buffer length */
 	u64 max_registered_pages;
 	u64 num_registered_pages; /* num pages registered with NIC */
 	u32 rx_copybreak; /* copy packets smaller than this */
@@ -444,7 +446,7 @@ static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
  */
 static inline u32 gve_num_rx_qpls(struct gve_priv *priv)
 {
-	return priv->rx_cfg.num_queues;
+	return priv->raw_addressing ? 0 : priv->rx_cfg.num_queues;
 }
 
 /* Returns a pointer to the next available tx qpl in the list of qpls
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
index 3e6de659b274..711d135c6b1a 100644
--- a/drivers/net/ethernet/google/gve/gve_adminq.c
+++ b/drivers/net/ethernet/google/gve/gve_adminq.c
@@ -369,8 +369,10 @@ static int gve_adminq_create_rx_queue(struct gve_priv *priv, u32 queue_index)
 {
 	struct gve_rx_ring *rx = &priv->rx[queue_index];
 	union gve_adminq_command cmd;
+	u32 qpl_id;
 	int err;
 
+	qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID : rx->data.qpl->id;
 	memset(&cmd, 0, sizeof(cmd));
 	cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_RX_QUEUE);
 	cmd.create_rx_queue = (struct gve_adminq_create_rx_queue) {
@@ -381,7 +383,7 @@ static int gve_adminq_create_rx_queue(struct gve_priv *priv, u32 queue_index)
 		.queue_resources_addr = cpu_to_be64(rx->q_resources_bus),
 		.rx_desc_ring_addr = cpu_to_be64(rx->desc.bus),
 		.rx_data_ring_addr = cpu_to_be64(rx->data.data_bus),
-		.queue_page_list_id = cpu_to_be32(rx->data.qpl->id),
+		.queue_page_list_id = cpu_to_be32(qpl_id),
 	};
 
 	err = gve_adminq_issue_cmd(priv, &cmd);
@@ -526,11 +528,11 @@ int gve_adminq_describe_device(struct gve_priv *priv)
 	mac = descriptor->mac;
 	dev_info(&priv->pdev->dev, "MAC addr: %pM\n", mac);
 	priv->tx_pages_per_qpl = be16_to_cpu(descriptor->tx_pages_per_qpl);
-	priv->rx_pages_per_qpl = be16_to_cpu(descriptor->rx_pages_per_qpl);
-	if (priv->rx_pages_per_qpl < priv->rx_desc_cnt) {
-		dev_err(&priv->pdev->dev, "rx_pages_per_qpl cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
-			priv->rx_pages_per_qpl);
-		priv->rx_desc_cnt = priv->rx_pages_per_qpl;
+	priv->rx_data_slot_cnt = be16_to_cpu(descriptor->rx_pages_per_qpl);
+	if (priv->rx_data_slot_cnt < priv->rx_desc_cnt) {
+		dev_err(&priv->pdev->dev, "rx_data_slot_cnt cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
+			priv->rx_data_slot_cnt);
+		priv->rx_desc_cnt = priv->rx_data_slot_cnt;
 	}
 	priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
 	dev_opt = (void *)(descriptor + 1);
diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
index 54779871d52e..0aad314aefaf 100644
--- a/drivers/net/ethernet/google/gve/gve_desc.h
+++ b/drivers/net/ethernet/google/gve/gve_desc.h
@@ -72,12 +72,14 @@ struct gve_rx_desc {
 } __packed;
 static_assert(sizeof(struct gve_rx_desc) == 64);
 
-/* As with the Tx ring format, the qpl_offset entries below are offsets into an
- * ordered list of registered pages.
+/* If the device supports raw dma addressing then the addr in data slot is
+ * the dma address of the buffer.
+ * If the device only supports registered segments than the addr is a byte
+ * offset into the registered segment (an ordered list of pages) where the
+ * buffer is.
  */
 struct gve_rx_data_slot {
-	/* byte offset into the rx registered segment of this slot */
-	__be64 qpl_offset;
+	__be64 addr;
 };
 
 /* GVE Recive Packet Descriptor Seq No */
diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
index 70685c10db0e..225e17dd1ae5 100644
--- a/drivers/net/ethernet/google/gve/gve_main.c
+++ b/drivers/net/ethernet/google/gve/gve_main.c
@@ -596,6 +596,7 @@ int gve_alloc_page(struct gve_priv *priv, struct device *dev,
 	if (dma_mapping_error(dev, *dma)) {
 		priv->dma_mapping_error++;
 		put_page(*page);
+		*page = NULL;
 		return -ENOMEM;
 	}
 	return 0;
@@ -694,7 +695,7 @@ static int gve_alloc_qpls(struct gve_priv *priv)
 	}
 	for (; i < num_qpls; i++) {
 		err = gve_alloc_queue_page_list(priv, i,
-						priv->rx_pages_per_qpl);
+						priv->rx_data_slot_cnt);
 		if (err)
 			goto free_qpls;
 	}
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 008fa897a3e6..49646caf930c 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -16,12 +16,39 @@ static void gve_rx_remove_from_block(struct gve_priv *priv, int queue_idx)
 	block->rx = NULL;
 }
 
+static void gve_rx_free_buffer(struct device *dev,
+			       struct gve_rx_slot_page_info *page_info,
+			       struct gve_rx_data_slot *data_slot)
+{
+	dma_addr_t dma = (dma_addr_t)(be64_to_cpu(data_slot->addr) -
+				      page_info->page_offset);
+
+	gve_free_page(dev, page_info->page, dma, DMA_FROM_DEVICE);
+}
+
+static void gve_rx_unfill_pages(struct gve_priv *priv, struct gve_rx_ring *rx)
+{
+	u32 slots = rx->mask + 1;
+	int i;
+
+	if (rx->data.raw_addressing) {
+		for (i = 0; i < slots; i++)
+			gve_rx_free_buffer(&priv->pdev->dev, &rx->data.page_info[i],
+					   &rx->data.data_ring[i]);
+	} else {
+		gve_unassign_qpl(priv, rx->data.qpl->id);
+		rx->data.qpl = NULL;
+	}
+	kvfree(rx->data.page_info);
+	rx->data.page_info = NULL;
+}
+
 static void gve_rx_free_ring(struct gve_priv *priv, int idx)
 {
 	struct gve_rx_ring *rx = &priv->rx[idx];
 	struct device *dev = &priv->pdev->dev;
+	u32 slots = rx->mask + 1;
 	size_t bytes;
-	u32 slots;
 
 	gve_rx_remove_from_block(priv, idx);
 
@@ -33,11 +60,8 @@ static void gve_rx_free_ring(struct gve_priv *priv, int idx)
 			  rx->q_resources, rx->q_resources_bus);
 	rx->q_resources = NULL;
 
-	gve_unassign_qpl(priv, rx->data.qpl->id);
-	rx->data.qpl = NULL;
-	kvfree(rx->data.page_info);
+	gve_rx_unfill_pages(priv, rx);
 
-	slots = rx->mask + 1;
 	bytes = sizeof(*rx->data.data_ring) * slots;
 	dma_free_coherent(dev, bytes, rx->data.data_ring,
 			  rx->data.data_bus);
@@ -52,14 +76,36 @@ static void gve_setup_rx_buffer(struct gve_rx_slot_page_info *page_info,
 	page_info->page = page;
 	page_info->page_offset = 0;
 	page_info->page_address = page_address(page);
-	slot->qpl_offset = cpu_to_be64(addr);
+	slot->addr = cpu_to_be64(addr);
+}
+
+static int gve_rx_alloc_buffer(struct gve_priv *priv, struct device *dev,
+			       struct gve_rx_slot_page_info *page_info,
+			       struct gve_rx_data_slot *data_slot,
+			       struct gve_rx_ring *rx)
+{
+	struct page *page;
+	dma_addr_t dma;
+	int err;
+
+	err = gve_alloc_page(priv, dev, &page, &dma, DMA_FROM_DEVICE);
+	if (err) {
+		u64_stats_update_begin(&rx->statss);
+		rx->rx_buf_alloc_fail++;
+		u64_stats_update_end(&rx->statss);
+		return err;
+	}
+
+	gve_setup_rx_buffer(page_info, data_slot, dma, page);
+	return 0;
 }
 
 static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
 {
 	struct gve_priv *priv = rx->gve;
 	u32 slots;
-	int i;
+	int err;
+	int i, j;
 
 	/* Allocate one page per Rx queue slot. Each page is split into two
 	 * packet buffers, when possible we "page flip" between the two.
@@ -71,17 +117,32 @@ static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
 	if (!rx->data.page_info)
 		return -ENOMEM;
 
-	rx->data.qpl = gve_assign_rx_qpl(priv);
-
+	if (!rx->data.raw_addressing)
+		rx->data.qpl = gve_assign_rx_qpl(priv);
 	for (i = 0; i < slots; i++) {
-		struct page *page = rx->data.qpl->pages[i];
-		dma_addr_t addr = i * PAGE_SIZE;
-
-		gve_setup_rx_buffer(&rx->data.page_info[i],
-				    &rx->data.data_ring[i], addr, page);
+		struct page *page;
+		dma_addr_t addr;
+
+		if (!rx->data.raw_addressing) {
+			page = rx->data.qpl->pages[i];
+			addr = i * PAGE_SIZE;
+			gve_setup_rx_buffer(&rx->data.page_info[i],
+					    &rx->data.data_ring[i], addr, page);
+			continue;
+		}
+		err = gve_rx_alloc_buffer(priv, &priv->pdev->dev, &rx->data.page_info[i],
+					  &rx->data.data_ring[i], rx);
+		if (err)
+			goto alloc_err;
 	}
 
 	return slots;
+alloc_err:
+	for (j = 0; j < i; j++)
+		gve_rx_free_buffer(&priv->pdev->dev,
+				   &rx->data.page_info[j],
+				   &rx->data.data_ring[j]);
+	return err;
 }
 
 static void gve_rx_add_to_block(struct gve_priv *priv, int queue_idx)
@@ -110,8 +171,9 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
 	rx->gve = priv;
 	rx->q_num = idx;
 
-	slots = priv->rx_pages_per_qpl;
+	slots = priv->rx_data_slot_cnt;
 	rx->mask = slots - 1;
+	rx->data.raw_addressing = priv->raw_addressing;
 
 	/* alloc rx data ring */
 	bytes = sizeof(*rx->data.data_ring) * slots;
@@ -156,8 +218,8 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
 		err = -ENOMEM;
 		goto abort_with_q_resources;
 	}
-	rx->mask = slots - 1;
 	rx->cnt = 0;
+	rx->db_threshold = priv->rx_desc_cnt / 2;
 	rx->desc.seqno = 1;
 	gve_rx_add_to_block(priv, idx);
 
@@ -168,7 +230,7 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
 			  rx->q_resources, rx->q_resources_bus);
 	rx->q_resources = NULL;
 abort_filled:
-	kvfree(rx->data.page_info);
+	gve_rx_unfill_pages(priv, rx);
 abort_with_slots:
 	bytes = sizeof(*rx->data.data_ring) * slots;
 	dma_free_coherent(hdev, bytes, rx->data.data_ring, rx->data.data_bus);
@@ -251,8 +313,7 @@ static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
 	return skb;
 }
 
-static struct sk_buff *gve_rx_add_frags(struct net_device *dev,
-					struct napi_struct *napi,
+static struct sk_buff *gve_rx_add_frags(struct napi_struct *napi,
 					struct gve_rx_slot_page_info *page_info,
 					u16 len)
 {
@@ -271,11 +332,25 @@ static struct sk_buff *gve_rx_add_frags(struct net_device *dev,
 static void gve_rx_flip_buff(struct gve_rx_slot_page_info *page_info,
 			     struct gve_rx_data_slot *data_ring)
 {
-	u64 addr = be64_to_cpu(data_ring->qpl_offset);
+	u64 addr = be64_to_cpu(data_ring->addr);
 
 	page_info->page_offset ^= PAGE_SIZE / 2;
 	addr ^= PAGE_SIZE / 2;
-	data_ring->qpl_offset = cpu_to_be64(addr);
+	data_ring->addr = cpu_to_be64(addr);
+}
+
+static struct sk_buff *
+gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
+		      struct gve_rx_slot_page_info *page_info, u16 len,
+		      struct napi_struct *napi,
+		      struct gve_rx_data_slot *data_slot)
+{
+	struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
+
+	if (!skb)
+		return NULL;
+
+	return skb;
 }
 
 static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
@@ -285,7 +360,9 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
 	struct gve_priv *priv = rx->gve;
 	struct napi_struct *napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
 	struct net_device *dev = priv->dev;
-	struct sk_buff *skb;
+	struct gve_rx_data_slot *data_slot;
+	struct sk_buff *skb = NULL;
+	dma_addr_t page_bus;
 	int pagecount;
 	u16 len;
 
@@ -294,18 +371,18 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
 		u64_stats_update_begin(&rx->statss);
 		rx->rx_desc_err_dropped_pkt++;
 		u64_stats_update_end(&rx->statss);
-		return true;
+		return false;
 	}
 
 	len = be16_to_cpu(rx_desc->len) - GVE_RX_PAD;
 	page_info = &rx->data.page_info[idx];
-	dma_sync_single_for_cpu(&priv->pdev->dev, rx->data.qpl->page_buses[idx],
-				PAGE_SIZE, DMA_FROM_DEVICE);
 
-	/* gvnic can only receive into registered segments. If the buffer
-	 * can't be recycled, our only choice is to copy the data out of
-	 * it so that we can return it to the device.
-	 */
+	data_slot = &rx->data.data_ring[idx];
+	page_bus = (rx->data.raw_addressing) ?
+					be64_to_cpu(data_slot->addr) - page_info->page_offset :
+					rx->data.qpl->page_buses[idx];
+	dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
+				PAGE_SIZE, DMA_FROM_DEVICE);
 
 	if (PAGE_SIZE == 4096) {
 		if (len <= priv->rx_copybreak) {
@@ -316,6 +393,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
 			u64_stats_update_end(&rx->statss);
 			goto have_skb;
 		}
+		if (rx->data.raw_addressing) {
+			skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
+						    page_info, len, napi,
+						     data_slot);
+			goto have_skb;
+		}
 		if (unlikely(!gve_can_recycle_pages(dev))) {
 			skb = gve_rx_copy(rx, dev, napi, page_info, len);
 			goto have_skb;
@@ -326,12 +409,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
 			 * the page fragment to a new SKB and pass it up the
 			 * stack.
 			 */
-			skb = gve_rx_add_frags(dev, napi, page_info, len);
+			skb = gve_rx_add_frags(napi, page_info, len);
 			if (!skb) {
 				u64_stats_update_begin(&rx->statss);
 				rx->rx_skb_alloc_fail++;
 				u64_stats_update_end(&rx->statss);
-				return true;
+				return false;
 			}
 			/* Make sure the kernel stack can't release the page */
 			get_page(page_info->page);
@@ -347,7 +430,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
 			return false;
 		}
 	} else {
-		skb = gve_rx_copy(rx, dev, napi, page_info, len);
+		if (rx->data.raw_addressing)
+			skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
+						    page_info, len, napi,
+						    data_slot);
+		else
+			skb = gve_rx_copy(rx, dev, napi, page_info, len);
 	}
 
 have_skb:
@@ -358,7 +446,7 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
 		u64_stats_update_begin(&rx->statss);
 		rx->rx_skb_alloc_fail++;
 		u64_stats_update_end(&rx->statss);
-		return true;
+		return false;
 	}
 
 	if (likely(feat & NETIF_F_RXCSUM)) {
@@ -399,19 +487,45 @@ static bool gve_rx_work_pending(struct gve_rx_ring *rx)
 	return (GVE_SEQNO(flags_seq) == rx->desc.seqno);
 }
 
+static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
+{
+	bool empty = rx->fill_cnt == rx->cnt;
+	u32 fill_cnt = rx->fill_cnt;
+
+	while (empty || ((fill_cnt & rx->mask) != (rx->cnt & rx->mask))) {
+		struct gve_rx_slot_page_info *page_info;
+		struct device *dev = &priv->pdev->dev;
+		struct gve_rx_data_slot *data_slot;
+		u32 idx = fill_cnt & rx->mask;
+
+		page_info = &rx->data.page_info[idx];
+		data_slot = &rx->data.data_ring[idx];
+		gve_rx_free_buffer(dev, page_info, data_slot);
+		page_info->page = NULL;
+		if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
+			break;
+		empty = false;
+		fill_cnt++;
+	}
+	rx->fill_cnt = fill_cnt;
+	return true;
+}
+
 bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
 		       netdev_features_t feat)
 {
 	struct gve_priv *priv = rx->gve;
+	u32 work_done = 0, packets = 0;
 	struct gve_rx_desc *desc;
 	u32 cnt = rx->cnt;
 	u32 idx = cnt & rx->mask;
-	u32 work_done = 0;
 	u64 bytes = 0;
 
 	desc = rx->desc.desc_ring + idx;
 	while ((GVE_SEQNO(desc->flags_seq) == rx->desc.seqno) &&
 	       work_done < budget) {
+		bool dropped;
+
 		netif_info(priv, rx_status, priv->dev,
 			   "[%d] idx=%d desc=%p desc->flags_seq=0x%x\n",
 			   rx->q_num, idx, desc, desc->flags_seq);
@@ -419,9 +533,11 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
 			   "[%d] seqno=%d rx->desc.seqno=%d\n",
 			   rx->q_num, GVE_SEQNO(desc->flags_seq),
 			   rx->desc.seqno);
-		bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
-		if (!gve_rx(rx, desc, feat, idx))
-			gve_schedule_reset(priv);
+		dropped = !gve_rx(rx, desc, feat, idx);
+		if (!dropped) {
+			bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
+			packets++;
+		}
 		cnt++;
 		idx = cnt & rx->mask;
 		desc = rx->desc.desc_ring + idx;
@@ -429,15 +545,35 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
 		work_done++;
 	}
 
-	if (!work_done)
+	if (!work_done && rx->fill_cnt - cnt > rx->db_threshold) {
 		return false;
+	} else if (work_done) {
+		u64_stats_update_begin(&rx->statss);
+		rx->rpackets += packets;
+		rx->rbytes += bytes;
+		u64_stats_update_end(&rx->statss);
+		rx->cnt = cnt;
+	}
 
-	u64_stats_update_begin(&rx->statss);
-	rx->rpackets += work_done;
-	rx->rbytes += bytes;
-	u64_stats_update_end(&rx->statss);
-	rx->cnt = cnt;
-	rx->fill_cnt += work_done;
+	/* restock ring slots */
+	if (!rx->data.raw_addressing) {
+		/* In QPL mode buffs are refilled as the desc are processed */
+		rx->fill_cnt += work_done;
+	} else if (rx->fill_cnt - cnt <= rx->db_threshold) {
+		/* In raw addressing mode buffs are only refilled if the avail
+		 * falls below a threshold.
+		 */
+		if (!gve_rx_refill_buffers(priv, rx))
+			return false;
+
+		/* If we were not able to completely refill buffers, we'll want
+		 * to schedule this queue for work again to refill buffers.
+		 */
+		if (rx->fill_cnt - cnt <= rx->db_threshold) {
+			gve_rx_write_doorbell(priv, rx);
+			return true;
+		}
+	}
 
 	gve_rx_write_doorbell(priv, rx);
 	return gve_rx_work_pending(rx);
-- 
2.29.2.222.g5d2a92d10f8-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v6 3/4] gve: Rx Buffer Recycling
  2020-11-09 23:36 [PATCH net-next v6 0/4] GVE Raw Addressing David Awogbemila
  2020-11-09 23:36 ` [PATCH net-next v6 1/4] gve: Add support for raw addressing device option David Awogbemila
  2020-11-09 23:36 ` [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path David Awogbemila
@ 2020-11-09 23:36 ` David Awogbemila
  2020-11-11 17:20   ` Alexander Duyck
  2020-11-09 23:36 ` [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path David Awogbemila
  3 siblings, 1 reply; 19+ messages in thread
From: David Awogbemila @ 2020-11-09 23:36 UTC (permalink / raw)
  To: netdev; +Cc: David Awogbemila

This patch lets the driver reuse buffers that have been freed by the
networking stack.

In the raw addressing case, this allows the driver avoid allocating new
buffers.
In the qpl case, the driver can avoid copies.

Signed-off-by: David Awogbemila <awogbemila@google.com>
---
 drivers/net/ethernet/google/gve/gve.h    |  10 +-
 drivers/net/ethernet/google/gve/gve_rx.c | 196 +++++++++++++++--------
 2 files changed, 131 insertions(+), 75 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index a8c589dd14e4..9dcf9fd8d128 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -50,6 +50,7 @@ struct gve_rx_slot_page_info {
 	struct page *page;
 	void *page_address;
 	u32 page_offset; /* offset to write to in page */
+	bool can_flip;
 };
 
 /* A list of pages registered with the device during setup and used by a queue
@@ -500,15 +501,6 @@ static inline enum dma_data_direction gve_qpl_dma_dir(struct gve_priv *priv,
 		return DMA_FROM_DEVICE;
 }
 
-/* Returns true if the max mtu allows page recycling */
-static inline bool gve_can_recycle_pages(struct net_device *dev)
-{
-	/* We can't recycle the pages if we can't fit a packet into half a
-	 * page.
-	 */
-	return dev->max_mtu <= PAGE_SIZE / 2;
-}
-
 /* buffers */
 int gve_alloc_page(struct gve_priv *priv, struct device *dev,
 		   struct page **page, dma_addr_t *dma,
diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
index 49646caf930c..ff28581f4427 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -287,8 +287,7 @@ static enum pkt_hash_types gve_rss_type(__be16 pkt_flags)
 	return PKT_HASH_TYPE_L2;
 }
 
-static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
-				   struct net_device *dev,
+static struct sk_buff *gve_rx_copy(struct net_device *dev,
 				   struct napi_struct *napi,
 				   struct gve_rx_slot_page_info *page_info,
 				   u16 len)
@@ -306,10 +305,6 @@ static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
 
 	skb->protocol = eth_type_trans(skb, dev);
 
-	u64_stats_update_begin(&rx->statss);
-	rx->rx_copied_pkt++;
-	u64_stats_update_end(&rx->statss);
-
 	return skb;
 }
 
@@ -334,22 +329,90 @@ static void gve_rx_flip_buff(struct gve_rx_slot_page_info *page_info,
 {
 	u64 addr = be64_to_cpu(data_ring->addr);
 
+	/* "flip" to other packet buffer on this page */
 	page_info->page_offset ^= PAGE_SIZE / 2;
 	addr ^= PAGE_SIZE / 2;
 	data_ring->addr = cpu_to_be64(addr);
 }
 
+static bool gve_rx_can_flip_buffers(struct net_device *netdev)
+{
+#if PAGE_SIZE == 4096
+	/* We can't flip a buffer if we can't fit a packet
+	 * into half a page.
+	 */
+	return netdev->mtu + GVE_RX_PAD + ETH_HLEN <= PAGE_SIZE / 2;
+#else
+	/* PAGE_SIZE != 4096 - don't try to reuse */
+	return false;
+#endif
+}
+
+static int gve_rx_can_recycle_buffer(struct page *page)
+{
+	int pagecount = page_count(page);
+
+	/* This page is not being used by any SKBs - reuse */
+	if (pagecount == 1)
+		return 1;
+	/* This page is still being used by an SKB - we can't reuse */
+	else if (pagecount >= 2)
+		return 0;
+	WARN(pagecount < 1, "Pagecount should never be < 1");
+	return -1;
+}
+
 static struct sk_buff *
 gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
 		      struct gve_rx_slot_page_info *page_info, u16 len,
 		      struct napi_struct *napi,
-		      struct gve_rx_data_slot *data_slot)
+		      struct gve_rx_data_slot *data_slot, bool can_flip)
 {
-	struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
+	struct sk_buff *skb;
 
+	skb = gve_rx_add_frags(napi, page_info, len);
 	if (!skb)
 		return NULL;
 
+	/* Optimistically stop the kernel from freeing the page by increasing
+	 * the page bias. We will check the refcount in refill to determine if
+	 * we need to alloc a new page.
+	 */
+	get_page(page_info->page);
+	page_info->can_flip = can_flip;
+
+	return skb;
+}
+
+static struct sk_buff *
+gve_rx_qpl(struct device *dev, struct net_device *netdev,
+	   struct gve_rx_ring *rx, struct gve_rx_slot_page_info *page_info,
+	   u16 len, struct napi_struct *napi,
+	   struct gve_rx_data_slot *data_slot, bool recycle)
+{
+	struct sk_buff *skb;
+
+	/* if raw_addressing mode is not enabled gvnic can only receive into
+	 * registered segments. If the buffer can't be recycled, our only
+	 * choice is to copy the data out of it so that we can return it to the
+	 * device.
+	 */
+	if (recycle) {
+		skb = gve_rx_add_frags(napi, page_info, len);
+		/* No point in recycling if we didn't get the skb */
+		if (skb) {
+			/* Make sure the networking stack can't free the page */
+			get_page(page_info->page);
+			gve_rx_flip_buff(page_info, data_slot);
+		}
+	} else {
+		skb = gve_rx_copy(netdev, napi, page_info, len);
+		if (skb) {
+			u64_stats_update_begin(&rx->statss);
+			rx->rx_copied_pkt++;
+			u64_stats_update_end(&rx->statss);
+		}
+	}
 	return skb;
 }
 
@@ -363,7 +426,6 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
 	struct gve_rx_data_slot *data_slot;
 	struct sk_buff *skb = NULL;
 	dma_addr_t page_bus;
-	int pagecount;
 	u16 len;
 
 	/* drop this packet */
@@ -384,64 +446,37 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
 	dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
 				PAGE_SIZE, DMA_FROM_DEVICE);
 
-	if (PAGE_SIZE == 4096) {
-		if (len <= priv->rx_copybreak) {
-			/* Just copy small packets */
-			skb = gve_rx_copy(rx, dev, napi, page_info, len);
-			u64_stats_update_begin(&rx->statss);
-			rx->rx_copybreak_pkt++;
-			u64_stats_update_end(&rx->statss);
-			goto have_skb;
+	if (len <= priv->rx_copybreak) {
+		/* Just copy small packets */
+		skb = gve_rx_copy(dev, napi, page_info, len);
+		u64_stats_update_begin(&rx->statss);
+		rx->rx_copied_pkt++;
+		rx->rx_copybreak_pkt++;
+		u64_stats_update_end(&rx->statss);
+	} else {
+		bool can_flip = gve_rx_can_flip_buffers(dev);
+		int recycle = 0;
+
+		if (can_flip) {
+			recycle = gve_rx_can_recycle_buffer(page_info->page);
+			if (recycle < 0) {
+				if (!rx->data.raw_addressing)
+					gve_schedule_reset(priv);
+				return false;
+			}
 		}
 		if (rx->data.raw_addressing) {
 			skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
 						    page_info, len, napi,
-						     data_slot);
-			goto have_skb;
-		}
-		if (unlikely(!gve_can_recycle_pages(dev))) {
-			skb = gve_rx_copy(rx, dev, napi, page_info, len);
-			goto have_skb;
-		}
-		pagecount = page_count(page_info->page);
-		if (pagecount == 1) {
-			/* No part of this page is used by any SKBs; we attach
-			 * the page fragment to a new SKB and pass it up the
-			 * stack.
-			 */
-			skb = gve_rx_add_frags(napi, page_info, len);
-			if (!skb) {
-				u64_stats_update_begin(&rx->statss);
-				rx->rx_skb_alloc_fail++;
-				u64_stats_update_end(&rx->statss);
-				return false;
-			}
-			/* Make sure the kernel stack can't release the page */
-			get_page(page_info->page);
-			/* "flip" to other packet buffer on this page */
-			gve_rx_flip_buff(page_info, &rx->data.data_ring[idx]);
-		} else if (pagecount >= 2) {
-			/* We have previously passed the other half of this
-			 * page up the stack, but it has not yet been freed.
-			 */
-			skb = gve_rx_copy(rx, dev, napi, page_info, len);
+						    data_slot,
+						    can_flip && recycle);
 		} else {
-			WARN(pagecount < 1, "Pagecount should never be < 1");
-			return false;
+			skb = gve_rx_qpl(&priv->pdev->dev, dev, rx,
+					 page_info, len, napi, data_slot,
+					 can_flip && recycle);
 		}
-	} else {
-		if (rx->data.raw_addressing)
-			skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
-						    page_info, len, napi,
-						    data_slot);
-		else
-			skb = gve_rx_copy(rx, dev, napi, page_info, len);
 	}
 
-have_skb:
-	/* We didn't manage to allocate an skb but we haven't had any
-	 * reset worthy failures.
-	 */
 	if (!skb) {
 		u64_stats_update_begin(&rx->statss);
 		rx->rx_skb_alloc_fail++;
@@ -494,16 +529,45 @@ static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
 
 	while (empty || ((fill_cnt & rx->mask) != (rx->cnt & rx->mask))) {
 		struct gve_rx_slot_page_info *page_info;
-		struct device *dev = &priv->pdev->dev;
-		struct gve_rx_data_slot *data_slot;
 		u32 idx = fill_cnt & rx->mask;
 
 		page_info = &rx->data.page_info[idx];
-		data_slot = &rx->data.data_ring[idx];
-		gve_rx_free_buffer(dev, page_info, data_slot);
-		page_info->page = NULL;
-		if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
-			break;
+		if (page_info->can_flip) {
+			/* The other half of the page is free because it was
+			 * free when we processed the descriptor. Flip to it.
+			 */
+			struct gve_rx_data_slot *data_slot =
+						&rx->data.data_ring[idx];
+
+			gve_rx_flip_buff(page_info, data_slot);
+			page_info->can_flip = false;
+		} else {
+			/* It is possible that the networking stack has already
+			 * finished processing all outstanding packets in the buffer
+			 * and it can be reused.
+			 * Flipping is unnecessary here - if the networking stack still
+			 * owns half the page it is impossible to tell which half. Either
+			 * the whole page is free or it needs to be replaced.
+			 */
+			int recycle = gve_rx_can_recycle_buffer(page_info->page);
+
+			if (recycle < 0) {
+				if (!rx->data.raw_addressing)
+					gve_schedule_reset(priv);
+				return false;
+			}
+			if (!recycle) {
+				/* We can't reuse the buffer - alloc a new one*/
+				struct gve_rx_data_slot *data_slot =
+						&rx->data.data_ring[idx];
+				struct device *dev = &priv->pdev->dev;
+
+				gve_rx_free_buffer(dev, page_info, data_slot);
+				page_info->page = NULL;
+				if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
+					break;
+			}
+		}
 		empty = false;
 		fill_cnt++;
 	}
-- 
2.29.2.222.g5d2a92d10f8-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path
  2020-11-09 23:36 [PATCH net-next v6 0/4] GVE Raw Addressing David Awogbemila
                   ` (2 preceding siblings ...)
  2020-11-09 23:36 ` [PATCH net-next v6 3/4] gve: Rx Buffer Recycling David Awogbemila
@ 2020-11-09 23:36 ` David Awogbemila
  2020-11-11 17:20   ` Alexander Duyck
  3 siblings, 1 reply; 19+ messages in thread
From: David Awogbemila @ 2020-11-09 23:36 UTC (permalink / raw)
  To: netdev; +Cc: Catherine Sullivan, Yangchun Fu, David Awogbemila

From: Catherine Sullivan <csully@google.com>

During TX, skbs' data addresses are dma_map'ed and passed to the NIC.
This means that the device can perform DMA directly from these addresses
and the driver does not have to copy the buffer content into
pre-allocated buffers/qpls (as in qpl mode).

Reviewed-by: Yangchun Fu <yangchun@google.com>
Signed-off-by: Catherine Sullivan <csully@google.com>
Signed-off-by: David Awogbemila <awogbemila@google.com>
---
 drivers/net/ethernet/google/gve/gve.h        |  15 +-
 drivers/net/ethernet/google/gve/gve_adminq.c |   4 +-
 drivers/net/ethernet/google/gve/gve_desc.h   |   8 +-
 drivers/net/ethernet/google/gve/gve_tx.c     | 207 +++++++++++++++----
 4 files changed, 192 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
index 9dcf9fd8d128..a7c77950bb84 100644
--- a/drivers/net/ethernet/google/gve/gve.h
+++ b/drivers/net/ethernet/google/gve/gve.h
@@ -110,12 +110,20 @@ struct gve_tx_iovec {
 	u32 iov_padding; /* padding associated with this segment */
 };
 
+struct gve_tx_dma_buf {
+	DEFINE_DMA_UNMAP_ADDR(dma);
+	DEFINE_DMA_UNMAP_LEN(len);
+};
+
 /* Tracks the memory in the fifo occupied by the skb. Mapped 1:1 to a desc
  * ring entry but only used for a pkt_desc not a seg_desc
  */
 struct gve_tx_buffer_state {
 	struct sk_buff *skb; /* skb for this pkt */
-	struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
+	union {
+		struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
+		struct gve_tx_dma_buf buf;
+	};
 };
 
 /* A TX buffer - each queue has one */
@@ -138,13 +146,16 @@ struct gve_tx_ring {
 	__be32 last_nic_done ____cacheline_aligned; /* NIC tail pointer */
 	u64 pkt_done; /* free-running - total packets completed */
 	u64 bytes_done; /* free-running - total bytes completed */
+	u32 dropped_pkt; /* free-running - total packets dropped */
 
 	/* Cacheline 2 -- Read-mostly fields */
 	union gve_tx_desc *desc ____cacheline_aligned;
 	struct gve_tx_buffer_state *info; /* Maps 1:1 to a desc */
 	struct netdev_queue *netdev_txq;
 	struct gve_queue_resources *q_resources; /* head and tail pointer idx */
+	struct device *dev;
 	u32 mask; /* masks req and done down to queue size */
+	bool raw_addressing; /* use raw_addressing? */
 
 	/* Slow-path fields */
 	u32 q_num ____cacheline_aligned; /* queue idx */
@@ -440,7 +451,7 @@ static inline u32 gve_rx_idx_to_ntfy(struct gve_priv *priv, u32 queue_idx)
  */
 static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
 {
-	return priv->tx_cfg.num_queues;
+	return priv->raw_addressing ? 0 : priv->tx_cfg.num_queues;
 }
 
 /* Returns the number of rx queue page lists
diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
index 711d135c6b1a..63358020658e 100644
--- a/drivers/net/ethernet/google/gve/gve_adminq.c
+++ b/drivers/net/ethernet/google/gve/gve_adminq.c
@@ -330,8 +330,10 @@ static int gve_adminq_create_tx_queue(struct gve_priv *priv, u32 queue_index)
 {
 	struct gve_tx_ring *tx = &priv->tx[queue_index];
 	union gve_adminq_command cmd;
+	u32 qpl_id;
 	int err;
 
+	qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID : tx->tx_fifo.qpl->id;
 	memset(&cmd, 0, sizeof(cmd));
 	cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_TX_QUEUE);
 	cmd.create_tx_queue = (struct gve_adminq_create_tx_queue) {
@@ -340,7 +342,7 @@ static int gve_adminq_create_tx_queue(struct gve_priv *priv, u32 queue_index)
 		.queue_resources_addr =
 			cpu_to_be64(tx->q_resources_bus),
 		.tx_ring_addr = cpu_to_be64(tx->bus),
-		.queue_page_list_id = cpu_to_be32(tx->tx_fifo.qpl->id),
+		.queue_page_list_id = cpu_to_be32(qpl_id),
 		.ntfy_id = cpu_to_be32(tx->ntfy_id),
 	};
 
diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
index 0aad314aefaf..a7da364e81c8 100644
--- a/drivers/net/ethernet/google/gve/gve_desc.h
+++ b/drivers/net/ethernet/google/gve/gve_desc.h
@@ -16,9 +16,11 @@
  * Base addresses encoded in seg_addr are not assumed to be physical
  * addresses. The ring format assumes these come from some linear address
  * space. This could be physical memory, kernel virtual memory, user virtual
- * memory. gVNIC uses lists of registered pages. Each queue is assumed
- * to be associated with a single such linear address space to ensure a
- * consistent meaning for seg_addrs posted to its rings.
+ * memory.
+ * If raw dma addressing is not supported then gVNIC uses lists of registered
+ * pages. Each queue is assumed to be associated with a single such linear
+ * address space to ensure a consistent meaning for seg_addrs posted to its
+ * rings.
  */
 
 struct gve_tx_pkt_desc {
diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
index d0244feb0301..26ff1d4e4f25 100644
--- a/drivers/net/ethernet/google/gve/gve_tx.c
+++ b/drivers/net/ethernet/google/gve/gve_tx.c
@@ -158,9 +158,11 @@ static void gve_tx_free_ring(struct gve_priv *priv, int idx)
 			  tx->q_resources, tx->q_resources_bus);
 	tx->q_resources = NULL;
 
-	gve_tx_fifo_release(priv, &tx->tx_fifo);
-	gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
-	tx->tx_fifo.qpl = NULL;
+	if (!tx->raw_addressing) {
+		gve_tx_fifo_release(priv, &tx->tx_fifo);
+		gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
+		tx->tx_fifo.qpl = NULL;
+	}
 
 	bytes = sizeof(*tx->desc) * slots;
 	dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
@@ -206,11 +208,15 @@ static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
 	if (!tx->desc)
 		goto abort_with_info;
 
-	tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
+	tx->raw_addressing = priv->raw_addressing;
+	tx->dev = &priv->pdev->dev;
+	if (!tx->raw_addressing) {
+		tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
 
-	/* map Tx FIFO */
-	if (gve_tx_fifo_init(priv, &tx->tx_fifo))
-		goto abort_with_desc;
+		/* map Tx FIFO */
+		if (gve_tx_fifo_init(priv, &tx->tx_fifo))
+			goto abort_with_desc;
+	}
 
 	tx->q_resources =
 		dma_alloc_coherent(hdev,
@@ -228,7 +234,8 @@ static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
 	return 0;
 
 abort_with_fifo:
-	gve_tx_fifo_release(priv, &tx->tx_fifo);
+	if (!tx->raw_addressing)
+		gve_tx_fifo_release(priv, &tx->tx_fifo);
 abort_with_desc:
 	dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
 	tx->desc = NULL;
@@ -301,27 +308,47 @@ static inline int gve_skb_fifo_bytes_required(struct gve_tx_ring *tx,
 	return bytes;
 }
 
-/* The most descriptors we could need are 3 - 1 for the headers, 1 for
- * the beginning of the payload at the end of the FIFO, and 1 if the
- * payload wraps to the beginning of the FIFO.
+/* The most descriptors we could need is MAX_SKB_FRAGS + 3 : 1 for each skb frag,
+ * +1 for the skb linear portion, +1 for when tcp hdr needs to be in separate descriptor,
+ * and +1 if the payload wraps to the beginning of the FIFO.
  */
-#define MAX_TX_DESC_NEEDED	3
+#define MAX_TX_DESC_NEEDED	(MAX_SKB_FRAGS + 3)
+static void gve_tx_unmap_buf(struct device *dev, struct gve_tx_buffer_state *info)
+{
+	if (info->skb) {
+		dma_unmap_single(dev, dma_unmap_addr(&info->buf, dma),
+				 dma_unmap_len(&info->buf, len),
+				 DMA_TO_DEVICE);
+		dma_unmap_len_set(&info->buf, len, 0);
+	} else {
+		dma_unmap_page(dev, dma_unmap_addr(&info->buf, dma),
+			       dma_unmap_len(&info->buf, len),
+			       DMA_TO_DEVICE);
+		dma_unmap_len_set(&info->buf, len, 0);
+	}
+}
 
 /* Check if sufficient resources (descriptor ring space, FIFO space) are
  * available to transmit the given number of bytes.
  */
 static inline bool gve_can_tx(struct gve_tx_ring *tx, int bytes_required)
 {
-	return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED &&
-		gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required));
+	bool can_alloc = true;
+
+	if (!tx->raw_addressing)
+		can_alloc = gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required);
+
+	return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED && can_alloc);
 }
 
 /* Stops the queue if the skb cannot be transmitted. */
 static int gve_maybe_stop_tx(struct gve_tx_ring *tx, struct sk_buff *skb)
 {
-	int bytes_required;
+	int bytes_required = 0;
+
+	if (!tx->raw_addressing)
+		bytes_required = gve_skb_fifo_bytes_required(tx, skb);
 
-	bytes_required = gve_skb_fifo_bytes_required(tx, skb);
 	if (likely(gve_can_tx(tx, bytes_required)))
 		return 0;
 
@@ -390,22 +417,22 @@ static void gve_tx_fill_seg_desc(union gve_tx_desc *seg_desc,
 	seg_desc->seg.seg_addr = cpu_to_be64(addr);
 }
 
-static void gve_dma_sync_for_device(struct device *dev, dma_addr_t *page_buses,
+static void gve_dma_sync_for_device(struct device *dev,
+				    dma_addr_t *page_buses,
 				    u64 iov_offset, u64 iov_len)
 {
 	u64 last_page = (iov_offset + iov_len - 1) / PAGE_SIZE;
 	u64 first_page = iov_offset / PAGE_SIZE;
-	dma_addr_t dma;
 	u64 page;
 
 	for (page = first_page; page <= last_page; page++) {
-		dma = page_buses[page];
+		dma_addr_t dma = page_buses[page];
+
 		dma_sync_single_for_device(dev, dma, PAGE_SIZE, DMA_TO_DEVICE);
 	}
 }
 
-static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
-			  struct device *dev)
+static int gve_tx_add_skb_copy(struct gve_priv *priv, struct gve_tx_ring *tx, struct sk_buff *skb)
 {
 	int pad_bytes, hlen, hdr_nfrags, payload_nfrags, l4_hdr_offset;
 	union gve_tx_desc *pkt_desc, *seg_desc;
@@ -429,7 +456,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
 	hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
 			skb_headlen(skb);
 
-	info->skb =  skb;
+	info->skb = skb;
 	/* We don't want to split the header, so if necessary, pad to the end
 	 * of the fifo and then put the header at the beginning of the fifo.
 	 */
@@ -447,7 +474,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
 	skb_copy_bits(skb, 0,
 		      tx->tx_fifo.base + info->iov[hdr_nfrags - 1].iov_offset,
 		      hlen);
-	gve_dma_sync_for_device(dev, tx->tx_fifo.qpl->page_buses,
+	gve_dma_sync_for_device(&priv->pdev->dev, tx->tx_fifo.qpl->page_buses,
 				info->iov[hdr_nfrags - 1].iov_offset,
 				info->iov[hdr_nfrags - 1].iov_len);
 	copy_offset = hlen;
@@ -463,7 +490,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
 		skb_copy_bits(skb, copy_offset,
 			      tx->tx_fifo.base + info->iov[i].iov_offset,
 			      info->iov[i].iov_len);
-		gve_dma_sync_for_device(dev, tx->tx_fifo.qpl->page_buses,
+		gve_dma_sync_for_device(&priv->pdev->dev, tx->tx_fifo.qpl->page_buses,
 					info->iov[i].iov_offset,
 					info->iov[i].iov_len);
 		copy_offset += info->iov[i].iov_len;
@@ -472,6 +499,100 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
 	return 1 + payload_nfrags;
 }
 
+static int gve_tx_add_skb_no_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
+				  struct sk_buff *skb)
+{
+	const struct skb_shared_info *shinfo = skb_shinfo(skb);
+	int hlen, payload_nfrags, l4_hdr_offset, seg_idx_bias;
+	union gve_tx_desc *pkt_desc, *seg_desc;
+	struct gve_tx_buffer_state *info;
+	bool is_gso = skb_is_gso(skb);
+	u32 idx = tx->req & tx->mask;
+	struct gve_tx_dma_buf *buf;
+	int last_mapped = 0;
+	u64 addr;
+	u32 len;
+	int i;
+
+	info = &tx->info[idx];
+	pkt_desc = &tx->desc[idx];
+
+	l4_hdr_offset = skb_checksum_start_offset(skb);
+	/* If the skb is gso, then we want only up to the tcp header in the first segment
+	 * to efficiently replicate on each segment otherwise we want the linear portion
+	 * of the skb (which will contain the checksum because skb->csum_start and
+	 * skb->csum_offset are given relative to skb->head) in the first segment.
+	 */
+	hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
+			skb_headlen(skb);
+	len = skb_headlen(skb);
+
+	info->skb =  skb;
+
+	addr = dma_map_single(tx->dev, skb->data, len, DMA_TO_DEVICE);
+	if (unlikely(dma_mapping_error(tx->dev, addr))) {
+		rtnl_lock();
+		priv->dma_mapping_error++;
+		rtnl_unlock();
+		goto drop;
+	}
+	buf = &info->buf;
+	dma_unmap_len_set(buf, len, len);
+	dma_unmap_addr_set(buf, dma, addr);
+
+	payload_nfrags = shinfo->nr_frags;
+	if (hlen < len) {
+		/* For gso the rest of the linear portion of the skb needs to
+		 * be in its own descriptor.
+		 */
+		payload_nfrags++;
+		gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+				     1 + payload_nfrags, hlen, addr);
+
+		len -= hlen;
+		addr += hlen;
+		seg_desc = &tx->desc[(tx->req + 1) & tx->mask];
+		seg_idx_bias = 2;
+		gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
+	} else {
+		seg_idx_bias = 1;
+		gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
+				     1 + payload_nfrags, hlen, addr);
+	}
+	idx = (tx->req + seg_idx_bias) & tx->mask;
+
+	for (i = 0; i < payload_nfrags - (seg_idx_bias - 1); i++) {
+		const skb_frag_t *frag = &shinfo->frags[i];
+
+		seg_desc = &tx->desc[idx];
+		len = skb_frag_size(frag);
+		addr = skb_frag_dma_map(tx->dev, frag, 0, len, DMA_TO_DEVICE);
+		if (unlikely(dma_mapping_error(tx->dev, addr))) {
+			priv->dma_mapping_error++;
+			goto unmap_drop;
+		}
+		buf = &tx->info[idx].buf;
+		tx->info[idx].skb = NULL;
+		dma_unmap_len_set(buf, len, len);
+		dma_unmap_addr_set(buf, dma, addr);
+
+		gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
+		idx = (idx + 1) & tx->mask;
+	}
+
+	return 1 + payload_nfrags;
+
+unmap_drop:
+	i--;
+	for (last_mapped = i + seg_idx_bias; last_mapped >= 0; last_mapped--) {
+		idx = (tx->req + last_mapped) & tx->mask;
+		gve_tx_unmap_buf(tx->dev, &tx->info[idx]);
+	}
+drop:
+	tx->dropped_pkt++;
+	return 0;
+}
+
 netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
 {
 	struct gve_priv *priv = netdev_priv(dev);
@@ -490,12 +611,20 @@ netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
 		gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
 		return NETDEV_TX_BUSY;
 	}
-	nsegs = gve_tx_add_skb(tx, skb, &priv->pdev->dev);
-
-	netdev_tx_sent_queue(tx->netdev_txq, skb->len);
-	skb_tx_timestamp(skb);
+	if (tx->raw_addressing)
+		nsegs = gve_tx_add_skb_no_copy(priv, tx, skb);
+	else
+		nsegs = gve_tx_add_skb_copy(priv, tx, skb);
+
+	/* If the packet is getting sent, we need to update the skb */
+	if (nsegs) {
+		netdev_tx_sent_queue(tx->netdev_txq, skb->len);
+		skb_tx_timestamp(skb);
+	}
 
-	/* give packets to NIC */
+	/* Give packets to NIC. Even if this packet failed to send the doorbell
+	 * might need to be rung because of xmit_more.
+	 */
 	tx->req += nsegs;
 
 	if (!netif_xmit_stopped(tx->netdev_txq) && netdev_xmit_more())
@@ -525,24 +654,30 @@ static int gve_clean_tx_done(struct gve_priv *priv, struct gve_tx_ring *tx,
 		info = &tx->info[idx];
 		skb = info->skb;
 
+		/* Unmap the buffer */
+		if (tx->raw_addressing)
+			gve_tx_unmap_buf(tx->dev, info);
 		/* Mark as free */
 		if (skb) {
 			info->skb = NULL;
 			bytes += skb->len;
 			pkts++;
 			dev_consume_skb_any(skb);
-			/* FIFO free */
-			for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
-				space_freed += info->iov[i].iov_len +
-					       info->iov[i].iov_padding;
-				info->iov[i].iov_len = 0;
-				info->iov[i].iov_padding = 0;
+			if (!tx->raw_addressing) {
+				/* FIFO free */
+				for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
+					space_freed += info->iov[i].iov_len +
+						       info->iov[i].iov_padding;
+					info->iov[i].iov_len = 0;
+					info->iov[i].iov_padding = 0;
+				}
 			}
 		}
 		tx->done++;
 	}
 
-	gve_tx_free_fifo(&tx->tx_fifo, space_freed);
+	if (!tx->raw_addressing)
+		gve_tx_free_fifo(&tx->tx_fifo, space_freed);
 	u64_stats_update_begin(&tx->statss);
 	tx->bytes_done += bytes;
 	tx->pkt_done += pkts;
-- 
2.29.2.222.g5d2a92d10f8-goog


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 1/4] gve: Add support for raw addressing device option
  2020-11-09 23:36 ` [PATCH net-next v6 1/4] gve: Add support for raw addressing device option David Awogbemila
@ 2020-11-11 17:20   ` Alexander Duyck
  2020-11-18 23:14     ` David Awogbemila
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2020-11-11 17:20 UTC (permalink / raw)
  To: David Awogbemila; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
>
> From: Catherine Sullivan <csully@google.com>
>
> Add support to describe device for parsing device options. As
> the first device option, add raw addressing.
>
> "Raw Addressing" mode (as opposed to the current "qpl" mode) is an
> operational mode which allows the driver avoid bounce buffer copies
> which it currently performs using pre-allocated qpls (queue_page_lists)
> when sending and receiving packets.
> For egress packets, the provided skb data addresses will be dma_map'ed and
> passed to the device, allowing the NIC can perform DMA directly - the
> driver will not have to copy the buffer content into pre-allocated
> buffers/qpls (as in qpl mode).
> For ingress packets, copies are also eliminated as buffers are handed to
> the networking stack and then recycled or re-allocated as
> necessary, avoiding the use of skb_copy_to_linear_data().
>
> This patch only introduces the option to the driver.
> Subsequent patches will add the ingress and egress functionality.
>
> Reviewed-by: Yangchun Fu <yangchun@google.com>
> Signed-off-by: Catherine Sullivan <csully@google.com>
> Signed-off-by: David Awogbemila <awogbemila@google.com>

A few minor nits called out below. Otherwise it looks good to me.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

> ---
>  drivers/net/ethernet/google/gve/gve.h        |  1 +
>  drivers/net/ethernet/google/gve/gve_adminq.c | 64 ++++++++++++++++++++
>  drivers/net/ethernet/google/gve/gve_adminq.h | 15 +++--
>  drivers/net/ethernet/google/gve/gve_main.c   |  9 +++
>  4 files changed, 85 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> index f5c80229ea96..80cdae06ee39 100644
> --- a/drivers/net/ethernet/google/gve/gve.h
> +++ b/drivers/net/ethernet/google/gve/gve.h
> @@ -199,6 +199,7 @@ struct gve_priv {
>         u64 num_registered_pages; /* num pages registered with NIC */
>         u32 rx_copybreak; /* copy packets smaller than this */
>         u16 default_num_queues; /* default num queues to set up */
> +       bool raw_addressing; /* true if this dev supports raw addressing */

The use of bool is generally frowned upon in structures if you care
about the cache alignment. You should probably just make this a char
or u8.

>
>         struct gve_queue_config tx_cfg;
>         struct gve_queue_config rx_cfg;
> diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
> index 24ae6a28a806..3e6de659b274 100644
> --- a/drivers/net/ethernet/google/gve/gve_adminq.c
> +++ b/drivers/net/ethernet/google/gve/gve_adminq.c
> @@ -14,6 +14,18 @@
>  #define GVE_ADMINQ_SLEEP_LEN           20
>  #define GVE_MAX_ADMINQ_EVENT_COUNTER_CHECK     100
>
> +static inline
> +struct gve_device_option *gve_get_next_option(struct gve_device_descriptor *descriptor,
> +                                             struct gve_device_option *option)
> +{
> +       void *option_end, *descriptor_end;
> +
> +       option_end = (void *)option + sizeof(*option) + be16_to_cpu(option->option_length);

If I am not mistaken you can make this statement more compact with the
following:
        option_end = (void *)(option + 1) + be16_to_cpu(option->option_length);

> +       descriptor_end = (void *)descriptor + be16_to_cpu(descriptor->total_length);
> +
> +       return option_end > descriptor_end ? NULL : (struct gve_device_option *)option_end;
> +}
> +
>  int gve_adminq_alloc(struct device *dev, struct gve_priv *priv)
>  {
>         priv->adminq = dma_alloc_coherent(dev, PAGE_SIZE,
> @@ -460,11 +472,14 @@ int gve_adminq_destroy_rx_queues(struct gve_priv *priv, u32 num_queues)
>  int gve_adminq_describe_device(struct gve_priv *priv)
>  {
>         struct gve_device_descriptor *descriptor;
> +       struct gve_device_option *dev_opt;
>         union gve_adminq_command cmd;
>         dma_addr_t descriptor_bus;
> +       u16 num_options;
>         int err = 0;
>         u8 *mac;
>         u16 mtu;
> +       int i;
>
>         memset(&cmd, 0, sizeof(cmd));
>         descriptor = dma_alloc_coherent(&priv->pdev->dev, PAGE_SIZE,
> @@ -518,6 +533,55 @@ int gve_adminq_describe_device(struct gve_priv *priv)
>                 priv->rx_desc_cnt = priv->rx_pages_per_qpl;
>         }
>         priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
> +       dev_opt = (void *)(descriptor + 1);
> +
> +       num_options = be16_to_cpu(descriptor->num_device_options);
> +       for (i = 0; i < num_options; i++) {
> +               u16 option_length = be16_to_cpu(dev_opt->option_length);
> +               u16 option_id = be16_to_cpu(dev_opt->option_id);
> +               struct gve_device_option *next_opt;
> +
> +               next_opt = gve_get_next_option(descriptor, dev_opt);
> +               if (!next_opt) {
> +                       dev_err(&priv->dev->dev,
> +                               "options exceed device_descriptor's total length.\n");
> +                       err = -EINVAL;
> +                       goto free_device_descriptor;
> +               }
> +
> +               switch (option_id) {
> +               case GVE_DEV_OPT_ID_RAW_ADDRESSING:
> +                       /* If the length or feature mask doesn't match,
> +                        * continue without enabling the feature.
> +                        */
> +                       if (option_length != GVE_DEV_OPT_LEN_RAW_ADDRESSING ||
> +                           dev_opt->feat_mask !=
> +                           cpu_to_be32(GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING)) {
> +                               dev_warn(&priv->pdev->dev,
> +                                        "Raw addressing option error:\n"
> +                                        "      Expected: length=%d, feature_mask=%x.\n"
> +                                        "      Actual: length=%d, feature_mask=%x.\n",
> +                                        GVE_DEV_OPT_LEN_RAW_ADDRESSING,
> +                                        cpu_to_be32(GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING),
> +                                        option_length, dev_opt->feat_mask);
> +                               priv->raw_addressing = false;
> +                       } else {
> +                               dev_info(&priv->pdev->dev,
> +                                        "Raw addressing device option enabled.\n");
> +                               priv->raw_addressing = true;
> +                       }
> +                       break;
> +               default:
> +                       /* If we don't recognize the option just continue
> +                        * without doing anything.
> +                        */
> +                       dev_dbg(&priv->pdev->dev,
> +                               "Unrecognized device option 0x%hx not enabled.\n",
> +                               option_id);
> +                       break;
> +               }
> +               dev_opt = next_opt;

Is there any reason for having this switch statement as a part of the
function instead of a function onto itself? Seems like you could take
all the code in the switch statement and move it into a seperate
function and only need to pass priv and dev_opt. That way you could
reduce the indentation a bit and help to make this a bit more readable
by possibly not having to fold lines like you did in the if statement
above.



> +       }
>
>  free_device_descriptor:
>         dma_free_coherent(&priv->pdev->dev, sizeof(*descriptor), descriptor,
> diff --git a/drivers/net/ethernet/google/gve/gve_adminq.h b/drivers/net/ethernet/google/gve/gve_adminq.h
> index 281de8326bc5..af5f586167bd 100644
> --- a/drivers/net/ethernet/google/gve/gve_adminq.h
> +++ b/drivers/net/ethernet/google/gve/gve_adminq.h
> @@ -79,12 +79,17 @@ struct gve_device_descriptor {
>
>  static_assert(sizeof(struct gve_device_descriptor) == 40);
>
> -struct device_option {
> -       __be32 option_id;
> -       __be32 option_length;
> +struct gve_device_option {
> +       __be16 option_id;
> +       __be16 option_length;
> +       __be32 feat_mask;
>  };
>
> -static_assert(sizeof(struct device_option) == 8);
> +static_assert(sizeof(struct gve_device_option) == 8);
> +
> +#define GVE_DEV_OPT_ID_RAW_ADDRESSING 0x1
> +#define GVE_DEV_OPT_LEN_RAW_ADDRESSING 0x0
> +#define GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING 0x0
>
>  struct gve_adminq_configure_device_resources {
>         __be64 counter_array;
> @@ -111,6 +116,8 @@ struct gve_adminq_unregister_page_list {
>
>  static_assert(sizeof(struct gve_adminq_unregister_page_list) == 4);
>
> +#define GVE_RAW_ADDRESSING_QPL_ID 0xFFFFFFFF
> +
>  struct gve_adminq_create_tx_queue {
>         __be32 queue_id;
>         __be32 reserved;
> diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> index 48a433154ce0..70685c10db0e 100644
> --- a/drivers/net/ethernet/google/gve/gve_main.c
> +++ b/drivers/net/ethernet/google/gve/gve_main.c
> @@ -678,6 +678,10 @@ static int gve_alloc_qpls(struct gve_priv *priv)
>         int i, j;
>         int err;
>
> +       /* Raw addressing means no QPLs */
> +       if (priv->raw_addressing)
> +               return 0;
> +
>         priv->qpls = kvzalloc(num_qpls * sizeof(*priv->qpls), GFP_KERNEL);
>         if (!priv->qpls)
>                 return -ENOMEM;
> @@ -718,6 +722,10 @@ static void gve_free_qpls(struct gve_priv *priv)
>         int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
>         int i;
>
> +       /* Raw addressing means no QPLs */
> +       if (priv->raw_addressing)
> +               return;
> +
>         kvfree(priv->qpl_cfg.qpl_id_map);
>
>         for (i = 0; i < num_qpls; i++)
> @@ -1078,6 +1086,7 @@ static int gve_init_priv(struct gve_priv *priv, bool skip_describe_device)
>         if (skip_describe_device)
>                 goto setup_device;
>
> +       priv->raw_addressing = false;
>         /* Get the initial information we need from the device */
>         err = gve_adminq_describe_device(priv);
>         if (err) {
> --
> 2.29.2.222.g5d2a92d10f8-goog
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path
  2020-11-09 23:36 ` [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path David Awogbemila
@ 2020-11-11 17:20   ` Alexander Duyck
  2020-11-18 23:15     ` David Awogbemila
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2020-11-11 17:20 UTC (permalink / raw)
  To: David Awogbemila; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
>
> From: Catherine Sullivan <csully@google.com>
>
> Add support to use raw dma addresses in the rx path. Due to this new
> support we can alloc a new buffer instead of making a copy.
>
> RX buffers are handed to the networking stack and are
> re-allocated as needed, avoiding the need to use
> skb_copy_to_linear_data() as in "qpl" mode.
>
> Reviewed-by: Yangchun Fu <yangchun@google.com>
> Signed-off-by: Catherine Sullivan <csully@google.com>
> Signed-off-by: David Awogbemila <awogbemila@google.com>
> ---
>  drivers/net/ethernet/google/gve/gve.h        |   6 +-
>  drivers/net/ethernet/google/gve/gve_adminq.c |  14 +-
>  drivers/net/ethernet/google/gve/gve_desc.h   |  10 +-
>  drivers/net/ethernet/google/gve/gve_main.c   |   3 +-
>  drivers/net/ethernet/google/gve/gve_rx.c     | 224 +++++++++++++++----
>  5 files changed, 200 insertions(+), 57 deletions(-)
>
> diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> index 80cdae06ee39..a8c589dd14e4 100644
> --- a/drivers/net/ethernet/google/gve/gve.h
> +++ b/drivers/net/ethernet/google/gve/gve.h
> @@ -68,6 +68,7 @@ struct gve_rx_data_queue {
>         dma_addr_t data_bus; /* dma mapping of the slots */
>         struct gve_rx_slot_page_info *page_info; /* page info of the buffers */
>         struct gve_queue_page_list *qpl; /* qpl assigned to this queue */
> +       bool raw_addressing; /* use raw_addressing? */
>  };

Again, if you care about alignment at all in this structure you should
probably use u8 instead of bool.

>
>  struct gve_priv;
> @@ -82,6 +83,7 @@ struct gve_rx_ring {
>         u32 cnt; /* free-running total number of completed packets */
>         u32 fill_cnt; /* free-running total number of descs and buffs posted */
>         u32 mask; /* masks the cnt and fill_cnt to the size of the ring */
> +       u32 db_threshold; /* threshold for posting new buffs and descs */
>         u64 rx_copybreak_pkt; /* free-running count of copybreak packets */
>         u64 rx_copied_pkt; /* free-running total number of copied packets */
>         u64 rx_skb_alloc_fail; /* free-running count of skb alloc fails */
> @@ -194,7 +196,7 @@ struct gve_priv {
>         u16 tx_desc_cnt; /* num desc per ring */
>         u16 rx_desc_cnt; /* num desc per ring */
>         u16 tx_pages_per_qpl; /* tx buffer length */
> -       u16 rx_pages_per_qpl; /* rx buffer length */
> +       u16 rx_data_slot_cnt; /* rx buffer length */
>         u64 max_registered_pages;
>         u64 num_registered_pages; /* num pages registered with NIC */
>         u32 rx_copybreak; /* copy packets smaller than this */
> @@ -444,7 +446,7 @@ static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
>   */
>  static inline u32 gve_num_rx_qpls(struct gve_priv *priv)
>  {
> -       return priv->rx_cfg.num_queues;
> +       return priv->raw_addressing ? 0 : priv->rx_cfg.num_queues;
>  }
>
>  /* Returns a pointer to the next available tx qpl in the list of qpls
> diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
> index 3e6de659b274..711d135c6b1a 100644
> --- a/drivers/net/ethernet/google/gve/gve_adminq.c
> +++ b/drivers/net/ethernet/google/gve/gve_adminq.c
> @@ -369,8 +369,10 @@ static int gve_adminq_create_rx_queue(struct gve_priv *priv, u32 queue_index)
>  {
>         struct gve_rx_ring *rx = &priv->rx[queue_index];
>         union gve_adminq_command cmd;
> +       u32 qpl_id;
>         int err;
>
> +       qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID : rx->data.qpl->id;
>         memset(&cmd, 0, sizeof(cmd));
>         cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_RX_QUEUE);
>         cmd.create_rx_queue = (struct gve_adminq_create_rx_queue) {
> @@ -381,7 +383,7 @@ static int gve_adminq_create_rx_queue(struct gve_priv *priv, u32 queue_index)
>                 .queue_resources_addr = cpu_to_be64(rx->q_resources_bus),
>                 .rx_desc_ring_addr = cpu_to_be64(rx->desc.bus),
>                 .rx_data_ring_addr = cpu_to_be64(rx->data.data_bus),
> -               .queue_page_list_id = cpu_to_be32(rx->data.qpl->id),
> +               .queue_page_list_id = cpu_to_be32(qpl_id),
>         };
>
>         err = gve_adminq_issue_cmd(priv, &cmd);
> @@ -526,11 +528,11 @@ int gve_adminq_describe_device(struct gve_priv *priv)
>         mac = descriptor->mac;
>         dev_info(&priv->pdev->dev, "MAC addr: %pM\n", mac);
>         priv->tx_pages_per_qpl = be16_to_cpu(descriptor->tx_pages_per_qpl);
> -       priv->rx_pages_per_qpl = be16_to_cpu(descriptor->rx_pages_per_qpl);
> -       if (priv->rx_pages_per_qpl < priv->rx_desc_cnt) {
> -               dev_err(&priv->pdev->dev, "rx_pages_per_qpl cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
> -                       priv->rx_pages_per_qpl);
> -               priv->rx_desc_cnt = priv->rx_pages_per_qpl;
> +       priv->rx_data_slot_cnt = be16_to_cpu(descriptor->rx_pages_per_qpl);
> +       if (priv->rx_data_slot_cnt < priv->rx_desc_cnt) {
> +               dev_err(&priv->pdev->dev, "rx_data_slot_cnt cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
> +                       priv->rx_data_slot_cnt);
> +               priv->rx_desc_cnt = priv->rx_data_slot_cnt;
>         }
>         priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
>         dev_opt = (void *)(descriptor + 1);
> diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
> index 54779871d52e..0aad314aefaf 100644
> --- a/drivers/net/ethernet/google/gve/gve_desc.h
> +++ b/drivers/net/ethernet/google/gve/gve_desc.h
> @@ -72,12 +72,14 @@ struct gve_rx_desc {
>  } __packed;
>  static_assert(sizeof(struct gve_rx_desc) == 64);
>
> -/* As with the Tx ring format, the qpl_offset entries below are offsets into an
> - * ordered list of registered pages.
> +/* If the device supports raw dma addressing then the addr in data slot is
> + * the dma address of the buffer.
> + * If the device only supports registered segments than the addr is a byte

s/than/then/

> + * offset into the registered segment (an ordered list of pages) where the
> + * buffer is.
>   */
>  struct gve_rx_data_slot {
> -       /* byte offset into the rx registered segment of this slot */
> -       __be64 qpl_offset;
> +       __be64 addr;
>  };

So this is either the qpl_offset or an address correct? In such a case
shouldn't this be a union? I'm just wanting to make sure that this
isn't something where we are just calling it all addr now even though
there are still cases where it is the qpl_offset.

>  /* GVE Recive Packet Descriptor Seq No */
> diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> index 70685c10db0e..225e17dd1ae5 100644
> --- a/drivers/net/ethernet/google/gve/gve_main.c
> +++ b/drivers/net/ethernet/google/gve/gve_main.c
> @@ -596,6 +596,7 @@ int gve_alloc_page(struct gve_priv *priv, struct device *dev,
>         if (dma_mapping_error(dev, *dma)) {
>                 priv->dma_mapping_error++;
>                 put_page(*page);
> +               *page = NULL;
>                 return -ENOMEM;
>         }
>         return 0;

This piece seems like a bug fix for the exception handling path.
Should it be pulled out into a separate patch so that it could be
backported if needed.

> @@ -694,7 +695,7 @@ static int gve_alloc_qpls(struct gve_priv *priv)
>         }
>         for (; i < num_qpls; i++) {
>                 err = gve_alloc_queue_page_list(priv, i,
> -                                               priv->rx_pages_per_qpl);
> +                                               priv->rx_data_slot_cnt);
>                 if (err)
>                         goto free_qpls;
>         }
> diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
> index 008fa897a3e6..49646caf930c 100644
> --- a/drivers/net/ethernet/google/gve/gve_rx.c
> +++ b/drivers/net/ethernet/google/gve/gve_rx.c
> @@ -16,12 +16,39 @@ static void gve_rx_remove_from_block(struct gve_priv *priv, int queue_idx)
>         block->rx = NULL;
>  }
>
> +static void gve_rx_free_buffer(struct device *dev,
> +                              struct gve_rx_slot_page_info *page_info,
> +                              struct gve_rx_data_slot *data_slot)
> +{
> +       dma_addr_t dma = (dma_addr_t)(be64_to_cpu(data_slot->addr) -
> +                                     page_info->page_offset);
> +

Why bother with subtraction when you are adding the page offset via an
xor? It seems like you should be able to just mask off the page offset
and not need to bother to store it anywhere. Either the DMA address is
page aligned in which case you can xor in and out the value and just
use an &= operator to strip it, or it isn't in which case you should
be using addition and subtraction everywhere and only bitflip page
offset as a single bit somewhere.

> +       gve_free_page(dev, page_info->page, dma, DMA_FROM_DEVICE);
> +}
> +
> +static void gve_rx_unfill_pages(struct gve_priv *priv, struct gve_rx_ring *rx)
> +{
> +       u32 slots = rx->mask + 1;
> +       int i;
> +
> +       if (rx->data.raw_addressing) {

The declaration of slots and I could be moved here since they aren't
used anywhere else in this function.

> +               for (i = 0; i < slots; i++)
> +                       gve_rx_free_buffer(&priv->pdev->dev, &rx->data.page_info[i],
> +                                          &rx->data.data_ring[i]);
> +       } else {
> +               gve_unassign_qpl(priv, rx->data.qpl->id);
> +               rx->data.qpl = NULL;
> +       }
> +       kvfree(rx->data.page_info);
> +       rx->data.page_info = NULL;
> +}
> +
>  static void gve_rx_free_ring(struct gve_priv *priv, int idx)
>  {
>         struct gve_rx_ring *rx = &priv->rx[idx];
>         struct device *dev = &priv->pdev->dev;
> +       u32 slots = rx->mask + 1;
>         size_t bytes;
> -       u32 slots;
>
>         gve_rx_remove_from_block(priv, idx);
>
> @@ -33,11 +60,8 @@ static void gve_rx_free_ring(struct gve_priv *priv, int idx)
>                           rx->q_resources, rx->q_resources_bus);
>         rx->q_resources = NULL;
>
> -       gve_unassign_qpl(priv, rx->data.qpl->id);
> -       rx->data.qpl = NULL;
> -       kvfree(rx->data.page_info);
> +       gve_rx_unfill_pages(priv, rx);
>
> -       slots = rx->mask + 1;
>         bytes = sizeof(*rx->data.data_ring) * slots;
>         dma_free_coherent(dev, bytes, rx->data.data_ring,
>                           rx->data.data_bus);
> @@ -52,14 +76,36 @@ static void gve_setup_rx_buffer(struct gve_rx_slot_page_info *page_info,
>         page_info->page = page;
>         page_info->page_offset = 0;
>         page_info->page_address = page_address(page);
> -       slot->qpl_offset = cpu_to_be64(addr);
> +       slot->addr = cpu_to_be64(addr);
> +}
> +
> +static int gve_rx_alloc_buffer(struct gve_priv *priv, struct device *dev,
> +                              struct gve_rx_slot_page_info *page_info,
> +                              struct gve_rx_data_slot *data_slot,
> +                              struct gve_rx_ring *rx)
> +{
> +       struct page *page;
> +       dma_addr_t dma;
> +       int err;
> +
> +       err = gve_alloc_page(priv, dev, &page, &dma, DMA_FROM_DEVICE);
> +       if (err) {
> +               u64_stats_update_begin(&rx->statss);
> +               rx->rx_buf_alloc_fail++;
> +               u64_stats_update_end(&rx->statss);
> +               return err;
> +       }

This just feels really clumsy to me. Do the stats really need to be
invoked in all cases? It seems like you could pull this code out and
put it in an exception path somewhere rather than in the middle of the
function. That way you don't need to carry around the rx ring which
isn't used for anything else. So for example when you are prefilling
the buffers it looks like you are already returning an error to the
user. In such a case the stats update might be redundant as the
interface wasn't allowed to come up in the first place.

> +
> +       gve_setup_rx_buffer(page_info, data_slot, dma, page);
> +       return 0;
>  }
>
>  static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
>  {
>         struct gve_priv *priv = rx->gve;
>         u32 slots;
> -       int i;
> +       int err;
> +       int i, j;
>
>         /* Allocate one page per Rx queue slot. Each page is split into two
>          * packet buffers, when possible we "page flip" between the two.
> @@ -71,17 +117,32 @@ static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
>         if (!rx->data.page_info)
>                 return -ENOMEM;
>
> -       rx->data.qpl = gve_assign_rx_qpl(priv);
> -
> +       if (!rx->data.raw_addressing)
> +               rx->data.qpl = gve_assign_rx_qpl(priv);
>         for (i = 0; i < slots; i++) {
> -               struct page *page = rx->data.qpl->pages[i];
> -               dma_addr_t addr = i * PAGE_SIZE;
> -
> -               gve_setup_rx_buffer(&rx->data.page_info[i],
> -                                   &rx->data.data_ring[i], addr, page);
> +               struct page *page;
> +               dma_addr_t addr;
> +
> +               if (!rx->data.raw_addressing) {
> +                       page = rx->data.qpl->pages[i];
> +                       addr = i * PAGE_SIZE;
> +                       gve_setup_rx_buffer(&rx->data.page_info[i],
> +                                           &rx->data.data_ring[i], addr, page);
> +                       continue;
> +               }
> +               err = gve_rx_alloc_buffer(priv, &priv->pdev->dev, &rx->data.page_info[i],
> +                                         &rx->data.data_ring[i], rx);
> +               if (err)
> +                       goto alloc_err;
>         }
>
>         return slots;
> +alloc_err:
> +       for (j = 0; j < i; j++)
> +               gve_rx_free_buffer(&priv->pdev->dev,
> +                                  &rx->data.page_info[j],
> +                                  &rx->data.data_ring[j]);

Instead of adding another variable you could just change this loop to
be based on "while(i--)" and then use i as you work your way backwards
through buffers you previously allocated.

> +       return err;
>  }
>
>  static void gve_rx_add_to_block(struct gve_priv *priv, int queue_idx)
> @@ -110,8 +171,9 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
>         rx->gve = priv;
>         rx->q_num = idx;
>
> -       slots = priv->rx_pages_per_qpl;
> +       slots = priv->rx_data_slot_cnt;
>         rx->mask = slots - 1;
> +       rx->data.raw_addressing = priv->raw_addressing;
>
>         /* alloc rx data ring */
>         bytes = sizeof(*rx->data.data_ring) * slots;
> @@ -156,8 +218,8 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
>                 err = -ENOMEM;
>                 goto abort_with_q_resources;
>         }
> -       rx->mask = slots - 1;
>         rx->cnt = 0;
> +       rx->db_threshold = priv->rx_desc_cnt / 2;
>         rx->desc.seqno = 1;
>         gve_rx_add_to_block(priv, idx);
>
> @@ -168,7 +230,7 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
>                           rx->q_resources, rx->q_resources_bus);
>         rx->q_resources = NULL;
>  abort_filled:
> -       kvfree(rx->data.page_info);
> +       gve_rx_unfill_pages(priv, rx);
>  abort_with_slots:
>         bytes = sizeof(*rx->data.data_ring) * slots;
>         dma_free_coherent(hdev, bytes, rx->data.data_ring, rx->data.data_bus);
> @@ -251,8 +313,7 @@ static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
>         return skb;
>  }
>
> -static struct sk_buff *gve_rx_add_frags(struct net_device *dev,
> -                                       struct napi_struct *napi,
> +static struct sk_buff *gve_rx_add_frags(struct napi_struct *napi,
>                                         struct gve_rx_slot_page_info *page_info,
>                                         u16 len)
>  {
> @@ -271,11 +332,25 @@ static struct sk_buff *gve_rx_add_frags(struct net_device *dev,
>  static void gve_rx_flip_buff(struct gve_rx_slot_page_info *page_info,
>                              struct gve_rx_data_slot *data_ring)
>  {
> -       u64 addr = be64_to_cpu(data_ring->qpl_offset);
> +       u64 addr = be64_to_cpu(data_ring->addr);
>
>         page_info->page_offset ^= PAGE_SIZE / 2;
>         addr ^= PAGE_SIZE / 2;
> -       data_ring->qpl_offset = cpu_to_be64(addr);
> +       data_ring->addr = cpu_to_be64(addr);

This code is far more inefficient than it needs to be. Instead of byte
swapping addr it should be byte swapping PAGE_SIZE / 2 since it is a
constant. A bitwise operation should work fine as long as you are only
performing it on two be64 values.

Also as I mentioned before if you are just using the xor on the
address directly then you don't need the page offset as you can use a
mask to pull it out of the addr.

> +}
> +
> +static struct sk_buff *
> +gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
> +                     struct gve_rx_slot_page_info *page_info, u16 len,
> +                     struct napi_struct *napi,
> +                     struct gve_rx_data_slot *data_slot)
> +{
> +       struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
> +
> +       if (!skb)
> +               return NULL;
> +
> +       return skb;
>  }
>

I'm not sure I see the point of this function. It seems like you
should just replace all references to it with gve_rx_add_frags until
you actually have something else going on here. I am assuming this is
addressed in a later patch.


>  static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> @@ -285,7 +360,9 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
>         struct gve_priv *priv = rx->gve;
>         struct napi_struct *napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
>         struct net_device *dev = priv->dev;
> -       struct sk_buff *skb;
> +       struct gve_rx_data_slot *data_slot;
> +       struct sk_buff *skb = NULL;
> +       dma_addr_t page_bus;
>         int pagecount;
>         u16 len;
>
> @@ -294,18 +371,18 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
>                 u64_stats_update_begin(&rx->statss);
>                 rx->rx_desc_err_dropped_pkt++;
>                 u64_stats_update_end(&rx->statss);
> -               return true;
> +               return false;
>         }
>
>         len = be16_to_cpu(rx_desc->len) - GVE_RX_PAD;
>         page_info = &rx->data.page_info[idx];
> -       dma_sync_single_for_cpu(&priv->pdev->dev, rx->data.qpl->page_buses[idx],
> -                               PAGE_SIZE, DMA_FROM_DEVICE);
>
> -       /* gvnic can only receive into registered segments. If the buffer
> -        * can't be recycled, our only choice is to copy the data out of
> -        * it so that we can return it to the device.
> -        */
> +       data_slot = &rx->data.data_ring[idx];
> +       page_bus = (rx->data.raw_addressing) ?
> +                                       be64_to_cpu(data_slot->addr) - page_info->page_offset :
> +                                       rx->data.qpl->page_buses[idx];
> +       dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
> +                               PAGE_SIZE, DMA_FROM_DEVICE);
>
>         if (PAGE_SIZE == 4096) {
>                 if (len <= priv->rx_copybreak) {
> @@ -316,6 +393,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
>                         u64_stats_update_end(&rx->statss);
>                         goto have_skb;
>                 }
> +               if (rx->data.raw_addressing) {
> +                       skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
> +                                                   page_info, len, napi,
> +                                                    data_slot);
> +                       goto have_skb;
> +               }
>                 if (unlikely(!gve_can_recycle_pages(dev))) {
>                         skb = gve_rx_copy(rx, dev, napi, page_info, len);
>                         goto have_skb;
> @@ -326,12 +409,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
>                          * the page fragment to a new SKB and pass it up the
>                          * stack.
>                          */
> -                       skb = gve_rx_add_frags(dev, napi, page_info, len);
> +                       skb = gve_rx_add_frags(napi, page_info, len);
>                         if (!skb) {
>                                 u64_stats_update_begin(&rx->statss);
>                                 rx->rx_skb_alloc_fail++;
>                                 u64_stats_update_end(&rx->statss);
> -                               return true;
> +                               return false;
>                         }
>                         /* Make sure the kernel stack can't release the page */
>                         get_page(page_info->page);
> @@ -347,7 +430,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
>                         return false;
>                 }
>         } else {
> -               skb = gve_rx_copy(rx, dev, napi, page_info, len);
> +               if (rx->data.raw_addressing)
> +                       skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
> +                                                   page_info, len, napi,
> +                                                   data_slot);
> +               else
> +                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
>         }
>
>  have_skb:
> @@ -358,7 +446,7 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
>                 u64_stats_update_begin(&rx->statss);
>                 rx->rx_skb_alloc_fail++;
>                 u64_stats_update_end(&rx->statss);
> -               return true;
> +               return false;
>         }
>
>         if (likely(feat & NETIF_F_RXCSUM)) {
> @@ -399,19 +487,45 @@ static bool gve_rx_work_pending(struct gve_rx_ring *rx)
>         return (GVE_SEQNO(flags_seq) == rx->desc.seqno);
>  }
>
> +static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
> +{
> +       bool empty = rx->fill_cnt == rx->cnt;
> +       u32 fill_cnt = rx->fill_cnt;
> +
> +       while (empty || ((fill_cnt & rx->mask) != (rx->cnt & rx->mask))) {

So one question I would have is why do you need to mask fill_cnt and
cnt here, but not above? Something doesn't match up.

> +               struct gve_rx_slot_page_info *page_info;
> +               struct device *dev = &priv->pdev->dev;
> +               struct gve_rx_data_slot *data_slot;
> +               u32 idx = fill_cnt & rx->mask;
> +
> +               page_info = &rx->data.page_info[idx];
> +               data_slot = &rx->data.data_ring[idx];
> +               gve_rx_free_buffer(dev, page_info, data_slot);
> +               page_info->page = NULL;
> +               if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
> +                       break;
> +               empty = false;
> +               fill_cnt++;
> +       }
> +       rx->fill_cnt = fill_cnt;
> +       return true;
> +}
> +
>  bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
>                        netdev_features_t feat)
>  {
>         struct gve_priv *priv = rx->gve;
> +       u32 work_done = 0, packets = 0;
>         struct gve_rx_desc *desc;
>         u32 cnt = rx->cnt;
>         u32 idx = cnt & rx->mask;
> -       u32 work_done = 0;
>         u64 bytes = 0;
>
>         desc = rx->desc.desc_ring + idx;
>         while ((GVE_SEQNO(desc->flags_seq) == rx->desc.seqno) &&
>                work_done < budget) {
> +               bool dropped;
> +
>                 netif_info(priv, rx_status, priv->dev,
>                            "[%d] idx=%d desc=%p desc->flags_seq=0x%x\n",
>                            rx->q_num, idx, desc, desc->flags_seq);
> @@ -419,9 +533,11 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
>                            "[%d] seqno=%d rx->desc.seqno=%d\n",
>                            rx->q_num, GVE_SEQNO(desc->flags_seq),
>                            rx->desc.seqno);
> -               bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
> -               if (!gve_rx(rx, desc, feat, idx))
> -                       gve_schedule_reset(priv);
> +               dropped = !gve_rx(rx, desc, feat, idx);
> +               if (!dropped) {
> +                       bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
> +                       packets++;
> +               }
>                 cnt++;
>                 idx = cnt & rx->mask;
>                 desc = rx->desc.desc_ring + idx;
> @@ -429,15 +545,35 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
>                 work_done++;
>         }
>
> -       if (!work_done)
> +       if (!work_done && rx->fill_cnt - cnt > rx->db_threshold) {
>                 return false;

Since you are returning here you don't really need the else below. You
can just drop the braces and use an if instead. However since
everything you are doing here can be done even if work_done is 0 I
would say to just drop the elseif entirely and instead leave the code
as it was.

> +       } else if (work_done) {
> +               u64_stats_update_begin(&rx->statss);
> +               rx->rpackets += packets;
> +               rx->rbytes += bytes;
> +               u64_stats_update_end(&rx->statss);
> +               rx->cnt = cnt;
> +       }

The block below seems much better than optimizing for an unlikely case
where there is no work done.


> -       u64_stats_update_begin(&rx->statss);
> -       rx->rpackets += work_done;
> -       rx->rbytes += bytes;
> -       u64_stats_update_end(&rx->statss);
> -       rx->cnt = cnt;
> -       rx->fill_cnt += work_done;
> +       /* restock ring slots */
> +       if (!rx->data.raw_addressing) {
> +               /* In QPL mode buffs are refilled as the desc are processed */
> +               rx->fill_cnt += work_done;
> +       } else if (rx->fill_cnt - cnt <= rx->db_threshold) {
> +               /* In raw addressing mode buffs are only refilled if the avail
> +                * falls below a threshold.
> +                */
> +               if (!gve_rx_refill_buffers(priv, rx))
> +                       return false;
> +
> +               /* If we were not able to completely refill buffers, we'll want
> +                * to schedule this queue for work again to refill buffers.
> +                */
> +               if (rx->fill_cnt - cnt <= rx->db_threshold) {
> +                       gve_rx_write_doorbell(priv, rx);
> +                       return true;
> +               }
> +       }
>
>         gve_rx_write_doorbell(priv, rx);
>         return gve_rx_work_pending(rx);
> --
> 2.29.2.222.g5d2a92d10f8-goog
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 3/4] gve: Rx Buffer Recycling
  2020-11-09 23:36 ` [PATCH net-next v6 3/4] gve: Rx Buffer Recycling David Awogbemila
@ 2020-11-11 17:20   ` Alexander Duyck
  2020-11-18 22:50     ` David Awogbemila
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2020-11-11 17:20 UTC (permalink / raw)
  To: David Awogbemila; +Cc: Netdev

On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
>
> This patch lets the driver reuse buffers that have been freed by the
> networking stack.
>
> In the raw addressing case, this allows the driver avoid allocating new
> buffers.
> In the qpl case, the driver can avoid copies.
>
> Signed-off-by: David Awogbemila <awogbemila@google.com>
> ---
>  drivers/net/ethernet/google/gve/gve.h    |  10 +-
>  drivers/net/ethernet/google/gve/gve_rx.c | 196 +++++++++++++++--------
>  2 files changed, 131 insertions(+), 75 deletions(-)
>
> diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> index a8c589dd14e4..9dcf9fd8d128 100644
> --- a/drivers/net/ethernet/google/gve/gve.h
> +++ b/drivers/net/ethernet/google/gve/gve.h
> @@ -50,6 +50,7 @@ struct gve_rx_slot_page_info {
>         struct page *page;
>         void *page_address;
>         u32 page_offset; /* offset to write to in page */
> +       bool can_flip;

Again avoid putting bool into structures. Preferred approach is to use a u8.

>  };
>
>  /* A list of pages registered with the device during setup and used by a queue
> @@ -500,15 +501,6 @@ static inline enum dma_data_direction gve_qpl_dma_dir(struct gve_priv *priv,
>                 return DMA_FROM_DEVICE;
>  }
>
> -/* Returns true if the max mtu allows page recycling */
> -static inline bool gve_can_recycle_pages(struct net_device *dev)
> -{
> -       /* We can't recycle the pages if we can't fit a packet into half a
> -        * page.
> -        */
> -       return dev->max_mtu <= PAGE_SIZE / 2;
> -}
> -
>  /* buffers */
>  int gve_alloc_page(struct gve_priv *priv, struct device *dev,
>                    struct page **page, dma_addr_t *dma,
> diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
> index 49646caf930c..ff28581f4427 100644
> --- a/drivers/net/ethernet/google/gve/gve_rx.c
> +++ b/drivers/net/ethernet/google/gve/gve_rx.c
> @@ -287,8 +287,7 @@ static enum pkt_hash_types gve_rss_type(__be16 pkt_flags)
>         return PKT_HASH_TYPE_L2;
>  }
>
> -static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
> -                                  struct net_device *dev,
> +static struct sk_buff *gve_rx_copy(struct net_device *dev,
>                                    struct napi_struct *napi,
>                                    struct gve_rx_slot_page_info *page_info,
>                                    u16 len)
> @@ -306,10 +305,6 @@ static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
>
>         skb->protocol = eth_type_trans(skb, dev);
>
> -       u64_stats_update_begin(&rx->statss);
> -       rx->rx_copied_pkt++;
> -       u64_stats_update_end(&rx->statss);
> -
>         return skb;
>  }
>
> @@ -334,22 +329,90 @@ static void gve_rx_flip_buff(struct gve_rx_slot_page_info *page_info,
>  {
>         u64 addr = be64_to_cpu(data_ring->addr);
>
> +       /* "flip" to other packet buffer on this page */
>         page_info->page_offset ^= PAGE_SIZE / 2;
>         addr ^= PAGE_SIZE / 2;
>         data_ring->addr = cpu_to_be64(addr);
>  }

So this is just adding a comment to existing code that was added in
patch 2. Perhaps it should have been added in that patch.

> +static bool gve_rx_can_flip_buffers(struct net_device *netdev)
> +{
> +#if PAGE_SIZE == 4096
> +       /* We can't flip a buffer if we can't fit a packet
> +        * into half a page.
> +        */

Seems like this comment is unnecessarily wrapped.

> +       return netdev->mtu + GVE_RX_PAD + ETH_HLEN <= PAGE_SIZE / 2;
> +#else
> +       /* PAGE_SIZE != 4096 - don't try to reuse */
> +       return false;
> +#endif
> +}
> +

You may look at converting this over to a ternary statement just to save space.

> +static int gve_rx_can_recycle_buffer(struct page *page)
> +{
> +       int pagecount = page_count(page);
> +
> +       /* This page is not being used by any SKBs - reuse */
> +       if (pagecount == 1)
> +               return 1;
> +       /* This page is still being used by an SKB - we can't reuse */
> +       else if (pagecount >= 2)
> +               return 0;
> +       WARN(pagecount < 1, "Pagecount should never be < 1");
> +       return -1;
> +}
> +

So using a page count of 1 is expensive. Really if you are going to do
this you should probably look at how we do it currently in ixgbe.
Basically you want to batch the count updates to avoid expensive
atomic operations per skb.

>  static struct sk_buff *
>  gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
>                       struct gve_rx_slot_page_info *page_info, u16 len,
>                       struct napi_struct *napi,
> -                     struct gve_rx_data_slot *data_slot)
> +                     struct gve_rx_data_slot *data_slot, bool can_flip)
>  {
> -       struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
> +       struct sk_buff *skb;
>
> +       skb = gve_rx_add_frags(napi, page_info, len);

Why split this up?It seemed fine as it was.

>         if (!skb)
>                 return NULL;
>
> +       /* Optimistically stop the kernel from freeing the page by increasing
> +        * the page bias. We will check the refcount in refill to determine if
> +        * we need to alloc a new page.
> +        */
> +       get_page(page_info->page);
> +       page_info->can_flip = can_flip;
> +

Why pass can_flip and page_info only to set it here? Also I don't get
why you are taking an extra reference on the page without checking the
can_flip variable. It seems like this should be set in the page_info
before you call this function and then you call get_page if
page_info->can_flip is true.

> +       return skb;
> +}
> +
> +static struct sk_buff *
> +gve_rx_qpl(struct device *dev, struct net_device *netdev,
> +          struct gve_rx_ring *rx, struct gve_rx_slot_page_info *page_info,
> +          u16 len, struct napi_struct *napi,
> +          struct gve_rx_data_slot *data_slot, bool recycle)
> +{
> +       struct sk_buff *skb;
> +
> +       /* if raw_addressing mode is not enabled gvnic can only receive into
> +        * registered segments. If the buffer can't be recycled, our only
> +        * choice is to copy the data out of it so that we can return it to the
> +        * device.
> +        */
> +       if (recycle) {
> +               skb = gve_rx_add_frags(napi, page_info, len);
> +               /* No point in recycling if we didn't get the skb */
> +               if (skb) {
> +                       /* Make sure the networking stack can't free the page */
> +                       get_page(page_info->page);
> +                       gve_rx_flip_buff(page_info, data_slot);

It isn't about the stack freeing the page, it is about letting the
buddy allocator know that when the skb frees the page that we are
still holding a reference to it so it should not free the memory. The
get_page is about what we are doing, not what the stack is doing.

> +               }
> +       } else {
> +               skb = gve_rx_copy(netdev, napi, page_info, len);
> +               if (skb) {
> +                       u64_stats_update_begin(&rx->statss);
> +                       rx->rx_copied_pkt++;
> +                       u64_stats_update_end(&rx->statss);
> +               }
> +       }
>         return skb;
>  }
>
> @@ -363,7 +426,6 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
>         struct gve_rx_data_slot *data_slot;
>         struct sk_buff *skb = NULL;
>         dma_addr_t page_bus;
> -       int pagecount;
>         u16 len;
>
>         /* drop this packet */
> @@ -384,64 +446,37 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
>         dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
>                                 PAGE_SIZE, DMA_FROM_DEVICE);
>
> -       if (PAGE_SIZE == 4096) {
> -               if (len <= priv->rx_copybreak) {
> -                       /* Just copy small packets */
> -                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
> -                       u64_stats_update_begin(&rx->statss);
> -                       rx->rx_copybreak_pkt++;
> -                       u64_stats_update_end(&rx->statss);
> -                       goto have_skb;
> +       if (len <= priv->rx_copybreak) {
> +               /* Just copy small packets */
> +               skb = gve_rx_copy(dev, napi, page_info, len);
> +               u64_stats_update_begin(&rx->statss);
> +               rx->rx_copied_pkt++;
> +               rx->rx_copybreak_pkt++;
> +               u64_stats_update_end(&rx->statss);

I would recommend just storing some local variables in the function
for tracking copied and copybreak packets if possible and only
updating them at the end of the function. Otherwise you are going to
pay a heavy cost on non-64b systems. Also how many threads will be
accessing these stats? If you have mutliple threads all updating the
copied_pkt for small packets it can lead to a heavy amount of cache
thrash.


> +       } else {
> +               bool can_flip = gve_rx_can_flip_buffers(dev);
> +               int recycle = 0;
> +
> +               if (can_flip) {
> +                       recycle = gve_rx_can_recycle_buffer(page_info->page);
> +                       if (recycle < 0) {
> +                               if (!rx->data.raw_addressing)
> +                                       gve_schedule_reset(priv);
> +                               return false;
> +                       }
>                 }
>                 if (rx->data.raw_addressing) {
>                         skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
>                                                     page_info, len, napi,
> -                                                    data_slot);
> -                       goto have_skb;
> -               }
> -               if (unlikely(!gve_can_recycle_pages(dev))) {
> -                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
> -                       goto have_skb;
> -               }
> -               pagecount = page_count(page_info->page);
> -               if (pagecount == 1) {
> -                       /* No part of this page is used by any SKBs; we attach
> -                        * the page fragment to a new SKB and pass it up the
> -                        * stack.
> -                        */
> -                       skb = gve_rx_add_frags(napi, page_info, len);
> -                       if (!skb) {
> -                               u64_stats_update_begin(&rx->statss);
> -                               rx->rx_skb_alloc_fail++;
> -                               u64_stats_update_end(&rx->statss);
> -                               return false;
> -                       }
> -                       /* Make sure the kernel stack can't release the page */
> -                       get_page(page_info->page);
> -                       /* "flip" to other packet buffer on this page */
> -                       gve_rx_flip_buff(page_info, &rx->data.data_ring[idx]);
> -               } else if (pagecount >= 2) {
> -                       /* We have previously passed the other half of this
> -                        * page up the stack, but it has not yet been freed.
> -                        */
> -                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
> +                                                   data_slot,
> +                                                   can_flip && recycle);
>                 } else {
> -                       WARN(pagecount < 1, "Pagecount should never be < 1");
> -                       return false;
> +                       skb = gve_rx_qpl(&priv->pdev->dev, dev, rx,
> +                                        page_info, len, napi, data_slot,
> +                                        can_flip && recycle);
>                 }
> -       } else {
> -               if (rx->data.raw_addressing)
> -                       skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
> -                                                   page_info, len, napi,
> -                                                   data_slot);
> -               else
> -                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
>         }
>
> -have_skb:
> -       /* We didn't manage to allocate an skb but we haven't had any
> -        * reset worthy failures.
> -        */
>         if (!skb) {
>                 u64_stats_update_begin(&rx->statss);
>                 rx->rx_skb_alloc_fail++;
> @@ -494,16 +529,45 @@ static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
>
>         while (empty || ((fill_cnt & rx->mask) != (rx->cnt & rx->mask))) {
>                 struct gve_rx_slot_page_info *page_info;
> -               struct device *dev = &priv->pdev->dev;
> -               struct gve_rx_data_slot *data_slot;
>                 u32 idx = fill_cnt & rx->mask;
>
>                 page_info = &rx->data.page_info[idx];
> -               data_slot = &rx->data.data_ring[idx];
> -               gve_rx_free_buffer(dev, page_info, data_slot);
> -               page_info->page = NULL;
> -               if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
> -                       break;
> +               if (page_info->can_flip) {
> +                       /* The other half of the page is free because it was
> +                        * free when we processed the descriptor. Flip to it.
> +                        */
> +                       struct gve_rx_data_slot *data_slot =
> +                                               &rx->data.data_ring[idx];
> +
> +                       gve_rx_flip_buff(page_info, data_slot);
> +                       page_info->can_flip = false;
> +               } else {
> +                       /* It is possible that the networking stack has already
> +                        * finished processing all outstanding packets in the buffer
> +                        * and it can be reused.
> +                        * Flipping is unnecessary here - if the networking stack still
> +                        * owns half the page it is impossible to tell which half. Either
> +                        * the whole page is free or it needs to be replaced.
> +                        */
> +                       int recycle = gve_rx_can_recycle_buffer(page_info->page);
> +
> +                       if (recycle < 0) {
> +                               if (!rx->data.raw_addressing)
> +                                       gve_schedule_reset(priv);
> +                               return false;
> +                       }
> +                       if (!recycle) {
> +                               /* We can't reuse the buffer - alloc a new one*/
> +                               struct gve_rx_data_slot *data_slot =
> +                                               &rx->data.data_ring[idx];
> +                               struct device *dev = &priv->pdev->dev;
> +
> +                               gve_rx_free_buffer(dev, page_info, data_slot);
> +                               page_info->page = NULL;
> +                               if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
> +                                       break;
> +                       }
> +               }
>                 empty = false;
>                 fill_cnt++;
>         }
> --
> 2.29.2.222.g5d2a92d10f8-goog
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path
  2020-11-09 23:36 ` [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path David Awogbemila
@ 2020-11-11 17:20   ` Alexander Duyck
  2020-11-18 23:16     ` David Awogbemila
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2020-11-11 17:20 UTC (permalink / raw)
  To: David Awogbemila; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
>
> From: Catherine Sullivan <csully@google.com>
>
> During TX, skbs' data addresses are dma_map'ed and passed to the NIC.
> This means that the device can perform DMA directly from these addresses
> and the driver does not have to copy the buffer content into
> pre-allocated buffers/qpls (as in qpl mode).
>
> Reviewed-by: Yangchun Fu <yangchun@google.com>
> Signed-off-by: Catherine Sullivan <csully@google.com>
> Signed-off-by: David Awogbemila <awogbemila@google.com>
> ---
>  drivers/net/ethernet/google/gve/gve.h        |  15 +-
>  drivers/net/ethernet/google/gve/gve_adminq.c |   4 +-
>  drivers/net/ethernet/google/gve/gve_desc.h   |   8 +-
>  drivers/net/ethernet/google/gve/gve_tx.c     | 207 +++++++++++++++----
>  4 files changed, 192 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> index 9dcf9fd8d128..a7c77950bb84 100644
> --- a/drivers/net/ethernet/google/gve/gve.h
> +++ b/drivers/net/ethernet/google/gve/gve.h
> @@ -110,12 +110,20 @@ struct gve_tx_iovec {
>         u32 iov_padding; /* padding associated with this segment */
>  };
>
> +struct gve_tx_dma_buf {
> +       DEFINE_DMA_UNMAP_ADDR(dma);
> +       DEFINE_DMA_UNMAP_LEN(len);
> +};
> +
>  /* Tracks the memory in the fifo occupied by the skb. Mapped 1:1 to a desc
>   * ring entry but only used for a pkt_desc not a seg_desc
>   */
>  struct gve_tx_buffer_state {
>         struct sk_buff *skb; /* skb for this pkt */
> -       struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
> +       union {
> +               struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
> +               struct gve_tx_dma_buf buf;
> +       };
>  };
>
>  /* A TX buffer - each queue has one */
> @@ -138,13 +146,16 @@ struct gve_tx_ring {
>         __be32 last_nic_done ____cacheline_aligned; /* NIC tail pointer */
>         u64 pkt_done; /* free-running - total packets completed */
>         u64 bytes_done; /* free-running - total bytes completed */
> +       u32 dropped_pkt; /* free-running - total packets dropped */
>
>         /* Cacheline 2 -- Read-mostly fields */
>         union gve_tx_desc *desc ____cacheline_aligned;
>         struct gve_tx_buffer_state *info; /* Maps 1:1 to a desc */
>         struct netdev_queue *netdev_txq;
>         struct gve_queue_resources *q_resources; /* head and tail pointer idx */
> +       struct device *dev;
>         u32 mask; /* masks req and done down to queue size */
> +       bool raw_addressing; /* use raw_addressing? */

Again no bool in structures as the size is architecture dependent. Go
with a u8 instead.

>
>         /* Slow-path fields */
>         u32 q_num ____cacheline_aligned; /* queue idx */
> @@ -440,7 +451,7 @@ static inline u32 gve_rx_idx_to_ntfy(struct gve_priv *priv, u32 queue_idx)
>   */
>  static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
>  {
> -       return priv->tx_cfg.num_queues;
> +       return priv->raw_addressing ? 0 : priv->tx_cfg.num_queues;
>  }
>
>  /* Returns the number of rx queue page lists
> diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
> index 711d135c6b1a..63358020658e 100644
> --- a/drivers/net/ethernet/google/gve/gve_adminq.c
> +++ b/drivers/net/ethernet/google/gve/gve_adminq.c
> @@ -330,8 +330,10 @@ static int gve_adminq_create_tx_queue(struct gve_priv *priv, u32 queue_index)
>  {
>         struct gve_tx_ring *tx = &priv->tx[queue_index];
>         union gve_adminq_command cmd;
> +       u32 qpl_id;
>         int err;
>
> +       qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID : tx->tx_fifo.qpl->id;
>         memset(&cmd, 0, sizeof(cmd));
>         cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_TX_QUEUE);
>         cmd.create_tx_queue = (struct gve_adminq_create_tx_queue) {
> @@ -340,7 +342,7 @@ static int gve_adminq_create_tx_queue(struct gve_priv *priv, u32 queue_index)
>                 .queue_resources_addr =
>                         cpu_to_be64(tx->q_resources_bus),
>                 .tx_ring_addr = cpu_to_be64(tx->bus),
> -               .queue_page_list_id = cpu_to_be32(tx->tx_fifo.qpl->id),
> +               .queue_page_list_id = cpu_to_be32(qpl_id),
>                 .ntfy_id = cpu_to_be32(tx->ntfy_id),
>         };
>
> diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
> index 0aad314aefaf..a7da364e81c8 100644
> --- a/drivers/net/ethernet/google/gve/gve_desc.h
> +++ b/drivers/net/ethernet/google/gve/gve_desc.h
> @@ -16,9 +16,11 @@
>   * Base addresses encoded in seg_addr are not assumed to be physical
>   * addresses. The ring format assumes these come from some linear address
>   * space. This could be physical memory, kernel virtual memory, user virtual
> - * memory. gVNIC uses lists of registered pages. Each queue is assumed
> - * to be associated with a single such linear address space to ensure a
> - * consistent meaning for seg_addrs posted to its rings.
> + * memory.
> + * If raw dma addressing is not supported then gVNIC uses lists of registered
> + * pages. Each queue is assumed to be associated with a single such linear
> + * address space to ensure a consistent meaning for seg_addrs posted to its
> + * rings.
>   */
>
>  struct gve_tx_pkt_desc {
> diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
> index d0244feb0301..26ff1d4e4f25 100644
> --- a/drivers/net/ethernet/google/gve/gve_tx.c
> +++ b/drivers/net/ethernet/google/gve/gve_tx.c
> @@ -158,9 +158,11 @@ static void gve_tx_free_ring(struct gve_priv *priv, int idx)
>                           tx->q_resources, tx->q_resources_bus);
>         tx->q_resources = NULL;
>
> -       gve_tx_fifo_release(priv, &tx->tx_fifo);
> -       gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
> -       tx->tx_fifo.qpl = NULL;
> +       if (!tx->raw_addressing) {
> +               gve_tx_fifo_release(priv, &tx->tx_fifo);
> +               gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
> +               tx->tx_fifo.qpl = NULL;
> +       }
>
>         bytes = sizeof(*tx->desc) * slots;
>         dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
> @@ -206,11 +208,15 @@ static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
>         if (!tx->desc)
>                 goto abort_with_info;
>
> -       tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
> +       tx->raw_addressing = priv->raw_addressing;
> +       tx->dev = &priv->pdev->dev;
> +       if (!tx->raw_addressing) {
> +               tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
>
> -       /* map Tx FIFO */
> -       if (gve_tx_fifo_init(priv, &tx->tx_fifo))
> -               goto abort_with_desc;
> +               /* map Tx FIFO */
> +               if (gve_tx_fifo_init(priv, &tx->tx_fifo))
> +                       goto abort_with_desc;
> +       }
>
>         tx->q_resources =
>                 dma_alloc_coherent(hdev,
> @@ -228,7 +234,8 @@ static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
>         return 0;
>
>  abort_with_fifo:
> -       gve_tx_fifo_release(priv, &tx->tx_fifo);
> +       if (!tx->raw_addressing)
> +               gve_tx_fifo_release(priv, &tx->tx_fifo);
>  abort_with_desc:
>         dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
>         tx->desc = NULL;
> @@ -301,27 +308,47 @@ static inline int gve_skb_fifo_bytes_required(struct gve_tx_ring *tx,
>         return bytes;
>  }
>
> -/* The most descriptors we could need are 3 - 1 for the headers, 1 for
> - * the beginning of the payload at the end of the FIFO, and 1 if the
> - * payload wraps to the beginning of the FIFO.
> +/* The most descriptors we could need is MAX_SKB_FRAGS + 3 : 1 for each skb frag,
> + * +1 for the skb linear portion, +1 for when tcp hdr needs to be in separate descriptor,
> + * and +1 if the payload wraps to the beginning of the FIFO.
>   */
> -#define MAX_TX_DESC_NEEDED     3
> +#define MAX_TX_DESC_NEEDED     (MAX_SKB_FRAGS + 3)
> +static void gve_tx_unmap_buf(struct device *dev, struct gve_tx_buffer_state *info)
> +{
> +       if (info->skb) {
> +               dma_unmap_single(dev, dma_unmap_addr(&info->buf, dma),
> +                                dma_unmap_len(&info->buf, len),
> +                                DMA_TO_DEVICE);
> +               dma_unmap_len_set(&info->buf, len, 0);
> +       } else {
> +               dma_unmap_page(dev, dma_unmap_addr(&info->buf, dma),
> +                              dma_unmap_len(&info->buf, len),
> +                              DMA_TO_DEVICE);
> +               dma_unmap_len_set(&info->buf, len, 0);
> +       }
> +}
>
>  /* Check if sufficient resources (descriptor ring space, FIFO space) are
>   * available to transmit the given number of bytes.
>   */
>  static inline bool gve_can_tx(struct gve_tx_ring *tx, int bytes_required)
>  {
> -       return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED &&
> -               gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required));
> +       bool can_alloc = true;
> +
> +       if (!tx->raw_addressing)
> +               can_alloc = gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required);
> +
> +       return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED && can_alloc);
>  }
>
>  /* Stops the queue if the skb cannot be transmitted. */
>  static int gve_maybe_stop_tx(struct gve_tx_ring *tx, struct sk_buff *skb)
>  {
> -       int bytes_required;
> +       int bytes_required = 0;
> +
> +       if (!tx->raw_addressing)
> +               bytes_required = gve_skb_fifo_bytes_required(tx, skb);
>
> -       bytes_required = gve_skb_fifo_bytes_required(tx, skb);
>         if (likely(gve_can_tx(tx, bytes_required)))
>                 return 0;
>
> @@ -390,22 +417,22 @@ static void gve_tx_fill_seg_desc(union gve_tx_desc *seg_desc,
>         seg_desc->seg.seg_addr = cpu_to_be64(addr);
>  }
>
> -static void gve_dma_sync_for_device(struct device *dev, dma_addr_t *page_buses,
> +static void gve_dma_sync_for_device(struct device *dev,
> +                                   dma_addr_t *page_buses,
>                                     u64 iov_offset, u64 iov_len)
>  {
>         u64 last_page = (iov_offset + iov_len - 1) / PAGE_SIZE;
>         u64 first_page = iov_offset / PAGE_SIZE;
> -       dma_addr_t dma;
>         u64 page;
>
>         for (page = first_page; page <= last_page; page++) {
> -               dma = page_buses[page];
> +               dma_addr_t dma = page_buses[page];
> +
>                 dma_sync_single_for_device(dev, dma, PAGE_SIZE, DMA_TO_DEVICE);

Why bother with the "dma" variable at all. Why not just refer directly
to "page_busses[page]"?

>         }
>  }
>
> -static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
> -                         struct device *dev)
> +static int gve_tx_add_skb_copy(struct gve_priv *priv, struct gve_tx_ring *tx, struct sk_buff *skb)
>  {
>         int pad_bytes, hlen, hdr_nfrags, payload_nfrags, l4_hdr_offset;
>         union gve_tx_desc *pkt_desc, *seg_desc;
> @@ -429,7 +456,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
>         hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
>                         skb_headlen(skb);
>
> -       info->skb =  skb;
> +       info->skb = skb;
>         /* We don't want to split the header, so if necessary, pad to the end
>          * of the fifo and then put the header at the beginning of the fifo.
>          */
> @@ -447,7 +474,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
>         skb_copy_bits(skb, 0,
>                       tx->tx_fifo.base + info->iov[hdr_nfrags - 1].iov_offset,
>                       hlen);
> -       gve_dma_sync_for_device(dev, tx->tx_fifo.qpl->page_buses,
> +       gve_dma_sync_for_device(&priv->pdev->dev, tx->tx_fifo.qpl->page_buses,
>                                 info->iov[hdr_nfrags - 1].iov_offset,
>                                 info->iov[hdr_nfrags - 1].iov_len);
>         copy_offset = hlen;
> @@ -463,7 +490,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
>                 skb_copy_bits(skb, copy_offset,
>                               tx->tx_fifo.base + info->iov[i].iov_offset,
>                               info->iov[i].iov_len);
> -               gve_dma_sync_for_device(dev, tx->tx_fifo.qpl->page_buses,
> +               gve_dma_sync_for_device(&priv->pdev->dev, tx->tx_fifo.qpl->page_buses,
>                                         info->iov[i].iov_offset,
>                                         info->iov[i].iov_len);
>                 copy_offset += info->iov[i].iov_len;
> @@ -472,6 +499,100 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
>         return 1 + payload_nfrags;
>  }
>
> +static int gve_tx_add_skb_no_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
> +                                 struct sk_buff *skb)
> +{
> +       const struct skb_shared_info *shinfo = skb_shinfo(skb);
> +       int hlen, payload_nfrags, l4_hdr_offset, seg_idx_bias;
> +       union gve_tx_desc *pkt_desc, *seg_desc;
> +       struct gve_tx_buffer_state *info;
> +       bool is_gso = skb_is_gso(skb);
> +       u32 idx = tx->req & tx->mask;
> +       struct gve_tx_dma_buf *buf;
> +       int last_mapped = 0;
> +       u64 addr;
> +       u32 len;
> +       int i;
> +
> +       info = &tx->info[idx];
> +       pkt_desc = &tx->desc[idx];
> +
> +       l4_hdr_offset = skb_checksum_start_offset(skb);
> +       /* If the skb is gso, then we want only up to the tcp header in the first segment
> +        * to efficiently replicate on each segment otherwise we want the linear portion
> +        * of the skb (which will contain the checksum because skb->csum_start and
> +        * skb->csum_offset are given relative to skb->head) in the first segment.
> +        */
> +       hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
> +                       skb_headlen(skb);
> +       len = skb_headlen(skb);
> +
> +       info->skb =  skb;
> +
> +       addr = dma_map_single(tx->dev, skb->data, len, DMA_TO_DEVICE);
> +       if (unlikely(dma_mapping_error(tx->dev, addr))) {
> +               rtnl_lock();
> +               priv->dma_mapping_error++;
> +               rtnl_unlock();

Do you really need an rtnl_lock for updating this statistic? That
seems like a glaring issue to me.

> +               goto drop;
> +       }
> +       buf = &info->buf;
> +       dma_unmap_len_set(buf, len, len);
> +       dma_unmap_addr_set(buf, dma, addr);
> +
> +       payload_nfrags = shinfo->nr_frags;
> +       if (hlen < len) {
> +               /* For gso the rest of the linear portion of the skb needs to
> +                * be in its own descriptor.
> +                */
> +               payload_nfrags++;
> +               gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
> +                                    1 + payload_nfrags, hlen, addr);
> +
> +               len -= hlen;
> +               addr += hlen;
> +               seg_desc = &tx->desc[(tx->req + 1) & tx->mask];
> +               seg_idx_bias = 2;
> +               gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
> +       } else {
> +               seg_idx_bias = 1;
> +               gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
> +                                    1 + payload_nfrags, hlen, addr);
> +       }
> +       idx = (tx->req + seg_idx_bias) & tx->mask;
> +

This math with payload_nfrags and seg_idx_bias seems way more
complicated than it needs to be. Instead of having to reverse engineer
shinfo->nr_frags why not just go back to using that?

> +       for (i = 0; i < payload_nfrags - (seg_idx_bias - 1); i++) {
> +               const skb_frag_t *frag = &shinfo->frags[i];
> +
> +               seg_desc = &tx->desc[idx];
> +               len = skb_frag_size(frag);
> +               addr = skb_frag_dma_map(tx->dev, frag, 0, len, DMA_TO_DEVICE);
> +               if (unlikely(dma_mapping_error(tx->dev, addr))) {
> +                       priv->dma_mapping_error++;

Same variable as above, but no lock. Definitely something not right
about the rtnl_lock used above.


> +                       goto unmap_drop;
> +               }
> +               buf = &tx->info[idx].buf;
> +               tx->info[idx].skb = NULL;
> +               dma_unmap_len_set(buf, len, len);
> +               dma_unmap_addr_set(buf, dma, addr);
> +
> +               gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
> +               idx = (idx + 1) & tx->mask;

You could probably look simplifying this by simply updating index at
the start of the loop instead of the end. Since you know at least the
first buffer is used for the header all follow buffers for the
fragments would start at idx + 1 which would reduce your seg_idx_bias
to 0 and could also be used to just get rid of that variable all
together.

> +       }
> +
> +       return 1 + payload_nfrags;
> +
> +unmap_drop:
> +       i--;
> +       for (last_mapped = i + seg_idx_bias; last_mapped >= 0; last_mapped--) {
> +               idx = (tx->req + last_mapped) & tx->mask;
> +               gve_tx_unmap_buf(tx->dev, &tx->info[idx]);
> +       }
> +drop:
> +       tx->dropped_pkt++;
> +       return 0;
> +}
> +
>  netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
>  {
>         struct gve_priv *priv = netdev_priv(dev);
> @@ -490,12 +611,20 @@ netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
>                 gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
>                 return NETDEV_TX_BUSY;
>         }
> -       nsegs = gve_tx_add_skb(tx, skb, &priv->pdev->dev);
> -
> -       netdev_tx_sent_queue(tx->netdev_txq, skb->len);
> -       skb_tx_timestamp(skb);
> +       if (tx->raw_addressing)
> +               nsegs = gve_tx_add_skb_no_copy(priv, tx, skb);
> +       else
> +               nsegs = gve_tx_add_skb_copy(priv, tx, skb);
> +
> +       /* If the packet is getting sent, we need to update the skb */
> +       if (nsegs) {

Why are you still continuing if you aren't going to transmit anything?

> +               netdev_tx_sent_queue(tx->netdev_txq, skb->len);
> +               skb_tx_timestamp(skb);
> +       }
>
> -       /* give packets to NIC */
> +       /* Give packets to NIC. Even if this packet failed to send the doorbell
> +        * might need to be rung because of xmit_more.
> +        */
>         tx->req += nsegs;
>
>         if (!netif_xmit_stopped(tx->netdev_txq) && netdev_xmit_more())
> @@ -525,24 +654,30 @@ static int gve_clean_tx_done(struct gve_priv *priv, struct gve_tx_ring *tx,
>                 info = &tx->info[idx];
>                 skb = info->skb;
>
> +               /* Unmap the buffer */
> +               if (tx->raw_addressing)
> +                       gve_tx_unmap_buf(tx->dev, info);
>                 /* Mark as free */
>                 if (skb) {
>                         info->skb = NULL;
>                         bytes += skb->len;
>                         pkts++;
>                         dev_consume_skb_any(skb);
> -                       /* FIFO free */
> -                       for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
> -                               space_freed += info->iov[i].iov_len +
> -                                              info->iov[i].iov_padding;
> -                               info->iov[i].iov_len = 0;
> -                               info->iov[i].iov_padding = 0;

Instead of increasing the indent you might look at moving the
tx->done++ up and then just doing a "continue" if you are using raw
addressing.


> +                       if (!tx->raw_addressing) {
> +                               /* FIFO free */
> +                               for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
> +                                       space_freed += info->iov[i].iov_len +
> +                                                      info->iov[i].iov_padding;
> +                                       info->iov[i].iov_len = 0;
> +                                       info->iov[i].iov_padding = 0;
> +                               }
>                         }
>                 }
>                 tx->done++;
>         }
>
> -       gve_tx_free_fifo(&tx->tx_fifo, space_freed);
> +       if (!tx->raw_addressing)
> +               gve_tx_free_fifo(&tx->tx_fifo, space_freed);
>         u64_stats_update_begin(&tx->statss);
>         tx->bytes_done += bytes;
>         tx->pkt_done += pkts;
> --
> 2.29.2.222.g5d2a92d10f8-goog
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 3/4] gve: Rx Buffer Recycling
  2020-11-11 17:20   ` Alexander Duyck
@ 2020-11-18 22:50     ` David Awogbemila
  2020-11-19 16:55       ` Alexander Duyck
  0 siblings, 1 reply; 19+ messages in thread
From: David Awogbemila @ 2020-11-18 22:50 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Netdev

On Wed, Nov 11, 2020 at 9:20 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> >
> > This patch lets the driver reuse buffers that have been freed by the
> > networking stack.
> >
> > In the raw addressing case, this allows the driver avoid allocating new
> > buffers.
> > In the qpl case, the driver can avoid copies.
> >
> > Signed-off-by: David Awogbemila <awogbemila@google.com>
> > ---
> >  drivers/net/ethernet/google/gve/gve.h    |  10 +-
> >  drivers/net/ethernet/google/gve/gve_rx.c | 196 +++++++++++++++--------
> >  2 files changed, 131 insertions(+), 75 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> > index a8c589dd14e4..9dcf9fd8d128 100644
> > --- a/drivers/net/ethernet/google/gve/gve.h
> > +++ b/drivers/net/ethernet/google/gve/gve.h
> > @@ -50,6 +50,7 @@ struct gve_rx_slot_page_info {
> >         struct page *page;
> >         void *page_address;
> >         u32 page_offset; /* offset to write to in page */
> > +       bool can_flip;
>
> Again avoid putting bool into structures. Preferred approach is to use a u8.

Ok.

>
> >  };
> >
> >  /* A list of pages registered with the device during setup and used by a queue
> > @@ -500,15 +501,6 @@ static inline enum dma_data_direction gve_qpl_dma_dir(struct gve_priv *priv,
> >                 return DMA_FROM_DEVICE;
> >  }
> >
> > -/* Returns true if the max mtu allows page recycling */
> > -static inline bool gve_can_recycle_pages(struct net_device *dev)
> > -{
> > -       /* We can't recycle the pages if we can't fit a packet into half a
> > -        * page.
> > -        */
> > -       return dev->max_mtu <= PAGE_SIZE / 2;
> > -}
> > -
> >  /* buffers */
> >  int gve_alloc_page(struct gve_priv *priv, struct device *dev,
> >                    struct page **page, dma_addr_t *dma,
> > diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
> > index 49646caf930c..ff28581f4427 100644
> > --- a/drivers/net/ethernet/google/gve/gve_rx.c
> > +++ b/drivers/net/ethernet/google/gve/gve_rx.c
> > @@ -287,8 +287,7 @@ static enum pkt_hash_types gve_rss_type(__be16 pkt_flags)
> >         return PKT_HASH_TYPE_L2;
> >  }
> >
> > -static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
> > -                                  struct net_device *dev,
> > +static struct sk_buff *gve_rx_copy(struct net_device *dev,
> >                                    struct napi_struct *napi,
> >                                    struct gve_rx_slot_page_info *page_info,
> >                                    u16 len)
> > @@ -306,10 +305,6 @@ static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
> >
> >         skb->protocol = eth_type_trans(skb, dev);
> >
> > -       u64_stats_update_begin(&rx->statss);
> > -       rx->rx_copied_pkt++;
> > -       u64_stats_update_end(&rx->statss);
> > -
> >         return skb;
> >  }
> >
> > @@ -334,22 +329,90 @@ static void gve_rx_flip_buff(struct gve_rx_slot_page_info *page_info,
> >  {
> >         u64 addr = be64_to_cpu(data_ring->addr);
> >
> > +       /* "flip" to other packet buffer on this page */
> >         page_info->page_offset ^= PAGE_SIZE / 2;
> >         addr ^= PAGE_SIZE / 2;
> >         data_ring->addr = cpu_to_be64(addr);
> >  }
>
> So this is just adding a comment to existing code that was added in
> patch 2. Perhaps it should have been added in that patch.

Ok, I'll put it in the earlier patch.

>
> > +static bool gve_rx_can_flip_buffers(struct net_device *netdev)
> > +{
> > +#if PAGE_SIZE == 4096
> > +       /* We can't flip a buffer if we can't fit a packet
> > +        * into half a page.
> > +        */
>
> Seems like this comment is unnecessarily wrapped.

Ok, I'll adjust this.

>
> > +       return netdev->mtu + GVE_RX_PAD + ETH_HLEN <= PAGE_SIZE / 2;
> > +#else
> > +       /* PAGE_SIZE != 4096 - don't try to reuse */
> > +       return false;
> > +#endif
> > +}
> > +
>
> You may look at converting this over to a ternary statement just to save space.

Ok, I'll do that, thanks.

>
> > +static int gve_rx_can_recycle_buffer(struct page *page)
> > +{
> > +       int pagecount = page_count(page);
> > +
> > +       /* This page is not being used by any SKBs - reuse */
> > +       if (pagecount == 1)
> > +               return 1;
> > +       /* This page is still being used by an SKB - we can't reuse */
> > +       else if (pagecount >= 2)
> > +               return 0;
> > +       WARN(pagecount < 1, "Pagecount should never be < 1");
> > +       return -1;
> > +}
> > +
>
> So using a page count of 1 is expensive. Really if you are going to do
> this you should probably look at how we do it currently in ixgbe.
> Basically you want to batch the count updates to avoid expensive
> atomic operations per skb.

A separate patch will be coming along to change the way the driver
tracks page count.
I thought it would be better to have that reviewed separately since
it's a different issue from what this patch addresses.

>
> >  static struct sk_buff *
> >  gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
> >                       struct gve_rx_slot_page_info *page_info, u16 len,
> >                       struct napi_struct *napi,
> > -                     struct gve_rx_data_slot *data_slot)
> > +                     struct gve_rx_data_slot *data_slot, bool can_flip)
> >  {
> > -       struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
> > +       struct sk_buff *skb;
> >
> > +       skb = gve_rx_add_frags(napi, page_info, len);
>
> Why split this up?It seemed fine as it was.

It was based on a comment from v3 of the patchset.

>
> >         if (!skb)
> >                 return NULL;
> >
> > +       /* Optimistically stop the kernel from freeing the page by increasing
> > +        * the page bias. We will check the refcount in refill to determine if
> > +        * we need to alloc a new page.
> > +        */
> > +       get_page(page_info->page);
> > +       page_info->can_flip = can_flip;
> > +
>
> Why pass can_flip and page_info only to set it here? Also I don't get
> why you are taking an extra reference on the page without checking the
> can_flip variable. It seems like this should be set in the page_info
> before you call this function and then you call get_page if
> page_info->can_flip is true.

I think it's important to call get_page here even for buffers we know
will not be flipped so that if the skb does a put_page twice we would
not run into the WARNing in gve_rx_can_recycle_buffer when trying to
refill buffers.
(Also please note that a future patch changes the way the driver uses
page refcounts)

>
> > +       return skb;
> > +}
> > +
> > +static struct sk_buff *
> > +gve_rx_qpl(struct device *dev, struct net_device *netdev,
> > +          struct gve_rx_ring *rx, struct gve_rx_slot_page_info *page_info,
> > +          u16 len, struct napi_struct *napi,
> > +          struct gve_rx_data_slot *data_slot, bool recycle)
> > +{
> > +       struct sk_buff *skb;
> > +
> > +       /* if raw_addressing mode is not enabled gvnic can only receive into
> > +        * registered segments. If the buffer can't be recycled, our only
> > +        * choice is to copy the data out of it so that we can return it to the
> > +        * device.
> > +        */
> > +       if (recycle) {
> > +               skb = gve_rx_add_frags(napi, page_info, len);
> > +               /* No point in recycling if we didn't get the skb */
> > +               if (skb) {
> > +                       /* Make sure the networking stack can't free the page */
> > +                       get_page(page_info->page);
> > +                       gve_rx_flip_buff(page_info, data_slot);
>
> It isn't about the stack freeing the page, it is about letting the
> buddy allocator know that when the skb frees the page that we are
> still holding a reference to it so it should not free the memory. The
> get_page is about what we are doing, not what the stack is doing.

Oh I see, thanks for the explanation. Perhaps a comment like:
/* Make sure that the page isn't freed. */
would be more correct.

>
> > +               }
> > +       } else {
> > +               skb = gve_rx_copy(netdev, napi, page_info, len);
> > +               if (skb) {
> > +                       u64_stats_update_begin(&rx->statss);
> > +                       rx->rx_copied_pkt++;
> > +                       u64_stats_update_end(&rx->statss);
> > +               }
> > +       }
> >         return skb;
> >  }
> >
> > @@ -363,7 +426,6 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> >         struct gve_rx_data_slot *data_slot;
> >         struct sk_buff *skb = NULL;
> >         dma_addr_t page_bus;
> > -       int pagecount;
> >         u16 len;
> >
> >         /* drop this packet */
> > @@ -384,64 +446,37 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> >         dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
> >                                 PAGE_SIZE, DMA_FROM_DEVICE);
> >
> > -       if (PAGE_SIZE == 4096) {
> > -               if (len <= priv->rx_copybreak) {
> > -                       /* Just copy small packets */
> > -                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
> > -                       u64_stats_update_begin(&rx->statss);
> > -                       rx->rx_copybreak_pkt++;
> > -                       u64_stats_update_end(&rx->statss);
> > -                       goto have_skb;
> > +       if (len <= priv->rx_copybreak) {
> > +               /* Just copy small packets */
> > +               skb = gve_rx_copy(dev, napi, page_info, len);
> > +               u64_stats_update_begin(&rx->statss);
> > +               rx->rx_copied_pkt++;
> > +               rx->rx_copybreak_pkt++;
> > +               u64_stats_update_end(&rx->statss);
>
> I would recommend just storing some local variables in the function
> for tracking copied and copybreak packets if possible and only
> updating them at the end of the function. Otherwise you are going to
> pay a heavy cost on non-64b systems. Also how many threads will be
> accessing these stats? If you have mutliple threads all updating the
> copied_pkt for small packets it can lead to a heavy amount of cache
> thrash.

Thanks for pointing these out. I think the driver will run mostly on
64-bit systems and rarely on 32-bit systems.
Also, the potential cache thrashing issue certainly seems like
something we'd want to look into but we're okay with the driver's
performance presently and this patch doesn't actually change this
aspect of the driver's behavior so perhaps it's best addressed in
another patch. This particular function doesn't loop so we'd probably
need to do more than moving the stats update to the end, which should
probably be done in a different patchset.




>
>
> > +       } else {
> > +               bool can_flip = gve_rx_can_flip_buffers(dev);
> > +               int recycle = 0;
> > +
> > +               if (can_flip) {
> > +                       recycle = gve_rx_can_recycle_buffer(page_info->page);
> > +                       if (recycle < 0) {
> > +                               if (!rx->data.raw_addressing)
> > +                                       gve_schedule_reset(priv);
> > +                               return false;
> > +                       }
> >                 }
> >                 if (rx->data.raw_addressing) {
> >                         skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
> >                                                     page_info, len, napi,
> > -                                                    data_slot);
> > -                       goto have_skb;
> > -               }
> > -               if (unlikely(!gve_can_recycle_pages(dev))) {
> > -                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
> > -                       goto have_skb;
> > -               }
> > -               pagecount = page_count(page_info->page);
> > -               if (pagecount == 1) {
> > -                       /* No part of this page is used by any SKBs; we attach
> > -                        * the page fragment to a new SKB and pass it up the
> > -                        * stack.
> > -                        */
> > -                       skb = gve_rx_add_frags(napi, page_info, len);
> > -                       if (!skb) {
> > -                               u64_stats_update_begin(&rx->statss);
> > -                               rx->rx_skb_alloc_fail++;
> > -                               u64_stats_update_end(&rx->statss);
> > -                               return false;
> > -                       }
> > -                       /* Make sure the kernel stack can't release the page */
> > -                       get_page(page_info->page);
> > -                       /* "flip" to other packet buffer on this page */
> > -                       gve_rx_flip_buff(page_info, &rx->data.data_ring[idx]);
> > -               } else if (pagecount >= 2) {
> > -                       /* We have previously passed the other half of this
> > -                        * page up the stack, but it has not yet been freed.
> > -                        */
> > -                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
> > +                                                   data_slot,
> > +                                                   can_flip && recycle);
> >                 } else {
> > -                       WARN(pagecount < 1, "Pagecount should never be < 1");
> > -                       return false;
> > +                       skb = gve_rx_qpl(&priv->pdev->dev, dev, rx,
> > +                                        page_info, len, napi, data_slot,
> > +                                        can_flip && recycle);
> >                 }
> > -       } else {
> > -               if (rx->data.raw_addressing)
> > -                       skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
> > -                                                   page_info, len, napi,
> > -                                                   data_slot);
> > -               else
> > -                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
> >         }
> >
> > -have_skb:
> > -       /* We didn't manage to allocate an skb but we haven't had any
> > -        * reset worthy failures.
> > -        */
> >         if (!skb) {
> >                 u64_stats_update_begin(&rx->statss);
> >                 rx->rx_skb_alloc_fail++;
> > @@ -494,16 +529,45 @@ static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
> >
> >         while (empty || ((fill_cnt & rx->mask) != (rx->cnt & rx->mask))) {
> >                 struct gve_rx_slot_page_info *page_info;
> > -               struct device *dev = &priv->pdev->dev;
> > -               struct gve_rx_data_slot *data_slot;
> >                 u32 idx = fill_cnt & rx->mask;
> >
> >                 page_info = &rx->data.page_info[idx];
> > -               data_slot = &rx->data.data_ring[idx];
> > -               gve_rx_free_buffer(dev, page_info, data_slot);
> > -               page_info->page = NULL;
> > -               if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
> > -                       break;
> > +               if (page_info->can_flip) {
> > +                       /* The other half of the page is free because it was
> > +                        * free when we processed the descriptor. Flip to it.
> > +                        */
> > +                       struct gve_rx_data_slot *data_slot =
> > +                                               &rx->data.data_ring[idx];
> > +
> > +                       gve_rx_flip_buff(page_info, data_slot);
> > +                       page_info->can_flip = false;
> > +               } else {
> > +                       /* It is possible that the networking stack has already
> > +                        * finished processing all outstanding packets in the buffer
> > +                        * and it can be reused.
> > +                        * Flipping is unnecessary here - if the networking stack still
> > +                        * owns half the page it is impossible to tell which half. Either
> > +                        * the whole page is free or it needs to be replaced.
> > +                        */
> > +                       int recycle = gve_rx_can_recycle_buffer(page_info->page);
> > +
> > +                       if (recycle < 0) {
> > +                               if (!rx->data.raw_addressing)
> > +                                       gve_schedule_reset(priv);
> > +                               return false;
> > +                       }
> > +                       if (!recycle) {
> > +                               /* We can't reuse the buffer - alloc a new one*/
> > +                               struct gve_rx_data_slot *data_slot =
> > +                                               &rx->data.data_ring[idx];
> > +                               struct device *dev = &priv->pdev->dev;
> > +
> > +                               gve_rx_free_buffer(dev, page_info, data_slot);
> > +                               page_info->page = NULL;
> > +                               if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
> > +                                       break;
> > +                       }
> > +               }
> >                 empty = false;
> >                 fill_cnt++;
> >         }
> > --
> > 2.29.2.222.g5d2a92d10f8-goog
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 1/4] gve: Add support for raw addressing device option
  2020-11-11 17:20   ` Alexander Duyck
@ 2020-11-18 23:14     ` David Awogbemila
  0 siblings, 0 replies; 19+ messages in thread
From: David Awogbemila @ 2020-11-18 23:14 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

> A few minor nits called out below. Otherwise it looks good to me.
>
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
>
> > ---
> >  drivers/net/ethernet/google/gve/gve.h        |  1 +
> >  drivers/net/ethernet/google/gve/gve_adminq.c | 64 ++++++++++++++++++++
> >  drivers/net/ethernet/google/gve/gve_adminq.h | 15 +++--
> >  drivers/net/ethernet/google/gve/gve_main.c   |  9 +++
> >  4 files changed, 85 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> > index f5c80229ea96..80cdae06ee39 100644
> > --- a/drivers/net/ethernet/google/gve/gve.h
> > +++ b/drivers/net/ethernet/google/gve/gve.h
> > @@ -199,6 +199,7 @@ struct gve_priv {
> >         u64 num_registered_pages; /* num pages registered with NIC */
> >         u32 rx_copybreak; /* copy packets smaller than this */
> >         u16 default_num_queues; /* default num queues to set up */
> > +       bool raw_addressing; /* true if this dev supports raw addressing */
>
> The use of bool is generally frowned upon in structures if you care
> about the cache alignment. You should probably just make this a char
> or u8.

Thanks for the suggestions, I'll use u8.
>
> >
> >         struct gve_queue_config tx_cfg;
> >         struct gve_queue_config rx_cfg;
> > diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
> > index 24ae6a28a806..3e6de659b274 100644
> > --- a/drivers/net/ethernet/google/gve/gve_adminq.c
> > +++ b/drivers/net/ethernet/google/gve/gve_adminq.c
> > @@ -14,6 +14,18 @@
> >  #define GVE_ADMINQ_SLEEP_LEN           20
> >  #define GVE_MAX_ADMINQ_EVENT_COUNTER_CHECK     100
> >
> > +static inline
> > +struct gve_device_option *gve_get_next_option(struct gve_device_descriptor *descriptor,
> > +                                             struct gve_device_option *option)
> > +{
> > +       void *option_end, *descriptor_end;
> > +
> > +       option_end = (void *)option + sizeof(*option) + be16_to_cpu(option->option_length);
>
> If I am not mistaken you can make this statement more compact with the
> following:
>         option_end = (void *)(option + 1) + be16_to_cpu(option->option_length);
>
Yes, I'll use (option + 1) instead.

> > +       descriptor_end = (void *)descriptor + be16_to_cpu(descriptor->total_length);
> > +
> > +       return option_end > descriptor_end ? NULL : (struct gve_device_option *)option_end;
> > +}
> > +
> >  int gve_adminq_alloc(struct device *dev, struct gve_priv *priv)
> >  {
> >         priv->adminq = dma_alloc_coherent(dev, PAGE_SIZE,
> > @@ -460,11 +472,14 @@ int gve_adminq_destroy_rx_queues(struct gve_priv *priv, u32 num_queues)
> >  int gve_adminq_describe_device(struct gve_priv *priv)
> >  {
> >         struct gve_device_descriptor *descriptor;
> > +       struct gve_device_option *dev_opt;
> >         union gve_adminq_command cmd;
> >         dma_addr_t descriptor_bus;
> > +       u16 num_options;
> >         int err = 0;
> >         u8 *mac;
> >         u16 mtu;
> > +       int i;
> >
> >         memset(&cmd, 0, sizeof(cmd));
> >         descriptor = dma_alloc_coherent(&priv->pdev->dev, PAGE_SIZE,
> > @@ -518,6 +533,55 @@ int gve_adminq_describe_device(struct gve_priv *priv)
> >                 priv->rx_desc_cnt = priv->rx_pages_per_qpl;
> >         }
> >         priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
> > +       dev_opt = (void *)(descriptor + 1);
> > +
> > +       num_options = be16_to_cpu(descriptor->num_device_options);
> > +       for (i = 0; i < num_options; i++) {
> > +               u16 option_length = be16_to_cpu(dev_opt->option_length);
> > +               u16 option_id = be16_to_cpu(dev_opt->option_id);
> > +               struct gve_device_option *next_opt;
> > +
> > +               next_opt = gve_get_next_option(descriptor, dev_opt);
> > +               if (!next_opt) {
> > +                       dev_err(&priv->dev->dev,
> > +                               "options exceed device_descriptor's total length.\n");
> > +                       err = -EINVAL;
> > +                       goto free_device_descriptor;
> > +               }
> > +
> > +               switch (option_id) {
> > +               case GVE_DEV_OPT_ID_RAW_ADDRESSING:
> > +                       /* If the length or feature mask doesn't match,
> > +                        * continue without enabling the feature.
> > +                        */
> > +                       if (option_length != GVE_DEV_OPT_LEN_RAW_ADDRESSING ||
> > +                           dev_opt->feat_mask !=
> > +                           cpu_to_be32(GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING)) {
> > +                               dev_warn(&priv->pdev->dev,
> > +                                        "Raw addressing option error:\n"
> > +                                        "      Expected: length=%d, feature_mask=%x.\n"
> > +                                        "      Actual: length=%d, feature_mask=%x.\n",
> > +                                        GVE_DEV_OPT_LEN_RAW_ADDRESSING,
> > +                                        cpu_to_be32(GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING),
> > +                                        option_length, dev_opt->feat_mask);
> > +                               priv->raw_addressing = false;
> > +                       } else {
> > +                               dev_info(&priv->pdev->dev,
> > +                                        "Raw addressing device option enabled.\n");
> > +                               priv->raw_addressing = true;
> > +                       }
> > +                       break;
> > +               default:
> > +                       /* If we don't recognize the option just continue
> > +                        * without doing anything.
> > +                        */
> > +                       dev_dbg(&priv->pdev->dev,
> > +                               "Unrecognized device option 0x%hx not enabled.\n",
> > +                               option_id);
> > +                       break;
> > +               }
> > +               dev_opt = next_opt;
>
> Is there any reason for having this switch statement as a part of the
> function instead of a function onto itself? Seems like you could take
> all the code in the switch statement and move it into a seperate
> function and only need to pass priv and dev_opt. That way you could
> reduce the indentation a bit and help to make this a bit more readable
> by possibly not having to fold lines like you did in the if statement
> above.
Ok, I'll separate this into a different function.


>
>
>
> > +       }
> >
> >  free_device_descriptor:
> >         dma_free_coherent(&priv->pdev->dev, sizeof(*descriptor), descriptor,
> > diff --git a/drivers/net/ethernet/google/gve/gve_adminq.h b/drivers/net/ethernet/google/gve/gve_adminq.h
> > index 281de8326bc5..af5f586167bd 100644
> > --- a/drivers/net/ethernet/google/gve/gve_adminq.h
> > +++ b/drivers/net/ethernet/google/gve/gve_adminq.h
> > @@ -79,12 +79,17 @@ struct gve_device_descriptor {
> >
> >  static_assert(sizeof(struct gve_device_descriptor) == 40);
> >
> > -struct device_option {
> > -       __be32 option_id;
> > -       __be32 option_length;
> > +struct gve_device_option {
> > +       __be16 option_id;
> > +       __be16 option_length;
> > +       __be32 feat_mask;
> >  };
> >
> > -static_assert(sizeof(struct device_option) == 8);
> > +static_assert(sizeof(struct gve_device_option) == 8);
> > +
> > +#define GVE_DEV_OPT_ID_RAW_ADDRESSING 0x1
> > +#define GVE_DEV_OPT_LEN_RAW_ADDRESSING 0x0
> > +#define GVE_DEV_OPT_FEAT_MASK_RAW_ADDRESSING 0x0
> >
> >  struct gve_adminq_configure_device_resources {
> >         __be64 counter_array;
> > @@ -111,6 +116,8 @@ struct gve_adminq_unregister_page_list {
> >
> >  static_assert(sizeof(struct gve_adminq_unregister_page_list) == 4);
> >
> > +#define GVE_RAW_ADDRESSING_QPL_ID 0xFFFFFFFF
> > +
> >  struct gve_adminq_create_tx_queue {
> >         __be32 queue_id;
> >         __be32 reserved;
> > diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> > index 48a433154ce0..70685c10db0e 100644
> > --- a/drivers/net/ethernet/google/gve/gve_main.c
> > +++ b/drivers/net/ethernet/google/gve/gve_main.c
> > @@ -678,6 +678,10 @@ static int gve_alloc_qpls(struct gve_priv *priv)
> >         int i, j;
> >         int err;
> >
> > +       /* Raw addressing means no QPLs */
> > +       if (priv->raw_addressing)
> > +               return 0;
> > +
> >         priv->qpls = kvzalloc(num_qpls * sizeof(*priv->qpls), GFP_KERNEL);
> >         if (!priv->qpls)
> >                 return -ENOMEM;
> > @@ -718,6 +722,10 @@ static void gve_free_qpls(struct gve_priv *priv)
> >         int num_qpls = gve_num_tx_qpls(priv) + gve_num_rx_qpls(priv);
> >         int i;
> >
> > +       /* Raw addressing means no QPLs */
> > +       if (priv->raw_addressing)
> > +               return;
> > +
> >         kvfree(priv->qpl_cfg.qpl_id_map);
> >
> >         for (i = 0; i < num_qpls; i++)
> > @@ -1078,6 +1086,7 @@ static int gve_init_priv(struct gve_priv *priv, bool skip_describe_device)
> >         if (skip_describe_device)
> >                 goto setup_device;
> >
> > +       priv->raw_addressing = false;
> >         /* Get the initial information we need from the device */
> >         err = gve_adminq_describe_device(priv);
> >         if (err) {
> > --
> > 2.29.2.222.g5d2a92d10f8-goog
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path
  2020-11-11 17:20   ` Alexander Duyck
@ 2020-11-18 23:15     ` David Awogbemila
  2020-11-19 16:25       ` Alexander Duyck
  0 siblings, 1 reply; 19+ messages in thread
From: David Awogbemila @ 2020-11-18 23:15 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Wed, Nov 11, 2020 at 9:20 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> >
> > From: Catherine Sullivan <csully@google.com>
> >
> > Add support to use raw dma addresses in the rx path. Due to this new
> > support we can alloc a new buffer instead of making a copy.
> >
> > RX buffers are handed to the networking stack and are
> > re-allocated as needed, avoiding the need to use
> > skb_copy_to_linear_data() as in "qpl" mode.
> >
> > Reviewed-by: Yangchun Fu <yangchun@google.com>
> > Signed-off-by: Catherine Sullivan <csully@google.com>
> > Signed-off-by: David Awogbemila <awogbemila@google.com>
> > ---
> >  drivers/net/ethernet/google/gve/gve.h        |   6 +-
> >  drivers/net/ethernet/google/gve/gve_adminq.c |  14 +-
> >  drivers/net/ethernet/google/gve/gve_desc.h   |  10 +-
> >  drivers/net/ethernet/google/gve/gve_main.c   |   3 +-
> >  drivers/net/ethernet/google/gve/gve_rx.c     | 224 +++++++++++++++----
> >  5 files changed, 200 insertions(+), 57 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> > index 80cdae06ee39..a8c589dd14e4 100644
> > --- a/drivers/net/ethernet/google/gve/gve.h
> > +++ b/drivers/net/ethernet/google/gve/gve.h
> > @@ -68,6 +68,7 @@ struct gve_rx_data_queue {
> >         dma_addr_t data_bus; /* dma mapping of the slots */
> >         struct gve_rx_slot_page_info *page_info; /* page info of the buffers */
> >         struct gve_queue_page_list *qpl; /* qpl assigned to this queue */
> > +       bool raw_addressing; /* use raw_addressing? */
> >  };
>
> Again, if you care about alignment at all in this structure you should
> probably use u8 instead of bool.

Thanks, I'll adjust this.

>
> >
> >  struct gve_priv;
> > @@ -82,6 +83,7 @@ struct gve_rx_ring {
> >         u32 cnt; /* free-running total number of completed packets */
> >         u32 fill_cnt; /* free-running total number of descs and buffs posted */
> >         u32 mask; /* masks the cnt and fill_cnt to the size of the ring */
> > +       u32 db_threshold; /* threshold for posting new buffs and descs */
> >         u64 rx_copybreak_pkt; /* free-running count of copybreak packets */
> >         u64 rx_copied_pkt; /* free-running total number of copied packets */
> >         u64 rx_skb_alloc_fail; /* free-running count of skb alloc fails */
> > @@ -194,7 +196,7 @@ struct gve_priv {
> >         u16 tx_desc_cnt; /* num desc per ring */
> >         u16 rx_desc_cnt; /* num desc per ring */
> >         u16 tx_pages_per_qpl; /* tx buffer length */
> > -       u16 rx_pages_per_qpl; /* rx buffer length */
> > +       u16 rx_data_slot_cnt; /* rx buffer length */
> >         u64 max_registered_pages;
> >         u64 num_registered_pages; /* num pages registered with NIC */
> >         u32 rx_copybreak; /* copy packets smaller than this */
> > @@ -444,7 +446,7 @@ static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
> >   */
> >  static inline u32 gve_num_rx_qpls(struct gve_priv *priv)
> >  {
> > -       return priv->rx_cfg.num_queues;
> > +       return priv->raw_addressing ? 0 : priv->rx_cfg.num_queues;
> >  }
> >
> >  /* Returns a pointer to the next available tx qpl in the list of qpls
> > diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
> > index 3e6de659b274..711d135c6b1a 100644
> > --- a/drivers/net/ethernet/google/gve/gve_adminq.c
> > +++ b/drivers/net/ethernet/google/gve/gve_adminq.c
> > @@ -369,8 +369,10 @@ static int gve_adminq_create_rx_queue(struct gve_priv *priv, u32 queue_index)
> >  {
> >         struct gve_rx_ring *rx = &priv->rx[queue_index];
> >         union gve_adminq_command cmd;
> > +       u32 qpl_id;
> >         int err;
> >
> > +       qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID : rx->data.qpl->id;
> >         memset(&cmd, 0, sizeof(cmd));
> >         cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_RX_QUEUE);
> >         cmd.create_rx_queue = (struct gve_adminq_create_rx_queue) {
> > @@ -381,7 +383,7 @@ static int gve_adminq_create_rx_queue(struct gve_priv *priv, u32 queue_index)
> >                 .queue_resources_addr = cpu_to_be64(rx->q_resources_bus),
> >                 .rx_desc_ring_addr = cpu_to_be64(rx->desc.bus),
> >                 .rx_data_ring_addr = cpu_to_be64(rx->data.data_bus),
> > -               .queue_page_list_id = cpu_to_be32(rx->data.qpl->id),
> > +               .queue_page_list_id = cpu_to_be32(qpl_id),
> >         };
> >
> >         err = gve_adminq_issue_cmd(priv, &cmd);
> > @@ -526,11 +528,11 @@ int gve_adminq_describe_device(struct gve_priv *priv)
> >         mac = descriptor->mac;
> >         dev_info(&priv->pdev->dev, "MAC addr: %pM\n", mac);
> >         priv->tx_pages_per_qpl = be16_to_cpu(descriptor->tx_pages_per_qpl);
> > -       priv->rx_pages_per_qpl = be16_to_cpu(descriptor->rx_pages_per_qpl);
> > -       if (priv->rx_pages_per_qpl < priv->rx_desc_cnt) {
> > -               dev_err(&priv->pdev->dev, "rx_pages_per_qpl cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
> > -                       priv->rx_pages_per_qpl);
> > -               priv->rx_desc_cnt = priv->rx_pages_per_qpl;
> > +       priv->rx_data_slot_cnt = be16_to_cpu(descriptor->rx_pages_per_qpl);
> > +       if (priv->rx_data_slot_cnt < priv->rx_desc_cnt) {
> > +               dev_err(&priv->pdev->dev, "rx_data_slot_cnt cannot be smaller than rx_desc_cnt, setting rx_desc_cnt down to %d.\n",
> > +                       priv->rx_data_slot_cnt);
> > +               priv->rx_desc_cnt = priv->rx_data_slot_cnt;
> >         }
> >         priv->default_num_queues = be16_to_cpu(descriptor->default_num_queues);
> >         dev_opt = (void *)(descriptor + 1);
> > diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
> > index 54779871d52e..0aad314aefaf 100644
> > --- a/drivers/net/ethernet/google/gve/gve_desc.h
> > +++ b/drivers/net/ethernet/google/gve/gve_desc.h
> > @@ -72,12 +72,14 @@ struct gve_rx_desc {
> >  } __packed;
> >  static_assert(sizeof(struct gve_rx_desc) == 64);
> >
> > -/* As with the Tx ring format, the qpl_offset entries below are offsets into an
> > - * ordered list of registered pages.
> > +/* If the device supports raw dma addressing then the addr in data slot is
> > + * the dma address of the buffer.
> > + * If the device only supports registered segments than the addr is a byte
>
> s/than/then/

I'll correct this.

>
> > + * offset into the registered segment (an ordered list of pages) where the
> > + * buffer is.
> >   */
> >  struct gve_rx_data_slot {
> > -       /* byte offset into the rx registered segment of this slot */
> > -       __be64 qpl_offset;
> > +       __be64 addr;
> >  };
>
> So this is either the qpl_offset or an address correct? In such a case
> shouldn't this be a union? I'm just wanting to make sure that this
> isn't something where we are just calling it all addr now even though
> there are still cases where it is the qpl_offset.

Yeah a union might be a better option here as it is still a qpl_offset
in some instances as you point out. I'll adopt a union.

>
> >  /* GVE Recive Packet Descriptor Seq No */
> > diff --git a/drivers/net/ethernet/google/gve/gve_main.c b/drivers/net/ethernet/google/gve/gve_main.c
> > index 70685c10db0e..225e17dd1ae5 100644
> > --- a/drivers/net/ethernet/google/gve/gve_main.c
> > +++ b/drivers/net/ethernet/google/gve/gve_main.c
> > @@ -596,6 +596,7 @@ int gve_alloc_page(struct gve_priv *priv, struct device *dev,
> >         if (dma_mapping_error(dev, *dma)) {
> >                 priv->dma_mapping_error++;
> >                 put_page(*page);
> > +               *page = NULL;
> >                 return -ENOMEM;
> >         }
> >         return 0;
>
> This piece seems like a bug fix for the exception handling path.
> Should it be pulled out into a separate patch so that it could be
> backported if needed.

Ok, I'll take it away from this patch.

>
> > @@ -694,7 +695,7 @@ static int gve_alloc_qpls(struct gve_priv *priv)
> >         }
> >         for (; i < num_qpls; i++) {
> >                 err = gve_alloc_queue_page_list(priv, i,
> > -                                               priv->rx_pages_per_qpl);
> > +                                               priv->rx_data_slot_cnt);
> >                 if (err)
> >                         goto free_qpls;
> >         }
> > diff --git a/drivers/net/ethernet/google/gve/gve_rx.c b/drivers/net/ethernet/google/gve/gve_rx.c
> > index 008fa897a3e6..49646caf930c 100644
> > --- a/drivers/net/ethernet/google/gve/gve_rx.c
> > +++ b/drivers/net/ethernet/google/gve/gve_rx.c
> > @@ -16,12 +16,39 @@ static void gve_rx_remove_from_block(struct gve_priv *priv, int queue_idx)
> >         block->rx = NULL;
> >  }
> >
> > +static void gve_rx_free_buffer(struct device *dev,
> > +                              struct gve_rx_slot_page_info *page_info,
> > +                              struct gve_rx_data_slot *data_slot)
> > +{
> > +       dma_addr_t dma = (dma_addr_t)(be64_to_cpu(data_slot->addr) -
> > +                                     page_info->page_offset);
> > +
>
> Why bother with subtraction when you are adding the page offset via an
> xor? It seems like you should be able to just mask off the page offset
> and not need to bother to store it anywhere. Either the DMA address is
> page aligned in which case you can xor in and out the value and just
> use an &= operator to strip it, or it isn't in which case you should
> be using addition and subtraction everywhere and only bitflip page
> offset as a single bit somewhere.

Ok, I'll adjust this so that it just uses and & to obtain the page
address and turn page_offset
into a u8 which keeps track of the flipped state.

>
> > +       gve_free_page(dev, page_info->page, dma, DMA_FROM_DEVICE);
> > +}
> > +
> > +static void gve_rx_unfill_pages(struct gve_priv *priv, struct gve_rx_ring *rx)
> > +{
> > +       u32 slots = rx->mask + 1;
> > +       int i;
> > +
> > +       if (rx->data.raw_addressing) {
>
> The declaration of slots and I could be moved here since they aren't
> used anywhere else in this function.

Ok.

>
> > +               for (i = 0; i < slots; i++)
> > +                       gve_rx_free_buffer(&priv->pdev->dev, &rx->data.page_info[i],
> > +                                          &rx->data.data_ring[i]);
> > +       } else {
> > +               gve_unassign_qpl(priv, rx->data.qpl->id);
> > +               rx->data.qpl = NULL;
> > +       }
> > +       kvfree(rx->data.page_info);
> > +       rx->data.page_info = NULL;
> > +}
> > +
> >  static void gve_rx_free_ring(struct gve_priv *priv, int idx)
> >  {
> >         struct gve_rx_ring *rx = &priv->rx[idx];
> >         struct device *dev = &priv->pdev->dev;
> > +       u32 slots = rx->mask + 1;
> >         size_t bytes;
> > -       u32 slots;
> >
> >         gve_rx_remove_from_block(priv, idx);
> >
> > @@ -33,11 +60,8 @@ static void gve_rx_free_ring(struct gve_priv *priv, int idx)
> >                           rx->q_resources, rx->q_resources_bus);
> >         rx->q_resources = NULL;
> >
> > -       gve_unassign_qpl(priv, rx->data.qpl->id);
> > -       rx->data.qpl = NULL;
> > -       kvfree(rx->data.page_info);
> > +       gve_rx_unfill_pages(priv, rx);
> >
> > -       slots = rx->mask + 1;
> >         bytes = sizeof(*rx->data.data_ring) * slots;
> >         dma_free_coherent(dev, bytes, rx->data.data_ring,
> >                           rx->data.data_bus);
> > @@ -52,14 +76,36 @@ static void gve_setup_rx_buffer(struct gve_rx_slot_page_info *page_info,
> >         page_info->page = page;
> >         page_info->page_offset = 0;
> >         page_info->page_address = page_address(page);
> > -       slot->qpl_offset = cpu_to_be64(addr);
> > +       slot->addr = cpu_to_be64(addr);
> > +}
> > +
> > +static int gve_rx_alloc_buffer(struct gve_priv *priv, struct device *dev,
> > +                              struct gve_rx_slot_page_info *page_info,
> > +                              struct gve_rx_data_slot *data_slot,
> > +                              struct gve_rx_ring *rx)
> > +{
> > +       struct page *page;
> > +       dma_addr_t dma;
> > +       int err;
> > +
> > +       err = gve_alloc_page(priv, dev, &page, &dma, DMA_FROM_DEVICE);
> > +       if (err) {
> > +               u64_stats_update_begin(&rx->statss);
> > +               rx->rx_buf_alloc_fail++;
> > +               u64_stats_update_end(&rx->statss);
> > +               return err;
> > +       }
>
> This just feels really clumsy to me. Do the stats really need to be
> invoked in all cases? It seems like you could pull this code out and
> put it in an exception path somewhere rather than in the middle of the
> function. That way you don't need to carry around the rx ring which
> isn't used for anything else. So for example when you are prefilling
> the buffers it looks like you are already returning an error to the
> user. In such a case the stats update might be redundant as the
> interface wasn't allowed to come up in the first place.

Ok, avoiding the stats update during a failed attempt to bring the
interface up might be better.
I'll separate the code so we don't bother with the stats update when
the interface is coming up.

>
> > +
> > +       gve_setup_rx_buffer(page_info, data_slot, dma, page);
> > +       return 0;
> >  }
> >
> >  static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
> >  {
> >         struct gve_priv *priv = rx->gve;
> >         u32 slots;
> > -       int i;
> > +       int err;
> > +       int i, j;
> >
> >         /* Allocate one page per Rx queue slot. Each page is split into two
> >          * packet buffers, when possible we "page flip" between the two.
> > @@ -71,17 +117,32 @@ static int gve_prefill_rx_pages(struct gve_rx_ring *rx)
> >         if (!rx->data.page_info)
> >                 return -ENOMEM;
> >
> > -       rx->data.qpl = gve_assign_rx_qpl(priv);
> > -
> > +       if (!rx->data.raw_addressing)
> > +               rx->data.qpl = gve_assign_rx_qpl(priv);
> >         for (i = 0; i < slots; i++) {
> > -               struct page *page = rx->data.qpl->pages[i];
> > -               dma_addr_t addr = i * PAGE_SIZE;
> > -
> > -               gve_setup_rx_buffer(&rx->data.page_info[i],
> > -                                   &rx->data.data_ring[i], addr, page);
> > +               struct page *page;
> > +               dma_addr_t addr;
> > +
> > +               if (!rx->data.raw_addressing) {
> > +                       page = rx->data.qpl->pages[i];
> > +                       addr = i * PAGE_SIZE;
> > +                       gve_setup_rx_buffer(&rx->data.page_info[i],
> > +                                           &rx->data.data_ring[i], addr, page);
> > +                       continue;
> > +               }
> > +               err = gve_rx_alloc_buffer(priv, &priv->pdev->dev, &rx->data.page_info[i],
> > +                                         &rx->data.data_ring[i], rx);
> > +               if (err)
> > +                       goto alloc_err;
> >         }
> >
> >         return slots;
> > +alloc_err:
> > +       for (j = 0; j < i; j++)
> > +               gve_rx_free_buffer(&priv->pdev->dev,
> > +                                  &rx->data.page_info[j],
> > +                                  &rx->data.data_ring[j]);
>
> Instead of adding another variable you could just change this loop to
> be based on "while(i--)" and then use i as you work your way backwards
> through buffers you previously allocated.

Ok.

>
> > +       return err;
> >  }
> >
> >  static void gve_rx_add_to_block(struct gve_priv *priv, int queue_idx)
> > @@ -110,8 +171,9 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
> >         rx->gve = priv;
> >         rx->q_num = idx;
> >
> > -       slots = priv->rx_pages_per_qpl;
> > +       slots = priv->rx_data_slot_cnt;
> >         rx->mask = slots - 1;
> > +       rx->data.raw_addressing = priv->raw_addressing;
> >
> >         /* alloc rx data ring */
> >         bytes = sizeof(*rx->data.data_ring) * slots;
> > @@ -156,8 +218,8 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
> >                 err = -ENOMEM;
> >                 goto abort_with_q_resources;
> >         }
> > -       rx->mask = slots - 1;
> >         rx->cnt = 0;
> > +       rx->db_threshold = priv->rx_desc_cnt / 2;
> >         rx->desc.seqno = 1;
> >         gve_rx_add_to_block(priv, idx);
> >
> > @@ -168,7 +230,7 @@ static int gve_rx_alloc_ring(struct gve_priv *priv, int idx)
> >                           rx->q_resources, rx->q_resources_bus);
> >         rx->q_resources = NULL;
> >  abort_filled:
> > -       kvfree(rx->data.page_info);
> > +       gve_rx_unfill_pages(priv, rx);
> >  abort_with_slots:
> >         bytes = sizeof(*rx->data.data_ring) * slots;
> >         dma_free_coherent(hdev, bytes, rx->data.data_ring, rx->data.data_bus);
> > @@ -251,8 +313,7 @@ static struct sk_buff *gve_rx_copy(struct gve_rx_ring *rx,
> >         return skb;
> >  }
> >
> > -static struct sk_buff *gve_rx_add_frags(struct net_device *dev,
> > -                                       struct napi_struct *napi,
> > +static struct sk_buff *gve_rx_add_frags(struct napi_struct *napi,
> >                                         struct gve_rx_slot_page_info *page_info,
> >                                         u16 len)
> >  {
> > @@ -271,11 +332,25 @@ static struct sk_buff *gve_rx_add_frags(struct net_device *dev,
> >  static void gve_rx_flip_buff(struct gve_rx_slot_page_info *page_info,
> >                              struct gve_rx_data_slot *data_ring)
> >  {
> > -       u64 addr = be64_to_cpu(data_ring->qpl_offset);
> > +       u64 addr = be64_to_cpu(data_ring->addr);
> >
> >         page_info->page_offset ^= PAGE_SIZE / 2;
> >         addr ^= PAGE_SIZE / 2;
> > -       data_ring->qpl_offset = cpu_to_be64(addr);
> > +       data_ring->addr = cpu_to_be64(addr);
>
> This code is far more inefficient than it needs to be. Instead of byte
> swapping addr it should be byte swapping PAGE_SIZE / 2 since it is a
> constant. A bitwise operation should work fine as long as you are only
> performing it on two be64 values.

Ok, I'll byte swap the constant instead. Thanks for pointing this out.

>
> Also as I mentioned before if you are just using the xor on the
> address directly then you don't need the page offset as you can use a
> mask to pull it out of the addr.
>
> > +}
> > +
> > +static struct sk_buff *
> > +gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
> > +                     struct gve_rx_slot_page_info *page_info, u16 len,
> > +                     struct napi_struct *napi,
> > +                     struct gve_rx_data_slot *data_slot)
> > +{
> > +       struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
> > +
> > +       if (!skb)
> > +               return NULL;
> > +
> > +       return skb;
> >  }
> >
>
> I'm not sure I see the point of this function. It seems like you
> should just replace all references to it with gve_rx_add_frags until
> you actually have something else going on here. I am assuming this is
> addressed in a later patch.

Yes, a later patch adds more to this function. I think you're right
and we can simply
use gve_rx_add_frags so I'll adjust this patch accordingly.

>
>
> >  static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> > @@ -285,7 +360,9 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> >         struct gve_priv *priv = rx->gve;
> >         struct napi_struct *napi = &priv->ntfy_blocks[rx->ntfy_id].napi;
> >         struct net_device *dev = priv->dev;
> > -       struct sk_buff *skb;
> > +       struct gve_rx_data_slot *data_slot;
> > +       struct sk_buff *skb = NULL;
> > +       dma_addr_t page_bus;
> >         int pagecount;
> >         u16 len;
> >
> > @@ -294,18 +371,18 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> >                 u64_stats_update_begin(&rx->statss);
> >                 rx->rx_desc_err_dropped_pkt++;
> >                 u64_stats_update_end(&rx->statss);
> > -               return true;
> > +               return false;
> >         }
> >
> >         len = be16_to_cpu(rx_desc->len) - GVE_RX_PAD;
> >         page_info = &rx->data.page_info[idx];
> > -       dma_sync_single_for_cpu(&priv->pdev->dev, rx->data.qpl->page_buses[idx],
> > -                               PAGE_SIZE, DMA_FROM_DEVICE);
> >
> > -       /* gvnic can only receive into registered segments. If the buffer
> > -        * can't be recycled, our only choice is to copy the data out of
> > -        * it so that we can return it to the device.
> > -        */
> > +       data_slot = &rx->data.data_ring[idx];
> > +       page_bus = (rx->data.raw_addressing) ?
> > +                                       be64_to_cpu(data_slot->addr) - page_info->page_offset :
> > +                                       rx->data.qpl->page_buses[idx];
> > +       dma_sync_single_for_cpu(&priv->pdev->dev, page_bus,
> > +                               PAGE_SIZE, DMA_FROM_DEVICE);
> >
> >         if (PAGE_SIZE == 4096) {
> >                 if (len <= priv->rx_copybreak) {
> > @@ -316,6 +393,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> >                         u64_stats_update_end(&rx->statss);
> >                         goto have_skb;
> >                 }
> > +               if (rx->data.raw_addressing) {
> > +                       skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
> > +                                                   page_info, len, napi,
> > +                                                    data_slot);
> > +                       goto have_skb;
> > +               }
> >                 if (unlikely(!gve_can_recycle_pages(dev))) {
> >                         skb = gve_rx_copy(rx, dev, napi, page_info, len);
> >                         goto have_skb;
> > @@ -326,12 +409,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> >                          * the page fragment to a new SKB and pass it up the
> >                          * stack.
> >                          */
> > -                       skb = gve_rx_add_frags(dev, napi, page_info, len);
> > +                       skb = gve_rx_add_frags(napi, page_info, len);
> >                         if (!skb) {
> >                                 u64_stats_update_begin(&rx->statss);
> >                                 rx->rx_skb_alloc_fail++;
> >                                 u64_stats_update_end(&rx->statss);
> > -                               return true;
> > +                               return false;
> >                         }
> >                         /* Make sure the kernel stack can't release the page */
> >                         get_page(page_info->page);
> > @@ -347,7 +430,12 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> >                         return false;
> >                 }
> >         } else {
> > -               skb = gve_rx_copy(rx, dev, napi, page_info, len);
> > +               if (rx->data.raw_addressing)
> > +                       skb = gve_rx_raw_addressing(&priv->pdev->dev, dev,
> > +                                                   page_info, len, napi,
> > +                                                   data_slot);
> > +               else
> > +                       skb = gve_rx_copy(rx, dev, napi, page_info, len);
> >         }
> >
> >  have_skb:
> > @@ -358,7 +446,7 @@ static bool gve_rx(struct gve_rx_ring *rx, struct gve_rx_desc *rx_desc,
> >                 u64_stats_update_begin(&rx->statss);
> >                 rx->rx_skb_alloc_fail++;
> >                 u64_stats_update_end(&rx->statss);
> > -               return true;
> > +               return false;
> >         }
> >
> >         if (likely(feat & NETIF_F_RXCSUM)) {
> > @@ -399,19 +487,45 @@ static bool gve_rx_work_pending(struct gve_rx_ring *rx)
> >         return (GVE_SEQNO(flags_seq) == rx->desc.seqno);
> >  }
> >
> > +static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
> > +{
> > +       bool empty = rx->fill_cnt == rx->cnt;
> > +       u32 fill_cnt = rx->fill_cnt;
> > +
> > +       while (empty || ((fill_cnt & rx->mask) != (rx->cnt & rx->mask))) {
>
> So one question I would have is why do you need to mask fill_cnt and
> cnt here, but not above? Something doesn't match up.

fill_cnt and cnt are both free-running uints with fill_cnt generally
greater than cnt
as fill_cnt tracks freed/available buffers while cnt tracks used buffers.
The difference between "fill_cnt == cnt" and "(fill_cnt & rx->mask) ==
(cnt & rx->mask)" is
useful when all the buffers are completely used up.
If all the descriptors are used up ("fill_cnt == cnt") when we attempt
to refill buffers, the right
hand side of the while loop's OR condition, "(fill_cnt & rx->mask) !=
(rx->cnt & rx->mask)"
will be false and we wouldn't get to attempt to refill the queue's buffers.

>
> > +               struct gve_rx_slot_page_info *page_info;
> > +               struct device *dev = &priv->pdev->dev;
> > +               struct gve_rx_data_slot *data_slot;
> > +               u32 idx = fill_cnt & rx->mask;
> > +
> > +               page_info = &rx->data.page_info[idx];
> > +               data_slot = &rx->data.data_ring[idx];
> > +               gve_rx_free_buffer(dev, page_info, data_slot);
> > +               page_info->page = NULL;
> > +               if (gve_rx_alloc_buffer(priv, dev, page_info, data_slot, rx))
> > +                       break;
> > +               empty = false;
> > +               fill_cnt++;
> > +       }
> > +       rx->fill_cnt = fill_cnt;
> > +       return true;
> > +}
> > +
> >  bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
> >                        netdev_features_t feat)
> >  {
> >         struct gve_priv *priv = rx->gve;
> > +       u32 work_done = 0, packets = 0;
> >         struct gve_rx_desc *desc;
> >         u32 cnt = rx->cnt;
> >         u32 idx = cnt & rx->mask;
> > -       u32 work_done = 0;
> >         u64 bytes = 0;
> >
> >         desc = rx->desc.desc_ring + idx;
> >         while ((GVE_SEQNO(desc->flags_seq) == rx->desc.seqno) &&
> >                work_done < budget) {
> > +               bool dropped;
> > +
> >                 netif_info(priv, rx_status, priv->dev,
> >                            "[%d] idx=%d desc=%p desc->flags_seq=0x%x\n",
> >                            rx->q_num, idx, desc, desc->flags_seq);
> > @@ -419,9 +533,11 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
> >                            "[%d] seqno=%d rx->desc.seqno=%d\n",
> >                            rx->q_num, GVE_SEQNO(desc->flags_seq),
> >                            rx->desc.seqno);
> > -               bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
> > -               if (!gve_rx(rx, desc, feat, idx))
> > -                       gve_schedule_reset(priv);
> > +               dropped = !gve_rx(rx, desc, feat, idx);
> > +               if (!dropped) {
> > +                       bytes += be16_to_cpu(desc->len) - GVE_RX_PAD;
> > +                       packets++;
> > +               }
> >                 cnt++;
> >                 idx = cnt & rx->mask;
> >                 desc = rx->desc.desc_ring + idx;
> > @@ -429,15 +545,35 @@ bool gve_clean_rx_done(struct gve_rx_ring *rx, int budget,
> >                 work_done++;
> >         }
> >
> > -       if (!work_done)
> > +       if (!work_done && rx->fill_cnt - cnt > rx->db_threshold) {
> >                 return false;
>
> Since you are returning here you don't really need the else below. You
> can just drop the braces and use an if instead. However since
> everything you are doing here can be done even if work_done is 0 I
> would say to just drop the elseif entirely and instead leave the code
> as it was.
>
> > +       } else if (work_done) {
> > +               u64_stats_update_begin(&rx->statss);
> > +               rx->rpackets += packets;
> > +               rx->rbytes += bytes;
> > +               u64_stats_update_end(&rx->statss);
> > +               rx->cnt = cnt;
> > +       }
>
> The block below seems much better than optimizing for an unlikely case
> where there is no work done.

Ok, I'll use just an if and remove the elseif.




>
>
> > -       u64_stats_update_begin(&rx->statss);
> > -       rx->rpackets += work_done;
> > -       rx->rbytes += bytes;
> > -       u64_stats_update_end(&rx->statss);
> > -       rx->cnt = cnt;
> > -       rx->fill_cnt += work_done;
> > +       /* restock ring slots */
> > +       if (!rx->data.raw_addressing) {
> > +               /* In QPL mode buffs are refilled as the desc are processed */
> > +               rx->fill_cnt += work_done;
> > +       } else if (rx->fill_cnt - cnt <= rx->db_threshold) {
> > +               /* In raw addressing mode buffs are only refilled if the avail
> > +                * falls below a threshold.
> > +                */
> > +               if (!gve_rx_refill_buffers(priv, rx))
> > +                       return false;
> > +
> > +               /* If we were not able to completely refill buffers, we'll want
> > +                * to schedule this queue for work again to refill buffers.
> > +                */
> > +               if (rx->fill_cnt - cnt <= rx->db_threshold) {
> > +                       gve_rx_write_doorbell(priv, rx);
> > +                       return true;
> > +               }
> > +       }
> >
> >         gve_rx_write_doorbell(priv, rx);
> >         return gve_rx_work_pending(rx);
> > --
> > 2.29.2.222.g5d2a92d10f8-goog
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path
  2020-11-11 17:20   ` Alexander Duyck
@ 2020-11-18 23:16     ` David Awogbemila
  2020-11-19 16:33       ` Alexander Duyck
  0 siblings, 1 reply; 19+ messages in thread
From: David Awogbemila @ 2020-11-18 23:16 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Wed, Nov 11, 2020 at 9:29 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> >
> > From: Catherine Sullivan <csully@google.com>
> >
> > During TX, skbs' data addresses are dma_map'ed and passed to the NIC.
> > This means that the device can perform DMA directly from these addresses
> > and the driver does not have to copy the buffer content into
> > pre-allocated buffers/qpls (as in qpl mode).
> >
> > Reviewed-by: Yangchun Fu <yangchun@google.com>
> > Signed-off-by: Catherine Sullivan <csully@google.com>
> > Signed-off-by: David Awogbemila <awogbemila@google.com>
> > ---
> >  drivers/net/ethernet/google/gve/gve.h        |  15 +-
> >  drivers/net/ethernet/google/gve/gve_adminq.c |   4 +-
> >  drivers/net/ethernet/google/gve/gve_desc.h   |   8 +-
> >  drivers/net/ethernet/google/gve/gve_tx.c     | 207 +++++++++++++++----
> >  4 files changed, 192 insertions(+), 42 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/google/gve/gve.h b/drivers/net/ethernet/google/gve/gve.h
> > index 9dcf9fd8d128..a7c77950bb84 100644
> > --- a/drivers/net/ethernet/google/gve/gve.h
> > +++ b/drivers/net/ethernet/google/gve/gve.h
> > @@ -110,12 +110,20 @@ struct gve_tx_iovec {
> >         u32 iov_padding; /* padding associated with this segment */
> >  };
> >
> > +struct gve_tx_dma_buf {
> > +       DEFINE_DMA_UNMAP_ADDR(dma);
> > +       DEFINE_DMA_UNMAP_LEN(len);
> > +};
> > +
> >  /* Tracks the memory in the fifo occupied by the skb. Mapped 1:1 to a desc
> >   * ring entry but only used for a pkt_desc not a seg_desc
> >   */
> >  struct gve_tx_buffer_state {
> >         struct sk_buff *skb; /* skb for this pkt */
> > -       struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
> > +       union {
> > +               struct gve_tx_iovec iov[GVE_TX_MAX_IOVEC]; /* segments of this pkt */
> > +               struct gve_tx_dma_buf buf;
> > +       };
> >  };
> >
> >  /* A TX buffer - each queue has one */
> > @@ -138,13 +146,16 @@ struct gve_tx_ring {
> >         __be32 last_nic_done ____cacheline_aligned; /* NIC tail pointer */
> >         u64 pkt_done; /* free-running - total packets completed */
> >         u64 bytes_done; /* free-running - total bytes completed */
> > +       u32 dropped_pkt; /* free-running - total packets dropped */
> >
> >         /* Cacheline 2 -- Read-mostly fields */
> >         union gve_tx_desc *desc ____cacheline_aligned;
> >         struct gve_tx_buffer_state *info; /* Maps 1:1 to a desc */
> >         struct netdev_queue *netdev_txq;
> >         struct gve_queue_resources *q_resources; /* head and tail pointer idx */
> > +       struct device *dev;
> >         u32 mask; /* masks req and done down to queue size */
> > +       bool raw_addressing; /* use raw_addressing? */
>
> Again no bool in structures as the size is architecture dependent. Go
> with a u8 instead.

Ok.

>
> >
> >         /* Slow-path fields */
> >         u32 q_num ____cacheline_aligned; /* queue idx */
> > @@ -440,7 +451,7 @@ static inline u32 gve_rx_idx_to_ntfy(struct gve_priv *priv, u32 queue_idx)
> >   */
> >  static inline u32 gve_num_tx_qpls(struct gve_priv *priv)
> >  {
> > -       return priv->tx_cfg.num_queues;
> > +       return priv->raw_addressing ? 0 : priv->tx_cfg.num_queues;
> >  }
> >
> >  /* Returns the number of rx queue page lists
> > diff --git a/drivers/net/ethernet/google/gve/gve_adminq.c b/drivers/net/ethernet/google/gve/gve_adminq.c
> > index 711d135c6b1a..63358020658e 100644
> > --- a/drivers/net/ethernet/google/gve/gve_adminq.c
> > +++ b/drivers/net/ethernet/google/gve/gve_adminq.c
> > @@ -330,8 +330,10 @@ static int gve_adminq_create_tx_queue(struct gve_priv *priv, u32 queue_index)
> >  {
> >         struct gve_tx_ring *tx = &priv->tx[queue_index];
> >         union gve_adminq_command cmd;
> > +       u32 qpl_id;
> >         int err;
> >
> > +       qpl_id = priv->raw_addressing ? GVE_RAW_ADDRESSING_QPL_ID : tx->tx_fifo.qpl->id;
> >         memset(&cmd, 0, sizeof(cmd));
> >         cmd.opcode = cpu_to_be32(GVE_ADMINQ_CREATE_TX_QUEUE);
> >         cmd.create_tx_queue = (struct gve_adminq_create_tx_queue) {
> > @@ -340,7 +342,7 @@ static int gve_adminq_create_tx_queue(struct gve_priv *priv, u32 queue_index)
> >                 .queue_resources_addr =
> >                         cpu_to_be64(tx->q_resources_bus),
> >                 .tx_ring_addr = cpu_to_be64(tx->bus),
> > -               .queue_page_list_id = cpu_to_be32(tx->tx_fifo.qpl->id),
> > +               .queue_page_list_id = cpu_to_be32(qpl_id),
> >                 .ntfy_id = cpu_to_be32(tx->ntfy_id),
> >         };
> >
> > diff --git a/drivers/net/ethernet/google/gve/gve_desc.h b/drivers/net/ethernet/google/gve/gve_desc.h
> > index 0aad314aefaf..a7da364e81c8 100644
> > --- a/drivers/net/ethernet/google/gve/gve_desc.h
> > +++ b/drivers/net/ethernet/google/gve/gve_desc.h
> > @@ -16,9 +16,11 @@
> >   * Base addresses encoded in seg_addr are not assumed to be physical
> >   * addresses. The ring format assumes these come from some linear address
> >   * space. This could be physical memory, kernel virtual memory, user virtual
> > - * memory. gVNIC uses lists of registered pages. Each queue is assumed
> > - * to be associated with a single such linear address space to ensure a
> > - * consistent meaning for seg_addrs posted to its rings.
> > + * memory.
> > + * If raw dma addressing is not supported then gVNIC uses lists of registered
> > + * pages. Each queue is assumed to be associated with a single such linear
> > + * address space to ensure a consistent meaning for seg_addrs posted to its
> > + * rings.
> >   */
> >
> >  struct gve_tx_pkt_desc {
> > diff --git a/drivers/net/ethernet/google/gve/gve_tx.c b/drivers/net/ethernet/google/gve/gve_tx.c
> > index d0244feb0301..26ff1d4e4f25 100644
> > --- a/drivers/net/ethernet/google/gve/gve_tx.c
> > +++ b/drivers/net/ethernet/google/gve/gve_tx.c
> > @@ -158,9 +158,11 @@ static void gve_tx_free_ring(struct gve_priv *priv, int idx)
> >                           tx->q_resources, tx->q_resources_bus);
> >         tx->q_resources = NULL;
> >
> > -       gve_tx_fifo_release(priv, &tx->tx_fifo);
> > -       gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
> > -       tx->tx_fifo.qpl = NULL;
> > +       if (!tx->raw_addressing) {
> > +               gve_tx_fifo_release(priv, &tx->tx_fifo);
> > +               gve_unassign_qpl(priv, tx->tx_fifo.qpl->id);
> > +               tx->tx_fifo.qpl = NULL;
> > +       }
> >
> >         bytes = sizeof(*tx->desc) * slots;
> >         dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
> > @@ -206,11 +208,15 @@ static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
> >         if (!tx->desc)
> >                 goto abort_with_info;
> >
> > -       tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
> > +       tx->raw_addressing = priv->raw_addressing;
> > +       tx->dev = &priv->pdev->dev;
> > +       if (!tx->raw_addressing) {
> > +               tx->tx_fifo.qpl = gve_assign_tx_qpl(priv);
> >
> > -       /* map Tx FIFO */
> > -       if (gve_tx_fifo_init(priv, &tx->tx_fifo))
> > -               goto abort_with_desc;
> > +               /* map Tx FIFO */
> > +               if (gve_tx_fifo_init(priv, &tx->tx_fifo))
> > +                       goto abort_with_desc;
> > +       }
> >
> >         tx->q_resources =
> >                 dma_alloc_coherent(hdev,
> > @@ -228,7 +234,8 @@ static int gve_tx_alloc_ring(struct gve_priv *priv, int idx)
> >         return 0;
> >
> >  abort_with_fifo:
> > -       gve_tx_fifo_release(priv, &tx->tx_fifo);
> > +       if (!tx->raw_addressing)
> > +               gve_tx_fifo_release(priv, &tx->tx_fifo);
> >  abort_with_desc:
> >         dma_free_coherent(hdev, bytes, tx->desc, tx->bus);
> >         tx->desc = NULL;
> > @@ -301,27 +308,47 @@ static inline int gve_skb_fifo_bytes_required(struct gve_tx_ring *tx,
> >         return bytes;
> >  }
> >
> > -/* The most descriptors we could need are 3 - 1 for the headers, 1 for
> > - * the beginning of the payload at the end of the FIFO, and 1 if the
> > - * payload wraps to the beginning of the FIFO.
> > +/* The most descriptors we could need is MAX_SKB_FRAGS + 3 : 1 for each skb frag,
> > + * +1 for the skb linear portion, +1 for when tcp hdr needs to be in separate descriptor,
> > + * and +1 if the payload wraps to the beginning of the FIFO.
> >   */
> > -#define MAX_TX_DESC_NEEDED     3
> > +#define MAX_TX_DESC_NEEDED     (MAX_SKB_FRAGS + 3)
> > +static void gve_tx_unmap_buf(struct device *dev, struct gve_tx_buffer_state *info)
> > +{
> > +       if (info->skb) {
> > +               dma_unmap_single(dev, dma_unmap_addr(&info->buf, dma),
> > +                                dma_unmap_len(&info->buf, len),
> > +                                DMA_TO_DEVICE);
> > +               dma_unmap_len_set(&info->buf, len, 0);
> > +       } else {
> > +               dma_unmap_page(dev, dma_unmap_addr(&info->buf, dma),
> > +                              dma_unmap_len(&info->buf, len),
> > +                              DMA_TO_DEVICE);
> > +               dma_unmap_len_set(&info->buf, len, 0);
> > +       }
> > +}
> >
> >  /* Check if sufficient resources (descriptor ring space, FIFO space) are
> >   * available to transmit the given number of bytes.
> >   */
> >  static inline bool gve_can_tx(struct gve_tx_ring *tx, int bytes_required)
> >  {
> > -       return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED &&
> > -               gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required));
> > +       bool can_alloc = true;
> > +
> > +       if (!tx->raw_addressing)
> > +               can_alloc = gve_tx_fifo_can_alloc(&tx->tx_fifo, bytes_required);
> > +
> > +       return (gve_tx_avail(tx) >= MAX_TX_DESC_NEEDED && can_alloc);
> >  }
> >
> >  /* Stops the queue if the skb cannot be transmitted. */
> >  static int gve_maybe_stop_tx(struct gve_tx_ring *tx, struct sk_buff *skb)
> >  {
> > -       int bytes_required;
> > +       int bytes_required = 0;
> > +
> > +       if (!tx->raw_addressing)
> > +               bytes_required = gve_skb_fifo_bytes_required(tx, skb);
> >
> > -       bytes_required = gve_skb_fifo_bytes_required(tx, skb);
> >         if (likely(gve_can_tx(tx, bytes_required)))
> >                 return 0;
> >
> > @@ -390,22 +417,22 @@ static void gve_tx_fill_seg_desc(union gve_tx_desc *seg_desc,
> >         seg_desc->seg.seg_addr = cpu_to_be64(addr);
> >  }
> >
> > -static void gve_dma_sync_for_device(struct device *dev, dma_addr_t *page_buses,
> > +static void gve_dma_sync_for_device(struct device *dev,
> > +                                   dma_addr_t *page_buses,
> >                                     u64 iov_offset, u64 iov_len)
> >  {
> >         u64 last_page = (iov_offset + iov_len - 1) / PAGE_SIZE;
> >         u64 first_page = iov_offset / PAGE_SIZE;
> > -       dma_addr_t dma;
> >         u64 page;
> >
> >         for (page = first_page; page <= last_page; page++) {
> > -               dma = page_buses[page];
> > +               dma_addr_t dma = page_buses[page];
> > +
> >                 dma_sync_single_for_device(dev, dma, PAGE_SIZE, DMA_TO_DEVICE);
>
> Why bother with the "dma" variable at all. Why not just refer directly
> to "page_busses[page]"?

It should be fine to just use page_buses[page] - I'll do that.

>
> >         }
> >  }
> >
> > -static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
> > -                         struct device *dev)
> > +static int gve_tx_add_skb_copy(struct gve_priv *priv, struct gve_tx_ring *tx, struct sk_buff *skb)
> >  {
> >         int pad_bytes, hlen, hdr_nfrags, payload_nfrags, l4_hdr_offset;
> >         union gve_tx_desc *pkt_desc, *seg_desc;
> > @@ -429,7 +456,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
> >         hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
> >                         skb_headlen(skb);
> >
> > -       info->skb =  skb;
> > +       info->skb = skb;
> >         /* We don't want to split the header, so if necessary, pad to the end
> >          * of the fifo and then put the header at the beginning of the fifo.
> >          */
> > @@ -447,7 +474,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
> >         skb_copy_bits(skb, 0,
> >                       tx->tx_fifo.base + info->iov[hdr_nfrags - 1].iov_offset,
> >                       hlen);
> > -       gve_dma_sync_for_device(dev, tx->tx_fifo.qpl->page_buses,
> > +       gve_dma_sync_for_device(&priv->pdev->dev, tx->tx_fifo.qpl->page_buses,
> >                                 info->iov[hdr_nfrags - 1].iov_offset,
> >                                 info->iov[hdr_nfrags - 1].iov_len);
> >         copy_offset = hlen;
> > @@ -463,7 +490,7 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
> >                 skb_copy_bits(skb, copy_offset,
> >                               tx->tx_fifo.base + info->iov[i].iov_offset,
> >                               info->iov[i].iov_len);
> > -               gve_dma_sync_for_device(dev, tx->tx_fifo.qpl->page_buses,
> > +               gve_dma_sync_for_device(&priv->pdev->dev, tx->tx_fifo.qpl->page_buses,
> >                                         info->iov[i].iov_offset,
> >                                         info->iov[i].iov_len);
> >                 copy_offset += info->iov[i].iov_len;
> > @@ -472,6 +499,100 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
> >         return 1 + payload_nfrags;
> >  }
> >
> > +static int gve_tx_add_skb_no_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
> > +                                 struct sk_buff *skb)
> > +{
> > +       const struct skb_shared_info *shinfo = skb_shinfo(skb);
> > +       int hlen, payload_nfrags, l4_hdr_offset, seg_idx_bias;
> > +       union gve_tx_desc *pkt_desc, *seg_desc;
> > +       struct gve_tx_buffer_state *info;
> > +       bool is_gso = skb_is_gso(skb);
> > +       u32 idx = tx->req & tx->mask;
> > +       struct gve_tx_dma_buf *buf;
> > +       int last_mapped = 0;
> > +       u64 addr;
> > +       u32 len;
> > +       int i;
> > +
> > +       info = &tx->info[idx];
> > +       pkt_desc = &tx->desc[idx];
> > +
> > +       l4_hdr_offset = skb_checksum_start_offset(skb);
> > +       /* If the skb is gso, then we want only up to the tcp header in the first segment
> > +        * to efficiently replicate on each segment otherwise we want the linear portion
> > +        * of the skb (which will contain the checksum because skb->csum_start and
> > +        * skb->csum_offset are given relative to skb->head) in the first segment.
> > +        */
> > +       hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
> > +                       skb_headlen(skb);
> > +       len = skb_headlen(skb);
> > +
> > +       info->skb =  skb;
> > +
> > +       addr = dma_map_single(tx->dev, skb->data, len, DMA_TO_DEVICE);
> > +       if (unlikely(dma_mapping_error(tx->dev, addr))) {
> > +               rtnl_lock();
> > +               priv->dma_mapping_error++;
> > +               rtnl_unlock();
>
> Do you really need an rtnl_lock for updating this statistic? That
> seems like a glaring issue to me.

I thought this would be the way to protect the stat from parallel
access as was suggested in a comment in v3 of the patchset but I
understand now that rtnl_lock/unlock ought only to be used for net
device configurations and not in the data path. Also I now believe
that since this driver is very rarely not running on a 64-bit
platform, the stat update is atomic anyway and shouldn't need the locks.

>
> > +               goto drop;
> > +       }
> > +       buf = &info->buf;
> > +       dma_unmap_len_set(buf, len, len);
> > +       dma_unmap_addr_set(buf, dma, addr);
> > +
> > +       payload_nfrags = shinfo->nr_frags;
> > +       if (hlen < len) {
> > +               /* For gso the rest of the linear portion of the skb needs to
> > +                * be in its own descriptor.
> > +                */
> > +               payload_nfrags++;
> > +               gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
> > +                                    1 + payload_nfrags, hlen, addr);
> > +
> > +               len -= hlen;
> > +               addr += hlen;
> > +               seg_desc = &tx->desc[(tx->req + 1) & tx->mask];
> > +               seg_idx_bias = 2;
> > +               gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
> > +       } else {
> > +               seg_idx_bias = 1;
> > +               gve_tx_fill_pkt_desc(pkt_desc, skb, is_gso, l4_hdr_offset,
> > +                                    1 + payload_nfrags, hlen, addr);
> > +       }
> > +       idx = (tx->req + seg_idx_bias) & tx->mask;
> > +
>
> This math with payload_nfrags and seg_idx_bias seems way more
> complicated than it needs to be. Instead of having to reverse engineer
> shinfo->nr_frags why not just go back to using that?

You're right, that makes it simpler. I'll revert to using shinfo->nr_frags.

>
> > +       for (i = 0; i < payload_nfrags - (seg_idx_bias - 1); i++) {
> > +               const skb_frag_t *frag = &shinfo->frags[i];
> > +
> > +               seg_desc = &tx->desc[idx];
> > +               len = skb_frag_size(frag);
> > +               addr = skb_frag_dma_map(tx->dev, frag, 0, len, DMA_TO_DEVICE);
> > +               if (unlikely(dma_mapping_error(tx->dev, addr))) {
> > +                       priv->dma_mapping_error++;
>
> Same variable as above, but no lock. Definitely something not right
> about the rtnl_lock used above.

Ok, I'm removing the locks added above.

>
>
> > +                       goto unmap_drop;
> > +               }
> > +               buf = &tx->info[idx].buf;
> > +               tx->info[idx].skb = NULL;
> > +               dma_unmap_len_set(buf, len, len);
> > +               dma_unmap_addr_set(buf, dma, addr);
> > +
> > +               gve_tx_fill_seg_desc(seg_desc, skb, is_gso, len, addr);
> > +               idx = (idx + 1) & tx->mask;
>
> You could probably look simplifying this by simply updating index at
> the start of the loop instead of the end. Since you know at least the
> first buffer is used for the header all follow buffers for the
> fragments would start at idx + 1 which would reduce your seg_idx_bias
> to 0 and could also be used to just get rid of that variable all
> together.

Ok, I'll simplify the code here to increment idx at the top of the
loop and get rid of seg_idx_bias.

>
> > +       }
> > +
> > +       return 1 + payload_nfrags;
> > +
> > +unmap_drop:
> > +       i--;
> > +       for (last_mapped = i + seg_idx_bias; last_mapped >= 0; last_mapped--) {
> > +               idx = (tx->req + last_mapped) & tx->mask;
> > +               gve_tx_unmap_buf(tx->dev, &tx->info[idx]);
> > +       }
> > +drop:
> > +       tx->dropped_pkt++;
> > +       return 0;
> > +}
> > +
> >  netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
> >  {
> >         struct gve_priv *priv = netdev_priv(dev);
> > @@ -490,12 +611,20 @@ netdev_tx_t gve_tx(struct sk_buff *skb, struct net_device *dev)
> >                 gve_tx_put_doorbell(priv, tx->q_resources, tx->req);
> >                 return NETDEV_TX_BUSY;
> >         }
> > -       nsegs = gve_tx_add_skb(tx, skb, &priv->pdev->dev);
> > -
> > -       netdev_tx_sent_queue(tx->netdev_txq, skb->len);
> > -       skb_tx_timestamp(skb);
> > +       if (tx->raw_addressing)
> > +               nsegs = gve_tx_add_skb_no_copy(priv, tx, skb);
> > +       else
> > +               nsegs = gve_tx_add_skb_copy(priv, tx, skb);
> > +
> > +       /* If the packet is getting sent, we need to update the skb */
> > +       if (nsegs) {
>
> Why are you still continuing if you aren't going to transmit anything?

Ok, I will move most of what is done after the if-block into the if
block. That way, we don't bother doing anything if nsegs is zero.


>
> > +               netdev_tx_sent_queue(tx->netdev_txq, skb->len);
> > +               skb_tx_timestamp(skb);
> > +       }
> >
> > -       /* give packets to NIC */
> > +       /* Give packets to NIC. Even if this packet failed to send the doorbell
> > +        * might need to be rung because of xmit_more.
> > +        */
> >         tx->req += nsegs;
> >
> >         if (!netif_xmit_stopped(tx->netdev_txq) && netdev_xmit_more())
> > @@ -525,24 +654,30 @@ static int gve_clean_tx_done(struct gve_priv *priv, struct gve_tx_ring *tx,
> >                 info = &tx->info[idx];
> >                 skb = info->skb;
> >
> > +               /* Unmap the buffer */
> > +               if (tx->raw_addressing)
> > +                       gve_tx_unmap_buf(tx->dev, info);
> >                 /* Mark as free */
> >                 if (skb) {
> >                         info->skb = NULL;
> >                         bytes += skb->len;
> >                         pkts++;
> >                         dev_consume_skb_any(skb);
> > -                       /* FIFO free */
> > -                       for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
> > -                               space_freed += info->iov[i].iov_len +
> > -                                              info->iov[i].iov_padding;
> > -                               info->iov[i].iov_len = 0;
> > -                               info->iov[i].iov_padding = 0;
>
> Instead of increasing the indent you might look at moving the
> tx->done++ up and then just doing a "continue" if you are using raw
> addressing.

Ok, I'll do this.


>
>
> > +                       if (!tx->raw_addressing) {
> > +                               /* FIFO free */
> > +                               for (i = 0; i < ARRAY_SIZE(info->iov); i++) {
> > +                                       space_freed += info->iov[i].iov_len +
> > +                                                      info->iov[i].iov_padding;
> > +                                       info->iov[i].iov_len = 0;
> > +                                       info->iov[i].iov_padding = 0;
> > +                               }
> >                         }
> >                 }
> >                 tx->done++;
> >         }
> >
> > -       gve_tx_free_fifo(&tx->tx_fifo, space_freed);
> > +       if (!tx->raw_addressing)
> > +               gve_tx_free_fifo(&tx->tx_fifo, space_freed);
> >         u64_stats_update_begin(&tx->statss);
> >         tx->bytes_done += bytes;
> >         tx->pkt_done += pkts;
> > --
> > 2.29.2.222.g5d2a92d10f8-goog
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path
  2020-11-18 23:15     ` David Awogbemila
@ 2020-11-19 16:25       ` Alexander Duyck
  2020-11-19 21:11         ` David Awogbemila
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2020-11-19 16:25 UTC (permalink / raw)
  To: David Awogbemila; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Wed, Nov 18, 2020 at 3:15 PM David Awogbemila <awogbemila@google.com> wrote:
>
> On Wed, Nov 11, 2020 at 9:20 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> > >
> > > From: Catherine Sullivan <csully@google.com>
> > >
> > > Add support to use raw dma addresses in the rx path. Due to this new
> > > support we can alloc a new buffer instead of making a copy.
> > >
> > > RX buffers are handed to the networking stack and are
> > > re-allocated as needed, avoiding the need to use
> > > skb_copy_to_linear_data() as in "qpl" mode.
> > >
> > > Reviewed-by: Yangchun Fu <yangchun@google.com>
> > > Signed-off-by: Catherine Sullivan <csully@google.com>
> > > Signed-off-by: David Awogbemila <awogbemila@google.com>
> > > ---

<snip>

> > > @@ -399,19 +487,45 @@ static bool gve_rx_work_pending(struct gve_rx_ring *rx)
> > >         return (GVE_SEQNO(flags_seq) == rx->desc.seqno);
> > >  }
> > >
> > > +static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
> > > +{
> > > +       bool empty = rx->fill_cnt == rx->cnt;
> > > +       u32 fill_cnt = rx->fill_cnt;
> > > +
> > > +       while (empty || ((fill_cnt & rx->mask) != (rx->cnt & rx->mask))) {
> >
> > So one question I would have is why do you need to mask fill_cnt and
> > cnt here, but not above? Something doesn't match up.
>
> fill_cnt and cnt are both free-running uints with fill_cnt generally
> greater than cnt
> as fill_cnt tracks freed/available buffers while cnt tracks used buffers.
> The difference between "fill_cnt == cnt" and "(fill_cnt & rx->mask) ==
> (cnt & rx->mask)" is
> useful when all the buffers are completely used up.
> If all the descriptors are used up ("fill_cnt == cnt") when we attempt
> to refill buffers, the right
> hand side of the while loop's OR condition, "(fill_cnt & rx->mask) !=
> (rx->cnt & rx->mask)"
> will be false and we wouldn't get to attempt to refill the queue's buffers.

I think I see what you are trying to get at, but it seems convoluted.
Your first check is checking for the empty case where rx->fill_cnt ==
rx->cnt. The second half of this is about pushing the count up so that
you cause fill_cnt to wrap and come back around and be equal to cnt.
That seems like a really convoluted way to get there.

Why not just simplify this and do something like the following?:
while (fill_cnt - rx->cnt  < rx->mask)

I would argue that is much easier to read and understand rather than
having to double up the cases by using the mask field as a mask on the
free running counters.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path
  2020-11-18 23:16     ` David Awogbemila
@ 2020-11-19 16:33       ` Alexander Duyck
  2020-11-19 21:11         ` David Awogbemila
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2020-11-19 16:33 UTC (permalink / raw)
  To: David Awogbemila; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Wed, Nov 18, 2020 at 3:16 PM David Awogbemila <awogbemila@google.com> wrote:
>
> On Wed, Nov 11, 2020 at 9:29 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> > >
> > > From: Catherine Sullivan <csully@google.com>
> > >
> > > During TX, skbs' data addresses are dma_map'ed and passed to the NIC.
> > > This means that the device can perform DMA directly from these addresses
> > > and the driver does not have to copy the buffer content into
> > > pre-allocated buffers/qpls (as in qpl mode).
> > >
> > > Reviewed-by: Yangchun Fu <yangchun@google.com>
> > > Signed-off-by: Catherine Sullivan <csully@google.com>
> > > Signed-off-by: David Awogbemila <awogbemila@google.com>

<snip>

> > > @@ -472,6 +499,100 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
> > >         return 1 + payload_nfrags;
> > >  }
> > >
> > > +static int gve_tx_add_skb_no_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
> > > +                                 struct sk_buff *skb)
> > > +{
> > > +       const struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > +       int hlen, payload_nfrags, l4_hdr_offset, seg_idx_bias;
> > > +       union gve_tx_desc *pkt_desc, *seg_desc;
> > > +       struct gve_tx_buffer_state *info;
> > > +       bool is_gso = skb_is_gso(skb);
> > > +       u32 idx = tx->req & tx->mask;
> > > +       struct gve_tx_dma_buf *buf;
> > > +       int last_mapped = 0;
> > > +       u64 addr;
> > > +       u32 len;
> > > +       int i;
> > > +
> > > +       info = &tx->info[idx];
> > > +       pkt_desc = &tx->desc[idx];
> > > +
> > > +       l4_hdr_offset = skb_checksum_start_offset(skb);
> > > +       /* If the skb is gso, then we want only up to the tcp header in the first segment
> > > +        * to efficiently replicate on each segment otherwise we want the linear portion
> > > +        * of the skb (which will contain the checksum because skb->csum_start and
> > > +        * skb->csum_offset are given relative to skb->head) in the first segment.
> > > +        */
> > > +       hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
> > > +                       skb_headlen(skb);
> > > +       len = skb_headlen(skb);
> > > +
> > > +       info->skb =  skb;
> > > +
> > > +       addr = dma_map_single(tx->dev, skb->data, len, DMA_TO_DEVICE);
> > > +       if (unlikely(dma_mapping_error(tx->dev, addr))) {
> > > +               rtnl_lock();
> > > +               priv->dma_mapping_error++;
> > > +               rtnl_unlock();
> >
> > Do you really need an rtnl_lock for updating this statistic? That
> > seems like a glaring issue to me.
>
> I thought this would be the way to protect the stat from parallel
> access as was suggested in a comment in v3 of the patchset but I
> understand now that rtnl_lock/unlock ought only to be used for net
> device configurations and not in the data path. Also I now believe
> that since this driver is very rarely not running on a 64-bit
> platform, the stat update is atomic anyway and shouldn't need the locks.

If nothing else it might be good to look at just creating a per-ring
stat for this and then aggregating the value before you report it to
the stack. Then you don't have to worry about multiple threads trying
to update it simultaneously since it will be protected by the Tx queue
lock.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 3/4] gve: Rx Buffer Recycling
  2020-11-18 22:50     ` David Awogbemila
@ 2020-11-19 16:55       ` Alexander Duyck
  2020-11-19 21:11         ` David Awogbemila
  0 siblings, 1 reply; 19+ messages in thread
From: Alexander Duyck @ 2020-11-19 16:55 UTC (permalink / raw)
  To: David Awogbemila; +Cc: Netdev

On Wed, Nov 18, 2020 at 2:50 PM David Awogbemila <awogbemila@google.com> wrote:
>
> On Wed, Nov 11, 2020 at 9:20 AM Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
> >
> > On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> > >
> > > This patch lets the driver reuse buffers that have been freed by the
> > > networking stack.
> > >
> > > In the raw addressing case, this allows the driver avoid allocating new
> > > buffers.
> > > In the qpl case, the driver can avoid copies.
> > >
> > > Signed-off-by: David Awogbemila <awogbemila@google.com>
> > > ---

<snip>

> > > +static int gve_rx_can_recycle_buffer(struct page *page)
> > > +{
> > > +       int pagecount = page_count(page);
> > > +
> > > +       /* This page is not being used by any SKBs - reuse */
> > > +       if (pagecount == 1)
> > > +               return 1;
> > > +       /* This page is still being used by an SKB - we can't reuse */
> > > +       else if (pagecount >= 2)
> > > +               return 0;
> > > +       WARN(pagecount < 1, "Pagecount should never be < 1");
> > > +       return -1;
> > > +}
> > > +
> >
> > So using a page count of 1 is expensive. Really if you are going to do
> > this you should probably look at how we do it currently in ixgbe.
> > Basically you want to batch the count updates to avoid expensive
> > atomic operations per skb.
>
> A separate patch will be coming along to change the way the driver
> tracks page count.
> I thought it would be better to have that reviewed separately since
> it's a different issue from what this patch addresses.

Okay, you might want to call that out in your patch description then
that this is just a temporary placeholder. Back when I did this for
ixgbe I think we had it doing a single page update for a few years,
however that code had a bug in it that would cause page count
corruption as I wasn't aware at the time that the mm tree had
functions that would take a reference on the page without us ever
handing it out.

> >
> > >  static struct sk_buff *
> > >  gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
> > >                       struct gve_rx_slot_page_info *page_info, u16 len,
> > >                       struct napi_struct *napi,
> > > -                     struct gve_rx_data_slot *data_slot)
> > > +                     struct gve_rx_data_slot *data_slot, bool can_flip)
> > >  {
> > > -       struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
> > > +       struct sk_buff *skb;
> > >
> > > +       skb = gve_rx_add_frags(napi, page_info, len);
> >
> > Why split this up?It seemed fine as it was.
>
> It was based on a comment from v3 of the patchset.

Do you recall the comment? This just seems like noise to me since it
is moving code and doesn't seem to address either a formatting issue,
nor a functional issue.

> >
> > >         if (!skb)
> > >                 return NULL;
> > >
> > > +       /* Optimistically stop the kernel from freeing the page by increasing
> > > +        * the page bias. We will check the refcount in refill to determine if
> > > +        * we need to alloc a new page.
> > > +        */
> > > +       get_page(page_info->page);
> > > +       page_info->can_flip = can_flip;
> > > +
> >
> > Why pass can_flip and page_info only to set it here? Also I don't get
> > why you are taking an extra reference on the page without checking the
> > can_flip variable. It seems like this should be set in the page_info
> > before you call this function and then you call get_page if
> > page_info->can_flip is true.
>
> I think it's important to call get_page here even for buffers we know
> will not be flipped so that if the skb does a put_page twice we would
> not run into the WARNing in gve_rx_can_recycle_buffer when trying to
> refill buffers.
> (Also please note that a future patch changes the way the driver uses
> page refcounts)

It was your buffer recycling bit that I hadn't noticed. So you have
cases where if your page count gets to 2 you push it to 3 in the hopes
that the 2 that are already out there will be freed and you are left
holding the one remaining count. However I am not sure how much of an
advantage there is to that. If the page flipping is already failing I
wonder what percentage of the time you are able to recover from that.
It might be worthwhile to look at adding a couple of counters to track
the number of times you couldn't flip versus the number of times you
recycled the frame. You may find that the buffer recycling is adding
more overhead for very little gain versus just doing the page
flipping.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path
  2020-11-19 16:33       ` Alexander Duyck
@ 2020-11-19 21:11         ` David Awogbemila
  0 siblings, 0 replies; 19+ messages in thread
From: David Awogbemila @ 2020-11-19 21:11 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Thu, Nov 19, 2020 at 8:33 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Wed, Nov 18, 2020 at 3:16 PM David Awogbemila <awogbemila@google.com> wrote:
> >
> > On Wed, Nov 11, 2020 at 9:29 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> > > >
> > > > From: Catherine Sullivan <csully@google.com>
> > > >
> > > > During TX, skbs' data addresses are dma_map'ed and passed to the NIC.
> > > > This means that the device can perform DMA directly from these addresses
> > > > and the driver does not have to copy the buffer content into
> > > > pre-allocated buffers/qpls (as in qpl mode).
> > > >
> > > > Reviewed-by: Yangchun Fu <yangchun@google.com>
> > > > Signed-off-by: Catherine Sullivan <csully@google.com>
> > > > Signed-off-by: David Awogbemila <awogbemila@google.com>
>
> <snip>
>
> > > > @@ -472,6 +499,100 @@ static int gve_tx_add_skb(struct gve_tx_ring *tx, struct sk_buff *skb,
> > > >         return 1 + payload_nfrags;
> > > >  }
> > > >
> > > > +static int gve_tx_add_skb_no_copy(struct gve_priv *priv, struct gve_tx_ring *tx,
> > > > +                                 struct sk_buff *skb)
> > > > +{
> > > > +       const struct skb_shared_info *shinfo = skb_shinfo(skb);
> > > > +       int hlen, payload_nfrags, l4_hdr_offset, seg_idx_bias;
> > > > +       union gve_tx_desc *pkt_desc, *seg_desc;
> > > > +       struct gve_tx_buffer_state *info;
> > > > +       bool is_gso = skb_is_gso(skb);
> > > > +       u32 idx = tx->req & tx->mask;
> > > > +       struct gve_tx_dma_buf *buf;
> > > > +       int last_mapped = 0;
> > > > +       u64 addr;
> > > > +       u32 len;
> > > > +       int i;
> > > > +
> > > > +       info = &tx->info[idx];
> > > > +       pkt_desc = &tx->desc[idx];
> > > > +
> > > > +       l4_hdr_offset = skb_checksum_start_offset(skb);
> > > > +       /* If the skb is gso, then we want only up to the tcp header in the first segment
> > > > +        * to efficiently replicate on each segment otherwise we want the linear portion
> > > > +        * of the skb (which will contain the checksum because skb->csum_start and
> > > > +        * skb->csum_offset are given relative to skb->head) in the first segment.
> > > > +        */
> > > > +       hlen = is_gso ? l4_hdr_offset + tcp_hdrlen(skb) :
> > > > +                       skb_headlen(skb);
> > > > +       len = skb_headlen(skb);
> > > > +
> > > > +       info->skb =  skb;
> > > > +
> > > > +       addr = dma_map_single(tx->dev, skb->data, len, DMA_TO_DEVICE);
> > > > +       if (unlikely(dma_mapping_error(tx->dev, addr))) {
> > > > +               rtnl_lock();
> > > > +               priv->dma_mapping_error++;
> > > > +               rtnl_unlock();
> > >
> > > Do you really need an rtnl_lock for updating this statistic? That
> > > seems like a glaring issue to me.
> >
> > I thought this would be the way to protect the stat from parallel
> > access as was suggested in a comment in v3 of the patchset but I
> > understand now that rtnl_lock/unlock ought only to be used for net
> > device configurations and not in the data path. Also I now believe
> > that since this driver is very rarely not running on a 64-bit
> > platform, the stat update is atomic anyway and shouldn't need the locks.
>
> If nothing else it might be good to look at just creating a per-ring
> stat for this and then aggregating the value before you report it to
> the stack. Then you don't have to worry about multiple threads trying
> to update it simultaneously since it will be protected by the Tx queue
> lock.

Ok, a per-ring stat would work so I'll switch to that.
I've actually sent v7 already, so I'll make the modifications in  v8.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 3/4] gve: Rx Buffer Recycling
  2020-11-19 16:55       ` Alexander Duyck
@ 2020-11-19 21:11         ` David Awogbemila
  0 siblings, 0 replies; 19+ messages in thread
From: David Awogbemila @ 2020-11-19 21:11 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Netdev

On Thu, Nov 19, 2020 at 8:55 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Wed, Nov 18, 2020 at 2:50 PM David Awogbemila <awogbemila@google.com> wrote:
> >
> > On Wed, Nov 11, 2020 at 9:20 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> > > >
> > > > This patch lets the driver reuse buffers that have been freed by the
> > > > networking stack.
> > > >
> > > > In the raw addressing case, this allows the driver avoid allocating new
> > > > buffers.
> > > > In the qpl case, the driver can avoid copies.
> > > >
> > > > Signed-off-by: David Awogbemila <awogbemila@google.com>
> > > > ---
>
> <snip>
>
> > > > +static int gve_rx_can_recycle_buffer(struct page *page)
> > > > +{
> > > > +       int pagecount = page_count(page);
> > > > +
> > > > +       /* This page is not being used by any SKBs - reuse */
> > > > +       if (pagecount == 1)
> > > > +               return 1;
> > > > +       /* This page is still being used by an SKB - we can't reuse */
> > > > +       else if (pagecount >= 2)
> > > > +               return 0;
> > > > +       WARN(pagecount < 1, "Pagecount should never be < 1");
> > > > +       return -1;
> > > > +}
> > > > +
> > >
> > > So using a page count of 1 is expensive. Really if you are going to do
> > > this you should probably look at how we do it currently in ixgbe.
> > > Basically you want to batch the count updates to avoid expensive
> > > atomic operations per skb.
> >
> > A separate patch will be coming along to change the way the driver
> > tracks page count.
> > I thought it would be better to have that reviewed separately since
> > it's a different issue from what this patch addresses.
>
> Okay, you might want to call that out in your patch description then
> that this is just a temporary placeholder. Back when I did this for
> ixgbe I think we had it doing a single page update for a few years,
> however that code had a bug in it that would cause page count
> corruption as I wasn't aware at the time that the mm tree had
> functions that would take a reference on the page without us ever
> handing it out.

Ok, I'll add that to the patch description since even though this
aspect of the driver's behavior is not changing in this patch it is being
separated into its own function.

>
> > >
> > > >  static struct sk_buff *
> > > >  gve_rx_raw_addressing(struct device *dev, struct net_device *netdev,
> > > >                       struct gve_rx_slot_page_info *page_info, u16 len,
> > > >                       struct napi_struct *napi,
> > > > -                     struct gve_rx_data_slot *data_slot)
> > > > +                     struct gve_rx_data_slot *data_slot, bool can_flip)
> > > >  {
> > > > -       struct sk_buff *skb = gve_rx_add_frags(napi, page_info, len);
> > > > +       struct sk_buff *skb;
> > > >
> > > > +       skb = gve_rx_add_frags(napi, page_info, len);
> > >
> > > Why split this up?It seemed fine as it was.
> >
> > It was based on a comment from v3 of the patchset.
>
> Do you recall the comment? This just seems like noise to me since it
> is moving code and doesn't seem to address either a formatting issue,
> nor a functional issue.

The comment suggested that it might be weird to initialize a variable
with a function call and error-check after an empty line.
I realize now that what I should have done here is introduce the split
in the patch that introduced the function so
it doesn't appear as an unnecessary isolated change - I'll do that.

>
> > >
> > > >         if (!skb)
> > > >                 return NULL;
> > > >
> > > > +       /* Optimistically stop the kernel from freeing the page by increasing
> > > > +        * the page bias. We will check the refcount in refill to determine if
> > > > +        * we need to alloc a new page.
> > > > +        */
> > > > +       get_page(page_info->page);
> > > > +       page_info->can_flip = can_flip;
> > > > +
> > >
> > > Why pass can_flip and page_info only to set it here? Also I don't get
> > > why you are taking an extra reference on the page without checking the
> > > can_flip variable. It seems like this should be set in the page_info
> > > before you call this function and then you call get_page if
> > > page_info->can_flip is true.
> >
> > I think it's important to call get_page here even for buffers we know
> > will not be flipped so that if the skb does a put_page twice we would
> > not run into the WARNing in gve_rx_can_recycle_buffer when trying to
> > refill buffers.
> > (Also please note that a future patch changes the way the driver uses
> > page refcounts)
>
> It was your buffer recycling bit that I hadn't noticed. So you have
> cases where if your page count gets to 2 you push it to 3 in the hopes
> that the 2 that are already out there will be freed and you are left
> holding the one remaining count. However I am not sure how much of an
> advantage there is to that. If the page flipping is already failing I
> wonder what percentage of the time you are able to recover from that.
> It might be worthwhile to look at adding a couple of counters to track
> the number of times you couldn't flip versus the number of times you
> recycled the frame. You may find that the buffer recycling is adding
> more overhead for very little gain versus just doing the page
> flipping.

Thanks for the suggestion. Metrics tracking those sound like a good idea.
I'll look into getting that change into the patch which changes the driver's
refcount tracking method and eliminates get_page entirely.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path
  2020-11-19 16:25       ` Alexander Duyck
@ 2020-11-19 21:11         ` David Awogbemila
  0 siblings, 0 replies; 19+ messages in thread
From: David Awogbemila @ 2020-11-19 21:11 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: Netdev, Catherine Sullivan, Yangchun Fu

On Thu, Nov 19, 2020 at 8:25 AM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>
> On Wed, Nov 18, 2020 at 3:15 PM David Awogbemila <awogbemila@google.com> wrote:
> >
> > On Wed, Nov 11, 2020 at 9:20 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> > >
> > > On Mon, Nov 9, 2020 at 3:39 PM David Awogbemila <awogbemila@google.com> wrote:
> > > >
> > > > From: Catherine Sullivan <csully@google.com>
> > > >
> > > > Add support to use raw dma addresses in the rx path. Due to this new
> > > > support we can alloc a new buffer instead of making a copy.
> > > >
> > > > RX buffers are handed to the networking stack and are
> > > > re-allocated as needed, avoiding the need to use
> > > > skb_copy_to_linear_data() as in "qpl" mode.
> > > >
> > > > Reviewed-by: Yangchun Fu <yangchun@google.com>
> > > > Signed-off-by: Catherine Sullivan <csully@google.com>
> > > > Signed-off-by: David Awogbemila <awogbemila@google.com>
> > > > ---
>
> <snip>
>
> > > > @@ -399,19 +487,45 @@ static bool gve_rx_work_pending(struct gve_rx_ring *rx)
> > > >         return (GVE_SEQNO(flags_seq) == rx->desc.seqno);
> > > >  }
> > > >
> > > > +static bool gve_rx_refill_buffers(struct gve_priv *priv, struct gve_rx_ring *rx)
> > > > +{
> > > > +       bool empty = rx->fill_cnt == rx->cnt;
> > > > +       u32 fill_cnt = rx->fill_cnt;
> > > > +
> > > > +       while (empty || ((fill_cnt & rx->mask) != (rx->cnt & rx->mask))) {
> > >
> > > So one question I would have is why do you need to mask fill_cnt and
> > > cnt here, but not above? Something doesn't match up.
> >
> > fill_cnt and cnt are both free-running uints with fill_cnt generally
> > greater than cnt
> > as fill_cnt tracks freed/available buffers while cnt tracks used buffers.
> > The difference between "fill_cnt == cnt" and "(fill_cnt & rx->mask) ==
> > (cnt & rx->mask)" is
> > useful when all the buffers are completely used up.
> > If all the descriptors are used up ("fill_cnt == cnt") when we attempt
> > to refill buffers, the right
> > hand side of the while loop's OR condition, "(fill_cnt & rx->mask) !=
> > (rx->cnt & rx->mask)"
> > will be false and we wouldn't get to attempt to refill the queue's buffers.
>
> I think I see what you are trying to get at, but it seems convoluted.
> Your first check is checking for the empty case where rx->fill_cnt ==
> rx->cnt. The second half of this is about pushing the count up so that
> you cause fill_cnt to wrap and come back around and be equal to cnt.
> That seems like a really convoluted way to get there.
>
> Why not just simplify this and do something like the following?:
> while (fill_cnt - rx->cnt  < rx->mask)
>
> I would argue that is much easier to read and understand rather than
> having to double up the cases by using the mask field as a mask on the
> free running counters.

Okay, that's simpler, I'll use that, thanks.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2020-11-19 21:12 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-09 23:36 [PATCH net-next v6 0/4] GVE Raw Addressing David Awogbemila
2020-11-09 23:36 ` [PATCH net-next v6 1/4] gve: Add support for raw addressing device option David Awogbemila
2020-11-11 17:20   ` Alexander Duyck
2020-11-18 23:14     ` David Awogbemila
2020-11-09 23:36 ` [PATCH net-next v6 2/4] gve: Add support for raw addressing to the rx path David Awogbemila
2020-11-11 17:20   ` Alexander Duyck
2020-11-18 23:15     ` David Awogbemila
2020-11-19 16:25       ` Alexander Duyck
2020-11-19 21:11         ` David Awogbemila
2020-11-09 23:36 ` [PATCH net-next v6 3/4] gve: Rx Buffer Recycling David Awogbemila
2020-11-11 17:20   ` Alexander Duyck
2020-11-18 22:50     ` David Awogbemila
2020-11-19 16:55       ` Alexander Duyck
2020-11-19 21:11         ` David Awogbemila
2020-11-09 23:36 ` [PATCH net-next v6 4/4] gve: Add support for raw addressing in the tx path David Awogbemila
2020-11-11 17:20   ` Alexander Duyck
2020-11-18 23:16     ` David Awogbemila
2020-11-19 16:33       ` Alexander Duyck
2020-11-19 21:11         ` David Awogbemila

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).