netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC Patch v1 0/3] Introduce ENA local page cache
@ 2021-03-09 17:10 Shay Agroskin
  2021-03-09 17:10 ` [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system Shay Agroskin
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Shay Agroskin @ 2021-03-09 17:10 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, netdev
  Cc: Shay Agroskin, Woodhouse, David, Machulsky, Zorik, Matushevsky,
	Alexander, Saeed Bshara, Wilson, Matt, Liguori, Anthony, Bshara,
	Nafea, Tzalik, Guy, Belgazal, Netanel, Saidi, Ali, Herrenschmidt,
	Benjamin, Kiyanovski, Arthur, Jubran, Samih, Dagan, Noam

High incoming pps rate leads to frequent memory allocations by the napi
routine to refill the pages of the incoming packets.

On several new instances in AWS fleet, with high pps rates, these
frequent allocations create a contention between the different napi
routines.
The contention happens because the routines end up accessing the
buddy allocator which is a shared resource and requires lock-based
synchronization (also, when freeing a page the same lock is held). In
our tests we observed that that this contention causes the CPUs that
serve the RX queues to reach 100% and damage network performance.
While this contention can be relieved by making sure that pages are
allocated and freed on the same core, which would allow the driver to
take advantage of PCP, this solution is not always available or easy to
maintain.

This patchset implements a page cache local to each RX queue. When the
napi routine allocates a page, it first checks whether the cache has a
previously allocated page that isn't used. If so, this page is fetched
instead of allocating a new one. Otherwise, if the cache is out of free
pages, a page is allocated using normal allocation path (PCP or buddy
allocator) and returned to the caller.
A page that is allocated outside the cache, is afterwards cached, up
to cache's maximum size (set to 2048 pages in this patchset).

The pages' availability is tracked by checking their refcount. A cached
page has a refcount of 2 when it is passed to the napi routine as an RX
buffer. When a refcount of a page reaches 1, the cache assumes that it is
free to be re-used.

To avoid traversing all pages in the cache when looking for an available
page, we only check the availability of the first page fetched for the
RX queue that wasn't returned to the cache yet (i.e. has a refcount of
more than 1).

For example, for a cache of size 8 from which pages at indices 0-7 were
fetched (and placed in the RX SQ), the next time napi would try to fetch
a page from the cache, the cache would check the availability of the
page at index 0, and if it is available, this page would be fetched for
the napi. The next time napi would try to fetch a page, the cache entry
at index 1 would be checked, and so on.

Memory consumption:

In its maximum occupancy the cache would hold 2048 pages per each
queue. Providing an interface with 32 queues, 32 * 2048 * 4K = 64MB
are being used by the driver for its RX queues.

To avoid choking the system, this feature is only enabled for
instances with more than 16 queues which in AWS come with several tens
Gigs of RAM. Moreover, the feature can be turned off completely using
ethtool.

Having said that, the memory cost of having RX queues with 2K entries
would be the same as with 1K entries queue + LPC in worst case, while
the latter allocates the memory only in case the traffic rate is higher
than the rate of the pages being freed.

Performance results:

4 c5n.18xlarge instances sending iperf TCP traffic to a p4d.24xlarge instance.
Packet size: 1500 bytes

c5n.18xlarge specs:
Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
with 72 cores. 32 queue pairs.

p4d.24xlarge specs:
Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
with 96 cores. 4 * 32 = 128 (4 interfaces) queue pairs.

|                     | before | after |
|                     +        +       |
| bandwidth (Gbps)    | 260    | 330   |
| CPU utilization (%) | 100    | 56    |

Shay Agroskin (3):
  net: ena: implement local page cache (LPC) system
  net: ena: update README file with a description about LPC
  net: ena: support ethtool priv-flags and LPC state change

 .../device_drivers/ethernet/amazon/ena.rst    |  28 ++
 drivers/net/ethernet/amazon/ena/ena_ethtool.c |  56 ++-
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 369 +++++++++++++++++-
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |  32 ++
 4 files changed, 458 insertions(+), 27 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system
  2021-03-09 17:10 [RFC Patch v1 0/3] Introduce ENA local page cache Shay Agroskin
@ 2021-03-09 17:10 ` Shay Agroskin
  2021-03-09 17:57   ` Eric Dumazet
  2021-03-09 17:10 ` [RFC Patch v1 2/3] net: ena: update README file with a description about LPC Shay Agroskin
  2021-03-09 17:10 ` [RFC Patch v1 3/3] net: ena: support ethtool priv-flags and LPC state change Shay Agroskin
  2 siblings, 1 reply; 10+ messages in thread
From: Shay Agroskin @ 2021-03-09 17:10 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, netdev
  Cc: Shay Agroskin, Woodhouse, David, Machulsky, Zorik, Matushevsky,
	Alexander, Saeed Bshara, Wilson, Matt, Liguori, Anthony, Bshara,
	Nafea, Tzalik, Guy, Belgazal, Netanel, Saidi, Ali, Herrenschmidt,
	Benjamin, Kiyanovski, Arthur, Jubran, Samih, Dagan, Noam

The page cache holds pages we allocated in the past during napi cycle,
and tracks their availability status using page ref count.

The cache can hold up to 2048 pages. Upon allocating a page, we check
whether the next entry in the cache contains an unused page, and if so
fetch it. If the next page is already used by another entity or if it
belongs to a different NUMA core than the napi routine, we allocate a
page in the regular way (page from a different NUMA core is replaced by
the newly allocated page).

This system can help us reduce the contention between different cores
when allocating page since every cache is unique to a queue.

This patch adds the following ethtool counters:

- lpc_warm_up: If the next page in the cache isn't free, and the
    cache wasn't allocated its maximum possible pages, allocate a new
    page and store it in the cache. This counter increases everytime it
    happens. Its maximum value can be N * 'current queue size'.

- lpc_full: The next entry in the cache contains a page that is
    still used. In such case a page is allocated in the regular way
    (i.e. dev_alloc())

- lpc_wrong_numa: The next entry in the cache contains a page in a
    different NUMA node than the napi routine which allocates the page.
    In this case increase the counter and replace current entry with a
    page from the same NUMA node.

Note that in all three cases a page should be returned to the caller of
the page cache function, and the page would be either from the cache, or
from the Linux memory system.
In case the system is out-of-memory the cache returns NULL. This
scenario doesn't break the cache's correctness.

The page cache is disabled when having less than 16 queues or when XDP
is used.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
---
 drivers/net/ethernet/amazon/ena/ena_ethtool.c |   3 +
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 337 ++++++++++++++++--
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |  30 ++
 3 files changed, 350 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_ethtool.c b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
index d6cc7aa612b7..fe16b3d5bd73 100644
--- a/drivers/net/ethernet/amazon/ena/ena_ethtool.c
+++ b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
@@ -96,6 +96,9 @@ static const struct ena_stats ena_stats_rx_strings[] = {
 	ENA_STAT_RX_ENTRY(xdp_tx),
 	ENA_STAT_RX_ENTRY(xdp_invalid),
 	ENA_STAT_RX_ENTRY(xdp_redirect),
+	ENA_STAT_RX_ENTRY(lpc_warm_up),
+	ENA_STAT_RX_ENTRY(lpc_full),
+	ENA_STAT_RX_ENTRY(lpc_wrong_numa),
 };
 
 static const struct ena_stats ena_stats_ena_com_strings[] = {
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 102f2c91fdb8..9f6cc479506f 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -49,6 +49,7 @@ static int ena_rss_init_default(struct ena_adapter *adapter);
 static void check_for_admin_com_state(struct ena_adapter *adapter);
 static void ena_destroy_device(struct ena_adapter *adapter, bool graceful);
 static int ena_restore_device(struct ena_adapter *adapter);
+static int ena_create_page_caches(struct ena_adapter *adapter);
 
 static void ena_init_io_rings(struct ena_adapter *adapter,
 			      int first_index, int count);
@@ -981,12 +982,162 @@ static void ena_free_all_io_rx_resources(struct ena_adapter *adapter)
 		ena_free_rx_resources(adapter, i);
 }
 
+static void ena_put_unmap_cache_page(struct ena_ring *rx_ring, struct ena_page *ena_page)
+{
+	dma_unmap_page(rx_ring->dev, ena_page->dma_addr, ENA_PAGE_SIZE,
+		       DMA_BIDIRECTIONAL);
+
+	put_page(ena_page->page);
+}
+
+static struct page *ena_alloc_map_page(struct ena_ring *rx_ring, dma_addr_t *dma)
+{
+	struct page *page;
+
+	/* This would allocate the page on the same NUMA node the executing code
+	 * is running on.
+	 */
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+
+	/* To enable NIC-side port-mirroring, AKA SPAN port,
+	 * we make the buffer readable from the nic as well
+	 */
+	*dma = dma_map_page(rx_ring->dev, page, 0, ENA_PAGE_SIZE,
+			    DMA_BIDIRECTIONAL);
+	if (unlikely(dma_mapping_error(rx_ring->dev, *dma))) {
+		ena_increase_stat(&rx_ring->rx_stats.dma_mapping_err, 1,
+				  &rx_ring->syncp);
+		__free_page(page);
+		return NULL;
+	}
+
+	return page;
+}
+
+/* Removes a page from page cache and allocate a new one instead. If an
+ * allocation of a new page fails, the cache entry isn't changed
+ */
+static void ena_replace_cache_page(struct ena_ring *rx_ring,
+				   struct ena_page *ena_page)
+{
+	struct page *new_page;
+	dma_addr_t dma;
+
+	new_page = ena_alloc_map_page(rx_ring, &dma);
+
+	if (likely(new_page)) {
+		ena_put_unmap_cache_page(rx_ring, ena_page);
+
+		ena_page->page = new_page;
+		ena_page->dma_addr = dma;
+	}
+}
+
+/* Fetch the cached page (mark the page as used and pass it to the caller).
+ * If the page belongs to a different NUMA than the current one, free the cache
+ * page and allocate another one instead.
+ */
+static struct page *ena_fetch_cache_page(struct ena_ring *rx_ring,
+					 struct ena_page *ena_page,
+					 dma_addr_t *dma,
+					 int current_nid)
+{
+	/* Remove pages belonging to different node than current_nid from cache */
+	if (unlikely(page_to_nid(ena_page->page) != current_nid)) {
+		ena_increase_stat(&rx_ring->rx_stats.lpc_wrong_numa, 1, &rx_ring->syncp);
+		ena_replace_cache_page(rx_ring, ena_page);
+	}
+
+	/* Make sure no writes are pending for this page */
+	dma_sync_single_for_device(rx_ring->dev, ena_page->dma_addr,
+				   ENA_PAGE_SIZE,
+				   DMA_BIDIRECTIONAL);
+
+	/* Increase refcount to 2 so that the page is returned to the
+	 * cache after being freed
+	 */
+	page_ref_inc(ena_page->page);
+
+	*dma = ena_page->dma_addr;
+
+	return ena_page->page;
+}
+
+static struct page *ena_get_page(struct ena_ring *rx_ring, dma_addr_t *dma,
+				 int current_nid, bool *is_lpc_page)
+{
+	struct ena_page_cache *page_cache = rx_ring->page_cache;
+	u32 head, cache_current_size;
+	struct ena_page *ena_page;
+
+	/* Cache size of zero indicates disabled cache */
+	if (!page_cache) {
+		*is_lpc_page = false;
+		return ena_alloc_map_page(rx_ring, dma);
+	}
+
+	*is_lpc_page = true;
+
+	cache_current_size = page_cache->current_size;
+	head = page_cache->head;
+
+	ena_page = &page_cache->cache[head];
+	/* Warm up phase. We fill the pages for the first time. The
+	 * phase is done in the napi context to improve the chances we
+	 * allocate on the correct NUMA node
+	 */
+	if (unlikely(cache_current_size < page_cache->max_size)) {
+		/* Check if oldest allocated page is free */
+		if (ena_page->page && page_ref_count(ena_page->page) == 1) {
+			page_cache->head = (head + 1) % cache_current_size;
+			return ena_fetch_cache_page(rx_ring, ena_page, dma, current_nid);
+		}
+
+		ena_page = &page_cache->cache[cache_current_size];
+
+		/* Add a new page to the cache */
+		ena_page->page = ena_alloc_map_page(rx_ring, dma);
+		if (unlikely(!ena_page->page))
+			return NULL;
+
+		ena_page->dma_addr = *dma;
+
+		/* Increase refcount to 2 so that the page is returned to the
+		 * cache after being freed
+		 */
+		page_ref_inc(ena_page->page);
+
+		page_cache->current_size++;
+
+		ena_increase_stat(&rx_ring->rx_stats.lpc_warm_up, 1, &rx_ring->syncp);
+
+		return ena_page->page;
+	}
+
+	/* Next page is still in use, so we allocate outside the cache */
+	if (unlikely(page_ref_count(ena_page->page) != 1)) {
+		ena_increase_stat(&rx_ring->rx_stats.lpc_full, 1, &rx_ring->syncp);
+		*is_lpc_page = false;
+		return ena_alloc_map_page(rx_ring, dma);
+	}
+
+	/* The cache has a free page to fetch for the caller. Update the
+	 * page that would be returned the next time this function's called.
+	 */
+	page_cache->head = (head + 1) & (page_cache->max_size - 1);
+
+	return ena_fetch_cache_page(rx_ring, ena_page, dma, current_nid);
+}
+
 static int ena_alloc_rx_page(struct ena_ring *rx_ring,
-				    struct ena_rx_buffer *rx_info, gfp_t gfp)
+			     struct ena_rx_buffer *rx_info, int current_nid)
 {
 	int headroom = rx_ring->rx_headroom;
 	struct ena_com_buf *ena_buf;
 	struct page *page;
+	bool is_lpc_page;
 	dma_addr_t dma;
 
 	/* restore page offset value in case it has been changed by device */
@@ -996,29 +1147,19 @@ static int ena_alloc_rx_page(struct ena_ring *rx_ring,
 	if (unlikely(rx_info->page))
 		return 0;
 
-	page = alloc_page(gfp);
+	/* We handle DMA here */
+	page = ena_get_page(rx_ring, &dma, current_nid, &is_lpc_page);
 	if (unlikely(!page)) {
 		ena_increase_stat(&rx_ring->rx_stats.page_alloc_fail, 1,
 				  &rx_ring->syncp);
 		return -ENOMEM;
 	}
 
-	/* To enable NIC-side port-mirroring, AKA SPAN port,
-	 * we make the buffer readable from the nic as well
-	 */
-	dma = dma_map_page(rx_ring->dev, page, 0, ENA_PAGE_SIZE,
-			   DMA_BIDIRECTIONAL);
-	if (unlikely(dma_mapping_error(rx_ring->dev, dma))) {
-		ena_increase_stat(&rx_ring->rx_stats.dma_mapping_err, 1,
-				  &rx_ring->syncp);
-
-		__free_page(page);
-		return -EIO;
-	}
 	netif_dbg(rx_ring->adapter, rx_status, rx_ring->netdev,
 		  "Allocate page %p, rx_info %p\n", page, rx_info);
 
 	rx_info->page = page;
+	rx_info->is_lpc_page = is_lpc_page;
 	ena_buf = &rx_info->ena_buf;
 	ena_buf->paddr = dma + headroom;
 	ena_buf->len = ENA_PAGE_SIZE - headroom;
@@ -1031,9 +1172,11 @@ static void ena_unmap_rx_buff(struct ena_ring *rx_ring,
 {
 	struct ena_com_buf *ena_buf = &rx_info->ena_buf;
 
-	dma_unmap_page(rx_ring->dev, ena_buf->paddr - rx_ring->rx_headroom,
-		       ENA_PAGE_SIZE,
-		       DMA_BIDIRECTIONAL);
+	/* LPC pages are unmapped at cache destruction */
+	if (!rx_info->is_lpc_page)
+		dma_unmap_page(rx_ring->dev, ena_buf->paddr - rx_ring->rx_headroom,
+			       ENA_PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
 }
 
 static void ena_free_rx_page(struct ena_ring *rx_ring,
@@ -1056,9 +1199,13 @@ static void ena_free_rx_page(struct ena_ring *rx_ring,
 static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 num)
 {
 	u16 next_to_use, req_id;
+	int current_nid;
 	u32 i;
 	int rc;
 
+	/* Prefer pages to be allocate on the same NUMA as the CPU */
+	current_nid = numa_mem_id();
+
 	next_to_use = rx_ring->next_to_use;
 
 	for (i = 0; i < num; i++) {
@@ -1068,8 +1215,7 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 num)
 
 		rx_info = &rx_ring->rx_buffer_info[req_id];
 
-		rc = ena_alloc_rx_page(rx_ring, rx_info,
-				       GFP_ATOMIC | __GFP_COMP);
+		rc = ena_alloc_rx_page(rx_ring, rx_info, current_nid);
 		if (unlikely(rc < 0)) {
 			netif_warn(rx_ring->adapter, rx_err, rx_ring->netdev,
 				   "Failed to allocate buffer for rx queue %d\n",
@@ -1140,12 +1286,51 @@ static void ena_refill_all_rx_bufs(struct ena_adapter *adapter)
 	}
 }
 
+/* Release all pages from the page cache */
+static void ena_free_ring_cache_pages(struct ena_adapter *adapter, int qid)
+{
+	struct ena_ring *rx_ring = &adapter->rx_ring[qid];
+	struct ena_page_cache *page_cache;
+	int i;
+
+	/* Page cache is disabled */
+	if (!rx_ring->page_cache)
+		return;
+
+	page_cache = rx_ring->page_cache;
+
+	/* We check size value to make sure we don't
+	 * free pages that weren't allocated.
+	 */
+	for (i = 0; i < page_cache->current_size; i++) {
+		struct ena_page *ena_page = &page_cache->cache[i];
+
+		WARN_ON(!ena_page->page);
+
+		dma_unmap_page(rx_ring->dev, ena_page->dma_addr,
+			       ENA_PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
+
+		/* If the page is also in the rx buffer, then this operation
+		 * would only decrease its reference count
+		 */
+		__free_page(ena_page->page);
+	}
+
+	page_cache->head = page_cache->current_size = 0;
+}
+
 static void ena_free_all_rx_bufs(struct ena_adapter *adapter)
 {
 	int i;
 
-	for (i = 0; i < adapter->num_io_queues; i++)
+	for (i = 0; i < adapter->num_io_queues; i++) {
+		/* The RX SQ's packet should be freed first, since they don't
+		 * unmap pages that belong to the page_cache.
+		 */
 		ena_free_rx_bufs(adapter, i);
+		ena_free_ring_cache_pages(adapter, i);
+	}
 }
 
 static void ena_unmap_tx_buff(struct ena_ring *tx_ring,
@@ -2539,6 +2724,10 @@ static int create_queues_with_size_backoff(struct ena_adapter *adapter)
 		if (rc)
 			goto err_create_rx_queues;
 
+		rc = ena_create_page_caches(adapter);
+		if (rc) /* Cache memory is freed in case of failure */
+			goto err_create_rx_queues;
+
 		return 0;
 
 err_create_rx_queues:
@@ -2591,6 +2780,111 @@ static int create_queues_with_size_backoff(struct ena_adapter *adapter)
 	}
 }
 
+static void ena_free_ring_page_cache(struct ena_ring *rx_ring)
+{
+	if (!rx_ring->page_cache)
+		return;
+
+	vfree(rx_ring->page_cache);
+	rx_ring->page_cache = NULL;
+}
+
+static bool ena_is_lpc_supported(struct ena_adapter *adapter,
+				 struct ena_ring *rx_ring,
+				 bool error_print)
+{
+	void (*print_log)(const struct net_device *dev, const char *format, ...);
+	int channels_nr = adapter->num_io_queues + adapter->xdp_num_queues;
+
+	print_log = (error_print) ? netdev_err : netdev_info;
+
+	/* LPC is disabled below min number of channels */
+	if (channels_nr < ENA_LPC_MIN_NUM_OF_CHANNELS) {
+		print_log(adapter->netdev,
+			  "Local page cache is disabled for less than %d channels\n",
+			  ENA_LPC_MIN_NUM_OF_CHANNELS);
+
+		return false;
+	}
+
+	/* The driver doesn't support page caches under XDP */
+	if (ena_xdp_present_ring(rx_ring)) {
+		print_log(adapter->netdev,
+			  "Local page cache is disabled when using XDP\n");
+		return false;
+	}
+
+	return true;
+}
+
+/* Calculate the size of the Local Page Cache. If LPC should be disabled, return
+ * a size of 0.
+ */
+static u32 ena_calculate_cache_size(struct ena_adapter *adapter,
+				    struct ena_ring *rx_ring)
+{
+	u32 page_cache_size = adapter->lpc_size;
+
+	/* LPC cache size of 0 means disabled cache */
+	if (page_cache_size == 0)
+		return 0;
+
+	if (!ena_is_lpc_supported(adapter, rx_ring, false))
+		return 0;
+
+	page_cache_size = page_cache_size * ENA_LPC_MULTIPLIER_UNIT;
+	page_cache_size = roundup_pow_of_two(page_cache_size);
+
+	return page_cache_size;
+}
+
+static int ena_create_page_caches(struct ena_adapter *adapter)
+{
+	struct ena_page_cache *cache;
+	u32 page_cache_size;
+	int i;
+
+	for (i = 0; i < adapter->num_io_queues; i++) {
+		struct ena_ring *rx_ring = &adapter->rx_ring[i];
+
+		page_cache_size = ena_calculate_cache_size(adapter, rx_ring);
+
+		if (!page_cache_size)
+			return 0;
+
+		cache = vzalloc(sizeof(struct ena_page_cache) +
+				sizeof(struct ena_page) * page_cache_size);
+		if (!cache)
+			goto err_cache_alloc;
+
+		cache->max_size = page_cache_size;
+		rx_ring->page_cache = cache;
+	}
+
+	return 0;
+err_cache_alloc:
+	netif_err(adapter, ifup, adapter->netdev,
+		  "Failed to initialize local page caches (LPCs)\n");
+	while (--i >= 0) {
+		struct ena_ring *rx_ring = &adapter->rx_ring[i];
+
+		ena_free_ring_page_cache(rx_ring);
+	}
+
+	return -ENOMEM;
+}
+
+static void ena_free_page_caches(struct ena_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < adapter->num_io_queues; i++) {
+		struct ena_ring *rx_ring = &adapter->rx_ring[i];
+
+		ena_free_ring_page_cache(rx_ring);
+	}
+}
+
 static int ena_up(struct ena_adapter *adapter)
 {
 	int io_queue_count, rc, i;
@@ -2641,6 +2935,7 @@ static int ena_up(struct ena_adapter *adapter)
 	return rc;
 
 err_up:
+	ena_free_page_caches(adapter);
 	ena_destroy_all_tx_queues(adapter);
 	ena_free_all_io_tx_resources(adapter);
 	ena_destroy_all_rx_queues(adapter);
@@ -2691,6 +2986,7 @@ static void ena_down(struct ena_adapter *adapter)
 
 	ena_free_all_tx_bufs(adapter);
 	ena_free_all_rx_bufs(adapter);
+	ena_free_page_caches(adapter);
 	ena_free_all_io_tx_resources(adapter);
 	ena_free_all_io_rx_resources(adapter);
 }
@@ -4296,6 +4592,7 @@ static int ena_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	adapter->max_rx_sgl_size = calc_queue_ctx.max_rx_sgl_size;
 
 	adapter->num_io_queues = max_num_io_queues;
+	adapter->lpc_size = ENA_LPC_DEFAULT_MULTIPLIER;
 	adapter->max_num_io_queues = max_num_io_queues;
 	adapter->last_monitored_tx_qid = 0;
 
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index 74af15d62ee1..242c9ce4a782 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -194,6 +194,7 @@ struct ena_rx_buffer {
 	struct page *page;
 	u32 page_offset;
 	struct ena_com_buf ena_buf;
+	bool is_lpc_page;
 } ____cacheline_aligned;
 
 struct ena_stats_tx {
@@ -234,8 +235,33 @@ struct ena_stats_rx {
 	u64 xdp_tx;
 	u64 xdp_invalid;
 	u64 xdp_redirect;
+	u64 lpc_warm_up;
+	u64 lpc_full;
+	u64 lpc_wrong_numa;
 };
 
+/* LPC definitions */
+#define ENA_LPC_DEFAULT_MULTIPLIER 2
+#define ENA_LPC_MULTIPLIER_UNIT 1024
+#define ENA_LPC_MIN_NUM_OF_CHANNELS 16
+
+/* Store DMA address along with the page */
+struct ena_page {
+	struct page *page;
+	dma_addr_t dma_addr;
+};
+
+struct ena_page_cache {
+	/* How many pages are produced */
+	u32 head;
+	/* How many of the entries were initialized */
+	u32 current_size;
+	/* Maximum number of pages the cache can hold */
+	u32 max_size;
+
+	struct ena_page cache[0];
+} ____cacheline_aligned;
+
 struct ena_ring {
 	/* Holds the empty requests for TX/RX
 	 * out of order completions
@@ -252,6 +278,7 @@ struct ena_ring {
 	struct pci_dev *pdev;
 	struct napi_struct *napi;
 	struct net_device *netdev;
+	struct ena_page_cache *page_cache;
 	struct ena_com_dev *ena_dev;
 	struct ena_adapter *adapter;
 	struct ena_com_io_cq *ena_com_io_cq;
@@ -333,6 +360,9 @@ struct ena_adapter {
 	u32 num_io_queues;
 	u32 max_num_io_queues;
 
+	/* Local page cache size */
+	u32 lpc_size;
+
 	int msix_vecs;
 
 	u32 missing_tx_completion_threshold;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC Patch v1 2/3] net: ena: update README file with a description about LPC
  2021-03-09 17:10 [RFC Patch v1 0/3] Introduce ENA local page cache Shay Agroskin
  2021-03-09 17:10 ` [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system Shay Agroskin
@ 2021-03-09 17:10 ` Shay Agroskin
  2021-03-09 17:10 ` [RFC Patch v1 3/3] net: ena: support ethtool priv-flags and LPC state change Shay Agroskin
  2 siblings, 0 replies; 10+ messages in thread
From: Shay Agroskin @ 2021-03-09 17:10 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, netdev
  Cc: Shay Agroskin, Woodhouse, David, Machulsky, Zorik, Matushevsky,
	Alexander, Saeed Bshara, Wilson, Matt, Liguori, Anthony, Bshara,
	Nafea, Tzalik, Guy, Belgazal, Netanel, Saidi, Ali, Herrenschmidt,
	Benjamin, Kiyanovski, Arthur, Jubran, Samih, Dagan, Noam

Add a description for local page cache system to the ENA driver readme
file.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
---
 .../device_drivers/ethernet/amazon/ena.rst    | 25 +++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index 3561a8a29fd2..d3423a2f472c 100644
--- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -222,6 +222,31 @@ If the frame length is larger than rx_copybreak, napi_get_frags()
 is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer
 content is copied (by CPU) to the SKB, and the buffer is recycled.
 
+Local Page Cache (LPC)
+======================
+ENA Linux driver allows to reduce lock contention and improve CPU usage by
+allocating RX buffers from a page cache rather than from Linux memory system
+(PCP or buddy allocator). The cache is created and binded per RX queue, and
+pages allocated for the queue are stored in the cache (up to cache maximum
+size).
+
+When enabled, LPC cache size is ENA_LPC_DEFAULT_MULTIPLIER * 1024 (2048 by
+default) pages.
+
+The cache usage for each queue can be monitored using ``ethtool -S`` counters. Where:
+
+- *rx_queue#_lpc_warm_up* - number of pages that were allocated and stored in
+  the cache
+- *rx_queue#_lpc_full* - number of pages that were allocated without using the
+  cache because it didn't have free pages
+- *rx_queue#_lpc_wrong_numa* -  number of pages from the cache that belong to a
+  different NUMA node than the CPU which runs the NAPI routine. In this case,
+  the driver would try to allocate a new page from the same NUMA node instead
+
+LPC is disabled when using XDP or when using less than 16 queue pairs. Note that
+cache usage might increase the memory footprint of the driver (depending on the
+traffic).
+
 Statistics
 ==========
 The user can obtain ENA device and driver statistics using ethtool.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RFC Patch v1 3/3] net: ena: support ethtool priv-flags and LPC state change
  2021-03-09 17:10 [RFC Patch v1 0/3] Introduce ENA local page cache Shay Agroskin
  2021-03-09 17:10 ` [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system Shay Agroskin
  2021-03-09 17:10 ` [RFC Patch v1 2/3] net: ena: update README file with a description about LPC Shay Agroskin
@ 2021-03-09 17:10 ` Shay Agroskin
  2 siblings, 0 replies; 10+ messages in thread
From: Shay Agroskin @ 2021-03-09 17:10 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, netdev
  Cc: Shay Agroskin, Woodhouse, David, Machulsky, Zorik, Matushevsky,
	Alexander, Saeed Bshara, Wilson, Matt, Liguori, Anthony, Bshara,
	Nafea, Tzalik, Guy, Belgazal, Netanel, Saidi, Ali, Herrenschmidt,
	Benjamin, Kiyanovski, Arthur, Jubran, Samih, Dagan, Noam

Add support for ethtool private flags show/set to the driver.
The first feature added to this infrastructure is Local Page Cache.

The LPC state query returns whether the LPC is currently used in the
driver. This is not the same as if LPC is enabled since in XDP case LPC
won't be used even if the user required it, and the cache might be
turned on right after the XDP program is unloaded.

The LPC state change toggles between an LPC cache size of 0 (i.e.
disabled cache) and the size ENA_LPC_DEFAULT_MULTIPLIER * 1024
(equals to 2048).

This patch also documents the private flag support for LPC it the
README.rst file.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
---
 .../device_drivers/ethernet/amazon/ena.rst    |  3 ++
 drivers/net/ethernet/amazon/ena/ena_ethtool.c | 53 ++++++++++++++++---
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 32 +++++++++++
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |  2 +
 4 files changed, 83 insertions(+), 7 deletions(-)

diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
index d3423a2f472c..63735f1dc216 100644
--- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
+++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst
@@ -232,6 +232,9 @@ size).
 
 When enabled, LPC cache size is ENA_LPC_DEFAULT_MULTIPLIER * 1024 (2048 by
 default) pages.
+The feature can be toggled between on/off state using ethtool private flags,
+e.g.
+    # ethtool --set-priv-flags eth1 local_page_cache off
 
 The cache usage for each queue can be monitored using ``ethtool -S`` counters. Where:
 
diff --git a/drivers/net/ethernet/amazon/ena/ena_ethtool.c b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
index fe16b3d5bd73..aea76dc51dff 100644
--- a/drivers/net/ethernet/amazon/ena/ena_ethtool.c
+++ b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
@@ -116,6 +116,13 @@ static const struct ena_stats ena_stats_ena_com_strings[] = {
 #define ENA_STATS_ARRAY_ENI(adapter)	\
 	(ARRAY_SIZE(ena_stats_eni_strings) * (adapter)->eni_stats_supported)
 
+static const char ena_priv_flags_strings[][ETH_GSTRING_LEN] = {
+#define ENA_PRIV_FLAGS_LPC	BIT(0)
+	"local_page_cache",
+};
+
+#define ENA_PRIV_FLAGS_NR ARRAY_SIZE(ena_priv_flags_strings)
+
 static void ena_safe_update_stat(u64 *src, u64 *dst,
 				 struct u64_stats_sync *syncp)
 {
@@ -236,10 +243,15 @@ int ena_get_sset_count(struct net_device *netdev, int sset)
 {
 	struct ena_adapter *adapter = netdev_priv(netdev);
 
-	if (sset != ETH_SS_STATS)
-		return -EOPNOTSUPP;
+	switch (sset) {
+	case ETH_SS_STATS:
+		return ena_get_sw_stats_count(adapter) +
+		       ena_get_hw_stats_count(adapter);
+	case ETH_SS_PRIV_FLAGS:
+		return ENA_PRIV_FLAGS_NR;
+	}
 
-	return ena_get_sw_stats_count(adapter) + ena_get_hw_stats_count(adapter);
+	return -EOPNOTSUPP;
 }
 
 static void ena_queue_strings(struct ena_adapter *adapter, u8 **data)
@@ -320,10 +332,14 @@ static void ena_get_ethtool_strings(struct net_device *netdev,
 {
 	struct ena_adapter *adapter = netdev_priv(netdev);
 
-	if (sset != ETH_SS_STATS)
-		return;
-
-	ena_get_strings(adapter, data, adapter->eni_stats_supported);
+	switch (sset) {
+	case ETH_SS_STATS:
+		ena_get_strings(adapter, data, adapter->eni_stats_supported);
+		break;
+	case ETH_SS_PRIV_FLAGS:
+		memcpy(data, ena_priv_flags_strings, sizeof(ena_priv_flags_strings));
+		break;
+	}
 }
 
 static int ena_get_link_ksettings(struct net_device *netdev,
@@ -460,6 +476,8 @@ static void ena_get_drvinfo(struct net_device *dev,
 	strlcpy(info->driver, DRV_MODULE_NAME, sizeof(info->driver));
 	strlcpy(info->bus_info, pci_name(adapter->pdev),
 		sizeof(info->bus_info));
+
+	info->n_priv_flags = ENA_PRIV_FLAGS_NR;
 }
 
 static void ena_get_ringparam(struct net_device *netdev,
@@ -892,6 +910,25 @@ static int ena_set_tunable(struct net_device *netdev,
 	return ret;
 }
 
+static u32 ena_get_priv_flags(struct net_device *netdev)
+{
+	struct ena_adapter *adapter = netdev_priv(netdev);
+	u32 priv_flags = 0;
+
+	if (adapter->rx_ring->page_cache)
+		priv_flags |= ENA_PRIV_FLAGS_LPC;
+
+	return priv_flags;
+}
+
+static int ena_set_priv_flags(struct net_device *netdev, u32 priv_flags)
+{
+	struct ena_adapter *adapter = netdev_priv(netdev);
+
+	/* LPC is the only supported private flag for now */
+	return ena_set_lpc_state(adapter, !!(priv_flags & ENA_PRIV_FLAGS_LPC));
+}
+
 static const struct ethtool_ops ena_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_USECS |
 				     ETHTOOL_COALESCE_USE_ADAPTIVE_RX,
@@ -918,6 +955,8 @@ static const struct ethtool_ops ena_ethtool_ops = {
 	.get_tunable		= ena_get_tunable,
 	.set_tunable		= ena_set_tunable,
 	.get_ts_info            = ethtool_op_get_ts_info,
+	.get_priv_flags		= ena_get_priv_flags,
+	.set_priv_flags		= ena_set_priv_flags,
 };
 
 void ena_set_ethtool_ops(struct net_device *netdev)
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 9f6cc479506f..1ec9e24d8c8c 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2804,6 +2804,11 @@ static bool ena_is_lpc_supported(struct ena_adapter *adapter,
 			  "Local page cache is disabled for less than %d channels\n",
 			  ENA_LPC_MIN_NUM_OF_CHANNELS);
 
+		/* Disable LPC for such case. It can enabled again through
+		 * ethtool private-flag.
+		 */
+		adapter->lpc_size = 0;
+
 		return false;
 	}
 
@@ -3063,6 +3068,33 @@ static int ena_close(struct net_device *netdev)
 	return 0;
 }
 
+int ena_set_lpc_state(struct ena_adapter *adapter, bool enabled)
+{
+	/* In XDP, lpc_size might be positive even with LPC disabled, use cache
+	 * pointer instead.
+	 */
+	struct ena_page_cache *page_cache = adapter->rx_ring->page_cache;
+
+	/* Exit early if LPC state doesn't change */
+	if (enabled == !!page_cache)
+		return 0;
+
+	if (enabled && !ena_is_lpc_supported(adapter, adapter->rx_ring, true))
+		return -EOPNOTSUPP;
+
+	adapter->lpc_size = enabled ? ENA_LPC_DEFAULT_MULTIPLIER : 0;
+
+	/* rtnl lock is already obtained in dev_ioctl() layer, so it's safe to
+	 * re-initialize IO resources.
+	 */
+	if (test_bit(ENA_FLAG_DEV_UP, &adapter->flags)) {
+		ena_close(adapter->netdev);
+		ena_up(adapter);
+	}
+
+	return 0;
+}
+
 int ena_update_queue_sizes(struct ena_adapter *adapter,
 			   u32 new_tx_size,
 			   u32 new_rx_size)
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index 242c9ce4a782..95b0d16dc71e 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -430,6 +430,8 @@ void ena_dump_stats_to_buf(struct ena_adapter *adapter, u8 *buf);
 
 int ena_update_hw_stats(struct ena_adapter *adapter);
 
+int ena_set_lpc_state(struct ena_adapter *adapter, bool enabled);
+
 int ena_update_queue_sizes(struct ena_adapter *adapter,
 			   u32 new_tx_size,
 			   u32 new_rx_size);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system
  2021-03-09 17:10 ` [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system Shay Agroskin
@ 2021-03-09 17:57   ` Eric Dumazet
  2021-03-10  2:04     ` Andrew Lunn
  2021-03-16  8:26     ` Shay Agroskin
  0 siblings, 2 replies; 10+ messages in thread
From: Eric Dumazet @ 2021-03-09 17:57 UTC (permalink / raw)
  To: Shay Agroskin, David Miller, Jakub Kicinski, netdev
  Cc: Woodhouse, David, Machulsky, Zorik, Matushevsky, Alexander,
	Saeed Bshara, Wilson, Matt, Liguori, Anthony, Bshara, Nafea,
	Tzalik, Guy, Belgazal, Netanel, Saidi, Ali, Herrenschmidt,
	Benjamin, Kiyanovski, Arthur, Jubran, Samih, Dagan, Noam



On 3/9/21 6:10 PM, Shay Agroskin wrote:
> The page cache holds pages we allocated in the past during napi cycle,
> and tracks their availability status using page ref count.
> 
> The cache can hold up to 2048 pages. Upon allocating a page, we check
> whether the next entry in the cache contains an unused page, and if so
> fetch it. If the next page is already used by another entity or if it
> belongs to a different NUMA core than the napi routine, we allocate a
> page in the regular way (page from a different NUMA core is replaced by
> the newly allocated page).
> 
> This system can help us reduce the contention between different cores
> when allocating page since every cache is unique to a queue.

For reference, many drivers already use a similar strategy.

> +
> +/* Fetch the cached page (mark the page as used and pass it to the caller).
> + * If the page belongs to a different NUMA than the current one, free the cache
> + * page and allocate another one instead.
> + */
> +static struct page *ena_fetch_cache_page(struct ena_ring *rx_ring,
> +					 struct ena_page *ena_page,
> +					 dma_addr_t *dma,
> +					 int current_nid)
> +{
> +	/* Remove pages belonging to different node than current_nid from cache */
> +	if (unlikely(page_to_nid(ena_page->page) != current_nid)) {
> +		ena_increase_stat(&rx_ring->rx_stats.lpc_wrong_numa, 1, &rx_ring->syncp);
> +		ena_replace_cache_page(rx_ring, ena_page);
> +	}
> +
> 

And they use dev_page_is_reusable() instead of copy/pasting this logic.

As a bonus, they properly deal with pfmemalloc





^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system
  2021-03-09 17:57   ` Eric Dumazet
@ 2021-03-10  2:04     ` Andrew Lunn
  2021-03-11 23:15       ` Saeed Mahameed
  2021-03-16  8:26     ` Shay Agroskin
  1 sibling, 1 reply; 10+ messages in thread
From: Andrew Lunn @ 2021-03-10  2:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Shay Agroskin, David Miller, Jakub Kicinski, netdev, Woodhouse,
	David, Machulsky, Zorik, Matushevsky, Alexander, Saeed Bshara,
	Wilson, Matt, Liguori, Anthony, Bshara, Nafea, Tzalik, Guy,
	Belgazal, Netanel, Saidi, Ali, Herrenschmidt, Benjamin,
	Kiyanovski, Arthur, Jubran, Samih, Dagan, Noam

On Tue, Mar 09, 2021 at 06:57:06PM +0100, Eric Dumazet wrote:
> 
> 
> On 3/9/21 6:10 PM, Shay Agroskin wrote:
> > The page cache holds pages we allocated in the past during napi cycle,
> > and tracks their availability status using page ref count.
> > 
> > The cache can hold up to 2048 pages. Upon allocating a page, we check
> > whether the next entry in the cache contains an unused page, and if so
> > fetch it. If the next page is already used by another entity or if it
> > belongs to a different NUMA core than the napi routine, we allocate a
> > page in the regular way (page from a different NUMA core is replaced by
> > the newly allocated page).
> > 
> > This system can help us reduce the contention between different cores
> > when allocating page since every cache is unique to a queue.
> 
> For reference, many drivers already use a similar strategy.

Hi Eric

So rather than yet another implementation, should we push for a
generic implementation which any driver can use?

	Andrew

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system
  2021-03-10  2:04     ` Andrew Lunn
@ 2021-03-11 23:15       ` Saeed Mahameed
  2021-03-16  8:23         ` Shay Agroskin
  0 siblings, 1 reply; 10+ messages in thread
From: Saeed Mahameed @ 2021-03-11 23:15 UTC (permalink / raw)
  To: Andrew Lunn, Eric Dumazet, Jesper Dangaard Brouer, Matteo Croce
  Cc: Shay Agroskin, David Miller, Jakub Kicinski, netdev, Woodhouse,
	David, Machulsky, Zorik, Matushevsky, Alexander, Saeed Bshara,
	Wilson, Matt, Liguori, Anthony, Bshara, Nafea, Tzalik, Guy,
	Belgazal, Netanel, Saidi, Ali, Herrenschmidt, Benjamin,
	Kiyanovski, Arthur, Jubran, Samih, Dagan, Noam

On Wed, 2021-03-10 at 03:04 +0100, Andrew Lunn wrote:
> On Tue, Mar 09, 2021 at 06:57:06PM +0100, Eric Dumazet wrote:
> > 
> > 
> > On 3/9/21 6:10 PM, Shay Agroskin wrote:
> > > The page cache holds pages we allocated in the past during napi
> > > cycle,
> > > and tracks their availability status using page ref count.
> > > 
> > > The cache can hold up to 2048 pages. Upon allocating a page, we

2048 per core ? IMHO this is too much ! ideally you want twice the napi
budget.

you are trying to mitigate against TCP/L4 delays/congestion but this is
very prone to DNS attacks, if your memory allocators are under stress,
you shouldn't be hogging own pages and worsen the situation. 

> > > check
> > > whether the next entry in the cache contains an unused page, and
> > > if so
> > > fetch it. If the next page is already used by another entity or
> > > if it
> > > belongs to a different NUMA core than the napi routine, we
> > > allocate a
> > > page in the regular way (page from a different NUMA core is
> > > replaced by
> > > the newly allocated page).
> > > 
> > > This system can help us reduce the contention between different
> > > cores
> > > when allocating page since every cache is unique to a queue.
> > 
> > For reference, many drivers already use a similar strategy.
> 
> Hi Eric
> 
> So rather than yet another implementation, should we push for a
> generic implementation which any driver can use?
> 

We already have it:
https://www.kernel.org/doc/html/latest/networking/page_pool.html

also please checkout this fresh page pool extension, SKB buffer
recycling RFC, might be useful for the use cases ena are interested in

https://patchwork.kernel.org/project/netdevbpf/patch/20210311194256.53706-4-mcroce@linux.microsoft.com/




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system
  2021-03-11 23:15       ` Saeed Mahameed
@ 2021-03-16  8:23         ` Shay Agroskin
  0 siblings, 0 replies; 10+ messages in thread
From: Shay Agroskin @ 2021-03-16  8:23 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: Andrew Lunn, Eric Dumazet, Jesper Dangaard Brouer, Matteo Croce,
	David Miller, Jakub Kicinski, netdev, Woodhouse, David,
	Machulsky, Zorik, Matushevsky, Alexander, Saeed Bshara, Wilson,
	Matt, Liguori, Anthony, Bshara, Nafea, Tzalik, Guy, Belgazal,
	Netanel, Saidi, Ali, Herrenschmidt, Benjamin, Kiyanovski, Arthur,
	Jubran, Samih, Dagan, Noam


Saeed Mahameed <saeed@kernel.org> writes:

> On Wed, 2021-03-10 at 03:04 +0100, Andrew Lunn wrote:
>> On Tue, Mar 09, 2021 at 06:57:06PM +0100, Eric Dumazet wrote:
>> > 
>> > 
>> > On 3/9/21 6:10 PM, Shay Agroskin wrote:
>> > > The page cache holds pages we allocated in the past during 
>> > > napi
>> > > cycle,
>> > > and tracks their availability status using page ref count.
>> > > 
>> > > The cache can hold up to 2048 pages. Upon allocating a 
>> > > page, we
>
> 2048 per core ? IMHO this is too much ! ideally you want twice 
> the napi
> budget.
>
> you are trying to mitigate against TCP/L4 delays/congestion but 
> this is
> very prone to DNS attacks, if your memory allocators are under 
> stress,
> you shouldn't be hogging own pages and worsen the situation.

First of all, thank you for taking a look in this patchset.

We are trying to mitigate a simultaneous access to a shared 
resource, the buddy allocator.
When using local caches, I reduce the number of accesses to this 
shared resource by about 90%, thus
avoiding this contention.

I might not understand you correctly, but this patch doesn't try 
to mitigate network peaks. I agree that we're hogging quite a lot 
of system's resources, I'll run some tests with smaller cache size 
(e.g. 2x napi badget) and see if it mitigates the problem we have

>
>> > > check
>> > > whether the next entry in the cache contains an unused 
>> > > page, and
>> > > if so
>> > > fetch it. If the next page is already used by another 
>> > > entity or
>> > > if it
>> > > belongs to a different NUMA core than the napi routine, we
>> > > allocate a
>> > > page in the regular way (page from a different NUMA core is
>> > > replaced by
>> > > the newly allocated page).
>> > > 
>> > > This system can help us reduce the contention between 
>> > > different
>> > > cores
>> > > when allocating page since every cache is unique to a 
>> > > queue.
>> > 
>> > For reference, many drivers already use a similar strategy.
>> 
>> Hi Eric
>> 
>> So rather than yet another implementation, should we push for a
>> generic implementation which any driver can use?
>> 
>
> We already have it:
> https://www.kernel.org/doc/html/latest/networking/page_pool.html

Yup the original page pool implementation didn't suit our needs 
since we never got to free the pages using its specialized 
function for non-XDP traffic.

>
> also please checkout this fresh page pool extension, SKB buffer
> recycling RFC, might be useful for the use cases ena are 
> interested in
>
> https://patchwork.kernel.org/project/netdevbpf/patch/20210311194256.53706-4-mcroce@linux.microsoft.com/

Gone over the code and ran some tests with this patchset. On first 
look it seems like it does allow us to mitigate the problem this 
patchset solves. I'll run
more tests with it and report on this thread my conclusions.

Thanks a lot for pointing it out (:

Shay

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system
  2021-03-09 17:57   ` Eric Dumazet
  2021-03-10  2:04     ` Andrew Lunn
@ 2021-03-16  8:26     ` Shay Agroskin
  1 sibling, 0 replies; 10+ messages in thread
From: Shay Agroskin @ 2021-03-16  8:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Jakub Kicinski, netdev, Woodhouse, David,
	Machulsky, Zorik, Matushevsky, Alexander, Saeed Bshara, Wilson,
	Matt, Liguori, Anthony, Bshara, Nafea, Tzalik, Guy, Belgazal,
	Netanel, Saidi, Ali, Herrenschmidt, Benjamin, Kiyanovski, Arthur,
	Jubran, Samih, Dagan, Noam


Eric Dumazet <eric.dumazet@gmail.com> writes:

> On 3/9/21 6:10 PM, Shay Agroskin wrote:
>> The page cache holds pages we allocated in the past during napi 
>> cycle,
>> and tracks their availability status using page ref count.
>> 
>> The cache can hold up to 2048 pages. Upon allocating a page, we 
>> check
>> whether the next entry in the cache contains an unused page, 
>> and if so
>> fetch it. If the next page is already used by another entity or 
>> if it
>> belongs to a different NUMA core than the napi routine, we 
>> allocate a
>> page in the regular way (page from a different NUMA core is 
>> replaced by
>> the newly allocated page).
>> 
>> This system can help us reduce the contention between different 
>> cores
>> when allocating page since every cache is unique to a queue.
>
> For reference, many drivers already use a similar strategy.
>
>> +
>> +/* Fetch the cached page (mark the page as used and pass it to 
>> the caller).
>> + * If the page belongs to a different NUMA than the current 
>> one, free the cache
>> + * page and allocate another one instead.
>> + */
>> +static struct page *ena_fetch_cache_page(struct ena_ring 
>> *rx_ring,
>> +					 struct ena_page 
>> *ena_page,
>> +					 dma_addr_t *dma,
>> +					 int current_nid)
>> +{
>> +	/* Remove pages belonging to different node than 
>> current_nid from cache */
>> +	if (unlikely(page_to_nid(ena_page->page) != current_nid)) 
>> {
>> + 
>> ena_increase_stat(&rx_ring->rx_stats.lpc_wrong_numa, 1, 
>> &rx_ring->syncp);
>> +		ena_replace_cache_page(rx_ring, ena_page);
>> +	}
>> +
>> 
>
> And they use dev_page_is_reusable() instead of copy/pasting this 
> logic.
>
> As a bonus, they properly deal with pfmemalloc

Thanks for reviewing it, and sorry for the time it took me to 
reply, I wanted to test some of the suggestions given me here 
before replying.

Providing that this patchset would still be necessary for our 
driver after testing the patchset Saeed suggested, I will switch 
to using dev_page_is_reusable() instead of this expression.

Shay

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system
  2021-03-09 17:07 [RFC Patch v1 0/3] Introduce ENA local page cache Shay Agroskin
@ 2021-03-09 17:07 ` Shay Agroskin
  0 siblings, 0 replies; 10+ messages in thread
From: Shay Agroskin @ 2021-03-09 17:07 UTC (permalink / raw)
  To: David Miller, Jakub Kicinski, netdev
  Cc: Shay Agroskin, Woodhouse, David, Machulsky, Zorik, Matushevsky,
	Alexander, Saeed Bshara, Wilson, Matt, Liguori, Anthony, Bshara,
	Nafea, Tzalik, Guy, Belgazal, Netanel, Saidi, Ali, Herrenschmidt,
	Benjamin, Kiyanovski, Arthur, Jubran, Samih, Dagan, Noam

The page cache holds pages we allocated in the past during napi cycle,
and tracks their availability status using page ref count.

The cache can hold up to 2048 pages. Upon allocating a page, we check
whether the next entry in the cache contains an unused page, and if so
fetch it. If the next page is already used by another entity or if it
belongs to a different NUMA core than the napi routine, we allocate a
page in the regular way (page from a different NUMA core is replaced by
the newly allocated page).

This system can help us reduce the contention between different cores
when allocating page since every cache is unique to a queue.

This patch adds the following ethtool counters:

- lpc_warm_up: If the next page in the cache isn't free, and the
    cache wasn't allocated its maximum possible pages, allocate a new
    page and store it in the cache. This counter increases everytime it
    happens. Its maximum value can be N * 'current queue size'.

- lpc_full: The next entry in the cache contains a page that is
    still used. In such case a page is allocated in the regular way
    (i.e. dev_alloc())

- lpc_wrong_numa: The next entry in the cache contains a page in a
    different NUMA node than the napi routine which allocates the page.
    In this case increase the counter and replace current entry with a
    page from the same NUMA node.

Note that in all three cases a page should be returned to the caller of
the page cache function, and the page would be either from the cache, or
from the Linux memory system.
In case the system is out-of-memory the cache returns NULL. This
scenario doesn't break the cache's correctness.

The page cache is disabled when having less than 16 queues or when XDP
is used.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
---
 drivers/net/ethernet/amazon/ena/ena_ethtool.c |   3 +
 drivers/net/ethernet/amazon/ena/ena_netdev.c  | 337 ++++++++++++++++--
 drivers/net/ethernet/amazon/ena/ena_netdev.h  |  30 ++
 3 files changed, 350 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_ethtool.c b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
index d6cc7aa612b7..fe16b3d5bd73 100644
--- a/drivers/net/ethernet/amazon/ena/ena_ethtool.c
+++ b/drivers/net/ethernet/amazon/ena/ena_ethtool.c
@@ -96,6 +96,9 @@ static const struct ena_stats ena_stats_rx_strings[] = {
 	ENA_STAT_RX_ENTRY(xdp_tx),
 	ENA_STAT_RX_ENTRY(xdp_invalid),
 	ENA_STAT_RX_ENTRY(xdp_redirect),
+	ENA_STAT_RX_ENTRY(lpc_warm_up),
+	ENA_STAT_RX_ENTRY(lpc_full),
+	ENA_STAT_RX_ENTRY(lpc_wrong_numa),
 };
 
 static const struct ena_stats ena_stats_ena_com_strings[] = {
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index 102f2c91fdb8..9f6cc479506f 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -49,6 +49,7 @@ static int ena_rss_init_default(struct ena_adapter *adapter);
 static void check_for_admin_com_state(struct ena_adapter *adapter);
 static void ena_destroy_device(struct ena_adapter *adapter, bool graceful);
 static int ena_restore_device(struct ena_adapter *adapter);
+static int ena_create_page_caches(struct ena_adapter *adapter);
 
 static void ena_init_io_rings(struct ena_adapter *adapter,
 			      int first_index, int count);
@@ -981,12 +982,162 @@ static void ena_free_all_io_rx_resources(struct ena_adapter *adapter)
 		ena_free_rx_resources(adapter, i);
 }
 
+static void ena_put_unmap_cache_page(struct ena_ring *rx_ring, struct ena_page *ena_page)
+{
+	dma_unmap_page(rx_ring->dev, ena_page->dma_addr, ENA_PAGE_SIZE,
+		       DMA_BIDIRECTIONAL);
+
+	put_page(ena_page->page);
+}
+
+static struct page *ena_alloc_map_page(struct ena_ring *rx_ring, dma_addr_t *dma)
+{
+	struct page *page;
+
+	/* This would allocate the page on the same NUMA node the executing code
+	 * is running on.
+	 */
+	page = dev_alloc_page();
+	if (!page)
+		return NULL;
+
+	/* To enable NIC-side port-mirroring, AKA SPAN port,
+	 * we make the buffer readable from the nic as well
+	 */
+	*dma = dma_map_page(rx_ring->dev, page, 0, ENA_PAGE_SIZE,
+			    DMA_BIDIRECTIONAL);
+	if (unlikely(dma_mapping_error(rx_ring->dev, *dma))) {
+		ena_increase_stat(&rx_ring->rx_stats.dma_mapping_err, 1,
+				  &rx_ring->syncp);
+		__free_page(page);
+		return NULL;
+	}
+
+	return page;
+}
+
+/* Removes a page from page cache and allocate a new one instead. If an
+ * allocation of a new page fails, the cache entry isn't changed
+ */
+static void ena_replace_cache_page(struct ena_ring *rx_ring,
+				   struct ena_page *ena_page)
+{
+	struct page *new_page;
+	dma_addr_t dma;
+
+	new_page = ena_alloc_map_page(rx_ring, &dma);
+
+	if (likely(new_page)) {
+		ena_put_unmap_cache_page(rx_ring, ena_page);
+
+		ena_page->page = new_page;
+		ena_page->dma_addr = dma;
+	}
+}
+
+/* Fetch the cached page (mark the page as used and pass it to the caller).
+ * If the page belongs to a different NUMA than the current one, free the cache
+ * page and allocate another one instead.
+ */
+static struct page *ena_fetch_cache_page(struct ena_ring *rx_ring,
+					 struct ena_page *ena_page,
+					 dma_addr_t *dma,
+					 int current_nid)
+{
+	/* Remove pages belonging to different node than current_nid from cache */
+	if (unlikely(page_to_nid(ena_page->page) != current_nid)) {
+		ena_increase_stat(&rx_ring->rx_stats.lpc_wrong_numa, 1, &rx_ring->syncp);
+		ena_replace_cache_page(rx_ring, ena_page);
+	}
+
+	/* Make sure no writes are pending for this page */
+	dma_sync_single_for_device(rx_ring->dev, ena_page->dma_addr,
+				   ENA_PAGE_SIZE,
+				   DMA_BIDIRECTIONAL);
+
+	/* Increase refcount to 2 so that the page is returned to the
+	 * cache after being freed
+	 */
+	page_ref_inc(ena_page->page);
+
+	*dma = ena_page->dma_addr;
+
+	return ena_page->page;
+}
+
+static struct page *ena_get_page(struct ena_ring *rx_ring, dma_addr_t *dma,
+				 int current_nid, bool *is_lpc_page)
+{
+	struct ena_page_cache *page_cache = rx_ring->page_cache;
+	u32 head, cache_current_size;
+	struct ena_page *ena_page;
+
+	/* Cache size of zero indicates disabled cache */
+	if (!page_cache) {
+		*is_lpc_page = false;
+		return ena_alloc_map_page(rx_ring, dma);
+	}
+
+	*is_lpc_page = true;
+
+	cache_current_size = page_cache->current_size;
+	head = page_cache->head;
+
+	ena_page = &page_cache->cache[head];
+	/* Warm up phase. We fill the pages for the first time. The
+	 * phase is done in the napi context to improve the chances we
+	 * allocate on the correct NUMA node
+	 */
+	if (unlikely(cache_current_size < page_cache->max_size)) {
+		/* Check if oldest allocated page is free */
+		if (ena_page->page && page_ref_count(ena_page->page) == 1) {
+			page_cache->head = (head + 1) % cache_current_size;
+			return ena_fetch_cache_page(rx_ring, ena_page, dma, current_nid);
+		}
+
+		ena_page = &page_cache->cache[cache_current_size];
+
+		/* Add a new page to the cache */
+		ena_page->page = ena_alloc_map_page(rx_ring, dma);
+		if (unlikely(!ena_page->page))
+			return NULL;
+
+		ena_page->dma_addr = *dma;
+
+		/* Increase refcount to 2 so that the page is returned to the
+		 * cache after being freed
+		 */
+		page_ref_inc(ena_page->page);
+
+		page_cache->current_size++;
+
+		ena_increase_stat(&rx_ring->rx_stats.lpc_warm_up, 1, &rx_ring->syncp);
+
+		return ena_page->page;
+	}
+
+	/* Next page is still in use, so we allocate outside the cache */
+	if (unlikely(page_ref_count(ena_page->page) != 1)) {
+		ena_increase_stat(&rx_ring->rx_stats.lpc_full, 1, &rx_ring->syncp);
+		*is_lpc_page = false;
+		return ena_alloc_map_page(rx_ring, dma);
+	}
+
+	/* The cache has a free page to fetch for the caller. Update the
+	 * page that would be returned the next time this function's called.
+	 */
+	page_cache->head = (head + 1) & (page_cache->max_size - 1);
+
+	return ena_fetch_cache_page(rx_ring, ena_page, dma, current_nid);
+}
+
 static int ena_alloc_rx_page(struct ena_ring *rx_ring,
-				    struct ena_rx_buffer *rx_info, gfp_t gfp)
+			     struct ena_rx_buffer *rx_info, int current_nid)
 {
 	int headroom = rx_ring->rx_headroom;
 	struct ena_com_buf *ena_buf;
 	struct page *page;
+	bool is_lpc_page;
 	dma_addr_t dma;
 
 	/* restore page offset value in case it has been changed by device */
@@ -996,29 +1147,19 @@ static int ena_alloc_rx_page(struct ena_ring *rx_ring,
 	if (unlikely(rx_info->page))
 		return 0;
 
-	page = alloc_page(gfp);
+	/* We handle DMA here */
+	page = ena_get_page(rx_ring, &dma, current_nid, &is_lpc_page);
 	if (unlikely(!page)) {
 		ena_increase_stat(&rx_ring->rx_stats.page_alloc_fail, 1,
 				  &rx_ring->syncp);
 		return -ENOMEM;
 	}
 
-	/* To enable NIC-side port-mirroring, AKA SPAN port,
-	 * we make the buffer readable from the nic as well
-	 */
-	dma = dma_map_page(rx_ring->dev, page, 0, ENA_PAGE_SIZE,
-			   DMA_BIDIRECTIONAL);
-	if (unlikely(dma_mapping_error(rx_ring->dev, dma))) {
-		ena_increase_stat(&rx_ring->rx_stats.dma_mapping_err, 1,
-				  &rx_ring->syncp);
-
-		__free_page(page);
-		return -EIO;
-	}
 	netif_dbg(rx_ring->adapter, rx_status, rx_ring->netdev,
 		  "Allocate page %p, rx_info %p\n", page, rx_info);
 
 	rx_info->page = page;
+	rx_info->is_lpc_page = is_lpc_page;
 	ena_buf = &rx_info->ena_buf;
 	ena_buf->paddr = dma + headroom;
 	ena_buf->len = ENA_PAGE_SIZE - headroom;
@@ -1031,9 +1172,11 @@ static void ena_unmap_rx_buff(struct ena_ring *rx_ring,
 {
 	struct ena_com_buf *ena_buf = &rx_info->ena_buf;
 
-	dma_unmap_page(rx_ring->dev, ena_buf->paddr - rx_ring->rx_headroom,
-		       ENA_PAGE_SIZE,
-		       DMA_BIDIRECTIONAL);
+	/* LPC pages are unmapped at cache destruction */
+	if (!rx_info->is_lpc_page)
+		dma_unmap_page(rx_ring->dev, ena_buf->paddr - rx_ring->rx_headroom,
+			       ENA_PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
 }
 
 static void ena_free_rx_page(struct ena_ring *rx_ring,
@@ -1056,9 +1199,13 @@ static void ena_free_rx_page(struct ena_ring *rx_ring,
 static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 num)
 {
 	u16 next_to_use, req_id;
+	int current_nid;
 	u32 i;
 	int rc;
 
+	/* Prefer pages to be allocate on the same NUMA as the CPU */
+	current_nid = numa_mem_id();
+
 	next_to_use = rx_ring->next_to_use;
 
 	for (i = 0; i < num; i++) {
@@ -1068,8 +1215,7 @@ static int ena_refill_rx_bufs(struct ena_ring *rx_ring, u32 num)
 
 		rx_info = &rx_ring->rx_buffer_info[req_id];
 
-		rc = ena_alloc_rx_page(rx_ring, rx_info,
-				       GFP_ATOMIC | __GFP_COMP);
+		rc = ena_alloc_rx_page(rx_ring, rx_info, current_nid);
 		if (unlikely(rc < 0)) {
 			netif_warn(rx_ring->adapter, rx_err, rx_ring->netdev,
 				   "Failed to allocate buffer for rx queue %d\n",
@@ -1140,12 +1286,51 @@ static void ena_refill_all_rx_bufs(struct ena_adapter *adapter)
 	}
 }
 
+/* Release all pages from the page cache */
+static void ena_free_ring_cache_pages(struct ena_adapter *adapter, int qid)
+{
+	struct ena_ring *rx_ring = &adapter->rx_ring[qid];
+	struct ena_page_cache *page_cache;
+	int i;
+
+	/* Page cache is disabled */
+	if (!rx_ring->page_cache)
+		return;
+
+	page_cache = rx_ring->page_cache;
+
+	/* We check size value to make sure we don't
+	 * free pages that weren't allocated.
+	 */
+	for (i = 0; i < page_cache->current_size; i++) {
+		struct ena_page *ena_page = &page_cache->cache[i];
+
+		WARN_ON(!ena_page->page);
+
+		dma_unmap_page(rx_ring->dev, ena_page->dma_addr,
+			       ENA_PAGE_SIZE,
+			       DMA_BIDIRECTIONAL);
+
+		/* If the page is also in the rx buffer, then this operation
+		 * would only decrease its reference count
+		 */
+		__free_page(ena_page->page);
+	}
+
+	page_cache->head = page_cache->current_size = 0;
+}
+
 static void ena_free_all_rx_bufs(struct ena_adapter *adapter)
 {
 	int i;
 
-	for (i = 0; i < adapter->num_io_queues; i++)
+	for (i = 0; i < adapter->num_io_queues; i++) {
+		/* The RX SQ's packet should be freed first, since they don't
+		 * unmap pages that belong to the page_cache.
+		 */
 		ena_free_rx_bufs(adapter, i);
+		ena_free_ring_cache_pages(adapter, i);
+	}
 }
 
 static void ena_unmap_tx_buff(struct ena_ring *tx_ring,
@@ -2539,6 +2724,10 @@ static int create_queues_with_size_backoff(struct ena_adapter *adapter)
 		if (rc)
 			goto err_create_rx_queues;
 
+		rc = ena_create_page_caches(adapter);
+		if (rc) /* Cache memory is freed in case of failure */
+			goto err_create_rx_queues;
+
 		return 0;
 
 err_create_rx_queues:
@@ -2591,6 +2780,111 @@ static int create_queues_with_size_backoff(struct ena_adapter *adapter)
 	}
 }
 
+static void ena_free_ring_page_cache(struct ena_ring *rx_ring)
+{
+	if (!rx_ring->page_cache)
+		return;
+
+	vfree(rx_ring->page_cache);
+	rx_ring->page_cache = NULL;
+}
+
+static bool ena_is_lpc_supported(struct ena_adapter *adapter,
+				 struct ena_ring *rx_ring,
+				 bool error_print)
+{
+	void (*print_log)(const struct net_device *dev, const char *format, ...);
+	int channels_nr = adapter->num_io_queues + adapter->xdp_num_queues;
+
+	print_log = (error_print) ? netdev_err : netdev_info;
+
+	/* LPC is disabled below min number of channels */
+	if (channels_nr < ENA_LPC_MIN_NUM_OF_CHANNELS) {
+		print_log(adapter->netdev,
+			  "Local page cache is disabled for less than %d channels\n",
+			  ENA_LPC_MIN_NUM_OF_CHANNELS);
+
+		return false;
+	}
+
+	/* The driver doesn't support page caches under XDP */
+	if (ena_xdp_present_ring(rx_ring)) {
+		print_log(adapter->netdev,
+			  "Local page cache is disabled when using XDP\n");
+		return false;
+	}
+
+	return true;
+}
+
+/* Calculate the size of the Local Page Cache. If LPC should be disabled, return
+ * a size of 0.
+ */
+static u32 ena_calculate_cache_size(struct ena_adapter *adapter,
+				    struct ena_ring *rx_ring)
+{
+	u32 page_cache_size = adapter->lpc_size;
+
+	/* LPC cache size of 0 means disabled cache */
+	if (page_cache_size == 0)
+		return 0;
+
+	if (!ena_is_lpc_supported(adapter, rx_ring, false))
+		return 0;
+
+	page_cache_size = page_cache_size * ENA_LPC_MULTIPLIER_UNIT;
+	page_cache_size = roundup_pow_of_two(page_cache_size);
+
+	return page_cache_size;
+}
+
+static int ena_create_page_caches(struct ena_adapter *adapter)
+{
+	struct ena_page_cache *cache;
+	u32 page_cache_size;
+	int i;
+
+	for (i = 0; i < adapter->num_io_queues; i++) {
+		struct ena_ring *rx_ring = &adapter->rx_ring[i];
+
+		page_cache_size = ena_calculate_cache_size(adapter, rx_ring);
+
+		if (!page_cache_size)
+			return 0;
+
+		cache = vzalloc(sizeof(struct ena_page_cache) +
+				sizeof(struct ena_page) * page_cache_size);
+		if (!cache)
+			goto err_cache_alloc;
+
+		cache->max_size = page_cache_size;
+		rx_ring->page_cache = cache;
+	}
+
+	return 0;
+err_cache_alloc:
+	netif_err(adapter, ifup, adapter->netdev,
+		  "Failed to initialize local page caches (LPCs)\n");
+	while (--i >= 0) {
+		struct ena_ring *rx_ring = &adapter->rx_ring[i];
+
+		ena_free_ring_page_cache(rx_ring);
+	}
+
+	return -ENOMEM;
+}
+
+static void ena_free_page_caches(struct ena_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < adapter->num_io_queues; i++) {
+		struct ena_ring *rx_ring = &adapter->rx_ring[i];
+
+		ena_free_ring_page_cache(rx_ring);
+	}
+}
+
 static int ena_up(struct ena_adapter *adapter)
 {
 	int io_queue_count, rc, i;
@@ -2641,6 +2935,7 @@ static int ena_up(struct ena_adapter *adapter)
 	return rc;
 
 err_up:
+	ena_free_page_caches(adapter);
 	ena_destroy_all_tx_queues(adapter);
 	ena_free_all_io_tx_resources(adapter);
 	ena_destroy_all_rx_queues(adapter);
@@ -2691,6 +2986,7 @@ static void ena_down(struct ena_adapter *adapter)
 
 	ena_free_all_tx_bufs(adapter);
 	ena_free_all_rx_bufs(adapter);
+	ena_free_page_caches(adapter);
 	ena_free_all_io_tx_resources(adapter);
 	ena_free_all_io_rx_resources(adapter);
 }
@@ -4296,6 +4592,7 @@ static int ena_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	adapter->max_rx_sgl_size = calc_queue_ctx.max_rx_sgl_size;
 
 	adapter->num_io_queues = max_num_io_queues;
+	adapter->lpc_size = ENA_LPC_DEFAULT_MULTIPLIER;
 	adapter->max_num_io_queues = max_num_io_queues;
 	adapter->last_monitored_tx_qid = 0;
 
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index 74af15d62ee1..242c9ce4a782 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -194,6 +194,7 @@ struct ena_rx_buffer {
 	struct page *page;
 	u32 page_offset;
 	struct ena_com_buf ena_buf;
+	bool is_lpc_page;
 } ____cacheline_aligned;
 
 struct ena_stats_tx {
@@ -234,8 +235,33 @@ struct ena_stats_rx {
 	u64 xdp_tx;
 	u64 xdp_invalid;
 	u64 xdp_redirect;
+	u64 lpc_warm_up;
+	u64 lpc_full;
+	u64 lpc_wrong_numa;
 };
 
+/* LPC definitions */
+#define ENA_LPC_DEFAULT_MULTIPLIER 2
+#define ENA_LPC_MULTIPLIER_UNIT 1024
+#define ENA_LPC_MIN_NUM_OF_CHANNELS 16
+
+/* Store DMA address along with the page */
+struct ena_page {
+	struct page *page;
+	dma_addr_t dma_addr;
+};
+
+struct ena_page_cache {
+	/* How many pages are produced */
+	u32 head;
+	/* How many of the entries were initialized */
+	u32 current_size;
+	/* Maximum number of pages the cache can hold */
+	u32 max_size;
+
+	struct ena_page cache[0];
+} ____cacheline_aligned;
+
 struct ena_ring {
 	/* Holds the empty requests for TX/RX
 	 * out of order completions
@@ -252,6 +278,7 @@ struct ena_ring {
 	struct pci_dev *pdev;
 	struct napi_struct *napi;
 	struct net_device *netdev;
+	struct ena_page_cache *page_cache;
 	struct ena_com_dev *ena_dev;
 	struct ena_adapter *adapter;
 	struct ena_com_io_cq *ena_com_io_cq;
@@ -333,6 +360,9 @@ struct ena_adapter {
 	u32 num_io_queues;
 	u32 max_num_io_queues;
 
+	/* Local page cache size */
+	u32 lpc_size;
+
 	int msix_vecs;
 
 	u32 missing_tx_completion_threshold;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-03-16  8:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-03-09 17:10 [RFC Patch v1 0/3] Introduce ENA local page cache Shay Agroskin
2021-03-09 17:10 ` [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system Shay Agroskin
2021-03-09 17:57   ` Eric Dumazet
2021-03-10  2:04     ` Andrew Lunn
2021-03-11 23:15       ` Saeed Mahameed
2021-03-16  8:23         ` Shay Agroskin
2021-03-16  8:26     ` Shay Agroskin
2021-03-09 17:10 ` [RFC Patch v1 2/3] net: ena: update README file with a description about LPC Shay Agroskin
2021-03-09 17:10 ` [RFC Patch v1 3/3] net: ena: support ethtool priv-flags and LPC state change Shay Agroskin
  -- strict thread matches above, loose matches on Subject: below --
2021-03-09 17:07 [RFC Patch v1 0/3] Introduce ENA local page cache Shay Agroskin
2021-03-09 17:07 ` [RFC Patch v1 1/3] net: ena: implement local page cache (LPC) system Shay Agroskin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).