Netdev Archive on lore.kernel.org
 help / color / Atom feed
From: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: netdev@vger.kernel.org,
	"Ilias Apalodimas" <ilias.apalodimas@linaro.org>,
	"Toke Høiland-Jørgensen" <toke@toke.dk>,
	"Tariq Toukan" <tariqt@mellanox.com>,
	toshiaki.makita1@gmail.com, grygorii.strashko@ti.com,
	mcroce@redhat.com
Subject: Re: [PATCH net-next v2 08/12] xdp: tracking page_pool resources and safe removal
Date: Tue, 25 Jun 2019 13:50:14 +0300
Message-ID: <20190625105013.GA6485@khorivan> (raw)
In-Reply-To: <156086314789.27760.6549333469314693352.stgit@firesoul>

Hi Jesper,

Could you please clarify one question.

On Tue, Jun 18, 2019 at 03:05:47PM +0200, Jesper Dangaard Brouer wrote:
>This patch is needed before we can allow drivers to use page_pool for
>DMA-mappings. Today with page_pool and XDP return API, it is possible to
>remove the page_pool object (from rhashtable), while there are still
>in-flight packet-pages. This is safely handled via RCU and failed lookups in
>__xdp_return() fallback to call put_page(), when page_pool object is gone.
>In-case page is still DMA mapped, this will result in page note getting
>correctly DMA unmapped.
>
>To solve this, the page_pool is extended with tracking in-flight pages. And
>XDP disconnect system queries page_pool and waits, via workqueue, for all
>in-flight pages to be returned.
>
>To avoid killing performance when tracking in-flight pages, the implement
>use two (unsigned) counters, that in placed on different cache-lines, and
>can be used to deduct in-flight packets. This is done by mapping the
>unsigned "sequence" counters onto signed Two's complement arithmetic
>operations. This is e.g. used by kernel's time_after macros, described in
>kernel commit 1ba3aab3033b and 5a581b367b5, and also explained in RFC1982.
>
>The trick is these two incrementing counters only need to be read and
>compared, when checking if it's safe to free the page_pool structure. Which
>will only happen when driver have disconnected RX/alloc side. Thus, on a
>non-fast-path.
>
>It is chosen that page_pool tracking is also enabled for the non-DMA
>use-case, as this can be used for statistics later.
>
>After this patch, using page_pool requires more strict resource "release",
>e.g. via page_pool_release_page() that was introduced in this patchset, and
>previous patches implement/fix this more strict requirement.
>
>Drivers no-longer call page_pool_destroy(). Drivers already call
>xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
>attempt to disconnect the mem id, and if attempt fails schedule the
>disconnect for later via delayed workqueue.
>
>Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
>---
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c |    3 -
> include/net/page_pool.h                           |   41 ++++++++++---
> net/core/page_pool.c                              |   62 +++++++++++++++-----
> net/core/xdp.c                                    |   65 +++++++++++++++++++--
> 4 files changed, 136 insertions(+), 35 deletions(-)
>
>diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>index 2f647be292b6..6c9d4d7defbc 100644
>--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>@@ -643,9 +643,6 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
> 	}
>
> 	xdp_rxq_info_unreg(&rq->xdp_rxq);
>-	if (rq->page_pool)
>-		page_pool_destroy(rq->page_pool);
>-
> 	mlx5_wq_destroy(&rq->wq_ctrl);
> }
>
>diff --git a/include/net/page_pool.h b/include/net/page_pool.h
>index 754d980700df..f09b3f1994e6 100644
>--- a/include/net/page_pool.h
>+++ b/include/net/page_pool.h
>@@ -16,14 +16,16 @@
>  * page_pool_alloc_pages() call.  Drivers should likely use
>  * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
>  *
>- * If page_pool handles DMA mapping (use page->private), then API user
>- * is responsible for invoking page_pool_put_page() once.  In-case of
>- * elevated refcnt, the DMA state is released, assuming other users of
>- * the page will eventually call put_page().
>+ * API keeps track of in-flight pages, in-order to let API user know
>+ * when it is safe to dealloactor page_pool object.  Thus, API users
>+ * must make sure to call page_pool_release_page() when a page is
>+ * "leaving" the page_pool.  Or call page_pool_put_page() where
>+ * appropiate.  For maintaining correct accounting.
>  *
>- * If no DMA mapping is done, then it can act as shim-layer that
>- * fall-through to alloc_page.  As no state is kept on the page, the
>- * regular put_page() call is sufficient.
>+ * API user must only call page_pool_put_page() once on a page, as it
>+ * will either recycle the page, or in case of elevated refcnt, it
>+ * will release the DMA mapping and in-flight state accounting.  We
>+ * hope to lift this requirement in the future.
>  */
> #ifndef _NET_PAGE_POOL_H
> #define _NET_PAGE_POOL_H
>@@ -66,9 +68,10 @@ struct page_pool_params {
> };
>
> struct page_pool {
>-	struct rcu_head rcu;
> 	struct page_pool_params p;
>
>+        u32 pages_state_hold_cnt;
>+
> 	/*
> 	 * Data structure for allocation side
> 	 *
>@@ -96,6 +99,8 @@ struct page_pool {
> 	 * TODO: Implement bulk return pages into this structure.
> 	 */
> 	struct ptr_ring ring;
>+
>+	atomic_t pages_state_release_cnt;
> };
>
> struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
>@@ -109,8 +114,6 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
>
> struct page_pool *page_pool_create(const struct page_pool_params *params);
>
>-void page_pool_destroy(struct page_pool *pool);
>-
> void __page_pool_free(struct page_pool *pool);
> static inline void page_pool_free(struct page_pool *pool)
> {
>@@ -143,6 +146,24 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> 	__page_pool_put_page(pool, page, true);
> }
>
>+/* API user MUST have disconnected alloc-side (not allowed to call
>+ * page_pool_alloc_pages()) before calling this.  The free-side can
>+ * still run concurrently, to handle in-flight packet-pages.
>+ *
>+ * A request to shutdown can fail (with false) if there are still
>+ * in-flight packet-pages.
>+ */
>+bool __page_pool_request_shutdown(struct page_pool *pool);
>+static inline bool page_pool_request_shutdown(struct page_pool *pool)
>+{
>+	/* When page_pool isn't compiled-in, net/core/xdp.c doesn't
>+	 * allow registering MEM_TYPE_PAGE_POOL, but shield linker.
>+	 */
>+#ifdef CONFIG_PAGE_POOL
>+	return __page_pool_request_shutdown(pool);
>+#endif
>+}

The free side can ran in softirq, that means fast cache recycle is accessed.
And it increments not atomic pool->alloc.count.

For instance While redirect, for remote interface, while .ndo_xdp_xmit the
xdp_return_frame_rx_napi(xdpf) is called everywhere in error path ....

In the same time, simultaneously, the work queue can try one more time to clear
cash, calling __page_pool_request_shutdown()....

Question, what prevents pool->alloc.count to be corrupted by race,
causing to wrong array num and as result wrong page to be unmapped/put ....or
even page leak. alloc.count usage is not protected,
__page_pool_request_shutdown() is called not from same rx NAPI, even not from
NAPI.

Here, while alloc cache empty procedure in __page_pool_request_shutdown():

while (pool->alloc.count) {
	page = pool->alloc.cache[--pool->alloc.count];
	__page_pool_return_page(pool, page);
}

For me seems all works fine, but I can't find what have I missed?

...

Same question about how xdp frame should be returned for drivers running
tx napi exclusively, it can be still softirq but another CPU? What API
should be used to return xdp frame.

[...]

-- 
Regards,
Ivan Khoronzhuk

  reply index

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-18 13:05 [PATCH net-next v2 00/12] xdp: page_pool fixes and in-flight accounting Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 01/12] net: page_pool: add helper function to retrieve dma addresses Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 02/12] net: page_pool: add helper function to unmap " Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 03/12] xdp: fix leak of IDA cyclic id if rhashtable_insert_slow fails Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 04/12] xdp: page_pool related fix to cpumap Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 05/12] veth: use xdp_release_frame for XDP_PASS Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 06/12] page_pool: introduce page_pool_free and use in mlx5 Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 07/12] mlx5: more strict use of page_pool API Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 08/12] xdp: tracking page_pool resources and safe removal Jesper Dangaard Brouer
2019-06-25 10:50   ` Ivan Khoronzhuk [this message]
2019-06-25 11:27     ` Jesper Dangaard Brouer
2019-06-25 11:51       ` Ivan Khoronzhuk
2019-06-25 12:28         ` Ivan Khoronzhuk
2019-06-18 13:05 ` [PATCH net-next v2 09/12] xdp: force mem allocator removal and periodic warning Jesper Dangaard Brouer
2019-06-18 13:05 ` [PATCH net-next v2 10/12] xdp: add tracepoints for XDP mem Jesper Dangaard Brouer
2019-06-18 13:06 ` [PATCH net-next v2 11/12] page_pool: add tracepoints for page_pool with details need by XDP Jesper Dangaard Brouer
2019-06-18 13:06 ` [PATCH net-next v2 12/12] page_pool: make sure struct device is stable Jesper Dangaard Brouer
2019-06-19 15:24 ` [PATCH net-next v2 00/12] xdp: page_pool fixes and in-flight accounting David Miller

Reply instructions:

You may reply publically to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190625105013.GA6485@khorivan \
    --to=ivan.khoronzhuk@linaro.org \
    --cc=brouer@redhat.com \
    --cc=grygorii.strashko@ti.com \
    --cc=ilias.apalodimas@linaro.org \
    --cc=mcroce@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=tariqt@mellanox.com \
    --cc=toke@toke.dk \
    --cc=toshiaki.makita1@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Netdev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/netdev/0 netdev/git/0.git
	git clone --mirror https://lore.kernel.org/netdev/1 netdev/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 netdev netdev/ https://lore.kernel.org/netdev \
		netdev@vger.kernel.org
	public-inbox-index netdev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.netdev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git