[RFC 00/12] net: huge page backed page_pool

* [RFC 00/12] net: huge page backed page_pool
@ 2023-07-07 18:39 Jakub Kicinski
  2023-07-07 18:39 ` [RFC 01/12] net: hack together some page sharing Jakub Kicinski
                   ` (13 more replies)
  0 siblings, 14 replies; 33+ messages in thread
From: Jakub Kicinski @ 2023-07-07 18:39 UTC (permalink / raw)
  To: netdev
  Cc: almasrymina, hawk, ilias.apalodimas, edumazet, dsahern,
	michael.chan, willemb, Jakub Kicinski

Hi!

This is an "early PoC" at best. It seems to work for a basic
traffic test but there's no uAPI and a lot more general polish
is needed.

The problem we're seeing is that performance of some older NICs
degrades quite a bit when IOMMU is used (in non-passthru mode).
There is a long tail of old NICs deployed, especially in PoPs/
/on edge. From a conversation I had with Eric a few months
ago it sounded like others may have similar issues. So I thought
I'd take a swing at getting page pool to feed drivers huge pages.
1G pages require hooking into early init via CMA but it works
just fine.

I haven't tested this with a real workload, because I'm still
waiting to get my hands on the right machine. But the experiment
with bnxt shows a ~90% reduction in IOTLB misses (670k -> 70k).

In terms of the missing parts - uAPI is definitely needed.
The rough plan would be to add memory config via the netdev
genl family. Should fit nicely there. Have the config stored
in struct netdevice. When page pool is created get to the netdev
and automatically select the provider without the driver even
knowing. Two problems with that are - 1) if the driver follows
the recommended flow of allocating new queues before freeing
old ones we will have page pools created before the old ones
are gone, which means we'd need to reserve 2x the number of
1G pages; 2) there's no callback to the driver to say "I did
something behind your back, don't worry about it, but recreate
your queues, please" so the change will not take effect until
some unrelated change like installing XDP. Which may be fine
in practice but is a bit odd.

Then we get into hand-wavy stuff like - if we can link page
pools to netdevs, we should also be able to export the page pool
stats via the netdev family instead doing it the ethtool -S.. ekhm..
"way". And if we start storing configs behind driver's back why
don't we also store other params, like ring size and queue count...
A lot of potential improvements as we iron out a new API...

Live tree: https://github.com/kuba-moo/linux/tree/pp-providers

Jakub Kicinski (12):
  net: hack together some page sharing
  net: create a 1G-huge-page-backed allocator
  net: page_pool: hide page_pool_release_page()
  net: page_pool: merge page_pool_release_page() with
    page_pool_return_page()
  net: page_pool: factor out releasing DMA from releasing the page
  net: page_pool: create hooks for custom page providers
  net: page_pool: add huge page backed memory providers
  eth: bnxt: let the page pool manage the DMA mapping
  eth: bnxt: use the page pool for data pages
  eth: bnxt: make sure we make for recycle skbs before freeing them
  eth: bnxt: wrap coherent allocations into helpers
  eth: bnxt: hack in the use of MEP

 Documentation/networking/page_pool.rst        |  10 +-
 arch/x86/kernel/setup.c                       |   6 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt.c     | 154 +++--
 drivers/net/ethernet/broadcom/bnxt/bnxt.h     |   5 +
 drivers/net/ethernet/engleder/tsnep_main.c    |   2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |   4 +-
 include/net/dcalloc.h                         |  28 +
 include/net/page_pool.h                       |  36 +-
 net/core/Makefile                             |   2 +-
 net/core/dcalloc.c                            | 615 +++++++++++++++++
 net/core/dcalloc.h                            |  96 +++
 net/core/page_pool.c                          | 625 +++++++++++++++++-
 12 files changed, 1478 insertions(+), 105 deletions(-)
 create mode 100644 include/net/dcalloc.h
 create mode 100644 net/core/dcalloc.c
 create mode 100644 net/core/dcalloc.h

-- 
2.41.0

^ permalink raw reply	[flat|nested] 33+ messages in thread