[RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map

* [RFC PATCH bpf-next 00/10] bpf, mm: Recharge pages when reuse bpf map
@ 2022-06-19 15:50 Yafang Shao
  2022-06-19 15:50 ` [RFC PATCH bpf-next 01/10] mm, memcg: Add a new helper memcg_should_recharge() Yafang Shao
                   ` (11 more replies)
  0 siblings, 12 replies; 30+ messages in thread
From: Yafang Shao @ 2022-06-19 15:50 UTC (permalink / raw)
  To: ast, daniel, andrii, kafai, songliubraving, yhs, john.fastabend,
	kpsingh, quentin, hannes, mhocko, roman.gushchin, shakeelb,
	songmuchun, akpm, cl, penberg, rientjes, iamjoonsoo.kim, vbabka
  Cc: linux-mm, bpf, Yafang Shao

After switching to memcg-based bpf memory accounting, the bpf memory is
charged to the loader's memcg by default, that causes unexpected issues for
us. For instance, the container of the loader may be restarted after
pinning progs and maps, but the bpf memcg will be left and pinned on the
system. Once the loader's new generation container is started, the leftover
pages won't be charged to it. That inconsistent behavior will make trouble
for the memory resource management for this container.

In the past few days, I have proposed two patchsets[1][2] to try to resolve
this issue, but in both of these two proposals the user code has to be
changed to adapt to it, that is a pain for us. This patchset relieves the
pain by triggering the recharge in libbpf. It also addresses Roman's
critical comments.

The key point we can avoid changing the user code is that there's a resue
path in libbpf. Once the bpf container is restarted again, it will try
to re-run the required bpf programs, if the bpf programs are the same with
the already pinned one, it will reuse them.

To make sure we either recharge all of them successfully or don't recharge
any of them. The recharge prograss is divided into three steps:
  - Pre charge to the new generation 
    To make sure once we uncharge from the old generation, we can always
    charge to the new generation succeesfully. If we can't pre charge to
    the new generation, we won't allow it to be uncharged from the old
    generation.
  - Uncharge from the old generation
    After pre charge to the new generation, we can uncharge from the old
    generation.
  - Post charge to the new generation
    Finnaly we can set pages' memcg_data to the new generation. 
In the pre charge step, we may succeed to charge some addresses, but fail
to charge a new address, then we should uncharge the already charged
addresses, so another recharge-err step is instroduced.

This pachset has finished recharging bpf hash map. which is mostly used
by our bpf services. The other maps hasn't been implemented yet. The bpf
progs hasn't been implemented neither.

The prev generation and the new generation may have the same parant,
that can be optimized in the future.

In the disccussion with Roman in the previous two proposals, he also
mentioned that the leftover page caches have similar issue.  There're key
differences between leftover page caches and leftover bpf programs:
  - The leftover page caches may not be reused again
    Because once a container exited, it may be deployed on another host
    next time for better resource management. That's why we fix leftover
    page caches by _trying_ to drop all its page caches when it is exiting.
    But regarding the bpf conatainer, it will always be deployed on the
    same host next time, that's why bpf programs are pinned.
 - The lefeover page caches can be reclaimed, but bpf memory can't.
   It means the leftover page caches can be accepted while the leftover bpf
   memory can't.
Regardless of these differences, we can also extend this method to
recharge leftover page caches if we need it, for example when we 'reuse' a
leftover inode, we recharge all its page caches to the new generation. But
unforunately there's no such a clear reuse path in page cache layer, so we
must build a resue path for it first:

      page cache's reuse path(X)           bpf's reuse path 
          |                                    |
   ------------------                   -------------
   | page cache layer|                  | bpf layer |
   ------------------                   -------------
      \                                     /
    page cache's recharge handler(X)     bpf's recharge handler
       \                                   /
       ------------------------------------
       |        Memcg layer               |
       |----------------------------------|

[1] https://lwn.net/Articles/887180/
[2] https://lwn.net/Articles/888549/

Yafang Shao (10):
  mm, memcg: Add a new helper memcg_should_recharge()
  bpftool: Show memcg info of bpf map
  mm, memcg: Add new helper obj_cgroup_from_current()
  mm, memcg: Make obj_cgroup_{charge, uncharge}_pages public
  mm: Add helper to recharge kmalloc'ed address
  mm: Add helper to recharge vmalloc'ed address
  mm: Add helper to recharge percpu address
  bpf: Recharge memory when reuse bpf map
  bpf: Make bpf_map_{save, release}_memcg public
  bpf: Support recharge for hash map

 include/linux/bpf.h            |  23 ++++++
 include/linux/memcontrol.h     |  22 ++++++
 include/linux/percpu.h         |   1 +
 include/linux/slab.h           |  18 +++++
 include/linux/vmalloc.h        |   2 +
 include/uapi/linux/bpf.h       |   4 +-
 kernel/bpf/hashtab.c           |  74 +++++++++++++++++++
 kernel/bpf/syscall.c           |  40 ++++++-----
 mm/memcontrol.c                |  35 +++++++--
 mm/percpu.c                    |  98 ++++++++++++++++++++++++++
 mm/slab.c                      |  85 ++++++++++++++++++++++
 mm/slob.c                      |   7 ++
 mm/slub.c                      | 125 +++++++++++++++++++++++++++++++++
 mm/util.c                      |   9 +++
 mm/vmalloc.c                   |  87 +++++++++++++++++++++++
 tools/bpf/bpftool/map.c        |   2 +
 tools/include/uapi/linux/bpf.h |   4 +-
 tools/lib/bpf/libbpf.c         |   2 +-
 18 files changed, 609 insertions(+), 29 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 30+ messages in thread