All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/4] Deterministic charging of shared memory
@ 2021-11-20  4:50 ` Mina Almasry
  0 siblings, 0 replies; 25+ messages in thread
From: Mina Almasry @ 2021-11-20  4:50 UTC (permalink / raw)
  Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Andrew Morton,
	Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
	Shuah Khan, Shakeel Butt, Greg Thelen, Dave Chinner,
	Matthew Wilcox, Roman Gushchin, Theodore Ts'o, linux-kernel,
	linux-fsdevel, linux-mm

Problem:
Currently shared memory is charged to the memcg of the allocating
process. This makes memory usage of processes accessing shared memory
a bit unpredictable since whichever process accesses the memory first
will get charged. We have a number of use cases where our userspace
would like deterministic charging of shared memory:

1. System services allocating memory for client jobs:
We have services (namely a network access service[1]) that provide
functionality for clients running on the machine and allocate memory
to carry out these services. The memory usage of these services
depends on the number of jobs running on the machine and the nature of
the requests made to the service, which makes the memory usage of
these services hard to predict and thus hard to limit via memory.max.
These system services would like a way to allocate memory and instruct
the kernel to charge this memory to the client’s memcg.

2. Shared filesystem between subtasks of a large job
Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod.

3. Shared libraries and language runtimes shared between independent jobs.
We’d like to optimize memory usage on the machine by sharing libraries
and language runtimes of many of the processes running on our machines
in separate memcgs. This produces a side effect that one job may be
unlucky to be the first to access many of the libraries and may get
oom killed as all the cached files get charged to it.

Design:
My rough proposal to solve this problem is to simply add a
‘memcg=/path/to/memcg’ mount option for filesystems:
directing all the memory of the file system to be ‘remote charged’ to
cgroup provided by that memcg= option.

Caveats:

1. One complication to address is the behavior when the target memcg
hits its memory.max limit because of remote charging. In this case the
oom-killer will be invoked, but the oom-killer may not find anything
to kill in the target memcg being charged. Thera are a number of considerations
in this case:

1. It's not great to kill the allocating process since the allocating process
   is not running in the memcg under oom, and killing it will not free memory
   in the memcg under oom.
2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
   somehow. If not, the process will forever loop the pagefault in the upstream
   kernel.

In this case, I propose simply failing the remote charge and returning an ENOSPC
to the caller. This will cause will cause the process executing the remote
charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
path.  This will be documented behavior of remote charging, and this feature is
opt-in. Users can:
- Not opt-into the feature if they want.
- Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
  abort if they desire.
- Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
  operation without executing the remote charge if possible.

2. Only processes allowed the enter cgroup at mount time can mount a
tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
process with write access to this mount point will be able to charge memory to
<cgroup>. This is largely a non-issue because in configurations where there is
untrusted code running on the machine, mount point access needs to be
restricted to the intended users only regardless of whether the mount point
memory is deterministly charged or not.

[1] https://research.google/pubs/pub48630

Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org

Mina Almasry (4):
  mm: support deterministic memory charging of filesystems
  mm/oom: handle remote ooms
  mm, shmem: add filesystem memcg= option documentation
  mm, shmem, selftests: add tmpfs memcg= mount option tests

 Documentation/filesystems/tmpfs.rst       |  28 ++++
 fs/fs_context.c                           |  27 ++++
 fs/proc_namespace.c                       |   4 +
 fs/super.c                                |   9 ++
 include/linux/fs.h                        |   5 +
 include/linux/fs_context.h                |   2 +
 include/linux/memcontrol.h                |  38 +++++
 mm/filemap.c                              |   2 +-
 mm/khugepaged.c                           |   3 +-
 mm/memcontrol.c                           | 171 ++++++++++++++++++++++
 mm/oom_kill.c                             |   9 ++
 mm/shmem.c                                |   3 +-
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/mmap_write.c   | 103 +++++++++++++
 tools/testing/selftests/vm/tmpfs-memcg.sh | 116 +++++++++++++++
 15 files changed, 518 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/vm/mmap_write.c
 create mode 100755 tools/testing/selftests/vm/tmpfs-memcg.sh

--
2.34.0.rc2.393.gf8c9666880-goog

^ permalink raw reply	[flat|nested] 25+ messages in thread
* Re: [PATCH v4 1/4] mm: support deterministic memory charging of filesystems
@ 2021-11-22 20:43 kernel test robot
  0 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2021-11-22 20:43 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 7596 bytes --]

CC: kbuild-all(a)lists.01.org
In-Reply-To: <20211120045011.3074840-2-almasrymina@google.com>
References: <20211120045011.3074840-2-almasrymina@google.com>
TO: Mina Almasry <almasrymina@google.com>
TO: Alexander Viro <viro@zeniv.linux.org.uk>
TO: Andrew Morton <akpm@linux-foundation.org>
CC: Linux Memory Management List <linux-mm@kvack.org>
TO: Johannes Weiner <hannes@cmpxchg.org>
TO: Michal Hocko <mhocko@kernel.org>
TO: Vladimir Davydov <vdavydov.dev@gmail.com>
TO: Hugh Dickins <hughd@google.com>
CC: Mina Almasry <almasrymina@google.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Shuah Khan <skhan@linuxfoundation.org>

Hi Mina,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on hnaz-mm/master]
[also build test WARNING on linus/master v5.16-rc2 next-20211118]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Mina-Almasry/Deterministic-charging-of-shared-memory/20211120-125229
base:   https://github.com/hnaz/linux-mm master
:::::: branch date: 3 days ago
:::::: commit date: 3 days ago
config: i386-randconfig-s002-20211122 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/5ecf5e613f50d859803aae9bc6f8295cb199701d
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Mina-Almasry/Deterministic-charging-of-shared-memory/20211120-125229
        git checkout 5ecf5e613f50d859803aae9bc6f8295cb199701d
        # save the attached .config to linux build tree
        make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
   mm/memcontrol.c:2644:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> mm/memcontrol.c:2644:17: sparse:    struct mem_cgroup [noderef] __rcu *
>> mm/memcontrol.c:2644:17: sparse:    struct mem_cgroup *
   mm/memcontrol.c:2699:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:2699:17: sparse:    struct mem_cgroup [noderef] __rcu *
   mm/memcontrol.c:2699:17: sparse:    struct mem_cgroup *
   mm/memcontrol.c:4192:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4192:21: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4192:21: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4194:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4194:21: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4194:21: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4350:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4350:9: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4350:9: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:4444:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4444:9: sparse:    struct mem_cgroup_threshold_ary [noderef] __rcu *
   mm/memcontrol.c:4444:9: sparse:    struct mem_cgroup_threshold_ary *
   mm/memcontrol.c:6059:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:6059:23: sparse:    struct task_struct [noderef] __rcu *
   mm/memcontrol.c:6059:23: sparse:    struct task_struct *
   mm/memcontrol.c: note: in included file:
   include/linux/memcontrol.h:779:9: sparse: sparse: context imbalance in 'folio_lruvec_lock' - wrong count at exit
   include/linux/memcontrol.h:779:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irq' - wrong count at exit
   include/linux/memcontrol.h:779:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irqsave' - wrong count at exit
   mm/memcontrol.c:2019:6: sparse: sparse: context imbalance in 'folio_memcg_lock' - wrong count at exit
   mm/memcontrol.c:2071:17: sparse: sparse: context imbalance in '__folio_memcg_unlock' - unexpected unlock
   mm/memcontrol.c:5910:28: sparse: sparse: context imbalance in 'mem_cgroup_count_precharge_pte_range' - unexpected unlock
   mm/memcontrol.c:6104:36: sparse: sparse: context imbalance in 'mem_cgroup_move_charge_pte_range' - unexpected unlock

vim +2644 mm/memcontrol.c

5ecf5e613f50d8 Mina Almasry 2021-11-19  2630  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2631  void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb)
5ecf5e613f50d8 Mina Almasry 2021-11-19  2632  {
5ecf5e613f50d8 Mina Almasry 2021-11-19  2633  	struct mem_cgroup *memcg;
5ecf5e613f50d8 Mina Almasry 2021-11-19  2634  	int ret = 0;
5ecf5e613f50d8 Mina Almasry 2021-11-19  2635  	char *buf = __getname();
5ecf5e613f50d8 Mina Almasry 2021-11-19  2636  	int len = PATH_MAX;
5ecf5e613f50d8 Mina Almasry 2021-11-19  2637  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2638  	if (!buf)
5ecf5e613f50d8 Mina Almasry 2021-11-19  2639  		return;
5ecf5e613f50d8 Mina Almasry 2021-11-19  2640  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2641  	buf[0] = '\0';
5ecf5e613f50d8 Mina Almasry 2021-11-19  2642  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2643  	rcu_read_lock();
5ecf5e613f50d8 Mina Almasry 2021-11-19 @2644  	memcg = rcu_dereference(sb->s_memcg_to_charge);
5ecf5e613f50d8 Mina Almasry 2021-11-19  2645  	if (memcg && !css_tryget_online(&memcg->css))
5ecf5e613f50d8 Mina Almasry 2021-11-19  2646  		memcg = NULL;
5ecf5e613f50d8 Mina Almasry 2021-11-19  2647  	rcu_read_unlock();
5ecf5e613f50d8 Mina Almasry 2021-11-19  2648  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2649  	if (!memcg)
5ecf5e613f50d8 Mina Almasry 2021-11-19  2650  		return;
5ecf5e613f50d8 Mina Almasry 2021-11-19  2651  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2652  	ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2);
5ecf5e613f50d8 Mina Almasry 2021-11-19  2653  	if (ret >= len / 2)
5ecf5e613f50d8 Mina Almasry 2021-11-19  2654  		strcpy(buf, "?");
5ecf5e613f50d8 Mina Almasry 2021-11-19  2655  	else {
5ecf5e613f50d8 Mina Almasry 2021-11-19  2656  		char *p = mangle_path(buf, buf + len / 2, " \t\n\\");
5ecf5e613f50d8 Mina Almasry 2021-11-19  2657  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2658  		if (p)
5ecf5e613f50d8 Mina Almasry 2021-11-19  2659  			*p = '\0';
5ecf5e613f50d8 Mina Almasry 2021-11-19  2660  		else
5ecf5e613f50d8 Mina Almasry 2021-11-19  2661  			strcpy(buf, "?");
5ecf5e613f50d8 Mina Almasry 2021-11-19  2662  	}
5ecf5e613f50d8 Mina Almasry 2021-11-19  2663  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2664  	css_put(&memcg->css);
5ecf5e613f50d8 Mina Almasry 2021-11-19  2665  	if (buf[0] != '\0')
5ecf5e613f50d8 Mina Almasry 2021-11-19  2666  		seq_printf(m, ",memcg=%s", buf);
5ecf5e613f50d8 Mina Almasry 2021-11-19  2667  
5ecf5e613f50d8 Mina Almasry 2021-11-19  2668  	__putname(buf);
5ecf5e613f50d8 Mina Almasry 2021-11-19  2669  }
5ecf5e613f50d8 Mina Almasry 2021-11-19  2670  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 38300 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2021-11-29  6:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-20  4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
2021-11-20  4:50 ` Mina Almasry
2021-11-20  4:50 ` [PATCH v4 1/4] mm: support deterministic memory charging of filesystems Mina Almasry
2021-11-20  4:50   ` Mina Almasry
2021-11-20  7:53   ` Shakeel Butt
2021-11-20  4:50 ` [PATCH v4 2/4] mm/oom: handle remote ooms Mina Almasry
2021-11-20  5:07   ` Matthew Wilcox
2021-11-20  5:07     ` Matthew Wilcox
2021-11-20  5:31     ` Mina Almasry
2021-11-20  7:58   ` Shakeel Butt
2021-11-20  7:58     ` Shakeel Butt
2021-11-20  4:50 ` [PATCH v4 3/4] mm, shmem: add filesystem memcg= option documentation Mina Almasry
2021-11-20  4:50 ` [PATCH v4 4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests Mina Almasry
2021-11-20  5:01 ` [PATCH v4 0/4] Deterministic charging of shared memory Matthew Wilcox
2021-11-20  5:27   ` Mina Almasry
2021-11-22 19:04 ` Johannes Weiner
2021-11-22 22:09   ` Mina Almasry
2021-11-22 23:09   ` Roman Gushchin
2021-11-23 19:26     ` Mina Almasry
2021-11-23 20:21     ` Johannes Weiner
2021-11-23 21:19       ` Mina Almasry
2021-11-23 22:49         ` Roman Gushchin
2021-11-24 17:27   ` Michal Hocko
2021-11-29  6:00   ` Shakeel Butt
2021-11-22 20:43 [PATCH v4 1/4] mm: support deterministic memory charging of filesystems kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.