* Re: [PATCH 2/3] mm: Charge active memcg when no mm is set
[not found] ` <20210610173944.1203706-3-schatzberg.dan@gmail.com>
@ 2021-06-25 14:47 ` Michal Koutný
0 siblings, 0 replies; 14+ messages in thread
From: Michal Koutný @ 2021-06-25 14:47 UTC (permalink / raw)
To: Dan Schatzberg
Cc: Andrew Morton, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Johannes Weiner, Tejun Heo,
Chris Down, Jens Axboe, Shakeel Butt
[-- Attachment #1: Type: text/plain, Size: 862 bytes --]
On Thu, Jun 10, 2021 at 10:39:43AM -0700, Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> @@ -926,8 +937,17 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> * counting is disabled on the root level in the
> * cgroup core. See CSS_NO_REF.
> */
> - if (unlikely(!mm))
> - return root_mem_cgroup;
> + if (unlikely(!mm)) {
> + memcg = active_memcg();
> + if (unlikely(memcg)) {
> + /* remote memcg must hold a ref */
> + css_get(&memcg->css);
> + return memcg;
> + }
> + mm = current->mm;
> + if (unlikely(!mm))
> + return root_mem_cgroup;
> + }
With the change in __add_to_page_cache_locked() all page cache charges
will supply null mm, so the first !mm unlikely hint may not be warranted
anymore. Just an interesting point, generally, I'm adding
Reviewed-by: Michal Koutný <mkoutny@suse.com>
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20210610173944.1203706-4-schatzberg.dan@gmail.com>]
* Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg
[not found] ` <20210610173944.1203706-4-schatzberg.dan@gmail.com>
@ 2021-06-25 15:01 ` Michal Koutný
2021-06-28 14:17 ` Dan Schatzberg
0 siblings, 1 reply; 14+ messages in thread
From: Michal Koutný @ 2021-06-25 15:01 UTC (permalink / raw)
To: Dan Schatzberg
Cc: Andrew Morton, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Johannes Weiner, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 1416 bytes --]
Hi.
On Thu, Jun 10, 2021 at 10:39:44AM -0700, Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> The current code only associates with the existing blkcg when aio is
> used to access the backing file. This patch covers all types of i/o to
> the backing file and also associates the memcg so if the backing file is
> on tmpfs, memory is charged appropriately.
>
> This patch also exports cgroup_get_e_css and int_active_memcg so it
> can be used by the loop module.
Wouldn't it be clearer to export (not explicitly inlined anymore)
set_active_memcg() instead of the int_active_memcg that's rather an
implementation detail?
> @@ -2111,13 +2112,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> }
>
> /* always use the first bio's css */
> + cmd->blkcg_css = NULL;
> + cmd->memcg_css = NULL;
> #ifdef CONFIG_BLK_CGROUP
> - if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
> - cmd->css = &bio_blkcg(rq->bio)->css;
> - css_get(cmd->css);
> - } else
> + if (rq->bio && rq->bio->bi_blkg) {
> + cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
> +#ifdef CONFIG_MEMCG
> + cmd->memcg_css =
> + cgroup_get_e_css(cmd->blkcg_css->cgroup,
> + &memory_cgrp_subsys);
> +#endif
> + }
> #endif
> - cmd->css = NULL;
> loop_queue_work(lo, cmd);
I see you dropped the cmd->blkcg_css reference (while rq is handled). Is
it intentional?
Thanks,
Michal
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-06-25 15:01 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Michal Koutný
@ 2021-06-28 14:17 ` Dan Schatzberg
2021-06-29 10:26 ` Michal Koutný
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2021-06-28 14:17 UTC (permalink / raw)
To: Michal Koutný
Cc: Andrew Morton, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Johannes Weiner, Jens Axboe
Hi Michal,
On Fri, Jun 25, 2021 at 05:01:03PM +0200, Michal Koutný wrote:
> Hi.
>
> On Thu, Jun 10, 2021 at 10:39:44AM -0700, Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> > The current code only associates with the existing blkcg when aio is
> > used to access the backing file. This patch covers all types of i/o to
> > the backing file and also associates the memcg so if the backing file is
> > on tmpfs, memory is charged appropriately.
> >
> > This patch also exports cgroup_get_e_css and int_active_memcg so it
> > can be used by the loop module.
>
> Wouldn't it be clearer to export (not explicitly inlined anymore)
> set_active_memcg() instead of the int_active_memcg that's rather an
> implementation detail?
Agreed that exporting int_active_memcg is an implementation detail,
but would this prevent set_active_memcg from being inlined? Is that
desireable?
>
> > @@ -2111,13 +2112,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> > }
> >
> > /* always use the first bio's css */
> > + cmd->blkcg_css = NULL;
> > + cmd->memcg_css = NULL;
> > #ifdef CONFIG_BLK_CGROUP
> > - if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
> > - cmd->css = &bio_blkcg(rq->bio)->css;
> > - css_get(cmd->css);
> > - } else
> > + if (rq->bio && rq->bio->bi_blkg) {
> > + cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
> > +#ifdef CONFIG_MEMCG
> > + cmd->memcg_css =
> > + cgroup_get_e_css(cmd->blkcg_css->cgroup,
> > + &memory_cgrp_subsys);
> > +#endif
> > + }
> > #endif
> > - cmd->css = NULL;
> > loop_queue_work(lo, cmd);
>
> I see you dropped the cmd->blkcg_css reference (while rq is handled). Is
> it intentional?
Yes it is intentional. All requests (not just aio) go through the loop
worker which grabs the blkcg reference in loop_queue_work() on
construction. So I believe grabbing a reference per request is
unnecessary.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-06-28 14:17 ` Dan Schatzberg
@ 2021-06-29 10:26 ` Michal Koutný
2021-06-29 14:03 ` Dan Schatzberg
0 siblings, 1 reply; 14+ messages in thread
From: Michal Koutný @ 2021-06-29 10:26 UTC (permalink / raw)
To: Dan Schatzberg
Cc: Andrew Morton, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Johannes Weiner, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]
On Mon, Jun 28, 2021 at 10:17:18AM -0400, Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> Agreed that exporting int_active_memcg is an implementation detail,
> but would this prevent set_active_memcg from being inlined?
Non-inlining in the loop module doesn't seem like a big trouble. OTOH,
other callers may be more sensitive and would need to rely on inlining.
I can't currently think of a nice way to have both the exported and the
exlicitly inlined variant at once. It seems it's either API or perf
craft in the end but both are uncertain, so I guess the current approach
is fine in the end.
> Yes it is intentional. All requests (not just aio) go through the loop
> worker which grabs the blkcg reference in loop_queue_work() on
> construction. So I believe grabbing a reference per request is
> unnecessary.
Isn't there a window without the reference between loop_queue_rq and
loop_queue_work? I don't know, you seem to know better, so I'd suggest
dropping a comment line into the code explaining this.
Thanks,
Michal
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-06-29 10:26 ` Michal Koutný
@ 2021-06-29 14:03 ` Dan Schatzberg
2021-06-30 9:42 ` Michal Koutný
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2021-06-29 14:03 UTC (permalink / raw)
To: Michal Koutný
Cc: Andrew Morton, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Johannes Weiner, Jens Axboe
> Non-inlining in the loop module doesn't seem like a big trouble. OTOH,
> other callers may be more sensitive and would need to rely on inlining.
Yes, this is my concern as well.
> I can't currently think of a nice way to have both the exported and the
> exlicitly inlined variant at once. It seems it's either API or perf
> craft in the end but both are uncertain, so I guess the current approach
> is fine in the end.
>
> > Yes it is intentional. All requests (not just aio) go through the loop
> > worker which grabs the blkcg reference in loop_queue_work() on
> > construction. So I believe grabbing a reference per request is
> > unnecessary.
>
> Isn't there a window without the reference between loop_queue_rq and
> loop_queue_work?
Hmm, perhaps I'm not understanding how the reference counting works,
but my understanding is that we enter loop_queue_rq with presumably
some code earlier holding a reference to the blkcg, we only need to
acquire a reference sometime before returning from loop_queue_rq. The
"window" between loop_queue_rq and loop_queue_work is all
straight-line code so there's no possibility for the earlier code to
get control back and drop the reference.
> I don't know, you seem to know better, so I'd suggest
> dropping a comment line into the code explaining this.
I wouldn't be so sure that I know any better here :D - I'm fairly
inexperienced in this domain.
Where would you suggest putting such a comment? The change in question
removed a particular case where we explicitly grab a reference to the
blkcg because now we do it uniformly in one place. Would you like a
comment explaining why we acquire a reference for all loop workers or
one explaining specifically why we don't need to acquire one for aio?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-06-29 14:03 ` Dan Schatzberg
@ 2021-06-30 9:42 ` Michal Koutný
2021-06-30 14:49 ` Dan Schatzberg
0 siblings, 1 reply; 14+ messages in thread
From: Michal Koutný @ 2021-06-30 9:42 UTC (permalink / raw)
To: Dan Schatzberg
Cc: Andrew Morton, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Johannes Weiner, Jens Axboe
[-- Attachment #1: Type: text/plain, Size: 2000 bytes --]
On Tue, Jun 29, 2021 at 10:03:33AM -0400, Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
> Hmm, perhaps I'm not understanding how the reference counting works,
> but my understanding is that we enter loop_queue_rq with presumably
> some code earlier holding a reference to the blkcg, we only need to
> acquire a reference sometime before returning from loop_queue_rq. The
> "window" between loop_queue_rq and loop_queue_work is all
> straight-line code so there's no possibility for the earlier code to
> get control back and drop the reference.
I don't say the current implementation is wrong, it just looked
suspicious to me when the css address is copied without taking the
reference.
The straight path is clear, I'm not sure about later invocations through
loop_workfn where the blkcg_css is accessed via the cmd->blkcg_css.
> Where would you suggest putting such a comment?
This is how I understand it:
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -996,6 +996,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
rb_insert_color(&worker->rb_node, &lo->worker_tree);
queue_work:
if (worker) {
+ WARN_ON_ONCE(worker->blkcg_css != cmd->blkcg_css);
/*
* We need to remove from the idle list here while
* holding the lock so that the idle timer doesn't
@@ -2106,6 +2107,8 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
if (rq->bio && rq->bio->bi_blkg) {
+ /* reference to blkcg_css will be held by loop_worker (outlives
+ * cmd) or it is the eternal root css */
cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
#ifdef CONFIG_MEMCG
cmd->memcg_css =
(On further thoughts, maybe the blkcg_css reference isn't needed even in
the loop_worker if it can be reasoned that blkcg_css won't go away while
there's an outstanding rq.)
HTH,
Michal
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-06-30 9:42 ` Michal Koutný
@ 2021-06-30 14:49 ` Dan Schatzberg
0 siblings, 0 replies; 14+ messages in thread
From: Dan Schatzberg @ 2021-06-30 14:49 UTC (permalink / raw)
To: Michal Koutný
Cc: Andrew Morton, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Johannes Weiner, Jens Axboe
> This is how I understand it:
>
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -996,6 +996,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
> rb_insert_color(&worker->rb_node, &lo->worker_tree);
> queue_work:
> if (worker) {
> + WARN_ON_ONCE(worker->blkcg_css != cmd->blkcg_css);
Yes, this is correct. Though the check here seems a bit obvious to me
- it must be correct because we assign worker above:
if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
or when we construct the worker:
worker->blkcg_css = cmd->blkcg_css;
I think this WARN_ON_ONCE check might be more interesting in
loop_process_work which invokes loop_handle_cmd and actually uses
cmd->blkcg_css. In any event, your understanding is correct here.
> /*
> * We need to remove from the idle list here while
> * holding the lock so that the idle timer doesn't
> @@ -2106,6 +2107,8 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> cmd->memcg_css = NULL;
> #ifdef CONFIG_BLK_CGROUP
> if (rq->bio && rq->bio->bi_blkg) {
> + /* reference to blkcg_css will be held by loop_worker (outlives
> + * cmd) or it is the eternal root css */
Yes, this is correct. Feel free to add my Acked-by to such a patch
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH V13 0/3] Charge loop device i/o to issuing cgroup
@ 2021-06-03 14:57 Dan Schatzberg
2021-06-03 14:57 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2021-06-03 14:57 UTC (permalink / raw)
To: Jens Axboe
Cc: open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT
No significant changes, rebased on Linus's tree.
Jens, this series was intended to go into the mm tree since it had
some conflicts with mm changes. It never got picked up for 5.12 and
the corresponding mm changes are now in linus's tree. This is mostly a
loop change so it feels more appropriate to go through the block tree.
Do you think that makes sense?
Changes since V12:
* Small change to get_mem_cgroup_from_mm to avoid needing
get_active_memcg
Changes since V11:
* Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
can be driven by writeback, but this was causing a warning in xfs
and likely other filesystems aren't equipped to be driven by reclaim
at the VFS layer.
* Included a small fix from Colin Ian King.
* reworked get_mem_cgroup_from_mm to institute the necessary charge
priority.
Changes since V10:
* Added page-cache charging to mm: Charge active memcg when no mm is set
Changes since V9:
* Rebased against linus's branch which now includes Roman Gushchin's
patch this series is based off of
Changes since V8:
* Rebased on top of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
support for setting active memcg. Dropped the patch from this series
that did the same thing.
Changes since V7:
* Rebased against linus's branch
Changes since V6:
* Added separate spinlock for worker synchronization
* Minor style changes
Changes since V5:
* Fixed a missing css_put when failing to allocate a worker
* Minor style changes
Changes since V4:
Only patches 1 and 2 have changed.
* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg
Changes since V3:
* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes
Changes since V2:
* Deferred destruction of workqueue items so in the common case there
is no allocation needed
Changes since V1:
* Split out and reordered patches so cgroup charging changes are
separate from kworker -> workqueue change
* Add mem_css to struct loop_cmd to simplify logic
The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.
A simple script to demonstrate this behavior on cgroupv2 machine:
'''
#!/bin/bash
set -e
CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0
if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true
grep oom_kill $CGROUP/memory.events
'''
Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.
With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.
Dan Schatzberg (3):
loop: Use worker per cgroup instead of kworker
mm: Charge active memcg when no mm is set
loop: Charge i/o to mem and blk cg
drivers/block/loop.c | 241 ++++++++++++++++++++++++++++++-------
drivers/block/loop.h | 15 ++-
include/linux/memcontrol.h | 6 +
kernel/cgroup/cgroup.c | 1 +
mm/filemap.c | 2 +-
mm/memcontrol.c | 49 +++++---
mm/shmem.c | 4 +-
7 files changed, 250 insertions(+), 68 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-06-03 14:57 [PATCH V13 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
@ 2021-06-03 14:57 ` Dan Schatzberg
0 siblings, 0 replies; 14+ messages in thread
From: Dan Schatzberg @ 2021-06-03 14:57 UTC (permalink / raw)
To: Jens Axboe
Cc: open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Johannes Weiner
The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.
This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Jens Axboe <axboe@kernel.dk>
---
drivers/block/loop.c | 61 +++++++++++++++++++++++++-------------
drivers/block/loop.h | 3 +-
include/linux/memcontrol.h | 6 ++++
kernel/cgroup/cgroup.c | 1 +
mm/memcontrol.c | 1 +
5 files changed, 51 insertions(+), 21 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 935edcf7c7b1..b38115c91288 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -78,6 +78,7 @@
#include <linux/uio.h>
#include <linux/ioprio.h>
#include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>
#include "loop.h"
@@ -516,8 +517,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
- if (cmd->css)
- css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@@ -578,8 +577,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (cmd->css)
- kthread_associate_blkcg(cmd->css);
if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -587,7 +584,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);
lo_rw_aio_do_completion(cmd);
- kthread_associate_blkcg(NULL);
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -928,7 +924,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
};
@@ -943,7 +939,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
spin_lock_irq(&lo->lo_work_lock);
- if (!cmd->css)
+ if (!cmd->blkcg_css)
goto queue_work;
node = &lo->worker_tree.rb_node;
@@ -951,10 +947,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
- if (cur_worker->css == cmd->css) {
+ if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
- } else if ((long)cur_worker->css < (long)cmd->css) {
+ } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -966,13 +962,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
* In the event we cannot allocate a worker, just queue on the
- * rootcg worker
+ * rootcg worker and issue the I/O as the rootcg
*/
- if (!worker)
+ if (!worker) {
+ cmd->blkcg_css = NULL;
+ if (cmd->memcg_css)
+ css_put(cmd->memcg_css);
+ cmd->memcg_css = NULL;
goto queue_work;
+ }
- worker->css = cmd->css;
- css_get(worker->css);
+ worker->blkcg_css = cmd->blkcg_css;
+ css_get(worker->blkcg_css);
INIT_WORK(&worker->work, loop_workfn);
INIT_LIST_HEAD(&worker->cmd_list);
INIT_LIST_HEAD(&worker->idle_list);
@@ -1291,7 +1292,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
idle_list) {
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(&lo->lo_work_lock);
@@ -2096,13 +2097,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
}
/* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
- if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
- cmd->css = &bio_blkcg(rq->bio)->css;
- css_get(cmd->css);
- } else
+ if (rq->bio && rq->bio->bi_blkg) {
+ cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+#endif
+ }
#endif
- cmd->css = NULL;
loop_queue_work(lo, cmd);
return BLK_STS_OK;
@@ -2114,13 +2120,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
const bool write = op_is_write(req_op(rq));
struct loop_device *lo = rq->q->queuedata;
int ret = 0;
+ struct mem_cgroup *old_memcg = NULL;
if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
ret = -EIO;
goto failed;
}
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(cmd->blkcg_css);
+ if (cmd->memcg_css)
+ old_memcg = set_active_memcg(
+ mem_cgroup_from_css(cmd->memcg_css));
+
ret = do_req_filebacked(lo, rq);
+
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(NULL);
+
+ if (cmd->memcg_css) {
+ set_active_memcg(old_memcg);
+ css_put(cmd->memcg_css);
+ }
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
@@ -2199,7 +2220,7 @@ static void loop_free_idle_workers(struct timer_list *timer)
break;
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
if (!list_empty(&lo->idle_worker_list))
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 9289c1cd6374..cd24a81e00e6 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -76,7 +76,8 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
+ struct cgroup_subsys_state *memcg_css;
};
/* Support for loadable transfer modules */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c193be760709..542d9cae336b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1255,6 +1255,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return NULL;
}
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 21ecc6ee6a6d..9cc8c3a686b1 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -577,6 +577,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
rcu_read_unlock();
return css;
}
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);
static void cgroup_get_live(struct cgroup *cgrp)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 26dc2dc0056a..8a8222df44b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -78,6 +78,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
+EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
/* Socket memory accounting disabled? */
static bool cgroup_memory_nosocket;
--
2.30.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH V12 0/3] Charge loop device i/o to issuing cgroup
@ 2021-04-02 19:16 Dan Schatzberg
2021-04-02 19:16 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2021-04-02 19:16 UTC (permalink / raw)
Cc: Jens Axboe, Tejun Heo, Zefan Li, Johannes Weiner, Andrew Morton,
Michal Hocko, Vladimir Davydov, Hugh Dickins, Shakeel Butt,
Roman Gushchin, Muchun Song, Yang Shi, Alex Shi, Alexander Duyck,
Wei Yang, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT
No major changes, rebased on top of latest mm tree
Changes since V12:
* Small change to get_mem_cgroup_from_mm to avoid needing
get_active_memcg
Changes since V11:
* Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
can be driven by writeback, but this was causing a warning in xfs
and likely other filesystems aren't equipped to be driven by reclaim
at the VFS layer.
* Included a small fix from Colin Ian King.
* reworked get_mem_cgroup_from_mm to institute the necessary charge
priority.
Changes since V10:
* Added page-cache charging to mm: Charge active memcg when no mm is set
Changes since V9:
* Rebased against linus's branch which now includes Roman Gushchin's
patch this series is based off of
Changes since V8:
* Rebased on top of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
support for setting active memcg. Dropped the patch from this series
that did the same thing.
Changes since V7:
* Rebased against linus's branch
Changes since V6:
* Added separate spinlock for worker synchronization
* Minor style changes
Changes since V5:
* Fixed a missing css_put when failing to allocate a worker
* Minor style changes
Changes since V4:
Only patches 1 and 2 have changed.
* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg
Changes since V3:
* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes
Changes since V2:
* Deferred destruction of workqueue items so in the common case there
is no allocation needed
Changes since V1:
* Split out and reordered patches so cgroup charging changes are
separate from kworker -> workqueue change
* Add mem_css to struct loop_cmd to simplify logic
The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.
A simple script to demonstrate this behavior on cgroupv2 machine:
'''
#!/bin/bash
set -e
CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0
if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true
grep oom_kill $CGROUP/memory.events
'''
Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.
With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.
Dan Schatzberg (3):
loop: Use worker per cgroup instead of kworker
mm: Charge active memcg when no mm is set
loop: Charge i/o to mem and blk cg
drivers/block/loop.c | 244 ++++++++++++++++++++++++++++++-------
drivers/block/loop.h | 15 ++-
include/linux/memcontrol.h | 6 +
kernel/cgroup/cgroup.c | 1 +
mm/filemap.c | 2 +-
mm/memcontrol.c | 49 +++++---
mm/shmem.c | 4 +-
7 files changed, 253 insertions(+), 68 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-04-02 19:16 [PATCH V12 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
@ 2021-04-02 19:16 ` Dan Schatzberg
2021-04-06 3:23 ` Ming Lei
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2021-04-02 19:16 UTC (permalink / raw)
Cc: Jens Axboe, Tejun Heo, Zefan Li, Johannes Weiner, Andrew Morton,
Michal Hocko, Vladimir Davydov, Hugh Dickins, Shakeel Butt,
Roman Gushchin, Muchun Song, Yang Shi, Alex Shi, Alexander Duyck,
Wei Yang, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT
The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.
This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
drivers/block/loop.c | 61 +++++++++++++++++++++++++-------------
drivers/block/loop.h | 3 +-
include/linux/memcontrol.h | 6 ++++
kernel/cgroup/cgroup.c | 1 +
mm/memcontrol.c | 1 +
5 files changed, 51 insertions(+), 21 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4750b373d4bb..d2759f8a7c2a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -78,6 +78,7 @@
#include <linux/uio.h>
#include <linux/ioprio.h>
#include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>
#include "loop.h"
@@ -516,8 +517,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
- if (cmd->css)
- css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@@ -578,8 +577,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (cmd->css)
- kthread_associate_blkcg(cmd->css);
if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -587,7 +584,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);
lo_rw_aio_do_completion(cmd);
- kthread_associate_blkcg(NULL);
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -928,7 +924,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
};
@@ -945,7 +941,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
spin_lock_irq(&lo->lo_work_lock);
- if (!cmd->css)
+ if (!cmd->blkcg_css)
goto queue_work;
node = &lo->worker_tree.rb_node;
@@ -953,10 +949,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
- if (cur_worker->css == cmd->css) {
+ if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
- } else if ((long)cur_worker->css < (long)cmd->css) {
+ } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -968,13 +964,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
* In the event we cannot allocate a worker, just queue on the
- * rootcg worker
+ * rootcg worker and issue the I/O as the rootcg
*/
- if (!worker)
+ if (!worker) {
+ cmd->blkcg_css = NULL;
+ if (cmd->memcg_css)
+ css_put(cmd->memcg_css);
+ cmd->memcg_css = NULL;
goto queue_work;
+ }
- worker->css = cmd->css;
- css_get(worker->css);
+ worker->blkcg_css = cmd->blkcg_css;
+ css_get(worker->blkcg_css);
INIT_WORK(&worker->work, loop_workfn);
INIT_LIST_HEAD(&worker->cmd_list);
INIT_LIST_HEAD(&worker->idle_list);
@@ -1294,7 +1295,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
idle_list) {
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(&lo->lo_work_lock);
@@ -2099,13 +2100,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
}
/* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
- if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
- cmd->css = &bio_blkcg(rq->bio)->css;
- css_get(cmd->css);
- } else
+ if (rq->bio && rq->bio->bi_blkg) {
+ cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+#endif
+ }
#endif
- cmd->css = NULL;
loop_queue_work(lo, cmd);
return BLK_STS_OK;
@@ -2117,13 +2123,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
const bool write = op_is_write(req_op(rq));
struct loop_device *lo = rq->q->queuedata;
int ret = 0;
+ struct mem_cgroup *old_memcg = NULL;
if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
ret = -EIO;
goto failed;
}
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(cmd->blkcg_css);
+ if (cmd->memcg_css)
+ old_memcg = set_active_memcg(
+ mem_cgroup_from_css(cmd->memcg_css));
+
ret = do_req_filebacked(lo, rq);
+
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(NULL);
+
+ if (cmd->memcg_css) {
+ set_active_memcg(old_memcg);
+ css_put(cmd->memcg_css);
+ }
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
@@ -2202,7 +2223,7 @@ static void loop_free_idle_workers(struct timer_list *timer)
break;
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
if (!list_empty(&lo->idle_worker_list))
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 9289c1cd6374..cd24a81e00e6 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -76,7 +76,8 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
+ struct cgroup_subsys_state *memcg_css;
};
/* Support for loadable transfer modules */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b8b0a802852c..a92500734f90 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1249,6 +1249,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return NULL;
}
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e049edd66776..8c84a5374238 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -577,6 +577,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
rcu_read_unlock();
return css;
}
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);
static void cgroup_get_live(struct cgroup *cgrp)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2939d6602b3..f12886a85e8b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -78,6 +78,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
+EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
/* Socket memory accounting disabled? */
static bool cgroup_memory_nosocket;
--
2.30.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-04-02 19:16 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
@ 2021-04-06 3:23 ` Ming Lei
0 siblings, 0 replies; 14+ messages in thread
From: Ming Lei @ 2021-04-06 3:23 UTC (permalink / raw)
To: Dan Schatzberg
Cc: Jens Axboe, Tejun Heo, Zefan Li, Johannes Weiner, Andrew Morton,
Michal Hocko, Vladimir Davydov, Hugh Dickins, Shakeel Butt,
Roman Gushchin, Muchun Song, Yang Shi, Alex Shi, Alexander Duyck,
Wei Yang, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT
On Sat, Apr 3, 2021 at 3:18 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> The current code only associates with the existing blkcg when aio is
> used to access the backing file. This patch covers all types of i/o to
> the backing file and also associates the memcg so if the backing file is
> on tmpfs, memory is charged appropriately.
>
> This patch also exports cgroup_get_e_css and int_active_memcg so it
> can be used by the loop module.
>
> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
--
Ming Lei
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH V11 0/3] Charge loop device i/o to issuing cgroup
@ 2021-03-29 14:48 Dan Schatzberg
2021-03-29 14:48 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2021-03-29 14:48 UTC (permalink / raw)
Cc: Jens Axboe, Tejun Heo, Zefan Li, Johannes Weiner, Andrew Morton,
Michal Hocko, Vladimir Davydov, Hugh Dickins, Shakeel Butt,
Roman Gushchin, Yang Shi, Muchun Song, Alex Shi, Alexander Duyck,
Yafang Shao, Wei Yang, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT
No major changes, rebased on top of latest mm tree
Changes since V11:
* Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
can be driven by writeback, but this was causing a warning in xfs
and likely other filesystems aren't equipped to be driven by reclaim
at the VFS layer.
* Included a small fix from Colin Ian King.
* reworked get_mem_cgroup_from_mm to institute the necessary charge
priority.
Changes since V10:
* Added page-cache charging to mm: Charge active memcg when no mm is set
Changes since V9:
* Rebased against linus's branch which now includes Roman Gushchin's
patch this series is based off of
Changes since V8:
* Rebased on top of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
support for setting active memcg. Dropped the patch from this series
that did the same thing.
Changes since V7:
* Rebased against linus's branch
Changes since V6:
* Added separate spinlock for worker synchronization
* Minor style changes
Changes since V5:
* Fixed a missing css_put when failing to allocate a worker
* Minor style changes
Changes since V4:
Only patches 1 and 2 have changed.
* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg
Changes since V3:
* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes
Changes since V2:
* Deferred destruction of workqueue items so in the common case there
is no allocation needed
Changes since V1:
* Split out and reordered patches so cgroup charging changes are
separate from kworker -> workqueue change
* Add mem_css to struct loop_cmd to simplify logic
The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.
A simple script to demonstrate this behavior on cgroupv2 machine:
'''
#!/bin/bash
set -e
CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0
if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true
grep oom_kill $CGROUP/memory.events
'''
Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.
With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.
Dan Schatzberg (3):
loop: Use worker per cgroup instead of kworker
mm: Charge active memcg when no mm is set
loop: Charge i/o to mem and blk cg
drivers/block/loop.c | 248 ++++++++++++++++++++++++++++++-------
drivers/block/loop.h | 15 ++-
include/linux/memcontrol.h | 6 +
kernel/cgroup/cgroup.c | 1 +
mm/filemap.c | 2 +-
mm/memcontrol.c | 73 ++++++-----
mm/shmem.c | 4 +-
7 files changed, 267 insertions(+), 82 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-03-29 14:48 [PATCH V11 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
@ 2021-03-29 14:48 ` Dan Schatzberg
0 siblings, 0 replies; 14+ messages in thread
From: Dan Schatzberg @ 2021-03-29 14:48 UTC (permalink / raw)
Cc: Jens Axboe, Tejun Heo, Zefan Li, Johannes Weiner, Andrew Morton,
Michal Hocko, Vladimir Davydov, Hugh Dickins, Shakeel Butt,
Roman Gushchin, Yang Shi, Muchun Song, Alex Shi, Alexander Duyck,
Yafang Shao, Wei Yang, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT, Chris Down
The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.
This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
drivers/block/loop.c | 61 +++++++++++++++++++++++++-------------
drivers/block/loop.h | 3 +-
include/linux/memcontrol.h | 6 ++++
kernel/cgroup/cgroup.c | 1 +
mm/memcontrol.c | 1 +
5 files changed, 51 insertions(+), 21 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5c18e6b856c2..96ade57c9f7c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -78,6 +78,7 @@
#include <linux/uio.h>
#include <linux/ioprio.h>
#include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>
#include "loop.h"
@@ -516,8 +517,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
- if (cmd->css)
- css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@@ -578,8 +577,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (cmd->css)
- kthread_associate_blkcg(cmd->css);
if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -587,7 +584,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);
lo_rw_aio_do_completion(cmd);
- kthread_associate_blkcg(NULL);
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -928,7 +924,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
};
@@ -945,7 +941,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
spin_lock_irq(&lo->lo_work_lock);
- if (!cmd->css)
+ if (!cmd->blkcg_css)
goto queue_work;
node = &lo->worker_tree.rb_node;
@@ -953,10 +949,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
- if (cur_worker->css == cmd->css) {
+ if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
- } else if ((long)cur_worker->css < (long)cmd->css) {
+ } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -968,13 +964,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
* In the event we cannot allocate a worker, just queue on the
- * rootcg worker
+ * rootcg worker and issue the I/O as the rootcg
*/
- if (!worker)
+ if (!worker) {
+ cmd->blkcg_css = NULL;
+ if (cmd->memcg_css)
+ css_put(cmd->memcg_css);
+ cmd->memcg_css = NULL;
goto queue_work;
+ }
- worker->css = cmd->css;
- css_get(worker->css);
+ worker->blkcg_css = cmd->blkcg_css;
+ css_get(worker->blkcg_css);
INIT_WORK(&worker->work, loop_workfn);
INIT_LIST_HEAD(&worker->cmd_list);
INIT_LIST_HEAD(&worker->idle_list);
@@ -1298,7 +1299,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
idle_list) {
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(&lo->lo_work_lock);
@@ -2103,13 +2104,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
}
/* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
- if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
- cmd->css = &bio_blkcg(rq->bio)->css;
- css_get(cmd->css);
- } else
+ if (rq->bio && rq->bio->bi_blkg) {
+ cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+#endif
+ }
#endif
- cmd->css = NULL;
loop_queue_work(lo, cmd);
return BLK_STS_OK;
@@ -2121,13 +2127,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
const bool write = op_is_write(req_op(rq));
struct loop_device *lo = rq->q->queuedata;
int ret = 0;
+ struct mem_cgroup *old_memcg = NULL;
if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
ret = -EIO;
goto failed;
}
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(cmd->blkcg_css);
+ if (cmd->memcg_css)
+ old_memcg = set_active_memcg(
+ mem_cgroup_from_css(cmd->memcg_css));
+
ret = do_req_filebacked(lo, rq);
+
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(NULL);
+
+ if (cmd->memcg_css) {
+ set_active_memcg(old_memcg);
+ css_put(cmd->memcg_css);
+ }
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
@@ -2206,7 +2227,7 @@ static void loop_free_idle_workers(struct timer_list *timer)
break;
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
if (!list_empty(&lo->idle_worker_list))
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 9289c1cd6374..cd24a81e00e6 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -76,7 +76,8 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
+ struct cgroup_subsys_state *memcg_css;
};
/* Support for loadable transfer modules */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4064c9dda534..df42be35b5fb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1178,6 +1178,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return NULL;
}
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e049edd66776..8c84a5374238 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -577,6 +577,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
rcu_read_unlock();
return css;
}
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);
static void cgroup_get_live(struct cgroup *cgrp)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index adc618814fd2..4aacdf06c6c8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -78,6 +78,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
+EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
/* Socket memory accounting disabled? */
static bool cgroup_memory_nosocket;
--
2.30.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v10 0/3] Charge loop device i/o to issuing cgroup
@ 2021-03-16 15:36 Dan Schatzberg
2021-03-16 15:36 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2021-03-16 15:36 UTC (permalink / raw)
Cc: Jens Axboe, Tejun Heo, Zefan Li, Johannes Weiner, Andrew Morton,
Michal Hocko, Vladimir Davydov, Hugh Dickins, Shakeel Butt,
Roman Gushchin, Muchun Song, Alex Shi, Alexander Duyck,
Chris Down, Yafang Shao, Wei Yang, open list:BLOCK LAYER,
open list, open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT
No major changes, just rebasing and resubmitting
Changes since V10:
* Added page-cache charging to mm: Charge active memcg when no mm is set
Changes since V9:
* Rebased against linus's branch which now includes Roman Gushchin's
patch this series is based off of
Changes since V8:
* Rebased on top of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
support for setting active memcg. Dropped the patch from this series
that did the same thing.
Changes since V7:
* Rebased against linus's branch
Changes since V6:
* Added separate spinlock for worker synchronization
* Minor style changes
Changes since V5:
* Fixed a missing css_put when failing to allocate a worker
* Minor style changes
Changes since V4:
Only patches 1 and 2 have changed.
* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg
Changes since V3:
* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes
Changes since V2:
* Deferred destruction of workqueue items so in the common case there
is no allocation needed
Changes since V1:
* Split out and reordered patches so cgroup charging changes are
separate from kworker -> workqueue change
* Add mem_css to struct loop_cmd to simplify logic
The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.
A simple script to demonstrate this behavior on cgroupv2 machine:
'''
#!/bin/bash
set -e
CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0
if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true
grep oom_kill $CGROUP/memory.events
'''
Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.
With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.
Dan Schatzberg (3):
loop: Use worker per cgroup instead of kworker
mm: Charge active memcg when no mm is set
loop: Charge i/o to mem and blk cg
drivers/block/loop.c | 248 ++++++++++++++++++++++++++++++-------
drivers/block/loop.h | 15 ++-
include/linux/memcontrol.h | 11 ++
kernel/cgroup/cgroup.c | 1 +
mm/filemap.c | 2 +-
mm/memcontrol.c | 15 ++-
mm/shmem.c | 4 +-
7 files changed, 242 insertions(+), 54 deletions(-)
--
2.30.2
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-03-16 15:36 [PATCH v10 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
@ 2021-03-16 15:36 ` Dan Schatzberg
2021-03-16 16:25 ` Shakeel Butt
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2021-03-16 15:36 UTC (permalink / raw)
Cc: Jens Axboe, Tejun Heo, Zefan Li, Johannes Weiner, Andrew Morton,
Michal Hocko, Vladimir Davydov, Hugh Dickins, Shakeel Butt,
Roman Gushchin, Muchun Song, Alex Shi, Alexander Duyck,
Chris Down, Yafang Shao, Wei Yang, open list:BLOCK LAYER,
open list, open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT
The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.
This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
drivers/block/loop.c | 61 +++++++++++++++++++++++++-------------
drivers/block/loop.h | 3 +-
include/linux/memcontrol.h | 11 +++++++
kernel/cgroup/cgroup.c | 1 +
mm/memcontrol.c | 1 +
5 files changed, 56 insertions(+), 21 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5c080af73caa..6cf3086a2e75 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
#include <linux/uio.h>
#include <linux/ioprio.h>
#include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>
#include "loop.h"
@@ -515,8 +516,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
- if (cmd->css)
- css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@@ -577,8 +576,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (cmd->css)
- kthread_associate_blkcg(cmd->css);
if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -586,7 +583,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);
lo_rw_aio_do_completion(cmd);
- kthread_associate_blkcg(NULL);
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -927,7 +923,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
};
@@ -944,7 +940,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
spin_lock_irq(&lo->lo_work_lock);
- if (!cmd->css)
+ if (!cmd->blkcg_css)
goto queue_work;
node = &lo->worker_tree.rb_node;
@@ -952,10 +948,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
- if (cur_worker->css == cmd->css) {
+ if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
- } else if ((long)cur_worker->css < (long)cmd->css) {
+ } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -967,13 +963,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
* In the event we cannot allocate a worker, just queue on the
- * rootcg worker
+ * rootcg worker and issue the I/O as the rootcg
*/
- if (!worker)
+ if (!worker) {
+ cmd->blkcg_css = NULL;
+ if (cmd->memcg_css)
+ css_put(cmd->memcg_css);
+ cmd->memcg_css = NULL;
goto queue_work;
+ }
- worker->css = cmd->css;
- css_get(worker->css);
+ worker->blkcg_css = cmd->blkcg_css;
+ css_get(worker->blkcg_css);
INIT_WORK(&worker->work, loop_workfn);
INIT_LIST_HEAD(&worker->cmd_list);
INIT_LIST_HEAD(&worker->idle_list);
@@ -1297,7 +1298,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
idle_list) {
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(&lo->lo_work_lock);
@@ -2102,13 +2103,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
}
/* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
- if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
- cmd->css = &bio_blkcg(rq->bio)->css;
- css_get(cmd->css);
- } else
+ if (rq->bio && rq->bio->bi_blkg) {
+ cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+#endif
+ }
#endif
- cmd->css = NULL;
loop_queue_work(lo, cmd);
return BLK_STS_OK;
@@ -2120,13 +2126,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
const bool write = op_is_write(req_op(rq));
struct loop_device *lo = rq->q->queuedata;
int ret = 0;
+ struct mem_cgroup *old_memcg = NULL;
if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
ret = -EIO;
goto failed;
}
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(cmd->blkcg_css);
+ if (cmd->memcg_css)
+ old_memcg = set_active_memcg(
+ mem_cgroup_from_css(cmd->memcg_css));
+
ret = do_req_filebacked(lo, rq);
+
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(NULL);
+
+ if (cmd->memcg_css) {
+ set_active_memcg(old_memcg);
+ css_put(cmd->memcg_css);
+ }
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
@@ -2205,7 +2226,7 @@ static void loop_free_idle_workers(struct timer_list *timer)
break;
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
if (!list_empty(&lo->idle_worker_list))
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 9289c1cd6374..cd24a81e00e6 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -76,7 +76,8 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
+ struct cgroup_subsys_state *memcg_css;
};
/* Support for loadable transfer modules */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..fd5dd961d01f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1187,6 +1187,17 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return NULL;
}
+static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20e5cc6..94c88f7221c5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -577,6 +577,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
rcu_read_unlock();
return css;
}
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);
static void cgroup_get_live(struct cgroup *cgrp)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a1b23ed3412..f05501669e29 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -78,6 +78,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;
/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
+EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
/* Socket memory accounting disabled? */
static bool cgroup_memory_nosocket;
--
2.30.2
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg
2021-03-16 15:36 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
@ 2021-03-16 16:25 ` Shakeel Butt
0 siblings, 0 replies; 14+ messages in thread
From: Shakeel Butt @ 2021-03-16 16:25 UTC (permalink / raw)
To: Dan Schatzberg
Cc: Jens Axboe, Tejun Heo, Zefan Li, Johannes Weiner, Andrew Morton,
Michal Hocko, Vladimir Davydov, Hugh Dickins, Roman Gushchin,
Muchun Song, Alex Shi, Alexander Duyck, Chris Down, Yafang Shao,
Wei Yang, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:MEMORY MANAGEMENT
On Tue, Mar 16, 2021 at 8:37 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
[...]
>
> /* Support for loadable transfer modules */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c04d39a7967..fd5dd961d01f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1187,6 +1187,17 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> return NULL;
> }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
> +{
> + return NULL;
> +}
The above function has been removed.
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v8 0/3] Charge loop device i/o to issuing cgroup
@ 2020-08-31 15:36 Dan Schatzberg
2020-08-31 15:37 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
0 siblings, 1 reply; 14+ messages in thread
From: Dan Schatzberg @ 2020-08-31 15:36 UTC (permalink / raw)
Cc: Dan Schatzberg, Jens Axboe, Tejun Heo, Li Zefan, Johannes Weiner,
Michal Hocko, Vladimir Davydov, Andrew Morton, Hugh Dickins,
Shakeel Butt, Roman Gushchin, Joonsoo Kim, Chris Down, Yang Shi,
Jakub Kicinski, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
Much of the discussion about this has died down. There's been a
concern raised that we could generalize infrastructure across loop,
md, etc. This may be possible, in the future, but it isn't clear to me
how this would look like. I'm inclined to fix the existing issue with
loop devices now (this is a problem we hit at FB) and address
consolidation with other cases if and when those need to be addressed.
Note that this series needs to be based off of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) to compile.
Changes since V8:
* Rebased on top of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
support for setting active memcg. Dropped the patch from this series
that did the same thing.
Changes since V7:
* Rebased against linus's branch
Changes since V6:
* Added separate spinlock for worker synchronization
* Minor style changes
Changes since V5:
* Fixed a missing css_put when failing to allocate a worker
* Minor style changes
Changes since V4:
Only patches 1 and 2 have changed.
* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg
Changes since V3:
* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes
Changes since V2:
* Deferred destruction of workqueue items so in the common case there
is no allocation needed
Changes since V1:
* Split out and reordered patches so cgroup charging changes are
separate from kworker -> workqueue change
* Add mem_css to struct loop_cmd to simplify logic
The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.
A simple script to demonstrate this behavior on cgroupv2 machine:
'''
#!/bin/bash
set -e
CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0
if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
grep oom_kill $CGROUP/memory.events
# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true
grep oom_kill $CGROUP/memory.events
'''
Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.
With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.
Dan Schatzberg (3):
loop: Use worker per cgroup instead of kworker
mm: Charge active memcg when no mm is set
loop: Charge i/o to mem and blk cg
drivers/block/loop.c | 248 ++++++++++++++++++++++++++++++-------
drivers/block/loop.h | 15 ++-
include/linux/memcontrol.h | 6 +
kernel/cgroup/cgroup.c | 1 +
mm/memcontrol.c | 11 +-
mm/shmem.c | 4 +-
6 files changed, 232 insertions(+), 53 deletions(-)
--
2.24.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/3] loop: Charge i/o to mem and blk cg
2020-08-31 15:36 [PATCH v8 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
@ 2020-08-31 15:37 ` Dan Schatzberg
0 siblings, 0 replies; 14+ messages in thread
From: Dan Schatzberg @ 2020-08-31 15:37 UTC (permalink / raw)
Cc: Dan Schatzberg, Johannes Weiner, Jens Axboe, Tejun Heo, Li Zefan,
Michal Hocko, Vladimir Davydov, Andrew Morton, Hugh Dickins,
Shakeel Butt, Roman Gushchin, Joonsoo Kim, Chris Down,
Yafang Shao, Yang Shi, open list:BLOCK LAYER, open list,
open list:CONTROL GROUP (CGROUP),
open list:CONTROL GROUP - MEMORY RESOURCE CONTROLLER (MEMCG)
The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.
This patch also exports cgroup_get_e_css so it can be used by the loop
module.
Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
drivers/block/loop.c | 61 +++++++++++++++++++++++++-------------
drivers/block/loop.h | 3 +-
include/linux/memcontrol.h | 6 ++++
kernel/cgroup/cgroup.c | 1 +
4 files changed, 50 insertions(+), 21 deletions(-)
diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 771685a6c259..3da34d454287 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
#include <linux/uio.h>
#include <linux/ioprio.h>
#include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>
#include "loop.h"
@@ -518,8 +519,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);
- if (cmd->css)
- css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@@ -580,8 +579,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (cmd->css)
- kthread_associate_blkcg(cmd->css);
if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -589,7 +586,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);
lo_rw_aio_do_completion(cmd);
- kthread_associate_blkcg(NULL);
if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -932,7 +928,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
};
@@ -949,7 +945,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
spin_lock_irq(&lo->lo_work_lock);
- if (!cmd->css)
+ if (!cmd->blkcg_css)
goto queue_work;
node = &lo->worker_tree.rb_node;
@@ -957,10 +953,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
- if (cur_worker->css == cmd->css) {
+ if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
- } else if ((long)cur_worker->css < (long)cmd->css) {
+ } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -972,13 +968,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
* In the event we cannot allocate a worker, just queue on the
- * rootcg worker
+ * rootcg worker and issue the I/O as the rootcg
*/
- if (!worker)
+ if (!worker) {
+ cmd->blkcg_css = NULL;
+ if (cmd->memcg_css)
+ css_put(cmd->memcg_css);
+ cmd->memcg_css = NULL;
goto queue_work;
+ }
- worker->css = cmd->css;
- css_get(worker->css);
+ worker->blkcg_css = cmd->blkcg_css;
+ css_get(worker->blkcg_css);
INIT_WORK(&worker->work, loop_workfn);
INIT_LIST_HEAD(&worker->cmd_list);
INIT_LIST_HEAD(&worker->idle_list);
@@ -1304,7 +1305,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
idle_list) {
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(&lo->lo_work_lock);
@@ -2105,13 +2106,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
}
/* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
- if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
- cmd->css = &bio_blkcg(rq->bio)->css;
- css_get(cmd->css);
- } else
+ if (rq->bio && rq->bio->bi_blkg) {
+ cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+#endif
+ }
#endif
- cmd->css = NULL;
loop_queue_work(lo, cmd);
return BLK_STS_OK;
@@ -2123,13 +2129,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
const bool write = op_is_write(req_op(rq));
struct loop_device *lo = rq->q->queuedata;
int ret = 0;
+ struct mem_cgroup *old_memcg = NULL;
if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
ret = -EIO;
goto failed;
}
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(cmd->blkcg_css);
+ if (cmd->memcg_css)
+ old_memcg = set_active_memcg(
+ mem_cgroup_from_css(cmd->memcg_css));
+
ret = do_req_filebacked(lo, rq);
+
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(NULL);
+
+ if (cmd->memcg_css) {
+ set_active_memcg(old_memcg);
+ css_put(cmd->memcg_css);
+ }
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
@@ -2208,7 +2229,7 @@ static void loop_free_idle_workers(struct timer_list *timer)
break;
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
if (!list_empty(&lo->idle_worker_list))
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 0162b55a68e1..4d6886d9855a 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -75,7 +75,8 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
+ struct cgroup_subsys_state *memcg_css;
};
/* Support for loadable transfer modules */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d0b036123c6a..fceac9f66d96 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1031,6 +1031,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
return NULL;
}
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index dd247747ec14..16d059a89a68 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -580,6 +580,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
rcu_read_unlock();
return css;
}
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);
static void cgroup_get_live(struct cgroup *cgrp)
{
--
2.24.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-06-30 14:50 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20210610173944.1203706-1-schatzberg.dan@gmail.com>
[not found] ` <20210610173944.1203706-3-schatzberg.dan@gmail.com>
2021-06-25 14:47 ` [PATCH 2/3] mm: Charge active memcg when no mm is set Michal Koutný
[not found] ` <20210610173944.1203706-4-schatzberg.dan@gmail.com>
2021-06-25 15:01 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Michal Koutný
2021-06-28 14:17 ` Dan Schatzberg
2021-06-29 10:26 ` Michal Koutný
2021-06-29 14:03 ` Dan Schatzberg
2021-06-30 9:42 ` Michal Koutný
2021-06-30 14:49 ` Dan Schatzberg
2021-06-03 14:57 [PATCH V13 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
2021-06-03 14:57 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
-- strict thread matches above, loose matches on Subject: below --
2021-04-02 19:16 [PATCH V12 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
2021-04-02 19:16 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
2021-04-06 3:23 ` Ming Lei
2021-03-29 14:48 [PATCH V11 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
2021-03-29 14:48 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
2021-03-16 15:36 [PATCH v10 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
2021-03-16 15:36 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
2021-03-16 16:25 ` Shakeel Butt
2020-08-31 15:36 [PATCH v8 0/3] Charge loop device i/o to issuing cgroup Dan Schatzberg
2020-08-31 15:37 ` [PATCH 3/3] loop: Charge i/o to mem and blk cg Dan Schatzberg
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).