[BUG] The usage of memory cgroup is not consistent with processes when using THP

All of lore.kernel.org
 help / color / mirror / Atom feed

* [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-26  7:35 ` 台运方
  0 siblings, 0 replies; 10+ messages in thread
From: 台运方 @ 2021-09-26  7:35 UTC (permalink / raw)
  To: hannes; +Cc: hughd, tj, vdavydov, cgroups, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1241 bytes --]

Hi folks，
We found that the usage counter of containers with memory cgroup v1 is
not consistent with the  memory usage of processes when using THP.

It is  introduced in upstream 0a31bc97c80 patch and still exists in
Linux 5.14.5.
The root cause is that mem_cgroup_uncharge is moved to the final
put_page(). When freeing parts of huge pages in THP, the memory usage
of process is updated  when pte unmapped  and the usage counter of
memory cgroup is updated when  splitting huge pages in
deferred_split_scan. This causes the inconsistencies and we could find
more than 30GB memory difference in our daily usage.

It is reproduced with the following program and script.
The program named "eat_memory_release" allocates every 8 MB memory and
releases the last 1 MB memory using madvise.
The script "test_thp.sh" creates a memory cgroup, runs
"eat_memory_release  500" in it and loops the proceed by 10 times. The
output shows the changing of memory, which should be about 500M memory
less in theory.
The outputs are varying randomly when using THP, while adding  "echo 2
> /proc/sys/vm/drop_caches" before accounting can avoid this.

Are there any patches to fix it or is it normal by design?

Thanks,
Yunfang Tai

[-- Attachment #2: eat_release_memory.c --]
[-- Type: text/x-c-code, Size: 1175 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>


int main(int argc, char* argv[])
{
    char* memindex[1000] = {0};
    int eat = 0;
    int wait = 0;
    int i = 0;

    if (argc < 2)  {
        printf("Usage: ./eat_release_memory <num>   #allocate num * 8 MB and free num MB memory\n");
        return;
    }

    sscanf(argv[1], "%d", &eat);
    if (eat <= 0 || eat >= 1000) {
        printf("num should larger than 0 and less than 1000\n");
        return;
    }
    printf("Allocate memory in MB size: %d\n", eat * 8);

    printf("Allocation memory Begin!\n");
    for (i = 0; i < eat; i++) {
        memindex[i] = (char*)mmap(NULL, 8*1024*1024, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
        memset(memindex[i], 0, 8*1024*1024);
    }

    printf("Allocation memory Done!\n");
    sleep(2);
    printf("Now begin to madvise free memory!\n");
    for (i = 0; i < eat; i++) {
        madvise(memindex[i] + 7*1024*1024, 1024*1024, MADV_DONTNEED);
    }
    sleep(5);
    printf("Now begin to release memory!\n");
    for (i = 0; i < eat; i++) {
        munmap(memindex[i], 8*1024*1024);
    }

}

[-- Attachment #3: test_thp.sh --]
[-- Type: application/x-sh, Size: 598 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-26  7:35 ` 台运方
  0 siblings, 0 replies; 10+ messages in thread
From: 台运方 @ 2021-09-26  7:35 UTC (permalink / raw)
  To: hannes-druUgvl0LCNAfugRpC6u6w
  Cc: hughd-hpIqsD4AKlfQT0dZR+AlfA, tj-DgEjT+Ai2ygdnm+yROfE0A,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg

[-- Attachment #1: Type: text/plain, Size: 1241 bytes --]

Hi folks，
We found that the usage counter of containers with memory cgroup v1 is
not consistent with the  memory usage of processes when using THP.

It is  introduced in upstream 0a31bc97c80 patch and still exists in
Linux 5.14.5.
The root cause is that mem_cgroup_uncharge is moved to the final
put_page(). When freeing parts of huge pages in THP, the memory usage
of process is updated  when pte unmapped  and the usage counter of
memory cgroup is updated when  splitting huge pages in
deferred_split_scan. This causes the inconsistencies and we could find
more than 30GB memory difference in our daily usage.

It is reproduced with the following program and script.
The program named "eat_memory_release" allocates every 8 MB memory and
releases the last 1 MB memory using madvise.
The script "test_thp.sh" creates a memory cgroup, runs
"eat_memory_release  500" in it and loops the proceed by 10 times. The
output shows the changing of memory, which should be about 500M memory
less in theory.
The outputs are varying randomly when using THP, while adding  "echo 2
> /proc/sys/vm/drop_caches" before accounting can avoid this.

Are there any patches to fix it or is it normal by design?

Thanks,
Yunfang Tai

[-- Attachment #2: eat_release_memory.c --]
[-- Type: text/x-c-code, Size: 1175 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>


int main(int argc, char* argv[])
{
    char* memindex[1000] = {0};
    int eat = 0;
    int wait = 0;
    int i = 0;

    if (argc < 2)  {
        printf("Usage: ./eat_release_memory <num>   #allocate num * 8 MB and free num MB memory\n");
        return;
    }

    sscanf(argv[1], "%d", &eat);
    if (eat <= 0 || eat >= 1000) {
        printf("num should larger than 0 and less than 1000\n");
        return;
    }
    printf("Allocate memory in MB size: %d\n", eat * 8);

    printf("Allocation memory Begin!\n");
    for (i = 0; i < eat; i++) {
        memindex[i] = (char*)mmap(NULL, 8*1024*1024, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
        memset(memindex[i], 0, 8*1024*1024);
    }

    printf("Allocation memory Done!\n");
    sleep(2);
    printf("Now begin to madvise free memory!\n");
    for (i = 0; i < eat; i++) {
        madvise(memindex[i] + 7*1024*1024, 1024*1024, MADV_DONTNEED);
    }
    sleep(5);
    printf("Now begin to release memory!\n");
    for (i = 0; i < eat; i++) {
        munmap(memindex[i], 8*1024*1024);
    }

}

[-- Attachment #3: test_thp.sh --]
[-- Type: application/x-sh, Size: 598 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-27 17:28   ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2021-09-27 17:28 UTC (permalink / raw)
  To: 台运方
  Cc: Johannes Weiner, Hugh Dickins, Tejun Heo, vdavydov, Cgroups, Linux MM

On Sun, Sep 26, 2021 at 12:35 AM 台运方 <yunfangtai09@gmail.com> wrote:
>
> Hi folks，
> We found that the usage counter of containers with memory cgroup v1 is
> not consistent with the  memory usage of processes when using THP.
>
> It is  introduced in upstream 0a31bc97c80 patch and still exists in
> Linux 5.14.5.
> The root cause is that mem_cgroup_uncharge is moved to the final
> put_page(). When freeing parts of huge pages in THP, the memory usage
> of process is updated  when pte unmapped  and the usage counter of
> memory cgroup is updated when  splitting huge pages in
> deferred_split_scan. This causes the inconsistencies and we could find
> more than 30GB memory difference in our daily usage.

IMHO I don't think this is a bug. The disparity reflects the
difference in how the page life cycle is viewed between process and
cgroup. The usage of process comes from the rss_counter of mm. It
tracks the per-process mapped memory usage. So it is updated once the
page is zapped.

But from the point of cgroup, the page is charged when it is allocated
and uncharged when it is freed. The page may be zapped by one process,
but there might be other users pin the page to prevent it from being
freed. The pin may be very transient or may be indefinite. THP is one
of the pins. It is gone when the THP is split, but the split may
happen a long time after the page is zapped due to deferred split.

>
> It is reproduced with the following program and script.
> The program named "eat_memory_release" allocates every 8 MB memory and
> releases the last 1 MB memory using madvise.
> The script "test_thp.sh" creates a memory cgroup, runs
> "eat_memory_release  500" in it and loops the proceed by 10 times. The
> output shows the changing of memory, which should be about 500M memory
> less in theory.
> The outputs are varying randomly when using THP, while adding  "echo 2
> > /proc/sys/vm/drop_caches" before accounting can avoid this.
>
> Are there any patches to fix it or is it normal by design?
>
> Thanks,
> Yunfang Tai


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-27 17:28   ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2021-09-27 17:28 UTC (permalink / raw)
  To: 台运方
  Cc: Johannes Weiner, Hugh Dickins, Tejun Heo,
	vdavydov-bzQdu9zFT3WakBO8gow8eQ, Cgroups, Linux MM

On Sun, Sep 26, 2021 at 12:35 AM 台运方 <yunfangtai09@gmail.com> wrote:
>
> Hi folks，
> We found that the usage counter of containers with memory cgroup v1 is
> not consistent with the  memory usage of processes when using THP.
>
> It is  introduced in upstream 0a31bc97c80 patch and still exists in
> Linux 5.14.5.
> The root cause is that mem_cgroup_uncharge is moved to the final
> put_page(). When freeing parts of huge pages in THP, the memory usage
> of process is updated  when pte unmapped  and the usage counter of
> memory cgroup is updated when  splitting huge pages in
> deferred_split_scan. This causes the inconsistencies and we could find
> more than 30GB memory difference in our daily usage.

IMHO I don't think this is a bug. The disparity reflects the
difference in how the page life cycle is viewed between process and
cgroup. The usage of process comes from the rss_counter of mm. It
tracks the per-process mapped memory usage. So it is updated once the
page is zapped.

But from the point of cgroup, the page is charged when it is allocated
and uncharged when it is freed. The page may be zapped by one process,
but there might be other users pin the page to prevent it from being
freed. The pin may be very transient or may be indefinite. THP is one
of the pins. It is gone when the THP is split, but the split may
happen a long time after the page is zapped due to deferred split.

>
> It is reproduced with the following program and script.
> The program named "eat_memory_release" allocates every 8 MB memory and
> releases the last 1 MB memory using madvise.
> The script "test_thp.sh" creates a memory cgroup, runs
> "eat_memory_release  500" in it and loops the proceed by 10 times. The
> output shows the changing of memory, which should be about 500M memory
> less in theory.
> The outputs are varying randomly when using THP, while adding  "echo 2
> > /proc/sys/vm/drop_caches" before accounting can avoid this.
>
> Are there any patches to fix it or is it normal by design?
>
> Thanks,
> Yunfang Tai

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-28  7:15     ` 台运方
  0 siblings, 0 replies; 10+ messages in thread
From: 台运方 @ 2021-09-28  7:15 UTC (permalink / raw)
  To: Yang Shi; +Cc: Johannes Weiner, Hugh Dickins, Tejun Heo, Cgroups, Linux MM

Yang Shi <shy828301@gmail.com> 于2021年9月28日周二 上午1:28写道：
> IMHO I don't think this is a bug. The disparity reflects the
> difference in how the page life cycle is viewed between process and
> cgroup. The usage of process comes from the rss_counter of mm. It
> tracks the per-process mapped memory usage. So it is updated once the
> page is zapped.
>
> But from the point of cgroup, the page is charged when it is allocated
> and uncharged when it is freed. The page may be zapped by one process,
> but there might be other users pin the page to prevent it from being
> freed. The pin may be very transient or may be indefinite. THP is one
> of the pins. It is gone when the THP is split, but the split may
> happen a long time after the page is zapped due to deferred split.
Thank you for reply. I agree that it reflects the difference between
process and cgroup. The memory usage of cgroup is usually used to
indicate the memory usage of the container. It can be used to avoid
the OOM and etc. The disparity will cause that the memory usage of
containers with the same processes are randomly different (we found
more than 30GB different). It is hard to manage them. Of course,
disable THP is a way to solve it. Can it have another way to solve it
?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-28  7:15     ` 台运方
  0 siblings, 0 replies; 10+ messages in thread
From: 台运方 @ 2021-09-28  7:15 UTC (permalink / raw)
  To: Yang Shi; +Cc: Johannes Weiner, Hugh Dickins, Tejun Heo, Cgroups, Linux MM

Yang Shi <shy828301-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 于2021年9月28日周二 上午1:28写道：
> IMHO I don't think this is a bug. The disparity reflects the
> difference in how the page life cycle is viewed between process and
> cgroup. The usage of process comes from the rss_counter of mm. It
> tracks the per-process mapped memory usage. So it is updated once the
> page is zapped.
>
> But from the point of cgroup, the page is charged when it is allocated
> and uncharged when it is freed. The page may be zapped by one process,
> but there might be other users pin the page to prevent it from being
> freed. The pin may be very transient or may be indefinite. THP is one
> of the pins. It is gone when the THP is split, but the split may
> happen a long time after the page is zapped due to deferred split.
Thank you for reply. I agree that it reflects the difference between
process and cgroup. The memory usage of cgroup is usually used to
indicate the memory usage of the container. It can be used to avoid
the OOM and etc. The disparity will cause that the memory usage of
containers with the same processes are randomly different (we found
more than 30GB different). It is hard to manage them. Of course,
disable THP is a way to solve it. Can it have another way to solve it
?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-28 22:14       ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2021-09-28 22:14 UTC (permalink / raw)
  To: 台运方
  Cc: Johannes Weiner, Hugh Dickins, Tejun Heo, Cgroups, Linux MM

On Tue, Sep 28, 2021 at 12:15 AM 台运方 <yunfangtai09@gmail.com> wrote:
>
> Yang Shi <shy828301@gmail.com> 于2021年9月28日周二 上午1:28写道：
> > IMHO I don't think this is a bug. The disparity reflects the
> > difference in how the page life cycle is viewed between process and
> > cgroup. The usage of process comes from the rss_counter of mm. It
> > tracks the per-process mapped memory usage. So it is updated once the
> > page is zapped.
> >
> > But from the point of cgroup, the page is charged when it is allocated
> > and uncharged when it is freed. The page may be zapped by one process,
> > but there might be other users pin the page to prevent it from being
> > freed. The pin may be very transient or may be indefinite. THP is one
> > of the pins. It is gone when the THP is split, but the split may
> > happen a long time after the page is zapped due to deferred split.
> Thank you for reply. I agree that it reflects the difference between
> process and cgroup. The memory usage of cgroup is usually used to
> indicate the memory usage of the container. It can be used to avoid
> the OOM and etc. The disparity will cause that the memory usage of
> containers with the same processes are randomly different (we found
> more than 30GB different). It is hard to manage them. Of course,
> disable THP is a way to solve it. Can it have another way to solve it
> ?

I don't quite get what exactly you want to manage. If you want to get
rid of the disparity, I don't have good idea other than splitting THP
in place instead of using deferred split. But AFAIK it is not quite
feasible due to some locking problems.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-28 22:14       ` Yang Shi
  0 siblings, 0 replies; 10+ messages in thread
From: Yang Shi @ 2021-09-28 22:14 UTC (permalink / raw)
  To: 台运方
  Cc: Johannes Weiner, Hugh Dickins, Tejun Heo, Cgroups, Linux MM

On Tue, Sep 28, 2021 at 12:15 AM 台运方 <yunfangtai09@gmail.com> wrote:
>
> Yang Shi <shy828301-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 于2021年9月28日周二 上午1:28写道：
> > IMHO I don't think this is a bug. The disparity reflects the
> > difference in how the page life cycle is viewed between process and
> > cgroup. The usage of process comes from the rss_counter of mm. It
> > tracks the per-process mapped memory usage. So it is updated once the
> > page is zapped.
> >
> > But from the point of cgroup, the page is charged when it is allocated
> > and uncharged when it is freed. The page may be zapped by one process,
> > but there might be other users pin the page to prevent it from being
> > freed. The pin may be very transient or may be indefinite. THP is one
> > of the pins. It is gone when the THP is split, but the split may
> > happen a long time after the page is zapped due to deferred split.
> Thank you for reply. I agree that it reflects the difference between
> process and cgroup. The memory usage of cgroup is usually used to
> indicate the memory usage of the container. It can be used to avoid
> the OOM and etc. The disparity will cause that the memory usage of
> containers with the same processes are randomly different (we found
> more than 30GB different). It is hard to manage them. Of course,
> disable THP is a way to solve it. Can it have another way to solve it
> ?

I don't quite get what exactly you want to manage. If you want to get
rid of the disparity, I don't have good idea other than splitting THP
in place instead of using deferred split. But AFAIK it is not quite
feasible due to some locking problems.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-29  3:25         ` Yunfang Tai
  0 siblings, 0 replies; 10+ messages in thread
From: Yunfang Tai @ 2021-09-29  3:25 UTC (permalink / raw)
  To: Yang Shi; +Cc: Johannes Weiner, Hugh Dickins, Tejun Heo, Cgroups, Linux MM

Yang Shi <shy828301@gmail.com> 于2021年9月29日周三 上午6:14写道：
> I don't quite get what exactly you want to manage. If you want to get
> rid of the disparity, I don't have good idea other than splitting THP
> in place instead of using deferred split. But AFAIK it is not quite
> feasible due to some locking problems.
I get it. Thank you for your reply!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [BUG] The usage of memory cgroup is not consistent with processes when using THP
@ 2021-09-29  3:25         ` Yunfang Tai
  0 siblings, 0 replies; 10+ messages in thread
From: Yunfang Tai @ 2021-09-29  3:25 UTC (permalink / raw)
  To: Yang Shi; +Cc: Johannes Weiner, Hugh Dickins, Tejun Heo, Cgroups, Linux MM

Yang Shi <shy828301-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 于2021年9月29日周三 上午6:14写道：
> I don't quite get what exactly you want to manage. If you want to get
> rid of the disparity, I don't have good idea other than splitting THP
> in place instead of using deferred split. But AFAIK it is not quite
> feasible due to some locking problems.
I get it. Thank you for your reply!

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-09-29  3:26 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-26  7:35 [BUG] The usage of memory cgroup is not consistent with processes when using THP 台运方
2021-09-26  7:35 ` 台运方
2021-09-27 17:28 ` Yang Shi
2021-09-27 17:28   ` Yang Shi
2021-09-28  7:15   ` 台运方
2021-09-28  7:15     ` 台运方
2021-09-28 22:14     ` Yang Shi
2021-09-28 22:14       ` Yang Shi
2021-09-29  3:25       ` Yunfang Tai
2021-09-29  3:25         ` Yunfang Tai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.