All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
@ 2023-10-03 17:40 T.J. Mercier
  2023-10-03 18:01 ` T.J. Mercier
  0 siblings, 1 reply; 15+ messages in thread
From: T.J. Mercier @ 2023-10-03 17:40 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný, cgroups; +Cc: Suren Baghdasaryan

Hi all,

Samsung reported an Android bug where over 1000 cgroups were left
empty without being removed. These cgroups should have been removed
after no processes were observed to be remaining in the cgroup by this
code [1], which loops until cgroup.procs is empty and then attempts to
rmdir the cgroup. That works most of the time, but occasionally the
rmdir fails with EBUSY *after cgroup.procs is empty*, which seems
wrong. No controllers are enabled in this cgroup v2 hierarchy; it's
currently used only for freezer. I spoke with Suren about this, who
recalled a similar problem and fix [2], but all the kernels I've
tested contain this fix. I have been able to reproduce this on 5.10,
5.15, 6.1, and 6.3 on various hardware. I've written a reproducer
(below) which typically hits the issue in under a minute.

The trace events look like this when the problem occurs. I'm guessing
the rmdir is attempted in that window between signal_deliver and
cgroup_notify_populated = 0.

           a.out-3786312 [120] ..... 629614.073808: task_newtask:
pid=3786313 comm=a.out clone_flags=1200000 oom_score_adj=200
           a.out-3786312 [120] ..... 629614.073832:
sched_process_fork: comm=a.out pid=3786312 child_comm=a.out
child_pid=3786313
           a.out-3786313 [028] d..2. 629614.074712:
cgroup_notify_populated: root=0 id=240416 level=1 path=/B val=1
           a.out-3786313 [028] dN.1. 629614.074742:
cgroup_attach_task: dst_root=0 dst_id=240416 dst_level=1 dst_path=/B
pid=3786313 comm=a.out
           a.out-3786312 [120] d..1. 629614.302764: signal_generate:
sig=9 errno=0 code=0 comm=a.out pid=3786313 grp=1 res=0
           <...>-3786313 [028] d..1. 629614.302789: signal_deliver:
sig=9 errno=0 code=0 sa_handler=0 sa_flags=0
           a.out-3786313 [028] ..... 629614.303007:
sched_process_exit: comm=a.out pid=3786313 prio=120
           a.out-3786313 [028] dN.2. 629614.303039:
cgroup_notify_populated: root=0 id=240416 level=1 path=/B val=0
           a.out-3786313 [028] dN.2. 629614.303057: signal_generate:
sig=17 errno=0 code=2 comm=a.out pid=3786312 grp=1 res=0
          <idle>-0       [120] ..s1. 629614.333591:
sched_process_free: comm=a.out pid=3786313 prio=120

However on Android we retry the rmdir for 2 seconds after cgroup.procs
is empty and we're still occasionally hitting the failure. On my
primary phone with 3 days of uptime I see a handful of cases, but the
problem is orders of magnitude worse on Samsung's device.

panther:/ $ find /sys/fs/cgroup/ -path "*/pid_*/cgroup.procs" -exec sh
-c 'return $(wc -l $0 | cut -f1 -d" ")' {} \; -print
/sys/fs/cgroup/uid_10133/pid_19665/cgroup.procs
/sys/fs/cgroup/uid_10133/pid_19784/cgroup.procs
/sys/fs/cgroup/uid_10133/pid_13124/cgroup.procs
/sys/fs/cgroup/uid_10133/pid_15176/cgroup.procs
/sys/fs/cgroup/uid_10274/pid_12322/cgroup.procs
/sys/fs/cgroup/uid_1010120/pid_13631/cgroup.procs
/sys/fs/cgroup/uid_10196/pid_27894/cgroup.procs
/sys/fs/cgroup/uid_1010133/pid_16103/cgroup.procs
/sys/fs/cgroup/uid_1010133/pid_29181/cgroup.procs
/sys/fs/cgroup/uid_10129/pid_7940/cgroup.procs
/sys/fs/cgroup/uid_10266/pid_11765/cgroup.procs

Please LMK if there's any more info I can provide.

[1] https://android.googlesource.com/platform/system/core/+/3483798fd9045f93e9095e5b09ffe4f59054c535/libprocessgroup/processgroup.cpp#522
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/cgroup/cgroup.c?id=9c974c77246460fa6a92c18554c3311c8c83c160

// Run with: while sudo ./a.out; do :; done
#include <signal.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#include <atomic>
#include <chrono>
#include <cstdlib>
#include <cstring>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <thread>
#include <vector>

static bool cgroupHasProcs(std::filesystem::path cgroup_path, bool
print = false) {
    cgroup_path /= "cgroup.procs";
    std::ifstream procs(cgroup_path);

    std::string line;
    bool populated = false;
    while(std::getline(procs, line)) {
        populated = true;
        if (print)
            std::cout << "Found " << line << " in " << cgroup_path << '\n';
        else
            break;
    }
    return populated;
}

static void SigChldHandler(int /*signal_number*/, siginfo_t* /*info*/,
void* /*ucontext*/) {
    pid_t pid;
    int status;
    while((pid = waitpid(-1, &status, WNOHANG)) > 0) {
        if (WIFEXITED(status))
            std::cout << "Process " << pid << " exited cleanly (" <<
WEXITSTATUS(status) << ")\n";
        else if (WIFSIGNALED(status))
            std::cout << "Process " << pid << " exited due to signal "
<< WTERMSIG(status) << " (" << strsignal(WTERMSIG(status)) << ")\n";
    }
}

static bool migrateToCgroup(std::filesystem::path cgroup_path) {
    cgroup_path /= "cgroup.procs";
    std::ofstream procs(cgroup_path);
    procs << getpid();
    procs.flush();
    return static_cast<bool>(procs);
}

static std::atomic<bool> noProcs = false;
static const std::filesystem::path CG_A_PATH = "/sys/fs/cgroup/A";
static const std::filesystem::path CG_B_PATH = "/sys/fs/cgroup/B";

static void readProcs() {
    while (cgroupHasProcs(CG_B_PATH))
        std::this_thread::sleep_for(std::chrono::milliseconds(5));
    noProcs = true;
}

int main()
{
    struct sigaction sig_chld = {};
    sig_chld.sa_sigaction = SigChldHandler;
    sig_chld.sa_flags = SA_SIGINFO;

    if (sigaction(SIGCHLD, &sig_chld, nullptr) < 0) {
        std::cerr << "Error setting SIGCHLD handler: " <<
std::strerror(errno) << std::endl;
        return EXIT_FAILURE;
    }

    if (std::error_code ec;
!std::filesystem::create_directory(CG_A_PATH, ec) && ec) {
        std::cerr << "Error creating cgroups (Are you root?): " <<
ec.message() << std::endl;
        return EXIT_FAILURE;
    }

    if (std::error_code ec;
!std::filesystem::create_directory(CG_B_PATH, ec) && ec) {
        std::cerr << "Error creating cgroups (Are you root?): " <<
ec.message() << std::endl;
        return EXIT_FAILURE;
    }

    if (!migrateToCgroup(CG_A_PATH)) {
        std::cerr << "Failed to migrate to " << CG_A_PATH << " (Are
you root?): " << std::strerror(errno) << std::endl;
        return EXIT_FAILURE;
    }

    const pid_t pid = fork();
    if (pid == -1) {
        std::cerr << "Failed to fork child process: " <<
std::strerror(errno) << std::endl;
        return EXIT_FAILURE;
    } else if (pid == 0) {
        migrateToCgroup(CG_B_PATH);
        pause();
        return EXIT_SUCCESS;
    } else {
        int ret = EXIT_SUCCESS;

        std::vector<std::thread> threads(2*std::thread::hardware_concurrency());
        for (std::thread &t : threads)
            t = std::thread(readProcs);

        std::this_thread::sleep_for(std::chrono::milliseconds(200));
        kill(pid, SIGKILL);

        while(!noProcs)
            std::this_thread::sleep_for(std::chrono::milliseconds(1));

        if(std::error_code ec; !std::filesystem::remove(CG_B_PATH, ec)) {
            std::cerr << "Cannot remove " << CG_B_PATH << ": " <<
ec.message() << std::endl;
            ret = EXIT_FAILURE;
        }

        for(std::thread &t : threads)
            if(t.joinable())
                t.join();
        return ret;
    }
}

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-03 17:40 [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty T.J. Mercier
@ 2023-10-03 18:01 ` T.J. Mercier
  2023-10-04 18:54   ` Tejun Heo
  2023-10-06 16:58   ` Michal Koutný
  0 siblings, 2 replies; 15+ messages in thread
From: T.J. Mercier @ 2023-10-03 18:01 UTC (permalink / raw)
  To: Tejun Heo, Michal Koutný, cgroups, Zefan Li, Johannes Weiner
  Cc: Suren Baghdasaryan

On Tue, Oct 3, 2023 at 10:40 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> Hi all,
>
> Samsung reported an Android bug where over 1000 cgroups were left
> empty without being removed. These cgroups should have been removed
> after no processes were observed to be remaining in the cgroup by this
> code [1], which loops until cgroup.procs is empty and then attempts to
> rmdir the cgroup. That works most of the time, but occasionally the
> rmdir fails with EBUSY *after cgroup.procs is empty*, which seems
> wrong. No controllers are enabled in this cgroup v2 hierarchy; it's
> currently used only for freezer. I spoke with Suren about this, who
> recalled a similar problem and fix [2], but all the kernels I've
> tested contain this fix. I have been able to reproduce this on 5.10,
> 5.15, 6.1, and 6.3 on various hardware. I've written a reproducer
> (below) which typically hits the issue in under a minute.
>
> The trace events look like this when the problem occurs. I'm guessing
> the rmdir is attempted in that window between signal_deliver and
> cgroup_notify_populated = 0.
>
>            a.out-3786312 [120] ..... 629614.073808: task_newtask:
> pid=3786313 comm=a.out clone_flags=1200000 oom_score_adj=200
>            a.out-3786312 [120] ..... 629614.073832:
> sched_process_fork: comm=a.out pid=3786312 child_comm=a.out
> child_pid=3786313
>            a.out-3786313 [028] d..2. 629614.074712:
> cgroup_notify_populated: root=0 id=240416 level=1 path=/B val=1
>            a.out-3786313 [028] dN.1. 629614.074742:
> cgroup_attach_task: dst_root=0 dst_id=240416 dst_level=1 dst_path=/B
> pid=3786313 comm=a.out
>            a.out-3786312 [120] d..1. 629614.302764: signal_generate:
> sig=9 errno=0 code=0 comm=a.out pid=3786313 grp=1 res=0
>            <...>-3786313 [028] d..1. 629614.302789: signal_deliver:
> sig=9 errno=0 code=0 sa_handler=0 sa_flags=0
>            a.out-3786313 [028] ..... 629614.303007:
> sched_process_exit: comm=a.out pid=3786313 prio=120
>            a.out-3786313 [028] dN.2. 629614.303039:
> cgroup_notify_populated: root=0 id=240416 level=1 path=/B val=0
>            a.out-3786313 [028] dN.2. 629614.303057: signal_generate:
> sig=17 errno=0 code=2 comm=a.out pid=3786312 grp=1 res=0
>           <idle>-0       [120] ..s1. 629614.333591:
> sched_process_free: comm=a.out pid=3786313 prio=120
>
> However on Android we retry the rmdir for 2 seconds after cgroup.procs
> is empty and we're still occasionally hitting the failure. On my
> primary phone with 3 days of uptime I see a handful of cases, but the
> problem is orders of magnitude worse on Samsung's device.
>
> panther:/ $ find /sys/fs/cgroup/ -path "*/pid_*/cgroup.procs" -exec sh
> -c 'return $(wc -l $0 | cut -f1 -d" ")' {} \; -print
> /sys/fs/cgroup/uid_10133/pid_19665/cgroup.procs
> /sys/fs/cgroup/uid_10133/pid_19784/cgroup.procs
> /sys/fs/cgroup/uid_10133/pid_13124/cgroup.procs
> /sys/fs/cgroup/uid_10133/pid_15176/cgroup.procs
> /sys/fs/cgroup/uid_10274/pid_12322/cgroup.procs
> /sys/fs/cgroup/uid_1010120/pid_13631/cgroup.procs
> /sys/fs/cgroup/uid_10196/pid_27894/cgroup.procs
> /sys/fs/cgroup/uid_1010133/pid_16103/cgroup.procs
> /sys/fs/cgroup/uid_1010133/pid_29181/cgroup.procs
> /sys/fs/cgroup/uid_10129/pid_7940/cgroup.procs
> /sys/fs/cgroup/uid_10266/pid_11765/cgroup.procs
>
> Please LMK if there's any more info I can provide.
>
> [1] https://android.googlesource.com/platform/system/core/+/3483798fd9045f93e9095e5b09ffe4f59054c535/libprocessgroup/processgroup.cpp#522
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/cgroup/cgroup.c?id=9c974c77246460fa6a92c18554c3311c8c83c160
>
> // Run with: while sudo ./a.out; do :; done
> #include <signal.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <unistd.h>
>
> #include <atomic>
> #include <chrono>
> #include <cstdlib>
> #include <cstring>
> #include <filesystem>
> #include <fstream>
> #include <iostream>
> #include <thread>
> #include <vector>
>
> static bool cgroupHasProcs(std::filesystem::path cgroup_path, bool
> print = false) {
>     cgroup_path /= "cgroup.procs";
>     std::ifstream procs(cgroup_path);
>
>     std::string line;
>     bool populated = false;
>     while(std::getline(procs, line)) {
>         populated = true;
>         if (print)
>             std::cout << "Found " << line << " in " << cgroup_path << '\n';
>         else
>             break;
>     }
>     return populated;
> }
>
> static void SigChldHandler(int /*signal_number*/, siginfo_t* /*info*/,
> void* /*ucontext*/) {
>     pid_t pid;
>     int status;
>     while((pid = waitpid(-1, &status, WNOHANG)) > 0) {
>         if (WIFEXITED(status))
>             std::cout << "Process " << pid << " exited cleanly (" <<
> WEXITSTATUS(status) << ")\n";
>         else if (WIFSIGNALED(status))
>             std::cout << "Process " << pid << " exited due to signal "
> << WTERMSIG(status) << " (" << strsignal(WTERMSIG(status)) << ")\n";
>     }
> }
>
> static bool migrateToCgroup(std::filesystem::path cgroup_path) {
>     cgroup_path /= "cgroup.procs";
>     std::ofstream procs(cgroup_path);
>     procs << getpid();
>     procs.flush();
>     return static_cast<bool>(procs);
> }
>
> static std::atomic<bool> noProcs = false;
> static const std::filesystem::path CG_A_PATH = "/sys/fs/cgroup/A";
> static const std::filesystem::path CG_B_PATH = "/sys/fs/cgroup/B";
>
> static void readProcs() {
>     while (cgroupHasProcs(CG_B_PATH))
>         std::this_thread::sleep_for(std::chrono::milliseconds(5));
>     noProcs = true;
> }
>
> int main()
> {
>     struct sigaction sig_chld = {};
>     sig_chld.sa_sigaction = SigChldHandler;
>     sig_chld.sa_flags = SA_SIGINFO;
>
>     if (sigaction(SIGCHLD, &sig_chld, nullptr) < 0) {
>         std::cerr << "Error setting SIGCHLD handler: " <<
> std::strerror(errno) << std::endl;
>         return EXIT_FAILURE;
>     }
>
>     if (std::error_code ec;
> !std::filesystem::create_directory(CG_A_PATH, ec) && ec) {
>         std::cerr << "Error creating cgroups (Are you root?): " <<
> ec.message() << std::endl;
>         return EXIT_FAILURE;
>     }
>
>     if (std::error_code ec;
> !std::filesystem::create_directory(CG_B_PATH, ec) && ec) {
>         std::cerr << "Error creating cgroups (Are you root?): " <<
> ec.message() << std::endl;
>         return EXIT_FAILURE;
>     }
>
>     if (!migrateToCgroup(CG_A_PATH)) {
>         std::cerr << "Failed to migrate to " << CG_A_PATH << " (Are
> you root?): " << std::strerror(errno) << std::endl;
>         return EXIT_FAILURE;
>     }
>
>     const pid_t pid = fork();
>     if (pid == -1) {
>         std::cerr << "Failed to fork child process: " <<
> std::strerror(errno) << std::endl;
>         return EXIT_FAILURE;
>     } else if (pid == 0) {
>         migrateToCgroup(CG_B_PATH);
>         pause();
>         return EXIT_SUCCESS;
>     } else {
>         int ret = EXIT_SUCCESS;
>
>         std::vector<std::thread> threads(2*std::thread::hardware_concurrency());
>         for (std::thread &t : threads)
>             t = std::thread(readProcs);
>
>         std::this_thread::sleep_for(std::chrono::milliseconds(200));
>         kill(pid, SIGKILL);
>
>         while(!noProcs)
>             std::this_thread::sleep_for(std::chrono::milliseconds(1));
>
>         if(std::error_code ec; !std::filesystem::remove(CG_B_PATH, ec)) {
>             std::cerr << "Cannot remove " << CG_B_PATH << ": " <<
> ec.message() << std::endl;
>             ret = EXIT_FAILURE;
>         }
>
>         for(std::thread &t : threads)
>             if(t.joinable())
>                 t.join();
>         return ret;
>     }
> }

+ Zefan and Johannes who I missed

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-03 18:01 ` T.J. Mercier
@ 2023-10-04 18:54   ` Tejun Heo
  2023-10-06 16:58   ` Michal Koutný
  1 sibling, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2023-10-04 18:54 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Michal Koutný,
	cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

On Tue, Oct 03, 2023 at 11:01:46AM -0700, T.J. Mercier wrote:
> On Tue, Oct 3, 2023 at 10:40 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > Hi all,
> >
> > Samsung reported an Android bug where over 1000 cgroups were left
> > empty without being removed. These cgroups should have been removed
> > after no processes were observed to be remaining in the cgroup by this
> > code [1], which loops until cgroup.procs is empty and then attempts to
> > rmdir the cgroup. That works most of the time, but occasionally the
> > rmdir fails with EBUSY *after cgroup.procs is empty*, which seems
> > wrong. No controllers are enabled in this cgroup v2 hierarchy; it's
> > currently used only for freezer. I spoke with Suren about this, who
> > recalled a similar problem and fix [2], but all the kernels I've
> > tested contain this fix. I have been able to reproduce this on 5.10,
> > 5.15, 6.1, and 6.3 on various hardware. I've written a reproducer
> > (below) which typically hits the issue in under a minute.
> >
> > The trace events look like this when the problem occurs. I'm guessing
> > the rmdir is attempted in that window between signal_deliver and
> > cgroup_notify_populated = 0.

So, the recommendation is to always trigger cleanup on the !populated
notification. That said, I don't immediately see why your reproducer doesn't
work. I'll dig into it later.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-03 18:01 ` T.J. Mercier
  2023-10-04 18:54   ` Tejun Heo
@ 2023-10-06 16:58   ` Michal Koutný
  2023-10-06 18:37     ` T.J. Mercier
  1 sibling, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2023-10-06 16:58 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

[-- Attachment #1: Type: text/plain, Size: 1799 bytes --]

Hello T.J.

A curious case.

I was staring at the code and any ways occurring to me would imply
css_set_lock doesn't work.

OTOH, I can bring the reproducer to rmdir()=-EBUSY on my machine
(6.4.12-1-default) [1].

I notice that there are 2*nr_cpus parallel readers of cgroup.procs.
And a single thread's testimony is enough to consider cgroup empty.
Could it be that despite the 200ms delay, some of the threads see the
cgroup empty _yet_?
(I didn't do own tracing but by reducing the delay, I could reduce the
time before EBUSY was hit, otherwise it took several minutes (on top of
desktop background).)


On Tue, Oct 03, 2023 at 11:01:46AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
...
> > The trace events look like this when the problem occurs. I'm guessing
> > the rmdir is attempted in that window between signal_deliver and
> > cgroup_notify_populated = 0.

But rmdir() happens after empty cgroup.procs was spotted, right?
(That's why it is curious.)

> > However on Android we retry the rmdir for 2 seconds after cgroup.procs
> > is empty and we're still occasionally hitting the failure. On my
> > primary phone with 3 days of uptime I see a handful of cases, but the
> > problem is orders of magnitude worse on Samsung's device.

Would there also be short-lived members of cgroups and reading
cgroup.procs under load?


Thanks,
Michal

[1] FTR, a hunk to run it without sudo on a modern desktop:
-static const std::filesystem::path CG_A_PATH = "/sys/fs/cgroup/A";
-static const std::filesystem::path CG_B_PATH = "/sys/fs/cgroup/B";
+static const std::filesystem::path CG_A_PATH = "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/a";
+static const std::filesystem::path CG_B_PATH = "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/b";


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-06 16:58   ` Michal Koutný
@ 2023-10-06 18:37     ` T.J. Mercier
  2023-10-10 16:31       ` Michal Koutný
  0 siblings, 1 reply; 15+ messages in thread
From: T.J. Mercier @ 2023-10-06 18:37 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

On Fri, Oct 6, 2023 at 9:58 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello T.J.
>
> A curious case.
>
> I was staring at the code and any ways occurring to me would imply
> css_set_lock doesn't work.
>
> OTOH, I can bring the reproducer to rmdir()=-EBUSY on my machine
> (6.4.12-1-default) [1].
>
> I notice that there are 2*nr_cpus parallel readers of cgroup.procs.
> And a single thread's testimony is enough to consider cgroup empty.
> Could it be that despite the 200ms delay, some of the threads see the
> cgroup empty _yet_?
> (I didn't do own tracing but by reducing the delay, I could reduce the
> time before EBUSY was hit, otherwise it took several minutes (on top of
> desktop background).)
>
Hm yes, it's possible a thread runs before the child migrates and sets
noProcs = true too early. I added a loop to wait for B to be populated
before running the threads, and now I can't reproduce it. :\

> On Tue, Oct 03, 2023 at 11:01:46AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> ...
> > > The trace events look like this when the problem occurs. I'm guessing
> > > the rmdir is attempted in that window between signal_deliver and
> > > cgroup_notify_populated = 0.
>
> But rmdir() happens after empty cgroup.procs was spotted, right?
> (That's why it is curious.)
>
Right, we read cgroup.procs to find which processes to kill. Kill them
all, wait until cgroup.procs is empty, and then attempt to rmdir.

I will try changing the cgroup_rmdir trace event to always fire
instead of only when the rmdir succeeds. That way I can get a more
complete timeline.

> > > However on Android we retry the rmdir for 2 seconds after cgroup.procs
> > > is empty and we're still occasionally hitting the failure. On my
> > > primary phone with 3 days of uptime I see a handful of cases, but the
> > > problem is orders of magnitude worse on Samsung's device.
>
> Would there also be short-lived members of cgroups and reading
> cgroup.procs under load?
>
I think the only short-lived members should be due to launch failures
/ crashes. Reading cgroup.procs does frequently happen under load. One
scenario that comes to mind is under memory pressure where LMKD hunts
for apps to kill (after which their cgroups are removed), while
reclaim and compaction are also occurring.

I suppose it's also possible there is PID reuse by the same app,
causing the cgroup to become repopulated at the same time as a kill,
but that seems extremely unlikely. Plus, at the point where these
kills are occurring we shouldn't normally be simultaneously launching
new processes for the app. Similarly if a process forks right before
it is killed, maybe it doesn't show up in cgroup.procs until after
we've observed it to be empty?

I will investigate some more on a phone where I'm seeing this since my
reproducer isn't doing the trick.

Thanks for taking a look Michal.

>
> Thanks,
> Michal
>
> [1] FTR, a hunk to run it without sudo on a modern desktop:
> -static const std::filesystem::path CG_A_PATH = "/sys/fs/cgroup/A";
> -static const std::filesystem::path CG_B_PATH = "/sys/fs/cgroup/B";
> +static const std::filesystem::path CG_A_PATH = "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/a";
> +static const std::filesystem::path CG_B_PATH = "/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/app.slice/b";
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-06 18:37     ` T.J. Mercier
@ 2023-10-10 16:31       ` Michal Koutný
  2023-10-10 17:14         ` T.J. Mercier
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2023-10-10 16:31 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

[-- Attachment #1: Type: text/plain, Size: 1744 bytes --]

On Fri, Oct 06, 2023 at 11:37:19AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> I suppose it's also possible there is PID reuse by the same app,
> causing the cgroup to become repopulated at the same time as a kill,
> but that seems extremely unlikely. Plus, at the point where these
> kills are occurring we shouldn't normally be simultaneously launching
> new processes for the app. Similarly if a process forks right before
> it is killed, maybe it doesn't show up in cgroup.procs until after
> we've observed it to be empty?

Something like this:

							kill (before)
cgroup_fork
cgroup_can_fork .. begin(threadgroup_rwsem)
tasklist_lock
fatal_signal_pending -> cgroup_cancel_fork		kill (mid)
tasklist_unlock
							seq_start,
							seq_next...

cgroup_post_fork  .. end(threadgroup_rwsem)		
							kill (after)

Only the third option `kill (after)` means the child would end up on the
css_set list. But that would mean the reader squeezed before
cgroup_post_fork() would still see the parent.
(I.e. I don't see the kill/fork race could skew the listed procs.)

(But it reminds me another pathological case of "group leader
 separation" where:
- there is a multithreaded process,
- threadgroup leader exits,
- threadgroup is migrated from A to B (write to cgroup.procs)
  - but leader stays in A (because it has PF_EXITING),
- A will still show it in cgroup.procs,
- B will not include it in cgroup.procs despite it hosts some threads of
  the threadgroup (effectively populated).
It's been some time, I didn't check if it's still possible nowadays.)

BTW is there any fundamental reason the apps cannot use the
notifications via cgroup.events as recommended by Tejun?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-10 16:31       ` Michal Koutný
@ 2023-10-10 17:14         ` T.J. Mercier
  2023-10-10 18:41           ` Tejun Heo
  2023-10-11 23:57           ` T.J. Mercier
  0 siblings, 2 replies; 15+ messages in thread
From: T.J. Mercier @ 2023-10-10 17:14 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

On Tue, Oct 10, 2023 at 9:31 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Fri, Oct 06, 2023 at 11:37:19AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> > I suppose it's also possible there is PID reuse by the same app,
> > causing the cgroup to become repopulated at the same time as a kill,
> > but that seems extremely unlikely. Plus, at the point where these
> > kills are occurring we shouldn't normally be simultaneously launching
> > new processes for the app. Similarly if a process forks right before
> > it is killed, maybe it doesn't show up in cgroup.procs until after
> > we've observed it to be empty?
>
> Something like this:
>
>                                                         kill (before)
> cgroup_fork
> cgroup_can_fork .. begin(threadgroup_rwsem)
> tasklist_lock
> fatal_signal_pending -> cgroup_cancel_fork              kill (mid)
> tasklist_unlock
>                                                         seq_start,
>                                                         seq_next...
>
> cgroup_post_fork  .. end(threadgroup_rwsem)
>                                                         kill (after)
>
> Only the third option `kill (after)` means the child would end up on the
> css_set list. But that would mean the reader squeezed before
> cgroup_post_fork() would still see the parent.
> (I.e. I don't see the kill/fork race could skew the listed procs.)
>
So here is a trace from a phone where the kills happen (~100ms) after
the forks. All but one of the children die before we read cgroup.procs
for the first time, and cgroup.procs is not empty. 5ms later we read
again and cgroup.procs is empty, but the last child still hasn't
exited. So it makes sense that the cset from that last child is still
on the list.
https://pastebin.com/raw/tnHhnZBE

> (But it reminds me another pathological case of "group leader
>  separation" where:
> - there is a multithreaded process,
> - threadgroup leader exits,
> - threadgroup is migrated from A to B (write to cgroup.procs)
>   - but leader stays in A (because it has PF_EXITING),
> - A will still show it in cgroup.procs,
> - B will not include it in cgroup.procs despite it hosts some threads of
>   the threadgroup (effectively populated).
> It's been some time, I didn't check if it's still possible nowadays.)
>
I don't think this is what 's happening because these processes only
get migrated once in their lifetimes (from the zygote cgroup to their
per-application cgroup), and the user/app code doesn't run until after
the migration completes which is the first time a thread could be
created.

> BTW is there any fundamental reason the apps cannot use the
> notifications via cgroup.events as recommended by Tejun?
>
This would require that we read both cgroup.procs and cgroup.events,
since we'd still want to know which processes to signal. I assumed
this would increase lock contention but there's no synchronization on
cgroup_is_populated so it looks like not. I had already identified
this as a workaround, but I'd prefer to depend on just one file to do
everything.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-10 17:14         ` T.J. Mercier
@ 2023-10-10 18:41           ` Tejun Heo
  2023-10-10 20:22             ` T.J. Mercier
  2023-10-11 23:57           ` T.J. Mercier
  1 sibling, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2023-10-10 18:41 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Michal Koutný,
	cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

Hello,

On Tue, Oct 10, 2023 at 10:14:26AM -0700, T.J. Mercier wrote:
> > BTW is there any fundamental reason the apps cannot use the
> > notifications via cgroup.events as recommended by Tejun?
> >
> This would require that we read both cgroup.procs and cgroup.events,
> since we'd still want to know which processes to signal. I assumed
> this would increase lock contention but there's no synchronization on
> cgroup_is_populated so it looks like not. I had already identified
> this as a workaround, but I'd prefer to depend on just one file to do
> everything.

I don't think we can guarantee that. There's a reason why we have
[!]populated events. Maybe we can find this particular situation better but
there isn't going to be a guarantee that a cgroup is removable if its
cgroup.procs file is seen empty.

Note that cgroup.events file is pollable. You can just poll the file and
then respond to them. I don't understand the part of having to read
cgroup.procs, which btw is an a lot more expensive operation. You said
"which processes to signal". If this is to kill all processes in the cgroup,
you can use cgroup.kill which sends signal atomically to all member tasks.

It feels like the use case is just trying to do things in an unsupported way
when there's no actual benefit to doing so. Is there something preventing
you guys from doing how it's supposed to be used?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-10 18:41           ` Tejun Heo
@ 2023-10-10 20:22             ` T.J. Mercier
  0 siblings, 0 replies; 15+ messages in thread
From: T.J. Mercier @ 2023-10-10 20:22 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Michal Koutný,
	cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

On Tue, Oct 10, 2023 at 11:41 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Tue, Oct 10, 2023 at 10:14:26AM -0700, T.J. Mercier wrote:
> > > BTW is there any fundamental reason the apps cannot use the
> > > notifications via cgroup.events as recommended by Tejun?
> > >
> > This would require that we read both cgroup.procs and cgroup.events,
> > since we'd still want to know which processes to signal. I assumed
> > this would increase lock contention but there's no synchronization on
> > cgroup_is_populated so it looks like not. I had already identified
> > this as a workaround, but I'd prefer to depend on just one file to do
> > everything.
>
> I don't think we can guarantee that. There's a reason why we have
> [!]populated events. Maybe we can find this particular situation better but
> there isn't going to be a guarantee that a cgroup is removable if its
> cgroup.procs file is seen empty.
>
I understand that there are cases where the cgroup can become
populated after a read of cgroup.procs shows nothing and before a
removal is attempted. But that doesn't seem to be what's happening
here.

> Note that cgroup.events file is pollable. You can just poll the file and
> then respond to them. I don't understand the part of having to read
> cgroup.procs, which btw is an a lot more expensive operation. You said
> "which processes to signal". If this is to kill all processes in the cgroup,
> you can use cgroup.kill which sends signal atomically to all member tasks.
>
That's coming, but we still need to support kills without cgroup.kill
for kernels before it was introduced. There are some non-SIGKILL use
cases that code supports, so that's what I meant when I said, "which
processes to signal". I guess we could separate these two paths so
that one uses cgroup.kill and blocks until populated = 0, and the
other just reads cgroup.procs to generate the signals and then
returns. But it would be nice to have cgroup.procs just show the pids
until after cgroup_exit when the condition for removal is satisfied.

> It feels like the use case is just trying to do things in an unsupported way
> when there's no actual benefit to doing so. Is there something preventing
> you guys from doing how it's supposed to be used?
>
Isn't this avoiding the problem? The docs say, "A cgroup which doesn't
have any children or live processes can be destroyed by removing the
directory." If it has live processes they should show up in
cgroup.procs. FWIW that's also the language used to describe how the
populated notification works, so the two shouldn't report different
things. (Assuming they could actually be read simultaneously.)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-10 17:14         ` T.J. Mercier
  2023-10-10 18:41           ` Tejun Heo
@ 2023-10-11 23:57           ` T.J. Mercier
  2023-10-24 23:10             ` T.J. Mercier
  1 sibling, 1 reply; 15+ messages in thread
From: T.J. Mercier @ 2023-10-11 23:57 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

On Tue, Oct 10, 2023 at 10:14 AM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 9:31 AM Michal Koutný <mkoutny@suse.com> wrote:
> >
> > On Fri, Oct 06, 2023 at 11:37:19AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> > > I suppose it's also possible there is PID reuse by the same app,
> > > causing the cgroup to become repopulated at the same time as a kill,
> > > but that seems extremely unlikely. Plus, at the point where these
> > > kills are occurring we shouldn't normally be simultaneously launching
> > > new processes for the app. Similarly if a process forks right before
> > > it is killed, maybe it doesn't show up in cgroup.procs until after
> > > we've observed it to be empty?
> >
> > Something like this:
> >
> >                                                         kill (before)
> > cgroup_fork
> > cgroup_can_fork .. begin(threadgroup_rwsem)
> > tasklist_lock
> > fatal_signal_pending -> cgroup_cancel_fork              kill (mid)
> > tasklist_unlock
> >                                                         seq_start,
> >                                                         seq_next...
> >
> > cgroup_post_fork  .. end(threadgroup_rwsem)
> >                                                         kill (after)
> >
> > Only the third option `kill (after)` means the child would end up on the
> > css_set list. But that would mean the reader squeezed before
> > cgroup_post_fork() would still see the parent.
> > (I.e. I don't see the kill/fork race could skew the listed procs.)
> >
> So here is a trace from a phone where the kills happen (~100ms) after
> the forks. All but one of the children die before we read cgroup.procs
> for the first time, and cgroup.procs is not empty. 5ms later we read
> again and cgroup.procs is empty, but the last child still hasn't
> exited. So it makes sense that the cset from that last child is still
> on the list.
> https://pastebin.com/raw/tnHhnZBE
>
Collected a bit more info. It's before exit_mm that the process
disappears from cgroup.procs, but the delay to populated=0 seems to be
exacerbated by CPU contention during this time. What's weird is that I
can see the task blocking the rmdir on cgrp->cset_links->cset->tasks
inside of cgroup_destroy_locked when the rmdir is attempted, so I
don't understand why it doesn't show up when iterating tasks for
cgroup.procs.

I'm going to be out until next Wednesday when I'll look some more.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-11 23:57           ` T.J. Mercier
@ 2023-10-24 23:10             ` T.J. Mercier
  2023-10-25 13:30               ` Michal Koutný
  0 siblings, 1 reply; 15+ messages in thread
From: T.J. Mercier @ 2023-10-24 23:10 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

On Wed, Oct 11, 2023 at 4:57 PM T.J. Mercier <tjmercier@google.com> wrote:
>
> On Tue, Oct 10, 2023 at 10:14 AM T.J. Mercier <tjmercier@google.com> wrote:
> >
> > On Tue, Oct 10, 2023 at 9:31 AM Michal Koutný <mkoutny@suse.com> wrote:
> > >
> > > On Fri, Oct 06, 2023 at 11:37:19AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> > > > I suppose it's also possible there is PID reuse by the same app,
> > > > causing the cgroup to become repopulated at the same time as a kill,
> > > > but that seems extremely unlikely. Plus, at the point where these
> > > > kills are occurring we shouldn't normally be simultaneously launching
> > > > new processes for the app. Similarly if a process forks right before
> > > > it is killed, maybe it doesn't show up in cgroup.procs until after
> > > > we've observed it to be empty?
> > >
> > > Something like this:
> > >
> > >                                                         kill (before)
> > > cgroup_fork
> > > cgroup_can_fork .. begin(threadgroup_rwsem)
> > > tasklist_lock
> > > fatal_signal_pending -> cgroup_cancel_fork              kill (mid)
> > > tasklist_unlock
> > >                                                         seq_start,
> > >                                                         seq_next...
> > >
> > > cgroup_post_fork  .. end(threadgroup_rwsem)
> > >                                                         kill (after)
> > >
> > > Only the third option `kill (after)` means the child would end up on the
> > > css_set list. But that would mean the reader squeezed before
> > > cgroup_post_fork() would still see the parent.
> > > (I.e. I don't see the kill/fork race could skew the listed procs.)
> > >
> > So here is a trace from a phone where the kills happen (~100ms) after
> > the forks. All but one of the children die before we read cgroup.procs
> > for the first time, and cgroup.procs is not empty. 5ms later we read
> > again and cgroup.procs is empty, but the last child still hasn't
> > exited. So it makes sense that the cset from that last child is still
> > on the list.
> > https://pastebin.com/raw/tnHhnZBE
> >
> Collected a bit more info. It's before exit_mm that the process
> disappears from cgroup.procs, but the delay to populated=0 seems to be
> exacerbated by CPU contention during this time. What's weird is that I
> can see the task blocking the rmdir on cgrp->cset_links->cset->tasks
> inside of cgroup_destroy_locked when the rmdir is attempted, so I
> don't understand why it doesn't show up when iterating tasks for
> cgroup.procs.
>
> I'm going to be out until next Wednesday when I'll look some more.

Back on this and pretty sure I discovered what's happening. For
processes with multiple threads where each thread has reached
atomic_dec_and_test(&tsk->signal->live) in do_exit (but not all have
reached cgroup_exit yet), subsequent reads of cgroup.procs will skip
over the process with not-yet-fully-exited thread group members
because the read of task->signal->live evaluates to 0 here in
css_task_iter_advance:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c?h=v6.5#n4869

But the cgroup is not removable yet because cgroup_exit hasn't been
called for all tasks.

Since all tasks have been signaled in this case and we're just waiting
for the exits to complete, I think it should be possible to turn the
cgroup into a zombie on rmdir with the current behavior of
cgroup.procs.

Or if we change cgroup.procs to continue showing the thread group
leader until all threads have finished exiting, we'd still probably
have to change our userspace to accommodate the longer kill times
exceeding our timeouts. So I'm going to change our userspace anyway as
suggested by Tejun. But I'd be interested to hear what folks think
about the potential kernel solutions as well.

Thanks,
T.J.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-24 23:10             ` T.J. Mercier
@ 2023-10-25 13:30               ` Michal Koutný
  2023-10-25 18:29                 ` T.J. Mercier
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2023-10-25 13:30 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

[-- Attachment #1: Type: text/plain, Size: 2078 bytes --]

Hi.

On Tue, Oct 24, 2023 at 04:10:32PM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> Back on this and pretty sure I discovered what's happening. For
> processes with multiple threads where each thread has reached
> atomic_dec_and_test(&tsk->signal->live) in do_exit (but not all have
> reached cgroup_exit yet), subsequent reads of cgroup.procs will skip
> over the process with not-yet-fully-exited thread group members
> because the read of task->signal->live evaluates to 0 here in
> css_task_iter_advance:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c?h=v6.5#n4869

Well done! It sounds plausible, the task->signal->live is not synced
via css_set_lock.

> 
> But the cgroup is not removable yet because cgroup_exit hasn't been
> called for all tasks.
> 
> Since all tasks have been signaled in this case and we're just waiting
> for the exits to complete, I think it should be possible to turn the
> cgroup into a zombie on rmdir with the current behavior of
> cgroup.procs.

In this case it could be removed but it would make the check in
cgroup_destroy_locked() way too complicated (if I understand your idea).

> 
> Or if we change cgroup.procs to continue showing the thread group
> leader until all threads have finished exiting, we'd still probably
> have to change our userspace to accommodate the longer kill times
> exceeding our timeouts.

Provided this is the cause, you could get this more (timewise) precise
info from cgroup.threads already? (PR [1] has a reproducer and its fix
describes exactly opposite listings (confusing) but I think that fix
actually works because it checks cgroup.threads additionally.)

> So I'm going to change our userspace anyway as suggested by Tejun. But
> I'd be interested to hear what folks think about the potential kernel
> solutions as well.

Despite that, I'd stick with the notifications since they use rely on
proper synchronization of cgroup-info.

HTT,
Michal

[1] https://github.com/systemd/systemd/pull/23561

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-25 13:30               ` Michal Koutný
@ 2023-10-25 18:29                 ` T.J. Mercier
  2023-11-01 15:29                   ` Michal Koutný
  0 siblings, 1 reply; 15+ messages in thread
From: T.J. Mercier @ 2023-10-25 18:29 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

On Wed, Oct 25, 2023 at 6:30 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hi.
>
> On Tue, Oct 24, 2023 at 04:10:32PM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> > Back on this and pretty sure I discovered what's happening. For
> > processes with multiple threads where each thread has reached
> > atomic_dec_and_test(&tsk->signal->live) in do_exit (but not all have
> > reached cgroup_exit yet), subsequent reads of cgroup.procs will skip
> > over the process with not-yet-fully-exited thread group members
> > because the read of task->signal->live evaluates to 0 here in
> > css_task_iter_advance:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c?h=v6.5#n4869
>
> Well done! It sounds plausible, the task->signal->live is not synced
> via css_set_lock.
>
> >
> > But the cgroup is not removable yet because cgroup_exit hasn't been
> > called for all tasks.
> >
> > Since all tasks have been signaled in this case and we're just waiting
> > for the exits to complete, I think it should be possible to turn the
> > cgroup into a zombie on rmdir with the current behavior of
> > cgroup.procs.
>
> In this case it could be removed but it would make the check in
> cgroup_destroy_locked() way too complicated (if I understand your idea).
>
I was thinking to remove it from sysfs and prevent migrations / forks,
but keep the cgroup and csets around in a dead state until all tasks
complete their exits. Similar to how with memcg any pages charged to
the memcg keep it alive in the background even after a successful
rmdir. Except in this case we're guaranteed to make progress towards
releasing the cgroup (without any dependency on reclaim or charge
transfer) because all tasks have already begun the process of exiting.
We're just waiting on the scheduler to give enough time to the tasks.

The cgroup_is_populated check in cgroup_destroy_locked is what's
currently blocking the removal, and in the case where
nr_populated_csets is not 0 I think we'd need to iterate through all
csets and ensure that each task has been signaled for a SIGKILL. Or
just ensure there are only dying tasks and the thread group leader has
0 for task->signal->live since that's when cgroup.procs stops showing
the process?

> >
> > Or if we change cgroup.procs to continue showing the thread group
> > leader until all threads have finished exiting, we'd still probably
> > have to change our userspace to accommodate the longer kill times
> > exceeding our timeouts.
>
> Provided this is the cause, you could get this more (timewise) precise
> info from cgroup.threads already? (PR [1] has a reproducer and its fix
> describes exactly opposite listings (confusing) but I think that fix
> actually works because it checks cgroup.threads additionally.)
>
Yes, I just tried this out and if we check both cgroup.procs and
cgroup.threads then we wait long enough to be sure that we can rmdir
successfully. Unfortunately that duration exceeds our current timeouts
in the environment where I can reproduce this, so we eventually give
up waiting and don't actually attempt the rmdir. But I'll fix that
with the change to use the populated notification.

> > So I'm going to change our userspace anyway as suggested by Tejun. But
> > I'd be interested to hear what folks think about the potential kernel
> > solutions as well.
>
> Despite that, I'd stick with the notifications since they use rely on
> proper synchronization of cgroup-info.
>
> HTT,
> Michal
>
> [1] https://github.com/systemd/systemd/pull/23561

Interesting case, and in the same part of the code. If one of the exit
functions takes a long time in the leader I could see how this might
happen, but I think a lot of those (mm for example) should be shared
among the group members so not sure exactly what would be the cause.
Maybe it's just that all the threads get scheduled before the leader.

Thanks!
-T.J.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-10-25 18:29                 ` T.J. Mercier
@ 2023-11-01 15:29                   ` Michal Koutný
  2023-11-01 17:32                     ` T.J. Mercier
  0 siblings, 1 reply; 15+ messages in thread
From: Michal Koutný @ 2023-11-01 15:29 UTC (permalink / raw)
  To: T.J. Mercier
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

[-- Attachment #1: Type: text/plain, Size: 1565 bytes --]

On Wed, Oct 25, 2023 at 11:29:56AM -0700, "T.J. Mercier" <tjmercier@google.com> wrote:
> The cgroup_is_populated check in cgroup_destroy_locked is what's
> currently blocking the removal, and in the case where
> nr_populated_csets is not 0 I think we'd need to iterate through all
> csets and ensure that each task has been signaled for a SIGKILL.

> Or just ensure there are only dying tasks and the thread group leader
> has 0 for task->signal->live since that's when cgroup.procs stops
> showing the process?

Yeah, both of these seem too complex checks when most of the time the
"stale" nr_populated_csets is sufficient (and notifications still).

(Tracking nr_leaders would be broken in the case that I called "group
leader separation".)

> Yes, I just tried this out and if we check both cgroup.procs and
> cgroup.threads then we wait long enough to be sure that we can rmdir
> successfully.

Thanks for checking.

> Interesting case, and in the same part of the code. If one of the exit
> functions takes a long time in the leader I could see how this might
> happen, but I think a lot of those (mm for example) should be shared
> among the group members so not sure exactly what would be the cause.

I've overlooked it at first (focused on exit_mm() only) but the
reproducer is explicit about enabled kernel preemption which extends the
gap quite a bit.

(Fun fact: I tried moving cgroup_exit() next to task->signal->live
decrement in do_exit() and test_core cgroup selftest still passes.
(Not only) the preemption takes the fun out of it though.)

Michal


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty
  2023-11-01 15:29                   ` Michal Koutný
@ 2023-11-01 17:32                     ` T.J. Mercier
  0 siblings, 0 replies; 15+ messages in thread
From: T.J. Mercier @ 2023-11-01 17:32 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Tejun Heo, cgroups, Zefan Li, Johannes Weiner, Suren Baghdasaryan

On Wed, Nov 1, 2023 at 8:29 AM Michal Koutný <mkoutny@suse.com> wrote:
> Yeah, both of these seem too complex checks when most of the time the
> "stale" nr_populated_csets is sufficient (and notifications still).
>
> (Tracking nr_leaders would be broken in the case that I called "group
> leader separation".)
>
Ok understood, thanks.

> I've overlooked it at first (focused on exit_mm() only) but the
> reproducer is explicit about enabled kernel preemption which extends the
> gap quite a bit.
>
> (Fun fact: I tried moving cgroup_exit() next to task->signal->live
> decrement in do_exit() and test_core cgroup selftest still passes.
> (Not only) the preemption takes the fun out of it though.)

I found this commit which made me think we couldn't move cgroup_exit
up any further, but yeah preemption after the decrement is still a
tiny problem.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/kernel/exit.c?id=0b3fcf178deefd7b64154c2c0760a2c63df0b74f

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-11-01 17:32 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-03 17:40 [Bug Report] EBUSY for cgroup rmdir after cgroup.procs empty T.J. Mercier
2023-10-03 18:01 ` T.J. Mercier
2023-10-04 18:54   ` Tejun Heo
2023-10-06 16:58   ` Michal Koutný
2023-10-06 18:37     ` T.J. Mercier
2023-10-10 16:31       ` Michal Koutný
2023-10-10 17:14         ` T.J. Mercier
2023-10-10 18:41           ` Tejun Heo
2023-10-10 20:22             ` T.J. Mercier
2023-10-11 23:57           ` T.J. Mercier
2023-10-24 23:10             ` T.J. Mercier
2023-10-25 13:30               ` Michal Koutný
2023-10-25 18:29                 ` T.J. Mercier
2023-11-01 15:29                   ` Michal Koutný
2023-11-01 17:32                     ` T.J. Mercier

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.