All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>
Subject: Re: [RFC PATCH] mm, oom: oom ratelimit auto tuning
Date: Tue, 14 Apr 2020 20:32:54 +0800	[thread overview]
Message-ID: <CALOAHbDv+ZAgmGJP7GFzGcjKBZTPk9kYo63g173Nh+vn00qmwg@mail.gmail.com> (raw)
In-Reply-To: <20200414073911.GC4629@dhcp22.suse.cz>

On Tue, Apr 14, 2020 at 3:39 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Sat 11-04-20 05:36:14, Yafang Shao wrote:
> > Recently we find an issue that when OOM happens the server is almost
> > unresponsive for several minutes. That is caused by a slow serial set
> > with "console=ttyS1,19200". As the speed of this serial is too slow, it
> > will take almost 10 seconds to print a full OOM message into it. And
> > then all tasks allocating pages will be blocked as there is almost no
> > pages can be reclaimed. At that time, the memory pressure is around 90
> > for a long time. If we don't print the OOM messages into this serial,
> > a full OOM message only takes less than 1ms and the memory pressure is
> > less than 40.
>
> Which part of the oom report takes the most time? I would expect this to
> be the dump_tasks part which can be pretty large when there is a lot of
> eligible tasks to kill.
>

Yes, dump_tasks takes around 6s of the total 10s,  show_mem take
around 2s, and dump_stack takes around 0.8s.

> > We can avoid printing OOM messages into slow serial by adjusting
> > /proc/sys/kernel/printk to fix this issue, but then all messages with
> > KERN_WARNING level can't be printed into it neither, that may loss some
> > useful messages when we want to collect messages from the it for
> > debugging purpose.
>
> A large part of the oom report is printed with KERN_INFO log level. So
> you can reduce a large part of the output while not losing other
> potentially important information.
>

Reduce the KERN_INFO log can save lots of time, but I just worried
that sometimes the user may need the full log and if then can't find
these logs they may complain.

> > So it is better to decrease the ratelimit. We can introduce some sysctl
> > knobes similar with printk_ratelimit and burst, but it will burden the
> > amdin. Let the kernel automatically adjust the ratelimit, that would be
> > a better choice.
>
> No new knobs for ratelimiting. Admin shouldn't really care about these
> things.

Agreed.

[snip]
> Besides that I strongly suspect that you would be much better of
> by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount
> of output a lot. Or do you really require this information when
> debugging oom reports?
>

Yes, disabling /proc/sys/vm/oom_dump_tasks can save lots of time.
But I'm not sure whehter we can disable it totally, because disabling
it would prevent the tasks log from being wrote into /var/log/messages
neither.

> > The OOM ratelimit starts with a slow rate, and it will increase slowly
> > if the speed of the console is rapid and decrease rapidly if the speed
> > of the console is slow. oom_rs.burst will be in [1, 10] and
> > oom_rs.interval will always greater than 5 * HZ.
>
> I am not against increasing the ratelimit timeout. But this patch seems
> to be trying to be too clever.  Why cannot we simply increase the
> parameters of the ratelimit?

I justed worried that the user may complain it if too many
oom_kill_process callbacks are suppressed.
But considering that OOM burst at the same time are always because of
the same reason, so I think one snapshot of the OOM may be enough.
Simply setting oom_rs with {20 * HZ, 1} can resolve this issue.

> I am also interested whether this actually
> works. AFAIR ratelimit doesn't really work reliably when the ratelimited
> operation takes a long time because the internals have no way to see
> when the operation finished.
>

Agree with you that ratelimit() was not so reliable.

> >  mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 48 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index dfc357614e56..23dba8ccf313 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -954,8 +954,10 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >  {
> >       struct task_struct *victim = oc->chosen;
> >       struct mem_cgroup *oom_group;
> > -     static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > -                                           DEFAULT_RATELIMIT_BURST);
> > +     static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1);
> > +     int delta;
> > +     unsigned long start;
> > +     unsigned long end;
> >
> >       /*
> >        * If the task is already exiting, don't alarm the sysadmin or kill
> > @@ -972,8 +974,51 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> >       }
> >       task_unlock(victim);
> >
> > -     if (__ratelimit(&oom_rs))
> > +     if (__ratelimit(&oom_rs)) {
> > +             start = jiffies;
> >               dump_header(oc, victim);
> > +             end = jiffies;
> > +             delta = end - start;
> > +
> > +             /*
> > +              * The OOM messages may be printed to a serial with very low
> > +              * speed, e.g. console=ttyS1,19200. It will take long
> > +              * time to print these OOM messages to this serial, and
> > +              * then processes allocating pages will all be blocked due
> > +              * to it can hardly reclaim pages. That will case high
> > +              * memory pressure and the system may be unresponsive for a
> > +              * long time.
> > +              * In this case, we should decrease the OOM ratelimit or
> > +              * avoid printing OOM messages into the slow serial. But if
> > +              * we avoid printing OOM messages into the slow serial, all
> > +              * messages with KERN_WARNING level can't be printed into
> > +              * it neither, that may loss some useful messages when we
> > +              * want to collect messages from the console for debugging
> > +              * purpose. So it is better to decrease the ratelimit. We
> > +              * can introduce some sysctl knobes similar with
> > +              * printk_ratelimit and burst, but it will burden the
> > +              * admin. Let the kernel automatically adjust the ratelimit
> > +              * would be a better chioce.
> > +              * In bellow algorithm, it will decrease the OOM ratelimit
> > +              * rapidly if the console is slow and increase the OOM
> > +              * ratelimit slowly if the console is fast. oom_rs.burst
> > +              * will be in [1, 10] and oom_rs.interval will always
> > +              * greater than 5 * HZ.
> > +              */
> > +             if (delta < oom_rs.interval / 10) {
> > +                     if (oom_rs.interval >= 10 * HZ)
> > +                             oom_rs.interval /= 2;
> > +                     else if (oom_rs.interval > 6 * HZ)
> > +                             oom_rs.interval -= HZ;
> > +
> > +                     if (oom_rs.burst < 10)
> > +                             oom_rs.burst += 1;
> > +             } else if (oom_rs.burst > 1) {
> > +                     oom_rs.burst = 1;
> > +                     oom_rs.interval = 4 * delta;
> > +             }
> > +
> > +     }
> >
> >       /*
> >        * Do we need to kill the entire memory cgroup?
> > --
> > 2.18.2
>
> --


Thanks
Yafang


  reply	other threads:[~2020-04-14 12:33 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-11  9:36 [RFC PATCH] mm, oom: oom ratelimit auto tuning Yafang Shao
2020-04-14  7:39 ` Michal Hocko
2020-04-14 12:32   ` Yafang Shao [this message]
2020-04-14 14:32     ` Michal Hocko
2020-04-14 14:58       ` Yafang Shao
2020-04-15  5:58         ` Tetsuo Handa
2020-04-17 11:57           ` Yafang Shao
2020-04-17 13:03             ` Tetsuo Handa
2020-04-17 13:55               ` Yafang Shao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbDv+ZAgmGJP7GFzGcjKBZTPk9kYo63g173Nh+vn00qmwg@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.