From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=05z6=56=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-7.0 required=3.0 tests=INCLUDES_PATCH,
	MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS autolearn=ham
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id EA2F2C2BA19
	for <linux-mm@archiver.kernel.org>; Tue, 14 Apr 2020 07:39:15 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id A708420575
	for <linux-mm@archiver.kernel.org>; Tue, 14 Apr 2020 07:39:15 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A708420575
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5A5CE8E0003; Tue, 14 Apr 2020 03:39:15 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 556248E0001; Tue, 14 Apr 2020 03:39:15 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 4929A8E0003; Tue, 14 Apr 2020 03:39:15 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com [216.40.44.101])
	by kanga.kvack.org (Postfix) with ESMTP id 2F4458E0001
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 03:39:15 -0400 (EDT)
Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id EC1833489
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 07:39:14 +0000 (UTC)
X-FDA: 76705659828.09.lip69_8aeb8fcb4762e
X-HE-Tag: lip69_8aeb8fcb4762e
X-Filterd-Recvd-Size: 7703
Received: from mail-wr1-f68.google.com (mail-wr1-f68.google.com [209.85.221.68])
	by imf16.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 14 Apr 2020 07:39:14 +0000 (UTC)
Received: by mail-wr1-f68.google.com with SMTP id a25so13144503wrd.0
        for <linux-mm@kvack.org>; Tue, 14 Apr 2020 00:39:14 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=kycA3JDnrd27lK/xPg6OlPVuTMEnooKfbLlQr6hR8VA=;
        b=VohlPM/UaWcTpFkxi2S8McN2KnOSiUpbeBkprvfyos1+IKpskUhoSgifgujKSKkGrS
         VB+LQuViY+wyuIrKOfDtNYeOj0uVPmRRCdh+9DRJYhpjOq8YIvzRdMbp6HEM7YZ/TpTn
         Y5IFxfb12u4SZDFYsSuVRDRlKn7qtujnMn2N9fKlma3EpgQ9Hyc97qFP416FXYJ2yBA/
         iJXr3w0wSt68drRbwtbNvspelOroQhxnCci6ar3gtFXzqyjB+6FeovEiPxOpon7G4yFY
         9VJdtZDWXspO/7R4HmlhVU+NTbgJNFUn0YHSKNVjqkxK/p7RyYx2+EIPVDQTHSOOXIbo
         velA==
X-Gm-Message-State: AGi0Pua5ErAc4AGMMK1u8hFlH4UUanRDLko/NLB5R73K6vQEwFIzHrqT
	8YhhQd9BxJytfufFUYSAxfA=
X-Google-Smtp-Source: APiQypI60aQjNYAIFWhbug4d0KgAa51P8pZq5u3Iay8IzILS3b3gcKjOlURyq2qxa/W1beOgGFMHDg==
X-Received: by 2002:a5d:6204:: with SMTP id y4mr23878844wru.410.1586849953434;
        Tue, 14 Apr 2020 00:39:13 -0700 (PDT)
Received: from localhost (ip-37-188-180-223.eurotel.cz. [37.188.180.223])
        by smtp.gmail.com with ESMTPSA id b7sm17979570wrn.67.2020.04.14.00.39.12
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 14 Apr 2020 00:39:12 -0700 (PDT)
Date: Tue, 14 Apr 2020 09:39:11 +0200
From: Michal Hocko <mhocko@kernel.org>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: akpm@linux-foundation.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH] mm, oom: oom ratelimit auto tuning
Message-ID: <20200414073911.GC4629@dhcp22.suse.cz>
References: <1586597774-6831-1-git-send-email-laoar.shao@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <1586597774-6831-1-git-send-email-laoar.shao@gmail.com>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Sat 11-04-20 05:36:14, Yafang Shao wrote:
> Recently we find an issue that when OOM happens the server is almost
> unresponsive for several minutes. That is caused by a slow serial set
> with "console=ttyS1,19200". As the speed of this serial is too slow, it
> will take almost 10 seconds to print a full OOM message into it. And
> then all tasks allocating pages will be blocked as there is almost no
> pages can be reclaimed. At that time, the memory pressure is around 90
> for a long time. If we don't print the OOM messages into this serial,
> a full OOM message only takes less than 1ms and the memory pressure is
> less than 40.

Which part of the oom report takes the most time? I would expect this to
be the dump_tasks part which can be pretty large when there is a lot of
eligible tasks to kill.
 
> We can avoid printing OOM messages into slow serial by adjusting
> /proc/sys/kernel/printk to fix this issue, but then all messages with
> KERN_WARNING level can't be printed into it neither, that may loss some
> useful messages when we want to collect messages from the it for
> debugging purpose.

A large part of the oom report is printed with KERN_INFO log level. So
you can reduce a large part of the output while not losing other
potentially important information.

> So it is better to decrease the ratelimit. We can introduce some sysctl
> knobes similar with printk_ratelimit and burst, but it will burden the
> amdin. Let the kernel automatically adjust the ratelimit, that would be
> a better choice.

No new knobs for ratelimiting. Admin shouldn't really care about these
things. Besides that I strongly suspect that you would be much better of
by disabling /proc/sys/vm/oom_dump_tasks which would reduce the amount
of output a lot. Or do you really require this information when
debugging oom reports?

> The OOM ratelimit starts with a slow rate, and it will increase slowly
> if the speed of the console is rapid and decrease rapidly if the speed
> of the console is slow. oom_rs.burst will be in [1, 10] and
> oom_rs.interval will always greater than 5 * HZ.

I am not against increasing the ratelimit timeout. But this patch seems
to be trying to be too clever.  Why cannot we simply increase the
parameters of the ratelimit? I am also interested whether this actually
works. AFAIR ratelimit doesn't really work reliably when the ratelimited
operation takes a long time because the internals have no way to see
when the operation finished.

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> ---
>  mm/oom_kill.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 48 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dfc357614e56..23dba8ccf313 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -954,8 +954,10 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  {
>  	struct task_struct *victim = oc->chosen;
>  	struct mem_cgroup *oom_group;
> -	static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> -					      DEFAULT_RATELIMIT_BURST);
> +	static DEFINE_RATELIMIT_STATE(oom_rs, 20 * HZ, 1);
> +	int delta;
> +	unsigned long start;
> +	unsigned long end;
>  
>  	/*
>  	 * If the task is already exiting, don't alarm the sysadmin or kill
> @@ -972,8 +974,51 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
>  	}
>  	task_unlock(victim);
>  
> -	if (__ratelimit(&oom_rs))
> +	if (__ratelimit(&oom_rs)) {
> +		start = jiffies;
>  		dump_header(oc, victim);
> +		end = jiffies;
> +		delta = end - start;
> +
> +		/*
> +		 * The OOM messages may be printed to a serial with very low
> +		 * speed, e.g. console=ttyS1,19200. It will take long
> +		 * time to print these OOM messages to this serial, and
> +		 * then processes allocating pages will all be blocked due
> +		 * to it can hardly reclaim pages. That will case high
> +		 * memory pressure and the system may be unresponsive for a
> +		 * long time.
> +		 * In this case, we should decrease the OOM ratelimit or
> +		 * avoid printing OOM messages into the slow serial. But if
> +		 * we avoid printing OOM messages into the slow serial, all
> +		 * messages with KERN_WARNING level can't be printed into
> +		 * it neither, that may loss some useful messages when we
> +		 * want to collect messages from the console for debugging
> +		 * purpose. So it is better to decrease the ratelimit. We
> +		 * can introduce some sysctl knobes similar with
> +		 * printk_ratelimit and burst, but it will burden the
> +		 * admin. Let the kernel automatically adjust the ratelimit
> +		 * would be a better chioce.
> +		 * In bellow algorithm, it will decrease the OOM ratelimit
> +		 * rapidly if the console is slow and increase the OOM
> +		 * ratelimit slowly if the console is fast. oom_rs.burst
> +		 * will be in [1, 10] and oom_rs.interval will always
> +		 * greater than 5 * HZ.
> +		 */
> +		if (delta < oom_rs.interval / 10) {
> +			if (oom_rs.interval >= 10 * HZ)
> +				oom_rs.interval /= 2;
> +			else if (oom_rs.interval > 6 * HZ)
> +				oom_rs.interval -= HZ;
> +
> +			if (oom_rs.burst < 10)
> +				oom_rs.burst += 1;
> +		} else if (oom_rs.burst > 1) {
> +			oom_rs.burst = 1;
> +			oom_rs.interval = 4 * delta;
> +		}
> +
> +	}
>  
>  	/*
>  	 * Do we need to kill the entire memory cgroup?
> -- 
> 2.18.2

-- 
Michal Hocko
SUSE Labs