From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=yvMx=KF=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,
	SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no
	version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 73C38C433ED
	for <linux-mm@archiver.kernel.org>; Mon, 10 May 2021 08:33:08 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id CFB2F600CD
	for <linux-mm@archiver.kernel.org>; Mon, 10 May 2021 08:33:07 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CFB2F600CD
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 5A3836B0073; Mon, 10 May 2021 04:33:07 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5525A6B0074; Mon, 10 May 2021 04:33:07 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3CBCF6B0075; Mon, 10 May 2021 04:33:07 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0230.hostedemail.com [216.40.44.230])
	by kanga.kvack.org (Postfix) with ESMTP id 1DF8D6B0073
	for <linux-mm@kvack.org>; Mon, 10 May 2021 04:33:07 -0400 (EDT)
Received: from smtpin34.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id D08E9181AF5C7
	for <linux-mm@kvack.org>; Mon, 10 May 2021 08:33:06 +0000 (UTC)
X-FDA: 78124656372.34.07DD45A
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf24.hostedemail.com (Postfix) with ESMTP id 81FDEA00038A
	for <linux-mm@kvack.org>; Mon, 10 May 2021 08:32:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1620635585;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=uliaiHChb7humBa38v24yInx0BgmeJS9JfrZnwy6ga4=;
	b=JFwLB7wKkKkZFOvTRjz5KGpfsTd9niCA/FWhoYnn4hTCuRZ8IcHHFaMCvHOFakgp3QhSfs
	Lh/YZ/Kea9Z9pAy1yPkREN8ZmU6n/Z48m1Dj6i+9UW7/a2UY9TZXUmGiq8pbP/ruo1MVNR
	com+uhWHkuEAdaP0VXNJrhglsSQECrU=
Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com
 [209.85.208.71]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-551-fFLUiOiZNESvxGfyft86Rw-1; Mon, 10 May 2021 04:33:02 -0400
X-MC-Unique: fFLUiOiZNESvxGfyft86Rw-1
Received: by mail-ed1-f71.google.com with SMTP id i19-20020a05640242d3b0290388cea34ed3so8641517edc.15
        for <linux-mm@kvack.org>; Mon, 10 May 2021 01:33:01 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=uliaiHChb7humBa38v24yInx0BgmeJS9JfrZnwy6ga4=;
        b=MVpueKRHEdUCmD24M3VKnL8cL9AfwxCaEX7WBgck8RqIs+gXwBQIwZAzQuqVtoCn/i
         lcnOzg72ffE0u5pr9ZsH5+e2XjDWCDQbp7XsbJUWOm2adGCCmUerCyv0JZf18CWv/vCb
         T/JKk5YbOHPfBbARdm2bnyEeCM7oPdNb4NQe7DgD5Giki/66c/2OKIK6/nrQ6+Vgq36t
         JwvSx+8+nF0mipGg13rIv7RYP8ha3gPRnL2qpJ0WO2L8e5UvtUJ2CpUmZY+qU5iHw5jS
         aoZ/8tAfyXbhDAbqx0riYobKhpTbbdB+/1i+2XNBJC2yX4HFznO0eEQFwHVGkub8hi4p
         1/Vw==
X-Gm-Message-State: AOAM531IEe0TKGjX4aR90rYzrYVmrr5Two93bzSSw9WVWBCyeNN3WbK5
	z77sKDJP68DtUpwQLOMEp8dYLWsxHuR0vUrCZNo0tw8yE9Q5JdUgRPVOjJbI1H1BQzwGrxl21Ta
	qXUWOo/spA1Q=
X-Received: by 2002:a05:6402:2317:: with SMTP id l23mr2021564eda.265.1620635580388;
        Mon, 10 May 2021 01:33:00 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJyp+VPb1Da0lLyb9DXtQj2sq52Lr7PtYJ5r4bllf1Zef8xWH6h/lD3+Txq6seAMJse1gz2Sjw==
X-Received: by 2002:a05:6402:2317:: with SMTP id l23mr2021491eda.265.1620635579668;
        Mon, 10 May 2021 01:32:59 -0700 (PDT)
Received: from [192.168.3.132] (p5b0c676a.dip0.t-ipconnect.de. [91.12.103.106])
        by smtp.gmail.com with ESMTPSA id um28sm8793738ejb.63.2021.05.10.01.32.58
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 10 May 2021 01:32:59 -0700 (PDT)
To: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, andreyknvl@google.com,
 christian.brauner@ubuntu.com, colin.king@canonical.com, corbet@lwn.net,
 dyoung@redhat.com, frederic@kernel.org, gpiccoli@canonical.com,
 john.p.donnelly@oracle.com, jpoimboe@redhat.com, keescook@chromium.org,
 linux-mm@kvack.org, masahiroy@kernel.org, mchehab+huawei@kernel.org,
 mike.kravetz@oracle.com, mingo@kernel.org, mm-commits@vger.kernel.org,
 paulmck@kernel.org, peterz@infradead.org, rdunlap@infradead.org,
 rostedt@goodmis.org, rppt@kernel.org, saeed.mirzamohammadi@oracle.com,
 samitolvanen@google.com, sboyd@kernel.org, tglx@linutronix.de,
 torvalds@linux-foundation.org, vgoyal@redhat.com, yifeifz2@illinois.edu
References: <20210507010432.IN24PudKT%akpm@linux-foundation.org>
 <889c6b90-7335-71ce-c955-3596e6ac7c5a@redhat.com>
 <20210508085133.GA2946@localhost.localdomain>
 <2d0f53d9-51ca-da57-95a3-583dc81f35ef@redhat.com>
 <20210510045338.GB2946@localhost.localdomain>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [patch 48/91] kernel/crash_core: add crashkernel=auto for vmcore
 creation
Message-ID: <4a544493-0622-ac6d-f14b-fb338e33b25e@redhat.com>
Date: Mon, 10 May 2021 10:32:57 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.8.1
MIME-Version: 1.0
In-Reply-To: <20210510045338.GB2946@localhost.localdomain>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: 81FDEA00038A
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=JFwLB7wK;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf24.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com
X-Stat-Signature: igcst7ddiynrwxxawhq56k8kxoapy1gn
Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from="<david@redhat.com>"; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124
X-HE-DKIM-Result: pass/pass
X-HE-Tag: 1620635572-845667
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


>> "Not correlated directly" ...
>>
>> "1G-64G:128M,64G-1T:256M,1T-:512M"
>>
>> Am I still asleep and dreaming? :)
>=20
> Well, I said 'Not correlated directly', then gave sentences to explan
> the reason. I would like to repeat them:
>=20
> 1) Crashkernel need more memory on some systems mainly because of
> device driver. You can take a system, no matter how much memory you
> increse or decrease total system RAM size, the crashkernel size needed
> is invariable.
>=20
>    - The extreme case I have give about the i40e.
>    - And the more devices, narutally the more memory needed.
>=20
> 2) About "1G-64G:128M,64G-1T:256M,1T-:512M", I also said the different
> value is because taking very low proprotion of extra memory to avoid
> potential risk, it's cost effective. Here, add another 90M which is
> 0.13% of 64G, 0.0085% of 1TB.

Just let me clarify the problem I am having with all of this:

We model the crashkernel size as a function of the memory size. Yet,=20
it's pretty much independent of the memory size. That screams for "ugly".

The main problem is that early during boot we don't have a clue how much=20
crashkernel memory we may need. So what I see is that we are mostly=20
using a heuristic based on the memory size to come up with the right=20
answer how much devices we might have. That just feels very wrong.

I can understand the reasoning of "using a fraction of the memory size"=20
when booting up just to be on the safe side as we don't know", and that=20
motivation is much better than what I read so far. But then I wonder if=20
we cannot handle that any better? Because this feels very suboptimal to=20
me and I feel like there can be cases where the heuristic is just wrong.

As one example, can I add a whole bunch of devices to a 32GB VM and=20
break "crashkernel=3Dauto"?

As another example, when I boot a 64G VM, the crashkernel size will be=20
512MB, although I really only might need 128MB. That's an effective=20
overhead of 0.5%. And especially when we take memory ballooning etc.=20
into account it can effectively be more than that.

Let's do a more detailed look. PPC64 in kernel-ark:

"2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G";

Assume I would only need 385M on a simple 16GB VM. We would have an=20
overhead of ~4%. But maybe on ppc64 we do have to take the memory size=20
into account (my assumption and, thus, my comment regarding memory hotplu=
g)?


I wonder if we could always try allocating larger granularity (falling=20
back to smaller if it fails), and once the kernel is able to come up=20
with a better answer how many devices there are and, thus, how big the=20
crashkernel area really should be, shrink the preallocated crashkernel=20
(either from the kernel or from user space)? Not completely trivial but=20
possible I think. It's trivial when we allocate a memmap for the=20
crashkernel (I think we mostly do but I might be wrong).

The "crashkernel=3Dauto" would really do something magical good instead o=
f=20
implement some heuristic base don the memory size.

[...]
>> So, all the rules we have are essentially broken because they rely
>> completely on the system RAM during boot.
>=20
> How do you get this?
>=20
> Crashkernel=3Dauto is a default value. PC, VMs, normal workstation and =
server
> which are the overall majority can work well with it. I can say the num=
ber
> is 99%. Only very few high end workstation, servers which contain
> many PCI devices need investigation to decide crashkernel size. A possi=
ble
> manual setting and rebooting is needed for them. You call this
> 'essentially broken'? So you later suggestd constructing crashkernel va=
lue
> in user space and rebooting is not broken? Even though it's the similar
> thing? what is your logic behind your conclusion?

A kernel early during boot can only guess. A kernel late during boot=20
knows. Please correct me if I'm wrong.

>=20
> Crashkernel=3Dauto is mainly targetting most of systems, help people
> w/o much knowledge of kdump implementation to use it for debugging.
>=20
> I can say more about the benefit of crashkernel=3Dauto. On Fedora, the
> community distros sponsord by Redhat, the kexec/kdump is also maintaine=
d
> by us. Fedora kernel is mainline kernel, so no crashkernel=3Dauto
> provided. We almost never get bug report from users, means almost nobod=
y
> use  it. We hope Fedora users' usage can help test functionality of
> component.

I know how helpful "crashkernel=3Dauto" was so far, but I am also aware=20
that there was strong pushback in the past, and I remember for the=20
reasons I gave. IMHO we should refine that approach instead of trying to=20
push the same thing upstream every couple of years.

I ran into the "512MB crashkernel" on a 64G VM with memory ballooning=20
issue already but didn't report a BZ, because so far, I was under the=20
impression that more memory means more crashkernel. But you explained to=20
me that I was just running into a (for my use case) bad heuristic.

>>> So system RAM size is the least important part to influence crashkern=
el
>>
>> Aehm, not with fadump, no?
>=20
> Fadump makes use of crashkernel reservation, but has different mechanis=
m
> to dumping. It needs a kernel config too if this patch is accepted, or
> it can add it to command line from a user space program, I will talk
> about that later. This depends on IBM's decision, I have added Hari to =
CC,
> they will make the best choice after consideration.
>=20

I was looking at RHEL8, and there we have

fadump_cmdline =3D get_last_crashkernel(cmdline, "fadump=3D", NULL);
...
if (!fadump_cmdline || (strncmp(fadump_cmdline, "off", 3) =3D=3D 0))
	ck_cmdline =3D ...
else
	ck_cmdline =3D ...

which was a runtime check for fadump.

Something that cannot be modeled properly at least with this patch here.

> }
>>
>>> costing. Say my x1 laptop, even though I extended the RAM to 100TB, 1=
60M
>>> crashkernel is still enough. Just we would like to get a tiny extra p=
art
>>> to add to crashkernel if the total RAM is very large, that's the rule
>>> for crashkernel=3Dauto. As for VMs, given their very few devices, vir=
tio
>>> disk, NAT nic, etc, no matter how much memory is deployed and hot
>>> added/removed, crashkernel size won't be influenced very much. My
>>> personal understanding about it.
>>
>> That's an interesting observation. But you're telling me that we end u=
p
>> wasting memory for the crashkernel because "crashkernel=3Dauto" which =
is
>> supposed to do something magical good automatically does something ver=
y
>> suboptimal? Oh my ... this is broken.
>>
>> Long story short: crashkernel=3Dauto is pure ugliness.
>=20
> Very interesting. Your long story is clear to me, but your short story
> confuses me a lot.
>=20
> Let me try to sort out and understand. In your first reply, you asserte=
d
> "it's plain wrong when taking memory hotplug serious account as
> we see it quite heavily in VMs", means you plain don't know if it's
> wrong, but you say it's plain wrong. I answered you 'no, not at all'
> with detailed explanation, means it's plain opposite to your assertion.

Yep, I might be partially wrong about memory hotplug thingy, mostly=20
because I had the RHEL8 rule for ppc64 (including fadump) in mind. For=20
dynamic resizing of VMs, the current rules for VMs can be very sub-optima=
l.

Let's relax "plain wrong" to "the heuristic can be very suboptimal=20
because it uses something mostly unrelated to come up with an answer".=20
And it's simply not plain wrong because in practice it gets the job=20
done. Mostly.


> So then you quickly came to 'crashkernel=3Dauto is pure ugliness'. If a
> simple crashkernel=3Dauto is added to cover 99% systems, and advanced
> operation only need be done for the rest which is tiny proportion,
> this is called pure ugliness, what's pure beauty? Here I say 99%, I
> could be very conservative.

I don't like wasting memory just because we cannot come up with a better=20
heuristic. Yes, it somewhat gets the job done, but I call that ugly. My=20
humble opinion.

[...]

>=20
> Yes, if you haven't seen our patch in fedora kexec-tools maining list,
> your suggested approach is the exactly same thing we are doing, please
> check below patch.
>=20
> [PATCH v2] kdumpctl: Add kdumpctl estimate
> https://lists.fedoraproject.org/archives/list/kexec@lists.fedoraproject=
.org/thread/YCEOJHQXKVEIVNB23M2TDAJGYVNP5MJZ/
>=20
> We will provide a new feature in user space script, to let user check i=
f
> their current crashkernel size is good or not. If not, they can adjust
> accordingly.

That's good, thanks for the pointer -- wasn't aware of that.

>=20
> But, where's the current crashkernel size coming from? Surely
> crashkernel=3Dauto. You wouldn't add a random crashkernel size then
> compared with the recommended crashkernel size, then reboot, will you?
> If crashkernel=3Dauto get the expected size, no need to reboot. Means 9=
9%
> of systems has no need to reboot. Only very few of systems, need reboot
> after checking the recommended size.
>=20
> Long story short. crashkernel=3Dauto will give a default value, trying =
to
> cover most of systems. (Very few high end server need check if it's
> enough and adjust with the help of user space tools. Then reboot.)

Then we might really want to investigate into shrinking a possibly=20
larger allocation dynamically during boot.

>>
>> Also: this approach here doesn't make any sense when you want to do
>> something dependent on other cmdline parameters. Take "fadump=3Don" vs
>> "fadump=3Doff" as an example. You just cannot handle it properly as pr=
oposed
>> in this patch. To me the approach in this patch makes least sense TBH.
>=20
> Why? We don't have this kind of judgement in kernel? Crashkernel=3Dauto=
 is
> a generic mechanism, and has been added much earlier. Fadump was added
> later by IBM for their need on ppc only, it relies on crashkernel
> reservation but different mechanism of dumping. If it has different val=
ue
> than kdump, a special hanlding is certainly needed. Who tell it has to =
be
> 'fadump=3Don'? They can check the value in user space program and add i=
nto
> cmdline as you suggested, they can also make it into auto. The most sui=
table
> is the best.

Take a look at the RHEL8 handling to see where my comment is coming from.

>=20
> And I have several questions to ask, hope you can help answer:
>=20
> 1) Have you ever met crashkernel=3Dauto broken on virt platform?

I have encountered it being very suboptimal. I call wasting hundreds of=20
MB problematic, especially when dynamically resizing of VMs (for=20
example, using memory ballooning)

>=20
> Asking this because you are from Virt team, and crashkernel=3Dauto has =
been
> there in RHEL for many years, and we have been working with Virt team t=
o
> support dumping. We haven't seen any bug report or complaint about
> crashkernel=3Dauto from Virt.

I've had plenty of bug reports where people try inflating the balloon=20
fairly heavily but don't take the crashkernel size into account. The=20
bigger the crashkernel size, the bigger the issue when people try=20
squeezing the last couple of MB out of their VMs. I keep repeating to=20
them "with crashkernel=3Dauto, you have to be careful about how much=20
memory might get set aside for the crashkernel and, therefore, reduces=20
your effective guest OS RAM size and reduces the maximum balloon size".

>=20
> 2) Adding crashkernel=3Dauto, and the kdumpctl estimate as user space
> program to get a recommended size, then reboot. Removing crashkernel=3D=
auto,
> only the kdumpctl estimate to get a recommended size, always reboot.
> In RHEL we will take the 1st option. Are you willing to take the 2nd on=
e
> for Virt platform since you think crashkernel=3Dauto is plain wrong, pu=
re
> ugliness, essentially broken, least sense?

We are talking about upstreaming stuff here and I am wearing my upstream=20
hat here. I'm stating (just like people decades ago) that this might not=20
be the right approach for upstream, at least not as it stands.

And no, I don't have time to solve problems/implement solutions/upstream=20
patches to tackle fundamental issues that have been there for decades.

I'll be happy to help looking into dynamic shrinking of the crashkernel=20
size if that approach makes sense. We could even let user space trigger=20
that resizing -- without a reboot.

--=20
Thanks,

David / dhildenb