From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 73C38C433ED for ; Mon, 10 May 2021 08:33:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id CFB2F600CD for ; Mon, 10 May 2021 08:33:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org CFB2F600CD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5A3836B0073; Mon, 10 May 2021 04:33:07 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5525A6B0074; Mon, 10 May 2021 04:33:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3CBCF6B0075; Mon, 10 May 2021 04:33:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0230.hostedemail.com [216.40.44.230]) by kanga.kvack.org (Postfix) with ESMTP id 1DF8D6B0073 for ; Mon, 10 May 2021 04:33:07 -0400 (EDT) Received: from smtpin34.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id D08E9181AF5C7 for ; Mon, 10 May 2021 08:33:06 +0000 (UTC) X-FDA: 78124656372.34.07DD45A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf24.hostedemail.com (Postfix) with ESMTP id 81FDEA00038A for ; Mon, 10 May 2021 08:32:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1620635585; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uliaiHChb7humBa38v24yInx0BgmeJS9JfrZnwy6ga4=; b=JFwLB7wKkKkZFOvTRjz5KGpfsTd9niCA/FWhoYnn4hTCuRZ8IcHHFaMCvHOFakgp3QhSfs Lh/YZ/Kea9Z9pAy1yPkREN8ZmU6n/Z48m1Dj6i+9UW7/a2UY9TZXUmGiq8pbP/ruo1MVNR com+uhWHkuEAdaP0VXNJrhglsSQECrU= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-551-fFLUiOiZNESvxGfyft86Rw-1; Mon, 10 May 2021 04:33:02 -0400 X-MC-Unique: fFLUiOiZNESvxGfyft86Rw-1 Received: by mail-ed1-f71.google.com with SMTP id i19-20020a05640242d3b0290388cea34ed3so8641517edc.15 for ; Mon, 10 May 2021 01:33:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=uliaiHChb7humBa38v24yInx0BgmeJS9JfrZnwy6ga4=; b=MVpueKRHEdUCmD24M3VKnL8cL9AfwxCaEX7WBgck8RqIs+gXwBQIwZAzQuqVtoCn/i lcnOzg72ffE0u5pr9ZsH5+e2XjDWCDQbp7XsbJUWOm2adGCCmUerCyv0JZf18CWv/vCb T/JKk5YbOHPfBbARdm2bnyEeCM7oPdNb4NQe7DgD5Giki/66c/2OKIK6/nrQ6+Vgq36t JwvSx+8+nF0mipGg13rIv7RYP8ha3gPRnL2qpJ0WO2L8e5UvtUJ2CpUmZY+qU5iHw5jS aoZ/8tAfyXbhDAbqx0riYobKhpTbbdB+/1i+2XNBJC2yX4HFznO0eEQFwHVGkub8hi4p 1/Vw== X-Gm-Message-State: AOAM531IEe0TKGjX4aR90rYzrYVmrr5Two93bzSSw9WVWBCyeNN3WbK5 z77sKDJP68DtUpwQLOMEp8dYLWsxHuR0vUrCZNo0tw8yE9Q5JdUgRPVOjJbI1H1BQzwGrxl21Ta qXUWOo/spA1Q= X-Received: by 2002:a05:6402:2317:: with SMTP id l23mr2021564eda.265.1620635580388; Mon, 10 May 2021 01:33:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyp+VPb1Da0lLyb9DXtQj2sq52Lr7PtYJ5r4bllf1Zef8xWH6h/lD3+Txq6seAMJse1gz2Sjw== X-Received: by 2002:a05:6402:2317:: with SMTP id l23mr2021491eda.265.1620635579668; Mon, 10 May 2021 01:32:59 -0700 (PDT) Received: from [192.168.3.132] (p5b0c676a.dip0.t-ipconnect.de. [91.12.103.106]) by smtp.gmail.com with ESMTPSA id um28sm8793738ejb.63.2021.05.10.01.32.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 10 May 2021 01:32:59 -0700 (PDT) To: Baoquan He Cc: Andrew Morton , andreyknvl@google.com, christian.brauner@ubuntu.com, colin.king@canonical.com, corbet@lwn.net, dyoung@redhat.com, frederic@kernel.org, gpiccoli@canonical.com, john.p.donnelly@oracle.com, jpoimboe@redhat.com, keescook@chromium.org, linux-mm@kvack.org, masahiroy@kernel.org, mchehab+huawei@kernel.org, mike.kravetz@oracle.com, mingo@kernel.org, mm-commits@vger.kernel.org, paulmck@kernel.org, peterz@infradead.org, rdunlap@infradead.org, rostedt@goodmis.org, rppt@kernel.org, saeed.mirzamohammadi@oracle.com, samitolvanen@google.com, sboyd@kernel.org, tglx@linutronix.de, torvalds@linux-foundation.org, vgoyal@redhat.com, yifeifz2@illinois.edu References: <20210507010432.IN24PudKT%akpm@linux-foundation.org> <889c6b90-7335-71ce-c955-3596e6ac7c5a@redhat.com> <20210508085133.GA2946@localhost.localdomain> <2d0f53d9-51ca-da57-95a3-583dc81f35ef@redhat.com> <20210510045338.GB2946@localhost.localdomain> From: David Hildenbrand Organization: Red Hat Subject: Re: [patch 48/91] kernel/crash_core: add crashkernel=auto for vmcore creation Message-ID: <4a544493-0622-ac6d-f14b-fb338e33b25e@redhat.com> Date: Mon, 10 May 2021 10:32:57 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <20210510045338.GB2946@localhost.localdomain> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 81FDEA00038A Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=JFwLB7wK; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf24.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-Stat-Signature: igcst7ddiynrwxxawhq56k8kxoapy1gn Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf24; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1620635572-845667 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >> "Not correlated directly" ... >> >> "1G-64G:128M,64G-1T:256M,1T-:512M" >> >> Am I still asleep and dreaming? :) >=20 > Well, I said 'Not correlated directly', then gave sentences to explan > the reason. I would like to repeat them: >=20 > 1) Crashkernel need more memory on some systems mainly because of > device driver. You can take a system, no matter how much memory you > increse or decrease total system RAM size, the crashkernel size needed > is invariable. >=20 > - The extreme case I have give about the i40e. > - And the more devices, narutally the more memory needed. >=20 > 2) About "1G-64G:128M,64G-1T:256M,1T-:512M", I also said the different > value is because taking very low proprotion of extra memory to avoid > potential risk, it's cost effective. Here, add another 90M which is > 0.13% of 64G, 0.0085% of 1TB. Just let me clarify the problem I am having with all of this: We model the crashkernel size as a function of the memory size. Yet,=20 it's pretty much independent of the memory size. That screams for "ugly". The main problem is that early during boot we don't have a clue how much=20 crashkernel memory we may need. So what I see is that we are mostly=20 using a heuristic based on the memory size to come up with the right=20 answer how much devices we might have. That just feels very wrong. I can understand the reasoning of "using a fraction of the memory size"=20 when booting up just to be on the safe side as we don't know", and that=20 motivation is much better than what I read so far. But then I wonder if=20 we cannot handle that any better? Because this feels very suboptimal to=20 me and I feel like there can be cases where the heuristic is just wrong. As one example, can I add a whole bunch of devices to a 32GB VM and=20 break "crashkernel=3Dauto"? As another example, when I boot a 64G VM, the crashkernel size will be=20 512MB, although I really only might need 128MB. That's an effective=20 overhead of 0.5%. And especially when we take memory ballooning etc.=20 into account it can effectively be more than that. Let's do a more detailed look. PPC64 in kernel-ark: "2G-4G:384M,4G-16G:512M,16G-64G:1G,64G-128G:2G,128G-:4G"; Assume I would only need 385M on a simple 16GB VM. We would have an=20 overhead of ~4%. But maybe on ppc64 we do have to take the memory size=20 into account (my assumption and, thus, my comment regarding memory hotplu= g)? I wonder if we could always try allocating larger granularity (falling=20 back to smaller if it fails), and once the kernel is able to come up=20 with a better answer how many devices there are and, thus, how big the=20 crashkernel area really should be, shrink the preallocated crashkernel=20 (either from the kernel or from user space)? Not completely trivial but=20 possible I think. It's trivial when we allocate a memmap for the=20 crashkernel (I think we mostly do but I might be wrong). The "crashkernel=3Dauto" would really do something magical good instead o= f=20 implement some heuristic base don the memory size. [...] >> So, all the rules we have are essentially broken because they rely >> completely on the system RAM during boot. >=20 > How do you get this? >=20 > Crashkernel=3Dauto is a default value. PC, VMs, normal workstation and = server > which are the overall majority can work well with it. I can say the num= ber > is 99%. Only very few high end workstation, servers which contain > many PCI devices need investigation to decide crashkernel size. A possi= ble > manual setting and rebooting is needed for them. You call this > 'essentially broken'? So you later suggestd constructing crashkernel va= lue > in user space and rebooting is not broken? Even though it's the similar > thing? what is your logic behind your conclusion? A kernel early during boot can only guess. A kernel late during boot=20 knows. Please correct me if I'm wrong. >=20 > Crashkernel=3Dauto is mainly targetting most of systems, help people > w/o much knowledge of kdump implementation to use it for debugging. >=20 > I can say more about the benefit of crashkernel=3Dauto. On Fedora, the > community distros sponsord by Redhat, the kexec/kdump is also maintaine= d > by us. Fedora kernel is mainline kernel, so no crashkernel=3Dauto > provided. We almost never get bug report from users, means almost nobod= y > use it. We hope Fedora users' usage can help test functionality of > component. I know how helpful "crashkernel=3Dauto" was so far, but I am also aware=20 that there was strong pushback in the past, and I remember for the=20 reasons I gave. IMHO we should refine that approach instead of trying to=20 push the same thing upstream every couple of years. I ran into the "512MB crashkernel" on a 64G VM with memory ballooning=20 issue already but didn't report a BZ, because so far, I was under the=20 impression that more memory means more crashkernel. But you explained to=20 me that I was just running into a (for my use case) bad heuristic. >>> So system RAM size is the least important part to influence crashkern= el >> >> Aehm, not with fadump, no? >=20 > Fadump makes use of crashkernel reservation, but has different mechanis= m > to dumping. It needs a kernel config too if this patch is accepted, or > it can add it to command line from a user space program, I will talk > about that later. This depends on IBM's decision, I have added Hari to = CC, > they will make the best choice after consideration. >=20 I was looking at RHEL8, and there we have fadump_cmdline =3D get_last_crashkernel(cmdline, "fadump=3D", NULL); ... if (!fadump_cmdline || (strncmp(fadump_cmdline, "off", 3) =3D=3D 0)) ck_cmdline =3D ... else ck_cmdline =3D ... which was a runtime check for fadump. Something that cannot be modeled properly at least with this patch here. > } >> >>> costing. Say my x1 laptop, even though I extended the RAM to 100TB, 1= 60M >>> crashkernel is still enough. Just we would like to get a tiny extra p= art >>> to add to crashkernel if the total RAM is very large, that's the rule >>> for crashkernel=3Dauto. As for VMs, given their very few devices, vir= tio >>> disk, NAT nic, etc, no matter how much memory is deployed and hot >>> added/removed, crashkernel size won't be influenced very much. My >>> personal understanding about it. >> >> That's an interesting observation. But you're telling me that we end u= p >> wasting memory for the crashkernel because "crashkernel=3Dauto" which = is >> supposed to do something magical good automatically does something ver= y >> suboptimal? Oh my ... this is broken. >> >> Long story short: crashkernel=3Dauto is pure ugliness. >=20 > Very interesting. Your long story is clear to me, but your short story > confuses me a lot. >=20 > Let me try to sort out and understand. In your first reply, you asserte= d > "it's plain wrong when taking memory hotplug serious account as > we see it quite heavily in VMs", means you plain don't know if it's > wrong, but you say it's plain wrong. I answered you 'no, not at all' > with detailed explanation, means it's plain opposite to your assertion. Yep, I might be partially wrong about memory hotplug thingy, mostly=20 because I had the RHEL8 rule for ppc64 (including fadump) in mind. For=20 dynamic resizing of VMs, the current rules for VMs can be very sub-optima= l. Let's relax "plain wrong" to "the heuristic can be very suboptimal=20 because it uses something mostly unrelated to come up with an answer".=20 And it's simply not plain wrong because in practice it gets the job=20 done. Mostly. > So then you quickly came to 'crashkernel=3Dauto is pure ugliness'. If a > simple crashkernel=3Dauto is added to cover 99% systems, and advanced > operation only need be done for the rest which is tiny proportion, > this is called pure ugliness, what's pure beauty? Here I say 99%, I > could be very conservative. I don't like wasting memory just because we cannot come up with a better=20 heuristic. Yes, it somewhat gets the job done, but I call that ugly. My=20 humble opinion. [...] >=20 > Yes, if you haven't seen our patch in fedora kexec-tools maining list, > your suggested approach is the exactly same thing we are doing, please > check below patch. >=20 > [PATCH v2] kdumpctl: Add kdumpctl estimate > https://lists.fedoraproject.org/archives/list/kexec@lists.fedoraproject= .org/thread/YCEOJHQXKVEIVNB23M2TDAJGYVNP5MJZ/ >=20 > We will provide a new feature in user space script, to let user check i= f > their current crashkernel size is good or not. If not, they can adjust > accordingly. That's good, thanks for the pointer -- wasn't aware of that. >=20 > But, where's the current crashkernel size coming from? Surely > crashkernel=3Dauto. You wouldn't add a random crashkernel size then > compared with the recommended crashkernel size, then reboot, will you? > If crashkernel=3Dauto get the expected size, no need to reboot. Means 9= 9% > of systems has no need to reboot. Only very few of systems, need reboot > after checking the recommended size. >=20 > Long story short. crashkernel=3Dauto will give a default value, trying = to > cover most of systems. (Very few high end server need check if it's > enough and adjust with the help of user space tools. Then reboot.) Then we might really want to investigate into shrinking a possibly=20 larger allocation dynamically during boot. >> >> Also: this approach here doesn't make any sense when you want to do >> something dependent on other cmdline parameters. Take "fadump=3Don" vs >> "fadump=3Doff" as an example. You just cannot handle it properly as pr= oposed >> in this patch. To me the approach in this patch makes least sense TBH. >=20 > Why? We don't have this kind of judgement in kernel? Crashkernel=3Dauto= is > a generic mechanism, and has been added much earlier. Fadump was added > later by IBM for their need on ppc only, it relies on crashkernel > reservation but different mechanism of dumping. If it has different val= ue > than kdump, a special hanlding is certainly needed. Who tell it has to = be > 'fadump=3Don'? They can check the value in user space program and add i= nto > cmdline as you suggested, they can also make it into auto. The most sui= table > is the best. Take a look at the RHEL8 handling to see where my comment is coming from. >=20 > And I have several questions to ask, hope you can help answer: >=20 > 1) Have you ever met crashkernel=3Dauto broken on virt platform? I have encountered it being very suboptimal. I call wasting hundreds of=20 MB problematic, especially when dynamically resizing of VMs (for=20 example, using memory ballooning) >=20 > Asking this because you are from Virt team, and crashkernel=3Dauto has = been > there in RHEL for many years, and we have been working with Virt team t= o > support dumping. We haven't seen any bug report or complaint about > crashkernel=3Dauto from Virt. I've had plenty of bug reports where people try inflating the balloon=20 fairly heavily but don't take the crashkernel size into account. The=20 bigger the crashkernel size, the bigger the issue when people try=20 squeezing the last couple of MB out of their VMs. I keep repeating to=20 them "with crashkernel=3Dauto, you have to be careful about how much=20 memory might get set aside for the crashkernel and, therefore, reduces=20 your effective guest OS RAM size and reduces the maximum balloon size". >=20 > 2) Adding crashkernel=3Dauto, and the kdumpctl estimate as user space > program to get a recommended size, then reboot. Removing crashkernel=3D= auto, > only the kdumpctl estimate to get a recommended size, always reboot. > In RHEL we will take the 1st option. Are you willing to take the 2nd on= e > for Virt platform since you think crashkernel=3Dauto is plain wrong, pu= re > ugliness, essentially broken, least sense? We are talking about upstreaming stuff here and I am wearing my upstream=20 hat here. I'm stating (just like people decades ago) that this might not=20 be the right approach for upstream, at least not as it stands. And no, I don't have time to solve problems/implement solutions/upstream=20 patches to tackle fundamental issues that have been there for decades. I'll be happy to help looking into dynamic shrinking of the crashkernel=20 size if that approach makes sense. We could even let user space trigger=20 that resizing -- without a reboot. --=20 Thanks, David / dhildenb