From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D13B4C433F5 for ; Wed, 8 Dec 2021 13:24:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E93C6B0071; Wed, 8 Dec 2021 08:24:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 349F36B0073; Wed, 8 Dec 2021 08:24:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 211FF6B0074; Wed, 8 Dec 2021 08:24:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id 1097A6B0071 for ; Wed, 8 Dec 2021 08:24:19 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DFC9420F5F for ; Wed, 8 Dec 2021 13:24:08 +0000 (UTC) X-FDA: 78894695376.20.DD09642 Received: from mail-il1-f171.google.com (mail-il1-f171.google.com [209.85.166.171]) by imf15.hostedemail.com (Postfix) with ESMTP id 9B101D00009D for ; Wed, 8 Dec 2021 13:24:07 +0000 (UTC) Received: by mail-il1-f171.google.com with SMTP id j7so2100239ilk.13 for ; Wed, 08 Dec 2021 05:24:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=8hwpYW1fSLpY7upb+3/zEof4IWOihAsLuJGqUjdGP2w=; b=C9HHvlTCn801Ml1Oky3DTqTPoNVywgoylirQDMtlEgWM2n39Q5DX9igGsvhLzFQDHQ knUBg75dyPO8Wmz+Jl/XrxlMRUxan9eHmQEhTuVm1oHECvncSOrwM3pbnpBTJcD4T0K8 o0yD920n5ZsrbVaJTSL4IzHTmnTBx+8G6MD6NGnpJlSrZSO1xaaK83OeMkROnvSBbiRt DriRoGhkj/zQFnxvqNn8NkFDLSDYMFErIsT9h5MSouv1EbP6b6TM6FI6OhjIwoW46J34 d/JC6Ka7RZLfhTF61ZTRvS4MDvWjoJ8MsBK9K/C+AQS3SKOrsi/IABMv/L4+0RfT3j51 aNtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=8hwpYW1fSLpY7upb+3/zEof4IWOihAsLuJGqUjdGP2w=; b=NHGAbjuC/qEx/fD4FUGwgVnyOw6U0fnuTnWNcT8rN5/J/XucZe47slHCu3xqougZGR Vdu9ueembYE0o4EjLPxAJ1tyUXSnArz4l8s4CWG//N+Bt+95L08kEYJ0GASGdfl2g0r2 e/2hUXXmV/PRSQ7nxBdfBCqdQU/T/0wS2elHSS0WpXqYDxtoczpfhQaK5eu3yF6L5sLB Zrl7wXi+Es7s3tPV2jWWejSJeqwVfyBsst5TirIo0y4WhA/CmPxFeqs8keTC9VEAuh6G cWaKp2d5ert0PcfqK5Jr+vUtG4+wOvI0AYyC9+4VfdtigZYkZYkb40cFkvYw4wzdTJ9e 6y+g== X-Gm-Message-State: AOAM532bLYieorkT4pEAb6apVZpKwHto4BPTytxUmnSGPej7mMPQ522l t14FwHs8QTW1hRv38wiCThxmcC20tu3gPCiGDZI= X-Google-Smtp-Source: ABdhPJyiDRDC1BYPpzq5X39uBmslh4AEGsPHR94MBQo9F/89FitwQ7zivjGuj3vCNlO7nvyswGWJffTObizKpi4IXHM= X-Received: by 2002:a92:c562:: with SMTP id b2mr6949815ilj.108.1638969847133; Wed, 08 Dec 2021 05:24:07 -0800 (PST) MIME-Version: 1.0 References: <20211120201230.920082-1-shakeelb@google.com> <861f98b5-9211-98c7-b4f7-fd71146aa64c@redhat.com> In-Reply-To: <861f98b5-9211-98c7-b4f7-fd71146aa64c@redhat.com> From: Pankaj Gupta Date: Wed, 8 Dec 2021 14:23:55 +0100 Message-ID: Subject: Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED To: David Hildenbrand Cc: Matthew Wilcox , Shakeel Butt , "Kirill A . Shutemov" , Yang Shi , Zi Yan , Andrew Morton , Linux MM , LKML Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=C9HHvlTC; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf15.hostedemail.com: domain of pankaj.gupta.linux@gmail.com designates 209.85.166.171 as permitted sender) smtp.mailfrom=pankaj.gupta.linux@gmail.com X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: 9B101D00009D X-Stat-Signature: cmtqo7saoesppmn5z63379t31ndm7idz X-HE-Tag: 1638969847-606984 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > >> Many applications do sophisticated management of their heap memory for > >> better performance but with low cost. We have a bunch of such > >> applications running on our production and examples include caching and > >> data storage services. These applications keep their hot data on the > >> THPs for better performance and release the cold data through > >> MADV_DONTNEED to keep the memory cost low. > >> > >> The kernel defers the split and release of THPs until there is memory > >> pressure. This causes complicates the memory management of these > >> sophisticated applications which then needs to look into low level > >> kernel handling of THPs to better gauge their headroom for expansion. In > >> addition these applications are very latency sensitive and would prefer > >> to not face memory reclaim due to non-deterministic nature of reclaim. > >> > >> This patch let such applications not worry about the low level handling > >> of THPs in the kernel and splits the THPs synchronously on > >> MADV_DONTNEED. > > > > I've been wondering about whether this is really the right strategy > > (and this goes wider than just this one, new case) > > > > We chose to use a 2MB page here, based on whatever heuristics are > > currently in play. Now userspace is telling us we were wrong and should > > have used smaller pages. > > IIUC, not necessarily, unfortunately. > > User space might be discarding the whole 2MB either via a single call > (MADV_DONTNEED a 2MB range as done by virtio-balloon with "free page > reporting" or by virtio-mem in QEMU). In that case, there is nothing to > migrate and we were not doing anything wrong. > > But more extreme, user space might be discarding the whole THP in small > pieces shortly over time. This for example happens when a VM inflates > the memory balloon via virtio-balloon. All inflation requests are 4k, > resulting in a 4k MADV_DONTNEED calls. If we end up inflating a THP > range inside of the VM, mapping to a THP range inside the hypervisor, > we'll essentially free a THP in the hypervisor piece by piece using > individual MADV_DONTNEED calls -- this happens frequently. Something > similar can happen when de-fragmentation inside the VM "moves around" > inflated 4k pages piece by piece to essentially form a huge inflated > range -- this happens less frequently as of now. In both cases, > migration is counter-productive, as we're just about to free the whole > range either way. > > (yes, there are ways to optimize, for example using hugepage ballooning > or merging MADV_DONTNEED calls in the hypervisor, but what I described > is what we currently implement in hypervisors like QEMU, because there > are corner cases for everything) It seems this can happen when guest using huge pages or THP. If we end up not freeing hypervisor memory(THP) till memory pressure mounts, this can be a problem for "free page reporting" as well? > > Long story short: it's hard to tell what will happen next based on a > single MADV_DONTNEED call. Page compaction, in comparison, doesn't care > and optimized the layout as it observes it.