From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0E520C433EF for ; Fri, 10 Dec 2021 09:54:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6856C6B0071; Fri, 10 Dec 2021 04:54:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 635356B0072; Fri, 10 Dec 2021 04:54:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4AEEE6B0074; Fri, 10 Dec 2021 04:54:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0230.hostedemail.com [216.40.44.230]) by kanga.kvack.org (Postfix) with ESMTP id 3CB636B0071 for ; Fri, 10 Dec 2021 04:54:45 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 024CD8249980 for ; Fri, 10 Dec 2021 09:54:34 +0000 (UTC) X-FDA: 78901424910.28.2D66EDB Received: from JPN01-TYC-obe.outbound.protection.outlook.com (mail-tycjpn01olkn2027.outbound.protection.outlook.com [52.100.215.27]) by imf25.hostedemail.com (Postfix) with ESMTP id 2EBC2A000C for ; Fri, 10 Dec 2021 09:54:33 +0000 (UTC) ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=C/rM1CTzK++/j2WGjzOiYNgvJfuyN1g6o+TAyx0m6Jk4sxG1plqwaDdZsxMG034tyvGRdCfWBkv6p88X64JUhTV/mfZ8WuzBTa6hJrMU4wGqhukIJIe5bzXKqJ+BCVmmxfAvau+GNc19fvLfZnbTcekT24MLFkCj5fFnJAJrumX52/YnM6HGBuO4BTbIAbFNOAm5NnCIfB7wCJUg5demfREvBEhAIStQVvtxDGrwwBrFmFVGBD75Ie490Dn4EoTBms8NMkv9HKvNgPsmH+cUdE5aMBeI5CkydJoozPG5H7AUjL01oo4LAyQFMhQ9USrLKu8rYNM/k/rR/Lf8uWsxjw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=djleo+wc0R05I8/8khaPCUKxTGsH0OTgoSEoV52+u2s=; b=nFHHOad/cvmSASVw/Cz9rGIJFVZobEyXJtmZoTZdfXeknlNF/5ILQo1bHCvq1SMA3rgbre+fSgo1VJYMn0nJxTag6jz1in5uACAJIfbJQKCjHaqGFATn7+JmKFRktzQLbJ68EFwjF9YtWaSjbzCcna8DCDbRdNi4bpeR/qV+cPX+Quxnxrwlxq/Hn76BAv8z7sI+rdYgdseJuNrgOss39xVp0oq/XRvfT9PTHMD6E6ibrzYgXhc7oLoMqFjurrXMWiPnawFLr5/7nuqZOhvxuD/ldO4ztEbnv7T3f5EkqFE9oOhY5INSaQ+mx4NuyLa5xtdJ5pIfgifciZ/evTVzZw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=none; dmarc=none; dkim=none; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=outlook.com; s=selector1; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=djleo+wc0R05I8/8khaPCUKxTGsH0OTgoSEoV52+u2s=; b=BLdLV6Ui5WwnrpRzyPAEMYSagpoLLEcZeTDIe9ROZ6qd3sFXDLW+xnPNWszM+sECNuVRnO8Mn8QfyaTFnwD+UpScadSL9CpyGR4EV5DqY+T7ZBT8KugkogKF8bVDOwX7IlKwbwG34RTGaY6DgbdD+YgFSmwMfOf58h9PPh3/FiZx1Dc4xc+Y8sGPJ73vasSZJFOK53KXlh+LNip8yJ3VorYc6nhjzkpyYmUI+jHDDRarPtTfuj9XLcM/rDgq1Zg5fY9yJPgZWD3HOkIFtlZ1I4WTZn+scS5oQrYSPyor9v9fkHJSdkwuDrc3Sb5hxmctW+/t1zQVZJOan3zsIxLhDQ== Received: from TYCP286MB2066.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:152::14) by TYWP286MB2404.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:16b::14) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4755.20; Fri, 10 Dec 2021 09:54:29 +0000 Received: from TYCP286MB2066.JPNP286.PROD.OUTLOOK.COM ([fe80::21e3:5e7a:7d5:d5f4]) by TYCP286MB2066.JPNP286.PROD.OUTLOOK.COM ([fe80::21e3:5e7a:7d5:d5f4%6]) with mapi id 15.20.4755.025; Fri, 10 Dec 2021 09:54:29 +0000 Date: Fri, 10 Dec 2021 17:54:24 +0800 From: =?utf-8?B?6IOh546u5paH?= To: Minchan Kim Cc: =?utf-8?B?6IOh546u5paH?= , Andrew Morton , linux-mm , LKML , Michal Hocko , David Hildenbrand , Suren Baghdasaryan , John Dias Subject: Re: [RFC] mm: introduce page pinner Message-ID: References: <20211208115250.GA17274@DESKTOP-N4CECTO.huww98.cn> Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-TMN: [+RivFH4M7DPpmCcazxrPWrPK2H3yWAv0+TzynzvXFBM38Qro4XTPf/Ozbcliks5p] X-ClientProxiedBy: HK0PR03CA0102.apcprd03.prod.outlook.com (2603:1096:203:b0::18) To TYCP286MB2066.JPNP286.PROD.OUTLOOK.COM (2603:1096:400:152::14) X-Microsoft-Original-Message-ID: <20211210095424.GA114015@dorm.huww98.cn> MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from dorm.huww98.cn (2001:250:3000:3cc3:31e8:952f:b057:463e) by HK0PR03CA0102.apcprd03.prod.outlook.com (2603:1096:203:b0::18) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.4755.20 via Frontend Transport; Fri, 10 Dec 2021 09:54:27 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 6f1d7595-ac4b-4429-e3dc-08d9bbc30ef1 X-MS-TrafficTypeDiagnostic: TYWP286MB2404:EE_ X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: 8egXLCHuJZm1zhIcizMBCFWAHhl44oCRlY58zfvWEQM0SfosQyg+HvWgqo7X6daD2W/ljFIFgfSGUyXIeZGXzLjeTCiv3QxMfnPD4zJpbhutagsSO0G3RHiziHNg14NwBU2qQABDEKgAaBnGoyCF4RliNac+GkCKwzKAUX9IQ9oE2Vgpz9+oNSWwLvbsWc1P4BtOfDRSWJpTvI0jfZzEKqEoSuqmSM/gM2uW9DMxiKBgAejxBoUl6egxRYR9zE8ASsVHZRdIf6/ZKLza7hdDt3G4QtK9/DnEIxesz+2pdi9VoAJo2v4TzQR655f/e0nc8/RjBpCS2MhvSJkAcQxp8N8GWgwfRBenE6F77gKlDjmIUX1nt0aGxDw/Tk8//Gy9aZVGUAtAis0ihNDtRQrOeZ3w8b1wfElFegsOrQIQrjiT4cW/LtBUabi+da5JGURsC9d21e+FeAJREbiqyVv9jgkni0oNcexB0XyMzZDe4IrmlFeNXgCU4EwaLyKNjTOUENjxjaJ3kQUqDzgodcoqwqP9T1FF9Z+fO9BXZjAwZVRmMCOF9duV+KWt0VeYNOYL6qFzP0XGiIANFquh8Vpq2LZNRMnuwrdTOe2Kmo884hvlxA+BTLwuoCQY9KTC/sRB X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1 X-MS-Exchange-AntiSpam-MessageData-0: =?utf-8?B?d3V3SE9jQkV2TlRPRUprcTd5eUtyYmtTZXI3TXplang2VzB5VzFZbmdLM0Fa?= =?utf-8?B?eXhPVG5WeTNkY1I3UmFEOTcveWYwaDh2RU40Sk55VnhmTlp5VThqOGJTQ0Zz?= =?utf-8?B?Qi9JRG0xL3JJeTRlNEc0YjZ3Z2lSWjh1RjZzOEJPVFlzYXRYQzJyV21IOERa?= =?utf-8?B?NHQ3Q0FaSHBUcnMwdEd3OVhNUE5Ld1JhbkdzZkVIVFJ0eCtUZlhQQmZGbkx1?= =?utf-8?B?MFVBK05TaEpNOVBod2YxNGhiNnFTWHhkVU9mTkVTYmZ1M3ZJbHlZOEx5WUJp?= =?utf-8?B?V0xnTjFMN1BEMlEzRXY1T0s1Si85MHFRa24zdklrUFU1QzlLZGpUUUhJSVUw?= =?utf-8?B?V1FTeDdwUFFVckJKYTBvbXpCK000SDNHRUhQVVV5TTk0b0F0QTY2Uktmam1F?= =?utf-8?B?eXR0Z0ExTzh1aXZ0QlM5Kzk1S2VHUXk2eExaVWxyUEljYXEwblhESjJoTkRK?= =?utf-8?B?b3pMNy9OK3NCa2htaGpScldCV2FlazNScnpXQ1hwWDN0VHF2a0ljK3dYdlE0?= =?utf-8?B?SmwrTEdJYWdnYk90ZVVvYWNudVNJeDVNYmQySjljL0tLTlZsbEo0Z1Bjd1lT?= =?utf-8?B?dS9TeTRZZ2VpSm5VVHFNNnhrMkRkelFmWFFDbnZjVCttRmFmZDhwZUlURE8r?= =?utf-8?B?SkJjNms4SVNaSjE0UndmWTRpdzJ4eTFJSzJMeHE0QWZ0NlNiVUZVVlJvWk03?= =?utf-8?B?aDlPMjhQdmZyb3dYckJZL3FscERrZU9USXMrbndtcnk2bFR3cVVSSW0vaHND?= =?utf-8?B?SmFDb2lZOUhCVFQ1QUxPVVBGWlF2YjQxeEJwcStFUW10QUwwRTdQWWxpYWNt?= =?utf-8?B?THhHRUF0MGVOYzZBb0Jnc0pSeE5Zc0VMMzRCQzJnRnFReVBDSDRRNDJpVDdl?= =?utf-8?B?aFYzZmw2bTZ1OHFmZFEvdGNSbUVJWmlCcHpJdlR2b3R6Z2I5OGd0clJuMUc0?= =?utf-8?B?N2lkbmpuTWorN0d1WW1oUStrUGhNemVUYnJ6cVp5azYxNjFYZEVVcDE3KzFh?= =?utf-8?B?UWxYaHlucjMxc0JsS0MrNTJEOUVwT0FQaC8wMy9pU1JQeWdPajVuNkFQVFR5?= =?utf-8?B?bGhZQ1ZtU1NFVzdhV0FNdTdaQmV0ZDdXTVhBZkVjTElWMGNkQWJmM3RwYmlu?= =?utf-8?B?K0M2dVhESGVUNkRDU1dTSUphdUFYTXRtczhzc0NITWRvRjFDejJkTDBRTVdh?= =?utf-8?B?ZXZSNWE5VlJlOFNGMC9jSzkyY3J6OHg3NVJyWVZHK1ZOUENuRTVYNHBRT2lw?= =?utf-8?B?bkszdXh6MDlZUXAvdW5abEtsWlpFcDhKMVNTMkNhYk1QdXlPWnVCZXN2YklT?= =?utf-8?B?Y3B3SzA5UzRNcjNkNFJWM2NjdEMvVmZ1ejNlQXdjUkkxTERuTE9wbUovaGhQ?= =?utf-8?B?MTl0NVNOVXhMN1E9PQ==?= X-OriginatorOrg: outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 6f1d7595-ac4b-4429-e3dc-08d9bbc30ef1 X-MS-Exchange-CrossTenant-AuthSource: TYCP286MB2066.JPNP286.PROD.OUTLOOK.COM X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 10 Dec 2021 09:54:29.6731 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 84df9e7f-e9f6-40af-b435-aaaaaaaaaaaa X-MS-Exchange-CrossTenant-RMS-PersistedConsumerOrg: 00000000-0000-0000-0000-000000000000 X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYWP286MB2404 X-Stat-Signature: ory3yhgiybrfpf878i48rz56k9taku4c Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=outlook.com header.s=selector1 header.b=BLdLV6Ui; dmarc=pass (policy=none) header.from=outlook.com; spf=pass (imf25.hostedemail.com: domain of huww98@outlook.com designates 52.100.215.27 as permitted sender) smtp.mailfrom=huww98@outlook.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 2EBC2A000C X-HE-Tag: 1639130073-771864 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Dec 08, 2021 at 10:42:37AM -0800, Minchan Kim wrote: > On Wed, Dec 08, 2021 at 07:54:35PM +0800, =E8=83=A1=E7=8E=AE=E6=96=87 w= rote: > > On Mon, Dec 06, 2021 at 10:47:30AM -0800, Minchan Kim wrote: > > > The contiguous memory allocation fails if one of the pages in > > > requested range has unexpected elevated reference count since > > > VM couldn't migrate the page out. It's very common pattern for > > > CMA allocation failure. The temporal elevated page refcount > > > could happen from various places and it's really hard to chase > > > who held the temporal page refcount at that time, which is the > > > vital information to debug the allocation failure. >=20 > Hi, >=20 > Please don't cut down original Cc list without special reason. Sorry, my school SMTP server does not allow that much recipients. I haved changed to outlook. =20 > > Hi Minchan, > >=20 > > I'm a newbie here. We are debugging a problem where every CPU core is= doing > > compaction but making no progress, because of the unexpected page ref= count. I'm > > interested in your approach, but this patch seems only to cover the C= MA > > allocation path. So could it be extended to debugging migrate failure= during > > compaction? I'm not familiar with the kernel codebase, here is my un= tested > > thought: >=20 > The compaction failure will produce a lot events I wanted to avoid > in my system but I think your case is reasonable if you doesn't > mind the large events. >=20 > >=20 > > diff --git a/mm/migrate.c b/mm/migrate.c > > index cf25b00f03c8..85dacbca8fa0 100644 > > --- a/mm/migrate.c > > +++ b/mm/migrate.c > > @@ -46,6 +46,7 @@ > > #include > > #include > > #include > > +#include > > #include > > #include > > #include > > @@ -388,8 +389,10 @@ int folio_migrate_mapping(struct address_space *= mapping, > > =20 > > if (!mapping) { > > /* Anonymous page without mapping */ > > - if (folio_ref_count(folio) !=3D expected_count) > > + if (folio_ref_count(folio) !=3D expected_count) { > > + page_pinner_failure(&folio->page); > > return -EAGAIN; > > + } > > =20 > > /* No turning back from here */ > > newfolio->index =3D folio->index; > > @@ -406,6 +409,7 @@ int folio_migrate_mapping(struct address_space *m= apping, > > xas_lock_irq(&xas); > > if (!folio_ref_freeze(folio, expected_count)) { > > xas_unlock_irq(&xas); > > + page_pinner_failure(&folio->page); > > return -EAGAIN; > > } > >=20 > > I'm not sure what to do with the new folio, it seems using folio->pag= e in new > > codes is not correct. I tested the above proposed patch, it works in my case. But it produces a= lot of redundant page_pinner_put events. Before the true pinner reveals, the tra= ced pages are get and put multiple times. Besides, when passed to page_pinner_failure(), the "count" is 3 in my case, any of the 3 holders = could be the interested pinner. I think this is hard to avoid, and we can just = let the users distinguish which is the interested pinner. Maybe we need some docs= about this. > If you want to cover compaction only, maybe this one: >=20 > diff --git a/mm/compaction.c b/mm/compaction.c > index bfc93da1c2c7..7bfbf7205fb8 100644 > --- a/mm/compaction.c > +++ b/mm/compaction.c > @@ -2400,6 +2400,11 @@ compact_zone(struct compact_control *cc, struct = capture_control *capc) > /* All pages were either migrated or will be released *= / > cc->nr_migratepages =3D 0; > if (err) { > + struct page *failed_page; > + > + list_for_each_entry(failed_page, &cc->migratepa= ges, lru) > + page_pinner_failure(failed_page); > + > putback_movable_pages(&cc->migratepages); > /* > * migrate_pages() may return -ENOMEM when scan= ners meet Maybe we should put the page_pinner_failure() calls as close to the real refcount check as possible to avoid protential racing and loss some page_pinner_put events? Besides, migrate_pages() will retry for 10 times,= and I image that someone may want to find out who is causing the retry. And mig= ration may fail for a number of reason, not only unexpected refcount. I image that enabling page pinner for migration senarios other than compa= ction could be helpful for others. > However, for the case, I want to introduce some filter options like > failure reason(?) >=20 > page_pinner_failure(pfn, reason) >=20 > So, I could keep getting only CMA allocation failure events, not > compaction failure. This is a good idea to me. But how can we implement the filter? Can we re= use the trace event filter? i.e., if the page_pinner_failure event is filtered ou= t, then we don't set the PAGE_EXT_PINNER flag and effectively also filter the corresponding page_pinner_put event out. I can't see whether it is possib= le now. trace_page_pinner_failure() returns void, so it seems we cannot know whet= her the event got through. If this is not possible, we may need to allocate additional space to stor= e the reason for each traced page, and also pass the reason to trace_page_pinne= r_put(). > > > This patch introduces page pinner to keep track of Page Pinner > > > who caused the CMA allocation failure. How page pinner work is > > > once VM found the non-migrated page after trying migration > > > during contiguos allocation, it marks the page and every page-put > > > operation on the page since then will have event trace. Since > > > page-put is always with page-get, the page-put event trace helps > > > to deduce where the pair page-get originated from. > > >=20 > > > The reason why the feature tracks page-put instead of page-get > > > indirectly is that since VM couldn't expect when page migration > > > fails, it should keep track of every page-get for migratable page > > > to dump information at failure. Considering backtrace as vitial > > > information as well as page's get/put is one of hottest path, > > > it's too heavy approach. Thus, to minimize runtime overhead, > > > this feature adds a new PAGE_EXT_PINNER flag under PAGE_EXT > > > debugging option to indicate migration-failed page and only > > > tracks every page-put operation for the page since the failure. > > >=20 > > > usage: > > >=20 > > > trace_dir=3D"/sys/kernel/tracing" > > > echo 1 > $trace_dir/events/page_pinner/enable > > > echo 1 > $trace_dir/options/stacktrace > > > .. > > > run workload > > > .. > > > .. > > >=20 > > > cat $trace_dir/trace > > >=20 > > > <...>-498 [006] .... 33306.301621: page_pinner_failu= re: pfn=3D0x9f0bb0 flags=3Duptodate|lru|swapbacked count=3D1 mapcount=3D0= mapping=3D00000000aec7812a mt=3D5 > > > <...>-498 [006] .... 33306.301625: > > > =3D> __page_pinner_failure > > > =3D> test_pages_isolated > > > =3D> alloc_contig_range > > > =3D> cma_alloc > > > =3D> cma_heap_allocate > > > =3D> dma_heap_ioctl > > > =3D> __arm64_sys_ioctl > > > =3D> el0_svc_common > > > =3D> do_el0_svc > > > =3D> el0_svc > > > =3D> el0_sync_handler > > > =3D> el0_sync > > > <...>-24965 [001] .... 33306.392836: page_pinner_put: = pfn=3D0x9f0bb0 flags=3Duptodate|lru|swapbacked count=3D0 mapcount=3D0 map= ping=3D00000000aec7812a mt=3D5 > > > <...>-24965 [001] .... 33306.392846: > > > =3D> __page_pinner_put > > > =3D> release_pages > > > =3D> free_pages_and_swap_cache > > > =3D> tlb_flush_mmu_free > > > =3D> tlb_flush_mmu > > > =3D> zap_pte_range > > > =3D> unmap_page_range > > > =3D> unmap_vmas > > > =3D> exit_mmap > > > =3D> __mmput > > > =3D> mmput > > > =3D> exit_mm > > > =3D> do_exit > > > =3D> do_group_exit > > > =3D> get_signal > > > =3D> do_signal > > > =3D> do_notify_resume > > > =3D> work_pending > > >=20 > > > Signed-off-by: Minchan Kim > > > --- > > > The PagePinner named after PageOwner since I wanted to keep track o= f > > > page refcount holder. Feel free to suggest better names. > > > Actually, I had alloc_contig_failure tracker as a candidate. > > >=20 > > > include/linux/mm.h | 7 ++- > > > include/linux/page_ext.h | 3 + > > > include/linux/page_pinner.h | 47 ++++++++++++++++ > > > include/trace/events/page_pinner.h | 60 ++++++++++++++++++++ > > > mm/Kconfig.debug | 13 +++++ > > > mm/Makefile | 1 + > > > mm/page_alloc.c | 3 + > > > mm/page_ext.c | 4 ++ > > > mm/page_isolation.c | 3 + > > > mm/page_pinner.c | 90 ++++++++++++++++++++++++++= ++++ > > > 10 files changed, 230 insertions(+), 1 deletion(-) > > > create mode 100644 include/linux/page_pinner.h > > > create mode 100644 include/trace/events/page_pinner.h > > > create mode 100644 mm/page_pinner.c >=20 > < snip> >=20 > > > diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug > > > index 1e73717802f8..0ad4a3b8f4eb 100644 > > > --- a/mm/Kconfig.debug > > > +++ b/mm/Kconfig.debug > > > @@ -62,6 +62,19 @@ config PAGE_OWNER > > > =20 > > > If unsure, say N. > > > =20 > > > +config PAGE_PINNER > > > + bool "Track page pinner" > > > + select PAGE_EXTENSION > > > + depends on DEBUG_KERNEL && TRACEPOINTS > > > + help > > > + This keeps track of what call chain is the pinner of a page, ma= y > > > + help to find contiguos page allocation failure. Even if you inc= lude > > > + this feature in your build, it is disabled by default. You shou= ld > > > + pass "page_pinner=3Don" to boot parameter in order to enable it= . Eats > > > + a fair amount of memory if enabled. > >=20 > > I'm a bit confused. It seems page pinner does not allocate any additi= onal > > memory if you enable it by boot parameter. So the description seems i= naccurate. >=20 > It will allocate page_ext descriptors so consumes the memory. Thanks, I see. So it is 8 bytes for each 4k page. Not much I think. > > > + > > > + If unsure, say N. > > > + > > > config PAGE_POISONING > > > bool "Poison pages after freeing" > > > help > > > diff --git a/mm/Makefile b/mm/Makefile > > > index fc60a40ce954..0c9b78b15070 100644 > > > --- a/mm/Makefile > > > +++ b/mm/Makefile > > > @@ -102,6 +102,7 @@ obj-$(CONFIG_DEBUG_KMEMLEAK) +=3D kmemleak.o > > > obj-$(CONFIG_DEBUG_RODATA_TEST) +=3D rodata_test.o > > > obj-$(CONFIG_DEBUG_VM_PGTABLE) +=3D debug_vm_pgtable.o > > > obj-$(CONFIG_PAGE_OWNER) +=3D page_owner.o > > > +obj-$(CONFIG_PAGE_PINNER) +=3D page_pinner.o > > > obj-$(CONFIG_CLEANCACHE) +=3D cleancache.o > > > obj-$(CONFIG_MEMORY_ISOLATION) +=3D page_isolation.o > > > obj-$(CONFIG_ZPOOL) +=3D zpool.o > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index f41a5e990ac0..6e3a6f875a40 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -63,6 +63,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include > > > #include > > > @@ -1299,6 +1300,7 @@ static __always_inline bool free_pages_prepar= e(struct page *page, > > > if (memcg_kmem_enabled() && PageMemcgKmem(page)) > > > __memcg_kmem_uncharge_page(page, order); > > > reset_page_owner(page, order); > > > + reset_page_pinner(page, order); > > > return false; > > > } > > > =20 > > > @@ -1338,6 +1340,7 @@ static __always_inline bool free_pages_prepar= e(struct page *page, > > > page_cpupid_reset_last(page); > > > page->flags &=3D ~PAGE_FLAGS_CHECK_AT_PREP; > > > reset_page_owner(page, order); > > > + reset_page_pinner(page, order); > > > =20 > > > if (!PageHighMem(page)) { > > > debug_check_no_locks_freed(page_address(page), > > > diff --git a/mm/page_ext.c b/mm/page_ext.c > > > index 2a52fd9ed464..0dafe968b212 100644 > > > --- a/mm/page_ext.c > > > +++ b/mm/page_ext.c > > > @@ -8,6 +8,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > =20 > > > /* > > > * struct page extension > > > @@ -75,6 +76,9 @@ static struct page_ext_operations *page_ext_ops[]= =3D { > > > #if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT) > > > &page_idle_ops, > > > #endif > > > +#ifdef CONFIG_PAGE_PINNER > > > + &page_pinner_ops, > > > +#endif > > > }; > > > =20 > > > unsigned long page_ext_size =3D sizeof(struct page_ext); > > > diff --git a/mm/page_isolation.c b/mm/page_isolation.c > > > index a95c2c6562d0..a9ddea1c9166 100644 > > > --- a/mm/page_isolation.c > > > +++ b/mm/page_isolation.c > > > @@ -9,6 +9,7 @@ > > > #include > > > #include > > > #include > > > +#include > > > #include > > > #include "internal.h" > > > =20 > > > @@ -310,6 +311,8 @@ int test_pages_isolated(unsigned long start_pfn= , unsigned long end_pfn, > > > =20 > > > out: > > > trace_test_pages_isolated(start_pfn, end_pfn, pfn); > > > + if (ret < 0) > > > + page_pinner_failure(pfn_to_page(pfn)); > > > =20 > > > return ret; > > > } > > > diff --git a/mm/page_pinner.c b/mm/page_pinner.c > > > new file mode 100644 > > > index 000000000000..300a90647557 > > > --- /dev/null > > > +++ b/mm/page_pinner.c > > > @@ -0,0 +1,90 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +#include > > > +#include > > > + > > > +#define CREATE_TRACE_POINTS > > > +#include > > > + > > > +static bool page_pinner_enabled; > > > +DEFINE_STATIC_KEY_FALSE(page_pinner_inited); > > > +EXPORT_SYMBOL(page_pinner_inited); > > > + > > > +static int __init early_page_pinner_param(char *buf) > > > +{ > > > + return kstrtobool(buf, &page_pinner_enabled); > > > +} > > > +early_param("page_pinner", early_page_pinner_param); > > > + > > > +static bool need_page_pinner(void) > > > +{ > > > + return page_pinner_enabled; > > > +} > > > + > > > +static void init_page_pinner(void) > > > +{ > > > + if (!page_pinner_enabled) > > > + return; > > > + > > > + static_branch_enable(&page_pinner_inited); > > > +} > > > + > > > +struct page_ext_operations page_pinner_ops =3D { > > > + .need =3D need_page_pinner, > > > + .init =3D init_page_pinner, > > > +}; > > > + > > > +void __reset_page_pinner(struct page *page, unsigned int order) > > > +{ > > > + struct page_ext *page_ext; > > > + int i; > > > + > > > + page_ext =3D lookup_page_ext(page); > > > + if (unlikely(!page_ext)) > > > + return; > > > + > > > + for (i =3D 0; i < (1 << order); i++) { > > > + if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags)) > > > + break; > > > + > > > + clear_bit(PAGE_EXT_PINNER, &page_ext->flags); > > > + page_ext =3D page_ext_next(page_ext); > > > + } > > > +} > > > + > > > +void __page_pinner_failure(struct page *page) > > > +{ > > > + struct page_ext *page_ext =3D lookup_page_ext(page); > > > + > > > + if (unlikely(!page_ext)) > > > + return; > > > + > > > + trace_page_pinner_failure(page); > > > + test_and_set_bit(PAGE_EXT_PINNER, &page_ext->flags); > > > +} > > > + > > > +void __page_pinner_put(struct page *page) > > > +{ > > > + struct page_ext *page_ext =3D lookup_page_ext(page); > > > + > > > + if (unlikely(!page_ext)) > > > + return; > > > + > > > + if (!test_bit(PAGE_EXT_PINNER, &page_ext->flags)) > > > + return; > > > + > > > + trace_page_pinner_put(page); > > > +} > > > +EXPORT_SYMBOL(__page_pinner_put); > > > + > > > + > > > +static int __init page_pinner_init(void) > > > +{ > > > + if (!static_branch_unlikely(&page_pinner_inited)) { > > > + pr_info("page_pinner is disabled\n"); > > > + return 0; > > > + } > > > + > > > + pr_info("page_pinner is enabled\n"); > > > + return 0; > > > +} > > > +late_initcall(page_pinner_init) > > > --=20 > > > 2.34.1.400.ga245620fadb-goog > > >=20 > >=20 > > More info about my compaction issue: > >=20 > > This call stack returns -EAGAIN in 99.9% cases on the problematic hos= t > > (Ubuntu 20.04 with kernel 5.11.0-40): > >=20 > > migrate_page_move_mapping (now folio_migrate_mapping) <- returns -EAG= AIN > > migrate_page > > fallback_migrate_page > > move_to_new_page > > migrate_pages > > compact_zone > > compact_zone_order > > try_to_compact_pages > > __alloc_pages_direct_compact > > __alloc_pages_slowpath.constprop.0 > > __alloc_pages_nodemask > > alloc_pages_vma > > do_huge_pmd_anonymous_page > > __handle_mm_fault > > handle_mm_fault > > do_user_addr_fault > > exc_page_fault > > asm_exc_page_fault > >=20 > > The offending pages are from shm, allocated by mmap() with MAP_SHARED= by a > > machine learning program. They may have relationships with NVIDIA CUD= A, but I > > want to confirm this, and make improvements if possible. >=20 > So you are suspecting some kernel driver hold a addtional refcount > using get_user_pages or page get API? Yes. By using the trace events in this patch, I have confirmed it is nvid= ia kernel module that holds the refcount. I got the stacktrace like this (fr= om "perf script"): cuda-EvtHandlr 31023 [000] 3244.976411: page_pinner:pa= ge_pinner_put: pfn=3D0x13e473 flags=3D0x8001e count=3D0 mapcount=3D0 mapp= ing=3D(nil) mt=3D1 ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/bui= ld/vmlinux) ffffffff82511be4 __page_pinner_put+0x54 (/lib/modules/5.15.6+/bui= ld/vmlinux) ffffffffc0b71e1f os_unlock_user_pages+0xbf ([nvidia]) ffffffffc14a4546 _nv032165rm+0x96 ([nvidia]) Still not much information. NVIDIA does not want me to debug its module. = Maybe the only thing I can do is reporting this to NVIDIA. > > When the issue reproduce, a single page fault that triggers a sync co= mpaction > > can take tens of seconds. Then all 40 CPU threads are doing compactio= n, and > > application runs several order of magnitude slower. > >=20 > > Disabling sync compaction is a workaround (the default is "madvise"): > >=20 > > echo never > /sys/kernel/mm/transparent_hugepage/defrag > >=20 > > Previously I asked for help at https://lore.kernel.org/linux-mm/20210= 516085644.13800-1-hdanton@sina.com/ > > Now I have more information but still cannot pinpoint the root cause. > >=20 > > Thanks, > > Hu Weiwen > >=20 > >=20