From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, MSGID_FROM_MTA_HEADER,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 39752C352A3 for ; Tue, 11 Feb 2020 14:43:03 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C2825206DB for ; Tue, 11 Feb 2020 14:43:02 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=fb.com header.i=@fb.com header.b="X8OyHNpd"; dkim=pass (1024-bit key) header.d=fb.onmicrosoft.com header.i=@fb.onmicrosoft.com header.b="FtyZd3o2" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C2825206DB Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=fb.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 65B766B02DC; Tue, 11 Feb 2020 09:43:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 60E0A6B02DD; Tue, 11 Feb 2020 09:43:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D4946B02DE; Tue, 11 Feb 2020 09:43:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0168.hostedemail.com [216.40.44.168]) by kanga.kvack.org (Postfix) with ESMTP id 271E26B02DC for ; Tue, 11 Feb 2020 09:43:02 -0500 (EST) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 9F8FC2476 for ; Tue, 11 Feb 2020 14:43:01 +0000 (UTC) X-FDA: 76478113362.02.bait82_1a50c50bc7260 X-HE-Tag: bait82_1a50c50bc7260 X-Filterd-Recvd-Size: 16299 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by imf31.hostedemail.com (Postfix) with ESMTP for ; Tue, 11 Feb 2020 14:43:00 +0000 (UTC) Received: from pps.filterd (m0109332.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.16.0.42/8.16.0.42) with SMTP id 01BEeKA5015773; Tue, 11 Feb 2020 06:42:59 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.com; h=date : from : to : cc : subject : message-id : references : content-type : in-reply-to : mime-version; s=facebook; bh=nC37Y4fAsk3YZ4gahg4nakRPVvn2kIByVmi1h5YEDr8=; b=X8OyHNpdEb9BJbZd78gokVlgu6RuH4QA7IUJjJzUfB5D2bntL6ha/KMGvccHQszmO8Zb aAg1Rr1uZwsYrqPvP14hkfwROQmocLc+MJnGh4V87rig8mbW64NDHhd6ZsCjBoIVi8Yu uTcG+pWq2Vh7vWmz9eoCgtisWbiufkDfbUY= Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com with ESMTP id 2y1ukuwwdr-3 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT); Tue, 11 Feb 2020 06:42:58 -0800 Received: from NAM11-DM6-obe.outbound.protection.outlook.com (100.104.31.183) by o365-in.thefacebook.com (100.104.36.102) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.1779.2; Tue, 11 Feb 2020 06:42:57 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=iVGb/HJ8YvQ7Lz8lN6vHLZjIsen+e4g1M9/nZujeEsG8SMqrOXldCXBXBmSJupwxEJwAizVvEfJVQ5CRJoJbt+foS2HECKiwe+I7wJDjjBJlKtAKJ8Mu6smJ0lExkJVqfNyjKGl6V20zv6YCVNhRqInZLqXodvrvERJZsxGJwfUkTjXO84dj0sYdTcAqIAK+hLJx0Ily5LsvatF3c5Ph+WSw1vf+f7qpyuXzYXx/hRsisOAHdFys5i52QwuIz1r+mbblm5WqlZctn611pM0/5CMOQhpeVro0kDXi35CWu0PEyA3/BnkBKku+Td0dYr9VyKMjnQMid5mwtz39RkvoqQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=nC37Y4fAsk3YZ4gahg4nakRPVvn2kIByVmi1h5YEDr8=; b=Qep7+BwspjgR5WurWfwya81FGoi5L3GdxTu2qjinLFr/dAANp2D6DnZqhSnZdqUefjEiaBNYmNNZ+bO+TSwgsXtgOlQRKMCPz7HZXcKf21pHZ0lZIZbeXYhwdBpCDfDMsTVHOhYFQcoHZyH4whZIaFnZLSquJHkCZOAUE0RR5Lbfi7yipgdNKc7jAU7StWF1rksDl4a31oxkfHXXmCcyCgJL6vfxUaYFajt7msUBCh+XSRwNY5sB6no5IzxeSMceNdJw6mFFA+ODAQGjj7VY7IaZCmywYJBXQtx+CdL4GGwNm2Vmw6WzolNQj3QctdGvK7TkhZecvvvLPbEb6pNl5w== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=fb.com; dmarc=pass action=none header.from=fb.com; dkim=pass header.d=fb.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fb.onmicrosoft.com; s=selector2-fb-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=nC37Y4fAsk3YZ4gahg4nakRPVvn2kIByVmi1h5YEDr8=; b=FtyZd3o2tZhX7eQ8hg0XxHyhK6fuf/kqPUcztH1NenJ5pOFsYIetAKodzpckZ1CuUftMn/6/MmzQV16ZxjHlid2fPZHfl7JphNAHPTr3iJtRy20qDzAHtnhylW6A4OCC/2ssQeIck2m2Mu9u9fhsmmDc78HEIAWm9Hh765ShDuU= Received: from BYAPR15MB2631.namprd15.prod.outlook.com (20.179.155.147) by BYAPR15MB2917.namprd15.prod.outlook.com (20.178.236.222) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2707.21; Tue, 11 Feb 2020 14:42:56 +0000 Received: from BYAPR15MB2631.namprd15.prod.outlook.com ([fe80::ccb6:a331:77d8:d308]) by BYAPR15MB2631.namprd15.prod.outlook.com ([fe80::ccb6:a331:77d8:d308%7]) with mapi id 15.20.2707.030; Tue, 11 Feb 2020 14:42:56 +0000 Date: Tue, 11 Feb 2020 06:42:52 -0800 From: Roman Gushchin To: Gavin Shan CC: , , , , Subject: Re: [RFC PATCH] mm/vmscan: Don't round up scan size for online memory cgroup Message-ID: <20200211144252.GA480760@tower.DHCP.thefacebook.com> References: <20200210121445.711819-1-gshan@redhat.com> <20200210161721.GA167254@tower.DHCP.thefacebook.com> <9919b674-244d-0a55-c842-b0661585f9e2@redhat.com> <20200211013118.GA147346@carbon.lan> <63c5d402-ec1e-2935-7f16-8e2aed047c7c@redhat.com> Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <63c5d402-ec1e-2935-7f16-8e2aed047c7c@redhat.com> X-ClientProxiedBy: MWHPR12CA0063.namprd12.prod.outlook.com (2603:10b6:300:103::25) To BYAPR15MB2631.namprd15.prod.outlook.com (2603:10b6:a03:150::19) MIME-Version: 1.0 Received: from tower.DHCP.thefacebook.com (2620:10d:c090:200::c28d) by MWHPR12CA0063.namprd12.prod.outlook.com (2603:10b6:300:103::25) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.2707.23 via Frontend Transport; Tue, 11 Feb 2020 14:42:55 +0000 X-Originating-IP: [2620:10d:c090:200::c28d] X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: cc555a82-7abe-4534-82cf-08d7af00ae8a X-MS-TrafficTypeDiagnostic: BYAPR15MB2917: X-Microsoft-Antispam-PRVS: X-FB-Source: Internal X-MS-Oob-TLC-OOBClassifiers: OLM:1443; X-Forefront-PRVS: 0310C78181 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(136003)(396003)(376002)(366004)(39860400002)(346002)(199004)(189003)(53546011)(2906002)(6506007)(5660300002)(6916009)(1076003)(478600001)(8676002)(966005)(52116002)(9686003)(7696005)(66946007)(55016002)(66476007)(66556008)(4326008)(8936002)(316002)(33656002)(186003)(6666004)(81166006)(81156014)(16526019)(86362001);DIR:OUT;SFP:1102;SCL:1;SRVR:BYAPR15MB2917;H:BYAPR15MB2631.namprd15.prod.outlook.com;FPR:;SPF:None;LANG:en;PTR:InfoNoRecords;A:1;MX:1; Received-SPF: None (protection.outlook.com: fb.com does not designate permitted sender hosts) X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: SJf7aEzhEPd5e7pYZ6/nkc5RTtPptBCVRE9wneQp7IzWQfTNp1ymAMWPYSJFOGf108W4doEv7xeqO7mgIHcsyyXWJvUp3I7jQGrJ+a4CT+w02a0QQGiwxTQSSyRMyC82q27x99gZNHqh66KC7q5t0mYb02aH6AwppMMJr30/516NAonES8zP8+CIf7cvf4VJgtOIPVQJZedEX1LF5xJjxZ5pvlAEjQFGudBipQHH1s82XU0kP7mlJ9Pzo5fcBGT7HuiGaBRXprgITFFvenfqb2iTt9k23X+72d+MoPlTwTg88m9oidblllLqfJ/FqKUl2KqPHxcI10FpmH0PWtlrskY/KLfQQbhzpIZ7adSDCXV3srffbAmCMOIG4tBMrX2cTHPnpxiUpQw7cABvtV5wKC0WG4GLE0KkYteHnQmMh10vXIxNzghK0gqsp5tdL92c8xxNLxiOhnoQ1rNuIWuC9ayytOn3+1nlLxW0bSJN0ZuC9SSA5AnYDNT3m7c8+4bz8Q8iMGQkeFy3XVkaNZW9cw== X-MS-Exchange-AntiSpam-MessageData: kdMo2JZxKpuUn4ie0Kdn1VETO+F75CXIv3Gi8vjrlvfMWXgzg8lj6/7MmHzOOzT4eJT8cVDIZxnnFmGIhI9yyrEN+D1OEwqdb7I6M7IsN8uz4bX1J0VqY2E0DSqnDtqprEfC8L3xLh7mue929RLy1VevpylY5UD4EKtri16ftow= X-MS-Exchange-CrossTenant-Network-Message-Id: cc555a82-7abe-4534-82cf-08d7af00ae8a X-MS-Exchange-CrossTenant-OriginalArrivalTime: 11 Feb 2020 14:42:56.2246 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 8ae927fe-1255-47a7-a2af-5f3a069daaa2 X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: PUvV9oAdcth5e0vVxbIvyzulOkZ2YijZmo0XM+b+LdcaelcVraDuNOogxKli1OMk X-MS-Exchange-Transport-CrossTenantHeadersStamped: BYAPR15MB2917 X-OriginatorOrg: fb.com X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.138,18.0.572 definitions=2020-02-11_04:2020-02-10,2020-02-11 signatures=0 X-Proofpoint-Spam-Details: rule=fb_default_notspam policy=fb_default score=0 malwarescore=0 suspectscore=0 impostorscore=0 mlxlogscore=999 phishscore=0 bulkscore=0 priorityscore=1501 mlxscore=0 lowpriorityscore=0 spamscore=0 adultscore=0 clxscore=1015 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2001150001 definitions=main-2002110110 X-FB-Internal: deliver X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Feb 11, 2020 at 01:17:47PM +1100, Gavin Shan wrote: > Hi Roman, > > On 2/11/20 12:31 PM, Roman Gushchin wrote: > > On Tue, Feb 11, 2020 at 10:55:53AM +1100, Gavin Shan wrote: > > > On 2/11/20 3:17 AM, Roman Gushchin wrote: > > > > On Mon, Feb 10, 2020 at 11:14:45PM +1100, Gavin Shan wrote: > > > > > commit 68600f623d69 ("mm: don't miss the last page because of round-off > > > > > error") makes the scan size round up to @denominator regardless of the > > > > > memory cgroup's state, online or offline. This affects the overall > > > > > reclaiming behavior: The corresponding LRU list is eligible for reclaiming > > > > > only when its size logically right shifted by @sc->priority is bigger than > > > > > zero in the former formula (non-roundup one). > > > > > > > > Not sure I fully understand, but wasn't it so before 68600f623d69 too? > > > > > > > > > > It's correct that "(non-roundup one)" is typo and should have been dropped. > > > Will be corrected in v2 if needed. > > > > Thanks! > > > > > > > > > > For example, the inactive > > > > > anonymous LRU list should have at least 0x4000 pages to be eligible for > > > > > reclaiming when we have 60/12 for swappiness/priority and without taking > > > > > scan/rotation ratio into account. After the roundup is applied, the > > > > > inactive anonymous LRU list becomes eligible for reclaiming when its > > > > > size is bigger than or equal to 0x1000 in the same condition. > > > > > > > > > > (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1 > > > > > ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1 > > > > > > > > > > aarch64 has 512MB huge page size when the base page size is 64KB. The > > > > > memory cgroup that has a huge page is always eligible for reclaiming in > > > > > that case. The reclaiming is likely to stop after the huge page is > > > > > reclaimed, meaing the subsequent @sc->priority and memory cgroups will be > > > > > skipped. It changes the overall reclaiming behavior. This fixes the issue > > > > > by applying the roundup to offlined memory cgroups only, to give more > > > > > preference to reclaim memory from offlined memory cgroup. It sounds > > > > > reasonable as those memory is likely to be useless. > > > > > > > > So is the problem that relatively small memory cgroups are getting reclaimed > > > > on default prio, however before they were skipped? > > > > > > > > > > Yes, you're correct. There are two dimensions for global reclaim: priority > > > (sc->priority) and memory cgroup. The scan/reclaim is carried out by iterating > > > from these two dimensions until the reclaimed pages are enough. If the roundup > > > is applied to current memory cgroup and occasionally helps to reclaim enough > > > memory, the subsequent priority and memory cgroup will be skipped. > > > > > > > > > > > > > The issue was found by starting up 8 VMs on a Ampere Mustang machine, > > > > > which has 8 CPUs and 16 GB memory. Each VM is given with 2 vCPUs and 2GB > > > > > memory. 784MB swap space is consumed after these 8 VMs are completely up. > > > > > Note that KSM is disable while THP is enabled in the testing. With this > > > > > applied, the consumed swap space decreased to 60MB. > > > > > > > > > > total used free shared buff/cache available > > > > > Mem: 16196 10065 2049 16 4081 3749 > > > > > Swap: 8175 784 7391 > > > > > total used free shared buff/cache available > > > > > Mem: 16196 11324 3656 24 1215 2936 > > > > > Swap: 8175 60 8115 > > > > > > > > Does it lead to any performance regressions? Or it's only about increased > > > > swap usage? > > > > > > > > > > Apart from swap usage, it also had performance downgrade for my case. With > > > your patch (68600f623d69) included, it took 264 seconds to bring up 8 VMs. > > > However, 236 seconds are used to do same thing with my patch applied on top > > > of yours. There is 10% performance downgrade. It's the reason why I had a > > > stable tag. > > > > I see... > > > > I will put these data into the commit log of v2, which will be posted shortly. > > > > > > > > > > > > > > Fixes: 68600f623d69 ("mm: don't miss the last page because of round-off error") > > > > > Cc: # v4.20+ > > > > > Signed-off-by: Gavin Shan > > > > > --- > > > > > mm/vmscan.c | 9 ++++++--- > > > > > 1 file changed, 6 insertions(+), 3 deletions(-) > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index c05eb9efec07..876370565455 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -2415,10 +2415,13 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc, > > > > > /* > > > > > * Scan types proportional to swappiness and > > > > > * their relative recent reclaim efficiency. > > > > > - * Make sure we don't miss the last page > > > > > - * because of a round-off error. > > > > > + * Make sure we don't miss the last page on > > > > > + * the offlined memory cgroups because of a > > > > > + * round-off error. > > > > > */ > > > > > - scan = DIV64_U64_ROUND_UP(scan * fraction[file], > > > > > + scan = mem_cgroup_online(memcg) ? > > > > > + div64_u64(scan * fraction[file], denominator) : > > > > > + DIV64_U64_ROUND_UP(scan * fraction[file], > > > > > denominator); > > > > > > > > It looks a bit strange to round up for offline and basically down for > > > > everything else. So maybe it's better to return to something like > > > > the very first version of the patch: > > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__www.spinics.net_lists_kernel_msg2883146.html&d=DwIC-g&c=5VD0RTtNlTh3ycd41b3MUw&r=jJYgtDM7QT-W-Fz_d29HYQ&m=urGWFxpEgETD4pryLqIYaKdVUk1Munj_zLpJthvrreM&s=k2RDZGNcvb_Sia2tZwcMPZ79Mad5dw1oT8JdIy0rkGY&e= ? > > > > For memcg reclaim reasons we do care only about an edge case with few pages. > > > > > > > > But overall it's not obvious to me, why rounding up is worse than rounding down. > > > > Maybe we should average down but accumulate the reminder? > > > > Creating an implicit bias for small memory cgroups sounds groundless. > > > > > > > > > > I don't think v1 path works for me either. The logic in v1 isn't too much > > > different from commit 68600f623d69. v1 has selective roundup, but current > > > code is having a forced roundup. With 68600f623d69 reverted and your v1 > > > patch applied, it took 273 seconds to bring up 8 VMs and 1752MB swap is used. > > > It looks more worse than 68600f623d69. > > > > > > Yeah, it's not reasonable to have a bias on all memory cgroups regardless > > > their states. I do think it's still right to give bias to offlined memory > > > cgroups. > > > > I don't think so, it really depends on the workload. Imagine systemd restarting > > a service due to some update or with other arguments. Almost entire pagecache > > is relevant and can be reused by a new cgroup. > > > > Indeed, it depends on the workload. This patch is to revert 68600f623d69 for online > memory cgroups, but keep the logic for offlined memory cgroup to avoid breaking your > case. > > There is something which might be unrelated to discuss here: the pagecache could be backed > by a low-speed (HDD) or high-speed (SSD) media. So the cost to fetch them from disk to memory > isn't equal, meaning we need some kind of bias during reclaiming. It seems something missed > from current implementation. > > > > So the point is we need take care of the memory cgroup's state > > > and apply the bias to offlined ones only. The offlined memory cgroup is > > > going to die and has been dead. It's unlikely for its memory to be used > > > by someone, but still possible. So it's reasonable to hardly squeeze the > > > used memory of offlined memory cgroup if possible. > > > > Anyway, I think your version is good to mitigate the regression. > > So, please feel free to add > > Acked-by: Roman Gushchin > > > > Thanks, Roman! It will be included in v2. > > > But I think we need something more clever long-term: e.g. accumulate > > the leftover from the division and add it to the next calculation. > > > > If you can test such an approach on your workload, that would be nice. > > > > Yeah, we need something smart in long run. Lets see if I can sort/test > it out and then come back to you. > Perfect, thank you! Roman