From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753691AbdBVB14 (ORCPT ); Tue, 21 Feb 2017 20:27:56 -0500 Received: from mx0a-00082601.pphosted.com ([67.231.145.42]:41765 "EHLO mx0a-00082601.pphosted.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753016AbdBVB1u (ORCPT ); Tue, 21 Feb 2017 20:27:50 -0500 Date: Tue, 21 Feb 2017 17:27:13 -0800 From: Shaohua Li To: Minchan Kim CC: , , , , , , , , , Subject: Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages Message-ID: <20170222012712.GA97403@shli-mbp.local> References: <123396e3b523e8716dfc6fc87a5cea0c124ff29d.1486163864.git.shli@fb.com> <20170222004604.GA14056@blaptop> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20170222004604.GA14056@blaptop> User-Agent: Mutt/1.6.1 (2016-04-27) X-Originating-IP: [2620:10d:c090:200::d:3dba] X-ClientProxiedBy: BL2PR19CA0002.namprd19.prod.outlook.com (10.167.113.12) To CY4PR15MB1638.namprd15.prod.outlook.com (10.175.119.151) X-MS-Office365-Filtering-Correlation-Id: 80f88729-e6f9-4158-78cb-08d45ac1f887 X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001);SRVR:CY4PR15MB1638; X-Microsoft-Exchange-Diagnostics: 1;CY4PR15MB1638;3:COMX0/OMJUKUs8wC2hTMOkYsYtnmF3oEd2xWKUWPpI+kFD9eoeQxmW69DI6M2x5/pNmbuJctxcwyW+wEKn4uBla4lAunfGBoISJmIe6StXxjTKFybhI79/aMDXuLPVwNekIBe6W4BCm2gbf3XEsc3oUjL+NcayMyyuOaevIKEZIiGIQIsqGc05MpbfDDLtogNfpQMFibNR2/RVe/lDzRkeCwgGqZHpFWNNawdKtzOZDRar2Co0YVmphQdsl8SEbjemkElakzoMfIKT5LqwWWzg==;25:UAuaq2LvZD4x60QlMbv2WAZol92VpE40RKQHN4hoVyqHslxd+kGC0AqRXdbO8wi/yA1Kh42aQbTCd9oS5zjIA2g1ga0pykuikPeK6yn0ixaxlSYh58Z0zXJIoM/z+qYCA0nw/Z9kvcIJYiqCCrl/lnywyCP/EtYUO1rkvCx70HEmr8GqaEuoVFdvPOaq6QP/y52ewGHaxzvHD0IDEIx50Z4ENpyyeNsovh32q3I7SRutEmRXBA1OkmS6NKLlGsN5fODhnIdcKCb/E2Dq8x4chGQmc/Idasef07WzfpD34qbtzAhUsCz1BzQJn1N++ObTxYMWhCEzh84tGYaCOVaPdBC2lbM5Lar08RnDOPFFG3ibPp5hWmfCHfk3aYVQBJC1h8AlArFpDfJmmyhe8mCwPaWdmRypRGmTpLDAD79BZIHi4znN9NXvQ7n8gl4hTwtqMTNdaAsOhnDfQhBKD3csmw== X-Microsoft-Exchange-Diagnostics: 1;CY4PR15MB1638;31:ztDh5lZNkWjmj/Xf1h3NNyvXzuoCsxAxbHQMuI3tL/w239SuhyzX0nEouFrCNGo+ek9ttI+mh3SnfAREHaz5hW4cksJd0KMZjs1ulWYPVV9ABYwXTDDFeC6xbHnkDfMF4XOga6GF0gJu1y8FcHXwfYmMYyPFItai3r35i3hRk+mlJ24X48ZUQg1fAUyI7mxXjga+kBmKGQ6H2yKVEfFPKdjeUHKJREKK+Hj/OnxE03g=;20:LTxYjh/jJsb36J4rX9SRMA81WwQIYXv1DAbpbXD/Wsi0T9ovJYoFMLKyHVUxnV2wQZiMhq+wV+hUqbjOKIcl0dkhv9jZ9wAUAMM/O3jouwK5/R8P2HuM0CaQVqfCqvJXVGMx4xMeEdrKPI7ThS6HdNNkaQ3sQ6LznNj0R1rx/aoGf8AM3oRwIN5C1cEZifUzEHGiUNkSal0wNIdmOlLH/4VjQTahFxOeTvRh2eqEa8n5Kk2fI7MM+fj4JAHIU2mPjGFd5QY2XBx+CLv5UAQV+qwDM3ClBQYOfRf7iE1kyJagTFEy4ZWcW5qLLrWcfokwr98/IFJKZZ1xGqMoaD194v46kJgj4bxr/bTAUYQJ8HNII8rxkQuP6sK/cK4om/2LH6TSsiCgOIO2n5FHSWkx+sbtd39jDSLajYN9Pq7HW9OLB67drKpGntgi+BNh6z5ewci+twwRexerbc+1MkyPyPil3mihrRHPSUl7yknVvlds8w1nXGAI7iSLL8xTWoNU X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(6040375)(601004)(2401047)(8121501046)(5005006)(10201501046)(3002001)(6041248)(20161123562025)(20161123555025)(20161123564025)(20161123560025)(20161123558025)(6072148);SRVR:CY4PR15MB1638;BCL:0;PCL:0;RULEID:;SRVR:CY4PR15MB1638; X-Microsoft-Exchange-Diagnostics: 1;CY4PR15MB1638;4:ELauDOLqGX0QB17tCNDUbFHgQeUcQ1O1wT53mS8NWlgJ9Fs7eVAnqCizA8nsCOb0toaIt8T6E5I2o9iQDu5K0IT7Q3rn7L2TSyCDPXEY9SUCre1C81iPvvpAVpQIO2Vp7Mcz5UQF6gPVLI2XKJGnHqvfaSatlXnzenjTKvk3MpvxkzcW7wBkFPnwvCqfbcIVhbuzDC88+jKos1LqILnFQstOp4HOMcnwa4wzQ7t6g/FNjRHqEoiUxovwgk2G8U6cIFIHCNTd76mvwtk6B4WNUpCpv2vOKGWEs3R1euGXzf98qaYJN/W9Du/3ZWO8YsBRa+HJNDifOOpVV7cJTTBVPwvwzDz4Pbd7hBJqONtRmDJT55IFxWUdsBWe9dwb6kT3+sUwzmOWEuVsTsPx3cPumj53qy2lKgJNHiLetxdxiJxnfUUnCTUK9OE9aUG/G7XpUQKa8Th0IW23ST0vhKWbJlc5k6XF1YgGZuBc2T2zMHJGy+MYQb9JuE1XCMh6LW0T6VVcRL6JM15/1GgSKnq+kE1eMAJTDOwQl9O0kMe1Kc8el57fwF+GXUi0Xg4xvYdmNH2JXLp6DGWSXHQLjegMr+bhSUuqLfENYOadm7WK0Z0= X-Forefront-PRVS: 022649CC2C X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(6009001)(7916002)(39450400003)(39830400002)(39410400002)(24454002)(189002)(199003)(97756001)(229853002)(2906002)(83506001)(47776003)(55016002)(86362001)(6506006)(105586002)(106356001)(42186005)(54906002)(25786008)(9686003)(1076002)(7736002)(305945005)(53936002)(23726003)(92566002)(6666003)(2950100002)(6916009)(6246003)(7416002)(76176999)(5660300001)(54356999)(4326007)(4001350100001)(50466002)(189998001)(38730400002)(110136004)(50986999)(8676002)(97736004)(6116002)(68736007)(81166006)(98436002)(81156014)(101416001)(46406003)(33656002)(18370500001);DIR:OUT;SFP:1102;SCL:1;SRVR:CY4PR15MB1638;H:shli-mbp.local;FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1;CY4PR15MB1638;23:+JQMUcvv8SbEc5BYU9h75ijySBrJtBxSGaPXOULZI?= =?us-ascii?Q?Kbsfg+XVldghYToNNprSqPvq4RqIplv4FRUWsZFGqJ1kmtC6CJkkPaxCRVBp?= =?us-ascii?Q?K9mgD/TdKIQWJg1y7QwQ7YAg18VFb4DRNx+FCYi9a7G3wJOcELDEX38l5uRZ?= =?us-ascii?Q?UzejuMq4tNpoNaCnRh1m8gMpGycZ7sJOF0iuizh50ickGyIgw3QJEMe9gSIx?= =?us-ascii?Q?iO7qWNdeFzrLHFjYP/M62Zdj/R6zFNIBYbxc/N57w9QWGbHqJ9PG0ej217P9?= =?us-ascii?Q?so9X/ASbtyje5HiD0OiBmTGavs+r67CgrdRdn02vt01UwvMc/OvV28UoSwrb?= =?us-ascii?Q?g2wALaXtPGikPBuyPdW/MTi1AVBqJ85Q8hmOY7P6OzH8LbJ1umOBicYRxSiI?= =?us-ascii?Q?SEKOyURHkJ0nPB79jB3QtCaBHbzMFpQ+0Xru3EQuYJy1xMNpKA+Ch/cvYojO?= =?us-ascii?Q?pcrOqgboRIswOgxuuXSypnT76qUaU2BvsWh2kYQ8gM6RoYnFZErUWeqHm/9u?= =?us-ascii?Q?x1zZBSBjc33jdXxKxLUKPr1X9Cbel75VoMTAbEYt2plvuXPChGAkmtEWIN5W?= =?us-ascii?Q?ZV+a3d1xMlAjO7t2swVP1awsUau5qT/XAQeUGSVgsrq0R22E8bpUXYeDBL0+?= =?us-ascii?Q?SepIhkSnZUY6az27daQFuFMxT81tJtQ3ua2FBFg5ur678ZnFVjIsSEQdAZmx?= =?us-ascii?Q?zINeZysw0IwYX5oVoJ/gBrNIBkCe84eXnenS0AYp44lp4FzzVlAeTDbVY4/F?= =?us-ascii?Q?ki7FS+qOJ9YpAlf0hgaafUtsHaz8fmGKrFiLr6ZMJm/kY7llCVLsW+gydq8W?= =?us-ascii?Q?rvHxToCgLyksqCrV8hzHBf02jxyU/kXkv4o5NcguY6mxVTIcsMdBvjW5ycIE?= =?us-ascii?Q?0+3899f5zQX0rcRT9cod4e2rCxkYwLMuW5DSCQfN8fKuO+6KjTM83+4voAq8?= =?us-ascii?Q?fqBcMUOYOz0ty+CvanD5kxTtWxTAsOCCHchLBBveRyeIywVfxc4m7d37Ud3U?= =?us-ascii?Q?18ZpkYzaWGs9yqDw4cSCPFImw6mZQ8Dv9TrgHlvwbztOpxAslOnvrIL/1aXq?= =?us-ascii?Q?diNXdGfKxyDv51N+ktRsT6jz14jTWpDUeVx5yEkGu4NXyOQBhCGIpTivDE3E?= =?us-ascii?Q?KJCxoH4dgCfhB7VZLoH5uGYQjY0Rsehxx19NOystDpL+++PJRPl4YwlG5dMC?= =?us-ascii?Q?Sfkq1mkRjrCmuq1ZmHU+p4Ea8dfak8orXLtUPEfjBokwkvRVL6VGTezPKfd6?= =?us-ascii?Q?Cv0wcLUsoH8UjvJRxgi8pyNOhjz480SlxXLIC2BAFqutjdr5wLZ8JJ52Bj3N?= =?us-ascii?Q?eWQbT6oPUDwdxL6VmSCKXYbWZ4v6tog8H5Mt85Gt9cz?= X-Microsoft-Exchange-Diagnostics: 1;CY4PR15MB1638;6:abSwn4NA25ocjQcIWm7lgRVCEsLeiY8hju5EI8aoVe4cCKU8q/jaBiKNLm1iZDAlWbXdUZAPm5NFj1lW8NeTJ+nwIV0iuc8n4ki7VRrKsZBJHNYoQhW1iUPXRP5ZLqjQuANBXGhr8E+a00PLhVJ20Sb5iAmpuroohAQ1MtgJ1hWyzxnBh5hZ5A+f/rUpK7M+llcEJgoGVNJ6Eb3XZGz5Le3I7nvxBukK6iWM6ubsABx+hiurDIKxw5aLrCbxpA1DGurKx/zDQ8wO60DWUrRwLn6ulFOnkIK4XaV6SR84sCA+967M9TiiiwdTVwBXEUrt/ClmeRLvR9yDEcg0ciApcgFDyViHHPbsgBuUmmi+tAPMbmerOlBS4Or+Nhalxhum8+XRcMVU1aHADtthEvo7pw==;5:ToeJq5JO+mJLIZ6+uysT55F1y1PXXSqCD9Rn61sXMxbZqaKwPQIVBThM4++72i0Z+g66jBgVqS4hcQu/OT9FK463sxJtnFgv/qF6Do0JUT1tkkN94uhouMM7kN7s/4wpLhr3zNmxxuKQWZi0wLhUaA==;24:LIj7oZMi1c3AKYZ0wTDsfifSlu+HNT8AVO6IS9JwG01oZQ18VbFDznp4OXqt44ppMAnFekUSN++FhnCpO6YzdyqHWj4yF0ZpH2FK6Ow2YyQ= SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-Microsoft-Exchange-Diagnostics: 1;CY4PR15MB1638;7:psakOa9FnOb5RrZyIyij1IyJ+aS4qXcN0cVS1d0sT85WeBG89wBZm3sGUlcTvnmKSUTCA3j2pEPPr27fzlEQOIr5IF0f9z+qA0S9lkqfuIVL4Hwd7xtkds2H+M9CNk4OMfGxUeeYPf9S6b0xFzlaQdPodhJsg5Wtjgx4t7auqRJfSlGq14jZIo0PoyPR7VbDFB8b1MvOOkhXs07qqoRG3V+QFX9vEbaVkMvLyMPYtyEX3zymL0lWuu/A7a9y8/AS8R9aD0P8tTD5t9sH/3juZvLvKevPfxmZGGU4aVZ+PjNutALc/xqLhs83xGnMcg9xVABuEQCuLsOFSL6NRrrS4g==;20:5YMpFmvaxfFya2BHLgrlZUZVK3PwjUWNkFGiPA4FC/XXcaZq4PdhPJN+hCLj74N2mOd12R2B40UZrA5WrsmJrkoHbT/e2lEMg7eAQ55UNVeGmjYsd3uyexGjWIo8Nz8QPUG++TA90urYTsLE4MtvGRjRQ8MFXEG6l3GH2s7fGK8= X-MS-Exchange-CrossTenant-OriginalArrivalTime: 22 Feb 2017 01:27:29.0731 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: CY4PR15MB1638 X-OriginatorOrg: fb.com X-Proofpoint-Spam-Reason: safe X-FB-Internal: Safe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10432:,, definitions=2017-02-21_22:,, signatures=0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 22, 2017 at 09:46:05AM +0900, Minchan Kim wrote: > Hi Shaohua, > > On Fri, Feb 03, 2017 at 03:33:23PM -0800, Shaohua Li wrote: > > Add a separate RSS for MADV_FREE pages. The pages are charged into > > MM_ANONPAGES (because they are mapped anon pages) and also charged into > > the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to > > display the RSS, which userspace can use to determine the RSS excluding > > MADV_FREE pages. > > I'm not sure statm is right place. With definition of statm and considering > your usecase, it would be right place but when I look "stuats", it already > shows RssAnon, RssFile and RssShmem so I thought we can add RssLazy to it. > It would be more consistent if you don't have big overhead. > > > > > The basic idea is to increment the RSS in madvise and decrement in unmap > > or page reclaim. There is one limitation. If a page is shared by two > > processes, since madvise only has mm cotext of current process, it isn't > > convenient to charge the RSS for both processes. So we don't charge the > > RSS if the mapcount isn't 1. On the other hand, fork can make a > > MADV_FREE page shared by two processes. To make things consistent, we > > uncharge the RSS from the source mm in fork. > > I don't understand why we need new flag. > > What's the problem like handling it normal anon|file|swapent|shmem? > IOW, we can increase in madvise context and increase for child in copy_one_pte > if the pte is still not dirty. And then decrease it in zap_pte_range/ > try_to_unmap_one if it finds it's dirty or discardable. > > Although it's shared by fork, VM can discard it if processes doesn't > make it dirty. The thing is we could madvise the same page twice. madvise context can't guarantee we move the page to inactive file list, so we could wrongly increase the count. > > > > A new flag is added to indicate if a page is accounted into the RSS. We > > can't use SwapBacked flag to do the determination because we can't > > guarantee the page has SwapBacked flag cleared in madvise. We are > > reusing mappedtodisk flag which should not be set for Anon pages. > > > > There are a couple of other places we need to uncharge the RSS, > > activate_page and mark_page_accessed. activate_page is used by swap, > > where MADV_FREE pages are already not in lazyfree state before going > > into swap. mark_page_accessed is mainly used for file pages, but there > > are several places it's used by anonymous pages. I fixed gup, but not > > some gpu drivers and kvm. If the drivers use MADV_FREE, we might have > > inprecise RSS accounting. > > > > Please note, the accounting is never going to be precise. MADV_FREE page > > could be written by userspace without notification to the kernel. The > > page can't be reclaimed like other clean lazyfree pages. The page isn't > > real lazyfree page. But since kernel isn't aware of this, the page is > > still accounted as lazyfree, thus the accounting could be incorrect. > > Right. Lazyfree is not inaccurate without CoW where it's point to decrease > lazyfree rss count when the store happens so we might be tempted to make > it to Cow at the cost of performance degradation but still it's not accurate > without making mark_page_accessed be aware of each mm context which is > hard part. So, I agree this stat is useful but don't want to make it > complicate. Yes, it only could be accurate with extra pagefault cost, but apparently nobody wants to pay for it. I talked to jemalloc guys here. They have concerns about the accounting since it's not accurate. I'll drop the accounting patches in next post. The only interface which can export accurate info is /proc/pid/smaps, we probably go that. Thanks, Shaohua From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-yw0-f199.google.com (mail-yw0-f199.google.com [209.85.161.199]) by kanga.kvack.org (Postfix) with ESMTP id 5D8036B038A for ; Tue, 21 Feb 2017 20:27:44 -0500 (EST) Received: by mail-yw0-f199.google.com with SMTP id 205so31431305yws.0 for ; Tue, 21 Feb 2017 17:27:44 -0800 (PST) Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com. [67.231.145.42]) by mx.google.com with ESMTPS id t2si6797341ywt.462.2017.02.21.17.27.42 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 21 Feb 2017 17:27:43 -0800 (PST) Date: Tue, 21 Feb 2017 17:27:13 -0800 From: Shaohua Li Subject: Re: [PATCH V2 7/7] mm: add a separate RSS for MADV_FREE pages Message-ID: <20170222012712.GA97403@shli-mbp.local> References: <123396e3b523e8716dfc6fc87a5cea0c124ff29d.1486163864.git.shli@fb.com> <20170222004604.GA14056@blaptop> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: <20170222004604.GA14056@blaptop> Sender: owner-linux-mm@kvack.org List-ID: To: Minchan Kim Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Kernel-team@fb.com, danielmicay@gmail.com, mhocko@suse.com, hughd@google.com, hannes@cmpxchg.org, riel@redhat.com, mgorman@techsingularity.net, akpm@linux-foundation.org On Wed, Feb 22, 2017 at 09:46:05AM +0900, Minchan Kim wrote: > Hi Shaohua, > > On Fri, Feb 03, 2017 at 03:33:23PM -0800, Shaohua Li wrote: > > Add a separate RSS for MADV_FREE pages. The pages are charged into > > MM_ANONPAGES (because they are mapped anon pages) and also charged into > > the MM_LAZYFREEPAGES. /proc/pid/statm will have an extra field to > > display the RSS, which userspace can use to determine the RSS excluding > > MADV_FREE pages. > > I'm not sure statm is right place. With definition of statm and considering > your usecase, it would be right place but when I look "stuats", it already > shows RssAnon, RssFile and RssShmem so I thought we can add RssLazy to it. > It would be more consistent if you don't have big overhead. > > > > > The basic idea is to increment the RSS in madvise and decrement in unmap > > or page reclaim. There is one limitation. If a page is shared by two > > processes, since madvise only has mm cotext of current process, it isn't > > convenient to charge the RSS for both processes. So we don't charge the > > RSS if the mapcount isn't 1. On the other hand, fork can make a > > MADV_FREE page shared by two processes. To make things consistent, we > > uncharge the RSS from the source mm in fork. > > I don't understand why we need new flag. > > What's the problem like handling it normal anon|file|swapent|shmem? > IOW, we can increase in madvise context and increase for child in copy_one_pte > if the pte is still not dirty. And then decrease it in zap_pte_range/ > try_to_unmap_one if it finds it's dirty or discardable. > > Although it's shared by fork, VM can discard it if processes doesn't > make it dirty. The thing is we could madvise the same page twice. madvise context can't guarantee we move the page to inactive file list, so we could wrongly increase the count. > > > > A new flag is added to indicate if a page is accounted into the RSS. We > > can't use SwapBacked flag to do the determination because we can't > > guarantee the page has SwapBacked flag cleared in madvise. We are > > reusing mappedtodisk flag which should not be set for Anon pages. > > > > There are a couple of other places we need to uncharge the RSS, > > activate_page and mark_page_accessed. activate_page is used by swap, > > where MADV_FREE pages are already not in lazyfree state before going > > into swap. mark_page_accessed is mainly used for file pages, but there > > are several places it's used by anonymous pages. I fixed gup, but not > > some gpu drivers and kvm. If the drivers use MADV_FREE, we might have > > inprecise RSS accounting. > > > > Please note, the accounting is never going to be precise. MADV_FREE page > > could be written by userspace without notification to the kernel. The > > page can't be reclaimed like other clean lazyfree pages. The page isn't > > real lazyfree page. But since kernel isn't aware of this, the page is > > still accounted as lazyfree, thus the accounting could be incorrect. > > Right. Lazyfree is not inaccurate without CoW where it's point to decrease > lazyfree rss count when the store happens so we might be tempted to make > it to Cow at the cost of performance degradation but still it's not accurate > without making mark_page_accessed be aware of each mm context which is > hard part. So, I agree this stat is useful but don't want to make it > complicate. Yes, it only could be accurate with extra pagefault cost, but apparently nobody wants to pay for it. I talked to jemalloc guys here. They have concerns about the accounting since it's not accurate. I'll drop the accounting patches in next post. The only interface which can export accurate info is /proc/pid/smaps, we probably go that. Thanks, Shaohua -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org