From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57E37C433EF for ; Wed, 20 Apr 2022 07:12:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8C9536B0074; Wed, 20 Apr 2022 03:12:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 878976B0075; Wed, 20 Apr 2022 03:12:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 718D06B0078; Wed, 20 Apr 2022 03:12:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 6283A6B0074 for ; Wed, 20 Apr 2022 03:12:51 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 26A2C22C8 for ; Wed, 20 Apr 2022 07:12:51 +0000 (UTC) X-FDA: 79376390142.03.914DC9D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf13.hostedemail.com (Postfix) with ESMTP id 5D84820010 for ; Wed, 20 Apr 2022 07:12:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1650438770; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8CllaqjnMUnq824vrSpuFbjqJCzo9SIG4gCoc2+ENZg=; b=P9Q3TpSuNlVcWTbmnkqVw6FpJty/rmJYnC1q81uAa5UqcdgZJWIGR0IXrbWTDPYGniVwc/ tH4TqBkixGdfX22n9OMDLeCCN5+INoHYcr6J8NI7gvTXvwWe4Rg5s7rhFFR963fN/IE0Gr r3CWoHApncYhvzbTpqWWuTwkmo8/4vc= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-657-ONzz51VANqW4vEa5YisuZQ-1; Wed, 20 Apr 2022 03:12:46 -0400 X-MC-Unique: ONzz51VANqW4vEa5YisuZQ-1 Received: by mail-wm1-f72.google.com with SMTP id b12-20020a05600c4e0c00b003914432b970so550046wmq.8 for ; Wed, 20 Apr 2022 00:12:46 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :content-language:to:cc:references:from:organization:in-reply-to :content-transfer-encoding; bh=8CllaqjnMUnq824vrSpuFbjqJCzo9SIG4gCoc2+ENZg=; b=Unwm5iBRch6SCKOsGeWL4hYLdk+xk/amr6GcKUIt47n6oLFrI4xg1i7K+qC71k9+dW dReHFGJVGzGcU1HU9zB3xJMzi+CzJDgSFzDTdI6gYG34EhX7xNuTNnQgGtvYmVUi79MX ukRvI90lbTKiJyfpqjdoHgkKAWyAcpClw8/lKCebknOV1CbjmC5EKT2NdQRF5iGCe5B9 4vElwCvSXhFbhdlXTjX10ZME1O+licOYggtYJyeiEMpc+z3SnAc8fQ6kGX0qar7MWU68 KCq3yBJrXDMbKMX4/23plN2XougqjtK/T1HsSXQc7olDJ5YFYyEN9FtNtJHWnxoss+xy AR0Q== X-Gm-Message-State: AOAM530SUsDSbfaWaW3J0uH5p2+iSkOY/OksNETNT5SKXSI7lvAEYFS4 CNAKEHOeN6nZSzJegwgIRdbjA5hk9UDkvfVnvzG4mvYvKgdFPXezdM8XMhVNuITtU7RKKSvEgFj 8FVxf1fO+UKY= X-Received: by 2002:a05:6000:2a1:b0:20a:7c6a:d43e with SMTP id l1-20020a05600002a100b0020a7c6ad43emr14975829wry.417.1650438765037; Wed, 20 Apr 2022 00:12:45 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxjsoEwJbdoCw8ims1vlhe9PqhEFnN/T+0MWwr03ugC0oOKkPFqWAydu5igC8slB5R0RsqH3Q== X-Received: by 2002:a05:6000:2a1:b0:20a:7c6a:d43e with SMTP id l1-20020a05600002a100b0020a7c6ad43emr14975810wry.417.1650438764782; Wed, 20 Apr 2022 00:12:44 -0700 (PDT) Received: from ?IPV6:2003:cb:c702:3d00:23e4:4c84:67a5:3ccf? (p200300cbc7023d0023e44c8467a53ccf.dip0.t-ipconnect.de. [2003:cb:c702:3d00:23e4:4c84:67a5:3ccf]) by smtp.gmail.com with ESMTPSA id 100-20020adf806d000000b00205ba671b25sm14277951wrk.56.2022.04.20.00.12.43 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 20 Apr 2022 00:12:44 -0700 (PDT) Message-ID: <592aaae5-2aa6-22a8-29e3-ec20f75945db@redhat.com> Date: Wed, 20 Apr 2022 09:12:43 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.2 Subject: Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing To: Mike Kravetz , linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: Michal Hocko , Peter Xu , Naoya Horiguchi , "Aneesh Kumar K . V" , Andrea Arcangeli , "Kirill A . Shutemov" , Davidlohr Bueso , Prakash Sangappa , James Houghton , Mina Almasry , Ray Fucillo , Andrew Morton References: <20220406204823.46548-1-mike.kravetz@oracle.com> <045a59a1-0929-a969-b184-1311f81504b8@redhat.com> <4ddf7d53-db45-4201-8ae0-095698ec7e1a@oracle.com> From: David Hildenbrand Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Queue-Id: 5D84820010 X-Stat-Signature: 9h1qmjbjz5fioooz9f7w8mc4ncrbfepb Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=P9Q3TpSu; spf=none (imf13.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam01 X-HE-Tag: 1650438769-782436 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 20.04.22 00:50, Mike Kravetz wrote: > On 4/8/22 02:26, David Hildenbrand wrote: >>>> >>>> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge >>>> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB. >>>> >>>> Sure, with thousands of processes sharing that memory, the size of page >>>> tables required would increase with each and every process. But TBH, >>>> that's in no way different to other file systems where we're even >>>> dealing with PTE tables. >>> >>> The numbers for a real use case I am frequently quoted are something like: >>> 1TB shared mapping, 10,000 processes sharing the mapping >>> 4K PMD Page per 1GB of shared mapping >>> 4M saving for each shared process >>> 9,999 * 4M ~= 39GB savings >> >> 3.7 % of all memory. Noticeable if the feature is removed? yes. Do we >> care about supporting such corner cases that result in a maintenance >> burden? My take is a clear no. >> >>> >>> However, if you look at commit 39dde65c9940c which introduced huge pmd sharing >>> it states that performance rather than memory savings was the primary >>> objective. >>> >>> "For hugetlb, the saving on page table memory is not the primary >>> objective (as hugetlb itself already cuts down page table overhead >>> significantly), instead, the purpose of using shared page table on hugetlb is >>> to allow faster TLB refill and smaller cache pollution upon TLB miss. >>> >>> With PT sharing, pte entries are shared among hundreds of processes, the >>> cache consumption used by all the page table is smaller and in return, >>> application gets much higher cache hit ratio. One other effect is that >>> cache hit ratio with hardware page walker hitting on pte in cache will be >>> higher and this helps to reduce tlb miss latency. These two effects >>> contribute to higher application performance." >>> >>> That 'makes sense', but I have never tried to measure any such performance >>> benefit. It is easier to calculate the memory savings. >> >> It does makes sense; but then, again, what's specific here about hugetlb? >> >> Most probably it was just easy to add to hugetlb in contrast to other >> types of shared memory. >> >>> >>>> >>>> Which results in me wondering if >>>> >>>> a) We should simply use gigantic pages for such extreme use case. Allows >>>> for freeing up more memory via vmemmap either way. >>> >>> The only problem with this is that many processors in use today have >>> limited TLB entries for gigantic pages. >>> >>>> b) We should instead look into reclaiming reconstruct-able page table. >>>> It's hard to imagine that each and every process accesses each and >>>> every part of the gigantic file all of the time. >>>> c) We should instead establish a more generic page table sharing >>>> mechanism. >>> >>> Yes. I think that is the direction taken by mshare() proposal. If we have >>> a more generic approach we can certainly start deprecating hugetlb pmd >>> sharing. >> >> My strong opinion is to remove it ASAP and get something proper into place. >> > > No arguments about the complexity of this code. However, there will be some > people who will notice if it is removed. Yes, it should never have been added that way -- unfortunately. > > Whether or not we remove huge pmd sharing support, I would still like to > address the scalability issue. To do so, taking i_mmap_rwsem in read mode > for fault processing needs to go away. With this gone, the issue of faults > racing with truncation needs to be addressed as it depended on fault code > taking the mutex. At a high level, this is fairly simple but hugetlb > reservations add to the complexity. This was not completely addressed in > this series. Okay. > > I will be sending out another RFC that more correctly address all the issues > this series attempted to address. I am not discounting your opinion that we > should get rid of huge pmd sharing. Rather, I would at least like to get > some eyes on my approach to addressing the issue with reservations during > fault and truncate races. Makes sense to me. I agree that we should fix all that. What I experienced is that the pmd sharing over-complicates the situation quite a lot and makes the code hard to follow [huge page reservation is another thing I dislike, especially because it's no good in NUMA setups and we still have to preallocate huge pages to make it work reliably] -- Thanks, David / dhildenb