From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751295AbdCNShP (ORCPT ); Tue, 14 Mar 2017 14:37:15 -0400 Received: from mx1.redhat.com ([209.132.183.28]:38982 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751015AbdCNShO (ORCPT ); Tue, 14 Mar 2017 14:37:14 -0400 Date: Tue, 14 Mar 2017 19:37:06 +0100 From: Andrea Arcangeli To: Mike Kravetz Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel , "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Mike Rapoport Subject: Re: [LSF/MM TOPIC][LSF/MM,ATTEND] shared TLB, hugetlb reservations Message-ID: <20170314183706.GO27056@redhat.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.8.0 (2017-02-23) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Tue, 14 Mar 2017 18:37:09 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote: > On 01/10/2017 03:02 PM, Mike Kravetz wrote: > > Another more concrete topic is hugetlb reservations. Michal Hocko > > proposed the topic "mm patches review bandwidth", and brought up the > > related subject of areas in need of attention from an architectural > > POV. I suggested that hugetlb reservations was one such area. I'm > > guessing it was introduced to solve a rather concrete problem. However, > > over time additional hugetlb functionality was added and the > > capabilities of the reservation code was stretched to accommodate. > > It would be good to step back and take a look at the design of this > > code to determine if a rewrite/redesign is necessary. Michal suggested > > documenting the current design/code as a first step. If people think > > this is worth discussion at the summit, I could put together such a > > design before the gathering. > > I attempted to put together a design/overview of how hugetlb reservations > currently work. Hopefully, this will be useful. Another area of hugetlbfs that is not clear is the status of MADV_REMOVE and the behavior of fallocate punch hole that deviates from more standard shmem semantics. That might also be a topic of interest related to your hugetlbfs topic and marginally related to userfaultfd. The current status for anon, shmem and hugetlbfs like this: MADV_DONTNEED works: anon, !VM_SHARED shmem MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED, shmem VM_SHARED fallocate punch hole doesn't work: anon, shmem !VM_SHARED So what happens in qemu is: anon -> MADV_DONTNEED shmem !VM_SHARED -> MADV_DONTNEED (fallocate punch hole wouldn't zap private pages, but it does on hugetlbfs) shmem VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too) hugetlbfs !VM_SHARED -> fallocate punch hole (works for hugetlbfs but not for shmem !VM_SHARED) hugetlbfs VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too) This means qemu has to carry around information on the type of memory it got from the initial memblock setup, so at live migration time it can zap the memory with the right call. (NOTE: such memory is not generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped and it must be zapped well before calling userfaultfd the first time). To do this qemu uses fstatfs and finds out which kind of memory it's dealing with to use the right call depending on which memory. In short it'd be better to have something like a generic MADV_REMOVE that guarantees a non-present fault after it succeeds, no matter what kind of memory is mapped in the virtual range that has to be zapped. The above is far from ideal from a userland developer prospective. Overall fallocate punch hole covers the most cases so to keep the code simpler ironically MADV_REMOVE ends up being never used despite it provides a more friendly API than fallocate to qemu. The files are always mapped and the older code only dealt with virtual addresses (before hugetlbfs and shmem entered thee equation). Ideally qemu wants to call the same madvise regardles if the memory is anon shmem or hugetlbfs without having to carry around file descriptor, file offsets and superblock types. It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs !VM_SHARED mappings and why fallocate punch hole is also zapping private cow-like pages from !VM_SHARED mappings (although if it didn't, it would be impossible to zap those... so it's good luck it does). Thanks, Andrea PS. CC'ed also qemu-devel in case it may help clarify why things are implemented they way they are in the postcopy live migration hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs share=on. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f199.google.com (mail-qt0-f199.google.com [209.85.216.199]) by kanga.kvack.org (Postfix) with ESMTP id 2FC356B0388 for ; Tue, 14 Mar 2017 14:37:10 -0400 (EDT) Received: by mail-qt0-f199.google.com with SMTP id r45so68503553qte.6 for ; Tue, 14 Mar 2017 11:37:10 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id f7si3648778qtf.104.2017.03.14.11.37.08 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 14 Mar 2017 11:37:09 -0700 (PDT) Date: Tue, 14 Mar 2017 19:37:06 +0100 From: Andrea Arcangeli Subject: Re: [LSF/MM TOPIC][LSF/MM,ATTEND] shared TLB, hugetlb reservations Message-ID: <20170314183706.GO27056@redhat.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Mike Kravetz Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel , "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Mike Rapoport Hello, On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote: > On 01/10/2017 03:02 PM, Mike Kravetz wrote: > > Another more concrete topic is hugetlb reservations. Michal Hocko > > proposed the topic "mm patches review bandwidth", and brought up the > > related subject of areas in need of attention from an architectural > > POV. I suggested that hugetlb reservations was one such area. I'm > > guessing it was introduced to solve a rather concrete problem. However, > > over time additional hugetlb functionality was added and the > > capabilities of the reservation code was stretched to accommodate. > > It would be good to step back and take a look at the design of this > > code to determine if a rewrite/redesign is necessary. Michal suggested > > documenting the current design/code as a first step. If people think > > this is worth discussion at the summit, I could put together such a > > design before the gathering. > > I attempted to put together a design/overview of how hugetlb reservations > currently work. Hopefully, this will be useful. Another area of hugetlbfs that is not clear is the status of MADV_REMOVE and the behavior of fallocate punch hole that deviates from more standard shmem semantics. That might also be a topic of interest related to your hugetlbfs topic and marginally related to userfaultfd. The current status for anon, shmem and hugetlbfs like this: MADV_DONTNEED works: anon, !VM_SHARED shmem MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED, shmem VM_SHARED fallocate punch hole doesn't work: anon, shmem !VM_SHARED So what happens in qemu is: anon -> MADV_DONTNEED shmem !VM_SHARED -> MADV_DONTNEED (fallocate punch hole wouldn't zap private pages, but it does on hugetlbfs) shmem VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too) hugetlbfs !VM_SHARED -> fallocate punch hole (works for hugetlbfs but not for shmem !VM_SHARED) hugetlbfs VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too) This means qemu has to carry around information on the type of memory it got from the initial memblock setup, so at live migration time it can zap the memory with the right call. (NOTE: such memory is not generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped and it must be zapped well before calling userfaultfd the first time). To do this qemu uses fstatfs and finds out which kind of memory it's dealing with to use the right call depending on which memory. In short it'd be better to have something like a generic MADV_REMOVE that guarantees a non-present fault after it succeeds, no matter what kind of memory is mapped in the virtual range that has to be zapped. The above is far from ideal from a userland developer prospective. Overall fallocate punch hole covers the most cases so to keep the code simpler ironically MADV_REMOVE ends up being never used despite it provides a more friendly API than fallocate to qemu. The files are always mapped and the older code only dealt with virtual addresses (before hugetlbfs and shmem entered thee equation). Ideally qemu wants to call the same madvise regardles if the memory is anon shmem or hugetlbfs without having to carry around file descriptor, file offsets and superblock types. It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs !VM_SHARED mappings and why fallocate punch hole is also zapping private cow-like pages from !VM_SHARED mappings (although if it didn't, it would be impossible to zap those... so it's good luck it does). Thanks, Andrea PS. CC'ed also qemu-devel in case it may help clarify why things are implemented they way they are in the postcopy live migration hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs share=on. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:33408) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cnrJo-0007mx-Sc for qemu-devel@nongnu.org; Tue, 14 Mar 2017 14:37:14 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cnrJl-0000OQ-Lm for qemu-devel@nongnu.org; Tue, 14 Mar 2017 14:37:12 -0400 Received: from mx1.redhat.com ([209.132.183.28]:37496) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cnrJl-0000O6-Cq for qemu-devel@nongnu.org; Tue, 14 Mar 2017 14:37:09 -0400 Date: Tue, 14 Mar 2017 19:37:06 +0100 From: Andrea Arcangeli Message-ID: <20170314183706.GO27056@redhat.com> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] [LSF/MM TOPIC][LSF/MM, ATTEND] shared TLB, hugetlb reservations List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Mike Kravetz Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel , "Dr. David Alan Gilbert" , qemu-devel@nongnu.org, Mike Rapoport Hello, On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote: > On 01/10/2017 03:02 PM, Mike Kravetz wrote: > > Another more concrete topic is hugetlb reservations. Michal Hocko > > proposed the topic "mm patches review bandwidth", and brought up the > > related subject of areas in need of attention from an architectural > > POV. I suggested that hugetlb reservations was one such area. I'm > > guessing it was introduced to solve a rather concrete problem. However, > > over time additional hugetlb functionality was added and the > > capabilities of the reservation code was stretched to accommodate. > > It would be good to step back and take a look at the design of this > > code to determine if a rewrite/redesign is necessary. Michal suggested > > documenting the current design/code as a first step. If people think > > this is worth discussion at the summit, I could put together such a > > design before the gathering. > > I attempted to put together a design/overview of how hugetlb reservations > currently work. Hopefully, this will be useful. Another area of hugetlbfs that is not clear is the status of MADV_REMOVE and the behavior of fallocate punch hole that deviates from more standard shmem semantics. That might also be a topic of interest related to your hugetlbfs topic and marginally related to userfaultfd. The current status for anon, shmem and hugetlbfs like this: MADV_DONTNEED works: anon, !VM_SHARED shmem MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED, shmem VM_SHARED fallocate punch hole doesn't work: anon, shmem !VM_SHARED So what happens in qemu is: anon -> MADV_DONTNEED shmem !VM_SHARED -> MADV_DONTNEED (fallocate punch hole wouldn't zap private pages, but it does on hugetlbfs) shmem VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too) hugetlbfs !VM_SHARED -> fallocate punch hole (works for hugetlbfs but not for shmem !VM_SHARED) hugetlbfs VM_SHARED -> fallocate punch hole (MADV_REMOVE would work too) This means qemu has to carry around information on the type of memory it got from the initial memblock setup, so at live migration time it can zap the memory with the right call. (NOTE: such memory is not generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped and it must be zapped well before calling userfaultfd the first time). To do this qemu uses fstatfs and finds out which kind of memory it's dealing with to use the right call depending on which memory. In short it'd be better to have something like a generic MADV_REMOVE that guarantees a non-present fault after it succeeds, no matter what kind of memory is mapped in the virtual range that has to be zapped. The above is far from ideal from a userland developer prospective. Overall fallocate punch hole covers the most cases so to keep the code simpler ironically MADV_REMOVE ends up being never used despite it provides a more friendly API than fallocate to qemu. The files are always mapped and the older code only dealt with virtual addresses (before hugetlbfs and shmem entered thee equation). Ideally qemu wants to call the same madvise regardles if the memory is anon shmem or hugetlbfs without having to carry around file descriptor, file offsets and superblock types. It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs !VM_SHARED mappings and why fallocate punch hole is also zapping private cow-like pages from !VM_SHARED mappings (although if it didn't, it would be impossible to zap those... so it's good luck it does). Thanks, Andrea PS. CC'ed also qemu-devel in case it may help clarify why things are implemented they way they are in the postcopy live migration hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs share=on.