From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1751295AbdCNShP (ORCPT <rfc822;w@1wt.eu>);
        Tue, 14 Mar 2017 14:37:15 -0400
Received: from mx1.redhat.com ([209.132.183.28]:38982 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1751015AbdCNShO (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 14 Mar 2017 14:37:14 -0400
Date: Tue, 14 Mar 2017 19:37:06 +0100
From: Andrea Arcangeli <aarcange@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
        linux-kernel <linux-kernel@vger.kernel.org>,
        "Dr. David Alan Gilbert" <dgilbert@redhat.com>, qemu-devel@nongnu.org,
        Mike Rapoport <rppt@linux.vnet.ibm.com>
Subject: Re: [LSF/MM TOPIC][LSF/MM,ATTEND] shared TLB, hugetlb reservations
Message-ID: <20170314183706.GO27056@redhat.com>
References: <cad15568-221e-82b7-a387-f23567a0bc76@oracle.com>
 <e09c529d-50e7-e6f2-8054-a34f22b5835a@oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <e09c529d-50e7-e6f2-8054-a34f22b5835a@oracle.com>
User-Agent: Mutt/1.8.0 (2017-02-23)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Tue, 14 Mar 2017 18:37:09 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hello,

On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote:
> On 01/10/2017 03:02 PM, Mike Kravetz wrote:
> > Another more concrete topic is hugetlb reservations.  Michal Hocko
> > proposed the topic "mm patches review bandwidth", and brought up the
> > related subject of areas in need of attention from an architectural
> > POV.  I suggested that hugetlb reservations was one such area.  I'm
> > guessing it was introduced to solve a rather concrete problem.  However,
> > over time additional hugetlb functionality was added and the
> > capabilities of the reservation code was stretched to accommodate.
> > It would be good to step back and take a look at the design of this
> > code to determine if a rewrite/redesign is necessary.  Michal suggested
> > documenting the current design/code as a first step.  If people think
> > this is worth discussion at the summit, I could put together such a
> > design before the gathering.
> 
> I attempted to put together a design/overview of how hugetlb reservations
> currently work.  Hopefully, this will be useful.

Another area of hugetlbfs that is not clear is the status of
MADV_REMOVE and the behavior of fallocate punch hole that deviates
from more standard shmem semantics. That might also be a topic of
interest related to your hugetlbfs topic and marginally related to
userfaultfd.

The current status for anon, shmem and hugetlbfs like this:

MADV_DONTNEED works: anon, !VM_SHARED shmem
MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED
MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED

MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED
MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED

fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED,
	  	     	    shmem VM_SHARED
fallocate punch hole doesn't work: anon, shmem !VM_SHARED

So what happens in qemu is:

anon			-> MADV_DONTNEED

shmem !VM_SHARED	-> MADV_DONTNEED (fallocate punch hole wouldn't zap
			   private pages, but it does on hugetlbfs)

shmem VM_SHARED		-> fallocate punch hole (MADV_REMOVE would
      			   work too)

hugetlbfs !VM_SHARED	-> fallocate punch hole (works for hugetlbfs
			   but not for shmem !VM_SHARED)

hugetlbfs VM_SHARED	-> fallocate punch hole (MADV_REMOVE would work too)

This means qemu has to carry around information on the type of memory
it got from the initial memblock setup, so at live migration time it
can zap the memory with the right call. (NOTE: such memory is not
generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped
and it must be zapped well before calling userfaultfd the first time).

To do this qemu uses fstatfs and finds out which kind of memory it's
dealing with to use the right call depending on which memory.

In short it'd be better to have something like a generic MADV_REMOVE
that guarantees a non-present fault after it succeeds, no matter what
kind of memory is mapped in the virtual range that has to be
zapped. The above is far from ideal from a userland developer
prospective.

Overall fallocate punch hole covers the most cases so to keep the code
simpler ironically MADV_REMOVE ends up being never used despite it
provides a more friendly API than fallocate to qemu. The files are
always mapped and the older code only dealt with virtual addresses
(before hugetlbfs and shmem entered thee equation). Ideally qemu wants
to call the same madvise regardles if the memory is anon shmem or
hugetlbfs without having to carry around file descriptor, file offsets
and superblock types.

It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs
!VM_SHARED mappings and why fallocate punch hole is also zapping
private cow-like pages from !VM_SHARED mappings (although if it
didn't, it would be impossible to zap those... so it's good luck it
does).

Thanks,
Andrea

PS. CC'ed also qemu-devel in case it may help clarify why things are
implemented they way they are in the postcopy live migration
hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs
share=on.

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail-qt0-f199.google.com (mail-qt0-f199.google.com [209.85.216.199])
	by kanga.kvack.org (Postfix) with ESMTP id 2FC356B0388
	for <linux-mm@kvack.org>; Tue, 14 Mar 2017 14:37:10 -0400 (EDT)
Received: by mail-qt0-f199.google.com with SMTP id r45so68503553qte.6
        for <linux-mm@kvack.org>; Tue, 14 Mar 2017 11:37:10 -0700 (PDT)
Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28])
        by mx.google.com with ESMTPS id f7si3648778qtf.104.2017.03.14.11.37.08
        for <linux-mm@kvack.org>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 14 Mar 2017 11:37:09 -0700 (PDT)
Date: Tue, 14 Mar 2017 19:37:06 +0100
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [LSF/MM TOPIC][LSF/MM,ATTEND] shared TLB, hugetlb reservations
Message-ID: <20170314183706.GO27056@redhat.com>
References: <cad15568-221e-82b7-a387-f23567a0bc76@oracle.com>
 <e09c529d-50e7-e6f2-8054-a34f22b5835a@oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <e09c529d-50e7-e6f2-8054-a34f22b5835a@oracle.com>
Sender: owner-linux-mm@kvack.org
List-ID: <linux-mm.kvack.org>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel <linux-kernel@vger.kernel.org>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, qemu-devel@nongnu.org, Mike Rapoport <rppt@linux.vnet.ibm.com>

Hello,

On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote:
> On 01/10/2017 03:02 PM, Mike Kravetz wrote:
> > Another more concrete topic is hugetlb reservations.  Michal Hocko
> > proposed the topic "mm patches review bandwidth", and brought up the
> > related subject of areas in need of attention from an architectural
> > POV.  I suggested that hugetlb reservations was one such area.  I'm
> > guessing it was introduced to solve a rather concrete problem.  However,
> > over time additional hugetlb functionality was added and the
> > capabilities of the reservation code was stretched to accommodate.
> > It would be good to step back and take a look at the design of this
> > code to determine if a rewrite/redesign is necessary.  Michal suggested
> > documenting the current design/code as a first step.  If people think
> > this is worth discussion at the summit, I could put together such a
> > design before the gathering.
> 
> I attempted to put together a design/overview of how hugetlb reservations
> currently work.  Hopefully, this will be useful.

Another area of hugetlbfs that is not clear is the status of
MADV_REMOVE and the behavior of fallocate punch hole that deviates
from more standard shmem semantics. That might also be a topic of
interest related to your hugetlbfs topic and marginally related to
userfaultfd.

The current status for anon, shmem and hugetlbfs like this:

MADV_DONTNEED works: anon, !VM_SHARED shmem
MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED
MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED

MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED
MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED

fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED,
	  	     	    shmem VM_SHARED
fallocate punch hole doesn't work: anon, shmem !VM_SHARED

So what happens in qemu is:

anon			-> MADV_DONTNEED

shmem !VM_SHARED	-> MADV_DONTNEED (fallocate punch hole wouldn't zap
			   private pages, but it does on hugetlbfs)

shmem VM_SHARED		-> fallocate punch hole (MADV_REMOVE would
      			   work too)

hugetlbfs !VM_SHARED	-> fallocate punch hole (works for hugetlbfs
			   but not for shmem !VM_SHARED)

hugetlbfs VM_SHARED	-> fallocate punch hole (MADV_REMOVE would work too)

This means qemu has to carry around information on the type of memory
it got from the initial memblock setup, so at live migration time it
can zap the memory with the right call. (NOTE: such memory is not
generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped
and it must be zapped well before calling userfaultfd the first time).

To do this qemu uses fstatfs and finds out which kind of memory it's
dealing with to use the right call depending on which memory.

In short it'd be better to have something like a generic MADV_REMOVE
that guarantees a non-present fault after it succeeds, no matter what
kind of memory is mapped in the virtual range that has to be
zapped. The above is far from ideal from a userland developer
prospective.

Overall fallocate punch hole covers the most cases so to keep the code
simpler ironically MADV_REMOVE ends up being never used despite it
provides a more friendly API than fallocate to qemu. The files are
always mapped and the older code only dealt with virtual addresses
(before hugetlbfs and shmem entered thee equation). Ideally qemu wants
to call the same madvise regardles if the memory is anon shmem or
hugetlbfs without having to carry around file descriptor, file offsets
and superblock types.

It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs
!VM_SHARED mappings and why fallocate punch hole is also zapping
private cow-like pages from !VM_SHARED mappings (although if it
didn't, it would be impossible to zap those... so it's good luck it
does).

Thanks,
Andrea

PS. CC'ed also qemu-devel in case it may help clarify why things are
implemented they way they are in the postcopy live migration
hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs
share=on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:33408)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1cnrJo-0007mx-Sc
	for qemu-devel@nongnu.org; Tue, 14 Mar 2017 14:37:14 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <aarcange@redhat.com>) id 1cnrJl-0000OQ-Lm
	for qemu-devel@nongnu.org; Tue, 14 Mar 2017 14:37:12 -0400
Received: from mx1.redhat.com ([209.132.183.28]:37496)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <aarcange@redhat.com>) id 1cnrJl-0000O6-Cq
	for qemu-devel@nongnu.org; Tue, 14 Mar 2017 14:37:09 -0400
Date: Tue, 14 Mar 2017 19:37:06 +0100
From: Andrea Arcangeli <aarcange@redhat.com>
Message-ID: <20170314183706.GO27056@redhat.com>
References: <cad15568-221e-82b7-a387-f23567a0bc76@oracle.com>
	<e09c529d-50e7-e6f2-8054-a34f22b5835a@oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <e09c529d-50e7-e6f2-8054-a34f22b5835a@oracle.com>
Subject: Re: [Qemu-devel] [LSF/MM TOPIC][LSF/MM, ATTEND] shared TLB,
 hugetlb reservations
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, linux-kernel <linux-kernel@vger.kernel.org>, "Dr. David Alan Gilbert" <dgilbert@redhat.com>, qemu-devel@nongnu.org, Mike Rapoport <rppt@linux.vnet.ibm.com>

Hello,

On Wed, Mar 08, 2017 at 05:30:55PM -0800, Mike Kravetz wrote:
> On 01/10/2017 03:02 PM, Mike Kravetz wrote:
> > Another more concrete topic is hugetlb reservations.  Michal Hocko
> > proposed the topic "mm patches review bandwidth", and brought up the
> > related subject of areas in need of attention from an architectural
> > POV.  I suggested that hugetlb reservations was one such area.  I'm
> > guessing it was introduced to solve a rather concrete problem.  However,
> > over time additional hugetlb functionality was added and the
> > capabilities of the reservation code was stretched to accommodate.
> > It would be good to step back and take a look at the design of this
> > code to determine if a rewrite/redesign is necessary.  Michal suggested
> > documenting the current design/code as a first step.  If people think
> > this is worth discussion at the summit, I could put together such a
> > design before the gathering.
> 
> I attempted to put together a design/overview of how hugetlb reservations
> currently work.  Hopefully, this will be useful.

Another area of hugetlbfs that is not clear is the status of
MADV_REMOVE and the behavior of fallocate punch hole that deviates
from more standard shmem semantics. That might also be a topic of
interest related to your hugetlbfs topic and marginally related to
userfaultfd.

The current status for anon, shmem and hugetlbfs like this:

MADV_DONTNEED works: anon, !VM_SHARED shmem
MADV_DONTNEED doesn't work: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED
MADV_DONTNEED works but not guaranteed to fault: shmem VM_SHARED

MADV_REMOVE works: shmem VM_SHARED, hugetlbfs VM_SHARED
MADV_REMOVE doesn't work: anon, shmem !VM_SHARED, hugetlbfs !VM_SHARED

fallocate punch hole works: hugetlbfs VM_SHARED, hugetlbfs !VM_SHARED,
	  	     	    shmem VM_SHARED
fallocate punch hole doesn't work: anon, shmem !VM_SHARED

So what happens in qemu is:

anon			-> MADV_DONTNEED

shmem !VM_SHARED	-> MADV_DONTNEED (fallocate punch hole wouldn't zap
			   private pages, but it does on hugetlbfs)

shmem VM_SHARED		-> fallocate punch hole (MADV_REMOVE would
      			   work too)

hugetlbfs !VM_SHARED	-> fallocate punch hole (works for hugetlbfs
			   but not for shmem !VM_SHARED)

hugetlbfs VM_SHARED	-> fallocate punch hole (MADV_REMOVE would work too)

This means qemu has to carry around information on the type of memory
it got from the initial memblock setup, so at live migration time it
can zap the memory with the right call. (NOTE: such memory is not
generated by userfaultfd UFFDIO_COPY, but it was allocated and mapped
and it must be zapped well before calling userfaultfd the first time).

To do this qemu uses fstatfs and finds out which kind of memory it's
dealing with to use the right call depending on which memory.

In short it'd be better to have something like a generic MADV_REMOVE
that guarantees a non-present fault after it succeeds, no matter what
kind of memory is mapped in the virtual range that has to be
zapped. The above is far from ideal from a userland developer
prospective.

Overall fallocate punch hole covers the most cases so to keep the code
simpler ironically MADV_REMOVE ends up being never used despite it
provides a more friendly API than fallocate to qemu. The files are
always mapped and the older code only dealt with virtual addresses
(before hugetlbfs and shmem entered thee equation). Ideally qemu wants
to call the same madvise regardles if the memory is anon shmem or
hugetlbfs without having to carry around file descriptor, file offsets
and superblock types.

It's also not clear why MADV_DONTNEED doesn't work for hugetlbfs
!VM_SHARED mappings and why fallocate punch hole is also zapping
private cow-like pages from !VM_SHARED mappings (although if it
didn't, it would be impossible to zap those... so it's good luck it
does).

Thanks,
Andrea

PS. CC'ed also qemu-devel in case it may help clarify why things are
implemented they way they are in the postcopy live migration
hugetlbfs/shmem support and in the future patches for shmem/hugetlbfs
share=on.