linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <aarcange@redhat.com>
To: Prakash Sangappa <prakash.sangappa@oracle.com>
Cc: Christoph Hellwig <hch@infradead.org>,
	Dave Hansen <dave.hansen@intel.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Mike Rapoport <rppt@linux.vnet.ibm.com>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
Date: Fri, 16 Jun 2017 15:15:54 +0200	[thread overview]
Message-ID: <20170616131554.GD11676@redhat.com> (raw)
In-Reply-To: <1031e0d4-cdbb-db8b-dae7-7c733921e20e@oracle.com>

Hello Prakash,

On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote:
> 
> 
> On 5/9/17 1:58 AM, Christoph Hellwig wrote:
> > On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:
> >> Regarding #3 as a general feature, do we want to
> >> consider this and the complexity associated with the
> >> implementation?
> > We have to.  Given that no one has exclusive access to hugetlbfs
> > a mount option is fundamentally the wrong interface.
> 
> 
> A hugetlbfs filesystem may need to be mounted for exclusive use by
> an application. Note, recently the 'min_size' mount option was added
> to hugetlbfs, which would reserve minimum number of huge pages
> for that filesystem for use by an application. If the filesystem with
> min size specified, is not setup for exclusive use by an application,
> then the purpose of reserving huge pages is defeated.  The
> min_size option was for use by applications like the database.
> 
> Also, I am investigating enabling hugetlbfs mounts within user
> namespace's mount namespace. That would allow an application
> to mount a hugetlbfs filesystem inside a namespace exclusively for
> its use, running as a non root user. For this it seems like the 'min_size'
> should be subject to some user limits. Anyways, mounting inside
> user namespaces is  a different discussion.
> 
> So, if a filesystem has to be setup for exclusive use by an application,
> then different mount options can be used for that filesystem.

Before userfaultfd I used a madvise that triggered SIGBUS. Aside from
performance that is much lower than userfaultfd because of the return
to userland, SIGBUS handling and new enter kernel to communicate
through a pipe with a memory manager, it couldn't work reliably
because you're not going to get exact information on the virtual
address that triggered the fault if the SIGBUS triggers in some random
in a copy-user of some random syscall, depending on the syscall some
random error will be returned. So it couldn't work transparently to
the app as far as syscalls and get_user_pages drivers were concerned.

With your solution if you pass a corrupted pointer to a random read()
syscall you're going to get a error, but supposedly you already handle
any syscall error and stop the app.

This is a special case because you don't care about performance and
you don't care about not returning random EFAULT errors from syscalls
like read().

This mount option seems non intrusive enough and hugetlbfs is quite
special already, so I'm not particularly concerned by the fact it's
one more special tweak.

If it would be enough to convert the SIGBUS into a (killable) process
hang, you could still use uffd and there would be no need to send the
uffd to a manager. You'd find the corrupting buggy process stuck in
handle_userfault().

As an alternative to the mount option we could consider adding
UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return
VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get
weird EFAULT or erratic retvals from syscalls so it would only be
usable in for your robustness feature. Then you could use UFFDIO_COPY
too to fill the memory atomically which runs faster than a page fault
(fallocate punch hole still required to zap it).

Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out,
branch for this corner case to handle_userfault() isn't great and the
hugetlbfs mount option is absolutely zero cost to the handle_userfault
which is primarily why I'm not against it.. although it's not going to
be measurable so it would be ok also to add such feature.

Thanks,
Andrea

  parent reply	other threads:[~2017-06-16 13:15 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <326e38dd-b4a8-e0ca-6ff7-af60e8045c74@oracle.com>
2017-05-01 18:00 ` [PATCH RFC] hugetlbfs 'noautofill' mount option Prakash Sangappa
2017-05-02 10:53   ` Anshuman Khandual
2017-05-02 16:07     ` Prakash Sangappa
2017-05-02 21:32   ` Dave Hansen
2017-05-02 23:34     ` Prakash Sangappa
2017-05-02 23:43       ` Dave Hansen
2017-05-03 19:02         ` Prakash Sangappa
2017-05-08  5:57           ` Prakash Sangappa
2017-05-08 15:58           ` Dave Hansen
2017-05-08 22:12             ` prakash.sangappa
2017-05-09  8:58               ` Christoph Hellwig
2017-05-09 20:59                 ` Prakash Sangappa
2017-05-16 16:51                   ` Prakash Sangappa
2017-06-16 13:15                   ` Andrea Arcangeli [this message]
2017-06-20 23:35                     ` Prakash Sangappa
2017-06-27 20:57                       ` Prakash Sangappa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170616131554.GD11676@redhat.com \
    --to=aarcange@redhat.com \
    --cc=dave.hansen@intel.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=prakash.sangappa@oracle.com \
    --cc=rppt@linux.vnet.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).