[PATCH RFC] hugetlbfs 'noautofill' mount option

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] hugetlbfs 'noautofill' mount option
       [not found] <326e38dd-b4a8-e0ca-6ff7-af60e8045c74@oracle.com>
@ 2017-05-01 18:00 ` Prakash Sangappa
  2017-05-02 10:53   ` Anshuman Khandual
  2017-05-02 21:32   ` Dave Hansen
  0 siblings, 2 replies; 16+ messages in thread
From: Prakash Sangappa @ 2017-05-01 18:00 UTC (permalink / raw)
  To: linux-kernel, linux-mm

Some applications like a database use hugetblfs for performance
reasons. Files on hugetlbfs filesystem are created and huge pages
allocated using fallocate() API. Pages are deallocated/freed using
fallocate() hole punching support that has been added to hugetlbfs.
These files are mmapped and accessed by many processes as shared memory.
Such applications keep track of which offsets in the hugetlbfs file have
pages allocated.

Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS.  However, currently when a hole in the file is
accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the hole
in the file. This may not be the desired behavior for applications like the
database that want to explicitly manage page allocations of hugetlbfs files.

This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
pages should not be allocated at page fault time when accessed thru mmapped
address.

Signed-off-by: Prakash <prakash.sangappa@oracle.com>
---
fs/hugetlbfs/inode.c    | 11 ++++++++++-
  include/linux/hugetlb.h |  1 +
  mm/hugetlb.c            |  5 +++++
  3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 8f96461..8342ee9 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -53,6 +53,7 @@ struct hugetlbfs_config {
      long    nr_inodes;
      struct hstate *hstate;
      long    min_hpages;
+    long    noautofill;
  };

  struct hugetlbfs_inode_info {
@@ -71,7 +72,7 @@ enum {
      Opt_size, Opt_nr_inodes,
      Opt_mode, Opt_uid, Opt_gid,
      Opt_pagesize, Opt_min_size,
-    Opt_err,
+    Opt_noautofill, Opt_err,
  };

  static const match_table_t tokens = {
@@ -82,6 +83,7 @@ static const match_table_t tokens = {
      {Opt_gid,    "gid=%u"},
      {Opt_pagesize,    "pagesize=%s"},
      {Opt_min_size,    "min_size=%s"},
+    {Opt_noautofill,    "noautofill"},
      {Opt_err,    NULL},
  };

@@ -1109,6 +1111,11 @@ hugetlbfs_parse_options(char *options, struct
hugetlbfs_config *pconfig)
              break;
          }

+        case Opt_noautofill: {
+            pconfig->noautofill = 1;
+            break;
+        }
+
          default:
              pr_err("Bad mount option: \"%s\"\n", p);
              return -EINVAL;
@@ -1157,6 +1164,7 @@ hugetlbfs_fill_super(struct super_block *sb, void
*data, int silent)
      config.mode = 0755;
      config.hstate = &default_hstate;
      config.min_hpages = -1; /* No default minimum size */
+    config.noautofill = 0;
      ret = hugetlbfs_parse_options(data, &config);
      if (ret)
          return ret;
@@ -1170,6 +1178,7 @@ hugetlbfs_fill_super(struct super_block *sb, void
*data, int silent)
      sbinfo->max_inodes = config.nr_inodes;
      sbinfo->free_inodes = config.nr_inodes;
      sbinfo->spool = NULL;
+    sbinfo->noautofill = config.noautofill;
      /*
       * Allocate and initialize subpool if maximum or minimum size is
       * specified.  Any needed reservations (for minimim size) are taken
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 503099d..2f37e0c 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -259,6 +259,7 @@ struct hugetlbfs_sb_info {
      spinlock_t    stat_lock;
      struct hstate *hstate;
      struct hugepage_subpool *spool;
+    int    noautofill; /* don't allocate page to fill hole at fault time */
  };

  static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct
super_block *sb)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a7aa811..11655ef 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3715,6 +3715,11 @@ static int hugetlb_no_page(struct mm_struct *mm,
struct vm_area_struct *vma,
              goto out;
          }

+        if (HUGETLBFS_SB(mapping->host->i_sb)->noautofill) {
+            ret = VM_FAULT_SIGBUS;
+            goto out;
+        }
+
          page = alloc_huge_page(vma, address, 0);
          if (IS_ERR(page)) {
              ret = PTR_ERR(page);
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-01 18:00 ` [PATCH RFC] hugetlbfs 'noautofill' mount option Prakash Sangappa
@ 2017-05-02 10:53   ` Anshuman Khandual
  2017-05-02 16:07     ` Prakash Sangappa
  2017-05-02 21:32   ` Dave Hansen
  1 sibling, 1 reply; 16+ messages in thread
From: Anshuman Khandual @ 2017-05-02 10:53 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel, linux-mm

On 05/01/2017 11:30 PM, Prakash Sangappa wrote:
> Some applications like a database use hugetblfs for performance
> reasons. Files on hugetlbfs filesystem are created and huge pages
> allocated using fallocate() API. Pages are deallocated/freed using
> fallocate() hole punching support that has been added to hugetlbfs.
> These files are mmapped and accessed by many processes as shared memory.
> Such applications keep track of which offsets in the hugetlbfs file have
> pages allocated.
> 
> Any access to mapped address over holes in the file, which can occur due

s/mapped/unmapped/ ^ ?

> to bugs in the application, is considered invalid and expect the process
> to simply receive a SIGBUS.  However, currently when a hole in the file is
> accessed via the mapped address, kernel/mm attempts to automatically
> allocate a page at page fault time, resulting in implicitly filling the
> hole

But this is expected when you try to control the file allocation from
a mapped address. Any changes while walking past or writing the range
in the memory mapped should reflect exactly in the file on the disk.
Why its not a valid behavior ?

> in the file. This may not be the desired behavior for applications like the
> database that want to explicitly manage page allocations of hugetlbfs
> files.
> 
> This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
> pages should not be allocated at page fault time when accessed thru mmapped
> address.

When the page should be allocated for mapping ?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-02 10:53   ` Anshuman Khandual
@ 2017-05-02 16:07     ` Prakash Sangappa
  0 siblings, 0 replies; 16+ messages in thread
From: Prakash Sangappa @ 2017-05-02 16:07 UTC (permalink / raw)
  To: Anshuman Khandual, linux-kernel, linux-mm



On 5/2/17 3:53 AM, Anshuman Khandual wrote:
> On 05/01/2017 11:30 PM, Prakash Sangappa wrote:
>> Some applications like a database use hugetblfs for performance
>> reasons. Files on hugetlbfs filesystem are created and huge pages
>> allocated using fallocate() API. Pages are deallocated/freed using
>> fallocate() hole punching support that has been added to hugetlbfs.
>> These files are mmapped and accessed by many processes as shared memory.
>> Such applications keep track of which offsets in the hugetlbfs file have
>> pages allocated.
>>
>> Any access to mapped address over holes in the file, which can occur due
> s/mapped/unmapped/ ^ ?

It is 'mapped' address.

>
>> to bugs in the application, is considered invalid and expect the process
>> to simply receive a SIGBUS.  However, currently when a hole in the file is
>> accessed via the mapped address, kernel/mm attempts to automatically
>> allocate a page at page fault time, resulting in implicitly filling the
>> hole
> But this is expected when you try to control the file allocation from
> a mapped address. Any changes while walking past or writing the range
> in the memory mapped should reflect exactly in the file on the disk.
> Why its not a valid behavior ?
Sure, that is a valid behavior. However, hugetlbfs is a pesudo filesystem
and the purpose is for applications to use hugepage memory. The contents
of these filesystem are not backed by disk nor are they swapped out.

The proposed new behavior is only applicable for hugetlbfs filesystem 
mounted
with the new 'noautofill' mount option. The file's page allocation/free 
are managed
using the 'fallocate()' API.

For hugetlbfs filesystems mounted without this option, there is no 
change in behavior.

>> in the file. This may not be the desired behavior for applications like the
>> database that want to explicitly manage page allocations of hugetlbfs
>> files.
>>
>> This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
>> pages should not be allocated at page fault time when accessed thru mmapped
>> address.
> When the page should be allocated for mapping ?
The application would allocate/free file pages using the fallocate() API.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-01 18:00 ` [PATCH RFC] hugetlbfs 'noautofill' mount option Prakash Sangappa
  2017-05-02 10:53   ` Anshuman Khandual
@ 2017-05-02 21:32   ` Dave Hansen
  2017-05-02 23:34     ` Prakash Sangappa
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2017-05-02 21:32 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel, linux-mm

On 05/01/2017 11:00 AM, Prakash Sangappa wrote:
> This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
> pages should not be allocated at page fault time when accessed thru mmapped
> address.

I think the main argument against doing something like this is further
specializing hugetlbfs.  I was really hoping that userfaultfd would be
usable for your needs here.

Could you elaborate on other options that you considered?  Did you look
at userfaultfd?  What about an madvise() option that disallows backing
allocations?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-02 21:32   ` Dave Hansen
@ 2017-05-02 23:34     ` Prakash Sangappa
  2017-05-02 23:43       ` Dave Hansen
  0 siblings, 1 reply; 16+ messages in thread
From: Prakash Sangappa @ 2017-05-02 23:34 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm

On 5/2/17 2:32 PM, Dave Hansen wrote:
> On 05/01/2017 11:00 AM, Prakash Sangappa wrote:
>> This patch adds a new hugetlbfs mount option 'noautofill', to indicate that
>> pages should not be allocated at page fault time when accessed thru mmapped
>> address.
> I think the main argument against doing something like this is further
> specializing hugetlbfs.  I was really hoping that userfaultfd would be
> usable for your needs here.
>
> Could you elaborate on other options that you considered?  Did you look
> at userfaultfd?  What about an madvise() option that disallows backing
> allocations?

Yes, we did consider userfaultfd and madvise(). The use case in mind is 
the database.

With a database, large number of single threaded processes are involved 
which
will map hugetlbfs file and use it for shared memory. The concern with 
using
userfaultfd is the overhead of setup and having an additional thread per 
process
to monitor the userfaultfd. Even if the additional thread can be 
avoided, by using
an external monitor process and  each process sending the userfaultfd to 
this
monitor process, setup overhead exists.

Similarly, a madvise() option also requires additional system call by every
process mapping the file, this is considered a overhead for the database.

If we do consider a new madvise() option, will it be acceptable since 
this will be
specifically for hugetlbfs file mappings? If so, would a new flag to mmap()
call itself be acceptable, which would define the proposed behavior?. 
That way
no additional system calls need to be made. Again this mmap flag would be
applicable  specifically to hugetlbfs file mappings

With the proposed mount option, it would enforce one consistent behavior
and the application using this filesystem would not have to take additional
steps as with userfaultfd or madvise().

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-02 23:34     ` Prakash Sangappa
@ 2017-05-02 23:43       ` Dave Hansen
  2017-05-03 19:02         ` Prakash Sangappa
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2017-05-02 23:43 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel, linux-mm

On 05/02/2017 04:34 PM, Prakash Sangappa wrote:
> Similarly, a madvise() option also requires additional system call by every
> process mapping the file, this is considered a overhead for the database.

How long-lived are these processes?  For a database, I'd assume that
this would happen a single time, or a single time per mmap() at process
startup time.  Such a syscall would be doing something on the order of
taking mmap_sem, walking the VMA tree, setting a bit per VMA, and
unlocking.  That's a pretty cheap one-time cost...

> If we do consider a new madvise() option, will it be acceptable
> since this will be specifically for hugetlbfs file mappings?

Ideally, it would be something that is *not* specifically for hugetlbfs.
 MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever
memory is touched that was not populated with MADV_WILLNEED, mlock(), etc...

> If so,
> would a new flag to mmap() call itself be acceptable, which would
> define the proposed behavior?. That way no additional system calls
> need to be made.

I don't feel super strongly about it, but I guess an mmap() flag could
work too.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-02 23:43       ` Dave Hansen
@ 2017-05-03 19:02         ` Prakash Sangappa
  2017-05-08  5:57           ` Prakash Sangappa
  2017-05-08 15:58           ` Dave Hansen
  0 siblings, 2 replies; 16+ messages in thread
From: Prakash Sangappa @ 2017-05-03 19:02 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm

On 5/2/17 4:43 PM, Dave Hansen wrote:

> On 05/02/2017 04:34 PM, Prakash Sangappa wrote:
>> Similarly, a madvise() option also requires additional system call by every
>> process mapping the file, this is considered a overhead for the database.
> How long-lived are these processes?  For a database, I'd assume that
> this would happen a single time, or a single time per mmap() at process
> startup time.  Such a syscall would be doing something on the order of
> taking mmap_sem, walking the VMA tree, setting a bit per VMA, and
> unlocking.  That's a pretty cheap one-time cost...
Plus a call into the filesystem (a_ops?) to check if the underlying 
filesystem
supports not filling holes to mapped access before setting the bit per vma.
Although the overhead may not be that bad.

Database processes can exit and new once started, for instance, depending on
database activity.

>> If we do consider a new madvise() option, will it be acceptable
>> since this will be specifically for hugetlbfs file mappings?
> Ideally, it would be something that is *not* specifically for hugetlbfs.
>   MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever
> memory is touched that was not populated with MADV_WILLNEED, mlock(), etc...

If this is a generic advice type, necessary support will have to be 
implemented
in various filesystems which can support this.

The proposed behavior for 'noautofill' was to not fill holes in 
files(like sparse files).
In the page fault path, mm would not know if the mmapped address on which
the fault occurred, is over a hole in the file or just that the page is 
not available
in the page cache. The underlying filesystem would be called and it 
determines
if it is a hole and that is where it would fail and not fill the hole, 
if this support is added.
Normally, filesystem which support sparse files(holes in file) 
automatically fill the hole
when accessed. Then there is the issue of file system block size and 
page size. If the
block sizes are smaller then page size, it could mean the noautofill 
would only work
if the hole size is equal to  or a multiple of, page size?

In case of hugetlbfs it is much straight forward. Since this filesystem 
is not like a normal
filesystems and and the file sizes are multiple of huge pages. The hole 
will be a multiple
of the huge page size. For this reason then should the advise be 
specific to hugetlbfs?

>
>> If so,
>> would a new flag to mmap() call itself be acceptable, which would
>> define the proposed behavior?. That way no additional system calls
>> need to be made.
> I don't feel super strongly about it, but I guess an mmap() flag could
> work too.
>

Same goes with the mmap call, if it is a generic flag.

-Prakash.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-03 19:02         ` Prakash Sangappa
@ 2017-05-08  5:57           ` Prakash Sangappa
  2017-05-08 15:58           ` Dave Hansen
  1 sibling, 0 replies; 16+ messages in thread
From: Prakash Sangappa @ 2017-05-08  5:57 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm



On 5/3/17 12:02 PM, Prakash Sangappa wrote:
> On 5/2/17 4:43 PM, Dave Hansen wrote:
>
>> Ideally, it would be something that is *not* specifically for hugetlbfs.
>>   MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever
>> memory is touched that was not populated with MADV_WILLNEED, mlock(), 
>> etc...
>
> If this is a generic advice type, necessary support will have to be 
> implemented
> in various filesystems which can support this.
>
> The proposed behavior for 'noautofill' was to not fill holes in 
> files(like sparse files).
> In the page fault path, mm would not know if the mmapped address on which
> the fault occurred, is over a hole in the file or just that the page 
> is not available
> in the page cache. The underlying filesystem would be called and it 
> determines
> if it is a hole and that is where it would fail and not fill the hole, 
> if this support is added.
> Normally, filesystem which support sparse files(holes in file) 
> automatically fill the hole
> when accessed. Then there is the issue of file system block size and 
> page size. If the
> block sizes are smaller then page size, it could mean the noautofill 
> would only work
> if the hole size is equal to  or a multiple of, page size?
>
> In case of hugetlbfs it is much straight forward. Since this 
> filesystem is not like a normal
> filesystems and and the file sizes are multiple of huge pages. The 
> hole will be a multiple
> of the huge page size. For this reason then should the advise be 
> specific to hugetlbfs?
>
>


Any further comments? I think introducing a general madvise option or a 
mmap flag applicable to all filesystems, may not be required. The 
'noautofill' behavior would be specifically useful in hugetlbfs filesystem.

So, if it is specific to hugetlbfs, will the mount option be ok? 
Otherwise adding a madvise / mmap option specific to hugetlbfs, be 
preferred?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-03 19:02         ` Prakash Sangappa
  2017-05-08  5:57           ` Prakash Sangappa
@ 2017-05-08 15:58           ` Dave Hansen
  2017-05-08 22:12             ` prakash.sangappa
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2017-05-08 15:58 UTC (permalink / raw)
  To: Prakash Sangappa, linux-kernel, linux-mm

On 05/03/2017 12:02 PM, Prakash Sangappa wrote:
>>> If we do consider a new madvise()option, will it be acceptable
>>> since this will be specifically for hugetlbfs file mappings?
>> Ideally, it would be something that is *not* specifically for
>> hugetlbfs. MADV_NOAUTOFILL, for instance, could be defined to
>> SIGSEGV whenever memory is touched that was not populated with
>> MADV_WILLNEED, mlock(), etc...
> 
> If this is a generic advice type, necessary support will have to be 
> implemented in various filesystems which can support this.

Yep.

> The proposed behavior for 'noautofill' was to not fill holes in 
> files(like sparse files). In the page fault path, mm would not know
> if the mmapped address on which the fault occurred, is over a hole in
> the file or just that the page is not available in the page cache.

It depends on how you define the feature.  I think you have three choices:

1. "Error" on page fault.  Require all access to be pre-faulted.
2. Allow faults, but "Error" if page cache has to be allocated
3. Allow faults and page cache allocations, but error on filesystem
   backing storage allocation.

All of those are useful in some cases.  But the implementations probably
happen in different places:

#1 can be implemented in core mm code
#2 can be implemented in the VFS
#3 needs filesystem involvement

> The underlying filesystem would be called and it determines if it is
> a hole and that is where it would fail and not fill the hole, if this
> support is added. Normally, filesystem which support sparse
> files(holes in file) automatically fill the hole when accessed. Then
> there is the issue of file system block size and page size. If the 
> block sizes are smaller then page size, it could mean the noautofill 
> would only work if the hole size is equal to or a multiple of, page
> size?

It depends on how you define the feature whether this is true.

> In case of hugetlbfs it is much straight forward. Since this
> filesystem is not like a normal filesystems and and the file sizes
> are multiple of huge pages. The hole will be a multiple of the huge
> page size. For this reason then should the advise be specific to
> hugetlbfs?

Let me paraphrase: it's simpler to implement it if it's specific to
hugetlbfs, thus we should implement it only for hugetlbfs, and keep it
specific to hugetlbfs.

The bigger question is: do we want to continue adding to the complexity
of hugetlbfs and increase its divergence from the core mm?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-08 15:58           ` Dave Hansen
@ 2017-05-08 22:12             ` prakash.sangappa
  2017-05-09  8:58               ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: prakash.sangappa @ 2017-05-08 22:12 UTC (permalink / raw)
  To: Dave Hansen, linux-kernel, linux-mm

On 05/08/2017 08:58 AM, Dave Hansen wrote:
>
> It depends on how you define the feature.  I think you have three choices:
>
> 1. "Error" on page fault.  Require all access to be pre-faulted.
> 2. Allow faults, but "Error" if page cache has to be allocated
> 3. Allow faults and page cache allocations, but error on filesystem
>     backing storage allocation.
>
> All of those are useful in some cases.  But the implementations probably
> happen in different places:
>
> #1 can be implemented in core mm code
> #2 can be implemented in the VFS
> #3 needs filesystem involvement

Sure it will depend on how we define the feature.
However, I am not clear about how useful are #1 & #2
as a general feature with respect to filesystems, but I
assume we could find some use cases for them.

Regarding #3 as a general feature, do we want to
consider this and the complexity associated with the
implementation?

>
>> In case of hugetlbfs it is much straight forward. Since this
>> filesystem is not like a normal filesystems and and the file sizes
>> are multiple of huge pages. The hole will be a multiple of the huge
>> page size. For this reason then should the advise be specific to
>> hugetlbfs?
> Let me paraphrase: it's simpler to implement it if it's specific to
> hugetlbfs, thus we should implement it only for hugetlbfs, and keep it
> specific to hugetlbfs.
>
> The bigger question is: do we want to continue adding to the complexity
> of hugetlbfs and increase its divergence from the core mm?

Ok,

The purpose of hugetlbfs is to enable applications to be
able to use memory in huge page sizes. Expecting that there
will be no change to its purpose other then this. The filesystem
API fallocate(), with the recent addition for hole punching support
to free pages, allows  explicit control on page
allocation  / deallocation which is useful.

It seems that the 'noautofill' feature is what is missing, with
regards to applications having explicit control on memory page
allocations using hugetlbfs. Even though the description for this
feature is not to fill holes in files, given it is filesystem semantic, but
actually the intent is to indicate not allocating pages implicitly as
the application is primarily dealing with memory allocation and
deallocation here. Is this a good enough justification?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-08 22:12             ` prakash.sangappa
@ 2017-05-09  8:58               ` Christoph Hellwig
  2017-05-09 20:59                 ` Prakash Sangappa
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2017-05-09  8:58 UTC (permalink / raw)
  To: prakash.sangappa; +Cc: Dave Hansen, linux-kernel, linux-mm

On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:
> Regarding #3 as a general feature, do we want to
> consider this and the complexity associated with the
> implementation?

We have to.  Given that no one has exclusive access to hugetlbfs
a mount option is fundamentally the wrong interface.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-09  8:58               ` Christoph Hellwig
@ 2017-05-09 20:59                 ` Prakash Sangappa
  2017-05-16 16:51                   ` Prakash Sangappa
  2017-06-16 13:15                   ` Andrea Arcangeli
  0 siblings, 2 replies; 16+ messages in thread
From: Prakash Sangappa @ 2017-05-09 20:59 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Hansen, linux-kernel, linux-mm

On 5/9/17 1:58 AM, Christoph Hellwig wrote:
> On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:
>> Regarding #3 as a general feature, do we want to
>> consider this and the complexity associated with the
>> implementation?
> We have to.  Given that no one has exclusive access to hugetlbfs
> a mount option is fundamentally the wrong interface.

A hugetlbfs filesystem may need to be mounted for exclusive use by
an application. Note, recently the 'min_size' mount option was added
to hugetlbfs, which would reserve minimum number of huge pages
for that filesystem for use by an application. If the filesystem with
min size specified, is not setup for exclusive use by an application,
then the purpose of reserving huge pages is defeated.  The
min_size option was for use by applications like the database.

Also, I am investigating enabling hugetlbfs mounts within user
namespace's mount namespace. That would allow an application
to mount a hugetlbfs filesystem inside a namespace exclusively for
its use, running as a non root user. For this it seems like the 'min_size'
should be subject to some user limits. Anyways, mounting inside
user namespaces is  a different discussion.

So, if a filesystem has to be setup for exclusive use by an application,
then different mount options can be used for that filesystem.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-09 20:59                 ` Prakash Sangappa
@ 2017-05-16 16:51                   ` Prakash Sangappa
  2017-06-16 13:15                   ` Andrea Arcangeli
  1 sibling, 0 replies; 16+ messages in thread
From: Prakash Sangappa @ 2017-05-16 16:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Dave Hansen, linux-kernel, linux-mm, Andrea Arcangeli



On 5/9/17 1:59 PM, Prakash Sangappa wrote:
>
>
> On 5/9/17 1:58 AM, Christoph Hellwig wrote:
>> On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:
>>> Regarding #3 as a general feature, do we want to
>>> consider this and the complexity associated with the
>>> implementation?
>> We have to.  Given that no one has exclusive access to hugetlbfs
>> a mount option is fundamentally the wrong interface.
>
>
> A hugetlbfs filesystem may need to be mounted for exclusive use by
> an application. Note, recently the 'min_size' mount option was added
> to hugetlbfs, which would reserve minimum number of huge pages
> for that filesystem for use by an application. If the filesystem with
> min size specified, is not setup for exclusive use by an application,
> then the purpose of reserving huge pages is defeated.  The
> min_size option was for use by applications like the database.
>
> Also, I am investigating enabling hugetlbfs mounts within user
> namespace's mount namespace. That would allow an application
> to mount a hugetlbfs filesystem inside a namespace exclusively for
> its use, running as a non root user. For this it seems like the 
> 'min_size'
> should be subject to some user limits. Anyways, mounting inside
> user namespaces is  a different discussion.
>
> So, if a filesystem has to be setup for exclusive use by an application,
> then different mount options can be used for that filesystem.
>

Any further comments?

Cc'ing Andrea as we had discussed this requirement for the Database.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-05-09 20:59                 ` Prakash Sangappa
  2017-05-16 16:51                   ` Prakash Sangappa
@ 2017-06-16 13:15                   ` Andrea Arcangeli
  2017-06-20 23:35                     ` Prakash Sangappa
  1 sibling, 1 reply; 16+ messages in thread
From: Andrea Arcangeli @ 2017-06-16 13:15 UTC (permalink / raw)
  To: Prakash Sangappa
  Cc: Christoph Hellwig, Dave Hansen, linux-kernel, linux-mm,
	Mike Rapoport, Mike Kravetz

Hello Prakash,

On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote:
> 
> 
> On 5/9/17 1:58 AM, Christoph Hellwig wrote:
> > On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:
> >> Regarding #3 as a general feature, do we want to
> >> consider this and the complexity associated with the
> >> implementation?
> > We have to.  Given that no one has exclusive access to hugetlbfs
> > a mount option is fundamentally the wrong interface.
> 
> 
> A hugetlbfs filesystem may need to be mounted for exclusive use by
> an application. Note, recently the 'min_size' mount option was added
> to hugetlbfs, which would reserve minimum number of huge pages
> for that filesystem for use by an application. If the filesystem with
> min size specified, is not setup for exclusive use by an application,
> then the purpose of reserving huge pages is defeated.  The
> min_size option was for use by applications like the database.
> 
> Also, I am investigating enabling hugetlbfs mounts within user
> namespace's mount namespace. That would allow an application
> to mount a hugetlbfs filesystem inside a namespace exclusively for
> its use, running as a non root user. For this it seems like the 'min_size'
> should be subject to some user limits. Anyways, mounting inside
> user namespaces is  a different discussion.
> 
> So, if a filesystem has to be setup for exclusive use by an application,
> then different mount options can be used for that filesystem.

Before userfaultfd I used a madvise that triggered SIGBUS. Aside from
performance that is much lower than userfaultfd because of the return
to userland, SIGBUS handling and new enter kernel to communicate
through a pipe with a memory manager, it couldn't work reliably
because you're not going to get exact information on the virtual
address that triggered the fault if the SIGBUS triggers in some random
in a copy-user of some random syscall, depending on the syscall some
random error will be returned. So it couldn't work transparently to
the app as far as syscalls and get_user_pages drivers were concerned.

With your solution if you pass a corrupted pointer to a random read()
syscall you're going to get a error, but supposedly you already handle
any syscall error and stop the app.

This is a special case because you don't care about performance and
you don't care about not returning random EFAULT errors from syscalls
like read().

This mount option seems non intrusive enough and hugetlbfs is quite
special already, so I'm not particularly concerned by the fact it's
one more special tweak.

If it would be enough to convert the SIGBUS into a (killable) process
hang, you could still use uffd and there would be no need to send the
uffd to a manager. You'd find the corrupting buggy process stuck in
handle_userfault().

As an alternative to the mount option we could consider adding
UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return
VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get
weird EFAULT or erratic retvals from syscalls so it would only be
usable in for your robustness feature. Then you could use UFFDIO_COPY
too to fill the memory atomically which runs faster than a page fault
(fallocate punch hole still required to zap it).

Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out,
branch for this corner case to handle_userfault() isn't great and the
hugetlbfs mount option is absolutely zero cost to the handle_userfault
which is primarily why I'm not against it.. although it's not going to
be measurable so it would be ok also to add such feature.

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-06-16 13:15                   ` Andrea Arcangeli
@ 2017-06-20 23:35                     ` Prakash Sangappa
  2017-06-27 20:57                       ` Prakash Sangappa
  0 siblings, 1 reply; 16+ messages in thread
From: Prakash Sangappa @ 2017-06-20 23:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Hellwig, Dave Hansen, linux-kernel, linux-mm,
	Mike Rapoport, Mike Kravetz



On 6/16/17 6:15 AM, Andrea Arcangeli wrote:
> Hello Prakash,


Thanks for you response. Comments inline.

>
> On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote:
>>
>> On 5/9/17 1:58 AM, Christoph Hellwig wrote:
>>> On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote:
>>>> Regarding #3 as a general feature, do we want to
>>>> consider this and the complexity associated with the
>>>> implementation?
>>> We have to.  Given that no one has exclusive access to hugetlbfs
>>> a mount option is fundamentally the wrong interface.
>>
>> A hugetlbfs filesystem may need to be mounted for exclusive use by
>> an application. Note, recently the 'min_size' mount option was added
>> to hugetlbfs, which would reserve minimum number of huge pages
>> for that filesystem for use by an application. If the filesystem with
>> min size specified, is not setup for exclusive use by an application,
>> then the purpose of reserving huge pages is defeated.  The
>> min_size option was for use by applications like the database.
>>
>> Also, I am investigating enabling hugetlbfs mounts within user
>> namespace's mount namespace. That would allow an application
>> to mount a hugetlbfs filesystem inside a namespace exclusively for
>> its use, running as a non root user. For this it seems like the 'min_size'
>> should be subject to some user limits. Anyways, mounting inside
>> user namespaces is  a different discussion.
>>
>> So, if a filesystem has to be setup for exclusive use by an application,
>> then different mount options can be used for that filesystem.
> Before userfaultfd I used a madvise that triggered SIGBUS. Aside from
> performance that is much lower than userfaultfd because of the return
> to userland, SIGBUS handling and new enter kernel to communicate
> through a pipe with a memory manager, it couldn't work reliably
> because you're not going to get exact information on the virtual
> address that triggered the fault if the SIGBUS triggers in some random
> in a copy-user of some random syscall, depending on the syscall some
> random error will be returned. So it couldn't work transparently to
> the app as far as syscalls and get_user_pages drivers were concerned.


Sure, seems like that would be the case if an application wants to take 
some action as a result of the fault.

>
> With your solution if you pass a corrupted pointer to a random read()
> syscall you're going to get a error, but supposedly you already handle
> any syscall error and stop the app.


Yes, the expectation is  that the application will handle the error and 
stop. This would be similar to an application passing an invalid  
address to a system call.

So, in the use case for this feature,  accessing the mapped address over 
a hole in hugetlbfs file is invalid. The application will keep track of 
the valid regions.


>
> This is a special case because you don't care about performance and
> you don't care about not returning random EFAULT errors from syscalls
> like read().


Exactly.

>
> This mount option seems non intrusive enough and hugetlbfs is quite
> special already, so I'm not particularly concerned by the fact it's
> one more special tweak.
>
> If it would be enough to convert the SIGBUS into a (killable) process
> hang, you could still use uffd and there would be no need to send the
> uffd to a manager. You'd find the corrupting buggy process stuck in
> handle_userfault().


This could be a useful feature in debug mode.  However, In the normal 
mode the application should exit/die.

>
> As an alternative to the mount option we could consider adding
> UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return
> VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get
> weird EFAULT or erratic retvals from syscalls so it would only be
> usable in for your robustness feature. Then you could use UFFDIO_COPY
> too to fill the memory atomically which runs faster than a page fault
> (fallocate punch hole still required to zap it).
>
> Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out,
> branch for this corner case to handle_userfault() isn't great and the
> hugetlbfs mount option is absolutely zero cost to the handle_userfault
> which is primarily why I'm not against it.. although it's not going to
> be measurable so it would be ok also to add such feature.


Sure, UFFD_FEATURE_SIGBUS would address the use case for the database 
using hugetlbfs. This could be a generic API and so could be useful in 
other cases as well maybe?

However for this, the userfaultfd(2)  has to be opened to register. This 
fd has to remain opened. Is this ok? Also, even though  a monitor thread 
will not be required for this particular feature, hopefully it will not 
hinder future  enhancements to userfaultfd.

Expectation is that the overhead of registering UFFD_FEATURE_SIGBUS is 
minimal, and the registration will be done by the application ones after 
every mmap() call as required, hopefully this is not required to be done 
frequently. In the database use case, the registration  will mainly be 
done once in the beginning when mapping hugetlbfs files, so should be ok.

The mount option proposed, would give one consistent behavior for the 
filesystem and will not require the application to take any additional 
steps.

If implementing UFFD_FEATURE_SIGBUS is preferred instead of the mount 
option, I could look into that.


Thanks,
-Prakash.

>
> Thanks,
> Andrea

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option
  2017-06-20 23:35                     ` Prakash Sangappa
@ 2017-06-27 20:57                       ` Prakash Sangappa
  0 siblings, 0 replies; 16+ messages in thread
From: Prakash Sangappa @ 2017-06-27 20:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Hellwig, Dave Hansen, linux-kernel, linux-mm,
	Mike Rapoport, Mike Kravetz



On 6/20/17 4:35 PM, Prakash Sangappa wrote:
>
>
> On 6/16/17 6:15 AM, Andrea Arcangeli wrote:
>> Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out,
>> branch for this corner case to handle_userfault() isn't great and the
>> hugetlbfs mount option is absolutely zero cost to the handle_userfault
>> which is primarily why I'm not against it.. although it's not going to
>> be measurable so it would be ok also to add such feature.
>
>
> If implementing UFFD_FEATURE_SIGBUS is preferred instead of the mount 
> option, I could look into that.
>
Implementing UFFD_FEATURE_SIGBUS seems reasonable.

I wanted to note here on this thread that I sent out a seperate
RFC patch review for adding UFFD_FEATURE_SIGBUS.

See,
http://marc.info/?l=linux-mm&m=149857975906880&w=2

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2017-06-27 20:58 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <326e38dd-b4a8-e0ca-6ff7-af60e8045c74@oracle.com>
2017-05-01 18:00 ` [PATCH RFC] hugetlbfs 'noautofill' mount option Prakash Sangappa
2017-05-02 10:53   ` Anshuman Khandual
2017-05-02 16:07     ` Prakash Sangappa
2017-05-02 21:32   ` Dave Hansen
2017-05-02 23:34     ` Prakash Sangappa
2017-05-02 23:43       ` Dave Hansen
2017-05-03 19:02         ` Prakash Sangappa
2017-05-08  5:57           ` Prakash Sangappa
2017-05-08 15:58           ` Dave Hansen
2017-05-08 22:12             ` prakash.sangappa
2017-05-09  8:58               ` Christoph Hellwig
2017-05-09 20:59                 ` Prakash Sangappa
2017-05-16 16:51                   ` Prakash Sangappa
2017-06-16 13:15                   ` Andrea Arcangeli
2017-06-20 23:35                     ` Prakash Sangappa
2017-06-27 20:57                       ` Prakash Sangappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).