* [PATCH RFC] hugetlbfs 'noautofill' mount option [not found] <326e38dd-b4a8-e0ca-6ff7-af60e8045c74@oracle.com> @ 2017-05-01 18:00 ` Prakash Sangappa 2017-05-02 10:53 ` Anshuman Khandual 2017-05-02 21:32 ` Dave Hansen 0 siblings, 2 replies; 16+ messages in thread From: Prakash Sangappa @ 2017-05-01 18:00 UTC (permalink / raw) To: linux-kernel, linux-mm Some applications like a database use hugetblfs for performance reasons. Files on hugetlbfs filesystem are created and huge pages allocated using fallocate() API. Pages are deallocated/freed using fallocate() hole punching support that has been added to hugetlbfs. These files are mmapped and accessed by many processes as shared memory. Such applications keep track of which offsets in the hugetlbfs file have pages allocated. Any access to mapped address over holes in the file, which can occur due to bugs in the application, is considered invalid and expect the process to simply receive a SIGBUS. However, currently when a hole in the file is accessed via the mapped address, kernel/mm attempts to automatically allocate a page at page fault time, resulting in implicitly filling the hole in the file. This may not be the desired behavior for applications like the database that want to explicitly manage page allocations of hugetlbfs files. This patch adds a new hugetlbfs mount option 'noautofill', to indicate that pages should not be allocated at page fault time when accessed thru mmapped address. Signed-off-by: Prakash <prakash.sangappa@oracle.com> --- fs/hugetlbfs/inode.c | 11 ++++++++++- include/linux/hugetlb.h | 1 + mm/hugetlb.c | 5 +++++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 8f96461..8342ee9 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -53,6 +53,7 @@ struct hugetlbfs_config { long nr_inodes; struct hstate *hstate; long min_hpages; + long noautofill; }; struct hugetlbfs_inode_info { @@ -71,7 +72,7 @@ enum { Opt_size, Opt_nr_inodes, Opt_mode, Opt_uid, Opt_gid, Opt_pagesize, Opt_min_size, - Opt_err, + Opt_noautofill, Opt_err, }; static const match_table_t tokens = { @@ -82,6 +83,7 @@ static const match_table_t tokens = { {Opt_gid, "gid=%u"}, {Opt_pagesize, "pagesize=%s"}, {Opt_min_size, "min_size=%s"}, + {Opt_noautofill, "noautofill"}, {Opt_err, NULL}, }; @@ -1109,6 +1111,11 @@ hugetlbfs_parse_options(char *options, struct hugetlbfs_config *pconfig) break; } + case Opt_noautofill: { + pconfig->noautofill = 1; + break; + } + default: pr_err("Bad mount option: \"%s\"\n", p); return -EINVAL; @@ -1157,6 +1164,7 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) config.mode = 0755; config.hstate = &default_hstate; config.min_hpages = -1; /* No default minimum size */ + config.noautofill = 0; ret = hugetlbfs_parse_options(data, &config); if (ret) return ret; @@ -1170,6 +1178,7 @@ hugetlbfs_fill_super(struct super_block *sb, void *data, int silent) sbinfo->max_inodes = config.nr_inodes; sbinfo->free_inodes = config.nr_inodes; sbinfo->spool = NULL; + sbinfo->noautofill = config.noautofill; /* * Allocate and initialize subpool if maximum or minimum size is * specified. Any needed reservations (for minimim size) are taken diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h index 503099d..2f37e0c 100644 --- a/include/linux/hugetlb.h +++ b/include/linux/hugetlb.h @@ -259,6 +259,7 @@ struct hugetlbfs_sb_info { spinlock_t stat_lock; struct hstate *hstate; struct hugepage_subpool *spool; + int noautofill; /* don't allocate page to fill hole at fault time */ }; static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index a7aa811..11655ef 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3715,6 +3715,11 @@ static int hugetlb_no_page(struct mm_struct *mm, struct vm_area_struct *vma, goto out; } + if (HUGETLBFS_SB(mapping->host->i_sb)->noautofill) { + ret = VM_FAULT_SIGBUS; + goto out; + } + page = alloc_huge_page(vma, address, 0); if (IS_ERR(page)) { ret = PTR_ERR(page); -- 2.7.4 ^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-01 18:00 ` [PATCH RFC] hugetlbfs 'noautofill' mount option Prakash Sangappa @ 2017-05-02 10:53 ` Anshuman Khandual 2017-05-02 16:07 ` Prakash Sangappa 2017-05-02 21:32 ` Dave Hansen 1 sibling, 1 reply; 16+ messages in thread From: Anshuman Khandual @ 2017-05-02 10:53 UTC (permalink / raw) To: Prakash Sangappa, linux-kernel, linux-mm On 05/01/2017 11:30 PM, Prakash Sangappa wrote: > Some applications like a database use hugetblfs for performance > reasons. Files on hugetlbfs filesystem are created and huge pages > allocated using fallocate() API. Pages are deallocated/freed using > fallocate() hole punching support that has been added to hugetlbfs. > These files are mmapped and accessed by many processes as shared memory. > Such applications keep track of which offsets in the hugetlbfs file have > pages allocated. > > Any access to mapped address over holes in the file, which can occur due s/mapped/unmapped/ ^ ? > to bugs in the application, is considered invalid and expect the process > to simply receive a SIGBUS. However, currently when a hole in the file is > accessed via the mapped address, kernel/mm attempts to automatically > allocate a page at page fault time, resulting in implicitly filling the > hole But this is expected when you try to control the file allocation from a mapped address. Any changes while walking past or writing the range in the memory mapped should reflect exactly in the file on the disk. Why its not a valid behavior ? > in the file. This may not be the desired behavior for applications like the > database that want to explicitly manage page allocations of hugetlbfs > files. > > This patch adds a new hugetlbfs mount option 'noautofill', to indicate that > pages should not be allocated at page fault time when accessed thru mmapped > address. When the page should be allocated for mapping ? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-02 10:53 ` Anshuman Khandual @ 2017-05-02 16:07 ` Prakash Sangappa 0 siblings, 0 replies; 16+ messages in thread From: Prakash Sangappa @ 2017-05-02 16:07 UTC (permalink / raw) To: Anshuman Khandual, linux-kernel, linux-mm On 5/2/17 3:53 AM, Anshuman Khandual wrote: > On 05/01/2017 11:30 PM, Prakash Sangappa wrote: >> Some applications like a database use hugetblfs for performance >> reasons. Files on hugetlbfs filesystem are created and huge pages >> allocated using fallocate() API. Pages are deallocated/freed using >> fallocate() hole punching support that has been added to hugetlbfs. >> These files are mmapped and accessed by many processes as shared memory. >> Such applications keep track of which offsets in the hugetlbfs file have >> pages allocated. >> >> Any access to mapped address over holes in the file, which can occur due > s/mapped/unmapped/ ^ ? It is 'mapped' address. > >> to bugs in the application, is considered invalid and expect the process >> to simply receive a SIGBUS. However, currently when a hole in the file is >> accessed via the mapped address, kernel/mm attempts to automatically >> allocate a page at page fault time, resulting in implicitly filling the >> hole > But this is expected when you try to control the file allocation from > a mapped address. Any changes while walking past or writing the range > in the memory mapped should reflect exactly in the file on the disk. > Why its not a valid behavior ? Sure, that is a valid behavior. However, hugetlbfs is a pesudo filesystem and the purpose is for applications to use hugepage memory. The contents of these filesystem are not backed by disk nor are they swapped out. The proposed new behavior is only applicable for hugetlbfs filesystem mounted with the new 'noautofill' mount option. The file's page allocation/free are managed using the 'fallocate()' API. For hugetlbfs filesystems mounted without this option, there is no change in behavior. >> in the file. This may not be the desired behavior for applications like the >> database that want to explicitly manage page allocations of hugetlbfs >> files. >> >> This patch adds a new hugetlbfs mount option 'noautofill', to indicate that >> pages should not be allocated at page fault time when accessed thru mmapped >> address. > When the page should be allocated for mapping ? The application would allocate/free file pages using the fallocate() API. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-01 18:00 ` [PATCH RFC] hugetlbfs 'noautofill' mount option Prakash Sangappa 2017-05-02 10:53 ` Anshuman Khandual @ 2017-05-02 21:32 ` Dave Hansen 2017-05-02 23:34 ` Prakash Sangappa 1 sibling, 1 reply; 16+ messages in thread From: Dave Hansen @ 2017-05-02 21:32 UTC (permalink / raw) To: Prakash Sangappa, linux-kernel, linux-mm On 05/01/2017 11:00 AM, Prakash Sangappa wrote: > This patch adds a new hugetlbfs mount option 'noautofill', to indicate that > pages should not be allocated at page fault time when accessed thru mmapped > address. I think the main argument against doing something like this is further specializing hugetlbfs. I was really hoping that userfaultfd would be usable for your needs here. Could you elaborate on other options that you considered? Did you look at userfaultfd? What about an madvise() option that disallows backing allocations? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-02 21:32 ` Dave Hansen @ 2017-05-02 23:34 ` Prakash Sangappa 2017-05-02 23:43 ` Dave Hansen 0 siblings, 1 reply; 16+ messages in thread From: Prakash Sangappa @ 2017-05-02 23:34 UTC (permalink / raw) To: Dave Hansen, linux-kernel, linux-mm On 5/2/17 2:32 PM, Dave Hansen wrote: > On 05/01/2017 11:00 AM, Prakash Sangappa wrote: >> This patch adds a new hugetlbfs mount option 'noautofill', to indicate that >> pages should not be allocated at page fault time when accessed thru mmapped >> address. > I think the main argument against doing something like this is further > specializing hugetlbfs. I was really hoping that userfaultfd would be > usable for your needs here. > > Could you elaborate on other options that you considered? Did you look > at userfaultfd? What about an madvise() option that disallows backing > allocations? Yes, we did consider userfaultfd and madvise(). The use case in mind is the database. With a database, large number of single threaded processes are involved which will map hugetlbfs file and use it for shared memory. The concern with using userfaultfd is the overhead of setup and having an additional thread per process to monitor the userfaultfd. Even if the additional thread can be avoided, by using an external monitor process and each process sending the userfaultfd to this monitor process, setup overhead exists. Similarly, a madvise() option also requires additional system call by every process mapping the file, this is considered a overhead for the database. If we do consider a new madvise() option, will it be acceptable since this will be specifically for hugetlbfs file mappings? If so, would a new flag to mmap() call itself be acceptable, which would define the proposed behavior?. That way no additional system calls need to be made. Again this mmap flag would be applicable specifically to hugetlbfs file mappings With the proposed mount option, it would enforce one consistent behavior and the application using this filesystem would not have to take additional steps as with userfaultfd or madvise(). ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-02 23:34 ` Prakash Sangappa @ 2017-05-02 23:43 ` Dave Hansen 2017-05-03 19:02 ` Prakash Sangappa 0 siblings, 1 reply; 16+ messages in thread From: Dave Hansen @ 2017-05-02 23:43 UTC (permalink / raw) To: Prakash Sangappa, linux-kernel, linux-mm On 05/02/2017 04:34 PM, Prakash Sangappa wrote: > Similarly, a madvise() option also requires additional system call by every > process mapping the file, this is considered a overhead for the database. How long-lived are these processes? For a database, I'd assume that this would happen a single time, or a single time per mmap() at process startup time. Such a syscall would be doing something on the order of taking mmap_sem, walking the VMA tree, setting a bit per VMA, and unlocking. That's a pretty cheap one-time cost... > If we do consider a new madvise() option, will it be acceptable > since this will be specifically for hugetlbfs file mappings? Ideally, it would be something that is *not* specifically for hugetlbfs. MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever memory is touched that was not populated with MADV_WILLNEED, mlock(), etc... > If so, > would a new flag to mmap() call itself be acceptable, which would > define the proposed behavior?. That way no additional system calls > need to be made. I don't feel super strongly about it, but I guess an mmap() flag could work too. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-02 23:43 ` Dave Hansen @ 2017-05-03 19:02 ` Prakash Sangappa 2017-05-08 5:57 ` Prakash Sangappa 2017-05-08 15:58 ` Dave Hansen 0 siblings, 2 replies; 16+ messages in thread From: Prakash Sangappa @ 2017-05-03 19:02 UTC (permalink / raw) To: Dave Hansen, linux-kernel, linux-mm On 5/2/17 4:43 PM, Dave Hansen wrote: > On 05/02/2017 04:34 PM, Prakash Sangappa wrote: >> Similarly, a madvise() option also requires additional system call by every >> process mapping the file, this is considered a overhead for the database. > How long-lived are these processes? For a database, I'd assume that > this would happen a single time, or a single time per mmap() at process > startup time. Such a syscall would be doing something on the order of > taking mmap_sem, walking the VMA tree, setting a bit per VMA, and > unlocking. That's a pretty cheap one-time cost... Plus a call into the filesystem (a_ops?) to check if the underlying filesystem supports not filling holes to mapped access before setting the bit per vma. Although the overhead may not be that bad. Database processes can exit and new once started, for instance, depending on database activity. >> If we do consider a new madvise() option, will it be acceptable >> since this will be specifically for hugetlbfs file mappings? > Ideally, it would be something that is *not* specifically for hugetlbfs. > MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever > memory is touched that was not populated with MADV_WILLNEED, mlock(), etc... If this is a generic advice type, necessary support will have to be implemented in various filesystems which can support this. The proposed behavior for 'noautofill' was to not fill holes in files(like sparse files). In the page fault path, mm would not know if the mmapped address on which the fault occurred, is over a hole in the file or just that the page is not available in the page cache. The underlying filesystem would be called and it determines if it is a hole and that is where it would fail and not fill the hole, if this support is added. Normally, filesystem which support sparse files(holes in file) automatically fill the hole when accessed. Then there is the issue of file system block size and page size. If the block sizes are smaller then page size, it could mean the noautofill would only work if the hole size is equal to or a multiple of, page size? In case of hugetlbfs it is much straight forward. Since this filesystem is not like a normal filesystems and and the file sizes are multiple of huge pages. The hole will be a multiple of the huge page size. For this reason then should the advise be specific to hugetlbfs? > >> If so, >> would a new flag to mmap() call itself be acceptable, which would >> define the proposed behavior?. That way no additional system calls >> need to be made. > I don't feel super strongly about it, but I guess an mmap() flag could > work too. > Same goes with the mmap call, if it is a generic flag. -Prakash. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-03 19:02 ` Prakash Sangappa @ 2017-05-08 5:57 ` Prakash Sangappa 2017-05-08 15:58 ` Dave Hansen 1 sibling, 0 replies; 16+ messages in thread From: Prakash Sangappa @ 2017-05-08 5:57 UTC (permalink / raw) To: Dave Hansen, linux-kernel, linux-mm On 5/3/17 12:02 PM, Prakash Sangappa wrote: > On 5/2/17 4:43 PM, Dave Hansen wrote: > >> Ideally, it would be something that is *not* specifically for hugetlbfs. >> MADV_NOAUTOFILL, for instance, could be defined to SIGSEGV whenever >> memory is touched that was not populated with MADV_WILLNEED, mlock(), >> etc... > > If this is a generic advice type, necessary support will have to be > implemented > in various filesystems which can support this. > > The proposed behavior for 'noautofill' was to not fill holes in > files(like sparse files). > In the page fault path, mm would not know if the mmapped address on which > the fault occurred, is over a hole in the file or just that the page > is not available > in the page cache. The underlying filesystem would be called and it > determines > if it is a hole and that is where it would fail and not fill the hole, > if this support is added. > Normally, filesystem which support sparse files(holes in file) > automatically fill the hole > when accessed. Then there is the issue of file system block size and > page size. If the > block sizes are smaller then page size, it could mean the noautofill > would only work > if the hole size is equal to or a multiple of, page size? > > In case of hugetlbfs it is much straight forward. Since this > filesystem is not like a normal > filesystems and and the file sizes are multiple of huge pages. The > hole will be a multiple > of the huge page size. For this reason then should the advise be > specific to hugetlbfs? > > Any further comments? I think introducing a general madvise option or a mmap flag applicable to all filesystems, may not be required. The 'noautofill' behavior would be specifically useful in hugetlbfs filesystem. So, if it is specific to hugetlbfs, will the mount option be ok? Otherwise adding a madvise / mmap option specific to hugetlbfs, be preferred? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-03 19:02 ` Prakash Sangappa 2017-05-08 5:57 ` Prakash Sangappa @ 2017-05-08 15:58 ` Dave Hansen 2017-05-08 22:12 ` prakash.sangappa 1 sibling, 1 reply; 16+ messages in thread From: Dave Hansen @ 2017-05-08 15:58 UTC (permalink / raw) To: Prakash Sangappa, linux-kernel, linux-mm On 05/03/2017 12:02 PM, Prakash Sangappa wrote: >>> If we do consider a new madvise()option, will it be acceptable >>> since this will be specifically for hugetlbfs file mappings? >> Ideally, it would be something that is *not* specifically for >> hugetlbfs. MADV_NOAUTOFILL, for instance, could be defined to >> SIGSEGV whenever memory is touched that was not populated with >> MADV_WILLNEED, mlock(), etc... > > If this is a generic advice type, necessary support will have to be > implemented in various filesystems which can support this. Yep. > The proposed behavior for 'noautofill' was to not fill holes in > files(like sparse files). In the page fault path, mm would not know > if the mmapped address on which the fault occurred, is over a hole in > the file or just that the page is not available in the page cache. It depends on how you define the feature. I think you have three choices: 1. "Error" on page fault. Require all access to be pre-faulted. 2. Allow faults, but "Error" if page cache has to be allocated 3. Allow faults and page cache allocations, but error on filesystem backing storage allocation. All of those are useful in some cases. But the implementations probably happen in different places: #1 can be implemented in core mm code #2 can be implemented in the VFS #3 needs filesystem involvement > The underlying filesystem would be called and it determines if it is > a hole and that is where it would fail and not fill the hole, if this > support is added. Normally, filesystem which support sparse > files(holes in file) automatically fill the hole when accessed. Then > there is the issue of file system block size and page size. If the > block sizes are smaller then page size, it could mean the noautofill > would only work if the hole size is equal to or a multiple of, page > size? It depends on how you define the feature whether this is true. > In case of hugetlbfs it is much straight forward. Since this > filesystem is not like a normal filesystems and and the file sizes > are multiple of huge pages. The hole will be a multiple of the huge > page size. For this reason then should the advise be specific to > hugetlbfs? Let me paraphrase: it's simpler to implement it if it's specific to hugetlbfs, thus we should implement it only for hugetlbfs, and keep it specific to hugetlbfs. The bigger question is: do we want to continue adding to the complexity of hugetlbfs and increase its divergence from the core mm? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-08 15:58 ` Dave Hansen @ 2017-05-08 22:12 ` prakash.sangappa 2017-05-09 8:58 ` Christoph Hellwig 0 siblings, 1 reply; 16+ messages in thread From: prakash.sangappa @ 2017-05-08 22:12 UTC (permalink / raw) To: Dave Hansen, linux-kernel, linux-mm On 05/08/2017 08:58 AM, Dave Hansen wrote: > > It depends on how you define the feature. I think you have three choices: > > 1. "Error" on page fault. Require all access to be pre-faulted. > 2. Allow faults, but "Error" if page cache has to be allocated > 3. Allow faults and page cache allocations, but error on filesystem > backing storage allocation. > > All of those are useful in some cases. But the implementations probably > happen in different places: > > #1 can be implemented in core mm code > #2 can be implemented in the VFS > #3 needs filesystem involvement Sure it will depend on how we define the feature. However, I am not clear about how useful are #1 & #2 as a general feature with respect to filesystems, but I assume we could find some use cases for them. Regarding #3 as a general feature, do we want to consider this and the complexity associated with the implementation? > >> In case of hugetlbfs it is much straight forward. Since this >> filesystem is not like a normal filesystems and and the file sizes >> are multiple of huge pages. The hole will be a multiple of the huge >> page size. For this reason then should the advise be specific to >> hugetlbfs? > Let me paraphrase: it's simpler to implement it if it's specific to > hugetlbfs, thus we should implement it only for hugetlbfs, and keep it > specific to hugetlbfs. > > The bigger question is: do we want to continue adding to the complexity > of hugetlbfs and increase its divergence from the core mm? Ok, The purpose of hugetlbfs is to enable applications to be able to use memory in huge page sizes. Expecting that there will be no change to its purpose other then this. The filesystem API fallocate(), with the recent addition for hole punching support to free pages, allows explicit control on page allocation / deallocation which is useful. It seems that the 'noautofill' feature is what is missing, with regards to applications having explicit control on memory page allocations using hugetlbfs. Even though the description for this feature is not to fill holes in files, given it is filesystem semantic, but actually the intent is to indicate not allocating pages implicitly as the application is primarily dealing with memory allocation and deallocation here. Is this a good enough justification? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-08 22:12 ` prakash.sangappa @ 2017-05-09 8:58 ` Christoph Hellwig 2017-05-09 20:59 ` Prakash Sangappa 0 siblings, 1 reply; 16+ messages in thread From: Christoph Hellwig @ 2017-05-09 8:58 UTC (permalink / raw) To: prakash.sangappa; +Cc: Dave Hansen, linux-kernel, linux-mm On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: > Regarding #3 as a general feature, do we want to > consider this and the complexity associated with the > implementation? We have to. Given that no one has exclusive access to hugetlbfs a mount option is fundamentally the wrong interface. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-09 8:58 ` Christoph Hellwig @ 2017-05-09 20:59 ` Prakash Sangappa 2017-05-16 16:51 ` Prakash Sangappa 2017-06-16 13:15 ` Andrea Arcangeli 0 siblings, 2 replies; 16+ messages in thread From: Prakash Sangappa @ 2017-05-09 20:59 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Hansen, linux-kernel, linux-mm On 5/9/17 1:58 AM, Christoph Hellwig wrote: > On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: >> Regarding #3 as a general feature, do we want to >> consider this and the complexity associated with the >> implementation? > We have to. Given that no one has exclusive access to hugetlbfs > a mount option is fundamentally the wrong interface. A hugetlbfs filesystem may need to be mounted for exclusive use by an application. Note, recently the 'min_size' mount option was added to hugetlbfs, which would reserve minimum number of huge pages for that filesystem for use by an application. If the filesystem with min size specified, is not setup for exclusive use by an application, then the purpose of reserving huge pages is defeated. The min_size option was for use by applications like the database. Also, I am investigating enabling hugetlbfs mounts within user namespace's mount namespace. That would allow an application to mount a hugetlbfs filesystem inside a namespace exclusively for its use, running as a non root user. For this it seems like the 'min_size' should be subject to some user limits. Anyways, mounting inside user namespaces is a different discussion. So, if a filesystem has to be setup for exclusive use by an application, then different mount options can be used for that filesystem. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-09 20:59 ` Prakash Sangappa @ 2017-05-16 16:51 ` Prakash Sangappa 2017-06-16 13:15 ` Andrea Arcangeli 1 sibling, 0 replies; 16+ messages in thread From: Prakash Sangappa @ 2017-05-16 16:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Dave Hansen, linux-kernel, linux-mm, Andrea Arcangeli On 5/9/17 1:59 PM, Prakash Sangappa wrote: > > > On 5/9/17 1:58 AM, Christoph Hellwig wrote: >> On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: >>> Regarding #3 as a general feature, do we want to >>> consider this and the complexity associated with the >>> implementation? >> We have to. Given that no one has exclusive access to hugetlbfs >> a mount option is fundamentally the wrong interface. > > > A hugetlbfs filesystem may need to be mounted for exclusive use by > an application. Note, recently the 'min_size' mount option was added > to hugetlbfs, which would reserve minimum number of huge pages > for that filesystem for use by an application. If the filesystem with > min size specified, is not setup for exclusive use by an application, > then the purpose of reserving huge pages is defeated. The > min_size option was for use by applications like the database. > > Also, I am investigating enabling hugetlbfs mounts within user > namespace's mount namespace. That would allow an application > to mount a hugetlbfs filesystem inside a namespace exclusively for > its use, running as a non root user. For this it seems like the > 'min_size' > should be subject to some user limits. Anyways, mounting inside > user namespaces is a different discussion. > > So, if a filesystem has to be setup for exclusive use by an application, > then different mount options can be used for that filesystem. > Any further comments? Cc'ing Andrea as we had discussed this requirement for the Database. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-05-09 20:59 ` Prakash Sangappa 2017-05-16 16:51 ` Prakash Sangappa @ 2017-06-16 13:15 ` Andrea Arcangeli 2017-06-20 23:35 ` Prakash Sangappa 1 sibling, 1 reply; 16+ messages in thread From: Andrea Arcangeli @ 2017-06-16 13:15 UTC (permalink / raw) To: Prakash Sangappa Cc: Christoph Hellwig, Dave Hansen, linux-kernel, linux-mm, Mike Rapoport, Mike Kravetz Hello Prakash, On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote: > > > On 5/9/17 1:58 AM, Christoph Hellwig wrote: > > On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: > >> Regarding #3 as a general feature, do we want to > >> consider this and the complexity associated with the > >> implementation? > > We have to. Given that no one has exclusive access to hugetlbfs > > a mount option is fundamentally the wrong interface. > > > A hugetlbfs filesystem may need to be mounted for exclusive use by > an application. Note, recently the 'min_size' mount option was added > to hugetlbfs, which would reserve minimum number of huge pages > for that filesystem for use by an application. If the filesystem with > min size specified, is not setup for exclusive use by an application, > then the purpose of reserving huge pages is defeated. The > min_size option was for use by applications like the database. > > Also, I am investigating enabling hugetlbfs mounts within user > namespace's mount namespace. That would allow an application > to mount a hugetlbfs filesystem inside a namespace exclusively for > its use, running as a non root user. For this it seems like the 'min_size' > should be subject to some user limits. Anyways, mounting inside > user namespaces is a different discussion. > > So, if a filesystem has to be setup for exclusive use by an application, > then different mount options can be used for that filesystem. Before userfaultfd I used a madvise that triggered SIGBUS. Aside from performance that is much lower than userfaultfd because of the return to userland, SIGBUS handling and new enter kernel to communicate through a pipe with a memory manager, it couldn't work reliably because you're not going to get exact information on the virtual address that triggered the fault if the SIGBUS triggers in some random in a copy-user of some random syscall, depending on the syscall some random error will be returned. So it couldn't work transparently to the app as far as syscalls and get_user_pages drivers were concerned. With your solution if you pass a corrupted pointer to a random read() syscall you're going to get a error, but supposedly you already handle any syscall error and stop the app. This is a special case because you don't care about performance and you don't care about not returning random EFAULT errors from syscalls like read(). This mount option seems non intrusive enough and hugetlbfs is quite special already, so I'm not particularly concerned by the fact it's one more special tweak. If it would be enough to convert the SIGBUS into a (killable) process hang, you could still use uffd and there would be no need to send the uffd to a manager. You'd find the corrupting buggy process stuck in handle_userfault(). As an alternative to the mount option we could consider adding UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get weird EFAULT or erratic retvals from syscalls so it would only be usable in for your robustness feature. Then you could use UFFDIO_COPY too to fill the memory atomically which runs faster than a page fault (fallocate punch hole still required to zap it). Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out, branch for this corner case to handle_userfault() isn't great and the hugetlbfs mount option is absolutely zero cost to the handle_userfault which is primarily why I'm not against it.. although it's not going to be measurable so it would be ok also to add such feature. Thanks, Andrea ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-06-16 13:15 ` Andrea Arcangeli @ 2017-06-20 23:35 ` Prakash Sangappa 2017-06-27 20:57 ` Prakash Sangappa 0 siblings, 1 reply; 16+ messages in thread From: Prakash Sangappa @ 2017-06-20 23:35 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Hellwig, Dave Hansen, linux-kernel, linux-mm, Mike Rapoport, Mike Kravetz On 6/16/17 6:15 AM, Andrea Arcangeli wrote: > Hello Prakash, Thanks for you response. Comments inline. > > On Tue, May 09, 2017 at 01:59:34PM -0700, Prakash Sangappa wrote: >> >> On 5/9/17 1:58 AM, Christoph Hellwig wrote: >>> On Mon, May 08, 2017 at 03:12:42PM -0700, prakash.sangappa wrote: >>>> Regarding #3 as a general feature, do we want to >>>> consider this and the complexity associated with the >>>> implementation? >>> We have to. Given that no one has exclusive access to hugetlbfs >>> a mount option is fundamentally the wrong interface. >> >> A hugetlbfs filesystem may need to be mounted for exclusive use by >> an application. Note, recently the 'min_size' mount option was added >> to hugetlbfs, which would reserve minimum number of huge pages >> for that filesystem for use by an application. If the filesystem with >> min size specified, is not setup for exclusive use by an application, >> then the purpose of reserving huge pages is defeated. The >> min_size option was for use by applications like the database. >> >> Also, I am investigating enabling hugetlbfs mounts within user >> namespace's mount namespace. That would allow an application >> to mount a hugetlbfs filesystem inside a namespace exclusively for >> its use, running as a non root user. For this it seems like the 'min_size' >> should be subject to some user limits. Anyways, mounting inside >> user namespaces is a different discussion. >> >> So, if a filesystem has to be setup for exclusive use by an application, >> then different mount options can be used for that filesystem. > Before userfaultfd I used a madvise that triggered SIGBUS. Aside from > performance that is much lower than userfaultfd because of the return > to userland, SIGBUS handling and new enter kernel to communicate > through a pipe with a memory manager, it couldn't work reliably > because you're not going to get exact information on the virtual > address that triggered the fault if the SIGBUS triggers in some random > in a copy-user of some random syscall, depending on the syscall some > random error will be returned. So it couldn't work transparently to > the app as far as syscalls and get_user_pages drivers were concerned. Sure, seems like that would be the case if an application wants to take some action as a result of the fault. > > With your solution if you pass a corrupted pointer to a random read() > syscall you're going to get a error, but supposedly you already handle > any syscall error and stop the app. Yes, the expectation is that the application will handle the error and stop. This would be similar to an application passing an invalid address to a system call. So, in the use case for this feature, accessing the mapped address over a hole in hugetlbfs file is invalid. The application will keep track of the valid regions. > > This is a special case because you don't care about performance and > you don't care about not returning random EFAULT errors from syscalls > like read(). Exactly. > > This mount option seems non intrusive enough and hugetlbfs is quite > special already, so I'm not particularly concerned by the fact it's > one more special tweak. > > If it would be enough to convert the SIGBUS into a (killable) process > hang, you could still use uffd and there would be no need to send the > uffd to a manager. You'd find the corrupting buggy process stuck in > handle_userfault(). This could be a useful feature in debug mode. However, In the normal mode the application should exit/die. > > As an alternative to the mount option we could consider adding > UFFD_FEATURE_SIGBUS that tells the handle_userfault() to simply return > VM_FAULT_SIGBUS in presence of a pagefault event. You'd still get > weird EFAULT or erratic retvals from syscalls so it would only be > usable in for your robustness feature. Then you could use UFFDIO_COPY > too to fill the memory atomically which runs faster than a page fault > (fallocate punch hole still required to zap it). > > Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out, > branch for this corner case to handle_userfault() isn't great and the > hugetlbfs mount option is absolutely zero cost to the handle_userfault > which is primarily why I'm not against it.. although it's not going to > be measurable so it would be ok also to add such feature. Sure, UFFD_FEATURE_SIGBUS would address the use case for the database using hugetlbfs. This could be a generic API and so could be useful in other cases as well maybe? However for this, the userfaultfd(2) has to be opened to register. This fd has to remain opened. Is this ok? Also, even though a monitor thread will not be required for this particular feature, hopefully it will not hinder future enhancements to userfaultfd. Expectation is that the overhead of registering UFFD_FEATURE_SIGBUS is minimal, and the registration will be done by the application ones after every mmap() call as required, hopefully this is not required to be done frequently. In the database use case, the registration will mainly be done once in the beginning when mapping hugetlbfs files, so should be ok. The mount option proposed, would give one consistent behavior for the filesystem and will not require the application to take any additional steps. If implementing UFFD_FEATURE_SIGBUS is preferred instead of the mount option, I could look into that. Thanks, -Prakash. > > Thanks, > Andrea ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH RFC] hugetlbfs 'noautofill' mount option 2017-06-20 23:35 ` Prakash Sangappa @ 2017-06-27 20:57 ` Prakash Sangappa 0 siblings, 0 replies; 16+ messages in thread From: Prakash Sangappa @ 2017-06-27 20:57 UTC (permalink / raw) To: Andrea Arcangeli Cc: Christoph Hellwig, Dave Hansen, linux-kernel, linux-mm, Mike Rapoport, Mike Kravetz On 6/20/17 4:35 PM, Prakash Sangappa wrote: > > > On 6/16/17 6:15 AM, Andrea Arcangeli wrote: >> Adding a single if (ctx->feature & UFFD_FEATURE_SIGBUS) goto out, >> branch for this corner case to handle_userfault() isn't great and the >> hugetlbfs mount option is absolutely zero cost to the handle_userfault >> which is primarily why I'm not against it.. although it's not going to >> be measurable so it would be ok also to add such feature. > > > If implementing UFFD_FEATURE_SIGBUS is preferred instead of the mount > option, I could look into that. > Implementing UFFD_FEATURE_SIGBUS seems reasonable. I wanted to note here on this thread that I sent out a seperate RFC patch review for adding UFFD_FEATURE_SIGBUS. See, http://marc.info/?l=linux-mm&m=149857975906880&w=2 ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2017-06-27 20:58 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <326e38dd-b4a8-e0ca-6ff7-af60e8045c74@oracle.com> 2017-05-01 18:00 ` [PATCH RFC] hugetlbfs 'noautofill' mount option Prakash Sangappa 2017-05-02 10:53 ` Anshuman Khandual 2017-05-02 16:07 ` Prakash Sangappa 2017-05-02 21:32 ` Dave Hansen 2017-05-02 23:34 ` Prakash Sangappa 2017-05-02 23:43 ` Dave Hansen 2017-05-03 19:02 ` Prakash Sangappa 2017-05-08 5:57 ` Prakash Sangappa 2017-05-08 15:58 ` Dave Hansen 2017-05-08 22:12 ` prakash.sangappa 2017-05-09 8:58 ` Christoph Hellwig 2017-05-09 20:59 ` Prakash Sangappa 2017-05-16 16:51 ` Prakash Sangappa 2017-06-16 13:15 ` Andrea Arcangeli 2017-06-20 23:35 ` Prakash Sangappa 2017-06-27 20:57 ` Prakash Sangappa
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).