linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on
@ 2018-04-20 16:33 Yang Shi
  2018-04-23  0:47 ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Yang Shi @ 2018-04-20 16:33 UTC (permalink / raw)
  To: kirill.shutemov, hughd, hch, viro, akpm
  Cc: yang.shi, linux-fsdevel, linux-mm, linux-kernel

Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
filesystem with huge page support anymore. tmpfs can use huge page via
THP when mounting by "huge=" mount option.

When applications use huge page on hugetlbfs, it just need check the
filesystem magic number, but it is not enough for tmpfs. Make
stat.st_blksize return huge page size if it is mounted by appropriate
"huge=" option.

Some applications could benefit from this change, for example QEMU.
When use mmap file as guest VM backend memory, QEMU typically mmap the
file size plus one extra page. If the file is on hugetlbfs the extra
page is huge page size (i.e. 2MB), but it is still 4KB on tmpfs even
though THP is enabled. tmpfs THP requires VMA is huge page aligned, so
if 4KB page is used THP will not be used at all. The below /proc/meminfo
fragment shows the THP use of QEMU with 4K page:

ShmemHugePages:   679936 kB
ShmemPmdMapped:        0 kB

By reading st_blksize, tmpfs can use huge page, then /proc/meminfo looks
like:

ShmemHugePages:    77824 kB
ShmemPmdMapped:     6144 kB

statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and it
also may fallback to 4KB page silently if there is not enough huge page.
Furthermore, different f_bsize makes max_blocks and free_blocks
calculation harder but without too much benefit. Returning huge page
size via stat.st_blksize sounds good enough.

Since PUD size huge page for THP has not been supported, now it just
returns HPAGE_PMD_SIZE.

Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Suggested-by: Christoph Hellwig <hch@infradead.org>
---
v2 --> v1:
* Adopted the suggestion from hch to return huge page size via st_blksize
  instead of creating a new flag.

 mm/shmem.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index b859192..3704258 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -39,6 +39,7 @@
 #include <asm/tlbflush.h> /* for arch/microblaze update_mmu_cache() */
 
 static struct vfsmount *shm_mnt;
+static bool is_huge = false;
 
 #ifdef CONFIG_SHMEM
 /*
@@ -995,6 +996,8 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
 		spin_unlock_irq(&info->lock);
 	}
 	generic_fillattr(inode, stat);
+	if (is_huge)
+		stat->blksize = HPAGE_PMD_SIZE;
 	return 0;
 }
 
@@ -3574,6 +3577,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
 					huge != SHMEM_HUGE_NEVER)
 				goto bad_val;
 			sbinfo->huge = huge;
+			is_huge = true;
 #endif
 #ifdef CONFIG_NUMA
 		} else if (!strcmp(this_char,"mpol")) {
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on
  2018-04-20 16:33 [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on Yang Shi
@ 2018-04-23  0:47 ` Michal Hocko
  2018-04-23  3:28   ` Yang Shi
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-04-23  0:47 UTC (permalink / raw)
  To: Yang Shi
  Cc: kirill.shutemov, hughd, hch, viro, akpm, linux-fsdevel, linux-mm,
	linux-kernel

On Sat 21-04-18 00:33:59, Yang Shi wrote:
> Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
> filesystem with huge page support anymore. tmpfs can use huge page via
> THP when mounting by "huge=" mount option.
> 
> When applications use huge page on hugetlbfs, it just need check the
> filesystem magic number, but it is not enough for tmpfs. Make
> stat.st_blksize return huge page size if it is mounted by appropriate
> "huge=" option.
> 
> Some applications could benefit from this change, for example QEMU.
> When use mmap file as guest VM backend memory, QEMU typically mmap the
> file size plus one extra page. If the file is on hugetlbfs the extra
> page is huge page size (i.e. 2MB), but it is still 4KB on tmpfs even
> though THP is enabled. tmpfs THP requires VMA is huge page aligned, so
> if 4KB page is used THP will not be used at all. The below /proc/meminfo
> fragment shows the THP use of QEMU with 4K page:
> 
> ShmemHugePages:   679936 kB
> ShmemPmdMapped:        0 kB
> 
> By reading st_blksize, tmpfs can use huge page, then /proc/meminfo looks
> like:
> 
> ShmemHugePages:    77824 kB
> ShmemPmdMapped:     6144 kB
> 
> statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and it
> also may fallback to 4KB page silently if there is not enough huge page.
> Furthermore, different f_bsize makes max_blocks and free_blocks
> calculation harder but without too much benefit. Returning huge page
> size via stat.st_blksize sounds good enough.

I am not sure I understand the above. So does QEMU or other tmpfs users
rely on f_bsize to do mmap alignment tricks? Also I thought that THP
will be used on the first aligned address even when the initial/last
portion of the mapping is not THP aligned.

And more importantly
[...]
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -39,6 +39,7 @@
>  #include <asm/tlbflush.h> /* for arch/microblaze update_mmu_cache() */
>  
>  static struct vfsmount *shm_mnt;
> +static bool is_huge = false;
>  
>  #ifdef CONFIG_SHMEM
>  /*
> @@ -995,6 +996,8 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>  		spin_unlock_irq(&info->lock);
>  	}
>  	generic_fillattr(inode, stat);
> +	if (is_huge)
> +		stat->blksize = HPAGE_PMD_SIZE;
>  	return 0;
>  }
>  
> @@ -3574,6 +3577,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
>  					huge != SHMEM_HUGE_NEVER)
>  				goto bad_val;
>  			sbinfo->huge = huge;
> +			is_huge = true;

Huh! How come this is a global flag. What if we have multiple shmem
mounts some with huge pages enabled and some without? Btw. we seem to
already have that information stored in the supperblock
		} else if (!strcmp(this_char, "huge")) {
			int huge;
			huge = shmem_parse_huge(value);
			if (huge < 0)
				goto bad_val;
			if (!has_transparent_hugepage() &&
					huge != SHMEM_HUGE_NEVER)
				goto bad_val;
			sbinfo->huge = huge;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on
  2018-04-23  0:47 ` Michal Hocko
@ 2018-04-23  3:28   ` Yang Shi
  2018-04-23 15:04     ` Michal Hocko
  0 siblings, 1 reply; 8+ messages in thread
From: Yang Shi @ 2018-04-23  3:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kirill.shutemov, hughd, hch, viro, akpm, linux-fsdevel, linux-mm,
	linux-kernel



On 4/22/18 6:47 PM, Michal Hocko wrote:
> On Sat 21-04-18 00:33:59, Yang Shi wrote:
>> Since tmpfs THP was supported in 4.8, hugetlbfs is not the only
>> filesystem with huge page support anymore. tmpfs can use huge page via
>> THP when mounting by "huge=" mount option.
>>
>> When applications use huge page on hugetlbfs, it just need check the
>> filesystem magic number, but it is not enough for tmpfs. Make
>> stat.st_blksize return huge page size if it is mounted by appropriate
>> "huge=" option.
>>
>> Some applications could benefit from this change, for example QEMU.
>> When use mmap file as guest VM backend memory, QEMU typically mmap the
>> file size plus one extra page. If the file is on hugetlbfs the extra
>> page is huge page size (i.e. 2MB), but it is still 4KB on tmpfs even
>> though THP is enabled. tmpfs THP requires VMA is huge page aligned, so
>> if 4KB page is used THP will not be used at all. The below /proc/meminfo
>> fragment shows the THP use of QEMU with 4K page:
>>
>> ShmemHugePages:   679936 kB
>> ShmemPmdMapped:        0 kB
>>
>> By reading st_blksize, tmpfs can use huge page, then /proc/meminfo looks
>> like:
>>
>> ShmemHugePages:    77824 kB
>> ShmemPmdMapped:     6144 kB
>>
>> statfs.f_bsize still returns 4KB for tmpfs since THP could be split, and it
>> also may fallback to 4KB page silently if there is not enough huge page.
>> Furthermore, different f_bsize makes max_blocks and free_blocks
>> calculation harder but without too much benefit. Returning huge page
>> size via stat.st_blksize sounds good enough.
> I am not sure I understand the above. So does QEMU or other tmpfs users
> rely on f_bsize to do mmap alignment tricks? Also I thought that THP

QEMU doesn't. It just check filesystem magic number now, if it is 
hugetlbfs, then it do mmap on huge page size alignment.

> will be used on the first aligned address even when the initial/last
> portion of the mapping is not THP aligned.

No, my test shows it is not. And, transhuge_vma_suitable() does check 
the virtual address alignment. If it is not huge page size aligned, it 
will not set PMD for huge page.

>
> And more importantly
> [...]
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -39,6 +39,7 @@
>>   #include <asm/tlbflush.h> /* for arch/microblaze update_mmu_cache() */
>>   
>>   static struct vfsmount *shm_mnt;
>> +static bool is_huge = false;
>>   
>>   #ifdef CONFIG_SHMEM
>>   /*
>> @@ -995,6 +996,8 @@ static int shmem_getattr(const struct path *path, struct kstat *stat,
>>   		spin_unlock_irq(&info->lock);
>>   	}
>>   	generic_fillattr(inode, stat);
>> +	if (is_huge)
>> +		stat->blksize = HPAGE_PMD_SIZE;
>>   	return 0;
>>   }
>>   
>> @@ -3574,6 +3577,7 @@ static int shmem_parse_options(char *options, struct shmem_sb_info *sbinfo,
>>   					huge != SHMEM_HUGE_NEVER)
>>   				goto bad_val;
>>   			sbinfo->huge = huge;
>> +			is_huge = true;
> Huh! How come this is a global flag. What if we have multiple shmem
> mounts some with huge pages enabled and some without? Btw. we seem to
> already have that information stored in the supperblock
> 		} else if (!strcmp(this_char, "huge")) {
> 			int huge;
> 			huge = shmem_parse_huge(value);
> 			if (huge < 0)
> 				goto bad_val;
> 			if (!has_transparent_hugepage() &&
> 					huge != SHMEM_HUGE_NEVER)
> 				goto bad_val;
> 			sbinfo->huge = huge;

Aha, my bad. I should used SHMEM_SB(inode->i_sb) to get shmem_sb_info 
then check the huge. Thanks a lot for catching this. Will fix in new 
version.

Yang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on
  2018-04-23  3:28   ` Yang Shi
@ 2018-04-23 15:04     ` Michal Hocko
  2018-04-23 16:19       ` Yang Shi
  2018-04-24  3:41       ` Yang Shi
  0 siblings, 2 replies; 8+ messages in thread
From: Michal Hocko @ 2018-04-23 15:04 UTC (permalink / raw)
  To: kirill.shutemov, Yang Shi
  Cc: hughd, hch, viro, akpm, linux-fsdevel, linux-mm, linux-kernel

On Sun 22-04-18 21:28:59, Yang Shi wrote:
> 
> 
> On 4/22/18 6:47 PM, Michal Hocko wrote:
[...]
> > will be used on the first aligned address even when the initial/last
> > portion of the mapping is not THP aligned.
> 
> No, my test shows it is not. And, transhuge_vma_suitable() does check the
> virtual address alignment. If it is not huge page size aligned, it will not
> set PMD for huge page.

It's been quite some time since I've looked at that code but I think you
are wrong. It just doesn't make sense to make the THP decision on the
VMA alignment much. Kirill, can you clarify please?

Please note that I have no objections to actually export the huge page
size as the max block size but your changelog just doesn't make any
sense to me.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on
  2018-04-23 15:04     ` Michal Hocko
@ 2018-04-23 16:19       ` Yang Shi
  2018-04-24  3:41       ` Yang Shi
  1 sibling, 0 replies; 8+ messages in thread
From: Yang Shi @ 2018-04-23 16:19 UTC (permalink / raw)
  To: Michal Hocko, kirill.shutemov
  Cc: hughd, hch, viro, akpm, linux-fsdevel, linux-mm, linux-kernel



On 4/23/18 9:04 AM, Michal Hocko wrote:
> On Sun 22-04-18 21:28:59, Yang Shi wrote:
>>
>> On 4/22/18 6:47 PM, Michal Hocko wrote:
> [...]
>>> will be used on the first aligned address even when the initial/last
>>> portion of the mapping is not THP aligned.
>> No, my test shows it is not. And, transhuge_vma_suitable() does check the
>> virtual address alignment. If it is not huge page size aligned, it will not
>> set PMD for huge page.
> It's been quite some time since I've looked at that code but I think you
> are wrong. It just doesn't make sense to make the THP decision on the
> VMA alignment much. Kirill, can you clarify please?

In the test, QEMU is trying to mmap a file (16GB in my configuration) + 
a guard page. If the page size is 4KB, there not any pages are mapped by 
PMD, but if the page size is 2MB (huge page aligned) we can see a lot 
pages are mapped by PMD. The test result is showed in the commit log.

So, if your assumption is right, there must be something wrong in THP code.

>
> Please note that I have no objections to actually export the huge page
> size as the max block size but your changelog just doesn't make any
> sense to me.

Thanks,
Yang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on
  2018-04-23 15:04     ` Michal Hocko
  2018-04-23 16:19       ` Yang Shi
@ 2018-04-24  3:41       ` Yang Shi
  2018-04-24 12:43         ` Michal Hocko
  1 sibling, 1 reply; 8+ messages in thread
From: Yang Shi @ 2018-04-24  3:41 UTC (permalink / raw)
  To: Michal Hocko, kirill.shutemov
  Cc: hughd, hch, viro, akpm, linux-fsdevel, linux-mm, linux-kernel



On 4/23/18 9:04 AM, Michal Hocko wrote:
> On Sun 22-04-18 21:28:59, Yang Shi wrote:
>>
>> On 4/22/18 6:47 PM, Michal Hocko wrote:
> [...]
>>> will be used on the first aligned address even when the initial/last
>>> portion of the mapping is not THP aligned.
>> No, my test shows it is not. And, transhuge_vma_suitable() does check the
>> virtual address alignment. If it is not huge page size aligned, it will not
>> set PMD for huge page.
> It's been quite some time since I've looked at that code but I think you
> are wrong. It just doesn't make sense to make the THP decision on the
> VMA alignment much. Kirill, can you clarify please?

Thanks a lot Michal and Kirill to elaborate how tmpfs THP make pmd map.

I did a quick test, THP will be PMD mapped as long as :
* hint address is huge page aligned if MAP_FIXED
Or
* offset is huge page aligned
And
* The size is big enough (>= huge page size)

This test does verify what Kirill said. And, I dig into a little further 
qemu code and did strace, qemu does try to mmap the file to non huge 
page aligned address with MAP_FIXED.

I will correct the commit log then submit v4.

Yang

>
> Please note that I have no objections to actually export the huge page
> size as the max block size but your changelog just doesn't make any
> sense to me.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on
  2018-04-24  3:41       ` Yang Shi
@ 2018-04-24 12:43         ` Michal Hocko
  2018-04-24 13:08           ` Yang Shi
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Hocko @ 2018-04-24 12:43 UTC (permalink / raw)
  To: Yang Shi
  Cc: kirill.shutemov, hughd, hch, viro, akpm, linux-fsdevel, linux-mm,
	linux-kernel

On Mon 23-04-18 21:41:50, Yang Shi wrote:
> 
> 
> On 4/23/18 9:04 AM, Michal Hocko wrote:
> > On Sun 22-04-18 21:28:59, Yang Shi wrote:
> > > 
> > > On 4/22/18 6:47 PM, Michal Hocko wrote:
> > [...]
> > > > will be used on the first aligned address even when the initial/last
> > > > portion of the mapping is not THP aligned.
> > > No, my test shows it is not. And, transhuge_vma_suitable() does check the
> > > virtual address alignment. If it is not huge page size aligned, it will not
> > > set PMD for huge page.
> > It's been quite some time since I've looked at that code but I think you
> > are wrong. It just doesn't make sense to make the THP decision on the
> > VMA alignment much. Kirill, can you clarify please?
> 
> Thanks a lot Michal and Kirill to elaborate how tmpfs THP make pmd map.
> 
> I did a quick test, THP will be PMD mapped as long as :
> * hint address is huge page aligned if MAP_FIXED
> Or
> * offset is huge page aligned
> And
> * The size is big enough (>= huge page size)
> 
> This test does verify what Kirill said. And, I dig into a little further
> qemu code and did strace, qemu does try to mmap the file to non huge page
> aligned address with MAP_FIXED.

Does it make sense to contact Qemu developers and probably fix this?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on
  2018-04-24 12:43         ` Michal Hocko
@ 2018-04-24 13:08           ` Yang Shi
  0 siblings, 0 replies; 8+ messages in thread
From: Yang Shi @ 2018-04-24 13:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: kirill.shutemov, hughd, hch, viro, akpm, linux-fsdevel, linux-mm,
	linux-kernel



On 4/24/18 6:43 AM, Michal Hocko wrote:
> On Mon 23-04-18 21:41:50, Yang Shi wrote:
>>
>> On 4/23/18 9:04 AM, Michal Hocko wrote:
>>> On Sun 22-04-18 21:28:59, Yang Shi wrote:
>>>> On 4/22/18 6:47 PM, Michal Hocko wrote:
>>> [...]
>>>>> will be used on the first aligned address even when the initial/last
>>>>> portion of the mapping is not THP aligned.
>>>> No, my test shows it is not. And, transhuge_vma_suitable() does check the
>>>> virtual address alignment. If it is not huge page size aligned, it will not
>>>> set PMD for huge page.
>>> It's been quite some time since I've looked at that code but I think you
>>> are wrong. It just doesn't make sense to make the THP decision on the
>>> VMA alignment much. Kirill, can you clarify please?
>> Thanks a lot Michal and Kirill to elaborate how tmpfs THP make pmd map.
>>
>> I did a quick test, THP will be PMD mapped as long as :
>> * hint address is huge page aligned if MAP_FIXED
>> Or
>> * offset is huge page aligned
>> And
>> * The size is big enough (>= huge page size)
>>
>> This test does verify what Kirill said. And, I dig into a little further
>> qemu code and did strace, qemu does try to mmap the file to non huge page
>> aligned address with MAP_FIXED.
> Does it make sense to contact Qemu developers and probably fix this?

Yes, I think so. We can submit a bug report to QEMU.

>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-04-24 13:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-20 16:33 [RFC v2 PATCH] mm: shmem: make stat.st_blksize return huge page size if THP is on Yang Shi
2018-04-23  0:47 ` Michal Hocko
2018-04-23  3:28   ` Yang Shi
2018-04-23 15:04     ` Michal Hocko
2018-04-23 16:19       ` Yang Shi
2018-04-24  3:41       ` Yang Shi
2018-04-24 12:43         ` Michal Hocko
2018-04-24 13:08           ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).