virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
       [not found] <CA+2MQi_C-PTqyrqBprhtGBAiDBnPQBzwu6hvyuk+QiKy0L3sHw@mail.gmail.com>
@ 2021-01-04 20:18 ` David Hildenbrand
       [not found]   ` <CA+2MQi_O47B8zOa_TwZqzRsS0LFoPS77+61mUV=yT1U3sa6xQw@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2021-01-04 20:18 UTC (permalink / raw)
  To: Liang Li
  Cc: Andrea Arcangeli, Michal Hocko, Michael S. Tsirkin, Dan Williams,
	Liang Li, LKML, linux-mm, Dave Hansen, Alexander Duyck,
	virtualization, Mel Gorman, Andrew Morton


> Am 23.12.2020 um 13:12 schrieb Liang Li <liliang324@gmail.com>:
> 
> On Wed, Dec 23, 2020 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
>> 
>> [...]
>> 
>>>> I was rather saying that for security it's of little use IMHO.
>>>> Application/VM start up time might be improved by using huge pages (and
>>>> pre-zeroing these). Free page reporting might be improved by using
>>>> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
>>>> 
>>>>> this feature, above all of them, which one is likely to become the
>>>>> most strong one?  From the implementation, you will find it is
>>>>> configurable, users don't want to use it can turn it off.  This is not
>>>>> an option?
>>>> 
>>>> Well, we have to maintain the feature and sacrifice a page flag. For
>>>> example, do we expect someone explicitly enabling the feature just to
>>>> speed up startup time of an app that consumes a lot of memory? I highly
>>>> doubt it.
>>> 
>>> In our production environment, there are three main applications have such
>>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
>>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
>>> for best performance, they populate memory when starting up. For SPDK vhost,
>>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
>>> vhost 'live' upgrade, which is done by killing the old process and
>>> starting a new
>>> one with the new binary. In this case, we want the new process started as quick
>>> as possible to shorten the service downtime. We really enable this feature
>>> to speed up startup time for them  :)

Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue?

>> 
>> Thanks for info on the use case!
>> 
>> All of these use cases either already use, or could use, huge pages
>> IMHO. It's not your ordinary proprietary gaming app :) This is where
>> pre-zeroing of huge pages could already help.
> 
> You are welcome.  For some historical reason, some of our services are
> not using hugetlbfs, that is why I didn't start with hugetlbfs.
> 
>> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
>> creating a file and pre-zeroing it from another process, or am I missing
>> something important? At least for QEMU this should work AFAIK, where you
>> can just pass the file to be use using memory-backend-file.
>> 
> If using another process to create a file, we can offload the overhead to
> another process, and there is no need to pre-zeroing it's content, just
> populating the memory is enough.

Right, if non-zero memory can be tolerated (e.g., for vms usually has to).

> If we do it that way, then how to determine the size of the file? it depends
> on the RAM size of the VM the customer buys.
> Maybe we can create a file
> large enough in advance and truncate it to the right size just before the
> VM is created. Then, how many large files should be created on a host?

That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)

> You will find there are a lot of things that have to be handled properly.
> I think it's possible to make it work well, but we will transfer the
> management complexity to up layer components. It's a bad practice to let
> upper layer components process such low level details which should be
> handled in the OS layer.

It‘s bad practice to squeeze things into the kernel that can just be handled on upper layers ;)

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
       [not found]   ` <CA+2MQi_O47B8zOa_TwZqzRsS0LFoPS77+61mUV=yT1U3sa6xQw@mail.gmail.com>
@ 2021-01-05  9:39     ` David Hildenbrand
       [not found]       ` <CA+2MQi9Qb5srEcx4qKNVWdphBGP0=HHV_h0hWghDMFKFmCOTMg@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2021-01-05  9:39 UTC (permalink / raw)
  To: Liang Li
  Cc: Andrea Arcangeli, Michal Hocko, Michael S. Tsirkin, Dan Williams,
	Liang Li, LKML, linux-mm, Dave Hansen, Alexander Duyck,
	virtualization, Mel Gorman, Andrew Morton

On 05.01.21 03:14, Liang Li wrote:
>>>>> In our production environment, there are three main applications have such
>>>>> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
>>>>> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
>>>>> for best performance, they populate memory when starting up. For SPDK vhost,
>>>>> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
>>>>> vhost 'live' upgrade, which is done by killing the old process and
>>>>> starting a new
>>>>> one with the new binary. In this case, we want the new process started as quick
>>>>> as possible to shorten the service downtime. We really enable this feature
>>>>> to speed up startup time for them  :)
>>
>> Am I wrong or does using hugeltbfs/tmpfs ... i.e., a file not-deleted between shutting down the old instances and firing up the new instance just solve this issue?
> 
> You are right, it works for the SPDK vhost upgrade case.
> 
>>
>>>>
>>>> Thanks for info on the use case!
>>>>
>>>> All of these use cases either already use, or could use, huge pages
>>>> IMHO. It's not your ordinary proprietary gaming app :) This is where
>>>> pre-zeroing of huge pages could already help.
>>>
>>> You are welcome.  For some historical reason, some of our services are
>>> not using hugetlbfs, that is why I didn't start with hugetlbfs.
>>>
>>>> Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
>>>> creating a file and pre-zeroing it from another process, or am I missing
>>>> something important? At least for QEMU this should work AFAIK, where you
>>>> can just pass the file to be use using memory-backend-file.
>>>>
>>> If using another process to create a file, we can offload the overhead to
>>> another process, and there is no need to pre-zeroing it's content, just
>>> populating the memory is enough.
>>
>> Right, if non-zero memory can be tolerated (e.g., for vms usually has to).
> 
> I mean there is no need to pre-zeroing the file content obviously in user space,
> the kernel will do it when populating the memory.
> 
>>> If we do it that way, then how to determine the size of the file? it depends
>>> on the RAM size of the VM the customer buys.
>>> Maybe we can create a file
>>> large enough in advance and truncate it to the right size just before the
>>> VM is created. Then, how many large files should be created on a host?
>>
>> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)
> 
> It depends on how the scheduling component is designed. Yes, you can put
> 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
> another one. But if one type of them, e.g. 4C8G are sold out, customers
> can't by more 4C8G VM while there are some free 2C4G VMs, the resource
> reserved for them can be provided as 4C8G VMs
> 

1. You can, just the startup time will be a little slower? E.g., grow
pre-allocated 4G file to 8G.

2. Or let's be creative: teach QEMU to construct a single
RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you
don't go crazy on different VM sizes / size differences.

3. In your example above, you can dynamically rebalance as VMs are
getting sold, to make sure you always have "big ones" lying around you
can shrink on demand.

> 
> You must know there are a lot of functions in the kernel which can
> be done in userspace. e.g. Some of the device emulations like APIC,
> vhost-net backend which has userspace implementation.   :)
> Bad or not depends on the benefits the solution brings.
> From the viewpoint of a user space application, the kernel should
> provide high performance memory management service. That's why
> I think it should be done in the kernel.

As I expressed a couple of times already, I don't see why using
hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient.

We really don't *want* complicated things deep down in the mm core if
there are reasonable alternatives.

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
       [not found]       ` <CA+2MQi9Qb5srEcx4qKNVWdphBGP0=HHV_h0hWghDMFKFmCOTMg@mail.gmail.com>
@ 2021-01-05 10:27         ` David Hildenbrand
  0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2021-01-05 10:27 UTC (permalink / raw)
  To: Liang Li
  Cc: Andrea Arcangeli, Michal Hocko, Michael S. Tsirkin, Dan Williams,
	Liang Li, LKML, linux-mm, Dave Hansen, Alexander Duyck,
	virtualization, Mel Gorman, Andrew Morton

On 05.01.21 11:22, Liang Li wrote:
>>>> That‘s mostly already existing scheduling logic, no? (How many vms can I put onto a specific machine eventually)
>>>
>>> It depends on how the scheduling component is designed. Yes, you can put
>>> 10 VMs with 4C8G(4CPU, 8G RAM) on a host and 20 VMs with 2C4G on
>>> another one. But if one type of them, e.g. 4C8G are sold out, customers
>>> can't by more 4C8G VM while there are some free 2C4G VMs, the resource
>>> reserved for them can be provided as 4C8G VMs
>>>
>>
>> 1. You can, just the startup time will be a little slower? E.g., grow
>> pre-allocated 4G file to 8G.
>>
>> 2. Or let's be creative: teach QEMU to construct a single
>> RAMBlock/MemoryRegion out of multiple tmpfs files. Works as long as you
>> don't go crazy on different VM sizes / size differences.
>>
>> 3. In your example above, you can dynamically rebalance as VMs are
>> getting sold, to make sure you always have "big ones" lying around you
>> can shrink on demand.
>>
> Yes, we can always come up with some ways to make things work.
> it will make the developer of the upper layer component crazy :)

I'd say that's life in upper layers to optimize special (!) use cases. :)

>>>
>>> You must know there are a lot of functions in the kernel which can
>>> be done in userspace. e.g. Some of the device emulations like APIC,
>>> vhost-net backend which has userspace implementation.   :)
>>> Bad or not depends on the benefits the solution brings.
>>> From the viewpoint of a user space application, the kernel should
>>> provide high performance memory management service. That's why
>>> I think it should be done in the kernel.
>>
>> As I expressed a couple of times already, I don't see why using
>> hugetlbfs and implementing some sort of pre-zeroing there isn't sufficient.
> 
> Did I miss something before? I thought you doubt the need for
> hugetlbfs free page pre zero out. Hugetlbfs is a good choice and is
> sufficient.

I remember even suggesting to focus on hugetlbfs during your KVM talk
when chatting. Maybe I was not clear before.

> 
>> We really don't *want* complicated things deep down in the mm core if
>> there are reasonable alternatives.
>>
> I understand your concern, we should have sufficient reason to add a new
> feature to the kernel. And for this one, it's most value is to make the
> application's life is easier. And implementing it in hugetlbfs can avoid
> adding more complexity to core MM.

Exactly, that's my point. Some people might still disagree with the
hugetlbfs approach, but there it's easier to add tunables without
affecting the overall system.

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
       [not found]       ` <CA+2MQi87+N87x+gLuJPurst38AfFQhnc9eyHr8On55d1+WY5zQ@mail.gmail.com>
@ 2020-12-23  8:41         ` David Hildenbrand
  0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2020-12-23  8:41 UTC (permalink / raw)
  To: Liang Li
  Cc: Andrea Arcangeli, Michal Hocko, Michael S. Tsirkin, Dan Williams,
	Liang Li, linux-kernel, linux-mm, Dave Hansen, Alexander Duyck,
	virtualization, Mel Gorman, Andrew Morton

[...]

>> I was rather saying that for security it's of little use IMHO.
>> Application/VM start up time might be improved by using huge pages (and
>> pre-zeroing these). Free page reporting might be improved by using
>> MADV_FREE instead of MADV_DONTNEED in the hypervisor.
>>
>>> this feature, above all of them, which one is likely to become the
>>> most strong one?  From the implementation, you will find it is
>>> configurable, users don't want to use it can turn it off.  This is not
>>> an option?
>>
>> Well, we have to maintain the feature and sacrifice a page flag. For
>> example, do we expect someone explicitly enabling the feature just to
>> speed up startup time of an app that consumes a lot of memory? I highly
>> doubt it.
> 
> In our production environment, there are three main applications have such
> requirement, one is QEMU [creating a VM with SR-IOV passthrough device],
> anther other two are DPDK related applications, DPDK OVS and SPDK vhost,
> for best performance, they populate memory when starting up. For SPDK vhost,
> we make use of the VHOST_USER_GET/SET_INFLIGHT_FD feature for
> vhost 'live' upgrade, which is done by killing the old process and
> starting a new
> one with the new binary. In this case, we want the new process started as quick
> as possible to shorten the service downtime. We really enable this feature
> to speed up startup time for them  :)

Thanks for info on the use case!

All of these use cases either already use, or could use, huge pages
IMHO. It's not your ordinary proprietary gaming app :) This is where
pre-zeroing of huge pages could already help.

Just wondering, wouldn't it be possible to use tmpfs/hugetlbfs ...
creating a file and pre-zeroing it from another process, or am I missing
something important? At least for QEMU this should work AFAIK, where you
can just pass the file to be use using memory-backend-file.

> 
>> I'd love to hear opinions of other people. (a lot of people are offline
>> until beginning of January, including, well, actually me :) )
> 
> OK. I will wait some time for others' feedback. Happy holidays!

To you too, cheers!


-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
       [not found] <20201221162519.GA22504@open-light-1.localdomain>
  2020-12-22  8:47 ` David Hildenbrand
  2020-12-22 12:23 ` Matthew Wilcox
@ 2020-12-22 19:13 ` Alexander Duyck
  2 siblings, 0 replies; 8+ messages in thread
From: Alexander Duyck @ 2020-12-22 19:13 UTC (permalink / raw)
  To: Alexander Duyck, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Dan Williams, Michael S. Tsirkin, David Hildenbrand, Jason Wang,
	Dave Hansen, Michal Hocko, Liang Li, linux-mm, LKML,
	virtualization

On Mon, Dec 21, 2020 at 8:25 AM Liang Li <liliang.opensource@gmail.com> wrote:
>
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
>
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.
>
> My original intention for adding this feature is to shorten VM
> creation time when SR-IOV devicde is attached, it works good and the
> VM creation time is reduced by about 90%.
>
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =====================================================
> QEMU use 4K pages, THP is off
>                   round1      round2      round3
> w/o this patch:    23.5s       24.7s       24.6s
> w/ this patch:     10.2s       10.3s       11.2s
>
> QEMU use 4K pages, THP is on
>                   round1      round2      round3
> w/o this patch:    17.9s       14.8s       14.9s
> w/ this patch:     1.9s        1.8s        1.9s
> =====================================================
>
> Obviously, it can do more than this. We can benefit from this feature
> in the flowing case:

So I am not sure page reporting is the best thing to base this page
zeroing setup on. The idea with page reporting is to essentially act
as a leaky bucket and allow the guest to drop memory it isn't using
slowly so if it needs to reinflate it won't clash with the
applications that need memory. What you are doing here seems far more
aggressive in that you are going down to low order pages and sleeping
instead of rescheduling for the next time interval.

Also I am not sure your SR-IOV creation time test is a good
justification for this extra overhead. With your patches applied all
you are doing is making use of the free time before the test to do the
page zeroing instead of doing it during your test. As such your CPU
overhead prior to running the test would be higher and you haven't
captured that information.

One thing I would be interested in seeing is what is the load this is
adding when you are running simple memory allocation/free type tests
on the system. For example it might be useful to see what the
will-it-scale page_fault1 tests look like with this patch applied
versus not applied. I suspect it would be adding some amount of
overhead as you have to spend a ton of time scanning all the pages and
that will be considerable overhead.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
       [not found] <20201221162519.GA22504@open-light-1.localdomain>
  2020-12-22  8:47 ` David Hildenbrand
@ 2020-12-22 12:23 ` Matthew Wilcox
  2020-12-22 19:13 ` Alexander Duyck
  2 siblings, 0 replies; 8+ messages in thread
From: Matthew Wilcox @ 2020-12-22 12:23 UTC (permalink / raw)
  To: Alexander Duyck, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Dan Williams, Michael S. Tsirkin, David Hildenbrand, Jason Wang,
	Dave Hansen, Michal Hocko, Liang Li, linux-mm, linux-kernel,
	virtualization

On Mon, Dec 21, 2020 at 11:25:22AM -0500, Liang Li wrote:
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =====================================================
> QEMU use 4K pages, THP is off
>                   round1      round2      round3
> w/o this patch:    23.5s       24.7s       24.6s
> w/ this patch:     10.2s       10.3s       11.2s
> 
> QEMU use 4K pages, THP is on
>                   round1      round2      round3
> w/o this patch:    17.9s       14.8s       14.9s
> w/ this patch:     1.9s        1.8s        1.9s
> =====================================================

The cost of zeroing pages has to be paid somewhere.  You've successfully
moved it out of this path that you can measure.  So now you've put it
somewhere that you're not measuring.  Why is this a win?

> Speed up kernel routine
> =======================
> This can’t be guaranteed because we don’t pre zero out all the free pages,
> but is true for most case. It can help to speed up some important system
> call just like fork, which will allocate zero pages for building page
> table. And speed up the process of page fault, especially for huge page
> fault. The POC of Hugetlb free page pre zero out has been done.

Try kernbench with and without your patch.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
       [not found]   ` <CA+2MQi89v=DZJZ7b-QaMsU2f42j4SRW47XcZvLtBj10YeqRGgQ@mail.gmail.com>
@ 2020-12-22 11:57     ` David Hildenbrand
       [not found]       ` <CA+2MQi87+N87x+gLuJPurst38AfFQhnc9eyHr8On55d1+WY5zQ@mail.gmail.com>
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2020-12-22 11:57 UTC (permalink / raw)
  To: Liang Li
  Cc: Andrea Arcangeli, Michal Hocko, Michael S. Tsirkin, Dan Williams,
	Liang Li, linux-kernel, linux-mm, Dave Hansen, Alexander Duyck,
	virtualization, Mel Gorman, Andrew Morton

> 
>>>
>>> Virtulization
>>> =============
>>> Speed up VM creation and shorten guest boot time, especially for PCI
>>> SR-IOV device passthrough scenario. Compared with some of the para
>>> vitalization solutions, it is easy to deploy because it’s transparent
>>> to guest and can handle DMA properly in BIOS stage, while the para
>>> virtualization solution can’t handle it well.
>>
>> What is the "para virtualization" approach you are talking about?
> 
> I refer two topic in the KVM forum 2020, the doc can give more details :
> https://static.sched.com/hosted_files/kvmforum2020/48/coIOMMU.pdf
> https://static.sched.com/hosted_files/kvmforum2020/51/The%20Practice%20Method%20to%20Speed%20Up%2010x%20Boot-up%20Time%20for%20Guest%20in%20Alibaba%20Cloud.pdf
> 
> and the flowing link is mine:
> https://static.sched.com/hosted_files/kvmforum2020/90/Speed%20Up%20Creation%20of%20a%20VM%20With%20Passthrough%20GPU.pdf

Thanks for the pointers! I actually did watch your presentation.

>>
>>>
>>> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
>>> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
>>> to the VMM, VMM will unmap the corresponding host page for reclaim,
>>> when guest allocate a page just reclaimed, host will allocate a new page
>>> and zero it out for guest, in this case pre zero out free page will help
>>> to speed up the proccess of fault in and reduce the performance impaction.
>>
>> Such faults in the VMM are no different to other faults, when first
>> accessing a page to be populated. Again, I wonder how much of a
>> difference it actually makes.
>>
> 
> I am not just referring to faults in the VMM, I mean the whole process
> that handles guest page faults.
> without VIRTIO_BALLOON_F_REPORTING, pages used by guests will be zero
> out only once by host. With VIRTIO_BALLOON_F_REPORTING, free pages are
> reclaimed by the host and may return to the host buddy
> free list. When the pages are given back to the guest, the host kernel
> needs to zero out it again. It means
> with VIRTIO_BALLOON_F_REPORTING, guest memory performance will be
> degraded for frequently
> zero out operation on host side. The performance degradation will be
> obvious for huge page case. Free
> page pre zero out can help to make guest memory performance almost the
> same as without
> VIRTIO_BALLOON_F_REPORTING.

Yes, what I am saying is that this fault handling is no different to
ordinary faults when accessing a virtual memory location the first time
and populating a page. The only difference is that it happens
continuously, not only the first time we touch a page.

And we might be able to improve handling in the hypervisor in the
future. We have been discussing using MADV_FREE instead of MADV_DONTNEED
in QEMU for handling free page reporting. Then, guest reported pages
will only get reclaimed by the hypervisor when there is actual memory
pressure in the hypervisor (e.g., when about to swap). And zeroing a
page is an obvious improvement over going to swap. The price for zeroing
pages has to be paid at one point.

Also note that we've been discussing cache-related things already. If
you zero out before giving the page to the guest, the page will already
be in the cache - where the guest directly wants to access it.

[...]

>>>
>>> Security
>>> ========
>>> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
>>> boot options", which zero out page in a asynchronous way. For users can't
>>> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
>>> this feauture provide another choice.
>> "we don’t pre zero out all the free pages" so this is of little actual use.
> 
> OK. It seems none of the reasons listed above is strong enough for

I was rather saying that for security it's of little use IMHO.
Application/VM start up time might be improved by using huge pages (and
pre-zeroing these). Free page reporting might be improved by using
MADV_FREE instead of MADV_DONTNEED in the hypervisor.

> this feature, above all of them, which one is likely to become the
> most strong one?  From the implementation, you will find it is
> configurable, users don't want to use it can turn it off.  This is not
> an option?

Well, we have to maintain the feature and sacrifice a page flag. For
example, do we expect someone explicitly enabling the feature just to
speed up startup time of an app that consumes a lot of memory? I highly
doubt it.

I'd love to hear opinions of other people. (a lot of people are offline
until beginning of January, including, well, actually me :) )

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO
       [not found] <20201221162519.GA22504@open-light-1.localdomain>
@ 2020-12-22  8:47 ` David Hildenbrand
       [not found]   ` <CA+2MQi89v=DZJZ7b-QaMsU2f42j4SRW47XcZvLtBj10YeqRGgQ@mail.gmail.com>
  2020-12-22 12:23 ` Matthew Wilcox
  2020-12-22 19:13 ` Alexander Duyck
  2 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2020-12-22  8:47 UTC (permalink / raw)
  To: Alexander Duyck, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Dan Williams, Michael S. Tsirkin, Jason Wang, Dave Hansen,
	Michal Hocko, Liang Li, linux-mm, linux-kernel, virtualization

On 21.12.20 17:25, Liang Li wrote:
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
> 
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.
> 
> My original intention for adding this feature is to shorten VM
> creation time when SR-IOV devicde is attached, it works good and the
> VM creation time is reduced by about 90%.
> 
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =====================================================
> QEMU use 4K pages, THP is off
>                   round1      round2      round3
> w/o this patch:    23.5s       24.7s       24.6s
> w/ this patch:     10.2s       10.3s       11.2s
> 
> QEMU use 4K pages, THP is on
>                   round1      round2      round3
> w/o this patch:    17.9s       14.8s       14.9s
> w/ this patch:     1.9s        1.8s        1.9s
> =====================================================
> 

I am still not convinces that we want/need this for this (main) use
case. Why can't we use huge pages for such use cases (that really care
about VM creation time) and rather deal with pre-zeroing of huge pages
instead?

If possible, I'd like to avoid GFP_ZERO (for reasons already discussed).

> Obviously, it can do more than this. We can benefit from this feature
> in the flowing case:
> 
> Interactive sence
> =================
> Shorten application lunch time on desktop or mobile phone, it can help
> to improve the user experience. Test shows on a
> server [Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz], zero out 1GB RAM by
> the kernel will take about 200ms, while some mainly used application
> like Firefox browser, Office will consume 100 ~ 300 MB RAM just after
> launch, by pre zero out free pages, it means the application launch
> time will be reduced about 20~60ms (can be visual sensed?). May be
> we can make use of this feature to speed up the launch of Andorid APP
> (I didn't do any test for Android).

I am not really sure if you can actually visually sense a difference in
your examples. Startup time of an application is not just memory
allocation (page zeroing) time. It would be interesting of much of a
difference this actually makes in practice. (e.g., firefox startup time
etc.)

> 
> Virtulization
> =============
> Speed up VM creation and shorten guest boot time, especially for PCI
> SR-IOV device passthrough scenario. Compared with some of the para
> vitalization solutions, it is easy to deploy because it’s transparent
> to guest and can handle DMA properly in BIOS stage, while the para
> virtualization solution can’t handle it well.

What is the "para virtualization" approach you are talking about?

> 
> Improve guest performance when use VIRTIO_BALLOON_F_REPORTING for memory
> overcommit. The VIRTIO_BALLOON_F_REPORTING feature will report guest page
> to the VMM, VMM will unmap the corresponding host page for reclaim,
> when guest allocate a page just reclaimed, host will allocate a new page
> and zero it out for guest, in this case pre zero out free page will help
> to speed up the proccess of fault in and reduce the performance impaction.

Such faults in the VMM are no different to other faults, when first
accessing a page to be populated. Again, I wonder how much of a
difference it actually makes.

> 
> Speed up kernel routine
> =======================
> This can’t be guaranteed because we don’t pre zero out all the free pages,
> but is true for most case. It can help to speed up some important system
> call just like fork, which will allocate zero pages for building page
> table. And speed up the process of page fault, especially for huge page
> fault. The POC of Hugetlb free page pre zero out has been done.

Would be interesting to have an actual example with some numbers.

> 
> Security
> ========
> This is a weak version of "introduce init_on_alloc=1 and init_on_free=1
> boot options", which zero out page in a asynchronous way. For users can't
> tolerate the impaction of 'init_on_alloc=1' or 'init_on_free=1' brings,
> this feauture provide another choice.

"we don’t pre zero out all the free pages" so this is of little actual use.

-- 
Thanks,

David / dhildenb

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-01-05 10:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CA+2MQi_C-PTqyrqBprhtGBAiDBnPQBzwu6hvyuk+QiKy0L3sHw@mail.gmail.com>
2021-01-04 20:18 ` [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO David Hildenbrand
     [not found]   ` <CA+2MQi_O47B8zOa_TwZqzRsS0LFoPS77+61mUV=yT1U3sa6xQw@mail.gmail.com>
2021-01-05  9:39     ` David Hildenbrand
     [not found]       ` <CA+2MQi9Qb5srEcx4qKNVWdphBGP0=HHV_h0hWghDMFKFmCOTMg@mail.gmail.com>
2021-01-05 10:27         ` David Hildenbrand
     [not found] <20201221162519.GA22504@open-light-1.localdomain>
2020-12-22  8:47 ` David Hildenbrand
     [not found]   ` <CA+2MQi89v=DZJZ7b-QaMsU2f42j4SRW47XcZvLtBj10YeqRGgQ@mail.gmail.com>
2020-12-22 11:57     ` David Hildenbrand
     [not found]       ` <CA+2MQi87+N87x+gLuJPurst38AfFQhnc9eyHr8On55d1+WY5zQ@mail.gmail.com>
2020-12-23  8:41         ` David Hildenbrand
2020-12-22 12:23 ` Matthew Wilcox
2020-12-22 19:13 ` Alexander Duyck

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).