All of lore.kernel.org
 help / color / mirror / Atom feed
From: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, borntraeger@de.ibm.com, hughd@google.com,
	izik.eidus@ravellosystems.com, chrisw@sous-sol.org,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 1/1] mm/ksm: improve deduplication of zero pages with colouring
Date: Wed, 18 Jan 2017 18:17:09 +0100	[thread overview]
Message-ID: <4ac20fb0-d9d2-e73f-2f17-1f69929756b7@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170118162914.GF10177@redhat.com>

On 18/01/17 17:29, Andrea Arcangeli wrote:
> It's still good to be able to exercise this code on all archs (if
> nothing else for debugging purposes). There's nothing arch dependent
> in it after all.

If it's fine for you, I'm definitely not going to complain :)

>> ZERO_PAGE() would save exactly one page and it would bring no speed
>> advantages (or rather: not using the ZERO_PAGE() would not bring any
>> speed penalty).
>> That's why I have #ifdef'd it to have it only when page coloring is
>> present. Also, for what I could see, only MIPS and s390 have page
>> coloring; I don't like the idea of imposing any overhead to all the
>> other archs.
> 
> With a sysctl disabled by default, the only overhead is 8 bytes and a
> branch?

Yes, I'm sometimes excessively paranoid.

>> I agree that this should be toggleable with a sysfs control, since it's
>> a change that can potentially negatively affect the performance in some
>> cases. I'm adding it in the next iteration.
> 
> Yes the sysctl can be useful for archs doing page coloring too, but I
> would add to all.
> 
>> Unless the userspace in the guests is creating lots of pages full of
>> zeroes :)
> 
> One question comes to mind though, why is the app doing memset(0), if
> the app limited itself to just reading the memory it'd use page
> colored zero pages that wouldn't risk to become PageKsm. That is true
> both for bare metal and guest with KSM in host. This looks a not
> optimal app.

I didn't really make a good example, although I can think of scenarios
where that could legitimately happen. Another case would be a KVM guest.
It will have a bunch of colored zero pages somewhere, but if KSM merges
those together, the advantages of colored zero pages disappear in the guest.

>> Honestly I don't think this patch will bring any benefits regarding
>> metadata -- one page more or less in the metadata won't change much. Our
>> issue is just the reading speed of the deduplicated empty pages.
> 
> The metadata amount changes, for each shared zero page we'd need to
> allocate a rmap_item.
> 
> The KSMscale introducing stable_node_chain/dup is precisely meant to
> deal with not creating a too large rmap_item chain. Because there can
> be plenty of those rmap_items, we've to create multiple zero pages and
> multiple stable nodes for those zero pages to limit the maximum number
> of rmap_items linked in a stable_node. This then limits the maximum
> cost of a rmap_walk on a PageKsm during page migration for compaction
> or swapping etc..
> 
> The KSMscale fix is needed regardless to avoid KSM to hang a very
> large server, because there is no guarantee the most equal page will
> be a zero page, it could be 0xff or anything.
> 
> However if one knows there's a disproportionate amount of memory as
> zero (i.e. certain guest OS do that, that to me would be the main
> motivation for the patch), he could prefer to use your sysctl instead
> of creating the rmap_item metadata on the zero pages.

Ok, sorry, I had completely misunderstood what you had meant the first
time. Now I got it.

>>> memcpy for every merge-candidate page to save some metadata, it's very
>>
>> I'm confused, why memcpy? did you mean memcmp? We are not doing any
>> additional memops except in the case when a candidate non-empty page
>> happens to have the same checksum as an empty page, in which case we
>> have an extra memcmp compared to the normal operation.
> 
> I meant memcmp sorry. So if this is further filtered by the
> precomputed zero page cksum, the only concern would be then how likely
> the cksum would provide a false positive, in which case there will be
> one more memcmp for every merge.
> 
>> Maybe an even less intrusive change could be to check in replace_page if
>> is_zero_pfn(page_to_pfn(kpage)). And of course I would #ifdef that too,
>> to avoid the overhead for archs without page coloring.
> 
> It's just one branch, the costly things are memcmp. As long as it's
> not memcmp I wouldn't worry about one branch in replace_page and in
> the caller. replace_page isn't even a fast path, the fast path is
> generally the code that scans the memory. The actual real merging is
> not as frequent. So I wouldn't use #ifdefs and I'd use the sysctl
> instead, potentially also enabled on all archs (or only on those with
> page coloring but possible to enable manually on all archs).

Ok!

>> So if the replacement page is a ZERO_PAGE() no get_page() and no
>> page_add_anon_rmap() would be performed, and the set_pte_at_notify()
>> would have pte_mkspecial(pfn_pte(page_to_pfn(kpage))) instead of mk_pte() .
> 
> That should fix your patch I think yes.
> 
> Singling out zero pages was discussed before, just without KSMscale
> it's an incomplete fix and just a band aid.
> 
> Actually even for your case it's incomplete and only covers a subset
> of apps, what if the app initializes all ram to 0xff instead of zero
> and keep reading from it? (it would make more sense to initialize the
> memory if it wasn't zero in fact) You'd get the same slowdown as you
> get now with the same zero page I think.

That's true. As I said above, my previous example was not very well
thought. The more realistic scenario is that of having the colored zero
pages of a guest merged.

> The real fix also for this, would have to have a stable tree for each
> page color possible, but that is not the same as a having a stable
> tree for each NUMA node (which is already implemented). There are
> likely too many colors (even if you're not fully associative) so you'd
> penalize the "compression" ratio if you were to implement a generic
> fix that doesn't single out the zero page.

Also in general it's not probable that the same non-zero data will be
read very often at different guest-physical addresses. A stable tree for
each color would pretty much defeat the purpose of KSM.

> I've never been particularly excited about optimizing bad apps that
> initialize a lot of RAM as zero when they could depend on the mmap
> behavior instead and get zero pages in the first place (or bad guest
> OS). Overall the main reason why I'm quite positive about adding this
> as an optimization is because after reading it (even if not complete)
> it seems non intrusive enough and some corner case may benefit, but if
> we do it, we can as well leave it available to all archs so it's
> easier to test and reproduce any problem too.

Ok, I'll fix and respin my patch then.

Thanks,

Claudio

WARNING: multiple messages have this Message-ID (diff)
From: Claudio Imbrenda <imbrenda@linux.vnet.ibm.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: linux-mm@kvack.org, borntraeger@de.ibm.com, hughd@google.com,
	izik.eidus@ravellosystems.com, chrisw@sous-sol.org,
	akpm@linux-foundation.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v1 1/1] mm/ksm: improve deduplication of zero pages with colouring
Date: Wed, 18 Jan 2017 18:17:09 +0100	[thread overview]
Message-ID: <4ac20fb0-d9d2-e73f-2f17-1f69929756b7@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170118162914.GF10177@redhat.com>

On 18/01/17 17:29, Andrea Arcangeli wrote:
> It's still good to be able to exercise this code on all archs (if
> nothing else for debugging purposes). There's nothing arch dependent
> in it after all.

If it's fine for you, I'm definitely not going to complain :)

>> ZERO_PAGE() would save exactly one page and it would bring no speed
>> advantages (or rather: not using the ZERO_PAGE() would not bring any
>> speed penalty).
>> That's why I have #ifdef'd it to have it only when page coloring is
>> present. Also, for what I could see, only MIPS and s390 have page
>> coloring; I don't like the idea of imposing any overhead to all the
>> other archs.
> 
> With a sysctl disabled by default, the only overhead is 8 bytes and a
> branch?

Yes, I'm sometimes excessively paranoid.

>> I agree that this should be toggleable with a sysfs control, since it's
>> a change that can potentially negatively affect the performance in some
>> cases. I'm adding it in the next iteration.
> 
> Yes the sysctl can be useful for archs doing page coloring too, but I
> would add to all.
> 
>> Unless the userspace in the guests is creating lots of pages full of
>> zeroes :)
> 
> One question comes to mind though, why is the app doing memset(0), if
> the app limited itself to just reading the memory it'd use page
> colored zero pages that wouldn't risk to become PageKsm. That is true
> both for bare metal and guest with KSM in host. This looks a not
> optimal app.

I didn't really make a good example, although I can think of scenarios
where that could legitimately happen. Another case would be a KVM guest.
It will have a bunch of colored zero pages somewhere, but if KSM merges
those together, the advantages of colored zero pages disappear in the guest.

>> Honestly I don't think this patch will bring any benefits regarding
>> metadata -- one page more or less in the metadata won't change much. Our
>> issue is just the reading speed of the deduplicated empty pages.
> 
> The metadata amount changes, for each shared zero page we'd need to
> allocate a rmap_item.
> 
> The KSMscale introducing stable_node_chain/dup is precisely meant to
> deal with not creating a too large rmap_item chain. Because there can
> be plenty of those rmap_items, we've to create multiple zero pages and
> multiple stable nodes for those zero pages to limit the maximum number
> of rmap_items linked in a stable_node. This then limits the maximum
> cost of a rmap_walk on a PageKsm during page migration for compaction
> or swapping etc..
> 
> The KSMscale fix is needed regardless to avoid KSM to hang a very
> large server, because there is no guarantee the most equal page will
> be a zero page, it could be 0xff or anything.
> 
> However if one knows there's a disproportionate amount of memory as
> zero (i.e. certain guest OS do that, that to me would be the main
> motivation for the patch), he could prefer to use your sysctl instead
> of creating the rmap_item metadata on the zero pages.

Ok, sorry, I had completely misunderstood what you had meant the first
time. Now I got it.

>>> memcpy for every merge-candidate page to save some metadata, it's very
>>
>> I'm confused, why memcpy? did you mean memcmp? We are not doing any
>> additional memops except in the case when a candidate non-empty page
>> happens to have the same checksum as an empty page, in which case we
>> have an extra memcmp compared to the normal operation.
> 
> I meant memcmp sorry. So if this is further filtered by the
> precomputed zero page cksum, the only concern would be then how likely
> the cksum would provide a false positive, in which case there will be
> one more memcmp for every merge.
> 
>> Maybe an even less intrusive change could be to check in replace_page if
>> is_zero_pfn(page_to_pfn(kpage)). And of course I would #ifdef that too,
>> to avoid the overhead for archs without page coloring.
> 
> It's just one branch, the costly things are memcmp. As long as it's
> not memcmp I wouldn't worry about one branch in replace_page and in
> the caller. replace_page isn't even a fast path, the fast path is
> generally the code that scans the memory. The actual real merging is
> not as frequent. So I wouldn't use #ifdefs and I'd use the sysctl
> instead, potentially also enabled on all archs (or only on those with
> page coloring but possible to enable manually on all archs).

Ok!

>> So if the replacement page is a ZERO_PAGE() no get_page() and no
>> page_add_anon_rmap() would be performed, and the set_pte_at_notify()
>> would have pte_mkspecial(pfn_pte(page_to_pfn(kpage))) instead of mk_pte() .
> 
> That should fix your patch I think yes.
> 
> Singling out zero pages was discussed before, just without KSMscale
> it's an incomplete fix and just a band aid.
> 
> Actually even for your case it's incomplete and only covers a subset
> of apps, what if the app initializes all ram to 0xff instead of zero
> and keep reading from it? (it would make more sense to initialize the
> memory if it wasn't zero in fact) You'd get the same slowdown as you
> get now with the same zero page I think.

That's true. As I said above, my previous example was not very well
thought. The more realistic scenario is that of having the colored zero
pages of a guest merged.

> The real fix also for this, would have to have a stable tree for each
> page color possible, but that is not the same as a having a stable
> tree for each NUMA node (which is already implemented). There are
> likely too many colors (even if you're not fully associative) so you'd
> penalize the "compression" ratio if you were to implement a generic
> fix that doesn't single out the zero page.

Also in general it's not probable that the same non-zero data will be
read very often at different guest-physical addresses. A stable tree for
each color would pretty much defeat the purpose of KSM.

> I've never been particularly excited about optimizing bad apps that
> initialize a lot of RAM as zero when they could depend on the mmap
> behavior instead and get zero pages in the first place (or bad guest
> OS). Overall the main reason why I'm quite positive about adding this
> as an optimization is because after reading it (even if not complete)
> it seems non intrusive enough and some corner case may benefit, but if
> we do it, we can as well leave it available to all archs so it's
> easier to test and reproduce any problem too.

Ok, I'll fix and respin my patch then.

Thanks,

Claudio

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-01-18 17:17 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-12 16:17 [PATCH v1 1/1] mm/ksm: improve deduplication of zero pages with colouring Claudio Imbrenda
2017-01-12 16:17 ` Claudio Imbrenda
2017-01-12 16:21 ` Christian Borntraeger
2017-01-12 16:21   ` Christian Borntraeger
2017-01-12 17:21 ` Andrea Arcangeli
2017-01-12 17:21   ` Andrea Arcangeli
2017-01-18 15:15   ` Claudio Imbrenda
2017-01-18 15:15     ` Claudio Imbrenda
2017-01-18 16:29     ` Andrea Arcangeli
2017-01-18 16:29       ` Andrea Arcangeli
2017-01-18 17:17       ` Claudio Imbrenda [this message]
2017-01-18 17:17         ` Claudio Imbrenda
2017-01-18 18:11         ` Andrea Arcangeli
2017-01-18 18:11           ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4ac20fb0-d9d2-e73f-2f17-1f69929756b7@linux.vnet.ibm.com \
    --to=imbrenda@linux.vnet.ibm.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=borntraeger@de.ibm.com \
    --cc=chrisw@sous-sol.org \
    --cc=hughd@google.com \
    --cc=izik.eidus@ravellosystems.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.