All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
@ 2017-08-01 12:07 leesioh
  2017-08-01 13:29 ` Claudio Imbrenda
  2017-08-01 20:05 ` Andrea Arcangeli
  0 siblings, 2 replies; 11+ messages in thread
From: leesioh @ 2017-08-01 12:07 UTC (permalink / raw)
  To: akpm, aarcange, mingo, zhongjiang, minchan, arvind.yadav.cs,
	imbrenda, kirill.shutemov
  Cc: linux-mm, leesioh

In ksm, the checksum values are used to check changes in page content and keep the unstable tree more stable.
KSM implements checksum calculation with jhash2 hash function.
However, because jhash2 is implemented in software,
it consumes high CPU cycles (about 26%, according to KSM thread profiling results)

To reduce CPU consumption, this commit applies the crc32 hash function
which is included in the SSE4.2 CPU instruction set.
This can significantly reduce the page checksum overhead as follows.

I measured checksum computation 300 times to see how fast crc32 is compared to jhash2.
With jhash2, the average checksum calculation time is about 3460ns,
and with crc32, the average checksum calculation time is 888ns. This is about 74% less than jhash2.

Signed-off-by: leesioh <solee@os.korea.ac.kr>
---
 mm/ksm.c | 36 ++++++++++++++++++++++++++++++++----
 1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 0c927e3..390a3cb 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -42,7 +42,8 @@
 
 #include <asm/tlbflush.h>
 #include "internal.h"
-
+#include <linux/crypto.h>
+#include <crypto/hash.h>
 #ifdef CONFIG_NUMA
 #define NUMA(x)		(x)
 #define DO_NUMA(x)	do { (x); } while (0)
@@ -260,6 +261,9 @@ static unsigned int zero_checksum __read_mostly;
 /* Whether to merge empty (zeroed) pages with actual zero pages */
 static bool ksm_use_zero_pages __read_mostly;
 
+/* Whether to support crc32 hash function */
+static bool crc32_support;
+
 #ifdef CONFIG_NUMA
 /* Zeroed when merging across nodes is not allowed */
 static unsigned int ksm_merge_across_nodes = 1;
@@ -279,11 +283,27 @@ static void wait_while_offlining(void);
 static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
 static DEFINE_MUTEX(ksm_thread_mutex);
 static DEFINE_SPINLOCK(ksm_mmlist_lock);
-
+static struct shash_desc desc;
+static struct crypto_shash *tfm;
 #define KSM_KMEM_CACHE(__struct, __flags) kmem_cache_create("ksm_"#__struct,\
 		sizeof(struct __struct), __alignof__(struct __struct),\
 		(__flags), NULL)
 
+static void __init ksm_crc32_init(void)
+{
+	tfm = crypto_alloc_shash("crc32", 0, CRYPTO_ALG_ASYNC);
+
+	if (IS_ERR(tfm) || tfm->base.__crt_alg->cra_priority < 200) {
+		pr_warn("not support crc32 instruction, use jhash2 \n");
+		crc32_support = false;
+		crypto_free_shash(tfm);
+		return;
+	}
+	desc.tfm = tfm;
+	desc.flags = 0;
+	crc32_support = true;
+}
+
 static int __init ksm_slab_init(void)
 {
 	rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
@@ -986,7 +1006,14 @@ static u32 calc_checksum(struct page *page)
 {
 	u32 checksum;
 	void *addr = kmap_atomic(page);
-	checksum = jhash2(addr, PAGE_SIZE / 4, 17);
+	/*
+	* If crc32 is supported, the use crc32 to calculate the checksum
+	* otherwise use jhash2
+	*/
+	if (crc32_support)
+		crypto_shash_digest(&desc, addr, PAGE_SIZE, (u8 *)&checksum);
+	else
+		checksum = jhash2(addr, PAGE_SIZE / 4, 17);
 	kunmap_atomic(addr);
 	return checksum;
 }
@@ -3057,7 +3084,8 @@ static int __init ksm_init(void)
 	zero_checksum = calc_checksum(ZERO_PAGE(0));
 	/* Default to false for backwards compatibility */
 	ksm_use_zero_pages = false;
-
+
+	ksm_crc32_init();
 	err = ksm_slab_init();
 	if (err)
 		goto out;
-- 
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-01 12:07 [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32) leesioh
@ 2017-08-01 13:29 ` Claudio Imbrenda
  2017-08-01 20:05 ` Andrea Arcangeli
  1 sibling, 0 replies; 11+ messages in thread
From: Claudio Imbrenda @ 2017-08-01 13:29 UTC (permalink / raw)
  To: leesioh
  Cc: akpm, aarcange, mingo, zhongjiang, minchan, arvind.yadav.cs,
	kirill.shutemov, linux-mm

On Tue,  1 Aug 2017 21:07:35 +0900
leesioh <solee@os.korea.ac.kr> wrote:

> In ksm, the checksum values are used to check changes in page content
> and keep the unstable tree more stable. KSM implements checksum
> calculation with jhash2 hash function. However, because jhash2 is
> implemented in software, it consumes high CPU cycles (about 26%,
> according to KSM thread profiling results)
> 
> To reduce CPU consumption, this commit applies the crc32 hash function
> which is included in the SSE4.2 CPU instruction set.
> This can significantly reduce the page checksum overhead as follows.

This is really a nice idea, but you are assuming that everything is x86.

> I measured checksum computation 300 times to see how fast crc32 is
> compared to jhash2. With jhash2, the average checksum calculation
> time is about 3460ns, and with crc32, the average checksum
> calculation time is 888ns. This is about 74% less than jhash2.

This could be different on other architectures.


I really like the idea, though. Can you maybe extract the jhash part
and make it generic, and then add an arch-specific override for x86?

that way other architectures will be able to plug in their preferred
checksum function.


best regards


Claudio Imbrenda


> Signed-off-by: leesioh <solee@os.korea.ac.kr>
> ---
>  mm/ksm.c | 36 ++++++++++++++++++++++++++++++++----
>  1 file changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 0c927e3..390a3cb 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -42,7 +42,8 @@
> 
>  #include <asm/tlbflush.h>
>  #include "internal.h"
> -
> +#include <linux/crypto.h>
> +#include <crypto/hash.h>
>  #ifdef CONFIG_NUMA
>  #define NUMA(x)		(x)
>  #define DO_NUMA(x)	do { (x); } while (0)
> @@ -260,6 +261,9 @@ static unsigned int zero_checksum __read_mostly;
>  /* Whether to merge empty (zeroed) pages with actual zero pages */
>  static bool ksm_use_zero_pages __read_mostly;
> 
> +/* Whether to support crc32 hash function */
> +static bool crc32_support;
> +
>  #ifdef CONFIG_NUMA
>  /* Zeroed when merging across nodes is not allowed */
>  static unsigned int ksm_merge_across_nodes = 1;
> @@ -279,11 +283,27 @@ static void wait_while_offlining(void);
>  static DECLARE_WAIT_QUEUE_HEAD(ksm_thread_wait);
>  static DEFINE_MUTEX(ksm_thread_mutex);
>  static DEFINE_SPINLOCK(ksm_mmlist_lock);
> -
> +static struct shash_desc desc;
> +static struct crypto_shash *tfm;
>  #define KSM_KMEM_CACHE(__struct, __flags)
> kmem_cache_create("ksm_"#__struct,\ sizeof(struct __struct),
> __alignof__(struct __struct),\ (__flags), NULL)
> 
> +static void __init ksm_crc32_init(void)
> +{
> +	tfm = crypto_alloc_shash("crc32", 0, CRYPTO_ALG_ASYNC);
> +
> +	if (IS_ERR(tfm) || tfm->base.__crt_alg->cra_priority < 200) {
> +		pr_warn("not support crc32 instruction, use jhash2
> \n");
> +		crc32_support = false;
> +		crypto_free_shash(tfm);
> +		return;
> +	}
> +	desc.tfm = tfm;
> +	desc.flags = 0;
> +	crc32_support = true;
> +}
> +
>  static int __init ksm_slab_init(void)
>  {
>  	rmap_item_cache = KSM_KMEM_CACHE(rmap_item, 0);
> @@ -986,7 +1006,14 @@ static u32 calc_checksum(struct page *page)
>  {
>  	u32 checksum;
>  	void *addr = kmap_atomic(page);
> -	checksum = jhash2(addr, PAGE_SIZE / 4, 17);
> +	/*
> +	* If crc32 is supported, the use crc32 to calculate the
> checksum
> +	* otherwise use jhash2
> +	*/
> +	if (crc32_support)
> +		crypto_shash_digest(&desc, addr, PAGE_SIZE, (u8
> *)&checksum);
> +	else
> +		checksum = jhash2(addr, PAGE_SIZE / 4, 17);
>  	kunmap_atomic(addr);
>  	return checksum;
>  }
> @@ -3057,7 +3084,8 @@ static int __init ksm_init(void)
>  	zero_checksum = calc_checksum(ZERO_PAGE(0));
>  	/* Default to false for backwards compatibility */
>  	ksm_use_zero_pages = false;
> -
> +
> +	ksm_crc32_init();
>  	err = ksm_slab_init();
>  	if (err)
>  		goto out;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-01 12:07 [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32) leesioh
  2017-08-01 13:29 ` Claudio Imbrenda
@ 2017-08-01 20:05 ` Andrea Arcangeli
  2017-08-02 12:26   ` Claudio Imbrenda
  2017-08-03  5:26   ` sioh Lee
  1 sibling, 2 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2017-08-01 20:05 UTC (permalink / raw)
  To: leesioh
  Cc: akpm, mingo, zhongjiang, minchan, arvind.yadav.cs, imbrenda,
	kirill.shutemov, linux-mm

On Tue, Aug 01, 2017 at 09:07:35PM +0900, leesioh wrote:
> In ksm, the checksum values are used to check changes in page content and keep the unstable tree more stable.
> KSM implements checksum calculation with jhash2 hash function.
> However, because jhash2 is implemented in software,
> it consumes high CPU cycles (about 26%, according to KSM thread profiling results)
> 
> To reduce CPU consumption, this commit applies the crc32 hash function
> which is included in the SSE4.2 CPU instruction set.
> This can significantly reduce the page checksum overhead as follows.
> 
> I measured checksum computation 300 times to see how fast crc32 is compared to jhash2.
> With jhash2, the average checksum calculation time is about 3460ns,
> and with crc32, the average checksum calculation time is 888ns. This is about 74% less than jhash2.

crc32 may create more false positives than jhash2. crc32 only
guarantees a different value in return if fewer than N bit
changes. False positives in crc32 comparison, would result in more
unstable pages being added to the unstable tree, and if they're
changing as result of false positives it may make the unstable tree
more unstable leading to missed merges (in addition to the overhead of
adding those to the unstable tree in the first place and in addition
of risking an immediate cow post merge which would slowdown apps even
more).

I think if somebody wants a crc instead of a more proper hash (that is
less likely to generate false positives if a couple of bits changes)
it should be an option in sysfs not enabled by default, but overall I
think it's not worth this change for a downgrade to crc. There's the
risk an admin thinks it's going to make things runs faster because KSM
CPU utilization decreases, but missing the risk of increased CoWs in
app context or missed merges because of higher instability in the
unstable tree.

Still deploying hardware accelleration in the KSM hash is a
interesting idea that I don't recall has been tried. Could you try to
benchmark in userland (or kernel if you wish) software jhash2 vs
CONFIG_CRYPTO_SHA1_SSSE3 or CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL instead
of the accellerated crc?  (I don't know if GHASH API can fit our use
case though, but accellerated SHA1 sure would fit).  I suppose they'll
be slower than crc32, and probably slower than jhash2 too, however I
can't be sure by just thinking about it.

We've to also keep the floating point save and restore into account in
the real world, where ksm schedules often and may run interleaved in
the same CPU where an app uses the fpu a lot in userland (if the
interleaved app doesn't use the fpu in userland it won't create
overhead).

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-01 20:05 ` Andrea Arcangeli
@ 2017-08-02 12:26   ` Claudio Imbrenda
  2017-08-03  5:26   ` sioh Lee
  1 sibling, 0 replies; 11+ messages in thread
From: Claudio Imbrenda @ 2017-08-02 12:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: leesioh, akpm, mingo, zhongjiang, minchan, arvind.yadav.cs,
	kirill.shutemov, linux-mm

On Tue, 1 Aug 2017 22:05:50 +0200
Andrea Arcangeli <aarcange@redhat.com> wrote:

> On Tue, Aug 01, 2017 at 09:07:35PM +0900, leesioh wrote:
> > In ksm, the checksum values are used to check changes in page
> > content and keep the unstable tree more stable. KSM implements
> > checksum calculation with jhash2 hash function. However, because
> > jhash2 is implemented in software, it consumes high CPU cycles
> > (about 26%, according to KSM thread profiling results)
> > 
> > To reduce CPU consumption, this commit applies the crc32 hash
> > function which is included in the SSE4.2 CPU instruction set.
> > This can significantly reduce the page checksum overhead as follows.
> > 
> > I measured checksum computation 300 times to see how fast crc32 is
> > compared to jhash2. With jhash2, the average checksum calculation
> > time is about 3460ns, and with crc32, the average checksum
> > calculation time is 888ns. This is about 74% less than jhash2.  
> 
> crc32 may create more false positives than jhash2. crc32 only
> guarantees a different value in return if fewer than N bit
> changes. False positives in crc32 comparison, would result in more
> unstable pages being added to the unstable tree, and if they're
> changing as result of false positives it may make the unstable tree
> more unstable leading to missed merges (in addition to the overhead of
> adding those to the unstable tree in the first place and in addition
> of risking an immediate cow post merge which would slowdown apps even
> more).
> 
> I think if somebody wants a crc instead of a more proper hash (that is
> less likely to generate false positives if a couple of bits changes)
> it should be an option in sysfs not enabled by default, but overall I
> think it's not worth this change for a downgrade to crc. There's the
> risk an admin thinks it's going to make things runs faster because KSM
> CPU utilization decreases, but missing the risk of increased CoWs in
> app context or missed merges because of higher instability in the
> unstable tree.

that's true, but it's possible that all the extra work due to
additional collisions could still be less than the time saved with a
faster checksum. Also, even within the same architecture, different
checksums can have different performances depending on CPU vendor and
model. I would still let the admin (or ksmtuned) choose.

> Still deploying hardware accelleration in the KSM hash is a
> interesting idea that I don't recall has been tried. Could you try to
> benchmark in userland (or kernel if you wish) software jhash2 vs
> CONFIG_CRYPTO_SHA1_SSSE3 or CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL instead
> of the accellerated crc?  (I don't know if GHASH API can fit our use
> case though, but accellerated SHA1 sure would fit).  I suppose they'll
> be slower than crc32, and probably slower than jhash2 too, however I
> can't be sure by just thinking about it.
>
> We've to also keep the floating point save and restore into account in
> the real world, where ksm schedules often and may run interleaved in
> the same CPU where an app uses the fpu a lot in userland (if the
> interleaved app doesn't use the fpu in userland it won't create
> overhead).

that is also true, although some CPUs have basic (and not-so-basic)
crypto functions baked in, so no need to save or restore FPU registers
in those cases. 


best regards

Claudio

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-01 20:05 ` Andrea Arcangeli
  2017-08-02 12:26   ` Claudio Imbrenda
@ 2017-08-03  5:26   ` sioh Lee
  2017-08-03 13:23     ` Andrea Arcangeli
  1 sibling, 1 reply; 11+ messages in thread
From: sioh Lee @ 2017-08-03  5:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: akpm, mingo, zhongjiang, minchan, arvind.yadav.cs, imbrenda,
	kirill.shutemov, linux-mm

Thank you very much for reading and responding to my commit.
I understand the problem with crc32 you describe.
I will investigate a?? as the first step, I will try to compare the number of CoWs with jhash2 and crc32. And I will send you the experiment results.
Thanks again!

-leesioh-


2017-08-02 i??i ? 5:05i?? Andrea Arcangeli i?'(e??) i?' e,?:
> On Tue, Aug 01, 2017 at 09:07:35PM +0900, leesioh wrote:
>> In ksm, the checksum values are used to check changes in page content and keep the unstable tree more stable.
>> KSM implements checksum calculation with jhash2 hash function.
>> However, because jhash2 is implemented in software,
>> it consumes high CPU cycles (about 26%, according to KSM thread profiling results)
>>
>> To reduce CPU consumption, this commit applies the crc32 hash function
>> which is included in the SSE4.2 CPU instruction set.
>> This can significantly reduce the page checksum overhead as follows.
>>
>> I measured checksum computation 300 times to see how fast crc32 is compared to jhash2.
>> With jhash2, the average checksum calculation time is about 3460ns,
>> and with crc32, the average checksum calculation time is 888ns. This is about 74% less than jhash2.
> crc32 may create more false positives than jhash2. crc32 only
> guarantees a different value in return if fewer than N bit
> changes. False positives in crc32 comparison, would result in more
> unstable pages being added to the unstable tree, and if they're
> changing as result of false positives it may make the unstable tree
> more unstable leading to missed merges (in addition to the overhead of
> adding those to the unstable tree in the first place and in addition
> of risking an immediate cow post merge which would slowdown apps even
> more).
>
> I think if somebody wants a crc instead of a more proper hash (that is
> less likely to generate false positives if a couple of bits changes)
> it should be an option in sysfs not enabled by default, but overall I
> think it's not worth this change for a downgrade to crc. There's the
> risk an admin thinks it's going to make things runs faster because KSM
> CPU utilization decreases, but missing the risk of increased CoWs in
> app context or missed merges because of higher instability in the
> unstable tree.
>
> Still deploying hardware accelleration in the KSM hash is a
> interesting idea that I don't recall has been tried. Could you try to
> benchmark in userland (or kernel if you wish) software jhash2 vs
> CONFIG_CRYPTO_SHA1_SSSE3 or CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL instead
> of the accellerated crc?  (I don't know if GHASH API can fit our use
> case though, but accellerated SHA1 sure would fit).  I suppose they'll
> be slower than crc32, and probably slower than jhash2 too, however I
> can't be sure by just thinking about it.
>
> We've to also keep the floating point save and restore into account in
> the real world, where ksm schedules often and may run interleaved in
> the same CPU where an app uses the fpu a lot in userland (if the
> interleaved app doesn't use the fpu in userland it won't create
> overhead).
>
> Thanks!
> Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-03  5:26   ` sioh Lee
@ 2017-08-03 13:23     ` Andrea Arcangeli
  2017-08-09 13:17       ` sioh Lee
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2017-08-03 13:23 UTC (permalink / raw)
  To: sioh Lee
  Cc: akpm, mingo, zhongjiang, minchan, arvind.yadav.cs, imbrenda,
	kirill.shutemov, linux-mm

On Thu, Aug 03, 2017 at 02:26:27PM +0900, sioh Lee wrote:
> Thank you very much for reading and responding to my commit.
> I understand the problem with crc32 you describe.
> I will investigate a?? as the first step, I will try to compare the number of CoWs with jhash2 and crc32. And I will send you the experiment results.

Also the number of KSM merges and ideally in a non simple workload. If
the hash triggers false positives it's not just that there will be
more CoWs, but the unstable tree will get more unstable and its
ability to find equality will decrease. This is why I don't like to
weaken the hash with a crc and I'd rather prefer to keep a real hash
there (doesn't need to be a crypto one, but it'd be even better if it
was).

The hash isn't used to find equality, it's only used to find which
pages are updated frequently (and if an app overwrites the same value
over and over, not even a crypto hash would be capable to detect it).

There were attempts to replace the hashing with a dirty bit set in
hardware in the pagetable in fact, that would be the ideal way, but
it's quite more complicated that way.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-03 13:23     ` Andrea Arcangeli
@ 2017-08-09 13:17       ` sioh Lee
  2017-08-24 19:14         ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: sioh Lee @ 2017-08-09 13:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: akpm, mingo, zhongjiang, minchan, arvind.yadav.cs, imbrenda,
	kirill.shutemov, linux-mm, hxy, oslab

Hello.
I am sending you the results of the experiments.
The experiment was done for two workloads.
The first is Kernel build (CPU Intensive) and the second is the iozone benchmark (I/O Intensive).
In the experiment, four VMs compile kernel at the same time.
I also experimented with iozone in the same way.


The values measured in the experiment are:
1. CoW count, 2. Checksum computation time, 3. pages_unshared, 4. pages_sharing, 5. (pages_unshared / pages_sharing).
The experiment was conducted twice for each workload and the average value was calculated.
Checksum computation time, pages_unshared, and pages_sharing are recorded every 1 second,
and the average of the recorded values is obtained after the end of the experiment.
The CoW was also recorded whenever CoW occurs on a shared page.

Experiment environment

test platform : openstack cloud platform (NEWTON version)
Experiment node : openstack based cloud compute node (CPU: Xeon E5-2650 v3 2.3Ghz 10core, memory : 64Gb)
VM : (2 VCPU, RAM 4GB, DISK 20GB) * 4
workload : Kernel Compile (kernel 4.47), iozone (read, write, random read and write for 2GB)
KSM setup - sleep_millisecs : 200ms, pages_to_scan : 1600

The experimental results are as follows. (All values are truncated to the second decimal place)

kernel build

Crc32

CoW count    Checksum time (ns)    pages_sharing    pages_unshared    unshared/sharing
  44036.5           903.58                  951660.82          265401.54             0.27

Jhash2
CoW count    Checksum time (ns)    pages_sharing    pages_unshared    unshared/sharing
  46114             4203.33                  949578.19          266564.98            0.28

Increase/Decrease percentage compared to jhash2 (I: Increase, D: Decrease)
CoW count    Checksum time (ns)    pages_sharing    pages_unshared    unshared/sharing
  4.5% D            78.5% D                 0.2% I             0.4% D             0.64% D

For the kernel build workload, the number of CoWs compared to jhash2 decreased by 4.5%, pages_sharing increased by 0.2%,
pages_unshared decreased by 0.4%, checksums computation decreased by 78.5%, and (pages_unshared / pages_sharing) decreased by 0.64%.

iozone

Crc32
CoW count    Checksum time (ns)    pages_sharing    pages_unshared    unshared/sharing
 4288702.5           1139.31             1441299.78         117746.22                0.14

Jhash2
CoW count    Checksum time (ns)    pages_sharing    pages_unshared    unshared/sharing
 4229174              4980.21            1446143.41         116153.12               0.13

Increase/Decrease percentage compared to jhash2 (I: Increase, D: Decrease)
CoW count    Checksum time (ns)    pages_sharing    pages_unshared    unshared/sharing
  1.4% I             77.1% D               0.33% D            1.37% I              1.89% I

For the iozone workload, the number of CoWs compared to jhash2 increased by 1.4%, pages_sharing decreased by 0.33%,
pages_unshared increased by 1.37%, checksums computation decreased by 77.1%, and (pages_unshared / pages_sharing) increased by 1.89%.


In summary, the experiment shows that crc32 has definite advantages over jhash2 for CPU intensive task.
For I/O intensive task, CoW increases only by 1.4% while the checksum computation time is significantly reduced by 77%.


 


2017-08-03 i??i?? 10:23i?? Andrea Arcangeli i?'(e??) i?' e,?:
> On Thu, Aug 03, 2017 at 02:26:27PM +0900, sioh Lee wrote:
>> Thank you very much for reading and responding to my commit.
>> I understand the problem with crc32 you describe.
>> I will investigate a?? as the first step, I will try to compare the number of CoWs with jhash2 and crc32. And I will send you the experiment results.
> Also the number of KSM merges and ideally in a non simple workload. If
> the hash triggers false positives it's not just that there will be
> more CoWs, but the unstable tree will get more unstable and its
> ability to find equality will decrease. This is why I don't like to
> weaken the hash with a crc and I'd rather prefer to keep a real hash
> there (doesn't need to be a crypto one, but it'd be even better if it
> was).
>
> The hash isn't used to find equality, it's only used to find which
> pages are updated frequently (and if an app overwrites the same value
> over and over, not even a crypto hash would be capable to detect it).
>
> There were attempts to replace the hashing with a dirty bit set in
> hardware in the pagetable in fact, that would be the ideal way, but
> it's quite more complicated that way.
>
> Thanks,
> Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-09 13:17       ` sioh Lee
@ 2017-08-24 19:14         ` Andrea Arcangeli
  2017-08-29  6:35           ` sioh Lee
  0 siblings, 1 reply; 11+ messages in thread
From: Andrea Arcangeli @ 2017-08-24 19:14 UTC (permalink / raw)
  To: sioh Lee
  Cc: akpm, mingo, zhongjiang, minchan, arvind.yadav.cs, imbrenda,
	kirill.shutemov, linux-mm, hxy, oslab

On Wed, Aug 09, 2017 at 10:17:31PM +0900, sioh Lee wrote:
> Hello.
> I am sending you the results of the experiments.
> The experiment was done for two workloads.
> The first is Kernel build (CPU Intensive) and the second is the iozone benchmark (I/O Intensive).
> In the experiment, four VMs compile kernel at the same time.
> I also experimented with iozone in the same way.
> 
> 
> The values measured in the experiment are:
> 1. CoW count, 2. Checksum computation time, 3. pages_unshared, 4. pages_sharing, 5. (pages_unshared / pages_sharing).
> The experiment was conducted twice for each workload and the average value was calculated.
> Checksum computation time, pages_unshared, and pages_sharing are recorded every 1 second,
> and the average of the recorded values is obtained after the end of the experiment.
> The CoW was also recorded whenever CoW occurs on a shared page.
> 
> Experiment environment
> 
> test platform : openstack cloud platform (NEWTON version)
> Experiment node : openstack based cloud compute node (CPU: Xeon E5-2650 v3 2.3Ghz 10core, memory : 64Gb)
> VM : (2 VCPU, RAM 4GB, DISK 20GB) * 4
> workload : Kernel Compile (kernel 4.47), iozone (read, write, random read and write for 2GB)
> KSM setup - sleep_millisecs : 200ms, pages_to_scan : 1600
> 
> The experimental results are as follows. (All values are truncated to the second decimal place)

Not sure if kernel build but especially iozone are the best tests for
this, a number crunching may have generate a larger set of pages,
these issues would be noticeable only with a very large unstable tree
(i.e. very large pages_unshared). You've got a very tiny
pages_unshared of only <=1GB in your tests. That should grow much
larger to make the measurement meaningful for this hash collision
concern.

However so far I checked some more information on the collisions of
crc32. With equal sized random data (i.e. our case, PAGE_SIZE in
length) it takes about ~78000 tries of new random data for crc32 to
trigger a collision with 51% probability, or ~200000 tries to get a
collision with 99% probability.

jhash2 instead starts seeing collisions with 2**52 random data tries?

Our objective here is to simulate a dirty bit, the problem with the
real pagetable dirty bit is it potentially requires an IPI to threads
using the "mm" to clear it reliably (i.e. to be sure it gets set on
next write), and not all archs have the dirty bit set in
hardware. Write protecting to simulate the dirty bit also wouldn't be
a solution, because our whole objective is to avoid wrprotecting the
first place while at the same time we try not to make the unstable
tree too unstable by adding pages that are statistically changing
frequently to it.

There were attempts in the past to use the dirty bit on x86 (plus
skipping the IPI probably wouldn't be a big issue for KSM because the
scan is linear and takes quite some time to complete if there's lots
of virtual memory to merge).

So after evaluating 200000 different candidate pages that are changing
frequently, to see if they're worth adding them to the unstable tree,
we'll sure get a false positive if compared to jhash2 (we'll add one
more page than we should have to). Considering the pages can start
changing at any time regardless after we add them to the unstable tree
and this check is only statistically significant, the difference with
jhash2 is probably not going to be noticeable here. We don't use the
hash to find equal pages of course, the unstable tree does that. I
earlier thought crc32 would require fewer tries before creating
collisions, much closer to a checksum. So now I'm optimistic this
change will be a good tradeoff.

If we switch to crc32 however I don't see why not to use it also if
the intel insn isn't present (why the < 200?). It's surely running
much faster also in the generic implementation (jhash2 isn't
accelerated either).

If we force CRC32C to be enabled if KSM is selected we could drop
jhash2 entirely. Then we'd need a way to ensure the accelerated
version gets loaded.

Now about implementation issues, crc32 insn is actually used by crc32c
not by crc32 (crc32c seems preferable polynomial anyway), the crc32
that you selected only uses PCLMULQDQ, not the crc32 insn of sse4.2.

I guess on Intel we could try to load crc32c first, if that showup at
priority < 200 (sse4.2 crc32 that I think you intended to use but you
got the PCLMULQDQ crc32 instead), then try again with crc32, if that
also showup at priority < 200, then take crc32c generic (without
fallback to jhash2 provided we've a build-time way to ensure the
generic implementation is linked into the kernel as suggested above).

Other archs using different algorithm sounds fine too, for example
assuming crc32be would be the fastest (not sure, just a guess) it
could be arch dependent the selection logic (the only non trivial
issue would be to avoid #ifdefs to make it arch dependent). As long as
the default fallbacks to generic that's fine.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-24 19:14         ` Andrea Arcangeli
@ 2017-08-29  6:35           ` sioh Lee
  2017-08-29 16:05             ` Andrea Arcangeli
  0 siblings, 1 reply; 11+ messages in thread
From: sioh Lee @ 2017-08-29  6:35 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: akpm, mingo, zhongjiang, minchan, arvind.yadav.cs, imbrenda,
	kirill.shutemov, linux-mm, hxy, oslab

Hello,
Thank you for the reply and for being supportive.
First of all, I made a mistake in that I typed crc32 incorrectly. All the experiments were done using crc32c-intel, not crc32 (PCLMULQDQ).

Second, the reason for (priority < 200) is because the priority of crc32c-intel is 200 so that if the priority is less than 200, jhash2 is used.
Also, I have a question about implementation. Do you want to exclude jhash2 from ksm and go only with crc32 ? Could you please give me guidance about it?

Then, I will implement it and send you a new patch.
Once again, thank you so much for your reply.
Best regards,

-Sioh Lee-


2017-08-25 ?AAu 4:14?! Andrea Arcangeli AI(?!)  3/4 ' +-U:
> On Wed, Aug 09, 2017 at 10:17:31PM +0900, sioh Lee wrote:
>> Hello.
>> I am sending you the results of the experiments.
>> The experiment was done for two workloads.
>> The first is Kernel build (CPU Intensive) and the second is the iozone benchmark (I/O Intensive).
>> In the experiment, four VMs compile kernel at the same time.
>> I also experimented with iozone in the same way.
>>
>>
>> The values measured in the experiment are:
>> 1. CoW count, 2. Checksum computation time, 3. pages_unshared, 4. pages_sharing, 5. (pages_unshared / pages_sharing).
>> The experiment was conducted twice for each workload and the average value was calculated.
>> Checksum computation time, pages_unshared, and pages_sharing are recorded every 1 second,
>> and the average of the recorded values is obtained after the end of the experiment.
>> The CoW was also recorded whenever CoW occurs on a shared page.
>>
>> Experiment environment
>>
>> test platform : openstack cloud platform (NEWTON version)
>> Experiment node : openstack based cloud compute node (CPU: Xeon E5-2650 v3 2.3Ghz 10core, memory : 64Gb)
>> VM : (2 VCPU, RAM 4GB, DISK 20GB) * 4
>> workload : Kernel Compile (kernel 4.47), iozone (read, write, random read and write for 2GB)
>> KSM setup - sleep_millisecs : 200ms, pages_to_scan : 1600
>>
>> The experimental results are as follows. (All values are truncated to the second decimal place)
> Not sure if kernel build but especially iozone are the best tests for
> this, a number crunching may have generate a larger set of pages,
> these issues would be noticeable only with a very large unstable tree
> (i.e. very large pages_unshared). You've got a very tiny
> pages_unshared of only <=1GB in your tests. That should grow much
> larger to make the measurement meaningful for this hash collision
> concern.
>
> However so far I checked some more information on the collisions of
> crc32. With equal sized random data (i.e. our case, PAGE_SIZE in
> length) it takes about ~78000 tries of new random data for crc32 to
> trigger a collision with 51% probability, or ~200000 tries to get a
> collision with 99% probability.
>
> jhash2 instead starts seeing collisions with 2**52 random data tries?
>
> Our objective here is to simulate a dirty bit, the problem with the
> real pagetable dirty bit is it potentially requires an IPI to threads
> using the "mm" to clear it reliably (i.e. to be sure it gets set on
> next write), and not all archs have the dirty bit set in
> hardware. Write protecting to simulate the dirty bit also wouldn't be
> a solution, because our whole objective is to avoid wrprotecting the
> first place while at the same time we try not to make the unstable
> tree too unstable by adding pages that are statistically changing
> frequently to it.
>
> There were attempts in the past to use the dirty bit on x86 (plus
> skipping the IPI probably wouldn't be a big issue for KSM because the
> scan is linear and takes quite some time to complete if there's lots
> of virtual memory to merge).
>
> So after evaluating 200000 different candidate pages that are changing
> frequently, to see if they're worth adding them to the unstable tree,
> we'll sure get a false positive if compared to jhash2 (we'll add one
> more page than we should have to). Considering the pages can start
> changing at any time regardless after we add them to the unstable tree
> and this check is only statistically significant, the difference with
> jhash2 is probably not going to be noticeable here. We don't use the
> hash to find equal pages of course, the unstable tree does that. I
> earlier thought crc32 would require fewer tries before creating
> collisions, much closer to a checksum. So now I'm optimistic this
> change will be a good tradeoff.
>
> If we switch to crc32 however I don't see why not to use it also if
> the intel insn isn't present (why the < 200?). It's surely running
> much faster also in the generic implementation (jhash2 isn't
> accelerated either).
>
> If we force CRC32C to be enabled if KSM is selected we could drop
> jhash2 entirely. Then we'd need a way to ensure the accelerated
> version gets loaded.
>
> Now about implementation issues, crc32 insn is actually used by crc32c
> not by crc32 (crc32c seems preferable polynomial anyway), the crc32
> that you selected only uses PCLMULQDQ, not the crc32 insn of sse4.2.
>
> I guess on Intel we could try to load crc32c first, if that showup at
> priority < 200 (sse4.2 crc32 that I think you intended to use but you
> got the PCLMULQDQ crc32 instead), then try again with crc32, if that
> also showup at priority < 200, then take crc32c generic (without
> fallback to jhash2 provided we've a build-time way to ensure the
> generic implementation is linked into the kernel as suggested above).
>
> Other archs using different algorithm sounds fine too, for example
> assuming crc32be would be the fastest (not sure, just a guess) it
> could be arch dependent the selection logic (the only non trivial
> issue would be to avoid #ifdefs to make it arch dependent). As long as
> the default fallbacks to generic that's fine.
>
> Thanks,
> Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
  2017-08-29  6:35           ` sioh Lee
@ 2017-08-29 16:05             ` Andrea Arcangeli
  0 siblings, 0 replies; 11+ messages in thread
From: Andrea Arcangeli @ 2017-08-29 16:05 UTC (permalink / raw)
  To: sioh Lee
  Cc: akpm, mingo, zhongjiang, minchan, arvind.yadav.cs, imbrenda,
	kirill.shutemov, linux-mm, hxy, oslab

Hello,

On Tue, Aug 29, 2017 at 03:35:34PM +0900, sioh Lee wrote:
> Hello,
> Thank you for the reply and for being supportive.
> First of all, I made a mistake in that I typed crc32 incorrectly. All the experiments were done using crc32c-intel, not crc32 (PCLMULQDQ).

So the fuzzy search in __crypto_alg_lookup gave you crc32c-intel
because you didn't enable crc32 PCLMULQDQ in the kernel config?

>From source it looks like an explicit load of crc32c-intel would work
too, instead of checking the priority. We can load in order
crc32c-intel, crc32-pclmul and fallback in "crc32c" which must be then
forced enabled in the kernel config.

> Second, the reason for (priority < 200) is because the priority of crc32c-intel is 200 so that if the priority is less than 200, jhash2 is used.
> Also, I have a question about implementation. Do you want to exclude jhash2 from ksm and go only with crc32 ? Could you please give me guidance about it?

Yes, the idea about excluding jhash2 from KSM is that if one almost
certain crc32c hash collision once every 200k modifications to the
page truly isn't a concern with sse4.2, it's still not a concern if
crc32 is implemented in C, but still faster than jhash2.

I don't like the behavior of the hash to change depending on hw or
kernel config, as that decreases the testing and it would generate
different behavior depending on kernel config or arch. I don't like
non reproducible bugs or to fragment testing across the userbase
depending on kernel config or arch. If crc32 creates problems with
hash collisions, it's better everyone is testing it so we find out
sooner than later.

The arch code can choose which of the crc32* variants to use, but it
shouldn't change the hash to something completely different. It's more
robust if the default hash algorithm is the same (only with minor
variations allowed like crc32c vs crc32 vs crc32be).

> Then, I will implement it and send you a new patch.
> Once again, thank you so much for your reply.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32)
@ 2017-10-11 15:49 Timofey Titovets
  0 siblings, 0 replies; 11+ messages in thread
From: Timofey Titovets @ 2017-10-11 15:49 UTC (permalink / raw)
  To: linux-mm

Hi, my 2 cents,

(I miss some CC because i don't have a copy of mail
i found conversation on: http://www.spinics.net/lists/linux-mm/msg132431.html)

(Fix me if i'm wrong, may be i miss something)

So, I have a Skylake with HW SHA1/256
And i use openssl 1.1.0.f for testing
Hash 1GiB file for throughput testing.
  No HW sha1 (Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz)
    - sha1 - ~300 MiB/s
    - sha256 - ~128 MiB/s
  Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
    - sha1 - ~900 MiB/s
    - sha256 - ~350 MiB/s

CRC32C for example below show about 13650.720367 MiB/s

I'm also afraid about possible collisions, but AFAIK:
http://cyan4973.github.io/xxHash/ (I copy part of table from that page)
Name      Speed    Quality Author
xxHash    5.4 GB/s 10 Y.C.
Lookup3   1.2 GB/s  9 Bob Jenkins
CRC32    0.43 GB/s 9  (that a SW implementationt, let's ignore speed)

So (in theory of course) jhash2 and crc32 have a same problems with collisions.

Info from my patch set (replace jhash2 with xxhash)

  x86_64 host:
    CPU: Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz
    PAGE_SIZE: 4096, loop count: 1048576
    jhash2:   0xacbc7a5b            time: 1907 ms,  th:  2251.9 MiB/s
    xxhash32: 0x570da981            time: 739 ms,   th:  5809.4 MiB/s
    xxhash64: 0xa1fa032ab85bbb62    time: 371 ms,   th: 11556.6 MiB/s

    CPU: Intel(R) Xeon(R) CPU E5-2420 0 @ 1.90GHz
    PAGE_SIZE: 4096, loop count: 1048576
    jhash2:   0xe680b382            time: 3722 ms,  th: 1153.896680 MiB/s
    xxhash32: 0x56d00be4            time: 1183 ms,  th: 3629.130689 MiB/s
    xxhash64: 0x8c194cff29cc4dee    time: 725 ms,   th: 5918.003401 MiB/s

So i really not believe in sha1 for KSM, it's just to slow

Thanks.

-- 
Have a nice day,
Timofey.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-10-11 15:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-01 12:07 [PATCH] mm/ksm : Checksum calculation function change (jhash2 -> crc32) leesioh
2017-08-01 13:29 ` Claudio Imbrenda
2017-08-01 20:05 ` Andrea Arcangeli
2017-08-02 12:26   ` Claudio Imbrenda
2017-08-03  5:26   ` sioh Lee
2017-08-03 13:23     ` Andrea Arcangeli
2017-08-09 13:17       ` sioh Lee
2017-08-24 19:14         ` Andrea Arcangeli
2017-08-29  6:35           ` sioh Lee
2017-08-29 16:05             ` Andrea Arcangeli
2017-10-11 15:49 Timofey Titovets

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.