All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] A multithread lockless deduplication engine
@ 2017-09-20  3:23 XaviLi
  2017-09-26 11:47   ` Michal Hocko
  0 siblings, 1 reply; 12+ messages in thread
From: XaviLi @ 2017-09-20  3:23 UTC (permalink / raw)
  To: linux-kernel

PageONE (Page Object Non-duplicate Engine) is a multithread kernel page deduplication engine. It is based on a lock-less tree algorithm we currently named as SD (Static and Dynamic) Tree. Normal operations such as insert/query/delete to this tree are block-less. Adding more CPU cores can linearly boost speed as far as we tested. Multithreading gives not only opportunity to work faster. It also allows any CPU to donate spare time for the job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an open source solution named Dynamic VM:
https://github.com/baibantech/dynamic_vm.git 

patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5

One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs can increase speed linearly. Here we can see a brief test:

Test environment
DELL R730
Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
256GB RAM
Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
Qemu: 2.9.0
Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76

We ran 12 VMs together. Each create 16GB data in memory. After all data is ready we start dedup-engine and see how host-side used memory amount changes.

KSM:
Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
Starting used memory: 216.8G
Result: KSM start merging pages immediately after turned on. KSM daemon took 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.

PageONE:
Configuration: merge_period(secs) = 20, work threads = 12
Starting used memory: 207.3G
(Which means PageONE scans full physical memory in 20 secs period. Pages was merged if not changed in 2 merge_periods.)
Result: In the first two periods PageONE only observe and identify unchanged pages. Little CPU was used in this time. As the third period begin all 12 threads start using 100% CPU to do real merge job. 00:58 later used memory was reduced to 70.5GB.

We ran the above test using the data quite easy for red-black tree of KSM. Every difference can be detected by comparing the first 8 bytes. Then we ran another test in which each data was begin with random zero bytes for comparison. The average size of zero data was 128 bytes. Result is shown below:

KSM:
Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
Starting used memory: 216.8G
Result: 19:49 minutes until used memory was reduced to 78.7GB.

PageONE:
Configuration: merge period(secs) = 20, work threads = 12
Starting used memory: 210.3G
Result: First 2 periods same as above. 1:09 after merge job start memory was reduced to 72GB.

PageONE shows little difference in the two tests because SD tree search compare each key bit just once in most cases.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
  2017-09-20  3:23 [RFC] A multithread lockless deduplication engine XaviLi
@ 2017-09-26 11:47   ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-09-26 11:47 UTC (permalink / raw)
  To: XaviLi; +Cc: linux-kernel, Andrea Arcangeli, Hugh Dickins, linux-mm

[Let's add some more people and linux-mm to the CC list]

On Wed 20-09-17 11:23:50, XaviLi wrote:
> PageONE (Page Object Non-duplicate Engine) is a multithread kernel page deduplication engine. It is based on a lock-less tree algorithm we currently named as SD (Static and Dynamic) Tree. Normal operations such as insert/query/delete to this tree are block-less. Adding more CPU cores can linearly boost speed as far as we tested. Multithreading gives not only opportunity to work faster. It also allows any CPU to donate spare time for the job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an open source solution named Dynamic VM:
> https://github.com/baibantech/dynamic_vm.git 
> 
> patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5
> 
> One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs can increase speed linearly. Here we can see a brief test:
> 
> Test environment
> DELL R730
> Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
> 256GB RAM
> Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
> Qemu: 2.9.0
> Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76
> 
> We ran 12 VMs together. Each create 16GB data in memory. After all data is ready we start dedup-engine and see how host-side used memory amount changes.
> 
> KSM:
> Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> Starting used memory: 216.8G
> Result: KSM start merging pages immediately after turned on. KSM daemon took 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.
> 
> PageONE:
> Configuration: merge_period(secs) = 20, work threads = 12
> Starting used memory: 207.3G
> (Which means PageONE scans full physical memory in 20 secs period. Pages was merged if not changed in 2 merge_periods.)
> Result: In the first two periods PageONE only observe and identify unchanged pages. Little CPU was used in this time. As the third period begin all 12 threads start using 100% CPU to do real merge job. 00:58 later used memory was reduced to 70.5GB.
> 
> We ran the above test using the data quite easy for red-black tree of KSM. Every difference can be detected by comparing the first 8 bytes. Then we ran another test in which each data was begin with random zero bytes for comparison. The average size of zero data was 128 bytes. Result is shown below:
> 
> KSM:
> Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> Starting used memory: 216.8G
> Result: 19:49 minutes until used memory was reduced to 78.7GB.
> 
> PageONE:
> Configuration: merge period(secs) = 20, work threads = 12
> Starting used memory: 210.3G
> Result: First 2 periods same as above. 1:09 after merge job start memory was reduced to 72GB.
> 
> PageONE shows little difference in the two tests because SD tree search compare each key bit just once in most cases.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
@ 2017-09-26 11:47   ` Michal Hocko
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Hocko @ 2017-09-26 11:47 UTC (permalink / raw)
  To: XaviLi; +Cc: linux-kernel, Andrea Arcangeli, Hugh Dickins, linux-mm

[Let's add some more people and linux-mm to the CC list]

On Wed 20-09-17 11:23:50, XaviLi wrote:
> PageONE (Page Object Non-duplicate Engine) is a multithread kernel page deduplication engine. It is based on a lock-less tree algorithm we currently named as SD (Static and Dynamic) Tree. Normal operations such as insert/query/delete to this tree are block-less. Adding more CPU cores can linearly boost speed as far as we tested. Multithreading gives not only opportunity to work faster. It also allows any CPU to donate spare time for the job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an open source solution named Dynamic VM:
> https://github.com/baibantech/dynamic_vm.git 
> 
> patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5
> 
> One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs can increase speed linearly. Here we can see a brief test:
> 
> Test environment
> DELL R730
> Intel(R) Xeon(R) E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
> 256GB RAM
> Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
> Qemu: 2.9.0
> Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76
> 
> We ran 12 VMs together. Each create 16GB data in memory. After all data is ready we start dedup-engine and see how host-side used memory amount changes.
> 
> KSM:
> Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> Starting used memory: 216.8G
> Result: KSM start merging pages immediately after turned on. KSM daemon took 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.
> 
> PageONE:
> Configuration: merge_period(secs) = 20, work threads = 12
> Starting used memory: 207.3G
> (Which means PageONE scans full physical memory in 20 secs period. Pages was merged if not changed in 2 merge_periods.)
> Result: In the first two periods PageONE only observe and identify unchanged pages. Little CPU was used in this time. As the third period begin all 12 threads start using 100% CPU to do real merge job. 00:58 later used memory was reduced to 70.5GB.
> 
> We ran the above test using the data quite easy for red-black tree of KSM. Every difference can be detected by comparing the first 8 bytes. Then we ran another test in which each data was begin with random zero bytes for comparison. The average size of zero data was 128 bytes. Result is shown below:
> 
> KSM:
> Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> Starting used memory: 216.8G
> Result: 19:49 minutes until used memory was reduced to 78.7GB.
> 
> PageONE:
> Configuration: merge period(secs) = 20, work threads = 12
> Starting used memory: 210.3G
> Result: First 2 periods same as above. 1:09 after merge job start memory was reduced to 72GB.
> 
> PageONE shows little difference in the two tests because SD tree search compare each key bit just once in most cases.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
  2017-09-26 11:47   ` Michal Hocko
@ 2017-09-28 10:09     ` Andrea Arcangeli
  -1 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2017-09-28 10:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: XaviLi, linux-kernel, Hugh Dickins, linux-mm

Hello XaviLi,

On Tue, Sep 26, 2017 at 01:47:47PM +0200, Michal Hocko wrote:
> [Let's add some more people and linux-mm to the CC list]
> 
> On Wed 20-09-17 11:23:50, XaviLi wrote:
> > PageONE (Page Object Non-duplicate Engine) is a multithread kernel page deduplication engine. It is based on a lock-less tree algorithm we currently named as SD (Static and Dynamic) Tree. Normal operations such as insert/query/delete to this tree are block-less. Adding more CPU cores can linearly boost speed as far as we tested. Multithreading gives not only opportunity to work faster. It also allows any CPU to donate spare time for the job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an open source solution named Dynamic VM:
> > https://github.com/baibantech/dynamic_vm.git 
> > 
> > patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5
> > 
> > One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs can increase speed linearly. Here we can see a brief test:
> > 
> > Test environment
> > DELL R730
> > Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
> > 256GB RAM
> > Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
> > Qemu: 2.9.0
> > Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76
> > 
> > We ran 12 VMs together. Each create 16GB data in memory. After all data is ready we start dedup-engine and see how host-side used memory amount changes.
> > 
> > KSM:
> > Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> > Starting used memory: 216.8G
> > Result: KSM start merging pages immediately after turned on. KSM daemon took 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.
> > 
> > PageONE:
> > Configuration: merge_period(secs) = 20, work threads = 12
> > Starting used memory: 207.3G
> > (Which means PageONE scans full physical memory in 20 secs period. Pages was merged if not changed in 2 merge_periods.)
> > Result: In the first two periods PageONE only observe and identify unchanged pages. Little CPU was used in this time. As the third period begin all 12 threads start using 100% CPU to do real merge job. 00:58 later used memory was reduced to 70.5GB.
> > 
> > We ran the above test using the data quite easy for red-black tree of KSM. Every difference can be detected by comparing the first 8 bytes. Then we ran another test in which each data was begin with random zero bytes for comparison. The average size of zero data was 128 bytes. Result is shown below:
> > 
> > KSM:
> > Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> > Starting used memory: 216.8G
> > Result: 19:49 minutes until used memory was reduced to 78.7GB.
> > 
> > PageONE:
> > Configuration: merge period(secs) = 20, work threads = 12
> > Starting used memory: 210.3G
> > Result: First 2 periods same as above. 1:09 after merge job start memory was reduced to 72GB.
> > 
> > PageONE shows little difference in the two tests because SD tree search compare each key bit just once in most cases.

Could you repeat the whole benchmark while giving only 1 CPU to PageONE
and after applying the following crc32c-intel patch to KSM?

https://www.spinics.net/lists/linux-mm/msg132394.html

You may consider also echo 1 > /sys/kernel/mm/ksm/use_zero_pages if
you single out zero pages in pone (but it doesn't look like you have
such feature in pone).

The second test is exercising the worst case possible of KSM so I
don't see how it's worth worrying about. Likely pone would also have a
worst case to exercise (it uses hash_64 so it very likely also has a
worst case to exercise). For KSM there are already plans to alter the
memcmp so it's more scattered randomly.

Making KSM multithreaded with one ksmd thread per CPU is entirely
possible, the rbtree rebalance will require some locking of course but
the high CPU usage parts of KSM are fully scalable (mm walk, checksum,
memcompare, writeprotection, pagetable replacement). We didn't
multithread ksmd to keep it simpler primarily but also because nobody
asked for this feature yet. Why didn't you simply multithread KSM
which provides a solid base also supporting KSMscale?

Are you using an hash to find equality? That can't be done currently
to avoid infringing. I see various memcmp in your patch but all around
#if 0... so what are you using for finding page equality?

How does PageONE deal with 1million of equal virtual pages? Does it
lockup in rmap? KSM in v4.13 can handle infinite amount of equal
virtual page content to dedup while generating O(1) complexity in rmap
walks. Without this, KSM was unusable for enterprise use and had to be
disabled, because the kernel would lockup for several seconds after
deduplicating million of virtual pages with same content (i.e. during
NUMA balancing induced page migrations or during compaction induced
page migrations, let alone swapping the million-times deduplicated KSM
page).

KSM is usually an activity run in the background so nobody asked to
dedicate more than one core to it, and what's relevant is to do the
dedup in the most efficient way possible (i.e. less CPU used and no
interference to the rest of the system whatsoever), not how long it
takes if you run it on all available CPUs loading 100% of the system
with it.

So comparing a dedup algorithm running concurrently on 12 threads vs
another dedup algorithm running in 1 thread only, is an apple to
oranges comparison.

Comparing KSM (with crc32 as cksum, to apply on top of upstream) vs
PageOne restricted to a single thread (also more realistic production
environment), will be a more interesting and meaningful comparison.

It looks like rmap is supported by pone but the patch has a multitude
of #if 0 and around all rmap code so it's not so clear. Rmap walks
have to work flawlessy on all deduplicated pages, or pone would then
break not just swapping but also NUMA Balancing compaction and in turn
THP utilization and THP utilization is critical for virtual machines
(MADV_HUGEPAGE is always set by QEMU, to run direct compactin also with
defrag=madvise or defer+madvise).

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
@ 2017-09-28 10:09     ` Andrea Arcangeli
  0 siblings, 0 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2017-09-28 10:09 UTC (permalink / raw)
  To: Michal Hocko; +Cc: XaviLi, linux-kernel, Hugh Dickins, linux-mm

Hello XaviLi,

On Tue, Sep 26, 2017 at 01:47:47PM +0200, Michal Hocko wrote:
> [Let's add some more people and linux-mm to the CC list]
> 
> On Wed 20-09-17 11:23:50, XaviLi wrote:
> > PageONE (Page Object Non-duplicate Engine) is a multithread kernel page deduplication engine. It is based on a lock-less tree algorithm we currently named as SD (Static and Dynamic) Tree. Normal operations such as insert/query/delete to this tree are block-less. Adding more CPU cores can linearly boost speed as far as we tested. Multithreading gives not only opportunity to work faster. It also allows any CPU to donate spare time for the job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an open source solution named Dynamic VM:
> > https://github.com/baibantech/dynamic_vm.git 
> > 
> > patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5
> > 
> > One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs can increase speed linearly. Here we can see a brief test:
> > 
> > Test environment
> > DELL R730
> > Intel(R) Xeon(R) E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
> > 256GB RAM
> > Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
> > Qemu: 2.9.0
> > Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76
> > 
> > We ran 12 VMs together. Each create 16GB data in memory. After all data is ready we start dedup-engine and see how host-side used memory amount changes.
> > 
> > KSM:
> > Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> > Starting used memory: 216.8G
> > Result: KSM start merging pages immediately after turned on. KSM daemon took 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.
> > 
> > PageONE:
> > Configuration: merge_period(secs) = 20, work threads = 12
> > Starting used memory: 207.3G
> > (Which means PageONE scans full physical memory in 20 secs period. Pages was merged if not changed in 2 merge_periods.)
> > Result: In the first two periods PageONE only observe and identify unchanged pages. Little CPU was used in this time. As the third period begin all 12 threads start using 100% CPU to do real merge job. 00:58 later used memory was reduced to 70.5GB.
> > 
> > We ran the above test using the data quite easy for red-black tree of KSM. Every difference can be detected by comparing the first 8 bytes. Then we ran another test in which each data was begin with random zero bytes for comparison. The average size of zero data was 128 bytes. Result is shown below:
> > 
> > KSM:
> > Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> > Starting used memory: 216.8G
> > Result: 19:49 minutes until used memory was reduced to 78.7GB.
> > 
> > PageONE:
> > Configuration: merge period(secs) = 20, work threads = 12
> > Starting used memory: 210.3G
> > Result: First 2 periods same as above. 1:09 after merge job start memory was reduced to 72GB.
> > 
> > PageONE shows little difference in the two tests because SD tree search compare each key bit just once in most cases.

Could you repeat the whole benchmark while giving only 1 CPU to PageONE
and after applying the following crc32c-intel patch to KSM?

https://www.spinics.net/lists/linux-mm/msg132394.html

You may consider also echo 1 > /sys/kernel/mm/ksm/use_zero_pages if
you single out zero pages in pone (but it doesn't look like you have
such feature in pone).

The second test is exercising the worst case possible of KSM so I
don't see how it's worth worrying about. Likely pone would also have a
worst case to exercise (it uses hash_64 so it very likely also has a
worst case to exercise). For KSM there are already plans to alter the
memcmp so it's more scattered randomly.

Making KSM multithreaded with one ksmd thread per CPU is entirely
possible, the rbtree rebalance will require some locking of course but
the high CPU usage parts of KSM are fully scalable (mm walk, checksum,
memcompare, writeprotection, pagetable replacement). We didn't
multithread ksmd to keep it simpler primarily but also because nobody
asked for this feature yet. Why didn't you simply multithread KSM
which provides a solid base also supporting KSMscale?

Are you using an hash to find equality? That can't be done currently
to avoid infringing. I see various memcmp in your patch but all around
#if 0... so what are you using for finding page equality?

How does PageONE deal with 1million of equal virtual pages? Does it
lockup in rmap? KSM in v4.13 can handle infinite amount of equal
virtual page content to dedup while generating O(1) complexity in rmap
walks. Without this, KSM was unusable for enterprise use and had to be
disabled, because the kernel would lockup for several seconds after
deduplicating million of virtual pages with same content (i.e. during
NUMA balancing induced page migrations or during compaction induced
page migrations, let alone swapping the million-times deduplicated KSM
page).

KSM is usually an activity run in the background so nobody asked to
dedicate more than one core to it, and what's relevant is to do the
dedup in the most efficient way possible (i.e. less CPU used and no
interference to the rest of the system whatsoever), not how long it
takes if you run it on all available CPUs loading 100% of the system
with it.

So comparing a dedup algorithm running concurrently on 12 threads vs
another dedup algorithm running in 1 thread only, is an apple to
oranges comparison.

Comparing KSM (with crc32 as cksum, to apply on top of upstream) vs
PageOne restricted to a single thread (also more realistic production
environment), will be a more interesting and meaningful comparison.

It looks like rmap is supported by pone but the patch has a multitude
of #if 0 and around all rmap code so it's not so clear. Rmap walks
have to work flawlessy on all deduplicated pages, or pone would then
break not just swapping but also NUMA Balancing compaction and in turn
THP utilization and THP utilization is critical for virtual machines
(MADV_HUGEPAGE is always set by QEMU, to run direct compactin also with
defrag=madvise or defer+madvise).

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
  2017-09-26 11:47   ` Michal Hocko
  (?)
  (?)
@ 2017-09-29  9:50   ` XaviLi
  -1 siblings, 0 replies; 12+ messages in thread
From: XaviLi @ 2017-09-29  9:50 UTC (permalink / raw)
  To: Michal Hocko, Andrea Arcangeli
  Cc: linux-kernel, Hugh Dickins, linux-mm, 杨泽昕,
	李珅, 王斌

[-- Attachment #1: Type: text/plain, Size: 5268 bytes --]


------------------------------------------------------------------From:Andrea Arcangeli <aarcange@redhat.com>Time:2017 Sep 28 (Thu) 18:09Thanks for replying
> Could you repeat the whole benchmark while giving only 1 CPU to PageONE
> and after applying the following crc32c-intel patch to KSM?
> 
> https://www.spinics.net/lists/linux-mm/msg132394.html
> 
> You may consider also echo 1 > /sys/kernel/mm/ksm/use_zero_pages if
> you single out zero pages in pone (but it doesn't look like you have
> such feature in pone).

> The second test is exercising the worst case possible of KSM so I
> don't see how it's worth worrying about. Likely pone would also have a
> worst case to exercise (it uses hash_64 so it very likely also has a
> worst case to exercise). For KSM there are already plans to alter the
> memcmp so it's more scattered randomly.

> Making KSM multithreaded with one ksmd thread per CPU is entirely
> possible, the rbtree rebalance will require some locking of course but
> the high CPU usage parts of KSM are fully scalable (mm walk, checksum,
> memcompare, writeprotection, pagetable replacement). We didn't
> multithread ksmd to keep it simpler primarily but also because nobody
> asked for this feature yet. Why didn't you simply multithread KSM
> which provides a solid base also supporting KSMscale?
> 
> Are you using an hash to find equality? That can't be done currently
> to avoid infringing. I see various memcmp in your patch but all around
> #if 0... so what are you using for finding page equality?
> 
> How does PageONE deal with 1million of equal virtual pages? Does it
> lockup in rmap? KSM in v4.13 can handle infinite amount of equal
> virtual page content to dedup while generating O(1) complexity in rmap
> walks. Without this, KSM was unusable for enterprise use and had to be
> disabled, because the kernel would lockup for several seconds after
> deduplicating million of virtual pages with same content (i.e. during
> NUMA balancing induced page migrations or during compaction induced
> page migrations, let alone swapping the million-times deduplicated KSM
>page).

> KSM is usually an activity run in the background so nobody asked to
> dedicate more than one core to it, and what's relevant is to do the
> dedup in the most efficient way possible (i.e. less CPU used and no
> interference to the rest of the system whatsoever), not how long it
> takes if you run it on all available CPUs loading 100% of the system
> with it.
>
> So comparing a dedup algorithm running concurrently on 12 threads vs
> another dedup algorithm running in 1 thread only, is an apple to
> oranges comparison.
> 
> Comparing KSM (with crc32 as cksum, to apply on top of upstream) vs
> PageOne restricted to a single thread (also more realistic production
> environment), will be a more interesting and meaningful comparison.
> 
> It looks like rmap is supported by pone but the patch has a multitude
> of #if 0 and around all rmap code so it's not so clear. Rmap walks
> have to work flawlessy on all deduplicated pages, or pone would then
> break not just swapping but also NUMA Balancing compaction and in turn
> THP utilization and THP utilization is critical for virtual machines
> (MADV_HUGEPAGE is always set by QEMU, to run direct compactin also with
> defrag=madvise or defer+madvise).


PageONE is based on a new tree algorithm other than KSM’s red-black tree. The original idea is that we can use a lockless tree to enhance multithread performance. Not all tree algorithms are suitable for this purpose. We have not find a way to do it for red-black tree. PageONE is based on a new tree. The closest topology we found is Patricia tree, but also different. We name it  SD tree currently.

The original engine name is ONE (Object Non-duplicate Engine), it is designed for general purpose object deduplication. We  applied it to kernel page field (PageONE) first because here we can find out how it behaves in high speed environment. PageOne is not to improve the ksm, which is two completely different things.

We do not use an hash to find equality. Because SD  tree need to compare bits, so we do the implementation of the comparison function, nor  using memcmp.
PageOne has  no additional management structure, except SD tree structure , page status bitmap, lockless que,  so we use rmap walks when necessary to obtain reverse mapping(wrire-protect ,repalce page table) ,  and now current version , the swap and migration process is not yet completely.
PageOne Single-threaded performance and xxhash test results, we will be provided  after the end of the holidays (10.1-10.8).







[-- Attachment #2: Type: text/html, Size: 13159 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
@ 2017-09-21  6:45 Jesse Yang
  0 siblings, 0 replies; 12+ messages in thread
From: Jesse Yang @ 2017-09-21  6:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, 李珅, 李计勇, 王斌

------------------------------------------------------------------
From:David Hildenbrand <david@redhat.com>
Send Time:20/9/2017 23:35

> Can you outline why we can't simply improve KSM? Or is your plan to
> actually improve KSM and it is just the codename "PageONE" ? (haven't
> looked at the code - why 4.4.1 and not upstream linux?)

> -- 

> Thanks,
> 
> David


During the development process, we switch over serval kernel versions, switching version is relatively simple.

So we haven't see updating kernel version as a pushing issue. if anyone  needs the  patch for other kernel versions, please name it.

--
Jesse Yang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
  2017-09-20  3:03 XaviLi
                   ` (2 preceding siblings ...)
  2017-09-20 15:35 ` David Hildenbrand
@ 2017-09-21  2:18 ` Jesse Yang
  3 siblings, 0 replies; 12+ messages in thread
From: Jesse Yang @ 2017-09-21  2:18 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: kvm, 李珅, 李计勇, 王斌

------------------------------------------------------------------
From:David Hildenbrand <david@redhat.com>
Send Time:20/9/2017 23:35

> Can you outline why we can't simply improve KSM? Or is your plan to
> actually improve KSM and it is just the codename "PageONE" ? (haven't
> looked at the code - why 4.4.1 and not upstream linux?)

> -- 

> Thanks,
> 
> David

Thanks for replying!

PageONE is based on a new tree algorithm other than KSM’s red-black tree. The original idea is that we can use a lockless tree to enhance multithread performance. Not all tree algorithms are suitable for this purpose. We have not find a way to do it for red-black tree. PageONE is based on a new tree. The closest topology we found is Patricia tree, but also different. We name it  SD tree currently.

The original engine name is ONE (Object Non-duplicate Engine), it is designed for general purpose object deduplication. We  applied it to kernel page field (PageONE) first because here we can find out how it behaves in high speed environment.

The details of SD tree are going to be explained in future topics. 

Right now, we just choose 4.4.1 to verify PageONE function .

--
Jesse Yang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
  2017-09-20  3:03 XaviLi
  2017-09-20  9:36 ` 王金浦
  2017-09-20 10:29 ` Jesse Yang
@ 2017-09-20 15:35 ` David Hildenbrand
  2017-09-21  2:18 ` Jesse Yang
  3 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2017-09-20 15:35 UTC (permalink / raw)
  To: XaviLi, kvm

On 20.09.2017 05:03, XaviLi wrote:
> PageONE (Page Object Non-duplicate Engine) is a multithread kernel page deduplication engine. It is based on a lock-less tree algorithm we currently named as SD (Static and Dynamic) Tree. Normal operations such as insert/query/delete to this tree are block-less. Adding more CPU cores can linearly boost speed as far as we tested. Multithreading gives not only opportunity to work faster. It also allows any CPU to donate spare time for the job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an open source solution named Dynamic VM:
> https://github.com/baibantech/dynamic_vm.git 
> 
> patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5
> 
> One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs can increase speed linearly. Here we can see a brief test:
> 
> Test environment
> DELL R730
> Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
> 256GB RAM
> Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
> Qemu: 2.9.0
> Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76
> 
> We ran 12 VMs together. Each create 16GB data in memory. After all data is ready we start dedup-engine and see how host-side used memory amount changes.
> 
> KSM:
> Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> Starting used memory: 216.8G
> Result: KSM start merging pages immediately after turned on. KSM daemon took 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.
> 
> PageONE:
> Configuration: merge_period(secs) = 20, work threads = 12
> Starting used memory: 207.3G
> (Which means PageONE scans full physical memory in 20 secs period. Pages was merged if not changed in 2 merge_periods.)
> Result: In the first two periods PageONE only observe and identify unchanged pages. Little CPU was used in this time. As the third period begin all 12 threads start using 100% CPU to do real merge job. 00:58 later used memory was reduced to 70.5GB.
> 
> We ran the above test using the data quite easy for red-black tree of KSM. Every difference can be detected by comparing the first 8 bytes. Then we ran another test in which each data was begin with random zero bytes for comparison. The average size of zero data was 128 bytes. Result is shown below:
> 
> KSM:
> Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> Starting used memory: 216.8G
> Result: 19:49 minutes until used memory was reduced to 78.7GB.
> 
> PageONE:
> Configuration: merge period(secs) = 20, work threads = 12
> Starting used memory: 210.3G
> Result: First 2 periods same as above. 1:09 after merge job start memory was reduced to 72GB.
> 
> PageONE shows little difference in the two tests because SD tree search compare each key bit just once in most cases.
> 

Can you outline why we can't simply improve KSM? Or is your plan to
actually improve KSM and it is just the codename "PageONE" ? (haven't
looked at the code - why 4.4.1 and not upstream linux?)

-- 

Thanks,

David

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
  2017-09-20  3:03 XaviLi
  2017-09-20  9:36 ` 王金浦
@ 2017-09-20 10:29 ` Jesse Yang
  2017-09-20 15:35 ` David Hildenbrand
  2017-09-21  2:18 ` Jesse Yang
  3 siblings, 0 replies; 12+ messages in thread
From: Jesse Yang @ 2017-09-20 10:29 UTC (permalink / raw)
  To: 王金浦
  Cc: kvm, 李珅, 李计勇, 王斌

------------------------------------------------------------------
From: <jinpuwang@gmail.com>
Send Time:20/9/2017 17:36
> Have you compare with uksm, what's the benifitial against uksm,
> multithread? what's the performance overhead? why do you need to patch
> qemu?
>
> Thanks,
> Jack Wang


Thanks for replying!

we have not compared with uksm yet. 

The benefit is that adding more CPU cores can linearly boost speed as far as we tested. It also allows any CPU to donate spare time for the job.

The qemu patch is used for another feather named PPR(Per Page Recycler). PageONE does not need that qemu patch.

PRR Details:
https://github.com/baibantech/dynamic_vm/wiki/PPR-Details

Jesse Yang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC] A multithread lockless deduplication engine
  2017-09-20  3:03 XaviLi
@ 2017-09-20  9:36 ` 王金浦
  2017-09-20 10:29 ` Jesse Yang
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: 王金浦 @ 2017-09-20  9:36 UTC (permalink / raw)
  To: XaviLi; +Cc: kvm

2017-09-20 5:03 GMT+02:00 XaviLi <ljy@baibantech.com.cn>:
> PageONE (Page Object Non-duplicate Engine) is a multithread kernel page deduplication engine. It is based on a lock-less tree algorithm we currently named as SD (Static and Dynamic) Tree. Normal operations such as insert/query/delete to this tree are block-less. Adding more CPU cores can linearly boost speed as far as we tested. Multithreading gives not only opportunity to work faster. It also allows any CPU to donate spare time for the job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an open source solution named Dynamic VM:
> https://github.com/baibantech/dynamic_vm.git
>
> patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5
>
> One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs can increase speed linearly. Here we can see a brief test:
>
> Test environment
> DELL R730
> Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24);
> 256GB RAM
> Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
> Qemu: 2.9.0
> Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76
>
> We ran 12 VMs together. Each create 16GB data in memory. After all data is ready we start dedup-engine and see how host-side used memory amount changes.
>
> KSM:
> Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> Starting used memory: 216.8G
> Result: KSM start merging pages immediately after turned on. KSM daemon took 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.
>
> PageONE:
> Configuration: merge_period(secs) = 20, work threads = 12
> Starting used memory: 207.3G
> (Which means PageONE scans full physical memory in 20 secs period. Pages was merged if not changed in 2 merge_periods.)
> Result: In the first two periods PageONE only observe and identify unchanged pages. Little CPU was used in this time. As the third period begin all 12 threads start using 100% CPU to do real merge job. 00:58 later used memory was reduced to 70.5GB.
>
> We ran the above test using the data quite easy for red-black tree of KSM. Every difference can be detected by comparing the first 8 bytes. Then we ran another test in which each data was begin with random zero bytes for comparison. The average size of zero data was 128 bytes. Result is shown below:
>
> KSM:
> Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
> Starting used memory: 216.8G
> Result: 19:49 minutes until used memory was reduced to 78.7GB.
>
> PageONE:
> Configuration: merge period(secs) = 20, work threads = 12
> Starting used memory: 210.3G
> Result: First 2 periods same as above. 1:09 after merge job start memory was reduced to 72GB.
>
> PageONE shows little difference in the two tests because SD tree search compare each key bit just once in most cases.
Thanks for sharing, intresting.

Have you compare with uksm, what's the benifitial against uksm,
multithread? what's the performance overhead? why do you need to patch
qemu?

Thanks,
Jack Wang

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC] A multithread lockless deduplication engine
@ 2017-09-20  3:03 XaviLi
  2017-09-20  9:36 ` 王金浦
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: XaviLi @ 2017-09-20  3:03 UTC (permalink / raw)
  To: kvm

PageONE (Page Object Non-duplicate Engine) is a multithread kernel page deduplication engine. It is based on a lock-less tree algorithm we currently named as SD (Static and Dynamic) Tree. Normal operations such as insert/query/delete to this tree are block-less. Adding more CPU cores can linearly boost speed as far as we tested. Multithreading gives not only opportunity to work faster. It also allows any CPU to donate spare time for the job. Therefore, it reveals a way to use CPU more efficiently. PPR is from an open source solution named Dynamic VM:
https://github.com/baibantech/dynamic_vm.git 

patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/master/dynamic_vm_0.5

One work thread of PageONE can match the speed of KSM daemon. Adding more CPUs can increase speed linearly. Here we can see a brief test:

Test environment
DELL R730
Intel® Xeon® E5-2650 v4 (2.20 GHz, of Cores 12, threads 24); 
256GB RAM
Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
Qemu: 2.9.0
Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76

We ran 12 VMs together. Each create 16GB data in memory. After all data is ready we start dedup-engine and see how host-side used memory amount changes.

KSM:
Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
Starting used memory: 216.8G
Result: KSM start merging pages immediately after turned on. KSM daemon took 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.

PageONE:
Configuration: merge_period(secs) = 20, work threads = 12
Starting used memory: 207.3G
(Which means PageONE scans full physical memory in 20 secs period. Pages was merged if not changed in 2 merge_periods.)
Result: In the first two periods PageONE only observe and identify unchanged pages. Little CPU was used in this time. As the third period begin all 12 threads start using 100% CPU to do real merge job. 00:58 later used memory was reduced to 70.5GB.

We ran the above test using the data quite easy for red-black tree of KSM. Every difference can be detected by comparing the first 8 bytes. Then we ran another test in which each data was begin with random zero bytes for comparison. The average size of zero data was 128 bytes. Result is shown below:

KSM:
Configuration: sleep_millisecs = 0, pages_to_scan = 1000000
Starting used memory: 216.8G
Result: 19:49 minutes until used memory was reduced to 78.7GB.

PageONE:
Configuration: merge period(secs) = 20, work threads = 12
Starting used memory: 210.3G
Result: First 2 periods same as above. 1:09 after merge job start memory was reduced to 72GB.

PageONE shows little difference in the two tests because SD tree search compare each key bit just once in most cases.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-09-29  9:50 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-20  3:23 [RFC] A multithread lockless deduplication engine XaviLi
2017-09-26 11:47 ` Michal Hocko
2017-09-26 11:47   ` Michal Hocko
2017-09-28 10:09   ` Andrea Arcangeli
2017-09-28 10:09     ` Andrea Arcangeli
2017-09-29  9:50   ` XaviLi
  -- strict thread matches above, loose matches on Subject: below --
2017-09-21  6:45 Jesse Yang
2017-09-20  3:03 XaviLi
2017-09-20  9:36 ` 王金浦
2017-09-20 10:29 ` Jesse Yang
2017-09-20 15:35 ` David Hildenbrand
2017-09-21  2:18 ` Jesse Yang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.