From mboxrd@z Thu Jan  1 00:00:00 1970
From: =?UTF-8?B?546L6YeR5rWm?= <jinpuwang@gmail.com>
Subject: Re: [RFC] A multithread lockless deduplication engine
Date: Wed, 20 Sep 2017 11:36:38 +0200
Message-ID: <CAD9gYJLet3HgXL5K9O4DBfmAinb=Z3rQ_z-t4qg5tfXWbDzdgg@mail.gmail.com>
References: <3d09c774-f930-432c-b60c-714bdaea7c59.ljy@baibantech.com.cn>
Mime-Version: 1.0
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Cc: kvm <kvm@vger.kernel.org>
To: XaviLi <ljy@baibantech.com.cn>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mail-yw0-f169.google.com ([209.85.161.169]:52255 "EHLO
        mail-yw0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751611AbdITJgj (ORCPT <rfc822;kvm@vger.kernel.org>);
        Wed, 20 Sep 2017 05:36:39 -0400
Received: by mail-yw0-f169.google.com with SMTP id i6so1485533ywc.9
        for <kvm@vger.kernel.org>; Wed, 20 Sep 2017 02:36:39 -0700 (PDT)
In-Reply-To: <3d09c774-f930-432c-b60c-714bdaea7c59.ljy@baibantech.com.cn>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

2017-09-20 5:03 GMT+02:00 XaviLi <ljy@baibantech.com.cn>:
> PageONE (Page Object Non-duplicate Engine) is a multithread kernel page d=
eduplication engine. It is based on a lock-less tree algorithm we currently=
 named as SD (Static and Dynamic) Tree. Normal operations such as insert/qu=
ery/delete to this tree are block-less. Adding more CPU cores can linearly =
boost speed as far as we tested. Multithreading gives not only opportunity =
to work faster. It also allows any CPU to donate spare time for the job. Th=
erefore, it reveals a way to use CPU more efficiently. PPR is from an open =
source solution named Dynamic VM:
> https://github.com/baibantech/dynamic_vm.git
>
> patch can be found here:  https://github.com/baibantech/dynamic_vm/tree/m=
aster/dynamic_vm_0.5
>
> One work thread of PageONE can match the speed of KSM daemon. Adding more=
 CPUs can increase speed linearly. Here we can see a brief test:
>
> Test environment
> DELL R730
> Intel=C2=AE Xeon=C2=AE E5-2650 v4 (2.20 GHz, of Cores 12, threads 24);
> 256GB RAM
> Host OS: Ubuntu server 14.04 Host kernel: 4.4.1
> Qemu: 2.9.0
> Guest OS: Ubuntu server 16.04 Guest kernel: 4.4.76
>
> We ran 12 VMs together. Each create 16GB data in memory. After all data i=
s ready we start dedup-engine and see how host-side used memory amount chan=
ges.
>
> KSM:
> Configuration: sleep_millisecs =3D 0, pages_to_scan =3D 1000000
> Starting used memory: 216.8G
> Result: KSM start merging pages immediately after turned on. KSM daemon t=
ook 100% of one CPU for 13:16 until used memory was reduced to 79.0GB.
>
> PageONE:
> Configuration: merge_period(secs) =3D 20, work threads =3D 12
> Starting used memory: 207.3G
> (Which means PageONE scans full physical memory in 20 secs period. Pages =
was merged if not changed in 2 merge_periods.)
> Result: In the first two periods PageONE only observe and identify unchan=
ged pages. Little CPU was used in this time. As the third period begin all =
12 threads start using 100% CPU to do real merge job. 00:58 later used memo=
ry was reduced to 70.5GB.
>
> We ran the above test using the data quite easy for red-black tree of KSM=
. Every difference can be detected by comparing the first 8 bytes. Then we =
ran another test in which each data was begin with random zero bytes for co=
mparison. The average size of zero data was 128 bytes. Result is shown belo=
w:
>
> KSM:
> Configuration: sleep_millisecs =3D 0, pages_to_scan =3D 1000000
> Starting used memory: 216.8G
> Result: 19:49 minutes until used memory was reduced to 78.7GB.
>
> PageONE:
> Configuration: merge period(secs) =3D 20, work threads =3D 12
> Starting used memory: 210.3G
> Result: First 2 periods same as above. 1:09 after merge job start memory =
was reduced to 72GB.
>
> PageONE shows little difference in the two tests because SD tree search c=
ompare each key bit just once in most cases.
Thanks for sharing, intresting.

Have you compare with uksm, what's the benifitial against uksm,
multithread? what's the performance overhead? why do you need to patch
qemu?

Thanks,
Jack Wang