From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 350CBC433F5 for ; Thu, 20 Jan 2022 09:06:22 +0000 (UTC) Received: from localhost ([::1]:34360 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nATOf-0001ZP-03 for qemu-devel@archiver.kernel.org; Thu, 20 Jan 2022 04:06:21 -0500 Received: from eggs.gnu.org ([209.51.188.92]:57072) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1nASm8-0003bw-1N for qemu-devel@nongnu.org; Thu, 20 Jan 2022 03:26:32 -0500 Received: from prt-mail.chinatelecom.cn ([42.123.76.226]:37534 helo=chinatelecom.cn) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1nASm2-0000JU-9B for qemu-devel@nongnu.org; Thu, 20 Jan 2022 03:26:31 -0500 HMM_SOURCE_IP: 172.18.0.188:56026.1548232854 HMM_ATTACHE_NUM: 0000 HMM_SOURCE_TYPE: SMTP Received: from clientip-182.150.57.243 (unknown [172.18.0.188]) by chinatelecom.cn (HERMES) with SMTP id 9E27328012A; Thu, 20 Jan 2022 16:26:10 +0800 (CST) X-189-SAVE-TO-SEND: huangy81@chinatelecom.cn Received: from ([172.18.0.188]) by app0023 with ESMTP id 3a4d3ac1e9a2489f840401e5b7eff939 for peterx@redhat.com; Thu, 20 Jan 2022 16:26:14 CST X-Transaction-ID: 3a4d3ac1e9a2489f840401e5b7eff939 X-Real-From: huangy81@chinatelecom.cn X-Receive-IP: 172.18.0.188 X-MEDUSA-Status: 0 Message-ID: Date: Thu, 20 Jan 2022 16:26:09 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Subject: Re: [PATCH v11 3/4] softmmu/dirtylimit: implement virtual CPU throttle To: Peter Xu References: <0381e32c2cc70613613aaa284b8e8c9760d6932f.1641316375.git.huangy81@chinatelecom.cn> From: Hyman Huang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Received-SPF: pass client-ip=42.123.76.226; envelope-from=huangy81@chinatelecom.cn; helo=chinatelecom.cn X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, NICE_REPLY_A=-0.001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Eduardo Habkost , David Hildenbrand , Juan Quintela , Richard Henderson , qemu-devel , Markus ArmBruster , Paolo Bonzini , =?UTF-8?Q?Philippe_Mathieu-Daud=c3=a9?= , "Dr. David Alan Gilbert" Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" 在 2022/1/17 15:32, Peter Xu 写道: > On Wed, Jan 05, 2022 at 01:14:08AM +0800, huangy81@chinatelecom.cn wrote: >> ## >> +# @DirtyLimitInfo: >> +# >> +# Dirty page rate limit information of virtual CPU. >> +# >> +# @cpu-index: index of virtual CPU. >> +# >> +# @limit-rate: upper limit of dirty page rate for virtual CPU. >> +# >> +# @current-rate: current dirty page rate for virtual CPU. > > Please consider spell out the unit too for all these dirty rate fields (MB/s). > >> +# >> +# Since: 7.0 >> +# >> +## >> +{ 'struct': 'DirtyLimitInfo', >> + 'data': { 'cpu-index': 'int', >> + 'limit-rate': 'int64', >> + 'current-rate': 'int64' } } >> + >> +## >> # @snapshot-save: >> # >> # Save a VM snapshot >> diff --git a/softmmu/dirtylimit.c b/softmmu/dirtylimit.c >> index a10ac6f..c9f5745 100644 >> --- a/softmmu/dirtylimit.c >> +++ b/softmmu/dirtylimit.c >> @@ -18,6 +18,26 @@ >> #include "sysemu/dirtylimit.h" >> #include "exec/memory.h" >> #include "hw/boards.h" >> +#include "sysemu/kvm.h" >> +#include "trace.h" >> + >> +/* >> + * Dirtylimit stop working if dirty page rate error >> + * value less than DIRTYLIMIT_TOLERANCE_RANGE >> + */ >> +#define DIRTYLIMIT_TOLERANCE_RANGE 25 /* MB/s */ >> +/* >> + * Plus or minus vcpu sleep time linearly if dirty >> + * page rate error value percentage over >> + * DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT. >> + * Otherwise, plus or minus a fixed vcpu sleep time. >> + */ >> +#define DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT 50 >> +/* >> + * Max vcpu sleep time percentage during a cycle >> + * composed of dirty ring full and sleep time. >> + */ >> +#define DIRTYLIMIT_THROTTLE_PCT_MAX 99 > > (Thanks for the enriched comments) > >> +static inline void dirtylimit_vcpu_set_quota(int cpu_index, >> + uint64_t quota, >> + bool on) >> +{ >> + dirtylimit_state->states[cpu_index].quota = quota; > > To be clear, we could move this line into the "(on)" if condition, then in !on > case we reset it. > >> + if (on) { >> + if (!dirtylimit_vcpu_get_state(cpu_index)->enabled) { >> + dirtylimit_state->limited_nvcpu++; >> + } >> + } else { >> + if (dirtylimit_state->states[cpu_index].enabled) { >> + dirtylimit_state->limited_nvcpu--; >> + } >> + } >> + >> + dirtylimit_state->states[cpu_index].enabled = on; >> +} >> + >> +static inline int64_t dirtylimit_dirty_ring_full_time(uint64_t dirtyrate) >> +{ >> + static uint64_t max_dirtyrate; >> + uint32_t dirty_ring_size = kvm_dirty_ring_size(); >> + uint64_t dirty_ring_size_meory_MB = >> + dirty_ring_size * TARGET_PAGE_SIZE >> 20; >> + >> + if (max_dirtyrate < dirtyrate) { >> + max_dirtyrate = dirtyrate; >> + } >> + >> + return dirty_ring_size_meory_MB * 1000000 / max_dirtyrate; >> +} >> + >> +static inline bool dirtylimit_done(uint64_t quota, >> + uint64_t current) >> +{ >> + uint64_t min, max; >> + >> + min = MIN(quota, current); >> + max = MAX(quota, current); >> + >> + return ((max - min) <= DIRTYLIMIT_TOLERANCE_RANGE) ? true : false; >> +} >> + >> +static inline bool >> +dirtylimit_need_linear_adjustment(uint64_t quota, >> + uint64_t current) >> +{ >> + uint64_t min, max, pct; >> + >> + min = MIN(quota, current); >> + max = MAX(quota, current); >> + >> + pct = (max - min) * 100 / max; >> + >> + return pct > DIRTYLIMIT_LINEAR_ADJUSTMENT_PCT; >> +} >> + >> +static void dirtylimit_set_throttle(CPUState *cpu, >> + uint64_t quota, >> + uint64_t current) >> +{ >> + int64_t ring_full_time_us = 0; >> + uint64_t sleep_pct = 0; >> + uint64_t throttle_us = 0; >> + >> + ring_full_time_us = dirtylimit_dirty_ring_full_time(current); >> + >> + if (dirtylimit_need_linear_adjustment(quota, current)) { >> + if (quota < current) { >> + sleep_pct = (current - quota) * 100 / current; >> + throttle_us = >> + ring_full_time_us * sleep_pct / (double)(100 - sleep_pct); >> + cpu->throttle_us_per_full += throttle_us; >> + } else { >> + sleep_pct = (quota - current) * 100 / quota; >> + throttle_us = >> + ring_full_time_us * sleep_pct / (double)(100 - sleep_pct); >> + cpu->throttle_us_per_full -= throttle_us; >> + } >> + >> + trace_dirtylimit_throttle_pct(cpu->cpu_index, >> + sleep_pct, >> + throttle_us); >> + } else { >> + if (quota < current) { >> + cpu->throttle_us_per_full += ring_full_time_us / 10; >> + } else { >> + cpu->throttle_us_per_full -= ring_full_time_us / 10; >> + } >> + } >> + >> + cpu->throttle_us_per_full = MIN(cpu->throttle_us_per_full, >> + ring_full_time_us * DIRTYLIMIT_THROTTLE_PCT_MAX); >> + >> + cpu->throttle_us_per_full = MAX(cpu->throttle_us_per_full, 0); >> +} > > This algorithm seems works even worse than the previous version, could you have > a look on what's wrong? > > See how it fluctuates when I set a throttle of 300MB/s: > > (QMP) set-vcpu-dirty-limit dirty-rate=300 > > Dirty rate: 17622 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 17617 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 17611 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 13023 (MB/s), duration: 1153 (ms), load: 100.00% > Dirty rate: 923 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 2853 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 1963 (MB/s), duration: 1040 (ms), load: 100.00% > Dirty rate: 180 (MB/s), duration: 1006 (ms), load: 100.00% > Dirty rate: 182 (MB/s), duration: 1007 (ms), load: 100.00% > Dirty rate: 177 (MB/s), duration: 1005 (ms), load: 100.00% > Dirty rate: 181 (MB/s), duration: 1007 (ms), load: 100.00% > Dirty rate: 179 (MB/s), duration: 1006 (ms), load: 100.00% > Dirty rate: 168 (MB/s), duration: 1005 (ms), load: 100.00% > Dirty rate: 169 (MB/s), duration: 1006 (ms), load: 100.00% > Dirty rate: 2717 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 2851 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 1773 (MB/s), duration: 1021 (ms), load: 100.00% > Dirty rate: 177 (MB/s), duration: 1006 (ms), load: 100.00% > Dirty rate: 179 (MB/s), duration: 1006 (ms), load: 100.00% > Dirty rate: 175 (MB/s), duration: 1005 (ms), load: 100.00% > Dirty rate: 1973 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 2878 (MB/s), duration: 1000 (ms), load: 100.00% > Dirty rate: 1690 (MB/s), duration: 1022 (ms), load: 100.00% > Dirty rate: 174 (MB/s), duration: 1005 (ms), load: 100.00% > Dirty rate: 184 (MB/s), duration: 1006 (ms), load: 100.00% > > This is the tool I'm using: > > https://github.com/xzpeter/mig_mon#memory-dirty > > Again, I won't ask for a good algorithm as the 1st version, but then I think > it's nicer we have the simplest algorithm merged first, which should be very > easy to verify. > Hi,Peter. I'm working on this problem and found the reason is kind of the same as i metioned in cover letter of v10, the following is what i posted: 2. The new implementaion of throttle algo enlightened by Peter responds faster and consume less cpu resource than the older, we make a impressed progress. And there is a viewpoint may be discussed, it is that the new throttle logic is "passive", vcpu sleeps only after dirty ring, is full, unlike the "auto-converge" which will kick vcpu instead in a fixed slice time. If the vcpu is memory-write intensive and the ring size is large, it will produce dirty memory during the dirty ring full time and the throttle works not so good, it means the throttle depends on the dirty ring size. I actually tested the new algo in two case: case 1: dirty-ring-size: 4096, dirtyrate: 1170MB/s result: minimum quota dirtyrate is 25MB/s or even less minimum vcpu util is 6% case 2: dirty-ring-size: 65536, dirtyrate: 1170MB/s result: minimum quota dirtyrate is 256MB/s minimum vcpu util is 24% I post this just for discussion, i think this is not a big deal beacase if we set the dirty-ring-size to the maximum value(65536), we assume the server's bandwidth is capable of handling it. Currently, Qemu handle the vcpu KVM_EXIT_DIRTY_RING_FULL exit as the following: 1. If one of the dirty-ring on a vcpu is full, vcpu thread returns to user space and qemu handle it. 2. Qemu get the kvm_slots_lock and reap dirty-ring of all vcpus once for all by calling kvm_dirty_ring_reap, fill the dirty page bitmap of slot and reset dirty ring. Release the kvm_slots_lock finally. The logic *reap and reset dirty ring of all vcpu after one vcpu's dirty ring is full* works fine and efficiently. But this is not what dirtylimit want becasue some of the vcpu may loss chance to sleep and could not be throttled, though vcpu's dirty ring was full. The latest test environment of you used a larger guest(20G, 40 cores), increasing the chances of missing sleep for vcpus and the throttle works not good as before. I try a simple modification make the throttle works better as before: +static void kvm_dirty_ring_reset_one(KVMState *s, CPUState *cpu) +{ + int ret; + uint64_t total = 0; + + kvm_slots_lock(); + total = kvm_dirty_ring_reap_one(s, cpu); + + if (total) { + ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS); + assert(ret == total); + } + + kvm_slots_unlock(); +} + static void do_kvm_cpu_synchronize_kick(CPUState *cpu, run_on_cpu_data arg) { /* No need to do anything */ @@ -2309,6 +2327,11 @@ bool kvm_dirty_ring_enabled(void) return kvm_state->kvm_dirty_ring_size ? true : false; } static int kvm_init(MachineState *ms) { MachineClass *mc = MACHINE_GET_CLASS(ms); @@ -2955,9 +2978,8 @@ int kvm_cpu_exec(CPUState *cpu) * still full. Got kicked by KVM_RESET_DIRTY_RINGS. */ trace_kvm_dirty_ring_full(cpu->cpu_index); - qemu_mutex_lock_iothread(); - kvm_dirty_ring_reap(kvm_state); - qemu_mutex_unlock_iothread(); + kvm_dirty_ring_reset_one(kvm_state, cpu); + dirtylimit_vcpu_execute(cpu); ret = 0; break; case KVM_EXIT_SYSTEM_EVENT: I drop the BQL to reduce the overhead of KVM_EXIT_DIRTY_RING_FULL exit handle. May be kvm_state could be protected by BQL, but i wonder if there can be a finer granularity lock. How about this? >> + >> +static void dirtylimit_adjust_throttle(CPUState *cpu) >> +{ >> + uint64_t quota = 0; >> + uint64_t current = 0; >> + int cpu_index = cpu->cpu_index; >> + >> + quota = dirtylimit_vcpu_get_state(cpu_index)->quota; >> + current = vcpu_dirty_rate_get(cpu_index); >> + >> + if (current == 0 && >> + dirtylimit_vcpu_get_state(cpu_index)->unmatched_cnt == 0) { >> + cpu->throttle_us_per_full = 0; >> + goto end; >> + } else if (++dirtylimit_vcpu_get_state(cpu_index)->unmatched_cnt >> + < 2) { >> + goto end; >> + } else if (dirtylimit_done(quota, current)) { >> + goto end; >> + } else { >> + dirtylimit_vcpu_get_state(cpu_index)->unmatched_cnt = 0; >> + dirtylimit_set_throttle(cpu, quota, current); >> + } >> +end: >> + trace_dirtylimit_adjust_throttle(cpu_index, >> + quota, current, >> + cpu->throttle_us_per_full); >> + return; >> +} >> + >> +static void *dirtylimit_thread(void *opaque) >> +{ >> + CPUState *cpu; >> + >> + rcu_register_thread(); >> + >> + while (!qatomic_read(&dirtylimit_quit)) { >> + sleep(DIRTYLIMIT_CALC_TIME_MS / 1000); > > Sorry to have not mentioned this: I think we probably don't even need this > dirtylimit thread. > > It'll be hard to make the "sleep" right here.. you could read two identical > values from the dirty calc thread because the 1sec sleep is not accurate, so > even after this sleep() the calc thread may not have provided the latest number > yet. > > It'll be much cleaner (and most importantly, accurate..) to me if we could make > this a hook function being passed over to the vcpu_dirty_rate_stat_thread() > thread, then after each vcpu_dirty_rate_stat_collect() we call the hook. > >> + >> + dirtylimit_state_lock(); >> + >> + if (!dirtylimit_in_service()) { >> + dirtylimit_state_unlock(); >> + break; >> + } >> + >> + CPU_FOREACH(cpu) { >> + if (!dirtylimit_vcpu_get_state(cpu->cpu_index)->enabled) { >> + continue; >> + } >> + dirtylimit_adjust_throttle(cpu); >> + } >> + dirtylimit_state_unlock(); >> + } >> + >> + rcu_unregister_thread(); >> + >> + return NULL; >> +} >> + >> +static void dirtylimit_thread_start(void) >> +{ >> + qatomic_set(&dirtylimit_quit, 0); >> + qemu_thread_create(&dirtylimit_thr, >> + "dirtylimit", >> + dirtylimit_thread, >> + NULL, >> + QEMU_THREAD_JOINABLE); >> +} >> + >> +static void dirtylimit_thread_stop(void) >> +{ >> + qatomic_set(&dirtylimit_quit, 1); >> + qemu_mutex_unlock_iothread(); >> + qemu_thread_join(&dirtylimit_thr); >> + qemu_mutex_lock_iothread(); >> +} >> + >> +void dirtylimit_set_vcpu(int cpu_index, >> + uint64_t quota, >> + bool enable) >> +{ >> + trace_dirtylimit_set_vcpu(cpu_index, quota); >> + >> + if (enable) { >> + if (dirtylimit_in_service()) { >> + /* only set the vcpu dirty page rate limit */ >> + dirtylimit_vcpu_set_quota(cpu_index, quota, true); >> + return; >> + } >> + >> + /* initialize state when set dirtylimit first time */ >> + dirtylimit_state_lock(); >> + dirtylimit_state_initialize(); >> + dirtylimit_vcpu_set_quota(cpu_index, quota, true); >> + dirtylimit_state_unlock(); >> + >> + dirtylimit_thread_start(); > > Can we move dirtylimit global initializations out of dirtylimit_set_vcpu() too? > We should always keep init/cleanup of dirty_rate_calc and dirtylimit together, > because they should, imho. We never enable one of them. > > I commented similarly in previous version on this. > >> + } else { >> + if (!dirtylimit_in_service()) { >> + return; >> + } >> + >> + dirtylimit_state_lock(); >> + /* dirty page rate limit is not enabled */ >> + if (!dirtylimit_vcpu_get_state(cpu_index)->enabled) { >> + dirtylimit_state_unlock(); >> + return; >> + } >> + >> + /* switch off vcpu dirty page rate limit */ >> + dirtylimit_vcpu_set_quota(cpu_index, 0, false); >> + dirtylimit_state_unlock(); >> + >> + if (!dirtylimit_state->limited_nvcpu) { >> + dirtylimit_thread_stop(); >> + >> + dirtylimit_state_lock(); >> + dirtylimit_state_finalize(); >> + dirtylimit_state_unlock(); > > We don't need such a fine control of locking, IMHO.. it can be a very big lock > just to serialize things.. > > IMHO it could be as simple as: > > void dirtylimit_set_vcpu(int cpu_index, > uint64_t quota, > bool enable) > { > dirtylimit_vcpu_set_quota(cpu_index, quota, enable); > trace_dirtylimit_set_vcpu(cpu_index, quota); > } > > void qmp_set_vcpu_dirty_limit(bool has_cpu_index, > uint64_t cpu_index, > uint64_t dirty_rate, > Error **errp) > { > if (!kvm_enabled() || !kvm_dirty_ring_enabled()) { > error_setg(errp, "dirty page limit feature requires KVM with" > " accelerator property 'dirty-ring-size' set'"); > return; > } > > if (has_cpu_index && !dirtylimit_vcpu_index_valid(cpu_index)) { > error_setg(errp, "incorrect cpu index specified"); > return; > } > > dirtylimit_state_lock(); > > if (!dirtylimit_in_service()) { > /* TODO: we could have one helper to initialize all of them */ > vcpu_dirty_rate_stat_initialize(); > vcpu_dirty_rate_stat_start(); > dirtylimit_state_initialize(); > dirtylimit_vcpu_set_quota(cpu_index, quota, true); > } > > if (has_cpu_index) { > dirtylimit_set_vcpu(cpu_index, dirty_rate, true); > } else { > dirtylimit_set_all(dirty_rate, true); > } > > dirtylimit_state_unlock(); > } > > I didn't write the cleanup path, but it's the same: we should only cleanup all > the global structs in cancel-vcpu-dirty-limit when we found there's zero vcpus > in track, and it should be done once there. > > Thanks, > -- Best regard Hyman Huang(黄勇)