From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752785AbbDCMvX (ORCPT ); Fri, 3 Apr 2015 08:51:23 -0400 Received: from forward-corp1g.mail.yandex.net ([95.108.253.251]:34199 "EHLO forward-corp1g.mail.yandex.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752191AbbDCMvV (ORCPT ); Fri, 3 Apr 2015 08:51:21 -0400 Authentication-Results: smtpcorp1m.mail.yandex.net; dkim=pass header.i=@yandex-team.ru Message-ID: <551E8CC5.30906@yandex-team.ru> Date: Fri, 03 Apr 2015 15:51:17 +0300 From: Konstantin Khlebnikov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Peter Zijlstra , Ingo Molnar , linux-kernel@vger.kernel.org CC: Ben Segall , Roman Gushchin Subject: Re: [PATCH RFC] sched/fair: fix sudden expiration of cfq quota in put_prev_task() References: <20150403124138.1349.11633.stgit@buzz> In-Reply-To: <20150403124138.1349.11633.stgit@buzz> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 03.04.2015 15:41, Konstantin Khlebnikov wrote: > Pick_next_task_fair() must be sure that here is at least one runnable > task before calling put_prev_task(), but put_prev_task() can expire > last remains of cfs quota and throttle all currently runnable tasks. > As a result pick_next_task_fair() cannot find next task and crashes. Kernel crash looks like this: <1>[50288.719491] BUG: unable to handle kernel NULL pointer dereference at 0000000000000038 <1>[50288.719538] IP: [] set_next_entity+0x1c/0x80 <4>[50288.719567] PGD 0 <4>[50288.719578] Oops: 0000 [#1] SMP <4>[50288.719594] Modules linked in: vhost_net macvtap macvlan vhost 8021q mrp garp ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc netconsole configfs x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw ttm gf128mul drm_kms_helper drm glue_helper aes_x86_64 i2c_algo_bit sysimgblt sysfillrect i2c_core sb_edac edac_core syscopyarea microcode ipmi_si ipmi_msghandler lpc_ich ioatdma dca mlx4_en mlx4_core vxlan udp_tunnel ip6_udp_tunnel tcp_htcp e1000e ptp pps_core ahci libahci raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov raid6_pq async_tx raid1 raid0 multipath<4>[50288.719956] linear <4>[50288.719964] CPU: 27 PID: 11505 Comm: kvm Not tainted 3.18.10-7 #7 <4>[50288.719987] Hardware name: <4>[50288.720015] task: ffff880036acbaa0 ti: ffff8808445f8000 task.ti: ffff8808445f8000 <4>[50288.720041] RIP: 0010:[] [] set_next_entity+0x1c/0x80 <4>[50288.720072] RSP: 0018:ffff8808445fbbb8 EFLAGS: 00010086 <4>[50288.720091] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 000000000000bcb8 <4>[50288.720116] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88107fd72af0 <4>[50288.720141] RBP: ffff8808445fbbd8 R08: 0000000000000000 R09: 0000000000000001 <4>[50288.720165] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000 <4>[50288.720190] R13: 0000000000000000 R14: ffff880b6f250030 R15: ffff88107fd72af0 <4>[50288.720214] FS: 00007f55467fc700(0000) GS:ffff88107fd60000(0000) knlGS:ffff8802175e0000 <4>[50288.720242] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[50288.720262] CR2: 0000000000000038 CR3: 0000000324ede000 CR4: 00000000000427e0 <4>[50288.720287] Stack: <4>[50288.720296] ffff88107fd72a80 ffff88107fd72a80 0000000000000000 0000000000000000 <4>[50288.720327] ffff8808445fbc68 ffffffff8109ead8 ffff880800000000 ffffffffa1438990 <4>[50288.720357] ffff880b6f250000 0000000000000000 0000000000012a80 ffff880036acbaa0 <4>[50288.720388] Call Trace: <4>[50288.720402] [] pick_next_task_fair+0x88/0x5d0 <4>[50288.720429] [] ? vmx_fpu_activate.part.63+0x90/0xb0 [kvm_intel] <4>[50288.720457] [] ? sched_clock_cpu+0x85/0xc0 <4>[50288.720479] [] __schedule+0xf9/0x7d0 <4>[50288.720500] [] ? reboot_interrupt+0x80/0x80 <4>[50288.720522] [] _cond_resched+0x2a/0x40 <4>[50288.720549] [] __vcpu_run+0xd35/0xf30 [kvm] <4>[50288.720573] [] ? __set_task_blocked+0x37/0x80 <4>[50288.720595] [] ? try_to_wake_up+0x21e/0x360 <4>[50288.720622] [] kvm_arch_vcpu_ioctl_run+0xa5/0x220 [kvm] <4>[50288.720650] [] kvm_vcpu_ioctl+0x2c2/0x620 [kvm] <4>[50288.720675] [] do_vfs_ioctl+0x86/0x4f0 <4>[50288.720697] [] ? SyS_futex+0x142/0x1a0 <4>[50288.720717] [] SyS_ioctl+0x91/0xb0 <4>[50288.720737] [] system_call_fastpath+0x12/0x17 <4>[50288.720758] Code: c7 47 60 00 00 00 00 45 31 c0 e9 0c ff ff ff 66 66 66 66 90 55 48 89 e5 48 83 ec 20 48 89 5d e8 4c 89 65 f0 48 89 f3 4c 89 6d f8 <44> 8b 4e 38 49 89 fc 45 85 c9 74 17 4c 8d 6e 10 4c 39 6f 30 74 <1>[50288.722636] RIP [] set_next_entity+0x1c/0x80 <4>[50288.723533] RSP <4>[50288.724406] CR2: 0000000000000038 in pick_next_task_fair() cfs_rq->nr_running was non-zero but after put_prev_task(rq, prev) kernel cannot find any tasks to schedule next. It crashes from time to time on strange libvirt/kvm setup where cfs_quota is set on two levels: at parent cgroup which contains kvm and at per-vcpu child cgroup. This patch isn't verified yet. But I haven't found any other possible reasons for that crash. > > This patch leaves 1 in ->runtime_remaining when current assignation > expires and tries to refill it right after that. In the worst case > task will be scheduled once and throttled at the end of slice. > > Signed-off-by: Konstantin Khlebnikov > --- > kernel/sched/fair.c | 19 +++++++++++++------ > 1 file changed, 13 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7ce18f3c097a..91785d077db4 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3447,11 +3447,12 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > { > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > > - /* if the deadline is ahead of our clock, nothing to do */ > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > + /* nothing to expire */ > + if (cfs_rq->runtime_remaining <= 0) > return; > > - if (cfs_rq->runtime_remaining < 0) > + /* if the deadline is ahead of our clock, nothing to do */ > + if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > return; > > /* > @@ -3469,8 +3470,14 @@ static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > } else { > - /* global deadline is ahead, expiration has passed */ > - cfs_rq->runtime_remaining = 0; > + /* > + * Global deadline is ahead, expiration has passed. > + * > + * Do not expire runtime completely. Otherwise put_prev_task() > + * can throttle all tasks when we already checked nr_running or > + * put_prev_entity() can throttle already chosen next entity. > + */ > + cfs_rq->runtime_remaining = 1; > } > } > > @@ -3480,7 +3487,7 @@ static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > cfs_rq->runtime_remaining -= delta_exec; > expire_cfs_rq_runtime(cfs_rq); > > - if (likely(cfs_rq->runtime_remaining > 0)) > + if (likely(cfs_rq->runtime_remaining > 1)) > return; > > /* > -- Konstantin