From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.1 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0DC7C43381 for ; Thu, 21 Mar 2019 21:20:58 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9F5DD21925 for ; Thu, 21 Mar 2019 21:20:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=digitalocean.com header.i=@digitalocean.com header.b="N6+xUhNs" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726589AbfCUVU5 (ORCPT ); Thu, 21 Mar 2019 17:20:57 -0400 Received: from mail-qk1-f195.google.com ([209.85.222.195]:46435 "EHLO mail-qk1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726692AbfCUVUz (ORCPT ); Thu, 21 Mar 2019 17:20:55 -0400 Received: by mail-qk1-f195.google.com with SMTP id s81so4498710qke.13 for ; Thu, 21 Mar 2019 14:20:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=digitalocean.com; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=OoPTjNoQH9P11AzxV7hz4O05Uw6lZyZ5WaH3L3lcp7M=; b=N6+xUhNsZV8o/bZUVTzWyjFPLjwKh9+4ra4G64HIh6y86G5EHbnCLKoRzQ860dt1L9 jtOJV2j+W5y+Ffl3VS6Cj0YrBBqMBsKM9yb7hSXkLWeNP+YxQEH1N2t3UadmKM9EXM1D QNm/0w6a49VY/b/CRLvkF4yRnBGlFHUaK/+a8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=OoPTjNoQH9P11AzxV7hz4O05Uw6lZyZ5WaH3L3lcp7M=; b=je3eFKw7nnU0w9StLWDjLVbrLDMaAgbVIG3Fc/ZRquGjv/H1F5Y2FoprCVbd7PlngW 86Kjyvwdy2fC0Vj12cM5cLai8TmffY0eYY/jWz+m+7XOa/KCQv3HgTDawWJSD3My7w9P pjhvLxNL2Sat3Jd1DIhtada0p7DacFTnDNJ8taBtySXr5e8V+NQ0n87W+HGhezLx+Ygw BsxLzel1PkuK8bGpXxo1uViVlkoU6UaiVG8URcbEdLKZFyHxvOiVpFXxVdK8jwKMEhdM FEhjG/EH6LC1mcHdSwJ8usxKt4bA2DHTdGN2TmP9Hp1k2pscgpQyuzSD3T8Xl8iIuiuN yo1g== X-Gm-Message-State: APjAAAW1UE1TlK89TWmHlJP9JmGQxBAsr33YmKgyUeu/L54e8YoVHLCB /0lWWz8JagqjaBx/IEFWeIHDrw== X-Google-Smtp-Source: APXvYqxzmoAU6I9Uob/gpeBsBt5EXvRdm37Mb7aOm2vn3MiGb82tpSryxMvtqn74ETBuoLbgyNMNyQ== X-Received: by 2002:a37:6748:: with SMTP id b69mr4766483qkc.79.1553203254239; Thu, 21 Mar 2019 14:20:54 -0700 (PDT) Received: from [192.168.1.240] (192-222-189-155.qc.cable.ebox.net. [192.222.189.155]) by smtp.gmail.com with ESMTPSA id n70sm312138qkn.5.2019.03.21.14.20.52 (version=TLS1_2 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Thu, 21 Mar 2019 14:20:53 -0700 (PDT) From: Julien Desfossez To: Peter Zijlstra , mingo@kernel.org, tglx@linutronix.de, pjt@google.com, tim.c.chen@linux.intel.com, torvalds@linux-foundation.org Cc: Julien Desfossez , linux-kernel@vger.kernel.org, subhra.mazumdar@oracle.com, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Vineeth Pillai , Nishanth Aravamudan Subject: Re: [RFC][PATCH 03/16] sched: Wrap rq::lock access Date: Thu, 21 Mar 2019 17:20:17 -0400 Message-Id: <1553203217-11444-1-git-send-email-jdesfossez@digitalocean.com> X-Mailer: git-send-email 2.7.4 In-Reply-To: <15f3f7e6-5dce-6bbf-30af-7cffbd7bb0c3@oracle.com> References: <15f3f7e6-5dce-6bbf-30af-7cffbd7bb0c3@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 19, 2019 at 10:31 PM Subhra Mazumdar wrote: > On 3/18/19 8:41 AM, Julien Desfossez wrote: > > The case where we try to acquire the lock on 2 runqueues belonging to 2 > > different cores requires the rq_lockp wrapper as well otherwise we > > frequently deadlock in there. > > > > This fixes the crash reported in > > 1552577311-8218-1-git-send-email-jdesfossez@digitalocean.com > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > > index 76fee56..71bb71f 100644 > > --- a/kernel/sched/sched.h > > +++ b/kernel/sched/sched.h > > @@ -2078,7 +2078,7 @@ static inline void double_rq_lock(struct rq *rq1, > struct rq *rq2) > > raw_spin_lock(rq_lockp(rq1)); > > __acquire(rq2->lock); /* Fake it out ;) */ > > } else { > > - if (rq1 < rq2) { > > + if (rq_lockp(rq1) < rq_lockp(rq2)) { > > raw_spin_lock(rq_lockp(rq1)); > > raw_spin_lock_nested(rq_lockp(rq2), > SINGLE_DEPTH_NESTING); > > } else { > With this fix and my previous NULL pointer fix my stress tests are > surviving. I > re-ran my 2 DB instance setup on 44 core 2 socket system by putting each DB > instance in separate core scheduling group. The numbers look much worse > now. > > users baseline %stdev %idle core_sched %stdev %idle > 16 1 0.3 66 -73.4% 136.8 82 > 24 1 1.6 54 -95.8% 133.2 81 > 32 1 1.5 42 -97.5% 124.3 89 We are also seeing a performance degradation of about 83% on the throughput of 2 MySQL VMs under a stress test (12 vcpus, 32GB of RAM). The server has 2 NUMA nodes, each with 18 cores (so a total of 72 hardware threads). Each MySQL VM is pinned to a different NUMA node. The clients for the stress tests are running on a separate physical machine, each client runs 48 query threads. Only the MySQL VMs use core scheduling (all vcpus and emulator threads). Overall the server is 90% idle when the 2 VMs use core scheduling, and 75% when they don’t. The rate of preemption vs normal “switch out” is about 1% with and without core scheduling enabled, but the overall rate of sched_switch is 5 times higher without core scheduling which suggests some heavy contention in the scheduling path. On further investigation, we could see that the contention is mostly in the way rq locks are taken. With this patchset, we lock the whole core if cpu.tag is set for at least one cgroup. Due to this, __schedule() is more or less serialized for the core and that attributes to the performance loss that we are seeing. We also saw that newidle_balance() takes considerably long time in load_balance() due to the rq spinlock contention. Do you think it would help if the core-wide locking was only performed when absolutely needed ? In terms of isolation, we measured the time a thread spends co-scheduled with either a thread from the same group, the idle thread or a thread from another group. This is what we see for 60 seconds of a specific busy VM pinned to a whole NUMA node (all its threads): no core scheduling: - local neighbors (19.989 % of process runtime) - idle neighbors (47.197 % of process runtime) - foreign neighbors (22.811 % of process runtime) core scheduling enabled: - local neighbors (6.763 % of process runtime) - idle neighbors (93.064 % of process runtime) - foreign neighbors (0.236 % of process runtime) As a separate test, we tried to pin all the vcpu threads to a set of cores (6 cores for 12 vcpus): no core scheduling: - local neighbors (88.299 % of process runtime) - idle neighbors (9.334 % of process runtime) - foreign neighbors (0.197 % of process runtime) core scheduling enabled: - local neighbors (84.570 % of process runtime) - idle neighbors (15.195 % of process runtime) - foreign neighbors (0.257 % of process runtime) Thanks, Julien