From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mga18.intel.com (mga18.intel.com [134.134.136.126]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BF1ED2F23 for ; Wed, 4 May 2022 18:52:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1651690364; x=1683226364; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=g6cXVITLw5XGRVy0kM+kX0NCN2l4lKwducjAVxfFljo=; b=Fx9jI+wWnrFMC2QGr7GC6i7PaeNe/rK7reJZfVbMWLfSKtG3i0HjGpRi wZJ856pFaF5C7pEtUSbg4GuLX/aNWbj8L3qbUI/ceQe68GKNmDe9C5P7N 55A0c+XBbVZwhQjkzU3Y/kau3vnkiaMSv7J9ZOGCg8CDi6JZfVBPsSFzx SAkhs1prU6gGf3nCXekpAIEwAPL0CTSj//QwXMdOmjTkfAuzx/aDmd1+R ahh9TpPCHRjNSOk23jLYkiYz0m5NEZNa9WXhDocoJsoRkkVVUIcFZE7my 6nquNFGTXSYRB3q9eWg6+AQsdITFmBQiB1PVL79SjgpHUdtwD98pTVflO g==; X-IronPort-AV: E=McAfee;i="6400,9594,10337"; a="249850398" X-IronPort-AV: E=Sophos;i="5.91,198,1647327600"; d="scan'208";a="249850398" Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 May 2022 11:52:44 -0700 X-IronPort-AV: E=Sophos;i="5.91,198,1647327600"; d="scan'208";a="599661690" Received: from agluck-desk3.sc.intel.com ([172.25.222.78]) by orsmga001-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 04 May 2022 11:52:43 -0700 Date: Wed, 4 May 2022 11:52:42 -0700 From: "Luck, Tony" To: Thomas Gleixner Cc: hdegoede@redhat.com, markgross@kernel.org, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, corbet@lwn.net, gregkh@linuxfoundation.org, andriy.shevchenko@linux.intel.com, jithu.joseph@intel.com, ashok.raj@intel.com, rostedt@goodmis.org, dan.j.williams@intel.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, platform-driver-x86@vger.kernel.org, patches@lists.linux.dev, ravi.v.shankar@intel.com Subject: Re: [PATCH v5 07/10] platform/x86/intel/ifs: Add scan test support Message-ID: References: <20220422200219.2843823-1-tony.luck@intel.com> <20220428153849.295779-1-tony.luck@intel.com> <20220428153849.295779-8-tony.luck@intel.com> <87r159jxaq.ffs@tglx> Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <87r159jxaq.ffs@tglx> On Wed, May 04, 2022 at 02:29:33PM +0200, Thomas Gleixner wrote: > On Thu, Apr 28 2022 at 08:38, Tony Luck wrote: > > + > > + /* wait for the sibling threads to join */ > > + first = cpumask_first(topology_sibling_cpumask(cpu)); > > + if (!wait_for_siblings(dev, ifsd, &siblings_in, NSEC_PER_SEC)) { > > Waiting for a second with preemption disabled? Seriously? Probably won't ever wait for a second. Any suggestions for a reasonable timeout for how long it might take before both threads on a core begin executing the target code after a pair of: queue_work_on(sibling, ifs_wq, &local_work[i].w); that the request to check this core fired off? Possibly this could be dropped, and just built into the allowable delay for the threads to execute the ACTIVATE_SCAN (31 bits of TSC cycles in the value written to the MSR). But a future scan feature doesn't include that user tuneable value. > > > + preempt_enable(); > > + dev_err(dev, "cpu %d sibling did not join rendezvous\n", cpu); > > + goto out; > > + } > > + > > + activate.start = 0; > > + activate.stop = ifsd->valid_chunks - 1; > > + timeout = jiffies + HZ / 2; > > Plus another half a second with preemption disabled. That's just insane. Another rounded up value. Experimentally we are seeing the core scan test take aroun 50ms. The spec (I know, we haven't published the spec) says "up to 200ms". > > > + retries = MAX_IFS_RETRIES; > > + > > + while (activate.start <= activate.stop) { > > + if (time_after(jiffies, timeout)) { > > + status.error_code = IFS_SW_TIMEOUT; > > + break; > > + } > > + > > + local_irq_disable(); > > + wrmsrl(MSR_ACTIVATE_SCAN, activate.data); > > + local_irq_enable(); > > That local_irq_disable() solves what? An interrupt will stop the currently running "chunk" of the scan. It is a restartable case. But the concern is that with high rate of interrupts the scan may not complete (or even make any forward progress). > > > + /* > > + * All logical CPUs on this core are now running IFS test. When it completes > > + * execution or is interrupted, the following RDMSR gets the scan status. > > + */ > > + > > + rdmsrl(MSR_SCAN_STATUS, status.data); > > Wait. Is that rdmsrl() blocking execution until the scan completes? The comment isn't quite accurate here (my fault). The WRMSR doesn't retire until the scan stops (either because it completed, or because some thing happend to stop before all chunks were processed). So the two HT threads will continue after the WRMSR pretty much in lockstep and do the RDMSR to get the status. Both will see the same status in most cases. > > If so, what's the stall time here? If not, how is the logic below > supposed to work? Exact time will depend how many chunks of the scan were completed, and how long they took. I see 50 ms total on current test system. > > > + /* Some cases can be retried, give up for others */ > > + if (!can_restart(status)) > > + break; > > + > > + if (status.chunk_num == activate.start) { > > + /* Check for forward progress */ > > + if (retries-- == 0) { > > + if (status.error_code == IFS_NO_ERROR) > > + status.error_code = IFS_SW_PARTIAL_COMPLETION; > > + break; > > + } > > + } else { > > + retries = MAX_IFS_RETRIES; > > + activate.start = status.chunk_num; > > + } > > + } > > + > > +int do_core_test(int cpu, struct device *dev) > > +{ > > + struct ifs_work *local_work; > > + int sibling; > > + int ret = 0; > > + int i = 0; > > + > > + if (!scan_enabled) > > + return -ENXIO; > > + > > + cpu_hotplug_disable(); > > Why cpu_hotplug_disable()? Why is cpus_read_lock() not sufficient here? May be left from earlier version. I'll check that cpus_read_lock() is enough. > > > + if (!cpu_online(cpu)) { > > + dev_info(dev, "cannot test on the offline cpu %d\n", cpu); > > + ret = -EINVAL; > > + goto out; > > + } > > + > > + reinit_completion(&test_thread_done); > > + atomic_set(&siblings_in, 0); > > + atomic_set(&siblings_out, 0); > > + > > + cpu_sibl_ct = cpumask_weight(topology_sibling_cpumask(cpu)); > > + local_work = kcalloc(cpu_sibl_ct, sizeof(*local_work), GFP_NOWAIT); > > Why does this need GFP_NOWAIT? It doesn't. Will fix. > > > +int ifs_setup_wq(void) > > +{ > > + /* Flags are to keep all the sibling cpu worker threads (of a core) in close sync */ > > I put that into the wishful thinking realm. Can change to "... to try to keep ..." > > Is there anywhere a proper specification of this mechanism? The public > available MSR list in the SDM is uselss. > > Without proper documentation it's pretty much impossible to review this > code and to think about the approach. Step 1 (at boot or driver load) is loading the scan tests into BIOS reserved memory. I will fix up the bits where you pointed out problems there. Step 2 is the run time test of each core. That requires the near simultaneous execution of: wrmsrl(MSR_ACTIVATE_SCAN, activate.data); on all HT threads on the core. Trivial on parts that do not support HT, or where it is disabled in BIOS. The above code is trying to achieve this "parallel" execution. The follow-on : rdmsrl(MSR_SCAN_STATUS, status.data); doesn't have to be synchronized ... but handy to do so for when not all chunks were processed and need to loop back to run another activate_scan to continue starting from the interrupted chunk. In the lab, this seems common ... when scanning all cores many of them complete all chunks in a single bite, but several take 2-3 times around the loop before completing. As noted above I'm seeing a core test take around 50ms (but spec says up to 200ms). In some environments that doesn't require any special application or system reconfiguration. It's not much different from time-slicing when you are running more processes (or guests) than you have CPUs. So sysadmins in those environments can use this driver to cylce through cores testing each in turn without any extra steps. You've pointed out that the driver disables preemption for insanely long amounts of time ... to use this driver to test cores on a system running applications where that is an issue will require additonal steps to migrate latency critical applications to a different core while the test is in progess, also re-direct interrupts. That seems well beyond the scope of what is possible in a driver without all the information about what workloads are running to pick a new temporary home for processes and interrupts while the core is being tested. -Tony