From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.7 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D0F3C64E7B for ; Mon, 30 Nov 2020 18:19:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 727E820705 for ; Mon, 30 Nov 2020 18:19:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="XyKl+moN" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 727E820705 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C32288D0003; Mon, 30 Nov 2020 13:19:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BBA998D0001; Mon, 30 Nov 2020 13:19:29 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC09E8D0003; Mon, 30 Nov 2020 13:19:29 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0204.hostedemail.com [216.40.44.204]) by kanga.kvack.org (Postfix) with ESMTP id 8DB098D0001 for ; Mon, 30 Nov 2020 13:19:29 -0500 (EST) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 53454181EE40B for ; Mon, 30 Nov 2020 18:19:29 +0000 (UTC) X-FDA: 77541897258.19.field81_3415860273a4 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin19.hostedemail.com (Postfix) with ESMTP id 25B631AD1BA for ; Mon, 30 Nov 2020 18:19:29 +0000 (UTC) X-HE-Tag: field81_3415860273a4 X-Filterd-Recvd-Size: 9419 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf27.hostedemail.com (Postfix) with ESMTP for ; Mon, 30 Nov 2020 18:19:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1606760368; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rA3djiFQ5IpJA3Hg5QDAWUGmLDUaiCnF9UVq5Ajj+jI=; b=XyKl+moNuWmzJhT/NCEXY1831Xb2uEXT81k8XC/P8Fi4y5t+M+WlT28bbOqCMGbhj0cKEy uCXxb5Wn0Dg6me5G66BH6R5XhJVafElNx3bOMmsTo+xr2kbyCW3aShDMnRHFXKn/Vq5bff frXFAC6G9QxZAKPFAAZXCq/GqyxFvh8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-63-AG3MZChrMKu9H3r8iCj4eQ-1; Mon, 30 Nov 2020 13:19:25 -0500 X-MC-Unique: AG3MZChrMKu9H3r8iCj4eQ-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id D59A7184214F; Mon, 30 Nov 2020 18:19:23 +0000 (UTC) Received: from fuller.cnet (ovpn-112-6.gru2.redhat.com [10.97.112.6]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 45D7F1346F; Mon, 30 Nov 2020 18:19:23 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id B4783416D87C; Mon, 30 Nov 2020 15:18:58 -0300 (-03) Date: Mon, 30 Nov 2020 15:18:58 -0300 From: Marcelo Tosatti To: Alex Belits Cc: "cl@linux.com" , "pauld@redhat.com" , "linux-mm@kvack.org" , "tglx@linutronix.de" , "willy@infradead.org" , "frederic@kernel.org" , "akpm@linux-foundation.org" , "peterz@infradead.org" Subject: Re: [EXT] Re: [PATCH] mm: introduce sysctl file to flush per-cpu vmstat statistics Message-ID: <20201130181858.GA5924@fuller.cnet> References: <20201117162805.GA274911@fuller.cnet> <20201117180356.GT29991@casper.infradead.org> <20201117202317.GA282679@fuller.cnet> <20201127154845.GA9100@fuller.cnet> MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=mtosatti@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Nov 28, 2020 at 03:49:38AM +0000, Alex Belits wrote: > > On Fri, 2020-11-27 at 12:48 -0300, Marcelo Tosatti wrote: > > External Email > > > > ------------------------------------------------------------------- > > --- > > On Fri, Nov 20, 2020 at 06:20:06PM +0000, Christopher Lameter wrote: > > > On Tue, 17 Nov 2020, Marcelo Tosatti wrote: > > > > > > > > So what we would need would be something like a sysctl that > > > > > puts the > > > > > system into a quiet state by completing all workqueue items. > > > > > Idle all > > > > > subsystems that need it and put the cpu into NOHZ mode. > > > > > > > > Are you suggesting that instead of a specific file to control > > > > vmstat > > > > workqueue only, a more generic sysctl could be used? > > > > > > Yes. Introduce a sysctl to quiet down the system. Clean caches that > > > will > > > trigger kernel threads and whatever else is pending on that > > > processor. > > > > > > > About NOHZ mode: the CPU should enter NOHZ automatically as soon > > > > as > > > > there is a single thread running, so unclear why that would be > > > > needed. > > > > > > There are typically pending actions that still trigger > > > interruptions. > > > > > > If you would immediately quiet down the system if there is only one > > > thread > > > runnable then you would compromise system performance through > > > frequent > > > counter folding and cache cleaning etc. > > > > Christopher, > > > > Decided to switch to prctl interface, and then it starts > > to become similar to "task mode isolation" patchset API. > > > > In addition to quiescing pending activities on the CPU, it would > > also be useful to assign a per-task attribute (which is then assigned > > to a per-CPU attribute), indicating whether that CPU is running > > an isolated task or not. > > This is what task isolation patch now does. Per-task attribute is used > when dealing with a task (normally current task on its dedicated CPU), > per-CPU attribute is used when other CPUs are involved (so they don't > have to chase tasks that are running on other CPUs) or when performing > low-level operations on entry and exit. Also since this per-CPU > attribute is only updated from the local CPU, this significantly > simplifies access to it. > > The difficult part of this approach is how to properly handle a > situation when for whatever reason isolation must be broken by an > interrupt of whatever origin, and there is no way to avoid it. Right. > And, of course, there is a matter of having to clean up all other > sources of avoidable interrupts. Two questions: What is the reason for not allowing _any_ interruption? Because it might be the case (and it is in the vRAN use cases our customers target), that some amount of interruption is tolerable. For example, for one use case 20us (of interruptions) every 1ms of elapsed time is tolerable. > Since I first and foremost care about eliminating all disturbances for > a running userspace task, Why? > my approach is to allow disabling everything > including "unavoidable" synchronization IPIs, and make kernel entry > procedure recognize that some delayed synchronization is necessary > while avoiding race conditions. As far as I can tell, not everyone > wants to go that far, Suppose it depends on how long each interruption takes. Upon the suggestion from Thomas and Frederic, i've been thinking it should be possible, with the tracing framework, to record the length of all interruptions to a given CPU and, every second check how many have happened (and whether the sum of interruptions exceeds the acceptable threshold). Its not as nice as the task isolation patch (in terms of spotting the culprit, since one won't get a backtrace on the originating CPU in case of an IPI for example), but it should be possible to verify that the latency threshold is not broken (which is what the application is interested in, isnt it?). > and it may make sense to allow "almost isolated > tasks" that still receive normal interrupts, including IPIs and page > faults. > > While that would be useless for the purposes that task > isolation patch was developed for, I recognize that some might prefer > that to be one of the options set by the same prctl call. This still > remains close enough to the design of task isolation -- same idea of > something that affects CPU but being tied to a given task (and dying > with it), same model of handling attributes, etc. > > Maybe there can be a mask of what we do and don't want to avoid for the > task. Say, some may want to only allow page faults or syscalls. Or re- > enter isolation on breaking without notifying the userspace. OK, i will try to come up with an interface that allows additional attributes - please review and let me know if it works for task isolation patchset. Can you talk a little about the signal handling part? What type of applications are expected to perform once isolation is broken? Thanks! > Then we may be able to combine those things, or make them separate > features that can be enabled and disabled, but all tied to a single > prctl. It will be possible to, say, check which features are > implemented and then set a mode for the current task. > > > This per-CPU attribute can be used to, for example, return -EBUSY > > from ring_buffer_resize() (or any other IPI generating activity > > which can return an error to userspace). > > > > So rather than: > > > > prctl(PR_QUIESCE_CPU) (current interface, similar to > > initial message on the thread but with prctl rather than > > sysfs) > > > > To be called before real time loop, one would have: > > > > prctl(PR_SET_TASK_ISOLATION, ISOLATION_ENABLE) [1] > > real time loop > > prctl(PR_SET_TASK_ISOLATION, ISOLATION_DISABLE) > > > > (with the attribute also being cleared on task exit). > > > > The general description would be: > > > > "Set task isolated mode for a given task, returning an error > > if the task is not pinned to a single CPU. > > > > In this mode, the kernel will avoid interruptions to isolated > > CPUs when possible." > > > > Any objections against such an interface ? > > > > > > [1] perhaps a name that does not conflict with "task mode" patchset > > is a better idea. > > >