From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3CDEAC433E9 for ; Fri, 22 Jan 2021 13:25:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DA47723437 for ; Fri, 22 Jan 2021 13:25:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DA47723437 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2B3F46B000C; Fri, 22 Jan 2021 08:25:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 263B66B000D; Fri, 22 Jan 2021 08:25:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0E0776B000E; Fri, 22 Jan 2021 08:25:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0129.hostedemail.com [216.40.44.129]) by kanga.kvack.org (Postfix) with ESMTP id E9FBC6B000C for ; Fri, 22 Jan 2021 08:25:45 -0500 (EST) Received: from smtpin30.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 9FF3A6100 for ; Fri, 22 Jan 2021 13:25:45 +0000 (UTC) X-FDA: 77733483450.30.knot27_100b3432756c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin30.hostedemail.com (Postfix) with ESMTP id 7539C1803BCF4 for ; Fri, 22 Jan 2021 13:25:45 +0000 (UTC) X-HE-Tag: knot27_100b3432756c X-Filterd-Recvd-Size: 8756 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf14.hostedemail.com (Postfix) with ESMTP for ; Fri, 22 Jan 2021 13:25:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1611321944; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=bchv83H1jEsYGbzN+YbSoeOhwLCdPF0oszS3mX15Qlo=; b=JjAUI2Ste3XhPk8oaZKe/o2njgrkCQBzCL9sNXJG8oV59rJ8x9ZyFQGwAG7QjLL6KDmG8v Uha/b90ZJMhblF0f6lzSoPNPkyD3fsrwJLASpQmDWcjJ8LuutSBETg8LJvs2K+C+FxyEkB AF1HVAM6NsWFJ26kJbfEikVJ4xiiRhw= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-228-wOVPJ3MYNQev_CgiBRio8A-1; Fri, 22 Jan 2021 08:25:40 -0500 X-MC-Unique: wOVPJ3MYNQev_CgiBRio8A-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A15CF84E244; Fri, 22 Jan 2021 13:25:38 +0000 (UTC) Received: from fuller.cnet (ovpn-112-4.gru2.redhat.com [10.97.112.4]) by smtp.corp.redhat.com (Postfix) with ESMTPS id AD15F61F55; Fri, 22 Jan 2021 13:25:28 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id DE8F64178900; Fri, 22 Jan 2021 10:05:11 -0300 (-03) Date: Fri, 22 Jan 2021 10:05:11 -0300 From: Marcelo Tosatti To: Alex Belits Cc: Christoph Lameter , "tglx@linutronix.de" , "pauld@redhat.com" , "linux-mm@kvack.org" , "frederic@kernel.org" , "willy@infradead.org" , "peterz@infradead.org" , "akpm@linux-foundation.org" , Juri Lelli , Daniel Bristot de Oliveira , Nitesh Narayan Lal Subject: Re: [RFC] tentative prctl task isolation interface Message-ID: <20210122130511.GA58675@fuller.cnet> References: <87h7p4dwus.fsf@nanos.tec.linutronix.de> <12ddb629555590cfd41db5b10854d95c1f154e24.camel@marvell.com> <20210113121544.GA16380@fuller.cnet> <20210114193430.GA149907@fuller.cnet> <3fe6a794-a578-3564-acec-d1f4684abeee@marvell.com> <20210121155141.GA11373@fuller.cnet> <20210121162059.GA18719@fuller.cnet> MIME-Version: 1.0 In-Reply-To: <20210121162059.GA18719@fuller.cnet> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=mtosatti@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jan 21, 2021 at 01:20:59PM -0300, Marcelo Tosatti wrote: > > Adding Nitesh to CC. > > On Thu, Jan 21, 2021 at 12:51:41PM -0300, Marcelo Tosatti wrote: > > Hi Alex, > > > > On Fri, Jan 15, 2021 at 10:35:14AM -0800, Alex Belits wrote: > > > On 1/15/21 05:24, Christoph Lameter wrote: > > > > > > > ---------------------------------------------------------------------- > > > > On Thu, 14 Jan 2021, Marcelo Tosatti wrote: > > > > > > > > > > How does one do a oneshot flush of OS activities? > > > > > > > > > > ret = prctl(PR_TASK_ISOLATION_REQUEST, ISOL_F_QUIESCE, 0, 0, 0); > > > > > if (ret == -1) { > > > > > perror("prctl PR_TASK_ISOLATION_REQUEST"); > > > > > exit(0); > > > > > } > > > > > > > > > > > > > > > > > I.e. I have a polling loop over numerous shared and I/o devices in user > > > > > > space and I want to make sure that the system is quite before I enter the > > > > > > loop. > > > > > > > > > > You could configure things in two ways: with syscalls allowed or not. > > > > > > > > Well syscalls that do not cause deferred processing like getting the time > > > > or determining the current cpu should be ok to use. > > > > > > Some of those syscalls go through vdso, and don't enter the kernel -- > > > nothing specific is necessary to allow them, and it would be pointless and > > > difficult to prevent them. > > > > > > For syscalls that enter the kernel, it's often difficult to predict, if they > > > will or won't cause deferred processing, so I am afraid, it won't be > > > possible to provide a "safe" class of syscalls for this purpose and not end > > > up with something minimal like reading /sys and /proc. Right now isolation > > > only "allows" syscalls that exit isolation. > > > > Christoph wrote: > > > > "> Features that I think may be needed: > > > > > > F_ISOL_QUIESCE -> quiet down now but allow all OS activities. OS > > > activites reset flag > > > > > > F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that > > > require such actions in the future. > > > > > > F_ISOL_BAREMETAL_WARN -> Similar. Create a warning in the syslog when OS > > > services require delayed processing etc > > > but continue while resetting the flag. > > " > > > > It seems the only difference between HARD and WARN (lets call it SOFT) > > would be whether a notification is sent to userspace. > > > > The definition > > > > "F_ISOL_BAREMETAL_HARD -> No OS interruptions. Fault on syscalls that > > require such actions in the future." > > > > fails in the static_key_enable case: Alex's idea is to queue the i-cache > > flush if the remote task/cpu is in isolated mode (and perform the flush > > when entering the kernel). > > > > So even if userspace uses syscalls that do not require delayed > > processing, there are events which are out of control of the > > application and might require it. > > > > So lets assume the application performs a number of syscalls on a > > given time critical codepath. > > > > Either the system is configured so that > > the number/frequency of static_key_enable's is limited, or the cost of > > i-cache flushes must be accounted on that critical codepath. > > > > Anyway, trying to improve Christoph's definition: > > > > F_ISOL_QUIESCE -> flush any pending operations that might cause > > the CPU to be interrupted (ex: free's > > per-CPU queues, sync MM statistics > > counters, etc). > > > > F_ISOL_ISOLATE -> inform the kernel that userspace is > > entering isolated mode (see description > > below on "ISOLATION MODES"). > > > > F_ISOL_UNISOLATE -> inform the kernel that userspace is > > leaving isolated mode. > > > > F_ISOL_NOTIFY -> notification mode of isolation breakage > > modes. > > > > > > Isolation modes: > > --------------- > > > > There are two main types of isolation modes: > > > > - SOFT mode: does not prevent activities which might generate interruptions > > (such as CPU hotplug). > > > > - HARD mode: prevents all blockable activities that might generate interruptions. > > Administrators can override this via /sys. > > > > Notifications: > > ------------- > > > > Notification mode of isolation breakage can be configured as follows: > > > > - None (default): No notification is performed by the kernel on isolation > > breakage. > > > > - Syslog: Isolation breakage is reported to syslog. > > > > (new modes can be added, for example signals). > > > > A new feature can be added to disallow syscalls (by default syscalls > > are enabled, with reporting of pending activities that might cause > > an interruption in a VDSO). After discussion with Juri and Daniel, it became clearer that supporting unmodified applications would be quite useful: - enter isolation mode - run unmodified application - leave isolation mode This could work via an additional mode which goes through the quiesce operation at every syscall return. Since this includes freeing per-CPU pagevecs (therefore allocating per-CPU pagevecs at the next syscall), it might considerably slowdown system startup (and cause MM related spinlocks contention). Better ideas are appreciated.