From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.6 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 71581C433DB for ; Mon, 1 Feb 2021 18:21:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D91A164DE1 for ; Mon, 1 Feb 2021 18:21:12 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D91A164DE1 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5E3856B0078; Mon, 1 Feb 2021 13:21:12 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 595426B007D; Mon, 1 Feb 2021 13:21:12 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 45DB36B007E; Mon, 1 Feb 2021 13:21:12 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0180.hostedemail.com [216.40.44.180]) by kanga.kvack.org (Postfix) with ESMTP id 313C96B0078 for ; Mon, 1 Feb 2021 13:21:12 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id E7815180AD820 for ; Mon, 1 Feb 2021 18:21:11 +0000 (UTC) X-FDA: 77770515942.13.army65_1f15a84275c4 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id BEA0918140B60 for ; Mon, 1 Feb 2021 18:21:11 +0000 (UTC) X-HE-Tag: army65_1f15a84275c4 X-Filterd-Recvd-Size: 9189 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf11.hostedemail.com (Postfix) with ESMTP for ; Mon, 1 Feb 2021 18:21:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1612203670; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=PcRXlj9V3bMw4Y/8s7YVy4limzSf40pyYTcJeQSLC9A=; b=HXAVPvff0QWwpn/Nndg8rj9c5cbr2RNfKCOYUzcJhjknCR4Cmb5Ic33zTh2+BkrU5390qF ya2/k6d+NP3AL6NhY11zPw1Q9nzJVCpvXUgD+VuKxZJ4LlFmpK5D/9gRuR6Ai0zKIrOx6+ i1smgxFjL3Hg2n0x3p4PJg0rt22mOhw= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-498-ra5enugAOuW3VzW71gaTkA-1; Mon, 01 Feb 2021 13:21:08 -0500 X-MC-Unique: ra5enugAOuW3VzW71gaTkA-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 2ABB8802B44; Mon, 1 Feb 2021 18:21:07 +0000 (UTC) Received: from fuller.cnet (ovpn-112-5.gru2.redhat.com [10.97.112.5]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 79B176EF4B; Mon, 1 Feb 2021 18:21:00 +0000 (UTC) Received: by fuller.cnet (Postfix, from userid 1000) id AE77B416D87F; Mon, 1 Feb 2021 15:20:17 -0300 (-03) Date: Mon, 1 Feb 2021 15:20:17 -0300 From: Marcelo Tosatti To: Christoph Lameter Cc: Alex Belits , "tglx@linutronix.de" , "pauld@redhat.com" , "linux-mm@kvack.org" , "frederic@kernel.org" , "willy@infradead.org" , "peterz@infradead.org" , "akpm@linux-foundation.org" , Juri Lelli , Daniel Bristot de Oliveira Subject: Re: [RFC] tentative prctl task isolation interface Message-ID: <20210201182017.GA29345@fuller.cnet> References: <87h7p4dwus.fsf@nanos.tec.linutronix.de> <12ddb629555590cfd41db5b10854d95c1f154e24.camel@marvell.com> <20210113121544.GA16380@fuller.cnet> <20210114193430.GA149907@fuller.cnet> <3fe6a794-a578-3564-acec-d1f4684abeee@marvell.com> <20210121155141.GA11373@fuller.cnet> MIME-Version: 1.0 In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=mtosatti@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=us-ascii Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Feb 01, 2021 at 10:48:18AM +0000, Christoph Lameter wrote: > On Thu, 21 Jan 2021, Marcelo Tosatti wrote: > > > Anyway, trying to improve Christoph's definition: > > > > F_ISOL_QUIESCE -> flush any pending operations that might cause > > the CPU to be interrupted (ex: free's > > per-CPU queues, sync MM statistics > > counters, etc). > > > > F_ISOL_ISOLATE -> inform the kernel that userspace is > > entering isolated mode (see description > > below on "ISOLATION MODES"). > > > > F_ISOL_UNISOLATE -> inform the kernel that userspace is > > leaving isolated mode. > > > > F_ISOL_NOTIFY -> notification mode of isolation breakage > > modes. > > Looks good to me. > > > > Isolation modes: > > --------------- > > > > There are two main types of isolation modes: > > > > - SOFT mode: does not prevent activities which might generate interruptions > > (such as CPU hotplug). > > > > - HARD mode: prevents all blockable activities that might generate interruptions. > > Administrators can override this via /sys. > > > Yup. > > > > > Notifications: > > ------------- > > > > Notification mode of isolation breakage can be configured as follows: > > > > - None (default): No notification is performed by the kernel on isolation > > breakage. > > > > - Syslog: Isolation breakage is reported to syslog. > > > - Abort with core dump > > This is useful for debugging and for hard core bare metalers that never > want any interrupts. > > One particular issue are page faults. One would have to prefault the > binary executable functions in order to avoid "interruptions" through page > faults. Are these proper interrutions of the code? Certainly major faults > are but minor faults may be ok? Dunno. mlockall man page: Real-time processes that are using mlockall() to prevent delays on page faults should reserve enough locked stack pages before entering the time-critical section, so that no page fault can be caused by function calls. This can be achieved by calling a function that allocates a sufficiently large automatic variable (an array) and writes to the memory occupied by this array in order to touch these stack pages. This way, enough pages will be mapped for the stack and can be locked into RAM. The dummy writes ensure that not even copy-on-write page faults can occur in the critical section. > In practice what I have often seen in such apps is that there is a "warm" > up mode where all critical functions are executed, all important variables > are touched and dummy I/Os are performed in order to populate the caches > and prefault all the data.I guess one would run these without isolation > first and then switch on some sort of isolation mode after warm up. So far > I think most people relied on the timer interrupt etc etc to be turned off > after a few secs of just running throught a polling loop without any OS > activities. Yep. > > > I ended up implementing a manager/helper task that talks to tasks over a > > > socket (when they are not isolated) and over ring buffers in shared memory > > > (when they are isolated). While the current implementation is rather > > > limited, the intention is to delegate to it everything that isolated task > > > either can't do at all (like, writing logs) or that it would be cumbersome > > > to implement (like monitoring the state of task, determining presence of > > > deferred work after the task returned to userspace), etc. > > > > Interesting. Are you considering opensourcing such library? Seems like a > > generic problem. > > Well everyone swears on having the right implementation. The people I know > would not do any thing with a socket in such situations. They would only > use shared memory and direct access to I/O devices via SPDK and DPDK or > the RDMA subsystem. > > > > > > Blocking? The app should fail if any deferred actions are triggered as a > > > > result of syscalls. It would give a warning with _WARN > > > > > > There are many supposedly innocent things, nowhere at the scale of CPU > > > hotplug, that happen in a system and result in synchronization implemented > > > as an IPI to every online CPU. We should consider them to be an ordinary > > > occurrence, so there is a choice: > > > > > > 1. Ignore them completely and allow them in isolated mode. This will delay > > > userspace with no indication and no isolation breaking. > > > > > > 2. Allow them, and notify userspace afterwards (through vdso or through > > > userspace helper/manager over shared memory). This may be useful in those > > > rare situations when the consequences of delay can be mitigated afterwards. > > > > > > 3. Make them break isolation, with userspace being notified normally (ex: > > > with a signal in the current implementation). I guess, can be used if > > > somehow most of the causes will be eliminated. > > > > > > 4. Prevent them from reaching the target CPU and make sure that whatever > > > synchronization they are intended to cause, will happen when intended target > > > CPU will enter to kernel later. Since we may have to synchronize things like > > > code modification, some of this synchronization has to happen very early on > > > kernel entry. > > > Or move the actions to a different victim processor like done with rcu and > vmstat etc etc. > > > > > > > I am most interested in (4), so this is what was implemented in my version > > > of the patch (and currently I am trying to achieve completeness and, if > > > possible, elegance of the implementation). > > > > Agree. (3) will be necessary as intermediate step. The proposed > > improvement to Christoph's reply, in this thread, separates notification > > and syscall blockage. > > I guess the notification mode will take care of the way we handle these > interruptions.