From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.2 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,MSGID_FROM_MTA_HEADER,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E2A1C433DB for ; Mon, 1 Feb 2021 12:47:27 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9A1B664E9E for ; Mon, 1 Feb 2021 12:47:26 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9A1B664E9E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=marvell.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 04CA56B0070; Mon, 1 Feb 2021 07:47:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id F3F496B0071; Mon, 1 Feb 2021 07:47:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DDFAA6B0073; Mon, 1 Feb 2021 07:47:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0220.hostedemail.com [216.40.44.220]) by kanga.kvack.org (Postfix) with ESMTP id BDD576B0070 for ; Mon, 1 Feb 2021 07:47:25 -0500 (EST) Received: from smtpin17.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 80C68180AD81F for ; Mon, 1 Feb 2021 12:47:25 +0000 (UTC) X-FDA: 77769674850.17.lace58_110261b275c2 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin17.hostedemail.com (Postfix) with ESMTP id 63EE6180D0181 for ; Mon, 1 Feb 2021 12:47:25 +0000 (UTC) X-HE-Tag: lace58_110261b275c2 X-Filterd-Recvd-Size: 18781 Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com [67.231.156.173]) by imf32.hostedemail.com (Postfix) with ESMTP for ; Mon, 1 Feb 2021 12:47:24 +0000 (UTC) Received: from pps.filterd (m0045851.ppops.net [127.0.0.1]) by mx0b-0016f401.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 111CVInp017880; Mon, 1 Feb 2021 04:47:14 -0800 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com; h=message-id : date : subject : to : cc : references : from : in-reply-to : content-type : content-transfer-encoding : mime-version; s=pfpt0220; bh=ioW4zBwX65WaqWes8POyh94yMrvRQcDGd3+IoTznuPo=; b=XE5Nu9dhfphdfgTUYii2FTY0POFWKIDxUImtO0dgr5w4A48FILPwtSo91H/gejWs+c1t 7UFvLTS6rKLN8seRekv/l3/VCVAmiNZv0rthWXyS8ypzkdG0ydzHNR1UH5p4WeRVTKGi 0LLw688+t0+TsjxBNVwP74o6aQAuPRBy4NbFLq5USYQAiEcARqMiTloQzb5Wwj6Ir6XU 7kPxdtFxLXs5VDOEXexiHGDqJfh95EDe4yZ6s9luCfPc6v9ih+0Jf/NnhwgG9TI3/KVw j0etmGlJJbqIsiAyCEBuOGma1jLt3odAZX8H9jkMw/DIK3IWh7SXennRtGyUgQQ37WEd kQ== Received: from dc5-exch02.marvell.com ([199.233.59.182]) by mx0b-0016f401.pphosted.com with ESMTP id 36d7uq3muv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Mon, 01 Feb 2021 04:47:13 -0800 Received: from SC-EXCH04.marvell.com (10.93.176.84) by DC5-EXCH02.marvell.com (10.69.176.39) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 1 Feb 2021 04:47:12 -0800 Received: from DC5-EXCH01.marvell.com (10.69.176.38) by SC-EXCH04.marvell.com (10.93.176.84) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 1 Feb 2021 04:47:11 -0800 Received: from NAM02-CY1-obe.outbound.protection.outlook.com (104.47.37.51) by DC5-EXCH01.marvell.com (10.69.176.38) with Microsoft SMTP Server (TLS) id 15.0.1497.2 via Frontend Transport; Mon, 1 Feb 2021 04:47:11 -0800 ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none; b=MyjKEEznC7xIXvKsGhy/w653hcJL70tiGInQDA640S8zb9wmcWDkNwVPOZ4kGrDsDurvTq+U687J+gzTVuN3suZb+5fjDkaom9hsXEr/2qoas0Kwq+ApqBm8fDBrgW88txz7VoZHsyM+NJjm3tFL6SmMk8Z1i3Skfb0NA2sz7c5idqn/xWAevsnHuCitFCqqluPW/7Mstka34vTZmPSQ3W/hY9KdldybO4rRedxpgQR/L4CrymTc1bHtgopPMrVuLzECiuko//4TUpOl5tHMQ7GCZakZi9e5roSbNKUzpBEG7M8ij1c2ooy4xtKbiDeoxtUXd/SGfbWwXkWM6fWuIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ioW4zBwX65WaqWes8POyh94yMrvRQcDGd3+IoTznuPo=; b=nTkDYepuCHOSC5U7hUDt5crnt1G1GKD2aws8lESsF9IzRrZkGFPgTnGGYwxdQY+G5kI/2Zt1daC3sFTfwDf6risIaQTMZ3hF5HZUUhkZ7JuDEhHkH3Gq+QbKWMH2dG71kk5VE+ppJU0lMSjuZ165PrNPzUaRneMzCLi7RVSxAX7e9yH/BI+gO4uCtczAauUwJM2A9EYZ5hCd9uZJsiywbfTMmUFDhcAmnPe4F4KXovg/qF0ey31BklBLI+9bjZiyTJPUvHNN9tq2sbRSDQxefs3du0dAyXPh+mCLV4GlsEMMpM/uW2hGDf5sD9l48RmOHR6fJ/pf+d7kBX+lXon52Q== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=marvell.com; dmarc=pass action=none header.from=marvell.com; dkim=pass header.d=marvell.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.onmicrosoft.com; s=selector1-marvell-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=ioW4zBwX65WaqWes8POyh94yMrvRQcDGd3+IoTznuPo=; b=eQ6m/0rG1B67g03bTyyjTJyhazl0/S0156hAmT1bnCwfy9/Kav8g+FWr6jK04pvbE2kG64NBX22uNHQitaiuu7ZWIwxNsAiKz2/6IFdaElX6pURbEKorBkV+cD2Rz02UWEjBAP0Pjjuz9q+huXYnxPAPdFadNP9hMfrGHJDX75M= Authentication-Results: redhat.com; dkim=none (message not signed) header.d=none;redhat.com; dmarc=none action=none header.from=marvell.com; Received: from MW2PR18MB2267.namprd18.prod.outlook.com (2603:10b6:907:3::11) by CO6PR18MB4033.namprd18.prod.outlook.com (2603:10b6:5:34d::24) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3805.17; Mon, 1 Feb 2021 12:47:09 +0000 Received: from MW2PR18MB2267.namprd18.prod.outlook.com ([fe80::73:f7f9:467d:5e3e]) by MW2PR18MB2267.namprd18.prod.outlook.com ([fe80::73:f7f9:467d:5e3e%3]) with mapi id 15.20.3805.025; Mon, 1 Feb 2021 12:47:08 +0000 Message-ID: Date: Mon, 1 Feb 2021 04:47:04 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:87.0) Gecko/20100101 Thunderbird/87.0a1 Subject: Re: [RFC] tentative prctl task isolation interface Content-Language: en-US To: Christoph Lameter , Marcelo Tosatti CC: "tglx@linutronix.de" , "pauld@redhat.com" , "linux-mm@kvack.org" , "frederic@kernel.org" , "willy@infradead.org" , "peterz@infradead.org" , "akpm@linux-foundation.org" , Juri Lelli , Daniel Bristot de Oliveira References: <20201127154845.GA9100@fuller.cnet> <87h7p4dwus.fsf@nanos.tec.linutronix.de> <12ddb629555590cfd41db5b10854d95c1f154e24.camel@marvell.com> <20210113121544.GA16380@fuller.cnet> <20210114193430.GA149907@fuller.cnet> <3fe6a794-a578-3564-acec-d1f4684abeee@marvell.com> <20210121155141.GA11373@fuller.cnet> From: Alex Belits In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [173.228.7.197] X-ClientProxiedBy: BYAPR08CA0067.namprd08.prod.outlook.com (2603:10b6:a03:117::44) To MW2PR18MB2267.namprd18.prod.outlook.com (2603:10b6:907:3::11) MIME-Version: 1.0 X-MS-Exchange-MessageSentRepresentingType: 1 Received: from [192.168.9.3] (173.228.7.197) by BYAPR08CA0067.namprd08.prod.outlook.com (2603:10b6:a03:117::44) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.3805.17 via Frontend Transport; Mon, 1 Feb 2021 12:47:07 +0000 X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 20cd3774-a5d7-4be7-eb2a-08d8c6af7ca0 X-MS-TrafficTypeDiagnostic: CO6PR18MB4033: X-Microsoft-Antispam-PRVS: X-MS-Oob-TLC-OOBClassifiers: OLM:9508; X-MS-Exchange-SenderADCheck: 1 X-Microsoft-Antispam: BCL:0; X-Microsoft-Antispam-Message-Info: J8Z18QQ26pXdHr0HML9cApFPJI6mibJ+selYGvmtfDd2BvD+cWrBzr37H1ddHN2SCTz8qAXdyy6/nol8AIBiOrXeJq/5urdXjjadymNnii/m++7baxJASh4Y0PmMNDrJKzH2bBt2m5Ob3+05y6BpGEi/rTi88ohW2DQxT3brC8rhSVk8NXMm4VeVDSMkmXIwjHBZ/TPUGfevuV/ZNEUrXPfMcEmLxFRo1QBdfeZH0pStUrkMYKd/XQo/r3z+PCFat+aC/5LJ0+NLkmifgZDAynyI0l24ghvNDpFFUoS2hZuvDaodjtsSyLwnHSQdSBr9+bnOnht33opAu0sjFFxFZh685q5M2nhO3PpqAsBM1y3Ln1KuqCAaoJq4AUWhn7aNhaF/tCqj0wZeEqFK7I7DlFddqsWvYJjaf1NwqTU9K2re23Al23vXBtIgEnQGyW5q7Q0jaJgoDWjnIqBbuh2nCuxWgWTiZOHciTm8ylQvMzUNerBVPROf97huuD83Vrj1/P/J8PIeXzdu1E0+Qw43hWU0Vwhtcm8f/xkI9QIrKIiHBGFy3B8yWvKuZXYe1QeFjLKk4DRPc6gC5rsKSzp/93uAarjsBT8A9yOQGCVpvl0Hiwub7s24/oCC88X1fSWjvyCHX/d6ybptSAiXPgD72wLr6q/McEPJUN+fjJDeUwaXMFDOmVVpbqmy0TzbWug2 X-Forefront-Antispam-Report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:MW2PR18MB2267.namprd18.prod.outlook.com;PTR:;CAT:NONE;SFS:(4636009)(39860400002)(346002)(376002)(136003)(366004)(396003)(966005)(956004)(31686004)(31696002)(6666004)(26005)(8936002)(2616005)(478600001)(54906003)(110136005)(5660300002)(16576012)(2906002)(316002)(36756003)(66946007)(8676002)(6486002)(83380400001)(53546011)(66556008)(66476007)(52116002)(186003)(86362001)(16526019)(7416002)(4326008)(45980500001)(43740500002);DIR:OUT;SFP:1101; X-MS-Exchange-AntiSpam-MessageData: =?utf-8?B?ajN2WnNTaG1rL21Ga2JiQ295Y29LRzZsUURqNHN6ZXZhNHJVZk5VOXpncldQ?= =?utf-8?B?VEx0N2FTaHlwQnlwenh5Tlh6NTRUbDNRM0FMYzdKSmVNd2F6dU55N0J1cndY?= =?utf-8?B?UHlvb1NXSExMNWk5bzBRVzVUcFFLVDU5QWpML2ZVdjdpaFdKOXNSUS9KMG9Z?= =?utf-8?B?V0cvdTRNS1FVTG13WXBHcDdFM2NrTXk1VHYzbEtRd25hVzZhVVpkOWcwTjlJ?= =?utf-8?B?V1hORnJEb0ZXajVqWGM4YzNkVVo2VThYZFRWY2ZMWjk2M1NrNDAvVjYvMzR5?= =?utf-8?B?Wm0veEQzVjh5cGJDblRBak9xcEw2UFVIV3NjN09FTFhUYkdycFM2TDRTekQw?= =?utf-8?B?VkZMZ051WVBud09NemtSb1pvdnBhSG1sd0lpcXRKbkxqZ0tkWFcyZ01XaW5z?= =?utf-8?B?emJBckV1OURBRFBld24zV09ScVA0NDh2eFVoalJDU1ArUWJHUXlQc09XZUtI?= =?utf-8?B?b2llckYwY3JqVnZ5OGM2eHFDREM1cWh3SCtFbk9oZlBDMW1tQUhLbnJkdFE3?= =?utf-8?B?YzRhZnRjQ09JQVJWcENzOHRhUzk0U1NmeDM3VjVrRk1GWUZEZVFzUStMOUty?= =?utf-8?B?TExiYmlucVl4N2FvWjhYbFcyUXJaazB4QkIyK0VGbExCMjk1M2luV3orelA4?= =?utf-8?B?Ylc2V2hnQzZlUmdLWTVjSWJMMllxajhvNEJhVGQxaXJhdjB0TGtlRGI3bzhQ?= =?utf-8?B?Y2xNVFhYMzByb0NRbENzUjRLcmNQYW94dkpXYnk0SFhjbmQrSEtiM0Z5MTFQ?= =?utf-8?B?aU9XU1NlNXd5NjJ0bHJaOWpKVjZRZlY3R2ZwbTRoUXB3U0xPWkl0S21OWnZ0?= =?utf-8?B?OVpNWjRPaGNjc1BGNmxhZTVHbGRwZmMvZnhQaTZNNHdLaFBSQXRGRGE3UWlJ?= =?utf-8?B?TVJod2MvT01WMUYvelpCZmYyVzQ4ZFRzeFJHUEtNeFM1RVBnakMxU2cyNW1G?= =?utf-8?B?UXFPeFp2OWFKOWJnTFhHV2xqV1hRWDdwTG9OR0RldTdYSDZQVEEvcTVGNUxO?= =?utf-8?B?bittcEptYzNVWXZ4R2IrbUNVMit5UDdVa0ZiUlYwTDdCdCtSd2RLc0JzQ24v?= =?utf-8?B?WHZrWjgySnY4emFWS2hZOE1oVEt4VG93enRCaTU3WHdzQ0Zvc0JMZ0xCOHVC?= =?utf-8?B?T3J2bmgxZmNqTmgxT0lHWkpHd090TnJneXlrNmRjUFYwOWZpa2oxV2JQczRQ?= =?utf-8?B?TU85OEFYRmFldGRsYWJWTGladjNwT1AyTjFnRW5QR2dROTNHNi9wNXdpL0tI?= =?utf-8?B?QloxNW5ndHkrMWsrdzAybFRMMjZYcitWYi9ZeVQ3MzlkbVFGRmE4VHkvOWF4?= =?utf-8?B?bUJYdGV4eGN0UDlqMTZ0VkVTQ0k2TmlidEZtdFVtQnF1V00yMDBhS01iNjIz?= =?utf-8?B?emkwNzVWMkJZWTdmVDg3YkJHODJHaFl5Slp0Z253cnh1WjlydDJhSEpDcU1F?= =?utf-8?Q?Kddb0aDa?= X-MS-Exchange-CrossTenant-Network-Message-Id: 20cd3774-a5d7-4be7-eb2a-08d8c6af7ca0 X-MS-Exchange-CrossTenant-AuthSource: MW2PR18MB2267.namprd18.prod.outlook.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-OriginalArrivalTime: 01 Feb 2021 12:47:08.7950 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 70e1fb47-1155-421d-87fc-2e58f638b6e0 X-MS-Exchange-CrossTenant-MailboxType: HOSTED X-MS-Exchange-CrossTenant-UserPrincipalName: 1M3gfOZI6acD2gZjU1PJ0uRT5xJDQAPBzREXs1/Do3kfecGh+ic6B6V+IgjtAA9hBGT/uWF3XwRhPyBCo/i0HA== X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO6PR18MB4033 X-OriginatorOrg: marvell.com X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.369,18.0.737 definitions=2021-02-01_05:2021-01-29,2021-02-01 signatures=0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2/1/21 02:48, Christoph Lameter wrote: >> Notifications: >> ------------- >> >> Notification mode of isolation breakage can be configured as follows: >> >> - None (default): No notification is performed by the kernel on isolation >> breakage. >> >> - Syslog: Isolation breakage is reported to syslog. Syslog is intended for humans, and isn't useful for userspace software processing. Since there are at least some cases then isolation breaking is unavoidable on startup (benign race of isolation entering with isolation-breaking event, register-mapping page fault), I would rather allow completely automated processing of those events. Signal interface does that now, however I think, it would help to associate software-handled events with either software-identifiable "cause type" (ex: "scheduling timer" or "page fault") or more verbose human-readable "cause description" (ex: IPI received, and here is the sender CPU's stack dump that led to this IPI being sent). The former ("cause") may be important for software (for example, it may want to have special processing of page faults for device registers), while the latter ("description") is more useful when it can be associated with particular event in userspace without manual log timing comparison and guesswork. > > > - Abort with core dump I would use an existing signal interface for that, with user-defined signal. The user can choose to handle the signal, ignore it, let it kill the task with or without a core dump. Oh, and if user wants, he can use ptrace() to delegate this signal to some other process. > > This is useful for debugging and for hard core bare metalers that never > want any interrupts. > > One particular issue are page faults. One would have to prefault the > binary executable functions in order to avoid "interruptions" through page > faults. Are these proper interrutions of the code? Certainly major faults > are but minor faults may be ok? Dunno. > > In practice what I have often seen in such apps is that there is a "warm" > up mode where all critical functions are executed, all important variables > are touched and dummy I/Os are performed in order to populate the caches > and prefault all the data.I guess one would run these without isolation > first and then switch on some sort of isolation mode after warm up. So far > I think most people relied on the timer interrupt etc etc to be turned off > after a few secs of just running throught a polling loop without any OS > activities. This is usually done not as much for page preloading but for cache. There is mlock() and mlockall() that load and lock pages explicitly. One exception is device registers -- they may remain unmapped until accessed. I can often see a pattern when application enters isolation, calls low-level library such as ODP, gets a page fault, leaves and re-enters isolation, and then everything is running perfectly because everything is mapped. However in those cases mlockall() is done before entering isolation, so regular memory mapping is already there. > >>> I ended up implementing a manager/helper task that talks to tasks over a >>> socket (when they are not isolated) and over ring buffers in shared memory >>> (when they are isolated). While the current implementation is rather >>> limited, the intention is to delegate to it everything that isolated task >>> either can't do at all (like, writing logs) or that it would be cumbersome >>> to implement (like monitoring the state of task, determining presence of >>> deferred work after the task returned to userspace), etc. >> >> Interesting. Are you considering opensourcing such library? Seems like a >> generic problem. It's already open source, https://github.com/abelits/libtmc It still needs some work. At the moment it does more than I would prefer because it tries to detect possible problems, such as running timers, and at the same time does not provide some obviously useful things like asynchronous interface to arbitrary file I/O. I also want to allow the use of some generic interface to triggering interrupts from isolated task to the manager (through, say, a sacrifice of a single GPIO), so if this option is available, the manager won't have to do all that polling. > > Well everyone swears on having the right implementation. The people I know > would not do any thing with a socket in such situations. They would only > use shared memory and direct access to I/O devices via SPDK and DPDK or > the RDMA subsystem. > Same applies to me. My library uses sockets to communicate when the task is not isolated, and it will be necessary if we want to have a dedicated manager process instead of a manager thread in every process. I would prefer initiating a connection with a manager through a socket, and only after that succeeds, assume that I can use any particular part of shared memory (because it means that manager allocated it for me, and no one else will race with me trying to touch it). > >>>> Blocking? The app should fail if any deferred actions are triggered as a >>>> result of syscalls. It would give a warning with _WARN >>> >>> There are many supposedly innocent things, nowhere at the scale of CPU >>> hotplug, that happen in a system and result in synchronization implemented >>> as an IPI to every online CPU. We should consider them to be an ordinary >>> occurrence, so there is a choice: >>> >>> 1. Ignore them completely and allow them in isolated mode. This will delay >>> userspace with no indication and no isolation breaking. >>> >>> 2. Allow them, and notify userspace afterwards (through vdso or through >>> userspace helper/manager over shared memory). This may be useful in those >>> rare situations when the consequences of delay can be mitigated afterwards. >>> >>> 3. Make them break isolation, with userspace being notified normally (ex: >>> with a signal in the current implementation). I guess, can be used if >>> somehow most of the causes will be eliminated. >>> >>> 4. Prevent them from reaching the target CPU and make sure that whatever >>> synchronization they are intended to cause, will happen when intended target >>> CPU will enter to kernel later. Since we may have to synchronize things like >>> code modification, some of this synchronization has to happen very early on >>> kernel entry. > > > Or move the actions to a different victim processor like done with rcu and > vmstat etc etc. If possible. For most of those things everything can be moved to other CPUs when entering for isolation, or not allowed on CPUs intended for isolation in the first place (how it's mostly done now). The troublesome sources of interruption are things that are legitimately supposed to be done on all CPUs at once to synchronize some important kind of state, and now we want to delay them on some CPUs until the end of isolation. >>> >>> I am most interested in (4), so this is what was implemented in my version >>> of the patch (and currently I am trying to achieve completeness and, if >>> possible, elegance of the implementation). >> >> Agree. (3) will be necessary as intermediate step. The proposed >> improvement to Christoph's reply, in this thread, separates notification >> and syscall blockage. > > I guess the notification mode will take care of the way we handle these > interruptions. > I think, development should go in parallel -- to have a "delayed synchronization on entry" mechanism that allows "no-interruption mode" (4) to work given that all interruptions are dealt with (that won't work perfectly at first because there are still "unprocessed" sources of interruptions) and a notification mechanism that will allow us to find and properly process them as (3), so we can exclude them and allow (4). Since (4) still requires somewhat intrusive architecture-specific changes, there may be some time when (4) will be only available on some CPUs, but (3) will work on everything. -- Alex