From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH v2 01/40] iommu: Introduce Shared Virtual Addressing API To: Jean-Philippe Brucker , Jacob Pan Cc: Auger Eric , "linux-arm-kernel@lists.infradead.org" , "linux-pci@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "devicetree@vger.kernel.org" , "iommu@lists.linux-foundation.org" , "kvm@vger.kernel.org" , "linux-mm@kvack.org" , "xieyisheng1@huawei.com" , "liubo95@huawei.com" , "xuzaibo@huawei.com" , "thunder.leizhen@huawei.com" , Will Deacon , "okaya@codeaurora.org" , "yi.l.liu@intel.com" , "ashok.raj@intel.com" , "tn@semihalf.com" , "joro@8bytes.org" , "bharatku@xilinx.com" , "liudongdong3@huawei.com" , "rfranz@cavium.com" , "kevin.tian@intel.com" , "jcrouse@codeaurora.org" , "rgummal@xilinx.com" , "jonathan.cameron@huawei.com" , "shunyong.yang@hxt-semitech.com" , Robin Murphy , "ilias.apalodimas@linaro.org" , "alex.williamson@redhat.com" , "robdclark@gmail.com" , "dwmw2@infradead.org" , "nwatters@codeaurora.org" , "baolu.lu@linux.intel.com" , Michal Hocko References: <20180511190641.23008-1-jean-philippe.brucker@arm.com> <20180511190641.23008-2-jean-philippe.brucker@arm.com> <03d31ba5-1eda-ea86-8c0c-91d14c86fe83@arm.com> <2fd4a0a1-1a35-bf82-df84-b995cce011d9@amd.com> <65e7accd-4446-19f5-c667-c6407e89cfa6@arm.com> <5bbc0332-b94b-75cc-ca42-a9b196811daf@amd.com> <20180907142504.5034351e@jacob-builder> <3e3a6797-a233-911b-3a7d-d9b549160960@amd.com> <9445a0be-fb5b-d195-4fdf-7ad6cb36ef4f@arm.com> From: =?UTF-8?Q?Christian_K=c3=b6nig?= Message-ID: <4d68da96-0ad5-b412-5987-2f7a6aa796c3@amd.com> Date: Wed, 12 Sep 2018 14:56:00 +0200 MIME-Version: 1.0 In-Reply-To: <9445a0be-fb5b-d195-4fdf-7ad6cb36ef4f@arm.com> Content-Type: text/plain; charset=utf-8; format=flowed Sender: linux-acpi-owner@vger.kernel.org List-ID: Am 12.09.2018 um 14:40 schrieb Jean-Philippe Brucker: > On 08/09/2018 08:29, Christian König wrote: >> Yes, exactly. I just need a PASID which is never used by the OS for a >> process and we can easily give that back when the last FD reference is >> closed. > Alright, iommu-sva can get its PASID from this external allocator as > well, as long as it has an interface similar to idr. Where would it go, > drivers/base/, mm/, kernel/...? Good question, my initial instinct was to put it under drivers/pci. But AFAIKS now you are supporting SVA implementations which are not based on PCI. So drivers/base sounds like a good place to me. > >>>>> The process dies, iommu-sva is notified and calls the mm_exit() >>>>> function passed by the device driver to iommu_sva_device_init(). In >>>>> mm_exit() the device driver needs to clear any reference to the >>>>> PASID in hardware and in its own structures. When the device driver >>>>> returns from mm_exit(), it effectively tells the core that it has >>>>> finished using the PASID, and iommu-sva can reuse the PASID for >>>>> another process. mm_exit() is allowed to block, so the device >>>>> driver has time to clean up and flush the queues. >>>>> >>>>> If the device driver finishes using the PASID before the process >>>>> exits, it just calls unbind(). >>>> Exactly that's what Michal Hocko is probably going to not like at all. >>>> >>>> Can we have a different approach where each driver is informed by the >>>> mm_exit(), but needs to explicitly call unbind() before a PASID is >>>> reused? > It's awful from the IOMMU driver perspective. In addition to "enabled" > and "disabled" PASID states, you add "disabled but DMA still running > normally". Between that new state and "disabled", the IOMMU will be > flooded by translation faults (non-recoverable ones), which it needs to > ignore instead of reporting to the kernel. Not all IOMMUs can deal with > this in hardware (SMMU and VT-d can quiesce translation faults > per-PASID, but I don't think AMD IOMMU can.) Some drivers will have to > filter fault events themselves, depending on the PASID state. Puh, yeah that is probably true. Ok let us skip that for a moment, we just need to invest more work in killing DMA operations quickly when the process address space is teared down. >>>> During that teardown transition it would be ideal if that PASID only >>>> points to a dummy root page directory with only invalid entries. >>>> >>> I guess this can be vendor specific, In VT-d I plan to mark PASID >>> entry not present and disable fault reporting while draining remaining >>> activities. >> Sounds good to me. >> >> Point is at least in the case where the process was killed by the OOM >> killer we should not block in mm_exit(). >> >> Instead operations issued by the process to a device driver which uses >> SVA needs to be terminated as soon as possible to make sure that the OOM >> killer can advance. > I don't see how we're preventing the OOM killer from advancing, so I'm > looking for a stronger argument that justifies adding this complexity to > IOMMU drivers. Time limit of the release MMU notifier, locking > requirement, a concrete example where things break, even a comment > somewhere in mm/ would do... > > In my tests I can't manage to disturb the OOM killer, but I could be > missing some special case. Even if the mm_exit() callback > (unrealistically) sleeps for 60 seconds, Well you are *COMPLETELY* under estimating this. A compute task with a huge wave launch can take multiple minutes to tear down. That's why I'm so concerned about that, but to be honest I think that just the hardware needs to become better and we need to be able to block dead tasks from spawning threads again. Regards, Christian. > the OOM killer isn't affected > because oom_reap_task_mm() wipes the victim's address space in another > thread, either before or while the release notifier is running. > > Thanks, > Jean