* [bug] __blk_mq_run_hw_queue suspicious rcu usage @ 2019-09-04 21:40 David Rientjes 2019-09-05 6:06 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: David Rientjes @ 2019-09-04 21:40 UTC (permalink / raw) To: Tom Lendacky, Brijesh Singh, Christoph Hellwig, Jens Axboe, Ming Lei Cc: Peter Gonda, Jianxiong Gao, linux-kernel Hi Christoph, Jens, and Ming, While booting a 5.2 SEV-enabled guest we have encountered the following WARNING that is followed up by a BUG because we are in atomic context while trying to call set_memory_decrypted: WARNING: suspicious RCU usage 5.2.0 #1 Not tainted ----------------------------- include/linux/rcupdate.h:266 Illegal context switch in RCU read-side critical section! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 3 locks held by kworker/0:1H/97: #0: 0000000016e1b654 ((wq_completion)kblockd){+.+.}, at: process_one_work+0x1b5/0x5e0 #1: 00000000002674ff ((work_completion)(&(&hctx->run_work)->work)){+.+.}, at: process_one_work+0x1b5/0x5e0 #2: 00000000addb6aba (rcu_read_lock){....}, at: hctx_lock+0x17/0xe0 stack backtrace: CPU: 0 PID: 97 Comm: kworker/0:1H Not tainted 5.2.0 #1 Workqueue: kblockd blk_mq_run_work_fn Call Trace: dump_stack+0x67/0x90 ___might_sleep+0xfb/0x180 _vm_unmap_aliases+0x3e/0x1f0 __set_memory_enc_dec+0x7b/0x120 dma_direct_alloc_pages+0xcc/0x100 dma_pool_alloc+0xcf/0x1e0 nvme_queue_rq+0x5fb/0x9f0 blk_mq_dispatch_rq_list+0x350/0x5a0 blk_mq_do_dispatch_sched+0x76/0x110 blk_mq_sched_dispatch_requests+0x119/0x170 __blk_mq_run_hw_queue+0x6c/0xf0 process_one_work+0x23b/0x5e0 worker_thread+0x3d/0x390 kthread+0x121/0x140 ret_from_fork+0x27/0x50 hctx_lock() in __blk_mq_run_hw_queue() takes rcu_read_lock or srcu_read_lock depending on BLK_MQ_F_BLOCKING. dma_direct_alloc_pages() can then call set_memory_decrypted() which must be allowed to block. Any ideas on how to address this? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-09-04 21:40 [bug] __blk_mq_run_hw_queue suspicious rcu usage David Rientjes @ 2019-09-05 6:06 ` Christoph Hellwig 2019-09-05 22:37 ` David Rientjes 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2019-09-05 6:06 UTC (permalink / raw) To: David Rientjes Cc: Tom Lendacky, Brijesh Singh, Christoph Hellwig, Jens Axboe, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Wed, Sep 04, 2019 at 02:40:44PM -0700, David Rientjes wrote: > Hi Christoph, Jens, and Ming, > > While booting a 5.2 SEV-enabled guest we have encountered the following > WARNING that is followed up by a BUG because we are in atomic context > while trying to call set_memory_decrypted: Well, this really is a x86 / DMA API issue unfortunately. Drivers are allowed to do GFP_ATOMIC dma allocation under locks / rcu critical sections and from interrupts. And it seems like the SEV case can't handle that. We have some semi-generic code to have a fixed sized pool in kernel/dma for non-coherent platforms that have similar issues that we could try to wire up, but I wonder if there is a better way to handle the issue, so I've added Tom and the x86 maintainers. Now independent of that issue using DMA coherent memory for the nvme PRPs/SGLs doesn't actually feel very optional. We could do with normal kmalloc allocations and just sync it to the device and back. I wonder if we should create some general mempool-like helpers for that. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-09-05 6:06 ` Christoph Hellwig @ 2019-09-05 22:37 ` David Rientjes 2019-09-16 23:45 ` David Rientjes 0 siblings, 1 reply; 13+ messages in thread From: David Rientjes @ 2019-09-05 22:37 UTC (permalink / raw) To: Christoph Hellwig Cc: Tom Lendacky, Brijesh Singh, Jens Axboe, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Thu, 5 Sep 2019, Christoph Hellwig wrote: > > Hi Christoph, Jens, and Ming, > > > > While booting a 5.2 SEV-enabled guest we have encountered the following > > WARNING that is followed up by a BUG because we are in atomic context > > while trying to call set_memory_decrypted: > > Well, this really is a x86 / DMA API issue unfortunately. Drivers > are allowed to do GFP_ATOMIC dma allocation under locks / rcu critical > sections and from interrupts. And it seems like the SEV case can't > handle that. We have some semi-generic code to have a fixed sized > pool in kernel/dma for non-coherent platforms that have similar issues > that we could try to wire up, but I wonder if there is a better way > to handle the issue, so I've added Tom and the x86 maintainers. > > Now independent of that issue using DMA coherent memory for the nvme > PRPs/SGLs doesn't actually feel very optional. We could do with > normal kmalloc allocations and just sync it to the device and back. > I wonder if we should create some general mempool-like helpers for that. > Thanks for looking into this. I assume it's a non-starter to try to address this in _vm_unmap_aliases() itself, i.e. rely on a purge spinlock to do all synchronization (or trylock if not forced) for purge_vmap_area_lazy() rather than only the vmap_area_lock within it. In other words, no mutex. If that's the case, and set_memory_encrypted() can't be fixed to not need to sleep by changing _vm_unmap_aliases() locking, then I assume dmapool is our only alternative? I have no idea with how large this should be. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-09-05 22:37 ` David Rientjes @ 2019-09-16 23:45 ` David Rientjes 2019-09-17 18:23 ` David Rientjes 0 siblings, 1 reply; 13+ messages in thread From: David Rientjes @ 2019-09-16 23:45 UTC (permalink / raw) To: Christoph Hellwig Cc: Tom Lendacky, Brijesh Singh, Jens Axboe, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Thu, 5 Sep 2019, David Rientjes wrote: > > > Hi Christoph, Jens, and Ming, > > > > > > While booting a 5.2 SEV-enabled guest we have encountered the following > > > WARNING that is followed up by a BUG because we are in atomic context > > > while trying to call set_memory_decrypted: > > > > Well, this really is a x86 / DMA API issue unfortunately. Drivers > > are allowed to do GFP_ATOMIC dma allocation under locks / rcu critical > > sections and from interrupts. And it seems like the SEV case can't > > handle that. We have some semi-generic code to have a fixed sized > > pool in kernel/dma for non-coherent platforms that have similar issues > > that we could try to wire up, but I wonder if there is a better way > > to handle the issue, so I've added Tom and the x86 maintainers. > > > > Now independent of that issue using DMA coherent memory for the nvme > > PRPs/SGLs doesn't actually feel very optional. We could do with > > normal kmalloc allocations and just sync it to the device and back. > > I wonder if we should create some general mempool-like helpers for that. > > > > Thanks for looking into this. I assume it's a non-starter to try to > address this in _vm_unmap_aliases() itself, i.e. rely on a purge spinlock > to do all synchronization (or trylock if not forced) for > purge_vmap_area_lazy() rather than only the vmap_area_lock within it. In > other words, no mutex. > > If that's the case, and set_memory_encrypted() can't be fixed to not need > to sleep by changing _vm_unmap_aliases() locking, then I assume dmapool is > our only alternative? I have no idea with how large this should be. > Brijesh and Tom, we currently hit this any time we boot an SEV enabled Ubuntu 18.04 guest; I assume that guest kernels, especially those of such major distributions, are expected to work with warnings and BUGs when certain drivers are enabled. If the vmap purge lock is to remain a mutex (any other reason that unmapping aliases can block?) then it appears that allocating a dmapool is the only alternative. Is this something that you'll be addressing generically or do we need to get buy-in from the maintainers of this specific driver? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-09-16 23:45 ` David Rientjes @ 2019-09-17 18:23 ` David Rientjes 2019-09-17 18:32 ` Jens Axboe 2019-09-17 18:41 ` Lendacky, Thomas 0 siblings, 2 replies; 13+ messages in thread From: David Rientjes @ 2019-09-17 18:23 UTC (permalink / raw) To: Christoph Hellwig, Keith Busch, Jens Axboe Cc: Tom Lendacky, Brijesh Singh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Mon, 16 Sep 2019, David Rientjes wrote: > Brijesh and Tom, we currently hit this any time we boot an SEV enabled > Ubuntu 18.04 guest; I assume that guest kernels, especially those of such > major distributions, are expected to work with warnings and BUGs when > certain drivers are enabled. > > If the vmap purge lock is to remain a mutex (any other reason that > unmapping aliases can block?) then it appears that allocating a dmapool > is the only alternative. Is this something that you'll be addressing > generically or do we need to get buy-in from the maintainers of this > specific driver? > We've found that the following applied on top of 5.2.14 suppresses the warnings. Christoph, Keith, Jens, is this something that we could do for the nvme driver? I'll happily propose it formally if it would be acceptable. Thanks! diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c --- a/drivers/nvme/host/pci.c +++ b/drivers/nvme/host/pci.c @@ -1613,7 +1613,8 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) dev->admin_tagset.timeout = ADMIN_TIMEOUT; dev->admin_tagset.numa_node = dev_to_node(dev->dev); dev->admin_tagset.cmd_size = sizeof(struct nvme_iod); - dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED; + dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED | + BLK_MQ_F_BLOCKING; dev->admin_tagset.driver_data = dev; if (blk_mq_alloc_tag_set(&dev->admin_tagset)) @@ -2262,7 +2263,8 @@ static int nvme_dev_add(struct nvme_dev *dev) dev->tagset.queue_depth = min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1; dev->tagset.cmd_size = sizeof(struct nvme_iod); - dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE; + dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE | + BLK_MQ_F_BLOCKING; dev->tagset.driver_data = dev; ret = blk_mq_alloc_tag_set(&dev->tagset); ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-09-17 18:23 ` David Rientjes @ 2019-09-17 18:32 ` Jens Axboe 2019-09-17 18:41 ` Lendacky, Thomas 1 sibling, 0 replies; 13+ messages in thread From: Jens Axboe @ 2019-09-17 18:32 UTC (permalink / raw) To: David Rientjes, Christoph Hellwig, Keith Busch Cc: Tom Lendacky, Brijesh Singh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On 9/17/19 12:23 PM, David Rientjes wrote: > On Mon, 16 Sep 2019, David Rientjes wrote: > >> Brijesh and Tom, we currently hit this any time we boot an SEV enabled >> Ubuntu 18.04 guest; I assume that guest kernels, especially those of such >> major distributions, are expected to work with warnings and BUGs when >> certain drivers are enabled. >> >> If the vmap purge lock is to remain a mutex (any other reason that >> unmapping aliases can block?) then it appears that allocating a dmapool >> is the only alternative. Is this something that you'll be addressing >> generically or do we need to get buy-in from the maintainers of this >> specific driver? >> > > We've found that the following applied on top of 5.2.14 suppresses the > warnings. > > Christoph, Keith, Jens, is this something that we could do for the nvme > driver? I'll happily propose it formally if it would be acceptable. No, this is not going to be acceptable, I'm afraid. This tells blk-mq that the driver always needs blocking context for queueing IO, which will increase latencies for the cases where we'd otherwise issue IO directly from the context that queues it. -- Jens Axboe ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-09-17 18:23 ` David Rientjes 2019-09-17 18:32 ` Jens Axboe @ 2019-09-17 18:41 ` Lendacky, Thomas 2019-09-18 13:22 ` Christoph Hellwig 1 sibling, 1 reply; 13+ messages in thread From: Lendacky, Thomas @ 2019-09-17 18:41 UTC (permalink / raw) To: David Rientjes, Christoph Hellwig, Keith Busch, Jens Axboe Cc: Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On 9/17/19 1:23 PM, David Rientjes wrote: > On Mon, 16 Sep 2019, David Rientjes wrote: > >> Brijesh and Tom, we currently hit this any time we boot an SEV enabled >> Ubuntu 18.04 guest; I assume that guest kernels, especially those of such >> major distributions, are expected to work with warnings and BUGs when >> certain drivers are enabled. >> >> If the vmap purge lock is to remain a mutex (any other reason that >> unmapping aliases can block?) then it appears that allocating a dmapool >> is the only alternative. Is this something that you'll be addressing >> generically or do we need to get buy-in from the maintainers of this >> specific driver? >> > > We've found that the following applied on top of 5.2.14 suppresses the > warnings. > > Christoph, Keith, Jens, is this something that we could do for the nvme > driver? I'll happily propose it formally if it would be acceptable. > > Thanks! > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -1613,7 +1613,8 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) > dev->admin_tagset.timeout = ADMIN_TIMEOUT; > dev->admin_tagset.numa_node = dev_to_node(dev->dev); > dev->admin_tagset.cmd_size = sizeof(struct nvme_iod); > - dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED; > + dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED | > + BLK_MQ_F_BLOCKING; I think you want to only set the BLK_MQ_F_BLOCKING if the DMA is required to be unencrypted. Unfortunately, force_dma_unencrypted() can't be called from a module. Is there a DMA API that could be called to get that info? Thanks, Tom > dev->admin_tagset.driver_data = dev; > > if (blk_mq_alloc_tag_set(&dev->admin_tagset)) > @@ -2262,7 +2263,8 @@ static int nvme_dev_add(struct nvme_dev *dev) > dev->tagset.queue_depth = > min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1; > dev->tagset.cmd_size = sizeof(struct nvme_iod); > - dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE; > + dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE | > + BLK_MQ_F_BLOCKING; > dev->tagset.driver_data = dev; > > ret = blk_mq_alloc_tag_set(&dev->tagset); > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-09-17 18:41 ` Lendacky, Thomas @ 2019-09-18 13:22 ` Christoph Hellwig 2019-11-27 22:11 ` David Rientjes 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2019-09-18 13:22 UTC (permalink / raw) To: Lendacky, Thomas Cc: David Rientjes, Christoph Hellwig, Keith Busch, Jens Axboe, Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Tue, Sep 17, 2019 at 06:41:02PM +0000, Lendacky, Thomas wrote: > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > --- a/drivers/nvme/host/pci.c > > +++ b/drivers/nvme/host/pci.c > > @@ -1613,7 +1613,8 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) > > dev->admin_tagset.timeout = ADMIN_TIMEOUT; > > dev->admin_tagset.numa_node = dev_to_node(dev->dev); > > dev->admin_tagset.cmd_size = sizeof(struct nvme_iod); > > - dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED; > > + dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED | > > + BLK_MQ_F_BLOCKING; > > I think you want to only set the BLK_MQ_F_BLOCKING if the DMA is required > to be unencrypted. Unfortunately, force_dma_unencrypted() can't be called > from a module. Is there a DMA API that could be called to get that info? The DMA API must support non-blocking calls, and various drivers rely on that. So we need to provide that even for the SEV case. If the actual blocking can't be made to work we'll need to wire up the DMA pool in kernel/dma/remap.c for it (and probably move it to separate file). ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-09-18 13:22 ` Christoph Hellwig @ 2019-11-27 22:11 ` David Rientjes 2019-11-28 6:40 ` Christoph Hellwig 0 siblings, 1 reply; 13+ messages in thread From: David Rientjes @ 2019-11-27 22:11 UTC (permalink / raw) To: Christoph Hellwig Cc: Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Wed, 18 Sep 2019, Christoph Hellwig wrote: > On Tue, Sep 17, 2019 at 06:41:02PM +0000, Lendacky, Thomas wrote: > > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > > > --- a/drivers/nvme/host/pci.c > > > +++ b/drivers/nvme/host/pci.c > > > @@ -1613,7 +1613,8 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev) > > > dev->admin_tagset.timeout = ADMIN_TIMEOUT; > > > dev->admin_tagset.numa_node = dev_to_node(dev->dev); > > > dev->admin_tagset.cmd_size = sizeof(struct nvme_iod); > > > - dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED; > > > + dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED | > > > + BLK_MQ_F_BLOCKING; > > > > I think you want to only set the BLK_MQ_F_BLOCKING if the DMA is required > > to be unencrypted. Unfortunately, force_dma_unencrypted() can't be called > > from a module. Is there a DMA API that could be called to get that info? > > The DMA API must support non-blocking calls, and various drivers rely > on that. So we need to provide that even for the SEV case. If the > actual blocking can't be made to work we'll need to wire up the DMA > pool in kernel/dma/remap.c for it (and probably move it to separate > file). > Resurrecting this thread from a couple months ago because it appears that this is still an issue with 5.4 guests. dma_pool_alloc(), regardless of whether mem_flags allows blocking or not, can always sleep if the device's DMA must be unencrypted and mem_encrypt_active() == true. We know this because vm_unmap_aliases() can always block. NVMe's setup of PRPs and SGLs uses dma_pool_alloc(GFP_ATOMIC) but when this is a SEV-enabled guest this allocation may block due to the possibility of allocating DMA coherent memory through dma_direct_alloc(). It seems like one solution would be to add significant latency by doing BLK_MQ_F_BLOCKING if force_dma_unencrypted() is true for the device but this comes with significant downsides. So we're left with making dma_pool_alloc(GFP_ATOMIC) actually be atomic even when the DMA needs to be unencrypted for SEV. Christoph's suggestion was to wire up dmapool in kernel/dma/remap.c for this. Is that necessary to be done for all devices that need to do dma_pool_alloc(GFP_ATOMIC) or can we do it within the DMA API itself so it's transparent to the driver? Thomas/Brijesh: separately, it seems the use of set_memory_encrypted() or set_memory_decrypted() must be possible without blocking; is this only an issue from the DMA API point of view or can it be done elsewhere? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-11-27 22:11 ` David Rientjes @ 2019-11-28 6:40 ` Christoph Hellwig 2019-12-13 0:07 ` David Rientjes 0 siblings, 1 reply; 13+ messages in thread From: Christoph Hellwig @ 2019-11-28 6:40 UTC (permalink / raw) To: David Rientjes Cc: Christoph Hellwig, Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Wed, Nov 27, 2019 at 02:11:28PM -0800, David Rientjes wrote: > So we're left with making dma_pool_alloc(GFP_ATOMIC) actually be atomic > even when the DMA needs to be unencrypted for SEV. Christoph's suggestion > was to wire up dmapool in kernel/dma/remap.c for this. Is that necessary > to be done for all devices that need to do dma_pool_alloc(GFP_ATOMIC) or > can we do it within the DMA API itself so it's transparent to the driver? It needs to be transparent to the driver. Lots of drivers use GFP_ATOMIC dma allocations, and all of them are broken on SEV setups currently. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-11-28 6:40 ` Christoph Hellwig @ 2019-12-13 0:07 ` David Rientjes 2019-12-13 9:33 ` David Rientjes 2019-12-15 5:38 ` David Rientjes 0 siblings, 2 replies; 13+ messages in thread From: David Rientjes @ 2019-12-13 0:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Thu, 28 Nov 2019, Christoph Hellwig wrote: > > So we're left with making dma_pool_alloc(GFP_ATOMIC) actually be atomic > > even when the DMA needs to be unencrypted for SEV. Christoph's suggestion > > was to wire up dmapool in kernel/dma/remap.c for this. Is that necessary > > to be done for all devices that need to do dma_pool_alloc(GFP_ATOMIC) or > > can we do it within the DMA API itself so it's transparent to the driver? > > It needs to be transparent to the driver. Lots of drivers use GFP_ATOMIC > dma allocations, and all of them are broken on SEV setups currently. > Not my area, so bear with me. Since all DMA must be unencrypted in this case, what happens if all dma_direct_alloc_pages() calls go through the DMA pool in kernel/dma/remap.c when force_dma_unencrypted(dev) == true since __PAGE_ENC is cleared for these ptes? (Ignoring for a moment that this special pool should likely be a separate dma pool.) I assume a general depletion of that atomic pool so DEFAULT_DMA_COHERENT_POOL_SIZE becomes insufficient. I'm not sure what size any DMA pool wired up for this specific purpose would need to be sized at, so I assume dynamic resizing is required. It shouldn't be *that* difficult to supplement kernel/dma/remap.c with the ability to do background expansion of the atomic pool when nearing its capacity for this purpose? I imagine that if we just can't allocate pages within the DMA mask that it's the only blocker to dynamic expansion and we don't oom kill for lowmem. But perhaps vm.lowmem_reserve_ratio is good enough protection? Beyond that, I'm not sure what sizing would be appropriate if this is to be a generic solution in the DMA API for all devices that may require unecrypted memory. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-12-13 0:07 ` David Rientjes @ 2019-12-13 9:33 ` David Rientjes 2019-12-15 5:38 ` David Rientjes 1 sibling, 0 replies; 13+ messages in thread From: David Rientjes @ 2019-12-13 9:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Thu, 12 Dec 2019, David Rientjes wrote: > Since all DMA must be unencrypted in this case, what happens if all > dma_direct_alloc_pages() calls go through the DMA pool in > kernel/dma/remap.c when force_dma_unencrypted(dev) == true since > __PAGE_ENC is cleared for these ptes? (Ignoring for a moment that this > special pool should likely be a separate dma pool.) > > I assume a general depletion of that atomic pool so > DEFAULT_DMA_COHERENT_POOL_SIZE becomes insufficient. I'm not sure what > size any DMA pool wired up for this specific purpose would need to be > sized at, so I assume dynamic resizing is required. > > It shouldn't be *that* difficult to supplement kernel/dma/remap.c with the > ability to do background expansion of the atomic pool when nearing its > capacity for this purpose? I imagine that if we just can't allocate pages > within the DMA mask that it's the only blocker to dynamic expansion and we > don't oom kill for lowmem. But perhaps vm.lowmem_reserve_ratio is good > enough protection? > > Beyond that, I'm not sure what sizing would be appropriate if this is to > be a generic solution in the DMA API for all devices that may require > unecrypted memory. > Secondly, I'm wondering about how the DMA pool for atomic allocations compares with lowmem reserve for both ZONE_DMA and ZONE_DMA32. For allocations where the classzone index is one of these zones, the lowmem reserve is static, we don't account the amount of lowmem allocated and adjust this for future watermark checks in the page allocator. We always guarantee that reserve is free (absent the depletion of the zone due to GFP_ATOMIC allocations where we fall below the min watermarks). If all DMA memory needs to have _PAGE_ENC cleared when the guest is SEV encrypted, I'm wondering if the entire lowmem reserve could be designed as a pool of lowmem pages rather than a watermark check. If implemented as a pool of pages in the page allocator itself, and today's reserve is static, maybe we could get away with a dynamic resizing based on that static amount? We could offload the handling of this reserve to kswapd such that when the pool falls below today's reserve amount, we dynamically expand, do the necessary unencryption in blockable context, and add to the pool. Bonus is that this provides high-order lowmem reserve if implemented as per-order freelists rather than the current watermark check that provides no guarantees for any high-order lowmem. I don't want to distract from the first set of questions in my previous email because I need an understanding of that anyway, but I'm hoping Christoph can guide me on why the above wouldn't be an improvement even for non encrypted guests. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage 2019-12-13 0:07 ` David Rientjes 2019-12-13 9:33 ` David Rientjes @ 2019-12-15 5:38 ` David Rientjes 1 sibling, 0 replies; 13+ messages in thread From: David Rientjes @ 2019-12-15 5:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu On Thu, 12 Dec 2019, David Rientjes wrote: > Since all DMA must be unencrypted in this case, what happens if all > dma_direct_alloc_pages() calls go through the DMA pool in > kernel/dma/remap.c when force_dma_unencrypted(dev) == true since > __PAGE_ENC is cleared for these ptes? (Ignoring for a moment that this > special pool should likely be a separate dma pool.) > > I assume a general depletion of that atomic pool so > DEFAULT_DMA_COHERENT_POOL_SIZE becomes insufficient. I'm not sure what > size any DMA pool wired up for this specific purpose would need to be > sized at, so I assume dynamic resizing is required. > > It shouldn't be *that* difficult to supplement kernel/dma/remap.c with the > ability to do background expansion of the atomic pool when nearing its > capacity for this purpose? I imagine that if we just can't allocate pages > within the DMA mask that it's the only blocker to dynamic expansion and we > don't oom kill for lowmem. But perhaps vm.lowmem_reserve_ratio is good > enough protection? > > Beyond that, I'm not sure what sizing would be appropriate if this is to > be a generic solution in the DMA API for all devices that may require > unecrypted memory. > Optimizations involving lowmem reserve ratio aside, is it ok that CONFIG_AMD_MEM_ENCRYPT develops a dependency on DMA_DIRECT_REMAP because set_memory_decrypted() must be allowed to block? If so, we could allocate from the atomic pool when we can't block and the device requires unencrypted DMA from dma_direct_alloc_pages(). I assume we need this to be its own atomic pool specifically for force_dma_unencrypted() devices and to check addr_in_gen_pool() for this new unencrypted pool in dma_direct_free_pages(). I have no idea how large this unencrypted atomic pool should be sized. We could determine a nice default and grow size for nvme itself, but as Christoph mentioned many drivers require non-blockable allocations that can be run inside a SEV encrypted guest. Trivial implementation would be to just double the size of the unencrypted pool when it reaches half capacity. Perhaps done with GFP_KERNEL | __GFP_DMA allocations in a workqueue. We can reclaim from ZONE_DMA or ZONE_DMA32 in this context but when that fails I'm not sure if it's satisfactory to just fail the dma_pool_alloc() when the unecrypted pool runs out. Heuristics can be tweaked, of course, but I want to make sure I'm not missing anything obvious with this approach before implementing it. Please let me know, thanks. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-12-15 5:38 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-09-04 21:40 [bug] __blk_mq_run_hw_queue suspicious rcu usage David Rientjes 2019-09-05 6:06 ` Christoph Hellwig 2019-09-05 22:37 ` David Rientjes 2019-09-16 23:45 ` David Rientjes 2019-09-17 18:23 ` David Rientjes 2019-09-17 18:32 ` Jens Axboe 2019-09-17 18:41 ` Lendacky, Thomas 2019-09-18 13:22 ` Christoph Hellwig 2019-11-27 22:11 ` David Rientjes 2019-11-28 6:40 ` Christoph Hellwig 2019-12-13 0:07 ` David Rientjes 2019-12-13 9:33 ` David Rientjes 2019-12-15 5:38 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).