linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [bug] __blk_mq_run_hw_queue suspicious rcu usage
@ 2019-09-04 21:40 David Rientjes
  2019-09-05  6:06 ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2019-09-04 21:40 UTC (permalink / raw)
  To: Tom Lendacky, Brijesh Singh, Christoph Hellwig, Jens Axboe, Ming Lei
  Cc: Peter Gonda, Jianxiong Gao, linux-kernel

Hi Christoph, Jens, and Ming,

While booting a 5.2 SEV-enabled guest we have encountered the following 
WARNING that is followed up by a BUG because we are in atomic context 
while trying to call set_memory_decrypted:

 WARNING: suspicious RCU usage
 5.2.0 #1 Not tainted
 -----------------------------
 include/linux/rcupdate.h:266 Illegal context switch in RCU read-side critical section!
 
 other info that might help us debug this:
 
 
 rcu_scheduler_active = 2, debug_locks = 1
 3 locks held by kworker/0:1H/97:
  #0: 0000000016e1b654 ((wq_completion)kblockd){+.+.}, at: process_one_work+0x1b5/0x5e0
  #1: 00000000002674ff ((work_completion)(&(&hctx->run_work)->work)){+.+.}, at: process_one_work+0x1b5/0x5e0
  #2: 00000000addb6aba (rcu_read_lock){....}, at: hctx_lock+0x17/0xe0
 
 stack backtrace:
 CPU: 0 PID: 97 Comm: kworker/0:1H Not tainted 5.2.0 #1
 Workqueue: kblockd blk_mq_run_work_fn
 Call Trace:
  dump_stack+0x67/0x90
  ___might_sleep+0xfb/0x180
  _vm_unmap_aliases+0x3e/0x1f0
  __set_memory_enc_dec+0x7b/0x120
  dma_direct_alloc_pages+0xcc/0x100
  dma_pool_alloc+0xcf/0x1e0
  nvme_queue_rq+0x5fb/0x9f0
  blk_mq_dispatch_rq_list+0x350/0x5a0
  blk_mq_do_dispatch_sched+0x76/0x110
  blk_mq_sched_dispatch_requests+0x119/0x170
  __blk_mq_run_hw_queue+0x6c/0xf0
  process_one_work+0x23b/0x5e0
  worker_thread+0x3d/0x390
  kthread+0x121/0x140
  ret_from_fork+0x27/0x50

hctx_lock() in __blk_mq_run_hw_queue() takes rcu_read_lock or 
srcu_read_lock depending on BLK_MQ_F_BLOCKING.

dma_direct_alloc_pages() can then call set_memory_decrypted() which must 
be allowed to block.

Any ideas on how to address this?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-09-04 21:40 [bug] __blk_mq_run_hw_queue suspicious rcu usage David Rientjes
@ 2019-09-05  6:06 ` Christoph Hellwig
  2019-09-05 22:37   ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2019-09-05  6:06 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tom Lendacky, Brijesh Singh, Christoph Hellwig, Jens Axboe,
	Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu

On Wed, Sep 04, 2019 at 02:40:44PM -0700, David Rientjes wrote:
> Hi Christoph, Jens, and Ming,
> 
> While booting a 5.2 SEV-enabled guest we have encountered the following 
> WARNING that is followed up by a BUG because we are in atomic context 
> while trying to call set_memory_decrypted:

Well, this really is a x86 / DMA API issue unfortunately.  Drivers
are allowed to do GFP_ATOMIC dma allocation under locks / rcu critical
sections and from interrupts.  And it seems like the SEV case can't
handle that.  We have some semi-generic code to have a fixed sized
pool in kernel/dma for non-coherent platforms that have similar issues
that we could try to wire up, but I wonder if there is a better way
to handle the issue, so I've added Tom and the x86 maintainers.

Now independent of that issue using DMA coherent memory for the nvme
PRPs/SGLs doesn't actually feel very optional.  We could do with
normal kmalloc allocations and just sync it to the device and back.
I wonder if we should create some general mempool-like helpers for that.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-09-05  6:06 ` Christoph Hellwig
@ 2019-09-05 22:37   ` David Rientjes
  2019-09-16 23:45     ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2019-09-05 22:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tom Lendacky, Brijesh Singh, Jens Axboe, Ming Lei, Peter Gonda,
	Jianxiong Gao, linux-kernel, x86, iommu

On Thu, 5 Sep 2019, Christoph Hellwig wrote:

> > Hi Christoph, Jens, and Ming,
> > 
> > While booting a 5.2 SEV-enabled guest we have encountered the following 
> > WARNING that is followed up by a BUG because we are in atomic context 
> > while trying to call set_memory_decrypted:
> 
> Well, this really is a x86 / DMA API issue unfortunately.  Drivers
> are allowed to do GFP_ATOMIC dma allocation under locks / rcu critical
> sections and from interrupts.  And it seems like the SEV case can't
> handle that.  We have some semi-generic code to have a fixed sized
> pool in kernel/dma for non-coherent platforms that have similar issues
> that we could try to wire up, but I wonder if there is a better way
> to handle the issue, so I've added Tom and the x86 maintainers.
> 
> Now independent of that issue using DMA coherent memory for the nvme
> PRPs/SGLs doesn't actually feel very optional.  We could do with
> normal kmalloc allocations and just sync it to the device and back.
> I wonder if we should create some general mempool-like helpers for that.
> 

Thanks for looking into this.  I assume it's a non-starter to try to 
address this in _vm_unmap_aliases() itself, i.e. rely on a purge spinlock 
to do all synchronization (or trylock if not forced) for 
purge_vmap_area_lazy() rather than only the vmap_area_lock within it.  In 
other words, no mutex.

If that's the case, and set_memory_encrypted() can't be fixed to not need 
to sleep by changing _vm_unmap_aliases() locking, then I assume dmapool is 
our only alternative?  I have no idea with how large this should be.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-09-05 22:37   ` David Rientjes
@ 2019-09-16 23:45     ` David Rientjes
  2019-09-17 18:23       ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2019-09-16 23:45 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tom Lendacky, Brijesh Singh, Jens Axboe, Ming Lei, Peter Gonda,
	Jianxiong Gao, linux-kernel, x86, iommu

On Thu, 5 Sep 2019, David Rientjes wrote:

> > > Hi Christoph, Jens, and Ming,
> > > 
> > > While booting a 5.2 SEV-enabled guest we have encountered the following 
> > > WARNING that is followed up by a BUG because we are in atomic context 
> > > while trying to call set_memory_decrypted:
> > 
> > Well, this really is a x86 / DMA API issue unfortunately.  Drivers
> > are allowed to do GFP_ATOMIC dma allocation under locks / rcu critical
> > sections and from interrupts.  And it seems like the SEV case can't
> > handle that.  We have some semi-generic code to have a fixed sized
> > pool in kernel/dma for non-coherent platforms that have similar issues
> > that we could try to wire up, but I wonder if there is a better way
> > to handle the issue, so I've added Tom and the x86 maintainers.
> > 
> > Now independent of that issue using DMA coherent memory for the nvme
> > PRPs/SGLs doesn't actually feel very optional.  We could do with
> > normal kmalloc allocations and just sync it to the device and back.
> > I wonder if we should create some general mempool-like helpers for that.
> > 
> 
> Thanks for looking into this.  I assume it's a non-starter to try to 
> address this in _vm_unmap_aliases() itself, i.e. rely on a purge spinlock 
> to do all synchronization (or trylock if not forced) for 
> purge_vmap_area_lazy() rather than only the vmap_area_lock within it.  In 
> other words, no mutex.
> 
> If that's the case, and set_memory_encrypted() can't be fixed to not need 
> to sleep by changing _vm_unmap_aliases() locking, then I assume dmapool is 
> our only alternative?  I have no idea with how large this should be.
> 

Brijesh and Tom, we currently hit this any time we boot an SEV enabled 
Ubuntu 18.04 guest; I assume that guest kernels, especially those of such 
major distributions, are expected to work with warnings and BUGs when 
certain drivers are enabled.

If the vmap purge lock is to remain a mutex (any other reason that 
unmapping aliases can block?) then it appears that allocating a dmapool 
is the only alternative.  Is this something that you'll be addressing 
generically or do we need to get buy-in from the maintainers of this 
specific driver?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-09-16 23:45     ` David Rientjes
@ 2019-09-17 18:23       ` David Rientjes
  2019-09-17 18:32         ` Jens Axboe
  2019-09-17 18:41         ` Lendacky, Thomas
  0 siblings, 2 replies; 13+ messages in thread
From: David Rientjes @ 2019-09-17 18:23 UTC (permalink / raw)
  To: Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Tom Lendacky, Brijesh Singh, Ming Lei, Peter Gonda,
	Jianxiong Gao, linux-kernel, x86, iommu

On Mon, 16 Sep 2019, David Rientjes wrote:

> Brijesh and Tom, we currently hit this any time we boot an SEV enabled 
> Ubuntu 18.04 guest; I assume that guest kernels, especially those of such 
> major distributions, are expected to work with warnings and BUGs when 
> certain drivers are enabled.
> 
> If the vmap purge lock is to remain a mutex (any other reason that 
> unmapping aliases can block?) then it appears that allocating a dmapool 
> is the only alternative.  Is this something that you'll be addressing 
> generically or do we need to get buy-in from the maintainers of this 
> specific driver?
> 

We've found that the following applied on top of 5.2.14 suppresses the 
warnings.

Christoph, Keith, Jens, is this something that we could do for the nvme 
driver?  I'll happily propose it formally if it would be acceptable.

Thanks!


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1613,7 +1613,8 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
 		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
 		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
 		dev->admin_tagset.cmd_size = sizeof(struct nvme_iod);
-		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
+		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED |
+					  BLK_MQ_F_BLOCKING;
 		dev->admin_tagset.driver_data = dev;
 
 		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
@@ -2262,7 +2263,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
 		dev->tagset.queue_depth =
 				min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1;
 		dev->tagset.cmd_size = sizeof(struct nvme_iod);
-		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
+		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE |
+				    BLK_MQ_F_BLOCKING;
 		dev->tagset.driver_data = dev;
 
 		ret = blk_mq_alloc_tag_set(&dev->tagset);

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-09-17 18:23       ` David Rientjes
@ 2019-09-17 18:32         ` Jens Axboe
  2019-09-17 18:41         ` Lendacky, Thomas
  1 sibling, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2019-09-17 18:32 UTC (permalink / raw)
  To: David Rientjes, Christoph Hellwig, Keith Busch
  Cc: Tom Lendacky, Brijesh Singh, Ming Lei, Peter Gonda,
	Jianxiong Gao, linux-kernel, x86, iommu

On 9/17/19 12:23 PM, David Rientjes wrote:
> On Mon, 16 Sep 2019, David Rientjes wrote:
> 
>> Brijesh and Tom, we currently hit this any time we boot an SEV enabled
>> Ubuntu 18.04 guest; I assume that guest kernels, especially those of such
>> major distributions, are expected to work with warnings and BUGs when
>> certain drivers are enabled.
>>
>> If the vmap purge lock is to remain a mutex (any other reason that
>> unmapping aliases can block?) then it appears that allocating a dmapool
>> is the only alternative.  Is this something that you'll be addressing
>> generically or do we need to get buy-in from the maintainers of this
>> specific driver?
>>
> 
> We've found that the following applied on top of 5.2.14 suppresses the
> warnings.
> 
> Christoph, Keith, Jens, is this something that we could do for the nvme
> driver?  I'll happily propose it formally if it would be acceptable.

No, this is not going to be acceptable, I'm afraid. This tells blk-mq
that the driver always needs blocking context for queueing IO, which
will increase latencies for the cases where we'd otherwise issue IO
directly from the context that queues it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-09-17 18:23       ` David Rientjes
  2019-09-17 18:32         ` Jens Axboe
@ 2019-09-17 18:41         ` Lendacky, Thomas
  2019-09-18 13:22           ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Lendacky, Thomas @ 2019-09-17 18:41 UTC (permalink / raw)
  To: David Rientjes, Christoph Hellwig, Keith Busch, Jens Axboe
  Cc: Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao,
	linux-kernel, x86, iommu

On 9/17/19 1:23 PM, David Rientjes wrote:
> On Mon, 16 Sep 2019, David Rientjes wrote:
> 
>> Brijesh and Tom, we currently hit this any time we boot an SEV enabled 
>> Ubuntu 18.04 guest; I assume that guest kernels, especially those of such 
>> major distributions, are expected to work with warnings and BUGs when 
>> certain drivers are enabled.
>>
>> If the vmap purge lock is to remain a mutex (any other reason that 
>> unmapping aliases can block?) then it appears that allocating a dmapool 
>> is the only alternative.  Is this something that you'll be addressing 
>> generically or do we need to get buy-in from the maintainers of this 
>> specific driver?
>>
> 
> We've found that the following applied on top of 5.2.14 suppresses the 
> warnings.
> 
> Christoph, Keith, Jens, is this something that we could do for the nvme 
> driver?  I'll happily propose it formally if it would be acceptable.
> 
> Thanks!
> 
> 
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -1613,7 +1613,8 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
>  		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
>  		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
>  		dev->admin_tagset.cmd_size = sizeof(struct nvme_iod);
> -		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
> +		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED |
> +					  BLK_MQ_F_BLOCKING;

I think you want to only set the BLK_MQ_F_BLOCKING if the DMA is required
to be unencrypted. Unfortunately, force_dma_unencrypted() can't be called
from a module. Is there a DMA API that could be called to get that info?

Thanks,
Tom

>  		dev->admin_tagset.driver_data = dev;
>  
>  		if (blk_mq_alloc_tag_set(&dev->admin_tagset))
> @@ -2262,7 +2263,8 @@ static int nvme_dev_add(struct nvme_dev *dev)
>  		dev->tagset.queue_depth =
>  				min_t(int, dev->q_depth, BLK_MQ_MAX_DEPTH) - 1;
>  		dev->tagset.cmd_size = sizeof(struct nvme_iod);
> -		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
> +		dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE |
> +				    BLK_MQ_F_BLOCKING;
>  		dev->tagset.driver_data = dev;
>  
>  		ret = blk_mq_alloc_tag_set(&dev->tagset);
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-09-17 18:41         ` Lendacky, Thomas
@ 2019-09-18 13:22           ` Christoph Hellwig
  2019-11-27 22:11             ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2019-09-18 13:22 UTC (permalink / raw)
  To: Lendacky, Thomas
  Cc: David Rientjes, Christoph Hellwig, Keith Busch, Jens Axboe,
	Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao,
	linux-kernel, x86, iommu

On Tue, Sep 17, 2019 at 06:41:02PM +0000, Lendacky, Thomas wrote:
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -1613,7 +1613,8 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> >  		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
> >  		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
> >  		dev->admin_tagset.cmd_size = sizeof(struct nvme_iod);
> > -		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
> > +		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED |
> > +					  BLK_MQ_F_BLOCKING;
> 
> I think you want to only set the BLK_MQ_F_BLOCKING if the DMA is required
> to be unencrypted. Unfortunately, force_dma_unencrypted() can't be called
> from a module. Is there a DMA API that could be called to get that info?

The DMA API must support non-blocking calls, and various drivers rely
on that.  So we need to provide that even for the SEV case.  If the
actual blocking can't be made to work we'll need to wire up the DMA
pool in kernel/dma/remap.c for it (and probably move it to separate
file).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-09-18 13:22           ` Christoph Hellwig
@ 2019-11-27 22:11             ` David Rientjes
  2019-11-28  6:40               ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: David Rientjes @ 2019-11-27 22:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh,
	Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu

On Wed, 18 Sep 2019, Christoph Hellwig wrote:

> On Tue, Sep 17, 2019 at 06:41:02PM +0000, Lendacky, Thomas wrote:
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -1613,7 +1613,8 @@ static int nvme_alloc_admin_tags(struct nvme_dev *dev)
> > >  		dev->admin_tagset.timeout = ADMIN_TIMEOUT;
> > >  		dev->admin_tagset.numa_node = dev_to_node(dev->dev);
> > >  		dev->admin_tagset.cmd_size = sizeof(struct nvme_iod);
> > > -		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED;
> > > +		dev->admin_tagset.flags = BLK_MQ_F_NO_SCHED |
> > > +					  BLK_MQ_F_BLOCKING;
> > 
> > I think you want to only set the BLK_MQ_F_BLOCKING if the DMA is required
> > to be unencrypted. Unfortunately, force_dma_unencrypted() can't be called
> > from a module. Is there a DMA API that could be called to get that info?
> 
> The DMA API must support non-blocking calls, and various drivers rely
> on that.  So we need to provide that even for the SEV case.  If the
> actual blocking can't be made to work we'll need to wire up the DMA
> pool in kernel/dma/remap.c for it (and probably move it to separate
> file).
> 

Resurrecting this thread from a couple months ago because it appears that 
this is still an issue with 5.4 guests.

dma_pool_alloc(), regardless of whether mem_flags allows blocking or not, 
can always sleep if the device's DMA must be unencrypted and 
mem_encrypt_active() == true.  We know this because vm_unmap_aliases() can 
always block.

NVMe's setup of PRPs and SGLs uses dma_pool_alloc(GFP_ATOMIC) but when 
this is a SEV-enabled guest this allocation may block due to the 
possibility of allocating DMA coherent memory through dma_direct_alloc().

It seems like one solution would be to add significant latency by doing 
BLK_MQ_F_BLOCKING if force_dma_unencrypted() is true for the device but 
this comes with significant downsides.

So we're left with making dma_pool_alloc(GFP_ATOMIC) actually be atomic 
even when the DMA needs to be unencrypted for SEV.  Christoph's suggestion 
was to wire up dmapool in kernel/dma/remap.c for this.  Is that necessary 
to be done for all devices that need to do dma_pool_alloc(GFP_ATOMIC) or 
can we do it within the DMA API itself so it's transparent to the driver?

Thomas/Brijesh: separately, it seems the use of set_memory_encrypted() or 
set_memory_decrypted() must be possible without blocking; is this only an 
issue from the DMA API point of view or can it be done elsewhere?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-11-27 22:11             ` David Rientjes
@ 2019-11-28  6:40               ` Christoph Hellwig
  2019-12-13  0:07                 ` David Rientjes
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2019-11-28  6:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Hellwig, Lendacky, Thomas, Keith Busch, Jens Axboe,
	Singh, Brijesh, Ming Lei, Peter Gonda, Jianxiong Gao,
	linux-kernel, x86, iommu

On Wed, Nov 27, 2019 at 02:11:28PM -0800, David Rientjes wrote:
> So we're left with making dma_pool_alloc(GFP_ATOMIC) actually be atomic 
> even when the DMA needs to be unencrypted for SEV.  Christoph's suggestion 
> was to wire up dmapool in kernel/dma/remap.c for this.  Is that necessary 
> to be done for all devices that need to do dma_pool_alloc(GFP_ATOMIC) or 
> can we do it within the DMA API itself so it's transparent to the driver?

It needs to be transparent to the driver.  Lots of drivers use GFP_ATOMIC
dma allocations, and all of them are broken on SEV setups currently.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-11-28  6:40               ` Christoph Hellwig
@ 2019-12-13  0:07                 ` David Rientjes
  2019-12-13  9:33                   ` David Rientjes
  2019-12-15  5:38                   ` David Rientjes
  0 siblings, 2 replies; 13+ messages in thread
From: David Rientjes @ 2019-12-13  0:07 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh,
	Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu

On Thu, 28 Nov 2019, Christoph Hellwig wrote:

> > So we're left with making dma_pool_alloc(GFP_ATOMIC) actually be atomic 
> > even when the DMA needs to be unencrypted for SEV.  Christoph's suggestion 
> > was to wire up dmapool in kernel/dma/remap.c for this.  Is that necessary 
> > to be done for all devices that need to do dma_pool_alloc(GFP_ATOMIC) or 
> > can we do it within the DMA API itself so it's transparent to the driver?
> 
> It needs to be transparent to the driver.  Lots of drivers use GFP_ATOMIC
> dma allocations, and all of them are broken on SEV setups currently.
> 

Not my area, so bear with me.

Since all DMA must be unencrypted in this case, what happens if all 
dma_direct_alloc_pages() calls go through the DMA pool in 
kernel/dma/remap.c when force_dma_unencrypted(dev) == true since 
__PAGE_ENC is cleared for these ptes?  (Ignoring for a moment that this 
special pool should likely be a separate dma pool.)

I assume a general depletion of that atomic pool so 
DEFAULT_DMA_COHERENT_POOL_SIZE becomes insufficient.  I'm not sure what 
size any DMA pool wired up for this specific purpose would need to be 
sized at, so I assume dynamic resizing is required.

It shouldn't be *that* difficult to supplement kernel/dma/remap.c with the 
ability to do background expansion of the atomic pool when nearing its 
capacity for this purpose?  I imagine that if we just can't allocate pages 
within the DMA mask that it's the only blocker to dynamic expansion and we 
don't oom kill for lowmem.  But perhaps vm.lowmem_reserve_ratio is good 
enough protection?

Beyond that, I'm not sure what sizing would be appropriate if this is to 
be a generic solution in the DMA API for all devices that may require 
unecrypted memory.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-12-13  0:07                 ` David Rientjes
@ 2019-12-13  9:33                   ` David Rientjes
  2019-12-15  5:38                   ` David Rientjes
  1 sibling, 0 replies; 13+ messages in thread
From: David Rientjes @ 2019-12-13  9:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh,
	Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu

On Thu, 12 Dec 2019, David Rientjes wrote:

> Since all DMA must be unencrypted in this case, what happens if all 
> dma_direct_alloc_pages() calls go through the DMA pool in 
> kernel/dma/remap.c when force_dma_unencrypted(dev) == true since 
> __PAGE_ENC is cleared for these ptes?  (Ignoring for a moment that this 
> special pool should likely be a separate dma pool.)
> 
> I assume a general depletion of that atomic pool so 
> DEFAULT_DMA_COHERENT_POOL_SIZE becomes insufficient.  I'm not sure what 
> size any DMA pool wired up for this specific purpose would need to be 
> sized at, so I assume dynamic resizing is required.
> 
> It shouldn't be *that* difficult to supplement kernel/dma/remap.c with the 
> ability to do background expansion of the atomic pool when nearing its 
> capacity for this purpose?  I imagine that if we just can't allocate pages 
> within the DMA mask that it's the only blocker to dynamic expansion and we 
> don't oom kill for lowmem.  But perhaps vm.lowmem_reserve_ratio is good 
> enough protection?
> 
> Beyond that, I'm not sure what sizing would be appropriate if this is to 
> be a generic solution in the DMA API for all devices that may require 
> unecrypted memory.
> 

Secondly, I'm wondering about how the DMA pool for atomic allocations 
compares with lowmem reserve for both ZONE_DMA and ZONE_DMA32.  For 
allocations where the classzone index is one of these zones, the lowmem 
reserve is static, we don't account the amount of lowmem allocated and 
adjust this for future watermark checks in the page allocator.  We always 
guarantee that reserve is free (absent the depletion of the zone due to 
GFP_ATOMIC allocations where we fall below the min watermarks).

If all DMA memory needs to have _PAGE_ENC cleared when the guest is SEV 
encrypted, I'm wondering if the entire lowmem reserve could be designed as 
a pool of lowmem pages rather than a watermark check.  If implemented as a 
pool of pages in the page allocator itself, and today's reserve is static, 
maybe we could get away with a dynamic resizing based on that static 
amount?  We could offload the handling of this reserve to kswapd such that 
when the pool falls below today's reserve amount, we dynamically expand, 
do the necessary unencryption in blockable context, and add to the pool.  
Bonus is that this provides high-order lowmem reserve if implemented as 
per-order freelists rather than the current watermark check that provides 
no guarantees for any high-order lowmem.

I don't want to distract from the first set of questions in my previous 
email because I need an understanding of that anyway, but I'm hoping 
Christoph can guide me on why the above wouldn't be an improvement even 
for non encrypted guests.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [bug] __blk_mq_run_hw_queue suspicious rcu usage
  2019-12-13  0:07                 ` David Rientjes
  2019-12-13  9:33                   ` David Rientjes
@ 2019-12-15  5:38                   ` David Rientjes
  1 sibling, 0 replies; 13+ messages in thread
From: David Rientjes @ 2019-12-15  5:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Lendacky, Thomas, Keith Busch, Jens Axboe, Singh, Brijesh,
	Ming Lei, Peter Gonda, Jianxiong Gao, linux-kernel, x86, iommu

On Thu, 12 Dec 2019, David Rientjes wrote:

> Since all DMA must be unencrypted in this case, what happens if all 
> dma_direct_alloc_pages() calls go through the DMA pool in 
> kernel/dma/remap.c when force_dma_unencrypted(dev) == true since 
> __PAGE_ENC is cleared for these ptes?  (Ignoring for a moment that this 
> special pool should likely be a separate dma pool.)
> 
> I assume a general depletion of that atomic pool so 
> DEFAULT_DMA_COHERENT_POOL_SIZE becomes insufficient.  I'm not sure what 
> size any DMA pool wired up for this specific purpose would need to be 
> sized at, so I assume dynamic resizing is required.
> 
> It shouldn't be *that* difficult to supplement kernel/dma/remap.c with the 
> ability to do background expansion of the atomic pool when nearing its 
> capacity for this purpose?  I imagine that if we just can't allocate pages 
> within the DMA mask that it's the only blocker to dynamic expansion and we 
> don't oom kill for lowmem.  But perhaps vm.lowmem_reserve_ratio is good 
> enough protection?
> 
> Beyond that, I'm not sure what sizing would be appropriate if this is to 
> be a generic solution in the DMA API for all devices that may require 
> unecrypted memory.
> 

Optimizations involving lowmem reserve ratio aside, is it ok that 
CONFIG_AMD_MEM_ENCRYPT develops a dependency on DMA_DIRECT_REMAP because 
set_memory_decrypted() must be allowed to block?

If so, we could allocate from the atomic pool when we can't block and the 
device requires unencrypted DMA from dma_direct_alloc_pages().  I assume 
we need this to be its own atomic pool specifically for 
force_dma_unencrypted() devices and to check addr_in_gen_pool() for this 
new unencrypted pool in dma_direct_free_pages().

I have no idea how large this unencrypted atomic pool should be sized.  We 
could determine a nice default and grow size for nvme itself, but as 
Christoph mentioned many drivers require non-blockable allocations that 
can be run inside a SEV encrypted guest.

Trivial implementation would be to just double the size of the unencrypted 
pool when it reaches half capacity.  Perhaps done with GFP_KERNEL | 
__GFP_DMA allocations in a workqueue.  We can reclaim from ZONE_DMA or 
ZONE_DMA32 in this context but when that fails I'm not sure if it's 
satisfactory to just fail the dma_pool_alloc() when the unecrypted pool 
runs out.

Heuristics can be tweaked, of course, but I want to make sure I'm not 
missing anything obvious with this approach before implementing it.  
Please let me know, thanks.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-12-15  5:38 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-04 21:40 [bug] __blk_mq_run_hw_queue suspicious rcu usage David Rientjes
2019-09-05  6:06 ` Christoph Hellwig
2019-09-05 22:37   ` David Rientjes
2019-09-16 23:45     ` David Rientjes
2019-09-17 18:23       ` David Rientjes
2019-09-17 18:32         ` Jens Axboe
2019-09-17 18:41         ` Lendacky, Thomas
2019-09-18 13:22           ` Christoph Hellwig
2019-11-27 22:11             ` David Rientjes
2019-11-28  6:40               ` Christoph Hellwig
2019-12-13  0:07                 ` David Rientjes
2019-12-13  9:33                   ` David Rientjes
2019-12-15  5:38                   ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).