All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] three bug fixes about tcmu page fault handle
@ 2022-03-23 13:49 Xiaoguang Wang
  2022-03-23 13:49 ` [PATCH v2 1/3] scsi: target: tcmu: Fix possible page UAF Xiaoguang Wang
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2022-03-23 13:49 UTC (permalink / raw)
  To: linux-scsi, target-devel; +Cc: linux-block, bostroesser

Xiaoguang Wang (3):
  scsi: target: tcmu: Fix possible page UAF
  scsi: target: tcmu: Fix possible data corruption
  scsi: target: tcmu: Use address_space->invalidate_lock

 drivers/target/target_core_user.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

-- 
2.14.4.44.g2045bb6


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/3] scsi: target: tcmu: Fix possible page UAF
  2022-03-23 13:49 [PATCH v2 0/3] three bug fixes about tcmu page fault handle Xiaoguang Wang
@ 2022-03-23 13:49 ` Xiaoguang Wang
  2022-03-23 13:49 ` [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption Xiaoguang Wang
  2022-03-23 13:49 ` [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock Xiaoguang Wang
  2 siblings, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2022-03-23 13:49 UTC (permalink / raw)
  To: linux-scsi, target-devel; +Cc: linux-block, bostroesser

tcmu_try_get_data_page() looks up pages under cmdr_lock, but it don't
take refcount properly and just return page pointer.

When tcmu_try_get_data_page() returns, the returned page may have been
freed by tcmu_blocks_release(), need to get_page() under cmdr_lock to
avoid concurrent tcmu_blocks_release().

Reviewed-by: Bodo Stroesser <bostroesser@gmail.com>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 drivers/target/target_core_user.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
index 7b2a89a67cdb..06a5c4086551 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1820,6 +1820,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
 	mutex_lock(&udev->cmdr_lock);
 	page = xa_load(&udev->data_pages, dpi);
 	if (likely(page)) {
+		get_page(page);
 		mutex_unlock(&udev->cmdr_lock);
 		return page;
 	}
@@ -1876,6 +1877,7 @@ static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf)
 		/* For the vmalloc()ed cmd area pages */
 		addr = (void *)(unsigned long)info->mem[mi].addr + offset;
 		page = vmalloc_to_page(addr);
+		get_page(page);
 	} else {
 		uint32_t dpi;
 
@@ -1886,7 +1888,6 @@ static vm_fault_t tcmu_vma_fault(struct vm_fault *vmf)
 			return VM_FAULT_SIGBUS;
 	}
 
-	get_page(page);
 	vmf->page = page;
 	return 0;
 }
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption
  2022-03-23 13:49 [PATCH v2 0/3] three bug fixes about tcmu page fault handle Xiaoguang Wang
  2022-03-23 13:49 ` [PATCH v2 1/3] scsi: target: tcmu: Fix possible page UAF Xiaoguang Wang
@ 2022-03-23 13:49 ` Xiaoguang Wang
  2022-04-01 19:45   ` Bodo Stroesser
  2022-03-23 13:49 ` [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock Xiaoguang Wang
  2 siblings, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2022-03-23 13:49 UTC (permalink / raw)
  To: linux-scsi, target-devel; +Cc: linux-block, bostroesser

When tcmu_vma_fault() gets one page successfully, before the current
context completes page fault procedure, find_free_blocks() may run in
and call unmap_mapping_range() to unmap this page. Assume when
find_free_blocks() completes its job firstly, previous page fault
procedure starts to run again and completes, then one truncated page has
beed mapped to use space, but note that tcmu_vma_fault() has gotten one
refcount for this page, so any other subsystem won't use this page,
unless later the use space addr is unmapped.

If another command runs in later and needs to extends dbi_thresh, it may
reuse the corresponding slot to previous page in data_bitmap, then thouth
we'll allocate new page for this slot in data_area, but no page fault will
happen again, because we have a valid map, real request's data will lose.

To fix this issue, when extending dbi_thresh, we'll need to call
unmap_mapping_range() to unmap use space data area which may exist,
which I think it's a simple method.

Filesystem implementations will also run into this issue, but they
ususally lock page when vm_operations_struct->fault gets one page, and
unlock page after finish_fault() completes. In truncate sides, they
lock pages in truncate_inode_pages() to protect race with page fault.
We can also have similar codes like filesystem to fix this issue.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 drivers/target/target_core_user.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
index 06a5c4086551..9196188504ec 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -862,6 +862,7 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
 	if (space < cmd->dbi_cnt) {
 		unsigned long blocks_left =
 				(udev->max_blocks - udev->dbi_thresh) + space;
+		loff_t off, len;
 
 		if (blocks_left < cmd->dbi_cnt) {
 			pr_debug("no data space: only %lu available, but ask for %u\n",
@@ -870,6 +871,10 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
 			return -1;
 		}
 
+		off = udev->data_off + (loff_t)udev->dbi_thresh * udev->data_blk_size;
+		len = cmd->dbi_cnt * udev->data_blk_size;
+		unmap_mapping_range(udev->inode->i_mapping, off, len, 1);
+
 		udev->dbi_thresh += cmd->dbi_cnt;
 		if (udev->dbi_thresh > udev->max_blocks)
 			udev->dbi_thresh = udev->max_blocks;
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock
  2022-03-23 13:49 [PATCH v2 0/3] three bug fixes about tcmu page fault handle Xiaoguang Wang
  2022-03-23 13:49 ` [PATCH v2 1/3] scsi: target: tcmu: Fix possible page UAF Xiaoguang Wang
  2022-03-23 13:49 ` [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption Xiaoguang Wang
@ 2022-03-23 13:49 ` Xiaoguang Wang
  2022-04-01 19:58   ` Bodo Stroesser
  2 siblings, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2022-03-23 13:49 UTC (permalink / raw)
  To: linux-scsi, target-devel; +Cc: linux-block, bostroesser

Currently tcmu_vma_fault() uses udev->cmdr_lock to avoid concurrent
find_free_blocks(), which unmaps idle pages and truncates them. This
work is really like many filesystem's truncate operations, but they
use address_space->invalidate_lock to protect race.

This patch replaces cmdr_lock with address_space->invalidate_lock in
tcmu fault procedure, which will also make page-fault have concurrency.

Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
---
 drivers/target/target_core_user.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
index 9196188504ec..81bfa553cc67 100644
--- a/drivers/target/target_core_user.c
+++ b/drivers/target/target_core_user.c
@@ -1820,13 +1820,14 @@ static int tcmu_find_mem_index(struct vm_area_struct *vma)
 
 static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
 {
+	struct address_space *mapping = udev->inode->i_mapping;
 	struct page *page;
 
-	mutex_lock(&udev->cmdr_lock);
+	filemap_invalidate_lock_shared(mapping);
 	page = xa_load(&udev->data_pages, dpi);
 	if (likely(page)) {
 		get_page(page);
-		mutex_unlock(&udev->cmdr_lock);
+		filemap_invalidate_unlock_shared(mapping);
 		return page;
 	}
 
@@ -1836,7 +1837,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
 	 */
 	pr_err("Invalid addr to data page mapping (dpi %u) on device %s\n",
 	       dpi, udev->name);
-	mutex_unlock(&udev->cmdr_lock);
+	filemap_invalidate_unlock_shared(mapping);
 
 	return NULL;
 }
@@ -3116,6 +3117,7 @@ static void find_free_blocks(void)
 	loff_t off;
 	u32 pages_freed, total_pages_freed = 0;
 	u32 start, end, block, total_blocks_freed = 0;
+	struct address_space *mapping;
 
 	if (atomic_read(&global_page_count) <= tcmu_global_max_pages)
 		return;
@@ -3139,6 +3141,8 @@ static void find_free_blocks(void)
 			continue;
 		}
 
+		mapping = udev->inode->i_mapping;
+		filemap_invalidate_lock(mapping);
 		end = udev->dbi_max + 1;
 		block = find_last_bit(udev->data_bitmap, end);
 		if (block == udev->dbi_max) {
@@ -3146,6 +3150,7 @@ static void find_free_blocks(void)
 			 * The last bit is dbi_max, so it is not possible
 			 * reclaim any blocks.
 			 */
+			filemap_invalidate_unlock(mapping);
 			mutex_unlock(&udev->cmdr_lock);
 			continue;
 		} else if (block == end) {
@@ -3159,10 +3164,11 @@ static void find_free_blocks(void)
 
 		/* Here will truncate the data area from off */
 		off = udev->data_off + (loff_t)start * udev->data_blk_size;
-		unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
+		unmap_mapping_range(mapping, off, 0, 1);
 
 		/* Release the block pages */
 		pages_freed = tcmu_blocks_release(udev, start, end - 1);
+		filemap_invalidate_unlock(mapping);
 		mutex_unlock(&udev->cmdr_lock);
 
 		total_pages_freed += pages_freed;
-- 
2.14.4.44.g2045bb6


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption
  2022-03-23 13:49 ` [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption Xiaoguang Wang
@ 2022-04-01 19:45   ` Bodo Stroesser
  2022-04-04 16:13     ` Xiaoguang Wang
  2022-04-05 16:03     ` Xiaoguang Wang
  0 siblings, 2 replies; 13+ messages in thread
From: Bodo Stroesser @ 2022-04-01 19:45 UTC (permalink / raw)
  To: Xiaoguang Wang, linux-scsi, target-devel; +Cc: linux-block

On 23.03.22 14:49, Xiaoguang Wang wrote:
> When tcmu_vma_fault() gets one page successfully, before the current
> context completes page fault procedure, find_free_blocks() may run in
> and call unmap_mapping_range() to unmap this page. Assume when
> find_free_blocks() completes its job firstly, previous page fault
> procedure starts to run again and completes, then one truncated page has
> beed mapped to use space, but note that tcmu_vma_fault() has gotten one
> refcount for this page, so any other subsystem won't use this page,
> unless later the use space addr is unmapped.
> 
> If another command runs in later and needs to extends dbi_thresh, it may
> reuse the corresponding slot to previous page in data_bitmap, then thouth
> we'll allocate new page for this slot in data_area, but no page fault will
> happen again, because we have a valid map, real request's data will lose.

I don't think, this is a safe fix. It is possible that not only
find_free_blocks runs before page fault procedure completes, but also
allocation for next cmd happens. In that case the new call to
unmap_mapping_range would also happen before page fault completes ->
data corruption.

AFAIK, no one ever has seen this this bug in real life, as
find_free_blocks only runs seldomly and userspace would have to access
a data page the very first time while the cmd that owned this page
already has been completed by userspace. Therefore I think we should
apply a perfect fix only.

I'm wondering whether there really is such a race. If so, couldn't the
same race happen in other drivers or even when truncating mapped files?


> 
> To fix this issue, when extending dbi_thresh, we'll need to call
> unmap_mapping_range() to unmap use space data area which may exist,
> which I think it's a simple method.
> 
> Filesystem implementations will also run into this issue, but they
> ususally lock page when vm_operations_struct->fault gets one page, and
> unlock page after finish_fault() completes. In truncate sides, they
> lock pages in truncate_inode_pages() to protect race with page fault.
> We can also have similar codes like filesystem to fix this issue.
> 
> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> ---
>   drivers/target/target_core_user.c | 5 +++++
>   1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
> index 06a5c4086551..9196188504ec 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -862,6 +862,7 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
>   	if (space < cmd->dbi_cnt) {
>   		unsigned long blocks_left =
>   				(udev->max_blocks - udev->dbi_thresh) + space;
> +		loff_t off, len;
>   
>   		if (blocks_left < cmd->dbi_cnt) {
>   			pr_debug("no data space: only %lu available, but ask for %u\n",
> @@ -870,6 +871,10 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
>   			return -1;
>   		}
>   
> +		off = udev->data_off + (loff_t)udev->dbi_thresh * udev->data_blk_size;
> +		len = cmd->dbi_cnt * udev->data_blk_size;
> +		unmap_mapping_range(udev->inode->i_mapping, off, len, 1);
> +
>   		udev->dbi_thresh += cmd->dbi_cnt;
>   		if (udev->dbi_thresh > udev->max_blocks)
>   			udev->dbi_thresh = udev->max_blocks;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock
  2022-03-23 13:49 ` [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock Xiaoguang Wang
@ 2022-04-01 19:58   ` Bodo Stroesser
  2022-04-04 15:09     ` Xiaoguang Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Bodo Stroesser @ 2022-04-01 19:58 UTC (permalink / raw)
  To: Xiaoguang Wang, linux-scsi, target-devel; +Cc: linux-block

On 23.03.22 14:49, Xiaoguang Wang wrote:
> Currently tcmu_vma_fault() uses udev->cmdr_lock to avoid concurrent
> find_free_blocks(), which unmaps idle pages and truncates them. This
> work is really like many filesystem's truncate operations, but they
> use address_space->invalidate_lock to protect race.
> 
> This patch replaces cmdr_lock with address_space->invalidate_lock in
> tcmu fault procedure, which will also make page-fault have concurrency.
> 

Page fault happens only once for each allocated data page. After the
page fault was processed, the page is re-used for many cmds without
further page faults.

Of course, if find_free_blocks released a page and tcmu after that
allocates a new one, another page fault will occur. But that is a
very rare case.

So, what is the advantage of concurrency here?

OTOH, I don't like the switch from tcmu's cmdr_lock to
filemap_invalidate_* without a real technical reason.


> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
> ---
>   drivers/target/target_core_user.c | 14 ++++++++++----
>   1 file changed, 10 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
> index 9196188504ec..81bfa553cc67 100644
> --- a/drivers/target/target_core_user.c
> +++ b/drivers/target/target_core_user.c
> @@ -1820,13 +1820,14 @@ static int tcmu_find_mem_index(struct vm_area_struct *vma)
>   
>   static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
>   {
> +	struct address_space *mapping = udev->inode->i_mapping;
>   	struct page *page;
>   
> -	mutex_lock(&udev->cmdr_lock);
> +	filemap_invalidate_lock_shared(mapping);
>   	page = xa_load(&udev->data_pages, dpi);
>   	if (likely(page)) {
>   		get_page(page);
> -		mutex_unlock(&udev->cmdr_lock);
> +		filemap_invalidate_unlock_shared(mapping);
>   		return page;
>   	}
>   
> @@ -1836,7 +1837,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
>   	 */
>   	pr_err("Invalid addr to data page mapping (dpi %u) on device %s\n",
>   	       dpi, udev->name);
> -	mutex_unlock(&udev->cmdr_lock);
> +	filemap_invalidate_unlock_shared(mapping);
>   
>   	return NULL;
>   }
> @@ -3116,6 +3117,7 @@ static void find_free_blocks(void)
>   	loff_t off;
>   	u32 pages_freed, total_pages_freed = 0;
>   	u32 start, end, block, total_blocks_freed = 0;
> +	struct address_space *mapping;
>   
>   	if (atomic_read(&global_page_count) <= tcmu_global_max_pages)
>   		return;
> @@ -3139,6 +3141,8 @@ static void find_free_blocks(void)
>   			continue;
>   		}
>   
> +		mapping = udev->inode->i_mapping;
> +		filemap_invalidate_lock(mapping);
>   		end = udev->dbi_max + 1;
>   		block = find_last_bit(udev->data_bitmap, end);
>   		if (block == udev->dbi_max) {
> @@ -3146,6 +3150,7 @@ static void find_free_blocks(void)
>   			 * The last bit is dbi_max, so it is not possible
>   			 * reclaim any blocks.
>   			 */
> +			filemap_invalidate_unlock(mapping);
>   			mutex_unlock(&udev->cmdr_lock);
>   			continue;
>   		} else if (block == end) {
> @@ -3159,10 +3164,11 @@ static void find_free_blocks(void)
>   
>   		/* Here will truncate the data area from off */
>   		off = udev->data_off + (loff_t)start * udev->data_blk_size;
> -		unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
> +		unmap_mapping_range(mapping, off, 0, 1);
>   
>   		/* Release the block pages */
>   		pages_freed = tcmu_blocks_release(udev, start, end - 1);
> +		filemap_invalidate_unlock(mapping);
>   		mutex_unlock(&udev->cmdr_lock);
>   
>   		total_pages_freed += pages_freed;

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock
  2022-04-01 19:58   ` Bodo Stroesser
@ 2022-04-04 15:09     ` Xiaoguang Wang
  2022-04-08 11:34       ` Bodo Stroesser
  0 siblings, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2022-04-04 15:09 UTC (permalink / raw)
  To: Bodo Stroesser, linux-scsi, target-devel; +Cc: linux-block

hi,

> On 23.03.22 14:49, Xiaoguang Wang wrote:
>> Currently tcmu_vma_fault() uses udev->cmdr_lock to avoid concurrent
>> find_free_blocks(), which unmaps idle pages and truncates them. This
>> work is really like many filesystem's truncate operations, but they
>> use address_space->invalidate_lock to protect race.
>>
>> This patch replaces cmdr_lock with address_space->invalidate_lock in
>> tcmu fault procedure, which will also make page-fault have concurrency.
>>
>
> Page fault happens only once for each allocated data page. After the
> page fault was processed, the page is re-used for many cmds without
> further page faults.
Oh, right, I almost have forgotten that, thanks.

>
> Of course, if find_free_blocks released a page and tcmu after that
> allocates a new one, another page fault will occur. But that is a
> very rare case.
Yeah, now I agree it will happen rarely, but will still happen, especially
in the early stages of page fault establishment when tcmu device is
freshly created.

And in current codes, tcmu normally reclaim free pages under such
conditions:
        if (atomic_read(&global_page_count) > tcmu_global_max_pages &&
            xa_empty(&udev->commands) && list_empty(&udev->qfull_queue)) {
                /*
                 * Allocated blocks exceeded global block limit, currently no
                 * more pending or waiting commands so try to reclaim blocks.
                 */
                schedule_delayed_work(&tcmu_unmap_work, 0);
        }

I'm not sure whether this method is good, seems other subsystems use
mm shrinker feature to reclaim free pages when system memory is low.
tcmu_global_max_pages is 512k, 2GB memory, that means if tcmu devices
really have allocated 2GB memory, these memory won't be tried to be freed
even whole system memory is very low.
What I want to express is that though currently page fault occurs rarely,
but if we use a better memory reclaim method later, page fault may occur
more.

>
> So, what is the advantage of concurrency here?
cmdr_lock is a mutex, so it meas that tcmu page fault could not happen
concurrently, and indeed mm subsystem allows page fault to run concurrently.
If we change to use filemap_invalidate_lock_shared() in tcmu page fault,
I think there will be two advantages:
1) Though page fault will happen rarely in tcmu if system memory is
enough, in the early stages of page fault establishment, say we'll
finally make the default 1GB data area have page fault happened.
At least in this stage, page fault will have concurrency.
Also if page fault happens with queue_cmd_ring() at the same time,
for big io size, queue_cmd_ring() may hold cmdr_lock for long time
because of data area and sgl pages copy, page fault will be stalled
for a long time, which will result in long tail latency.

2) If we latter try to support zero copy successfully, we'll need to
call zap_page_range() in tcmu_handle_completion(), which needs to
hold mm_struct's mmap_lock, then we may have "abba" dead lock issue:
  page fault: hold mm_struct's mmap_lock, then try to hold cmdr_lock
  cmd completion: hold cmdr_lock, then try to hold mm_struct's mmap_lock.

Or we can define a new "struct rw_semaphore i_mmap_sem;" in
tcmu_dev, and use it rather than filemap_invalidate_lock_shared(mapping);

>
> OTOH, I don't like the switch from tcmu's cmdr_lock to
> filemap_invalidate_* without a real technical reason.
Anyway, at least I don't see any obvious disadvantages
about this patch.


Regards,
Xiaoguang Wang
>
>
>> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
>> ---
>>   drivers/target/target_core_user.c | 14 ++++++++++----
>>   1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>> index 9196188504ec..81bfa553cc67 100644
>> --- a/drivers/target/target_core_user.c
>> +++ b/drivers/target/target_core_user.c
>> @@ -1820,13 +1820,14 @@ static int tcmu_find_mem_index(struct vm_area_struct *vma)
>>     static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
>>   {
>> +    struct address_space *mapping = udev->inode->i_mapping;
>>       struct page *page;
>>   -    mutex_lock(&udev->cmdr_lock);
>> +    filemap_invalidate_lock_shared(mapping);
>>       page = xa_load(&udev->data_pages, dpi);
>>       if (likely(page)) {
>>           get_page(page);
>> -        mutex_unlock(&udev->cmdr_lock);
>> +        filemap_invalidate_unlock_shared(mapping);
>>           return page;
>>       }
>>   @@ -1836,7 +1837,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
>>        */
>>       pr_err("Invalid addr to data page mapping (dpi %u) on device %s\n",
>>              dpi, udev->name);
>> -    mutex_unlock(&udev->cmdr_lock);
>> +    filemap_invalidate_unlock_shared(mapping);
>>         return NULL;
>>   }
>> @@ -3116,6 +3117,7 @@ static void find_free_blocks(void)
>>       loff_t off;
>>       u32 pages_freed, total_pages_freed = 0;
>>       u32 start, end, block, total_blocks_freed = 0;
>> +    struct address_space *mapping;
>>         if (atomic_read(&global_page_count) <= tcmu_global_max_pages)
>>           return;
>> @@ -3139,6 +3141,8 @@ static void find_free_blocks(void)
>>               continue;
>>           }
>>   +        mapping = udev->inode->i_mapping;
>> +        filemap_invalidate_lock(mapping);
>>           end = udev->dbi_max + 1;
>>           block = find_last_bit(udev->data_bitmap, end);
>>           if (block == udev->dbi_max) {
>> @@ -3146,6 +3150,7 @@ static void find_free_blocks(void)
>>                * The last bit is dbi_max, so it is not possible
>>                * reclaim any blocks.
>>                */
>> +            filemap_invalidate_unlock(mapping);
>>               mutex_unlock(&udev->cmdr_lock);
>>               continue;
>>           } else if (block == end) {
>> @@ -3159,10 +3164,11 @@ static void find_free_blocks(void)
>>             /* Here will truncate the data area from off */
>>           off = udev->data_off + (loff_t)start * udev->data_blk_size;
>> -        unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
>> +        unmap_mapping_range(mapping, off, 0, 1);
>>             /* Release the block pages */
>>           pages_freed = tcmu_blocks_release(udev, start, end - 1);
>> +        filemap_invalidate_unlock(mapping);
>>           mutex_unlock(&udev->cmdr_lock);
>>             total_pages_freed += pages_freed;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption
  2022-04-01 19:45   ` Bodo Stroesser
@ 2022-04-04 16:13     ` Xiaoguang Wang
  2022-04-05 16:03     ` Xiaoguang Wang
  1 sibling, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2022-04-04 16:13 UTC (permalink / raw)
  To: Bodo Stroesser, linux-scsi, target-devel; +Cc: linux-block

hi,

> On 23.03.22 14:49, Xiaoguang Wang wrote:
>> When tcmu_vma_fault() gets one page successfully, before the current
>> context completes page fault procedure, find_free_blocks() may run in
>> and call unmap_mapping_range() to unmap this page. Assume when
>> find_free_blocks() completes its job firstly, previous page fault
>> procedure starts to run again and completes, then one truncated page has
>> beed mapped to use space, but note that tcmu_vma_fault() has gotten one
>> refcount for this page, so any other subsystem won't use this page,
>> unless later the use space addr is unmapped.
>>
>> If another command runs in later and needs to extends dbi_thresh, it may
>> reuse the corresponding slot to previous page in data_bitmap, then thouth
>> we'll allocate new page for this slot in data_area, but no page fault will
>> happen again, because we have a valid map, real request's data will lose.
>
> I don't think, this is a safe fix. It is possible that not only
> find_free_blocks runs before page fault procedure completes, but also
> allocation for next cmd happens. In that case the new call to
> unmap_mapping_range would also happen before page fault completes ->
> data corruption.
You're right, thanks for pointing this issue.
For your below questions, I need to take some time to read and understand
codes, I'll try to reply in tomorrow, sorry.

Regards,
Xiaoguang Wang
>
> AFAIK, no one ever has seen this this bug in real life, as
> find_free_blocks only runs seldomly and userspace would have to access
> a data page the very first time while the cmd that owned this page
> already has been completed by userspace. Therefore I think we should
> apply a perfect fix only.
>
> I'm wondering whether there really is such a race. If so, couldn't the
> same race happen in other drivers or even when truncating mapped files?
>
>
>>
>> To fix this issue, when extending dbi_thresh, we'll need to call
>> unmap_mapping_range() to unmap use space data area which may exist,
>> which I think it's a simple method.
>>
>> Filesystem implementations will also run into this issue, but they
>> ususally lock page when vm_operations_struct->fault gets one page, and
>> unlock page after finish_fault() completes. In truncate sides, they
>> lock pages in truncate_inode_pages() to protect race with page fault.
>> We can also have similar codes like filesystem to fix this issue.
>>
>> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
>> ---
>>   drivers/target/target_core_user.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>> index 06a5c4086551..9196188504ec 100644
>> --- a/drivers/target/target_core_user.c
>> +++ b/drivers/target/target_core_user.c
>> @@ -862,6 +862,7 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
>>       if (space < cmd->dbi_cnt) {
>>           unsigned long blocks_left =
>>                   (udev->max_blocks - udev->dbi_thresh) + space;
>> +        loff_t off, len;
>>             if (blocks_left < cmd->dbi_cnt) {
>>               pr_debug("no data space: only %lu available, but ask for %u\n",
>> @@ -870,6 +871,10 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
>>               return -1;
>>           }
>>   +        off = udev->data_off + (loff_t)udev->dbi_thresh * udev->data_blk_size;
>> +        len = cmd->dbi_cnt * udev->data_blk_size;
>> +        unmap_mapping_range(udev->inode->i_mapping, off, len, 1);
>> +
>>           udev->dbi_thresh += cmd->dbi_cnt;
>>           if (udev->dbi_thresh > udev->max_blocks)
>>               udev->dbi_thresh = udev->max_blocks;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption
  2022-04-01 19:45   ` Bodo Stroesser
  2022-04-04 16:13     ` Xiaoguang Wang
@ 2022-04-05 16:03     ` Xiaoguang Wang
  2022-04-08 10:33       ` Bodo Stroesser
  1 sibling, 1 reply; 13+ messages in thread
From: Xiaoguang Wang @ 2022-04-05 16:03 UTC (permalink / raw)
  To: Bodo Stroesser, linux-scsi, target-devel; +Cc: linux-block

hi,

> On 23.03.22 14:49, Xiaoguang Wang wrote:
>> When tcmu_vma_fault() gets one page successfully, before the current
>> context completes page fault procedure, find_free_blocks() may run in
>> and call unmap_mapping_range() to unmap this page. Assume when
>> find_free_blocks() completes its job firstly, previous page fault
>> procedure starts to run again and completes, then one truncated page has
>> beed mapped to use space, but note that tcmu_vma_fault() has gotten one
>> refcount for this page, so any other subsystem won't use this page,
>> unless later the use space addr is unmapped.
>>
>> If another command runs in later and needs to extends dbi_thresh, it may
>> reuse the corresponding slot to previous page in data_bitmap, then thouth
>> we'll allocate new page for this slot in data_area, but no page fault will
>> happen again, because we have a valid map, real request's data will lose.
>
> I don't think, this is a safe fix. It is possible that not only
> find_free_blocks runs before page fault procedure completes, but also
> allocation for next cmd happens. In that case the new call to
> unmap_mapping_range would also happen before page fault completes ->
> data corruption.
>
> AFAIK, no one ever has seen this this bug in real life, as
Yeah, I know, just find this maybe an issue by reading codes :)

> find_free_blocks only runs seldomly and userspace would have to access
> a data page the very first time while the cmd that owned this page
> already has been completed by userspace. Therefore I think we should
> apply a perfect fix only.
>
> I'm wondering whether there really is such a race. If so, couldn't the
> same race happen in other drivers or even when truncating mapped files?
Indeed, I have described how filesystem implementations avoid this issue
in patch's commit message:

  Filesystem implementations will also run into this issue, but they
  usually lock page when vm_operations_struct->fault gets one page, and
  unlock page after finish_fault() completes. In truncate sides, they
  lock pages in truncate_inode_pages() to protect race with page fault.
  We can also have similar codes like filesystem to fix this issue.


Take ext4 as example, a file in ext4 is mapped to user space, if then a truncate
operation occurs, ext4 calls truncate_pagecache():
void truncate_pagecache(struct inode *inode, loff_t newsize)
{
        struct address_space *mapping = inode->i_mapping;
        loff_t holebegin = round_up(newsize, PAGE_SIZE);

        /*
         * unmap_mapping_range is called twice, first simply for
         * efficiency so that truncate_inode_pages does fewer
         * single-page unmaps.  However after this first call, and
         * before truncate_inode_pages finishes, it is possible for
         * private pages to be COWed, which remain after
         * truncate_inode_pages finishes, hence the second
         * unmap_mapping_range call must be made for correctness.
         */
        unmap_mapping_range(mapping, holebegin, 0, 1);
        truncate_inode_pages(mapping, newsize);
        unmap_mapping_range(mapping, holebegin, 0, 1);
}

In truncate_inode_pages(), it'll lock page and set page->mapping
to be NULL, and in ext4's filemap_fault(), it'll lock page and check whether
page->mapping has been changed, if it's true, it'll just fail the page
fault procedure.

For tcmu, though the data area's pages don't have a valid mapping,
but we can apply similar method.
In tcmu_vma_fault(), we lock the page and set VM_FAULT_LOCKED
flag, in find_free_blocks(), we firstly try to lock pages which are going
to be released, if lock_page() returns, we can ensure that there are
not inflight running page fault procedure, and following unmap_mapping_range()
will also ensure that all user maps will be cleared.
Seems that it'll resolve this possible issue, please have a check, thanks.


Regards,
Xiaoguang Wang


>
>
>>
>> To fix this issue, when extending dbi_thresh, we'll need to call
>> unmap_mapping_range() to unmap use space data area which may exist,
>> which I think it's a simple method.
>>
>> Filesystem implementations will also run into this issue, but they
>> ususally lock page when vm_operations_struct->fault gets one page, and
>> unlock page after finish_fault() completes. In truncate sides, they
>> lock pages in truncate_inode_pages() to protect race with page fault.
>> We can also have similar codes like filesystem to fix this issue.
>>
>> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
>> ---
>>   drivers/target/target_core_user.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>> index 06a5c4086551..9196188504ec 100644
>> --- a/drivers/target/target_core_user.c
>> +++ b/drivers/target/target_core_user.c
>> @@ -862,6 +862,7 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
>>       if (space < cmd->dbi_cnt) {
>>           unsigned long blocks_left =
>>                   (udev->max_blocks - udev->dbi_thresh) + space;
>> +        loff_t off, len;
>>             if (blocks_left < cmd->dbi_cnt) {
>>               pr_debug("no data space: only %lu available, but ask for %u\n",
>> @@ -870,6 +871,10 @@ static int tcmu_alloc_data_space(struct tcmu_dev *udev, struct tcmu_cmd *cmd,
>>               return -1;
>>           }
>>   +        off = udev->data_off + (loff_t)udev->dbi_thresh * udev->data_blk_size;
>> +        len = cmd->dbi_cnt * udev->data_blk_size;
>> +        unmap_mapping_range(udev->inode->i_mapping, off, len, 1);
>> +
>>           udev->dbi_thresh += cmd->dbi_cnt;
>>           if (udev->dbi_thresh > udev->max_blocks)
>>               udev->dbi_thresh = udev->max_blocks;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption
  2022-04-05 16:03     ` Xiaoguang Wang
@ 2022-04-08 10:33       ` Bodo Stroesser
  2022-04-11  4:49         ` Xiaoguang Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Bodo Stroesser @ 2022-04-08 10:33 UTC (permalink / raw)
  To: Xiaoguang Wang, linux-scsi, target-devel; +Cc: linux-block

On 05.04.22 18:03, Xiaoguang Wang wrote:
> hi,
> 
>> On 23.03.22 14:49, Xiaoguang Wang wrote:
>>> When tcmu_vma_fault() gets one page successfully, before the current
>>> context completes page fault procedure, find_free_blocks() may run in
>>> and call unmap_mapping_range() to unmap this page. Assume when
>>> find_free_blocks() completes its job firstly, previous page fault
>>> procedure starts to run again and completes, then one truncated page has
>>> beed mapped to use space, but note that tcmu_vma_fault() has gotten one
>>> refcount for this page, so any other subsystem won't use this page,
>>> unless later the use space addr is unmapped.
>>>
>>> If another command runs in later and needs to extends dbi_thresh, it may
>>> reuse the corresponding slot to previous page in data_bitmap, then thouth
>>> we'll allocate new page for this slot in data_area, but no page fault will
>>> happen again, because we have a valid map, real request's data will lose.
>>
>> I don't think, this is a safe fix. It is possible that not only
>> find_free_blocks runs before page fault procedure completes, but also
>> allocation for next cmd happens. In that case the new call to
>> unmap_mapping_range would also happen before page fault completes ->
>> data corruption.
>>
>> AFAIK, no one ever has seen this this bug in real life, as
> Yeah, I know, just find this maybe an issue by reading codes :)
> 
>> find_free_blocks only runs seldomly and userspace would have to access
>> a data page the very first time while the cmd that owned this page
>> already has been completed by userspace. Therefore I think we should
>> apply a perfect fix only.
>>
>> I'm wondering whether there really is such a race. If so, couldn't the
>> same race happen in other drivers or even when truncating mapped files?
> Indeed, I have described how filesystem implementations avoid this issue
> in patch's commit message:
> 
>    Filesystem implementations will also run into this issue, but they
>    usually lock page when vm_operations_struct->fault gets one page, and
>    unlock page after finish_fault() completes. In truncate sides, they
>    lock pages in truncate_inode_pages() to protect race with page fault.
>    We can also have similar codes like filesystem to fix this issue.
> 
> 
> Take ext4 as example, a file in ext4 is mapped to user space, if then a truncate
> operation occurs, ext4 calls truncate_pagecache():
> void truncate_pagecache(struct inode *inode, loff_t newsize)
> {
>          struct address_space *mapping = inode->i_mapping;
>          loff_t holebegin = round_up(newsize, PAGE_SIZE);
> 
>          /*
>           * unmap_mapping_range is called twice, first simply for
>           * efficiency so that truncate_inode_pages does fewer
>           * single-page unmaps.  However after this first call, and
>           * before truncate_inode_pages finishes, it is possible for
>           * private pages to be COWed, which remain after
>           * truncate_inode_pages finishes, hence the second
>           * unmap_mapping_range call must be made for correctness.
>           */
>          unmap_mapping_range(mapping, holebegin, 0, 1);
>          truncate_inode_pages(mapping, newsize);
>          unmap_mapping_range(mapping, holebegin, 0, 1);
> }
> 
> In truncate_inode_pages(), it'll lock page and set page->mapping
> to be NULL, and in ext4's filemap_fault(), it'll lock page and check whether
> page->mapping has been changed, if it's true, it'll just fail the page
> fault procedure.
> 
> For tcmu, though the data area's pages don't have a valid mapping,
> but we can apply similar method.
> In tcmu_vma_fault(), we lock the page and set VM_FAULT_LOCKED
> flag,

Yeah, looking into page fault handling I'm wondering why tcmu didn't do
that from the beginning!

> in find_free_blocks(), we firstly try to lock pages which are going
> to be released, if lock_page() returns,

I assume, we immediately unlock the page again, right?

> we can ensure that there are
> not inflight running page fault procedure, and following unmap_mapping_range()
> will also ensure that all user maps will be cleared.
> Seems that it'll resolve this possible issue, please have a check, thanks.

AFAICS, this is the clean solution we were searching for.

Thank you
Bodo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock
  2022-04-04 15:09     ` Xiaoguang Wang
@ 2022-04-08 11:34       ` Bodo Stroesser
  2022-04-11  6:16         ` Xiaoguang Wang
  0 siblings, 1 reply; 13+ messages in thread
From: Bodo Stroesser @ 2022-04-08 11:34 UTC (permalink / raw)
  To: Xiaoguang Wang, linux-scsi, target-devel; +Cc: linux-block

On 04.04.22 17:09, Xiaoguang Wang wrote:
> hi,
> 
>> On 23.03.22 14:49, Xiaoguang Wang wrote:
>>> Currently tcmu_vma_fault() uses udev->cmdr_lock to avoid concurrent
>>> find_free_blocks(), which unmaps idle pages and truncates them. This
>>> work is really like many filesystem's truncate operations, but they
>>> use address_space->invalidate_lock to protect race.
>>>
>>> This patch replaces cmdr_lock with address_space->invalidate_lock in
>>> tcmu fault procedure, which will also make page-fault have concurrency.
>>>
>>
>> Page fault happens only once for each allocated data page. After the
>> page fault was processed, the page is re-used for many cmds without
>> further page faults.
> Oh, right, I almost have forgotten that, thanks.
> 
>>
>> Of course, if find_free_blocks released a page and tcmu after that
>> allocates a new one, another page fault will occur. But that is a
>> very rare case.
> Yeah, now I agree it will happen rarely, but will still happen, especially
> in the early stages of page fault establishment when tcmu device is
> freshly created.
> 
> And in current codes, tcmu normally reclaim free pages under such
> conditions:
>          if (atomic_read(&global_page_count) > tcmu_global_max_pages &&
>              xa_empty(&udev->commands) && list_empty(&udev->qfull_queue)) {
>                  /*
>                   * Allocated blocks exceeded global block limit, currently no
>                   * more pending or waiting commands so try to reclaim blocks.
>                   */
>                  schedule_delayed_work(&tcmu_unmap_work, 0);
>          }
> 
> I'm not sure whether this method is good, seems other subsystems use
> mm shrinker feature to reclaim free pages when system memory is low.
> tcmu_global_max_pages is 512k, 2GB memory, that means if tcmu devices
> really have allocated 2GB memory, these memory won't be tried to be freed
> even whole system memory is very low.
> What I want to express is that though currently page fault occurs rarely,
> but if we use a better memory reclaim method later, page fault may occur
> more.

The way how memory limit currently is implemented in tcmu really looks
not very sophisticated. It waits until the hard limit is reached and
then frees as much memory as possible. Ideas for a better algorithm are
welcome!

Currently, I think the best way to limit mem usage in existing tcmu
is to limit data area size per tcmu device. The default of 1 GB per dev
I think is rarely needed. So users should configure devices carefully.
OTOH, if many tcmu devs are in use, even with small data areas the total
limit of 2GB can easily be reached. E.g., I know of a customer who uses
256 tcmu devs with 16MB each. So he rises the global limit to even 4GB,
as from performance point of view every releasing of data pages with
later re-allocation is just an overhead.

 From this point of view I think it would be difficult to find a good
algorithm.

> 
>>
>> So, what is the advantage of concurrency here?
> cmdr_lock is a mutex, so it meas that tcmu page fault could not happen
> concurrently, and indeed mm subsystem allows page fault to run concurrently.
> If we change to use filemap_invalidate_lock_shared() in tcmu page fault,
> I think there will be two advantages:
> 1) Though page fault will happen rarely in tcmu if system memory is
> enough, in the early stages of page fault establishment, say we'll
> finally make the default 1GB data area have page fault happened.
> At least in this stage, page fault will have concurrency.
> Also if page fault happens with queue_cmd_ring() at the same time,
> for big io size, queue_cmd_ring() may hold cmdr_lock for long time
> because of data area and sgl pages copy, page fault will be stalled
> for a long time, which will result in long tail latency.

TCM core can send multiple new cmds to tcmu concurrently, but cmdr_lock
forces sequential processing. So, some time ago I started an attempt to
make tcmu release the cmdr_lock while it copies sgl_pages. The measured
results unfortunately were not as good as expected. But I still need to
do further meassurements to find and possibly fix the reason.

With that change, I hope to also lower the latency by far for cmds
with big data transfers.

> 
> 2) If we latter try to support zero copy successfully, we'll need to
> call zap_page_range() in tcmu_handle_completion(), which needs to
> hold mm_struct's mmap_lock, then we may have "abba" dead lock issue:
>    page fault: hold mm_struct's mmap_lock, then try to hold cmdr_lock
>    cmd completion: hold cmdr_lock, then try to hold mm_struct's mmap_lock.

I'm still not convinced that you can get the support from mm you need.

BTW: after the cmdr_lock changes I described, the next step from my side
would be to avoid data copy in an even more efficient way for at least
those fabrics, that use core sgl allocation by calling
transport_generic_new_command or transport_handle_cdb_direct, or
target_submit_prep without pre-allocated sgls. The idea is to add a new
optional backend call that allows the backend (tcmu) to build sgls from
its own pages (tcmu: data pages). This method avoid data copy _and_
the need for mapping/unmapping sgl pages because we use already mapped
pages.

Of cause, this method does not speed up loopback, vhost and some other
fabrics, but it works very well e.g. for qla2xxx.

> 
> Or we can define a new "struct rw_semaphore i_mmap_sem;" in
> tcmu_dev, and use it rather than filemap_invalidate_lock_shared(mapping);
> 
>>
>> OTOH, I don't like the switch from tcmu's cmdr_lock to
>> filemap_invalidate_* without a real technical reason.
> Anyway, at least I don't see any obvious disadvantages
> about this patch.
> 
> 
> Regards,
> Xiaoguang Wang
>>
>>
>>> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
>>> ---
>>>    drivers/target/target_core_user.c | 14 ++++++++++----
>>>    1 file changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>>> index 9196188504ec..81bfa553cc67 100644
>>> --- a/drivers/target/target_core_user.c
>>> +++ b/drivers/target/target_core_user.c
>>> @@ -1820,13 +1820,14 @@ static int tcmu_find_mem_index(struct vm_area_struct *vma)
>>>      static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
>>>    {
>>> +    struct address_space *mapping = udev->inode->i_mapping;
>>>        struct page *page;
>>>    -    mutex_lock(&udev->cmdr_lock);
>>> +    filemap_invalidate_lock_shared(mapping);
>>>        page = xa_load(&udev->data_pages, dpi);
>>>        if (likely(page)) {
>>>            get_page(page);
>>> -        mutex_unlock(&udev->cmdr_lock);
>>> +        filemap_invalidate_unlock_shared(mapping);
>>>            return page;
>>>        }
>>>    @@ -1836,7 +1837,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
>>>         */
>>>        pr_err("Invalid addr to data page mapping (dpi %u) on device %s\n",
>>>               dpi, udev->name);
>>> -    mutex_unlock(&udev->cmdr_lock);
>>> +    filemap_invalidate_unlock_shared(mapping);
>>>          return NULL;
>>>    }
>>> @@ -3116,6 +3117,7 @@ static void find_free_blocks(void)
>>>        loff_t off;
>>>        u32 pages_freed, total_pages_freed = 0;
>>>        u32 start, end, block, total_blocks_freed = 0;
>>> +    struct address_space *mapping;
>>>          if (atomic_read(&global_page_count) <= tcmu_global_max_pages)
>>>            return;
>>> @@ -3139,6 +3141,8 @@ static void find_free_blocks(void)
>>>                continue;
>>>            }
>>>    +        mapping = udev->inode->i_mapping;
>>> +        filemap_invalidate_lock(mapping);
>>>            end = udev->dbi_max + 1;
>>>            block = find_last_bit(udev->data_bitmap, end);
>>>            if (block == udev->dbi_max) {
>>> @@ -3146,6 +3150,7 @@ static void find_free_blocks(void)
>>>                 * The last bit is dbi_max, so it is not possible
>>>                 * reclaim any blocks.
>>>                 */
>>> +            filemap_invalidate_unlock(mapping);
>>>                mutex_unlock(&udev->cmdr_lock);
>>>                continue;
>>>            } else if (block == end) {
>>> @@ -3159,10 +3164,11 @@ static void find_free_blocks(void)
>>>              /* Here will truncate the data area from off */
>>>            off = udev->data_off + (loff_t)start * udev->data_blk_size;
>>> -        unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
>>> +        unmap_mapping_range(mapping, off, 0, 1);
>>>              /* Release the block pages */
>>>            pages_freed = tcmu_blocks_release(udev, start, end - 1);
>>> +        filemap_invalidate_unlock(mapping);
>>>            mutex_unlock(&udev->cmdr_lock);
>>>              total_pages_freed += pages_freed;
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption
  2022-04-08 10:33       ` Bodo Stroesser
@ 2022-04-11  4:49         ` Xiaoguang Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2022-04-11  4:49 UTC (permalink / raw)
  To: Bodo Stroesser, linux-scsi, target-devel; +Cc: linux-block

hi,

> On 05.04.22 18:03, Xiaoguang Wang wrote:
>> hi,
>>
>>> On 23.03.22 14:49, Xiaoguang Wang wrote:
>>>> When tcmu_vma_fault() gets one page successfully, before the current
>>>> context completes page fault procedure, find_free_blocks() may run in
>>>> and call unmap_mapping_range() to unmap this page. Assume when
>>>> find_free_blocks() completes its job firstly, previous page fault
>>>> procedure starts to run again and completes, then one truncated page has
>>>> beed mapped to use space, but note that tcmu_vma_fault() has gotten one
>>>> refcount for this page, so any other subsystem won't use this page,
>>>> unless later the use space addr is unmapped.
>>>>
>>>> If another command runs in later and needs to extends dbi_thresh, it may
>>>> reuse the corresponding slot to previous page in data_bitmap, then thouth
>>>> we'll allocate new page for this slot in data_area, but no page fault will
>>>> happen again, because we have a valid map, real request's data will lose.
>>>
>>> I don't think, this is a safe fix. It is possible that not only
>>> find_free_blocks runs before page fault procedure completes, but also
>>> allocation for next cmd happens. In that case the new call to
>>> unmap_mapping_range would also happen before page fault completes ->
>>> data corruption.
>>>
>>> AFAIK, no one ever has seen this this bug in real life, as
>> Yeah, I know, just find this maybe an issue by reading codes :)
>>
>>> find_free_blocks only runs seldomly and userspace would have to access
>>> a data page the very first time while the cmd that owned this page
>>> already has been completed by userspace. Therefore I think we should
>>> apply a perfect fix only.
>>>
>>> I'm wondering whether there really is such a race. If so, couldn't the
>>> same race happen in other drivers or even when truncating mapped files?
>> Indeed, I have described how filesystem implementations avoid this issue
>> in patch's commit message:
>>
>>    Filesystem implementations will also run into this issue, but they
>>    usually lock page when vm_operations_struct->fault gets one page, and
>>    unlock page after finish_fault() completes. In truncate sides, they
>>    lock pages in truncate_inode_pages() to protect race with page fault.
>>    We can also have similar codes like filesystem to fix this issue.
>>
>>
>> Take ext4 as example, a file in ext4 is mapped to user space, if then a truncate
>> operation occurs, ext4 calls truncate_pagecache():
>> void truncate_pagecache(struct inode *inode, loff_t newsize)
>> {
>>          struct address_space *mapping = inode->i_mapping;
>>          loff_t holebegin = round_up(newsize, PAGE_SIZE);
>>
>>          /*
>>           * unmap_mapping_range is called twice, first simply for
>>           * efficiency so that truncate_inode_pages does fewer
>>           * single-page unmaps.  However after this first call, and
>>           * before truncate_inode_pages finishes, it is possible for
>>           * private pages to be COWed, which remain after
>>           * truncate_inode_pages finishes, hence the second
>>           * unmap_mapping_range call must be made for correctness.
>>           */
>>          unmap_mapping_range(mapping, holebegin, 0, 1);
>>          truncate_inode_pages(mapping, newsize);
>>          unmap_mapping_range(mapping, holebegin, 0, 1);
>> }
>>
>> In truncate_inode_pages(), it'll lock page and set page->mapping
>> to be NULL, and in ext4's filemap_fault(), it'll lock page and check whether
>> page->mapping has been changed, if it's true, it'll just fail the page
>> fault procedure.
>>
>> For tcmu, though the data area's pages don't have a valid mapping,
>> but we can apply similar method.
>> In tcmu_vma_fault(), we lock the page and set VM_FAULT_LOCKED
>> flag,
>
> Yeah, looking into page fault handling I'm wondering why tcmu didn't do
> that from the beginning!
>
>> in find_free_blocks(), we firstly try to lock pages which are going
>> to be released, if lock_page() returns,
>
> I assume, we immediately unlock the page again, right?
Yeah, and I'll add proper code comments in new version patch set.

>
>> we can ensure that there are
>> not inflight running page fault procedure, and following unmap_mapping_range()
>> will also ensure that all user maps will be cleared.
>> Seems that it'll resolve this possible issue, please have a check, thanks.
>
> AFAICS, this is the clean solution we were searching for.
Agree, I'll prepare new patch set soon.


Regards,
Xiaoguang Wang
>
> Thank you
> Bodo


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock
  2022-04-08 11:34       ` Bodo Stroesser
@ 2022-04-11  6:16         ` Xiaoguang Wang
  0 siblings, 0 replies; 13+ messages in thread
From: Xiaoguang Wang @ 2022-04-11  6:16 UTC (permalink / raw)
  To: Bodo Stroesser, linux-scsi, target-devel; +Cc: linux-block

hi,

> On 04.04.22 17:09, Xiaoguang Wang wrote:
>> hi,
>>
>>> On 23.03.22 14:49, Xiaoguang Wang wrote:
>>>> Currently tcmu_vma_fault() uses udev->cmdr_lock to avoid concurrent
>>>> find_free_blocks(), which unmaps idle pages and truncates them. This
>>>> work is really like many filesystem's truncate operations, but they
>>>> use address_space->invalidate_lock to protect race.
>>>>
>>>> This patch replaces cmdr_lock with address_space->invalidate_lock in
>>>> tcmu fault procedure, which will also make page-fault have concurrency.
>>>>
>>>
>>> Page fault happens only once for each allocated data page. After the
>>> page fault was processed, the page is re-used for many cmds without
>>> further page faults.
>> Oh, right, I almost have forgotten that, thanks.
>>
>>>
>>> Of course, if find_free_blocks released a page and tcmu after that
>>> allocates a new one, another page fault will occur. But that is a
>>> very rare case.
>> Yeah, now I agree it will happen rarely, but will still happen, especially
>> in the early stages of page fault establishment when tcmu device is
>> freshly created.
>>
>> And in current codes, tcmu normally reclaim free pages under such
>> conditions:
>>          if (atomic_read(&global_page_count) > tcmu_global_max_pages &&
>>              xa_empty(&udev->commands) && list_empty(&udev->qfull_queue)) {
>>                  /*
>>                   * Allocated blocks exceeded global block limit, currently no
>>                   * more pending or waiting commands so try to reclaim blocks.
>>                   */
>>                  schedule_delayed_work(&tcmu_unmap_work, 0);
>>          }
>>
>> I'm not sure whether this method is good, seems other subsystems use
>> mm shrinker feature to reclaim free pages when system memory is low.
>> tcmu_global_max_pages is 512k, 2GB memory, that means if tcmu devices
>> really have allocated 2GB memory, these memory won't be tried to be freed
>> even whole system memory is very low.
>> What I want to express is that though currently page fault occurs rarely,
>> but if we use a better memory reclaim method later, page fault may occur
>> more.
>
> The way how memory limit currently is implemented in tcmu really looks
> not very sophisticated. It waits until the hard limit is reached and
> then frees as much memory as possible. Ideas for a better algorithm are
> welcome!
>
> Currently, I think the best way to limit mem usage in existing tcmu
> is to limit data area size per tcmu device. The default of 1 GB per dev
> I think is rarely needed. So users should configure devices carefully.
> OTOH, if many tcmu devs are in use, even with small data areas the total
> limit of 2GB can easily be reached. E.g., I know of a customer who uses
> 256 tcmu devs with 16MB each. So he rises the global limit to even 4GB,
> as from performance point of view every releasing of data pages with
> later re-allocation is just an overhead.
>
> From this point of view I think it would be difficult to find a good
> algorithm.
>
>>
>>>
>>> So, what is the advantage of concurrency here?
>> cmdr_lock is a mutex, so it meas that tcmu page fault could not happen
>> concurrently, and indeed mm subsystem allows page fault to run concurrently.
>> If we change to use filemap_invalidate_lock_shared() in tcmu page fault,
>> I think there will be two advantages:
>> 1) Though page fault will happen rarely in tcmu if system memory is
>> enough, in the early stages of page fault establishment, say we'll
>> finally make the default 1GB data area have page fault happened.
>> At least in this stage, page fault will have concurrency.
>> Also if page fault happens with queue_cmd_ring() at the same time,
>> for big io size, queue_cmd_ring() may hold cmdr_lock for long time
>> because of data area and sgl pages copy, page fault will be stalled
>> for a long time, which will result in long tail latency.
>
> TCM core can send multiple new cmds to tcmu concurrently, but cmdr_lock
> forces sequential processing. So, some time ago I started an attempt to
> make tcmu release the cmdr_lock while it copies sgl_pages. The measured
> results unfortunately were not as good as expected. But I still need to
> do further meassurements to find and possibly fix the reason.
OK, thanks.

>
> With that change, I hope to also lower the latency by far for cmds
> with big data transfers.
>
>>
>> 2) If we latter try to support zero copy successfully, we'll need to
>> call zap_page_range() in tcmu_handle_completion(), which needs to
>> hold mm_struct's mmap_lock, then we may have "abba" dead lock issue:
>>    page fault: hold mm_struct's mmap_lock, then try to hold cmdr_lock
>>    cmd completion: hold cmdr_lock, then try to hold mm_struct's mmap_lock.
>
> I'm still not convinced that you can get the support from mm you need.
Yeah, I also get frustrated about current status, but I'll try to see whether
I can do something better later.

OK, now I'm fine that we leave this patch alone. If later I get a better
memory map function, we can go back to this patch :)

>
> BTW: after the cmdr_lock changes I described, the next step from my side
> would be to avoid data copy in an even more efficient way for at least
> those fabrics, that use core sgl allocation by calling
> transport_generic_new_command or transport_handle_cdb_direct, or
> target_submit_prep without pre-allocated sgls. The idea is to add a new
> optional backend call that allows the backend (tcmu) to build sgls from
> its own pages (tcmu: data pages). This method avoid data copy _and_
> the need for mapping/unmapping sgl pages because we use already mapped
> pages.
>
> Of cause, this method does not speed up loopback, vhost and some other
> fabrics, but it works very well e.g. for qla2xxx.
Thanks, it's a good start, and I believe this will give better io throughput.

Regards,
Xiaoguang Wang
>
>>
>> Or we can define a new "struct rw_semaphore i_mmap_sem;" in
>> tcmu_dev, and use it rather than filemap_invalidate_lock_shared(mapping);
>>
>>>
>>> OTOH, I don't like the switch from tcmu's cmdr_lock to
>>> filemap_invalidate_* without a real technical reason.
>> Anyway, at least I don't see any obvious disadvantages
>> about this patch.
>>
>>
>> Regards,
>> Xiaoguang Wang
>>>
>>>
>>>> Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
>>>> ---
>>>>    drivers/target/target_core_user.c | 14 ++++++++++----
>>>>    1 file changed, 10 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/target/target_core_user.c b/drivers/target/target_core_user.c
>>>> index 9196188504ec..81bfa553cc67 100644
>>>> --- a/drivers/target/target_core_user.c
>>>> +++ b/drivers/target/target_core_user.c
>>>> @@ -1820,13 +1820,14 @@ static int tcmu_find_mem_index(struct vm_area_struct *vma)
>>>>      static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
>>>>    {
>>>> +    struct address_space *mapping = udev->inode->i_mapping;
>>>>        struct page *page;
>>>>    -    mutex_lock(&udev->cmdr_lock);
>>>> +    filemap_invalidate_lock_shared(mapping);
>>>>        page = xa_load(&udev->data_pages, dpi);
>>>>        if (likely(page)) {
>>>>            get_page(page);
>>>> -        mutex_unlock(&udev->cmdr_lock);
>>>> +        filemap_invalidate_unlock_shared(mapping);
>>>>            return page;
>>>>        }
>>>>    @@ -1836,7 +1837,7 @@ static struct page *tcmu_try_get_data_page(struct tcmu_dev *udev, uint32_t dpi)
>>>>         */
>>>>        pr_err("Invalid addr to data page mapping (dpi %u) on device %s\n",
>>>>               dpi, udev->name);
>>>> -    mutex_unlock(&udev->cmdr_lock);
>>>> +    filemap_invalidate_unlock_shared(mapping);
>>>>          return NULL;
>>>>    }
>>>> @@ -3116,6 +3117,7 @@ static void find_free_blocks(void)
>>>>        loff_t off;
>>>>        u32 pages_freed, total_pages_freed = 0;
>>>>        u32 start, end, block, total_blocks_freed = 0;
>>>> +    struct address_space *mapping;
>>>>          if (atomic_read(&global_page_count) <= tcmu_global_max_pages)
>>>>            return;
>>>> @@ -3139,6 +3141,8 @@ static void find_free_blocks(void)
>>>>                continue;
>>>>            }
>>>>    +        mapping = udev->inode->i_mapping;
>>>> +        filemap_invalidate_lock(mapping);
>>>>            end = udev->dbi_max + 1;
>>>>            block = find_last_bit(udev->data_bitmap, end);
>>>>            if (block == udev->dbi_max) {
>>>> @@ -3146,6 +3150,7 @@ static void find_free_blocks(void)
>>>>                 * The last bit is dbi_max, so it is not possible
>>>>                 * reclaim any blocks.
>>>>                 */
>>>> +            filemap_invalidate_unlock(mapping);
>>>>                mutex_unlock(&udev->cmdr_lock);
>>>>                continue;
>>>>            } else if (block == end) {
>>>> @@ -3159,10 +3164,11 @@ static void find_free_blocks(void)
>>>>              /* Here will truncate the data area from off */
>>>>            off = udev->data_off + (loff_t)start * udev->data_blk_size;
>>>> -        unmap_mapping_range(udev->inode->i_mapping, off, 0, 1);
>>>> +        unmap_mapping_range(mapping, off, 0, 1);
>>>>              /* Release the block pages */
>>>>            pages_freed = tcmu_blocks_release(udev, start, end - 1);
>>>> +        filemap_invalidate_unlock(mapping);
>>>>            mutex_unlock(&udev->cmdr_lock);
>>>>              total_pages_freed += pages_freed;
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-04-11  6:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-23 13:49 [PATCH v2 0/3] three bug fixes about tcmu page fault handle Xiaoguang Wang
2022-03-23 13:49 ` [PATCH v2 1/3] scsi: target: tcmu: Fix possible page UAF Xiaoguang Wang
2022-03-23 13:49 ` [PATCH v2 2/3] scsi: target: tcmu: Fix possible data corruption Xiaoguang Wang
2022-04-01 19:45   ` Bodo Stroesser
2022-04-04 16:13     ` Xiaoguang Wang
2022-04-05 16:03     ` Xiaoguang Wang
2022-04-08 10:33       ` Bodo Stroesser
2022-04-11  4:49         ` Xiaoguang Wang
2022-03-23 13:49 ` [PATCH v2 3/3] scsi: target: tcmu: Use address_space->invalidate_lock Xiaoguang Wang
2022-04-01 19:58   ` Bodo Stroesser
2022-04-04 15:09     ` Xiaoguang Wang
2022-04-08 11:34       ` Bodo Stroesser
2022-04-11  6:16         ` Xiaoguang Wang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.