Re: [RFC 0/3] Add zero copy feature for tcmu

From: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
To: Bodo Stroesser <bostroesser@gmail.com>,
	linux-mm@kvack.org, target-devel@vger.kernel.org,
	linux-scsi@vger.kernel.org
Cc: linux-block@vger.kernel.org, xuyu@linux.alibaba.com
Subject: Re: [RFC 0/3] Add zero copy feature for tcmu
Date: Tue, 22 Mar 2022 21:17:07 +0800	[thread overview]
Message-ID: <36b5a8e5-c8e9-6a1f-834c-6bf9bf920f4c@linux.alibaba.com> (raw)
In-Reply-To: <abbe51c4-873f-e96e-d421-85906689a55a@gmail.com>

hi,

> On 18.03.22 10:55, Xiaoguang Wang wrote:
>> The core idea to implement tcmu zero copy feature is really straight,
>> which just maps block device io request's sgl pages to tcmu user space
>> backstore, then we can avoid extra copy overhead between sgl pages and
>> tcmu internal data area(which really impacts io throughput), please see
>> https://www.spinics.net/lists/target-devel/msg21121.html for detailed
>> info.
>>
>
> Can you please tell us, how big the performance improvement is and
> which configuration you are using for measurenments?
Sorry, I should have attached test results here. Initially I tried to use
tcmu user:fbo backstore to evaluate performance improvements, but
it only shows about 10%~15% io throughput improvement. Fio config
is numjobs=1, iodepth=8, bs=256k, which isn't very impressive. The
reason is that user:fbo backstore does buffered reads, it consumes most
of cpu.

Then I test this zero copy feature for our real workload, whose backstore
is a network program visiting distributed file system and it's 
multi-threaded.
For 4 job, 8 depth, 256 kb io size, the write throughput improves from
3.6GB/s to 10GB/s.

Regards,
Xiaoguang Wang

>
>> Initially I use remap_pfn_range or vm_insert_pages to map sgl pages to
>> user space, but both of them have limits:
>> 1)  Use vm_insert_pages
>> which is like tcp getsockopt(TCP_ZEROCOPY_RECEIVE), but there're two
>> restrictions:
>>    1. anonymous pages can not be mmaped to user spacea.
>>      ==> vm_insert_pages
>>      ====> insert_pages
>>      ======> insert_page_in_batch_locked
>>      ========> validate_page_before_insert
>>      In validate_page_before_insert(), it shows that anonymous page 
>> can not
>>      be mapped to use space, we know that if issuing direct io to block
>>      device, io request's sgl pages mostly comes from anonymous page.
>>          if (PageAnon(page) || PageSlab(page) || page_has_type(page))
>>              return -EINVAL;
>>      I'm not sure why there is such restriction? for safety reasons ?
>>
>>    2. warn_on triggered in __folio_mark_dirty
>>      When calling zap_page_range in tcmu user space backstore when io
>>      completes, there is a warn_on triggered in __folio_mark_dirty:
>>         if (folio->mapping) {   /* Race with truncate? */
>>             WARN_ON_ONCE(warn && !folio_test_uptodate(folio));
>>
>>      I'm not familiar with folio yet, but I think the reason is that 
>> when
>>      issuing a buffered read to tcmu block device, it's page cache 
>> mapped
>>      to user space, backstore write this page and pte will be 
>> dirtied. but
>>      initially it's newly allocated, hence page_update flag not set.
>>      In zap_pte_range(), there is such codes:
>>         if (!PageAnon(page)) {
>>             if (pte_dirty(ptent)) {
>>                 force_flush = 1;
>>                 set_page_dirty(page);
>>             }
>>     So this warn_on is reasonable.
>>     Indeed what I want is just to map io request sgl pages to tcmu user
>>     space backstore, then backstore can read or write data to mapped 
>> area,
>>     I don't want to care about page or its mapping status, so I 
>> choose to
>>     use remap_pfn_range.
>>
>> 2) Use remap_pfn_range()
>>    remap_pfn_range works well, but it has somewhat obvious overhead. 
>> For a
>>    512kb io request, it has 128 pages, and usually this 128 page's 
>> pfn are
>>    not consecutive, so in worst cases, for a 512kb io request, I'd 
>> need to
>>    issue 128 calls to remap_pfn_range, it's horrible. And in 
>> remap_pfn_range,
>>    if x86 page attribute table feature is enabled, lookup_memtype 
>> called by
>>    track_pfn_remap() also introduces obvious overhead.
>>
>> Finally in order to solve these problems, Xu Yu helps to implment a new
>> helper, which accepts an array of pages as parameter, anonymous pages 
>> can
>> be mapped to user space, pages would be treated as special 
>> pte(pte_special
>> returns true), so vm_normal_page returns NULL, above folio warn_on won't
>> trigger.
>>
>> Thanks.
>>
>> Xiaoguang Wang (2):
>>    mm: export zap_page_range()
>>    scsi: target: tcmu: Support zero copy
>>
>> Xu Yu (1):
>>    mm/memory.c: introduce vm_insert_page(s)_mkspecial
>>
>>   drivers/target/target_core_user.c | 257 
>> +++++++++++++++++++++++++++++++++-----
>>   include/linux/mm.h                |   2 +
>>   mm/memory.c                       | 183 +++++++++++++++++++++++++++
>>   3 files changed, 414 insertions(+), 28 deletions(-)
>>