On 2020/11/6 上午3:40, Josef Bacik wrote:
> On 11/3/20 8:30 AM, Qu Wenruo wrote:
>> In end_bio_extent_readpage() we had a strange dance around
>> extent_start/extent_len.
>>
>> Hides behind the strange dance is, it's just calling
>> endio_readpage_release_extent() on each bvec range.
>>
>> Here is an example to explain the original work flow:
>>    Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
>>
>>    end_bio_extent_extent_readpage() entered
>>    |- extent_start = 0;
>>    |- extent_end = 0;
>>    |- bio_for_each_segment_all() {
>>    |  |- /* Got the 1st bvec */
>>    |  |- start = SZ_1M;
>>    |  |- end = SZ_1M + SZ_4K - 1;
>>    |  |- update = 1;
>>    |  |- if (extent_len == 0) {
>>    |  |  |- extent_start = start; /* SZ_1M */
>>    |  |  |- extent_len = end + 1 - start; /* SZ_1M */
>>    |  |  }
>>    |  |
>>    |  |- /* Got the 2nd bvec */
>>    |  |- start = SZ_1M + 4K;
>>    |  |- end = SZ_1M + 4K - 1;
>>    |  |- update = 1;
>>    |  |- if (extent_start + extent_len == start) {
>>    |  |  |- extent_len += end + 1 - start; /* SZ_8K */
>>    |  |  }
>>    |  } /* All bio vec iterated */
>>    |
>>    |- if (extent_len) {
>>       |- endio_readpage_release_extent(tree, extent_start, extent_len,
>>                       update);
>>     /* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
>>
>> As the above flow shows, the existing code in end_bio_extent_readpage()
>> is just accumulate extent_start/extent_len, and when the contiguous range
>> breaks, call endio_readpage_release_extent() for the range.
>>
>> The contiguous range breaks at two locations:
>> - The total else {} branch
>>    This means we had a page in a bio where it's not contiguous.
>>    Currently this branch will never be triggered. As all our bio is
>>    submitted as contiguous pages.
>>
>> - After the bio_for_each_segment_all() loop ends
>>    This is the normal call sites where we iterated all bvecs of a bio,
>>    and all pages should be contiguous, thus we can call
>>    endio_readpage_release_extent() on the full range.
>>
>> The original code has also considered cases like (!uptodate), so it will
>> mark the uptodate range with EXTENT_UPTODATE.
>>
>> So this patch will remove the extent_start/extent_len dancing, replace
>> it with regular endio_readpage_release_extent() call on each bvec.
>>
>> This brings one behavior change:
>> - Temporary memory usage increase
>>    Unlike the old call which only modify the extent tree once, now we
>>    update the extent tree for each bvec.
>>
>>    Although the end result is the same, since we may need more extent
>>    state split/allocation, we need more temporary memory during that
>>    bvec iteration.
>>
>> But considering how streamline the new code is, the temporary memory
>> usage increase should be acceptable.
> 
> It's not just temporary memory usage, it's a point of latency for every
> memory operation.

The latency comes from 2 parts:
- extent_state search
  Even it's a log(n) operation, we're calling it for each bvec, thus
  it's definitely cause more latency, I'll post the test result soon,
  but initial result is already pretty poor.

- extent_state preallocation
  This is the tricky one.

  In theory, since we're at read path, we can call it with GFP_KERNEL,
  but the truth is, the extent io tree uses gfp_mask to determine if we
  can do memory allocation, and if possible, they will always try to
  prealloc some memory, which is not always ideal.

  This means even we can call GFP_KERNEL here, we shouldn't.

  So ironically, we should call with GFP_ATOMIC to reduce the memory
  allocation trials. But that would cause possible false ENOMEM alert
  thought.

  As in the extent io tree operations, except the first bvec, we should
  always just enlarge previously inserted extent_state, so the memory
  usage isn't really a problem.

This again, shows the hidden sins of extent io tree, and further prove
that we need more interface rework for it.

The best situation would be, we allocate one extent_state as cache, and
allow extent io tree to use that cache, other than doing the hidden
preallocate internally.

And only re-allocate the precached extent_state after extent io tree
really used that.
For endio call sites, the possibility to need new allocation is low.
As contig range should only need one extent_state allocated.

For now, I want to just keep the old behavior, with slightly better
comments.
And leave the large extent io tree rework in the future.

Thanks,
Qu

>  We have a lot of memory usage on our servers, every
> trip into the slab allocator is going to be a new chance to induce
> latency because we get caught by some cgroup limit and force reclaim. 
> The fact that these could be GFP_ATOMIC makes it even worse, because now
> we'll have this random knock-on affect for heavy read workloads.
> 
> Then to top it all off we could have several megs worth of IO per bio,
> which means we're doing this allocation 100's of times per bio!  This is
> a hard no for me.  Thanks,
> 
> Josef