linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
@ 2024-01-24  7:05 Jingbo Xu
  2024-01-24 12:23 ` Miklos Szeredi
  0 siblings, 1 reply; 14+ messages in thread
From: Jingbo Xu @ 2024-01-24  7:05 UTC (permalink / raw)
  To: miklos, linux-fsdevel; +Cc: linux-kernel, zhangjiachen.jaycee

From: Xu Ji <laoji.jx@alibaba-inc.com>

Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
single request is increased.

This optimizes the write performance especially when the optimal IO size
of the backend store at the fuse daemon side is greater than the original
maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
4096 PAGE_SIZE).

Be noted that this only increases the upper limit of the maximum request
size, while the real maximum request size relies on the FUSE_INIT
negotiation with the fuse daemon.

Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
---
I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
Bytedance floks seems to had increased the maximum request size to 8M
and saw a ~20% performance boost.
---
 fs/fuse/fuse_i.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 1df83eebda92..6bd2cf0b42e1 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -36,7 +36,7 @@
 #define FUSE_DEFAULT_MAX_PAGES_PER_REQ 32
 
 /** Maximum of max_pages received in init_out */
-#define FUSE_MAX_MAX_PAGES 256
+#define FUSE_MAX_MAX_PAGES 1024
 
 /** Bias for fi->writectr, meaning new writepages must not be sent */
 #define FUSE_NOWRITE INT_MIN
-- 
2.19.1.6.gb485710b


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-01-24  7:05 [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit Jingbo Xu
@ 2024-01-24 12:23 ` Miklos Szeredi
  2024-01-24 12:47   ` Jingbo Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Miklos Szeredi @ 2024-01-24 12:23 UTC (permalink / raw)
  To: Jingbo Xu; +Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee

On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> From: Xu Ji <laoji.jx@alibaba-inc.com>
>
> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
> single request is increased.

The only worry is about where this memory is getting accounted to.
This needs to be thought through, since the we are increasing the
possible memory that an unprivileged user is allowed to pin.



>
> This optimizes the write performance especially when the optimal IO size
> of the backend store at the fuse daemon side is greater than the original
> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
> 4096 PAGE_SIZE).
>
> Be noted that this only increases the upper limit of the maximum request
> size, while the real maximum request size relies on the FUSE_INIT
> negotiation with the fuse daemon.
>
> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> ---
> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
> Bytedance floks seems to had increased the maximum request size to 8M
> and saw a ~20% performance boost.

The 20% is against the 256 pages, I guess.  It would be interesting to
see the how the number of pages per request affects performance and
why.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-01-24 12:23 ` Miklos Szeredi
@ 2024-01-24 12:47   ` Jingbo Xu
  2024-01-26  6:29     ` Jingbo Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jingbo Xu @ 2024-01-24 12:47 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee



On 1/24/24 8:23 PM, Miklos Szeredi wrote:
> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>
>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>> single request is increased.
> 
> The only worry is about where this memory is getting accounted to.
> This needs to be thought through, since the we are increasing the
> possible memory that an unprivileged user is allowed to pin.

OK that will be an issue.

> 
> 
> 
>>
>> This optimizes the write performance especially when the optimal IO size
>> of the backend store at the fuse daemon side is greater than the original
>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>> 4096 PAGE_SIZE).
>>
>> Be noted that this only increases the upper limit of the maximum request
>> size, while the real maximum request size relies on the FUSE_INIT
>> negotiation with the fuse daemon.
>>
>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>> ---
>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>> Bytedance floks seems to had increased the maximum request size to 8M
>> and saw a ~20% performance boost.
> 
> The 20% is against the 256 pages, I guess. 

Yeah I guess so.


> It would be interesting to
> see the how the number of pages per request affects performance and
> why.

To be honest, I'm not sure the root cause of the performance boost in
bytedance's case.

While in our internal use scenario, the optimal IO size of the backend
store at the fuse server side is, e.g. 4MB, and thus if the maximum
throughput can not be achieved with current 256 pages per request. IOW
the backend store, e.g. a distributed parallel filesystem, get optimal
performance when the data is aligned at 4MB boundary.  I can ask my folk
who implements the fuse server to give more background info and the
exact performance statistics.

Thanks.



-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-01-24 12:47   ` Jingbo Xu
@ 2024-01-26  6:29     ` Jingbo Xu
  2024-02-26  4:00       ` Jingbo Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Jingbo Xu @ 2024-01-26  6:29 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee



On 1/24/24 8:47 PM, Jingbo Xu wrote:
> 
> 
> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>
>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>> single request is increased.
>>
>> The only worry is about where this memory is getting accounted to.
>> This needs to be thought through, since the we are increasing the
>> possible memory that an unprivileged user is allowed to pin.

Apart from the request size, the maximum number of background requests,
i.e. max_background (12 by default, and configurable by the fuse
daemon), also limits the size of the memory that an unprivileged user
can pin.  But yes, it indeed increases the number proportionally by
increasing the maximum request size.


> 
>>
>>
>>
>>>
>>> This optimizes the write performance especially when the optimal IO size
>>> of the backend store at the fuse daemon side is greater than the original
>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>> 4096 PAGE_SIZE).
>>>
>>> Be noted that this only increases the upper limit of the maximum request
>>> size, while the real maximum request size relies on the FUSE_INIT
>>> negotiation with the fuse daemon.
>>>
>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>> ---
>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>> Bytedance floks seems to had increased the maximum request size to 8M
>>> and saw a ~20% performance boost.
>>
>> The 20% is against the 256 pages, I guess. 
> 
> Yeah I guess so.
> 
> 
>> It would be interesting to
>> see the how the number of pages per request affects performance and
>> why.
> 
> To be honest, I'm not sure the root cause of the performance boost in
> bytedance's case.
> 
> While in our internal use scenario, the optimal IO size of the backend
> store at the fuse server side is, e.g. 4MB, and thus if the maximum
> throughput can not be achieved with current 256 pages per request. IOW
> the backend store, e.g. a distributed parallel filesystem, get optimal
> performance when the data is aligned at 4MB boundary.  I can ask my folk
> who implements the fuse server to give more background info and the
> exact performance statistics.

Here are more details about our internal use case:

We have a fuse server used in our internal cloud scenarios, while the
backend store is actually a distributed filesystem.  That is, the fuse
server actually plays as the client of the remote distributed
filesystem.  The fuse server forwards the fuse requests to the remote
backing store through network, while the remote distributed filesystem
handles the IO requests, e.g. process the data from/to the persistent store.

Then it comes the details of the remote distributed filesystem when it
process the requested data with the persistent store.

[1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
(ErasureCode), where each fixed sized user data is split and stored as 8
data blocks plus 3 extra parity blocks. For example, with 512 bytes
block size, for each 4MB user data, it's split and stored as 8 (512
bytes) data blocks with 3 (512 bytes) parity blocks.

It also utilize the stripe technology to boost the performance, for
example, there are 8 data disks and 3 parity disks in the above 8+3 mode
example, in which each stripe consists of 8 data blocks and 3 parity
blocks.

[2] To avoid data corruption on power off, the remote distributed
filesystem commit a O_SYNC write right away once a write (fuse) request
received.  Since the EC described above, when the write fuse request is
not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
other 3MB is read from the persistent store first, then compute the
extra 3 parity blocks with the complete 4MB stripe, and finally write
the 8 data blocks and 3 parity blocks down.


Thus the write amplification is un-neglectable and is the performance
bottleneck when the fuse request size is less than the stripe size.

Here are some simple performance statistics with varying request size.
With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
request size is increased from 256KB to 3.9MB, and another ~20%
improvement when the request size is increased to 4MB from 3.9MB.



-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-01-26  6:29     ` Jingbo Xu
@ 2024-02-26  4:00       ` Jingbo Xu
  2024-03-05 14:26         ` Miklos Szeredi
  0 siblings, 1 reply; 14+ messages in thread
From: Jingbo Xu @ 2024-02-26  4:00 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee

Hi Miklos,

On 1/26/24 2:29 PM, Jingbo Xu wrote:
> 
> 
> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>
>>
>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>
>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>> single request is increased.
>>>
>>> The only worry is about where this memory is getting accounted to.
>>> This needs to be thought through, since the we are increasing the
>>> possible memory that an unprivileged user is allowed to pin.
> 
> Apart from the request size, the maximum number of background requests,
> i.e. max_background (12 by default, and configurable by the fuse
> daemon), also limits the size of the memory that an unprivileged user
> can pin.  But yes, it indeed increases the number proportionally by
> increasing the maximum request size.
> 
> 
>>
>>>
>>>
>>>
>>>>
>>>> This optimizes the write performance especially when the optimal IO size
>>>> of the backend store at the fuse daemon side is greater than the original
>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>> 4096 PAGE_SIZE).
>>>>
>>>> Be noted that this only increases the upper limit of the maximum request
>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>> negotiation with the fuse daemon.
>>>>
>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>> ---
>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>> and saw a ~20% performance boost.
>>>
>>> The 20% is against the 256 pages, I guess. 
>>
>> Yeah I guess so.
>>
>>
>>> It would be interesting to
>>> see the how the number of pages per request affects performance and
>>> why.
>>
>> To be honest, I'm not sure the root cause of the performance boost in
>> bytedance's case.
>>
>> While in our internal use scenario, the optimal IO size of the backend
>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>> throughput can not be achieved with current 256 pages per request. IOW
>> the backend store, e.g. a distributed parallel filesystem, get optimal
>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>> who implements the fuse server to give more background info and the
>> exact performance statistics.
> 
> Here are more details about our internal use case:
> 
> We have a fuse server used in our internal cloud scenarios, while the
> backend store is actually a distributed filesystem.  That is, the fuse
> server actually plays as the client of the remote distributed
> filesystem.  The fuse server forwards the fuse requests to the remote
> backing store through network, while the remote distributed filesystem
> handles the IO requests, e.g. process the data from/to the persistent store.
> 
> Then it comes the details of the remote distributed filesystem when it
> process the requested data with the persistent store.
> 
> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
> (ErasureCode), where each fixed sized user data is split and stored as 8
> data blocks plus 3 extra parity blocks. For example, with 512 bytes
> block size, for each 4MB user data, it's split and stored as 8 (512
> bytes) data blocks with 3 (512 bytes) parity blocks.
> 
> It also utilize the stripe technology to boost the performance, for
> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
> example, in which each stripe consists of 8 data blocks and 3 parity
> blocks.
> 
> [2] To avoid data corruption on power off, the remote distributed
> filesystem commit a O_SYNC write right away once a write (fuse) request
> received.  Since the EC described above, when the write fuse request is
> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
> other 3MB is read from the persistent store first, then compute the
> extra 3 parity blocks with the complete 4MB stripe, and finally write
> the 8 data blocks and 3 parity blocks down.
> 
> 
> Thus the write amplification is un-neglectable and is the performance
> bottleneck when the fuse request size is less than the stripe size.
> 
> Here are some simple performance statistics with varying request size.
> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
> request size is increased from 256KB to 3.9MB, and another ~20%
> improvement when the request size is increased to 4MB from 3.9MB.
> 

gentle ping ...

I'm not sure if our using scenario described above is reasonable for
you.  Let me know if there's any other concern.


-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-02-26  4:00       ` Jingbo Xu
@ 2024-03-05 14:26         ` Miklos Szeredi
  2024-03-06 13:32           ` Jingbo Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Miklos Szeredi @ 2024-03-05 14:26 UTC (permalink / raw)
  To: Jingbo Xu; +Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee

On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>
> Hi Miklos,
>
> On 1/26/24 2:29 PM, Jingbo Xu wrote:
> >
> >
> > On 1/24/24 8:47 PM, Jingbo Xu wrote:
> >>
> >>
> >> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
> >>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
> >>>>
> >>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
> >>>>
> >>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
> >>>> single request is increased.
> >>>
> >>> The only worry is about where this memory is getting accounted to.
> >>> This needs to be thought through, since the we are increasing the
> >>> possible memory that an unprivileged user is allowed to pin.
> >
> > Apart from the request size, the maximum number of background requests,
> > i.e. max_background (12 by default, and configurable by the fuse
> > daemon), also limits the size of the memory that an unprivileged user
> > can pin.  But yes, it indeed increases the number proportionally by
> > increasing the maximum request size.
> >
> >
> >>
> >>>
> >>>
> >>>
> >>>>
> >>>> This optimizes the write performance especially when the optimal IO size
> >>>> of the backend store at the fuse daemon side is greater than the original
> >>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
> >>>> 4096 PAGE_SIZE).
> >>>>
> >>>> Be noted that this only increases the upper limit of the maximum request
> >>>> size, while the real maximum request size relies on the FUSE_INIT
> >>>> negotiation with the fuse daemon.
> >>>>
> >>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
> >>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
> >>>> ---
> >>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
> >>>> Bytedance floks seems to had increased the maximum request size to 8M
> >>>> and saw a ~20% performance boost.
> >>>
> >>> The 20% is against the 256 pages, I guess.
> >>
> >> Yeah I guess so.
> >>
> >>
> >>> It would be interesting to
> >>> see the how the number of pages per request affects performance and
> >>> why.
> >>
> >> To be honest, I'm not sure the root cause of the performance boost in
> >> bytedance's case.
> >>
> >> While in our internal use scenario, the optimal IO size of the backend
> >> store at the fuse server side is, e.g. 4MB, and thus if the maximum
> >> throughput can not be achieved with current 256 pages per request. IOW
> >> the backend store, e.g. a distributed parallel filesystem, get optimal
> >> performance when the data is aligned at 4MB boundary.  I can ask my folk
> >> who implements the fuse server to give more background info and the
> >> exact performance statistics.
> >
> > Here are more details about our internal use case:
> >
> > We have a fuse server used in our internal cloud scenarios, while the
> > backend store is actually a distributed filesystem.  That is, the fuse
> > server actually plays as the client of the remote distributed
> > filesystem.  The fuse server forwards the fuse requests to the remote
> > backing store through network, while the remote distributed filesystem
> > handles the IO requests, e.g. process the data from/to the persistent store.
> >
> > Then it comes the details of the remote distributed filesystem when it
> > process the requested data with the persistent store.
> >
> > [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
> > (ErasureCode), where each fixed sized user data is split and stored as 8
> > data blocks plus 3 extra parity blocks. For example, with 512 bytes
> > block size, for each 4MB user data, it's split and stored as 8 (512
> > bytes) data blocks with 3 (512 bytes) parity blocks.
> >
> > It also utilize the stripe technology to boost the performance, for
> > example, there are 8 data disks and 3 parity disks in the above 8+3 mode
> > example, in which each stripe consists of 8 data blocks and 3 parity
> > blocks.
> >
> > [2] To avoid data corruption on power off, the remote distributed
> > filesystem commit a O_SYNC write right away once a write (fuse) request
> > received.  Since the EC described above, when the write fuse request is
> > not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
> > other 3MB is read from the persistent store first, then compute the
> > extra 3 parity blocks with the complete 4MB stripe, and finally write
> > the 8 data blocks and 3 parity blocks down.
> >
> >
> > Thus the write amplification is un-neglectable and is the performance
> > bottleneck when the fuse request size is less than the stripe size.
> >
> > Here are some simple performance statistics with varying request size.
> > With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
> > request size is increased from 256KB to 3.9MB, and another ~20%
> > improvement when the request size is increased to 4MB from 3.9MB.

I sort of understand the issue, although my guess is that this could
be worked around in the client by coalescing writes.  This could be
done by adding a small delay before sending a write request off to the
network.

Would that work in your case?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-03-05 14:26         ` Miklos Szeredi
@ 2024-03-06 13:32           ` Jingbo Xu
  2024-03-06 15:45             ` Bernd Schubert
  0 siblings, 1 reply; 14+ messages in thread
From: Jingbo Xu @ 2024-03-06 13:32 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee



On 3/5/24 10:26 PM, Miklos Szeredi wrote:
> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>
>> Hi Miklos,
>>
>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>
>>>
>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>
>>>>
>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>
>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>>>> single request is increased.
>>>>>
>>>>> The only worry is about where this memory is getting accounted to.
>>>>> This needs to be thought through, since the we are increasing the
>>>>> possible memory that an unprivileged user is allowed to pin.
>>>
>>> Apart from the request size, the maximum number of background requests,
>>> i.e. max_background (12 by default, and configurable by the fuse
>>> daemon), also limits the size of the memory that an unprivileged user
>>> can pin.  But yes, it indeed increases the number proportionally by
>>> increasing the maximum request size.
>>>
>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> This optimizes the write performance especially when the optimal IO size
>>>>>> of the backend store at the fuse daemon side is greater than the original
>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>> 4096 PAGE_SIZE).
>>>>>>
>>>>>> Be noted that this only increases the upper limit of the maximum request
>>>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>>>> negotiation with the fuse daemon.
>>>>>>
>>>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>> ---
>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>>>> and saw a ~20% performance boost.
>>>>>
>>>>> The 20% is against the 256 pages, I guess.
>>>>
>>>> Yeah I guess so.
>>>>
>>>>
>>>>> It would be interesting to
>>>>> see the how the number of pages per request affects performance and
>>>>> why.
>>>>
>>>> To be honest, I'm not sure the root cause of the performance boost in
>>>> bytedance's case.
>>>>
>>>> While in our internal use scenario, the optimal IO size of the backend
>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>>> throughput can not be achieved with current 256 pages per request. IOW
>>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>>>> who implements the fuse server to give more background info and the
>>>> exact performance statistics.
>>>
>>> Here are more details about our internal use case:
>>>
>>> We have a fuse server used in our internal cloud scenarios, while the
>>> backend store is actually a distributed filesystem.  That is, the fuse
>>> server actually plays as the client of the remote distributed
>>> filesystem.  The fuse server forwards the fuse requests to the remote
>>> backing store through network, while the remote distributed filesystem
>>> handles the IO requests, e.g. process the data from/to the persistent store.
>>>
>>> Then it comes the details of the remote distributed filesystem when it
>>> process the requested data with the persistent store.
>>>
>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>> (ErasureCode), where each fixed sized user data is split and stored as 8
>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>
>>> It also utilize the stripe technology to boost the performance, for
>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
>>> example, in which each stripe consists of 8 data blocks and 3 parity
>>> blocks.
>>>
>>> [2] To avoid data corruption on power off, the remote distributed
>>> filesystem commit a O_SYNC write right away once a write (fuse) request
>>> received.  Since the EC described above, when the write fuse request is
>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
>>> other 3MB is read from the persistent store first, then compute the
>>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>>> the 8 data blocks and 3 parity blocks down.
>>>
>>>
>>> Thus the write amplification is un-neglectable and is the performance
>>> bottleneck when the fuse request size is less than the stripe size.
>>>
>>> Here are some simple performance statistics with varying request size.
>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>> improvement when the request size is increased to 4MB from 3.9MB.
> 
> I sort of understand the issue, although my guess is that this could
> be worked around in the client by coalescing writes.  This could be
> done by adding a small delay before sending a write request off to the
> network.
> 
> Would that work in your case?

It's possible but I'm not sure. I've asked my colleagues who working on
the fuse server and the backend store, though have not been replied yet.
 But I guess it's not as simple as increasing the maximum FUSE request
size directly and thus more complexity gets involved.

I can also understand the concern that this may increase the risk of
pinning more memory footprint, and a more generic using scenario needs
to be considered.  I can make it a private patch for our internal product.

Thanks for the suggestions and discussion.


-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-03-06 13:32           ` Jingbo Xu
@ 2024-03-06 15:45             ` Bernd Schubert
  2024-03-07  2:16               ` Jingbo Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Bernd Schubert @ 2024-03-06 15:45 UTC (permalink / raw)
  To: Jingbo Xu, Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee



On 3/6/24 14:32, Jingbo Xu wrote:
> 
> 
> On 3/5/24 10:26 PM, Miklos Szeredi wrote:
>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>
>>> Hi Miklos,
>>>
>>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>>
>>>>
>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>>
>>>>>
>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>
>>>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>
>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>>>>> single request is increased.
>>>>>>
>>>>>> The only worry is about where this memory is getting accounted to.
>>>>>> This needs to be thought through, since the we are increasing the
>>>>>> possible memory that an unprivileged user is allowed to pin.
>>>>
>>>> Apart from the request size, the maximum number of background requests,
>>>> i.e. max_background (12 by default, and configurable by the fuse
>>>> daemon), also limits the size of the memory that an unprivileged user
>>>> can pin.  But yes, it indeed increases the number proportionally by
>>>> increasing the maximum request size.
>>>>
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> This optimizes the write performance especially when the optimal IO size
>>>>>>> of the backend store at the fuse daemon side is greater than the original
>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>>> 4096 PAGE_SIZE).
>>>>>>>
>>>>>>> Be noted that this only increases the upper limit of the maximum request
>>>>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>>>>> negotiation with the fuse daemon.
>>>>>>>
>>>>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>> ---
>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>>>>> and saw a ~20% performance boost.
>>>>>>
>>>>>> The 20% is against the 256 pages, I guess.
>>>>>
>>>>> Yeah I guess so.
>>>>>
>>>>>
>>>>>> It would be interesting to
>>>>>> see the how the number of pages per request affects performance and
>>>>>> why.
>>>>>
>>>>> To be honest, I'm not sure the root cause of the performance boost in
>>>>> bytedance's case.
>>>>>
>>>>> While in our internal use scenario, the optimal IO size of the backend
>>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>>>> throughput can not be achieved with current 256 pages per request. IOW
>>>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>>>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>>>>> who implements the fuse server to give more background info and the
>>>>> exact performance statistics.
>>>>
>>>> Here are more details about our internal use case:
>>>>
>>>> We have a fuse server used in our internal cloud scenarios, while the
>>>> backend store is actually a distributed filesystem.  That is, the fuse
>>>> server actually plays as the client of the remote distributed
>>>> filesystem.  The fuse server forwards the fuse requests to the remote
>>>> backing store through network, while the remote distributed filesystem
>>>> handles the IO requests, e.g. process the data from/to the persistent store.
>>>>
>>>> Then it comes the details of the remote distributed filesystem when it
>>>> process the requested data with the persistent store.
>>>>
>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>>> (ErasureCode), where each fixed sized user data is split and stored as 8
>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>>
>>>> It also utilize the stripe technology to boost the performance, for
>>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
>>>> example, in which each stripe consists of 8 data blocks and 3 parity
>>>> blocks.
>>>>
>>>> [2] To avoid data corruption on power off, the remote distributed
>>>> filesystem commit a O_SYNC write right away once a write (fuse) request
>>>> received.  Since the EC described above, when the write fuse request is
>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
>>>> other 3MB is read from the persistent store first, then compute the
>>>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>>>> the 8 data blocks and 3 parity blocks down.
>>>>
>>>>
>>>> Thus the write amplification is un-neglectable and is the performance
>>>> bottleneck when the fuse request size is less than the stripe size.
>>>>
>>>> Here are some simple performance statistics with varying request size.
>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
>>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>>> improvement when the request size is increased to 4MB from 3.9MB.
>>
>> I sort of understand the issue, although my guess is that this could
>> be worked around in the client by coalescing writes.  This could be
>> done by adding a small delay before sending a write request off to the
>> network.
>>
>> Would that work in your case?
> 
> It's possible but I'm not sure. I've asked my colleagues who working on
> the fuse server and the backend store, though have not been replied yet.
>  But I guess it's not as simple as increasing the maximum FUSE request
> size directly and thus more complexity gets involved.
> 
> I can also understand the concern that this may increase the risk of
> pinning more memory footprint, and a more generic using scenario needs
> to be considered.  I can make it a private patch for our internal product.
> 
> Thanks for the suggestions and discussion.

It also gets kind of solved in my fuse-over-io-uring branch - as long as
there are enough free ring entries. I'm going to add in a flag there
that other CQEs might be follow up requests. Really time to post a new
version.

Bernd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-03-06 15:45             ` Bernd Schubert
@ 2024-03-07  2:16               ` Jingbo Xu
  2024-03-07 22:06                 ` Bernd Schubert
  0 siblings, 1 reply; 14+ messages in thread
From: Jingbo Xu @ 2024-03-07  2:16 UTC (permalink / raw)
  To: Bernd Schubert, Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee

Hi Bernd,

On 3/6/24 11:45 PM, Bernd Schubert wrote:
> 
> 
> On 3/6/24 14:32, Jingbo Xu wrote:
>>
>>
>> On 3/5/24 10:26 PM, Miklos Szeredi wrote:
>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>
>>>> Hi Miklos,
>>>>
>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>>>
>>>>>
>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>
>>>>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>>
>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>>>>>> single request is increased.
>>>>>>>
>>>>>>> The only worry is about where this memory is getting accounted to.
>>>>>>> This needs to be thought through, since the we are increasing the
>>>>>>> possible memory that an unprivileged user is allowed to pin.
>>>>>
>>>>> Apart from the request size, the maximum number of background requests,
>>>>> i.e. max_background (12 by default, and configurable by the fuse
>>>>> daemon), also limits the size of the memory that an unprivileged user
>>>>> can pin.  But yes, it indeed increases the number proportionally by
>>>>> increasing the maximum request size.
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> This optimizes the write performance especially when the optimal IO size
>>>>>>>> of the backend store at the fuse daemon side is greater than the original
>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>>>> 4096 PAGE_SIZE).
>>>>>>>>
>>>>>>>> Be noted that this only increases the upper limit of the maximum request
>>>>>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>>>>>> negotiation with the fuse daemon.
>>>>>>>>
>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>>> ---
>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>>>>>> and saw a ~20% performance boost.
>>>>>>>
>>>>>>> The 20% is against the 256 pages, I guess.
>>>>>>
>>>>>> Yeah I guess so.
>>>>>>
>>>>>>
>>>>>>> It would be interesting to
>>>>>>> see the how the number of pages per request affects performance and
>>>>>>> why.
>>>>>>
>>>>>> To be honest, I'm not sure the root cause of the performance boost in
>>>>>> bytedance's case.
>>>>>>
>>>>>> While in our internal use scenario, the optimal IO size of the backend
>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>>>>> throughput can not be achieved with current 256 pages per request. IOW
>>>>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>>>>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>>>>>> who implements the fuse server to give more background info and the
>>>>>> exact performance statistics.
>>>>>
>>>>> Here are more details about our internal use case:
>>>>>
>>>>> We have a fuse server used in our internal cloud scenarios, while the
>>>>> backend store is actually a distributed filesystem.  That is, the fuse
>>>>> server actually plays as the client of the remote distributed
>>>>> filesystem.  The fuse server forwards the fuse requests to the remote
>>>>> backing store through network, while the remote distributed filesystem
>>>>> handles the IO requests, e.g. process the data from/to the persistent store.
>>>>>
>>>>> Then it comes the details of the remote distributed filesystem when it
>>>>> process the requested data with the persistent store.
>>>>>
>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>>>> (ErasureCode), where each fixed sized user data is split and stored as 8
>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>>>
>>>>> It also utilize the stripe technology to boost the performance, for
>>>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
>>>>> example, in which each stripe consists of 8 data blocks and 3 parity
>>>>> blocks.
>>>>>
>>>>> [2] To avoid data corruption on power off, the remote distributed
>>>>> filesystem commit a O_SYNC write right away once a write (fuse) request
>>>>> received.  Since the EC described above, when the write fuse request is
>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
>>>>> other 3MB is read from the persistent store first, then compute the
>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>>>>> the 8 data blocks and 3 parity blocks down.
>>>>>
>>>>>
>>>>> Thus the write amplification is un-neglectable and is the performance
>>>>> bottleneck when the fuse request size is less than the stripe size.
>>>>>
>>>>> Here are some simple performance statistics with varying request size.
>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
>>>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>>>> improvement when the request size is increased to 4MB from 3.9MB.
>>>
>>> I sort of understand the issue, although my guess is that this could
>>> be worked around in the client by coalescing writes.  This could be
>>> done by adding a small delay before sending a write request off to the
>>> network.
>>>
>>> Would that work in your case?
>>
>> It's possible but I'm not sure. I've asked my colleagues who working on
>> the fuse server and the backend store, though have not been replied yet.
>>  But I guess it's not as simple as increasing the maximum FUSE request
>> size directly and thus more complexity gets involved.
>>
>> I can also understand the concern that this may increase the risk of
>> pinning more memory footprint, and a more generic using scenario needs
>> to be considered.  I can make it a private patch for our internal product.
>>
>> Thanks for the suggestions and discussion.
> 
> It also gets kind of solved in my fuse-over-io-uring branch - as long as
> there are enough free ring entries. I'm going to add in a flag there
> that other CQEs might be follow up requests. Really time to post a new
> version.

Thanks for the information.  I've not read the fuse-over-io-uring branch
yet, but sounds like it would be much helpful .  Would there be a flag
in the FUSE request indicating it's one of the linked FUSE requests?  Is
this feature, say linked FUSE requests, enabled only when io-uring is
upon FUSE?

-- 
Thanks,
Jingbo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-03-07  2:16               ` Jingbo Xu
@ 2024-03-07 22:06                 ` Bernd Schubert
  2024-03-28 16:46                   ` Sweet Tea Dorminy
  0 siblings, 1 reply; 14+ messages in thread
From: Bernd Schubert @ 2024-03-07 22:06 UTC (permalink / raw)
  To: Jingbo Xu, Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee

Hi Jingbo,

On 3/7/24 03:16, Jingbo Xu wrote:
> Hi Bernd,
> 
> On 3/6/24 11:45 PM, Bernd Schubert wrote:
>>
>>
>> On 3/6/24 14:32, Jingbo Xu wrote:
>>>
>>>
>>> On 3/5/24 10:26 PM, Miklos Szeredi wrote:
>>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>
>>>>> Hi Miklos,
>>>>>
>>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>>>>
>>>>>>
>>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>>
>>>>>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>>>
>>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>>>>>>> single request is increased.
>>>>>>>>
>>>>>>>> The only worry is about where this memory is getting accounted to.
>>>>>>>> This needs to be thought through, since the we are increasing the
>>>>>>>> possible memory that an unprivileged user is allowed to pin.
>>>>>>
>>>>>> Apart from the request size, the maximum number of background requests,
>>>>>> i.e. max_background (12 by default, and configurable by the fuse
>>>>>> daemon), also limits the size of the memory that an unprivileged user
>>>>>> can pin.  But yes, it indeed increases the number proportionally by
>>>>>> increasing the maximum request size.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This optimizes the write performance especially when the optimal IO size
>>>>>>>>> of the backend store at the fuse daemon side is greater than the original
>>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>>>>> 4096 PAGE_SIZE).
>>>>>>>>>
>>>>>>>>> Be noted that this only increases the upper limit of the maximum request
>>>>>>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>>>>>>> negotiation with the fuse daemon.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>>>> ---
>>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>>>>>>> and saw a ~20% performance boost.
>>>>>>>>
>>>>>>>> The 20% is against the 256 pages, I guess.
>>>>>>>
>>>>>>> Yeah I guess so.
>>>>>>>
>>>>>>>
>>>>>>>> It would be interesting to
>>>>>>>> see the how the number of pages per request affects performance and
>>>>>>>> why.
>>>>>>>
>>>>>>> To be honest, I'm not sure the root cause of the performance boost in
>>>>>>> bytedance's case.
>>>>>>>
>>>>>>> While in our internal use scenario, the optimal IO size of the backend
>>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>>>>>> throughput can not be achieved with current 256 pages per request. IOW
>>>>>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>>>>>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>>>>>>> who implements the fuse server to give more background info and the
>>>>>>> exact performance statistics.
>>>>>>
>>>>>> Here are more details about our internal use case:
>>>>>>
>>>>>> We have a fuse server used in our internal cloud scenarios, while the
>>>>>> backend store is actually a distributed filesystem.  That is, the fuse
>>>>>> server actually plays as the client of the remote distributed
>>>>>> filesystem.  The fuse server forwards the fuse requests to the remote
>>>>>> backing store through network, while the remote distributed filesystem
>>>>>> handles the IO requests, e.g. process the data from/to the persistent store.
>>>>>>
>>>>>> Then it comes the details of the remote distributed filesystem when it
>>>>>> process the requested data with the persistent store.
>>>>>>
>>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>>>>> (ErasureCode), where each fixed sized user data is split and stored as 8
>>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>>>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>>>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>>>>
>>>>>> It also utilize the stripe technology to boost the performance, for
>>>>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
>>>>>> example, in which each stripe consists of 8 data blocks and 3 parity
>>>>>> blocks.
>>>>>>
>>>>>> [2] To avoid data corruption on power off, the remote distributed
>>>>>> filesystem commit a O_SYNC write right away once a write (fuse) request
>>>>>> received.  Since the EC described above, when the write fuse request is
>>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
>>>>>> other 3MB is read from the persistent store first, then compute the
>>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>>>>>> the 8 data blocks and 3 parity blocks down.
>>>>>>
>>>>>>
>>>>>> Thus the write amplification is un-neglectable and is the performance
>>>>>> bottleneck when the fuse request size is less than the stripe size.
>>>>>>
>>>>>> Here are some simple performance statistics with varying request size.
>>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
>>>>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>>>>> improvement when the request size is increased to 4MB from 3.9MB.
>>>>
>>>> I sort of understand the issue, although my guess is that this could
>>>> be worked around in the client by coalescing writes.  This could be
>>>> done by adding a small delay before sending a write request off to the
>>>> network.
>>>>
>>>> Would that work in your case?
>>>
>>> It's possible but I'm not sure. I've asked my colleagues who working on
>>> the fuse server and the backend store, though have not been replied yet.
>>>  But I guess it's not as simple as increasing the maximum FUSE request
>>> size directly and thus more complexity gets involved.
>>>
>>> I can also understand the concern that this may increase the risk of
>>> pinning more memory footprint, and a more generic using scenario needs
>>> to be considered.  I can make it a private patch for our internal product.
>>>
>>> Thanks for the suggestions and discussion.
>>
>> It also gets kind of solved in my fuse-over-io-uring branch - as long as
>> there are enough free ring entries. I'm going to add in a flag there
>> that other CQEs might be follow up requests. Really time to post a new
>> version.
> 
> Thanks for the information.  I've not read the fuse-over-io-uring branch
> yet, but sounds like it would be much helpful .  Would there be a flag
> in the FUSE request indicating it's one of the linked FUSE requests?  Is
> this feature, say linked FUSE requests, enabled only when io-uring is
> upon FUSE?


Current development branch is this
https://github.com/bsbernd/linux/tree/fuse-uring-for-6.8
(It sometimes gets rebase/force pushes and incompatible changes - the
corresponding libfuse branch is also persistently updated).

Patches need clean up before I can send the next RFC version. And I
first want to change fixed single request size (not so nice to use 1MB
requests when 4K would be sufficient, for things like metadata and small
IO).


I just checked, struct fuse_write_in has a write_flags field

/**
 * WRITE flags
 *
 * FUSE_WRITE_CACHE: delayed write from page cache, file handle is guessed
 * FUSE_WRITE_LOCKOWNER: lock_owner field is valid
 * FUSE_WRITE_KILL_SUIDGID: kill suid and sgid bits
 */
#define FUSE_WRITE_CACHE	(1 << 0)
#define FUSE_WRITE_LOCKOWNER	(1 << 1)
#define FUSE_WRITE_KILL_SUIDGID (1 << 2)


I guess we could extend that and add flag that more pages are available
and will come in the next request - would avoid guessing and timeout on
the daemon/server side.
With uring that would be helpful as well, but then with uring one can
just look through available CQEs and see if these belong together. I
don't think there is much control right now on the kernel side to submit
multiple requests together but even without that I had seen consecutive
requests in a CQE completion round.


Bernd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-03-07 22:06                 ` Bernd Schubert
@ 2024-03-28 16:46                   ` Sweet Tea Dorminy
  2024-03-28 22:08                     ` Bernd Schubert
  0 siblings, 1 reply; 14+ messages in thread
From: Sweet Tea Dorminy @ 2024-03-28 16:46 UTC (permalink / raw)
  To: Bernd Schubert, Jingbo Xu, Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee



On 3/7/24 17:06, Bernd Schubert wrote:
> Hi Jingbo,
> 
> On 3/7/24 03:16, Jingbo Xu wrote:
>> Hi Bernd,
>>
>> On 3/6/24 11:45 PM, Bernd Schubert wrote:
>>>
>>>
>>> On 3/6/24 14:32, Jingbo Xu wrote:
>>>>
>>>>
>>>> On 3/5/24 10:26 PM, Miklos Szeredi wrote:
>>>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>
>>>>>> Hi Miklos,
>>>>>>
>>>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>>>
>>>>>>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>>>>
>>>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of a
>>>>>>>>>> single request is increased.
>>>>>>>>>
>>>>>>>>> The only worry is about where this memory is getting accounted to.
>>>>>>>>> This needs to be thought through, since the we are increasing the
>>>>>>>>> possible memory that an unprivileged user is allowed to pin.
>>>>>>>
>>>>>>> Apart from the request size, the maximum number of background requests,
>>>>>>> i.e. max_background (12 by default, and configurable by the fuse
>>>>>>> daemon), also limits the size of the memory that an unprivileged user
>>>>>>> can pin.  But yes, it indeed increases the number proportionally by
>>>>>>> increasing the maximum request size.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This optimizes the write performance especially when the optimal IO size
>>>>>>>>>> of the backend store at the fuse daemon side is greater than the original
>>>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>>>>>> 4096 PAGE_SIZE).
>>>>>>>>>>
>>>>>>>>>> Be noted that this only increases the upper limit of the maximum request
>>>>>>>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>>>>>>>> negotiation with the fuse daemon.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>>>>> ---
>>>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>>>>>> Bytedance floks seems to had increased the maximum request size to 8M
>>>>>>>>>> and saw a ~20% performance boost.
>>>>>>>>>
>>>>>>>>> The 20% is against the 256 pages, I guess.
>>>>>>>>
>>>>>>>> Yeah I guess so.
>>>>>>>>
>>>>>>>>
>>>>>>>>> It would be interesting to
>>>>>>>>> see the how the number of pages per request affects performance and
>>>>>>>>> why.
>>>>>>>>
>>>>>>>> To be honest, I'm not sure the root cause of the performance boost in
>>>>>>>> bytedance's case.
>>>>>>>>
>>>>>>>> While in our internal use scenario, the optimal IO size of the backend
>>>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>>>>>>> throughput can not be achieved with current 256 pages per request. IOW
>>>>>>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>>>>>>> performance when the data is aligned at 4MB boundary.  I can ask my folk
>>>>>>>> who implements the fuse server to give more background info and the
>>>>>>>> exact performance statistics.
>>>>>>>
>>>>>>> Here are more details about our internal use case:
>>>>>>>
>>>>>>> We have a fuse server used in our internal cloud scenarios, while the
>>>>>>> backend store is actually a distributed filesystem.  That is, the fuse
>>>>>>> server actually plays as the client of the remote distributed
>>>>>>> filesystem.  The fuse server forwards the fuse requests to the remote
>>>>>>> backing store through network, while the remote distributed filesystem
>>>>>>> handles the IO requests, e.g. process the data from/to the persistent store.
>>>>>>>
>>>>>>> Then it comes the details of the remote distributed filesystem when it
>>>>>>> process the requested data with the persistent store.
>>>>>>>
>>>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>>>>>> (ErasureCode), where each fixed sized user data is split and stored as 8
>>>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>>>>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>>>>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>>>>>
>>>>>>> It also utilize the stripe technology to boost the performance, for
>>>>>>> example, there are 8 data disks and 3 parity disks in the above 8+3 mode
>>>>>>> example, in which each stripe consists of 8 data blocks and 3 parity
>>>>>>> blocks.
>>>>>>>
>>>>>>> [2] To avoid data corruption on power off, the remote distributed
>>>>>>> filesystem commit a O_SYNC write right away once a write (fuse) request
>>>>>>> received.  Since the EC described above, when the write fuse request is
>>>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, the
>>>>>>> other 3MB is read from the persistent store first, then compute the
>>>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>>>>>>> the 8 data blocks and 3 parity blocks down.
>>>>>>>
>>>>>>>
>>>>>>> Thus the write amplification is un-neglectable and is the performance
>>>>>>> bottleneck when the fuse request size is less than the stripe size.
>>>>>>>
>>>>>>> Here are some simple performance statistics with varying request size.
>>>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the maximum
>>>>>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>>>>>> improvement when the request size is increased to 4MB from 3.9MB.
>>>>>
>>>>> I sort of understand the issue, although my guess is that this could
>>>>> be worked around in the client by coalescing writes.  This could be
>>>>> done by adding a small delay before sending a write request off to the
>>>>> network.
>>>>>
>>>>> Would that work in your case?
>>>>
>>>> It's possible but I'm not sure. I've asked my colleagues who working on
>>>> the fuse server and the backend store, though have not been replied yet.
>>>>   But I guess it's not as simple as increasing the maximum FUSE request
>>>> size directly and thus more complexity gets involved.
>>>>
>>>> I can also understand the concern that this may increase the risk of
>>>> pinning more memory footprint, and a more generic using scenario needs
>>>> to be considered.  I can make it a private patch for our internal product.
>>>>
>>>> Thanks for the suggestions and discussion.
>>>
>>> It also gets kind of solved in my fuse-over-io-uring branch - as long as
>>> there are enough free ring entries. I'm going to add in a flag there
>>> that other CQEs might be follow up requests. Really time to post a new
>>> version.
>>
>> Thanks for the information.  I've not read the fuse-over-io-uring branch
>> yet, but sounds like it would be much helpful .  Would there be a flag
>> in the FUSE request indicating it's one of the linked FUSE requests?  Is
>> this feature, say linked FUSE requests, enabled only when io-uring is
>> upon FUSE?
> 
> 
> Current development branch is this
> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.8
> (It sometimes gets rebase/force pushes and incompatible changes - the
> corresponding libfuse branch is also persistently updated).
> 
> Patches need clean up before I can send the next RFC version. And I
> first want to change fixed single request size (not so nice to use 1MB
> requests when 4K would be sufficient, for things like metadata and small
> IO).
> 

Let me know if there's something you'd like collaboration on -- 
fuse_iouring sounds very exciting and I'd love to help out any way that 
would be useful.

For our internal usecase at Meta, the relevant backend store operates on 
8M chunks, so I'm also very interested in the simplicity of just opting 
in to receiving 8M IOs from the kernel instead of needing to buffer our 
own 8MB IOs. But io_uring does seem like a plausible general-purpose 
improvement too, so either or both of these paths is interesting and I'm 
working on gathering performance numbers on the relative merits.

Thanks!

Sweet Tea

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-03-28 16:46                   ` Sweet Tea Dorminy
@ 2024-03-28 22:08                     ` Bernd Schubert
  0 siblings, 0 replies; 14+ messages in thread
From: Bernd Schubert @ 2024-03-28 22:08 UTC (permalink / raw)
  To: Sweet Tea Dorminy, Jingbo Xu, Miklos Szeredi
  Cc: linux-fsdevel, linux-kernel, zhangjiachen.jaycee



On 3/28/24 17:46, Sweet Tea Dorminy wrote:
> 
> 
> On 3/7/24 17:06, Bernd Schubert wrote:
>> Hi Jingbo,
>>
>> On 3/7/24 03:16, Jingbo Xu wrote:
>>> Hi Bernd,
>>>
>>> On 3/6/24 11:45 PM, Bernd Schubert wrote:
>>>>
>>>>
>>>> On 3/6/24 14:32, Jingbo Xu wrote:
>>>>>
>>>>>
>>>>> On 3/5/24 10:26 PM, Miklos Szeredi wrote:
>>>>>> On Mon, 26 Feb 2024 at 05:00, Jingbo Xu
>>>>>> <jefflexu@linux.alibaba.com> wrote:
>>>>>>>
>>>>>>> Hi Miklos,
>>>>>>>
>>>>>>> On 1/26/24 2:29 PM, Jingbo Xu wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>>>>>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu
>>>>>>>>>> <jefflexu@linux.alibaba.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>>>>>
>>>>>>>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data
>>>>>>>>>>> size of a
>>>>>>>>>>> single request is increased.
>>>>>>>>>>
>>>>>>>>>> The only worry is about where this memory is getting accounted
>>>>>>>>>> to.
>>>>>>>>>> This needs to be thought through, since the we are increasing the
>>>>>>>>>> possible memory that an unprivileged user is allowed to pin.
>>>>>>>>
>>>>>>>> Apart from the request size, the maximum number of background
>>>>>>>> requests,
>>>>>>>> i.e. max_background (12 by default, and configurable by the fuse
>>>>>>>> daemon), also limits the size of the memory that an unprivileged
>>>>>>>> user
>>>>>>>> can pin.  But yes, it indeed increases the number proportionally by
>>>>>>>> increasing the maximum request size.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This optimizes the write performance especially when the
>>>>>>>>>>> optimal IO size
>>>>>>>>>>> of the backend store at the fuse daemon side is greater than
>>>>>>>>>>> the original
>>>>>>>>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>>>>>>>>> 4096 PAGE_SIZE).
>>>>>>>>>>>
>>>>>>>>>>> Be noted that this only increases the upper limit of the
>>>>>>>>>>> maximum request
>>>>>>>>>>> size, while the real maximum request size relies on the
>>>>>>>>>>> FUSE_INIT
>>>>>>>>>>> negotiation with the fuse daemon.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>>>>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>>>>>>>>> ---
>>>>>>>>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>>>>>>>>> Bytedance floks seems to had increased the maximum request
>>>>>>>>>>> size to 8M
>>>>>>>>>>> and saw a ~20% performance boost.
>>>>>>>>>>
>>>>>>>>>> The 20% is against the 256 pages, I guess.
>>>>>>>>>
>>>>>>>>> Yeah I guess so.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> It would be interesting to
>>>>>>>>>> see the how the number of pages per request affects
>>>>>>>>>> performance and
>>>>>>>>>> why.
>>>>>>>>>
>>>>>>>>> To be honest, I'm not sure the root cause of the performance
>>>>>>>>> boost in
>>>>>>>>> bytedance's case.
>>>>>>>>>
>>>>>>>>> While in our internal use scenario, the optimal IO size of the
>>>>>>>>> backend
>>>>>>>>> store at the fuse server side is, e.g. 4MB, and thus if the
>>>>>>>>> maximum
>>>>>>>>> throughput can not be achieved with current 256 pages per
>>>>>>>>> request. IOW
>>>>>>>>> the backend store, e.g. a distributed parallel filesystem, get
>>>>>>>>> optimal
>>>>>>>>> performance when the data is aligned at 4MB boundary.  I can
>>>>>>>>> ask my folk
>>>>>>>>> who implements the fuse server to give more background info and
>>>>>>>>> the
>>>>>>>>> exact performance statistics.
>>>>>>>>
>>>>>>>> Here are more details about our internal use case:
>>>>>>>>
>>>>>>>> We have a fuse server used in our internal cloud scenarios,
>>>>>>>> while the
>>>>>>>> backend store is actually a distributed filesystem.  That is,
>>>>>>>> the fuse
>>>>>>>> server actually plays as the client of the remote distributed
>>>>>>>> filesystem.  The fuse server forwards the fuse requests to the
>>>>>>>> remote
>>>>>>>> backing store through network, while the remote distributed
>>>>>>>> filesystem
>>>>>>>> handles the IO requests, e.g. process the data from/to the
>>>>>>>> persistent store.
>>>>>>>>
>>>>>>>> Then it comes the details of the remote distributed filesystem
>>>>>>>> when it
>>>>>>>> process the requested data with the persistent store.
>>>>>>>>
>>>>>>>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>>>>>>>> (ErasureCode), where each fixed sized user data is split and
>>>>>>>> stored as 8
>>>>>>>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>>>>>>>> block size, for each 4MB user data, it's split and stored as 8 (512
>>>>>>>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>>>>>>>
>>>>>>>> It also utilize the stripe technology to boost the performance, for
>>>>>>>> example, there are 8 data disks and 3 parity disks in the above
>>>>>>>> 8+3 mode
>>>>>>>> example, in which each stripe consists of 8 data blocks and 3
>>>>>>>> parity
>>>>>>>> blocks.
>>>>>>>>
>>>>>>>> [2] To avoid data corruption on power off, the remote distributed
>>>>>>>> filesystem commit a O_SYNC write right away once a write (fuse)
>>>>>>>> request
>>>>>>>> received.  Since the EC described above, when the write fuse
>>>>>>>> request is
>>>>>>>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in
>>>>>>>> size, the
>>>>>>>> other 3MB is read from the persistent store first, then compute the
>>>>>>>> extra 3 parity blocks with the complete 4MB stripe, and finally
>>>>>>>> write
>>>>>>>> the 8 data blocks and 3 parity blocks down.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thus the write amplification is un-neglectable and is the
>>>>>>>> performance
>>>>>>>> bottleneck when the fuse request size is less than the stripe size.
>>>>>>>>
>>>>>>>> Here are some simple performance statistics with varying request
>>>>>>>> size.
>>>>>>>> With 4MB stripe size, there's ~3x bandwidth improvement when the
>>>>>>>> maximum
>>>>>>>> request size is increased from 256KB to 3.9MB, and another ~20%
>>>>>>>> improvement when the request size is increased to 4MB from 3.9MB.
>>>>>>
>>>>>> I sort of understand the issue, although my guess is that this could
>>>>>> be worked around in the client by coalescing writes.  This could be
>>>>>> done by adding a small delay before sending a write request off to
>>>>>> the
>>>>>> network.
>>>>>>
>>>>>> Would that work in your case?
>>>>>
>>>>> It's possible but I'm not sure. I've asked my colleagues who
>>>>> working on
>>>>> the fuse server and the backend store, though have not been replied
>>>>> yet.
>>>>>   But I guess it's not as simple as increasing the maximum FUSE
>>>>> request
>>>>> size directly and thus more complexity gets involved.
>>>>>
>>>>> I can also understand the concern that this may increase the risk of
>>>>> pinning more memory footprint, and a more generic using scenario needs
>>>>> to be considered.  I can make it a private patch for our internal
>>>>> product.
>>>>>
>>>>> Thanks for the suggestions and discussion.
>>>>
>>>> It also gets kind of solved in my fuse-over-io-uring branch - as
>>>> long as
>>>> there are enough free ring entries. I'm going to add in a flag there
>>>> that other CQEs might be follow up requests. Really time to post a new
>>>> version.
>>>
>>> Thanks for the information.  I've not read the fuse-over-io-uring branch
>>> yet, but sounds like it would be much helpful .  Would there be a flag
>>> in the FUSE request indicating it's one of the linked FUSE requests?  Is
>>> this feature, say linked FUSE requests, enabled only when io-uring is
>>> upon FUSE?
>>
>>
>> Current development branch is this
>> https://github.com/bsbernd/linux/tree/fuse-uring-for-6.8
>> (It sometimes gets rebase/force pushes and incompatible changes - the
>> corresponding libfuse branch is also persistently updated).
>>
>> Patches need clean up before I can send the next RFC version. And I
>> first want to change fixed single request size (not so nice to use 1MB
>> requests when 4K would be sufficient, for things like metadata and small
>> IO).
>>
> 
> Let me know if there's something you'd like collaboration on --
> fuse_iouring sounds very exciting and I'd love to help out any way that
> would be useful.

With pleasure, I take whatever help you offer. Right now I'm quite
jumping between between different projects and I'm not too happy that I
still didn't sent out a new patch version yet. (And the atomic-open
branch also needs updates).

> 
> For our internal usecase at Meta, the relevant backend store operates on
> 8M chunks, so I'm also very interested in the simplicity of just opting
> in to receiving 8M IOs from the kernel instead of needing to buffer our
> own 8MB IOs. But io_uring does seem like a plausible general-purpose
> improvement too, so either or both of these paths is interesting and I'm
> working on gathering performance numbers on the relative merits.

Merging requests requires a bit scanning through the CQEs on the
userspace side, it all arrives randomly. I haven't even tried yet to
merge requests, I have just seen with debugging that ring the queue gets
filled with requests that belong together.

Out of interest, are you using libfuse or your own kernel interface
library? I would be quite interested to know if the fuse-uring
kernel/userspace and then libfuse interface matches your needs. Example,
our next-gen DDN file system runs in spdk reactor context and I had to
update our own code base and libfuse to support ring polling. So one
project outside of libfuse example/ and already some changes needed...
Another change I haven't implemented yet in libfuse is ring request
buffer registration with the file system (for network rdma).

Btw, I just run into bug that came up with FUSE_CAP_WRITEBACK_CACHE - I
definitely don't claim that all code paths are perfectly tested already
(fixed now in the fuse-uring-for-6.8 branch).


Thanks,
Bernd

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
  2024-04-08  6:32 Sweet Tea Dorminy
@ 2024-04-08 14:26 ` Antonio SJ Musumeci
  0 siblings, 0 replies; 14+ messages in thread
From: Antonio SJ Musumeci @ 2024-04-08 14:26 UTC (permalink / raw)
  To: Sweet Tea Dorminy, Jingbo Xu
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, zhangjiachen.jaycee

On 4/8/24 01:32, Sweet Tea Dorminy wrote:
> 
> On 2024-01-26 01:29, Jingbo Xu wrote:
>> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>>>
>>>
>>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com>
>>>> wrote:
>>>>>
>>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>>>
>>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of
>>>>> a
>>>>> single request is increased.
>>>>
>>>> The only worry is about where this memory is getting accounted to.
>>>> This needs to be thought through, since the we are increasing the
>>>> possible memory that an unprivileged user is allowed to pin.
>>
>> Apart from the request size, the maximum number of background requests,
>> i.e. max_background (12 by default, and configurable by the fuse
>> daemon), also limits the size of the memory that an unprivileged user
>> can pin.  But yes, it indeed increases the number proportionally by
>> increasing the maximum request size.
>>
>>
>>>
>>>> It would be interesting to
>>>> see the how the number of pages per request affects performance and
>>>> why.
>>>
>>> To be honest, I'm not sure the root cause of the performance boost in
>>> bytedance's case.
>>>
>>> While in our internal use scenario, the optimal IO size of the backend
>>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>>> throughput can not be achieved with current 256 pages per request. IOW
>>> the backend store, e.g. a distributed parallel filesystem, get optimal
>>> performance when the data is aligned at 4MB boundary.  I can ask my
>>> folk
>>> who implements the fuse server to give more background info and the
>>> exact performance statistics.
>>
>> Here are more details about our internal use case:
>>
>> We have a fuse server used in our internal cloud scenarios, while the
>> backend store is actually a distributed filesystem.  That is, the fuse
>> server actually plays as the client of the remote distributed
>> filesystem.  The fuse server forwards the fuse requests to the remote
>> backing store through network, while the remote distributed filesystem
>> handles the IO requests, e.g. process the data from/to the persistent
>> store.
>>
>> Then it comes the details of the remote distributed filesystem when it
>> process the requested data with the persistent store.
>>
>> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
>> (ErasureCode), where each fixed sized user data is split and stored as
>> 8
>> data blocks plus 3 extra parity blocks. For example, with 512 bytes
>> block size, for each 4MB user data, it's split and stored as 8 (512
>> bytes) data blocks with 3 (512 bytes) parity blocks.
>>
>> It also utilize the stripe technology to boost the performance, for
>> example, there are 8 data disks and 3 parity disks in the above 8+3
>> mode
>> example, in which each stripe consists of 8 data blocks and 3 parity
>> blocks.
>>
>> [2] To avoid data corruption on power off, the remote distributed
>> filesystem commit a O_SYNC write right away once a write (fuse) request
>> received.  Since the EC described above, when the write fuse request is
>> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size,
>> the
>> other 3MB is read from the persistent store first, then compute the
>> extra 3 parity blocks with the complete 4MB stripe, and finally write
>> the 8 data blocks and 3 parity blocks down.
>>
>>
>> Thus the write amplification is un-neglectable and is the performance
>> bottleneck when the fuse request size is less than the stripe size.
>>
>> Here are some simple performance statistics with varying request size.
>> With 4MB stripe size, there's ~3x bandwidth improvement when the
>> maximum
>> request size is increased from 256KB to 3.9MB, and another ~20%
>> improvement when the request size is increased to 4MB from 3.9MB.
> 
> To add my own performance statistics in a microbenchmark:
> 
> Tested on both small VM and large hardware, with suitably large
> FUSE_MAX_MAX_PAGES, using a simple fuse filesystem whose write handlers
> did basically nothing but read the data buffers (memcmp() each 8 bytes
> of data provided against a variable), I ran fio with 128M blocksize,
> end_fsync=1, psync IO engine, times each of 4 parallel jobs. Throughput
> was as follows over variable write_size in MB/s.
> 
> write_size  machine1 machine2
> 32M	1071	6425
> 16M	1002	6445
> 8M	890	6443
> 4M	713	6342
> 2M	557	6290
> 1M	404	6201
> 512K	268	6041
> 256K	156	5782
> 
> Even on the fast machine, increasing the buffer size to 8M is worth 3.9%
> over keeping it at 1M, and is worth over 2x on the small VM. We are
> striving to reduce the ingestion speed in particular as we have seen
> that as a limiting factor on some machines, and there's a clear plateau
> reached around 8M. While most fuse servers would likely not benefit from
> this, and others would benefit from fuse passthrough instead, it does
> seem like a performance win.
> 
> Perhaps, in analogy to soft and hard limits on pipe size,
> FUSE_MAX_MAX_PAGES could be increased and treated as the maximum
> possible hard limit for max_write; and the default hard limit could stay
> at 1M, thereby allowing folks to opt into the new behavior if they care
> about the performance more than the memory?
> 
> Sweet Tea

As I recall the concern about increased message sizes is that it gives a 
process the ability to allocate non-insignificant amounts of kernel 
memory. Perhaps the limits could be expanded only if the server has 
SYS_ADMIN cap.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit
@ 2024-04-08  6:32 Sweet Tea Dorminy
  2024-04-08 14:26 ` Antonio SJ Musumeci
  0 siblings, 1 reply; 14+ messages in thread
From: Sweet Tea Dorminy @ 2024-04-08  6:32 UTC (permalink / raw)
  To: Jingbo Xu
  Cc: Miklos Szeredi, linux-fsdevel, linux-kernel, zhangjiachen.jaycee


On 2024-01-26 01:29, Jingbo Xu wrote:
> On 1/24/24 8:47 PM, Jingbo Xu wrote:
>> 
>> 
>> On 1/24/24 8:23 PM, Miklos Szeredi wrote:
>>> On Wed, 24 Jan 2024 at 08:05, Jingbo Xu <jefflexu@linux.alibaba.com> 
>>> wrote:
>>>> 
>>>> From: Xu Ji <laoji.jx@alibaba-inc.com>
>>>> 
>>>> Increase FUSE_MAX_MAX_PAGES limit, so that the maximum data size of 
>>>> a
>>>> single request is increased.
>>> 
>>> The only worry is about where this memory is getting accounted to.
>>> This needs to be thought through, since the we are increasing the
>>> possible memory that an unprivileged user is allowed to pin.
> 
> Apart from the request size, the maximum number of background requests,
> i.e. max_background (12 by default, and configurable by the fuse
> daemon), also limits the size of the memory that an unprivileged user
> can pin.  But yes, it indeed increases the number proportionally by
> increasing the maximum request size.
> 
> 
>> 
>>> 
>>> 
>>> 
>>>> 
>>>> This optimizes the write performance especially when the optimal IO 
>>>> size
>>>> of the backend store at the fuse daemon side is greater than the 
>>>> original
>>>> maximum request size (i.e. 1MB with 256 FUSE_MAX_MAX_PAGES and
>>>> 4096 PAGE_SIZE).
>>>> 
>>>> Be noted that this only increases the upper limit of the maximum 
>>>> request
>>>> size, while the real maximum request size relies on the FUSE_INIT
>>>> negotiation with the fuse daemon.
>>>> 
>>>> Signed-off-by: Xu Ji <laoji.jx@alibaba-inc.com>
>>>> Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
>>>> ---
>>>> I'm not sure if 1024 is adequate for FUSE_MAX_MAX_PAGES, as the
>>>> Bytedance floks seems to had increased the maximum request size to 
>>>> 8M
>>>> and saw a ~20% performance boost.
>>> 
>>> The 20% is against the 256 pages, I guess.
>> 
>> Yeah I guess so.
>> 
>> 
>>> It would be interesting to
>>> see the how the number of pages per request affects performance and
>>> why.
>> 
>> To be honest, I'm not sure the root cause of the performance boost in
>> bytedance's case.
>> 
>> While in our internal use scenario, the optimal IO size of the backend
>> store at the fuse server side is, e.g. 4MB, and thus if the maximum
>> throughput can not be achieved with current 256 pages per request. IOW
>> the backend store, e.g. a distributed parallel filesystem, get optimal
>> performance when the data is aligned at 4MB boundary.  I can ask my 
>> folk
>> who implements the fuse server to give more background info and the
>> exact performance statistics.
> 
> Here are more details about our internal use case:
> 
> We have a fuse server used in our internal cloud scenarios, while the
> backend store is actually a distributed filesystem.  That is, the fuse
> server actually plays as the client of the remote distributed
> filesystem.  The fuse server forwards the fuse requests to the remote
> backing store through network, while the remote distributed filesystem
> handles the IO requests, e.g. process the data from/to the persistent 
> store.
> 
> Then it comes the details of the remote distributed filesystem when it
> process the requested data with the persistent store.
> 
> [1] The remote distributed filesystem uses, e.g. a 8+3 mode, EC
> (ErasureCode), where each fixed sized user data is split and stored as 
> 8
> data blocks plus 3 extra parity blocks. For example, with 512 bytes
> block size, for each 4MB user data, it's split and stored as 8 (512
> bytes) data blocks with 3 (512 bytes) parity blocks.
> 
> It also utilize the stripe technology to boost the performance, for
> example, there are 8 data disks and 3 parity disks in the above 8+3 
> mode
> example, in which each stripe consists of 8 data blocks and 3 parity
> blocks.
> 
> [2] To avoid data corruption on power off, the remote distributed
> filesystem commit a O_SYNC write right away once a write (fuse) request
> received.  Since the EC described above, when the write fuse request is
> not aligned on 4MB (the stripe size) boundary, say it's 1MB in size, 
> the
> other 3MB is read from the persistent store first, then compute the
> extra 3 parity blocks with the complete 4MB stripe, and finally write
> the 8 data blocks and 3 parity blocks down.
> 
> 
> Thus the write amplification is un-neglectable and is the performance
> bottleneck when the fuse request size is less than the stripe size.
> 
> Here are some simple performance statistics with varying request size.
> With 4MB stripe size, there's ~3x bandwidth improvement when the 
> maximum
> request size is increased from 256KB to 3.9MB, and another ~20%
> improvement when the request size is increased to 4MB from 3.9MB.

To add my own performance statistics in a microbenchmark:

Tested on both small VM and large hardware, with suitably large 
FUSE_MAX_MAX_PAGES, using a simple fuse filesystem whose write handlers 
did basically nothing but read the data buffers (memcmp() each 8 bytes 
of data provided against a variable), I ran fio with 128M blocksize, 
end_fsync=1, psync IO engine, times each of 4 parallel jobs. Throughput 
was as follows over variable write_size in MB/s.

write_size  machine1 machine2
32M	1071	6425
16M	1002	6445
8M	890	6443
4M	713	6342
2M	557	6290
1M	404	6201
512K	268	6041
256K	156	5782

Even on the fast machine, increasing the buffer size to 8M is worth 3.9% 
over keeping it at 1M, and is worth over 2x on the small VM. We are 
striving to reduce the ingestion speed in particular as we have seen 
that as a limiting factor on some machines, and there's a clear plateau 
reached around 8M. While most fuse servers would likely not benefit from 
this, and others would benefit from fuse passthrough instead, it does 
seem like a performance win.

Perhaps, in analogy to soft and hard limits on pipe size, 
FUSE_MAX_MAX_PAGES could be increased and treated as the maximum 
possible hard limit for max_write; and the default hard limit could stay 
at 1M, thereby allowing folks to opt into the new behavior if they care 
about the performance more than the memory?

Sweet Tea

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-04-08 14:26 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-24  7:05 [PATCH] fuse: increase FUSE_MAX_MAX_PAGES limit Jingbo Xu
2024-01-24 12:23 ` Miklos Szeredi
2024-01-24 12:47   ` Jingbo Xu
2024-01-26  6:29     ` Jingbo Xu
2024-02-26  4:00       ` Jingbo Xu
2024-03-05 14:26         ` Miklos Szeredi
2024-03-06 13:32           ` Jingbo Xu
2024-03-06 15:45             ` Bernd Schubert
2024-03-07  2:16               ` Jingbo Xu
2024-03-07 22:06                 ` Bernd Schubert
2024-03-28 16:46                   ` Sweet Tea Dorminy
2024-03-28 22:08                     ` Bernd Schubert
2024-04-08  6:32 Sweet Tea Dorminy
2024-04-08 14:26 ` Antonio SJ Musumeci

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).