linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* fuse scalability part 1
@ 2015-05-18 15:13 Miklos Szeredi
  2015-09-24  1:13 ` Ashish Samant
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Miklos Szeredi @ 2015-05-18 15:13 UTC (permalink / raw)
  To: linux-fsdevel, linux-kernel, fuse-devel, Ashish Samant, Srinivas Eeda

This part splits out an "input queue" and a "processing queue" from the
monolithic "fuse connection", each of those having their own spinlock.

The end of the patchset adds the ability to "clone" a fuse connection.  This
means, that instead of having to read/write requests/answers on a single fuse
device fd, the fuse daemon can have multiple distinct file descriptors open.
Each of those can be used to receive requests and send answers, currently the
only constraint is that a request must be answered on the same fd as it was read
from.

This can be extended further to allow binding a device clone to a specific CPU
or NUMA node.

Patchset is available here:

  git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next

Libfuse patches adding support for "clone_fd" option:

  git://git.code.sf.net/p/fuse/fuse clone_fd

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse scalability part 1
  2015-05-18 15:13 fuse scalability part 1 Miklos Szeredi
@ 2015-09-24  1:13 ` Ashish Samant
       [not found] ` <20150814101453.GB31364@frosties>
  2015-09-24 19:17 ` Ashish Samant
  2 siblings, 0 replies; 7+ messages in thread
From: Ashish Samant @ 2015-09-24  1:13 UTC (permalink / raw)
  To: Miklos Szeredi, fuse-devel; +Cc: linux-fsdevel, linux-kernel, Srinivas Eeda


On 05/18/2015 08:13 AM, Miklos Szeredi wrote:
> This part splits out an "input queue" and a "processing queue" from the
> monolithic "fuse connection", each of those having their own spinlock.
>
> The end of the patchset adds the ability to "clone" a fuse connection.  This
> means, that instead of having to read/write requests/answers on a single fuse
> device fd, the fuse daemon can have multiple distinct file descriptors open.
> Each of those can be used to receive requests and send answers, currently the
> only constraint is that a request must be answered on the same fd as it was read
> from.
>
> This can be extended further to allow binding a device clone to a specific CPU
> or NUMA node.
>
> Patchset is available here:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
>
> Libfuse patches adding support for "clone_fd" option:
>
>    git://git.code.sf.net/p/fuse/fuse clone_fd
>
> Thanks,
> Miklos
>
>
We did some performance testing without these patches and with these 
patches (with -o clone_fd  option specified). Sorry for the delay in 
getting these done. We did 2 types of tests:

1. Throughput test : We did some parallel dd tests to read/write to FUSE 
based database fs on a system with 8 numa nodes and 288 cpus. The 
performance here is almost equal to the the per-numa patches we 
submitted a while back.

1) Writes to single mount

dd processes                throughput(without patchset) throughput(with 
patchset)
in parallel

4                                    633 
Mb/s                                               606 Mb/s
8                                   583.2 
Mb/s                                             561.6 Mb/s
16                                 436 
Mb/s                                                640.6 Mb/s
32                                 500.5 
Mb/s                                             718.1 Mb/s
64                                 440.7 Mb/s                            
                  1276.8 Mb/s
128                               526.2 
Mb/s                                             2343.4 Mb/s

2) Reading from single mount

dd processes                 throughput(without patchset) 
throughput(with patchset)
in parallel

4                                    1171 
Mb/s                                              1059 Mb/s
8                                    1626 
Mb/s                                              677 Mb/s
16                                  1014 
Mb/s                                              2240.6 Mb/s
32                                  807.6 
Mb/s                                             2512.9 Mb/s
64                                  985.8 
Mb/s                                             2870.3 Mb/s
128                                1355 
Mb/s                                              2996.5 Mb/s



2. Spinlock access times test: We also ran some tests within the kernel 
to check the time spent in accessing the spinlocks per request in both 
cases. As can be seen, the time taken per request to access the spinlock 
in the kernel code throughout the lifetime of the request is 30X to 100X 
better in the 2nd case (with patchset)


dd processes                  Time/req(without patchset) Time/req(with 
patchset)
in parallel

4                                     0.025 ms                     
0.00685 ms
8                                     0.174 ms                      
0.0071 ms
16                                   0.9825 
ms                                        0.0115 ms
32                                   2.4965 ms                           
              0.0315 ms
64                                   4.8335 ms                  0.071 ms
128                                 5.972 ms                         
0.1812 ms

In conclusion, splitting of fc->lock into multiple locks and splitting 
the request queues definitely helps performance.

Thanks,
Ashish

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [fuse-devel] fuse scalability part 1
       [not found] ` <20150814101453.GB31364@frosties>
@ 2015-09-24  6:30   ` Miklos Szeredi
  0 siblings, 0 replies; 7+ messages in thread
From: Miklos Szeredi @ 2015-09-24  6:30 UTC (permalink / raw)
  To: Goswin von Brederlow
  Cc: fuse-devel, Ashish Samant, Srinivas Eeda, Linux-Fsdevel,
	Kernel Mailing List

On Fri, Aug 14, 2015 at 12:14 PM, Goswin von Brederlow
<goswin-v-b@web.de> wrote:
> On Mon, May 18, 2015 at 05:13:36PM +0200, Miklos Szeredi wrote:
>> This part splits out an "input queue" and a "processing queue" from the
>> monolithic "fuse connection", each of those having their own spinlock.
>>
>> The end of the patchset adds the ability to "clone" a fuse connection.  This
>> means, that instead of having to read/write requests/answers on a single fuse
>> device fd, the fuse daemon can have multiple distinct file descriptors open.
>> Each of those can be used to receive requests and send answers, currently the
>> only constraint is that a request must be answered on the same fd as it was read
>> from.
>>
>> This can be extended further to allow binding a device clone to a specific CPU
>> or NUMA node.
>
> How will requests be distributed across clones?
>
> Is the idea here to start one clone per core and have IO requests
> originating from one core to be processed by the fuse clone on the
> same core? I remember there was a noticeable speedup when request and
> processing where on the same core.
>
> How is the clone for each request choosen? What if there is no clone
> pinned to the same core? Will it pick the clone nearest in NUMA terms?
> Will it round-robin? Will it load balance to the clone with least
> number of requests pending? What if one clone stops processing requests?

Good questions.  I guess, first implementation should be the simplest
possible.  E.g. use the queue that matches (in this order):

 - CPU
 - NUMA node
 - any (round robin or whatever)

I woudn't worry about load balancing and unresponsive queues until
such issues come up in real life.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse scalability part 1
  2015-05-18 15:13 fuse scalability part 1 Miklos Szeredi
  2015-09-24  1:13 ` Ashish Samant
       [not found] ` <20150814101453.GB31364@frosties>
@ 2015-09-24 19:17 ` Ashish Samant
  2015-09-25 12:11   ` Miklos Szeredi
  2 siblings, 1 reply; 7+ messages in thread
From: Ashish Samant @ 2015-09-24 19:17 UTC (permalink / raw)
  To: Miklos Szeredi, linux-fsdevel, linux-kernel, fuse-devel, Srinivas Eeda

[-- Attachment #1: Type: text/plain, Size: 1854 bytes --]


On 05/18/2015 08:13 AM, Miklos Szeredi wrote:
> This part splits out an "input queue" and a "processing queue" from the
> monolithic "fuse connection", each of those having their own spinlock.
>
> The end of the patchset adds the ability to "clone" a fuse connection.  This
> means, that instead of having to read/write requests/answers on a single fuse
> device fd, the fuse daemon can have multiple distinct file descriptors open.
> Each of those can be used to receive requests and send answers, currently the
> only constraint is that a request must be answered on the same fd as it was read
> from.
>
> This can be extended further to allow binding a device clone to a specific CPU
> or NUMA node.
>
> Patchset is available here:
>
>    git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git for-next
>
> Libfuse patches adding support for "clone_fd" option:
>
>    git://git.code.sf.net/p/fuse/fuse clone_fd
>
> Thanks,
> Miklos
>
>
Resending the numbers as attachments because my email client messes the 
formatting of the message. Sorry for the noise.

We did some performance testing without these patches and with these 
patches (with -o clone_fd  option specified). We did 2 types of tests:

1. Throughput test : We did some parallel dd tests to read/write to FUSE 
based database fs on a system with 8 numa nodes and 288 cpus. The 
performance here is almost equal to the the per-numa patches we 
submitted a while back.Please find results attached.

2. Spinlock access times test: We also ran some tests within the kernel 
to check the time spent in accessing the spinlocks per request in both 
cases. As can be seen, the time taken per request to access the spinlock 
in the kernel code throughout the lifetime of the request is 30X to 100X 
better in the 2nd case (with patchset). Please find results attached.

Thanks,
Ashish



[-- Attachment #2: dd_test_results.txt --]
[-- Type: text/plain, Size: 1274 bytes --]

1) Writes to single mount

dd processes                 throughput(without patchset)            throughput(with patchset)
in parallel

4                                633 Mb/s                               606 Mb/s
8                                583.2 Mb/s                             561.6 Mb/s
16                               436 Mb/s                               640.6 Mb/s
32                               500.5 Mb/s                             718.1 Mb/s
64                               440.7 Mb/s                             1276.8 Mb/s
128                              526.2 Mb/s                             2343.4 Mb/s

2) Reading from single mount
 
dd processes                 throughput(without patchset)            throughput(with patchset)
in parallel

4                               1171 Mb/s                               1059 Mb/s
8                               1626 Mb/s                               1677 Mb/s
16                              1014 Mb/s                               2240.6 Mb/s
32                              807.6 Mb/s                              2512.9 Mb/s
64                              985.8 Mb/s                              2870.3 Mb/s
128                             1355 Mb/s                               2996.5 Mb/s 

[-- Attachment #3: spinlock_access_time_test.txt --]
[-- Type: text/plain, Size: 580 bytes --]

dd processes                  Time/req(without patchset)            Time/req(with patchset)
in parallel

4                                0.025 ms                            0.00685 ms
8                                0.174 ms                            0.0071 ms
16                               0.9825 ms                           0.0115 ms
32                               2.4965 ms                           0.0315 ms
64                               4.8335 ms                           0.071 ms
128                              5.972 ms                            0.1812 ms 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse scalability part 1
  2015-09-24 19:17 ` Ashish Samant
@ 2015-09-25 12:11   ` Miklos Szeredi
  2015-09-25 17:53     ` Ashish Samant
  2015-09-29  6:18     ` Srinivas Eeda
  0 siblings, 2 replies; 7+ messages in thread
From: Miklos Szeredi @ 2015-09-25 12:11 UTC (permalink / raw)
  To: Ashish Samant
  Cc: Linux-Fsdevel, Kernel Mailing List, fuse-devel, Srinivas Eeda

On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samant <ashish.samant@oracle.com> wrote:

> We did some performance testing without these patches and with these patches
> (with -o clone_fd  option specified). We did 2 types of tests:
>
> 1. Throughput test : We did some parallel dd tests to read/write to FUSE
> based database fs on a system with 8 numa nodes and 288 cpus. The
> performance here is almost equal to the the per-numa patches we submitted a
> while back.Please find results attached.

Interesting.  This means, that serving the request on a different NUMA
node as the one where the request originated doesn't appear to make
the performance much worse.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse scalability part 1
  2015-09-25 12:11   ` Miklos Szeredi
@ 2015-09-25 17:53     ` Ashish Samant
  2015-09-29  6:18     ` Srinivas Eeda
  1 sibling, 0 replies; 7+ messages in thread
From: Ashish Samant @ 2015-09-25 17:53 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: Linux-Fsdevel, Kernel Mailing List, fuse-devel, Srinivas Eeda


On 09/25/2015 05:11 AM, Miklos Szeredi wrote:
> On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samant <ashish.samant@oracle.com> wrote:
>
>> We did some performance testing without these patches and with these patches
>> (with -o clone_fd  option specified). We did 2 types of tests:
>>
>> 1. Throughput test : We did some parallel dd tests to read/write to FUSE
>> based database fs on a system with 8 numa nodes and 288 cpus. The
>> performance here is almost equal to the the per-numa patches we submitted a
>> while back.Please find results attached.
> Interesting.  This means, that serving the request on a different NUMA
> node as the one where the request originated doesn't appear to make
> the performance much worse.
>
> Thanks,
> Miklos
Yes. The main performance gain is due to the reduced contention on one 
spinlock(fc->lock) , especially with a large number of requests.
Splitting fc->fiq per cloned device will definitely improve performance 
further and we can  experiment further with per numa / cpu cloned device.

Thanks,
Ashish

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: fuse scalability part 1
  2015-09-25 12:11   ` Miklos Szeredi
  2015-09-25 17:53     ` Ashish Samant
@ 2015-09-29  6:18     ` Srinivas Eeda
  1 sibling, 0 replies; 7+ messages in thread
From: Srinivas Eeda @ 2015-09-29  6:18 UTC (permalink / raw)
  To: Miklos Szeredi, Ashish Samant
  Cc: Linux-Fsdevel, Kernel Mailing List, fuse-devel

Hi Miklos,

On 09/25/2015 05:11 AM, Miklos Szeredi wrote:
> On Thu, Sep 24, 2015 at 9:17 PM, Ashish Samant <ashish.samant@oracle.com> wrote:
>
>> We did some performance testing without these patches and with these patches
>> (with -o clone_fd  option specified). We did 2 types of tests:
>>
>> 1. Throughput test : We did some parallel dd tests to read/write to FUSE
>> based database fs on a system with 8 numa nodes and 288 cpus. The
>> performance here is almost equal to the the per-numa patches we submitted a
>> while back.Please find results attached.
> Interesting.  This means, that serving the request on a different NUMA
> node as the one where the request originated doesn't appear to make
> the performance much worse.
with the new change, contention of spinlock is significantly reduced, 
hence the latency caused by NUMA is not visible. Even in earlier case, 
the scalability was not a big problem if we bind all processes(fuse 
worker and user (dd threads)) to a single NUMA node. The problem was 
only seen when threads spread out across numa nodes and contend for the 
spin lock.


>
> Thanks,
> Miklos


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-09-29  6:18 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-18 15:13 fuse scalability part 1 Miklos Szeredi
2015-09-24  1:13 ` Ashish Samant
     [not found] ` <20150814101453.GB31364@frosties>
2015-09-24  6:30   ` [fuse-devel] " Miklos Szeredi
2015-09-24 19:17 ` Ashish Samant
2015-09-25 12:11   ` Miklos Szeredi
2015-09-25 17:53     ` Ashish Samant
2015-09-29  6:18     ` Srinivas Eeda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).