2x difference between multi-thread and multi-process for same number of CTXs

All of lore.kernel.org
 help / color / mirror / Atom feed

* 2x difference between multi-thread and multi-process for same number of CTXs
@ 2018-01-24 16:22 Rohit Zambre
       [not found] ` <CAJ84Q-aFfcXxaJS5rApcoow6SBjfZAvt71_OKo1ORorXAOZKbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Rohit Zambre @ 2018-01-24 16:22 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3172 bytes --]

Hi,

I have been trying to pinpoint the cause for a mlx5 behavior/problem
but haven't been able to yet. It would be great if you could share
your thoughts or point me towards what I should be looking at.

I am running a simple, sender-receiver micro-benchmark (like
ib_write_bw) to calculate the message rate for RDMA writes with
increasing number of endpoints. Each endpoint has its own independent
resources: Context, PD, QP, CQ, MR, buffer. None of the resources are
shared between the endpoints. I am running this benchmark for N
endpoints in two ways: using multiple threads and using multiple
processes.

In the multi-threaded case, I have 1 process that creates the N
endpoints with all of its resources but uses N threads to drive the
endpoints; each thread will post only on its QP and poll only its CQ.
In the multi process case, I have N processes; each process creates
and drives only 1 endpoint and its resources. In both cases,
ibv_open_device, ibv_alloc_pd, ibv_reg_mr, ibv_create_cq and
ibv_create_qp each have been called N times. My understanding (from
reading the user-space driver code) is that a new struct ibv_context
is allocated every time ibv_open_device is called regardless of
whether all the ibv_open_device calls have been called by the same
process or different processes. So, in both cases, there are N
endpoints on the sender-node system. Theoretically, the message rates
should then be the same for both cases when using multiple endpoints.

However, in the graph attached, you will see that while both the
multi-thread and multi-process cases scale with increase in endpoints,
there is a >2x difference between the two cases when we have 8 CTXs.
The graph shows RDMA-write message rates for 2-byte messages. I
collected these numbers on the Thor cluster of the HPC Advisory
Council. A thor node's specs are: 16 cores on a socket, 1 ConnectX-4
card (with 1 active port: mlx5_0), RHEL 7.2 and kernel
3.10.0-327.el7.x86_64. Using the binding options of MPI and OpenMP, I
have made sure that each process/thread is bound to its own core. I am
using MPI to only launch the processes and exchange connection
information; all of the communication is through the libibverbs API.

Since I wasn't able to find any distinctions in the user-space code, I
have been going through the kernel code to find the cause of this
behavior. While I haven't been able to pinpoint on something specific,
I have noted that the current struct is used in ib_umem_get, which is
called by mmap, ibv_reg_mr and ibv_poll_cq. I'm currently studying
these to find the cause but again I am not sure if I am in the right
direction. Here are some questions for you:

(1) First, is this a surprising result or is the 2x difference
actually expected behavior?
(2) Could you point me to places in the kernel code that I should be
studying to understand the cause of this behavior? OR do you have any
suggestions for experiments I should try to possibly eliminate
potential causes?

If you would like me to share my micro-benchmark code so you can
reproduce the results, let me know.

Thank you,
Rohit Zambre
Ph.D. Student, Computer Engineering
University of California, Irvine

[-- Attachment #2: write_mr_small_multiProcVSmultiThread_mlx5.pdf --]
[-- Type: application/pdf, Size: 6498 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <CAJ84Q-aFfcXxaJS5rApcoow6SBjfZAvt71_OKo1ORorXAOZKbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found] ` <CAJ84Q-aFfcXxaJS5rApcoow6SBjfZAvt71_OKo1ORorXAOZKbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-24 17:08   ` Jason Gunthorpe
       [not found]     ` <20180124170830.GD16845-uk2M96/98Pc@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2018-01-24 17:08 UTC (permalink / raw)
  To: Rohit Zambre; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:

> (1) First, is this a surprising result or is the 2x difference
> actually expected behavior?

Maybe, there are lots of locks in one process, for instance glibc's
malloc has locking - so any memory allocation anywhere in the
applications processing path will cause lock contention. The issue may
have nothing to do with RDMA.

There is also some locking inside the userspace mlx5 driver that may
contend depending on how your process has set things up.

The entire send path is in user space so there is no kernel component
here.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <20180124170830.GD16845-uk2M96/98Pc@public.gmane.org>]

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found]     ` <20180124170830.GD16845-uk2M96/98Pc@public.gmane.org>
@ 2018-01-24 20:53       ` Rohit Zambre
       [not found]         ` <CAJ84Q-aMW7PXFFuODm6RN=SO342=tJ4_eSJ2TB0b8DLrxgwtGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Rohit Zambre @ 2018-01-24 20:53 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote:
> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:
>
>> (1) First, is this a surprising result or is the 2x difference
>> actually expected behavior?
>
> Maybe, there are lots of locks in one process, for instance glibc's
> malloc has locking - so any memory allocation anywhere in the
> applications processing path will cause lock contention. The issue may
> have nothing to do with RDMA.

There are no mallocs in the critical path of the benchmark. In the 1
process multi-threaded case, the mallocs for resource creation are all
before creating the OpenMP parallel region. Here's a snapshot of the
parallel region that contains the critical path:

#pragma omp parallel
        {
            int i = omp_get_thread_num(), k;
            int cqe_count = 0;
            int post_count = 0;
            int comp_count = 0;
            int posts = 0;

            struct ibv_send_wr *bad_send_wqe;
            struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth *
sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest)

            #pragma omp single
            { // only one thread will execute this
                MPI_Barrier(MPI_COMM_WORLD);
            } // implicit barrier for the threads
            if (i == 0)
                t_start = MPI_Wtime();

            /* Critical Path Start */
            while (post_count < posts_per_qp || comp_count <
posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps
                /* Post */
                posts = min( (posts_per_qp - post_count), (qp_depth -
(post_count - comp_count) ) );
                for (k = 0; k < posts; k++)
                    ret = ibv_post_send(qp[i], &send_wqe[i], &bad_send_wqe);
                post_count += posts;
                /* Poll */
                if (comp_count < posts_per_qp) {
                     cqe_count = ibv_poll_cq(cq[i], num_comps, WC); //
num_comps = qp_depth
                     comp_count += cqe_count;
                 }
             } /* Critical Path End */
             if (i == 0)
                 t_end = MPI_Wtime();
         }

> There is also some locking inside the userspace mlx5 driver that may
> contend depending on how your process has set things up.

I missed mentioning this but I collected the numbers with
MLX5_SINGLE_THREADED set since none of the resources were being shared
between the threads. So, the userspace driver wasn't taking any locks.

> The entire send path is in user space so there is no kernel component
> here.

Yes, that's correct. My concern was that during resource creation, the
kernel was maybe sharing some resource for a process or that some sort
of multiplexing was occurring to hardware contexts through control
groups. Is it safe for me to conclude that separate, independent
contexts/bfregs are being assigned when a process calls
ibv_open_device multiple times?

> Jason

Thanks,
Rohit Zambre
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <CAJ84Q-aMW7PXFFuODm6RN=SO342=tJ4_eSJ2TB0b8DLrxgwtGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found]         ` <CAJ84Q-aMW7PXFFuODm6RN=SO342=tJ4_eSJ2TB0b8DLrxgwtGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-24 22:00           ` Anuj Kalia
       [not found]             ` <CADPSxAg62gpCTdD9rqVfz+hznVpa_yHig1PRvMeHs2SWc1fvsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-01-24 22:22           ` Jason Gunthorpe
  1 sibling, 1 reply; 10+ messages in thread
From: Anuj Kalia @ 2018-01-24 22:00 UTC (permalink / raw)
  To: Rohit Zambre; +Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA

IMO this is probably an implementation issue in the benchmarking code,
and I'm curious to know the issue if you find it.

It's possible to achieve 150+ million writes per second with a
multi-threaded process. See Figure 12 in our paper:
http://www.cs.cmu.edu/~akalia/doc/atc16/rdma_bench_atc.pdf. Our
benchmark code is available:
https://github.com/efficient/rdma_bench/tree/master/rw-tput-sender.

--Anuj

On Wed, Jan 24, 2018 at 3:53 PM, Rohit Zambre <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
> On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote:
>> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:
>>
>>> (1) First, is this a surprising result or is the 2x difference
>>> actually expected behavior?
>>
>> Maybe, there are lots of locks in one process, for instance glibc's
>> malloc has locking - so any memory allocation anywhere in the
>> applications processing path will cause lock contention. The issue may
>> have nothing to do with RDMA.
>
> There are no mallocs in the critical path of the benchmark. In the 1
> process multi-threaded case, the mallocs for resource creation are all
> before creating the OpenMP parallel region. Here's a snapshot of the
> parallel region that contains the critical path:
>
> #pragma omp parallel
>         {
>             int i = omp_get_thread_num(), k;
>             int cqe_count = 0;
>             int post_count = 0;
>             int comp_count = 0;
>             int posts = 0;
>
>             struct ibv_send_wr *bad_send_wqe;
>             struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth *
> sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest)
>
>             #pragma omp single
>             { // only one thread will execute this
>                 MPI_Barrier(MPI_COMM_WORLD);
>             } // implicit barrier for the threads
>             if (i == 0)
>                 t_start = MPI_Wtime();
>
>             /* Critical Path Start */
>             while (post_count < posts_per_qp || comp_count <
> posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps
>                 /* Post */
>                 posts = min( (posts_per_qp - post_count), (qp_depth -
> (post_count - comp_count) ) );
>                 for (k = 0; k < posts; k++)
>                     ret = ibv_post_send(qp[i], &send_wqe[i], &bad_send_wqe);
>                 post_count += posts;
>                 /* Poll */
>                 if (comp_count < posts_per_qp) {
>                      cqe_count = ibv_poll_cq(cq[i], num_comps, WC); //
> num_comps = qp_depth
>                      comp_count += cqe_count;
>                  }
>              } /* Critical Path End */
>              if (i == 0)
>                  t_end = MPI_Wtime();
>          }
>
>> There is also some locking inside the userspace mlx5 driver that may
>> contend depending on how your process has set things up.
>
> I missed mentioning this but I collected the numbers with
> MLX5_SINGLE_THREADED set since none of the resources were being shared
> between the threads. So, the userspace driver wasn't taking any locks.
>
>> The entire send path is in user space so there is no kernel component
>> here.
>
> Yes, that's correct. My concern was that during resource creation, the
> kernel was maybe sharing some resource for a process or that some sort
> of multiplexing was occurring to hardware contexts through control
> groups. Is it safe for me to conclude that separate, independent
> contexts/bfregs are being assigned when a process calls
> ibv_open_device multiple times?
>
>> Jason
>
> Thanks,
> Rohit Zambre
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <CADPSxAg62gpCTdD9rqVfz+hznVpa_yHig1PRvMeHs2SWc1fvsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found]             ` <CADPSxAg62gpCTdD9rqVfz+hznVpa_yHig1PRvMeHs2SWc1fvsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-26 16:44               ` Rohit Zambre
       [not found]                 ` <CADPSxAj1wbNUqCpwigorpoQgGMVCAJ1TiQ5CuCxvSkVWd6LThQ@mail.gmail.com>
  0 siblings, 1 reply; 10+ messages in thread
From: Rohit Zambre @ 2018-01-26 16:44 UTC (permalink / raw)
  To: Anuj Kalia
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Balaji,
	Pavan, Aparna Chandramowlishwaran

[-- Attachment #1: Type: text/plain, Size: 4786 bytes --]

On Wed, Jan 24, 2018 at 4:00 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> IMO this is probably an implementation issue in the benchmarking code,
> and I'm curious to know the issue if you find it.
>
> It's possible to achieve 150+ million writes per second with a
> multi-threaded process. See Figure 12 in our paper:
> http://www.cs.cmu.edu/~akalia/doc/atc16/rdma_bench_atc.pdf. Our
> benchmark code is available:
> https://github.com/efficient/rdma_bench/tree/master/rw-tput-sender.

I read through your paper and code (great work!) but I don't think it
is an implementation issue. I am comparing my numbers against Figure
12b of your paper since the CX3 cluster is the closest to my testbed
which is a single-port ConnectX-4 card. Hugepages is the only
optimization we use; we don't use doorbell batching, unsignaled
completions, inlining, etc. However, the numbers are comparable: ~27M
writes/second from our benchmark without your optimizations VS ~35M
writes/second from your benchmark with all the optimizations. The 150M
writes/s on the CIB cluster is on a dual-port card. More importantly,
the ~35M writes/s on the CX3 cluster is >1.5x lower than the ~55M
writes/s that we see with multi processes benchmark without
optimizations.

> --Anuj
>
> On Wed, Jan 24, 2018 at 3:53 PM, Rohit Zambre <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
>> On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote:
>>> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:
>>>
>>>> (1) First, is this a surprising result or is the 2x difference
>>>> actually expected behavior?
>>>
>>> Maybe, there are lots of locks in one process, for instance glibc's
>>> malloc has locking - so any memory allocation anywhere in the
>>> applications processing path will cause lock contention. The issue may
>>> have nothing to do with RDMA.
>>
>> There are no mallocs in the critical path of the benchmark. In the 1
>> process multi-threaded case, the mallocs for resource creation are all
>> before creating the OpenMP parallel region. Here's a snapshot of the
>> parallel region that contains the critical path:
>>
>> #pragma omp parallel
>>         {
>>             int i = omp_get_thread_num(), k;
>>             int cqe_count = 0;
>>             int post_count = 0;
>>             int comp_count = 0;
>>             int posts = 0;
>>
>>             struct ibv_send_wr *bad_send_wqe;
>>             struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth *
>> sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest)
>>
>>             #pragma omp single
>>             { // only one thread will execute this
>>                 MPI_Barrier(MPI_COMM_WORLD);
>>             } // implicit barrier for the threads
>>             if (i == 0)
>>                 t_start = MPI_Wtime();
>>
>>             /* Critical Path Start */
>>             while (post_count < posts_per_qp || comp_count <
>> posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps
>>                 /* Post */
>>                 posts = min( (posts_per_qp - post_count), (qp_depth -
>> (post_count - comp_count) ) );
>>                 for (k = 0; k < posts; k++)
>>                     ret = ibv_post_send(qp[i], &send_wqe[i], &bad_send_wqe);
>>                 post_count += posts;
>>                 /* Poll */
>>                 if (comp_count < posts_per_qp) {
>>                      cqe_count = ibv_poll_cq(cq[i], num_comps, WC); //
>> num_comps = qp_depth
>>                      comp_count += cqe_count;
>>                  }
>>              } /* Critical Path End */
>>              if (i == 0)
>>                  t_end = MPI_Wtime();
>>          }
>>
>>> There is also some locking inside the userspace mlx5 driver that may
>>> contend depending on how your process has set things up.
>>
>> I missed mentioning this but I collected the numbers with
>> MLX5_SINGLE_THREADED set since none of the resources were being shared
>> between the threads. So, the userspace driver wasn't taking any locks.
>>
>>> The entire send path is in user space so there is no kernel component
>>> here.
>>
>> Yes, that's correct. My concern was that during resource creation, the
>> kernel was maybe sharing some resource for a process or that some sort
>> of multiplexing was occurring to hardware contexts through control
>> groups. Is it safe for me to conclude that separate, independent
>> contexts/bfregs are being assigned when a process calls
>> ibv_open_device multiple times?
>>
>>> Jason
>>
>> Thanks,
>> Rohit Zambre
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: multi-threadVSproc.zip --]
[-- Type: application/zip, Size: 34513 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <CADPSxAj1wbNUqCpwigorpoQgGMVCAJ1TiQ5CuCxvSkVWd6LThQ@mail.gmail.com>]

[parent not found: <CADPSxAj1wbNUqCpwigorpoQgGMVCAJ1TiQ5CuCxvSkVWd6LThQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found]                   ` <CADPSxAj1wbNUqCpwigorpoQgGMVCAJ1TiQ5CuCxvSkVWd6LThQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-26 20:14                     ` Rohit Zambre
       [not found]                       ` <CAJ84Q-avu53954aECPKBHv2RR3KeH3ra4S+duLsFQeb3pp2+ww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Rohit Zambre @ 2018-01-26 20:14 UTC (permalink / raw)
  To: Anuj Kalia
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Balaji,
	Pavan, Aparna Chandramowlishwaran

On Fri, Jan 26, 2018 at 12:13 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> ConnectX-4 is closer to Connect-IB. There was a 4x jump in message rate from
> ConnectX-3 to Connect-IB, way less from CIB to CX4. 35 M/s is the maximum
> that CX3 can do, so it's not a CPU bottleneck.

The fact that the Connect-IB card on NetApp's cluster is dual-port
would also contribute to higher message rates?

> I'll take a look at your code but it might be a while. If you can run our
> benchmark code I can be more helpful.

I see you are using sudo in run-servers.sh to run your benchmark code.
What is sudo needed for so I can workaround what is needed? Don't have
sudo access on the cluster that I am running on.

> --Anuj
>
>
> On Jan 26, 2018 11:44 AM, "Rohit Zambre" <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
>
> On Wed, Jan 24, 2018 at 4:00 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> IMO this is probably an implementation issue in the benchmarking code,
>> and I'm curious to know the issue if you find it.
>>
>> It's possible to achieve 150+ million writes per second with a
>> multi-threaded process. See Figure 12 in our paper:
>> http://www.cs.cmu.edu/~akalia/doc/atc16/rdma_bench_atc.pdf. Our
>> benchmark code is available:
>> https://github.com/efficient/rdma_bench/tree/master/rw-tput-sender.
>
> I read through your paper and code (great work!) but I don't think it
> is an implementation issue. I am comparing my numbers against Figure
> 12b of your paper since the CX3 cluster is the closest to my testbed
> which is a single-port ConnectX-4 card. Hugepages is the only
> optimization we use; we don't use doorbell batching, unsignaled
> completions, inlining, etc. However, the numbers are comparable: ~27M
> writes/second from our benchmark without your optimizations VS ~35M
> writes/second from your benchmark with all the optimizations. The 150M
> writes/s on the CIB cluster is on a dual-port card. More importantly,
> the ~35M writes/s on the CX3 cluster is >1.5x lower than the ~55M
> writes/s that we see with multi processes benchmark without
> optimizations.
>
>> --Anuj
>>
>> On Wed, Jan 24, 2018 at 3:53 PM, Rohit Zambre <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
>>> On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote:
>>>> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:
>>>>
>>>>> (1) First, is this a surprising result or is the 2x difference
>>>>> actually expected behavior?
>>>>
>>>> Maybe, there are lots of locks in one process, for instance glibc's
>>>> malloc has locking - so any memory allocation anywhere in the
>>>> applications processing path will cause lock contention. The issue may
>>>> have nothing to do with RDMA.
>>>
>>> There are no mallocs in the critical path of the benchmark. In the 1
>>> process multi-threaded case, the mallocs for resource creation are all
>>> before creating the OpenMP parallel region. Here's a snapshot of the
>>> parallel region that contains the critical path:
>>>
>>> #pragma omp parallel
>>>         {
>>>             int i = omp_get_thread_num(), k;
>>>             int cqe_count = 0;
>>>             int post_count = 0;
>>>             int comp_count = 0;
>>>             int posts = 0;
>>>
>>>             struct ibv_send_wr *bad_send_wqe;
>>>             struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth *
>>> sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest)
>>>
>>>             #pragma omp single
>>>             { // only one thread will execute this
>>>                 MPI_Barrier(MPI_COMM_WORLD);
>>>             } // implicit barrier for the threads
>>>             if (i == 0)
>>>                 t_start = MPI_Wtime();
>>>
>>>             /* Critical Path Start */
>>>             while (post_count < posts_per_qp || comp_count <
>>> posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps
>>>                 /* Post */
>>>                 posts = min( (posts_per_qp - post_count), (qp_depth -
>>> (post_count - comp_count) ) );
>>>                 for (k = 0; k < posts; k++)
>>>                     ret = ibv_post_send(qp[i], &send_wqe[i],
>>> &bad_send_wqe);
>>>                 post_count += posts;
>>>                 /* Poll */
>>>                 if (comp_count < posts_per_qp) {
>>>                      cqe_count = ibv_poll_cq(cq[i], num_comps, WC); //
>>> num_comps = qp_depth
>>>                      comp_count += cqe_count;
>>>                  }
>>>              } /* Critical Path End */
>>>              if (i == 0)
>>>                  t_end = MPI_Wtime();
>>>          }
>>>
>>>> There is also some locking inside the userspace mlx5 driver that may
>>>> contend depending on how your process has set things up.
>>>
>>> I missed mentioning this but I collected the numbers with
>>> MLX5_SINGLE_THREADED set since none of the resources were being shared
>>> between the threads. So, the userspace driver wasn't taking any locks.
>>>
>>>> The entire send path is in user space so there is no kernel component
>>>> here.
>>>
>>> Yes, that's correct. My concern was that during resource creation, the
>>> kernel was maybe sharing some resource for a process or that some sort
>>> of multiplexing was occurring to hardware contexts through control
>>> groups. Is it safe for me to conclude that separate, independent
>>> contexts/bfregs are being assigned when a process calls
>>> ibv_open_device multiple times?
>>>
>>>> Jason
>>>
>>> Thanks,
>>> Rohit Zambre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <CAJ84Q-avu53954aECPKBHv2RR3KeH3ra4S+duLsFQeb3pp2+ww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found]                       ` <CAJ84Q-avu53954aECPKBHv2RR3KeH3ra4S+duLsFQeb3pp2+ww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-26 22:34                         ` Anuj Kalia
       [not found]                           ` <CADPSxAgMeCdFsnrCAKLtGZp_aJp78LMB2VUObsr3_s-p6vu-MA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 10+ messages in thread
From: Anuj Kalia @ 2018-01-26 22:34 UTC (permalink / raw)
  To: Rohit Zambre
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Balaji,
	Pavan, Aparna Chandramowlishwaran

rdma_bench can do 70+ million writes/sec with one port (CX5 though). I
don't think that's the issue.

sudo is needed only for hugepages via shmget, unless I'm missing
something. It seems I don't use hugepages in rw_tput_sender, so it
might just work without sudo.

--Anuj

On Fri, Jan 26, 2018 at 3:14 PM, Rohit Zambre <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
> On Fri, Jan 26, 2018 at 12:13 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> ConnectX-4 is closer to Connect-IB. There was a 4x jump in message rate from
>> ConnectX-3 to Connect-IB, way less from CIB to CX4. 35 M/s is the maximum
>> that CX3 can do, so it's not a CPU bottleneck.
>
> The fact that the Connect-IB card on NetApp's cluster is dual-port
> would also contribute to higher message rates?
>
>> I'll take a look at your code but it might be a while. If you can run our
>> benchmark code I can be more helpful.
>
> I see you are using sudo in run-servers.sh to run your benchmark code.
> What is sudo needed for so I can workaround what is needed? Don't have
> sudo access on the cluster that I am running on.
>
>> --Anuj
>>
>>
>> On Jan 26, 2018 11:44 AM, "Rohit Zambre" <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
>>
>> On Wed, Jan 24, 2018 at 4:00 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> IMO this is probably an implementation issue in the benchmarking code,
>>> and I'm curious to know the issue if you find it.
>>>
>>> It's possible to achieve 150+ million writes per second with a
>>> multi-threaded process. See Figure 12 in our paper:
>>> http://www.cs.cmu.edu/~akalia/doc/atc16/rdma_bench_atc.pdf. Our
>>> benchmark code is available:
>>> https://github.com/efficient/rdma_bench/tree/master/rw-tput-sender.
>>
>> I read through your paper and code (great work!) but I don't think it
>> is an implementation issue. I am comparing my numbers against Figure
>> 12b of your paper since the CX3 cluster is the closest to my testbed
>> which is a single-port ConnectX-4 card. Hugepages is the only
>> optimization we use; we don't use doorbell batching, unsignaled
>> completions, inlining, etc. However, the numbers are comparable: ~27M
>> writes/second from our benchmark without your optimizations VS ~35M
>> writes/second from your benchmark with all the optimizations. The 150M
>> writes/s on the CIB cluster is on a dual-port card. More importantly,
>> the ~35M writes/s on the CX3 cluster is >1.5x lower than the ~55M
>> writes/s that we see with multi processes benchmark without
>> optimizations.
>>
>>> --Anuj
>>>
>>> On Wed, Jan 24, 2018 at 3:53 PM, Rohit Zambre <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
>>>> On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote:
>>>>> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:
>>>>>
>>>>>> (1) First, is this a surprising result or is the 2x difference
>>>>>> actually expected behavior?
>>>>>
>>>>> Maybe, there are lots of locks in one process, for instance glibc's
>>>>> malloc has locking - so any memory allocation anywhere in the
>>>>> applications processing path will cause lock contention. The issue may
>>>>> have nothing to do with RDMA.
>>>>
>>>> There are no mallocs in the critical path of the benchmark. In the 1
>>>> process multi-threaded case, the mallocs for resource creation are all
>>>> before creating the OpenMP parallel region. Here's a snapshot of the
>>>> parallel region that contains the critical path:
>>>>
>>>> #pragma omp parallel
>>>>         {
>>>>             int i = omp_get_thread_num(), k;
>>>>             int cqe_count = 0;
>>>>             int post_count = 0;
>>>>             int comp_count = 0;
>>>>             int posts = 0;
>>>>
>>>>             struct ibv_send_wr *bad_send_wqe;
>>>>             struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth *
>>>> sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest)
>>>>
>>>>             #pragma omp single
>>>>             { // only one thread will execute this
>>>>                 MPI_Barrier(MPI_COMM_WORLD);
>>>>             } // implicit barrier for the threads
>>>>             if (i == 0)
>>>>                 t_start = MPI_Wtime();
>>>>
>>>>             /* Critical Path Start */
>>>>             while (post_count < posts_per_qp || comp_count <
>>>> posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps
>>>>                 /* Post */
>>>>                 posts = min( (posts_per_qp - post_count), (qp_depth -
>>>> (post_count - comp_count) ) );
>>>>                 for (k = 0; k < posts; k++)
>>>>                     ret = ibv_post_send(qp[i], &send_wqe[i],
>>>> &bad_send_wqe);
>>>>                 post_count += posts;
>>>>                 /* Poll */
>>>>                 if (comp_count < posts_per_qp) {
>>>>                      cqe_count = ibv_poll_cq(cq[i], num_comps, WC); //
>>>> num_comps = qp_depth
>>>>                      comp_count += cqe_count;
>>>>                  }
>>>>              } /* Critical Path End */
>>>>              if (i == 0)
>>>>                  t_end = MPI_Wtime();
>>>>          }
>>>>
>>>>> There is also some locking inside the userspace mlx5 driver that may
>>>>> contend depending on how your process has set things up.
>>>>
>>>> I missed mentioning this but I collected the numbers with
>>>> MLX5_SINGLE_THREADED set since none of the resources were being shared
>>>> between the threads. So, the userspace driver wasn't taking any locks.
>>>>
>>>>> The entire send path is in user space so there is no kernel component
>>>>> here.
>>>>
>>>> Yes, that's correct. My concern was that during resource creation, the
>>>> kernel was maybe sharing some resource for a process or that some sort
>>>> of multiplexing was occurring to hardware contexts through control
>>>> groups. Is it safe for me to conclude that separate, independent
>>>> contexts/bfregs are being assigned when a process calls
>>>> ibv_open_device multiple times?
>>>>
>>>>> Jason
>>>>
>>>> Thanks,
>>>> Rohit Zambre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <CADPSxAgMeCdFsnrCAKLtGZp_aJp78LMB2VUObsr3_s-p6vu-MA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found]                           ` <CADPSxAgMeCdFsnrCAKLtGZp_aJp78LMB2VUObsr3_s-p6vu-MA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-02-07  0:02                             ` Rohit Zambre
  0 siblings, 0 replies; 10+ messages in thread
From: Rohit Zambre @ 2018-02-07  0:02 UTC (permalink / raw)
  To: Anuj Kalia
  Cc: Jason Gunthorpe, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Balaji,
	Pavan, Aparna Chandramowlishwaran

Hi everyone,

The issue of performance difference between my multi-thread and
multi-process has been solved. Couple of things (1) I wasn't using any
features/optimizations such as inlining, postlist and unsignaled
completions that efficient/rdma_bench was using (2) there was bug in
my program which crept in only when I didn't use the postlist feature:
I was sharing a variable for error-checking between threads causing
store misses.

With all the optimizations in, I am able to achieve ~140M messages/s
on the ConnectX-4 card with both multi-thread and multi-proc, the same
as efficient/rdma_bench.

Thank you for your help!

-Rohit Zambre

On Fri, Jan 26, 2018 at 4:34 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> rdma_bench can do 70+ million writes/sec with one port (CX5 though). I
> don't think that's the issue.
>
> sudo is needed only for hugepages via shmget, unless I'm missing
> something. It seems I don't use hugepages in rw_tput_sender, so it
> might just work without sudo.
>
> --Anuj
>
> On Fri, Jan 26, 2018 at 3:14 PM, Rohit Zambre <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
>> On Fri, Jan 26, 2018 at 12:13 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>> ConnectX-4 is closer to Connect-IB. There was a 4x jump in message rate from
>>> ConnectX-3 to Connect-IB, way less from CIB to CX4. 35 M/s is the maximum
>>> that CX3 can do, so it's not a CPU bottleneck.
>>
>> The fact that the Connect-IB card on NetApp's cluster is dual-port
>> would also contribute to higher message rates?
>>
>>> I'll take a look at your code but it might be a while. If you can run our
>>> benchmark code I can be more helpful.
>>
>> I see you are using sudo in run-servers.sh to run your benchmark code.
>> What is sudo needed for so I can workaround what is needed? Don't have
>> sudo access on the cluster that I am running on.
>>
>>> --Anuj
>>>
>>>
>>> On Jan 26, 2018 11:44 AM, "Rohit Zambre" <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
>>>
>>> On Wed, Jan 24, 2018 at 4:00 PM, Anuj Kalia <anujkaliaiitd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>>> IMO this is probably an implementation issue in the benchmarking code,
>>>> and I'm curious to know the issue if you find it.
>>>>
>>>> It's possible to achieve 150+ million writes per second with a
>>>> multi-threaded process. See Figure 12 in our paper:
>>>> http://www.cs.cmu.edu/~akalia/doc/atc16/rdma_bench_atc.pdf. Our
>>>> benchmark code is available:
>>>> https://github.com/efficient/rdma_bench/tree/master/rw-tput-sender.
>>>
>>> I read through your paper and code (great work!) but I don't think it
>>> is an implementation issue. I am comparing my numbers against Figure
>>> 12b of your paper since the CX3 cluster is the closest to my testbed
>>> which is a single-port ConnectX-4 card. Hugepages is the only
>>> optimization we use; we don't use doorbell batching, unsignaled
>>> completions, inlining, etc. However, the numbers are comparable: ~27M
>>> writes/second from our benchmark without your optimizations VS ~35M
>>> writes/second from your benchmark with all the optimizations. The 150M
>>> writes/s on the CIB cluster is on a dual-port card. More importantly,
>>> the ~35M writes/s on the CX3 cluster is >1.5x lower than the ~55M
>>> writes/s that we see with multi processes benchmark without
>>> optimizations.
>>>
>>>> --Anuj
>>>>
>>>> On Wed, Jan 24, 2018 at 3:53 PM, Rohit Zambre <rzambre-sXc7qaQca9o@public.gmane.org> wrote:
>>>>> On Wed, Jan 24, 2018 at 11:08 AM, Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote:
>>>>>> On Wed, Jan 24, 2018 at 10:22:53AM -0600, Rohit Zambre wrote:
>>>>>>
>>>>>>> (1) First, is this a surprising result or is the 2x difference
>>>>>>> actually expected behavior?
>>>>>>
>>>>>> Maybe, there are lots of locks in one process, for instance glibc's
>>>>>> malloc has locking - so any memory allocation anywhere in the
>>>>>> applications processing path will cause lock contention. The issue may
>>>>>> have nothing to do with RDMA.
>>>>>
>>>>> There are no mallocs in the critical path of the benchmark. In the 1
>>>>> process multi-threaded case, the mallocs for resource creation are all
>>>>> before creating the OpenMP parallel region. Here's a snapshot of the
>>>>> parallel region that contains the critical path:
>>>>>
>>>>> #pragma omp parallel
>>>>>         {
>>>>>             int i = omp_get_thread_num(), k;
>>>>>             int cqe_count = 0;
>>>>>             int post_count = 0;
>>>>>             int comp_count = 0;
>>>>>             int posts = 0;
>>>>>
>>>>>             struct ibv_send_wr *bad_send_wqe;
>>>>>             struct ibv_wc *WC = (struct ibv_wc*) malloc(qp_depth *
>>>>> sizeof(struct ibv_wc) ); // qp_depth is 128 (adopted from perftest)
>>>>>
>>>>>             #pragma omp single
>>>>>             { // only one thread will execute this
>>>>>                 MPI_Barrier(MPI_COMM_WORLD);
>>>>>             } // implicit barrier for the threads
>>>>>             if (i == 0)
>>>>>                 t_start = MPI_Wtime();
>>>>>
>>>>>             /* Critical Path Start */
>>>>>             while (post_count < posts_per_qp || comp_count <
>>>>> posts_per_qp) { // posts_per_qp = num_of_msgs / num_qps
>>>>>                 /* Post */
>>>>>                 posts = min( (posts_per_qp - post_count), (qp_depth -
>>>>> (post_count - comp_count) ) );
>>>>>                 for (k = 0; k < posts; k++)
>>>>>                     ret = ibv_post_send(qp[i], &send_wqe[i],
>>>>> &bad_send_wqe);
>>>>>                 post_count += posts;
>>>>>                 /* Poll */
>>>>>                 if (comp_count < posts_per_qp) {
>>>>>                      cqe_count = ibv_poll_cq(cq[i], num_comps, WC); //
>>>>> num_comps = qp_depth
>>>>>                      comp_count += cqe_count;
>>>>>                  }
>>>>>              } /* Critical Path End */
>>>>>              if (i == 0)
>>>>>                  t_end = MPI_Wtime();
>>>>>          }
>>>>>
>>>>>> There is also some locking inside the userspace mlx5 driver that may
>>>>>> contend depending on how your process has set things up.
>>>>>
>>>>> I missed mentioning this but I collected the numbers with
>>>>> MLX5_SINGLE_THREADED set since none of the resources were being shared
>>>>> between the threads. So, the userspace driver wasn't taking any locks.
>>>>>
>>>>>> The entire send path is in user space so there is no kernel component
>>>>>> here.
>>>>>
>>>>> Yes, that's correct. My concern was that during resource creation, the
>>>>> kernel was maybe sharing some resource for a process or that some sort
>>>>> of multiplexing was occurring to hardware contexts through control
>>>>> groups. Is it safe for me to conclude that separate, independent
>>>>> contexts/bfregs are being assigned when a process calls
>>>>> ibv_open_device multiple times?
>>>>>
>>>>>> Jason
>>>>>
>>>>> Thanks,
>>>>> Rohit Zambre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found]         ` <CAJ84Q-aMW7PXFFuODm6RN=SO342=tJ4_eSJ2TB0b8DLrxgwtGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-01-24 22:00           ` Anuj Kalia
@ 2018-01-24 22:22           ` Jason Gunthorpe
       [not found]             ` <20180124222240.GA10706-uk2M96/98Pc@public.gmane.org>
  1 sibling, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2018-01-24 22:22 UTC (permalink / raw)
  To: Rohit Zambre; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Jan 24, 2018 at 02:53:13PM -0600, Rohit Zambre wrote:

> Yes, that's correct. My concern was that during resource creation, the
> kernel was maybe sharing some resource for a process or that some sort
> of multiplexing was occurring to hardware contexts through control
> groups. Is it safe for me to conclude that separate, independent
> contexts/bfregs are being assigned when a process calls
> ibv_open_device multiple times?

I believe that is true.

There is no obvious point of contention if you use multiple contexts
and set the single threading flag..

Is it possible your benchmark is actually working differently in the
two modes? In a more broad sense, like the cluster network traffic
pattern is detrimental in the thread case for some reason?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

[parent not found: <20180124222240.GA10706-uk2M96/98Pc@public.gmane.org>]

* Re: 2x difference between multi-thread and multi-process for same number of CTXs
       [not found]             ` <20180124222240.GA10706-uk2M96/98Pc@public.gmane.org>
@ 2018-01-24 23:04               ` Rohit Zambre
  0 siblings, 0 replies; 10+ messages in thread
From: Rohit Zambre @ 2018-01-24 23:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Balaji, Pavan,
	Aparna Chandramowlishwaran

Anuj and Jason, thank you for your tips! Looking into it them right
now. Will update here.

Thanks
Rohit Zambre
Ph.D. Student, Computer Engineering
University of California, Irvine
rohitzambre dot com


On Wed, Jan 24, 2018 at 4:22 PM, Jason Gunthorpe <jgg-uk2M96/98Pc@public.gmane.org> wrote:
> On Wed, Jan 24, 2018 at 02:53:13PM -0600, Rohit Zambre wrote:
>
>> Yes, that's correct. My concern was that during resource creation, the
>> kernel was maybe sharing some resource for a process or that some sort
>> of multiplexing was occurring to hardware contexts through control
>> groups. Is it safe for me to conclude that separate, independent
>> contexts/bfregs are being assigned when a process calls
>> ibv_open_device multiple times?
>
> I believe that is true.
>
> There is no obvious point of contention if you use multiple contexts
> and set the single threading flag..
>
> Is it possible your benchmark is actually working differently in the
> two modes? In a more broad sense, like the cluster network traffic
> pattern is detrimental in the thread case for some reason?
>
> Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-02-07  0:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-24 16:22 2x difference between multi-thread and multi-process for same number of CTXs Rohit Zambre
     [not found] ` <CAJ84Q-aFfcXxaJS5rApcoow6SBjfZAvt71_OKo1ORorXAOZKbw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-24 17:08   ` Jason Gunthorpe
     [not found]     ` <20180124170830.GD16845-uk2M96/98Pc@public.gmane.org>
2018-01-24 20:53       ` Rohit Zambre
     [not found]         ` <CAJ84Q-aMW7PXFFuODm6RN=SO342=tJ4_eSJ2TB0b8DLrxgwtGQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-24 22:00           ` Anuj Kalia
     [not found]             ` <CADPSxAg62gpCTdD9rqVfz+hznVpa_yHig1PRvMeHs2SWc1fvsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-26 16:44               ` Rohit Zambre
     [not found]                 ` <CADPSxAj1wbNUqCpwigorpoQgGMVCAJ1TiQ5CuCxvSkVWd6LThQ@mail.gmail.com>
     [not found]                   ` <CADPSxAj1wbNUqCpwigorpoQgGMVCAJ1TiQ5CuCxvSkVWd6LThQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-26 20:14                     ` Rohit Zambre
     [not found]                       ` <CAJ84Q-avu53954aECPKBHv2RR3KeH3ra4S+duLsFQeb3pp2+ww-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-26 22:34                         ` Anuj Kalia
     [not found]                           ` <CADPSxAgMeCdFsnrCAKLtGZp_aJp78LMB2VUObsr3_s-p6vu-MA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-02-07  0:02                             ` Rohit Zambre
2018-01-24 22:22           ` Jason Gunthorpe
     [not found]             ` <20180124222240.GA10706-uk2M96/98Pc@public.gmane.org>
2018-01-24 23:04               ` Rohit Zambre

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.