RE: queue_transaction interface + unique_ptr + performance

* RE: queue_transaction interface + unique_ptr + performance
@ 2015-12-03  2:13 Somnath Roy
  2015-12-03  3:50 ` James (Fei) Liu-SSI
                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Somnath Roy @ 2015-12-03  2:13 UTC (permalink / raw)
  To: Somnath Roy, Sage Weil (sage@newdream.net),
	Samuel Just (sam.just@inktank.com)
  Cc: ceph-devel

*Also*, in this way we are unnecessary adding another smart pointer overhead in the Ceph IO path.
As I communicated sometimes back (probably 2 years now :-) ) in the community, profiler is showing these smart pointers (shared_ptr) as one of the hot spot. Now, I decided to actually measure this..Here is my findings from a sample application and using jemalloc.

1.  First, I measured the performance difference of just creation and deletion of various pointers..Here is the result..

##### Test conventional ptr ######
start: 1449107326 secs, 903873 usecs
end: 1449107353 secs, 578709 usecs
micros_used for conventional ptr: 26674836

##### Test Unique Smart ptr ######
start: 1449107353 secs, 578764 usecs
end: 1449107438 secs, 835114 usecs
micros_used for unique ptr: 85256350

##### Test Shared Smart ptr ######
start: 1449107438 secs, 835155 usecs
end: 1449107543 secs, 285443 usecs
micros_used for shared ptr: 104450288

So, as you can see >3x degradation with unique_ptr and ~4x degradation with shared_ptr.
My sample application is single threaded and I can see from perf top lot of other smart_ptr related functions are popping up reducing the actual % of jemalloc cpu usage (thus causing a slowdown).

2. Next, I added pointer dereferencing in the code..Here is the result..

##### Test conventional ptr ######
start: 1449107850 secs, 500595 usecs
end: 1449107876 secs, 936586 usecs
micros_used for conventional ptr: 26435991
##### Test Unique Smart ptr ######
start: 1449107876 secs, 936643 usecs
end: 1449107994 secs, 629418 usecs
micros_used for unique ptr: 117692775
##### Test Shared Smart ptr ######
start: 1449107994 secs, 629459 usecs
end: 1449108107 secs, 846052 usecs
micros_used for shared ptr: 113216593

This is interesting , not much change in case of conventional pointers but huge change for unique_ptr and some change for shared_ptr as well..So, now degradation for unique_ptr > 4X..This is probably inline with this http://stackoverflow.com/questions/8138284/about-unique-ptr-performances

3. I didn't measure the other stuff like std::move() , reference count in case of shared object etc. This will degrade the performance even more.

4. Here is the sample code in case anybody interested.

#include <iostream>
#include <memory>
#include <list>
#include <sys/time.h>
#include <stdint.h>

struct Foo { // object to manage
    Foo():xx(99),yy(999999) { /*std::cout << "Foo ctor\n";*/ }
    ~Foo() { /*std::cout << "~Foo dtor\n";*/ }
    int xx;
    long yy;
    char str[1024];
};
int main()
{
   struct timeval start, end;
   long secs_used,micros_used;
    printf("##### Test conventional ptr ######\n");
    gettimeofday(&start, NULL);
    for (uint64_t i = 0; i < 1000000000; i++) {
      Foo* f = new Foo();
      int xxx = f->xx;
      long yyy = f->yy;
      delete f;
    }
    gettimeofday(&end, NULL);

    printf("start: %d secs, %d usecs\n",start.tv_sec,start.tv_usec);
    printf("end: %d secs, %d usecs\n",end.tv_sec,end.tv_usec);

    secs_used=(end.tv_sec - start.tv_sec); //avoid overflow by subtracting first
    micros_used= ((secs_used*1000000) + end.tv_usec) - (start.tv_usec);

    printf("micros_used for conventional ptr: %d\n",micros_used);

    printf("##### Test Unique Smart ptr ######\n");

    gettimeofday(&start, NULL);
    for (uint64_t i = 0; i < 1000000000; i++) {
      std::unique_ptr<Foo> fu (new Foo());
      int xxx = fu->xx;
      long yyy = fu->yy;
    }
    gettimeofday(&end, NULL);

    printf("start: %d secs, %d usecs\n",start.tv_sec,start.tv_usec);
    printf("end: %d secs, %d usecs\n",end.tv_sec,end.tv_usec);

    secs_used=(end.tv_sec - start.tv_sec); //avoid overflow by subtracting first
    micros_used= ((secs_used*1000000) + end.tv_usec) - (start.tv_usec);

    printf("micros_used for unique ptr: %d\n",micros_used);

    printf("##### Test Shared Smart ptr ######\n");
    gettimeofday(&start, NULL);
    for (uint64_t i = 0; i < 1000000000; i++) {
      std::shared_ptr<Foo> fs (new Foo());
      int xxx = fs->xx;
      long yyy = fs->yy;
    }
    gettimeofday(&end, NULL);

    printf("start: %d secs, %d usecs\n",start.tv_sec,start.tv_usec);
    printf("end: %d secs, %d usecs\n",end.tv_sec,end.tv_usec);

    secs_used=(end.tv_sec - start.tv_sec); //avoid overflow by subtracting first
    micros_used= ((secs_used*1000000) + end.tv_usec) - (start.tv_usec);

    printf("micros_used for shared ptr: %d\n",micros_used);
    std::cout <<"Existing..\n";
    return 0;
}

So, my guess is, the heavy use of these smart pointers in the Ceph IO path is bringing iops/core down substantially.
My suggestion is *not to introduce* any smart pointers in the objectstore interface.

Thanks & Regards
Somnath

-----Original Message-----
From: ceph-devel-owner@vger.kernel.org [mailto:ceph-devel-owner@vger.kernel.org] On Behalf Of Somnath Roy
Sent: Wednesday, December 02, 2015 4:18 PM
To: Sage Weil (sage@newdream.net); Samuel Just (sam.just@inktank.com)
Cc: ceph-devel@vger.kernel.org
Subject: queue_transaction interface + unique_ptr

Hi Sage/Sam,
As discussed in today's performance meeting , I am planning to change the queue_transactions() interface to the following.

  int queue_transactions(Sequencer *osr, list<TransactionRef>& tls,
                         Context *onreadable, Context *ondisk=0,
                         Context *onreadable_sync=0,
                         TrackedOpRef op = TrackedOpRef(),
                         ThreadPool::TPHandle *handle = NULL) ;

typedef unique_ptr<Transaction> TransactionRef;

IMO , there is a problem with this approach.

The interface like apply_transaction(), queue_transaction() etc. basically the interfaces taking single transaction pointer and internally forming a list to call the queue_transactions() also needs to be changed to accept TransactionRef which will be *bad*. The reason is while preparing list internally we need to move the uniqueue_ptr and callers won't be aware of that.

Also, now changing every interfaces (and callers) that is taking Transaction* will produce a very big delta (and big testing effort as well). 

So, should we *reconsider* co-existing both  queue_transactions() interfaces and call the new one from the IO path ?

Thanks & Regards
Somnath

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread