Hi, I've just pushed a wip-readahead branch to ceph-client.git that rewrites ceph_readpages (used for readahead) to be fully asynchronous. This should let us take full advantage of whatever the readahead window is. I'm still doing some testing on this end, but things look good so far. There are two relevant mount options: rasize=NN - max readahead window size (bytes) rsize=MM - max read size rsize defaults to 0 (no limit), which means it effectively maxes out at the stripe size (one object, 4MB by default). rasize now defaults to 8 MB. This is probably what you'll want to experiment with. In practice I think something on the order of 8-12 MB will be best, as it will start loading things of disk ~2 objects ahead of the current position. Can you give it a go and see if this helps in your environment? Thanks! sage On Tue, 19 Jul 2011, huang jun wrote: > thanks for you reply > now we find two points confused us: > 1) the kernel client execute sequence read though aio_read function, > but from OSD log, > the dispatch_queue length in OSD is always 0, it means OSD can't > got next READ message until client send to it. It seems that > async_read changes to sync_read, OSD can't parallely read data, so can > not make the most of resources.What are the original purposes when > you design this part? perfect realiablity? Right. The old ceph_readpages was synhronous, which slowed things down in a couple of different ways. > 2) In singleness read circumstance,during OSD read data from it disk, > the OSD doesn't do anything but to wait it finish.We think it was the > result of 1), OSD have nothing to do,so just to wait. > > > 2011/7/19 Sage Weil : > > On Mon, 18 Jul 2011, huang jun wrote: > >> hi,all > >> We test ceph's read performance last week, and find something weird > >> we use ceph v0.30 on linux 2.6.37 > >> mount ceph on back-platform consist of 2 osds \1 mon \1 mds > >> $mount -t ceph 192.168.1.103:/ /mnt -vv > >> $ dd if=/dev/zero of=/mnt/test bs=4M count=200 > >> $ cd .. && umount /mnt > >> $mount -t ceph 192.168.1.103:/ /mnt -vv > >> $dd if=test of=/dev/zero bs=4M > >>   200+0 records in > >>   200+0 records out > >>   838860800 bytes (839 MB) copied, 16.2327 s, 51.7 MB/s > >> but if we use rados to test it > >> $ rados -m 192.168.1.103:6789 -p data bench 60 write > >> $ rados -m 192.168.1.103:6789 -p data bench 60 seq > >>   the result is: > >>   Total time run:        24.733935 > >>   Total reads made:     438 > >>   Read size:            4194304 > >>   Bandwidth (MB/sec):    70.834 > >> > >>   Average Latency:       0.899429 > >>   Max latency:           1.85106 > >>   Min latency:           0.128017 > >> this phenomenon attracts our attention, then we begin to analysis the > >> osd debug log. > >> we find that : > >> 1) the kernel client send READ request, at first it requests 1MB, and > >> after that it is 512KB > >> 2) from rados test cmd log, OSD recept the READ op with 4MB data to handle > >> we know the ceph developers pay their attention to read and write > >> performance, so i just want to confrim that > >> if the communication between the client and OSD spend  more time than > >> it should be? can we request  bigger size, just like default object > >> size 4MB, when it occurs to READ operation? or this is related to OS > >> management, if so, what can we do to promote the performance? > > > > I think it's related to the way the Linux VFS is doing readahead, and how > > the ceph fs code is handling it.  It's issue #1122 in the tracker and I > > plan to look at it today or tomorrow! > > > > Thanks- > > sage > > > >