All of lore.kernel.org
 help / color / mirror / Atom feed
* concurrent direct IO write in xfs
@ 2012-01-16  0:01 Zheng Da
  2012-01-16 17:48 ` Christoph Hellwig
  2012-01-16 23:25 ` Dave Chinner
  0 siblings, 2 replies; 19+ messages in thread
From: Zheng Da @ 2012-01-16  0:01 UTC (permalink / raw)
  To: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2285 bytes --]

Hello,

I surprisedly found that writing data to a file (no appending) with direct
IO and with multiple threads has the same performance as a single thread.
Actually, it seems there is only one core is working at a time. In my case,
each time I write a page to a file and the offset is always aligned to the
page size, so there is no overlapping between writes.

According to lockstat, the lock that causes the most waiting time is
xfs_inode.i_lock.
               &(&ip->i_lock)->mr_lock-W:         31568          36170
      0.24       20048.25     7589157.99         130154        3146848
      0.00         217.70     1238310.72
               &(&ip->i_lock)->mr_lock-R:         11251          11886
      0.24       20043.01     2895595.18          46671         526309
      0.00          63.80      264097.96
               -------------------------
                 &(&ip->i_lock)->mr_lock          36170
 [<ffffffffa03be122>] xfs_ilock+0xb2/0x110 [xfs]
                 &(&ip->i_lock)->mr_lock          11886
 [<ffffffffa03be15a>] xfs_ilock+0xea/0x110 [xfs]
               -------------------------
                 &(&ip->i_lock)->mr_lock          38555
 [<ffffffffa03be122>] xfs_ilock+0xb2/0x110 [xfs]
                 &(&ip->i_lock)->mr_lock           9501
 [<ffffffffa03be15a>] xfs_ilock+0xea/0x110 [xfs]

And systemtap shows me that xfs_inode.i_lock is locked exclusively in the
following functions.
0xffffffff81289235 : xfs_file_aio_write_checks+0x45/0x1d0 [kernel]
 0xffffffff812829f4 : __xfs_get_blocks+0x94/0x4a0 [kernel]
 0xffffffff81288b6a : xfs_aio_write_newsize_update+0x3a/0x90 [kernel]
 0xffffffff8129590a : xfs_log_dirty_inode+0x7a/0xe0 [kernel]
xfs_log_dirty_inode is only invoked 3 times when I write 4G data to the
file, so we can completely ignore it. But I'm not sure which of them is the
major cause of the bad write performance or whether they are the cause of
the bad performance. But it seems none of them are the main operations in
direct io write.

It seems to me that the lock might not be necessary for my case. It'll be
nice if I can disable the lock. Or is there any suggestion of achieving
better write performance with multiple threads in XFS?
I tried ext4 and it doesn't perform better than XFS. Does the problem exist
in all FS?

Thanks,
Da

[-- Attachment #1.2: Type: text/html, Size: 2824 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-16  0:01 concurrent direct IO write in xfs Zheng Da
@ 2012-01-16 17:48 ` Christoph Hellwig
  2012-01-16 19:44   ` Zheng Da
  2012-01-16 23:25 ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2012-01-16 17:48 UTC (permalink / raw)
  To: Zheng Da; +Cc: xfs

On Sun, Jan 15, 2012 at 07:01:42PM -0500, Zheng Da wrote:
> Hello,
> 
> I surprisedly found that writing data to a file (no appending) with direct
> IO and with multiple threads has the same performance as a single thread.
> Actually, it seems there is only one core is working at a time. In my case,
> each time I write a page to a file and the offset is always aligned to the
> page size, so there is no overlapping between writes.

What kernel version are you using?  Also, what is the exact I/O
pattern?

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-16 17:48 ` Christoph Hellwig
@ 2012-01-16 19:44   ` Zheng Da
  0 siblings, 0 replies; 19+ messages in thread
From: Zheng Da @ 2012-01-16 19:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1141 bytes --]

Hello Christoph,

Right now I'm using kernel 3.2.
My test program writes 4G of data to a file that has been preallocated and
each time it writes a page of data randomly to the file. It's always
overwriting, and no appending. The offset of each write is always aligned
to the page size. There is no overlapping between writes as I said before.
I also tried mixing reads and writes and got the similar result.
Can you write data to a file with multiple threads and the performance can
scale up with more threads?

Thanks,
Da

On Mon, Jan 16, 2012 at 12:48 PM, Christoph Hellwig <hch@infradead.org>wrote:

> On Sun, Jan 15, 2012 at 07:01:42PM -0500, Zheng Da wrote:
> > Hello,
> >
> > I surprisedly found that writing data to a file (no appending) with
> direct
> > IO and with multiple threads has the same performance as a single thread.
> > Actually, it seems there is only one core is working at a time. In my
> case,
> > each time I write a page to a file and the offset is always aligned to
> the
> > page size, so there is no overlapping between writes.
>
> What kernel version are you using?  Also, what is the exact I/O
> pattern?
>
>

[-- Attachment #1.2: Type: text/html, Size: 1547 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-16  0:01 concurrent direct IO write in xfs Zheng Da
  2012-01-16 17:48 ` Christoph Hellwig
@ 2012-01-16 23:25 ` Dave Chinner
  2012-01-17 19:19   ` Zheng Da
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2012-01-16 23:25 UTC (permalink / raw)
  To: Zheng Da; +Cc: xfs

On Sun, Jan 15, 2012 at 07:01:42PM -0500, Zheng Da wrote:
> Hello,
> 
> I surprisedly found that writing data to a file (no appending) with direct

I'm not so sure.

> And systemtap shows me that xfs_inode.i_lock is locked exclusively in the
> following functions.
> 0xffffffff81289235 : xfs_file_aio_write_checks+0x45/0x1d0 [kernel]

Always taken, short time period.

>  0xffffffff81288b6a : xfs_aio_write_newsize_update+0x3a/0x90 [kernel]

Only ever taken when doing appending writes. Are you -sure- you are
not doing appending writes?

>  0xffffffff812829f4 : __xfs_get_blocks+0x94/0x4a0 [kernel]

And for direct IO writes, this will be the block mapping lookup so
always hit.


What this says to me is that you are probably doing is lots of very
small concurrent write IOs, but I'm only guessing.  Can you provide
your test case and a description of your test hardware so we can try
to reproduce the problem?

>  0xffffffff8129590a : xfs_log_dirty_inode+0x7a/0xe0 [kernel]
> xfs_log_dirty_inode is only invoked 3 times when I write 4G data to the
> file, so we can completely ignore it. But I'm not sure which of them is the
> major cause of the bad write performance or whether they are the cause of
> the bad performance. But it seems none of them are the main operations in
> direct io write.
> 
> It seems to me that the lock might not be necessary for my case. It'll be

The locking is definitely necessary. We might be able to optimise it
to reduce the serialisation for the overwrite case if that really is
the problem, but there is a limit to how much concurrent IO you can
currently do to a single file. We really need a test case to be able
to make and test such optimisations, though.

> nice if I can disable the lock. Or is there any suggestion of achieving
> better write performance with multiple threads in XFS?
> I tried ext4 and it doesn't perform better than XFS. Does the problem exist
> in all FS?

I think you'll find XFS performs the best of the lot for this sort
of concurrent DIO write workload.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-16 23:25 ` Dave Chinner
@ 2012-01-17 19:19   ` Zheng Da
  2012-01-20  8:53     ` Linda Walsh
  2012-01-23  5:11     ` Dave Chinner
  0 siblings, 2 replies; 19+ messages in thread
From: Zheng Da @ 2012-01-17 19:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1740 bytes --]

Hello,

On Mon, Jan 16, 2012 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> >  0xffffffff81288b6a : xfs_aio_write_newsize_update+0x3a/0x90 [kernel]
>
> Only ever taken when doing appending writes. Are you -sure- you are
> not doing appending writes?
>
This is weird. Yes, I'm sure. I use pwrite() to write data to a 4G file,
and I check the offset of each write and they are always smaller than 4G.
I instrument the code with systemtap and it shows me that ip->i_new_size
and new_size in xfs_aio_write_newsize_update are both 0.
Since in my case there is only overwrite, ip->i_new_size will always be 0
(the only place that updates ip->i_new_size is xfs_file_aio_write_checks).
Because of the same reason, new_size returned by xfs_file_aio_write_checks
is always 0.
Is it what you expected?

>
> >  0xffffffff812829f4 : __xfs_get_blocks+0x94/0x4a0 [kernel]
>
> And for direct IO writes, this will be the block mapping lookup so
> always hit.
>
>
> What this says to me is that you are probably doing is lots of very
> small concurrent write IOs, but I'm only guessing.  Can you provide
> your test case and a description of your test hardware so we can try
> to reproduce the problem?
>
I build XFS on the top of ramdisk. So yes, there is a lot of small
concurrent writes in a second.
I create a file of 4GB in XFS (the ramdisk has 5GB of space). My test
program overwrites 4G of data to the file and each time writes a page of
data randomly to the file. It's always overwriting, and no appending. The
offset of each write is always aligned to the page size. There is no
overlapping between writes.
So the test case is pretty simple and I think it's easy to reproduce it.
It'll be great if you can try the test case.

Thanks,
Da

[-- Attachment #1.2: Type: text/html, Size: 2483 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-17 19:19   ` Zheng Da
@ 2012-01-20  8:53     ` Linda Walsh
  2012-01-20 15:07       ` Zheng Da
  2012-01-23  5:11     ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Linda Walsh @ 2012-01-20  8:53 UTC (permalink / raw)
  To: Zheng Da, Linux-Xfs



Zheng Da wrote:
> 
> I create a file of 4GB in XFS (the ramdisk has 5GB of space). My test 
> program overwrites 4G of data to the file  
----
	It sounds like you are asking why multiple threads don't
move memory from one point to another point in memory at a faster rate
than one thread alone.

I.e. if you had 2 processes doing an assembly instruction, memmov to move
a chunk of memory from 1 area to another, would you expect to do the move
any faster if you had 2 processors doing the move vs. 1??

I think the limiting factor (unless you have a slow processor and some
REALLY fast memory, but stock x86-64 parts, today have memory running about
2-4 times slower than the processor -- so the memory is usually the bottleneck.

Two processes wouldn't do it any faster, and might actually do it slower due to
resource contention issues -- I would *think*... but I really don't know the
details of how writing from mem2mem and having the target be in the format of
and xfs file system, would cause cpu-bound delays that would be significant to
change the fact that m2m operations are usually mem-bandwidth limited...?

(I don't know the answers, just clarifying what you are asking)...



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-20  8:53     ` Linda Walsh
@ 2012-01-20 15:07       ` Zheng Da
  0 siblings, 0 replies; 19+ messages in thread
From: Zheng Da @ 2012-01-20 15:07 UTC (permalink / raw)
  To: Linda Walsh; +Cc: Linux-Xfs


[-- Attachment #1.1: Type: text/plain, Size: 1668 bytes --]

Helllo

On Fri, Jan 20, 2012 at 3:53 AM, Linda Walsh <xfs@tlinx.org> wrote:

>
>
> Zheng Da wrote:
>
>>
>> I create a file of 4GB in XFS (the ramdisk has 5GB of space). My test
>> program overwrites 4G of data to the file
>>
> ----
>        It sounds like you are asking why multiple threads don't
> move memory from one point to another point in memory at a faster rate
> than one thread alone.
>
> I.e. if you had 2 processes doing an assembly instruction, memmov to move
> a chunk of memory from 1 area to another, would you expect to do the move
> any faster if you had 2 processors doing the move vs. 1??
>
Yes. Actually, for reading, using multiple threads is faster than a single
thread.
If you try simple memory copy with memcpy in the C library, the overall
throughput will still increase if you use multiple processors.

>
> I think the limiting factor (unless you have a slow processor and some
> REALLY fast memory, but stock x86-64 parts, today have memory running about
> 2-4 times slower than the processor -- so the memory is usually the
> bottleneck.
>
Memory bandwidth will eventually become a bottleneck, but it can still
scale for a small number of processors.

>
> Two processes wouldn't do it any faster, and might actually do it slower
> due to
> resource contention issues -- I would *think*... but I really don't know
> the
> details of how writing from mem2mem and having the target be in the format
> of
> and xfs file system, would cause cpu-bound delays that would be
> significant to
> change the fact that m2m operations are usually mem-bandwidth limited...?
>
> (I don't know the answers, just clarifying what you are asking)...
>
> Da

[-- Attachment #1.2: Type: text/html, Size: 2385 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-17 19:19   ` Zheng Da
  2012-01-20  8:53     ` Linda Walsh
@ 2012-01-23  5:11     ` Dave Chinner
  2012-01-23 19:34       ` Zheng Da
  1 sibling, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2012-01-23  5:11 UTC (permalink / raw)
  To: Zheng Da; +Cc: xfs

On Tue, Jan 17, 2012 at 02:19:52PM -0500, Zheng Da wrote:
> Hello,
> 
> On Mon, Jan 16, 2012 at 6:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > >  0xffffffff81288b6a : xfs_aio_write_newsize_update+0x3a/0x90 [kernel]
> >
> > Only ever taken when doing appending writes. Are you -sure- you are
> > not doing appending writes?
> >
> This is weird. Yes, I'm sure. I use pwrite() to write data to a 4G file,
> and I check the offset of each write and they are always smaller than 4G.
> I instrument the code with systemtap and it shows me that ip->i_new_size
> and new_size in xfs_aio_write_newsize_update are both 0.
> Since in my case there is only overwrite, ip->i_new_size will always be 0
> (the only place that updates ip->i_new_size is xfs_file_aio_write_checks).
> Because of the same reason, new_size returned by xfs_file_aio_write_checks
> is always 0.
> Is it what you expected?

No idea. I don't know what the problem you are seeing is yet, or if
indeed there even is a problem as I don't really understand what you
are trying to do or what results you are expecting to see...

Indeed, have you run the test on something other than a RAM disk and
confirmed that the problem exists on a block device that has real IO
latency? If your IO takes close to zero time, then there isn't any
IO level concurrency you can extract from single file direct IO; it
will all just serialise on the extent tree lookups.

> > >  0xffffffff812829f4 : __xfs_get_blocks+0x94/0x4a0 [kernel]
> >
> > And for direct IO writes, this will be the block mapping lookup so
> > always hit.
> >
> >
> > What this says to me is that you are probably doing is lots of very
> > small concurrent write IOs, but I'm only guessing.  Can you provide
> > your test case and a description of your test hardware so we can try
> > to reproduce the problem?
> >
> I build XFS on the top of ramdisk. So yes, there is a lot of small
> concurrent writes in a second.
> I create a file of 4GB in XFS (the ramdisk has 5GB of space). My test
> program overwrites 4G of data to the file and each time writes a page of
> data randomly to the file. It's always overwriting, and no appending. The
> offset of each write is always aligned to the page size. There is no
> overlapping between writes.

Why are you using XFS for this? tmpfs was designed to do this sort
of stuff as efficiently as possible....

> So the test case is pretty simple and I think it's easy to reproduce it.
> It'll be great if you can try the test case.

Can you post your test code so I know what I test is exactly what
you are running?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-23  5:11     ` Dave Chinner
@ 2012-01-23 19:34       ` Zheng Da
  2012-01-23 20:51         ` Zheng Da
  0 siblings, 1 reply; 19+ messages in thread
From: Zheng Da @ 2012-01-23 19:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 3038 bytes --]

Hello,

On Mon, Jan 23, 2012 at 12:11 AM, Dave Chinner <david@fromorbit.com> wrote:
>
> > >
> > This is weird. Yes, I'm sure. I use pwrite() to write data to a 4G file,
> > and I check the offset of each write and they are always smaller than 4G.
> > I instrument the code with systemtap and it shows me that ip->i_new_size
> > and new_size in xfs_aio_write_newsize_update are both 0.
> > Since in my case there is only overwrite, ip->i_new_size will always be 0
> > (the only place that updates ip->i_new_size is
> xfs_file_aio_write_checks).
> > Because of the same reason, new_size returned by
> xfs_file_aio_write_checks
> > is always 0.
> > Is it what you expected?
>
> No idea. I don't know what the problem you are seeing is yet, or if
> indeed there even is a problem as I don't really understand what you
> are trying to do or what results you are expecting to see...
>
Here I was just wondering if i_new_size is always 0 if there are only
overwrites. I think it has nothing to do with the pattern of my workloads
or the device I used for the test.

>
> Indeed, have you run the test on something other than a RAM disk and
> confirmed that the problem exists on a block device that has real IO
> latency? If your IO takes close to zero time, then there isn't any
> IO level concurrency you can extract from single file direct IO; it
> will all just serialise on the extent tree lookups.
>
It's difficult to test the scalability problem in the traditional disks.
They provide very low IOPS (IO per second). Even two SSDs can't provide
enough IOPS.
I don't think all direct IO will serialized on the extent tree lookups.
Direct IO reads can parallelized pretty well and they also need extent tree
lookups.

>
> > > >  0xffffffff812829f4 : __xfs_get_blocks+0x94/0x4a0 [kernel]
> > >
> > > And for direct IO writes, this will be the block mapping lookup so
> > > always hit.
> > >
> > >
> > > What this says to me is that you are probably doing is lots of very
> > > small concurrent write IOs, but I'm only guessing.  Can you provide
> > > your test case and a description of your test hardware so we can try
> > > to reproduce the problem?
> > >
> > I build XFS on the top of ramdisk. So yes, there is a lot of small
> > concurrent writes in a second.
> > I create a file of 4GB in XFS (the ramdisk has 5GB of space). My test
> > program overwrites 4G of data to the file and each time writes a page of
> > data randomly to the file. It's always overwriting, and no appending. The
> > offset of each write is always aligned to the page size. There is no
> > overlapping between writes.
>
> Why are you using XFS for this? tmpfs was designed to do this sort
> of stuff as efficiently as possible....
>
OK, I can try that.

>
> > So the test case is pretty simple and I think it's easy to reproduce it.
> > It'll be great if you can try the test case.
>
> Can you post your test code so I know what I test is exactly what
> you are running?
>
I can do that. My test code gets very complicated now. I need to simplify
it.

Thanks,
Da

[-- Attachment #1.2: Type: text/html, Size: 4085 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-23 19:34       ` Zheng Da
@ 2012-01-23 20:51         ` Zheng Da
  2012-01-24  0:34           ` Stan Hoeppner
  2012-01-24  3:54           ` Dave Chinner
  0 siblings, 2 replies; 19+ messages in thread
From: Zheng Da @ 2012-01-23 20:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1186 bytes --]

Hello

On Mon, Jan 23, 2012 at 2:34 PM, Zheng Da <zhengda1936@gmail.com> wrote:
>
> > I build XFS on the top of ramdisk. So yes, there is a lot of small
>> > concurrent writes in a second.
>> > I create a file of 4GB in XFS (the ramdisk has 5GB of space). My test
>> > program overwrites 4G of data to the file and each time writes a page of
>> > data randomly to the file. It's always overwriting, and no appending.
>> The
>> > offset of each write is always aligned to the page size. There is no
>> > overlapping between writes.
>>
>> Why are you using XFS for this? tmpfs was designed to do this sort
>> of stuff as efficiently as possible....
>>
> OK, I can try that.
>
tmpfs doesn't support direct IO.

>
>> > So the test case is pretty simple and I think it's easy to reproduce it.
>> > It'll be great if you can try the test case.
>>
>> Can you post your test code so I know what I test is exactly what
>> you are running?
>>
> I can do that. My test code gets very complicated now. I need to simplify
> it.
>
Here is the code. It's still a bit long. I hope it's OK.
You can run the code like "rand-read file option=direct pages=1048576
threads=8 access=write/read".

Thanks,
Da

[-- Attachment #1.2: Type: text/html, Size: 2041 bytes --]

[-- Attachment #2: simple-rand-read.cc --]
[-- Type: text/x-c++src, Size: 9604 bytes --]

#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/time.h>
#include <stdlib.h>
#include <sys/resource.h>
#include <sys/mman.h>
#include <string.h>
#include <errno.h>
#include <pthread.h>
#include <assert.h>
#include <google/profiler.h>

#include <iostream>
#include <string>
#include <deque>

#define PAGE_SIZE 4096
#define ROUND_PAGE(off) (((long) off) & (~(PAGE_SIZE - 1)))

#define NUM_PAGES 16384
#define NUM_THREADS 32

enum {
	READ,
	WRITE
};

int npages;
int nthreads = 1;
struct timeval global_start;
char static_buf[PAGE_SIZE * 8] __attribute__((aligned(PAGE_SIZE)));
volatile int first[NUM_THREADS];
int access_method = READ;

class workload_gen
{
public:
	virtual off_t next_offset() = 0;
	virtual bool has_next() = 0;
};

class rand_permute
{
	off_t *offset;
	long num;
public:
	rand_permute(long num, int stride) {
		offset = (off_t *) valloc(num * sizeof(off_t));
		for (int i = 0; i < num; i++) {
			offset[i] = ((off_t) i) * stride;
		}

		for (int i = num - 1; i >= 1; i--) {
			int j = random() % i;
			off_t tmp = offset[j];
			offset[j] = offset[i];
			offset[i] = tmp;
		}
	}

	~rand_permute() {
		free(offset);
	}

	off_t get_offset(long idx) const {
		return offset[idx];
	}
};

class local_rand_permute_workload: public workload_gen
{
	long start;
	long end;
	static const rand_permute *permute;
public:
	local_rand_permute_workload(long num, int stride, long start, long end) {
		if (permute == NULL) {
			permute = new rand_permute(num, stride);
		}
		this->start = start;
		this->end = end;
	}

	~local_rand_permute_workload() {
		if (permute) {
			delete permute;
			permute = NULL;
		}
	}

	off_t next_offset() {
		if (start >= end)
			return -1;
		return permute->get_offset(start++);
	}

	bool has_next() {
		return start < end;
	}
};

float time_diff(struct timeval time1, struct timeval time2)
{
	return time2.tv_sec - time1.tv_sec
			+ ((float)(time2.tv_usec - time1.tv_usec))/1000000;
}

class rand_buf
{
	/* where the data read from the disk is stored */
	char *buf;
	/* shows the locations in the array where data has be to stored.*/
	rand_permute buf_offset;
	int entry_size;
	int num_entries;

	int current;
public:
	rand_buf(int buf_size, int entry_size): buf_offset(buf_size / entry_size, entry_size) {
		this->entry_size = entry_size;
		num_entries = buf_size / entry_size;
		buf = (char *) valloc(buf_size);

		if (buf == NULL){
			fprintf(stderr, "can't allocate buffer\n");
			exit(1);
		}
		/* trigger page faults and bring pages to memory. */
		for (int i = 0; i < buf_size / PAGE_SIZE; i++)
			buf[i * PAGE_SIZE] = 0;

		current = 0;
	}

	~rand_buf() {
		free(buf);
	}

	char *next_entry() {
		int off = buf_offset.get_offset(current);
		current = (current + 1) % num_entries;;
		return &buf[off];
	}

	int get_entry_size() {
		return entry_size;
	}
};

/* this data structure stores the thread-private info. */
class thread_private
{
public:
	pthread_t id;
	/* the location in the thread descriptor array. */
	int idx;
	rand_buf buf;
	workload_gen *gen;
	ssize_t read_bytes;
	struct timeval start_time;
	struct timeval end_time;

	virtual ssize_t access(char *, off_t, ssize_t, int) = 0;
	virtual int thread_init() = 0;

	thread_private(int idx, int entry_size): buf(NUM_PAGES / nthreads * PAGE_SIZE, entry_size) {
		this->idx = idx;
		read_bytes = 0;
	}

};

class read_private: public thread_private
{
	const char *file_name;
	int fd;
	int flags;
protected:
	int get_fd() {
		return fd;
	}

public:
	read_private(const char *name, int idx, int entry_size,
			int flags = O_RDWR): thread_private(idx, entry_size), file_name(name) {
		this->flags = flags;
	}

	int thread_init() {
		int ret;

		fd = open(file_name, flags);
		if (fd < 0) {
			perror("open");
			exit (1);
		}
		ret = posix_fadvise(fd, 0, 0, POSIX_FADV_RANDOM);
		if (ret < 0) {
			perror("posix_fadvise");
			exit(1);
		}
		return 0;
	}

	ssize_t access(char *buf, off_t offset, ssize_t size, int access_method) {
		assert(offset < 0x100000000L);
		ssize_t ret;
		if (access_method == WRITE)
			ret = pwrite(fd, buf, size, offset);
		else
			ret = pread(fd, buf, size, offset);
		return ret;
	}
};

class direct_private: public read_private
{
	char *pages;
	int buf_idx;
public:
	direct_private(const char *name, int idx, int entry_size): read_private(name, idx,
			entry_size, O_DIRECT | O_RDWR) {
		pages = (char *) valloc(PAGE_SIZE * 4096);
		buf_idx = 0;
	}

	ssize_t access(char *buf, off_t offset, ssize_t size, int access_method) {
		ssize_t ret;
		/* for simplicity, I assume all request sizes are smaller than a page size */
		assert(size <= PAGE_SIZE);
		if (ROUND_PAGE(offset) == offset
				&& (long) buf == ROUND_PAGE(buf)
				&& size == PAGE_SIZE) {
			ret = read_private::access(buf, offset, size, access_method);
		}
		else {
			assert(access_method == READ);
			buf_idx++;
			if (buf_idx == 4096)
				buf_idx = 0;
			char *page = pages + buf_idx * PAGE_SIZE;
			ret = read_private::access(page, ROUND_PAGE(offset), PAGE_SIZE, access_method);
			if (ret < 0)
				return ret;
			else
				memcpy(buf, page + (offset - ROUND_PAGE(offset)), size);
			ret = size;
		}
		return ret;
	}
};

thread_private *threads[NUM_THREADS];

void *rand_read(void *arg)
{
	ssize_t ret = -1;
	thread_private *priv = threads[(long) arg];
	rand_buf *buf;

	priv->thread_init();
	buf = &priv->buf;

	gettimeofday(&priv->start_time, NULL);
	while (priv->gen->has_next()) {
		char *entry = buf->next_entry();
		off_t off = priv->gen->next_offset();

		ret = priv->access(entry, off, buf->get_entry_size(), access_method);
		if (ret > 0) {
			assert(ret == buf->get_entry_size());
			if (ret > 0)
				priv->read_bytes += ret;
			else
				break;
		}
		if (ret < 0) {
			perror("access");
			exit(1);
		}
	}
	if (ret < 0) {
		perror("read");
		exit(1);
	}
	gettimeofday(&priv->end_time, NULL);
	
	pthread_exit((void *) priv->read_bytes);
}

long str2size(std::string str)
{
	int len = str.length();
	long multiply = 1;
	if (str[len - 1] == 'M' || str[len - 1] == 'm') {
		multiply *= 1024 * 1024;
		str[len - 1] = 0;
	}
	else if (str[len - 1] == 'K' || str[len - 1] == 'k') {
		multiply *= 1024;
		str[len - 1] = 0;
	}
	else if (str[len - 1] == 'G' || str[len - 1] == 'g') {
		multiply *= 1024 * 1024 * 1024;
		str[len - 1] = 0;
	}
	return atol(str.c_str()) * multiply;
}

int main(int argc, char *argv[])
{
	int entry_size = 4096;
	std::string access_option;
	int ret;
	int i, j;
	struct timeval start_time, end_time;
	ssize_t read_bytes = 0;
	int num_files = 0;
	std::string file_names[NUM_THREADS];

	if (argc < 5) {
		fprintf(stderr, "there are %d argments\n", argc);
		fprintf(stderr, "read files option pages threads\n");
		exit(1);
	}

	for (int i = 1; i < argc; i++) {
		std::string str = argv[i];
		size_t found = str.find("=");
		/* if there isn't `=', I assume it's a file name*/
		if (found == std::string::npos) {
			file_names[num_files++] = str;
			continue;
		}

		std::string value = str.substr(found + 1);
		std::string key = str.substr(0, found);
		if (key.compare("option") == 0) {
			access_option = value;
		}
		else if(key.compare("pages") == 0) {
			npages = atoi(value.c_str());
		}
		else if(key.compare("threads") == 0) {
			nthreads = atoi(value.c_str());
		}
		else if(key.compare("access") == 0) {
			if(value.compare("read") == 0)
				access_method = READ;
			else if(value.compare("write") == 0)
				access_method = WRITE;
			else {
				fprintf(stderr, "wrong access method\n");
				exit(1);
			}
		}
		else {
			fprintf(stderr, "wrong option\n");
			exit(1);
		}
	}

	int num_entries = npages * (PAGE_SIZE / entry_size);

	if (nthreads > NUM_THREADS) {
		fprintf(stderr, "too many threads\n");
		exit(1);
	}
	if (num_files > 1 && num_files != nthreads) {
		fprintf(stderr, "if there are multiple files, \
				the number of files must be the same as the number of threads\n");
		exit(1);
	}

	/* initialize the threads' private data. */
	for (j = 0; j < nthreads; j++) {
		const char *file_name;
		if (num_files > 1) {
			file_name = file_names[j].c_str();
		}
		else {
			file_name = file_names[0].c_str();
		}
		if (access_option.compare("normal") == 0)
			threads[j] = new read_private(file_name, j, entry_size);
		else if (access_option.compare("direct") == 0)
			threads[j] = new direct_private(file_name, j, entry_size);
		else {
			fprintf(stderr, "wrong access option\n");
			exit(1);
		}
		
		long start, end;
		if (num_files > 1) {
			start = 0;
			end = npages * PAGE_SIZE / entry_size;
		}
		else {
			start = (long) npages / nthreads * PAGE_SIZE / entry_size * j;
			end = start + (long) npages / nthreads * PAGE_SIZE / entry_size;
		}
		printf("thread %d starts %ld ends %ld\n", j, start, end);
		threads[j]->gen = new local_rand_permute_workload(num_entries,
						entry_size, start, end);
	}

	ret = setpriority(PRIO_PROCESS, getpid(), -20);
	if (ret < 0) {
		perror("setpriority");
		exit(1);
	}

	gettimeofday(&start_time, NULL);
	global_start = start_time;
	for (i = 0; i < nthreads; i++) {
		ret = pthread_create(&threads[i]->id, NULL, rand_read, (void *) i);
		if (ret) {
			perror("pthread_create");
			exit(1);
		}
	}

	for (i = 0; i < nthreads; i++) {
		ssize_t size;
		ret = pthread_join(threads[i]->id, (void **) &size);
		if (ret) {
			perror("pthread_join");
			exit(1);
		}
		read_bytes += size;
	}
	gettimeofday(&end_time, NULL);
	printf("read %ld bytes, takes %f seconds\n",
			read_bytes, end_time.tv_sec - start_time.tv_sec
			+ ((float)(end_time.tv_usec - start_time.tv_usec))/1000000);
}

const rand_permute *local_rand_permute_workload::permute;

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-23 20:51         ` Zheng Da
@ 2012-01-24  0:34           ` Stan Hoeppner
  2012-01-24  1:40             ` Zheng Da
  2012-01-24  3:54           ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Stan Hoeppner @ 2012-01-24  0:34 UTC (permalink / raw)
  To: Zheng Da; +Cc: xfs

On 1/23/2012 2:51 PM, Zheng Da wrote:

> tmpfs doesn't support direct IO.

Of course not.  tmpfs resides entirely within the page cache (or some of
it in swap).  The whole point of direct IO is to bypass the page cache,
transferring data directly between user space memory and the storage
device.  As tmpfs is built entirely within the page cache, direct IO is
obviously impossible.  And it's also obviously unnecessary.

Yes, you will need to rewrite your application to use tmpfs as direct IO
calls won't work.  This is something you obviously would rather not do.
 Which brings us back to Dave's question, which you have not answered:

What exactly is the purpose of your program?  What does it aim to
accomplish?  Is it for a database application?  A word processor?  Or
simply a filesystem tester?  What do _you_ aim to accomplish with this
programming effort?

-- 
Stan

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-24  0:34           ` Stan Hoeppner
@ 2012-01-24  1:40             ` Zheng Da
  0 siblings, 0 replies; 19+ messages in thread
From: Zheng Da @ 2012-01-24  1:40 UTC (permalink / raw)
  To: stan; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 1244 bytes --]

Hello,

On Mon, Jan 23, 2012 at 7:34 PM, Stan Hoeppner <stan@hardwarefreak.com>wrote:

> On 1/23/2012 2:51 PM, Zheng Da wrote:
>
> > tmpfs doesn't support direct IO.
>
> Of course not.  tmpfs resides entirely within the page cache (or some of
> it in swap).  The whole point of direct IO is to bypass the page cache,
> transferring data directly between user space memory and the storage
> device.  As tmpfs is built entirely within the page cache, direct IO is
> obviously impossible.  And it's also obviously unnecessary.
>
> Yes, you will need to rewrite your application to use tmpfs as direct IO
> calls won't work.  This is something you obviously would rather not do.
>  Which brings us back to Dave's question, which you have not answered:
>
> What exactly is the purpose of your program?  What does it aim to
> accomplish?  Is it for a database application?  A word processor?  Or
> simply a filesystem tester?  What do _you_ aim to accomplish with this
> programming effort?
>
> I'm trying to test the scalability of page cache in the random access
workload. And the cache hit rate is also relatively low.
The cache is now implemented in the user space. When cache misses, I need
to read data from the file system with direct IO.

Da

[-- Attachment #1.2: Type: text/html, Size: 1629 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-23 20:51         ` Zheng Da
  2012-01-24  0:34           ` Stan Hoeppner
@ 2012-01-24  3:54           ` Dave Chinner
  2012-01-25 21:20             ` Zheng Da
  2012-02-09  6:09             ` Dave Chinner
  1 sibling, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2012-01-24  3:54 UTC (permalink / raw)
  To: Zheng Da; +Cc: xfs

On Mon, Jan 23, 2012 at 03:51:43PM -0500, Zheng Da wrote:
> Hello
> 
> On Mon, Jan 23, 2012 at 2:34 PM, Zheng Da <zhengda1936@gmail.com> wrote:
> >
> > > I build XFS on the top of ramdisk. So yes, there is a lot of small
> >> > concurrent writes in a second.
> >> > I create a file of 4GB in XFS (the ramdisk has 5GB of space). My test
> >> > program overwrites 4G of data to the file and each time writes a page of
> >> > data randomly to the file. It's always overwriting, and no appending.
> >> The
> >> > offset of each write is always aligned to the page size. There is no
> >> > overlapping between writes.
> >>
> >> Why are you using XFS for this? tmpfs was designed to do this sort
> >> of stuff as efficiently as possible....
> >>
> > OK, I can try that.
> >
> tmpfs doesn't support direct IO.

it doesn't need to. The ramdisk is copying data into it's own
private page cache and you are using direct Io to avoid the system
page cache (i.e. a double copy). tmpfs just uses the system page
cache, so tehre's only one copy and it has a much shorter and less
complex IO path than XFS.....

> >> > So the test case is pretty simple and I think it's easy to reproduce it.
> >> > It'll be great if you can try the test case.
> >>
> >> Can you post your test code so I know what I test is exactly what
> >> you are running?
> >>
> > I can do that. My test code gets very complicated now. I need to simplify
> > it.
> >
> Here is the code. It's still a bit long. I hope it's OK.
> You can run the code like "rand-read file option=direct pages=1048576
> threads=8 access=write/read".

With 262144 pages on a 2Gb ramdisk, the results I get on 3.2.0 are

Threads		Read	Write
    1		0.92s	1.49s
    2		0.51s	1.20s
    4		0.31s	1.34s
    8		0.22s	1.59s
   16		0.23s	2.24s

the contention is on the ip->i_ilock, and the newsize update is one
of the offenders It probably needs this change to
xfs_aio_write_newsize_update():

-        if (new_size == ip->i_new_size) {
+        if (new_size && new_size == ip->i_new_size) {

to avoid the lock being taken here.

But all that newsize crap is gone in the current git Linus tree,
so how much would that gains us:

Threads		Read	Write
    1		0.88s	0.85s
    2		0.54s	1.20s
    4		0.31s	1.23s
    8		0.27s	1.40s
   16		0.25s	2.36s

Pretty much nothing. IOWs, it's just like I suspected - you are
doing so many write IOs that you are serialising on the extent
lookup and write checks which use exclusive locking..

Given that it is 2 lock traversals per write IO, we're limiting at
about 4-500,000 exclusive lock grabs per second and decreasing as
contention goes up.

For reads, we are doing 2 shared (nested) lookups per read IO, we
appear to be limiting at around 2,000,000 shared lock grabs per
second. Ahmdals law is kicking in here, but it means if we could
make the writes to use a shared lock, it would at least scale like
the reads for this "no metadata modification except for mtime"
overwrite case.

I don't think that the generic write checks absolutely need
exclusive locking - we probably could get away with a shared lock
and only fall back to exclusive when we need to do EOF zeroing.
Similarly, for the block mapping code if we don't need to do
allocation, a shared lock is all we need. So maybe in that case for
direct IO when create == 1, we can do a read lookup first and only
grab the lock exclusively if that falls in a hole and requires
allocation.....

Let me think about it for a bit....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-24  3:54           ` Dave Chinner
@ 2012-01-25 21:20             ` Zheng Da
  2012-01-25 22:25               ` Dave Chinner
  2012-02-09  6:09             ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Zheng Da @ 2012-01-25 21:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 2722 bytes --]

Hello Dave,

On Mon, Jan 23, 2012 at 10:54 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> > >> > So the test case is pretty simple and I think it's easy to
> reproduce it.
> > >> > It'll be great if you can try the test case.
> > >>
> > >> Can you post your test code so I know what I test is exactly what
> > >> you are running?
> > >>
> > > I can do that. My test code gets very complicated now. I need to
> simplify
> > > it.
> > >
> > Here is the code. It's still a bit long. I hope it's OK.
> > You can run the code like "rand-read file option=direct pages=1048576
> > threads=8 access=write/read".
>
> With 262144 pages on a 2Gb ramdisk, the results I get on 3.2.0 are
>
> Threads         Read    Write
>    1           0.92s   1.49s
>    2           0.51s   1.20s
>    4           0.31s   1.34s
>    8           0.22s   1.59s
>   16           0.23s   2.24s
>
> the contention is on the ip->i_ilock, and the newsize update is one
> of the offenders It probably needs this change to
> xfs_aio_write_newsize_update():
>
> -        if (new_size == ip->i_new_size) {
> +        if (new_size && new_size == ip->i_new_size) {
>
> to avoid the lock being taken here.
>
> But all that newsize crap is gone in the current git Linus tree,
> so how much would that gains us:
>
> Threads         Read    Write
>    1           0.88s   0.85s
>    2           0.54s   1.20s
>    4           0.31s   1.23s
>    8           0.27s   1.40s
>   16           0.25s   2.36s
>
> Pretty much nothing. IOWs, it's just like I suspected - you are
> doing so many write IOs that you are serialising on the extent
> lookup and write checks which use exclusive locking..
>
> Given that it is 2 lock traversals per write IO, we're limiting at
> about 4-500,000 exclusive lock grabs per second and decreasing as
> contention goes up.
>
> For reads, we are doing 2 shared (nested) lookups per read IO, we
> appear to be limiting at around 2,000,000 shared lock grabs per
> second. Ahmdals law is kicking in here, but it means if we could
> make the writes to use a shared lock, it would at least scale like
> the reads for this "no metadata modification except for mtime"
> overwrite case.
>
> I don't think that the generic write checks absolutely need
> exclusive locking - we probably could get away with a shared lock
> and only fall back to exclusive when we need to do EOF zeroing.
> Similarly, for the block mapping code if we don't need to do
> allocation, a shared lock is all we need. So maybe in that case for
> direct IO when create == 1, we can do a read lookup first and only
> grab the lock exclusively if that falls in a hole and requires
> allocation.....


Do you think if you will provide a patch for the changes?

Thanks,
Da

[-- Attachment #1.2: Type: text/html, Size: 3386 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-25 21:20             ` Zheng Da
@ 2012-01-25 22:25               ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2012-01-25 22:25 UTC (permalink / raw)
  To: Zheng Da; +Cc: xfs

On Wed, Jan 25, 2012 at 04:20:12PM -0500, Zheng Da wrote:
> Hello Dave,
> 
> On Mon, Jan 23, 2012 at 10:54 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > > >> > So the test case is pretty simple and I think it's easy to
> > reproduce it.
> > > >> > It'll be great if you can try the test case.
> > > >>
> > > >> Can you post your test code so I know what I test is exactly what
> > > >> you are running?
> > > >>
> > > > I can do that. My test code gets very complicated now. I need to
> > simplify
> > > > it.
> > > >
> > > Here is the code. It's still a bit long. I hope it's OK.
> > > You can run the code like "rand-read file option=direct pages=1048576
> > > threads=8 access=write/read".
> >
> > With 262144 pages on a 2Gb ramdisk, the results I get on 3.2.0 are
> >
> > Threads         Read    Write
> >    1           0.92s   1.49s
> >    2           0.51s   1.20s
> >    4           0.31s   1.34s
> >    8           0.22s   1.59s
> >   16           0.23s   2.24s
> >
> > the contention is on the ip->i_ilock, and the newsize update is one
> > of the offenders It probably needs this change to
> > xfs_aio_write_newsize_update():
> >
> > -        if (new_size == ip->i_new_size) {
> > +        if (new_size && new_size == ip->i_new_size) {
> >
> > to avoid the lock being taken here.
> >
> > But all that newsize crap is gone in the current git Linus tree,
> > so how much would that gains us:
> >
> > Threads         Read    Write
> >    1           0.88s   0.85s
> >    2           0.54s   1.20s
> >    4           0.31s   1.23s
> >    8           0.27s   1.40s
> >   16           0.25s   2.36s
> >
> > Pretty much nothing. IOWs, it's just like I suspected - you are
> > doing so many write IOs that you are serialising on the extent
> > lookup and write checks which use exclusive locking..
> >
> > Given that it is 2 lock traversals per write IO, we're limiting at
> > about 4-500,000 exclusive lock grabs per second and decreasing as
> > contention goes up.
> >
> > For reads, we are doing 2 shared (nested) lookups per read IO, we
> > appear to be limiting at around 2,000,000 shared lock grabs per
> > second. Ahmdals law is kicking in here, but it means if we could
> > make the writes to use a shared lock, it would at least scale like
> > the reads for this "no metadata modification except for mtime"
> > overwrite case.
> >
> > I don't think that the generic write checks absolutely need
> > exclusive locking - we probably could get away with a shared lock
> > and only fall back to exclusive when we need to do EOF zeroing.
> > Similarly, for the block mapping code if we don't need to do
> > allocation, a shared lock is all we need. So maybe in that case for
> > direct IO when create == 1, we can do a read lookup first and only
> > grab the lock exclusively if that falls in a hole and requires
> > allocation.....
> 
> 
> Do you think if you will provide a patch for the changes?

I'm still thinking on it. I do have other work to do right now, so
this is low priority. If it appears safe to do, then I'll write a
patch and propose it. If it can't be made safe for all cases, then
you'll have to think of some other way to achieve what you want from
your application.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-01-24  3:54           ` Dave Chinner
  2012-01-25 21:20             ` Zheng Da
@ 2012-02-09  6:09             ` Dave Chinner
  2012-02-09  6:44               ` Zheng Da
  2012-02-13 17:48               ` Christoph Hellwig
  1 sibling, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2012-02-09  6:09 UTC (permalink / raw)
  To: Zheng Da; +Cc: xfs

On Tue, Jan 24, 2012 at 02:54:31PM +1100, Dave Chinner wrote:
> On Mon, Jan 23, 2012 at 03:51:43PM -0500, Zheng Da wrote:
> > Hello
> > 
> > On Mon, Jan 23, 2012 at 2:34 PM, Zheng Da <zhengda1936@gmail.com> wrote:
> > >> > So the test case is pretty simple and I think it's easy to reproduce it.
> > >> > It'll be great if you can try the test case.
> > >>
> > >> Can you post your test code so I know what I test is exactly what
> > >> you are running?
> > >>
> > > I can do that. My test code gets very complicated now. I need to simplify
> > > it.
> > >
> > Here is the code. It's still a bit long. I hope it's OK.
> > You can run the code like "rand-read file option=direct pages=1048576
> > threads=8 access=write/read".
> 
> With 262144 pages on a 2Gb ramdisk, the results I get on 3.2.0 are
> 
> Threads		Read	Write
>     1		0.92s	1.49s
>     2		0.51s	1.20s
>     4		0.31s	1.34s
>     8		0.22s	1.59s
>    16		0.23s	2.24s
> 
> the contention is on the ip->i_ilock, and the newsize update is one
> of the offenders It probably needs this change to
> xfs_aio_write_newsize_update():
> 
> -        if (new_size == ip->i_new_size) {
> +        if (new_size && new_size == ip->i_new_size) {
> 
> to avoid the lock being taken here.
> 
> But all that newsize crap is gone in the current git Linus tree,
> so how much would that gains us:
> 
> Threads		Read	Write
>     1		0.88s	0.85s
>     2		0.54s	1.20s
>     4		0.31s	1.23s
>     8		0.27s	1.40s
>    16		0.25s	2.36s
> 
> Pretty much nothing. IOWs, it's just like I suspected - you are
> doing so many write IOs that you are serialising on the extent
> lookup and write checks which use exclusive locking..
> 
> Given that it is 2 lock traversals per write IO, we're limiting at
> about 4-500,000 exclusive lock grabs per second and decreasing as
> contention goes up.
> 
> For reads, we are doing 2 shared (nested) lookups per read IO, we
> appear to be limiting at around 2,000,000 shared lock grabs per
> second. Ahmdals law is kicking in here, but it means if we could
> make the writes to use a shared lock, it would at least scale like
> the reads for this "no metadata modification except for mtime"
> overwrite case.
> 
> I don't think that the generic write checks absolutely need
> exclusive locking - we probably could get away with a shared lock
> and only fall back to exclusive when we need to do EOF zeroing.
> Similarly, for the block mapping code if we don't need to do
> allocation, a shared lock is all we need. So maybe in that case for
> direct IO when create == 1, we can do a read lookup first and only
> grab the lock exclusively if that falls in a hole and requires
> allocation.....

So, I have a proof of concept patch that gets rid of the exclusive
locking for the overwrite case.

Results are:

			Writes
Threads		vanilla		patched		read
    1		0.85s		0.93s		0.88s
    2		1.20s		0.58s		0.54s
    4		1.23s		0.32s		0.31s
    8		1.40s		0.27s		0.27s
   16		2.36s		0.23s		0.25s

So overwrites scale pretty much like reads now: ~1,000,000 overwrite
IOs per second to that one file with 8-16 threads. Given these tests
are running on an 8p VM, it's not surprising it doesn't go any
faster than that as the thread count goes up.

The patch hacks in some stuff that Christoph's transactional size
and timestamp update patches do correctly, so these changes would
need to wait for that series to be finalised. As it is, this patch
doesn't appear to cause any new xfstests regressions, so it's good
for discussion and testing....

Anyway, for people to comment on, the patch is below.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: use shared ilock mode for direct IO writes by default

From: Dave Chinner <dchinner@redhat.com>

For the direct IO write path, we only really need the ilock to be
taken in exclusive mode during IO submission if we need to do extent
allocation. We currently take it in exclusive mode for both the
write sanity checks and for block mapping and allocation.

In the case of the write sanity checks, we only need to protect the
inode from change while this is occurring, and hence we can use a
shared lock for this. We still need to provide exclusion for EOF
zeroing, so we need to detect that case and upgrade the locking to
exclusive in that case. This is a simple extension of the existing
iolock upgrade case.

We also have the case of timestamp updates occurring inside the
ilock. however, we don't really care if timestamp update races occur
as they are going to end up with the same timestamp anyway. Further,
as we move to transactional timestamp updates, we can't do the
update from within the ilock at all. Hence move the timestamp
update outside the ilock altogether as we don't need or want it to
be protected there.

For block mapping, the direct IO case has to drop the ilock after
the initial read mapping for transaction reservation before the
allocation can be done. This means that the mapping to determine if
allocation is needed simply requires a "don't change" locking
semantic. i.e. a shared lock.

Hence we can safely change the xfs_get_blocks() code to use a shared
lock for the initial mapping lookup because that provides the same
guarantees but doesn't introduce new race conditions due to the
allocation having to upgrade the lock - it already has those race
conditions and has to handle them. This means that overwrite and
write into preallocated space direct IO will be mapped with just a
shared lock.

Finally, we only take the ilock during IO completion if the current
ioend is beyond the file size. This is a quick hack that will be
fixed properly by transactional size updates.

In combination, these three changes remove all exclusive
serialisation points in the direct IO write path for single file
overwrite workloads.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c  |   22 ++++++++++++++++++++--
 fs/xfs/xfs_file.c  |   35 +++++++++++++++++++++--------------
 fs/xfs/xfs_iomap.c |   10 +++++++---
 fs/xfs/xfs_iomap.h |    2 +-
 4 files changed, 49 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 74b9baf..4c84508 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -138,6 +138,9 @@ xfs_setfilesize(
 	xfs_inode_t		*ip = XFS_I(ioend->io_inode);
 	xfs_fsize_t		isize;
 
+	if (!xfs_ioend_new_eof(ioend))
+		return 0;
+
 	if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
 		return EAGAIN;
 
@@ -1121,7 +1124,22 @@ __xfs_get_blocks(
 		return 0;
 
 	if (create) {
-		lockmode = XFS_ILOCK_EXCL;
+		/*
+		 * For direct IO, we lock in shared mode so that write
+		 * operations that don't require allocation can occur
+		 * concurrently. The ilock has to be dropped over the allocation
+		 * transaction reservation, so the only thing the ilock is
+		 * providing here is modification exclusion. i.e. there is no
+		 * need to hold the lock exclusive.
+		 *
+		 * For buffered IO, if we need to do delayed allocation then
+		 * hold the ilock exclusive so that the lookup and delalloc
+		 * reservation is atomic.
+		 */
+		if (direct)
+			lockmode = XFS_ILOCK_SHARED;
+		else
+			lockmode = XFS_ILOCK_EXCL;
 		xfs_ilock(ip, lockmode);
 	} else {
 		lockmode = xfs_ilock_map_shared(ip);
@@ -1144,7 +1162,7 @@ __xfs_get_blocks(
 	      imap.br_startblock == DELAYSTARTBLOCK))) {
 		if (direct) {
 			error = xfs_iomap_write_direct(ip, offset, size,
-						       &imap, nimaps);
+						       &imap, nimaps, &lockmode);
 		} else {
 			error = xfs_iomap_write_delay(ip, offset, size, &imap);
 		}
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 10ec272..c74e28c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -650,41 +650,48 @@ xfs_file_aio_write_checks(
 	struct inode		*inode = file->f_mapping->host;
 	struct xfs_inode	*ip = XFS_I(inode);
 	int			error = 0;
+	int			ilock = XFS_ILOCK_SHARED;
 
-	xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
+	xfs_rw_ilock(ip, ilock);
 restart:
 	error = generic_write_checks(file, pos, count, S_ISBLK(inode->i_mode));
 	if (error) {
-		xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
+		xfs_rw_iunlock(ip, ilock);
 		return error;
 	}
 
-	if (likely(!(file->f_mode & FMODE_NOCMTIME)))
-		file_update_time(file);
-
 	/*
 	 * If the offset is beyond the size of the file, we need to zero any
 	 * blocks that fall between the existing EOF and the start of this
-	 * write.  If zeroing is needed and we are currently holding the
-	 * iolock shared, we need to update it to exclusive which involves
-	 * dropping all locks and relocking to maintain correct locking order.
-	 * If we do this, restart the function to ensure all checks and values
-	 * are still valid.
+	 * write.  If zeroing is needed and we are currently holding shared
+	 * locks, we need to update it to exclusive which involves dropping all
+	 * locks and relocking to maintain correct locking order.  If we do
+	 * this, restart the function to ensure all checks and values are still
+	 * valid.
 	 */
 	if (*pos > i_size_read(inode)) {
-		if (*iolock == XFS_IOLOCK_SHARED) {
-			xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
+		if (*iolock == XFS_IOLOCK_SHARED || ilock == XFS_ILOCK_SHARED) {
+			xfs_rw_iunlock(ip, ilock | *iolock);
 			*iolock = XFS_IOLOCK_EXCL;
-			xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
+			ilock = XFS_ILOCK_EXCL;
+			xfs_rw_ilock(ip, ilock | *iolock);
 			goto restart;
 		}
 		error = -xfs_zero_eof(ip, *pos, i_size_read(inode));
 	}
-	xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_rw_iunlock(ip, ilock);
 	if (error)
 		return error;
 
 	/*
+	 * we can't do any operation that might call .dirty_inode under the
+	 * ilock when we move to completely transactional updates. Hence this
+	 * timestamp must sit outside the ilock.
+	 */
+	if (likely(!(file->f_mode & FMODE_NOCMTIME)))
+		file_update_time(file);
+
+	/*
 	 * If we're writing the file then make sure to clear the setuid and
 	 * setgid bits if the process is not being run by root.  This keeps
 	 * people from modifying setuid and setgid binaries.
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 246c7d5..792c81d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -123,7 +123,8 @@ xfs_iomap_write_direct(
 	xfs_off_t	offset,
 	size_t		count,
 	xfs_bmbt_irec_t *imap,
-	int		nmaps)
+	int		nmaps,
+	int		*lockmode)
 {
 	xfs_mount_t	*mp = ip->i_mount;
 	xfs_fileoff_t	offset_fsb;
@@ -189,7 +190,8 @@ xfs_iomap_write_direct(
 	/*
 	 * Allocate and setup the transaction
 	 */
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip, *lockmode);
+
 	tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
 	error = xfs_trans_reserve(tp, resblks,
 			XFS_WRITE_LOG_RES(mp), resrtextents,
@@ -200,7 +202,9 @@ xfs_iomap_write_direct(
 	 */
 	if (error)
 		xfs_trans_cancel(tp, 0);
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	*lockmode = XFS_ILOCK_EXCL;
+	xfs_ilock(ip, *lockmode);
 	if (error)
 		goto error_out;
 
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 8061576..21e3398 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -22,7 +22,7 @@ struct xfs_inode;
 struct xfs_bmbt_irec;
 
 extern int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,
-			struct xfs_bmbt_irec *, int);
+			struct xfs_bmbt_irec *, int, int *);
 extern int xfs_iomap_write_delay(struct xfs_inode *, xfs_off_t, size_t,
 			struct xfs_bmbt_irec *);
 extern int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t, size_t,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-02-09  6:09             ` Dave Chinner
@ 2012-02-09  6:44               ` Zheng Da
  2012-02-13 17:48               ` Christoph Hellwig
  1 sibling, 0 replies; 19+ messages in thread
From: Zheng Da @ 2012-02-09  6:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs


[-- Attachment #1.1: Type: text/plain, Size: 13670 bytes --]

Thanks, Dave. I'll also try this patch.

Da

On Thu, Feb 9, 2012 at 1:09 AM, Dave Chinner <david@fromorbit.com> wrote:

> On Tue, Jan 24, 2012 at 02:54:31PM +1100, Dave Chinner wrote:
> > On Mon, Jan 23, 2012 at 03:51:43PM -0500, Zheng Da wrote:
> > > Hello
> > >
> > > On Mon, Jan 23, 2012 at 2:34 PM, Zheng Da <zhengda1936@gmail.com>
> wrote:
> > > >> > So the test case is pretty simple and I think it's easy to
> reproduce it.
> > > >> > It'll be great if you can try the test case.
> > > >>
> > > >> Can you post your test code so I know what I test is exactly what
> > > >> you are running?
> > > >>
> > > > I can do that. My test code gets very complicated now. I need to
> simplify
> > > > it.
> > > >
> > > Here is the code. It's still a bit long. I hope it's OK.
> > > You can run the code like "rand-read file option=direct pages=1048576
> > > threads=8 access=write/read".
> >
> > With 262144 pages on a 2Gb ramdisk, the results I get on 3.2.0 are
> >
> > Threads               Read    Write
> >     1         0.92s   1.49s
> >     2         0.51s   1.20s
> >     4         0.31s   1.34s
> >     8         0.22s   1.59s
> >    16         0.23s   2.24s
> >
> > the contention is on the ip->i_ilock, and the newsize update is one
> > of the offenders It probably needs this change to
> > xfs_aio_write_newsize_update():
> >
> > -        if (new_size == ip->i_new_size) {
> > +        if (new_size && new_size == ip->i_new_size) {
> >
> > to avoid the lock being taken here.
> >
> > But all that newsize crap is gone in the current git Linus tree,
> > so how much would that gains us:
> >
> > Threads               Read    Write
> >     1         0.88s   0.85s
> >     2         0.54s   1.20s
> >     4         0.31s   1.23s
> >     8         0.27s   1.40s
> >    16         0.25s   2.36s
> >
> > Pretty much nothing. IOWs, it's just like I suspected - you are
> > doing so many write IOs that you are serialising on the extent
> > lookup and write checks which use exclusive locking..
> >
> > Given that it is 2 lock traversals per write IO, we're limiting at
> > about 4-500,000 exclusive lock grabs per second and decreasing as
> > contention goes up.
> >
> > For reads, we are doing 2 shared (nested) lookups per read IO, we
> > appear to be limiting at around 2,000,000 shared lock grabs per
> > second. Ahmdals law is kicking in here, but it means if we could
> > make the writes to use a shared lock, it would at least scale like
> > the reads for this "no metadata modification except for mtime"
> > overwrite case.
> >
> > I don't think that the generic write checks absolutely need
> > exclusive locking - we probably could get away with a shared lock
> > and only fall back to exclusive when we need to do EOF zeroing.
> > Similarly, for the block mapping code if we don't need to do
> > allocation, a shared lock is all we need. So maybe in that case for
> > direct IO when create == 1, we can do a read lookup first and only
> > grab the lock exclusively if that falls in a hole and requires
> > allocation.....
>
> So, I have a proof of concept patch that gets rid of the exclusive
> locking for the overwrite case.
>
> Results are:
>
>                        Writes
> Threads         vanilla         patched         read
>    1           0.85s           0.93s           0.88s
>    2           1.20s           0.58s           0.54s
>    4           1.23s           0.32s           0.31s
>    8           1.40s           0.27s           0.27s
>   16           2.36s           0.23s           0.25s
>
> So overwrites scale pretty much like reads now: ~1,000,000 overwrite
> IOs per second to that one file with 8-16 threads. Given these tests
> are running on an 8p VM, it's not surprising it doesn't go any
> faster than that as the thread count goes up.
>
> The patch hacks in some stuff that Christoph's transactional size
> and timestamp update patches do correctly, so these changes would
> need to wait for that series to be finalised. As it is, this patch
> doesn't appear to cause any new xfstests regressions, so it's good
> for discussion and testing....
>
> Anyway, for people to comment on, the patch is below.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
>
>
> xfs: use shared ilock mode for direct IO writes by default
>
> From: Dave Chinner <dchinner@redhat.com>
>
> For the direct IO write path, we only really need the ilock to be
> taken in exclusive mode during IO submission if we need to do extent
> allocation. We currently take it in exclusive mode for both the
> write sanity checks and for block mapping and allocation.
>
> In the case of the write sanity checks, we only need to protect the
> inode from change while this is occurring, and hence we can use a
> shared lock for this. We still need to provide exclusion for EOF
> zeroing, so we need to detect that case and upgrade the locking to
> exclusive in that case. This is a simple extension of the existing
> iolock upgrade case.
>
> We also have the case of timestamp updates occurring inside the
> ilock. however, we don't really care if timestamp update races occur
> as they are going to end up with the same timestamp anyway. Further,
> as we move to transactional timestamp updates, we can't do the
> update from within the ilock at all. Hence move the timestamp
> update outside the ilock altogether as we don't need or want it to
> be protected there.
>
> For block mapping, the direct IO case has to drop the ilock after
> the initial read mapping for transaction reservation before the
> allocation can be done. This means that the mapping to determine if
> allocation is needed simply requires a "don't change" locking
> semantic. i.e. a shared lock.
>
> Hence we can safely change the xfs_get_blocks() code to use a shared
> lock for the initial mapping lookup because that provides the same
> guarantees but doesn't introduce new race conditions due to the
> allocation having to upgrade the lock - it already has those race
> conditions and has to handle them. This means that overwrite and
> write into preallocated space direct IO will be mapped with just a
> shared lock.
>
> Finally, we only take the ilock during IO completion if the current
> ioend is beyond the file size. This is a quick hack that will be
> fixed properly by transactional size updates.
>
> In combination, these three changes remove all exclusive
> serialisation points in the direct IO write path for single file
> overwrite workloads.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c  |   22 ++++++++++++++++++++--
>  fs/xfs/xfs_file.c  |   35 +++++++++++++++++++++--------------
>  fs/xfs/xfs_iomap.c |   10 +++++++---
>  fs/xfs/xfs_iomap.h |    2 +-
>  4 files changed, 49 insertions(+), 20 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 74b9baf..4c84508 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -138,6 +138,9 @@ xfs_setfilesize(
>        xfs_inode_t             *ip = XFS_I(ioend->io_inode);
>        xfs_fsize_t             isize;
>
> +       if (!xfs_ioend_new_eof(ioend))
> +               return 0;
> +
>        if (!xfs_ilock_nowait(ip, XFS_ILOCK_EXCL))
>                return EAGAIN;
>
> @@ -1121,7 +1124,22 @@ __xfs_get_blocks(
>                return 0;
>
>        if (create) {
> -               lockmode = XFS_ILOCK_EXCL;
> +               /*
> +                * For direct IO, we lock in shared mode so that write
> +                * operations that don't require allocation can occur
> +                * concurrently. The ilock has to be dropped over the
> allocation
> +                * transaction reservation, so the only thing the ilock is
> +                * providing here is modification exclusion. i.e. there is
> no
> +                * need to hold the lock exclusive.
> +                *
> +                * For buffered IO, if we need to do delayed allocation
> then
> +                * hold the ilock exclusive so that the lookup and delalloc
> +                * reservation is atomic.
> +                */
> +               if (direct)
> +                       lockmode = XFS_ILOCK_SHARED;
> +               else
> +                       lockmode = XFS_ILOCK_EXCL;
>                xfs_ilock(ip, lockmode);
>        } else {
>                lockmode = xfs_ilock_map_shared(ip);
> @@ -1144,7 +1162,7 @@ __xfs_get_blocks(
>              imap.br_startblock == DELAYSTARTBLOCK))) {
>                if (direct) {
>                        error = xfs_iomap_write_direct(ip, offset, size,
> -                                                      &imap, nimaps);
> +                                                      &imap, nimaps,
> &lockmode);
>                } else {
>                        error = xfs_iomap_write_delay(ip, offset, size,
> &imap);
>                }
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 10ec272..c74e28c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -650,41 +650,48 @@ xfs_file_aio_write_checks(
>        struct inode            *inode = file->f_mapping->host;
>        struct xfs_inode        *ip = XFS_I(inode);
>        int                     error = 0;
> +       int                     ilock = XFS_ILOCK_SHARED;
>
> -       xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
> +       xfs_rw_ilock(ip, ilock);
>  restart:
>        error = generic_write_checks(file, pos, count,
> S_ISBLK(inode->i_mode));
>        if (error) {
> -               xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
> +               xfs_rw_iunlock(ip, ilock);
>                return error;
>        }
>
> -       if (likely(!(file->f_mode & FMODE_NOCMTIME)))
> -               file_update_time(file);
> -
>        /*
>         * If the offset is beyond the size of the file, we need to zero any
>         * blocks that fall between the existing EOF and the start of this
> -        * write.  If zeroing is needed and we are currently holding the
> -        * iolock shared, we need to update it to exclusive which involves
> -        * dropping all locks and relocking to maintain correct locking
> order.
> -        * If we do this, restart the function to ensure all checks and
> values
> -        * are still valid.
> +        * write.  If zeroing is needed and we are currently holding shared
> +        * locks, we need to update it to exclusive which involves
> dropping all
> +        * locks and relocking to maintain correct locking order.  If we do
> +        * this, restart the function to ensure all checks and values are
> still
> +        * valid.
>         */
>        if (*pos > i_size_read(inode)) {
> -               if (*iolock == XFS_IOLOCK_SHARED) {
> -                       xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
> +               if (*iolock == XFS_IOLOCK_SHARED || ilock ==
> XFS_ILOCK_SHARED) {
> +                       xfs_rw_iunlock(ip, ilock | *iolock);
>                        *iolock = XFS_IOLOCK_EXCL;
> -                       xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
> +                       ilock = XFS_ILOCK_EXCL;
> +                       xfs_rw_ilock(ip, ilock | *iolock);
>                        goto restart;
>                }
>                error = -xfs_zero_eof(ip, *pos, i_size_read(inode));
>        }
> -       xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
> +       xfs_rw_iunlock(ip, ilock);
>        if (error)
>                return error;
>
>        /*
> +        * we can't do any operation that might call .dirty_inode under the
> +        * ilock when we move to completely transactional updates. Hence
> this
> +        * timestamp must sit outside the ilock.
> +        */
> +       if (likely(!(file->f_mode & FMODE_NOCMTIME)))
> +               file_update_time(file);
> +
> +       /*
>         * If we're writing the file then make sure to clear the setuid and
>         * setgid bits if the process is not being run by root.  This keeps
>         * people from modifying setuid and setgid binaries.
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 246c7d5..792c81d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -123,7 +123,8 @@ xfs_iomap_write_direct(
>        xfs_off_t       offset,
>        size_t          count,
>        xfs_bmbt_irec_t *imap,
> -       int             nmaps)
> +       int             nmaps,
> +       int             *lockmode)
>  {
>        xfs_mount_t     *mp = ip->i_mount;
>        xfs_fileoff_t   offset_fsb;
> @@ -189,7 +190,8 @@ xfs_iomap_write_direct(
>        /*
>         * Allocate and setup the transaction
>         */
> -       xfs_iunlock(ip, XFS_ILOCK_EXCL);
> +       xfs_iunlock(ip, *lockmode);
> +
>        tp = xfs_trans_alloc(mp, XFS_TRANS_DIOSTRAT);
>        error = xfs_trans_reserve(tp, resblks,
>                        XFS_WRITE_LOG_RES(mp), resrtextents,
> @@ -200,7 +202,9 @@ xfs_iomap_write_direct(
>         */
>        if (error)
>                xfs_trans_cancel(tp, 0);
> -       xfs_ilock(ip, XFS_ILOCK_EXCL);
> +
> +       *lockmode = XFS_ILOCK_EXCL;
> +       xfs_ilock(ip, *lockmode);
>        if (error)
>                goto error_out;
>
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 8061576..21e3398 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -22,7 +22,7 @@ struct xfs_inode;
>  struct xfs_bmbt_irec;
>
>  extern int xfs_iomap_write_direct(struct xfs_inode *, xfs_off_t, size_t,
> -                       struct xfs_bmbt_irec *, int);
> +                       struct xfs_bmbt_irec *, int, int *);
>  extern int xfs_iomap_write_delay(struct xfs_inode *, xfs_off_t, size_t,
>                        struct xfs_bmbt_irec *);
>  extern int xfs_iomap_write_allocate(struct xfs_inode *, xfs_off_t, size_t,
>

[-- Attachment #1.2: Type: text/html, Size: 15698 bytes --]

[-- Attachment #2: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-02-09  6:09             ` Dave Chinner
  2012-02-09  6:44               ` Zheng Da
@ 2012-02-13 17:48               ` Christoph Hellwig
  2012-02-13 23:07                 ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2012-02-13 17:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Zheng Da, xfs

On Thu, Feb 09, 2012 at 05:09:20PM +1100, Dave Chinner wrote:
>  	if (create) {
> -		lockmode = XFS_ILOCK_EXCL;
> +		/*
> +		 * For direct IO, we lock in shared mode so that write
> +		 * operations that don't require allocation can occur
> +		 * concurrently. The ilock has to be dropped over the allocation
> +		 * transaction reservation, so the only thing the ilock is
> +		 * providing here is modification exclusion. i.e. there is no
> +		 * need to hold the lock exclusive.
> +		 *
> +		 * For buffered IO, if we need to do delayed allocation then
> +		 * hold the ilock exclusive so that the lookup and delalloc
> +		 * reservation is atomic.
> +		 */
> +		if (direct)
> +			lockmode = XFS_ILOCK_SHARED;
> +		else
> +			lockmode = XFS_ILOCK_EXCL;
>  		xfs_ilock(ip, lockmode);
>  	} else {
>  		lockmode = xfs_ilock_map_shared(ip);

We'll actually need to use xfs_ilock_map_shared for the the direct
create case too, to make sure we have the exclusive lock when we first
read the extent list in.

Also xfs_qm_dqattach_locked really wants the inode locked exclusively,
which your current code doesn't handle.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: concurrent direct IO write in xfs
  2012-02-13 17:48               ` Christoph Hellwig
@ 2012-02-13 23:07                 ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2012-02-13 23:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Zheng Da, xfs

On Mon, Feb 13, 2012 at 12:48:06PM -0500, Christoph Hellwig wrote:
> On Thu, Feb 09, 2012 at 05:09:20PM +1100, Dave Chinner wrote:
> >  	if (create) {
> > -		lockmode = XFS_ILOCK_EXCL;
> > +		/*
> > +		 * For direct IO, we lock in shared mode so that write
> > +		 * operations that don't require allocation can occur
> > +		 * concurrently. The ilock has to be dropped over the allocation
> > +		 * transaction reservation, so the only thing the ilock is
> > +		 * providing here is modification exclusion. i.e. there is no
> > +		 * need to hold the lock exclusive.
> > +		 *
> > +		 * For buffered IO, if we need to do delayed allocation then
> > +		 * hold the ilock exclusive so that the lookup and delalloc
> > +		 * reservation is atomic.
> > +		 */
> > +		if (direct)
> > +			lockmode = XFS_ILOCK_SHARED;
> > +		else
> > +			lockmode = XFS_ILOCK_EXCL;
> >  		xfs_ilock(ip, lockmode);
> >  	} else {
> >  		lockmode = xfs_ilock_map_shared(ip);
> 
> We'll actually need to use xfs_ilock_map_shared for the the direct
> create case too, to make sure we have the exclusive lock when we first
> read the extent list in.

Good point.

> Also xfs_qm_dqattach_locked really wants the inode locked exclusively,
> which your current code doesn't handle.

I didn't consider quotas. Looking at the code, it seems that it
wants an exclusive lock purely for ensuring there are no races
attaching the dquots to the inode. The xfs_dqget() code can actually
drop and regain the ilock, so I can't see that it is for any other
specific purpose.

I think that we could probably attach the dquot after dropping the
shared lock but before starting the transaction via a call to
xfs_qm_dqattach() which handles the locking internally. It does mean
an extra lock traversal for the quota case, but it still allows the
initial mapping to be done with a shared lock....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-02-13 23:07 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-16  0:01 concurrent direct IO write in xfs Zheng Da
2012-01-16 17:48 ` Christoph Hellwig
2012-01-16 19:44   ` Zheng Da
2012-01-16 23:25 ` Dave Chinner
2012-01-17 19:19   ` Zheng Da
2012-01-20  8:53     ` Linda Walsh
2012-01-20 15:07       ` Zheng Da
2012-01-23  5:11     ` Dave Chinner
2012-01-23 19:34       ` Zheng Da
2012-01-23 20:51         ` Zheng Da
2012-01-24  0:34           ` Stan Hoeppner
2012-01-24  1:40             ` Zheng Da
2012-01-24  3:54           ` Dave Chinner
2012-01-25 21:20             ` Zheng Da
2012-01-25 22:25               ` Dave Chinner
2012-02-09  6:09             ` Dave Chinner
2012-02-09  6:44               ` Zheng Da
2012-02-13 17:48               ` Christoph Hellwig
2012-02-13 23:07                 ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.