linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] readv/writev rework
@ 2002-09-11  8:30 Andrew Morton
  2002-09-12 13:00 ` Hirokazu Takahashi
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-09-11  8:30 UTC (permalink / raw)
  To: lkml; +Cc: Janet Morgan

(Ripping out a few things for final review here....)



This is Janet Morgan's patch which converts the readv/writev code
to submit all segments for IO before waiting on them, rather than
submitting each segment separately.

This is a critical performance fix for O_DIRECT reads and writes.
Prior to this change, O_DIRECT vectored IO was forced to wait for
completion against each segment of the iovec rather than submitting all
segments and waiting on the lot.  ie: for ten segments, this code will
be ten times faster.

There will also be moderate improvements for buffered IO - smaller code
paths, plus writev() only takes i_sem once.

The patch ended up quite large unfortunately - turned out that the only
sane way to implement this without duplicating significant amounts of
code (the generic_file_write() bounds checking, all the O_DIRECT
handling, etc) was to redo generic_file_read() and generic_file_write()
to take an iovec/nr_segs pair rather than `buf, count'.

New exported functions generic_file_readv() and generic_file_writev()
have been added:

ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
unsigned long nr_segs, loff_t *ppos);
ssize_t generic_file_writev(struct file *file, const struct iovec *iov,
unsigned long nr_segs, loff_t * ppos);

If a driver does not use these in their file_operations then they will
continue to use the old readv/writev code, which sits in a loop calling
calls fops->read() or fops->write().

ext2, ext3, JFS and the blockdev driver are currently using this
capability.

Some coding cleanups were made in fs/read_write.c.  Mainly:

- pass "READ" or "WRITE" around to indicate the diretion of the
  operation, rather than the (confusing, inverted)
  VERIFY_READ/VERIFY_WRITE.

- Use the identifier `nr_segs' everywhere to indicate the iovec
  length rather than `count', which is often used to indicate the
  number of bytes in the syscall.  It was confusing the heck out of me.

- Some cleanups to the raw driver.

- Some additional generality in fs/direct_io.c: the core `struct dio'
  used to be a "populate-and-go" thing.  Janet has broken that up so
  you can initialise a struct dio once, then loop around feeding it
  more file segments, then wait on completion against everything.

- In a couple of places we needed to handle the situation where we
  knew, a-priori, that the user was going to get a short read or write.
  File size limit exceeded, read past i_size, etc.  We handled that by
  shortening the iovec in-place with iov_shorten().  Which is not
  particularly pretty, but neither were the alternatives.


The performance benefits are much, much greater than I expected.  Here
is Janet's report:


I used a derivative of one of the LTP dio tests and the raw interface
to get the following data points:

IO type:         Vector io patch is:

Random writev        8.7 times faster
Sequential writev    6.6 times faster

Random readv         not faster, but sys time improves 5x
Sequential readv     not faster, but sys time improves 5x

Random mixed I/O     5 times faster
Sequential mixed I/O 6.6 times faster

Each raw readv or writev was for 32K (4k x 8).  I used a disk attached to
an aic controller.

Please let me know if you'd like a different set of data or whether I
should repeat using O_DIRECT.

I've attached below the time and vmstat output for each test.
P.S. Test times between say "Random writev" and "Random readv" should
not be compared because I may have varied the number of times through
the main testloop.

---------------------------------------------------------------------------

Random writev

before patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3772016   6564  61128   0   0     0   669 1176   346   0   0 100

real    0m14.254s
user    0m0.000s
sys     0m0.048s

after patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3771980   6568  61128   0   0     0  5920 1195   376   0   0 100

real    0m1.628s
user    0m0.001s
sys     0m0.009s

---------------------------------------------------------------------------

Sequential writev

before patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3771912   6620  61132   0   0     0   652 1170   330   0   0 100

real    0m14.853s
user    0m0.000s
sys     0m0.060s

after patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3771852   6636  61132   0   0     0  4352 1143   278   0   0 100

real    0m2.238s
user    0m0.001s
sys     0m0.011s

---------------------------------------------------------------------------

Random readv

before patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3771712   6732  61136   0   0 20012     0 6011 10012   0   4  96

real    0m4.760s
user    0m0.002s
sys     0m0.651s

after patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3769908   7924  61500   0   0 21440     0 1678  1346   0   3  97

real    0m4.510s
user    0m0.002s
sys     0m0.125s

---------------------------------------------------------------------------

Sequential readv

before patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3769340   7276  62832   0   0 20052     0 6020 10030   0   4  96

real    0m4.797s
user    0m0.001s
sys     0m0.684s

after patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3769320   7244  62832   0   0 21344     0 1674  1338   0   2  98

real    0m4.545s
user    0m0.000s
sys     0m0.119s

----------------------------------------------------------------------------

Random mixed I/O

before patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3770012   7884  61500   0   0   592   592 1303   596   0   2  98

real    0m16.237s
user    0m0.000s
sys     0m0.111s

after patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3770044   7840  61500   0   0  3104  3104 1201   392   0   0 100

real    0m3.193s
user    0m0.001s
sys     0m0.017s

----------------------------------------------------------------------------

Sequential mixed I/O

before patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3769232   7344  62832   0   0   640   648 1329   648   0   0 100

real    0m14.863s
user    0m0.003s
sys     0m0.106s

after patch:

procs                      memory    swap          io     system         cpu
r  b  w   swpd   free   buff  cache  si  so    bi    bo   in    cs  us  sy  id
0  1  0      0 3769192   7384  62832   0   0  4192  4197 1274   540   0   0 100

real    0m2.256s
user    0m0.002s
sys     0m0.023s




 drivers/char/raw.c  |   54 ++++++++++++++-------
 fs/block_dev.c      |   52 ++++++++++++--------
 fs/direct-io.c      |  132 ++++++++++++++++++++++++++++++----------------------
 fs/ext2/file.c      |    2 
 fs/ext2/inode.c     |    8 +--
 fs/ext3/file.c      |   22 ++++----
 fs/ext3/inode.c     |   10 ++-
 fs/jfs/file.c       |    2 
 fs/jfs/inode.c      |    8 +--
 fs/read_write.c     |  109 +++++++++++++++++++++++++++---------------
 include/linux/fs.h  |   19 ++++---
 include/linux/uio.h |   15 +++++
 kernel/ksyms.c      |    4 +
 mm/filemap.c        |  108 ++++++++++++++++++++++++++++++++++--------
 14 files changed, 363 insertions(+), 182 deletions(-)

--- 2.5.34/drivers/char/raw.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/drivers/char/raw.c	Tue Sep 10 18:57:53 2002
@@ -201,25 +201,29 @@ out:
 }
 
 static ssize_t
-rw_raw_dev(int rw, struct file *filp, char *buf, size_t size, loff_t *offp)
+rw_raw_dev(int rw, struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *offp)
 {
 	const int minor = minor(filp->f_dentry->d_inode->i_rdev);
 	struct block_device *bdev = raw_devices[minor].binding;
 	struct inode *inode = bdev->bd_inode;
+ 	size_t count = iov_length(iov, nr_segs); 
 	ssize_t ret = 0;
 
-	if (size == 0)
-		goto out;
-	ret = -EINVAL;
-	if (size < 0)
-		goto out;
-	ret = -ENXIO;
-	if (*offp >= inode->i_size)
-		goto out;
-
-	if (size + *offp > inode->i_size)
-		size = inode->i_size - *offp;
-	ret = generic_file_direct_IO(rw, inode, buf, *offp, size);
+	if (count == 0)
+		goto out;	
+
+	if ((ssize_t)count < 0)
+		return -EINVAL;	
+
+	if (*offp >= inode->i_size) 
+		return -ENXIO;
+
+	if (count + *offp > inode->i_size) {
+		count = inode->i_size - *offp;
+		nr_segs = iov_shorten((struct iovec *)iov, nr_segs, count);
+	}
+	ret = generic_file_direct_IO(rw, inode, iov, *offp, nr_segs);
+
 	if (ret > 0)
 		*offp += ret;
 out:
@@ -227,15 +231,31 @@ out:
 }
 
 static ssize_t
-raw_read(struct file *filp, char * buf, size_t size, loff_t *offp)
+raw_read(struct file *filp, char *buf, size_t size, loff_t *offp)
 {
-	return rw_raw_dev(READ, filp, buf, size, offp);
+	struct iovec local_iov = { .iov_base = buf, .iov_len = size};
+
+	return rw_raw_dev(READ, filp, &local_iov, 1, offp);
 }
 
 static ssize_t
 raw_write(struct file *filp, const char *buf, size_t size, loff_t *offp)
 {
-	return rw_raw_dev(WRITE, filp, (char *)buf, size, offp);
+	struct iovec local_iov = { .iov_base = buf, .iov_len = size};
+
+	return rw_raw_dev(WRITE, filp, &local_iov, 1, offp);
+}
+
+static ssize_t 
+raw_readv(struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *offp) 
+{
+	return rw_raw_dev(READ, filp, iov, nr_segs, offp);
+}
+
+static ssize_t 
+raw_writev(struct file *filp, const struct iovec *iov, unsigned long nr_segs, loff_t *offp) 
+{
+	return rw_raw_dev(WRITE, filp, iov, nr_segs, offp);
 }
 
 static struct file_operations raw_fops = {
@@ -244,6 +264,8 @@ static struct file_operations raw_fops =
 	.open	=	raw_open,
 	.release=	raw_release,
 	.ioctl	=	raw_ioctl,
+	.readv	= 	raw_readv,
+	.writev	= 	raw_writev,
 	.owner	=	THIS_MODULE,
 };
 
--- 2.5.34/fs/block_dev.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/block_dev.c	Wed Sep 11 01:05:34 2002
@@ -116,11 +116,11 @@ blkdev_get_blocks(struct inode *inode, s
 }
 
 static int
-blkdev_direct_IO(int rw, struct inode *inode, char *buf,
-			loff_t offset, size_t count)
+blkdev_direct_IO(int rw, struct inode *inode, const struct iovec *iov,
+			loff_t offset, unsigned long nr_segs)
 {
-	return generic_direct_IO(rw, inode, buf, offset,
-				count, blkdev_get_blocks);
+	return generic_direct_IO(rw, inode, iov, offset,
+				nr_segs, blkdev_get_blocks);
 }
 
 static int blkdev_writepage(struct page * page)
@@ -787,6 +787,14 @@ static int blkdev_reread_part(struct blo
 	return res;
 }
 
+static ssize_t blkdev_file_write(struct file *file, const char *buf,
+				   size_t count, loff_t *ppos)
+{
+	struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
+
+	return generic_file_write_nolock(file, &local_iov, 1, ppos);
+}
+
 static int blkdev_ioctl(struct inode *inode, struct file *file, unsigned cmd,
 			unsigned long arg)
 {
@@ -832,26 +840,28 @@ static int blkdev_ioctl(struct inode *in
 }
 
 struct address_space_operations def_blk_aops = {
-	readpage: blkdev_readpage,
-	writepage: blkdev_writepage,
-	sync_page: block_sync_page,
-	prepare_write: blkdev_prepare_write,
-	commit_write: blkdev_commit_write,
-	writepages: generic_writepages,
-	vm_writeback: generic_vm_writeback,
-	direct_IO: blkdev_direct_IO,
+	.readpage	= blkdev_readpage,
+	.writepage	= blkdev_writepage,
+	.sync_page	= block_sync_page,
+	.prepare_write	= blkdev_prepare_write,
+	.commit_write	= blkdev_commit_write,
+	.writepages	= generic_writepages,
+	.vm_writeback	= generic_vm_writeback,
+	.direct_IO	= blkdev_direct_IO,
 };
 
 struct file_operations def_blk_fops = {
-	open:		blkdev_open,
-	release:	blkdev_close,
-	llseek:		block_llseek,
-	read:		generic_file_read,
-	write:		generic_file_write_nolock,
-	mmap:		generic_file_mmap,
-	fsync:		block_fsync,
-	ioctl:		blkdev_ioctl,
-	sendfile:	generic_file_sendfile,
+	.open		= blkdev_open,
+	.release	= blkdev_close,
+	.llseek		= block_llseek,
+	.read		= generic_file_read,
+	.write		= blkdev_file_write,
+	.mmap		= generic_file_mmap,
+	.fsync		= block_fsync,
+	.ioctl		= blkdev_ioctl,
+	.readv		= generic_file_readv,
+	.writev		= generic_file_writev,
+	.sendfile	= generic_file_sendfile,
 };
 
 int ioctl_by_bdev(struct block_device *bdev, unsigned cmd, unsigned long arg)
--- 2.5.34/fs/direct-io.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/direct-io.c	Tue Sep 10 18:57:53 2002
@@ -75,7 +75,7 @@ struct dio {
  */
 static inline unsigned dio_pages_present(struct dio *dio)
 {
-	return dio->head - dio->tail;
+	return dio->tail - dio->head;
 }
 
 /*
@@ -265,6 +265,10 @@ static int dio_bio_complete(struct dio *
 static int dio_await_completion(struct dio *dio)
 {
 	int ret = 0;
+
+	if (dio->bio)
+		dio_bio_submit(dio);
+
 	while (atomic_read(&dio->bio_count)) {
 		struct bio *bio = dio_await_one(dio);
 		int ret2;
@@ -523,29 +527,16 @@ out:
 	return ret;
 }
 
-/*
- * The main direct-IO function.  This is a library function for use by
- * filesystem drivers.
- */
 int
-generic_direct_IO(int rw, struct inode *inode, char *buf, loff_t offset,
-			size_t count, get_blocks_t get_blocks)
+direct_io_worker(int rw, struct inode *inode, const struct iovec *iov, 
+	loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks)
 {
 	const unsigned blkbits = inode->i_blkbits;
-	const unsigned blocksize_mask = (1 << blkbits) - 1;
-	const unsigned long user_addr = (unsigned long)buf;
-	int ret;
-	int ret2;
+	unsigned long user_addr; 
+	int seg, ret2, ret = 0;
 	struct dio dio;
-	size_t bytes;
-
-	/* Check the memory alignment.  Blocks cannot straddle pages */
-	if ((user_addr & blocksize_mask) || (count & blocksize_mask)) {
-		ret = -EINVAL;
-		goto out;
-	}
+	size_t bytes, tot_bytes = 0;
 
-	/* BIO submission state */
 	dio.bio = NULL;
 	dio.bvec = NULL;
 	dio.inode = inode;
@@ -553,31 +544,13 @@ generic_direct_IO(int rw, struct inode *
 	dio.blkbits = blkbits;
 	dio.block_in_file = offset >> blkbits;
 	dio.blocks_available = 0;
-	dio.final_block_in_request = (offset + count) >> blkbits;
 
-	/* Index into the first page of the first block */
-	dio.first_block_in_page = (user_addr & (PAGE_SIZE - 1)) >> blkbits;
 	dio.boundary = 0;
 	dio.reap_counter = 0;
 	dio.get_blocks = get_blocks;
 	dio.last_block_in_bio = -1;
 	dio.next_block_in_bio = -1;
 
-	/* Page fetching state */
-	dio.curr_page = 0;
-	bytes = count;
-	dio.total_pages = 0;
-	if (user_addr & (PAGE_SIZE - 1)) {
-		dio.total_pages++;
-		bytes -= PAGE_SIZE - (user_addr & (PAGE_SIZE - 1));
-	}
-
-	dio.total_pages += (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
-	dio.curr_user_address = user_addr;
-
-	/* Page queue */
-	dio.head = 0;
-	dio.tail = 0;
 	dio.page_errors = 0;
 
 	/* BIO completion state */
@@ -586,38 +559,75 @@ generic_direct_IO(int rw, struct inode *
 	dio.bio_list = NULL;
 	dio.waiter = NULL;
 
-	ret = do_direct_IO(&dio);
+	for (seg = 0; seg < nr_segs; seg++) {
+		user_addr = (unsigned long)iov[seg].iov_base;
+		bytes = iov[seg].iov_len;
+
+		/* Index into the first page of the first block */
+		dio.first_block_in_page = (user_addr & (PAGE_SIZE - 1)) >> blkbits;
+		dio.final_block_in_request = dio.block_in_file + (bytes >> blkbits);
+		/* Page fetching state */
+		dio.head = 0;
+		dio.tail = 0;
+		dio.curr_page = 0;
+
+		dio.total_pages = 0;
+		if (user_addr & (PAGE_SIZE-1)) {
+			dio.total_pages++;
+			bytes -= PAGE_SIZE - (user_addr & (PAGE_SIZE - 1));
+		}
+		dio.total_pages += (bytes + PAGE_SIZE - 1) / PAGE_SIZE;
+		dio.curr_user_address = user_addr;
+	
+		ret = do_direct_IO(&dio);
+
+		if (ret) {
+			dio_cleanup(&dio);
+			break;
+		}
+
+		tot_bytes += iov[seg].iov_len - ((dio.final_block_in_request -
+					dio.block_in_file) << blkbits);
+
+	} /* end iovec loop */
 
-	if (dio.bio)
-		dio_bio_submit(&dio);
-	if (ret)
-		dio_cleanup(&dio);
 	ret2 = dio_await_completion(&dio);
 	if (ret == 0)
 		ret = ret2;
 	if (ret == 0)
 		ret = dio.page_errors;
 	if (ret == 0)
-		ret = count - ((dio.final_block_in_request -
-				dio.block_in_file) << blkbits);
-out:
+		ret = tot_bytes; 
+
 	return ret;
 }
 
-ssize_t
-generic_file_direct_IO(int rw, struct inode *inode, char *buf,
-			loff_t offset, size_t count)
+/*
+ * This is a library function for use by filesystem drivers.
+ */
+int
+generic_direct_IO(int rw, struct inode *inode, const struct iovec *iov, 
+	loff_t offset, unsigned long nr_segs, get_blocks_t get_blocks)
 {
+	int seg;
+	size_t size;
+	unsigned long addr;
 	struct address_space *mapping = inode->i_mapping;
-	unsigned blocksize_mask;
-	ssize_t retval;
+	unsigned blocksize_mask = (1 << inode->i_blkbits) - 1;
+	ssize_t retval = -EINVAL;
 
-	blocksize_mask = (1 << inode->i_blkbits) - 1;
-	if ((offset & blocksize_mask) || (count & blocksize_mask)) {
-		retval = -EINVAL;
+	if (offset & blocksize_mask) {
 		goto out;
 	}
 
+	/* Check the memory alignment.  Blocks cannot straddle pages */
+	for (seg = 0; seg < nr_segs; seg++) {
+		addr = (unsigned long)iov[seg].iov_base;
+		size = iov[seg].iov_len;
+		if ((addr & blocksize_mask) || (size & blocksize_mask)) 
+			goto out;	
+	}
+
 	if (mapping->nrpages) {
 		retval = filemap_fdatawrite(mapping);
 		if (retval == 0)
@@ -625,9 +635,21 @@ generic_file_direct_IO(int rw, struct in
 		if (retval)
 			goto out;
 	}
-	retval = mapping->a_ops->direct_IO(rw, inode, buf, offset, count);
+
+	retval = direct_io_worker(rw, inode, iov, offset, nr_segs, get_blocks);
+out:
+	return retval;
+}
+
+ssize_t
+generic_file_direct_IO(int rw, struct inode *inode, const struct iovec *iov, 
+	loff_t offset, unsigned long nr_segs)
+{
+	struct address_space *mapping = inode->i_mapping;
+	ssize_t retval;
+
+	retval = mapping->a_ops->direct_IO(rw, inode, iov, offset, nr_segs);
 	if (inode->i_mapping->nrpages)
 		invalidate_inode_pages2(inode->i_mapping);
-out:
 	return retval;
 }
--- 2.5.34/fs/ext2/file.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/ext2/file.c	Tue Sep 10 18:57:53 2002
@@ -46,6 +46,8 @@ struct file_operations ext2_file_operati
 	.open		= generic_file_open,
 	.release	= ext2_release_file,
 	.fsync		= ext2_sync_file,
+	.readv		= generic_file_readv,
+	.writev		= generic_file_writev,
 	.sendfile	= generic_file_sendfile,
 };
 
--- 2.5.34/fs/ext2/inode.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/ext2/inode.c	Wed Sep 11 01:05:16 2002
@@ -619,11 +619,11 @@ ext2_get_blocks(struct inode *inode, sec
 }
 
 static int
-ext2_direct_IO(int rw, struct inode *inode, char *buf,
-			loff_t offset, size_t count)
+ext2_direct_IO(int rw, struct inode *inode, const struct iovec *iov,
+			loff_t offset, unsigned long nr_segs)
 {
-	return generic_direct_IO(rw, inode, buf,
-				offset, count, ext2_get_blocks);
+	return generic_direct_IO(rw, inode, iov,
+				offset, nr_segs, ext2_get_blocks);
 }
 
 static int
--- 2.5.34/fs/jfs/file.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/jfs/file.c	Tue Sep 10 18:57:53 2002
@@ -108,6 +108,8 @@ struct file_operations jfs_file_operatio
 	.write		= generic_file_write,
 	.read		= generic_file_read,
 	.mmap		= generic_file_mmap,
+	.readv		= generic_file_readv,
+	.writev		= generic_file_writev,
  	.sendfile	= generic_file_sendfile,
 	.fsync		= jfs_fsync,
 };
--- 2.5.34/fs/jfs/inode.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/jfs/inode.c	Wed Sep 11 01:05:16 2002
@@ -309,11 +309,11 @@ static int jfs_bmap(struct address_space
 	return generic_block_bmap(mapping, block, jfs_get_block);
 }
 
-static int jfs_direct_IO(int rw, struct inode *inode, char *buf,
-			loff_t offset, size_t count)
+static int jfs_direct_IO(int rw, struct inode *inode, const struct iovec *iov, 
+			loff_t offset, unsigned long nr_segs)
 {
-	return generic_direct_IO(rw, inode, buf,
-				offset, count, jfs_get_blocks);
+	return generic_direct_IO(rw, inode, iov,
+				offset, nr_segs, jfs_get_blocks);
 }
 
 struct address_space_operations jfs_aops = {
--- 2.5.34/fs/read_write.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/read_write.c	Tue Sep 10 18:57:53 2002
@@ -286,9 +286,29 @@ asmlinkage ssize_t sys_pwrite64(unsigned
 	return ret;
 }
 
+/*
+ * Reduce an iovec's length in-place.  Return the resulting number of segments
+ */
+unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to)
+{
+	unsigned long seg = 0;
+	size_t len = 0;
+
+	while (seg < nr_segs) {
+		seg++;
+		if (len + iov->iov_len >= to) {
+			iov->iov_len = to - len;
+			break;
+		}
+		len += iov->iov_len;
+		iov++;
+	}
+	return seg;
+}
+
 static ssize_t do_readv_writev(int type, struct file *file,
 			       const struct iovec * vector,
-			       unsigned long count)
+			       unsigned long nr_segs)
 {
 	typedef ssize_t (*io_fn_t)(struct file *, char *, size_t, loff_t *);
 	typedef ssize_t (*iov_fn_t)(struct file *, const struct iovec *, unsigned long, loff_t *);
@@ -296,73 +316,86 @@ static ssize_t do_readv_writev(int type,
 	size_t tot_len;
 	struct iovec iovstack[UIO_FASTIOV];
 	struct iovec *iov=iovstack;
-	ssize_t ret, i;
+	ssize_t ret = -EINVAL;
+	int seg;
 	io_fn_t fn;
 	iov_fn_t fnv;
 	struct inode *inode;
 
 	/*
+	 * SuS says "The readv() function *may* fail if the iovcnt argument
+	 * was less than or equal to 0, or greater than {IOV_MAX}.  Linux has
+	 * traditionally returned -EINVAL for zero segments, so...
+	 */
+	if (nr_segs == 0)
+		goto out;
+
+	/*
 	 * First get the "struct iovec" from user memory and
 	 * verify all the pointers
 	 */
-	ret = 0;
-	if (!count)
-		goto out_nofree;
-	ret = -EINVAL;
-	if (count > UIO_MAXIOV)
-		goto out_nofree;
+	if ((nr_segs > UIO_MAXIOV) || (nr_segs <= 0))
+		goto out;
 	if (!file->f_op)
-		goto out_nofree;
-	if (count > UIO_FASTIOV) {
+		goto out;
+	if (nr_segs > UIO_FASTIOV) {
 		ret = -ENOMEM;
-		iov = kmalloc(count*sizeof(struct iovec), GFP_KERNEL);
+		iov = kmalloc(nr_segs*sizeof(struct iovec), GFP_KERNEL);
 		if (!iov)
-			goto out_nofree;
+			goto out;
 	}
 	ret = -EFAULT;
-	if (copy_from_user(iov, vector, count*sizeof(*vector)))
+	if (copy_from_user(iov, vector, nr_segs*sizeof(*vector)))
 		goto out;
 
 	/*
 	 * Single unix specification:
-	 * We should -EINVAL if an element length is not >= 0 and fitting an ssize_t
-	 * The total length is fitting an ssize_t
+	 * We should -EINVAL if an element length is not >= 0 and fitting an
+	 * ssize_t.  The total length is fitting an ssize_t
 	 *
 	 * Be careful here because iov_len is a size_t not an ssize_t
 	 */
-	 
 	tot_len = 0;
 	ret = -EINVAL;
-	for (i = 0 ; i < count ; i++) {
+	for (seg = 0 ; seg < nr_segs; seg++) {
 		ssize_t tmp = tot_len;
-		ssize_t len = (ssize_t)iov[i].iov_len;
+		ssize_t len = (ssize_t)iov[seg].iov_len;
 		if (len < 0)	/* size_t not fitting an ssize_t .. */
 			goto out;
 		tot_len += len;
 		if (tot_len < tmp) /* maths overflow on the ssize_t */
 			goto out;
 	}
+	if (tot_len == 0) {
+		ret = 0;
+		goto out;
+	}
 
 	inode = file->f_dentry->d_inode;
 	/* VERIFY_WRITE actually means a read, as we write to user space */
-	ret = locks_verify_area((type == VERIFY_WRITE
+	ret = locks_verify_area((type == READ 
 				 ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE),
 				inode, file, file->f_pos, tot_len);
-	if (ret) goto out;
+	if (ret)
+		goto out;
 
-	fnv = (type == VERIFY_WRITE ? file->f_op->readv : file->f_op->writev);
+	fnv = NULL;
+	if (type == READ) {
+		fn = file->f_op->read;
+		fnv = file->f_op->readv;
+	} else {
+		fn = (io_fn_t)file->f_op->write;
+		fnv = file->f_op->writev;
+	}
 	if (fnv) {
-		ret = fnv(file, iov, count, &file->f_pos);
+		ret = fnv(file, iov, nr_segs, &file->f_pos);
 		goto out;
 	}
 
-	/* VERIFY_WRITE actually means a read, as we write to user space */
-	fn = (type == VERIFY_WRITE ? file->f_op->read :
-	      (io_fn_t) file->f_op->write);
-
+	/* Do it by hand, with file-ops */
 	ret = 0;
 	vector = iov;
-	while (count > 0) {
+	while (nr_segs > 0) {
 		void * base;
 		size_t len;
 		ssize_t nr;
@@ -370,7 +403,7 @@ static ssize_t do_readv_writev(int type,
 		base = vector->iov_base;
 		len = vector->iov_len;
 		vector++;
-		count--;
+		nr_segs--;
 
 		nr = fn(file, base, len, &file->f_pos);
 
@@ -382,20 +415,18 @@ static ssize_t do_readv_writev(int type,
 		if (nr != len)
 			break;
 	}
-
 out:
 	if (iov != iovstack)
 		kfree(iov);
-out_nofree:
-	/* VERIFY_WRITE actually means a read, as we write to user space */
-	if ((ret + (type == VERIFY_WRITE)) > 0)
+	if ((ret + (type == READ)) > 0)
 		dnotify_parent(file->f_dentry,
-			(type == VERIFY_WRITE) ? DN_MODIFY : DN_ACCESS);
+				(type == READ) ? DN_MODIFY : DN_ACCESS);
 	return ret;
 }
 
-asmlinkage ssize_t sys_readv(unsigned long fd, const struct iovec * vector,
-			     unsigned long count)
+
+asmlinkage ssize_t
+sys_readv(unsigned long fd, const struct iovec *vector, unsigned long nr_segs)
 {
 	struct file * file;
 	ssize_t ret;
@@ -409,7 +440,7 @@ asmlinkage ssize_t sys_readv(unsigned lo
 	    (file->f_op->readv || file->f_op->read)) {
 		ret = security_ops->file_permission (file, MAY_READ);
 		if (!ret)
-			ret = do_readv_writev(VERIFY_WRITE, file, vector, count);
+			ret = do_readv_writev(READ, file, vector, nr_segs);
 	}
 	fput(file);
 
@@ -417,8 +448,8 @@ bad_file:
 	return ret;
 }
 
-asmlinkage ssize_t sys_writev(unsigned long fd, const struct iovec * vector,
-			      unsigned long count)
+asmlinkage ssize_t
+sys_writev(unsigned long fd, const struct iovec * vector, unsigned long nr_segs)
 {
 	struct file * file;
 	ssize_t ret;
@@ -432,7 +463,7 @@ asmlinkage ssize_t sys_writev(unsigned l
 	    (file->f_op->writev || file->f_op->write)) {
 		ret = security_ops->file_permission (file, MAY_WRITE);
 		if (!ret)
-			ret = do_readv_writev(VERIFY_READ, file, vector, count);
+			ret = do_readv_writev(WRITE, file, vector, nr_segs);
 	}
 	fput(file);
 
--- 2.5.34/include/linux/fs.h~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/include/linux/fs.h	Wed Sep 11 01:05:18 2002
@@ -307,8 +307,7 @@ struct address_space_operations {
 	int (*bmap)(struct address_space *, long);
 	int (*invalidatepage) (struct page *, unsigned long);
 	int (*releasepage) (struct page *, int);
-	int (*direct_IO)(int, struct inode *, char *buf,
-				loff_t offset, size_t count);
+	int (*direct_IO)(int, struct inode *, const struct iovec *iov, loff_t offset, unsigned long nr_segs);
 };
 
 struct backing_dev_info;
@@ -1245,14 +1244,18 @@ extern int generic_file_mmap(struct file
 extern int file_read_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size);
 extern ssize_t generic_file_read(struct file *, char *, size_t, loff_t *);
 extern ssize_t generic_file_write(struct file *, const char *, size_t, loff_t *);
-extern ssize_t generic_file_write_nolock(struct file *, const char *, size_t, loff_t *);
+ssize_t generic_file_write_nolock(struct file *file, const struct iovec *iov,
+				unsigned long nr_segs, loff_t *ppos);
 extern ssize_t generic_file_sendfile(struct file *, struct file *, loff_t *, size_t);
 extern void do_generic_file_read(struct file *, loff_t *, read_descriptor_t *, read_actor_t);
-ssize_t generic_file_direct_IO(int rw, struct inode *inode, char *buf,
-				loff_t offset, size_t count);
-int generic_direct_IO(int rw, struct inode *inode, char *buf,
-			loff_t offset, size_t count, get_blocks_t *get_blocks);
-
+extern ssize_t generic_file_direct_IO(int rw, struct inode *inode, 
+	const struct iovec *iov, loff_t offset, unsigned long nr_segs);
+extern int generic_direct_IO(int rw, struct inode *inode, const struct iovec 
+	*iov, loff_t offset, unsigned long nr_segs, get_blocks_t *get_blocks);
+extern ssize_t generic_file_readv(struct file *filp, const struct iovec *iov, 
+	unsigned long nr_segs, loff_t *ppos);
+ssize_t generic_file_writev(struct file *filp, const struct iovec *iov, 
+			unsigned long nr_segs, loff_t *ppos);
 extern loff_t no_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t generic_file_llseek(struct file *file, loff_t offset, int origin);
 extern loff_t remote_llseek(struct file *file, loff_t offset, int origin);
--- 2.5.34/kernel/ksyms.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/kernel/ksyms.c	Wed Sep 11 01:05:23 2002
@@ -42,6 +42,7 @@
 #include <linux/highuid.h>
 #include <linux/brlock.h>
 #include <linux/fs.h>
+#include <linux/uio.h>
 #include <linux/tty.h>
 #include <linux/in6.h>
 #include <linux/completion.h>
@@ -344,6 +345,9 @@ EXPORT_SYMBOL(register_disk);
 EXPORT_SYMBOL(read_dev_sector);
 EXPORT_SYMBOL(init_buffer);
 EXPORT_SYMBOL_GPL(generic_file_direct_IO);
+EXPORT_SYMBOL(generic_file_readv);
+EXPORT_SYMBOL(generic_file_writev);
+EXPORT_SYMBOL(iov_shorten);
 
 /* tty routines */
 EXPORT_SYMBOL(tty_hangup);
--- 2.5.34/mm/filemap.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/mm/filemap.c	Wed Sep 11 01:05:21 2002
@@ -18,6 +18,7 @@
 #include <linux/mman.h>
 #include <linux/pagemap.h>
 #include <linux/file.h>
+#include <linux/uio.h>
 #include <linux/iobuf.h>
 #include <linux/hash.h>
 #include <linux/writeback.h>
@@ -1121,14 +1122,18 @@ success:
  * This is the "read()" routine for all filesystems
  * that can use the page cache directly.
  */
-ssize_t
-generic_file_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
+static ssize_t
+__generic_file_read(struct file *filp, const struct iovec *iov,
+		unsigned long nr_segs, loff_t *ppos)
 {
 	ssize_t retval;
+	unsigned long seg;
+	size_t count = iov_length(iov, nr_segs);
 
 	if ((ssize_t) count < 0)
 		return -EINVAL;
 
+	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
 	if (filp->f_flags & O_DIRECT) {
 		loff_t pos = *ppos, size;
 		struct address_space *mapping;
@@ -1141,10 +1146,13 @@ generic_file_read(struct file *filp, cha
 			goto out; /* skip atime */
 		size = inode->i_size;
 		if (pos < size) {
-			if (pos + count > size)
+			if (pos + count > size) {
 				count = size - pos;
-			retval = generic_file_direct_IO(READ, inode,
-							buf, pos, count);
+				nr_segs = iov_shorten((struct iovec *)iov,
+							nr_segs, count);
+			}
+			retval = generic_file_direct_IO(READ, inode, 
+					iov, pos, nr_segs);
 			if (retval > 0)
 				*ppos = pos + retval;
 		}
@@ -1152,27 +1160,42 @@ generic_file_read(struct file *filp, cha
 		goto out;
 	}
 
-	retval = -EFAULT;
-	if (access_ok(VERIFY_WRITE, buf, count)) {
-		retval = 0;
+	for (seg = 0; seg < nr_segs; seg++) {
+		if (!access_ok(VERIFY_WRITE,iov[seg].iov_base,iov[seg].iov_len))
+			return -EFAULT;
+	}
 
-		if (count) {
+	retval = 0;
+	if (count) {
+		for (seg = 0; seg < nr_segs; seg++) {
 			read_descriptor_t desc;
 
 			desc.written = 0;
-			desc.count = count;
-			desc.buf = buf;
+			desc.buf = iov[seg].iov_base;
+			desc.count = iov[seg].iov_len;
+			if (desc.count == 0)
+				continue;
 			desc.error = 0;
 			do_generic_file_read(filp,ppos,&desc,file_read_actor);
-			retval = desc.written;
-			if (!retval)
+			retval += desc.written;
+			if (!retval) {
 				retval = desc.error;
+				break;
+			}
 		}
 	}
 out:
 	return retval;
 }
 
+ssize_t
+generic_file_read(struct file *filp, char *buf, size_t count, loff_t *ppos)
+{
+	struct iovec local_iov = { .iov_base = buf, .iov_len = count };
+
+	return __generic_file_read(filp, &local_iov, 1, ppos);
+}
+
 static int file_send_actor(read_descriptor_t * desc, struct page *page, unsigned long offset, unsigned long size)
 {
 	ssize_t written;
@@ -1926,11 +1949,14 @@ filemap_copy_from_user(struct page *page
  * it for writing by marking it dirty.
  *							okir@monad.swb.de
  */
-ssize_t generic_file_write_nolock(struct file *file, const char *buf,
-				  size_t count, loff_t *ppos)
+ssize_t
+generic_file_write_nolock(struct file *file, const struct iovec *iov,
+				unsigned long nr_segs, loff_t *ppos)
 {
 	struct address_space * mapping = file->f_dentry->d_inode->i_mapping;
 	struct address_space_operations *a_ops = mapping->a_ops;
+	const size_t ocount = iov_length(iov, nr_segs);
+	size_t count =	ocount;
 	struct inode 	*inode = mapping->host;
 	unsigned long	limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
 	long		status = 0;
@@ -1942,12 +1968,19 @@ ssize_t generic_file_write_nolock(struct
 	unsigned	bytes;
 	time_t		time_now;
 	struct pagevec	lru_pvec;
+	struct iovec	*cur_iov;
+	unsigned	iov_bytes;	/* Cumulative count to the end of the
+					   current iovec */
+	unsigned long	seg;
+	char		*buf;
 
 	if (unlikely((ssize_t)count < 0))
 		return -EINVAL;
 
-	if (unlikely(!access_ok(VERIFY_READ, buf, count)))
-		return -EFAULT;
+	for (seg = 0; seg < nr_segs; seg++) {
+		if (!access_ok(VERIFY_READ,iov[seg].iov_base,iov[seg].iov_len))
+			return -EFAULT;
+	}
 
 	pos = *ppos;
 	if (unlikely(pos < 0))
@@ -2045,9 +2078,13 @@ ssize_t generic_file_write_nolock(struct
 		mark_inode_dirty_sync(inode);
 	}
 
+	/* coalesce the iovecs and go direct-to-BIO for O_DIRECT */
 	if (unlikely(file->f_flags & O_DIRECT)) {
-		written = generic_file_direct_IO(WRITE, inode,
-						(char *)buf, pos, count);
+		if (count != ocount)
+			nr_segs = iov_shorten((struct iovec *)iov,
+						nr_segs, count);
+		written = generic_file_direct_IO(WRITE, inode, 
+					iov, pos, nr_segs);
 		if (written > 0) {
 			loff_t end = pos + written;
 			if (end > inode->i_size && !S_ISBLK(inode->i_mode)) {
@@ -2065,6 +2102,9 @@ ssize_t generic_file_write_nolock(struct
 		goto out_status;
 	}
 
+	cur_iov = (struct iovec *)iov;
+	iov_bytes = cur_iov->iov_len;
+	buf = cur_iov->iov_base;
 	do {
 		unsigned long index;
 		unsigned long offset;
@@ -2075,6 +2115,8 @@ ssize_t generic_file_write_nolock(struct
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
+		if (bytes + written > iov_bytes)
+			bytes = iov_bytes - written;
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -2084,7 +2126,7 @@ ssize_t generic_file_write_nolock(struct
 		 */
 		fault_in_pages_readable(buf, bytes);
 
-		page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
+		page = __grab_cache_page(mapping,index,&cached_page,&lru_pvec);
 		if (!page) {
 			status = -ENOMEM;
 			break;
@@ -2115,6 +2157,11 @@ ssize_t generic_file_write_nolock(struct
 				count -= status;
 				pos += status;
 				buf += status;
+				if (written == iov_bytes && count) {
+					cur_iov++;
+					iov_bytes += cur_iov->iov_len;
+					buf = cur_iov->iov_base;
+				}
 			}
 		}
 		if (!PageReferenced(page))
@@ -2151,10 +2198,29 @@ ssize_t generic_file_write(struct file *
 {
 	struct inode	*inode = file->f_dentry->d_inode->i_mapping->host;
 	int		err;
+	struct iovec local_iov = { .iov_base = (void *)buf, .iov_len = count };
 
 	down(&inode->i_sem);
-	err = generic_file_write_nolock(file, buf, count, ppos);
+	err = generic_file_write_nolock(file, &local_iov, 1, ppos);
 	up(&inode->i_sem);
 
 	return err;
 }
+
+ssize_t generic_file_readv(struct file *filp, const struct iovec *iov,
+			unsigned long nr_segs, loff_t *ppos)
+{
+	return __generic_file_read(filp, iov, nr_segs, ppos);
+}
+
+ssize_t generic_file_writev(struct file *file, const struct iovec *iov,
+			unsigned long nr_segs, loff_t * ppos) 
+{
+	struct inode *inode = file->f_dentry->d_inode;
+	ssize_t ret;
+
+	down(&inode->i_sem);
+	ret = generic_file_write_nolock(file, iov, nr_segs, ppos);
+	up(&inode->i_sem);
+	return ret;
+}
--- 2.5.34/fs/ext3/inode.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/ext3/inode.c	Wed Sep 11 01:05:31 2002
@@ -1399,13 +1399,15 @@ static int ext3_releasepage(struct page 
  * If the O_DIRECT write is intantiating holes inside i_size and the machine
  * crashes then stale disk data _may_ be exposed inside the file.
  */
-static int ext3_direct_IO(int rw, struct inode *inode, char *buf,
-			loff_t offset, size_t count)
+static int ext3_direct_IO(int rw, struct inode *inode,
+			const struct iovec *iov, loff_t offset,
+			unsigned long nr_segs)
 {
 	struct ext3_inode_info *ei = EXT3_I(inode);
 	handle_t *handle = NULL;
 	int ret;
 	int orphan = 0;
+	size_t count = iov_length(iov, nr_segs);
 
 	if (rw == WRITE) {
 		loff_t final_size = offset + count;
@@ -1428,8 +1430,8 @@ static int ext3_direct_IO(int rw, struct
 		}
 	}
 
-	ret = generic_direct_IO(rw, inode, buf, offset,
-				count, ext3_direct_io_get_blocks);
+	ret = generic_direct_IO(rw, inode, iov, offset,
+				nr_segs, ext3_direct_io_get_blocks);
 
 out_stop:
 	if (handle) {
--- 2.5.34/include/linux/uio.h~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/include/linux/uio.h	Tue Sep 10 18:57:53 2002
@@ -34,4 +34,19 @@ struct iovec
                                 /* Beg pardon: BSD has 1024 --ANK */
 #endif
 
+/*
+ * Total number of bytes covered by an iovec
+ */
+static inline size_t iov_length(const struct iovec *iov, unsigned long nr_segs)
+{
+	unsigned long seg;
+	size_t ret = 0;
+
+	for (seg = 0; seg < nr_segs; seg++)
+		ret += iov[seg].iov_len;
+	return ret;
+}
+
+unsigned long iov_shorten(struct iovec *iov, unsigned long nr_segs, size_t to);
+
 #endif
--- 2.5.34/fs/ext3/file.c~readv-writev	Tue Sep 10 18:57:53 2002
+++ 2.5.34-akpm/fs/ext3/file.c	Tue Sep 10 18:57:53 2002
@@ -76,19 +76,21 @@ ext3_file_write(struct file *file, const
 }
 
 struct file_operations ext3_file_operations = {
-	.llseek		= generic_file_llseek,	/* BKL held */
-	.read		= generic_file_read,	/* BKL not held.  Don't need */
-	.write		= ext3_file_write,	/* BKL not held.  Don't need */
-	.ioctl		= ext3_ioctl,		/* BKL held */
+	.llseek		= generic_file_llseek,
+	.read		= generic_file_read,
+	.write		= ext3_file_write,
+	.readv		= generic_file_readv,
+	.writev		= generic_file_writev,
+	.ioctl		= ext3_ioctl,
 	.mmap		= generic_file_mmap,
-	.open		= ext3_open_file,		/* BKL not held.  Don't need */
-	.release	= ext3_release_file,	/* BKL not held.  Don't need */
-	.fsync		= ext3_sync_file,		/* BKL held */
-	.sendfile	= generic_file_sendfile,	/* BKL not held.  Don't need */
+	.open		= ext3_open_file,
+	.release	= ext3_release_file,
+	.fsync		= ext3_sync_file,
+	.sendfile	= generic_file_sendfile,
 };
 
 struct inode_operations ext3_file_inode_operations = {
-	.truncate	= ext3_truncate,		/* BKL held */
-	.setattr	= ext3_setattr,		/* BKL held */
+	.truncate	= ext3_truncate,
+	.setattr	= ext3_setattr,
 };
 

.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-11  8:30 [patch] readv/writev rework Andrew Morton
@ 2002-09-12 13:00 ` Hirokazu Takahashi
  2002-09-12 18:47   ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Hirokazu Takahashi @ 2002-09-12 13:00 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, janetmor

Hello,

Your readv/writev patch interested me and I checked it.
I found we also have a chance to improve normal writev.

a_ops->prepare_write() and a_ops->commit_write will have a
penalty when I/O size isn't PAGE_SIZE.
With following patch generic_file_write_nolock() will try to
make each I/O size become PAGE_SIZE.

Could you apply it?

> This is Janet Morgan's patch which converts the readv/writev code
> to submit all segments for IO before waiting on them, rather than
> submitting each segment separately.
> 
> This is a critical performance fix for O_DIRECT reads and writes.
> Prior to this change, O_DIRECT vectored IO was forced to wait for
> completion against each segment of the iovec rather than submitting all
> segments and waiting on the lot.  ie: for ten segments, this code will
> be ten times faster.

Thank you,
Hirokazu Takahashi.



--- linux-2.5.34/mm/filemap.c.ORG	Wed Sep 11 19:48:00 2030
+++ linux-2.5.34/mm/filemap.c	Thu Sep 12 21:43:02 2030
@@ -2109,14 +2109,18 @@ generic_file_write_nolock(struct file *f
 		unsigned long index;
 		unsigned long offset;
 		long page_fault;
+		unsigned rest;
+		unsigned copy;
+		unsigned off;
+		struct iovec	*work_iov;
+		char		*work_buf;
+		unsigned	 work_iov_bytes;
 
 		offset = (pos & (PAGE_CACHE_SIZE -1)); /* Within page */
 		index = pos >> PAGE_CACHE_SHIFT;
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
-		if (bytes + written > iov_bytes)
-			bytes = iov_bytes - written;
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -2144,7 +2148,29 @@ generic_file_write_nolock(struct file *f
 				vmtruncate(inode, inode->i_size);
 			break;
 		}
-		page_fault = filemap_copy_from_user(page, offset, buf, bytes);
+
+		rest = bytes;
+		off = 0;
+		work_iov = cur_iov;
+		work_buf = buf;
+		work_iov_bytes = iov_bytes;
+		while (rest) {
+			copy = rest;
+			if (written + off == work_iov_bytes) {
+				work_iov++;
+				work_iov_bytes += work_iov->iov_len;
+				work_buf = work_iov->iov_base;
+			}
+			if (copy + written + off > work_iov_bytes)
+				copy = work_iov_bytes - written - off;
+
+			page_fault = filemap_copy_from_user(page, offset+off, work_buf, copy);
+			rest -= copy;
+			off += copy;
+			work_buf += copy;
+			if (unlikely(page_fault))
+				break;
+		}
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(page_fault)) {
 			status = -EFAULT;
@@ -2153,14 +2179,21 @@ generic_file_write_nolock(struct file *f
 				status = bytes;
 
 			if (status >= 0) {
-				written += status;
 				count -= status;
 				pos += status;
-				buf += status;
-				if (written == iov_bytes && count) {
-					cur_iov++;
-					iov_bytes += cur_iov->iov_len;
-					buf = cur_iov->iov_base;
+				rest = status;
+				while (rest) {
+					copy = rest;
+					if (written == iov_bytes) {
+						cur_iov++;
+						iov_bytes += cur_iov->iov_len;
+						buf = cur_iov->iov_base;
+					}
+					if (copy + written > iov_bytes)
+						copy = iov_bytes - written;
+					rest -= copy;
+					written += copy;
+					buf += copy;
 				}
 			}
 		}

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-12 13:00 ` Hirokazu Takahashi
@ 2002-09-12 18:47   ` Andrew Morton
  2002-09-13  1:18     ` Hirokazu Takahashi
  2002-09-13  7:22     ` Hirokazu Takahashi
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2002-09-12 18:47 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: linux-kernel, janetmor

Hirokazu Takahashi wrote:
> 
> Hello,
> 
> Your readv/writev patch interested me and I checked it.
> I found we also have a chance to improve normal writev.
> 
> a_ops->prepare_write() and a_ops->commit_write will have a
> penalty when I/O size isn't PAGE_SIZE.
> With following patch generic_file_write_nolock() will try to
> make each I/O size become PAGE_SIZE.
> 

Certainly makes a lot of sense.  If an application has a large
number of small objects which are to be appended to a file, and
they are not contiguous in user memory then this patch makes
writev() a very attractive way of doing that.  Tons faster.

However I'd be a little concerned over the increased work which a boring
old write() has to do.  Perhaps we could add a special code path
for it:

	struct iovstate {
		unsigned rest;
		unsigned copy;
		unsigned off;
		struct iovec    *work_iov;
		char            *work_buf;
		unsigned         work_iov_bytes;
	} iovstate;

	if (nr_segs != 1) {
		initialise iovstate;
	}

	...

	if (nr_segs == 1)
		page_fault = filemap_copy_from_user(page, offset, buf, bytes);
	else
		page_fault = iov_copy_from_user(&arg1, &arg2, &bytes, &iovstate);
	status = a_ops->commit_write(file, page, offset, offset+bytes);


then we can do all that fancy stuff in iov_copy_from_user().  That
function should be inlined so we don't really pass all those arguments
around.  That's just a cleanliness thing, and branching around the complex
code in the simple case.  We should be able to get the impact on the
common case down to a single taken branch per page.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-12 18:47   ` Andrew Morton
@ 2002-09-13  1:18     ` Hirokazu Takahashi
  2002-09-13  3:31       ` Andrew Morton
  2002-09-13  7:22     ` Hirokazu Takahashi
  1 sibling, 1 reply; 12+ messages in thread
From: Hirokazu Takahashi @ 2002-09-13  1:18 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, janetmor

Hello,

> > Your readv/writev patch interested me and I checked it.
> > I found we also have a chance to improve normal writev.
> > 
> > a_ops->prepare_write() and a_ops->commit_write will have a
> > penalty when I/O size isn't PAGE_SIZE.
> > With following patch generic_file_write_nolock() will try to
> > make each I/O size become PAGE_SIZE.
> > 
> 
> Certainly makes a lot of sense.  If an application has a large
> number of small objects which are to be appended to a file, and
> they are not contiguous in user memory then this patch makes
> writev() a very attractive way of doing that.  Tons faster.

Yeah, I realized syslogd is using writev against logfiles which are
opened with O_SYNC flag! I think heavy loaded mail-servers or
web-servers may get good performance with the new writev
as they are logging too much.

And I'd also like to make knfsd use it as a recevied nfs request may
be splited into some IP fragments.

> However I'd be a little concerned over the increased work which a boring
> old write() has to do.  Perhaps we could add a special code path
> for it:

It sounds nice.
I'll rewrite it soon.

> 	struct iovstate {
> 		unsigned rest;
> 		unsigned copy;
> 		unsigned off;
> 		struct iovec    *work_iov;
> 		char            *work_buf;
> 		unsigned         work_iov_bytes;
> 	} iovstate;
> 
> 	if (nr_segs != 1) {
> 		initialise iovstate;
> 	}
> 
> 	...
> 
> 	if (nr_segs == 1)
> 		page_fault = filemap_copy_from_user(page, offset, buf, bytes);
> 	else
> 		page_fault = iov_copy_from_user(&arg1, &arg2, &bytes, &iovstate);
> 	status = a_ops->commit_write(file, page, offset, offset+bytes);
> 
> 
> then we can do all that fancy stuff in iov_copy_from_user().  That
> function should be inlined so we don't really pass all those arguments
> around.  That's just a cleanliness thing, and branching around the complex
> code in the simple case.  We should be able to get the impact on the
> common case down to a single taken branch per page.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-13  1:18     ` Hirokazu Takahashi
@ 2002-09-13  3:31       ` Andrew Morton
  2002-09-14  4:54         ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-09-13  3:31 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: linux-kernel, janetmor

Hirokazu Takahashi wrote:
> 
> Hello,
> 
> > > Your readv/writev patch interested me and I checked it.
> > > I found we also have a chance to improve normal writev.
> > >
> > > a_ops->prepare_write() and a_ops->commit_write will have a
> > > penalty when I/O size isn't PAGE_SIZE.
> > > With following patch generic_file_write_nolock() will try to
> > > make each I/O size become PAGE_SIZE.
> > >
> >
> > Certainly makes a lot of sense.  If an application has a large
> > number of small objects which are to be appended to a file, and
> > they are not contiguous in user memory then this patch makes
> > writev() a very attractive way of doing that.  Tons faster.

I wrote a little app which simulates a text editor writing out
its buffer.  Just:

struct line {
	char *data;
	int length;
	struct line *next;
};

walk this linked list, writing the lines out.  The input was
`cat linux/kernel/*.c > inputfile' and the output was written
1000 times (300 megs).  Benched four different ways of writing the
output:

                    2.5.34         2.5.34-mm2         2.5.34-mm2-taka

write                 54s             54s                   55s
fwrite                12.8s          12.8s                 12.7s
fwrite_unlocked       11.6s          11.6s                 11.5s
writev                39s            33.4s                 15.8s

So Janet's patch made a 15% improvement with this test.  Yours
dropped it 50% again.

> Yeah, I realized syslogd is using writev against logfiles which are
> opened with O_SYNC flag! I think heavy loaded mail-servers or
> web-servers may get good performance with the new writev
> as they are logging too much.

O_SYNC writev?  Ooh, oww, that hurts...

With 2.5.34, writing the 300k file once (1000x less data than above)
with 1024-vector writev's, opened O_SYNC:  68 seconds.

With 2.5.34-mm2-taka the same write takes 0.23 seconds.  (I had to write
100x as much data just to get a measurement).

A 300x speedup is nice, but based on these numbers syslogd should be
using fwrite_unlocked() and fflush().

O_SYNC should be eradicated.  It's basically always the wrong thing
to do.  Applications should write as much stuff as they can and then
run fsync.

> 
> It sounds nice.
> I'll rewrite it soon.
>

Great.  The test app is at http://www.zip.com.au/~akpm/writev-speed.c

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-12 18:47   ` Andrew Morton
  2002-09-13  1:18     ` Hirokazu Takahashi
@ 2002-09-13  7:22     ` Hirokazu Takahashi
  2002-09-13  8:29       ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Hirokazu Takahashi @ 2002-09-13  7:22 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, janetmor

Hello,

I updated the writev patch which may be easy to understand.
How about it?

But I have one question, Could let me know if you have any idea,
why does filemap_copy_from_user() try to call kamp()+__copy_from_user()
again after the first trial get fault.

Is there any meanings?

> > a_ops->prepare_write() and a_ops->commit_write will have a
> > penalty when I/O size isn't PAGE_SIZE.
> > With following patch generic_file_write_nolock() will try to
> > make each I/O size become PAGE_SIZE.
> > 
> 
> Certainly makes a lot of sense.  If an application has a large
> number of small objects which are to be appended to a file, and
> they are not contiguous in user memory then this patch makes
> writev() a very attractive way of doing that.  Tons faster.
> 
> However I'd be a little concerned over the increased work which a boring
> old write() has to do.  Perhaps we could add a special code path
> for it:

Thank you, 
Hirokazu Takahashi.


--- linux/mm/filemap.c.ORG	Wed Sep 11 19:48:00 2030
+++ linux/mm/filemap.c	Fri Sep 13 16:08:51 2030
@@ -1940,6 +1940,48 @@ filemap_copy_from_user(struct page *page
 	return left;
 }
 
+static inline int
+filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
+			const struct iovec *iov, size_t base, unsigned bytes)
+{
+	char *kaddr;
+	int left = 0;
+
+	kaddr = kmap_atomic(page, KM_USER0);
+	while (bytes) {
+		char *buf = iov->iov_base + base;
+		int copy = min(bytes, iov->iov_len - base);
+		base = 0;
+		if ((left = __copy_from_user(kaddr + offset, buf, copy)))
+			break;
+		bytes -= copy;
+		offset += copy;
+		iov++;
+	}
+	kunmap_atomic(kaddr, KM_USER0);
+	return left;
+}
+
+static inline void
+filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, unsigned bytes)
+{
+	const struct iovec *iov = *iovp;
+	size_t base = *basep;
+
+	while (bytes) {
+		int copy = min(bytes, iov->iov_len - base);
+		bytes -= copy;
+		base += copy;
+		if (iov->iov_len == base) {
+			iov++;
+			base = 0;
+		}
+	}
+	*iovp = iov;
+	*basep = base;
+}
+
+
 /*
  * Write to a file through the page cache. 
  *
@@ -1968,9 +2010,8 @@ generic_file_write_nolock(struct file *f
 	unsigned	bytes;
 	time_t		time_now;
 	struct pagevec	lru_pvec;
-	struct iovec	*cur_iov;
-	unsigned	iov_bytes;	/* Cumulative count to the end of the
-					   current iovec */
+	const struct iovec *cur_iov = iov; /* current iovec */
+	unsigned	iov_base = 0;	   /* offset in the current iovec */
 	unsigned long	seg;
 	char		*buf;
 
@@ -2102,9 +2143,7 @@ generic_file_write_nolock(struct file *f
 		goto out_status;
 	}
 
-	cur_iov = (struct iovec *)iov;
-	iov_bytes = cur_iov->iov_len;
-	buf = cur_iov->iov_base;
+	buf = iov->iov_base;
 	do {
 		unsigned long index;
 		unsigned long offset;
@@ -2115,8 +2154,6 @@ generic_file_write_nolock(struct file *f
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
-		if (bytes + written > iov_bytes)
-			bytes = iov_bytes - written;
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -2144,7 +2181,12 @@ generic_file_write_nolock(struct file *f
 				vmtruncate(inode, inode->i_size);
 			break;
 		}
-		page_fault = filemap_copy_from_user(page, offset, buf, bytes);
+
+		if (nr_segs == 1) {
+			page_fault = filemap_copy_from_user(page, offset, buf, bytes);
+		} else {
+			page_fault = filemap_copy_from_user_iovec(page, offset, cur_iov, iov_base, bytes);
+		}
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(page_fault)) {
 			status = -EFAULT;
@@ -2157,11 +2199,8 @@ generic_file_write_nolock(struct file *f
 				count -= status;
 				pos += status;
 				buf += status;
-				if (written == iov_bytes && count) {
-					cur_iov++;
-					iov_bytes += cur_iov->iov_len;
-					buf = cur_iov->iov_base;
-				}
+				if (nr_segs > 1)
+					filemap_set_next_iovec(&cur_iov, &iov_base, status);
 			}
 		}
 		if (!PageReferenced(page))

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-13  8:29       ` Andrew Morton
@ 2002-09-13  8:26         ` Hirokazu Takahashi
  2002-09-13  9:23         ` Hirokazu Takahashi
  1 sibling, 0 replies; 12+ messages in thread
From: Hirokazu Takahashi @ 2002-09-13  8:26 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, janetmor

Hello,

I appreciate for your kind answer.
I understand clearly.
Ok, this means new filemap_copy_from_user_iovec() also have to
use regular kmap when page fault has happened.
I'll fix it soon.

> > I updated the writev patch which may be easy to understand.
> > How about it?
> 
> Looks nice.   And yes, you hung onto the atomic kmap across multiple
> iov segments ;)  That will save a tlb invalidate per segment.

Yes.

> > But I have one question, Could let me know if you have any idea,
> > why does filemap_copy_from_user() try to call kamp()+__copy_from_user()
> > again after the first trial get fault.
> > 
> > Is there any meanings?
> 
> We're not allowed to schedule away inside atomic_kmap - must remain
> in the same task, on the same CPU etc.  So the pagefault handler
> will return immediately if we take a pagefault while copying to/from
> userspace while holding an atomic kmap.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-13  7:22     ` Hirokazu Takahashi
@ 2002-09-13  8:29       ` Andrew Morton
  2002-09-13  8:26         ` Hirokazu Takahashi
  2002-09-13  9:23         ` Hirokazu Takahashi
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2002-09-13  8:29 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: linux-kernel, janetmor

Hirokazu Takahashi wrote:
> 
> Hello,
> 
> I updated the writev patch which may be easy to understand.
> How about it?

Looks nice.   And yes, you hung onto the atomic kmap across multiple
iov segments ;)  That will save a tlb invalidate per segment.
 
> But I have one question, Could let me know if you have any idea,
> why does filemap_copy_from_user() try to call kamp()+__copy_from_user()
> again after the first trial get fault.
> 
> Is there any meanings?

We're not allowed to schedule away inside atomic_kmap - must remain
in the same task, on the same CPU etc.  So the pagefault handler
will return immediately if we take a pagefault while copying to/from
userspace while holding an atomic kmap.

So the code first touches the userspace page (via __get_user) to
fault it in.  Now, there is a 99.999999% chance that the copy_*_user()
will not fault - it will remain wholly atomic.

But there is the 0.0000001% chance that the VM will evict (or at least
unmap) the page between the __get_user() and the completion of the 
copy_*_user().  In this case, copy_*_user() will fail and will return
a short copy.

Now, we could just touch the page with another __get_user() and retry
the atomic kmap approach.  But I flipped a coin and decided to fall back
to a regular sleeping kmap instead.  With a sleeping kmap, in a
non-atomic region the kernel will actually take the fault, fix it up
and the copy_*_user() will work OK.

> ...
> --- linux/mm/filemap.c.ORG      Wed Sep 11 19:48:00 2030
> +++ linux/mm/filemap.c  Fri Sep 13 16:08:51 2030

I shall retest...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-13  8:29       ` Andrew Morton
  2002-09-13  8:26         ` Hirokazu Takahashi
@ 2002-09-13  9:23         ` Hirokazu Takahashi
  2002-09-13 17:43           ` Andrew Morton
  1 sibling, 1 reply; 12+ messages in thread
From: Hirokazu Takahashi @ 2002-09-13  9:23 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, janetmor

Hello,

I fixed the patch.

But I have one more question.

Please consider...
while a process sleep in copy_*_user() another one may call kmap_atomic
and kunmap_atomic. And the process will restart and might access the
wrong page as kunmap_atomic do nothing without CONFIG_DEBUG_HIGHMEM flag.
I mean it wouldn't any faults as another page is still kmapped.

In my guess it would cause a woeful result.
Is it my misunderstanding ?

> We're not allowed to schedule away inside atomic_kmap - must remain
> in the same task, on the same CPU etc.  So the pagefault handler
> will return immediately if we take a pagefault while copying to/from
> userspace while holding an atomic kmap.
> 
> So the code first touches the userspace page (via __get_user) to
> fault it in.  Now, there is a 99.999999% chance that the copy_*_user()
> will not fault - it will remain wholly atomic.
> 
> But there is the 0.0000001% chance that the VM will evict (or at least
> unmap) the page between the __get_user() and the completion of the 
> copy_*_user().  In this case, copy_*_user() will fail and will return
> a short copy.
> 
> Now, we could just touch the page with another __get_user() and retry
> the atomic kmap approach.  But I flipped a coin and decided to fall back
> to a regular sleeping kmap instead.  With a sleeping kmap, in a
> non-atomic region the kernel will actually take the fault, fix it up
> and the copy_*_user() will work OK.

Thank you,
Hirokazu Takahashi.



--- linux/mm/filemap.c.ORG	Wed Sep 11 19:48:00 2030
+++ linux/mm/filemap.c	Fri Sep 13 17:57:34 2030
@@ -1940,6 +1940,63 @@ filemap_copy_from_user(struct page *page
 	return left;
 }
 
+static inline int
+__filemap_copy_from_user_iovec(char *vaddr, 
+			const struct iovec *iov, size_t base, unsigned bytes)
+{
+	int left = 0;
+
+	while (bytes) {
+		char *buf = iov->iov_base + base;
+		int copy = min(bytes, iov->iov_len - base);
+		base = 0;
+		if ((left = __copy_from_user(vaddr, buf, copy)))
+			break;
+		bytes -= copy;
+		vaddr += copy;
+		iov++;
+	}
+	return left;
+}
+
+static inline int
+filemap_copy_from_user_iovec(struct page *page, unsigned long offset,
+			const struct iovec *iov, size_t base, unsigned bytes)
+{
+	char *kaddr;
+	int left;
+
+	kaddr = kmap_atomic(page, KM_USER0);
+	left = __filemap_copy_from_user_iovec(kaddr + offset, iov, base, bytes);
+	kunmap_atomic(kaddr, KM_USER0);
+	if (left != 0) {
+		kaddr = kmap(page);
+		left = __filemap_copy_from_user_iovec(kaddr + offset, iov, base, bytes);
+		kunmap(page);
+	}
+	return left;
+}
+
+static inline void
+filemap_set_next_iovec(const struct iovec **iovp, size_t *basep, unsigned bytes)
+{
+	const struct iovec *iov = *iovp;
+	size_t base = *basep;
+
+	while (bytes) {
+		int copy = min(bytes, iov->iov_len - base);
+		bytes -= copy;
+		base += copy;
+		if (iov->iov_len == base) {
+			iov++;
+			base = 0;
+		}
+	}
+	*iovp = iov;
+	*basep = base;
+}
+
+
 /*
  * Write to a file through the page cache. 
  *
@@ -1968,9 +2025,8 @@ generic_file_write_nolock(struct file *f
 	unsigned	bytes;
 	time_t		time_now;
 	struct pagevec	lru_pvec;
-	struct iovec	*cur_iov;
-	unsigned	iov_bytes;	/* Cumulative count to the end of the
-					   current iovec */
+	const struct iovec *cur_iov = iov; /* current iovec */
+	unsigned	iov_base = 0;	   /* offset in the current iovec */
 	unsigned long	seg;
 	char		*buf;
 
@@ -2102,9 +2158,7 @@ generic_file_write_nolock(struct file *f
 		goto out_status;
 	}
 
-	cur_iov = (struct iovec *)iov;
-	iov_bytes = cur_iov->iov_len;
-	buf = cur_iov->iov_base;
+	buf = iov->iov_base;
 	do {
 		unsigned long index;
 		unsigned long offset;
@@ -2115,8 +2169,6 @@ generic_file_write_nolock(struct file *f
 		bytes = PAGE_CACHE_SIZE - offset;
 		if (bytes > count)
 			bytes = count;
-		if (bytes + written > iov_bytes)
-			bytes = iov_bytes - written;
 
 		/*
 		 * Bring in the user page that we will copy from _first_.
@@ -2144,7 +2196,12 @@ generic_file_write_nolock(struct file *f
 				vmtruncate(inode, inode->i_size);
 			break;
 		}
-		page_fault = filemap_copy_from_user(page, offset, buf, bytes);
+
+		if (nr_segs == 1) {
+			page_fault = filemap_copy_from_user(page, offset, buf, bytes);
+		} else {
+			page_fault = filemap_copy_from_user_iovec(page, offset, cur_iov, iov_base, bytes);
+		}
 		status = a_ops->commit_write(file, page, offset, offset+bytes);
 		if (unlikely(page_fault)) {
 			status = -EFAULT;
@@ -2157,11 +2214,8 @@ generic_file_write_nolock(struct file *f
 				count -= status;
 				pos += status;
 				buf += status;
-				if (written == iov_bytes && count) {
-					cur_iov++;
-					iov_bytes += cur_iov->iov_len;
-					buf = cur_iov->iov_base;
-				}
+				if (nr_segs > 1)
+					filemap_set_next_iovec(&cur_iov, &iov_base, status);
 			}
 		}
 		if (!PageReferenced(page))

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-13  9:23         ` Hirokazu Takahashi
@ 2002-09-13 17:43           ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2002-09-13 17:43 UTC (permalink / raw)
  To: Hirokazu Takahashi; +Cc: linux-kernel, janetmor

Hirokazu Takahashi wrote:
> 
> Hello,
> 
> I fixed the patch.

Thanks again.

> But I have one more question.
> 
> Please consider...
> while a process sleep in copy_*_user() another one may call kmap_atomic
> and kunmap_atomic. And the process will restart and might access the
> wrong page as kunmap_atomic do nothing without CONFIG_DEBUG_HIGHMEM flag.
> I mean it wouldn't any faults as another page is still kmapped.

Yes; this is why it is illegal to sleep, or to switch CPUs by any means
while holding an atomic kmap:


	kmap_atomic(...);
	__copy_*_user(...);
	kunmap_atomic(...);

the kmap_atomic() will increment the preempt count (even on CONFIG_PREEMPT=n).

- The incremented preempt count pins this code path onto this CPU
  while the kmap is held. (This is only relevant to CONFIG_PREEMPT=y)

- The incremented preempt count tells do_page_fault() that we cannot
  handle a pagefault; if a fault is encountered during the copy_*_user(),
  do_page_fault() will arrange for the __copy_*_user() to return a short
  copy.

So.  The code path is atomic, and is pinned to a single CPU.  The atomic
kmap pool uses a different batch of virtual addresses for each CPU (it's
a per-CPU pool of addresses).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-13  3:31       ` Andrew Morton
@ 2002-09-14  4:54         ` Andrew Morton
  2002-09-14  7:39           ` Hirokazu Takahashi
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2002-09-14  4:54 UTC (permalink / raw)
  To: Hirokazu Takahashi, linux-kernel, janetmor

Andrew Morton wrote:
> 
> ...
> walk this linked list, writing the lines out.  The input was
> `cat linux/kernel/*.c > inputfile' and the output was written
> 1000 times (300 megs).  Benched four different ways of writing the
> output:
> 
>                     2.5.34         2.5.34-mm2         2.5.34-mm2-taka
> 
> write                 54s             54s                   55s
> fwrite                12.8s          12.8s                 12.7s
> fwrite_unlocked       11.6s          11.6s                 11.5s
> writev                39s            33.4s                 15.8s
> 
> So Janet's patch made a 15% improvement with this test.  Yours
> dropped it 50% again.
> 

I've retested with your latest patch.

                   2.5.34-mm4-taka2
write                  55.543
fwrite                 12.625
fwrite_unlocked        11.389
writev                  9.219

So that's another 70% speedup on top of yesterday's 100%, and kernel
beats glibc ;)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [patch] readv/writev rework
  2002-09-14  4:54         ` Andrew Morton
@ 2002-09-14  7:39           ` Hirokazu Takahashi
  0 siblings, 0 replies; 12+ messages in thread
From: Hirokazu Takahashi @ 2002-09-14  7:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-kernel, janetmor

Hello

Brilliant!
It's great than I expected.

> > So Janet's patch made a 15% improvement with this test.  Yours
> > dropped it 50% again.
> > 
> 
> I've retested with your latest patch.
> 
>                    2.5.34-mm4-taka2
> write                  55.543
> fwrite                 12.625
> fwrite_unlocked        11.389
> writev                  9.219
> 
> So that's another 70% speedup on top of yesterday's 100%, and kernel
> beats glibc ;)

Thanks for your testing my patch and your detailed explanation.

Hirokazu Takahashi.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2002-09-14  7:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-11  8:30 [patch] readv/writev rework Andrew Morton
2002-09-12 13:00 ` Hirokazu Takahashi
2002-09-12 18:47   ` Andrew Morton
2002-09-13  1:18     ` Hirokazu Takahashi
2002-09-13  3:31       ` Andrew Morton
2002-09-14  4:54         ` Andrew Morton
2002-09-14  7:39           ` Hirokazu Takahashi
2002-09-13  7:22     ` Hirokazu Takahashi
2002-09-13  8:29       ` Andrew Morton
2002-09-13  8:26         ` Hirokazu Takahashi
2002-09-13  9:23         ` Hirokazu Takahashi
2002-09-13 17:43           ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).