From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 448385F0001 for ; Sat, 11 Apr 2009 08:01:42 -0400 (EDT) Subject: [RFC][PATCH 0/9] File descriptor hot-unplug support From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:01:29 -0700 Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: A couple of weeks ago I found myself looking at the uio, seeing that it does not support pci hot-unplug, and thinking "Great yet another implementation of hotunplug logic that needs to be added". I decided to see what it would take to add a generic implementation of the code we have for supporting hot unplugging devices in sysfs, proc, sysctl, tty_io, and now almost in the tun driver. Not long after I touched the tun driver and made it safe to delete the network device while still holding it's file descriptor open I someone else touch the code adding a different feature and my careful work went up in flames. Which brought home another point at the best of it this is ultimately complex tricky code that subsystems should not need to worry about. What makes this even more interesting is that in the presence of pci hot-unplug it looks like most subsystems and most devices will have to deal with the issue one way or another. This infrastructure could also be used to implement sys_revoke and when I could not think of a better name I have drawn on that. The following changes draw on and generalize the work in tty_io sysfs, proc, and sysctl and move it into the vfs level. Where the basic primitives are running faster, and the solution more general. The work is not complete. I have only fully converted proc. And there are more places in the vfs that need to be touched. But it is close enough the code works in practice and all of the core challenges should have been solved, and the design should be clear. Documentation/filesystems/vfs.txt | 4 + drivers/char/pty.c | 2 +- drivers/char/tty_io.c | 2 +- fs/Makefile | 2 +- fs/compat.c | 31 +++- fs/fcntl.c | 32 +++-- fs/file_table.c | 189 ++++++++++++++++++--- fs/inode.c | 1 + fs/ioctl.c | 39 +++-- fs/locks.c | 81 +++++++-- fs/open.c | 32 +++- fs/proc/generic.c | 100 ++++-------- fs/proc/inode.c | 339 +------------------------------------ fs/proc/internal.h | 2 + fs/proc/root.c | 2 +- fs/read_write.c | 143 ++++++++++++---- fs/readdir.c | 14 ++- fs/revoked_file.c | 181 ++++++++++++++++++++ fs/select.c | 17 ++- fs/super.c | 49 +++--- fs/sysfs/bin.c | 193 +--------------------- include/linux/fs.h | 31 +++- include/linux/mm.h | 3 + include/linux/proc_fs.h | 4 - mm/memory.c | 96 +++++++++++ 25 files changed, 841 insertions(+), 748 deletions(-) Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 05DAB5F0001 for ; Sat, 11 Apr 2009 08:03:24 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:03:23 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 1/9] mm: Introduce remap_file_mappings. Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 51EA05F0001 for ; Sat, 11 Apr 2009 08:05:23 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:05:23 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 2/9] mm: Implement generic support for revoking a mapping. Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Signed-off-by: Eric W. Biederman --- include/linux/mm.h | 2 ++ mm/memory.c | 9 +++++++++ 2 files changed, 11 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 96d8342..3fcbb8e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -807,6 +807,8 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping, extern int vmtruncate(struct inode * inode, loff_t offset); extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); + +extern struct vm_operations_struct revoked_vm_ops; extern void remap_file_mappings(struct file *file, struct vm_operations_struct *vm_ops); #ifdef CONFIG_MMU diff --git a/mm/memory.c b/mm/memory.c index dcd0a3c..f68c84e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2378,6 +2378,15 @@ out: spin_lock(&mapping->i_mmap_lock); } +static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return VM_FAULT_SIGBUS; +} + +struct vm_operations_struct revoked_vm_ops = { + .fault = revoked_fault, +}; + void remap_file_mappings(struct file *file, struct vm_operations_struct *vm_ops) { /* After file->f_ops has been changed update the vmas */ -- 1.6.1.2.350.g88cc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id C9FD15F0001 for ; Sat, 11 Apr 2009 08:06:10 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:06:11 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 3/9] sysfs: Use remap_file_mappings. Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Instead of wrapping all of the sysfs binary file vm operations when the backing kobject goes away, we can more easily change vm_ops on the vma when the backing kobject goes away. Leading to simpler and more easily maintained code. Signed-off-by: Eric W. Biederman --- fs/sysfs/bin.c | 193 +------------------------------------------------------- 1 files changed, 2 insertions(+), 191 deletions(-) diff --git a/fs/sysfs/bin.c b/fs/sysfs/bin.c index 93e0c02..898163c 100644 --- a/fs/sysfs/bin.c +++ b/fs/sysfs/bin.c @@ -39,8 +39,6 @@ static DEFINE_MUTEX(sysfs_bin_lock); struct bin_buffer { struct mutex mutex; void *buffer; - int mmapped; - struct vm_operations_struct *vm_ops; struct file *file; struct hlist_node list; }; @@ -181,175 +179,6 @@ out_free: return count; } -static void bin_vma_open(struct vm_area_struct *vma) -{ - struct file *file = vma->vm_file; - struct bin_buffer *bb = file->private_data; - struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; - - if (!bb->vm_ops || !bb->vm_ops->open) - return; - - if (!sysfs_get_active_two(attr_sd)) - return; - - bb->vm_ops->open(vma); - - sysfs_put_active_two(attr_sd); -} - -static void bin_vma_close(struct vm_area_struct *vma) -{ - struct file *file = vma->vm_file; - struct bin_buffer *bb = file->private_data; - struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; - - if (!bb->vm_ops || !bb->vm_ops->close) - return; - - if (!sysfs_get_active_two(attr_sd)) - return; - - bb->vm_ops->close(vma); - - sysfs_put_active_two(attr_sd); -} - -static int bin_fault(struct vm_area_struct *vma, struct vm_fault *vmf) -{ - struct file *file = vma->vm_file; - struct bin_buffer *bb = file->private_data; - struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; - int ret; - - if (!bb->vm_ops || !bb->vm_ops->fault) - return VM_FAULT_SIGBUS; - - if (!sysfs_get_active_two(attr_sd)) - return VM_FAULT_SIGBUS; - - ret = bb->vm_ops->fault(vma, vmf); - - sysfs_put_active_two(attr_sd); - return ret; -} - -static int bin_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) -{ - struct file *file = vma->vm_file; - struct bin_buffer *bb = file->private_data; - struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; - int ret; - - if (!bb->vm_ops) - return VM_FAULT_SIGBUS; - - if (!bb->vm_ops->page_mkwrite) - return 0; - - if (!sysfs_get_active_two(attr_sd)) - return VM_FAULT_SIGBUS; - - ret = bb->vm_ops->page_mkwrite(vma, vmf); - - sysfs_put_active_two(attr_sd); - return ret; -} - -static int bin_access(struct vm_area_struct *vma, unsigned long addr, - void *buf, int len, int write) -{ - struct file *file = vma->vm_file; - struct bin_buffer *bb = file->private_data; - struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; - int ret; - - if (!bb->vm_ops || !bb->vm_ops->access) - return -EINVAL; - - if (!sysfs_get_active_two(attr_sd)) - return -EINVAL; - - ret = bb->vm_ops->access(vma, addr, buf, len, write); - - sysfs_put_active_two(attr_sd); - return ret; -} - -#ifdef CONFIG_NUMA -static int bin_set_policy(struct vm_area_struct *vma, struct mempolicy *new) -{ - struct file *file = vma->vm_file; - struct bin_buffer *bb = file->private_data; - struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; - int ret; - - if (!bb->vm_ops || !bb->vm_ops->set_policy) - return 0; - - if (!sysfs_get_active_two(attr_sd)) - return -EINVAL; - - ret = bb->vm_ops->set_policy(vma, new); - - sysfs_put_active_two(attr_sd); - return ret; -} - -static struct mempolicy *bin_get_policy(struct vm_area_struct *vma, - unsigned long addr) -{ - struct file *file = vma->vm_file; - struct bin_buffer *bb = file->private_data; - struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; - struct mempolicy *pol; - - if (!bb->vm_ops || !bb->vm_ops->get_policy) - return vma->vm_policy; - - if (!sysfs_get_active_two(attr_sd)) - return vma->vm_policy; - - pol = bb->vm_ops->get_policy(vma, addr); - - sysfs_put_active_two(attr_sd); - return pol; -} - -static int bin_migrate(struct vm_area_struct *vma, const nodemask_t *from, - const nodemask_t *to, unsigned long flags) -{ - struct file *file = vma->vm_file; - struct bin_buffer *bb = file->private_data; - struct sysfs_dirent *attr_sd = file->f_path.dentry->d_fsdata; - int ret; - - if (!bb->vm_ops || !bb->vm_ops->migrate) - return 0; - - if (!sysfs_get_active_two(attr_sd)) - return 0; - - ret = bb->vm_ops->migrate(vma, from, to, flags); - - sysfs_put_active_two(attr_sd); - return ret; -} -#endif - -static struct vm_operations_struct bin_vm_ops = { - .open = bin_vma_open, - .close = bin_vma_close, - .fault = bin_fault, - .page_mkwrite = bin_page_mkwrite, - .access = bin_access, -#ifdef CONFIG_NUMA - .set_policy = bin_set_policy, - .get_policy = bin_get_policy, - .migrate = bin_migrate, -#endif -}; - static int mmap(struct file *file, struct vm_area_struct *vma) { struct bin_buffer *bb = file->private_data; @@ -370,25 +199,7 @@ static int mmap(struct file *file, struct vm_area_struct *vma) goto out_put; rc = attr->mmap(kobj, attr, vma); - if (rc) - goto out_put; - - /* - * PowerPC's pci_mmap of legacy_mem uses shmem_zero_setup() - * to satisfy versions of X which crash if the mmap fails: that - * substitutes a new vm_file, and we don't then want bin_vm_ops. - */ - if (vma->vm_file != file) - goto out_put; - - rc = -EINVAL; - if (bb->mmapped && bb->vm_ops != vma->vm_ops) - goto out_put; - rc = 0; - bb->mmapped = 1; - bb->vm_ops = vma->vm_ops; - vma->vm_ops = &bin_vm_ops; out_put: sysfs_put_active_two(attr_sd); out_unlock: @@ -475,9 +286,9 @@ void unmap_bin_file(struct sysfs_dirent *attr_sd) mutex_lock(&sysfs_bin_lock); hlist_for_each_entry(bb, tmp, &attr_sd->s_bin_attr.buffers, list) { - struct inode *inode = bb->file->f_path.dentry->d_inode; + struct file *file = bb->file; - unmap_mapping_range(inode->i_mapping, 0, 0, 1); + remap_file_mappings(file, &revoked_vm_ops); } mutex_unlock(&sysfs_bin_lock); -- 1.6.1.2.350.g88cc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 5B3175F0001 for ; Sat, 11 Apr 2009 08:07:38 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:07:39 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 4/9] vfs: Generalize the file_list Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: file_list_lock is held when files are being revoked aka hung up on in tty_io, and also needs to be done in a more general revoke case. The more general revoke case also needs to sleep so I have converted file_list_lock into a mutex. To make it clear what is on the file list and when I have changed file_move to file_add. The callers have been modified to to ensure file_add is called exactly once on a file. In __dentry_open file_add is called on everything except device files. In __ptmx_open and __tty_open file_add is called to place the ttys on the tty_files list. To make using the file_list for efficient for handling revokes I have moved the list head from the superblock to the inode. This means only relevant files need to be looked at. fs_may_remount_ro and mark_files_ro have been modified to walk the inode list to find all of the inodes and then to walk the file list on those inodes. It is a slightly slower process but just as efficient and potentially more correct as inodes may have some influence on the the rw state of the filesystem that files do not. Signed-off-by: Eric W. Biederman --- drivers/char/pty.c | 2 +- drivers/char/tty_io.c | 2 +- fs/file_table.c | 27 ++++++++++++++++++--------- fs/inode.c | 1 + fs/open.c | 3 ++- fs/super.c | 49 +++++++++++++++++++++++++++---------------------- include/linux/fs.h | 10 +++++----- 7 files changed, 55 insertions(+), 39 deletions(-) diff --git a/drivers/char/pty.c b/drivers/char/pty.c index 31038a0..3ed304c 100644 --- a/drivers/char/pty.c +++ b/drivers/char/pty.c @@ -662,7 +662,7 @@ static int __ptmx_open(struct inode *inode, struct file *filp) set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */ filp->private_data = tty; - file_move(filp, &tty->tty_files); + file_add(filp, &tty->tty_files); retval = devpts_pty_new(inode, tty->link); if (retval) diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c index 66b99a2..22b978e 100644 --- a/drivers/char/tty_io.c +++ b/drivers/char/tty_io.c @@ -1836,7 +1836,7 @@ got_driver: return PTR_ERR(tty); filp->private_data = tty; - file_move(filp, &tty->tty_files); + file_add(filp, &tty->tty_files); check_tty_count(tty, "tty_open"); if (tty->driver->type == TTY_DRIVER_TYPE_PTY && tty->driver->subtype == PTY_TYPE_MASTER) diff --git a/fs/file_table.c b/fs/file_table.c index 54018fe..03d74b6 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -22,6 +22,7 @@ #include #include #include +#include #include @@ -31,7 +32,7 @@ struct files_stat_struct files_stat = { }; /* public. Not pretty! */ -__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock); +__cacheline_aligned_in_smp DEFINE_MUTEX(files_lock); /* SLAB cache for file structures */ static struct kmem_cache *filp_cachep __read_mostly; @@ -357,12 +358,12 @@ void put_filp(struct file *file) } } -void file_move(struct file *file, struct list_head *list) +void file_add(struct file *file, struct list_head *list) { if (!list) return; file_list_lock(); - list_move(&file->f_u.fu_list, list); + list_add(&file->f_u.fu_list, list); file_list_unlock(); } @@ -377,24 +378,32 @@ void file_kill(struct file *file) int fs_may_remount_ro(struct super_block *sb) { + struct inode *inode; struct file *file; /* Check that no files are currently opened for writing. */ file_list_lock(); - list_for_each_entry(file, &sb->s_files, f_u.fu_list) { - struct inode *inode = file->f_path.dentry->d_inode; - + spin_lock(&inode_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { /* File with pending delete? */ if (inode->i_nlink == 0) goto too_bad; - /* Writeable file? */ - if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE)) - goto too_bad; + /* Regular file */ + if (!S_ISREG(inode->i_mode)) + continue; + + list_for_each_entry(file, &inode->i_files, f_u.fu_list) { + /* Writeable file? */ + if (file->f_mode & FMODE_WRITE) + goto too_bad; + } } + spin_unlock(&inode_lock); file_list_unlock(); return 1; /* Tis' cool bro. */ too_bad: + spin_unlock(&inode_lock); file_list_unlock(); return 0; } diff --git a/fs/inode.c b/fs/inode.c index d06d6d2..9682caf 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -238,6 +238,7 @@ void inode_init_once(struct inode *inode) memset(inode, 0, sizeof(*inode)); INIT_HLIST_NODE(&inode->i_hash); INIT_LIST_HEAD(&inode->i_dentry); + INIT_LIST_HEAD(&inode->i_files); INIT_LIST_HEAD(&inode->i_devices); INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC); spin_lock_init(&inode->i_data.tree_lock); diff --git a/fs/open.c b/fs/open.c index 377eb25..5e201cb 100644 --- a/fs/open.c +++ b/fs/open.c @@ -828,7 +828,8 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, f->f_path.mnt = mnt; f->f_pos = 0; f->f_op = fops_get(inode->i_fop); - file_move(f, &inode->i_sb->s_files); + if (!special_file(inode->i_mode)) + file_add(f, &inode->i_files); error = security_dentry_open(f, cred); if (error) diff --git a/fs/super.c b/fs/super.c index 786fe7d..e55299c 100644 --- a/fs/super.c +++ b/fs/super.c @@ -67,7 +67,6 @@ static struct super_block *alloc_super(struct file_system_type *type) INIT_LIST_HEAD(&s->s_dirty); INIT_LIST_HEAD(&s->s_io); INIT_LIST_HEAD(&s->s_more_io); - INIT_LIST_HEAD(&s->s_files); INIT_LIST_HEAD(&s->s_instances); INIT_HLIST_HEAD(&s->s_anon); INIT_LIST_HEAD(&s->s_inodes); @@ -597,32 +596,38 @@ out: static void mark_files_ro(struct super_block *sb) { + struct inode *inode; struct file *f; retry: file_list_lock(); - list_for_each_entry(f, &sb->s_files, f_u.fu_list) { - struct vfsmount *mnt; - if (!S_ISREG(f->f_path.dentry->d_inode->i_mode)) - continue; - if (!file_count(f)) - continue; - if (!(f->f_mode & FMODE_WRITE)) - continue; - f->f_mode &= ~FMODE_WRITE; - if (file_check_writeable(f) != 0) - continue; - file_release_write(f); - mnt = mntget(f->f_path.mnt); - file_list_unlock(); - /* - * This can sleep, so we can't hold - * the file_list_lock() spinlock. - */ - mnt_drop_write(mnt); - mntput(mnt); - goto retry; + spin_lock(&inode_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + list_for_each_entry(f, &inode->i_files, f_u.fu_list) { + struct vfsmount *mnt; + if (!S_ISREG(f->f_path.dentry->d_inode->i_mode)) + continue; + if (!file_count(f)) + continue; + if (!(f->f_mode & FMODE_WRITE)) + continue; + f->f_mode &= ~FMODE_WRITE; + if (file_check_writeable(f) != 0) + continue; + file_release_write(f); + mnt = mntget(f->f_path.mnt); + spin_unlock(&inode_lock); + file_list_unlock(); + /* + * This can sleep, so we can't hold + * the inode_lock spinlock. + */ + mnt_drop_write(mnt); + mntput(mnt); + goto retry; + } } + spin_unlock(&inode_lock); file_list_unlock(); } diff --git a/include/linux/fs.h b/include/linux/fs.h index 562d285..7805d20 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -656,6 +656,7 @@ struct inode { struct list_head i_list; struct list_head i_sb_list; struct list_head i_dentry; + struct list_head i_files; unsigned long i_ino; atomic_t i_count; unsigned int i_nlink; @@ -878,9 +879,9 @@ struct file { unsigned long f_mnt_write_state; #endif }; -extern spinlock_t files_lock; -#define file_list_lock() spin_lock(&files_lock); -#define file_list_unlock() spin_unlock(&files_lock); +extern struct mutex files_lock; +#define file_list_lock() mutex_lock(&files_lock); +#define file_list_unlock() mutex_unlock(&files_lock); #define get_file(x) atomic_long_inc(&(x)->f_count) #define file_count(x) atomic_long_read(&(x)->f_count) @@ -1277,7 +1278,6 @@ struct super_block { struct list_head s_io; /* parked for writeback */ struct list_head s_more_io; /* parked for more writeback */ struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ - struct list_head s_files; /* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */ struct list_head s_dentry_lru; /* unused dentry lru */ int s_nr_dentry_unused; /* # of dentry on lru */ @@ -2116,7 +2116,7 @@ static inline void insert_inode_hash(struct inode *inode) { } extern struct file * get_empty_filp(void); -extern void file_move(struct file *f, struct list_head *list); +extern void file_add(struct file *f, struct list_head *list); extern void file_kill(struct file *f); #ifdef CONFIG_BLOCK struct bio; -- 1.6.1.2.350.g88cc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 2513F5F0001 for ; Sat, 11 Apr 2009 08:08:56 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:08:58 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 5/9] vfs: Introduce basic infrastructure for revoking a file Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Going forward fops_read_lock should be held whenever file->f_op is being accesed and when the functions from f_op are executing. In 4 subsystems sysfs, proc, and sysctl, and tty we have support for modifing a file descriptor so that the underlying object can go away. In looking at the problem of pci hotunplug it appears that we potentially need that support for all file descriptors except ones talking to files on filesystems. Even on for file descriptors referring to files support for file the underlying object going away is interesting for implementing features like umount -f and sys_revoke. The implementations in sysfs, proc and sysctl are all very similar and are composed of several components. - A reference count to track that the file operations are being used. - An ability to flag the file as no longer being valid. - An ability to wait until the reference count is no longer used. Tracking when file_operations functions are running is done by holding the fops_read_lock across their invocations. Flagging when the file is no longer valid will be done by taking f_lock and modifying f_op, with a set of file operations that will return appropriate error codes,. roughly EIO from most operations, POLLERR from poll, and 0 from reads, and setting FMODE_REVOKE. Waiting until the functions are no longer being called is done with by waiting until f_use goes to 0. Essentially the same as synchronize_srcu. When implementing this I encountered an additional challenge. Ensuring that f_op->release is called exactly once, in an appropriate context. To ensure this I have taken several steps. - file_kill is moved immediate after after frelease in __fput to ensure the proper context is present even if fop_substitute calls release. - open sets FMODE_RELEASE after the open succeeds (but before fops_read_unlock) ensuring that fops_subsittute will know if release needs to be called after it has finished waiting for all of the files. - __fput samples fmode and f_op under f_lock and only calls __frelease if FMODE_REVOKE has not happened and FMODE_RELEASE is pending. Leaving it up to fops_subsitutate to call __frelease. - fops_substituate calls __frelease in all cases if after waiting for all users of a file to go to zero FMODE_RELEASE is still set. Signed-off-by: Eric W. Biederman --- Documentation/filesystems/vfs.txt | 4 + fs/file_table.c | 154 ++++++++++++++++++++++++++++++++++--- fs/open.c | 19 ++++- include/linux/fs.h | 19 +++++ 4 files changed, 181 insertions(+), 15 deletions(-) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index deeeed0..2b115ba 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -807,6 +807,10 @@ otherwise noted. splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call + awaken_all_waiters: Called in while revoking a file to wake up poll, + aio operations, fasync, and anything else blocked + indefinitely waiting for something to happen. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special diff --git a/fs/file_table.c b/fs/file_table.c index 03d74b6..d216557 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -23,6 +23,7 @@ #include #include #include +#include #include @@ -204,7 +205,7 @@ int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry, file->f_path.dentry = dentry; file->f_path.mnt = mntget(mnt); file->f_mapping = dentry->d_inode->i_mapping; - file->f_mode = mode; + file->f_mode = mode | FMODE_RELEASE; file->f_op = fop; /* @@ -255,6 +256,51 @@ void drop_file_write_access(struct file *file) } EXPORT_SYMBOL_GPL(drop_file_write_access); +static void __frelease(struct file *file, struct inode *inode, + const struct file_operations *f_op) +{ + locks_remove_flock(file); + if (unlikely(file->f_flags & FASYNC)) { + if (f_op && f_op->fasync) + file->f_op->fasync(-1, file, 0); + } + if (f_op && f_op->release) + f_op->release(inode, file); + + if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL)) + cdev_put(inode->i_cdev); +} + +static void frelease(struct file *file, struct inode *inode) + +{ + const struct file_operations *f_op; + int fops_idx; + fmode_t mode; + int need_release = 0; + + fops_idx = fops_read_lock(file); + + /* + * Ensure that __frelease is called exactly once. + * + * We don't do anything if FMODE_REVOKED is set because + * we will have a f_op without the proper release method + * and so can not cleanup from this path. + */ + spin_lock(&file->f_lock); + f_op = file->f_op; + mode = file->f_mode; + need_release = (mode & (FMODE_REVOKED | FMODE_RELEASE)) == FMODE_RELEASE; + if (need_release) + file->f_mode = mode & ~FMODE_RELEASE; + spin_unlock(&file->f_lock); + + if (need_release) + __frelease(file, inode, f_op); + fops_read_unlock(file, fops_idx); +} + /* __fput is called from task context when aio completion releases the last * last use of a struct file *. Do not use otherwise. */ @@ -272,21 +318,14 @@ void __fput(struct file *file) * in the file cleanup chain. */ eventpoll_release(file); - locks_remove_flock(file); - if (unlikely(file->f_flags & FASYNC)) { - if (file->f_op && file->f_op->fasync) - file->f_op->fasync(-1, file, 0); - } - if (file->f_op && file->f_op->release) - file->f_op->release(inode, file); + frelease(file, inode); + file_kill(file); + security_file_free(file); ima_file_free(file); - if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL)) - cdev_put(inode->i_cdev); fops_put(file->f_op); put_pid(file->f_owner.pid); - file_kill(file); if (file->f_mode & FMODE_WRITE) drop_file_write_access(file); file->f_path.dentry = NULL; @@ -296,6 +335,78 @@ void __fput(struct file *file) mntput(mnt); } +int fops_substitute(struct file *file, const struct file_operations *f_op, + struct vm_operations_struct *vm_ops) +{ + /* Must be called with file_list_lock held */ + /* This currently assumes that the new f_op does not need + * open or release to be called. + * This currently assumes that it will not be called twice + * on the same file. + */ + const struct file_operations *old_f_op; + fmode_t mode; + int err; + + err = -EINVAL; + f_op = fops_get(f_op); + if (!f_op) + goto out; + /* + * Ensure we have no new users of the old f_ops. + * Assignment order is important here. + */ + spin_lock(&file->f_lock); + old_f_op = file->f_op; + rcu_assign_pointer(file->f_op, f_op); + file->f_mode |= FMODE_REVOKED; + spin_unlock(&file->f_lock); + + /* + * Drain the existing uses of the original f_ops. + */ + remap_file_mappings(file, vm_ops); + if (old_f_op->awaken_all_waiters) + old_f_op->awaken_all_waiters(file); + + /* + * Wait until there are no more callers in the original + * file_operations methods. + */ + while (atomic_long_read(&file->f_use) > 0) + schedule_timeout_interruptible(1); + + /* + * Cleanup the data structures that were associated + * with the old fops. + */ + spin_lock(&file->f_lock); + mode = file->f_mode; + file->f_mode = mode & ~FMODE_RELEASE; + spin_unlock(&file->f_lock); + if (mode & FMODE_RELEASE) + __frelease(file, file->f_path.dentry->d_inode, old_f_op); + fops_put(old_f_op); + file->private_data = NULL; + err = 0; +out: + return err; +} + +void inode_fops_substitute(struct inode *inode, + const struct file_operations *f_op, + struct vm_operations_struct *vm_ops) +{ + struct file *file; + + file_list_lock(); + /* Prevent new files from showing up with the old f_ops */ + inode->i_fop = f_op; + list_for_each_entry(file, &inode->i_files, f_u.fu_list) + fops_substitute(file, f_op, vm_ops); + file_list_unlock(); +} + struct file *fget(unsigned int fd) { struct file *file; @@ -358,12 +469,17 @@ void put_filp(struct file *file) } } +void __file_add(struct file *file, struct list_head *list) +{ + list_add(&file->f_u.fu_list, list); +} + void file_add(struct file *file, struct list_head *list) { if (!list) return; file_list_lock(); - list_add(&file->f_u.fu_list, list); + __file_add(file, list); file_list_unlock(); } @@ -376,6 +492,20 @@ void file_kill(struct file *file) } } +int fops_read_lock(struct file *file) +{ + int revoked = (file->f_mode & FMODE_REVOKED); + if (likely(!revoked)) + atomic_long_inc(&file->f_use); + return revoked; +} + +void fops_read_unlock(struct file *file, int revoked) +{ + if (likely(!revoked)) + atomic_long_dec(&file->f_use); +} + int fs_may_remount_ro(struct super_block *sb) { struct inode *inode; diff --git a/fs/open.c b/fs/open.c index 5e201cb..0b75dde 100644 --- a/fs/open.c +++ b/fs/open.c @@ -808,7 +808,9 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, int (*open)(struct inode *, struct file *), const struct cred *cred) { + const struct file_operations *f_op; struct inode *inode; + int fops_idx; int error; f->f_flags = flags; @@ -827,21 +829,31 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, f->f_path.dentry = dentry; f->f_path.mnt = mnt; f->f_pos = 0; + + file_list_lock(); f->f_op = fops_get(inode->i_fop); if (!special_file(inode->i_mode)) - file_add(f, &inode->i_files); + __file_add(f, &inode->i_files); + file_list_unlock(); + + fops_idx = fops_read_lock(f); + f_op = rcu_dereference(f->f_op); error = security_dentry_open(f, cred); if (error) goto cleanup_all; - if (!open && f->f_op) - open = f->f_op->open; + if (!open && f_op) + open = f_op->open; if (open) { error = open(inode, f); if (error) goto cleanup_all; } + spin_lock(&f->f_lock); + f->f_mode |= FMODE_RELEASE; + spin_unlock(&f->f_lock); + fops_read_unlock(f, fops_idx); f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC); @@ -860,6 +872,7 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, return f; cleanup_all: + fops_read_unlock(f, fops_idx); fops_put(f->f_op); if (f->f_mode & FMODE_WRITE) { put_write_access(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 7805d20..a82a2ea 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -78,6 +78,13 @@ struct inodes_stat_t { /* File is opened using open(.., 3, ..) and is writeable only for ioctls (specialy hack for floppy.c) */ #define FMODE_WRITE_IOCTL ((__force fmode_t)256) +/* File release method needs to be called */ +#define FMODE_RELEASE ((__force fmode_t)512) +/* + * The file descriptor has been denied access to the original object. + * Likely a module removal, or device has been unplugged. + */ +#define FMODE_REVOKED ((__force fmode_t)1024) /* * Don't update ctime and mtime. @@ -329,6 +336,7 @@ struct kstatfs; struct vm_area_struct; struct vfsmount; struct cred; +struct vm_operations_struct; extern void __init inode_init(void); extern void __init inode_init_early(void); @@ -856,6 +864,7 @@ struct file { const struct file_operations *f_op; spinlock_t f_lock; /* f_ep_links, f_flags, no IRQ */ atomic_long_t f_count; + atomic_long_t f_use; /* f_op, private_data */ unsigned int f_flags; fmode_t f_mode; loff_t f_pos; @@ -879,6 +888,14 @@ struct file { unsigned long f_mnt_write_state; #endif }; + +extern int fops_read_lock(struct file *file); +extern void fops_read_unlock(struct file *file, int idx); +extern int fops_substitute(struct file *file, const struct file_operations *f_op, + struct vm_operations_struct *vm_ops); +extern void inode_fops_substitute(struct inode *inode, + const struct file_operations *f_op, struct vm_operations_struct *vm_ops); + extern struct mutex files_lock; #define file_list_lock() mutex_lock(&files_lock); #define file_list_unlock() mutex_unlock(&files_lock); @@ -1452,6 +1469,7 @@ struct file_operations { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); + int (*awaken_all_waiters)(struct file *); }; struct inode_operations { @@ -2116,6 +2134,7 @@ static inline void insert_inode_hash(struct inode *inode) { } extern struct file * get_empty_filp(void); +extern void __file_add(struct file *f, struct list_head *list); extern void file_add(struct file *f, struct list_head *list); extern void file_kill(struct file *f); #ifdef CONFIG_BLOCK -- 1.6.1.2.350.g88cc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 5FBE05F0001 for ; Sat, 11 Apr 2009 08:10:46 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:10:49 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 6/9] vfs: Utilize fops_read_lock where appropriate Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Signed-off-by: Eric W. Biederman --- fs/compat.c | 31 ++++++++---- fs/fcntl.c | 32 ++++++++---- fs/ioctl.c | 39 +++++++++------ fs/locks.c | 81 +++++++++++++++++++++++++------ fs/open.c | 12 ++++- fs/read_write.c | 143 ++++++++++++++++++++++++++++++++++++++++++------------- fs/readdir.c | 14 ++++- fs/select.c | 17 +++++- 8 files changed, 276 insertions(+), 93 deletions(-) diff --git a/fs/compat.c b/fs/compat.c index 3f84d5f..a73ca0d 100644 --- a/fs/compat.c +++ b/fs/compat.c @@ -1090,6 +1090,7 @@ out: #endif /* ! __ARCH_OMIT_COMPAT_SYS_GETDENTS64 */ static ssize_t compat_do_readv_writev(int type, struct file *file, + const struct file_operations *f_op, const struct compat_iovec __user *uvector, unsigned long nr_segs, loff_t *pos) { @@ -1117,7 +1118,7 @@ static ssize_t compat_do_readv_writev(int type, struct file *file, ret = -EINVAL; if ((nr_segs > UIO_MAXIOV) || (nr_segs <= 0)) goto out; - if (!file->f_op) + if (!f_op) goto out; if (nr_segs > UIO_FASTIOV) { ret = -ENOMEM; @@ -1170,11 +1171,11 @@ static ssize_t compat_do_readv_writev(int type, struct file *file, fnv = NULL; if (type == READ) { - fn = file->f_op->read; - fnv = file->f_op->aio_read; + fn = f_op->read; + fnv = f_op->aio_read; } else { - fn = (io_fn_t)file->f_op->write; - fnv = file->f_op->aio_write; + fn = (io_fn_t)f_op->write; + fnv = f_op->aio_write; } if (fnv) @@ -1200,21 +1201,27 @@ static size_t compat_readv(struct file *file, const struct compat_iovec __user *vec, unsigned long vlen, loff_t *pos) { + const struct file_operations *f_op; + int fops_idx; ssize_t ret = -EBADF; + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); + if (!(file->f_mode & FMODE_READ)) goto out; ret = -EINVAL; - if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read)) + if (!f_op || (!f_op->aio_read && !f_op->read)) goto out; - ret = compat_do_readv_writev(READ, file, vec, vlen, pos); + ret = compat_do_readv_writev(READ, file, f_op, vec, vlen, pos); out: if (ret > 0) add_rchar(current, ret); inc_syscr(current); + fops_read_unlock(file, fops_idx); return ret; } @@ -1257,21 +1264,27 @@ static size_t compat_writev(struct file *file, const struct compat_iovec __user *vec, unsigned long vlen, loff_t *pos) { + const struct file_operations *f_op; + int fops_idx; ssize_t ret = -EBADF; + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); + if (!(file->f_mode & FMODE_WRITE)) goto out; ret = -EINVAL; - if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write)) + if (!f_op || (!f_op->aio_write && !f_op->write)) goto out; - ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos); + ret = compat_do_readv_writev(WRITE, file, f_op, vec, vlen, pos); out: if (ret > 0) add_wchar(current, ret); inc_syscw(current); + fops_read_unlock(file, fops_idx); return ret; } diff --git a/fs/fcntl.c b/fs/fcntl.c index cc8e4de..2718aea 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -146,42 +146,51 @@ SYSCALL_DEFINE1(dup, unsigned int, fildes) static int setfl(int fd, struct file * filp, unsigned long arg) { struct inode * inode = filp->f_path.dentry->d_inode; - int error = 0; + const struct file_operations *f_op; + int fops_idx; + int error; + + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); /* * O_APPEND cannot be cleared if the file is marked as append-only * and the file is open for write. */ + error = -EPERM; if (((arg ^ filp->f_flags) & O_APPEND) && IS_APPEND(inode)) - return -EPERM; + goto out; /* O_NOATIME can only be set by the owner or superuser */ + error = -EPERM; if ((arg & O_NOATIME) && !(filp->f_flags & O_NOATIME)) if (!is_owner_or_cap(inode)) - return -EPERM; + goto out; /* required for strict SunOS emulation */ if (O_NONBLOCK != O_NDELAY) if (arg & O_NDELAY) arg |= O_NONBLOCK; + error = -EINVAL; if (arg & O_DIRECT) { if (!filp->f_mapping || !filp->f_mapping->a_ops || - !filp->f_mapping->a_ops->direct_IO) - return -EINVAL; + !filp->f_mapping->a_ops->direct_IO) + goto out; } - if (filp->f_op && filp->f_op->check_flags) - error = filp->f_op->check_flags(arg); + error = 0; + if (f_op && f_op->check_flags) + error = f_op->check_flags(arg); if (error) - return error; + goto out; /* * ->fasync() is responsible for setting the FASYNC bit. */ - if (((arg ^ filp->f_flags) & FASYNC) && filp->f_op && - filp->f_op->fasync) { - error = filp->f_op->fasync(fd, filp, (arg & FASYNC) != 0); + if (((arg ^ filp->f_flags) & FASYNC) && f_op && + f_op->fasync) { + error = f_op->fasync(fd, filp, (arg & FASYNC) != 0); if (error < 0) goto out; if (error > 0) @@ -192,6 +201,7 @@ static int setfl(int fd, struct file * filp, unsigned long arg) spin_unlock(&filp->f_lock); out: + fops_read_unlock(filp, fops_idx); return error; } diff --git a/fs/ioctl.c b/fs/ioctl.c index ac2d47e..158030b 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -33,23 +33,23 @@ * * Returns 0 on success, -errno on error. */ -static long vfs_ioctl(struct file *filp, unsigned int cmd, - unsigned long arg) +static long vfs_ioctl(struct file *filp, const struct file_operations *f_op, + unsigned int cmd, unsigned long arg) { int error = -ENOTTY; - if (!filp->f_op) + if (!f_op) goto out; - if (filp->f_op->unlocked_ioctl) { - error = filp->f_op->unlocked_ioctl(filp, cmd, arg); + if (f_op->unlocked_ioctl) { + error = f_op->unlocked_ioctl(filp, cmd, arg); if (error == -ENOIOCTLCMD) error = -EINVAL; goto out; - } else if (filp->f_op->ioctl) { + } else if (f_op->ioctl) { lock_kernel(); - error = filp->f_op->ioctl(filp->f_path.dentry->d_inode, - filp, cmd, arg); + error = f_op->ioctl(filp->f_path.dentry->d_inode, filp, + cmd, arg); unlock_kernel(); } @@ -370,8 +370,8 @@ EXPORT_SYMBOL(generic_block_fiemap); #endif /* CONFIG_BLOCK */ -static int file_ioctl(struct file *filp, unsigned int cmd, - unsigned long arg) +static int file_ioctl(struct file *filp, const struct file_operations *f_op, + unsigned int cmd, unsigned long arg) { struct inode *inode = filp->f_path.dentry->d_inode; int __user *p = (int __user *)arg; @@ -387,7 +387,7 @@ static int file_ioctl(struct file *filp, unsigned int cmd, return put_user(i_size_read(inode) - filp->f_pos, p); } - return vfs_ioctl(filp, cmd, arg); + return vfs_ioctl(filp, f_op, cmd, arg); } static int ioctl_fionbio(struct file *filp, int __user *argp) @@ -414,6 +414,7 @@ static int ioctl_fionbio(struct file *filp, int __user *argp) } static int ioctl_fioasync(unsigned int fd, struct file *filp, + const struct file_operations *f_op, int __user *argp) { unsigned int flag; @@ -426,9 +427,9 @@ static int ioctl_fioasync(unsigned int fd, struct file *filp, /* Did FASYNC state change ? */ if ((flag ^ filp->f_flags) & FASYNC) { - if (filp->f_op && filp->f_op->fasync) + if (f_op && f_op->fasync) /* fasync() adjusts filp->f_flags */ - error = filp->f_op->fasync(fd, filp, on); + error = f_op->fasync(fd, filp, on); else error = -ENOTTY; } @@ -482,9 +483,14 @@ static int ioctl_fsthaw(struct file *filp) int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, unsigned long arg) { + const struct file_operations *f_op; + int fops_idx; int error = 0; int __user *argp = (int __user *)arg; + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + switch (cmd) { case FIOCLEX: set_close_on_exec(fd, 1); @@ -499,7 +505,7 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, break; case FIOASYNC: - error = ioctl_fioasync(fd, filp, argp); + error = ioctl_fioasync(fd, filp, f_op, argp); break; case FIOQSIZE: @@ -524,11 +530,12 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd, default: if (S_ISREG(filp->f_path.dentry->d_inode->i_mode)) - error = file_ioctl(filp, cmd, arg); + error = file_ioctl(filp, f_op, cmd, arg); else - error = vfs_ioctl(filp, cmd, arg); + error = vfs_ioctl(filp, f_op, cmd, arg); break; } + fops_read_unlock(filp, fops_idx); return error; } diff --git a/fs/locks.c b/fs/locks.c index ec3deea..5ff959e 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1463,15 +1463,21 @@ EXPORT_SYMBOL(generic_setlease); int vfs_setlease(struct file *filp, long arg, struct file_lock **lease) { + const struct file_operations *f_op; + int fops_idx; int error; + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + lock_kernel(); - if (filp->f_op && filp->f_op->setlease) - error = filp->f_op->setlease(filp, arg, lease); + if (f_op && f_op->setlease) + error = f_op->setlease(filp, arg, lease); else error = generic_setlease(filp, arg, lease); unlock_kernel(); + fops_read_unlock(filp, fops_idx); return error; } EXPORT_SYMBOL_GPL(vfs_setlease); @@ -1566,9 +1572,11 @@ EXPORT_SYMBOL(flock_lock_file_wait); */ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) { + const struct file_operations *f_op; struct file *filp; struct file_lock *lock; int can_sleep, unlock; + int fops_idx; int error; error = -EBADF; @@ -1594,13 +1602,18 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) if (error) goto out_free; - if (filp->f_op && filp->f_op->flock) - error = filp->f_op->flock(filp, + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + + if (f_op && f_op->flock) + error = f_op->flock(filp, (can_sleep) ? F_SETLKW : F_SETLK, lock); else error = flock_lock_file_wait(filp, lock); + fops_read_unlock(filp, fops_idx); + out_free: locks_free_lock(lock); @@ -1620,10 +1633,20 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) */ int vfs_test_lock(struct file *filp, struct file_lock *fl) { - if (filp->f_op && filp->f_op->lock) - return filp->f_op->lock(filp, F_GETLK, fl); - posix_test_lock(filp, fl); - return 0; + const struct file_operations *f_op; + int fops_idx; + int ret = 0; + + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + + if (f_op && f_op->lock) + ret = filp->f_op->lock(filp, F_GETLK, fl); + else + posix_test_lock(filp, fl); + + fops_read_unlock(filp, fops_idx); + return ret; } EXPORT_SYMBOL_GPL(vfs_test_lock); @@ -1732,10 +1755,20 @@ out: */ int vfs_lock_file(struct file *filp, unsigned int cmd, struct file_lock *fl, struct file_lock *conf) { - if (filp->f_op && filp->f_op->lock) - return filp->f_op->lock(filp, cmd, fl); + const struct file_operations *f_op; + int fops_idx; + int ret; + + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + + if (f_op && f_op->lock) + ret = f_op->lock(filp, cmd, fl); else - return posix_lock_file(filp, fl, conf); + ret = posix_lock_file(filp, fl, conf); + + fops_read_unlock(filp, fops_idx); + return ret; } EXPORT_SYMBOL_GPL(vfs_lock_file); @@ -1999,13 +2032,18 @@ EXPORT_SYMBOL(locks_remove_posix); void locks_remove_flock(struct file *filp) { struct inode * inode = filp->f_path.dentry->d_inode; + const struct file_operations *f_op; struct file_lock *fl; struct file_lock **before; + int fops_idx; if (!inode->i_flock) return; - if (filp->f_op && filp->f_op->flock) { + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + + if (f_op && f_op->flock) { struct file_lock fl = { .fl_pid = current->tgid, .fl_file = filp, @@ -2013,11 +2051,13 @@ void locks_remove_flock(struct file *filp) .fl_type = F_UNLCK, .fl_end = OFFSET_MAX, }; - filp->f_op->flock(filp, F_SETLKW, &fl); + f_op->flock(filp, F_SETLKW, &fl); if (fl.fl_ops && fl.fl_ops->fl_release_private) fl.fl_ops->fl_release_private(&fl); } + fops_read_unlock(filp, fops_idx); + lock_kernel(); before = &inode->i_flock; @@ -2071,9 +2111,18 @@ EXPORT_SYMBOL(posix_unblock_lock); */ int vfs_cancel_lock(struct file *filp, struct file_lock *fl) { - if (filp->f_op && filp->f_op->lock) - return filp->f_op->lock(filp, F_CANCELLK, fl); - return 0; + const struct file_operations *f_op; + int fops_idx; + int ret = 0; + + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + + if (f_op && f_op->lock) + ret = f_op->lock(filp, F_CANCELLK, fl); + + fops_read_unlock(filp, fops_idx); + return ret; } EXPORT_SYMBOL_GPL(vfs_cancel_lock); diff --git a/fs/open.c b/fs/open.c index 0b75dde..67031e7 100644 --- a/fs/open.c +++ b/fs/open.c @@ -398,6 +398,7 @@ SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len) goto out; if (!(file->f_mode & FMODE_WRITE)) goto out_fput; + /* * Revalidate the write permissions, in case security policy has * changed since the files were opened. @@ -1107,6 +1108,8 @@ SYSCALL_DEFINE2(creat, const char __user *, pathname, int, mode) */ int filp_close(struct file *filp, fl_owner_t id) { + const struct file_operations *f_op; + int fops_idx; int retval = 0; if (!file_count(filp)) { @@ -1114,8 +1117,13 @@ int filp_close(struct file *filp, fl_owner_t id) return 0; } - if (filp->f_op && filp->f_op->flush) - retval = filp->f_op->flush(filp, id); + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + + if (f_op && f_op->flush) + retval = f_op->flush(filp, id); + + fops_read_unlock(filp, fops_idx); dnotify_flush(filp, id); locks_remove_posix(filp, id); diff --git a/fs/read_write.c b/fs/read_write.c index 9d1e76b..4def2ee 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -136,14 +136,23 @@ EXPORT_SYMBOL(default_llseek); loff_t vfs_llseek(struct file *file, loff_t offset, int origin) { loff_t (*fn)(struct file *, loff_t, int); + const struct file_operations *f_op; + int fops_idx; + loff_t ret; + + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); fn = no_llseek; if (file->f_mode & FMODE_LSEEK) { fn = default_llseek; - if (file->f_op && file->f_op->llseek) - fn = file->f_op->llseek; + if (f_op && f_op->llseek) + fn = f_op->llseek; } - return fn(file, offset, origin); + ret = fn(file, offset, origin); + + fops_read_unlock(file, fops_idx); + return ret; } EXPORT_SYMBOL(vfs_llseek); @@ -252,15 +261,20 @@ static void wait_on_retry_sync_kiocb(struct kiocb *iocb) ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *ppos) { struct iovec iov = { .iov_base = buf, .iov_len = len }; + const struct file_operations *f_op; struct kiocb kiocb; + int fops_idx; ssize_t ret; + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; kiocb.ki_left = len; for (;;) { - ret = filp->f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); + ret = f_op->aio_read(&kiocb, &iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(&kiocb); @@ -269,6 +283,8 @@ ssize_t do_sync_read(struct file *filp, char __user *buf, size_t len, loff_t *pp if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); *ppos = kiocb.ki_pos; + fops_read_unlock(filp, fops_idx); + return ret; } @@ -276,20 +292,28 @@ EXPORT_SYMBOL(do_sync_read); ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) { + const struct file_operations *f_op; + int fops_idx; ssize_t ret; + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); + + ret = -EBADF; if (!(file->f_mode & FMODE_READ)) - return -EBADF; - if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read)) - return -EINVAL; + goto out; + ret = -EINVAL; + if (!f_op || (!f_op->read && !f_op->aio_read)) + goto out; + ret = -EFAULT; if (unlikely(!access_ok(VERIFY_WRITE, buf, count))) - return -EFAULT; + goto out; ret = rw_verify_area(READ, file, pos, count); if (ret >= 0) { count = ret; - if (file->f_op->read) - ret = file->f_op->read(file, buf, count, pos); + if (f_op->read) + ret = f_op->read(file, buf, count, pos); else ret = do_sync_read(file, buf, count, pos); if (ret > 0) { @@ -298,7 +322,8 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) } inc_syscr(current); } - +out: + fops_read_unlock(file, fops_idx); return ret; } @@ -307,15 +332,20 @@ EXPORT_SYMBOL(vfs_read); ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { struct iovec iov = { .iov_base = (void __user *)buf, .iov_len = len }; + const struct file_operations *f_op; struct kiocb kiocb; + int fops_idx; ssize_t ret; + fops_idx = fops_read_lock(filp); + f_op = rcu_dereference(filp->f_op); + init_sync_kiocb(&kiocb, filp); kiocb.ki_pos = *ppos; kiocb.ki_left = len; for (;;) { - ret = filp->f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); + ret = f_op->aio_write(&kiocb, &iov, 1, kiocb.ki_pos); if (ret != -EIOCBRETRY) break; wait_on_retry_sync_kiocb(&kiocb); @@ -324,6 +354,8 @@ ssize_t do_sync_write(struct file *filp, const char __user *buf, size_t len, lof if (-EIOCBQUEUED == ret) ret = wait_on_sync_kiocb(&kiocb); *ppos = kiocb.ki_pos; + + fops_read_unlock(filp, fops_idx); return ret; } @@ -331,20 +363,28 @@ EXPORT_SYMBOL(do_sync_write); ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_t *pos) { + const struct file_operations *f_op; + int fops_idx; ssize_t ret; + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); + + ret = -EBADF; if (!(file->f_mode & FMODE_WRITE)) - return -EBADF; - if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write)) - return -EINVAL; + goto out; + ret = -EINVAL; + if (!f_op || (!f_op->write && !f_op->aio_write)) + goto out; + ret = -EFAULT; if (unlikely(!access_ok(VERIFY_READ, buf, count))) - return -EFAULT; + goto out; ret = rw_verify_area(WRITE, file, pos, count); if (ret >= 0) { count = ret; - if (file->f_op->write) - ret = file->f_op->write(file, buf, count, pos); + if (f_op->write) + ret = f_op->write(file, buf, count, pos); else ret = do_sync_write(file, buf, count, pos); if (ret > 0) { @@ -354,6 +394,8 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ inc_syscw(current); } +out: + fops_read_unlock(file, fops_idx); return ret; } @@ -611,6 +653,7 @@ out: } static ssize_t do_readv_writev(int type, struct file *file, + const struct file_operations *f_op, const struct iovec __user * uvector, unsigned long nr_segs, loff_t *pos) { @@ -621,7 +664,7 @@ static ssize_t do_readv_writev(int type, struct file *file, io_fn_t fn; iov_fn_t fnv; - if (!file->f_op) { + if (!f_op) { ret = -EINVAL; goto out; } @@ -638,11 +681,11 @@ static ssize_t do_readv_writev(int type, struct file *file, fnv = NULL; if (type == READ) { - fn = file->f_op->read; - fnv = file->f_op->aio_read; + fn = f_op->read; + fnv = f_op->aio_read; } else { - fn = (io_fn_t)file->f_op->write; - fnv = file->f_op->aio_write; + fn = (io_fn_t)f_op->write; + fnv = f_op->aio_write; } if (fnv) @@ -666,12 +709,25 @@ out: ssize_t vfs_readv(struct file *file, const struct iovec __user *vec, unsigned long vlen, loff_t *pos) { + const struct file_operations *f_op; + int fops_idx; + ssize_t ret; + + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); + + ret = -EBADF; if (!(file->f_mode & FMODE_READ)) - return -EBADF; - if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read)) - return -EINVAL; + goto out; + ret = -EINVAL; + if (!f_op || (!f_op->aio_read && !f_op->read)) + goto out; + + ret = do_readv_writev(READ, file, f_op, vec, vlen, pos); - return do_readv_writev(READ, file, vec, vlen, pos); +out: + fops_read_unlock(file, fops_idx); + return ret; } EXPORT_SYMBOL(vfs_readv); @@ -679,12 +735,25 @@ EXPORT_SYMBOL(vfs_readv); ssize_t vfs_writev(struct file *file, const struct iovec __user *vec, unsigned long vlen, loff_t *pos) { + const struct file_operations *f_op; + int fops_idx; + ssize_t ret; + + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); + + ret = -EBADF; if (!(file->f_mode & FMODE_WRITE)) - return -EBADF; - if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write)) - return -EINVAL; + goto out; + ret = -EINVAL; + if (!f_op || (!f_op->aio_write && !f_op->write)) + goto out; + + ret = do_readv_writev(WRITE, file, f_op, vec, vlen, pos); - return do_readv_writev(WRITE, file, vec, vlen, pos); +out: + fops_read_unlock(file, fops_idx); + return ret; } EXPORT_SYMBOL(vfs_writev); @@ -790,8 +859,10 @@ SYSCALL_DEFINE5(pwritev, unsigned long, fd, const struct iovec __user *, vec, static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, size_t count, loff_t max) { + const struct file_operations *in_f_op, *out_f_op; struct file * in_file, * out_file; struct inode * in_inode, * out_inode; + int in_fops_idx, out_fops_idx; loff_t pos; ssize_t retval; int fput_needed_in, fput_needed_out, fl; @@ -803,13 +874,15 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, in_file = fget_light(in_fd, &fput_needed_in); if (!in_file) goto out; + in_fops_idx = fops_read_lock(in_file); if (!(in_file->f_mode & FMODE_READ)) goto fput_in; retval = -EINVAL; in_inode = in_file->f_path.dentry->d_inode; if (!in_inode) goto fput_in; - if (!in_file->f_op || !in_file->f_op->splice_read) + in_f_op = rcu_dereference(in_file->f_op); + if (!in_f_op || !in_f_op->splice_read) goto fput_in; retval = -ESPIPE; if (!ppos) @@ -829,10 +902,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, out_file = fget_light(out_fd, &fput_needed_out); if (!out_file) goto fput_in; + out_fops_idx = fops_read_lock(out_file); if (!(out_file->f_mode & FMODE_WRITE)) goto fput_out; retval = -EINVAL; - if (!out_file->f_op || !out_file->f_op->sendpage) + out_f_op = rcu_dereference(out_file->f_op); + if (!out_f_op || !out_f_op->sendpage) goto fput_out; out_inode = out_file->f_path.dentry->d_inode; retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count); @@ -878,8 +953,10 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, retval = -EOVERFLOW; fput_out: + fops_read_unlock(out_file, out_fops_idx); fput_light(out_file, fput_needed_out); fput_in: + fops_read_unlock(in_file, in_fops_idx); fput_light(in_file, fput_needed_in); out: return retval; diff --git a/fs/readdir.c b/fs/readdir.c index 7723401..6017fa6 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -21,9 +21,16 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf) { - struct inode *inode = file->f_path.dentry->d_inode; + const struct file_operations *f_op; + struct inode *inode; + int fops_idx; int res = -ENOTDIR; - if (!file->f_op || !file->f_op->readdir) + + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); + + inode = file->f_path.dentry->d_inode; + if (!f_op || !f_op->readdir) goto out; res = security_file_permission(file, MAY_READ); @@ -36,11 +43,12 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf) res = -ENOENT; if (!IS_DEADDIR(inode)) { - res = file->f_op->readdir(file, buf, filler); + res = f_op->readdir(file, buf, filler); file_accessed(file); } mutex_unlock(&inode->i_mutex); out: + fops_read_unlock(file, fops_idx); return res; } diff --git a/fs/select.c b/fs/select.c index 0fe0e14..8f736a9 100644 --- a/fs/select.c +++ b/fs/select.c @@ -416,10 +416,12 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time) continue; file = fget_light(i, &fput_needed); if (file) { - f_op = file->f_op; + int fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); mask = DEFAULT_POLLMASK; if (f_op && f_op->poll) mask = (*f_op->poll)(file, retval ? NULL : wait); + fops_read_unlock(file, fops_idx); fput_light(file, fput_needed); if ((mask & POLLIN_SET) && (in & bit)) { res_in |= bit; @@ -684,11 +686,20 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait) file = fget_light(fd, &fput_needed); mask = POLLNVAL; if (file != NULL) { + const struct file_operations *f_op; + int fops_idx; + + fops_idx = fops_read_lock(file); + f_op = rcu_dereference(file->f_op); + mask = DEFAULT_POLLMASK; - if (file->f_op && file->f_op->poll) - mask = file->f_op->poll(file, pwait); + if (f_op && f_op->poll) + mask = f_op->poll(file, pwait); + /* Mask out unneeded events. */ mask &= pollfd->events | POLLERR | POLLHUP; + + fops_read_unlock(file, fops_idx); fput_light(file, fput_needed); } } -- 1.6.1.2.350.g88cc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 8CE275F0001 for ; Sat, 11 Apr 2009 08:11:53 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:11:59 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 7/9] vfs: Optimize fops_read_lock Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: After seeing fget_light and the justification for it in the commit log I did not want to introduce something into the common read/write path of file descriptors that could have a significant measurable impact on I/O speed. commit f6435db01709533f270b2dce1e5914770dbc65de Author: akpm Date: Thu May 8 05:19:50 2003 +0000 [PATCH] reduced overheads in fget/fput From: Dipankar Sarma fget() shows up on profiles, especially on SMP. Dipankar's patch special-cases the situation wherein there are no sharers of current->files. In this situation we know that no other process can close this file, so it is not necessary to increment the file's refcount. It's ugly as sin, but makes a substantial difference. The test is dd if=/dev/zero of=foo bs=1 count=1M On 4CPU P3 xeon with 1MB L2 cache and 512MB ram: kernel sys time std-dev ------------ -------- ------- UP - vanilla 2.104 0.028 UP - file 1.867 0.019 SMP - vanilla 2.976 0.023 SMP - file 2.719 0.026 BKrev: 3eb9e8f6Db0nMWoSx5IdHx6SBal8aw My test case was: dd if=/dev/zero of=/dev/null bs=1 count=1M. As writing to a real file turned out to cover the cost. Without the optimization I am seeing 2.4 - 2.5 MB/s on my idle core2 single socket quad core. With this optimization I am seeing 2.9 - 3.0 MB/s on the same machine. Maybe 2% slower than before I introduced fops_read_lock. The common case is that there is only one thread and so the fget_light optimization applies and f_count remains at 1. Which implies that there is only a single process performing operations through the file descriptor. In that case because there is no possible contention it is possible safely skip the atomic operations, gaining all of the benefits of rcu without requiring a per cpu variable. Signed-off-by: Eric W. Biederman --- fs/file_table.c | 18 ++++++++++++++---- 1 files changed, 14 insertions(+), 4 deletions(-) diff --git a/fs/file_table.c b/fs/file_table.c index d216557..634d44c 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -495,15 +495,25 @@ void file_kill(struct file *file) int fops_read_lock(struct file *file) { int revoked = (file->f_mode & FMODE_REVOKED); - if (likely(!revoked)) - atomic_long_inc(&file->f_use); + if (likely(!revoked)) { + if (likely(atomic_long_read(&file->f_count) == 1)) + atomic_long_set(&file->f_use, + atomic_long_read(&file->f_use) + 1); + else + atomic_long_inc(&file->f_use); + } return revoked; } void fops_read_unlock(struct file *file, int revoked) { - if (likely(!revoked)) - atomic_long_dec(&file->f_use); + if (likely(!revoked)) { + if (likely(atomic_long_read(&file->f_count) == 1)) + atomic_long_set(&file->f_use, + atomic_long_read(&file->f_use) - 1); + else + atomic_long_dec(&file->f_use); + } } int fs_may_remount_ro(struct super_block *sb) -- 1.6.1.2.350.g88cc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id EC4F85F0001 for ; Sat, 11 Apr 2009 08:13:15 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:13:22 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: revoked_file_ops is a set of file operations designed to be used when a files backing store has been removed. revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is always ready for I/O and return -EIO from all other operations. This is designed to allow userspace to gracefully file descriptors that enter this unusable state. Signed-off-by: Eric W. Biederman --- fs/Makefile | 2 +- fs/revoked_file.c | 181 ++++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/fs.h | 2 + 3 files changed, 184 insertions(+), 1 deletions(-) create mode 100644 fs/revoked_file.c diff --git a/fs/Makefile b/fs/Makefile index af6d047..7787ddd 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -11,7 +11,7 @@ obj-y := open.o read_write.o file_table.o super.o \ attr.o bad_inode.o file.o filesystems.o namespace.o \ seq_file.o xattr.o libfs.o fs-writeback.o \ pnode.o drop_caches.o splice.o sync.o utimes.o \ - stack.o fs_struct.o + stack.o fs_struct.o revoked_file.o ifeq ($(CONFIG_BLOCK),y) obj-y += buffer.o bio.o block_dev.o direct-io.o mpage.o ioprio.o diff --git a/fs/revoked_file.c b/fs/revoked_file.c new file mode 100644 index 0000000..9936693 --- /dev/null +++ b/fs/revoked_file.c @@ -0,0 +1,181 @@ +/* + * linux/fs/revoked_file.c + * + * Copyright (C) 1997, Stephen Tweedie + * + * Provide stub functions for unreadable inodes + * + * Fabian Frederick : August 2003 - All file operations assigned to EIO + * + * Eric Biederman : 8 April 2008 - Derivied from bad_inode.c + */ + +#include +#include +#include +#include +#include +#include + +static loff_t revoked_file_llseek(struct file *file, loff_t offset, int origin) +{ + return -EIO; +} + +static ssize_t revoked_file_read(struct file *filp, char __user *buf, + size_t size, loff_t *ppos) +{ + return 0; +} + +static ssize_t revoked_file_write(struct file *filp, const char __user *buf, + size_t siz, loff_t *ppos) +{ + return -EIO; +} + +static ssize_t revoked_file_aio_read(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) +{ + return 0; +} + +static ssize_t revoked_file_aio_write(struct kiocb *iocb, const struct iovec *iov, + unsigned long nr_segs, loff_t pos) +{ + return -EIO; +} + +static int revoked_file_readdir(struct file *filp, void *dirent, filldir_t filldir) +{ + return -EIO; +} + +static unsigned int revoked_file_poll(struct file *filp, poll_table *wait) +{ + return POLLIN | POLLOUT | POLLERR | POLLRDNORM | POLLWRNORM; +} + +static int revoked_file_ioctl (struct inode *inode, struct file *filp, + unsigned int cmd, unsigned long arg) +{ + return -EIO; +} + +static long revoked_file_unlocked_ioctl(struct file *file, unsigned cmd, + unsigned long arg) +{ + return -EIO; +} + +static long revoked_file_compat_ioctl(struct file *file, unsigned int cmd, + unsigned long arg) +{ + return -EIO; +} + +static int revoked_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + return -EIO; +} + +static int revoked_file_open(struct inode *inode, struct file *filp) +{ + return -EIO; +} + +static int revoked_file_flush(struct file *file, fl_owner_t id) +{ + return 0; +} + +static int revoked_file_release(struct inode *inode, struct file *filp) +{ + return 0; +} + +static int revoked_file_fsync(struct file *file, struct dentry *dentry, + int datasync) +{ + return -EIO; +} + +static int revoked_file_aio_fsync(struct kiocb *iocb, int datasync) +{ + return -EIO; +} + +static int revoked_file_fasync(int fd, struct file *filp, int on) +{ + return -EIO; +} + +static int revoked_file_lock(struct file *file, int cmd, struct file_lock *fl) +{ + return -EIO; +} + +static ssize_t revoked_file_sendpage(struct file *file, struct page *page, + int off, size_t len, loff_t *pos, int more) +{ + return -EIO; +} + +static unsigned long revoked_file_get_unmapped_area(struct file *file, + unsigned long addr, unsigned long len, + unsigned long pgoff, unsigned long flags) +{ + return -EIO; +} + +static int revoked_file_check_flags(int flags) +{ + return -EIO; +} + +static int revoked_file_flock(struct file *filp, int cmd, struct file_lock *fl) +{ + return -EIO; +} + +static ssize_t revoked_file_splice_write(struct pipe_inode_info *pipe, + struct file *out, loff_t *ppos, size_t len, + unsigned int flags) +{ + return -EIO; +} + +static ssize_t revoked_file_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, size_t len, + unsigned int flags) +{ + return -EIO; +} + +const struct file_operations revoked_file_ops = +{ + .llseek = revoked_file_llseek, + .read = revoked_file_read, + .write = revoked_file_write, + .aio_read = revoked_file_aio_read, + .aio_write = revoked_file_aio_write, + .readdir = revoked_file_readdir, + .poll = revoked_file_poll, + .ioctl = revoked_file_ioctl, + .unlocked_ioctl = revoked_file_unlocked_ioctl, + .compat_ioctl = revoked_file_compat_ioctl, + .mmap = revoked_file_mmap, + .open = revoked_file_open, + .flush = revoked_file_flush, + .release = revoked_file_release, + .fsync = revoked_file_fsync, + .aio_fsync = revoked_file_aio_fsync, + .fasync = revoked_file_fasync, + .lock = revoked_file_lock, + .sendpage = revoked_file_sendpage, + .get_unmapped_area = revoked_file_get_unmapped_area, + .check_flags = revoked_file_check_flags, + .flock = revoked_file_flock, + .splice_write = revoked_file_splice_write, + .splice_read = revoked_file_splice_read, +}; diff --git a/include/linux/fs.h b/include/linux/fs.h index a82a2ea..2fb0871 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -896,6 +896,8 @@ extern int fops_substitute(struct file *file, const struct file_operations *f_op extern void inode_fops_substitute(struct inode *inode, const struct file_operations *f_op, struct vm_operations_struct *vm_ops); +extern const struct file_operations revoked_file_ops; + extern struct mutex files_lock; #define file_list_lock() mutex_lock(&files_lock); #define file_list_unlock() mutex_unlock(&files_lock); -- 1.6.1.2.350.g88cc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id E96835F0001 for ; Sat, 11 Apr 2009 08:14:41 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 05:14:50 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [RFC][PATCH 9/9] proc: Use the generic vfs revoke facility that now exists. Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Doing this the code becomes much simpler and more robust. Signed-off-by: Eric W. Biederman --- fs/proc/generic.c | 100 +++++---------- fs/proc/inode.c | 339 +---------------------------------------------- fs/proc/internal.h | 2 + fs/proc/root.c | 2 +- include/linux/proc_fs.h | 4 - 5 files changed, 36 insertions(+), 411 deletions(-) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index fa678ab..5453114 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -20,6 +20,8 @@ #include #include #include +#include +#include #include #include "internal.h" @@ -37,7 +39,7 @@ static int proc_match(int len, const char *name, struct proc_dir_entry *de) #define PROC_BLOCK_SIZE (PAGE_SIZE - 1024) static ssize_t -__proc_file_read(struct file *file, char __user *buf, size_t nbytes, +proc_file_read(struct file *file, char __user *buf, size_t nbytes, loff_t *ppos) { struct inode * inode = file->f_path.dentry->d_inode; @@ -183,27 +185,6 @@ __proc_file_read(struct file *file, char __user *buf, size_t nbytes, } static ssize_t -proc_file_read(struct file *file, char __user *buf, size_t nbytes, - loff_t *ppos) -{ - struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode); - ssize_t rv = -EIO; - - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - spin_unlock(&pde->pde_unload_lock); - - rv = __proc_file_read(file, buf, nbytes, ppos); - - pde_users_dec(pde); - return rv; -} - -static ssize_t proc_file_write(struct file *file, const char __user *buffer, size_t count, loff_t *ppos) { @@ -211,17 +192,8 @@ proc_file_write(struct file *file, const char __user *buffer, ssize_t rv = -EIO; if (pde->write_proc) { - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - spin_unlock(&pde->pde_unload_lock); - /* FIXME: does this routine need ppos? probably... */ rv = pde->write_proc(file, buffer, count, pde->data); - pde_users_dec(pde); } return rv; } @@ -630,10 +602,6 @@ static struct proc_dir_entry *__proc_create(struct proc_dir_entry **parent, ent->mode = mode; ent->nlink = nlink; atomic_set(&ent->count, 1); - ent->pde_users = 0; - spin_lock_init(&ent->pde_unload_lock); - ent->pde_unload_completion = NULL; - INIT_LIST_HEAD(&ent->pde_openers); out: return ent; } @@ -777,6 +745,33 @@ void free_proc_entry(struct proc_dir_entry *de) kfree(de); } +static struct inode *get_pde_inode(struct proc_dir_entry *de) +{ + struct inode *inode = NULL; + struct super_block *sb; + + spin_lock(&sb_lock); + list_for_each_entry(sb, &proc_fs_type.fs_supers, s_instances) { + inode = ilookup(sb, de->low_ino); + if (inode && inode->i_fop != &revoked_file_ops) + break; + iput(inode); + inode = NULL; + } + spin_unlock(&sb_lock); + return inode; +} + +static void proc_revoke_pde(struct proc_dir_entry *de) +{ + struct inode *inode; + + while ((inode = get_pde_inode(de))) { + inode_fops_substitute(inode, &revoked_file_ops, &revoked_vm_ops); + iput(inode); + } +} + /* * Remove a /proc entry and free it if it's not currently in use. */ @@ -804,40 +799,7 @@ void remove_proc_entry(const char *name, struct proc_dir_entry *parent) if (!de) return; - spin_lock(&de->pde_unload_lock); - /* - * Stop accepting new callers into module. If you're - * dynamically allocating ->proc_fops, save a pointer somewhere. - */ - de->proc_fops = NULL; - /* Wait until all existing callers into module are done. */ - if (de->pde_users > 0) { - DECLARE_COMPLETION_ONSTACK(c); - - if (!de->pde_unload_completion) - de->pde_unload_completion = &c; - - spin_unlock(&de->pde_unload_lock); - - wait_for_completion(de->pde_unload_completion); - - goto continue_removing; - } - spin_unlock(&de->pde_unload_lock); - -continue_removing: - spin_lock(&de->pde_unload_lock); - while (!list_empty(&de->pde_openers)) { - struct pde_opener *pdeo; - - pdeo = list_first_entry(&de->pde_openers, struct pde_opener, lh); - list_del(&pdeo->lh); - spin_unlock(&de->pde_unload_lock); - pdeo->release(pdeo->inode, pdeo->file); - kfree(pdeo); - spin_lock(&de->pde_unload_lock); - } - spin_unlock(&de->pde_unload_lock); + proc_revoke_pde(de); if (S_ISDIR(de->mode)) parent->nlink--; diff --git a/fs/proc/inode.c b/fs/proc/inode.c index d78ade3..aa7e629 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -117,330 +117,6 @@ static const struct super_operations proc_sops = { .statfs = simple_statfs, }; -static void __pde_users_dec(struct proc_dir_entry *pde) -{ - pde->pde_users--; - if (pde->pde_unload_completion && pde->pde_users == 0) - complete(pde->pde_unload_completion); -} - -void pde_users_dec(struct proc_dir_entry *pde) -{ - spin_lock(&pde->pde_unload_lock); - __pde_users_dec(pde); - spin_unlock(&pde->pde_unload_lock); -} - -static loff_t proc_reg_llseek(struct file *file, loff_t offset, int whence) -{ - struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode); - loff_t rv = -EINVAL; - loff_t (*llseek)(struct file *, loff_t, int); - - spin_lock(&pde->pde_unload_lock); - /* - * remove_proc_entry() is going to delete PDE (as part of module - * cleanup sequence). No new callers into module allowed. - */ - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - /* - * Bump refcount so that remove_proc_entry will wail for ->llseek to - * complete. - */ - pde->pde_users++; - /* - * Save function pointer under lock, to protect against ->proc_fops - * NULL'ifying right after ->pde_unload_lock is dropped. - */ - llseek = pde->proc_fops->llseek; - spin_unlock(&pde->pde_unload_lock); - - if (!llseek) - llseek = default_llseek; - rv = llseek(file, offset, whence); - - pde_users_dec(pde); - return rv; -} - -static ssize_t proc_reg_read(struct file *file, char __user *buf, size_t count, loff_t *ppos) -{ - struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode); - ssize_t rv = -EIO; - ssize_t (*read)(struct file *, char __user *, size_t, loff_t *); - - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - read = pde->proc_fops->read; - spin_unlock(&pde->pde_unload_lock); - - if (read) - rv = read(file, buf, count, ppos); - - pde_users_dec(pde); - return rv; -} - -static ssize_t proc_reg_write(struct file *file, const char __user *buf, size_t count, loff_t *ppos) -{ - struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode); - ssize_t rv = -EIO; - ssize_t (*write)(struct file *, const char __user *, size_t, loff_t *); - - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - write = pde->proc_fops->write; - spin_unlock(&pde->pde_unload_lock); - - if (write) - rv = write(file, buf, count, ppos); - - pde_users_dec(pde); - return rv; -} - -static unsigned int proc_reg_poll(struct file *file, struct poll_table_struct *pts) -{ - struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode); - unsigned int rv = DEFAULT_POLLMASK; - unsigned int (*poll)(struct file *, struct poll_table_struct *); - - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - poll = pde->proc_fops->poll; - spin_unlock(&pde->pde_unload_lock); - - if (poll) - rv = poll(file, pts); - - pde_users_dec(pde); - return rv; -} - -static long proc_reg_unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long arg) -{ - struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode); - long rv = -ENOTTY; - long (*unlocked_ioctl)(struct file *, unsigned int, unsigned long); - int (*ioctl)(struct inode *, struct file *, unsigned int, unsigned long); - - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - unlocked_ioctl = pde->proc_fops->unlocked_ioctl; - ioctl = pde->proc_fops->ioctl; - spin_unlock(&pde->pde_unload_lock); - - if (unlocked_ioctl) { - rv = unlocked_ioctl(file, cmd, arg); - if (rv == -ENOIOCTLCMD) - rv = -EINVAL; - } else if (ioctl) { - lock_kernel(); - rv = ioctl(file->f_path.dentry->d_inode, file, cmd, arg); - unlock_kernel(); - } - - pde_users_dec(pde); - return rv; -} - -#ifdef CONFIG_COMPAT -static long proc_reg_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg) -{ - struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode); - long rv = -ENOTTY; - long (*compat_ioctl)(struct file *, unsigned int, unsigned long); - - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - compat_ioctl = pde->proc_fops->compat_ioctl; - spin_unlock(&pde->pde_unload_lock); - - if (compat_ioctl) - rv = compat_ioctl(file, cmd, arg); - - pde_users_dec(pde); - return rv; -} -#endif - -static int proc_reg_mmap(struct file *file, struct vm_area_struct *vma) -{ - struct proc_dir_entry *pde = PDE(file->f_path.dentry->d_inode); - int rv = -EIO; - int (*mmap)(struct file *, struct vm_area_struct *); - - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - mmap = pde->proc_fops->mmap; - spin_unlock(&pde->pde_unload_lock); - - if (mmap) - rv = mmap(file, vma); - - pde_users_dec(pde); - return rv; -} - -static int proc_reg_open(struct inode *inode, struct file *file) -{ - struct proc_dir_entry *pde = PDE(inode); - int rv = 0; - int (*open)(struct inode *, struct file *); - int (*release)(struct inode *, struct file *); - struct pde_opener *pdeo; - - /* - * What for, you ask? Well, we can have open, rmmod, remove_proc_entry - * sequence. ->release won't be called because ->proc_fops will be - * cleared. Depending on complexity of ->release, consequences vary. - * - * We can't wait for mercy when close will be done for real, it's - * deadlockable: rmmod foo release - * by hand in remove_proc_entry(). For this, save opener's credentials - * for later. - */ - pdeo = kmalloc(sizeof(struct pde_opener), GFP_KERNEL); - if (!pdeo) - return -ENOMEM; - - spin_lock(&pde->pde_unload_lock); - if (!pde->proc_fops) { - spin_unlock(&pde->pde_unload_lock); - kfree(pdeo); - return -EINVAL; - } - pde->pde_users++; - open = pde->proc_fops->open; - release = pde->proc_fops->release; - spin_unlock(&pde->pde_unload_lock); - - if (open) - rv = open(inode, file); - - spin_lock(&pde->pde_unload_lock); - if (rv == 0 && release) { - /* To know what to release. */ - pdeo->inode = inode; - pdeo->file = file; - /* Strictly for "too late" ->release in proc_reg_release(). */ - pdeo->release = release; - list_add(&pdeo->lh, &pde->pde_openers); - } else - kfree(pdeo); - __pde_users_dec(pde); - spin_unlock(&pde->pde_unload_lock); - return rv; -} - -static struct pde_opener *find_pde_opener(struct proc_dir_entry *pde, - struct inode *inode, struct file *file) -{ - struct pde_opener *pdeo; - - list_for_each_entry(pdeo, &pde->pde_openers, lh) { - if (pdeo->inode == inode && pdeo->file == file) - return pdeo; - } - return NULL; -} - -static int proc_reg_release(struct inode *inode, struct file *file) -{ - struct proc_dir_entry *pde = PDE(inode); - int rv = 0; - int (*release)(struct inode *, struct file *); - struct pde_opener *pdeo; - - spin_lock(&pde->pde_unload_lock); - pdeo = find_pde_opener(pde, inode, file); - if (!pde->proc_fops) { - /* - * Can't simply exit, __fput() will think that everything is OK, - * and move on to freeing struct file. remove_proc_entry() will - * find slacker in opener's list and will try to do non-trivial - * things with struct file. Therefore, remove opener from list. - * - * But if opener is removed from list, who will ->release it? - */ - if (pdeo) { - list_del(&pdeo->lh); - spin_unlock(&pde->pde_unload_lock); - rv = pdeo->release(inode, file); - kfree(pdeo); - } else - spin_unlock(&pde->pde_unload_lock); - return rv; - } - pde->pde_users++; - release = pde->proc_fops->release; - if (pdeo) { - list_del(&pdeo->lh); - kfree(pdeo); - } - spin_unlock(&pde->pde_unload_lock); - - if (release) - rv = release(inode, file); - - pde_users_dec(pde); - return rv; -} - -static const struct file_operations proc_reg_file_ops = { - .llseek = proc_reg_llseek, - .read = proc_reg_read, - .write = proc_reg_write, - .poll = proc_reg_poll, - .unlocked_ioctl = proc_reg_unlocked_ioctl, -#ifdef CONFIG_COMPAT - .compat_ioctl = proc_reg_compat_ioctl, -#endif - .mmap = proc_reg_mmap, - .open = proc_reg_open, - .release = proc_reg_release, -}; - -#ifdef CONFIG_COMPAT -static const struct file_operations proc_reg_file_ops_no_compat = { - .llseek = proc_reg_llseek, - .read = proc_reg_read, - .write = proc_reg_write, - .poll = proc_reg_poll, - .unlocked_ioctl = proc_reg_unlocked_ioctl, - .mmap = proc_reg_mmap, - .open = proc_reg_open, - .release = proc_reg_release, -}; -#endif - struct inode *proc_get_inode(struct super_block *sb, unsigned int ino, struct proc_dir_entry *de) { @@ -465,19 +141,8 @@ struct inode *proc_get_inode(struct super_block *sb, unsigned int ino, inode->i_nlink = de->nlink; if (de->proc_iops) inode->i_op = de->proc_iops; - if (de->proc_fops) { - if (S_ISREG(inode->i_mode)) { -#ifdef CONFIG_COMPAT - if (!de->proc_fops->compat_ioctl) - inode->i_fop = - &proc_reg_file_ops_no_compat; - else -#endif - inode->i_fop = &proc_reg_file_ops; - } else { - inode->i_fop = de->proc_fops; - } - } + if (de->proc_fops) + inode->i_fop = de->proc_fops; unlock_new_inode(inode); } else de_put(de); diff --git a/fs/proc/internal.h b/fs/proc/internal.h index f6db961..ea658ac 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -92,3 +92,5 @@ struct pde_opener { struct list_head lh; }; void pde_users_dec(struct proc_dir_entry *pde); + +extern struct file_system_type proc_fs_type; diff --git a/fs/proc/root.c b/fs/proc/root.c index 1e15a2b..ba7a99d 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -96,7 +96,7 @@ static void proc_kill_sb(struct super_block *sb) put_pid_ns(ns); } -static struct file_system_type proc_fs_type = { +struct file_system_type proc_fs_type = { .name = "proc", .get_sb = proc_get_sb, .kill_sb = proc_kill_sb, diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h index fbfa3d4..2baeb37 100644 --- a/include/linux/proc_fs.h +++ b/include/linux/proc_fs.h @@ -72,10 +72,6 @@ struct proc_dir_entry { read_proc_t *read_proc; write_proc_t *write_proc; atomic_t count; /* use count */ - int pde_users; /* number of callers into module in progress */ - spinlock_t pde_unload_lock; /* proc_fops checks and pde_users bumps */ - struct completion *pde_unload_completion; - struct list_head pde_openers; /* who did ->open, but not ->release */ }; struct kcore_list { -- 1.6.1.2.350.g88cc -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 781DA5F0001 for ; Sat, 11 Apr 2009 11:58:08 -0400 (EDT) Date: Sat, 11 Apr 2009 16:58:52 +0100 From: Al Viro Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support Message-ID: <20090411155852.GV26366@ZenIV.linux.org.uk> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: On Sat, Apr 11, 2009 at 05:01:29AM -0700, Eric W. Biederman wrote: > A couple of weeks ago I found myself looking at the uio, seeing that > it does not support pci hot-unplug, and thinking "Great yet another > implementation of hotunplug logic that needs to be added". > > I decided to see what it would take to add a generic implementation of > the code we have for supporting hot unplugging devices in sysfs, proc, > sysctl, tty_io, and now almost in the tun driver. > > Not long after I touched the tun driver and made it safe to delete the > network device while still holding it's file descriptor open I someone > else touch the code adding a different feature and my careful work > went up in flames. Which brought home another point at the best of it > this is ultimately complex tricky code that subsystems should not need > to worry about. > > What makes this even more interesting is that in the presence of pci > hot-unplug it looks like most subsystems and most devices will have to > deal with the issue one way or another. Ehh... The real mess is in things like "TTY in the middle of random ioctl" and there's another pile that won't be solved on struct file level - individual fs internals ;-/ > This infrastructure could also be used to implement sys_revoke and > when I could not think of a better name I have drawn on that. Yes, that's more or less obvious direction for revoke(), but there's a problem with locking overhead that always scared me away from that. Maybe I'm wrong, though... In any case, you want to carefully check the overhead and cacheline bouncing implications for things like pipes and sockets. Hell knows, maybe it'll work out, but... Anyway, the really nasty part of revoke() (and true SAK, which is obviously related) is handling of deep-inside-the-driver ioctls. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 0012A5F0001 for ; Sat, 11 Apr 2009 12:48:57 -0400 (EDT) References: <20090411155852.GV26366@ZenIV.linux.org.uk> From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 09:49:36 -0700 In-Reply-To: <20090411155852.GV26366@ZenIV.linux.org.uk> (Al Viro's message of "Sat\, 11 Apr 2009 16\:58\:52 +0100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support Sender: owner-linux-mm@kvack.org To: Al Viro Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Al Viro writes: > On Sat, Apr 11, 2009 at 05:01:29AM -0700, Eric W. Biederman wrote: > >> A couple of weeks ago I found myself looking at the uio, seeing that >> it does not support pci hot-unplug, and thinking "Great yet another >> implementation of hotunplug logic that needs to be added". >> >> I decided to see what it would take to add a generic implementation of >> the code we have for supporting hot unplugging devices in sysfs, proc, >> sysctl, tty_io, and now almost in the tun driver. >> >> Not long after I touched the tun driver and made it safe to delete the >> network device while still holding it's file descriptor open I someone >> else touch the code adding a different feature and my careful work >> went up in flames. Which brought home another point at the best of it >> this is ultimately complex tricky code that subsystems should not need >> to worry about. >> >> What makes this even more interesting is that in the presence of pci >> hot-unplug it looks like most subsystems and most devices will have to >> deal with the issue one way or another. > > Ehh... The real mess is in things like "TTY in the middle of random > ioctl" and there's another pile that won't be solved on struct file > level - individual fs internals ;-/ I haven't tackled code with a noticeable number of ioctls yet. But if they are anything like what I have seen so far, a ref count to see that you are in the still executing a function (so you don't pull the rug out) from under it, and an additional method to say stop sleeping and return should be sufficient. >> This infrastructure could also be used to implement sys_revoke and >> when I could not think of a better name I have drawn on that. > > Yes, that's more or less obvious direction for revoke(), but there's a > problem with locking overhead that always scared me away from that. > Maybe I'm wrong, though... In any case, you want to carefully check > the overhead and cacheline bouncing implications for things like pipes > and sockets. Hell knows, maybe it'll work out, but... I took a careful look and I can't claim perfection at this stage but I don't think there are any significant performance impacts from my code. Further I am confident that if someone finds some performance issues I will be able to understand and address them without a redesign. While working on this I took a good hard look at the overhead I have added to single byte reads and writes (operations that are dominated by any possible overhead I am adding) and currently I am within 2% of the case without my refcounting/locking. I would be interested in anyone running micro benchmarks against my patches and giving me feedback. The fact that in the common case only one task ever accesses a struct file leaves a lot of room for optimization. > Anyway, the really nasty part of revoke() (and true SAK, which is obviously > related) is handling of deep-inside-the-driver ioctls. I doubt I have solved all of the problems. My goals are more modest than a revoke that works for every possible file in the system. I just want a common implementation of refcounting and blocking unregistration code that can be used to solve the common problem I see in sysfs, sysctl, proc, etc. I completely expect to need to modify the code to take advantage of the infrastructure. Patch 9/9 has an example of that, modifying proc so that it uses the infrastructure I add and removing 400 lines of code. I do think that what I have built once it is in use will make a good foundation for building the rest of revoke. Mostly because I am solving common problems once in a common way. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id DA67D5F0001 for ; Sat, 11 Apr 2009 12:56:07 -0400 (EDT) Date: Sat, 11 Apr 2009 17:56:51 +0100 From: Al Viro Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support Message-ID: <20090411165651.GW26366@ZenIV.linux.org.uk> References: <20090411155852.GV26366@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: On Sat, Apr 11, 2009 at 09:49:36AM -0700, Eric W. Biederman wrote: > The fact that in the common case only one task ever accesses a struct > file leaves a lot of room for optimization. I'm not at all sure that it's a good assumption; even leaving aside e.g. several tasks sharing stdout/stderr, a bunch of datagrams coming out of several threads over the same socket is quite possible. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 96F665F0001 for ; Sat, 11 Apr 2009 19:56:40 -0400 (EDT) References: <20090411155852.GV26366@ZenIV.linux.org.uk> <20090411165651.GW26366@ZenIV.linux.org.uk> From: ebiederm@xmission.com (Eric W. Biederman) Date: Sat, 11 Apr 2009 16:57:25 -0700 In-Reply-To: <20090411165651.GW26366@ZenIV.linux.org.uk> (Al Viro's message of "Sat\, 11 Apr 2009 17\:56\:51 +0100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support Sender: owner-linux-mm@kvack.org To: Al Viro Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Al Viro writes: > On Sat, Apr 11, 2009 at 09:49:36AM -0700, Eric W. Biederman wrote: > >> The fact that in the common case only one task ever accesses a struct >> file leaves a lot of room for optimization. > > I'm not at all sure that it's a good assumption; even leaving aside e.g. > several tasks sharing stdout/stderr, a bunch of datagrams coming out of > several threads over the same socket is quite possible. Maybe not. However those cases are already more expensive today. Somewhere along the way we are already going to get cache line ping pongs if there is real contention, and we are going to see the cost of atomic operations. In which case the extra ref counting I am doing is a little more expensive. And when I say a little more expensive I mean 10-20ns per read/write more expensive. At the same time if the common case really is applications not sharing file descriptors (which seems sane) my current optimization easily keeps the cost to practically nothing. Using the srcu locking would also keep the cost down in the noise because it guarantees non-shared cachelines and no expensive atomic operations. srcu has the downside of requiring per cpu memory which seems wrong to me somehow. However there are hybrid models like what is used in mnt_want_write that are possible to limit the total amount of per cpu memory while still getting the advantages. Beyond that for correctness it looks like a pay me now or pay me later situation. Do we track when we are in the methods for an object generically where we can do the work once, and then concentrate on enhancements. Or do we bog ourselves down using inferior implementations that are replicated in varying ways from subsystem to subsystem, and spend our time fighting the bugs in the subsystems? I have the refcount/locking abstraction wrapped and have only to perform the most basic of optimizations. So if we need to do something more it should be easy. Is performance your only concern with my patches? Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 6B1925F0001 for ; Sun, 12 Apr 2009 14:56:11 -0400 (EDT) Date: Sun, 12 Apr 2009 19:56:59 +0100 From: Jamie Lokier Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations Message-ID: <20090412185659.GE4394@shareable.org> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: > revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is > always ready for I/O and return -EIO from all other operations. I think read should return -EIO too. If a program is reading from a /proc file (say), and the thing it's reading suddenly disappears, EOF gives the false impression that it's read to the end of formatted data from that file and it can process the data as if it's complete, which is wrong. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id DEA345F0001 for ; Sun, 12 Apr 2009 16:04:16 -0400 (EDT) References: <20090412185659.GE4394@shareable.org> From: ebiederm@xmission.com (Eric W. Biederman) Date: Sun, 12 Apr 2009 13:04:09 -0700 In-Reply-To: <20090412185659.GE4394@shareable.org> (Jamie Lokier's message of "Sun\, 12 Apr 2009 19\:56\:59 +0100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations Sender: owner-linux-mm@kvack.org To: Jamie Lokier Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Jamie Lokier writes: >> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is >> always ready for I/O and return -EIO from all other operations. > > I think read should return -EIO too. If a program is reading from a > /proc file (say), and the thing it's reading suddenly disappears, EOF > gives the false impression that it's read to the end of formatted data > from that file and it can process the data as if it's complete, which > is wrong. Good point EIO is the current read return value for a removed proc file. For closed pipes, and hung up ttys the read return value is 0, and from my reading that is what bsd returns after a sys_revoke. The reason I have f_op settable is because I never expected complete agreement on the return codes, and because it makes auditing and spotting this kind of thing easier. I guess I should make two variations on revoked_file_ops then. Say eof_file_ops, eio_file_ops. Identical except for their treatment of reads. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id EDC4C5F0001 for ; Sun, 12 Apr 2009 16:21:22 -0400 (EDT) References: <20090411155852.GV26366@ZenIV.linux.org.uk> <20090411165651.GW26366@ZenIV.linux.org.uk> From: ebiederm@xmission.com (Eric W. Biederman) Date: Sun, 12 Apr 2009 13:21:35 -0700 In-Reply-To: <20090411165651.GW26366@ZenIV.linux.org.uk> (Al Viro's message of "Sat\, 11 Apr 2009 17\:56\:51 +0100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support Sender: owner-linux-mm@kvack.org To: Al Viro Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Al Viro writes: > On Sat, Apr 11, 2009 at 09:49:36AM -0700, Eric W. Biederman wrote: > >> The fact that in the common case only one task ever accesses a struct >> file leaves a lot of room for optimization. > > I'm not at all sure that it's a good assumption; even leaving aside e.g. > several tasks sharing stdout/stderr, a bunch of datagrams coming out of > several threads over the same socket is quite possible. I have thought about this a little more and a solution to ensure this is not a problem for code that has not opted in to this new functionality is simple. Require uses that need it to set FMODE_REVOKE. It is no extra code and it keeps the absolute worst case behavior for existing code down an additional branch mispredict. It is worth doing anyway because it cleans up the abstraction and makes it clear where revoke is supported. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 9FB1C5F0001 for ; Sun, 12 Apr 2009 16:30:44 -0400 (EDT) Date: Sun, 12 Apr 2009 21:31:07 +0100 From: Jamie Lokier Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations Message-ID: <20090412203107.GH4394@shareable.org> References: <20090412185659.GE4394@shareable.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Eric W. Biederman wrote: > >> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is > >> always ready for I/O and return -EIO from all other operations. > > > > I think read should return -EIO too. If a program is reading from a > > /proc file (say), and the thing it's reading suddenly disappears, EOF > > gives the false impression that it's read to the end of formatted data > > from that file and it can process the data as if it's complete, which > > is wrong. > > Good point EIO is the current read return value for a removed proc file. > > For closed pipes, and hung up ttys the read return value is 0, and from > my reading that is what bsd returns after a sys_revoke. A few suggestions below. Feel free to ignore them on account of the basic revoking functionality being more important :-) I'm not sure a revoked pipe should look like a normally closed one. ECONNRESET? For hung up ttys, I agree. But where's the SIGHUP :-) You probably do want the process using it to die if it's not handling SIGHUP, because terminal-using processes don't always terminate themselves on EOF. For things writing to a pipe or file, SIGPIPE may be appropriate in addition to EIO, to avoid runaway processes. Looks odd I know. For writing to a terminal, SIGHUP again. > The reason I have f_op settable is because I never expected complete > agreement on the return codes, and because it makes auditing and spotting > this kind of thing easier. > > I guess I should make two variations on revoked_file_ops then. Say > eof_file_ops, eio_file_ops. Identical except for their treatment of > reads. Fair enough. It's good to have good defaults. I'm not convinced eof_file_ops is ever a good default. sighup_file_ops and sigpipe_file_ops maybe :-) -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 83CD65F0001 for ; Sun, 12 Apr 2009 16:54:01 -0400 (EDT) References: <20090412185659.GE4394@shareable.org> From: ebiederm@xmission.com (Eric W. Biederman) Date: Sun, 12 Apr 2009 13:54:45 -0700 In-Reply-To: (Eric W. Biederman's message of "Sun\, 12 Apr 2009 13\:04\:09 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations Sender: owner-linux-mm@kvack.org To: Jamie Lokier Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: ebiederm@xmission.com (Eric W. Biederman) writes: > Jamie Lokier writes: > >>> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is >>> always ready for I/O and return -EIO from all other operations. >> >> I think read should return -EIO too. If a program is reading from a >> /proc file (say), and the thing it's reading suddenly disappears, EOF >> gives the false impression that it's read to the end of formatted data >> from that file and it can process the data as if it's complete, which >> is wrong. I just thought about that some more and I am not convinced. In general the current return values from proc after an I/O operation are suspect. seek returns -EINVAL instead of -EIO. poll returns DEFAULT_POLLMASK (which doesn't set POLLERR). So I am not convinced that the existing proc return values on error are correct, and they are recent additions so the historical precedent is not especially large. EOF does give the impression that you have read all of the data from the /proc file, and that is in fact the case. There is no more data coming from that proc file. That the data is stale is well know. That the data is not atomic, anything that spans more than a single read is not atomic. So I don't see what returning EIO adds to the equation. Perhaps that your fragile user space string parser may break? EOF gives a clear indication the application should stop reading the data, because there is no more. EIO only says that the was a problem. I don't know of anything that depends on the rmmod behavior either way. But if we can get away with it I would like to use something that is generally useful instead of something that only makes sense in the context of proc. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 08C415F0001 for ; Sun, 12 Apr 2009 17:03:00 -0400 (EDT) Date: Sun, 12 Apr 2009 22:02:56 +0100 From: Jamie Lokier Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations Message-ID: <20090412210256.GK4394@shareable.org> References: <20090412185659.GE4394@shareable.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Eric W. Biederman wrote: > I just thought about that some more and I am not convinced. > > In general the current return values from proc after an I/O operation > are suspect. seek returns -EINVAL instead of -EIO. poll returns > DEFAULT_POLLMASK (which doesn't set POLLERR). So I am not convinced > that the existing proc return values on error are correct, and they > are recent additions so the historical precedent is not especially > large. > > EOF does give the impression that you have read all of the data from > the /proc file, and that is in fact the case. There is no more > data coming from that proc file. > > That the data is stale is well know. > > That the data is not atomic, anything that spans more than a single > read is not atomic. > > So I don't see what returning EIO adds to the equation. Perhaps > that your fragile user space string parser may break? > > EOF gives a clear indication the application should stop reading > the data, because there is no more. > > EIO only says that the was a problem. > > I don't know of anything that depends on the rmmod behavior either > way. But if we can get away with it I would like to use something > that is generally useful instead of something that only makes > sense in the context of proc. I'm not thinking of proc, really. More thinking of applications: EOF effectively means "whole file read without error - now do the next thing". If a filesystem file is revoked (umount -f), you definitely want to stop that Makefile which is copying a file from the unmounted filesystem to a target file. Otherwise you get inconsistent states which can only occur as a result of this umount -f, something Makefiles should never have to care about. rmmod behaviour is not something any app should see normally. Unexpected behaviour when files are oddly truncated (despite never being written that way) is not "fragile user space". So whatever it returns, it should be some error code, imho. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id EFB9F5F0001 for ; Sun, 12 Apr 2009 17:53:11 -0400 (EDT) References: <20090412185659.GE4394@shareable.org> <20090412203107.GH4394@shareable.org> From: ebiederm@xmission.com (Eric W. Biederman) Date: Sun, 12 Apr 2009 14:53:51 -0700 In-Reply-To: <20090412203107.GH4394@shareable.org> (Jamie Lokier's message of "Sun\, 12 Apr 2009 21\:31\:07 +0100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations Sender: owner-linux-mm@kvack.org To: Jamie Lokier Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Jamie Lokier writes: > Eric W. Biederman wrote: >> >> revoked_file_ops return 0 from reads (aka EOF). Tell poll the file is >> >> always ready for I/O and return -EIO from all other operations. >> > >> > I think read should return -EIO too. If a program is reading from a >> > /proc file (say), and the thing it's reading suddenly disappears, EOF >> > gives the false impression that it's read to the end of formatted data >> > from that file and it can process the data as if it's complete, which >> > is wrong. >> >> Good point EIO is the current read return value for a removed proc file. >> >> For closed pipes, and hung up ttys the read return value is 0, and from >> my reading that is what bsd returns after a sys_revoke. > > A few suggestions below. Feel free to ignore them on account of the > basic revoking functionality being more important :-) I think I will. This seems to be the part of the code that is easily approachable and it is going to be easy to have different opinions on, and there is no one right answer. For now I'm just going to pick my best understanding of what BSD did. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 2D8B95F0001 for ; Sun, 12 Apr 2009 19:06:38 -0400 (EDT) References: <20090412185659.GE4394@shareable.org> <20090412210256.GK4394@shareable.org> From: ebiederm@xmission.com (Eric W. Biederman) Date: Sun, 12 Apr 2009 16:06:34 -0700 In-Reply-To: <20090412210256.GK4394@shareable.org> (Jamie Lokier's message of "Sun\, 12 Apr 2009 22\:02\:56 +0100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [RFC][PATCH 8/9] vfs: Implement generic revoked file operations Sender: owner-linux-mm@kvack.org To: Jamie Lokier Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Jamie Lokier writes: > Eric W. Biederman wrote: >> I just thought about that some more and I am not convinced. >> >> In general the current return values from proc after an I/O operation >> are suspect. seek returns -EINVAL instead of -EIO. poll returns >> DEFAULT_POLLMASK (which doesn't set POLLERR). So I am not convinced >> that the existing proc return values on error are correct, and they >> are recent additions so the historical precedent is not especially >> large. >> >> EOF does give the impression that you have read all of the data from >> the /proc file, and that is in fact the case. There is no more >> data coming from that proc file. >> >> That the data is stale is well know. >> >> That the data is not atomic, anything that spans more than a single >> read is not atomic. >> >> So I don't see what returning EIO adds to the equation. Perhaps >> that your fragile user space string parser may break? >> >> EOF gives a clear indication the application should stop reading >> the data, because there is no more. >> >> EIO only says that the was a problem. >> >> I don't know of anything that depends on the rmmod behavior either >> way. But if we can get away with it I would like to use something >> that is generally useful instead of something that only makes >> sense in the context of proc. > > I'm not thinking of proc, really. More thinking of applications: EOF > effectively means "whole file read without error - now do the next thing". > > If a filesystem file is revoked (umount -f), you definitely want to > stop that Makefile which is copying a file from the unmounted > filesystem to a target file. Otherwise you get inconsistent states > which can only occur as a result of this umount -f, something > Makefiles should never have to care about. > > rmmod behaviour is not something any app should see normally. > Unexpected behaviour when files are oddly truncated (despite never > being written that way) is not "fragile user space". So whatever it > returns, it should be some error code, imho. Well I just took a look at NetBSD 4.0.1 and it appears they agree with you. Plus I'm starting to feel a lot better about the linux manual pages, as the revoke(2) man pages from the BSDs describe different error codes than the implementation, and they fail to mention revoke appears to work on ordinary files as well. If the file is not a tty EIO is returned from read. opens return ENXIO writes return EIO ioctl returns EBADF close returns 0 Operations that just lookup the vnode simply return EBADF. I don't know if that is perfectly correct for the linux case. EBADF usually means the file descriptor specified isn't open. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 19F0D5F0001 for ; Mon, 13 Apr 2009 23:15:34 -0400 (EDT) Message-ID: <49E4000E.10308@kernel.org> Date: Tue, 14 Apr 2009 12:16:30 +0900 From: Tejun Heo MIME-Version: 1.0 Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support References: In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Hello, Eric. Eric W. Biederman wrote: > A couple of weeks ago I found myself looking at the uio, seeing that > it does not support pci hot-unplug, and thinking "Great yet another > implementation of hotunplug logic that needs to be added". > > I decided to see what it would take to add a generic implementation of > the code we have for supporting hot unplugging devices in sysfs, proc, > sysctl, tty_io, and now almost in the tun driver. > > Not long after I touched the tun driver and made it safe to delete the > network device while still holding it's file descriptor open I someone > else touch the code adding a different feature and my careful work > went up in flames. Which brought home another point at the best of it > this is ultimately complex tricky code that subsystems should not need > to worry about. I like the way it's headed. I'm trying to add similar 'revoke' or 'sever' mechanism at block and char device layers so that low level drivers don't have to worry about object lifetimes and so on. Doing it at the file layer makes sense and can probably replace whatever mechanism at the chardev. The biggest obstacle was the extra in-use reference count overhead. I thought it could be solved by implementing generic percpu reference count similar to the one used for module reference counting. Hot path overhead could be reduced to local_t cmpxchg (w/o LOCK prefix) on per-cpu variable + one branch, which was pretty good. The problem was that space and access overhead for dynamic per-cpu variables wasn't too good, so I started working on dynamic percpu allocator. The dynamic per-cpu allocator is pretty close to completion. Only several archs need to be converted and it's likely to happen during next few months. The plan after that was 1. add per-cpu local_t accessors (might replace local_t completely) 2. add generic per-cpu reference counter and move module reference counting to it 3. implement block/chardev sever (or revoke) support. I think #3 can be merged with what you're working on. What do you think? Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 0C4C25F0001 for ; Tue, 14 Apr 2009 03:38:57 -0400 (EDT) Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support References: <49E4000E.10308@kernel.org> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 14 Apr 2009 00:39:25 -0700 In-Reply-To: <49E4000E.10308@kernel.org> (Tejun Heo's message of "Tue\, 14 Apr 2009 12\:16\:30 +0900") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Tejun Heo Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Tejun Heo writes: > Hello, Eric. > > Eric W. Biederman wrote: >> A couple of weeks ago I found myself looking at the uio, seeing that >> it does not support pci hot-unplug, and thinking "Great yet another >> implementation of hotunplug logic that needs to be added". >> >> I decided to see what it would take to add a generic implementation of >> the code we have for supporting hot unplugging devices in sysfs, proc, >> sysctl, tty_io, and now almost in the tun driver. >> >> Not long after I touched the tun driver and made it safe to delete the >> network device while still holding it's file descriptor open I someone >> else touch the code adding a different feature and my careful work >> went up in flames. Which brought home another point at the best of it >> this is ultimately complex tricky code that subsystems should not need >> to worry about. > > I like the way it's headed. I'm trying to add similar 'revoke' or > 'sever' mechanism at block and char device layers so that low level > drivers don't have to worry about object lifetimes and so on. Doing > it at the file layer makes sense and can probably replace whatever > mechanism at the chardev. > > The biggest obstacle was the extra in-use reference count overhead. I > thought it could be solved by implementing generic percpu reference > count similar to the one used for module reference counting. Hot path > overhead could be reduced to local_t cmpxchg (w/o LOCK prefix) on > per-cpu variable + one branch, which was pretty good. The problem was > that space and access overhead for dynamic per-cpu variables wasn't > too good, so I started working on dynamic percpu allocator. > > The dynamic per-cpu allocator is pretty close to completion. Only > several archs need to be converted and it's likely to happen during > next few months. The plan after that was 1. add per-cpu local_t > accessors (might replace local_t completely) 2. add generic per-cpu > reference counter and move module reference counting to it > 3. implement block/chardev sever (or revoke) support. > > I think #3 can be merged with what you're working on. What do you > think? Sounds reasonable. Do you know of a case where we actually have multiple tasks accessing a file simultaneously? I just instrumented up my patch an so far the only case I have found are multiple processes closing the same file. Some weird part of bash forking extra processes. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 50C0E5F0001 for ; Tue, 14 Apr 2009 03:44:59 -0400 (EDT) Message-ID: <49E43F1D.3070400@kernel.org> Date: Tue, 14 Apr 2009 16:45:33 +0900 From: Tejun Heo MIME-Version: 1.0 Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support References: <49E4000E.10308@kernel.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Eric W. Biederman wrote: > Do you know of a case where we actually have multiple tasks accessing > a file simultaneously? I don't have anything at hand but multithread/process server accepting on the same socket comes to mind. I don't think it would be a very rare thing. If you confine the scope to character devices or sysfs, it could be quite rare tho. > I just instrumented up my patch an so far the only case I have found > are multiple processes closing the same file. Some weird part of > bash forking extra processes. Hmmm... Thanks. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with SMTP id 81D3D5F0001 for ; Tue, 14 Apr 2009 04:27:40 -0400 (EDT) References: <49E4000E.10308@kernel.org> <49E43F1D.3070400@kernel.org> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 14 Apr 2009 01:27:58 -0700 In-Reply-To: <49E43F1D.3070400@kernel.org> (Tejun Heo's message of "Tue\, 14 Apr 2009 16\:45\:33 +0900") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support Sender: owner-linux-mm@kvack.org To: Tejun Heo Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Tejun Heo writes: > Eric W. Biederman wrote: >> Do you know of a case where we actually have multiple tasks accessing >> a file simultaneously? > > I don't have anything at hand but multithread/process server accepting > on the same socket comes to mind. I don't think it would be a very > rare thing. If you confine the scope to character devices or sysfs, > it could be quite rare tho. Yes. I think I can safely exclude sockets, and not bother with reference counting them. The only strong evidence I have that multi-threading on a single file descriptor is likely to be common is that we have pread and pwrite syscalls. At the same time the number of races we have in struct file if it is accessed by multiple threads at the same time, suggests that at least for cases where you have an offset it doesn't happen often. I cringe when I see per cpu counters for something like files that we are likely to have a lot of. I keep imagining a quadratic explosion in data size. In practice we are likely to have a small cpu count <= 8-16 cpus so it is likely ok. Especially if we are only allocating 8 bytes per cpu per file. I guess in total that is at most 128K per file. 8bytes*16k cpus. With the default system file-max on my systems 203871 to 705863, it looks like we would max out at between 1M and 5M per cpu. Still a lot but survivable. Somewhere it all falls down, but only if you max out a very rare very large machine, and that seems to be case with just about everything. Which all leads me to say that if we can avoid per cpu memory and not impact performance I want to do that. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 66C3A5F0001 for ; Tue, 14 Apr 2009 04:49:18 -0400 (EDT) Message-ID: <49E44E35.7050504@kernel.org> Date: Tue, 14 Apr 2009 17:49:57 +0900 From: Tejun Heo MIME-Version: 1.0 Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support References: <49E4000E.10308@kernel.org> <49E43F1D.3070400@kernel.org> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Hello, Eric. Eric W. Biederman wrote: > Tejun Heo writes: >> Eric W. Biederman wrote: >>> Do you know of a case where we actually have multiple tasks accessing >>> a file simultaneously? >> I don't have anything at hand but multithread/process server accepting >> on the same socket comes to mind. I don't think it would be a very >> rare thing. If you confine the scope to character devices or sysfs, >> it could be quite rare tho. > > Yes. I think I can safely exclude sockets, and not bother with > reference counting them. > > The only strong evidence I have that multi-threading on a single file > descriptor is likely to be common is that we have pread and pwrite > syscalls. At the same time the number of races we have in struct file > if it is accessed by multiple threads at the same time, suggests > that at least for cases where you have an offset it doesn't happen often. > > I cringe when I see per cpu counters for something like files that we > are likely to have a lot of. I keep imagining a quadratic explosion > in data size. In practice we are likely to have a small cpu count <= > 8-16 cpus so it is likely ok. Especially if we are only allocating 8 > bytes per cpu per file. I guess in total that is at most 128K per file. > 8bytes*16k cpus. With the default system file-max on my systems 203871 > to 705863, it looks like we would max out at between 1M and 5M per cpu. > Still a lot but survivable. Not only that percpu refcnt is quite expensive to shut down too. For modules and devices, it doesn't really matter but using it for files on FS would be pretty scary. > Somewhere it all falls down, but only if you max out a very rare > very large machine, and that seems to be case with just about everything. > > Which all leads me to say that if we can avoid per cpu memory and not impact > performance I want to do that. Yeah, fully agreed there. -- tejun -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 24FDC5F0001 for ; Tue, 14 Apr 2009 11:07:52 -0400 (EDT) Date: Tue, 14 Apr 2009 16:07:45 +0100 From: Jamie Lokier Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support Message-ID: <20090414150745.GC26621@shareable.org> References: <49E4000E.10308@kernel.org> <49E43F1D.3070400@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Tejun Heo , Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Eric W. Biederman wrote: > > I don't have anything at hand but multithread/process server accepting > > on the same socket comes to mind. I don't think it would be a very > > rare thing. If you confine the scope to character devices or sysfs, > > it could be quite rare tho. > > Yes. I think I can safely exclude sockets, and not bother with > reference counting them. Good idea. As well as many processes calling accept(), it's not unusual to have two threads or processes for reading and writing concurrently to TCP sockets, and to have a single UDP socket shared among threads/processes for sendto. > The only strong evidence I have that multi-threading on a single file > descriptor is likely to be common is that we have pread and pwrite > syscalls. At the same time the number of races we have in struct file > if it is accessed by multiple threads at the same time, suggests > that at least for cases where you have an offset it doesn't happen often. Notice the preadv and pwritev syscalls added recently? They were added because QEMU and KVM need them for performance. Those programs have multiple threads doing I/O to the same file concurrently. It's like a poor man's AIO, except it's more reliable than real Linux AIO :-) Databases probably should use concurrent p{read,write}{,v} if they're not using direct I/O and AIO. I'm not sure if the well-known databases do. In the past there have been some poor quality "emulations" of those syscalls prone to races, on Linux and BSD I believe. What are the races you've noticed? -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 3B31E5F0001 for ; Tue, 14 Apr 2009 15:09:43 -0400 (EDT) Subject: Re: [RFC][PATCH 0/9] File descriptor hot-unplug support References: <49E4000E.10308@kernel.org> <49E43F1D.3070400@kernel.org> <20090414150745.GC26621@shareable.org> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 14 Apr 2009 12:09:41 -0700 In-Reply-To: <20090414150745.GC26621@shareable.org> (Jamie Lokier's message of "Tue\, 14 Apr 2009 16\:07\:45 +0100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Jamie Lokier Cc: Tejun Heo , Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Jamie Lokier writes: > Eric W. Biederman wrote: >> > I don't have anything at hand but multithread/process server accepting >> > on the same socket comes to mind. I don't think it would be a very >> > rare thing. If you confine the scope to character devices or sysfs, >> > it could be quite rare tho. >> >> Yes. I think I can safely exclude sockets, and not bother with >> reference counting them. > > Good idea. As well as many processes calling accept(), it's not > unusual to have two threads or processes for reading and writing > concurrently to TCP sockets, and to have a single UDP socket shared > among threads/processes for sendto. I have been playing with what I can see when I instrument up my code. The first thing that popped up was that we have a lots of reads/writes to files with f_count > 1. Which defeats my micro optimization in fops_read_lock. So in those cases I still have to pay the full cost of an atomic even if I have an exclusive cache line. I have found that for make -j N I tend to get N processes all reading from the same pipe at the same time. Not a smoking gun that my assumption that only one process will be using a file descriptor at a time in performance paths but it certainly shows that things are nowhere near as rare as I thought. The good news is that I have found a much better/cheaper optimization. Instead of per cpu or per file memory, use per task memory. It is always uncontended, and a task appears to never use more than two files simultaneously (stacking?). I have just prototyped that and things are looking very promising. Now I just need to clean everything up and resend my patches. >> The only strong evidence I have that multi-threading on a single file >> descriptor is likely to be common is that we have pread and pwrite >> syscalls. At the same time the number of races we have in struct file >> if it is accessed by multiple threads at the same time, suggests >> that at least for cases where you have an offset it doesn't happen often. > > Notice the preadv and pwritev syscalls added recently? They were > added because QEMU and KVM need them for performance. Those programs > have multiple threads doing I/O to the same file concurrently. It's > like a poor man's AIO, except it's more reliable than real Linux AIO :-) > > Databases probably should use concurrent p{read,write}{,v} if they're > not using direct I/O and AIO. I'm not sure if the well-known > databases do. In the past there have been some poor quality > "emulations" of those syscalls prone to races, on Linux and BSD I believe. > > What are the races you've noticed? Besides the f_pos (which pread variants handle) there is no locking on the file read ahead state, and f_flags only got locking a month or two ago. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id D74755F0001 for ; Tue, 14 Apr 2009 18:12:32 -0400 (EDT) Date: Tue, 14 Apr 2009 16:12:40 -0600 From: Jonathan Corbet Subject: Re: [RFC][PATCH 5/9] vfs: Introduce basic infrastructure for revoking a file Message-ID: <20090414161240.73fe6bcd@bike.lwn.net> In-Reply-To: References: Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Hi, Eric, One little thing I noticed as I was looking at this... > +int fops_substitute(struct file *file, const struct file_operations *f_op, > + struct vm_operations_struct *vm_ops) > +{ [...] > + /* > + * Wait until there are no more callers in the original > + * file_operations methods. > + */ > + while (atomic_long_read(&file->f_use) > 0) > + schedule_timeout_interruptible(1); You use an interruptible sleep here, but there's no signal check to get you out of the loop. So it's not really interruptible. If f_use never goes to zero (a distressingly likely possibility, I fear), this code will create the equivalent of an unkillable D-wait state without ever actually showing up that way in "ps". Actually, now that I look, once you've got a signal pending you'll stay in TASK_RUNNING, so the above could turn into a busy-wait. Unless I've missed something...? I have no idea what the right thing to do in the face of a signal would be. Perhaps the wait-for-zero and release() call stuff should be dumped into a workqueue and done asynchronously? OTOH, I can see a need to know when the revoke operation is really done... jon -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 1B92C5F0001 for ; Tue, 14 Apr 2009 22:54:18 -0400 (EDT) Subject: Re: [RFC][PATCH 5/9] vfs: Introduce basic infrastructure for revoking a file References: <20090414161240.73fe6bcd@bike.lwn.net> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 14 Apr 2009 19:55:01 -0700 In-Reply-To: <20090414161240.73fe6bcd@bike.lwn.net> (Jonathan Corbet's message of "Tue\, 14 Apr 2009 16\:12\:40 -0600") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Jonathan Corbet Cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Al Viro , Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman List-ID: Jonathan Corbet writes: > Hi, Eric, > > One little thing I noticed as I was looking at this... > >> +int fops_substitute(struct file *file, const struct file_operations *f_op, >> + struct vm_operations_struct *vm_ops) >> +{ > > [...] > >> + /* >> + * Wait until there are no more callers in the original >> + * file_operations methods. >> + */ >> + while (atomic_long_read(&file->f_use) > 0) >> + schedule_timeout_interruptible(1); > > You use an interruptible sleep here, but there's no signal check to get you > out of the loop. So it's not really interruptible. If f_use never goes to > zero (a distressingly likely possibility, I fear), this code will create > the equivalent of an unkillable D-wait state without ever actually showing > up that way in "ps". I snagged this idiom out of srcu and hadn't given it much thought. We have a number of places in the kernel where we aren't performing work for user space where we fib about the kind of sleep we are doing, so we don't increase the load. In this case we are in fs code so I guess calling this an uninterruptible sleep is fair, especially since it looks like at some point this code path is going be called from a syscall. As for f_use not going to zero, we have strong progress guarantees: - fops_read_lock at that point will not increment the count of any new users of the file. - There is an additional awaken_all_waiters to wake up any wait queues that are causing syscalls to block in the kernel. > Actually, now that I look, once you've got a signal pending you'll stay > in TASK_RUNNING, so the above could turn into a busy-wait. > > Unless I've missed something...? Well we will always schedule, so it shouldn't be a pure busy-wait, but overall I would call this a good catch. > I have no idea what the right thing to do in the face of a signal would > be. Perhaps the wait-for-zero and release() call stuff should be dumped > into a workqueue and done asynchronously? OTOH, I can see a need to know > when the revoke operation is really done... Yes. For sys_revoke the wait doesn't appear necessary. For umount -f, rmmod, or pci hotunplug we need the wait to know when we it is safe to free up underlying data structures. And at least for the latter two being truly interruptible is a correctness problem. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id C190A6B005A for ; Wed, 3 Jun 2009 10:34:07 -0400 (EDT) Date: Mon, 1 Jun 2009 15:25:53 -0700 From: Andrew Morton Subject: Re: [PATCH 01/23] mm: Introduce revoke_file_mappings. Message-Id: <20090601152553.b2de027a.akpm@linux-foundation.org> In-Reply-To: <1243893048-17031-1-git-send-email-ebiederm@xmission.com> References: <1243893048-17031-1-git-send-email-ebiederm@xmission.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: viro@ZenIV.linux.org.uk, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, hch@infradead.org, ebiederm@aristanetworks.com List-ID: On Mon, 1 Jun 2009 14:50:26 -0700 "Eric W. Biederman" wrote: > +static void revoke_vma(struct vm_area_struct *vma) This looks odd. > +{ > + struct file *file = vma->vm_file; > + struct address_space *mapping = file->f_mapping; > + unsigned long start_addr, end_addr, size; > + struct mm_struct *mm; > + > + start_addr = vma->vm_start; > + end_addr = vma->vm_end; We take a copy of start_addr/end_addr (and this end_addr value is never used) > + /* Switch out the locks so I can maninuplate this under the mm sem. > + * Needed so I can call vm_ops->close. > + */ > + mm = vma->vm_mm; > + atomic_inc(&mm->mm_users); > + spin_unlock(&mapping->i_mmap_lock); > + > + /* Block page faults and other code modifying the mm. */ > + down_write(&mm->mmap_sem); > + > + /* Lookup a vma for my file address */ > + vma = find_vma(mm, start_addr); Then we look up a vma. Is there reason to believe that this will differ from the incoming arg which we just overwrote? Maybe the code is attempting to handle racing concurrent mmap/munmap activity? If so, what are the implications of this? I _think_ that what the function is attempting to do is "unmap the vma which covers the address at vma->start_addr". If so, why not just pass it that virtual address? Anyway, it's all a bit obscure and I do think that the semantics and behaviour should be carefully explained in a comment, no? > + if (vma->vm_file != file) > + goto out; This strengthens the theory that some sort of race-management is happening here. > + start_addr = vma->vm_start; > + end_addr = vma->vm_end; > + size = end_addr - start_addr; > + > + /* Unlock the pages */ > + if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) { > + mm->locked_vm -= vma_pages(vma); > + vma->vm_flags &= ~VM_LOCKED; > + } > + > + /* Unmap the vma */ > + zap_page_range(vma, start_addr, size, NULL); > + > + /* Unlink the vma from the file */ > + unlink_file_vma(vma); > + > + /* Close the vma */ > + if (vma->vm_ops && vma->vm_ops->close) > + vma->vm_ops->close(vma); > + fput(vma->vm_file); > + vma->vm_file = NULL; > + if (vma->vm_flags & VM_EXECUTABLE) > + removed_exe_file_vma(vma->vm_mm); > + > + /* Repurpose the vma */ > + vma->vm_private_data = NULL; > + vma->vm_ops = &revoked_vm_ops; > + vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR); > +out: > + up_write(&mm->mmap_sem); > + spin_lock(&mapping->i_mmap_lock); > +} Also, I'm not a bit fan of the practice of overwriting the value of a formal argument, especially in a function which is this large and complex. It makes the code harder to follow, because the one variable holds two conceptually different things within the span of the same function. And it adds risk that someone will will later access a field of *vma and it will be the wrong vma. Worse, the bug is only exposed under exeedingly rare conditions. So.. Use a new local, please. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 611FE6B005A for ; Wed, 3 Jun 2009 10:44:33 -0400 (EDT) Received: from makko.or.mcafeemobile.com by x35.xmailserver.org with [XMail 1.26 ESMTP Server] id for from ; Tue, 2 Jun 2009 12:57:46 -0400 Date: Tue, 2 Jun 2009 09:51:42 -0700 (PDT) From: Davide Libenzi Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock In-Reply-To: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> Message-ID: References: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , Linux Kernel Mailing List , linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: On Mon, 1 Jun 2009, Eric W. Biederman wrote: > From: Eric W. Biederman > > Signed-off-by: Eric W. Biederman > --- > fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- > 1 files changed, 32 insertions(+), 7 deletions(-) This patchset gives me the willies for the amount of changes and possible impact on many subsystems. Without having looked at the details, are you aware that epoll does not act like poll/select, and fds are not automatically removed (as in, dequeued from the poll wait queue) in any foreseeable amount of time after a POLLERR is received? As far as the usespace API goes, they have the right to remain there. - Davide -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id CE0AA6B005A for ; Wed, 3 Jun 2009 10:51:25 -0400 (EDT) Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock References: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 02 Jun 2009 15:51:14 -0700 In-Reply-To: (Davide Libenzi's message of "Tue\, 2 Jun 2009 14\:52\:41 -0700 \(PDT\)") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Davide Libenzi Cc: Al Viro , Linux Kernel Mailing List , linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: Davide Libenzi writes: > On Tue, 2 Jun 2009, Eric W. Biederman wrote: > >> Davide Libenzi writes: >> >> > On Mon, 1 Jun 2009, Eric W. Biederman wrote: >> > >> >> From: Eric W. Biederman >> >> >> >> Signed-off-by: Eric W. Biederman >> >> --- >> >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- >> >> 1 files changed, 32 insertions(+), 7 deletions(-) >> > >> > This patchset gives me the willies for the amount of changes and possible >> > impact on many subsystems. >> >> It both is and is not that bad. It is the cost of adding a lock. > > We both know that it is not only the cost of a lock, but also the > sprinkling over a pretty vast amount of subsystems, of another layer of > code. I am not clear what problem you have. Is it the sprinkling the code that takes and removes the lock? Just the VFS needs to be involved with that. It is a slightly larger surface area than doing the work inside the file operations as we sometimes call the same method from 3-4 different places but it is definitely a bounded problem. Is it putting in the handful lines per subsystem to actually use this functionality? At that level something generic that is maintained outside of the subsystem is better than the mess we have with 4-5 different implementations in the subsystems that need it, each having a different assortment of bugs. >> I thought of doing something more uniform to user space. But I observed >> that the existing epoll punts on the case of a file descriptor being closed >> and locking to go from a file to the other epoll datastructures is pretty >> horrid I said forget it and used the existing close behaviour. > > Well, you cannot rely on the caller to tidy up the epoll fd by issuing an > epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave > lingering crap around. You cannot even hold a reference of the file, since > otherwise the epoll hooking will have to trigger not only at ->release() > time, but at every close, where you'll have to figure out if this is the > last real userspace reference or not. Plus all the issues related to > holding permanent extra references to userspace files. > And since a file can be added in many epoll devices, you need to > unregister it from all of them (hence the other datastructures lookup). > Better this, on the slow path, with locks acquired only in the epoll usage > case, than some other thing and on the fast path, for every file. Sure, and that is largely and I am preserving those semantics. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with ESMTP id 03CE26B0083 for ; Wed, 3 Jun 2009 10:54:00 -0400 (EDT) Date: Tue, 2 Jun 2009 10:06:00 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file In-Reply-To: <20090602071411.GE31556@wotan.suse.de> Message-ID: References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> <20090602071411.GE31556@wotan.suse.de> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: "Eric W. Biederman" , Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: On Tue, 2 Jun 2009, Nick Piggin wrote: > > Why is it called hotplug? Does it have anything to do with hardware? > Because every concurrently changed software data structure in the > kernel can be "hot"-modified, right? > > Wouldn't file_revoke_lock be more appropriate? I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not "plug" and "unplug". Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 6113F6B0082 for ; Wed, 3 Jun 2009 10:56:06 -0400 (EDT) Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> <20090602071411.GE31556@wotan.suse.de> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 02 Jun 2009 15:56:02 -0700 In-Reply-To: <20090602071411.GE31556@wotan.suse.de> (Nick Piggin's message of "Tue\, 2 Jun 2009 09\:14\:11 +0200") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Nick Piggin writes: >> In addition for a complete solution we need: >> - A reliable way the file structures that we need to revoke. >> - To wait for but not tamper with ongoing file creation and cleanup. >> - A guarantee that all with user space controlled duration are removed. >> >> The file_hotplug_lock has a very unique implementation necessitated by >> the need to have no performance impact on existing code. Classic locking > > Well, it isn't no performance impact. Function calls, branches, icache > and dcache... Practically none. Everything I could measure was in the noise. It is cheaper than any serializing locking primitive. I ran both lmbench and did some microbenchmark testing. So I know on the fast path the overhead is minimal. Certainly less than what we are doing in sysfs and proc today. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 9197B6B0087 for ; Wed, 3 Jun 2009 10:58:47 -0400 (EDT) Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> <84144f020906012216n715a04d0ha492abc12175816@mail.gmail.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: Mon, 01 Jun 2009 23:51:56 -0700 In-Reply-To: <84144f020906012216n715a04d0ha492abc12175816@mail.gmail.com> (Pekka Enberg's message of "Tue\, 2 Jun 2009 08\:16\:44 +0300") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: Pekka Enberg Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Pekka Enberg writes: > Hi Eric, > > On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman > wrote: >> +#ifdef CONFIG_FILE_HOTPLUG >> + >> +static bool file_in_use(struct file *file) >> +{ >> + =C2=A0 =C2=A0 =C2=A0 struct task_struct *leader, *task; >> + =C2=A0 =C2=A0 =C2=A0 bool in_use =3D false; >> + =C2=A0 =C2=A0 =C2=A0 int i; >> + >> + =C2=A0 =C2=A0 =C2=A0 rcu_read_lock(); >> + =C2=A0 =C2=A0 =C2=A0 do_each_thread(leader, task) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for (i =3D 0; i < MAX= _FILE_HOTPLUG_LOCK_DEPTH; i++) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 if (task->file_hotplug_lock[i] =3D=3D file) { >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 in_use =3D true; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 goto found; >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 } >> + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 } >> + =C2=A0 =C2=A0 =C2=A0 } while_each_thread(leader, task); >> +found: >> + =C2=A0 =C2=A0 =C2=A0 rcu_read_unlock(); >> + =C2=A0 =C2=A0 =C2=A0 return in_use; >> +} > > This seems rather heavy-weight. If we're going to use this > infrastructure for forced unmount, I think this will be a problem. > Can't we two this in two stages: (1) mark a bit that forces > file_hotplug_read_trylock to always fail and (2) block until the last > remaining in-kernel file_hotplug_read_unlock() has executed? Yes there is room for more optimization in the slow path. I haven't noticed being a problem yet so I figured I would start with stupid and simple. I can easily see two passes. The first setting the flag an calling f_op->dead. The second some kind of consolidate walk through the task list, allowing checking on multiple files at once. I'm not ready to consider anything that will add cost to the fast path in the file descriptors though. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 85C166B0055 for ; Wed, 3 Jun 2009 11:03:50 -0400 (EDT) Received: from makko.or.mcafeemobile.com by x35.xmailserver.org with [XMail 1.26 ESMTP Server] id for from ; Wed, 3 Jun 2009 11:03:32 -0400 Date: Wed, 3 Jun 2009 07:57:40 -0700 (PDT) From: Davide Libenzi Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock In-Reply-To: Message-ID: References: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , Linux Kernel Mailing List , linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: On Tue, 2 Jun 2009, Eric W. Biederman wrote: > I am not clear what problem you have. > > Is it the sprinkling the code that takes and removes the lock? Just > the VFS needs to be involved with that. It is a slightly larger > surface area than doing the work inside the file operations as we > sometimes call the same method from 3-4 different places but it is > definitely a bounded problem. > > Is it putting in the handful lines per subsystem to actually use this > functionality? At that level something generic that is maintained > outside of the subsystem is better than the mess we have with 4-5 > different implementations in the subsystems that need it, each having > a different assortment of bugs. Come on, only in the open fast path, there are at least two spin lock/unlock and two atomic ops. Without even starting to count all the extra branches and software added. Is this stuff *really* needed, or we can faitly happily live w/out? - Davide -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id C17866B00AC for ; Wed, 3 Jun 2009 11:13:28 -0400 (EDT) Date: Tue, 2 Jun 2009 09:06:42 +0200 From: Nick Piggin Subject: Re: [PATCH 03/23] vfs: Generalize the file_list Message-ID: <20090602070642.GD31556@wotan.suse.de> References: <1243893048-17031-3-git-send-email-ebiederm@xmission.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1243893048-17031-3-git-send-email-ebiederm@xmission.com> Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: On Mon, Jun 01, 2009 at 02:50:28PM -0700, Eric W. Biederman wrote: > From: Eric W. Biederman > > In the implementation of revoke it is desirable to find all of the > files we want to operation on. Currently tty's and mark_files_ro > use the file_list for this, and this patch generalizes the file > list so it can be used more efficiently. > > This patch starts by introducing struct file_list making the file > list a first class object. file_list_lock and file_list_unlock > are modified to take this object, making it clear which file_list > we intended to lock. > > file_move is transformed into file_list_add taking a file_list and not > allowing the movement of one file to another. __dentry_open > is modified to support this by only adding normal files in open, > special files have always been ignored when walking the file_list. > __dentry_open skipping special files allows __ptmx_open and __tty_open > to safely call file_add as they are adding the file to the file_list > for the first time. > > file_kill has been renamed file_list_del to make it clear what it is > doing and to keep from confusing it with a more revoke like operation. > > put_filp has been modified to not take file_list_del as we are never > on a file_list when put_filp is called. > > fs_may_remount_ro and mark_files_ro have been modified to walk the > inode list to find all of the inodes and then to walk the file list > on those inodes. It can be a slightly longer walk as we frequently > cache inodes that we do not have open but the overall complexity > should be about the same, Well not really. I have a couple of orders of magnitude more cached inodes than open files here. > these are slow path functions, and it > gives us much greater flexibility overall. Define flexibility. Walking the sb's file list and checking for equality with the inode in question gives the same functionality, just different performance profile. > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping) > return mapping->i_mmap_writable != 0; > } > > +struct file_list { > + spinlock_t lock; > + struct list_head list; > +}; > + > /* > * Use sequence counter to get consistent i_size on 32-bit processors. > */ > @@ -764,6 +769,7 @@ struct inode { > struct list_head inotify_watches; /* watches on this inode */ > struct mutex inotify_mutex; /* protects the watches list */ > #endif > + struct file_list i_files; > > unsigned long i_state; > unsigned long dirtied_when; /* jiffies of first dirtying */ > @@ -934,9 +940,15 @@ struct file { > unsigned long f_mnt_write_state; > #endif > }; > -extern spinlock_t files_lock; > -#define file_list_lock() spin_lock(&files_lock); > -#define file_list_unlock() spin_unlock(&files_lock); > + > +static inline void file_list_lock(struct file_list *files) > +{ > + spin_lock(&files->lock); > +} > +static inline void file_list_unlock(struct file_list *files) > +{ > + spin_unlock(&files->lock); > +} I don't really like this. It's just a list head. Get rid of all these wrappers and crap I'd say. In fact, starting with my patch to unexport files_lock and remove these wrappers would be reasonable, wouldn't it? Increasing the size of the struct inode by 24 bytes hurts. Even when you decrapify it and can reuse i_lock or something, then it is still 16 bytes on 64-bit. I haven't looked through all the patches... but this is to speed up a slowpath operation, isn't it? Or does revoke need to be especially performant? So this patch is purely a perofrmance improvement? Then I think it needs to be justified with numbers and the downsides (bloating struct inode in particulra) to be changelogged. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 6C93F6B00B6 for ; Wed, 3 Jun 2009 11:14:57 -0400 (EDT) Received: by fxm12 with SMTP id 12so10989824fxm.38 for ; Tue, 02 Jun 2009 00:08:14 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> <84144f020906012216n715a04d0ha492abc12175816@mail.gmail.com> Date: Tue, 2 Jun 2009 10:08:14 +0300 Message-ID: <84144f020906020008w54b1c628hc6e41dcddd208f5f@mail.gmail.com> Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Hi Eric, On Tue, Jun 2, 2009 at 9:51 AM, Eric W. Biederman w= rote: > Pekka Enberg writes: > >> Hi Eric, >> >> On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman >> wrote: >>> +#ifdef CONFIG_FILE_HOTPLUG >>> + >>> +static bool file_in_use(struct file *file) >>> +{ >>> + =A0 =A0 =A0 struct task_struct *leader, *task; >>> + =A0 =A0 =A0 bool in_use =3D false; >>> + =A0 =A0 =A0 int i; >>> + >>> + =A0 =A0 =A0 rcu_read_lock(); >>> + =A0 =A0 =A0 do_each_thread(leader, task) { >>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 for (i =3D 0; i < MAX_FILE_HOTPLUG_LOCK_D= EPTH; i++) { >>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (task->file_hotplug_lo= ck[i] =3D=3D file) { >>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 in_use = =3D true; >>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto foun= d; >>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } >>> + =A0 =A0 =A0 =A0 =A0 =A0 =A0 } >>> + =A0 =A0 =A0 } while_each_thread(leader, task); >>> +found: >>> + =A0 =A0 =A0 rcu_read_unlock(); >>> + =A0 =A0 =A0 return in_use; >>> +} >> >> This seems rather heavy-weight. If we're going to use this >> infrastructure for forced unmount, I think this will be a problem. > >> Can't we two this in two stages: (1) mark a bit that forces >> file_hotplug_read_trylock to always fail and (2) block until the last >> remaining in-kernel file_hotplug_read_unlock() has executed? > > Yes there is room for more optimization in the slow path. > I haven't noticed being a problem yet so I figured I would start > with stupid and simple. Yup, just wanted to point it out. On Tue, Jun 2, 2009 at 9:51 AM, Eric W. Biederman w= rote: > I can easily see two passes. =A0The first setting the flag an calling > f_op->dead. =A0The second some kind of consolidate walk through the task > list, allowing checking on multiple files at once. > > I'm not ready to consider anything that will add cost to the fast > path in the file descriptors though. Makes sense. Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id CA29B6B00BB for ; Wed, 3 Jun 2009 11:20:55 -0400 (EDT) Date: Tue, 2 Jun 2009 09:14:11 +0200 From: Nick Piggin Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Message-ID: <20090602071411.GE31556@wotan.suse.de> References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: On Mon, Jun 01, 2009 at 02:50:29PM -0700, Eric W. Biederman wrote: > From: Eric W. Biederman > > Introduce the file_hotplug_lock to protect file->f_op, file->private, > file->f_path from revoke operations. > > The file_hotplug_lock is used typically as: > error = -EIO; > if (!file_hotplug_read_trylock(file)) > goto out; > .... > file_hotplug_read_unlock(file); Why is it called hotplug? Does it have anything to do with hardware? Because every concurrently changed software data structure in the kernel can be "hot"-modified, right? Wouldn't file_revoke_lock be more appropriate? > In addition for a complete solution we need: > - A reliable way the file structures that we need to revoke. > - To wait for but not tamper with ongoing file creation and cleanup. > - A guarantee that all with user space controlled duration are removed. > > The file_hotplug_lock has a very unique implementation necessitated by > the need to have no performance impact on existing code. Classic locking Well, it isn't no performance impact. Function calls, branches, icache and dcache... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 8E56A6B00AE for ; Wed, 3 Jun 2009 11:37:25 -0400 (EDT) Date: Wed, 3 Jun 2009 08:37:21 +0200 From: Nick Piggin Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Message-ID: <20090603063721.GD27563@wotan.suse.de> References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> <20090602071411.GE31556@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Linus Torvalds , Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: On Tue, Jun 02, 2009 at 01:52:46PM -0700, Eric W. Biederman wrote: > Linus Torvalds writes: > > > On Tue, 2 Jun 2009, Nick Piggin wrote: > >> > >> Why is it called hotplug? Does it have anything to do with hardware? > >> Because every concurrently changed software data structure in the > >> kernel can be "hot"-modified, right? > >> > >> Wouldn't file_revoke_lock be more appropriate? > > > > I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not > > "plug" and "unplug". > > I guess this shows my bias in triggering this code path from pci > hotunplug. Instead of with some system call. > > I'm not married to the name. I wanted file_lock but that is already > used, and I did call the method revoke. Definitely it is not going to be called hotplug in the generic vfs layer :) > The only place where hotplug gives a useful hint is that it makes it > clear we really are disconnecting the file descriptor from what lies > below it. Isn't that hotUNplug? But anyway hot plug/unplug is a purely hardware concept. Revoke for "unplug", please, including naming of patches, changelogs, and locks etc. > We can't do some weird thing like keep the underlying object. > Because the underlying object is gone. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 3E9A46B00CD for ; Wed, 3 Jun 2009 11:38:20 -0400 (EDT) Date: Wed, 3 Jun 2009 08:38:15 +0200 From: Nick Piggin Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Message-ID: <20090603063815.GE27563@wotan.suse.de> References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> <20090602071411.GE31556@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: On Tue, Jun 02, 2009 at 03:56:02PM -0700, Eric W. Biederman wrote: > Nick Piggin writes: > > >> In addition for a complete solution we need: > >> - A reliable way the file structures that we need to revoke. > >> - To wait for but not tamper with ongoing file creation and cleanup. > >> - A guarantee that all with user space controlled duration are removed. > >> > >> The file_hotplug_lock has a very unique implementation necessitated by > >> the need to have no performance impact on existing code. Classic locking > > > > Well, it isn't no performance impact. Function calls, branches, icache > > and dcache... > > Practically none. OK that's different from none. There is obviously overhead. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id D62D76B00CF for ; Wed, 3 Jun 2009 11:39:34 -0400 (EDT) Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> <20090602071411.GE31556@wotan.suse.de> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 02 Jun 2009 13:52:46 -0700 In-Reply-To: (Linus Torvalds's message of "Tue\, 2 Jun 2009 10\:06\:00 -0700 \(PDT\)") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Nick Piggin , Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Linus Torvalds writes: > On Tue, 2 Jun 2009, Nick Piggin wrote: >> >> Why is it called hotplug? Does it have anything to do with hardware? >> Because every concurrently changed software data structure in the >> kernel can be "hot"-modified, right? >> >> Wouldn't file_revoke_lock be more appropriate? > > I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not > "plug" and "unplug". I guess this shows my bias in triggering this code path from pci hotunplug. Instead of with some system call. I'm not married to the name. I wanted file_lock but that is already used, and I did call the method revoke. The only place where hotplug gives a useful hint is that it makes it clear we really are disconnecting the file descriptor from what lies below it. We can't do some weird thing like keep the underlying object. Because the underlying object is gone. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id C77B36B00D4 for ; Wed, 3 Jun 2009 11:40:06 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:37 -0700 Message-Id: <1243893048-17031-12-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 12/23] vfs: Teach fcntl to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/fcntl.c | 28 +++++++++++++++++++--------- 1 files changed, 19 insertions(+), 9 deletions(-) diff --git a/fs/fcntl.c b/fs/fcntl.c index cc8e4de..05d8961 100644 --- a/fs/fcntl.c +++ b/fs/fcntl.c @@ -344,14 +344,19 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) if (!filp) goto out; + err = -EIO; + if (!file_hotplug_read_trylock(filp)) + goto out_fput; + err = security_file_fcntl(filp, cmd, arg); - if (err) { - fput(filp); - return err; - } + if (err) + goto out_unlock; err = do_fcntl(fd, cmd, arg, filp); +out_unlock: + file_hotplug_read_unlock(filp); +out_fput: fput(filp); out: return err; @@ -369,13 +374,15 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd, if (!filp) goto out; + err = -EIO; + if (!file_hotplug_read_trylock(filp)) + goto out_fput; + err = security_file_fcntl(filp, cmd, arg); - if (err) { - fput(filp); - return err; - } + if (err) + goto out_unlock; + err = -EBADF; - switch (cmd) { case F_GETLK64: err = fcntl_getlk64(filp, (struct flock64 __user *) arg); @@ -389,6 +396,9 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd, err = do_fcntl(fd, cmd, arg, filp); break; } +out_unlock: + file_hotplug_read_unlock(filp); +out_fput: fput(filp); out: return err; -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 957CE5F0003 for ; Wed, 3 Jun 2009 12:10:35 -0400 (EDT) Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock References: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 02 Jun 2009 14:23:47 -0700 In-Reply-To: (Davide Libenzi's message of "Tue\, 2 Jun 2009 09\:51\:42 -0700 \(PDT\)") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Davide Libenzi Cc: Al Viro , Linux Kernel Mailing List , linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: Davide Libenzi writes: > On Mon, 1 Jun 2009, Eric W. Biederman wrote: > >> From: Eric W. Biederman >> >> Signed-off-by: Eric W. Biederman >> --- >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- >> 1 files changed, 32 insertions(+), 7 deletions(-) > > This patchset gives me the willies for the amount of changes and possible > impact on many subsystems. It both is and is not that bad. It is the cost of adding a lock. For the VFS except for nfsd the I have touched everything that needs to be touched. Other subsystems that open read/write close files should be able to use existing vfs helpers so they don't need to know about the new locking explicitly. Actually taking advantage of this infrastructure in a subsystem is comparatively easy. It took me about an hour to get uio using it. That part is not deep by any means and is opt in. > Without having looked at the details, are you aware that epoll does not > act like poll/select, and fds are not automatically removed (as in, > dequeued from the poll wait queue) in any foreseeable amount of time after > a POLLERR is received? Yes I am aware of how epoll acts differently. > As far as the usespace API goes, they have the right to remain there. I absolutely agree. Currently I have the code acting like close() with respect to epoll and just having the file descriptor vanish at the end of the revoke. While we the revoke is in progress you get an EIO. The file descriptor is not freed by a revoke operation so you can happily hang unto it as long as you want. I thought of doing something more uniform to user space. But I observed that the existing epoll punts on the case of a file descriptor being closed and locking to go from a file to the other epoll datastructures is pretty horrid I said forget it and used the existing close behaviour. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 082FA5F0003 for ; Wed, 3 Jun 2009 12:19:07 -0400 (EDT) Subject: Re: [PATCH 01/23] mm: Introduce revoke_file_mappings. References: <1243893048-17031-1-git-send-email-ebiederm@xmission.com> <20090601152553.b2de027a.akpm@linux-foundation.org> From: ebiederm@xmission.com (Eric W. Biederman) Date: Mon, 01 Jun 2009 17:12:19 -0700 In-Reply-To: <20090601152553.b2de027a.akpm@linux-foundation.org> (Andrew Morton's message of "Mon\, 1 Jun 2009 15\:25\:53 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Andrew Morton Cc: viro@ZenIV.linux.org.uk, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, tj@kernel.org, hugh.dickins@tiscali.co.uk, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, hch@infradead.org, ebiederm@aristanetworks.com List-ID: Andrew Morton writes: > On Mon, 1 Jun 2009 14:50:26 -0700 > "Eric W. Biederman" wrote: > >> +static void revoke_vma(struct vm_area_struct *vma) > > This looks odd. > >> +{ >> + struct file *file = vma->vm_file; >> + struct address_space *mapping = file->f_mapping; >> + unsigned long start_addr, end_addr, size; >> + struct mm_struct *mm; >> + >> + start_addr = vma->vm_start; >> + end_addr = vma->vm_end; > > We take a copy of start_addr/end_addr (and this end_addr value is never used) A foolish consistency. >> + /* Switch out the locks so I can maninuplate this under the mm sem. >> + * Needed so I can call vm_ops->close. >> + */ >> + mm = vma->vm_mm; >> + atomic_inc(&mm->mm_users); >> + spin_unlock(&mapping->i_mmap_lock); >> + >> + /* Block page faults and other code modifying the mm. */ >> + down_write(&mm->mmap_sem); >> + >> + /* Lookup a vma for my file address */ >> + vma = find_vma(mm, start_addr); > > Then we look up a vma. Is there reason to believe that this will > differ from the incoming arg which we just overwrote? Maybe the code > is attempting to handle racing concurrent mmap/munmap activity? If so, > what are the implications of this? Yes it is. The file based index is only safe while we hold the i_mmap_lock. The manipulation that needs to happen under the mmap_sem. So I drop all of the locks and restart. And use the time honored kernel practice of relooking up the thing I am going to manipulate. As long as it is for the same file I don't care. > I _think_ that what the function is attempting to do is "unmap the vma > which covers the address at vma->start_addr". If so, why not just pass > it that virtual address? Actually it is unmapping a vma for the file I am revoking. I hand it one and then it does an address space jig. > Anyway, it's all a bit obscure and I do think that the semantics and > behaviour should be carefully explained in a comment, no? > >> + if (vma->vm_file != file) >> + goto out; > > This strengthens the theory that some sort of race-management is > happening here. Totally. I dropped all of my locks so I am having to restart in a different locking context. >> + start_addr = vma->vm_start; >> + end_addr = vma->vm_end; >> + size = end_addr - start_addr; >> + >> + /* Unlock the pages */ >> + if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) { >> + mm->locked_vm -= vma_pages(vma); >> + vma->vm_flags &= ~VM_LOCKED; >> + } >> + >> + /* Unmap the vma */ >> + zap_page_range(vma, start_addr, size, NULL); >> + >> + /* Unlink the vma from the file */ >> + unlink_file_vma(vma); >> + >> + /* Close the vma */ >> + if (vma->vm_ops && vma->vm_ops->close) >> + vma->vm_ops->close(vma); >> + fput(vma->vm_file); >> + vma->vm_file = NULL; >> + if (vma->vm_flags & VM_EXECUTABLE) >> + removed_exe_file_vma(vma->vm_mm); >> + >> + /* Repurpose the vma */ >> + vma->vm_private_data = NULL; >> + vma->vm_ops = &revoked_vm_ops; >> + vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR); >> +out: >> + up_write(&mm->mmap_sem); >> + spin_lock(&mapping->i_mmap_lock); >> +} > > Also, I'm not a bit fan of the practice of overwriting the value of a > formal argument, especially in a function which is this large and > complex. It makes the code harder to follow, because the one variable > holds two conceptually different things within the span of the same > function. And it adds risk that someone will will later access a field > of *vma and it will be the wrong vma. Worse, the bug is only exposed > under exeedingly rare conditions. > > So.. Use a new local, please. We can never legitimately have more than one vma manipulated in this function. As for the rest. I guess I just assumed that the reader of the code would have a basic understanding of the locking rules for those data structures. Certainly the worst thing I suffer from is being close to the code, and not realizing which pieces are not obvious to a naive observer. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id AEF635F0019 for ; Wed, 3 Jun 2009 12:45:13 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:45 -0700 Message-Id: <1243893048-17031-20-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 20/23] vfs: Teach aio to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/aio.c | 51 ++++++++++++++++++++++++++++++++++++++------------- 1 files changed, 38 insertions(+), 13 deletions(-) diff --git a/fs/aio.c b/fs/aio.c index 76da125..eceb215 100644 --- a/fs/aio.c +++ b/fs/aio.c @@ -1362,13 +1362,20 @@ static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret) static ssize_t aio_rw_vect_retry(struct kiocb *iocb) { struct file *file = iocb->ki_filp; - struct address_space *mapping = file->f_mapping; - struct inode *inode = mapping->host; + struct address_space *mapping; + struct inode *inode; ssize_t (*rw_op)(struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t ret = 0; unsigned short opcode; + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; + + mapping = file->f_mapping; + inode = mapping->host; + if ((iocb->ki_opcode == IOCB_CMD_PREADV) || (iocb->ki_opcode == IOCB_CMD_PREAD)) { rw_op = file->f_op->aio_read; @@ -1379,8 +1386,9 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb) } /* This matches the pread()/pwrite() logic */ + ret = -EINVAL; if (iocb->ki_pos < 0) - return -EINVAL; + goto out_unlock; do { ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg], @@ -1407,26 +1415,37 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb) && iocb->ki_nbytes - iocb->ki_left) ret = iocb->ki_nbytes - iocb->ki_left; +out_unlock: + file_hotplug_read_unlock(file); +out: return ret; } static ssize_t aio_fdsync(struct kiocb *iocb) { struct file *file = iocb->ki_filp; - ssize_t ret = -EINVAL; + ssize_t ret = -EIO; - if (file->f_op->aio_fsync) - ret = file->f_op->aio_fsync(iocb, 1); + if (file_hotplug_read_trylock(file)) { + ret = -EINVAL; + if (file->f_op->aio_fsync) + ret = file->f_op->aio_fsync(iocb, 1); + file_hotplug_read_unlock(file); + } return ret; } static ssize_t aio_fsync(struct kiocb *iocb) { struct file *file = iocb->ki_filp; - ssize_t ret = -EINVAL; + ssize_t ret = -EIO; - if (file->f_op->aio_fsync) - ret = file->f_op->aio_fsync(iocb, 0); + if (file_hotplug_read_trylock(file)) { + ret = -EINVAL; + if (file->f_op->aio_fsync) + ret = file->f_op->aio_fsync(iocb, 0); + file_hotplug_read_unlock(file); + } return ret; } @@ -1469,7 +1488,11 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb) static ssize_t aio_setup_iocb(struct kiocb *kiocb) { struct file *file = kiocb->ki_filp; - ssize_t ret = 0; + ssize_t ret; + + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; switch (kiocb->ki_opcode) { case IOCB_CMD_PREAD: @@ -1551,10 +1574,12 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb) ret = -EINVAL; } - if (!kiocb->ki_retry) - return ret; + if (kiocb->ki_retry) + ret = 0; - return 0; + file_hotplug_read_unlock(file); +out: + return ret; } /* -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 6864C5F0019 for ; Wed, 3 Jun 2009 12:45:29 -0400 (EDT) Received: from makko.or.mcafeemobile.com by x35.xmailserver.org with [XMail 1.26 ESMTP Server] id for from ; Tue, 2 Jun 2009 17:58:42 -0400 Date: Tue, 2 Jun 2009 14:52:41 -0700 (PDT) From: Davide Libenzi Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock In-Reply-To: Message-ID: References: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , Linux Kernel Mailing List , linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: On Tue, 2 Jun 2009, Eric W. Biederman wrote: > Davide Libenzi writes: > > > On Mon, 1 Jun 2009, Eric W. Biederman wrote: > > > >> From: Eric W. Biederman > >> > >> Signed-off-by: Eric W. Biederman > >> --- > >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- > >> 1 files changed, 32 insertions(+), 7 deletions(-) > > > > This patchset gives me the willies for the amount of changes and possible > > impact on many subsystems. > > It both is and is not that bad. It is the cost of adding a lock. We both know that it is not only the cost of a lock, but also the sprinkling over a pretty vast amount of subsystems, of another layer of code. > I thought of doing something more uniform to user space. But I observed > that the existing epoll punts on the case of a file descriptor being closed > and locking to go from a file to the other epoll datastructures is pretty > horrid I said forget it and used the existing close behaviour. Well, you cannot rely on the caller to tidy up the epoll fd by issuing an epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave lingering crap around. You cannot even hold a reference of the file, since otherwise the epoll hooking will have to trigger not only at ->release() time, but at every close, where you'll have to figure out if this is the last real userspace reference or not. Plus all the issues related to holding permanent extra references to userspace files. And since a file can be added in many epoll devices, you need to unregister it from all of them (hence the other datastructures lookup). Better this, on the slow path, with locks acquired only in the epoll usage case, than some other thing and on the fast path, for every file. - Davide -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id BA9DB5F0019 for ; Wed, 3 Jun 2009 13:24:30 -0400 (EDT) Received: by bwz21 with SMTP id 21so10670200bwz.38 for ; Mon, 01 Jun 2009 22:24:28 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> Date: Tue, 2 Jun 2009 08:16:44 +0300 Message-ID: <84144f020906012216n715a04d0ha492abc12175816@mail.gmail.com> Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file From: Pekka Enberg Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Hi Eric, On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman wrote: > +#ifdef CONFIG_FILE_HOTPLUG > + > +static bool file_in_use(struct file *file) > +{ > + =A0 =A0 =A0 struct task_struct *leader, *task; > + =A0 =A0 =A0 bool in_use =3D false; > + =A0 =A0 =A0 int i; > + > + =A0 =A0 =A0 rcu_read_lock(); > + =A0 =A0 =A0 do_each_thread(leader, task) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 for (i =3D 0; i < MAX_FILE_HOTPLUG_LOCK_DEP= TH; i++) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 if (task->file_hotplug_lock= [i] =3D=3D file) { > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 in_use =3D = true; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 goto found; > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > + =A0 =A0 =A0 =A0 =A0 =A0 =A0 } > + =A0 =A0 =A0 } while_each_thread(leader, task); > +found: > + =A0 =A0 =A0 rcu_read_unlock(); > + =A0 =A0 =A0 return in_use; > +} This seems rather heavy-weight. If we're going to use this infrastructure for forced unmount, I think this will be a problem. Can't we two this in two stages: (1) mark a bit that forces file_hotplug_read_trylock to always fail and (2) block until the last remaining in-kernel file_hotplug_read_unlock() has executed? Pekka -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id A6D646B007E for ; Wed, 3 Jun 2009 14:12:05 -0400 (EDT) References: From: ebiederm@xmission.com (Eric W. Biederman) Date: Mon, 01 Jun 2009 14:45:17 -0700 In-Reply-To: (Eric W. Biederman's message of "Sat\, 11 Apr 2009 05\:01\:29 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: [PATCH 0/23] File descriptor hot-unplug support v2 Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig List-ID: I found myself looking at the uio, seeing that it does not support pci hot-unplug, and thinking "Great yet another implementation of hotunplug logic that needs to be added". I decided to see what it would take to add a generic implementation of the code we have for supporting hot unplugging devices in sysfs, proc, sysctl, tty_io, and now almost in the tun driver. Not long after I touched the tun driver and made it safe to delete the network device while still holding it's file descriptor open I someone else touch the code adding a different feature and my careful work went up in flames. Which brought home another point at the best of it this is ultimately complex tricky code that subsystems should not need to worry about. What makes this even more interesting is that in the presence of pci hot-unplug it looks like most subsystems and most devices will have to deal with the issue one way or another. This infrastructure could also be used to implement both force unmounts and sys_revoke. When I could not think of a better name for I have drawn on that and used revoke. The following changes draw on and generalize the work in tty_io sysfs, proc, and sysctl and move it into the vfs level. Where the basic primitives are running faster, and the solution is more general. ... Changes since version 1. All of that lead to the first version of this patchset. The feedback I got from that was generally positive but there was a concern about performance when two there are two simultaneous accessors to the tty at the same time. After looking into the performance concerns of what happens when multiple programs access the same struct file and finding that I could not rule out a performance regression I have gone back and redesigned my mutual exclusion primitive creating something simpler and faster. I have also changed my synchronization primitives extending them to protect most of what is read-only in struct file today and abandoning rcu-ness of struct file. Giving up rcu-ness leads to true exclusion and makes the code much easier to think about. In this patchset is the basic code patchs 1-4 and a conversion of the vfs except for the nfsd entry points. Enough for a reasonable result. These patches are based on Al's vfs/for-next tree. The vfs changes in this patchset. Documentation/filesystems/vfs.txt | 5 + drivers/char/pty.c | 2 +- drivers/char/tty_io.c | 22 ++-- fs/Kconfig | 4 + fs/aio.c | 51 +++++-- fs/compat.c | 16 ++- fs/compat_ioctl.c | 14 ++- fs/eventpoll.c | 41 +++++- fs/fcntl.c | 28 +++-- fs/file_table.c | 281 +++++++++++++++++++++++++++++-------- fs/inode.c | 1 + fs/ioctl.c | 8 +- fs/locks.c | 8 +- fs/namei.c | 11 ++- fs/open.c | 81 +++++++++-- fs/proc/base.c | 29 ++-- fs/read_write.c | 122 ++++++++++++---- fs/readdir.c | 20 ++- fs/select.c | 53 ++++++- fs/splice.c | 111 ++++++++++----- fs/super.c | 1 - fs/sync.c | 9 +- include/linux/fs.h | 49 ++++++- include/linux/mm.h | 2 + include/linux/poll.h | 3 + include/linux/sched.h | 7 + include/linux/tty.h | 2 +- mm/fadvise.c | 7 + mm/filemap.c | 25 ++-- mm/memory.c | 98 +++++++++++++ mm/mmap.c | 78 +++++++---- mm/nommu.c | 21 +++- security/selinux/hooks.c | 8 +- 33 files changed, 950 insertions(+), 268 deletions(-) The necessary changes to proc to take advantage of this functionality. fs/proc/Kconfig | 1 + fs/proc/generic.c | 56 +++----- fs/proc/inode.c | 354 ++++------------------------------------------- fs/proc/internal.h | 1 + include/linux/proc_fs.h | 4 - 5 files changed, 44 insertions(+), 372 deletions(-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id CEF7B6B00BC for ; Wed, 3 Jun 2009 14:17:40 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:27 -0700 Message-Id: <1243893048-17031-2-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 02/23] vfs: Implement unpoll_file. Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman During a revoke operation it is necessary to stop using all state that is managed by the underlying file operations implementation. The poll wait queue is one part of that state. unpoll_file achieves that by walking through a specified waitqueue. Finding any entries that were added by select or poll of that file descriptor and awakening them. If action was taken unpoll sleeps and repeats until the waitqueue has no entries for the spcified file. Signed-off-by: Eric W. Biederman --- fs/select.c | 31 +++++++++++++++++++++++++++++++ include/linux/poll.h | 2 ++ 2 files changed, 33 insertions(+), 0 deletions(-) diff --git a/fs/select.c b/fs/select.c index 0fe0e14..bd30fe8 100644 --- a/fs/select.c +++ b/fs/select.c @@ -941,3 +941,34 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, return ret; } #endif /* HAVE_SET_RESTORE_SIGMASK */ + +#ifdef CONFIG_FILE_HOTPLUG +static int unpoll_file_once(wait_queue_head_t *q, struct file *file) +{ + unsigned long flags; + wait_queue_t *curr, *next; + int found = 0; + + spin_lock_irqsave(&q->lock, flags); + list_for_each_entry_safe(curr, next, &q->task_list, task_list) { + struct poll_table_entry *entry; + if (curr->func != pollwake) + continue; + entry = container_of(curr, struct poll_table_entry, wait); + if (entry->filp != file) + continue; + curr->func(curr, TASK_NORMAL, 0, NULL); + found = 1; + } + spin_unlock_irqrestore(&q->lock, flags); + + return found; +} + +void unpoll_file(wait_queue_head_t *q, struct file *file) +{ + while (unpoll_file_once(q, file)) + schedule_timeout_uninterruptible(1); +} +EXPORT_SYMBOL(unpoll_file); +#endif diff --git a/include/linux/poll.h b/include/linux/poll.h index 8c24ef8..d388620 100644 --- a/include/linux/poll.h +++ b/include/linux/poll.h @@ -131,6 +131,8 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp, extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec); +extern void unpoll_file(wait_queue_head_t *q, struct file *file); + #endif /* KERNEL */ #endif /* _LINUX_POLL_H */ -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id D8FD26B00C2 for ; Wed, 3 Jun 2009 14:17:40 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:26 -0700 Message-Id: <1243893048-17031-1-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 01/23] mm: Introduce revoke_file_mappings. Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman When the backing store of a file becomes inaccessible we need a function to remove that file from the page tables and arrange for page faults to receive SIGBUS until the file is unmapped. Signed-off-by: Eric W. Biederman --- include/linux/mm.h | 2 + mm/memory.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 100 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index bff1f0d..5d7480d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -808,6 +808,8 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping, extern int vmtruncate(struct inode * inode, loff_t offset); extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end); +extern void revoke_file_mappings(struct file *file); + #ifdef CONFIG_MMU extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, int write_access); diff --git a/mm/memory.c b/mm/memory.c index 4126dd1..5cbee3b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -2358,6 +2359,103 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf) +{ + return VM_FAULT_SIGBUS; +} + +static struct vm_operations_struct revoked_vm_ops = { + .fault = revoked_fault, +}; + +static void revoke_vma(struct vm_area_struct *vma) +{ + struct file *file = vma->vm_file; + struct address_space *mapping = file->f_mapping; + unsigned long start_addr, end_addr, size; + struct mm_struct *mm; + + start_addr = vma->vm_start; + end_addr = vma->vm_end; + + /* Switch out the locks so I can maninuplate this under the mm sem. + * Needed so I can call vm_ops->close. + */ + mm = vma->vm_mm; + atomic_inc(&mm->mm_users); + spin_unlock(&mapping->i_mmap_lock); + + /* Block page faults and other code modifying the mm. */ + down_write(&mm->mmap_sem); + + /* Lookup a vma for my file address */ + vma = find_vma(mm, start_addr); + if (vma->vm_file != file) + goto out; + + start_addr = vma->vm_start; + end_addr = vma->vm_end; + size = end_addr - start_addr; + + /* Unlock the pages */ + if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) { + mm->locked_vm -= vma_pages(vma); + vma->vm_flags &= ~VM_LOCKED; + } + + /* Unmap the vma */ + zap_page_range(vma, start_addr, size, NULL); + + /* Unlink the vma from the file */ + unlink_file_vma(vma); + + /* Close the vma */ + if (vma->vm_ops && vma->vm_ops->close) + vma->vm_ops->close(vma); + fput(vma->vm_file); + vma->vm_file = NULL; + if (vma->vm_flags & VM_EXECUTABLE) + removed_exe_file_vma(vma->vm_mm); + + /* Repurpose the vma */ + vma->vm_private_data = NULL; + vma->vm_ops = &revoked_vm_ops; + vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR); +out: + up_write(&mm->mmap_sem); + spin_lock(&mapping->i_mmap_lock); +} + +void revoke_file_mappings(struct file *file) +{ + /* After a file has been marked dead update the vmas */ + struct address_space *mapping = file->f_mapping; + struct vm_area_struct *vma; + struct prio_tree_iter iter; + + spin_lock(&mapping->i_mmap_lock); + +restart_tree: + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) { + /* Skip quickly over vmas that do not need to be touched */ + if (vma->vm_file != file) + continue; + revoke_vma(vma); + goto restart_tree; + } + +restart_list: + list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) { + /* Skip quickly over vmas that do not need to be touched */ + if (vma->vm_file != file) + continue; + revoke_vma(vma); + goto restart_list; + } + + spin_unlock(&mapping->i_mmap_lock); +} + /** * vmtruncate - unmap mappings "freed" by truncate() syscall * @inode: inode of the file used -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 9792D6B00F5 for ; Wed, 3 Jun 2009 14:17:41 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:28 -0700 Message-Id: <1243893048-17031-3-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 03/23] vfs: Generalize the file_list Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman In the implementation of revoke it is desirable to find all of the files we want to operation on. Currently tty's and mark_files_ro use the file_list for this, and this patch generalizes the file list so it can be used more efficiently. This patch starts by introducing struct file_list making the file list a first class object. file_list_lock and file_list_unlock are modified to take this object, making it clear which file_list we intended to lock. file_move is transformed into file_list_add taking a file_list and not allowing the movement of one file to another. __dentry_open is modified to support this by only adding normal files in open, special files have always been ignored when walking the file_list. __dentry_open skipping special files allows __ptmx_open and __tty_open to safely call file_add as they are adding the file to the file_list for the first time. file_kill has been renamed file_list_del to make it clear what it is doing and to keep from confusing it with a more revoke like operation. put_filp has been modified to not take file_list_del as we are never on a file_list when put_filp is called. fs_may_remount_ro and mark_files_ro have been modified to walk the inode list to find all of the inodes and then to walk the file list on those inodes. It can be a slightly longer walk as we frequently cache inodes that we do not have open but the overall complexity should be about the same, these are slow path functions, and it gives us much greater flexibility overall. Signed-off-by: Eric W. Biederman --- drivers/char/pty.c | 2 +- drivers/char/tty_io.c | 22 ++++----- fs/file_table.c | 117 +++++++++++++++++++++++++--------------------- fs/inode.c | 1 + fs/open.c | 6 ++- fs/select.c | 2 - fs/super.c | 1 - include/linux/fs.h | 24 +++++++-- include/linux/tty.h | 2 +- security/selinux/hooks.c | 8 ++-- 10 files changed, 102 insertions(+), 83 deletions(-) diff --git a/drivers/char/pty.c b/drivers/char/pty.c index 31038a0..585f700 100644 --- a/drivers/char/pty.c +++ b/drivers/char/pty.c @@ -662,7 +662,7 @@ static int __ptmx_open(struct inode *inode, struct file *filp) set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */ filp->private_data = tty; - file_move(filp, &tty->tty_files); + file_list_add(filp, &tty->tty_files); retval = devpts_pty_new(inode, tty->link); if (retval) diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c index 66b99a2..b5c0ca1 100644 --- a/drivers/char/tty_io.c +++ b/drivers/char/tty_io.c @@ -235,11 +235,11 @@ static int check_tty_count(struct tty_struct *tty, const char *routine) struct list_head *p; int count = 0; - file_list_lock(); - list_for_each(p, &tty->tty_files) { + file_list_lock(&tty->tty_files); + list_for_each(p, &tty->tty_files.list) { count++; } - file_list_unlock(); + file_list_unlock(&tty->tty_files); if (tty->driver->type == TTY_DRIVER_TYPE_PTY && tty->driver->subtype == PTY_TYPE_SLAVE && tty->link && tty->link->count) @@ -554,9 +554,9 @@ static void do_tty_hangup(struct work_struct *work) spin_unlock(&redirect_lock); check_tty_count(tty, "do_tty_hangup"); - file_list_lock(); + file_list_lock(&tty->tty_files); /* This breaks for file handles being sent over AF_UNIX sockets ? */ - list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) { + list_for_each_entry(filp, &tty->tty_files.list, f_u.fu_list) { if (filp->f_op->write == redirected_tty_write) cons_filp = filp; if (filp->f_op->write != tty_write) @@ -565,7 +565,7 @@ static void do_tty_hangup(struct work_struct *work) tty_fasync(-1, filp, 0); /* can't block */ filp->f_op = &hung_up_tty_fops; } - file_list_unlock(); + file_list_unlock(&tty->tty_files); /* * FIXME! What are the locking issues here? This may me overdoing * things... This question is especially important now that we've @@ -1467,10 +1467,6 @@ static void release_one_tty(struct kref *kref) tty_driver_kref_put(driver); module_put(driver->owner); - file_list_lock(); - list_del_init(&tty->tty_files); - file_list_unlock(); - free_tty_struct(tty); } @@ -1678,7 +1674,7 @@ void tty_release_dev(struct file *filp) * - do_tty_hangup no longer sees this file descriptor as * something that needs to be handled for hangups. */ - file_kill(filp); + file_list_del(filp, &tty->tty_files); filp->private_data = NULL; /* @@ -1836,7 +1832,7 @@ got_driver: return PTR_ERR(tty); filp->private_data = tty; - file_move(filp, &tty->tty_files); + file_list_add(filp, &tty->tty_files); check_tty_count(tty, "tty_open"); if (tty->driver->type == TTY_DRIVER_TYPE_PTY && tty->driver->subtype == PTY_TYPE_MASTER) @@ -2779,7 +2775,7 @@ void initialize_tty_struct(struct tty_struct *tty, mutex_init(&tty->echo_lock); spin_lock_init(&tty->read_lock); spin_lock_init(&tty->ctrl_lock); - INIT_LIST_HEAD(&tty->tty_files); + init_file_list(&tty->tty_files); INIT_WORK(&tty->SAK_work, do_SAK_work); tty->driver = driver; diff --git a/fs/file_table.c b/fs/file_table.c index 334ce39..978f267 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -22,6 +22,7 @@ #include #include #include +#include #include @@ -30,9 +31,6 @@ struct files_stat_struct files_stat = { .max_files = NR_FILE }; -/* public. Not pretty! */ -__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock); - /* SLAB cache for file structures */ static struct kmem_cache *filp_cachep __read_mostly; @@ -285,7 +283,8 @@ void __fput(struct file *file) cdev_put(inode->i_cdev); fops_put(file->f_op); put_pid(file->f_owner.pid); - file_kill(file); + if (!special_file(inode->i_mode)) + file_list_del(file, &inode->i_files); if (file->f_mode & FMODE_WRITE) drop_file_write_access(file); file->f_path.dentry = NULL; @@ -352,50 +351,57 @@ void put_filp(struct file *file) { if (atomic_long_dec_and_test(&file->f_count)) { security_file_free(file); - file_kill(file); file_free(file); } } -void file_move(struct file *file, struct list_head *list) +void init_file_list(struct file_list *files) { - if (!list) - return; - file_list_lock(); - list_move(&file->f_u.fu_list, list); - file_list_unlock(); + INIT_LIST_HEAD(&files->list); + spin_lock_init(&files->lock); } -void file_kill(struct file *file) +void file_list_add(struct file *file, struct file_list *files) { - if (!list_empty(&file->f_u.fu_list)) { - file_list_lock(); - list_del_init(&file->f_u.fu_list); - file_list_unlock(); - } + file_list_lock(files); + list_add(&file->f_u.fu_list, &files->list); + file_list_unlock(files); +} +EXPORT_SYMBOL(file_list_add); + +void file_list_del(struct file *file, struct file_list *files) +{ + file_list_lock(files); + list_del_init(&file->f_u.fu_list); + file_list_unlock(files); } +EXPORT_SYMBOL(file_list_del); int fs_may_remount_ro(struct super_block *sb) { + struct inode *inode; struct file *file; /* Check that no files are currently opened for writing. */ - file_list_lock(); - list_for_each_entry(file, &sb->s_files, f_u.fu_list) { - struct inode *inode = file->f_path.dentry->d_inode; - - /* File with pending delete? */ - if (inode->i_nlink == 0) - goto too_bad; - - /* Writeable file? */ - if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE)) - goto too_bad; + spin_lock(&inode_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + file_list_lock(&inode->i_files); + list_for_each_entry(file, &inode->i_files.list, f_u.fu_list) { + /* File with pending delete? */ + if (inode->i_nlink == 0) + goto too_bad; + + /* Writeable file? */ + if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE)) + goto too_bad; + } + file_list_unlock(&inode->i_files); } - file_list_unlock(); + spin_unlock(&inode_lock); return 1; /* Tis' cool bro. */ too_bad: - file_list_unlock(); + file_list_unlock(&inode->i_files); + spin_unlock(&inode_lock); return 0; } @@ -408,33 +414,38 @@ too_bad: */ void mark_files_ro(struct super_block *sb) { + struct inode *inode; struct file *f; retry: - file_list_lock(); - list_for_each_entry(f, &sb->s_files, f_u.fu_list) { - struct vfsmount *mnt; - if (!S_ISREG(f->f_path.dentry->d_inode->i_mode)) - continue; - if (!file_count(f)) - continue; - if (!(f->f_mode & FMODE_WRITE)) - continue; - f->f_mode &= ~FMODE_WRITE; - if (file_check_writeable(f) != 0) - continue; - file_release_write(f); - mnt = mntget(f->f_path.mnt); - file_list_unlock(); - /* - * This can sleep, so we can't hold - * the file_list_lock() spinlock. - */ - mnt_drop_write(mnt); - mntput(mnt); - goto retry; + spin_lock(&inode_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + file_list_lock(&inode->i_files); + list_for_each_entry(f, &inode->i_files.list, f_u.fu_list) { + struct vfsmount *mnt; + if (!S_ISREG(f->f_path.dentry->d_inode->i_mode)) + continue; + if (!file_count(f)) + continue; + if (!(f->f_mode & FMODE_WRITE)) + continue; + f->f_mode &= ~FMODE_WRITE; + if (file_check_writeable(f) != 0) + continue; + file_release_write(f); + mnt = mntget(f->f_path.mnt); + file_list_unlock(&inode->i_files); + /* + * This can sleep, so we can't hold + * the file_list_lock() spinlock. + */ + mnt_drop_write(mnt); + mntput(mnt); + goto retry; + } + file_list_unlock(&inode->i_files); } - file_list_unlock(); + spin_unlock(&inode_lock); } void __init files_init(unsigned long mempages) diff --git a/fs/inode.c b/fs/inode.c index 9d26490..9d52d43 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -251,6 +251,7 @@ void inode_init_once(struct inode *inode) INIT_LIST_HEAD(&inode->inotify_watches); mutex_init(&inode->inotify_mutex); #endif + init_file_list(&inode->i_files); } EXPORT_SYMBOL(inode_init_once); diff --git a/fs/open.c b/fs/open.c index 7200e23..20c3fc0 100644 --- a/fs/open.c +++ b/fs/open.c @@ -828,7 +828,8 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, f->f_path.mnt = mnt; f->f_pos = 0; f->f_op = fops_get(inode->i_fop); - file_move(f, &inode->i_sb->s_files); + if (!special_file(inode->i_mode)) + file_list_add(f, &inode->i_files); error = security_dentry_open(f, cred); if (error) @@ -873,7 +874,8 @@ cleanup_all: mnt_drop_write(mnt); } } - file_kill(f); + if (!special_file(inode->i_mode)) + file_list_del(f, &inode->i_files); f->f_path.dentry = NULL; f->f_path.mnt = NULL; cleanup_file: diff --git a/fs/select.c b/fs/select.c index bd30fe8..99e4145 100644 --- a/fs/select.c +++ b/fs/select.c @@ -942,7 +942,6 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds, } #endif /* HAVE_SET_RESTORE_SIGMASK */ -#ifdef CONFIG_FILE_HOTPLUG static int unpoll_file_once(wait_queue_head_t *q, struct file *file) { unsigned long flags; @@ -971,4 +970,3 @@ void unpoll_file(wait_queue_head_t *q, struct file *file) schedule_timeout_uninterruptible(1); } EXPORT_SYMBOL(unpoll_file); -#endif diff --git a/fs/super.c b/fs/super.c index 2ea1586..477aeb4 100644 --- a/fs/super.c +++ b/fs/super.c @@ -65,7 +65,6 @@ static struct super_block *alloc_super(struct file_system_type *type) INIT_LIST_HEAD(&s->s_dirty); INIT_LIST_HEAD(&s->s_io); INIT_LIST_HEAD(&s->s_more_io); - INIT_LIST_HEAD(&s->s_files); INIT_LIST_HEAD(&s->s_instances); INIT_HLIST_HEAD(&s->s_anon); INIT_LIST_HEAD(&s->s_inodes); diff --git a/include/linux/fs.h b/include/linux/fs.h index 73242c3..5329fd6 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping) return mapping->i_mmap_writable != 0; } +struct file_list { + spinlock_t lock; + struct list_head list; +}; + /* * Use sequence counter to get consistent i_size on 32-bit processors. */ @@ -764,6 +769,7 @@ struct inode { struct list_head inotify_watches; /* watches on this inode */ struct mutex inotify_mutex; /* protects the watches list */ #endif + struct file_list i_files; unsigned long i_state; unsigned long dirtied_when; /* jiffies of first dirtying */ @@ -934,9 +940,15 @@ struct file { unsigned long f_mnt_write_state; #endif }; -extern spinlock_t files_lock; -#define file_list_lock() spin_lock(&files_lock); -#define file_list_unlock() spin_unlock(&files_lock); + +static inline void file_list_lock(struct file_list *files) +{ + spin_lock(&files->lock); +} +static inline void file_list_unlock(struct file_list *files) +{ + spin_unlock(&files->lock); +} #define get_file(x) atomic_long_inc(&(x)->f_count) #define file_count(x) atomic_long_read(&(x)->f_count) @@ -1333,7 +1345,6 @@ struct super_block { struct list_head s_io; /* parked for writeback */ struct list_head s_more_io; /* parked for more writeback */ struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */ - struct list_head s_files; /* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */ struct list_head s_dentry_lru; /* unused dentry lru */ int s_nr_dentry_unused; /* # of dentry on lru */ @@ -2163,8 +2174,9 @@ static inline void insert_inode_hash(struct inode *inode) { } extern struct file * get_empty_filp(void); -extern void file_move(struct file *f, struct list_head *list); -extern void file_kill(struct file *f); +extern void init_file_list(struct file_list *files); +extern void file_list_add(struct file *f, struct file_list *files); +extern void file_list_del(struct file *f, struct file_list *files); #ifdef CONFIG_BLOCK struct bio; extern void submit_bio(int, struct bio *); diff --git a/include/linux/tty.h b/include/linux/tty.h index fc39db9..7f04a5e 100644 --- a/include/linux/tty.h +++ b/include/linux/tty.h @@ -250,7 +250,7 @@ struct tty_struct { struct work_struct hangup_work; void *disc_data; void *driver_data; - struct list_head tty_files; + struct file_list tty_files; #define N_TTY_BUF_SIZE 4096 diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c index 2fcad7c..65afe36 100644 --- a/security/selinux/hooks.c +++ b/security/selinux/hooks.c @@ -2244,8 +2244,8 @@ static inline void flush_unauthorized_files(const struct cred *cred, tty = get_current_tty(); if (tty) { - file_list_lock(); - if (!list_empty(&tty->tty_files)) { + file_list_lock(&tty->tty_files); + if (!list_empty(&tty->tty_files.list)) { struct inode *inode; /* Revalidate access to controlling tty. @@ -2253,14 +2253,14 @@ static inline void flush_unauthorized_files(const struct cred *cred, than using file_has_perm, as this particular open file may belong to another process and we are only interested in the inode-based check here. */ - file = list_first_entry(&tty->tty_files, struct file, f_u.fu_list); + file = list_first_entry(&tty->tty_files.list, struct file, f_u.fu_list); inode = file->f_path.dentry->d_inode; if (inode_has_perm(cred, inode, FILE__READ | FILE__WRITE, NULL)) { drop_tty = 1; } } - file_list_unlock(); + file_list_unlock(&tty->tty_files); tty_kref_put(tty); } /* Reset controlling tty. */ -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id 0F67C6B00F6 for ; Wed, 3 Jun 2009 14:17:41 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:34 -0700 Message-Id: <1243893048-17031-9-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 09/23] vfs: Teach poll and select to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/select.c | 24 +++++++++++++++++------- include/linux/poll.h | 1 + 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/fs/select.c b/fs/select.c index 99e4145..fd68da0 100644 --- a/fs/select.c +++ b/fs/select.c @@ -416,10 +416,15 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time) continue; file = fget_light(i, &fput_needed); if (file) { - f_op = file->f_op; - mask = DEFAULT_POLLMASK; - if (f_op && f_op->poll) - mask = (*f_op->poll)(file, retval ? NULL : wait); + mask = DEAD_POLLMASK; + if (file_hotplug_read_trylock(file)) { + f_op = file->f_op; + mask = DEFAULT_POLLMASK; + if (f_op && f_op->poll) + mask = (*f_op->poll)(file, retval ? NULL : wait); + + file_hotplug_read_unlock(file); + } fput_light(file, fput_needed); if ((mask & POLLIN_SET) && (in & bit)) { res_in |= bit; @@ -684,9 +689,14 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait) file = fget_light(fd, &fput_needed); mask = POLLNVAL; if (file != NULL) { - mask = DEFAULT_POLLMASK; - if (file->f_op && file->f_op->poll) - mask = file->f_op->poll(file, pwait); + mask = DEAD_POLLMASK; + if (file_hotplug_read_trylock(file)) { + mask = DEFAULT_POLLMASK; + if (file->f_op && file->f_op->poll) + mask = file->f_op->poll(file, pwait); + + file_hotplug_read_unlock(file); + } /* Mask out unneeded events. */ mask &= pollfd->events | POLLERR | POLLHUP; fput_light(file, fput_needed); diff --git a/include/linux/poll.h b/include/linux/poll.h index d388620..f0512f4 100644 --- a/include/linux/poll.h +++ b/include/linux/poll.h @@ -22,6 +22,7 @@ #define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry)) #define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM) +#define DEAD_POLLMASK (DEFAULT_POLLMASK | POLLERR) struct poll_table_struct; -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id DAD446B00E6 for ; Wed, 3 Jun 2009 14:17:41 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:29 -0700 Message-Id: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Introduce the file_hotplug_lock to protect file->f_op, file->private, file->f_path from revoke operations. The file_hotplug_lock is used typically as: error = -EIO; if (!file_hotplug_read_trylock(file)) goto out; .... file_hotplug_read_unlock(file); In 5 subsystems sysfs, proc, and sysctl, tty, and sound we have support for modifing a file descriptor so that the underlying object can go away. In looking at the problem of pci hotunplug it appears that we potentially need that support for all file descriptors except ones talking to files on filesystems. Even for file descriptors referring to files, support for file the underlying object going away is interesting for implementing features like umount -f and sys_revoke. The implementations in sysfs, proc and sysctl are all very similar and are composed of several components. - A reference count to track that the file operations are being used. - An ability to flag the file as no longer being valid. - An ability to wait until the file operations are no longer being used. In addition for a complete solution we need: - A reliable way the file structures that we need to revoke. - To wait for but not tamper with ongoing file creation and cleanup. - A guarantee that all with user space controlled duration are removed. The file_hotplug_lock has a very unique implementation necessitated by the need to have no performance impact on existing code. Classic locking primitives and reference counting cause pipeline stalls, except for rcu which provides no ability to preventing reading a data structure while it is being updated. file_hotplug_lock keeps the overhead extremely low by dedicating a small amount of space in the task_struct to store the set of files the task is currently in the process of using. The revoke algorithm is simple: - Find a file on the file_list. If it is dying or being created come back later * Take a reference to the file, ensuring it does not get freed while the revoke code accesses it. * Block out new usages of fields guarded by file_hotplug_lock. * Kick the underlying implemenation to wake up functions that are potentially blocked indefinitely. * Wait until there are no tasks holding file_hotplug_read_lock * Release the file specific data. * Drop the file ref count. - Repeat until the file list is empty. The implication of this implementation is that all revoked files will behave exactly the same way, except for policy controlled by flags in fmode. The expected behaivor of revoked is close succeeds all other operations return -EIO. Except for the read on ttys this matches the historical bsd behavior. Approriate exports are present so modular character devices can use the file_list Signed-off-by: Eric W. Biederman --- Documentation/filesystems/vfs.txt | 5 + fs/Kconfig | 4 + fs/file_table.c | 166 ++++++++++++++++++++++++++++++++++-- fs/open.c | 6 ++ include/linux/fs.h | 25 ++++++- include/linux/sched.h | 7 ++ 6 files changed, 202 insertions(+), 11 deletions(-) diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f49eecf..d220fd5 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -806,6 +806,11 @@ otherwise noted. splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call + dead: Called by the VFS to notify a file that it has been killed. + Typically this is used to wake up poll, read or other blocking + file methods, that could be indefinitely waiting for something + to happen. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special diff --git a/fs/Kconfig b/fs/Kconfig index 9f7270f..2fb86b0 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -265,4 +265,8 @@ endif source "fs/nls/Kconfig" source "fs/dlm/Kconfig" +config FILE_HOTPLUG + bool + default n + endmenu diff --git a/fs/file_table.c b/fs/file_table.c index 978f267..9db3031 100644 --- a/fs/file_table.c +++ b/fs/file_table.c @@ -23,6 +23,7 @@ #include #include #include +#include #include @@ -201,7 +202,7 @@ int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry, file->f_path.dentry = dentry; file->f_path.mnt = mntget(mnt); file->f_mapping = dentry->d_inode->i_mapping; - file->f_mode = mode; + file->f_mode = mode | FMODE_OPENED; file->f_op = fop; /* @@ -252,17 +253,12 @@ void drop_file_write_access(struct file *file) } EXPORT_SYMBOL_GPL(drop_file_write_access); -/* __fput is called from task context when aio completion releases the last - * last use of a struct file *. Do not use otherwise. - */ -void __fput(struct file *file) +static void frelease(struct file *file) { struct dentry *dentry = file->f_path.dentry; struct vfsmount *mnt = file->f_path.mnt; struct inode *inode = dentry->d_inode; - might_sleep(); - fsnotify_close(file); /* * The function eventpoll_release() should be the first called @@ -277,23 +273,38 @@ void __fput(struct file *file) } if (file->f_op && file->f_op->release) file->f_op->release(inode, file); - security_file_free(file); ima_file_free(file); if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL)) cdev_put(inode->i_cdev); fops_put(file->f_op); - put_pid(file->f_owner.pid); if (!special_file(inode->i_mode)) file_list_del(file, &inode->i_files); if (file->f_mode & FMODE_WRITE) drop_file_write_access(file); file->f_path.dentry = NULL; file->f_path.mnt = NULL; - file_free(file); + file->f_mapping = NULL; + file->f_op = NULL; + file->private_data = NULL; dput(dentry); mntput(mnt); } +/* __fput is called from task context when aio completion releases the last + * last use of a struct file *. Do not use otherwise. + */ +void __fput(struct file *file) +{ + might_sleep(); + + if (likely(!(file->f_mode & FMODE_DEAD))) + frelease(file); + + security_file_free(file); + put_pid(file->f_owner.pid); + file_free(file); +} + struct file *fget(unsigned int fd) { struct file *file; @@ -360,6 +371,7 @@ void init_file_list(struct file_list *files) INIT_LIST_HEAD(&files->list); spin_lock_init(&files->lock); } +EXPORT_SYMBOL(init_file_list); void file_list_add(struct file *file, struct file_list *files) { @@ -377,6 +389,140 @@ void file_list_del(struct file *file, struct file_list *files) } EXPORT_SYMBOL(file_list_del); +#ifdef CONFIG_FILE_HOTPLUG + +static bool file_in_use(struct file *file) +{ + struct task_struct *leader, *task; + bool in_use = false; + int i; + + rcu_read_lock(); + do_each_thread(leader, task) { + for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) { + if (task->file_hotplug_lock[i] == file) { + in_use = true; + goto found; + } + } + } while_each_thread(leader, task); +found: + rcu_read_unlock(); + return in_use; +} + +static int revoke_file(struct file *file) +{ + /* Must be called with f_count held and FMODE_OPENED set */ + fmode_t mode; + + if (!(file->f_mode & FMODE_REVOKE)) + return -EINVAL; + + /* + * Tell everyone this file is dead. + */ + spin_lock(&file->f_ep_lock); + mode = file->f_mode; + file->f_mode |= FMODE_DEAD; + spin_unlock(&file->f_ep_lock); + if (mode & FMODE_DEAD) + return -EIO; + + /* + * Notify the file we have killed it. + */ + if (file->f_op->dead) + file->f_op->dead(file); + + /* + * Wait until there are no more callers in the file operations. + */ + if (file_in_use(file)) { + do { + schedule_timeout_uninterruptible(1); + } while (file_in_use(file)); + } + + revoke_file_mappings(file); + frelease(file); + + return 0; +} + +int revoke_file_list(struct file_list *files) +{ + struct file *file; + int error = 0; + int empty; + +restart: + file_list_lock(files); + list_for_each_entry(file, &files->list, f_u.fu_list) { + + /* Don't touch files that have not yet been fully opened */ + if (!(file->f_mode & FMODE_OPENED)) + continue; + + /* Ensure I am looking at the file after it was opened */ + smp_rmb(); + + /* Don't touch files that are in the final stages of being closed. */ + if (file_count(file) == 0) + continue; + + /* Get a reference to the file */ + if (!atomic_long_inc_not_zero(&file->f_count)) + continue; + + file_list_unlock(files); + + error = revoke_file(file); + fput(file); + + if (unlikely(error)) + goto out; + goto restart; + } + empty = list_empty(&files->list); + file_list_unlock(files); + /* + * If the file list had files we can't touch sleep a little while + * and check again. + */ + if (!empty) { + schedule_timeout_uninterruptible(1); + goto restart; + } +out: + return error; +} +EXPORT_SYMBOL(revoke_file_list); + +int __lockfunc file_hotplug_read_trylock(struct file *file) +{ + fmode_t mode = file->f_mode; + int locked = 0; + if (!(mode & FMODE_DEAD)) { + struct task_struct *tsk = current; + int pos = tsk->file_hotplug_lock_depth; + if (likely(pos < MAX_FILE_HOTPLUG_LOCK_DEPTH)) { + tsk->file_hotplug_lock_depth = pos + 1; + tsk->file_hotplug_lock[pos] = file; + locked = 1; + } + } + return locked; +} + +void __lockfunc file_hotplug_read_unlock(struct file *file) +{ + struct task_struct *tsk = current; + tsk->file_hotplug_lock[--(tsk->file_hotplug_lock_depth)] = NULL; +} + +#endif /* CONFIG_FILE_HOTPLUG */ + int fs_may_remount_ro(struct super_block *sb) { struct inode *inode; diff --git a/fs/open.c b/fs/open.c index 20c3fc0..d0b2433 100644 --- a/fs/open.c +++ b/fs/open.c @@ -809,6 +809,7 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, const struct cred *cred) { struct inode *inode; + fmode_t opened_fmode; int error; f->f_flags = flags; @@ -857,6 +858,11 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt, } } + opened_fmode = f->f_mode | FMODE_OPENED; + /* Ensure revoke_file_list sees the opened file */ + smp_wmb(); + f->f_mode = opened_fmode; + return f; cleanup_all: diff --git a/include/linux/fs.h b/include/linux/fs.h index 5329fd6..f7f4c46 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -87,6 +87,13 @@ struct inodes_stat_t { */ #define FMODE_NOCMTIME ((__force fmode_t)2048) +/* File has successfully been opened */ +#define FMODE_OPENED ((__force fmode_t)4096) +/* File supports being revoked */ +#define FMODE_REVOKE ((__force fmode_t)8192) +/* File is dead (has been revoked) */ +#define FMODE_DEAD ((__force fmode_t)16384) + /* * The below are the various read and write types that we support. Some of * them include behavioral modifiers that send information down to the @@ -903,6 +910,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index) #define FILE_MNT_WRITE_RELEASED 2 struct file { + /* file_hotplug_lock f_op, private, f_path, f_mapping */ /* * fu_list becomes invalid after file_free is called and queued via * fu_rcuhead for RCU freeing @@ -935,12 +943,26 @@ struct file { /* Used by fs/eventpoll.c to link all the hooks to this file */ struct list_head f_ep_links; #endif /* #ifdef CONFIG_EPOLL */ - struct address_space *f_mapping; + struct address_space *f_mapping; /* file_hotplug_lock or mmap_sem */ #ifdef CONFIG_DEBUG_WRITECOUNT unsigned long f_mnt_write_state; #endif }; +#ifdef CONFIG_FILE_HOTPLUG +extern int file_hotplug_read_trylock(struct file *file); +extern void file_hotplug_read_unlock(struct file *file); +extern int revoke_file_list(struct file_list *files); +#else +static inline int file_hotplug_read_trylock(struct file *file) +{ + return 1; +} +static inline void file_hotplug_read_unlock(struct file *file) +{ +} +#endif + static inline void file_list_lock(struct file_list *files) { spin_lock(&files->lock); @@ -1514,6 +1536,7 @@ struct file_operations { ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int); int (*setlease)(struct file *, long, struct file_lock **); + void (*dead)(struct file *); }; struct inode_operations { diff --git a/include/linux/sched.h b/include/linux/sched.h index b4c38bc..bbf1616 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1302,6 +1302,13 @@ struct task_struct { struct irqaction *irqaction; #endif +/* File hotplug lock */ +#ifdef CONFIG_FILE_HOTPLUG +#define MAX_FILE_HOTPLUG_LOCK_DEPTH 4U + int file_hotplug_lock_depth; + struct file *file_hotplug_lock[MAX_FILE_HOTPLUG_LOCK_DEPTH]; +#endif + /* Protection of the PI data structures: */ spinlock_t pi_lock; -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id 23F8E6B00E7 for ; Wed, 3 Jun 2009 14:17:42 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:31 -0700 Message-Id: <1243893048-17031-6-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 06/23] vfs: Teach read/write to use file_hotplug_read_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/compat.c | 16 +++++++++++- fs/read_write.c | 70 +++++++++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 72 insertions(+), 14 deletions(-) diff --git a/fs/compat.c b/fs/compat.c index 25be41c..dad9957 100644 --- a/fs/compat.c +++ b/fs/compat.c @@ -1196,12 +1196,18 @@ static size_t compat_readv(struct file *file, if (!(file->f_mode & FMODE_READ)) goto out; + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; + ret = -EINVAL; if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read)) - goto out; + goto out_unlock; ret = compat_do_readv_writev(READ, file, vec, vlen, pos); +out_unlock: + file_hotplug_read_unlock(file); out: if (ret > 0) add_rchar(current, ret); @@ -1253,12 +1259,18 @@ static size_t compat_writev(struct file *file, if (!(file->f_mode & FMODE_WRITE)) goto out; + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; + ret = -EINVAL; if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write)) - goto out; + goto out_unlock; ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos); +out_unlock: + file_hotplug_read_unlock(file); out: if (ret > 0) add_wchar(current, ret); diff --git a/fs/read_write.c b/fs/read_write.c index c9511ce..718baea 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -288,12 +288,18 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) { ssize_t ret; + ret = -EBADF; if (!(file->f_mode & FMODE_READ)) - return -EBADF; + goto out; + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; + ret = -EINVAL; if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read)) - return -EINVAL; + goto out_unlock; + ret = -EFAULT; if (unlikely(!access_ok(VERIFY_WRITE, buf, count))) - return -EFAULT; + goto out_unlock; ret = rw_verify_area(READ, file, pos, count); if (ret >= 0) { @@ -309,6 +315,9 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos) inc_syscr(current); } +out_unlock: + file_hotplug_read_unlock(file); +out: return ret; } @@ -343,12 +352,18 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ { ssize_t ret; + ret = -EBADF; if (!(file->f_mode & FMODE_WRITE)) - return -EBADF; + goto out; + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; + ret = -EINVAL; if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write)) - return -EINVAL; + goto out_unlock; + ret = -EFAULT; if (unlikely(!access_ok(VERIFY_READ, buf, count))) - return -EFAULT; + goto out_unlock; ret = rw_verify_area(WRITE, file, pos, count); if (ret >= 0) { @@ -364,6 +379,9 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_ inc_syscw(current); } +out_unlock: + file_hotplug_read_unlock(file); +out: return ret; } @@ -676,12 +694,26 @@ out: ssize_t vfs_readv(struct file *file, const struct iovec __user *vec, unsigned long vlen, loff_t *pos) { + ssize_t ret; + + ret = -EBADF; if (!(file->f_mode & FMODE_READ)) - return -EBADF; + goto out; + + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; + + ret = -EINVAL; if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read)) - return -EINVAL; + goto out_unlock; + + ret = do_readv_writev(READ, file, vec, vlen, pos); - return do_readv_writev(READ, file, vec, vlen, pos); +out_unlock: + file_hotplug_read_unlock(file); +out: + return ret; } EXPORT_SYMBOL(vfs_readv); @@ -689,12 +721,26 @@ EXPORT_SYMBOL(vfs_readv); ssize_t vfs_writev(struct file *file, const struct iovec __user *vec, unsigned long vlen, loff_t *pos) { + ssize_t ret; + + ret = -EBADF; if (!(file->f_mode & FMODE_WRITE)) - return -EBADF; + goto out; + + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; + + ret = -EINVAL; if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write)) - return -EINVAL; + goto out_unlock; - return do_readv_writev(WRITE, file, vec, vlen, pos); + ret = do_readv_writev(WRITE, file, vec, vlen, pos); + +out_unlock: + file_hotplug_read_unlock(file); +out: + return ret; } EXPORT_SYMBOL(vfs_writev); -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 8DFF46B00C2 for ; Wed, 3 Jun 2009 14:17:42 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:30 -0700 Message-Id: <1243893048-17031-5-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 05/23] vfs: Teach lseek to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/read_write.c | 24 +++++++++++++++++------- 1 files changed, 17 insertions(+), 7 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 9d1e76b..c9511ce 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -136,14 +136,24 @@ EXPORT_SYMBOL(default_llseek); loff_t vfs_llseek(struct file *file, loff_t offset, int origin) { loff_t (*fn)(struct file *, loff_t, int); + loff_t retval = -ESPIPE; - fn = no_llseek; - if (file->f_mode & FMODE_LSEEK) { - fn = default_llseek; - if (file->f_op && file->f_op->llseek) - fn = file->f_op->llseek; - } - return fn(file, offset, origin); + if (!(file->f_mode & FMODE_LSEEK)) + goto out; + + retval = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; + + fn = default_llseek; + if (file->f_op && file->f_op->llseek) + fn = file->f_op->llseek; + + retval = fn(file, offset, origin); + + file_hotplug_read_unlock(file); +out: + return retval; } EXPORT_SYMBOL(vfs_llseek); -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id BB5405F0006 for ; Wed, 3 Jun 2009 14:17:42 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:39 -0700 Message-Id: <1243893048-17031-14-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 14/23] vfs: Teach flock to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/locks.c | 8 +++++++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index ec3deea..f74794e 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -1584,9 +1584,13 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) !(filp->f_mode & (FMODE_READ|FMODE_WRITE))) goto out_putf; + error = -EIO; + if (!file_hotplug_read_trylock(filp)) + goto out_putf; + error = flock_make_lock(filp, &lock, cmd); if (error) - goto out_putf; + goto out_unlock; if (can_sleep) lock->fl_flags |= FL_SLEEP; @@ -1604,6 +1608,8 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd) out_free: locks_free_lock(lock); + out_unlock: + file_hotplug_read_unlock(filp); out_putf: fput(filp); out: -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id A1F156B00F8 for ; Wed, 3 Jun 2009 14:17:42 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:32 -0700 Message-Id: <1243893048-17031-7-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/read_write.c | 28 +++++++++---- fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++----------------- 2 files changed, 95 insertions(+), 44 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index 718baea..c473d74 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, goto out; if (!(in_file->f_mode & FMODE_READ)) goto fput_in; + retval = -EIO; + if (!file_hotplug_read_trylock(in_file)) + goto fput_in; retval = -EINVAL; in_inode = in_file->f_path.dentry->d_inode; if (!in_inode) - goto fput_in; + goto unlock_in; if (!in_file->f_op || !in_file->f_op->splice_read) - goto fput_in; + goto unlock_in; retval = -ESPIPE; if (!ppos) ppos = &in_file->f_pos; else if (!(in_file->f_mode & FMODE_PREAD)) - goto fput_in; + goto unlock_in; retval = rw_verify_area(READ, in_file, ppos, count); if (retval < 0) - goto fput_in; + goto unlock_in; count = retval; /* @@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, retval = -EBADF; out_file = fget_light(out_fd, &fput_needed_out); if (!out_file) - goto fput_in; + goto unlock_in; if (!(out_file->f_mode & FMODE_WRITE)) goto fput_out; + retval = -EIO; + if (!file_hotplug_read_trylock(out_file)) + goto fput_out; retval = -EINVAL; if (!out_file->f_op || !out_file->f_op->sendpage) - goto fput_out; + goto unlock_out; out_inode = out_file->f_path.dentry->d_inode; retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count); if (retval < 0) - goto fput_out; + goto unlock_out; count = retval; if (!max) @@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, pos = *ppos; retval = -EINVAL; if (unlikely(pos < 0)) - goto fput_out; + goto unlock_out; if (unlikely(pos + count > max)) { retval = -EOVERFLOW; if (pos >= max) - goto fput_out; + goto unlock_out; count = max - pos; } @@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, if (*ppos > max) retval = -EOVERFLOW; +unlock_out: + file_hotplug_read_unlock(out_file); fput_out: fput_light(out_file, fput_needed_out); +unlock_in: + file_hotplug_read_unlock(in_file); fput_in: fput_light(in_file, fput_needed_in); out: diff --git a/fs/splice.c b/fs/splice.c index 666953d..fc6b3a5 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov, error = -EBADF; file = fget_light(fd, &fput); - if (file) { - if (file->f_mode & FMODE_WRITE) - error = vmsplice_to_pipe(file, iov, nr_segs, flags); - else if (file->f_mode & FMODE_READ) - error = vmsplice_to_user(file, iov, nr_segs, flags); + if (!file) + goto out; - fput_light(file, fput); - } + if (!file_hotplug_read_trylock(file)) + goto fput_file; + if (file->f_mode & FMODE_WRITE) + error = vmsplice_to_pipe(file, iov, nr_segs, flags); + else if (file->f_mode & FMODE_READ) + error = vmsplice_to_user(file, iov, nr_segs, flags); + + file_hotplug_read_unlock(file); +fput_file: + fput_light(file, fput); +out: return error; } @@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in, error = -EBADF; in = fget_light(fd_in, &fput_in); - if (in) { - if (in->f_mode & FMODE_READ) { - out = fget_light(fd_out, &fput_out); - if (out) { - if (out->f_mode & FMODE_WRITE) - error = do_splice(in, off_in, - out, off_out, - len, flags); - fput_light(out, fput_out); - } - } + if (!in) + goto out; - fput_light(in, fput_in); - } + if (!(in->f_mode & FMODE_READ)) + goto fput_in; + + error = -EIO; + if (!file_hotplug_read_trylock(in)) + goto fput_in; + + error = -EBADF; + out = fget_light(fd_out, &fput_out); + if (!out) + goto unlock_in; + + if (!(out->f_mode & FMODE_WRITE)) + goto fput_out; + + error = -EIO; + if (!file_hotplug_read_trylock(out)) + goto fput_out; + + error = do_splice(in, off_in, out, off_out, len, flags); + file_hotplug_read_unlock(out); +fput_out: + fput_light(out, fput_out); +unlock_in: + file_hotplug_read_unlock(in); +fput_in: + fput_light(in, fput_in); + +out: return error; } @@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len, SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags) { - struct file *in; - int error, fput_in; + struct file *in, *out; + int error, fput_in, fput_out; if (unlikely(!len)) return 0; error = -EBADF; in = fget_light(fdin, &fput_in); - if (in) { - if (in->f_mode & FMODE_READ) { - int fput_out; - struct file *out = fget_light(fdout, &fput_out); - - if (out) { - if (out->f_mode & FMODE_WRITE) - error = do_tee(in, out, len, flags); - fput_light(out, fput_out); - } - } - fput_light(in, fput_in); - } + if (!in) + goto out; + + if (!(in->f_mode & FMODE_READ)) + goto unlock_in; + error = -EIO; + if (!file_hotplug_read_trylock(in)) + goto fput_in; + + error = -EBADF; + out = fget_light(fdout, &fput_out); + if (!out) + goto unlock_in; + + if (!(out->f_mode & FMODE_WRITE)) + goto fput_out; + + if (!file_hotplug_read_trylock(out)) + goto fput_out; + + error = do_tee(in, out, len, flags); + + file_hotplug_read_unlock(out); +fput_out: + fput_light(out, fput_out); +unlock_in: + file_hotplug_read_unlock(in); +fput_in: + fput_light(in, fput_in); +out: return error; } -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id E64885F0008 for ; Wed, 3 Jun 2009 14:17:42 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:36 -0700 Message-Id: <1243893048-17031-11-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 11/23] mm: Teach mmap to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- mm/mmap.c | 78 +++++++++++++++++++++++++++++++++++++++-------------------- mm/nommu.c | 21 +++++++++++++++- 2 files changed, 71 insertions(+), 28 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 6b7b1a9..f13251a 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -914,9 +914,13 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, struct mm_struct * mm = current->mm; struct inode *inode; unsigned int vm_flags; - int error; + unsigned long retval; unsigned long reqprot = prot; + retval = -EIO; + if (file && !file_hotplug_read_trylock(file)) + goto out; + /* * Does the application expect PROT_READ to imply PROT_EXEC? * @@ -927,35 +931,40 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC))) prot |= PROT_EXEC; + retval = -EINVAL; if (!len) - return -EINVAL; + goto out_unlock; if (!(flags & MAP_FIXED)) addr = round_hint_to_min(addr); - error = arch_mmap_check(addr, len, flags); - if (error) - return error; + retval = arch_mmap_check(addr, len, flags); + if (retval) + goto out_unlock; /* Careful about overflows.. */ + retval = -ENOMEM; len = PAGE_ALIGN(len); if (!len || len > TASK_SIZE) - return -ENOMEM; + goto out_unlock; /* offset overflow? */ + retval = -EOVERFLOW; if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) - return -EOVERFLOW; + goto out_unlock; /* Too many mappings? */ + retval = -ENOMEM; if (mm->map_count > sysctl_max_map_count) - return -ENOMEM; + goto out_unlock; /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ addr = get_unmapped_area(file, addr, len, pgoff, flags); + retval = addr; if (addr & ~PAGE_MASK) - return addr; + goto out_unlock; /* Do simple checking here so the lower-level routines won't have * to. we assume access permissions have been handled by the open @@ -965,8 +974,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; if (flags & MAP_LOCKED) { + retval = -EPERM; if (!can_do_mlock()) - return -EPERM; + goto out_unlock; vm_flags |= VM_LOCKED; } @@ -977,8 +987,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, locked += mm->locked_vm; lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur; lock_limit >>= PAGE_SHIFT; + retval = -EAGAIN; if (locked > lock_limit && !capable(CAP_IPC_LOCK)) - return -EAGAIN; + goto out_unlock; } inode = file ? file->f_path.dentry->d_inode : NULL; @@ -986,21 +997,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, if (file) { switch (flags & MAP_TYPE) { case MAP_SHARED: + retval = -EACCES; if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE)) - return -EACCES; + goto out_unlock; /* * Make sure we don't allow writing to an append-only * file.. */ + retval = -EACCES; if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE)) - return -EACCES; + goto out_unlock; /* * Make sure there are no mandatory locks on the file. */ + retval = -EAGAIN; if (locks_verify_locked(inode)) - return -EAGAIN; + goto out_unlock; vm_flags |= VM_SHARED | VM_MAYSHARE; if (!(file->f_mode & FMODE_WRITE)) @@ -1008,20 +1022,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, /* fall through */ case MAP_PRIVATE: + retval = -EACCES; if (!(file->f_mode & FMODE_READ)) - return -EACCES; + goto out_unlock; if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) { + retval = -EPERM; if (vm_flags & VM_EXEC) - return -EPERM; + goto out_unlock; vm_flags &= ~VM_MAYEXEC; } + retval = -ENODEV; if (!file->f_op || !file->f_op->mmap) - return -ENODEV; + goto out_unlock; break; default: - return -EINVAL; + retval = -EINVAL; + goto out_unlock; } } else { switch (flags & MAP_TYPE) { @@ -1039,18 +1057,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, pgoff = addr >> PAGE_SHIFT; break; default: - return -EINVAL; + retval = -EINVAL; + goto out_unlock; } } - error = security_file_mmap(file, reqprot, prot, flags, addr, 0); - if (error) - return error; - error = ima_file_mmap(file, prot); - if (error) - return error; + retval = security_file_mmap(file, reqprot, prot, flags, addr, 0); + if (retval) + goto out_unlock; + retval = ima_file_mmap(file, prot); + if (retval) + goto out_unlock; + retval = mmap_region(file, addr, len, flags, vm_flags, pgoff); - return mmap_region(file, addr, len, flags, vm_flags, pgoff); +out_unlock: + if (file) + file_hotplug_read_unlock(file); +out: + return retval; } EXPORT_SYMBOL(do_mmap_pgoff); diff --git a/mm/nommu.c b/mm/nommu.c index b571ef7..08038b7 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -1165,7 +1165,7 @@ enomem: /* * handle mapping creation for uClinux */ -unsigned long do_mmap_pgoff(struct file *file, +static unsigned long __do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, @@ -1402,6 +1402,25 @@ error_getting_region: show_free_areas(); return -ENOMEM; } + +unsigned long do_mmap_pgoff(struct file *file, + unsigned long addr, + unsigned long len, + unsigned long prot, + unsigned long flags, + unsigned long pgoff) +{ + unsigned long result = -EIO; + if (file && !file_hotplug_read_trylock(file)) + goto out; + + result = __do_mmap_pgoff(file, addr, len, prot, flags, pgoff); + + if (file) + file_hotplug_read_unlock(file); +out: + return result; +} EXPORT_SYMBOL(do_mmap_pgoff); /* -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with SMTP id 7A8225F0019 for ; Wed, 3 Jun 2009 14:17:43 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:40 -0700 Message-Id: <1243893048-17031-15-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 15/23] vfs: Teach fallocate, and filp_close to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/open.c | 22 +++++++++++++++++----- 1 files changed, 17 insertions(+), 5 deletions(-) diff --git a/fs/open.c b/fs/open.c index d0b2433..83d6369 100644 --- a/fs/open.c +++ b/fs/open.c @@ -398,19 +398,22 @@ SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len) goto out; if (!(file->f_mode & FMODE_WRITE)) goto out_fput; + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out_fput; /* * Revalidate the write permissions, in case security policy has * changed since the files were opened. */ ret = security_file_permission(file, MAY_WRITE); if (ret) - goto out_fput; + goto out_unlock; inode = file->f_path.dentry->d_inode; ret = -ESPIPE; if (S_ISFIFO(inode->i_mode)) - goto out_fput; + goto out_unlock; ret = -ENODEV; /* @@ -418,18 +421,20 @@ SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len) * for directories or not. */ if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode)) - goto out_fput; + goto out_unlock; ret = -EFBIG; /* Check for wrap through zero too */ if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) - goto out_fput; + goto out_unlock; if (inode->i_op->fallocate) ret = inode->i_op->fallocate(inode, mode, offset, len); else ret = -EOPNOTSUPP; +out_unlock: + file_hotplug_read_unlock(file); out_fput: fput(file); out: @@ -1101,18 +1106,25 @@ SYSCALL_DEFINE2(creat, const char __user *, pathname, int, mode) */ int filp_close(struct file *filp, fl_owner_t id) { - int retval = 0; + int retval; if (!file_count(filp)) { printk(KERN_ERR "VFS: Close: file count is 0\n"); return 0; } + retval = -EIO; + if (!file_hotplug_read_trylock(filp)) + goto out_fput; + + retval = 0; if (filp->f_op && filp->f_op->flush) retval = filp->f_op->flush(filp, id); dnotify_flush(filp, id); locks_remove_posix(filp, id); + file_hotplug_read_unlock(filp); +out_fput: fput(filp); return retval; } -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 9DDD35F001F for ; Wed, 3 Jun 2009 14:17:43 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:38 -0700 Message-Id: <1243893048-17031-13-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 13/23] vfs: Teach ioctl to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/compat_ioctl.c | 14 ++++++++++---- fs/ioctl.c | 8 +++++++- 2 files changed, 17 insertions(+), 5 deletions(-) diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c index b83f6bc..fa654c5 100644 --- a/fs/compat_ioctl.c +++ b/fs/compat_ioctl.c @@ -2796,10 +2796,14 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd, if (!filp) goto out; + error = -EIO; + if (!file_hotplug_read_trylock(filp)) + goto out_fput; + /* RED-PEN how should LSM module know it's handling 32bit? */ error = security_file_ioctl(filp, cmd, arg); if (error) - goto out_fput; + goto out_unlock; /* * To allow the compat_ioctl handlers to be self contained @@ -2825,7 +2829,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd, if (filp->f_op && filp->f_op->compat_ioctl) { error = filp->f_op->compat_ioctl(filp, cmd, arg); if (error != -ENOIOCTLCMD) - goto out_fput; + goto out_unlock; } if (!filp->f_op || @@ -2853,18 +2857,20 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd, error = -EINVAL; } - goto out_fput; + goto out_unlock; found_handler: if (t->handler) { lock_kernel(); error = t->handler(fd, cmd, arg, filp); unlock_kernel(); - goto out_fput; + goto out_unlock; } do_ioctl: error = do_vfs_ioctl(filp, fd, cmd, arg); + out_unlock: + file_hotplug_read_unlock(filp); out_fput: fput_light(filp, fput_needed); out: diff --git a/fs/ioctl.c b/fs/ioctl.c index 82d9c42..2dad7ba 100644 --- a/fs/ioctl.c +++ b/fs/ioctl.c @@ -577,11 +577,17 @@ SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg) if (!filp) goto out; + error = -EIO; + if (!file_hotplug_read_trylock(filp)) + goto out_fput; + error = security_file_ioctl(filp, cmd, arg); if (error) - goto out_fput; + goto out_unlock; error = do_vfs_ioctl(filp, fd, cmd, arg); + out_unlock: + file_hotplug_read_unlock(filp); out_fput: fput_light(filp, fput_needed); out: -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 91BA65F001B for ; Wed, 3 Jun 2009 14:17:43 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:41 -0700 Message-Id: <1243893048-17031-16-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 16/23] vfs: Teach fstatfs, fstatfs64, ftruncate, fchdir, fchmod, fchown to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/open.c | 47 +++++++++++++++++++++++++++++++++++++++++------ 1 files changed, 41 insertions(+), 6 deletions(-) diff --git a/fs/open.c b/fs/open.c index 83d6369..354646b 100644 --- a/fs/open.c +++ b/fs/open.c @@ -167,9 +167,14 @@ SYSCALL_DEFINE2(fstatfs, unsigned int, fd, struct statfs __user *, buf) file = fget(fd); if (!file) goto out; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out_putf; error = vfs_statfs_native(file->f_path.dentry, &tmp); if (!error && copy_to_user(buf, &tmp, sizeof(tmp))) error = -EFAULT; + file_hotplug_read_unlock(file); +out_putf: fput(file); out: return error; @@ -188,9 +193,14 @@ SYSCALL_DEFINE3(fstatfs64, unsigned int, fd, size_t, sz, struct statfs64 __user file = fget(fd); if (!file) goto out; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out_putf; error = vfs_statfs64(file->f_path.dentry, &tmp); if (!error && copy_to_user(buf, &tmp, sizeof(tmp))) error = -EFAULT; + file_hotplug_read_unlock(file); +out_putf: fput(file); out: return error; @@ -309,6 +319,10 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small) if (!file) goto out; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out_putf; + /* explicitly opened as large or we are on 64-bit box */ if (file->f_flags & O_LARGEFILE) small = 0; @@ -317,16 +331,16 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small) inode = dentry->d_inode; error = -EINVAL; if (!S_ISREG(inode->i_mode) || !(file->f_mode & FMODE_WRITE)) - goto out_putf; + goto out_unlock; error = -EINVAL; /* Cannot ftruncate over 2^31 bytes without large file support */ if (small && length > MAX_NON_LFS) - goto out_putf; + goto out_unlock; error = -EPERM; if (IS_APPEND(inode)) - goto out_putf; + goto out_unlock; error = locks_verify_truncate(inode, file, length); if (!error) @@ -334,6 +348,9 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small) ATTR_MTIME|ATTR_CTIME); if (!error) error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file); + +out_unlock: + file_hotplug_read_unlock(file); out_putf: fput(file); out: @@ -560,15 +577,21 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd) if (!file) goto out; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out_putf; + inode = file->f_path.dentry->d_inode; error = -ENOTDIR; if (!S_ISDIR(inode->i_mode)) - goto out_putf; + goto out_unlock; error = inode_permission(inode, MAY_EXEC | MAY_ACCESS); if (!error) set_fs_pwd(current->fs, &file->f_path); +out_unlock: + file_hotplug_read_unlock(file); out_putf: fput(file); out: @@ -612,6 +635,10 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode) if (!file) goto out; + err = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out_putf; + dentry = file->f_path.dentry; inode = dentry->d_inode; @@ -619,7 +646,7 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode) err = mnt_want_write_file(file); if (err) - goto out_putf; + goto out_unlock; mutex_lock(&inode->i_mutex); if (mode == (mode_t) -1) mode = inode->i_mode; @@ -628,6 +655,8 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode) err = notify_change(dentry, &newattrs); mutex_unlock(&inode->i_mutex); mnt_drop_write(file->f_path.mnt); +out_unlock: + file_hotplug_read_unlock(file); out_putf: fput(file); out: @@ -766,13 +795,19 @@ SYSCALL_DEFINE3(fchown, unsigned int, fd, uid_t, user, gid_t, group) if (!file) goto out; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out_fput; + error = mnt_want_write_file(file); if (error) - goto out_fput; + goto out_unlock; dentry = file->f_path.dentry; audit_inode(NULL, dentry); error = chown_common(dentry, user, group); mnt_drop_write(file->f_path.mnt); +out_unlock: + file_hotplug_read_unlock(file); out_fput: fput(file); out: -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id A05FA5F002A for ; Wed, 3 Jun 2009 14:17:43 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:33 -0700 Message-Id: <1243893048-17031-8-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 08/23] vfs: Teach readdir to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/readdir.c | 20 +++++++++++++++----- 1 files changed, 15 insertions(+), 5 deletions(-) diff --git a/fs/readdir.c b/fs/readdir.c index 7723401..2e147cf 100644 --- a/fs/readdir.c +++ b/fs/readdir.c @@ -21,18 +21,26 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf) { - struct inode *inode = file->f_path.dentry->d_inode; - int res = -ENOTDIR; - if (!file->f_op || !file->f_op->readdir) + struct inode *inode; + int res; + + res = -EIO; + if (!file_hotplug_read_trylock(file)) goto out; + inode = file->f_path.dentry->d_inode; + + res = -ENOTDIR; + if (!file->f_op || !file->f_op->readdir) + goto out_unlock; + res = security_file_permission(file, MAY_READ); if (res) - goto out; + goto out_unlock; res = mutex_lock_killable(&inode->i_mutex); if (res) - goto out; + goto out_unlock; res = -ENOENT; if (!IS_DEADDIR(inode)) { @@ -40,6 +48,8 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf) file_accessed(file); } mutex_unlock(&inode->i_mutex); +out_unlock: + file_hotplug_read_unlock(file); out: return res; } -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail144.messagelabs.com (mail144.messagelabs.com [216.82.254.51]) by kanga.kvack.org (Postfix) with SMTP id C17A75F002C for ; Wed, 3 Jun 2009 14:17:43 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:35 -0700 Message-Id: <1243893048-17031-10-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 10/23] vfs: Teach do_path_lookup to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/namei.c | 11 +++++++++-- 1 files changed, 9 insertions(+), 2 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 5472ed0..c4c6575 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -1049,23 +1049,30 @@ static int path_init(int dfd, const char *name, unsigned int flags, struct namei if (!file) goto out_fail; + retval = -EIO; + if (!file_hotplug_read_trylock(file)) + goto fput_fail; + dentry = file->f_path.dentry; retval = -ENOTDIR; if (!S_ISDIR(dentry->d_inode->i_mode)) - goto fput_fail; + goto unlock_fail; retval = file_permission(file, MAY_EXEC); if (retval) - goto fput_fail; + goto unlock_fail; nd->path = file->f_path; path_get(&file->f_path); + file_hotplug_read_unlock(file); fput_light(file, fput_needed); } return 0; +unlock_fail: + file_hotplug_read_unlock(file); fput_fail: fput_light(file, fput_needed); out_fail: -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with SMTP id 034986B00F6 for ; Wed, 3 Jun 2009 14:17:43 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:44 -0700 Message-Id: <1243893048-17031-19-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 19/23] eventpoll: Fix comment Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/eventpoll.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index eabb167..d42071d 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -62,7 +62,7 @@ * This mutex is acquired by ep_free() during the epoll file * cleanup path and it is also acquired by eventpoll_release_file() * if a file has been pushed inside an epoll set and it is then - * close()d without a previous call toepoll_ctl(EPOLL_CTL_DEL). + * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL). * It is possible to drop the "ep->mtx" and to use the global * mutex "epmutex" (together with "ep->lock") to have it working, * but having "ep->mtx" will make the interface more scalable. -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 3351A6B00FB for ; Wed, 3 Jun 2009 14:17:44 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:46 -0700 Message-Id: <1243893048-17031-21-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 21/23] vfs: Teach fsync to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/sync.c | 9 ++++++++- 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/fs/sync.c b/fs/sync.c index e9d56f6..ac6da60 100644 --- a/fs/sync.c +++ b/fs/sync.c @@ -197,6 +197,9 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync) * don't have a struct file available. Damn nfsd.. */ if (file) { + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out; mapping = file->f_mapping; fop = file->f_op; } else { @@ -206,7 +209,7 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync) if (!fop || !fop->fsync) { ret = -EINVAL; - goto out; + goto out_unlock; } ret = filemap_fdatawrite(mapping); @@ -223,6 +226,10 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync) err = filemap_fdatawait(mapping); if (!ret) ret = err; + +out_unlock: + if (file) + file_hotplug_read_unlock(file); out: return ret; } -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 7D8A06B00FC for ; Wed, 3 Jun 2009 14:17:44 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:48 -0700 Message-Id: <1243893048-17031-23-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- mm/filemap.c | 25 ++++++++++++++++--------- 1 files changed, 16 insertions(+), 9 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 379ff0b..5016aa5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1402,16 +1402,23 @@ SYSCALL_DEFINE(readahead)(int fd, loff_t offset, size_t count) ret = -EBADF; file = fget(fd); - if (file) { - if (file->f_mode & FMODE_READ) { - struct address_space *mapping = file->f_mapping; - pgoff_t start = offset >> PAGE_CACHE_SHIFT; - pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT; - unsigned long len = end - start + 1; - ret = do_readahead(mapping, file, start, len); - } - fput(file); + if (!file) + goto out; + + if (!(file->f_mode & FMODE_READ)) + goto out_fput; + + if (file_hotplug_read_trylock(file)) { + struct address_space *mapping = file->f_mapping; + pgoff_t start = offset >> PAGE_CACHE_SHIFT; + pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT; + unsigned long len = end - start + 1; + ret = do_readahead(mapping, file, start, len); + file_hotplug_read_unlock(file); } +out_fput: + fput(file); +out: return ret; } #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 7E8616B00FD for ; Wed, 3 Jun 2009 14:17:44 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:43 -0700 Message-Id: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++------- 1 files changed, 32 insertions(+), 7 deletions(-) diff --git a/fs/eventpoll.c b/fs/eventpoll.c index a89f370..eabb167 100644 --- a/fs/eventpoll.c +++ b/fs/eventpoll.c @@ -627,8 +627,13 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head, struct epitem *epi, *tmp; list_for_each_entry_safe(epi, tmp, head, rdllink) { - if (epi->ffd.file->f_op->poll(epi->ffd.file, NULL) & - epi->event.events) + int events = DEAD_POLLMASK; + + if (file_hotplug_read_trylock(epi->ffd.file)) { + events = epi->ffd.file->f_op->poll(epi->ffd.file, NULL); + file_hotplug_read_unlock(epi->ffd.file); + } + if (events & epi->event.events) return POLLIN | POLLRDNORM; else { /* @@ -1060,8 +1065,12 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head, list_del_init(&epi->rdllink); - revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) & - epi->event.events; + revents = DEAD_POLLMASK; + if (file_hotplug_read_trylock(epi->ffd.file)) { + revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL); + file_hotplug_read_unlock(epi->ffd.file); + } + revents &= epi->event.events; /* * If the event mask intersect the caller-requested one, @@ -1248,10 +1257,17 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, if (!tfile) goto error_fput; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto error_tgt_fput; + + if (!file_hotplug_read_trylock(tfile)) + goto error_file_unlock; + /* The target file descriptor must support poll */ error = -EPERM; if (!tfile->f_op || !tfile->f_op->poll) - goto error_tgt_fput; + goto error_tgt_unlock; /* * We have to check that the file structure underneath the file descriptor @@ -1260,7 +1276,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, */ error = -EINVAL; if (file == tfile || !is_file_epoll(file)) - goto error_tgt_fput; + goto error_tgt_unlock; /* * At this point it is safe to assume that the "private_data" contains @@ -1302,6 +1318,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd, } mutex_unlock(&ep->mtx); +error_tgt_unlock: + file_hotplug_read_unlock(tfile); +error_file_unlock: + file_hotplug_read_unlock(file); error_tgt_fput: fput(tfile); error_fput: @@ -1338,13 +1358,16 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, if (!file) goto error_return; + error = -EIO; + if (!file_hotplug_read_trylock(file)) + goto error_fput; /* * We have to check that the file structure underneath the fd * the user passed to us _is_ an eventpoll file. */ error = -EINVAL; if (!is_file_epoll(file)) - goto error_fput; + goto error_unlock; /* * At this point it is safe to assume that the "private_data" contains @@ -1355,6 +1378,8 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events, /* Time to fish for events ... */ error = ep_poll(ep, events, maxevents, timeout); +error_unlock: + file_hotplug_read_unlock(file); error_fput: fput(file); error_return: -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id 83FC96B00FE for ; Wed, 3 Jun 2009 14:17:45 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:47 -0700 Message-Id: <1243893048-17031-22-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 22/23] vfs: Teach fadvice to file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman Signed-off-by: Eric W. Biederman --- mm/fadvise.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/mm/fadvise.c b/mm/fadvise.c index 54a0f80..d7f1fba 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -38,6 +38,11 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) if (!file) return -EBADF; + ret = -EIO; + if (!file_hotplug_read_trylock(file)) + goto out_fput; + + ret = 0; if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) { ret = -ESPIPE; goto out; @@ -123,6 +128,8 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice) ret = -EINVAL; } out: + file_hotplug_read_unlock(file); +out_fput: fput(file); return ret; } -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with SMTP id A66646B0055 for ; Wed, 3 Jun 2009 14:44:27 -0400 (EDT) From: "Eric W. Biederman" Date: Mon, 1 Jun 2009 14:50:42 -0700 Message-Id: <1243893048-17031-17-git-send-email-ebiederm@xmission.com> In-Reply-To: References: Subject: [PATCH 17/23] proc: Teach /proc//fd to use file_hotplug_lock Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: From: Eric W. Biederman I have taken the opportunity to modify proc_fd_info to have a single exit point. Signed-off-by: Eric W. Biederman --- fs/proc/base.c | 29 ++++++++++++++++------------- 1 files changed, 16 insertions(+), 13 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index fb45615..ee4cdc2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1626,6 +1626,7 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info) struct files_struct *files = NULL; struct file *file; int fd = proc_fd(inode); + int retval = -ENOENT; if (task) { files = get_files_struct(task); @@ -1639,24 +1640,26 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info) spin_lock(&files->file_lock); file = fcheck_files(files, fd); if (file) { - if (path) { - *path = file->f_path; - path_get(&file->f_path); + retval = -EIO; + if (file_hotplug_read_trylock(file)) { + retval = 0; + if (path) { + *path = file->f_path; + path_get(&file->f_path); + } + if (info) + snprintf(info, PROC_FDINFO_MAX, + "pos:\t%lli\n" + "flags:\t0%o\n", + (long long) file->f_pos, + file->f_flags); + file_hotplug_read_unlock(file); } - if (info) - snprintf(info, PROC_FDINFO_MAX, - "pos:\t%lli\n" - "flags:\t0%o\n", - (long long) file->f_pos, - file->f_flags); - spin_unlock(&files->file_lock); - put_files_struct(files); - return 0; } spin_unlock(&files->file_lock); put_files_struct(files); } - return -ENOENT; + return retval; } static int proc_fd_link(struct inode *inode, struct path *path) -- 1.6.3.1.54.g99dd.dirty -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id D59346B004D for ; Wed, 3 Jun 2009 16:53:58 -0400 (EDT) Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock References: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: Wed, 03 Jun 2009 13:53:48 -0700 In-Reply-To: (Davide Libenzi's message of "Wed\, 3 Jun 2009 07\:57\:40 -0700 \(PDT\)") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Davide Libenzi Cc: Al Viro , Linux Kernel Mailing List , linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Davide Libenzi writes: > On Tue, 2 Jun 2009, Eric W. Biederman wrote: > >> I am not clear what problem you have. >> >> Is it the sprinkling the code that takes and removes the lock? Just >> the VFS needs to be involved with that. It is a slightly larger >> surface area than doing the work inside the file operations as we >> sometimes call the same method from 3-4 different places but it is >> definitely a bounded problem. >> >> Is it putting in the handful lines per subsystem to actually use this >> functionality? At that level something generic that is maintained >> outside of the subsystem is better than the mess we have with 4-5 >> different implementations in the subsystems that need it, each having >> a different assortment of bugs. > > Come on, only in the open fast path, there are at least two spin > lock/unlock and two atomic ops. Without even starting to count all the > extra branches and software added. > Is this stuff *really* needed, or we can faitly happily live w/out? ???? What code are you talking about? To the open path a few memory writes and a smp_wmb. No atomics and no spin lock/unlocks. Are you complaining because I retain the file_list? Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id D3A366B004D for ; Wed, 3 Jun 2009 19:25:34 -0400 (EDT) Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e34.co.us.ibm.com (8.13.1/8.13.1) with ESMTP id n53NMbdo021897 for ; Wed, 3 Jun 2009 17:22:37 -0600 Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v9.2) with ESMTP id n53NPWf6262150 for ; Wed, 3 Jun 2009 17:25:32 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id n53NPUSV022095 for ; Wed, 3 Jun 2009 17:25:32 -0600 Subject: Re: [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock From: Badari Pulavarty In-Reply-To: <1243893048-17031-23-git-send-email-ebiederm@xmission.com> References: <1243893048-17031-23-git-send-email-ebiederm@xmission.com> Content-Type: text/plain Date: Wed, 03 Jun 2009 16:25:29 -0700 Message-Id: <1244071529.6383.11.camel@badari-desktop> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote: > From: Eric W. Biederman > > Signed-off-by: Eric W. Biederman > --- > mm/filemap.c | 25 ++++++++++++++++--------- > 1 files changed, 16 insertions(+), 9 deletions(-) > > diff --git a/mm/filemap.c b/mm/filemap.c > index 379ff0b..5016aa5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -1402,16 +1402,23 @@ SYSCALL_DEFINE(readahead)(int fd, loff_t offset, size_t count) > > ret = -EBADF; > file = fget(fd); > - if (file) { > - if (file->f_mode & FMODE_READ) { > - struct address_space *mapping = file->f_mapping; > - pgoff_t start = offset >> PAGE_CACHE_SHIFT; > - pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT; > - unsigned long len = end - start + 1; > - ret = do_readahead(mapping, file, start, len); > - } > - fput(file); > + if (!file) > + goto out; > + > + if (!(file->f_mode & FMODE_READ)) > + goto out_fput; > + To be consistent with others, don't you want to do ret = -EIO; here ? > + if (file_hotplug_read_trylock(file)) { > + struct address_space *mapping = file->f_mapping; > + pgoff_t start = offset >> PAGE_CACHE_SHIFT; > + pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT; > + unsigned long len = end - start + 1; > + ret = do_readahead(mapping, file, start, len); > + file_hotplug_read_unlock(file); > } > +out_fput: > + fput(file); > +out: > return ret; > } > #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS Thanks, Badari -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id B04EB6B004D for ; Wed, 3 Jun 2009 19:39:28 -0400 (EDT) Received: from d01relay04.pok.ibm.com (d01relay04.pok.ibm.com [9.56.227.236]) by e7.ny.us.ibm.com (8.13.1/8.13.1) with ESMTP id n53NRWDP018594 for ; Wed, 3 Jun 2009 19:27:32 -0400 Received: from d01av01.pok.ibm.com (d01av01.pok.ibm.com [9.56.224.215]) by d01relay04.pok.ibm.com (8.13.8/8.13.8/NCO v9.2) with ESMTP id n53NdMVD243238 for ; Wed, 3 Jun 2009 19:39:22 -0400 Received: from d01av01.pok.ibm.com (loopback [127.0.0.1]) by d01av01.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id n53NdLB6007710 for ; Wed, 3 Jun 2009 19:39:22 -0400 Subject: Re: [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock From: Badari Pulavarty In-Reply-To: <1243893048-17031-7-git-send-email-ebiederm@xmission.com> References: <1243893048-17031-7-git-send-email-ebiederm@xmission.com> Content-Type: text/plain Date: Wed, 03 Jun 2009 16:39:23 -0700 Message-Id: <1244072363.6383.15.camel@badari-desktop> Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote: > From: Eric W. Biederman > > Signed-off-by: Eric W. Biederman > --- > fs/read_write.c | 28 +++++++++---- > fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++----------------- > 2 files changed, 95 insertions(+), 44 deletions(-) > > diff --git a/fs/read_write.c b/fs/read_write.c > index 718baea..c473d74 100644 > --- a/fs/read_write.c > +++ b/fs/read_write.c > @@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, > goto out; > if (!(in_file->f_mode & FMODE_READ)) > goto fput_in; > + retval = -EIO; > + if (!file_hotplug_read_trylock(in_file)) > + goto fput_in; > retval = -EINVAL; > in_inode = in_file->f_path.dentry->d_inode; > if (!in_inode) > - goto fput_in; > + goto unlock_in; > if (!in_file->f_op || !in_file->f_op->splice_read) > - goto fput_in; > + goto unlock_in; > retval = -ESPIPE; > if (!ppos) > ppos = &in_file->f_pos; > else > if (!(in_file->f_mode & FMODE_PREAD)) > - goto fput_in; > + goto unlock_in; > retval = rw_verify_area(READ, in_file, ppos, count); > if (retval < 0) > - goto fput_in; > + goto unlock_in; > count = retval; > > /* > @@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, > retval = -EBADF; > out_file = fget_light(out_fd, &fput_needed_out); > if (!out_file) > - goto fput_in; > + goto unlock_in; > if (!(out_file->f_mode & FMODE_WRITE)) > goto fput_out; > + retval = -EIO; > + if (!file_hotplug_read_trylock(out_file)) > + goto fput_out; > retval = -EINVAL; > if (!out_file->f_op || !out_file->f_op->sendpage) > - goto fput_out; > + goto unlock_out; > out_inode = out_file->f_path.dentry->d_inode; > retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count); > if (retval < 0) > - goto fput_out; > + goto unlock_out; > count = retval; > > if (!max) > @@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, > pos = *ppos; > retval = -EINVAL; > if (unlikely(pos < 0)) > - goto fput_out; > + goto unlock_out; > if (unlikely(pos + count > max)) { > retval = -EOVERFLOW; > if (pos >= max) > - goto fput_out; > + goto unlock_out; > count = max - pos; > } > > @@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, > if (*ppos > max) > retval = -EOVERFLOW; > > +unlock_out: > + file_hotplug_read_unlock(out_file); > fput_out: > fput_light(out_file, fput_needed_out); > +unlock_in: > + file_hotplug_read_unlock(in_file); > fput_in: > fput_light(in_file, fput_needed_in); > out: > diff --git a/fs/splice.c b/fs/splice.c > index 666953d..fc6b3a5 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov, > > error = -EBADF; > file = fget_light(fd, &fput); > - if (file) { > - if (file->f_mode & FMODE_WRITE) > - error = vmsplice_to_pipe(file, iov, nr_segs, flags); > - else if (file->f_mode & FMODE_READ) > - error = vmsplice_to_user(file, iov, nr_segs, flags); > + if (!file) > + goto out; > > - fput_light(file, fput); > - } > + if (!file_hotplug_read_trylock(file)) > + goto fput_file; > > + if (file->f_mode & FMODE_WRITE) > + error = vmsplice_to_pipe(file, iov, nr_segs, flags); > + else if (file->f_mode & FMODE_READ) > + error = vmsplice_to_user(file, iov, nr_segs, flags); > + > + file_hotplug_read_unlock(file); > +fput_file: > + fput_light(file, fput); > +out: > return error; > } > > @@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in, > > error = -EBADF; > in = fget_light(fd_in, &fput_in); > - if (in) { > - if (in->f_mode & FMODE_READ) { > - out = fget_light(fd_out, &fput_out); > - if (out) { > - if (out->f_mode & FMODE_WRITE) > - error = do_splice(in, off_in, > - out, off_out, > - len, flags); > - fput_light(out, fput_out); > - } > - } > + if (!in) > + goto out; > > - fput_light(in, fput_in); > - } > + if (!(in->f_mode & FMODE_READ)) > + goto fput_in; > + > + error = -EIO; > + if (!file_hotplug_read_trylock(in)) > + goto fput_in; > + > + error = -EBADF; > + out = fget_light(fd_out, &fput_out); > + if (!out) > + goto unlock_in; > + > + if (!(out->f_mode & FMODE_WRITE)) > + goto fput_out; > + > + error = -EIO; > + if (!file_hotplug_read_trylock(out)) > + goto fput_out; > + > + error = do_splice(in, off_in, out, off_out, len, flags); > > + file_hotplug_read_unlock(out); > +fput_out: > + fput_light(out, fput_out); > +unlock_in: > + file_hotplug_read_unlock(in); > +fput_in: > + fput_light(in, fput_in); > + > +out: > return error; > } > > @@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len, > > SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags) > { > - struct file *in; > - int error, fput_in; > + struct file *in, *out; > + int error, fput_in, fput_out; > > if (unlikely(!len)) > return 0; > > error = -EBADF; > in = fget_light(fdin, &fput_in); > - if (in) { > - if (in->f_mode & FMODE_READ) { > - int fput_out; > - struct file *out = fget_light(fdout, &fput_out); > - > - if (out) { > - if (out->f_mode & FMODE_WRITE) > - error = do_tee(in, out, len, flags); > - fput_light(out, fput_out); > - } > - } > - fput_light(in, fput_in); > - } > + if (!in) > + goto out; > + > + if (!(in->f_mode & FMODE_READ)) > + goto unlock_in; <<<<<<< Shouldn't this be goto fput_in; ? btw, its confusing to have labels and variables with same name: fput_in and fput_out. You may want to rename labels ? > > + error = -EIO; > + if (!file_hotplug_read_trylock(in)) > + goto fput_in; > + > + error = -EBADF; > + out = fget_light(fdout, &fput_out); > + if (!out) > + goto unlock_in; > + > + if (!(out->f_mode & FMODE_WRITE)) > + goto fput_out; > + > + if (!file_hotplug_read_trylock(out)) > + goto fput_out; > + > + error = do_tee(in, out, len, flags); > + > + file_hotplug_read_unlock(out); > +fput_out: > + fput_light(out, fput_out); > +unlock_in: > + file_hotplug_read_unlock(in); > +fput_in: > + fput_light(in, fput_in); > +out: > return error; > } Thanks, Badari -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id CF46E6B0055 for ; Wed, 3 Jun 2009 20:56:10 -0400 (EDT) Received: from makko.or.mcafeemobile.com by x35.xmailserver.org with [XMail 1.26 ESMTP Server] id for from ; Wed, 3 Jun 2009 20:55:48 -0400 Date: Wed, 3 Jun 2009 17:50:01 -0700 (PDT) From: Davide Libenzi Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock In-Reply-To: Message-ID: References: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , Linux Kernel Mailing List , linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: On Wed, 3 Jun 2009, Eric W. Biederman wrote: > What code are you talking about? > > To the open path a few memory writes and a smp_wmb. No atomics and no > spin lock/unlocks. > > Are you complaining because I retain the file_list? Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin lock/unlock couple present in __dentry_open() (same sort of the release path)? And that's only like 5% of the code touched by the new special handling of the file operations structure (basically, every f_op access ends up being wrapped by two atomic ops and other extra code). The question, that I'd like to reiterate is, is this stuff really needed? Anyway, my complaint ends here and I'll let others evaluate if merging this patchset is worth the cost. - Davide -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 306C26B004D for ; Wed, 3 Jun 2009 21:42:17 -0400 (EDT) Subject: Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock References: <1243893048-17031-18-git-send-email-ebiederm@xmission.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: Wed, 03 Jun 2009 18:42:07 -0700 In-Reply-To: (Davide Libenzi's message of "Wed\, 3 Jun 2009 17\:50\:01 -0700 \(PDT\)") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Davide Libenzi Cc: Al Viro , Linux Kernel Mailing List , linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Davide Libenzi writes: > On Wed, 3 Jun 2009, Eric W. Biederman wrote: > >> What code are you talking about? >> >> To the open path a few memory writes and a smp_wmb. No atomics and no >> spin lock/unlocks. >> >> Are you complaining because I retain the file_list? > > Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin > lock/unlock couple present in __dentry_open() (same sort of the release > path)? You might be remembering v1. In v2 I have operations like file_hotplug_read_trylock that implement a lock but use an rcu like algorithm. So there are no atomic operations involved with their associated pipeline stalls. Over my previous version this made a reasonable performance benefit. > And that's only like 5% of the code touched by the new special handling of > the file operations structure (basically, every f_op access ends up being > wrapped by two atomic ops and other extra code). Yes there is a single extra wrapping of every file in the syscall path. So we know that someone is using it. > The question, that I'd like to reiterate is, is this stuff really needed? > Anyway, my complaint ends here and I'll let others evaluate if merging > this patchset is worth the cost. Sure. My apologies for not answering that question earlier. My perspective is that every subsystem that winds up supporting hotplug hardware winds up rolling it's own version of something like this, and they each have a different set of bugs. So one generic version is definitely worth implementing. Similarly there is a case for a generic revoke facility in the kernel. Alan at least has made the case that there are certain security problems that can not be solved in userspace without revoke. >>From an implementation point of view doing the generic implementation at the vfs level has significant benefits. The extra locking appears reasonable from a code maintenance and comprehensibility point of view. A real pain to find all of the entry points into the vfs, and get other code to use the right vfs helpers they should always have been using but I am volunteering to do that work. The practical question I see is are the performance overheads of my primitives low enough that I do not cause performance regressions on anyone's fast path. As far as I have been able to measure is that the performance overhead is low enough, because I have been able to avoid the use of atomics and have been able to use fairly small code with predictable branches. Which is why I pressed you to be certain I understood where you are coming from. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 040046B0055 for ; Fri, 5 Jun 2009 05:03:42 -0400 (EDT) In-reply-to: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> Message-Id: From: Miklos Szeredi Date: Fri, 05 Jun 2009 11:03:29 +0200 Sender: owner-linux-mm@kvack.org To: ebiederm@xmission.com Cc: viro@ZenIV.linux.org.uk, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org, ebiederm@aristanetworks.com List-ID: Hi Eric, Very interesting work. On Mon, 1 Jun 2009, Eric W. Biederman wrote: > The file_hotplug_lock has a very unique implementation necessitated by > the need to have no performance impact on existing code. Classic locking > primitives and reference counting cause pipeline stalls, except for rcu > which provides no ability to preventing reading a data structure while > it is being updated. Well, the simple solution to that is to add another level of indirection: old: fdtable -> file new: fdtable -> persistent_file -> file Then it is possible to replace persistent_file->file with a revoked one under RCU. This has the added advantage that it supports arbitrary file replacements, not just ones which return EIO. Another advantage is that dereferencing can normally be done "under the hood" in fget()/fget_light(). Only code which wants to permanently store a file pointer (like the SCM_RIGHTS thing) would need to be aware of the extra complexity. Would that work, do you think? Thanks, Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with SMTP id 3571F6B004D for ; Fri, 5 Jun 2009 15:06:21 -0400 (EDT) Subject: Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file References: <1243893048-17031-4-git-send-email-ebiederm@xmission.com> From: ebiederm@xmission.com (Eric W. Biederman) Date: Fri, 05 Jun 2009 12:06:07 -0700 In-Reply-To: (Miklos Szeredi's message of "Fri\, 05 Jun 2009 11\:03\:29 +0200") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Miklos Szeredi Cc: viro@ZenIV.linux.org.uk, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org, ebiederm@aristanetworks.com List-ID: Miklos Szeredi writes: > Hi Eric, > > Very interesting work. > > On Mon, 1 Jun 2009, Eric W. Biederman wrote: >> The file_hotplug_lock has a very unique implementation necessitated by >> the need to have no performance impact on existing code. Classic locking >> primitives and reference counting cause pipeline stalls, except for rcu >> which provides no ability to preventing reading a data structure while >> it is being updated. > > Well, the simple solution to that is to add another level of indirection: > > old: > > fdtable -> file > > new: > > fdtable -> persistent_file -> file > > Then it is possible to replace persistent_file->file with a revoked > one under RCU. This has the added advantage that it supports > arbitrary file replacements, not just ones which return EIO. > > Another advantage is that dereferencing can normally be done "under > the hood" in fget()/fget_light(). Only code which wants to > permanently store a file pointer (like the SCM_RIGHTS thing) would > need to be aware of the extra complexity. > > Would that work, do you think? Well I went down this path for a little while, and it has some good points. Unfortunately it appears to be more costly. fget() and friends are semantically very different my file_hotplug_read_trylock and unlock. In fact there is very little overlap. Which means that transparent to the vfs users doesn't actually work. We actually have more and less predictable places where we store files. If there was actually a compelling case for being more general I would certainly agree that splitting the file structure in two would be a good deal. As it is that level of flexibility seems to be overkill. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 197A36B004D for ; Fri, 5 Jun 2009 15:34:05 -0400 (EDT) Subject: Re: [PATCH 03/23] vfs: Generalize the file_list References: <1243893048-17031-3-git-send-email-ebiederm@xmission.com> <20090602070642.GD31556@wotan.suse.de> From: ebiederm@xmission.com (Eric W. Biederman) Date: Fri, 05 Jun 2009 12:33:59 -0700 In-Reply-To: <20090602070642.GD31556@wotan.suse.de> (Nick Piggin's message of "Tue\, 2 Jun 2009 09\:06\:42 +0200") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Nick Piggin writes: >> fs_may_remount_ro and mark_files_ro have been modified to walk the >> inode list to find all of the inodes and then to walk the file list >> on those inodes. It can be a slightly longer walk as we frequently >> cache inodes that we do not have open but the overall complexity >> should be about the same, > > Well not really. I have a couple of orders of magnitude more cached > inodes than open files here. Good point. >> --- a/include/linux/fs.h >> +++ b/include/linux/fs.h >> @@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping) >> return mapping->i_mmap_writable != 0; >> } >> >> +struct file_list { >> + spinlock_t lock; >> + struct list_head list; >> +}; >> + >> /* >> * Use sequence counter to get consistent i_size on 32-bit processors. >> */ >> @@ -764,6 +769,7 @@ struct inode { >> struct list_head inotify_watches; /* watches on this inode */ >> struct mutex inotify_mutex; /* protects the watches list */ >> #endif >> + struct file_list i_files; >> >> unsigned long i_state; >> unsigned long dirtied_when; /* jiffies of first dirtying */ >> @@ -934,9 +940,15 @@ struct file { >> unsigned long f_mnt_write_state; >> #endif >> }; >> -extern spinlock_t files_lock; >> -#define file_list_lock() spin_lock(&files_lock); >> -#define file_list_unlock() spin_unlock(&files_lock); >> + >> +static inline void file_list_lock(struct file_list *files) >> +{ >> + spin_lock(&files->lock); >> +} >> +static inline void file_list_unlock(struct file_list *files) >> +{ >> + spin_unlock(&files->lock); >> +} > > I don't really like this. It's just a list head. Get rid of > all these wrappers and crap I'd say. In fact, starting with my > patch to unexport files_lock and remove these wrappers would > be reasonable, wouldn't it? I don't really mind killing the wrappers. I do mind your patch because it makes the list going through the tty's something very different. In my view of the world that is the only use case is what I'm working to move up more into the vfs layer. So orphaning it seems wrong. > Increasing the size of the struct inode by 24 bytes hurts. > Even when you decrapify it and can reuse i_lock or something, > then it is still 16 bytes on 64-bit. We can get it even smaller if we make it an hlist. A hlist_head is only a single pointer. This size growth appears to be one of the biggest weakness of the code. > I haven't looked through all the patches... but this is to > speed up a slowpath operation, isn't it? Or does revoke > need to be especially performant? This was more about simplicity rather than performance. The performance gain is using a per inode lock instead of a global lock. Which keeps cache lines from bouncing. > So this patch is purely a perofrmance improvement? Then I think > it needs to be justified with numbers and the downsides (bloating > struct inode in particulra) to be changelogged. Certainly the cost. One of the things I have discovered since I wrote this patch is the i_devices list. Which means we don't necessarily need to have heads in places other than struct inode. A character device driver (aka the tty code) can walk it's inode list and from each inode walk the file list. I need to check the locking on that one. If that simplification works we can move all maintenance of the file list into the vfs and not need a separate file list concept. I will take a look. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with SMTP id D3D356B004D for ; Fri, 5 Jun 2009 15:37:30 -0400 (EDT) Subject: Re: [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock References: <1243893048-17031-7-git-send-email-ebiederm@xmission.com> <1244072363.6383.15.camel@badari-desktop> From: ebiederm@xmission.com (Eric W. Biederman) Date: Fri, 05 Jun 2009 12:37:25 -0700 In-Reply-To: <1244072363.6383.15.camel@badari-desktop> (Badari Pulavarty's message of "Wed\, 03 Jun 2009 16\:39\:23 -0700") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: owner-linux-mm@kvack.org To: Badari Pulavarty Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: Badari Pulavarty writes: > On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote: >> From: Eric W. Biederman >> >> Signed-off-by: Eric W. Biederman >> --- >> fs/read_write.c | 28 +++++++++---- >> fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++----------------- >> 2 files changed, 95 insertions(+), 44 deletions(-) >> >> diff --git a/fs/read_write.c b/fs/read_write.c >> index 718baea..c473d74 100644 >> --- a/fs/read_write.c >> +++ b/fs/read_write.c >> @@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, >> goto out; >> if (!(in_file->f_mode & FMODE_READ)) >> goto fput_in; >> + retval = -EIO; >> + if (!file_hotplug_read_trylock(in_file)) >> + goto fput_in; >> retval = -EINVAL; >> in_inode = in_file->f_path.dentry->d_inode; >> if (!in_inode) >> - goto fput_in; >> + goto unlock_in; >> if (!in_file->f_op || !in_file->f_op->splice_read) >> - goto fput_in; >> + goto unlock_in; >> retval = -ESPIPE; >> if (!ppos) >> ppos = &in_file->f_pos; >> else >> if (!(in_file->f_mode & FMODE_PREAD)) >> - goto fput_in; >> + goto unlock_in; >> retval = rw_verify_area(READ, in_file, ppos, count); >> if (retval < 0) >> - goto fput_in; >> + goto unlock_in; >> count = retval; >> >> /* >> @@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, >> retval = -EBADF; >> out_file = fget_light(out_fd, &fput_needed_out); >> if (!out_file) >> - goto fput_in; >> + goto unlock_in; >> if (!(out_file->f_mode & FMODE_WRITE)) >> goto fput_out; >> + retval = -EIO; >> + if (!file_hotplug_read_trylock(out_file)) >> + goto fput_out; >> retval = -EINVAL; >> if (!out_file->f_op || !out_file->f_op->sendpage) >> - goto fput_out; >> + goto unlock_out; >> out_inode = out_file->f_path.dentry->d_inode; >> retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count); >> if (retval < 0) >> - goto fput_out; >> + goto unlock_out; >> count = retval; >> >> if (!max) >> @@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, >> pos = *ppos; >> retval = -EINVAL; >> if (unlikely(pos < 0)) >> - goto fput_out; >> + goto unlock_out; >> if (unlikely(pos + count > max)) { >> retval = -EOVERFLOW; >> if (pos >= max) >> - goto fput_out; >> + goto unlock_out; >> count = max - pos; >> } >> >> @@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos, >> if (*ppos > max) >> retval = -EOVERFLOW; >> >> +unlock_out: >> + file_hotplug_read_unlock(out_file); >> fput_out: >> fput_light(out_file, fput_needed_out); >> +unlock_in: >> + file_hotplug_read_unlock(in_file); >> fput_in: >> fput_light(in_file, fput_needed_in); >> out: >> diff --git a/fs/splice.c b/fs/splice.c >> index 666953d..fc6b3a5 100644 >> --- a/fs/splice.c >> +++ b/fs/splice.c >> @@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov, >> >> error = -EBADF; >> file = fget_light(fd, &fput); >> - if (file) { >> - if (file->f_mode & FMODE_WRITE) >> - error = vmsplice_to_pipe(file, iov, nr_segs, flags); >> - else if (file->f_mode & FMODE_READ) >> - error = vmsplice_to_user(file, iov, nr_segs, flags); >> + if (!file) >> + goto out; >> >> - fput_light(file, fput); >> - } >> + if (!file_hotplug_read_trylock(file)) >> + goto fput_file; >> >> + if (file->f_mode & FMODE_WRITE) >> + error = vmsplice_to_pipe(file, iov, nr_segs, flags); >> + else if (file->f_mode & FMODE_READ) >> + error = vmsplice_to_user(file, iov, nr_segs, flags); >> + >> + file_hotplug_read_unlock(file); >> +fput_file: >> + fput_light(file, fput); >> +out: >> return error; >> } >> >> @@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in, >> >> error = -EBADF; >> in = fget_light(fd_in, &fput_in); >> - if (in) { >> - if (in->f_mode & FMODE_READ) { >> - out = fget_light(fd_out, &fput_out); >> - if (out) { >> - if (out->f_mode & FMODE_WRITE) >> - error = do_splice(in, off_in, >> - out, off_out, >> - len, flags); >> - fput_light(out, fput_out); >> - } >> - } >> + if (!in) >> + goto out; >> >> - fput_light(in, fput_in); >> - } >> + if (!(in->f_mode & FMODE_READ)) >> + goto fput_in; >> + >> + error = -EIO; >> + if (!file_hotplug_read_trylock(in)) >> + goto fput_in; >> + >> + error = -EBADF; >> + out = fget_light(fd_out, &fput_out); >> + if (!out) >> + goto unlock_in; >> + >> + if (!(out->f_mode & FMODE_WRITE)) >> + goto fput_out; >> + >> + error = -EIO; >> + if (!file_hotplug_read_trylock(out)) >> + goto fput_out; >> + >> + error = do_splice(in, off_in, out, off_out, len, flags); >> >> + file_hotplug_read_unlock(out); >> +fput_out: >> + fput_light(out, fput_out); >> +unlock_in: >> + file_hotplug_read_unlock(in); >> +fput_in: >> + fput_light(in, fput_in); >> + >> +out: >> return error; >> } >> >> @@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len, >> >> SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags) >> { >> - struct file *in; >> - int error, fput_in; >> + struct file *in, *out; >> + int error, fput_in, fput_out; >> >> if (unlikely(!len)) >> return 0; >> >> error = -EBADF; >> in = fget_light(fdin, &fput_in); >> - if (in) { >> - if (in->f_mode & FMODE_READ) { >> - int fput_out; >> - struct file *out = fget_light(fdout, &fput_out); >> - >> - if (out) { >> - if (out->f_mode & FMODE_WRITE) >> - error = do_tee(in, out, len, flags); >> - fput_light(out, fput_out); >> - } >> - } >> - fput_light(in, fput_in); >> - } >> + if (!in) >> + goto out; >> + >> + if (!(in->f_mode & FMODE_READ)) >> + goto unlock_in; <<<<<<< > > Shouldn't this be > goto fput_in; Good point. That is a bug. > ? btw, its confusing to have labels and variables with same name: > fput_in and fput_out. You may want to rename labels ? Mayhap. I didn't start that one, although I am clearly spreading it around here. Do you have a better naming suggestion? >> + error = -EIO; >> + if (!file_hotplug_read_trylock(in)) >> + goto fput_in; >> + >> + error = -EBADF; >> + out = fget_light(fdout, &fput_out); >> + if (!out) >> + goto unlock_in; >> + >> + if (!(out->f_mode & FMODE_WRITE)) >> + goto fput_out; >> + >> + if (!file_hotplug_read_trylock(out)) >> + goto fput_out; >> + >> + error = do_tee(in, out, len, flags); >> + >> + file_hotplug_read_unlock(out); >> +fput_out: >> + fput_light(out, fput_out); >> +unlock_in: >> + file_hotplug_read_unlock(in); >> +fput_in: >> + fput_light(in, fput_in); >> +out: >> return error; >> } Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 709476B004D for ; Sat, 6 Jun 2009 04:03:42 -0400 (EDT) Date: Sat, 6 Jun 2009 09:03:34 +0100 From: Al Viro Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Message-ID: <20090606080334.GA15204@ZenIV.linux.org.uk> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig List-ID: On Mon, Jun 01, 2009 at 02:45:17PM -0700, Eric W. Biederman wrote: > > I found myself looking at the uio, seeing that it does not support pci > hot-unplug, and thinking "Great yet another implementation of > hotunplug logic that needs to be added". > > I decided to see what it would take to add a generic implementation of > the code we have for supporting hot unplugging devices in sysfs, proc, > sysctl, tty_io, and now almost in the tun driver. > > Not long after I touched the tun driver and made it safe to delete the > network device while still holding it's file descriptor open I someone > else touch the code adding a different feature and my careful work > went up in flames. Which brought home another point at the best of it > this is ultimately complex tricky code that subsystems should not need > to worry about. > > What makes this even more interesting is that in the presence of pci > hot-unplug it looks like most subsystems and most devices will have to > deal with the issue one way or another. > > This infrastructure could also be used to implement both force > unmounts and sys_revoke. When I could not think of a better name for > I have drawn on that and used revoke. To be honest, the longer I'm looking at it, the less I like the approach... It really looks as if we'd be much better off with functionality sitting in a set of library helpers to be used by instances that need this stuff. Do we really want it for generic case? Note that "we might someday implement real force-umount" doesn't count; the same kind of arguments had been given nine years ago in case of AIO ("oh, sure, we'll eventually cover foo_get_block() too - it will all be a state machine, fully asynchronous; whaddya mean 'it's not feasible'?"). Of course, it was _not_ feasible and had never been implemented. Frankly, I very much suspect that force-umount is another case like that; we'll need a *lot* of interesting cooperation from fs for that to work and to be useful. I'd be delighted to be proven incorrect on that one, so if you have anything serious in that direction, please share the details. As for the patchset in the current form... Could you explain what's to prevent POSIX locks and dnotify entries from outliving a struct file you'd revoked, seeing that filp_close() will skip killing them in that case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with ESMTP id 5B43F6B004D for ; Sat, 6 Jun 2009 04:08:25 -0400 (EDT) Date: Sat, 6 Jun 2009 09:08:20 +0100 From: Al Viro Subject: Re: [PATCH 02/23] vfs: Implement unpoll_file. Message-ID: <20090606080820.GA16867@ZenIV.linux.org.uk> References: <1243893048-17031-2-git-send-email-ebiederm@xmission.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1243893048-17031-2-git-send-email-ebiederm@xmission.com> Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" , "Eric W. Biederman" List-ID: On Mon, Jun 01, 2009 at 02:50:27PM -0700, Eric W. Biederman wrote: > From: Eric W. Biederman > > During a revoke operation it is necessary to stop using all state that is managed > by the underlying file operations implementation. The poll wait queue is one part > of that state. Erm... Seeing that drivers and filesystems tend to have fsckloads of other state of their own, why do we treat that separately? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id E666C6B004D for ; Mon, 8 Jun 2009 04:32:22 -0400 (EDT) In-reply-to: <20090606080334.GA15204@ZenIV.linux.org.uk> (message from Al Viro on Sat, 6 Jun 2009 09:03:34 +0100) Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 References: <20090606080334.GA15204@ZenIV.linux.org.uk> Message-Id: From: Miklos Szeredi Date: Mon, 08 Jun 2009 11:41:19 +0200 Sender: owner-linux-mm@kvack.org To: viro@ZenIV.linux.org.uk Cc: ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Sat, 6 Jun 2009, Al Viro wrote: > Frankly, I very much suspect that force-umount is another case like that; > we'll need a *lot* of interesting cooperation from fs for that to work and > to be useful. I'd be delighted to be proven incorrect on that one, so > if you have anything serious in that direction, please share the details. Umm, not sure why we'd need cooperation from the fs. Simply wait for the operation to exit the filesystem or driver. If it's a blocking operation, send a signal to interrupt it. Sure, filesystems and drivers have lots of state, but we don't need to care about that, just like we don't need to care about it for remounting read-only. Thanks, Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail190.messagelabs.com (mail190.messagelabs.com [216.82.249.51]) by kanga.kvack.org (Postfix) with ESMTP id 1E0CE6B004F for ; Mon, 8 Jun 2009 05:14:24 -0400 (EDT) Date: Mon, 8 Jun 2009 11:24:49 +0100 From: Jamie Lokier Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Message-ID: <20090608102449.GB25684@shareable.org> References: <20090606080334.GA15204@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Miklos Szeredi Cc: viro@ZenIV.linux.org.uk, ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: Miklos Szeredi wrote: > On Sat, 6 Jun 2009, Al Viro wrote: > > Frankly, I very much suspect that force-umount is another case like that; > > we'll need a *lot* of interesting cooperation from fs for that to work and > > to be useful. I'd be delighted to be proven incorrect on that one, so > > if you have anything serious in that direction, please share the details. > > Umm, not sure why we'd need cooperation from the fs. Simply wait for > the operation to exit the filesystem or driver. If it's a blocking > operation, send a signal to interrupt it. We could even include the internal signal in TASK_KILLABLE, so it interrupts otherwise uninterruptible operations. -- Jamie -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 4EB256B004D for ; Mon, 8 Jun 2009 11:07:12 -0400 (EDT) Date: Mon, 8 Jun 2009 17:29:13 +0100 From: Al Viro Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Message-ID: <20090608162913.GL8633@ZenIV.linux.org.uk> References: <20090606080334.GA15204@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Miklos Szeredi Cc: ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Mon, Jun 08, 2009 at 11:41:19AM +0200, Miklos Szeredi wrote: > On Sat, 6 Jun 2009, Al Viro wrote: > > Frankly, I very much suspect that force-umount is another case like that; > > we'll need a *lot* of interesting cooperation from fs for that to work and > > to be useful. I'd be delighted to be proven incorrect on that one, so > > if you have anything serious in that direction, please share the details. > > Umm, not sure why we'd need cooperation from the fs. Simply wait for > the operation to exit the filesystem or driver. If it's a blocking > operation, send a signal to interrupt it. And making sure that operations *are* interruptible (and that we can cope with $BIGNUM new failure exits correctly) does not qualify as cooperation? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail203.messagelabs.com (mail203.messagelabs.com [216.82.254.243]) by kanga.kvack.org (Postfix) with ESMTP id 670256B004D for ; Mon, 8 Jun 2009 11:22:21 -0400 (EDT) In-reply-to: <20090608162913.GL8633@ZenIV.linux.org.uk> (message from Al Viro on Mon, 8 Jun 2009 17:29:13 +0100) Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> Message-Id: From: Miklos Szeredi Date: Mon, 08 Jun 2009 18:44:41 +0200 Sender: owner-linux-mm@kvack.org To: viro@ZenIV.linux.org.uk Cc: miklos@szeredi.hu, ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Mon, 8 Jun 2009, Al Viro wrote: > On Mon, Jun 08, 2009 at 11:41:19AM +0200, Miklos Szeredi wrote: > > On Sat, 6 Jun 2009, Al Viro wrote: > > > Frankly, I very much suspect that force-umount is another case like that; > > > we'll need a *lot* of interesting cooperation from fs for that to work and > > > to be useful. I'd be delighted to be proven incorrect on that one, so > > > if you have anything serious in that direction, please share the details. > > > > Umm, not sure why we'd need cooperation from the fs. Simply wait for > > the operation to exit the filesystem or driver. If it's a blocking > > operation, send a signal to interrupt it. > > And making sure that operations *are* interruptible (and that we can cope > with $BIGNUM new failure exits correctly) does not qualify as cooperation? I'm still not getting what the problem is. AFAICS file operations are either a) non-interruptible but finish within a short time or b) may block indefinitely but are interruptible (or at least killable). Anything else is already problematic, resulting in processes "stuck in D state". Can you give a more concrete example about your worries? Thanks, Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with ESMTP id 2E1E26B0062 for ; Mon, 8 Jun 2009 13:49:26 -0400 (EDT) Date: Mon, 8 Jun 2009 18:50:18 +0100 From: Al Viro Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Message-ID: <20090608175018.GM8633@ZenIV.linux.org.uk> References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Miklos Szeredi Cc: ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote: > I'm still not getting what the problem is. AFAICS file operations are > either > > a) non-interruptible but finish within a short time or > b) may block indefinitely but are interruptible (or at least killable). > > Anything else is already problematic, resulting in processes "stuck in > D state". Welcome to reality... * bread() is non-interruptible * so's copy_from_user()/copy_to_user() * IO we are stuck upon _might_ be interruptible, but by sending a signal to some other process ... just for starters. If you sign up for auditing the tree to eliminate "something's stuck in D state", you are welcome to it. Mind you, you'll have to audit filesystems for "doesn't check if metadata IO has failed" first, but that _really_ needs to be done anyway. On the ongoing basis. Drivers, of course, are even more interesting - looking through foo_ioctl() instances is a wonderful way to lower pH in stomach, but that's on the "we want revoke()" side of it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 997AB6B005A for ; Mon, 8 Jun 2009 14:02:21 -0400 (EDT) Date: Mon, 8 Jun 2009 11:01:51 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 In-Reply-To: <20090608175018.GM8633@ZenIV.linux.org.uk> Message-ID: References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> <20090608175018.GM8633@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Al Viro Cc: Miklos Szeredi , ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Mon, 8 Jun 2009, Al Viro wrote: > > Welcome to reality... > > * bread() is non-interruptible > * so's copy_from_user()/copy_to_user() > * IO we are stuck upon _might_ be interruptible, but by sending a signal > to some other process We can probably improve on these, though. Like the copy_to/from_user thing. We might well be able to do that whole "if it's a fatal signal, return early" thing. So in the _general_ case - no, we probably can't fix things. But we could likely at least improve in some common cases if we cared. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id C5F2C6B004D for ; Mon, 8 Jun 2009 14:48:23 -0400 (EDT) Date: Mon, 8 Jun 2009 19:50:41 +0100 From: Al Viro Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Message-ID: <20090608185041.GN8633@ZenIV.linux.org.uk> References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> <20090608175018.GM8633@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Miklos Szeredi , ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Mon, Jun 08, 2009 at 11:01:51AM -0700, Linus Torvalds wrote: > > > On Mon, 8 Jun 2009, Al Viro wrote: > > > > Welcome to reality... > > > > * bread() is non-interruptible > > * so's copy_from_user()/copy_to_user() > > * IO we are stuck upon _might_ be interruptible, but by sending a signal > > to some other process > > We can probably improve on these, though. > > Like the copy_to/from_user thing. We might well be able to do that whole > "if it's a fatal signal, return early" thing. > > So in the _general_ case - no, we probably can't fix things. But we could > likely at least improve in some common cases if we cared. Sure, even though I'm not at all certain that copy_from_user() is that easy. We can make locking current->mm in there interruptible, all right, but that's only a part of the answer - even aside of the allocations, we'd need vma ->fault() interruptible as well, which leads to interruptible instances of ->readpage(), with all the fun _that_ would be. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail191.messagelabs.com (mail191.messagelabs.com [216.82.242.19]) by kanga.kvack.org (Postfix) with ESMTP id 67F146B004D for ; Mon, 8 Jun 2009 15:16:19 -0400 (EDT) Date: Mon, 8 Jun 2009 12:18:41 -0700 (PDT) From: Linus Torvalds Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 In-Reply-To: <20090608185041.GN8633@ZenIV.linux.org.uk> Message-ID: References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> <20090608175018.GM8633@ZenIV.linux.org.uk> <20090608185041.GN8633@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: owner-linux-mm@kvack.org To: Al Viro Cc: Miklos Szeredi , ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Mon, 8 Jun 2009, Al Viro wrote: > > Sure, even though I'm not at all certain that copy_from_user() is that easy. > We can make locking current->mm in there interruptible, all right, but that's > only a part of the answer - even aside of the allocations, we'd need vma > ->fault() interruptible as well, which leads to interruptible instances of > ->readpage(), with all the fun _that_ would be. We already have all that - the NFS people wanted it. More importantly, you don't actually need to interrupt readpage itself - you just need to stop _waiting_ on it. So in your fault handler, just stop waiting, and instead just return FAULT_RETRY or whatever. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id A03B86B004D for ; Tue, 9 Jun 2009 01:27:19 -0400 (EDT) In-reply-to: <20090608175018.GM8633@ZenIV.linux.org.uk> (message from Al Viro on Mon, 8 Jun 2009 18:50:18 +0100) Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> <20090608175018.GM8633@ZenIV.linux.org.uk> Message-Id: From: Miklos Szeredi Date: Tue, 09 Jun 2009 07:50:38 +0200 Sender: owner-linux-mm@kvack.org To: viro@ZenIV.linux.org.uk Cc: miklos@szeredi.hu, ebiederm@xmission.com, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Mon, 8 Jun 2009, Al Viro wrote: > On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote: > > > I'm still not getting what the problem is. AFAICS file operations are > > either > > > > a) non-interruptible but finish within a short time or > > b) may block indefinitely but are interruptible (or at least killable). > > > > Anything else is already problematic, resulting in processes "stuck in > > D state". > > Welcome to reality... > > * bread() is non-interruptible > * so's copy_from_user()/copy_to_user() And why should revoke(2) care? Just wait for the damn thing to finish. Why exactly do these need to be interruptible? Okay, if we want revoke or umount -f to be instantaneous then all that needs to be taken care of. But does it *need* to be? My idea of revoke is something like below: - make sure no new operations are started on the file - check state of tasks for ongoing operations, if interruptible send signal - wait for all pending operations to finish - kill file Thanks, Miklos -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 43D0A6B004D for ; Tue, 9 Jun 2009 01:58:24 -0400 (EDT) References: <20090606080334.GA15204@ZenIV.linux.org.uk> From: ebiederm@xmission.com (Eric W. Biederman) Date: Mon, 08 Jun 2009 23:22:50 -0700 In-Reply-To: <20090606080334.GA15204@ZenIV.linux.org.uk> (Al Viro's message of "Sat\, 6 Jun 2009 09\:03\:34 +0100") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Sender: owner-linux-mm@kvack.org To: Al Viro Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Nick Piggin , Andrew Morton , Christoph Hellwig List-ID: Al Viro writes: > On Mon, Jun 01, 2009 at 02:45:17PM -0700, Eric W. Biederman wrote: >> >> I found myself looking at the uio, seeing that it does not support pci >> hot-unplug, and thinking "Great yet another implementation of >> hotunplug logic that needs to be added". >> >> I decided to see what it would take to add a generic implementation of >> the code we have for supporting hot unplugging devices in sysfs, proc, >> sysctl, tty_io, and now almost in the tun driver. >> >> Not long after I touched the tun driver and made it safe to delete the >> network device while still holding it's file descriptor open I someone >> else touch the code adding a different feature and my careful work >> went up in flames. Which brought home another point at the best of it >> this is ultimately complex tricky code that subsystems should not need >> to worry about. >> >> What makes this even more interesting is that in the presence of pci >> hot-unplug it looks like most subsystems and most devices will have to >> deal with the issue one way or another. >> >> This infrastructure could also be used to implement both force >> unmounts and sys_revoke. When I could not think of a better name for >> I have drawn on that and used revoke. > > To be honest, the longer I'm looking at it, the less I like the approach... > It really looks as if we'd be much better off with functionality sitting > in a set of library helpers to be used by instances that need this stuff. > Do we really want it for generic case? I think so. I do know I have seen enough weird cases actually being used and not being done correctly we want a clean pattern for handling the general case that works and is complete. The problem seems to break up into several pieces. - unmap support. - Getting a list of the files that are open for an inode. - Waking up interruptible sleepers. - A test to see if we are executing any of the functions in the file_operations structure. (needed before we can free state) - Calling frelease and generally releasing of the state held by the file. It might be possible to solve the entire problem outside of the vfs > Note that "we might someday implement real force-umount" doesn't count; > the same kind of arguments had been given nine years ago in case of AIO > ("oh, sure, we'll eventually cover foo_get_block() too - it will all be > a state machine, fully asynchronous; whaddya mean 'it's not feasible'?"). > Of course, it was _not_ feasible and had never been implemented. > Frankly, I very much suspect that force-umount is another case like that; > we'll need a *lot* of interesting cooperation from fs for that to work and > to be useful. I'd be delighted to be proven incorrect on that one, so > if you have anything serious in that direction, please share the details. So far nothing but thought experiments, but you have a good point at least a proof of concept should be done of the various pieces. To flush out some niggling little detail that messes up the design. So I hereby sign up for writing a sys_revoke patch, a forced umount patch and a writing a patch to ext2 to support it. Supporting proc and sysfs while easy is not really the common case of an nfs exportable block filesystem so it is not complete. > As for the patchset in the current form... Could you explain what's to prevent > POSIX locks and dnotify entries from outliving a struct file you'd revoked, > seeing that filp_close() will skip killing them in that case. Good catch that looks like a big fat bug to me. It seems I overlooked the fact that we actually free things in filp_close. Given that posix_remove_file calls vfs_lock_file which calls file->f_op->lock it looks like something really needs to be done here. dnotify_flush doesn't look to hard to spin a special case for revoke. I am going to have to spend I while longer studying the rest of the code in filp_close. I hope I don't need to figure out the various fl_owner_t values to safely revoke a file, but it looks like I might. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 12EAF6B004D for ; Tue, 9 Jun 2009 02:06:24 -0400 (EDT) References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> <20090608175018.GM8633@ZenIV.linux.org.uk> From: ebiederm@xmission.com (Eric W. Biederman) Date: Mon, 08 Jun 2009 23:31:16 -0700 In-Reply-To: (Miklos Szeredi's message of "Tue\, 09 Jun 2009 07\:50\:38 +0200") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Sender: owner-linux-mm@kvack.org To: Miklos Szeredi Cc: viro@ZenIV.linux.org.uk, linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, torvalds@linux-foundation.org, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: Miklos Szeredi writes: > On Mon, 8 Jun 2009, Al Viro wrote: >> On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote: >> >> > I'm still not getting what the problem is. AFAICS file operations are >> > either >> > >> > a) non-interruptible but finish within a short time or >> > b) may block indefinitely but are interruptible (or at least killable). >> > >> > Anything else is already problematic, resulting in processes "stuck in >> > D state". >> >> Welcome to reality... >> >> * bread() is non-interruptible >> * so's copy_from_user()/copy_to_user() > > And why should revoke(2) care? Just wait for the damn thing to > finish. Why exactly do these need to be interruptible? Agreed. I expect the data size is going to be a page or less. Which is at most 64K on some weird architectures. I think that counts as a short time waiting for disk I/O. Baring thrashing. > Okay, if we want revoke or umount -f to be instantaneous then all that > needs to be taken care of. But does it *need* to be? Good question. I wonder what umount -f needs when we yank out a usb drive. > My idea of revoke is something like below: > > - make sure no new operations are started on the file > - check state of tasks for ongoing operations, if interruptible send signal Figuring out who to send a signal to is tricky. Still it should be doable in the common case. > - wait for all pending operations to finish > - kill file Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail202.messagelabs.com (mail202.messagelabs.com [216.82.254.227]) by kanga.kvack.org (Postfix) with SMTP id 582D96B004D for ; Tue, 9 Jun 2009 02:17:40 -0400 (EDT) References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> <20090608175018.GM8633@ZenIV.linux.org.uk> <20090608185041.GN8633@ZenIV.linux.org.uk> From: ebiederm@xmission.com (Eric W. Biederman) Date: Mon, 08 Jun 2009 23:42:53 -0700 In-Reply-To: (Linus Torvalds's message of "Mon\, 8 Jun 2009 12\:18\:41 -0700 \(PDT\)") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Sender: owner-linux-mm@kvack.org To: Linus Torvalds Cc: Al Viro , Miklos Szeredi , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, npiggin@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: Linus Torvalds writes: > On Mon, 8 Jun 2009, Al Viro wrote: >> >> Sure, even though I'm not at all certain that copy_from_user() is that easy. >> We can make locking current->mm in there interruptible, all right, but that's >> only a part of the answer - even aside of the allocations, we'd need vma >> ->fault() interruptible as well, which leads to interruptible instances of >> ->readpage(), with all the fun _that_ would be. > > We already have all that - the NFS people wanted it. > > More importantly, you don't actually need to interrupt readpage itself - > you just need to stop _waiting_ on it. So in your fault handler, just stop > waiting, and instead just return FAULT_RETRY or whatever. That sounds doable. Has that code been merged yet? I took a quick look and it didn't see anyone breaking out of page fault with a signal or code to really handle that. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail172.messagelabs.com (mail172.messagelabs.com [216.82.254.3]) by kanga.kvack.org (Postfix) with ESMTP id 45AFE6B0055 for ; Tue, 9 Jun 2009 06:05:52 -0400 (EDT) Date: Tue, 9 Jun 2009 12:38:32 +0200 From: Nick Piggin Subject: Re: [PATCH 03/23] vfs: Generalize the file_list Message-ID: <20090609103832.GI14820@wotan.suse.de> References: <1243893048-17031-3-git-send-email-ebiederm@xmission.com> <20090602070642.GD31556@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote: > Nick Piggin writes: > > >> +static inline void file_list_unlock(struct file_list *files) > >> +{ > >> + spin_unlock(&files->lock); > >> +} > > > > I don't really like this. It's just a list head. Get rid of > > all these wrappers and crap I'd say. In fact, starting with my > > patch to unexport files_lock and remove these wrappers would > > be reasonable, wouldn't it? > > I don't really mind killing the wrappers. > > I do mind your patch because it makes the list going through > the tty's something very different. In my view of the world > that is the only use case is what I'm working to move up more > into the vfs layer. So orphaning it seems wrong. My patch doesn't orphan it, it just makes the locking more explicit and that's all so it should be easier to work with. I just mean start with my patch and you could change things as needed. > > Increasing the size of the struct inode by 24 bytes hurts. > > Even when you decrapify it and can reuse i_lock or something, > > then it is still 16 bytes on 64-bit. > > We can get it even smaller if we make it an hlist. A hlist_head is > only a single pointer. This size growth appears to be one of the > biggest weakness of the code. 8 bytes would be a lot better than 24. > > I haven't looked through all the patches... but this is to > > speed up a slowpath operation, isn't it? Or does revoke > > need to be especially performant? > > This was more about simplicity rather than performance. The > performance gain is using a per inode lock instead of a global lock. > Which keeps cache lines from bouncing. Yes but we already have such a global lock which has been OK until now. Granted that some users are running into these locks, but fine graining them can be considered independently I think. So using per-sb lists of files and not bloating struct inode any more could be a less controversial step for you. > > So this patch is purely a perofrmance improvement? Then I think > > it needs to be justified with numbers and the downsides (bloating > > struct inode in particulra) to be changelogged. > > Certainly the cost. > > One of the things I have discovered since I wrote this patch is the > i_devices list. Which means we don't necessarily need to have heads > in places other than struct inode. A character device driver (aka the > tty code) can walk it's inode list and from each inode walk the file > list. I need to check the locking on that one. > > If that simplification works we can move all maintenance of the file > list into the vfs and not need a separate file list concept. I will > take a look. Thanks, Nick -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail138.messagelabs.com (mail138.messagelabs.com [216.82.249.35]) by kanga.kvack.org (Postfix) with ESMTP id 191F06B004D for ; Tue, 9 Jun 2009 06:19:45 -0400 (EDT) Date: Tue, 9 Jun 2009 12:52:51 +0200 From: Nick Piggin Subject: Re: [PATCH 0/23] File descriptor hot-unplug support v2 Message-ID: <20090609105251.GK14820@wotan.suse.de> References: <20090606080334.GA15204@ZenIV.linux.org.uk> <20090608162913.GL8633@ZenIV.linux.org.uk> <20090608175018.GM8633@ZenIV.linux.org.uk> <20090608185041.GN8633@ZenIV.linux.org.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Linus Torvalds , Al Viro , Miklos Szeredi , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, hugh@veritas.com, tj@kernel.org, adobriyan@gmail.com, alan@lxorguk.ukuu.org.uk, gregkh@suse.de, akpm@linux-foundation.org, hch@infradead.org List-ID: On Mon, Jun 08, 2009 at 11:42:53PM -0700, Eric W. Biederman wrote: > Linus Torvalds writes: > > > On Mon, 8 Jun 2009, Al Viro wrote: > >> > >> Sure, even though I'm not at all certain that copy_from_user() is that easy. > >> We can make locking current->mm in there interruptible, all right, but that's > >> only a part of the answer - even aside of the allocations, we'd need vma > >> ->fault() interruptible as well, which leads to interruptible instances of > >> ->readpage(), with all the fun _that_ would be. > > > > We already have all that - the NFS people wanted it. > > > > More importantly, you don't actually need to interrupt readpage itself - > > you just need to stop _waiting_ on it. So in your fault handler, just stop > > waiting, and instead just return FAULT_RETRY or whatever. > > That sounds doable. Has that code been merged yet? > > I took a quick look and it didn't see anyone breaking out of page fault with a > signal or code to really handle that. The problem is get_user_pages I think. Now that we have a good number of fault flags, we can pass down whether the caller is able to be interrupted or not. Ben H had some interest in doing this, but I don't know how far he got with it. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail137.messagelabs.com (mail137.messagelabs.com [216.82.249.19]) by kanga.kvack.org (Postfix) with SMTP id 7E0866B004D for ; Tue, 9 Jun 2009 13:51:21 -0400 (EDT) References: <1243893048-17031-3-git-send-email-ebiederm@xmission.com> <20090602070642.GD31556@wotan.suse.de> <20090609103832.GI14820@wotan.suse.de> From: ebiederm@xmission.com (Eric W. Biederman) Date: Tue, 09 Jun 2009 11:38:59 -0700 In-Reply-To: <20090609103832.GI14820@wotan.suse.de> (Nick Piggin's message of "Tue\, 9 Jun 2009 12\:38\:32 +0200") Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Subject: Re: [PATCH 03/23] vfs: Generalize the file_list Sender: owner-linux-mm@kvack.org To: Nick Piggin Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: Nick Piggin writes: > On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote: >> Nick Piggin writes: >> >> >> +static inline void file_list_unlock(struct file_list *files) >> >> +{ >> >> + spin_unlock(&files->lock); >> >> +} >> > >> > I don't really like this. It's just a list head. Get rid of >> > all these wrappers and crap I'd say. In fact, starting with my >> > patch to unexport files_lock and remove these wrappers would >> > be reasonable, wouldn't it? >> >> I don't really mind killing the wrappers. >> >> I do mind your patch because it makes the list going through >> the tty's something very different. In my view of the world >> that is the only use case is what I'm working to move up more >> into the vfs layer. So orphaning it seems wrong. > > My patch doesn't orphan it, it just makes the locking more > explicit and that's all so it should be easier to work with. > I just mean start with my patch and you could change things > as needed. As I recall you weren't using the files_lock for the tty layer. I seem to recall you were still walking through the same list head on struct file. Regardless it sure felt like pushing the tty usage out into some weird special case. My goal is to make it reasonable for more character drivers to use the list so it isn't an especially comfortable starting place for me. >> > Increasing the size of the struct inode by 24 bytes hurts. >> > Even when you decrapify it and can reuse i_lock or something, >> > then it is still 16 bytes on 64-bit. >> >> We can get it even smaller if we make it an hlist. A hlist_head is >> only a single pointer. This size growth appears to be one of the >> biggest weakness of the code. > > 8 bytes would be a lot better than 24. Definitely. >> > I haven't looked through all the patches... but this is to >> > speed up a slowpath operation, isn't it? Or does revoke >> > need to be especially performant? >> >> This was more about simplicity rather than performance. The >> performance gain is using a per inode lock instead of a global lock. >> Which keeps cache lines from bouncing. > > Yes but we already have such a global lock which has been > OK until now. Granted that some users are running into these > locks, but fine graining them can be considered independently > I think. So using per-sb lists of files and not bloating > struct inode any more could be a less controversial step > for you. I will take a look. Certainly doing the work in a couple of patches seems reasonable. If I can move all of the list maintenance out of the tty layer. That looks to be the ideal case. Eric -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail143.messagelabs.com (mail143.messagelabs.com [216.82.254.35]) by kanga.kvack.org (Postfix) with ESMTP id 562816B0085 for ; Wed, 10 Jun 2009 02:05:07 -0400 (EDT) Date: Wed, 10 Jun 2009 08:05:11 +0200 From: Nick Piggin Subject: Re: [PATCH 03/23] vfs: Generalize the file_list Message-ID: <20090610060511.GA31155@wotan.suse.de> References: <1243893048-17031-3-git-send-email-ebiederm@xmission.com> <20090602070642.GD31556@wotan.suse.de> <20090609103832.GI14820@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org To: "Eric W. Biederman" Cc: Al Viro , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Hugh Dickins , Tejun Heo , Alexey Dobriyan , Linus Torvalds , Alan Cox , Greg Kroah-Hartman , Andrew Morton , Christoph Hellwig , "Eric W. Biederman" List-ID: On Tue, Jun 09, 2009 at 11:38:59AM -0700, Eric W. Biederman wrote: > Nick Piggin writes: > > > On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote: > >> Nick Piggin writes: > >> > >> >> +static inline void file_list_unlock(struct file_list *files) > >> >> +{ > >> >> + spin_unlock(&files->lock); > >> >> +} > >> > > >> > I don't really like this. It's just a list head. Get rid of > >> > all these wrappers and crap I'd say. In fact, starting with my > >> > patch to unexport files_lock and remove these wrappers would > >> > be reasonable, wouldn't it? > >> > >> I don't really mind killing the wrappers. > >> > >> I do mind your patch because it makes the list going through > >> the tty's something very different. In my view of the world > >> that is the only use case is what I'm working to move up more > >> into the vfs layer. So orphaning it seems wrong. > > > > My patch doesn't orphan it, it just makes the locking more > > explicit and that's all so it should be easier to work with. > > I just mean start with my patch and you could change things > > as needed. > > As I recall you weren't using the files_lock for the tty layer. I > seem to recall you were still walking through the same list head on > struct file. > > Regardless it sure felt like pushing the tty usage out into > some weird special case. My goal is to make it reasonable for > more character drivers to use the list so it isn't an especially > comfortable starting place for me. I don't see the problem. It made files_lock for filesystems and uses another lock for tty. Tty is a special case (or different case) compared with filesystem, and how did it make it unreasonable for character drivers to use the list? Mandating the locking and list to be in the inode for everyone is just bloating things up. > >> > Increasing the size of the struct inode by 24 bytes hurts. > >> > Even when you decrapify it and can reuse i_lock or something, > >> > then it is still 16 bytes on 64-bit. > >> > >> We can get it even smaller if we make it an hlist. A hlist_head is > >> only a single pointer. This size growth appears to be one of the > >> biggest weakness of the code. > > > > 8 bytes would be a lot better than 24. > > Definitely. > > >> > I haven't looked through all the patches... but this is to > >> > speed up a slowpath operation, isn't it? Or does revoke > >> > need to be especially performant? > >> > >> This was more about simplicity rather than performance. The > >> performance gain is using a per inode lock instead of a global lock. > >> Which keeps cache lines from bouncing. > > > > Yes but we already have such a global lock which has been > > OK until now. Granted that some users are running into these > > locks, but fine graining them can be considered independently > > I think. So using per-sb lists of files and not bloating > > struct inode any more could be a less controversial step > > for you. > > I will take a look. Certainly doing the work in a couple > of patches seems reasonable. If I can move all of the list > maintenance out of the tty layer. That looks to be the ideal > case. I will wait to see. It will be nice if you have any obvious standalone fixes or improvements then to post them first or in front of your patchset: I'd like to make some progress here too to help my locking patchset. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org