* [PATCH 01/23] mm: Introduce revoke_file_mappings.
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 22:25 ` Andrew Morton
2009-06-01 21:50 ` [PATCH 02/23] vfs: Implement unpoll_file Eric W. Biederman
` (22 subsequent siblings)
23 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@xmission.com>
When the backing store of a file becomes inaccessible we need a function
to remove that file from the page tables and arrange for page faults
to receive SIGBUS until the file is unmapped.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
include/linux/mm.h | 2 +
mm/memory.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+), 0 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bff1f0d..5d7480d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -808,6 +808,8 @@ static inline void unmap_shared_mapping_range(struct address_space *mapping,
extern int vmtruncate(struct inode * inode, loff_t offset);
extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
+extern void revoke_file_mappings(struct file *file);
+
#ifdef CONFIG_MMU
extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, int write_access);
diff --git a/mm/memory.c b/mm/memory.c
index 4126dd1..5cbee3b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -55,6 +55,7 @@
#include <linux/kallsyms.h>
#include <linux/swapops.h>
#include <linux/elf.h>
+#include <linux/file.h>
#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -2358,6 +2359,103 @@ void unmap_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL(unmap_mapping_range);
+static int revoked_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ return VM_FAULT_SIGBUS;
+}
+
+static struct vm_operations_struct revoked_vm_ops = {
+ .fault = revoked_fault,
+};
+
+static void revoke_vma(struct vm_area_struct *vma)
+{
+ struct file *file = vma->vm_file;
+ struct address_space *mapping = file->f_mapping;
+ unsigned long start_addr, end_addr, size;
+ struct mm_struct *mm;
+
+ start_addr = vma->vm_start;
+ end_addr = vma->vm_end;
+
+ /* Switch out the locks so I can maninuplate this under the mm sem.
+ * Needed so I can call vm_ops->close.
+ */
+ mm = vma->vm_mm;
+ atomic_inc(&mm->mm_users);
+ spin_unlock(&mapping->i_mmap_lock);
+
+ /* Block page faults and other code modifying the mm. */
+ down_write(&mm->mmap_sem);
+
+ /* Lookup a vma for my file address */
+ vma = find_vma(mm, start_addr);
+ if (vma->vm_file != file)
+ goto out;
+
+ start_addr = vma->vm_start;
+ end_addr = vma->vm_end;
+ size = end_addr - start_addr;
+
+ /* Unlock the pages */
+ if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
+ mm->locked_vm -= vma_pages(vma);
+ vma->vm_flags &= ~VM_LOCKED;
+ }
+
+ /* Unmap the vma */
+ zap_page_range(vma, start_addr, size, NULL);
+
+ /* Unlink the vma from the file */
+ unlink_file_vma(vma);
+
+ /* Close the vma */
+ if (vma->vm_ops && vma->vm_ops->close)
+ vma->vm_ops->close(vma);
+ fput(vma->vm_file);
+ vma->vm_file = NULL;
+ if (vma->vm_flags & VM_EXECUTABLE)
+ removed_exe_file_vma(vma->vm_mm);
+
+ /* Repurpose the vma */
+ vma->vm_private_data = NULL;
+ vma->vm_ops = &revoked_vm_ops;
+ vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
+out:
+ up_write(&mm->mmap_sem);
+ spin_lock(&mapping->i_mmap_lock);
+}
+
+void revoke_file_mappings(struct file *file)
+{
+ /* After a file has been marked dead update the vmas */
+ struct address_space *mapping = file->f_mapping;
+ struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+
+ spin_lock(&mapping->i_mmap_lock);
+
+restart_tree:
+ vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, 0, ULONG_MAX) {
+ /* Skip quickly over vmas that do not need to be touched */
+ if (vma->vm_file != file)
+ continue;
+ revoke_vma(vma);
+ goto restart_tree;
+ }
+
+restart_list:
+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
+ /* Skip quickly over vmas that do not need to be touched */
+ if (vma->vm_file != file)
+ continue;
+ revoke_vma(vma);
+ goto restart_list;
+ }
+
+ spin_unlock(&mapping->i_mmap_lock);
+}
+
/**
* vmtruncate - unmap mappings "freed" by truncate() syscall
* @inode: inode of the file used
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH 01/23] mm: Introduce revoke_file_mappings.
2009-06-01 21:50 ` [PATCH 01/23] mm: Introduce revoke_file_mappings Eric W. Biederman
@ 2009-06-01 22:25 ` Andrew Morton
2009-06-02 0:12 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Andrew Morton @ 2009-06-01 22:25 UTC (permalink / raw)
To: Eric W. Biederman
Cc: viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel, hugh, tj,
adobriyan, torvalds, alan, gregkh, npiggin, hch, ebiederm,
ebiederm
On Mon, 1 Jun 2009 14:50:26 -0700
"Eric W. Biederman" <ebiederm@xmission.com> wrote:
> +static void revoke_vma(struct vm_area_struct *vma)
This looks odd.
> +{
> + struct file *file = vma->vm_file;
> + struct address_space *mapping = file->f_mapping;
> + unsigned long start_addr, end_addr, size;
> + struct mm_struct *mm;
> +
> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;
We take a copy of start_addr/end_addr (and this end_addr value is never used)
> + /* Switch out the locks so I can maninuplate this under the mm sem.
> + * Needed so I can call vm_ops->close.
> + */
> + mm = vma->vm_mm;
> + atomic_inc(&mm->mm_users);
> + spin_unlock(&mapping->i_mmap_lock);
> +
> + /* Block page faults and other code modifying the mm. */
> + down_write(&mm->mmap_sem);
> +
> + /* Lookup a vma for my file address */
> + vma = find_vma(mm, start_addr);
Then we look up a vma. Is there reason to believe that this will
differ from the incoming arg which we just overwrote? Maybe the code
is attempting to handle racing concurrent mmap/munmap activity? If so,
what are the implications of this?
I _think_ that what the function is attempting to do is "unmap the vma
which covers the address at vma->start_addr". If so, why not just pass
it that virtual address?
Anyway, it's all a bit obscure and I do think that the semantics and
behaviour should be carefully explained in a comment, no?
> + if (vma->vm_file != file)
> + goto out;
This strengthens the theory that some sort of race-management is
happening here.
> + start_addr = vma->vm_start;
> + end_addr = vma->vm_end;
> + size = end_addr - start_addr;
> +
> + /* Unlock the pages */
> + if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
> + mm->locked_vm -= vma_pages(vma);
> + vma->vm_flags &= ~VM_LOCKED;
> + }
> +
> + /* Unmap the vma */
> + zap_page_range(vma, start_addr, size, NULL);
> +
> + /* Unlink the vma from the file */
> + unlink_file_vma(vma);
> +
> + /* Close the vma */
> + if (vma->vm_ops && vma->vm_ops->close)
> + vma->vm_ops->close(vma);
> + fput(vma->vm_file);
> + vma->vm_file = NULL;
> + if (vma->vm_flags & VM_EXECUTABLE)
> + removed_exe_file_vma(vma->vm_mm);
> +
> + /* Repurpose the vma */
> + vma->vm_private_data = NULL;
> + vma->vm_ops = &revoked_vm_ops;
> + vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
> +out:
> + up_write(&mm->mmap_sem);
> + spin_lock(&mapping->i_mmap_lock);
> +}
Also, I'm not a bit fan of the practice of overwriting the value of a
formal argument, especially in a function which is this large and
complex. It makes the code harder to follow, because the one variable
holds two conceptually different things within the span of the same
function. And it adds risk that someone will will later access a field
of *vma and it will be the wrong vma. Worse, the bug is only exposed
under exeedingly rare conditions.
So.. Use a new local, please.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 01/23] mm: Introduce revoke_file_mappings.
2009-06-01 22:25 ` Andrew Morton
@ 2009-06-02 0:12 ` Eric W. Biederman
0 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-02 0:12 UTC (permalink / raw)
To: Andrew Morton
Cc: viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel, tj,
hugh.dickins, adobriyan, torvalds, alan, gregkh, npiggin, hch,
ebiederm
Andrew Morton <akpm@linux-foundation.org> writes:
> On Mon, 1 Jun 2009 14:50:26 -0700
> "Eric W. Biederman" <ebiederm@xmission.com> wrote:
>
>> +static void revoke_vma(struct vm_area_struct *vma)
>
> This looks odd.
>
>> +{
>> + struct file *file = vma->vm_file;
>> + struct address_space *mapping = file->f_mapping;
>> + unsigned long start_addr, end_addr, size;
>> + struct mm_struct *mm;
>> +
>> + start_addr = vma->vm_start;
>> + end_addr = vma->vm_end;
>
> We take a copy of start_addr/end_addr (and this end_addr value is never used)
A foolish consistency.
>> + /* Switch out the locks so I can maninuplate this under the mm sem.
>> + * Needed so I can call vm_ops->close.
>> + */
>> + mm = vma->vm_mm;
>> + atomic_inc(&mm->mm_users);
>> + spin_unlock(&mapping->i_mmap_lock);
>> +
>> + /* Block page faults and other code modifying the mm. */
>> + down_write(&mm->mmap_sem);
>> +
>> + /* Lookup a vma for my file address */
>> + vma = find_vma(mm, start_addr);
>
> Then we look up a vma. Is there reason to believe that this will
> differ from the incoming arg which we just overwrote? Maybe the code
> is attempting to handle racing concurrent mmap/munmap activity? If so,
> what are the implications of this?
Yes it is. The file based index is only safe while we hold the i_mmap_lock.
The manipulation that needs to happen under the mmap_sem.
So I drop all of the locks and restart. And use the time honored
kernel practice of relooking up the thing I am going to manipulate.
As long as it is for the same file I don't care.
> I _think_ that what the function is attempting to do is "unmap the vma
> which covers the address at vma->start_addr". If so, why not just pass
> it that virtual address?
Actually it is unmapping a vma for the file I am revoking. I hand it
one and then it does an address space jig.
> Anyway, it's all a bit obscure and I do think that the semantics and
> behaviour should be carefully explained in a comment, no?
>
>> + if (vma->vm_file != file)
>> + goto out;
>
> This strengthens the theory that some sort of race-management is
> happening here.
Totally. I dropped all of my locks so I am having to restart in
a different locking context.
>> + start_addr = vma->vm_start;
>> + end_addr = vma->vm_end;
>> + size = end_addr - start_addr;
>> +
>> + /* Unlock the pages */
>> + if (mm->locked_vm && (vma->vm_flags & VM_LOCKED)) {
>> + mm->locked_vm -= vma_pages(vma);
>> + vma->vm_flags &= ~VM_LOCKED;
>> + }
>> +
>> + /* Unmap the vma */
>> + zap_page_range(vma, start_addr, size, NULL);
>> +
>> + /* Unlink the vma from the file */
>> + unlink_file_vma(vma);
>> +
>> + /* Close the vma */
>> + if (vma->vm_ops && vma->vm_ops->close)
>> + vma->vm_ops->close(vma);
>> + fput(vma->vm_file);
>> + vma->vm_file = NULL;
>> + if (vma->vm_flags & VM_EXECUTABLE)
>> + removed_exe_file_vma(vma->vm_mm);
>> +
>> + /* Repurpose the vma */
>> + vma->vm_private_data = NULL;
>> + vma->vm_ops = &revoked_vm_ops;
>> + vma->vm_flags &= ~(VM_NONLINEAR | VM_CAN_NONLINEAR);
>> +out:
>> + up_write(&mm->mmap_sem);
>> + spin_lock(&mapping->i_mmap_lock);
>> +}
>
> Also, I'm not a bit fan of the practice of overwriting the value of a
> formal argument, especially in a function which is this large and
> complex. It makes the code harder to follow, because the one variable
> holds two conceptually different things within the span of the same
> function. And it adds risk that someone will will later access a field
> of *vma and it will be the wrong vma. Worse, the bug is only exposed
> under exeedingly rare conditions.
>
> So.. Use a new local, please.
We can never legitimately have more than one vma manipulated in this function.
As for the rest. I guess I just assumed that the reader of the code would
have a basic understanding of the locking rules for those data structures.
Certainly the worst thing I suffer from is being close to the code,
and not realizing which pieces are not obvious to a naive observer.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH 02/23] vfs: Implement unpoll_file.
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
2009-06-01 21:50 ` [PATCH 01/23] mm: Introduce revoke_file_mappings Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-06 8:08 ` Al Viro
2009-06-01 21:50 ` [PATCH 03/23] vfs: Generalize the file_list Eric W. Biederman
` (21 subsequent siblings)
23 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
During a revoke operation it is necessary to stop using all state that is managed
by the underlying file operations implementation. The poll wait queue is one part
of that state.
unpoll_file achieves that by walking through a specified waitqueue. Finding
any entries that were added by select or poll of that file descriptor and
awakening them. If action was taken unpoll sleeps and repeats until
the waitqueue has no entries for the spcified file.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/select.c | 31 +++++++++++++++++++++++++++++++
include/linux/poll.h | 2 ++
2 files changed, 33 insertions(+), 0 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..bd30fe8 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -941,3 +941,34 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
return ret;
}
#endif /* HAVE_SET_RESTORE_SIGMASK */
+
+#ifdef CONFIG_FILE_HOTPLUG
+static int unpoll_file_once(wait_queue_head_t *q, struct file *file)
+{
+ unsigned long flags;
+ wait_queue_t *curr, *next;
+ int found = 0;
+
+ spin_lock_irqsave(&q->lock, flags);
+ list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
+ struct poll_table_entry *entry;
+ if (curr->func != pollwake)
+ continue;
+ entry = container_of(curr, struct poll_table_entry, wait);
+ if (entry->filp != file)
+ continue;
+ curr->func(curr, TASK_NORMAL, 0, NULL);
+ found = 1;
+ }
+ spin_unlock_irqrestore(&q->lock, flags);
+
+ return found;
+}
+
+void unpoll_file(wait_queue_head_t *q, struct file *file)
+{
+ while (unpoll_file_once(q, file))
+ schedule_timeout_uninterruptible(1);
+}
+EXPORT_SYMBOL(unpoll_file);
+#endif
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 8c24ef8..d388620 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -131,6 +131,8 @@ extern int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
extern int poll_select_set_timeout(struct timespec *to, long sec, long nsec);
+extern void unpoll_file(wait_queue_head_t *q, struct file *file);
+
#endif /* KERNEL */
#endif /* _LINUX_POLL_H */
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH 02/23] vfs: Implement unpoll_file.
2009-06-01 21:50 ` [PATCH 02/23] vfs: Implement unpoll_file Eric W. Biederman
@ 2009-06-06 8:08 ` Al Viro
0 siblings, 0 replies; 99+ messages in thread
From: Al Viro @ 2009-06-06 8:08 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
On Mon, Jun 01, 2009 at 02:50:27PM -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
>
> During a revoke operation it is necessary to stop using all state that is managed
> by the underlying file operations implementation. The poll wait queue is one part
> of that state.
Erm... Seeing that drivers and filesystems tend to have fsckloads of
other state of their own, why do we treat that separately?
^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH 03/23] vfs: Generalize the file_list
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
2009-06-01 21:50 ` [PATCH 01/23] mm: Introduce revoke_file_mappings Eric W. Biederman
2009-06-01 21:50 ` [PATCH 02/23] vfs: Implement unpoll_file Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-02 7:06 ` Nick Piggin
2009-06-01 21:50 ` [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Eric W. Biederman
` (20 subsequent siblings)
23 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@xmission.com>
In the implementation of revoke it is desirable to find all of the
files we want to operation on. Currently tty's and mark_files_ro
use the file_list for this, and this patch generalizes the file
list so it can be used more efficiently.
This patch starts by introducing struct file_list making the file
list a first class object. file_list_lock and file_list_unlock
are modified to take this object, making it clear which file_list
we intended to lock.
file_move is transformed into file_list_add taking a file_list and not
allowing the movement of one file to another. __dentry_open
is modified to support this by only adding normal files in open,
special files have always been ignored when walking the file_list.
__dentry_open skipping special files allows __ptmx_open and __tty_open
to safely call file_add as they are adding the file to the file_list
for the first time.
file_kill has been renamed file_list_del to make it clear what it is
doing and to keep from confusing it with a more revoke like operation.
put_filp has been modified to not take file_list_del as we are never
on a file_list when put_filp is called.
fs_may_remount_ro and mark_files_ro have been modified to walk the
inode list to find all of the inodes and then to walk the file list
on those inodes. It can be a slightly longer walk as we frequently
cache inodes that we do not have open but the overall complexity
should be about the same, these are slow path functions, and it
gives us much greater flexibility overall.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
drivers/char/pty.c | 2 +-
drivers/char/tty_io.c | 22 ++++-----
fs/file_table.c | 117 +++++++++++++++++++++++++---------------------
fs/inode.c | 1 +
fs/open.c | 6 ++-
fs/select.c | 2 -
fs/super.c | 1 -
include/linux/fs.h | 24 +++++++--
include/linux/tty.h | 2 +-
security/selinux/hooks.c | 8 ++--
10 files changed, 102 insertions(+), 83 deletions(-)
diff --git a/drivers/char/pty.c b/drivers/char/pty.c
index 31038a0..585f700 100644
--- a/drivers/char/pty.c
+++ b/drivers/char/pty.c
@@ -662,7 +662,7 @@ static int __ptmx_open(struct inode *inode, struct file *filp)
set_bit(TTY_PTY_LOCK, &tty->flags); /* LOCK THE SLAVE */
filp->private_data = tty;
- file_move(filp, &tty->tty_files);
+ file_list_add(filp, &tty->tty_files);
retval = devpts_pty_new(inode, tty->link);
if (retval)
diff --git a/drivers/char/tty_io.c b/drivers/char/tty_io.c
index 66b99a2..b5c0ca1 100644
--- a/drivers/char/tty_io.c
+++ b/drivers/char/tty_io.c
@@ -235,11 +235,11 @@ static int check_tty_count(struct tty_struct *tty, const char *routine)
struct list_head *p;
int count = 0;
- file_list_lock();
- list_for_each(p, &tty->tty_files) {
+ file_list_lock(&tty->tty_files);
+ list_for_each(p, &tty->tty_files.list) {
count++;
}
- file_list_unlock();
+ file_list_unlock(&tty->tty_files);
if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
tty->driver->subtype == PTY_TYPE_SLAVE &&
tty->link && tty->link->count)
@@ -554,9 +554,9 @@ static void do_tty_hangup(struct work_struct *work)
spin_unlock(&redirect_lock);
check_tty_count(tty, "do_tty_hangup");
- file_list_lock();
+ file_list_lock(&tty->tty_files);
/* This breaks for file handles being sent over AF_UNIX sockets ? */
- list_for_each_entry(filp, &tty->tty_files, f_u.fu_list) {
+ list_for_each_entry(filp, &tty->tty_files.list, f_u.fu_list) {
if (filp->f_op->write == redirected_tty_write)
cons_filp = filp;
if (filp->f_op->write != tty_write)
@@ -565,7 +565,7 @@ static void do_tty_hangup(struct work_struct *work)
tty_fasync(-1, filp, 0); /* can't block */
filp->f_op = &hung_up_tty_fops;
}
- file_list_unlock();
+ file_list_unlock(&tty->tty_files);
/*
* FIXME! What are the locking issues here? This may me overdoing
* things... This question is especially important now that we've
@@ -1467,10 +1467,6 @@ static void release_one_tty(struct kref *kref)
tty_driver_kref_put(driver);
module_put(driver->owner);
- file_list_lock();
- list_del_init(&tty->tty_files);
- file_list_unlock();
-
free_tty_struct(tty);
}
@@ -1678,7 +1674,7 @@ void tty_release_dev(struct file *filp)
* - do_tty_hangup no longer sees this file descriptor as
* something that needs to be handled for hangups.
*/
- file_kill(filp);
+ file_list_del(filp, &tty->tty_files);
filp->private_data = NULL;
/*
@@ -1836,7 +1832,7 @@ got_driver:
return PTR_ERR(tty);
filp->private_data = tty;
- file_move(filp, &tty->tty_files);
+ file_list_add(filp, &tty->tty_files);
check_tty_count(tty, "tty_open");
if (tty->driver->type == TTY_DRIVER_TYPE_PTY &&
tty->driver->subtype == PTY_TYPE_MASTER)
@@ -2779,7 +2775,7 @@ void initialize_tty_struct(struct tty_struct *tty,
mutex_init(&tty->echo_lock);
spin_lock_init(&tty->read_lock);
spin_lock_init(&tty->ctrl_lock);
- INIT_LIST_HEAD(&tty->tty_files);
+ init_file_list(&tty->tty_files);
INIT_WORK(&tty->SAK_work, do_SAK_work);
tty->driver = driver;
diff --git a/fs/file_table.c b/fs/file_table.c
index 334ce39..978f267 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -22,6 +22,7 @@
#include <linux/fsnotify.h>
#include <linux/sysctl.h>
#include <linux/percpu_counter.h>
+#include <linux/writeback.h>
#include <asm/atomic.h>
@@ -30,9 +31,6 @@ struct files_stat_struct files_stat = {
.max_files = NR_FILE
};
-/* public. Not pretty! */
-__cacheline_aligned_in_smp DEFINE_SPINLOCK(files_lock);
-
/* SLAB cache for file structures */
static struct kmem_cache *filp_cachep __read_mostly;
@@ -285,7 +283,8 @@ void __fput(struct file *file)
cdev_put(inode->i_cdev);
fops_put(file->f_op);
put_pid(file->f_owner.pid);
- file_kill(file);
+ if (!special_file(inode->i_mode))
+ file_list_del(file, &inode->i_files);
if (file->f_mode & FMODE_WRITE)
drop_file_write_access(file);
file->f_path.dentry = NULL;
@@ -352,50 +351,57 @@ void put_filp(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
security_file_free(file);
- file_kill(file);
file_free(file);
}
}
-void file_move(struct file *file, struct list_head *list)
+void init_file_list(struct file_list *files)
{
- if (!list)
- return;
- file_list_lock();
- list_move(&file->f_u.fu_list, list);
- file_list_unlock();
+ INIT_LIST_HEAD(&files->list);
+ spin_lock_init(&files->lock);
}
-void file_kill(struct file *file)
+void file_list_add(struct file *file, struct file_list *files)
{
- if (!list_empty(&file->f_u.fu_list)) {
- file_list_lock();
- list_del_init(&file->f_u.fu_list);
- file_list_unlock();
- }
+ file_list_lock(files);
+ list_add(&file->f_u.fu_list, &files->list);
+ file_list_unlock(files);
+}
+EXPORT_SYMBOL(file_list_add);
+
+void file_list_del(struct file *file, struct file_list *files)
+{
+ file_list_lock(files);
+ list_del_init(&file->f_u.fu_list);
+ file_list_unlock(files);
}
+EXPORT_SYMBOL(file_list_del);
int fs_may_remount_ro(struct super_block *sb)
{
+ struct inode *inode;
struct file *file;
/* Check that no files are currently opened for writing. */
- file_list_lock();
- list_for_each_entry(file, &sb->s_files, f_u.fu_list) {
- struct inode *inode = file->f_path.dentry->d_inode;
-
- /* File with pending delete? */
- if (inode->i_nlink == 0)
- goto too_bad;
-
- /* Writeable file? */
- if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
- goto too_bad;
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ file_list_lock(&inode->i_files);
+ list_for_each_entry(file, &inode->i_files.list, f_u.fu_list) {
+ /* File with pending delete? */
+ if (inode->i_nlink == 0)
+ goto too_bad;
+
+ /* Writeable file? */
+ if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
+ goto too_bad;
+ }
+ file_list_unlock(&inode->i_files);
}
- file_list_unlock();
+ spin_unlock(&inode_lock);
return 1; /* Tis' cool bro. */
too_bad:
- file_list_unlock();
+ file_list_unlock(&inode->i_files);
+ spin_unlock(&inode_lock);
return 0;
}
@@ -408,33 +414,38 @@ too_bad:
*/
void mark_files_ro(struct super_block *sb)
{
+ struct inode *inode;
struct file *f;
retry:
- file_list_lock();
- list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
- struct vfsmount *mnt;
- if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
- continue;
- if (!file_count(f))
- continue;
- if (!(f->f_mode & FMODE_WRITE))
- continue;
- f->f_mode &= ~FMODE_WRITE;
- if (file_check_writeable(f) != 0)
- continue;
- file_release_write(f);
- mnt = mntget(f->f_path.mnt);
- file_list_unlock();
- /*
- * This can sleep, so we can't hold
- * the file_list_lock() spinlock.
- */
- mnt_drop_write(mnt);
- mntput(mnt);
- goto retry;
+ spin_lock(&inode_lock);
+ list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ file_list_lock(&inode->i_files);
+ list_for_each_entry(f, &inode->i_files.list, f_u.fu_list) {
+ struct vfsmount *mnt;
+ if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
+ continue;
+ if (!file_count(f))
+ continue;
+ if (!(f->f_mode & FMODE_WRITE))
+ continue;
+ f->f_mode &= ~FMODE_WRITE;
+ if (file_check_writeable(f) != 0)
+ continue;
+ file_release_write(f);
+ mnt = mntget(f->f_path.mnt);
+ file_list_unlock(&inode->i_files);
+ /*
+ * This can sleep, so we can't hold
+ * the file_list_lock() spinlock.
+ */
+ mnt_drop_write(mnt);
+ mntput(mnt);
+ goto retry;
+ }
+ file_list_unlock(&inode->i_files);
}
- file_list_unlock();
+ spin_unlock(&inode_lock);
}
void __init files_init(unsigned long mempages)
diff --git a/fs/inode.c b/fs/inode.c
index 9d26490..9d52d43 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -251,6 +251,7 @@ void inode_init_once(struct inode *inode)
INIT_LIST_HEAD(&inode->inotify_watches);
mutex_init(&inode->inotify_mutex);
#endif
+ init_file_list(&inode->i_files);
}
EXPORT_SYMBOL(inode_init_once);
diff --git a/fs/open.c b/fs/open.c
index 7200e23..20c3fc0 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -828,7 +828,8 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
f->f_path.mnt = mnt;
f->f_pos = 0;
f->f_op = fops_get(inode->i_fop);
- file_move(f, &inode->i_sb->s_files);
+ if (!special_file(inode->i_mode))
+ file_list_add(f, &inode->i_files);
error = security_dentry_open(f, cred);
if (error)
@@ -873,7 +874,8 @@ cleanup_all:
mnt_drop_write(mnt);
}
}
- file_kill(f);
+ if (!special_file(inode->i_mode))
+ file_list_del(f, &inode->i_files);
f->f_path.dentry = NULL;
f->f_path.mnt = NULL;
cleanup_file:
diff --git a/fs/select.c b/fs/select.c
index bd30fe8..99e4145 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -942,7 +942,6 @@ SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
}
#endif /* HAVE_SET_RESTORE_SIGMASK */
-#ifdef CONFIG_FILE_HOTPLUG
static int unpoll_file_once(wait_queue_head_t *q, struct file *file)
{
unsigned long flags;
@@ -971,4 +970,3 @@ void unpoll_file(wait_queue_head_t *q, struct file *file)
schedule_timeout_uninterruptible(1);
}
EXPORT_SYMBOL(unpoll_file);
-#endif
diff --git a/fs/super.c b/fs/super.c
index 2ea1586..477aeb4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -65,7 +65,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
INIT_LIST_HEAD(&s->s_dirty);
INIT_LIST_HEAD(&s->s_io);
INIT_LIST_HEAD(&s->s_more_io);
- INIT_LIST_HEAD(&s->s_files);
INIT_LIST_HEAD(&s->s_instances);
INIT_HLIST_HEAD(&s->s_anon);
INIT_LIST_HEAD(&s->s_inodes);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73242c3..5329fd6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
return mapping->i_mmap_writable != 0;
}
+struct file_list {
+ spinlock_t lock;
+ struct list_head list;
+};
+
/*
* Use sequence counter to get consistent i_size on 32-bit processors.
*/
@@ -764,6 +769,7 @@ struct inode {
struct list_head inotify_watches; /* watches on this inode */
struct mutex inotify_mutex; /* protects the watches list */
#endif
+ struct file_list i_files;
unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */
@@ -934,9 +940,15 @@ struct file {
unsigned long f_mnt_write_state;
#endif
};
-extern spinlock_t files_lock;
-#define file_list_lock() spin_lock(&files_lock);
-#define file_list_unlock() spin_unlock(&files_lock);
+
+static inline void file_list_lock(struct file_list *files)
+{
+ spin_lock(&files->lock);
+}
+static inline void file_list_unlock(struct file_list *files)
+{
+ spin_unlock(&files->lock);
+}
#define get_file(x) atomic_long_inc(&(x)->f_count)
#define file_count(x) atomic_long_read(&(x)->f_count)
@@ -1333,7 +1345,6 @@ struct super_block {
struct list_head s_io; /* parked for writeback */
struct list_head s_more_io; /* parked for more writeback */
struct hlist_head s_anon; /* anonymous dentries for (nfs) exporting */
- struct list_head s_files;
/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
struct list_head s_dentry_lru; /* unused dentry lru */
int s_nr_dentry_unused; /* # of dentry on lru */
@@ -2163,8 +2174,9 @@ static inline void insert_inode_hash(struct inode *inode) {
}
extern struct file * get_empty_filp(void);
-extern void file_move(struct file *f, struct list_head *list);
-extern void file_kill(struct file *f);
+extern void init_file_list(struct file_list *files);
+extern void file_list_add(struct file *f, struct file_list *files);
+extern void file_list_del(struct file *f, struct file_list *files);
#ifdef CONFIG_BLOCK
struct bio;
extern void submit_bio(int, struct bio *);
diff --git a/include/linux/tty.h b/include/linux/tty.h
index fc39db9..7f04a5e 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -250,7 +250,7 @@ struct tty_struct {
struct work_struct hangup_work;
void *disc_data;
void *driver_data;
- struct list_head tty_files;
+ struct file_list tty_files;
#define N_TTY_BUF_SIZE 4096
diff --git a/security/selinux/hooks.c b/security/selinux/hooks.c
index 2fcad7c..65afe36 100644
--- a/security/selinux/hooks.c
+++ b/security/selinux/hooks.c
@@ -2244,8 +2244,8 @@ static inline void flush_unauthorized_files(const struct cred *cred,
tty = get_current_tty();
if (tty) {
- file_list_lock();
- if (!list_empty(&tty->tty_files)) {
+ file_list_lock(&tty->tty_files);
+ if (!list_empty(&tty->tty_files.list)) {
struct inode *inode;
/* Revalidate access to controlling tty.
@@ -2253,14 +2253,14 @@ static inline void flush_unauthorized_files(const struct cred *cred,
than using file_has_perm, as this particular open
file may belong to another process and we are only
interested in the inode-based check here. */
- file = list_first_entry(&tty->tty_files, struct file, f_u.fu_list);
+ file = list_first_entry(&tty->tty_files.list, struct file, f_u.fu_list);
inode = file->f_path.dentry->d_inode;
if (inode_has_perm(cred, inode,
FILE__READ | FILE__WRITE, NULL)) {
drop_tty = 1;
}
}
- file_list_unlock();
+ file_list_unlock(&tty->tty_files);
tty_kref_put(tty);
}
/* Reset controlling tty. */
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH 03/23] vfs: Generalize the file_list
2009-06-01 21:50 ` [PATCH 03/23] vfs: Generalize the file_list Eric W. Biederman
@ 2009-06-02 7:06 ` Nick Piggin
2009-06-05 19:33 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2009-06-02 7:06 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
On Mon, Jun 01, 2009 at 02:50:28PM -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
>
> In the implementation of revoke it is desirable to find all of the
> files we want to operation on. Currently tty's and mark_files_ro
> use the file_list for this, and this patch generalizes the file
> list so it can be used more efficiently.
>
> This patch starts by introducing struct file_list making the file
> list a first class object. file_list_lock and file_list_unlock
> are modified to take this object, making it clear which file_list
> we intended to lock.
>
> file_move is transformed into file_list_add taking a file_list and not
> allowing the movement of one file to another. __dentry_open
> is modified to support this by only adding normal files in open,
> special files have always been ignored when walking the file_list.
> __dentry_open skipping special files allows __ptmx_open and __tty_open
> to safely call file_add as they are adding the file to the file_list
> for the first time.
>
> file_kill has been renamed file_list_del to make it clear what it is
> doing and to keep from confusing it with a more revoke like operation.
>
> put_filp has been modified to not take file_list_del as we are never
> on a file_list when put_filp is called.
>
> fs_may_remount_ro and mark_files_ro have been modified to walk the
> inode list to find all of the inodes and then to walk the file list
> on those inodes. It can be a slightly longer walk as we frequently
> cache inodes that we do not have open but the overall complexity
> should be about the same,
Well not really. I have a couple of orders of magnitude more cached
inodes than open files here.
> these are slow path functions, and it
> gives us much greater flexibility overall.
Define flexibility. Walking the sb's file list and checking for
equality with the inode in question gives the same functionality,
just different performance profile.
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
> return mapping->i_mmap_writable != 0;
> }
>
> +struct file_list {
> + spinlock_t lock;
> + struct list_head list;
> +};
> +
> /*
> * Use sequence counter to get consistent i_size on 32-bit processors.
> */
> @@ -764,6 +769,7 @@ struct inode {
> struct list_head inotify_watches; /* watches on this inode */
> struct mutex inotify_mutex; /* protects the watches list */
> #endif
> + struct file_list i_files;
>
> unsigned long i_state;
> unsigned long dirtied_when; /* jiffies of first dirtying */
> @@ -934,9 +940,15 @@ struct file {
> unsigned long f_mnt_write_state;
> #endif
> };
> -extern spinlock_t files_lock;
> -#define file_list_lock() spin_lock(&files_lock);
> -#define file_list_unlock() spin_unlock(&files_lock);
> +
> +static inline void file_list_lock(struct file_list *files)
> +{
> + spin_lock(&files->lock);
> +}
> +static inline void file_list_unlock(struct file_list *files)
> +{
> + spin_unlock(&files->lock);
> +}
I don't really like this. It's just a list head. Get rid of
all these wrappers and crap I'd say. In fact, starting with my
patch to unexport files_lock and remove these wrappers would
be reasonable, wouldn't it?
Increasing the size of the struct inode by 24 bytes hurts.
Even when you decrapify it and can reuse i_lock or something,
then it is still 16 bytes on 64-bit.
I haven't looked through all the patches... but this is to
speed up a slowpath operation, isn't it? Or does revoke
need to be especially performant?
So this patch is purely a perofrmance improvement? Then I think
it needs to be justified with numbers and the downsides (bloating
struct inode in particulra) to be changelogged.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 03/23] vfs: Generalize the file_list
2009-06-02 7:06 ` Nick Piggin
@ 2009-06-05 19:33 ` Eric W. Biederman
2009-06-09 10:38 ` Nick Piggin
0 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-05 19:33 UTC (permalink / raw)
To: Nick Piggin
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
Nick Piggin <npiggin@suse.de> writes:
>> fs_may_remount_ro and mark_files_ro have been modified to walk the
>> inode list to find all of the inodes and then to walk the file list
>> on those inodes. It can be a slightly longer walk as we frequently
>> cache inodes that we do not have open but the overall complexity
>> should be about the same,
>
> Well not really. I have a couple of orders of magnitude more cached
> inodes than open files here.
Good point.
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -699,6 +699,11 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
>> return mapping->i_mmap_writable != 0;
>> }
>>
>> +struct file_list {
>> + spinlock_t lock;
>> + struct list_head list;
>> +};
>> +
>> /*
>> * Use sequence counter to get consistent i_size on 32-bit processors.
>> */
>> @@ -764,6 +769,7 @@ struct inode {
>> struct list_head inotify_watches; /* watches on this inode */
>> struct mutex inotify_mutex; /* protects the watches list */
>> #endif
>> + struct file_list i_files;
>>
>> unsigned long i_state;
>> unsigned long dirtied_when; /* jiffies of first dirtying */
>> @@ -934,9 +940,15 @@ struct file {
>> unsigned long f_mnt_write_state;
>> #endif
>> };
>> -extern spinlock_t files_lock;
>> -#define file_list_lock() spin_lock(&files_lock);
>> -#define file_list_unlock() spin_unlock(&files_lock);
>> +
>> +static inline void file_list_lock(struct file_list *files)
>> +{
>> + spin_lock(&files->lock);
>> +}
>> +static inline void file_list_unlock(struct file_list *files)
>> +{
>> + spin_unlock(&files->lock);
>> +}
>
> I don't really like this. It's just a list head. Get rid of
> all these wrappers and crap I'd say. In fact, starting with my
> patch to unexport files_lock and remove these wrappers would
> be reasonable, wouldn't it?
I don't really mind killing the wrappers.
I do mind your patch because it makes the list going through
the tty's something very different. In my view of the world
that is the only use case is what I'm working to move up more
into the vfs layer. So orphaning it seems wrong.
> Increasing the size of the struct inode by 24 bytes hurts.
> Even when you decrapify it and can reuse i_lock or something,
> then it is still 16 bytes on 64-bit.
We can get it even smaller if we make it an hlist. A hlist_head is
only a single pointer. This size growth appears to be one of the
biggest weakness of the code.
> I haven't looked through all the patches... but this is to
> speed up a slowpath operation, isn't it? Or does revoke
> need to be especially performant?
This was more about simplicity rather than performance. The
performance gain is using a per inode lock instead of a global lock.
Which keeps cache lines from bouncing.
> So this patch is purely a perofrmance improvement? Then I think
> it needs to be justified with numbers and the downsides (bloating
> struct inode in particulra) to be changelogged.
Certainly the cost.
One of the things I have discovered since I wrote this patch is the
i_devices list. Which means we don't necessarily need to have heads
in places other than struct inode. A character device driver (aka the
tty code) can walk it's inode list and from each inode walk the file
list. I need to check the locking on that one.
If that simplification works we can move all maintenance of the file
list into the vfs and not need a separate file list concept. I will
take a look.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 03/23] vfs: Generalize the file_list
2009-06-05 19:33 ` Eric W. Biederman
@ 2009-06-09 10:38 ` Nick Piggin
2009-06-09 18:38 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Nick Piggin @ 2009-06-09 10:38 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote:
> Nick Piggin <npiggin@suse.de> writes:
>
> >> +static inline void file_list_unlock(struct file_list *files)
> >> +{
> >> + spin_unlock(&files->lock);
> >> +}
> >
> > I don't really like this. It's just a list head. Get rid of
> > all these wrappers and crap I'd say. In fact, starting with my
> > patch to unexport files_lock and remove these wrappers would
> > be reasonable, wouldn't it?
>
> I don't really mind killing the wrappers.
>
> I do mind your patch because it makes the list going through
> the tty's something very different. In my view of the world
> that is the only use case is what I'm working to move up more
> into the vfs layer. So orphaning it seems wrong.
My patch doesn't orphan it, it just makes the locking more
explicit and that's all so it should be easier to work with.
I just mean start with my patch and you could change things
as needed.
> > Increasing the size of the struct inode by 24 bytes hurts.
> > Even when you decrapify it and can reuse i_lock or something,
> > then it is still 16 bytes on 64-bit.
>
> We can get it even smaller if we make it an hlist. A hlist_head is
> only a single pointer. This size growth appears to be one of the
> biggest weakness of the code.
8 bytes would be a lot better than 24.
> > I haven't looked through all the patches... but this is to
> > speed up a slowpath operation, isn't it? Or does revoke
> > need to be especially performant?
>
> This was more about simplicity rather than performance. The
> performance gain is using a per inode lock instead of a global lock.
> Which keeps cache lines from bouncing.
Yes but we already have such a global lock which has been
OK until now. Granted that some users are running into these
locks, but fine graining them can be considered independently
I think. So using per-sb lists of files and not bloating
struct inode any more could be a less controversial step
for you.
> > So this patch is purely a perofrmance improvement? Then I think
> > it needs to be justified with numbers and the downsides (bloating
> > struct inode in particulra) to be changelogged.
>
> Certainly the cost.
>
> One of the things I have discovered since I wrote this patch is the
> i_devices list. Which means we don't necessarily need to have heads
> in places other than struct inode. A character device driver (aka the
> tty code) can walk it's inode list and from each inode walk the file
> list. I need to check the locking on that one.
>
> If that simplification works we can move all maintenance of the file
> list into the vfs and not need a separate file list concept. I will
> take a look.
Thanks,
Nick
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 03/23] vfs: Generalize the file_list
2009-06-09 10:38 ` Nick Piggin
@ 2009-06-09 18:38 ` Eric W. Biederman
2009-06-10 6:05 ` Nick Piggin
0 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-09 18:38 UTC (permalink / raw)
To: Nick Piggin
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
Nick Piggin <npiggin@suse.de> writes:
> On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote:
>> Nick Piggin <npiggin@suse.de> writes:
>>
>> >> +static inline void file_list_unlock(struct file_list *files)
>> >> +{
>> >> + spin_unlock(&files->lock);
>> >> +}
>> >
>> > I don't really like this. It's just a list head. Get rid of
>> > all these wrappers and crap I'd say. In fact, starting with my
>> > patch to unexport files_lock and remove these wrappers would
>> > be reasonable, wouldn't it?
>>
>> I don't really mind killing the wrappers.
>>
>> I do mind your patch because it makes the list going through
>> the tty's something very different. In my view of the world
>> that is the only use case is what I'm working to move up more
>> into the vfs layer. So orphaning it seems wrong.
>
> My patch doesn't orphan it, it just makes the locking more
> explicit and that's all so it should be easier to work with.
> I just mean start with my patch and you could change things
> as needed.
As I recall you weren't using the files_lock for the tty layer. I
seem to recall you were still walking through the same list head on
struct file.
Regardless it sure felt like pushing the tty usage out into
some weird special case. My goal is to make it reasonable for
more character drivers to use the list so it isn't an especially
comfortable starting place for me.
>> > Increasing the size of the struct inode by 24 bytes hurts.
>> > Even when you decrapify it and can reuse i_lock or something,
>> > then it is still 16 bytes on 64-bit.
>>
>> We can get it even smaller if we make it an hlist. A hlist_head is
>> only a single pointer. This size growth appears to be one of the
>> biggest weakness of the code.
>
> 8 bytes would be a lot better than 24.
Definitely.
>> > I haven't looked through all the patches... but this is to
>> > speed up a slowpath operation, isn't it? Or does revoke
>> > need to be especially performant?
>>
>> This was more about simplicity rather than performance. The
>> performance gain is using a per inode lock instead of a global lock.
>> Which keeps cache lines from bouncing.
>
> Yes but we already have such a global lock which has been
> OK until now. Granted that some users are running into these
> locks, but fine graining them can be considered independently
> I think. So using per-sb lists of files and not bloating
> struct inode any more could be a less controversial step
> for you.
I will take a look. Certainly doing the work in a couple
of patches seems reasonable. If I can move all of the list
maintenance out of the tty layer. That looks to be the ideal
case.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 03/23] vfs: Generalize the file_list
2009-06-09 18:38 ` Eric W. Biederman
@ 2009-06-10 6:05 ` Nick Piggin
0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2009-06-10 6:05 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
On Tue, Jun 09, 2009 at 11:38:59AM -0700, Eric W. Biederman wrote:
> Nick Piggin <npiggin@suse.de> writes:
>
> > On Fri, Jun 05, 2009 at 12:33:59PM -0700, Eric W. Biederman wrote:
> >> Nick Piggin <npiggin@suse.de> writes:
> >>
> >> >> +static inline void file_list_unlock(struct file_list *files)
> >> >> +{
> >> >> + spin_unlock(&files->lock);
> >> >> +}
> >> >
> >> > I don't really like this. It's just a list head. Get rid of
> >> > all these wrappers and crap I'd say. In fact, starting with my
> >> > patch to unexport files_lock and remove these wrappers would
> >> > be reasonable, wouldn't it?
> >>
> >> I don't really mind killing the wrappers.
> >>
> >> I do mind your patch because it makes the list going through
> >> the tty's something very different. In my view of the world
> >> that is the only use case is what I'm working to move up more
> >> into the vfs layer. So orphaning it seems wrong.
> >
> > My patch doesn't orphan it, it just makes the locking more
> > explicit and that's all so it should be easier to work with.
> > I just mean start with my patch and you could change things
> > as needed.
>
> As I recall you weren't using the files_lock for the tty layer. I
> seem to recall you were still walking through the same list head on
> struct file.
>
> Regardless it sure felt like pushing the tty usage out into
> some weird special case. My goal is to make it reasonable for
> more character drivers to use the list so it isn't an especially
> comfortable starting place for me.
I don't see the problem. It made files_lock for filesystems
and uses another lock for tty. Tty is a special case (or
different case) compared with filesystem, and how did it
make it unreasonable for character drivers to use the list?
Mandating the locking and list to be in the inode for
everyone is just bloating things up.
> >> > Increasing the size of the struct inode by 24 bytes hurts.
> >> > Even when you decrapify it and can reuse i_lock or something,
> >> > then it is still 16 bytes on 64-bit.
> >>
> >> We can get it even smaller if we make it an hlist. A hlist_head is
> >> only a single pointer. This size growth appears to be one of the
> >> biggest weakness of the code.
> >
> > 8 bytes would be a lot better than 24.
>
> Definitely.
>
> >> > I haven't looked through all the patches... but this is to
> >> > speed up a slowpath operation, isn't it? Or does revoke
> >> > need to be especially performant?
> >>
> >> This was more about simplicity rather than performance. The
> >> performance gain is using a per inode lock instead of a global lock.
> >> Which keeps cache lines from bouncing.
> >
> > Yes but we already have such a global lock which has been
> > OK until now. Granted that some users are running into these
> > locks, but fine graining them can be considered independently
> > I think. So using per-sb lists of files and not bloating
> > struct inode any more could be a less controversial step
> > for you.
>
> I will take a look. Certainly doing the work in a couple
> of patches seems reasonable. If I can move all of the list
> maintenance out of the tty layer. That looks to be the ideal
> case.
I will wait to see. It will be nice if you have any obvious
standalone fixes or improvements then to post them first or
in front of your patchset: I'd like to make some progress
here too to help my locking patchset.
^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (2 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 03/23] vfs: Generalize the file_list Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-02 5:16 ` Pekka Enberg
` (2 more replies)
2009-06-01 21:50 ` [PATCH 05/23] vfs: Teach lseek to use file_hotplug_lock Eric W. Biederman
` (19 subsequent siblings)
23 siblings, 3 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@xmission.com>
Introduce the file_hotplug_lock to protect file->f_op, file->private,
file->f_path from revoke operations.
The file_hotplug_lock is used typically as:
error = -EIO;
if (!file_hotplug_read_trylock(file))
goto out;
....
file_hotplug_read_unlock(file);
In 5 subsystems sysfs, proc, and sysctl, tty, and sound we have support for
modifing a file descriptor so that the underlying object can go away.
In looking at the problem of pci hotunplug it appears that we
potentially need that support for all file descriptors except ones
talking to files on filesystems. Even for file descriptors referring
to files, support for file the underlying object going away is
interesting for implementing features like umount -f and sys_revoke.
The implementations in sysfs, proc and sysctl are all very similar and
are composed of several components.
- A reference count to track that the file operations are being used.
- An ability to flag the file as no longer being valid.
- An ability to wait until the file operations are no longer being used.
In addition for a complete solution we need:
- A reliable way the file structures that we need to revoke.
- To wait for but not tamper with ongoing file creation and cleanup.
- A guarantee that all with user space controlled duration are removed.
The file_hotplug_lock has a very unique implementation necessitated by
the need to have no performance impact on existing code. Classic locking
primitives and reference counting cause pipeline stalls, except for rcu
which provides no ability to preventing reading a data structure while
it is being updated.
file_hotplug_lock keeps the overhead extremely low by dedicating a
small amount of space in the task_struct to store the set of files
the task is currently in the process of using.
The revoke algorithm is simple:
- Find a file on the file_list.
If it is dying or being created come back later
* Take a reference to the file, ensuring it does not get freed while the
revoke code accesses it.
* Block out new usages of fields guarded by file_hotplug_lock.
* Kick the underlying implemenation to wake up functions that are potentially
blocked indefinitely.
* Wait until there are no tasks holding file_hotplug_read_lock
* Release the file specific data.
* Drop the file ref count.
- Repeat until the file list is empty.
The implication of this implementation is that all revoked files will
behave exactly the same way, except for policy controlled by flags in
fmode. The expected behaivor of revoked is close succeeds all other
operations return -EIO. Except for the read on ttys this matches the
historical bsd behavior.
Approriate exports are present so modular character devices can
use the file_list
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
Documentation/filesystems/vfs.txt | 5 +
fs/Kconfig | 4 +
fs/file_table.c | 166 ++++++++++++++++++++++++++++++++++--
fs/open.c | 6 ++
include/linux/fs.h | 25 ++++++-
include/linux/sched.h | 7 ++
6 files changed, 202 insertions(+), 11 deletions(-)
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index f49eecf..d220fd5 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -806,6 +806,11 @@ otherwise noted.
splice_read: called by the VFS to splice data from file to a pipe. This
method is used by the splice(2) system call
+ dead: Called by the VFS to notify a file that it has been killed.
+ Typically this is used to wake up poll, read or other blocking
+ file methods, that could be indefinitely waiting for something
+ to happen.
+
Note that the file operations are implemented by the specific
filesystem in which the inode resides. When opening a device node
(character or block special) most filesystems will call special
diff --git a/fs/Kconfig b/fs/Kconfig
index 9f7270f..2fb86b0 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -265,4 +265,8 @@ endif
source "fs/nls/Kconfig"
source "fs/dlm/Kconfig"
+config FILE_HOTPLUG
+ bool
+ default n
+
endmenu
diff --git a/fs/file_table.c b/fs/file_table.c
index 978f267..9db3031 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -23,6 +23,7 @@
#include <linux/sysctl.h>
#include <linux/percpu_counter.h>
#include <linux/writeback.h>
+#include <linux/mm.h>
#include <asm/atomic.h>
@@ -201,7 +202,7 @@ int init_file(struct file *file, struct vfsmount *mnt, struct dentry *dentry,
file->f_path.dentry = dentry;
file->f_path.mnt = mntget(mnt);
file->f_mapping = dentry->d_inode->i_mapping;
- file->f_mode = mode;
+ file->f_mode = mode | FMODE_OPENED;
file->f_op = fop;
/*
@@ -252,17 +253,12 @@ void drop_file_write_access(struct file *file)
}
EXPORT_SYMBOL_GPL(drop_file_write_access);
-/* __fput is called from task context when aio completion releases the last
- * last use of a struct file *. Do not use otherwise.
- */
-void __fput(struct file *file)
+static void frelease(struct file *file)
{
struct dentry *dentry = file->f_path.dentry;
struct vfsmount *mnt = file->f_path.mnt;
struct inode *inode = dentry->d_inode;
- might_sleep();
-
fsnotify_close(file);
/*
* The function eventpoll_release() should be the first called
@@ -277,23 +273,38 @@ void __fput(struct file *file)
}
if (file->f_op && file->f_op->release)
file->f_op->release(inode, file);
- security_file_free(file);
ima_file_free(file);
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
- put_pid(file->f_owner.pid);
if (!special_file(inode->i_mode))
file_list_del(file, &inode->i_files);
if (file->f_mode & FMODE_WRITE)
drop_file_write_access(file);
file->f_path.dentry = NULL;
file->f_path.mnt = NULL;
- file_free(file);
+ file->f_mapping = NULL;
+ file->f_op = NULL;
+ file->private_data = NULL;
dput(dentry);
mntput(mnt);
}
+/* __fput is called from task context when aio completion releases the last
+ * last use of a struct file *. Do not use otherwise.
+ */
+void __fput(struct file *file)
+{
+ might_sleep();
+
+ if (likely(!(file->f_mode & FMODE_DEAD)))
+ frelease(file);
+
+ security_file_free(file);
+ put_pid(file->f_owner.pid);
+ file_free(file);
+}
+
struct file *fget(unsigned int fd)
{
struct file *file;
@@ -360,6 +371,7 @@ void init_file_list(struct file_list *files)
INIT_LIST_HEAD(&files->list);
spin_lock_init(&files->lock);
}
+EXPORT_SYMBOL(init_file_list);
void file_list_add(struct file *file, struct file_list *files)
{
@@ -377,6 +389,140 @@ void file_list_del(struct file *file, struct file_list *files)
}
EXPORT_SYMBOL(file_list_del);
+#ifdef CONFIG_FILE_HOTPLUG
+
+static bool file_in_use(struct file *file)
+{
+ struct task_struct *leader, *task;
+ bool in_use = false;
+ int i;
+
+ rcu_read_lock();
+ do_each_thread(leader, task) {
+ for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
+ if (task->file_hotplug_lock[i] == file) {
+ in_use = true;
+ goto found;
+ }
+ }
+ } while_each_thread(leader, task);
+found:
+ rcu_read_unlock();
+ return in_use;
+}
+
+static int revoke_file(struct file *file)
+{
+ /* Must be called with f_count held and FMODE_OPENED set */
+ fmode_t mode;
+
+ if (!(file->f_mode & FMODE_REVOKE))
+ return -EINVAL;
+
+ /*
+ * Tell everyone this file is dead.
+ */
+ spin_lock(&file->f_ep_lock);
+ mode = file->f_mode;
+ file->f_mode |= FMODE_DEAD;
+ spin_unlock(&file->f_ep_lock);
+ if (mode & FMODE_DEAD)
+ return -EIO;
+
+ /*
+ * Notify the file we have killed it.
+ */
+ if (file->f_op->dead)
+ file->f_op->dead(file);
+
+ /*
+ * Wait until there are no more callers in the file operations.
+ */
+ if (file_in_use(file)) {
+ do {
+ schedule_timeout_uninterruptible(1);
+ } while (file_in_use(file));
+ }
+
+ revoke_file_mappings(file);
+ frelease(file);
+
+ return 0;
+}
+
+int revoke_file_list(struct file_list *files)
+{
+ struct file *file;
+ int error = 0;
+ int empty;
+
+restart:
+ file_list_lock(files);
+ list_for_each_entry(file, &files->list, f_u.fu_list) {
+
+ /* Don't touch files that have not yet been fully opened */
+ if (!(file->f_mode & FMODE_OPENED))
+ continue;
+
+ /* Ensure I am looking at the file after it was opened */
+ smp_rmb();
+
+ /* Don't touch files that are in the final stages of being closed. */
+ if (file_count(file) == 0)
+ continue;
+
+ /* Get a reference to the file */
+ if (!atomic_long_inc_not_zero(&file->f_count))
+ continue;
+
+ file_list_unlock(files);
+
+ error = revoke_file(file);
+ fput(file);
+
+ if (unlikely(error))
+ goto out;
+ goto restart;
+ }
+ empty = list_empty(&files->list);
+ file_list_unlock(files);
+ /*
+ * If the file list had files we can't touch sleep a little while
+ * and check again.
+ */
+ if (!empty) {
+ schedule_timeout_uninterruptible(1);
+ goto restart;
+ }
+out:
+ return error;
+}
+EXPORT_SYMBOL(revoke_file_list);
+
+int __lockfunc file_hotplug_read_trylock(struct file *file)
+{
+ fmode_t mode = file->f_mode;
+ int locked = 0;
+ if (!(mode & FMODE_DEAD)) {
+ struct task_struct *tsk = current;
+ int pos = tsk->file_hotplug_lock_depth;
+ if (likely(pos < MAX_FILE_HOTPLUG_LOCK_DEPTH)) {
+ tsk->file_hotplug_lock_depth = pos + 1;
+ tsk->file_hotplug_lock[pos] = file;
+ locked = 1;
+ }
+ }
+ return locked;
+}
+
+void __lockfunc file_hotplug_read_unlock(struct file *file)
+{
+ struct task_struct *tsk = current;
+ tsk->file_hotplug_lock[--(tsk->file_hotplug_lock_depth)] = NULL;
+}
+
+#endif /* CONFIG_FILE_HOTPLUG */
+
int fs_may_remount_ro(struct super_block *sb)
{
struct inode *inode;
diff --git a/fs/open.c b/fs/open.c
index 20c3fc0..d0b2433 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -809,6 +809,7 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
const struct cred *cred)
{
struct inode *inode;
+ fmode_t opened_fmode;
int error;
f->f_flags = flags;
@@ -857,6 +858,11 @@ static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
}
}
+ opened_fmode = f->f_mode | FMODE_OPENED;
+ /* Ensure revoke_file_list sees the opened file */
+ smp_wmb();
+ f->f_mode = opened_fmode;
+
return f;
cleanup_all:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5329fd6..f7f4c46 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -87,6 +87,13 @@ struct inodes_stat_t {
*/
#define FMODE_NOCMTIME ((__force fmode_t)2048)
+/* File has successfully been opened */
+#define FMODE_OPENED ((__force fmode_t)4096)
+/* File supports being revoked */
+#define FMODE_REVOKE ((__force fmode_t)8192)
+/* File is dead (has been revoked) */
+#define FMODE_DEAD ((__force fmode_t)16384)
+
/*
* The below are the various read and write types that we support. Some of
* them include behavioral modifiers that send information down to the
@@ -903,6 +910,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
#define FILE_MNT_WRITE_RELEASED 2
struct file {
+ /* file_hotplug_lock f_op, private, f_path, f_mapping */
/*
* fu_list becomes invalid after file_free is called and queued via
* fu_rcuhead for RCU freeing
@@ -935,12 +943,26 @@ struct file {
/* Used by fs/eventpoll.c to link all the hooks to this file */
struct list_head f_ep_links;
#endif /* #ifdef CONFIG_EPOLL */
- struct address_space *f_mapping;
+ struct address_space *f_mapping; /* file_hotplug_lock or mmap_sem */
#ifdef CONFIG_DEBUG_WRITECOUNT
unsigned long f_mnt_write_state;
#endif
};
+#ifdef CONFIG_FILE_HOTPLUG
+extern int file_hotplug_read_trylock(struct file *file);
+extern void file_hotplug_read_unlock(struct file *file);
+extern int revoke_file_list(struct file_list *files);
+#else
+static inline int file_hotplug_read_trylock(struct file *file)
+{
+ return 1;
+}
+static inline void file_hotplug_read_unlock(struct file *file)
+{
+}
+#endif
+
static inline void file_list_lock(struct file_list *files)
{
spin_lock(&files->lock);
@@ -1514,6 +1536,7 @@ struct file_operations {
ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
int (*setlease)(struct file *, long, struct file_lock **);
+ void (*dead)(struct file *);
};
struct inode_operations {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b4c38bc..bbf1616 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1302,6 +1302,13 @@ struct task_struct {
struct irqaction *irqaction;
#endif
+/* File hotplug lock */
+#ifdef CONFIG_FILE_HOTPLUG
+#define MAX_FILE_HOTPLUG_LOCK_DEPTH 4U
+ int file_hotplug_lock_depth;
+ struct file *file_hotplug_lock[MAX_FILE_HOTPLUG_LOCK_DEPTH];
+#endif
+
/* Protection of the PI data structures: */
spinlock_t pi_lock;
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-01 21:50 ` [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Eric W. Biederman
@ 2009-06-02 5:16 ` Pekka Enberg
2009-06-02 6:51 ` Eric W. Biederman
2009-06-02 7:14 ` Nick Piggin
2009-06-05 9:03 ` Miklos Szeredi
2 siblings, 1 reply; 99+ messages in thread
From: Pekka Enberg @ 2009-06-02 5:16 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman
Hi Eric,
On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> +#ifdef CONFIG_FILE_HOTPLUG
> +
> +static bool file_in_use(struct file *file)
> +{
> + struct task_struct *leader, *task;
> + bool in_use = false;
> + int i;
> +
> + rcu_read_lock();
> + do_each_thread(leader, task) {
> + for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
> + if (task->file_hotplug_lock[i] == file) {
> + in_use = true;
> + goto found;
> + }
> + }
> + } while_each_thread(leader, task);
> +found:
> + rcu_read_unlock();
> + return in_use;
> +}
This seems rather heavy-weight. If we're going to use this
infrastructure for forced unmount, I think this will be a problem.
Can't we two this in two stages: (1) mark a bit that forces
file_hotplug_read_trylock to always fail and (2) block until the last
remaining in-kernel file_hotplug_read_unlock() has executed?
Pekka
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-02 5:16 ` Pekka Enberg
@ 2009-06-02 6:51 ` Eric W. Biederman
2009-06-02 7:08 ` Pekka Enberg
0 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-02 6:51 UTC (permalink / raw)
To: Pekka Enberg
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman
Pekka Enberg <penberg@cs.helsinki.fi> writes:
> Hi Eric,
>
> On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> +#ifdef CONFIG_FILE_HOTPLUG
>> +
>> +static bool file_in_use(struct file *file)
>> +{
>> + struct task_struct *leader, *task;
>> + bool in_use = false;
>> + int i;
>> +
>> + rcu_read_lock();
>> + do_each_thread(leader, task) {
>> + for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
>> + if (task->file_hotplug_lock[i] == file) {
>> + in_use = true;
>> + goto found;
>> + }
>> + }
>> + } while_each_thread(leader, task);
>> +found:
>> + rcu_read_unlock();
>> + return in_use;
>> +}
>
> This seems rather heavy-weight. If we're going to use this
> infrastructure for forced unmount, I think this will be a problem.
> Can't we two this in two stages: (1) mark a bit that forces
> file_hotplug_read_trylock to always fail and (2) block until the last
> remaining in-kernel file_hotplug_read_unlock() has executed?
Yes there is room for more optimization in the slow path.
I haven't noticed being a problem yet so I figured I would start
with stupid and simple.
I can easily see two passes. The first setting the flag an calling
f_op->dead. The second some kind of consolidate walk through the task
list, allowing checking on multiple files at once.
I'm not ready to consider anything that will add cost to the fast
path in the file descriptors though.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-02 6:51 ` Eric W. Biederman
@ 2009-06-02 7:08 ` Pekka Enberg
0 siblings, 0 replies; 99+ messages in thread
From: Pekka Enberg @ 2009-06-02 7:08 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman
Hi Eric,
On Tue, Jun 2, 2009 at 9:51 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> Pekka Enberg <penberg@cs.helsinki.fi> writes:
>
>> Hi Eric,
>>
>> On Tue, Jun 2, 2009 at 12:50 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> +#ifdef CONFIG_FILE_HOTPLUG
>>> +
>>> +static bool file_in_use(struct file *file)
>>> +{
>>> + struct task_struct *leader, *task;
>>> + bool in_use = false;
>>> + int i;
>>> +
>>> + rcu_read_lock();
>>> + do_each_thread(leader, task) {
>>> + for (i = 0; i < MAX_FILE_HOTPLUG_LOCK_DEPTH; i++) {
>>> + if (task->file_hotplug_lock[i] == file) {
>>> + in_use = true;
>>> + goto found;
>>> + }
>>> + }
>>> + } while_each_thread(leader, task);
>>> +found:
>>> + rcu_read_unlock();
>>> + return in_use;
>>> +}
>>
>> This seems rather heavy-weight. If we're going to use this
>> infrastructure for forced unmount, I think this will be a problem.
>
>> Can't we two this in two stages: (1) mark a bit that forces
>> file_hotplug_read_trylock to always fail and (2) block until the last
>> remaining in-kernel file_hotplug_read_unlock() has executed?
>
> Yes there is room for more optimization in the slow path.
> I haven't noticed being a problem yet so I figured I would start
> with stupid and simple.
Yup, just wanted to point it out.
On Tue, Jun 2, 2009 at 9:51 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
> I can easily see two passes. The first setting the flag an calling
> f_op->dead. The second some kind of consolidate walk through the task
> list, allowing checking on multiple files at once.
>
> I'm not ready to consider anything that will add cost to the fast
> path in the file descriptors though.
Makes sense.
Pekka
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-01 21:50 ` [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Eric W. Biederman
2009-06-02 5:16 ` Pekka Enberg
@ 2009-06-02 7:14 ` Nick Piggin
2009-06-02 17:06 ` Linus Torvalds
2009-06-02 22:56 ` Eric W. Biederman
2009-06-05 9:03 ` Miklos Szeredi
2 siblings, 2 replies; 99+ messages in thread
From: Nick Piggin @ 2009-06-02 7:14 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
On Mon, Jun 01, 2009 at 02:50:29PM -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@xmission.com>
>
> Introduce the file_hotplug_lock to protect file->f_op, file->private,
> file->f_path from revoke operations.
>
> The file_hotplug_lock is used typically as:
> error = -EIO;
> if (!file_hotplug_read_trylock(file))
> goto out;
> ....
> file_hotplug_read_unlock(file);
Why is it called hotplug? Does it have anything to do with hardware?
Because every concurrently changed software data structure in the
kernel can be "hot"-modified, right?
Wouldn't file_revoke_lock be more appropriate?
> In addition for a complete solution we need:
> - A reliable way the file structures that we need to revoke.
> - To wait for but not tamper with ongoing file creation and cleanup.
> - A guarantee that all with user space controlled duration are removed.
>
> The file_hotplug_lock has a very unique implementation necessitated by
> the need to have no performance impact on existing code. Classic locking
Well, it isn't no performance impact. Function calls, branches, icache
and dcache...
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-02 7:14 ` Nick Piggin
@ 2009-06-02 17:06 ` Linus Torvalds
2009-06-02 20:52 ` Eric W. Biederman
2009-06-02 22:56 ` Eric W. Biederman
1 sibling, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2009-06-02 17:06 UTC (permalink / raw)
To: Nick Piggin
Cc: Eric W. Biederman, Al Viro, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
On Tue, 2 Jun 2009, Nick Piggin wrote:
>
> Why is it called hotplug? Does it have anything to do with hardware?
> Because every concurrently changed software data structure in the
> kernel can be "hot"-modified, right?
>
> Wouldn't file_revoke_lock be more appropriate?
I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not
"plug" and "unplug".
Linus
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-02 17:06 ` Linus Torvalds
@ 2009-06-02 20:52 ` Eric W. Biederman
2009-06-03 6:37 ` Nick Piggin
0 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-02 20:52 UTC (permalink / raw)
To: Linus Torvalds
Cc: Nick Piggin, Al Viro, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Tue, 2 Jun 2009, Nick Piggin wrote:
>>
>> Why is it called hotplug? Does it have anything to do with hardware?
>> Because every concurrently changed software data structure in the
>> kernel can be "hot"-modified, right?
>>
>> Wouldn't file_revoke_lock be more appropriate?
>
> I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not
> "plug" and "unplug".
I guess this shows my bias in triggering this code path from pci
hotunplug. Instead of with some system call.
I'm not married to the name. I wanted file_lock but that is already
used, and I did call the method revoke.
The only place where hotplug gives a useful hint is that it makes it
clear we really are disconnecting the file descriptor from what lies
below it. We can't do some weird thing like keep the underlying object.
Because the underlying object is gone.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-02 20:52 ` Eric W. Biederman
@ 2009-06-03 6:37 ` Nick Piggin
0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2009-06-03 6:37 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Al Viro, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
On Tue, Jun 02, 2009 at 01:52:46PM -0700, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > On Tue, 2 Jun 2009, Nick Piggin wrote:
> >>
> >> Why is it called hotplug? Does it have anything to do with hardware?
> >> Because every concurrently changed software data structure in the
> >> kernel can be "hot"-modified, right?
> >>
> >> Wouldn't file_revoke_lock be more appropriate?
> >
> > I agree, "hotplug" just sounds crazy. It's "open" and "revoke", not
> > "plug" and "unplug".
>
> I guess this shows my bias in triggering this code path from pci
> hotunplug. Instead of with some system call.
>
> I'm not married to the name. I wanted file_lock but that is already
> used, and I did call the method revoke.
Definitely it is not going to be called hotplug in the generic
vfs layer :)
> The only place where hotplug gives a useful hint is that it makes it
> clear we really are disconnecting the file descriptor from what lies
> below it.
Isn't that hotUNplug?
But anyway hot plug/unplug is a purely hardware concept. Revoke
for "unplug", please, including naming of patches, changelogs,
and locks etc.
> We can't do some weird thing like keep the underlying object.
> Because the underlying object is gone.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-02 7:14 ` Nick Piggin
2009-06-02 17:06 ` Linus Torvalds
@ 2009-06-02 22:56 ` Eric W. Biederman
2009-06-03 6:38 ` Nick Piggin
1 sibling, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-02 22:56 UTC (permalink / raw)
To: Nick Piggin
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
Nick Piggin <npiggin@suse.de> writes:
>> In addition for a complete solution we need:
>> - A reliable way the file structures that we need to revoke.
>> - To wait for but not tamper with ongoing file creation and cleanup.
>> - A guarantee that all with user space controlled duration are removed.
>>
>> The file_hotplug_lock has a very unique implementation necessitated by
>> the need to have no performance impact on existing code. Classic locking
>
> Well, it isn't no performance impact. Function calls, branches, icache
> and dcache...
Practically none.
Everything I could measure was in the noise. It is cheaper than any serializing
locking primitive. I ran both lmbench and did some microbenchmark testing.
So I know on the fast path the overhead is minimal. Certainly less than what
we are doing in sysfs and proc today.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-02 22:56 ` Eric W. Biederman
@ 2009-06-03 6:38 ` Nick Piggin
0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2009-06-03 6:38 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Andrew Morton, Christoph Hellwig,
Eric W. Biederman
On Tue, Jun 02, 2009 at 03:56:02PM -0700, Eric W. Biederman wrote:
> Nick Piggin <npiggin@suse.de> writes:
>
> >> In addition for a complete solution we need:
> >> - A reliable way the file structures that we need to revoke.
> >> - To wait for but not tamper with ongoing file creation and cleanup.
> >> - A guarantee that all with user space controlled duration are removed.
> >>
> >> The file_hotplug_lock has a very unique implementation necessitated by
> >> the need to have no performance impact on existing code. Classic locking
> >
> > Well, it isn't no performance impact. Function calls, branches, icache
> > and dcache...
>
> Practically none.
OK that's different from none. There is obviously overhead.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-01 21:50 ` [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Eric W. Biederman
2009-06-02 5:16 ` Pekka Enberg
2009-06-02 7:14 ` Nick Piggin
@ 2009-06-05 9:03 ` Miklos Szeredi
2009-06-05 19:06 ` Eric W. Biederman
2 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2009-06-05 9:03 UTC (permalink / raw)
To: ebiederm
Cc: viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel, hugh, tj,
adobriyan, torvalds, alan, gregkh, npiggin, akpm, hch, ebiederm,
ebiederm
Hi Eric,
Very interesting work.
On Mon, 1 Jun 2009, Eric W. Biederman wrote:
> The file_hotplug_lock has a very unique implementation necessitated by
> the need to have no performance impact on existing code. Classic locking
> primitives and reference counting cause pipeline stalls, except for rcu
> which provides no ability to preventing reading a data structure while
> it is being updated.
Well, the simple solution to that is to add another level of indirection:
old:
fdtable -> file
new:
fdtable -> persistent_file -> file
Then it is possible to replace persistent_file->file with a revoked
one under RCU. This has the added advantage that it supports
arbitrary file replacements, not just ones which return EIO.
Another advantage is that dereferencing can normally be done "under
the hood" in fget()/fget_light(). Only code which wants to
permanently store a file pointer (like the SCM_RIGHTS thing) would
need to be aware of the extra complexity.
Would that work, do you think?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 04/23] vfs: Introduce infrastructure for revoking a file
2009-06-05 9:03 ` Miklos Szeredi
@ 2009-06-05 19:06 ` Eric W. Biederman
0 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-05 19:06 UTC (permalink / raw)
To: Miklos Szeredi
Cc: viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel, hugh, tj,
adobriyan, torvalds, alan, gregkh, npiggin, akpm, hch, ebiederm
Miklos Szeredi <miklos@szeredi.hu> writes:
> Hi Eric,
>
> Very interesting work.
>
> On Mon, 1 Jun 2009, Eric W. Biederman wrote:
>> The file_hotplug_lock has a very unique implementation necessitated by
>> the need to have no performance impact on existing code. Classic locking
>> primitives and reference counting cause pipeline stalls, except for rcu
>> which provides no ability to preventing reading a data structure while
>> it is being updated.
>
> Well, the simple solution to that is to add another level of indirection:
>
> old:
>
> fdtable -> file
>
> new:
>
> fdtable -> persistent_file -> file
>
> Then it is possible to replace persistent_file->file with a revoked
> one under RCU. This has the added advantage that it supports
> arbitrary file replacements, not just ones which return EIO.
>
> Another advantage is that dereferencing can normally be done "under
> the hood" in fget()/fget_light(). Only code which wants to
> permanently store a file pointer (like the SCM_RIGHTS thing) would
> need to be aware of the extra complexity.
>
> Would that work, do you think?
Well I went down this path for a little while, and it has some good points.
Unfortunately it appears to be more costly.
fget() and friends are semantically very different my
file_hotplug_read_trylock and unlock. In fact there is very little
overlap. Which means that transparent to the vfs users doesn't
actually work.
We actually have more and less predictable places where we store files.
If there was actually a compelling case for being more general I would
certainly agree that splitting the file structure in two would be a
good deal. As it is that level of flexibility seems to be overkill.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH 05/23] vfs: Teach lseek to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (3 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 04/23] vfs: Introduce infrastructure for revoking a file Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 06/23] vfs: Teach read/write to use file_hotplug_read_lock Eric W. Biederman
` (18 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/read_write.c | 24 +++++++++++++++++-------
1 files changed, 17 insertions(+), 7 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 9d1e76b..c9511ce 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -136,14 +136,24 @@ EXPORT_SYMBOL(default_llseek);
loff_t vfs_llseek(struct file *file, loff_t offset, int origin)
{
loff_t (*fn)(struct file *, loff_t, int);
+ loff_t retval = -ESPIPE;
- fn = no_llseek;
- if (file->f_mode & FMODE_LSEEK) {
- fn = default_llseek;
- if (file->f_op && file->f_op->llseek)
- fn = file->f_op->llseek;
- }
- return fn(file, offset, origin);
+ if (!(file->f_mode & FMODE_LSEEK))
+ goto out;
+
+ retval = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
+ fn = default_llseek;
+ if (file->f_op && file->f_op->llseek)
+ fn = file->f_op->llseek;
+
+ retval = fn(file, offset, origin);
+
+ file_hotplug_read_unlock(file);
+out:
+ return retval;
}
EXPORT_SYMBOL(vfs_llseek);
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 06/23] vfs: Teach read/write to use file_hotplug_read_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (4 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 05/23] vfs: Teach lseek to use file_hotplug_lock Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock Eric W. Biederman
` (17 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/compat.c | 16 +++++++++++-
fs/read_write.c | 70 +++++++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 72 insertions(+), 14 deletions(-)
diff --git a/fs/compat.c b/fs/compat.c
index 25be41c..dad9957 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -1196,12 +1196,18 @@ static size_t compat_readv(struct file *file,
if (!(file->f_mode & FMODE_READ))
goto out;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
ret = -EINVAL;
if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
- goto out;
+ goto out_unlock;
ret = compat_do_readv_writev(READ, file, vec, vlen, pos);
+out_unlock:
+ file_hotplug_read_unlock(file);
out:
if (ret > 0)
add_rchar(current, ret);
@@ -1253,12 +1259,18 @@ static size_t compat_writev(struct file *file,
if (!(file->f_mode & FMODE_WRITE))
goto out;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
ret = -EINVAL;
if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
- goto out;
+ goto out_unlock;
ret = compat_do_readv_writev(WRITE, file, vec, vlen, pos);
+out_unlock:
+ file_hotplug_read_unlock(file);
out:
if (ret > 0)
add_wchar(current, ret);
diff --git a/fs/read_write.c b/fs/read_write.c
index c9511ce..718baea 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -288,12 +288,18 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
{
ssize_t ret;
+ ret = -EBADF;
if (!(file->f_mode & FMODE_READ))
- return -EBADF;
+ goto out;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+ ret = -EINVAL;
if (!file->f_op || (!file->f_op->read && !file->f_op->aio_read))
- return -EINVAL;
+ goto out_unlock;
+ ret = -EFAULT;
if (unlikely(!access_ok(VERIFY_WRITE, buf, count)))
- return -EFAULT;
+ goto out_unlock;
ret = rw_verify_area(READ, file, pos, count);
if (ret >= 0) {
@@ -309,6 +315,9 @@ ssize_t vfs_read(struct file *file, char __user *buf, size_t count, loff_t *pos)
inc_syscr(current);
}
+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
return ret;
}
@@ -343,12 +352,18 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
{
ssize_t ret;
+ ret = -EBADF;
if (!(file->f_mode & FMODE_WRITE))
- return -EBADF;
+ goto out;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+ ret = -EINVAL;
if (!file->f_op || (!file->f_op->write && !file->f_op->aio_write))
- return -EINVAL;
+ goto out_unlock;
+ ret = -EFAULT;
if (unlikely(!access_ok(VERIFY_READ, buf, count)))
- return -EFAULT;
+ goto out_unlock;
ret = rw_verify_area(WRITE, file, pos, count);
if (ret >= 0) {
@@ -364,6 +379,9 @@ ssize_t vfs_write(struct file *file, const char __user *buf, size_t count, loff_
inc_syscw(current);
}
+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
return ret;
}
@@ -676,12 +694,26 @@ out:
ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
+ ssize_t ret;
+
+ ret = -EBADF;
if (!(file->f_mode & FMODE_READ))
- return -EBADF;
+ goto out;
+
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
+ ret = -EINVAL;
if (!file->f_op || (!file->f_op->aio_read && !file->f_op->read))
- return -EINVAL;
+ goto out_unlock;
+
+ ret = do_readv_writev(READ, file, vec, vlen, pos);
- return do_readv_writev(READ, file, vec, vlen, pos);
+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
+ return ret;
}
EXPORT_SYMBOL(vfs_readv);
@@ -689,12 +721,26 @@ EXPORT_SYMBOL(vfs_readv);
ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
unsigned long vlen, loff_t *pos)
{
+ ssize_t ret;
+
+ ret = -EBADF;
if (!(file->f_mode & FMODE_WRITE))
- return -EBADF;
+ goto out;
+
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
+ ret = -EINVAL;
if (!file->f_op || (!file->f_op->aio_write && !file->f_op->write))
- return -EINVAL;
+ goto out_unlock;
- return do_readv_writev(WRITE, file, vec, vlen, pos);
+ ret = do_readv_writev(WRITE, file, vec, vlen, pos);
+
+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
+ return ret;
}
EXPORT_SYMBOL(vfs_writev);
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (5 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 06/23] vfs: Teach read/write to use file_hotplug_read_lock Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-03 23:39 ` Badari Pulavarty
2009-06-01 21:50 ` [PATCH 08/23] vfs: Teach readdir " Eric W. Biederman
` (16 subsequent siblings)
23 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/read_write.c | 28 +++++++++----
fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++-----------------
2 files changed, 95 insertions(+), 44 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index 718baea..c473d74 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
goto out;
if (!(in_file->f_mode & FMODE_READ))
goto fput_in;
+ retval = -EIO;
+ if (!file_hotplug_read_trylock(in_file))
+ goto fput_in;
retval = -EINVAL;
in_inode = in_file->f_path.dentry->d_inode;
if (!in_inode)
- goto fput_in;
+ goto unlock_in;
if (!in_file->f_op || !in_file->f_op->splice_read)
- goto fput_in;
+ goto unlock_in;
retval = -ESPIPE;
if (!ppos)
ppos = &in_file->f_pos;
else
if (!(in_file->f_mode & FMODE_PREAD))
- goto fput_in;
+ goto unlock_in;
retval = rw_verify_area(READ, in_file, ppos, count);
if (retval < 0)
- goto fput_in;
+ goto unlock_in;
count = retval;
/*
@@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
retval = -EBADF;
out_file = fget_light(out_fd, &fput_needed_out);
if (!out_file)
- goto fput_in;
+ goto unlock_in;
if (!(out_file->f_mode & FMODE_WRITE))
goto fput_out;
+ retval = -EIO;
+ if (!file_hotplug_read_trylock(out_file))
+ goto fput_out;
retval = -EINVAL;
if (!out_file->f_op || !out_file->f_op->sendpage)
- goto fput_out;
+ goto unlock_out;
out_inode = out_file->f_path.dentry->d_inode;
retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count);
if (retval < 0)
- goto fput_out;
+ goto unlock_out;
count = retval;
if (!max)
@@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
pos = *ppos;
retval = -EINVAL;
if (unlikely(pos < 0))
- goto fput_out;
+ goto unlock_out;
if (unlikely(pos + count > max)) {
retval = -EOVERFLOW;
if (pos >= max)
- goto fput_out;
+ goto unlock_out;
count = max - pos;
}
@@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
if (*ppos > max)
retval = -EOVERFLOW;
+unlock_out:
+ file_hotplug_read_unlock(out_file);
fput_out:
fput_light(out_file, fput_needed_out);
+unlock_in:
+ file_hotplug_read_unlock(in_file);
fput_in:
fput_light(in_file, fput_needed_in);
out:
diff --git a/fs/splice.c b/fs/splice.c
index 666953d..fc6b3a5 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
error = -EBADF;
file = fget_light(fd, &fput);
- if (file) {
- if (file->f_mode & FMODE_WRITE)
- error = vmsplice_to_pipe(file, iov, nr_segs, flags);
- else if (file->f_mode & FMODE_READ)
- error = vmsplice_to_user(file, iov, nr_segs, flags);
+ if (!file)
+ goto out;
- fput_light(file, fput);
- }
+ if (!file_hotplug_read_trylock(file))
+ goto fput_file;
+ if (file->f_mode & FMODE_WRITE)
+ error = vmsplice_to_pipe(file, iov, nr_segs, flags);
+ else if (file->f_mode & FMODE_READ)
+ error = vmsplice_to_user(file, iov, nr_segs, flags);
+
+ file_hotplug_read_unlock(file);
+fput_file:
+ fput_light(file, fput);
+out:
return error;
}
@@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
error = -EBADF;
in = fget_light(fd_in, &fput_in);
- if (in) {
- if (in->f_mode & FMODE_READ) {
- out = fget_light(fd_out, &fput_out);
- if (out) {
- if (out->f_mode & FMODE_WRITE)
- error = do_splice(in, off_in,
- out, off_out,
- len, flags);
- fput_light(out, fput_out);
- }
- }
+ if (!in)
+ goto out;
- fput_light(in, fput_in);
- }
+ if (!(in->f_mode & FMODE_READ))
+ goto fput_in;
+
+ error = -EIO;
+ if (!file_hotplug_read_trylock(in))
+ goto fput_in;
+
+ error = -EBADF;
+ out = fget_light(fd_out, &fput_out);
+ if (!out)
+ goto unlock_in;
+
+ if (!(out->f_mode & FMODE_WRITE))
+ goto fput_out;
+
+ error = -EIO;
+ if (!file_hotplug_read_trylock(out))
+ goto fput_out;
+
+ error = do_splice(in, off_in, out, off_out, len, flags);
+ file_hotplug_read_unlock(out);
+fput_out:
+ fput_light(out, fput_out);
+unlock_in:
+ file_hotplug_read_unlock(in);
+fput_in:
+ fput_light(in, fput_in);
+
+out:
return error;
}
@@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len,
SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
{
- struct file *in;
- int error, fput_in;
+ struct file *in, *out;
+ int error, fput_in, fput_out;
if (unlikely(!len))
return 0;
error = -EBADF;
in = fget_light(fdin, &fput_in);
- if (in) {
- if (in->f_mode & FMODE_READ) {
- int fput_out;
- struct file *out = fget_light(fdout, &fput_out);
-
- if (out) {
- if (out->f_mode & FMODE_WRITE)
- error = do_tee(in, out, len, flags);
- fput_light(out, fput_out);
- }
- }
- fput_light(in, fput_in);
- }
+ if (!in)
+ goto out;
+
+ if (!(in->f_mode & FMODE_READ))
+ goto unlock_in;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(in))
+ goto fput_in;
+
+ error = -EBADF;
+ out = fget_light(fdout, &fput_out);
+ if (!out)
+ goto unlock_in;
+
+ if (!(out->f_mode & FMODE_WRITE))
+ goto fput_out;
+
+ if (!file_hotplug_read_trylock(out))
+ goto fput_out;
+
+ error = do_tee(in, out, len, flags);
+
+ file_hotplug_read_unlock(out);
+fput_out:
+ fput_light(out, fput_out);
+unlock_in:
+ file_hotplug_read_unlock(in);
+fput_in:
+ fput_light(in, fput_in);
+out:
return error;
}
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock
2009-06-01 21:50 ` [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock Eric W. Biederman
@ 2009-06-03 23:39 ` Badari Pulavarty
2009-06-05 19:37 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Badari Pulavarty @ 2009-06-03 23:39 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
> ---
> fs/read_write.c | 28 +++++++++----
> fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++-----------------
> 2 files changed, 95 insertions(+), 44 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 718baea..c473d74 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> goto out;
> if (!(in_file->f_mode & FMODE_READ))
> goto fput_in;
> + retval = -EIO;
> + if (!file_hotplug_read_trylock(in_file))
> + goto fput_in;
> retval = -EINVAL;
> in_inode = in_file->f_path.dentry->d_inode;
> if (!in_inode)
> - goto fput_in;
> + goto unlock_in;
> if (!in_file->f_op || !in_file->f_op->splice_read)
> - goto fput_in;
> + goto unlock_in;
> retval = -ESPIPE;
> if (!ppos)
> ppos = &in_file->f_pos;
> else
> if (!(in_file->f_mode & FMODE_PREAD))
> - goto fput_in;
> + goto unlock_in;
> retval = rw_verify_area(READ, in_file, ppos, count);
> if (retval < 0)
> - goto fput_in;
> + goto unlock_in;
> count = retval;
>
> /*
> @@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> retval = -EBADF;
> out_file = fget_light(out_fd, &fput_needed_out);
> if (!out_file)
> - goto fput_in;
> + goto unlock_in;
> if (!(out_file->f_mode & FMODE_WRITE))
> goto fput_out;
> + retval = -EIO;
> + if (!file_hotplug_read_trylock(out_file))
> + goto fput_out;
> retval = -EINVAL;
> if (!out_file->f_op || !out_file->f_op->sendpage)
> - goto fput_out;
> + goto unlock_out;
> out_inode = out_file->f_path.dentry->d_inode;
> retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count);
> if (retval < 0)
> - goto fput_out;
> + goto unlock_out;
> count = retval;
>
> if (!max)
> @@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> pos = *ppos;
> retval = -EINVAL;
> if (unlikely(pos < 0))
> - goto fput_out;
> + goto unlock_out;
> if (unlikely(pos + count > max)) {
> retval = -EOVERFLOW;
> if (pos >= max)
> - goto fput_out;
> + goto unlock_out;
> count = max - pos;
> }
>
> @@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
> if (*ppos > max)
> retval = -EOVERFLOW;
>
> +unlock_out:
> + file_hotplug_read_unlock(out_file);
> fput_out:
> fput_light(out_file, fput_needed_out);
> +unlock_in:
> + file_hotplug_read_unlock(in_file);
> fput_in:
> fput_light(in_file, fput_needed_in);
> out:
> diff --git a/fs/splice.c b/fs/splice.c
> index 666953d..fc6b3a5 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
>
> error = -EBADF;
> file = fget_light(fd, &fput);
> - if (file) {
> - if (file->f_mode & FMODE_WRITE)
> - error = vmsplice_to_pipe(file, iov, nr_segs, flags);
> - else if (file->f_mode & FMODE_READ)
> - error = vmsplice_to_user(file, iov, nr_segs, flags);
> + if (!file)
> + goto out;
>
> - fput_light(file, fput);
> - }
> + if (!file_hotplug_read_trylock(file))
> + goto fput_file;
>
> + if (file->f_mode & FMODE_WRITE)
> + error = vmsplice_to_pipe(file, iov, nr_segs, flags);
> + else if (file->f_mode & FMODE_READ)
> + error = vmsplice_to_user(file, iov, nr_segs, flags);
> +
> + file_hotplug_read_unlock(file);
> +fput_file:
> + fput_light(file, fput);
> +out:
> return error;
> }
>
> @@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
>
> error = -EBADF;
> in = fget_light(fd_in, &fput_in);
> - if (in) {
> - if (in->f_mode & FMODE_READ) {
> - out = fget_light(fd_out, &fput_out);
> - if (out) {
> - if (out->f_mode & FMODE_WRITE)
> - error = do_splice(in, off_in,
> - out, off_out,
> - len, flags);
> - fput_light(out, fput_out);
> - }
> - }
> + if (!in)
> + goto out;
>
> - fput_light(in, fput_in);
> - }
> + if (!(in->f_mode & FMODE_READ))
> + goto fput_in;
> +
> + error = -EIO;
> + if (!file_hotplug_read_trylock(in))
> + goto fput_in;
> +
> + error = -EBADF;
> + out = fget_light(fd_out, &fput_out);
> + if (!out)
> + goto unlock_in;
> +
> + if (!(out->f_mode & FMODE_WRITE))
> + goto fput_out;
> +
> + error = -EIO;
> + if (!file_hotplug_read_trylock(out))
> + goto fput_out;
> +
> + error = do_splice(in, off_in, out, off_out, len, flags);
>
> + file_hotplug_read_unlock(out);
> +fput_out:
> + fput_light(out, fput_out);
> +unlock_in:
> + file_hotplug_read_unlock(in);
> +fput_in:
> + fput_light(in, fput_in);
> +
> +out:
> return error;
> }
>
> @@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len,
>
> SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
> {
> - struct file *in;
> - int error, fput_in;
> + struct file *in, *out;
> + int error, fput_in, fput_out;
>
> if (unlikely(!len))
> return 0;
>
> error = -EBADF;
> in = fget_light(fdin, &fput_in);
> - if (in) {
> - if (in->f_mode & FMODE_READ) {
> - int fput_out;
> - struct file *out = fget_light(fdout, &fput_out);
> -
> - if (out) {
> - if (out->f_mode & FMODE_WRITE)
> - error = do_tee(in, out, len, flags);
> - fput_light(out, fput_out);
> - }
> - }
> - fput_light(in, fput_in);
> - }
> + if (!in)
> + goto out;
> +
> + if (!(in->f_mode & FMODE_READ))
> + goto unlock_in; <<<<<<<
Shouldn't this be
goto fput_in;
? btw, its confusing to have labels and variables with same name:
fput_in and fput_out. You may want to rename labels ?
>
> + error = -EIO;
> + if (!file_hotplug_read_trylock(in))
> + goto fput_in;
> +
> + error = -EBADF;
> + out = fget_light(fdout, &fput_out);
> + if (!out)
> + goto unlock_in;
> +
> + if (!(out->f_mode & FMODE_WRITE))
> + goto fput_out;
> +
> + if (!file_hotplug_read_trylock(out))
> + goto fput_out;
> +
> + error = do_tee(in, out, len, flags);
> +
> + file_hotplug_read_unlock(out);
> +fput_out:
> + fput_light(out, fput_out);
> +unlock_in:
> + file_hotplug_read_unlock(in);
> +fput_in:
> + fput_light(in, fput_in);
> +out:
> return error;
> }
Thanks,
Badari
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock
2009-06-03 23:39 ` Badari Pulavarty
@ 2009-06-05 19:37 ` Eric W. Biederman
0 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-05 19:37 UTC (permalink / raw)
To: Badari Pulavarty
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
Badari Pulavarty <pbadari@gmail.com> writes:
> On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote:
>> From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>> ---
>> fs/read_write.c | 28 +++++++++----
>> fs/splice.c | 111 +++++++++++++++++++++++++++++++++++++-----------------
>> 2 files changed, 95 insertions(+), 44 deletions(-)
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 718baea..c473d74 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -861,21 +861,24 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>> goto out;
>> if (!(in_file->f_mode & FMODE_READ))
>> goto fput_in;
>> + retval = -EIO;
>> + if (!file_hotplug_read_trylock(in_file))
>> + goto fput_in;
>> retval = -EINVAL;
>> in_inode = in_file->f_path.dentry->d_inode;
>> if (!in_inode)
>> - goto fput_in;
>> + goto unlock_in;
>> if (!in_file->f_op || !in_file->f_op->splice_read)
>> - goto fput_in;
>> + goto unlock_in;
>> retval = -ESPIPE;
>> if (!ppos)
>> ppos = &in_file->f_pos;
>> else
>> if (!(in_file->f_mode & FMODE_PREAD))
>> - goto fput_in;
>> + goto unlock_in;
>> retval = rw_verify_area(READ, in_file, ppos, count);
>> if (retval < 0)
>> - goto fput_in;
>> + goto unlock_in;
>> count = retval;
>>
>> /*
>> @@ -884,16 +887,19 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>> retval = -EBADF;
>> out_file = fget_light(out_fd, &fput_needed_out);
>> if (!out_file)
>> - goto fput_in;
>> + goto unlock_in;
>> if (!(out_file->f_mode & FMODE_WRITE))
>> goto fput_out;
>> + retval = -EIO;
>> + if (!file_hotplug_read_trylock(out_file))
>> + goto fput_out;
>> retval = -EINVAL;
>> if (!out_file->f_op || !out_file->f_op->sendpage)
>> - goto fput_out;
>> + goto unlock_out;
>> out_inode = out_file->f_path.dentry->d_inode;
>> retval = rw_verify_area(WRITE, out_file, &out_file->f_pos, count);
>> if (retval < 0)
>> - goto fput_out;
>> + goto unlock_out;
>> count = retval;
>>
>> if (!max)
>> @@ -902,11 +908,11 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>> pos = *ppos;
>> retval = -EINVAL;
>> if (unlikely(pos < 0))
>> - goto fput_out;
>> + goto unlock_out;
>> if (unlikely(pos + count > max)) {
>> retval = -EOVERFLOW;
>> if (pos >= max)
>> - goto fput_out;
>> + goto unlock_out;
>> count = max - pos;
>> }
>>
>> @@ -933,8 +939,12 @@ static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
>> if (*ppos > max)
>> retval = -EOVERFLOW;
>>
>> +unlock_out:
>> + file_hotplug_read_unlock(out_file);
>> fput_out:
>> fput_light(out_file, fput_needed_out);
>> +unlock_in:
>> + file_hotplug_read_unlock(in_file);
>> fput_in:
>> fput_light(in_file, fput_needed_in);
>> out:
>> diff --git a/fs/splice.c b/fs/splice.c
>> index 666953d..fc6b3a5 100644
>> --- a/fs/splice.c
>> +++ b/fs/splice.c
>> @@ -1464,15 +1464,21 @@ SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, iov,
>>
>> error = -EBADF;
>> file = fget_light(fd, &fput);
>> - if (file) {
>> - if (file->f_mode & FMODE_WRITE)
>> - error = vmsplice_to_pipe(file, iov, nr_segs, flags);
>> - else if (file->f_mode & FMODE_READ)
>> - error = vmsplice_to_user(file, iov, nr_segs, flags);
>> + if (!file)
>> + goto out;
>>
>> - fput_light(file, fput);
>> - }
>> + if (!file_hotplug_read_trylock(file))
>> + goto fput_file;
>>
>> + if (file->f_mode & FMODE_WRITE)
>> + error = vmsplice_to_pipe(file, iov, nr_segs, flags);
>> + else if (file->f_mode & FMODE_READ)
>> + error = vmsplice_to_user(file, iov, nr_segs, flags);
>> +
>> + file_hotplug_read_unlock(file);
>> +fput_file:
>> + fput_light(file, fput);
>> +out:
>> return error;
>> }
>>
>> @@ -1489,21 +1495,39 @@ SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
>>
>> error = -EBADF;
>> in = fget_light(fd_in, &fput_in);
>> - if (in) {
>> - if (in->f_mode & FMODE_READ) {
>> - out = fget_light(fd_out, &fput_out);
>> - if (out) {
>> - if (out->f_mode & FMODE_WRITE)
>> - error = do_splice(in, off_in,
>> - out, off_out,
>> - len, flags);
>> - fput_light(out, fput_out);
>> - }
>> - }
>> + if (!in)
>> + goto out;
>>
>> - fput_light(in, fput_in);
>> - }
>> + if (!(in->f_mode & FMODE_READ))
>> + goto fput_in;
>> +
>> + error = -EIO;
>> + if (!file_hotplug_read_trylock(in))
>> + goto fput_in;
>> +
>> + error = -EBADF;
>> + out = fget_light(fd_out, &fput_out);
>> + if (!out)
>> + goto unlock_in;
>> +
>> + if (!(out->f_mode & FMODE_WRITE))
>> + goto fput_out;
>> +
>> + error = -EIO;
>> + if (!file_hotplug_read_trylock(out))
>> + goto fput_out;
>> +
>> + error = do_splice(in, off_in, out, off_out, len, flags);
>>
>> + file_hotplug_read_unlock(out);
>> +fput_out:
>> + fput_light(out, fput_out);
>> +unlock_in:
>> + file_hotplug_read_unlock(in);
>> +fput_in:
>> + fput_light(in, fput_in);
>> +
>> +out:
>> return error;
>> }
>>
>> @@ -1703,27 +1727,44 @@ static long do_tee(struct file *in, struct file *out, size_t len,
>>
>> SYSCALL_DEFINE4(tee, int, fdin, int, fdout, size_t, len, unsigned int, flags)
>> {
>> - struct file *in;
>> - int error, fput_in;
>> + struct file *in, *out;
>> + int error, fput_in, fput_out;
>>
>> if (unlikely(!len))
>> return 0;
>>
>> error = -EBADF;
>> in = fget_light(fdin, &fput_in);
>> - if (in) {
>> - if (in->f_mode & FMODE_READ) {
>> - int fput_out;
>> - struct file *out = fget_light(fdout, &fput_out);
>> -
>> - if (out) {
>> - if (out->f_mode & FMODE_WRITE)
>> - error = do_tee(in, out, len, flags);
>> - fput_light(out, fput_out);
>> - }
>> - }
>> - fput_light(in, fput_in);
>> - }
>> + if (!in)
>> + goto out;
>> +
>> + if (!(in->f_mode & FMODE_READ))
>> + goto unlock_in; <<<<<<<
>
> Shouldn't this be
> goto fput_in;
Good point. That is a bug.
> ? btw, its confusing to have labels and variables with same name:
> fput_in and fput_out. You may want to rename labels ?
Mayhap. I didn't start that one, although I am clearly spreading
it around here. Do you have a better naming suggestion?
>> + error = -EIO;
>> + if (!file_hotplug_read_trylock(in))
>> + goto fput_in;
>> +
>> + error = -EBADF;
>> + out = fget_light(fdout, &fput_out);
>> + if (!out)
>> + goto unlock_in;
>> +
>> + if (!(out->f_mode & FMODE_WRITE))
>> + goto fput_out;
>> +
>> + if (!file_hotplug_read_trylock(out))
>> + goto fput_out;
>> +
>> + error = do_tee(in, out, len, flags);
>> +
>> + file_hotplug_read_unlock(out);
>> +fput_out:
>> + fput_light(out, fput_out);
>> +unlock_in:
>> + file_hotplug_read_unlock(in);
>> +fput_in:
>> + fput_light(in, fput_in);
>> +out:
>> return error;
>> }
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH 08/23] vfs: Teach readdir to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (6 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 07/23] vfs: Teach sendfile,splice,tee,and vmsplice to use file_hotplug_lock Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 09/23] vfs: Teach poll and select " Eric W. Biederman
` (15 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/readdir.c | 20 +++++++++++++++-----
1 files changed, 15 insertions(+), 5 deletions(-)
diff --git a/fs/readdir.c b/fs/readdir.c
index 7723401..2e147cf 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -21,18 +21,26 @@
int vfs_readdir(struct file *file, filldir_t filler, void *buf)
{
- struct inode *inode = file->f_path.dentry->d_inode;
- int res = -ENOTDIR;
- if (!file->f_op || !file->f_op->readdir)
+ struct inode *inode;
+ int res;
+
+ res = -EIO;
+ if (!file_hotplug_read_trylock(file))
goto out;
+ inode = file->f_path.dentry->d_inode;
+
+ res = -ENOTDIR;
+ if (!file->f_op || !file->f_op->readdir)
+ goto out_unlock;
+
res = security_file_permission(file, MAY_READ);
if (res)
- goto out;
+ goto out_unlock;
res = mutex_lock_killable(&inode->i_mutex);
if (res)
- goto out;
+ goto out_unlock;
res = -ENOENT;
if (!IS_DEADDIR(inode)) {
@@ -40,6 +48,8 @@ int vfs_readdir(struct file *file, filldir_t filler, void *buf)
file_accessed(file);
}
mutex_unlock(&inode->i_mutex);
+out_unlock:
+ file_hotplug_read_unlock(file);
out:
return res;
}
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 09/23] vfs: Teach poll and select to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (7 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 08/23] vfs: Teach readdir " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 10/23] vfs: Teach do_path_lookup " Eric W. Biederman
` (14 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/select.c | 24 +++++++++++++++++-------
include/linux/poll.h | 1 +
2 files changed, 18 insertions(+), 7 deletions(-)
diff --git a/fs/select.c b/fs/select.c
index 99e4145..fd68da0 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -416,10 +416,15 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
continue;
file = fget_light(i, &fput_needed);
if (file) {
- f_op = file->f_op;
- mask = DEFAULT_POLLMASK;
- if (f_op && f_op->poll)
- mask = (*f_op->poll)(file, retval ? NULL : wait);
+ mask = DEAD_POLLMASK;
+ if (file_hotplug_read_trylock(file)) {
+ f_op = file->f_op;
+ mask = DEFAULT_POLLMASK;
+ if (f_op && f_op->poll)
+ mask = (*f_op->poll)(file, retval ? NULL : wait);
+
+ file_hotplug_read_unlock(file);
+ }
fput_light(file, fput_needed);
if ((mask & POLLIN_SET) && (in & bit)) {
res_in |= bit;
@@ -684,9 +689,14 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
file = fget_light(fd, &fput_needed);
mask = POLLNVAL;
if (file != NULL) {
- mask = DEFAULT_POLLMASK;
- if (file->f_op && file->f_op->poll)
- mask = file->f_op->poll(file, pwait);
+ mask = DEAD_POLLMASK;
+ if (file_hotplug_read_trylock(file)) {
+ mask = DEFAULT_POLLMASK;
+ if (file->f_op && file->f_op->poll)
+ mask = file->f_op->poll(file, pwait);
+
+ file_hotplug_read_unlock(file);
+ }
/* Mask out unneeded events. */
mask &= pollfd->events | POLLERR | POLLHUP;
fput_light(file, fput_needed);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index d388620..f0512f4 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -22,6 +22,7 @@
#define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))
#define DEFAULT_POLLMASK (POLLIN | POLLOUT | POLLRDNORM | POLLWRNORM)
+#define DEAD_POLLMASK (DEFAULT_POLLMASK | POLLERR)
struct poll_table_struct;
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 10/23] vfs: Teach do_path_lookup to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (8 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 09/23] vfs: Teach poll and select " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 11/23] mm: Teach mmap " Eric W. Biederman
` (13 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/namei.c | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 5472ed0..c4c6575 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1049,23 +1049,30 @@ static int path_init(int dfd, const char *name, unsigned int flags, struct namei
if (!file)
goto out_fail;
+ retval = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto fput_fail;
+
dentry = file->f_path.dentry;
retval = -ENOTDIR;
if (!S_ISDIR(dentry->d_inode->i_mode))
- goto fput_fail;
+ goto unlock_fail;
retval = file_permission(file, MAY_EXEC);
if (retval)
- goto fput_fail;
+ goto unlock_fail;
nd->path = file->f_path;
path_get(&file->f_path);
+ file_hotplug_read_unlock(file);
fput_light(file, fput_needed);
}
return 0;
+unlock_fail:
+ file_hotplug_read_unlock(file);
fput_fail:
fput_light(file, fput_needed);
out_fail:
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 11/23] mm: Teach mmap to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (9 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 10/23] vfs: Teach do_path_lookup " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 12/23] vfs: Teach fcntl " Eric W. Biederman
` (12 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
mm/mmap.c | 78 +++++++++++++++++++++++++++++++++++++++--------------------
mm/nommu.c | 21 +++++++++++++++-
2 files changed, 71 insertions(+), 28 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 6b7b1a9..f13251a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -914,9 +914,13 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
struct mm_struct * mm = current->mm;
struct inode *inode;
unsigned int vm_flags;
- int error;
+ unsigned long retval;
unsigned long reqprot = prot;
+ retval = -EIO;
+ if (file && !file_hotplug_read_trylock(file))
+ goto out;
+
/*
* Does the application expect PROT_READ to imply PROT_EXEC?
*
@@ -927,35 +931,40 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (!(file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC)))
prot |= PROT_EXEC;
+ retval = -EINVAL;
if (!len)
- return -EINVAL;
+ goto out_unlock;
if (!(flags & MAP_FIXED))
addr = round_hint_to_min(addr);
- error = arch_mmap_check(addr, len, flags);
- if (error)
- return error;
+ retval = arch_mmap_check(addr, len, flags);
+ if (retval)
+ goto out_unlock;
/* Careful about overflows.. */
+ retval = -ENOMEM;
len = PAGE_ALIGN(len);
if (!len || len > TASK_SIZE)
- return -ENOMEM;
+ goto out_unlock;
/* offset overflow? */
+ retval = -EOVERFLOW;
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
- return -EOVERFLOW;
+ goto out_unlock;
/* Too many mappings? */
+ retval = -ENOMEM;
if (mm->map_count > sysctl_max_map_count)
- return -ENOMEM;
+ goto out_unlock;
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
addr = get_unmapped_area(file, addr, len, pgoff, flags);
+ retval = addr;
if (addr & ~PAGE_MASK)
- return addr;
+ goto out_unlock;
/* Do simple checking here so the lower-level routines won't have
* to. we assume access permissions have been handled by the open
@@ -965,8 +974,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
if (flags & MAP_LOCKED) {
+ retval = -EPERM;
if (!can_do_mlock())
- return -EPERM;
+ goto out_unlock;
vm_flags |= VM_LOCKED;
}
@@ -977,8 +987,9 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
locked += mm->locked_vm;
lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur;
lock_limit >>= PAGE_SHIFT;
+ retval = -EAGAIN;
if (locked > lock_limit && !capable(CAP_IPC_LOCK))
- return -EAGAIN;
+ goto out_unlock;
}
inode = file ? file->f_path.dentry->d_inode : NULL;
@@ -986,21 +997,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
if (file) {
switch (flags & MAP_TYPE) {
case MAP_SHARED:
+ retval = -EACCES;
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
- return -EACCES;
+ goto out_unlock;
/*
* Make sure we don't allow writing to an append-only
* file..
*/
+ retval = -EACCES;
if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE))
- return -EACCES;
+ goto out_unlock;
/*
* Make sure there are no mandatory locks on the file.
*/
+ retval = -EAGAIN;
if (locks_verify_locked(inode))
- return -EAGAIN;
+ goto out_unlock;
vm_flags |= VM_SHARED | VM_MAYSHARE;
if (!(file->f_mode & FMODE_WRITE))
@@ -1008,20 +1022,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
/* fall through */
case MAP_PRIVATE:
+ retval = -EACCES;
if (!(file->f_mode & FMODE_READ))
- return -EACCES;
+ goto out_unlock;
if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) {
+ retval = -EPERM;
if (vm_flags & VM_EXEC)
- return -EPERM;
+ goto out_unlock;
vm_flags &= ~VM_MAYEXEC;
}
+ retval = -ENODEV;
if (!file->f_op || !file->f_op->mmap)
- return -ENODEV;
+ goto out_unlock;
break;
default:
- return -EINVAL;
+ retval = -EINVAL;
+ goto out_unlock;
}
} else {
switch (flags & MAP_TYPE) {
@@ -1039,18 +1057,24 @@ unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
pgoff = addr >> PAGE_SHIFT;
break;
default:
- return -EINVAL;
+ retval = -EINVAL;
+ goto out_unlock;
}
}
- error = security_file_mmap(file, reqprot, prot, flags, addr, 0);
- if (error)
- return error;
- error = ima_file_mmap(file, prot);
- if (error)
- return error;
+ retval = security_file_mmap(file, reqprot, prot, flags, addr, 0);
+ if (retval)
+ goto out_unlock;
+ retval = ima_file_mmap(file, prot);
+ if (retval)
+ goto out_unlock;
+ retval = mmap_region(file, addr, len, flags, vm_flags, pgoff);
- return mmap_region(file, addr, len, flags, vm_flags, pgoff);
+out_unlock:
+ if (file)
+ file_hotplug_read_unlock(file);
+out:
+ return retval;
}
EXPORT_SYMBOL(do_mmap_pgoff);
diff --git a/mm/nommu.c b/mm/nommu.c
index b571ef7..08038b7 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1165,7 +1165,7 @@ enomem:
/*
* handle mapping creation for uClinux
*/
-unsigned long do_mmap_pgoff(struct file *file,
+static unsigned long __do_mmap_pgoff(struct file *file,
unsigned long addr,
unsigned long len,
unsigned long prot,
@@ -1402,6 +1402,25 @@ error_getting_region:
show_free_areas();
return -ENOMEM;
}
+
+unsigned long do_mmap_pgoff(struct file *file,
+ unsigned long addr,
+ unsigned long len,
+ unsigned long prot,
+ unsigned long flags,
+ unsigned long pgoff)
+{
+ unsigned long result = -EIO;
+ if (file && !file_hotplug_read_trylock(file))
+ goto out;
+
+ result = __do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
+
+ if (file)
+ file_hotplug_read_unlock(file);
+out:
+ return result;
+}
EXPORT_SYMBOL(do_mmap_pgoff);
/*
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 12/23] vfs: Teach fcntl to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (10 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 11/23] mm: Teach mmap " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 13/23] vfs: Teach ioctl " Eric W. Biederman
` (11 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/fcntl.c | 28 +++++++++++++++++++---------
1 files changed, 19 insertions(+), 9 deletions(-)
diff --git a/fs/fcntl.c b/fs/fcntl.c
index cc8e4de..05d8961 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -344,14 +344,19 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
if (!filp)
goto out;
+ err = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
err = security_file_fcntl(filp, cmd, arg);
- if (err) {
- fput(filp);
- return err;
- }
+ if (err)
+ goto out_unlock;
err = do_fcntl(fd, cmd, arg, filp);
+out_unlock:
+ file_hotplug_read_unlock(filp);
+out_fput:
fput(filp);
out:
return err;
@@ -369,13 +374,15 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
if (!filp)
goto out;
+ err = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
err = security_file_fcntl(filp, cmd, arg);
- if (err) {
- fput(filp);
- return err;
- }
+ if (err)
+ goto out_unlock;
+
err = -EBADF;
-
switch (cmd) {
case F_GETLK64:
err = fcntl_getlk64(filp, (struct flock64 __user *) arg);
@@ -389,6 +396,9 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
err = do_fcntl(fd, cmd, arg, filp);
break;
}
+out_unlock:
+ file_hotplug_read_unlock(filp);
+out_fput:
fput(filp);
out:
return err;
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 13/23] vfs: Teach ioctl to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (11 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 12/23] vfs: Teach fcntl " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 14/23] vfs: Teach flock " Eric W. Biederman
` (10 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/compat_ioctl.c | 14 ++++++++++----
fs/ioctl.c | 8 +++++++-
2 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/compat_ioctl.c b/fs/compat_ioctl.c
index b83f6bc..fa654c5 100644
--- a/fs/compat_ioctl.c
+++ b/fs/compat_ioctl.c
@@ -2796,10 +2796,14 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
if (!filp)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
/* RED-PEN how should LSM module know it's handling 32bit? */
error = security_file_ioctl(filp, cmd, arg);
if (error)
- goto out_fput;
+ goto out_unlock;
/*
* To allow the compat_ioctl handlers to be self contained
@@ -2825,7 +2829,7 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
if (filp->f_op && filp->f_op->compat_ioctl) {
error = filp->f_op->compat_ioctl(filp, cmd, arg);
if (error != -ENOIOCTLCMD)
- goto out_fput;
+ goto out_unlock;
}
if (!filp->f_op ||
@@ -2853,18 +2857,20 @@ asmlinkage long compat_sys_ioctl(unsigned int fd, unsigned int cmd,
error = -EINVAL;
}
- goto out_fput;
+ goto out_unlock;
found_handler:
if (t->handler) {
lock_kernel();
error = t->handler(fd, cmd, arg, filp);
unlock_kernel();
- goto out_fput;
+ goto out_unlock;
}
do_ioctl:
error = do_vfs_ioctl(filp, fd, cmd, arg);
+ out_unlock:
+ file_hotplug_read_unlock(filp);
out_fput:
fput_light(filp, fput_needed);
out:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 82d9c42..2dad7ba 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -577,11 +577,17 @@ SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
if (!filp)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
error = security_file_ioctl(filp, cmd, arg);
if (error)
- goto out_fput;
+ goto out_unlock;
error = do_vfs_ioctl(filp, fd, cmd, arg);
+ out_unlock:
+ file_hotplug_read_unlock(filp);
out_fput:
fput_light(filp, fput_needed);
out:
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 14/23] vfs: Teach flock to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (12 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 13/23] vfs: Teach ioctl " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 15/23] vfs: Teach fallocate, and filp_close " Eric W. Biederman
` (9 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/locks.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/fs/locks.c b/fs/locks.c
index ec3deea..f74794e 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1584,9 +1584,13 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
!(filp->f_mode & (FMODE_READ|FMODE_WRITE)))
goto out_putf;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_putf;
+
error = flock_make_lock(filp, &lock, cmd);
if (error)
- goto out_putf;
+ goto out_unlock;
if (can_sleep)
lock->fl_flags |= FL_SLEEP;
@@ -1604,6 +1608,8 @@ SYSCALL_DEFINE2(flock, unsigned int, fd, unsigned int, cmd)
out_free:
locks_free_lock(lock);
+ out_unlock:
+ file_hotplug_read_unlock(filp);
out_putf:
fput(filp);
out:
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 15/23] vfs: Teach fallocate, and filp_close to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (13 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 14/23] vfs: Teach flock " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 16/23] vfs: Teach fstatfs, fstatfs64, ftruncate, fchdir, fchmod, fchown " Eric W. Biederman
` (8 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.om>
---
fs/open.c | 22 +++++++++++++++++-----
1 files changed, 17 insertions(+), 5 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index d0b2433..83d6369 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -398,19 +398,22 @@ SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len)
goto out;
if (!(file->f_mode & FMODE_WRITE))
goto out_fput;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_fput;
/*
* Revalidate the write permissions, in case security policy has
* changed since the files were opened.
*/
ret = security_file_permission(file, MAY_WRITE);
if (ret)
- goto out_fput;
+ goto out_unlock;
inode = file->f_path.dentry->d_inode;
ret = -ESPIPE;
if (S_ISFIFO(inode->i_mode))
- goto out_fput;
+ goto out_unlock;
ret = -ENODEV;
/*
@@ -418,18 +421,20 @@ SYSCALL_DEFINE(fallocate)(int fd, int mode, loff_t offset, loff_t len)
* for directories or not.
*/
if (!S_ISREG(inode->i_mode) && !S_ISDIR(inode->i_mode))
- goto out_fput;
+ goto out_unlock;
ret = -EFBIG;
/* Check for wrap through zero too */
if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
- goto out_fput;
+ goto out_unlock;
if (inode->i_op->fallocate)
ret = inode->i_op->fallocate(inode, mode, offset, len);
else
ret = -EOPNOTSUPP;
+out_unlock:
+ file_hotplug_read_unlock(file);
out_fput:
fput(file);
out:
@@ -1101,18 +1106,25 @@ SYSCALL_DEFINE2(creat, const char __user *, pathname, int, mode)
*/
int filp_close(struct file *filp, fl_owner_t id)
{
- int retval = 0;
+ int retval;
if (!file_count(filp)) {
printk(KERN_ERR "VFS: Close: file count is 0\n");
return 0;
}
+ retval = -EIO;
+ if (!file_hotplug_read_trylock(filp))
+ goto out_fput;
+
+ retval = 0;
if (filp->f_op && filp->f_op->flush)
retval = filp->f_op->flush(filp, id);
dnotify_flush(filp, id);
locks_remove_posix(filp, id);
+ file_hotplug_read_unlock(filp);
+out_fput:
fput(filp);
return retval;
}
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 16/23] vfs: Teach fstatfs, fstatfs64, ftruncate, fchdir, fchmod, fchown to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (14 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 15/23] vfs: Teach fallocate, and filp_close " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 17/23] proc: Teach /proc/<pid>/fd " Eric W. Biederman
` (7 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/open.c | 47 +++++++++++++++++++++++++++++++++++++++++------
1 files changed, 41 insertions(+), 6 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 83d6369..354646b 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -167,9 +167,14 @@ SYSCALL_DEFINE2(fstatfs, unsigned int, fd, struct statfs __user *, buf)
file = fget(fd);
if (!file)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
error = vfs_statfs_native(file->f_path.dentry, &tmp);
if (!error && copy_to_user(buf, &tmp, sizeof(tmp)))
error = -EFAULT;
+ file_hotplug_read_unlock(file);
+out_putf:
fput(file);
out:
return error;
@@ -188,9 +193,14 @@ SYSCALL_DEFINE3(fstatfs64, unsigned int, fd, size_t, sz, struct statfs64 __user
file = fget(fd);
if (!file)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
error = vfs_statfs64(file->f_path.dentry, &tmp);
if (!error && copy_to_user(buf, &tmp, sizeof(tmp)))
error = -EFAULT;
+ file_hotplug_read_unlock(file);
+out_putf:
fput(file);
out:
return error;
@@ -309,6 +319,10 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
if (!file)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
+
/* explicitly opened as large or we are on 64-bit box */
if (file->f_flags & O_LARGEFILE)
small = 0;
@@ -317,16 +331,16 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
inode = dentry->d_inode;
error = -EINVAL;
if (!S_ISREG(inode->i_mode) || !(file->f_mode & FMODE_WRITE))
- goto out_putf;
+ goto out_unlock;
error = -EINVAL;
/* Cannot ftruncate over 2^31 bytes without large file support */
if (small && length > MAX_NON_LFS)
- goto out_putf;
+ goto out_unlock;
error = -EPERM;
if (IS_APPEND(inode))
- goto out_putf;
+ goto out_unlock;
error = locks_verify_truncate(inode, file, length);
if (!error)
@@ -334,6 +348,9 @@ static long do_sys_ftruncate(unsigned int fd, loff_t length, int small)
ATTR_MTIME|ATTR_CTIME);
if (!error)
error = do_truncate(dentry, length, ATTR_MTIME|ATTR_CTIME, file);
+
+out_unlock:
+ file_hotplug_read_unlock(file);
out_putf:
fput(file);
out:
@@ -560,15 +577,21 @@ SYSCALL_DEFINE1(fchdir, unsigned int, fd)
if (!file)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
+
inode = file->f_path.dentry->d_inode;
error = -ENOTDIR;
if (!S_ISDIR(inode->i_mode))
- goto out_putf;
+ goto out_unlock;
error = inode_permission(inode, MAY_EXEC | MAY_ACCESS);
if (!error)
set_fs_pwd(current->fs, &file->f_path);
+out_unlock:
+ file_hotplug_read_unlock(file);
out_putf:
fput(file);
out:
@@ -612,6 +635,10 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode)
if (!file)
goto out;
+ err = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_putf;
+
dentry = file->f_path.dentry;
inode = dentry->d_inode;
@@ -619,7 +646,7 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode)
err = mnt_want_write_file(file);
if (err)
- goto out_putf;
+ goto out_unlock;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
mode = inode->i_mode;
@@ -628,6 +655,8 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, mode_t, mode)
err = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
mnt_drop_write(file->f_path.mnt);
+out_unlock:
+ file_hotplug_read_unlock(file);
out_putf:
fput(file);
out:
@@ -766,13 +795,19 @@ SYSCALL_DEFINE3(fchown, unsigned int, fd, uid_t, user, gid_t, group)
if (!file)
goto out;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_fput;
+
error = mnt_want_write_file(file);
if (error)
- goto out_fput;
+ goto out_unlock;
dentry = file->f_path.dentry;
audit_inode(NULL, dentry);
error = chown_common(dentry, user, group);
mnt_drop_write(file->f_path.mnt);
+out_unlock:
+ file_hotplug_read_unlock(file);
out_fput:
fput(file);
out:
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 17/23] proc: Teach /proc/<pid>/fd to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (15 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 16/23] vfs: Teach fstatfs, fstatfs64, ftruncate, fchdir, fchmod, fchown " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 18/23] vfs: Teach epoll " Eric W. Biederman
` (6 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.arastra.com>
I have taken the opportunity to modify proc_fd_info to have
a single exit point.
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/proc/base.c | 29 ++++++++++++++++-------------
1 files changed, 16 insertions(+), 13 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index fb45615..ee4cdc2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1626,6 +1626,7 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
struct files_struct *files = NULL;
struct file *file;
int fd = proc_fd(inode);
+ int retval = -ENOENT;
if (task) {
files = get_files_struct(task);
@@ -1639,24 +1640,26 @@ static int proc_fd_info(struct inode *inode, struct path *path, char *info)
spin_lock(&files->file_lock);
file = fcheck_files(files, fd);
if (file) {
- if (path) {
- *path = file->f_path;
- path_get(&file->f_path);
+ retval = -EIO;
+ if (file_hotplug_read_trylock(file)) {
+ retval = 0;
+ if (path) {
+ *path = file->f_path;
+ path_get(&file->f_path);
+ }
+ if (info)
+ snprintf(info, PROC_FDINFO_MAX,
+ "pos:\t%lli\n"
+ "flags:\t0%o\n",
+ (long long) file->f_pos,
+ file->f_flags);
+ file_hotplug_read_unlock(file);
}
- if (info)
- snprintf(info, PROC_FDINFO_MAX,
- "pos:\t%lli\n"
- "flags:\t0%o\n",
- (long long) file->f_pos,
- file->f_flags);
- spin_unlock(&files->file_lock);
- put_files_struct(files);
- return 0;
}
spin_unlock(&files->file_lock);
put_files_struct(files);
}
- return -ENOENT;
+ return retval;
}
static int proc_fd_link(struct inode *inode, struct path *path)
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (16 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 17/23] proc: Teach /proc/<pid>/fd " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-02 16:51 ` Davide Libenzi
2009-06-01 21:50 ` [PATCH 19/23] eventpoll: Fix comment Eric W. Biederman
` (5 subsequent siblings)
23 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
1 files changed, 32 insertions(+), 7 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a89f370..eabb167 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -627,8 +627,13 @@ static int ep_read_events_proc(struct eventpoll *ep, struct list_head *head,
struct epitem *epi, *tmp;
list_for_each_entry_safe(epi, tmp, head, rdllink) {
- if (epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
- epi->event.events)
+ int events = DEAD_POLLMASK;
+
+ if (file_hotplug_read_trylock(epi->ffd.file)) {
+ events = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
+ file_hotplug_read_unlock(epi->ffd.file);
+ }
+ if (events & epi->event.events)
return POLLIN | POLLRDNORM;
else {
/*
@@ -1060,8 +1065,12 @@ static int ep_send_events_proc(struct eventpoll *ep, struct list_head *head,
list_del_init(&epi->rdllink);
- revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL) &
- epi->event.events;
+ revents = DEAD_POLLMASK;
+ if (file_hotplug_read_trylock(epi->ffd.file)) {
+ revents = epi->ffd.file->f_op->poll(epi->ffd.file, NULL);
+ file_hotplug_read_unlock(epi->ffd.file);
+ }
+ revents &= epi->event.events;
/*
* If the event mask intersect the caller-requested one,
@@ -1248,10 +1257,17 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
if (!tfile)
goto error_fput;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto error_tgt_fput;
+
+ if (!file_hotplug_read_trylock(tfile))
+ goto error_file_unlock;
+
/* The target file descriptor must support poll */
error = -EPERM;
if (!tfile->f_op || !tfile->f_op->poll)
- goto error_tgt_fput;
+ goto error_tgt_unlock;
/*
* We have to check that the file structure underneath the file descriptor
@@ -1260,7 +1276,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
*/
error = -EINVAL;
if (file == tfile || !is_file_epoll(file))
- goto error_tgt_fput;
+ goto error_tgt_unlock;
/*
* At this point it is safe to assume that the "private_data" contains
@@ -1302,6 +1318,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
}
mutex_unlock(&ep->mtx);
+error_tgt_unlock:
+ file_hotplug_read_unlock(tfile);
+error_file_unlock:
+ file_hotplug_read_unlock(file);
error_tgt_fput:
fput(tfile);
error_fput:
@@ -1338,13 +1358,16 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
if (!file)
goto error_return;
+ error = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto error_fput;
/*
* We have to check that the file structure underneath the fd
* the user passed to us _is_ an eventpoll file.
*/
error = -EINVAL;
if (!is_file_epoll(file))
- goto error_fput;
+ goto error_unlock;
/*
* At this point it is safe to assume that the "private_data" contains
@@ -1355,6 +1378,8 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
/* Time to fish for events ... */
error = ep_poll(ep, events, maxevents, timeout);
+error_unlock:
+ file_hotplug_read_unlock(file);
error_fput:
fput(file);
error_return:
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-01 21:50 ` [PATCH 18/23] vfs: Teach epoll " Eric W. Biederman
@ 2009-06-02 16:51 ` Davide Libenzi
2009-06-02 21:23 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Davide Libenzi @ 2009-06-02 16:51 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, Linux Kernel Mailing List, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Linus Torvalds, Alan Cox, Greg Kroah-Hartman, Nick Piggin,
Andrew Morton, Christoph Hellwig, Eric W. Biederman,
Eric W. Biederman
On Mon, 1 Jun 2009, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
> ---
> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
> 1 files changed, 32 insertions(+), 7 deletions(-)
This patchset gives me the willies for the amount of changes and possible
impact on many subsystems.
Without having looked at the details, are you aware that epoll does not
act like poll/select, and fds are not automatically removed (as in,
dequeued from the poll wait queue) in any foreseeable amount of time after
a POLLERR is received?
As far as the usespace API goes, they have the right to remain there.
- Davide
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-02 16:51 ` Davide Libenzi
@ 2009-06-02 21:23 ` Eric W. Biederman
2009-06-02 21:52 ` Davide Libenzi
0 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-02 21:23 UTC (permalink / raw)
To: Davide Libenzi
Cc: Al Viro, Linux Kernel Mailing List, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Linus Torvalds, Alan Cox, Greg Kroah-Hartman, Nick Piggin,
Andrew Morton, Christoph Hellwig, Eric W. Biederman,
Eric W. Biederman
Davide Libenzi <davidel@xmailserver.org> writes:
> On Mon, 1 Jun 2009, Eric W. Biederman wrote:
>
>> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
>>
>> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>> ---
>> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
>> 1 files changed, 32 insertions(+), 7 deletions(-)
>
> This patchset gives me the willies for the amount of changes and possible
> impact on many subsystems.
It both is and is not that bad. It is the cost of adding a lock.
For the VFS except for nfsd the I have touched everything that needs to be
touched.
Other subsystems that open read/write close files should be able to use
existing vfs helpers so they don't need to know about the new locking
explicitly.
Actually taking advantage of this infrastructure in a subsystem is
comparatively easy. It took me about an hour to get uio using it.
That part is not deep by any means and is opt in.
> Without having looked at the details, are you aware that epoll does not
> act like poll/select, and fds are not automatically removed (as in,
> dequeued from the poll wait queue) in any foreseeable amount of time after
> a POLLERR is received?
Yes I am aware of how epoll acts differently.
> As far as the usespace API goes, they have the right to remain there.
I absolutely agree.
Currently I have the code acting like close() with respect to epoll and
just having the file descriptor vanish at the end of the revoke. While
we the revoke is in progress you get an EIO.
The file descriptor is not freed by a revoke operation so you can happily
hang unto it as long as you want.
I thought of doing something more uniform to user space. But I observed
that the existing epoll punts on the case of a file descriptor being closed
and locking to go from a file to the other epoll datastructures is pretty
horrid I said forget it and used the existing close behaviour.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-02 21:23 ` Eric W. Biederman
@ 2009-06-02 21:52 ` Davide Libenzi
2009-06-02 22:51 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Davide Libenzi @ 2009-06-02 21:52 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, Linux Kernel Mailing List, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Linus Torvalds, Alan Cox, Greg Kroah-Hartman, Nick Piggin,
Andrew Morton, Christoph Hellwig, Eric W. Biederman,
Eric W. Biederman
On Tue, 2 Jun 2009, Eric W. Biederman wrote:
> Davide Libenzi <davidel@xmailserver.org> writes:
>
> > On Mon, 1 Jun 2009, Eric W. Biederman wrote:
> >
> >> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
> >>
> >> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
> >> ---
> >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
> >> 1 files changed, 32 insertions(+), 7 deletions(-)
> >
> > This patchset gives me the willies for the amount of changes and possible
> > impact on many subsystems.
>
> It both is and is not that bad. It is the cost of adding a lock.
We both know that it is not only the cost of a lock, but also the
sprinkling over a pretty vast amount of subsystems, of another layer of
code.
> I thought of doing something more uniform to user space. But I observed
> that the existing epoll punts on the case of a file descriptor being closed
> and locking to go from a file to the other epoll datastructures is pretty
> horrid I said forget it and used the existing close behaviour.
Well, you cannot rely on the caller to tidy up the epoll fd by issuing an
epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave
lingering crap around. You cannot even hold a reference of the file, since
otherwise the epoll hooking will have to trigger not only at ->release()
time, but at every close, where you'll have to figure out if this is the
last real userspace reference or not. Plus all the issues related to
holding permanent extra references to userspace files.
And since a file can be added in many epoll devices, you need to
unregister it from all of them (hence the other datastructures lookup).
Better this, on the slow path, with locks acquired only in the epoll usage
case, than some other thing and on the fast path, for every file.
- Davide
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-02 21:52 ` Davide Libenzi
@ 2009-06-02 22:51 ` Eric W. Biederman
2009-06-03 14:57 ` Davide Libenzi
0 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-02 22:51 UTC (permalink / raw)
To: Davide Libenzi
Cc: Al Viro, Linux Kernel Mailing List, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Linus Torvalds, Alan Cox, Greg Kroah-Hartman, Nick Piggin,
Andrew Morton, Christoph Hellwig, Eric W. Biederman,
Eric W. Biederman
Davide Libenzi <davidel@xmailserver.org> writes:
> On Tue, 2 Jun 2009, Eric W. Biederman wrote:
>
>> Davide Libenzi <davidel@xmailserver.org> writes:
>>
>> > On Mon, 1 Jun 2009, Eric W. Biederman wrote:
>> >
>> >> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
>> >>
>> >> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
>> >> ---
>> >> fs/eventpoll.c | 39 ++++++++++++++++++++++++++++++++-------
>> >> 1 files changed, 32 insertions(+), 7 deletions(-)
>> >
>> > This patchset gives me the willies for the amount of changes and possible
>> > impact on many subsystems.
>>
>> It both is and is not that bad. It is the cost of adding a lock.
>
> We both know that it is not only the cost of a lock, but also the
> sprinkling over a pretty vast amount of subsystems, of another layer of
> code.
I am not clear what problem you have.
Is it the sprinkling the code that takes and removes the lock? Just
the VFS needs to be involved with that. It is a slightly larger
surface area than doing the work inside the file operations as we
sometimes call the same method from 3-4 different places but it is
definitely a bounded problem.
Is it putting in the handful lines per subsystem to actually use this
functionality? At that level something generic that is maintained
outside of the subsystem is better than the mess we have with 4-5
different implementations in the subsystems that need it, each having
a different assortment of bugs.
>> I thought of doing something more uniform to user space. But I observed
>> that the existing epoll punts on the case of a file descriptor being closed
>> and locking to go from a file to the other epoll datastructures is pretty
>> horrid I said forget it and used the existing close behaviour.
>
> Well, you cannot rely on the caller to tidy up the epoll fd by issuing an
> epoll_ctl(DEL), so you do *need* to "punt" on close in order to not leave
> lingering crap around. You cannot even hold a reference of the file, since
> otherwise the epoll hooking will have to trigger not only at ->release()
> time, but at every close, where you'll have to figure out if this is the
> last real userspace reference or not. Plus all the issues related to
> holding permanent extra references to userspace files.
> And since a file can be added in many epoll devices, you need to
> unregister it from all of them (hence the other datastructures lookup).
> Better this, on the slow path, with locks acquired only in the epoll usage
> case, than some other thing and on the fast path, for every file.
Sure, and that is largely and I am preserving those semantics.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-02 22:51 ` Eric W. Biederman
@ 2009-06-03 14:57 ` Davide Libenzi
2009-06-03 20:53 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Davide Libenzi @ 2009-06-03 14:57 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, Linux Kernel Mailing List, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Linus Torvalds, Alan Cox, Greg Kroah-Hartman, Nick Piggin,
Andrew Morton, Christoph Hellwig, Eric W. Biederman,
Eric W. Biederman
On Tue, 2 Jun 2009, Eric W. Biederman wrote:
> I am not clear what problem you have.
>
> Is it the sprinkling the code that takes and removes the lock? Just
> the VFS needs to be involved with that. It is a slightly larger
> surface area than doing the work inside the file operations as we
> sometimes call the same method from 3-4 different places but it is
> definitely a bounded problem.
>
> Is it putting in the handful lines per subsystem to actually use this
> functionality? At that level something generic that is maintained
> outside of the subsystem is better than the mess we have with 4-5
> different implementations in the subsystems that need it, each having
> a different assortment of bugs.
Come on, only in the open fast path, there are at least two spin
lock/unlock and two atomic ops. Without even starting to count all the
extra branches and software added.
Is this stuff *really* needed, or we can faitly happily live w/out?
- Davide
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-03 14:57 ` Davide Libenzi
@ 2009-06-03 20:53 ` Eric W. Biederman
2009-06-04 0:50 ` Davide Libenzi
0 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-03 20:53 UTC (permalink / raw)
To: Davide Libenzi
Cc: Al Viro, Linux Kernel Mailing List, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Linus Torvalds, Alan Cox, Greg Kroah-Hartman, Nick Piggin,
Andrew Morton, Christoph Hellwig, Eric W. Biederman
Davide Libenzi <davidel@xmailserver.org> writes:
> On Tue, 2 Jun 2009, Eric W. Biederman wrote:
>
>> I am not clear what problem you have.
>>
>> Is it the sprinkling the code that takes and removes the lock? Just
>> the VFS needs to be involved with that. It is a slightly larger
>> surface area than doing the work inside the file operations as we
>> sometimes call the same method from 3-4 different places but it is
>> definitely a bounded problem.
>>
>> Is it putting in the handful lines per subsystem to actually use this
>> functionality? At that level something generic that is maintained
>> outside of the subsystem is better than the mess we have with 4-5
>> different implementations in the subsystems that need it, each having
>> a different assortment of bugs.
>
> Come on, only in the open fast path, there are at least two spin
> lock/unlock and two atomic ops. Without even starting to count all the
> extra branches and software added.
> Is this stuff *really* needed, or we can faitly happily live w/out?
????
What code are you talking about?
To the open path a few memory writes and a smp_wmb. No atomics and no
spin lock/unlocks.
Are you complaining because I retain the file_list?
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-03 20:53 ` Eric W. Biederman
@ 2009-06-04 0:50 ` Davide Libenzi
2009-06-04 1:42 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Davide Libenzi @ 2009-06-04 0:50 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, Linux Kernel Mailing List, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Linus Torvalds, Alan Cox, Greg Kroah-Hartman, Nick Piggin,
Andrew Morton, Christoph Hellwig, Eric W. Biederman
On Wed, 3 Jun 2009, Eric W. Biederman wrote:
> What code are you talking about?
>
> To the open path a few memory writes and a smp_wmb. No atomics and no
> spin lock/unlocks.
>
> Are you complaining because I retain the file_list?
Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin
lock/unlock couple present in __dentry_open() (same sort of the release
path)?
And that's only like 5% of the code touched by the new special handling of
the file operations structure (basically, every f_op access ends up being
wrapped by two atomic ops and other extra code).
The question, that I'd like to reiterate is, is this stuff really needed?
Anyway, my complaint ends here and I'll let others evaluate if merging
this patchset is worth the cost.
- Davide
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 18/23] vfs: Teach epoll to use file_hotplug_lock
2009-06-04 0:50 ` Davide Libenzi
@ 2009-06-04 1:42 ` Eric W. Biederman
0 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-04 1:42 UTC (permalink / raw)
To: Davide Libenzi
Cc: Al Viro, Linux Kernel Mailing List, linux-pci, linux-mm,
linux-fsdevel, Hugh Dickins, Tejun Heo, Alexey Dobriyan,
Linus Torvalds, Alan Cox, Greg Kroah-Hartman, Nick Piggin,
Andrew Morton, Christoph Hellwig, Eric W. Biederman
Davide Libenzi <davidel@xmailserver.org> writes:
> On Wed, 3 Jun 2009, Eric W. Biederman wrote:
>
>> What code are you talking about?
>>
>> To the open path a few memory writes and a smp_wmb. No atomics and no
>> spin lock/unlocks.
>>
>> Are you complaining because I retain the file_list?
>
> Sorry, did I overlook the patch? Weren't a couple of atomic ops and a spin
> lock/unlock couple present in __dentry_open() (same sort of the release
> path)?
You might be remembering v1. In v2 I have operations like file_hotplug_read_trylock
that implement a lock but use an rcu like algorithm. So there are no atomic
operations involved with their associated pipeline stalls. Over my previous
version this made a reasonable performance benefit.
> And that's only like 5% of the code touched by the new special handling of
> the file operations structure (basically, every f_op access ends up being
> wrapped by two atomic ops and other extra code).
Yes there is a single extra wrapping of every file in the syscall path. So
we know that someone is using it.
> The question, that I'd like to reiterate is, is this stuff really needed?
> Anyway, my complaint ends here and I'll let others evaluate if merging
> this patchset is worth the cost.
Sure. My apologies for not answering that question earlier.
My perspective is that every subsystem that winds up supporting hotplug
hardware winds up rolling it's own version of something like this,
and they each have a different set of bugs.
So one generic version is definitely worth implementing.
Similarly there is a case for a generic revoke facility in the kernel.
Alan at least has made the case that there are certain security problems
that can not be solved in userspace without revoke.
>From an implementation point of view doing the generic implementation at
the vfs level has significant benefits.
The extra locking appears reasonable from a code maintenance and
comprehensibility point of view. A real pain to find all of the entry
points into the vfs, and get other code to use the right vfs helpers
they should always have been using but I am volunteering to do that
work.
The practical question I see is are the performance overheads of my
primitives low enough that I do not cause performance regressions
on anyone's fast path. As far as I have been able to measure is that
the performance overhead is low enough, because I have been able to
avoid the use of atomics and have been able to use fairly small code
with predictable branches. Which is why I pressed you to be certain
I understood where you are coming from.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* [PATCH 19/23] eventpoll: Fix comment
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (17 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 18/23] vfs: Teach epoll " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 20/23] vfs: Teach aio to use file_hotplug_lock Eric W. Biederman
` (4 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/eventpoll.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index eabb167..d42071d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -62,7 +62,7 @@
* This mutex is acquired by ep_free() during the epoll file
* cleanup path and it is also acquired by eventpoll_release_file()
* if a file has been pushed inside an epoll set and it is then
- * close()d without a previous call toepoll_ctl(EPOLL_CTL_DEL).
+ * close()d without a previous call to epoll_ctl(EPOLL_CTL_DEL).
* It is possible to drop the "ep->mtx" and to use the global
* mutex "epmutex" (together with "ep->lock") to have it working,
* but having "ep->mtx" will make the interface more scalable.
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 20/23] vfs: Teach aio to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (18 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 19/23] eventpoll: Fix comment Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 21/23] vfs: Teach fsync " Eric W. Biederman
` (3 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/aio.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 38 insertions(+), 13 deletions(-)
diff --git a/fs/aio.c b/fs/aio.c
index 76da125..eceb215 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1362,13 +1362,20 @@ static void aio_advance_iovec(struct kiocb *iocb, ssize_t ret)
static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
{
struct file *file = iocb->ki_filp;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
+ struct address_space *mapping;
+ struct inode *inode;
ssize_t (*rw_op)(struct kiocb *, const struct iovec *,
unsigned long, loff_t);
ssize_t ret = 0;
unsigned short opcode;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
+
+ mapping = file->f_mapping;
+ inode = mapping->host;
+
if ((iocb->ki_opcode == IOCB_CMD_PREADV) ||
(iocb->ki_opcode == IOCB_CMD_PREAD)) {
rw_op = file->f_op->aio_read;
@@ -1379,8 +1386,9 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
}
/* This matches the pread()/pwrite() logic */
+ ret = -EINVAL;
if (iocb->ki_pos < 0)
- return -EINVAL;
+ goto out_unlock;
do {
ret = rw_op(iocb, &iocb->ki_iovec[iocb->ki_cur_seg],
@@ -1407,26 +1415,37 @@ static ssize_t aio_rw_vect_retry(struct kiocb *iocb)
&& iocb->ki_nbytes - iocb->ki_left)
ret = iocb->ki_nbytes - iocb->ki_left;
+out_unlock:
+ file_hotplug_read_unlock(file);
+out:
return ret;
}
static ssize_t aio_fdsync(struct kiocb *iocb)
{
struct file *file = iocb->ki_filp;
- ssize_t ret = -EINVAL;
+ ssize_t ret = -EIO;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(iocb, 1);
+ if (file_hotplug_read_trylock(file)) {
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 1);
+ file_hotplug_read_unlock(file);
+ }
return ret;
}
static ssize_t aio_fsync(struct kiocb *iocb)
{
struct file *file = iocb->ki_filp;
- ssize_t ret = -EINVAL;
+ ssize_t ret = -EIO;
- if (file->f_op->aio_fsync)
- ret = file->f_op->aio_fsync(iocb, 0);
+ if (file_hotplug_read_trylock(file)) {
+ ret = -EINVAL;
+ if (file->f_op->aio_fsync)
+ ret = file->f_op->aio_fsync(iocb, 0);
+ file_hotplug_read_unlock(file);
+ }
return ret;
}
@@ -1469,7 +1488,11 @@ static ssize_t aio_setup_single_vector(struct kiocb *kiocb)
static ssize_t aio_setup_iocb(struct kiocb *kiocb)
{
struct file *file = kiocb->ki_filp;
- ssize_t ret = 0;
+ ssize_t ret;
+
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
switch (kiocb->ki_opcode) {
case IOCB_CMD_PREAD:
@@ -1551,10 +1574,12 @@ static ssize_t aio_setup_iocb(struct kiocb *kiocb)
ret = -EINVAL;
}
- if (!kiocb->ki_retry)
- return ret;
+ if (kiocb->ki_retry)
+ ret = 0;
- return 0;
+ file_hotplug_read_unlock(file);
+out:
+ return ret;
}
/*
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 21/23] vfs: Teach fsync to use file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (19 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 20/23] vfs: Teach aio to use file_hotplug_lock Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 22/23] vfs: Teach fadvice to file_hotplug_lock Eric W. Biederman
` (2 subsequent siblings)
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
fs/sync.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/fs/sync.c b/fs/sync.c
index e9d56f6..ac6da60 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -197,6 +197,9 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
* don't have a struct file available. Damn nfsd..
*/
if (file) {
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out;
mapping = file->f_mapping;
fop = file->f_op;
} else {
@@ -206,7 +209,7 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
if (!fop || !fop->fsync) {
ret = -EINVAL;
- goto out;
+ goto out_unlock;
}
ret = filemap_fdatawrite(mapping);
@@ -223,6 +226,10 @@ int vfs_fsync(struct file *file, struct dentry *dentry, int datasync)
err = filemap_fdatawait(mapping);
if (!ret)
ret = err;
+
+out_unlock:
+ if (file)
+ file_hotplug_read_unlock(file);
out:
return ret;
}
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 22/23] vfs: Teach fadvice to file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (20 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 21/23] vfs: Teach fsync " Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-01 21:50 ` [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock Eric W. Biederman
2009-06-06 8:03 ` [PATCH 0/23] File descriptor hot-unplug support v2 Al Viro
23 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
mm/fadvise.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/mm/fadvise.c b/mm/fadvise.c
index 54a0f80..d7f1fba 100644
--- a/mm/fadvise.c
+++ b/mm/fadvise.c
@@ -38,6 +38,11 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
if (!file)
return -EBADF;
+ ret = -EIO;
+ if (!file_hotplug_read_trylock(file))
+ goto out_fput;
+
+ ret = 0;
if (S_ISFIFO(file->f_path.dentry->d_inode->i_mode)) {
ret = -ESPIPE;
goto out;
@@ -123,6 +128,8 @@ SYSCALL_DEFINE(fadvise64_64)(int fd, loff_t offset, loff_t len, int advice)
ret = -EINVAL;
}
out:
+ file_hotplug_read_unlock(file);
+out_fput:
fput(file);
return ret;
}
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (21 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 22/23] vfs: Teach fadvice to file_hotplug_lock Eric W. Biederman
@ 2009-06-01 21:50 ` Eric W. Biederman
2009-06-03 23:25 ` Badari Pulavarty
2009-06-06 8:03 ` [PATCH 0/23] File descriptor hot-unplug support v2 Al Viro
23 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-01 21:50 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
---
mm/filemap.c | 25 ++++++++++++++++---------
1 files changed, 16 insertions(+), 9 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 379ff0b..5016aa5 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1402,16 +1402,23 @@ SYSCALL_DEFINE(readahead)(int fd, loff_t offset, size_t count)
ret = -EBADF;
file = fget(fd);
- if (file) {
- if (file->f_mode & FMODE_READ) {
- struct address_space *mapping = file->f_mapping;
- pgoff_t start = offset >> PAGE_CACHE_SHIFT;
- pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
- unsigned long len = end - start + 1;
- ret = do_readahead(mapping, file, start, len);
- }
- fput(file);
+ if (!file)
+ goto out;
+
+ if (!(file->f_mode & FMODE_READ))
+ goto out_fput;
+
+ if (file_hotplug_read_trylock(file)) {
+ struct address_space *mapping = file->f_mapping;
+ pgoff_t start = offset >> PAGE_CACHE_SHIFT;
+ pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
+ unsigned long len = end - start + 1;
+ ret = do_readahead(mapping, file, start, len);
+ file_hotplug_read_unlock(file);
}
+out_fput:
+ fput(file);
+out:
return ret;
}
#ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
--
1.6.3.1.54.g99dd.dirty
^ permalink raw reply related [flat|nested] 99+ messages in thread
* Re: [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock
2009-06-01 21:50 ` [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock Eric W. Biederman
@ 2009-06-03 23:25 ` Badari Pulavarty
0 siblings, 0 replies; 99+ messages in thread
From: Badari Pulavarty @ 2009-06-03 23:25 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Al Viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
Hugh Dickins, Tejun Heo, Alexey Dobriyan, Linus Torvalds,
Alan Cox, Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig, Eric W. Biederman, Eric W. Biederman
On Mon, 2009-06-01 at 14:50 -0700, Eric W. Biederman wrote:
> From: Eric W. Biederman <ebiederm@maxwell.aristanetworks.com>
>
> Signed-off-by: Eric W. Biederman <ebiederm@aristanetworks.com>
> ---
> mm/filemap.c | 25 ++++++++++++++++---------
> 1 files changed, 16 insertions(+), 9 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 379ff0b..5016aa5 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1402,16 +1402,23 @@ SYSCALL_DEFINE(readahead)(int fd, loff_t offset, size_t count)
>
> ret = -EBADF;
> file = fget(fd);
> - if (file) {
> - if (file->f_mode & FMODE_READ) {
> - struct address_space *mapping = file->f_mapping;
> - pgoff_t start = offset >> PAGE_CACHE_SHIFT;
> - pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
> - unsigned long len = end - start + 1;
> - ret = do_readahead(mapping, file, start, len);
> - }
> - fput(file);
> + if (!file)
> + goto out;
> +
> + if (!(file->f_mode & FMODE_READ))
> + goto out_fput;
> +
To be consistent with others, don't you want to do
ret = -EIO;
here ?
> + if (file_hotplug_read_trylock(file)) {
> + struct address_space *mapping = file->f_mapping;
> + pgoff_t start = offset >> PAGE_CACHE_SHIFT;
> + pgoff_t end = (offset + count - 1) >> PAGE_CACHE_SHIFT;
> + unsigned long len = end - start + 1;
> + ret = do_readahead(mapping, file, start, len);
> + file_hotplug_read_unlock(file);
> }
> +out_fput:
> + fput(file);
> +out:
> return ret;
> }
> #ifdef CONFIG_HAVE_SYSCALL_WRAPPERS
Thanks,
Badari
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-01 21:45 ` [PATCH 0/23] File descriptor hot-unplug support v2 Eric W. Biederman
` (22 preceding siblings ...)
2009-06-01 21:50 ` [PATCH 23/23] vfs: Teach readahead to use the file_hotplug_lock Eric W. Biederman
@ 2009-06-06 8:03 ` Al Viro
2009-06-08 9:41 ` Miklos Szeredi
2009-06-09 6:22 ` Eric W. Biederman
23 siblings, 2 replies; 99+ messages in thread
From: Al Viro @ 2009-06-06 8:03 UTC (permalink / raw)
To: Eric W. Biederman
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig
On Mon, Jun 01, 2009 at 02:45:17PM -0700, Eric W. Biederman wrote:
>
> I found myself looking at the uio, seeing that it does not support pci
> hot-unplug, and thinking "Great yet another implementation of
> hotunplug logic that needs to be added".
>
> I decided to see what it would take to add a generic implementation of
> the code we have for supporting hot unplugging devices in sysfs, proc,
> sysctl, tty_io, and now almost in the tun driver.
>
> Not long after I touched the tun driver and made it safe to delete the
> network device while still holding it's file descriptor open I someone
> else touch the code adding a different feature and my careful work
> went up in flames. Which brought home another point at the best of it
> this is ultimately complex tricky code that subsystems should not need
> to worry about.
>
> What makes this even more interesting is that in the presence of pci
> hot-unplug it looks like most subsystems and most devices will have to
> deal with the issue one way or another.
>
> This infrastructure could also be used to implement both force
> unmounts and sys_revoke. When I could not think of a better name for
> I have drawn on that and used revoke.
To be honest, the longer I'm looking at it, the less I like the approach...
It really looks as if we'd be much better off with functionality sitting
in a set of library helpers to be used by instances that need this stuff.
Do we really want it for generic case?
Note that "we might someday implement real force-umount" doesn't count;
the same kind of arguments had been given nine years ago in case of AIO
("oh, sure, we'll eventually cover foo_get_block() too - it will all be
a state machine, fully asynchronous; whaddya mean 'it's not feasible'?").
Of course, it was _not_ feasible and had never been implemented.
Frankly, I very much suspect that force-umount is another case like that;
we'll need a *lot* of interesting cooperation from fs for that to work and
to be useful. I'd be delighted to be proven incorrect on that one, so
if you have anything serious in that direction, please share the details.
As for the patchset in the current form... Could you explain what's to prevent
POSIX locks and dnotify entries from outliving a struct file you'd revoked,
seeing that filp_close() will skip killing them in that case.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-06 8:03 ` [PATCH 0/23] File descriptor hot-unplug support v2 Al Viro
@ 2009-06-08 9:41 ` Miklos Szeredi
2009-06-08 10:24 ` Jamie Lokier
2009-06-08 16:29 ` Al Viro
2009-06-09 6:22 ` Eric W. Biederman
1 sibling, 2 replies; 99+ messages in thread
From: Miklos Szeredi @ 2009-06-08 9:41 UTC (permalink / raw)
To: viro
Cc: ebiederm, linux-kernel, linux-pci, linux-mm, linux-fsdevel, hugh,
tj, adobriyan, torvalds, alan, gregkh, npiggin, akpm, hch
On Sat, 6 Jun 2009, Al Viro wrote:
> Frankly, I very much suspect that force-umount is another case like that;
> we'll need a *lot* of interesting cooperation from fs for that to work and
> to be useful. I'd be delighted to be proven incorrect on that one, so
> if you have anything serious in that direction, please share the details.
Umm, not sure why we'd need cooperation from the fs. Simply wait for
the operation to exit the filesystem or driver. If it's a blocking
operation, send a signal to interrupt it.
Sure, filesystems and drivers have lots of state, but we don't need to
care about that, just like we don't need to care about it for
remounting read-only.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 9:41 ` Miklos Szeredi
@ 2009-06-08 10:24 ` Jamie Lokier
2009-06-08 16:29 ` Al Viro
1 sibling, 0 replies; 99+ messages in thread
From: Jamie Lokier @ 2009-06-08 10:24 UTC (permalink / raw)
To: Miklos Szeredi
Cc: viro, ebiederm, linux-kernel, linux-pci, linux-mm, linux-fsdevel,
hugh, tj, adobriyan, torvalds, alan, gregkh, npiggin, akpm, hch
Miklos Szeredi wrote:
> On Sat, 6 Jun 2009, Al Viro wrote:
> > Frankly, I very much suspect that force-umount is another case like that;
> > we'll need a *lot* of interesting cooperation from fs for that to work and
> > to be useful. I'd be delighted to be proven incorrect on that one, so
> > if you have anything serious in that direction, please share the details.
>
> Umm, not sure why we'd need cooperation from the fs. Simply wait for
> the operation to exit the filesystem or driver. If it's a blocking
> operation, send a signal to interrupt it.
We could even include the internal signal in TASK_KILLABLE, so it
interrupts otherwise uninterruptible operations.
-- Jamie
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 9:41 ` Miklos Szeredi
2009-06-08 10:24 ` Jamie Lokier
@ 2009-06-08 16:29 ` Al Viro
2009-06-08 16:44 ` Miklos Szeredi
1 sibling, 1 reply; 99+ messages in thread
From: Al Viro @ 2009-06-08 16:29 UTC (permalink / raw)
To: Miklos Szeredi
Cc: ebiederm, linux-kernel, linux-pci, linux-mm, linux-fsdevel, hugh,
tj, adobriyan, torvalds, alan, gregkh, npiggin, akpm, hch
On Mon, Jun 08, 2009 at 11:41:19AM +0200, Miklos Szeredi wrote:
> On Sat, 6 Jun 2009, Al Viro wrote:
> > Frankly, I very much suspect that force-umount is another case like that;
> > we'll need a *lot* of interesting cooperation from fs for that to work and
> > to be useful. I'd be delighted to be proven incorrect on that one, so
> > if you have anything serious in that direction, please share the details.
>
> Umm, not sure why we'd need cooperation from the fs. Simply wait for
> the operation to exit the filesystem or driver. If it's a blocking
> operation, send a signal to interrupt it.
And making sure that operations *are* interruptible (and that we can cope
with $BIGNUM new failure exits correctly) does not qualify as cooperation?
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 16:29 ` Al Viro
@ 2009-06-08 16:44 ` Miklos Szeredi
2009-06-08 17:50 ` Al Viro
0 siblings, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2009-06-08 16:44 UTC (permalink / raw)
To: viro
Cc: miklos, ebiederm, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, hugh, tj, adobriyan, torvalds, alan, gregkh,
npiggin, akpm, hch
On Mon, 8 Jun 2009, Al Viro wrote:
> On Mon, Jun 08, 2009 at 11:41:19AM +0200, Miklos Szeredi wrote:
> > On Sat, 6 Jun 2009, Al Viro wrote:
> > > Frankly, I very much suspect that force-umount is another case like that;
> > > we'll need a *lot* of interesting cooperation from fs for that to work and
> > > to be useful. I'd be delighted to be proven incorrect on that one, so
> > > if you have anything serious in that direction, please share the details.
> >
> > Umm, not sure why we'd need cooperation from the fs. Simply wait for
> > the operation to exit the filesystem or driver. If it's a blocking
> > operation, send a signal to interrupt it.
>
> And making sure that operations *are* interruptible (and that we can cope
> with $BIGNUM new failure exits correctly) does not qualify as cooperation?
I'm still not getting what the problem is. AFAICS file operations are
either
a) non-interruptible but finish within a short time or
b) may block indefinitely but are interruptible (or at least killable).
Anything else is already problematic, resulting in processes "stuck in
D state".
Can you give a more concrete example about your worries?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 16:44 ` Miklos Szeredi
@ 2009-06-08 17:50 ` Al Viro
2009-06-08 18:01 ` Linus Torvalds
2009-06-09 5:50 ` Miklos Szeredi
0 siblings, 2 replies; 99+ messages in thread
From: Al Viro @ 2009-06-08 17:50 UTC (permalink / raw)
To: Miklos Szeredi
Cc: ebiederm, linux-kernel, linux-pci, linux-mm, linux-fsdevel, hugh,
tj, adobriyan, torvalds, alan, gregkh, npiggin, akpm, hch
On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote:
> I'm still not getting what the problem is. AFAICS file operations are
> either
>
> a) non-interruptible but finish within a short time or
> b) may block indefinitely but are interruptible (or at least killable).
>
> Anything else is already problematic, resulting in processes "stuck in
> D state".
Welcome to reality...
* bread() is non-interruptible
* so's copy_from_user()/copy_to_user()
* IO we are stuck upon _might_ be interruptible, but by sending a signal
to some other process
... just for starters. If you sign up for auditing the tree to eliminate
"something's stuck in D state", you are welcome to it. Mind you, you'll
have to audit filesystems for "doesn't check if metadata IO has failed"
first, but that _really_ needs to be done anyway. On the ongoing basis.
Drivers, of course, are even more interesting - looking through foo_ioctl()
instances is a wonderful way to lower pH in stomach, but that's on the
"we want revoke()" side of it.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 17:50 ` Al Viro
@ 2009-06-08 18:01 ` Linus Torvalds
2009-06-08 18:50 ` Al Viro
2009-06-09 5:50 ` Miklos Szeredi
1 sibling, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2009-06-08 18:01 UTC (permalink / raw)
To: Al Viro
Cc: Miklos Szeredi, ebiederm, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, hugh, tj, adobriyan, alan, gregkh, npiggin, akpm,
hch
On Mon, 8 Jun 2009, Al Viro wrote:
>
> Welcome to reality...
>
> * bread() is non-interruptible
> * so's copy_from_user()/copy_to_user()
> * IO we are stuck upon _might_ be interruptible, but by sending a signal
> to some other process
We can probably improve on these, though.
Like the copy_to/from_user thing. We might well be able to do that whole
"if it's a fatal signal, return early" thing.
So in the _general_ case - no, we probably can't fix things. But we could
likely at least improve in some common cases if we cared.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 18:01 ` Linus Torvalds
@ 2009-06-08 18:50 ` Al Viro
2009-06-08 19:18 ` Linus Torvalds
0 siblings, 1 reply; 99+ messages in thread
From: Al Viro @ 2009-06-08 18:50 UTC (permalink / raw)
To: Linus Torvalds
Cc: Miklos Szeredi, ebiederm, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, hugh, tj, adobriyan, alan, gregkh, npiggin, akpm,
hch
On Mon, Jun 08, 2009 at 11:01:51AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 8 Jun 2009, Al Viro wrote:
> >
> > Welcome to reality...
> >
> > * bread() is non-interruptible
> > * so's copy_from_user()/copy_to_user()
> > * IO we are stuck upon _might_ be interruptible, but by sending a signal
> > to some other process
>
> We can probably improve on these, though.
>
> Like the copy_to/from_user thing. We might well be able to do that whole
> "if it's a fatal signal, return early" thing.
>
> So in the _general_ case - no, we probably can't fix things. But we could
> likely at least improve in some common cases if we cared.
Sure, even though I'm not at all certain that copy_from_user() is that easy.
We can make locking current->mm in there interruptible, all right, but that's
only a part of the answer - even aside of the allocations, we'd need vma
->fault() interruptible as well, which leads to interruptible instances of
->readpage(), with all the fun _that_ would be.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 18:50 ` Al Viro
@ 2009-06-08 19:18 ` Linus Torvalds
2009-06-09 6:42 ` Eric W. Biederman
0 siblings, 1 reply; 99+ messages in thread
From: Linus Torvalds @ 2009-06-08 19:18 UTC (permalink / raw)
To: Al Viro
Cc: Miklos Szeredi, ebiederm, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, hugh, tj, adobriyan, alan, gregkh, npiggin, akpm,
hch
On Mon, 8 Jun 2009, Al Viro wrote:
>
> Sure, even though I'm not at all certain that copy_from_user() is that easy.
> We can make locking current->mm in there interruptible, all right, but that's
> only a part of the answer - even aside of the allocations, we'd need vma
> ->fault() interruptible as well, which leads to interruptible instances of
> ->readpage(), with all the fun _that_ would be.
We already have all that - the NFS people wanted it.
More importantly, you don't actually need to interrupt readpage itself -
you just need to stop _waiting_ on it. So in your fault handler, just stop
waiting, and instead just return FAULT_RETRY or whatever.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 19:18 ` Linus Torvalds
@ 2009-06-09 6:42 ` Eric W. Biederman
2009-06-09 10:52 ` Nick Piggin
0 siblings, 1 reply; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-09 6:42 UTC (permalink / raw)
To: Linus Torvalds
Cc: Al Viro, Miklos Szeredi, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, hugh, tj, adobriyan, alan, gregkh, npiggin, akpm,
hch
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Mon, 8 Jun 2009, Al Viro wrote:
>>
>> Sure, even though I'm not at all certain that copy_from_user() is that easy.
>> We can make locking current->mm in there interruptible, all right, but that's
>> only a part of the answer - even aside of the allocations, we'd need vma
>> ->fault() interruptible as well, which leads to interruptible instances of
>> ->readpage(), with all the fun _that_ would be.
>
> We already have all that - the NFS people wanted it.
>
> More importantly, you don't actually need to interrupt readpage itself -
> you just need to stop _waiting_ on it. So in your fault handler, just stop
> waiting, and instead just return FAULT_RETRY or whatever.
That sounds doable. Has that code been merged yet?
I took a quick look and it didn't see anyone breaking out of page fault with a
signal or code to really handle that.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-09 6:42 ` Eric W. Biederman
@ 2009-06-09 10:52 ` Nick Piggin
0 siblings, 0 replies; 99+ messages in thread
From: Nick Piggin @ 2009-06-09 10:52 UTC (permalink / raw)
To: Eric W. Biederman
Cc: Linus Torvalds, Al Viro, Miklos Szeredi, linux-kernel, linux-pci,
linux-mm, linux-fsdevel, hugh, tj, adobriyan, alan, gregkh, akpm,
hch
On Mon, Jun 08, 2009 at 11:42:53PM -0700, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > On Mon, 8 Jun 2009, Al Viro wrote:
> >>
> >> Sure, even though I'm not at all certain that copy_from_user() is that easy.
> >> We can make locking current->mm in there interruptible, all right, but that's
> >> only a part of the answer - even aside of the allocations, we'd need vma
> >> ->fault() interruptible as well, which leads to interruptible instances of
> >> ->readpage(), with all the fun _that_ would be.
> >
> > We already have all that - the NFS people wanted it.
> >
> > More importantly, you don't actually need to interrupt readpage itself -
> > you just need to stop _waiting_ on it. So in your fault handler, just stop
> > waiting, and instead just return FAULT_RETRY or whatever.
>
> That sounds doable. Has that code been merged yet?
>
> I took a quick look and it didn't see anyone breaking out of page fault with a
> signal or code to really handle that.
The problem is get_user_pages I think. Now that we have a good number of
fault flags, we can pass down whether the caller is able to be
interrupted or not.
Ben H had some interest in doing this, but I don't know how far he got
with it.
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-08 17:50 ` Al Viro
2009-06-08 18:01 ` Linus Torvalds
@ 2009-06-09 5:50 ` Miklos Szeredi
2009-06-09 6:31 ` Eric W. Biederman
1 sibling, 1 reply; 99+ messages in thread
From: Miklos Szeredi @ 2009-06-09 5:50 UTC (permalink / raw)
To: viro
Cc: miklos, ebiederm, linux-kernel, linux-pci, linux-mm,
linux-fsdevel, hugh, tj, adobriyan, torvalds, alan, gregkh,
npiggin, akpm, hch
On Mon, 8 Jun 2009, Al Viro wrote:
> On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote:
>
> > I'm still not getting what the problem is. AFAICS file operations are
> > either
> >
> > a) non-interruptible but finish within a short time or
> > b) may block indefinitely but are interruptible (or at least killable).
> >
> > Anything else is already problematic, resulting in processes "stuck in
> > D state".
>
> Welcome to reality...
>
> * bread() is non-interruptible
> * so's copy_from_user()/copy_to_user()
And why should revoke(2) care? Just wait for the damn thing to
finish. Why exactly do these need to be interruptible?
Okay, if we want revoke or umount -f to be instantaneous then all that
needs to be taken care of. But does it *need* to be?
My idea of revoke is something like below:
- make sure no new operations are started on the file
- check state of tasks for ongoing operations, if interruptible send signal
- wait for all pending operations to finish
- kill file
Thanks,
Miklos
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-09 5:50 ` Miklos Szeredi
@ 2009-06-09 6:31 ` Eric W. Biederman
0 siblings, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-09 6:31 UTC (permalink / raw)
To: Miklos Szeredi
Cc: viro, linux-kernel, linux-pci, linux-mm, linux-fsdevel, hugh, tj,
adobriyan, torvalds, alan, gregkh, npiggin, akpm, hch
Miklos Szeredi <miklos@szeredi.hu> writes:
> On Mon, 8 Jun 2009, Al Viro wrote:
>> On Mon, Jun 08, 2009 at 06:44:41PM +0200, Miklos Szeredi wrote:
>>
>> > I'm still not getting what the problem is. AFAICS file operations are
>> > either
>> >
>> > a) non-interruptible but finish within a short time or
>> > b) may block indefinitely but are interruptible (or at least killable).
>> >
>> > Anything else is already problematic, resulting in processes "stuck in
>> > D state".
>>
>> Welcome to reality...
>>
>> * bread() is non-interruptible
>> * so's copy_from_user()/copy_to_user()
>
> And why should revoke(2) care? Just wait for the damn thing to
> finish. Why exactly do these need to be interruptible?
Agreed. I expect the data size is going to be a page or less. Which
is at most 64K on some weird architectures. I think that counts as a
short time waiting for disk I/O. Baring thrashing.
> Okay, if we want revoke or umount -f to be instantaneous then all that
> needs to be taken care of. But does it *need* to be?
Good question. I wonder what umount -f needs when we yank out a usb drive.
> My idea of revoke is something like below:
>
> - make sure no new operations are started on the file
> - check state of tasks for ongoing operations, if interruptible send signal
Figuring out who to send a signal to is tricky. Still it should be doable
in the common case.
> - wait for all pending operations to finish
> - kill file
Eric
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 99+ messages in thread
* Re: [PATCH 0/23] File descriptor hot-unplug support v2
2009-06-06 8:03 ` [PATCH 0/23] File descriptor hot-unplug support v2 Al Viro
2009-06-08 9:41 ` Miklos Szeredi
@ 2009-06-09 6:22 ` Eric W. Biederman
1 sibling, 0 replies; 99+ messages in thread
From: Eric W. Biederman @ 2009-06-09 6:22 UTC (permalink / raw)
To: Al Viro
Cc: linux-kernel, linux-pci, linux-mm, linux-fsdevel, Hugh Dickins,
Tejun Heo, Alexey Dobriyan, Linus Torvalds, Alan Cox,
Greg Kroah-Hartman, Nick Piggin, Andrew Morton,
Christoph Hellwig
Al Viro <viro@ZenIV.linux.org.uk> writes:
> On Mon, Jun 01, 2009 at 02:45:17PM -0700, Eric W. Biederman wrote:
>>
>> I found myself looking at the uio, seeing that it does not support pci
>> hot-unplug, and thinking "Great yet another implementation of
>> hotunplug logic that needs to be added".
>>
>> I decided to see what it would take to add a generic implementation of
>> the code we have for supporting hot unplugging devices in sysfs, proc,
>> sysctl, tty_io, and now almost in the tun driver.
>>
>> Not long after I touched the tun driver and made it safe to delete the
>> network device while still holding it's file descriptor open I someone
>> else touch the code adding a different feature and my careful work
>> went up in flames. Which brought home another point at the best of it
>> this is ultimately complex tricky code that subsystems should not need
>> to worry about.
>>
>> What makes this even more interesting is that in the presence of pci
>> hot-unplug it looks like most subsystems and most devices will have to
>> deal with the issue one way or another.
>>
>> This infrastructure could also be used to implement both force
>> unmounts and sys_revoke. When I could not think of a better name for
>> I have drawn on that and used revoke.
>
> To be honest, the longer I'm looking at it, the less I like the approach...
> It really looks as if we'd be much better off with functionality sitting
> in a set of library helpers to be used by instances that need this stuff.
> Do we really want it for generic case?
I think so. I do know I have seen enough weird cases actually being
used and not being done correctly we want a clean pattern for handling
the general case that works and is complete.
The problem seems to break up into several pieces.
- unmap support.
- Getting a list of the files that are open for an inode.
- Waking up interruptible sleepers.
- A test to see if we are executing any of the functions in
the file_operations structure. (needed before we can free state)
- Calling frelease and generally releasing of the state held by the
file.
It might be possible to solve the entire problem outside of the vfs
> Note that "we might someday implement real force-umount" doesn't count;
> the same kind of arguments had been given nine years ago in case of AIO
> ("oh, sure, we'll eventually cover foo_get_block() too - it will all be
> a state machine, fully asynchronous; whaddya mean 'it's not feasible'?").
> Of course, it was _not_ feasible and had never been implemented.
> Frankly, I very much suspect that force-umount is another case like that;
> we'll need a *lot* of interesting cooperation from fs for that to work and
> to be useful. I'd be delighted to be proven incorrect on that one, so
> if you have anything serious in that direction, please share the details.
So far nothing but thought experiments, but you have a good point at
least a proof of concept should be done of the various pieces. To flush
out some niggling little detail that messes up the design.
So I hereby sign up for writing a sys_revoke patch, a forced umount patch
and a writing a patch to ext2 to support it. Supporting proc and
sysfs while easy is not really the common case of an nfs exportable block
filesystem so it is not complete.
> As for the patchset in the current form... Could you explain what's to prevent
> POSIX locks and dnotify entries from outliving a struct file you'd revoked,
> seeing that filp_close() will skip killing them in that case.
Good catch that looks like a big fat bug to me. It seems I overlooked
the fact that we actually free things in filp_close.
Given that posix_remove_file calls vfs_lock_file which calls
file->f_op->lock it looks like something really needs to be done here.
dnotify_flush doesn't look to hard to spin a special case for revoke.
I am going to have to spend I while longer studying the rest of the
code in filp_close. I hope I don't need to figure out the various
fl_owner_t values to safely revoke a file, but it looks like I might.
Eric
^ permalink raw reply [flat|nested] 99+ messages in thread