linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* remove the last set_fs() in common code, and remove it for x86 and powerpc v3
@ 2020-09-03 14:22 Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 01/14] proc: remove a level of indentation in proc_get_inode Christoph Hellwig
                   ` (16 more replies)
  0 siblings, 17 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Hi all,

this series removes the last set_fs() used to force a kernel address
space for the uaccess code in the kernel read/write/splice code, and then
stops implementing the address space overrides entirely for x86 and
powerpc.

[Note to Linus: I'd like to get this into linux-next rather earlier
than later.  Do you think it is ok to add this tree to linux-next?]

The file system part has been posted a few times, and the read/write side
has been pretty much unchanced.  For splice this series drops the
conversion of the seq_file and sysctl code to the iter ops, and thus loses
the splice support for them.  The reasons for that is that it caused a lot
of churn for not much use - splice for these small files really isn't much
of a win, even if existing userspace uses it.  All callers I found do the
proper fallback, but if this turns out to be an issue the conversion can
be resurrected.

Besides x86 and powerpc I plan to eventually convert all other
architectures, although this will be a slow process, starting with the
easier ones once the infrastructure is merged.  The process to convert
architectures is roughtly:

 (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
 (2) implement __get_kernel_nofault and __put_kernel_nofault
 (3) remove the arch specific address limitation functionality

Changes since v2:
 - add back the patch to support splice through read_iter/write iter
   on /proc/sys/*
 - entirely remove the tests that depend on set_fs.  Note that for
   lkdtm the maintainer (Kees) disagrees with this request from Linus
 - fix a wrong check in the powerpc access_ok, and drop a few spurious
   cleanups there

Changes since v1:
 - drop the patch to remove the non-iter ops for /dev/zero and
   /dev/null as they caused a performance regression
 - don't enable user access in __get_kernel on powerpc
 - xfail the set_fs() based lkdtm tests

Diffstat:

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 01/14] proc: remove a level of indentation in proc_get_inode
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 02/14] proc: cleanup the compat vs no compat file ops Christoph Hellwig
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Just return early on inode allocation failure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/proc/inode.c | 72 +++++++++++++++++++++++++------------------------
 1 file changed, 37 insertions(+), 35 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 28d6105e908e4c..016b1302cbabc0 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -619,42 +619,44 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 {
 	struct inode *inode = new_inode(sb);
 
-	if (inode) {
-		inode->i_ino = de->low_ino;
-		inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
-		PROC_I(inode)->pde = de;
-
-		if (is_empty_pde(de)) {
-			make_empty_dir_inode(inode);
-			return inode;
-		}
-		if (de->mode) {
-			inode->i_mode = de->mode;
-			inode->i_uid = de->uid;
-			inode->i_gid = de->gid;
-		}
-		if (de->size)
-			inode->i_size = de->size;
-		if (de->nlink)
-			set_nlink(inode, de->nlink);
-
-		if (S_ISREG(inode->i_mode)) {
-			inode->i_op = de->proc_iops;
-			inode->i_fop = &proc_reg_file_ops;
+	if (!inode) {
+		pde_put(de);
+		return NULL;
+	}
+
+	inode->i_ino = de->low_ino;
+	inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+	PROC_I(inode)->pde = de;
+	if (is_empty_pde(de)) {
+		make_empty_dir_inode(inode);
+		return inode;
+	}
+
+	if (de->mode) {
+		inode->i_mode = de->mode;
+		inode->i_uid = de->uid;
+		inode->i_gid = de->gid;
+	}
+	if (de->size)
+		inode->i_size = de->size;
+	if (de->nlink)
+		set_nlink(inode, de->nlink);
+
+	if (S_ISREG(inode->i_mode)) {
+		inode->i_op = de->proc_iops;
+		inode->i_fop = &proc_reg_file_ops;
 #ifdef CONFIG_COMPAT
-			if (!de->proc_ops->proc_compat_ioctl) {
-				inode->i_fop = &proc_reg_file_ops_no_compat;
-			}
+		if (!de->proc_ops->proc_compat_ioctl)
+			inode->i_fop = &proc_reg_file_ops_no_compat;
 #endif
-		} else if (S_ISDIR(inode->i_mode)) {
-			inode->i_op = de->proc_iops;
-			inode->i_fop = de->proc_dir_ops;
-		} else if (S_ISLNK(inode->i_mode)) {
-			inode->i_op = de->proc_iops;
-			inode->i_fop = NULL;
-		} else
-			BUG();
-	} else
-	       pde_put(de);
+	} else if (S_ISDIR(inode->i_mode)) {
+		inode->i_op = de->proc_iops;
+		inode->i_fop = de->proc_dir_ops;
+	} else if (S_ISLNK(inode->i_mode)) {
+		inode->i_op = de->proc_iops;
+		inode->i_fop = NULL;
+	} else {
+		BUG();
+	}
 	return inode;
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 02/14] proc: cleanup the compat vs no compat file ops
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 01/14] proc: remove a level of indentation in proc_get_inode Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 03/14] proc: add a read_iter method to proc proc_ops Christoph Hellwig
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Instead of providing a special no-compat version provide a special
compat version for operations with ->compat_ioctl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/proc/inode.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 016b1302cbabc0..93dd2045737504 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -572,9 +572,6 @@ static const struct file_operations proc_reg_file_ops = {
 	.write		= proc_reg_write,
 	.poll		= proc_reg_poll,
 	.unlocked_ioctl	= proc_reg_unlocked_ioctl,
-#ifdef CONFIG_COMPAT
-	.compat_ioctl	= proc_reg_compat_ioctl,
-#endif
 	.mmap		= proc_reg_mmap,
 	.get_unmapped_area = proc_reg_get_unmapped_area,
 	.open		= proc_reg_open,
@@ -582,12 +579,13 @@ static const struct file_operations proc_reg_file_ops = {
 };
 
 #ifdef CONFIG_COMPAT
-static const struct file_operations proc_reg_file_ops_no_compat = {
+static const struct file_operations proc_reg_file_ops_compat = {
 	.llseek		= proc_reg_llseek,
 	.read		= proc_reg_read,
 	.write		= proc_reg_write,
 	.poll		= proc_reg_poll,
 	.unlocked_ioctl	= proc_reg_unlocked_ioctl,
+	.compat_ioctl	= proc_reg_compat_ioctl,
 	.mmap		= proc_reg_mmap,
 	.get_unmapped_area = proc_reg_get_unmapped_area,
 	.open		= proc_reg_open,
@@ -646,8 +644,8 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 		inode->i_op = de->proc_iops;
 		inode->i_fop = &proc_reg_file_ops;
 #ifdef CONFIG_COMPAT
-		if (!de->proc_ops->proc_compat_ioctl)
-			inode->i_fop = &proc_reg_file_ops_no_compat;
+		if (de->proc_ops->proc_compat_ioctl)
+			inode->i_fop = &proc_reg_file_ops_compat;
 #endif
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = de->proc_iops;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 03/14] proc: add a read_iter method to proc proc_ops
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 01/14] proc: remove a level of indentation in proc_get_inode Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 02/14] proc: cleanup the compat vs no compat file ops Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 04/14] sysctl: Convert to iter interfaces Christoph Hellwig
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

This will allow proc files to implement iter read semantics.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/proc/inode.c         | 53 ++++++++++++++++++++++++++++++++++++++---
 include/linux/proc_fs.h |  1 +
 2 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 93dd2045737504..58c075e2a452d6 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -297,6 +297,21 @@ static loff_t proc_reg_llseek(struct file *file, loff_t offset, int whence)
 	return rv;
 }
 
+static ssize_t proc_reg_read_iter(struct kiocb *iocb, struct iov_iter *iter)
+{
+	struct proc_dir_entry *pde = PDE(file_inode(iocb->ki_filp));
+	ssize_t ret;
+
+	if (pde_is_permanent(pde))
+		return pde->proc_ops->proc_read_iter(iocb, iter);
+
+	if (!use_pde(pde))
+		return -EIO;
+	ret = pde->proc_ops->proc_read_iter(iocb, iter);
+	unuse_pde(pde);
+	return ret;
+}
+
 static ssize_t pde_read(struct proc_dir_entry *pde, struct file *file, char __user *buf, size_t count, loff_t *ppos)
 {
 	typeof_member(struct proc_ops, proc_read) read;
@@ -578,6 +593,18 @@ static const struct file_operations proc_reg_file_ops = {
 	.release	= proc_reg_release,
 };
 
+static const struct file_operations proc_iter_file_ops = {
+	.llseek		= proc_reg_llseek,
+	.read_iter	= proc_reg_read_iter,
+	.write		= proc_reg_write,
+	.poll		= proc_reg_poll,
+	.unlocked_ioctl	= proc_reg_unlocked_ioctl,
+	.mmap		= proc_reg_mmap,
+	.get_unmapped_area = proc_reg_get_unmapped_area,
+	.open		= proc_reg_open,
+	.release	= proc_reg_release,
+};
+
 #ifdef CONFIG_COMPAT
 static const struct file_operations proc_reg_file_ops_compat = {
 	.llseek		= proc_reg_llseek,
@@ -591,6 +618,19 @@ static const struct file_operations proc_reg_file_ops_compat = {
 	.open		= proc_reg_open,
 	.release	= proc_reg_release,
 };
+
+static const struct file_operations proc_iter_file_ops_compat = {
+	.llseek		= proc_reg_llseek,
+	.read_iter	= proc_reg_read_iter,
+	.write		= proc_reg_write,
+	.poll		= proc_reg_poll,
+	.unlocked_ioctl	= proc_reg_unlocked_ioctl,
+	.compat_ioctl	= proc_reg_compat_ioctl,
+	.mmap		= proc_reg_mmap,
+	.get_unmapped_area = proc_reg_get_unmapped_area,
+	.open		= proc_reg_open,
+	.release	= proc_reg_release,
+};
 #endif
 
 static void proc_put_link(void *p)
@@ -642,10 +682,17 @@ struct inode *proc_get_inode(struct super_block *sb, struct proc_dir_entry *de)
 
 	if (S_ISREG(inode->i_mode)) {
 		inode->i_op = de->proc_iops;
-		inode->i_fop = &proc_reg_file_ops;
+		if (de->proc_ops->proc_read_iter)
+			inode->i_fop = &proc_iter_file_ops;
+		else
+			inode->i_fop = &proc_reg_file_ops;
 #ifdef CONFIG_COMPAT
-		if (de->proc_ops->proc_compat_ioctl)
-			inode->i_fop = &proc_reg_file_ops_compat;
+		if (de->proc_ops->proc_compat_ioctl) {
+			if (de->proc_ops->proc_read_iter)
+				inode->i_fop = &proc_iter_file_ops_compat;
+			else
+				inode->i_fop = &proc_reg_file_ops_compat;
+		}
 #endif
 	} else if (S_ISDIR(inode->i_mode)) {
 		inode->i_op = de->proc_iops;
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 2df965cd09742d..270cab43ca3dad 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -30,6 +30,7 @@ struct proc_ops {
 	unsigned int proc_flags;
 	int	(*proc_open)(struct inode *, struct file *);
 	ssize_t	(*proc_read)(struct file *, char __user *, size_t, loff_t *);
+	ssize_t (*proc_read_iter)(struct kiocb *, struct iov_iter *);
 	ssize_t	(*proc_write)(struct file *, const char __user *, size_t, loff_t *);
 	loff_t	(*proc_lseek)(struct file *, loff_t, int);
 	int	(*proc_release)(struct inode *, struct file *);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 04/14] sysctl: Convert to iter interfaces
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 03/14] proc: add a read_iter method to proc proc_ops Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev, Matthew Wilcox (Oracle)

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Using the read_iter/write_iter interfaces allows for in-kernel users
to set sysctls without using set_fs().  Also, the buffer is a string,
so give it the real type of 'char *', not void *.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/proc/proc_sysctl.c      | 46 ++++++++++++++++++--------------------
 include/linux/bpf-cgroup.h |  2 +-
 kernel/bpf/cgroup.c        |  2 +-
 3 files changed, 24 insertions(+), 26 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index 6c1166ccdaea57..a4a3122f8a584a 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -12,6 +12,7 @@
 #include <linux/cred.h>
 #include <linux/namei.h>
 #include <linux/mm.h>
+#include <linux/uio.h>
 #include <linux/module.h>
 #include <linux/bpf-cgroup.h>
 #include <linux/mount.h>
@@ -540,13 +541,14 @@ static struct dentry *proc_sys_lookup(struct inode *dir, struct dentry *dentry,
 	return err;
 }
 
-static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf,
-		size_t count, loff_t *ppos, int write)
+static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
+		int write)
 {
-	struct inode *inode = file_inode(filp);
+	struct inode *inode = file_inode(iocb->ki_filp);
 	struct ctl_table_header *head = grab_header(inode);
 	struct ctl_table *table = PROC_I(inode)->sysctl_entry;
-	void *kbuf;
+	size_t count = iov_iter_count(iter);
+	char *kbuf;
 	ssize_t error;
 
 	if (IS_ERR(head))
@@ -569,32 +571,30 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf,
 	error = -ENOMEM;
 	if (count >= KMALLOC_MAX_SIZE)
 		goto out;
+	kbuf = kzalloc(count + 1, GFP_KERNEL);
+	if (!kbuf)
+		goto out;
 
 	if (write) {
-		kbuf = memdup_user_nul(ubuf, count);
-		if (IS_ERR(kbuf)) {
-			error = PTR_ERR(kbuf);
-			goto out;
-		}
-	} else {
-		kbuf = kzalloc(count, GFP_KERNEL);
-		if (!kbuf)
-			goto out;
+		error = -EFAULT;
+		if (!copy_from_iter_full(kbuf, count, iter))
+			goto out_free_buf;
+		kbuf[count] = '\0';
 	}
 
 	error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, &kbuf, &count,
-					   ppos);
+					   &iocb->ki_pos);
 	if (error)
 		goto out_free_buf;
 
 	/* careful: calling conventions are nasty here */
-	error = table->proc_handler(table, write, kbuf, &count, ppos);
+	error = table->proc_handler(table, write, kbuf, &count, &iocb->ki_pos);
 	if (error)
 		goto out_free_buf;
 
 	if (!write) {
 		error = -EFAULT;
-		if (copy_to_user(ubuf, kbuf, count))
+		if (copy_to_iter(kbuf, count, iter) < count)
 			goto out_free_buf;
 	}
 
@@ -607,16 +607,14 @@ static ssize_t proc_sys_call_handler(struct file *filp, void __user *ubuf,
 	return error;
 }
 
-static ssize_t proc_sys_read(struct file *filp, char __user *buf,
-				size_t count, loff_t *ppos)
+static ssize_t proc_sys_read(struct kiocb *iocb, struct iov_iter *iter)
 {
-	return proc_sys_call_handler(filp, (void __user *)buf, count, ppos, 0);
+	return proc_sys_call_handler(iocb, iter, 0);
 }
 
-static ssize_t proc_sys_write(struct file *filp, const char __user *buf,
-				size_t count, loff_t *ppos)
+static ssize_t proc_sys_write(struct kiocb *iocb, struct iov_iter *iter)
 {
-	return proc_sys_call_handler(filp, (void __user *)buf, count, ppos, 1);
+	return proc_sys_call_handler(iocb, iter, 1);
 }
 
 static int proc_sys_open(struct inode *inode, struct file *filp)
@@ -853,8 +851,8 @@ static int proc_sys_getattr(const struct path *path, struct kstat *stat,
 static const struct file_operations proc_sys_file_operations = {
 	.open		= proc_sys_open,
 	.poll		= proc_sys_poll,
-	.read		= proc_sys_read,
-	.write		= proc_sys_write,
+	.read_iter	= proc_sys_read,
+	.write_iter	= proc_sys_write,
 	.llseek		= default_llseek,
 };
 
diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index 64f367044e25f4..82b26a1386d85e 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -136,7 +136,7 @@ int __cgroup_bpf_check_dev_permission(short dev_type, u32 major, u32 minor,
 
 int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
 				   struct ctl_table *table, int write,
-				   void **buf, size_t *pcount, loff_t *ppos,
+				   char **buf, size_t *pcount, loff_t *ppos,
 				   enum bpf_attach_type type);
 
 int __cgroup_bpf_run_filter_setsockopt(struct sock *sock, int *level,
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index e21de4f1754c2c..6ec088a96302f9 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -1226,7 +1226,7 @@ const struct bpf_verifier_ops cg_dev_verifier_ops = {
  */
 int __cgroup_bpf_run_filter_sysctl(struct ctl_table_header *head,
 				   struct ctl_table *table, int write,
-				   void **buf, size_t *pcount, loff_t *ppos,
+				   char **buf, size_t *pcount, loff_t *ppos,
 				   enum bpf_attach_type type)
 {
 	struct bpf_sysctl_kern ctx = {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 04/14] sysctl: Convert to iter interfaces Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-10-01 22:38   ` Eric Biggers
  2020-09-03 14:22 ` [PATCH 06/14] fs: don't allow splice read/write without explicit ops Christoph Hellwig
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Don't allow calling ->read or ->write with set_fs as a preparation for
killing off set_fs.  All the instances that we use kernel_read/write on
are using the iter ops already.

If a file has both the regular ->read/->write methods and the iter
variants those could have different semantics for messed up enough
drivers.  Also fails the kernel access to them in that case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/read_write.c | 67 +++++++++++++++++++++++++++++++------------------
 1 file changed, 42 insertions(+), 25 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 5db58b8c78d0dd..702c4301d9eb6b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -419,27 +419,41 @@ static ssize_t new_sync_read(struct file *filp, char __user *buf, size_t len, lo
 	return ret;
 }
 
+static int warn_unsupported(struct file *file, const char *op)
+{
+	pr_warn_ratelimited(
+		"kernel %s not supported for file %pD4 (pid: %d comm: %.20s)\n",
+		op, file, current->pid, current->comm);
+	return -EINVAL;
+}
+
 ssize_t __kernel_read(struct file *file, void *buf, size_t count, loff_t *pos)
 {
-	mm_segment_t old_fs = get_fs();
+	struct kvec iov = {
+		.iov_base	= buf,
+		.iov_len	= min_t(size_t, count, MAX_RW_COUNT),
+	};
+	struct kiocb kiocb;
+	struct iov_iter iter;
 	ssize_t ret;
 
 	if (WARN_ON_ONCE(!(file->f_mode & FMODE_READ)))
 		return -EINVAL;
 	if (!(file->f_mode & FMODE_CAN_READ))
 		return -EINVAL;
+	/*
+	 * Also fail if ->read_iter and ->read are both wired up as that
+	 * implies very convoluted semantics.
+	 */
+	if (unlikely(!file->f_op->read_iter || file->f_op->read))
+		return warn_unsupported(file, "read");
 
-	if (count > MAX_RW_COUNT)
-		count =  MAX_RW_COUNT;
-	set_fs(KERNEL_DS);
-	if (file->f_op->read)
-		ret = file->f_op->read(file, (void __user *)buf, count, pos);
-	else if (file->f_op->read_iter)
-		ret = new_sync_read(file, (void __user *)buf, count, pos);
-	else
-		ret = -EINVAL;
-	set_fs(old_fs);
+	init_sync_kiocb(&kiocb, file);
+	kiocb.ki_pos = *pos;
+	iov_iter_kvec(&iter, READ, &iov, 1, iov.iov_len);
+	ret = file->f_op->read_iter(&kiocb, &iter);
 	if (ret > 0) {
+		*pos = kiocb.ki_pos;
 		fsnotify_access(file);
 		add_rchar(current, ret);
 	}
@@ -510,28 +524,31 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
 /* caller is responsible for file_start_write/file_end_write */
 ssize_t __kernel_write(struct file *file, const void *buf, size_t count, loff_t *pos)
 {
-	mm_segment_t old_fs;
-	const char __user *p;
+	struct kvec iov = {
+		.iov_base	= (void *)buf,
+		.iov_len	= min_t(size_t, count, MAX_RW_COUNT),
+	};
+	struct kiocb kiocb;
+	struct iov_iter iter;
 	ssize_t ret;
 
 	if (WARN_ON_ONCE(!(file->f_mode & FMODE_WRITE)))
 		return -EBADF;
 	if (!(file->f_mode & FMODE_CAN_WRITE))
 		return -EINVAL;
+	/*
+	 * Also fail if ->write_iter and ->write are both wired up as that
+	 * implies very convoluted semantics.
+	 */
+	if (unlikely(!file->f_op->write_iter || file->f_op->write))
+		return warn_unsupported(file, "write");
 
-	old_fs = get_fs();
-	set_fs(KERNEL_DS);
-	p = (__force const char __user *)buf;
-	if (count > MAX_RW_COUNT)
-		count =  MAX_RW_COUNT;
-	if (file->f_op->write)
-		ret = file->f_op->write(file, p, count, pos);
-	else if (file->f_op->write_iter)
-		ret = new_sync_write(file, p, count, pos);
-	else
-		ret = -EINVAL;
-	set_fs(old_fs);
+	init_sync_kiocb(&kiocb, file);
+	kiocb.ki_pos = *pos;
+	iov_iter_kvec(&iter, WRITE, &iov, 1, iov.iov_len);
+	ret = file->f_op->write_iter(&kiocb, &iter);
 	if (ret > 0) {
+		*pos = kiocb.ki_pos;
 		fsnotify_modify(file);
 		add_wchar(current, ret);
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 06/14] fs: don't allow splice read/write without explicit ops
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 07/14] uaccess: add infrastructure for kernel builds with set_fs() Christoph Hellwig
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

default_file_splice_write is the last piece of generic code that uses
set_fs to make the uaccess routines operate on kernel pointers.  It
implements a "fallback loop" for splicing from files that do not actually
provide a proper splice_read method.  The usual file systems and other
high bandwidth instances all provide a ->splice_read, so this just removes
support for various device drivers and procfs/debugfs files.  If splice
support for any of those turns out to be important it can be added back
by switching them to the iter ops and using generic_file_splice_read.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 fs/read_write.c    |   2 +-
 fs/splice.c        | 130 +++++----------------------------------------
 include/linux/fs.h |   2 -
 3 files changed, 15 insertions(+), 119 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 702c4301d9eb6b..8c61f67453e3d3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1077,7 +1077,7 @@ ssize_t vfs_iter_write(struct file *file, struct iov_iter *iter, loff_t *ppos,
 }
 EXPORT_SYMBOL(vfs_iter_write);
 
-ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
+static ssize_t vfs_readv(struct file *file, const struct iovec __user *vec,
 		  unsigned long vlen, loff_t *pos, rwf_t flags)
 {
 	struct iovec iovstack[UIO_FASTIOV];
diff --git a/fs/splice.c b/fs/splice.c
index d7c8a7c4db07ff..412df7b48f9eb7 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -342,89 +342,6 @@ const struct pipe_buf_operations nosteal_pipe_buf_ops = {
 };
 EXPORT_SYMBOL(nosteal_pipe_buf_ops);
 
-static ssize_t kernel_readv(struct file *file, const struct kvec *vec,
-			    unsigned long vlen, loff_t offset)
-{
-	mm_segment_t old_fs;
-	loff_t pos = offset;
-	ssize_t res;
-
-	old_fs = get_fs();
-	set_fs(KERNEL_DS);
-	/* The cast to a user pointer is valid due to the set_fs() */
-	res = vfs_readv(file, (const struct iovec __user *)vec, vlen, &pos, 0);
-	set_fs(old_fs);
-
-	return res;
-}
-
-static ssize_t default_file_splice_read(struct file *in, loff_t *ppos,
-				 struct pipe_inode_info *pipe, size_t len,
-				 unsigned int flags)
-{
-	struct kvec *vec, __vec[PIPE_DEF_BUFFERS];
-	struct iov_iter to;
-	struct page **pages;
-	unsigned int nr_pages;
-	unsigned int mask;
-	size_t offset, base, copied = 0;
-	ssize_t res;
-	int i;
-
-	if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
-		return -EAGAIN;
-
-	/*
-	 * Try to keep page boundaries matching to source pagecache ones -
-	 * it probably won't be much help, but...
-	 */
-	offset = *ppos & ~PAGE_MASK;
-
-	iov_iter_pipe(&to, READ, pipe, len + offset);
-
-	res = iov_iter_get_pages_alloc(&to, &pages, len + offset, &base);
-	if (res <= 0)
-		return -ENOMEM;
-
-	nr_pages = DIV_ROUND_UP(res + base, PAGE_SIZE);
-
-	vec = __vec;
-	if (nr_pages > PIPE_DEF_BUFFERS) {
-		vec = kmalloc_array(nr_pages, sizeof(struct kvec), GFP_KERNEL);
-		if (unlikely(!vec)) {
-			res = -ENOMEM;
-			goto out;
-		}
-	}
-
-	mask = pipe->ring_size - 1;
-	pipe->bufs[to.head & mask].offset = offset;
-	pipe->bufs[to.head & mask].len -= offset;
-
-	for (i = 0; i < nr_pages; i++) {
-		size_t this_len = min_t(size_t, len, PAGE_SIZE - offset);
-		vec[i].iov_base = page_address(pages[i]) + offset;
-		vec[i].iov_len = this_len;
-		len -= this_len;
-		offset = 0;
-	}
-
-	res = kernel_readv(in, vec, nr_pages, *ppos);
-	if (res > 0) {
-		copied = res;
-		*ppos += res;
-	}
-
-	if (vec != __vec)
-		kfree(vec);
-out:
-	for (i = 0; i < nr_pages; i++)
-		put_page(pages[i]);
-	kvfree(pages);
-	iov_iter_advance(&to, copied);	/* truncates and discards */
-	return res;
-}
-
 /*
  * Send 'sd->len' bytes to socket from 'sd->file' at position 'sd->pos'
  * using sendpage(). Return the number of bytes sent.
@@ -788,33 +705,6 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,
 
 EXPORT_SYMBOL(iter_file_splice_write);
 
-static int write_pipe_buf(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
-			  struct splice_desc *sd)
-{
-	int ret;
-	void *data;
-	loff_t tmp = sd->pos;
-
-	data = kmap(buf->page);
-	ret = __kernel_write(sd->u.file, data + buf->offset, sd->len, &tmp);
-	kunmap(buf->page);
-
-	return ret;
-}
-
-static ssize_t default_file_splice_write(struct pipe_inode_info *pipe,
-					 struct file *out, loff_t *ppos,
-					 size_t len, unsigned int flags)
-{
-	ssize_t ret;
-
-	ret = splice_from_pipe(pipe, out, ppos, len, flags, write_pipe_buf);
-	if (ret > 0)
-		*ppos += ret;
-
-	return ret;
-}
-
 /**
  * generic_splice_sendpage - splice data from a pipe to a socket
  * @pipe:	pipe to splice from
@@ -836,15 +726,23 @@ ssize_t generic_splice_sendpage(struct pipe_inode_info *pipe, struct file *out,
 
 EXPORT_SYMBOL(generic_splice_sendpage);
 
+static int warn_unsupported(struct file *file, const char *op)
+{
+	pr_debug_ratelimited(
+		"splice %s not supported for file %pD4 (pid: %d comm: %.20s)\n",
+		op, file, current->pid, current->comm);
+	return -EINVAL;
+}
+
 /*
  * Attempt to initiate a splice from pipe to file.
  */
 static long do_splice_from(struct pipe_inode_info *pipe, struct file *out,
 			   loff_t *ppos, size_t len, unsigned int flags)
 {
-	if (out->f_op->splice_write)
-		return out->f_op->splice_write(pipe, out, ppos, len, flags);
-	return default_file_splice_write(pipe, out, ppos, len, flags);
+	if (unlikely(!out->f_op->splice_write))
+		return warn_unsupported(out, "write");
+	return out->f_op->splice_write(pipe, out, ppos, len, flags);
 }
 
 /*
@@ -866,9 +764,9 @@ static long do_splice_to(struct file *in, loff_t *ppos,
 	if (unlikely(len > MAX_RW_COUNT))
 		len = MAX_RW_COUNT;
 
-	if (in->f_op->splice_read)
-		return in->f_op->splice_read(in, ppos, pipe, len, flags);
-	return default_file_splice_read(in, ppos, pipe, len, flags);
+	if (unlikely(!in->f_op->splice_read))
+		return warn_unsupported(in, "read");
+	return in->f_op->splice_read(in, ppos, pipe, len, flags);
 }
 
 /**
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7519ae003a082c..839eeccf10174b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1894,8 +1894,6 @@ ssize_t rw_copy_check_uvector(int type, const struct iovec __user * uvector,
 
 extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
-extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
-		unsigned long, loff_t *, rwf_t);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
 extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 07/14] uaccess: add infrastructure for kernel builds with set_fs()
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (5 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 06/14] fs: don't allow splice read/write without explicit ops Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 08/14] test_bitmap: remove user bitmap tests Christoph Hellwig
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Add a CONFIG_SET_FS option that is selected by architecturess that
implement set_fs, which is all of them initially.  If the option is not
set stubs for routines related to overriding the address space are
provided so that architectures can start to opt out of providing set_fs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 arch/Kconfig            |  3 +++
 arch/alpha/Kconfig      |  1 +
 arch/arc/Kconfig        |  1 +
 arch/arm/Kconfig        |  1 +
 arch/arm64/Kconfig      |  1 +
 arch/c6x/Kconfig        |  1 +
 arch/csky/Kconfig       |  1 +
 arch/h8300/Kconfig      |  1 +
 arch/hexagon/Kconfig    |  1 +
 arch/ia64/Kconfig       |  1 +
 arch/m68k/Kconfig       |  1 +
 arch/microblaze/Kconfig |  1 +
 arch/mips/Kconfig       |  1 +
 arch/nds32/Kconfig      |  1 +
 arch/nios2/Kconfig      |  1 +
 arch/openrisc/Kconfig   |  1 +
 arch/parisc/Kconfig     |  1 +
 arch/powerpc/Kconfig    |  1 +
 arch/riscv/Kconfig      |  1 +
 arch/s390/Kconfig       |  1 +
 arch/sh/Kconfig         |  1 +
 arch/sparc/Kconfig      |  1 +
 arch/um/Kconfig         |  1 +
 arch/x86/Kconfig        |  1 +
 arch/xtensa/Kconfig     |  1 +
 include/linux/uaccess.h | 18 ++++++++++++++++++
 26 files changed, 45 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493fc..3fab619a6aa51a 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -24,6 +24,9 @@ config KEXEC_ELF
 config HAVE_IMA_KEXEC
 	bool
 
+config SET_FS
+	bool
+
 config HOTPLUG_SMT
 	bool
 
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index 9c5f06e8eb9bc0..d6e9fc7a7b19e2 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -39,6 +39,7 @@ config ALPHA
 	select OLD_SIGSUSPEND
 	select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67
 	select MMU_GATHER_NO_RANGE
+	select SET_FS
 	help
 	  The Alpha is a 64-bit general-purpose processor designed and
 	  marketed by the Digital Equipment Corporation of blessed memory,
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index ba00c4e1e1c271..c49f5754a11e40 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -48,6 +48,7 @@ config ARC
 	select PCI_SYSCALL if PCI
 	select PERF_USE_VMALLOC if ARC_CACHE_VIPT_ALIASING
 	select HAVE_ARCH_JUMP_LABEL if ISA_ARCV2 && !CPU_ENDIAN_BE32
+	select SET_FS
 
 config ARCH_HAS_CACHE_LINE_SIZE
 	def_bool y
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e00d94b1665876..87e1478a42dc4f 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -118,6 +118,7 @@ config ARM
 	select PCI_SYSCALL if PCI
 	select PERF_USE_VMALLOC
 	select RTC_LIB
+	select SET_FS
 	select SYS_SUPPORTS_APM_EMULATION
 	# Above selects are sorted alphabetically; please add new ones
 	# according to that.  Thanks.
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 6d232837cbeee8..fbd9e35bef096f 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -192,6 +192,7 @@ config ARM64
 	select PCI_SYSCALL if PCI
 	select POWER_RESET
 	select POWER_SUPPLY
+	select SET_FS
 	select SPARSE_IRQ
 	select SWIOTLB
 	select SYSCTL_EXCEPTION_TRACE
diff --git a/arch/c6x/Kconfig b/arch/c6x/Kconfig
index 6444ebfd06a665..48d66bf0465d68 100644
--- a/arch/c6x/Kconfig
+++ b/arch/c6x/Kconfig
@@ -22,6 +22,7 @@ config C6X
 	select GENERIC_CLOCKEVENTS
 	select MODULES_USE_ELF_RELA
 	select MMU_GATHER_NO_RANGE if MMU
+	select SET_FS
 
 config MMU
 	def_bool n
diff --git a/arch/csky/Kconfig b/arch/csky/Kconfig
index 3d5afb5f568543..2836f6e76fdb2d 100644
--- a/arch/csky/Kconfig
+++ b/arch/csky/Kconfig
@@ -78,6 +78,7 @@ config CSKY
 	select PCI_DOMAINS_GENERIC if PCI
 	select PCI_SYSCALL if PCI
 	select PCI_MSI if PCI
+	select SET_FS
 
 config LOCKDEP_SUPPORT
 	def_bool y
diff --git a/arch/h8300/Kconfig b/arch/h8300/Kconfig
index d11666d538fea8..7945de067e9fcc 100644
--- a/arch/h8300/Kconfig
+++ b/arch/h8300/Kconfig
@@ -25,6 +25,7 @@ config H8300
 	select HAVE_ARCH_KGDB
 	select HAVE_ARCH_HASH
 	select CPU_NO_EFFICIENT_FFS
+	select SET_FS
 	select UACCESS_MEMCPY
 
 config CPU_BIG_ENDIAN
diff --git a/arch/hexagon/Kconfig b/arch/hexagon/Kconfig
index 667cfc511cf999..f2afabbadd430e 100644
--- a/arch/hexagon/Kconfig
+++ b/arch/hexagon/Kconfig
@@ -31,6 +31,7 @@ config HEXAGON
 	select GENERIC_CLOCKEVENTS_BROADCAST
 	select MODULES_USE_ELF_RELA
 	select GENERIC_CPU_DEVICES
+	select SET_FS
 	help
 	  Qualcomm Hexagon is a processor architecture designed for high
 	  performance and low power across a wide variety of applications.
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 5b4ec80bf5863a..22a6853840e235 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -56,6 +56,7 @@ config IA64
 	select NEED_DMA_MAP_STATE
 	select NEED_SG_DMA_LENGTH
 	select NUMA if !FLATMEM
+	select SET_FS
 	default y
 	help
 	  The Itanium Processor Family is Intel's 64-bit successor to
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 6f2f38d05772ab..dcf4ae8c9b215f 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -32,6 +32,7 @@ config M68K
 	select OLD_SIGSUSPEND3
 	select OLD_SIGACTION
 	select MMU_GATHER_NO_RANGE if MMU
+	select SET_FS
 
 config CPU_BIG_ENDIAN
 	def_bool y
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index d262ac0c8714bd..7e3d4583abf3e6 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -46,6 +46,7 @@ config MICROBLAZE
 	select CPU_NO_EFFICIENT_FFS
 	select MMU_GATHER_NO_RANGE if MMU
 	select SPARSE_IRQ
+	select SET_FS
 
 # Endianness selection
 choice
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index c95fa3a2484cf0..fbc26391b588f8 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -87,6 +87,7 @@ config MIPS
 	select MODULES_USE_ELF_RELA if MODULES && 64BIT
 	select PERF_USE_VMALLOC
 	select RTC_LIB
+	select SET_FS
 	select SYSCTL_EXCEPTION_TRACE
 	select VIRT_TO_BUS
 
diff --git a/arch/nds32/Kconfig b/arch/nds32/Kconfig
index e30298e99e1bdf..e8e541fd2267d0 100644
--- a/arch/nds32/Kconfig
+++ b/arch/nds32/Kconfig
@@ -48,6 +48,7 @@ config NDS32
 	select HAVE_FUNCTION_GRAPH_TRACER
 	select HAVE_FTRACE_MCOUNT_RECORD
 	select HAVE_DYNAMIC_FTRACE
+	select SET_FS
 	help
 	  Andes(nds32) Linux support.
 
diff --git a/arch/nios2/Kconfig b/arch/nios2/Kconfig
index c6645141bb2a88..c7c6ba6bec9dfc 100644
--- a/arch/nios2/Kconfig
+++ b/arch/nios2/Kconfig
@@ -27,6 +27,7 @@ config NIOS2
 	select USB_ARCH_HAS_HCD if USB_SUPPORT
 	select CPU_NO_EFFICIENT_FFS
 	select MMU_GATHER_NO_RANGE if MMU
+	select SET_FS
 
 config GENERIC_CSUM
 	def_bool y
diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index 7e94fe37cb2fdf..6233c62931803f 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -39,6 +39,7 @@ config OPENRISC
 	select ARCH_WANT_FRAME_POINTERS
 	select GENERIC_IRQ_MULTI_HANDLER
 	select MMU_GATHER_NO_RANGE if MMU
+	select SET_FS
 
 config CPU_BIG_ENDIAN
 	def_bool y
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 3b0f53dd70bc9b..be70af482b5a9a 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -63,6 +63,7 @@ config PARISC
 	select HAVE_FTRACE_MCOUNT_RECORD if HAVE_DYNAMIC_FTRACE
 	select HAVE_KPROBES_ON_FTRACE
 	select HAVE_DYNAMIC_FTRACE_WITH_REGS
+	select SET_FS
 
 	help
 	  The PA-RISC microprocessor is designed by Hewlett-Packard and used
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 65bed1fdeaad71..4d1c18a3b83977 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -249,6 +249,7 @@ config PPC
 	select PCI_SYSCALL			if PCI
 	select PPC_DAWR				if PPC64
 	select RTC_LIB
+	select SET_FS
 	select SPARSE_IRQ
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index df18372861d8d2..ea0c1ad456d838 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -82,6 +82,7 @@ config RISCV
 	select PCI_MSI if PCI
 	select RISCV_INTC
 	select RISCV_TIMER if RISCV_SBI
+	select SET_FS
 	select SPARSEMEM_STATIC if 32BIT
 	select SPARSE_IRQ
 	select SYSCTL_EXCEPTION_TRACE
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 3d86e12e8e3c21..fd81385a7787cb 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -185,6 +185,7 @@ config S390
 	select OLD_SIGSUSPEND3
 	select PCI_DOMAINS		if PCI
 	select PCI_MSI			if PCI
+	select SET_FS
 	select SPARSE_IRQ
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index d20927128fce05..2bd1653f3b3fea 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -71,6 +71,7 @@ config SUPERH
 	select PERF_EVENTS
 	select PERF_USE_VMALLOC
 	select RTC_LIB
+	select SET_FS
 	select SPARSE_IRQ
 	help
 	  The SuperH is a RISC processor targeted for use in embedded systems
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index efeff2c896a544..3e0cf0319a278a 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -49,6 +49,7 @@ config SPARC
 	select LOCKDEP_SMALL if LOCKDEP
 	select NEED_DMA_MAP_STATE
 	select NEED_SG_DMA_LENGTH
+	select SET_FS
 
 config SPARC32
 	def_bool !64BIT
diff --git a/arch/um/Kconfig b/arch/um/Kconfig
index eb51fec759484a..3aefcd81566809 100644
--- a/arch/um/Kconfig
+++ b/arch/um/Kconfig
@@ -19,6 +19,7 @@ config UML
 	select GENERIC_CPU_DEVICES
 	select GENERIC_CLOCKEVENTS
 	select HAVE_GCC_PLUGINS
+	select SET_FS
 	select TTY # Needed for line.c
 
 config MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb209d..f85c13355732fe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -237,6 +237,7 @@ config X86
 	select HAVE_ARCH_KCSAN			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
 	select PROC_PID_ARCH_STATUS		if PROC_FS
+	select SET_FS
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 
 config INSTRUCTION_DECODER
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index e997e0119c0251..94bad4d66b4bde 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -41,6 +41,7 @@ config XTENSA
 	select IRQ_DOMAIN
 	select MODULES_USE_ELF_RELA
 	select PERF_USE_VMALLOC
+	select SET_FS
 	select VIRT_TO_BUS
 	help
 	  Xtensa processors are 32-bit RISC machines designed by Tensilica
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index 94b28541165929..70073c802b48ed 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -8,6 +8,7 @@
 
 #include <asm/uaccess.h>
 
+#ifdef CONFIG_SET_FS
 /*
  * Force the uaccess routines to be wired up for actual userspace access,
  * overriding any possible set_fs(KERNEL_DS) still lingering around.  Undone
@@ -25,6 +26,23 @@ static inline void force_uaccess_end(mm_segment_t oldfs)
 {
 	set_fs(oldfs);
 }
+#else /* CONFIG_SET_FS */
+typedef struct {
+	/* empty dummy */
+} mm_segment_t;
+
+#define uaccess_kernel()		(false)
+#define user_addr_max()			(TASK_SIZE_MAX)
+
+static inline mm_segment_t force_uaccess_begin(void)
+{
+	return (mm_segment_t) { };
+}
+
+static inline void force_uaccess_end(mm_segment_t oldfs)
+{
+}
+#endif /* CONFIG_SET_FS */
 
 /*
  * Architectures should provide two primitives (raw_copy_{to,from}_user())
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 08/14] test_bitmap: remove user bitmap tests
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (6 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 07/14] uaccess: add infrastructure for kernel builds with set_fs() Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 09/14] lkdtm: remove set_fs-based tests Christoph Hellwig
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

We can't run the tests for userspace bitmap parsing if set_fs() doesn't
exist, and it is about to go away for x86, powerpc with other major
architectures to follow.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 lib/test_bitmap.c | 91 +++++++++++------------------------------------
 1 file changed, 21 insertions(+), 70 deletions(-)

diff --git a/lib/test_bitmap.c b/lib/test_bitmap.c
index df903c53952bb9..4425a1dd4ef1c7 100644
--- a/lib/test_bitmap.c
+++ b/lib/test_bitmap.c
@@ -354,50 +354,37 @@ static const struct test_bitmap_parselist parselist_tests[] __initconst = {
 
 };
 
-static void __init __test_bitmap_parselist(int is_user)
+static void __init test_bitmap_parselist(void)
 {
 	int i;
 	int err;
 	ktime_t time;
 	DECLARE_BITMAP(bmap, 2048);
-	char *mode = is_user ? "_user"  : "";
 
 	for (i = 0; i < ARRAY_SIZE(parselist_tests); i++) {
 #define ptest parselist_tests[i]
 
-		if (is_user) {
-			mm_segment_t orig_fs = get_fs();
-			size_t len = strlen(ptest.in);
-
-			set_fs(KERNEL_DS);
-			time = ktime_get();
-			err = bitmap_parselist_user((__force const char __user *)ptest.in, len,
-						    bmap, ptest.nbits);
-			time = ktime_get() - time;
-			set_fs(orig_fs);
-		} else {
-			time = ktime_get();
-			err = bitmap_parselist(ptest.in, bmap, ptest.nbits);
-			time = ktime_get() - time;
-		}
+		time = ktime_get();
+		err = bitmap_parselist(ptest.in, bmap, ptest.nbits);
+		time = ktime_get() - time;
 
 		if (err != ptest.errno) {
-			pr_err("parselist%s: %d: input is %s, errno is %d, expected %d\n",
-					mode, i, ptest.in, err, ptest.errno);
+			pr_err("parselist: %d: input is %s, errno is %d, expected %d\n",
+					i, ptest.in, err, ptest.errno);
 			continue;
 		}
 
 		if (!err && ptest.expected
 			 && !__bitmap_equal(bmap, ptest.expected, ptest.nbits)) {
-			pr_err("parselist%s: %d: input is %s, result is 0x%lx, expected 0x%lx\n",
-					mode, i, ptest.in, bmap[0],
+			pr_err("parselist: %d: input is %s, result is 0x%lx, expected 0x%lx\n",
+					i, ptest.in, bmap[0],
 					*ptest.expected);
 			continue;
 		}
 
 		if (ptest.flags & PARSE_TIME)
-			pr_err("parselist%s: %d: input is '%s' OK, Time: %llu\n",
-					mode, i, ptest.in, time);
+			pr_err("parselist: %d: input is '%s' OK, Time: %llu\n",
+					i, ptest.in, time);
 
 #undef ptest
 	}
@@ -443,75 +430,41 @@ static const struct test_bitmap_parselist parse_tests[] __initconst = {
 #undef step
 };
 
-static void __init __test_bitmap_parse(int is_user)
+static void __init test_bitmap_parse(void)
 {
 	int i;
 	int err;
 	ktime_t time;
 	DECLARE_BITMAP(bmap, 2048);
-	char *mode = is_user ? "_user"  : "";
 
 	for (i = 0; i < ARRAY_SIZE(parse_tests); i++) {
 		struct test_bitmap_parselist test = parse_tests[i];
+		size_t len = test.flags & NO_LEN ? UINT_MAX : strlen(test.in);
 
-		if (is_user) {
-			size_t len = strlen(test.in);
-			mm_segment_t orig_fs = get_fs();
-
-			set_fs(KERNEL_DS);
-			time = ktime_get();
-			err = bitmap_parse_user((__force const char __user *)test.in, len,
-						bmap, test.nbits);
-			time = ktime_get() - time;
-			set_fs(orig_fs);
-		} else {
-			size_t len = test.flags & NO_LEN ?
-				UINT_MAX : strlen(test.in);
-			time = ktime_get();
-			err = bitmap_parse(test.in, len, bmap, test.nbits);
-			time = ktime_get() - time;
-		}
+		time = ktime_get();
+		err = bitmap_parse(test.in, len, bmap, test.nbits);
+		time = ktime_get() - time;
 
 		if (err != test.errno) {
-			pr_err("parse%s: %d: input is %s, errno is %d, expected %d\n",
-					mode, i, test.in, err, test.errno);
+			pr_err("parse: %d: input is %s, errno is %d, expected %d\n",
+					i, test.in, err, test.errno);
 			continue;
 		}
 
 		if (!err && test.expected
 			 && !__bitmap_equal(bmap, test.expected, test.nbits)) {
-			pr_err("parse%s: %d: input is %s, result is 0x%lx, expected 0x%lx\n",
-					mode, i, test.in, bmap[0],
+			pr_err("parse: %d: input is %s, result is 0x%lx, expected 0x%lx\n",
+					i, test.in, bmap[0],
 					*test.expected);
 			continue;
 		}
 
 		if (test.flags & PARSE_TIME)
-			pr_err("parse%s: %d: input is '%s' OK, Time: %llu\n",
-					mode, i, test.in, time);
+			pr_err("parse: %d: input is '%s' OK, Time: %llu\n",
+					i, test.in, time);
 	}
 }
 
-static void __init test_bitmap_parselist(void)
-{
-	__test_bitmap_parselist(0);
-}
-
-static void __init test_bitmap_parselist_user(void)
-{
-	__test_bitmap_parselist(1);
-}
-
-static void __init test_bitmap_parse(void)
-{
-	__test_bitmap_parse(0);
-}
-
-static void __init test_bitmap_parse_user(void)
-{
-	__test_bitmap_parse(1);
-}
-
 #define EXP1_IN_BITS	(sizeof(exp1) * 8)
 
 static void __init test_bitmap_arr32(void)
@@ -675,9 +628,7 @@ static void __init selftest(void)
 	test_replace();
 	test_bitmap_arr32();
 	test_bitmap_parse();
-	test_bitmap_parse_user();
 	test_bitmap_parselist();
-	test_bitmap_parselist_user();
 	test_mem_optimisations();
 	test_for_each_set_clump8();
 	test_bitmap_cut();
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 09/14] lkdtm: remove set_fs-based tests
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (7 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 08/14] test_bitmap: remove user bitmap tests Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 10/14] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h Christoph Hellwig
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Once we can't manipulate the address limit, we also can't test what
happens when the manipulation is abused.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/misc/lkdtm/bugs.c               | 10 ----------
 drivers/misc/lkdtm/core.c               |  2 --
 drivers/misc/lkdtm/lkdtm.h              |  2 --
 drivers/misc/lkdtm/usercopy.c           | 15 ---------------
 tools/testing/selftests/lkdtm/tests.txt |  2 --
 5 files changed, 31 deletions(-)

diff --git a/drivers/misc/lkdtm/bugs.c b/drivers/misc/lkdtm/bugs.c
index 4dfbfd51bdf774..a0675d4154d2fd 100644
--- a/drivers/misc/lkdtm/bugs.c
+++ b/drivers/misc/lkdtm/bugs.c
@@ -312,16 +312,6 @@ void lkdtm_CORRUPT_LIST_DEL(void)
 		pr_err("list_del() corruption not detected!\n");
 }
 
-/* Test if unbalanced set_fs(KERNEL_DS)/set_fs(USER_DS) check exists. */
-void lkdtm_CORRUPT_USER_DS(void)
-{
-	pr_info("setting bad task size limit\n");
-	set_fs(KERNEL_DS);
-
-	/* Make sure we do not keep running with a KERNEL_DS! */
-	force_sig(SIGKILL);
-}
-
 /* Test that VMAP_STACK is actually allocating with a leading guard page */
 void lkdtm_STACK_GUARD_PAGE_LEADING(void)
 {
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index a5e344df916632..97803f213d9d45 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -112,7 +112,6 @@ static const struct crashtype crashtypes[] = {
 	CRASHTYPE(CORRUPT_STACK_STRONG),
 	CRASHTYPE(CORRUPT_LIST_ADD),
 	CRASHTYPE(CORRUPT_LIST_DEL),
-	CRASHTYPE(CORRUPT_USER_DS),
 	CRASHTYPE(STACK_GUARD_PAGE_LEADING),
 	CRASHTYPE(STACK_GUARD_PAGE_TRAILING),
 	CRASHTYPE(UNSET_SMEP),
@@ -172,7 +171,6 @@ static const struct crashtype crashtypes[] = {
 	CRASHTYPE(USERCOPY_STACK_FRAME_FROM),
 	CRASHTYPE(USERCOPY_STACK_BEYOND),
 	CRASHTYPE(USERCOPY_KERNEL),
-	CRASHTYPE(USERCOPY_KERNEL_DS),
 	CRASHTYPE(STACKLEAK_ERASING),
 	CRASHTYPE(CFI_FORWARD_PROTO),
 #ifdef CONFIG_X86_32
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 8878538b2c1322..6dec4c9b442ff3 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -27,7 +27,6 @@ void lkdtm_OVERFLOW_UNSIGNED(void);
 void lkdtm_ARRAY_BOUNDS(void);
 void lkdtm_CORRUPT_LIST_ADD(void);
 void lkdtm_CORRUPT_LIST_DEL(void);
-void lkdtm_CORRUPT_USER_DS(void);
 void lkdtm_STACK_GUARD_PAGE_LEADING(void);
 void lkdtm_STACK_GUARD_PAGE_TRAILING(void);
 void lkdtm_UNSET_SMEP(void);
@@ -96,7 +95,6 @@ void lkdtm_USERCOPY_STACK_FRAME_TO(void);
 void lkdtm_USERCOPY_STACK_FRAME_FROM(void);
 void lkdtm_USERCOPY_STACK_BEYOND(void);
 void lkdtm_USERCOPY_KERNEL(void);
-void lkdtm_USERCOPY_KERNEL_DS(void);
 
 /* lkdtm_stackleak.c */
 void lkdtm_STACKLEAK_ERASING(void);
diff --git a/drivers/misc/lkdtm/usercopy.c b/drivers/misc/lkdtm/usercopy.c
index b833367a45d053..109e8d4302c113 100644
--- a/drivers/misc/lkdtm/usercopy.c
+++ b/drivers/misc/lkdtm/usercopy.c
@@ -325,21 +325,6 @@ void lkdtm_USERCOPY_KERNEL(void)
 	vm_munmap(user_addr, PAGE_SIZE);
 }
 
-void lkdtm_USERCOPY_KERNEL_DS(void)
-{
-	char __user *user_ptr =
-		(char __user *)(0xFUL << (sizeof(unsigned long) * 8 - 4));
-	mm_segment_t old_fs = get_fs();
-	char buf[10] = {0};
-
-	pr_info("attempting copy_to_user() to noncanonical address: %px\n",
-		user_ptr);
-	set_fs(KERNEL_DS);
-	if (copy_to_user(user_ptr, buf, sizeof(buf)) == 0)
-		pr_err("copy_to_user() to noncanonical address succeeded!?\n");
-	set_fs(old_fs);
-}
-
 void __init lkdtm_usercopy_init(void)
 {
 	/* Prepare cache that lacks SLAB_USERCOPY flag. */
diff --git a/tools/testing/selftests/lkdtm/tests.txt b/tools/testing/selftests/lkdtm/tests.txt
index 9d266e79c6a270..74a8d329a72c80 100644
--- a/tools/testing/selftests/lkdtm/tests.txt
+++ b/tools/testing/selftests/lkdtm/tests.txt
@@ -9,7 +9,6 @@ EXCEPTION
 #CORRUPT_STACK_STRONG Crashes entire system on success
 CORRUPT_LIST_ADD list_add corruption
 CORRUPT_LIST_DEL list_del corruption
-CORRUPT_USER_DS Invalid address limit on user-mode return
 STACK_GUARD_PAGE_LEADING
 STACK_GUARD_PAGE_TRAILING
 UNSET_SMEP CR4 bits went missing
@@ -67,6 +66,5 @@ USERCOPY_STACK_FRAME_TO
 USERCOPY_STACK_FRAME_FROM
 USERCOPY_STACK_BEYOND
 USERCOPY_KERNEL
-USERCOPY_KERNEL_DS
 STACKLEAK_ERASING OK: the rest of the thread stack is properly erased
 CFI_FORWARD_PROTO
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 10/14] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (8 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 09/14] lkdtm: remove set_fs-based tests Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 11/14] x86: make TASK_SIZE_MAX usable from assembly code Christoph Hellwig
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

At least for 64-bit this moves them closer to some of the defines
they are based on, and it prepares for using the TASK_SIZE_MAX
definition from assembly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/page_32_types.h | 11 +++++++
 arch/x86/include/asm/page_64_types.h | 38 +++++++++++++++++++++
 arch/x86/include/asm/processor.h     | 49 ----------------------------
 3 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h
index 565ad755c785e2..26236925fb2c36 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -41,6 +41,17 @@
 #define __VIRTUAL_MASK_SHIFT	32
 #endif	/* CONFIG_X86_PAE */
 
+/*
+ * User space process size: 3GB (default).
+ */
+#define IA32_PAGE_OFFSET	PAGE_OFFSET
+#define TASK_SIZE		PAGE_OFFSET
+#define TASK_SIZE_LOW		TASK_SIZE
+#define TASK_SIZE_MAX		TASK_SIZE
+#define DEFAULT_MAP_WINDOW	TASK_SIZE
+#define STACK_TOP		TASK_SIZE
+#define STACK_TOP_MAX		STACK_TOP
+
 /*
  * Kernel image size is limited to 512 MB (see in arch/x86/kernel/head_32.S)
  */
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 288b065955b729..996595c9897e0a 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -58,6 +58,44 @@
 #define __VIRTUAL_MASK_SHIFT	47
 #endif
 
+/*
+ * User space process size.  This is the first address outside the user range.
+ * There are a few constraints that determine this:
+ *
+ * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
+ * address, then that syscall will enter the kernel with a
+ * non-canonical return address, and SYSRET will explode dangerously.
+ * We avoid this particular problem by preventing anything executable
+ * from being mapped at the maximum canonical address.
+ *
+ * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
+ * CPUs malfunction if they execute code from the highest canonical page.
+ * They'll speculate right off the end of the canonical space, and
+ * bad things happen.  This is worked around in the same way as the
+ * Intel problem.
+ *
+ * With page table isolation enabled, we map the LDT in ... [stay tuned]
+ */
+#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+
+#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
+
+/* This decides where the kernel will search for a free chunk of vm
+ * space during mmap's.
+ */
+#define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
+					0xc0000000 : 0xFFFFe000)
+
+#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
+					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
+#define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
+					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
+#define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
+					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
+
+#define STACK_TOP		TASK_SIZE_LOW
+#define STACK_TOP_MAX		TASK_SIZE_MAX
+
 /*
  * Maximum kernel image size is limited to 1 GiB, due to the fixmap living
  * in the next 1 GiB (see level2_kernel_pgt in arch/x86/kernel/head_64.S).
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 97143d87994c24..1618eeb08361a9 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -782,17 +782,6 @@ static inline void spin_lock_prefetch(const void *x)
 })
 
 #ifdef CONFIG_X86_32
-/*
- * User space process size: 3GB (default).
- */
-#define IA32_PAGE_OFFSET	PAGE_OFFSET
-#define TASK_SIZE		PAGE_OFFSET
-#define TASK_SIZE_LOW		TASK_SIZE
-#define TASK_SIZE_MAX		TASK_SIZE
-#define DEFAULT_MAP_WINDOW	TASK_SIZE
-#define STACK_TOP		TASK_SIZE
-#define STACK_TOP_MAX		STACK_TOP
-
 #define INIT_THREAD  {							  \
 	.sp0			= TOP_OF_INIT_STACK,			  \
 	.sysenter_cs		= __KERNEL_CS,				  \
@@ -802,44 +791,6 @@ static inline void spin_lock_prefetch(const void *x)
 #define KSTK_ESP(task)		(task_pt_regs(task)->sp)
 
 #else
-/*
- * User space process size.  This is the first address outside the user range.
- * There are a few constraints that determine this:
- *
- * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
- * address, then that syscall will enter the kernel with a
- * non-canonical return address, and SYSRET will explode dangerously.
- * We avoid this particular problem by preventing anything executable
- * from being mapped at the maximum canonical address.
- *
- * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
- * CPUs malfunction if they execute code from the highest canonical page.
- * They'll speculate right off the end of the canonical space, and
- * bad things happen.  This is worked around in the same way as the
- * Intel problem.
- *
- * With page table isolation enabled, we map the LDT in ... [stay tuned]
- */
-#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
-
-#define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
-
-/* This decides where the kernel will search for a free chunk of vm
- * space during mmap's.
- */
-#define IA32_PAGE_OFFSET	((current->personality & ADDR_LIMIT_3GB) ? \
-					0xc0000000 : 0xFFFFe000)
-
-#define TASK_SIZE_LOW		(test_thread_flag(TIF_ADDR32) ? \
-					IA32_PAGE_OFFSET : DEFAULT_MAP_WINDOW)
-#define TASK_SIZE		(test_thread_flag(TIF_ADDR32) ? \
-					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
-#define TASK_SIZE_OF(child)	((test_tsk_thread_flag(child, TIF_ADDR32)) ? \
-					IA32_PAGE_OFFSET : TASK_SIZE_MAX)
-
-#define STACK_TOP		TASK_SIZE_LOW
-#define STACK_TOP_MAX		TASK_SIZE_MAX
-
 #define INIT_THREAD  {						\
 	.addr_limit		= KERNEL_DS,			\
 }
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 11/14] x86: make TASK_SIZE_MAX usable from assembly code
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (9 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 10/14] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 12/14] x86: remove address space overrides using set_fs() Christoph Hellwig
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

For 64-bit the only thing missing was a strategic _AC, and for 32-bit we
need to use __PAGE_OFFSET instead of PAGE_OFFSET in the TASK_SIZE
definition to escape the explicit unsigned long cast.  This just works
because __PAGE_OFFSET is defined using _AC itself and thus never needs
the cast anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/include/asm/page_32_types.h | 4 ++--
 arch/x86/include/asm/page_64_types.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/page_32_types.h b/arch/x86/include/asm/page_32_types.h
index 26236925fb2c36..f462895a33e452 100644
--- a/arch/x86/include/asm/page_32_types.h
+++ b/arch/x86/include/asm/page_32_types.h
@@ -44,8 +44,8 @@
 /*
  * User space process size: 3GB (default).
  */
-#define IA32_PAGE_OFFSET	PAGE_OFFSET
-#define TASK_SIZE		PAGE_OFFSET
+#define IA32_PAGE_OFFSET	__PAGE_OFFSET
+#define TASK_SIZE		__PAGE_OFFSET
 #define TASK_SIZE_LOW		TASK_SIZE
 #define TASK_SIZE_MAX		TASK_SIZE
 #define DEFAULT_MAP_WINDOW	TASK_SIZE
diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 996595c9897e0a..838515daf87b36 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -76,7 +76,7 @@
  *
  * With page table isolation enabled, we map the LDT in ... [stay tuned]
  */
-#define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
+#define TASK_SIZE_MAX	((_AC(1,UL) << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
 
 #define DEFAULT_MAP_WINDOW	((1UL << 47) - PAGE_SIZE)
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 12/14] x86: remove address space overrides using set_fs()
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (10 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 11/14] x86: make TASK_SIZE_MAX usable from assembly code Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 21:30   ` David Laight
  2020-09-04  2:55   ` Al Viro
  2020-09-03 14:22 ` [PATCH 13/14] powerpc: use non-set_fs based maccess routines Christoph Hellwig
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more.  To properly
handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on
x86 a new alternative is introduced, which just like the one in
entry_64.S has to use the hardcoded virtual address bits to escape
the fact that TASK_SIZE_MAX isn't actually a constant when 5-level
page tables are enabled.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
---
 arch/x86/Kconfig                   |  1 -
 arch/x86/ia32/ia32_aout.c          |  1 -
 arch/x86/include/asm/processor.h   | 11 +------
 arch/x86/include/asm/thread_info.h |  2 --
 arch/x86/include/asm/uaccess.h     | 26 +----------------
 arch/x86/kernel/asm-offsets.c      |  3 --
 arch/x86/lib/getuser.S             | 47 +++++++++++++++---------------
 arch/x86/lib/putuser.S             | 25 ++++++++--------
 8 files changed, 39 insertions(+), 77 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f85c13355732fe..7101ac64bb209d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -237,7 +237,6 @@ config X86
 	select HAVE_ARCH_KCSAN			if X86_64
 	select X86_FEATURE_NAMES		if PROC_FS
 	select PROC_PID_ARCH_STATUS		if PROC_FS
-	select SET_FS
 	imply IMA_SECURE_AND_OR_TRUSTED_BOOT    if EFI
 
 config INSTRUCTION_DECODER
diff --git a/arch/x86/ia32/ia32_aout.c b/arch/x86/ia32/ia32_aout.c
index ca8a657edf5977..a09fc37ead9d47 100644
--- a/arch/x86/ia32/ia32_aout.c
+++ b/arch/x86/ia32/ia32_aout.c
@@ -239,7 +239,6 @@ static int load_aout_binary(struct linux_binprm *bprm)
 	(regs)->ss = __USER32_DS;
 	regs->r8 = regs->r9 = regs->r10 = regs->r11 =
 	regs->r12 = regs->r13 = regs->r14 = regs->r15 = 0;
-	set_fs(USER_DS);
 	return 0;
 }
 
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 1618eeb08361a9..189573d95c3af6 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -482,10 +482,6 @@ extern unsigned int fpu_user_xstate_size;
 
 struct perf_event;
 
-typedef struct {
-	unsigned long		seg;
-} mm_segment_t;
-
 struct thread_struct {
 	/* Cached TLS descriptors: */
 	struct desc_struct	tls_array[GDT_ENTRY_TLS_ENTRIES];
@@ -538,8 +534,6 @@ struct thread_struct {
 	 */
 	unsigned long		iopl_emul;
 
-	mm_segment_t		addr_limit;
-
 	unsigned int		sig_on_uaccess_err:1;
 
 	/* Floating point and extended processor state */
@@ -785,15 +779,12 @@ static inline void spin_lock_prefetch(const void *x)
 #define INIT_THREAD  {							  \
 	.sp0			= TOP_OF_INIT_STACK,			  \
 	.sysenter_cs		= __KERNEL_CS,				  \
-	.addr_limit		= KERNEL_DS,				  \
 }
 
 #define KSTK_ESP(task)		(task_pt_regs(task)->sp)
 
 #else
-#define INIT_THREAD  {						\
-	.addr_limit		= KERNEL_DS,			\
-}
+#define INIT_THREAD { }
 
 extern unsigned long KSTK_ESP(struct task_struct *task);
 
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86dd..44733a4bfc4294 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -102,7 +102,6 @@ struct thread_info {
 #define TIF_SYSCALL_TRACEPOINT	28	/* syscall tracepoint instrumentation */
 #define TIF_ADDR32		29	/* 32-bit address space on 64 bits */
 #define TIF_X32			30	/* 32-bit native x86-64 binary */
-#define TIF_FSCHECK		31	/* Check FS is USER_DS on return */
 
 #define _TIF_SYSCALL_TRACE	(1 << TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME	(1 << TIF_NOTIFY_RESUME)
@@ -131,7 +130,6 @@ struct thread_info {
 #define _TIF_SYSCALL_TRACEPOINT	(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_ADDR32		(1 << TIF_ADDR32)
 #define _TIF_X32		(1 << TIF_X32)
-#define _TIF_FSCHECK		(1 << TIF_FSCHECK)
 
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE					\
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index ecefaffd15d4c8..a4ceda0510ea87 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -12,30 +12,6 @@
 #include <asm/smap.h>
 #include <asm/extable.h>
 
-/*
- * The fs value determines whether argument validity checking should be
- * performed or not.  If get_fs() == USER_DS, checking is performed, with
- * get_fs() == KERNEL_DS, checking is bypassed.
- *
- * For historical reasons, these macros are grossly misnamed.
- */
-
-#define MAKE_MM_SEG(s)	((mm_segment_t) { (s) })
-
-#define KERNEL_DS	MAKE_MM_SEG(-1UL)
-#define USER_DS 	MAKE_MM_SEG(TASK_SIZE_MAX)
-
-#define get_fs()	(current->thread.addr_limit)
-static inline void set_fs(mm_segment_t fs)
-{
-	current->thread.addr_limit = fs;
-	/* On user-mode return, check fs is correct */
-	set_thread_flag(TIF_FSCHECK);
-}
-
-#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg)
-#define user_addr_max() (current->thread.addr_limit.seg)
-
 /*
  * Test whether a block of memory is a valid user space address.
  * Returns 0 if the range is valid, nonzero otherwise.
@@ -93,7 +69,7 @@ static inline bool pagefault_disabled(void);
 #define access_ok(addr, size)					\
 ({									\
 	WARN_ON_IN_IRQ();						\
-	likely(!__range_not_ok(addr, size, user_addr_max()));		\
+	likely(!__range_not_ok(addr, size, TASK_SIZE_MAX));		\
 })
 
 /*
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 3ca07ad552ae0c..70b7154f4bdd62 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -37,9 +37,6 @@ static void __used common(void)
 	OFFSET(TASK_stack_canary, task_struct, stack_canary);
 #endif
 
-	BLANK();
-	OFFSET(TASK_addr_limit, task_struct, thread.addr_limit);
-
 	BLANK();
 	OFFSET(crypto_tfm_ctx_offset, crypto_tfm, __crt_ctx);
 
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index c8a85b512796e1..94f7be4971ed04 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -35,10 +35,19 @@
 #include <asm/smap.h>
 #include <asm/export.h>
 
+#ifdef CONFIG_X86_5LEVEL
+#define LOAD_TASK_SIZE_MINUS_N(n) \
+	ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
+		    "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
+#else
+#define LOAD_TASK_SIZE_MINUS_N(n) \
+	mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
+#endif
+
 	.text
 SYM_FUNC_START(__get_user_1)
-	mov PER_CPU_VAR(current_task), %_ASM_DX
-	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
+	LOAD_TASK_SIZE_MINUS_N(0)
+	cmp %_ASM_DX,%_ASM_AX
 	jae bad_get_user
 	sbb %_ASM_DX, %_ASM_DX		/* array_index_mask_nospec() */
 	and %_ASM_DX, %_ASM_AX
@@ -51,15 +60,13 @@ SYM_FUNC_END(__get_user_1)
 EXPORT_SYMBOL(__get_user_1)
 
 SYM_FUNC_START(__get_user_2)
-	add $1,%_ASM_AX
-	jc bad_get_user
-	mov PER_CPU_VAR(current_task), %_ASM_DX
-	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
+	LOAD_TASK_SIZE_MINUS_N(1)
+	cmp %_ASM_DX,%_ASM_AX
 	jae bad_get_user
 	sbb %_ASM_DX, %_ASM_DX		/* array_index_mask_nospec() */
 	and %_ASM_DX, %_ASM_AX
 	ASM_STAC
-2:	movzwl -1(%_ASM_AX),%edx
+2:	movzwl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
 	ret
@@ -67,15 +74,13 @@ SYM_FUNC_END(__get_user_2)
 EXPORT_SYMBOL(__get_user_2)
 
 SYM_FUNC_START(__get_user_4)
-	add $3,%_ASM_AX
-	jc bad_get_user
-	mov PER_CPU_VAR(current_task), %_ASM_DX
-	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
+	LOAD_TASK_SIZE_MINUS_N(3)
+	cmp %_ASM_DX,%_ASM_AX
 	jae bad_get_user
 	sbb %_ASM_DX, %_ASM_DX		/* array_index_mask_nospec() */
 	and %_ASM_DX, %_ASM_AX
 	ASM_STAC
-3:	movl -3(%_ASM_AX),%edx
+3:	movl (%_ASM_AX),%edx
 	xor %eax,%eax
 	ASM_CLAC
 	ret
@@ -84,29 +89,25 @@ EXPORT_SYMBOL(__get_user_4)
 
 SYM_FUNC_START(__get_user_8)
 #ifdef CONFIG_X86_64
-	add $7,%_ASM_AX
-	jc bad_get_user
-	mov PER_CPU_VAR(current_task), %_ASM_DX
-	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
+	LOAD_TASK_SIZE_MINUS_N(7)
+	cmp %_ASM_DX,%_ASM_AX
 	jae bad_get_user
 	sbb %_ASM_DX, %_ASM_DX		/* array_index_mask_nospec() */
 	and %_ASM_DX, %_ASM_AX
 	ASM_STAC
-4:	movq -7(%_ASM_AX),%rdx
+4:	movq (%_ASM_AX),%rdx
 	xor %eax,%eax
 	ASM_CLAC
 	ret
 #else
-	add $7,%_ASM_AX
-	jc bad_get_user_8
-	mov PER_CPU_VAR(current_task), %_ASM_DX
-	cmp TASK_addr_limit(%_ASM_DX),%_ASM_AX
+	LOAD_TASK_SIZE_MINUS_N(7)
+	cmp %_ASM_DX,%_ASM_AX
 	jae bad_get_user_8
 	sbb %_ASM_DX, %_ASM_DX		/* array_index_mask_nospec() */
 	and %_ASM_DX, %_ASM_AX
 	ASM_STAC
-4:	movl -7(%_ASM_AX),%edx
-5:	movl -3(%_ASM_AX),%ecx
+4:	movl (%_ASM_AX),%edx
+5:	movl 4(%_ASM_AX),%ecx
 	xor %eax,%eax
 	ASM_CLAC
 	ret
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index 7c7c92db8497af..445374885153bf 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -31,12 +31,19 @@
  * as they get called from within inline assembly.
  */
 
-#define ENTER	mov PER_CPU_VAR(current_task), %_ASM_BX
+#ifdef CONFIG_X86_5LEVEL
+#define LOAD_TASK_SIZE_MINUS_N(n) \
+	ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rbx", \
+		    "mov $((1 << 56) - 4096 - (n)),%rbx", X86_FEATURE_LA57
+#else
+#define LOAD_TASK_SIZE_MINUS_N(n) \
+	mov $(TASK_SIZE_MAX - (n)),%_ASM_BX
+#endif
 
 .text
 SYM_FUNC_START(__put_user_1)
-	ENTER
-	cmp TASK_addr_limit(%_ASM_BX),%_ASM_CX
+	LOAD_TASK_SIZE_MINUS_N(0)
+	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
 	ASM_STAC
 1:	movb %al,(%_ASM_CX)
@@ -47,9 +54,7 @@ SYM_FUNC_END(__put_user_1)
 EXPORT_SYMBOL(__put_user_1)
 
 SYM_FUNC_START(__put_user_2)
-	ENTER
-	mov TASK_addr_limit(%_ASM_BX),%_ASM_BX
-	sub $1,%_ASM_BX
+	LOAD_TASK_SIZE_MINUS_N(1)
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
 	ASM_STAC
@@ -61,9 +66,7 @@ SYM_FUNC_END(__put_user_2)
 EXPORT_SYMBOL(__put_user_2)
 
 SYM_FUNC_START(__put_user_4)
-	ENTER
-	mov TASK_addr_limit(%_ASM_BX),%_ASM_BX
-	sub $3,%_ASM_BX
+	LOAD_TASK_SIZE_MINUS_N(3)
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
 	ASM_STAC
@@ -75,9 +78,7 @@ SYM_FUNC_END(__put_user_4)
 EXPORT_SYMBOL(__put_user_4)
 
 SYM_FUNC_START(__put_user_8)
-	ENTER
-	mov TASK_addr_limit(%_ASM_BX),%_ASM_BX
-	sub $7,%_ASM_BX
+	LOAD_TASK_SIZE_MINUS_N(7)
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
 	ASM_STAC
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 13/14] powerpc: use non-set_fs based maccess routines
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (11 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 12/14] x86: remove address space overrides using set_fs() Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 14:22 ` [PATCH 14/14] powerpc: remove address space overrides using set_fs() Christoph Hellwig
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Provide __get_kernel_nofault and __put_kernel_nofault routines to
implement the maccess routines without messing with set_fs and without
opening up access to user space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/powerpc/include/asm/uaccess.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 00699903f1efca..7fe3531ad36a77 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -623,4 +623,20 @@ do {									\
 		__put_user_goto(*(u8*)(_src + _i), (u8 __user *)(_dst + _i), e);\
 } while (0)
 
+#define HAVE_GET_KERNEL_NOFAULT
+
+#define __get_kernel_nofault(dst, src, type, err_label)			\
+do {									\
+	int __kr_err;							\
+									\
+	__get_user_size_allowed(*((type *)(dst)), (__force type __user *)(src),\
+			sizeof(type), __kr_err);			\
+	if (unlikely(__kr_err))						\
+		goto err_label;						\
+} while (0)
+
+#define __put_kernel_nofault(dst, src, type, err_label)			\
+	__put_user_size_goto(*((type *)(src)),				\
+		(__force type __user *)(dst), sizeof(type), err_label)
+
 #endif	/* _ARCH_POWERPC_UACCESS_H */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 14/14] powerpc: remove address space overrides using set_fs()
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (12 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 13/14] powerpc: use non-set_fs based maccess routines Christoph Hellwig
@ 2020-09-03 14:22 ` Christoph Hellwig
  2020-09-03 15:43   ` Christophe Leroy
  2020-09-03 14:28 ` remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Al Viro
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:22 UTC (permalink / raw)
  To: Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

Stop providing the possibility to override the address space using
set_fs() now that there is no need for that any more.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/powerpc/Kconfig                   |  1 -
 arch/powerpc/include/asm/processor.h   |  7 ----
 arch/powerpc/include/asm/thread_info.h |  5 +--
 arch/powerpc/include/asm/uaccess.h     | 57 +++++++-------------------
 arch/powerpc/kernel/signal.c           |  3 --
 arch/powerpc/lib/sstep.c               |  6 +--
 6 files changed, 18 insertions(+), 61 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 4d1c18a3b83977..65bed1fdeaad71 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -249,7 +249,6 @@ config PPC
 	select PCI_SYSCALL			if PCI
 	select PPC_DAWR				if PPC64
 	select RTC_LIB
-	select SET_FS
 	select SPARSE_IRQ
 	select SYSCTL_EXCEPTION_TRACE
 	select THREAD_INFO_IN_TASK
diff --git a/arch/powerpc/include/asm/processor.h b/arch/powerpc/include/asm/processor.h
index ed0d633ab5aa42..f01e4d650c520a 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -83,10 +83,6 @@ struct task_struct;
 void start_thread(struct pt_regs *regs, unsigned long fdptr, unsigned long sp);
 void release_thread(struct task_struct *);
 
-typedef struct {
-	unsigned long seg;
-} mm_segment_t;
-
 #define TS_FPR(i) fp_state.fpr[i][TS_FPROFFSET]
 #define TS_CKFPR(i) ckfp_state.fpr[i][TS_FPROFFSET]
 
@@ -148,7 +144,6 @@ struct thread_struct {
 	unsigned long	ksp_vsid;
 #endif
 	struct pt_regs	*regs;		/* Pointer to saved register state */
-	mm_segment_t	addr_limit;	/* for get_fs() validation */
 #ifdef CONFIG_BOOKE
 	/* BookE base exception scratch space; align on cacheline */
 	unsigned long	normsave[8] ____cacheline_aligned;
@@ -295,7 +290,6 @@ struct thread_struct {
 #define INIT_THREAD { \
 	.ksp = INIT_SP, \
 	.ksp_limit = INIT_SP_LIMIT, \
-	.addr_limit = KERNEL_DS, \
 	.pgdir = swapper_pg_dir, \
 	.fpexc_mode = MSR_FE0 | MSR_FE1, \
 	SPEFSCR_INIT \
@@ -303,7 +297,6 @@ struct thread_struct {
 #else
 #define INIT_THREAD  { \
 	.ksp = INIT_SP, \
-	.addr_limit = KERNEL_DS, \
 	.fpexc_mode = 0, \
 }
 #endif
diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index ca6c9702570494..46a210b03d2b80 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -90,7 +90,6 @@ void arch_setup_new_exec(void);
 #define TIF_SYSCALL_TRACE	0	/* syscall trace active */
 #define TIF_SIGPENDING		1	/* signal pending */
 #define TIF_NEED_RESCHED	2	/* rescheduling necessary */
-#define TIF_FSCHECK		3	/* Check FS is USER_DS on return */
 #define TIF_SYSCALL_EMU		4	/* syscall emulation active */
 #define TIF_RESTORE_TM		5	/* need to restore TM FP/VEC/VSX */
 #define TIF_PATCH_PENDING	6	/* pending live patching update */
@@ -130,7 +129,6 @@ void arch_setup_new_exec(void);
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
 #define _TIF_EMULATE_STACK_STORE	(1<<TIF_EMULATE_STACK_STORE)
 #define _TIF_NOHZ		(1<<TIF_NOHZ)
-#define _TIF_FSCHECK		(1<<TIF_FSCHECK)
 #define _TIF_SYSCALL_EMU	(1<<TIF_SYSCALL_EMU)
 #define _TIF_SYSCALL_DOTRACE	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
@@ -138,8 +136,7 @@ void arch_setup_new_exec(void);
 
 #define _TIF_USER_WORK_MASK	(_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
 				 _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
-				 _TIF_RESTORE_TM | _TIF_PATCH_PENDING | \
-				 _TIF_FSCHECK)
+				 _TIF_RESTORE_TM | _TIF_PATCH_PENDING)
 #define _TIF_PERSYSCALL_MASK	(_TIF_RESTOREALL|_TIF_NOERROR)
 
 /* Bits in local_flags */
diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 7fe3531ad36a77..5363f7fc6dd06c 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -8,62 +8,33 @@
 #include <asm/extable.h>
 #include <asm/kup.h>
 
-/*
- * The fs value determines whether argument validity checking should be
- * performed or not.  If get_fs() == USER_DS, checking is performed, with
- * get_fs() == KERNEL_DS, checking is bypassed.
- *
- * For historical reasons, these macros are grossly misnamed.
- *
- * The fs/ds values are now the highest legal address in the "segment".
- * This simplifies the checking in the routines below.
- */
-
-#define MAKE_MM_SEG(s)  ((mm_segment_t) { (s) })
-
-#define KERNEL_DS	MAKE_MM_SEG(~0UL)
 #ifdef __powerpc64__
 /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */
-#define USER_DS		MAKE_MM_SEG(TASK_SIZE_USER64 - 1)
-#else
-#define USER_DS		MAKE_MM_SEG(TASK_SIZE - 1)
-#endif
-
-#define get_fs()	(current->thread.addr_limit)
-
-static inline void set_fs(mm_segment_t fs)
-{
-	current->thread.addr_limit = fs;
-	/* On user-mode return check addr_limit (fs) is correct */
-	set_thread_flag(TIF_FSCHECK);
-}
-
-#define uaccess_kernel() (get_fs().seg == KERNEL_DS.seg)
-#define user_addr_max()	(get_fs().seg)
+#define TASK_SIZE_MAX		TASK_SIZE_USER64
 
-#ifdef __powerpc64__
 /*
- * This check is sufficient because there is a large enough
- * gap between user addresses and the kernel addresses
+ * This check is sufficient because there is a large enough gap between user
+ * addresses and the kernel addresses.
  */
-#define __access_ok(addr, size, segment)	\
-	(((addr) <= (segment).seg) && ((size) <= (segment).seg))
+static inline bool __access_ok(unsigned long addr, unsigned long size)
+{
+	return addr < TASK_SIZE_MAX && size < TASK_SIZE_MAX;
+}
 
 #else
+#define TASK_SIZE_MAX		TASK_SIZE
 
-static inline int __access_ok(unsigned long addr, unsigned long size,
-			mm_segment_t seg)
+static inline bool __access_ok(unsigned long addr, unsigned long size)
 {
-	if (addr > seg.seg)
-		return 0;
-	return (size == 0 || size - 1 <= seg.seg - addr);
+	if (addr >= TASK_SIZE_MAX)
+		return false;
+	return size == 0 || size <= TASK_SIZE_MAX - addr;
 }
-
-#endif
+#endif /* __powerpc64__ */
 
 #define access_ok(addr, size)		\
 	(__chk_user_ptr(addr),		\
-	 __access_ok((__force unsigned long)(addr), (size), get_fs()))
+	 __access_ok((unsigned long)(addr), (size)))
 
 /*
  * These are the main single-value transfer routines.  They automatically
diff --git a/arch/powerpc/kernel/signal.c b/arch/powerpc/kernel/signal.c
index d15a98c758b8b4..df547d8e31e49c 100644
--- a/arch/powerpc/kernel/signal.c
+++ b/arch/powerpc/kernel/signal.c
@@ -312,9 +312,6 @@ void do_notify_resume(struct pt_regs *regs, unsigned long thread_info_flags)
 {
 	user_exit();
 
-	/* Check valid addr_limit, TIF check is done there */
-	addr_limit_user_check();
-
 	if (thread_info_flags & _TIF_UPROBE)
 		uprobe_notify_resume(regs);
 
diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index caee8cc77e1954..8342188ea1acd0 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -108,11 +108,11 @@ static nokprobe_inline long address_ok(struct pt_regs *regs,
 {
 	if (!user_mode(regs))
 		return 1;
-	if (__access_ok(ea, nb, USER_DS))
+	if (__access_ok(ea, nb))
 		return 1;
-	if (__access_ok(ea, 1, USER_DS))
+	if (__access_ok(ea, 1))
 		/* Access overlaps the end of the user region */
-		regs->dar = USER_DS.seg;
+		regs->dar = TASK_SIZE_MAX - 1;
 	else
 		regs->dar = ea;
 	return 0;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (13 preceding siblings ...)
  2020-09-03 14:22 ` [PATCH 14/14] powerpc: remove address space overrides using set_fs() Christoph Hellwig
@ 2020-09-03 14:28 ` Al Viro
  2020-09-03 14:30   ` Christoph Hellwig
  2020-09-09 17:31   ` Linus Torvalds
  2020-09-03 15:49 ` Linus Torvalds
  2020-09-04  6:00 ` Ingo Molnar
  16 siblings, 2 replies; 58+ messages in thread
From: Al Viro @ 2020-09-03 14:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Michael Ellerman, x86, Alexey Dobriyan,
	Luis Chamberlain, Kees Cook, linux-kernel, linux-fsdevel,
	linux-arch, linuxppc-dev

On Thu, Sep 03, 2020 at 04:22:28PM +0200, Christoph Hellwig wrote:

> Besides x86 and powerpc I plan to eventually convert all other
> architectures, although this will be a slow process, starting with the
> easier ones once the infrastructure is merged.  The process to convert
> architectures is roughtly:
> 
>  (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
>  (2) implement __get_kernel_nofault and __put_kernel_nofault
>  (3) remove the arch specific address limitation functionality

The one to really watch out for is sparc; I have something in that
direction, will resurrect as soon as I'm done with eventpoll analysis...

I can live with this series; do you want that in vfs.git#for-next?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-03 14:28 ` remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Al Viro
@ 2020-09-03 14:30   ` Christoph Hellwig
  2020-09-03 14:36     ` Al Viro
  2020-09-09 17:31   ` Linus Torvalds
  1 sibling, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

On Thu, Sep 03, 2020 at 03:28:03PM +0100, Al Viro wrote:
> On Thu, Sep 03, 2020 at 04:22:28PM +0200, Christoph Hellwig wrote:
> 
> > Besides x86 and powerpc I plan to eventually convert all other
> > architectures, although this will be a slow process, starting with the
> > easier ones once the infrastructure is merged.  The process to convert
> > architectures is roughtly:
> > 
> >  (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
> >  (2) implement __get_kernel_nofault and __put_kernel_nofault
> >  (3) remove the arch specific address limitation functionality
> 
> The one to really watch out for is sparc; I have something in that
> direction, will resurrect as soon as I'm done with eventpoll analysis...
> 
> I can live with this series; do you want that in vfs.git#for-next?

Either that or a separate tree is fine with me.  It would be good to
eventually have a non-rebased stable tree so that other arch trees
can work from it, though.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-03 14:30   ` Christoph Hellwig
@ 2020-09-03 14:36     ` Al Viro
  2020-09-03 14:40       ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: Al Viro @ 2020-09-03 14:36 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Michael Ellerman, x86, Alexey Dobriyan,
	Luis Chamberlain, Kees Cook, linux-kernel, linux-fsdevel,
	linux-arch, linuxppc-dev

On Thu, Sep 03, 2020 at 04:30:03PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 03:28:03PM +0100, Al Viro wrote:
> > On Thu, Sep 03, 2020 at 04:22:28PM +0200, Christoph Hellwig wrote:
> > 
> > > Besides x86 and powerpc I plan to eventually convert all other
> > > architectures, although this will be a slow process, starting with the
> > > easier ones once the infrastructure is merged.  The process to convert
> > > architectures is roughtly:
> > > 
> > >  (1) ensure there is no set_fs(KERNEL_DS) left in arch specific code
> > >  (2) implement __get_kernel_nofault and __put_kernel_nofault
> > >  (3) remove the arch specific address limitation functionality
> > 
> > The one to really watch out for is sparc; I have something in that
> > direction, will resurrect as soon as I'm done with eventpoll analysis...
> > 
> > I can live with this series; do you want that in vfs.git#for-next?
> 
> Either that or a separate tree is fine with me.  It would be good to
> eventually have a non-rebased stable tree so that other arch trees
> can work from it, though.

FWIW, vfs.git#for-next is always a merge of independent branches; I don't
put stuff directly into #for-next - too easy to lose that way.

IOW, that would be something like #base.set_fs, included into #for-next
merge set.  And I've no problem with never-rebased branches...

Where in the mainline are the most recent prereqs of this series?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-03 14:36     ` Al Viro
@ 2020-09-03 14:40       ` Christoph Hellwig
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 14:40 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

On Thu, Sep 03, 2020 at 03:36:29PM +0100, Al Viro wrote:
> FWIW, vfs.git#for-next is always a merge of independent branches; I don't
> put stuff directly into #for-next - too easy to lose that way.
> 
> IOW, that would be something like #base.set_fs, included into #for-next
> merge set.  And I've no problem with never-rebased branches...
> 
> Where in the mainline are the most recent prereqs of this series?

I can't think of anything past -rc1, but I haven't actually checked.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/14] powerpc: remove address space overrides using set_fs()
  2020-09-03 14:22 ` [PATCH 14/14] powerpc: remove address space overrides using set_fs() Christoph Hellwig
@ 2020-09-03 15:43   ` Christophe Leroy
  2020-09-03 15:49     ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: Christophe Leroy @ 2020-09-03 15:43 UTC (permalink / raw)
  To: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: linux-arch, Kees Cook, linux-kernel, Luis Chamberlain,
	linux-fsdevel, linuxppc-dev, Alexey Dobriyan



Le 03/09/2020 à 16:22, Christoph Hellwig a écrit :
> Stop providing the possibility to override the address space using
> set_fs() now that there is no need for that any more.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---


>   
> -static inline int __access_ok(unsigned long addr, unsigned long size,
> -			mm_segment_t seg)
> +static inline bool __access_ok(unsigned long addr, unsigned long size)
>   {
> -	if (addr > seg.seg)
> -		return 0;
> -	return (size == 0 || size - 1 <= seg.seg - addr);
> +	if (addr >= TASK_SIZE_MAX)
> +		return false;
> +	return size == 0 || size <= TASK_SIZE_MAX - addr;
>   }

You don't need to test size == 0 anymore. It used to be necessary 
because of the 'size - 1', as size is unsigned.

Now you can directly do

	return size <= TASK_SIZE_MAX - addr;

If size is 0, this will always be true (because you already know that 
addr is not >= TASK_SIZE_MAX

Christophe

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/14] powerpc: remove address space overrides using set_fs()
  2020-09-03 15:43   ` Christophe Leroy
@ 2020-09-03 15:49     ` Christoph Hellwig
  2020-09-03 15:56       ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 15:49 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman,
	x86, linux-arch, Kees Cook, linux-kernel, Luis Chamberlain,
	linux-fsdevel, linuxppc-dev, Alexey Dobriyan

On Thu, Sep 03, 2020 at 05:43:25PM +0200, Christophe Leroy wrote:
>
>
> Le 03/09/2020 à 16:22, Christoph Hellwig a écrit :
>> Stop providing the possibility to override the address space using
>> set_fs() now that there is no need for that any more.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>> ---
>
>
>>   -static inline int __access_ok(unsigned long addr, unsigned long size,
>> -			mm_segment_t seg)
>> +static inline bool __access_ok(unsigned long addr, unsigned long size)
>>   {
>> -	if (addr > seg.seg)
>> -		return 0;
>> -	return (size == 0 || size - 1 <= seg.seg - addr);
>> +	if (addr >= TASK_SIZE_MAX)
>> +		return false;
>> +	return size == 0 || size <= TASK_SIZE_MAX - addr;
>>   }
>
> You don't need to test size == 0 anymore. It used to be necessary because 
> of the 'size - 1', as size is unsigned.
>
> Now you can directly do
>
> 	return size <= TASK_SIZE_MAX - addr;
>
> If size is 0, this will always be true (because you already know that addr 
> is not >= TASK_SIZE_MAX

True.  What do you think of Linus' comment about always using the
ppc32 version on ppc64 as well with this?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (14 preceding siblings ...)
  2020-09-03 14:28 ` remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Al Viro
@ 2020-09-03 15:49 ` Linus Torvalds
  2020-09-04  6:00 ` Ingo Molnar
  16 siblings, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2020-09-03 15:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Al Viro, Michael Ellerman, the arch/x86 maintainers,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Linux Kernel Mailing List, linux-fsdevel, linux-arch,
	linuxppc-dev

On Thu, Sep 3, 2020 at 7:22 AM Christoph Hellwig <hch@lst.de> wrote:
>
> [Note to Linus: I'd like to get this into linux-next rather earlier
> than later.  Do you think it is ok to add this tree to linux-next?]

This whole series looks really good to me now, with each patch looking
like a clear cleanup on its own.

So ack on the whole series, and yes, please get this into linux-next
and ready for 5.10. Whether through Al or something else.

Thanks,

               Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/14] powerpc: remove address space overrides using set_fs()
  2020-09-03 15:49     ` Christoph Hellwig
@ 2020-09-03 15:56       ` Christoph Hellwig
  2020-09-03 16:03         ` Christophe Leroy
  0 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-03 15:56 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman,
	x86, linux-arch, Kees Cook, linux-kernel, Luis Chamberlain,
	linux-fsdevel, linuxppc-dev, Alexey Dobriyan

On Thu, Sep 03, 2020 at 05:49:09PM +0200, Christoph Hellwig wrote:
> On Thu, Sep 03, 2020 at 05:43:25PM +0200, Christophe Leroy wrote:
> >
> >
> > Le 03/09/2020 à 16:22, Christoph Hellwig a écrit :
> >> Stop providing the possibility to override the address space using
> >> set_fs() now that there is no need for that any more.
> >>
> >> Signed-off-by: Christoph Hellwig <hch@lst.de>
> >> ---
> >
> >
> >>   -static inline int __access_ok(unsigned long addr, unsigned long size,
> >> -			mm_segment_t seg)
> >> +static inline bool __access_ok(unsigned long addr, unsigned long size)
> >>   {
> >> -	if (addr > seg.seg)
> >> -		return 0;
> >> -	return (size == 0 || size - 1 <= seg.seg - addr);
> >> +	if (addr >= TASK_SIZE_MAX)
> >> +		return false;
> >> +	return size == 0 || size <= TASK_SIZE_MAX - addr;
> >>   }
> >
> > You don't need to test size == 0 anymore. It used to be necessary because 
> > of the 'size - 1', as size is unsigned.
> >
> > Now you can directly do
> >
> > 	return size <= TASK_SIZE_MAX - addr;
> >
> > If size is 0, this will always be true (because you already know that addr 
> > is not >= TASK_SIZE_MAX
> 
> True.  What do you think of Linus' comment about always using the
> ppc32 version on ppc64 as well with this?

i.e. something like this folded in:

diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 5363f7fc6dd06c..be070254e50943 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -11,26 +11,14 @@
 #ifdef __powerpc64__
 /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */
 #define TASK_SIZE_MAX		TASK_SIZE_USER64
-
-/*
- * This check is sufficient because there is a large enough gap between user
- * addresses and the kernel addresses.
- */
-static inline bool __access_ok(unsigned long addr, unsigned long size)
-{
-	return addr < TASK_SIZE_MAX && size < TASK_SIZE_MAX;
-}
-
 #else
 #define TASK_SIZE_MAX		TASK_SIZE
+#endif
 
 static inline bool __access_ok(unsigned long addr, unsigned long size)
 {
-	if (addr >= TASK_SIZE_MAX)
-		return false;
-	return size == 0 || size <= TASK_SIZE_MAX - addr;
+	return addr < TASK_SIZE_MAX && size <= TASK_SIZE_MAX - addr;
 }
-#endif /* __powerpc64__ */
 
 #define access_ok(addr, size)		\
 	(__chk_user_ptr(addr),		\

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 14/14] powerpc: remove address space overrides using set_fs()
  2020-09-03 15:56       ` Christoph Hellwig
@ 2020-09-03 16:03         ` Christophe Leroy
  0 siblings, 0 replies; 58+ messages in thread
From: Christophe Leroy @ 2020-09-03 16:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Al Viro, Michael Ellerman, x86, linux-arch,
	Kees Cook, linux-kernel, Luis Chamberlain, linux-fsdevel,
	linuxppc-dev, Alexey Dobriyan



Le 03/09/2020 à 17:56, Christoph Hellwig a écrit :
> On Thu, Sep 03, 2020 at 05:49:09PM +0200, Christoph Hellwig wrote:
>> On Thu, Sep 03, 2020 at 05:43:25PM +0200, Christophe Leroy wrote:
>>>
>>>
>>> Le 03/09/2020 à 16:22, Christoph Hellwig a écrit :
>>>> Stop providing the possibility to override the address space using
>>>> set_fs() now that there is no need for that any more.
>>>>
>>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>>> ---
>>>
>>>
>>>>    -static inline int __access_ok(unsigned long addr, unsigned long size,
>>>> -			mm_segment_t seg)
>>>> +static inline bool __access_ok(unsigned long addr, unsigned long size)
>>>>    {
>>>> -	if (addr > seg.seg)
>>>> -		return 0;
>>>> -	return (size == 0 || size - 1 <= seg.seg - addr);
>>>> +	if (addr >= TASK_SIZE_MAX)
>>>> +		return false;
>>>> +	return size == 0 || size <= TASK_SIZE_MAX - addr;
>>>>    }
>>>
>>> You don't need to test size == 0 anymore. It used to be necessary because
>>> of the 'size - 1', as size is unsigned.
>>>
>>> Now you can directly do
>>>
>>> 	return size <= TASK_SIZE_MAX - addr;
>>>
>>> If size is 0, this will always be true (because you already know that addr
>>> is not >= TASK_SIZE_MAX
>>
>> True.  What do you think of Linus' comment about always using the
>> ppc32 version on ppc64 as well with this?

I have nothing against it. That's only adding a substract, all args are 
already in registers so that will be in the noise for a modern CPU.

> 
> i.e. something like this folded in:
> 
> diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
> index 5363f7fc6dd06c..be070254e50943 100644
> --- a/arch/powerpc/include/asm/uaccess.h
> +++ b/arch/powerpc/include/asm/uaccess.h
> @@ -11,26 +11,14 @@
>   #ifdef __powerpc64__
>   /* We use TASK_SIZE_USER64 as TASK_SIZE is not constant */
>   #define TASK_SIZE_MAX		TASK_SIZE_USER64
> -
> -/*
> - * This check is sufficient because there is a large enough gap between user
> - * addresses and the kernel addresses.
> - */
> -static inline bool __access_ok(unsigned long addr, unsigned long size)
> -{
> -	return addr < TASK_SIZE_MAX && size < TASK_SIZE_MAX;
> -}
> -
>   #else
>   #define TASK_SIZE_MAX		TASK_SIZE
> +#endif
>   
>   static inline bool __access_ok(unsigned long addr, unsigned long size)
>   {
> -	if (addr >= TASK_SIZE_MAX)
> -		return false;
> -	return size == 0 || size <= TASK_SIZE_MAX - addr;
> +	return addr < TASK_SIZE_MAX && size <= TASK_SIZE_MAX - addr;
>   }
> -#endif /* __powerpc64__ */
>   
>   #define access_ok(addr, size)		\
>   	(__chk_user_ptr(addr),		\
> 


Christophe

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 12/14] x86: remove address space overrides using set_fs()
  2020-09-03 14:22 ` [PATCH 12/14] x86: remove address space overrides using set_fs() Christoph Hellwig
@ 2020-09-03 21:30   ` David Laight
  2020-09-03 23:25     ` Linus Torvalds
  2020-09-04  2:55   ` Al Viro
  1 sibling, 1 reply; 58+ messages in thread
From: David Laight @ 2020-09-03 21:30 UTC (permalink / raw)
  To: 'Christoph Hellwig',
	Linus Torvalds, Al Viro, Michael Ellerman, x86
  Cc: Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

From: Christoph Hellwig
> Sent: 03 September 2020 15:23
> 
> Stop providing the possibility to override the address space using
> set_fs() now that there is no need for that any more.  To properly
> handle the TASK_SIZE_MAX checking for 4 vs 5-level page tables on
> x86 a new alternative is introduced, which just like the one in
> entry_64.S has to use the hardcoded virtual address bits to escape
> the fact that TASK_SIZE_MAX isn't actually a constant when 5-level
> page tables are enabled.

Why does it matter whether 4 or 5 level page tables are in use?
Surely all access_ok() needs to do is ensure that a valid kernel
address isn't supplied.
A non-canonical (is that the right term) address between the highest
valid user address and the lowest valid kernel address (7ffe to fffe?)
will fault anyway.
So any limit between the valid user and kernel addresses should
work?
So a limit of 1<<63 would seem appropriate.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 12/14] x86: remove address space overrides using set_fs()
  2020-09-03 21:30   ` David Laight
@ 2020-09-03 23:25     ` Linus Torvalds
  2020-09-04  7:59       ` David Laight
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2020-09-03 23:25 UTC (permalink / raw)
  To: David Laight
  Cc: Christoph Hellwig, Al Viro, Michael Ellerman, x86,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

On Thu, Sep 3, 2020 at 2:30 PM David Laight <David.Laight@aculab.com> wrote:
>
> A non-canonical (is that the right term) address between the highest
> valid user address and the lowest valid kernel address (7ffe to fffe?)
> will fault anyway.

Yes.

But we actually warn against that fault, because it's been a good way
to catch places that didn't use the proper "access_ok()" pattern.

See ex_handler_uaccess() and the

        WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in
user access. Non-canonical address?");

warning. It's been good for randomized testing - a missing range check
on a user address will often hit this.

Of course, you should never see it in real life (and hopefully not in
testing either any more). But belt-and-suspenders..

              Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 12/14] x86: remove address space overrides using set_fs()
  2020-09-03 14:22 ` [PATCH 12/14] x86: remove address space overrides using set_fs() Christoph Hellwig
  2020-09-03 21:30   ` David Laight
@ 2020-09-04  2:55   ` Al Viro
  2020-09-04  4:41     ` Al Viro
  2020-09-04  6:38     ` Christoph Hellwig
  1 sibling, 2 replies; 58+ messages in thread
From: Al Viro @ 2020-09-04  2:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Michael Ellerman, x86, Alexey Dobriyan,
	Luis Chamberlain, Kees Cook, linux-kernel, linux-fsdevel,
	linux-arch, linuxppc-dev

On Thu, Sep 03, 2020 at 04:22:40PM +0200, Christoph Hellwig wrote:

> diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> index c8a85b512796e1..94f7be4971ed04 100644
> --- a/arch/x86/lib/getuser.S
> +++ b/arch/x86/lib/getuser.S
> @@ -35,10 +35,19 @@
>  #include <asm/smap.h>
>  #include <asm/export.h>
>  
> +#ifdef CONFIG_X86_5LEVEL
> +#define LOAD_TASK_SIZE_MINUS_N(n) \
> +	ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
> +		    "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
> +#else
> +#define LOAD_TASK_SIZE_MINUS_N(n) \
> +	mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> +#endif

Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean

#define LOAD_TASK_SIZE_MINUS_N(n) \
	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
		    __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57

there?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 12/14] x86: remove address space overrides using set_fs()
  2020-09-04  2:55   ` Al Viro
@ 2020-09-04  4:41     ` Al Viro
  2020-09-04  6:38     ` Christoph Hellwig
  1 sibling, 0 replies; 58+ messages in thread
From: Al Viro @ 2020-09-04  4:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Michael Ellerman, x86, Alexey Dobriyan,
	Luis Chamberlain, Kees Cook, linux-kernel, linux-fsdevel,
	linux-arch, linuxppc-dev

On Fri, Sep 04, 2020 at 03:55:10AM +0100, Al Viro wrote:
> On Thu, Sep 03, 2020 at 04:22:40PM +0200, Christoph Hellwig wrote:
> 
> > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> > index c8a85b512796e1..94f7be4971ed04 100644
> > --- a/arch/x86/lib/getuser.S
> > +++ b/arch/x86/lib/getuser.S
> > @@ -35,10 +35,19 @@
> >  #include <asm/smap.h>
> >  #include <asm/export.h>
> >  
> > +#ifdef CONFIG_X86_5LEVEL
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +	ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
> > +		    "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
> > +#else
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +	mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> > +#endif
> 
> Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean
> 
> #define LOAD_TASK_SIZE_MINUS_N(n) \
> 	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
> 		    __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57
> 
> there?

Pushed out with the following folded in.

diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index 94f7be4971ed..2f052bc96866 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -37,8 +37,8 @@
 
 #ifdef CONFIG_X86_5LEVEL
 #define LOAD_TASK_SIZE_MINUS_N(n) \
-	ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
-		    "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
+	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
+		    __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57
 #else
 #define LOAD_TASK_SIZE_MINUS_N(n) \
 	mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index 445374885153..358239d77dff 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -33,8 +33,8 @@
 
 #ifdef CONFIG_X86_5LEVEL
 #define LOAD_TASK_SIZE_MINUS_N(n) \
-	ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rbx", \
-		    "mov $((1 << 56) - 4096 - (n)),%rbx", X86_FEATURE_LA57
+	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rbx), \
+		    __stringify(mov $((1 << 56) - 4096 - (n)),%rbx), X86_FEATURE_LA57
 #else
 #define LOAD_TASK_SIZE_MINUS_N(n) \
 	mov $(TASK_SIZE_MAX - (n)),%_ASM_BX

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
                   ` (15 preceding siblings ...)
  2020-09-03 15:49 ` Linus Torvalds
@ 2020-09-04  6:00 ` Ingo Molnar
  2020-09-04 17:58   ` Alexey Dobriyan
  16 siblings, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2020-09-04  6:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Linus Torvalds, Al Viro, Michael Ellerman, x86, Alexey Dobriyan,
	Luis Chamberlain, Kees Cook, linux-kernel, linux-fsdevel,
	linux-arch, linuxppc-dev


* Christoph Hellwig <hch@lst.de> wrote:

> Hi all,
> 
> this series removes the last set_fs() used to force a kernel address
> space for the uaccess code in the kernel read/write/splice code, and then
> stops implementing the address space overrides entirely for x86 and
> powerpc.

Cool! For the x86 bits:

  Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 12/14] x86: remove address space overrides using set_fs()
  2020-09-04  2:55   ` Al Viro
  2020-09-04  4:41     ` Al Viro
@ 2020-09-04  6:38     ` Christoph Hellwig
  2020-09-04  7:47       ` Christoph Hellwig
  1 sibling, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-04  6:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

On Fri, Sep 04, 2020 at 03:55:10AM +0100, Al Viro wrote:
> On Thu, Sep 03, 2020 at 04:22:40PM +0200, Christoph Hellwig wrote:
> 
> > diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
> > index c8a85b512796e1..94f7be4971ed04 100644
> > --- a/arch/x86/lib/getuser.S
> > +++ b/arch/x86/lib/getuser.S
> > @@ -35,10 +35,19 @@
> >  #include <asm/smap.h>
> >  #include <asm/export.h>
> >  
> > +#ifdef CONFIG_X86_5LEVEL
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +	ALTERNATIVE "mov $((1 << 47) - 4096 - (n)),%rdx", \
> > +		    "mov $((1 << 56) - 4096 - (n)),%rdx", X86_FEATURE_LA57
> > +#else
> > +#define LOAD_TASK_SIZE_MINUS_N(n) \
> > +	mov $(TASK_SIZE_MAX - (n)),%_ASM_DX
> > +#endif
> 
> Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean
> 
> #define LOAD_TASK_SIZE_MINUS_N(n) \
> 	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
> 		    __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57
> 
> there?

Don't ask me about the how, but it builds and works with X86_5LEVEL,
and the style is copied from elsewhere..

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 12/14] x86: remove address space overrides using set_fs()
  2020-09-04  6:38     ` Christoph Hellwig
@ 2020-09-04  7:47       ` Christoph Hellwig
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-09-04  7:47 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

On Fri, Sep 04, 2020 at 08:38:13AM +0200, Christoph Hellwig wrote:
> > Wait a sec... how is that supposed to build with X86_5LEVEL?  Do you mean
> > 
> > #define LOAD_TASK_SIZE_MINUS_N(n) \
> > 	ALTERNATIVE __stringify(mov $((1 << 47) - 4096 - (n)),%rdx), \
> > 		    __stringify(mov $((1 << 56) - 4096 - (n)),%rdx), X86_FEATURE_LA57
> > 
> > there?
> 
> Don't ask me about the how, but it builds and works with X86_5LEVEL,
> and the style is copied from elsewhere..

Actually, it doesn't any more.  Looks like the change to pass the n
parameter as suggested by Linus broke the previously working version.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: [PATCH 12/14] x86: remove address space overrides using set_fs()
  2020-09-03 23:25     ` Linus Torvalds
@ 2020-09-04  7:59       ` David Laight
  0 siblings, 0 replies; 58+ messages in thread
From: David Laight @ 2020-09-04  7:59 UTC (permalink / raw)
  To: 'Linus Torvalds'
  Cc: Christoph Hellwig, Al Viro, Michael Ellerman, x86,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

From: Linus Torvalds
> Sent: 04 September 2020 00:26
> 
> On Thu, Sep 3, 2020 at 2:30 PM David Laight <David.Laight@aculab.com> wrote:
> >
> > A non-canonical (is that the right term) address between the highest
> > valid user address and the lowest valid kernel address (7ffe to fffe?)
> > will fault anyway.
> 
> Yes.
> 
> But we actually warn against that fault, because it's been a good way
> to catch places that didn't use the proper "access_ok()" pattern.
> 
> See ex_handler_uaccess() and the
> 
>         WARN_ONCE(trapnr == X86_TRAP_GP, "General protection fault in
> user access. Non-canonical address?");
> 
> warning. It's been good for randomized testing - a missing range check
> on a user address will often hit this.
> 
> Of course, you should never see it in real life (and hopefully not in
> testing either any more). But belt-and-suspenders..

That could still be effective, just pick an address limit that is
appropriate for the one access_ok() is using.

Even if access_ok() uses 1<<63 there are plenty of addresses above it that fault.
But the upper limit for 5-level page tables could be used all the time.

One option is to test '(address | length) < (3<<62)' in access_ok().
That is also moderately suitable for masking invalid addresses to 0.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-04  6:00 ` Ingo Molnar
@ 2020-09-04 17:58   ` Alexey Dobriyan
  2020-09-04 18:42     ` Linus Torvalds
  2020-09-04 21:01     ` David Laight
  0 siblings, 2 replies; 58+ messages in thread
From: Alexey Dobriyan @ 2020-09-04 17:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman,
	x86, Luis Chamberlain, Kees Cook, linux-kernel, linux-fsdevel,
	linux-arch, linuxppc-dev

On Fri, Sep 04, 2020 at 08:00:24AM +0200, Ingo Molnar wrote:
> * Christoph Hellwig <hch@lst.de> wrote:
> > this series removes the last set_fs() used to force a kernel address
> > space for the uaccess code in the kernel read/write/splice code, and then
> > stops implementing the address space overrides entirely for x86 and
> > powerpc.
> 
> Cool! For the x86 bits:
> 
>   Acked-by: Ingo Molnar <mingo@kernel.org>

set_fs() is older than some kernel hackers!

	$ cd linux-0.11/
	$ find . -type f -name '*.h' | xargs grep -e set_fs -w -n -A3
	./include/asm/segment.h:61:extern inline void set_fs(unsigned long val)
	./include/asm/segment.h-62-{
	./include/asm/segment.h-63-     __asm__("mov %0,%%fs"::"a" ((unsigned short) val));
	./include/asm/segment.h-64-}

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-04 17:58   ` Alexey Dobriyan
@ 2020-09-04 18:42     ` Linus Torvalds
  2020-09-04 21:01     ` David Laight
  1 sibling, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2020-09-04 18:42 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, Christoph Hellwig, Al Viro, Michael Ellerman,
	the arch/x86 maintainers, Luis Chamberlain, Kees Cook,
	Linux Kernel Mailing List, linux-fsdevel, linux-arch,
	linuxppc-dev

On Fri, Sep 4, 2020 at 10:58 AM Alexey Dobriyan <adobriyan@gmail.com> wrote:
>
> set_fs() is older than some kernel hackers!
>
>         $ cd linux-0.11/
>         $ find . -type f -name '*.h' | xargs grep -e set_fs -w -n -A3

Oh, it's older than that. It was there (as set_fs) in 0.10, and may
even predate that. But sadly, I don't have tar-balls for 0.02 and
0.03, so can't check.

The actual use of %fs as the user space segment is already there in
0.01, but there was no 'set_fs()'. That was a simpler and more direct
time, and "get_fs()" looked like this back then:

  #define _fs() ({ \
  register unsigned short __res; \
  __asm__("mov %%fs,%%ax":"=a" (__res):); \
  __res;})

and all the setting was basically part of the kernel entry asm and. Lovely.

                 Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-04 17:58   ` Alexey Dobriyan
  2020-09-04 18:42     ` Linus Torvalds
@ 2020-09-04 21:01     ` David Laight
  2020-09-05  7:16       ` Christophe Leroy
  1 sibling, 1 reply; 58+ messages in thread
From: David Laight @ 2020-09-04 21:01 UTC (permalink / raw)
  To: 'Alexey Dobriyan', Ingo Molnar
  Cc: Christoph Hellwig, Linus Torvalds, Al Viro, Michael Ellerman,
	x86, Luis Chamberlain, Kees Cook, linux-kernel, linux-fsdevel,
	linux-arch, linuxppc-dev

From: Alexey Dobriyan
> Sent: 04 September 2020 18:58
> 
> On Fri, Sep 04, 2020 at 08:00:24AM +0200, Ingo Molnar wrote:
> > * Christoph Hellwig <hch@lst.de> wrote:
> > > this series removes the last set_fs() used to force a kernel address
> > > space for the uaccess code in the kernel read/write/splice code, and then
> > > stops implementing the address space overrides entirely for x86 and
> > > powerpc.
> >
> > Cool! For the x86 bits:
> >
> >   Acked-by: Ingo Molnar <mingo@kernel.org>
> 
> set_fs() is older than some kernel hackers!
> 
> 	$ cd linux-0.11/
> 	$ find . -type f -name '*.h' | xargs grep -e set_fs -w -n -A3
> 	./include/asm/segment.h:61:extern inline void set_fs(unsigned long val)
> 	./include/asm/segment.h-62-{
> 	./include/asm/segment.h-63-     __asm__("mov %0,%%fs"::"a" ((unsigned short) val));
> 	./include/asm/segment.h-64-}

What is this strange %fs register you are talking about.
Figure 2-4 only has CS, DS, SS and ES.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-04 21:01     ` David Laight
@ 2020-09-05  7:16       ` Christophe Leroy
  2020-09-05 10:13         ` David Laight
  0 siblings, 1 reply; 58+ messages in thread
From: Christophe Leroy @ 2020-09-05  7:16 UTC (permalink / raw)
  To: David Laight, 'Alexey Dobriyan', Ingo Molnar
  Cc: linux-arch, linuxppc-dev, Kees Cook, x86, linux-kernel,
	Luis Chamberlain, Al Viro, linux-fsdevel, Linus Torvalds,
	Christoph Hellwig



Le 04/09/2020 à 23:01, David Laight a écrit :
> From: Alexey Dobriyan
>> Sent: 04 September 2020 18:58
>>
>> On Fri, Sep 04, 2020 at 08:00:24AM +0200, Ingo Molnar wrote:
>>> * Christoph Hellwig <hch@lst.de> wrote:
>>>> this series removes the last set_fs() used to force a kernel address
>>>> space for the uaccess code in the kernel read/write/splice code, and then
>>>> stops implementing the address space overrides entirely for x86 and
>>>> powerpc.
>>>
>>> Cool! For the x86 bits:
>>>
>>>    Acked-by: Ingo Molnar <mingo@kernel.org>
>>
>> set_fs() is older than some kernel hackers!
>>
>> 	$ cd linux-0.11/
>> 	$ find . -type f -name '*.h' | xargs grep -e set_fs -w -n -A3
>> 	./include/asm/segment.h:61:extern inline void set_fs(unsigned long val)
>> 	./include/asm/segment.h-62-{
>> 	./include/asm/segment.h-63-     __asm__("mov %0,%%fs"::"a" ((unsigned short) val));
>> 	./include/asm/segment.h-64-}
> 
> What is this strange %fs register you are talking about.
> Figure 2-4 only has CS, DS, SS and ES.
> 

Intel added registers FS and GS in the i386

Christophe

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-05  7:16       ` Christophe Leroy
@ 2020-09-05 10:13         ` David Laight
  0 siblings, 0 replies; 58+ messages in thread
From: David Laight @ 2020-09-05 10:13 UTC (permalink / raw)
  To: 'Christophe Leroy', 'Alexey Dobriyan', Ingo Molnar
  Cc: linux-arch, linuxppc-dev, Kees Cook, x86, linux-kernel,
	Luis Chamberlain, Al Viro, linux-fsdevel, Linus Torvalds,
	Christoph Hellwig

From: Christophe Leroy
> Sent: 05 September 2020 08:16
> 
> Le 04/09/2020 à 23:01, David Laight a écrit :
> > From: Alexey Dobriyan
> >> Sent: 04 September 2020 18:58
...
> > What is this strange %fs register you are talking about.
> > Figure 2-4 only has CS, DS, SS and ES.
> >
> 
> Intel added registers FS and GS in the i386

I know, I've got both the 'iAPX 286 Programmer's Reference Manual'
and the '80386 Programmer's Reference Manual' on my shelf.

I don't have the 8088 book though - which I used in 1982.

The old books are a lot easier to read if, for instance,
you are trying to work out how to back and forth to real mode
to do bios calls.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-03 14:28 ` remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Al Viro
  2020-09-03 14:30   ` Christoph Hellwig
@ 2020-09-09 17:31   ` Linus Torvalds
  2020-09-09 18:40     ` Segher Boessenkool
  1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2020-09-09 17:31 UTC (permalink / raw)
  To: Al Viro, Nick Desaulniers
  Cc: Christoph Hellwig, Michael Ellerman, the arch/x86 maintainers,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook,
	Linux Kernel Mailing List, linux-fsdevel, linux-arch,
	linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3002 bytes --]

On Thu, Sep 3, 2020 at 7:28 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> I can live with this series; do you want that in vfs.git#for-next?

Well, it's apparently there now (at least it's in your base.set_fs
branch, I didn't check actual -next).

So this is just a heads-up that I plan to merge the "asm goto" changes
on top of this during 5.10. Nick did the patch to make my patch-set
work either with or without the actual asm goto support, and I've been
running it privately now for several months.

And apparently there are people working on this on the gcc side too,
so it won't just be clang-specific. Nor kernel-specific in that Nick
tells me some other projects are looking at using that asm goto with
outputs too.

Anyway, the actual patch to use asm goto with outputs is fairly small
and not that interesting to people (since no released compiler
supports it), but part of the infrastructure to make it tiny is to
just get rid of the inlined "__get_user()" and "__put_user()" stuff.
I've ranted against those functions for a few years by now, so part of
this is to stop inlining them and make people think they are "good",
but part of it is also that those macros and inline functions are the
main remaining ones that mess with this all.

I'm attaching the two __get_user/__put_user patches here in case
anybody cares, but these are the pre-rebased ones, I'll make them work
with the new world order as it happens. The main change is:

 (a) unify around a common special calling convention:
   - %bx is clobbered
   - %cx contains the user address on input, and the error value on output
   - %ax/%dx contains the actual value (input for put, output for get,
of course)

 (b) unify around using just a "call", using the model that
get/put_user already did.
   - use "*_nocheck" for the double-underscore versions
   - this still has to use inline asm because the calling convention is odd
   - otherwise basically just a "call __{get,put}_user_[nocheck_]X"
where X is the size.

IOW, we unify around one single calling convention., and one single
model for actually getting things done.

I still want to remove the double-underscore versions entirely some
day - they have absolutely zero advantages compared to the full "do
address_ok as part of the operation" - but that's a separate thing. At
least they can be unified.

And the reason for this all is obviously that now *only* the
"unsafe_{get,put}_user()" cases with the error label output are the
"fast inlined" cases. They are the only ones that _can_ be done
quickly inline, since the slow clac/stac is not part of them. Plus
they already have that unified usage model of the error label, even if
unsafe_get_user() currently does it manually because "asm goto" with
outputs doesn't work in existing compilers.

Comments?

I suspect people won't care, but I thought I'd post these so that
there won't be any surprises during the next merge window when I apply
them after merging the set_fs() removal branch..

                 Linus

[-- Attachment #2: 0001-x86-Make-__get_user-generate-an-out-of-line-call.patch --]
[-- Type: text/x-patch, Size: 10283 bytes --]

From 52c7574a0d15722df52158a3d766803662d9a6ff Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 8 Apr 2020 12:50:01 -0700
Subject: [PATCH 1/6] x86: Make __get_user() generate an out-of-line call

Instead of inlining the whole stac/lfence/mov/clac sequence (which also
requires individual exception table entries and several asm instruction
alternatives entries), just generate "call __get_user_nocheck_X" for the
__get_user() cases.

We can use all the same infrastructure that we already do for the
regular "get_user()", and the end result is simpler source code, and
much simpler code generation.

It also measn that when I introduce asm goto with input for
"unsafe_get_user()", there are no nasty interactions with the
__get_user() code.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/x86/include/asm/uaccess.h | 128 ++++++++++++++-------------------
 arch/x86/lib/getuser.S         |  60 ++++++++++++++++
 2 files changed, 114 insertions(+), 74 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index ecefaffd15d4..cf5a3f61db3b 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -96,25 +96,14 @@ static inline bool pagefault_disabled(void);
 	likely(!__range_not_ok(addr, size, user_addr_max()));		\
 })
 
-/*
- * These are the main single-value transfer routines.  They automatically
- * use the right size if we just have the right pointer type.
- *
- * This gets kind of ugly. We want to return _two_ values in "get_user()"
- * and yet we don't want to do any pointers, because that is too much
- * of a performance impact. Thus we have a few rather ugly macros here,
- * and hide all the ugliness from the user.
- *
- * The "__xxx" versions of the user access functions are versions that
- * do not verify the address space, that must have been done previously
- * with a separate "access_ok()" call (this is used when we do multiple
- * accesses to the same area of user memory).
- */
-
 extern int __get_user_1(void);
 extern int __get_user_2(void);
 extern int __get_user_4(void);
 extern int __get_user_8(void);
+extern int __get_user_nocheck_1(void);
+extern int __get_user_nocheck_2(void);
+extern int __get_user_nocheck_4(void);
+extern int __get_user_nocheck_8(void);
 extern int __get_user_bad(void);
 
 #define __uaccess_begin() stac()
@@ -138,25 +127,12 @@ extern int __get_user_bad(void);
 #define __typefits(x,type,not) \
 	__builtin_choose_expr(sizeof(x)<=sizeof(type),(unsigned type)0,not)
 
-/**
- * get_user - Get a simple variable from user space.
- * @x:   Variable to store result.
- * @ptr: Source address, in user space.
- *
- * Context: User context only. This function may sleep if pagefaults are
- *          enabled.
- *
- * This macro copies a single simple variable from user space to kernel
- * space.  It supports simple types like char and int, but not larger
- * data types like structures or arrays.
- *
- * @ptr must have pointer-to-simple-variable type, and the result of
- * dereferencing @ptr must be assignable to @x without a cast.
- *
- * Return: zero on success, or -EFAULT on error.
- * On error, the variable @x is set to zero.
- */
 /*
+ * This is used for both get_user() and __get_user() to expand to
+ * the proper special function call that has odd calling conventions
+ * due to returning both a value and an error, and that depends on
+ * the size of the pointer passed in.
+ *
  * Careful: we have to cast the result to the type of the pointer
  * for sign reasons.
  *
@@ -169,13 +145,12 @@ extern int __get_user_bad(void);
  * Clang/LLVM cares about the size of the register, but still wants
  * the base register for something that ends up being a pair.
  */
-#define get_user(x, ptr)						\
+#define do_get_user_call(fn,x,ptr)					\
 ({									\
 	int __ret_gu;							\
 	register __inttype(*(ptr)) __val_gu asm("%"_ASM_DX);		\
 	__chk_user_ptr(ptr);						\
-	might_fault();							\
-	asm volatile("call __get_user_%P4"				\
+	asm volatile("call __" #fn "_%P4"				\
 		     : "=a" (__ret_gu), "=r" (__val_gu),		\
 			ASM_CALL_CONSTRAINT				\
 		     : "0" (ptr), "i" (sizeof(*(ptr))));		\
@@ -183,6 +158,49 @@ extern int __get_user_bad(void);
 	__builtin_expect(__ret_gu, 0);					\
 })
 
+/**
+ * get_user - Get a simple variable from user space.
+ * @x:   Variable to store result.
+ * @ptr: Source address, in user space.
+ *
+ * Context: User context only. This function may sleep if pagefaults are
+ *          enabled.
+ *
+ * This macro copies a single simple variable from user space to kernel
+ * space.  It supports simple types like char and int, but not larger
+ * data types like structures or arrays.
+ *
+ * @ptr must have pointer-to-simple-variable type, and the result of
+ * dereferencing @ptr must be assignable to @x without a cast.
+ *
+ * Return: zero on success, or -EFAULT on error.
+ * On error, the variable @x is set to zero.
+ */
+#define get_user(x,ptr) ({ might_fault(); do_get_user_call(get_user,x,ptr); })
+
+/**
+ * __get_user - Get a simple variable from user space, with less checking.
+ * @x:   Variable to store result.
+ * @ptr: Source address, in user space.
+ *
+ * Context: User context only. This function may sleep if pagefaults are
+ *          enabled.
+ *
+ * This macro copies a single simple variable from user space to kernel
+ * space.  It supports simple types like char and int, but not larger
+ * data types like structures or arrays.
+ *
+ * @ptr must have pointer-to-simple-variable type, and the result of
+ * dereferencing @ptr must be assignable to @x without a cast.
+ *
+ * Caller must check the pointer with access_ok() before calling this
+ * function.
+ *
+ * Return: zero on success, or -EFAULT on error.
+ * On error, the variable @x is set to zero.
+ */
+#define __get_user(x,ptr) do_get_user_call(get_user_nocheck,x,ptr)
+
 #define __put_user_x(size, x, ptr, __ret_pu)			\
 	asm volatile("call __put_user_" #size : "=a" (__ret_pu)	\
 		     : "0" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx")
@@ -367,19 +385,6 @@ __pu_label:							\
 	__builtin_expect(__pu_err, 0);				\
 })
 
-#define __get_user_nocheck(x, ptr, size)				\
-({									\
-	int __gu_err;							\
-	__inttype(*(ptr)) __gu_val;					\
-	__typeof__(ptr) __gu_ptr = (ptr);				\
-	__typeof__(size) __gu_size = (size);				\
-	__uaccess_begin_nospec();					\
-	__get_user_size(__gu_val, __gu_ptr, __gu_size, __gu_err);	\
-	__uaccess_end();						\
-	(x) = (__force __typeof__(*(ptr)))__gu_val;			\
-	__builtin_expect(__gu_err, 0);					\
-})
-
 /* FIXME: this hack is definitely wrong -AK */
 struct __large_struct { unsigned long buf[100]; };
 #define __m(x) (*(struct __large_struct __user *)(x))
@@ -396,31 +401,6 @@ struct __large_struct { unsigned long buf[100]; };
 		: : ltype(x), "m" (__m(addr))				\
 		: : label)
 
-/**
- * __get_user - Get a simple variable from user space, with less checking.
- * @x:   Variable to store result.
- * @ptr: Source address, in user space.
- *
- * Context: User context only. This function may sleep if pagefaults are
- *          enabled.
- *
- * This macro copies a single simple variable from user space to kernel
- * space.  It supports simple types like char and int, but not larger
- * data types like structures or arrays.
- *
- * @ptr must have pointer-to-simple-variable type, and the result of
- * dereferencing @ptr must be assignable to @x without a cast.
- *
- * Caller must check the pointer with access_ok() before calling this
- * function.
- *
- * Return: zero on success, or -EFAULT on error.
- * On error, the variable @x is set to zero.
- */
-
-#define __get_user(x, ptr)						\
-	__get_user_nocheck((x), (ptr), sizeof(*(ptr)))
-
 /**
  * __put_user - Write a simple value into user space, with less checking.
  * @x:   Value to copy to user space.
diff --git a/arch/x86/lib/getuser.S b/arch/x86/lib/getuser.S
index c8a85b512796..2cd902e06062 100644
--- a/arch/x86/lib/getuser.S
+++ b/arch/x86/lib/getuser.S
@@ -35,6 +35,8 @@
 #include <asm/smap.h>
 #include <asm/export.h>
 
+#define ASM_BARRIER_NOSPEC ALTERNATIVE "", "lfence", X86_FEATURE_LFENCE_RDTSC
+
 	.text
 SYM_FUNC_START(__get_user_1)
 	mov PER_CPU_VAR(current_task), %_ASM_DX
@@ -114,6 +116,52 @@ SYM_FUNC_START(__get_user_8)
 SYM_FUNC_END(__get_user_8)
 EXPORT_SYMBOL(__get_user_8)
 
+/* .. and the same for __get_user, just without the range checks */
+SYM_FUNC_START(__get_user_nocheck_1)
+	ASM_STAC
+	ASM_BARRIER_NOSPEC
+6:	movzbl (%_ASM_AX),%edx
+	xor %eax,%eax
+	ASM_CLAC
+	ret
+SYM_FUNC_END(__get_user_nocheck_1)
+EXPORT_SYMBOL(__get_user_nocheck_1)
+
+SYM_FUNC_START(__get_user_nocheck_2)
+	ASM_STAC
+	ASM_BARRIER_NOSPEC
+7:	movzwl (%_ASM_AX),%edx
+	xor %eax,%eax
+	ASM_CLAC
+	ret
+SYM_FUNC_END(__get_user_nocheck_2)
+EXPORT_SYMBOL(__get_user_nocheck_2)
+
+SYM_FUNC_START(__get_user_nocheck_4)
+	ASM_STAC
+	ASM_BARRIER_NOSPEC
+8:	movl (%_ASM_AX),%edx
+	xor %eax,%eax
+	ASM_CLAC
+	ret
+SYM_FUNC_END(__get_user_nocheck_4)
+EXPORT_SYMBOL(__get_user_nocheck_4)
+
+SYM_FUNC_START(__get_user_nocheck_8)
+	ASM_STAC
+	ASM_BARRIER_NOSPEC
+#ifdef CONFIG_X86_64
+9:	movq (%_ASM_AX),%rdx
+#else
+9:	movl (%_ASM_AX),%edx
+10:	movl 4(%_ASM_AX),%ecx
+#endif
+	xor %eax,%eax
+	ASM_CLAC
+	ret
+SYM_FUNC_END(__get_user_nocheck_8)
+EXPORT_SYMBOL(__get_user_nocheck_8)
+
 
 SYM_CODE_START_LOCAL(.Lbad_get_user_clac)
 	ASM_CLAC
@@ -134,6 +182,7 @@ bad_get_user_8:
 SYM_CODE_END(.Lbad_get_user_8_clac)
 #endif
 
+/* get_user */
 	_ASM_EXTABLE_UA(1b, .Lbad_get_user_clac)
 	_ASM_EXTABLE_UA(2b, .Lbad_get_user_clac)
 	_ASM_EXTABLE_UA(3b, .Lbad_get_user_clac)
@@ -143,3 +192,14 @@ SYM_CODE_END(.Lbad_get_user_8_clac)
 	_ASM_EXTABLE_UA(4b, .Lbad_get_user_8_clac)
 	_ASM_EXTABLE_UA(5b, .Lbad_get_user_8_clac)
 #endif
+
+/* __get_user */
+	_ASM_EXTABLE_UA(6b, .Lbad_get_user_clac)
+	_ASM_EXTABLE_UA(7b, .Lbad_get_user_clac)
+	_ASM_EXTABLE_UA(8b, .Lbad_get_user_clac)
+#ifdef CONFIG_X86_64
+	_ASM_EXTABLE_UA(9b, .Lbad_get_user_clac)
+#else
+	_ASM_EXTABLE_UA(9b, .Lbad_get_user_8_clac)
+	_ASM_EXTABLE_UA(10b, .Lbad_get_user_8_clac)
+#endif
-- 
2.28.0.218.gc12ef3d349


[-- Attachment #3: 0002-x86-Make-__put_user-generate-an-out-of-line-call.patch --]
[-- Type: text/x-patch, Size: 9905 bytes --]

From 3b07edf57b7d5a705a2fcf1ec92c9399efa51441 Mon Sep 17 00:00:00 2001
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Wed, 8 Apr 2020 13:36:49 -0700
Subject: [PATCH 2/6] x86: Make __put_user() generate an out-of-line call

Instead of inlining the stac/mov/clac sequence (which also requires
individual exception table entries and several asm instruction
alternatives entries), just generate "call __put_user_nocheck_X" for the
__put_user() cases, the same way we changed __get_user earlier.

Unlike the get_user() case, we didn't have the same nice infrastructure
to just generate the call with a single case, so this actually has to
change some of the infrastructure in order to do this.  But that only
cleans up the code further.

So now, instead of using a case statement for the sizes, we just do the
same thing we've done on tge get_user() side for a long time: use the
size as an immediate constant to the asm, and generate the asm that way
directly.

In order to handle the special case of 64-bit data on a 32-bit kernel, I
needed to change the calling convention slightly: the data is passed in
%eax[:%edx], the pointer in %ecx, and the return value is also returned
in %ecx.  It used to be returned in %eax, but because of how %eax can
now be a double register input, we don't want mix that with a
single-register output.

The actual low-level asm is easier to handle: we'll just share the code
between the checking and non-checking case, with the non-checking case
jumping into the middle of the function.  That may sound a bit too
special, but this code is all very very special anyway, so...

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 arch/x86/include/asm/uaccess.h | 119 ++++++++++++---------------------
 arch/x86/lib/putuser.S         |  22 ++++--
 2 files changed, 60 insertions(+), 81 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index cf5a3f61db3b..9a96399d76f0 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -201,11 +201,6 @@ extern int __get_user_bad(void);
  */
 #define __get_user(x,ptr) do_get_user_call(get_user_nocheck,x,ptr)
 
-#define __put_user_x(size, x, ptr, __ret_pu)			\
-	asm volatile("call __put_user_" #size : "=a" (__ret_pu)	\
-		     : "0" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx")
-
-
 
 #ifdef CONFIG_X86_32
 #define __put_user_goto_u64(x, addr, label)			\
@@ -217,25 +212,41 @@ extern int __get_user_bad(void);
 		     : : "A" (x), "r" (addr)			\
 		     : : label)
 
-#define __put_user_x8(x, ptr, __ret_pu)				\
-	asm volatile("call __put_user_8" : "=a" (__ret_pu)	\
-		     : "A" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx")
 #else
 #define __put_user_goto_u64(x, ptr, label) \
 	__put_user_goto(x, ptr, "q", "er", label)
-#define __put_user_x8(x, ptr, __ret_pu) __put_user_x(8, x, ptr, __ret_pu)
 #endif
 
 extern void __put_user_bad(void);
 
 /*
  * Strange magic calling convention: pointer in %ecx,
- * value in %eax(:%edx), return value in %eax. clobbers %rbx
+ * value in %eax(:%edx), return value in %ecx. clobbers %rbx
  */
 extern void __put_user_1(void);
 extern void __put_user_2(void);
 extern void __put_user_4(void);
 extern void __put_user_8(void);
+extern void __put_user_nocheck_1(void);
+extern void __put_user_nocheck_2(void);
+extern void __put_user_nocheck_4(void);
+extern void __put_user_nocheck_8(void);
+
+#define do_put_user_call(fn,x,ptr)					\
+({									\
+	int __ret_pu;							\
+	register __typeof__(*(ptr)) __val_pu asm("%"_ASM_AX);		\
+	__chk_user_ptr(ptr);						\
+	__val_pu = (x);							\
+	asm volatile("call __" #fn "_%P[size]"				\
+		     : "=c" (__ret_pu),					\
+			ASM_CALL_CONSTRAINT				\
+		     : "0" (ptr),					\
+		       "r" (__val_pu),					\
+		       [size] "i" (sizeof(*(ptr)))			\
+		     :"ebx");						\
+	__builtin_expect(__ret_pu, 0);					\
+})
 
 /**
  * put_user - Write a simple value into user space.
@@ -254,32 +265,29 @@ extern void __put_user_8(void);
  *
  * Return: zero on success, or -EFAULT on error.
  */
-#define put_user(x, ptr)					\
-({								\
-	int __ret_pu;						\
-	__typeof__(*(ptr)) __pu_val;				\
-	__chk_user_ptr(ptr);					\
-	might_fault();						\
-	__pu_val = x;						\
-	switch (sizeof(*(ptr))) {				\
-	case 1:							\
-		__put_user_x(1, __pu_val, ptr, __ret_pu);	\
-		break;						\
-	case 2:							\
-		__put_user_x(2, __pu_val, ptr, __ret_pu);	\
-		break;						\
-	case 4:							\
-		__put_user_x(4, __pu_val, ptr, __ret_pu);	\
-		break;						\
-	case 8:							\
-		__put_user_x8(__pu_val, ptr, __ret_pu);		\
-		break;						\
-	default:						\
-		__put_user_x(X, __pu_val, ptr, __ret_pu);	\
-		break;						\
-	}							\
-	__builtin_expect(__ret_pu, 0);				\
-})
+#define put_user(x, ptr) ({ might_fault(); do_put_user_call(put_user,x,ptr); })
+
+/**
+ * __put_user - Write a simple value into user space, with less checking.
+ * @x:   Value to copy to user space.
+ * @ptr: Destination address, in user space.
+ *
+ * Context: User context only. This function may sleep if pagefaults are
+ *          enabled.
+ *
+ * This macro copies a single simple value from kernel space to user
+ * space.  It supports simple types like char and int, but not larger
+ * data types like structures or arrays.
+ *
+ * @ptr must have pointer-to-simple-variable type, and @x must be assignable
+ * to the result of dereferencing @ptr.
+ *
+ * Caller must check the pointer with access_ok() before calling this
+ * function.
+ *
+ * Return: zero on success, or -EFAULT on error.
+ */
+#define __put_user(x, ptr) do_put_user_call(put_user_nocheck,x,ptr)
 
 #define __put_user_size(x, ptr, size, label)				\
 do {									\
@@ -370,21 +378,6 @@ do {									\
 		     : [umem] "m" (__m(addr)),				\
 		       [efault] "i" (-EFAULT), "0" (err))
 
-#define __put_user_nocheck(x, ptr, size)			\
-({								\
-	__label__ __pu_label;					\
-	int __pu_err = -EFAULT;					\
-	__typeof__(*(ptr)) __pu_val = (x);			\
-	__typeof__(ptr) __pu_ptr = (ptr);			\
-	__typeof__(size) __pu_size = (size);			\
-	__uaccess_begin();					\
-	__put_user_size(__pu_val, __pu_ptr, __pu_size, __pu_label);	\
-	__pu_err = 0;						\
-__pu_label:							\
-	__uaccess_end();					\
-	__builtin_expect(__pu_err, 0);				\
-})
-
 /* FIXME: this hack is definitely wrong -AK */
 struct __large_struct { unsigned long buf[100]; };
 #define __m(x) (*(struct __large_struct __user *)(x))
@@ -401,30 +394,6 @@ struct __large_struct { unsigned long buf[100]; };
 		: : ltype(x), "m" (__m(addr))				\
 		: : label)
 
-/**
- * __put_user - Write a simple value into user space, with less checking.
- * @x:   Value to copy to user space.
- * @ptr: Destination address, in user space.
- *
- * Context: User context only. This function may sleep if pagefaults are
- *          enabled.
- *
- * This macro copies a single simple value from kernel space to user
- * space.  It supports simple types like char and int, but not larger
- * data types like structures or arrays.
- *
- * @ptr must have pointer-to-simple-variable type, and @x must be assignable
- * to the result of dereferencing @ptr.
- *
- * Caller must check the pointer with access_ok() before calling this
- * function.
- *
- * Return: zero on success, or -EFAULT on error.
- */
-
-#define __put_user(x, ptr)						\
-	__put_user_nocheck((__typeof__(*(ptr)))(x), (ptr), sizeof(*(ptr)))
-
 extern unsigned long
 copy_from_user_nmi(void *to, const void __user *from, unsigned long n);
 extern __must_check long
diff --git a/arch/x86/lib/putuser.S b/arch/x86/lib/putuser.S
index 7c7c92db8497..b34a17763f28 100644
--- a/arch/x86/lib/putuser.S
+++ b/arch/x86/lib/putuser.S
@@ -25,7 +25,9 @@
  * Inputs:	%eax[:%edx] contains the data
  *		%ecx contains the address
  *
- * Outputs:	%eax is error code (0 or -EFAULT)
+ * Outputs:	%ecx is error code (0 or -EFAULT)
+ *
+ * Clobbers:	%ebx needed for task pointer
  *
  * These functions should not modify any other registers,
  * as they get called from within inline assembly.
@@ -38,13 +40,15 @@ SYM_FUNC_START(__put_user_1)
 	ENTER
 	cmp TASK_addr_limit(%_ASM_BX),%_ASM_CX
 	jae .Lbad_put_user
+SYM_INNER_LABEL(__put_user_nocheck_1, SYM_L_GLOBAL)
 	ASM_STAC
 1:	movb %al,(%_ASM_CX)
-	xor %eax,%eax
+	xor %ecx,%ecx
 	ASM_CLAC
 	ret
 SYM_FUNC_END(__put_user_1)
 EXPORT_SYMBOL(__put_user_1)
+EXPORT_SYMBOL(__put_user_nocheck_1)
 
 SYM_FUNC_START(__put_user_2)
 	ENTER
@@ -52,13 +56,15 @@ SYM_FUNC_START(__put_user_2)
 	sub $1,%_ASM_BX
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
+SYM_INNER_LABEL(__put_user_nocheck_2, SYM_L_GLOBAL)
 	ASM_STAC
 2:	movw %ax,(%_ASM_CX)
-	xor %eax,%eax
+	xor %ecx,%ecx
 	ASM_CLAC
 	ret
 SYM_FUNC_END(__put_user_2)
 EXPORT_SYMBOL(__put_user_2)
+EXPORT_SYMBOL(__put_user_nocheck_2)
 
 SYM_FUNC_START(__put_user_4)
 	ENTER
@@ -66,13 +72,15 @@ SYM_FUNC_START(__put_user_4)
 	sub $3,%_ASM_BX
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
+SYM_INNER_LABEL(__put_user_nocheck_4, SYM_L_GLOBAL)
 	ASM_STAC
 3:	movl %eax,(%_ASM_CX)
-	xor %eax,%eax
+	xor %ecx,%ecx
 	ASM_CLAC
 	ret
 SYM_FUNC_END(__put_user_4)
 EXPORT_SYMBOL(__put_user_4)
+EXPORT_SYMBOL(__put_user_nocheck_4)
 
 SYM_FUNC_START(__put_user_8)
 	ENTER
@@ -80,21 +88,23 @@ SYM_FUNC_START(__put_user_8)
 	sub $7,%_ASM_BX
 	cmp %_ASM_BX,%_ASM_CX
 	jae .Lbad_put_user
+SYM_INNER_LABEL(__put_user_nocheck_8, SYM_L_GLOBAL)
 	ASM_STAC
 4:	mov %_ASM_AX,(%_ASM_CX)
 #ifdef CONFIG_X86_32
 5:	movl %edx,4(%_ASM_CX)
 #endif
-	xor %eax,%eax
+	xor %ecx,%ecx
 	ASM_CLAC
 	RET
 SYM_FUNC_END(__put_user_8)
 EXPORT_SYMBOL(__put_user_8)
+EXPORT_SYMBOL(__put_user_nocheck_8)
 
 SYM_CODE_START_LOCAL(.Lbad_put_user_clac)
 	ASM_CLAC
 .Lbad_put_user:
-	movl $-EFAULT,%eax
+	movl $-EFAULT,%ecx
 	RET
 SYM_CODE_END(.Lbad_put_user_clac)
 
-- 
2.28.0.218.gc12ef3d349


^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-09 17:31   ` Linus Torvalds
@ 2020-09-09 18:40     ` Segher Boessenkool
  2020-09-09 21:33       ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Segher Boessenkool @ 2020-09-09 18:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Nick Desaulniers, linux-arch, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Alexey Dobriyan, Luis Chamberlain, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig

On Wed, Sep 09, 2020 at 10:31:34AM -0700, Linus Torvalds wrote:
> And apparently there are people working on this on the gcc side too,
> so it won't just be clang-specific. Nor kernel-specific in that Nick
> tells me some other projects are looking at using that asm goto with
> outputs too.

It will not work like this in GCC, no.  The LLVM people know about that.
I do not know why they insist on pushing this, being incompatible and
everything.


Segher

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-09 18:40     ` Segher Boessenkool
@ 2020-09-09 21:33       ` Linus Torvalds
  2020-09-10  8:04         ` David Laight
  2020-09-10 15:44         ` Segher Boessenkool
  0 siblings, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2020-09-09 21:33 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Al Viro, Nick Desaulniers, linux-arch, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Alexey Dobriyan, Luis Chamberlain, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig

On Wed, Sep 9, 2020 at 11:42 AM Segher Boessenkool
<segher@kernel.crashing.org> wrote:
>
> It will not work like this in GCC, no.  The LLVM people know about that.
> I do not know why they insist on pushing this, being incompatible and
> everything.

Umm. Since they'd be the ones supporting this, *gcc* would be the
incompatible one, not clang.

Like it or not, clang is becoming a major kernel compiler. It's
already basically used for all android uses afaik.

So I'd phrase it differently. If gcc is planning on doing some
different model for asm goto with outputs, that would be the
incompatible case.

I'm not sure how gcc could do it differently. The only possible
difference I see is

 (a) not doing it at all

 (b) doing the "all goto targets have the outputs" case

and honestly, (b) is actually inferior for the error cases, even if to
a compiler person it might feel like the "RightThing(tm)" to do.
Because when an exception happens, the outputs simply won't be
initialized.

Anyway, for either of those cases, the kernel won't care either way.
We'll have to support the non-goto case for many years even if
everybody were to magically implement it today, so it's not like this
is a "you have to do it" thing.

           Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-09 21:33       ` Linus Torvalds
@ 2020-09-10  8:04         ` David Laight
  2020-09-10  8:13           ` Christophe Leroy
  2020-09-10 15:44         ` Segher Boessenkool
  1 sibling, 1 reply; 58+ messages in thread
From: David Laight @ 2020-09-10  8:04 UTC (permalink / raw)
  To: 'Linus Torvalds', Segher Boessenkool
  Cc: Al Viro, Nick Desaulniers, linux-arch, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Alexey Dobriyan, Luis Chamberlain, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig

From: Linus Torvalds
> Sent: 09 September 2020 22:34
> On Wed, Sep 9, 2020 at 11:42 AM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > It will not work like this in GCC, no.  The LLVM people know about that.
> > I do not know why they insist on pushing this, being incompatible and
> > everything.
> 
> Umm. Since they'd be the ones supporting this, *gcc* would be the
> incompatible one, not clang.

I had an 'interesting' idea.

Can you use a local asm register variable as an input and output to
an 'asm volatile goto' statement?

Well you can - but is it guaranteed to work :-)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-10  8:04         ` David Laight
@ 2020-09-10  8:13           ` Christophe Leroy
  2020-09-10  9:26             ` David Laight
  0 siblings, 1 reply; 58+ messages in thread
From: Christophe Leroy @ 2020-09-10  8:13 UTC (permalink / raw)
  To: David Laight, 'Linus Torvalds', Segher Boessenkool
  Cc: linux-arch, Kees Cook, the arch/x86 maintainers,
	Nick Desaulniers, Linux Kernel Mailing List, Christoph Hellwig,
	Luis Chamberlain, Al Viro, linux-fsdevel, linuxppc-dev,
	Alexey Dobriyan



Le 10/09/2020 à 10:04, David Laight a écrit :
> From: Linus Torvalds
>> Sent: 09 September 2020 22:34
>> On Wed, Sep 9, 2020 at 11:42 AM Segher Boessenkool
>> <segher@kernel.crashing.org> wrote:
>>>
>>> It will not work like this in GCC, no.  The LLVM people know about that.
>>> I do not know why they insist on pushing this, being incompatible and
>>> everything.
>>
>> Umm. Since they'd be the ones supporting this, *gcc* would be the
>> incompatible one, not clang.
> 
> I had an 'interesting' idea.
> 
> Can you use a local asm register variable as an input and output to
> an 'asm volatile goto' statement?
> 
> Well you can - but is it guaranteed to work :-)
> 

With gcc at least it should work according to 
https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html

They even explicitely tell: "The only supported use for this feature is 
to specify registers for input and output operands when calling Extended 
asm "

Christophe

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-10  8:13           ` Christophe Leroy
@ 2020-09-10  9:26             ` David Laight
  2020-09-10 12:26               ` David Laight
  2020-09-10 15:16               ` Segher Boessenkool
  0 siblings, 2 replies; 58+ messages in thread
From: David Laight @ 2020-09-10  9:26 UTC (permalink / raw)
  To: 'Christophe Leroy', 'Linus Torvalds', Segher Boessenkool
  Cc: linux-arch, Kees Cook, the arch/x86 maintainers,
	Nick Desaulniers, Linux Kernel Mailing List, Christoph Hellwig,
	Luis Chamberlain, Al Viro, linux-fsdevel, linuxppc-dev,
	Alexey Dobriyan

From: Christophe Leroy
> Sent: 10 September 2020 09:14
> 
> Le 10/09/2020 à 10:04, David Laight a écrit :
> > From: Linus Torvalds
> >> Sent: 09 September 2020 22:34
> >> On Wed, Sep 9, 2020 at 11:42 AM Segher Boessenkool
> >> <segher@kernel.crashing.org> wrote:
> >>>
> >>> It will not work like this in GCC, no.  The LLVM people know about that.
> >>> I do not know why they insist on pushing this, being incompatible and
> >>> everything.
> >>
> >> Umm. Since they'd be the ones supporting this, *gcc* would be the
> >> incompatible one, not clang.
> >
> > I had an 'interesting' idea.
> >
> > Can you use a local asm register variable as an input and output to
> > an 'asm volatile goto' statement?
> >
> > Well you can - but is it guaranteed to work :-)
> >
> 
> With gcc at least it should work according to
> https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html
> 
> They even explicitely tell: "The only supported use for this feature is
> to specify registers for input and output operands when calling Extended
> asm "

A quick test isn't good....

int bar(char *z)
{
        __label__ label;
        register int eax asm ("eax") = 6;
        asm volatile goto (" mov $1, %%eax" ::: "eax" : label);

label:
        return eax;
}

0000000000000040 <bar>:
  40:   b8 01 00 00 00          mov    $0x1,%eax
  45:   b8 06 00 00 00          mov    $0x6,%eax
  4a:   c3                      retq

although adding:
        asm volatile ("" : "+r" (eax));
either side of the 'asm volatile goto' does fix it.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-10  9:26             ` David Laight
@ 2020-09-10 12:26               ` David Laight
  2020-09-10 15:20                 ` Segher Boessenkool
  2020-09-10 15:16               ` Segher Boessenkool
  1 sibling, 1 reply; 58+ messages in thread
From: David Laight @ 2020-09-10 12:26 UTC (permalink / raw)
  To: David Laight, 'Christophe Leroy',
	'Linus Torvalds',
	Segher Boessenkool
  Cc: linux-arch, Kees Cook, the arch/x86 maintainers,
	Nick Desaulniers, Linux Kernel Mailing List, Alexey Dobriyan,
	Luis Chamberlain, Al Viro, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig

From: David Laight
> Sent: 10 September 2020 10:26
...
> > > I had an 'interesting' idea.
> > >
> > > Can you use a local asm register variable as an input and output to
> > > an 'asm volatile goto' statement?
> > >
> > > Well you can - but is it guaranteed to work :-)
> > >
> >
> > With gcc at least it should work according to
> > https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html
> >
> > They even explicitely tell: "The only supported use for this feature is
> > to specify registers for input and output operands when calling Extended
> > asm "
> 
> A quick test isn't good....
> 
> int bar(char *z)
> {
>         __label__ label;
>         register int eax asm ("eax") = 6;
>         asm volatile goto (" mov $1, %%eax" ::: "eax" : label);
> label:
>         return eax;
> }
> 
> 0000000000000040 <bar>:
>   40:   b8 01 00 00 00          mov    $0x1,%eax
>   45:   b8 06 00 00 00          mov    $0x6,%eax
>   4a:   c3                      retq
> 
> although adding:
>         asm volatile ("" : "+r" (eax));
> either side of the 'asm volatile goto' does fix it.

Actually this is pretty sound:
	__label__ label;
	register int eax asm ("eax");
	// Ensure eax can't be reloaded from anywhere
	// In particular it can't be reloaded after the asm goto line
	asm volatile ("" : "=r" (eax));
	// Provided gcc doesn't save eax here...
	asm volatile goto ("xxxxx" ::: "eax" : label);
	// ... and reload the saved value here.
	// The input value here will be that modified by the 'asm goto'.
	// Since this modifies eax it can't be moved before the 'asm goto'.
	asm volatile ("" : "+r" (eax));
	// So here eax must contain the value set by the "xxxxx" instructions.

    David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-10  9:26             ` David Laight
  2020-09-10 12:26               ` David Laight
@ 2020-09-10 15:16               ` Segher Boessenkool
  1 sibling, 0 replies; 58+ messages in thread
From: Segher Boessenkool @ 2020-09-10 15:16 UTC (permalink / raw)
  To: David Laight
  Cc: 'Christophe Leroy', 'Linus Torvalds',
	linux-arch, Kees Cook, the arch/x86 maintainers,
	Nick Desaulniers, Linux Kernel Mailing List, Christoph Hellwig,
	Luis Chamberlain, Al Viro, linux-fsdevel, linuxppc-dev,
	Alexey Dobriyan

On Thu, Sep 10, 2020 at 09:26:28AM +0000, David Laight wrote:
> From: Christophe Leroy
> > Sent: 10 September 2020 09:14
> > 
> > Le 10/09/2020 à 10:04, David Laight a écrit :
> > > From: Linus Torvalds
> > >> Sent: 09 September 2020 22:34
> > >> On Wed, Sep 9, 2020 at 11:42 AM Segher Boessenkool
> > >> <segher@kernel.crashing.org> wrote:
> > >>>
> > >>> It will not work like this in GCC, no.  The LLVM people know about that.
> > >>> I do not know why they insist on pushing this, being incompatible and
> > >>> everything.
> > >>
> > >> Umm. Since they'd be the ones supporting this, *gcc* would be the
> > >> incompatible one, not clang.
> > >
> > > I had an 'interesting' idea.
> > >
> > > Can you use a local asm register variable as an input and output to
> > > an 'asm volatile goto' statement?
> > >
> > > Well you can - but is it guaranteed to work :-)
> > >
> > 
> > With gcc at least it should work according to
> > https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html
> > 
> > They even explicitely tell: "The only supported use for this feature is
> > to specify registers for input and output operands when calling Extended
> > asm "
> 
> A quick test isn't good....
> 
> int bar(char *z)
> {
>         __label__ label;
>         register int eax asm ("eax") = 6;
>         asm volatile goto (" mov $1, %%eax" ::: "eax" : label);
> 
> label:
>         return eax;
> }

It is neither input nor output operand here!  Only *then* is a local
register asm guaranteed to be in the given reg: as input or output to an
inline asm.


Segher

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-10 12:26               ` David Laight
@ 2020-09-10 15:20                 ` Segher Boessenkool
  2020-09-10 15:31                   ` David Laight
  0 siblings, 1 reply; 58+ messages in thread
From: Segher Boessenkool @ 2020-09-10 15:20 UTC (permalink / raw)
  To: David Laight
  Cc: 'Christophe Leroy', 'Linus Torvalds',
	linux-arch, Kees Cook, the arch/x86 maintainers,
	Nick Desaulniers, Linux Kernel Mailing List, Alexey Dobriyan,
	Luis Chamberlain, Al Viro, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig

On Thu, Sep 10, 2020 at 12:26:53PM +0000, David Laight wrote:
> Actually this is pretty sound:
> 	__label__ label;
> 	register int eax asm ("eax");
> 	// Ensure eax can't be reloaded from anywhere
> 	// In particular it can't be reloaded after the asm goto line
> 	asm volatile ("" : "=r" (eax));

This asm is fine.  It says it writes the "eax" variable, which lives in
the eax register *in that asm* (so *not* guaranteed after it!).

> 	// Provided gcc doesn't save eax here...
> 	asm volatile goto ("xxxxx" ::: "eax" : label);

So this is incorrect.

> 	// ... and reload the saved value here.
> 	// The input value here will be that modified by the 'asm goto'.
> 	// Since this modifies eax it can't be moved before the 'asm goto'.
> 	asm volatile ("" : "+r" (eax));
> 	// So here eax must contain the value set by the "xxxxx" instructions.

No, the register eax will contain the value of the eax variable.  In the
asm; it might well be there before or after the asm as well, but none of
that is guaranteed.


Segher

^ permalink raw reply	[flat|nested] 58+ messages in thread

* RE: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-10 15:20                 ` Segher Boessenkool
@ 2020-09-10 15:31                   ` David Laight
  2020-09-10 17:15                     ` Segher Boessenkool
  0 siblings, 1 reply; 58+ messages in thread
From: David Laight @ 2020-09-10 15:31 UTC (permalink / raw)
  To: 'Segher Boessenkool'
  Cc: 'Christophe Leroy', 'Linus Torvalds',
	linux-arch, Kees Cook, the arch/x86 maintainers,
	Nick Desaulniers, Linux Kernel Mailing List, Alexey Dobriyan,
	Luis Chamberlain, Al Viro, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig



> -----Original Message-----
> From: Segher Boessenkool <segher@kernel.crashing.org>
> Sent: 10 September 2020 16:21
> To: David Laight <David.Laight@ACULAB.COM>
> Cc: 'Christophe Leroy' <christophe.leroy@csgroup.eu>; 'Linus Torvalds' <torvalds@linux-
> foundation.org>; linux-arch <linux-arch@vger.kernel.org>; Kees Cook <keescook@chromium.org>; the
> arch/x86 maintainers <x86@kernel.org>; Nick Desaulniers <ndesaulniers@google.com>; Linux Kernel
> Mailing List <linux-kernel@vger.kernel.org>; Alexey Dobriyan <adobriyan@gmail.com>; Luis Chamberlain
> <mcgrof@kernel.org>; Al Viro <viro@zeniv.linux.org.uk>; linux-fsdevel <linux-fsdevel@vger.kernel.org>;
> linuxppc-dev <linuxppc-dev@lists.ozlabs.org>; Christoph Hellwig <hch@lst.de>
> Subject: Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
> 
> On Thu, Sep 10, 2020 at 12:26:53PM +0000, David Laight wrote:
> > Actually this is pretty sound:
> > 	__label__ label;
> > 	register int eax asm ("eax");
> > 	// Ensure eax can't be reloaded from anywhere
> > 	// In particular it can't be reloaded after the asm goto line
> > 	asm volatile ("" : "=r" (eax));
> 
> This asm is fine.  It says it writes the "eax" variable, which lives in
> the eax register *in that asm* (so *not* guaranteed after it!).
> 
> > 	// Provided gcc doesn't save eax here...
> > 	asm volatile goto ("xxxxx" ::: "eax" : label);
> 
> So this is incorrect.

From the other email:

> It is neither input nor output operand here!  Only *then* is a local
> register asm guaranteed to be in the given reg: as input or output to an
> inline asm.

Ok, so adding '"r" (eax)' to the input section helps a bit.

> > 	// ... and reload the saved value here.
> > 	// The input value here will be that modified by the 'asm goto'.
> > 	// Since this modifies eax it can't be moved before the 'asm goto'.
> > 	asm volatile ("" : "+r" (eax));
> > 	// So here eax must contain the value set by the "xxxxx" instructions.
> 
> No, the register eax will contain the value of the eax variable.  In the
> asm; it might well be there before or after the asm as well, but none of
> that is guaranteed.

Perhaps not 'guaranteed', but very unlikely to be wrong.
It doesn't give gcc much scope for not generating the desired code.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-09 21:33       ` Linus Torvalds
  2020-09-10  8:04         ` David Laight
@ 2020-09-10 15:44         ` Segher Boessenkool
  1 sibling, 0 replies; 58+ messages in thread
From: Segher Boessenkool @ 2020-09-10 15:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Nick Desaulniers, linux-arch, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Alexey Dobriyan, Luis Chamberlain, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig

On Wed, Sep 09, 2020 at 02:33:36PM -0700, Linus Torvalds wrote:
> On Wed, Sep 9, 2020 at 11:42 AM Segher Boessenkool
> <segher@kernel.crashing.org> wrote:
> >
> > It will not work like this in GCC, no.  The LLVM people know about that.
> > I do not know why they insist on pushing this, being incompatible and
> > everything.
> 
> Umm. Since they'd be the ones supporting this, *gcc* would be the
> incompatible one, not clang.

This breaks the basic requirements of asm goto.

> So I'd phrase it differently. If gcc is planning on doing some
> different model for asm goto with outputs, that would be the
> incompatible case.

If we will do asm goto with outputs, the asm will still be a jump
instruction!  (It is not in LLVM!)

We probably *can* make asm goto have outputs (jump instructions can have
outputs just fine!  Just output reloads on jump instructions are hard,
because not always they are *possible*; but for asm goto it should be
fine).

Doing as LLVM does, and making the asm a "trapping" instruction, makes
it not a jump insn, and opens up whole new cans of worms (including
inferior code quality).  Since it has very different semantics, and we
might want to keep the semantics of asm goto as well anyway, this should
be called something different ("asm break" or "asm __anything" for
example).

It would be nice if they talked to us about it, too.  LLVM claims it
implements the GCC inline asm extension.  It already only is compatible
for the simplest of cases, but this would be much worse still :-(

> and honestly, (b) is actually inferior for the error cases, even if to
> a compiler person it might feel like the "RightThing(tm)" to do.
> Because when an exception happens, the outputs simply won't be
> initialized.

Sure, that is fine, and quite possible useful, but it is not the same as
asm goto.  asm goto is not some exception handling construct: it is a
jump instruction.

> Anyway, for either of those cases, the kernel won't care either way.
> We'll have to support the non-goto case for many years even if
> everybody were to magically implement it today, so it's not like this
> is a "you have to do it" thing.

Yes.

I'm just annoyed because of all the extra work created by people not
communicating.


Segher

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: remove the last set_fs() in common code, and remove it for x86 and powerpc v3
  2020-09-10 15:31                   ` David Laight
@ 2020-09-10 17:15                     ` Segher Boessenkool
  0 siblings, 0 replies; 58+ messages in thread
From: Segher Boessenkool @ 2020-09-10 17:15 UTC (permalink / raw)
  To: David Laight
  Cc: 'Christophe Leroy', 'Linus Torvalds',
	linux-arch, Kees Cook, the arch/x86 maintainers,
	Nick Desaulniers, Linux Kernel Mailing List, Alexey Dobriyan,
	Luis Chamberlain, Al Viro, linux-fsdevel, linuxppc-dev,
	Christoph Hellwig

On Thu, Sep 10, 2020 at 03:31:53PM +0000, David Laight wrote:
> > > 	asm volatile ("" : "+r" (eax));
> > > 	// So here eax must contain the value set by the "xxxxx" instructions.
> > 
> > No, the register eax will contain the value of the eax variable.  In the
> > asm; it might well be there before or after the asm as well, but none of
> > that is guaranteed.
> 
> Perhaps not 'guaranteed', but very unlikely to be wrong.
> It doesn't give gcc much scope for not generating the desired code.

Wanna bet?  :-)

Correct is correct.  Anything else is not.


Segher

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-09-03 14:22 ` [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig
@ 2020-10-01 22:38   ` Eric Biggers
  2020-10-01 22:40     ` Al Viro
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Biggers @ 2020-10-01 22:38 UTC (permalink / raw)
  To: Christoph Hellwig, Al Viro, Linus Torvalds
  Cc: Michael Ellerman, x86, Alexey Dobriyan, Luis Chamberlain,
	Kees Cook, linux-kernel, linux-fsdevel, linux-arch, linuxppc-dev

Christoph, Al, and Linus:

On Thu, Sep 03, 2020 at 04:22:33PM +0200, Christoph Hellwig wrote:
> @@ -510,28 +524,31 @@ static ssize_t new_sync_write(struct file *filp, const char __user *buf, size_t
>  /* caller is responsible for file_start_write/file_end_write */
>  ssize_t __kernel_write(struct file *file, const void *buf, size_t count, loff_t *pos)
>  {
> -	mm_segment_t old_fs;
> -	const char __user *p;
> +	struct kvec iov = {
> +		.iov_base	= (void *)buf,
> +		.iov_len	= min_t(size_t, count, MAX_RW_COUNT),
> +	};
> +	struct kiocb kiocb;
> +	struct iov_iter iter;
>  	ssize_t ret;
>  
>  	if (WARN_ON_ONCE(!(file->f_mode & FMODE_WRITE)))
>  		return -EBADF;
>  	if (!(file->f_mode & FMODE_CAN_WRITE))
>  		return -EINVAL;
> +	/*
> +	 * Also fail if ->write_iter and ->write are both wired up as that
> +	 * implies very convoluted semantics.
> +	 */
> +	if (unlikely(!file->f_op->write_iter || file->f_op->write))
> +		return warn_unsupported(file, "write");
>  
> -	old_fs = get_fs();
> -	set_fs(KERNEL_DS);
> -	p = (__force const char __user *)buf;
> -	if (count > MAX_RW_COUNT)
> -		count =  MAX_RW_COUNT;
> -	if (file->f_op->write)
> -		ret = file->f_op->write(file, p, count, pos);
> -	else if (file->f_op->write_iter)
> -		ret = new_sync_write(file, p, count, pos);
> -	else
> -		ret = -EINVAL;
> -	set_fs(old_fs);
> +	init_sync_kiocb(&kiocb, file);
> +	kiocb.ki_pos = *pos;
> +	iov_iter_kvec(&iter, WRITE, &iov, 1, iov.iov_len);
> +	ret = file->f_op->write_iter(&kiocb, &iter);

next-20201001 crashes on boot for me because there is a bad interaction between
this commit in vfs/for-next:

	commit 4d03e3cc59828c82ee89ea6e27a2f3cdf95aaadf
	Author: Christoph Hellwig <hch@lst.de>
	Date:   Thu Sep 3 16:22:33 2020 +0200

	    fs: don't allow kernel reads and writes without iter ops

... and Linus's mainline commit:

	commit 90fb702791bf99b959006972e8ee7bb4609f441b
	Author: Linus Torvalds <torvalds@linux-foundation.org>
	Date:   Tue Sep 29 17:18:34 2020 -0700

	    autofs: use __kernel_write() for the autofs pipe writing

Linus's commit made autofs start passing pos=NULL to __kernel_write().
But, Christoph's commit made __kernel_write() no longer accept a NULL pos.

The following fixes it:

diff --git a/fs/autofs/waitq.c b/fs/autofs/waitq.c
index 5ced859dac53..b04c528b19d3 100644
--- a/fs/autofs/waitq.c
+++ b/fs/autofs/waitq.c
@@ -53,7 +53,7 @@ static int autofs_write(struct autofs_sb_info *sbi,
 
 	mutex_lock(&sbi->pipe_mutex);
 	while (bytes) {
-		wr = __kernel_write(file, data, bytes, NULL);
+		wr = __kernel_write(file, data, bytes, &file->f_pos);
 		if (wr <= 0)
 			break;
 		data += wr;

I'm not sure what was intended, though.

Full stack trace below.

BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 12 PID: 383 Comm: systemd-binfmt Tainted: G                T 5.9.0-rc7-next-20201001 #2
Hardware name: Gigabyte Technology Co., Ltd. X399 AORUS Gaming 7/X399 AORUS Gaming 7, BIOS F2 08/31/2017
RIP: 0010:init_sync_kiocb include/linux/fs.h:2050 [inline]
RIP: 0010:__kernel_write+0x16c/0x2b0 fs/read_write.c:546
Code: 24 4a b9 01 00 00 00 0f 45 c7 be 01 00 00 00 48 8d 7c 24 10 48 c7 44 24 40 00 00 00 00 48 c7 44 24 58 00 00 00 00 89 44 24 4c <48> 8b 03 48 c7 44 24 60 00 00 00 00 48 89 44 24 50 4c 89 64 24 38
RSP: 0018:ffffa2fc0102f8b0 EFLAGS: 00010246
RAX: 0000000000020000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: ffffa2fc0102f8b0 RSI: 0000000000000001 RDI: ffffa2fc0102f8c0
RBP: ffff8ad2927e2940 R08: 0000000000000130 R09: ffff8ad29a201800
R10: 000000000000000f R11: ffff8ad292547510 R12: ffff8ad2927e2940
R13: ffffa2fc0102f8e8 R14: ffff8ad29a951768 R15: 0000000000000130
FS:  00007f11023b9000(0000) GS:ffff8ad29ed00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000857b62000 CR4: 00000000003506e0
Call Trace:
 autofs_write fs/autofs/waitq.c:56 [inline]
 autofs_notify_daemon.constprop.0+0x115/0x260 fs/autofs/waitq.c:163
 autofs_wait+0x5b1/0x750 fs/autofs/waitq.c:465
 autofs_mount_wait+0x2d/0x60 fs/autofs/root.c:250
 autofs_d_automount+0xc4/0x1a0 fs/autofs/root.c:379
 follow_automount fs/namei.c:1198 [inline]
 __traverse_mounts+0x8d/0x230 fs/namei.c:1243
 traverse_mounts fs/namei.c:1272 [inline]
 handle_mounts fs/namei.c:1381 [inline]
 step_into+0x44e/0x730 fs/namei.c:1691
 walk_component+0x7e/0x1b0 fs/namei.c:1867
 link_path_walk+0x270/0x3b0 fs/namei.c:2184
 path_openat+0x90/0xe40 fs/namei.c:3365
 do_filp_open+0x98/0x140 fs/namei.c:3396
 do_sys_openat2+0xac/0x170 fs/open.c:1168
 do_sys_open fs/open.c:1184 [inline]
 __do_sys_openat fs/open.c:1200 [inline]
 __se_sys_openat fs/open.c:1195 [inline]
 __x64_sys_openat+0x51/0x90 fs/open.c:1195
 do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f11031d3c1b
Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 2b 0c 25
RSP: 002b:00007ffda6c0a1c0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 0000000000000020 RCX: 00007f11031d3c1b
RDX: 0000000000080101 RSI: 000055f65e114278 RDI: 00000000ffffff9c
random: lvm: uninitialized urandom read (4 bytes read)
RBP: 000055f65e114278 R08: 00000000ffffffff R09: 000055f65ef055d8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000080101
R13: 000055f65e11434a R14: 0000000000000000 R15: 0000000000000000
CR2: 0000000000000000
---[ end trace 166ad5429f22801b ]---

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-10-01 22:38   ` Eric Biggers
@ 2020-10-01 22:40     ` Al Viro
  2020-10-02 16:27       ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Al Viro @ 2020-10-01 22:40 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Christoph Hellwig, Linus Torvalds, Michael Ellerman, x86,
	Alexey Dobriyan, Luis Chamberlain, Kees Cook, linux-kernel,
	linux-fsdevel, linux-arch, linuxppc-dev

On Thu, Oct 01, 2020 at 03:38:52PM -0700, Eric Biggers wrote:

>  	mutex_lock(&sbi->pipe_mutex);
>  	while (bytes) {
> -		wr = __kernel_write(file, data, bytes, NULL);
> +		wr = __kernel_write(file, data, bytes, &file->f_pos);

Better
	loff_t dummy = 0;
...
		wr = __kernel_write(file, data, bytes, &dummy);

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-10-01 22:40     ` Al Viro
@ 2020-10-02 16:27       ` Linus Torvalds
  2020-10-09 22:06         ` Eric Biggers
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2020-10-02 16:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Eric Biggers, Christoph Hellwig, Michael Ellerman,
	the arch/x86 maintainers, Alexey Dobriyan, Luis Chamberlain,
	Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch,
	linuxppc-dev

On Thu, Oct 1, 2020 at 3:41 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Better
>         loff_t dummy = 0;
> ...
>                 wr = __kernel_write(file, data, bytes, &dummy);

No, just fix __kernel_write() to work correctly.

The fact is, NULL _is_ the right pointer for ppos these days.

That commit by Christoph is buggy: it replaces new_sync_write() with a
buggy open-coded version.

Notice how new_sync_write does

        kiocb.ki_pos = (ppos ? *ppos : 0);
,,,
        if (ret > 0 && ppos)
                *ppos = kiocb.ki_pos;

but the open-coded version doesn't.

So just fix that in linux-next. The *last* thing we want is to have
different semantics for the "same" kernel functions.

               Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-10-02 16:27       ` Linus Torvalds
@ 2020-10-09 22:06         ` Eric Biggers
  2020-10-10  1:03           ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Biggers @ 2020-10-09 22:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Christoph Hellwig, Michael Ellerman,
	the arch/x86 maintainers, Alexey Dobriyan, Luis Chamberlain,
	Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch,
	linuxppc-dev

On Fri, Oct 02, 2020 at 09:27:09AM -0700, Linus Torvalds wrote:
> On Thu, Oct 1, 2020 at 3:41 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Better
> >         loff_t dummy = 0;
> > ...
> >                 wr = __kernel_write(file, data, bytes, &dummy);
> 
> No, just fix __kernel_write() to work correctly.
> 
> The fact is, NULL _is_ the right pointer for ppos these days.
> 
> That commit by Christoph is buggy: it replaces new_sync_write() with a
> buggy open-coded version.
> 
> Notice how new_sync_write does
> 
>         kiocb.ki_pos = (ppos ? *ppos : 0);
> ,,,
>         if (ret > 0 && ppos)
>                 *ppos = kiocb.ki_pos;
> 
> but the open-coded version doesn't.
> 
> So just fix that in linux-next. The *last* thing we want is to have
> different semantics for the "same" kernel functions.

It's a bit unintuitive that ppos=NULL means "use pos 0", not "use file->f_pos".

Anyway, it works.  The important thing is, this is still broken in linux-next...

- Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-10-09 22:06         ` Eric Biggers
@ 2020-10-10  1:03           ` Linus Torvalds
  2020-10-10  1:19             ` Eric Biggers
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2020-10-10  1:03 UTC (permalink / raw)
  To: Eric Biggers
  Cc: Al Viro, Christoph Hellwig, Michael Ellerman,
	the arch/x86 maintainers, Alexey Dobriyan, Luis Chamberlain,
	Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch,
	linuxppc-dev

On Fri, Oct 9, 2020 at 3:06 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> It's a bit unintuitive that ppos=NULL means "use pos 0", not "use file->f_pos".

That's not at all what it means.

A NULL ppos means "this has no position at all", and is what we use
for FMODE_STREAM file descriptors (ie sockets, pipes, etc).

It also means that we don't do the locking for position updates.

The fact that "ki_pos" gets set to zero is just because it needs to be
_something_. It shouldn't actually ever be used for stream devices.

                  Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-10-10  1:03           ` Linus Torvalds
@ 2020-10-10  1:19             ` Eric Biggers
  2020-10-10  1:29               ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Eric Biggers @ 2020-10-10  1:19 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Christoph Hellwig, Michael Ellerman,
	the arch/x86 maintainers, Alexey Dobriyan, Luis Chamberlain,
	Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch,
	linuxppc-dev

On Fri, Oct 09, 2020 at 06:03:31PM -0700, Linus Torvalds wrote:
> On Fri, Oct 9, 2020 at 3:06 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > It's a bit unintuitive that ppos=NULL means "use pos 0", not "use file->f_pos".
> 
> That's not at all what it means.
> 
> A NULL ppos means "this has no position at all", and is what we use
> for FMODE_STREAM file descriptors (ie sockets, pipes, etc).
> 
> It also means that we don't do the locking for position updates.
> 
> The fact that "ki_pos" gets set to zero is just because it needs to be
> _something_. It shouldn't actually ever be used for stream devices.
> 

Okay, that makes more sense.  So the patchset from Matthew
https://lkml.kernel.org/linux-fsdevel/20201003025534.21045-1-willy@infradead.org/T/#u
isn't what you had in mind.

- Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-10-10  1:19             ` Eric Biggers
@ 2020-10-10  1:29               ` Linus Torvalds
  2020-10-10  1:55                 ` Alexander Viro
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2020-10-10  1:29 UTC (permalink / raw)
  To: Eric Biggers, Alexander Viro
  Cc: Al Viro, Christoph Hellwig, Michael Ellerman,
	the arch/x86 maintainers, Alexey Dobriyan, Luis Chamberlain,
	Kees Cook, Linux Kernel Mailing List, linux-fsdevel, linux-arch,
	linuxppc-dev

On Fri, Oct 9, 2020 at 6:19 PM Eric Biggers <ebiggers@kernel.org> wrote:
>
> Okay, that makes more sense.  So the patchset from Matthew
> https://lkml.kernel.org/linux-fsdevel/20201003025534.21045-1-willy@infradead.org/T/#u
> isn't what you had in mind.

No.

That first patch makes sense - it's just the "ppos can be NULL" patch.

But as mentioned, NULL isn't "shorthand for zero". It's just "pipes
don't _have_ a pos, trying to pass in some explicit position is
crazy".

So no, the other patches in that set are a bit odd, I think.

SOME of them look potentially fine - the bpfilter one seems to be
valid, for example, because it's literally about reading/writing a
pipe. And maybe the sysctl one is similarly sensible - I didn't check
the context of that one.

But no, NULL shouldn't mean "start at position zero, and we don't care
about the result".

              Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-10-10  1:29               ` Linus Torvalds
@ 2020-10-10  1:55                 ` Alexander Viro
  2020-10-14  5:51                   ` Christoph Hellwig
  0 siblings, 1 reply; 58+ messages in thread
From: Alexander Viro @ 2020-10-10  1:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Eric Biggers, Alexander Viro, Al Viro, Christoph Hellwig,
	Michael Ellerman, the arch/x86 maintainers, Alexey Dobriyan,
	Luis Chamberlain, Kees Cook, Linux Kernel Mailing List,
	linux-fsdevel, linux-arch, linuxppc-dev

On Fri, Oct 09, 2020 at 06:29:13PM -0700, Linus Torvalds wrote:
> On Fri, Oct 9, 2020 at 6:19 PM Eric Biggers <ebiggers@kernel.org> wrote:
> >
> > Okay, that makes more sense.  So the patchset from Matthew
> > https://lkml.kernel.org/linux-fsdevel/20201003025534.21045-1-willy@infradead.org/T/#u
> > isn't what you had in mind.
> 
> No.
> 
> That first patch makes sense - it's just the "ppos can be NULL" patch.
> 
> But as mentioned, NULL isn't "shorthand for zero". It's just "pipes
> don't _have_ a pos, trying to pass in some explicit position is
> crazy".
> 
> So no, the other patches in that set are a bit odd, I think.
> 
> SOME of them look potentially fine - the bpfilter one seems to be
> valid, for example, because it's literally about reading/writing a
> pipe. And maybe the sysctl one is similarly sensible - I didn't check
> the context of that one.

FWIW, I hadn't pushed that branch out (or merged it into #for-next yet);
for one thing, uml part (mconsole) is simply broken, for another...
IMO ##5--8 are asking for kernel_pread() and if you look at binfmt_elf.c,
you'll see elf_read() being pretty much that.  acct.c, keys and usermode
parts are asking for kernel_pwrite() as well.

I've got stuck looking through the drivers/target stuff - it would've
been another kernel_pwrite() candidate, but it smells like its use of
filp_open() is really asking for trouble, starting with symlink attacks.
Not sure - I'm not familiar with the area, but...


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops
  2020-10-10  1:55                 ` Alexander Viro
@ 2020-10-14  5:51                   ` Christoph Hellwig
  0 siblings, 0 replies; 58+ messages in thread
From: Christoph Hellwig @ 2020-10-14  5:51 UTC (permalink / raw)
  To: Alexander Viro
  Cc: Linus Torvalds, Eric Biggers, Al Viro, Christoph Hellwig,
	Michael Ellerman, the arch/x86 maintainers, Alexey Dobriyan,
	Luis Chamberlain, Kees Cook, Linux Kernel Mailing List,
	linux-fsdevel, linux-arch, linuxppc-dev

On Sat, Oct 10, 2020 at 01:55:24AM +0000, Alexander Viro wrote:
> FWIW, I hadn't pushed that branch out (or merged it into #for-next yet);
> for one thing, uml part (mconsole) is simply broken, for another...
> IMO ##5--8 are asking for kernel_pread() and if you look at binfmt_elf.c,
> you'll see elf_read() being pretty much that.  acct.c, keys and usermode
> parts are asking for kernel_pwrite() as well.
> 
> I've got stuck looking through the drivers/target stuff - it would've
> been another kernel_pwrite() candidate, but it smells like its use of
> filp_open() is really asking for trouble, starting with symlink attacks.
> Not sure - I'm not familiar with the area, but...

Can you just pull in the minimal fix so that the branch gets fixed
for this merge window?  All the cleanups can come later.

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2020-10-14  8:19 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 14:22 remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Christoph Hellwig
2020-09-03 14:22 ` [PATCH 01/14] proc: remove a level of indentation in proc_get_inode Christoph Hellwig
2020-09-03 14:22 ` [PATCH 02/14] proc: cleanup the compat vs no compat file ops Christoph Hellwig
2020-09-03 14:22 ` [PATCH 03/14] proc: add a read_iter method to proc proc_ops Christoph Hellwig
2020-09-03 14:22 ` [PATCH 04/14] sysctl: Convert to iter interfaces Christoph Hellwig
2020-09-03 14:22 ` [PATCH 05/14] fs: don't allow kernel reads and writes without iter ops Christoph Hellwig
2020-10-01 22:38   ` Eric Biggers
2020-10-01 22:40     ` Al Viro
2020-10-02 16:27       ` Linus Torvalds
2020-10-09 22:06         ` Eric Biggers
2020-10-10  1:03           ` Linus Torvalds
2020-10-10  1:19             ` Eric Biggers
2020-10-10  1:29               ` Linus Torvalds
2020-10-10  1:55                 ` Alexander Viro
2020-10-14  5:51                   ` Christoph Hellwig
2020-09-03 14:22 ` [PATCH 06/14] fs: don't allow splice read/write without explicit ops Christoph Hellwig
2020-09-03 14:22 ` [PATCH 07/14] uaccess: add infrastructure for kernel builds with set_fs() Christoph Hellwig
2020-09-03 14:22 ` [PATCH 08/14] test_bitmap: remove user bitmap tests Christoph Hellwig
2020-09-03 14:22 ` [PATCH 09/14] lkdtm: remove set_fs-based tests Christoph Hellwig
2020-09-03 14:22 ` [PATCH 10/14] x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h Christoph Hellwig
2020-09-03 14:22 ` [PATCH 11/14] x86: make TASK_SIZE_MAX usable from assembly code Christoph Hellwig
2020-09-03 14:22 ` [PATCH 12/14] x86: remove address space overrides using set_fs() Christoph Hellwig
2020-09-03 21:30   ` David Laight
2020-09-03 23:25     ` Linus Torvalds
2020-09-04  7:59       ` David Laight
2020-09-04  2:55   ` Al Viro
2020-09-04  4:41     ` Al Viro
2020-09-04  6:38     ` Christoph Hellwig
2020-09-04  7:47       ` Christoph Hellwig
2020-09-03 14:22 ` [PATCH 13/14] powerpc: use non-set_fs based maccess routines Christoph Hellwig
2020-09-03 14:22 ` [PATCH 14/14] powerpc: remove address space overrides using set_fs() Christoph Hellwig
2020-09-03 15:43   ` Christophe Leroy
2020-09-03 15:49     ` Christoph Hellwig
2020-09-03 15:56       ` Christoph Hellwig
2020-09-03 16:03         ` Christophe Leroy
2020-09-03 14:28 ` remove the last set_fs() in common code, and remove it for x86 and powerpc v3 Al Viro
2020-09-03 14:30   ` Christoph Hellwig
2020-09-03 14:36     ` Al Viro
2020-09-03 14:40       ` Christoph Hellwig
2020-09-09 17:31   ` Linus Torvalds
2020-09-09 18:40     ` Segher Boessenkool
2020-09-09 21:33       ` Linus Torvalds
2020-09-10  8:04         ` David Laight
2020-09-10  8:13           ` Christophe Leroy
2020-09-10  9:26             ` David Laight
2020-09-10 12:26               ` David Laight
2020-09-10 15:20                 ` Segher Boessenkool
2020-09-10 15:31                   ` David Laight
2020-09-10 17:15                     ` Segher Boessenkool
2020-09-10 15:16               ` Segher Boessenkool
2020-09-10 15:44         ` Segher Boessenkool
2020-09-03 15:49 ` Linus Torvalds
2020-09-04  6:00 ` Ingo Molnar
2020-09-04 17:58   ` Alexey Dobriyan
2020-09-04 18:42     ` Linus Torvalds
2020-09-04 21:01     ` David Laight
2020-09-05  7:16       ` Christophe Leroy
2020-09-05 10:13         ` David Laight

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).