kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/35] Enhance memory utilization with DMEMFS
@ 2020-10-08  7:53 yulei.kernel
  2020-10-08  7:53 ` [PATCH 01/35] fs: introduce dmemfs module yulei.kernel
                   ` (36 more replies)
  0 siblings, 37 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang

From: Yulei Zhang <yuleixzhang@tencent.com>

In current system each physical memory page is assocaited with
a page structure which is used to track the usage of this page.
But due to the memory usage rapidly growing in cloud environment,
we find the resource consuming for page structure storage becomes
highly remarkable. So is it an expense that we could spare?

This patchset introduces an idea about how to save the extra
memory through a new virtual filesystem -- dmemfs.

Dmemfs (Direct Memory filesystem) is device memory or reserved
memory based filesystem. This kind of memory is special as it
is not managed by kernel and most important it is without 'struct page'.
Therefore we can leverage the extra memory from the host system
to support more tenants in our cloud service.

We uses a kernel boot parameter 'dmem=' to reserve the system
memory when the host system boots up, the details can be checked
in /Documentation/admin-guide/kernel-parameters.txt. 

Theoretically for each 4k physical page it can save 64 bytes if
we drop the 'struct page', so for guest memory with 320G it can
save about 5G physical memory totally. 

Detailed usage of dmemfs is included in
/Documentation/filesystem/dmemfs.rst.

TODO:
1. we temporary disable the record_steal_time() before entering
guest, will enable that after solve the conflict.
2. working on systemcall such as mincore, will update the status
and patches soon. 

Yulei Zhang (35):
  fs: introduce dmemfs module
  mm: support direct memory reservation
  dmem: implement dmem memory management
  dmem: let pat recognize dmem
  dmemfs: support mmap
  dmemfs: support truncating inode down
  dmem: trace core functions
  dmem: show some statistic in debugfs
  dmemfs: support remote access
  dmemfs: introduce max_alloc_try_dpages parameter
  mm: export mempolicy interfaces to serve dmem allocator
  dmem: introduce mempolicy support
  mm, dmem: introduce PFN_DMEM and pfn_t_dmem
  mm, dmem: dmem-pmd vs thp-pmd
  mm: add pmd_special() check for pmd_trans_huge_lock()
  dmemfs: introduce ->split() to dmemfs_vm_ops
  mm, dmemfs: support unmap_page_range() for dmemfs pmd
  mm: follow_pmd_mask() for dmem huge pmd
  mm: gup_huge_pmd() for dmem huge pmd
  mm: support dmem huge pmd for vmf_insert_pfn_pmd()
  mm: support dmem huge pmd for follow_pfn()
  kvm, x86: Distinguish dmemfs page from mmio page
  kvm, x86: introduce VM_DMEM
  dmemfs: support hugepage for dmemfs
  mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn()
  mm, dmem: introduce pud_special()
  mm: add pud_special() to support dmem huge pud
  mm, dmemfs: support huge_fault() for dmemfs
  mm: add follow_pte_pud()
  dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free()
  dmem: introduce mce handler
  mm, dmemfs: register and handle the dmem mce
  kvm, x86: temporary disable record_steal_time for dmem
  dmem: add dmem unit tests
  Add documentation for dmemfs

 .../admin-guide/kernel-parameters.txt         |   38 +
 Documentation/filesystems/dmemfs.rst          |   59 +
 arch/x86/Kconfig                              |    1 +
 arch/x86/include/asm/pgtable.h                |   32 +-
 arch/x86/include/asm/pgtable_types.h          |   13 +-
 arch/x86/kernel/setup.c                       |    3 +
 arch/x86/kvm/mmu/mmu.c                        |    5 +-
 arch/x86/kvm/x86.c                            |    2 +
 arch/x86/mm/pat/memtype.c                     |   21 +
 drivers/vfio/vfio_iommu_type1.c               |    4 +
 fs/Kconfig                                    |    1 +
 fs/Makefile                                   |    1 +
 fs/dmemfs/Kconfig                             |   16 +
 fs/dmemfs/Makefile                            |    8 +
 fs/dmemfs/inode.c                             | 1063 ++++++++++++++++
 fs/dmemfs/trace.h                             |   54 +
 fs/inode.c                                    |    6 +
 include/linux/dmem.h                          |   49 +
 include/linux/fs.h                            |    1 +
 include/linux/huge_mm.h                       |    5 +-
 include/linux/mempolicy.h                     |    3 +
 include/linux/mm.h                            |    9 +
 include/linux/pfn_t.h                         |   17 +-
 include/linux/pgtable.h                       |   22 +
 include/trace/events/dmem.h                   |   85 ++
 include/uapi/linux/magic.h                    |    1 +
 mm/Kconfig                                    |   21 +
 mm/Makefile                                   |    1 +
 mm/dmem.c                                     | 1075 +++++++++++++++++
 mm/dmem_reserve.c                             |  303 +++++
 mm/gup.c                                      |   94 +-
 mm/huge_memory.c                              |   19 +-
 mm/memory-failure.c                           |   69 +-
 mm/memory.c                                   |   74 +-
 mm/mempolicy.c                                |    4 +-
 mm/mprotect.c                                 |    7 +-
 mm/mremap.c                                   |    3 +
 tools/testing/dmem/Kbuild                     |    1 +
 tools/testing/dmem/Makefile                   |   10 +
 tools/testing/dmem/dmem-test.c                |  184 +++
 40 files changed, 3336 insertions(+), 48 deletions(-)
 create mode 100644 Documentation/filesystems/dmemfs.rst
 create mode 100644 fs/dmemfs/Kconfig
 create mode 100644 fs/dmemfs/Makefile
 create mode 100644 fs/dmemfs/inode.c
 create mode 100644 fs/dmemfs/trace.h
 create mode 100644 include/linux/dmem.h
 create mode 100644 include/trace/events/dmem.h
 create mode 100644 mm/dmem.c
 create mode 100644 mm/dmem_reserve.c
 create mode 100644 tools/testing/dmem/Kbuild
 create mode 100644 tools/testing/dmem/Makefile
 create mode 100644 tools/testing/dmem/dmem-test.c

-- 
2.28.0


^ permalink raw reply	[flat|nested] 61+ messages in thread

* [PATCH 01/35] fs: introduce dmemfs module
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-11-10 20:04   ` Al Viro
  2020-10-08  7:53 ` [PATCH 02/35] mm: support direct memory reservation yulei.kernel
                   ` (35 subsequent siblings)
  36 siblings, 1 reply; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

dmemfs (Direct Memory filesystem) is device memory or reserved
memory based filesystem. This kind of memory is special as it
is not managed by kernel and it is without 'struct page'.

The original purpose of dmemfs is to drop the usage of
'struct page' to save extra system memory.

This patch introduces the basic framework of dmemfs and only
mkdir and create regular file are supported.

Signed-off-by: Xiao Guangrong  <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/Kconfig                 |   1 +
 fs/Makefile                |   1 +
 fs/dmemfs/Kconfig          |  13 ++
 fs/dmemfs/Makefile         |   7 +
 fs/dmemfs/inode.c          | 275 +++++++++++++++++++++++++++++++++++++
 include/uapi/linux/magic.h |   1 +
 6 files changed, 298 insertions(+)
 create mode 100644 fs/dmemfs/Kconfig
 create mode 100644 fs/dmemfs/Makefile
 create mode 100644 fs/dmemfs/inode.c

diff --git a/fs/Kconfig b/fs/Kconfig
index aa4c12282301..18e72089426f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -41,6 +41,7 @@ source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 source "fs/f2fs/Kconfig"
 source "fs/zonefs/Kconfig"
+source "fs/dmemfs/Kconfig"
 
 config FS_DAX
 	bool "Direct Access (DAX) support"
diff --git a/fs/Makefile b/fs/Makefile
index 1c7b0e3f6daa..10e0302c5902 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-$(CONFIG_DMEM_FS)		+= dmemfs/
diff --git a/fs/dmemfs/Kconfig b/fs/dmemfs/Kconfig
new file mode 100644
index 000000000000..d2894a513de0
--- /dev/null
+++ b/fs/dmemfs/Kconfig
@@ -0,0 +1,13 @@
+config DMEM_FS
+	tristate "Direct Memory filesystem support"
+	help
+	  dmemfs (Direct Memory filesystem) is device memory or reserved
+	  memory based filesystem. This kind of memory is special as it
+	  is not managed by kernel and it is without 'struct page'.
+
+	  The original purpose of dmemfs is saving extra memory of
+	  'struct page' that reduces the total cost of ownership (TCO)
+	  for cloud providers.
+
+	  To compile this file system support as a module, choose M here: the
+	  module will be called dmemfs.
diff --git a/fs/dmemfs/Makefile b/fs/dmemfs/Makefile
new file mode 100644
index 000000000000..73bdc9cbc87e
--- /dev/null
+++ b/fs/dmemfs/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the linux dmem-filesystem routines.
+#
+obj-$(CONFIG_DMEM_FS) += dmemfs.o
+
+dmemfs-y += inode.o
diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
new file mode 100644
index 000000000000..6a8a2d9f94e9
--- /dev/null
+++ b/fs/dmemfs/inode.c
@@ -0,0 +1,275 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/fs/dmemfs/inode.c
+ *
+ * Authors:
+ *   Xiao Guangrong  <gloryxiao@tencent.com>
+ *   Chen Zhuo	     <sagazchen@tencent.com>
+ *   Haiwei Li	     <gerryhwli@tencent.com>
+ *   Yulei Zhang     <yuleixzhang@tencent.com>
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/capability.h>
+#include <linux/magic.h>
+#include <linux/mman.h>
+#include <linux/statfs.h>
+#include <linux/pagemap.h>
+#include <linux/parser.h>
+#include <linux/pfn_t.h>
+#include <linux/pagevec.h>
+#include <linux/fs_parser.h>
+#include <linux/seq_file.h>
+
+MODULE_AUTHOR("Tencent Corporation");
+MODULE_LICENSE("GPL v2");
+
+struct dmemfs_mount_opts {
+	unsigned long dpage_size;
+};
+
+struct dmemfs_fs_info {
+	struct dmemfs_mount_opts mount_opts;
+};
+
+enum dmemfs_param {
+	Opt_dpagesize,
+};
+
+const struct fs_parameter_spec dmemfs_fs_parameters[] = {
+	fsparam_string("pagesize", Opt_dpagesize),
+	{}
+};
+
+static int check_dpage_size(unsigned long dpage_size)
+{
+	if (dpage_size != PAGE_SIZE && dpage_size != PMD_SIZE &&
+	      dpage_size != PUD_SIZE)
+		return -EINVAL;
+
+	return 0;
+}
+
+static struct inode *
+dmemfs_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode,
+		 dev_t dev);
+
+static int
+dmemfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
+{
+	struct inode *inode = dmemfs_get_inode(dir->i_sb, dir, mode, dev);
+	int error = -ENOSPC;
+
+	if (inode) {
+		d_instantiate(dentry, inode);
+		dget(dentry);	/* Extra count - pin the dentry in core */
+		error = 0;
+		dir->i_mtime = dir->i_ctime = current_time(inode);
+	}
+	return error;
+}
+
+static int dmemfs_create(struct inode *dir, struct dentry *dentry,
+			 umode_t mode, bool excl)
+{
+	return dmemfs_mknod(dir, dentry, mode | S_IFREG, 0);
+}
+
+static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry,
+			umode_t mode)
+{
+	int retval = dmemfs_mknod(dir, dentry, mode | S_IFDIR, 0);
+
+	if (!retval)
+		inc_nlink(dir);
+	return retval;
+}
+
+static const struct inode_operations dmemfs_dir_inode_operations = {
+	.create		= dmemfs_create,
+	.lookup		= simple_lookup,
+	.unlink		= simple_unlink,
+	.mkdir		= dmemfs_mkdir,
+	.rmdir		= simple_rmdir,
+	.rename		= simple_rename,
+};
+
+static const struct inode_operations dmemfs_file_inode_operations = {
+	.setattr = simple_setattr,
+	.getattr = simple_getattr,
+};
+
+int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	return 0;
+}
+
+static const struct file_operations dmemfs_file_operations = {
+	.mmap = dmemfs_file_mmap,
+};
+
+static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+	struct dmemfs_fs_info *fsi = fc->s_fs_info;
+	struct fs_parse_result result;
+	int opt, ret;
+
+	opt = fs_parse(fc, dmemfs_fs_parameters, param, &result);
+	if (opt < 0)
+		return opt;
+
+	switch (opt) {
+	case Opt_dpagesize:
+		fsi->mount_opts.dpage_size = memparse(param->string, NULL);
+		ret = check_dpage_size(fsi->mount_opts.dpage_size);
+		if (ret) {
+			pr_warn("dmemfs: unknown pagesize %x.\n",
+				result.uint_32);
+			return ret;
+		}
+		break;
+	default:
+		pr_warn("dmemfs: unknown mount option [%x].\n",
+			opt);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+struct inode *dmemfs_get_inode(struct super_block *sb,
+			       const struct inode *dir, umode_t mode, dev_t dev)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (inode) {
+		inode->i_ino = get_next_ino();
+		inode_init_owner(inode, dir, mode);
+		inode->i_mapping->a_ops = &empty_aops;
+		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		mapping_set_unevictable(inode->i_mapping);
+		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+		switch (mode & S_IFMT) {
+		default:
+			init_special_inode(inode, mode, dev);
+			break;
+		case S_IFREG:
+			inode->i_op = &dmemfs_file_inode_operations;
+			inode->i_fop = &dmemfs_file_operations;
+			break;
+		case S_IFDIR:
+			inode->i_op = &dmemfs_dir_inode_operations;
+			inode->i_fop = &simple_dir_operations;
+
+			/*
+			 * directory inodes start off with i_nlink == 2
+			 * (for "." entry)
+			 */
+			inc_nlink(inode);
+			break;
+		case S_IFLNK:
+			inode->i_op = &page_symlink_inode_operations;
+			break;
+		}
+	}
+	return inode;
+}
+
+static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	simple_statfs(dentry, buf);
+	buf->f_bsize = dentry->d_sb->s_blocksize;
+
+	return 0;
+}
+
+static const struct super_operations dmemfs_ops = {
+	.statfs	= dmemfs_statfs,
+	.drop_inode = generic_delete_inode,
+};
+
+static int
+dmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *inode;
+	struct dmemfs_fs_info *fsi = sb->s_fs_info;
+
+	sb->s_maxbytes = MAX_LFS_FILESIZE;
+	sb->s_blocksize = fsi->mount_opts.dpage_size;
+	sb->s_blocksize_bits = ilog2(fsi->mount_opts.dpage_size);
+	sb->s_magic = DMEMFS_MAGIC;
+	sb->s_op = &dmemfs_ops;
+	sb->s_time_gran = 1;
+
+	inode = dmemfs_get_inode(sb, NULL, S_IFDIR, 0);
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int dmemfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_nodev(fc, dmemfs_fill_super);
+}
+
+static void dmemfs_free_fc(struct fs_context *fc)
+{
+	kfree(fc->s_fs_info);
+}
+
+static const struct fs_context_operations dmemfs_context_ops = {
+	.free		= dmemfs_free_fc,
+	.parse_param	= dmemfs_parse_param,
+	.get_tree	= dmemfs_get_tree,
+};
+
+int dmemfs_init_fs_context(struct fs_context *fc)
+{
+	struct dmemfs_fs_info *fsi;
+
+	fsi = kzalloc(sizeof(*fsi), GFP_KERNEL);
+	if (!fsi)
+		return -ENOMEM;
+
+	fsi->mount_opts.dpage_size = PAGE_SIZE;
+	fc->s_fs_info = fsi;
+	fc->ops = &dmemfs_context_ops;
+	return 0;
+}
+
+static void dmemfs_kill_sb(struct super_block *sb)
+{
+	kill_litter_super(sb);
+}
+
+static struct file_system_type dmemfs_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "dmemfs",
+	.init_fs_context = dmemfs_init_fs_context,
+	.kill_sb	= dmemfs_kill_sb,
+};
+
+static int __init dmemfs_init(void)
+{
+	int ret;
+
+	ret = register_filesystem(&dmemfs_fs_type);
+
+	return ret;
+}
+
+static void __exit dmemfs_uninit(void)
+{
+	unregister_filesystem(&dmemfs_fs_type);
+}
+
+module_init(dmemfs_init)
+module_exit(dmemfs_uninit)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..3fbd06661c8c 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define DMEMFS_MAGIC		0x2ace90c6
 
 #endif /* __LINUX_MAGIC_H__ */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 02/35] mm: support direct memory reservation
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
  2020-10-08  7:53 ` [PATCH 01/35] fs: introduce dmemfs module yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-10-08 20:27   ` Randy Dunlap
  2020-10-08 20:34   ` Randy Dunlap
  2020-10-08  7:53 ` [PATCH 03/35] dmem: implement dmem memory management yulei.kernel
                   ` (34 subsequent siblings)
  36 siblings, 2 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

Introduce 'dmem=' to reserve system memory for DMEM (direct memory),
comparing with 'mem=' and 'memmap', it reserves memory based on the
topology of NUMA, for the detailed info, please refer to
kernel-parameters.txt

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 .../admin-guide/kernel-parameters.txt         |  38 +++
 arch/x86/kernel/setup.c                       |   3 +
 include/linux/dmem.h                          |  16 +
 mm/Kconfig                                    |   9 +
 mm/Makefile                                   |   1 +
 mm/dmem.c                                     | 137 ++++++++
 mm/dmem_reserve.c                             | 303 ++++++++++++++++++
 7 files changed, 507 insertions(+)
 create mode 100644 include/linux/dmem.h
 create mode 100644 mm/dmem.c
 create mode 100644 mm/dmem_reserve.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a1068742a6df..da15d4fc49db 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -980,6 +980,44 @@
 			The filter can be disabled or changed to another
 			driver later using sysfs.
 
+	dmem=[!]size[KMG]
+			[KNL, NUMA] When CONFIG_DMEM is set, this means
+			the size of memory reserved for dmemfs on each numa
+			memory node and 'size' must be aligned to the default
+			alignment that is the size of memory section which is
+			128M on default on x86_64. If set '!', such amount of
+			memory on each node will be owned by kernel and dmemfs
+			own the rest of memory on each node.
+			Example: Reserve 4G memory on each node for dmemfs
+				dmem = 4G
+
+	dmem=[!]size[KMG]:align[KMG]
+			[KNL, NUMA] Ditto. 'align' should be power of two and
+			it's not smaller than the default alignment. Also
+			'size' must be aligned to 'align'.
+			Example: Bad dmem parameter because 'size' misaligned
+				dmem=0x40200000:1G
+
+	dmem=size[KMG]@addr[KMG]
+			[KNL] When CONFIG_DMEM is set, this marks specific
+			memory as reserved for dmemfs. Region of memory will be
+			used by dmemfs, from addr to addr + size. Reserving a
+			certain memory region for kernel is illegal so '!' is
+			forbidden. Should not assign 'addr' to 0 because kernel
+			will occupy fixed memory region begin at 0 address.
+			Ditto, 'size' and 'addr' must be aligned to default
+			alignment.
+			Example: Exclude memory from 5G-6G for dmemfs.
+				dmem=1G@5G
+
+	dmem=size[KMG]@addr[KMG]:align[KMG]
+			[KNL] Ditto. 'align' should be power of two and it's
+			not smaller than the default alignment. Also 'size'
+			and 'addr' must be aligned to 'align'. Specially,
+			'@addr' and ':align' could occur in any order.
+			Example: Exclude memory from 5G-6G for dmemfs.
+				dmem=1G:1G@5G
+
 	driver_async_probe=  [KNL]
 			List of driver names to be probed asynchronously.
 			Format: <driver_name1>,<driver_name2>...
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 3511736fbc74..c2e59093a95e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -45,6 +45,7 @@
 #include <asm/unwind.h>
 #include <asm/vsyscall.h>
 #include <linux/vmalloc.h>
+#include <linux/dmem.h>
 
 /*
  * max_low_pfn_mapped: highest directly mapped pfn < 4 GB
@@ -1177,6 +1178,8 @@ void __init setup_arch(char **cmdline_p)
 	if (!early_xdbc_setup_hardware())
 		early_xdbc_register_console();
 
+	dmem_reserve_init();
+
 	x86_init.paging.pagetable_init();
 
 	kasan_init();
diff --git a/include/linux/dmem.h b/include/linux/dmem.h
new file mode 100644
index 000000000000..5049322d941c
--- /dev/null
+++ b/include/linux/dmem.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_DMEM_H
+#define _LINUX_DMEM_H
+
+#ifdef CONFIG_DMEM
+int dmem_reserve_init(void);
+void dmem_init(void);
+int dmem_region_register(int node, phys_addr_t start, phys_addr_t end);
+
+#else
+static inline int dmem_reserve_init(void)
+{
+	return 0;
+}
+#endif
+#endif	/* _LINUX_DMEM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 6c974888f86f..e1995da11cea 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -226,6 +226,15 @@ config BALLOON_COMPACTION
 	  scenario aforementioned and helps improving memory defragmentation.
 
 #
+# support for direct memory basics
+config DMEM
+	bool "Direct Memory Reservation"
+	def_bool n
+	depends on SPARSEMEM
+	help
+	  Allow reservation of memory which could be dedicated usage of dmem.
+	  It's the basics of dmemfs.
+
 # support for memory compaction
 config COMPACTION
 	bool "Allow for memory compaction"
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..97fa2fdf492e 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_DMEM) += dmem.o dmem_reserve.o
diff --git a/mm/dmem.c b/mm/dmem.c
new file mode 100644
index 000000000000..b5fb4f1b92db
--- /dev/null
+++ b/mm/dmem.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * memory management for dmemfs
+ *
+ * Authors:
+ *   Xiao Guangrong  <gloryxiao@tencent.com>
+ *   Chen Zhuo	     <sagazchen@tencent.com>
+ *   Haiwei Li	     <gerryhwli@tencent.com>
+ *   Yulei Zhang     <yuleixzhang@tencent.com>
+ */
+#include <linux/mempolicy.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/cpuset.h>
+#include <linux/nodemask.h>
+#include <linux/topology.h>
+#include <linux/dmem.h>
+#include <linux/debugfs.h>
+#include <linux/notifier.h>
+
+/*
+ * There are two kinds of page in dmem management:
+ * - nature page, it's the CPU's page size, i.e, 4K on x86
+ *
+ * - dmem page, it's the unit size used by dmem itself to manage all
+ *     registered memory. It's set by dmem_alloc_init()
+ */
+struct dmem_region {
+	/* original registered memory region */
+	phys_addr_t reserved_start_addr;
+	phys_addr_t reserved_end_addr;
+
+	/* memory region aligned to dmem page */
+	phys_addr_t dpage_start_pfn;
+	phys_addr_t dpage_end_pfn;
+
+	/*
+	 * avoid memory allocation if the dmem region is small enough
+	 */
+	unsigned long static_bitmap;
+	unsigned long *bitmap;
+	u64 next_free_pos;
+	struct list_head node;
+
+	unsigned long static_error_bitmap;
+	unsigned long *error_bitmap;
+};
+
+/*
+ * statically define number of regions to avoid allocating memory
+ * dynamically from memblock as slab is not available at that time
+ */
+#define DMEM_REGION_PAGES	2
+#define INIT_REGION_NUM							\
+	((DMEM_REGION_PAGES << PAGE_SHIFT) / sizeof(struct dmem_region))
+
+static struct dmem_region static_regions[INIT_REGION_NUM];
+
+struct dmem_node {
+	unsigned long total_dpages;
+	unsigned long free_dpages;
+
+	/* fallback list for allocation */
+	int nodelist[MAX_NUMNODES];
+	struct list_head regions;
+};
+
+struct dmem_pool {
+	struct mutex lock;
+
+	unsigned long region_num;
+	unsigned long registered_pages;
+	unsigned long unaligned_pages;
+
+	/* shift bits of dmem page */
+	unsigned long dpage_shift;
+
+	unsigned long total_dpages;
+	unsigned long free_dpages;
+
+	/*
+	 * increased when allocator is initialized,
+	 * stop it being destroyed when someone is
+	 * still using it
+	 */
+	u64 user_count;
+	struct dmem_node nodes[MAX_NUMNODES];
+};
+
+static struct dmem_pool dmem_pool = {
+	.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
+};
+
+#define for_each_dmem_node(_dnode)					\
+	for (_dnode = dmem_pool.nodes;					\
+		_dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes);	\
+		_dnode++)
+
+void __init dmem_init(void)
+{
+	struct dmem_node *dnode;
+
+	pr_info("dmem: pre-defined region: %ld\n", INIT_REGION_NUM);
+
+	for_each_dmem_node(dnode)
+		INIT_LIST_HEAD(&dnode->regions);
+}
+
+/*
+ * register the memory region to dmem pool as freed memory, the region
+ * should be properly aligned to PAGE_SIZE at least
+ *
+ * it's safe to be out of dmem_pool's lock as it's used at the very
+ * beginning of system boot
+ */
+int dmem_region_register(int node, phys_addr_t start, phys_addr_t end)
+{
+	struct dmem_region *dregion;
+
+	pr_info("dmem: register region [%#llx - %#llx] on node %d.\n",
+		(unsigned long long)start, (unsigned long long)end, node);
+
+	if (unlikely(dmem_pool.region_num >= INIT_REGION_NUM)) {
+		pr_err("dmem: region is not sufficient.\n");
+		return -ENOMEM;
+	}
+
+	dregion = &static_regions[dmem_pool.region_num++];
+	dregion->reserved_start_addr = start;
+	dregion->reserved_end_addr = end;
+
+	list_add_tail(&dregion->node, &dmem_pool.nodes[node].regions);
+	dmem_pool.registered_pages += __phys_to_pfn(end) -
+					__phys_to_pfn(start);
+	return 0;
+}
diff --git a/mm/dmem_reserve.c b/mm/dmem_reserve.c
new file mode 100644
index 000000000000..567ee9f18a7d
--- /dev/null
+++ b/mm/dmem_reserve.c
@@ -0,0 +1,303 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support reserved memory for dmem.
+ * As dmem_reserve_init will adjust memblock to reserve memory
+ * for dmem, we could save a vast amount of memory for 'struct page'.
+ *
+ * Authors:
+ *   Xiao Guangrong  <gloryxiao@tencent.com>
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/memblock.h>
+#include <linux/log2.h>
+#include <linux/dmem.h>
+
+struct dmem_param {
+	phys_addr_t base;
+	phys_addr_t size;
+	phys_addr_t align;
+	/*
+	 * If set to 1, dmem_param specified requested memory for kernel,
+	 * otherwise for dmem.
+	 */
+	bool resv_kernel;
+};
+
+static struct dmem_param dmem_param __initdata;
+
+/* Check dmem param defined by user to match dmem align */
+static int __init check_dmem_param(bool resv_kernel, phys_addr_t base,
+				   phys_addr_t size, phys_addr_t align)
+{
+	phys_addr_t min_align = 1UL << SECTION_SIZE_BITS;
+
+	if (!align)
+		align = min_align;
+
+	/*
+	 * the reserved region should be aligned to memory section
+	 * at least
+	 */
+	if (align < min_align) {
+		pr_warn("dmem: 'align' should be %#llx at least to be aligned to memory section.\n",
+			min_align);
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(align)) {
+		pr_warn("dmem: 'align' should be power of 2.\n");
+		return -EINVAL;
+	}
+
+	if (base & (align - 1)) {
+		pr_warn("dmem: 'addr' is unaligned to 'align' in dmem=\n");
+		return -EINVAL;
+	}
+
+	if (size & (align - 1)) {
+		pr_warn("dmem: 'size' is unaligned to 'align' in dmem=\n");
+		return -EINVAL;
+	}
+
+	if (base >= base + size) {
+		pr_warn("dmem: 'addr + size' overflow in dmem=\n");
+		return -EINVAL;
+	}
+
+	if (resv_kernel && base) {
+		pr_warn("dmem: take a certain base address for kernel is illegal\n");
+		return -EINVAL;
+	}
+
+	dmem_param.base = base;
+	dmem_param.size = size;
+	dmem_param.align = align;
+	dmem_param.resv_kernel = resv_kernel;
+
+	pr_info("dmem: parameter: base address %#llx size %#llx align %#llx resv_kernel %d\n",
+		(unsigned long long)base, (unsigned long long)size,
+		(unsigned long long)align, resv_kernel);
+	return 0;
+}
+
+static int __init parse_dmem(char *p)
+{
+	phys_addr_t base, size, align;
+	char *oldp;
+	bool resv_kernel = false;
+
+	if (!p)
+		return -EINVAL;
+
+	base = align = 0;
+
+	if (*p == '!') {
+		resv_kernel = true;
+		p++;
+	}
+
+	oldp = p;
+	size = memparse(p, &p);
+	if (oldp == p)
+		return -EINVAL;
+
+	if (!size) {
+		pr_warn("dmem: 'size' of 0 defined in dmem=, or {invalid} param\n");
+		return -EINVAL;
+	}
+
+	while (*p) {
+		phys_addr_t *pvalue;
+
+		switch (*p) {
+		case '@':
+			pvalue = &base;
+			break;
+		case ':':
+			pvalue = &align;
+			break;
+		default:
+			pr_warn("dmem: unknown indicator: %c in dmem=\n", *p);
+			return -EINVAL;
+		}
+
+		/*
+		 * Some attribute had been specified multiple times.
+		 * This is not allowed.
+		 */
+		if (*pvalue)
+			return -EINVAL;
+
+		oldp = ++p;
+		*pvalue = memparse(p, &p);
+		if (oldp == p)
+			return -EINVAL;
+
+		if (*pvalue == 0) {
+			pr_warn("dmem: 'addr' or 'align' should not be set to 0\n");
+			return -EINVAL;
+		}
+	}
+
+	return check_dmem_param(resv_kernel, base, size, align);
+}
+
+early_param("dmem", parse_dmem);
+
+/*
+ * We wanna remove a memory range from memblock.memory thoroughly.
+ * As isolating memblock.memory in memblock_remove needs to double
+ * the array of memblock_region, allocated memory for new array maybe
+ * locate in the memory range which we wanna to remove.
+ *	So, conflict.
+ * To resolve this conflict, here reserve this memory range firstly.
+ * While reserving this memory range, isolating memory.reserved will allocate
+ * memory excluded from memory range which to be removed. So following
+ * double array in memblock_remove can't observe this reserved range.
+ */
+static void __init dmem_remove_memblock(phys_addr_t base, phys_addr_t size)
+{
+	memblock_reserve(base, size);
+	memblock_remove(base, size);
+	memblock_free(base, size);
+}
+
+static u64 node_req_mem[MAX_NUMNODES] __initdata;
+
+/* Reserve certain size of memory for dmem in each numa node */
+static void __init dmem_reserve_size(phys_addr_t size, phys_addr_t align,
+		bool resv_kernel)
+{
+	phys_addr_t start, end;
+	u64 i;
+	int nid;
+
+	/* Calculate available free memory on each node */
+	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start,
+				&end, &nid)
+		node_req_mem[nid] += end - start;
+
+	/* Calculate memory size needed to reserve on each node for dmem */
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		node_req_mem[i] = ALIGN(node_req_mem[i], align);
+
+		if (!resv_kernel) {
+			node_req_mem[i] = min(size, node_req_mem[i]);
+			continue;
+		}
+
+		/* leave dmem_param.size memory for kernel */
+		if (node_req_mem[i] > size)
+			node_req_mem[i] = node_req_mem[i] - size;
+		else
+			node_req_mem[i] = 0;
+	}
+
+retry:
+	for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+					&start, &end, &nid) {
+		/* Well, we have got enough memory for this node. */
+		if (!node_req_mem[nid])
+			continue;
+
+		start = round_up(start, align);
+		end = round_down(end, align);
+		/* Skip memblock_region which is too small */
+		if (start >= end)
+			continue;
+
+		/* Towards memory block at higher address */
+		start = end - min((end - start), node_req_mem[nid]);
+
+		/*
+		 * do not have enough resource to save the region, skip it
+		 * from now on
+		 */
+		if (dmem_region_register(nid, start, end) < 0)
+			break;
+
+		dmem_remove_memblock(start, end - start);
+
+		node_req_mem[nid] -= end - start;
+
+		/* We have dropped a memblock, so re-walk it. */
+		goto retry;
+	}
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		if (!node_req_mem[i])
+			continue;
+
+		pr_info("dmem: %#llx size of memory is not reserved on node %lld due to misaligned regions.\n",
+			(unsigned long long)size, i);
+	}
+
+}
+
+/* Reserve [base, base + size) for dmem. */
+static void __init
+dmem_reserve_region(phys_addr_t base, phys_addr_t size, phys_addr_t align)
+{
+	phys_addr_t start, end;
+	phys_addr_t p_start, p_end;
+	u64 i;
+	int nid;
+
+	p_start = base;
+	p_end = base + size;
+
+retry:
+	for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+					&start, &end, &nid) {
+		/* Find region located in user defined range. */
+		if (start >= p_end || end <= p_start)
+			continue;
+
+		start = round_up(max(start, p_start), align);
+		end = round_down(min(end, p_end), align);
+		if (start >= end)
+			continue;
+
+		if (dmem_region_register(nid, start, end) < 0)
+			break;
+
+		dmem_remove_memblock(start, end - start);
+
+		size -= end - start;
+		if (!size)
+			return;
+
+		/* We have dropped a memblock, so re-walk it. */
+		goto retry;
+	}
+
+	pr_info("dmem: %#llx size of memory is not reserved for dmem due to holes and misaligned regions in [%#llx, %#llx].\n",
+		(unsigned long long)size, (unsigned long long)base,
+		(unsigned long long)(base + size));
+}
+
+/* Reserve memory for dmem */
+int __init dmem_reserve_init(void)
+{
+	phys_addr_t base, size, align;
+	bool resv_kernel;
+
+	dmem_init();
+
+	base = dmem_param.base;
+	size = dmem_param.size;
+	align = dmem_param.align;
+	resv_kernel = dmem_param.resv_kernel;
+
+	/* Dmem param had not been enabled. */
+	if (size == 0)
+		return 0;
+
+	if (base)
+		dmem_reserve_region(base, size, align);
+	else
+		dmem_reserve_size(size, align, resv_kernel);
+
+	return 0;
+}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 03/35] dmem: implement dmem memory management
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
  2020-10-08  7:53 ` [PATCH 01/35] fs: introduce dmemfs module yulei.kernel
  2020-10-08  7:53 ` [PATCH 02/35] mm: support direct memory reservation yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-10-08  7:53 ` [PATCH 04/35] dmem: let pat recognize dmem yulei.kernel
                   ` (33 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

It introduces the interfaces to manage dmem pages that include:
  - dmem_region_register(), it registers the reserved memory to the
    dmem management system, later it can be allocated out for dmemfs

 - dmem_alloc_init(), initiate dmem allocator, note the page size the
   allocator used isn't the same thing with the alignment used to
   reserve dmem memory

 - dmem_alloc_pages_from_vma() and dmem_free_pages() are the interfaces
   allocating and freeing dmem memory, multiple pages can be allocated
   at one time, but it should be power of two

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/dmem.h |   3 +
 mm/dmem.c            | 674 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 677 insertions(+)

diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 5049322d941c..476a82e8f252 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -7,6 +7,9 @@ int dmem_reserve_init(void);
 void dmem_init(void);
 int dmem_region_register(int node, phys_addr_t start, phys_addr_t end);
 
+int dmem_alloc_init(unsigned long dpage_shift);
+void dmem_alloc_uinit(void);
+
 #else
 static inline int dmem_reserve_init(void)
 {
diff --git a/mm/dmem.c b/mm/dmem.c
index b5fb4f1b92db..a77a064c8d59 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -91,11 +91,38 @@ static struct dmem_pool dmem_pool = {
 	.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
 };
 
+#define DMEM_PAGE_SIZE		(1UL << dmem_pool.dpage_shift)
+#define DMEM_PAGE_UP(x)		phys_to_dpage(((x) + DMEM_PAGE_SIZE - 1))
+#define DMEM_PAGE_DOWN(x)	phys_to_dpage(x)
+
+#define dpage_to_phys(_dpage)						\
+	((_dpage) << dmem_pool.dpage_shift)
+#define phys_to_dpage(_addr)						\
+	((_addr) >> dmem_pool.dpage_shift)
+
+#define dpage_to_pfn(_dpage)						\
+	(__phys_to_pfn(dpage_to_phys(_dpage)))
+#define pfn_to_dpage(_pfn)						\
+	(phys_to_dpage(__pfn_to_phys(_pfn)))
+
+#define dnode_to_nid(_dnode)						\
+	((_dnode) - dmem_pool.nodes)
+#define nid_to_dnode(nid)						\
+	(&dmem_pool.nodes[nid])
+
 #define for_each_dmem_node(_dnode)					\
 	for (_dnode = dmem_pool.nodes;					\
 		_dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes);	\
 		_dnode++)
 
+#define for_each_dmem_region(_dnode, _dregion)				\
+	list_for_each_entry(_dregion, &(_dnode)->regions, node)
+
+static inline int *dmem_nodelist(int nid)
+{
+	return nid_to_dnode(nid)->nodelist;
+}
+
 void __init dmem_init(void)
 {
 	struct dmem_node *dnode;
@@ -135,3 +162,649 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end)
 	return 0;
 }
 
+#define PENALTY_FOR_DMEM_SHARED_NODE		(1)
+
+static int dmem_nodeload[MAX_NUMNODES] __initdata;
+
+/* Evaluate penalty for each dmem node */
+static int __init dmem_evaluate_node(int local, int node)
+{
+	int penalty;
+
+	/* Use the distance array to find the distance */
+	penalty = node_distance(local, node);
+
+	/* Penalize nodes under us ("prefer the next node") */
+	penalty += (node < local);
+
+	/* Give preference to headless and unused nodes */
+	if (!cpumask_empty(cpumask_of_node(node)))
+		penalty += PENALTY_FOR_NODE_WITH_CPUS;
+
+	/* Penalize dmem-node shared with kernel */
+	if (node_state(node, N_MEMORY))
+		penalty += PENALTY_FOR_DMEM_SHARED_NODE;
+
+	/* Slight preference for less loaded node */
+	penalty *= (nr_online_nodes * MAX_NUMNODES);
+
+	penalty += dmem_nodeload[node];
+
+	return penalty;
+}
+
+static int __init find_next_dmem_node(int local, nodemask_t *used_nodes)
+{
+	struct dmem_node *dnode;
+	int node, best_node = NUMA_NO_NODE;
+	int penalty, min_penalty = INT_MAX;
+
+	/* Invalid node is not suitable to call node_distance */
+	if (!node_state(local, N_POSSIBLE))
+		return NUMA_NO_NODE;
+
+	/* Use the local node if we haven't already */
+	if (!node_isset(local, *used_nodes)) {
+		node_set(local, *used_nodes);
+		return local;
+	}
+
+	for_each_dmem_node(dnode) {
+		if (list_empty(&dnode->regions))
+			continue;
+
+		node = dnode_to_nid(dnode);
+
+		/* Don't want a node to appear more than once */
+		if (node_isset(node, *used_nodes))
+			continue;
+
+		penalty = dmem_evaluate_node(local, node);
+
+		if (penalty < min_penalty) {
+			min_penalty = penalty;
+			best_node = node;
+		}
+	}
+
+	if (best_node >= 0)
+		node_set(best_node, *used_nodes);
+
+	return best_node;
+}
+
+static int __init dmem_node_init(struct dmem_node *dnode)
+{
+	int *nodelist;
+	nodemask_t used_nodes;
+	int local, node, prev;
+	int load;
+	int i = 0;
+
+	nodelist = dnode->nodelist;
+	nodes_clear(used_nodes);
+	local = dnode_to_nid(dnode);
+	prev = local;
+	load = nr_online_nodes;
+
+	while ((node = find_next_dmem_node(local, &used_nodes)) >= 0) {
+		/*
+		 * We don't want to pressure a particular node.
+		 * So adding penalty to the first node in same
+		 * distance group to make it round-robin.
+		 */
+		if (node_distance(local, node) != node_distance(local, prev))
+			dmem_nodeload[node] = load;
+
+		nodelist[i++] = prev = node;
+		load--;
+	}
+
+	return 0;
+}
+
+static void __init dmem_region_uinit(struct dmem_region *dregion)
+{
+	unsigned long nr_pages, size, *bitmap = dregion->error_bitmap;
+
+	if (!bitmap)
+		return;
+
+	nr_pages = __phys_to_pfn(dregion->reserved_end_addr)
+		- __phys_to_pfn(dregion->reserved_start_addr);
+
+	WARN_ON(!nr_pages);
+
+	size = BITS_TO_LONGS(nr_pages) * sizeof(long);
+	if (size > sizeof(dregion->static_bitmap))
+		kfree(bitmap);
+	dregion->error_bitmap = NULL;
+}
+
+/*
+ * we only stop allocator to use the reserved page and do not
+ * reture pages back if anything goes wrong
+ */
+static void __init dmem_uinit(void)
+{
+	struct dmem_region *dregion, *dr;
+	struct dmem_node *dnode;
+
+	for_each_dmem_node(dnode) {
+		dnode->nodelist[0] = NUMA_NO_NODE;
+		list_for_each_entry_safe(dregion, dr, &dnode->regions, node) {
+			dmem_region_uinit(dregion);
+			dregion->reserved_start_addr =
+				dregion->reserved_end_addr = 0;
+			list_del(&dregion->node);
+		}
+	}
+
+	dmem_pool.region_num = 0;
+	dmem_pool.registered_pages = 0;
+}
+
+static int __init dmem_region_init(struct dmem_region *dregion)
+{
+	unsigned long *bitmap, size, nr_pages;
+
+	nr_pages = __phys_to_pfn(dregion->reserved_end_addr)
+		- __phys_to_pfn(dregion->reserved_start_addr);
+
+	size = BITS_TO_LONGS(nr_pages) * sizeof(long);
+	if (size <= sizeof(dregion->static_error_bitmap)) {
+		bitmap = &dregion->static_error_bitmap;
+	} else {
+		bitmap = kzalloc(size, GFP_KERNEL);
+		if (!bitmap)
+			return -ENOMEM;
+	}
+	dregion->error_bitmap = bitmap;
+	return 0;
+}
+
+/*
+ * dmem memory is not 'struct page' backend, i.e, the kernel threats
+ * it as invalid pfn
+ */
+static int __init dmem_check_region(struct dmem_region *dregion)
+{
+	unsigned long pfn;
+
+	for (pfn = __phys_to_pfn(dregion->reserved_start_addr);
+	      pfn < __phys_to_pfn(dregion->reserved_end_addr); pfn++) {
+		if (!WARN_ON(pfn_valid(pfn)))
+			continue;
+
+		pr_err("dmem: check pfn %#lx failed, its memory was not properly reserved\n",
+			pfn);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int __init dmem_late_init(void)
+{
+	struct dmem_region *dregion;
+	struct dmem_node *dnode;
+	int ret;
+
+	for_each_dmem_node(dnode) {
+		dmem_node_init(dnode);
+
+		for_each_dmem_region(dnode, dregion) {
+			ret = dmem_region_init(dregion);
+			if (ret)
+				goto exit;
+			ret = dmem_check_region(dregion);
+			if (ret)
+				goto exit;
+		}
+	}
+	return ret;
+exit:
+	dmem_uinit();
+	return ret;
+}
+late_initcall(dmem_late_init);
+
+static int dmem_alloc_region_init(struct dmem_region *dregion,
+				  unsigned long *dpages)
+{
+	unsigned long start, end, *bitmap, size;
+
+	start = DMEM_PAGE_UP(dregion->reserved_start_addr);
+	end = DMEM_PAGE_DOWN(dregion->reserved_end_addr);
+
+	*dpages = end - start;
+	if (!*dpages)
+		return 0;
+
+	size = BITS_TO_LONGS(*dpages) * sizeof(long);
+	if (size <= sizeof(dregion->static_bitmap))
+		bitmap = &dregion->static_bitmap;
+	else {
+		bitmap = kzalloc(size, GFP_KERNEL);
+		if (!bitmap)
+			return -ENOMEM;
+	}
+
+	dregion->bitmap = bitmap;
+	dregion->next_free_pos = 0;
+	dregion->dpage_start_pfn = start;
+	dregion->dpage_end_pfn = end;
+
+	dmem_pool.unaligned_pages += __phys_to_pfn((dpage_to_phys(start)
+		- dregion->reserved_start_addr));
+	dmem_pool.unaligned_pages += __phys_to_pfn(dregion->reserved_end_addr
+		- dpage_to_phys(end));
+	return 0;
+}
+
+static bool dmem_dpage_is_error(struct dmem_region *dregion, phys_addr_t dpage)
+{
+	unsigned long valid_pages;
+	unsigned long pos_pfn, pos_offset;
+	unsigned long pages_per_dpage = DMEM_PAGE_SIZE >> PAGE_SHIFT;
+	phys_addr_t reserved_start_pfn;
+
+	reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr);
+	valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn;
+
+	pos_offset = dpage_to_pfn(dpage) - reserved_start_pfn;
+	pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset);
+	if (pos_pfn < pos_offset + pages_per_dpage)
+		return true;
+	return false;
+}
+
+static unsigned long
+dmem_alloc_bitmap_clear(struct dmem_region *dregion, phys_addr_t dpage,
+			unsigned int dpages_nr)
+{
+	u64 pos = dpage - dregion->dpage_start_pfn;
+	unsigned int i;
+	unsigned long err_num = 0;
+
+	for (i = 0; i < dpages_nr; i++) {
+		if (dmem_dpage_is_error(dregion, dpage + i)) {
+			WARN_ON(!test_bit(pos + i, dregion->bitmap));
+			err_num++;
+		} else {
+			WARN_ON(!__test_and_clear_bit(pos + i,
+						      dregion->bitmap));
+		}
+	}
+	return err_num;
+}
+
+/* set or clear corresponding bit on allocation bitmap based on error bitmap */
+static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion,
+						    bool set)
+{
+	unsigned long pos_pfn, pos_offset;
+	unsigned long valid_pages, mce_dpages = 0;
+	phys_addr_t dpage, reserved_start_pfn;
+
+	reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr);
+
+	valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn;
+	pos_offset = dpage_to_pfn(dregion->dpage_start_pfn)
+		- reserved_start_pfn;
+try_set:
+	pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset);
+
+	if (pos_pfn >= valid_pages)
+		return mce_dpages;
+	mce_dpages++;
+	dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn);
+	if (set)
+		WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn,
+					   dregion->bitmap));
+	else
+		WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn,
+					      dregion->bitmap));
+	pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn;
+	goto try_set;
+}
+
+static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion)
+{
+	unsigned long dpages, size;
+
+	dregion_alloc_bitmap_set_clear(dregion, false);
+
+	dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn;
+	size = BITS_TO_LONGS(dpages) * sizeof(long);
+	WARN_ON(!bitmap_empty(dregion->bitmap, size * BITS_PER_BYTE));
+}
+
+static void dmem_alloc_region_uinit(struct dmem_region *dregion)
+{
+	unsigned long dpages, size, *bitmap = dregion->bitmap;
+
+	if (!bitmap)
+		return;
+
+	dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn;
+	WARN_ON(!dpages);
+
+	dmem_uinit_check_alloc_bitmap(dregion);
+
+	size = BITS_TO_LONGS(dpages) * sizeof(long);
+	if (size > sizeof(dregion->static_bitmap))
+		kfree(bitmap);
+	dregion->bitmap = NULL;
+}
+
+static void __dmem_alloc_uinit(void)
+{
+	struct dmem_node *dnode;
+	struct dmem_region *dregion;
+
+	if (!dmem_pool.dpage_shift)
+		return;
+
+	dmem_pool.unaligned_pages = 0;
+
+	for_each_dmem_node(dnode) {
+		for_each_dmem_region(dnode, dregion)
+			dmem_alloc_region_uinit(dregion);
+
+		dnode->total_dpages = dnode->free_dpages = 0;
+	}
+
+	dmem_pool.dpage_shift = 0;
+	dmem_pool.total_dpages = dmem_pool.free_dpages = 0;
+}
+
+static void dnode_count_free_dpages(struct dmem_node *dnode, long dpages)
+{
+	dnode->free_dpages += dpages;
+	dmem_pool.free_dpages += dpages;
+}
+
+/*
+ * uninitialize dmem allocator
+ *
+ * all dpages should be freed before calling it
+ */
+void dmem_alloc_uinit(void)
+{
+	mutex_lock(&dmem_pool.lock);
+	if (!--dmem_pool.user_count)
+		__dmem_alloc_uinit();
+	mutex_unlock(&dmem_pool.lock);
+}
+EXPORT_SYMBOL(dmem_alloc_uinit);
+
+/*
+ * initialize dmem allocator
+ *   @dpage_shift: the shift bits of dmem page size used to manange
+ *      dmem memory, it should be CPU's nature page size at least
+ *
+ * Note: the page size the allocator used isn't the same thing with
+ *       the alignment used to reserve dmem memory
+ */
+int dmem_alloc_init(unsigned long dpage_shift)
+{
+	struct dmem_node *dnode;
+	struct dmem_region *dregion;
+	unsigned long dpages;
+	int ret = 0;
+
+	if (dpage_shift < PAGE_SHIFT)
+		return -EINVAL;
+
+	mutex_lock(&dmem_pool.lock);
+
+	if (dmem_pool.dpage_shift) {
+		/*
+		 * double init on the same page size is okay
+		 * to make the unit tests happy
+		 */
+		if (dmem_pool.dpage_shift != dpage_shift)
+			ret = -EBUSY;
+
+		goto exit;
+	}
+
+	dmem_pool.dpage_shift = dpage_shift;
+
+	for_each_dmem_node(dnode) {
+		for_each_dmem_region(dnode, dregion) {
+			ret = dmem_alloc_region_init(dregion, &dpages);
+			if (ret < 0) {
+				__dmem_alloc_uinit();
+				goto exit;
+			}
+
+			dnode_count_free_dpages(dnode, dpages);
+		}
+		dnode->total_dpages = dnode->free_dpages;
+	}
+
+	dmem_pool.total_dpages = dmem_pool.free_dpages;
+
+	if (dmem_pool.unaligned_pages && !ret)
+		pr_warn("dmem: %llu pages are wasted due to alignment\n",
+			(unsigned long long)dmem_pool.unaligned_pages);
+exit:
+	if (!ret)
+		dmem_pool.user_count++;
+
+	mutex_unlock(&dmem_pool.lock);
+	return ret;
+}
+EXPORT_SYMBOL(dmem_alloc_init);
+
+static phys_addr_t
+dmem_alloc_region_page(struct dmem_region *dregion, unsigned int try_max,
+		       unsigned int *result_nr)
+{
+	unsigned long pos, dpages;
+	unsigned int i;
+
+	/* no dpage is available in this region */
+	if (!dregion->bitmap)
+		return 0;
+
+	dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn;
+
+	/* no free page in this region */
+	if (dregion->next_free_pos >= dpages)
+		return 0;
+
+	pos = find_next_zero_bit(dregion->bitmap, dpages,
+				 dregion->next_free_pos);
+	if (pos >= dpages) {
+		dregion->next_free_pos = pos;
+		return 0;
+	}
+
+	__set_bit(pos, dregion->bitmap);
+
+	/* do not go beyond the region */
+	try_max = min(try_max, (unsigned int)(dpages - pos - 1));
+	for (i = 1; i < try_max; i++)
+		if (__test_and_set_bit(pos + i, dregion->bitmap))
+			break;
+
+	*result_nr = i;
+	dregion->next_free_pos = pos + *result_nr;
+	return dpage_to_phys(dregion->dpage_start_pfn + pos);
+}
+
+/*
+ * allocate dmem pages from the nodelist
+ *
+ *   @nodelist: dmem_node's nodelist
+ *   @nodemask: nodemask for filtering the dmem nodelist
+ *   @try_max: try to allocate @try_max dpages if possible
+ *   @result_nr: allocated dpage number returned to the caller
+ *
+ * return the physical address of the first dpage allocated from dmem
+ * pool, or 0 on failure. The allocated dpage number is filled into
+ * @result_nr
+ */
+static phys_addr_t
+dmem_alloc_pages_from_nodelist(int *nodelist, nodemask_t *nodemask,
+			       unsigned int try_max, unsigned int *result_nr)
+{
+	struct dmem_node *dnode;
+	struct dmem_region *dregion;
+	phys_addr_t addr = 0;
+	int node, i;
+	unsigned int local_result_nr;
+
+	WARN_ON(try_max > 1 && !result_nr);
+
+	if (!result_nr)
+		result_nr = &local_result_nr;
+
+	*result_nr = 0;
+
+	for (i = 0; !addr && i < ARRAY_SIZE(dnode->nodelist); i++) {
+		node = nodelist[i];
+
+		if (nodemask && !node_isset(node, *nodemask))
+			continue;
+
+		mutex_lock(&dmem_pool.lock);
+
+		WARN_ON(!dmem_pool.dpage_shift);
+
+		dnode = &dmem_pool.nodes[node];
+		for_each_dmem_region(dnode, dregion) {
+			addr = dmem_alloc_region_page(dregion, try_max,
+						      result_nr);
+			if (addr) {
+				dnode_count_free_dpages(dnode,
+							-(long)(*result_nr));
+				break;
+			}
+		}
+
+		mutex_unlock(&dmem_pool.lock);
+	}
+	return addr;
+}
+
+/*
+ * allocate a dmem page from the dmem pool and try to allocate more
+ * continuous dpages if @try_max is not less than 1
+ *
+ *   @nid: the NUMA node the dmem page got from
+ *   @nodemask: nodemask for filtering the dmem nodelist
+ *   @try_max: try to allocate @try_max dpages if possible
+ *   @result_nr: allocated dpage number returned to the caller
+ *
+ * return the physical address of the first dpage allocated from dmem
+ * pool, or 0 on failure. The allocated dpage number is filled into
+ * @result_nr
+ */
+phys_addr_t
+dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max,
+			  unsigned int *result_nr)
+{
+	int *nodelist;
+
+	if (nid >= sizeof(ARRAY_SIZE(dmem_pool.nodes)))
+		return 0;
+
+	nodelist = dmem_nodelist(nid);
+	return dmem_alloc_pages_from_nodelist(nodelist, nodemask,
+					      try_max, result_nr);
+}
+EXPORT_SYMBOL(dmem_alloc_pages_nodemask);
+
+/*
+ * dmem_alloc_pages_vma - Allocate pages for a VMA.
+ *
+ *   @vma:  Pointer to VMA or NULL if not available.
+ *   @addr: Virtual Address of the allocation. Must be inside the VMA.
+ *   @try_max: try to allocate @try_max dpages if possible
+ *   @result_nr: allocated dpage number returned to the caller
+ *
+ * Return the physical address of the first dpage allocated from dmem
+ * pool, or 0 on failure. The allocated dpage number is filled into
+ * @result_nr
+ */
+phys_addr_t
+dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr,
+		     unsigned int try_max, unsigned int *result_nr)
+{
+	phys_addr_t phys_addr;
+	int *nl;
+	unsigned int cpuset_mems_cookie;
+
+retry_cpuset:
+	nl = dmem_nodelist(numa_node_id());
+
+	phys_addr = dmem_alloc_pages_from_nodelist(nl, NULL, try_max,
+						   result_nr);
+	if (unlikely(!phys_addr && read_mems_allowed_retry(cpuset_mems_cookie)))
+		goto retry_cpuset;
+
+	return phys_addr;
+}
+EXPORT_SYMBOL(dmem_alloc_pages_vma);
+
+/*
+ * Don't need to call it in a lock.
+ * This function uses the reserved addresses those are initially registered
+ * and will not be modified at run time.
+ */
+static struct dmem_region *find_dmem_region(phys_addr_t phys_addr,
+					    struct dmem_node **pdnode)
+{
+	struct dmem_node *dnode;
+	struct dmem_region *dregion;
+
+	for_each_dmem_node(dnode)
+		for_each_dmem_region(dnode, dregion) {
+			if (dregion->reserved_start_addr > phys_addr)
+				continue;
+			if (dregion->reserved_end_addr <= phys_addr)
+				continue;
+
+			*pdnode = dnode;
+			return dregion;
+		}
+
+	return NULL;
+}
+
+/*
+ * free dmem page to the dmem pool
+ *   @addr: the physical addree will be freed
+ *   @dpage_nr: the number of dpage to be freed
+ */
+void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
+{
+	struct dmem_region *dregion;
+	struct dmem_node *pdnode = NULL;
+	phys_addr_t dpage = phys_to_dpage(addr);
+	u64 pos;
+	unsigned long err_dpages;
+
+	mutex_lock(&dmem_pool.lock);
+
+	WARN_ON(!dmem_pool.dpage_shift);
+
+	dregion = find_dmem_region(addr, &pdnode);
+	WARN_ON(!dregion || !dregion->bitmap || !pdnode);
+
+	pos = dpage - dregion->dpage_start_pfn;
+	dregion->next_free_pos = min(dregion->next_free_pos, pos);
+
+	/* it is not possible to span multiple regions */
+	WARN_ON(dpage + dpages_nr - 1 >= dregion->dpage_end_pfn);
+
+	err_dpages = dmem_alloc_bitmap_clear(dregion, dpage, dpages_nr);
+
+	dnode_count_free_dpages(pdnode, dpages_nr - err_dpages);
+	mutex_unlock(&dmem_pool.lock);
+}
+EXPORT_SYMBOL(dmem_free_pages);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 04/35] dmem: let pat recognize dmem
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (2 preceding siblings ...)
  2020-10-08  7:53 ` [PATCH 03/35] dmem: implement dmem memory management yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-10-13  7:27   ` Paolo Bonzini
  2020-10-08  7:53 ` [PATCH 05/35] dmemfs: support mmap yulei.kernel
                   ` (32 subsequent siblings)
  36 siblings, 1 reply; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

x86 pat uses 'struct page' by only checking if it's system ram,
however it is not true if dmem is used, let's teach pat to
recognize this case if it is ram but it is !pfn_valid()

We always use WB for dmem and any attempt to change this
behavior will be rejected and WARN_ON is triggered

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/mm/pat/memtype.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 8f665c352bf0..fd8a298fc30b 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -511,6 +511,13 @@ static int reserve_ram_pages_type(u64 start, u64 end,
 	for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) {
 		enum page_cache_mode type;
 
+		/*
+		 * it's dmem if it's ram but not 'struct page' backend,
+		 * we always use WB
+		 */
+		if (WARN_ON(!pfn_valid(pfn)))
+			return -EBUSY;
+
 		page = pfn_to_page(pfn);
 		type = get_page_memtype(page);
 		if (type != _PAGE_CACHE_MODE_WB) {
@@ -539,6 +546,13 @@ static int free_ram_pages_type(u64 start, u64 end)
 	u64 pfn;
 
 	for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) {
+		/*
+		 * it's dmem, see the comments in
+		 * reserve_ram_pages_type()
+		 */
+		if (WARN_ON(!pfn_valid(pfn)))
+			continue;
+
 		page = pfn_to_page(pfn);
 		set_page_memtype(page, _PAGE_CACHE_MODE_WB);
 	}
@@ -714,6 +728,13 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
 	if (pat_pagerange_is_ram(paddr, paddr + PAGE_SIZE)) {
 		struct page *page;
 
+		/*
+		 * dmem always uses WB, see the comments in
+		 * reserve_ram_pages_type()
+		 */
+		if (!pfn_valid(paddr >> PAGE_SHIFT))
+			return rettype;
+
 		page = pfn_to_page(paddr >> PAGE_SHIFT);
 		return get_page_memtype(page);
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 05/35] dmemfs: support mmap
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (3 preceding siblings ...)
  2020-10-08  7:53 ` [PATCH 04/35] dmem: let pat recognize dmem yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-10-08  7:53 ` [PATCH 06/35] dmemfs: support truncating inode down yulei.kernel
                   ` (31 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

It adds mmap support. Note the file will be extended if it's
beyond mmap's offset, that drops the requirement of write()
operation, however, it has not supported cutting file down.

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c    | 337 ++++++++++++++++++++++++++++++++++++++++++-
 include/linux/dmem.h |  10 ++
 2 files changed, 345 insertions(+), 2 deletions(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 6a8a2d9f94e9..21d2f951b4ea 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -26,6 +26,7 @@
 #include <linux/pagevec.h>
 #include <linux/fs_parser.h>
 #include <linux/seq_file.h>
+#include <linux/dmem.h>
 
 MODULE_AUTHOR("Tencent Corporation");
 MODULE_LICENSE("GPL v2");
@@ -105,8 +106,250 @@ static const struct inode_operations dmemfs_file_inode_operations = {
 	.getattr = simple_getattr,
 };
 
+static unsigned long dmem_pgoff_to_index(struct inode *inode, pgoff_t pgoff)
+{
+	struct super_block *sb = inode->i_sb;
+
+	return pgoff >> (sb->s_blocksize_bits - PAGE_SHIFT);
+}
+
+static void *dmem_addr_to_entry(struct inode *inode, phys_addr_t addr)
+{
+	struct super_block *sb = inode->i_sb;
+
+	addr >>= sb->s_blocksize_bits;
+	return xa_mk_value(addr);
+}
+
+static phys_addr_t dmem_entry_to_addr(struct inode *inode, void *entry)
+{
+	struct super_block *sb = inode->i_sb;
+
+	WARN_ON(!xa_is_value(entry));
+	return xa_to_value(entry) << sb->s_blocksize_bits;
+}
+
+static unsigned long
+dmem_addr_to_pfn(struct inode *inode, phys_addr_t addr, pgoff_t pgoff,
+		 unsigned int fault_shift)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned long pfn = addr >> PAGE_SHIFT;
+	unsigned long mask;
+
+	mask = (1UL << ((unsigned int)sb->s_blocksize_bits - fault_shift)) - 1;
+	mask <<= fault_shift - PAGE_SHIFT;
+
+	return pfn + (pgoff & mask);
+}
+
+static inline unsigned long dmem_page_size(struct inode *inode)
+{
+	return inode->i_sb->s_blocksize;
+}
+
+static int check_inode_size(struct inode *inode, loff_t offset)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (offset >= i_size_read(inode))
+		return -EINVAL;
+
+	return 0;
+}
+
+static unsigned
+dmemfs_find_get_entries(struct address_space *mapping, unsigned long start,
+			unsigned int nr_entries, void **entries,
+			unsigned long *indices)
+{
+	XA_STATE(xas, &mapping->i_pages, start);
+
+	void *entry;
+	unsigned int ret = 0;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, entry, ULONG_MAX) {
+		if (xas_retry(&xas, entry))
+			continue;
+
+		if (xa_is_value(entry))
+			goto export;
+
+		if (unlikely(entry != xas_reload(&xas)))
+			goto retry;
+
+export:
+		indices[ret] = xas.xa_index;
+		entries[ret] = entry;
+		if (++ret == nr_entries)
+			break;
+		continue;
+retry:
+		xas_reset(&xas);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+static void *find_radix_entry_or_next(struct address_space *mapping,
+				      unsigned long start,
+				      unsigned long *eindex)
+{
+	void *entry = NULL;
+
+	dmemfs_find_get_entries(mapping, start, 1, &entry, eindex);
+	return entry;
+}
+
+/*
+ * find the entry in radix tree based on @index, create it if
+ * it does not exist
+ *
+ * return the entry with rcu locked, otherwise ERR_PTR()
+ * is returned
+ */
+static void *
+radix_get_create_entry(struct vm_area_struct *vma, unsigned long fault_addr,
+		       struct inode *inode, pgoff_t pgoff)
+{
+	struct address_space *mapping = inode->i_mapping;
+	unsigned long eindex, index;
+	loff_t offset;
+	phys_addr_t addr;
+	gfp_t gfp_masks = mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM;
+	void *entry;
+	unsigned int try_dpages, dpages;
+	int ret;
+
+retry:
+	offset = ((loff_t)pgoff << PAGE_SHIFT);
+	index = dmem_pgoff_to_index(inode, pgoff);
+	rcu_read_lock();
+	ret = check_inode_size(inode, offset);
+	if (ret) {
+		rcu_read_unlock();
+		return ERR_PTR(ret);
+	}
+
+	try_dpages = dmem_pgoff_to_index(inode, (i_size_read(inode) - offset)
+				     >> PAGE_SHIFT);
+	entry = find_radix_entry_or_next(mapping, index, &eindex);
+	if (entry) {
+		WARN_ON(!xa_is_value(entry));
+		if (eindex == index)
+			return entry;
+
+		WARN_ON(eindex <= index);
+		try_dpages = eindex - index;
+	}
+	rcu_read_unlock();
+
+	/* entry does not exist, create it */
+	addr = dmem_alloc_pages_vma(vma, fault_addr, try_dpages, &dpages);
+	if (!addr) {
+		/*
+		 * do not return -ENOMEM as that will trigger OOM,
+		 * it is useless for reclaiming dmem page
+		 */
+		ret = -EINVAL;
+		goto exit;
+	}
+
+	try_dpages = dpages;
+	while (dpages) {
+		rcu_read_lock();
+		ret = check_inode_size(inode, offset);
+		if (ret)
+			goto unlock_rcu;
+
+		entry = dmem_addr_to_entry(inode, addr);
+		entry = xa_store(&mapping->i_pages, index, entry, gfp_masks);
+		if (!xa_is_err(entry)) {
+			addr += inode->i_sb->s_blocksize;
+			offset += inode->i_sb->s_blocksize;
+			dpages--;
+			mapping->nrexceptional++;
+			index++;
+		}
+
+unlock_rcu:
+		rcu_read_unlock();
+		if (ret)
+			break;
+	}
+
+	if (dpages)
+		dmem_free_pages(addr, dpages);
+
+	/* we have created some entries, let's retry it */
+	if (ret == -EEXIST || try_dpages != dpages)
+		goto retry;
+exit:
+	return ERR_PTR(ret);
+}
+
+static void radix_put_entry(void)
+{
+	rcu_read_unlock();
+}
+
+static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = file_inode(vma->vm_file);
+	phys_addr_t addr;
+	void *entry;
+	int ret;
+
+	if (vmf->pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return VM_FAULT_SIGBUS;
+
+	entry = radix_get_create_entry(vma, (unsigned long)vmf->address,
+				       inode, vmf->pgoff);
+	if (IS_ERR(entry)) {
+		ret = PTR_ERR(entry);
+		goto exit;
+	}
+
+	addr = dmem_entry_to_addr(inode, entry);
+	ret = vmf_insert_pfn(vma, (unsigned long)vmf->address,
+			    dmem_addr_to_pfn(inode, addr, vmf->pgoff,
+					     PAGE_SHIFT));
+	radix_put_entry();
+
+exit:
+	return ret;
+}
+
+static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
+{
+	return dmem_page_size(file_inode(vma->vm_file));
+}
+
+static const struct vm_operations_struct dmemfs_vm_ops = {
+	.fault = dmemfs_fault,
+	.pagesize = dmemfs_pagesize,
+};
+
 int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct inode *inode = file_inode(file);
+
+	if (vma->vm_pgoff & ((dmem_page_size(inode) - 1) >> PAGE_SHIFT))
+		return -EINVAL;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	vma->vm_flags |= VM_PFNMAP;
+
+	file_accessed(file);
+	vma->vm_ops = &dmemfs_vm_ops;
 	return 0;
 }
 
@@ -189,9 +432,86 @@ static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return 0;
 }
 
+/*
+ * should make sure the dmem page in the dropped region is not
+ * being mapped by any process
+ */
+static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct pagevec pvec;
+	unsigned long istart, iend, indices[PAGEVEC_SIZE];
+	int i;
+
+	/* we never use normap page */
+	WARN_ON(mapping->nrpages);
+
+	/* if no dpage is allocated for the inode */
+	if (!mapping->nrexceptional)
+		return;
+
+	istart = dmem_pgoff_to_index(inode, start >> PAGE_SHIFT);
+	iend = dmem_pgoff_to_index(inode, end >> PAGE_SHIFT);
+	pagevec_init(&pvec);
+	while (istart < iend) {
+		pvec.nr = dmemfs_find_get_entries(mapping, istart,
+				min(iend - istart,
+				(unsigned long)PAGEVEC_SIZE),
+				(void **)pvec.pages,
+				indices);
+		if (!pvec.nr)
+			break;
+
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			phys_addr_t addr;
+
+			istart = indices[i];
+			if (istart >= iend)
+				break;
+
+			xa_erase(&mapping->i_pages, istart);
+			mapping->nrexceptional--;
+
+			addr = dmem_entry_to_addr(inode, pvec.pages[i]);
+			dmem_free_page(addr);
+		}
+
+		/*
+		 * only exception entries in pagevec, it's safe to
+		 * reinit it
+		 */
+		pagevec_reinit(&pvec);
+		cond_resched();
+		istart++;
+	}
+}
+
+static void dmemfs_evict_inode(struct inode *inode)
+{
+	/* no VMA works on it */
+	WARN_ON(!RB_EMPTY_ROOT(&inode->i_data.i_mmap.rb_root));
+
+	inode_drop_dpages(inode, 0, LLONG_MAX);
+	clear_inode(inode);
+}
+
+/*
+ * Display the mount options in /proc/mounts.
+ */
+static int dmemfs_show_options(struct seq_file *m, struct dentry *root)
+{
+	struct dmemfs_fs_info *fsi = root->d_sb->s_fs_info;
+
+	if (check_dpage_size(fsi->mount_opts.dpage_size))
+		seq_printf(m, ",pagesize=%lx", fsi->mount_opts.dpage_size);
+	return 0;
+}
+
 static const struct super_operations dmemfs_ops = {
 	.statfs	= dmemfs_statfs,
+	.evict_inode = dmemfs_evict_inode,
 	.drop_inode = generic_delete_inode,
+	.show_options = dmemfs_show_options,
 };
 
 static int
@@ -199,6 +519,7 @@ dmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
 {
 	struct inode *inode;
 	struct dmemfs_fs_info *fsi = sb->s_fs_info;
+	int ret;
 
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = fsi->mount_opts.dpage_size;
@@ -207,11 +528,17 @@ dmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
 	sb->s_op = &dmemfs_ops;
 	sb->s_time_gran = 1;
 
+	ret = dmem_alloc_init(sb->s_blocksize_bits);
+	if (ret)
+		return ret;
+
 	inode = dmemfs_get_inode(sb, NULL, S_IFDIR, 0);
 	sb->s_root = d_make_root(inode);
-	if (!sb->s_root)
-		return -ENOMEM;
 
+	if (!sb->s_root) {
+		dmem_alloc_uinit();
+		return -ENOMEM;
+	}
 	return 0;
 }
 
@@ -247,7 +574,13 @@ int dmemfs_init_fs_context(struct fs_context *fc)
 
 static void dmemfs_kill_sb(struct super_block *sb)
 {
+	bool has_inode = !!sb->s_root;
+
 	kill_litter_super(sb);
+
+	/* do not uninit dmem allocator if mount failed */
+	if (has_inode)
+		dmem_alloc_uinit();
 }
 
 static struct file_system_type dmemfs_fs_type = {
diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 476a82e8f252..8682d63ed43a 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -10,6 +10,16 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end);
 int dmem_alloc_init(unsigned long dpage_shift);
 void dmem_alloc_uinit(void);
 
+phys_addr_t
+dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max,
+			  unsigned int *result_nr);
+
+phys_addr_t
+dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr,
+		     unsigned int try_max, unsigned int *result_nr);
+
+void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr);
+#define dmem_free_page(addr)	dmem_free_pages(addr, 1)
 #else
 static inline int dmem_reserve_init(void)
 {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 06/35] dmemfs: support truncating inode down
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (4 preceding siblings ...)
  2020-10-08  7:53 ` [PATCH 05/35] dmemfs: support mmap yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-10-08  7:53 ` [PATCH 07/35] dmem: trace core functions yulei.kernel
                   ` (30 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

To support cut inode down, it will
introduce the race between page fault handler and
truncating handler as the entry to be deleted is being
mapped into process's VMA

in order to make page fault faster (as it's the hot
path), we use RCU to sync these two handlers. When
inode's size is updated, the handler makes sure the
new size is visible to page fault handler who will
not use truncated entry anymore and will not create
new entry in that region

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 21d2f951b4ea..d617494fc633 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -101,8 +101,73 @@ static const struct inode_operations dmemfs_dir_inode_operations = {
 	.rename		= simple_rename,
 };
 
+static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end);
+
+static int dmemfs_truncate(struct inode *inode, loff_t newsize)
+{
+	struct super_block *sb = inode->i_sb;
+	loff_t current_size;
+
+	if (newsize & ((1 << sb->s_blocksize_bits) - 1))
+		return -EINVAL;
+
+	current_size = i_size_read(inode);
+	i_size_write(inode, newsize);
+
+	if (newsize >= current_size)
+		return 0;
+
+	/* it cuts the inode down */
+
+	/*
+	 * we should make sure inode->i_size has been updated before
+	 * unmapping and dropping radix entries, so that other sides
+	 * can not create new i_mapping entry beyond inode->i_size
+	 * and the radix entry in the truncated region is not being
+	 * used
+	 *
+	 * see the comments in dmemfs_fault()
+	 */
+	synchronize_rcu();
+
+	/*
+	 * should unmap all mapping first as dmem pages are freed in
+	 * inode_drop_dpages()
+	 *
+	 * after that, dmem page in the truncated region is not used
+	 * by any process
+	 */
+	unmap_mapping_range(inode->i_mapping, newsize, 0, 1);
+
+	inode_drop_dpages(inode, newsize, LLONG_MAX);
+	return 0;
+}
+
+/*
+ * same logic as simple_setattr but we need to handle ftruncate
+ * carefully as we inserted self-defined entry into radix tree
+ */
+static int dmemfs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+	struct inode *inode = dentry->d_inode;
+	int error;
+
+	error = setattr_prepare(dentry, iattr);
+	if (error)
+		return error;
+
+	if (iattr->ia_valid & ATTR_SIZE) {
+		error = dmemfs_truncate(inode, iattr->ia_size);
+		if (error)
+			return error;
+	}
+	setattr_copy(inode, iattr);
+	mark_inode_dirty(inode);
+	return 0;
+}
+
 static const struct inode_operations dmemfs_file_inode_operations = {
-	.setattr = simple_setattr,
+	.setattr = dmemfs_setattr,
 	.getattr = simple_getattr,
 };
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 07/35] dmem: trace core functions
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (5 preceding siblings ...)
  2020-10-08  7:53 ` [PATCH 06/35] dmemfs: support truncating inode down yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-10-08  7:53 ` [PATCH 08/35] dmem: show some statistic in debugfs yulei.kernel
                   ` (29 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

Add tracepoints for alloc_init, alloc and free functions,
that helps us to figure out what is happening inside dmem
allocator

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/Makefile          |  1 +
 fs/dmemfs/inode.c           |  5 +++
 fs/dmemfs/trace.h           | 54 +++++++++++++++++++++++++++++
 include/trace/events/dmem.h | 68 +++++++++++++++++++++++++++++++++++++
 mm/dmem.c                   |  6 ++++
 5 files changed, 134 insertions(+)
 create mode 100644 fs/dmemfs/trace.h
 create mode 100644 include/trace/events/dmem.h

diff --git a/fs/dmemfs/Makefile b/fs/dmemfs/Makefile
index 73bdc9cbc87e..0b36d03f1097 100644
--- a/fs/dmemfs/Makefile
+++ b/fs/dmemfs/Makefile
@@ -2,6 +2,7 @@
 #
 # Makefile for the linux dmem-filesystem routines.
 #
+ccflags-y += -I $(srctree)/$(src)		# needed for trace events
 obj-$(CONFIG_DMEM_FS) += dmemfs.o
 
 dmemfs-y += inode.o
diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index d617494fc633..8b0516d98ee7 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -31,6 +31,9 @@
 MODULE_AUTHOR("Tencent Corporation");
 MODULE_LICENSE("GPL v2");
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 struct dmemfs_mount_opts {
 	unsigned long dpage_size;
 };
@@ -339,6 +342,7 @@ radix_get_create_entry(struct vm_area_struct *vma, unsigned long fault_addr,
 			offset += inode->i_sb->s_blocksize;
 			dpages--;
 			mapping->nrexceptional++;
+			trace_dmemfs_radix_tree_insert(index, entry);
 			index++;
 		}
 
@@ -535,6 +539,7 @@ static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end)
 				break;
 
 			xa_erase(&mapping->i_pages, istart);
+			trace_dmemfs_radix_tree_delete(istart, pvec.pages[i]);
 			mapping->nrexceptional--;
 
 			addr = dmem_entry_to_addr(inode, pvec.pages[i]);
diff --git a/fs/dmemfs/trace.h b/fs/dmemfs/trace.h
new file mode 100644
index 000000000000..cc1165332e60
--- /dev/null
+++ b/fs/dmemfs/trace.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/**
+ * trace.h - DesignWare Support
+ *
+ * Copyright (C)
+ *
+ * Author: Xiao Guangrong <xiaoguangrong@tencent.com>
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dmemfs
+
+#if !defined(_TRACE_DMEMFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DMEMFS_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(dmemfs_radix_tree_class,
+	TP_PROTO(unsigned long index, void *rentry),
+	TP_ARGS(index, rentry),
+
+	TP_STRUCT__entry(
+		__field(unsigned long,	index)
+		__field(void *, rentry)
+	),
+
+	TP_fast_assign(
+		__entry->index = index;
+		__entry->rentry = rentry;
+	),
+
+	TP_printk("index %lu entry %#lx", __entry->index,
+		  (unsigned long)__entry->rentry)
+);
+
+DEFINE_EVENT(dmemfs_radix_tree_class, dmemfs_radix_tree_insert,
+	TP_PROTO(unsigned long index, void *rentry),
+	TP_ARGS(index, rentry)
+);
+
+DEFINE_EVENT(dmemfs_radix_tree_class, dmemfs_radix_tree_delete,
+	TP_PROTO(unsigned long index, void *rentry),
+	TP_ARGS(index, rentry)
+);
+#endif
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/trace/events/dmem.h b/include/trace/events/dmem.h
new file mode 100644
index 000000000000..10d1b90a7783
--- /dev/null
+++ b/include/trace/events/dmem.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dmem
+
+#if !defined(_TRACE_DMEM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DMEM_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(dmem_alloc_init,
+	TP_PROTO(unsigned long dpage_shift),
+	TP_ARGS(dpage_shift),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, dpage_shift)
+	),
+
+	TP_fast_assign(
+		__entry->dpage_shift = dpage_shift;
+	),
+
+	TP_printk("dpage_shift %lu", __entry->dpage_shift)
+);
+
+TRACE_EVENT(dmem_alloc_pages_node,
+	TP_PROTO(phys_addr_t addr, int node, int try_max, int result_nr),
+	TP_ARGS(addr, node, try_max, result_nr),
+
+	TP_STRUCT__entry(
+		__field(phys_addr_t, addr)
+		__field(int, node)
+		__field(int, try_max)
+		__field(int, result_nr)
+	),
+
+	TP_fast_assign(
+		__entry->addr = addr;
+		__entry->node = node;
+		__entry->try_max = try_max;
+		__entry->result_nr = result_nr;
+	),
+
+	TP_printk("addr %#lx node %d try_max %d result_nr %d",
+		  (unsigned long)__entry->addr, __entry->node,
+		  __entry->try_max, __entry->result_nr)
+);
+
+TRACE_EVENT(dmem_free_pages,
+	TP_PROTO(phys_addr_t addr, int dpages_nr),
+	TP_ARGS(addr, dpages_nr),
+
+	TP_STRUCT__entry(
+		__field(phys_addr_t, addr)
+		__field(int, dpages_nr)
+	),
+
+	TP_fast_assign(
+		__entry->addr = addr;
+		__entry->dpages_nr = dpages_nr;
+	),
+
+	TP_printk("addr %#lx dpages_nr %d", (unsigned long)__entry->addr,
+		  __entry->dpages_nr)
+);
+#endif
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/dmem.c b/mm/dmem.c
index a77a064c8d59..aa34bf20f830 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -18,6 +18,8 @@
 #include <linux/debugfs.h>
 #include <linux/notifier.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/dmem.h>
 /*
  * There are two kinds of page in dmem management:
  * - nature page, it's the CPU's page size, i.e, 4K on x86
@@ -559,6 +561,8 @@ int dmem_alloc_init(unsigned long dpage_shift)
 
 	mutex_lock(&dmem_pool.lock);
 
+	trace_dmem_alloc_init(dpage_shift);
+
 	if (dmem_pool.dpage_shift) {
 		/*
 		 * double init on the same page size is okay
@@ -686,6 +690,7 @@ dmem_alloc_pages_from_nodelist(int *nodelist, nodemask_t *nodemask,
 			}
 		}
 
+		trace_dmem_alloc_pages_node(addr, node, try_max, *result_nr);
 		mutex_unlock(&dmem_pool.lock);
 	}
 	return addr;
@@ -791,6 +796,7 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
 
 	mutex_lock(&dmem_pool.lock);
 
+	trace_dmem_free_pages(addr, dpages_nr);
 	WARN_ON(!dmem_pool.dpage_shift);
 
 	dregion = find_dmem_region(addr, &pdnode);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 08/35] dmem: show some statistic in debugfs
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (6 preceding siblings ...)
  2020-10-08  7:53 ` [PATCH 07/35] dmem: trace core functions yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-10-08 20:23   ` Randy Dunlap
  2020-10-08  7:53 ` [PATCH 09/35] dmemfs: support remote access yulei.kernel
                   ` (28 subsequent siblings)
  36 siblings, 1 reply; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

Create 'dmem' directory under debugfs and show some
statistic for dmem pool, track total and free dpages
on dmem pool and each numa node.

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/Kconfig |   9 +++++
 mm/dmem.c  | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 108 insertions(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index e1995da11cea..8a67c8933a42 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -235,6 +235,15 @@ config DMEM
 	  Allow reservation of memory which could be dedicated usage of dmem.
 	  It's the basics of dmemfs.
 
+config DMEM_DEBUG_FS
+	bool "Enable debug information for direct memory"
+	depends on DMEM && DEBUG_FS
+	def_bool n
+	help
+	  This option enables showing various statistics of direct memory
+	  in debugfs filesystem.
+
+#
 # support for memory compaction
 config COMPACTION
 	bool "Allow for memory compaction"
diff --git a/mm/dmem.c b/mm/dmem.c
index aa34bf20f830..6992e57d5df0 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -164,6 +164,103 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end)
 	return 0;
 }
 
+#ifdef CONFIG_DMEM_DEBUG_FS
+struct debugfs_entry {
+	const char *name;
+	unsigned long offset;
+};
+
+#define DMEM_POOL_OFFSET(x)	offsetof(struct dmem_pool, x)
+#define DMEM_POOL_ENTRY(x)	{__stringify(x), DMEM_POOL_OFFSET(x)}
+
+#define DMEM_NODE_OFFSET(x)	offsetof(struct dmem_node, x)
+#define DMEM_NODE_ENTRY(x)	{__stringify(x), DMEM_NODE_OFFSET(x)}
+
+static struct debugfs_entry dmem_pool_entries[] = {
+	DMEM_POOL_ENTRY(region_num),
+	DMEM_POOL_ENTRY(registered_pages),
+	DMEM_POOL_ENTRY(unaligned_pages),
+	DMEM_POOL_ENTRY(dpage_shift),
+	DMEM_POOL_ENTRY(total_dpages),
+	DMEM_POOL_ENTRY(free_dpages),
+};
+
+static struct debugfs_entry dmem_node_entries[] = {
+	DMEM_NODE_ENTRY(total_dpages),
+	DMEM_NODE_ENTRY(free_dpages),
+};
+
+static int dmem_entry_get(void *offset, u64 *val)
+{
+	*val = *(u64 *)offset;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(dmem_fops, dmem_entry_get, NULL, "%llu\n");
+
+static int dmemfs_init_debugfs_node(struct dmem_node *dnode,
+				    struct dentry *parent)
+{
+	struct dentry *node_dir;
+	char dir_name[32];
+	int i, ret = -EEXIST;
+
+	snprintf(dir_name, sizeof(dir_name), "node%ld",
+		 dnode - dmem_pool.nodes);
+	node_dir = debugfs_create_dir(dir_name, parent);
+	if (!node_dir)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(dmem_node_entries); i++)
+		if (!debugfs_create_file(dmem_node_entries[i].name, 0444,
+		   node_dir, (void *)dnode + dmem_node_entries[i].offset,
+		   &dmem_fops))
+			return ret;
+	return 0;
+}
+
+static int dmemfs_init_debugfs(void)
+{
+	struct dentry *dmem_debugfs_dir;
+	struct dmem_node *dnode;
+	int i, ret = -EEXIST;
+
+	dmem_debugfs_dir = debugfs_create_dir("dmem", NULL);
+	if (!dmem_debugfs_dir)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(dmem_pool_entries); i++)
+		if (!debugfs_create_file(dmem_pool_entries[i].name, 0444,
+		   dmem_debugfs_dir,
+		   (void *)&dmem_pool + dmem_pool_entries[i].offset,
+		   &dmem_fops))
+			goto exit;
+
+	for_each_dmem_node(dnode) {
+		/*
+		 * do not create debugfs files for the node
+		 * where no memory is available
+		 */
+		if (list_empty(&dnode->regions))
+			continue;
+
+		if (dmemfs_init_debugfs_node(dnode, dmem_debugfs_dir))
+			goto exit;
+	}
+
+	return 0;
+exit:
+	debugfs_remove_recursive(dmem_debugfs_dir);
+	return ret;
+}
+
+#else
+static int dmemfs_init_debugfs(void)
+{
+	return 0;
+}
+#endif
+
 #define PENALTY_FOR_DMEM_SHARED_NODE		(1)
 
 static int dmem_nodeload[MAX_NUMNODES] __initdata;
@@ -364,7 +461,8 @@ static int __init dmem_late_init(void)
 				goto exit;
 		}
 	}
-	return ret;
+
+	return dmemfs_init_debugfs();
 exit:
 	dmem_uinit();
 	return ret;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 09/35] dmemfs: support remote access
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (7 preceding siblings ...)
  2020-10-08  7:53 ` [PATCH 08/35] dmem: show some statistic in debugfs yulei.kernel
@ 2020-10-08  7:53 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 10/35] dmemfs: introduce max_alloc_try_dpages parameter yulei.kernel
                   ` (27 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:53 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

It is required by ptrace_writedata and ptrace_readdata to access
dmem memory remotely. The typical user is gdb, after this patch,
gdb is able to read & write memory owned by the attached process

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 8b0516d98ee7..4dacbf7e6844 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -367,6 +367,51 @@ static void radix_put_entry(void)
 	rcu_read_unlock();
 }
 
+static bool check_vma_access(struct vm_area_struct *vma, int write)
+{
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
+
+	return !!(vm_flags & vma->vm_flags);
+}
+
+static int
+dmemfs_access_dmem(struct vm_area_struct *vma, unsigned long addr,
+		   void *buf, int len, int write)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	void *entry, *maddr;
+	int offset, pgoff;
+
+	if (!check_vma_access(vma, write))
+		return -EACCES;
+
+	pgoff = linear_page_index(vma, addr);
+	if (pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return -EFAULT;
+
+	entry = radix_get_create_entry(vma, addr, inode, pgoff);
+	if (IS_ERR(entry))
+		return PTR_ERR(entry);
+
+	offset = addr & (sb->s_blocksize - 1);
+	addr = dmem_entry_to_addr(inode, entry);
+
+	/*
+	 * it is not beyond vma's region as the vma should be aligned
+	 * to blocksize
+	 */
+	len = min(len, (int)(sb->s_blocksize - offset));
+	maddr = __va(addr);
+	if (write)
+		memcpy(maddr + offset, buf, len);
+	else
+		memcpy(buf, maddr + offset, len);
+	radix_put_entry();
+
+	return len;
+}
+
 static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -403,6 +448,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
 static const struct vm_operations_struct dmemfs_vm_ops = {
 	.fault = dmemfs_fault,
 	.pagesize = dmemfs_pagesize,
+	.access = dmemfs_access_dmem,
 };
 
 int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 10/35] dmemfs: introduce max_alloc_try_dpages parameter
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (8 preceding siblings ...)
  2020-10-08  7:53 ` [PATCH 09/35] dmemfs: support remote access yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 11/35] mm: export mempolicy interfaces to serve dmem allocator yulei.kernel
                   ` (26 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

It specifies the dmem page number allocated at one time, then
multiple radix entries can be created. That will relief the
allocation pressure and make page fault more fast.

However that could cause no dmem page mmapped to userspace
even if there are some free dmem pages.

Set it to 1 to completely disable this behavior.

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 4dacbf7e6844..6932d73edab6 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -34,6 +34,8 @@ MODULE_LICENSE("GPL v2");
 #define CREATE_TRACE_POINTS
 #include "trace.h"
 
+static uint __read_mostly max_alloc_try_dpages = 1;
+
 struct dmemfs_mount_opts {
 	unsigned long dpage_size;
 };
@@ -46,6 +48,44 @@ enum dmemfs_param {
 	Opt_dpagesize,
 };
 
+static int
+max_alloc_try_dpages_set(const char *val, const struct kernel_param *kp)
+{
+	uint sval;
+	int ret;
+
+	ret = kstrtouint(val, 0, &sval);
+	if (ret)
+		return ret;
+
+	/* should be 1 at least */
+	if (!sval)
+		return -EINVAL;
+
+	max_alloc_try_dpages = sval;
+	return 0;
+}
+
+static struct kernel_param_ops alloc_max_try_dpages_ops = {
+	.set = max_alloc_try_dpages_set,
+	.get = param_get_uint,
+};
+
+/*
+ * it specifies the dmem page number allocated at one time, then
+ * multiple radix entries can be created. That will relief the
+ * allocation pressure and make page fault more fast.
+ *
+ * however that could cause no dmem page mmapped to userspace
+ * even if there are some free dmem pages
+ *
+ * set it to 1 to completely disable this behavior
+ */
+fs_param_cb(max_alloc_try_dpages, &alloc_max_try_dpages_ops,
+	    &max_alloc_try_dpages, 0644);
+__MODULE_PARM_TYPE(max_alloc_try_dpages, "uint");
+MODULE_PARM_DESC(max_alloc_try_dpages, "Set the dmem page number allocated at one time, should be 1 at least");
+
 const struct fs_parameter_spec dmemfs_fs_parameters[] = {
 	fsparam_string("pagesize", Opt_dpagesize),
 	{}
@@ -317,6 +357,7 @@ radix_get_create_entry(struct vm_area_struct *vma, unsigned long fault_addr,
 	}
 	rcu_read_unlock();
 
+	try_dpages = min(try_dpages, max_alloc_try_dpages);
 	/* entry does not exist, create it */
 	addr = dmem_alloc_pages_vma(vma, fault_addr, try_dpages, &dpages);
 	if (!addr) {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 11/35] mm: export mempolicy interfaces to serve dmem allocator
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (9 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 10/35] dmemfs: introduce max_alloc_try_dpages parameter yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 12/35] dmem: introduce mempolicy support yulei.kernel
                   ` (25 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang

From: Yulei Zhang <yuleixzhang@tencent.com>

Export interface interleave_nid() to serve dmem allocator.

Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/mempolicy.h | 3 +++
 mm/mempolicy.c            | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f1c74df264d..478966133514 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,9 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr);
+unsigned interleave_nid(struct mempolicy *pol, struct vm_area_struct *vma,
+			unsigned long addr, int shift);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index eddbe4e56c73..b3103f5d9123 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1816,7 +1816,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -1982,7 +1982,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 }
 
 /* Determine a node number for interleave */
-static inline unsigned interleave_nid(struct mempolicy *pol,
+unsigned interleave_nid(struct mempolicy *pol,
 		 struct vm_area_struct *vma, unsigned long addr, int shift)
 {
 	if (vma) {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 12/35] dmem: introduce mempolicy support
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (10 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 11/35] mm: export mempolicy interfaces to serve dmem allocator yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 13/35] mm, dmem: introduce PFN_DMEM and pfn_t_dmem yulei.kernel
                   ` (24 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Haiwei Li

From: Yulei Zhang <yuleixzhang@tencent.com>

It adds mempolicy support for dmem to allocates memory
from mempolicy specified nodes.

Signed-off-by: Haiwei Li   <gerryhwli@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/pgtable.h       |  7 ++++
 arch/x86/include/asm/pgtable_types.h | 13 +++++-
 fs/dmemfs/Kconfig                    |  3 ++
 include/linux/pgtable.h              |  7 ++++
 mm/Kconfig                           |  3 ++
 mm/dmem.c                            | 63 +++++++++++++++++++++++++++-
 7 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..86f3139edfc7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -73,6 +73,7 @@ config X86
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
+	select ARCH_HAS_PTE_DMEM		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if X86_64
 	select ARCH_HAS_UACCESS_MCSAFE		if X86_64 && X86_MCE
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b836138ce852..ea4554a728bc 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -453,6 +453,13 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DEVMAP);
 }
 
+#ifdef CONFIG_ARCH_HAS_PTE_DMEM
+static inline pmd_t pmd_mkdmem(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SPECIAL | _PAGE_DMEM);
+}
+#endif
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 816b31c68550..ee4cae110f5c 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,6 +23,15 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
+#define _PAGE_BIT_DMEM		57	/* Flag used to indicate dmem pmd.
+					 * Since _PAGE_BIT_SPECIAL is defined
+					 * same as _PAGE_BIT_CPA_TEST, we can
+					 * not only use _PAGE_BIT_SPECIAL, so
+					 * add _PAGE_BIT_DMEM to help
+					 * indicate it. Since dmem pte will
+					 * never be splitting, setting
+					 * _PAGE_BIT_SPECIAL for pte is enough.
+					 */
 #define _PAGE_BIT_SOFTW4	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
@@ -112,9 +121,11 @@
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
+#define _PAGE_DMEM	(_AT(u64, 1) << _PAGE_BIT_DMEM)
 #else
 #define _PAGE_NX	(_AT(pteval_t, 0))
 #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
+#define _PAGE_DMEM	(_AT(pteval_t, 0))
 #endif
 
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
@@ -128,7 +139,7 @@
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |  \
-			 _PAGE_UFFD_WP)
+			 _PAGE_UFFD_WP | _PAGE_DMEM)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 /*
diff --git a/fs/dmemfs/Kconfig b/fs/dmemfs/Kconfig
index d2894a513de0..19ca3914da39 100644
--- a/fs/dmemfs/Kconfig
+++ b/fs/dmemfs/Kconfig
@@ -1,5 +1,8 @@
 config DMEM_FS
 	tristate "Direct Memory filesystem support"
+	depends on DMEM
+	depends on TRANSPARENT_HUGEPAGE
+	depends on ARCH_HAS_PTE_DMEM
 	help
 	  dmemfs (Direct Memory filesystem) is device memory or reserved
 	  memory based filesystem. This kind of memory is special as it
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e8cbc2e795d5..45d4c4a3e519 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1129,6 +1129,13 @@ static inline int pud_trans_unstable(pud_t *pud)
 #endif
 }
 
+#ifndef CONFIG_ARCH_HAS_PTE_DMEM
+static inline pmd_t pmd_mkdmem(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+
 #ifndef pmd_read_atomic
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 8a67c8933a42..09d1b1551a44 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -795,6 +795,9 @@ config IDLE_PAGE_TRACKING
 config ARCH_HAS_PTE_DEVMAP
 	bool
 
+config ARCH_HAS_PTE_DMEM
+	bool
+
 config ZONE_DEVICE
 	bool "Device memory (pmem, HMM, etc...) hotplug support"
 	depends on MEMORY_HOTPLUG
diff --git a/mm/dmem.c b/mm/dmem.c
index 6992e57d5df0..2e61dbddbc62 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -822,6 +822,56 @@ dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max,
 }
 EXPORT_SYMBOL(dmem_alloc_pages_nodemask);
 
+/* Return a nodelist indicated for current node representing a mempolicy */
+static int *policy_nodelist(struct mempolicy *policy)
+{
+	int nd = numa_node_id();
+
+	switch (policy->mode) {
+	case MPOL_PREFERRED:
+		if (!(policy->flags & MPOL_F_LOCAL))
+			nd = policy->v.preferred_node;
+		break;
+	case MPOL_BIND:
+		if (unlikely(!node_isset(nd, policy->v.nodes)))
+			nd = first_node(policy->v.nodes);
+		break;
+	default:
+		WARN_ON(1);
+	}
+	return dmem_nodelist(nd);
+}
+
+static nodemask_t *dmem_policy_nodemask(struct mempolicy *policy)
+{
+	if (unlikely(policy->mode == MPOL_BIND) &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
+static void
+get_mempolicy_nlist_and_nmask(struct mempolicy *pol,
+			      struct vm_area_struct *vma, unsigned long addr,
+			      int **nl, nodemask_t **nmask)
+{
+	if (pol->mode == MPOL_INTERLEAVE) {
+		unsigned int nid;
+
+		/*
+		 * we use dpage_shift to interleave numa nodes although
+		 * multiple dpages may be allocated
+		 */
+		nid = interleave_nid(pol, vma, addr, dmem_pool.dpage_shift);
+		*nl = dmem_nodelist(nid);
+		*nmask = NULL;
+	} else {
+		*nl = policy_nodelist(pol);
+		*nmask = dmem_policy_nodemask(pol);
+	}
+}
+
 /*
  * dmem_alloc_pages_vma - Allocate pages for a VMA.
  *
@@ -830,6 +880,9 @@ EXPORT_SYMBOL(dmem_alloc_pages_nodemask);
  *   @try_max: try to allocate @try_max dpages if possible
  *   @result_nr: allocated dpage number returned to the caller
  *
+ * This function allocates pages from dmem pool and applies a NUMA policy
+ * associated with the VMA.
+ *
  * Return the physical address of the first dpage allocated from dmem
  * pool, or 0 on failure. The allocated dpage number is filled into
  * @result_nr
@@ -839,13 +892,19 @@ dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr,
 		     unsigned int try_max, unsigned int *result_nr)
 {
 	phys_addr_t phys_addr;
+	struct mempolicy *pol;
 	int *nl;
+	nodemask_t *nmask;
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
-	nl = dmem_nodelist(numa_node_id());
+	pol = get_vma_policy(vma, addr);
+	cpuset_mems_cookie = read_mems_allowed_begin();
+
+	get_mempolicy_nlist_and_nmask(pol, vma, addr, &nl, &nmask);
+	mpol_cond_put(pol);
 
-	phys_addr = dmem_alloc_pages_from_nodelist(nl, NULL, try_max,
+	phys_addr = dmem_alloc_pages_from_nodelist(nl, nmask, try_max,
 						   result_nr);
 	if (unlikely(!phys_addr && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 13/35] mm, dmem: introduce PFN_DMEM and pfn_t_dmem
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (11 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 12/35] dmem: introduce mempolicy support yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 14/35] mm, dmem: dmem-pmd vs thp-pmd yulei.kernel
                   ` (23 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Introduce PFN_DMEM as a new pfn flag for dmem pfn, define it
by setting (BITS_PER_LONG_LONG - 6) bit.

Introduce pfn_t_dmem() helper to recognize dmem pfn.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/pfn_t.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 2d9148221e9a..c6c0f1f84498 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -11,6 +11,7 @@
  * PFN_MAP - pfn has a dynamic page mapping established by a device driver
  * PFN_SPECIAL - for CONFIG_FS_DAX_LIMITED builds to allow XIP, but not
  *		 get_user_pages
+ * PFN_DMEM - pfn references a dmem page
  */
 #define PFN_FLAGS_MASK (((u64) (~PAGE_MASK)) << (BITS_PER_LONG_LONG - PAGE_SHIFT))
 #define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1))
@@ -18,13 +19,15 @@
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
 #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
 #define PFN_SPECIAL (1ULL << (BITS_PER_LONG_LONG - 5))
+#define PFN_DMEM (1ULL << (BITS_PER_LONG_LONG - 6))
 
 #define PFN_FLAGS_TRACE \
 	{ PFN_SPECIAL,	"SPECIAL" }, \
 	{ PFN_SG_CHAIN,	"SG_CHAIN" }, \
 	{ PFN_SG_LAST,	"SG_LAST" }, \
 	{ PFN_DEV,	"DEV" }, \
-	{ PFN_MAP,	"MAP" }
+	{ PFN_MAP,	"MAP" }, \
+	{ PFN_DMEM,	"DMEM" }
 
 static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags)
 {
@@ -128,4 +131,16 @@ static inline bool pfn_t_special(pfn_t pfn)
 	return false;
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
+
+#ifdef CONFIG_ARCH_HAS_PTE_DMEM
+static inline bool pfn_t_dmem(pfn_t pfn)
+{
+	return (pfn.val & PFN_DMEM) == PFN_DMEM;
+}
+#else
+static inline bool pfn_t_dmem(pfn_t pfn)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_HAS_PTE_DMEM */
 #endif /* _LINUX_PFN_T_H_ */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 14/35] mm, dmem: dmem-pmd vs thp-pmd
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (12 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 13/35] mm, dmem: introduce PFN_DMEM and pfn_t_dmem yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 15/35] mm: add pmd_special() check for pmd_trans_huge_lock() yulei.kernel
                   ` (22 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

A dmem huge page is ultimately not a transparent huge page. As we
decided to use pmd_special() to distinguish dmem-pmd from thp-pmd,
we should make some slightly different semantics between pmd_special()
and pmd_trans_huge(), just as pmd_devmap() in upstream. This distinction
is especially important in some mm-core paths such as zap_pmd_range().

Explicitly mark the pmd_trans_huge() helpers that dmem needs by adding
pmd_special() checks. This method could be reused in many mm-core paths.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/include/asm/pgtable.h | 10 +++++++++-
 include/linux/pgtable.h        |  5 +++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index ea4554a728bc..e29601cad384 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -260,7 +260,7 @@ static inline int pmd_large(pmd_t pte)
 /* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_large */
 static inline int pmd_trans_huge(pmd_t pmd)
 {
-	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_DMEM)) == _PAGE_PSE;
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -276,6 +276,14 @@ static inline int has_transparent_hugepage(void)
 	return boot_cpu_has(X86_FEATURE_PSE);
 }
 
+#ifdef CONFIG_ARCH_HAS_PTE_DMEM
+static inline int pmd_special(pmd_t pmd)
+{
+	return (pmd_val(pmd) & (_PAGE_SPECIAL | _PAGE_DMEM)) ==
+		(_PAGE_SPECIAL | _PAGE_DMEM);
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pmd_devmap(pmd_t pmd)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 45d4c4a3e519..1fe8546c0a7c 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1134,6 +1134,11 @@ static inline pmd_t pmd_mkdmem(pmd_t pmd)
 {
 	return pmd;
 }
+
+static inline int pmd_special(pmd_t pmd)
+{
+	return 0;
+}
 #endif
 
 #ifndef pmd_read_atomic
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 15/35] mm: add pmd_special() check for pmd_trans_huge_lock()
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (13 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 14/35] mm, dmem: dmem-pmd vs thp-pmd yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 16/35] dmemfs: introduce ->split() to dmemfs_vm_ops yulei.kernel
                   ` (21 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

As dmem-pmd had been distinguished from thp-pmd, we need to add
pmd_special() such that pmd_trans_huge_lock could fetch ptl
for dmem huge pmd and treat it as stable pmd.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/huge_mm.h | 3 ++-
 mm/huge_memory.c        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 8a8bc46a2432..b7381e5aafe5 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -245,7 +245,8 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
-	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)
+		|| pmd_devmap(*pmd) || pmd_special(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7ff29cc3d55c..531493a0bc82 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1862,7 +1862,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
 	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
+			pmd_devmap(*pmd) || pmd_special(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 16/35] dmemfs: introduce ->split() to dmemfs_vm_ops
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (14 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 15/35] mm: add pmd_special() check for pmd_trans_huge_lock() yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 17/35] mm, dmemfs: support unmap_page_range() for dmemfs pmd yulei.kernel
                   ` (20 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

It is required by __split_vma() to adjust vma. munmap() which create
hole unaligned to pagesize in dmemfs-mapping should be forbidden.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 6932d73edab6..e37498c00497 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -453,6 +453,13 @@ dmemfs_access_dmem(struct vm_area_struct *vma, unsigned long addr,
 	return len;
 }
 
+static int dmemfs_split(struct vm_area_struct *vma, unsigned long addr)
+{
+	if (addr & (dmem_page_size(file_inode(vma->vm_file)) - 1))
+		return -EINVAL;
+	return 0;
+}
+
 static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -487,6 +494,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
 }
 
 static const struct vm_operations_struct dmemfs_vm_ops = {
+	.split = dmemfs_split,
 	.fault = dmemfs_fault,
 	.pagesize = dmemfs_pagesize,
 	.access = dmemfs_access_dmem,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 17/35] mm, dmemfs: support unmap_page_range() for dmemfs pmd
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (15 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 16/35] dmemfs: introduce ->split() to dmemfs_vm_ops yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 18/35] mm: follow_pmd_mask() for dmem huge pmd yulei.kernel
                   ` (19 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

It is required by munmap() for dmemfs mapping.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/huge_memory.c | 2 ++
 mm/memory.c      | 8 +++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 531493a0bc82..73af337b454e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1636,6 +1636,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		if (is_huge_zero_pmd(orig_pmd))
 			tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
+	} else if (pmd_special(orig_pmd)) {
+		spin_unlock(ptl);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
 		zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
diff --git a/mm/memory.c b/mm/memory.c
index 469af373ae76..2d2c0f8a966b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1178,10 +1178,12 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE)
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+			pmd_devmap(*pmd) || pmd_special(*pmd)) {
+			if (next - addr != HPAGE_PMD_SIZE) {
+				VM_BUG_ON(pmd_special(*pmd));
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			else if (zap_huge_pmd(tlb, vma, pmd, addr))
+			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
 		}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 18/35] mm: follow_pmd_mask() for dmem huge pmd
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (16 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 17/35] mm, dmemfs: support unmap_page_range() for dmemfs pmd yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 19/35] mm: gup_huge_pmd() " yulei.kernel
                   ` (18 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

While follow_pmd_mask(), dmem huge pmd should be recognized and return
error pointer of '-EEXIST' to indicate that proper page table entry exists
in pmd special but no corresponding struct page, because dmem page means
non struct page backend. We update pmd if foll_flags takes FOLL_TOUCH.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/gup.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index e5739a1974d5..726ffc5b0ea9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -380,6 +380,42 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 	return -EEXIST;
 }
 
+static struct page *
+follow_special_pmd(struct vm_area_struct *vma, unsigned long address,
+		   pmd_t *pmd, unsigned int flags)
+{
+	spinlock_t *ptl;
+
+	if (flags & FOLL_DUMP)
+		/* Avoid special (like zero) pages in core dumps */
+		return ERR_PTR(-EFAULT);
+
+	/* No page to get reference */
+	if (flags & FOLL_GET)
+		return ERR_PTR(-EFAULT);
+
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+
+		ptl = pmd_lock(vma->vm_mm, pmd);
+		if (!pmd_special(*pmd)) {
+			spin_unlock(ptl);
+			return NULL;
+		}
+		_pmd = pmd_mkyoung(*pmd);
+		if (flags & FOLL_WRITE)
+			_pmd = pmd_mkdirty(_pmd);
+		if (pmdp_set_access_flags(vma, address & HPAGE_PMD_MASK,
+					  pmd, _pmd,
+					  flags & FOLL_WRITE))
+			update_mmu_cache_pmd(vma, address, pmd);
+		spin_unlock(ptl);
+	}
+
+	/* Proper page table entry exists, but no corresponding struct page */
+	return ERR_PTR(-EEXIST);
+}
+
 /*
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
@@ -564,6 +600,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags);
 	}
+	if (pmd_special(*pmd)) {
+		page = follow_special_pmd(vma, address, pmd, flags);
+		if (page)
+			return page;
+		return no_page_table(vma, flags);
+	}
 	if (is_hugepd(__hugepd(pmd_val(pmdval)))) {
 		page = follow_huge_pd(vma, address,
 				      __hugepd(pmd_val(pmdval)), flags,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 19/35] mm: gup_huge_pmd() for dmem huge pmd
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (17 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 18/35] mm: follow_pmd_mask() for dmem huge pmd yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 20/35] mm: support dmem huge pmd for vmf_insert_pfn_pmd() yulei.kernel
                   ` (17 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Add pmd_special() check in gup_huge_pmd() to support dmem huge pmd.
GUP will return zero if enconter dmem page, and we could handle it
outside GUP routine.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/gup.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index 726ffc5b0ea9..a8edbb6a2b2f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2440,6 +2440,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
+	/* Bypass dmem huge pmd. It will be handled in outside routine. */
+	if (pmd_special(orig))
+		return 0;
+
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
@@ -2542,7 +2546,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
-			     pmd_devmap(pmd))) {
+			     pmd_devmap(pmd) || pmd_special(pmd))) {
 			/*
 			 * NUMA hinting faults need to be handled in the GUP
 			 * slowpath for accounting purposes and so that they
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 20/35] mm: support dmem huge pmd for vmf_insert_pfn_pmd()
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (18 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 19/35] mm: gup_huge_pmd() " yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 21/35] mm: support dmem huge pmd for follow_pfn() yulei.kernel
                   ` (16 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Since vmf_insert_pfn_pmd will BUG_ON non-pmd-devmap, we make pfn dmem pass
the check.

Dmem huge pmd will be marked with _PAGE_SPECIAL and _PAGE_DMEM, so that
follow_pfn() could recognize it.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/huge_memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 73af337b454e..a24601c93713 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -781,6 +781,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pmd_mkdevmap(entry);
+	else if (pfn_t_dmem(pfn))
+		entry = pmd_mkdmem(entry);
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -827,7 +829,7 @@ vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn,
 	 * can't support a 'special' bit.
 	 */
 	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
+			!pfn_t_devmap(pfn) && !pfn_t_dmem(pfn));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 21/35] mm: support dmem huge pmd for follow_pfn()
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (19 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 20/35] mm: support dmem huge pmd for vmf_insert_pfn_pmd() yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page yulei.kernel
                   ` (15 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

follow_pfn() will get pfn of pmd if huge pmd is encountered.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/memory.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2d2c0f8a966b..ca42a6e56e9b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4644,15 +4644,23 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	int ret = -EINVAL;
 	spinlock_t *ptl;
 	pte_t *ptep;
+	pmd_t *pmdp = NULL;
 
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte(vma->vm_mm, address, &ptep, &ptl);
+	ret = follow_pte_pmd(vma->vm_mm, address, NULL, &ptep, &pmdp, &ptl);
 	if (ret)
 		return ret;
-	*pfn = pte_pfn(*ptep);
-	pte_unmap_unlock(ptep, ptl);
+
+	if (pmdp) {
+		*pfn = pmd_pfn(*pmdp) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+		spin_unlock(ptl);
+	} else {
+		*pfn = pte_pfn(*ptep);
+		pte_unmap_unlock(ptep, ptl);
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL(follow_pfn);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (20 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 21/35] mm: support dmem huge pmd for follow_pfn() yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-09  0:58   ` Sean Christopherson
  2020-10-08  7:54 ` [PATCH 23/35] kvm, x86: introduce VM_DMEM yulei.kernel
                   ` (14 subsequent siblings)
  36 siblings, 1 reply; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Dmem page is pfn invalid but not mmio. Support cacheable
dmem page for kvm.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/kvm/mmu/mmu.c | 5 +++--
 include/linux/dmem.h   | 7 +++++++
 mm/dmem.c              | 7 +++++++
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 71aa3da2a0b7..0115c1767063 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -41,6 +41,7 @@
 #include <linux/hash.h>
 #include <linux/kern_levels.h>
 #include <linux/kthread.h>
+#include <linux/dmem.h>
 
 #include <asm/page.h>
 #include <asm/memtype.h>
@@ -2962,9 +2963,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
 			 */
 			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
 
-	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
+	return (!e820__mapped_raw_any(pfn_to_hpa(pfn),
 				     pfn_to_hpa(pfn + 1) - 1,
-				     E820_TYPE_RAM);
+				     E820_TYPE_RAM)) || (!is_dmem_pfn(pfn));
 }
 
 /* Bits which may be returned by set_spte() */
diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 8682d63ed43a..59d3ef14fe42 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -19,11 +19,18 @@ dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr,
 		     unsigned int try_max, unsigned int *result_nr);
 
 void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr);
+bool is_dmem_pfn(unsigned long pfn);
 #define dmem_free_page(addr)	dmem_free_pages(addr, 1)
 #else
 static inline int dmem_reserve_init(void)
 {
 	return 0;
 }
+
+static inline bool is_dmem_pfn(unsigned long pfn)
+{
+	return 0;
+}
+
 #endif
 #endif	/* _LINUX_DMEM_H */
diff --git a/mm/dmem.c b/mm/dmem.c
index 2e61dbddbc62..eb6df7059cf0 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -972,3 +972,10 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
 }
 EXPORT_SYMBOL(dmem_free_pages);
 
+bool is_dmem_pfn(unsigned long pfn)
+{
+	struct dmem_node *dnode;
+
+	return !!find_dmem_region(__pfn_to_phys(pfn), &dnode);
+}
+EXPORT_SYMBOL(is_dmem_pfn);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 23/35] kvm, x86: introduce VM_DMEM
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (21 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 24/35] dmemfs: support hugepage for dmemfs yulei.kernel
                   ` (13 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Currently dmemfs do not support memory readonly, so change_protection()
will be disabled for dmemfs vma. Since vma->vm_flags could be changed to
new flag in mprotect_fixup(), so we introduce a new vma flag VM_DMEM and
check this flag in mprotect_fixup() to avoid changing vma->vm_flags.

We also check it in vma_to_resize() to disable mremap() for dmemfs vma.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c  | 2 +-
 include/linux/mm.h | 7 +++++++
 mm/mprotect.c      | 5 ++++-
 mm/mremap.c        | 3 +++
 4 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index e37498c00497..b3e394f33b42 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -510,7 +510,7 @@ int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_flags |= VM_PFNMAP | VM_DMEM | VM_IO;
 
 	file_accessed(file);
 	vma->vm_ops = &dmemfs_vm_ops;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ca6e6a81576b..7b1e574d2387 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -309,6 +309,8 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#define VM_DMEM		BIT(38)		/* Dmem page VM */
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
@@ -656,6 +658,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_ACCESS_FLAGS;
 }
 
+static inline bool vma_is_dmem(struct vm_area_struct *vma)
+{
+	return !!(vma->vm_flags & VM_DMEM);
+}
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ce8b8a5eacbb..36f885cbbb30 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -236,7 +236,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * for all the checks.
 		 */
 		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
+		     pmd_none_or_clear_bad_unless_trans_huge(pmd) && !pmd_special(*pmd))
 			goto next;
 
 		/* invoke the mmu notifier if the pmd is populated */
@@ -412,6 +412,9 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
 		return 0;
 	}
 
+	if (vma_is_dmem(vma))
+		return -EINVAL;
+
 	/*
 	 * Do PROT_NONE PFN permission checks here when we can still
 	 * bail out without undoing a lot of state. This is a rather
diff --git a/mm/mremap.c b/mm/mremap.c
index 138abbae4f75..598e68174e24 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -482,6 +482,9 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	if (!vma || vma->vm_start > addr)
 		return ERR_PTR(-EFAULT);
 
+	if (vma_is_dmem(vma))
+		return ERR_PTR(-EINVAL);
+
 	/*
 	 * !old_len is a special case where an attempt is made to 'duplicate'
 	 * a mapping.  This makes no sense for private mappings as it will
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 24/35] dmemfs: support hugepage for dmemfs
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (22 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 23/35] kvm, x86: introduce VM_DMEM yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 25/35] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() yulei.kernel
                   ` (12 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

It add hugepage support for dmemfs. We use PFN_DMEM to notify
vmf_insert_pfn_pmd, and dmem huge pmd will be marked with
_PAGE_SPECIAL and _PAGE_DMEM. So that GUP-fast can separate
dmemfs page from other page type and handle it correctly.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 113 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 111 insertions(+), 2 deletions(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index b3e394f33b42..53a9bf214e0d 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -460,7 +460,7 @@ static int dmemfs_split(struct vm_area_struct *vma, unsigned long addr)
 	return 0;
 }
 
-static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
+static vm_fault_t __dmemfs_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = file_inode(vma->vm_file);
@@ -488,6 +488,63 @@ static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+static vm_fault_t  __dmemfs_pmd_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	unsigned long page_addr;
+	struct inode *inode = file_inode(vma->vm_file);
+	void *entry;
+	phys_addr_t phys;
+	pfn_t pfn;
+	int ret;
+
+	if (dmem_page_size(inode) < PMD_SIZE)
+		return VM_FAULT_FALLBACK;
+
+	WARN_ON(pmd_addr < vma->vm_start ||
+		vma->vm_end < pmd_addr + PMD_SIZE);
+
+	page_addr = vmf->address & ~(dmem_page_size(inode) - 1);
+	entry = radix_get_create_entry(vma, page_addr, inode,
+				       linear_page_index(vma, page_addr));
+	if (IS_ERR(entry))
+		return (PTR_ERR(entry) == -ENOMEM) ?
+			VM_FAULT_OOM : VM_FAULT_SIGBUS;
+
+	phys = dmem_addr_to_pfn(inode, dmem_entry_to_addr(inode, entry),
+				linear_page_index(vma, pmd_addr), PMD_SHIFT);
+	phys <<= PAGE_SHIFT;
+	pfn = phys_to_pfn_t(phys, PFN_DMEM);
+	ret = vmf_insert_pfn_pmd(vmf, pfn, !!(vma->vm_flags & VM_WRITE));
+
+	radix_put_entry();
+	return ret;
+}
+
+static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size pe_size)
+{
+	int ret;
+
+	switch (pe_size) {
+	case PE_SIZE_PTE:
+		ret = __dmemfs_fault(vmf);
+		break;
+	case PE_SIZE_PMD:
+		ret = __dmemfs_pmd_fault(vmf);
+		break;
+	default:
+		ret = VM_FAULT_SIGBUS;
+	}
+
+	return ret;
+}
+
+static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
+{
+	return dmemfs_huge_fault(vmf, PE_SIZE_PTE);
+}
+
 static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
 {
 	return dmem_page_size(file_inode(vma->vm_file));
@@ -498,6 +555,7 @@ static const struct vm_operations_struct dmemfs_vm_ops = {
 	.fault = dmemfs_fault,
 	.pagesize = dmemfs_pagesize,
 	.access = dmemfs_access_dmem,
+	.huge_fault = dmemfs_huge_fault,
 };
 
 int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
@@ -510,15 +568,66 @@ int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_PFNMAP | VM_DMEM | VM_IO;
+	vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY | VM_DMEM | VM_IO;
+
+	if (dmem_page_size(inode) != PAGE_SIZE)
+		vma->vm_flags |= VM_HUGEPAGE;
 
 	file_accessed(file);
 	vma->vm_ops = &dmemfs_vm_ops;
 	return 0;
 }
 
+/*
+ * If the size of area returned by mm->get_unmapped_area() is one
+ * dmem pagesize larger than 'len', the returned addr by
+ * mm->get_unmapped_area() could be aligned to dmem pagesize to
+ * meet alignment demand.
+ */
+static unsigned long
+dmemfs_get_unmapped_area(struct file *file, unsigned long addr,
+			 unsigned long len, unsigned long pgoff,
+			 unsigned long flags)
+{
+	unsigned long len_pad;
+	unsigned long off = pgoff << PAGE_SHIFT;
+	unsigned long align;
+
+	align = dmem_page_size(file_inode(file));
+
+	/* For pud or pmd pagesize, could not support fault fallback. */
+	if (len & (align - 1))
+		return -EINVAL;
+	if (len > TASK_SIZE)
+		return -ENOMEM;
+
+	if (flags & MAP_FIXED) {
+		if (addr & (align - 1))
+			return -EINVAL;
+		return addr;
+	}
+
+	/*
+	 * Pad a extra align space for 'len', as we want to find a unmapped
+	 * area which is larger enough to align with dmemfs pagesize, if
+	 * pagesize of dmem is larger than 4K.
+	 */
+	len_pad = (align == PAGE_SIZE) ? len : len + align;
+
+	/* 'len' or 'off' is too large for pad. */
+	if (len_pad < len || (off + len_pad) < off)
+		return -EINVAL;
+
+	addr = current->mm->get_unmapped_area(file, addr, len_pad,
+					      pgoff, flags);
+
+	/* Now 'addr' could be aligned to upper boundary. */
+	return IS_ERR_VALUE(addr) ? addr : round_up(addr, align);
+}
+
 static const struct file_operations dmemfs_file_operations = {
 	.mmap = dmemfs_file_mmap,
+	.get_unmapped_area = dmemfs_get_unmapped_area,
 };
 
 static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 25/35] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn()
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (23 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 24/35] dmemfs: support hugepage for dmemfs yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 26/35] mm, dmem: introduce pud_special() yulei.kernel
                   ` (11 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Fix estimation of reserved page for vaddr_get_pfn() and
check 'ret' before checking writable permission

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 drivers/vfio/vfio_iommu_type1.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 5fbf0c1f7433..257a8cab0a77 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -471,6 +471,10 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 		if (ret == -EAGAIN)
 			goto retry;
 
+		if (!ret && (prot & IOMMU_WRITE) &&
+		    !(vma->vm_flags & VM_WRITE))
+			ret = -EFAULT;
+
 		if (!ret && !is_invalid_reserved_pfn(*pfn))
 			ret = -EFAULT;
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 26/35] mm, dmem: introduce pud_special()
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (24 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 25/35] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 27/35] mm: add pud_special() to support dmem huge pud yulei.kernel
                   ` (10 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

pud_special() will check both _PAGE_SPECIAL and _PAGE_DMEM bit
as pmd_special() does.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/include/asm/pgtable.h | 13 +++++++++++++
 include/linux/pgtable.h        | 10 ++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index e29601cad384..313fb4fd6645 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -282,6 +282,12 @@ static inline int pmd_special(pmd_t pmd)
 	return (pmd_val(pmd) & (_PAGE_SPECIAL | _PAGE_DMEM)) ==
 		(_PAGE_SPECIAL | _PAGE_DMEM);
 }
+
+static inline int pud_special(pud_t pud)
+{
+	return (pud_val(pud) & (_PAGE_SPECIAL | _PAGE_DMEM)) ==
+		(_PAGE_SPECIAL | _PAGE_DMEM);
+}
 #endif
 
 #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
@@ -517,6 +523,13 @@ static inline pud_t pud_mkdirty(pud_t pud)
 	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
+#ifdef CONFIG_ARCH_HAS_PTE_DMEM
+static inline pud_t pud_mkdmem(pud_t pud)
+{
+	return pud_set_flags(pud, _PAGE_SPECIAL | _PAGE_DMEM);
+}
+#endif
+
 static inline pud_t pud_mkdevmap(pud_t pud)
 {
 	return pud_set_flags(pud, _PAGE_DEVMAP);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 1fe8546c0a7c..50f27d61f5cd 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1139,6 +1139,16 @@ static inline int pmd_special(pmd_t pmd)
 {
 	return 0;
 }
+
+static inline pud_t pud_mkdmem(pud_t pud)
+{
+	return pud;
+}
+
+static inline int pud_special(pud_t pud)
+{
+	return 0;
+}
 #endif
 
 #ifndef pmd_read_atomic
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 27/35] mm: add pud_special() to support dmem huge pud
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (25 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 26/35] mm, dmem: introduce pud_special() yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 28/35] mm, dmemfs: support huge_fault() for dmemfs yulei.kernel
                   ` (9 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Add pud_special() and follow_special_pud() to support dmem
huge pud as we do for dmem huge pmd.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/include/asm/pgtable.h |  2 +-
 include/linux/huge_mm.h        |  2 +-
 mm/gup.c                       | 46 ++++++++++++++++++++++++++++++++++
 mm/huge_memory.c               | 11 +++++---
 mm/memory.c                    |  4 +--
 mm/mprotect.c                  |  2 ++
 6 files changed, 59 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 313fb4fd6645..c9a3b1f79cd5 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -266,7 +266,7 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static inline int pud_trans_huge(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_DMEM)) == _PAGE_PSE;
 }
 #endif
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b7381e5aafe5..ac8eb3e39575 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -254,7 +254,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma)
 {
-	if (pud_trans_huge(*pud) || pud_devmap(*pud))
+	if (pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud))
 		return __pud_trans_huge_lock(pud, vma);
 	else
 		return NULL;
diff --git a/mm/gup.c b/mm/gup.c
index a8edbb6a2b2f..fdcaeb163bc4 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -416,6 +416,42 @@ follow_special_pmd(struct vm_area_struct *vma, unsigned long address,
 	return ERR_PTR(-EEXIST);
 }
 
+static struct page *
+follow_special_pud(struct vm_area_struct *vma, unsigned long address,
+		   pud_t *pud, unsigned int flags)
+{
+	spinlock_t *ptl;
+
+	if (flags & FOLL_DUMP)
+		/* Avoid special (like zero) pages in core dumps */
+		return ERR_PTR(-EFAULT);
+
+	/* No page to get reference */
+	if (flags & FOLL_GET)
+		return ERR_PTR(-EFAULT);
+
+	if (flags & FOLL_TOUCH) {
+		pud_t _pud;
+
+		ptl = pud_lock(vma->vm_mm, pud);
+		if (!pud_special(*pud)) {
+			spin_unlock(ptl);
+			return NULL;
+		}
+		_pud = pud_mkyoung(*pud);
+		if (flags & FOLL_WRITE)
+			_pud = pud_mkdirty(_pud);
+		if (pudp_set_access_flags(vma, address & HPAGE_PMD_MASK,
+					  pud, _pud,
+					  flags & FOLL_WRITE))
+			update_mmu_cache_pud(vma, address, pud);
+		spin_unlock(ptl);
+	}
+
+	/* Proper page table entry exists, but no corresponding struct page */
+	return ERR_PTR(-EEXIST);
+}
+
 /*
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
@@ -716,6 +752,12 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags);
 	}
+	if (pud_special(*pud)) {
+		page = follow_special_pud(vma, address, pud, flags);
+		if (page)
+			return page;
+		return no_page_table(vma, flags);
+	}
 	if (is_hugepd(__hugepd(pud_val(*pud)))) {
 		page = follow_huge_pd(vma, address,
 				      __hugepd(pud_val(*pud)), flags,
@@ -2478,6 +2520,10 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
+	/* Bypass dmem pud. It will be handled in outside routine. */
+	if (pud_special(orig))
+		return 0;
+
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a24601c93713..29e1ab959c90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -883,6 +883,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pud_mkdevmap(entry);
+	if (pfn_t_dmem(pfn))
+		entry = pud_mkdmem(entry);
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
@@ -919,7 +921,7 @@ vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn,
 	 * can't support a 'special' bit.
 	 */
 	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
+			!pfn_t_devmap(pfn) && !pfn_t_dmem(pfn));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -1883,7 +1885,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -1894,6 +1896,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
 	spinlock_t *ptl;
+	pud_t orig_pud;
 
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
@@ -1904,9 +1907,9 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * pgtable_trans_huge_withdraw after finishing pudp related
 	 * operations.
 	 */
-	pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+	orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
-	if (vma_is_special_huge(vma)) {
+	if (vma_is_special_huge(vma) || pud_special(orig_pud)) {
 		spin_unlock(ptl);
 		/* No zero page support yet */
 	} else {
diff --git a/mm/memory.c b/mm/memory.c
index ca42a6e56e9b..3748fab7cc2a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -922,7 +922,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src
 	src_pud = pud_offset(src_p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud) || pud_special(*src_pud)) {
 			int err;
 
 			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, vma);
@@ -1215,7 +1215,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+		if (pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
 				split_huge_pud(vma, pud, addr);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 36f885cbbb30..cae78c0c5160 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -292,6 +292,8 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
+		if (pud_special(*pud))
+			continue;
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 28/35] mm, dmemfs: support huge_fault() for dmemfs
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (26 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 27/35] mm: add pud_special() to support dmem huge pud yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 29/35] mm: add follow_pte_pud() yulei.kernel
                   ` (8 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Introduce __dmemfs_huge_fault() to handle 1G huge pud for dmemfs.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 53a9bf214e0d..027428a7f7a0 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -522,6 +522,43 @@ static vm_fault_t  __dmemfs_pmd_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static vm_fault_t __dmemfs_huge_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pud_addr = vmf->address & PUD_MASK;
+	struct inode *inode = file_inode(vma->vm_file);
+	void *entry;
+	phys_addr_t phys;
+	pfn_t pfn;
+	int ret;
+
+	if (dmem_page_size(inode) < PUD_SIZE)
+		return VM_FAULT_FALLBACK;
+
+	WARN_ON(pud_addr < vma->vm_start ||
+		vma->vm_end < pud_addr + PUD_SIZE);
+
+	entry = radix_get_create_entry(vma, pud_addr, inode,
+				       linear_page_index(vma, pud_addr));
+	if (IS_ERR(entry))
+		return (PTR_ERR(entry) == -ENOMEM) ?
+			VM_FAULT_OOM : VM_FAULT_SIGBUS;
+
+	phys = dmem_entry_to_addr(inode, entry);
+	pfn = phys_to_pfn_t(phys, PFN_DMEM);
+	ret = vmf_insert_pfn_pud(vmf, pfn, !!(vma->vm_flags & VM_WRITE));
+
+	radix_put_entry();
+	return ret;
+}
+#else
+static vm_fault_t __dmemfs_huge_fault(struct vm_fault *vmf)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size pe_size)
 {
 	int ret;
@@ -533,6 +570,9 @@ static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size p
 	case PE_SIZE_PMD:
 		ret = __dmemfs_pmd_fault(vmf);
 		break;
+	case PE_SIZE_PUD:
+		ret = __dmemfs_huge_fault(vmf);
+		break;
 	default:
 		ret = VM_FAULT_SIGBUS;
 	}
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 29/35] mm: add follow_pte_pud()
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (27 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 28/35] mm, dmemfs: support huge_fault() for dmemfs yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 30/35] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() yulei.kernel
                   ` (7 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Since we had supported dmem huge pud, here support dmem huge pud for
hva_to_pfn().

Similar to follow_pte_pmd(), follow_pte_pud() allows a PTE lead or a
huge page PMD or huge page PUD to be found and returned.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/memory.c | 52 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 44 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3748fab7cc2a..f831ab4b7ccd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4535,9 +4535,9 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+static int __follow_pte_pud(struct mm_struct *mm, unsigned long address,
 			    struct mmu_notifier_range *range,
-			    pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+			    pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4554,6 +4554,26 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		goto out;
 
 	pud = pud_offset(p4d, address);
+	VM_BUG_ON(pud_trans_huge(*pud));
+	if (pud_huge(*pud)) {
+		if (!pudpp)
+			goto out;
+
+		if (range) {
+			mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0,
+						NULL, mm, address & PUD_MASK,
+						(address & PUD_MASK) + PUD_SIZE);
+			mmu_notifier_invalidate_range_start(range);
+		}
+		*ptlp = pud_lock(mm, pud);
+		if (pud_huge(*pud)) {
+			*pudpp = pud;
+			return 0;
+		}
+		spin_unlock(*ptlp);
+		if (range)
+			mmu_notifier_invalidate_range_end(range);
+	}
 	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
 		goto out;
 
@@ -4609,8 +4629,8 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, NULL,
-						    ptepp, NULL, ptlp)));
+			   !(res = __follow_pte_pud(mm, address, NULL,
+						    ptepp, NULL, NULL, ptlp)));
 	return res;
 }
 
@@ -4622,12 +4642,24 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, range,
-						    ptepp, pmdpp, ptlp)));
+			   !(res = __follow_pte_pud(mm, address, range,
+						    ptepp, pmdpp, NULL, ptlp)));
 	return res;
 }
 EXPORT_SYMBOL(follow_pte_pmd);
 
+int follow_pte_pud(struct mm_struct *mm, unsigned long address,
+		   struct mmu_notifier_range *range,
+		   pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp, spinlock_t **ptlp)
+{
+	int res;
+
+	/* (void) is needed to make gcc happy */
+	(void) __cond_lock(*ptlp,
+			   !(res = __follow_pte_pud(mm, address, range,
+						    ptepp, pmdpp, pudpp, ptlp)));
+	return res;
+}
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
@@ -4645,15 +4677,19 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	pmd_t *pmdp = NULL;
+	pud_t *pudp = NULL;
 
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte_pmd(vma->vm_mm, address, NULL, &ptep, &pmdp, &ptl);
+	ret = follow_pte_pud(vma->vm_mm, address, NULL, &ptep, &pmdp, &pudp, &ptl);
 	if (ret)
 		return ret;
 
-	if (pmdp) {
+	if (pudp) {
+		*pfn = pud_pfn(*pudp) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+		spin_unlock(ptl);
+	} else if (pmdp) {
 		*pfn = pmd_pfn(*pmdp) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
 		spin_unlock(ptl);
 	} else {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 30/35] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free()
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (28 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 29/35] mm: add follow_pte_pud() yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 31/35] dmem: introduce mce handler yulei.kernel
                   ` (6 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

If dmem contained in dmem region is too large and dmemfs is mounted as
4K pagesize, size of bitmap in this dmem region maybe exceed maximal
available memory of kzalloc(). It would cause kzalloc() fail.

So introduce dmem_bitmap_alloc() and use vzalloc() if bitmap is larger than
PAGE_SIZE as vzalloc() will get sparse page.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/inode.c         |  6 ++++
 include/linux/fs.h |  1 +
 mm/dmem.c          | 69 +++++++++++++++++++++++++++++-----------------
 3 files changed, 50 insertions(+), 26 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 72c4c347afb7..6f8c60ac9302 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -208,6 +208,12 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 }
 EXPORT_SYMBOL(inode_init_always);
 
+struct inode *alloc_inode_nonrcu(void)
+{
+	return kmem_cache_alloc(inode_cachep, GFP_KERNEL);
+}
+EXPORT_SYMBOL(alloc_inode_nonrcu);
+
 void free_inode_nonrcu(struct inode *inode)
 {
 	kmem_cache_free(inode_cachep, inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7519ae003a08..872552dc5a61 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2984,6 +2984,7 @@ extern void clear_inode(struct inode *);
 extern void __destroy_inode(struct inode *);
 extern struct inode *new_inode_pseudo(struct super_block *sb);
 extern struct inode *new_inode(struct super_block *sb);
+extern struct inode *alloc_inode_nonrcu(void);
 extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_privs(struct file *);
diff --git a/mm/dmem.c b/mm/dmem.c
index eb6df7059cf0..50cdff98675b 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -17,6 +17,7 @@
 #include <linux/dmem.h>
 #include <linux/debugfs.h>
 #include <linux/notifier.h>
+#include <linux/vmalloc.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/dmem.h>
@@ -362,9 +363,38 @@ static int __init dmem_node_init(struct dmem_node *dnode)
 	return 0;
 }
 
+static unsigned long *dmem_bitmap_alloc(unsigned long pages,
+					unsigned long *static_bitmap)
+{
+	unsigned long *bitmap, size;
+
+	size = BITS_TO_LONGS(pages) * sizeof(long);
+	if (size <= sizeof(*static_bitmap))
+		bitmap = static_bitmap;
+	else if (size <= PAGE_SIZE)
+		bitmap = kzalloc(size, GFP_KERNEL);
+	else
+		bitmap = vzalloc(size);
+
+	return bitmap;
+}
+
+static void dmem_bitmap_free(unsigned long pages,
+			     unsigned long *bitmap,
+			     unsigned long *static_bitmap)
+{
+	unsigned long size;
+
+	size = BITS_TO_LONGS(pages) * sizeof(long);
+	if (size > PAGE_SIZE)
+		vfree(bitmap);
+	else if (bitmap != static_bitmap)
+		kfree(bitmap);
+}
+
 static void __init dmem_region_uinit(struct dmem_region *dregion)
 {
-	unsigned long nr_pages, size, *bitmap = dregion->error_bitmap;
+	unsigned long nr_pages, *bitmap = dregion->error_bitmap;
 
 	if (!bitmap)
 		return;
@@ -374,9 +404,7 @@ static void __init dmem_region_uinit(struct dmem_region *dregion)
 
 	WARN_ON(!nr_pages);
 
-	size = BITS_TO_LONGS(nr_pages) * sizeof(long);
-	if (size > sizeof(dregion->static_bitmap))
-		kfree(bitmap);
+	dmem_bitmap_free(nr_pages, bitmap, &dregion->static_error_bitmap);
 	dregion->error_bitmap = NULL;
 }
 
@@ -405,19 +433,15 @@ static void __init dmem_uinit(void)
 
 static int __init dmem_region_init(struct dmem_region *dregion)
 {
-	unsigned long *bitmap, size, nr_pages;
+	unsigned long *bitmap, nr_pages;
 
 	nr_pages = __phys_to_pfn(dregion->reserved_end_addr)
 		- __phys_to_pfn(dregion->reserved_start_addr);
 
-	size = BITS_TO_LONGS(nr_pages) * sizeof(long);
-	if (size <= sizeof(dregion->static_error_bitmap)) {
-		bitmap = &dregion->static_error_bitmap;
-	} else {
-		bitmap = kzalloc(size, GFP_KERNEL);
-		if (!bitmap)
-			return -ENOMEM;
-	}
+	bitmap = dmem_bitmap_alloc(nr_pages, &dregion->static_error_bitmap);
+	if (!bitmap)
+		return -ENOMEM;
+
 	dregion->error_bitmap = bitmap;
 	return 0;
 }
@@ -472,7 +496,7 @@ late_initcall(dmem_late_init);
 static int dmem_alloc_region_init(struct dmem_region *dregion,
 				  unsigned long *dpages)
 {
-	unsigned long start, end, *bitmap, size;
+	unsigned long start, end, *bitmap;
 
 	start = DMEM_PAGE_UP(dregion->reserved_start_addr);
 	end = DMEM_PAGE_DOWN(dregion->reserved_end_addr);
@@ -481,14 +505,9 @@ static int dmem_alloc_region_init(struct dmem_region *dregion,
 	if (!*dpages)
 		return 0;
 
-	size = BITS_TO_LONGS(*dpages) * sizeof(long);
-	if (size <= sizeof(dregion->static_bitmap))
-		bitmap = &dregion->static_bitmap;
-	else {
-		bitmap = kzalloc(size, GFP_KERNEL);
-		if (!bitmap)
-			return -ENOMEM;
-	}
+	bitmap = dmem_bitmap_alloc(*dpages, &dregion->static_bitmap);
+	if (!bitmap)
+		return -ENOMEM;
 
 	dregion->bitmap = bitmap;
 	dregion->next_free_pos = 0;
@@ -582,7 +601,7 @@ static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion)
 
 static void dmem_alloc_region_uinit(struct dmem_region *dregion)
 {
-	unsigned long dpages, size, *bitmap = dregion->bitmap;
+	unsigned long dpages, *bitmap = dregion->bitmap;
 
 	if (!bitmap)
 		return;
@@ -592,9 +611,7 @@ static void dmem_alloc_region_uinit(struct dmem_region *dregion)
 
 	dmem_uinit_check_alloc_bitmap(dregion);
 
-	size = BITS_TO_LONGS(dpages) * sizeof(long);
-	if (size > sizeof(dregion->static_bitmap))
-		kfree(bitmap);
+	dmem_bitmap_free(dpages, bitmap, &dregion->static_bitmap);
 	dregion->bitmap = NULL;
 }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 31/35] dmem: introduce mce handler
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (29 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 30/35] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 32/35] mm, dmemfs: register and handle the dmem mce yulei.kernel
                   ` (5 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Haiwei Li

From: Yulei Zhang <yuleixzhang@tencent.com>

dmem handle the mce if the pfn belongs to dmem when mce occurs.
1. check whether the pfn is handled by dmem. return if true.
2. mark the pfn in a new error bitmap defined in page.
3. a series of mechanism to ensure that the mce pfn is not allocated.

Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/dmem.h        |   6 +++
 include/trace/events/dmem.h |  17 ++++++
 mm/dmem.c                   | 103 +++++++++++++++++++++++++-----------
 mm/memory-failure.c         |   6 +++
 4 files changed, 102 insertions(+), 30 deletions(-)

diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 59d3ef14fe42..cd17a91a7264 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -21,6 +21,8 @@ dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr,
 void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr);
 bool is_dmem_pfn(unsigned long pfn);
 #define dmem_free_page(addr)	dmem_free_pages(addr, 1)
+
+bool dmem_memory_failure(unsigned long pfn, int flags);
 #else
 static inline int dmem_reserve_init(void)
 {
@@ -32,5 +34,9 @@ static inline bool is_dmem_pfn(unsigned long pfn)
 	return 0;
 }
 
+static inline bool dmem_memory_failure(unsigned long pfn, int flags)
+{
+	return false;
+}
 #endif
 #endif	/* _LINUX_DMEM_H */
diff --git a/include/trace/events/dmem.h b/include/trace/events/dmem.h
index 10d1b90a7783..f8eeb3c63b14 100644
--- a/include/trace/events/dmem.h
+++ b/include/trace/events/dmem.h
@@ -62,6 +62,23 @@ TRACE_EVENT(dmem_free_pages,
 	TP_printk("addr %#lx dpages_nr %d", (unsigned long)__entry->addr,
 		  __entry->dpages_nr)
 );
+
+TRACE_EVENT(dmem_memory_failure,
+	TP_PROTO(unsigned long pfn, bool used),
+	TP_ARGS(pfn, used),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(bool, used)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+		__entry->used = used;
+	),
+
+	TP_printk("pfn=%#lx used=%d", __entry->pfn, __entry->used)
+);
 #endif
 
 /* This part must be outside protection */
diff --git a/mm/dmem.c b/mm/dmem.c
index 50cdff98675b..16438dbed3f5 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -431,6 +431,41 @@ static void __init dmem_uinit(void)
 	dmem_pool.registered_pages = 0;
 }
 
+/* set or clear corresponding bit on allocation bitmap based on error bitmap */
+static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion,
+						    bool set)
+{
+	unsigned long pos_pfn, pos_offset;
+	unsigned long valid_pages, mce_dpages = 0;
+	phys_addr_t dpage, reserved_start_pfn;
+
+	reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr);
+
+	valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn;
+	pos_offset = dpage_to_pfn(dregion->dpage_start_pfn)
+		- reserved_start_pfn;
+try_set:
+	pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset);
+
+	if (pos_pfn >= valid_pages)
+		return mce_dpages;
+	mce_dpages++;
+	dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn);
+	if (set)
+		WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn,
+					   dregion->bitmap));
+	else
+		WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn,
+					      dregion->bitmap));
+	pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn;
+	goto try_set;
+}
+
+static unsigned long dmem_region_mark_mce_dpages(struct dmem_region *dregion)
+{
+	return dregion_alloc_bitmap_set_clear(dregion, true);
+}
+
 static int __init dmem_region_init(struct dmem_region *dregion)
 {
 	unsigned long *bitmap, nr_pages;
@@ -514,6 +549,8 @@ static int dmem_alloc_region_init(struct dmem_region *dregion,
 	dregion->dpage_start_pfn = start;
 	dregion->dpage_end_pfn = end;
 
+	*dpages -= dmem_region_mark_mce_dpages(dregion);
+
 	dmem_pool.unaligned_pages += __phys_to_pfn((dpage_to_phys(start)
 		- dregion->reserved_start_addr));
 	dmem_pool.unaligned_pages += __phys_to_pfn(dregion->reserved_end_addr
@@ -558,36 +595,6 @@ dmem_alloc_bitmap_clear(struct dmem_region *dregion, phys_addr_t dpage,
 	return err_num;
 }
 
-/* set or clear corresponding bit on allocation bitmap based on error bitmap */
-static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion,
-						    bool set)
-{
-	unsigned long pos_pfn, pos_offset;
-	unsigned long valid_pages, mce_dpages = 0;
-	phys_addr_t dpage, reserved_start_pfn;
-
-	reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr);
-
-	valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn;
-	pos_offset = dpage_to_pfn(dregion->dpage_start_pfn)
-		- reserved_start_pfn;
-try_set:
-	pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset);
-
-	if (pos_pfn >= valid_pages)
-		return mce_dpages;
-	mce_dpages++;
-	dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn);
-	if (set)
-		WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn,
-					   dregion->bitmap));
-	else
-		WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn,
-					      dregion->bitmap));
-	pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn;
-	goto try_set;
-}
-
 static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion)
 {
 	unsigned long dpages, size;
@@ -989,6 +996,42 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
 }
 EXPORT_SYMBOL(dmem_free_pages);
 
+bool dmem_memory_failure(unsigned long pfn, int flags)
+{
+	struct dmem_region *dregion;
+	struct dmem_node *pdnode = NULL;
+	u64 pos;
+	phys_addr_t addr = __pfn_to_phys(pfn);
+	bool used = false;
+
+	dregion = find_dmem_region(addr, &pdnode);
+	if (!dregion)
+		return false;
+
+	WARN_ON(!pdnode || !dregion->error_bitmap);
+
+	mutex_lock(&dmem_pool.lock);
+	pos = pfn - __phys_to_pfn(dregion->reserved_start_addr);
+	if (__test_and_set_bit(pos, dregion->error_bitmap))
+		goto out;
+
+	if (!dregion->bitmap || pfn < dpage_to_pfn(dregion->dpage_start_pfn) ||
+	    pfn >= dpage_to_pfn(dregion->dpage_end_pfn))
+		goto out;
+
+	pos = phys_to_dpage(addr) - dregion->dpage_start_pfn;
+	if (__test_and_set_bit(pos, dregion->bitmap)) {
+		used = true;
+	} else {
+		pr_info("MCE: free dpage, mark %#lx disabled in dmem\n", pfn);
+		dnode_count_free_dpages(pdnode, -1);
+	}
+out:
+	trace_dmem_memory_failure(pfn, used);
+	mutex_unlock(&dmem_pool.lock);
+	return true;
+}
+
 bool is_dmem_pfn(unsigned long pfn)
 {
 	struct dmem_node *dnode;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f1aa6433f404..c613e1ec5995 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -35,6 +35,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/dmem.h>
 #include <linux/page-flags.h>
 #include <linux/kernel-page-flags.h>
 #include <linux/sched/signal.h>
@@ -1280,6 +1281,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
 
+	if (dmem_memory_failure(pfn, flags)) {
+		pr_info("MCE %#lx: handled by dmem\n", pfn);
+		return 0;
+	}
+
 	p = pfn_to_online_page(pfn);
 	if (!p) {
 		if (pfn_valid(pfn)) {
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 32/35] mm, dmemfs: register and handle the dmem mce
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (30 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 31/35] dmem: introduce mce handler yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 33/35] kvm, x86: temporary disable record_steal_time for dmem yulei.kernel
                   ` (4 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Haiwei Li

From: Yulei Zhang <yuleixzhang@tencent.com>

dmemfs register the mce handler, send signal to the procs
whose vma is mapped in mce pfn.

Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c    | 141 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/dmem.h |   7 +++
 include/linux/mm.h   |   2 +
 mm/dmem.c            |  34 +++++++++++
 mm/memory-failure.c  |  63 ++++++++++++++-----
 5 files changed, 231 insertions(+), 16 deletions(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 027428a7f7a0..adfceff98636 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -36,6 +36,47 @@ MODULE_LICENSE("GPL v2");
 
 static uint __read_mostly max_alloc_try_dpages = 1;
 
+struct dmemfs_inode {
+	struct inode *inode;
+	struct list_head link;
+};
+
+static LIST_HEAD(dmemfs_inode_list);
+static DEFINE_SPINLOCK(dmemfs_inode_lock);
+
+static struct dmemfs_inode *
+dmemfs_create_dmemfs_inode(struct inode *inode)
+{
+	struct dmemfs_inode *dmemfs_inode;
+
+	spin_lock(&dmemfs_inode_lock);
+	dmemfs_inode = kmalloc(sizeof(struct dmemfs_inode), GFP_NOIO);
+	if (!dmemfs_inode) {
+		pr_err("DMEMFS: Out of memory while getting dmemfs inode\n");
+		goto out;
+	}
+	dmemfs_inode->inode = inode;
+	list_add_tail(&dmemfs_inode->link, &dmemfs_inode_list);
+out:
+	spin_unlock(&dmemfs_inode_lock);
+	return dmemfs_inode;
+}
+
+static void dmemfs_delete_dmemfs_inode(struct inode *inode)
+{
+	struct dmemfs_inode *i, *next;
+
+	spin_lock(&dmemfs_inode_lock);
+	list_for_each_entry_safe(i, next, &dmemfs_inode_list, link) {
+		if (i->inode == inode) {
+			list_del(&i->link);
+			kfree(i);
+			break;
+		}
+	}
+	spin_unlock(&dmemfs_inode_lock);
+}
+
 struct dmemfs_mount_opts {
 	unsigned long dpage_size;
 };
@@ -221,6 +262,13 @@ static unsigned long dmem_pgoff_to_index(struct inode *inode, pgoff_t pgoff)
 	return pgoff >> (sb->s_blocksize_bits - PAGE_SHIFT);
 }
 
+static pgoff_t dmem_index_to_pgoff(struct inode *inode, unsigned long index)
+{
+	struct super_block *sb = inode->i_sb;
+
+	return index << (sb->s_blocksize_bits - PAGE_SHIFT);
+}
+
 static void *dmem_addr_to_entry(struct inode *inode, phys_addr_t addr)
 {
 	struct super_block *sb = inode->i_sb;
@@ -809,6 +857,23 @@ static void dmemfs_evict_inode(struct inode *inode)
 	clear_inode(inode);
 }
 
+static struct inode *dmemfs_alloc_inode(struct super_block *sb)
+{
+	struct inode *inode;
+
+	inode = alloc_inode_nonrcu();
+	if (inode)
+		dmemfs_create_dmemfs_inode(inode);
+	return inode;
+}
+
+static void dmemfs_destroy_inode(struct inode *inode)
+{
+	if (inode)
+		dmemfs_delete_dmemfs_inode(inode);
+	free_inode_nonrcu(inode);
+}
+
 /*
  * Display the mount options in /proc/mounts.
  */
@@ -822,9 +887,11 @@ static int dmemfs_show_options(struct seq_file *m, struct dentry *root)
 }
 
 static const struct super_operations dmemfs_ops = {
+	.alloc_inode = dmemfs_alloc_inode,
 	.statfs	= dmemfs_statfs,
 	.evict_inode = dmemfs_evict_inode,
 	.drop_inode = generic_delete_inode,
+	.destroy_inode = dmemfs_destroy_inode,
 	.show_options = dmemfs_show_options,
 };
 
@@ -904,17 +971,91 @@ static struct file_system_type dmemfs_fs_type = {
 	.kill_sb	= dmemfs_kill_sb,
 };
 
+static struct inode *
+dmemfs_find_inode_by_addr(phys_addr_t addr, pgoff_t *pgoff)
+{
+	struct dmemfs_inode *di;
+	struct inode *inode;
+	struct address_space *mapping;
+	void *entry, **slot;
+	void *mce_entry;
+
+	list_for_each_entry(di, &dmemfs_inode_list, link) {
+		inode = di->inode;
+		mapping = inode->i_mapping;
+		mce_entry = dmem_addr_to_entry(inode, addr);
+		XA_STATE(xas, &mapping->i_pages, 0);
+		rcu_read_lock();
+
+		xas_for_each(&xas, entry, ULONG_MAX) {
+			if (xas_retry(&xas, entry))
+				continue;
+
+			if (unlikely(entry != xas_reload(&xas)))
+				goto retry;
+
+			if (mce_entry != entry)
+				continue;
+			*pgoff = dmem_index_to_pgoff(inode, xas.xa_index);
+			rcu_read_unlock();
+			return inode;
+retry:
+			xas_reset(&xas);
+		}
+		rcu_read_unlock();
+	}
+	return NULL;
+}
+
+static int dmemfs_mce_handler(struct notifier_block *this, unsigned long pfn,
+			      void *v)
+{
+	struct dmem_mce_notifier_info *info =
+		(struct dmem_mce_notifier_info *)v;
+	int flags = info->flags;
+	struct inode *inode;
+	phys_addr_t mce_addr = __pfn_to_phys(pfn);
+	pgoff_t pgoff;
+
+	spin_lock(&dmemfs_inode_lock);
+	inode = dmemfs_find_inode_by_addr(mce_addr, &pgoff);
+	if (!inode || !atomic_read(&inode->i_count))
+		goto out;
+
+	collect_procs_and_signal_inode(inode, pgoff, pfn, flags);
+out:
+	spin_unlock(&dmemfs_inode_lock);
+	return 0;
+}
+
+static struct notifier_block dmemfs_mce_notifier = {
+	.notifier_call	= dmemfs_mce_handler,
+};
+
 static int __init dmemfs_init(void)
 {
 	int ret;
 
+	pr_info("dmemfs initialized\n");
 	ret = register_filesystem(&dmemfs_fs_type);
+	if (ret)
+		goto reg_fs_fail;
+
+	ret = dmem_register_mce_notifier(&dmemfs_mce_notifier);
+	if (ret)
+		goto reg_notifier_fail;
 
+	return 0;
+
+reg_notifier_fail:
+	unregister_filesystem(&dmemfs_fs_type);
+reg_fs_fail:
 	return ret;
 }
 
 static void __exit dmemfs_uninit(void)
 {
+	dmem_unregister_mce_notifier(&dmemfs_mce_notifier);
 	unregister_filesystem(&dmemfs_fs_type);
 }
 
diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index cd17a91a7264..fe0b270ef1e5 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -23,6 +23,13 @@ bool is_dmem_pfn(unsigned long pfn);
 #define dmem_free_page(addr)	dmem_free_pages(addr, 1)
 
 bool dmem_memory_failure(unsigned long pfn, int flags);
+
+struct dmem_mce_notifier_info {
+	int flags;
+};
+
+int dmem_register_mce_notifier(struct notifier_block *nb);
+int dmem_unregister_mce_notifier(struct notifier_block *nb);
 #else
 static inline int dmem_reserve_init(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7b1e574d2387..ff0b12320ca1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3006,6 +3006,8 @@ extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
+extern void collect_procs_and_signal_inode(struct inode *inode, pgoff_t pgoff,
+				    unsigned long pfn, int flags);
 extern int get_hwpoison_page(struct page *page);
 #define put_hwpoison_page(page)	put_page(page)
 extern int sysctl_memory_failure_early_kill;
diff --git a/mm/dmem.c b/mm/dmem.c
index 16438dbed3f5..dd81b2483696 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -70,6 +70,7 @@ struct dmem_node {
 
 struct dmem_pool {
 	struct mutex lock;
+	struct raw_notifier_head mce_notifier_chain;
 
 	unsigned long region_num;
 	unsigned long registered_pages;
@@ -92,6 +93,7 @@ struct dmem_pool {
 
 static struct dmem_pool dmem_pool = {
 	.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
+	.mce_notifier_chain = RAW_NOTIFIER_INIT(dmem_pool.mce_notifier_chain),
 };
 
 #define DMEM_PAGE_SIZE		(1UL << dmem_pool.dpage_shift)
@@ -121,6 +123,35 @@ static struct dmem_pool dmem_pool = {
 #define for_each_dmem_region(_dnode, _dregion)				\
 	list_for_each_entry(_dregion, &(_dnode)->regions, node)
 
+int dmem_register_mce_notifier(struct notifier_block *nb)
+{
+	int ret;
+
+	mutex_lock(&dmem_pool.lock);
+	ret = raw_notifier_chain_register(&dmem_pool.mce_notifier_chain, nb);
+	mutex_unlock(&dmem_pool.lock);
+	return ret;
+}
+EXPORT_SYMBOL(dmem_register_mce_notifier);
+
+int dmem_unregister_mce_notifier(struct notifier_block *nb)
+{
+	int ret;
+
+	mutex_lock(&dmem_pool.lock);
+	ret = raw_notifier_chain_unregister(&dmem_pool.mce_notifier_chain, nb);
+	mutex_unlock(&dmem_pool.lock);
+	return ret;
+}
+EXPORT_SYMBOL(dmem_unregister_mce_notifier);
+
+static int dmem_mce_notify(unsigned long pfn,
+			   struct dmem_mce_notifier_info *info)
+{
+	return raw_notifier_call_chain(&dmem_pool.mce_notifier_chain,
+				       pfn, info);
+}
+
 static inline int *dmem_nodelist(int nid)
 {
 	return nid_to_dnode(nid)->nodelist;
@@ -1003,6 +1034,7 @@ bool dmem_memory_failure(unsigned long pfn, int flags)
 	u64 pos;
 	phys_addr_t addr = __pfn_to_phys(pfn);
 	bool used = false;
+	struct dmem_mce_notifier_info info;
 
 	dregion = find_dmem_region(addr, &pdnode);
 	if (!dregion)
@@ -1022,6 +1054,8 @@ bool dmem_memory_failure(unsigned long pfn, int flags)
 	pos = phys_to_dpage(addr) - dregion->dpage_start_pfn;
 	if (__test_and_set_bit(pos, dregion->bitmap)) {
 		used = true;
+		info.flags = flags;
+		dmem_mce_notify(pfn, &info);
 	} else {
 		pr_info("MCE: free dpage, mark %#lx disabled in dmem\n", pfn);
 		dnode_count_free_dpages(pdnode, -1);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c613e1ec5995..cdd3cd77edbc 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -307,8 +307,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
-		       struct vm_area_struct *vma,
-		       struct list_head *to_kill)
+		       struct vm_area_struct *vma, unsigned long pfn,
+		       pgoff_t pgoff, struct list_head *to_kill)
 {
 	struct to_kill *tk;
 
@@ -318,12 +318,17 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 		return;
 	}
 
-	tk->addr = page_address_in_vma(p, vma);
-	if (is_zone_device_page(p))
-		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-	else
-		tk->size_shift = page_shift(compound_head(p));
-
+	if (p) {
+		tk->addr = page_address_in_vma(p, vma);
+		if (is_zone_device_page(p))
+			tk->size_shift = dev_pagemap_mapping_shift(p, vma);
+		else
+			tk->size_shift = page_shift(compound_head(p));
+	} else {
+		tk->size_shift = PAGE_SHIFT;
+		tk->addr = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	}
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
 	 * "tk->size_shift" is always non-zero for !is_zone_device_page(),
@@ -336,7 +341,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 	 */
 	if (tk->addr == -EFAULT) {
 		pr_info("Memory failure: Unable to find user space address %lx in %s\n",
-			page_to_pfn(p), tsk->comm);
+			pfn, tsk->comm);
 	} else if (tk->size_shift == 0) {
 		kfree(tk);
 		return;
@@ -469,7 +474,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 			if (!page_mapped_in_vma(page, vma))
 				continue;
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, vma, page_to_pfn(page),
+					page_to_pgoff(page), to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -477,19 +483,18 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 }
 
 /*
- * Collect processes when the error hit a file mapped page.
+ * Collect processes when the error hit a file mapped memory.
  */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
-				int force_early)
+static void __collect_procs_file(struct address_space *mapping, pgoff_t pgoff,
+				struct page *page, unsigned long pfn,
+				struct list_head *to_kill, int force_early)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
-	struct address_space *mapping = page->mapping;
 
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page_to_pgoff(page);
 		struct task_struct *t = task_early_kill(tsk, force_early);
 
 		if (!t)
@@ -504,13 +509,39 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 			 * to be informed of all such data corruptions.
 			 */
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, vma, pfn, pgoff, to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
 	i_mmap_unlock_read(mapping);
 }
 
+/*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+				int force_early)
+{
+	struct address_space *mapping = page->mapping;
+
+	__collect_procs_file(mapping, page_to_pgoff(page), page,
+			     page_to_pfn(page), to_kill, force_early);
+}
+
+void collect_procs_and_signal_inode(struct inode *inode, pgoff_t pgoff,
+					unsigned long pfn, int flags)
+{
+	int forcekill;
+	struct address_space *mapping = &inode->i_data;
+	LIST_HEAD(tokill);
+
+	__collect_procs_file(mapping, pgoff, NULL, pfn, &tokill,
+			     flags & MF_ACTION_REQUIRED);
+	forcekill = flags & MF_MUST_KILL;
+	kill_procs(&tokill, forcekill, false, pfn, flags);
+}
+EXPORT_SYMBOL(collect_procs_and_signal_inode);
+
 /*
  * Collect the processes who have the corrupted page mapped to kill.
  */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 33/35] kvm, x86: temporary disable record_steal_time for dmem
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (31 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 32/35] mm, dmemfs: register and handle the dmem mce yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 34/35] dmem: add dmem unit tests yulei.kernel
                   ` (3 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang

From: Yulei Zhang <yuleixzhang@tencent.com>

Temporarily disable record_steal_time when entering
the guest for dmem.

Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/kvm/x86.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1994602a0851..409b5a68aa60 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2789,6 +2789,7 @@ static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu)
 
 static void record_steal_time(struct kvm_vcpu *vcpu)
 {
+#if 0
 	struct kvm_host_map map;
 	struct kvm_steal_time *st;
 
@@ -2830,6 +2831,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
 	st->version += 1;
 
 	kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, false);
+#endif
 }
 
 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 34/35] dmem: add dmem unit tests
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (32 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 33/35] kvm, x86: temporary disable record_steal_time for dmem yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-08  7:54 ` [PATCH 35/35] Add documentation for dmemfs yulei.kernel
                   ` (2 subsequent siblings)
  36 siblings, 0 replies; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

This test case is used to test dmem management system.

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 tools/testing/dmem/Kbuild      |   1 +
 tools/testing/dmem/Makefile    |  10 ++
 tools/testing/dmem/dmem-test.c | 184 +++++++++++++++++++++++++++++++++
 3 files changed, 195 insertions(+)
 create mode 100644 tools/testing/dmem/Kbuild
 create mode 100644 tools/testing/dmem/Makefile
 create mode 100644 tools/testing/dmem/dmem-test.c

diff --git a/tools/testing/dmem/Kbuild b/tools/testing/dmem/Kbuild
new file mode 100644
index 000000000000..04988f7c76b7
--- /dev/null
+++ b/tools/testing/dmem/Kbuild
@@ -0,0 +1 @@
+obj-m += dmem-test.o
diff --git a/tools/testing/dmem/Makefile b/tools/testing/dmem/Makefile
new file mode 100644
index 000000000000..21f141f585de
--- /dev/null
+++ b/tools/testing/dmem/Makefile
@@ -0,0 +1,10 @@
+KDIR ?= ../../../
+
+default:
+	$(MAKE) -C $(KDIR) M=$$PWD
+
+install: default
+	$(MAKE) -C $(KDIR) M=$$PWD modules_install
+
+clean:
+	rm -f *.o *.ko Module.* modules.* *.mod.c
diff --git a/tools/testing/dmem/dmem-test.c b/tools/testing/dmem/dmem-test.c
new file mode 100644
index 000000000000..4baae18b593e
--- /dev/null
+++ b/tools/testing/dmem/dmem-test.c
@@ -0,0 +1,184 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/sizes.h>
+#include <linux/list.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/dmem.h>
+
+struct dmem_mem_node {
+	struct list_head node;
+};
+
+static LIST_HEAD(dmem_list);
+
+static int dmem_test_alloc_init(unsigned long dpage_shift)
+{
+	int ret;
+
+	ret = dmem_alloc_init(dpage_shift);
+	if (ret)
+		pr_info("dmem_alloc_init failed, dpage_shift %ld ret=%d\n",
+			dpage_shift, ret);
+	return ret;
+}
+
+static int __dmem_test_alloc(int order, int nid, nodemask_t *nodemask,
+			     const char *caller)
+{
+	struct dmem_mem_node *pos;
+	phys_addr_t addr;
+	int i, ret = 0;
+
+	for (i = 0; i < (1 << order); i++) {
+		addr = dmem_alloc_pages_nodemask(nid, nodemask, 1, NULL);
+		if (!addr) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		pos = __va(addr);
+		list_add(&pos->node, &dmem_list);
+	}
+
+	pr_info("%s: alloc order %d on node %d has fallback node %s... %s.\n",
+		caller, order, nid, nodemask ? "yes" : "no",
+		!ret ? "okay" : "failed");
+
+	return ret;
+}
+
+static void dmem_test_free_all(void)
+{
+	struct dmem_mem_node *pos, *n;
+
+	list_for_each_entry_safe(pos, n, &dmem_list, node) {
+		list_del(&pos->node);
+		dmem_free_page(__pa(pos));
+	}
+}
+
+#define dmem_test_alloc(order, nid, nodemask)	\
+	__dmem_test_alloc(order, nid, nodemask, __func__)
+
+/* dmem shoud have 2^6 native pages available at lest */
+static int order_test(void)
+{
+	int order, i, ret;
+	int page_orders[] = {0, 1, 2, 3, 4, 5, 6};
+
+	ret = dmem_test_alloc_init(PAGE_SHIFT);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(page_orders); i++) {
+		order = page_orders[i];
+
+		ret = dmem_test_alloc(order, numa_node_id(), NULL);
+		if (ret)
+			break;
+	}
+
+	dmem_test_free_all();
+
+	dmem_alloc_uinit();
+
+	return ret;
+}
+
+static int node_test(void)
+{
+	nodemask_t nodemask;
+	unsigned long nr = 0;
+	int order;
+	int node;
+	int ret = 0;
+
+	order = 0;
+
+	ret = dmem_test_alloc_init(PUD_SHIFT);
+	if (ret)
+		return ret;
+
+	pr_info("%s: test allocation on node 0\n", __func__);
+	node = 0;
+	nodes_clear(nodemask);
+	node_set(0, nodemask);
+
+	ret = dmem_test_alloc(order, node, &nodemask);
+	if (ret)
+		goto exit;
+
+	dmem_test_free_all();
+
+	pr_info("%s: begin to exhaust dmem on node 0.\n", __func__);
+	node = 1;
+	nodes_clear(nodemask);
+	node_set(0, nodemask);
+
+	INIT_LIST_HEAD(&dmem_list);
+	while (!(ret = dmem_test_alloc(order, node, &nodemask)))
+		nr++;
+
+	pr_info("Allocation on node 0 success times: %lu\n", nr);
+
+	pr_info("%s: allocation on node 0 again\n", __func__);
+	node = 0;
+	nodes_clear(nodemask);
+	node_set(0, nodemask);
+	ret = dmem_test_alloc(order, node, &nodemask);
+	if (!ret) {
+		pr_info("\tNot expected fallback\n");
+		ret = -1;
+	} else {
+		ret = 0;
+		pr_info("\tOK, Dmem on node 0 exhausted, fallback success\n");
+	}
+
+	pr_info("%s: Release dmem\n", __func__);
+	dmem_test_free_all();
+
+exit:
+	dmem_alloc_uinit();
+	return ret;
+}
+
+static __init int dmem_test_init(void)
+{
+	int ret;
+
+	pr_info("dmem: test init...\n");
+
+	ret = order_test();
+	if (ret)
+		return ret;
+
+	ret = node_test();
+
+
+	if (ret)
+		pr_info("dmem test fail, ret=%d\n", ret);
+	else
+		pr_info("dmem test success\n");
+	return ret;
+}
+
+static __exit void dmem_test_exit(void)
+{
+	pr_info("dmem: test exit...\n");
+}
+
+module_init(dmem_test_init);
+module_exit(dmem_test_exit);
+MODULE_LICENSE("GPL v2");
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [PATCH 35/35] Add documentation for dmemfs
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (33 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 34/35] dmem: add dmem unit tests yulei.kernel
@ 2020-10-08  7:54 ` yulei.kernel
  2020-10-09  1:26   ` Randy Dunlap
  2020-10-08 19:01 ` [PATCH 00/35] Enhance memory utilization with DMEMFS Joao Martins
  2020-10-12 11:57 ` Zengtao (B)
  36 siblings, 1 reply; 61+ messages in thread
From: yulei.kernel @ 2020-10-08  7:54 UTC (permalink / raw)
  To: akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang

From: Yulei Zhang <yuleixzhang@tencent.com>

Introduce dmemfs.rst to document the basic usage of dmemfs.

Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 Documentation/filesystems/dmemfs.rst | 59 ++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)
 create mode 100644 Documentation/filesystems/dmemfs.rst

diff --git a/Documentation/filesystems/dmemfs.rst b/Documentation/filesystems/dmemfs.rst
new file mode 100644
index 000000000000..cbb4cc1ed31d
--- /dev/null
+++ b/Documentation/filesystems/dmemfs.rst
@@ -0,0 +1,57 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+The Direct Memory Filesystem - DMEMFS
+=====================================
+
+
+.. Table of contents
+
+   - Overview
+   - Compilation
+   - Usage
+
+Overview
+========
+
+Dmemfs (Direct Memory filesystem) is device memory or reserved
+memory based filesystem. This kind of memory is special as it
+is not managed by kernel and it is without 'struct page'. Therefore
+it can save extra memory from the host system for various usage,
+especially for guest virtual machines.
+
+It uses a kernel boot parameter ``dmem=`` to reserve the system
+memory when the host system boots up, the details can be checked
+in /Documentation/admin-guide/kernel-parameters.txt.
+
+Compilation
+===========
+
+The filesystem should be enabled by turning on the kernel configuration
+options::
+
+        CONFIG_DMEM_FS          - Direct Memory filesystem support
+        CONFIG_DMEM             - Allow reservation of memory for dmem
+
+
+Additionally, the following can be turned on to aid debugging::
+
+        CONFIG_DMEM_DEBUG_FS    - Enable debug information for dmem
+
+Usage
+========
+
+Dmemfs supports mapping ``4K``, ``2M`` and ``1G`` size of pages to
+the userspace, for example ::
+
+    # mount -t dmemfs none -o pagesize=4K /mnt/
+
+The it can create the backing storage with 4G size ::
+
+    # truncate /mnt/dmemfs-uuid --size 4G
+
+To use as backing storage for virtual machine starts with qemu, just need
+to specify the memory-backed-file in the qemu command line like this ::
+
+    # -object memory-backend-file,id=ram-node0,mem-path=/mnt/dmemfs-uuid \
+        share=yes,size=4G,host-nodes=0,policy=preferred -numa node,nodeid=0,memdev=ram-node0
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (34 preceding siblings ...)
  2020-10-08  7:54 ` [PATCH 35/35] Add documentation for dmemfs yulei.kernel
@ 2020-10-08 19:01 ` Joao Martins
  2020-10-09 11:39   ` yulei zhang
  2020-10-12 11:57 ` Zengtao (B)
  36 siblings, 1 reply; 61+ messages in thread
From: Joao Martins @ 2020-10-08 19:01 UTC (permalink / raw)
  To: yulei.kernel
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, akpm, naoya.horiguchi, viro,
	pbonzini, Matthew Wilcox, Mike Kravetz, Jane Y Chu, Dan Williams,
	Muchun Song, Konrad Rzeszutek Wilk

[adding a couple folks that directly or indirectly work on the subject]

On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> In current system each physical memory page is assocaited with
> a page structure which is used to track the usage of this page.
> But due to the memory usage rapidly growing in cloud environment,
> we find the resource consuming for page structure storage becomes
> highly remarkable. So is it an expense that we could spare?
> 
Happy to see another person working to solve the same problem!

I am really glad to see more folks being interested in solving
this problem and I hope we can join efforts?

BTW, there is also a second benefit in removing struct page -
which is carving out memory from the direct map.

> This patchset introduces an idea about how to save the extra
> memory through a new virtual filesystem -- dmemfs.
> 
> Dmemfs (Direct Memory filesystem) is device memory or reserved
> memory based filesystem. This kind of memory is special as it
> is not managed by kernel and most important it is without 'struct page'.
> Therefore we can leverage the extra memory from the host system
> to support more tenants in our cloud service.
> 
This is like a walk down the memory lane.

About a year ago we followed the same exact idea/motivation to
have memory outside of the direct map (and removing struct page overhead)
and started with our own layer/thingie. However we realized that DAX
is one the subsystems which already gives you direct access to memory
for free (and is already upstream), plus a couple of things which we
found more handy.

So we sent an RFC a couple months ago:

https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/

Since then majority of the work has been in improving DAX[1].
But now that is done I am going to follow up with the above patchset.

[1]
https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/

(Give me a couple of days and I will send you the link to the latest
patches on a git-tree - would love feedback!)

The struct page removal for DAX would then be small, and ticks the
same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
support) that we both do, with a smaller diffstat and it doesn't
touch KVM (not at least fundamentally).

	15 files changed, 401 insertions(+), 38 deletions(-)

The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
like we both do. Furthermore there wouldn't be a need for a new vm type,
consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.

[1]
https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/


> We uses a kernel boot parameter 'dmem=' to reserve the system
> memory when the host system boots up, the details can be checked
> in /Documentation/admin-guide/kernel-parameters.txt. 
> 
> Theoretically for each 4k physical page it can save 64 bytes if
> we drop the 'struct page', so for guest memory with 320G it can
> save about 5G physical memory totally. 
> 
Also worth mentioning that if you only care about 'struct page' cost, and not on the
security boundary, there's also some work on hugetlbfs preallocation of hugepages into
tricking vmemmap in reusing tail pages.

  https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/

Going forward that could also make sense for device-dax to avoid so many
struct pages allocated (which would require its transition to compound
struct pages like hugetlbfs which we are looking at too). In addition an
idea <handwaving> would be perhaps to have a stricter mode in DAX where
we initialize/use the metadata ('struct page') but remove the underlaying
PFNs (of the 'struct page') from the direct map having to bear the cost of
mapping/unmapping on gup/pup.

	Joao

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/35] dmem: show some statistic in debugfs
  2020-10-08  7:53 ` [PATCH 08/35] dmem: show some statistic in debugfs yulei.kernel
@ 2020-10-08 20:23   ` Randy Dunlap
  2020-10-09 11:49     ` yulei zhang
  0 siblings, 1 reply; 61+ messages in thread
From: Randy Dunlap @ 2020-10-08 20:23 UTC (permalink / raw)
  To: yulei.kernel, akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

On 10/8/20 12:53 AM, yulei.kernel@gmail.com wrote:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e1995da11cea..8a67c8933a42 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -235,6 +235,15 @@ config DMEM
>  	  Allow reservation of memory which could be dedicated usage of dmem.
>  	  It's the basics of dmemfs.
>  
> +config DMEM_DEBUG_FS
> +	bool "Enable debug information for direct memory"
> +	depends on DMEM && DEBUG_FS
> +	def_bool n

Drop the def_bool line. 'n' is the default anyway and the symbol is
already of type bool from 2 lines above.

> +	help
> +	  This option enables showing various statistics of direct memory
> +	  in debugfs filesystem.
> +
> +#


-- 
~Randy


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/35] mm: support direct memory reservation
  2020-10-08  7:53 ` [PATCH 02/35] mm: support direct memory reservation yulei.kernel
@ 2020-10-08 20:27   ` Randy Dunlap
  2020-10-08 20:34   ` Randy Dunlap
  1 sibling, 0 replies; 61+ messages in thread
From: Randy Dunlap @ 2020-10-08 20:27 UTC (permalink / raw)
  To: yulei.kernel, akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

On 10/8/20 12:53 AM, yulei.kernel@gmail.com wrote:
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 6c974888f86f..e1995da11cea 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -226,6 +226,15 @@ config BALLOON_COMPACTION
>  	  scenario aforementioned and helps improving memory defragmentation.
>  
>  #
> +# support for direct memory basics
> +config DMEM
> +	bool "Direct Memory Reservation"
> +	def_bool n

Drop the def_bool line.

> +	depends on SPARSEMEM
> +	help
> +	  Allow reservation of memory which could be dedicated usage of dmem.

	                                             dedicated to the use of dmem.
or
	                              which could be for the dedicated use of dmem.

> +	  It's the basics of dmemfs.

	           basis


-- 
~Randy


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 02/35] mm: support direct memory reservation
  2020-10-08  7:53 ` [PATCH 02/35] mm: support direct memory reservation yulei.kernel
  2020-10-08 20:27   ` Randy Dunlap
@ 2020-10-08 20:34   ` Randy Dunlap
  1 sibling, 0 replies; 61+ messages in thread
From: Randy Dunlap @ 2020-10-08 20:34 UTC (permalink / raw)
  To: yulei.kernel, akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

On 10/8/20 12:53 AM, yulei.kernel@gmail.com wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a1068742a6df..da15d4fc49db 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -980,6 +980,44 @@
>  			The filter can be disabled or changed to another
>  			driver later using sysfs.
>  
> +	dmem=[!]size[KMG]
> +			[KNL, NUMA] When CONFIG_DMEM is set, this means
> +			the size of memory reserved for dmemfs on each numa

			                                               NUMA

> +			memory node and 'size' must be aligned to the default
> +			alignment that is the size of memory section which is
> +			128M on default on x86_64. If set '!', such amount of

			     by default

> +			memory on each node will be owned by kernel and dmemfs
> +			own the rest of memory on each node.

			owns

> +			Example: Reserve 4G memory on each node for dmemfs
> +				dmem = 4G

IIRC, you don't want spaces in this example.
Or did you check? Does the kernel's command line parser accept & ignore spaces like these?


> +
> +	dmem=[!]size[KMG]:align[KMG]
> +			[KNL, NUMA] Ditto. 'align' should be power of two and
> +			it's not smaller than the default alignment. Also

	drop "it's"

> +			'size' must be aligned to 'align'.
> +			Example: Bad dmem parameter because 'size' misaligned
> +				dmem=0x40200000:1G
> +
> +	dmem=size[KMG]@addr[KMG]
> +			[KNL] When CONFIG_DMEM is set, this marks specific
> +			memory as reserved for dmemfs. Region of memory will be
> +			used by dmemfs, from addr to addr + size. Reserving a
> +			certain memory region for kernel is illegal so '!' is
> +			forbidden. Should not assign 'addr' to 0 because kernel
> +			will occupy fixed memory region begin at 0 address.

			                                beginning

> +			Ditto, 'size' and 'addr' must be aligned to default
> +			alignment.
> +			Example: Exclude memory from 5G-6G for dmemfs.
> +				dmem=1G@5G
> +
> +	dmem=size[KMG]@addr[KMG]:align[KMG]
> +			[KNL] Ditto. 'align' should be power of two and it's

		Drop "it's"

> +			not smaller than the default alignment. Also 'size'
> +			and 'addr' must be aligned to 'align'. Specially,
> +			'@addr' and ':align' could occur in any order.
> +			Example: Exclude memory from 5G-6G for dmemfs.
> +				dmem=1G:1G@5G
> +
>  	driver_async_probe=  [KNL]
>  			List of driver names to be probed asynchronously.
>  			Format: <driver_name1>,<driver_name2>...


-- 
~Randy
Reported-by: Randy Dunlap <rdunlap@infradead.org>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page
  2020-10-08  7:54 ` [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page yulei.kernel
@ 2020-10-09  0:58   ` Sean Christopherson
  2020-10-09 10:28     ` Joao Martins
  0 siblings, 1 reply; 61+ messages in thread
From: Sean Christopherson @ 2020-10-09  0:58 UTC (permalink / raw)
  To: yulei.kernel
  Cc: akpm, naoya.horiguchi, viro, pbonzini, linux-fsdevel, kvm,
	linux-kernel, xiaoguangrong.eric, kernellwp, lihaiwei.kernel,
	Yulei Zhang, Chen Zhuo

On Thu, Oct 08, 2020 at 03:54:12PM +0800, yulei.kernel@gmail.com wrote:
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> Dmem page is pfn invalid but not mmio. Support cacheable
> dmem page for kvm.
> 
> Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
> Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 5 +++--
>  include/linux/dmem.h   | 7 +++++++
>  mm/dmem.c              | 7 +++++++
>  3 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 71aa3da2a0b7..0115c1767063 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -41,6 +41,7 @@
>  #include <linux/hash.h>
>  #include <linux/kern_levels.h>
>  #include <linux/kthread.h>
> +#include <linux/dmem.h>
>  
>  #include <asm/page.h>
>  #include <asm/memtype.h>
> @@ -2962,9 +2963,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>  			 */
>  			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
>  
> -	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
> +	return (!e820__mapped_raw_any(pfn_to_hpa(pfn),
>  				     pfn_to_hpa(pfn + 1) - 1,
> -				     E820_TYPE_RAM);
> +				     E820_TYPE_RAM)) || (!is_dmem_pfn(pfn));

This is wrong.  As is, the logic reads "A PFN is MMIO if it is INVALID &&
(!RAM || !DMEM)".  The obvious fix would be to change it to "INVALID &&
!RAM && !DMEM", but that begs the question of whether or DMEM is reported
as RAM.  I don't see any e820 related changes in the series, i.e. no evidence
that dmem yanks its memory out of the e820 tables, which makes me think this
change is unnecessary.

>  }
>  
>  /* Bits which may be returned by set_spte() */

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 35/35] Add documentation for dmemfs
  2020-10-08  7:54 ` [PATCH 35/35] Add documentation for dmemfs yulei.kernel
@ 2020-10-09  1:26   ` Randy Dunlap
  0 siblings, 0 replies; 61+ messages in thread
From: Randy Dunlap @ 2020-10-09  1:26 UTC (permalink / raw)
  To: yulei.kernel, akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang

On 10/8/20 12:54 AM, yulei.kernel@gmail.com wrote:
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> Introduce dmemfs.rst to document the basic usage of dmemfs.
> 

Please add dmemfs as an entry in Documentation/filesystems/index.rst also.

> Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
> ---
>  Documentation/filesystems/dmemfs.rst | 59 ++++++++++++++++++++++++++++
>  1 file changed, 59 insertions(+)
>  create mode 100644 Documentation/filesystems/dmemfs.rst
> 
> diff --git a/Documentation/filesystems/dmemfs.rst b/Documentation/filesystems/dmemfs.rst
> new file mode 100644
> index 000000000000..cbb4cc1ed31d
> --- /dev/null
> +++ b/Documentation/filesystems/dmemfs.rst
> @@ -0,0 +1,57 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================
> +The Direct Memory Filesystem - DMEMFS
> +=====================================
> +
> +
> +.. Table of contents
> +
> +   - Overview
> +   - Compilation
> +   - Usage
> +
> +Overview
> +========
> +
> +Dmemfs (Direct Memory filesystem) is device memory or reserved
> +memory based filesystem. This kind of memory is special as it
> +is not managed by kernel and it is without 'struct page'. Therefore
> +it can save extra memory from the host system for various usage,

                                                             usages,

> +especially for guest virtual machines.
> +
> +It uses a kernel boot parameter ``dmem=`` to reserve the system
> +memory when the host system boots up, the details can be checked

                                     up. The details

> +in /Documentation/admin-guide/kernel-parameters.txt.
> +
> +Compilation
> +===========
> +
> +The filesystem should be enabled by turning on the kernel configuration
> +options::
> +
> +        CONFIG_DMEM_FS          - Direct Memory filesystem support
> +        CONFIG_DMEM             - Allow reservation of memory for dmem

Hm, is there a good reason for having both of these options?
Is one of them usable without the other one?
If not, there should only be one Kconfig option for DMEMFS.

> +
> +
> +Additionally, the following can be turned on to aid debugging::
> +
> +        CONFIG_DMEM_DEBUG_FS    - Enable debug information for dmem
> +
> +Usage
> +========
> +
> +Dmemfs supports mapping ``4K``, ``2M`` and ``1G`` size of pages to
> +the userspace, for example ::
> +
> +    # mount -t dmemfs none -o pagesize=4K /mnt/
> +
> +The it can create the backing storage with 4G size ::

   Then

> +
> +    # truncate /mnt/dmemfs-uuid --size 4G
> +
> +To use as backing storage for virtual machine starts with qemu, just need
> +to specify the memory-backed-file in the qemu command line like this ::
> +
> +    # -object memory-backend-file,id=ram-node0,mem-path=/mnt/dmemfs-uuid \
> +        share=yes,size=4G,host-nodes=0,policy=preferred -numa node,nodeid=0,memdev=ram-node0
> 


-- 
~Randy


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page
  2020-10-09  0:58   ` Sean Christopherson
@ 2020-10-09 10:28     ` Joao Martins
  2020-10-09 11:42       ` yulei zhang
  0 siblings, 1 reply; 61+ messages in thread
From: Joao Martins @ 2020-10-09 10:28 UTC (permalink / raw)
  To: Sean Christopherson, yulei.kernel
  Cc: akpm, naoya.horiguchi, viro, pbonzini, linux-fsdevel, kvm,
	linux-kernel, xiaoguangrong.eric, kernellwp, lihaiwei.kernel,
	Yulei Zhang, Chen Zhuo

On 10/9/20 1:58 AM, Sean Christopherson wrote:
> On Thu, Oct 08, 2020 at 03:54:12PM +0800, yulei.kernel@gmail.com wrote:
>> From: Yulei Zhang <yuleixzhang@tencent.com>
>>
>> Dmem page is pfn invalid but not mmio. Support cacheable
>> dmem page for kvm.
>>
>> Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
>> Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
>> ---
>>  arch/x86/kvm/mmu/mmu.c | 5 +++--
>>  include/linux/dmem.h   | 7 +++++++
>>  mm/dmem.c              | 7 +++++++
>>  3 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 71aa3da2a0b7..0115c1767063 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -41,6 +41,7 @@
>>  #include <linux/hash.h>
>>  #include <linux/kern_levels.h>
>>  #include <linux/kthread.h>
>> +#include <linux/dmem.h>
>>  
>>  #include <asm/page.h>
>>  #include <asm/memtype.h>
>> @@ -2962,9 +2963,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
>>  			 */
>>  			(!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
>>  
>> -	return !e820__mapped_raw_any(pfn_to_hpa(pfn),
>> +	return (!e820__mapped_raw_any(pfn_to_hpa(pfn),
>>  				     pfn_to_hpa(pfn + 1) - 1,
>> -				     E820_TYPE_RAM);
>> +				     E820_TYPE_RAM)) || (!is_dmem_pfn(pfn));
> 
> This is wrong.  As is, the logic reads "A PFN is MMIO if it is INVALID &&
> (!RAM || !DMEM)".  The obvious fix would be to change it to "INVALID &&
> !RAM && !DMEM", but that begs the question of whether or DMEM is reported
> as RAM.  I don't see any e820 related changes in the series, i.e. no evidence
> that dmem yanks its memory out of the e820 tables, which makes me think this
> change is unnecessary.
> 
Even if there would exist e820 changes, e820__mapped_raw_any() checks against
hardware-provided e820 that we are given before any changes happen i.e. not the one kernel
has changed (e820_table_firmware). So unless you're having that memory carved from an MMIO
range (which would be wrong), or the BIOS is misrepresenting its memory map... the
e820__mapped_raw_any(E820_TYPE_RAM) ought to be enough to cover RAM.

Or at least that has been my experience with similar work.

	Joao

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-08 19:01 ` [PATCH 00/35] Enhance memory utilization with DMEMFS Joao Martins
@ 2020-10-09 11:39   ` yulei zhang
  2020-10-09 11:53     ` Joao Martins
  0 siblings, 1 reply; 61+ messages in thread
From: yulei zhang @ 2020-10-09 11:39 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-fsdevel, kvm, LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li,
	Yulei Zhang, akpm, naoya.horiguchi, viro, Paolo Bonzini,
	Matthew Wilcox, Mike Kravetz, Jane Y Chu, Dan Williams,
	Muchun Song, Konrad Rzeszutek Wilk

Joao, thanks a lot for the feedback. One more thing needs to mention
is that dmemfs also support fine-grained
memory management which makes it more flexible for tenants with
different requirements.

On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> [adding a couple folks that directly or indirectly work on the subject]
>
> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
> > From: Yulei Zhang <yuleixzhang@tencent.com>
> >
> > In current system each physical memory page is assocaited with
> > a page structure which is used to track the usage of this page.
> > But due to the memory usage rapidly growing in cloud environment,
> > we find the resource consuming for page structure storage becomes
> > highly remarkable. So is it an expense that we could spare?
> >
> Happy to see another person working to solve the same problem!
>
> I am really glad to see more folks being interested in solving
> this problem and I hope we can join efforts?
>
> BTW, there is also a second benefit in removing struct page -
> which is carving out memory from the direct map.
>
> > This patchset introduces an idea about how to save the extra
> > memory through a new virtual filesystem -- dmemfs.
> >
> > Dmemfs (Direct Memory filesystem) is device memory or reserved
> > memory based filesystem. This kind of memory is special as it
> > is not managed by kernel and most important it is without 'struct page'.
> > Therefore we can leverage the extra memory from the host system
> > to support more tenants in our cloud service.
> >
> This is like a walk down the memory lane.
>
> About a year ago we followed the same exact idea/motivation to
> have memory outside of the direct map (and removing struct page overhead)
> and started with our own layer/thingie. However we realized that DAX
> is one the subsystems which already gives you direct access to memory
> for free (and is already upstream), plus a couple of things which we
> found more handy.
>
> So we sent an RFC a couple months ago:
>
> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/
>
> Since then majority of the work has been in improving DAX[1].
> But now that is done I am going to follow up with the above patchset.
>
> [1]
> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> (Give me a couple of days and I will send you the link to the latest
> patches on a git-tree - would love feedback!)
>
> The struct page removal for DAX would then be small, and ticks the
> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
> support) that we both do, with a smaller diffstat and it doesn't
> touch KVM (not at least fundamentally).
>
>         15 files changed, 401 insertions(+), 38 deletions(-)
>
> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
> like we both do. Furthermore there wouldn't be a need for a new vm type,
> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
>
> [1]
> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>
>
> > We uses a kernel boot parameter 'dmem=' to reserve the system
> > memory when the host system boots up, the details can be checked
> > in /Documentation/admin-guide/kernel-parameters.txt.
> >
> > Theoretically for each 4k physical page it can save 64 bytes if
> > we drop the 'struct page', so for guest memory with 320G it can
> > save about 5G physical memory totally.
> >
> Also worth mentioning that if you only care about 'struct page' cost, and not on the
> security boundary, there's also some work on hugetlbfs preallocation of hugepages into
> tricking vmemmap in reusing tail pages.
>
>   https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/
>
> Going forward that could also make sense for device-dax to avoid so many
> struct pages allocated (which would require its transition to compound
> struct pages like hugetlbfs which we are looking at too). In addition an
> idea <handwaving> would be perhaps to have a stricter mode in DAX where
> we initialize/use the metadata ('struct page') but remove the underlaying
> PFNs (of the 'struct page') from the direct map having to bear the cost of
> mapping/unmapping on gup/pup.
>
>         Joao

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page
  2020-10-09 10:28     ` Joao Martins
@ 2020-10-09 11:42       ` yulei zhang
  0 siblings, 0 replies; 61+ messages in thread
From: yulei zhang @ 2020-10-09 11:42 UTC (permalink / raw)
  To: Joao Martins
  Cc: Sean Christopherson, akpm, naoya.horiguchi, viro, Paolo Bonzini,
	linux-fsdevel, kvm, LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li,
	Yulei Zhang, Chen Zhuo

Sean and Joao, thanks for the feedback. Probably we can drop this change.

On Fri, Oct 9, 2020 at 6:28 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 10/9/20 1:58 AM, Sean Christopherson wrote:
> > On Thu, Oct 08, 2020 at 03:54:12PM +0800, yulei.kernel@gmail.com wrote:
> >> From: Yulei Zhang <yuleixzhang@tencent.com>
> >>
> >> Dmem page is pfn invalid but not mmio. Support cacheable
> >> dmem page for kvm.
> >>
> >> Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
> >> Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
> >> ---
> >>  arch/x86/kvm/mmu/mmu.c | 5 +++--
> >>  include/linux/dmem.h   | 7 +++++++
> >>  mm/dmem.c              | 7 +++++++
> >>  3 files changed, 17 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> index 71aa3da2a0b7..0115c1767063 100644
> >> --- a/arch/x86/kvm/mmu/mmu.c
> >> +++ b/arch/x86/kvm/mmu/mmu.c
> >> @@ -41,6 +41,7 @@
> >>  #include <linux/hash.h>
> >>  #include <linux/kern_levels.h>
> >>  #include <linux/kthread.h>
> >> +#include <linux/dmem.h>
> >>
> >>  #include <asm/page.h>
> >>  #include <asm/memtype.h>
> >> @@ -2962,9 +2963,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn)
> >>                       */
> >>                      (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn));
> >>
> >> -    return !e820__mapped_raw_any(pfn_to_hpa(pfn),
> >> +    return (!e820__mapped_raw_any(pfn_to_hpa(pfn),
> >>                                   pfn_to_hpa(pfn + 1) - 1,
> >> -                                 E820_TYPE_RAM);
> >> +                                 E820_TYPE_RAM)) || (!is_dmem_pfn(pfn));
> >
> > This is wrong.  As is, the logic reads "A PFN is MMIO if it is INVALID &&
> > (!RAM || !DMEM)".  The obvious fix would be to change it to "INVALID &&
> > !RAM && !DMEM", but that begs the question of whether or DMEM is reported
> > as RAM.  I don't see any e820 related changes in the series, i.e. no evidence
> > that dmem yanks its memory out of the e820 tables, which makes me think this
> > change is unnecessary.
> >
> Even if there would exist e820 changes, e820__mapped_raw_any() checks against
> hardware-provided e820 that we are given before any changes happen i.e. not the one kernel
> has changed (e820_table_firmware). So unless you're having that memory carved from an MMIO
> range (which would be wrong), or the BIOS is misrepresenting its memory map... the
> e820__mapped_raw_any(E820_TYPE_RAM) ought to be enough to cover RAM.
>
> Or at least that has been my experience with similar work.
>
>         Joao

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 08/35] dmem: show some statistic in debugfs
  2020-10-08 20:23   ` Randy Dunlap
@ 2020-10-09 11:49     ` yulei zhang
  0 siblings, 0 replies; 61+ messages in thread
From: yulei zhang @ 2020-10-09 11:49 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: akpm, naoya.horiguchi, viro, Paolo Bonzini, linux-fsdevel, kvm,
	LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li, Yulei Zhang,
	Xiao Guangrong

Thanks, Randy. I will follow the instructions to modify the patches.

On Fri, Oct 9, 2020 at 4:23 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
> On 10/8/20 12:53 AM, yulei.kernel@gmail.com wrote:
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index e1995da11cea..8a67c8933a42 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -235,6 +235,15 @@ config DMEM
> >         Allow reservation of memory which could be dedicated usage of dmem.
> >         It's the basics of dmemfs.
> >
> > +config DMEM_DEBUG_FS
> > +     bool "Enable debug information for direct memory"
> > +     depends on DMEM && DEBUG_FS
> > +     def_bool n
>
> Drop the def_bool line. 'n' is the default anyway and the symbol is
> already of type bool from 2 lines above.
>
> > +     help
> > +       This option enables showing various statistics of direct memory
> > +       in debugfs filesystem.
> > +
> > +#
>
>
> --
> ~Randy
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-09 11:39   ` yulei zhang
@ 2020-10-09 11:53     ` Joao Martins
  2020-10-10  8:15       ` yulei zhang
  0 siblings, 1 reply; 61+ messages in thread
From: Joao Martins @ 2020-10-09 11:53 UTC (permalink / raw)
  To: yulei zhang
  Cc: linux-fsdevel, kvm, LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li,
	Yulei Zhang, akpm, naoya.horiguchi, viro, Paolo Bonzini,
	Matthew Wilcox, Mike Kravetz, Jane Y Chu, Dan Williams,
	Muchun Song, Konrad Rzeszutek Wilk

On 10/9/20 12:39 PM, yulei zhang wrote:
> Joao, thanks a lot for the feedback. One more thing needs to mention
> is that dmemfs also support fine-grained
> memory management which makes it more flexible for tenants with
> different requirements.
> 
So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region
which you dedicated to userspace. That region can then be partitioning into devices which
give you access to multiple (possibly discontinuous) extents with at a given page
granularity (selectable when you create the device), accessed through mmap().
You can then give that device to a cgroup. Or you can return that memory back to the
kernel (should you run into OOM situation), or you recreate the same mappings across
reboot/kexec.

I probably need to read your patches again, but can you extend on the 'dmemfs also support
fine-grained memory management' to understand what is the gap that you mention?

> On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>
>> [adding a couple folks that directly or indirectly work on the subject]
>>
>> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
>>> From: Yulei Zhang <yuleixzhang@tencent.com>
>>>
>>> In current system each physical memory page is assocaited with
>>> a page structure which is used to track the usage of this page.
>>> But due to the memory usage rapidly growing in cloud environment,
>>> we find the resource consuming for page structure storage becomes
>>> highly remarkable. So is it an expense that we could spare?
>>>
>> Happy to see another person working to solve the same problem!
>>
>> I am really glad to see more folks being interested in solving
>> this problem and I hope we can join efforts?
>>
>> BTW, there is also a second benefit in removing struct page -
>> which is carving out memory from the direct map.
>>
>>> This patchset introduces an idea about how to save the extra
>>> memory through a new virtual filesystem -- dmemfs.
>>>
>>> Dmemfs (Direct Memory filesystem) is device memory or reserved
>>> memory based filesystem. This kind of memory is special as it
>>> is not managed by kernel and most important it is without 'struct page'.
>>> Therefore we can leverage the extra memory from the host system
>>> to support more tenants in our cloud service.
>>>
>> This is like a walk down the memory lane.
>>
>> About a year ago we followed the same exact idea/motivation to
>> have memory outside of the direct map (and removing struct page overhead)
>> and started with our own layer/thingie. However we realized that DAX
>> is one the subsystems which already gives you direct access to memory
>> for free (and is already upstream), plus a couple of things which we
>> found more handy.
>>
>> So we sent an RFC a couple months ago:
>>
>> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/
>>
>> Since then majority of the work has been in improving DAX[1].
>> But now that is done I am going to follow up with the above patchset.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>> (Give me a couple of days and I will send you the link to the latest
>> patches on a git-tree - would love feedback!)
>>
>> The struct page removal for DAX would then be small, and ticks the
>> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
>> support) that we both do, with a smaller diffstat and it doesn't
>> touch KVM (not at least fundamentally).
>>
>>         15 files changed, 401 insertions(+), 38 deletions(-)
>>
>> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
>> like we both do. Furthermore there wouldn't be a need for a new vm type,
>> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
>>
>> [1]
>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>>
>>> We uses a kernel boot parameter 'dmem=' to reserve the system
>>> memory when the host system boots up, the details can be checked
>>> in /Documentation/admin-guide/kernel-parameters.txt.
>>>
>>> Theoretically for each 4k physical page it can save 64 bytes if
>>> we drop the 'struct page', so for guest memory with 320G it can
>>> save about 5G physical memory totally.
>>>
>> Also worth mentioning that if you only care about 'struct page' cost, and not on the
>> security boundary, there's also some work on hugetlbfs preallocation of hugepages into
>> tricking vmemmap in reusing tail pages.
>>
>>   https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/
>>
>> Going forward that could also make sense for device-dax to avoid so many
>> struct pages allocated (which would require its transition to compound
>> struct pages like hugetlbfs which we are looking at too). In addition an
>> idea <handwaving> would be perhaps to have a stricter mode in DAX where
>> we initialize/use the metadata ('struct page') but remove the underlaying
>> PFNs (of the 'struct page') from the direct map having to bear the cost of
>> mapping/unmapping on gup/pup.
>>
>>         Joao

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-09 11:53     ` Joao Martins
@ 2020-10-10  8:15       ` yulei zhang
  2020-10-12 10:59         ` Joao Martins
  0 siblings, 1 reply; 61+ messages in thread
From: yulei zhang @ 2020-10-10  8:15 UTC (permalink / raw)
  To: Joao Martins
  Cc: linux-fsdevel, kvm, LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li,
	Yulei Zhang, akpm, naoya.horiguchi, viro, Paolo Bonzini,
	Matthew Wilcox, Mike Kravetz, Jane Y Chu, Dan Williams,
	Muchun Song, Konrad Rzeszutek Wilk

On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 10/9/20 12:39 PM, yulei zhang wrote:
> > Joao, thanks a lot for the feedback. One more thing needs to mention
> > is that dmemfs also support fine-grained
> > memory management which makes it more flexible for tenants with
> > different requirements.
> >
> So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region
> which you dedicated to userspace. That region can then be partitioning into devices which
> give you access to multiple (possibly discontinuous) extents with at a given page
> granularity (selectable when you create the device), accessed through mmap().
> You can then give that device to a cgroup. Or you can return that memory back to the
> kernel (should you run into OOM situation), or you recreate the same mappings across
> reboot/kexec.
>
> I probably need to read your patches again, but can you extend on the 'dmemfs also support
> fine-grained memory management' to understand what is the gap that you mention?
>

sure, dmemfs uses bitmap to track the memory usage in the reserved
memory region in
a given page size granularity. And for each user the memory can be
discrete as well.

> > On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
> >>
> >> [adding a couple folks that directly or indirectly work on the subject]
> >>
> >> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
> >>> From: Yulei Zhang <yuleixzhang@tencent.com>
> >>>
> >>> In current system each physical memory page is assocaited with
> >>> a page structure which is used to track the usage of this page.
> >>> But due to the memory usage rapidly growing in cloud environment,
> >>> we find the resource consuming for page structure storage becomes
> >>> highly remarkable. So is it an expense that we could spare?
> >>>
> >> Happy to see another person working to solve the same problem!
> >>
> >> I am really glad to see more folks being interested in solving
> >> this problem and I hope we can join efforts?
> >>
> >> BTW, there is also a second benefit in removing struct page -
> >> which is carving out memory from the direct map.
> >>
> >>> This patchset introduces an idea about how to save the extra
> >>> memory through a new virtual filesystem -- dmemfs.
> >>>
> >>> Dmemfs (Direct Memory filesystem) is device memory or reserved
> >>> memory based filesystem. This kind of memory is special as it
> >>> is not managed by kernel and most important it is without 'struct page'.
> >>> Therefore we can leverage the extra memory from the host system
> >>> to support more tenants in our cloud service.
> >>>
> >> This is like a walk down the memory lane.
> >>
> >> About a year ago we followed the same exact idea/motivation to
> >> have memory outside of the direct map (and removing struct page overhead)
> >> and started with our own layer/thingie. However we realized that DAX
> >> is one the subsystems which already gives you direct access to memory
> >> for free (and is already upstream), plus a couple of things which we
> >> found more handy.
> >>
> >> So we sent an RFC a couple months ago:
> >>
> >> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/
> >>
> >> Since then majority of the work has been in improving DAX[1].
> >> But now that is done I am going to follow up with the above patchset.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
> >>
> >> (Give me a couple of days and I will send you the link to the latest
> >> patches on a git-tree - would love feedback!)
> >>
> >> The struct page removal for DAX would then be small, and ticks the
> >> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
> >> support) that we both do, with a smaller diffstat and it doesn't
> >> touch KVM (not at least fundamentally).
> >>
> >>         15 files changed, 401 insertions(+), 38 deletions(-)
> >>
> >> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
> >> like we both do. Furthermore there wouldn't be a need for a new vm type,
> >> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
> >>
> >> [1]
> >> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
> >>
> >>
> >>> We uses a kernel boot parameter 'dmem=' to reserve the system
> >>> memory when the host system boots up, the details can be checked
> >>> in /Documentation/admin-guide/kernel-parameters.txt.
> >>>
> >>> Theoretically for each 4k physical page it can save 64 bytes if
> >>> we drop the 'struct page', so for guest memory with 320G it can
> >>> save about 5G physical memory totally.
> >>>
> >> Also worth mentioning that if you only care about 'struct page' cost, and not on the
> >> security boundary, there's also some work on hugetlbfs preallocation of hugepages into
> >> tricking vmemmap in reusing tail pages.
> >>
> >>   https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/
> >>
> >> Going forward that could also make sense for device-dax to avoid so many
> >> struct pages allocated (which would require its transition to compound
> >> struct pages like hugetlbfs which we are looking at too). In addition an
> >> idea <handwaving> would be perhaps to have a stricter mode in DAX where
> >> we initialize/use the metadata ('struct page') but remove the underlaying
> >> PFNs (of the 'struct page') from the direct map having to bear the cost of
> >> mapping/unmapping on gup/pup.
> >>
> >>         Joao

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-10  8:15       ` yulei zhang
@ 2020-10-12 10:59         ` Joao Martins
  2020-10-14 22:25           ` Dan Williams
  0 siblings, 1 reply; 61+ messages in thread
From: Joao Martins @ 2020-10-12 10:59 UTC (permalink / raw)
  To: yulei zhang
  Cc: linux-fsdevel, kvm, LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li,
	Yulei Zhang, akpm, naoya.horiguchi, viro, Paolo Bonzini,
	Matthew Wilcox, Mike Kravetz, Jane Y Chu, Dan Williams,
	Muchun Song, Konrad Rzeszutek Wilk

On 10/10/20 9:15 AM, yulei zhang wrote:
> On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 10/9/20 12:39 PM, yulei zhang wrote:
>>> Joao, thanks a lot for the feedback. One more thing needs to mention
>>> is that dmemfs also support fine-grained
>>> memory management which makes it more flexible for tenants with
>>> different requirements.
>>>
>> So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region
>> which you dedicated to userspace. That region can then be partitioning into devices which
>> give you access to multiple (possibly discontinuous) extents with at a given page
>> granularity (selectable when you create the device), accessed through mmap().
>> You can then give that device to a cgroup. Or you can return that memory back to the
>> kernel (should you run into OOM situation), or you recreate the same mappings across
>> reboot/kexec.
>>
>> I probably need to read your patches again, but can you extend on the 'dmemfs also support
>> fine-grained memory management' to understand what is the gap that you mention?
>>
> sure, dmemfs uses bitmap to track the memory usage in the reserved
> memory region in
> a given page size granularity. And for each user the memory can be
> discrete as well.
> 
That same functionality of tracking reserved region usage across different users at any
page granularity is covered the DAX series I mentioned below. The discrete part -- IIUC
what you meant -- is then reduced using DAX ABI/tools to create a device file vs a filesystem.

>>> On Fri, Oct 9, 2020 at 3:01 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>>>>
>>>> [adding a couple folks that directly or indirectly work on the subject]
>>>>
>>>> On 10/8/20 8:53 AM, yulei.kernel@gmail.com wrote:
>>>>> From: Yulei Zhang <yuleixzhang@tencent.com>
>>>>>
>>>>> In current system each physical memory page is assocaited with
>>>>> a page structure which is used to track the usage of this page.
>>>>> But due to the memory usage rapidly growing in cloud environment,
>>>>> we find the resource consuming for page structure storage becomes
>>>>> highly remarkable. So is it an expense that we could spare?
>>>>>
>>>> Happy to see another person working to solve the same problem!
>>>>
>>>> I am really glad to see more folks being interested in solving
>>>> this problem and I hope we can join efforts?
>>>>
>>>> BTW, there is also a second benefit in removing struct page -
>>>> which is carving out memory from the direct map.
>>>>
>>>>> This patchset introduces an idea about how to save the extra
>>>>> memory through a new virtual filesystem -- dmemfs.
>>>>>
>>>>> Dmemfs (Direct Memory filesystem) is device memory or reserved
>>>>> memory based filesystem. This kind of memory is special as it
>>>>> is not managed by kernel and most important it is without 'struct page'.
>>>>> Therefore we can leverage the extra memory from the host system
>>>>> to support more tenants in our cloud service.
>>>>>
>>>> This is like a walk down the memory lane.
>>>>
>>>> About a year ago we followed the same exact idea/motivation to
>>>> have memory outside of the direct map (and removing struct page overhead)
>>>> and started with our own layer/thingie. However we realized that DAX
>>>> is one the subsystems which already gives you direct access to memory
>>>> for free (and is already upstream), plus a couple of things which we
>>>> found more handy.
>>>>
>>>> So we sent an RFC a couple months ago:
>>>>
>>>> https://lore.kernel.org/linux-mm/20200110190313.17144-1-joao.m.martins@oracle.com/
>>>>
>>>> Since then majority of the work has been in improving DAX[1].
>>>> But now that is done I am going to follow up with the above patchset.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>
>>>> (Give me a couple of days and I will send you the link to the latest
>>>> patches on a git-tree - would love feedback!)
>>>>
>>>> The struct page removal for DAX would then be small, and ticks the
>>>> same bells and whistles (MCE handling, reserving PAT memtypes, ptrace
>>>> support) that we both do, with a smaller diffstat and it doesn't
>>>> touch KVM (not at least fundamentally).
>>>>
>>>>         15 files changed, 401 insertions(+), 38 deletions(-)
>>>>
>>>> The things needed in core-mm is for handling PMD/PUD PAGE_SPECIAL much
>>>> like we both do. Furthermore there wouldn't be a need for a new vm type,
>>>> consuming an extra page bit (in addition to PAGE_SPECIAL) or new filesystem.
>>>>
>>>> [1]
>>>> https://lore.kernel.org/linux-mm/159625229779.3040297.11363509688097221416.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>
>>>>
>>>>> We uses a kernel boot parameter 'dmem=' to reserve the system
>>>>> memory when the host system boots up, the details can be checked
>>>>> in /Documentation/admin-guide/kernel-parameters.txt.
>>>>>
>>>>> Theoretically for each 4k physical page it can save 64 bytes if
>>>>> we drop the 'struct page', so for guest memory with 320G it can
>>>>> save about 5G physical memory totally.
>>>>>
>>>> Also worth mentioning that if you only care about 'struct page' cost, and not on the
>>>> security boundary, there's also some work on hugetlbfs preallocation of hugepages into
>>>> tricking vmemmap in reusing tail pages.
>>>>
>>>>   https://lore.kernel.org/linux-mm/20200915125947.26204-1-songmuchun@bytedance.com/
>>>>
>>>> Going forward that could also make sense for device-dax to avoid so many
>>>> struct pages allocated (which would require its transition to compound
>>>> struct pages like hugetlbfs which we are looking at too). In addition an
>>>> idea <handwaving> would be perhaps to have a stricter mode in DAX where
>>>> we initialize/use the metadata ('struct page') but remove the underlaying
>>>> PFNs (of the 'struct page') from the direct map having to bear the cost of
>>>> mapping/unmapping on gup/pup.
>>>>
>>>>         Joao

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (35 preceding siblings ...)
  2020-10-08 19:01 ` [PATCH 00/35] Enhance memory utilization with DMEMFS Joao Martins
@ 2020-10-12 11:57 ` Zengtao (B)
  2020-10-13  2:45   ` yulei zhang
  36 siblings, 1 reply; 61+ messages in thread
From: Zengtao (B) @ 2020-10-12 11:57 UTC (permalink / raw)
  To: yulei.kernel, akpm, naoya.horiguchi, viro, pbonzini
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang


> -----Original Message-----
> From: yulei.kernel@gmail.com [mailto:yulei.kernel@gmail.com]
> Sent: Thursday, October 08, 2020 3:54 PM
> To: akpm@linux-foundation.org; naoya.horiguchi@nec.com;
> viro@zeniv.linux.org.uk; pbonzini@redhat.com
> Cc: linux-fsdevel@vger.kernel.org; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; xiaoguangrong.eric@gmail.com;
> kernellwp@gmail.com; lihaiwei.kernel@gmail.com; Yulei Zhang
> Subject: [PATCH 00/35] Enhance memory utilization with DMEMFS
> 
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> In current system each physical memory page is assocaited with
> a page structure which is used to track the usage of this page.
> But due to the memory usage rapidly growing in cloud environment,
> we find the resource consuming for page structure storage becomes
> highly remarkable. So is it an expense that we could spare?
> 
> This patchset introduces an idea about how to save the extra
> memory through a new virtual filesystem -- dmemfs.
> 
> Dmemfs (Direct Memory filesystem) is device memory or reserved
> memory based filesystem. This kind of memory is special as it
> is not managed by kernel and most important it is without 'struct page'.
> Therefore we can leverage the extra memory from the host system
> to support more tenants in our cloud service.
> 
> We uses a kernel boot parameter 'dmem=' to reserve the system
> memory when the host system boots up, the details can be checked
> in /Documentation/admin-guide/kernel-parameters.txt.
> 
> Theoretically for each 4k physical page it can save 64 bytes if
> we drop the 'struct page', so for guest memory with 320G it can
> save about 5G physical memory totally.

Sounds interesting, but seems your patch only support x86, have you
 considered aarch64?

Regards
Zengtao 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-12 11:57 ` Zengtao (B)
@ 2020-10-13  2:45   ` yulei zhang
  0 siblings, 0 replies; 61+ messages in thread
From: yulei zhang @ 2020-10-13  2:45 UTC (permalink / raw)
  To: Zengtao (B)
  Cc: akpm, naoya.horiguchi, viro, pbonzini, linux-fsdevel, kvm,
	linux-kernel, xiaoguangrong.eric, kernellwp, lihaiwei.kernel,
	Yulei Zhang

On Mon, Oct 12, 2020 at 7:57 PM Zengtao (B) <prime.zeng@hisilicon.com> wrote:
>
>
> > -----Original Message-----
> > From: yulei.kernel@gmail.com [mailto:yulei.kernel@gmail.com]
> > Sent: Thursday, October 08, 2020 3:54 PM
> > To: akpm@linux-foundation.org; naoya.horiguchi@nec.com;
> > viro@zeniv.linux.org.uk; pbonzini@redhat.com
> > Cc: linux-fsdevel@vger.kernel.org; kvm@vger.kernel.org;
> > linux-kernel@vger.kernel.org; xiaoguangrong.eric@gmail.com;
> > kernellwp@gmail.com; lihaiwei.kernel@gmail.com; Yulei Zhang
> > Subject: [PATCH 00/35] Enhance memory utilization with DMEMFS
> >
> > From: Yulei Zhang <yuleixzhang@tencent.com>
> >
> > In current system each physical memory page is assocaited with
> > a page structure which is used to track the usage of this page.
> > But due to the memory usage rapidly growing in cloud environment,
> > we find the resource consuming for page structure storage becomes
> > highly remarkable. So is it an expense that we could spare?
> >
> > This patchset introduces an idea about how to save the extra
> > memory through a new virtual filesystem -- dmemfs.
> >
> > Dmemfs (Direct Memory filesystem) is device memory or reserved
> > memory based filesystem. This kind of memory is special as it
> > is not managed by kernel and most important it is without 'struct page'.
> > Therefore we can leverage the extra memory from the host system
> > to support more tenants in our cloud service.
> >
> > We uses a kernel boot parameter 'dmem=' to reserve the system
> > memory when the host system boots up, the details can be checked
> > in /Documentation/admin-guide/kernel-parameters.txt.
> >
> > Theoretically for each 4k physical page it can save 64 bytes if
> > we drop the 'struct page', so for guest memory with 320G it can
> > save about 5G physical memory totally.
>
> Sounds interesting, but seems your patch only support x86, have you
>  considered aarch64?
>
> Regards
> Zengtao

Thanks, so far we only verify it on x86 server, may extend to arm platform
in the future.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/35] dmem: let pat recognize dmem
  2020-10-08  7:53 ` [PATCH 04/35] dmem: let pat recognize dmem yulei.kernel
@ 2020-10-13  7:27   ` Paolo Bonzini
  2020-10-13  9:53     ` yulei zhang
  0 siblings, 1 reply; 61+ messages in thread
From: Paolo Bonzini @ 2020-10-13  7:27 UTC (permalink / raw)
  To: yulei.kernel, akpm, naoya.horiguchi, viro
  Cc: linux-fsdevel, kvm, linux-kernel, xiaoguangrong.eric, kernellwp,
	lihaiwei.kernel, Yulei Zhang, Xiao Guangrong

On 08/10/20 09:53, yulei.kernel@gmail.com wrote:
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> x86 pat uses 'struct page' by only checking if it's system ram,
> however it is not true if dmem is used, let's teach pat to
> recognize this case if it is ram but it is !pfn_valid()
> 
> We always use WB for dmem and any attempt to change this
> behavior will be rejected and WARN_ON is triggered
> 
> Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
> Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>

Hooks like these will make it very hard to merge this series.

I like the idea of struct page-backed memory, but this is a lot of code
and I wonder if it's worth adding all these complications.

One can already use mem= to remove the "struct page" cost for most of
the host memory, and manage the allocation of the remaining memory in
userspace with /dev/mem.  What is the advantage of doing this in the kernel?

Paolo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 04/35] dmem: let pat recognize dmem
  2020-10-13  7:27   ` Paolo Bonzini
@ 2020-10-13  9:53     ` yulei zhang
  0 siblings, 0 replies; 61+ messages in thread
From: yulei zhang @ 2020-10-13  9:53 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: akpm, naoya.horiguchi, viro, linux-fsdevel, kvm, LKML,
	Xiao Guangrong, Wanpeng Li, Haiwei Li, Yulei Zhang,
	Xiao Guangrong

On Tue, Oct 13, 2020 at 3:27 PM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 08/10/20 09:53, yulei.kernel@gmail.com wrote:
> > From: Yulei Zhang <yuleixzhang@tencent.com>
> >
> > x86 pat uses 'struct page' by only checking if it's system ram,
> > however it is not true if dmem is used, let's teach pat to
> > recognize this case if it is ram but it is !pfn_valid()
> >
> > We always use WB for dmem and any attempt to change this
> > behavior will be rejected and WARN_ON is triggered
> >
> > Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
> > Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
>
> Hooks like these will make it very hard to merge this series.
>
> I like the idea of struct page-backed memory, but this is a lot of code
> and I wonder if it's worth adding all these complications.
>
> One can already use mem= to remove the "struct page" cost for most of
> the host memory, and manage the allocation of the remaining memory in
> userspace with /dev/mem.  What is the advantage of doing this in the kernel?
>
> Paolo
>

hi Paolo,as far as I know there are a few limitations to play with
/dev/mem in this case.
1. access to /dev/men is restricted due to the security requirement,
but usually our virtual machines are unprivileged processes.
2. what we get from /dev/mem is a whole block of memory, as dynamic
VMs running on /dev/mem will cause memory fragment, it needs extra logic
to manage the allocation and recovery to avoid wasted memory. dmemfs
can support this and also leverage the kernel tlb management.
3. it needs to support hugepage with different page size granularity.
4. MCE recovery capability is also required.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-12 10:59         ` Joao Martins
@ 2020-10-14 22:25           ` Dan Williams
  2020-10-19 13:37             ` Paolo Bonzini
  0 siblings, 1 reply; 61+ messages in thread
From: Dan Williams @ 2020-10-14 22:25 UTC (permalink / raw)
  To: Joao Martins
  Cc: yulei zhang, linux-fsdevel, kvm, LKML, Xiao Guangrong,
	Wanpeng Li, Haiwei Li, Yulei Zhang, Andrew Morton,
	Naoya Horiguchi, Al Viro, Paolo Bonzini, Matthew Wilcox,
	Mike Kravetz, Jane Y Chu, Muchun Song, Konrad Rzeszutek Wilk

On Mon, Oct 12, 2020 at 4:00 AM Joao Martins <joao.m.martins@oracle.com> wrote:
[..]
> On 10/10/20 9:15 AM, yulei zhang wrote:
> > On Fri, Oct 9, 2020 at 7:53 PM Joao Martins <joao.m.martins@oracle.com> wrote:
> >> On 10/9/20 12:39 PM, yulei zhang wrote:
> >>> Joao, thanks a lot for the feedback. One more thing needs to mention
> >>> is that dmemfs also support fine-grained
> >>> memory management which makes it more flexible for tenants with
> >>> different requirements.
> >>>
> >> So as DAX when it allows to partition a region (starting 5.10). Meaning you have a region
> >> which you dedicated to userspace. That region can then be partitioning into devices which
> >> give you access to multiple (possibly discontinuous) extents with at a given page
> >> granularity (selectable when you create the device), accessed through mmap().
> >> You can then give that device to a cgroup. Or you can return that memory back to the
> >> kernel (should you run into OOM situation), or you recreate the same mappings across
> >> reboot/kexec.
> >>
> >> I probably need to read your patches again, but can you extend on the 'dmemfs also support
> >> fine-grained memory management' to understand what is the gap that you mention?
> >>
> > sure, dmemfs uses bitmap to track the memory usage in the reserved
> > memory region in
> > a given page size granularity. And for each user the memory can be
> > discrete as well.
> >
> That same functionality of tracking reserved region usage across different users at any
> page granularity is covered the DAX series I mentioned below. The discrete part -- IIUC
> what you meant -- is then reduced using DAX ABI/tools to create a device file vs a filesystem.

Put another way. Linux already has a fine grained memory management
system, the page allocator. Now, with recent device-dax extensions, it
also has a coarse grained memory management system for  physical
address-space partitioning and a path for struct-page-less backing for
VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
with incremental improvements to the 2 existing memory-management
systems?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-14 22:25           ` Dan Williams
@ 2020-10-19 13:37             ` Paolo Bonzini
  2020-10-19 19:03               ` Joao Martins
  0 siblings, 1 reply; 61+ messages in thread
From: Paolo Bonzini @ 2020-10-19 13:37 UTC (permalink / raw)
  To: Dan Williams, Joao Martins
  Cc: yulei zhang, linux-fsdevel, kvm, LKML, Xiao Guangrong,
	Wanpeng Li, Haiwei Li, Yulei Zhang, Andrew Morton,
	Naoya Horiguchi, Al Viro, Matthew Wilcox, Mike Kravetz,
	Jane Y Chu, Muchun Song, Konrad Rzeszutek Wilk

On 15/10/20 00:25, Dan Williams wrote:
> Now, with recent device-dax extensions, it
> also has a coarse grained memory management system for  physical
> address-space partitioning and a path for struct-page-less backing for
> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
> with incremental improvements to the 2 existing memory-management
> systems?

If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory
would still create the "struct page" albeit lazily?  KVM then would use
the usual get_user_pages() path.

Looking more closely at the implementation of dmemfs, I don't understand
is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed
memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem.  If it
did that KVM would get physical addresses using fixup_user_fault and
never need pfn_to_page() or get_user_pages().  I'm not saying that would
instantly be an approval, but it would make remove a lot of hooks.

Paolo


^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-19 13:37             ` Paolo Bonzini
@ 2020-10-19 19:03               ` Joao Martins
  2020-10-20 15:22                 ` yulei zhang
  0 siblings, 1 reply; 61+ messages in thread
From: Joao Martins @ 2020-10-19 19:03 UTC (permalink / raw)
  To: Paolo Bonzini, Dan Williams
  Cc: yulei zhang, linux-fsdevel, kvm, LKML, Xiao Guangrong,
	Wanpeng Li, Haiwei Li, Yulei Zhang, Andrew Morton,
	Naoya Horiguchi, Al Viro, Matthew Wilcox, Mike Kravetz,
	Jane Y Chu, Muchun Song, Konrad Rzeszutek Wilk

On 10/19/20 2:37 PM, Paolo Bonzini wrote:
> On 15/10/20 00:25, Dan Williams wrote:
>> Now, with recent device-dax extensions, it
>> also has a coarse grained memory management system for  physical
>> address-space partitioning and a path for struct-page-less backing for
>> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
>> with incremental improvements to the 2 existing memory-management
>> systems?
> 
> If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory
> would still create the "struct page" albeit lazily?  KVM then would use
> the usual get_user_pages() path.
> 
Correct.

The removal of struct page would be one of the added incremental improvements, like a
'map' with 'raw' sysfs attribute for dynamic dax regions that wouldn't online/create the
struct pages. The remaining plumbing (...)

> Looking more closely at the implementation of dmemfs, I don't understand
> is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed
> memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem.  If it
> did that KVM would get physical addresses using fixup_user_fault and
> never need pfn_to_page() or get_user_pages().  I'm not saying that would
> instantly be an approval, but it would make remove a lot of hooks.
> 

(...) is similar to what you describe above. Albeit there's probably no need to do a
remap_pfn_range at mmap(), as DAX supplies a fault/huge_fault. Also, using that means it's
limited to a single contiguous PFN chunk.

KVM has the bits to make it work without struct pages, I don't think there's a need for
new pg/pfn_t/VM_* bits (aside from relying on {PFN,PAGE}_SPECIAL) as mentioned at the
start of the thread. I'm storing my wip here:

	https://github.com/jpemartins/linux pageless-dax

Which is based on the first series that had been submitted earlier this year:

	https://lore.kernel.org/kvm/20200110190313.17144-1-joao.m.martins@oracle.com/

  Joao

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 00/35] Enhance memory utilization with DMEMFS
  2020-10-19 19:03               ` Joao Martins
@ 2020-10-20 15:22                 ` yulei zhang
  0 siblings, 0 replies; 61+ messages in thread
From: yulei zhang @ 2020-10-20 15:22 UTC (permalink / raw)
  To: Joao Martins
  Cc: Paolo Bonzini, Dan Williams, linux-fsdevel, kvm, LKML,
	Xiao Guangrong, Wanpeng Li, Haiwei Li, Yulei Zhang,
	Andrew Morton, Naoya Horiguchi, Al Viro, Matthew Wilcox,
	Mike Kravetz, Jane Y Chu, Muchun Song, Konrad Rzeszutek Wilk

On Tue, Oct 20, 2020 at 3:03 AM Joao Martins <joao.m.martins@oracle.com> wrote:
>
> On 10/19/20 2:37 PM, Paolo Bonzini wrote:
> > On 15/10/20 00:25, Dan Williams wrote:
> >> Now, with recent device-dax extensions, it
> >> also has a coarse grained memory management system for  physical
> >> address-space partitioning and a path for struct-page-less backing for
> >> VMs. What feature gaps remain vs dmemfs, and can those gaps be closed
> >> with incremental improvements to the 2 existing memory-management
> >> systems?
> >
> > If I understand correctly, devm_memremap_pages() on ZONE_DEVICE memory
> > would still create the "struct page" albeit lazily?  KVM then would use
> > the usual get_user_pages() path.
> >
> Correct.
>
> The removal of struct page would be one of the added incremental improvements, like a
> 'map' with 'raw' sysfs attribute for dynamic dax regions that wouldn't online/create the
> struct pages. The remaining plumbing (...)
>
> > Looking more closely at the implementation of dmemfs, I don't understand
> > is why dmemfs needs VM_DMEM etc. and cannot provide access to mmap-ed
> > memory using remap_pfn_range and VM_PFNMAP, just like /dev/mem.  If it
> > did that KVM would get physical addresses using fixup_user_fault and
> > never need pfn_to_page() or get_user_pages().  I'm not saying that would
> > instantly be an approval, but it would make remove a lot of hooks.
> >
>
> (...) is similar to what you describe above. Albeit there's probably no need to do a
> remap_pfn_range at mmap(), as DAX supplies a fault/huge_fault. Also, using that means it's
> limited to a single contiguous PFN chunk.
>
> KVM has the bits to make it work without struct pages, I don't think there's a need for
> new pg/pfn_t/VM_* bits (aside from relying on {PFN,PAGE}_SPECIAL) as mentioned at the
> start of the thread. I'm storing my wip here:
>
>         https://github.com/jpemartins/linux pageless-dax
>
> Which is based on the first series that had been submitted earlier this year:
>
>         https://lore.kernel.org/kvm/20200110190313.17144-1-joao.m.martins@oracle.com/
>
>   Joao

Just as Joao mentioned, remap_pfn_range() may request a single
contiguous PFN range, which
is not our intention. And for VM_DMEM, I think we may drop it in the
next version, and to use the
existing bits as much as possible to minimize the modifications.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/35] fs: introduce dmemfs module
  2020-10-08  7:53 ` [PATCH 01/35] fs: introduce dmemfs module yulei.kernel
@ 2020-11-10 20:04   ` Al Viro
  2020-11-11  8:53     ` yulei zhang
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2020-11-10 20:04 UTC (permalink / raw)
  To: yulei.kernel
  Cc: akpm, naoya.horiguchi, pbonzini, linux-fsdevel, kvm,
	linux-kernel, xiaoguangrong.eric, kernellwp, lihaiwei.kernel,
	Yulei Zhang, Xiao Guangrong

On Thu, Oct 08, 2020 at 03:53:51PM +0800, yulei.kernel@gmail.com wrote:

> +static struct inode *
> +dmemfs_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode,
> +		 dev_t dev);

WTF is 'dev' for?

> +static int
> +dmemfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
> +{
> +	struct inode *inode = dmemfs_get_inode(dir->i_sb, dir, mode, dev);
> +	int error = -ENOSPC;
> +
> +	if (inode) {
> +		d_instantiate(dentry, inode);
> +		dget(dentry);	/* Extra count - pin the dentry in core */
> +		error = 0;
> +		dir->i_mtime = dir->i_ctime = current_time(inode);
> +	}
> +	return error;
> +}

... same here, seeing that you only call that thing from the next two functions
and you do *not* provide ->mknod() as a method (unsurprisingly - what would
device nodes do there?)

> +static int dmemfs_create(struct inode *dir, struct dentry *dentry,
> +			 umode_t mode, bool excl)
> +{
> +	return dmemfs_mknod(dir, dentry, mode | S_IFREG, 0);
> +}
> +
> +static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry,
> +			umode_t mode)
> +{
> +	int retval = dmemfs_mknod(dir, dentry, mode | S_IFDIR, 0);
> +
> +	if (!retval)
> +		inc_nlink(dir);
> +	return retval;
> +}

> +int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	return 0;
> +}
> +
> +static const struct file_operations dmemfs_file_operations = {
> +	.mmap = dmemfs_file_mmap,
> +};

Er...  Is that a placeholder for later in the series?  Because as it is,
it makes no sense whatsoever - "it can be mmapped, but any access to the
mapped area will segfault".

> +struct inode *dmemfs_get_inode(struct super_block *sb,
> +			       const struct inode *dir, umode_t mode, dev_t dev)
> +{
> +	struct inode *inode = new_inode(sb);
> +
> +	if (inode) {
> +		inode->i_ino = get_next_ino();
> +		inode_init_owner(inode, dir, mode);
> +		inode->i_mapping->a_ops = &empty_aops;
> +		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> +		mapping_set_unevictable(inode->i_mapping);
> +		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
> +		switch (mode & S_IFMT) {
> +		default:
> +			init_special_inode(inode, mode, dev);
> +			break;
> +		case S_IFREG:
> +			inode->i_op = &dmemfs_file_inode_operations;
> +			inode->i_fop = &dmemfs_file_operations;
> +			break;
> +		case S_IFDIR:
> +			inode->i_op = &dmemfs_dir_inode_operations;
> +			inode->i_fop = &simple_dir_operations;
> +
> +			/*
> +			 * directory inodes start off with i_nlink == 2
> +			 * (for "." entry)
> +			 */
> +			inc_nlink(inode);
> +			break;
> +		case S_IFLNK:
> +			inode->i_op = &page_symlink_inode_operations;
> +			break;

Where would symlinks come from?  Or anything other than regular files and
directories, for that matter...

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/35] fs: introduce dmemfs module
  2020-11-10 20:04   ` Al Viro
@ 2020-11-11  8:53     ` yulei zhang
  2020-11-11 23:09       ` Al Viro
  0 siblings, 1 reply; 61+ messages in thread
From: yulei zhang @ 2020-11-11  8:53 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Naoya Horiguchi, Paolo Bonzini, linux-fsdevel,
	kvm, LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li, Yulei Zhang,
	Xiao Guangrong

On Wed, Nov 11, 2020 at 4:04 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Oct 08, 2020 at 03:53:51PM +0800, yulei.kernel@gmail.com wrote:
>
> > +static struct inode *
> > +dmemfs_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode,
> > +              dev_t dev);
>
> WTF is 'dev' for?
>
> > +static int
> > +dmemfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev)
> > +{
> > +     struct inode *inode = dmemfs_get_inode(dir->i_sb, dir, mode, dev);
> > +     int error = -ENOSPC;
> > +
> > +     if (inode) {
> > +             d_instantiate(dentry, inode);
> > +             dget(dentry);   /* Extra count - pin the dentry in core */
> > +             error = 0;
> > +             dir->i_mtime = dir->i_ctime = current_time(inode);
> > +     }
> > +     return error;
> > +}
>
> ... same here, seeing that you only call that thing from the next two functions
> and you do *not* provide ->mknod() as a method (unsurprisingly - what would
> device nodes do there?)
>

Thanks for pointing this out. we may need support the mknod method, otherwise
the dev is redundant  and need to be removed.

> > +static int dmemfs_create(struct inode *dir, struct dentry *dentry,
> > +                      umode_t mode, bool excl)
> > +{
> > +     return dmemfs_mknod(dir, dentry, mode | S_IFREG, 0);
> > +}
> > +
> > +static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry,
> > +                     umode_t mode)
> > +{
> > +     int retval = dmemfs_mknod(dir, dentry, mode | S_IFDIR, 0);
> > +
> > +     if (!retval)
> > +             inc_nlink(dir);
> > +     return retval;
> > +}
>
> > +int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> > +{
> > +     return 0;
> > +}
> > +
> > +static const struct file_operations dmemfs_file_operations = {
> > +     .mmap = dmemfs_file_mmap,
> > +};
>
> Er...  Is that a placeholder for later in the series?  Because as it is,
> it makes no sense whatsoever - "it can be mmapped, but any access to the
> mapped area will segfault".
>

Yes, we seperate the full implementation for dmemfs_file_mmap into
patch 05/35, it
will assign the interfaces to handle the page fault.

> > +struct inode *dmemfs_get_inode(struct super_block *sb,
> > +                            const struct inode *dir, umode_t mode, dev_t dev)
> > +{
> > +     struct inode *inode = new_inode(sb);
> > +
> > +     if (inode) {
> > +             inode->i_ino = get_next_ino();
> > +             inode_init_owner(inode, dir, mode);
> > +             inode->i_mapping->a_ops = &empty_aops;
> > +             mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
> > +             mapping_set_unevictable(inode->i_mapping);
> > +             inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
> > +             switch (mode & S_IFMT) {
> > +             default:
> > +                     init_special_inode(inode, mode, dev);
> > +                     break;
> > +             case S_IFREG:
> > +                     inode->i_op = &dmemfs_file_inode_operations;
> > +                     inode->i_fop = &dmemfs_file_operations;
> > +                     break;
> > +             case S_IFDIR:
> > +                     inode->i_op = &dmemfs_dir_inode_operations;
> > +                     inode->i_fop = &simple_dir_operations;
> > +
> > +                     /*
> > +                      * directory inodes start off with i_nlink == 2
> > +                      * (for "." entry)
> > +                      */
> > +                     inc_nlink(inode);
> > +                     break;
> > +             case S_IFLNK:
> > +                     inode->i_op = &page_symlink_inode_operations;
> > +                     break;
>
> Where would symlinks come from?  Or anything other than regular files and
> directories, for that matter...

You are right, so far it just supports regular files and directories.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/35] fs: introduce dmemfs module
  2020-11-11  8:53     ` yulei zhang
@ 2020-11-11 23:09       ` Al Viro
  2020-11-12 10:03         ` yulei zhang
  0 siblings, 1 reply; 61+ messages in thread
From: Al Viro @ 2020-11-11 23:09 UTC (permalink / raw)
  To: yulei zhang
  Cc: Andrew Morton, Naoya Horiguchi, Paolo Bonzini, linux-fsdevel,
	kvm, LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li, Yulei Zhang,
	Xiao Guangrong

On Wed, Nov 11, 2020 at 04:53:00PM +0800, yulei zhang wrote:

> > ... same here, seeing that you only call that thing from the next two functions
> > and you do *not* provide ->mknod() as a method (unsurprisingly - what would
> > device nodes do there?)
> >
> 
> Thanks for pointing this out. we may need support the mknod method, otherwise
> the dev is redundant  and need to be removed.

I'd suggest turning that into (static) __create_file(....) with

static int dmemfs_create(struct inode *dir, struct dentry *dentry,
			 umode_t mode, bool excl)
{
	return __create_file(dir, dentry, mode | S_IFREG);
}

static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry,
			 umode_t mode)
{
	return __create_file(dir, dentry, mode | S_IFDIR);
}

(i.e. even inc_nlink() of parent folded into that).

[snip]

> Yes, we seperate the full implementation for dmemfs_file_mmap into
> patch 05/35, it
> will assign the interfaces to handle the page fault.

It would be less confusing to move the introduction of ->mmap() to that patch,
then.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [PATCH 01/35] fs: introduce dmemfs module
  2020-11-11 23:09       ` Al Viro
@ 2020-11-12 10:03         ` yulei zhang
  0 siblings, 0 replies; 61+ messages in thread
From: yulei zhang @ 2020-11-12 10:03 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Naoya Horiguchi, Paolo Bonzini, linux-fsdevel,
	kvm, LKML, Xiao Guangrong, Wanpeng Li, Haiwei Li, Yulei Zhang,
	Xiao Guangrong

On Thu, Nov 12, 2020 at 7:09 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Wed, Nov 11, 2020 at 04:53:00PM +0800, yulei zhang wrote:
>
> > > ... same here, seeing that you only call that thing from the next two functions
> > > and you do *not* provide ->mknod() as a method (unsurprisingly - what would
> > > device nodes do there?)
> > >
> >
> > Thanks for pointing this out. we may need support the mknod method, otherwise
> > the dev is redundant  and need to be removed.
>
> I'd suggest turning that into (static) __create_file(....) with
>
> static int dmemfs_create(struct inode *dir, struct dentry *dentry,
>                          umode_t mode, bool excl)
> {
>         return __create_file(dir, dentry, mode | S_IFREG);
> }
>
> static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry,
>                          umode_t mode)
> {
>         return __create_file(dir, dentry, mode | S_IFDIR);
> }
>
> (i.e. even inc_nlink() of parent folded into that).
>
> [snip]
>
> > Yes, we seperate the full implementation for dmemfs_file_mmap into
> > patch 05/35, it
> > will assign the interfaces to handle the page fault.
>
> It would be less confusing to move the introduction of ->mmap() to that patch,
> then.

Thanks for the suggestion. will refactor the patches accordingly.

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2020-11-12 10:03 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-08  7:53 [PATCH 00/35] Enhance memory utilization with DMEMFS yulei.kernel
2020-10-08  7:53 ` [PATCH 01/35] fs: introduce dmemfs module yulei.kernel
2020-11-10 20:04   ` Al Viro
2020-11-11  8:53     ` yulei zhang
2020-11-11 23:09       ` Al Viro
2020-11-12 10:03         ` yulei zhang
2020-10-08  7:53 ` [PATCH 02/35] mm: support direct memory reservation yulei.kernel
2020-10-08 20:27   ` Randy Dunlap
2020-10-08 20:34   ` Randy Dunlap
2020-10-08  7:53 ` [PATCH 03/35] dmem: implement dmem memory management yulei.kernel
2020-10-08  7:53 ` [PATCH 04/35] dmem: let pat recognize dmem yulei.kernel
2020-10-13  7:27   ` Paolo Bonzini
2020-10-13  9:53     ` yulei zhang
2020-10-08  7:53 ` [PATCH 05/35] dmemfs: support mmap yulei.kernel
2020-10-08  7:53 ` [PATCH 06/35] dmemfs: support truncating inode down yulei.kernel
2020-10-08  7:53 ` [PATCH 07/35] dmem: trace core functions yulei.kernel
2020-10-08  7:53 ` [PATCH 08/35] dmem: show some statistic in debugfs yulei.kernel
2020-10-08 20:23   ` Randy Dunlap
2020-10-09 11:49     ` yulei zhang
2020-10-08  7:53 ` [PATCH 09/35] dmemfs: support remote access yulei.kernel
2020-10-08  7:54 ` [PATCH 10/35] dmemfs: introduce max_alloc_try_dpages parameter yulei.kernel
2020-10-08  7:54 ` [PATCH 11/35] mm: export mempolicy interfaces to serve dmem allocator yulei.kernel
2020-10-08  7:54 ` [PATCH 12/35] dmem: introduce mempolicy support yulei.kernel
2020-10-08  7:54 ` [PATCH 13/35] mm, dmem: introduce PFN_DMEM and pfn_t_dmem yulei.kernel
2020-10-08  7:54 ` [PATCH 14/35] mm, dmem: dmem-pmd vs thp-pmd yulei.kernel
2020-10-08  7:54 ` [PATCH 15/35] mm: add pmd_special() check for pmd_trans_huge_lock() yulei.kernel
2020-10-08  7:54 ` [PATCH 16/35] dmemfs: introduce ->split() to dmemfs_vm_ops yulei.kernel
2020-10-08  7:54 ` [PATCH 17/35] mm, dmemfs: support unmap_page_range() for dmemfs pmd yulei.kernel
2020-10-08  7:54 ` [PATCH 18/35] mm: follow_pmd_mask() for dmem huge pmd yulei.kernel
2020-10-08  7:54 ` [PATCH 19/35] mm: gup_huge_pmd() " yulei.kernel
2020-10-08  7:54 ` [PATCH 20/35] mm: support dmem huge pmd for vmf_insert_pfn_pmd() yulei.kernel
2020-10-08  7:54 ` [PATCH 21/35] mm: support dmem huge pmd for follow_pfn() yulei.kernel
2020-10-08  7:54 ` [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page yulei.kernel
2020-10-09  0:58   ` Sean Christopherson
2020-10-09 10:28     ` Joao Martins
2020-10-09 11:42       ` yulei zhang
2020-10-08  7:54 ` [PATCH 23/35] kvm, x86: introduce VM_DMEM yulei.kernel
2020-10-08  7:54 ` [PATCH 24/35] dmemfs: support hugepage for dmemfs yulei.kernel
2020-10-08  7:54 ` [PATCH 25/35] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() yulei.kernel
2020-10-08  7:54 ` [PATCH 26/35] mm, dmem: introduce pud_special() yulei.kernel
2020-10-08  7:54 ` [PATCH 27/35] mm: add pud_special() to support dmem huge pud yulei.kernel
2020-10-08  7:54 ` [PATCH 28/35] mm, dmemfs: support huge_fault() for dmemfs yulei.kernel
2020-10-08  7:54 ` [PATCH 29/35] mm: add follow_pte_pud() yulei.kernel
2020-10-08  7:54 ` [PATCH 30/35] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() yulei.kernel
2020-10-08  7:54 ` [PATCH 31/35] dmem: introduce mce handler yulei.kernel
2020-10-08  7:54 ` [PATCH 32/35] mm, dmemfs: register and handle the dmem mce yulei.kernel
2020-10-08  7:54 ` [PATCH 33/35] kvm, x86: temporary disable record_steal_time for dmem yulei.kernel
2020-10-08  7:54 ` [PATCH 34/35] dmem: add dmem unit tests yulei.kernel
2020-10-08  7:54 ` [PATCH 35/35] Add documentation for dmemfs yulei.kernel
2020-10-09  1:26   ` Randy Dunlap
2020-10-08 19:01 ` [PATCH 00/35] Enhance memory utilization with DMEMFS Joao Martins
2020-10-09 11:39   ` yulei zhang
2020-10-09 11:53     ` Joao Martins
2020-10-10  8:15       ` yulei zhang
2020-10-12 10:59         ` Joao Martins
2020-10-14 22:25           ` Dan Williams
2020-10-19 13:37             ` Paolo Bonzini
2020-10-19 19:03               ` Joao Martins
2020-10-20 15:22                 ` yulei zhang
2020-10-12 11:57 ` Zengtao (B)
2020-10-13  2:45   ` yulei zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).