linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC V2 00/37] Enhance memory utilization with DMEMFS
@ 2020-12-07 11:30 yulei.kernel
  2020-12-07 11:30 ` [RFC V2 01/37] fs: introduce dmemfs module yulei.kernel
                   ` (37 more replies)
  0 siblings, 38 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:30 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang

From: Yulei Zhang <yuleixzhang@tencent.com>

In current system each physical memory page is assocaited with
a page structure which is used to track the usage of this page.
But due to the memory usage rapidly growing in cloud environment,
we find the resource consuming for page structure storage becomes
more and more remarkable. So is it possible that we could reclaim
such memory and make it reusable?

This patchset introduces an idea about how to save the extra
memory through a new virtual filesystem -- dmemfs.

Dmemfs (Direct Memory filesystem) is device memory or reserved
memory based filesystem. This kind of memory is special as it
is not managed by kernel and most important it is without 'struct page'.
Therefore we can leverage the extra memory from the host system
to support more tenants in our cloud service.

As the belowing figure shows, we uses a kernel boot parameter 'dmem='
to reserve the system memory when the host system boots up, the
remaining system memory is still managed by system memory management
which is associated with "struct page", the reserved memory
will be managed by dmem and assigned to guest system, the details
can be checked in /Documentation/admin-guide/kernel-parameters.txt.

   +------------------+--------------------------------------+
   |  system memory   |     memory for guest system          | 
   +------------------+--------------------------------------+
    |                                   |
    v                                   |
struct page                             |
    |                                   |
    v                                   v
    system mem management             dmem  

And during the usage, the dmemfs will handle the memory request to
allocate and free the reserved memory on each NUMA node, the user 
space application could leverage the mmap interface to access the 
memory, and kernel module such as kvm and vfio would be able to pin
the memory thongh follow_pfn() and get_user_page() in different given
page size granularities.

          +-----------+  +-----------+
          |   QEMU    |  |  dpdk etc.|      user
          +-----+-----+  +-----------+
  +-----------|------\------------------------------+
  |           |       v                    kernel   |
  |           |     +-------+  +-------+            |
  |           |     |  KVM  |  | vfio  |            |
  |           |     +-------+  +-------+            |
  |           |         |          |                |
  |      +----v---------v----------v------+         |
  |      |                                |         |
  |      |             Dmemfs             |         |
  |      |                                |         |
  |      +--------------------------------+         |
  +-----------/-----------------------\-------------+
             /                         \
     +------v-----+                +----v-------+
     |   node 0   |                |   node 1   |
     +------------+                +------------+

Theoretically for each 4k physical page it can save 64 bytes if
we drop the 'struct page', so for guest memory with 320G it can
save about 5G physical memory totally.

Detailed usage of dmemfs is included in
/Documentation/filesystem/dmemfs.rst.

V1->V2:
* Rebase the code the kernel version 5.10.0-rc3.
* Introudce dregion->memmap for dmem to add _refcount for each
  dmem page.
* Enable record_steal_time for dmem before entering guest system.
* Adjust page walking for dmem.

Yulei Zhang (37):
  fs: introduce dmemfs module
  mm: support direct memory reservation
  dmem: implement dmem memory management
  dmem: let pat recognize dmem
  dmemfs: support mmap for dmemfs
  dmemfs: support truncating inode down
  dmem: trace core functions
  dmem: show some statistic in debugfs
  dmemfs: support remote access
  dmemfs: introduce max_alloc_try_dpages parameter
  mm: export mempolicy interfaces to serve dmem allocator
  dmem: introduce mempolicy support
  mm, dmem: introduce PFN_DMEM and pfn_t_dmem
  mm, dmem: differentiate dmem-pmd and thp-pmd
  mm: add pmd_special() check for pmd_trans_huge_lock()
  dmemfs: introduce ->split() to dmemfs_vm_ops
  mm, dmemfs: support unmap_page_range() for dmemfs pmd
  mm: follow_pmd_mask() for dmem huge pmd
  mm: gup_huge_pmd() for dmem huge pmd
  mm: support dmem huge pmd for vmf_insert_pfn_pmd()
  mm: support dmem huge pmd for follow_pfn()
  kvm, x86: Distinguish dmemfs page from mmio page
  kvm, x86: introduce VM_DMEM for syscall support usage
  dmemfs: support hugepage for dmemfs
  mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn()
  mm, dmem: introduce pud_special() for dmem huge pud support
  mm: add pud_special() check to support dmem huge pud
  mm, dmemfs: support huge_fault() for dmemfs
  mm: add follow_pte_pud() to support huge pud look up
  dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free()
  dmem: introduce mce handler
  mm, dmemfs: register and handle the dmem mce
  kvm, x86: enable record_steal_time for dmem
  dmem: add dmem unit tests
  mm, dmem: introduce dregion->memmap for dmem
  vfio: support dmempage refcount for vfio
  Add documentation for dmemfs

 Documentation/admin-guide/kernel-parameters.txt |   38 +
 Documentation/filesystems/dmemfs.rst            |   58 ++
 Documentation/filesystems/index.rst             |    1 +
 arch/x86/Kconfig                                |    1 +
 arch/x86/include/asm/pgtable.h                  |   32 +-
 arch/x86/include/asm/pgtable_types.h            |   13 +-
 arch/x86/kernel/setup.c                         |    3 +
 arch/x86/kvm/mmu/mmu.c                          |    1 +
 arch/x86/mm/pat/memtype.c                       |   21 +
 drivers/vfio/vfio_iommu_type1.c                 |   13 +-
 fs/Kconfig                                      |    1 +
 fs/Makefile                                     |    1 +
 fs/dmemfs/Kconfig                               |   16 +
 fs/dmemfs/Makefile                              |    8 +
 fs/dmemfs/inode.c                               | 1060 ++++++++++++++++++++
 fs/dmemfs/trace.h                               |   54 +
 fs/inode.c                                      |    6 +
 include/linux/dmem.h                            |   54 +
 include/linux/fs.h                              |    1 +
 include/linux/huge_mm.h                         |    5 +-
 include/linux/mempolicy.h                       |    3 +
 include/linux/mm.h                              |    9 +
 include/linux/pfn_t.h                           |   17 +-
 include/linux/pgtable.h                         |   22 +
 include/trace/events/dmem.h                     |   85 ++
 include/uapi/linux/magic.h                      |    1 +
 mm/Kconfig                                      |   19 +
 mm/Makefile                                     |    1 +
 mm/dmem.c                                       | 1196 +++++++++++++++++++++++
 mm/dmem_reserve.c                               |  303 ++++++
 mm/gup.c                                        |  101 +-
 mm/huge_memory.c                                |   19 +-
 mm/memory-failure.c                             |   70 +-
 mm/memory.c                                     |   74 +-
 mm/mempolicy.c                                  |    4 +-
 mm/mincore.c                                    |    8 +-
 mm/mprotect.c                                   |    7 +-
 mm/mremap.c                                     |    3 +
 mm/pagewalk.c                                   |    4 +-
 tools/testing/dmem/Kbuild                       |    1 +
 tools/testing/dmem/Makefile                     |   10 +
 tools/testing/dmem/dmem-test.c                  |  184 ++++
 virt/kvm/kvm_main.c                             |   13 +-
 43 files changed, 3483 insertions(+), 58 deletions(-)
 create mode 100644 Documentation/filesystems/dmemfs.rst
 create mode 100644 fs/dmemfs/Kconfig
 create mode 100644 fs/dmemfs/Makefile
 create mode 100644 fs/dmemfs/inode.c
 create mode 100644 fs/dmemfs/trace.h
 create mode 100644 include/linux/dmem.h
 create mode 100644 include/trace/events/dmem.h
 create mode 100644 mm/dmem.c
 create mode 100644 mm/dmem_reserve.c
 create mode 100644 tools/testing/dmem/Kbuild
 create mode 100644 tools/testing/dmem/Makefile
 create mode 100644 tools/testing/dmem/dmem-test.c

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [RFC V2 01/37] fs: introduce dmemfs module
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
@ 2020-12-07 11:30 ` yulei.kernel
  2020-12-07 11:30 ` [RFC V2 02/37] mm: support direct memory reservation yulei.kernel
                   ` (36 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:30 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

dmemfs (Direct Memory filesystem) is device memory or reserved
memory based filesystem. This kind of memory is special as it
is not managed by kernel and it is not associated with 'struct page'.

The original purpose for dmemfs is to drop the usage of
'struct page' to save extra system memory in public cloud
enviornment.

This patch introduces the basic framework of dmemfs and only
mkdir and create regular file are supported.

Signed-off-by: Xiao Guangrong  <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/Kconfig                 |   1 +
 fs/Makefile                |   1 +
 fs/dmemfs/Kconfig          |  13 +++
 fs/dmemfs/Makefile         |   7 ++
 fs/dmemfs/inode.c          | 266 +++++++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/magic.h |   1 +
 6 files changed, 289 insertions(+)
 create mode 100644 fs/dmemfs/Kconfig
 create mode 100644 fs/dmemfs/Makefile
 create mode 100644 fs/dmemfs/inode.c

diff --git a/fs/Kconfig b/fs/Kconfig
index aa4c122..18e7208 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -41,6 +41,7 @@ source "fs/btrfs/Kconfig"
 source "fs/nilfs2/Kconfig"
 source "fs/f2fs/Kconfig"
 source "fs/zonefs/Kconfig"
+source "fs/dmemfs/Kconfig"
 
 config FS_DAX
 	bool "Direct Access (DAX) support"
diff --git a/fs/Makefile b/fs/Makefile
index 999d1a2..34747ec 100644
--- a/fs/Makefile
+++ b/fs/Makefile
@@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS)		+= efivarfs/
 obj-$(CONFIG_EROFS_FS)		+= erofs/
 obj-$(CONFIG_VBOXSF_FS)		+= vboxsf/
 obj-$(CONFIG_ZONEFS_FS)		+= zonefs/
+obj-$(CONFIG_DMEM_FS)		+= dmemfs/
diff --git a/fs/dmemfs/Kconfig b/fs/dmemfs/Kconfig
new file mode 100644
index 00000000..d2894a5
--- /dev/null
+++ b/fs/dmemfs/Kconfig
@@ -0,0 +1,13 @@
+config DMEM_FS
+	tristate "Direct Memory filesystem support"
+	help
+	  dmemfs (Direct Memory filesystem) is device memory or reserved
+	  memory based filesystem. This kind of memory is special as it
+	  is not managed by kernel and it is without 'struct page'.
+
+	  The original purpose of dmemfs is saving extra memory of
+	  'struct page' that reduces the total cost of ownership (TCO)
+	  for cloud providers.
+
+	  To compile this file system support as a module, choose M here: the
+	  module will be called dmemfs.
diff --git a/fs/dmemfs/Makefile b/fs/dmemfs/Makefile
new file mode 100644
index 00000000..73bdc9c
--- /dev/null
+++ b/fs/dmemfs/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+#
+# Makefile for the linux dmem-filesystem routines.
+#
+obj-$(CONFIG_DMEM_FS) += dmemfs.o
+
+dmemfs-y += inode.o
diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
new file mode 100644
index 00000000..0aa3d3b
--- /dev/null
+++ b/fs/dmemfs/inode.c
@@ -0,0 +1,266 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *  linux/fs/dmemfs/inode.c
+ *
+ * Authors:
+ *   Xiao Guangrong  <gloryxiao@tencent.com>
+ *   Chen Zhuo	     <sagazchen@tencent.com>
+ *   Haiwei Li	     <gerryhwli@tencent.com>
+ *   Yulei Zhang     <yuleixzhang@tencent.com>
+ */
+
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mount.h>
+#include <linux/file.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/capability.h>
+#include <linux/magic.h>
+#include <linux/mman.h>
+#include <linux/statfs.h>
+#include <linux/pagemap.h>
+#include <linux/parser.h>
+#include <linux/pfn_t.h>
+#include <linux/pagevec.h>
+#include <linux/fs_parser.h>
+#include <linux/seq_file.h>
+
+MODULE_AUTHOR("Tencent Corporation");
+MODULE_LICENSE("GPL v2");
+
+struct dmemfs_mount_opts {
+	unsigned long dpage_size;
+};
+
+struct dmemfs_fs_info {
+	struct dmemfs_mount_opts mount_opts;
+};
+
+enum dmemfs_param {
+	Opt_dpagesize,
+};
+
+const struct fs_parameter_spec dmemfs_fs_parameters[] = {
+	fsparam_string("pagesize", Opt_dpagesize),
+	{}
+};
+
+static int check_dpage_size(unsigned long dpage_size)
+{
+	if (dpage_size != PAGE_SIZE && dpage_size != PMD_SIZE &&
+	      dpage_size != PUD_SIZE)
+		return -EINVAL;
+
+	return 0;
+}
+
+static struct inode *
+dmemfs_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode);
+
+static int
+__create_file(struct inode *dir, struct dentry *dentry, umode_t mode)
+{
+	struct inode *inode = dmemfs_get_inode(dir->i_sb, dir, mode);
+	int error = -ENOSPC;
+
+	if (inode) {
+		d_instantiate(dentry, inode);
+		dget(dentry);	/* Extra count - pin the dentry in core */
+		error = 0;
+		dir->i_mtime = dir->i_ctime = current_time(inode);
+		if (mode & S_IFDIR)
+			inc_nlink(dir);
+	}
+	return error;
+}
+
+static int dmemfs_create(struct inode *dir, struct dentry *dentry,
+			 umode_t mode, bool excl)
+{
+	return __create_file(dir, dentry, mode | S_IFREG);
+}
+
+static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry,
+			umode_t mode)
+{
+	return __create_file(dir, dentry, mode | S_IFDIR);
+}
+
+static const struct inode_operations dmemfs_dir_inode_operations = {
+	.create		= dmemfs_create,
+	.lookup		= simple_lookup,
+	.unlink		= simple_unlink,
+	.mkdir		= dmemfs_mkdir,
+	.rmdir		= simple_rmdir,
+	.rename		= simple_rename,
+};
+
+static const struct inode_operations dmemfs_file_inode_operations = {
+	.setattr = simple_setattr,
+	.getattr = simple_getattr,
+};
+
+static const struct file_operations dmemfs_file_operations = {
+};
+
+static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
+{
+	struct dmemfs_fs_info *fsi = fc->s_fs_info;
+	struct fs_parse_result result;
+	int opt, ret;
+
+	opt = fs_parse(fc, dmemfs_fs_parameters, param, &result);
+	if (opt < 0)
+		return opt;
+
+	switch (opt) {
+	case Opt_dpagesize:
+		fsi->mount_opts.dpage_size = memparse(param->string, NULL);
+		ret = check_dpage_size(fsi->mount_opts.dpage_size);
+		if (ret) {
+			pr_warn("dmemfs: unknown pagesize %x.\n",
+				result.uint_32);
+			return ret;
+		}
+		break;
+	default:
+		pr_warn("dmemfs: unknown mount option [%x].\n",
+			opt);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+struct inode *dmemfs_get_inode(struct super_block *sb,
+			       const struct inode *dir, umode_t mode)
+{
+	struct inode *inode = new_inode(sb);
+
+	if (inode) {
+		inode->i_ino = get_next_ino();
+		inode_init_owner(inode, dir, mode);
+		inode->i_mapping->a_ops = &empty_aops;
+		mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+		mapping_set_unevictable(inode->i_mapping);
+		inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
+		switch (mode & S_IFMT) {
+		default:
+			init_special_inode(inode, mode, 0);
+			break;
+		case S_IFREG:
+			inode->i_op = &dmemfs_file_inode_operations;
+			inode->i_fop = &dmemfs_file_operations;
+			break;
+		case S_IFDIR:
+			inode->i_op = &dmemfs_dir_inode_operations;
+			inode->i_fop = &simple_dir_operations;
+
+			/*
+			 * directory inodes start off with i_nlink == 2
+			 * (for "." entry)
+			 */
+			inc_nlink(inode);
+			break;
+		case S_IFLNK:
+			inode->i_op = &page_symlink_inode_operations;
+			break;
+		}
+	}
+	return inode;
+}
+
+static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf)
+{
+	simple_statfs(dentry, buf);
+	buf->f_bsize = dentry->d_sb->s_blocksize;
+
+	return 0;
+}
+
+static const struct super_operations dmemfs_ops = {
+	.statfs	= dmemfs_statfs,
+	.drop_inode = generic_delete_inode,
+};
+
+static int
+dmemfs_fill_super(struct super_block *sb, struct fs_context *fc)
+{
+	struct inode *inode;
+	struct dmemfs_fs_info *fsi = sb->s_fs_info;
+
+	sb->s_maxbytes = MAX_LFS_FILESIZE;
+	sb->s_blocksize = fsi->mount_opts.dpage_size;
+	sb->s_blocksize_bits = ilog2(fsi->mount_opts.dpage_size);
+	sb->s_magic = DMEMFS_MAGIC;
+	sb->s_op = &dmemfs_ops;
+	sb->s_time_gran = 1;
+
+	inode = dmemfs_get_inode(sb, NULL, S_IFDIR);
+	sb->s_root = d_make_root(inode);
+	if (!sb->s_root)
+		return -ENOMEM;
+
+	return 0;
+}
+
+static int dmemfs_get_tree(struct fs_context *fc)
+{
+	return get_tree_nodev(fc, dmemfs_fill_super);
+}
+
+static void dmemfs_free_fc(struct fs_context *fc)
+{
+	kfree(fc->s_fs_info);
+}
+
+static const struct fs_context_operations dmemfs_context_ops = {
+	.free		= dmemfs_free_fc,
+	.parse_param	= dmemfs_parse_param,
+	.get_tree	= dmemfs_get_tree,
+};
+
+int dmemfs_init_fs_context(struct fs_context *fc)
+{
+	struct dmemfs_fs_info *fsi;
+
+	fsi = kzalloc(sizeof(*fsi), GFP_KERNEL);
+	if (!fsi)
+		return -ENOMEM;
+
+	fsi->mount_opts.dpage_size = PAGE_SIZE;
+	fc->s_fs_info = fsi;
+	fc->ops = &dmemfs_context_ops;
+	return 0;
+}
+
+static void dmemfs_kill_sb(struct super_block *sb)
+{
+	kill_litter_super(sb);
+}
+
+static struct file_system_type dmemfs_fs_type = {
+	.owner		= THIS_MODULE,
+	.name		= "dmemfs",
+	.init_fs_context = dmemfs_init_fs_context,
+	.kill_sb	= dmemfs_kill_sb,
+};
+
+static int __init dmemfs_init(void)
+{
+	int ret;
+
+	ret = register_filesystem(&dmemfs_fs_type);
+
+	return ret;
+}
+
+static void __exit dmemfs_uninit(void)
+{
+	unregister_filesystem(&dmemfs_fs_type);
+}
+
+module_init(dmemfs_init)
+module_exit(dmemfs_uninit)
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc..3fbd066 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define DMEMFS_MAGIC		0x2ace90c6
 
 #endif /* __LINUX_MAGIC_H__ */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 02/37] mm: support direct memory reservation
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
  2020-12-07 11:30 ` [RFC V2 01/37] fs: introduce dmemfs module yulei.kernel
@ 2020-12-07 11:30 ` yulei.kernel
  2020-12-07 11:30 ` [RFC V2 03/37] dmem: implement dmem memory management yulei.kernel
                   ` (35 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:30 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

Introduce 'dmem=' to reserve system memory for DMEM (direct memory),
comparing with 'mem=' and 'memmap', it reserves memory based on the
topology of NUMA, for the detailed info, please refer to
kernel-parameters.txt

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  38 +++
 arch/x86/kernel/setup.c                         |   3 +
 include/linux/dmem.h                            |  16 ++
 mm/Kconfig                                      |   8 +
 mm/Makefile                                     |   1 +
 mm/dmem.c                                       | 137 +++++++++++
 mm/dmem_reserve.c                               | 303 ++++++++++++++++++++++++
 7 files changed, 506 insertions(+)
 create mode 100644 include/linux/dmem.h
 create mode 100644 mm/dmem.c
 create mode 100644 mm/dmem_reserve.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 526d65d..78caf11 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -991,6 +991,44 @@
 			The filter can be disabled or changed to another
 			driver later using sysfs.
 
+	dmem=[!]size[KMG]
+			[KNL, NUMA] When CONFIG_DMEM is set, this means
+			the size of memory reserved for dmemfs on each NUMA
+			memory node and 'size' must be aligned to the default
+			alignment that is the size of memory section which is
+			128M by default on x86_64. If set '!', such amount of
+			memory on each node will be owned by kernel and dmemfs
+			owns the rest of memory on each node.
+			Example: Reserve 4G memory on each node for dmemfs
+				dmem = 4G
+
+	dmem=[!]size[KMG]:align[KMG]
+			[KNL, NUMA] Ditto. 'align' should be power of two and
+			not smaller than the default alignment. Also 'size'
+			must be aligned to 'align'.
+			Example: Bad dmem parameter because 'size' misaligned
+				dmem=0x40200000:1G
+
+	dmem=size[KMG]@addr[KMG]
+			[KNL] When CONFIG_DMEM is set, this marks specific
+			memory as reserved for dmemfs. Region of memory will be
+			used by dmemfs, from addr to addr + size. Reserving a
+			certain memory region for kernel is illegal so '!' is
+			forbidden. Should not assign 'addr' to 0 because kernel
+			will occupy fixed memory region beginning at 0 address.
+			Ditto, 'size' and 'addr' must be aligned to default
+			alignment.
+			Example: Exclude memory from 5G-6G for dmemfs.
+				dmem=1G@5G
+
+	dmem=size[KMG]@addr[KMG]:align[KMG]
+			[KNL] Ditto. 'align' should be power of two and
+			not smaller than the default alignment. Also 'size'
+			and 'addr' must be aligned to 'align'. Specially,
+			'@addr' and ':align' could occur in any order.
+			Example: Exclude memory from 5G-6G for dmemfs.
+				dmem=1G:1G@5G
+
 	driver_async_probe=  [KNL]
 			List of driver names to be probed asynchronously.
 			Format: <driver_name1>,<driver_name2>...
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 84f581c..9d05e1b 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -48,6 +48,7 @@
 #include <asm/unwind.h>
 #include <asm/vsyscall.h>
 #include <linux/vmalloc.h>
+#include <linux/dmem.h>
 
 /*
  * max_low_pfn_mapped: highest directly mapped pfn < 4 GB
@@ -1149,6 +1150,8 @@ void __init setup_arch(char **cmdline_p)
 	if (boot_cpu_has(X86_FEATURE_GBPAGES))
 		hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT);
 
+	dmem_reserve_init();
+
 	/*
 	 * Reserve memory for crash kernel after SRAT is parsed so that it
 	 * won't consume hotpluggable memory.
diff --git a/include/linux/dmem.h b/include/linux/dmem.h
new file mode 100644
index 00000000..5049322
--- /dev/null
+++ b/include/linux/dmem.h
@@ -0,0 +1,16 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_DMEM_H
+#define _LINUX_DMEM_H
+
+#ifdef CONFIG_DMEM
+int dmem_reserve_init(void);
+void dmem_init(void);
+int dmem_region_register(int node, phys_addr_t start, phys_addr_t end);
+
+#else
+static inline int dmem_reserve_init(void)
+{
+	return 0;
+}
+#endif
+#endif	/* _LINUX_DMEM_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index d42423f..3a6d408 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -226,6 +226,14 @@ config BALLOON_COMPACTION
 	  scenario aforementioned and helps improving memory defragmentation.
 
 #
+# support for direct memory basics
+config DMEM
+	bool "Direct Memory Reservation"
+	depends on SPARSEMEM
+	help
+	  Allow reservation of memory which could be for the dedicated use of dmem.
+	  It's the basis of dmemfs.
+
 # support for memory compaction
 config COMPACTION
 	bool "Allow for memory compaction"
diff --git a/mm/Makefile b/mm/Makefile
index d73aed0..775c8518 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_DMEM) += dmem.o dmem_reserve.o
diff --git a/mm/dmem.c b/mm/dmem.c
new file mode 100644
index 00000000..b5fb4f1
--- /dev/null
+++ b/mm/dmem.c
@@ -0,0 +1,137 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * memory management for dmemfs
+ *
+ * Authors:
+ *   Xiao Guangrong  <gloryxiao@tencent.com>
+ *   Chen Zhuo	     <sagazchen@tencent.com>
+ *   Haiwei Li	     <gerryhwli@tencent.com>
+ *   Yulei Zhang     <yuleixzhang@tencent.com>
+ */
+#include <linux/mempolicy.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/cpuset.h>
+#include <linux/nodemask.h>
+#include <linux/topology.h>
+#include <linux/dmem.h>
+#include <linux/debugfs.h>
+#include <linux/notifier.h>
+
+/*
+ * There are two kinds of page in dmem management:
+ * - nature page, it's the CPU's page size, i.e, 4K on x86
+ *
+ * - dmem page, it's the unit size used by dmem itself to manage all
+ *     registered memory. It's set by dmem_alloc_init()
+ */
+struct dmem_region {
+	/* original registered memory region */
+	phys_addr_t reserved_start_addr;
+	phys_addr_t reserved_end_addr;
+
+	/* memory region aligned to dmem page */
+	phys_addr_t dpage_start_pfn;
+	phys_addr_t dpage_end_pfn;
+
+	/*
+	 * avoid memory allocation if the dmem region is small enough
+	 */
+	unsigned long static_bitmap;
+	unsigned long *bitmap;
+	u64 next_free_pos;
+	struct list_head node;
+
+	unsigned long static_error_bitmap;
+	unsigned long *error_bitmap;
+};
+
+/*
+ * statically define number of regions to avoid allocating memory
+ * dynamically from memblock as slab is not available at that time
+ */
+#define DMEM_REGION_PAGES	2
+#define INIT_REGION_NUM							\
+	((DMEM_REGION_PAGES << PAGE_SHIFT) / sizeof(struct dmem_region))
+
+static struct dmem_region static_regions[INIT_REGION_NUM];
+
+struct dmem_node {
+	unsigned long total_dpages;
+	unsigned long free_dpages;
+
+	/* fallback list for allocation */
+	int nodelist[MAX_NUMNODES];
+	struct list_head regions;
+};
+
+struct dmem_pool {
+	struct mutex lock;
+
+	unsigned long region_num;
+	unsigned long registered_pages;
+	unsigned long unaligned_pages;
+
+	/* shift bits of dmem page */
+	unsigned long dpage_shift;
+
+	unsigned long total_dpages;
+	unsigned long free_dpages;
+
+	/*
+	 * increased when allocator is initialized,
+	 * stop it being destroyed when someone is
+	 * still using it
+	 */
+	u64 user_count;
+	struct dmem_node nodes[MAX_NUMNODES];
+};
+
+static struct dmem_pool dmem_pool = {
+	.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
+};
+
+#define for_each_dmem_node(_dnode)					\
+	for (_dnode = dmem_pool.nodes;					\
+		_dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes);	\
+		_dnode++)
+
+void __init dmem_init(void)
+{
+	struct dmem_node *dnode;
+
+	pr_info("dmem: pre-defined region: %ld\n", INIT_REGION_NUM);
+
+	for_each_dmem_node(dnode)
+		INIT_LIST_HEAD(&dnode->regions);
+}
+
+/*
+ * register the memory region to dmem pool as freed memory, the region
+ * should be properly aligned to PAGE_SIZE at least
+ *
+ * it's safe to be out of dmem_pool's lock as it's used at the very
+ * beginning of system boot
+ */
+int dmem_region_register(int node, phys_addr_t start, phys_addr_t end)
+{
+	struct dmem_region *dregion;
+
+	pr_info("dmem: register region [%#llx - %#llx] on node %d.\n",
+		(unsigned long long)start, (unsigned long long)end, node);
+
+	if (unlikely(dmem_pool.region_num >= INIT_REGION_NUM)) {
+		pr_err("dmem: region is not sufficient.\n");
+		return -ENOMEM;
+	}
+
+	dregion = &static_regions[dmem_pool.region_num++];
+	dregion->reserved_start_addr = start;
+	dregion->reserved_end_addr = end;
+
+	list_add_tail(&dregion->node, &dmem_pool.nodes[node].regions);
+	dmem_pool.registered_pages += __phys_to_pfn(end) -
+					__phys_to_pfn(start);
+	return 0;
+}
+
diff --git a/mm/dmem_reserve.c b/mm/dmem_reserve.c
new file mode 100644
index 00000000..567ee9f
--- /dev/null
+++ b/mm/dmem_reserve.c
@@ -0,0 +1,303 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support reserved memory for dmem.
+ * As dmem_reserve_init will adjust memblock to reserve memory
+ * for dmem, we could save a vast amount of memory for 'struct page'.
+ *
+ * Authors:
+ *   Xiao Guangrong  <gloryxiao@tencent.com>
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/memblock.h>
+#include <linux/log2.h>
+#include <linux/dmem.h>
+
+struct dmem_param {
+	phys_addr_t base;
+	phys_addr_t size;
+	phys_addr_t align;
+	/*
+	 * If set to 1, dmem_param specified requested memory for kernel,
+	 * otherwise for dmem.
+	 */
+	bool resv_kernel;
+};
+
+static struct dmem_param dmem_param __initdata;
+
+/* Check dmem param defined by user to match dmem align */
+static int __init check_dmem_param(bool resv_kernel, phys_addr_t base,
+				   phys_addr_t size, phys_addr_t align)
+{
+	phys_addr_t min_align = 1UL << SECTION_SIZE_BITS;
+
+	if (!align)
+		align = min_align;
+
+	/*
+	 * the reserved region should be aligned to memory section
+	 * at least
+	 */
+	if (align < min_align) {
+		pr_warn("dmem: 'align' should be %#llx at least to be aligned to memory section.\n",
+			min_align);
+		return -EINVAL;
+	}
+
+	if (!is_power_of_2(align)) {
+		pr_warn("dmem: 'align' should be power of 2.\n");
+		return -EINVAL;
+	}
+
+	if (base & (align - 1)) {
+		pr_warn("dmem: 'addr' is unaligned to 'align' in dmem=\n");
+		return -EINVAL;
+	}
+
+	if (size & (align - 1)) {
+		pr_warn("dmem: 'size' is unaligned to 'align' in dmem=\n");
+		return -EINVAL;
+	}
+
+	if (base >= base + size) {
+		pr_warn("dmem: 'addr + size' overflow in dmem=\n");
+		return -EINVAL;
+	}
+
+	if (resv_kernel && base) {
+		pr_warn("dmem: take a certain base address for kernel is illegal\n");
+		return -EINVAL;
+	}
+
+	dmem_param.base = base;
+	dmem_param.size = size;
+	dmem_param.align = align;
+	dmem_param.resv_kernel = resv_kernel;
+
+	pr_info("dmem: parameter: base address %#llx size %#llx align %#llx resv_kernel %d\n",
+		(unsigned long long)base, (unsigned long long)size,
+		(unsigned long long)align, resv_kernel);
+	return 0;
+}
+
+static int __init parse_dmem(char *p)
+{
+	phys_addr_t base, size, align;
+	char *oldp;
+	bool resv_kernel = false;
+
+	if (!p)
+		return -EINVAL;
+
+	base = align = 0;
+
+	if (*p == '!') {
+		resv_kernel = true;
+		p++;
+	}
+
+	oldp = p;
+	size = memparse(p, &p);
+	if (oldp == p)
+		return -EINVAL;
+
+	if (!size) {
+		pr_warn("dmem: 'size' of 0 defined in dmem=, or {invalid} param\n");
+		return -EINVAL;
+	}
+
+	while (*p) {
+		phys_addr_t *pvalue;
+
+		switch (*p) {
+		case '@':
+			pvalue = &base;
+			break;
+		case ':':
+			pvalue = &align;
+			break;
+		default:
+			pr_warn("dmem: unknown indicator: %c in dmem=\n", *p);
+			return -EINVAL;
+		}
+
+		/*
+		 * Some attribute had been specified multiple times.
+		 * This is not allowed.
+		 */
+		if (*pvalue)
+			return -EINVAL;
+
+		oldp = ++p;
+		*pvalue = memparse(p, &p);
+		if (oldp == p)
+			return -EINVAL;
+
+		if (*pvalue == 0) {
+			pr_warn("dmem: 'addr' or 'align' should not be set to 0\n");
+			return -EINVAL;
+		}
+	}
+
+	return check_dmem_param(resv_kernel, base, size, align);
+}
+
+early_param("dmem", parse_dmem);
+
+/*
+ * We wanna remove a memory range from memblock.memory thoroughly.
+ * As isolating memblock.memory in memblock_remove needs to double
+ * the array of memblock_region, allocated memory for new array maybe
+ * locate in the memory range which we wanna to remove.
+ *	So, conflict.
+ * To resolve this conflict, here reserve this memory range firstly.
+ * While reserving this memory range, isolating memory.reserved will allocate
+ * memory excluded from memory range which to be removed. So following
+ * double array in memblock_remove can't observe this reserved range.
+ */
+static void __init dmem_remove_memblock(phys_addr_t base, phys_addr_t size)
+{
+	memblock_reserve(base, size);
+	memblock_remove(base, size);
+	memblock_free(base, size);
+}
+
+static u64 node_req_mem[MAX_NUMNODES] __initdata;
+
+/* Reserve certain size of memory for dmem in each numa node */
+static void __init dmem_reserve_size(phys_addr_t size, phys_addr_t align,
+		bool resv_kernel)
+{
+	phys_addr_t start, end;
+	u64 i;
+	int nid;
+
+	/* Calculate available free memory on each node */
+	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start,
+				&end, &nid)
+		node_req_mem[nid] += end - start;
+
+	/* Calculate memory size needed to reserve on each node for dmem */
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		node_req_mem[i] = ALIGN(node_req_mem[i], align);
+
+		if (!resv_kernel) {
+			node_req_mem[i] = min(size, node_req_mem[i]);
+			continue;
+		}
+
+		/* leave dmem_param.size memory for kernel */
+		if (node_req_mem[i] > size)
+			node_req_mem[i] = node_req_mem[i] - size;
+		else
+			node_req_mem[i] = 0;
+	}
+
+retry:
+	for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+					&start, &end, &nid) {
+		/* Well, we have got enough memory for this node. */
+		if (!node_req_mem[nid])
+			continue;
+
+		start = round_up(start, align);
+		end = round_down(end, align);
+		/* Skip memblock_region which is too small */
+		if (start >= end)
+			continue;
+
+		/* Towards memory block at higher address */
+		start = end - min((end - start), node_req_mem[nid]);
+
+		/*
+		 * do not have enough resource to save the region, skip it
+		 * from now on
+		 */
+		if (dmem_region_register(nid, start, end) < 0)
+			break;
+
+		dmem_remove_memblock(start, end - start);
+
+		node_req_mem[nid] -= end - start;
+
+		/* We have dropped a memblock, so re-walk it. */
+		goto retry;
+	}
+
+	for (i = 0; i < MAX_NUMNODES; i++) {
+		if (!node_req_mem[i])
+			continue;
+
+		pr_info("dmem: %#llx size of memory is not reserved on node %lld due to misaligned regions.\n",
+			(unsigned long long)size, i);
+	}
+
+}
+
+/* Reserve [base, base + size) for dmem. */
+static void __init
+dmem_reserve_region(phys_addr_t base, phys_addr_t size, phys_addr_t align)
+{
+	phys_addr_t start, end;
+	phys_addr_t p_start, p_end;
+	u64 i;
+	int nid;
+
+	p_start = base;
+	p_end = base + size;
+
+retry:
+	for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+					&start, &end, &nid) {
+		/* Find region located in user defined range. */
+		if (start >= p_end || end <= p_start)
+			continue;
+
+		start = round_up(max(start, p_start), align);
+		end = round_down(min(end, p_end), align);
+		if (start >= end)
+			continue;
+
+		if (dmem_region_register(nid, start, end) < 0)
+			break;
+
+		dmem_remove_memblock(start, end - start);
+
+		size -= end - start;
+		if (!size)
+			return;
+
+		/* We have dropped a memblock, so re-walk it. */
+		goto retry;
+	}
+
+	pr_info("dmem: %#llx size of memory is not reserved for dmem due to holes and misaligned regions in [%#llx, %#llx].\n",
+		(unsigned long long)size, (unsigned long long)base,
+		(unsigned long long)(base + size));
+}
+
+/* Reserve memory for dmem */
+int __init dmem_reserve_init(void)
+{
+	phys_addr_t base, size, align;
+	bool resv_kernel;
+
+	dmem_init();
+
+	base = dmem_param.base;
+	size = dmem_param.size;
+	align = dmem_param.align;
+	resv_kernel = dmem_param.resv_kernel;
+
+	/* Dmem param had not been enabled. */
+	if (size == 0)
+		return 0;
+
+	if (base)
+		dmem_reserve_region(base, size, align);
+	else
+		dmem_reserve_size(size, align, resv_kernel);
+
+	return 0;
+}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 03/37] dmem: implement dmem memory management
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
  2020-12-07 11:30 ` [RFC V2 01/37] fs: introduce dmemfs module yulei.kernel
  2020-12-07 11:30 ` [RFC V2 02/37] mm: support direct memory reservation yulei.kernel
@ 2020-12-07 11:30 ` yulei.kernel
  2020-12-07 11:30 ` [RFC V2 04/37] dmem: let pat recognize dmem yulei.kernel
                   ` (34 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:30 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

The belowing figure shows the topology of dmem memory
management, it reserves a few memory regions from the
numa nodes, and in each region it leverages the bitmap to
track the actual memory usage.

         +------+-+-------+---------+
         | Node0| | ...   | NodeN   |
         +--/-\-+-+-------+---------+
           /   \
      +---v----+v-----+----------+
      |region 0| ...  | region n |
      +--/---\-+------+----------+
        /     \
    +-+v+------v-------+-+-+-+
    | | |   bitmap     | | | |
    +-+-+--------------+-+-+-+

It introduces the interfaces to manage dmem pages that include:
  - dmem_region_register(), it registers the reserved memory to the
    dmem management system

 - dmem_alloc_init(), initiate dmem allocator, note the page size the
   allocator used isn't the same thing with the alignment used to
   reserve dmem memory

 - dmem_alloc_pages_vma() and dmem_free_pages() are the interfaces
   allocating and freeing dmem memory, multiple pages can be allocated
   at one time, but it should be power of two

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/dmem.h |   3 +
 mm/dmem.c            | 674 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 677 insertions(+)

diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 5049322..476a82e 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -7,6 +7,9 @@
 void dmem_init(void);
 int dmem_region_register(int node, phys_addr_t start, phys_addr_t end);
 
+int dmem_alloc_init(unsigned long dpage_shift);
+void dmem_alloc_uinit(void);
+
 #else
 static inline int dmem_reserve_init(void)
 {
diff --git a/mm/dmem.c b/mm/dmem.c
index b5fb4f1..a77a064 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -91,11 +91,38 @@ struct dmem_pool {
 	.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
 };
 
+#define DMEM_PAGE_SIZE		(1UL << dmem_pool.dpage_shift)
+#define DMEM_PAGE_UP(x)		phys_to_dpage(((x) + DMEM_PAGE_SIZE - 1))
+#define DMEM_PAGE_DOWN(x)	phys_to_dpage(x)
+
+#define dpage_to_phys(_dpage)						\
+	((_dpage) << dmem_pool.dpage_shift)
+#define phys_to_dpage(_addr)						\
+	((_addr) >> dmem_pool.dpage_shift)
+
+#define dpage_to_pfn(_dpage)						\
+	(__phys_to_pfn(dpage_to_phys(_dpage)))
+#define pfn_to_dpage(_pfn)						\
+	(phys_to_dpage(__pfn_to_phys(_pfn)))
+
+#define dnode_to_nid(_dnode)						\
+	((_dnode) - dmem_pool.nodes)
+#define nid_to_dnode(nid)						\
+	(&dmem_pool.nodes[nid])
+
 #define for_each_dmem_node(_dnode)					\
 	for (_dnode = dmem_pool.nodes;					\
 		_dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes);	\
 		_dnode++)
 
+#define for_each_dmem_region(_dnode, _dregion)				\
+	list_for_each_entry(_dregion, &(_dnode)->regions, node)
+
+static inline int *dmem_nodelist(int nid)
+{
+	return nid_to_dnode(nid)->nodelist;
+}
+
 void __init dmem_init(void)
 {
 	struct dmem_node *dnode;
@@ -135,3 +162,650 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end)
 	return 0;
 }
 
+#define PENALTY_FOR_DMEM_SHARED_NODE		(1)
+
+static int dmem_nodeload[MAX_NUMNODES] __initdata;
+
+/* Evaluate penalty for each dmem node */
+static int __init dmem_evaluate_node(int local, int node)
+{
+	int penalty;
+
+	/* Use the distance array to find the distance */
+	penalty = node_distance(local, node);
+
+	/* Penalize nodes under us ("prefer the next node") */
+	penalty += (node < local);
+
+	/* Give preference to headless and unused nodes */
+	if (!cpumask_empty(cpumask_of_node(node)))
+		penalty += PENALTY_FOR_NODE_WITH_CPUS;
+
+	/* Penalize dmem-node shared with kernel */
+	if (node_state(node, N_MEMORY))
+		penalty += PENALTY_FOR_DMEM_SHARED_NODE;
+
+	/* Slight preference for less loaded node */
+	penalty *= (nr_online_nodes * MAX_NUMNODES);
+
+	penalty += dmem_nodeload[node];
+
+	return penalty;
+}
+
+static int __init find_next_dmem_node(int local, nodemask_t *used_nodes)
+{
+	struct dmem_node *dnode;
+	int node, best_node = NUMA_NO_NODE;
+	int penalty, min_penalty = INT_MAX;
+
+	/* Invalid node is not suitable to call node_distance */
+	if (!node_state(local, N_POSSIBLE))
+		return NUMA_NO_NODE;
+
+	/* Use the local node if we haven't already */
+	if (!node_isset(local, *used_nodes)) {
+		node_set(local, *used_nodes);
+		return local;
+	}
+
+	for_each_dmem_node(dnode) {
+		if (list_empty(&dnode->regions))
+			continue;
+
+		node = dnode_to_nid(dnode);
+
+		/* Don't want a node to appear more than once */
+		if (node_isset(node, *used_nodes))
+			continue;
+
+		penalty = dmem_evaluate_node(local, node);
+
+		if (penalty < min_penalty) {
+			min_penalty = penalty;
+			best_node = node;
+		}
+	}
+
+	if (best_node >= 0)
+		node_set(best_node, *used_nodes);
+
+	return best_node;
+}
+
+static int __init dmem_node_init(struct dmem_node *dnode)
+{
+	int *nodelist;
+	nodemask_t used_nodes;
+	int local, node, prev;
+	int load;
+	int i = 0;
+
+	nodelist = dnode->nodelist;
+	nodes_clear(used_nodes);
+	local = dnode_to_nid(dnode);
+	prev = local;
+	load = nr_online_nodes;
+
+	while ((node = find_next_dmem_node(local, &used_nodes)) >= 0) {
+		/*
+		 * We don't want to pressure a particular node.
+		 * So adding penalty to the first node in same
+		 * distance group to make it round-robin.
+		 */
+		if (node_distance(local, node) != node_distance(local, prev))
+			dmem_nodeload[node] = load;
+
+		nodelist[i++] = prev = node;
+		load--;
+	}
+
+	return 0;
+}
+
+static void __init dmem_region_uinit(struct dmem_region *dregion)
+{
+	unsigned long nr_pages, size, *bitmap = dregion->error_bitmap;
+
+	if (!bitmap)
+		return;
+
+	nr_pages = __phys_to_pfn(dregion->reserved_end_addr)
+		- __phys_to_pfn(dregion->reserved_start_addr);
+
+	WARN_ON(!nr_pages);
+
+	size = BITS_TO_LONGS(nr_pages) * sizeof(long);
+	if (size > sizeof(dregion->static_bitmap))
+		kfree(bitmap);
+	dregion->error_bitmap = NULL;
+}
+
+/*
+ * we only stop allocator to use the reserved page and do not
+ * reture pages back if anything goes wrong
+ */
+static void __init dmem_uinit(void)
+{
+	struct dmem_region *dregion, *dr;
+	struct dmem_node *dnode;
+
+	for_each_dmem_node(dnode) {
+		dnode->nodelist[0] = NUMA_NO_NODE;
+		list_for_each_entry_safe(dregion, dr, &dnode->regions, node) {
+			dmem_region_uinit(dregion);
+			dregion->reserved_start_addr =
+				dregion->reserved_end_addr = 0;
+			list_del(&dregion->node);
+		}
+	}
+
+	dmem_pool.region_num = 0;
+	dmem_pool.registered_pages = 0;
+}
+
+static int __init dmem_region_init(struct dmem_region *dregion)
+{
+	unsigned long *bitmap, size, nr_pages;
+
+	nr_pages = __phys_to_pfn(dregion->reserved_end_addr)
+		- __phys_to_pfn(dregion->reserved_start_addr);
+
+	size = BITS_TO_LONGS(nr_pages) * sizeof(long);
+	if (size <= sizeof(dregion->static_error_bitmap)) {
+		bitmap = &dregion->static_error_bitmap;
+	} else {
+		bitmap = kzalloc(size, GFP_KERNEL);
+		if (!bitmap)
+			return -ENOMEM;
+	}
+	dregion->error_bitmap = bitmap;
+	return 0;
+}
+
+/*
+ * dmem memory is not 'struct page' backend, i.e, the kernel threats
+ * it as invalid pfn
+ */
+static int __init dmem_check_region(struct dmem_region *dregion)
+{
+	unsigned long pfn;
+
+	for (pfn = __phys_to_pfn(dregion->reserved_start_addr);
+	      pfn < __phys_to_pfn(dregion->reserved_end_addr); pfn++) {
+		if (!WARN_ON(pfn_valid(pfn)))
+			continue;
+
+		pr_err("dmem: check pfn %#lx failed, its memory was not properly reserved\n",
+			pfn);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int __init dmem_late_init(void)
+{
+	struct dmem_region *dregion;
+	struct dmem_node *dnode;
+	int ret;
+
+	for_each_dmem_node(dnode) {
+		dmem_node_init(dnode);
+
+		for_each_dmem_region(dnode, dregion) {
+			ret = dmem_region_init(dregion);
+			if (ret)
+				goto exit;
+			ret = dmem_check_region(dregion);
+			if (ret)
+				goto exit;
+		}
+	}
+	return ret;
+exit:
+	dmem_uinit();
+	return ret;
+}
+late_initcall(dmem_late_init);
+
+static int dmem_alloc_region_init(struct dmem_region *dregion,
+				  unsigned long *dpages)
+{
+	unsigned long start, end, *bitmap, size;
+
+	start = DMEM_PAGE_UP(dregion->reserved_start_addr);
+	end = DMEM_PAGE_DOWN(dregion->reserved_end_addr);
+
+	*dpages = end - start;
+	if (!*dpages)
+		return 0;
+
+	size = BITS_TO_LONGS(*dpages) * sizeof(long);
+	if (size <= sizeof(dregion->static_bitmap))
+		bitmap = &dregion->static_bitmap;
+	else {
+		bitmap = kzalloc(size, GFP_KERNEL);
+		if (!bitmap)
+			return -ENOMEM;
+	}
+
+	dregion->bitmap = bitmap;
+	dregion->next_free_pos = 0;
+	dregion->dpage_start_pfn = start;
+	dregion->dpage_end_pfn = end;
+
+	dmem_pool.unaligned_pages += __phys_to_pfn((dpage_to_phys(start)
+		- dregion->reserved_start_addr));
+	dmem_pool.unaligned_pages += __phys_to_pfn(dregion->reserved_end_addr
+		- dpage_to_phys(end));
+	return 0;
+}
+
+static bool dmem_dpage_is_error(struct dmem_region *dregion, phys_addr_t dpage)
+{
+	unsigned long valid_pages;
+	unsigned long pos_pfn, pos_offset;
+	unsigned long pages_per_dpage = DMEM_PAGE_SIZE >> PAGE_SHIFT;
+	phys_addr_t reserved_start_pfn;
+
+	reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr);
+	valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn;
+
+	pos_offset = dpage_to_pfn(dpage) - reserved_start_pfn;
+	pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset);
+	if (pos_pfn < pos_offset + pages_per_dpage)
+		return true;
+	return false;
+}
+
+static unsigned long
+dmem_alloc_bitmap_clear(struct dmem_region *dregion, phys_addr_t dpage,
+			unsigned int dpages_nr)
+{
+	u64 pos = dpage - dregion->dpage_start_pfn;
+	unsigned int i;
+	unsigned long err_num = 0;
+
+	for (i = 0; i < dpages_nr; i++) {
+		if (dmem_dpage_is_error(dregion, dpage + i)) {
+			WARN_ON(!test_bit(pos + i, dregion->bitmap));
+			err_num++;
+		} else {
+			WARN_ON(!__test_and_clear_bit(pos + i,
+						      dregion->bitmap));
+		}
+	}
+	return err_num;
+}
+
+/* set or clear corresponding bit on allocation bitmap based on error bitmap */
+static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion,
+						    bool set)
+{
+	unsigned long pos_pfn, pos_offset;
+	unsigned long valid_pages, mce_dpages = 0;
+	phys_addr_t dpage, reserved_start_pfn;
+
+	reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr);
+
+	valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn;
+	pos_offset = dpage_to_pfn(dregion->dpage_start_pfn)
+		- reserved_start_pfn;
+try_set:
+	pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset);
+
+	if (pos_pfn >= valid_pages)
+		return mce_dpages;
+	mce_dpages++;
+	dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn);
+	if (set)
+		WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn,
+					   dregion->bitmap));
+	else
+		WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn,
+					      dregion->bitmap));
+	pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn;
+	goto try_set;
+}
+
+static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion)
+{
+	unsigned long dpages, size;
+
+	dregion_alloc_bitmap_set_clear(dregion, false);
+
+	dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn;
+	size = BITS_TO_LONGS(dpages) * sizeof(long);
+	WARN_ON(!bitmap_empty(dregion->bitmap, size * BITS_PER_BYTE));
+}
+
+static void dmem_alloc_region_uinit(struct dmem_region *dregion)
+{
+	unsigned long dpages, size, *bitmap = dregion->bitmap;
+
+	if (!bitmap)
+		return;
+
+	dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn;
+	WARN_ON(!dpages);
+
+	dmem_uinit_check_alloc_bitmap(dregion);
+
+	size = BITS_TO_LONGS(dpages) * sizeof(long);
+	if (size > sizeof(dregion->static_bitmap))
+		kfree(bitmap);
+	dregion->bitmap = NULL;
+}
+
+static void __dmem_alloc_uinit(void)
+{
+	struct dmem_node *dnode;
+	struct dmem_region *dregion;
+
+	if (!dmem_pool.dpage_shift)
+		return;
+
+	dmem_pool.unaligned_pages = 0;
+
+	for_each_dmem_node(dnode) {
+		for_each_dmem_region(dnode, dregion)
+			dmem_alloc_region_uinit(dregion);
+
+		dnode->total_dpages = dnode->free_dpages = 0;
+	}
+
+	dmem_pool.dpage_shift = 0;
+	dmem_pool.total_dpages = dmem_pool.free_dpages = 0;
+}
+
+static void dnode_count_free_dpages(struct dmem_node *dnode, long dpages)
+{
+	dnode->free_dpages += dpages;
+	dmem_pool.free_dpages += dpages;
+}
+
+/*
+ * uninitialize dmem allocator
+ *
+ * all dpages should be freed before calling it
+ */
+void dmem_alloc_uinit(void)
+{
+	mutex_lock(&dmem_pool.lock);
+	if (!--dmem_pool.user_count)
+		__dmem_alloc_uinit();
+	mutex_unlock(&dmem_pool.lock);
+}
+EXPORT_SYMBOL(dmem_alloc_uinit);
+
+/*
+ * initialize dmem allocator
+ *   @dpage_shift: the shift bits of dmem page size used to manange
+ *      dmem memory, it should be CPU's nature page size at least
+ *
+ * Note: the page size the allocator used isn't the same thing with
+ *       the alignment used to reserve dmem memory
+ */
+int dmem_alloc_init(unsigned long dpage_shift)
+{
+	struct dmem_node *dnode;
+	struct dmem_region *dregion;
+	unsigned long dpages;
+	int ret = 0;
+
+	if (dpage_shift < PAGE_SHIFT)
+		return -EINVAL;
+
+	mutex_lock(&dmem_pool.lock);
+
+	if (dmem_pool.dpage_shift) {
+		/*
+		 * double init on the same page size is okay
+		 * to make the unit tests happy
+		 */
+		if (dmem_pool.dpage_shift != dpage_shift)
+			ret = -EBUSY;
+
+		goto exit;
+	}
+
+	dmem_pool.dpage_shift = dpage_shift;
+
+	for_each_dmem_node(dnode) {
+		for_each_dmem_region(dnode, dregion) {
+			ret = dmem_alloc_region_init(dregion, &dpages);
+			if (ret < 0) {
+				__dmem_alloc_uinit();
+				goto exit;
+			}
+
+			dnode_count_free_dpages(dnode, dpages);
+		}
+		dnode->total_dpages = dnode->free_dpages;
+	}
+
+	dmem_pool.total_dpages = dmem_pool.free_dpages;
+
+	if (dmem_pool.unaligned_pages && !ret)
+		pr_warn("dmem: %llu pages are wasted due to alignment\n",
+			(unsigned long long)dmem_pool.unaligned_pages);
+exit:
+	if (!ret)
+		dmem_pool.user_count++;
+
+	mutex_unlock(&dmem_pool.lock);
+	return ret;
+}
+EXPORT_SYMBOL(dmem_alloc_init);
+
+static phys_addr_t
+dmem_alloc_region_page(struct dmem_region *dregion, unsigned int try_max,
+		       unsigned int *result_nr)
+{
+	unsigned long pos, dpages;
+	unsigned int i;
+
+	/* no dpage is available in this region */
+	if (!dregion->bitmap)
+		return 0;
+
+	dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn;
+
+	/* no free page in this region */
+	if (dregion->next_free_pos >= dpages)
+		return 0;
+
+	pos = find_next_zero_bit(dregion->bitmap, dpages,
+				 dregion->next_free_pos);
+	if (pos >= dpages) {
+		dregion->next_free_pos = pos;
+		return 0;
+	}
+
+	__set_bit(pos, dregion->bitmap);
+
+	/* do not go beyond the region */
+	try_max = min(try_max, (unsigned int)(dpages - pos - 1));
+	for (i = 1; i < try_max; i++)
+		if (__test_and_set_bit(pos + i, dregion->bitmap))
+			break;
+
+	*result_nr = i;
+	dregion->next_free_pos = pos + *result_nr;
+	return dpage_to_phys(dregion->dpage_start_pfn + pos);
+}
+
+/*
+ * allocate dmem pages from the nodelist
+ *
+ *   @nodelist: dmem_node's nodelist
+ *   @nodemask: nodemask for filtering the dmem nodelist
+ *   @try_max: try to allocate @try_max dpages if possible
+ *   @result_nr: allocated dpage number returned to the caller
+ *
+ * return the physical address of the first dpage allocated from dmem
+ * pool, or 0 on failure. The allocated dpage number is filled into
+ * @result_nr
+ */
+static phys_addr_t
+dmem_alloc_pages_from_nodelist(int *nodelist, nodemask_t *nodemask,
+			       unsigned int try_max, unsigned int *result_nr)
+{
+	struct dmem_node *dnode;
+	struct dmem_region *dregion;
+	phys_addr_t addr = 0;
+	int node, i;
+	unsigned int local_result_nr;
+
+	WARN_ON(try_max > 1 && !result_nr);
+
+	if (!result_nr)
+		result_nr = &local_result_nr;
+
+	*result_nr = 0;
+
+	for (i = 0; !addr && i < ARRAY_SIZE(dnode->nodelist); i++) {
+		node = nodelist[i];
+
+		if (nodemask && !node_isset(node, *nodemask))
+			continue;
+
+		mutex_lock(&dmem_pool.lock);
+
+		WARN_ON(!dmem_pool.dpage_shift);
+
+		dnode = &dmem_pool.nodes[node];
+		for_each_dmem_region(dnode, dregion) {
+			addr = dmem_alloc_region_page(dregion, try_max,
+						      result_nr);
+			if (addr) {
+				dnode_count_free_dpages(dnode,
+							-(long)(*result_nr));
+				break;
+			}
+		}
+
+		mutex_unlock(&dmem_pool.lock);
+	}
+	return addr;
+}
+
+/*
+ * allocate a dmem page from the dmem pool and try to allocate more
+ * continuous dpages if @try_max is not less than 1
+ *
+ *   @nid: the NUMA node the dmem page got from
+ *   @nodemask: nodemask for filtering the dmem nodelist
+ *   @try_max: try to allocate @try_max dpages if possible
+ *   @result_nr: allocated dpage number returned to the caller
+ *
+ * return the physical address of the first dpage allocated from dmem
+ * pool, or 0 on failure. The allocated dpage number is filled into
+ * @result_nr
+ */
+phys_addr_t
+dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max,
+			  unsigned int *result_nr)
+{
+	int *nodelist;
+
+	if (nid >= sizeof(ARRAY_SIZE(dmem_pool.nodes)))
+		return 0;
+
+	nodelist = dmem_nodelist(nid);
+	return dmem_alloc_pages_from_nodelist(nodelist, nodemask,
+					      try_max, result_nr);
+}
+EXPORT_SYMBOL(dmem_alloc_pages_nodemask);
+
+/*
+ * dmem_alloc_pages_vma - Allocate pages for a VMA.
+ *
+ *   @vma:  Pointer to VMA or NULL if not available.
+ *   @addr: Virtual Address of the allocation. Must be inside the VMA.
+ *   @try_max: try to allocate @try_max dpages if possible
+ *   @result_nr: allocated dpage number returned to the caller
+ *
+ * Return the physical address of the first dpage allocated from dmem
+ * pool, or 0 on failure. The allocated dpage number is filled into
+ * @result_nr
+ */
+phys_addr_t
+dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr,
+		     unsigned int try_max, unsigned int *result_nr)
+{
+	phys_addr_t phys_addr;
+	int *nl;
+	unsigned int cpuset_mems_cookie;
+
+retry_cpuset:
+	nl = dmem_nodelist(numa_node_id());
+
+	phys_addr = dmem_alloc_pages_from_nodelist(nl, NULL, try_max,
+						   result_nr);
+	if (unlikely(!phys_addr && read_mems_allowed_retry(cpuset_mems_cookie)))
+		goto retry_cpuset;
+
+	return phys_addr;
+}
+EXPORT_SYMBOL(dmem_alloc_pages_vma);
+
+/*
+ * Don't need to call it in a lock.
+ * This function uses the reserved addresses those are initially registered
+ * and will not be modified at run time.
+ */
+static struct dmem_region *find_dmem_region(phys_addr_t phys_addr,
+					    struct dmem_node **pdnode)
+{
+	struct dmem_node *dnode;
+	struct dmem_region *dregion;
+
+	for_each_dmem_node(dnode)
+		for_each_dmem_region(dnode, dregion) {
+			if (dregion->reserved_start_addr > phys_addr)
+				continue;
+			if (dregion->reserved_end_addr <= phys_addr)
+				continue;
+
+			*pdnode = dnode;
+			return dregion;
+		}
+
+	return NULL;
+}
+
+/*
+ * free dmem page to the dmem pool
+ *   @addr: the physical addree will be freed
+ *   @dpage_nr: the number of dpage to be freed
+ */
+void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
+{
+	struct dmem_region *dregion;
+	struct dmem_node *pdnode = NULL;
+	phys_addr_t dpage = phys_to_dpage(addr);
+	u64 pos;
+	unsigned long err_dpages;
+
+	mutex_lock(&dmem_pool.lock);
+
+	WARN_ON(!dmem_pool.dpage_shift);
+
+	dregion = find_dmem_region(addr, &pdnode);
+	WARN_ON(!dregion || !dregion->bitmap || !pdnode);
+
+	pos = dpage - dregion->dpage_start_pfn;
+	dregion->next_free_pos = min(dregion->next_free_pos, pos);
+
+	/* it is not possible to span multiple regions */
+	WARN_ON(dpage + dpages_nr - 1 >= dregion->dpage_end_pfn);
+
+	err_dpages = dmem_alloc_bitmap_clear(dregion, dpage, dpages_nr);
+
+	dnode_count_free_dpages(pdnode, dpages_nr - err_dpages);
+	mutex_unlock(&dmem_pool.lock);
+}
+EXPORT_SYMBOL(dmem_free_pages);
+
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 04/37] dmem: let pat recognize dmem
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (2 preceding siblings ...)
  2020-12-07 11:30 ` [RFC V2 03/37] dmem: implement dmem memory management yulei.kernel
@ 2020-12-07 11:30 ` yulei.kernel
  2020-12-07 11:30 ` [RFC V2 05/37] dmemfs: support mmap for dmemfs yulei.kernel
                   ` (33 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:30 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

x86 pat uses 'struct page' by only checking if it's system ram,
however it is not true if dmem is used, let's teach pat to
recognize this case if it is ram but it is !pfn_valid()

We always use WB for dmem and any attempt to change this
behavior will be rejected and WARN_ON is triggered

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/mm/pat/memtype.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 8f665c3..fd8a298 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -511,6 +511,13 @@ static int reserve_ram_pages_type(u64 start, u64 end,
 	for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) {
 		enum page_cache_mode type;
 
+		/*
+		 * it's dmem if it's ram but not 'struct page' backend,
+		 * we always use WB
+		 */
+		if (WARN_ON(!pfn_valid(pfn)))
+			return -EBUSY;
+
 		page = pfn_to_page(pfn);
 		type = get_page_memtype(page);
 		if (type != _PAGE_CACHE_MODE_WB) {
@@ -539,6 +546,13 @@ static int free_ram_pages_type(u64 start, u64 end)
 	u64 pfn;
 
 	for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) {
+		/*
+		 * it's dmem, see the comments in
+		 * reserve_ram_pages_type()
+		 */
+		if (WARN_ON(!pfn_valid(pfn)))
+			continue;
+
 		page = pfn_to_page(pfn);
 		set_page_memtype(page, _PAGE_CACHE_MODE_WB);
 	}
@@ -714,6 +728,13 @@ static enum page_cache_mode lookup_memtype(u64 paddr)
 	if (pat_pagerange_is_ram(paddr, paddr + PAGE_SIZE)) {
 		struct page *page;
 
+		/*
+		 * dmem always uses WB, see the comments in
+		 * reserve_ram_pages_type()
+		 */
+		if (!pfn_valid(paddr >> PAGE_SHIFT))
+			return rettype;
+
 		page = pfn_to_page(paddr >> PAGE_SHIFT);
 		return get_page_memtype(page);
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 05/37] dmemfs: support mmap for dmemfs
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (3 preceding siblings ...)
  2020-12-07 11:30 ` [RFC V2 04/37] dmem: let pat recognize dmem yulei.kernel
@ 2020-12-07 11:30 ` yulei.kernel
  2020-12-07 11:30 ` [RFC V2 06/37] dmemfs: support truncating inode down yulei.kernel
                   ` (32 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:30 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

This patch adds mmap support. Note the file will be extended if
it's beyond mmap's offset, that drops the requirement of write()
operation, however, it has not supported cutting file down yet.

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c    | 343 ++++++++++++++++++++++++++++++++++++++++++++++++++-
 include/linux/dmem.h |  10 ++
 2 files changed, 351 insertions(+), 2 deletions(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 0aa3d3b..7b6e51d 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -26,6 +26,7 @@
 #include <linux/pagevec.h>
 #include <linux/fs_parser.h>
 #include <linux/seq_file.h>
+#include <linux/dmem.h>
 
 MODULE_AUTHOR("Tencent Corporation");
 MODULE_LICENSE("GPL v2");
@@ -102,7 +103,255 @@ static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry,
 	.getattr = simple_getattr,
 };
 
+static unsigned long dmem_pgoff_to_index(struct inode *inode, pgoff_t pgoff)
+{
+	struct super_block *sb = inode->i_sb;
+
+	return pgoff >> (sb->s_blocksize_bits - PAGE_SHIFT);
+}
+
+static void *dmem_addr_to_entry(struct inode *inode, phys_addr_t addr)
+{
+	struct super_block *sb = inode->i_sb;
+
+	addr >>= sb->s_blocksize_bits;
+	return xa_mk_value(addr);
+}
+
+static phys_addr_t dmem_entry_to_addr(struct inode *inode, void *entry)
+{
+	struct super_block *sb = inode->i_sb;
+
+	WARN_ON(!xa_is_value(entry));
+	return xa_to_value(entry) << sb->s_blocksize_bits;
+}
+
+static unsigned long
+dmem_addr_to_pfn(struct inode *inode, phys_addr_t addr, pgoff_t pgoff,
+		 unsigned int fault_shift)
+{
+	struct super_block *sb = inode->i_sb;
+	unsigned long pfn = addr >> PAGE_SHIFT;
+	unsigned long mask;
+
+	mask = (1UL << ((unsigned int)sb->s_blocksize_bits - fault_shift)) - 1;
+	mask <<= fault_shift - PAGE_SHIFT;
+
+	return pfn + (pgoff & mask);
+}
+
+static inline unsigned long dmem_page_size(struct inode *inode)
+{
+	return inode->i_sb->s_blocksize;
+}
+
+static int check_inode_size(struct inode *inode, loff_t offset)
+{
+	WARN_ON_ONCE(!rcu_read_lock_held());
+
+	if (offset >= i_size_read(inode))
+		return -EINVAL;
+
+	return 0;
+}
+
+static unsigned
+dmemfs_find_get_entries(struct address_space *mapping, unsigned long start,
+			unsigned int nr_entries, void **entries,
+			unsigned long *indices)
+{
+	XA_STATE(xas, &mapping->i_pages, start);
+
+	void *entry;
+	unsigned int ret = 0;
+
+	if (!nr_entries)
+		return 0;
+
+	rcu_read_lock();
+
+	xas_for_each(&xas, entry, ULONG_MAX) {
+		if (xas_retry(&xas, entry))
+			continue;
+
+		if (xa_is_value(entry))
+			goto export;
+
+		if (unlikely(entry != xas_reload(&xas)))
+			goto retry;
+
+export:
+		indices[ret] = xas.xa_index;
+		entries[ret] = entry;
+		if (++ret == nr_entries)
+			break;
+		continue;
+retry:
+		xas_reset(&xas);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+static void *find_radix_entry_or_next(struct address_space *mapping,
+				      unsigned long start,
+				      unsigned long *eindex)
+{
+	void *entry = NULL;
+
+	dmemfs_find_get_entries(mapping, start, 1, &entry, eindex);
+	return entry;
+}
+
+/*
+ * find the entry in radix tree based on @index, create it if
+ * it does not exist
+ *
+ * return the entry with rcu locked, otherwise ERR_PTR()
+ * is returned
+ */
+static void *
+radix_get_create_entry(struct vm_area_struct *vma, unsigned long fault_addr,
+		       struct inode *inode, pgoff_t pgoff)
+{
+	struct address_space *mapping = inode->i_mapping;
+	unsigned long eindex, index;
+	loff_t offset;
+	phys_addr_t addr;
+	gfp_t gfp_masks = mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM;
+	void *entry;
+	unsigned int try_dpages, dpages;
+	int ret;
+
+retry:
+	offset = ((loff_t)pgoff << PAGE_SHIFT);
+	index = dmem_pgoff_to_index(inode, pgoff);
+	rcu_read_lock();
+	ret = check_inode_size(inode, offset);
+	if (ret) {
+		rcu_read_unlock();
+		return ERR_PTR(ret);
+	}
+
+	try_dpages = dmem_pgoff_to_index(inode, (i_size_read(inode) - offset)
+				     >> PAGE_SHIFT);
+	entry = find_radix_entry_or_next(mapping, index, &eindex);
+	if (entry) {
+		WARN_ON(!xa_is_value(entry));
+		if (eindex == index)
+			return entry;
+
+		WARN_ON(eindex <= index);
+		try_dpages = eindex - index;
+	}
+	rcu_read_unlock();
+
+	/* entry does not exist, create it */
+	addr = dmem_alloc_pages_vma(vma, fault_addr, try_dpages, &dpages);
+	if (!addr) {
+		/*
+		 * do not return -ENOMEM as that will trigger OOM,
+		 * it is useless for reclaiming dmem page
+		 */
+		ret = -EINVAL;
+		goto exit;
+	}
+
+	try_dpages = dpages;
+	while (dpages) {
+		rcu_read_lock();
+		ret = check_inode_size(inode, offset);
+		if (ret)
+			goto unlock_rcu;
+
+		entry = dmem_addr_to_entry(inode, addr);
+		entry = xa_store(&mapping->i_pages, index, entry, gfp_masks);
+		if (!xa_is_err(entry)) {
+			addr += inode->i_sb->s_blocksize;
+			offset += inode->i_sb->s_blocksize;
+			dpages--;
+			mapping->nrexceptional++;
+			index++;
+		}
+
+unlock_rcu:
+		rcu_read_unlock();
+		if (ret)
+			break;
+	}
+
+	if (dpages)
+		dmem_free_pages(addr, dpages);
+
+	/* we have created some entries, let's retry it */
+	if (ret == -EEXIST || try_dpages != dpages)
+		goto retry;
+exit:
+	return ERR_PTR(ret);
+}
+
+static void radix_put_entry(void)
+{
+	rcu_read_unlock();
+}
+
+static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct inode *inode = file_inode(vma->vm_file);
+	phys_addr_t addr;
+	void *entry;
+	int ret;
+
+	if (vmf->pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return VM_FAULT_SIGBUS;
+
+	entry = radix_get_create_entry(vma, (unsigned long)vmf->address,
+				       inode, vmf->pgoff);
+	if (IS_ERR(entry)) {
+		ret = PTR_ERR(entry);
+		goto exit;
+	}
+
+	addr = dmem_entry_to_addr(inode, entry);
+	ret = vmf_insert_pfn(vma, (unsigned long)vmf->address,
+			    dmem_addr_to_pfn(inode, addr, vmf->pgoff,
+					     PAGE_SHIFT));
+	radix_put_entry();
+
+exit:
+	return ret;
+}
+
+static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
+{
+	return dmem_page_size(file_inode(vma->vm_file));
+}
+
+static const struct vm_operations_struct dmemfs_vm_ops = {
+	.fault = dmemfs_fault,
+	.pagesize = dmemfs_pagesize,
+};
+
+int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+
+	if (vma->vm_pgoff & ((dmem_page_size(inode) - 1) >> PAGE_SHIFT))
+		return -EINVAL;
+
+	if (!(vma->vm_flags & VM_SHARED))
+		return -EINVAL;
+
+	vma->vm_flags |= VM_PFNMAP;
+
+	file_accessed(file);
+	vma->vm_ops = &dmemfs_vm_ops;
+	return 0;
+}
+
 static const struct file_operations dmemfs_file_operations = {
+	.mmap = dmemfs_file_mmap,
 };
 
 static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
@@ -180,9 +429,86 @@ static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return 0;
 }
 
+/*
+ * should make sure the dmem page in the dropped region is not
+ * being mapped by any process
+ */
+static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end)
+{
+	struct address_space *mapping = inode->i_mapping;
+	struct pagevec pvec;
+	unsigned long istart, iend, indices[PAGEVEC_SIZE];
+	int i;
+
+	/* we never use normap page */
+	WARN_ON(mapping->nrpages);
+
+	/* if no dpage is allocated for the inode */
+	if (!mapping->nrexceptional)
+		return;
+
+	istart = dmem_pgoff_to_index(inode, start >> PAGE_SHIFT);
+	iend = dmem_pgoff_to_index(inode, end >> PAGE_SHIFT);
+	pagevec_init(&pvec);
+	while (istart < iend) {
+		pvec.nr = dmemfs_find_get_entries(mapping, istart,
+				min(iend - istart,
+				(unsigned long)PAGEVEC_SIZE),
+				(void **)pvec.pages,
+				indices);
+		if (!pvec.nr)
+			break;
+
+		for (i = 0; i < pagevec_count(&pvec); i++) {
+			phys_addr_t addr;
+
+			istart = indices[i];
+			if (istart >= iend)
+				break;
+
+			xa_erase(&mapping->i_pages, istart);
+			mapping->nrexceptional--;
+
+			addr = dmem_entry_to_addr(inode, pvec.pages[i]);
+			dmem_free_page(addr);
+		}
+
+		/*
+		 * only exception entries in pagevec, it's safe to
+		 * reinit it
+		 */
+		pagevec_reinit(&pvec);
+		cond_resched();
+		istart++;
+	}
+}
+
+static void dmemfs_evict_inode(struct inode *inode)
+{
+	/* no VMA works on it */
+	WARN_ON(!RB_EMPTY_ROOT(&inode->i_data.i_mmap.rb_root));
+
+	inode_drop_dpages(inode, 0, LLONG_MAX);
+	clear_inode(inode);
+}
+
+/*
+ * Display the mount options in /proc/mounts.
+ */
+static int dmemfs_show_options(struct seq_file *m, struct dentry *root)
+{
+	struct dmemfs_fs_info *fsi = root->d_sb->s_fs_info;
+
+	if (check_dpage_size(fsi->mount_opts.dpage_size))
+		seq_printf(m, ",pagesize=%lx", fsi->mount_opts.dpage_size);
+	return 0;
+}
+
 static const struct super_operations dmemfs_ops = {
 	.statfs	= dmemfs_statfs,
+	.evict_inode = dmemfs_evict_inode,
 	.drop_inode = generic_delete_inode,
+	.show_options = dmemfs_show_options,
 };
 
 static int
@@ -190,6 +516,7 @@ static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
 	struct inode *inode;
 	struct dmemfs_fs_info *fsi = sb->s_fs_info;
+	int ret;
 
 	sb->s_maxbytes = MAX_LFS_FILESIZE;
 	sb->s_blocksize = fsi->mount_opts.dpage_size;
@@ -198,11 +525,17 @@ static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 	sb->s_op = &dmemfs_ops;
 	sb->s_time_gran = 1;
 
+	ret = dmem_alloc_init(sb->s_blocksize_bits);
+	if (ret)
+		return ret;
+
 	inode = dmemfs_get_inode(sb, NULL, S_IFDIR);
 	sb->s_root = d_make_root(inode);
-	if (!sb->s_root)
-		return -ENOMEM;
 
+	if (!sb->s_root) {
+		dmem_alloc_uinit();
+		return -ENOMEM;
+	}
 	return 0;
 }
 
@@ -238,7 +571,13 @@ int dmemfs_init_fs_context(struct fs_context *fc)
 
 static void dmemfs_kill_sb(struct super_block *sb)
 {
+	bool has_inode = !!sb->s_root;
+
 	kill_litter_super(sb);
+
+	/* do not uninit dmem allocator if mount failed */
+	if (has_inode)
+		dmem_alloc_uinit();
 }
 
 static struct file_system_type dmemfs_fs_type = {
diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 476a82e..8682d63 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -10,6 +10,16 @@
 int dmem_alloc_init(unsigned long dpage_shift);
 void dmem_alloc_uinit(void);
 
+phys_addr_t
+dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max,
+			  unsigned int *result_nr);
+
+phys_addr_t
+dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr,
+		     unsigned int try_max, unsigned int *result_nr);
+
+void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr);
+#define dmem_free_page(addr)	dmem_free_pages(addr, 1)
 #else
 static inline int dmem_reserve_init(void)
 {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 06/37] dmemfs: support truncating inode down
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (4 preceding siblings ...)
  2020-12-07 11:30 ` [RFC V2 05/37] dmemfs: support mmap for dmemfs yulei.kernel
@ 2020-12-07 11:30 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 07/37] dmem: trace core functions yulei.kernel
                   ` (31 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:30 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

To support cut inode down, it will
introduce the race between page fault handler and
truncating handler as the entry to be deleted is being
mapped into process's VMA

in order to make page fault faster (as it's the hot
path), we use RCU to sync these two handlers. When
inode's size is updated, the handler makes sure the
new size is visible to page fault handler who will
not use truncated entry anymore and will not create
new entry in that region

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 66 insertions(+), 1 deletion(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 7b6e51d..9ec62dc 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -98,8 +98,73 @@ static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry,
 	.rename		= simple_rename,
 };
 
+static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end);
+
+static int dmemfs_truncate(struct inode *inode, loff_t newsize)
+{
+	struct super_block *sb = inode->i_sb;
+	loff_t current_size;
+
+	if (newsize & ((1 << sb->s_blocksize_bits) - 1))
+		return -EINVAL;
+
+	current_size = i_size_read(inode);
+	i_size_write(inode, newsize);
+
+	if (newsize >= current_size)
+		return 0;
+
+	/* it cuts the inode down */
+
+	/*
+	 * we should make sure inode->i_size has been updated before
+	 * unmapping and dropping radix entries, so that other sides
+	 * can not create new i_mapping entry beyond inode->i_size
+	 * and the radix entry in the truncated region is not being
+	 * used
+	 *
+	 * see the comments in dmemfs_fault()
+	 */
+	synchronize_rcu();
+
+	/*
+	 * should unmap all mapping first as dmem pages are freed in
+	 * inode_drop_dpages()
+	 *
+	 * after that, dmem page in the truncated region is not used
+	 * by any process
+	 */
+	unmap_mapping_range(inode->i_mapping, newsize, 0, 1);
+
+	inode_drop_dpages(inode, newsize, LLONG_MAX);
+	return 0;
+}
+
+/*
+ * same logic as simple_setattr but we need to handle ftruncate
+ * carefully as we inserted self-defined entry into radix tree
+ */
+static int dmemfs_setattr(struct dentry *dentry, struct iattr *iattr)
+{
+	struct inode *inode = dentry->d_inode;
+	int error;
+
+	error = setattr_prepare(dentry, iattr);
+	if (error)
+		return error;
+
+	if (iattr->ia_valid & ATTR_SIZE) {
+		error = dmemfs_truncate(inode, iattr->ia_size);
+		if (error)
+			return error;
+	}
+	setattr_copy(inode, iattr);
+	mark_inode_dirty(inode);
+	return 0;
+}
+
 static const struct inode_operations dmemfs_file_inode_operations = {
-	.setattr = simple_setattr,
+	.setattr = dmemfs_setattr,
 	.getattr = simple_getattr,
 };
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 07/37] dmem: trace core functions
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (5 preceding siblings ...)
  2020-12-07 11:30 ` [RFC V2 06/37] dmemfs: support truncating inode down yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 08/37] dmem: show some statistic in debugfs yulei.kernel
                   ` (30 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

Add tracepoints for dmem alloc_init, alloc and free functions,
that helps us to figure out what is happening inside dmem
allocator

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/Makefile          |  1 +
 fs/dmemfs/inode.c           |  5 ++++
 fs/dmemfs/trace.h           | 54 +++++++++++++++++++++++++++++++++++
 include/trace/events/dmem.h | 68 +++++++++++++++++++++++++++++++++++++++++++++
 mm/dmem.c                   |  6 ++++
 5 files changed, 134 insertions(+)
 create mode 100644 fs/dmemfs/trace.h
 create mode 100644 include/trace/events/dmem.h

diff --git a/fs/dmemfs/Makefile b/fs/dmemfs/Makefile
index 73bdc9c..0b36d03 100644
--- a/fs/dmemfs/Makefile
+++ b/fs/dmemfs/Makefile
@@ -2,6 +2,7 @@
 #
 # Makefile for the linux dmem-filesystem routines.
 #
+ccflags-y += -I $(srctree)/$(src)		# needed for trace events
 obj-$(CONFIG_DMEM_FS) += dmemfs.o
 
 dmemfs-y += inode.o
diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 9ec62dc..7723b58 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -31,6 +31,9 @@
 MODULE_AUTHOR("Tencent Corporation");
 MODULE_LICENSE("GPL v2");
 
+#define CREATE_TRACE_POINTS
+#include "trace.h"
+
 struct dmemfs_mount_opts {
 	unsigned long dpage_size;
 };
@@ -336,6 +339,7 @@ static void *find_radix_entry_or_next(struct address_space *mapping,
 			offset += inode->i_sb->s_blocksize;
 			dpages--;
 			mapping->nrexceptional++;
+			trace_dmemfs_radix_tree_insert(index, entry);
 			index++;
 		}
 
@@ -532,6 +536,7 @@ static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end)
 				break;
 
 			xa_erase(&mapping->i_pages, istart);
+			trace_dmemfs_radix_tree_delete(istart, pvec.pages[i]);
 			mapping->nrexceptional--;
 
 			addr = dmem_entry_to_addr(inode, pvec.pages[i]);
diff --git a/fs/dmemfs/trace.h b/fs/dmemfs/trace.h
new file mode 100644
index 00000000..cc11653
--- /dev/null
+++ b/fs/dmemfs/trace.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/**
+ * trace.h - DesignWare Support
+ *
+ * Copyright (C)
+ *
+ * Author: Xiao Guangrong <xiaoguangrong@tencent.com>
+ */
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dmemfs
+
+#if !defined(_TRACE_DMEMFS_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DMEMFS_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(dmemfs_radix_tree_class,
+	TP_PROTO(unsigned long index, void *rentry),
+	TP_ARGS(index, rentry),
+
+	TP_STRUCT__entry(
+		__field(unsigned long,	index)
+		__field(void *, rentry)
+	),
+
+	TP_fast_assign(
+		__entry->index = index;
+		__entry->rentry = rentry;
+	),
+
+	TP_printk("index %lu entry %#lx", __entry->index,
+		  (unsigned long)__entry->rentry)
+);
+
+DEFINE_EVENT(dmemfs_radix_tree_class, dmemfs_radix_tree_insert,
+	TP_PROTO(unsigned long index, void *rentry),
+	TP_ARGS(index, rentry)
+);
+
+DEFINE_EVENT(dmemfs_radix_tree_class, dmemfs_radix_tree_delete,
+	TP_PROTO(unsigned long index, void *rentry),
+	TP_ARGS(index, rentry)
+);
+#endif
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+
+#undef TRACE_INCLUDE_FILE
+#define TRACE_INCLUDE_FILE trace
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/include/trace/events/dmem.h b/include/trace/events/dmem.h
new file mode 100644
index 00000000..10d1b90
--- /dev/null
+++ b/include/trace/events/dmem.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM dmem
+
+#if !defined(_TRACE_DMEM_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DMEM_H
+
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(dmem_alloc_init,
+	TP_PROTO(unsigned long dpage_shift),
+	TP_ARGS(dpage_shift),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, dpage_shift)
+	),
+
+	TP_fast_assign(
+		__entry->dpage_shift = dpage_shift;
+	),
+
+	TP_printk("dpage_shift %lu", __entry->dpage_shift)
+);
+
+TRACE_EVENT(dmem_alloc_pages_node,
+	TP_PROTO(phys_addr_t addr, int node, int try_max, int result_nr),
+	TP_ARGS(addr, node, try_max, result_nr),
+
+	TP_STRUCT__entry(
+		__field(phys_addr_t, addr)
+		__field(int, node)
+		__field(int, try_max)
+		__field(int, result_nr)
+	),
+
+	TP_fast_assign(
+		__entry->addr = addr;
+		__entry->node = node;
+		__entry->try_max = try_max;
+		__entry->result_nr = result_nr;
+	),
+
+	TP_printk("addr %#lx node %d try_max %d result_nr %d",
+		  (unsigned long)__entry->addr, __entry->node,
+		  __entry->try_max, __entry->result_nr)
+);
+
+TRACE_EVENT(dmem_free_pages,
+	TP_PROTO(phys_addr_t addr, int dpages_nr),
+	TP_ARGS(addr, dpages_nr),
+
+	TP_STRUCT__entry(
+		__field(phys_addr_t, addr)
+		__field(int, dpages_nr)
+	),
+
+	TP_fast_assign(
+		__entry->addr = addr;
+		__entry->dpages_nr = dpages_nr;
+	),
+
+	TP_printk("addr %#lx dpages_nr %d", (unsigned long)__entry->addr,
+		  __entry->dpages_nr)
+);
+#endif
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/dmem.c b/mm/dmem.c
index a77a064..aa34bf2 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -18,6 +18,8 @@
 #include <linux/debugfs.h>
 #include <linux/notifier.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/dmem.h>
 /*
  * There are two kinds of page in dmem management:
  * - nature page, it's the CPU's page size, i.e, 4K on x86
@@ -559,6 +561,8 @@ int dmem_alloc_init(unsigned long dpage_shift)
 
 	mutex_lock(&dmem_pool.lock);
 
+	trace_dmem_alloc_init(dpage_shift);
+
 	if (dmem_pool.dpage_shift) {
 		/*
 		 * double init on the same page size is okay
@@ -686,6 +690,7 @@ int dmem_alloc_init(unsigned long dpage_shift)
 			}
 		}
 
+		trace_dmem_alloc_pages_node(addr, node, try_max, *result_nr);
 		mutex_unlock(&dmem_pool.lock);
 	}
 	return addr;
@@ -791,6 +796,7 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
 
 	mutex_lock(&dmem_pool.lock);
 
+	trace_dmem_free_pages(addr, dpages_nr);
 	WARN_ON(!dmem_pool.dpage_shift);
 
 	dregion = find_dmem_region(addr, &pdnode);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 08/37] dmem: show some statistic in debugfs
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (6 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 07/37] dmem: trace core functions yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 09/37] dmemfs: support remote access yulei.kernel
                   ` (29 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

Create 'dmem' directory under debugfs and show some
statistic for dmem pool, track total and free dpages
on dmem pool and each numa node.

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/Kconfig |   8 +++++
 mm/dmem.c  | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 107 insertions(+), 1 deletion(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 3a6d408..4dd8896 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -234,6 +234,14 @@ config DMEM
 	  Allow reservation of memory which could be for the dedicated use of dmem.
 	  It's the basis of dmemfs.
 
+config DMEM_DEBUG_FS
+	bool "Enable debug information for direct memory"
+	depends on DMEM && DEBUG_FS
+	help
+	  This option enables showing various statistics of direct memory
+	  in debugfs filesystem.
+
+#
 # support for memory compaction
 config COMPACTION
 	bool "Allow for memory compaction"
diff --git a/mm/dmem.c b/mm/dmem.c
index aa34bf2..6992e57 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -164,6 +164,103 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end)
 	return 0;
 }
 
+#ifdef CONFIG_DMEM_DEBUG_FS
+struct debugfs_entry {
+	const char *name;
+	unsigned long offset;
+};
+
+#define DMEM_POOL_OFFSET(x)	offsetof(struct dmem_pool, x)
+#define DMEM_POOL_ENTRY(x)	{__stringify(x), DMEM_POOL_OFFSET(x)}
+
+#define DMEM_NODE_OFFSET(x)	offsetof(struct dmem_node, x)
+#define DMEM_NODE_ENTRY(x)	{__stringify(x), DMEM_NODE_OFFSET(x)}
+
+static struct debugfs_entry dmem_pool_entries[] = {
+	DMEM_POOL_ENTRY(region_num),
+	DMEM_POOL_ENTRY(registered_pages),
+	DMEM_POOL_ENTRY(unaligned_pages),
+	DMEM_POOL_ENTRY(dpage_shift),
+	DMEM_POOL_ENTRY(total_dpages),
+	DMEM_POOL_ENTRY(free_dpages),
+};
+
+static struct debugfs_entry dmem_node_entries[] = {
+	DMEM_NODE_ENTRY(total_dpages),
+	DMEM_NODE_ENTRY(free_dpages),
+};
+
+static int dmem_entry_get(void *offset, u64 *val)
+{
+	*val = *(u64 *)offset;
+	return 0;
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(dmem_fops, dmem_entry_get, NULL, "%llu\n");
+
+static int dmemfs_init_debugfs_node(struct dmem_node *dnode,
+				    struct dentry *parent)
+{
+	struct dentry *node_dir;
+	char dir_name[32];
+	int i, ret = -EEXIST;
+
+	snprintf(dir_name, sizeof(dir_name), "node%ld",
+		 dnode - dmem_pool.nodes);
+	node_dir = debugfs_create_dir(dir_name, parent);
+	if (!node_dir)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(dmem_node_entries); i++)
+		if (!debugfs_create_file(dmem_node_entries[i].name, 0444,
+		   node_dir, (void *)dnode + dmem_node_entries[i].offset,
+		   &dmem_fops))
+			return ret;
+	return 0;
+}
+
+static int dmemfs_init_debugfs(void)
+{
+	struct dentry *dmem_debugfs_dir;
+	struct dmem_node *dnode;
+	int i, ret = -EEXIST;
+
+	dmem_debugfs_dir = debugfs_create_dir("dmem", NULL);
+	if (!dmem_debugfs_dir)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(dmem_pool_entries); i++)
+		if (!debugfs_create_file(dmem_pool_entries[i].name, 0444,
+		   dmem_debugfs_dir,
+		   (void *)&dmem_pool + dmem_pool_entries[i].offset,
+		   &dmem_fops))
+			goto exit;
+
+	for_each_dmem_node(dnode) {
+		/*
+		 * do not create debugfs files for the node
+		 * where no memory is available
+		 */
+		if (list_empty(&dnode->regions))
+			continue;
+
+		if (dmemfs_init_debugfs_node(dnode, dmem_debugfs_dir))
+			goto exit;
+	}
+
+	return 0;
+exit:
+	debugfs_remove_recursive(dmem_debugfs_dir);
+	return ret;
+}
+
+#else
+static int dmemfs_init_debugfs(void)
+{
+	return 0;
+}
+#endif
+
 #define PENALTY_FOR_DMEM_SHARED_NODE		(1)
 
 static int dmem_nodeload[MAX_NUMNODES] __initdata;
@@ -364,7 +461,8 @@ static int __init dmem_late_init(void)
 				goto exit;
 		}
 	}
-	return ret;
+
+	return dmemfs_init_debugfs();
 exit:
 	dmem_uinit();
 	return ret;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 09/37] dmemfs: support remote access
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (7 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 08/37] dmem: show some statistic in debugfs yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 10/37] dmemfs: introduce max_alloc_try_dpages parameter yulei.kernel
                   ` (28 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

It is required by ptrace_writedata and ptrace_readdata to access
dmem memory remotely. The typical user is gdb, after this patch,
gdb is able to read & write memory owned by the attached process

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 7723b58..3192f31 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -364,6 +364,51 @@ static void radix_put_entry(void)
 	rcu_read_unlock();
 }
 
+static bool check_vma_access(struct vm_area_struct *vma, int write)
+{
+	vm_flags_t vm_flags = write ? VM_WRITE : VM_READ;
+
+	return !!(vm_flags & vma->vm_flags);
+}
+
+static int
+dmemfs_access_dmem(struct vm_area_struct *vma, unsigned long addr,
+		   void *buf, int len, int write)
+{
+	struct inode *inode = file_inode(vma->vm_file);
+	struct super_block *sb = inode->i_sb;
+	void *entry, *maddr;
+	int offset, pgoff;
+
+	if (!check_vma_access(vma, write))
+		return -EACCES;
+
+	pgoff = linear_page_index(vma, addr);
+	if (pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
+		return -EFAULT;
+
+	entry = radix_get_create_entry(vma, addr, inode, pgoff);
+	if (IS_ERR(entry))
+		return PTR_ERR(entry);
+
+	offset = addr & (sb->s_blocksize - 1);
+	addr = dmem_entry_to_addr(inode, entry);
+
+	/*
+	 * it is not beyond vma's region as the vma should be aligned
+	 * to blocksize
+	 */
+	len = min(len, (int)(sb->s_blocksize - offset));
+	maddr = __va(addr);
+	if (write)
+		memcpy(maddr + offset, buf, len);
+	else
+		memcpy(buf, maddr + offset, len);
+	radix_put_entry();
+
+	return len;
+}
+
 static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -400,6 +445,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
 static const struct vm_operations_struct dmemfs_vm_ops = {
 	.fault = dmemfs_fault,
 	.pagesize = dmemfs_pagesize,
+	.access = dmemfs_access_dmem,
 };
 
 int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 10/37] dmemfs: introduce max_alloc_try_dpages parameter
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (8 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 09/37] dmemfs: support remote access yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 11/37] mm: export mempolicy interfaces to serve dmem allocator yulei.kernel
                   ` (27 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

It specifies the dmem page number allocated at one time, then
multiple radix entries can be created. That will relief the
allocation pressure and make page fault more fast.

However that could cause no dmem page mmapped to userspace
even if there are some free dmem pages.

Set it to 1 to completely disable this behavior.

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 41 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 41 insertions(+)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 3192f31..443f2e1 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -34,6 +34,8 @@
 #define CREATE_TRACE_POINTS
 #include "trace.h"
 
+static uint __read_mostly max_alloc_try_dpages = 1;
+
 struct dmemfs_mount_opts {
 	unsigned long dpage_size;
 };
@@ -46,6 +48,44 @@ enum dmemfs_param {
 	Opt_dpagesize,
 };
 
+static int
+max_alloc_try_dpages_set(const char *val, const struct kernel_param *kp)
+{
+	uint sval;
+	int ret;
+
+	ret = kstrtouint(val, 0, &sval);
+	if (ret)
+		return ret;
+
+	/* should be 1 at least */
+	if (!sval)
+		return -EINVAL;
+
+	max_alloc_try_dpages = sval;
+	return 0;
+}
+
+static struct kernel_param_ops alloc_max_try_dpages_ops = {
+	.set = max_alloc_try_dpages_set,
+	.get = param_get_uint,
+};
+
+/*
+ * it specifies the dmem page number allocated at one time, then
+ * multiple radix entries can be created. That will relief the
+ * allocation pressure and make page fault more fast.
+ *
+ * however that could cause no dmem page mmapped to userspace
+ * even if there are some free dmem pages
+ *
+ * set it to 1 to completely disable this behavior
+ */
+fs_param_cb(max_alloc_try_dpages, &alloc_max_try_dpages_ops,
+	    &max_alloc_try_dpages, 0644);
+__MODULE_PARM_TYPE(max_alloc_try_dpages, "uint");
+MODULE_PARM_DESC(max_alloc_try_dpages, "Set the dmem page number allocated at one time, should be 1 at least");
+
 const struct fs_parameter_spec dmemfs_fs_parameters[] = {
 	fsparam_string("pagesize", Opt_dpagesize),
 	{}
@@ -314,6 +354,7 @@ static void *find_radix_entry_or_next(struct address_space *mapping,
 	}
 	rcu_read_unlock();
 
+	try_dpages = min(try_dpages, max_alloc_try_dpages);
 	/* entry does not exist, create it */
 	addr = dmem_alloc_pages_vma(vma, fault_addr, try_dpages, &dpages);
 	if (!addr) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 11/37] mm: export mempolicy interfaces to serve dmem allocator
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (9 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 10/37] dmemfs: introduce max_alloc_try_dpages parameter yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 12/37] dmem: introduce mempolicy support yulei.kernel
                   ` (26 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang

From: Yulei Zhang <yuleixzhang@tencent.com>

Export interface interleave_nid() to serve dmem allocator.

Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/mempolicy.h | 3 +++
 mm/mempolicy.c            | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5f1c74d..4789661 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,9 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
 struct mempolicy *get_task_policy(struct task_struct *p);
 struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
 		unsigned long addr);
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr);
+unsigned interleave_nid(struct mempolicy *pol, struct vm_area_struct *vma,
+			unsigned long addr, int shift);
 bool vma_policy_mof(struct vm_area_struct *vma);
 
 extern void numa_default_policy(void);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 3ca4898..efd80e5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1813,7 +1813,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * freeing by another task.  It is the caller's responsibility to free the
  * extra reference for shared policies.
  */
-static struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
+struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 						unsigned long addr)
 {
 	struct mempolicy *pol = __get_vma_policy(vma, addr);
@@ -1978,7 +1978,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n)
 }
 
 /* Determine a node number for interleave */
-static inline unsigned interleave_nid(struct mempolicy *pol,
+unsigned interleave_nid(struct mempolicy *pol,
 		 struct vm_area_struct *vma, unsigned long addr, int shift)
 {
 	if (vma) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 12/37] dmem: introduce mempolicy support
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (10 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 11/37] mm: export mempolicy interfaces to serve dmem allocator yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 13/37] mm, dmem: introduce PFN_DMEM and pfn_t_dmem yulei.kernel
                   ` (25 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Haiwei Li

From: Yulei Zhang <yuleixzhang@tencent.com>

It adds mempolicy support for dmem to allocates memory
from mempolicy specified nodes.

Signed-off-by: Haiwei Li   <gerryhwli@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/Kconfig                     |  1 +
 arch/x86/include/asm/pgtable.h       |  7 ++++
 arch/x86/include/asm/pgtable_types.h | 13 +++++++-
 fs/dmemfs/Kconfig                    |  3 ++
 include/linux/pgtable.h              |  7 ++++
 mm/Kconfig                           |  3 ++
 mm/dmem.c                            | 63 ++++++++++++++++++++++++++++++++++--
 7 files changed, 94 insertions(+), 3 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f6946b8..9ccee76 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -73,6 +73,7 @@ config X86
 	select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 	select ARCH_HAS_PMEM_API		if X86_64
 	select ARCH_HAS_PTE_DEVMAP		if X86_64
+	select ARCH_HAS_PTE_DMEM		if X86_64
 	select ARCH_HAS_PTE_SPECIAL
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if X86_64
 	select ARCH_HAS_COPY_MC			if X86_64
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c672..dd4aff6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -452,6 +452,13 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd)
 	return pmd_set_flags(pmd, _PAGE_DEVMAP);
 }
 
+#ifdef CONFIG_ARCH_HAS_PTE_DMEM
+static inline pmd_t pmd_mkdmem(pmd_t pmd)
+{
+	return pmd_set_flags(pmd, _PAGE_SPECIAL | _PAGE_DMEM);
+}
+#endif
+
 static inline pmd_t pmd_mkhuge(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_PSE);
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 816b31c..ee4cae1 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -23,6 +23,15 @@
 #define _PAGE_BIT_SOFTW2	10	/* " */
 #define _PAGE_BIT_SOFTW3	11	/* " */
 #define _PAGE_BIT_PAT_LARGE	12	/* On 2MB or 1GB pages */
+#define _PAGE_BIT_DMEM		57	/* Flag used to indicate dmem pmd.
+					 * Since _PAGE_BIT_SPECIAL is defined
+					 * same as _PAGE_BIT_CPA_TEST, we can
+					 * not only use _PAGE_BIT_SPECIAL, so
+					 * add _PAGE_BIT_DMEM to help
+					 * indicate it. Since dmem pte will
+					 * never be splitting, setting
+					 * _PAGE_BIT_SPECIAL for pte is enough.
+					 */
 #define _PAGE_BIT_SOFTW4	58	/* available for programmer */
 #define _PAGE_BIT_PKEY_BIT0	59	/* Protection Keys, bit 1/4 */
 #define _PAGE_BIT_PKEY_BIT1	60	/* Protection Keys, bit 2/4 */
@@ -112,9 +121,11 @@
 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
 #define _PAGE_NX	(_AT(pteval_t, 1) << _PAGE_BIT_NX)
 #define _PAGE_DEVMAP	(_AT(u64, 1) << _PAGE_BIT_DEVMAP)
+#define _PAGE_DMEM	(_AT(u64, 1) << _PAGE_BIT_DMEM)
 #else
 #define _PAGE_NX	(_AT(pteval_t, 0))
 #define _PAGE_DEVMAP	(_AT(pteval_t, 0))
+#define _PAGE_DMEM	(_AT(pteval_t, 0))
 #endif
 
 #define _PAGE_PROTNONE	(_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
@@ -128,7 +139,7 @@
 #define _PAGE_CHG_MASK	(PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT |		\
 			 _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY |	\
 			 _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC |  \
-			 _PAGE_UFFD_WP)
+			 _PAGE_UFFD_WP | _PAGE_DMEM)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
 /*
diff --git a/fs/dmemfs/Kconfig b/fs/dmemfs/Kconfig
index d2894a5..19ca391 100644
--- a/fs/dmemfs/Kconfig
+++ b/fs/dmemfs/Kconfig
@@ -1,5 +1,8 @@
 config DMEM_FS
 	tristate "Direct Memory filesystem support"
+	depends on DMEM
+	depends on TRANSPARENT_HUGEPAGE
+	depends on ARCH_HAS_PTE_DMEM
 	help
 	  dmemfs (Direct Memory filesystem) is device memory or reserved
 	  memory based filesystem. This kind of memory is special as it
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 71125a4..9e65694 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1157,6 +1157,13 @@ static inline int pud_trans_unstable(pud_t *pud)
 #endif
 }
 
+#ifndef CONFIG_ARCH_HAS_PTE_DMEM
+static inline pmd_t pmd_mkdmem(pmd_t pmd)
+{
+	return pmd;
+}
+#endif
+
 #ifndef pmd_read_atomic
 static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 4dd8896..10fd7ff 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -794,6 +794,9 @@ config IDLE_PAGE_TRACKING
 config ARCH_HAS_PTE_DEVMAP
 	bool
 
+config ARCH_HAS_PTE_DMEM
+	bool
+
 config ZONE_DEVICE
 	bool "Device memory (pmem, HMM, etc...) hotplug support"
 	depends on MEMORY_HOTPLUG
diff --git a/mm/dmem.c b/mm/dmem.c
index 6992e57..2e61dbd 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -822,6 +822,56 @@ int dmem_alloc_init(unsigned long dpage_shift)
 }
 EXPORT_SYMBOL(dmem_alloc_pages_nodemask);
 
+/* Return a nodelist indicated for current node representing a mempolicy */
+static int *policy_nodelist(struct mempolicy *policy)
+{
+	int nd = numa_node_id();
+
+	switch (policy->mode) {
+	case MPOL_PREFERRED:
+		if (!(policy->flags & MPOL_F_LOCAL))
+			nd = policy->v.preferred_node;
+		break;
+	case MPOL_BIND:
+		if (unlikely(!node_isset(nd, policy->v.nodes)))
+			nd = first_node(policy->v.nodes);
+		break;
+	default:
+		WARN_ON(1);
+	}
+	return dmem_nodelist(nd);
+}
+
+static nodemask_t *dmem_policy_nodemask(struct mempolicy *policy)
+{
+	if (unlikely(policy->mode == MPOL_BIND) &&
+			cpuset_nodemask_valid_mems_allowed(&policy->v.nodes))
+		return &policy->v.nodes;
+
+	return NULL;
+}
+
+static void
+get_mempolicy_nlist_and_nmask(struct mempolicy *pol,
+			      struct vm_area_struct *vma, unsigned long addr,
+			      int **nl, nodemask_t **nmask)
+{
+	if (pol->mode == MPOL_INTERLEAVE) {
+		unsigned int nid;
+
+		/*
+		 * we use dpage_shift to interleave numa nodes although
+		 * multiple dpages may be allocated
+		 */
+		nid = interleave_nid(pol, vma, addr, dmem_pool.dpage_shift);
+		*nl = dmem_nodelist(nid);
+		*nmask = NULL;
+	} else {
+		*nl = policy_nodelist(pol);
+		*nmask = dmem_policy_nodemask(pol);
+	}
+}
+
 /*
  * dmem_alloc_pages_vma - Allocate pages for a VMA.
  *
@@ -830,6 +880,9 @@ int dmem_alloc_init(unsigned long dpage_shift)
  *   @try_max: try to allocate @try_max dpages if possible
  *   @result_nr: allocated dpage number returned to the caller
  *
+ * This function allocates pages from dmem pool and applies a NUMA policy
+ * associated with the VMA.
+ *
  * Return the physical address of the first dpage allocated from dmem
  * pool, or 0 on failure. The allocated dpage number is filled into
  * @result_nr
@@ -839,13 +892,19 @@ int dmem_alloc_init(unsigned long dpage_shift)
 		     unsigned int try_max, unsigned int *result_nr)
 {
 	phys_addr_t phys_addr;
+	struct mempolicy *pol;
 	int *nl;
+	nodemask_t *nmask;
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
-	nl = dmem_nodelist(numa_node_id());
+	pol = get_vma_policy(vma, addr);
+	cpuset_mems_cookie = read_mems_allowed_begin();
+
+	get_mempolicy_nlist_and_nmask(pol, vma, addr, &nl, &nmask);
+	mpol_cond_put(pol);
 
-	phys_addr = dmem_alloc_pages_from_nodelist(nl, NULL, try_max,
+	phys_addr = dmem_alloc_pages_from_nodelist(nl, nmask, try_max,
 						   result_nr);
 	if (unlikely(!phys_addr && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 13/37] mm, dmem: introduce PFN_DMEM and pfn_t_dmem
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (11 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 12/37] dmem: introduce mempolicy support yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 14/37] mm, dmem: differentiate dmem-pmd and thp-pmd yulei.kernel
                   ` (24 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Introduce PFN_DMEM as a new pfn flag for dmem pfn, define it
by setting (BITS_PER_LONG_LONG - 6) bit.

Introduce pfn_t_dmem() helper to recognize dmem pfn.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/pfn_t.h | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 2d91482..c6c0f1f 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -11,6 +11,7 @@
  * PFN_MAP - pfn has a dynamic page mapping established by a device driver
  * PFN_SPECIAL - for CONFIG_FS_DAX_LIMITED builds to allow XIP, but not
  *		 get_user_pages
+ * PFN_DMEM - pfn references a dmem page
  */
 #define PFN_FLAGS_MASK (((u64) (~PAGE_MASK)) << (BITS_PER_LONG_LONG - PAGE_SHIFT))
 #define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1))
@@ -18,13 +19,15 @@
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))
 #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4))
 #define PFN_SPECIAL (1ULL << (BITS_PER_LONG_LONG - 5))
+#define PFN_DMEM (1ULL << (BITS_PER_LONG_LONG - 6))
 
 #define PFN_FLAGS_TRACE \
 	{ PFN_SPECIAL,	"SPECIAL" }, \
 	{ PFN_SG_CHAIN,	"SG_CHAIN" }, \
 	{ PFN_SG_LAST,	"SG_LAST" }, \
 	{ PFN_DEV,	"DEV" }, \
-	{ PFN_MAP,	"MAP" }
+	{ PFN_MAP,	"MAP" }, \
+	{ PFN_DMEM,	"DMEM" }
 
 static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags)
 {
@@ -128,4 +131,16 @@ static inline bool pfn_t_special(pfn_t pfn)
 	return false;
 }
 #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
+
+#ifdef CONFIG_ARCH_HAS_PTE_DMEM
+static inline bool pfn_t_dmem(pfn_t pfn)
+{
+	return (pfn.val & PFN_DMEM) == PFN_DMEM;
+}
+#else
+static inline bool pfn_t_dmem(pfn_t pfn)
+{
+	return false;
+}
+#endif /* CONFIG_ARCH_HAS_PTE_DMEM */
 #endif /* _LINUX_PFN_T_H_ */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 14/37] mm, dmem: differentiate dmem-pmd and thp-pmd
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (12 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 13/37] mm, dmem: introduce PFN_DMEM and pfn_t_dmem yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 15/37] mm: add pmd_special() check for pmd_trans_huge_lock() yulei.kernel
                   ` (23 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

A dmem huge page is ultimately not a transparent huge page. As we
decided to use pmd_special() to distinguish dmem-pmd from thp-pmd,
we should make some slightly different semantics between pmd_special()
and pmd_trans_huge(), just as pmd_devmap() in upstream. This distinction
is especially important in some mm-core paths such as zap_pmd_range().

Explicitly mark the pmd_trans_huge() helpers that dmem needs by adding
pmd_special() checks. This method could be reused in many mm-core paths.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/include/asm/pgtable.h | 10 +++++++++-
 include/linux/pgtable.h        |  5 +++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index dd4aff6..6ce85d4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -259,7 +259,7 @@ static inline int pmd_large(pmd_t pte)
 /* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_large */
 static inline int pmd_trans_huge(pmd_t pmd)
 {
-	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_DMEM)) == _PAGE_PSE;
 }
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -275,6 +275,14 @@ static inline int has_transparent_hugepage(void)
 	return boot_cpu_has(X86_FEATURE_PSE);
 }
 
+#ifdef CONFIG_ARCH_HAS_PTE_DMEM
+static inline int pmd_special(pmd_t pmd)
+{
+	return (pmd_val(pmd) & (_PAGE_SPECIAL | _PAGE_DMEM)) ==
+		(_PAGE_SPECIAL | _PAGE_DMEM);
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
 static inline int pmd_devmap(pmd_t pmd)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 9e65694..30342b8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1162,6 +1162,11 @@ static inline pmd_t pmd_mkdmem(pmd_t pmd)
 {
 	return pmd;
 }
+
+static inline int pmd_special(pmd_t pmd)
+{
+	return 0;
+}
 #endif
 
 #ifndef pmd_read_atomic
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 15/37] mm: add pmd_special() check for pmd_trans_huge_lock()
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (13 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 14/37] mm, dmem: differentiate dmem-pmd and thp-pmd yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 16/37] dmemfs: introduce ->split() to dmemfs_vm_ops yulei.kernel
                   ` (22 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

As dmem-pmd had been distinguished from thp-pmd, we need to add
pmd_special() such that pmd_trans_huge_lock could fetch ptl
for dmem huge pmd and treat it as stable pmd.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/huge_mm.h | 3 ++-
 mm/huge_memory.c        | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0365aa9..2514b90 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -242,7 +242,8 @@ static inline int is_swap_pmd(pmd_t pmd)
 static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 		struct vm_area_struct *vma)
 {
-	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+	if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)
+		|| pmd_devmap(*pmd) || pmd_special(*pmd))
 		return __pmd_trans_huge_lock(pmd, vma);
 	else
 		return NULL;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9474dbc..31f9e83 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1890,7 +1890,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 	ptl = pmd_lock(vma->vm_mm, pmd);
 	if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
-			pmd_devmap(*pmd)))
+			pmd_devmap(*pmd) || pmd_special(*pmd)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 16/37] dmemfs: introduce ->split() to dmemfs_vm_ops
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (14 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 15/37] mm: add pmd_special() check for pmd_trans_huge_lock() yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 17/37] mm, dmemfs: support unmap_page_range() for dmemfs pmd yulei.kernel
                   ` (21 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

It is required by __split_vma() to adjust vma. munmap() which create
hole unaligned to pagesize in dmemfs-mapping should be forbidden.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 443f2e1..ab6a492 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -450,6 +450,13 @@ static bool check_vma_access(struct vm_area_struct *vma, int write)
 	return len;
 }
 
+static int dmemfs_split(struct vm_area_struct *vma, unsigned long addr)
+{
+	if (addr & (dmem_page_size(file_inode(vma->vm_file)) - 1))
+		return -EINVAL;
+	return 0;
+}
+
 static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -484,6 +491,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
 }
 
 static const struct vm_operations_struct dmemfs_vm_ops = {
+	.split = dmemfs_split,
 	.fault = dmemfs_fault,
 	.pagesize = dmemfs_pagesize,
 	.access = dmemfs_access_dmem,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 17/37] mm, dmemfs: support unmap_page_range() for dmemfs pmd
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (15 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 16/37] dmemfs: introduce ->split() to dmemfs_vm_ops yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 18/37] mm: follow_pmd_mask() for dmem huge pmd yulei.kernel
                   ` (20 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

It is required by munmap() for dmemfs mapping.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/huge_memory.c | 2 ++
 mm/memory.c      | 8 +++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 31f9e83..2a818ec 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1664,6 +1664,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		if (is_huge_zero_pmd(orig_pmd))
 			tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
+	} else if (pmd_special(orig_pmd)) {
+		spin_unlock(ptl);
 	} else if (is_huge_zero_pmd(orig_pmd)) {
 		zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
diff --git a/mm/memory.c b/mm/memory.c
index c48f8df..6b60981 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1338,10 +1338,12 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
 	pmd = pmd_offset(pud, addr);
 	do {
 		next = pmd_addr_end(addr, end);
-		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
-			if (next - addr != HPAGE_PMD_SIZE)
+		if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
+			pmd_devmap(*pmd) || pmd_special(*pmd)) {
+			if (next - addr != HPAGE_PMD_SIZE) {
+				VM_BUG_ON(pmd_special(*pmd));
 				__split_huge_pmd(vma, pmd, addr, false, NULL);
-			else if (zap_huge_pmd(tlb, vma, pmd, addr))
+			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
 				goto next;
 			/* fall through */
 		}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 18/37] mm: follow_pmd_mask() for dmem huge pmd
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (16 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 17/37] mm, dmemfs: support unmap_page_range() for dmemfs pmd yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 19/37] mm: gup_huge_pmd() " yulei.kernel
                   ` (19 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

While follow_pmd_mask(), dmem huge pmd should be recognized and return
error pointer of '-EEXIST' to indicate that proper page table entry exists
in pmd special but no corresponding struct page, because dmem page means
non struct page backend. We update pmd if foll_flags takes FOLL_TOUCH.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/gup.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 98eb8e6..ad1aede 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -387,6 +387,42 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 	return -EEXIST;
 }
 
+static struct page *
+follow_special_pmd(struct vm_area_struct *vma, unsigned long address,
+		   pmd_t *pmd, unsigned int flags)
+{
+	spinlock_t *ptl;
+
+	if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd))
+		/* Avoid special (like zero) pages in core dumps */
+		return ERR_PTR(-EFAULT);
+
+	/* No page to get reference */
+	if (flags & FOLL_GET)
+		return ERR_PTR(-EFAULT);
+
+	if (flags & FOLL_TOUCH) {
+		pmd_t _pmd;
+
+		ptl = pmd_lock(vma->vm_mm, pmd);
+		if (!pmd_special(*pmd)) {
+			spin_unlock(ptl);
+			return NULL;
+		}
+		_pmd = pmd_mkyoung(*pmd);
+		if (flags & FOLL_WRITE)
+			_pmd = pmd_mkdirty(_pmd);
+		if (pmdp_set_access_flags(vma, address & HPAGE_PMD_MASK,
+					  pmd, _pmd,
+					  flags & FOLL_WRITE))
+			update_mmu_cache_pmd(vma, address, pmd);
+		spin_unlock(ptl);
+	}
+
+	/* Proper page table entry exists, but no corresponding struct page */
+	return ERR_PTR(-EEXIST);
+}
+
 /*
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
@@ -571,6 +607,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags);
 	}
+	if (pmd_special(*pmd)) {
+		page = follow_special_pmd(vma, address, pmd, flags);
+		if (page)
+			return page;
+		return no_page_table(vma, flags);
+	}
 	if (is_hugepd(__hugepd(pmd_val(pmdval)))) {
 		page = follow_huge_pd(vma, address,
 				      __hugepd(pmd_val(pmdval)), flags,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 19/37] mm: gup_huge_pmd() for dmem huge pmd
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (17 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 18/37] mm: follow_pmd_mask() for dmem huge pmd yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 20/37] mm: support dmem huge pmd for vmf_insert_pfn_pmd() yulei.kernel
                   ` (18 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Add pmd_special() check in gup_huge_pmd() to support dmem huge pmd.
GUP will return zero if enconter dmem page, and we could handle it
outside GUP routine.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/gup.c      | 6 +++++-
 mm/pagewalk.c | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index ad1aede..47c8197 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2470,6 +2470,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
 	if (!pmd_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
+	/* Bypass dmem huge pmd. It will be handled in outside routine. */
+	if (pmd_special(orig))
+		return 0;
+
 	if (pmd_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
@@ -2572,7 +2576,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo
 			return 0;
 
 		if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) ||
-			     pmd_devmap(pmd))) {
+			     pmd_devmap(pmd) || pmd_special(pmd))) {
 			/*
 			 * NUMA hinting faults need to be handled in the GUP
 			 * slowpath for accounting purposes and so that they
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e81640d..e7c4575 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -71,7 +71,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
 	do {
 again:
 		next = pmd_addr_end(addr, end);
-		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) {
+		if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma) || pmd_special(*pmd)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 20/37] mm: support dmem huge pmd for vmf_insert_pfn_pmd()
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (18 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 19/37] mm: gup_huge_pmd() " yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 21/37] mm: support dmem huge pmd for follow_pfn() yulei.kernel
                   ` (17 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Since vmf_insert_pfn_pmd will BUG_ON non-pmd-devmap, we make pfn dmem pass
the check.

Dmem huge pmd will be marked with _PAGE_SPECIAL and _PAGE_DMEM, so that
follow_pfn() could recognize it.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/huge_memory.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a818ec..6e52d57 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -781,6 +781,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 	entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pmd_mkdevmap(entry);
+	else if (pfn_t_dmem(pfn))
+		entry = pmd_mkdmem(entry);
 	if (write) {
 		entry = pmd_mkyoung(pmd_mkdirty(entry));
 		entry = maybe_pmd_mkwrite(entry, vma);
@@ -827,7 +829,7 @@ vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn,
 	 * can't support a 'special' bit.
 	 */
 	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
+			!pfn_t_devmap(pfn) && !pfn_t_dmem(pfn));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 21/37] mm: support dmem huge pmd for follow_pfn()
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (19 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 20/37] mm: support dmem huge pmd for vmf_insert_pfn_pmd() yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 22/37] kvm, x86: Distinguish dmemfs page from mmio page yulei.kernel
                   ` (16 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

follow_pfn() will get pfn of pmd if huge pmd is encountered.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/memory.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6b60981..abb9148 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4807,15 +4807,23 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	int ret = -EINVAL;
 	spinlock_t *ptl;
 	pte_t *ptep;
+	pmd_t *pmdp = NULL;
 
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte(vma->vm_mm, address, &ptep, &ptl);
+	ret = follow_pte_pmd(vma->vm_mm, address, NULL, &ptep, &pmdp, &ptl);
 	if (ret)
 		return ret;
-	*pfn = pte_pfn(*ptep);
-	pte_unmap_unlock(ptep, ptl);
+
+	if (pmdp) {
+		*pfn = pmd_pfn(*pmdp) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
+		spin_unlock(ptl);
+	} else {
+		*pfn = pte_pfn(*ptep);
+		pte_unmap_unlock(ptep, ptl);
+	}
+
 	return 0;
 }
 EXPORT_SYMBOL(follow_pfn);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 22/37] kvm, x86: Distinguish dmemfs page from mmio page
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (20 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 21/37] mm: support dmem huge pmd for follow_pfn() yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 23/37] kvm, x86: introduce VM_DMEM for syscall support usage yulei.kernel
                   ` (15 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Dmem page is pfn invalid but not mmio, introduce API
is_dmem_pfn() to distinguish that.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/kvm/mmu/mmu.c | 1 +
 include/linux/dmem.h   | 7 +++++++
 mm/dmem.c              | 7 +++++++
 3 files changed, 15 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5bb1939..394508f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -43,6 +43,7 @@
 #include <linux/hash.h>
 #include <linux/kern_levels.h>
 #include <linux/kthread.h>
+#include <linux/dmem.h>
 
 #include <asm/page.h>
 #include <asm/memtype.h>
diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 8682d63..59d3ef14 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -19,11 +19,18 @@
 		     unsigned int try_max, unsigned int *result_nr);
 
 void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr);
+bool is_dmem_pfn(unsigned long pfn);
 #define dmem_free_page(addr)	dmem_free_pages(addr, 1)
 #else
 static inline int dmem_reserve_init(void)
 {
 	return 0;
 }
+
+static inline bool is_dmem_pfn(unsigned long pfn)
+{
+	return 0;
+}
+
 #endif
 #endif	/* _LINUX_DMEM_H */
diff --git a/mm/dmem.c b/mm/dmem.c
index 2e61dbd..eb6df70 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -972,3 +972,10 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
 }
 EXPORT_SYMBOL(dmem_free_pages);
 
+bool is_dmem_pfn(unsigned long pfn)
+{
+	struct dmem_node *dnode;
+
+	return !!find_dmem_region(__pfn_to_phys(pfn), &dnode);
+}
+EXPORT_SYMBOL(is_dmem_pfn);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 23/37] kvm, x86: introduce VM_DMEM for syscall support usage
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (21 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 22/37] kvm, x86: Distinguish dmemfs page from mmio page yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 24/37] dmemfs: support hugepage for dmemfs yulei.kernel
                   ` (14 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Currently dmemfs do not support memory readonly, so change_protection()
will be disabled for dmemfs vma. Since vma->vm_flags could be changed to
new flag in mprotect_fixup(), so we introduce a new vma flag VM_DMEM and
check this flag in mprotect_fixup() to avoid changing vma->vm_flags.

We also check it in vma_to_resize() to disable mremap() for dmemfs vma.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c  | 2 +-
 include/linux/mm.h | 7 +++++++
 mm/gup.c           | 7 +++++--
 mm/mincore.c       | 8 ++++++--
 mm/mprotect.c      | 5 ++++-
 mm/mremap.c        | 3 +++
 6 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index ab6a492..b165bd3 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -507,7 +507,7 @@ int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_PFNMAP;
+	vma->vm_flags |= VM_PFNMAP | VM_DMEM | VM_IO;
 
 	file_accessed(file);
 	vma->vm_ops = &dmemfs_vm_ops;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d..2f3135fe 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -311,6 +311,8 @@ int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *,
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
+#define VM_DMEM		BIT(38)		/* Dmem page VM */
+
 #ifdef CONFIG_ARCH_HAS_PKEYS
 # define VM_PKEY_SHIFT	VM_HIGH_ARCH_BIT_0
 # define VM_PKEY_BIT0	VM_HIGH_ARCH_0	/* A protection key is a 4-bit value */
@@ -666,6 +668,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma)
 	return vma->vm_flags & VM_ACCESS_FLAGS;
 }
 
+static inline bool vma_is_dmem(struct vm_area_struct *vma)
+{
+	return !!(vma->vm_flags & VM_DMEM);
+}
+
 #ifdef CONFIG_SHMEM
 /*
  * The vma_is_shmem is not inline because it is used only by slow
diff --git a/mm/gup.c b/mm/gup.c
index 47c8197..0ea9071 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -492,8 +492,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 			goto no_page;
 	} else if (unlikely(!page)) {
 		if (flags & FOLL_DUMP) {
-			/* Avoid special (like zero) pages in core dumps */
-			page = ERR_PTR(-EFAULT);
+			if (vma_is_dmem(vma))
+				page = ERR_PTR(-EEXIST);
+			else
+				/* Avoid special (like zero) pages in core dumps */
+				page = ERR_PTR(-EFAULT);
 			goto out;
 		}
 
diff --git a/mm/mincore.c b/mm/mincore.c
index 02db1a8..f8d10e4 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -78,8 +78,12 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
 		pgoff_t pgoff;
 
 		pgoff = linear_page_index(vma, addr);
-		for (i = 0; i < nr; i++, pgoff++)
-			vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+		for (i = 0; i < nr; i++, pgoff++) {
+			if (vma_is_dmem(vma))
+				vec[i] = 1;
+			else
+				vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+		}
 	} else {
 		for (i = 0; i < nr; i++)
 			vec[i] = 0;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56c02be..b1650b5 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -236,7 +236,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * for all the checks.
 		 */
 		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
+		     pmd_none_or_clear_bad_unless_trans_huge(pmd) && !pmd_special(*pmd))
 			goto next;
 
 		/* invoke the mmu notifier if the pmd is populated */
@@ -412,6 +412,9 @@ static int prot_none_test(unsigned long addr, unsigned long next,
 		return 0;
 	}
 
+	if (vma_is_dmem(vma))
+		return -EINVAL;
+
 	/*
 	 * Do PROT_NONE PFN permission checks here when we can still
 	 * bail out without undoing a lot of state. This is a rather
diff --git a/mm/mremap.c b/mm/mremap.c
index 138abba..598e681 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -482,6 +482,9 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	if (!vma || vma->vm_start > addr)
 		return ERR_PTR(-EFAULT);
 
+	if (vma_is_dmem(vma))
+		return ERR_PTR(-EINVAL);
+
 	/*
 	 * !old_len is a special case where an attempt is made to 'duplicate'
 	 * a mapping.  This makes no sense for private mappings as it will
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 24/37] dmemfs: support hugepage for dmemfs
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (22 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 23/37] kvm, x86: introduce VM_DMEM for syscall support usage yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 25/37] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() yulei.kernel
                   ` (13 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

It add hugepage support for dmemfs. We use PFN_DMEM to notify
vmf_insert_pfn_pmd, and dmem huge pmd will be marked with
_PAGE_SPECIAL and _PAGE_DMEM. So that GUP-fast can separate
dmemfs page from other page type and handle it correctly.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 111 insertions(+), 2 deletions(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index b165bd3..17a518c 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -457,7 +457,7 @@ static int dmemfs_split(struct vm_area_struct *vma, unsigned long addr)
 	return 0;
 }
 
-static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
+static vm_fault_t __dmemfs_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = file_inode(vma->vm_file);
@@ -485,6 +485,63 @@ static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+static vm_fault_t  __dmemfs_pmd_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pmd_addr = vmf->address & PMD_MASK;
+	unsigned long page_addr;
+	struct inode *inode = file_inode(vma->vm_file);
+	void *entry;
+	phys_addr_t phys;
+	pfn_t pfn;
+	int ret;
+
+	if (dmem_page_size(inode) < PMD_SIZE)
+		return VM_FAULT_FALLBACK;
+
+	WARN_ON(pmd_addr < vma->vm_start ||
+		vma->vm_end < pmd_addr + PMD_SIZE);
+
+	page_addr = vmf->address & ~(dmem_page_size(inode) - 1);
+	entry = radix_get_create_entry(vma, page_addr, inode,
+				       linear_page_index(vma, page_addr));
+	if (IS_ERR(entry))
+		return (PTR_ERR(entry) == -ENOMEM) ?
+			VM_FAULT_OOM : VM_FAULT_SIGBUS;
+
+	phys = dmem_addr_to_pfn(inode, dmem_entry_to_addr(inode, entry),
+				linear_page_index(vma, pmd_addr), PMD_SHIFT);
+	phys <<= PAGE_SHIFT;
+	pfn = phys_to_pfn_t(phys, PFN_DMEM);
+	ret = vmf_insert_pfn_pmd(vmf, pfn, !!(vma->vm_flags & VM_WRITE));
+
+	radix_put_entry();
+	return ret;
+}
+
+static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size pe_size)
+{
+	int ret;
+
+	switch (pe_size) {
+	case PE_SIZE_PTE:
+		ret = __dmemfs_fault(vmf);
+		break;
+	case PE_SIZE_PMD:
+		ret = __dmemfs_pmd_fault(vmf);
+		break;
+	default:
+		ret = VM_FAULT_SIGBUS;
+	}
+
+	return ret;
+}
+
+static vm_fault_t dmemfs_fault(struct vm_fault *vmf)
+{
+	return dmemfs_huge_fault(vmf, PE_SIZE_PTE);
+}
+
 static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
 {
 	return dmem_page_size(file_inode(vma->vm_file));
@@ -495,6 +552,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma)
 	.fault = dmemfs_fault,
 	.pagesize = dmemfs_pagesize,
 	.access = dmemfs_access_dmem,
+	.huge_fault = dmemfs_huge_fault,
 };
 
 int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
@@ -507,15 +565,66 @@ int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!(vma->vm_flags & VM_SHARED))
 		return -EINVAL;
 
-	vma->vm_flags |= VM_PFNMAP | VM_DMEM | VM_IO;
+	vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY | VM_DMEM | VM_IO;
+
+	if (dmem_page_size(inode) != PAGE_SIZE)
+		vma->vm_flags |= VM_HUGEPAGE;
 
 	file_accessed(file);
 	vma->vm_ops = &dmemfs_vm_ops;
 	return 0;
 }
 
+/*
+ * If the size of area returned by mm->get_unmapped_area() is one
+ * dmem pagesize larger than 'len', the returned addr by
+ * mm->get_unmapped_area() could be aligned to dmem pagesize to
+ * meet alignment demand.
+ */
+static unsigned long
+dmemfs_get_unmapped_area(struct file *file, unsigned long addr,
+			 unsigned long len, unsigned long pgoff,
+			 unsigned long flags)
+{
+	unsigned long len_pad;
+	unsigned long off = pgoff << PAGE_SHIFT;
+	unsigned long align;
+
+	align = dmem_page_size(file_inode(file));
+
+	/* For pud or pmd pagesize, could not support fault fallback. */
+	if (len & (align - 1))
+		return -EINVAL;
+	if (len > TASK_SIZE)
+		return -ENOMEM;
+
+	if (flags & MAP_FIXED) {
+		if (addr & (align - 1))
+			return -EINVAL;
+		return addr;
+	}
+
+	/*
+	 * Pad a extra align space for 'len', as we want to find a unmapped
+	 * area which is larger enough to align with dmemfs pagesize, if
+	 * pagesize of dmem is larger than 4K.
+	 */
+	len_pad = (align == PAGE_SIZE) ? len : len + align;
+
+	/* 'len' or 'off' is too large for pad. */
+	if (len_pad < len || (off + len_pad) < off)
+		return -EINVAL;
+
+	addr = current->mm->get_unmapped_area(file, addr, len_pad,
+					      pgoff, flags);
+
+	/* Now 'addr' could be aligned to upper boundary. */
+	return IS_ERR_VALUE(addr) ? addr : round_up(addr, align);
+}
+
 static const struct file_operations dmemfs_file_operations = {
 	.mmap = dmemfs_file_mmap,
+	.get_unmapped_area = dmemfs_get_unmapped_area,
 };
 
 static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 25/37] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn()
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (23 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 24/37] dmemfs: support hugepage for dmemfs yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 26/37] mm, dmem: introduce pud_special() for dmem huge pud support yulei.kernel
                   ` (12 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Fix estimation of reserved page for vaddr_get_pfn() and
check 'ret' before checking writable permission

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 drivers/vfio/vfio_iommu_type1.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 67e8276..c465d1a 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -471,6 +471,10 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 		if (ret == -EAGAIN)
 			goto retry;
 
+		if (!ret && (prot & IOMMU_WRITE) &&
+		    !(vma->vm_flags & VM_WRITE))
+			ret = -EFAULT;
+
 		if (!ret && !is_invalid_reserved_pfn(*pfn))
 			ret = -EFAULT;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 26/37] mm, dmem: introduce pud_special() for dmem huge pud support
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (24 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 25/37] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 27/37] mm: add pud_special() check to support dmem huge pud yulei.kernel
                   ` (11 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

pud_special() will check both _PAGE_SPECIAL and _PAGE_DMEM bit
as pmd_special() does.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/include/asm/pgtable.h | 13 +++++++++++++
 include/linux/pgtable.h        | 10 ++++++++++
 2 files changed, 23 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 6ce85d4..9e36d42 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -281,6 +281,12 @@ static inline int pmd_special(pmd_t pmd)
 	return (pmd_val(pmd) & (_PAGE_SPECIAL | _PAGE_DMEM)) ==
 		(_PAGE_SPECIAL | _PAGE_DMEM);
 }
+
+static inline int pud_special(pud_t pud)
+{
+	return (pud_val(pud) & (_PAGE_SPECIAL | _PAGE_DMEM)) ==
+		(_PAGE_SPECIAL | _PAGE_DMEM);
+}
 #endif
 
 #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
@@ -516,6 +522,13 @@ static inline pud_t pud_mkdirty(pud_t pud)
 	return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY);
 }
 
+#ifdef CONFIG_ARCH_HAS_PTE_DMEM
+static inline pud_t pud_mkdmem(pud_t pud)
+{
+	return pud_set_flags(pud, _PAGE_SPECIAL | _PAGE_DMEM);
+}
+#endif
+
 static inline pud_t pud_mkdevmap(pud_t pud)
 {
 	return pud_set_flags(pud, _PAGE_DEVMAP);
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 30342b8..0ef03ff 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1167,6 +1167,16 @@ static inline int pmd_special(pmd_t pmd)
 {
 	return 0;
 }
+
+static inline pud_t pud_mkdmem(pud_t pud)
+{
+	return pud;
+}
+
+static inline int pud_special(pud_t pud)
+{
+	return 0;
+}
 #endif
 
 #ifndef pmd_read_atomic
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 27/37] mm: add pud_special() check to support dmem huge pud
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (25 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 26/37] mm, dmem: introduce pud_special() for dmem huge pud support yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 28/37] mm, dmemfs: support huge_fault() for dmemfs yulei.kernel
                   ` (10 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Add pud_special() and follow_special_pud() to support dmem
huge pud as we do for dmem huge pmd.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 arch/x86/include/asm/pgtable.h |  2 +-
 include/linux/huge_mm.h        |  2 +-
 mm/gup.c                       | 46 ++++++++++++++++++++++++++++++++++++++++++
 mm/huge_memory.c               | 11 ++++++----
 mm/memory.c                    |  4 ++--
 mm/mprotect.c                  |  2 ++
 mm/pagewalk.c                  |  2 +-
 7 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 9e36d42..2284387 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -265,7 +265,7 @@ static inline int pmd_trans_huge(pmd_t pmd)
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 static inline int pud_trans_huge(pud_t pud)
 {
-	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+	return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_DMEM)) == _PAGE_PSE;
 }
 #endif
 
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2514b90..b69c940 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -251,7 +251,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
 static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
 		struct vm_area_struct *vma)
 {
-	if (pud_trans_huge(*pud) || pud_devmap(*pud))
+	if (pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud))
 		return __pud_trans_huge_lock(pud, vma);
 	else
 		return NULL;
diff --git a/mm/gup.c b/mm/gup.c
index 0ea9071..8eb85ba 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -423,6 +423,42 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
 	return ERR_PTR(-EEXIST);
 }
 
+static struct page *
+follow_special_pud(struct vm_area_struct *vma, unsigned long address,
+		   pud_t *pud, unsigned int flags)
+{
+	spinlock_t *ptl;
+
+	if ((flags & FOLL_DUMP) && is_huge_zero_pud(*pud))
+		/* Avoid special (like zero) pages in core dumps */
+		return ERR_PTR(-EFAULT);
+
+	/* No page to get reference */
+	if (flags & FOLL_GET)
+		return ERR_PTR(-EFAULT);
+
+	if (flags & FOLL_TOUCH) {
+		pud_t _pud;
+
+		ptl = pud_lock(vma->vm_mm, pud);
+		if (!pud_special(*pud)) {
+			spin_unlock(ptl);
+			return NULL;
+		}
+		_pud = pud_mkyoung(*pud);
+		if (flags & FOLL_WRITE)
+			_pud = pud_mkdirty(_pud);
+		if (pudp_set_access_flags(vma, address & HPAGE_PMD_MASK,
+					  pud, _pud,
+					  flags & FOLL_WRITE))
+			update_mmu_cache_pud(vma, address, pud);
+		spin_unlock(ptl);
+	}
+
+	/* Proper page table entry exists, but no corresponding struct page */
+	return ERR_PTR(-EEXIST);
+}
+
 /*
  * FOLL_FORCE can write to even unwritable pte's, but only
  * after we've gone through a COW cycle and they are dirty.
@@ -726,6 +762,12 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma,
 			return page;
 		return no_page_table(vma, flags);
 	}
+	if (pud_special(*pud)) {
+		page = follow_special_pud(vma, address, pud, flags);
+		if (page)
+			return page;
+		return no_page_table(vma, flags);
+	}
 	if (is_hugepd(__hugepd(pud_val(*pud)))) {
 		page = follow_huge_pd(vma, address,
 				      __hugepd(pud_val(*pud)), flags,
@@ -2511,6 +2553,10 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
 	if (!pud_access_permitted(orig, flags & FOLL_WRITE))
 		return 0;
 
+	/* Bypass dmem pud. It will be handled in outside routine. */
+	if (pud_special(orig))
+		return 0;
+
 	if (pud_devmap(orig)) {
 		if (unlikely(flags & FOLL_LONGTERM))
 			return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6e52d57..7c5385a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -883,6 +883,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
 	entry = pud_mkhuge(pfn_t_pud(pfn, prot));
 	if (pfn_t_devmap(pfn))
 		entry = pud_mkdevmap(entry);
+	if (pfn_t_dmem(pfn))
+		entry = pud_mkdmem(entry);
 	if (write) {
 		entry = pud_mkyoung(pud_mkdirty(entry));
 		entry = maybe_pud_mkwrite(entry, vma);
@@ -919,7 +921,7 @@ vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn,
 	 * can't support a 'special' bit.
 	 */
 	BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
-			!pfn_t_devmap(pfn));
+			!pfn_t_devmap(pfn) && !pfn_t_dmem(pfn));
 	BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
 						(VM_PFNMAP|VM_MIXEDMAP));
 	BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -1911,7 +1913,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
 	spinlock_t *ptl;
 
 	ptl = pud_lock(vma->vm_mm, pud);
-	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+	if (likely(pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud)))
 		return ptl;
 	spin_unlock(ptl);
 	return NULL;
@@ -1922,6 +1924,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pud_t *pud, unsigned long addr)
 {
 	spinlock_t *ptl;
+	pud_t orig_pud;
 
 	ptl = __pud_trans_huge_lock(pud, vma);
 	if (!ptl)
@@ -1932,9 +1935,9 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	 * pgtable_trans_huge_withdraw after finishing pudp related
 	 * operations.
 	 */
-	pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+	orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
 	tlb_remove_pud_tlb_entry(tlb, pud, addr);
-	if (vma_is_special_huge(vma)) {
+	if (vma_is_special_huge(vma) || pud_special(orig_pud)) {
 		spin_unlock(ptl);
 		/* No zero page support yet */
 	} else {
diff --git a/mm/memory.c b/mm/memory.c
index abb9148..01f3b05 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1078,7 +1078,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 	src_pud = pud_offset(src_p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud) || pud_special(*src_pud)) {
 			int err;
 
 			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma);
@@ -1375,7 +1375,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+		if (pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud)) {
 			if (next - addr != HPAGE_PUD_SIZE) {
 				mmap_assert_locked(tlb->mm);
 				split_huge_pud(vma, pud, addr);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b1650b5..05fa453 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -292,6 +292,8 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
 	pud = pud_offset(p4d, addr);
 	do {
 		next = pud_addr_end(addr, end);
+		if (pud_special(*pud))
+			continue;
 		if (pud_none_or_clear_bad(pud))
 			continue;
 		pages += change_pmd_range(vma, pud, addr, next, newprot,
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e7c4575..afd8bca 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -129,7 +129,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
 	do {
  again:
 		next = pud_addr_end(addr, end);
-		if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) {
+		if (pud_none(*pud) || (!walk->vma && !walk->no_vma) || pud_special(*pud)) {
 			if (ops->pte_hole)
 				err = ops->pte_hole(addr, next, depth, walk);
 			if (err)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 28/37] mm, dmemfs: support huge_fault() for dmemfs
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (26 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 27/37] mm: add pud_special() check to support dmem huge pud yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 29/37] mm: add follow_pte_pud() to support huge pud look up yulei.kernel
                   ` (9 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Introduce __dmemfs_huge_fault() to handle 1G huge pud for dmemfs.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index 17a518c..f698b9d 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -519,6 +519,43 @@ static vm_fault_t  __dmemfs_pmd_fault(struct vm_fault *vmf)
 	return ret;
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static vm_fault_t __dmemfs_huge_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long pud_addr = vmf->address & PUD_MASK;
+	struct inode *inode = file_inode(vma->vm_file);
+	void *entry;
+	phys_addr_t phys;
+	pfn_t pfn;
+	int ret;
+
+	if (dmem_page_size(inode) < PUD_SIZE)
+		return VM_FAULT_FALLBACK;
+
+	WARN_ON(pud_addr < vma->vm_start ||
+		vma->vm_end < pud_addr + PUD_SIZE);
+
+	entry = radix_get_create_entry(vma, pud_addr, inode,
+				       linear_page_index(vma, pud_addr));
+	if (IS_ERR(entry))
+		return (PTR_ERR(entry) == -ENOMEM) ?
+			VM_FAULT_OOM : VM_FAULT_SIGBUS;
+
+	phys = dmem_entry_to_addr(inode, entry);
+	pfn = phys_to_pfn_t(phys, PFN_DMEM);
+	ret = vmf_insert_pfn_pud(vmf, pfn, !!(vma->vm_flags & VM_WRITE));
+
+	radix_put_entry();
+	return ret;
+}
+#else
+static vm_fault_t __dmemfs_huge_fault(struct vm_fault *vmf)
+{
+	return VM_FAULT_FALLBACK;
+}
+#endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size pe_size)
 {
 	int ret;
@@ -530,6 +567,9 @@ static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size p
 	case PE_SIZE_PMD:
 		ret = __dmemfs_pmd_fault(vmf);
 		break;
+	case PE_SIZE_PUD:
+		ret = __dmemfs_huge_fault(vmf);
+		break;
 	default:
 		ret = VM_FAULT_SIGBUS;
 	}
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 29/37] mm: add follow_pte_pud() to support huge pud look up
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (27 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 28/37] mm, dmemfs: support huge_fault() for dmemfs yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 30/37] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() yulei.kernel
                   ` (8 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Since we had supported dmem huge pud, here support dmem huge pud for
hva_to_pfn().

Similar to follow_pte_pmd(), follow_pte_pud() allows a PTE lead or a
huge page PMD or huge page PUD to be found and returned.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 mm/memory.c | 52 ++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 44 insertions(+), 8 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 01f3b05..dfc95be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4698,9 +4698,9 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address)
 }
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
+static int __follow_pte_pud(struct mm_struct *mm, unsigned long address,
 			    struct mmu_notifier_range *range,
-			    pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp)
+			    pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp, spinlock_t **ptlp)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4717,6 +4717,26 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		goto out;
 
 	pud = pud_offset(p4d, address);
+	VM_BUG_ON(pud_trans_huge(*pud));
+	if (pud_huge(*pud)) {
+		if (!pudpp)
+			goto out;
+
+		if (range) {
+			mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0,
+						NULL, mm, address & PUD_MASK,
+						(address & PUD_MASK) + PUD_SIZE);
+			mmu_notifier_invalidate_range_start(range);
+		}
+		*ptlp = pud_lock(mm, pud);
+		if (pud_huge(*pud)) {
+			*pudpp = pud;
+			return 0;
+		}
+		spin_unlock(*ptlp);
+		if (range)
+			mmu_notifier_invalidate_range_end(range);
+	}
 	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
 		goto out;
 
@@ -4772,8 +4792,8 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, NULL,
-						    ptepp, NULL, ptlp)));
+			   !(res = __follow_pte_pud(mm, address, NULL,
+						    ptepp, NULL, NULL, ptlp)));
 	return res;
 }
 
@@ -4785,12 +4805,24 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 
 	/* (void) is needed to make gcc happy */
 	(void) __cond_lock(*ptlp,
-			   !(res = __follow_pte_pmd(mm, address, range,
-						    ptepp, pmdpp, ptlp)));
+			   !(res = __follow_pte_pud(mm, address, range,
+						    ptepp, pmdpp, NULL, ptlp)));
 	return res;
 }
 EXPORT_SYMBOL(follow_pte_pmd);
 
+int follow_pte_pud(struct mm_struct *mm, unsigned long address,
+		   struct mmu_notifier_range *range,
+		   pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp, spinlock_t **ptlp)
+{
+	int res;
+
+	/* (void) is needed to make gcc happy */
+	(void) __cond_lock(*ptlp,
+			   !(res = __follow_pte_pud(mm, address, range,
+						    ptepp, pmdpp, pudpp, ptlp)));
+	return res;
+}
 /**
  * follow_pfn - look up PFN at a user virtual address
  * @vma: memory mapping
@@ -4808,15 +4840,19 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address,
 	spinlock_t *ptl;
 	pte_t *ptep;
 	pmd_t *pmdp = NULL;
+	pud_t *pudp = NULL;
 
 	if (!(vma->vm_flags & (VM_IO | VM_PFNMAP)))
 		return ret;
 
-	ret = follow_pte_pmd(vma->vm_mm, address, NULL, &ptep, &pmdp, &ptl);
+	ret = follow_pte_pud(vma->vm_mm, address, NULL, &ptep, &pmdp, &pudp, &ptl);
 	if (ret)
 		return ret;
 
-	if (pmdp) {
+	if (pudp) {
+		*pfn = pud_pfn(*pudp) + ((address & ~PUD_MASK) >> PAGE_SHIFT);
+		spin_unlock(ptl);
+	} else if (pmdp) {
 		*pfn = pmd_pfn(*pmdp) + ((address & ~PMD_MASK) >> PAGE_SHIFT);
 		spin_unlock(ptl);
 	} else {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 30/37] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free()
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (28 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 29/37] mm: add follow_pte_pud() to support huge pud look up yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 31/37] dmem: introduce mce handler yulei.kernel
                   ` (7 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

If dmem contained in dmem region is too large and dmemfs is mounted as
4K pagesize, size of bitmap in this dmem region maybe exceed maximal
available memory of kzalloc(). It would cause kzalloc() fail.

So introduce dmem_bitmap_alloc() and use vzalloc() if bitmap is larger than
PAGE_SIZE as vzalloc() will get sparse page.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/inode.c         |  6 +++++
 include/linux/fs.h |  1 +
 mm/dmem.c          | 69 ++++++++++++++++++++++++++++++++++--------------------
 3 files changed, 50 insertions(+), 26 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 9d78c37..9b6363d3 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,6 +210,12 @@ int inode_init_always(struct super_block *sb, struct inode *inode)
 }
 EXPORT_SYMBOL(inode_init_always);
 
+struct inode *alloc_inode_nonrcu(void)
+{
+	return kmem_cache_alloc(inode_cachep, GFP_KERNEL);
+}
+EXPORT_SYMBOL(alloc_inode_nonrcu);
+
 void free_inode_nonrcu(struct inode *inode)
 {
 	kmem_cache_free(inode_cachep, inode);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8667d0c..bc7a89c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2937,6 +2937,7 @@ static inline bool is_zero_ino(ino_t ino)
 extern void __destroy_inode(struct inode *);
 extern struct inode *new_inode_pseudo(struct super_block *sb);
 extern struct inode *new_inode(struct super_block *sb);
+extern struct inode *alloc_inode_nonrcu(void);
 extern void free_inode_nonrcu(struct inode *inode);
 extern int should_remove_suid(struct dentry *);
 extern int file_remove_privs(struct file *);
diff --git a/mm/dmem.c b/mm/dmem.c
index eb6df70..50cdff9 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -17,6 +17,7 @@
 #include <linux/dmem.h>
 #include <linux/debugfs.h>
 #include <linux/notifier.h>
+#include <linux/vmalloc.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/dmem.h>
@@ -362,9 +363,38 @@ static int __init dmem_node_init(struct dmem_node *dnode)
 	return 0;
 }
 
+static unsigned long *dmem_bitmap_alloc(unsigned long pages,
+					unsigned long *static_bitmap)
+{
+	unsigned long *bitmap, size;
+
+	size = BITS_TO_LONGS(pages) * sizeof(long);
+	if (size <= sizeof(*static_bitmap))
+		bitmap = static_bitmap;
+	else if (size <= PAGE_SIZE)
+		bitmap = kzalloc(size, GFP_KERNEL);
+	else
+		bitmap = vzalloc(size);
+
+	return bitmap;
+}
+
+static void dmem_bitmap_free(unsigned long pages,
+			     unsigned long *bitmap,
+			     unsigned long *static_bitmap)
+{
+	unsigned long size;
+
+	size = BITS_TO_LONGS(pages) * sizeof(long);
+	if (size > PAGE_SIZE)
+		vfree(bitmap);
+	else if (bitmap != static_bitmap)
+		kfree(bitmap);
+}
+
 static void __init dmem_region_uinit(struct dmem_region *dregion)
 {
-	unsigned long nr_pages, size, *bitmap = dregion->error_bitmap;
+	unsigned long nr_pages, *bitmap = dregion->error_bitmap;
 
 	if (!bitmap)
 		return;
@@ -374,9 +404,7 @@ static void __init dmem_region_uinit(struct dmem_region *dregion)
 
 	WARN_ON(!nr_pages);
 
-	size = BITS_TO_LONGS(nr_pages) * sizeof(long);
-	if (size > sizeof(dregion->static_bitmap))
-		kfree(bitmap);
+	dmem_bitmap_free(nr_pages, bitmap, &dregion->static_error_bitmap);
 	dregion->error_bitmap = NULL;
 }
 
@@ -405,19 +433,15 @@ static void __init dmem_uinit(void)
 
 static int __init dmem_region_init(struct dmem_region *dregion)
 {
-	unsigned long *bitmap, size, nr_pages;
+	unsigned long *bitmap, nr_pages;
 
 	nr_pages = __phys_to_pfn(dregion->reserved_end_addr)
 		- __phys_to_pfn(dregion->reserved_start_addr);
 
-	size = BITS_TO_LONGS(nr_pages) * sizeof(long);
-	if (size <= sizeof(dregion->static_error_bitmap)) {
-		bitmap = &dregion->static_error_bitmap;
-	} else {
-		bitmap = kzalloc(size, GFP_KERNEL);
-		if (!bitmap)
-			return -ENOMEM;
-	}
+	bitmap = dmem_bitmap_alloc(nr_pages, &dregion->static_error_bitmap);
+	if (!bitmap)
+		return -ENOMEM;
+
 	dregion->error_bitmap = bitmap;
 	return 0;
 }
@@ -472,7 +496,7 @@ static int __init dmem_late_init(void)
 static int dmem_alloc_region_init(struct dmem_region *dregion,
 				  unsigned long *dpages)
 {
-	unsigned long start, end, *bitmap, size;
+	unsigned long start, end, *bitmap;
 
 	start = DMEM_PAGE_UP(dregion->reserved_start_addr);
 	end = DMEM_PAGE_DOWN(dregion->reserved_end_addr);
@@ -481,14 +505,9 @@ static int dmem_alloc_region_init(struct dmem_region *dregion,
 	if (!*dpages)
 		return 0;
 
-	size = BITS_TO_LONGS(*dpages) * sizeof(long);
-	if (size <= sizeof(dregion->static_bitmap))
-		bitmap = &dregion->static_bitmap;
-	else {
-		bitmap = kzalloc(size, GFP_KERNEL);
-		if (!bitmap)
-			return -ENOMEM;
-	}
+	bitmap = dmem_bitmap_alloc(*dpages, &dregion->static_bitmap);
+	if (!bitmap)
+		return -ENOMEM;
 
 	dregion->bitmap = bitmap;
 	dregion->next_free_pos = 0;
@@ -582,7 +601,7 @@ static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion)
 
 static void dmem_alloc_region_uinit(struct dmem_region *dregion)
 {
-	unsigned long dpages, size, *bitmap = dregion->bitmap;
+	unsigned long dpages, *bitmap = dregion->bitmap;
 
 	if (!bitmap)
 		return;
@@ -592,9 +611,7 @@ static void dmem_alloc_region_uinit(struct dmem_region *dregion)
 
 	dmem_uinit_check_alloc_bitmap(dregion);
 
-	size = BITS_TO_LONGS(dpages) * sizeof(long);
-	if (size > sizeof(dregion->static_bitmap))
-		kfree(bitmap);
+	dmem_bitmap_free(dpages, bitmap, &dregion->static_bitmap);
 	dregion->bitmap = NULL;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 31/37] dmem: introduce mce handler
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (29 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 30/37] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 32/37] mm, dmemfs: register and handle the dmem mce yulei.kernel
                   ` (6 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Haiwei Li

From: Yulei Zhang <yuleixzhang@tencent.com>

dmem handle the mce if the pfn belongs to dmem when mce occurs.
1. check whether the pfn is handled by dmem. return if true.
2. mark the pfn in a new error bitmap defined in page.
3. a series of mechanism to ensure that the mce pfn is not allocated.

Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/dmem.h        |   6 +++
 include/trace/events/dmem.h |  17 ++++++++
 mm/dmem.c                   | 103 +++++++++++++++++++++++++++++++-------------
 mm/memory-failure.c         |   6 +++
 4 files changed, 102 insertions(+), 30 deletions(-)

diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index 59d3ef14..cd17a91 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -21,6 +21,8 @@
 void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr);
 bool is_dmem_pfn(unsigned long pfn);
 #define dmem_free_page(addr)	dmem_free_pages(addr, 1)
+
+bool dmem_memory_failure(unsigned long pfn, int flags);
 #else
 static inline int dmem_reserve_init(void)
 {
@@ -32,5 +34,9 @@ static inline bool is_dmem_pfn(unsigned long pfn)
 	return 0;
 }
 
+static inline bool dmem_memory_failure(unsigned long pfn, int flags)
+{
+	return false;
+}
 #endif
 #endif	/* _LINUX_DMEM_H */
diff --git a/include/trace/events/dmem.h b/include/trace/events/dmem.h
index 10d1b90..f8eeb3c 100644
--- a/include/trace/events/dmem.h
+++ b/include/trace/events/dmem.h
@@ -62,6 +62,23 @@
 	TP_printk("addr %#lx dpages_nr %d", (unsigned long)__entry->addr,
 		  __entry->dpages_nr)
 );
+
+TRACE_EVENT(dmem_memory_failure,
+	TP_PROTO(unsigned long pfn, bool used),
+	TP_ARGS(pfn, used),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, pfn)
+		__field(bool, used)
+	),
+
+	TP_fast_assign(
+		__entry->pfn = pfn;
+		__entry->used = used;
+	),
+
+	TP_printk("pfn=%#lx used=%d", __entry->pfn, __entry->used)
+);
 #endif
 
 /* This part must be outside protection */
diff --git a/mm/dmem.c b/mm/dmem.c
index 50cdff9..16438db 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -431,6 +431,41 @@ static void __init dmem_uinit(void)
 	dmem_pool.registered_pages = 0;
 }
 
+/* set or clear corresponding bit on allocation bitmap based on error bitmap */
+static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion,
+						    bool set)
+{
+	unsigned long pos_pfn, pos_offset;
+	unsigned long valid_pages, mce_dpages = 0;
+	phys_addr_t dpage, reserved_start_pfn;
+
+	reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr);
+
+	valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn;
+	pos_offset = dpage_to_pfn(dregion->dpage_start_pfn)
+		- reserved_start_pfn;
+try_set:
+	pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset);
+
+	if (pos_pfn >= valid_pages)
+		return mce_dpages;
+	mce_dpages++;
+	dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn);
+	if (set)
+		WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn,
+					   dregion->bitmap));
+	else
+		WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn,
+					      dregion->bitmap));
+	pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn;
+	goto try_set;
+}
+
+static unsigned long dmem_region_mark_mce_dpages(struct dmem_region *dregion)
+{
+	return dregion_alloc_bitmap_set_clear(dregion, true);
+}
+
 static int __init dmem_region_init(struct dmem_region *dregion)
 {
 	unsigned long *bitmap, nr_pages;
@@ -514,6 +549,8 @@ static int dmem_alloc_region_init(struct dmem_region *dregion,
 	dregion->dpage_start_pfn = start;
 	dregion->dpage_end_pfn = end;
 
+	*dpages -= dmem_region_mark_mce_dpages(dregion);
+
 	dmem_pool.unaligned_pages += __phys_to_pfn((dpage_to_phys(start)
 		- dregion->reserved_start_addr));
 	dmem_pool.unaligned_pages += __phys_to_pfn(dregion->reserved_end_addr
@@ -558,36 +595,6 @@ static bool dmem_dpage_is_error(struct dmem_region *dregion, phys_addr_t dpage)
 	return err_num;
 }
 
-/* set or clear corresponding bit on allocation bitmap based on error bitmap */
-static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion,
-						    bool set)
-{
-	unsigned long pos_pfn, pos_offset;
-	unsigned long valid_pages, mce_dpages = 0;
-	phys_addr_t dpage, reserved_start_pfn;
-
-	reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr);
-
-	valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn;
-	pos_offset = dpage_to_pfn(dregion->dpage_start_pfn)
-		- reserved_start_pfn;
-try_set:
-	pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset);
-
-	if (pos_pfn >= valid_pages)
-		return mce_dpages;
-	mce_dpages++;
-	dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn);
-	if (set)
-		WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn,
-					   dregion->bitmap));
-	else
-		WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn,
-					      dregion->bitmap));
-	pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn;
-	goto try_set;
-}
-
 static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion)
 {
 	unsigned long dpages, size;
@@ -989,6 +996,42 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
 }
 EXPORT_SYMBOL(dmem_free_pages);
 
+bool dmem_memory_failure(unsigned long pfn, int flags)
+{
+	struct dmem_region *dregion;
+	struct dmem_node *pdnode = NULL;
+	u64 pos;
+	phys_addr_t addr = __pfn_to_phys(pfn);
+	bool used = false;
+
+	dregion = find_dmem_region(addr, &pdnode);
+	if (!dregion)
+		return false;
+
+	WARN_ON(!pdnode || !dregion->error_bitmap);
+
+	mutex_lock(&dmem_pool.lock);
+	pos = pfn - __phys_to_pfn(dregion->reserved_start_addr);
+	if (__test_and_set_bit(pos, dregion->error_bitmap))
+		goto out;
+
+	if (!dregion->bitmap || pfn < dpage_to_pfn(dregion->dpage_start_pfn) ||
+	    pfn >= dpage_to_pfn(dregion->dpage_end_pfn))
+		goto out;
+
+	pos = phys_to_dpage(addr) - dregion->dpage_start_pfn;
+	if (__test_and_set_bit(pos, dregion->bitmap)) {
+		used = true;
+	} else {
+		pr_info("MCE: free dpage, mark %#lx disabled in dmem\n", pfn);
+		dnode_count_free_dpages(pdnode, -1);
+	}
+out:
+	trace_dmem_memory_failure(pfn, used);
+	mutex_unlock(&dmem_pool.lock);
+	return true;
+}
+
 bool is_dmem_pfn(unsigned long pfn)
 {
 	struct dmem_node *dnode;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5d880d4..dda45d2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -35,6 +35,7 @@
  */
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/dmem.h>
 #include <linux/page-flags.h>
 #include <linux/kernel-page-flags.h>
 #include <linux/sched/signal.h>
@@ -1323,6 +1324,11 @@ int memory_failure(unsigned long pfn, int flags)
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
 
+	if (dmem_memory_failure(pfn, flags)) {
+		pr_info("MCE %#lx: handled by dmem\n", pfn);
+		return 0;
+	}
+
 	p = pfn_to_online_page(pfn);
 	if (!p) {
 		if (pfn_valid(pfn)) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 32/37] mm, dmemfs: register and handle the dmem mce
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (30 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 31/37] dmem: introduce mce handler yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 33/37] kvm, x86: enable record_steal_time for dmem yulei.kernel
                   ` (5 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Haiwei Li

From: Yulei Zhang <yuleixzhang@tencent.com>

dmemfs register the mce handler, send signal to the procs
whose vma is mapped in mce pfn.

Signed-off-by: Haiwei Li <lihaiwei@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 fs/dmemfs/inode.c    | 141 +++++++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/dmem.h |   7 +++
 include/linux/mm.h   |   2 +
 mm/dmem.c            |  34 +++++++++++++
 mm/memory-failure.c  |  64 ++++++++++++++++-------
 5 files changed, 231 insertions(+), 17 deletions(-)

diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c
index f698b9d..4303bcdc 100644
--- a/fs/dmemfs/inode.c
+++ b/fs/dmemfs/inode.c
@@ -36,6 +36,47 @@
 
 static uint __read_mostly max_alloc_try_dpages = 1;
 
+struct dmemfs_inode {
+	struct inode *inode;
+	struct list_head link;
+};
+
+static LIST_HEAD(dmemfs_inode_list);
+static DEFINE_SPINLOCK(dmemfs_inode_lock);
+
+static struct dmemfs_inode *
+dmemfs_create_dmemfs_inode(struct inode *inode)
+{
+	struct dmemfs_inode *dmemfs_inode;
+
+	spin_lock(&dmemfs_inode_lock);
+	dmemfs_inode = kmalloc(sizeof(struct dmemfs_inode), GFP_NOIO);
+	if (!dmemfs_inode) {
+		pr_err("DMEMFS: Out of memory while getting dmemfs inode\n");
+		goto out;
+	}
+	dmemfs_inode->inode = inode;
+	list_add_tail(&dmemfs_inode->link, &dmemfs_inode_list);
+out:
+	spin_unlock(&dmemfs_inode_lock);
+	return dmemfs_inode;
+}
+
+static void dmemfs_delete_dmemfs_inode(struct inode *inode)
+{
+	struct dmemfs_inode *i, *next;
+
+	spin_lock(&dmemfs_inode_lock);
+	list_for_each_entry_safe(i, next, &dmemfs_inode_list, link) {
+		if (i->inode == inode) {
+			list_del(&i->link);
+			kfree(i);
+			break;
+		}
+	}
+	spin_unlock(&dmemfs_inode_lock);
+}
+
 struct dmemfs_mount_opts {
 	unsigned long dpage_size;
 };
@@ -218,6 +259,13 @@ static unsigned long dmem_pgoff_to_index(struct inode *inode, pgoff_t pgoff)
 	return pgoff >> (sb->s_blocksize_bits - PAGE_SHIFT);
 }
 
+static pgoff_t dmem_index_to_pgoff(struct inode *inode, unsigned long index)
+{
+	struct super_block *sb = inode->i_sb;
+
+	return index << (sb->s_blocksize_bits - PAGE_SHIFT);
+}
+
 static void *dmem_addr_to_entry(struct inode *inode, phys_addr_t addr)
 {
 	struct super_block *sb = inode->i_sb;
@@ -806,6 +854,23 @@ static void dmemfs_evict_inode(struct inode *inode)
 	clear_inode(inode);
 }
 
+static struct inode *dmemfs_alloc_inode(struct super_block *sb)
+{
+	struct inode *inode;
+
+	inode = alloc_inode_nonrcu();
+	if (inode)
+		dmemfs_create_dmemfs_inode(inode);
+	return inode;
+}
+
+static void dmemfs_destroy_inode(struct inode *inode)
+{
+	if (inode)
+		dmemfs_delete_dmemfs_inode(inode);
+	free_inode_nonrcu(inode);
+}
+
 /*
  * Display the mount options in /proc/mounts.
  */
@@ -819,9 +884,11 @@ static int dmemfs_show_options(struct seq_file *m, struct dentry *root)
 }
 
 static const struct super_operations dmemfs_ops = {
+	.alloc_inode = dmemfs_alloc_inode,
 	.statfs	= dmemfs_statfs,
 	.evict_inode = dmemfs_evict_inode,
 	.drop_inode = generic_delete_inode,
+	.destroy_inode = dmemfs_destroy_inode,
 	.show_options = dmemfs_show_options,
 };
 
@@ -901,17 +968,91 @@ static void dmemfs_kill_sb(struct super_block *sb)
 	.kill_sb	= dmemfs_kill_sb,
 };
 
+static struct inode *
+dmemfs_find_inode_by_addr(phys_addr_t addr, pgoff_t *pgoff)
+{
+	struct dmemfs_inode *di;
+	struct inode *inode;
+	struct address_space *mapping;
+	void *entry, **slot;
+	void *mce_entry;
+
+	list_for_each_entry(di, &dmemfs_inode_list, link) {
+		inode = di->inode;
+		mapping = inode->i_mapping;
+		mce_entry = dmem_addr_to_entry(inode, addr);
+		XA_STATE(xas, &mapping->i_pages, 0);
+		rcu_read_lock();
+
+		xas_for_each(&xas, entry, ULONG_MAX) {
+			if (xas_retry(&xas, entry))
+				continue;
+
+			if (unlikely(entry != xas_reload(&xas)))
+				goto retry;
+
+			if (mce_entry != entry)
+				continue;
+			*pgoff = dmem_index_to_pgoff(inode, xas.xa_index);
+			rcu_read_unlock();
+			return inode;
+retry:
+			xas_reset(&xas);
+		}
+		rcu_read_unlock();
+	}
+	return NULL;
+}
+
+static int dmemfs_mce_handler(struct notifier_block *this, unsigned long pfn,
+			      void *v)
+{
+	struct dmem_mce_notifier_info *info =
+		(struct dmem_mce_notifier_info *)v;
+	int flags = info->flags;
+	struct inode *inode;
+	phys_addr_t mce_addr = __pfn_to_phys(pfn);
+	pgoff_t pgoff;
+
+	spin_lock(&dmemfs_inode_lock);
+	inode = dmemfs_find_inode_by_addr(mce_addr, &pgoff);
+	if (!inode || !atomic_read(&inode->i_count))
+		goto out;
+
+	collect_procs_and_signal_inode(inode, pgoff, pfn, flags);
+out:
+	spin_unlock(&dmemfs_inode_lock);
+	return 0;
+}
+
+static struct notifier_block dmemfs_mce_notifier = {
+	.notifier_call	= dmemfs_mce_handler,
+};
+
 static int __init dmemfs_init(void)
 {
 	int ret;
 
+	pr_info("dmemfs initialized\n");
 	ret = register_filesystem(&dmemfs_fs_type);
+	if (ret)
+		goto reg_fs_fail;
+
+	ret = dmem_register_mce_notifier(&dmemfs_mce_notifier);
+	if (ret)
+		goto reg_notifier_fail;
 
+	return 0;
+
+reg_notifier_fail:
+	unregister_filesystem(&dmemfs_fs_type);
+reg_fs_fail:
 	return ret;
 }
 
 static void __exit dmemfs_uninit(void)
 {
+	dmem_unregister_mce_notifier(&dmemfs_mce_notifier);
 	unregister_filesystem(&dmemfs_fs_type);
 }
 
diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index cd17a91..fe0b270 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -23,6 +23,13 @@
 #define dmem_free_page(addr)	dmem_free_pages(addr, 1)
 
 bool dmem_memory_failure(unsigned long pfn, int flags);
+
+struct dmem_mce_notifier_info {
+	int flags;
+};
+
+int dmem_register_mce_notifier(struct notifier_block *nb);
+int dmem_unregister_mce_notifier(struct notifier_block *nb);
 #else
 static inline int dmem_reserve_init(void)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2f3135fe..fa20f9c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3041,6 +3041,8 @@ enum mf_flags {
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
+extern void collect_procs_and_signal_inode(struct inode *inode, pgoff_t pgoff,
+				    unsigned long pfn, int flags);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
diff --git a/mm/dmem.c b/mm/dmem.c
index 16438db..dd81b24 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -70,6 +70,7 @@ struct dmem_node {
 
 struct dmem_pool {
 	struct mutex lock;
+	struct raw_notifier_head mce_notifier_chain;
 
 	unsigned long region_num;
 	unsigned long registered_pages;
@@ -92,6 +93,7 @@ struct dmem_pool {
 
 static struct dmem_pool dmem_pool = {
 	.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
+	.mce_notifier_chain = RAW_NOTIFIER_INIT(dmem_pool.mce_notifier_chain),
 };
 
 #define DMEM_PAGE_SIZE		(1UL << dmem_pool.dpage_shift)
@@ -121,6 +123,35 @@ struct dmem_pool {
 #define for_each_dmem_region(_dnode, _dregion)				\
 	list_for_each_entry(_dregion, &(_dnode)->regions, node)
 
+int dmem_register_mce_notifier(struct notifier_block *nb)
+{
+	int ret;
+
+	mutex_lock(&dmem_pool.lock);
+	ret = raw_notifier_chain_register(&dmem_pool.mce_notifier_chain, nb);
+	mutex_unlock(&dmem_pool.lock);
+	return ret;
+}
+EXPORT_SYMBOL(dmem_register_mce_notifier);
+
+int dmem_unregister_mce_notifier(struct notifier_block *nb)
+{
+	int ret;
+
+	mutex_lock(&dmem_pool.lock);
+	ret = raw_notifier_chain_unregister(&dmem_pool.mce_notifier_chain, nb);
+	mutex_unlock(&dmem_pool.lock);
+	return ret;
+}
+EXPORT_SYMBOL(dmem_unregister_mce_notifier);
+
+static int dmem_mce_notify(unsigned long pfn,
+			   struct dmem_mce_notifier_info *info)
+{
+	return raw_notifier_call_chain(&dmem_pool.mce_notifier_chain,
+				       pfn, info);
+}
+
 static inline int *dmem_nodelist(int nid)
 {
 	return nid_to_dnode(nid)->nodelist;
@@ -1003,6 +1034,7 @@ bool dmem_memory_failure(unsigned long pfn, int flags)
 	u64 pos;
 	phys_addr_t addr = __pfn_to_phys(pfn);
 	bool used = false;
+	struct dmem_mce_notifier_info info;
 
 	dregion = find_dmem_region(addr, &pdnode);
 	if (!dregion)
@@ -1022,6 +1054,8 @@ bool dmem_memory_failure(unsigned long pfn, int flags)
 	pos = phys_to_dpage(addr) - dregion->dpage_start_pfn;
 	if (__test_and_set_bit(pos, dregion->bitmap)) {
 		used = true;
+		info.flags = flags;
+		dmem_mce_notify(pfn, &info);
 	} else {
 		pr_info("MCE: free dpage, mark %#lx disabled in dmem\n", pfn);
 		dnode_count_free_dpages(pdnode, -1);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index dda45d2..3aa7fe7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -334,8 +334,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page,
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  */
 static void add_to_kill(struct task_struct *tsk, struct page *p,
-		       struct vm_area_struct *vma,
-		       struct list_head *to_kill)
+		       struct vm_area_struct *vma, unsigned long pfn,
+		       pgoff_t pgoff, struct list_head *to_kill)
 {
 	struct to_kill *tk;
 
@@ -345,12 +345,17 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 		return;
 	}
 
-	tk->addr = page_address_in_vma(p, vma);
-	if (is_zone_device_page(p))
-		tk->size_shift = dev_pagemap_mapping_shift(p, vma);
-	else
-		tk->size_shift = page_shift(compound_head(p));
-
+	if (p) {
+		tk->addr = page_address_in_vma(p, vma);
+		if (is_zone_device_page(p))
+			tk->size_shift = dev_pagemap_mapping_shift(p, vma);
+		else
+			tk->size_shift = page_shift(compound_head(p));
+	} else {
+		tk->size_shift = PAGE_SHIFT;
+		tk->addr = vma->vm_start +
+			((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
+	}
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
 	 * "tk->size_shift" is always non-zero for !is_zone_device_page(),
@@ -363,7 +368,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p,
 	 */
 	if (tk->addr == -EFAULT) {
 		pr_info("Memory failure: Unable to find user space address %lx in %s\n",
-			page_to_pfn(p), tsk->comm);
+			pfn, tsk->comm);
 	} else if (tk->size_shift == 0) {
 		kfree(tk);
 		return;
@@ -496,7 +501,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 			if (!page_mapped_in_vma(page, vma))
 				continue;
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, vma, page_to_pfn(page),
+					page_to_pgoff(page), to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -504,19 +510,17 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 }
 
 /*
- * Collect processes when the error hit a file mapped page.
+ * Collect processes when the error hit a file mapped memory.
  */
-static void collect_procs_file(struct page *page, struct list_head *to_kill,
-				int force_early)
+static void __collect_procs_file(struct address_space *mapping, pgoff_t pgoff,
+				struct page *page, unsigned long pfn,
+				struct list_head *to_kill, int force_early)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
-	struct address_space *mapping = page->mapping;
-	pgoff_t pgoff;
 
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
-	pgoff = page_to_pgoff(page);
 	for_each_process(tsk) {
 		struct task_struct *t = task_early_kill(tsk, force_early);
 
@@ -532,7 +536,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 			 * to be informed of all such data corruptions.
 			 */
 			if (vma->vm_mm == t->mm)
-				add_to_kill(t, page, vma, to_kill);
+				add_to_kill(t, page, vma, pfn, pgoff, to_kill);
 		}
 	}
 	read_unlock(&tasklist_lock);
@@ -540,6 +544,32 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
 }
 
 /*
+ * Collect processes when the error hit a file mapped page.
+ */
+static void collect_procs_file(struct page *page, struct list_head *to_kill,
+				int force_early)
+{
+	struct address_space *mapping = page->mapping;
+
+	__collect_procs_file(mapping, page_to_pgoff(page), page,
+			     page_to_pfn(page), to_kill, force_early);
+}
+
+void collect_procs_and_signal_inode(struct inode *inode, pgoff_t pgoff,
+					unsigned long pfn, int flags)
+{
+	int forcekill;
+	struct address_space *mapping = &inode->i_data;
+	LIST_HEAD(tokill);
+
+	__collect_procs_file(mapping, pgoff, NULL, pfn, &tokill,
+			     flags & MF_ACTION_REQUIRED);
+	forcekill = flags & MF_MUST_KILL;
+	kill_procs(&tokill, forcekill, false, pfn, flags);
+}
+EXPORT_SYMBOL(collect_procs_and_signal_inode);
+
+/*
  * Collect the processes who have the corrupted page mapped to kill.
  */
 static void collect_procs(struct page *page, struct list_head *tokill,
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 33/37] kvm, x86: enable record_steal_time for dmem
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (31 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 32/37] mm, dmemfs: register and handle the dmem mce yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 34/37] dmem: add dmem unit tests yulei.kernel
                   ` (4 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang

From: Yulei Zhang <yuleixzhang@tencent.com>

Adjust the kvm_map_gfn while using dmemfs to enable
record_steal_time when entering the guest.

Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 virt/kvm/kvm_main.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2541a17..500b170 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -51,6 +51,7 @@
 #include <linux/io.h>
 #include <linux/lockdep.h>
 #include <linux/kthread.h>
+#include <linux/dmem.h>
 
 #include <asm/processor.h>
 #include <asm/ioctl.h>
@@ -2164,7 +2165,10 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn,
 			hva = kmap(page);
 #ifdef CONFIG_HAS_IOMEM
 	} else if (!atomic) {
-		hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
+		if (is_dmem_pfn(pfn))
+			hva = __va(PFN_PHYS(pfn));
+		else
+			hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB);
 	} else {
 		return -EINVAL;
 #endif
@@ -2214,9 +2218,10 @@ static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot,
 			kunmap(map->page);
 	}
 #ifdef CONFIG_HAS_IOMEM
-	else if (!atomic)
-		memunmap(map->hva);
-	else
+	else if (!atomic) {
+		if (!is_dmem_pfn(map->pfn))
+			memunmap(map->hva);
+	} else
 		WARN_ONCE(1, "Unexpected unmapping in atomic context");
 #endif
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 34/37] dmem: add dmem unit tests
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (32 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 33/37] kvm, x86: enable record_steal_time for dmem yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 35/37] mm, dmem: introduce dregion->memmap for dmem yulei.kernel
                   ` (3 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Xiao Guangrong

From: Yulei Zhang <yuleixzhang@tencent.com>

This test case is used to test dmem management system.

Signed-off-by: Xiao Guangrong <gloryxiao@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 tools/testing/dmem/Kbuild      |   1 +
 tools/testing/dmem/Makefile    |  10 +++
 tools/testing/dmem/dmem-test.c | 184 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 195 insertions(+)
 create mode 100644 tools/testing/dmem/Kbuild
 create mode 100644 tools/testing/dmem/Makefile
 create mode 100644 tools/testing/dmem/dmem-test.c

diff --git a/tools/testing/dmem/Kbuild b/tools/testing/dmem/Kbuild
new file mode 100644
index 00000000..04988f7
--- /dev/null
+++ b/tools/testing/dmem/Kbuild
@@ -0,0 +1 @@
+obj-m += dmem-test.o
diff --git a/tools/testing/dmem/Makefile b/tools/testing/dmem/Makefile
new file mode 100644
index 00000000..21f141f
--- /dev/null
+++ b/tools/testing/dmem/Makefile
@@ -0,0 +1,10 @@
+KDIR ?= ../../../
+
+default:
+	$(MAKE) -C $(KDIR) M=$$PWD
+
+install: default
+	$(MAKE) -C $(KDIR) M=$$PWD modules_install
+
+clean:
+	rm -f *.o *.ko Module.* modules.* *.mod.c
diff --git a/tools/testing/dmem/dmem-test.c b/tools/testing/dmem/dmem-test.c
new file mode 100644
index 00000000..4baae18
--- /dev/null
+++ b/tools/testing/dmem/dmem-test.c
@@ -0,0 +1,184 @@
+/*
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/module.h>
+#include <linux/sizes.h>
+#include <linux/list.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/dmem.h>
+
+struct dmem_mem_node {
+	struct list_head node;
+};
+
+static LIST_HEAD(dmem_list);
+
+static int dmem_test_alloc_init(unsigned long dpage_shift)
+{
+	int ret;
+
+	ret = dmem_alloc_init(dpage_shift);
+	if (ret)
+		pr_info("dmem_alloc_init failed, dpage_shift %ld ret=%d\n",
+			dpage_shift, ret);
+	return ret;
+}
+
+static int __dmem_test_alloc(int order, int nid, nodemask_t *nodemask,
+			     const char *caller)
+{
+	struct dmem_mem_node *pos;
+	phys_addr_t addr;
+	int i, ret = 0;
+
+	for (i = 0; i < (1 << order); i++) {
+		addr = dmem_alloc_pages_nodemask(nid, nodemask, 1, NULL);
+		if (!addr) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		pos = __va(addr);
+		list_add(&pos->node, &dmem_list);
+	}
+
+	pr_info("%s: alloc order %d on node %d has fallback node %s... %s.\n",
+		caller, order, nid, nodemask ? "yes" : "no",
+		!ret ? "okay" : "failed");
+
+	return ret;
+}
+
+static void dmem_test_free_all(void)
+{
+	struct dmem_mem_node *pos, *n;
+
+	list_for_each_entry_safe(pos, n, &dmem_list, node) {
+		list_del(&pos->node);
+		dmem_free_page(__pa(pos));
+	}
+}
+
+#define dmem_test_alloc(order, nid, nodemask)	\
+	__dmem_test_alloc(order, nid, nodemask, __func__)
+
+/* dmem shoud have 2^6 native pages available at lest */
+static int order_test(void)
+{
+	int order, i, ret;
+	int page_orders[] = {0, 1, 2, 3, 4, 5, 6};
+
+	ret = dmem_test_alloc_init(PAGE_SHIFT);
+	if (ret)
+		return ret;
+
+	for (i = 0; i < ARRAY_SIZE(page_orders); i++) {
+		order = page_orders[i];
+
+		ret = dmem_test_alloc(order, numa_node_id(), NULL);
+		if (ret)
+			break;
+	}
+
+	dmem_test_free_all();
+
+	dmem_alloc_uinit();
+
+	return ret;
+}
+
+static int node_test(void)
+{
+	nodemask_t nodemask;
+	unsigned long nr = 0;
+	int order;
+	int node;
+	int ret = 0;
+
+	order = 0;
+
+	ret = dmem_test_alloc_init(PUD_SHIFT);
+	if (ret)
+		return ret;
+
+	pr_info("%s: test allocation on node 0\n", __func__);
+	node = 0;
+	nodes_clear(nodemask);
+	node_set(0, nodemask);
+
+	ret = dmem_test_alloc(order, node, &nodemask);
+	if (ret)
+		goto exit;
+
+	dmem_test_free_all();
+
+	pr_info("%s: begin to exhaust dmem on node 0.\n", __func__);
+	node = 1;
+	nodes_clear(nodemask);
+	node_set(0, nodemask);
+
+	INIT_LIST_HEAD(&dmem_list);
+	while (!(ret = dmem_test_alloc(order, node, &nodemask)))
+		nr++;
+
+	pr_info("Allocation on node 0 success times: %lu\n", nr);
+
+	pr_info("%s: allocation on node 0 again\n", __func__);
+	node = 0;
+	nodes_clear(nodemask);
+	node_set(0, nodemask);
+	ret = dmem_test_alloc(order, node, &nodemask);
+	if (!ret) {
+		pr_info("\tNot expected fallback\n");
+		ret = -1;
+	} else {
+		ret = 0;
+		pr_info("\tOK, Dmem on node 0 exhausted, fallback success\n");
+	}
+
+	pr_info("%s: Release dmem\n", __func__);
+	dmem_test_free_all();
+
+exit:
+	dmem_alloc_uinit();
+	return ret;
+}
+
+static __init int dmem_test_init(void)
+{
+	int ret;
+
+	pr_info("dmem: test init...\n");
+
+	ret = order_test();
+	if (ret)
+		return ret;
+
+	ret = node_test();
+
+
+	if (ret)
+		pr_info("dmem test fail, ret=%d\n", ret);
+	else
+		pr_info("dmem test success\n");
+	return ret;
+}
+
+static __exit void dmem_test_exit(void)
+{
+	pr_info("dmem: test exit...\n");
+}
+
+module_init(dmem_test_init);
+module_exit(dmem_test_exit);
+MODULE_LICENSE("GPL v2");
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 35/37] mm, dmem: introduce dregion->memmap for dmem
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (33 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 34/37] dmem: add dmem unit tests yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 36/37] vfio: support dmempage refcount for vfio yulei.kernel
                   ` (2 subsequent siblings)
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Append 'memmap' into struct dmem_region, mapping each page of dmem with
struct dmempage.

Currently there is just one member '_refcount' in struct dmempage to
reflect the number of all modules which occupied the dmem page.

Modules which allocates the dmem page from dmempool will make first
reference and set _refcount to 1.

Modules which try to free the dmem page to dmempool will decrease 1
at _refcount and free it if _refcount is tested as zero after decrease.

At each time module A passes dmem page to module B, module B should call
get_dmem_pfn() to increase _refcount for dmem page before making use of it
to avoid referencing a dmem page which is occasionally freeed by any other
module in parallel. Vice versa after finishing usage of that dmem page
need call put_dmem_pfn() to decrease the _refcount.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 include/linux/dmem.h |   5 ++
 mm/dmem.c            | 147 ++++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 139 insertions(+), 13 deletions(-)

diff --git a/include/linux/dmem.h b/include/linux/dmem.h
index fe0b270..8aaa80b 100644
--- a/include/linux/dmem.h
+++ b/include/linux/dmem.h
@@ -22,6 +22,9 @@
 bool is_dmem_pfn(unsigned long pfn);
 #define dmem_free_page(addr)	dmem_free_pages(addr, 1)
 
+void get_dmem_pfn(unsigned long pfn);
+#define put_dmem_pfn(pfn)	dmem_free_page(PFN_PHYS(pfn))
+
 bool dmem_memory_failure(unsigned long pfn, int flags);
 
 struct dmem_mce_notifier_info {
@@ -45,5 +48,7 @@ static inline bool dmem_memory_failure(unsigned long pfn, int flags)
 {
 	return false;
 }
+void get_dmem_pfn(unsigned long pfn) {}
+void put_dmem_pfn(unsigned long pfn) {}
 #endif
 #endif	/* _LINUX_DMEM_H */
diff --git a/mm/dmem.c b/mm/dmem.c
index dd81b24..776dbf2 100644
--- a/mm/dmem.c
+++ b/mm/dmem.c
@@ -47,6 +47,7 @@ struct dmem_region {
 
 	unsigned long static_error_bitmap;
 	unsigned long *error_bitmap;
+	void *memmap;
 };
 
 /*
@@ -91,6 +92,10 @@ struct dmem_pool {
 	struct dmem_node nodes[MAX_NUMNODES];
 };
 
+struct dmempage {
+	atomic_t _refcount;
+};
+
 static struct dmem_pool dmem_pool = {
 	.lock = __MUTEX_INITIALIZER(dmem_pool.lock),
 	.mce_notifier_chain = RAW_NOTIFIER_INIT(dmem_pool.mce_notifier_chain),
@@ -123,6 +128,40 @@ struct dmem_pool {
 #define for_each_dmem_region(_dnode, _dregion)				\
 	list_for_each_entry(_dregion, &(_dnode)->regions, node)
 
+#define pfn_to_dmempage(_pfn, _dregion)					\
+	((struct dmempage *)(_dregion)->memmap +			\
+	pfn_to_dpage(_pfn) - (_dregion)->dpage_start_pfn)
+
+#define dmempage_to_dpage(_dmempage, _dregion)				\
+	((_dmempage) - (struct dmempage *)(_dregion)->memmap +		\
+	(_dregion)->dpage_start_pfn)
+
+static inline int dmempage_count(struct dmempage *dmempage)
+{
+	return atomic_read(&dmempage->_refcount);
+}
+
+static inline void set_dmempage_count(struct dmempage *dmempage, int v)
+{
+	atomic_set(&dmempage->_refcount, v);
+}
+
+static inline void dmempage_ref_inc(struct dmempage *dmempage)
+{
+	atomic_inc(&dmempage->_refcount);
+}
+
+static inline int dmempage_ref_dec_and_test(struct dmempage *dmempage)
+{
+	return atomic_dec_and_test(&dmempage->_refcount);
+}
+
+static inline int put_dmempage_testzero(struct dmempage *dmempage)
+{
+	VM_BUG_ON(dmempage_count(dmempage) == 0);
+	return dmempage_ref_dec_and_test(dmempage);
+}
+
 int dmem_register_mce_notifier(struct notifier_block *nb)
 {
 	int ret;
@@ -559,10 +598,25 @@ static int __init dmem_late_init(void)
 }
 late_initcall(dmem_late_init);
 
+static void *dmem_memmap_alloc(unsigned long dpages)
+{
+	unsigned long size;
+
+	size = dpages * sizeof(struct dmempage);
+	return vzalloc(size);
+}
+
+static void dmem_memmap_free(void *memmap)
+{
+	if (memmap)
+		vfree(memmap);
+}
+
 static int dmem_alloc_region_init(struct dmem_region *dregion,
 				  unsigned long *dpages)
 {
 	unsigned long start, end, *bitmap;
+	void *memmap;
 
 	start = DMEM_PAGE_UP(dregion->reserved_start_addr);
 	end = DMEM_PAGE_DOWN(dregion->reserved_end_addr);
@@ -575,7 +629,14 @@ static int dmem_alloc_region_init(struct dmem_region *dregion,
 	if (!bitmap)
 		return -ENOMEM;
 
+	memmap = dmem_memmap_alloc(*dpages);
+	if (!memmap) {
+		dmem_bitmap_free(*dpages, bitmap, &dregion->static_bitmap);
+		return -ENOMEM;
+	}
+
 	dregion->bitmap = bitmap;
+	dregion->memmap = memmap;
 	dregion->next_free_pos = 0;
 	dregion->dpage_start_pfn = start;
 	dregion->dpage_end_pfn = end;
@@ -650,7 +711,9 @@ static void dmem_alloc_region_uinit(struct dmem_region *dregion)
 	dmem_uinit_check_alloc_bitmap(dregion);
 
 	dmem_bitmap_free(dpages, bitmap, &dregion->static_bitmap);
+	dmem_memmap_free(dregion->memmap);
 	dregion->bitmap = NULL;
+	dregion->memmap = NULL;
 }
 
 static void __dmem_alloc_uinit(void)
@@ -793,6 +856,16 @@ int dmem_alloc_init(unsigned long dpage_shift)
 	return dpage_to_phys(dregion->dpage_start_pfn + pos);
 }
 
+static void prep_new_dmempage(unsigned long phys, unsigned int nr,
+			      struct dmem_region *dregion)
+{
+	struct dmempage *dmempage = pfn_to_dmempage(PHYS_PFN(phys), dregion);
+	unsigned int i;
+
+	for (i = 0; i < nr; i++, dmempage++)
+		set_dmempage_count(dmempage, 1);
+}
+
 /*
  * allocate dmem pages from the nodelist
  *
@@ -839,6 +912,7 @@ int dmem_alloc_init(unsigned long dpage_shift)
 			if (addr) {
 				dnode_count_free_dpages(dnode,
 							-(long)(*result_nr));
+				prep_new_dmempage(addr, *result_nr, dregion);
 				break;
 			}
 		}
@@ -993,6 +1067,41 @@ static struct dmem_region *find_dmem_region(phys_addr_t phys_addr,
 	return NULL;
 }
 
+static unsigned int free_dmempages_prepare(struct dmempage *dmempage,
+				   unsigned int dpages_nr)
+{
+	unsigned int i, ret = 0;
+
+	for (i = 0; i < dpages_nr; i++, dmempage++)
+		if (put_dmempage_testzero(dmempage))
+			ret++;
+
+	return ret;
+}
+
+void __dmem_free_pages(struct dmempage *dmempage,
+		       unsigned int dpages_nr,
+		       struct dmem_region *dregion,
+		       struct dmem_node *pdnode)
+{
+	phys_addr_t dpage = dmempage_to_dpage(dmempage, dregion);
+	u64 pos;
+	unsigned long err_dpages;
+
+	trace_dmem_free_pages(dpage_to_phys(dpage), dpages_nr);
+	WARN_ON(!dmem_pool.dpage_shift);
+
+	pos = dpage - dregion->dpage_start_pfn;
+	dregion->next_free_pos = min(dregion->next_free_pos, pos);
+
+	/* it is not possible to span multiple regions */
+	WARN_ON(dpage + dpages_nr - 1 >= dregion->dpage_end_pfn);
+
+	err_dpages = dmem_alloc_bitmap_clear(dregion, dpage, dpages_nr);
+
+	dnode_count_free_dpages(pdnode, dpages_nr - err_dpages);
+}
+
 /*
  * free dmem page to the dmem pool
  *   @addr: the physical addree will be freed
@@ -1002,27 +1111,26 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr)
 {
 	struct dmem_region *dregion;
 	struct dmem_node *pdnode = NULL;
-	phys_addr_t dpage = phys_to_dpage(addr);
-	u64 pos;
-	unsigned long err_dpages;
+	struct dmempage *dmempage;
+	unsigned int nr;
 
 	mutex_lock(&dmem_pool.lock);
 
-	trace_dmem_free_pages(addr, dpages_nr);
-	WARN_ON(!dmem_pool.dpage_shift);
-
 	dregion = find_dmem_region(addr, &pdnode);
 	WARN_ON(!dregion || !dregion->bitmap || !pdnode);
 
-	pos = dpage - dregion->dpage_start_pfn;
-	dregion->next_free_pos = min(dregion->next_free_pos, pos);
-
-	/* it is not possible to span multiple regions */
-	WARN_ON(dpage + dpages_nr - 1 >= dregion->dpage_end_pfn);
+	dmempage = pfn_to_dmempage(PHYS_PFN(addr), dregion);
 
-	err_dpages = dmem_alloc_bitmap_clear(dregion, dpage, dpages_nr);
+	nr = free_dmempages_prepare(dmempage, dpages_nr);
+	if (nr == dpages_nr)
+		__dmem_free_pages(dmempage, dpages_nr, dregion, pdnode);
+	else if (nr)
+		while (dpages_nr--, dmempage++) {
+			if (dmempage_count(dmempage))
+				continue;
+			__dmem_free_pages(dmempage, 1, dregion, pdnode);
+		}
 
-	dnode_count_free_dpages(pdnode, dpages_nr - err_dpages);
 	mutex_unlock(&dmem_pool.lock);
 }
 EXPORT_SYMBOL(dmem_free_pages);
@@ -1073,3 +1181,16 @@ bool is_dmem_pfn(unsigned long pfn)
 	return !!find_dmem_region(__pfn_to_phys(pfn), &dnode);
 }
 EXPORT_SYMBOL(is_dmem_pfn);
+
+void get_dmem_pfn(unsigned long pfn)
+{
+	struct dmem_region *dregion = find_dmem_region(PFN_PHYS(pfn), NULL);
+	struct dmempage *dmempage;
+
+	VM_BUG_ON(!dregion || !dregion->memmap);
+
+	dmempage = pfn_to_dmempage(pfn, dregion);
+	VM_BUG_ON(dmempage_count(dmempage) + 127u <= 127u);
+	dmempage_ref_inc(dmempage);
+}
+EXPORT_SYMBOL(get_dmem_pfn);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 36/37] vfio: support dmempage refcount for vfio
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (34 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 35/37] mm, dmem: introduce dregion->memmap for dmem yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-07 11:31 ` [RFC V2 37/37] Add documentation for dmemfs yulei.kernel
  2020-12-07 12:02 ` [RFC V2 00/37] Enhance memory utilization with DMEMFS David Hildenbrand
  37 siblings, 0 replies; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang,
	Chen Zhuo

From: Yulei Zhang <yuleixzhang@tencent.com>

Add get/put_dmem_pfn(), each time when vfio module reference/release
dmempages.

Signed-off-by: Chen Zhuo <sagazchen@tencent.com>
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 drivers/vfio/vfio_iommu_type1.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index c465d1a..4856a89 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -39,6 +39,7 @@
 #include <linux/notifier.h>
 #include <linux/dma-iommu.h>
 #include <linux/irqdomain.h>
+#include <linux/dmem.h>
 
 #define DRIVER_VERSION  "0.2"
 #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson@redhat.com>"
@@ -411,7 +412,10 @@ static int put_pfn(unsigned long pfn, int prot)
 
 		unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
 		return 1;
-	}
+	} else if (is_dmem_pfn(pfn))
+		put_dmem_pfn(pfn);
+
+	/* Dmem page is not counted against user. */
 	return 0;
 }
 
@@ -477,6 +481,9 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
 
 		if (!ret && !is_invalid_reserved_pfn(*pfn))
 			ret = -EFAULT;
+
+		if (!ret && is_dmem_pfn(*pfn))
+			get_dmem_pfn(*pfn);
 	}
 done:
 	mmap_read_unlock(mm);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [RFC V2 37/37] Add documentation for dmemfs
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (35 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 36/37] vfio: support dmempage refcount for vfio yulei.kernel
@ 2020-12-07 11:31 ` yulei.kernel
  2020-12-24 18:27   ` Randy Dunlap
  2020-12-07 12:02 ` [RFC V2 00/37] Enhance memory utilization with DMEMFS David Hildenbrand
  37 siblings, 1 reply; 41+ messages in thread
From: yulei.kernel @ 2020-12-07 11:31 UTC (permalink / raw)
  To: linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang

From: Yulei Zhang <yuleixzhang@tencent.com>

Introduce dmemfs.rst to document the basic usage of dmemfs.

Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
---
 Documentation/filesystems/dmemfs.rst | 58 ++++++++++++++++++++++++++++++++++++
 Documentation/filesystems/index.rst  |  1 +
 2 files changed, 59 insertions(+)
 create mode 100644 Documentation/filesystems/dmemfs.rst

diff --git a/Documentation/filesystems/dmemfs.rst b/Documentation/filesystems/dmemfs.rst
new file mode 100644
index 00000000..f13ed0c
--- /dev/null
+++ b/Documentation/filesystems/dmemfs.rst
@@ -0,0 +1,58 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================================
+The Direct Memory Filesystem - DMEMFS
+=====================================
+
+
+.. Table of contents
+
+   - Overview
+   - Compilation
+   - Usage
+
+Overview
+========
+
+Dmemfs (Direct Memory filesystem) is device memory or reserved
+memory based filesystem. This kind of memory is special as it
+is not managed by kernel and it is without 'struct page'. Therefore
+it can save extra memory from the host system for various usage,
+especially for guest virtual machines.
+
+It uses a kernel boot parameter ``dmem=`` to reserve the system
+memory when the host system boots up, the details can be checked
+in /Documentation/admin-guide/kernel-parameters.txt.
+
+Compilation
+===========
+
+The filesystem should be enabled by turning on the kernel configuration
+options::
+
+        CONFIG_DMEM_FS          - Direct Memory filesystem support
+        CONFIG_DMEM             - Allow reservation of memory for dmem
+
+
+Additionally, the following can be turned on to aid debugging::
+
+        CONFIG_DMEM_DEBUG_FS    - Enable debug information for dmem
+
+Usage
+========
+
+Dmemfs supports mapping ``4K``, ``2M`` and ``1G`` size of pages to
+the userspace, for example ::
+
+    # mount -t dmemfs none -o pagesize=4K /mnt/
+
+The it can create the backing storage with 4G size ::
+
+    # truncate /mnt/dmemfs-uuid --size 4G
+
+To use as backing storage for virtual machine starts with qemu, just need
+to specify the memory-backed-file in the qemu command line like this ::
+
+    # -object memory-backend-file,id=ram-node0,mem-path=/mnt/dmemfs-uuid \
+        share=yes,size=4G,host-nodes=0,policy=preferred -numa node,nodeid=0,memdev=ram-node0
+
diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
index 98f59a8..23e944b 100644
--- a/Documentation/filesystems/index.rst
+++ b/Documentation/filesystems/index.rst
@@ -120,3 +120,4 @@ Documentation for filesystem implementations.
    xfs-delayed-logging-design
    xfs-self-describing-metadata
    zonefs
+   dmemfs
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [RFC V2 00/37] Enhance memory utilization with DMEMFS
  2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
                   ` (36 preceding siblings ...)
  2020-12-07 11:31 ` [RFC V2 37/37] Add documentation for dmemfs yulei.kernel
@ 2020-12-07 12:02 ` David Hildenbrand
  2020-12-07 19:32   ` Dan Williams
  37 siblings, 1 reply; 41+ messages in thread
From: David Hildenbrand @ 2020-12-07 12:02 UTC (permalink / raw)
  To: yulei.kernel, linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini, Dan Williams
  Cc: joao.m.martins, rdunlap, sean.j.christopherson,
	xiaoguangrong.eric, kernellwp, lihaiwei.kernel, Yulei Zhang

On 07.12.20 12:30, yulei.kernel@gmail.com wrote:
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> In current system each physical memory page is assocaited with
> a page structure which is used to track the usage of this page.
> But due to the memory usage rapidly growing in cloud environment,
> we find the resource consuming for page structure storage becomes
> more and more remarkable. So is it possible that we could reclaim
> such memory and make it reusable?
> 
> This patchset introduces an idea about how to save the extra
> memory through a new virtual filesystem -- dmemfs.
> 
> Dmemfs (Direct Memory filesystem) is device memory or reserved
> memory based filesystem. This kind of memory is special as it
> is not managed by kernel and most important it is without 'struct page'.
> Therefore we can leverage the extra memory from the host system
> to support more tenants in our cloud service.

"is not managed by kernel" well, it's obviously is managed by the
kernel. It's not managed by the buddy ;)

How is this different to using "mem=X" and mapping the relevant memory
directly into applications? Is this "simply" a control instance on top
that makes sure unprivileged process can access it and not step onto
each others feet? Is that the reason why it's called  a "file system"?
(an example would have helped here, showing how it's used)

It's worth noting that memory hotunplug, memory poisoning and probably
more is currently fundamentally incompatible with this approach - which
should better be pointed out in the cover letter.

Also, I think something similar can be obtained by using dax/hmat
infrastructure with "memmap=", at least I remember a talk where this was
discussed (but not sure if they modified the firmware to expose selected
memory as soft-reserved - we would only need a cmdline parameter to
achieve the same - Dan might know more).

> 
> As the belowing figure shows, we uses a kernel boot parameter 'dmem='
> to reserve the system memory when the host system boots up, the
> remaining system memory is still managed by system memory management
> which is associated with "struct page", the reserved memory
> will be managed by dmem and assigned to guest system, the details
> can be checked in /Documentation/admin-guide/kernel-parameters.txt.
> 
>    +------------------+--------------------------------------+
>    |  system memory   |     memory for guest system          | 
>    +------------------+--------------------------------------+
>     |                                   |
>     v                                   |
> struct page                             |
>     |                                   |
>     v                                   v
>     system mem management             dmem  
> 
> And during the usage, the dmemfs will handle the memory request to
> allocate and free the reserved memory on each NUMA node, the user 
> space application could leverage the mmap interface to access the 
> memory, and kernel module such as kvm and vfio would be able to pin
> the memory thongh follow_pfn() and get_user_page() in different given
> page size granularities.

I cannot say that I really like this approach. I really prefer the
proposal to free-up most vmemmap pages for huge/gigantic pages instead
if all this is about is reducing the memmap size.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC V2 00/37] Enhance memory utilization with DMEMFS
  2020-12-07 12:02 ` [RFC V2 00/37] Enhance memory utilization with DMEMFS David Hildenbrand
@ 2020-12-07 19:32   ` Dan Williams
  0 siblings, 0 replies; 41+ messages in thread
From: Dan Williams @ 2020-12-07 19:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: yulei zhang, Linux MM, Andrew Morton, linux-fsdevel, KVM list,
	Linux Kernel Mailing List, Naoya Horiguchi, Al Viro,
	Paolo Bonzini, Joao Martins, Randy Dunlap, Sean J Christopherson,
	Xiao Guangrong, Wanpeng Li, Haiwei Li, Yulei Zhang

On Mon, Dec 7, 2020 at 4:03 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.12.20 12:30, yulei.kernel@gmail.com wrote:
> > From: Yulei Zhang <yuleixzhang@tencent.com>
> >
> > In current system each physical memory page is assocaited with
> > a page structure which is used to track the usage of this page.
> > But due to the memory usage rapidly growing in cloud environment,
> > we find the resource consuming for page structure storage becomes
> > more and more remarkable. So is it possible that we could reclaim
> > such memory and make it reusable?
> >
> > This patchset introduces an idea about how to save the extra
> > memory through a new virtual filesystem -- dmemfs.
> >
> > Dmemfs (Direct Memory filesystem) is device memory or reserved
> > memory based filesystem. This kind of memory is special as it
> > is not managed by kernel and most important it is without 'struct page'.
> > Therefore we can leverage the extra memory from the host system
> > to support more tenants in our cloud service.
>
> "is not managed by kernel" well, it's obviously is managed by the
> kernel. It's not managed by the buddy ;)
>
> How is this different to using "mem=X" and mapping the relevant memory
> directly into applications? Is this "simply" a control instance on top
> that makes sure unprivileged process can access it and not step onto
> each others feet? Is that the reason why it's called  a "file system"?
> (an example would have helped here, showing how it's used)
>
> It's worth noting that memory hotunplug, memory poisoning and probably
> more is currently fundamentally incompatible with this approach - which
> should better be pointed out in the cover letter.
>
> Also, I think something similar can be obtained by using dax/hmat
> infrastructure with "memmap=", at least I remember a talk where this was
> discussed (but not sure if they modified the firmware to expose selected
> memory as soft-reserved - we would only need a cmdline parameter to
> achieve the same - Dan might know more).

There is currently the efi_fake_mem parameter that can add the
"EFI_MEMORY_SP" attribute on EFI platforms:

    efi_fake_mem=4G@9G:0x40000

...this results in a /dev/dax instance that can be further partitioned
via the device-dax sub-division facility merged for 5.10. That could
be generalized to something else for non-EFI platforms, but there has
not been a justification to go that route yet.

Joao pointed this out in a previous posting of DMEMFS, and I have yet
to see an explanation of incremental benefit the kernel gains from
having yet another parallel memory management interface.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [RFC V2 37/37] Add documentation for dmemfs
  2020-12-07 11:31 ` [RFC V2 37/37] Add documentation for dmemfs yulei.kernel
@ 2020-12-24 18:27   ` Randy Dunlap
  0 siblings, 0 replies; 41+ messages in thread
From: Randy Dunlap @ 2020-12-24 18:27 UTC (permalink / raw)
  To: yulei.kernel, linux-mm, akpm, linux-fsdevel, kvm, linux-kernel,
	naoya.horiguchi, viro, pbonzini
  Cc: joao.m.martins, sean.j.christopherson, xiaoguangrong.eric,
	kernellwp, lihaiwei.kernel, Yulei Zhang

Hi,

On 12/7/20 3:31 AM, yulei.kernel@gmail.com wrote:
> From: Yulei Zhang <yuleixzhang@tencent.com>
> 
> Introduce dmemfs.rst to document the basic usage of dmemfs.
> 
> Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
> ---
>  Documentation/filesystems/dmemfs.rst | 58 ++++++++++++++++++++++++++++++++++++
>  Documentation/filesystems/index.rst  |  1 +
>  2 files changed, 59 insertions(+)
>  create mode 100644 Documentation/filesystems/dmemfs.rst
> 
> diff --git a/Documentation/filesystems/dmemfs.rst b/Documentation/filesystems/dmemfs.rst
> new file mode 100644
> index 00000000..f13ed0c
> --- /dev/null
> +++ b/Documentation/filesystems/dmemfs.rst
> @@ -0,0 +1,58 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================================
> +The Direct Memory Filesystem - DMEMFS
> +=====================================
> +
> +
> +.. Table of contents
> +
> +   - Overview
> +   - Compilation
> +   - Usage
> +
> +Overview
> +========
> +
> +Dmemfs (Direct Memory filesystem) is device memory or reserved
> +memory based filesystem. This kind of memory is special as it
> +is not managed by kernel and it is without 'struct page'. Therefore
> +it can save extra memory from the host system for various usage,

                                                             usages,
or                                                           uses,

> +especially for guest virtual machines.
> +
> +It uses a kernel boot parameter ``dmem=`` to reserve the system
> +memory when the host system boots up, the details can be checked

                               boots up. The detail

> +in /Documentation/admin-guide/kernel-parameters.txt.
> +
> +Compilation
> +===========
> +
> +The filesystem should be enabled by turning on the kernel configuration
> +options::
> +
> +        CONFIG_DMEM_FS          - Direct Memory filesystem support
> +        CONFIG_DMEM             - Allow reservation of memory for dmem

Would anyone want DMEM_FS without DMEM?

> +
> +
> +Additionally, the following can be turned on to aid debugging::
> +
> +        CONFIG_DMEM_DEBUG_FS    - Enable debug information for dmem
> +
> +Usage
> +========
> +
> +Dmemfs supports mapping ``4K``, ``2M`` and ``1G`` size of pages to

                                                     sizes

> +the userspace, for example ::

       userspace. For example::

> +
> +    # mount -t dmemfs none -o pagesize=4K /mnt/
> +
> +The it can create the backing storage with 4G size ::

   Then

> +
> +    # truncate /mnt/dmemfs-uuid --size 4G
> +
> +To use as backing storage for virtual machine starts with qemu, just need

                                                 started with qemu, just specify
   the memory-backed-file

> +to specify the memory-backed-file in the qemu command line like this ::
> +
> +    # -object memory-backend-file,id=ram-node0,mem-path=/mnt/dmemfs-uuid \

                        backed


> +        share=yes,size=4G,host-nodes=0,policy=preferred -numa node,nodeid=0,memdev=ram-node0
> +


-- 
~Randy


^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2020-12-24 18:29 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-07 11:30 [RFC V2 00/37] Enhance memory utilization with DMEMFS yulei.kernel
2020-12-07 11:30 ` [RFC V2 01/37] fs: introduce dmemfs module yulei.kernel
2020-12-07 11:30 ` [RFC V2 02/37] mm: support direct memory reservation yulei.kernel
2020-12-07 11:30 ` [RFC V2 03/37] dmem: implement dmem memory management yulei.kernel
2020-12-07 11:30 ` [RFC V2 04/37] dmem: let pat recognize dmem yulei.kernel
2020-12-07 11:30 ` [RFC V2 05/37] dmemfs: support mmap for dmemfs yulei.kernel
2020-12-07 11:30 ` [RFC V2 06/37] dmemfs: support truncating inode down yulei.kernel
2020-12-07 11:31 ` [RFC V2 07/37] dmem: trace core functions yulei.kernel
2020-12-07 11:31 ` [RFC V2 08/37] dmem: show some statistic in debugfs yulei.kernel
2020-12-07 11:31 ` [RFC V2 09/37] dmemfs: support remote access yulei.kernel
2020-12-07 11:31 ` [RFC V2 10/37] dmemfs: introduce max_alloc_try_dpages parameter yulei.kernel
2020-12-07 11:31 ` [RFC V2 11/37] mm: export mempolicy interfaces to serve dmem allocator yulei.kernel
2020-12-07 11:31 ` [RFC V2 12/37] dmem: introduce mempolicy support yulei.kernel
2020-12-07 11:31 ` [RFC V2 13/37] mm, dmem: introduce PFN_DMEM and pfn_t_dmem yulei.kernel
2020-12-07 11:31 ` [RFC V2 14/37] mm, dmem: differentiate dmem-pmd and thp-pmd yulei.kernel
2020-12-07 11:31 ` [RFC V2 15/37] mm: add pmd_special() check for pmd_trans_huge_lock() yulei.kernel
2020-12-07 11:31 ` [RFC V2 16/37] dmemfs: introduce ->split() to dmemfs_vm_ops yulei.kernel
2020-12-07 11:31 ` [RFC V2 17/37] mm, dmemfs: support unmap_page_range() for dmemfs pmd yulei.kernel
2020-12-07 11:31 ` [RFC V2 18/37] mm: follow_pmd_mask() for dmem huge pmd yulei.kernel
2020-12-07 11:31 ` [RFC V2 19/37] mm: gup_huge_pmd() " yulei.kernel
2020-12-07 11:31 ` [RFC V2 20/37] mm: support dmem huge pmd for vmf_insert_pfn_pmd() yulei.kernel
2020-12-07 11:31 ` [RFC V2 21/37] mm: support dmem huge pmd for follow_pfn() yulei.kernel
2020-12-07 11:31 ` [RFC V2 22/37] kvm, x86: Distinguish dmemfs page from mmio page yulei.kernel
2020-12-07 11:31 ` [RFC V2 23/37] kvm, x86: introduce VM_DMEM for syscall support usage yulei.kernel
2020-12-07 11:31 ` [RFC V2 24/37] dmemfs: support hugepage for dmemfs yulei.kernel
2020-12-07 11:31 ` [RFC V2 25/37] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() yulei.kernel
2020-12-07 11:31 ` [RFC V2 26/37] mm, dmem: introduce pud_special() for dmem huge pud support yulei.kernel
2020-12-07 11:31 ` [RFC V2 27/37] mm: add pud_special() check to support dmem huge pud yulei.kernel
2020-12-07 11:31 ` [RFC V2 28/37] mm, dmemfs: support huge_fault() for dmemfs yulei.kernel
2020-12-07 11:31 ` [RFC V2 29/37] mm: add follow_pte_pud() to support huge pud look up yulei.kernel
2020-12-07 11:31 ` [RFC V2 30/37] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() yulei.kernel
2020-12-07 11:31 ` [RFC V2 31/37] dmem: introduce mce handler yulei.kernel
2020-12-07 11:31 ` [RFC V2 32/37] mm, dmemfs: register and handle the dmem mce yulei.kernel
2020-12-07 11:31 ` [RFC V2 33/37] kvm, x86: enable record_steal_time for dmem yulei.kernel
2020-12-07 11:31 ` [RFC V2 34/37] dmem: add dmem unit tests yulei.kernel
2020-12-07 11:31 ` [RFC V2 35/37] mm, dmem: introduce dregion->memmap for dmem yulei.kernel
2020-12-07 11:31 ` [RFC V2 36/37] vfio: support dmempage refcount for vfio yulei.kernel
2020-12-07 11:31 ` [RFC V2 37/37] Add documentation for dmemfs yulei.kernel
2020-12-24 18:27   ` Randy Dunlap
2020-12-07 12:02 ` [RFC V2 00/37] Enhance memory utilization with DMEMFS David Hildenbrand
2020-12-07 19:32   ` Dan Williams

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).