From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx142.netapp.com ([216.240.21.19]:32998 "EHLO mx142.netapp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752113AbeCMRjK (ORCPT ); Tue, 13 Mar 2018 13:39:10 -0400 Subject: [RFC 5/7] zus: Devices && mounting To: linux-fsdevel References: <0319872f-f8c8-b927-b275-e6a82e8819ab@netapp.com> CC: Ric Wheeler , Miklos Szeredi , Steve French , Steven Whitehouse , Jefff moyer , Sage Weil , Jan Kara , Amir Goldstein , Andy Rudof , Anna Schumaker , Amit Golander , Sagi Manole , Shachar Sharon From: Boaz Harrosh Message-ID: Date: Tue, 13 Mar 2018 19:38:39 +0200 MIME-Version: 1.0 In-Reply-To: <0319872f-f8c8-b927-b275-e6a82e8819ab@netapp.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-fsdevel-owner@vger.kernel.org List-ID: In this patch we already establish a mounted filesystem. There are three mode of operations: * mount with out a device (mount -t FOO none /somepath) * A single device - The FS stated register_fs_info->dt_offset==-1 No checks are made by Kernel, the single bdev is registered with Kernel's mount_bdev. It is up to the zusFS to check validity * Multy devices - The FS stated register_fs_info->dt_offset==X This mode is the main of this patch. A single device is given on the mount command line. At register_fs_info->dt_offset of this device we look for a zufs_dev_table structure. After all the checks we look there at the device list and open all devices. Any one of the devices may be given on command line. But they will always be opened in DT(Device Table) order. The Device table has the notion of two types of bdevs: T1 devices - are pmem devices capable of direct_access T2 devices - are none direct_access devices All t1 devices are presented as one linear array. in DT order In t1.c we mmap this space for the server to directly access pmem. (In the proper persistent way) [We do not support any direct_access device, we only support pmem(s) where the all device can be addressed by a single physical/virtual address. This is checked before mount] The T2 devices are also grabbed and owned by the super_block A later API will enable the Server to write or transfer buffers from T1 to T2 in a very efficient manner. Also presented as a single linear array in DT order. Both kind of devices are NUMA aware and the NUMA info is presented to the zusFS for optimal allocation and access. So these are the steps for mounting a zufs Filesystem: * All devices (Single or DT) are opened and established in an md object. This md-object is given an pmem-id * mount_bdev is called with the main (first) device, in turn fill_supper is called. * fill_supper despatches a mount_operation(register_fs_info) to the server with the pmem-id of the md-object above. * The Server at the zus mount routine. Will first thing do a GRAB_PMEM(pmem-id) ioctl call to establish a special filehandle through which it will have full access to the all of its pmem space. With that it will call the zusFS to continue to inspect the content of the pmem and mount the FS. * On return from mount the zusFS returns the root inode info * fill_supper continues to create a root vfs-inode and returns successfully. * We now have a mounted super_block, with corresponding super_block objects in the Server. Signed-off-by: Boaz Harrosh --- fs/zuf/Makefile | 5 +- fs/zuf/_extern.h | 12 + fs/zuf/inode.c | 45 ++++ fs/zuf/md.c | 695 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/zuf/md.h | 188 +++++++++++++++ fs/zuf/super.c | 605 ++++++++++++++++++++++++++++++++++++++++++++++- fs/zuf/t1.c | 114 +++++++++ fs/zuf/t2.c | 348 +++++++++++++++++++++++++++ fs/zuf/t2.h | 67 ++++++ fs/zuf/zuf-core.c | 68 +++++- fs/zuf/zuf-root.c | 9 +- fs/zuf/zuf.h | 220 +++++++++++++++++ fs/zuf/zus_api.h | 177 ++++++++++++++ 13 files changed, 2549 insertions(+), 4 deletions(-) create mode 100644 fs/zuf/inode.c create mode 100644 fs/zuf/md.c create mode 100644 fs/zuf/md.h create mode 100644 fs/zuf/t1.c create mode 100644 fs/zuf/t2.c create mode 100644 fs/zuf/t2.h diff --git a/fs/zuf/Makefile b/fs/zuf/Makefile index d00940c..94ce80b 100644 --- a/fs/zuf/Makefile +++ b/fs/zuf/Makefile @@ -10,9 +10,12 @@ obj-$(CONFIG_ZUF) += zuf.o +# Infrastructure +zuf-y += md.o t2.o t1.o + # ZUF core zuf-y += zuf-core.o zuf-root.o # Main FS -zuf-y += super.o +zuf-y += super.o inode.o zuf-y += module.o diff --git a/fs/zuf/_extern.h b/fs/zuf/_extern.h index e490043..0543fd8 100644 --- a/fs/zuf/_extern.h +++ b/fs/zuf/_extern.h @@ -19,7 +19,16 @@ * extern functions declarations */ +/* inode.c */ +struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii, + zu_dpp_t _zi, bool *exist); +void zuf_evict_inode(struct inode *inode); +int zuf_write_inode(struct inode *inode, struct writeback_control *wbc); +int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync); + /* super.c */ +int zuf_init_inodecache(void); +void zuf_destroy_inodecache(void); struct dentry *zuf_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data); @@ -44,4 +53,7 @@ void zufs_mounter_release(struct file *filp); /* zuf-root.c */ int zuf_register_fs(struct super_block *sb, struct zufs_ioc_register_fs *rfs); +/* t1.c */ +int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma); + #endif /*ndef __ZUF_EXTERN_H__*/ diff --git a/fs/zuf/inode.c b/fs/zuf/inode.c new file mode 100644 index 0000000..7aa8c9e --- /dev/null +++ b/fs/zuf/inode.c @@ -0,0 +1,45 @@ +/* + * BRIEF DESCRIPTION + * + * Inode methods (allocate/free/read/write). + * + * Copyright (c) 2018 NetApp Inc. All rights reserved. + * + * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details. + * + * Authors: + * Boaz Harrosh + * Sagi Manole " + */ + +#include "zuf.h" + +struct inode *zuf_iget(struct super_block *sb, struct zus_inode_info *zus_ii, + zu_dpp_t _zi, bool *exist) +{ + return ERR_PTR(-ENOMEM); +} + +void zuf_evict_inode(struct inode *inode) +{ +} + +int zuf_write_inode(struct inode *inode, struct writeback_control *wbc) +{ + /* write_inode should never be called because we always keep our inodes + * clean. So let us know if write_inode ever gets called. + */ + + /* d_tmpfile() does a mark_inode_dirty so only complain on regular files + * TODO: How? Every thing off for now + * WARN_ON(inode->i_nlink); + */ + + return 0; +} + +/* This function is called by msync(), fsync() && sync_fs(). */ +int zuf_isync(struct inode *inode, loff_t start, loff_t end, int datasync) +{ + return 0; +} diff --git a/fs/zuf/md.c b/fs/zuf/md.c new file mode 100644 index 0000000..9436f03 --- /dev/null +++ b/fs/zuf/md.c @@ -0,0 +1,695 @@ +/* + * Multi-Device operations. + * + * Copyright (c) 2018 NetApp Inc. All rights reserved. + * + * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details. + * + * Authors: + * Boaz Harrosh + * Sagi Manole " + */ + +#include +#include +#include +#include +#include +#include + +#include "_pr.h" +#include "md.h" +#include "t2.h" +#include "zus_api.h" + +/* length of uuid dev path /dev/disk/by-uuid/ */ +#define PATH_UUID 64 + +const fmode_t _g_mode = FMODE_READ | FMODE_WRITE | FMODE_EXCL; + +/* allocate space for and copy an existing uuid */ +static char *_uuid_path(uuid_le *uuid) +{ + char path[PATH_UUID]; + + sprintf(path, "/dev/disk/by-uuid/%pUb", uuid); + return kstrdup(path, GFP_KERNEL); +} + +static int _bdev_get_by_path(const char *path, struct block_device **bdev, + void *holder) +{ + /* The owner of the device is the pointer that will hold it. This + * protects from same device mounting on two super-blocks as well + * as same device being repeated twice. + */ + *bdev = blkdev_get_by_path(path, _g_mode, holder); + if (IS_ERR(*bdev)) { + int err = PTR_ERR(*bdev); + *bdev = NULL; + return err; + } + return 0; +} + +static void _bdev_put(struct block_device **bdev, struct block_device *s_bdev) +{ + if (*bdev) { + if (!s_bdev || *bdev != s_bdev) + blkdev_put(*bdev, _g_mode); + *bdev = NULL; + } +} + +static int ___bdev_get_by_uuid(struct block_device **bdev, uuid_le *uuid, + void *holder, bool silent, const char *msg, + const char *f, int l) +{ + char *path = NULL; + int err; + + path = _uuid_path(uuid); + err = _bdev_get_by_path(path, bdev, holder); + if (unlikely(err)) + zuf_err_cnd(silent, "[%s:%d] %s path=%s =>%d\n", + f, l, msg, path, err); + + kfree(path); + return err; +} + +#define _bdev_get_by_uuid(bdev, uuid, holder, msg) \ + ___bdev_get_by_uuid(bdev, uuid, holder, silent, msg, __func__, __LINE__) + +static bool _main_bdev(struct block_device *bdev) +{ + if (bdev->bd_super && bdev->bd_super->s_bdev == bdev) + return true; + return false; +} + +short md_calc_csum(struct zufs_dev_table *msb) +{ + uint n = ZUFS_SB_STATIC_SIZE(msb) - sizeof(msb->s_sum); + + return crc16(~0, (__u8 *)&msb->s_version, n); +} + +/* ~~~~~~~ mdt related functions ~~~~~~~ */ + +struct zufs_dev_table *md_t2_mdt_read(struct block_device *bdev) +{ + int err; + struct page *page; + /* t2 interface works for all block devices */ + struct multi_devices *md; + struct md_dev_info *mdi; + + md = kzalloc(sizeof(*md), GFP_KERNEL); + if (unlikely(!md)) + return ERR_PTR(-ENOMEM); + + md->t2_count = 1; + md->devs[0].bdev = bdev; + mdi = &md->devs[0]; + md->t2a.map = &mdi; + md->t2a.bn_gcd = 1; /*Does not matter only must not be zero */ + + page = alloc_page(GFP_KERNEL); + if (!page) { + zuf_dbg_err("!!! failed to alloc page\n"); + err = -ENOMEM; + goto out; + } + + err = t2_readpage(md, 0, page); + if (err) { + zuf_dbg_err("!!! t2_readpage err=%d\n", err); + __free_page(page); + } +out: + kfree(md); + return err ? ERR_PTR(err) : page_address(page); +} + +static bool _csum_mismatch(struct zufs_dev_table *msb, int silent) +{ + ushort crc = md_calc_csum(msb); + + if (msb->s_sum == cpu_to_le16(crc)) + return false; + + zuf_warn_cnd(silent, "expected(0x%x) != s_sum(0x%x)\n", + cpu_to_le16(crc), msb->s_sum); + return true; +} + +static bool _uuid_le_equal(uuid_le *uuid1, uuid_le *uuid2) +{ + return (memcmp(uuid1, uuid2, sizeof(uuid_le)) == 0); +} + +bool md_mdt_check(struct zufs_dev_table *msb, + struct zufs_dev_table *main_msb, struct block_device *bdev, + struct mdt_check *mc) +{ + struct zufs_dev_table *msb2 = (void *)msb + ZUFS_SB_SIZE; + struct md_dev_id *dev_id; + ulong bdev_size, super_size; + + BUILD_BUG_ON(ZUFS_SB_STATIC_SIZE(msb) & (SMP_CACHE_BYTES - 1)); + + /* Do sanity checks on the superblock */ + if (le32_to_cpu(msb->s_magic) != mc->magic) { + if (le32_to_cpu(msb2->s_magic) != mc->magic) { + zuf_warn_cnd(mc->silent, + "Can't find a valid partition\n"); + return false; + } + + zuf_warn_cnd(mc->silent, + "Magic error in super block: using copy\n"); + /* Try to auto-recover the super block */ + memcpy_flushcache(msb, msb2, sizeof(*msb)); + } + + if ((mc->major_ver != msb_major_version(msb)) || + (mc->minor_ver < msb_minor_version(msb))) { + zuf_warn_cnd(mc->silent, + "mkfs-mount versions mismatch! %d.%d != %d.%d\n", + msb_major_version(msb), msb_minor_version(msb), + mc->major_ver, mc->minor_ver); + return false; + } + + if (_csum_mismatch(msb, mc->silent)) { + if (_csum_mismatch(msb2, mc->silent)) { + zuf_warn_cnd(mc->silent, + "checksum error in super block\n"); + return false; + } + + zuf_warn_cnd(mc->silent, + "crc16 error in super block: using copy\n"); + /* Try to auto-recover the super block */ + memcpy_flushcache(msb, msb2, sizeof(*msb)); + } + + if (main_msb && !_uuid_le_equal(&main_msb->s_uuid, &msb->s_uuid)) { + zuf_warn_cnd(mc->silent, + "uuids do not match main_msb=%pUb msb=%pUb\n", + &main_msb->s_uuid, &msb->s_uuid); + return false; + } + + /* check t1 device size */ + bdev_size = i_size_read(bdev->bd_inode); + dev_id = &msb->s_dev_list.dev_ids[msb->s_dev_list.id_index]; + super_size = md_p2o(__dev_id_blocks(dev_id)); + if (unlikely(!super_size || super_size & ZUFS_ALLOC_MASK)) { + zuf_warn_cnd(mc->silent, "super_size(0x%lx) ! 2_M aligned\n", + super_size); + return false; + } + + if (unlikely(super_size > bdev_size)) { + zuf_warn_cnd(mc->silent, + "bdev_size(0x%lx) too small expected 0x%lx\n", + bdev_size, super_size); + return false; + } else if (unlikely(super_size < bdev_size)) { + zuf_dbg_err("Note msb->size=(0x%lx) < bdev_size(0x%lx)\n", + super_size, bdev_size); + } + + return true; +} + + +int md_set_sb(struct multi_devices *md, struct block_device *s_bdev, + void *sb, int silent) +{ + struct md_dev_info *mdi = md_dev_info(md, md->dev_index); + int i; + + mdi->bdev = s_bdev; + + for (i = 0; i < md->t1_count; ++i) { + struct md_dev_info *mdi = md_t1_dev(md, i); + + if (mdi->bdev->bd_super && (mdi->bdev->bd_super != sb)) { + zuf_warn_cnd(silent, + "!!! %s already mounted on a different FS => -EBUSY\n", + _bdev_name(mdi->bdev)); + return -EBUSY; + } + + mdi->bdev->bd_super = sb; + } + + return 0; +} + +void md_fini(struct multi_devices *md, struct block_device *s_bdev) +{ + int i; + + kfree(md->t2a.map); + kfree(md->t1a.map); + + for (i = 0; i < md->t1_count + md->t2_count; ++i) { + struct md_dev_info *mdi = md_dev_info(md, i); + + if (mdi->bdev && !_main_bdev(mdi->bdev)) + mdi->bdev->bd_super = NULL; + _bdev_put(&mdi->bdev, s_bdev); + } + + kfree(md); +} + + +/* ~~~~~~~ Pre-mount operations ~~~~~~~ */ + +static int _get_device(struct block_device **bdev, const char *dev_name, + uuid_le *uuid, void *holder, int silent, + bool *bind_mount) +{ + int err; + + if (dev_name) + err = _bdev_get_by_path(dev_name, bdev, holder); + else + err = _bdev_get_by_uuid(bdev, uuid, holder, + "failed to get device"); + + if (unlikely(err)) { + zuf_err_cnd(silent, + "failed to get device dev_name=%s uuid=%pUb err=%d\n", + dev_name, uuid, err); + return err; + } + + if (bind_mount && _main_bdev(*bdev)) + *bind_mount = true; + + return 0; +} + +static int _init_dev_info(struct md_dev_info *mdi, struct md_dev_id *id, + int index, u64 offset, + struct zufs_dev_table *main_msb, + struct mdt_check *mc, bool t1_dev, + int silent) +{ + struct zufs_dev_table *msb = NULL; + int err = 0; + + if (mdi->bdev == NULL) { + err = _get_device(&mdi->bdev, NULL, &id->uuid, mc->holder, + silent, NULL); + if (unlikely(err)) + return err; + } + + mdi->offset = offset; + mdi->size = md_p2o(__dev_id_blocks(id)); + mdi->index = index; + + if (t1_dev) { + struct page *dev_page; + int end_of_dev_nid; + + err = md_t1_info_init(mdi, silent); + if (unlikely(err)) + return err; + + if ((ulong)mdi->t1i.virt_addr & ZUFS_ALLOC_MASK) { + zuf_warn_cnd(silent, "!!! unaligned device %s\n", + _bdev_name(mdi->bdev)); + return -EINVAL; + } + + msb = mdi->t1i.virt_addr; + + dev_page = pfn_to_page(mdi->t1i.phys_pfn); + mdi->nid = page_to_nid(dev_page); + end_of_dev_nid = page_to_nid(dev_page + md_o2p(mdi->size - 1)); + + if (mdi->nid != end_of_dev_nid) + zuf_warn("pmem crosses NUMA boundaries"); + } else { + msb = md_t2_mdt_read(mdi->bdev); + if (IS_ERR(msb)) { + zuf_err_cnd(silent, + "failed to read msb from t2 => %ld\n", + PTR_ERR(msb)); + return PTR_ERR(msb); + } + mdi->nid = __dev_id_nid(id); + } + + if (!md_mdt_check(msb, main_msb, mdi->bdev, mc)) { + zuf_err_cnd(silent, "device %s failed integrity check\n", + _bdev_name(mdi->bdev)); + err = -EINVAL; + goto out; + } + + return 0; + +out: + if (!(t1_dev || IS_ERR_OR_NULL(msb))) + free_page((ulong)msb); + return err; +} + +static int _map_setup(struct multi_devices *md, ulong blocks, int dev_start, + struct md_dev_larray *larray) +{ + ulong map_size, bn_end; + int i, dev_index = dev_start; + + map_size = blocks / larray->bn_gcd; + larray->map = kcalloc(map_size, sizeof(*larray->map), GFP_KERNEL); + if (!larray->map) { + zuf_dbg_err("failed to allocate dev map\n"); + return -ENOMEM; + } + + bn_end = md_o2p(md->devs[dev_index].size); + for (i = 0; i < map_size; ++i) { + if ((i * larray->bn_gcd) >= bn_end) + bn_end += md_o2p(md->devs[++dev_index].size); + larray->map[i] = &md->devs[dev_index]; + } + + return 0; +} + +static int _md_init(struct multi_devices *md, struct mdt_check *mc, + struct md_dev_list *dev_list, int silent) +{ + struct zufs_dev_table *main_msb = NULL; + u64 total_size = 0; + int i, err; + + for (i = 0; i < md->t1_count; ++i) { + struct md_dev_info *mdi = md_t1_dev(md, i); + struct zufs_dev_table *dev_msb; + + err = _init_dev_info(mdi, &dev_list->dev_ids[i], i, total_size, + main_msb, mc, true, silent); + if (unlikely(err)) + return err; + + /* apparently gcd(0,X)=X which is nice */ + md->t1a.bn_gcd = gcd(md->t1a.bn_gcd, md_o2p(mdi->size)); + total_size += mdi->size; + + dev_msb = md_t1_addr(md, i); + if (!main_msb) + main_msb = dev_msb; + + if (test_msb_opt(dev_msb, ZUFS_SHADOW)) + memcpy(mdi->t1i.virt_addr, + mdi->t1i.virt_addr + mdi->size, mdi->size); + + zuf_dbg_verbose("dev=%d %pUb %s v=%p pfn=%lu off=%lu size=%lu\n", + i, &dev_list->dev_ids[i].uuid, + _bdev_name(mdi->bdev), dev_msb, + mdi->t1i.phys_pfn, mdi->offset, mdi->size); + } + + if (unlikely(le64_to_cpu(main_msb->s_t1_blocks) != + md_o2p(total_size))) { + zuf_err_cnd(silent, + "FS corrupted msb->t1_blocks(0x%llx) != total_size(0x%llx)\n", + main_msb->s_t1_blocks, total_size); + return -EIO; + } + + err = _map_setup(md, le64_to_cpu(main_msb->s_t1_blocks), 0, &md->t1a); + if (unlikely(err)) + return err; + + zuf_dbg_verbose("t1 devices=%d total_size=%llu segment_map=%lu\n", + md->t1_count, total_size, + md_o2p(total_size) / md->t1a.bn_gcd); + + if (md->t2_count == 0) + return 0; + + /* Done with t1. Counting t2s */ + total_size = 0; + for (i = 0; i < md->t2_count; ++i) { + struct md_dev_info *mdi = md_t2_dev(md, i); + + err = _init_dev_info(mdi, &dev_list->dev_ids[md->t1_count + i], + md->t1_count + i, total_size, main_msb, + mc, false, silent); + if (unlikely(err)) + return err; + + /* apparently gcd(0,X)=X which is nice */ + md->t2a.bn_gcd = gcd(md->t2a.bn_gcd, md_o2p(mdi->size)); + total_size += mdi->size; + + zuf_dbg_verbose("dev=%d %s off=%lu size=%lu\n", i, + _bdev_name(mdi->bdev), mdi->offset, mdi->size); + } + + if (unlikely(le64_to_cpu(main_msb->s_t2_blocks) != md_o2p(total_size))) { + zuf_err_cnd(silent, + "FS corrupted msb_t2_blocks(0x%llx) != total_size(0x%llx)\n", + main_msb->s_t2_blocks, total_size); + return -EIO; + } + + err = _map_setup(md, le64_to_cpu(main_msb->s_t2_blocks), md->t1_count, + &md->t2a); + if (unlikely(err)) + return err; + + zuf_dbg_verbose("t2 devices=%d total_size=%llu segment_map=%lu\n", + md->t2_count, total_size, + md_o2p(total_size) / md->t2a.bn_gcd); + + return 0; +} + +static int _load_dev_list(struct md_dev_list *dev_list, struct mdt_check *mc, + struct block_device *bdev, const char *dev_name, + int silent) +{ + struct zufs_dev_table *msb; + int err = 0; + + msb = md_t2_mdt_read(bdev); + if (IS_ERR(msb)) { + zuf_err_cnd(silent, + "failed to read super block from %s; err=%ld\n", + dev_name, PTR_ERR(msb)); + err = PTR_ERR(msb); + goto out; + } + + if (!md_mdt_check(msb, NULL, bdev, mc)) { + zuf_err_cnd(silent, "bad msb in %s\n", dev_name); + err = -EINVAL; + goto out; + } + + *dev_list = msb->s_dev_list; + +out: + if (!IS_ERR_OR_NULL(msb)) + free_page((ulong)msb); + + return err; +} + +int md_init(struct multi_devices *md, const char *dev_name, + struct mdt_check *mc, const char **dev_path) +{ + struct md_dev_list *dev_list; + struct block_device *bdev; + short id_index; + bool bind_mount = false; + int err; + + dev_list = kmalloc(sizeof(*dev_list), GFP_KERNEL); + if (unlikely(!dev_list)) + return -ENOMEM; + + err = _get_device(&bdev, dev_name, NULL, mc->holder, mc->silent, + &bind_mount); + if (unlikely(err)) + goto out2; + + err = _load_dev_list(dev_list, mc, bdev, dev_name, mc->silent); + if (unlikely(err)) { + _bdev_put(&bdev, NULL); + goto out2; + } + + id_index = le16_to_cpu(dev_list->id_index); + if (bind_mount) { + _bdev_put(&bdev, NULL); + md->dev_index = id_index; + goto out; + } + + md->t1_count = le16_to_cpu(dev_list->t1_count); + md->t2_count = le16_to_cpu(dev_list->t2_count); + md->devs[id_index].bdev = bdev; + + if ((id_index != 0)) { + err = _get_device(&md_t1_dev(md, 0)->bdev, NULL, + &dev_list->dev_ids[0].uuid, mc->holder, + mc->silent, &bind_mount); + if (unlikely(err)) + goto out2; + + if (bind_mount) + goto out; + } + + if (md->t2_count) { + int t2_index = md->t1_count; + + /* t2 is the primary device if given in mount, or the first + * mount specified it as primary device + */ + if (id_index != md->t1_count) { + err = _get_device(&md_t2_dev(md, 0)->bdev, NULL, + &dev_list->dev_ids[t2_index].uuid, + mc->holder, mc->silent, &bind_mount); + if (unlikely(err)) + goto out2; + } + md->dev_index = t2_index; + } + +out: + if (md->dev_index != id_index) + *dev_path = _uuid_path(&dev_list->dev_ids[md->dev_index].uuid); + else + *dev_path = kstrdup(dev_name, GFP_KERNEL); + + if (!bind_mount) { + err = _md_init(md, mc, dev_list, mc->silent); + if (unlikely(err)) + goto out2; + _bdev_put(&md_dev_info(md, md->dev_index)->bdev, NULL); + } else { + md_fini(md, NULL); + } + +out2: + kfree(dev_list); + + return err; +} + +struct multi_devices *md_alloc(size_t size) +{ + uint s = max(sizeof(struct multi_devices), size); + struct multi_devices *md = kzalloc(s, GFP_KERNEL); + + if (unlikely(!md)) + return ERR_PTR(-ENOMEM); + return md; +} + +int md_numa_info(struct multi_devices *md, struct zufs_ioc_pmem *zi_pmem) +{ + zi_pmem->pmem_total_blocks = md_t1_blocks(md); +#if 0 + if (max_cpu_id < sys_num_active_cpus) { + max_cpu_id = sys_num_active_cpus; + return -ETOSMALL; + } + + max_cpu_id = sys_num_active_cpus; + __u32 max_nodes; + __u32 active_pmem_nodes; + struct zufs_pmem_info { + int sections; + struct zufs_pmem_sec { + __u32 length; + __u16 numa_id; + __u16 numa_index; + } secs[ZUFS_DEV_MAX]; + } pmem; + + struct zufs_numa_info { + __u32 max_cpu_id; // The below array size + struct zufs_cpu_info { + __u32 numa_id; + __u32 numa_index; + } numa_id_map[]; + } *numa_info; + k_nf = kcalloc(max_cpu_id, sizeof(struct zufs_cpu_info), GFP_KERNEL); + .... + copy_to_user(->numa_info, kn_f, + max_cpu_id * sizeof(struct zufs_cpu_info)); +#endif + return 0; +} + +static int _check_da_ret(struct md_dev_info *mdi, long avail, bool silent) +{ + if (unlikely(avail < (long)mdi->size)) { + if (0 < avail) { + zuf_warn_cnd(silent, + "Unsupported DAX device %s (range mismatch) => 0x%lx < 0x%lx\n", + _bdev_name(mdi->bdev), avail, mdi->size); + return -ERANGE; + } + zuf_warn_cnd(silent, "!!! %s direct_access return =>%ld\n", + _bdev_name(mdi->bdev), avail); + return avail; + } + return 0; +} + +int md_t1_info_init(struct md_dev_info *mdi, bool silent) +{ + pfn_t a_pfn_t; + void *addr; + long nrpages, avail; + int id; + + mdi->t1i.dax_dev = fs_dax_get_by_host(_bdev_name(mdi->bdev)); + if (unlikely(!mdi->t1i.dax_dev)) + return -EOPNOTSUPP; + + id = dax_read_lock(); + + nrpages = dax_direct_access(mdi->t1i.dax_dev, 0, md_o2p(mdi->size), + &addr, &a_pfn_t); + dax_read_unlock(id); + if (unlikely(nrpages <= 0)) { + if (!nrpages) + nrpages = -ERANGE; + avail = nrpages; + } else { + avail = md_p2o(nrpages); + } + + mdi->t1i.virt_addr = addr; + mdi->t1i.phys_pfn = pfn_t_to_pfn(a_pfn_t); + + zuf_dbg_verbose("0x%lx 0x%llx\n", + (ulong)addr, a_pfn_t.val); + + return _check_da_ret(mdi, avail, silent); +} + +void md_t1_info_fini(struct md_dev_info *mdi) +{ + fs_put_dax(mdi->t1i.dax_dev); + mdi->t1i.dax_dev = NULL; + mdi->t1i.virt_addr = NULL; +} diff --git a/fs/zuf/md.h b/fs/zuf/md.h new file mode 100644 index 0000000..1ad3db3c --- /dev/null +++ b/fs/zuf/md.h @@ -0,0 +1,188 @@ +/* + * Multi-Device operations. + * + * Copyright (c) 2018 NetApp Inc. All rights reserved. + * + * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details. + * + * Authors: + * Boaz Harrosh + * Sagi Manole " + */ + +#ifndef __MD_H__ +#define __MD_H__ + +#include "zus_api.h" + +struct md_t1_info { + ulong phys_pfn; + void *virt_addr; + struct dax_device *dax_dev; + struct dev_pagemap *pgmap; +}; + +struct md_t2_info { + bool err_read_reported; + bool err_write_reported; +}; + +struct md_dev_info { + struct block_device *bdev; + ulong size; + ulong offset; + union { + struct md_t1_info t1i; + struct md_t2_info t2i; + }; + int index; + int nid; +}; + +struct md_dev_larray { + ulong bn_gcd; + struct md_dev_info **map; +}; + +struct multi_devices { + int dev_index; + int t1_count; + int t2_count; + struct md_dev_info devs[MD_DEV_MAX]; + struct md_dev_larray t1a; + struct md_dev_larray t2a; +}; + +static inline u64 md_p2o(ulong bn) +{ + return (u64)bn << PAGE_SHIFT; +} + +static inline ulong md_o2p(u64 offset) +{ + return offset >> PAGE_SHIFT; +} + +static inline ulong md_o2p_up(u64 offset) +{ + return md_o2p(offset + PAGE_SIZE - 1); +} + +static inline struct md_dev_info *md_t1_dev(struct multi_devices *md, int i) +{ + return &md->devs[i]; +} + +static inline struct md_dev_info *md_t2_dev(struct multi_devices *md, int i) +{ + return &md->devs[md->t1_count + i]; +} + +static inline struct md_dev_info *md_dev_info(struct multi_devices *md, int i) +{ + return &md->devs[i]; +} + +static inline void *md_t1_addr(struct multi_devices *md, int i) +{ + struct md_dev_info *mdi = md_t1_dev(md, i); + + return mdi->t1i.virt_addr; +} + +static inline struct md_dev_info *md_bn_t1_dev(struct multi_devices *md, + ulong bn) +{ + return md->t1a.map[bn / md->t1a.bn_gcd]; +} + +static inline ulong md_pfn(struct multi_devices *md, ulong block) +{ + struct md_dev_info *mdi = md_bn_t1_dev(md, block); + + return mdi->t1i.phys_pfn + (block - md_o2p(mdi->offset)); +} + +static inline void *md_addr(struct multi_devices *md, ulong offset) +{ + struct md_dev_info *mdi = md_bn_t1_dev(md, md_o2p(offset)); + + return offset ? mdi->t1i.virt_addr + (offset - mdi->offset) : NULL; +} + +static inline void *md_baddr(struct multi_devices *md, ulong bn) +{ + return md_addr(md, md_p2o(bn)); +} + +static inline struct zufs_dev_table *md_zdt(struct multi_devices *md) +{ + return md_t1_addr(md, 0); +} + +static inline ulong md_t1_blocks(struct multi_devices *md) +{ + return le64_to_cpu(md_zdt(md)->s_t1_blocks); +} + +static inline ulong md_t2_blocks(struct multi_devices *md) +{ + return le64_to_cpu(md_zdt(md)->s_t2_blocks); +} + +static inline struct md_dev_info *md_bn_t2_dev(struct multi_devices *md, + ulong bn) +{ + return md->t2a.map[bn / md->t2a.bn_gcd]; +} + +static inline ulong md_t2_local_bn(struct multi_devices *md, ulong bn) +{ + struct md_dev_info *mdi = md_bn_t2_dev(md, bn); + + return bn - md_o2p(mdi->offset); +} + +static inline void *md_addr_verify(struct multi_devices *md, ulong offset) +{ + if (unlikely(offset > md_p2o(md_t1_blocks(md)))) { + zuf_dbg_err("offset=0x%lx > max=0x%llx\n", + offset, md_p2o(md_t1_blocks(md))); + return NULL; + } + + return md_addr(md, offset); +} + +static inline const char *_bdev_name(struct block_device *bdev) +{ + return dev_name(&bdev->bd_part->__dev); +} + +struct mdt_check { + uint major_ver; + uint minor_ver; + u32 magic; + + void *holder; + bool silent; +}; + +/* md.c */ +struct zufs_dev_table *md_t2_mdt_read(struct block_device *bdev); +bool md_mdt_check(struct zufs_dev_table *msb, struct zufs_dev_table *main_msb, + struct block_device *bdev, struct mdt_check *mc); +struct multi_devices *md_alloc(size_t size); +int md_init(struct multi_devices *md, const char *dev_name, + struct mdt_check *mc, const char **dev_path); +void md_fini(struct multi_devices *md, struct block_device *s_bdev); +int md_set_sb(struct multi_devices *md, struct block_device *s_bdev, void *sb, + int silent); + +struct zufs_ioc_pmem; +int md_numa_info(struct multi_devices *md, struct zufs_ioc_pmem *zi_pmem); + +int md_t1_info_init(struct md_dev_info *mdi, bool silent); +void md_t1_info_fini(struct md_dev_info *mdi); + +#endif diff --git a/fs/zuf/super.c b/fs/zuf/super.c index 6e176a5..03d1772 100644 --- a/fs/zuf/super.c +++ b/fs/zuf/super.c @@ -12,10 +12,613 @@ * Sagi Manole " */ +#include +#include +#include +#include + #include "zuf.h" +static struct super_operations zuf_sops; +static struct kmem_cache *zuf_inode_cachep; + +enum { + Opt_uid, + Opt_gid, + Opt_pedantic, + Opt_ephemeral, + Opt_dax, + Opt_err +}; + +static const match_table_t tokens = { + { Opt_uid, "uid=%u" }, + { Opt_gid, "gid=%u" }, + { Opt_pedantic, "pedantic" }, + { Opt_pedantic, "pedantic=%d" }, + { Opt_ephemeral, "ephemeral" }, + { Opt_dax, "dax" }, + { Opt_err, NULL }, +}; + +/* Output parameters from _parse_options */ +struct __parse_options { + bool clear_t2sync; + bool pedantic_17; +}; + +static int _parse_options(struct zuf_sb_info *sbi, const char *data, + bool remount, struct __parse_options *po) +{ + char *orig_options, *options; + char *p; + substring_t args[MAX_OPT_ARGS]; + int option; + int err = 0; + bool ephemeral = false; + + /* no options given */ + if (!data) + return 0; + + options = orig_options = kstrdup(data, GFP_KERNEL); + if (!options) + return -ENOMEM; + + while ((p = strsep(&options, ",")) != NULL) { + int token; + + if (!*p) + continue; + + /* Initialize args struct so we know whether arg was found */ + args[0].to = args[0].from = NULL; + token = match_token(p, tokens, args); + switch (token) { + case Opt_uid: + if (remount) + goto bad_opt; + if (match_int(&args[0], &option)) + goto bad_val; + sbi->uid = KUIDT_INIT(option); + break; + case Opt_gid: + if (match_int(&args[0], &option)) + goto bad_val; + sbi->gid = KGIDT_INIT(option); + break; + case Opt_pedantic: + set_opt(sbi, PEDANTIC); + break; + case Opt_ephemeral: + set_opt(sbi, EPHEMERAL); + ephemeral = true; + break; + case Opt_dax: + set_opt(sbi, DAX); + break; + default: { + goto bad_opt; + } + } + } + + if (remount && test_opt(sbi, EPHEMERAL) && (ephemeral == false)) + clear_opt(sbi, EPHEMERAL); +out: + kfree(orig_options); + return err; + +bad_val: + zuf_warn_cnd(test_opt(sbi, SILENT), + "Bad value '%s' for mount option '%s'\n", + args[0].from, p); + err = -EINVAL; + goto out; +bad_opt: + zuf_warn_cnd(test_opt(sbi, SILENT), "Bad mount option: \"%s\"\n", p); + err = -EINVAL; + goto out; +} + +static void _print_mount_info(struct zuf_sb_info *sbi, char *mount_options) +{ + char buff[1000]; + int space = sizeof(buff); + char *b = buff; + uint i; + int printed; + + for (i = 0; i < sbi->md->t1_count; ++i) { + printed = snprintf(b, space, "%s%s", i ? "," : "", + _bdev_name(md_t1_dev(sbi->md, i)->bdev)); + + if (unlikely(printed > space)) + goto no_space; + + b += printed; + space -= printed; + } + + if (sbi->md->t2_count) { + printed = snprintf(b, space, " t2=%s", + _bdev_name(md_t2_dev(sbi->md, 0)->bdev)); + if (unlikely(printed > space)) + goto no_space; + + b += printed; + space -= printed; + } + + if (mount_options) { + printed = snprintf(b, space, " -o %s", mount_options); + if (unlikely(printed > space)) + goto no_space; + } + +print: + zuf_info("mounted t1=%s (0x%lx/0x%lx)\n", buff, + md_t1_blocks(sbi->md), md_t2_blocks(sbi->md)); + return; + +no_space: + snprintf(buff + sizeof(buff) - 4, 4, "..."); + goto print; +} + +static void _sb_mwtime_now(struct super_block *sb, struct zufs_dev_table *zdt) +{ + struct timespec now = current_kernel_time(); + + timespec_to_mt(&zdt->s_mtime, &now); + zdt->s_wtime = zdt->s_mtime; + /* TOZO _persist_md(sb, &zdt->s_mtime, 2*sizeof(zdt->s_mtime)); */ +} + +static int _setup_bdi(struct super_block *sb, const char *device_name) +{ + int err; + + if (sb->s_bdi && (sb->s_bdi != &noop_backing_dev_info)) { + /* + * sb->s_bdi points to blkdev's bdi however we want to redirect + * it to our private bdi... + */ + bdi_put(sb->s_bdi); + } + sb->s_bdi = &noop_backing_dev_info; + + err = super_setup_bdi_name(sb, "zuf-%s", device_name); + if (unlikely(err)) { + zuf_err("Failed to super_setup_bdi\n"); + return err; + } + + sb->s_bdi->ra_pages = ZUFS_READAHEAD_PAGES; + sb->s_bdi->capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK; + return 0; +} + +static void zuf_put_super(struct super_block *sb) +{ + struct zuf_sb_info *sbi = SBI(sb); + + if (sbi->zus_sbi) { + zufs_dispatch_umount(ZUF_ROOT(sbi), sbi->zus_sbi); + sbi->zus_sbi = NULL; + } + + /* NOTE!!! this is a HACK! we should not touch the s_umount + * lock but to make lockdep happy we do that since our devices + * are held exclusivly. Need to revisit every kernel version + * change. + */ + if (sbi->md) { + up_write(&sb->s_umount); + md_fini(sbi->md, sb->s_bdev); + down_write(&sb->s_umount); + } + + sb->s_fs_info = NULL; + if (!test_opt(sbi, FAILED)) + zuf_info("unmounted /dev/%s\n", _bdev_name(sb->s_bdev)); + kfree(sbi); +} + +struct __fill_super_params { + struct multi_devices *md; + char *mount_options; +}; + +static int zuf_fill_super(struct super_block *sb, void *data, int silent) +{ + struct zuf_sb_info *sbi; + struct __fill_super_params *fsp = data; + struct __parse_options po = {}; + struct zufs_ioc_mount zim = {}; + struct register_fs_info *rfi; + struct inode *root_i; + bool exist; + int err; + + BUILD_BUG_ON(sizeof(struct zufs_dev_table) > ZUFS_SB_SIZE); + BUILD_BUG_ON(sizeof(struct zus_inode) != ZUFS_INODE_SIZE); + + sbi = kzalloc(sizeof(struct zuf_sb_info), GFP_KERNEL); + if (!sbi) { + zuf_err_cnd(silent, "Not enough memory to allocate sbi\n"); + return -ENOMEM; + } + sb->s_fs_info = sbi; + sbi->sb = sb; + + /* Initialize embedded objects */ + spin_lock_init(&sbi->s_mmap_dirty_lock); + INIT_LIST_HEAD(&sbi->s_mmap_dirty); + if (silent) + set_opt(sbi, SILENT); + + sbi->md = fsp->md; + err = md_set_sb(sbi->md, sb->s_bdev, sb, silent); + if (unlikely(err)) + goto error; + + err = _parse_options(sbi, fsp->mount_options, 0, &po); + if (err) + goto error; + + err = _setup_bdi(sb, _bdev_name(sb->s_bdev)); + if (err) { + zuf_err_cnd(silent, "Failed to setup bdi => %d\n", err); + goto error; + } + + /* Tell ZUS to mount an FS for us */ + zim.pmem_kern_id = zuf_pmem_id(sbi->md); + err = zufs_dispatch_mount(ZUF_ROOT(sbi), zuf_fst(sb)->zus_zfi, &zim); + if (unlikely(err)) + goto error; + sbi->zus_sbi = zim.zus_sbi; + + /* Init with default values */ + sb->s_blocksize_bits = zim.s_blocksize_bits; + sb->s_blocksize = 1 << zim.s_blocksize_bits; + + sbi->mode = ZUFS_DEF_SBI_MODE; + sbi->uid = current_fsuid(); + sbi->gid = current_fsgid(); + + rfi = &zuf_fst(sb)->rfi; + + sb->s_magic = rfi->FS_magic; + sb->s_time_gran = rfi->s_time_gran; + sb->s_maxbytes = rfi->s_maxbytes; + sb->s_flags |= MS_NOSEC | (rfi->acl_on ? MS_POSIXACL : 0); + + sb->s_op = &zuf_sops; + + root_i = zuf_iget(sb, zim.zus_ii, zim._zi, &exist); + if (IS_ERR(root_i)) { + err = PTR_ERR(root_i); + goto error; + } + WARN_ON(exist); + + sb->s_root = d_make_root(root_i); + if (!sb->s_root) { + zuf_err_cnd(silent, "get tozu root inode failed\n"); + iput(root_i); /* undo zuf_iget */ + err = -ENOMEM; + goto error; + } + + if (!zuf_rdonly(sb)) + _sb_mwtime_now(sb, md_zdt(sbi->md)); + + _print_mount_info(sbi, fsp->mount_options); + clear_opt(sbi, SILENT); + return 0; + +error: + zuf_warn("NOT mounting => %d\n", err); + set_opt(sbi, FAILED); + zuf_put_super(sb); + return err; +} + +static void _zst_to_kst(const struct statfs64 *zst, struct kstatfs *kst) +{ + kst->f_type = zst->f_type; + kst->f_bsize = zst->f_bsize; + kst->f_blocks = zst->f_blocks; + kst->f_bfree = zst->f_bfree; + kst->f_bavail = zst->f_bavail; + kst->f_files = zst->f_files; + kst->f_ffree = zst->f_ffree; + kst->f_fsid = zst->f_fsid; + kst->f_namelen = zst->f_namelen; + kst->f_frsize = zst->f_frsize; + kst->f_flags = zst->f_flags; +} + +static int zuf_statfs(struct dentry *d, struct kstatfs *buf) +{ + struct zuf_sb_info *sbi = SBI(d->d_sb); + struct zufs_ioc_statfs ioc_statfs = { + .hdr.in_len = offsetof(struct zufs_ioc_statfs, statfs_out), + .hdr.out_len = sizeof(ioc_statfs), + .hdr.operation = ZUS_OP_STATFS, + .zus_sbi = sbi->zus_sbi, + }; + int err; + + err = zufs_dispatch(ZUF_ROOT(sbi), &ioc_statfs.hdr, NULL, 0); + if (unlikely(err)) { + zuf_err("zufs_dispatch failed op=ZUS_OP_STATFS => %d\n", err); + return err; + } + + _zst_to_kst(&ioc_statfs.statfs_out, buf); + return 0; +} + +static int zuf_show_options(struct seq_file *seq, struct dentry *root) +{ + struct zuf_sb_info *sbi = SBI(root->d_sb); + + if (__kuid_val(sbi->uid) && uid_valid(sbi->uid)) + seq_printf(seq, ",uid=%u", __kuid_val(sbi->uid)); + if (__kgid_val(sbi->gid) && gid_valid(sbi->gid)) + seq_printf(seq, ",gid=%u", __kgid_val(sbi->gid)); + if (test_opt(sbi, EPHEMERAL)) + seq_puts(seq, ",ephemeral"); + if (test_opt(sbi, DAX)) + seq_puts(seq, ",dax"); + + return 0; +} + +static int zuf_show_devname(struct seq_file *seq, struct dentry *root) +{ + seq_printf(seq, "/dev/%s", _bdev_name(root->d_sb->s_bdev)); + + return 0; +} + +static int zuf_remount(struct super_block *sb, int *mntflags, char *data) +{ + unsigned long old_mount_opt; + struct zuf_sb_info *sbi = SBI(sb); + struct __parse_options po; /* Actually not used */ + int err; + + zuf_info("remount... -o %s\n", data); + + /* Store the old options */ + old_mount_opt = sbi->s_mount_opt; + + err = _parse_options(sbi, data, 1, &po); + if (unlikely(err)) + goto fail; + + if ((*mntflags & MS_RDONLY) != zuf_rdonly(sb)) + _sb_mwtime_now(sb, md_zdt(sbi->md)); + + return 0; + +fail: + sbi->s_mount_opt = old_mount_opt; + zuf_dbg_err("remount failed restore option\n"); + return err; +} + +static int zuf_update_s_wtime(struct super_block *sb) +{ + if (!(sb->s_flags & MS_RDONLY)) { + struct timespec now = current_kernel_time(); + + timespec_to_mt(&md_zdt(SBI(sb)->md)->s_wtime, &now); + } + return 0; +} + +static void _sync_add_inode(struct inode *inode) +{ + struct zuf_sb_info *sbi = SBI(inode->i_sb); + struct zuf_inode_info *zii = ZUII(inode); + + zuf_dbg_mmap("[%ld] write_mapped=%d\n", + inode->i_ino, atomic_read(&zii->write_mapped)); + + spin_lock(&sbi->s_mmap_dirty_lock); + + /* Because we are lazy removing the inodes, only in case of an fsync + * or an evict_inode. It is fine if we are call multiple times. + */ + if (list_empty(&zii->i_mmap_dirty)) + list_add(&zii->i_mmap_dirty, &sbi->s_mmap_dirty); + + spin_unlock(&sbi->s_mmap_dirty_lock); +} + +static void _sync_remove_inode(struct inode *inode) +{ + struct zuf_sb_info *sbi = SBI(inode->i_sb); + struct zuf_inode_info *zii = ZUII(inode); + + zuf_dbg_mmap("[%ld] write_mapped=%d\n", + inode->i_ino, atomic_read(&zii->write_mapped)); + + spin_lock(&sbi->s_mmap_dirty_lock); + list_del_init(&zii->i_mmap_dirty); + spin_unlock(&sbi->s_mmap_dirty_lock); +} + +void zuf_sync_inc(struct inode *inode) +{ + struct zuf_inode_info *zii = ZUII(inode); + + if (1 == atomic_inc_return(&zii->write_mapped)) + _sync_add_inode(inode); +} + +/* zuf_sync_dec will unmapped in batches */ +void zuf_sync_dec(struct inode *inode, ulong write_unmapped) +{ + struct zuf_inode_info *zii = ZUII(inode); + + if (0 == atomic_sub_return(write_unmapped, &zii->write_mapped)) + _sync_remove_inode(inode); +} + +/* + * We must fsync any mmap-active inodes + */ +static int zuf_sync_fs(struct super_block *sb, int wait) +{ + struct zuf_sb_info *sbi = SBI(sb); + struct zuf_inode_info *zii, *t; + enum {to_clean_size = 120}; + struct zuf_inode_info *zii_to_clean[to_clean_size]; + uint i, to_clean; + +more_inodes: + spin_lock(&sbi->s_mmap_dirty_lock); + to_clean = 0; + list_for_each_entry_safe(zii, t, &sbi->s_mmap_dirty, i_mmap_dirty) { + list_del_init(&zii->i_mmap_dirty); + zii_to_clean[to_clean++] = zii; + if (to_clean >= to_clean_size) + break; + } + spin_unlock(&sbi->s_mmap_dirty_lock); + + if (!to_clean) + return 0; + + for (i = 0; i < to_clean; ++i) + zuf_isync(&zii_to_clean[i]->vfs_inode, 0, ~0 - 1, 1); + + if (to_clean == to_clean_size) + goto more_inodes; + + return 0; +} + +static struct inode *zuf_alloc_inode(struct super_block *sb) +{ + struct zuf_inode_info *zii; + + zii = kmem_cache_alloc(zuf_inode_cachep, GFP_NOFS); + if (!zii) + return NULL; + + zii->vfs_inode.i_version = 1; + return &zii->vfs_inode; +} + +static void zuf_destroy_inode(struct inode *inode) +{ + kmem_cache_free(zuf_inode_cachep, ZUII(inode)); +} + +static void _init_once(void *foo) +{ + struct zuf_inode_info *zii = foo; + + inode_init_once(&zii->vfs_inode); + INIT_LIST_HEAD(&zii->i_mmap_dirty); + zii->zi = NULL; + zii->zero_page = NULL; + init_rwsem(&zii->in_sync); + atomic_set(&zii->vma_count, 0); + atomic_set(&zii->write_mapped, 0); +} + +int __init zuf_init_inodecache(void) +{ + zuf_inode_cachep = kmem_cache_create("zuf_inode_cache", + sizeof(struct zuf_inode_info), + 0, + (SLAB_RECLAIM_ACCOUNT | + SLAB_MEM_SPREAD | + SLAB_TYPESAFE_BY_RCU), + _init_once); + if (zuf_inode_cachep == NULL) + return -ENOMEM; + return 0; +} + +void zuf_destroy_inodecache(void) +{ + kmem_cache_destroy(zuf_inode_cachep); +} + +/* + * the super block writes are all done "on the fly", so the + * super block is never in a "dirty" state, so there's no need + * for write_super. + */ +static struct super_operations zuf_sops = { + .alloc_inode = zuf_alloc_inode, + .destroy_inode = zuf_destroy_inode, + .write_inode = zuf_write_inode, + .evict_inode = zuf_evict_inode, + .put_super = zuf_put_super, + .freeze_fs = zuf_update_s_wtime, + .unfreeze_fs = zuf_update_s_wtime, + .sync_fs = zuf_sync_fs, + .statfs = zuf_statfs, + .remount_fs = zuf_remount, + .show_options = zuf_show_options, + .show_devname = zuf_show_devname, +}; + struct dentry *zuf_mount(struct file_system_type *fs_type, int flags, const char *dev_name, void *data) { - return ERR_PTR(-ENOTSUPP); + int silent = flags & MS_SILENT ? 1 : 0; + struct __fill_super_params fsp = { + .mount_options = data, + }; + struct register_fs_info *rfi = &ZUF_FST(fs_type)->rfi; + struct mdt_check mc = { + .major_ver = rfi->FS_ver_major, + .minor_ver = rfi->FS_ver_minor, + .magic = rfi->FS_magic, + + .holder = fs_type, + .silent = silent, + }; + struct dentry *ret = NULL; + const char *dev_path = NULL; + struct zuf_fs_type *fst; + int err; + + zuf_dbg_vfs("dev_name=%s, data=%s\n", dev_name, (const char *)data); + + fsp.md = md_alloc(sizeof(struct zuf_pmem)); + if (IS_ERR(fsp.md)) { + err = PTR_ERR(fsp.md); + fsp.md = NULL; + goto out; + } + + err = md_init(fsp.md, dev_name, &mc, &dev_path); + if (unlikely(err)) { + zuf_err_cnd(silent, "md_init failed! => %d\n", err); + goto out; + } + + fst = container_of(fs_type, struct zuf_fs_type, vfs_fst); + zuf_add_pmem(fst->zri, fsp.md); + + zuf_dbg_vfs("mounting with dev_path=%s\n", dev_path); + ret = mount_bdev(fs_type, flags, dev_path, &fsp, zuf_fill_super); + +out: + if (unlikely(err) && fsp.md) + md_fini(fsp.md, NULL); + kfree(dev_path); + return err ? ERR_PTR(err) : ret; } diff --git a/fs/zuf/t1.c b/fs/zuf/t1.c new file mode 100644 index 0000000..b0c869c --- /dev/null +++ b/fs/zuf/t1.c @@ -0,0 +1,114 @@ +/* + * BRIEF DESCRIPTION + * + * Just the special mmap of the all t1 array to the ZUS Server + * + * Copyright (c) 2018 NetApp Inc. All rights reserved. + * + * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details. + * + * Authors: + * Boaz Harrosh + */ + +#include +#include +#include +#include + +#include "zuf.h" + +/* ~~~ Functions for mmap a t1-array and page faults ~~~ */ +struct zuf_pmem *_pmem_from_f_private(struct file *file) +{ + struct zuf_special_file *zsf = file->private_data; + + WARN_ON(zsf->type != zlfs_e_pmem); + return container_of(zsf, struct zuf_pmem, hdr); +} + +static int t1_file_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct inode *inode = vma->vm_file->f_mapping->host; + struct zuf_pmem *z_pmem; + pgoff_t size; + ulong bn = vmf->pgoff; + ulong pfn; + int err; + + zuf_dbg_t1("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx " + "pgoff=0x%lx vmf_flags=0x%x cow_page=%p page=%p\n", + inode->i_ino, vma->vm_start, vma->vm_end, + vmf->address, vmf->pgoff, vmf->flags, + vmf->cow_page, vmf->page); + + if (unlikely(vmf->page)) { + zuf_err("[%ld] vm_start=0x%lx vm_end=0x%lx VA=0x%lx " + "pgoff=0x%lx vmf_flags=0x%x page=%p cow_page=%p\n", + inode->i_ino, vma->vm_start, vma->vm_end, + vmf->address, vmf->pgoff, vmf->flags, + vmf->page, vmf->cow_page); + return VM_FAULT_SIGBUS; + } + + size = md_o2p_up(i_size_read(inode)); + if (unlikely(vmf->pgoff >= size)) { + ulong pgoff = vma->vm_pgoff + + md_o2p((vmf->address - vma->vm_start)); + + zuf_err("[%ld] pgoff(0x%lx)(0x%lx) >= size(0x%lx) => SIGBUS\n", + inode->i_ino, vmf->pgoff, pgoff, size); + + return VM_FAULT_SIGBUS; + } + + if (vmf->cow_page) + /* HOWTO: prevent private mmaps */ + return VM_FAULT_SIGBUS; + + z_pmem = _pmem_from_f_private(vma->vm_file); + pfn = md_pfn(&z_pmem->md, bn); + + err = vm_insert_mixed_mkwrite(vma, vmf->address, + phys_to_pfn_t(PFN_PHYS(pfn), PFN_MAP | PFN_DEV)); + zuf_dbg_t1("[%ld] vm_insert_mixed 0x%lx prot=0x%lx => %d\n", + inode->i_ino, pfn, vma->vm_page_prot.pgprot, err); + + /* + * err == -EBUSY is fine, we've raced against another thread + * that faulted-in the same page + */ + if (err && (err != -EBUSY)) { + zuf_err("[%ld] vm_insert_page/mixed => %d\n", + inode->i_ino, err); + return VM_FAULT_SIGBUS; + } + + return VM_FAULT_NOPAGE; +} + +static const struct vm_operations_struct t1_vm_ops = { + .fault = t1_file_fault, +}; + +int zuf_pmem_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct zuf_special_file *zsf = file->private_data; + + if (!zsf || zsf->type != zlfs_e_pmem) + return -EPERM; + + + /* FIXME: MIXEDMAP for the support of pmem-pages (Why?) + */ + vma->vm_flags |= VM_MIXEDMAP; + vma->vm_ops = &t1_vm_ops; + + zuf_dbg_vfs("[%ld] start=0x%lx end=0x%lx flags=0x%lx page_prot=0x%lx\n", + file->f_mapping->host->i_ino, vma->vm_start, vma->vm_end, + vma->vm_flags, pgprot_val(vma->vm_page_prot)); + + return 0; +} + diff --git a/fs/zuf/t2.c b/fs/zuf/t2.c new file mode 100644 index 0000000..fa4eadc --- /dev/null +++ b/fs/zuf/t2.c @@ -0,0 +1,348 @@ +/* + * Tier-2 operations. + * + * Copyright (c) 2018 NetApp Inc. All rights reserved. + * + * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details. + * + * Authors: + * Boaz Harrosh + */ + +#include "t2.h" + +#include +#include + +#include "zuf.h" + +#define t2_dbg(fmt, args ...) zuf_dbg_t2(fmt, ##args) + +const char *_pr_rw(int rw) +{ + return (rw & WRITE) ? "WRITE" : "READ"; +} +#define t2_tis_dbg(tis, fmt, args ...) \ + zuf_dbg_t2("%s: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags), \ + atomic_read(&tis->refcount), tis->rw_flags, ##args) + +#define t2_tis_dbg_rw(tis, fmt, args ...) \ + zuf_dbg_t2_rw("%s<%p>: r=%d f=0x%lx " fmt, _pr_rw(tis->rw_flags), \ + tis->priv, atomic_read(&tis->refcount), tis->rw_flags,\ + ##args) + +/* ~~~~~~~~~~~~ Async read/write ~~~~~~~~~~ */ +void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done, + void *priv, uint n_vects, struct t2_io_state *tis) +{ + atomic_set(&tis->refcount, 1); + tis->md = md; + tis->done = done; + tis->priv = priv; + tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES); + tis->rw_flags = rw; + tis->last_t2 = -1; + tis->cur_bio = NULL; + tis->index = ~0; + bio_list_init(&tis->delayed_bios); + tis->err = 0; + blk_start_plug(&tis->plug); + t2_tis_dbg_rw(tis, "done=%pF n_vects=%d\n", done, n_vects); +} + +static void _tis_put(struct t2_io_state *tis) +{ + t2_tis_dbg_rw(tis, "done=%pF\n", tis->done); + + if (test_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags)) + wake_up_atomic_t(&tis->refcount); + else if (tis->done) + /* last - done may free the tis */ + tis->done(tis, NULL, true); +} + +static inline void tis_get(struct t2_io_state *tis) +{ + atomic_inc(&tis->refcount); +} + +static inline int tis_put(struct t2_io_state *tis) +{ + if (atomic_dec_and_test(&tis->refcount)) { + _tis_put(tis); + return 1; + } + return 0; +} + +static inline bool _err_set_reported(struct md_dev_info *mdi, bool write) +{ + bool *reported = write ? &mdi->t2i.err_write_reported : + &mdi->t2i.err_read_reported; + + if (!(*reported)) { + *reported = true; + return true; + } + return false; +} + +static int _status_to_errno(blk_status_t status) +{ + return -EIO; +} + +static void _tis_bio_done(struct bio *bio) +{ + struct t2_io_state *tis = bio->bi_private; + struct md_dev_info *mdi = md_t2_dev(tis->md, 0); + + t2_tis_dbg(tis, "done=%pF err=%d\n", tis->done, bio->bi_status); + + if (unlikely(bio->bi_status)) { + zuf_dbg_err("%s: err=%d last-err=%d\n", + _pr_rw(tis->rw_flags), bio->bi_status, tis->err); + if (_err_set_reported(mdi, 0 != (tis->rw_flags & WRITE))) + zuf_err("%s: err=%d\n", + _pr_rw(tis->rw_flags), bio->bi_status); + /* Store the last one */ + tis->err = _status_to_errno(bio->bi_status); + } else if (unlikely(mdi->t2i.err_write_reported || + mdi->t2i.err_read_reported)) { + if (tis->rw_flags & WRITE) + mdi->t2i.err_write_reported = false; + else + mdi->t2i.err_read_reported = false; + } + + if (tis->done) + tis->done(tis, bio, false); + + bio_put(bio); + tis_put(tis); +} + +static bool _tis_delay(struct t2_io_state *tis) +{ + return 0 != (tis->rw_flags & TIS_DELAY_SUBMIT); +} + +#define bio_list_for_each_safe(bio, btmp, bl) \ + for (bio = (bl)->head, btmp = bio ? bio->bi_next : NULL; \ + bio; bio = btmp, btmp = bio ? bio->bi_next : NULL) + +static void _tis_submit_bio(struct t2_io_state *tis, bool flush, bool done) +{ + if (flush || done) { + if (_tis_delay(tis)) { + struct bio *btmp, *bio; + + bio_list_for_each_safe(bio, btmp, &tis->delayed_bios) { + bio->bi_next = NULL; + if (bio->bi_iter.bi_sector == -1) { + t2_warn("!!!!!!!!!!!!!\n"); + bio_put(bio); + continue; + } + t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n", + bio->bi_vcnt, tis->n_vects); + submit_bio(bio); + } + bio_list_init(&tis->delayed_bios); + } + + if (!tis->cur_bio) + return; + + if (tis->cur_bio->bi_iter.bi_sector != -1) { + t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n", + tis->cur_bio->bi_vcnt, tis->n_vects); + submit_bio(tis->cur_bio); + tis->cur_bio = NULL; + tis->index = ~0; + } else if (done) { + t2_tis_dbg(tis, "put cur_bio=%p\n", tis->cur_bio); + bio_put(tis->cur_bio); + WARN_ON(tis_put(tis)); + } + } else if (tis->cur_bio && (tis->cur_bio->bi_iter.bi_sector != -1)) { + /* Not flushing regular progress */ + if (_tis_delay(tis)) { + t2_tis_dbg(tis, "list_add cur_bio=%p\n", tis->cur_bio); + bio_list_add(&tis->delayed_bios, tis->cur_bio); + } else { + t2_tis_dbg(tis, "submit bio[%d] max_v=%d\n", + tis->cur_bio->bi_vcnt, tis->n_vects); + submit_bio(tis->cur_bio); + } + tis->cur_bio = NULL; + tis->index = ~0; + } +} + +/* tis->cur_bio MUST be NULL, checked by caller */ +static void _tis_alloc(struct t2_io_state *tis, struct md_dev_info *mdi, + gfp_t gfp) +{ + struct bio *bio = bio_alloc(gfp, tis->n_vects); + int bio_op; + + if (unlikely(!bio)) { + if (!_tis_delay(tis)) + t2_warn("!!! failed to alloc bio"); + tis->err = -ENOMEM; + return; + } + + if (WARN_ON(!tis || !tis->md)) { + tis->err = -ENOMEM; + return; + } + + /* FIXME: bio_set_op_attrs macro has a BUG which does not allow this + * question inline. + */ + bio_op = (tis->rw_flags & WRITE) ? REQ_OP_WRITE : REQ_OP_READ; + bio_set_op_attrs(bio, bio_op, 0); + + if (mdi->bdev) + bio_set_dev(bio, mdi->bdev); + bio->bi_iter.bi_sector = -1; + bio->bi_end_io = _tis_bio_done; + bio->bi_private = tis; + + tis->index = mdi ? mdi->index : ~0; + tis->last_t2 = -1; + tis->cur_bio = bio; + tis_get(tis); + t2_tis_dbg(tis, "New bio n_vects=%d\n", tis->n_vects); +} + +int t2_io_prealloc(struct t2_io_state *tis, uint n_vects) +{ + tis->err = 0; /* reset any -ENOMEM from a previous t2_io_add */ + + _tis_submit_bio(tis, true, false); + tis->n_vects = min(n_vects ? n_vects : 1, (uint)BIO_MAX_PAGES); + + t2_tis_dbg(tis, "n_vects=%d cur_bio=%p\n", tis->n_vects, tis->cur_bio); + + if (!tis->cur_bio) + _tis_alloc(tis, NULL, GFP_NOFS); + return tis->err; +} + +int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page) +{ + struct md_dev_info *mdi = md_bn_t2_dev(tis->md, t2); + ulong local_t2 = md_t2_local_bn(tis->md, t2); + int ret; + + if (((local_t2 != (tis->last_t2 + 1)) && (tis->last_t2 != -1)) || + (mdi && (0 < tis->index) && (tis->index != mdi->index))) + _tis_submit_bio(tis, false, false); + +start: + if (!tis->cur_bio) { + _tis_alloc(tis, mdi, _tis_delay(tis) ? GFP_ATOMIC : GFP_NOFS); + if (unlikely(tis->err)) + return tis->err; + } else if (tis->index == ~0) { + /* the bio was allocated during t2_io_prealloc */ + tis->index = mdi->index; + bio_set_dev(tis->cur_bio, mdi->bdev); + } + + if (tis->last_t2 == -1) + tis->cur_bio->bi_iter.bi_sector = local_t2 * T2_SECTORS_PER_PAGE; + + ret = bio_add_page(tis->cur_bio, page, PAGE_SIZE, 0); + if (unlikely(ret != PAGE_SIZE)) { + t2_tis_dbg(tis, "bio_add_page=>%d bi_vcnt=%d n_vects=%d\n", + ret, tis->cur_bio->bi_vcnt, tis->n_vects); + _tis_submit_bio(tis, false, false); + goto start; /* device does not support tis->n_vects */ + } + + if ((tis->cur_bio->bi_vcnt == tis->n_vects) && (tis->n_vects != 1)) + _tis_submit_bio(tis, false, false); + + t2_tis_dbg(tis, "t2=0x%lx last_t2=0x%lx local_t2=0x%lx page-i=0x%lx\n", + t2, tis->last_t2, local_t2, page->index); + + tis->last_t2 = local_t2; + return 0; +} + +int t2_io_end(struct t2_io_state *tis, bool wait) +{ + int err = 0; + + if (unlikely(!tis || !tis->md)) + return 0; /* never initialized nothing to do */ + + t2_tis_dbg_rw(tis, "wait=%d\n", wait); + + _tis_submit_bio(tis, true, true); + blk_finish_plug(&tis->plug); + + if (wait) + set_bit(B_TIS_FREE_AFTER_WAIT, &tis->rw_flags); + tis_put(tis); + + if (wait) { + err = wait_on_atomic_t(&tis->refcount, atomic_t_wait, + TASK_INTERRUPTIBLE); + if (likely(!err)) + err = tis->err; + if (tis->done) + tis->done(tis, NULL, true); + } + + /* In case of a ctrl-c we return an err but tis->err == 0 */ + return err; +} + +/* ~~~~~~~ Sync read/write ~~~~~~~ TODO: Remove soon */ +static int _sync_io_page(struct multi_devices *md, int rw, ulong bn, + struct page *page) +{ + struct t2_io_state tis; + int err; + + t2_io_begin(md, rw, NULL, NULL, 1, &tis); + + t2_tis_dbg((&tis), "bn=0x%lx p-i=0x%lx\n", bn, page->index); + + err = t2_io_add(&tis, bn, page); + if (unlikely(err)) + return err; + + err = submit_bio_wait(tis.cur_bio); + if (unlikely(err)) { + SetPageError(page); + /* + * We failed to write the page out to tier-2. + * Print a dire warning that things will go BAD (tm) + * very quickly. + */ + zuf_err("io-error bn=0x%lx => %d\n", bn, err); + } + + /* Same as t2_io_end+_tis_bio_done but without the kref stuff */ + blk_finish_plug(&tis.plug); + if (likely(tis.cur_bio)) + bio_put(tis.cur_bio); + + return err; +} + +int t2_writepage(struct multi_devices *md, ulong bn, struct page *page) +{ + return _sync_io_page(md, WRITE, bn, page); +} + +int t2_readpage(struct multi_devices *md, ulong bn, struct page *page) +{ + return _sync_io_page(md, READ, bn, page); +} diff --git a/fs/zuf/t2.h b/fs/zuf/t2.h new file mode 100644 index 0000000..75c24f7 --- /dev/null +++ b/fs/zuf/t2.h @@ -0,0 +1,67 @@ +/* + * Tier-2 Header file. + * + * Copyright (c) 2018 NetApp Inc. All rights reserved. + * + * ZUFS-License: GPL-2.0 OR BSD-3-Clause. See module.c for LICENSE details. + * + * Authors: + * Boaz Harrosh + */ + +#ifndef __T2_H__ +#define __T2_H__ + +#include +#include +#include +#include +#include "_pr.h" +#include "md.h" + +#define T2_SECTORS_PER_PAGE (PAGE_SIZE / 512) + +#define t2_warn(fmt, args ...) zuf_warn(fmt, ##args) + +/* t2.c */ + +/* Sync read/write */ +int t2_writepage(struct multi_devices *md, ulong bn, struct page *page); +int t2_readpage(struct multi_devices *md, ulong bn, struct page *page); + +/* Async read/write */ +struct t2_io_state; +typedef void (*t2_io_done_fn)(struct t2_io_state *tis, struct bio *bio, + bool last); + +struct t2_io_state { + atomic_t refcount; /* counts in-flight bios */ + struct blk_plug plug; + + struct multi_devices *md; + int index; + t2_io_done_fn done; + void *priv; + + uint n_vects; + ulong rw_flags; + ulong last_t2; + struct bio *cur_bio; + struct bio_list delayed_bios; + int err; +}; + +/* For rw_flags above */ +/* From Kernel: WRITE (1U << 0) */ +#define TIS_DELAY_SUBMIT (1U << 2) +enum {B_TIS_FREE_AFTER_WAIT = 3}; +#define TIS_FREE_AFTER_WAIT (1U << B_TIS_FREE_AFTER_WAIT) +#define TIS_USER_DEF_FIRST (1U << 8) + +void t2_io_begin(struct multi_devices *md, int rw, t2_io_done_fn done, + void *priv, uint n_vects, struct t2_io_state *tis); +int t2_io_prealloc(struct t2_io_state *tis, uint n_vects); +int t2_io_add(struct t2_io_state *tis, ulong t2, struct page *page); +int t2_io_end(struct t2_io_state *tis, bool wait); + +#endif /*def __T2_H__*/ diff --git a/fs/zuf/zuf-core.c b/fs/zuf/zuf-core.c index 12a23f1..963c417 100644 --- a/fs/zuf/zuf-core.c +++ b/fs/zuf/zuf-core.c @@ -14,7 +14,6 @@ #include #include #include -#include #include "zuf.h" @@ -220,6 +219,71 @@ void zufs_mounter_release(struct file *file) } } +/* ~~~~ PMEM GRAB ~~~~ */ +static int zufr_find_pmem(struct zuf_root_info *zri, + uint pmem_kern_id, struct zuf_pmem **pmem_md) +{ + struct zuf_pmem *z_pmem; + + list_for_each_entry(z_pmem, &zri->pmem_list, list) { + if (z_pmem->pmem_id == pmem_kern_id) { + *pmem_md = z_pmem; + return 0; + } + } + + return -ENODEV; +} + +static int _zu_grab_pmem(struct file *file, void *parg) +{ + struct zuf_root_info *zri = ZRI(file->f_inode->i_sb); + struct zufs_ioc_pmem __user *arg_pmem = parg; + struct zufs_ioc_pmem zi_pmem = {}; + struct zuf_pmem *pmem_md; + int err; + + err = get_user(zi_pmem.pmem_kern_id, &arg_pmem->pmem_kern_id); + if (err) { + zuf_err("\n"); + return err; + } + + err = zufr_find_pmem(zri, zi_pmem.pmem_kern_id, &pmem_md); + if (err) { + zuf_err("!!! pmem_kern_id=%d not found\n", + zi_pmem.pmem_kern_id); + goto out; + } + + if (pmem_md->file) { + zuf_err("[%u] pmem already taken\n", zi_pmem.pmem_kern_id); + err = -EIO; + goto out; + } + + err = md_numa_info(&pmem_md->md, &zi_pmem); + if (unlikely(err)) { + zuf_err("md_numa_info => %d\n", err); + goto out; + } + + i_size_write(file->f_inode, md_p2o(md_t1_blocks(&pmem_md->md))); + pmem_md->hdr.type = zlfs_e_pmem; + pmem_md->file = file; + file->private_data = &pmem_md->hdr; + zuf_dbg_core("pmem %d GRABED %s\n", + zi_pmem.pmem_kern_id, + _bdev_name(md_t1_dev(&pmem_md->md, 0)->bdev)); + +out: + zi_pmem.hdr.err = err; + err = copy_to_user(parg, &zi_pmem, sizeof(zi_pmem)); + if (err) + zuf_err("=>%d\n", err); + return err; +} + static int _map_pages(struct zufs_thread *zt, struct page **pages, uint nump, bool zap) { @@ -451,6 +515,8 @@ long zufs_ioc(struct file *file, unsigned int cmd, ulong arg) return _zu_register_fs(file, parg); case ZU_IOC_MOUNT: return _zu_mount(file, parg); + case ZU_IOC_GRAB_PMEM: + return _zu_grab_pmem(file, parg); case ZU_IOC_INIT_THREAD: return _zu_init(file, parg); case ZU_IOC_WAIT_OPT: diff --git a/fs/zuf/zuf-root.c b/fs/zuf/zuf-root.c index 8102d3a..ebb44c9 100644 --- a/fs/zuf/zuf-root.c +++ b/fs/zuf/zuf-root.c @@ -117,6 +117,8 @@ int zufr_mmap(struct file *file, struct vm_area_struct *vma) switch (zsf->type) { case zlfs_e_zt: return zuf_zt_mmap(file, vma); + case zlfs_e_pmem: + return zuf_pmem_mmap(file, vma); default: zuf_err("type=%d\n", zsf->type); return -ENOTTY; @@ -300,7 +302,10 @@ static struct kset *zufr_kset; int __init zuf_root_init(void) { - int err; + int err = zuf_init_inodecache(); + + if (unlikely(err)) + return err; zufr_kset = kset_create_and_add("zuf", NULL, fs_kobj); if (!zufr_kset) { @@ -317,6 +322,7 @@ int __init zuf_root_init(void) un_kset: kset_unregister(zufr_kset); un_inodecache: + zuf_destroy_inodecache(); return err; } @@ -324,6 +330,7 @@ void __exit zuf_root_exit(void) { unregister_filesystem(&zufr_type); kset_unregister(zufr_kset); + zuf_destroy_inodecache(); } module_init(zuf_root_init) diff --git a/fs/zuf/zuf.h b/fs/zuf/zuf.h index 15516d0..a5d277f 100644 --- a/fs/zuf/zuf.h +++ b/fs/zuf/zuf.h @@ -26,7 +26,9 @@ #include "zus_api.h" #include "relay.h" +#include "t2.h" #include "_pr.h" +#include "md.h" enum zlfs_e_special_file { zlfs_e_zt = 1, @@ -38,6 +40,15 @@ struct zuf_special_file { enum zlfs_e_special_file type; }; +/* Our special md structure */ +struct zuf_pmem { + struct multi_devices md; /* must be first */ + struct list_head list; + struct zuf_special_file hdr; + uint pmem_id; + struct file *file; +}; + /* This is the zuf-root.c mini filesystem */ struct zuf_root_info { struct __mount_thread_info { @@ -84,6 +95,215 @@ static inline void zuf_add_fs_type(struct zuf_root_info *zri, list_add(&zft->list, &zri->fst_list); } +static inline void zuf_add_pmem(struct zuf_root_info *zri, + struct multi_devices *md) +{ + struct zuf_pmem *z_pmem = (void *)md; + + z_pmem->pmem_id = ++zri->next_pmem_id; /* Avoid 0 id */ + + /* Unlocked for now only one mount-thread with zus */ + list_add(&z_pmem->list, &zri->pmem_list); +} + +static inline uint zuf_pmem_id(struct multi_devices *md) +{ + struct zuf_pmem *z_pmem = container_of(md, struct zuf_pmem, md); + + return z_pmem->pmem_id; +} + +// void zuf_del_fs_type(struct zuf_root_info *zri, struct zuf_fs_type *zft); + +/* + * Private Super-block flags + */ +enum { + ZUF_MOUNT_PEDANTIC = 0x000001, /* Check for memory leaks */ + ZUF_MOUNT_PEDANTIC_SHADOW = 0x00002, /* */ + ZUF_MOUNT_SILENT = 0x000004, /* verbosity is silent */ + ZUF_MOUNT_EPHEMERAL = 0x000008, /* Don't persist the data */ + ZUF_MOUNT_FAILED = 0x000010, /* mark a failed-mount */ + ZUF_MOUNT_DAX = 0x000020, /* mounted with dax option */ + ZUF_MOUNT_POSIXACL = 0x000040, /* mounted with posix acls */ +}; + +#define clear_opt(sbi, opt) (sbi->s_mount_opt &= ~ZUF_MOUNT_ ## opt) +#define set_opt(sbi, opt) (sbi->s_mount_opt |= ZUF_MOUNT_ ## opt) +#define test_opt(sbi, opt) (sbi->s_mount_opt & ZUF_MOUNT_ ## opt) + +#define ZUFS_DEF_SBI_MODE (S_IRUGO | S_IXUGO | S_IWUSR) + +/* Flags bits on zii->flags */ +enum { + ZII_UNMAP_LOCK = 1, +}; + +/* + * ZUF per-inode data in memory + */ +struct zuf_inode_info { + struct inode vfs_inode; + + /* Stuff for mmap write */ + struct rw_semaphore in_sync; + struct list_head i_mmap_dirty; + atomic_t write_mapped; + atomic_t vma_count; + struct page *zero_page; /* TODO: Remove */ + + /* cookies from Server */ + struct zus_inode *zi; + struct zus_inode_info *zus_ii; +}; + +static inline struct zuf_inode_info *ZUII(struct inode *inode) +{ + return container_of(inode, struct zuf_inode_info, vfs_inode); +} + +/* + * ZUF super-block data in memory + */ +struct zuf_sb_info { + struct super_block *sb; + struct multi_devices *md; + + /* zus cookie*/ + struct zus_sb_info *zus_sbi; + + /* Mount options */ + unsigned long s_mount_opt; + kuid_t uid; /* Mount uid for root directory */ + kgid_t gid; /* Mount gid for root directory */ + umode_t mode; /* Mount mode for root directory */ + + spinlock_t s_mmap_dirty_lock; + struct list_head s_mmap_dirty; +}; + +static inline struct zuf_sb_info *SBI(struct super_block *sb) +{ + return sb->s_fs_info; +} + +static inline struct zuf_fs_type *ZUF_FST(struct file_system_type *fs_type) +{ + return container_of(fs_type, struct zuf_fs_type, vfs_fst); +} + +static inline struct zuf_fs_type *zuf_fst(struct super_block *sb) +{ + return ZUF_FST(sb->s_type); +} + +static inline struct zuf_root_info *ZUF_ROOT(struct zuf_sb_info *sbi) +{ + return zuf_fst(sbi->sb)->zri; +} + +static inline bool zuf_rdonly(struct super_block *sb) +{ + return sb->s_flags & MS_RDONLY; +} + +static inline struct zus_inode *zus_zi(struct inode *inode) +{ + return ZUII(inode)->zi; +} + +/* An accessor because of the frequent use in prints */ +static inline ulong _zi_ino(struct zus_inode *zi) +{ + return le64_to_cpu(zi->i_ino); +} + +static inline bool _zi_active(struct zus_inode *zi) +{ + return (zi->i_nlink || zi->i_mode); +} + +static inline void mt_to_timespec(struct timespec *t, __le64 *mt) +{ + u32 nsec; + + t->tv_sec = div_s64_rem(le64_to_cpu(*mt), NSEC_PER_SEC, &nsec); + t->tv_nsec = nsec; +} + +static inline void timespec_to_mt(__le64 *mt, struct timespec *t) +{ + *mt = cpu_to_le64(t->tv_sec * NSEC_PER_SEC + t->tv_nsec); +} + +static inline void zuf_r_lock(struct zuf_inode_info *zii) +{ + inode_lock_shared(&zii->vfs_inode); +} +static inline void zuf_r_unlock(struct zuf_inode_info *zii) +{ + inode_unlock_shared(&zii->vfs_inode); +} + +static inline void zuf_smr_lock(struct zuf_inode_info *zii) +{ + down_read_nested(&zii->in_sync, 1); +} +static inline void zuf_smr_lock_pagefault(struct zuf_inode_info *zii) +{ + down_read_nested(&zii->in_sync, 2); +} +static inline void zuf_smr_unlock(struct zuf_inode_info *zii) +{ + up_read(&zii->in_sync); +} + +static inline void zuf_smw_lock(struct zuf_inode_info *zii) +{ + down_write(&zii->in_sync); +} +static inline void zuf_smw_lock_nested(struct zuf_inode_info *zii) +{ + down_write_nested(&zii->in_sync, 1); +} +static inline void zuf_smw_unlock(struct zuf_inode_info *zii) +{ + up_write(&zii->in_sync); +} + +static inline void zuf_w_lock(struct zuf_inode_info *zii) +{ + inode_lock(&zii->vfs_inode); + zuf_smw_lock(zii); +} +static inline void zuf_w_lock_nested(struct zuf_inode_info *zii) +{ + inode_lock_nested(&zii->vfs_inode, 2); + zuf_smw_lock_nested(zii); +} +static inline void zuf_w_unlock(struct zuf_inode_info *zii) +{ + zuf_smw_unlock(zii); + inode_unlock(&zii->vfs_inode); +} + +static inline void ZUF_CHECK_I_W_LOCK(struct inode *inode) +{ +#ifdef CONFIG_ZUF_DEBUG + if (WARN_ON(down_write_trylock(&inode->i_rwsem))) + up_write(&inode->i_rwsem); +#endif +} + +/* CAREFUL: Needs an sfence eventually, after this call */ +static inline +void zus_inode_cmtime_now(struct inode *inode, struct zus_inode *zi) +{ + inode->i_mtime = inode->i_ctime = current_time(inode); + timespec_to_mt(&zi->i_ctime, &inode->i_ctime); + zi->i_mtime = zi->i_ctime; +} + /* Keep this include last thing in file */ #include "_extern.h" diff --git a/fs/zuf/zus_api.h b/fs/zuf/zus_api.h index 19ce326..d461782 100644 --- a/fs/zuf/zus_api.h +++ b/fs/zuf/zus_api.h @@ -66,6 +66,17 @@ #endif /* ndef __KERNEL__ */ +/* + * Maximal count of links to a file + */ +#define ZUFS_LINK_MAX 32000 +#define ZUFS_MAX_SYMLINK PAGE_SIZE +#define ZUFS_NAME_LEN 255 +#define ZUFS_READAHEAD_PAGES 8 + +/* All device sizes offsets must align on 2M */ +#define ZUFS_ALLOC_MASK (1024 * 1024 * 2 - 1) + /** * zufs dual port memory * This is a special type of offset to either memory or persistent-memory, @@ -75,6 +86,121 @@ */ typedef __u64 zu_dpp_t; +/* + * Structure of a ZUS inode. + * This is all the inode fields + */ + +/* zus_inode size */ +#define ZUFS_INODE_SIZE 128 /* must be power of two */ +#define ZUFS_INODE_BITS 7 + +struct zus_inode { + __le32 i_flags; /* Inode flags */ + __le16 i_mode; /* File mode */ + __le16 i_nlink; /* Links count */ + __le64 i_size; /* Size of data in bytes */ +/* 16*/ struct __zi_on_disk_desc { + __le64 a[2]; + } i_on_disk; /* FS-specific on disc placement */ +/* 32*/ __le64 i_blocks; + __le64 i_mtime; /* Inode/data Modification time */ + __le64 i_ctime; /* Inode/data Changed time */ + __le64 i_atime; /* Data Access time */ +/* 64 - cache-line boundary */ + __le64 i_ino; /* Inode number */ + __le32 i_uid; /* Owner Uid */ + __le32 i_gid; /* Group Id */ + __le64 i_xattr; /* FS-specific Extended attribute block */ + __le64 i_generation; /* File version (for NFS) */ +/* 96*/ union { + __le32 i_rdev; /* special-inode major/minor etc ...*/ + u8 i_symlink[32]; /* if i_size < sizeof(i_symlink) */ + __le64 i_sym_sno; /* FS-specific symlink placement */ + struct _zu_dir { + __le64 parent; + } i_dir; + }; + /* Total ZUFS_INODE_SIZE bytes always */ +}; + +#define ZUFS_SB_SIZE 2048 /* must be power of two */ + +/* device table s_flags */ +#define ZUFS_SHADOW (1UL << 4) /* simulate cpu cache */ + +#define test_msb_opt(msb, opt) (le64_to_cpu(msb->s_flags) & opt) + +#define ZUFS_DEV_NUMA_SHIFT 60 +#define ZUFS_DEV_BLOCKS_MASK 0x0FFFFFFFFFFFFFFF + +struct md_dev_id { + uuid_le uuid; + __le64 blocks; +} __packed; + +static inline __u64 __dev_id_blocks(struct md_dev_id *dev) +{ + return le64_to_cpu(dev->blocks) & ZUFS_DEV_BLOCKS_MASK; +} + +static inline int __dev_id_nid(struct md_dev_id *dev) +{ + return (int)(le64_to_cpu(dev->blocks) >> ZUFS_DEV_NUMA_SHIFT); +} + +/* 64 is the nicest number to still fit when the ZDT is 2048 and 6 bits can + * fit in page struct for address to block translation. + */ +#define MD_DEV_MAX 64 + +struct md_dev_list { + __le16 id_index; /* index of current dev in list */ + __le16 t1_count; /* # of t1 devs */ + __le16 t2_count; /* # of t2 devs (after t1_count) */ + __le16 reserved; /* align to 64 bit */ + struct md_dev_id dev_ids[MD_DEV_MAX]; +} __attribute__((aligned(64))); + +/* + * Structure of the on disk zufs device table + * NOTE: zufs_dev_table is always of size ZUFS_SB_SIZE. These below are the + * currently defined/used members in this version. + * TODO: remove the s_ from all the fields + */ +struct zufs_dev_table { + /* static fields. they never change after file system creation. + * checksum only validates up to s_start_dynamic field below + */ + __le16 s_sum; /* checksum of this sb */ + __le16 s_version; /* zdt-version */ + __le32 s_magic; /* magic signature */ + uuid_le s_uuid; /* 128-bit uuid */ + __le64 s_flags; + __le64 s_t1_blocks; + __le64 s_t2_blocks; + + struct md_dev_list s_dev_list; + + char s_start_dynamic[0]; + + /* all the dynamic fields should go here */ + __le64 s_mtime; /* mount time */ + __le64 s_wtime; /* write time */ +}; + +static inline int msb_major_version(struct zufs_dev_table *msb) +{ + return le16_to_cpu(msb->s_version) / ZUFS_MINORS_PER_MAJOR; +} + +static inline int msb_minor_version(struct zufs_dev_table *msb) +{ + return le16_to_cpu(msb->s_version) % ZUFS_MINORS_PER_MAJOR; +} + +#define ZUFS_SB_STATIC_SIZE(ps) ((u64)&ps->s_start_dynamic - (u64)ps) + /* ~~~~~ ZUFS API ioctl commands ~~~~~ */ enum { ZUS_API_MAP_MAX_PAGES = 1024, @@ -143,6 +269,46 @@ struct zufs_ioc_mount { }; #define ZU_IOC_MOUNT _IOWR('S', 12, struct zufs_ioc_mount) +/* pmem */ +struct zufs_ioc_pmem { + /* Set by zus */ + struct zufs_ioc_hdr hdr; + __u32 pmem_kern_id; + + /* Returned to zus */ + __u64 pmem_total_blocks; + __u32 max_nodes; + __u32 active_pmem_nodes; + struct zufs_pmem_info { + int sections; + struct zufs_pmem_sec { + __u32 length; + __u16 numa_id; + __u16 numa_index; + } secs[MD_DEV_MAX]; + } pmem; + + /* Variable length array mapping A CPU to the proper active pmem to use. + * ZUS starts with 4k if to small hdr.err === ETOSMALL and + * max_cpu_id set for the needed amount. + * + * Careful a user_mode pointer if not needed by server set to NULL + * + * @max_cpu_id is set by server to say how much space at numa_info, + * Kernel returns the actual active CPUs + */ + struct zufs_numa_info { + __u32 max_cpu_id; + __u32 pad; + struct zufs_cpu_info { + __u32 numa_id; + __u32 numa_index; + } numa_id_map[]; + } *numa_info; +}; +/* GRAB is never ungrabed umount or file close cleans it all */ +#define ZU_IOC_GRAB_PMEM _IOWR('S', 12, struct zufs_ioc_init) + /* ZT init */ struct zufs_ioc_init { struct zufs_ioc_hdr hdr; @@ -169,9 +335,20 @@ struct zufs_ioc_wait_operation { */ enum e_zufs_operation { ZUS_OP_NULL = 0, + ZUS_OP_STATFS, ZUS_OP_BREAK, /* Kernel telling Server to exit */ ZUS_OP_MAX_OPT, }; +/* ZUS_OP_STATFS */ +struct zufs_ioc_statfs { + struct zufs_ioc_hdr hdr; + /* IN */ + struct zus_sb_info *zus_sbi; + + /* OUT */ + struct statfs64 statfs_out; +}; + #endif /* _LINUX_ZUFS_API_H */ -- 2.5.5