From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2951C4361B for ; Tue, 8 Dec 2020 09:56:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AD30A23A5A for ; Tue, 8 Dec 2020 09:56:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728238AbgLHJ4J (ORCPT ); Tue, 8 Dec 2020 04:56:09 -0500 Received: from userp2130.oracle.com ([156.151.31.86]:58130 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726218AbgLHJ4J (ORCPT ); Tue, 8 Dec 2020 04:56:09 -0500 Received: from pps.filterd (userp2130.oracle.com [127.0.0.1]) by userp2130.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 0B89rrT4125984; Tue, 8 Dec 2020 09:55:09 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=subject : to : cc : references : from : message-id : date : mime-version : in-reply-to : content-type : content-transfer-encoding; s=corp-2020-01-29; bh=rRVHm4jxOlkTJ+oPuyKoL7DOZjVJZtDHqWNF06T4kbU=; b=Jc1eF3/CEhO4oPcAPoZMLV2X7n9ys+FgP+4+LIZ/xnzuh3VvPYgY9k0YxBNRE9eguiQv F+qODd1t2ab9LA5ndQtl3zk41FzzLMMpscIvrfiKeJvxPGipJn6ivLB12JBy1s2wqS18 Vv46xsxNTEaS4mPI5KGFqmK2ufE858/B6MtTMsdpXYmDsZe9k08ZGCkb2IlVse4z3kGm 3Z9IKGAinhXOyVRv3a/17D8Ygi8iQqsqi0yaR/RyPNWWrT4Y4TXZwDF879AUHbFyS18Q 3smSeoU2yBCjBeW+SZG3d2OIBCPhX1F51uawoLoBs1GU3AAYZffiwuW20nLPxkp3Nts5 oA== Received: from aserp3020.oracle.com (aserp3020.oracle.com [141.146.126.70]) by userp2130.oracle.com with ESMTP id 3581mqsuuw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Tue, 08 Dec 2020 09:55:09 +0000 Received: from pps.filterd (aserp3020.oracle.com [127.0.0.1]) by aserp3020.oracle.com (8.16.0.42/8.16.0.42) with SMTP id 0B89iShV190032; Tue, 8 Dec 2020 09:55:08 GMT Received: from aserv0121.oracle.com (aserv0121.oracle.com [141.146.126.235]) by aserp3020.oracle.com with ESMTP id 358m3xgttw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 08 Dec 2020 09:55:08 +0000 Received: from abhmp0005.oracle.com (abhmp0005.oracle.com [141.146.116.11]) by aserv0121.oracle.com (8.14.4/8.13.8) with ESMTP id 0B89t4US014273; Tue, 8 Dec 2020 09:55:04 GMT Received: from [192.168.10.102] (/39.109.186.25) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Tue, 08 Dec 2020 01:55:03 -0800 Subject: Re: [PATCH v10 14/41] btrfs: load zone's alloction offset To: Naohiro Aota , linux-btrfs@vger.kernel.org, dsterba@suse.com Cc: hare@suse.com, linux-fsdevel@vger.kernel.org, Jens Axboe , Christoph Hellwig , "Darrick J. Wong" , Josef Bacik References: From: Anand Jain Message-ID: Date: Tue, 8 Dec 2020 17:54:57 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.5.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9828 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 malwarescore=0 adultscore=0 bulkscore=0 phishscore=0 suspectscore=2 mlxscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2012080060 X-Proofpoint-Virus-Version: vendor=nai engine=6000 definitions=9828 signatures=668682 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=2 mlxlogscore=999 clxscore=1015 malwarescore=0 priorityscore=1501 adultscore=0 lowpriorityscore=0 phishscore=0 spamscore=0 impostorscore=0 mlxscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2012080061 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On 10/11/20 7:26 pm, Naohiro Aota wrote: > Zoned btrfs must allocate blocks at the zones' write pointer. The device's > write pointer position can be mapped to a logical address within a block > group. This commit adds "alloc_offset" to track the logical address. > > This logical address is populated in btrfs_load_block_group_zone_info() > from write pointers of corresponding zones. > > For now, zoned btrfs only support the SINGLE profile. Supporting non-SINGLE > profile with zone append writing is not trivial. For example, in the DUP > profile, we send a zone append writing IO to two zones on a device. The > device reply with written LBAs for the IOs. If the offsets of the returned > addresses from the beginning of the zone are different, then it results in > different logical addresses. > > We need fine-grained logical to physical mapping to support such separated > physical address issue. Since it should require additional metadata type, > disable non-SINGLE profiles for now. > > This commit supports the case all the zones in a block group are > sequential. The next patch will handle the case having a conventional zone. > > Signed-off-by: Naohiro Aota > Reviewed-by: Josef Bacik > --- > fs/btrfs/block-group.c | 15 ++++ > fs/btrfs/block-group.h | 6 ++ > fs/btrfs/zoned.c | 154 +++++++++++++++++++++++++++++++++++++++++ > fs/btrfs/zoned.h | 7 ++ > 4 files changed, 182 insertions(+) > > diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c > index 6b4831824f51..ffc64dfbe09e 100644 > --- a/fs/btrfs/block-group.c > +++ b/fs/btrfs/block-group.c > @@ -15,6 +15,7 @@ > #include "delalloc-space.h" > #include "discard.h" > #include "raid56.h" > +#include "zoned.h" > > /* > * Return target flags in extended format or 0 if restripe for this chunk_type > @@ -1935,6 +1936,13 @@ static int read_one_block_group(struct btrfs_fs_info *info, > goto error; > } > > + ret = btrfs_load_block_group_zone_info(cache); > + if (ret) { > + btrfs_err(info, "zoned: failed to load zone info of bg %llu", > + cache->start); > + goto error; > + } > + > /* > * We need to exclude the super stripes now so that the space info has > * super bytes accounted for, otherwise we'll think we have more space > @@ -2161,6 +2169,13 @@ int btrfs_make_block_group(struct btrfs_trans_handle *trans, u64 bytes_used, > cache->last_byte_to_unpin = (u64)-1; > cache->cached = BTRFS_CACHE_FINISHED; > cache->needs_free_space = 1; > + > + ret = btrfs_load_block_group_zone_info(cache); > + if (ret) { > + btrfs_put_block_group(cache); > + return ret; > + } > + > ret = exclude_super_stripes(cache); > if (ret) { > /* We may have excluded something, so call this just in case */ > diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h > index adfd7583a17b..14e3043c9ce7 100644 > --- a/fs/btrfs/block-group.h > +++ b/fs/btrfs/block-group.h > @@ -183,6 +183,12 @@ struct btrfs_block_group { > > /* Record locked full stripes for RAID5/6 block group */ > struct btrfs_full_stripe_locks_tree full_stripe_locks_root; > + > + /* > + * Allocation offset for the block group to implement sequential > + * allocation. This is used only with ZONED mode enabled. > + */ > + u64 alloc_offset; > }; > > static inline u64 btrfs_block_group_end(struct btrfs_block_group *block_group) > diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c > index ed5de1c138d7..69d3412c4fef 100644 > --- a/fs/btrfs/zoned.c > +++ b/fs/btrfs/zoned.c > @@ -3,14 +3,20 @@ > #include > #include > #include > +#include > #include "ctree.h" > #include "volumes.h" > #include "zoned.h" > #include "rcu-string.h" > #include "disk-io.h" > +#include "block-group.h" > > /* Maximum number of zones to report per blkdev_report_zones() call */ > #define BTRFS_REPORT_NR_ZONES 4096 > +/* Invalid allocation pointer value for missing devices */ > +#define WP_MISSING_DEV ((u64)-1) > +/* Pseudo write pointer value for conventional zone */ > +#define WP_CONVENTIONAL ((u64)-2) > > /* Number of superblock log zones */ > #define BTRFS_NR_SB_LOG_ZONES 2 > @@ -777,3 +783,151 @@ int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size) > > return 0; > } > + > +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache) > +{ > + struct btrfs_fs_info *fs_info = cache->fs_info; > + struct extent_map_tree *em_tree = &fs_info->mapping_tree; > + struct extent_map *em; > + struct map_lookup *map; > + struct btrfs_device *device; > + u64 logical = cache->start; > + u64 length = cache->length; > + u64 physical = 0; > + int ret; > + int i; > + unsigned int nofs_flag; > + u64 *alloc_offsets = NULL; > + u32 num_sequential = 0, num_conventional = 0; > + > + if (!btrfs_is_zoned(fs_info)) > + return 0; > + > + /* Sanity check */ > + if (!IS_ALIGNED(length, fs_info->zone_size)) { > + btrfs_err(fs_info, "zoned: block group %llu len %llu unaligned to zone size %llu", > + logical, length, fs_info->zone_size); > + return -EIO; > + } > + > + /* Get the chunk mapping */ > + read_lock(&em_tree->lock); > + em = lookup_extent_mapping(em_tree, logical, length); > + read_unlock(&em_tree->lock); > + > + if (!em) > + return -EINVAL; > + > + map = em->map_lookup; > + > + /* > + * Get the zone type: if the group is mapped to a non-sequential zone, > + * there is no need for the allocation offset (fit allocation is OK). > + */ > + alloc_offsets = kcalloc(map->num_stripes, sizeof(*alloc_offsets), > + GFP_NOFS); > + if (!alloc_offsets) { > + free_extent_map(em); > + return -ENOMEM; > + } > + > + for (i = 0; i < map->num_stripes; i++) { > + bool is_sequential; > + struct blk_zone zone; > + > + device = map->stripes[i].dev; > + physical = map->stripes[i].physical; > + > + if (device->bdev == NULL) { > + alloc_offsets[i] = WP_MISSING_DEV; > + continue; > + } > + > + is_sequential = btrfs_dev_is_sequential(device, physical); > + if (is_sequential) > + num_sequential++; > + else > + num_conventional++; > + > + if (!is_sequential) { > + alloc_offsets[i] = WP_CONVENTIONAL; > + continue; > + } > + > + /* > + * This zone will be used for allocation, so mark this > + * zone non-empty. > + */ > + btrfs_dev_clear_zone_empty(device, physical); > + > + /* > + * The group is mapped to a sequential zone. Get the zone write > + * pointer to determine the allocation offset within the zone. > + */ > + WARN_ON(!IS_ALIGNED(physical, fs_info->zone_size)); > + nofs_flag = memalloc_nofs_save(); > + ret = btrfs_get_dev_zone(device, physical, &zone); > + memalloc_nofs_restore(nofs_flag); > + if (ret == -EIO || ret == -EOPNOTSUPP) { > + ret = 0; > + alloc_offsets[i] = WP_MISSING_DEV; > + continue; > + } else if (ret) { > + goto out; > + } > + > + switch (zone.cond) { > + case BLK_ZONE_COND_OFFLINE: > + case BLK_ZONE_COND_READONLY: > + btrfs_err(fs_info, "zoned: offline/readonly zone %llu on device %s (devid %llu)", > + physical >> device->zone_info->zone_size_shift, > + rcu_str_deref(device->name), device->devid); > + alloc_offsets[i] = WP_MISSING_DEV; > + break; > + case BLK_ZONE_COND_EMPTY: > + alloc_offsets[i] = 0; > + break; > + case BLK_ZONE_COND_FULL: > + alloc_offsets[i] = fs_info->zone_size; > + break; > + default: > + /* Partially used zone */ > + alloc_offsets[i] = > + ((zone.wp - zone.start) << SECTOR_SHIFT); > + break; > + } > + } > + > + if (num_conventional > 0) { > + /* > + * Since conventional zones do not have a write pointer, we > + * cannot determine alloc_offset from the pointer > + */ > + ret = -EINVAL; > + goto out; > + } > + > + switch (map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK) { > + case 0: /* single */ > + cache->alloc_offset = alloc_offsets[0]; > + break; > + case BTRFS_BLOCK_GROUP_DUP: > + case BTRFS_BLOCK_GROUP_RAID1: > + case BTRFS_BLOCK_GROUP_RAID0: > + case BTRFS_BLOCK_GROUP_RAID10: > + case BTRFS_BLOCK_GROUP_RAID5: > + case BTRFS_BLOCK_GROUP_RAID6: > + /* non-SINGLE profiles are not supported yet */ > + default: > + btrfs_err(fs_info, "zoned: profile %s not supported", > + btrfs_bg_type_to_raid_name(map->type)); > + ret = -EINVAL; > + goto out; > + } > + > +out: > + kfree(alloc_offsets); > + free_extent_map(em); > + > + return ret; > +} > diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h > index ec2391c52d8b..e3338a2f1be9 100644 > --- a/fs/btrfs/zoned.h > +++ b/fs/btrfs/zoned.h > @@ -40,6 +40,7 @@ u64 btrfs_find_allocatable_zones(struct btrfs_device *device, u64 hole_start, > int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical, > u64 length, u64 *bytes); > int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size); > +int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache); > #else /* CONFIG_BLK_DEV_ZONED */ > static inline int btrfs_get_dev_zone(struct btrfs_device *device, u64 pos, > struct blk_zone *zone) > @@ -112,6 +113,12 @@ static inline int btrfs_ensure_empty_zones(struct btrfs_device *device, > return 0; > } > > +static inline int btrfs_load_block_group_zone_info( > + struct btrfs_block_group *cache) > +{ > + return 0; > +} > + > #endif > > static inline bool btrfs_dev_is_sequential(struct btrfs_device *device, u64 pos) > looks good. Reviewed-by: Anand Jain