Commit e15e8b00cbce ("Remove all "if zeros"") did not remove all "if 0" code blocks. This commit is cleanup for that commit. Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com> --- Build.c | 6 ------ Grow.c | 13 +------------ super1.c | 11 ----------- 3 files changed, 1 insertion(+), 29 deletions(-) diff --git a/Build.c b/Build.c index 1fbf8596a9dd..1be90e418ad1 100644 --- a/Build.c +++ b/Build.c @@ -156,12 +156,6 @@ int Build(struct mddev_ident *ident, struct mddev_dev *devlist, struct shape *s, bitmap_fd = open(s->bitmap_file, O_RDWR); if (bitmap_fd < 0) { int major = BITMAP_MAJOR_HI; -#if 0 - if (s->bitmap_chunk == UnSet) { - pr_err("%s cannot be opened.\n", s->bitmap_file); - goto abort; - } -#endif bitmapsize = s->size >> 9; /* FIXME wrong for RAID10 */ if (CreateBitmap(s->bitmap_file, 1, NULL, s->bitmap_chunk, c->delay, diff --git a/Grow.c b/Grow.c index 5498e54fec11..97782fbc48cf 100644 --- a/Grow.c +++ b/Grow.c @@ -4417,19 +4417,8 @@ static void validate(int afd, int bfd, unsigned long long offset) lseek64(afd, __le64_to_cpu(bsb2.arraystart)*512, 0); if ((unsigned long long)read(afd, abuf, len) != len) fail("read first from array failed"); - if (memcmp(bbuf, abuf, len) != 0) { -#if 0 - int i; - printf("offset=%llu len=%llu\n", - (unsigned long long)__le64_to_cpu(bsb2.arraystart)*512, len); - for (i=0; i<len; i++) - if (bbuf[i] != abuf[i]) { - printf("first diff byte %d\n", i); - break; - } -#endif + if (memcmp(bbuf, abuf, len) != 0) fail("data1 compare failed"); - } } if (bsb2.length2) { unsigned long long len = __le64_to_cpu(bsb2.length2)*512; diff --git a/super1.c b/super1.c index 871d19f0398c..0d059b4321b8 100644 --- a/super1.c +++ b/super1.c @@ -575,17 +575,6 @@ static void examine_super1(struct supertype *st, char *homehost) inconsistent = 1; } } -#if 0 - /* This is confusing too */ - faulty = 0; - for (i = 0; i < __le32_to_cpu(sb->max_dev); i++) { - int role = __le16_to_cpu(sb->dev_roles[i]); - if (role == MD_DISK_ROLE_FAULTY) - faulty++; - } - if (faulty) - printf(" %d failed", faulty); -#endif printf(" ('A' == active, '.' == missing, 'R' == replacing)"); printf("\n"); for (d = 0; d < __le32_to_cpu(sb->max_dev); d++) { -- 2.39.2
Dear Blazej, Am 18.03.24 um 17:25 schrieb Blazej Kucman: > For NVMe devices with Opal support, encryption information, status and > ability are completed based on Opal Level 0 discovery response. What do you mean by “are completed”? > Technical documentation used is given in the implementation. > > Ability in general describes what type of encryption is supported, > Status describes in what state the disk with encryption support is. > The current patch only includes only the implementation of reading One *only* is enough? > encryption information, functions will be used in one of > the next patches. > > Motivation for adding this functionality is to block mixing of disks > in IMSM arrays with encryption enabled and disabled. > The main goal is to not allow stealing data by rebuilding array to not > encrypted drive which can be read elsewhere. I wouldn’t wrap the line, just because a sentence ends. > Value ENA_OTHER from enum encryption_ability will be used > in the next patch. > > Additionally, pr_vrb define is moved from super-intel.c to mdadm.h > so that it can be used globally. Using “Additionally” in a commit message is a good indicator to make it a separate commit. ;-) > Signed-off-by: Blazej Kucman <blazej.kucman@intel.com> > --- > Makefile | 4 +- > drive_encryption.c | 362 +++++++++++++++++++++++++++++++++++++++++++++ > drive_encryption.h | 32 ++++ > mdadm.h | 2 + > super-intel.c | 2 - > 5 files changed, 398 insertions(+), 4 deletions(-) > create mode 100644 drive_encryption.c > create mode 100644 drive_encryption.h > > diff --git a/Makefile b/Makefile > index cbdba49a..7c221a89 100644 > --- a/Makefile > +++ b/Makefile > @@ -170,7 +170,7 @@ OBJS = mdadm.o config.o policy.o mdstat.o ReadMe.o uuid.o util.o maps.o lib.o u > mdopen.o super0.o super1.o super-ddf.o super-intel.o bitmap.o \ > super-mbr.o super-gpt.o \ > restripe.o sysfs.o sha1.o mapfile.o crc32.o sg_io.o msg.o xmalloc.o \ > - platform-intel.o probe_roms.o crc32c.o > + platform-intel.o probe_roms.o crc32c.o drive_encryption.o > > CHECK_OBJS = restripe.o uuid.o sysfs.o maps.o lib.o xmalloc.o dlink.o > > @@ -183,7 +183,7 @@ MON_OBJS = mdmon.o monitor.o managemon.o uuid.o util.o maps.o mdstat.o sysfs.o c > Kill.o sg_io.o dlink.o ReadMe.o super-intel.o \ > super-mbr.o super-gpt.o \ > super-ddf.o sha1.o crc32.o msg.o bitmap.o xmalloc.o \ > - platform-intel.o probe_roms.o crc32c.o > + platform-intel.o probe_roms.o crc32c.o drive_encryption.o > > MON_SRCS = $(patsubst %.o,%.c,$(MON_OBJS)) > > diff --git a/drive_encryption.c b/drive_encryption.c > new file mode 100644 > index 00000000..0fa214a9 > --- /dev/null > +++ b/drive_encryption.c > @@ -0,0 +1,362 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * Read encryption information for Opal and ATA devices. > + * > + * Copyright (C) 2024 Intel Corporation > + * Author: Blazej Kucman <blazej.kucman@intel.com> > + */ > + > +#include "mdadm.h" > + > +#include <asm/types.h> > +#include <linux/nvme_ioctl.h> > +#include "drive_encryption.h" > + > +/* > + * Opal defines > + * TCG Storage Opal SSC 2.01 chapter 3.3.3 > + * NVM ExpressTM Revision 1.4c, chapter 5 > + */ > +#define TCG_SECP_01 (0x01) > +#define TCG_SECP_00 (0x00) > +#define OPAL_DISCOVERY_COMID (0x0001) > +#define OPAL_LOCKING_FEATURE (0x0002) > +#define OPAL_IO_BUFFER_LEN 2048 > +#define OPAL_DISCOVERY_FEATURE_HEADER_LEN (4) > + > +/* > + * NVMe defines > + * NVM ExpressTM Revision 1.4c, chapter 5 > + */ > +#define NVME_SECURITY_RECV (0x82) > +#define NVME_IDENTIFY (0x06) > +#define NVME_IDENTIFY_RESPONSE_LEN 4096 > +#define NVME_OACS_BYTE_POSITION (256) > +#define NVME_IDENTIFY_CONTROLLER_DATA (1) > + > +typedef enum drive_feature_support_status { > + /* Drive feature is supported. */ > + DRIVE_FEAT_SUP_ST = 0, > + /* Drive feature is not supported. */ > + DRIVE_FEAT_NOT_SUP_ST, > + /* Drive feature support check failed. */ > + DRIVE_FEAT_CHECK_FAILED_ST > +} drive_feat_sup_st; > + > +/* TCG Storage Opal SSC 2.01 chapter 3.1.1.3 */ > +typedef struct opal_locking_feature { > + /* feature header */ > + __u16 feature_code; > + __u8 reserved : 4; > + __u8 version : 4; > + __u8 description_length; > + /* feature description */ > + __u8 locking_supported : 1; > + __u8 locking_enabled : 1; > + __u8 locked : 1; > + __u8 media_encryption : 1; > + __u8 mbr_enabled : 1; > + __u8 mbr_done : 1; > + __u8 mbr_shadowing_not_supported : 1; > + __u8 hw_reset_for_dor_supported : 1; > + __u8 reserved1[11]; > +} __attribute__((__packed__)) opal_locking_feature_t; > + > +/* TCG Storage Opal SSC 2.01 chapter 3.1.1.1 */ > +typedef struct opal_level0_header { > + __u32 length; > + __u32 version; > + __u64 reserved; > + __u8 vendor_specific[32]; > +} opal_level0_header_t; > + > +/** > + * NVM ExpressTM Revision 1.4c, Figure 249 > + * Structure specifies only OACS filed, which is needed in the current use case. > + */ > +typedef struct nvme_identify_ctrl { > + __u8 reserved[255]; > + __u16 oacs; > + __u8 reserved2[3839]; > +} nvme_identify_ctrl_t; > + > +/* SCSI Primary Commands - 4 (SPC-4), Table 512 */ > +typedef struct supported_security_protocols { > + __u8 reserved[6]; > + __u16 list_length; > + __u8 list[504]; > +} supported_security_protocols_t; > + > +/** > + * get_opal_locking_feature_description() - get opal locking feature description. > + * @response: response from Opal Discovery Level 0. > + * > + * Based on the documentation TCG Storage Opal SSC 2.01 chapter 3.1.1, > + * a Locking feature is searched for in Opal Level 0 Discovery response. > + * > + * Return: if locking feature found pointer to struct %opal_locking_feature_t, NULL otherwise. is found? > + */ > +static opal_locking_feature_t *get_opal_locking_feature_description(__u8 *response) > +{ > + opal_level0_header_t *response_header = (opal_level0_header_t *)response; > + int features_length = __be32_to_cpu(response_header->length); > + int current_position = sizeof(*response_header); > + > + while (current_position < features_length) { > + opal_locking_feature_t *feature; > + > + feature = (opal_locking_feature_t *)(response + current_position); > + > + if (__be16_to_cpu(feature->feature_code) == OPAL_LOCKING_FEATURE) > + return feature; > + > + current_position += feature->description_length + OPAL_DISCOVERY_FEATURE_HEADER_LEN; > + } > + > + return NULL; > +} > + > +/** > + * nvme_security_recv_ioctl() - nvme security receive ioctl. > + * @disk_fd: a disk file descriptor. > + * @sec_protocol: security protocol. > + * @comm_id: command id. > + * @response_buffer: response buffer to fill out. > + * @buf_size: response buffer size. > + * @verbose: verbose flag. > + * > + * Based on the documentations TCG Storage Opal SSC 2.01 chapter 3.3.3 and > + * NVM ExpressTM Revision 1.4c, chapter 5.25, > + * read security receive command via ioctl(). > + * On success, @response_buffer is completed. > + * > + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR otherwise. > + */ > +static mdadm_status_t > +nvme_security_recv_ioctl(int disk_fd, __u8 sec_protocol, __u16 comm_id, void *response_buffer, > + size_t buf_size, const int verbose) > +{ > + struct nvme_admin_cmd nvme_cmd = {0}; > + int status; > + > + nvme_cmd.opcode = NVME_SECURITY_RECV; > + nvme_cmd.cdw10 = sec_protocol << 24 | comm_id << 8; > + nvme_cmd.cdw11 = buf_size; > + nvme_cmd.data_len = buf_size; > + nvme_cmd.addr = (__u64)response_buffer; > + > + status = ioctl(disk_fd, NVME_IOCTL_ADMIN_CMD, &nvme_cmd); > + if (status != 0) { > + pr_vrb("Failed to read NVMe security receive ioctl() for device /dev/%s, status: %d\n", > + fd2kname(disk_fd), status); > + return MDADM_STATUS_ERROR; > + } > + > + return MDADM_STATUS_SUCCESS; > +} > + > +/** > + * nvme_identify_ioctl() - NVMe identify ioctl. > + * @disk_fd: a disk file descriptor. > + * @response_buffer: response buffer to fill out. > + * @buf_size: response buffer size. > + * @verbose: verbose flag. > + * > + * Based on the documentations TCG Storage Opal SSC 2.01 chapter 3.3.3 and > + * NVM ExpressTM Revision 1.4c, chapter 5.25, > + * read NVMe identify via ioctl(). > + * On success, @response_buffer will be completed. > + * > + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR otherwise. > + */ > +static mdadm_status_t > +nvme_identify_ioctl(int disk_fd, void *response_buffer, size_t buf_size, const int verbose) > +{ > + struct nvme_admin_cmd nvme_cmd = {0}; > + int status; > + > + nvme_cmd.opcode = NVME_IDENTIFY; > + nvme_cmd.cdw10 = NVME_IDENTIFY_CONTROLLER_DATA; > + nvme_cmd.data_len = buf_size; > + nvme_cmd.addr = (__u64)response_buffer; > + > + status = ioctl(disk_fd, NVME_IOCTL_ADMIN_CMD, &nvme_cmd); > + if (status != 0) { > + pr_vrb("Failed to read NVMe identify ioctl() for device /dev/%s, status: %d\n", > + fd2kname(disk_fd), status); > + return MDADM_STATUS_ERROR; > + } > + > + return MDADM_STATUS_SUCCESS; > +} > + > +/** > + * is_sec_prot_01h_supported() - check if security protocol 01h supported. > + * @security_protocols: struct with response from disk (NVMe, SATA) describing supported > + * security protocols. > + * > + * Return: true if TCG_SECP_01 found, false otherwise. > + */ > +static bool is_sec_prot_01h_supported(supported_security_protocols_t *security_protocols) > +{ > + int list_length = be16toh(security_protocols->list_length); > + int index; > + > + for (index = 0 ; index < list_length; index++) { > + if (security_protocols->list[index] == TCG_SECP_01) > + return true; > + } > + > + return false; > +} > + > +/** > + * is_sec_prot_01h_supported_nvme() - check if security protocol 01h supported for given NVMe disk. > + * @disk_fd: a disk file descriptor. > + * @verbose: verbose flag. > + * > + * Return: %DRIVE_FEAT_SUP_ST if TCG_SECP_01 supported, %DRIVE_FEAT_NOT_SUP_ST if not supported, > + * %DRIVE_FEAT_CHECK_FAILED_ST if failed to check. > + */ > +static drive_feat_sup_st is_sec_prot_01h_supported_nvme(int disk_fd, const int verbose) > +{ > + supported_security_protocols_t security_protocols = {0}; > + > + /* security_protocol: TCG_SECP_00, comm_id: not applicable */ > + if (nvme_security_recv_ioctl(disk_fd, TCG_SECP_00, 0x0, &security_protocols, > + sizeof(security_protocols), verbose)) > + return DRIVE_FEAT_CHECK_FAILED_ST; > + > + if (is_sec_prot_01h_supported(&security_protocols)) > + return DRIVE_FEAT_SUP_ST; > + > + return DRIVE_FEAT_NOT_SUP_ST; > +} > + > +/** > + * is_nvme_sec_send_recv_supported() - check if Security Send and Security Receive is supported. > + * @disk_fd: a disk file descriptor. > + * @verbose: verbose flag. > + * > + * Check if "Optional Admin Command Support" bit 0 is set in NVMe identify. > + * Bit 0 set to 1 means controller supports the Security Send and Security Receive commands. > + * > + * Return: %DRIVE_FEAT_SUP_ST if security send/receive supported, > + * %DRIVE_FEAT_NOT_SUP_ST if not supported, %DRIVE_FEAT_CHECK_FAILED_ST if check failed. > + */ > +static drive_feat_sup_st is_nvme_sec_send_recv_supported(int disk_fd, const int verbose) > +{ > + nvme_identify_ctrl_t nvme_identify = {0}; > + int status = 0; > + > + status = nvme_identify_ioctl(disk_fd, &nvme_identify, sizeof(nvme_identify), verbose); > + if (status) > + return DRIVE_FEAT_CHECK_FAILED_ST; > + > + if ((__le16_to_cpu(nvme_identify.oacs) & 0x1) == 0x1) > + return DRIVE_FEAT_SUP_ST; > + > + return DRIVE_FEAT_NOT_SUP_ST; > +} > + > +/** > + * get_opal_encryption_locking_informations() - get Opal Locking information. There is no plural for *information*. > + * @buffer: buffer with Opal Level 0 Discovery response. > + * @information: struct to fill out, describing encryption status of disk. > + * > + * If Locking feature frame is in response from Opal Level 0 discovery, &encryption_information_t > + * structure is completed with status and ability otherwise the status is set to &None. > + * For possible encryption statuses and abilities, > + * please refer to enums &encryption_status and &encryption_ability. > + * > + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR otherwise. > + */ > +static mdadm_status_t get_opal_encryption_informations(__u8 *buffer, > + encryption_information_t *information) > +{ > + opal_locking_feature_t *opal_locking_feature = > + get_opal_locking_feature_description(buffer); > + > + if (!opal_locking_feature) > + return MDADM_STATUS_ERROR; > + > + if (opal_locking_feature->locking_supported == 1) { > + information->ability = ENA_SED; > + > + if (opal_locking_feature->locking_enabled == 0) > + information->status = ENS_UNENCRYPTED; > + else if (opal_locking_feature->locked == 1) > + information->status = ENS_LOCKED; > + else > + information->status = ENS_UNLOCKED; > + } else { > + information->ability = ENA_NONE; > + information->status = ENS_UNENCRYPTED; > + } > + > + return MDADM_STATUS_SUCCESS; > +} > + > +/** > + * get_nvme_opal_encryption_informations() - get NVMe Opal encryption information. > + * @disk_fd: a disk file descriptor. > + * @information: struct to fill out, describing encryption status of disk. > + * @verbose: verbose flag. > + * > + * In case the disk supports Opal Level 0 discovery, &encryption_information_t structure > + * is completed with status and ability based on ioctl response, > + * otherwise the ability is set to %ENA_NONE and &status to %ENS_UNENCRYPTED. > + * As the current use case does not need the knowledge of Opal support, if there is no support, > + * %MDADM_STATUS_SUCCESS will be returned, with the values described above. > + * For possible encryption statuses and abilities, > + * please refer to enums &encryption_status and &encryption_ability. > + * > + * %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR otherwise. > + */ > +mdadm_status_t > +get_nvme_opal_encryption_informations(int disk_fd, encryption_information_t *information, > + const int verbose) > +{ > + __u8 buffer[OPAL_IO_BUFFER_LEN]; > + int sec_send_recv_supported = 0; > + int protocol_01h_supported = 0; > + mdadm_status_t status; > + > + information->ability = ENA_NONE; > + information->status = ENS_UNENCRYPTED; > + > + sec_send_recv_supported = is_nvme_sec_send_recv_supported(disk_fd, verbose); > + if (sec_send_recv_supported == DRIVE_FEAT_CHECK_FAILED_ST) > + return MDADM_STATUS_ERROR; > + > + /* Opal not supported */ > + if (sec_send_recv_supported == DRIVE_FEAT_NOT_SUP_ST) > + return MDADM_STATUS_SUCCESS; > + > + /** > + * sec_send_recv_supported determine that it should be possible to read > + * supported sec protocols > + */ > + protocol_01h_supported = is_sec_prot_01h_supported_nvme(disk_fd, verbose); > + if (protocol_01h_supported == DRIVE_FEAT_CHECK_FAILED_ST) > + return MDADM_STATUS_ERROR; > + > + /* Opal not supported */ > + if (sec_send_recv_supported == DRIVE_FEAT_SUP_ST && > + protocol_01h_supported == DRIVE_FEAT_NOT_SUP_ST) > + return MDADM_STATUS_SUCCESS; > + > + if (nvme_security_recv_ioctl(disk_fd, TCG_SECP_01, OPAL_DISCOVERY_COMID, (void *)&buffer, > + OPAL_IO_BUFFER_LEN, verbose)) > + return MDADM_STATUS_ERROR; > + > + status = get_opal_encryption_informations((__u8 *)&buffer, information); > + if (status) > + pr_vrb("Locking feature description not found in Level 0 discovery response. Device /dev/%s.\n", > + fd2kname(disk_fd)); > + > + if (information->ability == ENA_NONE) > + assert(information->status == ENS_UNENCRYPTED); > + > + return status; > +} > diff --git a/drive_encryption.h b/drive_encryption.h > new file mode 100644 > index 00000000..27f1de55 > --- /dev/null > +++ b/drive_encryption.h > @@ -0,0 +1,32 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * Read encryption information for Opal and ATA devices. > + * > + * Copyright (C) 2024 Intel Corporation > + * Author: Blazej Kucman <blazej.kucman@intel.com> > + */ > + > +typedef enum encryption_status { > + /* The drive is not currently encrypted. */ > + ENS_UNENCRYPTED = 0, I’d user longer names like `ENC_STATUS_`. > + /* The drive is encrypted and the data is not accessible. */ > + ENS_LOCKED, > + /* The drive is encrypted but the data is accessible in unencrypted form. */ > + ENS_UNLOCKED > +} encryption_status_t; > + > +typedef enum encryption_ability { > + ENA_NONE = 0, > + ENA_OTHER, > + /* Self encrypted drive */ > + ENA_SED > +} encryption_ability_t; > + > +typedef struct encryption_information { > + encryption_ability_t ability; > + encryption_status_t status; > +} encryption_information_t; > + > +mdadm_status_t > +get_nvme_opal_encryption_informations(int disk_fd, struct encryption_information *information, > + const int verbose); > diff --git a/mdadm.h b/mdadm.h > index 3fedca48..141adbd7 100644 > --- a/mdadm.h > +++ b/mdadm.h > @@ -1912,6 +1912,8 @@ static inline int xasprintf(char **strp, const char *fmt, ...) { > > #define pr_info(fmt, args...) printf("%s: "fmt, Name, ##args) > > +#define pr_vrb(fmt, arg...) ((void)(verbose && pr_err(fmt, ##arg))) > + > void *xmalloc(size_t len); > void *xrealloc(void *ptr, size_t len); > void *xcalloc(size_t num, size_t size); > diff --git a/super-intel.c b/super-intel.c > index 77140455..806b6248 100644 > --- a/super-intel.c > +++ b/super-intel.c > @@ -393,8 +393,6 @@ struct md_list { > struct md_list *next; > }; > > -#define pr_vrb(fmt, arg...) (void) (verbose && pr_err(fmt, ##arg)) > - > static __u8 migr_type(struct imsm_dev *dev) > { > if (dev->vol.migr_type == MIGR_VERIFY && Kind regards, Paul
Dear Blazej, Thank you for your patch. Am 18.03.24 um 17:25 schrieb Blazej Kucman: > Add ENCRYPTION_NO_VERIFY config key and allow to disable > checking encryption status for given type of drives. > > The key is introduced because of SATA Opal disks for which > TPM commands must be enabled in libata kernel module, > (libata.allow_tpm=1), otherwise it is impossible to verify > encryption status. TPM commands are disabled by default. > > Currently the key only supports the "sata_opal" value, > if necessary, the functionality is ready to support more types of disks. Please use 75 characters per line consistently. Maybe document how to test this feature, preferebly with QEMU. > Signed-off-by: Blazej Kucman <blazej.kucman@intel.com> > --- > config.c | 25 ++++++++++++++++++++++++- > drive_encryption.c | 16 ++++++++++++---- > mdadm.conf.5.in | 13 +++++++++++++ > mdadm.h | 1 + > 4 files changed, 50 insertions(+), 5 deletions(-) > > diff --git a/config.c b/config.c > index 44f7dd2f..b46d71cb 100644 > --- a/config.c > +++ b/config.c > @@ -81,7 +81,7 @@ char DefaultAltConfDir[] = CONFFILE2 ".d"; > > enum linetype { Devices, Array, Mailaddr, Mailfrom, Program, CreateDev, > Homehost, HomeCluster, AutoMode, Policy, PartPolicy, Sysfs, > - MonitorDelay, LTEnd }; > + MonitorDelay, EncryptionNoVerify, LTEnd }; > char *keywords[] = { > [Devices] = "devices", > [Array] = "array", > @@ -96,6 +96,7 @@ char *keywords[] = { > [PartPolicy]="part-policy", > [Sysfs] = "sysfs", > [MonitorDelay] = "monitordelay", > + [EncryptionNoVerify] = "ENCRYPTION_NO_VERIFY", > [LTEnd] = NULL > }; > > @@ -729,6 +730,19 @@ void monitordelayline(char *line) > } > } > > +static bool sata_opal_encryption_no_verify; > +void encryption_no_verify_line(char *line) > +{ > + char *word; > + > + for (word = dl_next(line); word != line; word = dl_next(word)) { > + if (strcasecmp(word, "sata_opal") == 0) > + sata_opal_encryption_no_verify = true; > + else > + pr_err("unrecognised word on ENCRYPTION_NO_VERIFY line: %s\n", word); > + } > +} > + > char auto_yes[] = "yes"; > char auto_no[] = "no"; > char auto_homehost[] = "homehost"; > @@ -913,6 +927,9 @@ void conf_file(FILE *f) > case MonitorDelay: > monitordelayline(line); > break; > + case EncryptionNoVerify: > + encryption_no_verify_line(line); > + break; > default: > pr_err("Unknown keyword %s\n", line); > } > @@ -1075,6 +1092,12 @@ int conf_get_monitor_delay(void) > return monitor_delay; > } > > +bool conf_get_sata_opal_encryption_no_verify(void) > +{ > + load_conffile(); > + return sata_opal_encryption_no_verify; > +} > + > struct createinfo *conf_get_create_info(void) > { > load_conffile(); > diff --git a/drive_encryption.c b/drive_encryption.c > index 8421e374..5b9cdc00 100644 > --- a/drive_encryption.c > +++ b/drive_encryption.c > @@ -656,10 +656,18 @@ get_ata_encryption_information(int disk_fd, struct encryption_information *infor > if (status == MDADM_STATUS_ERROR) > return MDADM_STATUS_ERROR; > > - if (is_ata_trusted_computing_supported(buffer_identify) && > - !sysfs_is_libata_allow_tpm_enabled(verbose)) { > - pr_vrb("For SATA with Trusted Computing support, required libata.tpm_enabled=1.\n"); > - return MDADM_STATUS_ERROR; > + /* Possible OPAL support, further checks require tpm_enabled.*/ > + if (is_ata_trusted_computing_supported(buffer_identify)) { > + /* OPAL SATA encryption checking disabled. */ > + if (conf_get_sata_opal_encryption_no_verify()) > + return MDADM_STATUS_SUCCESS; > + > + if (!sysfs_is_libata_allow_tpm_enabled(verbose)) { > + pr_vrb("Detected SATA drive /dev/%s with Trusted Computing support.\n", > + fd2kname(disk_fd)); > + pr_vrb("Cannot verify encryption state. Required libata.tpm_enabled=1.\n"); Require*s*? > + return MDADM_STATUS_ERROR; > + } > } > > ata_opal_status = is_ata_opal(disk_fd, buffer_identify, verbose); > diff --git a/mdadm.conf.5.in b/mdadm.conf.5.in > index 787e51e9..afb0a296 100644 > --- a/mdadm.conf.5.in > +++ b/mdadm.conf.5.in > @@ -636,6 +636,17 @@ If multiple > .B MINITORDELAY > lines are provided, only first non-zero value is considered. > > +.TP > +.B ENCRYPTION_NO_VERIFY > +The > +.B ENCRYPTION_NO_VERIFY > +disables encryption verification for devices with particular encryption support detected. > +Currently, only verification of SATA OPAL encryption can be disabled. > +It does not disable ATA security encryption verification. > +Available parameter > +.I "sata_opal". > + > + > .SH FILES > > .SS {CONFFILE} > @@ -744,6 +755,8 @@ SYSFS uuid=bead5eb6:31c17a27:da120ba2:7dfda40d group_thread_cnt=4 > sync_speed_max=1000000 > .br > MONITORDELAY 60 > +.br > +ENCRYPTION_NO_VERIFY sata_opal > > .SH SEE ALSO > .BR mdadm (8), > diff --git a/mdadm.h b/mdadm.h > index f64bb783..9d98693b 100644 > --- a/mdadm.h > +++ b/mdadm.h > @@ -1673,6 +1673,7 @@ extern char *conf_get_program(void); > extern char *conf_get_homehost(int *require_homehostp); > extern char *conf_get_homecluster(void); > extern int conf_get_monitor_delay(void); > +extern bool conf_get_sata_opal_encryption_no_verify(void); > extern char *conf_line(FILE *file); > extern char *conf_word(FILE *file, int allow_key); > extern void print_quoted(char *str); The diff looks good. Kind regards, Paul
If sc is not initialized, there is possibility that sc.pols is not zeroed and it causes segfault. Add missing initialization. Add missing dev_policy_free() in two places. Fixes: f656201188d7 ("mdadm: drop get_required_spare_criteria()") Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> --- Incremental.c | 1 + super-intel.c | 9 +++++++-- 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/Incremental.c b/Incremental.c index 958ba9ba7851..83db071214ee 100644 --- a/Incremental.c +++ b/Incremental.c @@ -874,6 +874,7 @@ mdadm_status_t incremental_external_test_spare_criteria(struct supertype *st, ch rv = MDADM_STATUS_SUCCESS; out: + dev_policy_free(sc.pols); dup->ss->free_super(dup); free(dup); diff --git a/super-intel.c b/super-intel.c index 7714045575b2..32eceb155886 100644 --- a/super-intel.c +++ b/super-intel.c @@ -11518,10 +11518,15 @@ static int imsm_reshape_is_allowed_on_container(struct supertype *st, */ static struct mdinfo *get_spares_for_grow(struct supertype *st) { - struct spare_criteria sc; + struct spare_criteria sc = {0}; + struct mdinfo *spares; get_spare_criteria_imsm(st, NULL, &sc); - return container_choose_spares(st, &sc, NULL, NULL, NULL, 0); + spares = container_choose_spares(st, &sc, NULL, NULL, NULL, 0); + + dev_policy_free(sc.pols); + + return spares; } /****************************************************************************** -- 2.35.3
IMSM cares about drive encryption state. It is not allowed to mix disks with different encryption state within one md device. This policy will verify that attempt to use disks with different encryption states will fail. Verification is performed for devices NVMe/SATA Opal and SATA. There is one exception, Opal SATA drives encryption is not checked when ENCRYPTION_NO_VERIFY key with "sata_opal" value is set in conf, for this reason such drives are treated as with encryption disabled. Signed-off-by: Blazej Kucman <blazej.kucman@intel.com> --- super-intel.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 73 insertions(+) diff --git a/super-intel.c b/super-intel.c index c5eff352..54d2103a 100644 --- a/super-intel.c +++ b/super-intel.c @@ -11291,6 +11291,78 @@ test_and_add_drive_controller_policy_imsm(const char * const type, dev_policy_t return MDADM_STATUS_ERROR; } +/** + * test_and_add_drive_encryption_policy_imsm() - add disk encryption to policies list. + * @type: policy type to search in the list. + * @pols: list of currently recorded policies. + * @disk_fd: file descriptor of the device to check. + * @hba: The hba to which the drive is attached, could be NULL if verification is disabled. + * @verbose: verbose flag. + * + * IMSM cares about drive encryption state. It is not allowed to mix disks with different + * encryption state within one md device. + * If there is no encryption policy on pols we are free to add first one. + * If there is a policy then, new must be the same. + */ +static mdadm_status_t +test_and_add_drive_encryption_policy_imsm(const char * const type, dev_policy_t **pols, int disk_fd, + struct sys_dev *hba, const int verbose) +{ + struct dev_policy *expected_policy = pol_find(*pols, (char *)type); + struct encryption_information information = {0}; + char *encryption_state = "Unknown"; + int status = MDADM_STATUS_SUCCESS; + bool encryption_checked = true; + char devname[PATH_MAX]; + + if (!hba) + goto check_policy; + + switch (hba->type) { + case SYS_DEV_NVME: + case SYS_DEV_VMD: + status = get_nvme_opal_encryption_informations(disk_fd, &information, verbose); + break; + case SYS_DEV_SATA: + case SYS_DEV_SATA_VMD: + status = get_ata_encryption_information(disk_fd, &information, verbose); + break; + default: + encryption_checked = false; + } + + if (status) { + fd2devname(disk_fd, devname); + pr_vrb("Failed to read encryption information of device %s\n", devname); + return MDADM_STATUS_ERROR; + } + + if (encryption_checked) { + if (information.status == ENS_LOCKED) { + fd2devname(disk_fd, devname); + pr_vrb("Device %s is in Locked state, cannot use. Aborting.\n", devname); + return MDADM_STATUS_ERROR; + } + encryption_state = (char *)get_encryption_status_string(information.status); + } + +check_policy: + if (expected_policy) { + if (strcmp(expected_policy->value, encryption_state) == 0) + return MDADM_STATUS_SUCCESS; + + fd2devname(disk_fd, devname); + pr_vrb("Encryption status \"%s\" detected for disk %s, but \"%s\" status was detected eariler.\n", + encryption_state, devname, expected_policy->value); + pr_vrb("Disks with different encryption status cannot be used.\n"); + return MDADM_STATUS_ERROR; + } + + pol_add(pols, (char *)type, encryption_state, "imsm"); + + return MDADM_STATUS_SUCCESS; +} + struct imsm_drive_policy { char *type; mdadm_status_t (*test_and_add_drive_policy)(const char * const type, @@ -11300,6 +11372,7 @@ struct imsm_drive_policy { struct imsm_drive_policy imsm_policies[] = { {"controller", test_and_add_drive_controller_policy_imsm}, + {"encryption", test_and_add_drive_encryption_policy_imsm} }; mdadm_status_t test_and_add_drive_policies_imsm(struct dev_policy **pols, int disk_fd, -- 2.35.3
Print SATA/NVMe disk encryption information in detail-platform. Encryption Ability and Status will be printed for each disk. There is one exception, Opal SATA drives encryption is not checked when ENCRYPTION_NO_VERIFY key with "sata_opal" value is set in conf, for this reason such drives are treated as with encryption disabled. Signed-off-by: Blazej Kucman <blazej.kucman@intel.com> --- drive_encryption.c | 36 ++++++++++++++++++++++++++++++++++++ drive_encryption.h | 2 ++ mdadm.conf.5.in | 3 +++ super-intel.c | 42 ++++++++++++++++++++++++++++++++++++++---- 4 files changed, 79 insertions(+), 4 deletions(-) diff --git a/drive_encryption.c b/drive_encryption.c index 5b9cdc00..8ead9062 100644 --- a/drive_encryption.c +++ b/drive_encryption.c @@ -141,6 +141,42 @@ typedef struct ata_trusted_computing { __u16 var2 : 1; } __attribute__((__packed__)) ata_trusted_computing_t; +mapping_t encryption_ability_map[] = { + { "None", ENA_NONE }, + { "Other", ENA_OTHER }, + { "SED", ENA_SED }, + { NULL, UnSet } +}; + +mapping_t encryption_status_map[] = { + { "Unencrypted", ENS_UNENCRYPTED }, + { "Locked", ENS_LOCKED }, + { "Unlocked", ENS_UNLOCKED }, + { NULL, UnSet } +}; + +/** + * get_encryption_ability_string() - get encryption ability name string. + * @ability: encryption ability enum. + * + * Return: encryption ability string. + */ +const char *get_encryption_ability_string(enum encryption_ability ability) +{ + return map_num_s(encryption_ability_map, ability); +} + +/** + * get_encryption_status_string() - get encryption status name string. + * @ability: encryption status enum. + * + * Return: encryption status string. + */ +const char *get_encryption_status_string(enum encryption_status status) +{ + return map_num_s(encryption_status_map, status); +} + /** * get_opal_locking_feature_description() - get opal locking feature description. * @response: response from Opal Discovery Level 0. diff --git a/drive_encryption.h b/drive_encryption.h index 01405e1e..3abcef70 100644 --- a/drive_encryption.h +++ b/drive_encryption.h @@ -33,3 +33,5 @@ get_nvme_opal_encryption_informations(int disk_fd, struct encryption_information mdadm_status_t get_ata_encryption_information(int disk_fd, struct encryption_information *information, const int verbose); +const char *get_encryption_ability_string(enum encryption_ability ability); +const char *get_encryption_status_string(enum encryption_status status); diff --git a/mdadm.conf.5.in b/mdadm.conf.5.in index afb0a296..14302a91 100644 --- a/mdadm.conf.5.in +++ b/mdadm.conf.5.in @@ -643,6 +643,9 @@ The disables encryption verification for devices with particular encryption support detected. Currently, only verification of SATA OPAL encryption can be disabled. It does not disable ATA security encryption verification. +Currently effective only for +.I IMSM +metadata. Available parameter .I "sata_opal". diff --git a/super-intel.c b/super-intel.c index 806b6248..c5eff352 100644 --- a/super-intel.c +++ b/super-intel.c @@ -27,6 +27,7 @@ #include <scsi/sg.h> #include <ctype.h> #include <dirent.h> +#include "drive_encryption.h" /* MPB == Metadata Parameter Block */ #define MPB_SIGNATURE "Intel Raid ISM Cfg Sig. " @@ -2349,12 +2350,41 @@ static int imsm_read_serial(int fd, char *devname, __u8 *serial, size_t serial_buf_len); static void fd2devname(int fd, char *name); -static int ahci_enumerate_ports(const char *hba_path, int port_count, int host_base, int verbose) +void print_encrytpion_information(int disk_fd, enum sys_dev_type hba_type) +{ + struct encryption_information information = {0}; + mdadm_status_t status = MDADM_STATUS_SUCCESS; + const char *indent = " "; + + switch (hba_type) { + case SYS_DEV_VMD: + case SYS_DEV_NVME: + status = get_nvme_opal_encryption_informations(disk_fd, &information, 1); + break; + case SYS_DEV_SATA: + case SYS_DEV_SATA_VMD: + status = get_ata_encryption_information(disk_fd, &information, 1); + break; + default: + return; + } + + if (status) { + pr_err("Failed to get drive encrytpion information.\n"); + return; + } + + printf("%sEncryption(Ability|Status): %s|%s\n", indent, + get_encryption_ability_string(information.ability), + get_encryption_status_string(information.status)); +} + +static int ahci_enumerate_ports(struct sys_dev *hba, int port_count, int host_base, int verbose) { /* dump an unsorted list of devices attached to AHCI Intel storage * controller, as well as non-connected ports */ - int hba_len = strlen(hba_path) + 1; + int hba_len = strlen(hba->path) + 1; struct dirent *ent; DIR *dir; char *path = NULL; @@ -2390,7 +2420,7 @@ static int ahci_enumerate_ports(const char *hba_path, int port_count, int host_b path = devt_to_devpath(makedev(major, minor), 1, NULL); if (!path) continue; - if (!path_attached_to_hba(path, hba_path)) { + if (!path_attached_to_hba(path, hba->path)) { free(path); path = NULL; continue; @@ -2493,6 +2523,8 @@ static int ahci_enumerate_ports(const char *hba_path, int port_count, int host_b printf(" (%s)\n", buf); else printf(" ()\n"); + + print_encrytpion_information(fd, hba->type); close(fd); } free(path); @@ -2557,6 +2589,8 @@ static int print_nvme_info(struct sys_dev *hba) else printf("()\n"); + print_encrytpion_information(fd, hba->type); + skip: close_fd(&fd); } @@ -2812,7 +2846,7 @@ static int detail_platform_imsm(int verbose, int enumerate_only, char *controlle hba->path, get_sys_dev_type(hba->type)); if (hba->type == SYS_DEV_SATA || hba->type == SYS_DEV_SATA_VMD) { host_base = ahci_get_port_count(hba->path, &port_count); - if (ahci_enumerate_ports(hba->path, port_count, host_base, verbose)) { + if (ahci_enumerate_ports(hba, port_count, host_base, verbose)) { if (verbose > 0) pr_err("failed to enumerate ports on %s controller at %s.\n", get_sys_dev_type(hba->type), hba->pci_id); -- 2.35.3
Functionality reads information about SATA disk encryption. Technical documentation used is given in the implementation. The implementation is able to recognized two encryption standards for SATA drives OPAL and ATA security. If the SATA drive supports OPAL, encryption status and ability are completed based on Opal Level 0 discovery response, for ATA security, based on ATA identify response. If SATA supports OPAL, ability is set to "SED", for ATA security to "Other". SED(Self-Encrypting Drive) is commonly used to describe drive which using OPAL or Enterprise standards developed by Trusted Computing Group. Ability "Other" is used for ATA security because we rely only on information from ATA identify which describe the overall state of encryption. It is allowed to mix disks with different encryption ability such as "SED" and "Other" and it is not security gap. Motivation for adding this functionality is to block mixing of disks in IMSM arrays with encryption enabled and disabled. The main goal is to not allow stealing data by rebuilding array to not encrypted drive which can be read elsewhere. For SATA Opal drives, libata allow_tmp parameter enabled is required, which is necessary for Opal Security commands to work, therefore, if the parameter is not enabled, SATA Opal disk cannot be used in case the encryption will be checked by metadata. Implemented functions will be used in one of the next patches. In one of the next patches, a flag will be added to enable disabling SATA Opal encryption checking due to allow_tpm kernel setting dependency. Signed-off-by: Blazej Kucman <blazej.kucman@intel.com> --- drive_encryption.c | 318 +++++++++++++++++++++++++++++++++++++++++++++ drive_encryption.h | 3 + mdadm.h | 1 + sysfs.c | 29 +++++ 4 files changed, 351 insertions(+) diff --git a/drive_encryption.c b/drive_encryption.c index 0fa214a9..8421e374 100644 --- a/drive_encryption.c +++ b/drive_encryption.c @@ -10,8 +10,12 @@ #include <asm/types.h> #include <linux/nvme_ioctl.h> +#include <scsi/sg.h> +#include <scsi/scsi.h> #include "drive_encryption.h" +#define DEFAULT_SECTOR_SIZE (512) + /* * Opal defines * TCG Storage Opal SSC 2.01 chapter 3.3.3 @@ -34,6 +38,35 @@ #define NVME_OACS_BYTE_POSITION (256) #define NVME_IDENTIFY_CONTROLLER_DATA (1) +/* + * ATA defines + * ATA/ATAPI Command Set ATA8-ACS + * SCSI / ATA Translation - 3 (SAT-3) + * SCSI Primary Commands - 4 (SPC-4) + * AT Attachment-8 - ATA Serial Transport (ATA8-AST) + * ATA Command Pass-Through + */ +#define ATA_IDENTIFY (0xec) +#define ATA_TRUSTED_RECEIVE (0x5c) +#define ATA_SECURITY_WORD_POSITION (128) +#define HDIO_DRIVE_CMD (0x031f) +#define ATA_TRUSTED_COMPUTING_POS (48) +#define ATA_PASS_THROUGH_12 (0xa1) +#define ATA_IDENTIFY_RESPONSE_LEN (512) +#define ATA_PIO_DATA_IN (4) +#define SG_CHECK_CONDITION (0x02) +#define ATA_STATUS_RETURN_DESCRIPTOR (0x09) +#define ATA_PT_INFORMATION_AVAILABLE_ASCQ (0x1d) +#define ATA_PT_INFORMATION_AVAILABLE_ASC (0x00) +#define ATA_INQUIRY_LENGTH (0x0c) +#define SG_INTERFACE_ID 'S' +#define SG_IO_TIMEOUT (60000) +#define SG_SENSE_SIZE (32) +#define SENSE_DATA_CURRENT_FIXED (0x70) +#define SENSE_DATA_CURRENT_DESC (0x72) +#define SENSE_CURRENT_RES_DESC_POS (8) +#define SG_DRIVER_SENSE (0x08) + typedef enum drive_feature_support_status { /* Drive feature is supported. */ DRIVE_FEAT_SUP_ST = 0, @@ -87,6 +120,27 @@ typedef struct supported_security_protocols { __u8 list[504]; } supported_security_protocols_t; +/* ATA/ATAPI Command Set - 3 (ACS-3), Table 45 */ +typedef struct ata_security_status { + __u16 security_supported : 1; + __u16 security_enabled : 1; + __u16 security_locked : 1; + __u16 security_frozen : 1; + __u16 security_count_expired : 1; + __u16 enhanced_security_erase_supported : 1; + __u16 reserved1 : 2; + __u16 security_level : 1; + __u16 reserved2 : 7; +} __attribute__((__packed__)) ata_security_status_t; + +/* ATA/ATAPI Command Set - 3 (ACS-3), Table 45 */ +typedef struct ata_trusted_computing { + __u16 tc_feature :1; + __u16 reserved : 13; + __u16 var1 : 1; + __u16 var2 : 1; +} __attribute__((__packed__)) ata_trusted_computing_t; + /** * get_opal_locking_feature_description() - get opal locking feature description. * @response: response from Opal Discovery Level 0. @@ -360,3 +414,267 @@ get_nvme_opal_encryption_informations(int disk_fd, encryption_information_t *inf return status; } + +/** + * ata_pass_through12_ioctl() - ata pass through12 ioctl. + * @disk_fd: a disk file descriptor. + * @ata_command: ata command. + * @sec_protocol: security protocol. + * @comm_id: additional command id. + * @response_buffer: response buffer to fill out. + * @buf_size: response buffer size. + * @verbose: verbose flag. + * + * Based on the documentations ATA Command Pass-Through, chapter 13.2.2 and + * ATA Translation - 3 (SAT-3), send read ata pass through 12 command via ioctl(). + * On success, @response_buffer will be completed. + * + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR on fail. + */ +static mdadm_status_t +ata_pass_through12_ioctl(int disk_fd, __u8 ata_command, __u8 sec_protocol, __u16 comm_id, + void *response_buffer, size_t buf_size, const int verbose) +{ + __u8 cdb[ATA_INQUIRY_LENGTH] = {0}; + __u8 sense[SG_SENSE_SIZE] = {0}; + __u8 *sense_desc = NULL; + sg_io_hdr_t sg = {0}; + + /* + * ATA Command Pass-Through, chapter 13.2.2 + * SCSI Primary Commands - 4 (SPC-4) + * ATA Translation - 3 (SAT-3) + */ + cdb[0] = ATA_PASS_THROUGH_12; + /* protocol, bits 1-4 */ + cdb[1] = ATA_PIO_DATA_IN << 1; + /* Bytes: CK_COND=1, T_DIR = 1, BYTE_BLOCK = 1, Length in Sector Count = 2 */ + cdb[2] = 0x2E; + cdb[3] = sec_protocol; + /* Sector count */ + cdb[4] = buf_size / DEFAULT_SECTOR_SIZE; + cdb[6] = (comm_id) & 0xFF; + cdb[7] = (comm_id >> 8) & 0xFF; + cdb[9] = ata_command; + + sg.interface_id = SG_INTERFACE_ID; + sg.cmd_len = sizeof(cdb); + sg.mx_sb_len = sizeof(sense); + sg.dxfer_direction = SG_DXFER_FROM_DEV; + sg.dxfer_len = buf_size; + sg.dxferp = response_buffer; + sg.cmdp = cdb; + sg.sbp = sense; + sg.timeout = SG_IO_TIMEOUT; + sg.usr_ptr = NULL; + + if (ioctl(disk_fd, SG_IO, &sg) < 0) { + pr_vrb("Failed ata passthrough12 ioctl. Device: /dev/%s.\n", fd2kname(disk_fd)); + return MDADM_STATUS_ERROR; + } + + if ((sg.status && sg.status != SG_CHECK_CONDITION) || sg.host_status || + (sg.driver_status && sg.driver_status != SG_DRIVER_SENSE)) { + pr_vrb("Failed ata passthrough12 ioctl. Device: /dev/%s.\n", fd2kname(disk_fd)); + pr_vrb("SG_IO error: ATA_12 Status: %d Host Status: %d, Driver Status: %d\n", + sg.status, sg.host_status, sg.driver_status); + return MDADM_STATUS_ERROR; + } + + /* verify expected sense response code */ + if (!(sense[0] == SENSE_DATA_CURRENT_DESC || sense[0] == SENSE_DATA_CURRENT_FIXED)) { + pr_vrb("Failed ata passthrough12 ioctl. Device: /dev/%s.\n", fd2kname(disk_fd)); + return MDADM_STATUS_ERROR; + } + + sense_desc = sense + SENSE_CURRENT_RES_DESC_POS; + /* verify sense data current response with descriptor format */ + if (sense[0] == SENSE_DATA_CURRENT_DESC && + !(sense_desc[0] == ATA_STATUS_RETURN_DESCRIPTOR && + sense_desc[1] == ATA_INQUIRY_LENGTH)) { + pr_vrb("Failed ata passthrough12 ioctl. Device: /dev/%s. Sense data ASC: %d, ASCQ: %d.\n", + fd2kname(disk_fd), sense[2], sense[3]); + return MDADM_STATUS_ERROR; + } + + /* verify sense data current response with fixed format */ + if (sense[0] == SENSE_DATA_CURRENT_FIXED && + !(sense[12] == ATA_PT_INFORMATION_AVAILABLE_ASC && + sense[13] == ATA_PT_INFORMATION_AVAILABLE_ASCQ)) { + pr_vrb("Failed ata passthrough12 ioctl. Device: /dev/%s. Sense data ASC: %d, ASCQ: %d.\n", + fd2kname(disk_fd), sense[12], sense[13]); + return MDADM_STATUS_ERROR; + } + + return MDADM_STATUS_SUCCESS; +} + +/** + * is_sec_prot_01h_supported_ata() - check if security protocol 01h supported for given SATA disk. + * @disk_fd: a disk file descriptor. + * @verbose: verbose flag. + * + * Return: %DRIVE_FEAT_SUP_ST if TCG_SECP_01 supported, %DRIVE_FEAT_NOT_SUP_ST if not supported, + * %DRIVE_FEAT_CHECK_FAILED_ST if failed. + */ +static drive_feat_sup_st is_sec_prot_01h_supported_ata(int disk_fd, const int verbose) +{ + supported_security_protocols_t security_protocols; + + mdadm_status_t result = ata_pass_through12_ioctl(disk_fd, ATA_TRUSTED_RECEIVE, TCG_SECP_00, + 0x0, &security_protocols, + sizeof(security_protocols), verbose); + if (result) + return DRIVE_FEAT_CHECK_FAILED_ST; + + if (is_sec_prot_01h_supported(&security_protocols)) + return DRIVE_FEAT_SUP_ST; + + return DRIVE_FEAT_NOT_SUP_ST; +} + +/** + * is_ata_trusted_computing_supported() - check if ata trusted computing supported. + * @buffer: buffer with ATA identify response, not NULL. + * + * Return: true if trusted computing bit set, false otherwise. + */ +bool is_ata_trusted_computing_supported(__u16 *buffer) +{ + /* Added due to warnings from the compiler about a possible uninitialized variable below. */ + assert(buffer); + + __u16 security_tc_frame = __le16_to_cpu(buffer[ATA_TRUSTED_COMPUTING_POS]); + ata_trusted_computing_t *security_tc = (ata_trusted_computing_t *)&security_tc_frame; + + if (security_tc->tc_feature == 1) + return true; + + return false; +} + +/** + * get_ata_standard_security_status() - get ATA disk encryption information from ATA identify. + * @buffer: buffer with response from ATA identify, not NULL. + * @information: struct to fill out, describing encryption status of disk. + * + * The function based on the Security status frame from ATA identify, + * completed encryption information. + * For possible encryption statuses and abilities, + * please refer to enums &encryption_status and &encryption_ability. + * + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR on fail. + */ +static mdadm_status_t get_ata_standard_security_status(__u16 *buffer, + struct encryption_information *information) +{ + /* Added due to warnings from the compiler about a possible uninitialized variable below. */ + assert(buffer); + + __u16 security_status_frame = __le16_to_cpu(buffer[ATA_SECURITY_WORD_POSITION]); + ata_security_status_t *security_status = (ata_security_status_t *)&security_status_frame; + + if (!security_status->security_supported) { + information->ability = ENA_NONE; + information->status = ENS_UNENCRYPTED; + + return MDADM_STATUS_SUCCESS; + } + + information->ability = ENA_OTHER; + + if (security_status->security_enabled == 0) + information->status = ENS_UNENCRYPTED; + else if (security_status->security_locked == 1) + information->status = ENS_LOCKED; + else + information->status = ENS_UNLOCKED; + + return MDADM_STATUS_SUCCESS; +} + +/** + * is_ata_opal() - check if SATA disk support Opal. + * @disk_fd: a disk file descriptor. + * @buffer: buffer with ATA identify response. + * @verbose: verbose flag. + * + * Return: %DRIVE_FEAT_SUP_ST if TCG_SECP_01 supported, %DRIVE_FEAT_NOT_SUP_ST if not supported, + * %DRIVE_FEAT_CHECK_FAILED_ST if failed to check. + */ +static drive_feat_sup_st is_ata_opal(int disk_fd, __u16 *buffer_identify, const int verbose) +{ + bool tc_status = is_ata_trusted_computing_supported(buffer_identify); + drive_feat_sup_st tcg_sec_prot_status; + + if (!tc_status) + return DRIVE_FEAT_NOT_SUP_ST; + + tcg_sec_prot_status = is_sec_prot_01h_supported_ata(disk_fd, verbose); + + if (tcg_sec_prot_status == DRIVE_FEAT_CHECK_FAILED_ST) { + pr_vrb("Failed to verify if security protocol 01h supported. Device /dev/%s.\n", + fd2kname(disk_fd)); + return DRIVE_FEAT_CHECK_FAILED_ST; + } + + if (tc_status && tcg_sec_prot_status == DRIVE_FEAT_SUP_ST) + return DRIVE_FEAT_SUP_ST; + + return DRIVE_FEAT_NOT_SUP_ST; +} + +/** + * get_ata_encryption_information() - get ATA disk encryption information. + * @disk_fd: a disk file descriptor. + * @information: struct to fill out, describing encryption status of disk. + * @verbose: verbose flag. + * + * The function reads information about encryption, if the disk supports Opal, + * the information is completed based on Opal Level 0 discovery, otherwise, + * based on ATA security status frame from ATA identification response. + * For possible encryption statuses and abilities, + * please refer to enums &encryption_status and &encryption_ability. + * + * Based on the documentations ATA/ATAPI Command Set ATA8-ACS and + * AT Attachment-8 - ATA Serial Transport (ATA8-AST). + * + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR on fail. + */ +mdadm_status_t +get_ata_encryption_information(int disk_fd, struct encryption_information *information, + const int verbose) +{ + __u8 buffer_opal_level0_discovery[OPAL_IO_BUFFER_LEN] = {0}; + __u16 buffer_identify[ATA_IDENTIFY_RESPONSE_LEN] = {0}; + drive_feat_sup_st ata_opal_status; + mdadm_status_t status; + + /* Get disk ATA identification */ + status = ata_pass_through12_ioctl(disk_fd, ATA_IDENTIFY, 0x0, 0x0, buffer_identify, + sizeof(buffer_identify), verbose); + if (status == MDADM_STATUS_ERROR) + return MDADM_STATUS_ERROR; + + if (is_ata_trusted_computing_supported(buffer_identify) && + !sysfs_is_libata_allow_tpm_enabled(verbose)) { + pr_vrb("For SATA with Trusted Computing support, required libata.tpm_enabled=1.\n"); + return MDADM_STATUS_ERROR; + } + + ata_opal_status = is_ata_opal(disk_fd, buffer_identify, verbose); + if (ata_opal_status == DRIVE_FEAT_CHECK_FAILED_ST) + return MDADM_STATUS_ERROR; + + if (ata_opal_status == DRIVE_FEAT_NOT_SUP_ST) + return get_ata_standard_security_status(buffer_identify, information); + + /* SATA Opal */ + status = ata_pass_through12_ioctl(disk_fd, ATA_TRUSTED_RECEIVE, TCG_SECP_01, + OPAL_DISCOVERY_COMID, buffer_opal_level0_discovery, + OPAL_IO_BUFFER_LEN, verbose); + if (status != MDADM_STATUS_SUCCESS) + return MDADM_STATUS_ERROR; + + return get_opal_encryption_informations(buffer_opal_level0_discovery, information); +} diff --git a/drive_encryption.h b/drive_encryption.h index 27f1de55..01405e1e 100644 --- a/drive_encryption.h +++ b/drive_encryption.h @@ -30,3 +30,6 @@ typedef struct encryption_information { mdadm_status_t get_nvme_opal_encryption_informations(int disk_fd, struct encryption_information *information, const int verbose); +mdadm_status_t +get_ata_encryption_information(int disk_fd, struct encryption_information *information, + const int verbose); diff --git a/mdadm.h b/mdadm.h index 141adbd7..f64bb783 100644 --- a/mdadm.h +++ b/mdadm.h @@ -853,6 +853,7 @@ extern int restore_stripes(int *dest, unsigned long long *offsets, int source, unsigned long long read_offset, unsigned long long start, unsigned long long length, char *src_buf); +extern bool sysfs_is_libata_allow_tpm_enabled(const int verbose); #ifndef Sendmail #define Sendmail "/usr/lib/sendmail -t" diff --git a/sysfs.c b/sysfs.c index 230b842e..24790896 100644 --- a/sysfs.c +++ b/sysfs.c @@ -1123,3 +1123,32 @@ void sysfsline(char *line) sr->next = sysfs_rules; sysfs_rules = sr; } + +/** + * sysfs_is_libata_allow_tpm_enabled() - check if libata allow_tmp is enabled. + * @verbose: verbose flag. + * + * Check if libata allow_tmp flag is set, this is required for SATA Opal Security commands to work. + * + * Return: true if allow_tpm enable, false otherwise. + */ +bool sysfs_is_libata_allow_tpm_enabled(const int verbose) +{ + const char *path = "/sys/module/libata/parameters/allow_tpm"; + const char *expected_value = "1"; + int fd = open(path, O_RDONLY); + char buf[3]; + + if (!is_fd_valid(fd)) { + pr_vrb("Failed open file descriptor to %s. Cannot check libata allow_tpm param.\n", + path); + return false; + } + + sysfs_fd_get_str(fd, buf, sizeof(buf)); + close(fd); + + if (strncmp(buf, expected_value, 1) == 0) + return true; + return false; +} -- 2.35.3
For NVMe devices with Opal support, encryption information, status and ability are completed based on Opal Level 0 discovery response. Technical documentation used is given in the implementation. Ability in general describes what type of encryption is supported, Status describes in what state the disk with encryption support is. The current patch only includes only the implementation of reading encryption information, functions will be used in one of the next patches. Motivation for adding this functionality is to block mixing of disks in IMSM arrays with encryption enabled and disabled. The main goal is to not allow stealing data by rebuilding array to not encrypted drive which can be read elsewhere. Value ENA_OTHER from enum encryption_ability will be used in the next patch. Additionally, pr_vrb define is moved from super-intel.c to mdadm.h so that it can be used globally. Signed-off-by: Blazej Kucman <blazej.kucman@intel.com> --- Makefile | 4 +- drive_encryption.c | 362 +++++++++++++++++++++++++++++++++++++++++++++ drive_encryption.h | 32 ++++ mdadm.h | 2 + super-intel.c | 2 - 5 files changed, 398 insertions(+), 4 deletions(-) create mode 100644 drive_encryption.c create mode 100644 drive_encryption.h diff --git a/Makefile b/Makefile index cbdba49a..7c221a89 100644 --- a/Makefile +++ b/Makefile @@ -170,7 +170,7 @@ OBJS = mdadm.o config.o policy.o mdstat.o ReadMe.o uuid.o util.o maps.o lib.o u mdopen.o super0.o super1.o super-ddf.o super-intel.o bitmap.o \ super-mbr.o super-gpt.o \ restripe.o sysfs.o sha1.o mapfile.o crc32.o sg_io.o msg.o xmalloc.o \ - platform-intel.o probe_roms.o crc32c.o + platform-intel.o probe_roms.o crc32c.o drive_encryption.o CHECK_OBJS = restripe.o uuid.o sysfs.o maps.o lib.o xmalloc.o dlink.o @@ -183,7 +183,7 @@ MON_OBJS = mdmon.o monitor.o managemon.o uuid.o util.o maps.o mdstat.o sysfs.o c Kill.o sg_io.o dlink.o ReadMe.o super-intel.o \ super-mbr.o super-gpt.o \ super-ddf.o sha1.o crc32.o msg.o bitmap.o xmalloc.o \ - platform-intel.o probe_roms.o crc32c.o + platform-intel.o probe_roms.o crc32c.o drive_encryption.o MON_SRCS = $(patsubst %.o,%.c,$(MON_OBJS)) diff --git a/drive_encryption.c b/drive_encryption.c new file mode 100644 index 00000000..0fa214a9 --- /dev/null +++ b/drive_encryption.c @@ -0,0 +1,362 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Read encryption information for Opal and ATA devices. + * + * Copyright (C) 2024 Intel Corporation + * Author: Blazej Kucman <blazej.kucman@intel.com> + */ + +#include "mdadm.h" + +#include <asm/types.h> +#include <linux/nvme_ioctl.h> +#include "drive_encryption.h" + +/* + * Opal defines + * TCG Storage Opal SSC 2.01 chapter 3.3.3 + * NVM ExpressTM Revision 1.4c, chapter 5 + */ +#define TCG_SECP_01 (0x01) +#define TCG_SECP_00 (0x00) +#define OPAL_DISCOVERY_COMID (0x0001) +#define OPAL_LOCKING_FEATURE (0x0002) +#define OPAL_IO_BUFFER_LEN 2048 +#define OPAL_DISCOVERY_FEATURE_HEADER_LEN (4) + +/* + * NVMe defines + * NVM ExpressTM Revision 1.4c, chapter 5 + */ +#define NVME_SECURITY_RECV (0x82) +#define NVME_IDENTIFY (0x06) +#define NVME_IDENTIFY_RESPONSE_LEN 4096 +#define NVME_OACS_BYTE_POSITION (256) +#define NVME_IDENTIFY_CONTROLLER_DATA (1) + +typedef enum drive_feature_support_status { + /* Drive feature is supported. */ + DRIVE_FEAT_SUP_ST = 0, + /* Drive feature is not supported. */ + DRIVE_FEAT_NOT_SUP_ST, + /* Drive feature support check failed. */ + DRIVE_FEAT_CHECK_FAILED_ST +} drive_feat_sup_st; + +/* TCG Storage Opal SSC 2.01 chapter 3.1.1.3 */ +typedef struct opal_locking_feature { + /* feature header */ + __u16 feature_code; + __u8 reserved : 4; + __u8 version : 4; + __u8 description_length; + /* feature description */ + __u8 locking_supported : 1; + __u8 locking_enabled : 1; + __u8 locked : 1; + __u8 media_encryption : 1; + __u8 mbr_enabled : 1; + __u8 mbr_done : 1; + __u8 mbr_shadowing_not_supported : 1; + __u8 hw_reset_for_dor_supported : 1; + __u8 reserved1[11]; +} __attribute__((__packed__)) opal_locking_feature_t; + +/* TCG Storage Opal SSC 2.01 chapter 3.1.1.1 */ +typedef struct opal_level0_header { + __u32 length; + __u32 version; + __u64 reserved; + __u8 vendor_specific[32]; +} opal_level0_header_t; + +/** + * NVM ExpressTM Revision 1.4c, Figure 249 + * Structure specifies only OACS filed, which is needed in the current use case. + */ +typedef struct nvme_identify_ctrl { + __u8 reserved[255]; + __u16 oacs; + __u8 reserved2[3839]; +} nvme_identify_ctrl_t; + +/* SCSI Primary Commands - 4 (SPC-4), Table 512 */ +typedef struct supported_security_protocols { + __u8 reserved[6]; + __u16 list_length; + __u8 list[504]; +} supported_security_protocols_t; + +/** + * get_opal_locking_feature_description() - get opal locking feature description. + * @response: response from Opal Discovery Level 0. + * + * Based on the documentation TCG Storage Opal SSC 2.01 chapter 3.1.1, + * a Locking feature is searched for in Opal Level 0 Discovery response. + * + * Return: if locking feature found pointer to struct %opal_locking_feature_t, NULL otherwise. + */ +static opal_locking_feature_t *get_opal_locking_feature_description(__u8 *response) +{ + opal_level0_header_t *response_header = (opal_level0_header_t *)response; + int features_length = __be32_to_cpu(response_header->length); + int current_position = sizeof(*response_header); + + while (current_position < features_length) { + opal_locking_feature_t *feature; + + feature = (opal_locking_feature_t *)(response + current_position); + + if (__be16_to_cpu(feature->feature_code) == OPAL_LOCKING_FEATURE) + return feature; + + current_position += feature->description_length + OPAL_DISCOVERY_FEATURE_HEADER_LEN; + } + + return NULL; +} + +/** + * nvme_security_recv_ioctl() - nvme security receive ioctl. + * @disk_fd: a disk file descriptor. + * @sec_protocol: security protocol. + * @comm_id: command id. + * @response_buffer: response buffer to fill out. + * @buf_size: response buffer size. + * @verbose: verbose flag. + * + * Based on the documentations TCG Storage Opal SSC 2.01 chapter 3.3.3 and + * NVM ExpressTM Revision 1.4c, chapter 5.25, + * read security receive command via ioctl(). + * On success, @response_buffer is completed. + * + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR otherwise. + */ +static mdadm_status_t +nvme_security_recv_ioctl(int disk_fd, __u8 sec_protocol, __u16 comm_id, void *response_buffer, + size_t buf_size, const int verbose) +{ + struct nvme_admin_cmd nvme_cmd = {0}; + int status; + + nvme_cmd.opcode = NVME_SECURITY_RECV; + nvme_cmd.cdw10 = sec_protocol << 24 | comm_id << 8; + nvme_cmd.cdw11 = buf_size; + nvme_cmd.data_len = buf_size; + nvme_cmd.addr = (__u64)response_buffer; + + status = ioctl(disk_fd, NVME_IOCTL_ADMIN_CMD, &nvme_cmd); + if (status != 0) { + pr_vrb("Failed to read NVMe security receive ioctl() for device /dev/%s, status: %d\n", + fd2kname(disk_fd), status); + return MDADM_STATUS_ERROR; + } + + return MDADM_STATUS_SUCCESS; +} + +/** + * nvme_identify_ioctl() - NVMe identify ioctl. + * @disk_fd: a disk file descriptor. + * @response_buffer: response buffer to fill out. + * @buf_size: response buffer size. + * @verbose: verbose flag. + * + * Based on the documentations TCG Storage Opal SSC 2.01 chapter 3.3.3 and + * NVM ExpressTM Revision 1.4c, chapter 5.25, + * read NVMe identify via ioctl(). + * On success, @response_buffer will be completed. + * + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR otherwise. + */ +static mdadm_status_t +nvme_identify_ioctl(int disk_fd, void *response_buffer, size_t buf_size, const int verbose) +{ + struct nvme_admin_cmd nvme_cmd = {0}; + int status; + + nvme_cmd.opcode = NVME_IDENTIFY; + nvme_cmd.cdw10 = NVME_IDENTIFY_CONTROLLER_DATA; + nvme_cmd.data_len = buf_size; + nvme_cmd.addr = (__u64)response_buffer; + + status = ioctl(disk_fd, NVME_IOCTL_ADMIN_CMD, &nvme_cmd); + if (status != 0) { + pr_vrb("Failed to read NVMe identify ioctl() for device /dev/%s, status: %d\n", + fd2kname(disk_fd), status); + return MDADM_STATUS_ERROR; + } + + return MDADM_STATUS_SUCCESS; +} + +/** + * is_sec_prot_01h_supported() - check if security protocol 01h supported. + * @security_protocols: struct with response from disk (NVMe, SATA) describing supported + * security protocols. + * + * Return: true if TCG_SECP_01 found, false otherwise. + */ +static bool is_sec_prot_01h_supported(supported_security_protocols_t *security_protocols) +{ + int list_length = be16toh(security_protocols->list_length); + int index; + + for (index = 0 ; index < list_length; index++) { + if (security_protocols->list[index] == TCG_SECP_01) + return true; + } + + return false; +} + +/** + * is_sec_prot_01h_supported_nvme() - check if security protocol 01h supported for given NVMe disk. + * @disk_fd: a disk file descriptor. + * @verbose: verbose flag. + * + * Return: %DRIVE_FEAT_SUP_ST if TCG_SECP_01 supported, %DRIVE_FEAT_NOT_SUP_ST if not supported, + * %DRIVE_FEAT_CHECK_FAILED_ST if failed to check. + */ +static drive_feat_sup_st is_sec_prot_01h_supported_nvme(int disk_fd, const int verbose) +{ + supported_security_protocols_t security_protocols = {0}; + + /* security_protocol: TCG_SECP_00, comm_id: not applicable */ + if (nvme_security_recv_ioctl(disk_fd, TCG_SECP_00, 0x0, &security_protocols, + sizeof(security_protocols), verbose)) + return DRIVE_FEAT_CHECK_FAILED_ST; + + if (is_sec_prot_01h_supported(&security_protocols)) + return DRIVE_FEAT_SUP_ST; + + return DRIVE_FEAT_NOT_SUP_ST; +} + +/** + * is_nvme_sec_send_recv_supported() - check if Security Send and Security Receive is supported. + * @disk_fd: a disk file descriptor. + * @verbose: verbose flag. + * + * Check if "Optional Admin Command Support" bit 0 is set in NVMe identify. + * Bit 0 set to 1 means controller supports the Security Send and Security Receive commands. + * + * Return: %DRIVE_FEAT_SUP_ST if security send/receive supported, + * %DRIVE_FEAT_NOT_SUP_ST if not supported, %DRIVE_FEAT_CHECK_FAILED_ST if check failed. + */ +static drive_feat_sup_st is_nvme_sec_send_recv_supported(int disk_fd, const int verbose) +{ + nvme_identify_ctrl_t nvme_identify = {0}; + int status = 0; + + status = nvme_identify_ioctl(disk_fd, &nvme_identify, sizeof(nvme_identify), verbose); + if (status) + return DRIVE_FEAT_CHECK_FAILED_ST; + + if ((__le16_to_cpu(nvme_identify.oacs) & 0x1) == 0x1) + return DRIVE_FEAT_SUP_ST; + + return DRIVE_FEAT_NOT_SUP_ST; +} + +/** + * get_opal_encryption_locking_informations() - get Opal Locking information. + * @buffer: buffer with Opal Level 0 Discovery response. + * @information: struct to fill out, describing encryption status of disk. + * + * If Locking feature frame is in response from Opal Level 0 discovery, &encryption_information_t + * structure is completed with status and ability otherwise the status is set to &None. + * For possible encryption statuses and abilities, + * please refer to enums &encryption_status and &encryption_ability. + * + * Return: %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR otherwise. + */ +static mdadm_status_t get_opal_encryption_informations(__u8 *buffer, + encryption_information_t *information) +{ + opal_locking_feature_t *opal_locking_feature = + get_opal_locking_feature_description(buffer); + + if (!opal_locking_feature) + return MDADM_STATUS_ERROR; + + if (opal_locking_feature->locking_supported == 1) { + information->ability = ENA_SED; + + if (opal_locking_feature->locking_enabled == 0) + information->status = ENS_UNENCRYPTED; + else if (opal_locking_feature->locked == 1) + information->status = ENS_LOCKED; + else + information->status = ENS_UNLOCKED; + } else { + information->ability = ENA_NONE; + information->status = ENS_UNENCRYPTED; + } + + return MDADM_STATUS_SUCCESS; +} + +/** + * get_nvme_opal_encryption_informations() - get NVMe Opal encryption information. + * @disk_fd: a disk file descriptor. + * @information: struct to fill out, describing encryption status of disk. + * @verbose: verbose flag. + * + * In case the disk supports Opal Level 0 discovery, &encryption_information_t structure + * is completed with status and ability based on ioctl response, + * otherwise the ability is set to %ENA_NONE and &status to %ENS_UNENCRYPTED. + * As the current use case does not need the knowledge of Opal support, if there is no support, + * %MDADM_STATUS_SUCCESS will be returned, with the values described above. + * For possible encryption statuses and abilities, + * please refer to enums &encryption_status and &encryption_ability. + * + * %MDADM_STATUS_SUCCESS on success, %MDADM_STATUS_ERROR otherwise. + */ +mdadm_status_t +get_nvme_opal_encryption_informations(int disk_fd, encryption_information_t *information, + const int verbose) +{ + __u8 buffer[OPAL_IO_BUFFER_LEN]; + int sec_send_recv_supported = 0; + int protocol_01h_supported = 0; + mdadm_status_t status; + + information->ability = ENA_NONE; + information->status = ENS_UNENCRYPTED; + + sec_send_recv_supported = is_nvme_sec_send_recv_supported(disk_fd, verbose); + if (sec_send_recv_supported == DRIVE_FEAT_CHECK_FAILED_ST) + return MDADM_STATUS_ERROR; + + /* Opal not supported */ + if (sec_send_recv_supported == DRIVE_FEAT_NOT_SUP_ST) + return MDADM_STATUS_SUCCESS; + + /** + * sec_send_recv_supported determine that it should be possible to read + * supported sec protocols + */ + protocol_01h_supported = is_sec_prot_01h_supported_nvme(disk_fd, verbose); + if (protocol_01h_supported == DRIVE_FEAT_CHECK_FAILED_ST) + return MDADM_STATUS_ERROR; + + /* Opal not supported */ + if (sec_send_recv_supported == DRIVE_FEAT_SUP_ST && + protocol_01h_supported == DRIVE_FEAT_NOT_SUP_ST) + return MDADM_STATUS_SUCCESS; + + if (nvme_security_recv_ioctl(disk_fd, TCG_SECP_01, OPAL_DISCOVERY_COMID, (void *)&buffer, + OPAL_IO_BUFFER_LEN, verbose)) + return MDADM_STATUS_ERROR; + + status = get_opal_encryption_informations((__u8 *)&buffer, information); + if (status) + pr_vrb("Locking feature description not found in Level 0 discovery response. Device /dev/%s.\n", + fd2kname(disk_fd)); + + if (information->ability == ENA_NONE) + assert(information->status == ENS_UNENCRYPTED); + + return status; +} diff --git a/drive_encryption.h b/drive_encryption.h new file mode 100644 index 00000000..27f1de55 --- /dev/null +++ b/drive_encryption.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * Read encryption information for Opal and ATA devices. + * + * Copyright (C) 2024 Intel Corporation + * Author: Blazej Kucman <blazej.kucman@intel.com> + */ + +typedef enum encryption_status { + /* The drive is not currently encrypted. */ + ENS_UNENCRYPTED = 0, + /* The drive is encrypted and the data is not accessible. */ + ENS_LOCKED, + /* The drive is encrypted but the data is accessible in unencrypted form. */ + ENS_UNLOCKED +} encryption_status_t; + +typedef enum encryption_ability { + ENA_NONE = 0, + ENA_OTHER, + /* Self encrypted drive */ + ENA_SED +} encryption_ability_t; + +typedef struct encryption_information { + encryption_ability_t ability; + encryption_status_t status; +} encryption_information_t; + +mdadm_status_t +get_nvme_opal_encryption_informations(int disk_fd, struct encryption_information *information, + const int verbose); diff --git a/mdadm.h b/mdadm.h index 3fedca48..141adbd7 100644 --- a/mdadm.h +++ b/mdadm.h @@ -1912,6 +1912,8 @@ static inline int xasprintf(char **strp, const char *fmt, ...) { #define pr_info(fmt, args...) printf("%s: "fmt, Name, ##args) +#define pr_vrb(fmt, arg...) ((void)(verbose && pr_err(fmt, ##arg))) + void *xmalloc(size_t len); void *xrealloc(void *ptr, size_t len); void *xcalloc(size_t num, size_t size); diff --git a/super-intel.c b/super-intel.c index 77140455..806b6248 100644 --- a/super-intel.c +++ b/super-intel.c @@ -393,8 +393,6 @@ struct md_list { struct md_list *next; }; -#define pr_vrb(fmt, arg...) (void) (verbose && pr_err(fmt, ##arg)) - static __u8 migr_type(struct imsm_dev *dev) { if (dev->vol.migr_type == MIGR_VERIFY && -- 2.35.3
Add ENCRYPTION_NO_VERIFY config key and allow to disable checking encryption status for given type of drives. The key is introduced because of SATA Opal disks for which TPM commands must be enabled in libata kernel module, (libata.allow_tpm=1), otherwise it is impossible to verify encryption status. TPM commands are disabled by default. Currently the key only supports the "sata_opal" value, if necessary, the functionality is ready to support more types of disks. Signed-off-by: Blazej Kucman <blazej.kucman@intel.com> --- config.c | 25 ++++++++++++++++++++++++- drive_encryption.c | 16 ++++++++++++---- mdadm.conf.5.in | 13 +++++++++++++ mdadm.h | 1 + 4 files changed, 50 insertions(+), 5 deletions(-) diff --git a/config.c b/config.c index 44f7dd2f..b46d71cb 100644 --- a/config.c +++ b/config.c @@ -81,7 +81,7 @@ char DefaultAltConfDir[] = CONFFILE2 ".d"; enum linetype { Devices, Array, Mailaddr, Mailfrom, Program, CreateDev, Homehost, HomeCluster, AutoMode, Policy, PartPolicy, Sysfs, - MonitorDelay, LTEnd }; + MonitorDelay, EncryptionNoVerify, LTEnd }; char *keywords[] = { [Devices] = "devices", [Array] = "array", @@ -96,6 +96,7 @@ char *keywords[] = { [PartPolicy]="part-policy", [Sysfs] = "sysfs", [MonitorDelay] = "monitordelay", + [EncryptionNoVerify] = "ENCRYPTION_NO_VERIFY", [LTEnd] = NULL }; @@ -729,6 +730,19 @@ void monitordelayline(char *line) } } +static bool sata_opal_encryption_no_verify; +void encryption_no_verify_line(char *line) +{ + char *word; + + for (word = dl_next(line); word != line; word = dl_next(word)) { + if (strcasecmp(word, "sata_opal") == 0) + sata_opal_encryption_no_verify = true; + else + pr_err("unrecognised word on ENCRYPTION_NO_VERIFY line: %s\n", word); + } +} + char auto_yes[] = "yes"; char auto_no[] = "no"; char auto_homehost[] = "homehost"; @@ -913,6 +927,9 @@ void conf_file(FILE *f) case MonitorDelay: monitordelayline(line); break; + case EncryptionNoVerify: + encryption_no_verify_line(line); + break; default: pr_err("Unknown keyword %s\n", line); } @@ -1075,6 +1092,12 @@ int conf_get_monitor_delay(void) return monitor_delay; } +bool conf_get_sata_opal_encryption_no_verify(void) +{ + load_conffile(); + return sata_opal_encryption_no_verify; +} + struct createinfo *conf_get_create_info(void) { load_conffile(); diff --git a/drive_encryption.c b/drive_encryption.c index 8421e374..5b9cdc00 100644 --- a/drive_encryption.c +++ b/drive_encryption.c @@ -656,10 +656,18 @@ get_ata_encryption_information(int disk_fd, struct encryption_information *infor if (status == MDADM_STATUS_ERROR) return MDADM_STATUS_ERROR; - if (is_ata_trusted_computing_supported(buffer_identify) && - !sysfs_is_libata_allow_tpm_enabled(verbose)) { - pr_vrb("For SATA with Trusted Computing support, required libata.tpm_enabled=1.\n"); - return MDADM_STATUS_ERROR; + /* Possible OPAL support, further checks require tpm_enabled.*/ + if (is_ata_trusted_computing_supported(buffer_identify)) { + /* OPAL SATA encryption checking disabled. */ + if (conf_get_sata_opal_encryption_no_verify()) + return MDADM_STATUS_SUCCESS; + + if (!sysfs_is_libata_allow_tpm_enabled(verbose)) { + pr_vrb("Detected SATA drive /dev/%s with Trusted Computing support.\n", + fd2kname(disk_fd)); + pr_vrb("Cannot verify encryption state. Required libata.tpm_enabled=1.\n"); + return MDADM_STATUS_ERROR; + } } ata_opal_status = is_ata_opal(disk_fd, buffer_identify, verbose); diff --git a/mdadm.conf.5.in b/mdadm.conf.5.in index 787e51e9..afb0a296 100644 --- a/mdadm.conf.5.in +++ b/mdadm.conf.5.in @@ -636,6 +636,17 @@ If multiple .B MINITORDELAY lines are provided, only first non-zero value is considered. +.TP +.B ENCRYPTION_NO_VERIFY +The +.B ENCRYPTION_NO_VERIFY +disables encryption verification for devices with particular encryption support detected. +Currently, only verification of SATA OPAL encryption can be disabled. +It does not disable ATA security encryption verification. +Available parameter +.I "sata_opal". + + .SH FILES .SS {CONFFILE} @@ -744,6 +755,8 @@ SYSFS uuid=bead5eb6:31c17a27:da120ba2:7dfda40d group_thread_cnt=4 sync_speed_max=1000000 .br MONITORDELAY 60 +.br +ENCRYPTION_NO_VERIFY sata_opal .SH SEE ALSO .BR mdadm (8), diff --git a/mdadm.h b/mdadm.h index f64bb783..9d98693b 100644 --- a/mdadm.h +++ b/mdadm.h @@ -1673,6 +1673,7 @@ extern char *conf_get_program(void); extern char *conf_get_homehost(int *require_homehostp); extern char *conf_get_homecluster(void); extern int conf_get_monitor_delay(void); +extern bool conf_get_sata_opal_encryption_no_verify(void); extern char *conf_line(FILE *file); extern char *conf_word(FILE *file, int allow_key); extern void print_quoted(char *str); -- 2.35.3
The purpose of this series is to add functionality of reading information about the encryption status of OPAL NVMe/SATA and SATA drives and use this information for purposes of IMSM metadata, which introduces restrictions on the possibility of mixing disks with encryption enabled and disabled because of security reasons. Blazej Kucman (5): Add reading Opal NVMe encryption information Add reading SATA encryption information Add key ENCRYPTION_NO_VERIFY to conf imsm: print disk encryption information imsm: drive encryption policy implementation Makefile | 4 +- config.c | 25 +- drive_encryption.c | 724 +++++++++++++++++++++++++++++++++++++++++++++ drive_encryption.h | 37 +++ mdadm.conf.5.in | 16 + mdadm.h | 4 + super-intel.c | 117 +++++++- sysfs.c | 29 ++ 8 files changed, 947 insertions(+), 9 deletions(-) create mode 100644 drive_encryption.c create mode 100644 drive_encryption.h -- 2.35.3
9003 was passed directly to sysfs_set_array() since md_get_version() always returned this value. md_get_version() was removed long ago. Remove dead version check from sysfs_set_array(). Remove "vers" argument and fix function calls. Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com> --- Assemble.c | 2 +- mdadm.h | 2 +- sysfs.c | 6 ++---- util.c | 3 +-- 4 files changed, 5 insertions(+), 8 deletions(-) diff --git a/Assemble.c b/Assemble.c index 9d042055ad4e..f6c5b99e25e2 100644 --- a/Assemble.c +++ b/Assemble.c @@ -1988,7 +1988,7 @@ int assemble_container_content(struct supertype *st, int mdfd, * and ignoring special character on the first place. */ if (strcmp(sra->text_version + 1, content->text_version + 1) != 0) { - if (sysfs_set_array(content, 9003) != 0) { + if (sysfs_set_array(content) != 0) { sysfs_free(sra); return 1; } diff --git a/mdadm.h b/mdadm.h index 1f28b3e754be..48e5a4935868 100644 --- a/mdadm.h +++ b/mdadm.h @@ -807,7 +807,7 @@ extern int sysfs_attribute_available(struct mdinfo *sra, struct mdinfo *dev, extern int sysfs_get_str(struct mdinfo *sra, struct mdinfo *dev, char *name, char *val, int size); extern int sysfs_set_safemode(struct mdinfo *sra, unsigned long ms); -extern int sysfs_set_array(struct mdinfo *info, int vers); +extern int sysfs_set_array(struct mdinfo *info); extern int sysfs_add_disk(struct mdinfo *sra, struct mdinfo *sd, int resume); extern int sysfs_disk_to_scsi_id(int fd, __u32 *id); extern int sysfs_unique_holder(char *devnm, long rdev); diff --git a/sysfs.c b/sysfs.c index f95ef7013e84..937a02e88a79 100644 --- a/sysfs.c +++ b/sysfs.c @@ -655,7 +655,7 @@ int sysfs_set_safemode(struct mdinfo *sra, unsigned long ms) return sysfs_set_str(sra, NULL, "safe_mode_delay", delay); } -int sysfs_set_array(struct mdinfo *info, int vers) +int sysfs_set_array(struct mdinfo *info) { int rv = 0; char ver[100]; @@ -679,9 +679,7 @@ int sysfs_set_array(struct mdinfo *info, int vers) if (strlen(buf) >= 9 && buf[9] == '-') ver[9] = '-'; - if ((vers % 100) < 2 || - sysfs_set_str(info, NULL, "metadata_version", - ver) < 0) { + if (sysfs_set_str(info, NULL, "metadata_version", ver) < 0) { pr_err("This kernel does not support external metadata.\n"); return 1; } diff --git a/util.c b/util.c index b145447370b3..a3a46255d297 100644 --- a/util.c +++ b/util.c @@ -1899,8 +1899,7 @@ int set_array_info(int mdfd, struct supertype *st, struct mdinfo *info) int rv; if (st->ss->external) - return sysfs_set_array(info, 9003); - + return sysfs_set_array(info); memset(&inf, 0, sizeof(inf)); inf.major_version = info->array.major_version; inf.minor_version = info->array.minor_version; -- 2.39.2
Mentioned commit (see Fixes) causes that UUID is not swapped as expected for native superblock. Fix this problem. For detail, we should avoid superblock calls, we can have information about supertype from map, use that. Simplify fname_from_uuid() by removing dependencies to metadata handler, it is not needed. Decision is taken at compile time, expect super1 but this function is not used by super1. Add warning about that. Remove separator, it is always ':'. Fixes: 60c19530dd7c ("Detail: remove duplicated code") Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> --- Detail.c | 26 +++++++++++++++++++++++++- mdadm.h | 3 +-- super-ddf.c | 10 +++++----- super-intel.c | 16 ++++++++-------- util.c | 24 +++++++++++++----------- 5 files changed, 52 insertions(+), 27 deletions(-) diff --git a/Detail.c b/Detail.c index f23ec16f81bc..55a086d3378f 100644 --- a/Detail.c +++ b/Detail.c @@ -49,6 +49,30 @@ static int add_device(const char *dev, char ***p_devices, return n_devices + 1; } +/** + * detail_fname_from_uuid() - generate uuid string with special super1 handling. + * @mp: map entry to parse. + * @buf: buf to write. + * + * Hack to workaround an issue with super1 superblocks. It swapuuid set in order for assembly + * to work, but can't have it set if we want this printout to match all the other uuid printouts + * in super1.c, so we force swapuuid to 1 to make our printout match the rest of super1. + * + * Always convert uuid if host is big endian. + */ +char *detail_fname_from_uuid(struct map_ent *mp, char *buf) +{ +#if __BYTE_ORDER == BIG_ENDIAN + bool swap = true; +#else + bool swap = false; +#endif + if (strncmp(mp->metadata, "1.", 2) == 0) + swap = true; + + return __fname_from_uuid(mp->uuid, swap, buf, ':'); +} + int Detail(char *dev, struct context *c) { /* @@ -256,7 +280,7 @@ int Detail(char *dev, struct context *c) mp = map_by_devnm(&map, fd2devnm(fd)); if (mp) { - __fname_from_uuid(mp->uuid, 0, nbuf, ':'); + detail_fname_from_uuid(mp, nbuf); printf("MD_UUID=%s\n", nbuf + 5); if (mp->path && strncmp(mp->path, DEV_MD_DIR, DEV_MD_DIR_LEN) == 0) printf("MD_DEVNAME=%s\n", mp->path + DEV_MD_DIR_LEN); diff --git a/mdadm.h b/mdadm.h index 3fedca484bdd..a363708a2710 100644 --- a/mdadm.h +++ b/mdadm.h @@ -1696,8 +1696,7 @@ extern const int uuid_zero[4]; extern int same_uuid(int a[4], int b[4], int swapuuid); extern void copy_uuid(void *a, int b[4], int swapuuid); extern char *__fname_from_uuid(int id[4], int swap, char *buf, char sep); -extern char *fname_from_uuid(struct supertype *st, - struct mdinfo *info, char *buf, char sep); +extern char *fname_from_uuid(struct mdinfo *info, char *buf); extern unsigned long calc_csum(void *super, int bytes); extern int enough(int level, int raid_disks, int layout, int clean, char *avail); diff --git a/super-ddf.c b/super-ddf.c index 94ac5ff3965a..21426c753c6d 100644 --- a/super-ddf.c +++ b/super-ddf.c @@ -1617,7 +1617,7 @@ static void brief_examine_super_ddf(struct supertype *st, int verbose) struct mdinfo info; char nbuf[64]; getinfo_super_ddf(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); printf("ARRAY metadata=ddf UUID=%s\n", nbuf + 5); } @@ -1632,7 +1632,7 @@ static void brief_examine_subarrays_ddf(struct supertype *st, int verbose) unsigned int i; char nbuf[64]; getinfo_super_ddf(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); for (i = 0; i < be16_to_cpu(ddf->virt->max_vdes); i++) { struct virtual_entry *ve = &ddf->virt->entries[i]; @@ -1645,7 +1645,7 @@ static void brief_examine_subarrays_ddf(struct supertype *st, int verbose) ddf->currentconf =&vcl; vcl.vcnum = i; uuid_from_super_ddf(st, info.uuid); - fname_from_uuid(st, &info, nbuf1, ':'); + fname_from_uuid(&info, nbuf1); _ddf_array_name(namebuf, ddf, i); printf("ARRAY%s%s container=%s member=%d UUID=%s\n", namebuf[0] == '\0' ? "" : " " DEV_MD_DIR, namebuf, @@ -1658,7 +1658,7 @@ static void export_examine_super_ddf(struct supertype *st) struct mdinfo info; char nbuf[64]; getinfo_super_ddf(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); printf("MD_METADATA=ddf\n"); printf("MD_LEVEL=container\n"); printf("MD_UUID=%s\n", nbuf+5); @@ -1798,7 +1798,7 @@ static void brief_detail_super_ddf(struct supertype *st, char *subarray) return; else uuid_of_ddf_subarray(ddf, vcnum, info.uuid); - fname_from_uuid(st, &info, nbuf,':'); + fname_from_uuid(&info, nbuf); printf(" UUID=%s", nbuf + 5); } diff --git a/super-intel.c b/super-intel.c index e1754f29246c..ff2590fef63f 100644 --- a/super-intel.c +++ b/super-intel.c @@ -2217,7 +2217,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost) else printf("not supported\n"); getinfo_super_imsm(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); printf(" UUID : %s\n", nbuf + 5); sum = __le32_to_cpu(mpb->check_sum); printf(" Checksum : %08x %s\n", sum, @@ -2242,7 +2242,7 @@ static void examine_super_imsm(struct supertype *st, char *homehost) super->current_vol = i; getinfo_super_imsm(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); print_imsm_dev(super, dev, nbuf + 5, super->disks->index); } for (i = 0; i < mpb->num_disks; i++) { @@ -2267,7 +2267,7 @@ static void brief_examine_super_imsm(struct supertype *st, int verbose) char nbuf[64]; getinfo_super_imsm(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); printf("ARRAY metadata=imsm UUID=%s\n", nbuf + 5); } @@ -2284,13 +2284,13 @@ static void brief_examine_subarrays_imsm(struct supertype *st, int verbose) return; getinfo_super_imsm(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); for (i = 0; i < super->anchor->num_raid_devs; i++) { struct imsm_dev *dev = get_imsm_dev(super, i); super->current_vol = i; getinfo_super_imsm(st, &info, NULL); - fname_from_uuid(st, &info, nbuf1, ':'); + fname_from_uuid(&info, nbuf1); printf("ARRAY " DEV_MD_DIR "%.16s container=%s member=%d UUID=%s\n", dev->volume, nbuf + 5, i, nbuf1 + 5); } @@ -2304,7 +2304,7 @@ static void export_examine_super_imsm(struct supertype *st) char nbuf[64]; getinfo_super_imsm(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); printf("MD_METADATA=imsm\n"); printf("MD_LEVEL=container\n"); printf("MD_UUID=%s\n", nbuf+5); @@ -2324,7 +2324,7 @@ static void detail_super_imsm(struct supertype *st, char *homehost, super->current_vol = strtoul(subarray, NULL, 10); getinfo_super_imsm(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); printf("\n UUID : %s\n", nbuf + 5); super->current_vol = temp_vol; @@ -2341,7 +2341,7 @@ static void brief_detail_super_imsm(struct supertype *st, char *subarray) super->current_vol = strtoul(subarray, NULL, 10); getinfo_super_imsm(st, &info, NULL); - fname_from_uuid(st, &info, nbuf, ':'); + fname_from_uuid(&info, nbuf); printf(" UUID=%s", nbuf + 5); super->current_vol = temp_vol; diff --git a/util.c b/util.c index 49a9c6e29cf7..03336d6fa24c 100644 --- a/util.c +++ b/util.c @@ -589,19 +589,21 @@ char *__fname_from_uuid(int id[4], int swap, char *buf, char sep) } -char *fname_from_uuid(struct supertype *st, struct mdinfo *info, - char *buf, char sep) -{ - // dirty hack to work around an issue with super1 superblocks... - // super1 superblocks need swapuuid set in order for assembly to - // work, but can't have it set if we want this printout to match - // all the other uuid printouts in super1.c, so we force swapuuid - // to 1 to make our printout match the rest of super1 +/** + * fname_from_uuid() - generate uuid string. Should not be used with super1. + * @info: info with uuid + * @buf: buf to fill. + * + * This routine should not be used with super1. See detail_fname_from_uuid() for details. It does + * not use superswitch swapuuid as it should be 0 but it has to do UUID conversion if host is big + * endian- left for backward compatibility. + */ +char *fname_from_uuid(struct mdinfo *info, char *buf) +{ #if __BYTE_ORDER == BIG_ENDIAN - return __fname_from_uuid(info->uuid, 1, buf, sep); + return __fname_from_uuid(info->uuid, true, buf, ':'); #else - return __fname_from_uuid(info->uuid, (st->ss == &super1) ? 1 : - st->ss->swapuuid, buf, sep); + return __fname_from_uuid(info->uuid, false, buf, ':'); #endif } -- 2.35.3
It is not set, so it should be 0 but it may vary on compilation settings. Set it always to 0. metadata should care to set UUID and read in proper endianness so it doesn't follow super1 concept of swapuuid to depend on endianness. It is not an attempt to fix endianness issues. Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> --- super-ddf.c | 1 + super-intel.c | 1 + super0.c | 2 ++ 3 files changed, 4 insertions(+) diff --git a/super-ddf.c b/super-ddf.c index 7571e3b740c6..94ac5ff3965a 100644 --- a/super-ddf.c +++ b/super-ddf.c @@ -5162,6 +5162,7 @@ struct superswitch super_ddf = { .default_geometry = default_geometry_ddf, .external = 1, + .swapuuid = 0, /* for mdmon */ .open_new = ddf_open_new, diff --git a/super-intel.c b/super-intel.c index 7714045575b2..e1754f29246c 100644 --- a/super-intel.c +++ b/super-intel.c @@ -13116,6 +13116,7 @@ struct superswitch super_imsm = { .validate_ppl = validate_ppl_imsm, .external = 1, + .swapuuid = 0, .name = "imsm", /* for mdmon */ diff --git a/super0.c b/super0.c index a7c5f813d926..9b8a1bd63bb7 100644 --- a/super0.c +++ b/super0.c @@ -1369,5 +1369,7 @@ struct superswitch super0 = { .locate_bitmap = locate_bitmap0, .write_bitmap = write_bitmap0, .free_super = free_super0, + + .swapuuid = 0, .name = "0.90", }; -- 2.35.3
Commit 60c19530dd7c ("Detail: remove duplicated code") introduced regression catched by 00confnames because generated UUID was different than expected. This patchset fixes the issue. Mariusz Tkaczyk (2): mdadm: set swapuuid in all handlers mdadm: Fix native --detail --export Cc: Xiao Ni <xni@redhat.com> Cc: Jes Sorensen <jes@trained-monkey.org> Detail.c | 26 +++++++++++++++++++++++++- mdadm.h | 3 +-- super-ddf.c | 11 ++++++----- super-intel.c | 17 +++++++++-------- super0.c | 2 ++ util.c | 24 +++++++++++++----------- 6 files changed, 56 insertions(+), 27 deletions(-) -- 2.35.3
On Mon, Mar 18, 2024 at 1:18 PM Mariusz Tkaczyk
<mariusz.tkaczyk@linux.intel.com> wrote:
>
> On Sat, 16 Mar 2024 20:26:15 +0200
> Shaya Potter <spotter@gmail.com> wrote:
>
> > note: not subscribed, so please cc me on responses.
> >
> > I recently had a Dell R710 die where I was using the Perc6 to provide
> > storage to the box. As the box wasn't usable, I decided to image the
> > individual disks to a newer machine with significantly more storage.
> >
> > I sort of messed up the progress, but that might have discovered a bug in
> > mdadm.
> >
> > Background, the Dell R710 supported 6 drives, which I had as a 1TB
> > SATA SSD and 5x8TB SATA disks in a RAID5 array.
> >
> > In the process of imaging it, I I was setting up devices on /dev/loop
> > to be prepared to assemble the raid, but I think I accidentally
> > assembled the raid while imaging the last disk (which in effect caused
> > the last disk to get out of sync with the other disks. This was
> > initially ok, until the VM I was doing it on, crashed with a KVM/QEMU
> > failure (unsure what occurred).
> >
> > I was hoping, it was going to be easy to bring up the raid array
> > again, but now mdadm was segfault on a null pointer exception whenever
> > I tried to assemble the array (was just trying the RAID5 portion).
> >
> > I was thinking perhaps my VM got corrupted, but I couldn't figure that
> > out, so I decided to try and reimage the disks (more carefully this
> > time), but yes, the 5th disk was marked as in quick init, while the
> > others were more consistent.
> >
> > Howvever, same segfault was occuring, so I built mdadm from source
> > (with -g and no -O, as an aside, this would be a good Makefile target
> > to have, to make issues easier to debug)
> >
> > After understanding the issue, the segfault seems to be due to
> > Assemble.c wanting to call update_super() with a ddf super. Except
> > super-ddf.c doesn't provide that.
> >
> > i.e. in Assemble.c it was crashing at
> >
> > if (st->ss->update_super(st, &devices[j].i, UOPT_SPEC_ASSEMBLE, NULL,
> > c->verbose, 0, NULL)) {...}
> >
> > which now explained the seg fault on null pointer exception. I was
> > able to progress past the segfault (perhaps badly, but it "seems" to
> > work for me), by putting in a null check before the update_super()
> > call, i.e.
> >
> > if (st->ss->update_super && st->ss->update_super(....)) { ... }
> >
> > thoughts about my "fix" (perhaps super-ddf.c needs an empty
> > update_super function?) , if this is a bug? (perhaps its unexpected
> > for me to have gotten into this state in the first place?)
> >
>
> Hello Shaya,
> DDF is not actively developed. I'm considering dropping
> it.
> If you are interested in bringing it too life then you are
> more than welcome to send patches!
>
> If DDF doesn't implement update_super() then fix proposed by you seems to be
> valid. Please send proper patch for that then we will review it.
>
> Thanks,
> Mariusz
I'll make a proper patch in the coming days.
just to note: it is very useful for recovering from RAID arrays that
do provide that metadata. It would be a shame (IMO) to lose support
for it, as it would have made my recovery/migration efforts much more
difficult. At worst, I'd suggest marking it unmaintained, needing a
specific flag to be used which notes, since it's unmaintained, it
might go down code paths that are untested and could break in future
(i.e. what happened to me).
As a total other aside: md seems to work much better (performance
wise) when using loop devices when the loop devices are created with
direct-io support.
On Sat, 16 Mar 2024 20:26:15 +0200
Shaya Potter <spotter@gmail.com> wrote:
> note: not subscribed, so please cc me on responses.
>
> I recently had a Dell R710 die where I was using the Perc6 to provide
> storage to the box. As the box wasn't usable, I decided to image the
> individual disks to a newer machine with significantly more storage.
>
> I sort of messed up the progress, but that might have discovered a bug in
> mdadm.
>
> Background, the Dell R710 supported 6 drives, which I had as a 1TB
> SATA SSD and 5x8TB SATA disks in a RAID5 array.
>
> In the process of imaging it, I I was setting up devices on /dev/loop
> to be prepared to assemble the raid, but I think I accidentally
> assembled the raid while imaging the last disk (which in effect caused
> the last disk to get out of sync with the other disks. This was
> initially ok, until the VM I was doing it on, crashed with a KVM/QEMU
> failure (unsure what occurred).
>
> I was hoping, it was going to be easy to bring up the raid array
> again, but now mdadm was segfault on a null pointer exception whenever
> I tried to assemble the array (was just trying the RAID5 portion).
>
> I was thinking perhaps my VM got corrupted, but I couldn't figure that
> out, so I decided to try and reimage the disks (more carefully this
> time), but yes, the 5th disk was marked as in quick init, while the
> others were more consistent.
>
> Howvever, same segfault was occuring, so I built mdadm from source
> (with -g and no -O, as an aside, this would be a good Makefile target
> to have, to make issues easier to debug)
>
> After understanding the issue, the segfault seems to be due to
> Assemble.c wanting to call update_super() with a ddf super. Except
> super-ddf.c doesn't provide that.
>
> i.e. in Assemble.c it was crashing at
>
> if (st->ss->update_super(st, &devices[j].i, UOPT_SPEC_ASSEMBLE, NULL,
> c->verbose, 0, NULL)) {...}
>
> which now explained the seg fault on null pointer exception. I was
> able to progress past the segfault (perhaps badly, but it "seems" to
> work for me), by putting in a null check before the update_super()
> call, i.e.
>
> if (st->ss->update_super && st->ss->update_super(....)) { ... }
>
> thoughts about my "fix" (perhaps super-ddf.c needs an empty
> update_super function?) , if this is a bug? (perhaps its unexpected
> for me to have gotten into this state in the first place?)
>
Hello Shaya,
DDF is not actively developed. I'm considering dropping
it.
If you are interested in bringing it too life then you are
more than welcome to send patches!
If DDF doesn't implement update_super() then fix proposed by you seems to be
valid. Please send proper patch for that then we will review it.
Thanks,
Mariusz
Am Montag, dem 18.03.2024 um 09:33 +0800 schrieb Yu Kuai:
> you might need to learn some tools like blktrace or bpftrace to find out
> which thread is issuing IO to sdb1.
Thnaks for the hint, I'll play around with these tools.
Some other musings: as this is a RAID-1 array, and both sda1 and sdb1 are "identical" (both are
flagged with write-mostly), I *should* see identical write patterns to sda1 and sdb1?
If we look at my iostat output from above:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
md0 2,20 0,00 8,80 0,00 0 44 0
nvme0n1p3 3,60 0,00 9,50 0,00 0 47 0
sda1 3,80 0,00 9,50 0,00 0 47 0
sdb1 54,20 0,00 26223,10 0,00 0 131115 0
44 kb have been written to md0, the md subsystem converts these to writes to the RAID members (plus
some overhead like bitmaps and stuff)
The 47 kb written to nvme and sda1 is what I'd expect to see. But the 130 MB to sdb1 are wrong...
btw, when I run this tests on kernel 6.1.76, I get identical writes to all RAID members:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
md0 5,40 0,00 67,20 0,00 0 336 0
nvme0n1p3 3,40 0,00 68,00 0,00 0 340 0
sda1 3,40 0,00 68,00 0,00 0 340 0
sdb1 3,40 0,00 68,00 0,00 0 340 0
Wild guess: the (external) USB device sdb1 is using a huge "transfer size", so when only a few
sectors are written to sda1, megabytes are written to sdb1?
How could I proove this?
thanks, Michael
--
Michael Reinelt <michael@reinelt.co.at>
Ringsiedlung 75
A-8111 Gratwein-Straßengel
+43 676 3079941
Hi,
在 2024/03/17 18:31, Michael Reinelt 写道:
> when I check with iotop, i can see similar high write rates, but no
> process or thread responsible for it.
you might need to learn some tools like blktrace or bpftrace to find out
which thread is issuing IO to sdb1.
Thanks,
Kuai
Hallo all,
I have a very strange behaviour on my RAID-1 array with Kernel 6.6.13.
Kernel 6.1.76 and earlier do not show this symptoms, but I *think* I
have seen it on any 6.5 Kernel (I am not sure)
the array contains my /home, and consists of an internal NVME drive, a
internal SATA device, and an external USB drive.
even if the array is (neraly) idle, I see very heave I/O on the
external USB drive, which makes the system more or less unusable.
artus:~ # iostat -p sda,sdb,nvme0n1,md0,md1 -y 5
Linux 6.6.13+bpo-amd64 (artus) 2024-03-17 _x86_64_ (12 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
0,12 0,00 0,15 0,45 0,00 99,28
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
md0 2,20 0,00 8,80 0,00 0 44 0
md1 0,00 0,00 0,00 0,00 0 0 0
nvme0n1 8,00 0,00 39,10 0,00 0 195 0
nvme0n1p1 0,00 0,00 0,00 0,00 0 0 0
nvme0n1p2 4,40 0,00 29,60 0,00 0 148 0
nvme0n1p3 3,60 0,00 9,50 0,00 0 47 0
nvme0n1p4 0,00 0,00 0,00 0,00 0 0 0
sda 3,80 0,00 9,50 0,00 0 47 0
sda1 3,80 0,00 9,50 0,00 0 47 0
sda2 0,00 0,00 0,00 0,00 0 0 0
sdb 54,20 0,00 26223,10 0,00 0 131115 0
sdb1 54,20 0,00 26223,10 0,00 0 131115 0
sdb2 0,00 0,00 0,00 0,00 0 0 0
avg-cpu: %user %nice %system %iowait %steal %idle
0,20 0,00 0,12 0,37 0,00 99,32
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
md0 0,20 0,00 0,80 0,00 0 4 0
md1 0,00 0,00 0,00 0,00 0 0 0
nvme0n1 1,60 0,00 3,70 0,00 0 18 0
nvme0n1p1 0,00 0,00 0,00 0,00 0 0 0
nvme0n1p2 0,40 0,00 2,40 0,00 0 12 0
nvme0n1p3 1,20 0,00 1,30 0,00 0 6 0
nvme0n1p4 0,00 0,00 0,00 0,00 0 0 0
sda 1,20 0,00 1,30 0,00 0 6 0
sda1 1,20 0,00 1,30 0,00 0 6 0
sda2 0,00 0,00 0,00 0,00 0 0 0
sdb 39,00 0,00 19661,50 0,00 0 98307 0
sdb1 39,00 0,00 19661,50 0,00 0 98307 0
sdb2 0,00 0,00 0,00 0,00 0 0 0
I am sure that there is no rebuild or check at the moment... and I have
no idea WTF is going on here... sometimes the write rate goes up to
>100 MB/sec
when I check with iotop, i can see similar high write rates, but no
process or thread responsible for it.
Some more information:
artus:~ # uname -a
Linux artus 6.6.13+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.6.13-1~bpo12+1 (2024-02-15) x86_64 GNU/Linux
artus:~ # mdadm --version
mdadm - v4.2 - 2021-12-30 - Debian 4.2-5
artus:~ # cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sda2[2](W) sdb2[4]
1919826944 blocks super 1.2 [2/2] [UU]
bitmap: 0/15 pages [0KB], 65536KB chunk
md0 : active raid1 sda1[5](W) nvme0n1p3[3] sdb1[4](W)
33520640 blocks super 1.2 [3/3] [UUU]
bitmap: 1/1 pages [4KB], 65536KB chunk
unused devices: <none>
note: I see similar behaviour on /dev/md1 which is a quite large device
containing virtual machines
artus:~ # smartctl -H -i -l scterc /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.6.13+bpo-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 870 EVO 2TB
Serial Number: S6P4NF0W307661D
LU WWN Device Id: 5 002538 f43329aaa
Firmware Version: SVT02B6Q
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sun Mar 17 11:06:23 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SCT Error Recovery Control:
Read: Disabled
Write: Disabled
artus:~ # smartctl -d sat,auto -H -i -l scterc /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.6.13+bpo-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: Samsung
Product: PSSD T7
Revision: 0
Compliance: SPC-4
User Capacity: 2.000.398.934.016 bytes [2,00 TB]
Logical block size: 512 bytes
LU is fully provisioned
Rotation Rate: Solid State Device
Logical Unit id: 0x5000000000000001
Serial number: K611523T0SNDT5S
Device type: disk
Local Time is: Sun Mar 17 11:13:59 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
artus:~ # smartctl -H -i -l scterc /dev/nvme0n1
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.6.13+bpo-amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: TS512GMTE400S
Serial Number: I216681028
Firmware Version: V0804S3
PCI Vendor/Subsystem ID: 0x1d79
IEEE OUI Identifier: 0x7c3548
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512.110.190.592 [512 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 7c3548 52255b6444
Local Time is: Sun Mar 17 11:08:56 2024 CET
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
artus:~ # mdadm --examine /dev/sda1
/dev/sda1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : f6189e48:e5dadfbd:8a7a9239:c6074410
Name : any:home
Creation Time : Fri Nov 15 07:07:21 2019
Raid Level : raid1
Raid Devices : 3
Avail Dev Size : 67041280 sectors (31.97 GiB 34.33 GB)
Array Size : 33520640 KiB (31.97 GiB 34.33 GB)
Data Offset : 67584 sectors
Super Offset : 8 sectors
Unused Space : before=67504 sectors, after=0 sectors
State : clean
Device UUID : 9023e389:b12c69be:abd226c2:fceba209
Internal Bitmap : 8 sectors from superblock
Flags : write-mostly
Update Time : Sun Mar 17 11:15:22 2024
Bad Block Log : 512 entries available at offset 16 sectors
Checksum : 7958e05b - correct
Events : 864895
Device Role : Active device 2
Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
artus:~ # mdadm --examine /dev/sdb1
/dev/sdb1:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : f6189e48:e5dadfbd:8a7a9239:c6074410
Name : any:home
Creation Time : Fri Nov 15 07:07:21 2019
Raid Level : raid1
Raid Devices : 3
Avail Dev Size : 67041280 sectors (31.97 GiB 34.33 GB)
Array Size : 33520640 KiB (31.97 GiB 34.33 GB)
Data Offset : 67584 sectors
Super Offset : 8 sectors
Unused Space : before=67504 sectors, after=0 sectors
State : clean
Device UUID : a5a29aca:ebfaa7b1:c484660d:bb455305
Internal Bitmap : 8 sectors from superblock
Flags : write-mostly
Update Time : Sun Mar 17 11:15:34 2024
Bad Block Log : 512 entries available at offset 16 sectors
Checksum : f43f398c - correct
Events : 864895
Device Role : Active device 1
Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
artus:~ # mdadm --examine /dev/nvme0n1p3
/dev/nvme0n1p3:
Magic : a92b4efc
Version : 1.2
Feature Map : 0x1
Array UUID : f6189e48:e5dadfbd:8a7a9239:c6074410
Name : any:home
Creation Time : Fri Nov 15 07:07:21 2019
Raid Level : raid1
Raid Devices : 3
Avail Dev Size : 67041280 sectors (31.97 GiB 34.33 GB)
Array Size : 33520640 KiB (31.97 GiB 34.33 GB)
Data Offset : 67584 sectors
Super Offset : 8 sectors
Unused Space : before=67504 sectors, after=0 sectors
State : clean
Device UUID : dae5b41c:8eaceaae:d65a1486:819723c1
Internal Bitmap : 8 sectors from superblock
Update Time : Sun Mar 17 11:16:38 2024
Bad Block Log : 512 entries available at offset 16 sectors
Checksum : 781a567b - correct
Events : 864895
Device Role : Active device 0
Array State : AAA ('A' == active, '.' == missing, 'R' == replacing)
artus:~ # mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Fri Nov 15 07:07:21 2019
Raid Level : raid1
Array Size : 33520640 (31.97 GiB 34.33 GB)
Used Dev Size : 33520640 (31.97 GiB 34.33 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Mar 17 11:19:35 2024
State : clean
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : any:home
UUID : f6189e48:e5dadfbd:8a7a9239:c6074410
Events : 864895
Number Major Minor RaidDevice State
3 259 3 0 active sync /dev/nvme0n1p3
4 8 17 1 active sync writemostly /dev/sdb1
5 8 1 2 active sync writemostly /dev/sda1
Any hints?
If you need more information, please let me know.
sunny greetings from Austria, Michael
--
Michael Reinelt <michael@reinelt.co.at>
Ringsiedlung 75
A-8111 Gratwein-Straßengel
+43 676 3079941
note: not subscribed, so please cc me on responses. I recently had a Dell R710 die where I was using the Perc6 to provide storage to the box. As the box wasn't usable, I decided to image the individual disks to a newer machine with significantly more storage. I sort of messed up the progress, but that might have discovered a bug in mdadm. Background, the Dell R710 supported 6 drives, which I had as a 1TB SATA SSD and 5x8TB SATA disks in a RAID5 array. In the process of imaging it, I I was setting up devices on /dev/loop to be prepared to assemble the raid, but I think I accidentally assembled the raid while imaging the last disk (which in effect caused the last disk to get out of sync with the other disks. This was initially ok, until the VM I was doing it on, crashed with a KVM/QEMU failure (unsure what occurred). I was hoping, it was going to be easy to bring up the raid array again, but now mdadm was segfault on a null pointer exception whenever I tried to assemble the array (was just trying the RAID5 portion). I was thinking perhaps my VM got corrupted, but I couldn't figure that out, so I decided to try and reimage the disks (more carefully this time), but yes, the 5th disk was marked as in quick init, while the others were more consistent. Howvever, same segfault was occuring, so I built mdadm from source (with -g and no -O, as an aside, this would be a good Makefile target to have, to make issues easier to debug) After understanding the issue, the segfault seems to be due to Assemble.c wanting to call update_super() with a ddf super. Except super-ddf.c doesn't provide that. i.e. in Assemble.c it was crashing at if (st->ss->update_super(st, &devices[j].i, UOPT_SPEC_ASSEMBLE, NULL, c->verbose, 0, NULL)) {...} which now explained the seg fault on null pointer exception. I was able to progress past the segfault (perhaps badly, but it "seems" to work for me), by putting in a null check before the update_super() call, i.e. if (st->ss->update_super && st->ss->update_super(....)) { ... } thoughts about my "fix" (perhaps super-ddf.c needs an empty update_super function?) , if this is a bug? (perhaps its unexpected for me to have gotten into this state in the first place?)
On Sat, Mar 16, 2024 at 02:27:02PM +0530, Swami Kevala wrote: > I have a RAID6 array made up of 8 NVMe SSD drives. I have noticed that > the write speed for the array is slower than the write speed for an > individual drive! This could be. There is some cache tuning in the folder: /sys/class/block/mdXXX/md/stripe_cache_* Two files, check for detailed docs. It might help, or not. > It is not possible for me to use trim on my array since my drives (WD > SN650) report "Write Zeroes Not Supported" So, don't trim. Apart that, "Write Zeroes Not Supported" is new to me. I guess it is always possible to write zeroes... I know about "Return Zeroes After Trim" or similar. > As per my understanding, the reason why trim on raid is complex to > implement is due to the need to recalculate the parity blocks whenever > data blocks are discarded. It is implemented if 1) the SSD support RZAT or DRAT and 2) it is enable (under user responsability). At least for RAID5, but it seems to me RAID6 as well. The module parameter: devices_handle_discard_safely Also in this case, check for detailed docs. > My question is: Would it be possible (or a good idea), to make a > version of fstrim (e.g. fstrimraid) that could discard at the stripe > level? i.e. Discard only those blocks for which all blocks in the > stripe can be discarded. > > I guess this would need to call the md api to know which file system > blocks are stored on which stripes. > > Our server is used for editing large video files, so I would expect > that a significant percentage of discard operations would result in > entire stripes being discarded at once. So I wonder if this would be a > relatively simple and effective way of improving write performance on > SSD RAIDs without having to worry about parity. > > Would be interested to know what people think. It's not clear to me the connection you see between "trim" and "write speed" in RAID6. I mean, if the SSDs have speed problems due to missing trims, then maybe better to change SSDs. On the other hand, as wrote at the beginning, maybe playing with the RAID stripe cache could help more (or not, as wrote). Hope this helps a bit, bye, pg > -- > - 9442504660 > > -- > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. WARNING: Computer viruses can be transmitted > via email. The recipient should check this email and any attachments for > the presence of viruses. The Organisation accepts no liability for any > damage caused by any virus transmitted by this email. > www.ishafoundation.org <http://www.ishafoundation.org> > > -- > > -- piergiorgio
I have a RAID6 array made up of 8 NVMe SSD drives. I have noticed that the write speed for the array is slower than the write speed for an individual drive! It is not possible for me to use trim on my array since my drives (WD SN650) report "Write Zeroes Not Supported" As per my understanding, the reason why trim on raid is complex to implement is due to the need to recalculate the parity blocks whenever data blocks are discarded. My question is: Would it be possible (or a good idea), to make a version of fstrim (e.g. fstrimraid) that could discard at the stripe level? i.e. Discard only those blocks for which all blocks in the stripe can be discarded. I guess this would need to call the md api to know which file system blocks are stored on which stripes. Our server is used for editing large video files, so I would expect that a significant percentage of discard operations would result in entire stripes being discarded at once. So I wonder if this would be a relatively simple and effective way of improving write performance on SSD RAIDs without having to worry about parity. Would be interested to know what people think. -- - 9442504660 -- The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The Organisation accepts no liability for any damage caused by any virus transmitted by this email. www.ishafoundation.org <http://www.ishafoundation.org> --
Hi, 在 2024/03/15 2:20, junxiao.bi@oracle.com 写道: > On 3/12/24 6:20 PM, Yu Kuai wrote: > >> Hi, >> >> 在 2024/03/13 6:56, junxiao.bi@oracle.com 写道: >>> On 3/10/24 6:50 PM, Yu Kuai wrote: >>> >>>> Hi, >>>> >>>> 在 2024/03/09 7:49, junxiao.bi@oracle.com 写道: >>>>> Here is the root cause for this issue: >>>>> >>>>> Commit 5e2cf333b7bd ("md/raid5: Wait for MD_SB_CHANGE_PENDING in >>>>> raid5d") introduced a regression, it got reverted through commit >>>>> bed9e27baf52 ("Revert "md/raid5: Wait for MD_SB_CHANGE_PENDING in >>>>> raid5d"). To fix the original issue commit 5e2cf333b7bd was fixing, >>>>> commit d6e035aad6c0 ("md: bypass block throttle for superblock >>>>> update") was created, it avoids md superblock write getting >>>>> throttled by block layer which is good, but md superblock write >>>>> could be stuck in block layer due to block flush as well, and that >>>>> is what was happening in this regression report. >>>>> >>>>> Process "md0_reclaim" got stuck while waiting IO for md superblock >>>>> write done, that IO was marked with REQ_PREFLUSH | REQ_FUA flags, >>>>> these 3 steps ( PREFLUSH, DATA and POSTFLUSH ) will be executed >>>>> before done, the hung of this process is because the last step >>>>> "POSTFLUSH" never done. And that was because of process >>>>> "md0_raid5" submitted another IO with REQ_FUA flag marked just >>>>> before that step started. To handle that IO, blk_insert_flush() >>>>> will be invoked and hit "REQ_FSEQ_DATA | REQ_FSEQ_POSTFLUSH" case >>>>> where "fq->flush_data_in_flight" will be increased. When the IO for >>>>> md superblock write was to issue "POSTFLUSH" step through >>>>> blk_kick_flush(), it found that "fq->flush_data_in_flight" was not >>>>> zero, so it will skip that step, that is expected, because flush >>>>> will be triggered when "fq->flush_data_in_flight" dropped to zero. >>>>> >>>>> Unfortunately here that inflight data IO from "md0_raid5" will >>>>> never done, because it was added into the blk_plug list of that >>>>> process, but "md0_raid5" run into infinite loop due to >>>>> "MD_SB_CHANGE_PENDING" which made it never had a chance to finish >>>>> the blk plug until "MD_SB_CHANGE_PENDING" was cleared. Process >>>>> "md0_reclaim" was supposed to clear that flag but it was stuck by >>>>> "md0_raid5", so this is a deadlock. >>>>> >>>>> Looks like the approach in the RFC patch trying to resolve the >>>>> regression of commit 5e2cf333b7bd can help this issue. Once >>>>> "md0_raid5" starts looping due to "MD_SB_CHANGE_PENDING", it should >>>>> release all its staging IO requests to avoid blocking others. Also >>>>> a cond_reschedule() will avoid it run into lockup. >>>> >>>> The analysis sounds good, however, it seems to me that the behaviour >>>> raid5d() pings the cpu to wait for 'MD_SB_CHANGE_PENDING' to be cleared >>>> is not reasonable, because md_check_recovery() must hold >>>> 'reconfig_mutex' to clear the flag. >>> >>> That's the behavior before commit 5e2cf333b7bd which was added into >>> Sep 2022, so this behavior has been with raid5 for many years. >>> >> >> Yes, it exists for a long time doesn't mean it's good. It is really >> weird to hold spinlock to wait for a mutex. > I am confused about this, where is the code that waiting mutex while > holding spinlock, wouldn't that cause a deadlock? For example, assume that other contex already holding the 'reconfig_mutex', and this can be slow, then in raid5d: md_check_recovery try lock 'reconfig_mutex' failed while (1) hold spin_lock try to issue IO, failed release spin_lock blk_flush_plug hold spin_lock So, untill other contex release the 'reconfig_mutex', and then md_check_recovery() update super_block, raid5d() will not make progress, meanwhile it will waste one cpu. Thanks, Kuai >>> >>>> >>>> Look at raid1/raid10, there are two different behaviour that seems can >>>> avoid this problem as well: >>>> >>>> 1) blk_start_plug() is delayed until all failed IO is handled. This >>>> look >>>> reasonable because in order to get better performance, IO should be >>>> handled by submitted thread as much as possible, and meanwhile, the >>>> deadlock can be triggered here. >>>> 2) if 'MD_SB_CHANGE_PENDING' is not cleared by md_check_recovery(), >>>> skip >>>> the handling of failed IO, and when mddev_unlock() is called, daemon >>>> thread will be woken up again to handle failed IO. >>>> >>>> How about the following patch? >>>> >>>> Thanks, >>>> Kuai >>>> >>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >>>> index 3ad5f3c7f91e..0b2e6060f2c9 100644 >>>> --- a/drivers/md/raid5.c >>>> +++ b/drivers/md/raid5.c >>>> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread) >>>> >>>> md_check_recovery(mddev); >>>> >>>> - blk_start_plug(&plug); >>>> handled = 0; >>>> spin_lock_irq(&conf->device_lock); >>>> while (1) { >>>> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread) >>>> int batch_size, released; >>>> unsigned int offset; >>>> >>>> + /* >>>> + * md_check_recovery() can't clear sb_flags, usually >>>> because of >>>> + * 'reconfig_mutex' can't be grabbed, wait for >>>> mddev_unlock() to >>>> + * wake up raid5d(). >>>> + */ >>>> + if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) >>>> + goto skip; >>>> + >>>> released = release_stripe_list(conf, >>>> conf->temp_inactive_list); >>>> if (released) >>>> clear_bit(R5_DID_ALLOC, &conf->cache_state); >>>> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread) >>>> spin_lock_irq(&conf->device_lock); >>>> } >>>> } >>>> +skip: >>>> pr_debug("%d stripes handled\n", handled); >>>> - >>>> spin_unlock_irq(&conf->device_lock); >>>> if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) && >>>> mutex_trylock(&conf->cache_size_mutex)) { >>>> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread) >>>> mutex_unlock(&conf->cache_size_mutex); >>>> } >>>> >>>> + blk_start_plug(&plug); >>>> flush_deferred_bios(conf); >>>> >>>> r5l_flush_stripe_to_raid(conf->log); >>> >>> This patch eliminated the benefit of blk_plug, i think it will not be >>> good for IO performance perspective? >> >> There is only one daemon thread, so IO should not be handled here as >> much as possible. The IO should be handled by the thread that is >> submitting the IO, and let daemon to hanldle the case that IO failed or >> can't be submitted at that time. > > I am not sure how much it will impact regarding drop the blk_plug. > > Song, what's your take on this? > > Thanks, > > Junxiao. > >> >> Thanks, >> Kuai >> >>> >>> >>> Thanks, >>> >>> Junxiao. >>> >>>> >>>>> >>>>> https://www.spinics.net/lists/raid/msg75338.html >>>>> >>>>> Dan, can you try the following patch? >>>>> >>>>> diff --git a/block/blk-core.c b/block/blk-core.c >>>>> index de771093b526..474462abfbdc 100644 >>>>> --- a/block/blk-core.c >>>>> +++ b/block/blk-core.c >>>>> @@ -1183,6 +1183,7 @@ void __blk_flush_plug(struct blk_plug *plug, >>>>> bool from_schedule) >>>>> if (unlikely(!rq_list_empty(plug->cached_rq))) >>>>> blk_mq_free_plug_rqs(plug); >>>>> } >>>>> +EXPORT_SYMBOL(__blk_flush_plug); >>>>> >>>>> /** >>>>> * blk_finish_plug - mark the end of a batch of submitted I/O >>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >>>>> index 8497880135ee..26e09cdf46a3 100644 >>>>> --- a/drivers/md/raid5.c >>>>> +++ b/drivers/md/raid5.c >>>>> @@ -6773,6 +6773,11 @@ static void raid5d(struct md_thread *thread) >>>>> spin_unlock_irq(&conf->device_lock); >>>>> md_check_recovery(mddev); >>>>> spin_lock_irq(&conf->device_lock); >>>>> + } else { >>>>> + spin_unlock_irq(&conf->device_lock); >>>>> + blk_flush_plug(&plug, false); >>>>> + cond_resched(); >>>>> + spin_lock_irq(&conf->device_lock); >>>>> } >>>>> } >>>>> pr_debug("%d stripes handled\n", handled); >>>>> >>>>> Thanks, >>>>> >>>>> Junxiao. >>>>> >>>>> On 3/1/24 12:26 PM, junxiao.bi@oracle.com wrote: >>>>>> Hi Dan & Song, >>>>>> >>>>>> I have not root cause this yet, but would like share some findings >>>>>> from the vmcore Dan shared. From what i can see, this doesn't look >>>>>> like a md issue, but something wrong with block layer or below. >>>>>> >>>>>> 1. There were multiple process hung by IO over 15mins. >>>>>> >>>>>> crash> ps -m | grep UN >>>>>> [0 00:15:50.424] [UN] PID: 957 TASK: ffff88810baa0ec0 CPU: 1 >>>>>> COMMAND: "jbd2/dm-3-8" >>>>>> [0 00:15:56.151] [UN] PID: 1835 TASK: ffff888108a28ec0 CPU: 2 >>>>>> COMMAND: "dd" >>>>>> [0 00:15:56.187] [UN] PID: 876 TASK: ffff888108bebb00 CPU: 3 >>>>>> COMMAND: "md0_reclaim" >>>>>> [0 00:15:56.185] [UN] PID: 1914 TASK: ffff8881015e6740 CPU: 1 >>>>>> COMMAND: "kworker/1:2" >>>>>> [0 00:15:56.255] [UN] PID: 403 TASK: ffff888101351d80 CPU: 7 >>>>>> COMMAND: "kworker/u21:1" >>>>>> >>>>>> 2. Let pick md0_reclaim to take a look, it is waiting done >>>>>> super_block update. We can see there were two pending superblock >>>>>> write and other pending io for the underling physical disk, which >>>>>> caused these process hung. >>>>>> >>>>>> crash> bt 876 >>>>>> PID: 876 TASK: ffff888108bebb00 CPU: 3 COMMAND: >>>>>> "md0_reclaim" >>>>>> #0 [ffffc900008c3d10] __schedule at ffffffff81ac18ac >>>>>> #1 [ffffc900008c3d70] schedule at ffffffff81ac1d82 >>>>>> #2 [ffffc900008c3d88] md_super_wait at ffffffff817df27a >>>>>> #3 [ffffc900008c3dd0] md_update_sb at ffffffff817df609 >>>>>> #4 [ffffc900008c3e20] r5l_do_reclaim at ffffffff817d1cf4 >>>>>> #5 [ffffc900008c3e98] md_thread at ffffffff817db1ef >>>>>> #6 [ffffc900008c3ef8] kthread at ffffffff8114f8ee >>>>>> #7 [ffffc900008c3f30] ret_from_fork at ffffffff8108bb98 >>>>>> #8 [ffffc900008c3f50] ret_from_fork_asm at ffffffff81000da1 >>>>>> >>>>>> crash> mddev.pending_writes,disks 0xffff888108335800 >>>>>> pending_writes = { >>>>>> counter = 2 <<<<<<<<<< 2 active super block write >>>>>> }, >>>>>> disks = { >>>>>> next = 0xffff88810ce85a00, >>>>>> prev = 0xffff88810ce84c00 >>>>>> }, >>>>>> crash> list -l md_rdev.same_set -s md_rdev.kobj.name,nr_pending >>>>>> 0xffff88810ce85a00 >>>>>> ffff88810ce85a00 >>>>>> kobj.name = 0xffff8881067c1a00 "dev-dm-1", >>>>>> nr_pending = { >>>>>> counter = 0 >>>>>> }, >>>>>> ffff8881083ace00 >>>>>> kobj.name = 0xffff888100a93280 "dev-sde", >>>>>> nr_pending = { >>>>>> counter = 10 <<<< >>>>>> }, >>>>>> ffff8881010ad200 >>>>>> kobj.name = 0xffff8881012721c8 "dev-sdc", >>>>>> nr_pending = { >>>>>> counter = 8 <<<<< >>>>>> }, >>>>>> ffff88810ce84c00 >>>>>> kobj.name = 0xffff888100325f08 "dev-sdd", >>>>>> nr_pending = { >>>>>> counter = 2 <<<<< >>>>>> }, >>>>>> >>>>>> 3. From block layer, i can find the inflight IO for md superblock >>>>>> write which has been pending 955s which matches with the hung time >>>>>> of "md0_reclaim" >>>>>> >>>>>> crash> >>>>>> request.q,mq_hctx,cmd_flags,rq_flags,start_time_ns,bio,biotail,state,__data_len,flush,end_io >>>>>> ffff888103b4c300 >>>>>> q = 0xffff888103a00d80, >>>>>> mq_hctx = 0xffff888103c5d200, >>>>>> cmd_flags = 38913, >>>>>> rq_flags = 139408, >>>>>> start_time_ns = 1504179024146, >>>>>> bio = 0x0, >>>>>> biotail = 0xffff888120758e40, >>>>>> state = MQ_RQ_COMPLETE, >>>>>> __data_len = 0, >>>>>> flush = { >>>>>> seq = 3, <<<< REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA >>>>>> saved_end_io = 0x0 >>>>>> }, >>>>>> end_io = 0xffffffff815186e0 <mq_flush_data_end_io>, >>>>>> >>>>>> crash> p tk_core.timekeeper.tkr_mono.base >>>>>> $1 = 2459916243002 >>>>>> crash> eval 2459916243002-1504179024146 >>>>>> hexadecimal: de86609f28 >>>>>> decimal: 955737218856 <<<<<<< IO pending time is 955s >>>>>> octal: 15720630117450 >>>>>> binary: >>>>>> 0000000000000000000000001101111010000110011000001001111100101000 >>>>>> >>>>>> crash> bio.bi_iter,bi_end_io 0xffff888120758e40 >>>>>> bi_iter = { >>>>>> bi_sector = 8, <<<< super block offset >>>>>> bi_size = 0, >>>>>> bi_idx = 0, >>>>>> bi_bvec_done = 0 >>>>>> }, >>>>>> bi_end_io = 0xffffffff817dca50 <super_written>, >>>>>> crash> dev -d | grep ffff888103a00d80 >>>>>> 8 ffff8881003ab000 sdd ffff888103a00d80 0 0 0 >>>>>> >>>>>> 4. Check above request, even its state is "MQ_RQ_COMPLETE", but it >>>>>> is still pending. That's because each md superblock write was >>>>>> marked with REQ_PREFLUSH | REQ_FUA, so it will be handled in 3 >>>>>> steps: pre_flush, data, and post_flush. Once each step complete, >>>>>> it will be marked in "request.flush.seq", here the value is 3, >>>>>> which is REQ_FSEQ_PREFLUSH | REQ_FSEQ_DATA, so the last step >>>>>> "post_flush" has not be done. Another wired thing is that >>>>>> blk_flush_queue.flush_data_in_flight is still 1 even "data" step >>>>>> already done. >>>>>> >>>>>> crash> blk_mq_hw_ctx.fq 0xffff888103c5d200 >>>>>> fq = 0xffff88810332e240, >>>>>> crash> blk_flush_queue 0xffff88810332e240 >>>>>> struct blk_flush_queue { >>>>>> mq_flush_lock = { >>>>>> { >>>>>> rlock = { >>>>>> raw_lock = { >>>>>> { >>>>>> val = { >>>>>> counter = 0 >>>>>> }, >>>>>> { >>>>>> locked = 0 '\000', >>>>>> pending = 0 '\000' >>>>>> }, >>>>>> { >>>>>> locked_pending = 0, >>>>>> tail = 0 >>>>>> } >>>>>> } >>>>>> } >>>>>> } >>>>>> } >>>>>> }, >>>>>> flush_pending_idx = 1, >>>>>> flush_running_idx = 1, >>>>>> rq_status = 0 '\000', >>>>>> flush_pending_since = 4296171408, >>>>>> flush_queue = {{ >>>>>> next = 0xffff88810332e250, >>>>>> prev = 0xffff88810332e250 >>>>>> }, { >>>>>> next = 0xffff888103b4c348, <<<< the request is in this list >>>>>> prev = 0xffff888103b4c348 >>>>>> }}, >>>>>> flush_data_in_flight = 1, >>>>>> still 1 >>>>>> flush_rq = 0xffff888103c2e000 >>>>>> } >>>>>> >>>>>> crash> list 0xffff888103b4c348 >>>>>> ffff888103b4c348 >>>>>> ffff88810332e260 >>>>>> >>>>>> crash> request.tag,state,ref 0xffff888103c2e000 >>>> flush_rq of >>>>>> hw queue >>>>>> tag = -1, >>>>>> state = MQ_RQ_IDLE, >>>>>> ref = { >>>>>> counter = 0 >>>>>> }, >>>>>> >>>>>> 5. Looks like the block layer or underlying(scsi/virtio-scsi) may >>>>>> have some issue which leading to the io request from md layer >>>>>> stayed in a partial complete statue. I can't see how this can be >>>>>> related with the commit bed9e27baf52 ("Revert "md/raid5: Wait for >>>>>> MD_SB_CHANGE_PENDING in raid5d"") >>>>>> >>>>>> >>>>>> Dan, >>>>>> >>>>>> Are you able to reproduce using some regular scsi disk, would like >>>>>> to rule out whether this is related with virtio-scsi? >>>>>> >>>>>> And I see the kernel version is 6.8.0-rc5 from vmcore, is this the >>>>>> official mainline v6.8-rc5 without any other patches? >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Junxiao. >>>>>> >>>>>> On 2/23/24 6:13 PM, Song Liu wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On Fri, Feb 23, 2024 at 12:07 AM Linux regression tracking (Thorsten >>>>>>> Leemhuis) <regressions@leemhuis.info> wrote: >>>>>>>> On 21.02.24 00:06, Dan Moulding wrote: >>>>>>>>> Just a friendly reminder that this regression still exists on the >>>>>>>>> mainline. It has been reverted in 6.7 stable. But I upgraded a >>>>>>>>> development system to 6.8-rc5 today and immediately hit this issue >>>>>>>>> again. Then I saw that it hasn't yet been reverted in Linus' tree. >>>>>>>> Song Liu, what's the status here? I aware that you fixed with >>>>>>>> quite a >>>>>>>> few regressions recently, but it seems like resolving this one is >>>>>>>> stalled. Or were you able to reproduce the issue or make some >>>>>>>> progress >>>>>>>> and I just missed it? >>>>>>> Sorry for the delay with this issue. I have been occupied with some >>>>>>> other stuff this week. >>>>>>> >>>>>>> I haven't got luck to reproduce this issue. I will spend more >>>>>>> time looking >>>>>>> into it next week. >>>>>>> >>>>>>>> And if not, what's the way forward here wrt to the release of 6.8? >>>>>>>> Revert the culprit and try again later? Or is that not an option >>>>>>>> for one >>>>>>>> reason or another? >>>>>>> If we don't make progress with it in the next week, we will do >>>>>>> the revert, >>>>>>> same as we did with stable kernels. >>>>>>> >>>>>>>> Or do we assume that this is not a real issue? That it's caused >>>>>>>> by some >>>>>>>> oddity (bit-flip in the metadata or something like that?) only >>>>>>>> to be >>>>>>>> found in Dan's setup? >>>>>>> I don't think this is because of oddities. Hopefully we can get more >>>>>>> information about this soon. >>>>>>> >>>>>>> Thanks, >>>>>>> Song >>>>>>> >>>>>>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression >>>>>>>> tracker' hat) >>>>>>>> -- >>>>>>>> Everything you wanna know about Linux kernel regression tracking: >>>>>>>> https://linux-regtracking.leemhuis.info/about/#tldr >>>>>>>> If I did something stupid, please tell me, as explained on that >>>>>>>> page. >>>>>>>> >>>>>>>> #regzbot poke >>>>>>>> >>>>> >>>>> . >>>>> >>>> >>> . >>> >> > . >
Hi, 在 2024/03/15 0:12, Dan Moulding 写道: >> How about the following patch? >> >> Thanks, >> Kuai >> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >> index 3ad5f3c7f91e..0b2e6060f2c9 100644 >> --- a/drivers/md/raid5.c >> +++ b/drivers/md/raid5.c >> @@ -6720,7 +6720,6 @@ static void raid5d(struct md_thread *thread) >> >> md_check_recovery(mddev); >> >> - blk_start_plug(&plug); >> handled = 0; >> spin_lock_irq(&conf->device_lock); >> while (1) { >> @@ -6728,6 +6727,14 @@ static void raid5d(struct md_thread *thread) >> int batch_size, released; >> unsigned int offset; >> >> + /* >> + * md_check_recovery() can't clear sb_flags, usually >> because of >> + * 'reconfig_mutex' can't be grabbed, wait for >> mddev_unlock() to >> + * wake up raid5d(). >> + */ >> + if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) >> + goto skip; >> + >> released = release_stripe_list(conf, >> conf->temp_inactive_list); >> if (released) >> clear_bit(R5_DID_ALLOC, &conf->cache_state); >> @@ -6766,8 +6773,8 @@ static void raid5d(struct md_thread *thread) >> spin_lock_irq(&conf->device_lock); >> } >> } >> +skip: >> pr_debug("%d stripes handled\n", handled); >> - >> spin_unlock_irq(&conf->device_lock); >> if (test_and_clear_bit(R5_ALLOC_MORE, &conf->cache_state) && >> mutex_trylock(&conf->cache_size_mutex)) { >> @@ -6779,6 +6786,7 @@ static void raid5d(struct md_thread *thread) >> mutex_unlock(&conf->cache_size_mutex); >> } >> >> + blk_start_plug(&plug); >> flush_deferred_bios(conf); >> >> r5l_flush_stripe_to_raid(conf->log); > > I can confirm that this patch also works. I'm unable to reproduce the > hang after applying this instead of the first patch provided by > Junxiao. So looks like both ways are succesful in avoiding the hang. > Thanks a lot for the testing! Can you also give following patch a try? It removes the change to blk_plug, because Dan and Song are worried about performance degradation, so we need to verify the performance before consider that patch. Anyway, I think following patch can fix this problem as well. Thanks, Kuai diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 3ad5f3c7f91e..ae8665be9940 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -6728,6 +6728,9 @@ static void raid5d(struct md_thread *thread) int batch_size, released; unsigned int offset; + if (test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) + goto skip; + released = release_stripe_list(conf, conf->temp_inactive_list); if (released) clear_bit(R5_DID_ALLOC, &conf->cache_state); @@ -6766,6 +6769,7 @@ static void raid5d(struct md_thread *thread) spin_lock_irq(&conf->device_lock); } } +skip: pr_debug("%d stripes handled\n", handled); spin_unlock_irq(&conf->device_lock); > -- Dan > . >
On Thu, Mar 14, 2024 at 11:20 AM <junxiao.bi@oracle.com> wrote: > [...] > >> > >> This patch eliminated the benefit of blk_plug, i think it will not be > >> good for IO performance perspective? > > > > There is only one daemon thread, so IO should not be handled here as > > much as possible. The IO should be handled by the thread that is > > submitting the IO, and let daemon to hanldle the case that IO failed or > > can't be submitted at that time. raid5 can have multiple threads calling handle_stripe(). See raid5_do_work(). Only chunk_aligned_read() can be handled in raid5_make_request. > > I am not sure how much it will impact regarding drop the blk_plug. > > Song, what's your take on this? I think we need to evaluate the impact of (removing) blk_plug. We had some performance regressions related to blk_plug a couple years ago. Thanks, Song