linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [IDEA RFC] Forward error correction (FEC) / Error correction code (ECC) for BTRFS
@ 2022-06-21 13:29 Zhang Boyang
  2022-06-21 13:40 ` Zhang Boyang
  2022-06-21 13:56 ` Phillip Susi
  0 siblings, 2 replies; 5+ messages in thread
From: Zhang Boyang @ 2022-06-21 13:29 UTC (permalink / raw)
  To: linux-btrfs

Hello,

I'm planning to implement error correction feature for btrfs, this 
feature can repair (limited) data corruptions on disk. Data corruptions 
can be caused by, for example, faulty disks, faulty RAMs, even cosmic 
rays. For RAID systems, the benefit of error correction feature is 
little because we store multiple copies of data, we can easily repair 
corrupted data using intact copies. But for single disk systems, or 
systems without ECC memory, this feature can be useful. The error 
correction feature can help user to repair their data when something bad 
happened.

The key idea is to reuse the space of checksums (ctree) to store error 
correction codes, (optionally) along with existing checksums.

Here are some detailed approaches:

1) Zero-cost approach 
======================================================

In fact it's not zero cost :) In this approach, the only requirement is 
to use a crypto hash as checksum in btrfs, e.g. SHA256 or BLAKE2b. So it 
also applies to existing btrfs instances. The error correction capacity 
is 1 byte. The error-correcting algorithm is brute-force, as below:

char data[N] = ...;  // data on disk, assume only one byte in it is 
corrupted
char csum[32] = ...; // checksum in ctree
for (int i = 0; i < N; i++) // brute-force error location
   for (int j = 0; j < 0x100; j++) { // brute-force byte value at location i
     char buf[N];
     memcpy(buf, data, N);
     buf[i] = j;
     if (checksum(buf) equals csum[]) {
       return SUCCESS; // data in buf[] are good data, repair succeeded
     }
   }
return FAIL; // search space exhausted, failed to repair

The computational cost of repairing is doing checksum of N bytes of 
data, N*255 times. As mkfs.btrfs requires nodesize < 65536, the maximum 
cost should approximately equal to checksum data of 1TB. For a typical 
4K sector, the cost is about checksumming 4GB data. This brute-force 
process can be easily parallelized, reducing the repair time. In fact, 
if one have infinite computing power, the error correction capacity can 
be enlarged, as long as the brute-force search space is significantly 
smaller than the checksum space.

Unfortunately, although I came up with this idea independently, a quick 
search on internet shows there is a US patent owned by Microsoft 
describes a very similar algorithm (Link: 
https://patents.google.com/patent/US9152502 ). I don't know if this 
trivial brute-force error-correcting algorithm may / may not affected by 
this patent. I'd like to hear opinions about this.


2) Low-cost approach: 
======================================================

In this approach, the checksum is generated by truncating a crypto hash 
to 32-T bytes and append T bytes (xv[T]) to that hash value. See the 
figure below:

   checksum = [-------CRYPTO-HASH-------][xv[0], xv[1], ...., xv[T-1]]
              <----- (32 - T) bytes ----><--------  T bytes --------->
              <--------------------- 32 bytes ----------------------->

The xv[i] is calculated by XORing data[j] where j%T==i, see the code below:

char data[N] = ...; // good data on disk
char xv[T];
memset(xv, 0, T);
for (int i = 0; i < N; i++)
   xv[i % T] ^= data[i];

Here is the example of T==8:
    let's view the data array d[] as (N/8) x 8 matrix:
        d[ 0], d[ 1], d[ 2], d[ 3], d[ 4], d[ 5], d[ 6], d[ 7],
        d[ 8], d[ 9], d[10], d[11], d[12], d[13], d[14], d[15],
        d[16], d[17], d[18], d[19], d[20], d[21], d[22], d[23],
   xor  d[24], d[25], d[26], d[27], d[28], d[29], d[30], ...
  ----------------------------------------------------------------
        xv[0], xv[1], xv[2], xv[3], xv[4], xv[5], xv[6], xv[7]
    then
        xv[0] = xor(column 0) = d[0]^d[8]^d[16]^d[24]^...
        xv[1] = xor(column 1) = d[1]^d[9]^d[17]^d[25]^...
        ...

This approach can correct at most T consecutive bytes. Formally, let 
d[i0], d[i1], d[i2], ..., d[iE] are corrupted bytes, then error 
correction can be done if and only if max(i0, i1, ..., iE) - min(i0, i1, 
.... iE) < T. The error-correcting algorithm is also brute-force, as below:

char data[N] = ...;  // data on disk, assume at most T consecutive bad bytes
char csum[32] = ...; // checksum in ctree, 32-T bytes hash and T bytes xv[]
char *xv = csum + 32 - T;
for (int i = 0; i <= N - T; i++) { // brute-force the begin of error 
location
   char buf[N];
   memcpy(buf, data, N);

   // calculate values in error location using xv[]
   char repair[T];
   memcpy(repair, xv, T);
   for (int j = 0; j < N; j++)
     if (j < i || j >= i + T)
       repair[j % T] ^= buf[j];
   for (int j = 0; j < T; j++)
     buf[i + j] = repair[(i + j) % T];

   if (checksum(buf)[0...32-T] equals csum[0...32-T]) {
     return SUCCESS; // data in buf[] are good data, repair succeeded
   }
}
return FAIL; // search space exhausted, failed to repair

Here is the example of above algorithm when T==8, i==11:
        d[ 0], d[ 1], d[ 2], d[ 3], d[ 4], d[ 5], d[ 6], d[ 7],
        d[ 8], d[ 9], d[10], ?????, ?????, ?????, ?????, ?????,
        ?????, ?????, ?????, d[19], d[20], d[21], d[22], d[23],
   xor  d[24], d[25], d[26], d[27], d[28], d[29], d[30], ...
  ----------------------------------------------------------------
        xv[0], xv[1], xv[2], xv[3], xv[4], xv[5], xv[6], xv[7]
    then
        d[11] = xv[3] ^ d[3]^d[19]^d[27]^...
        d[12] = xv[4] ^ d[4]^d[20]^d[28]^...
        ...

The computational cost of repairing is doing checksum of N bytes data, 
N-T+1 times. For nodesize < 65536, the maximum cost should approximately 
equal to checksum data of 4GB. For typical 4K sectors, cost is about 
checksumming 16MB data. Parallelizing is also possible.

If less then T bytes corrupts, this process can also speed up by 
skipping intact positions when brute-force searching. (If 
calc_xv(bad_data)[X] == xv[X], X < T, we can prove there is no 
corruption in column X, so we can skip checking for i%T==X while looping)

The value of T can't be too large. If T is too large, then the safety of 
32-T bytes hash will be too weak. Personally, I like T=8.

The process of calculating xv[] can SIMD easily. So there will be little 
cost when doing disk writes.

Unfortunately, since this algorithm use brute-force, it might be 
affected by the patent issue mentioned above. But I think even this 
algorithm is affected, we can save checksums with xv[] in btrfs, and 
create APIs for third party userspace tools to repair data externally.


3) Reed-Solomon approach: 
===================================================

Reed-Solomon (RS) can correct multiple symbol errors. Linux kernel 
already have support for RS codes (lib/reed_solomon). RS requires 
message size < 2**symsize. (message size = data size + parity size) For 
8-bit symbols (bytes), it requires message size < 255, which is not 
acceptable by btrfs. So we have to use 16 bit symbol size, then the 
limit is 65535 symbols = 131070 bytes.

If choose to generate T parity symbols (= 2*T parity bytes) for data, 
the RS can fix any combination of T/2 symbols corruption. Formally, let 
d[i0], d[i1], d[i2], ..., d[iE] are corrupted bytes, then error 
correction can be done if and only if set(floor(i0/2), floor(i1/2), ..., 
floor(iE/2)).size() <= T/2.

For example, if T == 6 (12 parity bytes), then
   * 3 error bytes at d[10], d[100], d[200] can be fixed
   * 4 error bytes at d[9], d[10], d[100], d[200] can NOT be fixed
   * 4 error bytes at d[10], d[11], d[100], d[200] can be fixed
   * 6 error bytes at d[10], d[11], d[100], d[101], d[200], d[201] can 
be fixed

However, using RS coder is very slow, because the complexity is 
O(N*T*T), and it make use of galois field arithmetic, which is slow. A 
simple test of T=6 only gives me 180MB/s on my machine.


Conclusion ===================================================

Personally I prefer approach 2 and 3. Since I'm new to kernel 
development, any comments or suggestions are welcome.


Best Regards,
Zhang Boyang

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [IDEA RFC] Forward error correction (FEC) / Error correction code (ECC) for BTRFS
  2022-06-21 13:29 [IDEA RFC] Forward error correction (FEC) / Error correction code (ECC) for BTRFS Zhang Boyang
@ 2022-06-21 13:40 ` Zhang Boyang
  2022-06-21 13:56 ` Phillip Susi
  1 sibling, 0 replies; 5+ messages in thread
From: Zhang Boyang @ 2022-06-21 13:40 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 92 bytes --]

Hi,

Here are runnable example programs of first two approaches.

Best Regards,
Zhang Boyang

[-- Attachment #2: 1.c --]
[-- Type: text/x-csrc, Size: 2174 bytes --]

// gcc -O2 -Wall 1.c -lssl -lcrypto && ./a.out

#include <stdio.h>
#include <string.h>
#include <assert.h>
#include <openssl/evp.h>

void digest(void *out, size_t out_len, void *in, size_t in_len)
{
    unsigned char md_value[EVP_MAX_MD_SIZE];
    EVP_MD_CTX *mdctx;
    const EVP_MD *md = EVP_blake2b512();
    mdctx = EVP_MD_CTX_new();
    EVP_DigestInit_ex(mdctx, md, NULL);
    EVP_DigestUpdate(mdctx, in, in_len);
    EVP_DigestFinal_ex(mdctx, md_value, NULL);
    EVP_MD_CTX_free(mdctx);
    memcpy(out, md_value, out_len);
}

void hexdump(void *p, size_t n)
{
    unsigned char *uc = p;
    for (size_t i = 0; i < n; i++)
        printf("%02x", uc[i]);
}

#define FAIL 0
#define SUCCESS 1

///////////////////////////////////////////////////////////////////////////

#define N 4096

void checksum(void *out, void *in)
{
    digest(out, 32, in, N);
}

char data[N];  // data on disk, assume only one byte in it is corrupted
char csum[32]; // checksum in ctree
int repair()
{
    for (int i = 0; i < N; i++) // brute-force error location
      for (int j = 0; j < 0x100; j++) { // brute-force byte value at location i
        char buf[N];
        memcpy(buf, data, N);
        buf[i] = j;
        char new_csum[32];
        checksum(new_csum, buf);
        if (memcmp(csum, new_csum, 32) == 0) {
          memcpy(data, buf, N);
          return SUCCESS; // data in buf[] are good data, repair succeeded
        }
      }
    return FAIL; // search space exhausted, failed to repair
}

int main()
{
    // get random data
    FILE *fp = fopen("/dev/urandom", "r");
    assert(fp);
    fread(data, 1, N, fp);
    fclose(fp);

    // calc checksum of good data
    checksum(csum, data);
    printf("good\t"); hexdump(csum, 32); printf("\n");

    // corrupt some data
    data[N-1] = 200;
    //data[101] = 200;

    // calc checksum of bad data
    char bad_csum[32];
    checksum(bad_csum, data);
    printf("bad\t"); hexdump(bad_csum, 32); printf("\n");

    // try repair
    int result = repair();
    printf("%s\t", result ? "SUCCESS" : "FAIL");
    char repair_csum[32];
    checksum(repair_csum, data);
    hexdump(repair_csum, 32); printf("\n");

    return 0;
}

[-- Attachment #3: 2.c --]
[-- Type: text/x-csrc, Size: 2593 bytes --]

// gcc -O2 -Wall 2.c -lssl -lcrypto && ./a.out

#include <stdio.h>
#include <string.h>
#include <assert.h>
#include <openssl/evp.h>

void digest(void *out, size_t out_len, void *in, size_t in_len)
{
    unsigned char md_value[EVP_MAX_MD_SIZE];
    EVP_MD_CTX *mdctx;
    const EVP_MD *md = EVP_blake2b512();
    mdctx = EVP_MD_CTX_new();
    EVP_DigestInit_ex(mdctx, md, NULL);
    EVP_DigestUpdate(mdctx, in, in_len);
    EVP_DigestFinal_ex(mdctx, md_value, NULL);
    EVP_MD_CTX_free(mdctx);
    memcpy(out, md_value, out_len);
}

void hexdump(void *p, size_t n)
{
    unsigned char *uc = p;
    for (size_t i = 0; i < n; i++)
        printf("%02x", uc[i]);
}

#define FAIL 0
#define SUCCESS 1

///////////////////////////////////////////////////////////////////////////

#define N 4096
#define T 8

void checksum(void *out, void *in)
{
    char *xv = (char *)out + 32 - T;
    memset(xv, 0, T);
    for (int i = 0; i < N; i++)
      xv[i % T] ^= ((char *)in)[i];
    digest(out, 32 - T, in, N);
}

char data[N];  // data on disk, assume at most T consecutive bad bytes
char csum[32]; // checksum in ctree, 32-T bytes hash and T bytes xv[]
int repair()
{
    char *xv = csum + 32 - T;
    for (int i = 0; i <= N - T; i++) { // brute-force the begin of error location
      char buf[N];
      memcpy(buf, data, N);

      // calculate values in error location using xv[]
      char repair[T];
      memcpy(repair, xv, T);
      for (int j = 0; j < N; j++)
        if (j < i || j >= i + T)
          repair[j % T] ^= buf[j];
      for (int j = 0; j < T; j++)
        buf[i + j] = repair[(i + j) % T];

      char new_csum[32];
      checksum(new_csum, buf);
      if (memcmp(csum, new_csum, 32 - T) == 0) {
        memcpy(data, buf, N);
        return SUCCESS; // data in buf[] are good data, repair succeeded
      }
    }
    return FAIL; // search space exhausted, failed to repair
}

int main()
{
    // get random data
    FILE *fp = fopen("/dev/urandom", "r");
    assert(fp);
    fread(data, 1, N, fp);
    fclose(fp);

    // calc checksum of good data
    checksum(csum, data);
    printf("good\t"); hexdump(csum, 32); printf("\n");

    // corrupt some data
    data[N-1] = 200;
    data[N-8] = 200;
    //data[N-9] = 200;

    // calc checksum of bad data
    char bad_csum[32];
    checksum(bad_csum, data);
    printf("bad\t"); hexdump(bad_csum, 32); printf("\n");

    // try repair
    int result = repair();
    printf("%s\t", result ? "SUCCESS" : "FAIL");
    char repair_csum[32];
    checksum(repair_csum, data);
    hexdump(repair_csum, 32); printf("\n");

    return 0;
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [IDEA RFC] Forward error correction (FEC) / Error correction code (ECC) for BTRFS
  2022-06-21 13:29 [IDEA RFC] Forward error correction (FEC) / Error correction code (ECC) for BTRFS Zhang Boyang
  2022-06-21 13:40 ` Zhang Boyang
@ 2022-06-21 13:56 ` Phillip Susi
  2022-06-21 14:25   ` David Sterba
  1 sibling, 1 reply; 5+ messages in thread
From: Phillip Susi @ 2022-06-21 13:56 UTC (permalink / raw)
  To: Zhang Boyang; +Cc: linux-btrfs


Zhang Boyang <zhangboyang.id@gmail.com> writes:

> Here are some detailed approaches:
>
> 1) Zero-cost approach

<snip>

> 2) Low-cost approach:

Both of these home brew methods rely on guess and check to repair, which
has a high computational cost and the possibility of false repair.  It
also requires that only a minescule amount of silent corruption has
occured.  Disk drives already have highly robust error correction and
detection in place, so it is almost unheard of for them to silently
corrupt data, but rather they refuse to read the whole sector.  If
corrupt data somehow managed to pass the error detection code in the
drive, it is highly likely that a lot more than one or two bytes will be wrong.


> 3) Reed-Solomon approach:

If you are going to do error correction, this or another real FEC
altorithm is the way to go.  Also since the drive reports corrupt
sectors, you can use RS in erasure mode where it can correct T errors
instead of T/2.  There is a handy tool called par2 that lets you create
a small RS FEC archive of a file that you can later use to repair damaged
portions of it.

Another common behavior of drives is for them to fail outright with 100%
data loss rather than just have a few bad sectors.  For that reason, I
think that anyone who really cares about their data should be using a
raid and making regular backups rather than relying on an automatic on
the fly FEC in the filesystem.  If they don't care enough to do that
then they probably don't care enough to think the cost of online FEC is
worth it either.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [IDEA RFC] Forward error correction (FEC) / Error correction code (ECC) for BTRFS
  2022-06-21 13:56 ` Phillip Susi
@ 2022-06-21 14:25   ` David Sterba
  2022-06-21 16:39     ` Zhang Boyang
  0 siblings, 1 reply; 5+ messages in thread
From: David Sterba @ 2022-06-21 14:25 UTC (permalink / raw)
  To: Phillip Susi; +Cc: Zhang Boyang, linux-btrfs

On Tue, Jun 21, 2022 at 09:56:48AM -0400, Phillip Susi wrote:
> 
> Zhang Boyang <zhangboyang.id@gmail.com> writes:
> 
> > Here are some detailed approaches:
> >
> > 1) Zero-cost approach
> 
> <snip>
> 
> > 2) Low-cost approach:
> 
> Both of these home brew methods rely on guess and check to repair, which
> has a high computational cost and the possibility of false repair.  It
> also requires that only a minescule amount of silent corruption has
> occured.  Disk drives already have highly robust error correction and
> detection in place, so it is almost unheard of for them to silently
> corrupt data, but rather they refuse to read the whole sector.  If
> corrupt data somehow managed to pass the error detection code in the
> drive, it is highly likely that a lot more than one or two bytes will be wrong.
> 
> 
> > 3) Reed-Solomon approach:
> 
> If you are going to do error correction, this or another real FEC
> altorithm is the way to go.  Also since the drive reports corrupt
> sectors, you can use RS in erasure mode where it can correct T errors
> instead of T/2.  There is a handy tool called par2 that lets you create
> a small RS FEC archive of a file that you can later use to repair damaged
> portions of it.
> 
> Another common behavior of drives is for them to fail outright with 100%
> data loss rather than just have a few bad sectors.  For that reason, I
> think that anyone who really cares about their data should be using a
> raid and making regular backups rather than relying on an automatic on
> the fly FEC in the filesystem.  If they don't care enough to do that
> then they probably don't care enough to think the cost of online FEC is
> worth it either.

I basically agree with all the points above. Adding the FEC is an
interesting idea but it's another layer with own correction and there
are other mechanisms that protect against isolated data loss.

The introduction why to implement this lightly touches on the reasons of
the corruptions, but I think this is the crucial part. For a security
feature this would be the threat model, for storage and correctness it's
an equivalent I don't know name for.

Most common source of byte-level errors is faulty RAM, one or two bits,
and FEC can't fix all problems once a corrupted value spreads in the
structures. That's why we don't have automatic repair even for such a
"simple" type of error. If there are more bits corrupeted in different
bytes, the repair code would have to be upgraded for that.

Storage devices tend to lose data in bigger chunks, we'd have to have
large correction codes, but I think device level redundancy solves that
in a more robust way.

Ad 1, we can already do bruteforce repair, enumerating bit flipped
values for data and matching against the checksum, but it's an
unreliable guess work.

Ad 2 and 3, costly compared to plain checksum and additional measures
like redundant devices and profiles.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [IDEA RFC] Forward error correction (FEC) / Error correction code (ECC) for BTRFS
  2022-06-21 14:25   ` David Sterba
@ 2022-06-21 16:39     ` Zhang Boyang
  0 siblings, 0 replies; 5+ messages in thread
From: Zhang Boyang @ 2022-06-21 16:39 UTC (permalink / raw)
  To: dsterba, Phillip Susi, linux-btrfs

Hi,

On 2022/6/21 21:56, Phillip Susi wrote:
 >> 2) Low-cost approach:
 >
 > Both of these home brew methods rely on guess and check to repair, which
 > has a high computational cost and the possibility of false repair.  It
 > also requires that only a minescule amount of silent corruption has
 > occured.

The computational cost for repairing is indeed high, but I think it is 
acceptable. Take the example of approach 2. For typical 4096 bytes data 
block, the cost is nearly same as checksumming a 16MB file, which is 
very fast. For default setting (nodesize == 16384), the cost is nearly 
same as checksumming a 256MB file. For worst setting (nodesize == 
65536), the cost is nearly same as checksumming a 4GB file. It can be 
done in several seconds on a typical computer. We can even speed up by 
multithreading.

The possibility of false repair is negligible since we are using crypto 
hashes. If there is a misrepair, then you can find a hash collision of 
crypto hash function.

 > Disk drives already have highly robust error correction and
 > detection in place, so it is almost unheard of for them to silently
 > corrupt data, but rather they refuse to read the whole sector.  If
 > corrupt data somehow managed to pass the error detection code in the
 > drive, it is highly likely that a lot more than one or two bytes will 
be wrong.

Indeed, disk drives are likely to refuse reading the whole sector. My 
idea can't handle it if whole sector is lost. But I don't think silently 
corrupt data is seldom. Link: https://en.wikipedia.org/wiki/Data_corruption

If the corrupted sector can be read from disk, and corrupted bytes is 
very near (i.e. burst errors), my idea may able to repair them. If using 
T=8 in approach 2, it can fix 8 consecutive bytes of corruption. I think 
it is quite a large amount of redundancy.


 >> 3) Reed-Solomon approach:
 >
 > If you are going to do error correction, this or another real FEC
 > altorithm is the way to go.  Also since the drive reports corrupt
 > sectors, you can use RS in erasure mode where it can correct T errors
 > instead of T/2.  There is a handy tool called par2 that lets you create
 > a small RS FEC archive of a file that you can later use to repair damaged
 > portions of it.
 >

My idea can't handle this, because it only generate very few parity 
symbols (less than 16). So even one single lost sector will exceed the 
redundancy capacity.

 > Another common behavior of drives is for them to fail outright with 100%
 > data loss rather than just have a few bad sectors.  For that reason, I
 > think that anyone who really cares about their data should be using a
 > raid and making regular backups rather than relying on an automatic on
 > the fly FEC in the filesystem.  If they don't care enough to do that
 > then they probably don't care enough to think the cost of online FEC is
 > worth it either.

Indeed, if someone really cares about data safety, that person should 
use RAID to protect the data. RAID can handle data corruption way better 
than FEC. But there exists situations that the user want to choose 
cheaper solutions. Think about a disk full of games, music, movies, even 
server logs or security camera videos. The value of data is low, but it 
is annoying if these data really corrupted. The FEC can help these 
situations.


On 2022/6/21 22:25, David Sterba wrote:
> I basically agree with all the points above. Adding the FEC is an
> interesting idea but it's another layer with own correction and there
> are other mechanisms that protect against isolated data loss.
> 
> The introduction why to implement this lightly touches on the reasons of
> the corruptions, but I think this is the crucial part. For a security
> feature this would be the threat model, for storage and correctness it's
> an equivalent I don't know name for.

I think the `threat model' is same as checksum/ctree itself. We have 
checksum/ctree because we want to detect errors. FEC go further, it can 
fix (some of) these errors.

> Most common source of byte-level errors is faulty RAM, one or two bits,
> and FEC can't fix all problems once a corrupted value spreads in the
> structures. That's why we don't have automatic repair even for such a
> "simple" type of error. If there are more bits corrupeted in different
> bytes, the repair code would have to be upgraded for that.

I think the fundamental assumption of FEC is "data corrupted after 
checksumming". If corrupted value already spread, there is no reliable 
way to deal with it (checksum will even say it's OK). If corrupted value 
not spread, checksum and FEC can should be able to repair it reliablly. 
I think the only susceptible part is checksumming itself. As long as a 
random error occured in checksumming will not further corrupt data by 
mis-repairing, the process of automatic repair should be safe.

> Storage devices tend to lose data in bigger chunks, we'd have to have
> large correction codes, but I think device level redundancy solves that
> in a more robust way.

Yes. RAID already did it very well.

> 
> Ad 1, we can already do bruteforce repair, enumerating bit flipped
> values for data and matching against the checksum, but it's an
> unreliable guess work.
> 

I think it's quite reliable as long as the hash function is 
cryptographic and didn't truncate too much.

> Ad 2 and 3, costly compared to plain checksum and additional measures
> like redundant devices and profiles.

Approach 2 is really fast. The extra cost at writing data is as little 
as XORing all data together, and we can use SIMD to do it very well. 
There is no extra cost when reading data. However, a crypto hash indeed 
required, which may slower than non-crypto hash functions, but I think 
BLAKE2b is fast enough for daily use.

Approach 3 is very slow. This can hardly improved.

I think we can provide user an option to use approach 2, e.g. providing 
a new checksum option, which concatenate BLAKE2b-160 and 12-XORed-bytes, 
or BLAKE2b-192 and 8-XORed-bytes, or BLAKE2b-224 and 4-XORed-bytes. 
These options provide redundancy against 12bytes, 8bytes, 4bytes of 
consecutive bytes of corruption, respectively.

Best Regards,
Zhang Boyang

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-06-21 16:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-21 13:29 [IDEA RFC] Forward error correction (FEC) / Error correction code (ECC) for BTRFS Zhang Boyang
2022-06-21 13:40 ` Zhang Boyang
2022-06-21 13:56 ` Phillip Susi
2022-06-21 14:25   ` David Sterba
2022-06-21 16:39     ` Zhang Boyang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).