git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>, Junio C Hamano <gitster@pobox.com>
Cc: Luat Nguyen <root@l4w.io>, git@vger.kernel.org
Subject: Re: [PATCH 4/3] ewah: adjust callers of ewah_read_mmap()
Date: Fri, 15 Jun 2018 14:23:44 -0400	[thread overview]
Message-ID: <1685a358-f033-64e0-2e12-df3a1c10af19@gmail.com> (raw)
In-Reply-To: <20180615173155.GC3067@sigill.intra.peff.net>

On 6/15/2018 1:31 PM, Jeff King wrote:
> On Fri, Jun 15, 2018 at 09:41:46AM -0700, Junio C Hamano wrote:
>
>> Derrick Stolee <stolee@gmail.com> writes:
>>
>>> On 6/14/2018 11:44 PM, Jeff King wrote:
>>>> The return value of ewah_read_mmap() is now an ssize_t,
>>>> since we could (in theory) process up to 32GB of data. This
>>>> would never happen in practice, but a corrupt or malicious
>>>> .bitmap or index file could convince us to do so.
>>> Aside: Why a 32GB limit? Usually I see 32-bit issues limiting by 2
>>> (signed) or 4 GB (unsigned). Is there something specifically in the
>>> bitmap format that stops at this size?
>> The proposed log message for 1/3 has this bit
>>
>>    - check the size for integer overflow (this should be
>>      impossible on 64-bit, as the size is given as a 32-bit
>>      count of 8-byte words, but is possible on a 32-bit
>>      system)
>>
>> 4 Giga 8-byte words makes 32 Giga bytes, I'd guess.
> Yes, exactly. It's definitely an odd size. I think using the same
> varints we use in the packfile would be more efficient (and they're
> already variable-length records), though we tend to have few enough of
> these that it probably doesn't matter.
>
> I think a more useful change for the bitmap format would be some kind of
> index. IIRC, right now readers have to linearly scan the whole file in
> order to use a single bitmap.
>
> With Stolee's commit-metadata files, we could potentially move to
> storing reachability bitmaps as a data element there. But there are two
> complications:
>
>   - the bitmaps themselves are dependent on having an ordered list of
>     objects (one per bit) to talk about. And that's why they're
>     integrated with packfiles, since that provides a constrained universe
>     with a set ordering.
>
>   - the existing storage isn't entirely independent between bitmaps. Some
>     of them are basically "deltas" off of nearby bitmaps.

The VSTS reachability bitmap is different from the Git bitmap in a few ways.

1. It uses Roaring+Run bitmaps [1] instead of EWAH bitmap. This format 
is simpler (in my opinion) in several ways, especially in how it splits 
the bitmap into 16-bit address ranges and compresses each on its own. 
This makes set containment queries much faster, as we can navigate 
directly to the region that is important (and then binary search if that 
chunk is an "array" or "run" chunk). I say this is simpler because I can 
explain the entire compression format to you in five minutes and a 
whiteboard. The paper [2] is a simple enough read (start at the "Roaring 
Bitmap" section at the end of page 4).

2. Our file uses a chunk-based approach similar to the commit-graph 
file: one chunk is simply a list of the commits that have bitmaps. 
Another chunk is the raw binary data of all bitmaps concatenated 
together. The last chunk connects the other two chunks by a) providing 
an offset into the binary data for the bitmap at that commit, and b) the 
commit that provides a delta base for the bitmap.

If we are considering changing the reachability bitmap, then I'm very 
intrigued. I think the number one thing to consider is to use the 
multi-pack-index as a reference point (with a stable object order) so 
the objects can be repacked independently from the reachability bitmap 
computation. If we are changing the model at that level, then it is 
worth thinking about other questions, like how we index the file or how 
we compress the bitmaps.

Thanks,
-Stolee

[1] https://roaringbitmap.org/about/

[2] https://arxiv.org/abs/1603.06549
     Consistently faster and smaller compressed bitmaps with Roaring

  reply	other threads:[~2018-06-15 18:23 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-14 22:59 security: potential out-of-bound read at ewah_io.c |ewah_read_mmap| Luat Nguyen
2018-06-15  3:28 ` Jeff King
2018-06-15  3:31   ` [PATCH 1/3] ewah_read_mmap: bounds-check mmap reads Jeff King
2018-06-15  9:14     ` SZEDER Gábor
2018-06-15 16:20       ` Junio C Hamano
2018-06-15 17:10         ` SZEDER Gábor
2018-06-15 17:21           ` Jeff King
2018-06-15 19:42             ` Junio C Hamano
2018-06-15 17:05     ` Junio C Hamano
2018-06-15 17:26       ` Jeff King
2018-06-15 19:44         ` Junio C Hamano
2018-06-16 14:35     ` SZEDER Gábor
2018-06-16 19:14       ` Jeff King
2018-06-15  3:31   ` [PATCH 2/3] ewah: drop ewah_deserialize function Jeff King
2018-06-15  3:32   ` [PATCH 3/3] ewah: drop ewah_serialize_native function Jeff King
2018-06-15 13:56     ` Ramsay Jones
2018-06-15 14:07       ` Ramsay Jones
2018-06-15 14:30         ` [PATCH 0/8] Delete unused methods in EWAH bitmap Derrick Stolee
2018-06-15 14:30           ` [PATCH 1/8] ewah/bitmap.c: delete unused 'bitmap_clear()' Derrick Stolee
2018-06-15 14:46             ` Ramsay Jones
2018-06-15 15:11               ` Derrick Stolee
2018-06-15 14:30           ` [PATCH 2/8] ewah/bitmap.c: delete unused 'bitmap_each_bit()' Derrick Stolee
2018-06-15 15:03             ` Ramsay Jones
2018-06-15 14:30           ` [PATCH 3/8] ewah_bitmap: delete unused 'ewah_and()' Derrick Stolee
2018-06-15 14:30           ` [PATCH 4/8] ewah_bitmap: delete unused 'ewah_and_not()' Derrick Stolee
2018-06-15 14:30           ` [PATCH 5/8] ewah_bitmap: delete unused 'ewah_not()' Derrick Stolee
2018-06-15 14:30           ` [PATCH 6/8] ewah_bitmap: delete unused 'ewah_or()' Derrick Stolee
2018-06-15 14:30           ` [PATCH 7/8] ewah_io: delete unused 'ewah_serialize()' Derrick Stolee
2018-06-15 14:30           ` [PATCH 8/8] ewah_io: delete unused 'ewah_serialize_native()' Derrick Stolee
2018-06-15 15:01             ` Ramsay Jones
2018-06-15 15:10               ` Derrick Stolee
2018-06-15 14:35           ` [PATCH 0/8] Delete unused methods in EWAH bitmap Derrick Stolee
2018-06-15 18:27           ` [PATCH v2 0/7] " Derrick Stolee
2018-06-15 18:27             ` [PATCH v2 1/7] ewah/bitmap.c: delete unused 'bitmap_clear()' Derrick Stolee
2018-06-15 18:27             ` [PATCH v2 2/7] ewah/bitmap.c: delete unused 'bitmap_each_bit()' Derrick Stolee
2018-06-15 18:27             ` [PATCH v2 3/7] ewah_bitmap: delete unused 'ewah_and()' Derrick Stolee
2018-06-15 18:27             ` [PATCH v2 4/7] ewah_bitmap: delete unused 'ewah_and_not()' Derrick Stolee
2018-06-15 18:27             ` [PATCH v2 5/7] ewah_bitmap: delete unused 'ewah_not()' Derrick Stolee
2018-06-15 18:27             ` [PATCH v2 6/7] ewah_bitmap: delete unused 'ewah_or()' Derrick Stolee
2018-06-15 18:27             ` [PATCH v2 7/7] ewah_io: delete unused 'ewah_serialize()' Derrick Stolee
2018-06-15 18:51             ` [PATCH v2 0/7] Delete unused methods in EWAH bitmap Junio C Hamano
2018-06-15 18:56               ` Derrick Stolee
2018-06-15 19:48                 ` Junio C Hamano
2018-06-15 20:35                   ` Jeff King
2018-06-15 14:15       ` [PATCH 3/3] ewah: drop ewah_serialize_native function Derrick Stolee
2018-06-15 17:51         ` Jeff King
2018-06-15 18:33           ` Junio C Hamano
2018-06-15 18:46             ` Jeff King
2018-06-15  3:44   ` [PATCH 4/3] ewah: adjust callers of ewah_read_mmap() Jeff King
2018-06-15 11:23     ` Derrick Stolee
2018-06-15 16:41       ` Junio C Hamano
2018-06-15 17:31         ` Jeff King
2018-06-15 18:23           ` Derrick Stolee [this message]
2018-06-15 20:38             ` Jeff King
2018-06-15 17:12     ` Junio C Hamano
2018-06-15 16:11   ` security: potential out-of-bound read at ewah_io.c |ewah_read_mmap| Junio C Hamano
2018-06-19 19:00 ` Dyer, Edwin
2018-06-19 19:56   ` Jeff King

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1685a358-f033-64e0-2e12-df3a1c10af19@gmail.com \
    --to=stolee@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=peff@peff.net \
    --cc=root@l4w.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).