From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.6 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_DKIMWL_WL_MED,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 91234C28CC0 for ; Thu, 30 May 2019 22:45:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 596B82629F for ; Thu, 30 May 2019 22:45:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="rDK7wvSr" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726326AbfE3Wp4 (ORCPT ); Thu, 30 May 2019 18:45:56 -0400 Received: from mail-lj1-f178.google.com ([209.85.208.178]:34980 "EHLO mail-lj1-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726106AbfE3Wp4 (ORCPT ); Thu, 30 May 2019 18:45:56 -0400 Received: by mail-lj1-f178.google.com with SMTP id h11so7702446ljb.2 for ; Thu, 30 May 2019 15:45:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Z/mhNnTAOsJLEZHyKQjw/9NCHJUOEGHQw19Cx/4jcO4=; b=rDK7wvSro0FEjQwKzXFAZoXU+zA7CQ8tauuAidU1OyJdpvDQmgQQg2QoL0I+7hABEB zWhWhjniUI2ciyRj37qwz23E9nH52zdwtzjRbNFL1XNrGxpcSOvDXsl+9YZ0zZmpr2hO Izuco+oTW90piH7FY4aay8XKDGLVuP0jR1pypSKab6kTqOKvf1J64RQ1gwopC4MFY8wH VcNakrYC8zZDSGhSftU/Ye6YCML+jLFw3X/a0Rb2YNeRrGG0QRyuCbH/2TEUkqfUpxJ+ /kYiIMgcbpGOiY2aiV29lgE9nvPgQZPyZ/ZtOWoar6Rmhc/a935aKvBV8k6dXuarMIgF 2rMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Z/mhNnTAOsJLEZHyKQjw/9NCHJUOEGHQw19Cx/4jcO4=; b=fxfuJyQPENzXphAHeFD6M347E9c0Ff66cLoRa11cnie0sJ5FrwOZYbUshUt7bJbUbt ahSfxf6jxycbEb0G4oP0JV7V2VkOVeQu2NU8M9pOEGZ678phARkeWkAQEORoTt+DIPCq Xz7FnsHZAPXSFYUvdug3mnV2CzcH4W/LUHmeJCWEkPmXY4QPSPL9T7d9+AW2apj/NHRJ vnX1KPL1JioKpqx+5aUw0XYTZAaja+vMriEZkc2o6LIkVlTkba8Gp4CM7Nw/+GkMxIC2 BMS0DKIPFLUlSSeIIU1voIHnxQ74LP2FNqCLnMB2HjGYVKYCjAV5R5UFNuMUCj+wO0Oc 1KcQ== X-Gm-Message-State: APjAAAVH2R59Ku38yNidFKqUWAriJEwxhQiQm/iwbynVs2knP0Z2mZQB hNBmKsF8qwmrSrkYY8L8tB/d5h3ZnpQgX0gDI8tUbQ== X-Google-Smtp-Source: APXvYqx16U5v6+bC1bWyWxdock3D0aaJh8w5cK3wMUydhh8SE4UIUyh0jp6Z+IHGm2TxDyfSMeT/uDgU8yqKvygWkzo= X-Received: by 2002:a2e:5d9c:: with SMTP id v28mr3643226lje.32.1559256353596; Thu, 30 May 2019 15:45:53 -0700 (PDT) MIME-Version: 1.0 References: <20190502040331.81196-1-ezemtsov@google.com> <20190502131034.GA25007@mit.edu> <20190502132623.GU23075@ZenIV.linux.org.uk> In-Reply-To: From: Yurii Zubrytskyi Date: Thu, 30 May 2019 15:45:42 -0700 Message-ID: Subject: Re: Initial patches for Incremental FS To: Miklos Szeredi Cc: Eugene Zemtsov , Amir Goldstein , linux-fsdevel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org > With the proposed FUSE solution the following sequences would occur: > > kernel: if index for given block is missing, send MAP message > userspace: if data/hash is missing for given block then download data/hash > userspace: send MAP reply > kernel: decompress data and verify hash based on index > > The kernel would not be involved in either streaming data or hash, it > would only work with data/hash that has already been downloaded. > Right? > > Or is your implementation doing streamed decompress/hash or partial blocks? > ... > Why does the kernel have to know the on-disk format to be able to load > and discard parts of the index on-demand? It only needs to know which > blocks were accessed recently and which not so recently. > (1) You're correct, only the userspace deals with all streaming. Kernel then sees full blocks of data (usually LZ4-compressed) and blocks of hashes We'd need to give the location of the hash tree instead of the individual hash here though - verification has to go all the way to the top and even check the signature there. And the same 5 GB file would have over 40 MB of hashes (32 bytes of SHA2 for each 4K block), so those have to be read from disk as well. Overall, let's just imagine a phone with 100 apps, 100MB each, installed this way. That ends up being ~10GB of data, so we'd need _at least_ 40 MB for the index and 80 MB for hashes *in kernel*. Android now fights for each megabyte of RAM used in the system services, so FUSE won't be able to cache that, going back to the user mode for almost all reads again. (1 and 2) ... If FUSE were to know the on-disk format it would be able to simply parse and read it when needed, with as little memory footprint as it can. Requesting this data from the usermode every time with little caching defeats the whole purpose of the change. > BTW, which interface does your fuse filesystem use? Libfuse? Raw device? Yes, our code interacts with the raw FUSE fd via poll/read/write calls. We have tried the multithreaded approach via duping the control fd and FUSE_DEV_IOC_CLONE, but it didn't give much improvement - Android apps aren't usually use multithreaded, so there's at most two pending reads at once. I've seen 10 once, but that was some kind of miractle And again, we have not even looked at the directory structure and stat caching yet, neither interface nor memory usage. For a general case we have to make direct disk reads from kernel and this forces even bigger part of the disk format to be defined there. The end result is what we've got when researching FUSE - a huge chunk of FUSE gets overspecialized to handle our own way of using it end to end, with no real configurability (because making it configurable makes that code even bigger and more complex) -- Thanks, Yurii