Christopher B. Browne's Home Page

2. File Systems

2.1. Introduction

Linux has long been fertile ground for the creation of various sorts of file systems. The reasons for this have been manifold:

Regrettably, there is some conflict in this. There is an enormous desire for "more functionality", and that prevents kernel folk from stopping to comprehensively fix problems. There are, somewhere between fibrechannel device drivers, the mapping of that to SCSI access, then to the VFS layer, which then connects to various filesystems, some evident holes.

My employer sponsored some reliability testing in the interests of trying to see if Linux on Opterons, connecting to FibreChannel disk arrays would make a viable platform for large, highly-available PostgreSQL databases. All the filesystems corrupted painfully easily even though the hardware ought to support better.

It's not going to be easy to resolve this; supporting HA hardware requires thoroughly verifying all the details, supporting SCSI and FibreChannel protocols fully, and that would require a deceleration of Linux kernel development efforts, which is inconsistent with the way that Git is allowing larger and larger sets of contributors to cascade flurries of patches, like snow storms, on the Linux development team.

2.2. Significant File Systems

2.2.1. Virtualized Filesystems

  • AVFS - A Virtual File System

  • The Perl Filesystem (Kernel module to let you hook Perl code in to make up FSes)

    Perl vs. traditional Filesystems

    People have been known to react to the Perl Filesystem with "Why?", so I thought I'd compare the job of writing filesystems in Perl and C and let you draw your own conclusions.

    • Perl

      • Filesystem will work the same on any supported system, any supported kernel version. If somebody gives you a pre-built module, you won't even need the kernel sources.

      • Most bugs will cause error messages and meaningful syslog entries.

      • Some filesystems might be slower (but our example "Net" filesystem spends all the time waiting for servers at the other end, so it'd be just as slow in any other language).

    • Traditional

      • You need to recompile your filesystem for every combination of operating system/version where you want to use it. In most cases, this requires extensive rewriting (just look at the loadable kernel module which supports PerlFS - it tries to work on two kernel versions of the same operating system, and it contains more conditional compilation than is good for sanity)

      • Most bugs will result in a kernel panic or at best some obscure syslog entry.

      • Some filesystems might be faster.

    "Why can't you use userfs?" - I wish I could find a recent version.

    Another question I get is "Why not write a Perl NFS server instead?" - Because the NFS protocol is not flexible enough for some of the things I plan to do.

  • POrtable Dodgy Filesystems in Userland (hacK)

  • Stackable Design of File Systems

  • docfs Unified Documentation Storage and Retrieval for Linux Systems.

    And now, for something completely different...

    This project proposes creating special file systems that dynamically format documentation into the requested format. For instance, the "original source" would be in /usr/doc/sgml in SGML form. When a request is made for the manual page in /usr/man , this file system would dynamically run the SGML-to-GROFF translator, producing the manual page "on the fly." Similarly, accessing /usr/info/something would result in the SGML source being turned into TeXInfo form.

  • Extending File Systems Using Stackable Templates

  • Usenetfs: A Stackable File System for Large Article Directories

  • A Scalable News Architecture on a Single Spool

  • FiST

    File System development is very difficult and time consuming. Even small changes to existing file systems require deep understanding of kernel internals, making the barrier to entry for new developers high. Moreover, porting file system code from one operating system to another is almost as difficult as the first port. Past proposals to provide extensible (stackable) file system interfaces would have simplified the development of new file systems. These proposals, however, advocated massive changes to existing operating system interfaces and existing file systems; operating system vendors and maintainers resist making any large changes to their kernels because of stability and performance concerns. As a result, file system development is still a difficult, long, and non-portable process.

    The FiST (File System Translator) system combines two methods to solve the above problems in a novel way: a set of stackable file system templates for each operating system, and a high-level language that can describe stackable file systems in a cross-platform portable fashion. Using FiST, stackable file systems need only be described once. FiST's code generation tool, fistgen, compiles a single file system description into loadable kernel modules for several operating systems (currently Solaris, Linux, and FreeBSD).

  • PyVen - for implementing Userspace Filesystems in Python, atop Coda

  • Pgfs - PostgreSQL File System

    A file system "server" that stores files in a PostgreSQL database, accesses being handled using NFS clients.

    The point of the exercise is to provide automatic versioning, so that one can compare current file "sets" to those that existed at a previous point in time, rolling forward and back as necessary.

    This provides a "pervasive" equivalent to CVS.

    Code hasn't been sighted in several years.

  • The Design and Implementation of the Inversion File System

    A filesystem implemented atop Postgres . It was slower than NFS, when each update is treated as atomic under "standard" Unix/ NFS semantics. When they were able to run file operations within the DBMS, it was rather a lot faster...

  • Alex Viro's Per-Process Namespaces for Linux 2.4.2

    This is based on the Plan 9 notion of namespaces.

    In effect, a namespace associates a set of mounts of filesystems with a process, rather than the traditional Unix approach of associating them with a central table for the system as a whole.

    This leads to the notion of mounting "private" filesystems that are visible only to a particular process (and perhaps its children). One thing that this would be useful for is in enhancing system security.

    For instance, if I'm using CFS to secure a directory, with the traditional Unix approach, I might use the command cattach /home/cbbrowne/secret_stuff/ secretstuff to mount the data in /home/cbbrowne/secret_stuff/ on /crypt/secretstuff . Unfortunately, anyone on the system with suitable permissions can look in /crypt/secretstuff and see the readable version of the data. That's not terribly secret; I have to be quite careful to keep my data secret!

    With a per-process namespace, the mount might be associated with a specific process, and its children. It would be invisible to other processes belonging to other users, and (for better or worse) is even invisible to processes that are not children of that environment. That's rather more secure.


    Mind you, that does not forcibly help in this particular situation since CFS behaves as a pretty much public NFS server for the host; the "mount" is for /crypt as a whole, not for each individual encrypted directory...

    The other really cool thing that starts to become more practical is the notion of mapping data structures onto virtual filesystems. For instance, you might create a "driver" that provides a mapping DBM files to make one look like a directory with a whole bunch of files.

    I might thus do mount -t dbm /home/cbbrowne/data/file.dbm /home/cbbrowne/mounts/file and be given the ability to do the following sorts of things

    • List the keys via ls /home/cbbrowne/mounts/file


          key1    key2    key3   key4

    • Show the value for a key via cat /home/cbbrowne/mounts/file/key4


    • More interestingly, we might create entries via echo "value 5" > /home/cbbrowne/mounts/file/key5

    None of this would be conceptually impossible with a public namespace; the merit of the namespaces remaining private is that these sorts of isomorphisms are not be blathered around publicly.

  • There are a number of cryptographic filesystems wherein a virtual filesystem is somehow authenticated at mount time and made accessible to the user.

  • AtFS - Attribute Filesystems - provides uniform access to immutable revisions of files

  • Loopback FS using AES

    Allowing use of AES encryption for filesystems...



  • LinFS/SQLFS page

  • MaLinux

    LUFS (Linux Userland FileSystem) is a hybrid userspace filesystem framework supporting an indefinite number of filesystems (localfs, sshfs, ftpfs, cardfs and cefs implemented so far) transparently for any application

    For instance, consider ftpfs, FTP File System, which is a Linux kernel module, enhancing the VFS with FTP volume mounting capabilities. That is, you can "mount" FTP shared directories in your very personal file system and take advantage of local files ops.


      LoCaseFS provides a lowercase mapping of the local file system. It comes in handy when importing win32 source trees on *nix systems.

    • SshFS is probably the most advanced LUFS file system because of its security, usefulness and completeness. It is based on the SFTP protocol and requires openssh. You can mount remote file systems accessible through sftp (scp utility).


      You mount a gnetfs in ~/gnet. You wait a couple of minutes so it can establish its peer connections. You start a search by creating a subdirectory of SEARCH: mkdir "~/gnet/SEARCH/metallica mp3". You wait a few seconds for the results to accumulate. The you chdir to "SEARCH/metallica mp3" and try a ls; surprise - the files are there!

      You shoot up mpg123 and enjoy... You are happy.

  • Storage

    A project to replace the traditional filesystem with a new document store.

    The idea is to store data as BLOBs in a relational database , notably PostgreSQL , along with document attributes. Users would then look for documents based on the attributes, as opposed to designing (usually badly) a hierarchy.

  • SRFS - Selfstabilizing Replication File System

  • redisfs - Replication-Friendly Redis-based filesystem

    This implements a filesystem which stores data atop the Section 5 database.

  • Grive

    Allows mounting Google Drive as a Linux filesystem

  • aufs - Advanced Multilayered Unification Filesystem

  • Tagsistant

    Tagsistant is a tool to organize files in a semantic way, which means using tags.

2.2.2. Distributed Filesystems

  • NFS is the "traditional" networked filesystem used on Linux and Unix.

  • nqnfs - Not Quite NFS

  • The arla project - Free Implementation of AFS

  • GFS - GlobalFile System

    The goal of the Global File System research project is to develop a serverless file system that exploits new interfaces like Fibre Channel that allow network attached storage. (Buzzword: SAN = Storage Area Network.)

    The critical notion is that the system isserverless. With a traditional networked storage system like NFS, one host "owns" the filesystem and then provides access as a server so that other hosts access the data through that server.

    GFS eschews having "a server;" shared-SCSI version exploits SCSI command extensions that provide a locking scheme such that multiple hosts may simultaneously access and update the filesystem directly across the SCSI bus. None of the hosts "own" the filesystem.

  • Coda Networked Filesystem

    Coda is a distributed filesystem with its origin in AFS2. It has many features that are very desirable for network filesystems. Currently, Coda has several features not found elsewhere.

    • Disconnected operation for mobile computing

    • Is freely available under a liberal license

    • High performance through client side persistent caching

    • Server replication

    • Security model for authentication, encryption and access control

    • Continued operation during partial network failures in server network

    • Network bandwith adaptation

    • Good scalability

    • Well defined semantics of sharing, even in the presence of network failures

    Oversimplifying somewhat, clients use a cache to store changes that are made to files. They then push updates back to the server, which then distributes changes to other clients.

    By having a sufficiently large cache, it can operate even when systems are disconnected, deferring "pushing updates back to the server" until the server is again available.

    It implemented the cache using RVM (Recoverable Virtual Memory).

  • InterMezzo

    InterMezzo is a new distributed file system with a focus on high availability. InterMezzo is an Open Source project, currently on Linux (2.2 and 2.3). A primary target of our development is to provide support for flexible replication of directories, with disconnected operation and a persistent cache. It was "deeply inspired" by Coda, and was originally started as part of that project.

  • OpenAFS

  • Unison File Synchronizer

    Unison is a file-synchronization tool for Unix and Windows. It allows two replicas of a collection of files and directories to be stored on different hosts (or different disks on the same host), modified separately, and then brought up to date by propagating the changes in each replica to the other.

  • The Inferno operating system provides a distributed file access protocol called Styx which would be interesting to use on other OSes, perhaps even Linux...

  • konspire

    A new distributed file-sharing system featuring fast, exhaustive searches and modest network bandwidth requirements. Written in Java 1.1 (with Swing GUI) for platform independence.

  • Semantic File Systems

  • Secure Internet File System

  • FunFS: Fast User Network FileSystem

  • ShareWidth

    A file sharing system organized around users, allowing users to expose files to those users they wish to provide them to.

  • Lustre

    Lustre is a storage and file system architecture and implementation designed for use with very large clusters.

2.2.3. Other Disk Stuff

2.2.4. S3 Storage

Amazon has created a storage service, S3, which offers a web service-based API, quite widely used for data access for file storage, packet-based backups, and which is extensively used by Amazon for hosting data for its EC2 virtualization service.

I haven't yet had call to use it directly, though I use it via some proxies (e.g. - DropBox). I would be particularly interested in seeing alternative implementations of the server side emerge. There's code out there, though not particularly easy to deploy nor totally interoperable, at this stage.

Contact me at