fsutil

package
v0.0.0-202406181927 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 18, 2024 License: Apache-2.0, MIT Imports: 13 Imported by: 0

README

This package provides utilities for implementing virtual filesystem objects.

[TOC]

Page cache

CachingInodeOperations implements a page cache for files that cannot use the host page cache. Normally these are files that store their data in a remote filesystem. This also applies to files that are accessed on a platform that does not support directly memory mapping host file descriptors (e.g. the ptrace platform).

An CachingInodeOperations buffers regions of a single file into memory. It is owned by an fs.Inode, the in-memory representation of a file (all open file descriptors are backed by an fs.Inode). The fs.Inode provides operations for reading memory into an CachingInodeOperations, to represent the contents of the file in-memory, and for writing memory out, to relieve memory pressure on the kernel and to synchronize in-memory changes to filesystems.

An CachingInodeOperations enables readable and/or writable memory access to file content. Files can be mapped shared or private, see mmap(2). When a file is mapped shared, changes to the file via write(2) and truncate(2) are reflected in the shared memory region. Conversely, when the shared memory region is modified, changes to the file are visible via read(2). Multiple shared mappings of the same file are coherent with each other. This is consistent with Linux.

When a file is mapped private, updates to the mapped memory are not visible to other memory mappings. Updates to the mapped memory are also not reflected in the file content as seen by read(2). If the file is changed after a private mapping is created, for instance by write(2), the change to the file may or may not be reflected in the private mapping. This is consistent with Linux.

An CachingInodeOperations keeps track of ranges of memory that were modified (or "dirtied"). When the file is explicitly synced via fsync(2), only the dirty ranges are written out to the filesystem. Any error returned indicates a failure to write all dirty memory of an CachingInodeOperations to the filesystem. In this case the filesystem may be in an inconsistent state. The same operation can be performed on the shared memory itself using msync(2). If neither fsync(2) nor msync(2) is performed, then the dirty memory is written out in accordance with the CachingInodeOperations eviction strategy (see below) and there is no guarantee that memory will be written out successfully in full.

Memory allocation and eviction

An CachingInodeOperations implements the following allocation and eviction strategy:

  • Memory is allocated and brought up to date with the contents of a file when a region of mapped memory is accessed (or "faulted on").

  • Dirty memory is written out to filesystems when an fsync(2) or msync(2) operation is performed on a memory mapped file, for all memory mapped files when saved, and/or when there are no longer any memory mappings of a range of a file, see munmap(2). As the latter implies, in the absence of a panic or SIGKILL, dirty memory is written out for all memory mapped files when an application exits.

  • Memory is freed when there are no longer any memory mappings of a range of a file (e.g. when an application exits). This behavior is consistent with Linux for shared memory that has been locked via mlock(2).

Notably, memory is not allocated for read(2) or write(2) operations. This means that reads and writes to the file are only accelerated by an CachingInodeOperations if the file being read or written has been memory mapped and if the shared memory has been accessed at the region being read or written. This diverges from Linux which buffers memory into a page cache on read(2) proactively (i.e. readahead) and delays writing it out to filesystems on write(2) (i.e. writeback). The absence of these optimizations is not visible to applications beyond less than optimal performance when repeatedly reading and/or writing to same region of a file. See Future Work for plans to implement these optimizations.

Additionally, memory held by CachingInodeOperationss is currently unbounded in size. An CachingInodeOperations does not write out dirty memory and free it under system memory pressure. This can cause pathological memory usage.

When memory is written back, an CachingInodeOperations may write regions of shared memory that were never modified. This is due to the strategy of minimizing page faults (see below) and handling only a subset of memory write faults. In the absence of an application or sentry crash, it is guaranteed that if a region of shared memory was written to, it is written back to a filesystem.

Life of a shared memory mapping

A file is memory mapped via mmap(2). For example, if A is an address, an application may execute:

mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

This creates a shared mapping of fd that reflects 4k of the contents of fd starting at offset 0, accessible at address A. This in turn creates a virtual memory area region ("vma") which indicates that [A, A+0x1000) is now a valid address range for this application to access.

At this point, memory has not been allocated in the file's CachingInodeOperations. It is also the case that the address range [A, A+0x1000) has not been mapped on the host on behalf of the application. If the application then tries to modify 8 bytes of the shared memory:

char buffer[] = "aaaaaaaa";
memcpy(A, buffer, 8);

The host then sends a SIGSEGV to the sentry because the address range [A, A+8) is not mapped on the host. The SIGSEGV indicates that the memory was accessed writable. The sentry looks up the vma associated with [A, A+8), finds the file that was mapped and its CachingInodeOperations. It then calls CachingInodeOperations.Translate which allocates memory to back [A, A+8). It may choose to allocate more memory (i.e. do "readahead") to minimize subsequent faults.

Memory that is allocated comes from a host tmpfs file (see pgalloc.MemoryFile). The host tmpfs file memory is brought up to date with the contents of the mapped file on its filesystem. The region of the host tmpfs file that reflects the mapped file is then mapped into the host address space of the application so that subsequent memory accesses do not repeatedly generate a SIGSEGV.

The range that was allocated, including any extra memory allocation to minimize faults, is marked dirty due to the write fault. This overcounts dirty memory if the extra memory allocated is never modified.

To make the scenario more interesting, imagine that this application spawns another process and maps the same file in the exact same way:

mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);

Imagine that this process then tries to modify the file again but with only 4 bytes:

char buffer[] = "bbbb";
memcpy(A, buffer, 4);

Since the first process has already mapped and accessed the same region of the file writable, CachingInodeOperations.Translate is called but returns the memory that has already been allocated rather than allocating new memory. The address range [A, A+0x1000) reflects the same cached view of the file as the first process sees. For example, reading 8 bytes from the file from either process via read(2) starting at offset 0 returns a consistent "bbbbaaaa".

When this process no longer needs the shared memory, it may do:

munmap(A, 0x1000);

At this point, the modified memory cached by the CachingInodeOperations is not written back to the file because it is still in use by the first process that mapped it. When the first process also does:

munmap(A, 0x1000);

Then the last memory mapping of the file at the range [0, 0x1000) is gone. The file's CachingInodeOperations then starts writing back memory marked dirty to the file on its filesystem. Once writing completes, regardless of whether it was successful, the CachingInodeOperations frees the memory cached at the range [0, 0x1000).

Subsequent read(2) or write(2) operations on the file go directly to the filesystem since there no longer exists memory for it in its CachingInodeOperations.

Future Work

Page cache

The sentry does not yet implement the readahead and writeback optimizations for read(2) and write(2) respectively. To do so, on read(2) and/or write(2) the sentry must ensure that memory is allocated in a page cache to read or write into. However, the sentry cannot boundlessly allocate memory. If it did, the host would eventually OOM-kill the sentry+application process. This means that the sentry must implement a page cache memory allocation strategy that is bounded by a global user or container imposed limit. When this limit is approached, the sentry must decide from which page cache memory should be freed so that it can allocate more memory. If it makes a poor decision, the sentry may end up freeing and re-allocating memory to back regions of files that are frequently used, nullifying the optimization (and in some cases causing worse performance due to the overhead of memory allocation and general management). This is a form of "cache thrashing".

In Linux, much research has been done to select and implement a lightweight but optimal page cache eviction algorithm. Linux makes use of hardware page bits to keep track of whether memory has been accessed. The sentry does not have direct access to hardware. Implementing a similarly lightweight and optimal page cache eviction algorithm will need to either introduce a kernel interface to obtain these page bits or find a suitable alternative proxy for access events.

In Linux, readahead happens by default but is not always ideal. For instance, for files that are not read sequentially, it would be more ideal to simply read from only those regions of the file rather than to optimistically cache some number of bytes ahead of the read (up to 2MB in Linux) if the bytes cached won't be accessed. Linux implements the fadvise64(2) system call for applications to specify that a range of a file will not be accessed sequentially. The advice bit FADV_RANDOM turns off the readahead optimization for the given range in the given file. However fadvise64 is rarely used by applications so Linux implements a readahead backoff strategy if reads are not sequential. To ensure that application performance is not degraded, the sentry must implement a similar backoff strategy.

Documentation

Overview

Package fsutil provides utilities for implementing vfs.FileDescriptionImpl and vfs.FilesystemImpl.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func SyncDirty

func SyncDirty(ctx context.Context, mr memmap.MappableRange, cache *FileRangeSet, dirty *DirtySet, max uint64, mem memmap.File, writeAt func(ctx context.Context, srcs safemem.BlockSeq, offset uint64) (uint64, error)) error

SyncDirty passes pages in the range mr that are stored in cache and identified as dirty to writeAt, updating dirty to reflect successful writes. If writeAt returns a successful partial write, SyncDirty will call it repeatedly until all bytes have been written. max is the true size of the cached object; offsets beyond max will not be passed to writeAt, even if they are marked dirty.

func SyncDirtyAll

func SyncDirtyAll(ctx context.Context, cache *FileRangeSet, dirty *DirtySet, max uint64, mem memmap.File, writeAt func(ctx context.Context, srcs safemem.BlockSeq, offset uint64) (uint64, error)) error

SyncDirtyAll passes all pages stored in cache identified as dirty to writeAt, updating dirty to reflect successful writes. If writeAt returns a successful partial write, SyncDirtyAll will call it repeatedly until all bytes have been written. max is the true size of the cached object; offsets beyond max will not be passed to writeAt, even if they are marked dirty.

Types

type DirtyInfo

type DirtyInfo struct {
	// Keep is true if the represented offset is concurrently writable, such
	// that writing the data for that offset back to the source does not
	// guarantee that the offset is clean (since it may be concurrently
	// rewritten after the writeback).
	Keep bool
}

DirtyInfo is the value type of DirtySet, and represents information about a Mappable offset that is dirty (the cached data for that offset is newer than its source).

+stateify savable

type FileRangeSetFunctions

type FileRangeSetFunctions struct{}

FileRangeSetFunctions implements segment.Functions for FileRangeSet.

func (FileRangeSetFunctions) ClearValue

func (FileRangeSetFunctions) ClearValue(_ *uint64)

ClearValue implements segment.Functions.ClearValue.

func (FileRangeSetFunctions) MaxKey

func (FileRangeSetFunctions) MaxKey() uint64

MaxKey implements segment.Functions.MaxKey.

func (FileRangeSetFunctions) Merge

func (FileRangeSetFunctions) Merge(mr1 memmap.MappableRange, frstart1 uint64, _ memmap.MappableRange, frstart2 uint64) (uint64, bool)

Merge implements segment.Functions.Merge.

func (FileRangeSetFunctions) MinKey

func (FileRangeSetFunctions) MinKey() uint64

MinKey implements segment.Functions.MinKey.

func (FileRangeSetFunctions) Split

func (FileRangeSetFunctions) Split(mr memmap.MappableRange, frstart uint64, split uint64) (uint64, uint64)

Split implements segment.Functions.Split.

type FrameRefSegInfo

type FrameRefSegInfo struct {
	// contains filtered or unexported fields
}

FrameRefSegInfo holds reference count and memory cgroup id of the segment.

type FrameRefSetFunctions

type FrameRefSetFunctions struct{}

FrameRefSetFunctions implements segment.Functions for FrameRefSet.

func (FrameRefSetFunctions) ClearValue

func (FrameRefSetFunctions) ClearValue(val *FrameRefSegInfo)

ClearValue implements segment.Functions.ClearValue.

func (FrameRefSetFunctions) MaxKey

func (FrameRefSetFunctions) MaxKey() uint64

MaxKey implements segment.Functions.MaxKey.

func (FrameRefSetFunctions) Merge

Merge implements segment.Functions.Merge.

func (FrameRefSetFunctions) MinKey

func (FrameRefSetFunctions) MinKey() uint64

MinKey implements segment.Functions.MinKey.

func (FrameRefSetFunctions) Split

Split implements segment.Functions.Split.

type HostFileMapper

type HostFileMapper struct {
	// contains filtered or unexported fields
}

HostFileMapper caches mappings of an arbitrary host file descriptor. It is used by implementations of memmap.Mappable that represent a host file descriptor.

+stateify savable

func NewHostFileMapper

func NewHostFileMapper() *HostFileMapper

NewHostFileMapper returns an initialized HostFileMapper allocated on the heap with no references or cached mappings.

func (*HostFileMapper) DecRefOn

func (f *HostFileMapper) DecRefOn(mr memmap.MappableRange)

DecRefOn decrements the reference count on all offsets in mr.

Preconditions:

  • mr.Length() != 0.
  • mr.Start and mr.End must be page-aligned.

func (*HostFileMapper) IncRefOn

func (f *HostFileMapper) IncRefOn(mr memmap.MappableRange)

IncRefOn increments the reference count on all offsets in mr.

Preconditions:

  • mr.Length() != 0.
  • mr.Start and mr.End must be page-aligned.

func (*HostFileMapper) Init

func (f *HostFileMapper) Init()

Init must be called on zero-value HostFileMappers before first use.

func (*HostFileMapper) IsInited

func (f *HostFileMapper) IsInited() bool

IsInited returns true if f.Init() has been called. This is used when restoring a checkpoint that contains a HostFileMapper that may or may not have been initialized.

func (*HostFileMapper) MapInternal

func (f *HostFileMapper) MapInternal(fr memmap.FileRange, fd int, write bool) (safemem.BlockSeq, error)

MapInternal returns a mapping of offsets in fr from fd. The returned safemem.BlockSeq is valid as long as at least one reference is held on all offsets in fr or until the next call to UnmapAll.

Preconditions: The caller must hold a reference on all offsets in fr.

func (*HostFileMapper) RegenerateMappings

func (f *HostFileMapper) RegenerateMappings(fd int) error

RegenerateMappings must be called when the file description mapped by f changes, to replace existing mappings of the previous file description.

func (*HostFileMapper) UnmapAll

func (f *HostFileMapper) UnmapAll()

UnmapAll unmaps all cached mappings. Callers are responsible for synchronization with mappings returned by previous calls to MapInternal.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL